Assign lineages to the samples in the cryptic ~15000 sample VCF. This is important to

Answers to the above discussions Yes, happy. We don't expect

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Assign lineages about head_to_head_pipeline HOT 15 CLOSED

mbhall88 commented on August 17, 2024

Assign lineages

from head_to_head_pipeline.

Comments (15)

iqbal-lab commented on August 17, 2024 1

for your purposes you just need to identify root level lineages (4 not 4.1 or...), so i think this is not relevant to this project. I say fix 4.3 for you as you suggest and ignore 4.10 - delete it from your world. Will not affect whether or not you say a sample is lineage 4

from head_to_head_pipeline.

iqbal-lab commented on August 17, 2024 1

Answers to the above discussions

Yes, happy. We don't expect any 5/7 and 6 lineage 6 matches what Phil found
yes, use those with the ref-lin-pos
include bovis/caprae. Could stick them in the non-1-4 bin so they get used
yes discard unknowns/mixed

from head_to_head_pipeline.

iqbal-lab commented on August 17, 2024

Just been talking through this with Kerri, who's been using the lineage snps to type the samples. Some apparent discrepancies were in fact not. In progress.

from head_to_head_pipeline.

mbhall88 commented on August 17, 2024

Is the code she is using to do the lineage assignment on GitHub somewhere?

from head_to_head_pipeline.

iqbal-lab commented on August 17, 2024

Not written yet. There is 1 defining snps for each lineage and sublineage (attached in a Mykrobe issue FYI (can add here but am on phone now)). All bar one are in our vcf. Algorithm is to check which snps a sample has (checking position, alleles (our vcf has multiallelics at some of these positions) and genotype - genotype has to match the alt allele in the Lineage scheme), and count evidence for each Lineage.

from head_to_head_pipeline.

mbhall88 commented on August 17, 2024

I have the SNP table, so no need to upload. I'll add it into this repo at some point.

I suspect I will be finished splitting H37Rv into chunks tomorrow. Do you have a problem with me making a start on lineage calling based on the Comas panel? As this will be a big blocking issue from tomorrow for me. Obviously happy to work with Kerri on it so we don't double effort.

from head_to_head_pipeline.

iqbal-lab commented on August 17, 2024

Re SNP table, we found one wrong SNP in it (ref base does not match ref at that position - kerri following up)
I of course do not mind you going ahead and lineage calling for yourself - i think i said this in my excessively long message on slack (the one with the preamble about prevalence and AMR). Feel free to contact Kerri, but i don't think this v simple thing (assigning "root"/main lineage - 1,2,3, etc, not really worrying about sublineages) is a big deal if duplicated.

Incorporating these lineages into mykrobe requires more work and testing because
a) it needs to handle nanopore and illumina, in particular sparse data
b) it needs to cope sensibly with mixtures
Martin will be tracking what kerri learns from her analysis and that will feed into the mykrobe implementation. None of that is your problem.

For your use case, picking illumina samples for a PRG, i don't think either of a) and b) are a real concern - you can always discard a sample if the lineage results look weird.

from head_to_head_pipeline.

mbhall88 commented on August 17, 2024

Re SNP table, we found one wrong SNP in it (ref base does not match ref at that position - kerri following up)

Using the SNPs that Alvaro and Inaki gave us I find two where the reference allele does not match the base at that position in H37Rv.

Lineage 4.3

lineage              lineage4.3
position                1480024
gene_coord                  NaN
allele_change               C/T
codon_number                541
codon_change                NaN
amino_acid_change           NaN
locus_id                 Rv1318
gene_name                     -

H37Rv base at that position: G

When I dig into the source that they got their L4-defining SNPs from I find that the position is correct - 1480024 - and the allele change is actually G/T. So it seems they entered the ref allele incorrectly in the version of the spreadsheet they sent us. If we change the ref allele from C->G in our sheet, then everything checks out for this sublineage.

Lineage 4.10

lineage              lineage4.10
position                 1692141
gene_coord                   NaN
allele_change                C/A
codon_number                 NaN
codon_change                 NaN
amino_acid_change            NaN
locus_id                     NaN
gene_name                    NaN

H37Rv base at that position: A

The position is correct for this one and according to Supp. Table 2 the ref and mutant alleles are the same as in the Comas spreadsheet C->A. H37Rv has an A at this position. I guess this is the discrepancy that Kerri found, but I just wanted to document it.

What is the best way to rectify this?

from head_to_head_pipeline.

mbhall88 commented on August 17, 2024

@iqbal-lab we discussed these questions on our call but I am likely to forget the answers quickly so would prefer them written somewhere (remember this is in the context of assigning lineages):

Do we count a panel variant if a sample is HET for it, or do we only include samples that are HOM for the relevant ALT?
What proportion of "minor" lineages do we want to allow before we abandon assigning a lineage for a sample. i.e. if we have a variant for lineage 2 and 3 for lineage 1, do we abandon, or say lineage 1?
If a sample has no panel variants, what do we classify that samples as, lineage-wise? Or do we abandon it?

from head_to_head_pipeline.

iqbal-lab commented on August 17, 2024

Replies in order below. I'll reply in the context that I want to discard samples if they have uncertain lineage or mixed lineages

Count a het, but don't allow more than 1 het
i propose allow max 1 SNP on a secondary lineage - if more than that, mark it as unknown, and count how many we discard, in case that chucks out too many?
abandon/mark as unknown - useless to us.

from head_to_head_pipeline.

mbhall88 commented on August 17, 2024

The lineage assignment script is complete in 4041941

I have validated it on the Comas samples using the truth assignments and assignments from the script and this script.

$ python scripts/assign_lineages.py -i vcfs/cryptic_release.2020-04.comas.vcf \
    -p resources/snps_for_typing.csv -o resources/comas.lineages.csv \
    --max-het 1 --max-alt-lineages 1
2020-06-05 15:29:50,634 [INFO]: Loading the panel...
2020-06-05 15:29:50,636 [INFO]: Loaded panel with 66 variants successfully.
2020-06-05 15:29:50,637 [INFO]: Searching VCF for lineage-defining variants...
2020-06-05 15:30:03,134 [INFO]: Calling lineages for samples based on found lineage variants...
2020-06-05 15:30:03,135 [INFO]: All done!
$ python scripts/validate_lineage_assignment.py resources/comas.truth.lineages.tsv resources/comas.lineages.csv
N0004: [PASS]
N0031: [PASS]
N0052: [PASS]
N0054: [PASS]
N0072: [PASS]
N0091: [PASS]
N0136: [PASS]
N0145: [PASS]
N0153: [PASS]
N0155: [PASS]
N0157: [PASS]
N1176: [PASS]
N1177: [PASS]
N1202: [PASS]
N1216: [PASS]
N1272: [PASS]
N1283: [PASS]

from head_to_head_pipeline.

mbhall88 commented on August 17, 2024

Lineage assignment of the cryptic VCF took 2 hours 40 mins on a single core and max. 592Mb RAM.

The version of the script used was from the bug fix in c246ee1

The overall results are

   1101 1
   5518 2
   1826 3
   5303 4
      6 6
     15 Bovis
      1 Caprae
      2 mixed
   1439 unknown

The classifications for the two mixed samples are

sample,major_lineage,full_lineage,found_lineages
site.04.iso.1.subject.04432.lab_id.832714.seq_reps.1,mixed,mixed,3;2.2.6_RD150
site.04.iso.1.subject.00816.lab_id.717577.seq_reps.1,mixed,mixed,3;2.2.6_RD150

I also ran another version of the script [16ee257] where I implemented the ability to specify a position that describes the lineage of the VCF reference sequence (i.e. H37Rv). This off the back of the email chain with Alvaro and Inaki discussing the discrepancy with Lineage 4.10 (as mentioned above #11 (comment)). The reason for that discrepancy is they were using an ancestral Mtb sequence instead of H37Rv for this panel. Long story short, the only position that is affected by this is the Lineage 4.10 position 1692141. In 16ee257 I add the parameter

--ref-lineage-position INTEGER  Variant position that defines the lineage of
                                  the VCF reference. When classifying lineages
                                  a REF call at this position will be
                                  considered a call for the reference lineage
                                  as opposed to other variants which require
                                  an ALT call. Set to 0 to disable this
                                  option.  [default: 0]

By setting this parameter to 1692141, when I get to that position in the VCF I check which samples have a REF call, rather than an ALT call, and then add a call for lineage 4.10 for those samples.

Running the lineage classification with this script variation, I get the following assignments.

   1101 1
   5518 2
   1826 3
   6188 4
      6 6
     15 Bovis
      1 Caprae
      2 mixed
    554 unknown

Which moves 885 samples from unknown to lineage 4.

Discussion

Are we happy with these assignments? I am just a little worried there are only six L6 samples and no L5/7 samples (or is that what we expected?).
Shall we use the calls from the version of the script with the --ref-lineage-position parameter or not?
Do we include or ignore the Bovis and Caprae samples?
I assume we discard the unknowns and mixed? Or do we want to dig into the unknowns more?

from head_to_head_pipeline.

iqbal-lab commented on August 17, 2024

@mbhall88 - any chance you could put your lineage assignments somewhere? I might later on use them for SNPIT, and also might be useful for Kerri

from head_to_head_pipeline.

mbhall88 commented on August 17, 2024

Lineage assignments added in 279a640

from head_to_head_pipeline.

iqbal-lab commented on August 17, 2024

cheers

from head_to_head_pipeline.

Assign lineages about head_to_head_pipeline HOT 15 CLOSED

Comments (15)

Answers to the above discussions

Lineage 4.3

Lineage 4.10

Discussion

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent