Giter Site home page Giter Site logo

Comments (24)

ArtPoon avatar ArtPoon commented on August 19, 2024

Processed SRR11140746 from NCBI SRA. This is annotated as SARS-CoV-2/2019-nCoV/USA-WI-1/2020, passaged once on Vero E6 cells. I compared the MiCall consensus sequence to hCoV-19/USA/WI1/2020|EPI_ISL_408670 from GISAID. There are a total of 8 differences:

position MiCall GISAID
20302 A -
20303 A -
20304 G -
29872 t A
29878 c A
29880 - A
29881 - A

Here are screen caps of the two regions:
problem1

problem2

I am not too concerned about the last four because they appear in a low-coverage region (By default, MiCall-Lite reports bases in lower case when coverage is below 100 reads.)

However, the insertion of AAG is worrisome. Note that it is in a tandem repeat. Is this in the raw data?

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

It is not in the raw data:

Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGAAGGCTA
Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGGCTA
GGCGGCTATGCTCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTTTCGCTAGCATTTCAGTAGTGCCACCAGCCTTTTTAGTAGGTATAACCACAGCAGTTAAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTTCGAACATATCGT
CATTAATTTGCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCCCCTCTACAGTGTAACCAAATTTAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTT

Meaning that MiCall is making it up. I'm going to have to trace back through the intermediate outputs to figure out where this is coming from.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

It does appear once in the second read file:

Elzar:data artpoon$ gunzip -c SRR11140746_2.fastq.gz | grep AAGAAGAAGGCTA
TTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGCATTGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGCCATCAAATTGGATGACAAAGATCCAAATTTCAAAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAAGGACAAAAAGAAGAAGGCTAATGAAACTCAA

Is MiCall stitching this one insertion into the data?

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

The plot thickens - I ran the FASTQ data through bowtie2 once and then generated a consensus sequence with samtools, and it also introduced the extra AAG insertion as well as some weird substitutions upstream:
problem3

Also samtools doesn't botch the 3' end:
problem4

We are not ready for prime time. 😞

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Options:

  • bring MiCall-Lite fork up to date with current MiCall - is there a problem that I introduced or something that got fixed by the CfE dev team?
  • start debugging MiCall-Lite based on this first test case
  • write a bespoke pipeline for SARS-COV-2

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

I got 100% map rate with a single run of bowtie2, local alignment, so I don't think MiCall's iterative approach to mapping is necessary. Will adapt functions from sam2aln.py and aln2counts from cfe-lab/MiCall to write a more efficient consensus sequence-generating script.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Looked at the SAM file with Tablet, where we can see the deletion relative to the NC_045512 reference genome:
problem5

Here's a look at the end of the alignment:
problem6

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

@donkirkby can you please help me reproduce this issue with cfe-lab/MiCall?
The NCBI SRA accession number is in this issue.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

New pipeline is picking the deletion up:

20295,0,0,0,1247,156,1,{}
20296,1244,1,0,0,155,1,{}
20297,1242,0,0,0,155,4,{}
20298,263,0,0,0,155,981,{}
20299,0,0,0,2,156,1018,{}
20300,0,0,1,0,156,1019,{'G': 1}
20301,1183,0,1,0,157,1,{}
20302,0,0,1186,0,156,1,{}
20303,1185,0,0,0,156,2,{}

Closing issue and tracking any further issues in new repo.

from micall-lite.

donkirkby avatar donkirkby commented on August 19, 2024

I'm getting our references set up today, @ArtPoon, and I'll let you know what I see when I run this sample.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Thanks @donkirkby !

from micall-lite.

donkirkby avatar donkirkby commented on August 19, 2024

OK, I got some results out of MiCall, and I'm starting to look through them. First item: your mystery insertion. I see it, but as a minority variant. It's also in a slightly different position from yours. The MAX consensus shows a deletion at that position. Here's a section of the consensus sequences at different cutoffs from 20290 to 20310:

        v20290...   ...20310v
MAX     CGGTAT---AAAGAAGGCTAT
0.01    CGGTATaaarAAGAAGGCTAT
0.02    CGGTATaaaRAAGAAGGCTAT
0.05    CGGTATaaaRAAGAAGGCTAT
0.10    CGGTATaaaRAAGAAGGCTAT
0.20    CGGTATaaaRAAGAAGGCTAT
0.25    CGGTAT---AAAGAAGGCTAT

The lower case codes represent a mixture of nucleotides and deletions. My guess is that MiCall-lite still reports any nucleotide reads over deletions, even if the deletions are over 75% of the reads as in this case.
What's probably happening is that this sample has a deletion relative to the reference sequence. Then the reads with ends near the deletion get dragged across the gap and fill it in with bad matches. However, I think you said that MiCall was actually inserting that section during the remapping, so I'll check more closely tomorrow. Hope this is a useful start.
Where's the other repo?

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Thanks for checking this out @donkirkby - the new repo is set private for now. I'll publish it soon, I just need time to fix the merge_pairs function that I migrated from sam2aln. I know that my version is a lot slower and the refactored version in cfe MiCall is probably a lot faster, but I also really don't like that code. (It introduces NumPy as another dependency, and the author didn't bother to document any of their code. barf!)

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

I apologize @donkirkby, I see that we're the only contributors to that code - but why no inline comments? Throw me a bone! 😄

from micall-lite.

donkirkby avatar donkirkby commented on August 19, 2024

The programmer that has annoyed me the most throughout my career has always been myself from six months earlier. Would you like to go through the code together tomorrow morning over Google hangouts? I'm free at 10am PDT.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Fair enough - I've written some real pigs as a perennial amateur.
I'm teaching a bioinformatics lab (ironically) tomorrow so I can't meet, unfortunately. I have some basic ideas of how to make my original code a lot faster though. May pester you for some sage wisdom.

from micall-lite.

donkirkby avatar donkirkby commented on August 19, 2024

You might not need such drastic changes, @ArtPoon. I think the reason you're not getting a clean deletion is that you're not counting deletions when building the consensus:

    intermed = self.counts.most_common()

    # Remove gaps and low quality reads if there is anything else.
    for i in reversed(range(len(intermed))):
        nuc, _count = intermed[i]
        if nuc in ('N', '-') and len(intermed) > 1:
            intermed.pop(i)

    total_count = sum(self.counts.values())
    mixture = []
    min_count = (intermed[0][1]
                 if mixture_cutoff == MAX_CUTOFF
                 else total_count * mixture_cutoff)
    # filter for nucleotides that pass frequency cutoff
    for nuc, count in intermed:
        if count >= min_count:
            mixture.append(nuc)

Here's the equivalent code in our version:

    mixture = []
    for nuc, count in self.counts.most_common():
        if count < min_count:
            break
        if nuc in self.COUNTED_NUCS:
            mixture.append(nuc)
            if mixture_cutoff == MAX_CUTOFF:
                # Catch any ties before breaking out.
                min_count = count

from micall-lite.

donkirkby avatar donkirkby commented on August 19, 2024

Where did you get the GISAID consensus sequence? I'd like to compare my results.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Thanks @donkirkby - I figured I had some legacy code in my branch that had been changed since the fork, but I simply didn't have the bandwidth to search for it this week. I'm pretty much finished the other pipeline anyhow, and I really don't think iterative remapping is necessary, so I'm probably going to ahead with validating my pared down version of MiCall.

I registered for GISAID and then searched for a sequence with a matching descriptor.

from micall-lite.

donkirkby avatar donkirkby commented on August 19, 2024

I registered for GISAID and found Accession EPI_ISL_408670, but it took me a while to figure out that the descriptions in the SRA abstract for SRR11140746 (SARS-CoV-2/2019-nCoV/USA-WI-1/2020) loosely match the virus name in GISAID for EPI_ISL_408670 (hCoV-19/USA/WI1/2020). Thanks for the clues!

After all that, I can say that MiCall generates a matching consensus to EPI_ISL_408670, for positions from 114 to 29835. The 113 bases before that and the 44 bases after had fewer than 100 reads, and weren't reported.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Sorry @donkirkby - I didn't mean to obscure the source. I should have provided the EPI identifier.

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Since the primary objective is to obtain a consensus sequence rather than to sample at depth (i.e., to detect minor variants), shouldn't MiCall use a lower reporting threshold?

from micall-lite.

donkirkby avatar donkirkby commented on August 19, 2024

You did provide the EPI identifier, @ArtPoon. That's the clue I was thanking you for! I was just struggling to figure out how you knew they were the same sample.

I'll discuss the reporting threshold with the team. Thanks for the suggestion!

from micall-lite.

ArtPoon avatar ArtPoon commented on August 19, 2024

Published new repo at http://github.com/PoonLab/sam2conseq

from micall-lite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.