There are presently 34 SARS-COV-2 samples on NCBI generated on an Illumina platf

It is not in the raw data: <div class="highlight highlight-text-shell-ses

It does appear once in the second read file: <div class="snippet-clipboard-content

Options: bring MiCall-Lite fork up to date with current MiCall

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

New pipeline is picking the deletion up: <div class="snippet-clipboard-content not

I'm getting our references set up today, <a class="user-mention notranslate" data-hove

Validate MiCall on SARS-COV-2 Illumina data about micall-lite HOT 24 CLOSED

ArtPoon commented on August 19, 2024

Validate MiCall on SARS-COV-2 Illumina data

from micall-lite.

Comments (24)

ArtPoon commented on August 19, 2024

Processed SRR11140746 from NCBI SRA. This is annotated as SARS-CoV-2/2019-nCoV/USA-WI-1/2020, passaged once on Vero E6 cells. I compared the MiCall consensus sequence to hCoV-19/USA/WI1/2020|EPI_ISL_408670 from GISAID. There are a total of 8 differences:

position	MiCall	GISAID
20302	A	-
20303	A	-
20304	G	-
29872	t	A
29878	c	A
29880	-	A
29881	-	A

Here are screen caps of the two regions:

I am not too concerned about the last four because they appear in a low-coverage region (By default, MiCall-Lite reports bases in lower case when coverage is below 100 reads.)

However, the insertion of AAG is worrisome. Note that it is in a tandem repeat. Is this in the raw data?

from micall-lite.

ArtPoon commented on August 19, 2024

It is not in the raw data:

Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGAAGGCTA
Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGGCTA
GGCGGCTATGCTCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTTTCGCTAGCATTTCAGTAGTGCCACCAGCCTTTTTAGTAGGTATAACCACAGCAGTTAAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTTCGAACATATCGT
CATTAATTTGCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCCCCTCTACAGTGTAACCAAATTTAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTT

Meaning that MiCall is making it up. I'm going to have to trace back through the intermediate outputs to figure out where this is coming from.

from micall-lite.

ArtPoon commented on August 19, 2024

It does appear once in the second read file:

Elzar:data artpoon$ gunzip -c SRR11140746_2.fastq.gz | grep AAGAAGAAGGCTA
TTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGCATTGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGCCATCAAATTGGATGACAAAGATCCAAATTTCAAAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAAGGACAAAAAGAAGAAGGCTAATGAAACTCAA

Is MiCall stitching this one insertion into the data?

from micall-lite.

ArtPoon commented on August 19, 2024

The plot thickens - I ran the FASTQ data through bowtie2 once and then generated a consensus sequence with samtools, and it also introduced the extra AAG insertion as well as some weird substitutions upstream:

Also samtools doesn't botch the 3' end:

We are not ready for prime time. 😞

from micall-lite.

ArtPoon commented on August 19, 2024

Options:

bring MiCall-Lite fork up to date with current MiCall - is there a problem that I introduced or something that got fixed by the CfE dev team?
start debugging MiCall-Lite based on this first test case
write a bespoke pipeline for SARS-COV-2

from micall-lite.

ArtPoon commented on August 19, 2024

I got 100% map rate with a single run of bowtie2, local alignment, so I don't think MiCall's iterative approach to mapping is necessary. Will adapt functions from sam2aln.py and aln2counts from cfe-lab/MiCall to write a more efficient consensus sequence-generating script.

from micall-lite.

ArtPoon commented on August 19, 2024

Looked at the SAM file with Tablet, where we can see the deletion relative to the NC_045512 reference genome:

Here's a look at the end of the alignment:

from micall-lite.

ArtPoon commented on August 19, 2024

@donkirkby can you please help me reproduce this issue with cfe-lab/MiCall?
The NCBI SRA accession number is in this issue.

from micall-lite.

ArtPoon commented on August 19, 2024

New pipeline is picking the deletion up:

20295,0,0,0,1247,156,1,{}
20296,1244,1,0,0,155,1,{}
20297,1242,0,0,0,155,4,{}
20298,263,0,0,0,155,981,{}
20299,0,0,0,2,156,1018,{}
20300,0,0,1,0,156,1019,{'G': 1}
20301,1183,0,1,0,157,1,{}
20302,0,0,1186,0,156,1,{}
20303,1185,0,0,0,156,2,{}

Closing issue and tracking any further issues in new repo.

from micall-lite.

donkirkby commented on August 19, 2024

I'm getting our references set up today, @ArtPoon, and I'll let you know what I see when I run this sample.

from micall-lite.

ArtPoon commented on August 19, 2024

Thanks @donkirkby !

from micall-lite.

donkirkby commented on August 19, 2024

OK, I got some results out of MiCall, and I'm starting to look through them. First item: your mystery insertion. I see it, but as a minority variant. It's also in a slightly different position from yours. The MAX consensus shows a deletion at that position. Here's a section of the consensus sequences at different cutoffs from 20290 to 20310:

        v20290...   ...20310v
MAX     CGGTAT---AAAGAAGGCTAT
0.01    CGGTATaaarAAGAAGGCTAT
0.02    CGGTATaaaRAAGAAGGCTAT
0.05    CGGTATaaaRAAGAAGGCTAT
0.10    CGGTATaaaRAAGAAGGCTAT
0.20    CGGTATaaaRAAGAAGGCTAT
0.25    CGGTAT---AAAGAAGGCTAT

The lower case codes represent a mixture of nucleotides and deletions. My guess is that MiCall-lite still reports any nucleotide reads over deletions, even if the deletions are over 75% of the reads as in this case.
What's probably happening is that this sample has a deletion relative to the reference sequence. Then the reads with ends near the deletion get dragged across the gap and fill it in with bad matches. However, I think you said that MiCall was actually inserting that section during the remapping, so I'll check more closely tomorrow. Hope this is a useful start.
Where's the other repo?

from micall-lite.

ArtPoon commented on August 19, 2024

Thanks for checking this out @donkirkby - the new repo is set private for now. I'll publish it soon, I just need time to fix the merge_pairs function that I migrated from sam2aln. I know that my version is a lot slower and the refactored version in cfe MiCall is probably a lot faster, but I also really don't like that code. (It introduces NumPy as another dependency, and the author didn't bother to document any of their code. barf!)

from micall-lite.

ArtPoon commented on August 19, 2024

I apologize @donkirkby, I see that we're the only contributors to that code - but why no inline comments? Throw me a bone! 😄

from micall-lite.

donkirkby commented on August 19, 2024

The programmer that has annoyed me the most throughout my career has always been myself from six months earlier. Would you like to go through the code together tomorrow morning over Google hangouts? I'm free at 10am PDT.

from micall-lite.

ArtPoon commented on August 19, 2024

Fair enough - I've written some real pigs as a perennial amateur.
I'm teaching a bioinformatics lab (ironically) tomorrow so I can't meet, unfortunately. I have some basic ideas of how to make my original code a lot faster though. May pester you for some sage wisdom.

from micall-lite.

donkirkby commented on August 19, 2024

You might not need such drastic changes, @ArtPoon. I think the reason you're not getting a clean deletion is that you're not counting deletions when building the consensus:

    intermed = self.counts.most_common()

    # Remove gaps and low quality reads if there is anything else.
    for i in reversed(range(len(intermed))):
        nuc, _count = intermed[i]
        if nuc in ('N', '-') and len(intermed) > 1:
            intermed.pop(i)

    total_count = sum(self.counts.values())
    mixture = []
    min_count = (intermed[0][1]
                 if mixture_cutoff == MAX_CUTOFF
                 else total_count * mixture_cutoff)
    # filter for nucleotides that pass frequency cutoff
    for nuc, count in intermed:
        if count >= min_count:
            mixture.append(nuc)

Here's the equivalent code in our version:

    mixture = []
    for nuc, count in self.counts.most_common():
        if count < min_count:
            break
        if nuc in self.COUNTED_NUCS:
            mixture.append(nuc)
            if mixture_cutoff == MAX_CUTOFF:
                # Catch any ties before breaking out.
                min_count = count

from micall-lite.

donkirkby commented on August 19, 2024

Where did you get the GISAID consensus sequence? I'd like to compare my results.

from micall-lite.

ArtPoon commented on August 19, 2024

Thanks @donkirkby - I figured I had some legacy code in my branch that had been changed since the fork, but I simply didn't have the bandwidth to search for it this week. I'm pretty much finished the other pipeline anyhow, and I really don't think iterative remapping is necessary, so I'm probably going to ahead with validating my pared down version of MiCall.

I registered for GISAID and then searched for a sequence with a matching descriptor.

from micall-lite.

donkirkby commented on August 19, 2024

I registered for GISAID and found Accession EPI_ISL_408670, but it took me a while to figure out that the descriptions in the SRA abstract for SRR11140746 (SARS-CoV-2/2019-nCoV/USA-WI-1/2020) loosely match the virus name in GISAID for EPI_ISL_408670 (hCoV-19/USA/WI1/2020). Thanks for the clues!

After all that, I can say that MiCall generates a matching consensus to EPI_ISL_408670, for positions from 114 to 29835. The 113 bases before that and the 44 bases after had fewer than 100 reads, and weren't reported.

from micall-lite.

ArtPoon commented on August 19, 2024

Sorry @donkirkby - I didn't mean to obscure the source. I should have provided the EPI identifier.

from micall-lite.

ArtPoon commented on August 19, 2024

Since the primary objective is to obtain a consensus sequence rather than to sample at depth (i.e., to detect minor variants), shouldn't MiCall use a lower reporting threshold?

from micall-lite.

donkirkby commented on August 19, 2024

You did provide the EPI identifier, @ArtPoon. That's the clue I was thanking you for! I was just struggling to figure out how you knew they were the same sample.

I'll discuss the reporting threshold with the team. Thanks for the suggestion!

from micall-lite.

ArtPoon commented on August 19, 2024

Published new repo at http://github.com/PoonLab/sam2conseq

from micall-lite.

Validate MiCall on SARS-COV-2 Illumina data about micall-lite HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent