Comments (8)
is this still an issue?
from deblur.
Right now all the counts level are brought down. I was thinking if we correct the frequency back to the "true" sequences, how that affects the results.
from deblur.
Closing as this wasn't something we explored with the manuscript that I'm aware of. I think this is possibly a duplicate of #81. Reopen if necessary.
from deblur.
Hi deblur devs,
I am interested in the possibility of this issue being reopened, as I believe it could improve deblur
by generating more accurate relative abundance data. We've previously avoided deblur
in favour of DADA2
for our denoising work because of what we believe is the consequence of this issue. To explain a little further why we think it could be a problem: we recently generated amplicon data from the BioGEOTRACES transects (discussed in this paper) for which we had paired metagenomic / amplicon samples. We tried both q2-dada2
and q2-deblur
for denoising and compared the relative abundances of taxa between metagenomic SSU rRNA fragments recovered by phyloFlash and amplicon SSU rRNA. In essence we were just making scatterplots with one axis being MG SSU rRNA and the other amplicon SSU rRNA (some merging had to be done to account for the lower taxonomic resolution afforded by the short metagenomic fragments). For 16S, DADA2 consistently gave more accurate correlations between metagenomic and amplicon relative abundances which was a bit puzzling. Looking into it further, we believe the following is happening:
- For some taxa in our samples (e.g. Prochlorococcus) there is high microdiversity such that some true semi-abundant variants are considered to be sequencing noise by
deblur
. - The sequencing counts for these putative denoising artifacts are not added back to the "parent" sequence, thus reducing the overall abundance of this broader taxonomic group. For taxa that have abundant true variants, this can comprise several percent of the overall # of amplicon reads.
- This skews the relative abundances for all taxa, resulting in worse correlations to MG SSU rRNA vs.
DADA2
.
While 1) above could be a problem since it may underestimate true sequence diversity, we would be willing to accept this as a potential tradeoff when dealing with noisy data for which deblur
seems to consistently outperform DADA2
in terms of removing sequencing noise. However, to the best of my knowledge, 2 & 3 result in data that no longer accurately reflect the true relative abundances of taxa in the sample. While this might be moot if you're doing some sort of log-transform of your data before further processing, it would be a serious issue if you wanted to back out quantitative copy numbers from amplicon data using e.g. an internal standard.
Do you have any thoughts on this? Would this be a trivial thing to implement and at least provide an option to the user to allow choice on how these sequences are dealt with?
Happy to provide more information on this if you think it would help. I have plots and ASV tables that I could share as well as raw sequences.
Thanks for your help,
Jesse
from deblur.
Hi Jesse,
Thank you for the inquiry. To be honest, I'm unsure if this would be easy or hard to implement or how this would impact benchmarking.
A few follow up questions, if that's alright:
- How do the correlations with WGS look when performing a naive 100% or 99% closed reference OTU clustering?
- Given the inherent differences in the molecular protocols, is a higher correlation necessarily more correct?
- Are there any ground truth observations here (e.g., mocks, simulated data, etc)?
cc @antgonza
Best,
Daniel
from deblur.
Hi Daniel,
Thanks for the quick reply and happy to answer the follow-up questions:
- How do the correlations with WGS look when performing a naive 100% or 99% closed reference OTU clustering?
- We haven't tried this, but my instinct is that it wouldn't change the results. The pipeline I made basically merges the ASVs into something like 95% OTUs anyway, since the short metagenomic reads will often match more than one ASV if they are closely related.
- Given the inherent differences in the molecular protocols, is a higher correlation necessarily more correct?
- If both looked equally bad then maybe you could argue that it's six of one and half a dozen of another, but my view is since the correlations look so much nicer with DADA2 that I don't think we can say that it's just due to inherent differences. DADA2 amplicons are pretty much spot on vs the MG.
- Are there any ground truth observations here (e.g., mocks, simulated data, etc)?
- Sort of - we were working under the assumption that the MG would be a ground truth for amplicons since one might naively assume they are more biased due to PCR. But perhaps someone out there has done a similar test on, say, the Zymo mock communities where they did both MG and amplicon samples. But even then, what would be the ground truth?
So to summarize - I think the issue really boils down to relative abundances being skewed due to the subtraction of putative sequencing noise. Since those subtracted abundances are not added back to the parent, it creates a situation where the quantitative nature of the data is potentially lost. The severity of this issue would probably vary sample-to-sample and may be more acute in some cases versus others and would depend on the properties of the sample.
Thanks again for looking into this, and looking forward to hearing your thoughts.
Best,
Jesse
from deblur.
Thank you both; this is interesting.
I'm not sure exactly what would need to be modified to test this hypothesis; @jcmcnch, do you know? Basically, I'm wondering if it's something already exposed in the CLI, for example one of the options in deblur workflow --help
or deblur deblur-seqs --help
or something more "code involved".
Anyway, a few more questions:
- Are you using the same reference database to assign taxonomy (GG vs Silva) to the fragments produced by DADA2 and deblur? What about algorithm? I think you used vsearch for the publication vs.
qiime feature-classifier classify-sklearn
, right? - Do you know the number of different fragments for each protocol? Basically, are you seeing a larger number with DADA2 than deblur?
- Thinking a bit more about 2, what about the length of the sequences? Looking at the paper is not clear if the fwd/rev reads were joined or if you just used the fwd.
from deblur.
from deblur.
Related Issues (20)
- new release: mix case fixing HOT 1
- illumina 1.9 support? HOT 1
- convert float to int HOT 3
- Failure to write BIOM file - OSError: Unable to create file HOT 1
- Deblur workflow output HOT 14
- Indels aren't calculated correctly HOT 6
- Derivation of error profile HOT 2
- Is it possible to turn off chimera checking? HOT 4
- Not sure where counts are going for dominant sequence HOT 1
- "ERROR ...: Problem running vsearch dereplication on file" HOT 1
- bug Mac, error message HOT 1
- Whether deblur trim the read length with multipile sequence alignment? HOT 9
- Crash after saving biom table and attempting to remove artifacts
- Novaseq support HOT 1
- Is there an equivalent deblur-stats output in the stand alone version? HOT 4
- Can you recommend some reference databases (Positive mode) for ITS and 18S sequences? HOT 8
- Question about using Deblur in meta analysis HOT 1
- Deblur installation error HOT 5
- support click 8
- indexdb_rna not found HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deblur.