Giter Site home page Giter Site logo

Comments (8)

wasade avatar wasade commented on July 17, 2024

is this still an issue?

from deblur.

josenavas avatar josenavas commented on July 17, 2024

Right now all the counts level are brought down. I was thinking if we correct the frequency back to the "true" sequences, how that affects the results.

from deblur.

wasade avatar wasade commented on July 17, 2024

Closing as this wasn't something we explored with the manuscript that I'm aware of. I think this is possibly a duplicate of #81. Reopen if necessary.

from deblur.

jcmcnch avatar jcmcnch commented on July 17, 2024

Hi deblur devs,

I am interested in the possibility of this issue being reopened, as I believe it could improve deblur by generating more accurate relative abundance data. We've previously avoided deblur in favour of DADA2 for our denoising work because of what we believe is the consequence of this issue. To explain a little further why we think it could be a problem: we recently generated amplicon data from the BioGEOTRACES transects (discussed in this paper) for which we had paired metagenomic / amplicon samples. We tried both q2-dada2 and q2-deblur for denoising and compared the relative abundances of taxa between metagenomic SSU rRNA fragments recovered by phyloFlash and amplicon SSU rRNA. In essence we were just making scatterplots with one axis being MG SSU rRNA and the other amplicon SSU rRNA (some merging had to be done to account for the lower taxonomic resolution afforded by the short metagenomic fragments). For 16S, DADA2 consistently gave more accurate correlations between metagenomic and amplicon relative abundances which was a bit puzzling. Looking into it further, we believe the following is happening:

  1. For some taxa in our samples (e.g. Prochlorococcus) there is high microdiversity such that some true semi-abundant variants are considered to be sequencing noise by deblur.
  2. The sequencing counts for these putative denoising artifacts are not added back to the "parent" sequence, thus reducing the overall abundance of this broader taxonomic group. For taxa that have abundant true variants, this can comprise several percent of the overall # of amplicon reads.
  3. This skews the relative abundances for all taxa, resulting in worse correlations to MG SSU rRNA vs. DADA2.

While 1) above could be a problem since it may underestimate true sequence diversity, we would be willing to accept this as a potential tradeoff when dealing with noisy data for which deblur seems to consistently outperform DADA2 in terms of removing sequencing noise. However, to the best of my knowledge, 2 & 3 result in data that no longer accurately reflect the true relative abundances of taxa in the sample. While this might be moot if you're doing some sort of log-transform of your data before further processing, it would be a serious issue if you wanted to back out quantitative copy numbers from amplicon data using e.g. an internal standard.

Do you have any thoughts on this? Would this be a trivial thing to implement and at least provide an option to the user to allow choice on how these sequences are dealt with?

Happy to provide more information on this if you think it would help. I have plots and ASV tables that I could share as well as raw sequences.

Thanks for your help,
Jesse

from deblur.

wasade avatar wasade commented on July 17, 2024

Hi Jesse,

Thank you for the inquiry. To be honest, I'm unsure if this would be easy or hard to implement or how this would impact benchmarking.

A few follow up questions, if that's alright:

  • How do the correlations with WGS look when performing a naive 100% or 99% closed reference OTU clustering?
  • Given the inherent differences in the molecular protocols, is a higher correlation necessarily more correct?
  • Are there any ground truth observations here (e.g., mocks, simulated data, etc)?

cc @antgonza

Best,
Daniel

from deblur.

jcmcnch avatar jcmcnch commented on July 17, 2024

Hi Daniel,

Thanks for the quick reply and happy to answer the follow-up questions:

  • How do the correlations with WGS look when performing a naive 100% or 99% closed reference OTU clustering?
  • We haven't tried this, but my instinct is that it wouldn't change the results. The pipeline I made basically merges the ASVs into something like 95% OTUs anyway, since the short metagenomic reads will often match more than one ASV if they are closely related.
  • Given the inherent differences in the molecular protocols, is a higher correlation necessarily more correct?
  • If both looked equally bad then maybe you could argue that it's six of one and half a dozen of another, but my view is since the correlations look so much nicer with DADA2 that I don't think we can say that it's just due to inherent differences. DADA2 amplicons are pretty much spot on vs the MG.
  • Are there any ground truth observations here (e.g., mocks, simulated data, etc)?
  • Sort of - we were working under the assumption that the MG would be a ground truth for amplicons since one might naively assume they are more biased due to PCR. But perhaps someone out there has done a similar test on, say, the Zymo mock communities where they did both MG and amplicon samples. But even then, what would be the ground truth?

So to summarize - I think the issue really boils down to relative abundances being skewed due to the subtraction of putative sequencing noise. Since those subtracted abundances are not added back to the parent, it creates a situation where the quantitative nature of the data is potentially lost. The severity of this issue would probably vary sample-to-sample and may be more acute in some cases versus others and would depend on the properties of the sample.

Thanks again for looking into this, and looking forward to hearing your thoughts.

Best,
Jesse

from deblur.

antgonza avatar antgonza commented on July 17, 2024

Thank you both; this is interesting.

I'm not sure exactly what would need to be modified to test this hypothesis; @jcmcnch, do you know? Basically, I'm wondering if it's something already exposed in the CLI, for example one of the options in deblur workflow --help or deblur deblur-seqs --help or something more "code involved".

Anyway, a few more questions:

  1. Are you using the same reference database to assign taxonomy (GG vs Silva) to the fragments produced by DADA2 and deblur? What about algorithm? I think you used vsearch for the publication vs. qiime feature-classifier classify-sklearn, right?
  2. Do you know the number of different fragments for each protocol? Basically, are you seeing a larger number with DADA2 than deblur?
  3. Thinking a bit more about 2, what about the length of the sequences? Looking at the paper is not clear if the fwd/rev reads were joined or if you just used the fwd.

from deblur.

amnona avatar amnona commented on July 17, 2024

from deblur.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.