Giter Site home page Giter Site logo

q2-dada2's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

q2-dada2's People

Contributors

andrewsanchez avatar angrybee avatar benjjneb avatar chriskeefe avatar david-rod avatar dwthomas avatar ebolyen avatar epruesse avatar gregcaporaso avatar hagenjp avatar jairideout avatar jordenrabasco avatar keegan-evans avatar leasi avatar lizgehret avatar maxvonhippel avatar mortonjt avatar oddant1 avatar q2d2 avatar sixvable avatar thermokarst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

q2-dada2's Issues

denoise-paired: add flag for just concatenating paired reads

Improvement Description
Hi there!

The base dada2 R function for this is mergeReads() and it takes a logical argument for “justConcatenate” which by default is False, so to do this in R you just set that to true. But I don’t think that can be specified currently in qiime2’s implementation.

Questions
Would it be difficult to add that capability?

conda installed dada2 is slow

When I run the q2-dada2 workflow (via Rscript, identical input data) on my machine, the processing time is 5x longer when run inside the q2 conda environment than when run in the standard shell (the native installation) using the same 1.4.0 version of the package.

More to come, but my initial suspicions is that when bioconda constructs the universal binary version of the package that no compiler optimizations are being performed. This could cause such a slowdown as the two performance critical parts of the algorithm are written to be auto-vectorized by gcc and llvm rather than being explicitly vectorized.

add description of the plots to the plot-qualities visualization

From @benjjneb: The distribution of quality scores at each position is shown as a grey-scale heat map, with dark colors corresponding to higher frequency. The plotted lines show positional summary statistics: green is the mean, orange is the median, and the dashed orange lines are the 25th and 75th quantiles.

Add truncLen requirements to the denoise-paired help text

For the F/R reads to be successfully merged, trunc-len-f + trunc-len-r must be greater than the length of the amplicon + 20 nucleotides (the 20 nts is the length of the overlap).

This requirement should be reflected in the documentation for denoise-paired.

failure reporting when R script fails

@johnchase is testing the plugin and had the R script fail. Currently stdout and stderr doesn't get provided to the user, making it impossible to debug. We need better handling of this in general for the framework (qiime2/qiime2#153), but I'm going to stub this in this repo for testing.

Plugin installation

Currently the plugin is installing the dada2 R package via bioconda. Is there any reason not to continue with bioconda install going forward?

Right now this relies on a bioconda recipe submitted by @zachcp, and installs the Bioconductor release version (1.0.3). If bioconda is the long-term installation plan, it is probably best to create a plugin-specific recipe.

I think the time to do this would be with the next release of Bioconductor and the dada2 package, which will be in mid October.

@zachcp Any input? Are you interested in helping with a QIIME2-specific bioconda build?

test latest dada2 bioconda package when available

Currently we are using Bioconductor to install the latest version of dada2 (1.2.1). There is work being done to package this release via bioconda; see #13 for previous discussion. When the package is available we'll want to test it with q2-dada2 and update the qiime2 install docs.

cc: @zachcp

dada2 denoise-paired-end run failed with default `n-reads-learn` (i.e., 1,000,000), but not with `n-reads-learn=10000`

It ran for about an hour, and then failed with the following traceback:

$ qiime dada2 denoise-paired --i-demultiplexed-seqs ../demux-paired.qza --p-trunc-len-f 100 --p-trunc-len-r 100 --p-trim-left-f 0 --p-trim-left-r 0 --o-table table-default --o-representative-sequences rep-set-default --p-n-threads 0 --verbose
object 'errR' not found
Execution halted
Traceback (most recent call last):
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/bin/qiime", line 11, in <module>
    load_entry_point('q2cli==2017.2.0.dev0', 'console_scripts', 'qiime')()
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2cli/commands.py", line 215, in __call__
    results = action(**arguments)
  File "<decorator-gen-191>", line 2, in denoise_paired
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime2/sdk/action.py", line 171, in callable_wrapper
    output_types, provenance)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/qiime2/sdk/action.py", line 248, in _callable_executor_
    output_views = callable(**view_args)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 98, in denoise_paired
    return _denoise_helper(cmd, biom_fp, hashed_feature_ids)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 34, in _denoise_helper
    run_commands([cmd])
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/site-packages/q2_dada2/_plot.py", line 26, in run_commands
    subprocess.run(cmd, check=True)
  File "/home/gregcaporaso/.conda/envs/qiime2-dev/lib/python3.5/subprocess.py", line 708, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmp3u_qh3d4/forward', '/tmp/tmp3u_qh3d4/reverse', '/tmp/tmp3u_qh3d4/output.tsv.biom', '/tmp/tmp3u_qh3d4/filt_f', '/tmp/tmp3u_qh3d4/filt_r', '100', '100', '0', '0', '2.0', '2', '0', '1000000']' returned non-zero exit status 1

This looks like it's coming from the R code based on this error:

object 'errR' not found

Weirdly, this command completed successfully:

$ qiime dada2 denoise-paired --i-demultiplexed-seqs ../demux-paired.qza --p-trunc-len-f 100 --p-trunc-len-r 100 --p-trim-left-f 0 --p-trim-left-r 0 --o-table table --o-representative-sequences rep-set --p-n-threads 0 --p-n-reads-learn 10000
Saved FeatureTable[Frequency] to: table.qza
Saved FeatureData[Sequence] to: rep-set.qza

Split filtering into its own command?

Improvement Description
As I understand it, QIIME intends to be more of a push-button pipeline than our R workflows, but I think it would be worth considering separating the filtering in the DADA2 pipeline from the sample processing. It is often useful to filter more than once to see what works well, and I would imagine that there will be other filtering tools that will come online in the QIIME2 ecosystem that people may want to use. It also has a nice effect of reducing the number of parameters/options at each step, and naturally grouping them together.

Downside: Two commands to do what once took one. May need a new semantic type (eg. FilteredVersionOfPreviousType).

Update plugin to 1.4.0 R package

The new version is available via bioconda (version 1.4.0).

This is a precursor to fixing #56

The new package version allows variable length amplicons, and once available #43 and #52 should be addressed.

Turning chimera checking off breaks the plugin

Need to add assignment seqtab.nochim <- seqtab to the R scripts when the chimera method is "none".

Should also probably add a test checking that the plugin runs w/ the different chimera modes.

qiime dada2 denoise-paired open up options for percent mismatch and overlap length

qiime dada2 denoise-paired has options open for trimming and truncating (as seen below):

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza
--p-trim-left-f 0
--p-trunc-len-f 200
--p-trim-left-r 0
--p-trunc-len-r 200
--o-representative-sequences demux-paired-end.qza
--p-n-threads 0
--o-table table-1.qza

However, without the ability to adjust the percent mismatch (set at 0 as default) an overlap it has little use. When trying to join paired ends at 0% mismatch sequences have to be identical to be retained. This value should be something users can change and the default should be something more like 40% (perhaps some discussion would be valuable here). The overlap length is set to 20 as default. This could be a decent value (perhaps some discussion would be valuable here), but it should definitely be something users have access to change. So we need to open up the options to be passed into the qiime dada2 denoise-paired command and change the default for percent mismatch.

Add an explicit error for return code -9

Current Behavior
This seems to be always be a SIGKILL event which is almost always because of an OOM error.

Proposed Behavior
It is probably worth just catching and raising an error about SIGKILL/Memory when it happens.

Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) : Input must be a valid sequence table.

Hello,
I got the error below at the step of chimera detection.
I wonder if this is that there were no sequences left for chimera detection at the end of that stage.
Running qiime 2 Feb 2018 version.
Let me know if you need any further info

See below for error report:

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmps4hlcnk7/forward /tmp/tmps4hlcnk7/reverse /tmp/tmps4hlcnk7/output.tsv.biom /tmp/tmps4hlcnk7/filt_f /tmp/tmps4hlcnk7/filt_r 260 260 0 0 2.0 2 consensus 1.0 1 1000000

R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0

  1. Filtering ...

  2. Learning Error Rates
    2a) Forward Reads
    Initializing error rates to maximum possible estimate.
    Sample 1 - 2656 reads in 1987 unique sequences.
    Sample 2 - 6060 reads in 2150 unique sequences.
    Sample 3 - 11141 reads in 3185 unique sequences.
    selfConsist step 2
    selfConsist step 3
    selfConsist step 4
    selfConsist step 5
    Convergence after 5 rounds.
    2b) Reverse Reads
    Initializing error rates to maximum possible estimate.
    Sample 1 - 2656 reads in 1987 unique sequences.
    Sample 2 - 6060 reads in 2150 unique sequences.
    Sample 3 - 11141 reads in 3185 unique sequences.
    selfConsist step 2
    selfConsist step 3
    selfConsist step 4
    selfConsist step 5
    Convergence after 5 rounds.

  3. Denoise remaining samples

  4. Remove chimeras (method = consensus)
    Error in isBimeraDenovoTable(unqs[[i]], ..., verbose = verbose) :
    Input must be a valid sequence table.
    Calls: removeBimeraDenovo -> isBimeraDenovoTable
    In addition: Warning message:
    In is.na(colnames(unqs[[i]])) :
    is.na() applied to non-(list or vector) of type 'NULL'
    Execution halted

Add pooling options to Q2 workflows

Improvement Description
Add a new option that allows users to pick independent sample processing (as done currently), pooled sample processing, or "pseudo-pooling" that was added in 1.7.5. It probably makes sense to wait until the R package 1.8 release is available (~June) to add this.

The pooling options provide better detection of rare per-sample variants at the cost of increased computation time.

Also consider making pseudo-pooling the default processing mode.

References
"pseudo-pooling" that was added in 1.7.5

resolve potential bug related to missing filtered samples

edit by @ebolyen for future searching in case we see this again:

Error in open.connection(con, "rb") : cannot open the connection
Calls: derepFastq ... FastqStreamer -> FastqStreamer -> open -> open.connection
In addition: Warning message:
In open.connection(con, "rb") :
  cannot open file '/var/folders/qr/vd5b6lqx18z8zzhc612cwk9r0000gn/T/tmphkndm05o/filt_f/16_S11_L001_R1_001.fastq.gz': No such file or directory
Execution halted

/edit

This came up a couple of times on the forum and is actively being debugged by @ebolyen and @benjjneb. See this topic for details. Creating a placeholder issue to track progress.

Segfault errors in 2017.12 plugin version

Adding this as a tracking issue for a seemingly related group of errors that popped up after the plugin was updated to use the current version of the R package (1.6) in Q2 2017.12

https://forum.qiime2.org/t/dada2-error-caught-segfault/2322
https://forum.qiime2.org/t/dada2-exit-code-11/2333
https://forum.qiime2.org/t/dada2-errors-return-code-1-and-11/2380
https://forum.qiime2.org/t/dada2-11-exit-code-paired-end-reads/2360

In every case, these users are seeing the following:

Plugin error from dada2:
  An error was encountered while running DADA2 in R (return code -11), please inspect stdout and stderr to learn more.

And in the log file...

2) Learning Error Rates
Initializing error rates to maximum possible estimate.
Sample 1 - 19785 reads in 10633 unique sequences.
*** caught segfault ***
address 0x8, cause ‘memory not mapped’

This indicates the segfault error is happening the first time that the dada function in the R package is called, and that some other functions (such as filtering) are working OK.

Also in all cases reported thus far, the affected users are running OSX, although several different versions have been reported so far:

I am using macOS High Sierra Version 10.13.1
I am using OX X Yosemite Version 10.10.5
I have macOS High Sierra Version 10.13.2.
I am running High Sierra 10.13.1.

plot-qualities produces empty data directory

I am following the tutorial instructions provided here.

the plot-qualities command runs without error, but the data/ directory in the output is empty. Consequently, running qiime tools view quality-plots.qzv produces an empty page. I have tested this with both the test files provided and with real data files, so suspect this is an issue with the code rather than the files.

However, the subsequent commands in the tutorial are functional, e.g., the following produces the expected output:
qiime dada2 denoise --i-demultiplexed-seqs fmt-tutorial-per-sample-fastq-1p.qza --o-table fmt-tutorial-table.qza --p-trim-left 10 --p-trunc-len 130 --o-representative-sequences fmt-tutorial-rep-seqs.qza

Hence, I believe that this issue is isolated to the plot-qualities command.

better error message when truncating past max seq length

When passing --p-trunc-len with a length greater than the length of the input sequences, dada2 denoise (which runs run_data.R) fails with an uninformative error message. For example, running this command (--p-trunc-len 98) on sequences that have 90 positions:

$ qiime dada2 denoise --i-demultiplexed-seqs 88soils-tutorial-demux-1p.qza --o-representative-sequences rep-seqs --o-table table --p-trim-left 0 --p-trunc-len 98 --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R /var/folders/h3/2l54lt8d69db92g2mlsc_wnw0000gp/T/qiime2-archive-tg33bd_3/52802aaa-e515-4eb6-a193-4466da154458/data /var/folders/h3/2l54lt8d69db92g2mlsc_wnw0000gp/T/tmp83ajfcmt/output.tsv.biom 98 0 /var/folders/h3/2l54lt8d69db92g2mlsc_wnw0000gp/T/tmp83ajfcmt

Loading required package: Rcpp
There were 50 or more warnings (use warnings() to see the first 50)
Initial error matrix unspecified. Error rates will be initialized to the maximum possible estimate from this data.
Error in colSums(trans[paste0(nti, "2", c("A", "C", "G", "T")), ]) :
  'x' must be an array of at least two dimensions
Calls: dada -> errorEstimationFunction -> colSums
Execution halted
Traceback (most recent call last):
  File "/Users/jairideout/miniconda3/envs/qiime2-test/bin/qiime", line 6, in <module>
    sys.exit(q2cli.__main__.qiime())
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/q2cli-0.0.5-py3.5.egg/q2cli/commands.py", line 210, in __call__
  File "<decorator-gen-131>", line 2, in denoise
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/qiime-2.0.5-py3.5.egg/qiime/core/callable.py", line 221, in callable_wrapper
    output_types)
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/qiime-2.0.5-py3.5.egg/qiime/core/callable.py", line 321, in _callable_executor_
    output_views = callable(**view_args)
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/q2_dada2-0.0.5-py3.5.egg/q2_dada2/_denoise.py", line 29, in denoise
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/site-packages/q2_dada2-0.0.5-py3.5.egg/q2_dada2/_plot.py", line 29, in run_commands
  File "/Users/jairideout/miniconda3/envs/qiime2-test/lib/python3.5/subprocess.py", line 708, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada.R', '/var/folders/h3/2l54lt8d69db92g2mlsc_wnw0000gp/T/qiime2-archive-tg33bd_3/52802aaa-e515-4eb6-a193-4466da154458/data', '/var/folders/h3/2l54lt8d69db92g2mlsc_wnw0000gp/T/tmp83ajfcmt/output.tsv.biom', '98', '0', '/var/folders/h3/2l54lt8d69db92g2mlsc_wnw0000gp/T/tmp83ajfcmt']' returned non-zero exit status 1

It would be helpful to have a more informative error message. Not sure if this should happen in the R dada2 library itself or in q2-dada2.

A workaround for q2-dada2 and a local R installation with dada2

This has come up before and we've finally figured out a "workaround". This probably needs to be solved in a deeper way, but to record what works:

R loads it's libraries from .libPaths(). On mac OS X installations with an R Studio installation, the R Studio path ends up in .libPaths() as the highest priority, so if a user has dada2 installed locally, it will use that which is linked against a different R binary, so you'll get a segfault (if you're lucky).

What you can do is create an .Rprofile file and place:

.libPaths(.libPaths()[2])

This is gross, but it's a start for a better solution.

Upgrading to 2017.12 breaks `q2-dada2`/`dada2`

Under qiime2=2017.12 q2-dada2 does not work (tested on multiple conda environments, both local (macOS) and remote (Ubuntu 16.04 on EC2). It appears to be missing a GenomeInfoDbData requirement and will fail to load.

Here's an R terminal attempt to load dada2 independent of qiime2 to show the error (same error spit into terminal through q2-dada2):
image

Unsure if it's something in the q2-dada2 installation that is failing, or if a dependency of dada2 itself has become unreachable? Only relevant issue I could find was this on the forum.


edit: this remains as an issue even after conda env update --file 2017.12 from here

Add chimera filtering options

In my opinion the options related to chimera removal are important to expose to users of the plugin. This is especially true because dada2 implements two kinds of chimera removal (pooled and consensus), and as of our next 1.4 release we will be officially recommending consensus removal for large datasets (default is pooled).

Upgrade to DADA2 >= 1.7.3

As noted here, when that new release lands, we should update the pinned version here, as well as revert the SSE changes made in January 2018 in the wake of SSE-gate.

Add support for variable amplicon lengths

Now that the plugin has been updated to 1.4, variable length amplicons can be supported, but the R script needs to be updated. Should be doable for the next release.

My first thought is to allow trunc-len-f = 0 to be a special value that turns off the truncation length.

Error bubbling or better error documentation

Currently many (possibly most or even all) errors encountered in R do not bubble up to Python. As a result, users get error messages with vaguely mysterious error codes (I've seen this specifically on denoise) (e.g, returned non-zero exit code -11).

I think we should do one of the following:

  1. Provide clear documentation somewhere of the significance of the error codes

  2. If such docs exist externally (i.e, in the actual dada2 docs), point to them somewhere from our own docs
    ... or,

  3. Bubble those errors up and use a dictionary or something to convert them into useful error messages, e.g,

    DADA2 returned non-zero exit code -11. This probably means there is a problem with X, Y, or Z. For more, see www.our_awesome_dada2_docs.biz

Learn error rates from a subset of the data

Parameter learning is the most computationally part of the "tutorial" workflow currently implemented. It is much faster, and achieves essentially the same result, to learn the error rates from a subset of the data for larger datasets. This can substantially speed computation times for large datasets (2x-7x faster).

Workflow should add an argument, perhaps --nreads-learn with a default of perhaps 1-million, specifying the minimum number of reads used to learn the error rates. Those error rates will then be used to process the entire dataset.

This approach follows that in the DADA2 "big data" workflow: http://benjjneb.github.io/dada2/bigdata.html

Use DADA2 1.4 to fix quality-score precedence bug in parameters

It looks like this is up on bioconda, although it lists both 1.4 and 1.4.0 and I haven't had a chance to figure out the difference.

From a forum post:

There is a small bug in the 1.2 version of the DADA2 package the plugin is using (minQ is enforced before trimming, rather than after) that is preventing p-trim-left-f from working as expected. So you would need to trim off that bad staring base position with another bit of software to get it to work with the current QIIME2 plugin. When the plugin upgrades to the 1.4 version of DADA2 the p-trim-left-f approach should work.

The conda-recipe pins the version to 1.2 so this will need to wait until 2017.5 unless we want to patch.

@benjjneb would it be more appropriate to list our dependency as >= 1.2 (or rather >= 1.4 now) so that just upgrading dada2 fixes these kinds of issues?

Installation documentation issues on mac osx

I am working from the q2-dada2 plugin install instructions provided here, and installing on mac osx 10.11.6. A couple issues came up during installation, and I am posting these as a work-around.

  1. The most recent version of R appears to be incompatible with the dada2 dependency biocgenerics, so version r=3.2.2 needs to be specified in the install. Additionally, a number of other R packages that are dependencies for biocgenerics also need to be installed using the R channel (they are missing in the bioconda channel apparently, so don't install when installing bioconductor-dada2). To solve this, I modified the environment creation command to:
    conda create -n q2-dada2 -c r r=3.2.2 python=3.5 r-bitops r-latticeextra r-data.table r-foreach r-ggplot2 r-gridextra r-rcpp

  2. matplotlib error. When I run qiime --help:
    RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

According to working with matplotlib in virtual environments, this should not be an issue with conda. As matplotlib is installed with pip above, this appears to be the issue, so I uninstalled, then reinstalled with conda to make this work:
pip uninstall matplotlib
conda install matplotlib

Following these fixes, I can confirm that qiime2-dev and q2-dada2 are properly installed:
qiime dada2 --help

Expectations for primer removal in Q2?

DADA2's chimera removal step requires primers to have been removed, as otherwise the ambiguous nucleotides in most primer sets cause large numbers of false-positive chimeras to be identified.

We expect the user to remove primers (if they are in their reads) prior to starting the R DADA2 workflows, and try to make that as clear as possible in our documentation. To what extent is primer removal to be expected in Q2 prior to the "denoise" step? Should a check-and-warn step for potential untrimmed primers be added to the q2-dada2 workflows?

trunc_len=0 can't be used

$ qiime dada2 denoise-single   --i-demultiplexed-seqs demux.qza   --p-trim-left 10   --p-trunc-len 0   --o-representative-sequences rep-seqs.qza   --o-table table.qza

Plugin error from dada2:

  trim_left (10) must be smaller than trunc_len (0)

Debug info has been saved to /var/folders/b6/g3p2lswj2153q21x12mjlfwh0000gn/T/qiime2-q2cli-err-oeeswhd9.log

$ qiime dada2 denoise-single   --i-demultiplexed-seqs demux.qza  --p-trunc-len 0   --o-representative-sequences rep-seqs.qza   --o-table table.qza

Plugin error from dada2:

  trim_left (0) must be smaller than trunc_len (0)

Debug info has been saved to /var/folders/b6/g3p2lswj2153q21x12mjlfwh0000gn/T/qiime2-q2cli-err-8zcjto_1.log.

phiX filtering?

@benjjneb, does the DADA2 script in this repo perform phiX filtering? I'm working on some test data, and seeing a lot of phiX in my data set after DADA2.

adding 454/ion torrent support

I think this is just a matter of exposing the parameters described here. I can take this one (either for 2017.10 or 2017.11), and will do some research when I work on it to figure out if it makes sense as part of an existing method or a new one or two.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.