bokulich-lab / q2-assembly Goto Github PK
View Code? Open in Web Editor NEWQIIME 2 plugin for (meta)genome assembly.
License: BSD 3-Clause "New" or "Revised" License
QIIME 2 plugin for (meta)genome assembly.
License: BSD 3-Clause "New" or "Revised" License
If no contigs are formed for any samples during assembly, and a SampleData[Contigs]
with some .fa
files of size zero is therefore passed as input to index-contigs
, index-contigs
fails with a fairly uninformative error message:
An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.
The --verbose
output was more useful, but still only "warned" about an empty fasta file:
Input files DNA, FASTA:
/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa
Warning: Empty fasta file: '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/bin/bowtie2-build-s --wrapper basic-0 --bmaxdivn 4 --dcv 1024 --offrate 5 --ftabchars 10 --threads 40 /scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa /scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index
Traceback (most recent call last):
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 50, in _index_seqs
run_command(cmd, verbose=True)
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/_utils.py", line 28, in run_command
subprocess.run(cmd, check=True)
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['bowtie2-build', '--bmaxdivn', '4', '--dcv', '1024', '--offrate', '5', '--ftabchars', '10', '--threads', '40', '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa', '/scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/q2cli/commands.py", line 468, in __call__
results = action(**arguments)
File "<decorator-gen-736>", line 2, in index_contigs
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 274, in bound_callable
outputs = self._callable_executor_(
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 509, in _callable_executor_
output_views = self._callable(**view_args)
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 85, in index_contigs
_index_seqs(contig_fps, str(result), common_args, "contigs")
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 52, in _index_seqs
raise Exception(
Exception: An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.
Plugin error from assembly:
An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.
See above for debug info.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
I came across this because I had a couple of control samples which had very few (<10) demultiplexed sequences in my input to assemble-megahit
, and these unsurprisingly didn't form any contigs. When I ran index-contigs
I got the error.
I'm not sure what the best pathway forward is for this - at the very least we probably want a more informative error message, but we also might want a way to filter the SampleData[Contigs]
so the user doesn't have to generate contigs again (which can take a while). I got around it this time by filtering my input to assemble-megahit
to drop the two samples that were causing problems with qiime demux filter
.
EDIT: I just hit this again, on a different data set. (Aug 21 2023)
It seems that QUAST generates many more visualizations than the ones that are currently displayed in the visualization produced by evaluate-contigs
(most, if not all, of them are generated based on alignments to reference sequences, either provided by the user - not yet supported, see #35 - or fetched by QUAST automatically). Some of the interesting ones include:
The new version relaxes its dependency requirements by allowing new versions of pysam and biopython - it'll make it easier to solve our future environments.
Describe the bug
When I try to run the generate-reads
action, I get an error stating that the pysam
module is missing.
To Reproduce
Steps to reproduce the behavior:
mamba env create -n q2-shotgun --file https://data.qiime2.org/distro/shotgun/qiime2-shotgun-2023.9-py38-linux-conda.yml
conda activate q2-shotgun
qiime dev refresh-cache
qiime assembly generate-reads \
--i-genomes genomes.qza \
--p-sample-names sample1 sample2 sample3 sample4 \
--p-n-reads 2000000 \
--p-abundance uniform \
--p-n-genomes 5 \
--p-cpus 10 \
--output-dir reads \
--verbose
Template genome sequences were provided - "n-genomes-ncbi" and "ncbi" parameters will be ignored.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: iss generate --compress --genomes /scratch/mziemski/tmp/qiime2/mziemski/data/0619c364-72ba-492a-9d8f-7043e825bf15/data/dna-sequences.fasta --n_genomes 5 --abundance uniform --n_reads 2000000 --mode kde --model HiSeq --cpus 10 --output /scratch/mziemski/tmp/tmpowd79ql2/sample1_00_L001
Traceback (most recent call last):
File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/bin/iss", line 6, in <module>
from iss.app import main
File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/app.py", line 4, in <module>
from iss import bam
File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/bam.py", line 14, in <module>
import pysam
ModuleNotFoundError: No module named 'pysam'
Expected behavior
The action runs without errors.
Please complete the following information:
Additional context
This happened on both, Linux and macOS. I used the 2023.9 distro - for some reason the pysam
package is not included there. Before, when I used to install q2-assembly "directly" from our conda channel, it seemed to work all fine. @lizgehret, @ebolyen would you have an idea why that could be happening? ๐ค
I am testing with samples that have underscores in their identifiers, and am running into a failure due to an assumption that only the text before the _
is the sample identifier. This looks to be traceable to this line.
Note that QIIME 2 does allow for underscores in identifiers (see the documentation here).
Is there a different way to get the sample ids from the filenames in this case? Based on a quick look at the files in the data artifact, it looks like you could change line 113 to:
return os.path.basename(fp.replace('_contigs.fa', ''))
I recommend adding a test of this function that includes a sample id with an underscore in it.
Inside of index_mags
here mags
(of type MultiMAGSequencesDirFmt
) does not have the expected manifest attached to it. E.g. adding mags.validate()
in this function causes tests to fail.
It also doesn't seem like there's a way to generate a manifest in the way that we do e.g. for CasavaOneEightSingleLanePerSampleDirFmt
in q2-types, from which some of the parent classes are borrowed as parent classes here.
Hello,
I tried to run qiime assembly generate-reads --output-dir test_data
and was expecting data sampled genomes to be placed in a new dir test_data
. Instead, I got
Plugin error from assembly:
'NoneType' object is not iterable
It seems that ncbi, n-genomes-ncbi, and sample-names are needed for this to run.
Thanks!
When I open the attached visualization, I am unable to download any attached images.
Steps to reproduce:
qiime assembly evaluate-contigs
with the default parameters (without providing reads as input) and the contigs.qza artifact provided in the attached filesExpected behaviour:
Plots get downloaded.
Actual behaviour:
An error message is displayed:
Attached files: https://polybox.ethz.ch/index.php/s/0MwYFnTJS5M1k8p/download
Only .fa
contig files are collected but .fasta
files should also be collected:
See bokulich-lab/q2-moshpit#76 and the closing PR for a possible solution.
I kept running into memory issues with a test data set I am using. After reading the the Spades manual, for release 3.15.2 which qiime2-shotgun-2023.9
uses:
SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. If you set memory limit manually, SPAdes will use smaller buffers and thus less RAM.
I think my issue was not realizing the increased memory usage incurred when using multiple threads. I am in the process of validating this now.
If a user specifies 32 cores, they'll be using up ~ 16 GB of RAM for buffers. This is analogous to feature-classifier
, in which more memory is used with increasing thread count. Conversely, the user may specify too little memory to get anything to run. For example, setting the maximum memory usage to 100 GB and using 16 threads, means much smaller buffers / RAM per thread.
Perhaps update the help text like so:
--p-threads
: Number of threads. By default SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. This can be further affected by the --p-memory
option.
--p-memory
: RAM limit for SPAdes in Gb (terminates if exceeded). If a smaller memory limit is set, SPAdes will use smaller buffers and thus less memory per --p-threads
.
Is it easier for everyone to post these types of suggestions as an issue like this, or should I simply wait and compile a set of these suggestions and then and submit them as PR? I've not dived into the code yet, so I figured I'd recommend these simple fixes as I work through testing the tools. I'd imagine that these are easy enough to wrap into any other existing PRs.
Describe the bug
It looks like one has to manually install megahit with make
to disable mpopcnt
, e.g. make clean && make disablempopcnt=1
(but since megahit is installed through Conda I don't see how this could work). Otherwise one can only run on one CPU. See this issue in the megahit repo for more info.
To Reproduce
cd <download_here>
wget https://www.polybox.ethz.ch/index.php/s/Zhk2igDBGRJTi1n/download
qiime assembly assemble-megahit \
--i-seqs reads.qza \
--p-presets meta-sensitive \
--p-num-cpu-threads 2 \
--o-contigs contigs.qza \
--verbose
Expected error:
Plugin error from assembly:
An error was encountered while running MEGAHIT, (return code 245), please inspect stdout and stderr to learn more.
Expected behavior
Action works without errors.
Please complete the following information:
When running the command within the conda environment qiime2-shotgun-2023.9
:
qiime assembly assemble-spades \
--i-seqs fondue-output/single_reads.qza \
--p-threads 8 \
--p-phred-offset 33 \
--p-memory 60 \
--p-meta \
--o-contigs contigs.qza \
--verbose
The following error was returned:
Plugin error from assembly:
SPAdes v3.15.2 in "meta" mode supports only paired-end reads.
See above for debug info.
Update the help text to reflect this.
When running:
qiime assembly assemble-spades \
> --i-seqs fondue-output/single_reads.qza \
> --p-threads 8 \
> --o-contigs contigs.qza \
within the conda env qiime2-shotgun-2023.9
, the following error appears within the --verbose
output:
Usage: spades.py [options] -o <output_dir>
spades.py: error: argument --phred-offset: invalid qvoffset value: 'auto-detect'
...
Is your feature request related to a problem? Please describe.
When using QUAST to evaluate assembly quality on multiple large samples the process takes very long and its significant portion is spent on drawing the Icarus plots.
Describe the solution you'd like
Since those plots are of less importance compared to the actual assembly statistics, it would be great to have a way to disable drawing those so that the user can decide whether they want them or not. QUAST supports it through the --no-icarus
flag.
As a plugin user,
I want to be able to co-assemble genomes across all the samples rather than per-sample
so that I can use all the available sequence information.
Tasks:
Depends on bokulich-lab/q2-types-genomics#77.
The CI can be simplified by moving coverage testing to the package-building step, as was done for q2-moshpit: bokulich-lab/q2-moshpit#12.
I didn't notice this during my review of #32, but I believe this text should now be updated:
$ qiime assembly evaluate-contigs --help
Usage: qiime assembly evaluate-contigs [OPTIONS]
...
--p-threads INTEGER Maximum number of parallel jobs. Default: 1.
Range(1, None) Currently disabled - only 1 CPU is supported.
[default: 1]
...
I think that should indicate that it's disabled on platforms other than Linux, right?
We need a QIIME 2 visualisation displaying assembly quality control results.
Acceptance criteria:
SampleData[Contigs]
as inputAt present, it looks like you tend to define default values implicitly, and it would be better to do this explicitly (i.e., when you define the function the action is mapped to).
I suspect that you are doing this so that the defaults are actually set by the underlying code if not overridden by the user, which makes sense intuitively but is not ideal for a few reasons. First, it could result in your help text becoming outdated and misleading (e.g., if you specify the default is 1, and the underlying code is changed so it's 16, the help text your user is referring to will be wrong). Next, and probably most importantly, if not specified explicitly the parameter values won't be stored in data provenance. And finally, if you define explicitly on function definition, the default value will autopopulate in the help text for your action.
As an example of how I recommend doing this, take a look at this snippet of the help text from dada2 denoise-single
:
--p-n-threads INTEGER The number of threads to use for multithreaded
processing. If 0 is provided, all available cores
will be used. [default: 1]
That default value is specified here.
I recommend ultimately making this change across the whole code base.
As a user,
I want more QUAST params to be exposed in the evaluate_contigs
actions,
so that I can have more control over how the tool is being run.
Notes
There are options like --memory-efficient
but also a couple of other ones which have to do with aligning contigs to references.
Is your feature request related to a problem? Please describe.
No. It would be great, though, if the evaluate-contigs
action could output a table (as an additional artifact, similarly how the revamped evaluate-busco
from q2-moshpit does) containing most important stats (e.g., N50, L50, #contigs, total length, etc.) per sample. We could then use this table e.g. to filter samples for re-assembly based on some thresholds.
As a user,
I want the evaluate_contigs
action to accept reference sequences and/or at least custom BLAST dbs
so that the 16S Silva reference does not need to be downloaded by QUAST on every execution.
Notes:
Maybe in the beginning we could just allow passing the pre-created Silva QZA that we can easily convert to a blast database to avoid those constant re-downloads.
Describe the bug
The download buttons in the visualization generated by evaluate-contigs
do not work.
To Reproduce
Steps to reproduce the behavior:
qiime assembly evaluate-contgis
with any set of contigs.Expected behavior
The report/figure is downloaded.
Additional context
It seems the links attached to the buttons are pointing to a wrong directory within the artifact.
This is very minor, but might be worth addressing before an alpha release.
These two actions in assembly use different names for the same input type. It might be nice to pick one and use that for all actions in the plugin for consistency. This happens all over the place in the core distro, but it would be a breaking change to address it, so it hasn't been worth the trouble.
qiime assembly assemble-megahit --i-seqs demux.qza ...
qiime assembly map-reads-to-contigs --i-reads demux.qza ...
We need an action supporting mapping reads to indexed contigs/MAGs using bowtie2.
Acceptance criteria:
SampleData[*SequencesWithQuality]
and SampleData[MultiBowtie2Index | Bowtie2Index]
SampleData[AlignmentMaps | AlignmentMap]
Notes:
On two separate machines (both HPC clusters), I've had evaluate-contigs
fail due to available disk space. I can see from the log that it downloads a lot of files. @misialq, have you run into this? Any ideas on how to address this? I haven't actually got this command to complete yet, after testing on two different studies (and two different systems, as I mentioned). Let me know if you'd like the error log - I can send that by email.
We need an action supporting contig/MAG indexing using bowtie2.
Acceptance criteria:
SampleData[Contigs]
or SampleData[MAG]
as inputSampleData[Bowtie2Index]
or SampleData[MultiBowtie2Index]
When running the evaluate-contigs
action in an environment with the most recent version of QUAST installed (5.2.0), it is impossible to generate the QC visualization due to the following error:
ValueError: invalid literal for int() with base 10: 'START_A'
.
This seems to be caused by ablab/quast#230 (and fixed by ablab/quast#244). Unfortunately, the previous conda-installable version of QUAST (5.0.2) is not compatible with our environment (it needs Python<3.7) so until a new, fixed version is released, the only solution would be to pip install
QUAST directly.
Multiprocessing was temporarily disabled here:
q2-assembly/q2_assembly/quast/quast.py
Lines 57 to 63 in a562844
Update the following information:
In doing some experiments with q2-assembly, I noticed that the default number of CPUs is set a few different ways. For example:
qiime assembly assemble-megahit
: --p-num-cpu-threads
Number of CPU threads. Default: # of logical processors.
qiime assembly assemble-spades
: --p-threads
Number of threads. Default: 16.
In the core distribution QIIME 2 plugins, we tend to set these types of values with a default of 1
, forcing the user to intentionally request more resources. This is because users will often just go with the default setting. If it's set to 1
, they'll notice that it's too slow and increase the value. If it's set high by default though, it becomes easy for users to not notice and overload a system. For example, if they request a single CPU on their cluster, and then 16 subprocesses spin up, that can overload the cluster node and get them in trouble with the sys admin (and potentially give QIIME 2 a bad reputation with the sys admin if it happens regularly).
I recommend always setting the defaults for these parameters to 1
, and letting the user override them.
Hi, it would be useful to have some action to filter contigs by length. This action should output a fasta file for contigs smaller or with the same size as the cuttof value and another file with the contigs that are larger.
The following command successfully runs and generates a QZV.
qiime assembly evaluate-contigs \
--i-contigs megahit-contigs.qza \
--p-min-contig 1000 \
--p-threads 56 \
--o-visualization megahit-contigs.qzv \
--verbose
However, clicking on any of the "Downloads" buttons does not invoke a "download" of the respective PDFs & TSVs. Howewver, I can confirm that these files do exist within the extracted QZV under the quast_data/
and quast_data/basic_stats/
folder, using the command:
qiime tools extract \
--input-path megahit-contigs.qzv \
--output-path megahit-contigs-extract
Tested within the Chrome and Safari browsers. Code ran within the qiime2-shotgun-2023.9
environment.
Is your feature request related to a problem? Please describe.
It could be related to a potential problem arising after implementing #82. For backward-incompatibility reasons, users who already assembled contigs would need a way to rename those to follow the convention which will be introduced by the other issue.
Describe the solution you'd like
We need an action (rename-contigs
) which would simply take a SampleData[Contigs]
artifact, rename contigs and return a new SampleData[Contigs]
artfifact - that way one could "easily" rename everything. This action should also take a parameter to specify the function used for renaming, as it was done in #82.
Describe alternatives you've considered
The only alternative would be to export the pre-assembled data, rename "manually" and re-import - this would break the provenance though.
We need an action supporting metagenome assembly using the MEGAHIT assembler.
Acceptance criteria:
SampleData[*SequencesWithQuality*]
as inputSampleData[Contigs]
artifactIs your feature request related to a problem? Please describe.
This is related to #49 and it provides a solution to remove sample which do not have any assembled contigs.
Describe the solution you'd like
We should add a filter-contigs
action which would allow us to do filtering similar to the demux-filter action. We should also include an option to filter out all empty contig files.
Is your feature request related to a problem? Please describe.
When binning contigs from more than one sample (see q2-moshpit), we need a way to distinguish contigs belonging to different samples (contig IDs are only unique per-sample).
Describe the solution you'd like
I want the contigs to be renamed post-assembly, regardless of the assembler used. We could use UUIDs to represent those, instead of arbitrary strings, as it is now. shortuuid could be a good candidate as it can generate short IDs which are still unique enough - that way we would avoid using the same kind of ID as we already are using to represent MAGs with the added benefit of those being slightly more human-readable.
Notes:
Let's provide the user with a selection of ways to rename: shortuuid, uuid4, uuid5 etc.
Acceptance criteria:
rename-contigs
) specifying the type of ID we want to applyshortuuid
(see the library linked above), uuid3
, uuid4
, uuid5
(using respective methods from Python's uuid module)Is your feature request related to a problem? Please describe.
Not a problem but a potential improvement. Currently, if we want QUAST to take reads into account during contig evaluation, they are being passed directly using the --i-reads
flag. That results in QUAST mapping all the reads to the contigs, which increases the runtime significantly. QUAST allows passing a pre-generated alignment map using the --sam/--bam
flags, which allows it to use the maps directly, without the need to align the reads first. We should expose this flag in our action so that a pre-generated map can be passed - we are usually generating it anyway as it is required for binning with MetaBAT.
Describe the solution you'd like
Expose the --bam
flag from QUAST (in a form of --i-reads-to-contigs-map
Q2 flag) to enable passing SampleData[AlignmentMap]
to the action directly.
We need an action supporting metagenome assembly using the metaSPAdes assembler.
Acceptance criteria:
SampleData[*SequencesWithQuality*]
as inputSampleData[Contigs]
artifactIs your feature request related to a problem? Please describe.
Whenever QUAST is run and reference genomes are not provided it will try to identify some references and fetch them from NCBI. Since all of those are saved to a temporary directory, they will be removed after the action completes.
Describe the solution you'd like
It would be great if we could collect those genomes and output them as GenomeData[DNASequence]
artifact - that way, next time when a user wants to re-run QUAST for whatever reason, they can just pass those references instead of re-downloading all of them.
Additional context
The references fetched by QUAST are combined into a single file which can be found in results/quast_downloaded_references
(?). Alternatively, individual references could also be passed - not sure what kind of artifact they should be stored in, though.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.