bokulich-lab / q2-assembly Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 12.0 32.56 MB

QIIME 2 plugin for (meta)genome assembly.

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.09% Python 95.76% TeX 2.08% HTML 2.07%

q2-assembly's People

Contributors

Stargazers

Watchers

Forkers

misialq ebolyen gregcaporaso keegan-evans colinvwood nbokulich oddant1 lizgehret christosmatzoros sann5 mikerobeson dorielagrabocka

q2-assembly's Issues

BUG: `index-contigs` fails if any input files are empty

If no contigs are formed for any samples during assembly, and a SampleData[Contigs] with some .fa files of size zero is therefore passed as input to index-contigs, index-contigs fails with a fairly uninformative error message:

  An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.

The --verbose output was more useful, but still only "warned" about an empty fasta file:

Input files DNA, FASTA:
  /scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa
Warning: Empty fasta file: '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/bin/bowtie2-build-s --wrapper basic-0 --bmaxdivn 4 --dcv 1024 --offrate 5 --ftabchars 10 --threads 40 /scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa /scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index
Traceback (most recent call last):
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 50, in _index_seqs
    run_command(cmd, verbose=True)
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/_utils.py", line 28, in run_command
    subprocess.run(cmd, check=True)
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['bowtie2-build', '--bmaxdivn', '4', '--dcv', '1024', '--offrate', '5', '--ftabchars', '10', '--threads', '40', '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa', '/scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/q2cli/commands.py", line 468, in __call__
    results = action(**arguments)
  File "<decorator-gen-736>", line 2, in index_contigs
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 274, in bound_callable
    outputs = self._callable_executor_(
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 509, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 85, in index_contigs
    _index_seqs(contig_fps, str(result), common_args, "contigs")
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 52, in _index_seqs
    raise Exception(
Exception: An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.

Plugin error from assembly:

  An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.

See above for debug info.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

I came across this because I had a couple of control samples which had very few (<10) demultiplexed sequences in my input to assemble-megahit, and these unsurprisingly didn't form any contigs. When I ran index-contigs I got the error.

I'm not sure what the best pathway forward is for this - at the very least we probably want a more informative error message, but we also might want a way to filter the SampleData[Contigs] so the user doesn't have to generate contigs again (which can take a while). I got around it this time by filtering my input to assemble-megahit to drop the two samples that were causing problems with qiime demux filter.

EDIT: I just hit this again, on a different data set. (Aug 21 2023)

expose additional visualizations generated by QUAST in the `evaluate-contigs` visualization

It seems that QUAST generates many more visualizations than the ones that are currently displayed in the visualization produced by evaluate-contigs (most, if not all, of them are generated based on alignments to reference sequences, either provided by the user - not yet supported, see #35 - or fetched by QUAST automatically). Some of the interesting ones include:

Krona plots per sample
more Icarus reports (per reference sequence - show coverage per reference genome in the contig browser)
reports for not aligned sequences

ENH: update `insilicoseq` to 1.6.0

The new version relaxes its dependency requirements by allowing new versions of pysam and biopython - it'll make it easier to solve our future environments.

BUG: action `generate-reads` is missing the `pysam` dependency

Describe the bug
When I try to run the generate-reads action, I get an error stating that the pysam module is missing.

To Reproduce
Steps to reproduce the behavior:

Create and activate the environment:

mamba env create -n q2-shotgun --file https://data.qiime2.org/distro/shotgun/qiime2-shotgun-2023.9-py38-linux-conda.yml
conda activate q2-shotgun
qiime dev refresh-cache

Execute the command

qiime assembly generate-reads \
    --i-genomes genomes.qza \
    --p-sample-names sample1 sample2 sample3 sample4 \
    --p-n-reads 2000000 \
    --p-abundance uniform \
    --p-n-genomes 5 \
    --p-cpus 10 \
    --output-dir reads \
    --verbose

where genomes.qza is any FeatureData[Sequence] artifact.

See error:

Template genome sequences were provided - "n-genomes-ncbi" and "ncbi" parameters will be ignored.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: iss generate --compress --genomes /scratch/mziemski/tmp/qiime2/mziemski/data/0619c364-72ba-492a-9d8f-7043e825bf15/data/dna-sequences.fasta --n_genomes 5 --abundance uniform --n_reads 2000000 --mode kde --model HiSeq --cpus 10 --output /scratch/mziemski/tmp/tmpowd79ql2/sample1_00_L001

Traceback (most recent call last):
  File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/bin/iss", line 6, in <module>
    from iss.app import main
  File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/app.py", line 4, in <module>
    from iss import bam
  File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/bam.py", line 14, in <module>
    import pysam
ModuleNotFoundError: No module named 'pysam'

Expected behavior
The action runs without errors.

Please complete the following information:

OS: CentOS
QIIME 2 version: 2023.9

Additional context
This happened on both, Linux and macOS. I used the 2023.9 distro - for some reason the pysam package is not included there. Before, when I used to install q2-assembly "directly" from our conda channel, it seemed to work all fine. @lizgehret, @ebolyen would you have an idea why that could be happening? 🤔

invalid sample name assumption causes failure in `evaluate-contigs`

I am testing with samples that have underscores in their identifiers, and am running into a failure due to an assumption that only the text before the _ is the sample identifier. This looks to be traceable to this line.

Note that QIIME 2 does allow for underscores in identifiers (see the documentation here).

Is there a different way to get the sample ids from the filenames in this case? Based on a quick look at the files in the data artifact, it looks like you could change line 113 to:

return os.path.basename(fp.replace('_contigs.fa', ''))

I recommend adding a test of this function that includes a sample id with an underscore in it.

BUG: `MultiMAGSequencesDirFmt` does not have a manifest in `index_mags`

Inside of index_mags here mags (of type MultiMAGSequencesDirFmt) does not have the expected manifest attached to it. E.g. adding mags.validate() in this function causes tests to fail.

It also doesn't seem like there's a way to generate a manifest in the way that we do e.g. for CasavaOneEightSingleLanePerSampleDirFmt in q2-types, from which some of the parent classes are borrowed as parent classes here.

Generate Reads unexpected inputs needed

Hello,

I tried to run qiime assembly generate-reads --output-dir test_data and was expecting data sampled genomes to be placed in a new dir test_data. Instead, I got

Plugin error from assembly:

  'NoneType' object is not iterable

It seems that ncbi, n-genomes-ncbi, and sample-names are needed for this to run.

Thanks!

Cannot download some files from the `evaluate-contigs` qzv

When I open the attached visualization, I am unable to download any attached images.

Steps to reproduce:

run qiime assembly evaluate-contigs with the default parameters (without providing reads as input) and the contigs.qza artifact provided in the attached files
open the visualization
on the "QC report" tab:
- try to download the "GC content plot" for one sample
- try to download the full report (green button)
- try to download any of the other summary plots (next to the green button)

Expected behaviour:
Plots get downloaded.

Actual behaviour:
An error message is displayed:

Attached files: https://polybox.ethz.ch/index.php/s/0MwYFnTJS5M1k8p/download

BUG: index-contigs collects only .fa files

Only .fa contig files are collected but .fasta files should also be collected:

q2-assembly/q2_assembly/bowtie2/indexing.py

Line 84 in 5afdeb6

contig_fps = sorted(glob.glob(os.path.join(str(contigs), "*_contigs.fa")))

See bokulich-lab/q2-moshpit#76 and the closing PR for a possible solution.

Update help text for `assemble-spades --p-threads --p-memory`

I kept running into memory issues with a test data set I am using. After reading the the Spades manual, for release 3.15.2 which qiime2-shotgun-2023.9 uses:

SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. If you set memory limit manually, SPAdes will use smaller buffers and thus less RAM.

I think my issue was not realizing the increased memory usage incurred when using multiple threads. I am in the process of validating this now.

If a user specifies 32 cores, they'll be using up ~ 16 GB of RAM for buffers. This is analogous to feature-classifier, in which more memory is used with increasing thread count. Conversely, the user may specify too little memory to get anything to run. For example, setting the maximum memory usage to 100 GB and using 16 threads, means much smaller buffers / RAM per thread.

Perhaps update the help text like so:

--p-threads: Number of threads. By default SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. This can be further affected by the --p-memory option.
--p-memory: RAM limit for SPAdes in Gb (terminates if exceeded). If a smaller memory limit is set, SPAdes will use smaller buffers and thus less memory per --p-threads.

Is it easier for everyone to post these types of suggestions as an issue like this, or should I simply wait and compile a set of these suggestions and then and submit them as PR? I've not dived into the code yet, so I figured I'd recommend these simple fixes as I work through testing the tools. I'd imagine that these are easy enough to wrap into any other existing PRs.

BUG: action `assemble-megahit` does not allow for `--p-num-cpu-threads > 1`

Describe the bug
It looks like one has to manually install megahit with make to disable mpopcnt, e.g. make clean && make disablempopcnt=1 (but since megahit is installed through Conda I don't see how this could work). Otherwise one can only run on one CPU. See this issue in the megahit repo for more info.

To Reproduce

Download data

cd <download_here>
wget https://www.polybox.ethz.ch/index.php/s/Zhk2igDBGRJTi1n/download

qiime assembly assemble-megahit \
  --i-seqs reads.qza \
  --p-presets meta-sensitive \
  --p-num-cpu-threads 2 \
  --o-contigs contigs.qza \
  --verbose

Expected error:

Plugin error from assembly:

  An error was encountered while running MEGAHIT, (return code 245), please inspect stdout and stderr to learn more.

Expected behavior
Action works without errors.

Please complete the following information:

OS: macOS
QIIME 2 version: 2024.5

MAINT: update dependencies

update help text for `assemble-spades --p-meta`

When running the command within the conda environment qiime2-shotgun-2023.9:

qiime assembly assemble-spades \
    --i-seqs fondue-output/single_reads.qza \
    --p-threads 8   \
    --p-phred-offset 33 \
    --p-memory 60 \
    --p-meta \
    --o-contigs contigs.qza  \
    --verbose

The following error was returned:

Plugin error from assembly:
SPAdes v3.15.2 in "meta" mode supports only paired-end reads.
See above for debug info.

Update the help text to reflect this.

BUG: `assembly assemble-spades` auto-detection of phred-offset error

When running:

qiime assembly assemble-spades \
>     --i-seqs fondue-output/single_reads.qza \
>     --p-threads 8 \
>     --o-contigs contigs.qza \

within the conda env qiime2-shotgun-2023.9, the following error appears within the --verbose output:

Usage: spades.py [options] -o <output_dir>
spades.py: error: argument --phred-offset: invalid qvoffset value: 'auto-detect'
...

ENH: allow contig QC evaluation without drawing the Icarus plots

Is your feature request related to a problem? Please describe.
When using QUAST to evaluate assembly quality on multiple large samples the process takes very long and its significant portion is spent on drawing the Icarus plots.

Describe the solution you'd like
Since those plots are of less importance compared to the actual assembly statistics, it would be great to have a way to disable drawing those so that the user can decide whether they want them or not. QUAST supports it through the --no-icarus flag.

ENH: implement co-assembly option

As a plugin user,
I want to be able to co-assemble genomes across all the samples rather than per-sample
so that I can use all the available sequence information.

Tasks:

Implement co-assembly for MEGAHIT
Implement co-assembly for metaSPAdes

Depends on bokulich-lab/q2-types-genomics#77.

Clean up the CI

The CI can be simplified by moving coverage testing to the package-building step, as was done for q2-moshpit: bokulich-lab/q2-moshpit#12.

outdated help text on evaluate-contigs

I didn't notice this during my review of #32, but I believe this text should now be updated:

$ qiime assembly evaluate-contigs --help
Usage: qiime assembly evaluate-contigs [OPTIONS]
...
  --p-threads INTEGER     Maximum number of parallel jobs. Default: 1.
    Range(1, None)        Currently disabled - only 1 CPU is supported.
                                                                  [default: 1]
...

I think that should indicate that it's disabled on platforms other than Linux, right?

Implement assembly QC visualisation

We need a QIIME 2 visualisation displaying assembly quality control results.

Acceptance criteria:

uses metaQUAST (link)
uses SampleData[Contigs] as input
wraps the output report and the Icarus browser as a qzv visualization

specify default values in function definition

At present, it looks like you tend to define default values implicitly, and it would be better to do this explicitly (i.e., when you define the function the action is mapped to).

I suspect that you are doing this so that the defaults are actually set by the underlying code if not overridden by the user, which makes sense intuitively but is not ideal for a few reasons. First, it could result in your help text becoming outdated and misleading (e.g., if you specify the default is 1, and the underlying code is changed so it's 16, the help text your user is referring to will be wrong). Next, and probably most importantly, if not specified explicitly the parameter values won't be stored in data provenance. And finally, if you define explicitly on function definition, the default value will autopopulate in the help text for your action.

As an example of how I recommend doing this, take a look at this snippet of the help text from dada2 denoise-single:

  --p-n-threads INTEGER  The number of threads to use for multithreaded
                         processing. If 0 is provided, all available cores
                         will be used.                            [default: 1]

That default value is specified here.

I recommend ultimately making this change across the whole code base.

Expose more QUAST parameters

As a user,
I want more QUAST params to be exposed in the evaluate_contigs actions,
so that I can have more control over how the tool is being run.

Notes
There are options like --memory-efficient but also a couple of other ones which have to do with aligning contigs to references.

ENH: evaluate-contigs action should output a results table to enable filtering of samples

Is your feature request related to a problem? Please describe.
No. It would be great, though, if the evaluate-contigs action could output a table (as an additional artifact, similarly how the revamped evaluate-busco from q2-moshpit does) containing most important stats (e.g., N50, L50, #contigs, total length, etc.) per sample. We could then use this table e.g. to filter samples for re-assembly based on some thresholds.

`evaluate-contigs` should accept reference seqs

As a user,
I want the evaluate_contigs action to accept reference sequences and/or at least custom BLAST dbs
so that the 16S Silva reference does not need to be downloaded by QUAST on every execution.

Notes:
Maybe in the beginning we could just allow passing the pre-created Silva QZA that we can easily convert to a blast database to avoid those constant re-downloads.

BUG: download buttons in the QUAST viz are broken

Describe the bug
The download buttons in the visualization generated by evaluate-contigs do not work.

To Reproduce
Steps to reproduce the behavior:

Execute the command qiime assembly evaluate-contgis with any set of contigs.
Open the file output visualization.
Click one of the download buttons.

Expected behavior
The report/figure is downloaded.

Additional context
It seems the links attached to the buttons are pointing to a wrong directory within the artifact.

MAINT: ensure that input names are consistent across actions

This is very minor, but might be worth addressing before an alpha release.

These two actions in assembly use different names for the same input type. It might be nice to pick one and use that for all actions in the plugin for consistency. This happens all over the place in the core distro, but it would be a breaking change to address it, so it hasn't been worth the trouble.

qiime assembly assemble-megahit --i-seqs demux.qza ...

qiime assembly map-reads-to-contigs --i-reads demux.qza ...

Implement read mapping action

We need an action supporting mapping reads to indexed contigs/MAGs using bowtie2.

Acceptance criteria:

uses SampleData[*SequencesWithQuality] and SampleData[MultiBowtie2Index | Bowtie2Index]
uses bowtie2 to align reads
outputs SampleData[AlignmentMaps | AlignmentMap]

Notes:

perhaps this needs to be split into two actions: one for single- and one for paired-end reads...? (perhaps not)

`evaluate-contigs` fails with `OSError: [Errno 28] No space left on device`

On two separate machines (both HPC clusters), I've had evaluate-contigs fail due to available disk space. I can see from the log that it downloads a lot of files. @misialq, have you run into this? Any ideas on how to address this? I haven't actually got this command to complete yet, after testing on two different studies (and two different systems, as I mentioned). Let me know if you'd like the error log - I can send that by email.

Implement contig read indexing action

We need an action supporting contig/MAG indexing using bowtie2.

Acceptance criteria:

uses SampleData[Contigs] or SampleData[MAG] as input
uses bowtie2-build to generate the indices for contig/MAG files
outputs SampleData[Bowtie2Index] or SampleData[MultiBowtie2Index]

`evaluate-contigs` fails on the most recent QUAST version

When running the evaluate-contigs action in an environment with the most recent version of QUAST installed (5.2.0), it is impossible to generate the QC visualization due to the following error:
ValueError: invalid literal for int() with base 10: 'START_A'.

This seems to be caused by ablab/quast#230 (and fixed by ablab/quast#244). Unfortunately, the previous conda-installable version of QUAST (5.0.2) is not compatible with our environment (it needs Python<3.7) so until a new, fixed version is released, the only solution would be to pip install QUAST directly.

Enable multiprocessing in `evaluate-contigs`

Multiprocessing was temporarily disabled here:

q2-assembly/q2_assembly/quast/quast.py

Lines 57 to 63 in a562844

    
           elif arg_key == "threads" and (not arg_val or arg_val > 1): 
        
               # TODO: this needs to be fixed (to allow multiprocessing) 
        
               print( 
        
                   "Multiprocessing is currently not supported. Resetting " 
        
                   "number of threads to 1." 
        
               ) 
        
               return [_construct_param(arg_key), "1"]

as it was not working before but should be possible now - that branch can be removed.

Update README

Update the following information:

add new conda installation instructions
add dev section (hooks etc.)
add a new section describing functionality

default values for number of CPU threads

In doing some experiments with q2-assembly, I noticed that the default number of CPUs is set a few different ways. For example:

qiime assembly assemble-megahit: --p-num-cpu-threads Number of CPU threads. Default: # of logical processors.
qiime assembly assemble-spades: --p-threads Number of threads. Default: 16.

In the core distribution QIIME 2 plugins, we tend to set these types of values with a default of 1, forcing the user to intentionally request more resources. This is because users will often just go with the default setting. If it's set to 1, they'll notice that it's too slow and increase the value. If it's set high by default though, it becomes easy for users to not notice and overload a system. For example, if they request a single CPU on their cluster, and then 16 subprocesses spin up, that can overload the cluster node and get them in trouble with the sys admin (and potentially give QIIME 2 a bad reputation with the sys admin if it happens regularly).

I recommend always setting the defaults for these parameters to 1, and letting the user override them.

ENH: filter contigs by length

Hi, it would be useful to have some action to filter contigs by length. This action should output a fasta file for contigs smaller or with the same size as the cuttof value and another file with the contigs that are larger.

BUG: Download buttons for the `assembly evaluate-contigs` visualization not producing PDFs, etc..

The following command successfully runs and generates a QZV.

qiime assembly evaluate-contigs \
  --i-contigs megahit-contigs.qza \
  --p-min-contig 1000 \
  --p-threads 56 \
  --o-visualization megahit-contigs.qzv \
  --verbose

However, clicking on any of the "Downloads" buttons does not invoke a "download" of the respective PDFs & TSVs. Howewver, I can confirm that these files do exist within the extracted QZV under the quast_data/ and quast_data/basic_stats/ folder, using the command:

qiime tools extract \
--input-path megahit-contigs.qzv \
--output-path megahit-contigs-extract

Tested within the Chrome and Safari browsers. Code ran within the qiime2-shotgun-2023.9 environment.

ENH: add action to rename contigs

Is your feature request related to a problem? Please describe.
It could be related to a potential problem arising after implementing #82. For backward-incompatibility reasons, users who already assembled contigs would need a way to rename those to follow the convention which will be introduced by the other issue.

Describe the solution you'd like
We need an action (rename-contigs) which would simply take a SampleData[Contigs] artifact, rename contigs and return a new SampleData[Contigs] artfifact - that way one could "easily" rename everything. This action should also take a parameter to specify the function used for renaming, as it was done in #82.

Describe alternatives you've considered
The only alternative would be to export the pre-assembled data, rename "manually" and re-import - this would break the provenance though.

Implement assemble-megahit action

We need an action supporting metagenome assembly using the MEGAHIT assembler.

Acceptance criteria:

uses MEGAHIT assembler (link)
uses SampleData[*SequencesWithQuality*] as input
can handle single- and paired-end reads
outputs SampleData[Contigs] artifact

ENH: add an action to filter `SampleData[Contigs]`

Is your feature request related to a problem? Please describe.
This is related to #49 and it provides a solution to remove sample which do not have any assembled contigs.

Describe the solution you'd like
We should add a filter-contigs action which would allow us to do filtering similar to the demux-filter action. We should also include an option to filter out all empty contig files.

ENH: make contig IDs unique across all samples

Is your feature request related to a problem? Please describe.
When binning contigs from more than one sample (see q2-moshpit), we need a way to distinguish contigs belonging to different samples (contig IDs are only unique per-sample).

Describe the solution you'd like
I want the contigs to be renamed post-assembly, regardless of the assembler used. We could use UUIDs to represent those, instead of arbitrary strings, as it is now. shortuuid could be a good candidate as it can generate short IDs which are still unique enough - that way we would avoid using the same kind of ID as we already are using to represent MAGs with the added benefit of those being slightly more human-readable.

Notes:
Let's provide the user with a selection of ways to rename: shortuuid, uuid4, uuid5 etc.

Acceptance criteria:

both contig assembly actions have a new, optional parameter (rename-contigs) specifying the type of ID we want to apply
the options for the above param should, for now, be: shortuuid (see the library linked above), uuid3, uuid4, uuid5 (using respective methods from Python's uuid module)
for uuid3 and uuid5 the sample name should become the namespace and the original contig ID could become the name

ENH: allow passing already aligned reads to `evaluate-quast`

Is your feature request related to a problem? Please describe.
Not a problem but a potential improvement. Currently, if we want QUAST to take reads into account during contig evaluation, they are being passed directly using the --i-reads flag. That results in QUAST mapping all the reads to the contigs, which increases the runtime significantly. QUAST allows passing a pre-generated alignment map using the --sam/--bam flags, which allows it to use the maps directly, without the need to align the reads first. We should expose this flag in our action so that a pre-generated map can be passed - we are usually generating it anyway as it is required for binning with MetaBAT.

Describe the solution you'd like
Expose the --bam flag from QUAST (in a form of --i-reads-to-contigs-map Q2 flag) to enable passing SampleData[AlignmentMap] to the action directly.

Implement assemble-metaspades action

We need an action supporting metagenome assembly using the metaSPAdes assembler.

Acceptance criteria:

uses metaSPAdes assembler (link)
uses SampleData[*SequencesWithQuality*] as input
can handle single- and paired-end reads
outputs SampleData[Contigs] artifact

ENH: output fetched reference genomes from 'evaluate-quast'

Is your feature request related to a problem? Please describe.
Whenever QUAST is run and reference genomes are not provided it will try to identify some references and fetch them from NCBI. Since all of those are saved to a temporary directory, they will be removed after the action completes.

Describe the solution you'd like
It would be great if we could collect those genomes and output them as GenomeData[DNASequence] artifact - that way, next time when a user wants to re-run QUAST for whatever reason, they can just pass those references instead of re-downloading all of them.

Additional context
The references fetched by QUAST are combined into a single file which can be found in results/quast_downloaded_references (?). Alternatively, individual references could also be passed - not sure what kind of artifact they should be stored in, though.

	elif arg_key == "threads" and (not arg_val or arg_val > 1):
	# TODO: this needs to be fixed (to allow multiprocessing)
	print(
	"Multiprocessing is currently not supported. Resetting "
	"number of threads to 1."
	)
	return [_construct_param(arg_key), "1"]

bokulich-lab / q2-assembly Goto Github PK

q2-assembly's People

Contributors

Stargazers

Watchers

Forkers

q2-assembly's Issues

Recommend Projects

Recommend Topics

Recommend Org