bokulich-lab / q2-moshpit Goto Github PK
View Code? Open in Web Editor NEWMOdular SHotgun metagenome Pipelines with Integrated provenance Tracking
License: BSD 3-Clause "New" or "Revised" License
MOdular SHotgun metagenome Pipelines with Integrated provenance Tracking
License: BSD 3-Clause "New" or "Revised" License
We need to come back to these tests after the alpha release to fill in the missing ones (commented as TODO in kraken2/tests/test_classification
) using the same database that's used by the contig tests. We should probably also test edge cases at that time, like empty inputs for some samples and a sample with no hits to the database. We should also test merging sequences from multiple test samples to confirm that multiple expected taxon ids show up in the classification results.
see discussion in #69
The build-eggnog-diamond-db
action (to be implemented) creates a Diamond database for specific taxa. To do so it need a ReferenceDB[EggnogSequenceTaxa]
input artifact, therefore there is a need to create an action that produces this artifact by fetching the appropriate data from the internet.
Something like...
qiime moshpit fetch-eggnog-fasta --o-eggnog-fasta one_taxa_data.qza
Which can then me used in...
qiime moshpit build-eggnog-diamond-db
It might be important to include the date of download in the artifact since the version might influence downstream results.
Citations are missing from the citations.bib
(plus respective references) for the binning action.
I'm experimenting with difference confidence values for this classifier, and my understanding is that 1.0 is a valid value here, but it is disallowed by the plugin. I think the type needs to be updated to Float % Range(0, 1, inclusive_end=True)
.
Parameter 'confidence' received 1 as an argument, which is incompatible with parameter type: Float % Range(0, 1)
As discussed on this PR, due to CheckM's dependency on pplacer
(and a resulting community distro incompatibility), let's move those visualizations out of q2-moshpit and to a new plugin.
Acceptance criteria:
q2-checkm
?)Is your feature request related to a problem? Please describe.
Yes and no. We currently only support our custom dereplication method which is very simplistic. There are at least two other tools which could, potentially, be used for that purpose: dRep and DAS_Tool.
Describe the solution you'd like
It would be great if we could support at least one of those. We should first evaluate which of those would be compatible with our Q2 Python environment and what functionalities they both provide. Then, we can decide which of those to build in and how.
We need an action supporting binning assembled contigs into MAGs using the metaBat binner.
Acceptance criteria:
SampleData[Contigs]
+ SampleData[AlignmentMaps]
as inputsjgi_summarize_bam_contig_depths
)SampleData[MAGs]
Documentation on this can be found here. Specifically I could see a searchable report being useful for assessing whether taxa of interest (including a host's taxon) are present in a database, and in finding information on things like the number of distinct minimizers in the database that are associated with a taxon (which could be useful as feature metadata). If the output were of type ImmutableMetadata
keyed on taxon id, that could let us use the information as feature metadata, or generate a searchable .qzv with metadata tabulate
.
Is your feature request related to a problem?
No. Eggnog provides functionality to analyze sequences using HMMER. If one wishes to use this functionality through Qiime2, it would be nice to also have an action that fetches the HMMER database using the download_eggnog_data.py
script from Eggnog.
Describe the solution you'd like
This could be done by simply calling the script and piping it with the appropriate yes/no answers to indicate which database should be downloaded. For example:
printf "n\nn\nn\nn\nn\nn" | download_eggnog_data.py -s -F -H -P
Additional context
GenomeData[BLAST6]
-> SampleData[BLAST6]
This is just to track work already in progress.
We want to use Bracken to re-estimate abundances of taxa when using reads for Kraken2 classification. Next to the Kraken2 reports (already implemented), the action should output a FeatureTable[Frequency]
artifact for all the samples.
See #36 for more details.
When I created a conda environment from the 2023.9/shotgun/released/ubuntu environment file and ran make test
on the moshpit main branch, I got an error about not having altair installed. I conda installed altair and it resolved the issue.
We need an action supporting dereplication of MAGs into unique genomes.
Acceptance criteria:
SampleData[MAGs]
+ SampleData[MultiAlignmentMap]
as inputsFeatureData[MAG]
+ FeatureTable[Frequency]
After pulling main and running make test I'm getting an error from this test case: q2_moshpit/busco/tests/test_utils.py::TestBUSCO::test_draw_busco_plots_for_render
.
The relevant part of the diff is:
E - "$schema": "https://vega.github.io/schema/vega-lite/v5.15.1.json",
E + "$schema": "https://vega.github.io/schema/vega-lite/v5.8.0.json",
I'm guessing different versions of Altair are going to possibly output different specs?
Is your feature request related to a problem? Please describe.
The eggnog
-specific actions require reference databases as inputs which at the moment need to be manually created/fetched by the user. It is not immediately clear how those should be constructed and/or what actually should be included in either of those.
Describe the solution you'd like
Let's add an action fetch-eggnog-db
which would grab the latest version of the entire eggnog database (using the download_eggnog_data.py
tool provided by eggnog itself) and create the two ReferenceDB artifacts required by those actions.
Additional context
We could later add a build-eggnog-db
action which could be used to construct custom eggnog databases. This has lower priority, though, as there is a comprehensive DB already available through download.
As discussed on #38, let's wrap up the Kraken 2 integration by:
If no MAGs are formed for a sample, bin-contigs-metabat
fails with a validation error:
[bam_sort_core] merging from 1 files and 1 in-memory blocks...
Output depth matrix to /tmp/tmpgg2uftbn/KS_depth.txt
jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-01-04T21:10:40
Output matrix to /tmp/tmpgg2uftbn/KS_depth.txt
0: Opening bam: /tmp/tmpgg2uftbn/KS_alignment_sorted.bam
Processing bam files
Thread 0 finished: KS_alignment_sorted.bam with 3838648 reads and 219961 readsWellMapped
Creating depth matrix file: /tmp/tmpgg2uftbn/KS_depth.txt
Closing most bam files
Closing last bam file
Finished
MetaBAT 2 (2.15 (Bioconda)) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, maxEdges 200 and minClsSize 200000. with random seed=1680628081
0 bins (0 bases in total) formed.
Traceback (most recent call last):
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
results = action(**arguments)
File "<decorator-gen-40>", line 2, in bin_contigs_metabat
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
outputs = self._callable_executor_(scope, callable_args,
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 408, in _callable_executor_
artifact = qiime2.sdk.Artifact._from_view(
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/sdk/result.py", line 349, in _from_view
result = transformation(view, validate_level)
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/core/transform.py", line 68, in transformation
self.validate(view, level=validate_level)
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/core/transform.py", line 143, in validate
view.validate(level)
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/plugin/model/directory_format.py", line 177, in validate
getattr(self, field)._validate_members(collected_paths, level)
File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/plugin/model/directory_format.py", line 109, in _validate_members
raise ValidationError(
qiime2.core.exceptions.ValidationError: Missing one or more files for MultiFASTADirectoryFormat: '.+\\.(fa|fasta)$'
Plugin error from moshpit:
Missing one or more files for MultiFASTADirectoryFormat: '.+\\.(fa|fasta)$'
See above for debug info.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: samtools sort /tmp/qiime2/gcaporaso/data/8a63aaeb-3bfc-4a2c-b3f3-6cfedfcf6a7a/data/KS_KS_All13-C0500000_alignment.bam -o /tmp/tmpgg2uftbn/KS_alignment_sorted.bam
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: jgi_summarize_bam_contig_depths --outputDepth /tmp/tmpgg2uftbn/KS_depth.txt /tmp/tmpgg2uftbn/KS_alignment_sorted.bam
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: metabat2 -i /tmp/qiime2/gcaporaso/data/c227e281-0838-4bfd-ac40-2d78f2e471e9/data/KS_KS_All13-C0500000_contigs.fa -a /tmp/tmpgg2uftbn/KS_depth.txt -o /tmp/tmpgg2uftbn/KS/bin --numThreads 40
I'm not sure what the right way to handle this internally is - e.g., the whole command fails, or no MAGs are provided for that sample - but we should catch and handle this case with a more informative error message.
UPDATE: This looks to actually be an issue with how the sample ids are obtained from the filenames, similar to what we discovered on bokulich-lab/q2-assembly#37. I am testing this now and will have a PR if it works.
As a q2-moshpit user,
I want a new classify-kaiju
action,
so that I can use Kaiju as an alternative to Kraken2 when doing read-based taxonomic classification.
Needs #41.
The build-kraken-db
action does not yet allow fetching 16S databases, even though they are available on
https://benlangmead.github.io/aws-indexes/k2.
We should add those to the list of options under the --p-collection
flag, so that users interested in classification of 16S can use that action too.
Update README to contain the following information:
As a q2-moshpit user,
I want a new action to fetch Kaiju indices,
so that I can use it as reference in classification with Kaiju.
Notes:
We need an action supporting functional annotation of proteins.
Acceptance criteria:
FeatureData[ProteinSequence]
or GenomeData[Proteins]
as inputFeatureData[NOG]
+ FeatureData[OG]
+ FeatureData[KEGG]
Is your feature request related to a problem? Please describe.
Currently, predicting genes on dereplicated MAGs is only possible. It would be great if we could also do it on un-dereplicated MAGs.
Describe the solution you'd like
Add SampleData[MAGs]
as accepted input to the predict-genes-prodigal
action.
Describe the solution you'd like
I'd like another action (bin-contigs-maxbin2
) which can take contigs as input (plus whatever else that is required) and use MaxBin2 to produce bins, similarly to how bin-contigs-metabat
does.
Describe alternatives you've considered
Additional context
There is very little documentation available... Basically, just those links may be of help:
MaxBin 2.2.7
No Contig file. Please specify contig file by -contig
MaxBin - a metagenomics binning software.
Usage:
run_MaxBin.pl
-contig (contig file)
-out (output file)
(Input reads and abundance information)
[-reads (reads file) -reads2 (readsfile) -reads3 (readsfile) -reads4 ... ]
[-abund (abundance file) -abund2 (abundfile) -abund3 (abundfile) -abund4 ... ]
(You can also input lists consisting of reads and abundance files)
[-reads_list (list of reads files)]
[-abund_list (list of abundance files)]
(Other parameters)
[-min_contig_length (minimum contig length. Default 1000)]
[-max_iteration (maximum Expectation-Maximization algorithm iteration number. Default 50)]
[-thread (thread num; default 1)]
[-prob_threshold (probability threshold for EM final classification. Default 0.9)]
[-plotmarker]
[-markerset (marker gene sets, 107 (default) or 40. See README for more information.)]
(for debug purpose)
[-version] [-v] (print version number)
[-verbose]
[-preserve_intermediate]
Please specify either -reads or -abund information.
You can input multiple reads and/or abundance files at the same time.
Please read README file for more details.
There are examples of this both in the core software (e.g. in kraken2/classifaction.py::_classify_kraken2
) and in the tests (e.g. in kraken2/tests/test_classification.py::test_classify_kraken2_contigs
).
The build_custom_diamond_db
action can optionally take a ReferenceDB[NCBITaxonomy]
artifact, with which the resulting Diamond database contains taxonomy features. There is a need to create an action that produces this ReferenceDB[NCBITaxonomy]
artifact by fetching the appropriate data from the internet.
Something like...
qiime moshpit fetch-ncbi-taxonomy-db ...
There are two links from where the data comes from. One can find out the time and date of the last modification to the file on the server by calling wget -S --spider <somewhere_in_the_internet>
. It might be important to include this data in the artifact since the version might influence the results.
We need an action generating a visualisation of the binning quality control using CheckM.
Acceptance criteria:
Notes:
Currently, the dereplicate-mags
action needs a distance matrix to use when dereplicating, but there is no action available in the environment to generate one from SampleData[MAGs]
. Thus everything downstream of this action (classification and feature table creation) is unreachable.
@gregcaporaso suggested using actions in the sourmash plugin to generate the distance matrix.
Thus either sourmash needs to be included in the shotgun environment, or we need to borrow the necessary actions and put them here.
During binning with metaBat 2
it is possible to output the contigs that were not binned into a separate file. We should expose this file as SampleData[Contigs % unbinned]
(the unbinned
property would be used to distinguished those contigs from the ones originating directly from genome assembly) - they may contain potnetially useful data.
When performing taxonomic profiling of MAGs, abundance information currently is not preserved. This is needed to then estimate the (relative) abundances of MAGs/taxa.
This can be done in a few steps:
This action should probably be independent to taxonomic classification, i.e., to output a feature table of MAG abundances per sample.
Some useful reading
We need an action (predict-genes-prodigal
) supporting gene prediction on dereplicated MAGs.
Acceptance criteria:
FeatureData[MAG]
as inputGenomeData[Loci]
+ GenomeData[Genes]
+ GenomeData[Proteins]
+ FeatureTable[Frequency]
The bin_contigs_metabat
action should:
Currently, it is possible to annotate either contigs
or dereplicated MAGs
- let's also allow the MAGs which have not undergone dereplication (so straight after binning), just so that annotation can be performed on any step of the pipeline.
Output should KrakenReport % Property('contigs')
and KrakenOutput % Property('contigs')
Problem:
In the main visualization, all samples are allocated the same amount of space for plotting independently of the number of mags they contain. The height of the bars representing each of the MAGs in the plot is then adjusted to fit this predefined space. This leads to bars having different heights since different samples have different numbers of MAGs. This has been found to be aesthetically unpleasing and perhaps even cumbersome to the interpretation of the plot.
Solution:
Make it so all bars have the same height thereby adjusting the height of the per-sample bar plots to account for different numbers of MAGs/bars.
Possible Implementations
Plan A: If it's possible leave the implementation as is (using the facet feature of the plotting library).
Plan B: We change the way the plot is constructed to allow bars to have the same size.
Similar to #76, at several places in q2_moshpit/metabat2/metabat2.py
only .fa
files are collected when .fasta
files should also be collected.
For consistent code style, Black formatting should be applied (similarly to how it's done in q2-assembly) and an appropriate check will be added in the CI.
we are only finding .fasta
files here when we should also be getting .fa
files.
This probably hasn't been noticed because metabat2 (probably) outputs its mags in .fasta formats and contigs were only recently added as inputs.
Update: megahit outputs .fa files so none of the contigs generated with that tool are being discovered.
No. It would be nice to have an action that downloads the Diamond database for downstream Eggnog analyses.
This could be done by simply calling the script and piping it with the appropriate yes/no answers to indicate which database should be downloaded. For example:
printf "n\nn\ny" | download_eggnog_data.py -s --daata_dir .
Describe the bug
The two sides of the plot (left: BUSCOs and right: assembly stats) are misaligned - I see an offset between those:
To Reproduce
qiime moshpit evaluate-busco --i-bins <attached file> --p-mode genome --p-lineage-dataset bacteria_odb10 --p-cpu 6 --output-dir busco-test
using the attached mags as input (see below, the file is zipped)Expected behavior
The plots are aligned.
Please complete the following information:
Additional context
w9-mags.qza.zip
Ultimately we'll want FeatureTable
and FeatureData[Taxonomy]
results to use these data in downstream applications. One option would be to use Bracken to go from SampleData[Kraken2Output]
and/or SampleData[Kraken2Report]
to a FeatureTable
and FeatureData[Taxonomy]
. Should we start thinking about that, or is there another approach that is planned?
As a MOSHPIT user,
I want an action which can run BUSCO
so that I can evaluate completeness of the generated MAGs.
Following up from the discussion #45, we need to change the input type for MAGs from SampleData[MAG]
to FeatureData[MAG]
to allow for classification of dereplicated MAGs. The output type will then need to be adjusted to FeatureData[Kraken2Reports]
+ FeatureData[Kraken2Outputs]
.
Feature description
Currently, there is only one way to visualize the results obtained from Kraken 2 - taxa barplot. Also, this only works for reads (for now). It would be nice to bring in some more Kraken 2-specific visualizers which could leverage Kraken 2 reports directly and enable visualization fo results from both, reads and MAGs. One of such tools is pavian.
Notes:
It would be great if we could run the evaluation on dereplicated MAGs too, on top of the SampleData[MAGs]
.
Let's assume you want to implement some new feature in moshpit. You would fork this repo and then clone the fork to your local machine where you would work on the new features. But before you start coding you need to set up the virtual environment. So you follow the instructions in the wiki and run the following (notice how moshpit is left out of the mamba create
command since you will install your local copy and not the version that is available through mamba).
mamba create -yn test_env \
-c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2023.5/tested -c defaults \
q2cli q2-assembly q2-checkm
conda run -n test_env \
pip install --no-deps --force-reinstall git+https://github.com/misialq/quast.git@issue-230
conda activate test_env
cd <path_to_local_moshpit_repo>/q2-moshpit
pip install -e .
So far so good, but then running anything (e.g. qiime dev refresh-cache
) will return this error:
ImportError: cannot import name 'BrackenDBDirectoryFormat' from 'q2_types_genomics.kraken2'
This error comes from q2_types_genomics/kraken2/__init__.py
(and perhaps the other files in q2_types_genomics/kraken2
). The version that gets installed is missing some classes that are used by moshpit.
Installed version:
from ._format import (
Kraken2ReportFormat, Kraken2ReportDirectoryFormat,
Kraken2OutputFormat, Kraken2OutputDirectoryFormat,
Kraken2DBFormat, Kraken2DBDirectoryFormat
)
from ._type import Kraken2Reports, Kraken2Outputs, Kraken2DB
__all__ = [
'Kraken2ReportFormat', 'Kraken2ReportDirectoryFormat', 'Kraken2Reports',
'Kraken2OutputFormat', 'Kraken2OutputDirectoryFormat', 'Kraken2Outputs',
'Kraken2DBFormat', 'Kraken2DBDirectoryFormat', 'Kraken2DB'
]
However, these classes are available in the current version of q2-types-genomics
.
Current GitHub version:
from ._format import (
Kraken2ReportFormat, Kraken2ReportDirectoryFormat,
Kraken2OutputFormat, Kraken2OutputDirectoryFormat,
Kraken2DBFormat, Kraken2DBDirectoryFormat,
BrackenDBFormat, BrackenDBDirectoryFormat
)
from ._type import Kraken2Reports, Kraken2Outputs, Kraken2DB
__all__ = [
'Kraken2ReportFormat', 'Kraken2ReportDirectoryFormat', 'Kraken2Reports',
'Kraken2OutputFormat', 'Kraken2OutputDirectoryFormat', 'Kraken2Outputs',
'Kraken2DBFormat', 'Kraken2DBDirectoryFormat', 'Kraken2DB',
'BrackenDBFormat', 'BrackenDBDirectoryFormat'
]
link to lines of code in q2-types-genomics
So we can fix it by running the following.
pip uninstall q2_types_genomics
git clone [email protected]:bokulich-lab/q2-types-genomics.git
cd <path_to_local_q2-types-genomics_repo>/q2-types-genomics
pip install .
Ok problem solved! Sadly not, apparently q2-types-genomics
is not the only incompatible package. If we run qiime dev refresh-cache
we will get the following error.
TypeError: BLAST6 is not a variant of SampleData.field['type']
The error comes from the q2-types
package. It boils down to the q2_types/feature_data/_type.py
lines 51-53.
Old version (installed with conda/mamba):
BLAST6 = SemanticType('BLAST6',
variant_of=[FeatureData.field['type']])
New version (available in GitHub):
BLAST6 = SemanticType('BLAST6',
variant_of=[FeatureData.field['type'],
SampleData.field['type']])
link to lines of code in q2-types
Again we uninstall the guilty package and clone the new version and install it locally.
pip uninstall q2-types
git clone https://github.com/qiime2/q2-types.git
cd <path_to_local_q2-types>/q2-types
pip install .
Woohoo! We solved it! Almost. Running qiime dev refresh-cache
will return import errors, essentially complaining that some libraries are missing from the environment. Just
pip install xmltodict tqdm
(the libraries that are missing) and you are good to go.
This can be achieved by adding the --report-minimizer-data
to our kraken2 call, and is discussed here. This will add two additional columns to the report "representing the number of minimizers found to be associated with a taxon in the read sequences, and the estimate of the number of distinct minimizers associated with a taxon in the read sequence data".
Any reason to not make this the default behavior?
Update: because their report format doesn't include headers, and this flag changes the number of columns, it may be worth creating our own report format that adds headers.
This isn't FeatureData
in the way that we typically define it, in that the feature ids from the corresponding table are not the ids in this artifact. This seems more like SampleData
to me. I'm currently investigating this, but I wanted to get an issue up now to make folks aware that this semantic type is likely going to change. We'll ultimately want to process this artifact to generate FeatureData
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.