bokulich-lab / q2-types-genomics Goto Github PK
View Code? Open in Web Editor NEWQIIME 2 types for genomics plugins.
License: BSD 3-Clause "New" or "Revised" License
QIIME 2 types for genomics plugins.
License: BSD 3-Clause "New" or "Revised" License
diamond makedb
documentation where the data is describedbuild_custom_diamond_db
action (under development), thereby allowing the user to obtain a database with taxonomy features.There is a need to update the Semantic type of diamond DB, according to this issue: bokulich-lab/q2-moshpit#97
As of now the semantic type looks like this q2-types-genomics/blob/main/q2_types_genomics/reference_db/_format.py
45 DiamondDatabaseDirFmt = model.SingleFileDirectoryFormat(
46 'DiamondDatabaseDirFmt', 'ref_db.dmnd', DiamondDatabaseFileFmt)
It is essentially a directory with one file named ref_db.dmnd
. The update needs to accommodate the fact that now there can be multiple files. E.g.
e5.proteomes.faa
e5.taxid_info.tsv
Archaea.dmnd
Bacteria.dmnd
See the issue linked above for explanations of what these files are.
fetch-ncbi-taxonomy-data
action is planned (bokulich-lab/q2-moshpit#107)version.tsv
.NCBITaxonomyDirFmt
to accommodate this extra file.Following assembly, the contigs will be binned into MAGs. From one contig file there will be multiple MAGs generated per sample. We need a new type to handle those multiple MAGs.
Acceptance criteria:
SampleDataMAGDirFmt
Notes:
Update to include coverage testing, just as was done for https://github.com/bokulich-lab/q2-moshpit.
Similarly to #6, the gene prediction step will provide protein translations for all the genes identified in a given MAG. We need a new type to store protein FASTA files corresponding to those protein lists.
Acceptance criteria:
genome1: proteins.fa
genome2: proteins.fa
...
when i install this packages, It reported an error and displayed an error message as follows:
this is my command:
qiime dev refresh-cache
this is failed message:
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2523, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/__init__.py", line 16, in <module>
importlib.import_module('q2_types_genomics.kraken2')
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/kraken2/__init__.py", line 11, in <module>
from ._format import (
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/kraken2/_format.py", line 14, in <module>
from ..per_sample_data._format import MultiDirValidationMixin
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/per_sample_data/__init__.py", line 16, in <module>
from ._type import (
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/per_sample_data/_type.py", line 64, in <module>
SampleData[BLAST6],
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/qiime2/core/type/grammar.py", line 172, in __getitem__
self.template.validate_fields_expr(self, fields)
File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/qiime2/core/type/semantic.py", line 223, in validate_fields_expr
raise TypeError("%r is not a variant of %r" % (expr, varf))
TypeError: BLAST6 is not a variant of SampleData.field['type']
As Clusters Orthologous Groups (COGs) are widely used for genome functional annotation, we need to introduce a new type that will specifically handle those. It would be based on a TSV file containing a few OG-specific fields (inspired by the eggNOG annotator) that would most likely be derivable by other tools in the future.
Acceptance criteria:
OrthologousGroupsFormat
Notes:
Bracken requires generation of additional database files, based on the original Kraken2 database.
Acceptance criteria:
BrackenDBFormat
and BrackenDB
semantic type implemented*.kmer_distrib
filesFollowing genome binning, resulting MAGs from all samples can be de-replicated to produce a list of unique genomes. The result of this process would be FeatureData[MAG]
where one MAG comprises multiple contigs.
Acceptance criteria:
MAGSequencesDirFmt
Most likely, the only required type will be the one used to store the pre-fetched/built indices.
Similarly to #9, we need a type able to handle KEGG annotations. Inspired by eggNOG annotations output (#8), we need to handle a TSV file containing a few KEGG-specific fields.
Acceptance criteria:
KEGGMapFormat
Notes:
Annotations file/Transferred annotations fields
sectionIs your feature request related to a problem? Please describe.
Currently, no database validation is performed when importing those two into a QIIME 2 artifact.
Describe the solution you'd like
For diamond databases, perhaps we could use the diamond dbinfo
command (presumably it would fail if the DB is corrupted?). For eggnog we need to investigate a bit.
Based on the gene positional information from GFF files (see #5), gene prediction step will extract gene sequences from corresponding MAGs. We need a new type to store FASTA files corresponding to those genes.
Acceptance criteria:
genome1: genes.fa
genome2: genes.fa
...
As a Kraken2DB user,
I want the database files to be validated on DB creation/import,
so that I can be sure that the database I'm using follows the Kraken2 requirements.
This can be achieved by adding a step using the kraken2-inspect
tool, which fails if the DB files are corrupted.
Similarly to how it was done in #57, we want to have a mapping between samples, mags and the corresponding mag paths in the MultiFASTADirectoryFormat
so that it's easier to iterate through files e.g.: when performing operations MAG-wise. The method should return something like:
{
"sample1": {
"mag1": "/path/to/mag1.fa"
"mag2": "/path/to/mag2.fa"
},
"sample2": {
"mag1": "/path/to/mag1.fa"
"mag2": "/path/to/mag2.fa"
},
...
}
We will need the map to link contigs to unique MAGs right after binning.
Is your feature request related to a problem?
No, but it's related to another issue (see link below). In a nutshell there is a need for a semantic type to represent a HMMER database in Qiime2.
Describe the solution you'd like
The Semantic type should be able to manage an arbitrary number of organisms from the HMMER database. E.g:
download_eggnog_data.py -s -H -d 2
This would download HMMER database for these two groups of organisms of interest.
Context:
The introduction on MAGs will require handling multiple files per sample. We will need to be able to support multiple MAGs, Bowtie2Indices, AlignmentMaps and other formats per sample (see other related issues on introduction of GenomeData).
Acceptance criteria:
Notes:
@colinvwood and @Oddant1 noticed while working on bokulich-lab/q2-moshpit#63 and bokulich-lab/q2-assembly#46 that many of the formats in q2-types-genomics require some knowledge of the underlying format to identify the sample ids. This is requiring code like:
result = ContigSequencesDirFmt()
for sample_fp in samples:
# These paths are defined in the ContigSequencesDirFmt class as
# {sample_id}_contigs.(fa | fasta). This should get the id from a
# name like that
sample_id = sample_fp.name.rsplit("_contigs", 1)[0]
to crop up fairly frequently (that example is from bokulich-lab/q2-assembly#46). If a developer misses this, or doesn't handle the stripping of _contigs
correctly (e.g., replacing all occurrences, rather than just stripping the last occurrence) downstream results could have misnamed sample ids (e.g., sample-1
could easily become sample-1_contigs
, or sample2_contigs
[a terrible sample name] could become sample2
).
We're thinking that, after the alpha release, it will make sense to define new formats that contain manifests which explicitly map sample ids to filenames, and then add transformers from these existing formats to the new manifest formats. That way we encapsulate all knowledge about how the filenames map to sample ids in those transformers, and actions that need these data can use the manifest formats and have explicit access to sample ids. This would be similar to how we handle the "demux formats" in q2-types (see an example of how this is used here).
We would want to keep the existing formats as-is so that existing artifacts (generated pre- or post-alpha) would continue to work.
Hello,
I have recently started analysing metabarcoding data with QIIME2 and the pipelines I follow use this extension.
Unfortunately after installing the latest versions of both 'types-genomics' and 'types', I get the following error message:
File "/.../.../miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 162, in _integrate_plugin
raise ValueError("Duplicate semantic type (%r) defined in"
ValueError: Duplicate semantic type ('FeatureMap') defined in plugins: 'types-genomics' and 'types'
--o-silva-sequences: command not found
What can be done (keep in mind I am very new to it, python and linux in general)?
The metagenome assembler will output one file containing all the assembled contigs per sample. We need a new type to handle that contig data.
Acceptance criteria:
SampleDataContigFmt
We need a new FeatureData[Contig]
type to represent contigs which were generated through co-assembly of reads from multiple samples (see bokulich-lab/q2-assembly#22). Since this process leads to the loss of the "sample" concept, we wouldn't want to store those contigs as SampleData[Contigs]
(the contigs simply become features). Introducing a new type will also allow us to better control the flow of data through the MOSHPIT pipeline in a case where the user starts with co-assembly, as certain actions will need to work slightly differently.
Functional annotation step will use FeatureData[Sequence | ProteinSequence] as input to generate a list if functional categories. At least initially we will be using eggNOG which produces a large TSV-like table of all kinds of annotations. We will need a new type to contain these annotations. Additionally, two other, more universal types will be introduced for (potential) compatibility with other annotators in the future (see #9 and #10).
Acceptance criteria:
EggnogAnnotationsFormat
Notes:
Annotations file
sectionIs your feature request related to a problem? Please describe.
Different variations of the GenomeData
type (Proteins
, Genes
, Loci
) store the data in fasta/gff files where names end with different suffixes (e.g.: _proteins.fasta
for proteins or _loci.gff
for loci). It would be handy to have a way to easily retrieve feature/genome IDs without the need to parse the names (in a similar way as is described in #56).
Describe the solution you'd like
Let's add a genome_dict
method similar to how it was done in #57 so that one can easily retrieve feature IDs from any GenomeData artifact.
Describe alternatives you've considered
An alternative solution could be to remove the suffix completely but this would need adjusting the actions which already use that type (one of them being get-ncbi-genomes
in RESCRIPt) and could potentially cause issues with artifacts which were created before.
Genome binning will require original reads to be first mapped to the assembled contigs to evaluate coverage. Read mapping will also be needed for MAG de-replication. We will need a new type to handle multiple index files per genome per sample. q2-types already implements a Bowtie2Index
semantic type and Bowtie2IndexDirFmt
format - we need something similar but with multiple indices per sample.
Acceptance criteria:
SampleDataBowtie2IndexDirFmt
sample1: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
sample2: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
...
Gene prediction step will produce one GFF file per genome in a sample (totalling to multiple GFF files per sample). We need a new type to handle those GFF files.
Acceptance criteria:
GFFDirFmt
Notes:
genome1: loci.gff
genome2: loci.gff
...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.