The q2-types-genomics from bokulich-lab

ENH: Semantic Type for NCBI protein-taxonomy data

Context

An action to build Diamond reference databases is under development.
- PR: bokulich-lab/q2-moshpit#103
- Issue: bokulich-lab/q2-moshpit#102
This action offers the user the possibility to include taxonomy features in the output database.
To do so the user must input into the action some additional files containing the taxonomy data.
- diamond makedb documentation where the data is described

Crux of the issue

Thus, there is a need for a Semantic Type to represent taxonomy data.
Then an artifact with this Semantic Type can be used as optional input to the build_custom_diamond_db action (under development), thereby allowing the user to obtain a database with taxonomy features.

ENH: Update Semantic Type for Diamond DB

Problem and solution

There is a need to update the Semantic type of diamond DB, according to this issue: bokulich-lab/q2-moshpit#97

As of now the semantic type looks like this q2-types-genomics/blob/main/q2_types_genomics/reference_db/_format.py

45 DiamondDatabaseDirFmt = model.SingleFileDirectoryFormat(
46   'DiamondDatabaseDirFmt', 'ref_db.dmnd', DiamondDatabaseFileFmt)

It is essentially a directory with one file named ref_db.dmnd. The update needs to accommodate the fact that now there can be multiple files. E.g.

e5.proteomes.faa
e5.taxid_info.tsv
Archaea.dmnd
Bacteria.dmnd

See the issue linked above for explanations of what these files are.

Additional Considerations

Care should be taken that the output artifact can be accessed in the downstream actions.

ENH: Update `NCBITaxonomyDirFmt` to accomodate data-version file

An fetch-ncbi-taxonomy-data action is planned (bokulich-lab/q2-moshpit#107)
This will download data from two different sources. Consequently, there is a need to add version information to the resulting artifact.
This information will be written to an additional file within the artifact, provisionally version.tsv.
Consequently, there is a need to modify the code in NCBITaxonomyDirFmt to accommodate this extra file.
Validation for the extra file should also be implemented.
Test suit should be adjusted/expanded accordingly.

ENH: Semantic Type for `fetch-busco-db`'s output

New action planned -> bokulich-lab/q2-moshpit#122
Ergo we need Semantic Type for its output.

Implement SampleData[MAGs] type

Following assembly, the contigs will be binned into MAGs. From one contig file there will be multiple MAGs generated per sample. We need a new type to handle those multiple MAGs.

Acceptance criteria:

uses a new artifact format: SampleDataMAGDirFmt
validates on reading in multiple MAG fasta files per sample

Notes:

some MAG examples:
sample-mag.1.fa.txt
sample-mag.2.fa.txt

Update CI to follow the other repos

Update to include coverage testing, just as was done for https://github.com/bokulich-lab/q2-moshpit.

Implement GenomeData[Proteins] type

Similarly to #6, the gene prediction step will provide protein translations for all the genes identified in a given MAG. We need a new type to store protein FASTA files corresponding to those protein lists.

Acceptance criteria:

validates on reading in FASTA files containing multiple sequence entries (protein)
corresponds to something like:

genome1: proteins.fa
genome2: proteins.fa
...

TypeError: BLAST6 is not a variant of SampleData.field['type']

when i install this packages, It reported an error and displayed an error message as follows:
this is my command:

qiime dev refresh-cache

this is failed message:

  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/pkg_resources/__init__.py", line 2523, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/__init__.py", line 16, in <module>
    importlib.import_module('q2_types_genomics.kraken2')
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/kraken2/__init__.py", line 11, in <module>
    from ._format import (
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/kraken2/_format.py", line 14, in <module>
    from ..per_sample_data._format import MultiDirValidationMixin
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/per_sample_data/__init__.py", line 16, in <module>
    from ._type import (
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/q2_types_genomics-0+unknown-py3.8.egg/q2_types_genomics/per_sample_data/_type.py", line 64, in <module>
    SampleData[BLAST6],
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/qiime2/core/type/grammar.py", line 172, in __getitem__
    self.template.validate_fields_expr(self, fields)
  File "/opt/conda/envs/qiime2-2023.5/lib/python3.8/site-packages/qiime2/core/type/semantic.py", line 223, in validate_fields_expr
    raise TypeError("%r is not a variant of %r" % (expr, varf))
TypeError: BLAST6 is not a variant of SampleData.field['type']

Implement FeatureData[OG] type

As Clusters Orthologous Groups (COGs) are widely used for genome functional annotation, we need to introduce a new type that will specifically handle those. It would be based on a TSV file containing a few OG-specific fields (inspired by the eggNOG annotator) that would most likely be derivable by other tools in the future.

Acceptance criteria:

uses a new artifact format: OrthologousGroupsFormat
validates on reading in a TSV file containing OG fields extracted from EggNOG annotations: eggNOG OGs, narr_og_name, narr_og_cat, nar_og_desc
implements transformers: into/from pd.DataFrame, qiime.Metadata

Notes:

more details on annotation fields here -> see Annotations file section
compare with #8 for a file example

Implement `BrackenDB` format/type

Bracken requires generation of additional database files, based on the original Kraken2 database.

Acceptance criteria:

BrackenDBFormat and BrackenDB semantic type implemented
the directory format needs to recognize a bunch of *.kmer_distrib files

Implement FeatureData[MAG] type

Following genome binning, resulting MAGs from all samples can be de-replicated to produce a list of unique genomes. The result of this process would be FeatureData[MAG] where one MAG comprises multiple contigs.

Acceptance criteria:

uses a new artifact format: MAGSequencesDirFmt
implements transformers: into/from pd.DataFrame, qiime.Metadata (convert contigs' list into columns?)

ENH: Semantic Type for taxa specific diamond DB

Context

An action to build Diamond reference databases is under development.
- PR: bokulich-lab/q2-moshpit#TODO
- Issue: bokulich-lab/q2-moshpit#TODO
This action offers the user the possibility to build diamond databases for specific taxa.
To do this the user must provide two additional files:
- one containing the taxonomy data
- and a reference fasta file.

Crux of the issue

There is a need to create a Semantic Type that can contain these two files, such that an artifact of this Semantic Type can be used as input to the action mentioned above (TODO: mention the action name).

add formats/types required by Kaiju

Most likely, the only required type will be the one used to store the pre-fetched/built indices.

Implement FeatureData[KEGG] type

Similarly to #9, we need a type able to handle KEGG annotations. Inspired by eggNOG annotations output (#8), we need to handle a TSV file containing a few KEGG-specific fields.

Acceptance criteria:

uses a new artifact format: KEGGMapFormat
validates on reading in a TSV file containing KEGG fields extracted from EggNOG annotations: KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC
implements transformers: into/from pd.DataFrame, qiime.Metadata

Notes:

more details on annotation fields here -> see Annotations file/Transferred annotations fields section

ENH: add eggnog/diamond DB validation

Is your feature request related to a problem? Please describe.
Currently, no database validation is performed when importing those two into a QIIME 2 artifact.

Describe the solution you'd like
For diamond databases, perhaps we could use the diamond dbinfo command (presumably it would fail if the DB is corrupted?). For eggnog we need to investigate a bit.

Implement GenomeData[Genes] type

Based on the gene positional information from GFF files (see #5), gene prediction step will extract gene sequences from corresponding MAGs. We need a new type to store FASTA files corresponding to those genes.

Acceptance criteria:

validates on reading in FASTA files containing multiple sequence entries (genes)
corresponds to something like:

genome1: genes.fa
genome2: genes.fa
...

add database validation to the Kraken2DB format

As a Kraken2DB user,
I want the database files to be validated on DB creation/import,
so that I can be sure that the database I'm using follows the Kraken2 requirements.

This can be achieved by adding a step using the kraken2-inspect tool, which fails if the DB files are corrupted.

ENH: add `sample_dict` method to the `MultiFASTADirectoryFormat` format

Similarly to how it was done in #57, we want to have a mapping between samples, mags and the corresponding mag paths in the MultiFASTADirectoryFormat so that it's easier to iterate through files e.g.: when performing operations MAG-wise. The method should return something like:

{
    "sample1": {
          "mag1": "/path/to/mag1.fa"
          "mag2": "/path/to/mag2.fa"
    },
    "sample2": {
          "mag1": "/path/to/mag1.fa"
          "mag2": "/path/to/mag2.fa"
    },
    ...
}

add `FeatureMap` and `FeatureMap[MAGtoContigs]` types

We will need the map to link contigs to unique MAGs right after binning.

Tasks

Beta Give feedback

create FeatureMap and FeatureMap[MAGtoContigs] types
create file format for MAGtoContigs type (this could be a json file mapping a unique MAG ID to a list of contig IDs)
Options

ENH: Semantic type for HMMER database

Is your feature request related to a problem?
No, but it's related to another issue (see link below). In a nutshell there is a need for a semantic type to represent a HMMER database in Qiime2.

Describe the solution you'd like
The Semantic type should be able to manage an arbitrary number of organisms from the HMMER database. E.g:

download_eggnog_data.py -s -H -d 2

This would download HMMER database for these two groups of organisms of interest.

Context:

Sister issue: bokulich-lab/q2-moshpit#96
Care should be taken that the output artifact can be accessed in the downstream actions that run the HMMER analysis (unexisting as of now).

Add support for multi-file FASTA/GFF

The introduction on MAGs will require handling multiple files per sample. We will need to be able to support multiple MAGs, Bowtie2Indices, AlignmentMaps and other formats per sample (see other related issues on introduction of GenomeData).

Acceptance criteria:

validates on reading in an artifact containing multiple files per sample

Notes:

could use a mixin approach to add the functionality to existing types or just introduce the new types with support for multiple files directly
this probably can be directly combined with e.g. issue #2 for a real use case

add formats that make sample (or other) ids explicit

@colinvwood and @Oddant1 noticed while working on bokulich-lab/q2-moshpit#63 and bokulich-lab/q2-assembly#46 that many of the formats in q2-types-genomics require some knowledge of the underlying format to identify the sample ids. This is requiring code like:

result = ContigSequencesDirFmt()

for sample_fp in samples:
    # These paths are defined in the ContigSequencesDirFmt class as
    # {sample_id}_contigs.(fa | fasta). This should get the id from a
    # name like that
    sample_id = sample_fp.name.rsplit("_contigs", 1)[0]

to crop up fairly frequently (that example is from bokulich-lab/q2-assembly#46). If a developer misses this, or doesn't handle the stripping of _contigs correctly (e.g., replacing all occurrences, rather than just stripping the last occurrence) downstream results could have misnamed sample ids (e.g., sample-1 could easily become sample-1_contigs, or sample2_contigs [a terrible sample name] could become sample2).

We're thinking that, after the alpha release, it will make sense to define new formats that contain manifests which explicitly map sample ids to filenames, and then add transformers from these existing formats to the new manifest formats. That way we encapsulate all knowledge about how the filenames map to sample ids in those transformers, and actions that need these data can use the manifest formats and have explicit access to sample ids. This would be similar to how we handle the "demux formats" in q2-types (see an example of how this is used here).

We would want to keep the existing formats as-is so that existing artifacts (generated pre- or post-alpha) would continue to work.

Duplicate semantic type

Hello,
I have recently started analysing metabarcoding data with QIIME2 and the pipelines I follow use this extension.

Unfortunately after installing the latest versions of both 'types-genomics' and 'types', I get the following error message:
File "/.../.../miniconda3/envs/qiime2-2023.2/lib/python3.8/site-packages/qiime2/sdk/plugin_manager.py", line 162, in _integrate_plugin
raise ValueError("Duplicate semantic type (%r) defined in"
ValueError: Duplicate semantic type ('FeatureMap') defined in plugins: 'types-genomics' and 'types'
--o-silva-sequences: command not found

What can be done (keep in mind I am very new to it, python and linux in general)?

Implement SampleData[Contigs] type

The metagenome assembler will output one file containing all the assembled contigs per sample. We need a new type to handle that contig data.

Acceptance criteria:

uses new artifact format: SampleDataContigFmt
validates on reading in one FASTA file containing multiple sequences (contigs) per sample (see example)

sample.contigs.fa.txt

ENH: introduce the `FeatureData[Contig]` semantic type

We need a new FeatureData[Contig] type to represent contigs which were generated through co-assembly of reads from multiple samples (see bokulich-lab/q2-assembly#22). Since this process leads to the loss of the "sample" concept, we wouldn't want to store those contigs as SampleData[Contigs] (the contigs simply become features). Introducing a new type will also allow us to better control the flow of data through the MOSHPIT pipeline in a case where the user starts with co-assembly, as certain actions will need to work slightly differently.

Implement FeatureData[NOG] type

Functional annotation step will use FeatureData[Sequence | ProteinSequence] as input to generate a list if functional categories. At least initially we will be using eggNOG which produces a large TSV-like table of all kinds of annotations. We will need a new type to contain these annotations. Additionally, two other, more universal types will be introduced for (potential) compatibility with other annotators in the future (see #9 and #10).

Acceptance criteria:

uses a new artifact format: EggnogAnnotationsFormat
validates on reading in a TSV file containing all EggNOG annotation categories
implements transformers: into/from pd.DataFrame, qiime.Metadata

Notes:

more details on annotation fields here -> see Annotations file section
file example: sample.annotations.txt

ENH: add a `genome_dict` method on all `GenomeData`-linked formats

Is your feature request related to a problem? Please describe.
Different variations of the GenomeData type (Proteins, Genes, Loci) store the data in fasta/gff files where names end with different suffixes (e.g.: _proteins.fasta for proteins or _loci.gff for loci). It would be handy to have a way to easily retrieve feature/genome IDs without the need to parse the names (in a similar way as is described in #56).

Describe the solution you'd like
Let's add a genome_dict method similar to how it was done in #57 so that one can easily retrieve feature IDs from any GenomeData artifact.

Describe alternatives you've considered
An alternative solution could be to remove the suffix completely but this would need adjusting the actions which already use that type (one of them being get-ncbi-genomes in RESCRIPt) and could potentially cause issues with artifacts which were created before.

Implement SampleData[Bowtie2Index] type

Genome binning will require original reads to be first mapped to the assembled contigs to evaluate coverage. Read mapping will also be needed for MAG de-replication. We will need a new type to handle multiple index files per genome per sample. q2-types already implements a Bowtie2Index semantic type and Bowtie2IndexDirFmt format - we need something similar but with multiple indices per sample.

Acceptance criteria:

uses a new artifact format: SampleDataBowtie2IndexDirFmt
validates on reading in directories of multiple index files per sample (compare with Bowtie2IndexDirFmt)
corresponds to something like:

sample1: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
sample2: mag1/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl, mag2/{idx1, idx2, ref3, ref4, rev1, rev2}.b2tl ...
...

Implement GenomeData[Loci] type

Gene prediction step will produce one GFF file per genome in a sample (totalling to multiple GFF files per sample). We need a new type to handle those GFF files.

Acceptance criteria:

uses a new artifact format: GFFDirFmt
validates on reading in GFF files (one GFF file per MAG)

Notes:

more on GFF format here
corresponds to something like:

genome1: loci.gff
genome2: loci.gff
...

file example: sample-loci.gff.txt

bokulich-lab / q2-types-genomics Goto Github PK

q2-types-genomics's People

Contributors

Stargazers

Watchers

Forkers

q2-types-genomics's Issues

Context

Crux of the issue

Problem and solution

Additional Considerations

Context

Crux of the issue

Tasks

Recommend Projects

Recommend Topics

Recommend Org