The q2-moshpit from bokulich-lab

add tests that prove functionality for classify-kraken2 with reads as input and with MAGs as input

We need to come back to these tests after the alpha release to fill in the missing ones (commented as TODO in kraken2/tests/test_classification) using the same database that's used by the contig tests. We should probably also test edge cases at that time, like empty inputs for some samples and a sample with no hits to the database. We should also test merging sequences from multiple test samples to confirm that multiple expected taxon ids show up in the classification results.

see discussion in #69

ENH: Fetch `EggnogSequenceTaxa` for `build-eggnog-diamond-db` action

Context

The build-eggnog-diamond-db action (to be implemented) creates a Diamond database for specific taxa. To do so it need a ReferenceDB[EggnogSequenceTaxa] input artifact, therefore there is a need to create an action that produces this artifact by fetching the appropriate data from the internet.

Something like...

qiime moshpit fetch-eggnog-fasta --o-eggnog-fasta one_taxa_data.qza

Which can then me used in...

qiime moshpit build-eggnog-diamond-db

Important Considerations

It might be important to include the date of download in the artifact since the version might influence downstream results.

Add missing citations

Citations are missing from the citations.bib (plus respective references) for the binning action.

`classify-kraken` parameter `confidence` should allow confidence of 1.0

I'm experimenting with difference confidence values for this classifier, and my understanding is that 1.0 is a valid value here, but it is disallowed by the plugin. I think the type needs to be updated to Float % Range(0, 1, inclusive_end=True).

 Parameter 'confidence' received 1 as an argument, which is incompatible with parameter type: Float % Range(0, 1)

Move CheckM action(s) to a separate plugin

As discussed on this PR, due to CheckM's dependency on pplacer (and a resulting community distro incompatibility), let's move those visualizations out of q2-moshpit and to a new plugin.

Acceptance criteria:

new plugin created (q2-checkm?)
CheckM functionality removed from q2-moshpit and moved to the new plugin

ENH: add support for other MAG dereplication tools

Is your feature request related to a problem? Please describe.
Yes and no. We currently only support our custom dereplication method which is very simplistic. There are at least two other tools which could, potentially, be used for that purpose: dRep and DAS_Tool.

Describe the solution you'd like
It would be great if we could support at least one of those. We should first evaluate which of those would be compatible with our Q2 Python environment and what functionalities they both provide. Then, we can decide which of those to build in and how.

Implement contig binning action

We need an action supporting binning assembled contigs into MAGs using the metaBat binner.

Acceptance criteria:

uses metaBat2 binner (link)
uses SampleData[Contigs] + SampleData[AlignmentMaps] as inputs
handles contig depth estimation internally (using jgi_summarize_bam_contig_depths)
outputs SampleData[MAGs]

add support for kraken2-inspect to generate reports about kraken databases

Documentation on this can be found here. Specifically I could see a searchable report being useful for assessing whether taxa of interest (including a host's taxon) are present in a database, and in finding information on things like the number of distinct minimizers in the database that are associated with a taxon (which could be useful as feature metadata). If the output were of type ImmutableMetadata keyed on taxon id, that could let us use the information as feature metadata, or generate a searchable .qzv with metadata tabulate.

ENH: Fetching action for HMMER database via Eggnog

Is your feature request related to a problem?
No. Eggnog provides functionality to analyze sequences using HMMER. If one wishes to use this functionality through Qiime2, it would be nice to also have an action that fetches the HMMER database using the download_eggnog_data.py script from Eggnog.

Describe the solution you'd like
This could be done by simply calling the script and piping it with the appropriate yes/no answers to indicate which database should be downloaded. For example:

printf "n\nn\nn\nn\nn\nn" | download_eggnog_data.py -s -F -H -P

Additional context

Care should be taken that the output artifact can be accessed in the downstream actions that run the HMMER analysis (unexisting as of now).
Care should be taken that appropriate Semantic Types are created for the generated database artifact. Sister issue bokulich-lab/q2-types-genomics#63

Tasks

Beta Give feedback

ENH: Semantic type for HMMER database q2-types-genomics#63

enhancement
Options

update output type from eggnog_diamond_search

GenomeData[BLAST6] -> SampleData[BLAST6]

Add abundance estimation with Bracken to `classify-kraken`

This is just to track work already in progress.

We want to use Bracken to re-estimate abundances of taxa when using reads for Kraken2 classification. Next to the Kraken2 reports (already implemented), the action should output a FeatureTable[Frequency] artifact for all the samples.

See #36 for more details.

BUG: Environment Missing Altair

When I created a conda environment from the 2023.9/shotgun/released/ubuntu environment file and ran make test on the moshpit main branch, I got an error about not having altair installed. I conda installed altair and it resolved the issue.

Implement MAG dereplication

We need an action supporting dereplication of MAGs into unique genomes.

Acceptance criteria:

uses dRep (link)
uses SampleData[MAGs] + SampleData[MultiAlignmentMap] as inputs
outputs FeatureData[MAG] + FeatureTable[Frequency]

Getting local BUSCO test failure

After pulling main and running make test I'm getting an error from this test case: q2_moshpit/busco/tests/test_utils.py::TestBUSCO::test_draw_busco_plots_for_render.

The relevant part of the diff is:

E       -   "$schema": "https://vega.github.io/schema/vega-lite/v5.15.1.json",
E       +   "$schema": "https://vega.github.io/schema/vega-lite/v5.8.0.json",

I'm guessing different versions of Altair are going to possibly output different specs?

ENH: add action to fetch eggnog/diamond databases

Is your feature request related to a problem? Please describe.
The eggnog-specific actions require reference databases as inputs which at the moment need to be manually created/fetched by the user. It is not immediately clear how those should be constructed and/or what actually should be included in either of those.

Describe the solution you'd like
Let's add an action fetch-eggnog-db which would grab the latest version of the entire eggnog database (using the download_eggnog_data.py tool provided by eggnog itself) and create the two ReferenceDB artifacts required by those actions.

Additional context
We could later add a build-eggnog-db action which could be used to construct custom eggnog databases. This has lower priority, though, as there is a comprehensive DB already available through download.

add missing tests for the kraken2 classification

As discussed on #38, let's wrap up the Kraken 2 integration by:

adding the missing tests of report parsing (@ebolyen)
dropping the unnecessary taxonomy annotation from the feature data artifact (#38 (comment)) (@ebolyen)
regression test of report generation (@gregcaporaso)

`bin-contigs-metabat` fails if no MAGs are formed for a sample

If no MAGs are formed for a sample, bin-contigs-metabat fails with a validation error:

[bam_sort_core] merging from 1 files and 1 in-memory blocks...
Output depth matrix to /tmp/tmpgg2uftbn/KS_depth.txt
jgi_summarize_bam_contig_depths 2.15 (Bioconda) 2020-01-04T21:10:40
Output matrix to /tmp/tmpgg2uftbn/KS_depth.txt
0: Opening bam: /tmp/tmpgg2uftbn/KS_alignment_sorted.bam
Processing bam files
Thread 0 finished: KS_alignment_sorted.bam with 3838648 reads and 219961 readsWellMapped
Creating depth matrix file: /tmp/tmpgg2uftbn/KS_depth.txt
Closing most bam files
Closing last bam file
Finished
MetaBAT 2 (2.15 (Bioconda)) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, maxEdges 200 and minClsSize 200000. with random seed=1680628081
0 bins (0 bases in total) formed.
Traceback (most recent call last):
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-40>", line 2, in bin_contigs_metabat
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 408, in _callable_executor_
    artifact = qiime2.sdk.Artifact._from_view(
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/sdk/result.py", line 349, in _from_view
    result = transformation(view, validate_level)
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/core/transform.py", line 68, in transformation
    self.validate(view, level=validate_level)
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/core/transform.py", line 143, in validate
    view.validate(level)
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/plugin/model/directory_format.py", line 177, in validate
    getattr(self, field)._validate_members(collected_paths, level)
  File "/home/gcaporaso/mambaforge/envs/q2-shotgun/lib/python3.8/site-packages/qiime2/plugin/model/directory_format.py", line 109, in _validate_members
    raise ValidationError(
qiime2.core.exceptions.ValidationError: Missing one or more files for MultiFASTADirectoryFormat: '.+\\.(fa|fasta)$'

Plugin error from moshpit:

  Missing one or more files for MultiFASTADirectoryFormat: '.+\\.(fa|fasta)$'

See above for debug info.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: samtools sort /tmp/qiime2/gcaporaso/data/8a63aaeb-3bfc-4a2c-b3f3-6cfedfcf6a7a/data/KS_KS_All13-C0500000_alignment.bam -o /tmp/tmpgg2uftbn/KS_alignment_sorted.bam

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: jgi_summarize_bam_contig_depths --outputDepth /tmp/tmpgg2uftbn/KS_depth.txt /tmp/tmpgg2uftbn/KS_alignment_sorted.bam

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: metabat2 -i /tmp/qiime2/gcaporaso/data/c227e281-0838-4bfd-ac40-2d78f2e471e9/data/KS_KS_All13-C0500000_contigs.fa -a /tmp/tmpgg2uftbn/KS_depth.txt -o /tmp/tmpgg2uftbn/KS/bin --numThreads 40

I'm not sure what the right way to handle this internally is - e.g., the whole command fails, or no MAGs are provided for that sample - but we should catch and handle this case with a more informative error message.

UPDATE: This looks to actually be an issue with how the sample ids are obtained from the filenames, similar to what we discovered on bokulich-lab/q2-assembly#37. I am testing this now and will have a PR if it works.

add action to classify reads with Kaiju

As a q2-moshpit user,
I want a new classify-kaiju action,
so that I can use Kaiju as an alternative to Kraken2 when doing read-based taxonomic classification.

Needs #41.

ENH: enable fetching of kraken2's 16S databases

The build-kraken-db action does not yet allow fetching 16S databases, even though they are available on
https://benlangmead.github.io/aws-indexes/k2.

We should add those to the list of options under the --p-collection flag, so that users interested in classification of 16S can use that action too.

Add README

Update README to contain the following information:

add conda installation instructions
add dev section (hooks etc.)
add a section describing functionality

add action to fetch Kaiju indices

As a q2-moshpit user,
I want a new action to fetch Kaiju indices,
so that I can use it as reference in classification with Kaiju.

Notes:

there'll be a separate action to do the actual classification (#42)

Implement functional annotation

We need an action supporting functional annotation of proteins.

Acceptance criteria:

uses eggnog-mapper (link)
uses FeatureData[ProteinSequence] or GenomeData[Proteins] as input
outputs FeatureData[NOG] + FeatureData[OG] + FeatureData[KEGG]

ENH: allow gene prediction from `SampleData[MAGs]`

Is your feature request related to a problem? Please describe.
Currently, predicting genes on dereplicated MAGs is only possible. It would be great if we could also do it on un-dereplicated MAGs.

Describe the solution you'd like
Add SampleData[MAGs] as accepted input to the predict-genes-prodigal action.

ENH: add MaxBin2 as another binning option

Describe the solution you'd like
I'd like another action (bin-contigs-maxbin2) which can take contigs as input (plus whatever else that is required) and use MaxBin2 to produce bins, similarly to how bin-contigs-metabat does.

Describe alternatives you've considered

https://github.com/BinPro/CONCOCT: few years older, can't install due to Python pin (<3.7)
https://github.com/ziyewang/MetaBinner: looks interesting but Python is pinned to 3.7

Additional context
There is very little documentation available... Basically, just those links may be of help:

installation
Nextflow module showing usage
paper
protocol description
copy of the help entry from the tool itself:

MaxBin 2.2.7
No Contig file. Please specify contig file by -contig
MaxBin - a metagenomics binning software.
Usage:
  run_MaxBin.pl
    -contig (contig file)
    -out (output file)

   (Input reads and abundance information)
    [-reads (reads file) -reads2 (readsfile) -reads3 (readsfile) -reads4 ... ]
    [-abund (abundance file) -abund2 (abundfile) -abund3 (abundfile) -abund4 ... ]

   (You can also input lists consisting of reads and abundance files)
    [-reads_list (list of reads files)]
    [-abund_list (list of abundance files)]

   (Other parameters)
    [-min_contig_length (minimum contig length. Default 1000)]
    [-max_iteration (maximum Expectation-Maximization algorithm iteration number. Default 50)]
    [-thread (thread num; default 1)]
    [-prob_threshold (probability threshold for EM final classification. Default 0.9)]
    [-plotmarker]
    [-markerset (marker gene sets, 107 (default) or 40.  See README for more information.)]

  (for debug purpose)
    [-version] [-v] (print version number)
    [-verbose]
    [-preserve_intermediate]

  Please specify either -reads or -abund information.
  You can input multiple reads and/or abundance files at the same time.
  Please read README file for more details.

Implement read/MAG taxonomic classification using Kraken2

replace all manual parsing of sample ids from contig file paths using the new sample dict

There are examples of this both in the core software (e.g. in kraken2/classifaction.py::_classify_kraken2) and in the tests (e.g. in kraken2/tests/test_classification.py::test_classify_kraken2_contigs).

See bokulich-lab/q2-types-genomics#57.

ENH: Fetch NCBI Taxonomy Data (input to `build_custom_diamond_db` action)

Context

The build_custom_diamond_db action can optionally take a ReferenceDB[NCBITaxonomy] artifact, with which the resulting Diamond database contains taxonomy features. There is a need to create an action that produces this ReferenceDB[NCBITaxonomy] artifact by fetching the appropriate data from the internet.

Something like...

qiime moshpit fetch-ncbi-taxonomy-db ...

Important Considerations

There are two links from where the data comes from. One can find out the time and date of the last modification to the file on the server by calling wget -S --spider <somewhere_in_the_internet>. It might be important to include this data in the artifact since the version might influence the results.

Tasks

Beta Give feedback

ENH: Fetch EggnogSequenceTaxa for build-eggnog-diamond-db action #108

enhancement
ENH: add a build-eggnog-diamond-db action #115

1 of 1

enhancement
ENH: Update NCBITaxonomyDirFmt to accomodate data-version file q2-types-genomics#72

enhancement
ENH: action to build custom Diamond database #102

enhancement
Options

Implement binning QC visualizer

We need an action generating a visualisation of the binning quality control using CheckM.

Acceptance criteria:

uses CheckM (link)
wraps the stat plots generated per bin into a qzv visualization
wraps the bin summary table within the same visualization?

Notes:

how to handle the DB - have another action to fetch and store in another artifact (requires a new type)? or just point to where the DB is located?

MAG workflow requires one or more missing actions to generate a DistanceMatrix

Currently, the dereplicate-mags action needs a distance matrix to use when dereplicating, but there is no action available in the environment to generate one from SampleData[MAGs]. Thus everything downstream of this action (classification and feature table creation) is unreachable.

@gregcaporaso suggested using actions in the sourmash plugin to generate the distance matrix.

Thus either sourmash needs to be included in the shotgun environment, or we need to borrow the necessary actions and put them here.

`bin_contigs_metabat` action should output unbinned contigs

During binning with metaBat 2 it is possible to output the contigs that were not binned into a separate file. We should expose this file as SampleData[Contigs % unbinned] (the unbinned property would be used to distinguished those contigs from the ones originating directly from genome assembly) - they may contain potnetially useful data.

add action for estimating (relative) abundances MAGs/MAG-based taxonomic profiles

When performing taxonomic profiling of MAGs, abundance information currently is not preserved. This is needed to then estimate the (relative) abundances of MAGs/taxa.

This can be done in a few steps:

map reads to MAGs.
normalize by genome length, e.g., calculate RPKM or similar. c.f., bracken uses genome length to calculate relative abundances of taxa when doing read-based profiling.

This action should probably be independent to taxonomic classification, i.e., to output a feature table of MAG abundances per sample.

Some useful reading

Challenges in benchmarking metagenomic profilers describes the differences between sequence abundance and taxon abundance for estimating relative abundances in metagenomes.
mOTUs2 uses a single-copy marker-gene approach for accurately estimating taxonomic abundance from reads, but gives a good overview of the different approaches and a comparison vs. MAGs.

Implement gene prediction from dereplicated MAGs

We need an action (predict-genes-prodigal) supporting gene prediction on dereplicated MAGs.

Acceptance criteria:

uses metaProdigal (link)
uses FeatureData[MAG] as input
outputs GenomeData[Loci] + GenomeData[Genes] + GenomeData[Proteins] + FeatureTable[Frequency]

output a contig map from `bin_contigs_metabat` action

The bin_contigs_metabat action should:

Tasks

Beta Give feedback

generate unique IDs for every identified MAG (UUID V4)
output a ContigMap (as proposed in bokulich-lab/q2-types-genomics#46)
Options

ENH: allow `SampleData[MAGs]` as input to the `eggnog-diamond-search` action

Currently, it is possible to annotate either contigs or dereplicated MAGs - let's also allow the MAGs which have not undergone dereplication (so straight after binning), just so that annotation can be performed on any step of the pipeline.

add `SampleData[Contigs]` as input type to `classify-kraken2`

Output should KrakenReport % Property('contigs') and KrakenOutput % Property('contigs')

ENH: On busco's main plot, make the facet plots vary in height depending on the number of mags in each sample.

Problem:
In the main visualization, all samples are allocated the same amount of space for plotting independently of the number of mags they contain. The height of the bars representing each of the MAGs in the plot is then adjusted to fit this predefined space. This leads to bars having different heights since different samples have different numbers of MAGs. This has been found to be aesthetically unpleasing and perhaps even cumbersome to the interpretation of the plot.

Solution:
Make it so all bars have the same height thereby adjusting the height of the per-sample bar plots to account for different numbers of MAGs/bars.

Possible Implementations
Plan A: If it's possible leave the implementation as is (using the facet feature of the plotting library).
Plan B: We change the way the plot is constructed to allow bars to have the same size.

BUG: bin-contigs-metabat only collects .fa files

Similar to #76, at several places in q2_moshpit/metabat2/metabat2.py only .fa files are collected when .fasta files should also be collected.

add pre-commit hooks and reformat with Black

For consistent code style, Black formatting should be applied (similarly to how it's done in q2-assembly) and an appropriate check will be added in the CI.

BUG: os.glob should also search for .fa files

we are only finding .fasta files here when we should also be getting .fa files.

This probably hasn't been noticed because metabat2 (probably) outputs its mags in .fasta formats and contigs were only recently added as inputs.

Update: megahit outputs .fa files so none of the contigs generated with that tool are being discovered.

ENH: Fetching action for full Diamond database

Is your feature request related to a problem?

No. It would be nice to have an action that downloads the Diamond database for downstream Eggnog analyses.

Describe the solution you'd like

This could be done by simply calling the script and piping it with the appropriate yes/no answers to indicate which database should be downloaded. For example:

printf "n\nn\ny" | download_eggnog_data.py -s --daata_dir .

Additional considerations

Care should be taken that the output artifact can be accessed in the downstream actions.
Care should be taken an appropriate Semantic Types is used for the generated database artifact. See this issue bokulich-lab/q2-types-genomics#64

BUG: misaligned plots in the BUSCO visualization

Describe the bug
The two sides of the plot (left: BUSCOs and right: assembly stats) are misaligned - I see an offset between those:

To Reproduce

Execute the command qiime moshpit evaluate-busco --i-bins <attached file> --p-mode genome --p-lineage-dataset bacteria_odb10 --p-cpu 6 --output-dir busco-test using the attached mags as input (see below, the file is zipped)
Open the file resulting visualization
Check out the plot

Expected behavior
The plots are aligned.

Please complete the following information:

OS: macOS Ventura 13.6
QIIME 2 version: 2023.9

Additional context
w9-mags.qza.zip

add support for `FeatureTable` and `FeatureData` generation from `classify-kraken` results

Ultimately we'll want FeatureTable and FeatureData[Taxonomy] results to use these data in downstream applications. One option would be to use Bracken to go from SampleData[Kraken2Output] and/or SampleData[Kraken2Report] to a FeatureTable and FeatureData[Taxonomy]. Should we start thinking about that, or is there another approach that is planned?

implement MAG evaluation using BUSCO

As a MOSHPIT user,
I want an action which can run BUSCO
so that I can evaluate completeness of the generated MAGs.

adjust input types to `classify-kraken2`

Following up from the discussion #45, we need to change the input type for MAGs from SampleData[MAG] to FeatureData[MAG] to allow for classification of dereplicated MAGs. The output type will then need to be adjusted to FeatureData[Kraken2Reports] + FeatureData[Kraken2Outputs].

ENH: action to build custom Diamond database

Feature description

We want to be able to create a DIAMOND formatted reference database from a FASTA input file, just like the diamond documentation shows that one can.
A new action will be created for this. Users can specify their own input FASTA file which according to the documentation must be a protein reference database file in FASTA format.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

ENH: add `pavian` as a visualization option for Kraken 2 results

Currently, there is only one way to visualize the results obtained from Kraken 2 - taxa barplot. Also, this only works for reads (for now). It would be nice to bring in some more Kraken 2-specific visualizers which could leverage Kraken 2 reports directly and enable visualization fo results from both, reads and MAGs. One of such tools is pavian.

Notes:

there seems to be conda package available on bioconda but only for Linux platform
this is an interactive visualizer - would it be possible to pre-load our reports in the QIIME visualization directly? otherwise this cannot really work

ENH: allow `FeatureData[MAG]` as input to the `evaluate-busco` action

It would be great if we could run the evaluation on dereplicated MAGs too, on top of the SampleData[MAGs].

Current version of moshpit incompatible with conda version of q2-types and q2-types-genomics.

Let's assume you want to implement some new feature in moshpit. You would fork this repo and then clone the fork to your local machine where you would work on the new features. But before you start coding you need to set up the virtual environment. So you follow the instructions in the wiki and run the following (notice how moshpit is left out of the mamba create command since you will install your local copy and not the version that is available through mamba).

mamba create -yn test_env \
    -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2023.5/tested -c defaults \
    q2cli q2-assembly q2-checkm

conda run -n test_env \
    pip install --no-deps --force-reinstall git+https://github.com/misialq/quast.git@issue-230

conda activate test_env

cd <path_to_local_moshpit_repo>/q2-moshpit

pip install -e .

So far so good, but then running anything (e.g. qiime dev refresh-cache) will return this error:

ImportError: cannot import name 'BrackenDBDirectoryFormat' from 'q2_types_genomics.kraken2'

This error comes from q2_types_genomics/kraken2/__init__.py (and perhaps the other files in q2_types_genomics/kraken2). The version that gets installed is missing some classes that are used by moshpit.

Installed version:

 from ._format import (
     Kraken2ReportFormat, Kraken2ReportDirectoryFormat,
     Kraken2OutputFormat, Kraken2OutputDirectoryFormat,
     Kraken2DBFormat, Kraken2DBDirectoryFormat
 )
 from ._type import Kraken2Reports, Kraken2Outputs, Kraken2DB

 __all__ = [
     'Kraken2ReportFormat', 'Kraken2ReportDirectoryFormat', 'Kraken2Reports',
     'Kraken2OutputFormat', 'Kraken2OutputDirectoryFormat', 'Kraken2Outputs',
     'Kraken2DBFormat', 'Kraken2DBDirectoryFormat', 'Kraken2DB'
 ]

However, these classes are available in the current version of q2-types-genomics.

Current GitHub version:

from ._format import (
    Kraken2ReportFormat, Kraken2ReportDirectoryFormat,
    Kraken2OutputFormat, Kraken2OutputDirectoryFormat,
    Kraken2DBFormat, Kraken2DBDirectoryFormat,
    BrackenDBFormat, BrackenDBDirectoryFormat
)
from ._type import Kraken2Reports, Kraken2Outputs, Kraken2DB

__all__ = [
    'Kraken2ReportFormat', 'Kraken2ReportDirectoryFormat', 'Kraken2Reports',
    'Kraken2OutputFormat', 'Kraken2OutputDirectoryFormat', 'Kraken2Outputs',
    'Kraken2DBFormat', 'Kraken2DBDirectoryFormat', 'Kraken2DB',
    'BrackenDBFormat', 'BrackenDBDirectoryFormat'
]

link to lines of code in q2-types-genomics

So we can fix it by running the following.

pip uninstall q2_types_genomics
git clone [email protected]:bokulich-lab/q2-types-genomics.git
cd <path_to_local_q2-types-genomics_repo>/q2-types-genomics
pip install .

Ok problem solved! Sadly not, apparently q2-types-genomics is not the only incompatible package. If we run qiime dev refresh-cache we will get the following error.

TypeError: BLAST6 is not a variant of SampleData.field['type']

The error comes from the q2-types package. It boils down to the q2_types/feature_data/_type.py lines 51-53.

Old version (installed with conda/mamba):

BLAST6 = SemanticType('BLAST6',
                      variant_of=[FeatureData.field['type']])

New version (available in GitHub):

BLAST6 = SemanticType('BLAST6',
                      variant_of=[FeatureData.field['type'],
                                  SampleData.field['type']])

link to lines of code in q2-types

Again we uninstall the guilty package and clone the new version and install it locally.

pip uninstall q2-types
git clone https://github.com/qiime2/q2-types.git
cd <path_to_local_q2-types>/q2-types
pip install .

Woohoo! We solved it! Almost. Running qiime dev refresh-cache will return import errors, essentially complaining that some libraries are missing from the environment. Just

pip install xmltodict tqdm

(the libraries that are missing) and you are good to go.

add support for `KrakenUniq`

This can be achieved by adding the --report-minimizer-data to our kraken2 call, and is discussed here. This will add two additional columns to the report "representing the number of minimizers found to be associated with a taxon in the read sequences, and the estimate of the number of distinct minimizers associated with a taxon in the read sequence data".

Any reason to not make this the default behavior?

Update: because their report format doesn't include headers, and this flag changes the number of columns, it may be worth creating our own report format that adds headers.

redefine `FeatureData[NOG]` type that is generated as output from `eggnog-annotate`, and add support for downstream analysis of this data

This isn't FeatureData in the way that we typically define it, in that the feature ids from the corresponding table are not the ids in this artifact. This seems more like SampleData to me. I'm currently investigating this, but I wanted to get an issue up now to make folks aware that this semantic type is likely going to change. We'll ultimately want to process this artifact to generate FeatureData.

bokulich-lab / q2-moshpit Goto Github PK

q2-moshpit's People

Contributors

Stargazers

Watchers

Forkers

q2-moshpit's Issues

Context

Important Considerations

Tasks

Context

Important Considerations

Tasks

Tasks

Is your feature request related to a problem?

Describe the solution you'd like

Additional considerations

Tasks

Recommend Projects

Recommend Topics

Recommend Org