The q2-amr from bokulich-lab

BUG: No module named tqdm

The PR that added the progress bar introduced the dependency tqdm but it was not added to the readme

ENH: Add sorted.length_100.bam file to `CARDAlleleAnnotation` when running `annotate_reads_card`

For #21 the RGI function kmer-query is going to be added to q2-amr. This function needs a file called sorted.length_100.bam as input.
This file is created during running of annotate_reads_card but is deleted. So it just has to be copied into the CARDAlleleAnnotation directory.

Make sure installation instructions work

Update the qiime2 conda channel to https://packages.qiime2.org/qiime2/2023.9/shotgun/released/ (for the latest released version) and check that everything works as expected.

Also, can you please try without the --no-channel-priority flag to check whether it works now?

BUG: `sample_dict` function of `CARDAnnotationDirectoryFormat` does not include all MAGs per sample

When calling sample_dict on an artifact with format CARDAnnotationDirectoryFormat the resulting dict does include all samples in the artifact but only one MAG per sample even if there are more than one MAG per sample.

The function iterates through all MAGs in a sample but overwrites the content of the last MAG.

ENH: Change how the CARD database is loaded with function `load_card`

To run any commands with RGI you have to be load all needed database files. Eithter globally so the loaded files can be used from anywhere, or locally. When loading locally the files are copied to a directory in the working directory that is called LocalDB that contains all the files.
In all actions the function load_card_db is used that runs the function RGI load to load all the needed files locally.
If you use the annotation (annotate_mags_card, annotate_reads_card) and the kmer_query_card functions in parallel, every partition will load the needed database files into a temp directory. In the case for the kmer_query_card these files amount to 6.25 GB of disc space per partition, what limits the number of partitions you can run by your disc space.
Also the copying into a temp directory does not have an advantage.

The load_card_db should be changed to load the files globally. The paths to the database files also have to be changed in all actions that use the databases.

there is a command called RGI clean to remove all previously loaded database files. I am not sure if this functionality has to be introduced for q2-amr.

ENH: Add new q2-amr action that builds a kmer database of custom kmer lengths.

Add new q2-amr action called kmer-build-card that builds a kmer database of custom kmer lengths. kmer-build-card will use the RGI function kmer_build.

The kmer database downloaded with the fetch-card-db action is the standard database with kmers of length 61. With the action kmer-build-card it will be possible to generate a custom kmer database with kmers of any length.

RGI kmer_build: https://github.com/arpcard/rgi?tab=readme-ov-file#building-custom-k-mer-classifiers

MAINT: Remove altair from installation instructions

In the README, altair is still listed in the installation instructions. But it is not needed anymore since #24.

ENH: Add possibility to parallelise the `annotate_reads_card` action.

Add possibility to parallelise the annotate_reads_card action with Parsl
To achieve this, a new pipeline has to be created that uses the partition functions partition_samples_single and partition_samples_paired of q2-demux for paired and single end reads, the annotate_reads_card action and the collate functions collate_reads_allele_annotations and collate_reads_gene_annotations of q2-amr that will be added with #56. Also the function merge of q2-feature-table will be needed to merge the FeatureTables
This means that q2-feature-table and q2-demux will be a dependency of q2-amr

ENH: Add new sematic types `CARDReadsKmerAnalysis` and `CARDMAGsKmerAnalysis`

For #21 two new actions will be added that run kmer-query of RGI for the reads CARDAlleleAnnotation and mags CARDAnnotation output.
The outputs of these two new functions need two new types called CARDReadsKmerAnalysis and CARDMAGsKmerAnalysis.

BUG: ValueError when running `collate-reads-allele-annotations`

When running:
qiime amr collate-reads-allele-annotations --i-annotations partition_reads_allele/sample1.qza partition_reads_allele/sample2.qza --o-collated-annotations partition_reads_allele/collate.qza

This error occurs:
ValueError: Cannot place Tuple[SampleData[CARDAlleleAnnotation % Properties('kma', 'bowtie2')]] and Tuple[SampleData[CARDAlleleAnnotation % Properties('kma', 'bwa')]] in the same type variable.

I do not know where this error could come from.
The error gets fixed if the not released qiime2 version 2024.5 is installed

Until this version is released The installation instructions have to be changed to include:
pip install git+https://github.com/qiime2/qiime2.git

MAINT: Change imports from types-genomics to types.

q2-types-genomics will be deprecated with the next qiime distro.
All imports from q2-types-genomics have to be changed to q2-types.

MAINT: Change order of qiime2 channels in installation instructions.

The conda channel qiime2 is listed before the channel https://packages.qiime2.org/qiime2/2024.2/shotgun/released/ what leads to the wrong version of the qiime2 packages being installed.
To change this the order has to be changed of the channels.

ENH: Add possibility to parallelise the `annotate_mags_card` action.

Add possibility to parallelise the annotate_mags_card action with Parsl
To achieve this, a new pipeline has to be created that uses the partition functions partition_sample_data_mags q2-moshpit for MAGs, the annotate_mags_card action and the collate function collate_mags_annotations of q2-amr that will be added with #56. Also the function merge of q2-feature-table will be needed to merge the FeatureTables
This means that q2-feature-table and q2-moshpit will be a dependency of q2-amr

TEST: Coverage is low in `_formats.py` and `utils.py`

Coverage is low in the files _formats.py and utils.py, so new unit tests have to be added to increase coverage.

BUG: Invalid value for '--i-amr-reads-annotation' in `visualize-annotation-stats`

This error message occurs when using visualize-annotation-stats:

(1/1) Invalid value for '--i-amr-reads-annotation': Expected an artifact of
at least type CARDGeneAnnotation | CARDAlleleAnnotation. An artifact of type
SampleData[CARDAlleleAnnotation] was provided.

To fix this issue the input in the plugin_setup of visualize-annotation-stats has to be changed from CARDGeneAnnotation | CARDAlleleAnnotation to SampleData[CARDGeneAnnotation | CARDAlleleAnnotation]

ENH: Add three actions that can partition the annotations of mags and reads for parallelization of the kmer-query functions.

Three actions will be added called partition_mags_annotations, partition_reads_allele_annotations, partition_reads_gene_annotations.
These actions will be able to partition artifacts of types SampleData[CARDAnnotation], SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation] respectively into collections of a specified number of artifacts with the same type.
These collections will be needed to make parallelisation of the actions kmer_query_mags_card and kmer_query_reads_card possible. These actions will be added with #21.
For now these have to be three different actions but can be unified into one, when qiime2 allows multiple output types.

MAINT: Include second installation instructions for Apple silicone chips

The installation instructions don't work for Apple silicone.
A second instruction has to be added like in the instructions on the QIIME2 website .

ENH: Add transformer for kmer analysis formats to Metadata

The actions kmer-query-reads-card and kmer-query-mags-card create outputs with three different directory formats called CARDMAGsKmerAnalysisDirectoryFormat, CARDReadsAlleleKmerAnalysisDirectoryFormat and CARDReadsGeneKmerAnalysisDirectoryFormat. These formats are all associated with SampleData types that contain text files in a per sample directory structure. To explore this data with the action metadata tabulate, all text files have to be merged into one pandas.df.

Three new transformers have to be added.
Helper function tabulate_data has to be modified.

`CARDGeneAnnotationDirectoryFormat` contains stats file that is not needed

The action annotate-reads-card produces artifacts of type CARDAlleleAnnotation and CARDGeneAnnotation containing the amr annotations but also mapping statistics per sample. This file, called "overall-mapping-stats.txt", is only needed in one artifact.
To address this issue the CARDGeneAnnotationDirectoryFormat and the annotate-reads-card action have to be altered.
Also the visualizer visualize-annotation-stats has to be altered to only accept inputs of CARDAlleleAnnotation

Re-format the code using `black`

The dev section of the README says we follow black style - is this really the case? if not, let's reformat.

It would also be great if you could add a GitHub action to test the code at every commit - see here: https://black.readthedocs.io/en/stable/integrations/github_actions.html.

BUG: Heatmap produces error becuase it does not recognise JSON files

When running qiime amr heatmap --i-amr-annotation amr_annotations.qza --o-visualization amr_annotations.qzv this error message is produced:

ERROR 2024-01-18 14:18:43,081 : Error: No data recovered from JSONs, cannot build heatmap. Please check you are using RGI results from ver 4.0.0 or greater.

The Problem most likly is that the json files get moved into one directory and renamed to their sample and bin name. They lose the .json extension during renaming and don't get recognised by RGI.

Adding `Genelengths` type

Like described in #23, several normalization methods will be added to the plugin.
The FPKM and TPM methods both need information about the gene lengths.
This information can be extracted from the artifacts with type CARDAlleleAnnotation and CARDGeneAnnotation.
To ensure that feature tables from other sources can be normalized with these methods, a new type called Genelength has to be introduced.
The type will include one TSV file with two columns: One with the gene names and one with the corresponding gene lengths.
Additionally a transformer has to be introduced that converts CARDAlleleAnnotation and CARDGeneAnnotation to Genelength.

Test whether the RGI fix is still needed

A while ago we introduced the additional installation step for the RGI fix from git+https://github.com/misialq/rgi.git@py38-fix. Could you please test whether this is still needed? RGI was updated in the meantime (I think...) so maybe we got lucky :D

ENH: Add sample_dict function to classes `CARDAlleleAnnotationDirectoryFormat`, `CARDAlleleAnnotationDirectoryFormat` and `CARDAnnotationDirectoryFormat`

ENH: Change Presence/Absence table to Frequency table in `annotate-mags-card` output

The action annotate-mags-card has a FeatureTable[Presence/Absence] as one output. But because one ARG can occur multiple times in one MAG, the information of the frequency is lost.
To account for this the output has to be changed to a FeatureTable[Frequency].

When running `annotate-reads-card` without the flag `--p-include-wildcard` the gene and allele mapping outputs contain the same information

RGI bwt (The underlying function used by annotate-reads-card) outputs annotations where alleles are mapped to ARGs in CARD ("allele_mapping_data.txt"). This output is then further summerized at the gene level and is outputted in an extra file ("gene_mapping_data.txt").
But when using annotate-reads-card without the flag --p-include-wildcard, there are not multiple alleles per gene in the database. This means that the gene mapping data contains in that case the same information as the allele mapping data.

To solve this issue:

The annotate-reads-card action has to be altered to move an empty TXT file to the CARDGeneAnnotation artifact. And to move an empty feature table to the FeatureTable[Frequency] artifact.
The validation of the CARDGeneAnnotationDirectoryFormat has to be altered to accept empty files.

Update README

The current README is a little bit out-of-date - let's update it. Below are some things I noticed but please add anything what may still be missing:

Tasks

Beta Give feedback

the indentation in the first code block is a bit off (check https://github.com/bokulich-lab/q2-fondue to see what it should look like)
the table is missing links - maybe we could actually add another column stating which method from RGI is being used by those actions?
Make sure installation instructions work #14
Re-format the code using black #15
Test whether the RGI fix is still needed #16
Options

ENH: Adding new action to download NCBI database for AMRFinderPlus

After RGI of the CARD implementation. the next AMR tool to be included in q2-amr will be AMRFinderPlus of NCBI.

The first action to be included will be one that downloads the NCBI database with the command amrfinder -u. This action will download the newest version of the NCBI database and can also be used to update the database.
The database is stored at */miniconda3/envs/amrfinder/share/amrfinderplus/data/2024-01-31.1/ automatically. The database will not need a new qiime2 type.

BUG: "biom.exception.TableException: Duplicate observation IDs" when running `annotate-reads-card`

The error message biom.exception.TableException: Duplicate observation IDs occurs when running the action annotate-reads-card with the flag --p-include-wildcard.

The error occurs during the creation of the feature table, when the pd.Dataframe gets converted into a biom file.
The biom format does not allow duplicated ID values.
The helper function create_count_table uses the column ARO_Term in the file allele_mapping_data.txt. When using annotate-reads-card without the flag --p-include-wildcard, this does not lead to any issues. But with the flag, a secondary database with multiple alleles per gene is introduced, which can lead to duplications in the ARO_Term column.

To solve this issue a secondary unique identifier has to be added to the ARO_Term column to ensure there are no duplicates in the index of the biom file.

Tests don't run in CI, because of pyarrow dependency of Altair

In the CI check ci / q2-amr (ubuntu-latest) there is an error in one of the tests.
The error message is:
FAILED test-env/lib/python3.8/site-packages/q2_amr/card/tests/test_reads.py::TestAnnotateReadsCARD::test_plot_sample_stats - RuntimeError: The pyarrow package must be version 11.0.0 or greater. Found version 8.0.0

I tried to add pyarrow >=11.0.0 to the meta.yaml file but then the environment can't be solved.
Because the problem appears during the test_plot_sample_stats test I assume that the issue lies with the package Altair. Altair is only used by the plot_sample_stats function.

A temporary solution would be to remove all functions that use Altair. This means the action visualize-annotation-stats and all functions associated with it have to be removed.

CI: The name of the qiime2 distro has changed and has to be updated

The name of the qiime2 distro has changed and has to be updated to metagenome in the ci-dev.yaml file.
Furthermore a new line that points to the codecov token has to be added in the same file.

ENH: Add aligner property to annotate-reads-card outputs

The annoate-reads-card action creates outputs of type SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation].
These outputs can be used with the function kmer-analysis that is currently under development.
But the kmer-analysis can only be performed if the outputs were created with one of the three possible aligners of the annoate-reads-card function.

Because of this one of three possible properties ("kma", "bowtie2", "bwa") has to be added to the SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation] outputs.

ENH: Add progress bars to downloads with `fetch-card-db` action

Add progress bars to downloads with fetch-card-db action
Add some informative comments what is being done

ENH: Add new action that can normalize `FeatureTable[Frequency]` with different methods.

The action annotate-reads-card generates a FeatureTable[Frequency] table. These tables contain the raw information of how many reads were mapped to a certain ARG in CARD. To make meaningful comparisons in and across samples, this information has to be normalized to gene length, library size and composition.

To achieve this, a new action has to be created that can normalize feature tables with current standard methods.
The integration will be done with RNAnorm. A python package that is used for RNAseq normalizations and provides multiple methods like CPM, TPM , FPKM, TMM , CTF, UQ and CUF.

ENH: Add three actions that can collate the annotations of mags and reads for parallelization of the kmer-query functions.

Three actions will be added called collate_mags_annotations, collate_reads_allele_annotations, collate_reads_gene_annotations.
These actions will be able to collate artifacts of types SampleData[CARDAnnotation], SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation] respectively.
These collections will be needed to make parallelisation of the actions kmer_query_mags_card and kmer_query_reads_card possible. These actions will be added with #21.
For now these have to be three different actions but can be unified into one, when qiime2 allows multiple output types.

ENH: Add new action that predicts the origin of detected ARGs.

Add a new action called kmer-query-card that can predict the origin of ARGs identified with the action annotate-reads-card. kmer-query-card will use the RGI function kmer-query.

CARD provides a data set of AMR alleles and their distribution among pathogens and plasmids. CARD's k-mer classifiers sub-sample these sequences to identify k-mers uniquely found within AMR alleles of individual pathogen species, pathogen genera, pathogen-restricted plasmids, or promiscuous plasmids.

RGI kmer_query: https://github.com/arpcard/rgi?tab=readme-ov-file#using-rgi-kmer-query-k-mer-taxonomic-classification

bokulich-lab / q2-amr Goto Github PK

q2-amr's People

Contributors

Stargazers

Watchers

Forkers

q2-amr's Issues

Tasks

Recommend Projects

Recommend Topics

Recommend Org