Giter Site home page Giter Site logo

q2-amr's People

Contributors

misialq avatar vinzentrisch avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

vinzentrisch

q2-amr's Issues

Make sure installation instructions work

Update the qiime2 conda channel to https://packages.qiime2.org/qiime2/2023.9/shotgun/released/ (for the latest released version) and check that everything works as expected.

Also, can you please try without the --no-channel-priority flag to check whether it works now?

ENH: Change how the CARD database is loaded with function `load_card`

To run any commands with RGI you have to be load all needed database files. Eithter globally so the loaded files can be used from anywhere, or locally. When loading locally the files are copied to a directory in the working directory that is called LocalDB that contains all the files.
In all actions the function load_card_db is used that runs the function RGI load to load all the needed files locally.
If you use the annotation (annotate_mags_card, annotate_reads_card) and the kmer_query_card functions in parallel, every partition will load the needed database files into a temp directory. In the case for the kmer_query_card these files amount to 6.25 GB of disc space per partition, what limits the number of partitions you can run by your disc space.
Also the copying into a temp directory does not have an advantage.

The load_card_db should be changed to load the files globally. The paths to the database files also have to be changed in all actions that use the databases.

there is a command called RGI clean to remove all previously loaded database files. I am not sure if this functionality has to be introduced for q2-amr.

ENH: Add new q2-amr action that builds a kmer database of custom kmer lengths.

Add new q2-amr action called kmer-build-card that builds a kmer database of custom kmer lengths. kmer-build-card will use the RGI function kmer_build.

The kmer database downloaded with the fetch-card-db action is the standard database with kmers of length 61. With the action kmer-build-card it will be possible to generate a custom kmer database with kmers of any length.

RGI kmer_build: https://github.com/arpcard/rgi?tab=readme-ov-file#building-custom-k-mer-classifiers

ENH: Add possibility to parallelise the `annotate_reads_card` action.

  • Add possibility to parallelise the annotate_reads_card action with Parsl
  • To achieve this, a new pipeline has to be created that uses the partition functions partition_samples_single and partition_samples_paired of q2-demux for paired and single end reads, the annotate_reads_card action and the collate functions collate_reads_allele_annotations and collate_reads_gene_annotations of q2-amr that will be added with #56. Also the function merge of q2-feature-table will be needed to merge the FeatureTables
  • This means that q2-feature-table and q2-demux will be a dependency of q2-amr

BUG: ValueError when running `collate-reads-allele-annotations`

When running:
qiime amr collate-reads-allele-annotations --i-annotations partition_reads_allele/sample1.qza partition_reads_allele/sample2.qza --o-collated-annotations partition_reads_allele/collate.qza

This error occurs:
ValueError: Cannot place Tuple[SampleData[CARDAlleleAnnotation % Properties('kma', 'bowtie2')]] and Tuple[SampleData[CARDAlleleAnnotation % Properties('kma', 'bwa')]] in the same type variable.

I do not know where this error could come from.
The error gets fixed if the not released qiime2 version 2024.5 is installed

Until this version is released The installation instructions have to be changed to include:
pip install git+https://github.com/qiime2/qiime2.git

ENH: Add possibility to parallelise the `annotate_mags_card` action.

  • Add possibility to parallelise the annotate_mags_card action with Parsl
  • To achieve this, a new pipeline has to be created that uses the partition functions partition_sample_data_mags q2-moshpit for MAGs, the annotate_mags_card action and the collate function collate_mags_annotations of q2-amr that will be added with #56. Also the function merge of q2-feature-table will be needed to merge the FeatureTables
  • This means that q2-feature-table and q2-moshpit will be a dependency of q2-amr

BUG: Invalid value for '--i-amr-reads-annotation' in `visualize-annotation-stats`

This error message occurs when using visualize-annotation-stats:

(1/1) Invalid value for '--i-amr-reads-annotation': Expected an artifact of
at least type CARDGeneAnnotation | CARDAlleleAnnotation. An artifact of type
SampleData[CARDAlleleAnnotation] was provided.

To fix this issue the input in the plugin_setup of visualize-annotation-stats has to be changed from CARDGeneAnnotation | CARDAlleleAnnotation to SampleData[CARDGeneAnnotation | CARDAlleleAnnotation]

ENH: Add three actions that can partition the annotations of mags and reads for parallelization of the kmer-query functions.

  • Three actions will be added called partition_mags_annotations, partition_reads_allele_annotations, partition_reads_gene_annotations.
  • These actions will be able to partition artifacts of types SampleData[CARDAnnotation], SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation] respectively into collections of a specified number of artifacts with the same type.
  • These collections will be needed to make parallelisation of the actions kmer_query_mags_card and kmer_query_reads_card possible. These actions will be added with #21.
  • For now these have to be three different actions but can be unified into one, when qiime2 allows multiple output types.

ENH: Add transformer for kmer analysis formats to Metadata

The actions kmer-query-reads-card and kmer-query-mags-card create outputs with three different directory formats called CARDMAGsKmerAnalysisDirectoryFormat, CARDReadsAlleleKmerAnalysisDirectoryFormat and CARDReadsGeneKmerAnalysisDirectoryFormat. These formats are all associated with SampleData types that contain text files in a per sample directory structure. To explore this data with the action metadata tabulate, all text files have to be merged into one pandas.df.

  • Three new transformers have to be added.
  • Helper function tabulate_data has to be modified.

`CARDGeneAnnotationDirectoryFormat` contains stats file that is not needed

  • The action annotate-reads-card produces artifacts of type CARDAlleleAnnotation and CARDGeneAnnotation containing the amr annotations but also mapping statistics per sample. This file, called "overall-mapping-stats.txt", is only needed in one artifact.

  • To address this issue the CARDGeneAnnotationDirectoryFormat and the annotate-reads-card action have to be altered.

  • Also the visualizer visualize-annotation-stats has to be altered to only accept inputs of CARDAlleleAnnotation

BUG: Heatmap produces error becuase it does not recognise JSON files

When running qiime amr heatmap --i-amr-annotation amr_annotations.qza --o-visualization amr_annotations.qzv this error message is produced:

ERROR 2024-01-18 14:18:43,081 : Error: No data recovered from JSONs, cannot build heatmap. Please check you are using RGI results from ver 4.0.0 or greater.

The Problem most likly is that the json files get moved into one directory and renamed to their sample and bin name. They lose the .json extension during renaming and don't get recognised by RGI.

Adding `Genelengths` type

Like described in #23, several normalization methods will be added to the plugin.
The FPKM and TPM methods both need information about the gene lengths.
This information can be extracted from the artifacts with type CARDAlleleAnnotation and CARDGeneAnnotation.
To ensure that feature tables from other sources can be normalized with these methods, a new type called Genelength has to be introduced.
The type will include one TSV file with two columns: One with the gene names and one with the corresponding gene lengths.
Additionally a transformer has to be introduced that converts CARDAlleleAnnotation and CARDGeneAnnotation to Genelength.

When running `annotate-reads-card` without the flag `--p-include-wildcard` the gene and allele mapping outputs contain the same information

RGI bwt (The underlying function used by annotate-reads-card) outputs annotations where alleles are mapped to ARGs in CARD ("allele_mapping_data.txt"). This output is then further summerized at the gene level and is outputted in an extra file ("gene_mapping_data.txt").
But when using annotate-reads-card without the flag --p-include-wildcard, there are not multiple alleles per gene in the database. This means that the gene mapping data contains in that case the same information as the allele mapping data.

To solve this issue:

  • The annotate-reads-card action has to be altered to move an empty TXT file to the CARDGeneAnnotation artifact. And to move an empty feature table to the FeatureTable[Frequency] artifact.
  • The validation of the CARDGeneAnnotationDirectoryFormat has to be altered to accept empty files.

Update README

The current README is a little bit out-of-date - let's update it. Below are some things I noticed but please add anything what may still be missing:

Tasks

  1. VinzentRisch
  2. VinzentRisch
  3. VinzentRisch

ENH: Adding new action to download NCBI database for AMRFinderPlus

After RGI of the CARD implementation. the next AMR tool to be included in q2-amr will be AMRFinderPlus of NCBI.

The first action to be included will be one that downloads the NCBI database with the command amrfinder -u. This action will download the newest version of the NCBI database and can also be used to update the database.
The database is stored at */miniconda3/envs/amrfinder/share/amrfinderplus/data/2024-01-31.1/ automatically. The database will not need a new qiime2 type.

BUG: "biom.exception.TableException: Duplicate observation IDs" when running `annotate-reads-card`

The error message biom.exception.TableException: Duplicate observation IDs occurs when running the action annotate-reads-card with the flag --p-include-wildcard.

The error occurs during the creation of the feature table, when the pd.Dataframe gets converted into a biom file.
The biom format does not allow duplicated ID values.
The helper function create_count_table uses the column ARO_Term in the file allele_mapping_data.txt. When using annotate-reads-card without the flag --p-include-wildcard, this does not lead to any issues. But with the flag, a secondary database with multiple alleles per gene is introduced, which can lead to duplications in the ARO_Term column.

To solve this issue a secondary unique identifier has to be added to the ARO_Term column to ensure there are no duplicates in the index of the biom file.

Tests don't run in CI, because of pyarrow dependency of Altair

In the CI check ci / q2-amr (ubuntu-latest) there is an error in one of the tests.
The error message is:
FAILED test-env/lib/python3.8/site-packages/q2_amr/card/tests/test_reads.py::TestAnnotateReadsCARD::test_plot_sample_stats - RuntimeError: The pyarrow package must be version 11.0.0 or greater. Found version 8.0.0

I tried to add pyarrow >=11.0.0 to the meta.yaml file but then the environment can't be solved.
Because the problem appears during the test_plot_sample_stats test I assume that the issue lies with the package Altair. Altair is only used by the plot_sample_stats function.

A temporary solution would be to remove all functions that use Altair. This means the action visualize-annotation-stats and all functions associated with it have to be removed.

ENH: Add aligner property to annotate-reads-card outputs

The annoate-reads-card action creates outputs of type SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation].
These outputs can be used with the function kmer-analysis that is currently under development.
But the kmer-analysis can only be performed if the outputs were created with one of the three possible aligners of the annoate-reads-card function.

Because of this one of three possible properties ("kma", "bowtie2", "bwa") has to be added to the SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation] outputs.

ENH: Add new action that can normalize `FeatureTable[Frequency]` with different methods.

The action annotate-reads-card generates a FeatureTable[Frequency] table. These tables contain the raw information of how many reads were mapped to a certain ARG in CARD. To make meaningful comparisons in and across samples, this information has to be normalized to gene length, library size and composition.

To achieve this, a new action has to be created that can normalize feature tables with current standard methods.
The integration will be done with RNAnorm. A python package that is used for RNAseq normalizations and provides multiple methods like CPM, TPM , FPKM, TMM , CTF, UQ and CUF.

ENH: Add three actions that can collate the annotations of mags and reads for parallelization of the kmer-query functions.

  • Three actions will be added called collate_mags_annotations, collate_reads_allele_annotations, collate_reads_gene_annotations.
  • These actions will be able to collate artifacts of types SampleData[CARDAnnotation], SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation] respectively.
  • These collections will be needed to make parallelisation of the actions kmer_query_mags_card and kmer_query_reads_card possible. These actions will be added with #21.
  • For now these have to be three different actions but can be unified into one, when qiime2 allows multiple output types.

ENH: Add new action that predicts the origin of detected ARGs.

Add a new action called kmer-query-card that can predict the origin of ARGs identified with the action annotate-reads-card. kmer-query-card will use the RGI function kmer-query.

CARD provides a data set of AMR alleles and their distribution among pathogens and plasmids. CARD's k-mer classifiers sub-sample these sequences to identify k-mers uniquely found within AMR alleles of individual pathogen species, pathogen genera, pathogen-restricted plasmids, or promiscuous plasmids.

RGI kmer_query: https://github.com/arpcard/rgi?tab=readme-ov-file#using-rgi-kmer-query-k-mer-taxonomic-classification

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.