bokulich-lab / q2-amr Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
The PR that added the progress bar introduced the dependency tqdm but it was not added to the readme
kmer-query
is going to be added to q2-amr. This function needs a file called sorted.length_100.bam as input.CARDAlleleAnnotation
directory.Update the qiime2 conda channel to https://packages.qiime2.org/qiime2/2023.9/shotgun/released/
(for the latest released version) and check that everything works as expected.
Also, can you please try without the --no-channel-priority
flag to check whether it works now?
When calling sample_dict on an artifact with format CARDAnnotationDirectoryFormat
the resulting dict does include all samples in the artifact but only one MAG per sample even if there are more than one MAG per sample.
The function iterates through all MAGs in a sample but overwrites the content of the last MAG.
To run any commands with RGI you have to be load all needed database files. Eithter globally so the loaded files can be used from anywhere, or locally. When loading locally the files are copied to a directory in the working directory that is called LocalDB that contains all the files.
In all actions the function load_card_db
is used that runs the function RGI load
to load all the needed files locally.
If you use the annotation (annotate_mags_card
, annotate_reads_card
) and the kmer_query_card
functions in parallel, every partition will load the needed database files into a temp directory. In the case for the kmer_query_card
these files amount to 6.25 GB of disc space per partition, what limits the number of partitions you can run by your disc space.
Also the copying into a temp directory does not have an advantage.
The load_card_db
should be changed to load the files globally. The paths to the database files also have to be changed in all actions that use the databases.
there is a command called RGI clean
to remove all previously loaded database files. I am not sure if this functionality has to be introduced for q2-amr.
Add new q2-amr action called kmer-build-card that builds a kmer database of custom kmer lengths. kmer-build-card will use the RGI function kmer_build.
The kmer database downloaded with the fetch-card-db action is the standard database with kmers of length 61. With the action kmer-build-card it will be possible to generate a custom kmer database with kmers of any length.
RGI kmer_build: https://github.com/arpcard/rgi?tab=readme-ov-file#building-custom-k-mer-classifiers
annotate_reads_card
action with Parslannotate_reads_card
action and the collate functions collate_reads_allele_annotations
and collate_reads_gene_annotations
of q2-amr that will be added with #56. Also the function merge
of q2-feature-table will be needed to merge the FeatureTablesFor #21 two new actions will be added that run kmer-query
of RGI for the reads CARDAlleleAnnotation
and mags CARDAnnotation
output.
The outputs of these two new functions need two new types called CARDReadsKmerAnalysis
and CARDMAGsKmerAnalysis
.
When running:
qiime amr collate-reads-allele-annotations --i-annotations partition_reads_allele/sample1.qza partition_reads_allele/sample2.qza --o-collated-annotations partition_reads_allele/collate.qza
This error occurs:
ValueError: Cannot place Tuple[SampleData[CARDAlleleAnnotation % Properties('kma', 'bowtie2')]] and Tuple[SampleData[CARDAlleleAnnotation % Properties('kma', 'bwa')]] in the same type variable.
I do not know where this error could come from.
The error gets fixed if the not released qiime2 version 2024.5 is installed
Until this version is released The installation instructions have to be changed to include:
pip install git+https://github.com/qiime2/qiime2.git
annotate_mags_card
action with Parslpartition_sample_data_mags
q2-moshpit for MAGs, the annotate_mags_card
action and the collate function collate_mags_annotations
of q2-amr that will be added with #56. Also the function merge
of q2-feature-table will be needed to merge the FeatureTablesCoverage is low in the files _formats.py
and utils.py
, so new unit tests have to be added to increase coverage.
This error message occurs when using visualize-annotation-stats
:
(1/1) Invalid value for '--i-amr-reads-annotation': Expected an artifact of
at least type CARDGeneAnnotation | CARDAlleleAnnotation. An artifact of type
SampleData[CARDAlleleAnnotation] was provided.
To fix this issue the input in the plugin_setup of visualize-annotation-stats
has to be changed from CARDGeneAnnotation | CARDAlleleAnnotation
to SampleData[CARDGeneAnnotation | CARDAlleleAnnotation]
partition_mags_annotations
, partition_reads_allele_annotations
, partition_reads_gene_annotations
.SampleData[CARDAnnotation]
, SampleData[CARDAlleleAnnotation]
and SampleData[CARDGeneAnnotation]
respectively into collections of a specified number of artifacts with the same type.kmer_query_mags_card
and kmer_query_reads_card
possible. These actions will be added with #21.The actions kmer-query-reads-card and kmer-query-mags-card create outputs with three different directory formats called CARDMAGsKmerAnalysisDirectoryFormat
, CARDReadsAlleleKmerAnalysisDirectoryFormat
and CARDReadsGeneKmerAnalysisDirectoryFormat
. These formats are all associated with SampleData types that contain text files in a per sample directory structure. To explore this data with the action metadata tabulate
, all text files have to be merged into one pandas.df.
The action annotate-reads-card
produces artifacts of type CARDAlleleAnnotation
and CARDGeneAnnotation
containing the amr annotations but also mapping statistics per sample. This file, called "overall-mapping-stats.txt", is only needed in one artifact.
To address this issue the CARDGeneAnnotationDirectoryFormat
and the annotate-reads-card
action have to be altered.
Also the visualizer visualize-annotation-stats
has to be altered to only accept inputs of CARDAlleleAnnotation
The dev
section of the README says we follow black style - is this really the case? if not, let's reformat.
It would also be great if you could add a GitHub action to test the code at every commit - see here: https://black.readthedocs.io/en/stable/integrations/github_actions.html.
When running qiime amr heatmap --i-amr-annotation amr_annotations.qza --o-visualization amr_annotations.qzv
this error message is produced:
ERROR 2024-01-18 14:18:43,081 : Error: No data recovered from JSONs, cannot build heatmap. Please check you are using RGI results from ver 4.0.0 or greater.
The Problem most likly is that the json files get moved into one directory and renamed to their sample and bin name. They lose the .json extension during renaming and don't get recognised by RGI.
Like described in #23, several normalization methods will be added to the plugin.
The FPKM and TPM methods both need information about the gene lengths.
This information can be extracted from the artifacts with type CARDAlleleAnnotation
and CARDGeneAnnotation
.
To ensure that feature tables from other sources can be normalized with these methods, a new type called Genelength
has to be introduced.
The type will include one TSV file with two columns: One with the gene names and one with the corresponding gene lengths.
Additionally a transformer has to be introduced that converts CARDAlleleAnnotation
and CARDGeneAnnotation
to Genelength
.
A while ago we introduced the additional installation step for the RGI fix from git+https://github.com/misialq/rgi.git@py38-fix. Could you please test whether this is still needed? RGI was updated in the meantime (I think...) so maybe we got lucky :D
The action annotate-mags-card
has a FeatureTable[Presence/Absence] as one output. But because one ARG can occur multiple times in one MAG, the information of the frequency is lost.
To account for this the output has to be changed to a FeatureTable[Frequency].
RGI bwt
(The underlying function used by annotate-reads-card
) outputs annotations where alleles are mapped to ARGs in CARD ("allele_mapping_data.txt"). This output is then further summerized at the gene level and is outputted in an extra file ("gene_mapping_data.txt").
But when using annotate-reads-card
without the flag --p-include-wildcard
, there are not multiple alleles per gene in the database. This means that the gene mapping data contains in that case the same information as the allele mapping data.
To solve this issue:
annotate-reads-card
action has to be altered to move an empty TXT file to the CARDGeneAnnotation
artifact. And to move an empty feature table to the FeatureTable[Frequency]
artifact.CARDGeneAnnotationDirectoryFormat
has to be altered to accept empty files.The current README is a little bit out-of-date - let's update it. Below are some things I noticed but please add anything what may still be missing:
After RGI of the CARD implementation. the next AMR tool to be included in q2-amr will be AMRFinderPlus of NCBI.
The first action to be included will be one that downloads the NCBI database with the command amrfinder -u
. This action will download the newest version of the NCBI database and can also be used to update the database.
The database is stored at */miniconda3/envs/amrfinder/share/amrfinderplus/data/2024-01-31.1/ automatically. The database will not need a new qiime2 type.
The error message biom.exception.TableException: Duplicate observation IDs occurs when running the action annotate-reads-card
with the flag --p-include-wildcard
.
The error occurs during the creation of the feature table, when the pd.Dataframe
gets converted into a biom file.
The biom format does not allow duplicated ID values.
The helper function create_count_table
uses the column ARO_Term in the file allele_mapping_data.txt. When using annotate-reads-card
without the flag --p-include-wildcard
, this does not lead to any issues. But with the flag, a secondary database with multiple alleles per gene is introduced, which can lead to duplications in the ARO_Term column.
To solve this issue a secondary unique identifier has to be added to the ARO_Term column to ensure there are no duplicates in the index of the biom file.
In the CI check ci / q2-amr (ubuntu-latest) there is an error in one of the tests.
The error message is:
FAILED test-env/lib/python3.8/site-packages/q2_amr/card/tests/test_reads.py::TestAnnotateReadsCARD::test_plot_sample_stats - RuntimeError: The pyarrow package must be version 11.0.0 or greater. Found version 8.0.0
I tried to add pyarrow >=11.0.0 to the meta.yaml file but then the environment can't be solved.
Because the problem appears during the test_plot_sample_stats
test I assume that the issue lies with the package Altair. Altair is only used by the plot_sample_stats
function.
A temporary solution would be to remove all functions that use Altair. This means the action visualize-annotation-stats
and all functions associated with it have to be removed.
The annoate-reads-card action creates outputs of type SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation].
These outputs can be used with the function kmer-analysis that is currently under development.
But the kmer-analysis can only be performed if the outputs were created with one of the three possible aligners of the annoate-reads-card function.
Because of this one of three possible properties ("kma", "bowtie2", "bwa") has to be added to the SampleData[CARDAlleleAnnotation] and SampleData[CARDGeneAnnotation] outputs.
fetch-card-db
actionThe action annotate-reads-card
generates a FeatureTable[Frequency]
table. These tables contain the raw information of how many reads were mapped to a certain ARG in CARD. To make meaningful comparisons in and across samples, this information has to be normalized to gene length, library size and composition.
To achieve this, a new action has to be created that can normalize feature tables with current standard methods.
The integration will be done with RNAnorm. A python package that is used for RNAseq normalizations and provides multiple methods like CPM, TPM , FPKM, TMM , CTF, UQ and CUF.
collate_mags_annotations
, collate_reads_allele_annotations
, collate_reads_gene_annotations
.SampleData[CARDAnnotation]
, SampleData[CARDAlleleAnnotation]
and SampleData[CARDGeneAnnotation]
respectively.kmer_query_mags_card
and kmer_query_reads_card
possible. These actions will be added with #21.Add a new action called kmer-query-card
that can predict the origin of ARGs identified with the action annotate-reads-card
. kmer-query-card
will use the RGI function kmer-query
.
CARD provides a data set of AMR alleles and their distribution among pathogens and plasmids. CARD's k-mer classifiers sub-sample these sequences to identify k-mers uniquely found within AMR alleles of individual pathogen species, pathogen genera, pathogen-restricted plasmids, or promiscuous plasmids.
RGI kmer_query
: https://github.com/arpcard/rgi?tab=readme-ov-file#using-rgi-kmer-query-k-mer-taxonomic-classification
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.