Giter Site home page Giter Site logo

alpar's Introduction

ALPAR - Automated Learning Pipeline for Antimicrobial Resistance

Automated Learning Pipeline for Antimicrobial Resistance

Pipeline

(Orange parts are not available in lite version)

Installation

Single-Reference AMR is installable from conda using mamba:

To install it into the new environment:

mamba create -n alpar -c conda-forge -c kalininalab -c bioconda -c etetoolkit alpar
conda activate alpar
pip install panacota

Or to install it into the already existing environment:

mamba install -c conda-forge -c kalininalab -c bioconda -c etetoolkit alpar
pip install panacota

Lite Version

ALPAR Lite is installable from conda using mamba:

Lite version of the tool, missing; DataSAIL (Split Against Information Leakage), Pyseer (GWAS), PanACoTA (Alignment-Based Phylogenetic Tree Creation), MashTree (Alignment-Free Phylogenetic Tree Creation), PRPS (Phylogeny Related Parallelism Score), AutoSklearn (Machine Learning Parameter Optimization)

For more information : ALPAR-Lite

Example Files

Example files can be downloaded from:

Example files

Automatic Pipeline

From genomic files, creates binary mutation and phenotype tables, applies thresholds, creates phylogenetic tree, conducts GWAS analysis, calculates PRPS score and trains machine learning models with conducting feature importance analysis and splitting data aginst information leakage with DataSAIL against all the given antibiotics.

  • Input, -i: Path of folder that have structure: input_folder -> antibiotic -> [Resistant, Susceptible]

    📦input_folder
    ┣ 📂antibiotic1
    ┃ ┣ 📂Resistant
    ┃ ┃ ┣ 📜fasta1.fna
    ┃ ┃ ┗ 📜fasta2.fna
    ┃ ┃ ┗ ...
    ┃ ┗ 📂Susceptible
    ┃ ┃ ┣ 📜fasta3.fna
    ┃ ┃ ┗ 📜fasta4.fna
    ┃ ┃ ┗ ...
    ┗ 📂antibiotic2
    ┃ ┣ 📂Resistant
    ┃ ┃ ┣ 📜fasta2.fna
    ┃ ┃ ┗ 📜fasta5.fna
    ┃ ┃ ┗ ...
    ┃ ┗ 📂Susceptible
    ┃ ┃ ┣ 📜fasta2.fna
    ┃ ┃ ┗ 📜fasta3.fna
    ┃ ┃ ┗ ...
    ┗ 📂...
  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

  • Reference, --reference: Reference file path, accepted file formats are: .gbk .gbff

  • Custom database (Optional), --custom_database: Fasta file path for protein database creation, can be downloaded from UniProt accepted file formats are: .fasta

Basic usage:

alpar automatix -i example/example_files/ -o example/example_output/ --reference example/reference.gbff

Create Binary Tables

From genomic files, creates binary mutation and phenotype tables

  • Input, -i: Path of file that contains path of genomic fasta files per line or path of folder that have structure: input_folder -> antibiotic -> [Resistant, Susceptible]

    📦input_folder
    ┣ 📂antibiotic1
    ┃ ┣ 📂Resistant
    ┃ ┃ ┣ 📜fasta1.fna
    ┃ ┃ ┗ 📜fasta2.fna
    ┃ ┃ ┗ ...
    ┃ ┗ 📂Susceptible
    ┃ ┃ ┣ 📜fasta3.fna
    ┃ ┃ ┗ 📜fasta4.fna
    ┃ ┃ ┗ ...
    ┗ 📂antibiotic2
    ┃ ┣ 📂Resistant
    ┃ ┃ ┣ 📜fasta2.fna
    ┃ ┃ ┗ 📜fasta5.fna
    ┃ ┃ ┗ ...
    ┃ ┗ 📂Susceptible
    ┃ ┃ ┣ 📜fasta2.fna
    ┃ ┃ ┗ 📜fasta3.fna
    ┃ ┃ ┗ ...
    ┗ 📂...
  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

  • Reference, --reference: Reference file path, accepted file formats are: .gbk .gbff

  • Custom database (Optional), --custom_database: Fasta file path for protein database creation, can be downloaded from UniProt accepted file formats are: .fasta

  • Creation of phenotype table (Optional):

    • --create_phenotype_from_folder should be used
    • Genomes_folder_path should have a structure: input_folder -> antibiotic -> [Resistant, Susceptible] -> genomic fasta files
    📦input_folder
    ┣ 📂antibiotic1
    ┃ ┣ 📂Resistant
    ┃ ┃ ┣ 📜fasta1.fna
    ┃ ┃ ┗ 📜fasta2.fna
    ┃ ┃ ┗ ...
    ┃ ┗ 📂Susceptible
    ┃ ┃ ┣ 📜fasta3.fna
    ┃ ┃ ┗ 📜fasta4.fna
    ┃ ┃ ┗ ...
    ┗ 📂antibiotic2
    ┃ ┣ 📂Resistant
    ┃ ┃ ┣ 📜fasta2.fna
    ┃ ┃ ┗ 📜fasta5.fna
    ┃ ┃ ┗ ...
    ┃ ┗ 📂Susceptible
    ┃ ┃ ┣ 📜fasta2.fna
    ┃ ┃ ┗ 📜fasta3.fna
    ┃ ┃ ┗ ...
    ┗ 📂...

Basic usage:

alpar create_binary_tables -i example/example_files/ -o example/example_output/ --reference example/reference.gbff

Binary Table Threshold

Applies threshold to binary mutation table, and drops columns that has less than threshold percentage, useful to reduce sequencing errors in the data.

  • Input, -i: Binary mutation table path

  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

  • Threshold percentage, --threshold_percentage: Threshold percentage value to be used to drop columns. If column sum is less than this value, columns will be deleted from table

Basic usage:

alpar binary_tables_threshold -i example/example_output/binary_mutation_table.tsv -o example/example_output/ 

Phylogenetic Tree

Runs Phylogeny pipeline to create phylogenetic tree. (Alignment free)

  • Input, -i: Text file that contains path of each strain per line. It can be found in create_binary_tables output path as strains.txt

  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

  • Random names dictionary path, --random_names_dict: Random names text file path. If not provided, strain's original names will be used for phylogenetic tree

Basic usage:

alpar phylogenetic_tree -i example/example_output/strains.txt -o example/example_output/ --random_names_dict example/example_output/random_names.txt 

Panacota

Runs PanACoTA pipeline to create phylogenetic tree. (Alignment based)

  • Input, -i: Text file that contains path of each strain per line. It can be found in create_binary_tables output path as strains.txt

  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

  • Random names dictionary path, --random_names_dict: Random names text file path. If not provided, strain's original names will be used for phylogenetic tree

Basic usage:

alpar panacota -i example/example_output/strains.txt -o example/example_output/

GWAS

Runs GWAS analysis to detect important mutations in the data.

  • Input, -i: Binary mutation table path that is created via create_binary_tables command, can be found in create_binary_tables output path as binary_mutation_table_with_gene_presence_absence.tsv or binary_mutation_table.tsv or if threshold applied, can be found in binary_table_threshold output path as binary_mutation_table_threshold_*_percent.tsv

  • Phenotype, -p: Binary phenotype table path, can be found in create_binary_tables output path as phenotype_table.tsv if --create_phenotype_from_folder is used. Can also created manually and used.

  • Tree, -t : Phylogenetic tree path, can be found in panacota output path as phylogenetic_tree.newick or phylogeny output path as phylogenetic_tree.tree

  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

Basic usage:

alpar gwas -i example/example_output/binary_mutation_table_with_gene_presence_absence.tsv -p example/example_output/phenotype_table.tsv -t example/example_output/phylogeny/phylogenetic_tree.tree -o example_output/

PRPS

Runs PRPS (Phylogeny-Related Parallelism Score) to detect the mutations are more likely associated with phylogeny rather than antimicrobial resistance.

  • Input, -i: Binary mutation table path that is created via create_binary_tables command, can be found in create_binary_tables output path as binary_mutation_table.tsv or if threshold applied, can be found in binary_table_threshold output path as binary_mutation_table_threshold_*_percent.tsv

  • Tree, -t : Phylogenetic tree path, can be found in panacota output path as phylogenetic_tree.newick or phylogeny output path as phylogenetic_tree.tree

  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

Basic usage:

alpar prps -i example/example_output/binary_mutation_table.tsv -t example/example_output/phylogeny/phylogenetic_tree.tree -o example_output/

ML

Trains machine learning models with classification algorithms on the data and optimizes.
Available Classification algorithms: Random Forest, Support Vector Machine and Gradient Boosting

  • Input, -i: Binary mutation table path that is created via create_binary_tables command, can be found in create_binary_tables output path as binary_mutation_table_with_gene_presence_absence.tsv or binary_mutation_table.tsv

  • Phenotype, -p: Binary phenotype table path, can be found in create_binary_tables output path as phenotype_table.tsv if --create_phenotype_from_folder is used. Can also created manually and used.

  • Output, -o: Output folder path, where the output will be stored. If path exist, --overwrite option can be used to overwrite existing output.

  • Antibiotic, -a: Antibiotic name that model will be trained. Should match the name with column that represents phenotype in binary phenotype table

  • Optional arguments:

    • Machine learning algorithm, --ml_algorithm: Classification algorithm to be used, available selections: [rf, svm, gb]
    • Resampling strategy, --resampling_strategy: Resampling strategy to be used, available selections: [holdout, cv]
    • Parameter optimization, --parameter_optimization: Parameter optimization for model with autosklearn (https://automl.github.io/auto-sklearn/master/index.html)
    • Save model, --save_model: Save model
    • Feature importance analysis, --feature_importance_analysis: Analyze important features in the model with permutation importance
    • Datasail, --sail: Splits data into training and test sets against information leakage to train better models. Requires text file that contains path of each strain per line. It can be found in create_binary_tables output path as strains.txt

    More optional arguments can be found in help page:

    python alpar/sr_amr.py ml -h

Basic usage:

alpar ml -i example/example_output/binary_mutation_table.tsv -p example/example_output/phenotype_table.tsv -o example_output/ -a amikacin

alpar's People

Contributors

alperyurtseven avatar

Stargazers

Janani Ravi avatar

alpar's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.