Giter Site home page Giter Site logo

makegnu's Introduction

MakeGNU

Snakemake pipeline for implementing WhatsGNU. This Snakemake workflow allows for downloading of microbial genome sequences, annotation with prokka, pangenome analysis with Roary and investigation of proteomic novelty with WhatsGNU.

Set Up

Installing the pipeline

git clone https://github.com/ArwaAbbas/MakeGNU
cd MakeGNU

Creating the environment

For most of the tools used in the pipeline, a separate conda environment is created when the rule runs. These dependencies are listed in Envs/. However, because of this issue in prokka, a little bit of finagling is necessary at the moment. First, we'll create the base snakemake environment:

conda create -c bioconda -c conda-forge -n MakeGNU snakemake
conda activate MakeGNU

Then we'll add prokka to the base environment and manually replace the outdated script.

conda install -c conda-forge -c bioconda -c defaults prokka=1.14.5
wget ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/linux64.tbl2asn.gz -O linux64.tbl2asn.gz 
gunzip linux64.tbl2asn.gz
mv linux64.tbl2asn ~/anaconda3/envs/MakeGNU/bin/tbl2asn
chmod +x ~/anaconda3/envs/MakeGNU/bin/tbl2asn

The path to the location of the script to replace may be slightly different depending on whether you're using anaconda, miniconda, conda, etc.

Preparing inputs and configuration files

The pipeline currently needs these inputs from the user:

  1. A config.yaml that contains some fields the user will modify to run.
  2. Query proteome ".faa" files. The names of the queries will be specified in the config folder. If the user is beginning with the nucleotide sequence of a whole genome assembly (see below), they can optionally use prokka to annotate the genome and create the ".faa" files.
  3. Two CSV files that map names of .faa and .gff files (usually something like "GCA_#########.#.faa/gff" to a biologist-friendly strain name). See documentation in WhatsGNU for more details.
  4. A reference proteome for the organism of interest. Currently this is REQUIRED for MakeGNU to run.

A Working Example Using the Test Data

This is how the directory looks like prior to running any rules:

  • Data
    • Query_fna (contains microbial genomes)
    • ReferenceProteome (contains the reference proteome from a bacterial strain)
    • Dummy_query (contains a small faa file used to help create the WhatsGNU database)
    • strain_name_list_faa.csv
    • strain_name_list_gff.csv

Annotating bacterial genomes to be queried

If you are starting with nucleotide sequences, this wil use prokka to annotate the genomes and pull out the ".faa" files to be used by WhatsGNU.

Execute the following in the MakeGNU root directory. This README won't/can't go over every single Snakemake parameter or error you may encounter, but here are some helpful tips: The -p flag will print out the shell commands that will be executed. To do a dry run (see the commands without running them), pass -np and if you want to see the reason for each rule use -r.

snakemake all_query --cores 2 --use-conda --configfile test_config.yaml

The directory structure should now look like this. New output is bolded

  • Data
    • Query_faa (contains your proteomes to be queried)
    • Annotations
      • prokka_QUERY (contains all the outputs from prokka)
    • Query_fna (contains microbial genomes)
    • ReferenceProteome
    • Dummy_query (contains a small faa file used to help create the WhatsGNU database)
    • strain_name_list_faa.csv
    • strain_name_list_gff.csv

Downloading and annotating reference genomes

snakemake download_genomes --cores 2 --use-conda --configfile test_config.yaml 
snakemake unzip_genome_files --cores 2 --configfile test_config.yaml
snakemake rename_genome_files --cores 2 --configfile test_config.yaml
snakemake all_database_processing --cores 2 --use-conda --configfile test_config.yaml

The directory structure should now look similar to this.

  • Data
    • Genomes
    • Query_faa
    • Query_fna
    • Annotations
    • ReferenceProteome
    • Dummy_query
    • strain_name_list_faa.csv
    • strain_name_list_gff.csv
    • genome_list.txt
  • Results
    • Annotations
      • prokka_GENOMEID (contains all prokka output files)
      • all_modified_faa
      • all_modified_gff

Once the reference database has been built, and you have additional genomes to analyze, these database processing steps do not need to be rerun.

Creating a basic report

snakemake all_basic --cores 2 --use-conda --configfile test_config.yaml

Creating an ortholog report

snakemake analyze_pangenome --cores 2 --use-conda --configfile test_config.yaml 
snakemake roary_cleanup --cores 2 --configfile test_config.yaml

Once the pangenome analysis has been done on the reference genomes, and you have additional query genomes to analyze, the above steps do not need to be rerun.

snakemake all_ortholog --cores 2 --use-conda --configfile test_config.yaml

Final directory structure should look like this:

  • Data
    • Genomes
    • Query_faa
    • Query_fna
    • Annotations
    • ReferenceProteome
    • Dummy_query
  • Results
    • Annotations
      • prokka_GENOMEID
      • all_modified_faa
      • all_modified_gff
    • Roary
    • WhatsGNU_db
    • WhatsGNU_basic_results
    • WhatsGNU_ortholog_results

Visualization of WhatsGNU results

Read the full description of the types of plots created here on the WhatsGNU GitHub.

    snakemake all_histogram --cores 2 --use-conda --configfile test_config.yaml

New directory structure:

  • Data
    • Genomes
    • Query_faa
    • Query_fna
    • Annotations
    • ReferenceProteome
    • Dummy_query
  • Results
    • Annotations
      • prokka_GENOMEID
      • all_modified_faa
      • all_modified_gff
    • Roary
    • WhatsGNU_db
    • WhatsGNU_basic_results
      • Plots
    • WhatsGNU_ortholog_results
      • Plots

makegnu's People

Contributors

arwaabbas avatar

Watchers

James Cloos avatar  avatar Ahmed M Moustafa avatar

makegnu's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.