hivlab / quantify-virome Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 1.0 9.23 MB

quantify-virome: identify and quantify viruses from metagenomic shotgun sequences

License: MIT License

Python 100.00%

snakemake-workflows virome metagenomics metagenomic-pipeline snakemake virome-workflow taxonomy

quantify-virome's Introduction

Snakemake implementation of VirusSeeker Virome workflow.

Setup environment and install prerequisites

Install miniconda

Download and install miniconda https://conda.io/docs/user-guide/install/index.html. In case of Linux, following should work:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Install environment

Create conda environment with snakemake. There are two options:

If you want to upload your results to Zenodo, then you need snakemake Zenodo remote provider, which is currently implemented in zenodo-simple branch in my forked snakemake repo.

First, clone snakemake repo and checkout zenodo-simple branch:

git clone https://[email protected]/tpall/snakemake.git
cd snakemake
git checkout zenodo-simple

Then, create conda environment, install prerequisites and snakemake:

conda env create -f environment.yml -n snakemake
source activate snakemake
pip install -e .

Alternatively, if you don't want to upload your results to Zenodo, you can create conda environment and install snakemake 'normally':

conda create -n snakemake -c bioconda -c conda-forge snakemake
source activate snakemake

Setup databases

Together all databases will occupy ~250GB+ from your HD.

BLAST databases

Download BLAST version 5 databases

Download version 5 BLAST databases using these instructions https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf

Briefly, you can use update_blastdb.pl script from BLAST+ software bundle to update/download BLAST databases.

To get BLAST, you can start by creating conda environment with blast+ like so:

conda env create -n blastenv
conda blastenv activate
conda install -c bioconda blast

Change working directory to location where you want BLAST databases to be installed, e.g. $HOME/databases/blast.

mkdir -p $HOME/databases/blast
cd $HOME/databases/blast

Use update_blastdb.pl (included with the BLAST+ package) to check available version 5 databases, use the --blastdb_version flag:

update_blastdb.pl --blastdb_version 5 --showall

Download nt_v5 and nr_v5 databases (takes time and might need restarting if connection drops):

update_blastdb.pl --blastdb_version 5 nt_v5 --decompress
update_blastdb.pl --blastdb_version 5 nr_v5 --decompress

Setup BLASTDB environment variable Edit $HOME/.bashrc file to permanently add BLASTDB variable to your shell environment

echo 'export BLASTDB=$HOME/databases/blast' >> $HOME/.bashrc
source $HOME/.bashrc
echo $BLASTDB

Download reference genome databases

Human reference genome.

First, create a directory for the reference genome sequence file, e.g mkdir -p $HOME/databases/ref_genomes && cd $HOME/databases/ref_genomes.

Then, human refgenome human_g1k_v37.fasta.gz sequences file can be obtained like so:

wget --continue ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

Bacterial reference genome sequences.

Create a directory for the bacteria reference sequence files. Download all *genomic.fna.gz files to the directory by using command.

wget --recursive --continue ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*genomic.fna.gz

Unzip the files and concatenate all the files into a single file. Use "bwa index" command to create index for the BWA algorithm.

Add paths to human_g1k_v37.fasta and Bacteria_ref_genome.fna to environment variables.

echo 'export REF_GENOME_HUMAN=$HOME/databases/ref_genomes/human_g1k_v37.fasta' >> $HOME/.bashrc
echo 'export REF_BACTERIA=$HOME/databases/bacteria_ref_sequence/Bacteria_ref_genome.fna' >> $HOME/.bashrc
source $HOME/.bashrc

Install workflow

Clone this repo and cd to repo (Change URL accordingly if using HTTPS)

git clone [email protected]:avilab/quantify-virome.git
cd quantify-virome

Example

Dry run

snakemake -n

Create workflow graph

snakemake -d .test --dag | dot -Tsvg > graph/dag.svg

Run workflow

This workflow is designed to be run in cluster. cluster.json configuration file may need some customisation, for example partition name. Memory nad maximum runtime limits are optimised for 100 splits. Number of splits can be specified in config.yaml file with n_files option (currently n_files is 2). Installation of software dependencies is taken care by conda, hence there is software installation overhead when you run this workflow for the first time in new environment.

Example workflow submission script for slurm cluster, where values for job name, cluster partition name, time and memory constraints, and slurm log path (output) are taken from cluster.json:

snakemake -j --use-conda --cluster-config cluster.json  \
             --cluster "sbatch -J {cluster.name} \
             -p {cluster.partition} \
             -t {cluster.time} \
             --mem {cluster.mem} \
             --output {cluster.output}"

You may want to use also following flags when running this workflow in cluster:

--max-jobs-per-second 1 --max-status-checks-per-second 10 --rerun-incomplete --keep-going

All other possible snakemake execution options can be printed by calling snakemake -h.

Exit/deactivate environment

Conda environment can be closed with the following command when work is finished:

source deactivate

Workflow graph

For technical reasons, workflow is split into two parts, virome and taxonomy, that can be run separately, but taxonomy depends on the output of virome. Virome subworkflow (virome.snakefile) munges, masks, and blasts input sequences. Taxonomy subworkflow (Snakefile) merges blast results with taxonomy data and generates report.

Figure 1. Virome workflow graph with test sample split into two (default = 20) subfiles for parallel processing. Outputs parsed BLAST results.

quantify-virome's People

Contributors

Stargazers

Watchers

Forkers

vikash84

quantify-virome's Issues

Fix filter viruses

Filter viruses uses div_id = 3 (phages) by default. Update filter_viruses rule after nt blast to use div_id = 9, currently, there are hits from all sources that are not div_id 3.

no complaining about missing seqtk

fastp rule was not complaining about missing seqtk program, instead 'Wildcards' object has no attribute 'output '. Updated envs/fastp.yaml with seqtk cd1d6a4

BLAST Database error: No alias or index file found for nucleotide database

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       megablast_nt
        1

[Sat Jan  5 17:02:06 2019]
rule megablast_nt:
    input: blast/SRR5580233_bac_unmapped_10_masked.fa, /gpfs/hpc/repository/VirusSeeker/databases/nt
    output: blast/SRR5580233_megablast_nt_10.tsv
    jobid: 0
    wildcards: sample=SRR5580233, n=10

Activating conda environment: /gpfs/hpc/home/taavi74/fastq/prjna361402/.snakemake/conda/76feaf1a
Traceback (most recent call last):
  File "/gpfs/hpc/home/taavi74/fastq/prjna361402/.snakemake/scripts/tmp2qetcyug.wrapper.py", line 30, in <module>
    run_blast(snakemake.input, snakemake.output, snakemake.params)
  File "/gpfs/hpc/home/taavi74/fastq/prjna361402/.snakemake/scripts/tmp2qetcyug.wrapper.py", line 27, in run_blast
    stdout, stderr = cline()
  File "/gpfs/hpc/home/taavi74/fastq/prjna361402/.snakemake/conda/76feaf1a/lib/python3.6/site-packages/Bio/Application/__init__.py", line 523, in __call__
    stdout_str, stderr_str)
Bio.Application.ApplicationError: Non-zero return code 2 from "blastn -out blast/SRR5580233_megablast_nt_10.tsv -outfmt '6 qseqid sgi pident length mismatch gapopen qstart qend sstart send evalue bitscore' -show_gis -query blast/SRR5580233_bac_unmapped_10_masked.fa -db /gpfs/hpc/repository/VirusSeeker/databases/nt -evalue 1e-8 -word_size 16 -max_hsps 50 -num_threads 8 -task megablast", message 'BLAST Database error: No alias or index file found for nucleotide database [/gpfs/hpc/repository/VirusSeeker/databases/nt] in search path [/gpfs/hpc/home/taavi74/fastq/prjna361402::]'
[Sat Jan  5 17:02:07 2019]
Error in rule megablast_nt:
    jobid: 0
    output: blast/SRR5580233_megablast_nt_10.tsv
    conda-env: /gpfs/hpc/home/taavi74/fastq/prjna361402/.snakemake/conda/76feaf1a

RuleException:
CalledProcessError in line 164 of /gpfs/hpc/home/taavi74/Projects/vs/rules/blast.smk:

fails on empty file

parse_megablast fails on empty xml file

update readme

Update README with zenodo and ncbi api key setup instructions.

fix filter viruses

Something has went sour in tidyverse environment

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       filter_viruses
        1

[Sat Dec  1 16:51:00 2018]
rule filter_viruses:
    input: blast/SRR5580142_blastn_virus_19_known-viral.tsv, /gpfs/software/VirusSeeker/
databases/taxdump_300817/vhunter.db, taxonomy/nodes.csv
    output: results/SRR5580142_phages_19.csv, blast/SRR5580142_candidate_viruses_19.csv
    jobid: 0
    wildcards: sample=SRR5580142, n=19

Activating conda environment: /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/1
30c7264
During startup - Warning message:
Setting LC_CTYPE failed, using "C" 
Parse blast results
Importing BLAST+ tabular output
Parsed with column specification:
cols(
  qseqid = col_character(),
  sgi = col_integer(),
  pident = col_double(),
  length = col_integer(),
  mismatch = col_integer(),
  gapopen = col_integer(),
  qstart = col_integer(),
  qend = col_integer(),
  sstart = col_integer(),
  send = col_integer(),
  evalue = col_double(),
  `bitscore'` = col_double()
)
Munging metadata
Map tax_ids to gis

Connect to database
Collect tax_ids from tables
Bind rows
Join taxonomy to blast results by gi
Joining, by = "gi"
Fill in few missing tax_ids by quering remote ncbi database
Error in mutate_impl(.data, dots) : 
  Evaluation error: no applicable method for 'xml_find_first' applied to an object of class "list".
Calls: do.call ... eval -> <Anonymous> -> mutate.tbl_df -> mutate_impl
Execution halted
[Sat Dec  1 16:51:17 2018]
Error in rule filter_viruses:
    jobid: 0
    output: results/SRR5580142_phages_19.csv, blast/SRR5580142_candidate_viruses_19.csv
    conda-env: /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/130c7264

RuleException:
CalledProcessError in line 99 of /gpfs/hpchome/taavi74/Projects/vs/rules/blast.smk:
Command 'source activate /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/130c7264; set -euo pipefail;  Rscript /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/scripts/tmp86qu6xuw.filter_viruses.R ' returned non-zero exit status 1.
  File "/gpfs/hpchome/taavi74/Projects/vs/rules/blast.smk", line 99, in __rule_filter_viruses
  File "/gpfs/hpchome/taavi74/miniconda3/envs/snakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Fix number of files

Fix number of files not batch size.

pickling error

Out of the blue:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       parse_megablast_nt
        1

[Tue Dec 25 17:27:00 2018]
rule parse_megablast_nt:
    input: blast/SRR5580055_megablast_nt_4.tsv
    output: blast/SRR5580055_megablast_nt_4_mapped.tsv, blast/SRR5580055_megablast_nt_4_unmapped.fa
    jobid: 0
    wildcards: sample=SRR5580055, n=4

[Tue Dec 25 17:27:00 2018]
Error in rule parse_megablast_nt:
    jobid: 0
    output: blast/SRR5580055_megablast_nt_4_mapped.tsv, blast/SRR5580055_megablast_nt_4_unmapped.fa
    conda-env: /gpfs/hpc/home/taavi74/fastq/prjna361402/.snakemake/conda/1e40c9f1

RuleException:
PicklingError in line 204 of /gpfs/hpc/home/taavi74/Projects/vs/rules/blast.smk:
Can't pickle <function <lambda> at 0x2b995cef57b8>: attribute lookup <lambda> on snakemake.workflow failed
  File "/gpfs/hpc/home/taavi74/Projects/vs/rules/blast.smk", line 204, in __rule_parse_megablast_nt
  File "/gpfs/hpc/home/taavi74/miniconda3/envs/viruscount/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

RepeatMasker parseTagData error

Problem with loading RM database

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       repeatmasker
        1

[Sun Dec  2 09:29:30 2018]
rule repeatmasker:
    input: mask/SRR5579964_repeatmasker_3.fa
    output: mask/SRR5579964_repeatmasker_3.fa.masked, mask/SRR5579964_repeatmasker_3.fa.out, mask/S
RR5579964_repeatmasker_3.fa.tbl
    jobid: 0
    wildcards: sample=SRR5579964, n=3
    threads: 2

Activating conda environment: /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/5302d956
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LC_CTYPE = "UTF-8",
        LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
RepeatMasker version open-4.0.7
Search Engine: NCBI/RMBLAST [ 2.6.0+ ]
parseTagData: ID field not to EMBL spec "Sola3-3_NV repeatmasker; DNA; ???;  BP.
" from DE   RepbaseID: Sola3-3_NVXX

 at /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/5302d956/share/RepeatMasker/RepeatMasker line 7611.
Master RepeatMasker Database: /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/5302d956/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127-rb20170127 )


[Sun Dec  2 09:29:45 2018]
Error in rule repeatmasker:
    jobid: 0
    output: mask/SRR5579964_repeatmasker_3.fa.masked, mask/SRR5579964_repeatmasker_3.fa.out, mask/SRR5579964_repeatmasker_3.fa.tbl
    conda-env: /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/5302d956

RuleException:
CalledProcessError in line 61 of /gpfs/hpchome/taavi74/Projects/vs/rules/mask.smk:
Command 'source activate /gpfs/hpchome/taavi74/fastq/prjna361402/.snakemake/conda/5302d956; set -euo pipefail;  
    RepeatMasker -qq -pa 2 mask/SRR5579964_repeatmasker_3.fa -dir mask
    if head -n 1 mask/SRR5579964_repeatmasker_3.fa.out | grep -q "There were no repetitive sequences detected"
      then ln -sr mask/SRR5579964_repeatmasker_3.fa mask/SRR5579964_repeatmasker_3.fa.masked
    fi ' returned non-zero exit status 25.
  File "/gpfs/hpchome/taavi74/Projects/vs/rules/mask.smk", line 61, in __rule_repeatmasker
  File "/gpfs/hpchome/taavi74/miniconda3/envs/snakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

use conda

Use --use-conda option for dependencies. Set up autoconfiguration for RepeatMasker.

MissingInputException

(viruscount) [taavi74@rocket vs]$ snakemake --cluster "sbatch -J {cluster.name} -p {cluster.partition} -t {cluster.time} --mem {cluster.mem} --output {cluster.output} -c {cluster.cpus-per-task}" --use-conda -u ~/fastq/prjna361402/cluster.json -n -d ~/fastq/prjna361402 --keep-going
Building DAG of jobs...
MissingInputException in line 259 of /gpfs/hpc/home/taavi74/Projects/vs/rules/blast.smk:
Missing input files for rule upload:
results/SRR5558000_phages-viruses_20.csv
results/SRR5558000_phages-viruses_12.csv
results/SRR5558000_phages-viruses_18.csv
results/SRR5558000_phages-viruses_11.csv
results/SRR5558000_phages-viruses_15.csv
results/SRR5558000_phages-viruses_5.csv
results/SRR5558000_phages-viruses_16.csv
results/SRR5558000_phages-viruses_8.csv
results/SRR5558000_phages-viruses_9.csv
results/SRR5558000_phages-viruses_3.csv
results/SRR5558000_phages-viruses_19.csv
results/SRR5558000_phages-viruses_1.csv
results/SRR5558000_phages-viruses_14.csv
results/SRR5558000_phages-viruses_13.csv

ambiguous wildcard when only upload jobs

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 upload
1

[Sun Feb 17 04:43:23 2019]
rule upload:
input: results/SRR5580304_phages_viruses_1.csv, results/SRR5580304_phages_viruses_2.csv, results/SRR5580304_phages_viruses_3.csv, results/$
output: results/SRR5580304_phages_viruses.csv.tar.gz
jobid: 0
wildcards: sample=SRR5580304_phages, result=viruses, ext=csv

Activating conda environment: /gpfs/hpchome/radko/Projects/prjna361402/.snakemake/conda/4f3133f2
Traceback (most recent call last):
File "/gpfs/hpchome/radko/Projects/prjna361402/.snakemake/scripts/tmpip1zi1x7.wrapper.py", line 40, in
filename = [deposit["filename"] for deposit in r.json()]
File "/gpfs/hpchome/radko/Projects/prjna361402/.snakemake/conda/4f3133f2/lib/python3.6/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/gpfs/hpchome/radko/Projects/prjna361402/.snakemake/conda/4f3133f2/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/gpfs/hpchome/radko/Projects/prjna361402/.snakemake/conda/4f3133f2/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/gpfs/hpchome/radko/Projects/prjna361402/.snakemake/conda/4f3133f2/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
[Sun Feb 17 04:44:25 2019]

Error in rule upload:
jobid: 0
output: results/SRR5580304_phages_viruses.csv.tar.gz
conda-env: /gpfs/hpchome/radko/Projects/prjna361402/.snakemake/conda/4f3133f2

RuleException:
CalledProcessError in line 267 of /gpfs/hpchome/radko/Projects/vs/rules/blast.smk:
Command 'source activate /gpfs/hpchome/radko/Projects/prjna361402/.snakemake/conda/4f3133f2; set -euo pipefail; python /gpfs/hpchome/radko/Pr$
File "/gpfs/hpchome/radko/Projects/vs/rules/blast.smk", line 267, in __rule_upload
File "/gpfs/hpchome/radko/miniconda3/envs/snakemake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job upload since they might be corrupted:
results/SRR5580304_phages_viruses.csv.tar.gz
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

missing dependency

Missing dependency in environment.yml

Traceback (most recent call last):
File "/gpfs/hpchome/radko/vs/src/.snakemake.vnp37zlc.sequence_cleaner.py", line 8, in
from Bio import SeqIO
ModuleNotFoundError: No module named 'Bio'
[Wed Mar 28 16:02:38 2018] Error in rule tantan_goodreads:
[Wed Mar 28 16:02:38 2018] jobid: 0
[Wed Mar 28 16:02:38 2018] output: output/tantan_goodreads/I1164_12629_Harvard_SIV_196_06_2_24_12_mini.tantan.goodseq.fa