Giter Site home page Giter Site logo

aim11 / virapipe Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ngseq/virapipe

1.0 2.0 0.0 94 KB

ViraPipe is a Apache Spark based scalable pipeline for metagenome analysis from NGS read data

License: MIT License

Shell 7.32% Python 0.32% Java 86.52% Scala 5.84%

virapipe's Introduction

ViraPipe

ViraPipe is a Apache Spark based scalable parallel pipeline for analysing metagenomes from NGS read data in a computing cluster or on a multi-core standalone machine. The pipeline is designed especially for viral metagenomes but the software is applicable for any other genome analysis purposes. The pipeline integrates parallel BWA-MEM read aligner, MegaHit DeNovo assembler, BLAST and HMMSearch tools. Pipeline includes also tools for sequence data normalization and filtration as well as SQL query interfaces for querying the data in parallel. FASTQ, FASTA, SAM/BAM and Apache Parquet formats are supported as well as BLAST tabular format.


Requirements

Apache Spark 2 installed with Hadoop, YARN and HDFS filesystem. Tested on Ubuntu 16 and Centos 7 servers. Java 1.8

Dependencies: Hadoop-BAM 7.4.0 + Samtools HtsJDK 2.5.0 + Java Bindings (JNI) for bwa: jbwa 1.0.0 +

Scala 2.10 + Scala parsing library Scallop 2.1.2 +


Installation

Compiling ViraPipe

git clone https://github.com/NGSeq/ViraPipe
cd ViraPipe
mvn install package

jbwa

git clone https://github.com/lindenb/jbwa
cd jbwa
make

Copy libbwajni native library to {$HADOOP_HOME}/lib/native directory on every worker node e.g. with scp:
scp src/main/native/libbwajni.so username@hostname:/{$HADOOP_HOME}/lib/native

Download and install MegaHit assembler on every worker node (requires gcc compiler)

git clone https://github.com/voutcn/megahit
cd megahit
make
create symbolic link
    ln -s {path to megahit}/megahit /usr/local/bin/megahit
or add megahit to PATH

Download and install BLAST and HMMER software on every worker node (requires gcc compiler)

HMMER wget http://eddylab.org/software/hmmer3/3.1b2/hmmer-3.1b2-linux-intel-x86_64.tar.gz Extract and make

BLAST wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.6.0+-x64-linux.tar.gz Extract and make

Download and extract BLAST and HMMER databases under the same path on every worker node

vFam database for hmmsearch
wget http://derisilab.ucsf.edu/software/vFam/vFam-B_2014.hmm
BLAST database files
for i in {0..9}; do wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.0$i.tar.gz ; done
for i in {10..50}; do wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.$i.tar.gz ; done
for i in {0..9}; do wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/human_genomic.0$i.tar.gz ; done
for i in {10..22}; do wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/human_genomic.$i.tar.gz ; done
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz
cat *.gz | tar -xzvf - -i
Copy DBs to every node with scp
scp vFam-B_2014.hmm username@hostname:/database/hmmer
scp nt.* username@hostname:/database/blast/nt
scp human_genomic.* username@hostname:/database/blast/hg
scp taxdb.* username@hostname:/database/taxdb
Set BLASTDB environment variable on each node:
export BLASTDB=$BLASTDB:/database/blast/nt
export BLASTDB=$BLASTDB:/database/blast/hg
export BLASTDB=$BLASTDB:/database/taxdb

Running the example pipeline

Download human reference genome index on every node under the same path e.g. /index

wget -r ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/*

Download NGS sequence files and load to HDFS

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00313/sequence_read/ERR016234_1.filt.fastq.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00313/sequence_read/ERR016234_2.filt.fastq.gz
hdfs dfs -mkdir /data/input/example
hdfs dfs -mkdir /data/output
hdfs dfs -put ERR016234_1.filt.fastq.gz /data/input/example
hdfs dfs -put ERR016234_2.filt.fastq.gz /data/input/example

Run the pipeline

Check that configuration of Spark master, num-executors, executor-memory etc. fit your system and that classpath, directories, databases etc. exists and user has proper permissions.
scripts/virapipe.sh /data/input /data/output example

Citation

AI Maarala, Z Bzhalava, J Dillner, K Heljanko, D Bzhalava; ViraPipe: Scalable Parallel Pipeline for Viral Metagenome Analysis from Next Generation Sequencing Reads, Bioinformatics, Nov. 2017, btx702, https://doi.org/10.1093/bioinformatics/btx702

virapipe's People

Contributors

aim11 avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.