Giter Site home page Giter Site logo

pauloluniyi / vgea Goto Github PK

View Code? Open in Web Editor NEW
19.0 1.0 6.0 2.66 MB

VGEA (Viral Genomes Easily Analyzed) is a pipeline for analysis of RNA virus next-generation sequencing data.

License: GNU General Public License v3.0

Python 36.24% Shell 40.95% Dockerfile 9.59% Singularity 13.22%

vgea's Introduction

VGEA DOI

VGEA (Viral Genomes Easily Analyzed) is an RNA viral assembly toolkit.

VGEA was developed to aid in the analysis of next generation sequencing data. Users can do the following with this pipeline:

  • Remove adapters, low quality bases/positions, and perform read-level QC
  • Align paired-end sequencing reads to the human reference genome.
  • Extract unmapped/unaligned reads.
  • Split bam files into forward and reverse reads.
  • Carry out de novo assembly of forward and reverse reads to generate contigs.
  • Pre-process reads for quality and contamination.
  • Map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences.
  • Evaluate/assess the quality of genome assemblies.
  • Collate results in a multiqc summary..

Dependencies:

The VGEA pipeline requires the following dependencies:

  • Snakemake (Köster et al., 2012)
  • Fastp (Chen et al., 2018)
  • BWA (Li and Durbin, 2009)
  • Samtools (Li et al., 2009)
  • IVA (Hunt et al., 2015)
  • Shiver (Wymant et al., 2018)
  • SeqKit (Shen et al., 2016)
  • Quast (Gurevich et al., 2013)
  • Multiqc (Ewels et al., 2016)

VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and quality control, bwa (Li and Durbin, 2009) for mapping sequencing reads to the human reference genome, samtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, iva (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, seqkit for cleaning shiver assembly for QUAST and quast (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies. Finally, an interactive report of results are generated by MultiQC (Ewels et al., 2016).

Installation

1. Conda

This workflow requires conda to be installed and available on the system.

To do this install conda via the miniconda installers found here and instructions here.

Briefly:

Linux

To obtain the installer for linux use the following:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Then, install miniconda,

sh Miniconda3-latest-Linux-x86_64.sh

MacOS

To obtain the installer for MacOS, you can download it manually or use wget:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

Then, install miniconda,

sh Miniconda3-latest-MacOSX-x86_64.sh

2. Snakemake

Then install snakemake as follows:

conda create -c bioconda -c conda-forge --name snakemake snakemake 

3. VGEA

To complete installation of VGEA clone the directory and enter it.

git clone [email protected]:pauloluniyi/VGEA.git
cd VGEA

4. Data Dependency

Finally, you need to download a human reference genome. There is a convenience script provided to do this that can run as follows:

cd resources/
bash get_human_reference.sh 
cd ..

Usage

There are 2 key files when running VGEA:

  • A yaml config file containing key parameters e.g., config/config.yaml

  • A 3 column tab-seperated sheet with a name and paths to r1 and r2 for each sample e.g., ./tests/integration/sample_table.tsv

Config

  • sample_table is the path to the tsv created above with names and read locations for each sample (under headings id, r1, r2)

  • shiver_config_file is the path to the config file for shiver, by default this is config/shiver_config.sh

  • human_reference_genome is the path to the fasta file containing the human genome reference, if you used get_human_reference.sh this will be resources/GRCh38_latest_genomic.fna

  • viral_species is one of the 4 pre-supported viral references (HIV-1, SARS-CoV-2, LASV_L, LASV_S) and is used to autofill the paths to various reference files used by default for these. If you supply your own viral reference/adapter/primer files then this doesn't need to be supplied

  • viral_reference_alignment path to a reference alignment of the appropriate viral genomes for your samples (by default resources/{viral_species}/MyRefAlignment.fasta). This reference alignment should be carefully created. "An alignment of existing reference sequences is required as input for shiver. Construction of a custom reference for mapping involves identifying the existing references that are closest to the sample under consideration. The greater the number and diversity of existing references given as input, the denser and broader the coverage of sequence space is, and the closer the closest reference is expected to be, with corresponding benefits for the accuracy of the results. However these existing references should be aligned to each other accurately, in order for the addition of each sample’s contigs to the alignment to be meaningful; this means that producing such an input by automatically aligning a large number of diverse sequences without checking the results would be a bad idea. This alignment will be used as input for every sample in a dataset processed by shiver, and so the user is advised to put a little thought into sequence selection and manually curating the alignment if needed." (Wymant et al., 2018).

  • viral_reference_genome path to a singular reference genome for your samples for QUAST based assembly assessment (by default resources/{viral_species}/MyRefGenome.fasta).

  • viral_reference_gene_features path to a GFF3 containing gene features for the supplied viral reference genome (by default resources/{viral_species}/MyRefFeatures.gff3)

  • viral_sequencing_adapters path to a fasta file containing the sequencing adapters used for your samples (by default resources/{viral_species}/MyAdapters.fasta)

  • viral_sequencing_primers path to a fasta file containing the sequencing primers used for your samples (by default resources/{viral_species}/MyRefAlignment.fasta)

Sample Table

A 3 column, tab-separated file with an id column containing sample names, r1 with the path to the forward reads for that sample, and r2 with the path to reverse reads for that sample.

id	r1	r2
test1	.tests/integration/test1_r1.fq.gz	.tests/integration/test1_r2.fq.gz
test2	.tests/integration/test2_r1.fq.gz	.tests/integration/test2_r2.fq.gz

Executing the workflow

The workflow will automatically install dependencies using conda/mamba if executed with --use-conda otherwise all dependencies listed at the start of the README must be manually installed via conda.

To run the workflow complete the above config and sample tables and execute:

snakemake --cores $n --use-conda --configfile $your_config

Where $n is the number of cores with which to execute the workflow and $your_config is the path to config.yaml you've created.

Results

VGEA will output all results in the results/ directory with a subfolder containing results for each sample and top-level folders collecting all log files, tool benchmarks, and an interactive multiQC html summary of results.

results/
├── benchmarks                  # all benchmark files with hardware usage for each sample
├── logs                        # all log files for each rule and sample
├── multiqc                     # multiqc summary of quast, fastp, and de-hosting results
│   ├── multiqc_data
│   └── multiqc_report.html     # the interactive multiqc report
├── test1                       # all results for the sample "test1" 
│   ├── shiver                  # folder containing all shiver output files 
│   ├── test1_1.fastq           # fastp trimmed and de-hosted reads for the sample 
│   ├── test1_2.fastq
│   ├── test1.bam               # alignment against human reference 
│   ├── test1.fasta             # final cleaned assembly from IVA and shiver
│   ├── test1.fastp.html        # fastp report in html format
│   ├── test1.fastp.json        # fastp report in json format
│   ├── test1.flagstat          # dehosting mapping statistics
│   ├── test1_iva               # folder containing all IVA assembly files
│   ├── test1.quast_results     # folder containing all QUAST assembly assessment of test1.fasta
│   ├── test1_r1_trimmed.fq     # fastp trimmed reads
│   ├── test1_r2_trimmed.fq     
│   └── test1.sam               # sam file containing all the reads that didn't map to the human reference
└── test2
    ├── shiver
    ├── test2_1.fastq
    ├── test2_2.fastq
    ├── test2.bam
    ├── test2.fasta
    ├── test2.fastp.html
    ├── test2.fastp.json
    ├── test2.flagstat
    ├── test2_iva
    ├── test2.quast_results
    ├── test2_r1_trimmed.fq
    ├── test2_r2_trimmed.fq
    └── test2.sam

Containerized Singularity (Beta)

Alternatively, users can run the VGEA pipeline with all dependencies installed in a docker/singularity container.

This requires singularity and snakemake to be installed on the system but in theory provides a more reproducible version of the conda environments (not fully tested compared to just conda).

See here for instructions to install Singularity.

Then the workflow can be run as normal with --use-singularity added e.g.,

snakemake --use-conda --use-singularity --configfile .tests/integration/test_config.yaml -j 1

Testing

To run a minimal integration test once snakemake and conda are installed:

snakemake --use-conda --configfile .tests/integration/test_config.yaml -j 1

Contributions

DOI

vgea's People

Contributors

pauloluniyi avatar fmaguire avatar gtonkinhill avatar

Stargazers

Sha avatar Xiangchen Li avatar bingli avatar  avatar  avatar  avatar Leonardo de Oliveira Martins avatar wook2014 avatar Xiao avatar David Enoma avatar Lei avatar venura herath avatar  avatar Richard Olumide Daodu avatar Colin Davenport avatar Laise de Moraes avatar dezordi avatar Olabode Onile-ere avatar  avatar

Watchers

 avatar

vgea's Issues

Error in rule iva_assembly

Hello, I keep receiving this error during the assembly process

Activating` conda environment: .snakemake/conda/f31eec44745c66475754753e88beadd1
[Wed May 11 14:46:50 2022]
Error in rule iva_assembly:
    jobid: 6
    output: results/test1/test1_iva/contigs.fasta, results/test1/test1_iva
    log: results/logs/test1/test1_iva.log (check log file(s) for error message)
    conda-env: /home/thermite/VGEA/.snakemake/conda/f31eec44745c66475754753e88beadd1
    shell:
        
        rm -rf results/test1/test1_iva #to prevent snakemake pre-making the folder
        (iva --reads_fwd results/test1/test1_1.fastq --reads_rev results/test1/test1_2.fastq --threads 2 results/test1/test1_iva) > results/logs/test1/test1_iva.log 2>&1
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-05-11T144643.736946.snakemake.log

Error log:

Traceback (most recent call last):
  File "/home/thermite/VGEA/.snakemake/conda/f31eec44745c66475754753e88beadd1/bin/iva", line 22, in <module>
    import iva
  File "/home/thermite/VGEA/.snakemake/conda/f31eec44745c66475754753e88beadd1/lib/python3.9/site-packages/iva/__init__.py", line 36, in <module>
    from iva import *
  File "/home/thermite/VGEA/.snakemake/conda/f31eec44745c66475754753e88beadd1/lib/python3.9/site-packages/iva/assembly.py", line 19, in <module>
    from iva import contig, mapping, seed, mummer, graph, edge, common
  File "/home/thermite/VGEA/.snakemake/conda/f31eec44745c66475754753e88beadd1/lib/python3.9/site-packages/iva/graph.py", line 15, in <module>
    import networkx
  File "/home/thermite/.local/lib/python3.9/site-packages/networkx/__init__.py", line 98, in <module>
    import networkx.utils
  File "/home/thermite/.local/lib/python3.9/site-packages/networkx/utils/__init__.py", line 2, in <module>
    from networkx.utils.decorators import *
  File "/home/thermite/.local/lib/python3.9/site-packages/networkx/utils/decorators.py", line 14, in <module>
    from decorator import decorator
ModuleNotFoundError: No module named 'decorator'

I reinstalled the module 'decorator' but the issue was not resolved.

I would appreciate any help in resolving this issue.

De novo assembly

Hi Paulo,
I have question about the workflow.
I am curious about whether the unmapped reads( after alignment to HgCh38) are used to generate de novo assembly with iva?

Thanks!

SARS-CoV-2 workflow comparison - kindly check if your work is represented correctly

Hello VGEA Team,

I am from the University Hospital Essen, Germany, and we work extensively with SARS-CoV-2 in our research. We have also developed a SARS-CoV-2 workflow. In preparation for the publication of our workflow, we have looked at several other SARS-CoV-2 related workflows, including your work. We will present this review in the publication and want to ensure that your work is represented as accurately as possible.

Moreover, there is currently little to no current overview of SARS-CoV-2 related workflows. Therefore, we have decided to make the above comparison publicly available via this GitHub repository. It contains a table with an overview of the functions of different SARS-CoV-2 workflows and the tools used to implement these functions.

We would like to give you the opportunity to correct any misunderstandings on our side. Please take a moment to make sure we are not misrepresenting your work or leaving out important parts of it by taking a look at this overview table. If you feel that something is missing or misrepresented, please feel free to give us feedback by contributing directly to the repository.

Thank you very much!

cc @alethomas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.