The metabp from zifornd

Introduction

Shotgun metagenomic sequencing involves randomly fragmenting DNA into many small pieces. These fragmented pieces of DNA are then sequenced, and their DNA sequences are stitched back together using bioinformatics tools to identify the species and genes present in the sample. Metagenomics requires several tools for processing and analyzing raw sequencing data. There are already numerous open-source tools available for processing each step. While this is helpful, it can also be difficult to determine which tools to use and how they compare to one another. There are several advantages with open-source resources like flexibility to customize, cost efficiency, collaboration with the community, transparency, and reproducibility. But the open-source tools have their own set of disadvantages such as lack of maintenance, compatibility issues with in-house applications or environments, open bug issues and security vulnerabilities. Thus, there is a need for a benchmarking pipeline that could help us compare the performance of different metagenomics data processing tools.

zifornd/metaBP is a bioinformatics best-practice benchmarking pipeline for QC, assembly, binning and annotation of metagenomes.

The current pipeline is made by using nf-core/mag pipeline version 2.1.1 as the base framework to which various tools have been added for benchmarking.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been installed from nf-core/modules.

Pipeline summary

By default, the pipeline currently performs the following: it supports short reads, quality trims the reads and adapters with Cutadapt and Trimmomatic, and performs basic QC with FastQC. The pipeline then:

assigns taxonomy to reads using Kraken2
performs assembly using MEGAHIT and SPAdes, and checks their quality using Quast
predicts protein-coding genes for the assemblies using Prodigal
performs metagenome binning using MetaBAT2, and checks the quality of the genome bins using Busco
assigns taxonomy to bins using GTDB-Tk and/or Sourmash

Furthermore, the pipeline creates intermediate outputs in the Intermediate_output directory and various reports in the Results directory specified, including a MultiQC report summarizing some of the findings and software versions.

Quick Start

Install Nextflow (>=21.10.3)
Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).
Download the pipeline and test it on a minimal dataset with a single command:
```
nextflow run zifornd/metaBP -profile test,YOURPROFILE --outdir <OUTDIR>
```
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.
- The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
- Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
- If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
- If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.

Start running your own analysis!

nextflow run zifornd/metaBP -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input '*_R{1,2}.fastq.gz' --outdir <OUTDIR>

nextflow run zifornd/metaBP -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input samplesheet.csv --outdir <OUTDIR>

See usage docs and parameter docs for all of the available options when running the pipeline.

Documentation

The zifornd/metaBP pipeline comes with documentation about the pipeline usage, parameters and output. Detailed information about how to specify the input can be found under input specifications.

Group-wise co-assembly and co-abundance computation

Each sample has an associated group ID (see input specifications). This group information can be used for group-wise co-assembly with MEGAHIT or SPAdes and/or to compute co-abundances for the binning step with MetaBAT2. By default, group-wise co-assembly is disabled, while the computation of group-wise co-abundances is enabled. For more information about how this group information can be used see the documentation for the parameters --coassemble_group and --binning_map_mode.

When group-wise co-assembly is enabled, SPAdes is run on accordingly pooled read files, since metaSPAdes does not yet allow the input of multiple samples or libraries. In contrast, MEGAHIT is run for each group while supplying lists of the individual readfiles.

Credits

zifornd/metaBP was originally written by Ragavi Shanmugam(@Ragavi-Shanmugam), Riya Saju(@Riya-Saju), Asma Ali (@Asma-Ali) , Lathika Madhan Mohan (@Lathika-Madhan-Mohan), Diya Basu (@Diya-Basu) and Mohamed Kassam (@Mohamed-Kassam) from Zifo RnD Solutions.

The current pipeline is made using the nf-core/mag version 2.1.1 by adding tools for benchmarking.

Citations

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

In addition, references of tools and data used in this pipeline are as follows: An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

zifornd / metabp Goto Github PK

metabp's Introduction

Introduction

Pipeline summary

Quick Start

Documentation

Group-wise co-assembly and co-abundance computation

Credits

Citations

metabp's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent