Shotgun metagenomic sequencing involves randomly fragmenting DNA into many small pieces. These fragmented pieces of DNA are then sequenced, and their DNA sequences are stitched back together using bioinformatics tools to identify the species and genes present in the sample. Metagenomics requires several tools for processing and analyzing raw sequencing data. There are already numerous open-source tools available for processing each step. While this is helpful, it can also be difficult to determine which tools to use and how they compare to one another. There are several advantages with open-source resources like flexibility to customize, cost efficiency, collaboration with the community, transparency, and reproducibility. But the open-source tools have their own set of disadvantages such as lack of maintenance, compatibility issues with in-house applications or environments, open bug issues and security vulnerabilities. Thus, there is a need for a benchmarking pipeline that could help us compare the performance of different metagenomics data processing tools.
zifornd/metaBP is a bioinformatics best-practice benchmarking pipeline for QC, assembly, binning and annotation of metagenomes.
The current pipeline is made by using nf-core/mag pipeline version 2.1.1 as the base framework to which various tools have been added for benchmarking.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been installed from nf-core/modules.
By default, the pipeline currently performs the following: it supports short reads, quality trims the reads and adapters with Cutadapt and Trimmomatic, and performs basic QC with FastQC. The pipeline then:
- assigns taxonomy to reads using Kraken2
- performs assembly using MEGAHIT and SPAdes, and checks their quality using Quast
- predicts protein-coding genes for the assemblies using Prodigal
- performs metagenome binning using MetaBAT2, and checks the quality of the genome bins using Busco
- assigns taxonomy to bins using GTDB-Tk and/or Sourmash
Furthermore, the pipeline creates intermediate outputs in the Intermediate_output directory and various reports in the Results directory specified, including a MultiQC report summarizing some of the findings and software versions.
-
Install
Nextflow
(>=21.10.3
) -
Install any of
Docker
,Singularity
(you can follow this tutorial),Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (you can useConda
both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs). -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run zifornd/metaBP -profile test,YOURPROFILE --outdir <OUTDIR>
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (
YOURPROFILE
in the example command above). You can chain multiple config profiles in a comma-separated string.- The pipeline comes with config profiles called
docker
,singularity
,podman
,shifter
,charliecloud
andconda
which instruct the pipeline to use the named tool for software management. For example,-profile test,docker
. - Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use
-profile <institute>
in your command. This will enable eitherdocker
orsingularity
and set the appropriate execution settings for your local compute environment. - If you are using
singularity
, please use thenf-core download
command to download images first, before running the pipeline. Setting theNXF_SINGULARITY_CACHEDIR
orsingularity.cacheDir
Nextflow options enables you to store and re-use the images from a central location for future pipeline runs. - If you are using
conda
, it is highly recommended to use theNXF_CONDA_CACHEDIR
orconda.cacheDir
settings to store the environments in a central location for future pipeline runs.
- The pipeline comes with config profiles called
-
Start running your own analysis!
nextflow run zifornd/metaBP -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input '*_R{1,2}.fastq.gz' --outdir <OUTDIR>
or
nextflow run zifornd/metaBP -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input samplesheet.csv --outdir <OUTDIR>
See usage docs and parameter docs for all of the available options when running the pipeline.
The zifornd/metaBP pipeline comes with documentation about the pipeline usage, parameters and output. Detailed information about how to specify the input can be found under input specifications.
Each sample has an associated group ID (see input specifications). This group information can be used for group-wise co-assembly with MEGAHIT
or SPAdes
and/or to compute co-abundances for the binning step with MetaBAT2
. By default, group-wise co-assembly is disabled, while the computation of group-wise co-abundances is enabled. For more information about how this group information can be used see the documentation for the parameters --coassemble_group
and --binning_map_mode
.
When group-wise co-assembly is enabled, SPAdes
is run on accordingly pooled read files, since metaSPAdes
does not yet allow the input of multiple samples or libraries. In contrast, MEGAHIT
is run for each group while supplying lists of the individual readfiles.
zifornd/metaBP was originally written by Ragavi Shanmugam(@Ragavi-Shanmugam), Riya Saju(@Riya-Saju), Asma Ali (@Asma-Ali) , Lathika Madhan Mohan (@Lathika-Madhan-Mohan), Diya Basu (@Diya-Basu) and Mohamed Kassam (@Mohamed-Kassam) from Zifo RnD Solutions.
The current pipeline is made using the nf-core/mag version 2.1.1 by adding tools for benchmarking.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
In addition, references of tools and data used in this pipeline are as follows:
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.