Giter Site home page Giter Site logo

nf-core / taxprofiler Goto Github PK

View Code? Open in Web Editor NEW
92.0 143.0 30.0 12.15 MB

Highly parallelised multi-taxonomic profiling of shotgun short- and long-read metagenomic data

Home Page: https://nf-co.re/taxprofiler

License: MIT License

HTML 1.14% Nextflow 98.86%
metagenomics profiling taxonomic-profiling shotgun nf-core nextflow workflow pipeline classification microbiome

taxprofiler's Introduction

nf-core/taxprofiler

GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo nf-test

Nextflow run with conda run with docker run with singularity Launch on Seqera Platform

Get help on SlackFollow on TwitterFollow on MastodonWatch on YouTube

Cite Preprint

Introduction

nf-core/taxprofiler is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun short- and long-read metagenomic data. It allows for in-parallel taxonomic identification of reads or taxonomic abundance estimation with multiple classification and profiling tools against multiple databases, and produces standardised output tables for facilitating results comparison between different tools and databases.

Pipeline summary

  1. Read QC (FastQC or falco as an alternative option)
  2. Performs optional read pre-processing
  3. Supports statistics for host-read removal (Samtools)
  4. Performs taxonomic classification and/or profiling using one or more of:
  5. Perform optional post-processing with:
  6. Standardises output tables (Taxpasta)
  7. Present QC for raw reads (MultiQC)
  8. Plotting Kraken2, Centrifuge, Kaiju and MALT results (Krona)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta
2612,run1,ILLUMINA,2612_run1_R1.fq.gz,,
2612,run2,ILLUMINA,2612_run2_R1.fq.gz,,
2612,run3,ILLUMINA,2612_run3_R1.fq.gz,2612_run3_R2.fq.gz,

Each row represents a fastq file (single-end), a pair of fastq files (paired end), or a fasta (with long reads).

Additionally, you will need a database sheet that looks as follows:

databases.csv:

tool,db_name,db_params,db_path
kraken2,db2,--quick,/<path>/<to>/kraken2/testdb-kraken2.tar.gz
metaphlan,db1,,/<path>/<to>/metaphlan/metaphlan_database/

That includes directories or .tar.gz archives containing databases for the tools you wish to run the pipeline against.

Now, you can run the pipeline using:

nextflow run nf-core/taxprofiler \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --databases databases.csv \
   --outdir <OUTDIR>  \
   --run_kraken2 --run_metaphlan

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Credits

nf-core/taxprofiler was originally written by James A. Fellows Yates, Sofia Stamouli, Moritz E. Beber, and the nf-core/taxprofiler team.

Team

We thank the following people for their contributions to the development of this pipeline:

Acknowledgments

We also are grateful for the feedback and comments from:

And specifically to

❤️ also goes to Zandra Fagernäs for the logo.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #taxprofiler channel (you can join with this invite).

Citations

If you use nf-core/taxprofiler for your analysis, please cite it using the following doi: 10.1101/2023.10.20.563221.

Stamouli, S., Beber, M. E., Normark, T., Christensen II, T. A., Andersson-Li, L., Borry, M., Jamy, M., nf-core community, & Fellows Yates, J. A. (2023). nf-core/taxprofiler: Highly parallelised and flexible pipeline for metagenomic taxonomic classification and profiling. In bioRxiv (p. 2023.10.20.563221). https://doi.org/10.1101/2023.10.20.563221

For the latest version of the code, cite the Zenodo doi: 10.5281/zenodo.7728364

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

taxprofiler's People

Contributors

alexhbnr avatar hkaspersen avatar husensofteng avatar ilight1542 avatar jfy133 avatar jianhong avatar joon-klaps avatar lilyanderssonlee avatar ljmesi avatar mashehu avatar maxibor avatar maxulysse avatar midnighter avatar millironx avatar mjamy avatar nf-core-bot avatar rafalstepien avatar robsyme avatar sofstam avatar zandrafagernas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

taxprofiler's Issues

Profiler: centrifuge

Description of feature

Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples, with better sensitivity than and comparable accuracy to other leading systems. The system uses a novel indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (e.g., 4.3 GB for ~4,100 bacterial genomes) yet provides very fast classification speed, allowing it to process a typical DNA sequencing run within an hour. Together these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers.

https://ccb.jhu.edu/software/centrifuge/

Preprocessing: long read host removal

Description of feature

To remove e.g. human reads

What needs to be done

Use minimap2 to remove human reads from nanopore reads. Minimap2 is already implemented as nf-core module.

This can be closed when

  • Minimap2 has been merged to dev branch.

Allow multiple databases

Description of feature

The purpose of this pipeline to allow in-parallel taxonomic profiling of metagenomic data against multiple taxonomic profilers. However in some cases, it might be of interest to align additionally against multiple databases per profiler (e.g., maybe against a bacterial, then virus, then fungi, separately).

We should allow this within the pipeline. However this theoretically would mean different database contexts may need different parameters. I therefore suggest we pass a second input sheet that represents the location of all databases, and the parameters required for each database.

tool db_name tool_params db_path
kraken kraken-db1 -n 10 /path/to/db1
kraken kraken2-db2 -n 4 /path/to/db2
malt malt-db1 -F -s path/to/malt-db1

Then all samples/databases can be combined with .combine so each sample will be aligned against each database.

Profiler: kraken2

Description of feature

Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. The k-mer assignments inform the classification algorithm. [see: Kraken 1's Webpage for more details].

https://ccb.jhu.edu/software/kraken2/

samplesheet_check.py is not compatible with fetchngs generated output

Description of the bug

samplesheet_check.py currently forces a specific set of headers and the order of the header.

This is not compatible with fetchngs pipeline-specific output which includes a bunch of other metadata and also in it's 'own' column ordering, with the request for the user to 'double check' input and rename columns where necessary with contents of other columns.

I think it would be good that we can just directly accept the output from fetchngs (with the different order/extra columns), for cases where all the downloaded data is assumed correct (which hopefully is most cases).

Furthermore, loosening the order and allowing extra columns would support additoinal use cases e.g. @sofstam & co who would like to record case/control-like information.

More specifically:

HEADER = [
"sample",
"run_accession",
"instrument_platform",
"fastq_1",
"fastq_2",
"fasta",
]
header = [x.strip('"') for x in fin.readline().strip().split(",")]
if header[: len(HEADER)] != HEADER:
print(
"ERROR: Please check samplesheet header -> {} != {}".format(
",".join(header), ",".join(HEADER)
)
)
sys.exit(1)

Is the error that is reported when using the fetchngs output from nf-core/fetchngs#97

We should just check that the REQUIRED columns are present only

Command used and terminal output

No response

Relevant files

No response

System information

No response

Postprocessing: taxon table generation

Description of feature

We should consider standardising how output tables should look like.

This could be done with simple parsing and production of (standardised, if possible) TSV tables, and/or producing formats like biom

Find full-test data

Description of feature

These should be 3-5 'real life' samples that you would profile against.

Ideally these would be shortread/long read pairs.

Profiler request: MetaPhlAn3

Description of feature

MetaPhlAn (Metagenomic Phylogenetic Analysis) is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:

https://huttenhower.sph.harvard.edu/metaphlan/

Add separate workflow for database building

Description of feature

I guess it could maybe be nice to have a entirely separate workflow

e.g.

nextflow run nf-core/taxprofiler --taxprofiling

vs

nextflow run nf-core/taxprofiler --dbbuilding

or something, where the latter could somehow allow you to build from the same set of input FASTA files databases with all our supported tools?

I don't know if this would be of interest given all tools would likely need a lot of tuned parameters? But maybe that's OK if we supply with an another TSV sheet or something...

Preprocessing: short read host removal

Description of feature

We should allow a step to remove potential host-reads that may increase run-time and result in false positive hits due to contamination in reference databases, and/or phiX

Profiler: mOTUs

Description of feature

Phylogenetic markers are genes that can be used to reconstruct the evolutionary history of organisms and to profile the taxonomic composition of environmental samples. Efforts to find a good set of protein-coding phylogenetic marker genes led to the identification of 40 universal marker genes (MGs) [1,2]. These 40 MGs occur in single copy in the vast majority of known organisms and they have been used to delineate prokaryotic organisms at the species level [3].

We developed the mOTU profiler as a successor of the original version described in [4]. It uses 10 of the 40 MGs to taxonomically profile shotgun metagenomes, to quantify metabolically active members in metatranscriptomics and to quantify differences between strain populations using single nucleotide variation (SNV) profiles. We extracted the MGs from ~86,000 prokaryotic reference genomes and more than 3,100 publicly available metagenomes (from major human body sites, gut microbiome samples from disease association studies, and ocean water samples). Clustering of MGs led to the generation of a database of MG-based operational taxonomic units (mOTUs) containing 2,297 metagenomic mOTUs (meta-mOTUs) and 11,915 reference mOTUs (ref-mOTUs). For the most recent version (2.6) we extended the database by 19.358 (ext-mOTUs) using MGs from ~600,000 metagenome assembled genomes from 23 environments (mouse, cat, dog, pig, freshwater, wastewater, air, ...). Alignments against this database are then used to taxonomically classify reads, to identify metabolically active members and to profile sub-species level SNVs.

https://motu-tool.org/

Extract average read length from fastp output

Description of feature

Bracken needs as an input the read length. This is not consistently or always correctly reported in SRA meta-data so it's better to estimate the average read length. fastp already does this so we can extract the information from there.

Profiler request: kaiju

Description of feature

Kaiju is a program for sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments

https://kaiju.binf.ku.dk/

All reads from host-removal are published by default

Description of the bug

We are going with a more opt-in approach in the pipeline's preprocessing steps, and we should make sure that both align/unaligned are independently optionally published.

Also reference index files

Command used and terminal output

No response

Relevant files

No response

System information

No response

Feature: Allow Recentrifuge as replacement for Krona

Once Krona support is added (#44), it would be nice to allow using https://github.com/khyox/recentrifuge as a replacement when possible. It's a souped-up version of Krona that supports

  • Centrifuge
  • LMAT
  • CLARK
  • Kraken

out of the box without adding an extra conversion step, and can take any generic CSV/TSV as input, too. It includes confidence scores along with the taxonomy classifications. The downside is that it requires a NCBI taxonomy database to run. It could be downloaded on-the-fly using the included retaxdump, or could be a required input from the user.

Originally posted by @MillironX in #44 (comment)

Profiler: MetaMaps

Description of feature

MetaMaps is tool specifically developed for the analysis of long-read (PacBio/Oxford Nanopore) metagenomic datasets. It simultaenously carries out read assignment and sample composition estimation.It is faster than classical exact alignment-based approaches, and its output is more information-rich than that of kmer-spectra-based methods. For example, each MetaMaps alignment comes with an approximate alignment location, an estimated alignment identity and a mapping quality.
The approximate mapping algorithm employed by MetaMaps is based on MashMap. MetaMaps adds a mapping quality model and EM-based estimation of sample composition.

https://www.nature.com/articles/s41467-019-10934-2

Work in progress in nf-core/modules:
MetaMaps

FASTA input files are not yet incorporated into the profiling workflow

Description of the bug

I just realised that when looking at the code, while we make a branch for fasta (vs short/long read FASTQ) we do not actually integrate them into the workflow at any point.

We should make sure we incorporated it into the workflow at the profiling step (as in skip any other form of processing), and only into tools that accept FASTA input (i.e., we will likely need to apply a filter {} operator to remove FASTA files from input channels to profilers that don't accept this input.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Postprocessing: bracken

Description of feature

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample. Braken uses the taxonomy labels assigned by Kraken, a highly accurate metagenomics classification algorithm, to estimate the number of reads originating from each species present in a sample. Kraken classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species. We use the Kraken database itself to derive probabilities that describe how much sequence from each genome is identical to other genomes in the database, and combine this information with the assignments for a particular sample to estimate abundance at the species level, the genus level, or above. Combined with the Kraken classifier, Bracken produces accurate species- and genus-level abundance estimates even when a sample contains two or more near-identical species.

https://ccb.jhu.edu/software/bracken/

Preprocessing: run merging

Description of feature

In some cases multiple libraries and/or sequencing runs of a single sample can be generated. In these cases you would want to profile all reads together as this can influence some secondary profiling steps (e.g. LCA). We should allow automated merging of these.

Input CSV structure

Description of feature

We will need to support both input (paired/single) FASTQ and also FASTA files, as the latter seems common.

I propose something like this:

sample_id,run_id,format,r1,r2
ERS0000001,ERR000008,fasta,/path/to/data1.fa,NA
ERS0000001,ERR000009,fastq,/path/to/data2_1.fq.gz,/path/to/data2_2.fq.gz
ERS0000001,ERR000009,fastq,/path/to/data3_1.fq.gz,NA

i.e.

UPDATE:

sample run_accession instrument_platform fastq_1 fastq_2 fasta
ERS0000001 ERR000008 OXFORD_NANOPORE NA NA /path/to/data1.fa
ERS0000001 ERR000009 ILLUMINA /path/to/data2_1.fq.gz /path/to/data2_2.fq.gz
ERS0000001 ERR000009 ILLUMINA /path/to/data3_1.fq.gz NA

Check instrument_platform against: https://www.ebi.ac.uk/ena/portal/api/controlledVocab?field=instrument_platform

Remove support for uncompressed fastq files from check_samplesheet.py and docs

Description of the bug

check_samplesheet.py allows the input of uncompressed fastq files into the workflow, but these files are not supported by the kraken2 module (even when the --gzip-compressed flag is removed) and an error is produced at the kraken2 profiling step. Thanks for all your help on the Slack!

Command used and terminal output

No response

Relevant files

No response

System information

  • Nextflow version: 22.04.3.5703
  • Hardware: AWS EC2 instance r5a.8xlarge with gp3 volume
  • Executor: local
  • Container engine: Singularity
  • OS: Ubuntu 22.04 LTS
  • Version of nf-core/taxprofiler: dev

Current MALT version does not correctly perform LCA

Description of the bug

It was recently identified in my department that MALT v0.5 has a broken LCA (sa in doesn't execute at all).

Should update the module to fall back to 0.4.2 (last known working version)

Command used and terminal output

No response

Relevant files

No response

System information

No response

Move profiling into a subworkflow

Description of feature

Maybe not really useful for us, but could be future proofing if we port it to a nf-core/subworkflow (other pipelines may want to use it as a contamination check, for example)

Write documentation

Description of feature

Start writing documentation.

This can include parameters, making a pretty workflow diagram, describing output files etc.

Profiler: diamond

Description of feature

DIAMOND - high throughput protein alignment
DIAMOND is a high-throughput program for aligning DNA reads or protein sequences against a protein reference database such as NR, at up to 20,000 times the speed of BLAST, with high sensitivity.

On Illumina reads of length 100-150bp, in fast mode, DIAMOND is about 20,000 times faster than BLASTX, while reporting about 80-90% of all matches that BLASTX finds, with an e-value of at most 1e-5. In sensitive mode, DIAMOND ist about 2,500 times faster than BLASTX, finding more than 94% of all matches.

https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/diamond/

Create workflow diagram

Description of feature

maybe in Tube map style (I have a new style I wanna try :P ). Should be done once we finalise contents of first release

All tools should have some form of MultiQC modules

Description of feature

If possible...

Currently supported by MultiQC

This can be closed when:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.