Giter Site home page Giter Site logo

nf-core / mag Goto Github PK

View Code? Open in Web Editor NEW
182.0 153.0 97.0 19.74 MB

Assembly and binning of metagenomes

Home Page: https://nf-co.re/mag

License: MIT License

HTML 0.70% Python 11.06% Nextflow 73.64% Groovy 10.45% R 1.75% Shell 2.41%
workflow nextflow metagenomics assembly binning annotation nf-core pipeline bioinformatics nanopore

mag's Introduction

nf-core/mag

[![GitHub Actions CI Status](https://github.com/nf-core/mag/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/mag/actions?query=workflow%3A%22nf-core+CI%22) [![GitHub Actions Linting Status](https://github.com/nf-core/mag/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/mag/actions?query=workflow%3A%22nf-core+linting%22)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/mag/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.3589527-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.3589527) [![Cite Publication](https://img.shields.io/badge/Cite%20Us!-Cite%20Publication-orange)](https://doi.org/10.1093/nargab/lqac007)

Nextflow run with conda run with docker run with singularity Launch on Nextflow Tower

Get help on SlackFollow on TwitterFollow on MastodonWatch on YouTube

Introduction

nf-core/mag is a bioinformatics best-practise analysis pipeline for assembly, binning and annotation of metagenomes.

nf-core/mag workflow overview

Pipeline summary

By default, the pipeline currently performs the following: it supports both short and long reads, quality trims the reads and adapters with fastp and Porechop, and performs basic QC with FastQC, and merge multiple sequencing runs.

The pipeline then:

Furthermore, the pipeline creates various reports in the results directory specified, including a MultiQC report summarizing some of the findings and software versions.

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

nextflow run nf-core/mag -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input '*_R{1,2}.fastq.gz' --outdir <OUTDIR>

or

nextflow run nf-core/mag -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input samplesheet.csv --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Group-wise co-assembly and co-abundance computation

Each sample has an associated group ID (see input specifications). This group information can be used for group-wise co-assembly with MEGAHIT or SPAdes and/or to compute co-abundances for the binning step with MetaBAT2. By default, group-wise co-assembly is disabled, while the computation of group-wise co-abundances is enabled. For more information about how this group information can be used see the documentation for the parameters --coassemble_group and --binning_map_mode.

When group-wise co-assembly is enabled, SPAdes is run on accordingly pooled read files, since metaSPAdes does not yet allow the input of multiple samples or libraries. In contrast, MEGAHIT is run for each group while supplying lists of the individual readfiles.

Credits

nf-core/mag was written by Hadrien Gourlé at SLU, Daniel Straub and Sabrina Krakau at the Quantitative Biology Center (QBiC). James A. Fellows Yates and Maxime Borry at the Max Planck Institute for Evolutionary Anthropology joined in version 2.2.0. More recent contributors include Jim Downie and Carson Miller.

Long read processing was inspired by caspargross/HybridAssembly written by Caspar Gross @caspargross

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #mag channel (you can join with this invite).

Citations

If you use nf-core/mag for your analysis, please cite the preprint as follows:

nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning

Sabrina Krakau, Daniel Straub, Hadrien Gourlé, Gisela Gabernet, Sven Nahnsen.

NAR Genom Bioinform. 2022 Feb 2;4(1):lqac007. doi: 10.1093/nargab/lqac007.

Additionally you can cite the pipeline directly with the following doi: 10.5281/zenodo.3589527

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

mag's People

Contributors

adamrtalbot avatar alexhbnr avatar alneberg avatar alxndrdiaz avatar antoniaschuster avatar apeltzer avatar carsonjm avatar d4straub avatar drpatelh avatar erikrikarddaniel avatar ewels avatar friederikehanssen avatar ggabernet avatar gregorysprenger avatar hadrieng avatar heuermh avatar jfy133 avatar kevinmenden avatar maxibor avatar maxulysse avatar mglubber avatar nf-core-bot avatar pcantalupo avatar philpalmer avatar prototaxites avatar skrakau avatar tillenglert avatar willros avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mag's Issues

Handle several metagenome samples not individually but synergistically.

Problem

The pipeline should allow assembly of multiple samples instead/ in addition to treating them individually. For example several metagenome samples from the same study might share genomes and the assembly and also binning can benefit from pooling these samples instead of treating them individually as done right now.

Possible solutions

MetaSPAdes 3.13.0 using Illumina and Nanopore data (hybrid assembly) can't handle several samples just yet, but MEGAHIT and possibly Illumina-only SPAdes should be able to handle this. Also metabat allows the usage of multiple samples for improved binning.

  • a parameter (such as --pool_samples) could be added to allow optional pooling
  • several samples could be simply added to the parameters for assembly in process megahit and process spades
  • Binning could be improved by using depth information from several samples when available (e.g. with jgi_summarize_bam_contig_depths --outputDepth depth.txt *.bam as decribed in https://bitbucket.org/berkeleylab/metabat). Because done.

Feedback welcome

Is that of interest? For me that's not high priority but definitely interesting to have.

busco error - ignored

When running the pipeline on the test data with scratch = false, I frequently see this error:

"""
NOTE: Process busco (MEGAHIT-test_minigut_sample2.unbinned.2.fa) terminated with an error exit status (1) -- Error is ignored
"""

The pipeline still completes successfully.

The .command.log file contains:
"cp: cannot stat 'run_MEGAHIT-test_minigut_sample2.unbinned.2.fa/short_summary_MEGAHIT-test_minigut_sample2.unbinned.2.fa.txt': No such file or directory"

SPAdes errors with >1000GB RAM

SPAdes expects for '--memory' an int() with GB, but nextflow changes >1000GB to 1.TB and therefore '--memory' receives an int() or non-int with TB and this leads to a failure.

Error message:
Caused by:
  Process `spades (QMFCE004AVQMFCE027A1QMFCE028A9)` terminated with an error exit status (1)

Command executed:

  metaspades.py         --threads "64"         --memory "1.5T"         --pe1-1 QMFCE004AVQMFCE027A1QMFCE028A9_unmapped_1.fastq.gz         --pe1-2 QMFCE004AVQMFCE027A1QMFCE028A9_unmapped_2.fastq.gz         -o spades
  mv spades/assembly_graph_with_scaffolds.gfa QMFCE004AVQMFCE027A1QMFCE028A9_graph.gfa
  mv spades/scaffolds.fasta QMFCE004AVQMFCE027A1QMFCE028A9_scaffolds.fasta
  mv spades/contigs.fasta QMFCE004AVQMFCE027A1QMFCE028A9_contigs.fasta
  mv spades/spades.log QMFCE004AVQMFCE027A1QMFCE028A9_log.txt

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/opt/conda/envs/nf-core-mag-1.0.0/bin/metaspades.py", line 1102, in <module>
      main(sys.argv)
    File "/opt/conda/envs/nf-core-mag-1.0.0/bin/metaspades.py", line 651, in main
      cfg, dataset_data = fill_cfg(options, log)
    File "/opt/conda/envs/nf-core-mag-1.0.0/bin/metaspades.py", line 297, in fill_cfg
      options_storage.memory = int(arg)
  ValueError: invalid literal for int() with base 10: '1.5T'

Untested possible solution:
Instead of
def maxmem = "${task.memory.toString().replaceAll(/[\sGB]/,'')}"
use
def maxmem = "${task.memory.toGiga().toString().replaceAll(/[\sGB]/,'')}"

Create full size test

Create full size test to run on AWS batch and specify in .github/workflows/awsfulltest.yml.

Busco Archive error

Hi,
I wanted to test mag and I have this error when I run it with test profile or with our lab test samples:
ERROR ~ No such file: https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz

I have tested wget https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz and all works well.

Command lines:
nextflow run nf-core/mag -profile genotoul,test
nextflow run nf-core/mag --reads '*_R{1,2}.fastq.gz' -profile singularity,genotoul

Do you know this issue and how to solve it ?
Thanks a lot in advance !

Running the pipeline with singularity

I had success running the test but when I tried to run the pipeline with my data it is not working and gives the message
-[nf-core/mag] Pipeline completed with errors-

If some can take a look at my command below and let me know where I'm doing wrong that will be helpful.

nextflow run nf-core/mag -r 1.1.0 --input '/Users/rsompallae/*_R{1,2}.fastq' -profile singularity -with-singularity /Users/rsompallae/biotools/nfcore-mag-1.1.0.sif -c nfcore_ui_hpc_2.config --busco_reference /Users/rsompallae/biotools/bacteria_odb10.2020-03-06.tar.gz

Thanks
Rama

MetaBAT2 binning discards unbinned contigs

I recognized that the current MetaBAT2 binning discards contigs that do not bin with any other contigs. So when looking at the pipeline output any single-contigs are discarded (e.g. for BUSCO or CAT taxa assignments).
Sounds not too shocking, however I have now a dataset with several really large contigs, e.g. a 4.3Mb sequence that does not appear in the output (MultiQC, BUSCO, CAT). This isn't the way the pipeline should handle the data I think. Unbinned contigs need to be reported more obvious. E.g. with metabat's parameter --unbinned.

Apply CAT also on unbinned contigs

Currently CAT is only applied to MetaBAT bins for taxonomic classification. Maybe it would be useful to apply it on whole assemblies / also on unbinned contigs.

checkm data setRoot not working using singularity

When running the pipeline with singularity i get this error when running the checkm process:

It seems that the CheckM data folder has not been set yet or has been removed. Running: 'checkm data setRoot'.
You do not seem to have permission to edit the checkm config file
located at /opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG
Please try again with updated privileges. Error was:

[Errno 30] Read-only file system: '/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG'
Sorry, CheckM cannot run without a valid data folder.

Unexpected error: <type 'exceptions.IOError'>
Traceback (most recent call last):
  File "/opt/conda/envs/nf-core-mag-1.0.0/bin/checkm", line 708, in <module>
    checkmParser.parseOptions(args)
  File "/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/main.py", line 1237, in parseOptions
    self.taxonSet(options)
  File "/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/main.py", line 262, in taxonSet
    bValidSet = taxonParser.markerSet(options.rank, options.taxon, options.marker_file)
  File "/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/taxonParser.py", line 82, in markerSet
    taxonMarkerSets = self.readMarkerSets()
  File "/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/taxonParser.py", line 40, in readMarkerSets
    for line in open(DefaultValues.TAXON_MARKER_SETS):
IOError: [Errno 2] No such file or directory: 'taxon_marker_sets.tsv'

For checkm_download_db, it doesn't crash (I get exitcode 0) but I still have the same issue when checking the log:

*******************************************************************************
 [CheckM - data] Check for database updates. [setRoot]
*******************************************************************************

Data location not changed
It seems that the CheckM data folder has not been set yet or has been removed. Running: 'checkm data setRoot'.
You do not seem to have permission to edit the checkm config file
located at /opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG
Please try again with updated privileges. Error was:

[Errno 30] Read-only file system: '/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG'
Sorry, CheckM cannot run without a valid data folder.
It seems that the CheckM data folder has not been set yet or has been removed. Running: 'checkm data setRoot'.
You do not seem to have permission to edit the checkm config file
located at /opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG
Please try again with updated privileges. Error was:

[Errno 30] Read-only file system: '/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG'
Sorry, CheckM cannot run without a valid data folder.
You do not seem to have permission to edit the checkm config file
located at /opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG
Please try again with updated privileges. Error was:

[Errno 30] Read-only file system: '/opt/conda/envs/nf-core-mag-1.0.0/lib/python2.7/site-packages/checkm/DATA_CONFIG'

Do you think this has to do something with user permission-differences between docker and singularity?

CAT: Database was built with a different version of Diamond and is incompatible.

I get the following error using CAT with the CAT_prepare_20200618.tar.gz database:

Error: Database was built with a different version of Diamond and is incompatible.
[2020-08-14 20:06:10.946813] ERROR: DIAMOND finished abnormally.

Not really an error of the mag pipeline, but it might be worth checking out mainly for purposes of updating the documentation.

error with projectDir

Hi,
I pulled the version 1.1.2 and tried to run but getting the error 'No such
variable: projectDir'. is this something I need to provide with the command
options?

Thanks

On Wed, Nov 25, 2020, 2:42 AM Sabrina Krakau [email protected]
wrote:

Hi @ramakrishnas https://github.com/ramakrishnas , yes that would be
great! Let us know if it solves your problems.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#131 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAGYS55H5TOBYPUOZKL27M3SRS7PNANCNFSM4TT7VLQA
.

Originally posted by @ramakrishnas in #131 (comment)

Add taxonomic info

Hi there,

I was wondering about your thoughts on how useful it would be to include here some taxonomic information independent or dependent on the metagenome assembly. I would enjoy having a pipeline that also adds info on the taxa in the genome bins but also of the raw reads.

A tool that I found for taxonomic classification of contigs and metagenome-assembled genomes is CAT/BAT, which might fit quite well into the container.

For giving a taxonomic overview of the relative abundances in the metagenome centrifuge or kaiju might be interesting, both seem to fit from their requirements.

Let me know if you think it would benefit the pipeline to add these analysis steps as an option.

However, from my side I require that you fix the dev branch in a way that the test runs go through before I start working on adding these options.

Run co-assembly on pooled samples

Hi,

First, I want to thank you for implementing this pipeline with Nextflow, it's extremely convenient to use, and I was able to run it really easily on a HPC.

When working on whole genome metagenome sequencing data, I usually do a co-assembly instead of assembling each metagenomic sample separately. In other words, I pool the reads from all samples and then use metaspades/megahit or others assemblers. It can provide more coverage for contigs that co-occur in multiple samples, therefore producing more complete genomes. It also makes sample easier to directly compare since they stem from the same assembly. Pooling can sometime hurt the assembly process depending on the data. However, having this option would be very valuable, or even performing both in parallel.

Unless I'm mistaken, this pipeline seems to assemble each sample separately. Would it be possible to include an option to do the assembly with the merged reads as well?

Thank you,
Cédric

Help running in offline cluster

I need to run mag in offline mode and I tried with:

nextflow run $NF/nextflow/nf-core-mag-1.0.0/workflow/main.nf \
  --reads '$INPUT/reads/*_R{1,2}.fq.gz' \
  --busco_reference $DB/bacteria_odb9.tar.gz \
  --outdir out \
  --cat_db $DB/CAT_prepare_20190108.tar.gz \
  --kraken2_db $DB/minikraken2_v2_8GB_201904_UPDATE.tgz \
  --centrifuge_db $DB/p_compressed+h+v.tar.gz \
  -profile singularity

But I got an error from the KronaDB updater:

Caused by:
  Process `krona_db` terminated with an error exit status (1)

Command executed:
  ktUpdateTaxonomy.sh taxonomy

Command exit status:
  1

Command output:
  Creating taxonomy...
  
  Fetching taxdump.tar.gz...
  
  Update failed.
     Is your internet connection okay?

Command error:
  /opt/conda/envs/nf-core-mag-1.0.0/bin/ktUpdateTaxonomy.sh: line 163: [: ==: unary operator expected

Work dir:
  /qib/platforms/Informatics/GMH/nextflow/work/86/ca13b3e7b7dcc86ba54cb03e2f0d1b

I'd like to skip the ktUpdateTaxonomy.sh without skipping Krona plots, how can I do the step in advance (if required) and then run the workflow from the cluster? Thanks

metabat2 "[Error!] the order of contigs in abundance file is not the same as the assembly file"

Hi,
I have been testing mag on two samples 3 months ago and all the pipeline well ran.
I wanted to re-test mag on two others samples yesterday and I have this issue, only for metabat2 associated with MEGAHIT assembly (all is ok for SPADES assembly).
It seems there is a reversal of file "first" and file "second" in metabat command line and an other issue with the order of contigs.
I just re-run mag with the last version of Nextflow nf-core (nfcore-Nextflow-v20.01.0) to see if the issue persists.
Do you have a solution for this issue ?
Thanks a lot in advance !

ERROR ~ Error executing process > 'metabat (MEGAHIT-first)'

Caused by:
  Process `metabat (MEGAHIT-first)` terminated with an error exit status (1)

Command executed:

  jgi_summarize_bam_contig_depths --outputDepth depth.txt MEGAHIT-first-first.bam MEGAHIT-first-second.bam
  metabat2 -t "8" -i "second.contigs.fa" -a depth.txt -o "MetaBAT2/MEGAHIT-first" -m 1500
  
  #if bin folder is empty
  if [ -z "$(ls -A MetaBAT2)" ]; then 
      cp second.contigs.fa MetaBAT2/MEGAHIT-second.contigs.fa
  fi

Command exit status:
  1

Command output:
  MetaBAT 2 (v2.13 (Bioconda)) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. 

Command error:
  Output depth matrix to depth.txt
  jgi_summarize_bam_contig_depths 2.13 (Bioconda) 2019-06-11T06:53:12
  Output matrix to depth.txt
  0: Opening bam: MEGAHIT-first-first.bam
  1: Opening bam: MEGAHIT-first-second.bam
  Processing bam files
  Thread 0 finished: MEGAHIT-first-first.bam with 52840392 reads and 51019403 readsWellMapped
  Thread 1 finished: MEGAHIT-first-second.bam with 55185564 reads and 53195312 readsWellMapped
  Creating depth matrix file: depth.txt
  Closing most bam files
  Closing last bam file
  Finished
  [Error!] the order of contigs in abundance file is not the same as the assembly file: k141_0

Use labels in config

We could make the config quite a bit shorter by using labels instead of referring to each process by name.

bowtie2 log file contains stdout (sam file)

In process remove_phix log files contain stdout (sam file) and stderr and therefore can blow up to huge sizes. stdout in the log file seems absolutely unnecessary, it should only contain stderr.

This could be solved as in process remove_host:

bowtie2 [...] \
1> /dev/null \
2> ${name}.bowtie2.log

Experience from recent assemblies - proposals for improvements

To save a lot of storage space in the results folder, routinely

  • compress assembler output: Done in #67
    -- SPAdes: *.fasta, *.gfa
    -- MEGAHIT: *.fa
  • compress genome bin sequences (results/GenomeBinning/MetaBAT2/*)
  • do not publish bin sequences in QUAST output (redundant)
  • compress results/Taxonomy/<Assembler>/*.ORF2LCA.names.txt
  • compress folder results/Taxonomy/<Assembler>/raw, possibly reduce output or even skip

For better reporting,

  • [easy] merge bin taxonomy results/Taxonomy/<Assembler>/*.bin2classification.names.txt with bin QC results/GenomeBinning/QC/quast_and_busco_summary.tsv (*.bin2classification.names.txt may contain more than one classification for a bin, aggregate before merging)
  • [easy] collect all assemblies (MEGAHIT/SPAdes/SPAdesHybrid) for one sample in one channel and process with QUAST in one go for side-by-side comparison instead of processing these assemblies independently.
  • [new program & process] provide gene info: BUSCO already extracts marker genes but possibly add Prokka for gene annotation of bins (maybe even for whole metagenomes), or use CAT called genes results/Taxonomy/<assembler>/raw/*.concatenated.predicted_proteins.faa or the corresponding .gff
  • [medium] provide genomic bin abundance estimate either by re-mapping of Illumina reads (bowtie) or using already produced file from process metabat -> depth.txt, and merge into results/GenomeBinning/QC/quast_and_busco_summary.tsv depth.txt is now exported #66, integrating into a final table is pending

Replace atropos with fastp

Nothing against atropos, but:

  • faster, which matter for large metagenomes
  • may eliminate python3 dep and allow integrating checkm in the nf-core dockerfile

Compress MetaBAT2 unbinned files

From @d4straub :
"MetaBAT2/${meta.assembler}-${meta.id}.lowDepth.fa"
"MetaBAT2/${meta.assembler}-${meta.id}.tooShort.fa"
"MetaBAT2/${meta.assembler}-${meta.id}.unbinned.pooled.fa"
"MetaBAT2/${meta.assembler}-${meta.id}.unbinned.remaining.fa"
"These 4 files recently made ~50% of the size of the output folder, so it would be nice to have that zipped."

Improvements - Possible merge with YAMP

This issue summarises the discussion that emerged at the London hackathon concerning a possible merge with YAMP

While MAG focuses heavily on assembly, YAMP is assembly-free and focuses on taxonomy classification and functional annotation of reads (and is quite human-centric at the moment)

What cam out of this discussion is that we'd like to combine both pipelines into one, since both approaches are very complementary.

A few things/kind of roadmap that came up in the discussion

  • DSL2. MAG already has quite a lot of processes, and the code readability/maintainability would be better with modules/imports.
  • We need to harmonise QC. YAMP currently uses bbduk while MAG uses fastp.
  • There is a need for host removal. This steps exists in YAMP.
  • Ideally the taxonomic classification will use kraken2/metaphlan. The choice would be left to the user.
  • There should be a way to run only the YAMP (or the MAG) part. We already have flags to disable some parts of MAG so it should not be a problem.

cc @alesssia

Update CAT

Update CAT and diamond and adjust process to new output format.

Migrate all parameter docs to JSON schema

Hi!

this is not necessarily an issue with the pipeline, but in order to streamline the documentation group next week for the hackathon, I'm opening issues in all repositories / pipeline repos that might need this update to switch from parameter docs to auto-generated documentation based on the JSON schema.

This will then supersede any further parameter documentation, thus making things a bit easier :-)

If this doesn't apply (anymore), please close the issue. Otherwise, I'm hoping to have some helping hands on this next week in the documentation team on Slack https://nfcore.slack.com/archives/C01QPMKBYNR

checkm is not on conda

it would be nicer to have everything in conda. In the meantime a manual install in the Dockerfile will have to do.

BUSCO update

BUSCO v3.0.2 is currently failing on some datasets, among others on the test profile when scratch = false. Somehow the results and thrown errors differ between scratch = true and scratch = false (on CFC), which we cannot not explain currently. In the past such errors were ignored, which was changed now (see #68 and #72).
Moreover it seems that if no tblastn hits are found, this causes an error and is not handled properly.

To achieve more control, update BUSCO to v4.0.6 and handle the case of no tblastn hits. Use a parameter to pass over the path to an already downloaded db (test if this works with --offline), the name of the db for automatic download or the auto-lineage parameter

However, currently there is an issue with the download of the BUSCO databases: https://busco.ezlab.org/frames/bact.htm.
See also https://gitlab.com/ezlab/busco/-/issues/293. So I need to wait until this works again, to test this.

Additionally, for offline use, one can also download the whole dataset https://busco-data.ezlab.org/v4/data/ and add the path to the custom config file. I think this should work both with --lineage_dataset bacteria_odb10 and --auto-lineage

Zcat fails on Mac OS X when using Conda

zcat foo.txt.gz doesn't work on Mac OS X.

This causes zcat ${name}_phix_unmapped_1.fastq.gz in process remove_phix to fail on my Mac when using Conda:

zcat: can't stat: test_minigut_trimmed_R1.fastq.gz (test_minigut_trimmed_R1.fastq.gz.Z): No such file or directory

Think about how to handle this differently.

Protein assemblies are reported to be better than genome assemblies

Hey there,

just another thought that might be interesting: De-novo protein-level assembler Plass (described here) assembles six-frame-translated sequencing reads into protein sequences and is 10 times more sensitive than MEGAHIT or metaSPAdes (that we use in that pipeline).

Maybe it would be of advantage to have that as well. However, that is not exactly "assembly, binning and annotation of metagenomes", what do you think?

File not found error when using downloaded kraken2_db

Describe the bug

when I use downloaded kraken2 database and put the archive in some place and run the pipeline with --kraken2_db parameter,
the process kraken2_db_preparation reported a bug which is caused by file not found

Steps to reproduce

$ nextflow run nf-core/mag --reads '../data/*_{1,2}.fastq' --busco_reference bacteria_odb9.tar.gz --task.memory 120 --task.cpus 32 -profile docker --kraken2_db minikraken2_v2_8GB_201904.tgz
....
Error executing process > 'kraken2_db_preparation (1)'                                        `bash .command.run`

Caused by:
  Missing output file(s) `minikraken2_v2_8GB_201904/*.k2d` expected by process `kraken2_db_preparation (1)`

Command executed:

  tar -xf "minikraken2_v2_8GB_201904.tgz"

Command exit status:
  0

Command output:
  (empty)

Work dir:
  /home/stella/Proj/20200908_NF/run/work/2a/aa7e195cef0882b13f789c378442af

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

and I am sure that this bug is caused by the different folder name in the archive and the base name of the archive

➜   tree ./minikraken2_v2_8GB_201904_UPDATE 
./minikraken2_v2_8GB_201904_UPDATE
├── database100mers.kmer_distrib
├── database150mers.kmer_distrib
├── database200mers.kmer_distrib
├── hash.k2d
├── opts.k2d
└── taxo.k2d

when the pipeline tried to find minikraken2_v2_8GB_201904/*.k2d, there is only minikraken2_v2_8GB_201904_UPDATE

System:

  • Hardware: Dell Server R720
  • OS:
➜  uname -a
Linux hal9003 5.4.0-47-generic #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Executor: local execute within tmux

Nextflow Installation:

  • Version: N E X T F L O W ~ version 20.07.1+nf-core/mag version 1.0.0

Error code 141

the pipeline fails with error code 141 whenever a pipe (|) is in a bash process.

This fails:

samtools view "${bam}" | awk '{print length(\$10)}' | head -1000 > checkm/read_length.txt

But this does not:

samtools view "${bam}" > rl_tmp_1
awk '{print length(\$10)}' < rl_tmp_1 > rl_tmp_2
head -1000 rl_tmp_2 > checkm/read_length.txt

kraken2_db_preparation failed

I have checked the following places for my error:

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

  1. Command line: nextflow run nf-core/mag -profile docker --input '*_{1,2}.fastq' --outdir /archive/karakulahg/nfcore-mag/analysis --host_genome GRCh38 --kraken2_db "/archive/db/kraken2db/maindb/kraken2db.tar.gz" --busco_reference "/archive/db/busco/bacteroidia_odb10.2021-02-23.tar.gz" --skip_krona
  2. See error:
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'kraken2_db_preparation (1)'

Caused by:
  Process `kraken2_db_preparation (1)` terminated with an error exit status (2)

Command executed:

  tar -xf "kraken2db.tar.gz"

Command exit status:
  2

Command output:
  (empty)

Command error:
  tar: library/bacteria/library.fna: Cannot write: Input/output error
  tar: Exiting with failure status due to previous errors

Work dir:
  /archive/karakulahg/nfcore-mag/fastq/2sample_local_kraken/work/95/58ac74a21c56f1a5dbf611428aec6c

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

I expected to use my kraken2db in my local hpc. I compressed kraken2db into a tar.gz file. Could you please help me?

Log files

Have you provided the following extra information/files:

  • The command used to run the pipeline: nextflow run nf-core/mag -profile docker --input '*_{1,2}.fastq' --outdir /archive/karakulahg/nfcore-mag/analysis --host_genome GRCh38 --kraken2_db "/archive/db/kraken2db/maindb/kraken2db.tar.gz" --busco_reference "/archive/db/busco/bacteroidia_odb10.2021-02-23.tar.gz" --skip_krona
  • The nextflow.log file

System

  • Hardware: HPC with 36cpu,384gb memory, 20tb disk space, 100gbit network, no run limit,
  • Executor: local in conda environment created just for nextflow
  • OS: CentOS Linux
  • Version 7.3

Nextflow Installation

  • Version: ```
    20.10.0 build 5430
    Created: 01-11-2020 15:14 UTC (18:14 EEST)
    System: Linux 3.10.0-1062.4.1.el7.x86_64
    Runtime: Groovy 3.0.5 on OpenJDK 64-Bit Server VM 10.0.2+13
    Encoding: UTF-8 (UTF-8)

## Container engine

- Engine: Docker
- version: ```
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 29
 Server Version: 20.10.3
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1062.4.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 36
 Total Memory: 376.4GiB
 Name: node03
 ID: KKSX:7ZZ4:5A2Q:5FKB:A4NG:PR24:XD36:MAMY:RK3W:Q3WX:PRBE:TKCT
 Docker Root Dir: /docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

  • Image tag: nfcore/mag:1.2.0

Additional context

My local kraken2db folder structure:
image

pipeline halts when quast finds no min length contigs

Hi, I'm running mag and have run into a problem where it throws an error and stops running when quast reaches a contig file that has no contigs meeting the minimum length requirement. Is it possible to create a work around where quast skips these files and the pipe line moves on?

My command:
nextflow run nf-core/mag
--reads 'input/*.R{1,2}.fastq.gz'
-profile shh
--kraken2_db 'ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz'
--outdir 'output'
-name 'cmc_assembly'
-w 'output/work'

And the error:
ERROR ~ Error executing process > 'quast (MEGAHIT-LIB050.A0105.SG1.1)'

Caused by:
Process quast (MEGAHIT-LIB050.A0105.SG1.1) terminated with an error exit status (4)

Command executed:

metaquast.py --threads "1" --rna-finding --max-ref-number 0 -l "MEGAHIT-LIB050.A0105.SG1.1" "LIB050.A0105.SG1.1.contigs.fa" -o "LIB050.A0105.SG1.1_QC"

Command exit status:
4

Command output:
/opt/conda/envs/nf-core-mag-1.0.0/lib/python3.6/site-packages/quast-5.0.2-py3.6.egg-info/scripts/metaquast.py --threads 1 --rna-finding --max-ref-number 0 -l MEGAHIT-LIB050.A0105.SG1.1 LIB050.A0105.
SG1.1.contigs.fa -o LIB050.A0105.SG1.1_QC

Version: 5.0.2

System information:
OS: Linux-4.4.0-38-generic-x86_64-with-debian-9.9 (linux_64) [23/12529]
Python version: 3.6.7
CPUs number: 64

Started: 2020-02-19 13:07:10

Logging to LIB050.A0105.SG1.1_QC/metaquast.log

Contigs:
Pre-processing...
WARNING: Skipping MEGAHIT-LIB050.A0105.SG1.1 because it doesn't contain contigs >= 0 bp.

ERROR! None of the assembly files contains correct contigs. Please, provide different files or decrease --min-contig threshold.

Command wrapper:
/opt/conda/envs/nf-core-mag-1.0.0/lib/python3.6/site-packages/quast-5.0.2-py3.6.egg-info/scripts/metaquast.py --threads 1 --rna-finding --max-ref-number 0 -l MEGAHIT-LIB050.A0105.SG1.1 LIB050.A0105.
SG1.1.contigs.fa -o LIB050.A0105.SG1.1_QC

Version: 5.0.2
System information: [2/12529]
OS: Linux-4.4.0-38-generic-x86_64-with-debian-9.9 (linux_64)
Python version: 3.6.7
CPUs number: 64

Started: 2020-02-19 13:07:10

Logging to LIB050.A0105.SG1.1_QC/metaquast.log

Contigs:
Pre-processing...
WARNING: Skipping MEGAHIT-LIB050.A0105.SG1.1 because it doesn't contain contigs >= 0 bp.

ERROR! None of the assembly files contains correct contigs. Please, provide different files or decrease --min-contig threshold.

Work dir:
/projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output/work/3f/ff67506920c98fbed4f39ec6eda583

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run
-- Check '.nextflow.log' file for details

Metabat fails when running with multiple input files.

When I run the pipeline on multiple samples, the metabat2 step fails for me with the following error message:

 [Error!] the order of contigs in abundance file is not the same as the assembly file: k119_0

The pipeline runs fine when I include only a single sample in the input directory. This might be related to #27.

Command executed

./main.nf --reads "/home/sturm/projects/2020/metagenomics_test/test_data/*_R[1,2].fastq.gz" -profile singularity --skip_spades

Input data

I ran the pipeline of publicly available metagenomics samples from the ibdmdb project. The fastq files can be downloaded here: https://ibdmdb.org/tunnel/public/HMP2/WGS/1818/rawfiles. For testing, I ran the pipeline on the 10 first samples listed in the web portal.

Full log

N E X T F L O W  ~  version 19.10.0                                                                                                                                                                                                                                               
Launching `./main.nf` [focused_kalam] - revision: 91f2d5ee09                                                                                                                                                                                                                      
WARN: Access to undefined parameter `readPaths` -- Initialise it to a default value eg. `params.readPaths = some_value`                                                                                                                                                           
[2m----------------------------------------------------                                                                                                                                                                                                                           
                                        ,--./,-.                                                                                                                                                                                                                                  
        ___     __   __   __   ___     /,-._.--~'                                                                                                                                                                                                                                 
  |\ | |__  __ /  ` /  \ |__) |__         }  {                                                                                                                                                                                                                                    
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,                                                                                                                                                                                                                                 
                                        `._,._,'                                                                                                                                                                                                                                  
  nf-core/mag v1.0.0                                                                                                                                                                                                                                                              
----------------------------------------------------                                                                                                                                                                                                                              
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`                                                                                                                                                                   
Run Name          : focused_kalam                                                                                                                                                                                                                                                 
Reads             : /home/sturm/projects/2020/metagenomics_test/test_data/*_R[1,2].fastq.gz                                                                                                                                                                                       
Fasta Ref         : null                                                                                                                                                                                                                                                          
Data Type         : Paired-End                                                                                                                                                                                                                                                    
Busco Reference   : https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz                                                                                                                                                                                              
Max Resources     : 128 GB memory, 16 cpus, 10d time per job                                                                                                                                                                                                                      
Container         : singularity - nfcore/mag:1.0.0                                                                                                                                                                                                                                
Output dir        : ./results                                                                                                                                                                                                                                                     
Launch dir        : /home/sturm/projects/2020/metagenomics_test/mag                                                                                                                                                                                                               
Working dir       : /data/scratch/sturm/scratch/test_metagenomics                                                                                                                                                                                                                 
Script dir        : /home/sturm/projects/2020/metagenomics_test/mag                                                                                                                                                                                                               
User              : sturm                                                                                                                                                                                                                                                         
Config Profile    : singularity                                                                                                                                                                                                                                          
[2m----------------------------------------------------       
executor >  sge (17)                                                                                                                                                                                                                                                              
[2a/f38182] process > get_software_versions                                         [100%] 1 of 1, cached: 1 ✔                                                                                                                                                                    
[-        ] process > porechop                                                      -                                                                                                                                                                                             
[-        ] process > nanolyse                                                      -                                                                                                                                                                                             
[-        ] process > filtlong                                                      -                                                                                                                                                                                             
[-        ] process > nanoplot                                                      -                                                                                                                                                                                             
[22/d10cd6] process > fastqc_raw (CSM5MCW6)                                         [100%] 10 of 10, cached: 10 ✔                                                                                                                                                                 
[a8/e30a96] process > fastp (CSM5MCXH)                                              [100%] 10 of 10, cached: 10 ✔                                                                                                                                                                 
[a8/7e6467] process > phix_download_db (GCA_002596845.1_ASM259684v1_genomic.fna.gz) [100%] 1 of 1, cached: 1 ✔                                                                                                                                                                    
[b5/bfc0ac] process > remove_phix (CSM5MCXD)                                        [100%] 10 of 10, cached: 10 ✔                                                                                                                                                                 
[97/5edf95] process > fastqc_trimmed (CSM5MCXD)                                     [100%] 10 of 10, cached: 10 ✔                                                                                                                                                                 
[-        ] process > centrifuge_db_preparation                                     -                                                                                                                                                                                             
[-        ] process > centrifuge                                                    -                                                                                                                                                                                             
[-        ] process > kraken2_db_preparation                                        -                                                                                                                                                                                             
[-        ] process > kraken2                                                       -                                                                                                                                                                                             
[-        ] process > krona_db                                                      -                                                                                                                                                                                             
[-        ] process > krona                                                         -                                                                                                                                                                                             
[c3/6e32ef] process > megahit (CSM5MCX3)                                            [100%] 10 of 10, cached: 10 ✔                                                                                                                                                                 
[-        ] process > spadeshybrid                                                  -                                                                                                                                                                                             
[-        ] process > spades                                                        -                                                                                                                                                                                             
[a7/eb6fb5] process > quast (MEGAHIT-CSM5MCXH)                                      [100%] 10 of 10, cached: 10 ✔                                                                                                                                                                 
[ac/92db93] process > bowtie2 (MEGAHIT-CSM5MCXJ)                                    [100%] 100 of 100, cached: 100 ✔                                                                                                                                                              
[70/15a36b] process > metabat (MEGAHIT-CSM5MCX3)                                    [100%] 10 of 10, failed: 3 ✘                                                                                                                                                                  
[76/f4dd2b] process > busco_download_db (bacteria_odb9.tar)                         [100%] 1 of 1, cached: 1 ✔                                                                                                                                                                    
[25/932b5f] process > busco (MEGAHIT-CSM5MCW6.2.fa)                                 [100%] 6 of 6                                                                                                                                                                                 
[-        ] process > busco_plot                                                    -                                                                                                                                                                                             
[b8/cc0d5f] process > quast_bins (MEGAHIT-CSM5MCW6)                                 [100%] 1 of 1                                                                                                                                                                                 
[-        ] process > merge_quast_and_busco                                         -                                                                                                                                                                                             
[-        ] process > cat_db                                                        -                                                                                                                                                                                             
[-        ] process > cat                                                           -                                                                                                                                                                                             
[-        ] process > multiqc                                                       -                                                                                                                                                                                             
[13/073856] process > output_documentation (1)                                      [100%] 1 of 1, cached: 1 ✔                                                                                                                                                                    
[0;35m[nf-core/mag] Pipeline completed with errors          
Error executing process > 'metabat (MEGAHIT-CSM5MCWQ)'

Caused by:
  Process `metabat (MEGAHIT-CSM5MCWQ)` terminated with an error exit status (1)

Command executed:

  jgi_summarize_bam_contig_depths --outputDepth depth.txt MEGAHIT-CSM5MCWQ-CSM5MCXJ.bam MEGAHIT-CSM5MCWQ-CSM5MCX3.bam MEGAHIT-CSM5MCWQ-CSM5MCW6.bam MEGAHIT-CSM5MCWQ-CSM5MCWQ.bam MEGAHIT-CSM5MCWQ-CSM5MCXN.bam MEGAHIT-CSM5MCWQ-CSM5MCXL.bam MEGAHIT-CSM5MCWQ-CSM5FZ4M.bam MEGAHI
T-CSM5MCWQ-CSM5MCUO.bam MEGAHIT-CSM5MCWQ-CSM5MCXH.bam MEGAHIT-CSM5MCWQ-CSM5MCXD.bam
  metabat2 -t "8" -i "CSM5MCXL.contigs.fa" -a depth.txt -o "MetaBAT2/MEGAHIT-CSM5MCWQ" -m 1500
  
  #if bin folder is empty
  if [ -z "$(ls -A MetaBAT2)" ]; then
      cp CSM5MCXL.contigs.fa MetaBAT2/MEGAHIT-CSM5MCXL.contigs.fa
  fi

Command exit status:
  1

Command output:
  MetaBAT 2 (v2.13 (Bioconda)) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200. 

Command error:
  Output depth matrix to depth.txt
  jgi_summarize_bam_contig_depths 2.13 (Bioconda) 2019-06-11T06:53:12
  Output matrix to depth.txt
  0: Opening bam: MEGAHIT-CSM5MCWQ-CSM5MCXJ.bam
  93: Opening bam: MEGAHIT-CSM5MCWQ-CSM5MCWQ.bam4: Opening bam: 
  MEGAHIT-CSM5MCWQ-CSM5MCXN.bam7: Opening bam: 
  : Opening bam: MEGAHIT-CSM5MCWQ-CSM5MCXD.bam
  MEGAHIT-CSM5MCWQ-CSM5MCUO.bam
  2: Opening bam: 5MEGAHIT-CSM5MCWQ-CSM5MCW6.bam: Opening bam: MEGAHIT-CSM5MCWQ-CSM5MCXL.bam
  
  1: Opening bam: MEGAHIT-CSM5MCWQ-CSM5MCX3.bam
  86: Opening bam: MEGAHIT-CSM5MCWQ-CSM5FZ4M.bam
  : Opening bam: MEGAHIT-CSM5MCWQ-CSM5MCXH.bam
  Processing bam files
  Thread 2 finished: MEGAHIT-CSM5MCWQ-CSM5MCW6.bam with 18135298 reads and 9888770 readsWellMapped
  Thread 9 finished: MEGAHIT-CSM5MCWQ-CSM5MCXD.bam with 22283812 reads and 6970361 readsWellMapped
  Thread 0 finished: MEGAHIT-CSM5MCWQ-CSM5MCXJ.bam with 22264616 reads and 8236430 readsWellMapped
  Thread 8 finished: MEGAHIT-CSM5MCWQ-CSM5MCXH.bam with 24528496 reads and 10391605 readsWellMapped
  Thread 4 finished: MEGAHIT-CSM5MCWQ-CSM5MCXN.bam with 23731976 reads and 7460350 readsWellMapped
  Thread 5 finished: MEGAHIT-CSM5MCWQ-CSM5MCXL.bam with 28607918 reads and 9303132 readsWellMapped
  Thread 1 finished: MEGAHIT-CSM5MCWQ-CSM5MCX3.bam with 26832004 reads and 6496630 readsWellMapped
  Thread 7 finished: MEGAHIT-CSM5MCWQ-CSM5MCUO.bam with 31554136 reads and 7481572 readsWellMapped
  Thread 3 finished: MEGAHIT-CSM5MCWQ-CSM5MCWQ.bam with 17952520 reads and 15715720 readsWellMapped
  Thread 6 finished: MEGAHIT-CSM5MCWQ-CSM5FZ4M.bam with 28383496 reads and 25071349 readsWellMapped
  Creating depth matrix file: depth.txt
  Closing most bam files
  Closing last bam file
  Finished
  [Error!] the order of contigs in abundance file is not the same as the assembly file: k119_0

Work dir:
  /data/scratch/sturm/scratch/test_metagenomics/be/42009fbbe39e43da5e595006558f28

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Run design and implications for bowtie2 and MetaBAT2 steps

This is a question more than an issue. In previous conversations in Slack @skrakau helped me to understand the rationale of the bowtie2 step (Thank you @skrakau!) In particular, she referred me to #21 and explained why does mag run bowtie2 for all vs. all samples (i.e. "it is required by MetaBAT for binning the reads from each sample to be aligned in separate BAM files, which is important to include the information from library specific abundances").
I am analyzing 43 samples and I wonder if the design of the analysis run has an implication in the bowtie2 step too. In this case, we have 43*43 = 1849 * 2 (I am comparing metaSPAdes vs. Megahit as well) = 3698 jobs. But not all of the 43 samples are from the same environment and geographical location. They come from 4 very different environments and most come from different geographical locations. Thus, I am not sure if the alignment of all samples vs. all would be appropriate in this case. Maybe I should have separated my samples by environment type. What do you think? Thank you for your time and dedication to this project.

I'm having a error with nf-core/mag

Hi,
I'm trying to run the nf-core/mag pipeline on hpc cluster. when I ran the profile test

nextflow run nf-core/mag -profile test

it gave the following error. I greatly appreciate for all your help.

[68/ce5ef3] NOTE: Process phix_download_db (GCA_002596845.1_ASM259684v1_genomic.fna.gz) terminated with an error exit status (139) -- Execution is retried (1)
Error executing process > 'get_busco_version'

Caused by:
Process get_busco_version terminated with an error exit status (127)

Command executed:

busco --version > v_busco.txt

Command exit status:
127

Command output:
(empty)

Command error:
.command.sh: line 2: busco: command not found

README.md -> URI typo

I found a minor problem in the README while setting up nf-core/mag for offline use:

in the cat_db section, this URI is not working for me (and if the * was intended, is causing a md format change):

Database for taxonomic classification of metagenome assembled genomes (default: none). E.g. "<tbb.bio.uu.nl/bastiaan/CAT*prepare/CAT_prepare_20190108.tar.gz>"

For me, this URI worked:

Database for taxonomic classification of metagenome assembled genomes (default: none). E.g. "<http://tbb.bio.uu.nl/bastiaan/CAT_prepare/CAT_prepare_20190108.tar.gz>"

`output.md` - incorrect file descriptions

The following output descriptions need to be checked and corrected:

  • Assembly/[assembler]/[sample]_QC/: not documented files from Quast
  • Assembly/[assembler]/[sample]_QC/predicted_genes/: contains wrong file description
  • GenomeBinning/[assembler]-[sample/group]-depth.txt.gz is [assembler]-[bin]-depth.txt.gz (adjust name to docs?)
  • GenomeBinning/QC/QUAST: not documented files
  • GenomeBinning/QC/BUSCO/: not documented files
  • Assembly/[assembler]/[sample/group]_QC/MEGAHIT-[sample/group]-[sampleToMap].bowtie2.log: check description

run won't resume since kraken2 database file path changed

Hi, I ran into the same error as issue #32 (Metabat fails when running with multiple input files) and let it sit for a few days before trying to resume the run. During that time the Kraken2 developers released a new database, and moved the one they had into a new folder.

I tried to resume my run to see if it would move past the metabat error, but it gave an error b/c it couldn't find the database file. So I tried to resume the run using the new database path, but that also gave an error. I've pasted both errors below. Is there a way around this, or will I need to start a new run?

Thanks,
Irina

This is the first error about the database (with the original database file path):

u.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz'  --outdir '/projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output' -w '/projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output/work' -resume cmc_assembly
N E X T F L O W  ~  version 19.04.0
Launching `nf-core/mag` [tiny_ekeblad] - revision: 4c2f61cbbb [master]
WARN: It appears you have never run this project before -- Option `-resume` is ignored
WARN: Access to undefined parameter `readPaths` -- Initialise it to a default value eg. `params.readPaths = some_value`
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`
Pipeline Release  : master
Run Name          : tiny_ekeblad
Reads             : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/input/*.R{1,2}.fastq.gz
Fasta Ref         : null
Data Type         : Paired-End
Kraken2 Db        : ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz
Busco Reference   : https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz
Max Resources     : 256 GB memory, 32 cpus, 24d 20h 31m 24s time per job
Container         : singularity - nfcore/mag:1.0.0
Output dir        : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output
Launch dir        : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/aadder/output/Nov2018acc
Working dir       : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output/work
Script dir        : /projects1/clusterhomes/velsko/.nextflow/assets/nf-core/mag
User              : velsko
Config Profile    : shh
Config Description: Generic MPI-SHH cluster(s) profile provided by nf-core/configs.
Config Contact    : James Fellows Yates (@jfy133), Maxime Borry (@Maxibor)
Config URL        : https://shh.mpg.de
executor >  slurm (77)                                                                                                                                                                                                                                                 [59/6648]
executor >  slurm (77)
executor >  slurm (77)
executor >  slurm (77)
executor >  slurm (77)
executor >  slurm (77)
[64/f2cbd8] process > phix_download_db       [100%] 1 of 1 ✔
[7a/3ac69f] process > fastp                  [100%] 39 of 39
[de/73cec0] process > fastqc_raw             [100%] 36 of 36
[f0/82d032] process > get_software_versions  [100%] 1 of 1 ✔
[8a/3b3835] process > kraken2_db_preparation [  0%] 1 of 0, failed: 1
ERROR ~ Error executing process > 'kraken2_db_preparation (1)'

Caused by:
  Can't stage file ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz -- reason: pub/data/kraken2_dbs/minikraken2_v2_8GB_201904_UPDATE.tgz

Source block:
  """
  tar -xf "${db}"
  """

Work dir:
  /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output/work/8a/3b383522700f7823f76790e4b8bccd

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details```


and this is the second error about the database (with the current path to the database file):
```$ nextflow run nf-core/mag --reads '/projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/input/*.R{1,2}.fastq.gz' -profile shh --kraken2_db 'ftp://ftp.ccb.jh
u.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz'  --outdir '/projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output' -w '/projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output/work' -resume cmc_assembly
N E X T F L O W  ~  version 19.04.0
Launching `nf-core/mag` [reverent_dubinsky] - revision: 4c2f61cbbb [master]
WARN: Access to undefined parameter `readPaths` -- Initialise it to a default value eg. `params.readPaths = some_value`
WARN: Access to undefined parameter `fasta` -- Initialise it to a default value eg. `params.fasta = some_value`
Pipeline Release  : master
Run Name          : reverent_dubinsky
Reads             : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/input/*.R{1,2}.fastq.gz
Fasta Ref         : null
Data Type         : Paired-End
Kraken2 Db        : ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz
Busco Reference   : https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz
Max Resources     : 256 GB memory, 32 cpus, 24d 20h 31m 24s time per job
Container         : singularity - nfcore/mag:1.0.0
Output dir        : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output
Launch dir        : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/aadder/output/Nov2018acc
Working dir       : /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output/work
Script dir        : /projects1/clusterhomes/velsko/.nextflow/assets/nf-core/mag
User              : velsko
Config Profile    : shh
Config Description: Generic MPI-SHH cluster(s) profile provided by nf-core/configs.
Config Contact    : James Fellows Yates (@jfy133), Maxime Borry (@Maxibor)
Config URL        : https://shh.mpg.de
executor >  slurm (2)
[c0/ad731a] process > fastqc_raw             [ 95%] 36 of 38, cached: 36
executor >  slurm (2)
[c0/ad731a] process > fastqc_raw             [100%] 38 of 38, cached: 36, failed: 2
executor >  slurm (2)
[c0/ad731a] process > fastqc_raw             [100%] 38 of 38, cached: 36, failed: 2
[64/465396] process > fastp                  [100%] 39 of 39, cached: 39
[f0/82d032] process > get_software_versions  [100%] 1 of 1, cached: 1 ✔
[64/f2cbd8] process > phix_download_db       [100%] 1 of 1, cached: 1 ✔
[a5/7d8194] process > kraken2_db_preparation [  0%] 1 of 0, failed: 1
Staging foreign file: ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz
Execution cancelled -- Finishing pending tasks before exit
WARN: Unable to stage foreign file: ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz (try 1) -- Cause: pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz
WARN: Unable to stage foreign file: ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz (try 2) -- Cause: pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz
WARN: Unable to stage foreign file: ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz (try 3) -- Cause: pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz
WARN: Killing pending tasks (2)
ERROR ~ Error executing process > 'kraken2_db_preparation (1)'

Caused by:
  Can't stage file ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz -- reason: pub/data/kraken2_dbs/old/minikraken2_v2_8GB_201904_UPDATE.tgz

Source block:
  """
  tar -xf "${db}"
  """

Work dir:
  /projects1/microbiome_calculus/Cameroon_plaque/04-analysis/assembly/output/work/a5/7d8194ef6a450faa240eac1779a5ca

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`

 -- Check '.nextflow.log' file for details```

running mag with locally installed Kraken2 database

I was trying to run mag with the local instance of Kraken2 database as opposed to compressed minikraken. But from slack discussion learned that mag is configured only to process minikraken. I believe this feature will be very helpful for other users as well.

Thanks
Rama

Renaming of FASTQ (un-gzipped) input files fails

Hi team!

I experienced a problem w/ the pipeline when using FASTQ files as input (and not .fastq.gz) files, see:

https://github.com/nf-core/mag/blob/master/main.nf#L480

The problem is, that regardless of what the input file type is it will be linked to

ln -s "${reads[0]}" "${name}_R1.fastq.gz"

So basically, the pipeline does not work atm w/ FASTQ input and should check if a file is gzipped and then handle it appropriately.

Btw: why anyway perform this rename_short_read_fastqs step? For the final documentation?

Adding Unicycler as assembly choice for low diversity communities

Hi there,

I'd like to propose to add Unicycler, which is also used in nf-core/bacass, and is rather specialized on single genomes but is also used successfully with low diversity communities (personal communication Daniel Huson). Unicycler is able to use Illumina and Nanopore data and perform hybrid assemblies. Unicycler also utilizes Spades which is already part of this pipeline.

My reasoning to use mag with Unicycler over bacass is because bacass is not meant for genome mixtures and has e.g. no binning and will develop rather not into supporting this.

Would you be ok to have Unicycler added?
I would first open a PR with Unicycler added to the environment (to check that the container still works when merged), and add in a second PR Unicycler into the main.nf.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.