nf-core / eager Goto Github PK

View Code? Open in Web Editor NEW

130.0 153.0 78.0 65.7 MB

A fully reproducible and state-of-the-art ancient DNA analysis pipeline

Home Page: https://nf-co.re/eager

License: MIT License

HTML 0.84% Python 10.76% Nextflow 75.71% Dockerfile 0.19% Groovy 12.49%

nextflow ancientdna adna pipeline pathogen-genomics population-genetics nf-core workflow metagenomics bioinformatics

eager's People

Contributors

Stargazers

Watchers

Forkers

philpalmer judithneukamm amakunin jdetras eisenra lifebit-ai apeltzer esteinig fvangef lucventurini ldcabansay maxibor sc13-bioinf katemcd0nald lindenb ktmeaton maxulysse chriswyatt1 evanfloden aerijman marcel-keller ma-diroma lordkev zandrafagernas gunzivan28 alaincoletta teepean aiko4 ggabernet scarlhoff ewels jfy133 96radhikajadhav sofsta idobar charles-plessy lfearnley vallebueno valery-shap alexandregilardet natalieeo bbartholdy ashildv eliasonaws clemgoub mestia acad-uofa xavierrocarada tuberinfo sysbiocoder tclamnidis benchsar lehmann-fabian duanjunhyq ls-1801 adrianforsythe drpatelh kathy-t joshuadanielrubin khl0798 jbv2 shyama-mama dineshravindraraju ilight1542 sabiqali tastulek joskid boghurthlen r22lai biofriends ngogiaphat dearborn-open-ai iohlsson aidaanva jch-13 jonathanbader latchbio-nfcore

eager's Issues

Not able to generate picard dict file

Describe the bug
When running the pipeline using -profile shh,singularity, the pipeline crashes at picard CreateSequenceDictionary with a 127 error. Singularity reports 'no such file or dictionary'.

From the 'Command executed' message - I think this is because this step does not take into account the path to the reference fasta file, and assumes reference is present in the EAGER2 output directory (which I also assume is the execution directory) - but I'm not 100% sure.

To Reproduce
Pipeline info is attached.

Operating System:

OS: Ubuntu 14.04

Additional context
Please provide me with the following files:
pipeline_report.txt
nextflow.log

Conda test case

To at least catch conda related errors early on in the pipeline.

Modify MultiQC module results

Is your feature request related to a problem? Please describe.
At the moment the MultiQC results are in reverse order, which doesn't make sense in the natural way of reading and doing quality control.

Describe the solution you'd like
Provide and/or modify the default multiqc.config file that comes with the singularity/docker/conda 'images' to reflect the
order in which each module is run.

More context is provided here: https://multiqc.info/docs/#order-of-modules

Describe alternatives you've considered
Add to documentation to read in reverse

Example

I've not tested this, but something along the lines of?:

report_comment: >
    This report has been generated by the <a href="https://github.com/nf-core/eager" target="_blank">nf-core/eager</a>
    analysis pipeline. For information about how to interpret these results, please see the
    <a href="https://github.com/nf-core/eager" target="_blank">documentation</a>.
report_section_order:
    nf-core/eager-software-versions:
        order: -1000
    fastqc:
        after: 'nf-core/eager-software-versions'
    adapterRemoval:
        after: 'fastqc'
    Samtools:
        after: 'adapterRemoval'
    dedup:
        after: 'Samtools'
    qualimap:
        after: 'dedup'
    preseq:
        after: 'qualimap'

If you agree I will try it out and then do a pull request.

Help message is incorrect regarding mandatory arguments

Describe the bug
The help message currently says that --genome is a mandatory argument, yet requires an iGenomes reference name. This is in conflict with the --fasta option, and is what is more likely to be used (at least in context of aDNA studies where we don't often study model organisms), and is equally a 'mandatory' argument.

To Reproduce
nextflow run nf-core/eager --help

Expected behavior
The mandatory argument should have --genome or --fasta as the two options for the reference section

Screenshots
Current help message

N E X T F L O W  ~  version 0.32.0
Launching `nf-core/eager` [thirsty_hawking] - revision: 0894028508 [master]

=========================================
eager v2.0dev
=========================================
Usage:

The typical command for running the pipeline is as follows:

nextflow run nf-core/eager --reads '*_R{1,2}.fastq.gz' -profile docker

Mandatory arguments:
  --reads                       Path to input data (must be surrounded with quotes)
  --genome                      Name of iGenomes reference
  -profile                      Hardware config to use. docker / aws

Options:
  --singleEnd                   Specifies that the input is single end reads

References                      If not specified in the configuration file or you wish to overwrite any of the references.
  --fasta                       Path to Fasta reference
  --bwa_index                   Path to BWA index

Other options:
  --outdir                      The output directory where the results will be saved
  --email                       Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits
  -name                         Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.

Add PMDTools

PMDTools should be available in EAGER2

Add some keywords to the repository

The repository is currently lacking any keywords ("topics" in GitHub language).

Clean Up Folder Structure

There should be a clean folder structure, probably we will just remove the numbers there.

Be explicit with input data (don't assume paired end and require flag when otherwise)

Is your feature request related to a problem? Please describe.
I think it is 'dangerous' to assume paired end data, and require a --singleend switch when mapping with single end.

As sad as it is, I know quite a lot of clever people doing bioinformatics/downstream analysis but who rarely deal with raw sequencing data. When they do have to occasionally have to deal with it, they are often not familiar with sequencing configurations and may not understand why the pipeline is not working when submitting with single end data.

It also doesn't make sense in the sense you are assuming what the most common type of data is, which is not the role of the pipeline (IMO)

Describe the solution you'd like

I think it would be beneficial require a --*end flag, regardless of what you are submitting: i.e. you always require --singleend or --pairedend. Being explicit is a better presumed state for user-friendly pipelines that anyone can run.

Describe alternatives you've considered
The pipeline itself could do an automatic detection based on the naming scheme (as you already require a specific naming scheme with R1/R2). This would also help if you have multiple lanes.

Additional context
None.

No description of 'default' pipeline steps

Describe the bug
Currently it is possible to run the pipeline with basic mandatory options (reads input, reference, profile), but what happens when you run this doesn't seem to be currently described anywhere.

While a user could look through the output results, it would be good to specify this is actually possible and what is actually going on. This will make clearer to novice users that one can customise the pipeline parameters itself, and should not necessarily rely on defaults.

Request

Could @apeltzer please describe somewhere these steps, and I can include this in the documentation somewhere.

Add a DOI on first release

DeDup Hist for Preseq usage

If dedup is running, preseq could use the *.hist file created by it.

Get rid of readPaths warning

.... initialize with default value

Consider "atropos" for read merging + deduplication

Maybe we could even ask for "joint" output as single file with the addition of "M_" prefixes to read names to get rid of adapterremovalfixprefix?

Singularity missing?

> nextflow run nf-core/eager -with-singularity /apps/containers/nf-core-eager.simg  --pairedEnd --reads "/path/sample_{1,2}.fastq.gz" --trim_bam 3 --max_time 12.h --max_cpus 4 --max_memory 32G --snpcapture false --udg true --udg_type Half --bwamem --genome ${ref} --saveReference true  -name textEager3
N E X T F L O W  ~  version 0.32.0
Launching `nf-core/eager` [textEager3] - revision: 897fca777a [master]
WARN: Access to undefined parameter `readPaths` -- Initialise it to a default value eg. `params.readPaths = some_value`
=========================================
 nf-core/eager v2.0.0dev
=========================================
Pipeline Name  : nf-core/eager
Pipeline Version: 2.0.0
Run Name       : textEager3
Reads          : /path/sample_{1,2}.fastq.gz
Fasta Ref      : false
Data Type      : Paired-End
Max Memory     : 32G
Max CPUs       : 4
Max Time       : 12.h
Output dir     : ./results
Working dir    : /fast/users/user/eager/work
Container Engine: singularity
Container      : /apps/containers/nf-core-eager.simg
Current home   : /home/user
Current user   : user
Current path   : /home/user/fastdir/eager
Script dir     : /home/user/.nextflow/assets/nf-core/eager
Config Profile : standard
=========================================
[warm up] executor > SLURM
[c5/898ed5] Submitted process > get_software_versions
[e1/cadfbc] Submitted process > fastqc (sample)
[06/d3a688] Submitted process > adapter_removal (sample)
[7f/7a6a9a] Submitted process > output_documentation
ERROR ~ Error executing process > 'get_software_versions'

Caused by:
  Process `get_software_versions` terminated with an error exit status (127)

Command executed:

  echo 2.0.0 &> v_pipeline.txt
  echo 0.32.0 &> v_nextflow.txt
  fastqc --version &> v_fastqc.txt 2>&1 || true
  multiqc --version &> v_multiqc.txt 2>&1 || true
  bwa &> v_bwa.txt 2>&1 || true
  samtools --version &> v_samtools.txt 2>&1 || true
  AdapterRemoval -version  &> v_adapterremoval.txt 2>&1 || true
  picard MarkDuplicates --version &> v_markduplicates.txt  2>&1 || true
  dedup -v &> v_dedup.txt 2>&1 || true
  preseq &> v_preseq.txt 2>&1 || true
  gatk BaseRecalibrator --version 2>&1 | grep Version: > v_gatk.txt 2>&1 || true
  vcf2genome &> v_vcf2genome.txt 2>&1 || true
  fastp --version &> v_fastp.txt 2>&1 || true
  bam --version &> v_bamutil.txt 2>&1 || true
  qualimap --version &> v_qualimap.txt 2>&1 || true

  scrape_software_versions.py &> software_versions_mqc.yaml

Command exit status:
  127

Command output:
  (empty)

Command error:
  env: singularity: No such file or directory

Work dir:
  /fast/users/user/eager/work/c5/898ed52fe4a02f48caaf2f0130963b

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit
[e1/cadfbc] NOTE: Process `fastqc (sample)` terminated with an error exit status (127) -- Error is ignored
[nf-core/eager] Pipeline Complete

Originally posted by @yassineS in #67 (comment)

Non-GUI usage of EAGER pipeline [Feature]

Hiya,

While I think the GUI is great for people with little to no CLI experience, it can be a bit cumbersome for power-users. One useful feature for experienced users would be the ability to create .xml files in the CLI. This would be especially useful for people who don't have visualization technology on their servers to load the GUI.

Thanks!
Raphael

Conda profile FastQC crash

Describe the bug
When using the following flag -profile standard,conda the pipeline crashes at FastQC with th exit status 127, and the following command error:

Command error:
  java: symbol lookup error: /projects1/users/fellows/nextflow/eager2/my_test/work/conda/nf-core-eager-2.0.0-a95ee9548e4a04d99b07955eabaa0afe/jre/lib/amd64/../../../lib/libfontconfig.so.1: undefined symbol: FT_Done_MM_Var

Furthermore, one 'Tip' message provided is to check .command.out file in the work directory. However, this file doesn't exist. Note, the tip message seems to change when re-running so it would be good to know if this is a random 'help of the day' type message from Nextflow, and if so make that clearer.

To Reproduce
The command I used was
run nf-core/eager --reads 'EXB015.A1701/*_R{1,2}*fastq.gz' --pairedEnd --fasta '~References/hg19_MT.fasta' -profile standard,conda --max_cpus 4 --max_memory 32G

(Sending input files privately)

Expected behavior

Not to crash and possibly for the conda environment downloaded having the correct library contents

Screenshots
NA

Operating System:
Ubuntu 14.04

Additional context
Please provide me with the following files:

.nextflow.log
results/pipeline_info/...

eager2_error_20181018.tar.gz

Handle reference genomes properly

Currently setting -bwa_index doesn't do anything. We should read in files from the selected path and push these to the same input channel that the indexing process uses as well ...

Base Trimming for UDGhalf Protocols

Add an option to clip reads after damage profile generation. Ideally after mapping, to allow usage of damage patterns for schmutzi etc.

Option to export unmapped reads as fastq/fasta

For metagenomic analysis one may want to strip out a certain genome (e.g. human DNA).

EAGER v1.x allows export as unmapped reads as BAM, which is often not accepted as input for other tools (e.g. taxonomic profilers or assemblers.

Consider making unmapped read export as fastq/fasta (via samtools).

Rename BAM-specific flags

Is your feature request related to a problem? Please describe.
The flag --bam_keep_mapped_only may be misleading. While the help message says Only consider mapped reads for downstream analysis. Unmapped reads are extracted to separate output., the actual flag may suggest to the user that by 'keeping', unmapped reads would be discarded completely.

The flag --bam_filter_reads is also confusing. While the help message says Keep all reads in BAM file for downstream analysis the flag suggests something is being filtered (either specific reads retained, or filtered out).

Describe the solution you'd like
Perhaps rename the --bam_keep_mapped_only flag to be more specific such as --bam_analyse_mapped_only

Perhaps rename --bam_filter_reads to --bam_retain_all_reads. Optionally, change function of flag and name to e.g.: ``--bam_discard_unmapped`.

Support for running everything in PE mode without merging

We should be able to run datasets without merging reads and just clipping etc. pp.

Bioconda recipes for missing tools

We should have bioconda recipes for every tool used in the pipeline, (some of them are already there). This allows us to create a single Dockerfile/Singularity container with just a couple of lines of code and directly link this to the version of the pipeline with a simple GitHub /Git tag in the future.

List of recipes already there (some pull requests of mine are pending for some of the tools required here):

https://github.com/bioconda/bioconda-recipes

ToDo List:

-profile 'error while loading shared libraries: libiconv.so.2'

Describe the bug
The pipeline when ran with -r dev flag and the conda profile crashes at output_documentation after adapter removal with the following error:

Command error: /projects1/users/fellows/nextflow/eager2/my_test/work/conda/nf-core-eager-2.0.1-21a7e9f4e1525113a0a8843adce834c8/lib/R/bin/exec/R: error while loading shared libraries: libiconv.so.2: cannot open shared object file: No such file or directory

To Reproduce
run the same command and data in #67 .

Expected behaviour
The conda environment to download the a working R environment.

Additional context
Please provide me with the following files:
nextflow.log.2.txt
eager2_Rsharedlibrary_erorr.zip

Update README with name following conventions

EAGER2 -> eager
Also check services

Docker Hub
Singularity Hub
other (?)

BAM rename flags #2

Hm, ok. If you want to keep it the way it is then we need to consider changing the description slightly

  --bam_discard_unmapped        Discard an unmapped read file, depending on choice in --

Removing references to bam or fastq in the description makes it clearer you are not trying to actually define the file type in this flag.

That said, I still don't think this makes complete sense/is unnecessarily over complicated.

In principle I think it makes it simpler to just have a single: --bam_discard_unmapped_bam.

Use cases would be, assuming someone wants the unmapped reads:

does the user want unmapped reads in only bam format? Yes: use --bam_separate_unmapped
does the user want unmapped reads in both bam and fastq? Do above but with --bam_unmapped_to_fastq
does the user want unmapped reads in only fastq format? Do both 1) and 2) with --bam_unmapped_discard_bam

I think this would also work programatically. The current system in this commit I think has mixed messages with the one flag saying you do want to discard something but then an entire other flag that saying you also want to discard something, but additionally which one. The messages behind the flags are sort of overlapping.

Does this make sense? Or do you disagree?

Redundant UDG flags

Currently two flags (--udg) and (--udg_type) related to UDG treatment for downstream PMD/BamUtil trimming.

However functionally they pretty much do the same thing, but the latter being more detailed so you can condense into one.

It might also be worth modifying the flag name to --pmd_udg_type so a novice user doesn't add their UDG treatment when they don't in fact want PMD processing.

--
Sent from my mobile

CircularMapper Support

Add complexity filter?

In our group we've noticed that we regularly get lots of poly G reads from NextSeq data which don't get discarded by the sequencer or demultiplexer. This can mess up some downstream statistics if not thrown out.

Maybe we could consider having as a module some form of complexity filter to remove low complexity reads?

VariantCalling / PseudoCalling Parallelization

Could be worth trying

splitBams -> genotype/pseudo-call/genotype likelihoods -> stitchVCFs together

Review MultiQC Custom Content in Main Table

I'd like to have a better and improved report in the end, summarizing certain important metrics in a smart way, e.g. but not limited to. Naming is something I'd not like to standardize too much since different labs tend to have different names for certain metrics.

"Cluster Factor": (Number of duplicate reads / Total number of reads in sample)
- Ideally, this would be summarized based on information retrieved from some modules running in the pipeline.
"Target Efficiency": (Captured bases / targetted bases)

...

Could you please add things that are of specific interest to you?
These here are on my ToDo list:

mt/Nuc ratio
mapped reads / % mapped reads
merged reads / % merged reads
GC content

Schmutzi / estimates for contamination
....

Any ideas what you'd like to see here @EisenRa @jfy133 @JudithNeukamm ?

Reference File organization

Human Reference Genomes (GRCH37, HG19, rCRS, RSRS, GRCH38)

If the schmutzi opt files are not specified on command line, one could also offer to download these specifically in a single step...

Document the Pipeline Functionality

For each configurable parameter, we should have documentation.

BAM Indices

After DeDup
Before DeDup after Filtering
on RAW bam

Maybe just add it in each step depending whether downstream things are turned on/off ?

GUI Ideas

One idea would be to have a simple GUI that fetches a pipeline revision from this GitHub (e.g. with -r ) and then automatically offers the available params. entries as configurable options in dynamic GUI.

e.g. the GUI application just queries this Github and finds in main.nf some parameters to configure adapter clipping in more detail, thus we could provide a means to offer all these parameters in a GUI for end users in a dynamic way. Explanation of parameters could be done using a separate mapping file with <param.name, "GUI Name of param.name", "Short description", that we can gradually update too.

This way, we will always have a functional simple GUI application for end users that are not willing to use the CLI only + have the possibility to create a working JSON object for a specific pipeline version in the future too.

Let me know what you think @jfy133!

Possibly faulty `samtools index` command?

Describe the bug
When running with the -singularity profile on the branches/versions -r shh-profile and -r 2.0.2, the bwa module crashes on the samtools index on the sorted bamfile.

It appears there maybe a misplaced -@ option which is not applicable to samtools index.

Samtools standard out/error is below:

[main] CMD: bwa samse -r @RG\tID:ILLUMINA-ABM006.A0101_S0_L002_R1_001\tSM:ABM006.A0101_S0_L002_R1_001\tPL:illumina hg19_MT.fasta ABM006.A0101_S0_L002_R1_001.combined.prefixed.fq.sai ABM006.A0101_S0_L002_R1_001.combined.prefixed.fq.gz
  [main] Real time: 7.227 sec; CPU: 6.184 sec
  index: invalid option -- '@'
  Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
  Options:
    -b       Generate BAI-format index for BAM files [default]
    -c       Generate CSI-format index for BAM files
    -m INT   Set minimum interval size for CSI indices to 2^INT [14]

To Reproduce
Run the following command (replace paired end data, index file as required):

nextflow run nf-core/eager \
--reads '/projects1/users/fellows/nextflow/eager2/my_test/data/ABM006.A0101/*_R{1,2}*fastq.gz' \
--pairedEnd \
--fasta '/projects1/users/fellows/nextflow/eager2/my_test/references/hg19_MT.fasta' \
--outdir '/projects1/users/fellows/nextflow/eager2/my_test/output_eager2' \
-profile singularity \
--max_cpus 4 \
--max_memory 16G \
-r 2.0.2

Additional context
Please provide me with the following files:

.nextflow.log nextflow.log
results/pipeline_info/... pipeline_report.txt

Finish CM implementation

Missing parts:

Handling SE data
FixPrefix for AR - when to combine which files?
merging vs only clipping for both

BWAMem Support

Travis CI Test Implementation

The EAGER 2.0 pipeline will have fully automatic consistency tests

Human Genome, autosomal (hg19)
Mitochondrial dataset
An exemplary bacterial genome

I'll create test reference genomes with a couple of KB in size to test things quickly and then

Add sambamba to improve the efficiency

May have a think of using sambamba to do sam/bam/cram operation like what I did in LncPipe
for sam->bam conversion
https://github.com/likelet/LncPipe/blob/master/LncRNAanalysisPipe.nf#L592
and bam sort
https://github.com/likelet/LncPipe/blob/master/LncRNAanalysisPipe.nf#L593

hope it helps

Add GenConS as alternative to VCF2Genome

To continue with aDNA specific tools, we could consider adding 'GenConS', which is a part of of the TOPAS package - https://github.com/subwaystation/TOPAS

This allows you to generate a consensus sequence but with a punishment score on possible C to T deamination lesions.

Accept GZipped FASTA files as reference

Is your feature request related to a problem? Please describe.
Eukaryotic genomes can sometimes be large and can take a up lot of space when in uncompressed formats.

We could consider accepting .fasta.gz files, as with the FASTA files.

For indexing this would require a decompressing, indexing then re-compression step of the FASTA itself - but indices are often smaller so I guess wouldn't require indexing.

The only potential issue I see here is how the pipeline deals with multiple runs trying to access the same file at once.

Kinship analysis

lcMLkin, pmr, READ

Genotyping / Downstream Analysis ideas

The former way to do things in EAGER1.X was to use GATK to call variants on the preprocessed / filtered BAM files and then use that to recreate e.g. a consensus FastA for small genomes and/or create a VCF for downstream tools.

There are nowadays however tools out there that can be used for downstream genotyping, aware of ancient DNA damage etc, for example snpAD and IIRC angsd and sequenceTools that I'd rather like to rely on, as they are specifically designed for aDNA usage.

The learning curve for these is okayish, as I think that basic functionality as for example solely output for downstream analysis tools is required.

My plan for now is to incorporate some of the functionality of:

snpAD
ANGSD
sequenceTools

Additionally, I'd love to incorporate:

GenCons
READ

These changes are planned features for V2.1 of the pipeline, 2.0 will "just" provide functionality for preprocessing, QC and mapping using BWA for now.

Support for snpAD

We should have support for snpAD, additionally incorporating mapability tracks if possible.

Replace process$ with withName syntax

as its going deprecated soon:
nf-core/methylseq#27

Selection of multiple references

Title is pretty self explanatory. (@jfy133 asked for this)

Results directory could be along the lines of

eager_out
--| reference_1
----| sample_1
----| sample_2
----| sample_3
--| reference_2
----| sample_1
----| sample_2
----| sample_3
--| reference_3
----| sample_1
----| sample_2
----| sample_3

Deduplication Parallelization

We could think about splitting BAMs as DeDup/MarkDup takes quite some time normally and use a file.size>2GB (or similar operator) to speed up things significantly. A subsequent merge would be a matter of minutes, automatically creating the same output for downstream analysis as before.

BAM as input?

Is your feature request related to a problem? Please describe.
Sometimes we receive/download only BAM files (e.g. see Slon et al. 2017 data on ENA), that we wish remap/reprocess in a different context. For example if the data is metagenomic, and we want to map to a particular bacterial genome. Currently we would have to manually re-convert the BAM to FASTQ, which leads to unnecessary data redundancy.

Describe the solution you'd like
An option to provide BAM as an input, rather than FASTQ. One solution would with a -bam flag, which would (turn-off a clip/merge module? and then) pipe stdout from samtools fastq into the mapper itself.

Describe alternatives you've considered
One could just manually re-convert the BAM.

Competitive mapping for strain separation

In some metagenomic contexts, there can be closely related species in a sample that can make read mapping to a single reference genome difficult (e.g. cross-mapping of reads between species). In this situation, it can be useful to employ competitive mapping, whereby reference genomes from closely related species are concatenated (in a multifasta), and the reads mapped to this reference. This can allow for the mapping quality filter to filter out reads that would cross-map between species.

EAGER could allow users to select a folder, or a list of reference fastas and concatenate them into a multifasta prior to read mapping. Alternatively, the user could provide their pre-concatenated multifasta file.

Regarding the output of mapping stats, the concatenated BAM file would have to be split using bamtools prior to generating stats.

Prepare for nf-core sync

In order to get synced with the nf-core template, this repo needs to be prepared accordingly as described in the nf-core documentation.