nf-core / eager Goto Github PK
View Code? Open in Web Editor NEWA fully reproducible and state-of-the-art ancient DNA analysis pipeline
Home Page: https://nf-co.re/eager
License: MIT License
A fully reproducible and state-of-the-art ancient DNA analysis pipeline
Home Page: https://nf-co.re/eager
License: MIT License
Describe the bug
When running the pipeline using -profile shh,singularity
, the pipeline crashes at picard CreateSequenceDictionary
with a 127 error. Singularity reports 'no such file or dictionary'.
From the 'Command executed' message - I think this is because this step does not take into account the path to the reference fasta file, and assumes reference is present in the EAGER2 output directory (which I also assume is the execution directory) - but I'm not 100% sure.
To Reproduce
Pipeline info is attached.
Operating System:
Additional context
Please provide me with the following files:
pipeline_report.txt
nextflow.log
To at least catch conda related errors early on in the pipeline.
Is your feature request related to a problem? Please describe.
At the moment the MultiQC results are in reverse order, which doesn't make sense in the natural way of reading and doing quality control.
Describe the solution you'd like
Provide and/or modify the default multiqc.config file that comes with the singularity/docker/conda 'images' to reflect the
order in which each module is run.
More context is provided here: https://multiqc.info/docs/#order-of-modules
Describe alternatives you've considered
Add to documentation to read in reverse
Example
I've not tested this, but something along the lines of?:
report_comment: >
This report has been generated by the <a href="https://github.com/nf-core/eager" target="_blank">nf-core/eager</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://github.com/nf-core/eager" target="_blank">documentation</a>.
report_section_order:
nf-core/eager-software-versions:
order: -1000
fastqc:
after: 'nf-core/eager-software-versions'
adapterRemoval:
after: 'fastqc'
Samtools:
after: 'adapterRemoval'
dedup:
after: 'Samtools'
qualimap:
after: 'dedup'
preseq:
after: 'qualimap'
If you agree I will try it out and then do a pull request.
Describe the bug
The help message currently says that --genome
is a mandatory argument, yet requires an iGenomes
reference name. This is in conflict with the --fasta
option, and is what is more likely to be used (at least in context of aDNA studies where we don't often study model organisms), and is equally a 'mandatory' argument.
To Reproduce
nextflow run nf-core/eager --help
Expected behavior
The mandatory argument should have --genome
or --fasta
as the two options for the reference section
Screenshots
Current help message
N E X T F L O W ~ version 0.32.0
Launching `nf-core/eager` [thirsty_hawking] - revision: 0894028508 [master]
=========================================
eager v2.0dev
=========================================
Usage:
The typical command for running the pipeline is as follows:
nextflow run nf-core/eager --reads '*_R{1,2}.fastq.gz' -profile docker
Mandatory arguments:
--reads Path to input data (must be surrounded with quotes)
--genome Name of iGenomes reference
-profile Hardware config to use. docker / aws
Options:
--singleEnd Specifies that the input is single end reads
References If not specified in the configuration file or you wish to overwrite any of the references.
--fasta Path to Fasta reference
--bwa_index Path to BWA index
Other options:
--outdir The output directory where the results will be saved
--email Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits
-name Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
PMDTools should be available in EAGER2
The repository is currently lacking any keywords ("topics" in GitHub language).
There should be a clean folder structure, probably we will just remove the numbers there.
Is your feature request related to a problem? Please describe.
I think it is 'dangerous' to assume paired end data, and require a --singleend
switch when mapping with single end.
As sad as it is, I know quite a lot of clever people doing bioinformatics/downstream analysis but who rarely deal with raw sequencing data. When they do have to occasionally have to deal with it, they are often not familiar with sequencing configurations and may not understand why the pipeline is not working when submitting with single end data.
It also doesn't make sense in the sense you are assuming what the most common type of data is, which is not the role of the pipeline (IMO)
Describe the solution you'd like
I think it would be beneficial require a --*end
flag, regardless of what you are submitting: i.e. you always require --singleend
or --pairedend
. Being explicit is a better presumed state for user-friendly pipelines that anyone can run.
Describe alternatives you've considered
The pipeline itself could do an automatic detection based on the naming scheme (as you already require a specific naming scheme with R1/R2). This would also help if you have multiple lanes.
Additional context
None.
Describe the bug
Currently it is possible to run the pipeline with basic mandatory options (reads input, reference, profile), but what happens when you run this doesn't seem to be currently described anywhere.
While a user could look through the output results, it would be good to specify this is actually possible and what is actually going on. This will make clearer to novice users that one can customise the pipeline parameters itself, and should not necessarily rely on defaults.
Request
Could @apeltzer please describe somewhere these steps, and I can include this in the documentation somewhere.
If dedup is running, preseq could use the *.hist file created by it.
.... initialize with default value
Maybe we could even ask for "joint" output as single file with the addition of "M_" prefixes to read names to get rid of adapterremovalfixprefix?
> nextflow run nf-core/eager -with-singularity /apps/containers/nf-core-eager.simg --pairedEnd --reads "/path/sample_{1,2}.fastq.gz" --trim_bam 3 --max_time 12.h --max_cpus 4 --max_memory 32G --snpcapture false --udg true --udg_type Half --bwamem --genome ${ref} --saveReference true -name textEager3
N E X T F L O W ~ version 0.32.0
Launching `nf-core/eager` [textEager3] - revision: 897fca777a [master]
WARN: Access to undefined parameter `readPaths` -- Initialise it to a default value eg. `params.readPaths = some_value`
=========================================
nf-core/eager v2.0.0dev
=========================================
Pipeline Name : nf-core/eager
Pipeline Version: 2.0.0
Run Name : textEager3
Reads : /path/sample_{1,2}.fastq.gz
Fasta Ref : false
Data Type : Paired-End
Max Memory : 32G
Max CPUs : 4
Max Time : 12.h
Output dir : ./results
Working dir : /fast/users/user/eager/work
Container Engine: singularity
Container : /apps/containers/nf-core-eager.simg
Current home : /home/user
Current user : user
Current path : /home/user/fastdir/eager
Script dir : /home/user/.nextflow/assets/nf-core/eager
Config Profile : standard
=========================================
[warm up] executor > SLURM
[c5/898ed5] Submitted process > get_software_versions
[e1/cadfbc] Submitted process > fastqc (sample)
[06/d3a688] Submitted process > adapter_removal (sample)
[7f/7a6a9a] Submitted process > output_documentation
ERROR ~ Error executing process > 'get_software_versions'
Caused by:
Process `get_software_versions` terminated with an error exit status (127)
Command executed:
echo 2.0.0 &> v_pipeline.txt
echo 0.32.0 &> v_nextflow.txt
fastqc --version &> v_fastqc.txt 2>&1 || true
multiqc --version &> v_multiqc.txt 2>&1 || true
bwa &> v_bwa.txt 2>&1 || true
samtools --version &> v_samtools.txt 2>&1 || true
AdapterRemoval -version &> v_adapterremoval.txt 2>&1 || true
picard MarkDuplicates --version &> v_markduplicates.txt 2>&1 || true
dedup -v &> v_dedup.txt 2>&1 || true
preseq &> v_preseq.txt 2>&1 || true
gatk BaseRecalibrator --version 2>&1 | grep Version: > v_gatk.txt 2>&1 || true
vcf2genome &> v_vcf2genome.txt 2>&1 || true
fastp --version &> v_fastp.txt 2>&1 || true
bam --version &> v_bamutil.txt 2>&1 || true
qualimap --version &> v_qualimap.txt 2>&1 || true
scrape_software_versions.py &> software_versions_mqc.yaml
Command exit status:
127
Command output:
(empty)
Command error:
env: singularity: No such file or directory
Work dir:
/fast/users/user/eager/work/c5/898ed52fe4a02f48caaf2f0130963b
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
-- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit
[e1/cadfbc] NOTE: Process `fastqc (sample)` terminated with an error exit status (127) -- Error is ignored
[nf-core/eager] Pipeline Complete
Originally posted by @yassineS in #67 (comment)
Hiya,
While I think the GUI is great for people with little to no CLI experience, it can be a bit cumbersome for power-users. One useful feature for experienced users would be the ability to create .xml files in the CLI. This would be especially useful for people who don't have visualization technology on their servers to load the GUI.
Thanks!
Raphael
Describe the bug
When using the following flag -profile standard,conda
the pipeline crashes at FastQC
with th exit status 127, and the following command error:
Command error:
java: symbol lookup error: /projects1/users/fellows/nextflow/eager2/my_test/work/conda/nf-core-eager-2.0.0-a95ee9548e4a04d99b07955eabaa0afe/jre/lib/amd64/../../../lib/libfontconfig.so.1: undefined symbol: FT_Done_MM_Var
Furthermore, one 'Tip' message provided is to check .command.out
file in the work directory. However, this file doesn't exist. Note, the tip message seems to change when re-running so it would be good to know if this is a random 'help of the day' type message from Nextflow, and if so make that clearer.
To Reproduce
The command I used was
run nf-core/eager --reads 'EXB015.A1701/*_R{1,2}*fastq.gz' --pairedEnd --fasta '~References/hg19_MT.fasta' -profile standard,conda --max_cpus 4 --max_memory 32G
(Sending input files privately)
Expected behavior
Not to crash and possibly for the conda environment downloaded having the correct library contents
Screenshots
NA
Operating System:
Ubuntu 14.04
Additional context
Please provide me with the following files:
.nextflow.log
results/pipeline_info/...
Currently setting -bwa_index
doesn't do anything. We should read in files from the selected path and push these to the same input channel that the indexing process uses as well ...
Add an option to clip reads after damage profile generation. Ideally after mapping, to allow usage of damage patterns for schmutzi etc.
For metagenomic analysis one may want to strip out a certain genome (e.g. human DNA).
EAGER v1.x allows export as unmapped reads as BAM, which is often not accepted as input for other tools (e.g. taxonomic profilers or assemblers.
Consider making unmapped read export as fastq/fasta (via samtools).
Is your feature request related to a problem? Please describe.
The flag --bam_keep_mapped_only
may be misleading. While the help message says Only consider mapped reads for downstream analysis. Unmapped reads are extracted to separate output.
, the actual flag may suggest to the user that by 'keeping', unmapped reads would be discarded completely.
The flag --bam_filter_reads
is also confusing. While the help message says Keep all reads in BAM file for downstream analysis
the flag suggests something is being filtered (either specific reads retained, or filtered out).
Describe the solution you'd like
Perhaps rename the --bam_keep_mapped_only
flag to be more specific such as --bam_analyse_mapped_only
Perhaps rename --bam_filter_reads
to --bam_retain_all_reads
. Optionally, change function of flag and name to e.g.: ``--bam_discard_unmapped`.
We should be able to run datasets without merging reads and just clipping etc. pp.
We should have bioconda recipes for every tool used in the pipeline, (some of them are already there). This allows us to create a single Dockerfile/Singularity container with just a couple of lines of code and directly link this to the version of the pipeline with a simple GitHub /Git tag in the future.
List of recipes already there (some pull requests of mine are pending for some of the tools required here):
https://github.com/bioconda/bioconda-recipes
ToDo List:
Describe the bug
The pipeline when ran with -r dev
flag and the conda
profile crashes at output_documentation after adapter removal with the following error:
Command error: /projects1/users/fellows/nextflow/eager2/my_test/work/conda/nf-core-eager-2.0.1-21a7e9f4e1525113a0a8843adce834c8/lib/R/bin/exec/R: error while loading shared libraries: libiconv.so.2: cannot open shared object file: No such file or directory
To Reproduce
run the same command and data in #67 .
Expected behaviour
The conda environment to download the a working R environment.
Additional context
Please provide me with the following files:
nextflow.log.2.txt
eager2_Rsharedlibrary_erorr.zip
Hm, ok. If you want to keep it the way it is then we need to consider changing the description slightly
--bam_discard_unmapped Discard an unmapped read file, depending on choice in --
Removing references to bam or fastq in the description makes it clearer you are not trying to actually define the file type in this flag.
That said, I still don't think this makes complete sense/is unnecessarily over complicated.
In principle I think it makes it simpler to just have a single: --bam_discard_unmapped_bam.
Use cases would be, assuming someone wants the unmapped reads:
does the user want unmapped reads in only bam format? Yes: use --bam_separate_unmapped
does the user want unmapped reads in both bam and fastq? Do above but with --bam_unmapped_to_fastq
does the user want unmapped reads in only fastq format? Do both 1) and 2) with --bam_unmapped_discard_bam
I think this would also work programatically. The current system in this commit I think has mixed messages with the one flag saying you do want to discard something but then an entire other flag that saying you also want to discard something, but additionally which one. The messages behind the flags are sort of overlapping.
Does this make sense? Or do you disagree?
Currently two flags (--udg) and (--udg_type) related to UDG treatment for downstream PMD/BamUtil trimming.
However functionally they pretty much do the same thing, but the latter being more detailed so you can condense into one.
It might also be worth modifying the flag name to --pmd_udg_type
so a novice user doesn't add their UDG treatment when they don't in fact want PMD processing.
--
Sent from my mobile
In our group we've noticed that we regularly get lots of poly G reads from NextSeq data which don't get discarded by the sequencer or demultiplexer. This can mess up some downstream statistics if not thrown out.
Maybe we could consider having as a module some form of complexity filter to remove low complexity reads?
Could be worth trying
splitBams -> genotype/pseudo-call/genotype likelihoods -> stitchVCFs together
I'd like to have a better and improved report in the end, summarizing certain important metrics in a smart way, e.g. but not limited to. Naming is something I'd not like to standardize too much since different labs tend to have different names for certain metrics.
...
Could you please add things that are of specific interest to you?
These here are on my ToDo list:
Schmutzi / estimates for contamination
....
Any ideas what you'd like to see here @EisenRa @jfy133 @JudithNeukamm ?
If the schmutzi opt files are not specified on command line, one could also offer to download these specifically in a single step...
For each configurable parameter, we should have documentation.
Maybe just add it in each step depending whether downstream things are turned on/off ?
One idea would be to have a simple GUI that fetches a pipeline revision from this GitHub (e.g. with -r ) and then automatically offers the available params. entries as configurable options in dynamic GUI.
e.g. the GUI application just queries this Github and finds in main.nf
some parameters to configure adapter clipping in more detail, thus we could provide a means to offer all these parameters in a GUI for end users in a dynamic way. Explanation of parameters could be done using a separate mapping file with <param.name, "GUI Name of param.name", "Short description"
, that we can gradually update too.
This way, we will always have a functional simple GUI application for end users that are not willing to use the CLI only + have the possibility to create a working JSON object for a specific pipeline version in the future too.
Let me know what you think @jfy133!
Describe the bug
When running with the -singularity
profile on the branches/versions -r shh-profile
and -r 2.0.2
, the bwa
module crashes on the samtools index
on the sorted bamfile.
It appears there maybe a misplaced -@
option which is not applicable to samtools index
.
Samtools standard out/error is below:
[main] CMD: bwa samse -r @RG\tID:ILLUMINA-ABM006.A0101_S0_L002_R1_001\tSM:ABM006.A0101_S0_L002_R1_001\tPL:illumina hg19_MT.fasta ABM006.A0101_S0_L002_R1_001.combined.prefixed.fq.sai ABM006.A0101_S0_L002_R1_001.combined.prefixed.fq.gz
[main] Real time: 7.227 sec; CPU: 6.184 sec
index: invalid option -- '@'
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files
-m INT Set minimum interval size for CSI indices to 2^INT [14]
To Reproduce
Run the following command (replace paired end data, index file as required):
nextflow run nf-core/eager \
--reads '/projects1/users/fellows/nextflow/eager2/my_test/data/ABM006.A0101/*_R{1,2}*fastq.gz' \
--pairedEnd \
--fasta '/projects1/users/fellows/nextflow/eager2/my_test/references/hg19_MT.fasta' \
--outdir '/projects1/users/fellows/nextflow/eager2/my_test/output_eager2' \
-profile singularity \
--max_cpus 4 \
--max_memory 16G \
-r 2.0.2
Additional context
Please provide me with the following files:
.nextflow.log
nextflow.logresults/pipeline_info/...
pipeline_report.txtMissing parts:
Handling SE data
FixPrefix for AR - when to combine which files?
merging vs only clipping for both
The EAGER 2.0 pipeline will have fully automatic consistency tests
I'll create test reference genomes with a couple of KB in size to test things quickly and then
May have a think of using sambamba to do sam/bam/cram operation like what I did in LncPipe
for sam->bam conversion
https://github.com/likelet/LncPipe/blob/master/LncRNAanalysisPipe.nf#L592
and bam sort
https://github.com/likelet/LncPipe/blob/master/LncRNAanalysisPipe.nf#L593
hope it helps
To continue with aDNA specific tools, we could consider adding 'GenConS', which is a part of of the TOPAS package - https://github.com/subwaystation/TOPAS
This allows you to generate a consensus sequence but with a punishment score on possible C to T deamination lesions.
Is your feature request related to a problem? Please describe.
Eukaryotic genomes can sometimes be large and can take a up lot of space when in uncompressed formats.
We could consider accepting .fasta.gz
files, as with the FASTA files.
For indexing this would require a decompressing, indexing then re-compression step of the FASTA itself - but indices are often smaller so I guess wouldn't require indexing.
The only potential issue I see here is how the pipeline deals with multiple runs trying to access the same file at once.
lcMLkin, pmr, READ
The former way to do things in EAGER1.X was to use GATK to call variants on the preprocessed / filtered BAM files and then use that to recreate e.g. a consensus FastA for small genomes and/or create a VCF for downstream tools.
There are nowadays however tools out there that can be used for downstream genotyping, aware of ancient DNA damage etc, for example snpAD and IIRC angsd and sequenceTools that I'd rather like to rely on, as they are specifically designed for aDNA usage.
The learning curve for these is okayish, as I think that basic functionality as for example solely output for downstream analysis tools is required.
My plan for now is to incorporate some of the functionality of:
Additionally, I'd love to incorporate:
These changes are planned features for V2.1 of the pipeline, 2.0 will "just" provide functionality for preprocessing, QC and mapping using BWA for now.
We should have support for snpAD, additionally incorporating mapability tracks if possible.
as its going deprecated soon:
nf-core/methylseq#27
Title is pretty self explanatory. (@jfy133 asked for this)
Results directory could be along the lines of
eager_out
--| reference_1
----| sample_1
----| sample_2
----| sample_3
--| reference_2
----| sample_1
----| sample_2
----| sample_3
--| reference_3
----| sample_1
----| sample_2
----| sample_3
We could think about splitting BAMs as DeDup/MarkDup takes quite some time normally and use a file.size>2GB
(or similar operator) to speed up things significantly. A subsequent merge would be a matter of minutes, automatically creating the same output for downstream analysis as before.
Is your feature request related to a problem? Please describe.
Sometimes we receive/download only BAM files (e.g. see Slon et al. 2017 data on ENA), that we wish remap/reprocess in a different context. For example if the data is metagenomic, and we want to map to a particular bacterial genome. Currently we would have to manually re-convert the BAM to FASTQ, which leads to unnecessary data redundancy.
Describe the solution you'd like
An option to provide BAM as an input, rather than FASTQ. One solution would with a -bam flag, which would (turn-off a clip/merge module? and then) pipe stdout from samtools fastq
into the mapper itself.
Describe alternatives you've considered
One could just manually re-convert the BAM.
In some metagenomic contexts, there can be closely related species in a sample that can make read mapping to a single reference genome difficult (e.g. cross-mapping of reads between species). In this situation, it can be useful to employ competitive mapping, whereby reference genomes from closely related species are concatenated (in a multifasta), and the reads mapped to this reference. This can allow for the mapping quality filter to filter out reads that would cross-map between species.
EAGER could allow users to select a folder, or a list of reference fastas and concatenate them into a multifasta prior to read mapping. Alternatively, the user could provide their pre-concatenated multifasta file.
Regarding the output of mapping stats, the concatenated BAM file would have to be split using bamtools prior to generating stats.
In order to get synced with the nf-core template, this repo needs to be prepared accordingly as described in the nf-core documentation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.