nf-core / modules Goto Github PK
View Code? Open in Web Editor NEWRepository to host tool-specific module files for the Nextflow DSL2 community!
Home Page: https://nf-co.re/modules
License: MIT License
Repository to host tool-specific module files for the Nextflow DSL2 community!
Home Page: https://nf-co.re/modules
License: MIT License
Lots of people use nf-core pipelines offline. We want to make the process of using modules from a different repository as simple as possible.
One solution would be to use git submodule
to add nf-core/modules
as a git submodule to every pipeline. By default, doing git clone
will not pull the submodules. Doing git clone --recursive
or git submodule update --init --recursive
will pull the module repository.
Loading logic could then be:
Then by default most people running online will pull the online files dynamically. But pulling a pipeline to use offline is super easy and does not require any changes to files or config.
Currently nf-core download
manually pulls institutional config files and edits nextflow.config
so that the pipeline loads these files. This could also be done with submodules as above, without any need to edit any files.
Limitations would be that we have to manage the git hash of the modules repository in two places - the git submodule file and the nextflow.config
file. We can lint to check that these two are the same. Also, this forces pipelines to use a single hash for all modules in the pipeline. I think this is probably ok for reasons of maintaining sanity though.
Thoughts?
Hi here,
Maybe it was already suggested, somewhere?
Have you ever think about sharing the singularity images using the CernVM File System. It allow to provide files on the network based on a web of mirrors.
The Galaxy project is using this technology to share databanks, singularity images, configs accross Galaxy instances.
https://galaxyproject.org/blog/2019-02-cvmfs/
More easy to suggest than implement. So far, I'm just a client (and soon a stratum 1) and never try to build something from scratch.
My 2 cents
I'll be working on adding the modules and single-tool workflows that were already used and tested at the Babraham.
To avoid duplication of efforts, the tools I'll be working on initially will include the following:
QC
Trimming
Alignment
Read Simulator:
Allele-specific sorting:
Before we released v2.0 of the rnaseq pipeline Nextflow didn't have direct download support for Singularity images. Paolo has now added this functionality here and it will be available in any releases after 20.10.0.
I had already added some logic to download the Singularity images in the DSL2 module files but it had to be removed in #76 for the reasons outlined above. Be great to add it back in after the next stable Nextflow release!
I think it would be good to have module for freebayes
I think it would be good to have module for freebayes
Should we support multiple nextflow versions in the CI? If so which ones. See 8cd635f
For running locally and for more CI jobs.
I think it would be good to have module for gatk4
Copied from the slack channel:
Hi guys,
Can I get your feedback on a custom parameter inheritance model we have built-in for our modules?
Our user story is such that we wanted a set of default params defined inside the module to run the process in the case that the user imports the module and does nothing else.
We then wanted to be able to override the params with those from the parent nf file, but without making large boilerplate calls using addParams or by passing arguments as channels as we feel these should be retained for data.
Finally, we wanted to be able to set group parameters on multiple includes of the same module but retaining the ability to override the module params individually if we wanted to.
We found during our testing that any module params defined actually override the global parameters which is the opposite of what we wanted. This forces either the route via addParams or the route via channels, neither of which we wanted to use.
I constructed a custom groovy class which automatically overrides the params by matching names. First, the module params are prefixed with internal_*
- then any parameter in the parent nf file can override an internal param by prefixing with the module name (e.g for cutadapt params.cutadapt_adapter_seq
would override params.internal_adapter_seq
inside the module.
This provides a model where defaults are used unless explicitly overridden in the parent. The same param is overridden in all module instances unless specifically overridden using addParams
. This gives us the flexibility for example to define a global adapter sequence for cutadapt, but define separate output directories for each module instance.
The functionality requires 3 lines of code per module to implement.
I have posted the code below - please ignore the rest of the module parameter wise as we are still building out and generalising (we also know there is a cutadapt module, its just an easy example)
#!/usr/bin/env nextflow
// Include NfUtils
Class groovyClass = new GroovyClassLoader(getClass().getClassLoader()).parseClass(new File("groovy/NfUtils.groovy"));
GroovyObject nfUtils = (GroovyObject) groovyClass.newInstance();
// Define internal params
module_name = 'cutadapt'
// Specify DSL2
nextflow.preview.dsl = 2
// TODO check version of cutadapt in host process
// Define default nextflow internals
params.internal_outdir = './results'
params.internal_process_name = 'cutadapt'
params.internal_output_prefix = ''
params.internal_min_quality = 10
params.internal_min_length = 16
params.internal_adapter_sequence = 'AGATCGGAAGAGC'
// Check if globals need to
nfUtils.check_internal_overrides(module_name, params)
// Trimming reusable component
process cutadapt {
// Tag
tag "${sample_id}"
publishDir "${params.internal_outdir}/${params.internal_process_name}",
mode: "copy", overwrite: true
input:
//tuple val(sample_id), path(reads)
path(reads)
output:
//tuple val(sample_id), path("${reads.simpleName}.trimmed.fq.gz")
path("${params.internal_output_prefix}${reads.simpleName}.trimmed.fq.gz")
shell:
"""
cutadapt \
-j ${task.cpus} \
-q ${params.internal_min_quality} \
--minimum-length ${params.internal_min_length} \
-a ${params.internal_adapter_sequence} \
-o ${params.internal_output_prefix}${reads.simpleName}.trimmed.fq.gz $reads
"""
}
class NfUtils{
def check_internal_overrides(String moduleName, Map params)
{
// get params set of keys
Set paramsKeySet = params.keySet()
// Interate through and set internals to the correct parameter at runtime
paramsKeySet.each {
if(it.startsWith("internal_")) {
def searchString = moduleName + '_' + it.replace('internal_', '');
if(paramsKeySet.contains(searchString)) {
params.replace(it, params.get(searchString))
}
}
}
}
}
#!/usr/bin/env nextflow
// Define DSL2
nextflow.preview.dsl=2
// Log
log.info ("Starting Cutadapt trimming test pipeline")
/* Define global params
--------------------------------------------------------------------------------------*/
params.cutadapt_output_prefix = 'trimmed_'
/* Module inclusions
--------------------------------------------------------------------------------------*/
include cutadapt from './trim-reads.nf' addParams(cutadapt_process_name: 'cutadapt1')
include cutadapt as cutadapt2 from './trim-reads.nf' addParams(cutadapt_process_name: 'cutadapt2')
/*------------------------------------------------------------------------------------*/
/* Define input channels
--------------------------------------------------------------------------------------*/
testPaths = [
['Sample 1', "$baseDir/input/readfile1.fq.gz"],
['Sample 2', "$baseDir/input/readfile2.fq.gz"],
['Sample 3', "$baseDir/input/readfile3.fq.gz"],
['Sample 4', "$baseDir/input/readfile4.fq.gz"],
['Sample 5', "$baseDir/input/readfile5.fq.gz"],
['Sample 6', "$baseDir/input/readfile6.fq.gz"]
]
// Create channel of test data (excluding the sample ID)
Channel
.from(testPaths)
.map { row -> file(row[1]) }
.set {ch_test_inputs}
Channel
.from(testPaths)
.map { row -> file(row[1]) }
.set {ch_test_inputs2}
/*------------------------------------------------------------------------------------*/
// Run workflow
workflow {
// Run cutadapt
cutadapt( ch_test_inputs )
// Run cutadapt
cutadapt2( ch_test_inputs2 )
// Collect file names and view output
//cutadapt.out | view
}
I think it would be good to have a module for allelecounter
The more I think about it, the more I think that JSON is more appropriate for the meta information. We have nested lists and other semi-complicated structures at JSON is more verbose and clear with this stuff.
Need to test and possibly work out a way to use a remote git repo with the include
statement
e.g.
modules_base = "https://raw.githubusercontent.com/nf-core/modules/${params.module_version}"
include "${modules_base}" params(params)
Edit: during ci jobs to speed up the workflows. This will be applicable across nf-core ci-jobs
Write module file for shovill https://github.com/tseemann/shovill
There will be various tests we can perform on individual module files...how far we go and how we implement this is up for discussion.
input:
, output:
and script:
include
with a vanilla template scriptI was wondering if using the bits @ewels posted on slack, we could solve both of these issues and introduce a "test autodiscovery" from a single github action, that spawns "pytest workflow" for each changed folder using the "test matrix" strategy. Like that the pytest-workflow test-dir could be set to each module directory, and the tests
folder contained in each module.
IMO, that would clean up quite a few redundancies.
Originally posted by @grst in #80 (comment)
Without separate workflow files for each module.
We need to decide how best to be able to document each individual module itself e.g. what is this module doing, keywords for findability, links to homepage per tool used in the process etc. @sven and I came up with a rudimentary version of this but I think we will need more discussion to get this right.
/*
* Description:
* Run FastQC on sequenced reads
* Keywords:
* read qc
* adapter
* Tools:
* FastQC:
* homepage: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
* documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
* description: FastQC gives general quality metrics about your reads.
* It provides information about the quality score distribution
* across your reads, the per base sequence content (%A/C/G/T).
* You get information about adapter contamination and other
* overrepresented sequences.
*/
It would also be good to be able to generate automated docs for the types of objects that are required as input:
and output:
for each modules, the script:
section and any other information that may be useful. @sven suggested we may be able to get this by directly by plugging into NF.
This is all still open for discussion so please chime in if you have some ideas.
I think it would be good to have module for fgbio
I commented out the Editor config linting 082c582 but it would be good to fix and add this back in. Possibly in one go to get all of the tests passing again.
I think it would be good to have a module for controlfreec
We need to come up with a way to version each module or at least be able to use a particular version of a module within the main pipeline script. Through previous discussions we have somewhat agreed that we need to be able to do this via git commit as we are able to do with nf-core/configs
. Whether we are able to do this at the level of individual module files or a commit id for the entire nf-core/modules
repo is still up for discussion.
I think it would be good to have a module for ascat
This is just a place holder for a future discussion. I'm working on adding some homer modules. The problem is that the way the configuration occurs in the currently used dockerfile and because the docker files are read only.
https://hub.docker.com/r/dennishazelett/homer
Here's a documented example of how they create various genomes off a base docker file.
So far I have
perl /usr/local/share/homer-4.11-2/configureHomer.pl \\
-install $genome \\
-keepScript
Which runs but I'm not able to take the /usr/local/share/homer-4.11-2/ directory and use it as an output.
Write a tcoffee
module:
modules
branch of nf-core/test-datasets
To get pinged and have actual owners who keep up with the software that the modules are taking advantage of.
I think it would be good to have module for fgbio
I think it would be good to have module for fgbio
Given that we are now testing for the same outputs generated by a given module in order to detect changes as a result of updating the module itself, it would be good if we can somehow factor in instances where for example alignments are generated at random if the same tool is run more than once. This will rightly break the CI tests but one way around that is to use --seed
parameters if available e.g. Bowtie2.
The implementation should be as simple as passing the appropriate optional argument to the tool in the main.nf
script for the tests.
Also see #143 (comment)
Suggested during discussion at the Stockholm hackathon about potential repository organisation:
.
├── .github
│ └── wokflows
│ └── test-processes.yml
├── README.md
├── nf-core
└── tools
├── bwa
│ └── mem
│ ├── main.nf
│ ├── meta.yml
│ └── test-action.yml
├── fastqc
│ ├── main.nf
│ ├── meta.yml
│ └── test-action.yml
└── samtools
├── index
│ ├── main.nf
│ ├── meta.yml
│ └── test-action.yml
└── sort
├── main.nf
├── meta.yml
└── test-action.yml
.github/workflows/test-processes.yml
will have a step for each process tool.
Need to look in to how Nextflow DSL2 handles variable numbers of inputs or outputs.
For example - TrimGalore! can optionally save untrimmed reads. If that is enabled, we will have an additional output channel. How do pipelines handle this?
https://pytest-workflow.readthedocs.io/en/stable/#writing-custom-tests
Bowtie/2 include in the header the run commands, which are never going to be the same, so the md5 hash will never be equal across different containers.
$ samtools view -H test.bam
## Singularity
@HD VN:1.0 SO:unsorted
@SQ SN:gi|170079663|ref|NC_010473.1| LN:4686137
@PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 -x ./bowtie2/NC_010473 --threads 1 -1 test_R1.fastq.gz -2 test_R2.fastq.gz"
@PG ID:samtools PN:samtools PP:bowtie2 VN:1.11 CL:samtools view -@ 1 -bhS -o test.bam -
## Conda
@HD VN:1.0 SO:unsorted
@SQ SN:gi|170079663|ref|NC_010473.1| LN:4686137
@PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:"/tmp/pytest_workflow_4fbqrxe4/Run_bowtie2_index_and_align_paired-end/work/conda/env-10b78180015f409ae983f51f20f43c6a/bin/bowtie2-align-s --wrapper basic-0 -x ./bowtie2/NC_010473 --threads 1 -1 test_R1.fastq.gz -2 test_R2.fastq.gz"
@PG ID:samtools PN:samtools PP:bowtie2 VN:1.11 CL:samtools view -@ 1 -bhS -o test.bam -
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.