Giter Site home page Giter Site logo

nf-core / modules Goto Github PK

View Code? Open in Web Editor NEW
258.0 142.0 633.0 149.88 MB

Repository to host tool-specific module files for the Nextflow DSL2 community!

Home Page: https://nf-co.re/modules

License: MIT License

Nextflow 95.42% Dockerfile 0.26% Shell 0.09% Python 0.76% R 3.47%
nf-core nextflow modules workflows pipelines dsl2 nf-test

modules's Issues

Handle module / process imports

Lots of people use nf-core pipelines offline. We want to make the process of using modules from a different repository as simple as possible.

One solution would be to use git submodule to add nf-core/modules as a git submodule to every pipeline. By default, doing git clone will not pull the submodules. Doing git clone --recursive or git submodule update --init --recursive will pull the module repository.

Loading logic could then be:

  • Try to load the files locally - works if submodule is initialised. Fails otherwise.
  • If fails, try to load from the web
  • If fails, exit with an error

Then by default most people running online will pull the online files dynamically. But pulling a pipeline to use offline is super easy and does not require any changes to files or config.

Currently nf-core download manually pulls institutional config files and edits nextflow.config so that the pipeline loads these files. This could also be done with submodules as above, without any need to edit any files.

Limitations would be that we have to manage the git hash of the modules repository in two places - the git submodule file and the nextflow.config file. We can lint to check that these two are the same. Also, this forces pipelines to use a single hash for all modules in the pipeline. I think this is probably ok for reasons of maintaining sanity though.

Thoughts?

Share singularity images using CVMFS

Hi here,

Maybe it was already suggested, somewhere?
Have you ever think about sharing the singularity images using the CernVM File System. It allow to provide files on the network based on a web of mirrors.

The Galaxy project is using this technology to share databanks, singularity images, configs accross Galaxy instances.
https://galaxyproject.org/blog/2019-02-cvmfs/

More easy to suggest than implement. So far, I'm just a client (and soon a stratum 1) and never try to build something from scratch.

My 2 cents

Port modules and single-tool workflows from Babraham

I'll be working on adding the modules and single-tool workflows that were already used and tested at the Babraham.

To avoid duplication of efforts, the tools I'll be working on initially will include the following:

QC

  • FastQC
  • FastQ Screen
  • MultiQC

Trimming

  • Trim Galore

Alignment

  • Bowtie2
  • HISAT2
  • Bismark
  • deduplicate_bismark
  • bismark_methylation_extractor
  • bismark2bedGraph
  • coverage2cytosine
  • bismark2summary
  • bismark2report

Read Simulator:

  • Sherman

Allele-specific sorting:

  • SNPsplit

Direct download of Singularity images via HTTPS

Before we released v2.0 of the rnaseq pipeline Nextflow didn't have direct download support for Singularity images. Paolo has now added this functionality here and it will be available in any releases after 20.10.0.

I had already added some logic to download the Singularity images in the DSL2 module files but it had to be removed in #76 for the reasons outlined above. Be great to add it back in after the next stable Nextflow release!

Module parameter inheritance and parameter wrapping

Copied from the slack channel:


Hi guys,

Can I get your feedback on a custom parameter inheritance model we have built-in for our modules?

Our user story is such that we wanted a set of default params defined inside the module to run the process in the case that the user imports the module and does nothing else.
We then wanted to be able to override the params with those from the parent nf file, but without making large boilerplate calls using addParams or by passing arguments as channels as we feel these should be retained for data.
Finally, we wanted to be able to set group parameters on multiple includes of the same module but retaining the ability to override the module params individually if we wanted to.
We found during our testing that any module params defined actually override the global parameters which is the opposite of what we wanted. This forces either the route via addParams or the route via channels, neither of which we wanted to use.

I constructed a custom groovy class which automatically overrides the params by matching names. First, the module params are prefixed with internal_* - then any parameter in the parent nf file can override an internal param by prefixing with the module name (e.g for cutadapt params.cutadapt_adapter_seq would override params.internal_adapter_seq inside the module.
This provides a model where defaults are used unless explicitly overridden in the parent. The same param is overridden in all module instances unless specifically overridden using addParams. This gives us the flexibility for example to define a global adapter sequence for cutadapt, but define separate output directories for each module instance.

The functionality requires 3 lines of code per module to implement.

I have posted the code below - please ignore the rest of the module parameter wise as we are still building out and generalising (we also know there is a cutadapt module, its just an easy example)

#!/usr/bin/env nextflow
// Include NfUtils
Class groovyClass = new GroovyClassLoader(getClass().getClassLoader()).parseClass(new File("groovy/NfUtils.groovy"));
GroovyObject nfUtils = (GroovyObject) groovyClass.newInstance();
// Define internal params
module_name = 'cutadapt'
// Specify DSL2
nextflow.preview.dsl = 2
// TODO check version of cutadapt in host process
// Define default nextflow internals
params.internal_outdir = './results'
params.internal_process_name = 'cutadapt'
params.internal_output_prefix = ''
params.internal_min_quality = 10
params.internal_min_length = 16
params.internal_adapter_sequence = 'AGATCGGAAGAGC'
// Check if globals need to 
nfUtils.check_internal_overrides(module_name, params)
// Trimming reusable component
process cutadapt {
    // Tag
    tag "${sample_id}"
    publishDir "${params.internal_outdir}/${params.internal_process_name}",
        mode: "copy", overwrite: true
    input:
        //tuple val(sample_id), path(reads)
        path(reads)
    output:
        //tuple val(sample_id), path("${reads.simpleName}.trimmed.fq.gz")
        path("${params.internal_output_prefix}${reads.simpleName}.trimmed.fq.gz")
    shell:
    """
    cutadapt \
        -j ${task.cpus} \
        -q ${params.internal_min_quality} \
        --minimum-length ${params.internal_min_length} \
        -a ${params.internal_adapter_sequence} \
        -o ${params.internal_output_prefix}${reads.simpleName}.trimmed.fq.gz $reads
    """
}
class NfUtils{
    def check_internal_overrides(String moduleName, Map params)
    {
        // get params set of keys
        Set paramsKeySet = params.keySet()
        // Interate through and set internals to the correct parameter at runtime
        paramsKeySet.each {
            if(it.startsWith("internal_")) {
                def searchString = moduleName + '_' + it.replace('internal_', '');
                if(paramsKeySet.contains(searchString)) {
                    params.replace(it, params.get(searchString))
                }
            }
        }
    }
}
#!/usr/bin/env nextflow
// Define DSL2
nextflow.preview.dsl=2
// Log
log.info ("Starting Cutadapt trimming test pipeline")
/* Define global params
--------------------------------------------------------------------------------------*/
params.cutadapt_output_prefix = 'trimmed_'
/* Module inclusions 
--------------------------------------------------------------------------------------*/
include cutadapt from './trim-reads.nf' addParams(cutadapt_process_name: 'cutadapt1')
include cutadapt as cutadapt2 from './trim-reads.nf' addParams(cutadapt_process_name: 'cutadapt2')
/*------------------------------------------------------------------------------------*/
/* Define input channels
--------------------------------------------------------------------------------------*/
testPaths = [
  ['Sample 1', "$baseDir/input/readfile1.fq.gz"],
  ['Sample 2', "$baseDir/input/readfile2.fq.gz"],
  ['Sample 3', "$baseDir/input/readfile3.fq.gz"],
  ['Sample 4', "$baseDir/input/readfile4.fq.gz"],
  ['Sample 5', "$baseDir/input/readfile5.fq.gz"],
  ['Sample 6', "$baseDir/input/readfile6.fq.gz"]
]
// Create channel of test data (excluding the sample ID)
 Channel
  .from(testPaths)
  .map { row -> file(row[1]) }
  .set {ch_test_inputs}
  Channel
  .from(testPaths)
  .map { row -> file(row[1]) }
  .set {ch_test_inputs2}
/*------------------------------------------------------------------------------------*/
// Run workflow
workflow {
    // Run cutadapt
    cutadapt( ch_test_inputs )
    // Run cutadapt
    cutadapt2( ch_test_inputs2 )
    // Collect file names and view output
    //cutadapt.out | view 
}

Use JSON for meta data

The more I think about it, the more I think that JSON is more appropriate for the meta information. We have nested lists and other semi-complicated structures at JSON is more verbose and clear with this stuff.

Use remote repos with include statement

Need to test and possibly work out a way to use a remote git repo with the include statement
e.g.

modules_base = "https://raw.githubusercontent.com/nf-core/modules/${params.module_version}"
include "${modules_base}" params(params)

Cache nextflow binary

Edit: during ci jobs to speed up the workflows. This will be applicable across nf-core ci-jobs

Module tests

There will be various tests we can perform on individual module files...how far we go and how we implement this is up for discussion.

  1. Test and parse module file to create documentation for information about the tools used in the process e.g home-page links etc
  2. Test and parse the content of the process via NF e.g. input:, output: and script:
  3. Test the module works on include with a vanilla template script
  4. Test the actual process command works by bundling containers from biocontainers as default and testing the execution - this will also require the appropriate test data to be hosted somewhere for CI tests. This could be a can of worms as we should be able to expect contributors to test this anyway (@sven?)

Run a test on all modules which have been modified in that push / PR, all in parallel in separate jobs

I was wondering if using the bits @ewels posted on slack, we could solve both of these issues and introduce a "test autodiscovery" from a single github action, that spawns "pytest workflow" for each changed folder using the "test matrix" strategy. Like that the pytest-workflow test-dir could be set to each module directory, and the tests folder contained in each module.

IMO, that would clean up quite a few redundancies.

Originally posted by @grst in #80 (comment)

Without separate workflow files for each module.

Module documentation format

We need to decide how best to be able to document each individual module itself e.g. what is this module doing, keywords for findability, links to homepage per tool used in the process etc. @sven and I came up with a rudimentary version of this but I think we will need more discussion to get this right.

/*
* Description:
*     Run FastQC on sequenced reads
* Keywords:
*     read qc
*     adapter
* Tools:
*     FastQC:
*         homepage: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
*         documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/
*         description: FastQC gives general quality metrics about your reads.
*                      It provides information about the quality score distribution
*                      across your reads, the per base sequence content (%A/C/G/T).
*                      You get information about adapter contamination and other
*                      overrepresented sequences.
*/

It would also be good to be able to generate automated docs for the types of objects that are required as input: and output: for each modules, the script: section and any other information that may be useful. @sven suggested we may be able to get this by directly by plugging into NF.

This is all still open for discussion so please chime in if you have some ideas.

Add Editor Config lint back in

I commented out the Editor config linting 082c582 but it would be good to fix and add this back in. Possibly in one go to get all of the tests passing again.

Module file versioning

We need to come up with a way to version each module or at least be able to use a particular version of a module within the main pipeline script. Through previous discussions we have somewhat agreed that we need to be able to do this via git commit as we are able to do with nf-core/configs. Whether we are able to do this at the level of individual module files or a commit id for the entire nf-core/modules repo is still up for discussion.

Configure Homer reproducibly and effeciently

This is just a place holder for a future discussion. I'm working on adding some homer modules. The problem is that the way the configuration occurs in the currently used dockerfile and because the docker files are read only.

https://hub.docker.com/r/dennishazelett/homer

Here's a documented example of how they create various genomes off a base docker file.

So far I have

    perl /usr/local/share/homer-4.11-2/configureHomer.pl \\
        -install $genome \\
        -keepScript

Which runs but I'm not able to take the /usr/local/share/homer-4.11-2/ directory and use it as an output.

Add t-coffee module

Write a tcoffee module:

  • Create the module for tcoffee itself.
  • Since there is any data set available for testing multiple sequence alignment, include a dataset on the modules branch of nf-core/test-datasets

Add codeowners

To get pinged and have actual owners who keep up with the software that the modules are taking advantage of.

Use --seed parameters for aligners / other tools wherever possible

Given that we are now testing for the same outputs generated by a given module in order to detect changes as a result of updating the module itself, it would be good if we can somehow factor in instances where for example alignments are generated at random if the same tool is run more than once. This will rightly break the CI tests but one way around that is to use --seed parameters if available e.g. Bowtie2.

The implementation should be as simple as passing the appropriate optional argument to the tool in the main.nf script for the tests.

Also see #143 (comment)

File structure

Suggested during discussion at the Stockholm hackathon about potential repository organisation:

.
├── .github
│   └── wokflows
│       └── test-processes.yml
├── README.md
├── nf-core
└── tools
    ├── bwa
    │   └── mem
    │       ├── main.nf
    │       ├── meta.yml
    │       └── test-action.yml
    ├── fastqc
    │   ├── main.nf
    │   ├── meta.yml
    │   └── test-action.yml
    └── samtools
        ├── index
        │   ├── main.nf
        │   ├── meta.yml
        │   └── test-action.yml
        └── sort
            ├── main.nf
            ├── meta.yml
            └── test-action.yml
  • Have a directory for every tool
  • Have subdirectories for every subcommand
  • Have a yaml meta file with descriptions of the process
  • .github/workflows/test-processes.yml will have a step for each process tool.
    • Each step can use path to only run when those files are changed (docs)
    • Each step can reference the test-action.yml file held in the process subdirectory with uses (docs)
    • Need to lint that .github/workflows/test-processes.yml has a step for every process

  • QUESTION: commands that can be run in very different ways?
    • Should we have a different subdirectory for commands that can be run in a very different manner?
  • QUESTION: What happens with variable numbers of inputs and outputs? cf. #6

Test how variable numbers of inputs and outputs work

Need to look in to how Nextflow DSL2 handles variable numbers of inputs or outputs.

For example - TrimGalore! can optionally save untrimmed reads. If that is enabled, we will have an additional output channel. How do pipelines handle this?

Write custom test for checking contents of BAM file.

https://pytest-workflow.readthedocs.io/en/stable/#writing-custom-tests

Bowtie/2 include in the header the run commands, which are never going to be the same, so the md5 hash will never be equal across different containers.

$ samtools view -H test.bam
## Singularity
@HD     VN:1.0  SO:unsorted
@SQ     SN:gi|170079663|ref|NC_010473.1|        LN:4686137
@PG     ID:bowtie2      PN:bowtie2      VN:2.4.2        CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 -x ./bowtie2/NC_010473 --threads 1 -1 test_R1.fastq.gz -2 test_R2.fastq.gz"
@PG     ID:samtools     PN:samtools     PP:bowtie2      VN:1.11 CL:samtools view -@ 1 -bhS -o test.bam -
## Conda
@HD     VN:1.0  SO:unsorted
@SQ     SN:gi|170079663|ref|NC_010473.1|        LN:4686137
@PG     ID:bowtie2      PN:bowtie2      VN:2.4.2        CL:"/tmp/pytest_workflow_4fbqrxe4/Run_bowtie2_index_and_align_paired-end/work/conda/env-10b78180015f409ae983f51f20f43c6a/bin/bowtie2-align-s --wrapper basic-0 -x ./bowtie2/NC_010473 --threads 1 -1 test_R1.fastq.gz -2 test_R2.fastq.gz"
@PG     ID:samtools     PN:samtools     PP:bowtie2      VN:1.11 CL:samtools view -@ 1 -bhS -o test.bam -

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.