nextflow-io / patterns Goto Github PK

View Code? Open in Web Editor NEW

325.0 24.0 72.0 2.46 MB

A curated collection of Nextflow implementation patterns

Home Page: http://nextflow-io.github.io/patterns/

License: MIT License

Shell 0.88% Nextflow 99.12%

nextflow

patterns's Introduction

Nextflow Patterns

A curated collection of Nextflow implementation patterns

Basic patterns

Scatter executions

Gather results

Organize outputs

Other

Advanced patterns

patterns's People

Contributors

Stargazers

Watchers

Forkers

machbio gattofrancesco marcpaterno hy714335634 chiefeghan rduque1 vineethvijay johncearls ganguvamshi wuxi-nextcode alperyilmaz kevinsayers mes5k misssoft winni2k ypriverol yanding cellgeni stevekm enriquea lebernstein rernst olgabot egonw heuermh dfajar2 nilaycan msultan xiayb bioinfonerd-forks mattwherlock mikeyrupert davidealbanese maurya-anand jcabraham obonyojimmy benjaminralexander adrielvelazquez meowcat crhisllane oscarw99 seanpm2001 seanwallawalla-forks seanpm2001-all benzzdan lqsae robsyme zzygyx9119 pmb59 mribeirodantas vjmarteau patrickraess darci-giac cgpu lifebit-ai robertkalo abhi18av b8307038 frederickmappin fwip svzhuk mattheww95 stanikae ivanv87 fredrickkebaso odoublewen anttitakalahti sateeshperi christopher-hakkaart

patterns's Issues

Example of aggregating file with associated metadata

Currently, there is no example of aggregating files AND associated metadata. For instance, in many/most nf-core pipelines the process outputs are something like:

output:
tuple val(meta), path("file.txt")

...but what if one wants to then aggregate all of the file.txt outputs into one table AND include the meta metadata in that output table?

As far as I can tell from scouring the nextflow slack channel, one must "embed" the metadata in the file paths and then parse the file paths in the aggregation step. For example:

Per-file process:

output:
tuple val(meta), path("${meta}.txt")

Aggregation process:

input:
path("*")

script:
"""
[somehow parse {meta} from input file path] 
"""

Is there a better way, especially given the substantial limitations of trying to embed metadata into a file path (eg., dealing with multiple values and special characters in the metadata values)?

I'm sure a lot of pipeline developers would like a best-practices example of how to deal with this situation (without having to decipher how meta is dealt with in aggregation steps of nf-core pipelines).

Blast example stuck

I'm trying to run blast.nf in my workstation, blast_result and top_hits have been generated, then the project stuck. Any possible reason?

nextflow run examples/blast.nf -with-docker -with-report -with-timeline
N E X T F L O W ~ version 0.27.0
Launching examples/blast.nf [small_nightingale] - revision: 7b4b740be4
[warm up] executor > local
[94/2ab84a] Submitted process > blast (1)

nextflow info
Version: 0.27.0 build 4751
Modified: 09-01-2018 10:18 UTC (05:18 EDT)
System: Linux 3.10.0-514.10.2.el7.x86_64
Runtime: Groovy 2.4.13 on OpenJDK 64-Bit Server VM 1.8.0_121-b13
Encoding: UTF-8 (UTF-8)

blast-parallel.nf adding makeblastdb process

Hi,
Would you be able to add makeblastdb process to blast-parallel.nf?

genomes = Channel.fromPath(params.genomes)

process formatBlastDatabases {

  storeDir '/db/genomes'

  input:
  file species from genomes

  output:
  file "${dbName}.*" into blastDb

  script:
  dbName = species.baseName
  """
  makeblastdb -dbtype nucl -in ${species} -out ${dbName}
  """
}

Thank you in advance.

Michal

Port the patterns to DSL2

I noticed that the patterns are (i) still in the DSL1 and (ii) use deprecated features like set value and (iii) do not make use of the newer helpful features such as stub for quick iterations. Perhaps, it might be beneficial for the overall community to move the patterns to DSL2 to facilitate the transition.

What do you think?

Multiple optional inputs

The optional input pattern does not seem to work if a process has more than one optional input.

For example, the following test:

params.inputs = "$projectDir/data/sites.txt"
params.filter = "$projectDir/assets/NO_FILE"

process foo {
  debug true
  input:
  path seq
  path(opt)
  path(opt2)

  script:
  def filter = opt.name != 'NO_FILE' ? "--filter $opt" : ''
  def filter2 = opt2.name != 'NO_FILE' ? "--filter $opt" : ''
  """
  echo your_command --input $seq $filter $filter2
  """
}

workflow {
  prots_ch = Channel.fromPath(params.inputs, checkIfExists:true)
  opt_file = file(params.filter, checkIfExists:true)
  opt2_file = file(params.filter, checkIfExists:true)

  foo(prots_ch, opt_file, opt2_file)
}

Returns a collision error:

ERROR ~ Error executing process > 'foo (1)'
ERROR ~ Error executing process > 'foo (1)'

Caused by:                                                                                                             Process `foo` input file name collision -- There are multiple input files for each of the following file names: NO_FILE


Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

How should one generalize this pattern to handle multiple optional inputs?

Add example for handling commands that return non-zero exit code normally

See nextflow-io/nextflow#2992

Multiple input and multiple output

Hi, I am a new user of nextflow. I have test1.bed, test1.bim, test1.fam files. I want to do some qc using plink in nextflow. How can I import those three files in nextflow and get output test2.bed, test2.bim, test2.fam which will be used in the next step to generate test3.bed, test3.bim, test3.fam and so on? I also need to save all the intermediate files in a directory. Any help with example?
Kind regards, Zillur

Change example code in `conditional-process.nf`

I think the correct code example should be

$ nextflow run patterns/conditional-process.nf --flag

see here

Example of conditional execution based on channel output

The conditional process example is a great example, but it only covers a conditional based on a pre-set param value (params.flag in the example), and does not cover dynamic conditionals based on process/workflow output. For example, one may want to run Sub-workflow1 if Process1 generates non-empty files, while Sub-workflow2 is run if the files are all empty.

Code like the following does not work:

  if( MY_PROCESS.out.map{ it.size() }.sum() == 0 ){
    ch_out = WORKFLOW1()
  } else {
    ch_out = WORKFLOW2()
  }

...since MY_PROCESS.out.map{ it.size() }.sum() is not considered an integer that can be compared to 0. So how can one handle dynamic flow control in Nextflow, based on process/workflow output?

Add example how parse json file

More here

https://groups.google.com/d/msg/nextflow/qzsORfO5CFU/pYh-tEWXAgAJ

Add combinations pattern

Channel.from([['A', 10], ['B', 8], ['C', 5], ['D', 4]])
  .toList().map{ [it, it].combinations().findAll{ a, b -> a[1] < b[1]} }
  .flatMap()
  .view()

Add pattern for optional execution using `until` operator

See example from @micans

ch_fastqs_cram
.mix(ch_fastqs_dir)
.into{ ch_rnaseq; ch_fastqc; ch_mixcr }

ch_rnaseq
.until{ skip_align_step }
.into { ch_star; ch_hisat2; ch_salmon }

Parsing from initial run argument into process scripts

How do I parse an argument from the initiation command into one of my process scripts that's in another language like Python.

Initiation cmd:
nextflow run /Path/to/myscript.nf --in '/Path/to/MyData'

process dataDirectories {

"""
#!/usr/bin/env python2.7

import os
import getpass

currentUser=getpass.getuser()

dataPath="/home/" + currentUser + "/WGS_Data/" + MyData
resultsPath="/home/" + currentUser + "/WGS_Results/" + MyData

try:
	os.makedirs(dataPath, 0o777)    
	os.makedirs(resultsPath, 0o777)

except:
	pass
"""

}

I would like to get the string 'MyData' from the path (which is in my '--in' argument) into my python script.
Can this be done?

Groovy syntax on GitHub

You can get GitHub to play nicely with the NextFlow scripts file extensions with these two tricks:

Add the following to the top of every .nf script: (does syntax highlighting)

vim: syntax=groovy
-*- mode: groovy;-*-

Create a file called .gitattributes with the following: (changes the coloured bar at the top of the repo to say 100% Groovy instead of 100% Shell.

*.nf linguist-language=Groovy

Hope this helps! It was annoying me on our NextFlow repos 😉

Merge patterns into main Nextflow docs

The Nextflow patterns provide a lot of value to users at every level. I think they would have more visibility if they were part of the main Nextflow docs, since that seems to be the starting point for most people.

What do you think @pditommaso

Example for optional input in tuple?

I am trying to run a configuration where the input is a tuple of paths, some of which are optional. The pattern in this repository works for separate path inputs (or, so says the author), but extending it to my use case results in the error Not a valid path value: NO_FILE.

In this simple example demonstrating the issue, the input is a CSV file defining RNA-seq sample names, forward and reverse read fastqs, and a STAR genome index to align them to. The reverse read is optional.

My question to the community is how I can work around this issue.

Repository setup

nextflow.config

params {
  manifest = null  // csv file name,R1,R2?,index
  outdir = "outs"  // save output bams
}
profiles {
  conda {
    conda.enabled = true
    process.conda = "star samtools"
  }
}

main.nf

process MyProcess {
  publishDir outdir, mode: "copy"
  input:
    tuple val(name), path(R1), path(R2), path(index)
    path outdir
  output:
    path "${name}_Aligned.out.sortedByCoord.bam"
    path "${name}_Aligned.out.sortedByCoord.bam.bai"
  script:
    R2_arg = R2.name == "NO_FILE" ? "" : R2
"""
STAR --readFilesIn $R1 $R2_arg --readFilesCommand gunzip -c \
     --genomeDir $index --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix ${name}_
samtools index ${name}_Aligned.out.sortedByCoord.bam
"""
}

workflow {
  MyProcess(
    file(manifest).read().splitCsv(header: ["name", "R1", "R2", "index"]).map{it.R2 = it.R2 ?: "NO_FILE"},
    params.outdir
  )
}

Run

nextflow run main.nf [-profile conda] --manifest path/to/manifest.csv

manipulating variable outside of scripts

My title may be slightly misleading, however, bare with me.

I have a process iterate_list. Process iterate_list takes a list and does something on each item in the list. When running the script, it takes two inputs. The list and the item it needs to process (which it gets as a consumer from a rabbitmq queue)

Currently, I give a python script the entire list, and it iterates over each one does the processing (as one big chunk) and returns after completion. This is fine, however, if the system restarts, it starts all over again.

I was wondering, how can I make it so that every time my python script processes a single item, it returns the item, I remove it from the list, and then pass in the new list to the process. So in case of a system restart/crash, nextflow knows where it left off and can continue from there.

import groovy.json.JsonSlurper

 def jsonSlurper = new JsonSlurper()
 def cfg_file = new File('/config.json')
 def analysis_config = jsonSlurper.parse(cfg_file)
 def cfg_json = cfg_file.getText()
 def list_of_items_to_process = [] 

 items = Channel.from(analysis_config.items.keySet())

 for (String item : items) {
     list_of_items_to_process << item
     } 

 process iterate_list{
     echo true

     input:
     list_of_items_to_process

     output:
     val 1 into typing_cur

     script:
     """
     python3.7 process_list_items.py ${my_queue} \'${list_of_items_to_process}\'
     """ 
 }

 process signal_completion{

     echo true

     input:
     val typing_cur

     script:
     """
     echo "all done!"
     """
 }

Basically, the process "iterate_list" takes one "item" from a queue in the message broker. Process iterate_list should look something like:

    process iterate_list{
        echo true

        input:
        list_of_items_to_process

        output:
        val 1 into typing_cur

        script:
        """
        python3.7 process_list_items.py ${my_queue} \'${list_of_items_to_process}\'
        list_of_items_to_process.remove(<output from python script>)
        """
    }

And so for each one, it shd run, remove the item it jus processed, and restart with a new list.

    initial_list = [1,2,3,4]
    after_first_process_completes = [2,3,4]
    and_eventually = [] <- This is when it should move on to the next process.

Excuse the indents, SO wasn't letting me post the code without indents.

Dockerfile refers to non-existing file

bin/AMPA.pl does not exist in this examples repository, and $ docker build . fails.

$ docker build .
Sending build context to Docker daemon 6.546 MB
Step 0 : FROM pditommaso/dkrbase:1.1
 ---> ae4cb2b803ba
Step 1 : MAINTAINER Paolo Di Tommaso <[email protected]>
 ---> Using cache
 ---> 7f956c07387e
Step 2 : RUN apt-get install -q -y gnuplot python && apt-get clean
 ---> Using cache
 ---> 730aeb7ec1b6
Step 3 : RUN cpanm Math::CDF Math::Round &&   rm -rf /root/.cpanm/work/
 ---> Using cache
 ---> 1d85dd9a180e
Step 4 : RUN wget -q ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.29/ncbi-blast-2.2.29+-x64-linux.tar.gz &&     tar xf ncbi-blast-2.2.29+-x64-linux.tar.gz &&     mv ncbi-blast-2.2.29+ /opt/ &&     rm -rf ncbi-blast-2.2.29+-x64-linux.tar.gz &&     ln -s /opt/ncbi-blast-2.2.29+/ /opt/blast
 ---> Using cache
 ---> c8bfe75956a3
Step 5 : RUN wget -q http://tcoffee.org/Packages/Stable/Version_11.00.8cbe486/linux/T-COFFEE_installer_Version_11.00.8cbe486_linux_x64.tar.gz &&   tar xf T-COFFEE_installer_Version_11.00.8cbe486_linux_x64.tar.gz -C /opt &&   mv /opt/T-COFFEE_installer_Version_11.00.8cbe486_linux_x64 /opt/tcoffee &&   rm -rf /opt/tcoffee/plugins/linux/*  &&   rm T-COFFEE_installer_Version_11.00.8cbe486_linux_x64.tar.gz
 ---> Using cache
 ---> 5381d47ca2b9
Step 6 : ADD bin/AMPA.pl /usr/local/bin/
bin/AMPA.pl: no such file or directory

Temporary resolution: comment out in Dockerfile (?) or add the script found in another repository nextflow-io/tests/bin/AMPA.pl.

Logo is not aligned

The logo is the main page is not aligned with the header separator

Typo for "Process when empty"?

Current example for process-when-empty:

params.inputs = ''

process foo {
  debug true  
  input:
  val x
  when:
  x ## 'EMPTY'

  script:
  '''
  echo hello
  ''' 
}

workflow {
  reads_ch = params.inputs
    ? Channel.fromPath(params.inputs, checkIfExists:true)
    : Channel.empty()

  reads_ch \
    | ifEmpty { 'EMPTY' } \
    | foo
}

I'm guessing that x ## 'EMPTY' should be x == 'EMPTY'