nextflow-io / patterns Goto Github PK
View Code? Open in Web Editor NEWA curated collection of Nextflow implementation patterns
Home Page: http://nextflow-io.github.io/patterns/
License: MIT License
A curated collection of Nextflow implementation patterns
Home Page: http://nextflow-io.github.io/patterns/
License: MIT License
I noticed that the patterns are (i) still in the DSL1 and (ii) use deprecated features like set
value and (iii) do not make use of the newer helpful features such as stub
for quick iterations. Perhaps, it might be beneficial for the overall community to move the patterns to DSL2 to facilitate the transition.
What do you think?
The conditional process example is a great example, but it only covers a conditional based on a pre-set param
value (params.flag
in the example), and does not cover dynamic conditionals based on process/workflow output. For example, one may want to run Sub-workflow1 if Process1 generates non-empty files, while Sub-workflow2 is run if the files are all empty.
Code like the following does not work:
if( MY_PROCESS.out.map{ it.size() }.sum() == 0 ){
ch_out = WORKFLOW1()
} else {
ch_out = WORKFLOW2()
}
...since MY_PROCESS.out.map{ it.size() }.sum()
is not considered an integer that can be compared to 0
. So how can one handle dynamic flow control in Nextflow, based on process/workflow output?
You can get GitHub to play nicely with the NextFlow scripts file extensions with these two tricks:
Add the following to the top of every .nf
script: (does syntax highlighting)
vim: syntax=groovy
-*- mode: groovy;-*-
Create a file called .gitattributes
with the following: (changes the coloured bar at the top of the repo to say 100% Groovy instead of 100% Shell.
*.nf linguist-language=Groovy
Hope this helps! It was annoying me on our NextFlow repos ๐
bin/AMPA.pl
does not exist in this examples
repository, and $ docker build .
fails.
$ docker build .
Sending build context to Docker daemon 6.546 MB
Step 0 : FROM pditommaso/dkrbase:1.1
---> ae4cb2b803ba
Step 1 : MAINTAINER Paolo Di Tommaso <[email protected]>
---> Using cache
---> 7f956c07387e
Step 2 : RUN apt-get install -q -y gnuplot python && apt-get clean
---> Using cache
---> 730aeb7ec1b6
Step 3 : RUN cpanm Math::CDF Math::Round && rm -rf /root/.cpanm/work/
---> Using cache
---> 1d85dd9a180e
Step 4 : RUN wget -q ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.29/ncbi-blast-2.2.29+-x64-linux.tar.gz && tar xf ncbi-blast-2.2.29+-x64-linux.tar.gz && mv ncbi-blast-2.2.29+ /opt/ && rm -rf ncbi-blast-2.2.29+-x64-linux.tar.gz && ln -s /opt/ncbi-blast-2.2.29+/ /opt/blast
---> Using cache
---> c8bfe75956a3
Step 5 : RUN wget -q http://tcoffee.org/Packages/Stable/Version_11.00.8cbe486/linux/T-COFFEE_installer_Version_11.00.8cbe486_linux_x64.tar.gz && tar xf T-COFFEE_installer_Version_11.00.8cbe486_linux_x64.tar.gz -C /opt && mv /opt/T-COFFEE_installer_Version_11.00.8cbe486_linux_x64 /opt/tcoffee && rm -rf /opt/tcoffee/plugins/linux/* && rm T-COFFEE_installer_Version_11.00.8cbe486_linux_x64.tar.gz
---> Using cache
---> 5381d47ca2b9
Step 6 : ADD bin/AMPA.pl /usr/local/bin/
bin/AMPA.pl: no such file or directory
Temporary resolution: comment out in Dockerfile (?) or add the script found in another repository nextflow-io/tests/bin/AMPA.pl.
My title may be slightly misleading, however, bare with me.
I have a process iterate_list. Process iterate_list takes a list and does something on each item in the list. When running the script, it takes two inputs. The list and the item it needs to process (which it gets as a consumer from a rabbitmq queue)
Currently, I give a python script the entire list, and it iterates over each one does the processing (as one big chunk) and returns after completion. This is fine, however, if the system restarts, it starts all over again.
I was wondering, how can I make it so that every time my python script processes a single item, it returns the item, I remove it from the list, and then pass in the new list to the process. So in case of a system restart/crash, nextflow knows where it left off and can continue from there.
import groovy.json.JsonSlurper
def jsonSlurper = new JsonSlurper()
def cfg_file = new File('/config.json')
def analysis_config = jsonSlurper.parse(cfg_file)
def cfg_json = cfg_file.getText()
def list_of_items_to_process = []
items = Channel.from(analysis_config.items.keySet())
for (String item : items) {
list_of_items_to_process << item
}
process iterate_list{
echo true
input:
list_of_items_to_process
output:
val 1 into typing_cur
script:
"""
python3.7 process_list_items.py ${my_queue} \'${list_of_items_to_process}\'
"""
}
process signal_completion{
echo true
input:
val typing_cur
script:
"""
echo "all done!"
"""
}
Basically, the process "iterate_list" takes one "item" from a queue in the message broker. Process iterate_list should look something like:
process iterate_list{
echo true
input:
list_of_items_to_process
output:
val 1 into typing_cur
script:
"""
python3.7 process_list_items.py ${my_queue} \'${list_of_items_to_process}\'
list_of_items_to_process.remove(<output from python script>)
"""
}
And so for each one, it shd run, remove the item it jus processed, and restart with a new list.
initial_list = [1,2,3,4]
after_first_process_completes = [2,3,4]
and_eventually = [] <- This is when it should move on to the next process.
Excuse the indents, SO wasn't letting me post the code without indents.
The Nextflow patterns provide a lot of value to users at every level. I think they would have more visibility if they were part of the main Nextflow docs, since that seems to be the starting point for most people.
What do you think @pditommaso
I am trying to run a configuration where the input is a tuple of paths, some of which are optional. The pattern in this repository works for separate path inputs (or, so says the author), but extending it to my use case results in the error Not a valid path value: NO_FILE
.
In this simple example demonstrating the issue, the input is a CSV file defining RNA-seq sample names, forward and reverse read fastqs, and a STAR genome index to align them to. The reverse read is optional.
My question to the community is how I can work around this issue.
nextflow.config
params {
manifest = null // csv file name,R1,R2?,index
outdir = "outs" // save output bams
}
profiles {
conda {
conda.enabled = true
process.conda = "star samtools"
}
}
main.nf
process MyProcess {
publishDir outdir, mode: "copy"
input:
tuple val(name), path(R1), path(R2), path(index)
path outdir
output:
path "${name}_Aligned.out.sortedByCoord.bam"
path "${name}_Aligned.out.sortedByCoord.bam.bai"
script:
R2_arg = R2.name == "NO_FILE" ? "" : R2
"""
STAR --readFilesIn $R1 $R2_arg --readFilesCommand gunzip -c \
--genomeDir $index --outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix ${name}_
samtools index ${name}_Aligned.out.sortedByCoord.bam
"""
}
workflow {
MyProcess(
file(manifest).read().splitCsv(header: ["name", "R1", "R2", "index"]).map{it.R2 = it.R2 ?: "NO_FILE"},
params.outdir
)
}
nextflow run main.nf [-profile conda] --manifest path/to/manifest.csv
I'm trying to run blast.nf in my workstation, blast_result and top_hits have been generated, then the project stuck. Any possible reason?
nextflow run examples/blast.nf -with-docker -with-report -with-timeline
N E X T F L O W ~ version 0.27.0
Launching examples/blast.nf
[small_nightingale] - revision: 7b4b740be4
[warm up] executor > local
[94/2ab84a] Submitted process > blast (1)
nextflow info
Version: 0.27.0 build 4751
Modified: 09-01-2018 10:18 UTC (05:18 EDT)
System: Linux 3.10.0-514.10.2.el7.x86_64
Runtime: Groovy 2.4.13 on OpenJDK 64-Bit Server VM 1.8.0_121-b13
Encoding: UTF-8 (UTF-8)
Hi, I am a new user of nextflow. I have test1.bed, test1.bim, test1.fam files. I want to do some qc using plink in nextflow. How can I import those three files in nextflow and get output test2.bed, test2.bim, test2.fam which will be used in the next step to generate test3.bed, test3.bim, test3.fam and so on? I also need to save all the intermediate files in a directory. Any help with example?
Kind regards, Zillur
See example from @micans
ch_fastqs_cram
.mix(ch_fastqs_dir)
.into{ ch_rnaseq; ch_fastqc; ch_mixcr }
ch_rnaseq
.until{ skip_align_step }
.into { ch_star; ch_hisat2; ch_salmon }
I think the correct code example should be
$ nextflow run patterns/conditional-process.nf --flag
see here
Currently, there is no example of aggregating files AND associated metadata. For instance, in many/most nf-core pipelines the process outputs are something like:
output:
tuple val(meta), path("file.txt")
...but what if one wants to then aggregate all of the file.txt
outputs into one table AND include the meta
metadata in that output table?
As far as I can tell from scouring the nextflow slack channel, one must "embed" the metadata in the file paths and then parse the file paths in the aggregation step. For example:
Per-file process:
output:
tuple val(meta), path("${meta}.txt")
Aggregation process:
input:
path("*")
script:
"""
[somehow parse {meta} from input file path]
"""
Is there a better way, especially given the substantial limitations of trying to embed metadata into a file path (eg., dealing with multiple values and special characters in the metadata values)?
I'm sure a lot of pipeline developers would like a best-practices example of how to deal with this situation (without having to decipher how meta
is dealt with in aggregation steps of nf-core pipelines).
The optional input pattern does not seem to work if a process has more than one optional input.
For example, the following test:
params.inputs = "$projectDir/data/sites.txt"
params.filter = "$projectDir/assets/NO_FILE"
process foo {
debug true
input:
path seq
path(opt)
path(opt2)
script:
def filter = opt.name != 'NO_FILE' ? "--filter $opt" : ''
def filter2 = opt2.name != 'NO_FILE' ? "--filter $opt" : ''
"""
echo your_command --input $seq $filter $filter2
"""
}
workflow {
prots_ch = Channel.fromPath(params.inputs, checkIfExists:true)
opt_file = file(params.filter, checkIfExists:true)
opt2_file = file(params.filter, checkIfExists:true)
foo(prots_ch, opt_file, opt2_file)
}
Returns a collision error:
ERROR ~ Error executing process > 'foo (1)'
ERROR ~ Error executing process > 'foo (1)'
Caused by: Process `foo` input file name collision -- There are multiple input files for each of the following file names: NO_FILE
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
How should one generalize this pattern to handle multiple optional inputs?
Hi,
Would you be able to add makeblastdb process to blast-parallel.nf?
genomes = Channel.fromPath(params.genomes)
process formatBlastDatabases {
storeDir '/db/genomes'
input:
file species from genomes
output:
file "${dbName}.*" into blastDb
script:
dbName = species.baseName
"""
makeblastdb -dbtype nucl -in ${species} -out ${dbName}
"""
}
Thank you in advance.
Michal
Current example for process-when-empty:
params.inputs = ''
process foo {
debug true
input:
val x
when:
x ## 'EMPTY'
script:
'''
echo hello
'''
}
workflow {
reads_ch = params.inputs
? Channel.fromPath(params.inputs, checkIfExists:true)
: Channel.empty()
reads_ch \
| ifEmpty { 'EMPTY' } \
| foo
}
I'm guessing that x ## 'EMPTY'
should be x == 'EMPTY'
How do I parse an argument from the initiation command into one of my process scripts that's in another language like Python.
Initiation cmd:
nextflow run /Path/to/myscript.nf --in '/Path/to/MyData'
process dataDirectories {
"""
#!/usr/bin/env python2.7
import os
import getpass
currentUser=getpass.getuser()
dataPath="/home/" + currentUser + "/WGS_Data/" + MyData
resultsPath="/home/" + currentUser + "/WGS_Results/" + MyData
try:
os.makedirs(dataPath, 0o777)
os.makedirs(resultsPath, 0o777)
except:
pass
"""
}
I would like to get the string 'MyData' from the path (which is in my '--in' argument) into my python script.
Can this be done?
Channel.from([['A', 10], ['B', 8], ['C', 5], ['D', 4]])
.toList().map{ [it, it].combinations().findAll{ a, b -> a[1] < b[1]} }
.flatMap()
.view()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.