Nextflow workshop

https://github.com/cellgeni/nf-workshop

Vladimir Kiselev

Head of the Cellular Genetics Informatics

Sanger Institute

Basic concepts

Nextflow is designed around the idea that the Linux platform is the lingua franca of data science.
Linux provides many simple but powerful command-line and scripting tools that, when chained together, facilitate complex data manipulations
Nextflow extends this approach, adding the ability to define complex program interactions and a high-level parallel computational environment based on the dataflow programming model

Processes and channels

Nextflow pipeline script is made by joining together many different processes. Each process can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.)
Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable) state
The only way they can communicate is via asynchronous FIFO queues, called channels
Any process can define one or more channels as input and output

Execution and abstraction

While a process defines what command or script has to be executed, the executor determines how that script is actually run on the target environment
If not otherwise specified, processes are executed on the local computer. The local executor is very useful for pipeline development and test purposes, but for real world computational pipelines an HPC or cloud platform is required
Nextflow provides an abstraction between the pipeline's functional logic and the underlying execution environment
Thus it is possible to write a pipeline once and to seamlessly run it on your computer, a grid platform, or the cloud, without modifying it

Clone the workshop repo and install Nextflow

git clone https://github.com/cellgeni/nf-workshop.git
cd nf-workshop
curl -s https://get.nextflow.io | bash
# use wget if curl is not available
# wget -qO- https://get.nextflow.io | bash
ls

`Hello world!` pipeline

The Hello world! pipeline defines two processes
splitLetters splits a string in file chunks containing 6 characters
convertToUpper receives these files and transforms their contents to uppercase letters
The resulting strings are emitted on the result channel and the final output is printed by the subscribe operator

Run the pipeline

> ./nextflow run hello-world.nf
N E X T F L O W  ~  version 0.28.0
Launching `hello-world.nf` [exotic_bartik] - revision: 361b274147
[warm up] executor > local
[e7/7d678f] Submitted process > splitLetters
[5e/fe9bf6] Submitted process > convertToUpper (2)
[bb/75ef46] Submitted process > convertToUpper (1)
WORLD!
HELLO

The first process is executed once, and the second twice
The result string is printed
convertToUpper is executed in parallel, so it is possible that you will get the final result printed out in a different order:

HELLO
WORLD!

`work` directory

> tree -a work
work
├── 66
│   └── 5422cf0adc07c4662eaaa04b5c1700
│       ├── .command.begin
│       ├── .command.err
│       ├── .command.log
│       ├── .command.out
│       ├── .command.run
│       ├── .command.sh
│       ├── .exitcode
│       ├── chunk_aa
│       └── chunk_ab
├── 7b
│   └── 86f99a5e06fe183daee815b2cf09c7
│       ├── .command.begin
│       ├── .command.err
│       ├── .command.log
│       ├── .command.out
│       ├── .command.run
│       ├── .command.sh
│       ├── .exitcode
│       └── chunk_aa -> /Users/vk6/nf-workshop/work/66/5422cf0adc07c4662eaaa04b5c1700/chunk_aa
└── f5
    └── 56dcd57618fe720c4ed4602055f68a
        ├── .command.begin
        ├── .command.err
        ├── .command.log
        ├── .command.out
        ├── .command.run
        ├── .command.sh
        ├── .exitcode
        └── chunk_ab -> /Users/vk6/nf-workshop/work/66/5422cf0adc07c4662eaaa04b5c1700/chunk_ab

6 directories, 25 files

`work` directory

work directory contains sub-directories where Nextflow executes its processes
The names of the directories are randomly generated
splitLetters was executed in the 66 sub-directory
convertToUpper was executed in 7b and f5 sub-directories
For each process Nextflow generates some system scripts and the outputs
The original process command is in .command.sh
.command.run is the script submitted to the cluster environment
.command.err, .command.log, .command.out are the standard outputs
convertToUpper depends on the output of splitLetters, a link has been created (no copying)

Modify and resume

Let's modify the script block in the convertToUpper process (now will run rev $x):

process convertToUpper {
    input:
    file x from letters

    output:
    stdout result

    script:
    """
    rev $x
    """
}

Modify and resume

Now rerun Nextflow with the -resume flag:

> ./nextflow run hello-world.nf -resume
N E X T F L O W  ~  version 0.28.0
Launching `hello-world.nf` [mighty_goldstine] - revision: 0fa0fd8326
[warm up] executor > local
[66/5422cf] Cached process > splitLetters
[00/fc01e0] Submitted process > convertToUpper (1)
[b2/816523] Submitted process > convertToUpper (2)
olleH
!dlrow

Note that the first process splitLetters was cached and was not run at all!

Pipeline parameters

Nextflow allows to define parameters inside the pipeline, e.g. in the Hello, world! pipeline there is a str parameter defined:

params.str = 'Hello world!'

We can use it in the command line to redefine the default value:

> ./nextflow run hello-world.nf --str 'Hola mundo'
N E X T F L O W  ~  version 0.28.0
Launching `hello-world.nf` [elated_hamilton] - revision: b0857ec305
[warm up] executor > local
[b3/924952] Submitted process > splitLetters
[d8/727942] Submitted process > convertToUpper (2)
[5a/3b7252] Submitted process > convertToUpper (1)
odnu
m aloH

Local software

By default, Nextflow can use all the software available in your bash environment.

`conda` environments

However, if you need some specific software installation that you don't want to be bothered with, we recommend using conda environments. To install conda, please follow the steps on this page.

There are multiple ways of creating and managing conda environments. We recommend using environment.yml file. For more details please see here.

`conda` environments

Here we will show an example of setting up a conda environment for our RNAseq pipeline using the environment.yml file:

name: rnaseq
channels:
  - conda-forge
  - bioconda
dependencies:
  - fastqc=0.11.7
  - bedops=2.4.30
  - cutadapt=1.15
  - trim-galore=0.4.5
  - star=2.5.4a
  - hisat2=2.1.0
  - rseqc=2.6.4
  - picard=2.17.6
  - samtools=1.7
  - preseq=2.0.2
  - subread=1.6.0
  - stringtie=1.3.3
  - multiqc=1.4

`conda` environments

To create an environment with the all the listed software one needs to run:

conda env create -f environment.yml

Once the environment is ready it can be activated inside a Nextflow process using these lines:

beforeScript "set +u; source activate rnaseq"
afterScript "set +u; source deactivate"

We do not recommend to add any R-related packages to your conda environment, in out experience it never worked well.

Docker images

You conda environments can be further dockerized if needed using a Dockerfile:

FROM continuumio/miniconda
ADD environment.yml
RUN conda env create -f environment.yml

You can then add any other softwared you need for your pipeline to the docker image.

`SC3` clustering Nextflow pipeline

Now we will create a more real-life example pipeline. We will run SC3 clustering of a small single-cell RNAseq dataset (which is included in the SC3 package). SC3 is a stochastic clustering algorithm and with this pipeline we would like to check its stability. To do that we will run SC3 multiple times in parallel every time changing an initial random seed. After that we will merge all the results into one matrix, which is useful for further downstream analysis.

`main.nf`

In order to publish your Nextflow pipeline to GitHub (or any other supported platform) and allow other people to use it, you only need to create a GitHub repository containing all your project script and data files.

Nextflow only requires that the main script in your pipeline project to be called main.nf. We will use this name for our SC3 pipeline.

In the main.nf we have two processes (run_sc3 and merge_results) and one parameter (params.n) which is the number of times we would like to SC3 with different random seeds.

Exercise Have a look at main.nf and notice how it is different from hello-world.nf.

Third-party scripts

Since SC3 is an R package, preferably we would like to have an R script with all the SC3 commands in a separate file. Nextflow allows you to store all third-party scripts in the bin folder in the root directory of your project repository. Nextflow will automatically add this folder to the PATH environment variable, and the scripts will automatically be accessible in your pipeline without the need to specify an absolute path to invoke them.

Note that you have to grant these scripts the execute permission (chmod +x bin/*) and add a shebang to all of your third-party scripts. In the case of R scripts we will add #!/usr/bin/env Rscript shebang. The scripts then should be called just by their name in the process scripts.

In our pipeline we have to R scripts in the bin folder (sc3.R and merge.R) corresponding to the two pipeline processes.

Run `SC3`

Now, let's run our SC3 pipeline:

> ./nextflow run main.nf
N E X T F L O W  ~  version 0.28.0
Launching `main.nf` [elegant_carson] - revision: d5c6c10c98
[warm up] executor > local
[0a/9ccd50] Submitted process > run_sc3 (3)
[b2/a0bbd7] Submitted process > run_sc3 (1)
[9d/9a91db] Submitted process > run_sc3 (2)
[7a/1bb0d2] Submitted process > merge_results

The first process was run 3 times (default value), but this number can be controlled by using --n flag when executing main.nf.

Exercise What is the result of the pipeline?

Exercise Explore the newly created folders in the work directory.

Run `SC3`

Exercise Check Nextflow integration with GitHub by changing your directory to some temporary one and running:

curl -s https://get.nextflow.io | bash
./nextflow run cellgeni/nf-workshop

Note that the pipeline has automatically been pulled from GitHub.

Configuration file

When a pipeline script is launched Nextflow looks for a file named nextflow.config in the current directory and in the script base directory (if it is not the same as the current directory). Finally it checks for the file $HOME/.nextflow/config.

nextflow.config is a configuration file that is used to define parameters required by your computational environment. If you need to run your pipeline on different environments, you can make use of configuration __profile__s. A profile is a set of configuration attributes that can be activated/chosen when launching a pipeline execution by using the -profile command line option.

Exercise Have a look at our nextflow.config file. Did you notice the cloud profile?!

Sanger farm settings

We have tested Nextflow on the Sanger farm and found parameters that are required to be present in the configuration file to be able to run the pipeline successfully:

// this is required by bsub on farm3: selected[mem] should = rusage[mem]
// http://mediawiki.internal.sanger.ac.uk/wiki/index.php/Submitting_large_memory_jobs
process.clusterOptions = { "-R \"select[mem>${task.memory.toMega()}]\"" }
// https://www.nextflow.io/docs/latest/executor.html#lsf
executor.perJobMemLimit = true

The first requirement is the memory limit and second one is giving memory to the job and not to a single core inside the job (in case of multicore jobs).

Run Nextflow on the Sanger farm

Here is what we recommend you to do to be able to successfully run Nextflow on the Sanger farm:

# login to the farm
ssh -Y farm3-head2

# create a dedicated bash session for your pipeline
tmux new -s sc3_pipeline

# this will add Nextflow executable to your path (together with other
# CellGen IT software and R packages, including SC3)
source /nfs/cellgeni/.cellgenirc

# Java is not permitted on the head nodes, but it is allowed on the 
# slave nodes, therefore we need to start an interactive job to be 
# able to run Nextflow
bsub -Is -q long -R"select[mem>5000] rusage[mem=5000]" -M5000 bash

# (optional)
# your session in the long queue will be active for 2 days, if your
# pipeline requires more time to finish, you can use rhweek queue which
# will be active for 1 week:
# bsub -Is -q rhweek -R"select[mem>5000] rusage[mem=5000]" -M5000 bash

Run Nextflow on the Sanger farm

Exercise Now try to run our SC3 pipeline on the farm using:

nextflow run cellgeni/nf-workshop -profile farm

If you open another terminal and look at your jobs using bjobs, you will see that Nextflow has started 3 jobs corresponding to the first process in our pipeline running with three different random seeds:

vk6@farm3-head2:~$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9525703 vk6     RUN   long       farm3-head2 bc-27-3-04  bash       Apr 17 11:38
9525942 vk6     RUN   normal     bc-27-3-04  bc-32-1-09  *n_sc3_(2) Apr 17 11:39
9525949 vk6     RUN   normal     bc-27-3-04  bc-32-1-09  *n_sc3_(3) Apr 17 11:39
9525953 vk6     RUN   normal     bc-27-3-04  bc-32-1-09  *n_sc3_(1) Apr 17 11:39

cellgeni / nf-workshop Goto Github PK

nf-workshop's Introduction

Nextflow workshop

Basic concepts

Processes and channels

Execution and abstraction

Clone the workshop repo and install Nextflow

Hello world! pipeline

Run the pipeline

work directory

work directory

Modify and resume

Modify and resume

Pipeline parameters

Local software

conda environments

conda environments

conda environments

Docker images

SC3 clustering Nextflow pipeline

main.nf

Third-party scripts

Run SC3

Run SC3

Configuration file

Sanger farm settings

Run Nextflow on the Sanger farm

Run Nextflow on the Sanger farm

More resources

Acknowledgements

nf-workshop's People

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

`Hello world!` pipeline

`work` directory

`work` directory

`conda` environments

`conda` environments

`conda` environments

`SC3` clustering Nextflow pipeline

`main.nf`

Run `SC3`

Run `SC3`