Giter Site home page Giter Site logo

sralign's Introduction

SRAlign

A flexible pipeline for short read alignment to a reference with extensive QC reporting.

Introduction

SRAlign is a Nextflow pipeline for aligning short reads to a reference.

SRAlign is designed to be highly flexible by allowing for the easy addition of tools to the pipeline as well as serving as a starting point for genomic analyses that rely on alignment of short reads to a reference.

Pipeline overview

  1. Trim reads
  2. QC of reads
    1. Raw reads FastQC
    2. Trim reads FastQC
    3. Summary MultiQC
  3. Align reads
    1. Align to reference genome/transcriptome
    2. Check contamination
  4. Preprocess alignments
    1. Mark duplicates
    2. Compress sam to bam
    3. Index bam
  5. QC of alignments
    1. samtools stats
    2. Samtools index stats
    3. Percent duplicates
    4. Percent aligned to contamination reference
    5. Summary MultiQC
  6. Library complexity and reproducibility
    1. Preseq library complexity
    2. DeepTools correlation
    3. DeepTools PCA
  7. Full pipeline MultiQC

Quick start

Prerequisites

  1. Any POSIX compatible system (e.g. Linux, OS X, etc) with internet access

  2. Nextflow version >= 21.04

  3. Docker

    • I recommend Docker Desktop for OS X or Windows users

Get or update SRAlign

  1. Download or update SRAlign:

    • Downloads the project into $HOME/.nextflow/assets
    • Useful for quickly downloading and easily running a project.
      • Allows for accessing SRAlign using Nextflow command by simply referring to trev-f/SRAlign without having to refer to the location of SRAlign in the system.
      • To customize or expand SRAlign, see the documentation on customizing or expanding SRAlign.
    nextflow pull trev-f/SRAlign
  2. Show project info:

    nextflow info trev-f/SRAlign

Test SRAlign

  1. Check that SRAlign works on your system:

    • -profile test uses preconfigured test parameters to run SRAlign in full on a small test dataset stored in a remote GitHub repository.
      • Because these test files are stored in a remote repository, internet access is required to run the test.
      • For more information, see the profiles section of the nextflow config file and trev-f/SRAlign-test.
    nextflow run trev-f/SRAlign -profile test 

Run SRAlign

  1. Prepare the input design csv file.

    • Input design file must be in csv format with no whitespace.
    • Either reads (fastq or fastq.gz) or alignments (bam) are accepted.
      • If reads are supplied, can be paired or unpaired.
    • Required columns:
      • reads: lib_ID, sample_name, replicate, reads1, reads2 (optional)
      • alignments: lib_ID, sample_name, replicate, bam, tool_IDs
    • See sample inputs in the SRAlign-test repository.
    • A template project repository can be downloaded from the SRAlign-template repository.
  2. Show all configurable options for SRAlign by showing a help message:

    • The most important information here is probably the list of available reference genomes.
    nextflow run trev-f/SRAlign --help
  3. Analyze your data with SRAlign:

    nextflow run trev-f/SRAlign -profile docker --input <input.csv> --genome <valid genome key>

Tips for running Nextflow and SRAlign

SRAlign is designed to be highly configurable, meaning that its default behavior can be changed by supplying any of a number of configurable parameters. These can be supplied in a number of ways that have a specific hierarchy of precedence.

  • Show configurable parameters by showing command line help documentation: nextflow run trev-f/SRAlign --help
  • Nextflow arguments always begin with a single dash, e.g. -profile.
  • Pipeline parameters specified at the command line always begin with a double dash, e.g. --input.
    • Parameters specified at the command line always have the highest precedence. They will overwrite parameters specified in any config or params files.
    • I recommend specifying required parameters (i.e. --input and --genome) and up to a few others at the command line in this manner. Specifying more than this at the command line gets unwieldy.
  • A custom config or parameters file is a good option for cases where you want to supply more parameters than can comfortably be done at the command line or you want to use the same custom parameters in multiple runs.

Additional documentation

Additional documentation can be found in docs.

Quick links:

sralign's People

Contributors

trev-f avatar t-f-freeman avatar

Watchers

 avatar

sralign's Issues

Set a resources maximum

Testing locally often leads to exceeding the max allotment of resources since these are currently hard-coded without any checking whether they are over the maximum resources of the system.

Make this dynamic. Write a function that makes sure the resources isn't over the max, and if it is go with the max. Check out nf-core's implementation.

Only use unique prefix name for MultiQC reports

Using a unique name for all processes where tasks are present (e.g. DeepToolsMultiBam) makes no sense because it means these must be rerun of -resume is used. Change this to just being a base output prefix name without the unique IDs.

Perform parameter checks upon initialization of the SRAlignWorkflow class

def tools = [
trim : ['fastp'],
alignment : ['bowtie2', 'hisat2']
]
// check valid read-trimming tool
assert params.trimTool in tools.trim ,
"'${params.trimTool}' is not a valid read trimming tool.\n\tValid options: ${tools.trim.join(', ')}\n\t"
// check valid alignment tool
assert params.alignmentTool in tools.alignment ,
"'${params.alignmentTool}' is not a valid alignment tool.\n\tValid options: ${tools.alignment.join(', ')}\n\t"
ch_multiqcConfig = file(params.multiqcConfig, checkIfExists: true)
/*
---------------------------------------------------------------------
Design and Inputs
---------------------------------------------------------------------
*/
// check design file
if (params.input) {
ch_input = file(params.input)
} else {
exit 1, 'Input design file not specified!'
}
// set input design name
inName = params.input.take(params.input.lastIndexOf('.')).split('/')[-1]
// set a timestamp
timeStamp = new java.util.Date().format('yyyy-MM-dd_HH-mm')
// set workflow prefix name to be used for output files that combine all files (i.e. only one output file such as the full MultiQC)
wfPrefix = "${inName}_-_${workflow.runName}_-_${timeStamp}"

Add command stubs

Add command stubs to make it easier to mimic pipeline processes commands without actually running them. This will make my life easier when testing the pipeline.

Fix quick start in documentation

SRAlign/README.md

Lines 22 to 33 in 52ba5ce

3. Download **sralign**:
```
git clone https://github.com/trev-f/sralign.git
```
4. Run **sralign** in test mode:
```
nextflow run sralign -profile test
```
5. Run your analysis:
```
nextflow run sralign -profile <> --input YYYYMMDD_input.csv --genome WBCel235
```

Combine raw reads and trimmed reads FastQC and MultiQC

It feels like there's no need to have these split up into different modules and subworkflows. The only difference is that raw reads QC is sent to a raw subdirectory whereas trimmed reads QC is sent to a trim subdirectory. But the raw and trimmed reads QC should have different suffixes because the trimmed reads will have the read-trimming tool's ID. Combining them would make more sense and keep things simpler and more readable.

Accept bam files as input

Take bam files as input so that downstream QC and analysis does not have to start from scratch, i.e. with read alignment, etc. This will be helpful for many things, but my main idea now is not having to keep all intermediate files (e.g. massive sam files) cached for resuming analysis. In other words, I can just keep bams, delete fastqs and intermediate files, and operate on those bams without having to keep a ton in storage.

Changes will have to include the following:

  1. Modify parse_design.py to not throw errors when bam files are supplied as input.
  2. Modify ParseDesign.nf to handle the changes to parse_design.py
  3. Add a ParseDesignBamSWF.nf script to setup the proper channels
  4. Add a workflow that can be accessed from the command line using -entry option that properly sets up the new input.

Fix pipeline overview documentation

SRAlign/README.md

Lines 12 to 16 in 52ba5ce

1. QC of raw reads - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) & [MultiQC](https://multiqc.info/)
2. Trim raw reads - [cutadapt](https://github.com/marcelm/cutadapt)
3. Align reads - [BWA](http://bio-bwa.sourceforge.net/) -OR- [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
4. Mark duplicates - [samblaster](https://github.com/GregoryFaust/samblaster)
5. QC of alignments - [Samtools](http://www.htslib.org/) & [MultiQC](https://multiqc.info/)

NXF_CONTAINER_ENTRYPOINT_OVERRIDE defaults to false as of Nextflow v 22.08.0

Hi! I've been trying to get the test run working but kept getting the following error message:

ERROR ~ Error executing process > 'SRAlign:ReadsQC:ReadsMultiQC (ATAC-seq_Celegans_design_reads_-_angry_golick_-_2023-07-21_10-49)'

Caused by:
  Process `SRAlign:ReadsQC:ReadsMultiQC (ATAC-seq_Celegans_design_reads_-_angry_golick_-_2023-07-21_10-49)` terminated with an error exit status (2)

Command executed:

  multiqc             -n ATAC-seq_Celegans_design_reads_-_angry_golick_-_2023-07-21_10-49__mqc-reads             --module fastqc             hmg-3_rep2__fqc_R1_fastqc.zip hmg-3_rep2__fqc_R2_fastqc.zip hmg-3_rep1__fqc_R1_fastqc.zip hmg-3_rep1__fqc_R2_fastqc.zip hmg-3_rep2__fsp_fqc_R1_fastqc.zip hmg-3_rep2__fsp_fqc_R2_fastqc.zip hmg-3_rep1__fsp_fqc_R1_fastqc.zip hmg-3_rep1__fsp_fqc_R2_fastqc.zip

Command exit status:
  2

Command output:
  (empty)

Command error:
  Usage: multiqc [OPTIONS] <analysis directory>

  Error: Invalid value for '-c' / '--config': Path 'eval export PATH="$PATH:/home/raquel/.nextflow/assets/trev-f/SRAlign/bin"; /bin/bash .command.run nxf_trace' does not exist.

  This is MultiQC v1.11

  For more help, run 'multiqc --help' or visit http://multiqc.info

Work dir:
  /home/raquel/test-SRAlign/work/62/c7352db1945be910bbc73c183f237b

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

The docker run -i --cpu-shares 1024 --memory 2048m -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /home/raquel:/home/raquel -w "$PWD" --name $NXF_BOXID ewels/multiqc:v1.11 /bin/bash -c "eval $(nxf_container_env); /bin/bash /home/raquel/test-SRAlign/work/66/5f440fed6865d7f8b62e201a9be293/.command.run nxf_trace" command in the .command.run file within the ReadsMultiQC process was being misinterpreted.

Seems like it's because Nextflow>=22.08.0 does not automatically use /bin/bash as the entry point.
nextflow-io/nextflow#3357

Test run works now if I set export NXF_CONTAINER_ENTRYPOINT_OVERRIDE=true

Might be useful to add that to the documentation install instructions here and on SRAtac repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.