Napu

Pipelines version: R9

Napu (Nanopore Analysis Pipeline) is a collection of WDL workflows for variant calling and de novo assembly of ONT data,

optimized for single-flowcell ONT sequencing protocol. The wet-lab/informatics protocol is now applied to sequence and characterize thousands of human brain genomes at the Center for Alzheimer's and Related Dementias at NIH. This pipeline version has more optional input options that run modular peices of the pipeline more easily.

Versions for R9/R10 data

Please use either r9 or r10 branch for your corresponding data type.

Installation and Usage

Using Terra/DNAnexus/AnVIL

If you are using Terra, the workflows are already available at the Dockstore collection. All you need to do is to import the relevant workflow to your Terra workspace.

On a single local compute node

There are multiple existing WDL engine implementations. We performed our tests using Cromwell, and the following instructions assume this WDL implementation.

First, follow these instructions to download and install the latest version of Cromwell. Also make sure that Docker is installed and running.

Then, you will need to prepare an inputs file. For example, for the end-to-end pipeline, you can generate blank input file as follows (-o option hides optional workflow inputs)

java -jar womtool-XX.jar inputs -o false wdl/workflows/cardEndToEndVcf.wdl > inputs.json

Then, edit the inputs.json file with your input parameters. Afterwards, you will be able to run the corresponding workflow (for example, cardEndToEndVcf.wdl) as follows:

java -jar cromwell-XY.jar run wdl/workflows/cardEndToEndVcf.wdl --inputs inputs.json

On a custom HPC server or cloud environemnt

Cromwell could be configured to run on an HPC or cloud. This configuration is more involving and requires optimization for a particular environemnt. Please refer to the corresponding manual for details

Running the assembly with Shasta separately

The Shasta assembler can be run in in-memory mode (faster/cheaper) or disk-backed mode (slower but doesn't require as much memory).The end-to-end workflow uses the disk-backed mode by default because Terra can't (currently) launch instances with more than 768 Gb of memory which might not be enough for some samples. However, because the in-memory mode is much cheaper (~$20 vs ~$70), it might be worth attempting to run Shasta in in-memory mode first. Then, the rest of the workflow could be run on the samples that worked, or the full workflow with the disk-backed mode on the few that failed.

Shasta can be run separately with the workflow defined at wdl/tasks/shasta.wdl and deposited on Dockstore. To use the in-memory mode on Terra, the suggested inputs are:

inMemory=true
diskSizeGB=1500 (needed to get high-memory instances on Terra)

Then, the FASTA file produced by Shasta can be provided to the end-to-end workflow described above (defined at wdl/workflows/cardEndToEndVcf.wdl and deposited on Dockstore) using the optional input shastaFasta.

Running DeepVariant per chromosome and minimap2 alignment on chunks of reads

Both of these tasks in the end-to-end pipeline take considerable amount of time when run on the whole genome. Running these tasks in parallel reduces time and cost.

DeepVariant can be run on each chromosome independently by providing a list of chromosomes as input to the end-to-end workflow. To run minimap2 alignments in parallel provide the end-to-end workflow with a number of reads per chunk to split the input reads.

To run these tasks in parallel on Terra, the suggested inputs are:

nbReadsPerChunk = 1000000 ( this can vary by input reads )
chrs=["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY", "chrM"]

Quick demo

wget https://zenodo.org/record/7532080/files/card_pipeline_small_test.tar.gz
tar -xvf card_pipeline_small_test.tar.gz
cd card_pipeline_small_test
java -jar PATH_TO/cromwell-XX.jar run PATH_TO/cardEndToEndVcf.wdl --inputs inputs_end2end.js

Make sure cromwell is installed and substitute the paths to cromwell and WDL workflow according to your setup.

Input data requirements

Napu was tested using 30-40x ONT sequencing using R9.4 pore with read N50 ~30kb. Basecalling and mehtylation calls were done using Guppy 6.1.2. Napu should work for similar or newer nanopore data. We are currently planning to release a special version of this pipeline for R10 ONT data.

The input for this end-to-end workflow is one or more mapped or unmapped bam files with methylation tags produced by Guppy or fastq file as input.

Other kinds of input include reference genome and corresponding VNTR annotations (provided in this repository). A Shasta assembly fasta file can be provided as input if run separately in-memory.

Napu manuscript data availability

The cell line data (HG002, HG0073 and HG02723) and openly available through this Terra workspace. If you don't have a Terra account, you can also download the cell line data from this mirror.

Human brain sequencing datasets are under controlled access and require a dbGap application (phs001300.v4). Afterwards, the data will be available through the restricted Terra workspace.

Pipeline description

Napu begins by generating a diploid de novo assembly using a combination of Shasta, which produces a haploid assembly and Hapdup, which generates locally phased diploid contigs. We then use the generated assemblies to call structural variants (at least 50 bp in size) against a given reference genome using a new assembly-to-reference pipeline called hapdiff.

Ideally, small variants could also be recovered from diploid contigs, as has been successfully done for HiFi-based assemblies. Our Shasta-Hapdup assemblies had mean substitution error rates of ~8 per 100 kb, which is
higher than current contig assemblies produced with PacBio HiFi (<1 per 100kb). Reference-based small variant calling methods are less error-prone because they can better approximate the biases of ONT base calling errors via supervised learning. We therefore use PEPPER-Margin-DeepVariant software to call small variants against a reference.

Given a set of structural variants produced with de novo assemblies, and reference-based small variant calls, our pipeline phases them into a harmonized variant call set using Margin. In addition, given the phased reference alignment with methylation tags (produced by Guppy), we produce haplotype-specific calls of hypo- and hyper-methylated regions of the genome.

The workflows are buit around the following tools:

Credits

Napu was developed at in collaboration between UC Santa Cruz genomics institute and the Cancer Data Science Laboratory, National Cancer Institute.

Main code contributors.

Mikhail Kolmogorov (NCI)
Mira Mastoras (UCSC)
Melissa Meredith (UCSC)
Jean Monlong (UCSC)

Citation

Kolmogorov, Billingsley et al, "Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation". Nature Methods (2023). doi:10.1038/s41592-023-01993-x

License

Workflows are distributed under a BSD license. See the LICENSE file for details.

odielwoolmore / napu_wf Goto Github PK

napu_wf's Introduction