viral assembly techniques from ONT reads

Assembly methods often aim to produce a single canonical representation of every genomic locus in their input. This objective can sometimes frustrate analyses that depend on information about variation represented in a given biosample. Assemblers are also often monolithic systems, where each step in the assembly process is implemented in a single program. This reduces configurability of the method. It may be difficult to understand all asumptions of the method, or impossible to modify certain hyperparameters without modifying the software itself.

approach

In this repository, I explore a modular assembly method for viral genomes sequenced using long read sequencing from Oxford Nanopore (ONT). The workflow consists of a number of independent steps:

select longer reads in the data
all-vs-all mapping
induction of graph with seqwish
graph simplification
read mapping to graph, coverage calculation
inference of viral quasispecies

Intermediate data is either in the form of sequence graphs, alignments between sequences and/or graphs, and coverage information over graphs.

example workflow

To clarify these steps, let's work through a step by step example of such a workflow. Here I'm using data from a nasal swab of a SARS-CoV-2 patient.

I'll use the swab Oxford Nanopore data. We can get the data with the sra toolkit (installed in guix via guix package -i sra-tools).

fastq-dump https://sra-download.ncbi.nlm.nih.gov/traces/sra52/SRR/010879/SRR11140751
# symlink for convenience
ln -s SRR11140751.fastq swab_ONT.fastq

(1) we filter for read length, keeping only long reads (most are very short, due the reverse transcription / amplification genesis of the data):

cat swab_ONT.fastq \
    | paste - - - - \
    | tr ' ' '_' \
    | awk 'length($2) > 2000 { print $1; print $2; print $3; print $4; }' \
    | pigz >swab_ONT.2kbp.fastq.gz

(2) all versus all mapping, using all kmers to drive the mapping, and 48 mapping threads

minimap2 -c -w 1 -X -t 48 swab_ONT.2kbp.fastq.gz swab_ONT.2kbp.fastq.gz >swab_ONT.2kbp.paf

(2.1) optionally filter the alignments for read length using fpa (installable via `cargo install fpa_lr`):

fpa drop -l 1000 <swab_ONT.2kbp.paf >swab_ONT.2kbp.drop1kbp.paf

This will result in a more open and fragmentary graph with short tips where reads don't align full length against each other. Different settings of length might work better given different data and applications.

(3) induction of graph with seqwish

seqwish -t 48 -k 16 -s swab_ONT.2kbp.fastq.gz -p swab_ONT.2kbp.paf -g swab_ONT.2kbp.gfa

The resulting graph can be visualized with Bandage, yielding something like this:

(4) graph simplification

odgi build -g swab_ONT.2kbp.gfa -o - \  # read in the GFA file
    | odgi prune -i - -b 3 -o - \       # run a pruning operation
    | odgi view -i - -g >swab_ONT.2kbp.odgi-prune.b3.gfa  # write GFA

This uses odgi to remove edges from the graph that are not among the best three edges for any node. It is crude, but brings out a linear structure that is expected of the SARS-CoV-2 genome. The resulting graph still has a lot of sequence in it, presumably derived from local errors, low-quality reads, and chimeric PCR products (RT-PCR was used to generate cDNA from the input sample).

(5) read mapping to graph, coverage calculation

I haven't yet implemented this in my testing, but I would use GraphAligner or gyeet. vg has three mappers which might work (map, mpmap, and giraffe). While vg map can return high quality alignments, it is sometimes slow on graphs with complicated looping structures.

Graphs with complex local loops are difficult for any mapper that treats them as an actual graph (all those mentioned except for gyeet) because there is a risk of alignments getting stuck in small graph motifs which are "universal" models of any DNA sequence. These can arise during the seqwish graph induction step. Removing them may be possible with odgi break, but I have not been able to get this to work reliably enough to prevent GraphAligner from getting stuck sometimes.

(6) inference of viral quasispecies

With the coverage of reads mapped back to the variation graph induced from long reads, we should be able to make a guess as to the viral quasispecies that are present. We could apply something like virus-vg or vg-flow (probably the latter, as it scales better) to the variation graph. These are designed for illumina data, but by constructing a cleaned variation graph from the ONT reads, we can effectively reduce their error rate and allow the application of these methods. Other, cruder methods may also be possible to apply.

ekg / viral-assembly Goto Github PK

viral-assembly's Introduction

viral assembly techniques from ONT reads

approach

example workflow

(1) we filter for read length, keeping only long reads (most are very short, due the reverse transcription / amplification genesis of the data):

(2) all versus all mapping, using all kmers to drive the mapping, and 48 mapping threads

(2.1) optionally filter the alignments for read length using fpa (installable via `cargo install fpa_lr`):

(3) induction of graph with seqwish

(4) graph simplification

(5) read mapping to graph, coverage calculation

(6) inference of viral quasispecies

viral-assembly's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

ekg / viral-assembly Goto Github PK

viral-assembly's Introduction

viral assembly techniques from ONT reads

approach

example workflow

(1) we filter for read length, keeping only long reads (most are very short, due the reverse transcription / amplification genesis of the data):

(2) all versus all mapping, using all kmers to drive the mapping, and 48 mapping threads

(2.1) optionally filter the alignments for read length using fpa (installable via cargo install fpa_lr):

(3) induction of graph with seqwish

(4) graph simplification

(5) read mapping to graph, coverage calculation

(6) inference of viral quasispecies

viral-assembly's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

(2.1) optionally filter the alignments for read length using fpa (installable via `cargo install fpa_lr`):