The tsa_pairhmm from conorwalker

TSA pairHMM: Template switch alignment pair hidden Markov model

This repository contains code for the pairHMMs presented in the paper:

Short-range template switching in great ape genomes explored using pair hidden Markov models
by Conor R. Walker, Aylwyn Scally, Nicola De Maio, and Nick Goldman
DOI: https://doi.org/10.1371/journal.pgen.1009221

Code in this reposistory is modified from the four-point aligner (FPA) available here: https://github.com/ariloytynoja/fpa. Compilation and usage is the same as for the FPA. Developed and tested using Red Hat Enterprise Linux version 7.3.

Compilation

git clone https://github.com/conorwalker/tsa_pairhmm.git
cd tsa_pairhmm
make

Typical usage

Input

A FASTA file (e.g. aln.fa included here) which contains a pairwise alignment of two nucleotide sequences, in which the assumed descendant sequence is the first record and the assumed ancestral sequence is the second record, for example:

>descendant
ATCTGGAA...
>ancestor
ATCTGGAA...

Scan for template switch events

Mutation clusters within aln.fa are defined as >=2 nucleotide (nt) differences within a 10nt sliding window. Whenever a mutation cluster is found, a local region around this mutation cluster is re-aligned under the unidirectional and template switch pairHMMs, following the procedure described in the Methods and Supplementary algorithms sections of https://doi.org/10.1371/journal.pgen.1009221.

To identify and re-align all mutation clusters within aln.fa under both pairHMMs, we use:

./tsa_pairhmm --scan --pair aln.fa > aln_scanned.csv

Output

The above command creates a CSV file aln_scanned.csv which contains a header line followed by one line per re-aligned mutation cluster as follows:

head aln_scanned.csv

chrom,clus_start_chrom,clus_start_align,clust_start1,clust_end1,sp1_qry,sp1_ref,sp2_ref,sp3_ref,sp4_ref,iden_up,ident_rep,ident_down,ident_inv,ident_fwd,ident_epo,masked,sum_ins,sum_del,sum_mis,sum_nuc,CpG,clus_ins,clus_del,clus_mis,fwd_score,ts_score_local,ts_score_global,ts_ref_seq_len,ts_qry_seq_len,frag_L1_size,frag_23_size,frag_4R_size,logprob_per_base_minus_ts
10,123199418,601,601,603,602,602,740,740,605,1,1,1,1,0.988,0.976,0,0,1,0,1,0,0,1,1,-11,-15.4,-27.7,282,81,40,1,39,-1.69e+159
10,123201804,2988,2987,2994,2960,2961,3126,3126,2963,1,1,0.95,0.96,0.94,0.94,0,0,0,1,1,0,0,0,2,-18.7,-27.4,-39.4,282,82,9,1,40,-0.155
10,123201987,3171,3170,3173,3171,3172,3302,3301,3175,1,1,1,1,0.975,0.976,3,0,0,2,2,0,0,0,2,-12.7,-15.5,-27.8,282,82,40,2,39,-0.0101
10,123202034,3218,3217,3221,3218,3219,3336,3334,3223,1,1,1,1,0.976,0.976,3,0,0,2,2,0,0,0,2,-12.7,-15.6,-28,282,82,40,3,39,-0.0114
10,123202123,3307,3306,3314,3307,3308,3452,3452,3310,1,1,0.975,0.987,0.974,0.974,3,0,0,1,1,0,0,0,2,-12.7,-21.4,-33.4,282,82,36,1,40,-0.0819
10,123202787,3971,3970,3972,3966,3967,4025,4025,3970,1,1,1,1,0.987,0.962,3,0,1,0,1,0,1,2,0,-11,-15.4,-27.7,282,81,36,1,40,-0.00886
10,123202852,4038,4035,4041,4041,4043,4137,4123,4146,1,0.933,1,0.987,0.473,0.511,3,0,87,-1,2,0,0,86,1,-25.6,-22.9,-35.2,371,84,40,15,23,-0.0975
10,123202871,4143,4054,4056,4041,4003,4137,4123,4146,1,0.933,1,0.982,0.671,0.947,3,0,27,-1,2,0,0,1,1,-22.3,-22.2,-34.5,282,55,1,15,39,-0.144
10,123203007,4280,4190,4193,4186,4275,4413,4413,4275,1,1,1,1,0.974,0.974,3,2,0,0,1,0,2,0,0,-13.8,-15.4,-27.7,280,82,36,1,40,-0.00886

For each mutation cluster, the associated line in aln_scanned.csv contains the following information:

chrom                   Chromosome
clus_start_chrom        Chromosome position for the start of the mutation cluster
                        (based on information in the input FASTA headers)
clus_start_align        Pairwise alignment position for the start of the mutation cluster
clust_start1            Starting position of the cluster in the descendant sequence
clust_end1              End position of the cluster in the descendant sequence
sp1_qry                 Descendant switch point ① position
sp1_ref                 Ancestral switch point ① position
sp2_ref                 Ancestral switch point ② position
sp3_ref                 Ancestral switch point ③ position
sp4_ref                 Ancestral switch point ④ position
iden_up                 Identity of the region upstream of the re-aligned mutation cluster
ident_rep               ②→③ fragment identity
ident_down              Identity of the region downstream of the re-aligned mutation cluster
ident_inv               TSA pairHMM alignment identity
ident_fwd               Unidirectional pairHMM alignment identity
ident_epo               Input alignment identity
masked                  Boolean: is the ②→③ fragment from a masked region?
                        (0 = not masked, 1 = masked)
sum_ins                 Count of insertions in the TSA pairHMM alignment
sum_del                 Count of deletions in the TSA pairHMM alignment
sum_mis                 Count of mismatches in the TSA pairHMM alignment
sum_nuc                 Number of unique nucleotides in the ②→③ fragment
CpG                     CpG content of re-aligned region
clus_ins                Count of insertions in the mutation cluster
clus_del                Count of deletions in the mutation cluster
clus_mis                Count of mismatches in the mutation cluster
fwd_score               Unidirectional pairHMM log-probability, i.e. max(M(n,m), I(n,m), D(n,m))
ts_score_local          TSA pairHMM log-probability, i.e. max(M3(•,m), I3(•,m), D3(•,m)) 
ts_score_global         Global TSA pairHMM log-probability i.e. max(M3(n,m), I3(n,m), D3(n,m))
                        (not used for any calculations)
ts_ref_seq_len          Ancestral sequence (x) length in the TSA pairHMM alignment
ts_qry_seq_len          Descendant sequence (y) length in the TSA pairHMM alignment
frag_L1_size            Length of Ⓛ→①
frag_23_size            Length of ②→③
frag_4R_size            Length of ④→Ⓡ
logprob_per_base_minus_ts          (ts_score_local-(*_1->M_2->*_3)) / alignment length

Assessing candidate events for significance and alignment quality

Many of the mutation clusters re-aligned and output to aln_scanned.csv will not have arisen due to a template switch event. Using our approach outlined in https://doi.org/10.1371/journal.pgen.1009221, we filter each candidate event line based on:

Statistical significance
A threshold on the per-base template switch alignment quality
No masked sequence in the ②→③ fragment
All four unique nucleotides within the ②→③ fragment
A threshold on the maximum number of deletions in the mutation cluster

An example filtering script filter_csv.py is included here to filter based on this criteria:

python filter_csv.py aln_scanned.csv

This will generate one file aln_events.csv, which contains the events from aln_scanned.csv that pass our thresholds (one in this example):

cat aln_events.csv

10,123237102,40311,38285,38298,38286,30379,30396,30385,30389,1,1,1,1,0.79,0.902,0,12,9,0,4,2,3,0,6,-38.8,-16.8,-29.1,289,92,40,12,39,-0.0213

Visualising template switch alignments

The event output contained within the above CSV files can be visualised using:

./tsa_pairhmm --pair aln.fa --print-file aln_events.csv

chr10:123237104-123237115

Template switch process:
F1: L CCAAACTCCATTTTTACTGACCATGCTAACACACACCCAAA 1
F3:                                                   4 GAGCATAATTCTTGTACGTTATTGCCACATACAGGATGTC R
RF:   CCAAACTCCATTTTTACTGACCATGCTAACACACACCCAAAAAATCCAAAGAGCATAATTCTTGTACGTTATTGCCACATACAGGATGTC
RR:   GGTTTGAGGTAAAAATGACTGGTACGATTGTGTGTGGGTTTTTTAGGTTTCTCGTATTAAGAACATGCAATAACGGTGTATGTCCTACAG
F2:                                               3 GTTTCTCGTATT 2


Unidirectional alignment (log-probability: -38.9)
 CAAACTCCATTTTTACTGACCATGCTAACACACAC---------CCAAAttatgctctttgGAGCATAATTCTTGTACGTTATTGCCACATACAGGATGTC
 CAAACTCCATTTTTACTGACCATGCTAACACACACCCAAAAAATCCAAA------------GAGCATAATTCTTGTACGTTATTGCCACATACAGGATGTC

Template switch alignment (log-probability: -16.8)
 CAAACTCCATTTTTACTGACCATGCTAACACACACCCAAA|TTATGCTCTTTG|GAGCATAATTCTTGTACGTTATTGCCACATACAGGATGTC
 CAAACTCCATTTTTACTGACCATGCTAACACACACCCAAA|TTATGCTCTTTG|GAGCATAATTCTTGTACGTTATTGCCACATACAGGATGTC

Update (10/01/2022)

The final value in the CSV output is now the log-probability of the template switch alignment with the *_1->M_2->*_3 transition subtracted (i.e. from any subscript 1 state to M_2, and from M_2 to any subscript 3 state), divided by the alignment length. Filter script updated accordingly. This may be helpful for creating a "per-base" log-probability threshold for significant template switch alignments, as it allows the TSA pairHMM per-base log-probabilities to be related to a sample of unidirectional alignment per-base log-probabilities (see S8 Fig in Walker et al. 2021).

conorwalker / tsa_pairhmm Goto Github PK

tsa_pairhmm's Introduction

TSA pairHMM: Template switch alignment pair hidden Markov model

Compilation

Typical usage

tsa_pairhmm's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent