Papet

This repository contains the PAcbio Prediction of Epigenetics Technology (Papet) software to model the PacBio CCS kinetic signal and predict epigenetics modifications from this kinetic signal.

Table of content

Content
Dependencies
Compilation
PacBio kinetics and epigenetics
Running Papet
5.1. model-kinetic
5.2. model-kinetic-txt
5.3. kinetics
5.4. kinetics-wig
5.5. kinetics-kmer
5.6. model-sequence
5.7. model-sequence-txt
5.8. loglikelihoods
5.9. predict
[Aknow]

Dependencies

Papet relies on the following third party libraries:

pthread, normally a system library on linux system.
boost v1.78 or higher
pbcopper v2.1.0 or higher
pbbam v2.0.0 (pbbam relies on pbcopper)
ngsaipp v1.6.1

We strongly recommend to install these libraries in /usr/local/lib and their corresponding header files in /usr/local/include. If done this way, cmake compilation configuration should work out of the box.

Compilation

The building process uses cmake. To compile and install papet, simply type:

./build.sh

The CMakeLists.txt file contains 3 variable defined at its top:

# user defined paths
## list of directories containing libraries headers
set(INCLUDE_DIRECTORIES "/usr/local/include/"
                        "/usr/local/include/ngsaipp")
## list of directories in which the required libraries are installed
set(LINK_DIRECTORY      "/usr/local/lib")
## path in which the final executable will be installed
set(INSTALL_DIRECTORY   "/usr/local/bin")

INCLUDE_DIRECTORIES a space separated list of directories in which the required third party library header files are located.
LINK_DIRECTORY a space separated list of directories in which all the required libraries are installed.
INSTALL_DIRECTORY the directory in which papet will be installed.

About PacBio kinetics and epigenetics

PacBio BAM file epignetics signal data handling, modelling as well as the epigenetics prediction computations are performed by ngsaipp. For an in-depth dive on this topic, please read the dedicated documention

Running Papet

Papet is a wrapper application that can run different commands to fullfil specific tasks. The general synthax is:

papet [command] [options]

Papet accepts the following options and commands:

short	long	description
-h	--help	Produces the help message
	--vaudois	Tastes quite good.
	model-kinetic	Creates kinetic signal models from CCSs.
	model-kinetic-txt	Dumps a kinetic signal model in txt format.
	kinetics	Extracts CCS kinetic information in txt format.
	kinetics-wig	Creates WIG tracks from CCSs.
	kinetics-kmer	Computes the per-kmer distribution of kinetic signal from CCSs.
	model-sequence	Creates DNA sequence kinetic signal models from CCSs.
	model-sequence-txt	Dumps a DNA sequence kinetic signal model in txt format.
	predict	Predicts the presence of epignetic modifications from CCSs.

model-kinetic

model-kinetic trains a kinetic signal model by computing the kinetic (IPD and PWD) signal distribution at each position inside windows of <N> bases. The signal distributions are learned by extracting the kinetic signal, from the reads, using <N> bases long windows centered on the given CpGs. The model is then serialized in the given file. The final results is a serialized instance derived from the KineticModel class (see ngsaipp KineticModel.hpp).

The synthax is:

papet model-kinetic [type] [options]

The exact type of kinetic signal model is defined using the first argument. The accepted values are:

raw: simply computes the distribution of IPD and PWD at each position in the window. The model is a serialized instance of the RawKineticModel class (see ngsaipp RawKineticModel.hpp).
raw-norm: computes the normalized IPD and PWD' distributions at each position in the window. The model is a serialized instance of the NormalizedKineticModel class (see ngsaipp NormalizedKineticModel.hpp).
diposition: computes the distributions of IPD and PWD signal at each position in the window as a function of its direct neighbour. The model is a serialized instance of the DiPositionKineticModel class (see ngsaipp DiPositionKineticModel.hpp).
diposition-norm: computes the distributions of normalized IPD and PWD signal at each position in the window as a function of its direct neighbor. The model is a serialized instance of the DiPositionNormalizedKineticModel class (see ngsaipp DiPositionNormalizedKineticModel.hpp).
pairwise: computes all distributions of IPD and PWD as a function of the signal at another position in the window. The model is a serialized instance of the PairWiseKineticModel class (see ngsaipp PairWiseKineticModel.hpp)
pairwise-norm: computes all distributions of normalized IPD and PWD as a function of the signal at another position in the window. The model is a serialized instance of the PairWiseNormalizedKineticModel class (see ngsaipp PairWiseNormalizedKineticModel.hpp)

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--bam	A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest.
	--bed	The path to the bed file containing the genomic coordinates of the CpGs of interest.
	--out	The path to file in which the kinetic model will be saved.
	--background	For normalized models only, the path to background model to use. It must contain a serialized KmerMap.
	--size	The size of the model, the length of the signal window to model, in bp
	--nbin	The number of bins in each histogram.
	--xmin	The lower limit of the lower bin in each histogram.
	--xmax	The upper limit of the upper bin in each histogram.
	--pseudocount	A number of counts that will be added to each bin in each histogram, by default 0.
	--thread	The number of threads, by default 1.

model-kinetic-txt

model-kinetic-txt converts a kinetic signal model file in tsv format and prints the results on stdout. If the model is a normalized one, the KmerMap is not tincluded in the tsv conversion.

The synthax is:

papet model-kinetic-txt [options] [> FILE]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--model	The path to the file containing the kinetic model to convert in tsv format.

kinetics

kinetics is an application to extract interpulse duration (IPDs) and pulse widths (PWs) from mapped PacBio CCS reads that overlap a given set of genomic regions specified in a BED 6 file. The results are printed on stdout in tsv format. The first row is a header. Then, each line contains per read sequence, IPDs and PWDs. The memory footprint of this progam is O(n), where n is the window size. The results are immediately printed and nothing is kept in memory.

The synthax is:

papet kinetics [options] [> FILE]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--bam arg	A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest.
	--bed arg	The path to the bed file containing the genomic regions of interest.
	--winSize	The size in bp of the windows from which the CCS features will be extracted. It must be odd. The window will be centered on the center of the genomic regions of interst.

kinetics-wig

kinetics-wig is an application that creates wig kinetic tracks from a set of PacBio CCS from from bam files, over a set of defined genomic windows. Exactly 4 tracks are created each time, corresponding to the IPD on forward strand, the IPD on the reverse strand, the PWD on the forward strand and the PWD on the reverse strand.

The synthax is:

papet kinetics-wig [options]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--bam arg	A coma separated list of paths to the bam files containing the mapped PacBio CCS from which the IPD/PWD track must be computed.
	--bed	The path to the bed file containing the genomic oordinates of the CpGs of interest.
	--out	A path prefix to use to write the results files. In total, 4 resulting files will be created with this prefix : _IPDfw.wig, _IPDrv.wig, _PWDfw.wig and _PWDrv.wig containing the IPD and PWD forward and reverse track respectively.
	--winSize	The size of the window (in bp) around the CpGs in which the average kinetic signal will be computed.

kinetics-kmer

kinetics-kmer is an application that computes the interpulse duration (IPDs) and pulse widths (PWs) value distribution for all kmers found in a PacBio CCS dataset. The kmer IPD / PWD value is defined has the IPD / PWD value corresponding to the central position of the kmer. The results are printed on stdout in tsv format. The first row is a header. Then, each line contains the results for one kmer. Each line contains the kmer sequence (column 1), the IPD distribution for this kmer (columns 2 to 954) amd the PWD distribution (columns 955 1907). IPD / PWD values can take any score in the [0,952] interval. Thus there are 953 columns for each IPD / PWD distribution. The 1st IPD / PWD related value (column 2 / 955) corresponds to the number of times an IPD / PWD value of 0 was observed for this kmer. The 2nd IPD / PWD value (column 3 / 956) corresponds to the number of time an IPD / PWD value of 1 was observed for this kmer, and so on until a score of 952 (column 953 / 1907). In the final table, not all possible kmers will be present in the table, only the ones found in the data. The memory footprint of this progam is O(4^k) where k is the size of the kmers.

The synthax is:

papet kinetics-kmer [options]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--bam arg	A coma separated list of paths to the bam files containing the PacBio CCS of interest.
	--kmer arg	The size of the kmer. It must be odd.

model-sequence

model-sequence is an application that computes the mean IPD and PWD per position signal for all possible kmer from a set of CCS reads (see Expected kinetic background). The final results is a serialized instance of KmerMap (see ngsaipp KmerMap.hpp). The serialized instance can then either be loaded in another program or converted in tsv format using model-sequence-txt. The memory footprint of this progam is O(4^k) where k is the size of the kmers. Note that overflow errors may occure if a given kmer is encountered too many times in the dataset. This can be the case if there are too many reads (unlikely) or if the kmer size is too small (experience shows that kmer of 3 leads to this issue). In any case, if the mean signal seems weird, this is a likely explanation.

The synthax is:

papet model-sequence [options]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--bam arg	A coma separated list of paths to the bam files containing the PacBio CCS of interest.
	--out arg	The path to the file in which the KmerMap will be dumped.
	--kmer arg	The size of the kmer. It must be odd.

model-sequence-txt

model-sequence-txt is an application that converts into tsv format a serialized KmerMap created by model-sequence. The results are printed on stdout. The first row contains headers, the following rows contains the kmer data. Each row contains the kmer sequence, the kmer number of occurences, k per position IPD means and k per position PWD means. The memory footprint of this progam is O(4^k) where k is the size of the kmers. Note that overflow errors may occure if a given kmer is encountered too many times in the dataset. This can be the case if there are too many reads (unlikely) or if the kmer size is too small (experience shows that kmer of 3 leads to this issue). In any case, if the mean signal seems weird, this is a likely explanation.

The synthax is:

papet model-sequence-txt [options] [>FILE]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--model arg	The path to the file containing the KmerMap to convert in tsv format.

loglikelihoods

reports the detailled contributions of each strand of each CCSs of a model loglikelihood computations. The output is streamed on stdout in tab-separated format. Each row contains the contributions of a single CCS for a given CpG strand. The first row is a header.

The program is given a set of CpG coordinates of interest, a set of CCS a model the kinetic signal likelihood from. One row in the results contains the individual contributions of a given CCS for a given CpG on a given strand (fw or rv). The contributions are composed of the IPD contributions followed by the PWD contributions. The contributions may corresponds to individual positions, for simple models, or to pairs of positions for pairwise or diposition models. The contributions are always given in a 5'->3' fashion from left to right columns.

Contributions from a given CCS,for both strands of a given CpG are reported as 2 rows. One for the fw strand and the second for the rv strand.

The synthax is:

papet loglikelihoods [options] [>FILE]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--bam	A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest.
	--bed	The path to a bed file containing the coordinates of the CpGs interest.
	--model	The path to the file containing the kinetic model of intrest.

predict

predict predicts the presence of epigenetic modifications at the CpG of interest given two models and returns the results on stdout in BED format. The fields are :

CpG chromosome
CpG starting position (C)
CpG ending position (past G)
name (empty)
CpG strand
CpG methylation probability
(optional, enabled with --ids) a coma separated list of CCSs names used to compute the methylation probability

The score field contains the probability of the presence of an epigenetic modification, the

The synthax is:

papet predict [options] [>FILE]

This program has the following options :

short	long	description
-h	--help	Produces the help message
	--bam	A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest.
	--bed	The path to a bed file containing the coordinates of the CpGs interest.
	--modelMeth	The path to the file containing the methylated kinetic model to use.
	--modelUnmeth	The path to the file containing the unmethylated kinetic model to use.
	--prob	The prior probability of methylation for any CpG. It must belong to [0,1]. 0.5 by default.
	--thread	The number of threads, by default 1.

This program requires that the input bam and bed files have been ordered by chromosome and position using samtools sort and sort -k1,1V -k2,2n -k3,3n respectively!

Acknowledgments

Richard Hall and Pacific Biosciences for the discussions and the shared data.

Authors

Romain Groux

ngs-ai-org / papet Goto Github PK

papet's Introduction

Papet

Table of content

Dependencies

Compilation

About PacBio kinetics and epigenetics

Running Papet

model-kinetic

model-kinetic-txt

kinetics

kinetics-wig

kinetics-kmer

model-sequence

model-sequence-txt

loglikelihoods

predict

Acknowledgments

Authors

papet's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org