ulelab / peka Goto Github PK

Find motifs enriched around prominent crosslinks

License: GNU General Public License v3.0

Python 98.56% Shell 1.44%

bioinformatics bioinformatics-analysis bioinformatics-scripts

peka's Introduction

PEKA

Positionally-enriched k-mer analysis (PEKA) is a software package for identifying enriched protein-RNA binding motifs from CLIP datasets. PEKA compares k-mer enrichment in proximity of high-confidence crosslink sites (tXn - thresholded crosslinks), located within crosslinking peaks and having a high cDNA count, relative to low-count crosslink sites located outside of peaks (oXn - outside crosslinks). This approach reduces the effects of technical biases, such as uridine-preference of UV crosslinking. Each k-mer is assigned a PEKA score, which is used to rank the k-mers from the most to the least enriched. Additionally, PEKA provides comprehensive visualisations of motif enrichment profiles around the high-confidence crosslink sites and clusters the motifs that display similar profiles. PEKA also enables motif discovery within specific transcriptomic regions, including or excluding repetitive elements.

To interactively explore PEKA applied to all ENCODE eCLIP data, visit flow.bio. For a more detailed description of PEKA method please refer to our paper:

Kuret, K., Amalietti, A.G., Jones, D.M. et al. Positional motif analysis reveals the extent of specificity of protein-RNA interactions observed by CLIP. Genome Biol 23, 191 (2022). https://doi.org/10.1186/s13059-022-02755-2.

Authors of PEKA source code: [email protected], [email protected]

Dependencies:

python
matplotlib
numpy
pybedtools
scipy
seaborn
plumbum
scikit-learn
pandas
textdistance

Set up

We recommend running PEKA in a conda environment so all the dependencies are managed for you, to set this up run download the environment file and follow the steps below:

# Create the environmnet
conda env create -f environment.yml

# Activate the environment
conda activate peka

# Install the latest version of PEKA from the main branch
pip install git+https://github.com/ulelab/peka@main

Usage

usage: peka.py [-h] -i INPUTPEAKS -x INPUTXLSITES -g GENOMEFASTA -gi GENOMEINDEX -r REGIONS \
               [-k [{3,4,5,6,7}]] \
               [-o [OUTPUTPATH]] \
               [-w [WINDOW]] \
               [-dw [DISTALWINDOW]] \
               [-t [TOPN]] \
               [-p [PERCENTILE]] \
               [-c [CLUSTERS]] \
               [-s [SMOOTHING]] \
               [-re [{remove_repeats,masked,unmasked,repeats_only}]] \
               [-pos [RELEVANT_POS_THRESHOLD] | -relax {True,False}] \
               [-a {True,False}] \
               [-sr {genome,whole_gene,intron,UTR3,other_exon,ncRNA,intergenic}] \
               [-sub {True,False}] \
               [-seed {True,False}]

You can check your installation by going to the TestData directory and running the run_peka.sh script within the activated peka environment. This script uses a small test dataset found in TestData/inputs.

🔴 IMPORTANT NOTE! When you run PEKA on your data, make sure all the required inputs, i.e. bed files, genome in fasta format, genome index and regions file, follow the same naming convention for chromosome names. Either all files must use the UCSC (GENCODE) naming convention, which prefixes chromosome names with "chr" ("chr1", ..., "chrM") or all files should use Ensembl naming convention ("1", ..., "MT").

required arguments:
  -i INPUTPEAKS, --inputpeaks INPUTPEAKS
                        CLIP peaks (intervals of crosslinks) in BED file
                        format
  -x INPUTXLSITES, --inputxlsites INPUTXLSITES
                        CLIP crosslinks in BED file format
  -g GENOMEFASTA, --genomefasta GENOMEFASTA
                        genome fasta file, ideally the same as was used for
                        read alignment
  -gi GENOMEINDEX, --genomeindex GENOMEINDEX
                        genome fasta index file (.fai)
  -r REGIONS, --regions REGIONS
                        genome segmentation file produced as output of "iCount
                        segment" function

optional arguments:
  -h, --help            show this help message and exit
  -k [{3,4,5,6,7}], --kmerlength [{3,4,5,6,7}]
                        kmer length [DEFAULT 5]
  -o [OUTPUTPATH], --outputpath [OUTPUTPATH]
                        output folder [DEFAULT current directory]
  -w [WINDOW], --window [WINDOW]
                        window around thresholded crosslinks for finding
                        enriched kmers [DEFAULT 20]
  -dw [DISTALWINDOW], --distalwindow [DISTALWINDOW]
                        window around enriched kmers to calculate distribution
                        [DEFAULT 150]
  -t [TOPN], --topn [TOPN]
                        number of kmers ranked by z-score in descending order
                        for clustering and plotting [DEFAULT 20]
  -p [PERCENTILE], --percentile [PERCENTILE]
                        Percentile for considering thresholded crosslinks.
                        Accepts a value between 0 and 1 [0, 1). Percentile 0.7
                        means that a cDNA count threshold is determined at
                        which >=70 percent of the crosslink sites within the
                        region have a cDNA count equal or below the threshold.
                        Thresholded crosslinks have cDNA count above the
                        threshold. [DEFAULT 0.7]
  -c [CLUSTERS], --clusters [CLUSTERS]
                        how many enriched kmers to cluster and plot [DEFAULT
                        5]
  -s [SMOOTHING], --smoothing [SMOOTHING]
                        window used for smoothing kmer positional distribution
                        curves [DEFAULT 6]
  -re [{remove_repeats,masked,unmasked,repeats_only}], --repeats [{remove_repeats,masked,unmasked,repeats_only}]
                        how to treat repeating regions within genome (options:
                        "masked", "unmasked", "repeats_only",
                        "remove_repeats"). When applying any of the options
                        with the exception of repeats == "unmasked", a genome
                        with soft-masked repeat sequences should be used for
                        input, ie. repeats in lowercase letters. [DEFAULT
                        "unmasked"]
  -pos [RELEVANT_POS_THRESHOLD], --relevant_pos_threshold [RELEVANT_POS_THRESHOLD]
                        Percentile to set as threshold for relevant positions.
                        Accepted values are floats between 0 and 99 [0, 99].
                        If threshold is set to 0 then all positions within the
                        set window (-w, default 20 nt) will be considered for
                        enrichment calculation. If threshold is not zero, it
                        will be used to determine relevant positions for
                        enrichment calculation for each k-mer. If the -pos
                        option is not set, then the threshold will be
                        automatically assigned based on k-mer lengthand number
                        of crosslinks in region.
  -relax {True,False}, --relaxed_prtxn {True,False}
                        Whether to relax automatically calculated prtxn
                        threshold or not. Can be 'True' or 'False', default is
                        'True'. If 'True', more positions will be included for
                        PEKA-score calculation across k-mers. Using relaxed
                        threshold is recommended, unless you have data of very
                        high complexity and are using k-mer length <=5. This
                        argument can't be used together with -pos argument,
                        which sets a user-defined threshold for relevant
                        positions. [DEFAULT True]
  -a {True,False}, --alloutputs {True,False}
                        controls the number of outputs, can be True or False
                        [DEFAULT False]
  -sr {genome,whole_gene,intron,UTR3,other_exon,ncRNA,intergenic}, --specificregion {genome,whole_gene,intron,UTR3,other_exon,ncRNA,intergenic}
                        choose to run PEKA on a specific region only, to
                        specify multiple regions enter them space separated
                        [DEFAULT None]
  -sub {True,False}, --subsample {True,False}
                        if the crosslinks file is very large, they can be
                        subsampled to reduce runtime, can be True/False
                        [DEFAULT True]
  -seed {True,False}, --set_seeds {True,False}
                        If you want to ensure reproducibility of results the
                        option set_seeds must be set to True. Can be True or
                        False [DEFAULT True]. Note that setting seeds reduces
                        the randomness of background sampling.

Common issues

The script needs writing permission in the staging directory to save results and make an environment variable TMPDIR for temporary files. If you get KeyError: 'TMPDIR' a solution would be to type export TMPDIR=<path_to_folder> in terminal where you want to run the script.

Outputs

The default outputs produced by PEKA for each specified genomic region are:

A pdf file with graphs showing k-mer occurrence distributions around thresholded crosslink sites for top n most enriched k-mers in the region spannig -50...50 nt around thersholded crosslink sites. K-mers are clustered based on their sequence and distributions.
- NOTE: Even though top k-mers are selected with respect to the user-defined window, the graphs of top k-mers are not scaled to the selected window and are always plotted in the -50...50 nt range.
- y-axis on the plots denotes a % of thresholded crosslinks for which a particular k-mer occurs at a specified position (y-axis).
- NOTE: If clustering could not be optimized to yield less or equal to user-defined number of clusters, the number of clusters will be determined automatically by the algorithm.
A tsv file with summed occurrence distributions of k-mers within defined clusters. Distributions in this file correspond to the curves on the last plot in the .pdf file.
A tsv file with calculated PEKA score and occurrence distribution for all possible k-mers in the window -48 to +50 around thresholded crosslinks.
- This file also contains several other values that are calculated for a particular k-mer during the analysis, such as:
  - mtxn : position of occurrence distribution max peak
  - prtxn : positions used for calculating motif enrichment relative to sampled background sequences
  - DtXn : average k-mer occurence in a distal window around thresholded crosslink sites
  - artxn : average k-mer relative occurrence at prtxn positions around thresholded crosslink sites
  - aroxn : average k-mer relative occurrence at prtxn positions around background crosslink sites
  - etxn : log2 relative k-mer enrichment, calculated as log2(artxn/aroxn)
  - p-value : a p-value for each k-mer is obtained under assumption that a distribution of obtained PEKA-scores is normal. K-mers with no PEKA-score are omitted from distribution. A distribution of PEKA-scores is converted to distribution of z-scores and p-value is obtained as area under the resulting standard Gaussian probability density function.
    🔴 CAUTION! The distribution of PEKA-score is not necessarily normal. If using p-value to gauge significance, make sure that enough k-mers indeed get the PEKA-score and that they are normally distributed. This can be achieved by lowering k-mer length or by decreasing the threshold for relevant positions, using the -pos argument.

If the option ALLOUTPUTS is set to True, the following files are also outputted for each specified genomic region:

a bed file of thresholded crosslink sites
a bed file with background crosslink sites (oxn)
a tsv file with relative occurrence distribution (rtxn) for all possible k-mers. Relative occurrences are obtained by dividing raw occurrences with the average k-mer occurrence in the distal region (DtXn).
A csv file with clusters of top n k-mers.

Other outputs:

A tsv file with saved run parameters

Plotting k-mer relative occurrences in heatmap format (preprint)

To produce k-mer relative occurrence heatmaps as shown in Figure 1g of our paper https://doi.org/10.1186/s13059-022-02755-2, use the script plotRelativeOccurrenceHeatmap.py in peka environment as follows:

# Example of run command
python3 ./src/plotRelativeOccurrenceHeatmap.py \
PathToKmerOccurrenceTable \
PathToKmerRtxnTable \
PathToOutputDirectory \
WindowAroundCrosslinkSite\
NumberOfTopKmers

Arguments

PathToKmerOccurrenceTable - file with k-mer PEKA-scores 5mer_distribution_whole_gene.tsv
PathToKmerRtxnTable - file with k-mer relative occurrences 5mer_rtxn_whole_gene.tsv
PathToOutputDirectory - location to which files will be saved
WindowAroundCrosslinkSite - n nucleotides up and downstream from crosslink to plot
NumberOfTopKmers - n top k-mers to plot

To run on test data head to TestData/RelativeOccurrenceHeatmap

peka's People

Contributors

Stargazers

Watchers

Forkers

almiheenko

peka's Issues

Question on how to make use of CLIP-Seq biological replicates?

Hi @kkuret
Thank for your great work and related bioRxiv paper. I am learning to analysis our peaks and XLS by your peka software. I have used iCounts to perform the eCLIP-Seq analysis. But I am wondering how to make use of the replicates. We have designed three replicates. Should I use peka to identify from rep1 associated cluster and peaks to identify motifs. Then intersect motifs from Rep1, Rep2, and Rep3. Or should I merged the peaks and sum or mean the raw XLS bed files to perform a single motif enrichment process. Thank you so much.
Linhua

Puzzle about the Figure 1g about Heatmaps showing relative occurrences (RtXn) and PEKA-scores for top 40 k-mers

@kkuret Hi:
I am trying the peka to performing motif identification from our in-house generated eCLIP-Seq in plants. I found peka is more suitable to peaks identified from CLIP-Seq (unlike those peaks from ChIP-Seq, usually people use MEME-ChIP or HOMER to identify motifs) based on your bioRxiv preprint. Thanks for your great tool!
I have tested peka on my data and generated a series of output. A file with suffix '*5mer_distribution_whole_gene.tsv' seems to contain the information like your Figure 1g. I want convert the table into heatmap to better understand your paper. But I am confused about the kmer seqence in the left part of the heatmap with first column in tsv file. How to convert the V1 column into the left part of heatmap. I also choose the top 20 ranked rows based on peka-score.

Another question is how to present the motif enrichment results of CLIP-Seq like those results from ChIP-Seq in a typical experiments centered paper? like A in https://iiif.elifesciences.org/lax/53278%2Felife-53278-fig6-v2.tif/full/1500,/0/default.jpg Do you have any suggestions? Thanks a lot. I am not use what value to show the significance (peka score?).

Working with cross-link sites identified using iCount with overlapping indicies

I have been running iCount xlsites to identify cross-link sites using read quantification in my eCLIP sample replicates for PEKA. I selected read quantification as the input BAM files have been UMI-pruned, and the bam files PCR-deduplicated. I have noticed several overlapping indices between identified cross-link sites in each replicate. I have considered using other cross-link site detection such as htseq-clip. However, there isn't a column for cDNA numbers. How would you recommend I deal with the overlapping sites?

ValueError: Overlapping IntervalIndex is not accepted.

I downlowd xl and peak file from https://imaps.goodwright.com/collections/868/ and run: peka -i tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single.bed1 -x tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single_peaks.bed1 -g $genome -gi $gfai -r $segment -k 6
But got error:
Namespace(alloutputs=False, clusters=5, distalwindow=150, genomefasta='/mnt/1/genome/hg38/hg38.fa', genomeindex='/mnt/1/genome/hg38/hg38.fa.fai', inputpeaks='tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single.bed1', inputxlsites='tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single_peaks.bed1', kmerlength=6, outputpath='/mnt/9/yuan_jianwen/scTRIBE_project/09.fastaMotif/00.rmPseudo/test', percentile=0.7, regions='/mnt/12/yuan_jianwen/hg38/Homo_sapiens.GRCh38.103.segment.gtf.gz', repeats='unmasked', smoothing=6, specificregion=None, subsample=True, topn=20, window=25)
Getting thresholded crosslinks
Thresholding intron
lenght of df_reg for intron is: 70328
Traceback (most recent call last):
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka", line 8, in
sys.exit(cli())
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 1317, in cli
subsample,
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 1002, in run
df_txn = get_threshold_sites(sites_file, percentile=percentile)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 445, in get_threshold_sites
df_cut = cut_sites_with_region(df_reg, df_region)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 376, in cut_sites_with_region
df_temp = cut_per_chrom(chrom, df_p, df_m, df_region_p, df_region_m)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 363, in cut_per_chrom
df_xl_p["cut"] = pd.cut(df_xl_p["start"], interval_index_p)
File "/home/yuan_jianwen/anaconda3/envs/peka/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 226, in cut
raise ValueError('Overlapping IntervalIndex is not accepted.')
ValueError: Overlapping IntervalIndex is not accepted.

test data works, but own data gives: ValueError: Overlapping IntervalIndex is not accepted.

Hi there

Thanks for the program. I installed it and running the test data works fine. Moving to my own data (the first is what I understood from the documentation but all fail):

peka -i iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf
peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf
peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf

gives me an error:

Getting thresholded crosslinks
Thresholding intron
lenght of df_reg for intron is: 1038414
Traceback (most recent call last):
File "/home/name/miniconda3/envs/peka/bin/peka", line 8, in
sys.exit(main())
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1462, in main
set_seeds
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1051, in run
df_txn = get_threshold_sites(sites_file, percentile=percentile)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 497, in get_threshold_sites
df_cut = cut_sites_with_region(df_reg, df_region)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 422, in cut_sites_with_region
df_temp = cut_per_chrom(chrom, df_p, df_m, df_region_p, df_region_m)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 409, in cut_per_chrom
df_xl_p["cut"] = pd.cut(df_xl_p["start"], interval_index_p)
File "/home/name/miniconda3/envs/peka/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 226, in cut
raise ValueError('Overlapping IntervalIndex is not accepted.')
ValueError: Overlapping IntervalIndex is not accepted.

Chromosome names and sorting seem to match and the files are iCount outputs (I removed some chromosomes and resorted while testing, but the original files also did not work). I uploaded my files:

https://e.pcloud.link/publink/show?code=XZCJGeZ4hXmsgv6hmLEf71rt8yzRhSgzqWk

What's wrong? :)

PEKA on intronless genomes

When I try to run PEKA I get this message:

"Getting thresholded crosslinks
Thresholding intron
Not able to find any thresholded sites in your sample (NoneType). Exiting."

I have tried this with iCount xlsites and both iCount peak and Clippy peaks as inputs. Is this because my GTF has no annotated introns? I have a custom GTF file with only gene on 3rd collumn.

Hi, can you provide sample data to see whether it works or not? Very appreciate for that.

Example input files?

Great work!

Trying to run PEKA on a dataset we have here, would it be possible to get an example dataset complete with small example files? (see below for requirements). Would greatly help to see exactly what kind of input (format) is accepted / required, thanks!, Gregor

required arguments:
  -i INPUTPEAKS, --inputpeaks INPUTPEAKS
                        CLIP peaks (intervals of crosslinks) in BED file
                        format
  -x INPUTXLSITES, --inputxlsites INPUTXLSITES
                        CLIP crosslinks in BED file format
  -g GENOMEFASTA, --genomefasta GENOMEFASTA
                        genome fasta file, ideally the same as was used for
                        read alignment
  -gi GENOMEINDEX, --genomeindex GENOMEINDEX
                        genome fasta index file (.fai)
  -r REGIONS, --regions REGIONS
                        genome segmentation file produced as output of "iCount
                        segment" function

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.