ulelab / peka Goto Github PK
View Code? Open in Web Editor NEWFind motifs enriched around prominent crosslinks
License: GNU General Public License v3.0
Find motifs enriched around prominent crosslinks
License: GNU General Public License v3.0
I have been running iCount xlsites
to identify cross-link sites using read quantification in my eCLIP sample replicates for PEKA. I selected read quantification as the input BAM files have been UMI-pruned, and the bam files PCR-deduplicated. I have noticed several overlapping indices between identified cross-link sites in each replicate. I have considered using other cross-link site detection such as htseq-clip. However, there isn't a column for cDNA numbers. How would you recommend I deal with the overlapping sites?
@kkuret Hi:
I am trying the peka
to performing motif identification from our in-house generated eCLIP-Seq in plants. I found peka
is more suitable to peaks identified from CLIP-Seq (unlike those peaks from ChIP-Seq, usually people use MEME-ChIP or HOMER to identify motifs) based on your bioRxiv preprint. Thanks for your great tool!
I have tested peka
on my data and generated a series of output. A file with suffix '*5mer_distribution_whole_gene.tsv' seems to contain the information like your Figure 1g. I want convert the table into heatmap to better understand your paper. But I am confused about the kmer seqence in the left part of the heatmap with first column in tsv file. How to convert the V1 column into the left part of heatmap. I also choose the top 20 ranked rows based on peka-score.
Another question is how to present the motif enrichment results of CLIP-Seq like those results from ChIP-Seq in a typical experiments centered paper? like A in https://iiif.elifesciences.org/lax/53278%2Felife-53278-fig6-v2.tif/full/1500,/0/default.jpg Do you have any suggestions? Thanks a lot. I am not use what value to show the significance (peka score?).
Hi there
Thanks for the program. I installed it and running the test data works fine. Moving to my own data (the first is what I understood from the documentation but all fail):
peka -i iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf
peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf
peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf
gives me an error:
Getting thresholded crosslinks
Thresholding intron
lenght of df_reg for intron is: 1038414
Traceback (most recent call last):
File "/home/name/miniconda3/envs/peka/bin/peka", line 8, in
sys.exit(main())
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1462, in main
set_seeds
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1051, in run
df_txn = get_threshold_sites(sites_file, percentile=percentile)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 497, in get_threshold_sites
df_cut = cut_sites_with_region(df_reg, df_region)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 422, in cut_sites_with_region
df_temp = cut_per_chrom(chrom, df_p, df_m, df_region_p, df_region_m)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 409, in cut_per_chrom
df_xl_p["cut"] = pd.cut(df_xl_p["start"], interval_index_p)
File "/home/name/miniconda3/envs/peka/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 226, in cut
raise ValueError('Overlapping IntervalIndex is not accepted.')
ValueError: Overlapping IntervalIndex is not accepted.
Chromosome names and sorting seem to match and the files are iCount outputs (I removed some chromosomes and resorted while testing, but the original files also did not work). I uploaded my files:
https://e.pcloud.link/publink/show?code=XZCJGeZ4hXmsgv6hmLEf71rt8yzRhSgzqWk
What's wrong? :)
When I try to run PEKA I get this message:
"Getting thresholded crosslinks
Thresholding intron
Not able to find any thresholded sites in your sample (NoneType). Exiting."
I have tried this with iCount xlsites and both iCount peak and Clippy peaks as inputs. Is this because my GTF has no annotated introns? I have a custom GTF file with only gene on 3rd collumn.
I downlowd xl and peak file from https://imaps.goodwright.com/collections/868/ and run: peka -i tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single.bed1 -x tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single_peaks.bed1 -g $genome -gi $gfai -r $segment -k 6
But got error:
Namespace(alloutputs=False, clusters=5, distalwindow=150, genomefasta='/mnt/1/genome/hg38/hg38.fa', genomeindex='/mnt/1/genome/hg38/hg38.fa.fai', inputpeaks='tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single.bed1', inputxlsites='tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single_peaks.bed1', kmerlength=6, outputpath='/mnt/9/yuan_jianwen/scTRIBE_project/09.fastaMotif/00.rmPseudo/test', percentile=0.7, regions='/mnt/12/yuan_jianwen/hg38/Homo_sapiens.GRCh38.103.segment.gtf.gz', repeats='unmasked', smoothing=6, specificregion=None, subsample=True, topn=20, window=25)
Getting thresholded crosslinks
Thresholding intron
lenght of df_reg for intron is: 70328
Traceback (most recent call last):
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka", line 8, in
sys.exit(cli())
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 1317, in cli
subsample,
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 1002, in run
df_txn = get_threshold_sites(sites_file, percentile=percentile)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 445, in get_threshold_sites
df_cut = cut_sites_with_region(df_reg, df_region)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 376, in cut_sites_with_region
df_temp = cut_per_chrom(chrom, df_p, df_m, df_region_p, df_region_m)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 363, in cut_per_chrom
df_xl_p["cut"] = pd.cut(df_xl_p["start"], interval_index_p)
File "/home/yuan_jianwen/anaconda3/envs/peka/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 226, in cut
raise ValueError('Overlapping IntervalIndex is not accepted.')
ValueError: Overlapping IntervalIndex is not accepted.
Great work!
Trying to run PEKA on a dataset we have here, would it be possible to get an example dataset complete with small example files? (see below for requirements). Would greatly help to see exactly what kind of input (format) is accepted / required, thanks!, Gregor
required arguments:
-i INPUTPEAKS, --inputpeaks INPUTPEAKS
CLIP peaks (intervals of crosslinks) in BED file
format
-x INPUTXLSITES, --inputxlsites INPUTXLSITES
CLIP crosslinks in BED file format
-g GENOMEFASTA, --genomefasta GENOMEFASTA
genome fasta file, ideally the same as was used for
read alignment
-gi GENOMEINDEX, --genomeindex GENOMEINDEX
genome fasta index file (.fai)
-r REGIONS, --regions REGIONS
genome segmentation file produced as output of "iCount
segment" function
Hi @kkuret
Thank for your great work and related bioRxiv paper. I am learning to analysis our peaks and XLS by your peka
software. I have used iCounts to perform the eCLIP-Seq analysis. But I am wondering how to make use of the replicates. We have designed three replicates. Should I use peka
to identify from rep1 associated cluster and peaks to identify motifs. Then intersect motifs from Rep1, Rep2, and Rep3. Or should I merged the peaks and sum or mean the raw XLS bed files to perform a single motif enrichment process. Thank you so much.
Linhua
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.