ulelab / peka Goto Github PK

Find motifs enriched around prominent crosslinks

License: GNU General Public License v3.0

Python 98.56% Shell 1.44%

bioinformatics bioinformatics-analysis bioinformatics-scripts

peka's Issues

Working with cross-link sites identified using iCount with overlapping indicies

I have been running iCount xlsites to identify cross-link sites using read quantification in my eCLIP sample replicates for PEKA. I selected read quantification as the input BAM files have been UMI-pruned, and the bam files PCR-deduplicated. I have noticed several overlapping indices between identified cross-link sites in each replicate. I have considered using other cross-link site detection such as htseq-clip. However, there isn't a column for cDNA numbers. How would you recommend I deal with the overlapping sites?

Hi, can you provide sample data to see whether it works or not? Very appreciate for that.

Puzzle about the Figure 1g about Heatmaps showing relative occurrences (RtXn) and PEKA-scores for top 40 k-mers

@kkuret Hi:
I am trying the peka to performing motif identification from our in-house generated eCLIP-Seq in plants. I found peka is more suitable to peaks identified from CLIP-Seq (unlike those peaks from ChIP-Seq, usually people use MEME-ChIP or HOMER to identify motifs) based on your bioRxiv preprint. Thanks for your great tool!
I have tested peka on my data and generated a series of output. A file with suffix '*5mer_distribution_whole_gene.tsv' seems to contain the information like your Figure 1g. I want convert the table into heatmap to better understand your paper. But I am confused about the kmer seqence in the left part of the heatmap with first column in tsv file. How to convert the V1 column into the left part of heatmap. I also choose the top 20 ranked rows based on peka-score.

Another question is how to present the motif enrichment results of CLIP-Seq like those results from ChIP-Seq in a typical experiments centered paper? like A in https://iiif.elifesciences.org/lax/53278%2Felife-53278-fig6-v2.tif/full/1500,/0/default.jpg Do you have any suggestions? Thanks a lot. I am not use what value to show the significance (peka score?).

test data works, but own data gives: ValueError: Overlapping IntervalIndex is not accepted.

Hi there

Thanks for the program. I installed it and running the test data works fine. Moving to my own data (the first is what I understood from the documentation but all fail):

peka -i iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf
peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.cDNA_unique.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf
peka -i iCount.deDupBCQFdemux_barcode_RT6.clusters.forPeka.bed -x iCount.deDupBCQFdemux_barcode_RT6.peaks.forPeka.bed -g refGenome.fasta -gi refGenome.fasta.fai -r refGenome.segs.forPeka.gtf

gives me an error:

Getting thresholded crosslinks
Thresholding intron
lenght of df_reg for intron is: 1038414
Traceback (most recent call last):
File "/home/name/miniconda3/envs/peka/bin/peka", line 8, in
sys.exit(main())
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1462, in main
set_seeds
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 1051, in run
df_txn = get_threshold_sites(sites_file, percentile=percentile)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 497, in get_threshold_sites
df_cut = cut_sites_with_region(df_reg, df_region)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 422, in cut_sites_with_region
df_temp = cut_per_chrom(chrom, df_p, df_m, df_region_p, df_region_m)
File "/home/name/miniconda3/envs/peka/bin/peka.py", line 409, in cut_per_chrom
df_xl_p["cut"] = pd.cut(df_xl_p["start"], interval_index_p)
File "/home/name/miniconda3/envs/peka/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 226, in cut
raise ValueError('Overlapping IntervalIndex is not accepted.')
ValueError: Overlapping IntervalIndex is not accepted.

Chromosome names and sorting seem to match and the files are iCount outputs (I removed some chromosomes and resorted while testing, but the original files also did not work). I uploaded my files:

https://e.pcloud.link/publink/show?code=XZCJGeZ4hXmsgv6hmLEf71rt8yzRhSgzqWk

What's wrong? :)

PEKA on intronless genomes

When I try to run PEKA I get this message:

"Getting thresholded crosslinks
Thresholding intron
Not able to find any thresholded sites in your sample (NoneType). Exiting."

I have tried this with iCount xlsites and both iCount peak and Clippy peaks as inputs. Is this because my GTF has no annotated introns? I have a custom GTF file with only gene on 3rd collumn.

ValueError: Overlapping IntervalIndex is not accepted.

I downlowd xl and peak file from https://imaps.goodwright.com/collections/868/ and run: peka -i tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single.bed1 -x tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single_peaks.bed1 -g $genome -gi $gfai -r $segment -k 6
But got error:
Namespace(alloutputs=False, clusters=5, distalwindow=150, genomefasta='/mnt/1/genome/hg38/hg38.fa', genomeindex='/mnt/1/genome/hg38/hg38.fa.fai', inputpeaks='tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single.bed1', inputxlsites='tardbp-egfp-hd-hek293-1-20201021-ju_mapped_to_genome_single_peaks.bed1', kmerlength=6, outputpath='/mnt/9/yuan_jianwen/scTRIBE_project/09.fastaMotif/00.rmPseudo/test', percentile=0.7, regions='/mnt/12/yuan_jianwen/hg38/Homo_sapiens.GRCh38.103.segment.gtf.gz', repeats='unmasked', smoothing=6, specificregion=None, subsample=True, topn=20, window=25)
Getting thresholded crosslinks
Thresholding intron
lenght of df_reg for intron is: 70328
Traceback (most recent call last):
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka", line 8, in
sys.exit(cli())
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 1317, in cli
subsample,
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 1002, in run
df_txn = get_threshold_sites(sites_file, percentile=percentile)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 445, in get_threshold_sites
df_cut = cut_sites_with_region(df_reg, df_region)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 376, in cut_sites_with_region
df_temp = cut_per_chrom(chrom, df_p, df_m, df_region_p, df_region_m)
File "/home/yuan_jianwen/anaconda3/envs/peka/bin/peka.py", line 363, in cut_per_chrom
df_xl_p["cut"] = pd.cut(df_xl_p["start"], interval_index_p)
File "/home/yuan_jianwen/anaconda3/envs/peka/lib/python3.7/site-packages/pandas/core/reshape/tile.py", line 226, in cut
raise ValueError('Overlapping IntervalIndex is not accepted.')
ValueError: Overlapping IntervalIndex is not accepted.

Example input files?

Great work!

Trying to run PEKA on a dataset we have here, would it be possible to get an example dataset complete with small example files? (see below for requirements). Would greatly help to see exactly what kind of input (format) is accepted / required, thanks!, Gregor

required arguments:
  -i INPUTPEAKS, --inputpeaks INPUTPEAKS
                        CLIP peaks (intervals of crosslinks) in BED file
                        format
  -x INPUTXLSITES, --inputxlsites INPUTXLSITES
                        CLIP crosslinks in BED file format
  -g GENOMEFASTA, --genomefasta GENOMEFASTA
                        genome fasta file, ideally the same as was used for
                        read alignment
  -gi GENOMEINDEX, --genomeindex GENOMEINDEX
                        genome fasta index file (.fai)
  -r REGIONS, --regions REGIONS
                        genome segmentation file produced as output of "iCount
                        segment" function

Question on how to make use of CLIP-Seq biological replicates?

Hi @kkuret
Thank for your great work and related bioRxiv paper. I am learning to analysis our peaks and XLS by your peka software. I have used iCounts to perform the eCLIP-Seq analysis. But I am wondering how to make use of the replicates. We have designed three replicates. Should I use peka to identify from rep1 associated cluster and peaks to identify motifs. Then intersect motifs from Rep1, Rep2, and Rep3. Or should I merged the peaks and sum or mean the raw XLS bed files to perform a single motif enrichment process. Thank you so much.
Linhua

ulelab / peka Goto Github PK

peka's Issues

Working with cross-link sites identified using iCount with overlapping indicies

Hi, can you provide sample data to see whether it works or not? Very appreciate for that.

Puzzle about the Figure 1g about Heatmaps showing relative occurrences (RtXn) and PEKA-scores for top 40 k-mers

test data works, but own data gives: ValueError: Overlapping IntervalIndex is not accepted.

PEKA on intronless genomes

ValueError: Overlapping IntervalIndex is not accepted.

Example input files?

Question on how to make use of CLIP-Seq biological replicates?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent