zarnackgroup / bindingsitefinder Goto Github PK
View Code? Open in Web Editor NEWPackage for the definition of biniding sites for iCLIP data
Home Page: https://www.bioconductor.org/packages/release/bioc/html/BindingSiteFinder.html
Package for the definition of biniding sites for iCLIP data
Home Page: https://www.bioconductor.org/packages/release/bioc/html/BindingSiteFinder.html
Hello, I was trying to use your BindingSiteFinder and got an error:
Error in if (all(!df.global$increaseOverMin)) { :
missing value where TRUE/FALSE needed
I was using BSFind(object = KObds, anno.genes = gns, anno.transcriptRegionList = regions,
est.subsetChromosome = "chr3", veryQuiet = TRUE, est.maxBsWidth = 29)
and it also gave following message:
2.16% (17/788) peaks overlap with multiple anno.genes in the given gene annotation.
A single instance of each peak is kept. This is recommended.
Could you please tell me waht could be the problem? thanks
Hi,
when I use makeBindingSites, I get the following error
Error in .subset_by_GenomicRanges(x, i) :
‘x’ must have unique names when subsetting by a GenomicRanges subscript.
I think the error comes from the .collapseSamples function, that uses
for (i in seq_along(p)) {
pSum = pSum + p[[i]]
}
This will not add up the chromosomes right if there are different chromosomes or the chromosomes do not have the same order in to samples.
when I merge these two samples
names(signal$signalPlus$
1_oe_FLAG
)
[1] “KI270802.1” “chr1" “chr10”
[4] “chr11" “chr12” “chr13"
[7] “chr14” “chr15" “chr16”
[10] “chr17" “chr18” “chr19"
[13] “chr2” “chr20" “chr21”
[16] “chr22" “chr3” “chr4"
[19] “chr5” “chr6" “chr7”
[22] “chr8" “chr9” “chrM”
[25] “chrX” “chrY”
names(signal$signalPlus$2_oe_FLAG
)
[1] “GL000225.1” “chr1" “chr10”
[4] “chr11" “chr12” “chr13"
[7] “chr14” “chr15" “chr16”
[10] “chr17" “chr18” “chr19"
[13] “chr2” “chr20" “chr21”
[16] “chr22" “chr3” “chr4"
[19] “chr5” “chr6" “chr7”
[22] “chr8" “chr9” “chrM”
[25] “chrX” “chrY”
the merge does not contain “GL000225.1”
names(p)
[1] “KI270802.1" “chr1” “chr10" “chr11” “chr12"
[6] “chr13” “chr14" “chr15” “chr16" “chr17”
[11] “chr18" “chr19” “chr2" “chr20” “chr21"
[16] “chr22” “chr3" “chr4” “chr5" “chr6”
[21] “chr7" “chr8” “chr9" “chrM” “chrX”
[26] “chrY”
instead “KI270802.1” and “GL000225.1" are added and called “KI270802.1”
If I then merge all 4 samples the merge looks like this:
names(sgnMergePlus)
[1] “KI270802.1” “chr1" “chr10”
[4] “chr11" “chr12” “chr13"
[7] “chr14” “chr15" “chr16”
[10] “chr17" “chr18” “chr19"
[13] “chr2” “chr20" “chr21”
[16] “chr22" “chr3” “chr4"
[19] “chr5” “chr6" “chr7”
[22] “chr8" “chr9” “chrM”
[25] “chrX” “chrY” NA
[28] NA
and it will through an error because two names are NA. However, it would probably not cause an error if just one name is NA, which is kind of dangerous. Because it might add up the wrong stuff without causing an error.
Hi,
maybe I am missing something, but I was wondering about the third code chunk of the 3.3 Transcript region assignment chapter in the vignette.
The code is:
# Count the overlaps of each binding site fore each region of the transcript.
cdseq = regions$CDS %>% countOverlaps(bindingSites,.)
intrns = regions$Intron %>% countOverlaps(bindingSites,.)
utrs3 = regions$UTR3 %>% countOverlaps(bindingSites,.)
utrs5 = regions$UTR5 %>% countOverlaps(bindingSites,.)
countDf = data.frame(CDS = cdseq, Intron = intrns, UTR3 = utrs3, UTR5 = utrs5)
# Count how many times an annotation is not present.
df = data.frame(olClass = apply(countDf,1,function(x) length(x[x == 0])))
So you want to know how many times a binding site overlaps with more than one region. You then say you "Count how many times an annotation is not present." But shouldn't you count how many times an annotation is present?
So length(x[x != 0]) in the last row?
Because the plot afterward says "Bar plot shows how many times a binding site overlaps with an annotation of a different transcript region."
Hi, I am trying to use calculateBsBackground()
function but getting an error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'i' in selecting a method for function '[': 'match' requires vector arguments
I've also tried with the example code provided in help page, but is still giving the same error.
This is the code that I've used.
# load clip data
files <- system.file("extdata", package="BindingSiteFinder")
load(list.files(files, pattern = ".rda$", full.names = TRUE))
# make binding sites
bds = makeBindingSites(bds, bsSize = 7)
bds = assignToGenes(bds, anno.genes = gns)
# change meta data
m = getMeta(bds)
m$condition = factor(c("WT", "WT", "KO", "KO"), levels = c("WT", "KO"))
bds = setMeta(bds, m)
# change signal
s = getSignal(bds)
names(s$signalPlus) = paste0(m$id, "_", m$condition)
names(s$signalMinus) = paste0(m$id, "_", m$condition)
bds = setSignal(bds, s)
# make background
bds.b1 = calculateBsBackground(bds, anno.genes = gns)
I'm not sure what the problem is.
The codes work perfectly fine for finding binding sites.
Can you help me debugging this error?
Maybe the error could be related to the version of packages I'm using.
Here is my sessionInfo, and let me know if you need any other information for fixing this error.
R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.9 (Ootpa)
Matrix products: default
BLAS/LAPACK: /home/jy656/.conda/envs/BindingSiteFinder/lib/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] viridis_0.6.5 viridisLite_0.4.2 GGally_2.2.1 ggplot2_3.4.4 dplyr_1.1.4 GenomicFeatures_1.54.1 AnnotationDbi_1.64.1 Biobase_2.62.0
[9] rtracklayer_1.62.0 BindingSiteFinder_2.0.0 GenomicRanges_1.54.1 GenomeInfoDb_1.38.1 IRanges_2.36.0 S4Vectors_0.40.2 BiocGenerics_0.48.1 BiocManager_1.30.22
loaded via a namespace (and not attached):
[1] DBI_1.2.1 bitops_1.0-7 gridExtra_2.3 biomaRt_2.58.0 rlang_1.1.3 magrittr_2.0.3
[7] clue_0.3-65 GetoptLong_1.0.5 matrixStats_1.2.0 compiler_4.3.2 RSQLite_2.3.4 systemfonts_1.0.5
[13] png_0.1-8 vctrs_0.6.5 stringr_1.5.1 pkgconfig_2.0.3 shape_1.4.6 crayon_1.5.2
[19] fastmap_1.1.1 dbplyr_2.4.0 XVector_0.42.0 utf8_1.2.4 Rsamtools_2.18.0 rmarkdown_2.25
[25] purrr_1.0.2 bit_4.0.5 xfun_0.42 zlibbioc_1.48.0 cachem_1.0.8 progress_1.2.3
[31] blob_1.2.4 DelayedArray_0.28.0 tweenr_2.0.3 BiocParallel_1.36.0 parallel_4.3.2 prettyunits_1.2.0
[37] cluster_2.1.6 R6_2.5.1 stringi_1.8.3 RColorBrewer_1.1-3 Rcpp_1.0.12 SummarizedExperiment_1.32.0
[43] iterators_1.0.14 knitr_1.45 Matrix_1.6-5 tidyselect_1.2.0 rstudioapi_0.15.0 abind_1.4-5
[49] yaml_2.3.8 doParallel_1.0.17 codetools_0.2-19 curl_5.1.0 plyr_1.8.9 lattice_0.22-5
[55] tibble_3.2.1 withr_3.0.0 KEGGREST_1.42.0 evaluate_0.23 ggstats_0.5.1 polyclip_1.10-6
[61] BiocFileCache_2.10.1 ggdist_3.3.1 xml2_1.3.6 circlize_0.4.16 Biostrings_2.70.1 pillar_1.9.0
[67] filelock_1.0.3 MatrixGenerics_1.14.0 foreach_1.5.2 distributional_0.4.0 generics_0.1.3 RCurl_1.98-1.14
[73] hms_1.1.3 munsell_0.5.0 scales_1.3.0 glue_1.7.0 tools_4.3.2 BiocIO_1.12.0
[79] data.table_1.14.10 GenomicAlignments_1.38.0 XML_3.99-0.16.1 grid_4.3.2 tidyr_1.3.1 colorspace_2.1-0
[85] GenomeInfoDbData_1.2.11 ggforce_0.4.2 restfulr_0.0.15 cli_3.6.2 rappdirs_0.3.3 kableExtra_1.4.0
[91] fansi_1.0.6 S4Arrays_1.2.0 svglite_2.1.3 ComplexHeatmap_2.18.0 gtable_0.3.4 digest_0.6.34
[97] SparseArray_1.2.2 farver_2.1.1 rjson_0.2.21 htmlwidgets_1.6.4 memoise_2.0.1 htmltools_0.5.7
[103] lifecycle_1.0.4 httr_1.4.7 GlobalOptions_0.1.2 MASS_7.3-60 bit64_4.0.5
The current version number in the Github repo is 1.99.16, but it should be 2.1.16.
On the side of Bioconductor it's 2.1.0 in devel and it will be changed to 2.3.0 in the new release.
Wait for the new release to happen, then pull changes from Bioconductor to resync the Github repos.
This will create conflicts with the current version. Resolve them by removing local changes that are not present on the side of Bioconductor.
Hi, please help.
I'm having some issues running BindingSiteFinder
.
After following the complete iCLIP analysis pipeline on the instructions, I check my data and get these:
### checking the bed file
read.table(BED.file1)
V1 V2 V3 V4 V5 V6
1 Bomo_Chr1 366460 366461 3 3.7454300 +
2 Bomo_Chr1 366461 366462 3 17.8156000 +
3 Bomo_Chr1 553887 553888 3 6.1753600 +
...
164 Bomo_Chr1 13135067 13135068 3 3.8580500 +
165 Bomo_Chr1 13135205 13135206 3 6.2024100 +
166 Bomo_Chr1 13135206 13135207 3 1.8451600 +
[ reached 'max' / getOption("max.print") -- omitted 18706 rows ]
### importing the GRanges object
cs = rtracklayer::import(con = "BED.file1", format = "BED")
cs
GRanges object with 18872 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] Bomo_Chr1 366461 + | 3 3.74543
[2] Bomo_Chr1 366462 + | 3 17.81560
[3] Bomo_Chr1 553888 + | 3 6.17536
[4] Bomo_Chr1 553935 + | 3 0.46261
[5] Bomo_Chr1 629420 + | 3 3.59940
... ... ... ... . ... ...
[18868] Bomo_Scaf657 7769 - | 3 4.72738
[18869] Bomo_Scaf657 7768 - | 3 4.76283
[18870] Bomo_Scaf657 7767 - | 3 2.66149
[18871] Bomo_Scaf657 7766 - | 3 4.81242
[18872] Bomo_Scaf659 7245 - | 3 20.80880
-------
seqinfo: 59 sequences from an unspecified genome; no seqlengths
### Importing the bigwig files
files <- "BIGWIG.file1"
clipFilesP <- list.files(files, pattern = "_s.bw$", full.names = TRUE)
clipFilesM <- list.files(files, pattern = "_as.bw$", full.names = TRUE)
### checking the meta data
meta = data.frame(
id = c(1,2,3,4),
condition = factor(c("WT","WT","KD","KD"),
levels = c("KD","WT")),
clPlus = clipFilesP, clMinus = clipFilesM)
meta
id condition clPlus
1 1 WT BIGWIG.file1/WT-1_s.bw
2 2 WT BIGWIG.file1/WT-2_s.bw
3 3 KD BIGWIG.file1/KD-1_s.bw
4 4 KD BIGWIG.file1/KD-2_s.bw
clMinus
1 BIGWIG.file1/WT-1_as.bw
2 BIGWIG.file1/WT-2_as.bw
3 BIGWIG.file1/KD-1_as.bw
4 BIGWIG.file1/KD-2_as.bw
### Construction of the the BindingSiteFinder dataset
bds = BSFDataSetFromBigWig(ranges = cs, meta = meta, silent = TRUE)
bds
Object of class BSFDataSet
Contained ranges: 17.918
----> Number of chromosomes: 59
----> Ranges width: 1
Contained conditions: KD WT
But when I try to proceed with the workflow like this:
supportRatioPlot(bds, bsWidths = seq(from = 3, to = 19, by = 2))
I get the following error:
Error in .subset_by_GenomicRanges(x, i) :
'x' must have unique names when subsetting by a GenomicRanges subscript
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Do you think you could help me figure out what's causing this?
Thank you!
Hello,
I am using your vignette (https://www.bioconductor.org/packages/release/bioc/vignettes/BindingSiteFinder/inst/doc/vignette.html) and found and got an Error in the " Target gene assignment" part (with gencode.v31.annotation.gff3):
> annoDb = makeTxDbFromGFF(file = mygff3, format = "gff3")
> annoInfo = import(mygff3, format = "gff3")
>
> # Get genes as GRanges
> gns = genes(annoDb)
> idx = match(gns$gene_id, annoInfo$gene_id)
> elementMetadata(gns) = cbind(elementMetadata(gns),
+ elementMetadata(annoInfo)[idx,])
Error: subscript contains NAs
The problems seem to be the _PAR_Y gene names.
gns$gene_id[!(gns$gene_id %in% annoInfo$gene_id)]
[1] "ENSG00000002586.20_PAR_Y" "ENSG00000124333.16_PAR_Y" "ENSG00000124334.17_PAR_Y" "ENSG00000167393.17_PAR_Y" "ENSG00000168939.11_PAR_Y" "ENSG00000169084.14_PAR_Y"
[7] "ENSG00000169093.16_PAR_Y" "ENSG00000169100.14_PAR_Y" "ENSG00000178605.13_PAR_Y" "ENSG00000182162.11_PAR_Y" "ENSG00000182378.14_PAR_Y" "ENSG00000182484.15_PAR_Y"
[13] "ENSG00000185203.12_PAR_Y" "ENSG00000185291.11_PAR_Y" "ENSG00000185960.14_PAR_Y" "ENSG00000196433.13_PAR_Y" "ENSG00000197976.12_PAR_Y" "ENSG00000198223.16_PAR_Y"
[19] "ENSG00000205755.11_PAR_Y" "ENSG00000214717.12_PAR_Y" "ENSG00000223274.6_PAR_Y" "ENSG00000223484.7_PAR_Y" "ENSG00000223511.7_PAR_Y" "ENSG00000223571.6_PAR_Y"
[25] "ENSG00000223773.7_PAR_Y" "ENSG00000225661.7_PAR_Y" "ENSG00000226179.6_PAR_Y" "ENSG00000227159.8_PAR_Y" "ENSG00000228410.6_PAR_Y" "ENSG00000228572.7_PAR_Y"
[31] "ENSG00000229232.6_PAR_Y" "ENSG00000230542.6_PAR_Y" "ENSG00000234622.6_PAR_Y" "ENSG00000234958.6_PAR_Y" "ENSG00000236017.8_PAR_Y" "ENSG00000236871.7_PAR_Y"
[37] "ENSG00000237040.6_PAR_Y" "ENSG00000237531.6_PAR_Y" "ENSG00000237801.6_PAR_Y" "ENSG00000265658.6_PAR_Y" "ENSG00000270726.6_PAR_Y" "ENSG00000275287.5_PAR_Y"
[43] "ENSG00000277120.5_PAR_Y" "ENSG00000280767.3_PAR_Y" "ENSG00000281849.3_PAR_Y"
A solution is to remove the _PAR. Maybe include that in the vignette?
annoDb = makeTxDbFromGFF(file = mygff3, format = "gff3")
annoInfo = import(mygff3, format = "gff3")
# Get genes as GRanges
gns = genes(annoDb)
gns = gns[!grepl(pattern = "_PAR_Y", gns$gene_id)]
idx = match(gns$gene_id, annoInfo$gene_id)
elementMetadata(gns) = cbind(elementMetadata(gns),
elementMetadata(annoInfo)[idx,])
... I do not understand the plot and neither does Kathi. Can we explain it better in the vignette?
Hi profess when i repeat the workflow from this repo https://github.com/malewins/Plant-iCLIPseq , come across some problem
Estimation at: 0%
Error in dplyr_col_modify()
:
! Can't recycle quantiles
(size 101) to size 0.
Backtrace:
▆
└─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
Warning messages:
1: In .merge_two_Seqinfo_objects(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
2: In .merge_two_Seqinfo_objects(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
Execution halted
R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /home/zhengdy/miniconda3/envs/diffscan/lib/libopenblasp-r0.3.21.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BindingSiteFinder_2.0.0 GenomicRanges_1.50.2 GenomeInfoDb_1.34.9
[4] IRanges_2.32.0 S4Vectors_0.36.2 BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 matrixStats_1.3.0
[3] bit64_4.0.5 filelock_1.0.3
[5] doParallel_1.0.17 RColorBrewer_1.1-3
[7] progress_1.2.3 httr_1.4.7
[9] tools_4.2.2 utf8_1.2.4
[11] R6_2.5.1 DBI_1.2.2
[13] colorspace_2.1-0 ggdist_3.3.2
[15] GetoptLong_1.0.5 withr_3.0.0
[17] tidyselect_1.2.1 prettyunits_1.2.0
[19] bit_4.0.5 curl_5.2.1
[21] compiler_4.2.2 cli_3.6.2
[23] Biobase_2.58.0 xml2_1.3.6
[25] DelayedArray_0.24.0 rtracklayer_1.58.0
[27] scales_1.3.0 rappdirs_0.3.3
[29] systemfonts_1.0.6 stringr_1.5.1
[31] digest_0.6.35 Rsamtools_2.14.0
[33] svglite_2.1.3 rmarkdown_2.26
[35] XVector_0.38.0 htmltools_0.5.8.1
[37] pkgconfig_2.0.3 MatrixGenerics_1.10.0
[39] dbplyr_2.5.0 fastmap_1.1.1
[41] rlang_1.1.3 GlobalOptions_0.1.2
[43] rstudioapi_0.16.0 RSQLite_2.3.6
[45] farver_2.1.1 shape_1.4.6.1
[47] BiocIO_1.8.0 generics_0.1.3
[49] BiocParallel_1.32.6 dplyr_1.1.4
[51] distributional_0.4.0 RCurl_1.98-1.14
[53] magrittr_2.0.3 kableExtra_1.4.0
[55] GenomeInfoDbData_1.2.9 Matrix_1.5-3
[57] Rcpp_1.0.12 munsell_0.5.1
[59] fansi_1.0.6 lifecycle_1.0.4
[61] stringi_1.8.3 yaml_2.3.8
[63] MASS_7.3-58.2 SummarizedExperiment_1.28.0
[65] zlibbioc_1.44.0 plyr_1.8.9
[67] BiocFileCache_2.6.1 grid_4.2.2
[69] blob_1.2.4 parallel_4.2.2
[71] crayon_1.5.2 lattice_0.22-6
[73] Biostrings_2.66.0 GenomicFeatures_1.50.4
[75] circlize_0.4.16 hms_1.1.3
[77] KEGGREST_1.38.0 knitr_1.46
[79] ComplexHeatmap_2.15.4 pillar_1.9.0
[81] rjson_0.2.21 codetools_0.2-20
[83] biomaRt_2.54.1 XML_3.99-0.13
[85] glue_1.7.0 evaluate_0.23
[87] tweenr_2.0.3 png_0.1-8
[89] vctrs_0.6.5 foreach_1.5.2
[91] polyclip_1.10-6 gtable_0.3.5
[93] purrr_1.0.2 tidyr_1.3.1
[95] clue_0.3-65 cachem_1.0.8
[97] ggplot2_3.5.1 xfun_0.43
[99] ggforce_0.4.2 restfulr_0.0.15
[101] viridisLite_0.4.2 tibble_3.2.1
[103] iterators_1.0.14 GenomicAlignments_1.34.1
[105] AnnotationDbi_1.60.2 memoise_2.0.1
[107] cluster_2.1.6
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.