ryanlayer / giggle Goto Github PK

Interval data structure

License: MIT License

Makefile 1.11% C 91.44% Ruby 0.58% Shell 0.75% HTML 0.45% JavaScript 1.29% CSS 0.23% Python 1.91% M4 0.42% Roff 0.66% Perl 1.07% Scilab 0.09%

giggle's People

Contributors

Stargazers

Watchers

giggle's Issues

api clarifications

How to get list of files from giggle index? I see giggle_index->file_index[i]->file_name, but how do we know how many files are in file_index?
giggle_query_result->num_files: is this always == the number of files in the giggle index? Or is it the length of \*\*offsets? If the latter? How to know which files are associated with which offsets?
(related to 2.) I assume that giggle_get_query_len means the count of results for the given file_id? is the file_id? Can I always use file_id up to number of files in the index?

add strand to nodes to enable stranded search

Install fails tests (Ubuntu subsystem in Windows)

Installed Giggle following instructions in the README; running Ubuntu 16.04.4 Xenial embedded in Windows 10. Ran test cases and fail 15/27. Looks like most of the passes shouldn't be passes at all, as giggle is throwing an error reading a file.

Anyone know what happened?

chapmano@Jarvis:~/giggle/test/func$ ./giggle_tests.sh
/usr/bin/bedtools

check_intersections_per_file ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL EXIT CODE (LINE 41)
--> expected EX_OK, observed EX_IOERR
FAIL "0" != "1412" (LINE 42)

check_chr_v_nochr_search_1 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_2 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_3 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_4 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_5 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_6 ran in 1 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_7 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_8 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_9 ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_10 ran in 1 sec with 1/0 lines to STDERR/OUT
FAIL "11" != "0" (LINE 55)
giggle: Error reading file "../data/chr_mix_i/cache.0.idx": End of file
PASS "0" == "0" (LINE 56)

check_bulk_insert ran in 0 sec with 0/23 lines to STDERR/OUT
PASS "23" == "23" (LINE 79)
giggle: Error reading file "../data/many_i/cache.0.idx": End of file
FAIL "0" != "24" (LINE 80)

check_offset_additional_data ran in 1 sec with 1/0 lines to STDERR/OUT
FAIL "72" != "0" (LINE 87)
PASS "0" == "0" (LINE 88)

check_dense_index ran in 0 sec with 1/0 lines to STDERR/OUT
FAIL EXIT CODE (LINE 102)
--> expected EX_OK, observed Unknown code: 1

sshtest v0.1.5

27 Tests
15 Failures
12 Successes

Better error message when too many files are open

Right now it just says "could not open file"

inkids_rare.bed.gz test file available?

Hi Ryan,

Is it possible that you can make the file you tested Giggle and the other tools with, inkids_rare.bed.gz, available for replication purposes? I understand that may not be possible if it's confidential, otherwise it'd be great to have it alongside the example RME data tests that you've already very helpfully provided and documented.

Thanks,
Chris

Segfault when trying to index

Hi Ryan,

I am trying to index ~3,500 bed files using giggle. They all have been sorted using giggle/scripts/sort_bed "tmp/*.bed" bgzip_sort 4. But typing giggle index -i "bgzip_sort/*.gz" -o giggle_index raises a segfault.
I have recompiled giggle to enable debugging with gdb and get the following information:

$ gdb --args ../../bin/giggle/bin/giggle index -i "bgzip_sort/*.gz" -o giggle_index -f

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /storage/mathelierarea/processed/anthoma/Projects/UniBind2.0/bin/giggle/bin/giggle...done.
(gdb) run
Starting program: /storage/mathelierarea/processed/anthoma/Projects/UniBind2.0/results/20201006_giggle/../../bin/giggle/bin/giggle index -i bgzip_sort/*.gz -o giggle_index -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x000000000041007b in hash_list_add (hashl=0x0, index=0, data=0x7fffffffaa90, data_size=24) at lists.c:280
280 khash_t(hashl) hash = (khash_t(hashl))(hashl->hash);
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-46.el7.x86_64 libcom_err-1.42.9-17.el7.x86_64 libcurl-7.29.0-57.el7_8.1.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 libssh2-1.8.0-3.el7.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64

Thanks for your help.

examples for plotting utilities in scripts folder

I successfully used the giggle_heat_map.py script for an example result against the RME dataset:

~/giggle/scripts/giggle_heat_map.py -i $query.result \
 -o $query.png \
-s ~/giggle/examples/rme/states.txt \
 -c ~/giggle/examples/rme/EDACC_NAME.txt

It would be great to have examples on how the other scripts can be used in the scripts folder. Thx

Incorrect result for edge case where there are more overlaps than db regions

Suppose I have a query with two large intervals, and I GIGGLE query against a database with 6 small intervals, where all 6 db tracks overlap the query. I should get p=0 for a hypergeometric test; GIGGLE returns p=1 and odds ratio=-0.
(Files attached)
Fisher exact test result (bedtools fisher):

[ochapman@comet-ln2 tftest]$ bedtools fisher -a query.bed -b db.bed -g hg19.genome
# Number of query intervals: 2
# Number of db intervals: 6
# Number of overlaps: 6
# Number of possible intervals (estimated): 3301
# phyper(6 - 1, 2, 3301 - 2, 6, lower.tail=F)
# Contingency Table Of Counts
#_________________________________________
#           |  in -b       | not in -b    |
#     in -a | 6            | 0            |
# not in -a | 0            | 3295         |
#_________________________________________
# p-values for fisher's exact test
left    right   two-tail        ratio
1       5.5903e-19      5.5903e-19      inf

GIGGLE result:

[ochapman@comet-ln2 tftest]$ cat out.txt
#file   file_size       overlaps        odds_ratio      fishers_two_tail        fishers_left_tail       fishers_right_tail      combo_score
db.bed.gz       6       6       1.6298145070949704e-09  1       1       1       -0

db.bed.gz
query.bed.gz

Unable to sort inout file

dhwani@dhwani-HP-Z620-Workstation:/Downloads/giggle$ dir
gargs giggle giggle-master.zip repeat repeat_sort
dhwani@dhwani-HP-Z620-Workstation:/Downloads/giggle$ giggle/scripts/sort_bed "repeat/*.bed" repeat_sort 4
giggle/scripts/sort_bed: 2: set: Illegal option -o pipefail

Deleting a set.

Just mask.

Installation Error

I'm doing

python setup.py test

and I'm getting the following errors

lib/htslib/hfile_s3.c:70:2: error: #error No HMAC() routine found by configure
#error No HMAC() routine found by configure
^~~~~
lib/htslib/hfile_s3.c: In function ‘s3_rewrite’:
lib/htslib/hfile_s3.c:335:30: error: ‘DIGEST_BUFSIZ’ undeclared (first use in this function); did you mean ‘_G_BUFSIZ’?
unsigned char digest[DIGEST_BUFSIZ];
^~~~~~~~~~~~~
_G_BUFSIZ
lib/htslib/hfile_s3.c:335:30: note: each undeclared identifier is reported only once for each function it appears in
lib/htslib/hfile_s3.c:336:29: warning: implicit declaration of function ‘s3_sign’; did you mean ‘strsignal’? [-Wimplicit-function-declaration]
size_t digest_len = s3_sign(digest, &secret, &message);
^~~~~~~
strsignal
lib/htslib/hfile_s3.c:335:23: warning: unused variable ‘digest’ [-Wunused-variable]
unsigned char digest[DIGEST_BUFSIZ];
^~~~~~
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

I'm not sure what to do. Can you help me out? Thanks in advance

overall count

can a summation over all counts be added?

reserved identifier violation

I would like to point out that identifiers like “__GIGGLE_INDEX_H__” and “__LISTS_H__” do not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?

When indexing, check to see if start/end are negative

Availability of test data

I cannot access to the test dataset using this command line:
wget https://s3.amazonaws.com/layerlab/giggle/roadmap/roadmap_sort.tar.gz

I get the following error:

--2017-07-05 11:41:05-- https://s3.amazonaws.com/layerlab/giggle/roadmap/roadmap_sort.tar.gz
Resolving s3.amazonaws.com... 54.231.114.100
Connecting to s3.amazonaws.com|54.231.114.100|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2017-07-05 11:41:06 ERROR 403: Forbidden.

Maybe it's an issue of access rights for the folder in which the dataset is stored?
Thanks
Gael

example doesn't work

hello,

I was trying to run the interactive heatmap example, but says the server (or the hosted page on github) has CORS issues.

http://ryanlayer.github.io/giggle/index.html?primary_index=ec2-54-227-176-15.compute-1.amazonaws.com/rme&ucsc_index=ec2-54-227-176-15.compute-1.amazonaws.com/ucsc

Edit1: tried to download the repo locally and run the example, but still has issues.

allow sending list of files when creating index

user may not want all files in a directory or may want files in different directories.

Resolve any 0/1-based half-closed issues

api for indexing

would be nice to have this:

giggle_index *giggle_index_create(char *out_dir, char **input_paths);

the append function is not defined

bpt.c line 337; append is used as a variable and a function. The define in bpt.h is a function header but the function is not defined.

    // If the append function is NULL assume overwrite
    if (append != NULL)
        append(domain,
               value_id,
               BPT_POINTERS(target_bpt_node)[*target_key_pos],
               handler);
    else
        BPT_POINTERS(target_bpt_node)[*target_key_pos] = value_id;

Feature request - custom background

Added here at Ryan Layer's request.

The current comparisons for the association of bed file A with B appear to be vs the whole genome. What about allowing the user to supply a superset bed file of genomic regions to use (or to exclude) in the comparison? For instance:

a lot of the genome is dark (centromeres, stalks, acen, etc) - I may wish to focus just on higher confidence regions;
for many comparisons, might want to only look at autosomes as sex chr may be less confident
I may want to look at the overlap of brain open chromatin and the transcript start sites of brain expressed genes (±2 kb) but only in the superset of open compartments (a nuanced way to look at region specific ATAC-seq, RNA-seq, and Hi-C results) (for example)

thx! pfs

single-argument version of giggle_load()

should just be able to specify just the directory without "uint32_t_ll_giggle_set_data_handler"

incrementing files in index

Just found out about giggle, trying it out today.

This may already be covered elsewhere but, what's the best way to are files to an already existing indexed set of files?

install errors

Hi,
I'm trying to install Giggle on MacOsX. I went through the test procedure, and I had several blocks/errors:

upon running giggle_tests.sh:

 /usr/local/bin/bedtools

check_intersections_per_file ran in 0 sec with 0/1411 lines to STDERR/OUT
PASS EXIT CODE (LINE 41)
 PASS "0" == "0" (LINE 42)

check_chr_v_nochr_search_1 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_2 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_3 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_4 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_5 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_6 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_7 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_8 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_9 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_chr_v_nochr_search_10 ran in 0 sec with 0/11 lines to STDERR/OUT
PASS "11" == "11" (LINE 55)
PASS "0" == "0" (LINE 56)

check_bulk_insert ran in 0 sec with 0/23 lines to STDERR/OUT
PASS "23" == "23" (LINE 79)
PASS "0" == "0" (LINE 80)

check_offset_additional_data ran in 0 sec with 0/72 lines to STDERR/OUT
PASS "72" == "72" (LINE 87)
PASS "0" == "0" (LINE 88)

check_dense_index ran in 0 sec with 2/0 lines to STDERR/OUT
FAIL EXIT CODE (LINE 102)
-->	expected EX_OK, observed Unknown code: 1

sshtest v0.1.5

27        Tests
1         Failures
26        Successes

upon running cd ../unit; make

test_giggle.c:1851:test_giggle_bulk_insert:FAIL: Expected 22 Was 140239760787248
Any suggestion to fix this?
Thanks in advance
Gael

extra (false positive) overlaps issue

Extra (false positive) overlaps are reported by giggle search:

Steps to reproduce:

db.bed:
chr1 207799860 207799861
chr1 207799861 207799862
chr1 207799869 207799871
chr1 207799877 207799878
chr1 207799878 207799879

query.bed:
chr1 207799861 207799878

bgzip db.bed
bgzip query.bed
giggle index -i db.bed.gz -o db_index

giggle search -i db_index -q query.bed.gz
#db.bed.gz size:5 overlaps:5

giggle search -i db_index -q query.bed.gz -v -o

Giggle search output: 5 overlaps
##chr1 207799861 207799878
chr1 207799860 207799861 db.bed.gz
chr1 207799861 207799862 db.bed.gz
chr1 207799869 207799871 db.bed.gz
chr1 207799877 207799878 db.bed.gz
chr1 207799878 207799879 db.bed.gz

Extra (false positive) overlaps: 2 overlaps (1 upstream, 1 downstream of the query interval)
chr1 207799860 207799861 db.bed.gz
chr1 207799878 207799879 db.bed.gz

Expected (correct) output for overlap between db.bed and query.bed: 3 overlaps
chr1 207799861 207799862
chr1 207799869 207799871
chr1 207799877 207799878

The extra overlaps are reported for both left and right interval boundaries.
Attached are db and query files used for running.

db.bed.gz
query.bed.gz

Thanks!

GIGGLE on Singularity

Hi all,

Thanks for the excellent resource! I just made a wrapper for running GIGGLE on a Singularity container https://github.com/HugoGuillen/giggle-singularity. Might it help other users if it's included in the list of APIs?

Best,
Hugo.

Completion of error handling

Would you like to add more error handling for return values from functions like the following?

fclose ⇒ disk_store_destroy
strdup ⇒ main

toolshed rename.py error

Hi,

I'm running through your readme file and compiling giggle, i have reached this part of the procedure:
mkdir split

pip install toolshed --user

python $GIGGLE_ROOT/examples/rme/rename.py
$GIGGLE_ROOT/examples/rme/states.txt
$GIGGLE_ROOT/examples/rme/EDACC_NAME.txt
"orig/gz"
"split/"*

but keep getting this error about the syntax on line 35:

File "/home/sejjhwi/GIGGLE/giggle/examples/rme/rename.py", line 35
print fname
^
SyntaxError: Missing parentheses in call to 'print'**

do you kow how I can overcome this problem and continue with the installation?

Many thanks

Hywel

Feature request - permutation testing

Added here at Ryan Layer's request.

In addition to OR and FET for the strength of an overlap, what about adding permutation testing? Given GIGGLE’s speed, this might be pretty slick. Could also get at both increase and decrease in overlap.

thx!

Make a C library

Segmentation fault

Everything worked as expected until I had more than 1 million lines in my query file for "giggle search". Then it crashed with Segmentation Fault. After trying to do some quick debugging with gdb, I get the following backrace:

(gdb) backtrace
#0  0x0000000000408f30 in giggle_leaf_data_get_intersection_size ()
#1  0x0000000000408893 in giggle_collect_intersection_data_in_block ()
#2  0x000000000040882e in giggle_search ()
#3  0x000000000040a204 in giggle_query ()
#4  0x0000000000417eaa in search_main ()
#5  0x000000000040403a in main ()

Any help is appreciated. My original giggle command line was something like this:

giggle search -i mm10_tf_giggle_index -q tmp.1.5m.bed.gz

Can't run outside of directory

Can't run outside of the directory (that it was indexed in?)

[u0691312@puhi:rufus_1000g]$ ~/bin/giggle/bin/giggle search -i /uufs/chpc.utah.edu/common/home/u0691312/resources/reference/GTEx_sort_b/ -q 11404X2.bam.generator.V2.overlap.hashcount.fastq.bam.vcf.bed.gz -o -v
##2	223423891	223423891	2:223423891:0:24
giggle: Could not open file 'GTEx_sort/Brain_Hypothalamus_Analysis.v6p.egenes.txt.gz.bed.gz'
: No such file or directory

Interactive heatmap not working

Hi,
I'm trying to run the interactive heatmap. The page is loading, I can choose the default genomic region or upload a bed file, but when I hit the "run" button, nothing actually happens nor displayed.
Thanks for your help
-Gael

License and tagged release?

I'd like to look into adding a package for GIGGLE to bioconda. Would that be ok? In order to do so it would be helpful to know which license GIGGLE is distributed under (MIT, GPL, something else). Would it be possible to explicitly include a LICENSE file in this repository?

Related to redistribution, would it be possible to tag releases on GitHub by version number for easier sourcing of the package?

Thanks!

giggle_index.h redefinition of "leaf_data_cache_handler"

noticed on line 288 that leaf_data_cache_handler is re declared (previously declared on line 281). I can't imagine that that was done on purpose.

Take stdin as input for -q

Be able to pipe stdin as an input for -q

Ie something like:
zcat your.vcf.gz | giggle search -i $G -q /dev/stdin

columns to index on

It'd be nice if you could index on custom columns.

Feature request - Jaccard index

Added here at Ryan Layer's request.

To help evaluate association between bed files, add Jaccard index (available in bedtools) in addition to FET and OR

thx! pfs

Segfault for Negative Interval Bounds

When searching with a bed file containing a negative start position I get a segfault.

Feature request - bed file header read/write/bulk retrieve

Added here at Ryan Layer's request.

I put a post on bedtools-discuss (https://groups.google.com/forum/#!topic/bedtools-discuss/t6E74mCQb-E), suggesting that you support adding and retrieving headers in bed files. I am not suggesting the one adds something as extensive as in VCF but a handful of clearly defined “##” fields would be exceptionally useful (genome reference, organism, description, source - would seem to be key).

Improve const-correctness

I suggest to add the key word “const” to the type specifiers for parameters like the following.

chrm (function “chrm_index_add”)
file_name (function “bit_map_load”)

Would you like to apply the advices from an article to more places in your source files?

Feature request: bioconda recipe

Hi @ryanlayer, I'm interested in writing a bioconda recipe for giggle. Would you be up for helping me with this?

Best,
Mike

don't return incorrect result silently on unsorted data

see, e.g. brentp/python-giggle#4

Unit test failing

Fresh new install, I have all pre-requisites installed however this one unit test seems to be failing:

test_giggle.c:1851:test_giggle_bulk_insert:FAIL: Expected 22 Was 30452688

No apparent impact so far. Just FYI, in case this is not expected!

gargs command?

Hi again,

i have progressed so far with the installation but run aground while trying to run this command:

ls *.bed | ../gargs -p 30 "bgzip {}"

i've googled it and found a command xargs and so wondered if this is a typo? However, to me it looks like you are trying to zip all the .bed files in the /split folder? If so I am unable to make this work, I have run the following:

ls *.bed | xargs -p 30 "bgzip {}"

but this just resulted in every file name being printed in the terminal window so can you help clarify what this command is doing and suggest a way to make it work.

Cheers

Hywel

giggle: Could not open human_hm_sort/3651_sort_peaks.narrowPeak.bed.gz.

To who it may concerned,

When I was trying to build the index using the cistrome histone modification peak files, giggle report this errors,which says "could not open XXX."

giggle index -i "human_hm_sort/*.gz"  -o human_hm_index -f -s

the error information:

Could not open file 'human_hm_sort/3651_sort_peaks.narrowPeak.bed.gz'
giggle: Could not open human_hm_sort/3651_sort_peaks.narrowPeak.bed.gz.

I have had sort these bed files using the script sort_bed.
I think this error information is too vague.How can I fix this problem?
Thanks for your kind advice.

Xiu

Write something that merges two indexes

Problem indexing multiple bed files

Hi there,

I can't manage to index more than 1 file.
I am trying this with a single-line bed file like this:

chr2L	450388	451555

$ giggle index -o test_idx9  -i a.bed.gz b.bed.gz
Indexed 1 intervals.

(a.bed.gz is a copy of b.bed.gz, but only 1 interval seems to be indexed)
Consequently the same thing happens when I try to index a directory full of sorted and bgzip'ed bed files.

Is there anything I'm doing wrong ? I'm working off the master branch, but the docker container also seems to have the same issue. Any help would be much appreciated.

GIGGLE combo score 0 for highly overlapping files containing broad intervals

See example files below.
3CGvs1CGregion_chr1.bed (query, 57860700 bp in 9773 intervals)
Int90617792_early_RT_chr1.bed (test, 62087472 bp in 2528 intervals)

Int90617792_early_RT_chr1.bed.txt
3CGvs1CGregion_chr1.bed.txt

These were sorted and gzipped using “giggle/scripts/sort_bed”

#Then test file was indexed:
$ giggle index -i "bed_sorted/Int90617792_early_RT_chr1.bed.gz" -o bed_sorted_b -f -s
Indexed 2528 intervals.

#Then giggle search was done:
$ giggle search -i bed_sorted_b -q 3CGvs1CGregion_chr1.bed.gz -s
#file file_size overlaps odds_ratio fishers_two_tail fishers_left_tail fishers_right_tail combo_score
bed_sorted/Int90617792_early_RT_chr1.bed.gz 2528 5744 3.319962891691297e-10 2.324012630792748e-201 2.324012630792748e-201 1 0

This 0 value must be an artifact, possibly due to the fact that number of overlaps exceeds the number of intervals, as similar problem was already issued here previously. Interestingly, if the whole procedure is done in the opposite direction (the previous test file is used as a query…), then overlap number does not exceed the number of query intervals, still GIGGLE score is 0:

$ giggle index -i "3CGvs1CGregion_chr1.bed.gz" -o bed_sorted_c -f -s
Indexed 9773 intervals.

$ giggle search -i bed_sorted_c -q Int90617792_early_RT_chr1.bed.gz -s
#file file_size overlaps odds_ratio fishers_two_tail fishers_left_tail fishers_right_tail combo_score
bed_sorted/3CGvs1CGregion_chr1.bed.gz 9773 5744 3.3191390803321616e-10 2.3240126288372418e-201 2.3240126288372418e-201 1 0

The GIGGLE score for these two example files is expected to be high positive value, as overlaps are obvious via IGV as well as in the bedtools jaccard:

$ bedtools jaccard -a 3CGvs1CGregion_chr1.bed -b Int90617792_early_RT_chr1.bed -g chr1.genome
intersection union jaccard n_intersections
32124156 87824016 0.365779 5724

That means more than 50% of bases of each interval files are actually overlapping.

I think, such limitation of GIGGLE can strongly influence results, as the most significant hits just escape.

Make web service optional

Ryan,

Missing lib curl and lib crypto (insert jokes at my expense). Are these required? Is there an older commit of giggle that can be installed without them?

Waiting for the sys admins to install.

--Zev

ryanlayer / giggle Goto Github PK

giggle's People

Contributors

Stargazers

Watchers

Forkers

giggle's Issues

Recommend Projects

Recommend Topics

Recommend Org