Giter Site home page Giter Site logo

koszullab / chromosight Goto Github PK

View Code? Open in Web Editor NEW
59.0 14.0 9.0 55.87 MB

Computer vision based program for pattern recognition in chromosome (Hi-C) contact maps

Home Page: https://chromosight.readthedocs.io

License: Other

Python 99.68% Makefile 0.14% Dockerfile 0.17%
hi-c genomics chromatin-loops pattern-detection

chromosight's People

Contributors

abignaud avatar axelcournac avatar baudrly avatar cmdoret avatar lgtm-migrator avatar rmontagn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chromosight's Issues

Where do these kernels come from?

Hello,

Thanks for developing such a great tool! When I use chromosight, I have a question in mind. Chromosight uses a loop kernel to detect loops from a Hi-C contact map. But where does this loop kernel come from? Does the loop kernel come from a Hi-C data set or somewhere else?

Scaling of kernel

Dear chromosighter,

I have a few questions regarding the win-size option for chromosight detect and quantify.

  • I saw that it is set to "auto" by default, does this mean that the kernel size is scaled all the time or is the default size of the kernel used no matter the bin size?
  • I ran some analysis with 5kb-binned data, and now want to do the same with higher resolution data (1kb-binned). In case I need to set the scaling factor myself, I would need to set the win-size option to default size of kernel * 5 to correct for the difference, is that correct?

Best regards,

Reported coordinates seem shifted

Hello, thank you for this tool - I just tried to use it to call loops with default settings and the results look very good! The only issue is that I have a feeling that the reported coordinates are shifted by ~1-2 pixels relative to the actual highest point in the peak. Here is an example three calls, all just off the pixel with the highest value. Is it possible chromosight selects a wrong pixel from the neighborhood?
Screenshot from 2020-04-15 12-34-41

Do you observe anything like this?

I just used the file from here and called at 5 kb resolution, if you want to reproduce this, and looked in higlass. https://data.4dnucleome.org/files-processed/4DNFIFLDVASC/ I installed chromosight using conda.

Thanks again for developing and sharing this tool!

only one contact matrix gets processed

Hello Cyril,

I have run chromosight 1.4.0. It did finishe but it seems to me it only processed one matrix, the EBV contact matrix that corresponds to Human_herpesvirus_4 in the GRCh38 fasta reference file. Is there something I am missing to set?

This the log:
pearson set to 0.3 based on config file.
min_separation set to 5000 based on config file.
max_perc_undetected set to 50.0 based on config file.
max_perc_zero set to 10.0 based on config file.
Matrix already balanced, reusing weights
Preprocessing sub-matrices...
[====================] 100.0% EBV-EBV
Detecting patterns...
[--------------------] 0.0% Kernel: 0, Iteration: 0
[--------------------] 0.0% Kernel: 0, Iteration: 0
No pattern detected ! Exiting.

This is how I run the program:

/mnt/lustre/scratch/SOFTWARE/miniconda3/bin/chromosight \
detect \
--threads=8 \
--min-dist 20000 --max-dist 200000 \
/mnt/lustre/scratch/results.test/matrix/test.matrix.cool \
loops_H2087_intra

Thanks so much
Jorge

How to evaluate the detected loops?

Hello, thank you for the nice work~

I'm interested in how you evaluate your detected loops? In the paper, you mentioned that it was calculated through p-value

However, have you compared your result with the ground truth annotation?

For example, you detected a loop, let's say "loop No.1", so, how would you classify that your detected "loop No.1" into a True Positive detection? From my perspective, I think the detected "loop No.1" should be somehow "close" to the ground truth loop "Ground Truth loop No.1" before it can be classified into a correct detection.

Hope that you could help me thank you!

How does Chromosight compute the Pearson correlation ?

Hello! Firstly, thank you for fixing error with pandas)

Then, I have tried to get known how do you exactly calculate Pearson correlation. It seems that in the article you wrote about correlation, but in the code the convolution is implemented. I don't get it...

Why do I need it? I want to know, if the detalization of kernel is so important. For instance, in one issue you proposed to generate the corner of TAD as a picture of ones and zeroes. But, obviously, the "real" corner is much more complicated. So, do I have to try to generate as detailed kernel as possible, or can I use rough one? Will it affect results?

additional features

  • stretch the kernel to detect patterns of different size
  • adapt kernel size to the resolution
  • fix convolution for interchromosomal matrices
  • implement 'turbo' convolution

simplify objects

It would be more convenient to repurpose the contact_map object to hold only single chromosomes (with their start bin, pixels, undetectable bins and temporary file path) and to have a different object (e.g. genome_object) which will store those contact_map object (and remember their original order.

This would allow to completely separate the preprocessing step from loading the file instead of doing the chromosome splitting in the contact_map constructor.

Optimise memory usage

Currently, the whole matrix is loaded into the RAM as a sparse array for balancing. This can be an issue for extremely large datasets. An alternative would be to:

  1. In case of bedgraph2 input, convert it to cool format
  2. Rely on the cooler API to balance the cool file in place.
  3. Load sub-matrices one at a time from the cool file for processing

This would allow to have only as many sub-matrices in memory as there are parallel processes.

Empty score fields in quantify results

Hello!

I used chromosight quantify to compare two conditions (as describe in the documentation) and it worked overall, but I saw that some rows had empty score, pvalue and qvalue fields. Do you know why this happens for some rows? Does this correspond to masked bins in the HiC matrix?
Should I remove these rows for the rest of the analysis?

Best,

Perrine

Chromosight for single-cell Hi-C

(python37) [wuhg@mgt module]$ chromosight detect --pattern hairpins --min-dist 50000 --max-dist 2000000 SC_035.mcool::/resolutions/10000 SC_035_hairpins

pearson set to 0.1 based on config file.
min_separation set to 5000 based on config file.
max_perc_undetected set to 75.0 based on config file.
max_perc_zero set to 10.0 based on config file.
Whole genome matrix balanced
Found 12393 / 272566 detectable bins
Preprocessing sub-matrices...
[====================] 100.0% Y-Y
Sub matrices extracted
Detecting patterns...
[--------------------] 0.0% Kernel: 0, Iteration: 0
[====================] 100.0% Kernel: 0, Iteration: 0
Minimum pattern separation is : 1
Traceback (most recent call last):
File "/share/home/wuhg/.local/bin/chromosight", line 8, in
sys.exit(main())
File "/share/home/wuhg/.local/lib/python3.7/site-packages/chromosight/cli/chromosight.py", line 959, in main
cmd_detect(args)
File "/share/home/wuhg/.local/lib/python3.7/site-packages/chromosight/cli/chromosight.py", line 802, in cmd_detect
pval_mask = np.isnan(all_coords.pvalue)
File "/share/home/wuhg/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 1936, in array_ufunc
return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File "/share/home/wuhg/.local/lib/python3.7/site-packages/pandas/core/arraylike.py", line 358, in array_ufunc
result = getattr(ufunc, method)(*inputs, **kwargs)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Can chromosight detect loops based on restriction fragments level (1f, 2f, etc) HI-C matrix.

Hi:
I am wondering whether chromosight can be used to detect chromatin loops based on restriction fragments level (1f, 2f, etc) HI-C matrix? Such as store Hi-C matrix in cooler format in restriction fragments level. I found using like 1-kb Hi-C matrix in cooler format to identify loops will make the Inaccurate anchors (if I am interested in small-scale loops). The Hi-C library was generated by MboI.
Thanks!
Best wishes!
Linhua

run chromosight with --inter option on each interchromosome matrix separately

Hello,

I had a go with chromosight 1.4.0 enabling the --inter option. Unfortunatelly I exceeded my RAM limit which is 400GB.

I was wondering if there is a way to analyse each individual interchromosome matrix separatelly and so try to fit in memory.

My datatset (the .cool file) comes from a HiC experiment done in human (GRCh38) and this is how I run chromosight:

/mnt/lustre/scratch/SOFTWARE/miniconda3/bin/chromosight \
detect \
--threads=4 \
--inter \
/mnt/lustre/scratch/results.test/matrix/test.matrix.cool \
loops_all

Any advise will be much appreciated.

Regards
Jorge

Different score values between chromosight detect and chromosight quantify on borders pattern

Hello,
I detected borders pattern using chromosight detect and then wanted to quantify those borders scores across different conditions. However, when running chromosight quantify on the initial matrix, border scores were much lower than those found using chromosight detect (except for the first border detected). Is this result expected ?

Here are the command line I used (with Chromosight version 1.4.1):
chromosight detect --pattern=borders --threads 6 --min-separation 20000 -p 0.5 -W 9 matrix.cool test/borders_detect
chromosight quantify -W 9 --pattern=borders test/borders_detect.tsv matrix.cool test/borders_quantify

Bellow are parts of the output files:
Chromosight detect output:
chrom1 start1 end1 chrom2 start2 end2 bin1 bin2 kernel_id iteration score pvalue qvalue
chromosome_ref 1315000 1320000 chromosome_ref 1315000 1320000 263 263 0 0 0.8382847806 0.0000000000 0.0000000000
chromosome_ref 110000 115000 chromosome_ref 110000 115000 22 22 1 0 0.7745989939 0.0000000000 0.0000000001
chromosome_ref 205000 210000 chromosome_ref 205000 210000 41 41 1 0 0.7087168920 0.0000000331 0.0000000777
chromosome_ref 260000 265000 chromosome_ref 260000 265000 52 52 1 0 0.5554020576 0.0000748932 0.0000818600
chromosome_ref 505000 510000 chromosome_ref 505000 510000 101 101 1 0 0.5250500651 0.0001567472 0.0001601547
chromosome_ref 710000 715000 chromosome_ref 710000 715000 142 142 1 0 0.7006494326 0.0000000181 0.0000000448
chromosome_ref 780000 785000 chromosome_ref 780000 785000 156 156 1 0 0.5201083470 0.0001869179 0.0001869179
chromosome_ref 1030000 1035000 chromosome_ref 1030000 1035000 206 206 1 0 0.6466182488 0.0000006142 0.0000011547

Chromosight quantify output:
chrom1 start1 end1 chrom2 start2 end2 bin1 bin2 score pvalue qvalue
chromosome_ref 1315000 1320000 chromosome_ref 1315000 1320000 263 263 0.8382847806 0.0000000000 0.0000000000
chromosome_ref 110000 115000 chromosome_ref 110000 115000 22 22 0.2937926314 0.0497858994 0.1559958182
chromosome_ref 205000 210000 chromosome_ref 205000 210000 41 41 0.3053649535 0.0488575488 0.1559958182
chromosome_ref 260000 265000 chromosome_ref 260000 265000 52 52 0.2115458335 0.1743224566 0.3724161573
chromosome_ref 505000 510000 chromosome_ref 505000 510000 101 101 -0.1663934145 0.2763862300 0.4811167707
chromosome_ref 710000 715000 chromosome_ref 710000 715000 142 142 0.2302008405 0.1287384415 0.3184582499
chromosome_ref 780000 785000 chromosome_ref 780000 785000 156 156 0.2003301451 0.1881451788 0.3770751398
chromosome_ref 1030000 1035000 chromosome_ref 1030000 1035000 206 206 -0.0722279965 0.6391357794 0.7952778302

Questions about resolution in loop json file

Hi: I found chromosight detect loops quite well in my Hi-C datasets. I found resolution in loops.json and loops_small.json are both defined as 2000. I am wondering whether I should adjust the value if I used 5-kb or 10-kb resolution cooler files?

Another question: I found artificial_template_loops_type1.txt is just like a pileup heatmap, each pixel has specific value. I wonder if these values in the matrix will affect the results of loop calling by the default options? Do we need to refine these values?
I also tried to use the --click option from generate-config in order to manually build the kernel by double-clicking on relevant regions in a Hi-C matrix. But it is quite consuming. So failed.
Thanks!

Correlation score

Thank you for making available this great tool! I've been trying to apply the quantify module for scoring interaction strength at pairs of transcription factor binding sites using the supplied loop kernel, but the scores are all strictly positive unlike what's been shown for Rad21 in the tutorial or your manuscript.

Could you please help me troubleshoot what I'm doing wrong? I used the following command on the files here

chromosight quantify --pattern loops sites.bed2d 5000.cool out

Different number of patterns for the same Hi-C matrix

Hi!

I run into the following issue. When I execute Chromosight v1.4.1, I get slightly different number of loops (<100 differences) each time I run the calculations for the exact same Hi-C matrix. I wonder why this might happen and whether there is a way to make it robust in detection (is there a seed somewhere?).

Best,
Mikhail

chromosight detects hairpin, the numer is too large

Hi
I used chromosight to detect hairpins in our dataset, but I got a large number of hairpins, that's not normal, here is my report. Do you have any suggestions, is there any parameters that I set wrong?

detect --threads 10 --pattern hairpins --min-dist 50000 --max-dist 2000000 gm12878_microc.mcool::resolutions/10000 /gshare/xielab/wuhg/published/Micro-C_hairpins
pearson set to 0.1 based on config file.
min_separation set to 5000 based on config file.
max_perc_undetected set to 75.0 based on config file.
max_perc_zero set to 10.0 based on config file.
Matrix already balanced, reusing weights
Found 261150 / 308839 detectable bins
Preprocessing sub-matrices...
[====================] 100.0% chrY-chrY
Sub matrices extracted
Detecting patterns...
[--------------------] 0.0% Kernel: 0, Iteration: 0
[====================] 100.0% Kernel: 0, Iteration: 0
Minimum pattern separation is : 1
186221 patterns detected
Saving patterns in /gshare/xielab/wuhg/published/Micro-C_hairpins.tsv
Saving patterns in /gshare/xielab/wuhg/published/Micro-C_hairpins.json
Saving pileup plots in /gshare/xielab/wuhg/published/Micro-C_hairpins.pdf

Input to simulated data

Hi, which inputs are we supposed to give to the simulated data?

Particularly in this part of the file:

# Path to positions of borders in the experimental matrix (in bins,
# relative to chromosome start)
borders_pos = np.loadtxt(sys.argv[3])

Detection doesn't report loops that should pass the Pearson threshold

Hello,

I was wondering how the detection step was deciding which coordinates to report. We have a gold standard set of loops that were manually annotated and I am trying to find the best overlap with those. Some loops were not found by detect even if when using quantify on with these sets of coordinates, they had a score that should have passed the Pearson threshold. They are not close to other loops that could have interfered.

I also saw that when just changing the Pearson parameter while running detect, I get some unexpected behaviour: when I lower the threshold, I get more loops called, but I also lose some of them.

Do you know why this is happening?

Best,

Adding of quantification of pattern

On the to do list (an easy one!)

  • add the quantification of a set of positions to a specific generic pattern: will allow for example to quantify between 2 biological conditions it a set of genomic positions do more loops than another one, could be interesting for many of our projects.
    Could be implemented by taking the mean of the correlation coefficients of detector function.

How to compare loops of Hi-C from different conditions like DEG in RNA-Seq?

Hi:
Thanks for your great chromosight! We have used chromosight widely in a plant's Hi-C. Now, I am wondering how to compare the loops?For example, if we got two Hi-C samples under different conditions.
Option 1: we can call loops for A and B Hi-C separately. Then we can merged the LoopA and LoopB (sometimes, loops results differ too much which is inconsistent with hic results in juicebox ). Then we can use chromosight quantify to calculate similarity to kernel. Just DEG analysis in RNA-Seq, can we identify those significantly gain/loss loops in different conditions?
Option 2: we can merge the two hic files and call loops in an artificial merged hic samples.
In summary, how to call loops for two or more Hi-Cs and how to compare them?
Thank you for your time.
Best wishes!
Linhua

HicMatrix generated cool file not supported - possible solution

Hi,
I am trying to compare the results I got from hicExplorer TAD prediction to chromosight. However, I am having trouble using my matrix. I exported the matrix to .cool as described in the instructions. However, i get the error:

"""
Traceback (most recent call last):
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 610, in _detect_sub_mat
    chrom_patterns, chrom_windows = cid.pattern_detector(
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/utils/detection.py", line 273, in pattern_detector
    mat_conv = preproc.diag_trim(mat_conv.tocsr(), contact_map.max_dist)
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/utils/preprocessing.py", line 119, in diag_trim
    trimmed = sp.tril(mat, n, format="csr")
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/scipy/sparse/_extract.py", line 100, in tril
    mask = A.row + k >= A.col
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
"""

I downloaded the example.cool file and ran it in parallel for debugging

I did some debugging on my end. And I got to the step where in the keep_distance function within contact_map.py.
Here, self.max_dist is None. Therefore, mat_max_dist = self.matrix.shape[0]+self.largest_kernel (10607 + 3)
This value gets moved around (goes as max_dist in preprocess.detrend) but at no point self.max_dist is set to a different value. At the end (the error message) is because matrix.max_dist is never changed from None and it gets fed to numpy.tril that throws the error.

I checked my cool file with cooler info and the main difference compared to example.cool seems to be:

"bin-size": null,
"bin-type": "variable",

versus:

"bin-size": 1000,
"bin-type": "fixed",

For the example.cool file. I am not 100% sure that is the origin but that is my best guess atm.

In order to make the tool work, I changed self.max_dist = self.keep_distance in contats_map.detrend and it seems to work now. However, I am not sure what is the effect of this change on the overall results as I am not familiar with the role of max_dist in the overall tool. Do you think this workaround is likely to affect the final results in a major way?

Cheers,
Vittorio

output file name

Hello,

I am trying to use chromosight detect in a snakemake pipeline, and I have a problem that there is no way to change the default output file name. I could create a new folder for each analysis, and maybe automatically move/rename the files, but that's not the most convenient approach... Is there a reason why it's impossible to customize the file names? (or file prefix, since chromosight detect always saves at least two files)

Thanks!
Ilya

Point and click mode

Could you upload basic usage of 'point and click' mode?
I used config-generate function, but got an error as below:

sora@server:my/path/to/chromosight$ chromosight generate-config --threads 10 --click myfile.mcool::resolutions/5000 my_prefix
Matrix already balanced, reusing weights
Found 500267 / 545114 detectable bins
Preprocessing sub-matrices...
[====================] 100.0% chrY-chrY
Sub matrices extracted

Traceback (most recent call last):
File "/home/sora/.local/bin/chromosight", line 8, in
sys.exit(main())
File "/home/sora/.local/lib/python3.8/site-packages/chromosight/cli/chromosight.py", line 961, in main
cmd_generate_config(args)
File "/home/sora/.local/lib/python3.8/site-packages/chromosight/cli/chromosight.py", line 536, in cmd_generate_config
windows = click_finder(processed_mat, half_w=int((win_size - 1) / 2))
File "/home/sora/.local/lib/python3.8/site-packages/chromosight/utils/plotting.py", line 139, in click_finder
plt.imshow(mat.toarray(), cmap="afmhot_r", vmax=np.percentile(mat.data, 95))
File "/mnt/data0/apps/anaconda/Anaconda2-5.2/envs/py38/lib/python3.8/site-packages/scipy/sparse/compressed.py", line 1029, in toarray
out = self._process_toarray_args(order, out)
File "/mnt/data0/apps/anaconda/Anaconda2-5.2/envs/py38/lib/python3.8/site-packages/scipy/sparse/base.py", line 1185, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError: Unable to allocate 2.16 TiB for an array with shape (545114, 545114) and data type float64

I guess there is an memory issue. Could you suggest any solution for this??

call peaks between different chromosomes

Hello Chromosight developers,

This is not an issue. It's just I would like to know if Chromosight is able to call peaks between different chromosomes.

Thanks so much
Jorge

TypeError: ufunc 'isnan' not supported for the input types

Hello Cyril,

I have run chromosight 1.4.1 and I encountered the following error. This is the full log. I have used the same input .cool file on 1.4.0. Could you possibly think on something I'm doing wrong?

pearson set to 0.3 based on config file.
min_separation set to 5000 based on config file.
max_perc_undetected set to 50.0 based on config file.
max_perc_zero set to 10.0 based on config file.
WARNING: Detection on interchromosomal matrices is expensive in RAM
Matrix already balanced, reusing weights
Preprocessing sub-matrices...
[====================] 100.0% EBV-EBV
Detecting patterns...
[--------------------] 0.0% Kernel: 0, Iteration: 0
[====================] 100.0% Kernel: 0, Iteration: 0
Traceback (most recent call last):
File "/mnt/lustre/scratch/SOFTWARE/miniconda3/bin/chromosight", line 8, in
sys.exit(main())
File "/mnt/lustre/scratch/SOFTWARE/miniconda3/lib/python3.8/site-packages/chromosight/cli/chromosight.py", line 950, in main
cmd_detect(args)
File "/mnt/lustre/scratch/SOFTWARE/miniconda3/lib/python3.8/site-packages/chromosight/cli/chromosight.py", line 793, in cmd_detect
pval_mask = np.isnan(all_coords.pvalue)
File "/mnt/lustre/scratch/SOFTWARE/miniconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 1935, in array_ufunc
return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File "/mnt/lustre/scratch/SOFTWARE/miniconda3/lib/python3.8/site-packages/pandas/core/arraylike.py", line 358, in array_ufunc
result = getattr(ufunc, method)(*inputs, **kwargs)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

This is how I have run chromosight

export PATH=/mnt/lustre/scratch/SOFTWARE/miniconda3/bin:$PATH

/mnt/lustre/scratch/SOFTWARE/miniconda3/bin/chromosight \
detect \
--threads=1 \
--inter \
--min-dist 50000 --max-dist 200000 \
/mnt/lustre/scratch/test.matrix.cool \
loops_output

Thanks so much
Jorge

Features to add and things to improve

The sparse version of the pipeline is functional as of 03fa26a but a few things could be improved:

  • Refactor code: remove detection operations from main script and delegate to sub modules for easier debugging/maintenance
  • Return loops in genomic coordinates instead of bin number (or in addition to). Ideally bedpe format
  • Add parallelisation (maybe at chromosome level?)
  • Replace hard coded kernel configs by config files packaged in chromovision and allow users to provide a custom file.
  • Add a command (e.g. chromovision create-kernel) to generate template kernel config to help users making new configs.

Plotting

Hi,
thanks for the handy tool!
I wanted to ask if you could provide some code on how to label the loops, e.g. based on the order from the output file. This would make it easer to extract the coordinates from the file of a single loop I am interested in.
Also, how do I plot the chromosome coordinates on the axes instead of the bin number?
Thank you in advance!

Pattern = TAD?

Thank you for creating Chromosight, it is such a great tool. Are there any plans to create a forth pattern option for detection, which will be a TAD (a classic triangle-shaped domain)? Or, how easy it would be for a user to create it as a custom pattern?

I think this could be a cool thing to add since TADs have been shown to be not continuous along chromosomes of some species. In these cases, simply calling boundaries and calling the space between two boundaries a "TAD" doesn't always have biological meaning.

qvalues are always smaller than the pvalue

Hello Cyril,

I have noticed that in my tsv output for chromosight v.1.4.0 the qvalues are always smaller than the pvalues. I would expect the opposite. Should I still filter based on the qvalues?

These are the first lines of my output:
chrom1 start1 end1 chrom2 start2 end2 bin1 bin2 kernel_id iteration score pvalue qvalue
1 9800000 9900000 1 9900000 10000000 98 99 0 0 0.3970866534 0.0000000617
1 12300000 12400000 1 12500000 12600000 123 125 0 0 0.4051226690 0.0000002507
1 20300000 20400000 1 20400000 20500000 203 204 0 0 0.4270293101 0.0000000041
1 33800000 33900000 1 34000000 34100000 338 340 0 0 0.3583274096 0.0000004544
1 39400000 39500000 1 39500000 39600000 394 395 0 0 0.4120016497 0.0000000167

This is how I run chromosight:

/mnt/lustre/scratch/SOFTWARE/miniconda3/bin/chromosight \
detect \
-z 100 -u 100 \
--threads=1 \
--inter \
/mnt/lustre/scratch/results.test/matrix/test.25000.cool \
loops_test_25000

Thanks so much
Jorge

Recommend parameters of borders detect

Hi
I'm a green hand of Hi-C.
What are the recommended parameters of chromosight detect --pattern borders?
Especially min-dist and min-dist.
What I have used is

chromosight detect \
--threads 8 \
--min-dist 10000 \
--min-dist 1000000 \
--pattern $pattern \
$PAIRS_FILE \
$output_prefix

But the contact map plotted by .tsv file is strange.
Thanks.

Tuning the parameters (perc-zero, perc-undetected, pearson) for a relatively small dataset

Dear Chromosight developers:
First I'd like to thank you guys for developing such an excellent tool to call loop, and it is undoubtedly one of the best and popular loop callers among the users!
It is really necessary to call valid and precise loop to perform downstream analysis, so I am trying to apply chromosight to my Micro-C datasets, which is about 150M contacts for the mouse genome. I know it's a bit of an awkward size because it is a bit smaller than the lowest recommended size, but I still want to have a try. I have read about the closed issues and get to know that I may need to adjust the parameters (perc-zero, perc-undetected, pearson). But what can I do to assess the quality of loops called under different parameters? (I can only come up with this --- visualize the map and see with my eyes ). Can you give me some instructions on fine-tuning and assessing the parameters and the outcome? Or could you please share some professional experience?
Best wishes!
Woody

Raw or Norm counts cool

hi,

I am about to try chromosight on my data and I just want to double check what hic contact map I should provide. From the publication I understand the raw counts should be given as the ICE normalization will be made internally. Is this correct? Or should I give my corrected version ?

Also, would you have a good rationale to guess the resolution I should use (5K, 10K) ?

NB: My count matrices were generated using the hicexplorer package.

Different number of loops on GM12878 Hi-C map

Hi,

I read the Chromosight paper, and I have a question about the loop calling on GM12878 Hi-C data. Below were the two statements about these results from the paper.

Presentation and benchmark of Chromosight.
For instance, Chromosight found 85% of the loops detected by Cooltools, the software with the highest precision in our benchmark, while overall identifying a much larger number of loops (37,955 vs. 6264, respectively) (Supplementary Fig. 3c).

Exploration of various genomes and patterns.
With default parameters, Chromosight identified 18,839 loops (compared to !10,000 detected in ref. 6) whose anchors fall mostly (~ 58%, P < 10!16) into loci enriched in cohesin subunit Rad21 (Fig. 3b).

If I understand correctly, you applied Chromosight to call loops on the same dataset in Rao et al., Cell 2014. So, I was wondering why Chromosight yielded different results (37955 vs. 18839). I would be grateful if you could clarify this point.

Thank you so much.

Jinakun

Chromosight quantification time complexity

Dear chromosighter,

I am running quantify on 1K-binned matrices (drosophila genome) for a set of 140 loops. I tried scaling the pattern by a factor of 2 or 3 to determine the best factor (normal size works well on 5K-binned data).
However, while the quantification is done after 35 min for the smallest matrices, after 72h it is still not done for the other ones (while they were at 16 or 33% after 35min according to the log, and didn't change after that, the process was killed after 72h).
I was wondering if you ever had such a problem with bigger data, and if you know the time complexity of the algorithm (to have an idea of the time limit I should put depending on the data). Do you know where the algorithm could be stuck for so long?

Best,

Handle variable bin size

When the input cool file does not have a constant bin size, Chromosight crashes with an obscure error message.

We should either provide an informative error message, and/or allow overriding this behaviour to support variable bin size, maybe with a warning saying that the detrending and pattern calls may be less reliable.

Use chromosight to quantify chromatin interactions on domain (TAD or local domains)

Hi:
Thank you very much for developing such an excellent software. It's gonna be so awesome. I am testing chromosight in Hi-C datasets from plants. I am interested in whether chromosight can be applied to quantify the chromatin interactions (strength) on domains (or say TAD, TAD-like domains, those local triangle-shaped domains). Like, we can score the chromatin strength of targeted regions in sample A and sample B. Thus, we can integrated the Hi-C datasets with RNA-Seq etc.
Thanks.
Linhua

Proper packaging

Now that the software works properly, we should do the following:

  • Document all functions properly
  • Write a documentation on readthedocs with examples, and/or a python notebook
  • Add unit tests
  • Distribute it as a python package on Pypi
  • Automate tests and distribution using a CI provider (circle ?)

Perhaps also make a conda package

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.