Comments (16)
Also when I filtered my reads to be more than the 20X limit, 99.9% of my bedpe regions fail with the following error:
"Could not compute similarity for [number] regions. this cna be due to faulty coordinates, too smallregion sizes or too many unmappable bins"
from chess.
Maybe you could try using the normalized hic files generated by juicer. I also found the cool files sometimes can fail. Please refer to #16, however, I also confused about the normalization procedures in the chess publication.
Best wishes,
Zheng Zhuqing
from chess.
Hi @rikrdo89 , to answer your first question, chess should only quit with this message if all regions you submitted in the bedpe span less than 20 bins. Is that the case?
About the second: do I understand right that you get numeric output for some of the regions (0.01%) ? That is, not all rows in the ouput file are nan?
from chess.
Hi Nick. Yes out of the 5000 regions i used (that span more than 20x bins) only one didn't have nan for the first two fields. The last field, however, also have a nan.
from chess.
could you try to update fanc to 0.9.9 (pip install fanc --upgrade
)? @kaukrise just pushed a great fix for an issue with cooler file handling, which fixes the all NaN issue (your first question, possibly) for me.
from chess.
I updated fanc to 0.9.9 and I still have the same issue.
$ chess sim dmso.allValidPairs.hic@10000 tx4h.allValidPairs.hic@10000 dmso_loops_greater200k.bedpe dmso_vs_tx4.tsv
2020-12-01 19:23:19,079 INFO Running 'user/miniconda3/envs/py3/bin/chess sim dmso.allValidPairs.hic@10000 tx4h.allValidPairs.hic@10000 dmso_loops_greater200k.bedpe dmso_vs_tx4.tsv'
2020-12-01 19:23:21,001 INFO CHESS version: 0.3.5
2020-12-01 19:23:21,001 INFO FAN-C version: 0.9.9
2020-12-01 19:23:21,003 INFO Loading reference contact data
2020-12-01 19:33:30,297 INFO Loading query contact data
2020-12-01 19:42:26,874 INFO Loading region pairs
2020-12-01 19:42:26,913 INFO Launching workers
2020-12-01 19:42:27,236 INFO Submitting pairs for comparison
2020-12-01 19:42:28,780 INFO Could not compute similarity for 4046 region pairs.This can be due to faulty coordinates, too smallregion sizes or too many unmappable bins
2020-12-01 19:42:50,681 INFO Finished '/user/miniconda3/envs/py3/bin/chess sim dmso.allValidPairs.hic@10000 tx4h.allValidPairs.hic@10000 dmso_loops_greater200k.bedpe dmso_vs_tx4.tsv'
from chess.
Do the regions in the bedpe need to be formatted in any particular way? I wonder if I need to do some sorting or maybe binning of the regions ? Or is this a different issue?
from chess.
The bedpe should look like the output of chess pairs
. I take from this that you have generated the bedpe yourself, right? Could you generate a pairs file with that command and run it on your data, just to see if you still get so many nans?
from chess.
Correct. I generated the bedpe from running a loop caller, and then formatting it in a way chess would not immediately quit ( removing chr, adding unique identifier, etc)
I will run the chess pair command and let you know if it works for me.
Just a clarification question, since my goal is to compare changes between two hic samples around already called loops and domains, was using 'chess sim' the way I was initially doing correct ?
from chess.
I think this is a reasonable application of chess, and the command you posted looks good to me.
from chess.
I generated a bedpe using chess pairs --chromosome chr19 mm10 500000 100000 mm10_chr19_500kb_win_100kb_step.bedpe
. The resulting bedpe has the string chr
in the chromosomes (chr1, chr2...) and thus breaks chess sim
(After waiting for a while to load the reference and query hic). I removed all instances of the chr
string from the bedpe, fed it into:
chess sim ../../hic_links/dmso.allValidPairs.hic@5000 ../../hic_links/tx4h.allValidPairs.hic@5000 ../mm10_19_500kb_win_100kb_step.bedpe dmso_vs_tx4.tsv
and it now works. I get a result table as expected. However there were 37 out of 610 regions that failed, and this regions are mainly at the beginning and at the end of chromosome 19. So this makes me think that there may be a total read cutoff and thus why it may be failing in my loop calls? or maybe it is not just the distance of the loop anchor (20X bins) but also the actual size of the patch (submatrix) enclosed by the anchors that needs to be greater than a certain number of bins?
I will troubleshoot a little more. In the mean time, could you please recommend some optimal parameter for chess pair
for doing a genome wide scan on a 5kb matrix? This time I used 500kb window-size and 100kb step-size, but if you could offer any guidance for setting this parameters, I would appreciate it a lot.
from chess.
Also could you please clarify if the comparisons are being done with the raw values of the hic or normalized values? or could I specify which set of values to use by inputting something like control.hic@5000@KR ?
from chess.
The bedpe generated by chess sim
has the "chr" prefixes because this is the format at USCS, from where the chromosome size data is retrieved by default.
chess requires all matrices to have at least 20 x 20 "pixels". If the distance between your loop anchors (i.e. the difference between columns 2 and 3 in the bedpe file) is at least 20 * your bin size, the region is large enough. However, chess is more suitable for larger regions, we recommend spans >= 100 bins.
Chess also imposes a maximum fraction of unmappable / masked bins in a matrix, by default this is 0.1 (see --mappability-cutoff
in chess sim --help
). The masking value is 0. So, if you have small regions of 20 bins span, 2 unmappable bins in that region are enough for the region to be filtered out, which you see as nans in the results file. This is common in repetitive regions like the ends of chromosomes. If you really need a comparison in a part of the genome which is difficult to map reads to you have to increase the region size, and/or lower the mappability-cutoff, but in general I would recommend to avoid such comparisons.
You have to provide normalized matrices, e.g. Knight-Ruiz balanced. The comparisons are done on observed/expected transformed matrices. If your data is not in OE format, it is automatically transformed. This all goes through FAN-C. I believe that the normalized values are used automatically if present in the provided data, but I am not 100 percent sure here: @kaukrise might now better.
I cannot really tell you what parameters to use for chess pairs
, it depends what you are looking for, what the size of typical features in the species is etc. If this is a human or mouse genome, and you expect a median TAD size of 1mb, a good first run could be with 2-3mb window size. 250 kb step may be small enough. If you have 5kb resolution and want to focus on differences within smaller linear genomic spans, 500kb may be a good choice too. For now, chess really requires the user to play around a little, I hope that we might find some good rules for these parameters with more experience.
from chess.
This is quite helpful... I see, so the problem must be the 20x20 pixels, i.e. the 20X min is actually the size of the side sub-matrix, not the min distance of the loop. I was filtering my loop coordinates based on the distance between column 5 and column 2. So maybe running chess sim
for small areas of local enrichment (loops or corner peaks) is not ideal, since a 20x20 region will make a really large square... unless you think it may be worth expanding the loop regions in the bedpe file to encompass more bins, and then re-running chess sim
For the balanced matrices, I can use a cool file with all the normalization and balance I previously applied to it, but chess
takes a long time. So I switched to the .hic files, which have different resolutions, balancing, and O/E values. So I guess if I use a hic file, I can specify the resolution ( say 10Kb using myFile.hic@10000 ), and chess will use the O/E stored in this file... But it is my understanding that the .hic file stores different O/E values for each of the corrected values (KR, Vanilla coverage, etc)... At least that is what I see when I explore the .hic file in juicer... I think in FAN-C you can specify the correction by adding a @
such as myFile.hic@10000@KR ... I guess I can try it in chess... and wait.... to see if it changes anything.
This is extremely helpful. Thanks Nick!
from chess.
One more thing, it just occurred to me that I might be misunderstanding what you are trying to do: if you say that you want to compare loops, does that mean that you are trying to somehow feed off-diagonal regions to chess, for instance a small submatrix around a TAD corner peak?
If yes, then I am sorry: this is not what chess was designed for, and it will not work. The software was made to compare regions along the diagonal of HiC matrices between different datasets, which are specified just with the start bin and end coordinates (columns 2, 3 for the reference and 5 and 6 for the query).
from chess.
Yeah that's what I initially thought, I was hoping it could make some comparisons of these loops as well... and I think it does to some extend as long as they are large regions, but I understand that chess was not made to make this comparisons. Thanks again Nick.
Now I am doing a whole genome scan as you did in your paper. I am inputting myFile.hic@5000@KR
and so far chess has not complained about the extra @KR
, but it is still reading the file ... ... ...
from chess.
Related Issues (20)
- Get observed/expected from Juicer Hi-C HOT 2
- chess --version doesn't work?
- CNV bias in normalization HOT 2
- Conditions for conservation analysis of syntenic blocks HOT 5
- Nan Continued HOT 2
- No valid region pairs found? HOT 1
- Different resolution produce different result HOT 1
- Should the users be concerned about the problem raised in the new Contradictory Results bioRxiv preprint? HOT 2
- conservation analysis when only a few syntenic blocks are available HOT 3
- speed up the chess run HOT 1
- error of the chess extract HOT 3
- issue of normalized/chess extract HOT 1
- error on running chess sim HOT 2
- error when running extract on .hic files HOT 1
- something different from plotting HOT 9
- _pickle.PicklingError HOT 2
- chess extract error: operands could not be broadcast together with shapes HOT 1
- data_range parameter not specified - error HOT 7
- Chess sim output .tsv file explained HOT 1
- Normalization of .hic files HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chess.