Giter Site home page Giter Site logo

Comments (6)

axelcournac avatar axelcournac commented on August 16, 2024

Hi Vittorio,
THanks for your feedbacks. It seems that your cool file contains multiple resolutions, to use just one resolution with chromosight, you can do something like this:
chromosight detect --pattern=loops_small
--threads=10
--min-dist=15000
--max-dist=2000000
4DNFI81RQ431.mcool::/resolutions/10000
out_4DNFI81RQ431_10kb_loops_small`

where ::/resolutions/10000 is the resolution you want to use

from chromosight.

cmdoret avatar cmdoret commented on August 16, 2024

Hi @Mestizia,

Thanks for reporting this issue, and for your thorough investigation!

To answer your question, max_dist represents the longest interaction distance that will be retained throughout the analysis (this is done to reduce compute time and memory usage).

We indeed rely on the bin-size information to handle conversion between base-pairs and #diagonals, therefore chromosight is not expected to work with cool files that have a variable bin-size.

We should either:

  • Fail instantly with an informative error message when that is the case.
  • Allow specifying the resolution if not in the cool file

If you feel like it, you are very welcome to propose a pull request for this and we will happily review it! Otherwise, we can handle this when time allows.

from chromosight.

Mestizia avatar Mestizia commented on August 16, 2024

Thanks for the prompt reply to both of you.
@axelcournac Unfortunately I spoke too soon. The tool crashed anyway on a later step when the binsize comes back into play (see further down).
While I can generate fixed bin sizes with cooler zoomify and specify the resolution as described in your reply, the file still shows as having bin-type:"variable" and bin-size: null despite the fact that I am running cooler info on the specific resolution version
I am assuming that zoomify is generating multiple fixed bin-size resolutions as described in their manual but this is not reflected/detected in cool info:
https://cooler.readthedocs.io/en/latest/schema.html#multi-resolution
A multi-resolution cooler file that contains multiple “coarsened” resolutions or “zoom-levels” derived from the same dataset. Multires cooler files should store each data collection underneath a group called /resolutions within a sub-group whose name is the bin size (e.g, XYZ.1000.mcool::resolutions/10000). **If the base cooler has variable-length bins, then use 1 to designate the base resolution, and the use coarsening multiplier (e.g. 2, 4, 8, etc.) to name the lower resolutions**. Conventional file extension: .mcool.
It may be wishful thinking but the wording seems to indicate that "non base resolution" files have fixed binsize.

Traceback (most recent call last):
  File "/home/vtracann/miniconda3/envs/chromosight/bin/chromosight", line 10, in <module>
    sys.exit(main())
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 1000, in main
    cmd_detect(args)
  File "/home/vtracann/miniconda3/envs/chromosight/lib/python3.10/site-packages/chromosight/cli/chromosight.py", line 807, in cmd_detect
    separation_bins = int(cfg["min_separation"] // hic_genome.clr.binsize)
TypeError: unsupported operand type(s) for //: 'int' and 'NoneType'

Since I can determine the binsize from my previous work, my temporary solution would be to set the binsize to that specific value. From what I can tell, the bin size is either used as a denominator and the resulting value ends up being small enough that the tool automatically resort to defaults of 1 or as a numerator when it comes to "scanning distance" and the number ends up being large. Computationally, this is not a problem on my end and I don't need heuristics to speed it up (smaller scanning distance, if I am interpreting it correctly).

Do you think this temporary solution would introduce artifacts in the data?

Once this is clarified, I am up for making a pull request (with slightly cleaner solutions). Generally I would commend you and the team for the clean code. It was quite easy to navigate and debug on my end! I could also see this working better if cooler itself was able to tell when a variable binsize cool file is set to fixed binsize. I briefly looked in their repository but I couldn't find an immediate solution.

Cheers,
Vittorio

from chromosight.

cmdoret avatar cmdoret commented on August 16, 2024

If you use cooler zoomify on a file with variable bin size, the resulting bin sizes are still variable, it only pools them by factor 2. What we mean by "fixed" is that each bin represents a segment of the same length on the chromosome.

If your cool file has variable bin size, e.g. 1 bin = 1 restriction fragment, even if you zoomify it (1 bin = 2, 4, 8 ... restriction fragments), there would still not be a constant mapping between #bins and #basepairs. Thus many of our assumptions would fail and the calls would be unreliable. This is why we do not support variable bin sizes.

You can of course use this hack, if you know your matrix has a fixed resolution, but the clean solution is to build your matrix on a fixed bin size from the start, and it should be reflected in the metadata of the cool file.

from chromosight.

Mestizia avatar Mestizia commented on August 16, 2024

@cmdoret Thank you for the clarification. I can look at the distribution of my bin_size lengths to verify that the spread is not significant. I suggest formally including in a step to check for variable bin size and explain the reasoning in chromosight.
Are there plans in the future to support variable bin size or is it a "limitation" due to the algorithms you use?
As far as I can tell, fixed bin size is not a characteristic of the data (due to the semi-random nature of restriction sites occurrence)

With that being said, you can close this issue. Many thanks to both of you. On the side, I will also try running chromosight with a fixed binsize matrix, I will write a report of how similar-different it ends up being in the context of the genome I am working with.

from chromosight.

cmdoret avatar cmdoret commented on August 16, 2024

Thanks for the feedback and suggestions @Mestizia! :) I've open #67 to keep track of this.

It would in theory be possible to support variable bin size, but I might be forgetting something.

from chromosight.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.