dendrograms / astrodendro Goto Github PK
View Code? Open in Web Editor NEWGenerate a dendrogram from a dataset
Home Page: https://dendrograms.readthedocs.io/
License: Other
Generate a dendrogram from a dataset
Home Page: https://dendrograms.readthedocs.io/
License: Other
There should be some nice way to lookup what statistic properties require what metadata items -- maybe this should be embedded into the docstring
I'm working on dendrogramming Tom Dame's CO datacube, which is much longer along one axis than the other two, and so the default Viewer performs poorly for imagecube slices with that aspect ratio.
As a temporary fix, I've forked astrodendro to implement my own "workaround" in the style of this gist. In the long term, it would be interesting if I could write my own Viewer classes that extend BasicDendrogramViewer and can be used in place of it.
As an even better "bonus", it would be wonderful if the extensibility allowed for me to put in multiple "image" panels, so that (e.g.) I could see both an (l, b) and an (l, v) slice at the same time, or even add a GUI button that let me toggle between the two.
This is something that I'd be happy to help develop over the coming weeks, and could probably submit a pull request with my own idea of how to "extensibilize" BasicDendrogramViewer, in it if that would be helpful.
Just to check, are we happy with the name astrodendro
?
Alternatives are for example dendrogram
or dendrograms
or dendro
, with dendrogram.astro
containing astronomy-specific routines and allowing people from other fields to contribute? Alternatively, we could even separate the astronomy-specific stuff to another repository, but then users have to install two packages, so not ideal.
Currently, the DendrogramPlotter code is susceptible to recursion limits, so we need to make the code smarter to avoid this.
Ignore; I hadn't pulled from the right master. Sorry.
This is related to the PR I submitted on the old version of astrodendro, but basically: I typically want to run dendrograms on 2D maps to see, as a function of contour level, how much mass is in that contour. It's a very convenient way to make density-threshold based cutoffs (see, e.g., Jens Kauffman's recent papers).
Anyway, to do this, you need some tree-walking code, which I don't think is presently in astrodendro. Here's a sample that I used:
def get_mostest_children(item, props=('npix','f_sum'), mostest='f_sum'):
"""
Return certain properties of all children selecting the mostest
of something (e.g., brightest)
"""
item.get_dens = lambda: item.values().sum()/item.get_npix()**(1.5)
item.get_f_sum = lambda: item.values().sum()
if item.is_leaf:
d = dict([(p,[get(item,p)]) for p in props])
d['branch'] = [item]
return d
else:
brightest = item.children[0]
brightest.get_dens = lambda: brightest.values().sum()/item.get_npix()**(1.5)
brightest.get_f_sum = lambda: brightest.values().sum()
for child in item.children[1:]:
child.get_dens = lambda: child.values().sum()/item.get_npix()**(1.5)
child.get_f_sum = lambda: child.values().sum()
if get(child,mostest) > get(brightest,mostest):
brightest = child
brightest_props = get_mostest_children(brightest)
d = dict([(p,[get(item,p)] + brightest_props[p])
for p in props])
d['branch'] = [item] + brightest_props['branch']
return d
(where I think get
is a matplotlib-only convenience function...).
I'm not submitting this as a PR yet, in part because I think there are some "convenience functions" that ought to be implemented, e.g.:
Structure.get_sum(self): return self.values().sum()
and maybe some mechanism for computing other values, e.g. the density as I've done above, for a given branch/tree.
I've used this approach effectively already:
When I compute a dendrogram of the CO survey, using the syntax
>>> from astrodendro import Dendrogram
>>> d = Dendrogram.compute(array)
after using astropy.io.fits.getdata to extract the CO datacube, my dendrograms always are displayed with pixel values instead of (l, b, v) coordinates. I'm guessing this is at least partly because I'm passing the data in as a pure ndarray instead of giving the whole FITS object, hence the dendrogram doesn't know anything about the scaling factors.
example image in (l, v) space, where (l) should run from 180->0->360->180 from left-to-right:
Is there a good way to get the coordinate scaling back into the dendrogram so that I can show plots in (l, b, v)?
A feature I would find useful is the ability to pass a list of seed pixels to compute
. The dendrogram would then be constructed in such a way that each leaf contains exactly one seed pixel. This would be helpful for, e.g., resolution studies.
It would be nice to extend an example using e.g.. pp_catalog
which would show an overlay of the structures found with ellipses showing the principal axis, and the major/minor axis.
In the one-line description of this repo there's a link to the dendrograms.org website:
Generate a dendrogram from a dataset
https://dendrograms.org
but the link seems to be broken unless I change "https" -> "http" and try
http://dendrograms.org
as the other links on the homepage use. I tried a couple proxies and it seems to be not just me, dendrograms.org doesn't respond to https.
This is small potatoes but I figured I'd let someone know.
This is a simple question that may end up being deceptively difficult to handle.
I am working with a datacube that spans galactic longitude 0->360 and technically should "wrap around" on the edges. The dendrogram software doesn't know this, and so it naturally treats the left and right edges of the data unconnected. Hence, the red and blue structures in the following image correspond to different trees in the dendrogram -- but if I were to "rotate" the same data by 180 degrees and put the boundary at the galactic center, not anticenter, then they would match together in a merged tree.
Is it possible to compute a dendrogram where the data wraps at the edges? Could this be added as an optional input when making a dendrogram, so that I could specify e.g., wrapx=True
and then have these structures be properly merged?
Hi,
When I save a dendrogram object to a file, and then load it back, I have the following bug. I call up the viewer, then click on the intensity map, and get these errors:
In [8]: d
Out[8]: <astrodendro.dendrogram.Dendrogram at 0x106435f90>
In [9]: d.save_to('saved_dendrogram_test.hdf5')
In [12]: d2 = astrodendro.Dendrogram.load_from('saved_dendrogram_test.hdf5')
In [13]: d2.viewer()
# Then I interactively click on the map in the viewer and the following occurs:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/Users/tsrice/anaconda/lib/python2.7/site-packages/matplotlib/cbook.pyc in process(self, s, *args, **kwargs)
525 del self.callbacks[s][cid]
526 else:
--> 527 proxy(*args, **kwargs)
528
529
/Users/tsrice/anaconda/lib/python2.7/site-packages/matplotlib/cbook.pyc in __call__(self, *args, **kwargs)
403 mtd = self.func
404 # invoke the callable and return the result
--> 405 return mtd(*args, **kwargs)
406
407 def __eq__(self, other):
/Users/tsrice/Documents/Code/astrodendro/astrodendro/viewer.pyc in select_from_map(self, event)
219 # Select the structure
220 structure = self.dendrogram.structure_at(indices)
--> 221 self.hub.select(input_key, structure)
222
223 # Re-draw
/Users/tsrice/Documents/Code/astrodendro/astrodendro/viewer.pyc in select(self, id, structure)
51 self.selections[id] = structure
52 for cb in self._callbacks:
---> 53 cb(id)
54
55
/Users/tsrice/Documents/Code/astrodendro/astrodendro/viewer.pyc in _on_selection_change(self, selection_id)
178 def _on_selection_change(self, selection_id):
179 self._update_lines(selection_id)
--> 180 self.update_contours()
181
182 def update_vmin(self, vmin):
/Users/tsrice/Documents/Code/astrodendro/astrodendro/viewer.pyc in update_contours(self)
312 if struct is None:
313 continue
--> 314 mask = struct.get_mask(subtree=True)
315 if self.array.ndim == 3:
316 mask = mask[self.slice, :, :]
/Users/tsrice/Documents/Code/astrodendro/astrodendro/structure.pyc in get_mask(self, shape, subtree)
476 """
477 if shape is None:
--> 478 shape = self._dendrogram.data.shape
479 indices = self.indices(subtree=True) if subtree else self.indices
480 mask = np.zeros(shape, dtype=bool)
AttributeError: 'NoneType' object has no attribute 'data'
I assume this is a bug in how the dendrogram object is reconstructed when loading from a file, or perhaps the .viewer()
method should have a way of handling this case?
Currently, the set of Structure.idx
values is a sparse subset of range(1, data.size). For the index code I am working on, it would be much simpler if they were just range(1, n_structure)
.
@astrofrog, is the particular value of Structure.idx
used for anything right now? Would anything break if we switched schemes?
It would be nice to allow a WCS object to be passed for the data, and be able to return the world coordinates of elements inside structures - e.g. the world coordinates of all voxels inside a leaf from a 3-d cube. This could use a different attribute to what would normally be used to get the pixel coordinates (e.g. leaf.world_coords
instead of leaf.coords
, or something like this).
Dendrogram.index_map
gives the structure ID at each pixel. This map is created with a 1 pixel border within compute
, for edge-checking reasons. However, this border should probably be removed at the end of compute
. Otherwise, the node_at
behavior is broken.
GitHub will close the milestone if all issues are completed, so this ensures that the milestone stays open until the package is actually released. Leave any reminders here about anything release-related.
If there are specific pruning functions that would be generically useful to use for Dendrogram.compute(is_independent)
, we should provide them. Some ideas:
It would be cool if these were composable, so that I could write something like
Dendrogram.compute(x, is_independent = min_peak(3) & min_sum(10))
And that would build an object that runs both tests. I'm not sure if &
, and
, or +
makes the most sense to use.
The repository name doesn't currently reflect the name of the package, which may be confusing - shall I rename the repository name to astrodendro
? (since in the end, we included the analysis and visualization code in the same package?)
For statistics which are e.g. angles, it makes sense to return Astropy Quantity objects. However, for statistics in pixel units, I wonder whether maybe we should return simple Python floats rather than Quantity objects with u.pix
objects. Otherwise we end up with this kind of problem: #71 - we can still have the heading in the catalog table be set to pix
automatically, which is what this was intended for.
Now that we have the is_independent
option available, it makes more sense to implement the "built-in" pruning criteria using this. We could keep the keyword arguments if that is more user friendly, but the test inside Dendrogram.compute
should just be a single call to an is_indendent
function.
The PR in #111 has me thinking about how to ensure that dendrograms remain in a valid state as they are constructed, and when they are modified after construction. It would be useful to have a check_invariants(dendrogram)
method that asserts a bunch of properties that should never be violated (that the index map is consistent with the structures, that structure.parent and structure.children are always consistent, that vmax/vmin always increase from trunk to leaf, etc).
This method would probably be expensive to compute, so it should be tied to a DEBUG global variable that is False by default, but can be enabled by the unit tests or when working on new features. Then we should put calls to check_invariants
inside the loops of compute
, build_index
, and post_pruning
When I'm working on the dendrogram of Tom Dame's CO survey, I often want to compare the structures identified by the dendrogram with the structures that were highlighted in these integrated maps: http://www.cfa.harvard.edu/mmw/MilkyWayinMolClouds.html
but the Viewer constrains me to look only at single slices (in velocity or latitude). Assuming that integrating along a single axis is a well-defined operation, would it be difficult to add an option that shows an "integrated" image rather than a single slice? The dendrogram computation should still use the full datacube, which is why I wouldn't want to feed Dendrogram an already-integrated array.
This may tie in neatly with Issue #94, wherein the best way to do this would be to write custom Viewers that take care of this.
I don't see Structure.fill_footprint
as having much utility outside the creation of the index map in Dendrogram.compute
. Perhaps this should be a private Dendrogram method?
This is related to #4 - once we have WCS support, it would be trivial to then accept astropy NDData
objects, which can carry around data with WCS and other meta-data.
As well as having 'standard' criteria for merging two structures (minimum number of pixels, etc.) it would be nice to allow users to be able to specify their own function based on their own criteria.
@bradenmacdonald @ChrisBeaumont - what do you think about this?
Can we use the same kind of approach as in height
to find the 'background' of a structure and compute a background-subtracted flux? (in addition to a raw total flux)
Given the complexities involved with the K
units, we could consider providing a utility function that pre-converts the data from K
to less ambiguous units that can be then used by the analysis tools.
The coveralls link still only works for the old name dendro-core
https://coveralls.io/r/dendrograms/dendro-core
As a result, the icon on the README is broken
As discussed in #20, part of the infrastructure in the Structure
class (e.g. the tree index) relies on the dendrogram being immutable after the computation. One of the suggestions we mentioned there was to split Structure
into two classes - one representing mutable structures that are used for the initial computation, and one that is used for the final structure, which is efficient but immutable.
One question that I'm wondering now is that if we do this, the tree can no longer be pruned by the user after the fact, so is that what we want?
If we decide we want an alternative that leaves the possibility of having a flexible tree after the computation, we can use the hash of the index map to check whether the tree has changed - Numpy arrays can be hashed, so this is trivial. If the hash of the index map changes, all cached variables (including the tree index) would get reset.
The luminosity calculation should take into account that flux units might be per pixel or per unit area, and I don't think we should default to a given pixel scale since there is no sensible default - we should just raise an exception.
I've started a wiki page discussion how to port over the "levelprops" functionality from IDL to python. @low-sky's input would be valuable here, since he best understands what quantities need to be extracted, and how they should be computed.
https://github.com/dendrograms/dendro-core/wiki/Structure-analysis-API-proposal
cc @astrofrog
@bradenmacdonald wrote a lot of visualization code, and we initially decided that this should stay out of the core package, especially the GTK viewer, which makes sense. However, I think that it would make sense to have a static Matplotlib-based viewer just to plot the basic dendrogram, and this won't add much code to the core-package. It will only require Matplotlib as an optional dependency when that method is called. Does that sound sensible? If so, I can start working on this.
At the moment, the warnings for the catalog generation are very verbose and might scare users a little:
https://dendrograms-astrofrog-v2.readthedocs.org/en/latest/catalog.html
We should try making them more concise, e.g.
WARNING: bmaj (Beam major axis, sigma) missing, defaulting to 0 [astrodendro.analysis]
As indicated in #33, it would be nice to have a tutorial on using this with APLpy to show world coordinates.
@ChrisBeaumont - I think it would be good to add a simple example dataset to the docs that we can use for all the examples (we can then use the Matplotlib sphinx directive to generate plots for the docs rather than having to include static plots). Do you have any interesting and small datasets that can be public? The extinction map of Perseus? Or anything else? Otherwise, I can check on my side.
I am in favor of the following interface tweaks. None are super important, so feel free to veto any, @astrofrog
Dendrogram.all_nodes
min_data_value
, min_delta
, min_npix
into a Dendrogram.params
dict (this is the style that scikit-learn uses)Structure.merge_level
Structure.height
to suggestion #25 (comment)Dendrogram.__getitem__
to fetch from node dictIn the analysis tools, I think we should rename dx
to scale
as it is a more explicit name.
(see, I'm trying to not put everything in a single PR! ;) Easier to keep track of separate discussions this way...)
The I/O utilities should take advantage of the new TreeIndex class
If anyone has a chance to help me understand the following performance issues I've been running into, I'd be grateful.
On one particular machine, I've been running into issues when calculating ppv_catalog
on dendrogrammed data of modest to large size. (In particular, the file I'm running into trouble on is hosted here: http://www.cfa.harvard.edu/rtdc/CO/NumberedRegions/DHT21/DHT21_Taurus_mom.fits and is around 10 MB in size)
In [6]: d, catalog, x, y = perseus_analysis() # perseus_analysis loads the data, computes a dendrogram, and then calls ppv_catalog on the dendrogram using appropriate metadata.
loading data: ...
transposing, downsampling, and unit-converting data: ...
computing dendrogram: ...
Generating dendrogram using 307,278 of 5,296,671 pixels (5% of data)
[========================================>] 100%
# the above dendrogram computation takes 30 - 60 seconds using the following parameters:
# min_value=0.01
# min_delta=0.005
# min_npix=20
# then the code hangs while running ppv_catalog
Based on the tracebacks I'm seeing, the resulting dendrogram seems to have around 100,000 indexed structures. The dendrogram has 1029 structures, and for some of those structures, the struct.indices has over 100,000 items.
When I hit ctrl-c after about five or so minutes, the interrupt typically hits one of the following lines of code (below examples chosen as illustrative) :
# line number one
> /Users/tsrice/Documents/Code/astrodendro/astrodendro/analysis.py(65)mom1()
64 m0 = self.mom0()
---> 65 return [np.nansum(i * self.values) / m0 for i in self.indices]
66
# line number two
/Users/tsrice/Documents/Code/astrodendro/astrodendro/analysis.pyc in mom2(self)
77
78 for i in range(nd):
---> 79 result[i, i] = np.nansum(v * zyx[i] ** 2)
80 for j in range(i + 1, nd):
81 result[i, j] = result[j, i] = np.nansum(v * zyx[i] * zyx[j])
# line number three
/Users/tsrice/Documents/Code/astrodendro/astrodendro/analysis.pyc in mom2(self)
79 result[i, i] = np.nansum(v * zyx[i] ** 2)
80 for j in range(i + 1, nd):
---> 81 result[i, j] = result[j, i] = np.nansum(v * zyx[i] * zyx[j])
82
83 return result
The versions of my python packages:
Python 2.7.6 |Anaconda 2.0.0 (x86_64)| (default, May 27 2014, 14:58:54)
IPython 2.1.0 -- An enhanced Interactive Python.
In [4]: np.__version__
Out[4]: '1.8.1'
In [5]: astropy.__version__
Out[5]: '0.3.2'
Some questions:
We had mentioned maybe renaming sky_major_sigma
and sky_minor_sigma
- I just realized just now that maybe the sky
is not needed - if one is just looking at a collapsed simulation, the prefix sky
is not really relevant.
To explain things like delta
with pictures
At the moment, the viewer implemented in #29 is quite slow when sliders are changed - this is because the slider automatically calls fig.canvas.draw
, which re-draws the whole figure. It's possible to improve performance by making use of draw_artist
to selectively re-draw only elements that are changing, though this requires re-implementing a slider class to allow this. I've done this all locally in a hacky way, but will try and tidy it up and submit a patch here.
Unfortunately, draw_artist
doesn't work properly with the MacOS X backend, so it will have to be a solution for other backends only.
GitHub will close the milestone if all issues are completed, so this ensures that the milestone stays open until the package is actually released. Leave any reminders here about anything release-related.
Anticipating that the astrodendro
routines will be ultimately used as the core functionality for cataloging of structures based in python, it will become important to be able to specify the indices of the leaves that will define the dendrograms. Specifically, this is necessary functionality to recompute the dendrogram after filtering and pruning.
This is replicating the functionality lost by dropping friends
and specfriends
from the base API. I envision a use case where the dendrogram is computed, the leaves are then filtered based on delta
or minimum/maximum separation in pixel space.
I think it would be okay if this functionality can insist that these leaves are a subset of an original dendrogram leaves.
We should adopt scikit-learn's idea of putting the Perseus image and other files (like the scuba image referenced in the docs) into a data
submodule, with easy loader functions. This would make it easier to run the examples in the docs
Hi,
I was working on getting integrated fluxes of PPV structures in a datacube using ppv_catalog
. The unit handling for summing up the flux when the input units are (u.Jy/u.beam)
seems to correctly handle integrating over 2d spatial units to produce u.Jy
output, but the integration over the velocity axis should also contribute a velocity units term (usually I use u.km/u.s
) to the computed flux (producing u.Jy * u.km / u.s
) in order to be a physically meaningful sum over all 3 axes.
Currently I am doing a clumsy workaround that looks like this:
catalog = astrodendro.ppv_catalog(d, metadata)
flux_kms = catalog['flux'] * metadata['velocity_scale']
but it seems to me that the PPVStatistic
and compute_flux
code should be updated to correctly deal with this use case. Do you agree?
Hi,
I run dendrograms on an integrated intensity map from JVLA successfully, which has two degenerate coordinates in the header (Stokes=1, and Frequency=1). The size of the array is:
In [14]: array.shape
Out[14]: (1, 1, 1024, 972)
However, the dendrogram viewer does not know how to handle this image:
In [9]: v = d2.viewer()
ERROR: ValueError: Only 2- and 3-dimensional arrays are supported at this time [astrodendro.viewer]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-3d0d5ce782bb> in <module>()
----> 1 v = d2.viewer()
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/astrodendro-0.1.0-py2.7.egg/astrodendro/dendrogram.pyc in viewer(self)
454 """
455 from .viewer import BasicDendrogramViewer
--> 456 return BasicDendrogramViewer(self)
457
458
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/astrodendro-0.1.0-py2.7.egg/astrodendro/viewer.pyc in __init__(self, dendrogram)
10
11 if dendrogram.data.ndim not in [2, 3]:
---> 12 raise ValueError("Only 2- and 3-dimensional arrays are supported at this time")
13
14 self.array = dendrogram.data
ValueError: Only 2- and 3-dimensional arrays are supported at this time
I had to go back into CASA and export the data without the degenerate axes, however, since the 2 extra dimensions in the array have length=1 I think that this should be accepted .
Cheers,
Jaime
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.