aimalz / qp Goto Github PK
View Code? Open in Web Editor NEWQuantile Parametrization for probability distribution functions module
License: MIT License
Quantile Parametrization for probability distribution functions module
License: MIT License
@aimalz I'll do this - it'll involve moving all our docs to a 'docs' folder and then getting readthedocs working. Feel free to ask questions about each line in your code review! :-)
qp
should be able to ingest PDFs in the format of a function evaluated at points on a grid, as it can samples, bin values, quantiles, and parametric "truth" functions.
I’d suggest prioritizing somewhat different metrics than those being looked at now: in particular, the moments of the PDF are probably the most important things. In general, those moments will determine how errors in the PDFs propagate into cosmology measurements. There are plenty of analyses looking at the first couple of moments (corresponding to tests of bias in mean z and spread), but I would probably look at the 3rd and 4th moments too, at least.
In response to #23, a collection of samples should be another way to initialize a PDF object, and there should be a function to generate samples from any of the other representations.
Hey @aimalz - Looks like there could be some inconsistencies in the KLD notebook, between the precision and tension parts, and then the math and the plots.
Some suggestions:
Go back and clean up the precision derivation to make it much more concise, but double check the steps to make sure the formula is correct. There could be an error of a factor of 2 in there - and maybe some r vs 1/r confusion.
Derive the full formula in terms of r and t first, and then simplify it to the two cases (zero-tension precision, r=1 tension) to recover the simpler approximate models.
Overlay the full analytic formula on both plots as dotted black lines, as well as the approximate models in wide gray lines.
Then, I think we could add some insight in the conclusions, by pointing out where the two limits (zero-tension precision, r=1 tension) might occur in practice. I guess the former is more like the photo-z PDF characterization problem? And the latter more like the case where an unknown systematic afflicts your cosmological parameter central value but not its uncertainty?
@aimalz Once you have closed #34 by merging #37 you should make our first "release", here:
https://github.com/aimalz/qp/releases
I'd suggest "v0.1 beta" as a reasonable name - and you can edit the release text once you have made it. Probably this should be a "pre-release" as well - we can save the "release" name for v1.0, after we've done an investigation of the quantile idea. What do you think? This could be the last issue of our "Launch" milestone :-)
With unit tests! :-)
I'm collecting things I really need to fix before a proper code release.
scipy.stats.rv_continuous
, etc.), with consistent syntax for conversions to replace the using
keyword, opens the door for implementing SparsePZ, FlexZBoost formatsqp.PDF.truth
in favor of using mix_mod
and qp.PDF.first
histogram
parametrization to something like piecewise constant or stepfunc
to reduce confusion and make the piecewise constant distribution consistent in interpolation, integration, etc.infty
for converting quantiles to other parametrizationsqp.Ensemble.stack()
and other "derived" functions to utils.py
instead of within the class and publish demos for the "derived" use casesqp.PDF.\[insert parametrization format here\]
with cleaner dict objectThis is a continuation of @drphilmarshall's suggestion in #30.
This needs its own branch, so it merits its own issue.
#43 will benefit from parallelizing the handling of individual qp.PDF
objects, and it shouldn't be hard to implement.
The integrator function of the qp.PDF
class has been blank for a long time and is necessary for the QQ plots for the LSST-DESC PZ DC1 paper. Now is the time to implement it!
Code fails on import when using python3.
Suggest adding
from __future__ import absolute_import, division, print_function
to the top of all python files to break the current code and fix the issues that arise. This should result in code that is cross compatible with python2 and python3.
Please and thank you!
The update to fill_value and extrapolation in the initialization of using='gridded' pdfs (that I believe was added because the default is 'NaN' outside the range of support I raised in an earlier issue) may cause problems in the normalization of the PDF if that range that you are normalizing extends beyond the grid points. The line in question in qp.pdf.approximate:
self.interpolator = spi.interp1d(x, y, kind=self.scheme, fill_value="extrapolate")
An example of the issue: I created a PDF ensemble object for a set of galaxies with p(z) evaluated on np.arange(0.005,2.11,0.01), so the lowest redshift point is 0.005, not 0.0. The normalization in the initialization:
self.gridded = qp.utils.normalize_integral(qp.utils.normalize_gridded(gridded, vb=vb))
normalizes between zmin and zmax. However, if the p(z) value at z=zmin is non-zero, qp.pdf.approximate will extrapolate a non-zero value at z=0.0, so if you integrate qp.pdf.integrate(limits=(0.0,2.0)) you get a value >1.0, because approximate has extrapolated non-zero probability at z=0.0.
Is it smarter to set fill_value=0.0, rather than "extrapolate" in interp1d? That seems to be the more sensible choice here, since presumably the PDF is zero outside the tested range over which it was initially normalized. bounds_error=False will also need to be set. This fix works on my local copy of qp.
The LSST DM "Data Products Definition" document, LSE-163, has the following to say about photo-z storage:
Colors of the object in “standard seeing” (for example, the third quartile expected survey seeing in the i band, ∼ 0.9”) will be measured. These colors are guaranteed to be seeing-insensitive, suitable for estimation of photometric redshifts. (Footnote: The problem of optimal determination of photometric redshift is the subject of intense research. The approach we’re taking here is conservative, following contemporary practices. As new insights develop, we will revisit the issue.)
We currently plan to provide [full photo-z posterior PDF] information ... by providing parametric estimates of the likelihood function. As will be shown in Table 4, the current allocation is ... ~100 parameters for describing the photo-Z likelihood distributions, per object. The methods of storing likelihood functions (or samples thereof) will continue to be developed and optimized throughout Construction and Commissioning. The key limitation, on the amount of data needed to be stored, can be overcome by compression techniques. For example, simply noticing that not more than ∼ 0.5% accuracy is needed for sample values allows one to increase the number of samples by a factor of 4. ... Advanced techniques, such as PCA analysis of the likelihoods across the entire catalog, may allow us to store even more, providing a better estimate of the shape of the likelihood function. In that sense, what is presented in Table 4 should be thought of as a conservative estimate, which we plan to improve upon as development continues in Construction.
So, a good baseline assumption is that we have 100 parameters per object to play with. Using fewer parameters would reduce the storage costs somewhat, and presumably speed up the computations too (although that would need investigating). This stuff should appear in the introduction of our Note.
Here's what the LSST Science Requirements Document has to say (in section 2.1) about photometric redshifts:
LSST will measure the comoving distance as a function of redshift in the redshift range 0.3–3.0 with an accuracy of 1-2%, and separately the growth of cosmic mass structure. A sample of about four billion galaxies with sufficiently accurate photometric redshifts is required. In order to achieve this comoving distance accuracy, the photometric redshifts requirements for this i < 25 flux-limited galaxy sample are i) the rms (σ) for error in (1 + z) must be smaller than 0.02, ii) the fraction of 3σ (“catastrophic”) outliers must be below 10%, and iii) the bias must be below 0.003.
So, we should somehow see that rms precision and low outlier rate in the simulated dataset the PZ WG provide. I guess the main thing for us is to quantify the additional rms error and bias introduced by using an approximation for each p(z).
@aimalz I think for a good LSST DESC Note we'll need to do a fairly comprehensive suite of tests. We want to find efficient PDF characterizations that can cope robustly with randomly drawn multi-modal distributions, right? Maybe you can come up with a scheme that generates plausible photo-z-like PDFs, so that we can do more interesting comparisons between characterizations. I think the KLD will be a good metric for comparison, but we do want to explore where each characterization fails... Maybe we can discuss this a bit here before starting a big notebook (that will become our DESC Note)?
I'm interested in the interpretation of the KL divergence, so will make a little notebook. Watch this space!
Currently our .travis.yml file does not contain instructions to build and deploy our DESC Note - we need to copy and edit the relevant lines from the .travis.yml
file that start_paper
provided into the main travis file. Also, I notice that two of thedeployment conditions have been commented out: we should uncomment those, we only want the html and pdf files to be updated when the master branch is updated. Want to have a go at this, @aimalz ? If I were you, I'd carry out this "hot fix" by editing files on the master branch directly (even using the web editor), using your admin privileges; then we can check that things get deployed correctly when we next merge some other issue's PR.
In order to implement the suggestion from #36, qp
will need the ability to perform a parametric fit to samples (particularly a Gaussian mixture model), rather than just obtaining a KDE and interpolating to convert between parametrizations.
Currently the quantile and histogram approximations can only be made from a parametric truth. It would be easy to support creation of quantile and histogram approximations from samples.
Also, there is currently no support for sampling the gridded approximation (an oversight on my part), which will be fixed as part of resolving this issue.
These features will be necessary for #64 to be useful, unless #51 happens first.
I was thinking it might be useful to implement a quantile-quantile plot option in addition to calculating the RMS and KLD to compare PDF
objects.
Edit: This will probably also merit another explanatory notebook like kld.ipynb
.
I guess we want to choose the quantiles such that we sample the peaks and wings well: uniform in CDF does not seem to be optimal (or even superior to a histogram). In fact; don't we want to preferentially place interpolation nodes where the second derivative of the function is high? How about the first derivative? There must be a whole literature on this.
For the sake of comparing the quantile representation to more conventional representations, I should implement a PDF.histogram function. Similarly, I want be able to initialize a PDF object with any one of truth, quantiles, or histogram and then convert truth into both alternatives, quantize a histogram, and histogram quantiles. I'm not sure if this should be part of the Epic and/or ready before launch of the first version or if it can wait.
Now we have some class methods, we should write some unit tests for them.
Hey @aimalz - the qp
algorithm and implementation dev is going great. Let's keep up the documentation as we go. Here are two things I think we should build:
API docs with Sphinx - this is easier than it sounds, we can crib from SLTimer #27
A short notebook of tex math and qp
code illustrating the KLD between 2 Gaussians and providing some intuition about the magnitude of its value. #29
What do you think?
There's a bug in the reconstruction of the PDF from quantiles that's causing them to blow up at the endpoints. I'm going to change the interpolation function for quantiles to require endpoints at which the CDF is 0 and 1 so it doesn't cause nonsensical linear interpolation. This will require changing the syntax for qp.PDF.quantize()
. I'm going to use this opportunity to make the syntax more consistent as was suggested in #79.
Let's add this to kld.ipynb
.
@aimalz You pointed me to this webpage - looking at this, it turns out my algebra was not very far off at all! With a bit more re-arranging I think we can make a link to the precision ratio
I have a nice template we can use.
As in, http://mybinder.org/status/aimalz/qp ... Let's keep trying that link, hopefully it'll build at some point!
Let's use this issue thread to oversee this project! We can close it when we submit the paper (and we can re-title the thread if the Note becomes a journal article). To see all the issues associated with this epic, click here. There will be a at least two milestones: circulate for feedback, and then submit to arxiv.
As noted here, the qp.PDF.interpolator
's default setting returns NaN for samples if the min or max value has multiple entries. @sschmidt23 says:
qp.PDF() objects using approximate and evaluate return NaN outside the region of support if Samples are used and if the min or max value has multiple entries that are the same value. This is due to scipy.interp.interp1d and fill_value="extrapolate", where it fails to estimate a slope at the end point, breaking extrapolation. This should be fixable by setting fill_value = zero or a tiny number when using samples.
In order to maximize the usefulness of #43, I'll add read/write functions to the PDF
(and/or survey
) objects.
@aimalz Let's set up travis and have it run the demo notebook as a test. I have code to do this, but you'll need to turn on travis to get started.
Re: conversation with @drphilmarshall today, it's time to give #36 some context. In order to make #39 submittable to a journal, we'll need to perform an end-to-end test on a science case, and it would be appropriate for it to answer the question that inspired qp
to be written: "What is the best way to store photo-z PDFs?" @janewman-pitt-edu, we are thinking of the simplest scientifically relevant end product to which inaccuracy in PDF representation could be propagated. n(z) seems like an obvious choice, but I think this application would be best approached after CHIPPR is submitted. Thoughts?
For the sake of completeness, I should probably implement the Kolmogorov-Smirnov test for comparing PDFs. This will probably require another explanatory notebook like kld.ipynb
.
Let's start an LSST DESC Note writing up the qp
code and some interesting tests, from the comprehensive suite in #36. Since LSST DESC Notes can be IPython notebooks it should be easy to make code development and testing EQUAL to LSST DESC Note production :-) This issue can be closed once we have run start_paper
and scrubbed the ipynb
template of all guideline text.
This will be a key part of our docs, and also teach me how to make PDFs out of notebooks (which I'd like to be able to do for the LSST DESC Notes...) I'll need to remember how the GITHUB_API_KEYS work, but I have some notes in my start_paper
project.
PDF
objects are currently initialized with a single scipy.stats
rv.continuous
object. Linear combinations of these must be implemented next!
I'm making an issue for the paper revisions in response to DESC feedback to correspond to a new branch. I'll use this space to note major changes.
In response to #64, it's going to be necessary to enforce normalization conditions for all approximations, to keep outliers from dominating the summary statistics.
With unit tests! :-)
... containing modules:
pdf.py
(with just the PDF
class in it)utils.py
(with various functions in it)and also a __init__.py
. Let's put all this in a folder called qp
so that we can
import qp
At this point we'll need a setup.py
file, like this one. Good luck!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.