Giter Site home page Giter Site logo

qp's People

Contributors

aimalz avatar drphilmarshall avatar morriscb avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

qp's Issues

API docs

@aimalz I'll do this - it'll involve moving all our docs to a 'docs' folder and then getting readthedocs working. Feel free to ask questions about each line in your code review! :-)

Gridded approximation

qp should be able to ingest PDFs in the format of a function evaluated at points on a grid, as it can samples, bin values, quantiles, and parametric "truth" functions.

Use moments as a metric

I’d suggest prioritizing somewhat different metrics than those being looked at now: in particular, the moments of the PDF are probably the most important things. In general, those moments will determine how errors in the PDFs propagate into cosmology measurements. There are plenty of analyses looking at the first couple of moments (corresponding to tests of bias in mean z and spread), but I would probably look at the 3rd and 4th moments too, at least.

Implement samples representation

In response to #23, a collection of samples should be another way to initialize a PDF object, and there should be a function to generate samples from any of the other representations.

KLD notebook completion

Hey @aimalz - Looks like there could be some inconsistencies in the KLD notebook, between the precision and tension parts, and then the math and the plots.

Some suggestions:

  • Go back and clean up the precision derivation to make it much more concise, but double check the steps to make sure the formula is correct. There could be an error of a factor of 2 in there - and maybe some r vs 1/r confusion.

  • Derive the full formula in terms of r and t first, and then simplify it to the two cases (zero-tension precision, r=1 tension) to recover the simpler approximate models.

  • Overlay the full analytic formula on both plots as dotted black lines, as well as the approximate models in wide gray lines.

Then, I think we could add some insight in the conclusions, by pointing out where the two limits (zero-tension precision, r=1 tension) might occur in practice. I guess the former is more like the photo-z PDF characterization problem? And the latter more like the case where an unknown systematic afflicts your cosmological parameter central value but not its uncertainty?

Make first "Release"

@aimalz Once you have closed #34 by merging #37 you should make our first "release", here:

https://github.com/aimalz/qp/releases

I'd suggest "v0.1 beta" as a reasonable name - and you can edit the release text once you have made it. Probably this should be a "pre-release" as well - we can save the "release" name for v1.0, after we've done an investigation of the quantile idea. What do you think? This could be the last issue of our "Launch" milestone :-)

Epic: Polishing for response to referee report

I'm collecting things I really need to fix before a proper code release.

  • make it easy for users to add formats by making a parametrization class (like scipy.stats.rv_continuous, etc.), with consistent syntax for conversions to replace the using keyword, opens the door for implementing SparsePZ, FlexZBoost formats punting this to a later code release
  • eliminate qp.PDF.truth in favor of using mix_mod and qp.PDF.first
  • rename histogram parametrization to something like piecewise constant or stepfunc to reduce confusion and make the piecewise constant distribution consistent in interpolation, integration, etc.
  • remove checking if parametrization exists/calculating with default parameters in favor of exceptions
  • find a way around infty for converting quantiles to other parametrizations
  • formalize interpolation options and include option for user-defined callable punting this to a later code release
  • check for consistent style in documentation
  • clean up excessive print statements
  • #88 punting this to a later code release
  • #65
  • move qp.Ensemble.stack() and other "derived" functions to utils.py instead of within the class and publish demos for the "derived" use cases
  • replace qp.PDF.\[insert parametrization format here\]with cleaner dict object punting this to a later code release

Parallelization

#43 will benefit from parallelizing the handling of individual qp.PDF objects, and it shouldn't be hard to implement.

Code is not currently python3 compatible.

Code fails on import when using python3.

Suggest adding

from __future__ import absolute_import, division, print_function

to the top of all python files to break the current code and fix the issues that arise. This should result in code that is cross compatible with python2 and python3.

Extrapolation of probability may break normalization

The update to fill_value and extrapolation in the initialization of using='gridded' pdfs (that I believe was added because the default is 'NaN' outside the range of support I raised in an earlier issue) may cause problems in the normalization of the PDF if that range that you are normalizing extends beyond the grid points. The line in question in qp.pdf.approximate:
self.interpolator = spi.interp1d(x, y, kind=self.scheme, fill_value="extrapolate")

An example of the issue: I created a PDF ensemble object for a set of galaxies with p(z) evaluated on np.arange(0.005,2.11,0.01), so the lowest redshift point is 0.005, not 0.0. The normalization in the initialization:
self.gridded = qp.utils.normalize_integral(qp.utils.normalize_gridded(gridded, vb=vb))
normalizes between zmin and zmax. However, if the p(z) value at z=zmin is non-zero, qp.pdf.approximate will extrapolate a non-zero value at z=0.0, so if you integrate qp.pdf.integrate(limits=(0.0,2.0)) you get a value >1.0, because approximate has extrapolated non-zero probability at z=0.0.

Is it smarter to set fill_value=0.0, rather than "extrapolate" in interp1d? That seems to be the more sensible choice here, since presumably the PDF is zero outside the tested range over which it was initially normalized. bounds_error=False will also need to be set. This fix works on my local copy of qp.

Background text on baseline LSST DM plan

The LSST DM "Data Products Definition" document, LSE-163, has the following to say about photo-z storage:

Colors of the object in “standard seeing” (for example, the third quartile expected survey seeing in the i band, ∼ 0.9”) will be measured. These colors are guaranteed to be seeing-insensitive, suitable for estimation of photometric redshifts. (Footnote: The problem of optimal determination of photometric redshift is the subject of intense research. The approach we’re taking here is conservative, following contemporary practices. As new insights develop, we will revisit the issue.)

We currently plan to provide [full photo-z posterior PDF] information ... by providing parametric estimates of the likelihood function. As will be shown in Table 4, the current allocation is ... ~100 parameters for describing the photo-Z likelihood distributions, per object. The methods of storing likelihood functions (or samples thereof) will continue to be developed and optimized throughout Construction and Commissioning. The key limitation, on the amount of data needed to be stored, can be overcome by compression techniques. For example, simply noticing that not more than ∼ 0.5% accuracy is needed for sample values allows one to increase the number of samples by a factor of 4. ... Advanced techniques, such as PCA analysis of the likelihoods across the entire catalog, may allow us to store even more, providing a better estimate of the shape of the likelihood function. In that sense, what is presented in Table 4 should be thought of as a conservative estimate, which we plan to improve upon as development continues in Construction.

So, a good baseline assumption is that we have 100 parameters per object to play with. Using fewer parameters would reduce the storage costs somewhat, and presumably speed up the computations too (although that would need investigating). This stuff should appear in the introduction of our Note.

LSST SRD photo-z metrics

Here's what the LSST Science Requirements Document has to say (in section 2.1) about photometric redshifts:

LSST will measure the comoving distance as a function of redshift in the redshift range 0.3–3.0 with an accuracy of 1-2%, and separately the growth of cosmic mass structure. A sample of about four billion galaxies with sufficiently accurate photometric redshifts is required. In order to achieve this comoving distance accuracy, the photometric redshifts requirements for this i < 25 flux-limited galaxy sample are i) the rms (σ) for error in (1 + z) must be smaller than 0.02, ii) the fraction of 3σ (“catastrophic”) outliers must be below 10%, and iii) the bias must be below 0.003.

So, we should somehow see that rms precision and low outlier rate in the simulated dataset the PZ WG provide. I guess the main thing for us is to quantify the additional rms error and bias introduced by using an approximation for each p(z).

Comprehensive test suite

@aimalz I think for a good LSST DESC Note we'll need to do a fairly comprehensive suite of tests. We want to find efficient PDF characterizations that can cope robustly with randomly drawn multi-modal distributions, right? Maybe you can come up with a scheme that generates plausible photo-z-like PDFs, so that we can do more interesting comparisons between characterizations. I think the KLD will be a good metric for comparison, but we do want to explore where each characterization fails... Maybe we can discuss this a bit here before starting a big notebook (that will become our DESC Note)?

Interpreting the KLD

I'm interested in the interpretation of the KL divergence, so will make a little notebook. Watch this space!

Upgrade .travis.yml to make and deploy Note PDF

Currently our .travis.yml file does not contain instructions to build and deploy our DESC Note - we need to copy and edit the relevant lines from the .travis.yml file that start_paper provided into the main travis file. Also, I notice that two of thedeployment conditions have been commented out: we should uncomment those, we only want the html and pdf files to be updated when the master branch is updated. Want to have a go at this, @aimalz ? If I were you, I'd carry out this "hot fix" by editing files on the master branch directly (even using the web editor), using your admin privileges; then we can check that things get deployed correctly when we next merge some other issue's PR.

GMM fits

In order to implement the suggestion from #36, qp will need the ability to perform a parametric fit to samples (particularly a Gaussian mixture model), rather than just obtaining a KDE and interpolating to convert between parametrizations.

Expand format conversion options

Currently the quantile and histogram approximations can only be made from a parametric truth. It would be easy to support creation of quantile and histogram approximations from samples.

Also, there is currently no support for sampling the gridded approximation (an oversight on my part), which will be fixed as part of resolving this issue.

These features will be necessary for #64 to be useful, unless #51 happens first.

Implement quantile approximation

@drphilmarshall I have code that does some of this but it needs to be integrated into the qp class. This is an Epic, being addressed in PR #2 .

  • PDF.quantize() #14
  • PDF.interpolate() #15
  • KL divergence between PDF and qp approximation #9
  • 'PDF.histogram()' #23
  • Better quantile spacing #35

Q-Q Plot

I was thinking it might be useful to implement a quantile-quantile plot option in addition to calculating the RMS and KLD to compare PDF objects.

Edit: This will probably also merit another explanatory notebook like kld.ipynb.

Better quantile spacing

I guess we want to choose the quantiles such that we sample the peaks and wings well: uniform in CDF does not seem to be optimal (or even superior to a histogram). In fact; don't we want to preferentially place interpolation nodes where the second derivative of the function is high? How about the first derivative? There must be a whole literature on this.

Implement histogram representation

For the sake of comparing the quantile representation to more conventional representations, I should implement a PDF.histogram function. Similarly, I want be able to initialize a PDF object with any one of truth, quantiles, or histogram and then convert truth into both alternatives, quantize a histogram, and histogram quantiles. I'm not sure if this should be part of the Epic and/or ready before launch of the first version or if it can wait.

Start unit tests

Now we have some class methods, we should write some unit tests for them.

Documentation

Hey @aimalz - the qp algorithm and implementation dev is going great. Let's keep up the documentation as we go. Here are two things I think we should build:

  • API docs with Sphinx - this is easier than it sounds, we can crib from SLTimer #27

  • A short notebook of tex math and qp code illustrating the KLD between 2 Gaussians and providing some intuition about the magnitude of its value. #29

What do you think?

Bug in quantile interpolation

There's a bug in the reconstruction of the PDF from quantiles that's causing them to blow up at the endpoints. I'm going to change the interpolation function for quantiles to require endpoints at which the CDF is 0 and 1 so it doesn't cause nonsensical linear interpolation. This will require changing the syntax for qp.PDF.quantize(). I'm going to use this opportunity to make the syntax more consistent as was suggested in #79.

Analytic derivation of 2-Gaussian KLD

Let's add this to kld.ipynb.

@aimalz You pointed me to this webpage - looking at this, it turns out my algebra was not very far off at all! With a bit more re-arranging I think we can make a link to the precision ratio $\sigma/\sigma_0$ and tension parameter. If I get a chance I'll have a go this afternoon.

Mock catalog construction

  • 10^5 galaxies with z_p, sigma_p, z_s
  • follow the XDGMM demo to make 2D GMM, using BIC to choose Ng (might need to provide some nominal sigma_s as well)
  • draw samples from 2D GMM to check that it looks like the input data
  • condition on each z_s to make the "true" PDFs (each one is also an Ng-component GMM)
  • export via #48 in the #51 format

Epic: LSST DESC Note on Photo-z P(z) Approximations

Let's use this issue thread to oversee this project! We can close it when we submit the paper (and we can re-title the thread if the Note becomes a journal article). To see all the issues associated with this epic, click here. There will be a at least two milestones: circulate for feedback, and then submit to arxiv.

Make the approximate PDFs

  • After #55 is closed. . .
  • read in PDFs via #48
  • via #43 produce a catalog of these PDFs in each parametrization (bins, quantiles, samples, mixture model via #51), (possibly using multiprocessing, which would be a new issue)
  • export to file via #48

Extrapolation options for qp.PDF interpolator

As noted here, the qp.PDF.interpolator's default setting returns NaN for samples if the min or max value has multiple entries. @sschmidt23 says:

qp.PDF() objects using approximate and evaluate return NaN outside the region of support if Samples are used and if the min or max value has multiple entries that are the same value. This is due to scipy.interp.interp1d and fill_value="extrapolate", where it fails to estimate a slope at the end point, breaking extrapolation. This should be fixable by setting fill_value = zero or a tiny number when using samples.

File i/o

In order to maximize the usefulness of #43, I'll add read/write functions to the PDF (and/or survey) objects.

Science metrics for photo-z PDF approximation performance assessment

Re: conversation with @drphilmarshall today, it's time to give #36 some context. In order to make #39 submittable to a journal, we'll need to perform an end-to-end test on a science case, and it would be appropriate for it to answer the question that inspired qp to be written: "What is the best way to store photo-z PDFs?" @janewman-pitt-edu, we are thinking of the simplest scientifically relevant end product to which inaccuracy in PDF representation could be propagated. n(z) seems like an obvious choice, but I think this application would be best approached after CHIPPR is submitted. Thoughts?

KS test

For the sake of completeness, I should probably implement the Kolmogorov-Smirnov test for comparing PDFs. This will probably require another explanatory notebook like kld.ipynb.

Metric analysis

  • read in parametrized catalogs via #56
  • loop over N_metrics, N_floats, N_parametrizations
  • plot the results

Survey mode

qp should be tested on surveys of PDFs in addition to its current functionality with individual PDFs. New tests for this capability will have to be written.

Start LSST DESC Note

Let's start an LSST DESC Note writing up the qp code and some interesting tests, from the comprehensive suite in #36. Since LSST DESC Notes can be IPython notebooks it should be easy to make code development and testing EQUAL to LSST DESC Note production :-) This issue can be closed once we have run start_paper and scrubbed the ipynb template of all guideline text.

Build HTML from demo notebook(s) and deploy

This will be a key part of our docs, and also teach me how to make PDFs out of notebooks (which I'd like to be able to do for the LSST DESC Notes...) I'll need to remember how the GITHUB_API_KEYS work, but I have some notes in my start_paper project.

More flexible functional forms

PDF objects are currently initialized with a single scipy.stats rv.continuous object. Linear combinations of these must be implemented next!

Paper revisions

I'm making an issue for the paper revisions in response to DESC feedback to correspond to a new branch. I'll use this space to note major changes.

Normalize all approximations

In response to #64, it's going to be necessary to enforce normalization conditions for all approximations, to keep outliers from dominating the summary statistics.

Make a `qp` package

... containing modules:

  • pdf.py (with just the PDF class in it)
  • utils.py (with various functions in it)

and also a __init__.py. Let's put all this in a folder called qp so that we can

import qp

At this point we'll need a setup.py file, like this one. Good luck!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.