Giter Site home page Giter Site logo

abhranildas / integrate-and-classify-normal-distributions Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 1.0 212.91 MB

Matlab toolbox to integrate normal (Gaussian) distributions in any dimensions with any parameters in any domain, compute pdf/cdf/inverse cdf of any function of a normal vector, and measures of classification performance among two or more multinormals, like error matrix and d'.

MATLAB 100.00%
gaussian-distributions classification classification-error integration statistics discrimination

integrate-and-classify-normal-distributions's People

Contributors

abhranildas avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

destenson

integrate-and-classify-normal-distributions's Issues

Change grid integration algorithm to handle a normal on multiple boundaries

Currently, in opt_bd_multi, if the origin (normal mean) is on multiple boundaries, it throws an error.
This is because of the following:
when the mean is on the boundary during regular integration, the starting mass is 0.5, and in each direction we add and subtract masses depending on the sign. This works out since half-plane is being left out.
However, when the point is on multiple boundaries, more than half-plane is left out.
It's complicated to calculate how much is being left out (in different dimensions too) and compensate starting mass accordingly.
Instead, the grid integration algorithm can be changed, such that in each direction, only the correct integral over the correct intervals is being taken, so that there is no need for a starting mass.

`bd_pts_gauss_opt` isn't optimal

The bd_pts_gauss_opt returned in the results seems to not really be the optimal boundary, but the one corresponding to the provided priors, outcome values, and custom boundary.

Ask for citation

Ask for users to cite VSS conference meeting abstract (when it is published in September).

In every script, write comments, and a preamble about author, link and credit like in Steve's code.

Degenerate cases

If we silence one of the cues, i.e. make its mean and SD 0, there may be a problem with matrix inversions etc refusing to work because of zeros in that dimension. A solution still exists. To make this work, look at the wiki page of multivariate normal distribution, in the degenerate cases section.

Avoid using rotm2eul

This function to extract Euler angles from a rotation matrix needs the Robotics Toolbox. Instead, do this manually. Google 'rotation matrix to yaw pitch roll'.

Optional inputs and outputs.

Make the codebase as fast as possible using optional outputs that are preferably not computed unless asked for.

Pick one of \Phi or \erf

The paper uses \Phi for the 1D gaussian integral, and \erf for the 3D Gaussian integral. Choose one.

Double boundary with custom boundary

When a custom boundary is supplied, the boundary drawn will often be double. The boundary computed with dist a and that computed with dist b (just negative of the coefficients) won't exactly line up, like they do when no custom boundary is supplied and it computes the optimal bdry with dist a, and with dist b (which ought to also just be a negation).

Tighter error bound

I tried to implement Ruben's choice of a different beta and a tighter error bound when b=0. This is in gen_chi_sq_cdf2. However,

  1. In some cases, eg gen_chi_sq_cdf(50,[1 5 2],[1 2 3],[0 0 0],10), the tighter error is actually much bigger, although it shrinks more quickly by iter=100. So it's not always better to use this, even when b=0.

  2. For large K's, the modified error bound is a ratio of large numbers (gamma function on top, and factorial in the bottom) and returns NaN. Maybe it'll be possible to re-express this (using gamma properties), in a calculable form.

In any case, this is low priority because the no. of terms can be set high enough to have a tiny value of the regular error bound, and the user will not really care about a tighter bound.

input data

Measure distribution parameters from data. In order to tell if the input is Gaussian parameters or data, maybe I should do the input parsing first.

Input and output polynomial coeffs of boundary

It would be nice to be able to enter and read out boundary coefficients that are the coefficients of the quadratic boundary polynomial, i.e. coeffs of 1, x, x^2, y, y^2, xy, z, z^2, zx, zy.

In addition to the above convenience, the matrix coefficients have redundant elements which polynomial coefficients avoid.

Change opt_bd_multi to speed up.

Currently written with a cellfun of opt_bd_multi_pt, which calls opt_bd. This sends individual points to opt_bd, not leveraging its simultaneous capabilities.
So, compared to previous code, this runs slow.
Go back to the previous version (commented below). However, changes have happened. Incorporate them.

Measures of the accuracy of our method

  • Both the Imhof method, and the grid method. Test against simple-geometry cases with known d'. Test our method against Imhof's for quadratic cases, and maybe test against Monte Carlo method for other cases.

  • compare ray vs gx2 results across those toy examples that were quadratic.

Load angle grid only once.

Don't compute it every time d' is called. See if there's a way for a function to load something persistently.

Better approximate d'?

Current approximate d' is the Mahalanobis distance between the distributions considering that they both have the average covariance matrix.
A better approximation to the d' might be found by considering the boundary point on the line joining the two means, then adding its Mahalanobis distances to the two distributions.
However, in some cases (e.g. same mean, diff variance), there may be no boundary between the means. The current approximate d' still works, but this new one won't.
So instead, consider the mid-point between the means. What if we add its Mahalanobis distances d'_a and d'_b to either distribution? You can check that the current approximate d'^2=2(d'_a^2+d'_b^2), so that in the case where d'_a=d'_b (this happens when the covariance matrices are the same, maybe more generally?), d'=d'_a+d'_b.
So d'_a+d'_b is not the same as d', and may be a better approximation. Will have to check against ground-truth.

function to optimize boundary based on data

Define a function with a vector input.
This function will have a variable number of arguments depending on the dimension (may still be able to optimize it). Accordingly infer dimension.
Parse the vector into a struct of boundary coefficients, depending on the dimension interpretation.
Call d' with data input, d'_data output, with these boundary coefficients.

norm_val_mat is conditional

Just like norm_err_mat.
For the readme: val_mat_norm(i,j) is: given true class was i, expected value of classifying as j.

input:data, output: d'_gaussian

  1. Fit gaussians to data.
  2. Return d' between gaussians.
  3. If custom boundary, return d' between fitted gaussians with custom boundary. If boundary coeffs are requested, return optimal boundary between fitted gaussians.

Remove optimal-case results

Don't return separate optimal-case results (eg d'), if the supplied situation is non-optimal. If the user wants the optimal case, they can simply omit the optional inputs of custom priors, outcome values and custom boundary.
This will simplify the code.

Colour-divided boundary

If in a classification scenario, while plotting the boundary for any pair of classes (say red and blue), we use red points for the points evaluated from the red dist, and blue points for the points evaluated from the blue. Then the boundary will be a mixture of red and blue.
Boundary between blue and green will then be a mixture of blue and green points.
May be helpful to visualize multi-class.
Drawback: if the points are dense, only the ones plotted later will show.

input custom boundary

This is the current case, when input: gaussian, and output: d'_gaussian.
Return d' between gaussians with the custom boundary. If boundary coefficients are also requested, return optimal coefficients.

input: data, output: d'_data

  1. Fit gaussians to data
  2. Find optimal boundary between gaussians
  3. Optimize this boundary based on how much of the data lies in opposite regions, by optimizing #17 . Return this d' and this boundary.
  4. Try to make the same algorithm work for 1D, instead of having its own criterion search algorithm.
  5. If custom boundary is specified, no need to do the above. Just return d' based on how much of the data lies on opposite sides of the custom boundary.

Usage comments in code

Once code is finalized, basic usage (inputs/outputs/example), or github readme link in code comments.

New changes

For the gen_chi_sq_cdf MATLAB code,

  • Understand the best beta to use, and the best truncation error to return.

  • Comment with expected inputs, outputs, and example command.

  • Add to MATLAB file exchange (single file upload) once it's polished and dust settles.

For the classify library,

  • Break the library into two main commands. New command 'integrate the multinormal' of a given mu and vcov over a given region (called by the 'classify' command).
  • This could produce its own plot, with the error ellipsoid and the boundary. So it might be good to write a separate script for plots, which returns this for the 'integrate' command, and the current plots for the 'classify' command.

The LLR Chi-square method of computing error volumes is faster, more accurate, and works for any number of dimensions.

  • Push this result to the top of the output struct. Maybe even not return our method's result unless a custom boundary is supplied.

So why use our method? One advantage is that our method works for custom boundaries. The LLR method can also do custom boundaries. One trivial way is to change the LLR level. This defines a 1-parameter family of boundaries which are isoclines of the LLR. Another way is possibly to change the 'ellipsoid' region of integration. Ruben's original paper mentions this.

  • Check this, understand how wide of a family of regions can be handled by the LLR method. Read Ruben parts I and II as well. (Also take a look at Can's book).

Several regions surveyed in Ruben part I. The non-central ellipsoid is of interest to us. Simplex and 'flats' we're not interested in.

  • Cite Ruben part I in this context.

Can's book's author provides a MATLAB function mvnlps to integrate the MVN over ellipsoidal (positive semi-definite A) regions.

  • Cite this book with proper context.

  • Can's Page 15 of this book says that hyperboloids, corresponding to some negative diagonal terms (indefinite symmetric matrix), cannot just be computed with Ruben's method. Look into this. Do such cases arise in the classification problem?

Yes, such cases arise, e.g. vcov=0.3*[4 1 2; 1 4 0; 2 0 4]. Can's book mentions simulation methods for such cases, but doesn't explicitly say which implementation takes care of it.

Actually, it seems like none of the code handles hyperboloid regions.

  • Make up such a case and check if this method returns the same answer as our method (probably won't).

  • Implement these custom boundary chi-square integrals in the library.

I'm pretty sure this'll still leave out the fully arbitrary custom boundary (not an ellipsoid or quadratic), which our code can still handle.

  • In order for our code to be more useful than the LLR method, there should be a way to supply a custom function that returns a radius for each point in a polar grid.

  • Pull the grid resolution (no. of points in the Fibonacci sphere) into the function argument.

Change in r_sign

Previously, r_sign = sign(r dq/dr) sees if we are entering or exiting the region in radially outward direction.
When r=0 is a root, this is 0 and loses the information of which direction the region is in.
Instead, the code now uses r_sign = sign(dq/dr). The grid integration algorithm then computes int_sign = sign(r)*r_sign as before, but when r=0, int_sign=r_sign.
Make changes accordingly in the manuscript text.
The algorithm will then use the Heaviside theta with the half-maximum convention, i.e. H(0)=0.5.

Exponent notation for chernoff error bound

For low overlap, the Chernoff error bound (e.g, exp(-991)) may evaluate to 0, but the exponent itself is computable. Find a way to represent it as 1E-991 instead of 10^(-991) which becomes 0.

Include outcome losses

Include losses for the 4 possible outcomes. What is the boundary and results when the overall Bayes risk is being minimized? Duda section 2.2.1.

This has applications to psychophysics with different outcome losses, either explicit (games with different monetary rewards attached to different outcomes), or implicit (monkey given either fruit juice or electric shock), to economic theory etc.

The correct way to go about this seems to be to re-derive a fuller decision surface eq. that incorporates the 4 loss arguments. The current calculations then become the default case with +/-1 losses for correct or wrong outcomes.

Monte Carlo for last resort

In many-D and non-quadratic regions, a Monte Carlo integration method that produces some answer. Users can keep hitching up the number of samples and check if their desired measures converge.

Nearest ellipsoid boundary for many-D

Bill's idea.
Ruben's method only works for ellipsoidal boundaries. If in many-D, the boundary is hyperboloidal, we can construct an ellipsoid instead that works as best as possible, especially where the Gaussians meet.
Ideas: make the ellipsoid cross the line joining the means at the same point as the true boundary.
Along each axis, the ellipsoid can cross at the best 1D crossing point.
This is an approximate method anyway, so first find out if a hyperboloid can be integrated.

Approximate d'

Set each covariance matrix to be the average. Then calculate the Mahalanobis distance between them. In 1D, this should be the same as current approximation.

1D classification

Try to use the existing method/code maximally for the 1D case. Also produce plots.

Make val_obs also return err_obs

That is, the complement of the accuracy. Then, change the returned errmat_obs and err_obs to take advantage of the more accurate errors rather than 1-accuracy.

Rename 'whiten' to 'standardize'

Since you're converting the normal to a standard normal. 'Whiten' is a more niche term. But we can explain in the ms that this corresponds to whitening.

Classifying samples

Currently, the gaussian that's fitted to data input has the same mean and vcov. But Bill mentioned that one often uses the maximum likelihood fit. Calculate this and see if it has the same mean and vcov as the data.

Resource: https://people.eecs.berkeley.edu/~jrs/189/lec/07.pdf

Acc. to this resource and Wikipedia, the ML Gaussian parameters are the sample prior, the sample mean and the sample covariance.

Also, Gupta and Logan (https://www.tandfonline.com/doi/pdf/10.1080/00949659008811211?needAccess=true) mentions the discriminant statistic to use if you only have samples.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.