abhranildas / integrate-and-classify-normal-distributions Goto Github PK

Matlab toolbox to integrate normal (Gaussian) distributions in any dimensions with any parameters in any domain, compute pdf/cdf/inverse cdf of any function of a normal vector, and measures of classification performance among two or more multinormals, like error matrix and d'.

MATLAB 100.00%

gaussian-distributions classification classification-error integration statistics discrimination

integrate-and-classify-normal-distributions's Introduction

Integrate and classify normal distributions

Matlab toolbox to integrate normal (Gaussian) and multivariate normal distributions in any dimensions with any parameters within any domain, compute pdf/cdf/inverse cdf of any function of a normal vector, and compute quantities concerning classification performance among two or more multinormals, such as error matrix and discriminability d'.

Author

Abhranil Das, Center for Perceptual Systems, The University of Texas at Austin.

Bugs/comments/questions/suggestions to [email protected].

If you use this code, please cite: A method to integrate and classify normal distributions.

Installation

Within Matlab's Home tab, select Add-Ons > Get Add-Ons > Search for 'Integrate and classify normal distributions' and install.

Quick Start

After installation, begin with the Getting Started live script, or at any time, go to Home tab > Add-Ons > Manage Add-Ons > Click the three dots next to this toolbox > View Getting Started Guide.

Documentation

For function help, type:

doc integrate_normal
doc classify_normals
doc classify_normals_multi
doc norm_fun_cdf
doc norm_fun_pdf
doc norm_fun_inv

integrate-and-classify-normal-distributions's People

Contributors

Stargazers

Watchers

Forkers

destenson

integrate-and-classify-normal-distributions's Issues

Ask for citation

Ask for users to cite VSS conference meeting abstract (when it is published in September).

In every script, write comments, and a preamble about author, link and credit like in Steve's code.

Rename to Gaussian Discriminant Analysis

This problem is known as Gaussian Discriminant Analysis.

Tighter error bound

I tried to implement Ruben's choice of a different beta and a tighter error bound when b=0. This is in gen_chi_sq_cdf2. However,

In some cases, eg gen_chi_sq_cdf(50,[1 5 2],[1 2 3],[0 0 0],10), the tighter error is actually much bigger, although it shrinks more quickly by iter=100. So it's not always better to use this, even when b=0.
For large K's, the modified error bound is a ratio of large numbers (gamma function on top, and factorial in the bottom) and returns NaN. Maybe it'll be possible to re-express this (using gamma properties), in a calculable form.

In any case, this is low priority because the no. of terms can be set high enough to have a tiny value of the regular error bound, and the user will not really care about a tighter bound.

Different category probabilities

Extend to the case when the probabilities of the two categories can be p_a and p_b.

Avoid using rotm2eul

This function to extract Euler angles from a rotation matrix needs the Robotics Toolbox. Instead, do this manually. Google 'rotation matrix to yaw pitch roll'.

Usage comments in code

Once code is finalized, basic usage (inputs/outputs/example), or github readme link in code comments.

`bd_pts_gauss_opt` isn't optimal

The bd_pts_gauss_opt returned in the results seems to not really be the optimal boundary, but the one corresponding to the provided priors, outcome values, and custom boundary.

input data

Measure distribution parameters from data. In order to tell if the input is Gaussian parameters or data, maybe I should do the input parsing first.

Change grid integration algorithm to handle a normal on multiple boundaries

Currently, in opt_bd_multi, if the origin (normal mean) is on multiple boundaries, it throws an error.
This is because of the following:
when the mean is on the boundary during regular integration, the starting mass is 0.5, and in each direction we add and subtract masses depending on the sign. This works out since half-plane is being left out.
However, when the point is on multiple boundaries, more than half-plane is left out.
It's complicated to calculate how much is being left out (in different dimensions too) and compensate starting mass accordingly.
Instead, the grid integration algorithm can be changed, such that in each direction, only the correct integral over the correct intervals is being taken, so that there is no need for a starting mass.

Remove optimal-case results

Don't return separate optimal-case results (eg d'), if the supplied situation is non-optimal. If the user wants the optimal case, they can simply omit the optional inputs of custom priors, outcome values and custom boundary.
This will simplify the code.

Optional inputs and outputs.

Make the codebase as fast as possible using optional outputs that are preferably not computed unless asked for.

Approximate d'

Set each covariance matrix to be the average. Then calculate the Mahalanobis distance between them. In 1D, this should be the same as current approximation.

Change in r_sign

Previously, r_sign = sign(r dq/dr) sees if we are entering or exiting the region in radially outward direction.
When r=0 is a root, this is 0 and loses the information of which direction the region is in.
Instead, the code now uses r_sign = sign(dq/dr). The grid integration algorithm then computes int_sign = sign(r)*r_sign as before, but when r=0, int_sign=r_sign.
Make changes accordingly in the manuscript text.
The algorithm will then use the Heaviside theta with the half-maximum convention, i.e. H(0)=0.5.

function to optimize boundary based on data

Define a function with a vector input.
This function will have a variable number of arguments depending on the dimension (may still be able to optimize it). Accordingly infer dimension.
Parse the vector into a struct of boundary coefficients, depending on the dimension interpretation.
Call d' with data input, d'_data output, with these boundary coefficients.

Pick one of \Phi or \erf

The paper uses \Phi for the 1D gaussian integral, and \erf for the 3D Gaussian integral. Choose one.

val_obs with custom bd_fn?

New changes

For the gen_chi_sq_cdf MATLAB code,

Understand the best beta to use, and the best truncation error to return.
Comment with expected inputs, outputs, and example command.
Add to MATLAB file exchange (single file upload) once it's polished and dust settles.

For the classify library,

Break the library into two main commands. New command 'integrate the multinormal' of a given mu and vcov over a given region (called by the 'classify' command).
This could produce its own plot, with the error ellipsoid and the boundary. So it might be good to write a separate script for plots, which returns this for the 'integrate' command, and the current plots for the 'classify' command.

The LLR Chi-square method of computing error volumes is faster, more accurate, and works for any number of dimensions.

Push this result to the top of the output struct. Maybe even not return our method's result unless a custom boundary is supplied.

So why use our method? One advantage is that our method works for custom boundaries. The LLR method can also do custom boundaries. One trivial way is to change the LLR level. This defines a 1-parameter family of boundaries which are isoclines of the LLR. Another way is possibly to change the 'ellipsoid' region of integration. Ruben's original paper mentions this.

Check this, understand how wide of a family of regions can be handled by the LLR method. Read Ruben parts I and II as well. (Also take a look at Can's book).

Several regions surveyed in Ruben part I. The non-central ellipsoid is of interest to us. Simplex and 'flats' we're not interested in.

Cite Ruben part I in this context.

Can's book's author provides a MATLAB function mvnlps to integrate the MVN over ellipsoidal (positive semi-definite A) regions.

Cite this book with proper context.
Can's Page 15 of this book says that hyperboloids, corresponding to some negative diagonal terms (indefinite symmetric matrix), cannot just be computed with Ruben's method. Look into this. Do such cases arise in the classification problem?

Yes, such cases arise, e.g. vcov=0.3*[4 1 2; 1 4 0; 2 0 4]. Can's book mentions simulation methods for such cases, but doesn't explicitly say which implementation takes care of it.

Actually, it seems like none of the code handles hyperboloid regions.

Make up such a case and check if this method returns the same answer as our method (probably won't).
Implement these custom boundary chi-square integrals in the library.

I'm pretty sure this'll still leave out the fully arbitrary custom boundary (not an ellipsoid or quadratic), which our code can still handle.

In order for our code to be more useful than the LLR method, there should be a way to supply a custom function that returns a radius for each point in a polar grid.
Pull the grid resolution (no. of points in the Fibonacci sphere) into the function argument.

What happens when input obs (equal/unequal freq), and p_a is also specified?

Clear up the situation.

Extend bayes_classify to >3 dimensions

Double boundary with custom boundary

When a custom boundary is supplied, the boundary drawn will often be double. The boundary computed with dist a and that computed with dist b (just negative of the coefficients) won't exactly line up, like they do when no custom boundary is supplied and it computes the optimal bdry with dist a, and with dist b (which ought to also just be a negation).

Check out Press's method for indefinite forms.

AS 204 method mentions a Press. Look into this, and works that cite it, for possibly more recent methods.

Nearest ellipsoid boundary for many-D

Bill's idea.
Ruben's method only works for ellipsoidal boundaries. If in many-D, the boundary is hyperboloidal, we can construct an ellipsoid instead that works as best as possible, especially where the Gaussians meet.
Ideas: make the ellipsoid cross the line joining the means at the same point as the true boundary.
Along each axis, the ellipsoid can cross at the best 1D crossing point.
This is an approximate method anyway, so first find out if a hyperboloid can be integrated.

Better approximate d'?

Current approximate d' is the Mahalanobis distance between the distributions considering that they both have the average covariance matrix.
A better approximation to the d' might be found by considering the boundary point on the line joining the two means, then adding its Mahalanobis distances to the two distributions.
However, in some cases (e.g. same mean, diff variance), there may be no boundary between the means. The current approximate d' still works, but this new one won't.
So instead, consider the mid-point between the means. What if we add its Mahalanobis distances d'_a and d'_b to either distribution? You can check that the current approximate d'^2=2(d'_a^2+d'_b^2), so that in the case where d'_a=d'_b (this happens when the covariance matrices are the same, maybe more generally?), d'=d'_a+d'_b.
So d'_a+d'_b is not the same as d', and may be a better approximation. Will have to check against ground-truth.

Make val_obs also return err_obs

That is, the complement of the accuracy. Then, change the returned errmat_obs and err_obs to take advantage of the more accurate errors rather than 1-accuracy.

1D classification

Try to use the existing method/code maximally for the 1D case. Also produce plots.

Add Chernoff bound

And corresponding d'

Vectorize boundary functions

Then evaluate for all normal vectors at once. Will speed up the grid method.

Rename 'whiten' to 'standardize'

Since you're converting the normal to a standard normal. 'Whiten' is a more niche term. But we can explain in the ms that this corresponds to whitening.

Exponent notation for chernoff error bound

For low overlap, the Chernoff error bound (e.g, exp(-991)) may evaluate to 0, but the exponent itself is computable. Find a way to represent it as 1E-991 instead of 10^(-991) which becomes 0.

Input and output polynomial coeffs of boundary

It would be nice to be able to enter and read out boundary coefficients that are the coefficients of the quadratic boundary polynomial, i.e. coeffs of 1, x, x^2, y, y^2, xy, z, z^2, zx, zy.

In addition to the above convenience, the matrix coefficients have redundant elements which polynomial coefficients avoid.

Classifying samples

Currently, the gaussian that's fitted to data input has the same mean and vcov. But Bill mentioned that one often uses the maximum likelihood fit. Calculate this and see if it has the same mean and vcov as the data.

Resource: https://people.eecs.berkeley.edu/~jrs/189/lec/07.pdf

Acc. to this resource and Wikipedia, the ML Gaussian parameters are the sample prior, the sample mean and the sample covariance.

Also, Gupta and Logan (https://www.tandfonline.com/doi/pdf/10.1080/00949659008811211?needAccess=true) mentions the discriminant statistic to use if you only have samples.

Degenerate cases

If we silence one of the cues, i.e. make its mean and SD 0, there may be a problem with matrix inversions etc refusing to work because of zeros in that dimension. A solution still exists. To make this work, look at the wiki page of multivariate normal distribution, in the degenerate cases section.

For 1D d', opt_crit_theo is undefined for equal variance

The formula came from:
https://stats.stackexchange.com/questions/103800/calculate-probability-area-under-the-overlapping-area-of-two-normal-distributi

Change opt_bd_multi to speed up.

Currently written with a cellfun of opt_bd_multi_pt, which calls opt_bd. This sends individual points to opt_bd, not leveraging its simultaneous capabilities.
So, compared to previous code, this runs slow.
Go back to the previous version (commented below). However, changes have happened. Incorporate them.

input: gaussian, output: d'_data

Throw error

input custom boundary

This is the current case, when input: gaussian, and output: d'_gaussian.
Return d' between gaussians with the custom boundary. If boundary coefficients are also requested, return optimal coefficients.

Use hist3 for 2D data plots

norm_val_mat is conditional

Just like norm_err_mat.
For the readme: val_mat_norm(i,j) is: given true class was i, expected value of classifying as j.

Monte Carlo for last resort

In many-D and non-quadratic regions, a Monte Carlo integration method that produces some answer. Users can keep hitching up the number of samples and check if their desired measures converge.

Uniform grid on a sphere

Use Fibonacci spiral or something.

Amend approximate d' for unequal priors

Theory is in the pattern classification book.

Change errs array output to confusion matrix

element i,j is conditional probability of classifying as i given j (not joint probability; it's clearer to use conditionals).

Load angle grid only once.

Don't compute it every time d' is called. See if there's a way for a function to load something persistently.

Accept means as both row and column vectors

Transpose if necessary

Measures of the accuracy of our method

Both the Imhof method, and the grid method. Test against simple-geometry cases with known d'. Test our method against Imhof's for quadratic cases, and maybe test against Monte Carlo method for other cases.
compare ray vs gx2 results across those toy examples that were quadratic.

input:data, output: d'_gaussian

Fit gaussians to data.
Return d' between gaussians.
If custom boundary, return d' between fitted gaussians with custom boundary. If boundary coeffs are requested, return optimal boundary between fitted gaussians.

Include outcome losses

Include losses for the 4 possible outcomes. What is the boundary and results when the overall Bayes risk is being minimized? Duda section 2.2.1.

This has applications to psychophysics with different outcome losses, either explicit (games with different monetary rewards attached to different outcomes), or implicit (monkey given either fruit juice or electric shock), to economic theory etc.

The correct way to go about this seems to be to re-derive a fuller decision surface eq. that incorporates the 4 loss arguments. The current calculations then become the default case with +/-1 losses for correct or wrong outcomes.

input: data, output: d'_data

Fit gaussians to data
Find optimal boundary between gaussians
Optimize this boundary based on how much of the data lies in opposite regions, by optimizing #17 . Return this d' and this boundary.
Try to make the same algorithm work for 1D, instead of having its own criterion search algorithm.
If custom boundary is specified, no need to do the above. Just return d' based on how much of the data lies on opposite sides of the custom boundary.

Colour-divided boundary

If in a classification scenario, while plotting the boundary for any pair of classes (say red and blue), we use red points for the points evaluated from the red dist, and blue points for the points evaluated from the blue. Then the boundary will be a mixture of red and blue.
Boundary between blue and green will then be a mixture of blue and green points.
May be helpful to visualize multi-class.
Drawback: if the points are dense, only the ones plotted later will show.