Giter Site home page Giter Site logo

goxmeans's People

Contributors

afoglia avatar bobhancock avatar danielhfrank avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goxmeans's Issues

Likelihood computation incorrect?

Here you use per-cluster variances:

goxmeans/km.go

Lines 788 to 791 in a78e909

if c[i].Variance == 0 {
c[i].Variance = math.Nextafter(0, 1)
}
t3 := ((fRn * float64(c[i].Dim())) / 2) * math.Log((2*math.Pi)*c[i].Variance)

However both the x-means paper, as well as your BIC_notes.pdf use the assumption that all clusters should have the same variance. That should resolve the #15 issue (clusters with variance of zero) as long as the data set is not entirely constant.
The code does not appear to reflect BIC_notes anymore (c.f. also #26, as there appears to be a typo in there).

TestRandCentroids fails

./km_test.go:85: cannot use DataCentroids literal (type DataCentroids) as type CentroidChooser in array element:
DataCentroids does not implement CentroidChooser (wrong type for ChooseCentroids method)
have ChooseCentroids(_matrix.DenseMatrix, int) (_matrix.DenseMatrix, error)
want ChooseCentroids(*matrix.DenseMatrix, int) *matrix.DenseMatrix

Help on calculating mean for variance()

Assuming R^2 for the example coordinates (x, y).

Points [ 3, 4
5, 6]

Centroid [ 4, 5]

is the mean of x = [(4 - 3) + (4 - 5)] / 2 ?

Or do we need to square it as in [ (4 - 3) ^2 + (4 - 4)^2 ] / 2 and then take the square root?

Variance when centroids and points are the same

When the centroids and the points are exactly the same, this makes the denominator R - K = 0, and produces Nan.

Is it valid to to just return 0 as a variance if the two sets of points are the same?

How to handle clusters with a variance of zero.

I'm expanding the tests for bisection and and ran into this case where when the cluster is bisected, one of the resulting clusters has only one point. Hence, the point functions as both the centroid and the datapoint, and the variance is zero.

The problem occurs when attempting to calculate the log likelihood. log(2pi * variance) equates to -Inf which gives a log likelihood of +Inf. This will drive the BIC to +Inf regardless of what the other clusters in the model may contain.

Since the BIC would be driven by +Inf to the exclusion of anything else you do not obtain a valid score for the model, should models with a variance of zero be excluded from the log likelihood calculation?

Additional centroid init functions

‘gaussrandom’: generate k centroids from a Gaussian with mean and variance estimated from the data.
‘uniform’: generate k observations from the data from a uniform distribution defined by the data set

Centroid improvement function on biscection

Page 3 of Pelleg-Moore: ...splitting each centroid into two children. The are moved a distance proportional to the size of the region in opposite directions along a randomly chosen vector.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.