Giter Site home page Giter Site logo

BGMM refinement about poppunk HOT 14 CLOSED

bacpop avatar bacpop commented on June 15, 2024
BGMM refinement

from poppunk.

Comments (14)

nickjcroucher avatar nickjcroucher commented on June 15, 2024

Seems to identify two separate within and between strain sets of links:

In here within label is 1 with mean [0.52112061 0.48266401] and between label is 2with mean [0.64633422 0.49405153]

from poppunk.

nickjcroucher avatar nickjcroucher commented on June 15, 2024

So the non-origin end of the scale is:

euclidean 0.1257303629256175

Hence the returned boundary likelihoods are:

origin end -0.028069695293248376 other end -0.13515692159251055

Looks like the centres of the two distributions are too close together?

from poppunk.

johnlees avatar johnlees commented on June 15, 2024

Yeah I don't think it'll work with that as the initial fit. It assumes two good cluster means for these components, whereas the fit above doesn't look like it's worked. So the first error will be because there's no root between the two means (should be able to see this from the likelihood contours + decision boundary plot too)

You could try with K = 2 or 4 and see if you get something reasonable to start from, or use --manual-start with the means from the previous GPS fit/SPARC fit.

from poppunk.

nickjcroucher avatar nickjcroucher commented on June 15, 2024

Trying K = 4 now; K = 2 gave a similar fit. I think the previous GPS fit was t-dist based, so would the means still be useful for starting?

from poppunk.

johnlees avatar johnlees commented on June 15, 2024

from poppunk.

nickjcroucher avatar nickjcroucher commented on June 15, 2024

To "solve" this issue, I've been trying to look at whether there's a simple function that will identify a scaling factor that will ensure the two means have different signs:

def getBoundSize(model, start, end, within, between, origin, other):
     boundSize = 1
     sameSign = True
     while sameSign:
        origin_value = likelihoodBoundary(origin, model, start, end, within, between, boundSize)
        other_value = likelihoodBoundary(other, model, start, end, within, between, boundSize)
        if origin_value < 0 and other_value > 0:
                sameSign = False
        else:
                boundSize = boundSize + 1
     return boundSize

Followed by:

def likelihoodBoundary(s, model, start, end, within, between, boundSize):
    """Wrapper function around :func:`~PopPUNK.bgmm.fit2dMultiGaussian` so that it can
    go into a root-finding function for probabilities between components

    Args:
        s (float)
            Distance along line from mean0
        model (BGMMFit)
            Fitted mixture model
        start (numpy.array)
            The co-ordinates of the centre of the within-strain distribution
        end (numpy.array)
            The co-ordinates of the centre of the between-strain distribution
        within (int)
            Label of the within-strain distribution
        between (int)
            Label of the between-strain distribution
    Returns:
        responsibility (float)
            The difference between responsibilities of assignment to the within component
            and the between assignment
    """
    X = transformLine(s, start, end).reshape(1*boundSize, -1*boundSize)
    responsibilities = model.assign(X, values = True)
    return(responsibilities[0, within] - responsibilities[0, between])

But it just throws:

X = transformLine(s, start, end).reshape(1*boundSize, -1*boundSize)

ValueError: cannot reshape array of size 2 into shape (3,newaxis)

It's probably not a good idea anyway, but I suppose we should (a) either throw a more comprehensible error or (b) use this as a criterion that has to be met for the fitting to be considered valid?

from poppunk.

johnlees avatar johnlees commented on June 15, 2024

I think in some cases when the initial fit is just crap (as here) there's not much that can be used to salvage it. Eschewing the mixture model and using the KDE estimate (or DBSCAN, if that gives something better) I think is the way to go.

Doing a check before doing the minimum search is definitely a good idea - I'll add this in now

from poppunk.

johnlees avatar johnlees commented on June 15, 2024

Did we have a decent BGMM fit from the previous iteration of GPS data, which we ran refinement on to produce the last set of clusters we sent to Becca and Steph?

from poppunk.

nickjcroucher avatar nickjcroucher commented on June 15, 2024

The last good fit I remember was the t-distribution for which we provided priors - not sure about the BGMM fits.

from poppunk.

johnlees avatar johnlees commented on June 15, 2024

The one I was thinking of was the one started from in issue #12, but I can't quite remember how/with what it was obtained – probably it was the t-distribution as you say!

from poppunk.

nickjcroucher avatar nickjcroucher commented on June 15, 2024

Alas it was t distribution based, from my neglected bsub outputs:

python3 ./PopPUNK/poppunk-runner.py --fit-model --distances GPS/GPS.dists --output GPS --full-db --ref-db GPS --t-dist --priors priors.txt --K 3

from poppunk.

nickjcroucher avatar nickjcroucher commented on June 15, 2024

Baffling fit for K=4:
gps_bgmm_v2_dpgmm_fit

from poppunk.

johnlees avatar johnlees commented on June 15, 2024

Wow, that's really quite something. I particularly don't understand the yellow component.

One final thing to try could be changing lines 44-49 of bgmm.py to:

   dpgmm = mixture.BayesianGaussianMixture(n_components = dpgmm_max_K,
                                                n_init = 5,
                                                covariance_type = 'full',
                                                weight_concentration_prior = 1,
                                                mean_precision_prior = 1).fit(X)

Though I don't really expect that to change anything. So either a manual start from KDE/previous or DBSCAN?

from poppunk.

johnlees avatar johnlees commented on June 15, 2024

Looks like --manual-start fixed this, so closing for now

from poppunk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.