Refinement with BGMM is complaining: PopPUNK (POPulatio

The one I was thinking of was the one started from in issue <a class="issue-link js-is

BGMM refinement about poppunk HOT 14 CLOSED

bacpop commented on June 15, 2024

BGMM refinement

from poppunk.

Comments (14)

nickjcroucher commented on June 15, 2024

Seems to identify two separate within and between strain sets of links:

In here within label is 1 with mean [0.52112061 0.48266401] and between label is 2with mean [0.64633422 0.49405153]

from poppunk.

nickjcroucher commented on June 15, 2024

So the non-origin end of the scale is:

euclidean 0.1257303629256175

Hence the returned boundary likelihoods are:

origin end -0.028069695293248376 other end -0.13515692159251055

Looks like the centres of the two distributions are too close together?

from poppunk.

johnlees commented on June 15, 2024

Yeah I don't think it'll work with that as the initial fit. It assumes two good cluster means for these components, whereas the fit above doesn't look like it's worked. So the first error will be because there's no root between the two means (should be able to see this from the likelihood contours + decision boundary plot too)

You could try with K = 2 or 4 and see if you get something reasonable to start from, or use --manual-start with the means from the previous GPS fit/SPARC fit.

from poppunk.

nickjcroucher commented on June 15, 2024

Trying K = 4 now; K = 2 gave a similar fit. I think the previous GPS fit was t-dist based, so would the means still be useful for starting?

from poppunk.

johnlees commented on June 15, 2024

Yeah, I think they should work. Another sensible alternative is 0,0 and the centre with the highest density from the KDE fit (in the distance contour plot)

…

On Fri, 25 May 2018 at 11:48 nickjcroucher ***@***.***> wrote: Trying K = 4 now; K = 2 gave a similar fit. I think the previous GPS fit was t-dist based, so would the means still be useful for starting? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGCdvdS2UUhtU_nIv9RAC49EuezI7dtWks5t19N6gaJpZM4UNfMK> .

from poppunk.

nickjcroucher commented on June 15, 2024

To "solve" this issue, I've been trying to look at whether there's a simple function that will identify a scaling factor that will ensure the two means have different signs:

def getBoundSize(model, start, end, within, between, origin, other):
     boundSize = 1
     sameSign = True
     while sameSign:
        origin_value = likelihoodBoundary(origin, model, start, end, within, between, boundSize)
        other_value = likelihoodBoundary(other, model, start, end, within, between, boundSize)
        if origin_value < 0 and other_value > 0:
                sameSign = False
        else:
                boundSize = boundSize + 1
     return boundSize

Followed by:

def likelihoodBoundary(s, model, start, end, within, between, boundSize):
    """Wrapper function around :func:`~PopPUNK.bgmm.fit2dMultiGaussian` so that it can
    go into a root-finding function for probabilities between components

    Args:
        s (float)
            Distance along line from mean0
        model (BGMMFit)
            Fitted mixture model
        start (numpy.array)
            The co-ordinates of the centre of the within-strain distribution
        end (numpy.array)
            The co-ordinates of the centre of the between-strain distribution
        within (int)
            Label of the within-strain distribution
        between (int)
            Label of the between-strain distribution
    Returns:
        responsibility (float)
            The difference between responsibilities of assignment to the within component
            and the between assignment
    """
    X = transformLine(s, start, end).reshape(1*boundSize, -1*boundSize)
    responsibilities = model.assign(X, values = True)
    return(responsibilities[0, within] - responsibilities[0, between])

But it just throws:

X = transformLine(s, start, end).reshape(1*boundSize, -1*boundSize)
ValueError: cannot reshape array of size 2 into shape (3,newaxis)

It's probably not a good idea anyway, but I suppose we should (a) either throw a more comprehensible error or (b) use this as a criterion that has to be met for the fitting to be considered valid?

from poppunk.

johnlees commented on June 15, 2024

I think in some cases when the initial fit is just crap (as here) there's not much that can be used to salvage it. Eschewing the mixture model and using the KDE estimate (or DBSCAN, if that gives something better) I think is the way to go.

Doing a check before doing the minimum search is definitely a good idea - I'll add this in now

from poppunk.

johnlees commented on June 15, 2024

Did we have a decent BGMM fit from the previous iteration of GPS data, which we ran refinement on to produce the last set of clusters we sent to Becca and Steph?

from poppunk.

nickjcroucher commented on June 15, 2024

The last good fit I remember was the t-distribution for which we provided priors - not sure about the BGMM fits.

from poppunk.

johnlees commented on June 15, 2024

The one I was thinking of was the one started from in issue #12, but I can't quite remember how/with what it was obtained – probably it was the t-distribution as you say!

from poppunk.

nickjcroucher commented on June 15, 2024

Alas it was t distribution based, from my neglected bsub outputs:

python3 ./PopPUNK/poppunk-runner.py --fit-model --distances GPS/GPS.dists --output GPS --full-db --ref-db GPS --t-dist --priors priors.txt --K 3

from poppunk.

nickjcroucher commented on June 15, 2024

Baffling fit for K=4:

from poppunk.

johnlees commented on June 15, 2024

Wow, that's really quite something. I particularly don't understand the yellow component.

One final thing to try could be changing lines 44-49 of bgmm.py to:

   dpgmm = mixture.BayesianGaussianMixture(n_components = dpgmm_max_K,
                                                n_init = 5,
                                                covariance_type = 'full',
                                                weight_concentration_prior = 1,
                                                mean_precision_prior = 1).fit(X)

Though I don't really expect that to change anything. So either a manual start from KDE/previous or DBSCAN?

from poppunk.

johnlees commented on June 15, 2024

Looks like --manual-start fixed this, so closing for now

from poppunk.

BGMM refinement about poppunk HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent