EM's score optimization in CPTAC's binomial about ddmc HOT 20 CLOSED

meyer-lab commented on August 17, 2024

EM's score optimization in CPTAC's binomial

from ddmc.

Comments (20)

mcreixell commented on August 17, 2024 1

Binomial fixed! Will explain later, it was a silly but hard to spot bug.

from ddmc.

mcreixell commented on August 17, 2024

Although more obvious with CPTAC, for some iterations the EM is not improving either with the AXL data set. I didn't notice this before because in contrast to CPTAC where we see a very marked decay of the scores, the difference between the scores of both iterations when this happens is always very minor with our MS data. At first I thought it was due to pomegranate's GMM, but this also happens when the data's weight is set to 0... Any suggestions? @aarmey

from ddmc.

aarmey commented on August 17, 2024

Does it happen when you're only using the data?

from ddmc.

mcreixell commented on August 17, 2024

Yes...

from ddmc.

aarmey commented on August 17, 2024

One thought—within one step try running the E step twice in a row. The second time nothing should change, since you already identified the maximum likelihood assignments. Then, do the same but with the M step. This should uncover any unexpected randomness in the process or indexing errors.

Second, you could print out the likelihood during both M and E—each substep should improve the likelihood. It might be that one is working while the other isn't.

from ddmc.

mcreixell commented on August 17, 2024

I tried the first point and I think that the bug should be in the M step. The scores during the E step iterations were identical. I think here is where the problem is:

https://github.com/meyer-lab/resistance-MS/blob/7e1375580c7062605c70eb5ea34c333a9abbd3ec/msresist/sequence_analysis.py#L358

I was looking at pomegranate API's documentation and I think that using fit() here is the right thing to do, but I'm unsure about how the gmm parameters are being updated. I think the argument inertia shouldn't be set to 0 (default) since this means ignoring the old parameters, whereas an inertia of 1 means ignoring the new parameters. It sounds like something in between should be the right choice? Thanks.

from ddmc.

mcreixell commented on August 17, 2024

(https://pomegranate.readthedocs.io/en/latest/GeneralMixtureModel.html#pomegranate.gmm.GeneralMixtureModel.fit)

from ddmc.

aarmey commented on August 17, 2024

I think you want an inertia of 0. The old parameters don’t matter; you just want the best fit for the current assignments.

from ddmc.

mcreixell commented on August 17, 2024

But this function doesn’t have an argument to input the current assignments or likelihoods. It just takes the data. By using an inertia of 0, aren’t we telling the model to forget whatever it learned in the previous iteration and to re-initialize? This would explain why the EM gets worse sometimes and I found out that this happens more often with data alone than sequence alone.

from ddmc.

aarmey commented on August 17, 2024

Yes—there's no memory during EM; you just maximize provided the current state.

from ddmc.

mcreixell commented on August 17, 2024

Sorry to keep asking about this... I can't get my head around the fact that pom's gmm.fit(data) takes the data only – and it doesn't see the new cluster assignments found in the E step –, whereas sklearn's gmm.fit() method (which I was using before switching to pomegranate) does take both, the data and the new assignments in every M step. My question is how can pom's fit method update its parameters correctly without using the updated cluster assignments.

M step: Update motifs, cluster centers, and gmm probabilities

cl_seqs = seq_reassign
gmm.fit(d, inertia=0, stop_threshold=100, max_iterations=1**10000)
gmmp = gmm.predict_proba(d)
gmmp = GmmpCompatibleWithSeqScores(gmmp, distance_method)

from ddmc.

aarmey commented on August 17, 2024

Ah ha! Yes, you do need to be setting the cluster assignments within pom's gmm before running the M step. If it's not clear where to do this in the code I can do some digging.

from ddmc.

mcreixell commented on August 17, 2024

This updates the M step:
https://pomegranate.readthedocs.io/en/latest/GeneralMixtureModel.html#pomegranate.gmm.GeneralMixtureModel.from_summaries.

But still can't find a way to input the new assignments...

from ddmc.

mcreixell commented on August 17, 2024

I'm pretty sure I found a way to correctly update the gmm model but unfortunately this bug still persists... Specially with the CPTAC data set. Let's discuss tomorrow!

from ddmc.

aarmey commented on August 17, 2024

So you're initializing the GMM with the class weights like here?

model = GeneralMixtureModel(distributions, class_weights)

Either way, I agree it will be helpful to walk through what we do and don't know. If I'm following correctly, we know the issue is within M, and that M is not replicable within an iteration. I'm not sure that this can be the case without us missing something obvious, since functions with no side effects have to be replicable for the same inputs. However, maybe pom's GMM is using random initialization of some parameter, etc...

from ddmc.

mcreixell commented on August 17, 2024

I just checked, and with the M step fix I made yesterday, the M step is replicable now within an iteration, as you can see below. The first float is the score of iteration 1, the next array includes the gmm likelihoods after initialization, and the next three arrays are the gmm likelihoods within iteration 1, running the M step 3 times. However, this doesn't seem to avoid randomly getting lower scores during EM's optimization. I think that the way I weight the data vs seq may have something to do. We can discuss later.

from ddmc.

mcreixell commented on August 17, 2024

Below I'm illustrating three scenarios using 4 clusters: (i) prioritizing data but still using some sequence info –weight of 1, but anything below 1 should fit– (ii) threshold where the model won't fit –weight of 1.1–, and (iii) prioritizing sequence –weight of 100–.

What I'm printing in this screenshot is, for every iteration:

The iteration number.
What number of the final peptide assignments based on the final scaled score were made following the sequence or the data score (eg: scaled assignment = cluster 1, data assignment = cluster = 2, sequence assignment = 1, in this case: SeqWins+=1). I also count when both sequence and data assigned peptides in the same cluster, and when the scaled assignment is different than either of the two scores individually.
Cluster sizes
Scores

I think it's interesting that the second scenario looks like a tie between scores ~ -52.9 and ~ -51.3, since both exact numbers go back and forth for many iterations, together with the rest of printed information.

I tried to spot any bugs in my "Binomial" programing but couldn't find anything unfortunately. For every iteration, I printed each cluster's consensus motif (biopython's FrequenciesMatrix.consenus, which is used to calculate the binomial matrix), and these seem to evolve as the algorithm progresses:

https://github.com/meyer-lab/resistance-MS/blob/4a93004f48f29b66f396ad6cde2d7a43726b1d52/msresist/sequence_analysis.py#L313-L323

I think that an easy start would be trying to speed PAM250 up and go from there. #134

Also, it rarely occurs, but sometimes pom returns NaN values when calculating the GMM cluster probabilities... Just run again. #136

Please let me know if anything is not clear! Thanks.

from ddmc.

aarmey commented on August 17, 2024

How do you determine which sequence "won"? Just on relative value? I'm not sure that indicates much, given the cluster assignment is based on the relative score between clusters.

Have you tried raising the size of the lru_cache? You can set the size to None to let it grow unlimited.

from ddmc.

mcreixell commented on August 17, 2024

I think it does. I was trying to figure out what weight corresponds to what fraction of sequence or data wins. For every peptide, I look individually at the unscaled, raw sequence and data scores and see, just by looking at either separately, in what cluster a given peptide would be assigned. Then I compare these two assignments to the assignment coming from the scaled scores to see who “won”, if sequence or data, after scaling. If the SeqWeight is 0, we get SeqWins=0 and DataWins~32.000, and with SeqWeight=100, we get SeqWins~32k and DataWins=0 as we we’d expect. Then, I also record situations where both sequence and data assignments agree with the scaled assignment, and where the scaled assignment is different than the sequence and data assignments unscaled. Now we know that for Binomial, we can try different options ~0-5 to find a weight that results in the desired proportion.

from ddmc.

aarmey commented on August 17, 2024

Aha. Need idea.

Try removing the limit on the rlu cache and see whether you can get decent performance out of PAM250.

from ddmc.

EM's score optimization in CPTAC's binomial about ddmc HOT 20 CLOSED

Comments (20)

M step: Update motifs, cluster centers, and gmm probabilities

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent