Speed up PAM250 for CPTAC about ddmc HOT 12 CLOSED

meyer-lab commented on August 17, 2024

Speed up PAM250 for CPTAC

from ddmc.

Comments (12)

mcreixell commented on August 17, 2024 1

Hmm interesting! The order of the query sequences is the same as in the mass spec data set but the order within each cluster is changing after every iteration... However, I think I should be able to get around this. Otherwise, instead of an n_seq by n_seq matrix, I could collect this information in dictionary form Dict[(seq1, seq2)] = score. This would essentially serve as a PAM20 matrix but the pairwise comparisons would be at the motif level. Will explore! Thanks for the idea.

from ddmc.

aarmey commented on August 17, 2024 1

Take a look at my most recent commit on #129. There were a couple (really hard) things to change:

Threads don't help, because of the GIL in python (code can only be interpreted on one thread). This means you have to use multiprocessing.
Processes don't (normally) share memory, which means everything you pass to the function has to be copied. There is overhead with this so a process now works on blocks of 500 columns. (Check the indexing for this, because I'm pretty sure it doesn't work if you don't have a multiple of 500 as your sequence length.)
Rather than copying the out matrix, I made it shared memory. This is what required upgrading python to v3.8. I've never done this before, but it seems to be working.

Good news is 30,000 sequences now takes ~2 min. 🤩 FYI, that out matrix is ~20 gigabytes in size, so don't make a bunch of copies.

from ddmc.

mcreixell commented on August 17, 2024

Setting @lru_cache(maxsize=None) didn't seem to improve much unfortunately... Since the order in which the query sequence is compared to the rest of the sequences within the cluster is not relevant maybe it's a good scenario to run these in parallel? @aarmey

from ddmc.

aarmey commented on August 17, 2024

Could you profile it with that setting? lru is super fast so I'm surprised it would be the bottleneck.... even things like getting the string from a list could be taking longer. If it is indeed the slow point, there are some adjustments we can make, but they'll be a little more involved.

And aren't you already running parallel calculations at another level?

from ddmc.

mcreixell commented on August 17, 2024

Currently I'm only running parallel while gridsearching but not during fitting.

Good news is that the model seems to work with PAM250. I used a reduced version of the CPTAC data set (~3000 peptides, all patients) and the model strictly improves and ultimately fits. Now it's only a matter of speeding it up.

This is the profile:

from ddmc.

mcreixell commented on August 17, 2024

This is what I changed but it didn't seem to speed things much more. I also tried calculating the cluster assignments in parallel with ThreadPoolExecutor and tried transforming the pam250 dictionary's key to one string instead of two (eg pam250[("A", "E")]=score to pam250(["AE"]=score) as well as using sys.intern but no noticeable improvements...

https://github.com/meyer-lab/resistance-MS/blob/099333c2d41b3fc46af8dbadfbffc837c50a2f8e/msresist/sequence_analysis.py#L263-L272

from ddmc.

aarmey commented on August 17, 2024

Are your sequences in a defined and unchanging order at some point? Because you could make an n_seq by n_seq matrix once, then never recalculate the PAM distance again...

from ddmc.

mcreixell commented on August 17, 2024

I implemented what we discussed above and it sped up the code a lot when using a fraction of the data set, but when I try to calculate the scores matrix with all 42k, it breaks Aretha... As expected, once the pam250 scores matrix is calculated, the model runs much faster. Something strange I've noticed, is that after running GenerateSeq1Seq2ToPAM250Distance once in a notebook cell, the subsequent times I re-run it in that same cell without re-initializing the kernel, the running time is drastically shorter. For instance with 5k peptides, it may take ~45s to generate the matrix versus ~5s when running for the second time in that cell. I don't know if there's anything we can get out of this but found it surprising. This is the code to generate the pam250 scores array:

https://github.com/meyer-lab/resistance-MS/blob/b8a6a164cb476b8feb529951db1a8659889aa0b9/msresist/sequence_analysis.py#L351-L380

from ddmc.

aarmey commented on August 17, 2024

A few things:

Remove the RLU cache. This may be what breaks Aretha and explains the difference when running a second time. You no longer need it because you don't need to store the PAM results in two places.
Rather that return the calculated values, you should pass out into the thread (numpy arrays are passed by reference) and have the thread fill them in. This means the thread function won't return anything, so all these values don't have to be stored in the future.
Since you don't have to retrieve anything from the futures anymore, you can just call e.shutdown() to wait all the tasks to finish at once.
Make a for loop to submit futures, but don't store them in a list, since you won't need to see them again.

After these changes, let me know how I can run this. I can check one or two other items.

from ddmc.

mcreixell commented on August 17, 2024

I think it's ready. I'll commit these edits in #129. To run this, you just have to go to the notebook CPTAC.ipynb. The first cell just imports the data, and in the next section, the first cell checks the speed of this function and the next cell runs the model.

from ddmc.

aarmey commented on August 17, 2024

P.S. Because this requires python v3.8, PAM250 will not build on Jenkins for now. That's fine—we can override the build checks.

from ddmc.

mcreixell commented on August 17, 2024

Wow, this so cool! Thanks! I'll take a look now.

from ddmc.

Speed up PAM250 for CPTAC about ddmc HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent