meyer-lab / ddmc Goto Github PK

Clusters phosphoproteomics data based on a combination of the sequence information and abundance changes over conditions.

Home Page: https://asmlab.org

Python 99.34% Makefile 0.66%

drug-resistance cancer phosphoproteomics academic-project

ddmc's Introduction

Co-clustering for mass spectrometry peptide analysis

Clusters peptides based on both sequence similarity and phosphorylation signal across samples.

Usage

>>> from ddmc.clustering import DDMC

>>> # load dataset as p_signal...

>>> p_signal
             Sample 1  Sample 2  Sample 3  Sample 4  Sample 5
Sequence                                                     
AAAAAsQQGSA -3.583614       NaN -0.662659 -1.320029 -0.730832
AAAAGsASPRS -0.174779 -1.796899  0.891798 -3.092941  2.394315
AAAAGsGPSPP -1.951552 -2.937095  2.692876 -2.344894  0.556615
AAAAGsGPsPP  3.666782       NaN -2.081231  0.989394       NaN
AAAAPsPGSAR  1.753855 -2.135835  0.896778  3.369230  2.020967
...               ...       ...       ...       ...       ...
YYSPYsVSGSG -3.502871  2.831169  3.383486  2.589559  3.624968
YYSSRsQSGGY -0.870365  0.887317  2.600291 -0.374107  3.285459
YYTAGyNSPVK  0.249539  2.047050 -0.286033  0.042650  2.863317
YYTSAsGDEMV  0.662787  0.135326 -1.004350  0.879398 -1.609894
YYYSSsEDEDS       NaN -1.101679 -3.273987 -0.872370 -1.735891

>>> p_signal.index  # p_signal.index contains the peptide sequences
Index(['AAAAAsQQGSA', 'AAAAGsASPRS', 'AAAAGsGPSPP', 'AAAAGsGPsPP',
       'AAAAPsPGSAR', 'AAAAPsPGsAR', 'AAAARsLLNHT', 'AAAARsPDRNL',
       'AAAARtQAPPT', 'AAADFsDEDED',
       ...
       'YYDRMySYPAR', 'YYEDDsEGEDI', 'YYGGGsEGGRA', 'YYRNNsFTAPS',
       'YYSPDyGLPSP', 'YYSPYsVSGSG', 'YYSSRsQSGGY', 'YYTAGyNSPVK',
       'YYTSAsGDEMV', 'YYYSSsEDEDS'],
      dtype='object', name='Sequence', length=30561)
      
>>> model = DDMC(n_components=2, seq_weight=100).fit(p_signal)  # fit model

>>> model.transform(as_df=True)  # get cluster centers
                 0         1
Sample 1  0.017644  0.370375
Sample 2 -0.003625 -0.914869
Sample 3 -0.087624 -0.682140
Sample 4  0.014644 -0.658907
Sample 5  0.023885  0.196063

ddmc's People

Contributors

Stargazers

Watchers

Forkers

complexdata

ddmc's Issues

Script to profile

I'd like to do a little profiling to see if I can speed up fitting. Do you possibly have a python script that runs fitting? (Using a jupyter notebook is a bit more difficult.)

EOFError when trying to use plotly, jupyterlab v2.0 issue?

When trying to run plotpca_ScoresLoadings_plotly (figure1.py), I'm getting the error below. I didn't change anything to this function (that I'm aware of) and I'm not familiar with this error at all. I was thinking that this may have something to do with the recent upgrade of jupyterlab to v2.0? This is not urgent so whenever you get a chance works great. Thanks.

`---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
in
----> 1 plotpca_ScoresLoadings_plotly(d1.T, "AXLmuts + AF154 BR1", "AXL")

~/resistance-MS/msresist/figures/figure1.py in plotpca_ScoresLoadings_plotly(data, title, loc)
309 print(loadings.loc[loc])
310
--> 311 fig = make_subplots(rows=1, cols=2, subplot_titles=("PCA Scores", "PCA Loadings"))
312 fig.add_trace(
313 go.Scatter(

/usr/local/lib/python3.7/dist-packages/plotly/subplots.py in make_subplots(rows, cols, shared_xaxes, shared_yaxes, start_cell, print_grid, horizontal_spacing, vertical_spacing, subplot_titles, column_widths, row_heights, specs, insets, column_titles, row_titles, x_title, y_title, **kwargs)
810
811 # Build resulting figure
--> 812 fig = go.Figure(layout=layout)
813
814 # Attach subplot grid info to the figure

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/_figure.py in init(self, data, layout, frames, skip_invalid, **kwargs)
609 is invalid AND skip_invalid is False
610 """
--> 611 super(Figure, self).init(data, layout, frames, skip_invalid, **kwargs)
612
613 def add_area(

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in init(self, data, layout_plotly, frames, skip_invalid, **kwargs)
268 # --------
269 # ### Check for default template ###
--> 270 self._initialize_layout_template()
271
272 # Process kwargs

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in _initialize_layout_template(self)
1948 if self._layout_obj.template is None:
1949 if pio.templates.default is not None:
-> 1950 self._layout_obj.template = pio.templates.default
1951 else:
1952 self._layout_obj.template = None

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setattr(self, prop, value)
4449 if match is None:
4450 # Set as ordinary property
-> 4451 super(BaseLayoutHierarchyType, self).setattr(prop, value)
4452 else:
4453 # Set as subplotid property

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setattr(self, prop, value)
3514 if prop.startswith("_") or hasattr(self, prop) or prop in self._validators:
3515 # Let known properties and private properties through
-> 3516 super(BasePlotlyType, self).setattr(prop, value)
3517 else:
3518 # Raise error on unknown public properties

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/init.py in template(self, val)
96981 @template.setter
96982 def template(self, val):

96983 self["template"] = val
96984
96985 # ternary

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setitem(self, prop, value)
4435 if match is None:
4436 # Set as ordinary property
-> 4437 super(BaseLayoutHierarchyType, self).setitem(prop, value)
4438 else:
4439 # Set as subplotid property

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setitem(self, prop, value)
3480 # ### Handle compound property ###
3481 if isinstance(validator, CompoundValidator):
-> 3482 self._set_compound_prop(prop, value)
3483
3484 # ### Handle compound array property ###

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in _set_compound_prop(self, prop, val)
3834 # ------------
3835 validator = self._validators.get(prop)
-> 3836 val = validator.validate_coerce(val, skip_invalid=self._skip_invalid)
3837
3838 # Save deep copies of current and new states

/usr/local/lib/python3.7/dist-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v, skip_invalid)
2714 # (could be any hashable object)
2715 if v in pio.templates:
-> 2716 return copy.deepcopy(pio.templates[v])
2717 # Otherwise, if v is a string, check to see if it consists of
2718 # multiple template names joined on '+' characters

/usr/local/lib/python3.7/dist-packages/plotly/io/_templates.py in getitem(self, item)
89 template_str = pkgutil.get_data("plotly", path).decode("utf-8")
90 template_dict = json.loads(template_str)
---> 91 template = Template(template_dict)
92
93 self._templates[template_name] = template

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/layout/init.py in init(self, arg, data, layout, **kwargs)
9448 # ----------------------------------
9449 _v = arg.pop("data", None)
-> 9450 self["data"] = data if data is not None else _v
9451 _v = arg.pop("layout", None)
9452 self["layout"] = layout if layout is not None else _v

/usr/local/lib/python3.7/dist-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v, skip_invalid)
2442
2443 elif isinstance(v, dict):
-> 2444 v = self.data_class(v, skip_invalid=skip_invalid)
2445
2446 elif isinstance(v, self.data_class):

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/layout/template/init.py in init(self, arg, area, barpolar, bar, box, candlestick, carpet, choroplethmapbox, choropleth, cone, contourcarpet, contour, densitymapbox, funnelarea, funnel, heatmapgl, heatmap, histogram2dcontour, histogram2d, histogram, image, indicator, isosurface, mesh3d, ohlc, parcats, parcoords, pie, pointcloud, sankey, scatter3d, scattercarpet, scattergeo, scattergl, scattermapbox, scatterpolargl, scatterpolar, scatter, scatterternary, splom, streamtube, sunburst, surface, table, treemap, violin, volume, waterfall, **kwargs)
1595 self["scattergeo"] = scattergeo if scattergeo is not None else _v
1596 _v = arg.pop("scattergl", None)
-> 1597 self["scattergl"] = scattergl if scattergl is not None else _v
1598 _v = arg.pop("scattermapbox", None)
1599 self["scattermapbox"] = scattermapbox if scattermapbox is not None else _v

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setitem(self, prop, value)
3484 # ### Handle compound array property ###
3485 elif isinstance(validator, (CompoundArrayValidator, BaseDataValidator)):
-> 3486 self._set_array_prop(prop, value)
3487
3488 # ### Handle simple property ###

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in _set_array_prop(self, prop, val)
3908 # ------------
3909 validator = self._validators.get(prop)
-> 3910 val = validator.validate_coerce(val, skip_invalid=self._skip_invalid)
3911
3912 # Save deep copies of current and new states

/usr/local/lib/python3.7/dist-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v, skip_invalid)
2530 res.append(self.data_class(v_el))
2531 elif isinstance(v_el, dict):
-> 2532 res.append(self.data_class(v_el, skip_invalid=skip_invalid))
2533 else:
2534 if skip_invalid:

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/init.py in init(self, arg, connectgaps, customdata, customdatasrc, dx, dy, error_x, error_y, fill, fillcolor, hoverinfo, hoverinfosrc, hoverlabel, hovertemplate, hovertemplatesrc, hovertext, hovertextsrc, ids, idssrc, legendgroup, line, marker, meta, metasrc, mode, name, opacity, selected, selectedpoints, showlegend, stream, text, textfont, textposition, textpositionsrc, textsrc, texttemplate, texttemplatesrc, uid, uirevision, unselected, visible, x, x0, xaxis, xcalendar, xsrc, y, y0, yaxis, ycalendar, ysrc, **kwargs)
28533 # Import validators
28534 # -----------------

28535 from plotly.validators import scattergl as v_scattergl
28536
28537 # Initialize validators

/usr/lib/python3.7/importlib/_bootstrap.py in find_and_load(name, import)

/usr/lib/python3.7/importlib/_bootstrap.py in find_and_load_unlocked(name, import)

/usr/lib/python3.7/importlib/_bootstrap.py in _load_unlocked(spec)

/usr/lib/python3.7/importlib/_bootstrap_external.py in exec_module(self, module)

/usr/lib/python3.7/importlib/_bootstrap_external.py in get_code(self, fullname)

/usr/lib/python3.7/importlib/_bootstrap_external.py in _compile_bytecode(data, name, bytecode_path, source_path)

EOFError: marshal data too short`

Using CPTAC dataset

Basic import
Handling missingness

Binomial method background generation

Are you absolutely sure that the background set has to be the same size as the foreground? That doesn't seem right.

https://github.com/meyer-lab/resistance-MS/blob/90d15e719a56e0fbf63ae81cb8c237470f01a971/msresist/binomial.py#L122

Duplicating gmm methods

Do you ever need to use sklearn's GMM implementation? If pom will work for you, then it'd be better to stick with just one implementation.

https://github.com/meyer-lab/resistance-MS/blob/399f3517c3e60ccc910c34864bb07ce39e9f9848/msresist/gmm.py#L21

Fix Cluster Assignments

Same peptides assigned to different clusters

Speed up binomial

frequencies() doesn't depend upon the sequence being assigned at the moment, so this should be precomputed before you start looping through sequences. (In other words, this should happen outside of assignSeqs(), then assignSeqs() should just get a list of the results for each cluster.

https://github.com/meyer-lab/resistance-MS/blob/ed0dc4ffe2a80c170249d2eba688ceea3a1591d9/msresist/sequence_analysis.py#L211

Set a minimum sequence score

Only use the sequence score if it reaches a minimum score.

Put together notebook with a tentative paper figure parts

We should start to create roughly one notebook for each potential figure. These notebooks should run everything exactly how we would want it in a paper. Where to start with this could be something along the lines of your poster:

Figure / notebook 1:

Raw phenotype data
Raw data / clustergram
We'll want to show how well replicates match up somehow...

Figure / notebook 2:

GridSearch with CV strategies 1 and 2.
Model predicted vs actual for best hyperparameter set
PLSR scores and loadings

Let me know what you think.

Dealing with pY, pT, and pS peptides at once?

It would be useful to make our method integrate the different phosphosite types. Specially in the CPTAC database where we have:

pS: 32710 peptides pY: 1940 peptides pT: 7824 peptides primed: 17137 peptides

I was thinking we could cluster each of the pY, pS, and pT peptides in parallel since they don't rely on each other and this way we make sure they end up being in different clusters. Then we would create an X matrix combining all clusters which would then be used for regression. What do you think?

When including more merging indices, final data set varies, but it shouldn’t

Figure out how to edit pd.pivot_table or find other resources.

assignees duplicates e_step

These seem almost identical. Could they just be one function?

https://github.com/meyer-lab/resistance-MS/blob/90d15e719a56e0fbf63ae81cb8c237470f01a971/msresist/expectation_maximization.py#L101-L183

Cell Viability: Noisy initial seeding volume

As shown here, the initial number of AXL KO cells in experiment 4 is noticeably higher than the initial number of PC9 WT cells. However, in experiment 3, even though the initial number of AXL KO cells is moderately lower than the rest, the resulting correlation is good. Would you therefore consider experiment 3 measurements as reliable or we would expect an even better correlation? @aarmey

Figures 1&3, seaborn bar plot misplaced and superimposed with last subplot

The same issue occurs in figure 1&3, where the seaborn bar plot I'm assigning to ax[0], for some reason is located in the last subplot together with the plot that is correctly assigned in that position.

resistance-MS/msresist/figures/figure1.py, line 148 (BarPlot_UtErlAF154)

resistance-MS/msresist/figures/figure3.py, line 93 (BarPlot_FCendpoint)

There are many more things I have to fix or improve in these figures but this is the only one I have no idea how to fix.

R2Y / Q2Y plot

Another item I wanted to touch base on today was my RQ2Y plot. I'm using sklearn's explained_variance_score to calculate Q2Y. My Y is a data frame with all 3 phenotypic arrays and my model is either PLSR –cv == 1 (cross-validation 1)–, or the chained model –cv == 2 (cross-validation 2)–. Is anything here incorrect? @aarmey

https://github.com/meyer-lab/resistance-MS/blob/fe5c10a846ea7de98bd257ee4db35b210abcd59f/msresist/plsr.py#L30-L40

Plot cell death measurements with error bars

It'd be very useful to include a vector of cell death measurements across mutants to the AXL model. With the data I currently have imported it is very noisy across replicates. Hopefully with the new values this improves.

Speed up PAM250 for CPTAC

Move notebooks to figures

Where possible it will be helpful to move figures we will keep to figure files, as opposed to notebooks. These are easier to work with when comparing source code changes.

Feel free to close this if you think this wouldn't be helpful at the moment.

Pandas / Plotting Tutorial

I'm creating this issue to cover any questions you may have as you work on the tutorial. Feel free to use slack / zoom instead.

FYI-accidentally merged my branch into master

I accidentally merged my local branch into master directly without opening a PR when I was trying to do so. I'm not sure what happened but I wanted to let you know. In these new edits I just organized the model functions into different files as @aarmey suggested in #141 and updated the notebook with the growth factor / RTK data (BypassRegression.ipynb) I was playing with while working on my proposal. I took screenshots of my terminal to briefly check this with you next week. Happy weekend. @scottdtaylor95

Another plot idea

This would help illustrate the hyper parameter, but also that the sequence and data can agree, one can win, or their combination can be unique.

Speed up binomial calculations

To speed up the binomial calculations, we need to represent the probability matrix as a true matrix, rather than a Pandas table. Rows should be amino acids (in whatever order so long as it's defined and kept constant), and columns should be motif position. In the preprocessing you can also convert all the sequence motifs from amino acid to row index (e.g., A -> 0, I -> 1, K -> 2...).

I think you can probably take your existing BPM matrix function, order the rows, and then apply X.to_numpy().

This alone will vastly speed things up, then making everything base numpy will make it easier to convert the slow parts to C code.

https://github.com/meyer-lab/resistance-MS/blob/6d304675aafc03b5a209cf173d00aa99d1d4a056/msresist/sequence_analysis.py#L418

Figure out what Cell Viability BRs (not) to use

The cell viability data seems pretty reproducible across replicates, specially if we take the fold-change after 24h but I still think there's one generating most of the noise. I think it's BR1 but I'm not sure.

Incorrect z-scoring during cross-validation

I didn't look at exactly where this is being run, but the cross-validation performed here is wrong. The normalization has to occur during the training step, and then be applied to the test data with the same stdev and mean calculated during training. sklearn.preprocessing.scale helps to perform this by separating X.fit_transform() and X.transform().

(sklearn.cross_decomposition.PLSRegression is supposed to handle this for you, anyway, when scale=True.)

https://github.com/meyer-lab/resistance-MS/blob/90d15e719a56e0fbf63ae81cb8c237470f01a971/msresist/figures/figure3.py#L208

Reduce redundancy in Ripley's K methods

All these functions largely repeat the same code. Common tasks should be a separate, common function.

https://github.com/meyer-lab/resistance-MS/blob/90d15e719a56e0fbf63ae81cb8c237470f01a971/msresist/distances.py#L94-L354

Ask about other distance matrices

People who might have an idea for a pY-specific transition matrix:

Brian Joughin in the Yaffe lab
Kristen Neagle at UVA

Update and Import new data

Update cell viability phase values. I'm not sure if the cell migration phase should be updated too?
Update green and red values for both.
Finish importing new cell migration replicates. Currently only imported RWD values for collagen BRs 1-4.

TerminatedWorkerError: (Due to memmory?) - in Grdisearch(n_jobs=-1)

Only one of the checks failed with the following error below. It's coming from plotMixedClusteringPLSR_GridSearch in figure2.py:

�[0;31mTerminatedWorkerError�[0m: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}

From what I've found online, it seems that this is a quite recent error from 'Gridsearch'? scikit-learn-contrib/skope-rules#18

It's a bit confusing to me tho the fact that all the figures are correctly displayed on Jenkins when I click on the checks that did pass. Also I don't get any errors when I run this locally on my notebook.
`
Any suggestions? @aarmey

Should not refit with any functions that are not fit()

https://github.com/meyer-lab/resistance-MS/blob/073ef957f223534fd16430d231955d8525578f08/msresist/clustering.py#L104

I think this is incorrect, as only fit() should perform EM. With predict(), score() or transform(), this should only make a prediction based on the already fit model.

Different scores in different runs with same hyperparameters

I'm planning on running gridsearch, for that, I'd like to find a way to run the model a few times and make GridSearch only count the best score. As of now, since I'm chaining our co-clustering method with sklearn's PLSR, Gridsearch uses PLSR's scoring method automatically. Do you have any suggestions on how to force GridSearch to run each hyperparameter combination a few times and use only the best score?

Why large negative values with logT = False?

4 specific pY positions appear as floats (one of the values). The rest of positions are correct.

Note to self.

EM's score optimization in CPTAC's binomial

Distance measure

Did a little reading, and it looks like Ripley's K function would be a good statistic to better estimate clustering.

https://docs.astropy.org/en/stable/stats/ripley.html

Look at GridSearchCV output with chained estimators

Why is CV performance always that of a constant model?

Speed up sequence calculations

Would you prefer to do this with the PAM250 matrix or binomial motif method first? They are both similar in complexity I think... I can probably write out a C++ implementation in a day or so, but will do your pick first so we can see whether that improves things.

Pomegranate only clusters X:array-like, shape (n_samples, n_dimensions)

When I try to cluster cell lines vs peptides it breaks, but when I cluster peptides vs cell lines it does work. which I don't understand. I've been stuck with this for a while now. I tried making the data set to contain only float64 but I don't think this is because of incompatible data types... Any suggestions? Thanks.

Send along methods for completing the protein sequence from the phosphopeptides

AXL Ab ms data

We should look through this data, even if it's a simple PCA plot, this week. Free to do so pretty much any day.

File "Fig1-CellViability&Signaling.ipynb" has unsaved changes, close without saving?

After the last merge, I'm having trouble opening this notebook. When I try to open it, a window with the error above appears, I click on 'cancel', but the notebook never opens. Are you familiar with this kind of error?

Artificial missingness to evaluate our approach

With the CPTAC data, the core issue is it is missing not at random (MNAR). My sense is, particularly at higher levels of missingness, the sequence motif helps to correctly assign peptides to representative clusters.

A way we could evaluate this is introduce artificial missing data by leaving a value out, and then seeing how well that value is represented by the cluster.

Ultimately, this could look something like the following. I think the biggest limitation will be runtime, since this would require fitting the clustering many times.

Update Uniprot's proteome (a few p-site positions are off)

Some of the p-site positions (SHP2 Y584-p) don't match those of uniprot / phosphositeplus (Y580-p) due to a new update of uniprot's proteome. Tried to download it but it's much bigger and has multiple entries for the same proteins which made it difficult to map. Revisit in the future.

Random NaN values by pom

Sometimes I randomly get an AssertionError coming from NaN values generated by pom's GMM method when computing the cluster probabilities... It's very infrequent but should be fixed.

Distance based on PAM matrix might be better

This would be more sensitive to finding the centroid of a cluster, particularly in the few-peptide case.

I think these papers sort of get at the idea:

http://graphics.idav.ucdavis.edu/~hamann/GuPochHamannKoehl_BIBM2007PaperFinal08202007.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.138.3308&rep=rep1&type=pdf

One simplistic way might be to calculate the average PAM250 distance between the peptide of interest and all members of the cluster. This doesn't come up with the cluster "average", but still uses some representation of the group...

Come up with a bootstrap approach

Will be difficult considering overlap between experimental replicates.

Figure out Cell Migration noise

I tried to plot the RWD values of all 4 collagen replicates (or any combination of these), and it was hard to find reproducible replicates even though the trends did seem to recapitulate among experiments. I only tried with RWD, maybe other metrics such as % confluency in wound or wound width works better.

Code warnings

There are a few warnings in the builds that we should look at. Namely, I noticed many of the following:

/usr/local/lib/python3.7/dist-packages/sklearn/cross_decomposition/_pls.py:321: UserWarning: Y residual constant at iteration 8
warnings.warn('Y residual constant at iteration %s' % k)
/usr/local/lib/python3.7/dist-packages/Bio/Seq.py:182: BiopythonWarning: Biopython Seq objects now use string comparison. Older versions of Biopython used object comparison. During this transition, please use hash(id(my_seq)) or my_dict[id(my_seq)] if you want the old behaviour, or use hash(str(my_seq)) or my_dict[str(my_seq)] for the new string hashing behaviour.
"the new string hashing behaviour.", BiopythonWarning)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.mixture.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.mixture. Anything that cannot be imported from sklearn.mixture is now part of the private API.
warnings.warn(message, FutureWarning)

Ping me if you'd like me to take a look. If you're sure we're not affected by either of the last two warnings, I can show you how to silence them.

'Island' analysis

Create a new branch to work on this. (Only you have access to this branch before opening a new PR with your edits).
Add your files and analysis in the repository.
Publish your work.

sklearn's multivariate imputer

@aarmey What do you think of this approach to handle incomplete data?

https://scikit-learn.org/stable/modules/impute.html#multivariate-feature-imputation

Update pom's gmm in every m_step

I fitted pomegranate's gmm to the CPTAC's data set and it seems to do a great job handling data missingness. The only issue remaining is to find a way to update the gmm model in every m step as I currently do with sklearn's method (line 324). I haven't found a way to easily update the model.
@aarmey any suggestions? Thank you.

Here is pom's GMM documentation:
https://pomegranate.readthedocs.io/en/latest/GeneralMixtureModel.html

https://github.com/meyer-lab/resistance-MS/blob/e68bda30fedfd7cfc16992e9b82f409197d9962f/msresist/sequence_analysis.py#L321-L325