Giter Site home page Giter Site logo

meyer-lab / ddmc Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 1.0 689.49 MB

Clusters phosphoproteomics data based on a combination of the sequence information and abundance changes over conditions.

Home Page: https://asmlab.org

Python 99.34% Makefile 0.66%
drug-resistance cancer phosphoproteomics academic-project

ddmc's Introduction

Co-clustering for mass spectrometry peptide analysis

Build Test

Clusters peptides based on both sequence similarity and phosphorylation signal across samples.

Usage

>>> from ddmc.clustering import DDMC

>>> # load dataset as p_signal...

>>> p_signal
             Sample 1  Sample 2  Sample 3  Sample 4  Sample 5
Sequence                                                     
AAAAAsQQGSA -3.583614       NaN -0.662659 -1.320029 -0.730832
AAAAGsASPRS -0.174779 -1.796899  0.891798 -3.092941  2.394315
AAAAGsGPSPP -1.951552 -2.937095  2.692876 -2.344894  0.556615
AAAAGsGPsPP  3.666782       NaN -2.081231  0.989394       NaN
AAAAPsPGSAR  1.753855 -2.135835  0.896778  3.369230  2.020967
...               ...       ...       ...       ...       ...
YYSPYsVSGSG -3.502871  2.831169  3.383486  2.589559  3.624968
YYSSRsQSGGY -0.870365  0.887317  2.600291 -0.374107  3.285459
YYTAGyNSPVK  0.249539  2.047050 -0.286033  0.042650  2.863317
YYTSAsGDEMV  0.662787  0.135326 -1.004350  0.879398 -1.609894
YYYSSsEDEDS       NaN -1.101679 -3.273987 -0.872370 -1.735891

>>> p_signal.index  # p_signal.index contains the peptide sequences
Index(['AAAAAsQQGSA', 'AAAAGsASPRS', 'AAAAGsGPSPP', 'AAAAGsGPsPP',
       'AAAAPsPGSAR', 'AAAAPsPGsAR', 'AAAARsLLNHT', 'AAAARsPDRNL',
       'AAAARtQAPPT', 'AAADFsDEDED',
       ...
       'YYDRMySYPAR', 'YYEDDsEGEDI', 'YYGGGsEGGRA', 'YYRNNsFTAPS',
       'YYSPDyGLPSP', 'YYSPYsVSGSG', 'YYSSRsQSGGY', 'YYTAGyNSPVK',
       'YYTSAsGDEMV', 'YYYSSsEDEDS'],
      dtype='object', name='Sequence', length=30561)
      
>>> model = DDMC(n_components=2, seq_weight=100).fit(p_signal)  # fit model

>>> model.transform(as_df=True)  # get cluster centers
                 0         1
Sample 1  0.017644  0.370375
Sample 2 -0.003625 -0.914869
Sample 3 -0.087624 -0.682140
Sample 4  0.014644 -0.658907
Sample 5  0.023885  0.196063

ddmc's People

Contributors

aarmey avatar armaan-abraham avatar farnazmdi avatar github-actions[bot] avatar lkargi avatar mcreixell avatar scottdtaylor95 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

complexdata

ddmc's Issues

Script to profile

I'd like to do a little profiling to see if I can speed up fitting. Do you possibly have a python script that runs fitting? (Using a jupyter notebook is a bit more difficult.)

EOFError when trying to use plotly, jupyterlab v2.0 issue?

When trying to run plotpca_ScoresLoadings_plotly (figure1.py), I'm getting the error below. I didn't change anything to this function (that I'm aware of) and I'm not familiar with this error at all. I was thinking that this may have something to do with the recent upgrade of jupyterlab to v2.0? This is not urgent so whenever you get a chance works great. Thanks.

`---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
in
----> 1 plotpca_ScoresLoadings_plotly(d1.T, "AXLmuts + AF154 BR1", "AXL")

~/resistance-MS/msresist/figures/figure1.py in plotpca_ScoresLoadings_plotly(data, title, loc)
309 print(loadings.loc[loc])
310
--> 311 fig = make_subplots(rows=1, cols=2, subplot_titles=("PCA Scores", "PCA Loadings"))
312 fig.add_trace(
313 go.Scatter(

/usr/local/lib/python3.7/dist-packages/plotly/subplots.py in make_subplots(rows, cols, shared_xaxes, shared_yaxes, start_cell, print_grid, horizontal_spacing, vertical_spacing, subplot_titles, column_widths, row_heights, specs, insets, column_titles, row_titles, x_title, y_title, **kwargs)
810
811 # Build resulting figure
--> 812 fig = go.Figure(layout=layout)
813
814 # Attach subplot grid info to the figure

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/_figure.py in init(self, data, layout, frames, skip_invalid, **kwargs)
609 is invalid AND skip_invalid is False
610 """
--> 611 super(Figure, self).init(data, layout, frames, skip_invalid, **kwargs)
612
613 def add_area(

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in init(self, data, layout_plotly, frames, skip_invalid, **kwargs)
268 # --------
269 # ### Check for default template ###
--> 270 self._initialize_layout_template()
271
272 # Process kwargs

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in _initialize_layout_template(self)
1948 if self._layout_obj.template is None:
1949 if pio.templates.default is not None:
-> 1950 self._layout_obj.template = pio.templates.default
1951 else:
1952 self._layout_obj.template = None

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setattr(self, prop, value)
4449 if match is None:
4450 # Set as ordinary property
-> 4451 super(BaseLayoutHierarchyType, self).setattr(prop, value)
4452 else:
4453 # Set as subplotid property

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setattr(self, prop, value)
3514 if prop.startswith("_") or hasattr(self, prop) or prop in self._validators:
3515 # Let known properties and private properties through
-> 3516 super(BasePlotlyType, self).setattr(prop, value)
3517 else:
3518 # Raise error on unknown public properties

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/init.py in template(self, val)
96981 @template.setter
96982 def template(self, val):

96983 self["template"] = val
96984
96985 # ternary

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setitem(self, prop, value)
4435 if match is None:
4436 # Set as ordinary property
-> 4437 super(BaseLayoutHierarchyType, self).setitem(prop, value)
4438 else:
4439 # Set as subplotid property

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setitem(self, prop, value)
3480 # ### Handle compound property ###
3481 if isinstance(validator, CompoundValidator):
-> 3482 self._set_compound_prop(prop, value)
3483
3484 # ### Handle compound array property ###

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in _set_compound_prop(self, prop, val)
3834 # ------------
3835 validator = self._validators.get(prop)
-> 3836 val = validator.validate_coerce(val, skip_invalid=self._skip_invalid)
3837
3838 # Save deep copies of current and new states

/usr/local/lib/python3.7/dist-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v, skip_invalid)
2714 # (could be any hashable object)
2715 if v in pio.templates:
-> 2716 return copy.deepcopy(pio.templates[v])
2717 # Otherwise, if v is a string, check to see if it consists of
2718 # multiple template names joined on '+' characters

/usr/local/lib/python3.7/dist-packages/plotly/io/_templates.py in getitem(self, item)
89 template_str = pkgutil.get_data("plotly", path).decode("utf-8")
90 template_dict = json.loads(template_str)
---> 91 template = Template(template_dict)
92
93 self._templates[template_name] = template

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/layout/init.py in init(self, arg, data, layout, **kwargs)
9448 # ----------------------------------
9449 _v = arg.pop("data", None)
-> 9450 self["data"] = data if data is not None else _v
9451 _v = arg.pop("layout", None)
9452 self["layout"] = layout if layout is not None else _v

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setitem(self, prop, value)
3480 # ### Handle compound property ###
3481 if isinstance(validator, CompoundValidator):
-> 3482 self._set_compound_prop(prop, value)
3483
3484 # ### Handle compound array property ###

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in _set_compound_prop(self, prop, val)
3834 # ------------
3835 validator = self._validators.get(prop)
-> 3836 val = validator.validate_coerce(val, skip_invalid=self._skip_invalid)
3837
3838 # Save deep copies of current and new states

/usr/local/lib/python3.7/dist-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v, skip_invalid)
2442
2443 elif isinstance(v, dict):
-> 2444 v = self.data_class(v, skip_invalid=skip_invalid)
2445
2446 elif isinstance(v, self.data_class):

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/layout/template/init.py in init(self, arg, area, barpolar, bar, box, candlestick, carpet, choroplethmapbox, choropleth, cone, contourcarpet, contour, densitymapbox, funnelarea, funnel, heatmapgl, heatmap, histogram2dcontour, histogram2d, histogram, image, indicator, isosurface, mesh3d, ohlc, parcats, parcoords, pie, pointcloud, sankey, scatter3d, scattercarpet, scattergeo, scattergl, scattermapbox, scatterpolargl, scatterpolar, scatter, scatterternary, splom, streamtube, sunburst, surface, table, treemap, violin, volume, waterfall, **kwargs)
1595 self["scattergeo"] = scattergeo if scattergeo is not None else _v
1596 _v = arg.pop("scattergl", None)
-> 1597 self["scattergl"] = scattergl if scattergl is not None else _v
1598 _v = arg.pop("scattermapbox", None)
1599 self["scattermapbox"] = scattermapbox if scattermapbox is not None else _v

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in setitem(self, prop, value)
3484 # ### Handle compound array property ###
3485 elif isinstance(validator, (CompoundArrayValidator, BaseDataValidator)):
-> 3486 self._set_array_prop(prop, value)
3487
3488 # ### Handle simple property ###

/usr/local/lib/python3.7/dist-packages/plotly/basedatatypes.py in _set_array_prop(self, prop, val)
3908 # ------------
3909 validator = self._validators.get(prop)
-> 3910 val = validator.validate_coerce(val, skip_invalid=self._skip_invalid)
3911
3912 # Save deep copies of current and new states

/usr/local/lib/python3.7/dist-packages/_plotly_utils/basevalidators.py in validate_coerce(self, v, skip_invalid)
2530 res.append(self.data_class(v_el))
2531 elif isinstance(v_el, dict):
-> 2532 res.append(self.data_class(v_el, skip_invalid=skip_invalid))
2533 else:
2534 if skip_invalid:

/usr/local/lib/python3.7/dist-packages/plotly/graph_objs/init.py in init(self, arg, connectgaps, customdata, customdatasrc, dx, dy, error_x, error_y, fill, fillcolor, hoverinfo, hoverinfosrc, hoverlabel, hovertemplate, hovertemplatesrc, hovertext, hovertextsrc, ids, idssrc, legendgroup, line, marker, meta, metasrc, mode, name, opacity, selected, selectedpoints, showlegend, stream, text, textfont, textposition, textpositionsrc, textsrc, texttemplate, texttemplatesrc, uid, uirevision, unselected, visible, x, x0, xaxis, xcalendar, xsrc, y, y0, yaxis, ycalendar, ysrc, **kwargs)
28533 # Import validators
28534 # -----------------

28535 from plotly.validators import scattergl as v_scattergl
28536
28537 # Initialize validators

/usr/lib/python3.7/importlib/_bootstrap.py in find_and_load(name, import)

/usr/lib/python3.7/importlib/_bootstrap.py in find_and_load_unlocked(name, import)

/usr/lib/python3.7/importlib/_bootstrap.py in _load_unlocked(spec)

/usr/lib/python3.7/importlib/_bootstrap_external.py in exec_module(self, module)

/usr/lib/python3.7/importlib/_bootstrap_external.py in get_code(self, fullname)

/usr/lib/python3.7/importlib/_bootstrap_external.py in _compile_bytecode(data, name, bytecode_path, source_path)

EOFError: marshal data too short`

Put together notebook with a tentative paper figure parts

We should start to create roughly one notebook for each potential figure. These notebooks should run everything exactly how we would want it in a paper. Where to start with this could be something along the lines of your poster:

Figure / notebook 1:

  • Raw phenotype data
  • Raw data / clustergram
  • We'll want to show how well replicates match up somehow...

Figure / notebook 2:

  • GridSearch with CV strategies 1 and 2.
  • Model predicted vs actual for best hyperparameter set
  • PLSR scores and loadings

Let me know what you think.

Dealing with pY, pT, and pS peptides at once?

It would be useful to make our method integrate the different phosphosite types. Specially in the CPTAC database where we have:

pS: 32710 peptides pY: 1940 peptides pT: 7824 peptides primed: 17137 peptides

I was thinking we could cluster each of the pY, pS, and pT peptides in parallel since they don't rely on each other and this way we make sure they end up being in different clusters. Then we would create an X matrix combining all clusters which would then be used for regression. What do you think?

Cell Viability: Noisy initial seeding volume

As shown here, the initial number of AXL KO cells in experiment 4 is noticeably higher than the initial number of PC9 WT cells. However, in experiment 3, even though the initial number of AXL KO cells is moderately lower than the rest, the resulting correlation is good. Would you therefore consider experiment 3 measurements as reliable or we would expect an even better correlation? @aarmey

Screen Shot 2019-08-14 at 5 48 04 PM

Screen Shot 2019-08-14 at 5 45 26 PM

Figures 1&3, seaborn bar plot misplaced and superimposed with last subplot

The same issue occurs in figure 1&3, where the seaborn bar plot I'm assigning to ax[0], for some reason is located in the last subplot together with the plot that is correctly assigned in that position.

resistance-MS/msresist/figures/figure1.py, line 148 (BarPlot_UtErlAF154)

resistance-MS/msresist/figures/figure3.py, line 93 (BarPlot_FCendpoint)

There are many more things I have to fix or improve in these figures but this is the only one I have no idea how to fix.

Plot cell death measurements with error bars

It'd be very useful to include a vector of cell death measurements across mutants to the AXL model. With the data I currently have imported it is very noisy across replicates. Hopefully with the new values this improves.

Move notebooks to figures

Where possible it will be helpful to move figures we will keep to figure files, as opposed to notebooks. These are easier to work with when comparing source code changes.

Feel free to close this if you think this wouldn't be helpful at the moment.

Pandas / Plotting Tutorial

I'm creating this issue to cover any questions you may have as you work on the tutorial. Feel free to use slack / zoom instead.

FYI-accidentally merged my branch into master

I accidentally merged my local branch into master directly without opening a PR when I was trying to do so. I'm not sure what happened but I wanted to let you know. In these new edits I just organized the model functions into different files as @aarmey suggested in #141 and updated the notebook with the growth factor / RTK data (BypassRegression.ipynb) I was playing with while working on my proposal. I took screenshots of my terminal to briefly check this with you next week. Happy weekend. @scottdtaylor95

Another plot idea

This would help illustrate the hyper parameter, but also that the sequence and data can agree, one can win, or their combination can be unique.

n

Speed up binomial calculations

To speed up the binomial calculations, we need to represent the probability matrix as a true matrix, rather than a Pandas table. Rows should be amino acids (in whatever order so long as it's defined and kept constant), and columns should be motif position. In the preprocessing you can also convert all the sequence motifs from amino acid to row index (e.g., A -> 0, I -> 1, K -> 2...).

I think you can probably take your existing BPM matrix function, order the rows, and then apply X.to_numpy().

This alone will vastly speed things up, then making everything base numpy will make it easier to convert the slow parts to C code.

https://github.com/meyer-lab/resistance-MS/blob/6d304675aafc03b5a209cf173d00aa99d1d4a056/msresist/sequence_analysis.py#L418

Figure out what Cell Viability BRs (not) to use

The cell viability data seems pretty reproducible across replicates, specially if we take the fold-change after 24h but I still think there's one generating most of the noise. I think it's BR1 but I'm not sure.

Incorrect z-scoring during cross-validation

I didn't look at exactly where this is being run, but the cross-validation performed here is wrong. The normalization has to occur during the training step, and then be applied to the test data with the same stdev and mean calculated during training. sklearn.preprocessing.scale helps to perform this by separating X.fit_transform() and X.transform().

(sklearn.cross_decomposition.PLSRegression is supposed to handle this for you, anyway, when scale=True.)

https://github.com/meyer-lab/resistance-MS/blob/90d15e719a56e0fbf63ae81cb8c237470f01a971/msresist/figures/figure3.py#L208

Update and Import new data

  • Update cell viability phase values. I'm not sure if the cell migration phase should be updated too?
  • Update green and red values for both.
  • Finish importing new cell migration replicates. Currently only imported RWD values for collagen BRs 1-4.

TerminatedWorkerError: (Due to memmory?) - in Grdisearch(n_jobs=-1)

Only one of the checks failed with the following error below. It's coming from plotMixedClusteringPLSR_GridSearch in figure2.py:

�[0;31mTerminatedWorkerError�[0m: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}

From what I've found online, it seems that this is a quite recent error from 'Gridsearch'? scikit-learn-contrib/skope-rules#18

It's a bit confusing to me tho the fact that all the figures are correctly displayed on Jenkins when I click on the checks that did pass. Also I don't get any errors when I run this locally on my notebook.
`
Any suggestions? @aarmey

Different scores in different runs with same hyperparameters

I'm planning on running gridsearch, for that, I'd like to find a way to run the model a few times and make GridSearch only count the best score. As of now, since I'm chaining our co-clustering method with sklearn's PLSR, Gridsearch uses PLSR's scoring method automatically. Do you have any suggestions on how to force GridSearch to run each hyperparameter combination a few times and use only the best score?

Speed up sequence calculations

Would you prefer to do this with the PAM250 matrix or binomial motif method first? They are both similar in complexity I think... I can probably write out a C++ implementation in a day or so, but will do your pick first so we can see whether that improves things.

Pomegranate only clusters X:array-like, shape (n_samples, n_dimensions)

When I try to cluster cell lines vs peptides it breaks, but when I cluster peptides vs cell lines it does work. which I don't understand. I've been stuck with this for a while now. I tried making the data set to contain only float64 but I don't think this is because of incompatible data types... Any suggestions? Thanks.

Screen Shot 2020-03-31 at 7 48 16 PM

AXL Ab ms data

We should look through this data, even if it's a simple PCA plot, this week. Free to do so pretty much any day.

Artificial missingness to evaluate our approach

With the CPTAC data, the core issue is it is missing not at random (MNAR). My sense is, particularly at higher levels of missingness, the sequence motif helps to correctly assign peptides to representative clusters.

A way we could evaluate this is introduce artificial missing data by leaving a value out, and then seeing how well that value is represented by the cluster.

Ultimately, this could look something like the following. I think the biggest limitation will be runtime, since this would require fitting the clustering many times.

A

Update Uniprot's proteome (a few p-site positions are off)

Some of the p-site positions (SHP2 Y584-p) don't match those of uniprot / phosphositeplus (Y580-p) due to a new update of uniprot's proteome. Tried to download it but it's much bigger and has multiple entries for the same proteins which made it difficult to map. Revisit in the future.

Random NaN values by pom

Sometimes I randomly get an AssertionError coming from NaN values generated by pom's GMM method when computing the cluster probabilities... It's very infrequent but should be fixed.

Distance based on PAM matrix might be better

This would be more sensitive to finding the centroid of a cluster, particularly in the few-peptide case.

I think these papers sort of get at the idea:

http://graphics.idav.ucdavis.edu/~hamann/GuPochHamannKoehl_BIBM2007PaperFinal08202007.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.138.3308&rep=rep1&type=pdf

One simplistic way might be to calculate the average PAM250 distance between the peptide of interest and all members of the cluster. This doesn't come up with the cluster "average", but still uses some representation of the group...

Figure out Cell Migration noise

I tried to plot the RWD values of all 4 collagen replicates (or any combination of these), and it was hard to find reproducible replicates even though the trends did seem to recapitulate among experiments. I only tried with RWD, maybe other metrics such as % confluency in wound or wound width works better.

Code warnings

There are a few warnings in the builds that we should look at. Namely, I noticed many of the following:

  • /usr/local/lib/python3.7/dist-packages/sklearn/cross_decomposition/_pls.py:321: UserWarning: Y residual constant at iteration 8
    warnings.warn('Y residual constant at iteration %s' % k)
  • /usr/local/lib/python3.7/dist-packages/Bio/Seq.py:182: BiopythonWarning: Biopython Seq objects now use string comparison. Older versions of Biopython used object comparison. During this transition, please use hash(id(my_seq)) or my_dict[id(my_seq)] if you want the old behaviour, or use hash(str(my_seq)) or my_dict[str(my_seq)] for the new string hashing behaviour.
    "the new string hashing behaviour.", BiopythonWarning)
  • /usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.mixture.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.mixture. Anything that cannot be imported from sklearn.mixture is now part of the private API.
    warnings.warn(message, FutureWarning)

Ping me if you'd like me to take a look. If you're sure we're not affected by either of the last two warnings, I can show you how to silence them.

'Island' analysis

  • Create a new branch to work on this. (Only you have access to this branch before opening a new PR with your edits).
  • Add your files and analysis in the repository.
  • Publish your work.

Update pom's gmm in every m_step

I fitted pomegranate's gmm to the CPTAC's data set and it seems to do a great job handling data missingness. The only issue remaining is to find a way to update the gmm model in every m step as I currently do with sklearn's method (line 324). I haven't found a way to easily update the model.
@aarmey any suggestions? Thank you.

Here is pom's GMM documentation:
https://pomegranate.readthedocs.io/en/latest/GeneralMixtureModel.html

https://github.com/meyer-lab/resistance-MS/blob/e68bda30fedfd7cfc16992e9b82f409197d9962f/msresist/sequence_analysis.py#L321-L325

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.