Giter Site home page Giter Site logo

lmotte / metabolite-identification-with-fused-gromov-wasserstein Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 51 KB

Python implementation of the supervised graph prediction method proposed in http://arxiv.org/abs/2202.03813 for solving the metabolite identification problem.

License: MIT License

Python 100.00%
machine-learning graph-prediction optimal-transport molecular-graphs structured-prediction

metabolite-identification-with-fused-gromov-wasserstein's Introduction

Metabolite identification with fused Gromov-Wasserstein

This repository contains a Python implementation of the supervised graph prediction method proposed in [1] for solving the metabolite identification problem.

Metabolite identification problem

An important problem in metabolomics is to identify the small molecules, called metabolites, that are present in a biological sample. Mass spectrometry is a widespread method to extract distinctive features from a biological sample in the form of a tandem mass (MS/MS) spectrum.

Various machine learning methods have been proposed to solve this problem by learning to predict metabolites from mass spectra thanks to a training data set of couples (mass spectrum, molecular structure). This problem belongs to the challenging family of structured prediction learning problems.

Dataset. The data used in [1] are available for download from https://zenodo.org/record/804241#.Yi9bzS_pNhE. It consists in a set of 4138 labeled data also used in [2] to evaluate the performance of metabolite identification from tandem mass spectra. These data have been extracted and processed in [3]. The MS/MS spectra have been extracted from the GNPS public spectral library (https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp). The candidate sets have been build with molecular structures from PubChem.

Quick start

Load data.

from Utils.load_data import load_dataset_kernel_graph

n_tr = 3000
n_te = 1148
D_tr, D_te = load_dataset_kernel_graph(n_tr)
K_tr, Y_tr = D_tr
K_tr_te, K_te_te, Y_te = D_te

Input pre-processing (optional). Centering and normalizing the input kernel improves the statistical performance.

from Utils.metabolites_utils import center_gram_matrix, normalize_gram_matrix

center, normalize = True, True
if center:
    K_tr_te = center_gram_matrix(K_tr_te, K_tr, K_tr_te, K_tr)
    K_tr = center_gram_matrix(K_tr)
if normalize:
    K_tr_te = normalize_gram_matrix(K_tr_te, K_tr, K_te_te)
    K_tr = normalize_gram_matrix(K_tr)

Train.

from Methods.method_gromov_wasserstein import FgwEstimator
from Utils.diffusion import diffuse

clf = FgwEstimator()
clf.ground_metric = 'diffuse'
L = 1e-4  # kernel ridge regularization parameter
clf.tau = 0.6  # the bigger tau is the more the neighbor atoms have similar feature. This impacts the FGW's ground metric.
Y_Tr = diffuse(Y_tr, clf.tau)
clf.train(K_tr, Y_tr, L)

Predict and compute the test scores.

n_bary = 5  # Number of kept alpha_i(x) when predicting
n_c_max = 500   # Do not predict test input with more than n_c_max candidates
fgw, topk, n_pred = clf.predict(K_tr_te, n_bary=n_bary, Y_te=Y_te, n_c_max=n_c_max)

You should obtain the following results:

  • Mean FGW = 0.14209778 ± 0.0460629.
  • Top-1 = 32.6%, Top-10 = 62.7%, Top-20 = 72.5%.
  • Number of predictions = 448.

Reproducing the experiments in [1]

Brogat-Motte et al., 2022 (Section 6.2) experimentally assess the performance of fused Gromov-Wasserstein barycenter for predicting metabolites from mass spectra. In particular, a comparison of the prediction performance of different ground metrics between atoms used in the fused Gromov-Wasserstein distance between molecules (Vayer et al., 2020) is carried out.

These experiments can be reproduced in two steps: 1) hyperparameters selection, 2) test the methods using the selected hyperparameters. It is possible to run directly the step 2).

1) Run hyperparameters selection.

python hyper_param_selection.py method

with method="finger", "gk", "gw_onehot", "gw_fine", "gw_diffuse".

The results are saved in the folder "Results", where one can find already saved results.

2) Test the methods with the selected hyperparameters

python test.py method

The results are saved in the folder "Results", where one can already find saved results.

The Top-k accuracies obtained on the test data are given in the following table.

Top-1 Top-10 Top-20
WL kernel 9.8% 29.1% 37.4%
Linear fingerprint 28.6% 54.5% 59.9%
Gaussian fingerprint 41.0% 62.0% 67.8%
FGW one-hot 12.7% 37.3% 44.2%
FGW fine 18.1% 46.3% 53.7%
FGW diffuse 27.8% 52.8% 59.6%

References

[1] Brogat-Motte, L., Flamary, R., Brouard, C., Rousu, J., d'Alché-Buc, F. Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. arXiv preprint arXiv:2202.03813, 2022. (http://arxiv.org/abs/2202.03813)

[2] Brouard, C., Shen, H., Dührkop, K., d'Alché-Buc, F., Böcker, S. and Rousu, J.: Fast metabolite identification with Input Output Kernel Regression. In the proceedings of ISMB 2016, Bioinformatics 32(12): i28-i36, 2016. DOI: https://doi.org/10.1093/bioinformatics/btw246

[3] Dührkop, K., Shen, H., Meusel, M., Rousu, J. and Böcker, S.: Searching molecular structure databases with tandem mass spectra using CSI:FingerID. PNAS, 112(41), 12580-12585, 2015. doi:10.1073/pnas.1509788112

[4] Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Optimal transport for structured data with application on graphs. In International Conference on Machine Learning (ICML), 2019.

metabolite-identification-with-fused-gromov-wasserstein's People

Contributors

lmotte avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.