Giter Site home page Giter Site logo

pyopenms_umetaflow's Introduction

pyOpenMS: Jupyter Notebook implementation of UmetaFlow

This is a workflow for untargeted metabolomics data preprocessing and analysis tailored in Jupyter notebooks by Eftychia Eva Kontou and Axel Walter using pyOpenMS which are python bindings to the cpp OpenMS alogithms. The workflow is compatible for Linux, Windows and MacOS operating systems.

Publication DOI: https://doi.org/10.1186/s13321-023-00724-w

Workflow overview

The pipeline consists of seven interconnected steps:

  1. File conversion (optional): Simply add your Thermo raw files in data/raw/ and they will be converted to centroid mzML files. If you have Agilent or Bruker files, skip that step - convert them independently using proteowizard (see https://proteowizard.sourceforge.io/) and add them to the data/mzML/ directory.

  2. Pre-processing: Converting your raw data to a table of metabolic features with a series of algorithms.

  3. Re-quantification: Re-quantify all raw files to avoid missing values resulted by the pre-processing workflow for statistical analysis and data exploration (optional step).

  4. GNPSexport: generate all the files necessary to create a FBMN or IIMN job at GNPS.

  5. Structural and formula predictions with SIRIUS and CSI:FingerID.

  6. Annotations: annotate the feature tables with #1 ranked SIRIUS and CSI:FingerID predictions (MSI level 3), spectral matches from a local MGF file (MSI level 2).

  7. Data integration: Integrate the #1 ranked SIRIUS and CSI:FingerID predictions to the graphml file from GNPS FBMN for visualization. Optionally, annotate the feature tables with GNPS MSMS library matching annotations (MSI level 2).

Overview

dag

Usage

Step 1: Clone the workflow

Clone this repository to your local system, into the place where you want to perform the data analysis.

(Make sure to have the right access / SSH Key. If not, follow the steps: Step 1: https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent

Step 2: https://docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account)

git clone https://github.com/biosustain/pyOpenMS_UmetaFlow.git

Step 2: Install all dependencies

Mono, homebrew and wget dependencies:

For Linux only(!)

Install mono with sudo:

 sudo apt install mono-devel

For both systems

Install homebrew and wget:

 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Press enter (RETURN) to continue

For Linux only(!)

Follow the Next steps instructions to add Linuxbrew to your PATH and to your bash shell profile script, either ~/.profile on Debian/Ubuntu or ~/.bash_profile on CentOS/Fedora/RedHat (https://github.com/Linuxbrew/brew).

 test -d ~/.linuxbrew && eval $(~/.linuxbrew/bin/brew shellenv)
 test -d /home/linuxbrew/.linuxbrew && eval $(/home/linuxbrew/.linuxbrew/bin/brew shellenv)
 test -r ~/.bash_profile && echo "eval \$($(brew --prefix)/bin/brew shellenv)" >>~/.bash_profile
 echo "eval \$($(brew --prefix)/bin/brew shellenv)" >>~/.profile

For both systems

 brew install wget

pyOpenMS and other libraries:

Installing pyOpenMS using conda is advised: First, create a conda environment and install the wheels and other dependencies. Then get the latest wheels and install all dependencies:

   conda create --name pyopenms python=3.10
   conda activate pyopenms
   pip install --index-url https://pypi.cs.uni-tuebingen.de/simple/ pyopenms-nightly
   conda install -n pyopenms ipykernel --update-deps --force-reinstall
   pip install pyteomics
   pip install --upgrade nbformat
   pip install matplotlib

For installation details and further documentation, see pyOpenMS documentation.

Step 3: Install executables (ThermoRawFileParser & SIRIUS):

ThermoRawFileParser

   (cd resources/ThermoRawFileParser && wget https://github.com/compomics/ThermoRawFileParser/releases/download/v1.3.4/ThermoRawFileParser.zip && unzip ThermoRawFileParser.zip)

SIRIUS

Download the latest SIRIUS executable. Choose the headless zipped file compatible with your operating system (linux, macOS or windows) and unzip it under the directory resources/. Make sure to register using your university email and password.

  1. Specify the operating system

    MY_OS="linux64" # or "osx64" for macOS or "win64" for windows 
    
  2. Get the SIRIUS executable

    (cd resources && curl -s https://api.github.com/repos/boecker-lab/sirius/releases/latest | tr -d '"' | grep "browser_download_url.*${MY_OS}.zip$"| cut -d : -f 2,3 |  wget -i- && unzip *.zip)
    

Tip: If you get the executable manually, make sure to download a version >5.6. Avoid SNAPSHOT versions and get the headless zipped file.

Step 4 (optional): Get example data from zenodo

   (cd data && wget https://zenodo.org/record/6948449/files/Commercial_std_raw.zip?download=1 && unzip *.zip -d raw)

The data can be used for testing the workflow. Otherwise, the user can simply transfer their own data under the directory data/raw/ or data/mzML/.

Step 5: Run all kernels and investigate the results

All the results are in a .TSV format and can be opened simply with excel or using pandas dataframes.

Citations

  • Kontou, E.E., Walter, A., Alka, O. et al. UmetaFlow: an untargeted metabolomics workflow for high-throughput data processing and analysis. J Cheminform 15, 52 (2023). https://doi.org/10.1186/s13321-023-00724-w

  • Pfeuffer J, Sachsenberg T, Alka O, et al. OpenMS – A platform for reproducible analysis of mass spectrometry data. J Biotechnol. 2017;261:142-148. doi:10.1016/j.jbiotec.2017.05.016

  • Dührkop K, Fleischauer M, Ludwig M, et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods. 2019;16(4):299-302. doi:10.1038/s41592-019-0344-8

  • Dührkop K, Shen H, Meusel M, Rousu J, Böcker S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci. 2015;112(41):12580-12585. doi:10.1073/pnas.1509788112

  • Nothias LF, Petras D, Schmid R, et al. Feature-based molecular networking in the GNPS analysis environment. Nat Methods. 2020;17(9):905-908. doi:10.1038/s41592-020-0933-6

Test Data (only for testing the workflow with the example dataset)

  • The current test data are built from known metabolite producer strains or standard samples that have been analysed with a Thermo IDX mass spectrometer. The presence of the metabolites and their fragmentation patterns has been manually confirmed using TOPPView.

pyopenms_umetaflow's People

Contributors

axelwalter avatar eeko-kon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pyopenms_umetaflow's Issues

Notebook 4 GNPS export memory error

Hi,

Thank you for these detailed notebooks and workflow.

I managed to get to notebook 4 with my own dataset but in step 3 for MSMS clustering, when I try to run the line:

spectra_clustering.store(Consensus_file, [s.encode() for s in mzML_files], String(out_file))

I run into what seems like an error with memory:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[9], line 1
----> 1 spectra_clustering.store(Consensus_file, [s.encode() for s in mzML_files], String(out_file))

File pyopenms/_pyopenms_1.pyx:3028, in pyopenms._pyopenms_1.GNPSMGFFile.store()

MemoryError: std::bad_alloc

However I am very sure my machine is not running out of memory. Happy to provide more information but not sure what to - I'm very puzzled as to why this is happening?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.