Giter Site home page Giter Site logo

eubic2023's Introduction

EuBIC 2023 developers meeting

Sunday, 15–20 January 2023, Congressi Stefano Franscini, Monte Verità, Ticino, Switzerland

Logo

The EuBIC 2023 developers meeting will bring together scientists active in the field of computational proteomics.

This repository collects proposals for hackathon projects at the EuBIC 2023 developers meeting. For details about the submission format, see below.

For more information about the meeting, please check the official website.

Submission of project proposals

Topics for the hackathon sessions during the EuBIC 2023 developers meeting can now be proposed!

Please carefully read the full guidelines before submitting a project proposal and make sure to add all relevant information to your proposal. Examples from the EuBIC 2023 developers meeting can be found here.

How to submit a project proposal?

Create an issue in this repository describing your project. When creating an issue a template is provided listing some of the relevant information that should be included:

Project description:

  • A general abstract of up to 200 words describing the goal of the project and why it is well suited as a community project.
  • A (high-level) project plan detailing the work to be conducted. This primarily includes tasks that will be tackled during the developers meeting, but we encourage you to also think about a follow-up strategy.

Technical details:

  • The programming language(s) that will be used.
  • (If applicable) any existing software that will be featured.
  • (If applicable) any datasets that will be used and their availability.

Contact information:

  • Your name, affiliation, and contact information.

How to contribute?

  • Vote for your favorite project in the Discussions section!
  • Leave comments to interesting proposals. Engage in a discussion to finetune the project proposals!

Important deadlines

  • Hackathons proposal deadline: September 30th, 2022
  • Notification of accepted hackathons: November 1st, 2022

Powered by

FGCZ ETHZ UZH

CSFlogo

EuPAlogo

Sponsors

Biognosys

Matrix Science

MSAidlogo

eubic2023's People

Contributors

cpanse avatar tobiasko avatar swillems avatar

Stargazers

Keyth M Citizen  avatar Arthur Grimaud avatar BoJi avatar Micha Birklbauer avatar  avatar Fabian Egli avatar Nadezhda T. Doncheva avatar  avatar  avatar Jochem Smit avatar Elena Krismer avatar Helge Hecht avatar Tobias avatar

Watchers

Henry Webel avatar Nadezhda T. Doncheva avatar  avatar David Bouyssié avatar Vladimir Gorshkov avatar Johannes Griss avatar Marc Vaudel avatar  avatar Veit Schwämmle avatar  avatar  avatar

Forkers

cpanse

eubic2023's Issues

Example hack

Title

A catchy working title for your hack! 😄

Abstract

Up to 200 words describing the general goal of the project and why it is well suited as a community project.

Project Plan

A (high-level) project plan detailing the work to be conducted. This primarily includes tasks that will be tackled during the developer's meeting, but we encourage you to also think about a follow-up strategy.

Technical Details

  • The programming language(s) that will be used.
  • (If applicable) Any existing software that will be featured.
  • (If applicable) Any datasets that will be used and their availability.

Contact Information

Your name, affiliation, and contact information.

Making sense of internal fragment ions

Title

Making sense of internal fragment ions

Abstract

Peptide identification from fragment mass spectra uses only part of the contained information. Here, internal fragment ions, i.e. peptides with both termini cleaved, have a high potential to provide further evidence about peptide identity. Despite the option to include internal ions in several database search engines, their actual use has so far been explored only poorly. A big challenge lies in the large number of possible ions, and thus the difficulty in distinguishing them from background signals or other fragment ions. This hackathon project aims to shed more light into the applicability of internal ions by creating a framework to determine their characteristic patterns in MS data. We will provide statistics and extensive visualizations for internal ions in a given data set. For that we will employ both raw data files and identifications from a database search. This framework will establish the grounds for the detection and utilization of characteristic internal ions in a dataset, explore potential “fragment motifs”, and facilitate the distinction of actual internal ions from background noise. A clearer understanding and exploration of internal fragmentation will channel future efforts towards a more extensive use of them in MS data processing leading to higher peptide identification rates.

Project Plan

We suggest the following tasks for creating and testing the framework:

  • Nomenclature and definitions: Given the complexity and the large combinatorics of internal fragment ions, we will discuss and stringently define the nomenclature. Existing knowledge and nomenclatures will be assessed for their usability.

  • Data sets: Selection of about ten data sets from different MS technologies that will be used for testing and exploration. This will include bottom-up and top-down approaches, as well as different fragmentation types and acquisition methods.

  • Implementation: We plan to take advantage of the pyteomics tools for reading files and spectra as well as of libraries such as spectral utils to extract fragment ions. Visualizations and further analysis will be in python and/or R depending on the participants’ background.

  • Assessment and testing: Different statistical measures and motif algorithms will be tested and discussed. Interactive visualizations will be used to conveniently explore different subsets of one or multiple MS runs.

  • Software: We expect to achieve developing a prototype that can process any data set given by standard file formats mzML and mzid, and optional widely used formats like mgf and pepxml.

These tasks will be discussed on the first day prior to their implementation. Depending on the skills and interest of the participants, we may define working groups for addressing them in the following days.

Technical details

  • The programming language(s): Python and/or R. For faster implementations, we might collaborate with the hackathon focussing on Rust implementations.

  • Existing software that will be featured: python libraries: pyteomics, spectrum-utils

  • (Public) datasets that will be used and their availability
    Given the available ground truth and the availability of different fragmentation types, we might use the proteometools (http://www.proteometools.org/)

Contact information

Arthur Grimaud
Protein Research Group
Department for Biochemistry and Molecular Biology
University of Southern Denmark
Campusvej 55
5230 Odense M / Denmark
[email protected]

Veit Schwämmle
Protein Research Group
Department for Biochemistry and Molecular Biology
University of Southern Denmark
Campusvej 55
5230 Odense M / Denmark
[email protected]

test entry

Abstract

Up to 200 words describing the goal of the project and why it is well suited as a community project.

Projekt Plan

A (high-level) project plan detailing the work to be conducted. This primarily includes tasks that will be tackled during the developer's meeting, but we encourage you to also think about a follow-up strategy.

Technical details

  • The programming language(s) that will be used.
  • (If applicable) Any existing software that will be featured.
  • (If applicable) Any datasets that will be used and their availability.

Easy-to-use interactive HTML plots collection for data exploration and web tools

Title

Easy-to-use interactive HTML plots collection for data exploration and web tools

Abstract

Interactive data exploration and publicly accessible web applications for bioinformatics tools are an essential part of bioinformatics. Ideally, one uses the same framework for both to reduce development time. Shiny apps are popular for this purpose, but are limited to the R language, while Python’s interactive plotting solutions (Plotly, Bokeh) tend to have a steep learning curve and are tricky to deploy.

For ProteomicsDB (https://www.proteomicsdb.org/), we developed over a dozen interactive, general-purpose plots based on D3.js and Vue.js (https://github.com/wilhelm-lab/proteomicsdb-components). These range from simple scatter and bar plots to interaction graphs, heatmaps and pathway viewers. We managed to easily reuse and adapt these components in ongoing projects, allowing quick deployment upon publication.

For the hackathon, we propose releasing these plots as Web Components (https://www.webcomponents.org/introduction), the HTML standard that will shape the web for the coming decade. Web Components are easily includable in web pages as simple HTML tags without JavaScript knowledge. As this is purely a frontend framework, data processing and provisioning can be done in any programming language. We aim to provide examples and documentation to make integration for bioinformaticians easy and straightforward in any project.

Project Plan

The goal for the hackathon is to create a repository with Web Components for all our plotting components, and potentially add a few more. Each of the components can be installed individually and has example code and documentation. Additionally, separate repositories will be set up, demonstrating how one can utilize these components in combination with different programming languages (e.g. Python, R).

Tasks:

  • Define consistent input data structures for Web Components
  • Update outdated components to d3v6
  • Create examples and documentation on how to use each component
  • Create example repositories that demonstrate how to use components with different programming languages (Python, R)
  • Participants can create new components for their own tools or write their own tools with the new components

Taking part in this hackathon will allow you to become familiar with Web Components, the standard in web development for the next decade, and D3.js, a flexible plotting library for JavaScript. Furthermore, you will learn how to quickly set up a web tool with interactive plots in your language of choice.

Technical Details

Contact Information

Matthew The
Chair of Proteomics and Bioanalytics
Technical University of Munich
[email protected]

Julian Müller
Chair of Proteomics and Bioanalytics
Technical University of Munich
[email protected]

Metabolomics hackathon: MS2 spectra matching for metabolite identification

MS2 spectra matching for metabolite identification

Abstract

One major open topic in untargeted metabolomics is identifying unknown compounds from mass spectra. As MS1 comparisons can be ambiguous (especially for small molecules), we need to look at MS2 spectra, and compare them to public MS2 databases, to differentiate compounds in the same mass range.
Currently, the best performing methods for compound identification are GNPS and Sirius. They provide a user with a list of potential compounds, but in some cases the uncertainty is very high or multiple candidates are suggested, making the downstream analysis labor intensive. GNPS improves their predictions by using molecular networks and taking biological information into account. Sirius improves their predictions by comparing structural similarity of the compounds.
We would like to set up a novel system, with modular parts that can be tested separately. Each aspect of the pipeline can be improved/modified individually, and multiple methods can be combined as an ensemble. In doing so, this can also serve as a benchmark of existing scoring and matching functions and a testing playground for novel ideas.

Project Plan

The general purpose is to have a(n automated) workflow for MS2 spectra matching that does not just rely on cosine similarity scoring. Subsequently, we would like to

  • Automatically clean MS2 spectra
    • Pick the MS2-spectrum with the highest precursor ion intensity or with the highest total ion current for each LC-MS feature;
    • remove peaks with relative intensities below 0.01 compared to the highest intensity peak
    • remove the peaks outside set m/z window
    • remove the peaks outside set intensity window
    • (optional) calculate neutral losses within each MS2 spectrum and compare across multiple spectra to identify similar functional groups (that might have been lost)
  • Optimize pipeline. Additional steps (e.g., QC-points, interactive visualization)
  • Optimize scoring. As a starting point, commonly used scoring functions can be compared, e.g., a normalized dot product (cosine score), spec2Vec (inspired by Word2Vec) – structure similarity, …
  • Ranking of results of identifications. We would like to not just rely on the score as this can be misleading. One way would be to include information that helps us exclude/lower the weight on unlikely candidates.

Technical Details

Main language: Python

  • Packages:
    • matchms (cosine score, modified cosine score)
    • spec2vec
      Workflow includes: GNPS, SIRIUS
      GitHub

Contact Information

Members of the metabolomics research group lead by Thomas Moritz at the NNF Center for Basic Metabolic Research, Faculty of Health Research, University of Copenhagen

Matthias Mattanovich ([email protected])
Muyao Xi ([email protected])
Lawrence Egyir ([email protected])

Shaping a European prediction service for biological data

Shaping a European prediction service for biological data

Abstract

We developed DLOmix-serving, an open-source and modular machine learning (ML) inference server for biological data based on NVIDIA Triton. This platform allows centralizing the hosting of models originating from various training/prediction pipelines such as PyTorch, Sklearn, XGBoost and TensorFlow. Most importantly, models can be accessed by a remote and language-independent interface. A similar system has been used for the last 5 years within Skyline (C#), the Prosit website (Python), and ProteomicsDB (JavaScript) to access Prosit models.
We envision that such a platform will be centrally hosted by, for example, the EBI, and model submission will become a standard practice for publishing new ML models. Although accompanying codebases on platforms like GitHub describe the access of models, they require technical skill to set up. Standardizing the access of the models will dramatically improve reproducibility, and ensure easy access for as many people as possible.
As part of this Hackathon, we want to port as many existing models as possible to this platform while using DLOmix, an open-source package developed to integrate deep learning models in proteomics, as an exemplary tool to access these models. Ideally, participants would bring their own ML models they wish to make publicly accessible.

Project Plan

  1. Defining a standard interface for models within the same family (e.g., fragmentation predictor)
  2. Setup of DLOmix-serving on participant's machine (pair programming)
    a. Porting model file to Triton
    b. Developing interface to Triton
  3. Exploring the capabilities of the Python backend in Triton to provide additional meta information about models or enrich the predictions
  4. Exploring the auto-generation of a client library for DLOmix-serving using the Protocol Buffer interface definition
  5. Writing a technical note.

Technical Details

Python
DLOmix (https://github.com/wilhelm-lab/dlomix)
Triton Inference server (https://developer.nvidia.com/nvidia-triton-inference-server)
Docker (https://docs.docker.com/engine/install/)

Contact Information

Ludwig Lautenbacher, Computational Mass Spectrometry, TUM, [email protected]
Tobias Schmidt, MSAID [email protected]
Wassim Gabriel, Computational Mass Spectrometry, TUM, [email protected]

Rusteomics - Community driven toolbox for omic-research

Title

Rusteomics

Abstract

The proteomics community created some exceptional toolboxes over the past years, like OpenMS, Pyteomics or mzR. Most of these toolboxes implement general computational tasks, like reading and preprocessing data. However, most of them do not rely on a mutual code base or the same internal data representation, which makes interoperability only possible by using (PSI) standard formats. Reading/writing them without a common implementation may introduce another layer of errors.
The aim of Rusteomics is to build a collaborative community-driven toolbox, which provides read and write access to the most common file formats, as well as low-level and well established algorithms like (deisotoping, deconvolution, MS/MS spectra annotation, etc.).
While similar solutions exist in various programming languages, this project will be the opportunity to tailor these new components to be highly compatible with scientific (scripting) languages like Python / R. Moreover, the reimplementation in Rust should bring some major benefits: The modern compiler and building system makes Rust-based projects easier to maintain than C++-based projects, while providing the same performance.
During this hackathon, we will refine the goals/organization of the project, and start the development of a tool that can be used to generate spectral libraries (MSP format writer).

Project Plan

  1. Define goals of the project and establish short- & long-term goals - 0.25 day
  2. Define the project organization (coding rules, licensing, ...) - 0.25 day
  3. Implementation of an MGF writer (1 day)
  4. Implement and MSP-writer by extending the MGF writer to add spectra annotations, the annotation functionality will be provided by David Bouyssié. The MSP-writer is a current community demand and will help the development of new search engines. (1 day)
  5. Investigate mzIdent (.mzid) to MSP

Technical Details

The base implementation of Rusteomics will be written in Rust, while the language bindings may relay on the targeted language.

mzio-crate

reader- && writer-module

Contains reader and writer classes for proteomic specific files. The readers-module should contain a sub-module called vendor which contains read support for vendor formats like the Mascot .dat-format. It may be necessary to add write capabilities for some vendor formats to exchange data with these related tools.

entities- or models-module

Contains internal representation of different data types, e.g. a spectrum or a amino acid sequence. Defining this representation is a crucial task to be able to handle mandatory and optional data for each supported format.
These data structures are created by the use of io.reader.* and should be used to create files with io.writer.*.

mzcore-crate

chemistry-module

Here, some constants and functionality are implemented to deal with molecules, e.g. amino acids representation (name, mass, one letter code, chemical representation etc.), losses, maybe atom representation, etc.

function- or algorithms-module

This module will contain all processing and analytic functions used in proteomicsprotomics:

  • sequence digestion
  • decoy generation
  • deisotoping
  • false discovery rate calculation
  • ...

Additional crates

In addition to the proteomics related crates, researchers of other omic-fields (Transcriptomics, Metabolomics, ...) are welcome to contribute crates of their own to make Rusteomics truly usable for 'multi-omics' studies.

Language bindings

Each crate repository will contain several sub folders, each containing a specific language binding, e.g. for mzcore

mzcore
|- mzcore-rs        (the rust implementation)
|- mzcore-r         (R-binding)
|- mzcore-python    (Python-binding)
|- ...

R-bindings

Most popular languages for statistical analyses. based on rextendr

Python-bindings

Offers support for multiple famous toolkits for Deep Learning (Keras, PyTorch), Machine Learning (scikit), data handling (Pandas, Numpy) and web development (Django, Flask). based on pyo3

CPP-bindings

Still one of the fastest languages used in different proteomic-software, e.g. OpenMS. based on rust-bindgen
OR rust-diplomat https://github.com/rust-diplomat/diplomat/

Java-bindings

Java interoperability will give the opportunity to be compatible with several programming languages running on the JVM (Groovy, Clojure, Jython, Kotlin, Scala). based on JNI, JNR or the new Foreign-Memory Access API

C#-bindings

One of the main languages for Microsoft based systems and used in many projects, also in vendor software, Rusteomics may benefit from C# bindings as well. Bindings can be created by DNNE or netcorehost

Contact Information

  • Dirk Winkelhardt, Ruhr-University Bochum, Medical Faculty, Medical Proteome Center & Proteoin Diagnostic Center (PRODI), [email protected]
  • Dominik Lux, Ruhr-University Bochum, Medical Faculty, Medical Proteome Center & Proteoin Diagnostic Center (PRODI), [email protected]
  • David Bouyssié, IPBS, CNRS, University of Toulouse, UPS, Toulouse, France. [email protected]

Re-Assembly and Re-Quantification of imputed peptides

Title

Re-assembly and re-quantification of imputed peptides

Abstract

Protein Assembly was described as an initial implicit imputation step by Lazar and others (2016). Under this consideration, how would imputation on the PSM or peptide level affect the protein group aggregation, both regarding the assembly and quantification. Approaches and opinions to imputation vary and to investigate this approach a community effort promises a good mix of angles.

Lazar, Cosmin, Laurent Gatto, Myriam Ferro, Christophe Bruley, and Thomas Burger. 2016. “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.” Journal of Proteome Research 15 (4): 1116–25.

Project Plan

In order to answer what effect imputation on a lower level has on the higher aggregated data, I see the following points to be discussed:

  • Which conditions and models are suitable (in combination) for a PSM level imputation strategy?
  • Is it possible to find an evaluation strategy for any dataset by using simulation of missing values?
  • Is imputation suited for spectral counting or only for MS1 quantifications?
  • Which tools and formats are compatible?

Technical Details

  • Python, R, command line
  • PIA, Pout2Prot, IDPicker2.0, ProteinProphet, MSqRob2 (and probably many more)

Contact Information

Henry Webel, Novo Nordisk Foundation Center for Protein Research, contact

Exploring and solving functional analysis gaps in metaproteomics

Title

Exploring and solving functional analysis gaps in metaproteomics.

Abstract

Metaproteomics, the study of the full protein complement of microbial communities, is becoming an increasingly popular approach to study microbiomes and is mostly combined with metagenomics or -transcriptomics. Although the latter approaches are more commonly used, they only reveal the potential functions of the microbiome. Metaproteomics, in contrast, reveals the actual functions of the microbiome because it studies the proteins, the worker molecules of the cell [doi: 10.1080/14789450.2020.1738931]. Specifically, these functions are described as functional annotations such as InterPro and GO terms, and these terms can often be mapped to functional pathways. Although this shows the added value of metaproteomics, there are still some major hurdles to take in optimizing the functional analyses of microbiomes. The input from the community will prove particularly useful here, as every lab or research group might have their own tools and therefore own problems in performing a functional analysis. This project will explore the challenges in the functional analyses of metaproteomics data, prioritize them, and undertake the first steps in solving the most urgent and technical feasible one.

Project Plan

The first step will be to create an overview with the biggest challenges in the functional analysis in metaproteomics. We will create this overview based on the input from the participants from the hackathon, the literature, and from information received from multiple wet-lab research groups in the Metaproteomics Initiative [doi: 10.1186/s40168-021-01176-w] who will be consulted before the hackathon starts. These challenges could be summarized in a review. In the second step, we will prioritize these challenges based on urgency. In the third step, we will start solving one of these challenges based on urgency and technical feasibility. After the hackathon, we will continue the work on this project, make the open-source software freely available, and report the results in a technical note.

Technical Details

This must be discussed with the hackathon group, but we aim to work in Python because it is a very accessible language, and often used in the community.

Contact Information

Bart Mesuere, Ghent University, [email protected]
Tim Van Den Bossche, VIB - Ghent University, [email protected]
Tibo Vande Moortele, Ghent University, [email protected]
Pieter Verschaffelt, Ghent University, [email protected]

DeepScore: Community curated scoring, supercharged with AI

DeepScore: Community curated scoring, supercharged with AI

EJ2PqRjXsAIQrb6
(Image taken from https://twitter.com/afterglow2046/status/1197271037009973251)
The trained deep learning classifier predicts that this image of a dog is an image of Harrison Ford with a 99% probability. The probability that the image of the dog is NOT Harrison Ford is predicted to be 1%. If Harrison Ford is a target and NOT Harrison Ford a decoy, how would this example translate to proteomics?

Abstract

State of the art for identification in mass-spectrometry-based proteomics is to generate “non-sense”, decoy data and to determine a score cutoff based on a false discovery rate. Nowadays, machine (ML) and deep learning (DL) algorithms are used to learn how to optimally distinguish targets from decoys. While this drastically increases sensitivity, it comes at the cost of explainability, and human-chosen acceptance criteria are replaced with black-box models. This can hinder acceptance in clinical practice, e.g., in peptidomics, where upregulated proteins are investigated by investigating raw data peaks. While other DL-driven domains allow straightforward human validation of the models, e.g., imaging, speech, text, or inspecting a predicted structure from AlphaFold, for proteomics, this is much more challenging. Here, we aim to explore the limitations of the current scoring approach and provide potential solutions. We revisit the idea of confidence by trying to artificially increase identifications with non-sense features, hard decoys, or leaking data. Next, we will build an interactive tool to validate identifications manually and assign human confidence scores. With this, we create a training dataset and build an ML or DL- model to rescore identifications and assign predicted human-level confidence scores.

Project Plan

he hackathon will be organized similarly to a SCRUM process. First, we create a user story map to
agree on a prioritized list of things we would like to do. Next, we will assess the workload for each task with story points and, depending on the team size and skillset, make a sprint plan on what we can achieve in the given timeframe.
Some of the potential tasks could be:

Assessing Confidence

  • Hacks: Supplement random numbers to an ML-scoring system and investigate the performance
  • Hard decoys: Provide harder decoys and investigate the performance
  • Data Leakage: Gradually leak training data to a scoring system and investigate the performance

Interactive Tool

  • Frontend that shows raw data and accepts user input to assign confidence scores
  • Backend with database or functionality to merge multiple user sessions

Deep-learning Score

  • Extract raw data identifications
  • Train a model based on the human-supplied confidence scores
  • Perform rescoring on existing studies

Technical Details

The default language should be Python, but open to everything that gets the job done better.
The tools mentioned below are suggestions – there are a lot of great tools out there that I am not aware of, and we should collect and decide what to use at the beginning of the hackathon.

Hardware

• Laptop, with decent GPU is a plus (set up drivers etc. for compute)
• There is always Google Colab as a fallback
• For heavier workloads I have access to a high-performance cluster
• Alternatively we could rent on Amazon or related.

Datasets

There are a lot of datasets out there we could use, but this we probably narrow down once we have discussed the scope.

Feasibility

I have some preliminary data with human-curated data and some preliminary tools, so we could start with existing code or start from scratch. Most of the modules can be worked on in parallel.

Contact Information

Maximilian Strauss [email protected] or [email protected]
Mann Group
Novo Nordisk Foundation Center for Protein Research
University of Copenhagen

Feel also free to reply to this issue with questions or comments. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.