Giter Site home page Giter Site logo

general-protein-embeddings's Introduction

Structure, Surface and Interface Informed Protein Language Model

This repository contains the implementation of a protein language models pre-trained with structural, surface and interaction data.

For more details, see: NeurIPS-MLSB2023.

About

Language models applied to protein sequence data have gained a lot of interest in recent years, mainly due to their ability to capture complex patterns at the protein sequence level. However, their understanding of why certain evolution-related conservation patterns appear is limited. This work explores the potential of protein language models to further incorporate intrinsic protein properties stemming from protein structures, surfaces, and interfaces. The results indicate that this multi-task pretraining allows the PLM to learn more meaningful representations by leveraging information obtained from different protein views. We evaluate and show improve�ments in performance on various downstream tasks, such as enzyme classification, remote homology detection, and protein engineering datasets.

Datasets

The model is trained and evaluated using publicly available datasets:

We provide all datasets for download as a single folder (apart from Uniref90): all-data.

Pretraining PLM

To pretrain the protein language model you can run train_prose_multitask.py. The implementation uses multiple GPUs and can be run on a single machine or on a cluster. The scripts for running the file on a cluster can be found at iridis-scripts. The progress of the training can be monitored using tensorboard.sh. All trained models can be downloaded in the release section.

Finetuning on downstream tasks

After pretraining the protein language model, you can finetune it on downstream tasks. You can do this by running the following python files:

If you want to run these experiments on a cluster, take a look in the folder: iridis-scripts

Reproducing plots from the paper

To reproduce the plots for the protein embedding projection using TSNE, use the notebook scop-tsne.ipynb.

Embedding protein sequences

If you want to embedd a set of protein sequences using any of the models, you can use the embedd.py script. You only need to provide a fasta file.

This code contains various bits of code taken from other sources. If you find the repo useful, please cite the following work too:

  • Surface generation code: MASIF
  • Model archiecture and uniprot tokenization: Prose

Authors

Ioan Ieremie, Rob M. Ewing, Mahesan Niranjan

Citation

@article{ieremiestructure,
  title={Structure, Surface and Interface Informed Protein Language Model},
  author={Ieremie, Ioan and Mahesan, Niranjan and Ewing, Rob M}
}

Contact

ii1g17 [at] soton [dot] ac [dot] uk

general-protein-embeddings's People

Contributors

ieremie avatar

Stargazers

Young Su Ko avatar Suhaib Shekfeh avatar Kaiyu (Rossmann) Qiu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.