Giter Site home page Giter Site logo

reduced-alph-plm's Introduction

PLMs meet reduced amino acid alphabets

This repository contains the implementation of various protein language models trained on reduced amino acid alphabets, along with the notebooks to recreate the figures found in the paper.

For more details, see: Link after publishing.

Alt Text

About

Motivation: Protein Language Models (PLMs), which borrowed ideas for modelling and inference from Natural Language Processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.

Results: Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions withLDDT-CĪ± differences of up to 19%.

Datasets

The model is trained and evaluated using publicly available datasets:

All of these datasets can be downloaded using the release feature on Github, apart from Uniref90 which is very large. This can be downloaded and then modified using our dataset script.

Pretraining PLMs on reduced alphabets

To pretrain the protein language model you can run train_prose_multitask.py. The implementation uses multiple GPUs and can be run on a single machine or on a cluster. The scripts for running the file on a cluster can be found at iridis-scripts. The progress of the training can be monitored using tensorboard.sh. All trained models can be downloaded in the release section.

Finetuning on downstream tasks

After pretraining the protein language model, you can finetune it on downstream tasks. You can do this by running the following python files:

If you want to run these experiments on a cluster, take a look in the folder: iridis-scripts

Reproducing plots from the paper

To reproduce the plots for the amino acid embedding projection using PCA, use the notebook aa_embeddings.ipynb. For experiments involving protein structure prediction using reduced amino acid alphabets, use the notebook esm-structure-prediction.ipynb. This notebook contains code for generating the structures with ESMFold and everything else needed to recreate the results.

For more information on the steps taken to create the WASS14 alphabet, take a look at: surface_plots.ipynb

Embedding protein sequences

If you want to embedd a set of protein sequences using any of the models, you can use the embedd.py script. You only need to provide a fasta file.

This code contains various bits of code taken from other sources. If you find the repo useful, please cite the following work too:

  • Surface generation code: MASIF
  • LDDT calculation: AlphaFold
  • Model archiecture and uniprot tokenization: Prose
  • MSA plot generation ColabFold

Authors

Ioan Ieremie, Rob M. Ewing, Mahesan Niranjan

Citation

to be added

Contact

ii1g17 [at] soton [dot] ac [dot] uk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.