Giter Site home page Giter Site logo

elttaes / revisiting-plms Goto Github PK

View Code? Open in Web Editor NEW
56.0 2.0 10.0 6.71 MB

Exploring Evolution-aware & free protein language models as protein function predictors

License: MIT License

Python 94.37% Jupyter Notebook 5.63%
alphafold function-prediction fitness-prediction alphafold2 benchmark contact-prediction multiple-sequence-alignment protein-annotation protein-classification protein-language-model

revisiting-plms's Introduction

Exploring Evolution-aware & free protein language models as protein function predictors

Env:

Jax(Alphafold2):

https://github.com/kalininalab/alphafold_non_docker

Pytorch(ESM-1b,MSA-Transformer):

  1. As a prerequisite, you must have PyTorch installed(https://pytorch.org/get-started/locally/).

  2. pip install fair-esm # latest release, OR:

    pip install git+https://github.com/facebookresearch/esm.git

data:

For SSP & Contact map:

ESMStructuralSplitDataset:

Name Description URL
splits train/valid splits https://dl.fbaipublicfiles.com/fair-esm/structural-data/splits.tar.gz
pkl pkl objects containing sequence, SSP labels, distance map, and 3d coordinates https://dl.fbaipublicfiles.com/fair-esm/structural-data/pkl.tar.gz
msas a3m files containing MSA for each domain https://dl.fbaipublicfiles.com/fair-esm/structural-data/msas.tar.gz

from https://github.com/facebookresearch/esm

For Contact map Test:

CAMEO(https://www.cameo3d.org/)

For Fitness dataset:

Tape(https://github.com/songlab-cal/tape)

For pretrain ESM-1b & MSA-Transformer

Alphafold2 training data:

https://registry.opendata.aws/openfold/

from Openfold(https://github.com/aqlaboratory/openfold)

@article{hu2022exploring,
  title={Exploring evolution-aware \&-free protein language models as protein function predictors},
  author={Hu, Mingyang and Yuan, Fajie and Yang, Kevin K and Ju, Fusong and Su, Jin and Wang, Hui and Yang, Fei and Ding, Qiuyang},
  journal={arXiv preprint arXiv:2206.06583},
  year={2022}
}

revisiting-plms's People

Contributors

elttaes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

revisiting-plms's Issues

About datasets with AlphaFold on contact map prediction

Hello, quite great work you do. Recently I wanted to try your code on contact map prediction. I saw the link of ESMStructuralSplitDataset, but didn't find AlphaFold data. Can you release this data? Hope to get your reply. Thanks so much.

Does the padding affect the training of SSP?

Hi. Thanks for your interesting work!

I noticed that you used the pad token to align sequences of different lengths and MSAs. I'm curious if the resulting representations with the padding token are consistent with the representations without the padding token. Additionally, I would like to know if applying a convolution network on the representations with the padding token would affect training and prediction. Looking forward to your response!

Remote homology?

Hi, thanks for sharing the code! I saw a remote homology folder in your repo. Since I didn't notice you reported, what is that for?

Metal ion binding dataset

Hi there,

Nice work!
I have a question about the metal ion binding dataset used in your paper.
Could you let me know where do you get the original dataset?

Thanks!

AF2 structure/embedding for Stability (Fitness) task is not available.

Hello,
Interesting work. Could you provide the AF2 embedding for the 60,000+ sequences in the stability task? (the dataset from TAPE's github only contains the sequences)
It would be even better if the corresponding AF2 structures for those sequences are also available.
generating those again will be a waste of resources and error prone if we are using different database/setting.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.