Giter Site home page Giter Site logo

mahdi-shafiei / xtrimopglm Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 6.08 MB

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Home Page: https://arxiv.org/abs/2401.06199

License: Apache License 2.0

Python 89.71% Shell 2.07% Makefile 0.03% C++ 5.74% C 0.25% Cuda 2.20%

xtrimopglm's Introduction

xTrimoPGLM

Introduction to xTrimoPGLM Family Models

xTrimoPGLM is the open-source version of the latest protein language models towards protein understanding tasks (Masked Protein Language Models) and protein design (Casual Protein Language Models). The xTrimoPGLM family models are developed by BioMap and Tsinghua University. Along with this, we have released the int4 quantization xTrimoPGLM-100B weights and other xTrimo-series small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.

Out-of-Distribution Perplexity Evaluation

We evaluated the xTrimoPGLM (xTMLM or xTCLM) and xTrimoPGLM(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The MLM perplexity results, compared against ESM2-3B and ESM2-15B and the CLM perplexity againest ProGen2-xlarge (6.4B), are as follows (lower is better):

Model ESM2(3B) ESM2 (15B) xTMLM (1B) xTMLM (3B) xTMLM (10B) xT (100B) xT (100B)-INT4
< 0.9 ID 7.7 7.3 9.3 7.8 7.6 6.7 6.8
< 0.5 ID 11.5 11.0 13.5 11.9 11.6 10.7 10.8
Model ProGen2-xlarge (6.4B) xTCLM (1B) xTCLM (3B) xTCLM (7B) xT (100B) xT (100B)-INT4
< 0.9 ID 9.7 9.8 9.3 8.9 8.7 8.9
< 0.5 ID 14.3 14.0 13.7 13.5 13.3 13.5

Downstream Protein Understanding Tasks Evaluation

(TODO)

Get Started

Model List

You can choose to manually download the necessary weights.

Model Download
xTrimoPGLM-1B-MLM ๐Ÿค— Huggingface ๐Ÿ”จ SwissArmyTransformer
xTrimoPGLM-3B-MLM ๐Ÿค— Huggingface ๐Ÿ”จ SwissArmyTransformer
xTrimoPGLM-10B-MLM ๐Ÿค— Huggingface ๐Ÿ”จ SwissArmyTransformer
xTrimoPGLM-1B-CLM ๐Ÿค— Huggingface ๐Ÿ”จ SwissArmyTransformer
xTrimoPGLM-3B-CLM ๐Ÿค— Huggingface ๐Ÿ”จ SwissArmyTransformer
xTrimoPGLM-7B-CLM ๐Ÿค— Huggingface ๐Ÿ”จ SwissArmyTransformer
xTrimoPGLM-100B-Int4 (MLM or CLM) ๐Ÿค— Huggingface ๐Ÿ”จ SwissArmyTransformer

How to use

xTrimoPGLM-MLM: Masked Langeuage Models for Protein Understanding tasks

(Note that the xTrimoPGLM-100B INT4 quantization can be infered in a single A100/800 GPU with 80G memory.)

# Obtain residue embeddings
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
import torch

tokenizer  = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, use_fast=True)
config = AutoConfig.from_pretrained("biomap-research/xtrimopglm-100b-int4",  trust_remote_code=True, torch_dtype=torch.half)
config.is_causal=False
model = AutoModelForMaskedLM.from_pretrained("biomap-research/xtrimopglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
if torch.cuda.is_available():
    model = model.cuda()

# # if you don't have the single gpu with 80G memory, try the dispatch load.
# from accelerate import load_checkpoint_and_dispatch, init_empty_weights
# with init_empty_weights():
  # model = AutoModelForMaskedLM.from_config(config, trust_remote_code=True)
# 
# model = load_checkpoint_and_dispatch(
#     model, "<your model cached dir>", device_map="auto", no_split_module_classes=["xTrimoPGLMBlock"], strict=True, dtype=dtype
# )

model.eval()

seq = 'MILMCQHFSGQFSKYFLAVSSDFCHFVFPIILVSHVNFKQMKRKGFALWNDRAVPFTQGIFTTVMILLQYLHGTG'
output = tokenizer(seq, add_special_tokens=True, return_tensors='pt')
with torch.inference_mode():
    inputs = {"input_ids": output["input_ids"].cuda(), "attention_mask": output["attention_mask"].cuda()}
    output_embeddings = model(**inputs, output_hidden_states=True, return_last_hidden_state=True).hidden_states[:-1, 0] # get rid of the <eos> token


# model for the sequence-level tasks
model = AutoModelForSequenceClassification.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)

# model for the token-level tasks
model = AutoModelForTokenClassification.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)

Refer the finetune folder to check more finetuning examples, such as LoRA and Linear Probing.

xTrimoPGLM-CLM: Casusal Langeuage Models for Protein Design

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch

tokenizer  = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, use_fast=True)
config = AutoConfig.from_pretrained("biomap-research/xtrimopglm-100b-int4",  trust_remote_code=True, torch_dtype=torch.half)
config.is_causal=True
model = AutoModelForCausalLM.from_pretrained("biomap-research/xtrimopglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
if torch.cuda.is_available():
    model = model.cuda()

# # if you don't have the single gpu with 80G memory, try the dispatch load.
# from accelerate import load_checkpoint_and_dispatch, init_empty_weights
# with init_empty_weights():
  # model = AutoModelForMaskedLM.from_config(config, trust_remote_code=True)
# 
# model = load_checkpoint_and_dispatch(
#     model, "<your model cached dir>", device_map="auto", no_split_module_classes=["xTrimoPGLMBlock"], strict=True, dtype=dtype
# )
model.eval()

gen_kwargs = {'max_length': 256, 'top_p': 0.8, 'temperature':0.9, "num_beams": 1}
prompt=['', 'MLFVVL', 'LDL', 'VTQA']

for idx, each in enumerate(prompt):
    print(f"Begin generating idx: {idx} with prompt {each}")
    output = model.chat(tokenizer, each, **gen_kwargs)
    print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")

For more inference scrpts of other models, please visit the model card of the huggingface page.

Implementation of Pretrain

xTrimoPGLM pretraining is based on the Megatron-DeepSpeed Framework . Details Please refer to pretrain

LICENSE

The code in this repository is open source under the Apache-2.0 license.

Citations

If you find our work useful, please consider citing the following paper:

@article{chen2024xtrimopglm,
  title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
  author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
  journal={arXiv preprint arXiv:2401.06199},
  year={2024}
}

@article{cheng2024training,
  title={Training Compute-Optimal Protein Language Models},
  author={Cheng, Xingyi and Chen, Bo and Li, Pan and Gong, Jing and Tang, Jie and Song, Le},
  journal={bioRxiv},
  pages={2024--06},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

xtrimopglm's People

Contributors

allanchen95 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.