qizhipei / biot5 Goto Github PK

View Code? Open in Web Editor NEW

83.0 3.0 4.0 1.79 MB

BioT5 (EMNLP 2023) and BioT5+ (ACL 2024 Findings)

Home Page: https://arxiv.org/abs/2310.07276

License: MIT License

Python 97.87% Shell 2.13%

bioinformatics computational-biology cross-modal machine-learning nlp nlp-applications

biot5's Introduction

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations 🔥

News

🎉July 18 2024: Happy to share that our enhanced version of BioT5+ ranked 1st place in the Text-based Molecule Generation track and 2nd place in the Molecular Captioning Track at Language + Molecule @ ACL2024 Competition

🔥July 11 2024: Data, codes, and pre-trained models for BioT5+ are relased.

🔥May 16 2024: BioT5+ is accepted by ACL 2024 (Findings).

🔥Mar 03 2024: We have published a suvery paper Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey and the related github repository Awesome-Biomolecule-Language-Cross-Modeling. Kindly check it if you are interested in this field~

🔥Feb 29 2024: Update BioT5 to BioT5+ with the ability of IUPAC integration and multi-task learning!

🔥Nov 06 2023: Update example usage for molecule captioning, text-based molecule generation, drug-target interaction prediction!

🔥Oct 20 2023: The data for fine-tuning is released!

🔥Oct 19 2023: The pre-trained and fine-tuned models are released!

🔥Oct 11 2023: Initial commits. More codes, pre-trained model, and data are coming soon.

Overview

This repository contains the source code for

EMNLP 2023 paper "BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations", by Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, and Rui Yan. BioT5 achieves superior performance on various biological tasks.
ACL 2024 (Findings) paper "BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning", by Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan. BioT5+ is pre-trained and fine-tuned with a large number of experiments, including 3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets, demonstrating the remarkable performance and state-of-the-art results in most cases.
If you have questions, don't hesitate to open an issue or ask me via [email protected] or Lijun Wu via [email protected]. We are happy to hear from you!

↓Overview of BioT5

↓Overview of BioT5+

Please refer to the biot5 or biot5_plus folder for detailed instructions.

Citations

BioT5

@inproceedings{pei2023biot5,
  title={BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations},
  author={Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui},
  booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
  month = dec,
  year = "2023",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2023.emnlp-main.70",
  pages = "1102--1123"
}

BioT5+

@article{pei2024biot5+,
  title={BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning},
  author={Pei, Qizhi and Wu, Lijun and Gao, Kaiyuan and Liang, Xiaozhuan and Fang, Yin and Zhu, Jinhua and Xie, Shufang and Qin, Tao and Yan, Rui},
  journal={arXiv preprint arXiv:2402.17810},
  year={2024}
}

Acknowledegments

The code is based on nanoT5.

biot5's People

Contributors

Stargazers

Watchers

Forkers

amelie-iska techthiyanes ai4sci-research yangzhang33

biot5's Issues

Use case of biot5-base-text2mol

Hi Qizhi,

Thanks for the nice work you have done on BioT5.
I have a few questions regarding the use case of model biot5-base-text2mol. When I copy the Example Usage provided in Model card. The output is also a natural language description, rather than a SELFIES molecule. The complete input and output are as follows:

Definition: You are given a molecule description in English. Your job is to generate the molecule SELFIES that fits the description.

Now complete the following example -
Input: The molecule is a monocarboxylic acid anion obtained by deprotonation of the carboxy and sulfino groups of 3-sulfinopropionic acid. Major microspecies at pH 7.3 It is an organosulfinate oxoanion and a monocarboxylic acid anion. It is a conjugate base of a 3-sulfinopropionic acid.
Output: 
The molecule is a divalent inorganic anion obtained by removal of both protons from methylene chloride. It is a metabolite of methanol. It has a role as a human xenobiotic metabolite. It is a divalent inorganic anion and a methylene sulfide.

In this end, the output of model can not be decoded to SMILES, as it does not output SELFIES. It confused me a lot. Do you have any ideas about why this happened?

Add Model Config Files to HuggingFace

Thank you very much for adding this repo.

When you uploaded to HuggingFace, only the model weights were included and not the config files for the tokenizer or model architecture.
(Only one file here).
https://huggingface.co/QizhiPei/BioT5/tree/main/pretrained

Can you please add the config files to HuggingFace as well?
An example of this being done would be the base model you used (T5) which contains key files such as: tokenizer_config.json, config.json, etc.
https://huggingface.co/google/t5-v1_1-base/tree/main

Doing this isn't hard:
All you have to do is run hf's save_pretrained method with push_to_hub set to true:
i.e.
model.save_pretrained("QizhiPei/BioT5/pretrained",push_to_hub=True)
Docs:
https://huggingface.co/docs/transformers/add_new_model#transformers.PreTrainedModel.save_pretrained

Thanks again!

data is not available

hello! there are no json file in huggingface,please check.

Code and model of Biot5+

Thanks for your great work! I'd like to know when the code and model of Biot5+ will be released.

Is attention-mask needed in pre-training in tokenize_function_seq_desc?

While padding="max_length", truncation=True, I feel that it would be better to use an attention mask here, right?

Text-to-small molecule deterministic vs. nondeterministic behavior

I was just wondering if there is a way to adjust the model to be less deterministic. I've tried adjusting the beams in the beam search, adding in functionality to adjust the temperature of the model and various other things in an attempt to make the model non-deterministic but it does not seem to be working. As an example, I have tried the following:

from transformers import T5Tokenizer, T5ForConditionalGeneration
import selfies as sf

tokenizer = T5Tokenizer.from_pretrained("QizhiPei/biot5-base-text2mol", model_max_length=540)
model = T5ForConditionalGeneration.from_pretrained('QizhiPei/biot5-base-text2mol')

task_definition = 'Definition: You are given a molecule description in English. Your job is to generate the molecule SELFIES that fits the description.\n\n'
text_input = "The molecule is a monocarboxylic acid anion obtained by deprotonation of the carboxy and sulfino groups of 3-sulfinopropionic acid. Major microspecies at pH 7.3 It is an organosulfinate oxoanion and a monocarboxylic acid anion. It is a conjugate base of a 3-sulfinopropionic acid."
task_input = f'Now complete the following example -\nInput: {text_input}\nOutput: '

model_input = task_definition + task_input
input_ids = tokenizer(model_input, return_tensors="pt").input_ids

generation_config = model.generation_config
generation_config.max_length = 512
generation_config.num_beams = 5
generation_config.do_sample = True
generation_config.top_k = 50
generation_config.top_p = 0.75
generation_config.temperature = 0.95

# Set the number of sequences to generate, must be <= num_beams
num_sequences = 3
generation_config.num_return_sequences = num_sequences

outputs = model.generate(input_ids, generation_config=generation_config)

print(f"Top {num_sequences} generated sequences:")
for i, output in enumerate(outputs, start=1):
    output_selfies = tokenizer.decode(output, skip_special_tokens=True).replace(' ', '')
    output_smiles = sf.decoder(output_selfies)
    print(f"Sequence {i}:")
    print("Generated SELFIES:", output_selfies)
    print("Generated SMILES:", output_smiles)
    print()

It seems that no matter how I adjust this (turning beams down to one or to higher numbers for example) does nothing to change the output. Even changing the prompt slightly does not seem to change the output. Perhaps the model has learned too rigid a representation of the text-molecule mapping? Or perhaps I am overlooking something?

Checkpoints for molecule reactions

Could you provide some examples for molecule forward and retrosynthesis prediction?

ValueError: Non-consecutive added token '<pad>' found. Should have index 32100 but has index 0 in saved vocabulary.

Very exciting work, but I reported an error when loading Hugging Face's biot5-base-text2mol tokenizer. I tried transformers==4.28.1 and 4.33.3 versions.

code:
tokenizer = T5Tokenizer.from_pretrained("/workspace/LLM_ckpt/QizhiPei/biot5-base-text2mol", model_max_length=512)

ValueError: Non-consecutive added token '' found. Should have index 32100 but has index 0 in saved vocabulary.

I have another question.
When I load the biot5-base model for text2mol, I follow the text2mol code. The results are as follows:

output_selfies:

MTTPTPSPSPAPSPAPSPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAPAP

output_smiles:

My code：

from transformers import AutoTokenizer, T5ForConditionalGeneration,T5Tokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("/workspace/LLM_ckpt/QizhiPei/biot5-base")
model = T5ForConditionalGeneration.from_pretrained('/workspace/LLM_ckpt/QizhiPei/biot5-base')

# tokenizer = T5Tokenizer.from_pretrained("QizhiPei/biot5-base-text2mol", model_max_length=512)
# model = T5ForConditionalGeneration.from_pretrained('QizhiPei/biot5-base-text2mol')

task_definition = 'Definition: You are given a molecule description in English. Your job is to generate the molecule SELFIES that fits the description.\n\n'
text_input = 'The molecule is a monocarboxylic acid anion obtained by deprotonation of the carboxy and sulfino groups of 3-sulfinopropionic acid. Major microspecies at pH 7.3 It is an organosulfinate oxoanion and a monocarboxylic acid anion. It is a conjugate base of a 3-sulfinopropionic acid.'
task_input = f'Now complete the following example -\nInput: {text_input}\nOutput: '

model_input = task_definition + task_input
input_ids = tokenizer(model_input, return_tensors="pt").input_ids

generation_config = model.generation_config
generation_config.max_length = 512
generation_config.num_beams = 1

outputs = model.generate(input_ids, generation_config=generation_config)
output_selfies = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(' ', '')
print("output_selfies:",output_selfies)

import selfies as sf
output_smiles = sf.decoder(output_selfies)
print("output_smiles:",output_smiles)

Thank you for your help！！！

Text-To-Protein Generation Recommended Model

Hi again! I was just wondering which checkpoint is recommended for text-to-protein generation.

Comparison to MoleculeSTM and ProteinDT

Hi! Thanks for the interesting work! I was just wondering if you had and comparisons of BioT5+ to MoleculeSTM or ProteinDT. They are mentioned in the paper but no comparison is given on the text-guided molecule design or text-guided protein design. My intuition is that the contrastive learning CLIP like approach is a great approach to the problem, but it is not clear how it compares to this method and training on multiple modalities sometimes improves performance as well. Do you have a good sense of how the two approaches compare to one another?

About the metrics of FCD and Text2Mol

Thanks for your great work!

I want to know the details about calculate FCD and Text2Mol metrics. It seems that the related codes are not provided in the repo.

Actually, I have already gotten the repositories of FCD and Text2Mol. But I don't know the details about how to use them to reproduce the results in BioT5 paper.

For example, when calculating FCD, do only valid molecules participate in calculations?

I would be grateful if codes or some details could be provided!

Plan for Pre-training data release

Hi, I appreciate your efforts for releasing this repository.

Do you have plan to release the pre-training data used for this work?

Thanks!