Giter Site home page Giter Site logo

qizhipei / biot5 Goto Github PK

View Code? Open in Web Editor NEW
83.0 3.0 4.0 1.79 MB

BioT5 (EMNLP 2023) and BioT5+ (ACL 2024 Findings)

Home Page: https://arxiv.org/abs/2310.07276

License: MIT License

Python 97.87% Shell 2.13%
bioinformatics computational-biology cross-modal machine-learning nlp nlp-applications

biot5's Introduction

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations 🔥

News

🎉July 18 2024: Happy to share that our enhanced version of BioT5+ ranked 1st place in the Text-based Molecule Generation track and 2nd place in the Molecular Captioning Track at Language + Molecule @ ACL2024 Competition

🔥July 11 2024: Data, codes, and pre-trained models for BioT5+ are relased.

🔥May 16 2024: BioT5+ is accepted by ACL 2024 (Findings).

🔥Mar 03 2024: We have published a suvery paper Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey and the related github repository Awesome-Biomolecule-Language-Cross-Modeling. Kindly check it if you are interested in this field~

🔥Feb 29 2024: Update BioT5 to BioT5+ with the ability of IUPAC integration and multi-task learning!

🔥Nov 06 2023: Update example usage for molecule captioning, text-based molecule generation, drug-target interaction prediction!

🔥Oct 20 2023: The data for fine-tuning is released!

🔥Oct 19 2023: The pre-trained and fine-tuned models are released!

🔥Oct 11 2023: Initial commits. More codes, pre-trained model, and data are coming soon.

Overview

This repository contains the source code for

↓Overview of BioT5

↓Overview of BioT5+

Please refer to the biot5 or biot5_plus folder for detailed instructions.

Citations

BioT5

@inproceedings{pei2023biot5,
  title={BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations},
  author={Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui},
  booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
  month = dec,
  year = "2023",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2023.emnlp-main.70",
  pages = "1102--1123"
}

BioT5+

@article{pei2024biot5+,
  title={BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning},
  author={Pei, Qizhi and Wu, Lijun and Gao, Kaiyuan and Liang, Xiaozhuan and Fang, Yin and Zhu, Jinhua and Xie, Shufang and Qin, Tao and Yan, Rui},
  journal={arXiv preprint arXiv:2402.17810},
  year={2024}
}

Acknowledegments

The code is based on nanoT5.

biot5's People

Contributors

apeterswu avatar qizhipei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

biot5's Issues

Use case of biot5-base-text2mol

Hi Qizhi,

Thanks for the nice work you have done on BioT5.
I have a few questions regarding the use case of model biot5-base-text2mol. When I copy the Example Usage provided in Model card. The output is also a natural language description, rather than a SELFIES molecule. The complete input and output are as follows:

Definition: You are given a molecule description in English. Your job is to generate the molecule SELFIES that fits the description.

Now complete the following example -
Input: The molecule is a monocarboxylic acid anion obtained by deprotonation of the carboxy and sulfino groups of 3-sulfinopropionic acid. Major microspecies at pH 7.3 It is an organosulfinate oxoanion and a monocarboxylic acid anion. It is a conjugate base of a 3-sulfinopropionic acid.
Output: 
The molecule is a divalent inorganic anion obtained by removal of both protons from methylene chloride. It is a metabolite of methanol. It has a role as a human xenobiotic metabolite. It is a divalent inorganic anion and a methylene sulfide.

In this end, the output of model can not be decoded to SMILES, as it does not output SELFIES. It confused me a lot. Do you have any ideas about why this happened?

Add Model Config Files to HuggingFace

Thank you very much for adding this repo.

When you uploaded to HuggingFace, only the model weights were included and not the config files for the tokenizer or model architecture.
(Only one file here).
https://huggingface.co/QizhiPei/BioT5/tree/main/pretrained

Can you please add the config files to HuggingFace as well?
An example of this being done would be the base model you used (T5) which contains key files such as: tokenizer_config.json, config.json, etc.
https://huggingface.co/google/t5-v1_1-base/tree/main

Doing this isn't hard:
All you have to do is run hf's save_pretrained method with push_to_hub set to true:
i.e.
model.save_pretrained("QizhiPei/BioT5/pretrained",push_to_hub=True)
Docs:
https://huggingface.co/docs/transformers/add_new_model#transformers.PreTrainedModel.save_pretrained

Thanks again!

Code and model of Biot5+

Thanks for your great work! I'd like to know when the code and model of Biot5+ will be released.

Text-to-small molecule deterministic vs. nondeterministic behavior

I was just wondering if there is a way to adjust the model to be less deterministic. I've tried adjusting the beams in the beam search, adding in functionality to adjust the temperature of the model and various other things in an attempt to make the model non-deterministic but it does not seem to be working. As an example, I have tried the following:

from transformers import T5Tokenizer, T5ForConditionalGeneration
import selfies as sf

tokenizer = T5Tokenizer.from_pretrained("QizhiPei/biot5-base-text2mol", model_max_length=540)
model = T5ForConditionalGeneration.from_pretrained('QizhiPei/biot5-base-text2mol')

task_definition = 'Definition: You are given a molecule description in English. Your job is to generate the molecule SELFIES that fits the description.\n\n'
text_input = "The molecule is a monocarboxylic acid anion obtained by deprotonation of the carboxy and sulfino groups of 3-sulfinopropionic acid. Major microspecies at pH 7.3 It is an organosulfinate oxoanion and a monocarboxylic acid anion. It is a conjugate base of a 3-sulfinopropionic acid."
task_input = f'Now complete the following example -\nInput: {text_input}\nOutput: '

model_input = task_definition + task_input
input_ids = tokenizer(model_input, return_tensors="pt").input_ids

generation_config = model.generation_config
generation_config.max_length = 512
generation_config.num_beams = 5
generation_config.do_sample = True
generation_config.top_k = 50
generation_config.top_p = 0.75
generation_config.temperature = 0.95

# Set the number of sequences to generate, must be <= num_beams
num_sequences = 3
generation_config.num_return_sequences = num_sequences

outputs = model.generate(input_ids, generation_config=generation_config)

print(f"Top {num_sequences} generated sequences:")
for i, output in enumerate(outputs, start=1):
    output_selfies = tokenizer.decode(output, skip_special_tokens=True).replace(' ', '')
    output_smiles = sf.decoder(output_selfies)
    print(f"Sequence {i}:")
    print("Generated SELFIES:", output_selfies)
    print("Generated SMILES:", output_smiles)
    print()

It seems that no matter how I adjust this (turning beams down to one or to higher numbers for example) does nothing to change the output. Even changing the prompt slightly does not seem to change the output. Perhaps the model has learned too rigid a representation of the text-molecule mapping? Or perhaps I am overlooking something?

ValueError: Non-consecutive added token '<pad>' found. Should have index 32100 but has index 0 in saved vocabulary.

Very exciting work, but I reported an error when loading Hugging Face's biot5-base-text2mol tokenizer. I tried transformers==4.28.1 and 4.33.3 versions.

code:
tokenizer = T5Tokenizer.from_pretrained("/workspace/LLM_ckpt/QizhiPei/biot5-base-text2mol", model_max_length=512)

ValueError: Non-consecutive added token '' found. Should have index 32100 but has index 0 in saved vocabulary.

I have another question.
When I load the biot5-base model for text2mol, I follow the text2mol code. The results are as follows:

output_selfies:

<p>M<p>T<p>T<p>P<p>T<p>P<p>S<p>P<p>S<p>P<p>A<p>P<p>S<p>P<p>A<p>P<p>S<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P<p>A<p>P

output_smiles:

My code:

from transformers import AutoTokenizer, T5ForConditionalGeneration,T5Tokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("/workspace/LLM_ckpt/QizhiPei/biot5-base")
model = T5ForConditionalGeneration.from_pretrained('/workspace/LLM_ckpt/QizhiPei/biot5-base')

# tokenizer = T5Tokenizer.from_pretrained("QizhiPei/biot5-base-text2mol", model_max_length=512)
# model = T5ForConditionalGeneration.from_pretrained('QizhiPei/biot5-base-text2mol')

task_definition = 'Definition: You are given a molecule description in English. Your job is to generate the molecule SELFIES that fits the description.\n\n'
text_input = 'The molecule is a monocarboxylic acid anion obtained by deprotonation of the carboxy and sulfino groups of 3-sulfinopropionic acid. Major microspecies at pH 7.3 It is an organosulfinate oxoanion and a monocarboxylic acid anion. It is a conjugate base of a 3-sulfinopropionic acid.'
task_input = f'Now complete the following example -\nInput: {text_input}\nOutput: '

model_input = task_definition + task_input
input_ids = tokenizer(model_input, return_tensors="pt").input_ids

generation_config = model.generation_config
generation_config.max_length = 512
generation_config.num_beams = 1

outputs = model.generate(input_ids, generation_config=generation_config)
output_selfies = tokenizer.decode(outputs[0], skip_special_tokens=True).replace(' ', '')
print("output_selfies:",output_selfies)

import selfies as sf
output_smiles = sf.decoder(output_selfies)
print("output_smiles:",output_smiles)

Thank you for your help!!!

Comparison to MoleculeSTM and ProteinDT

Hi! Thanks for the interesting work! I was just wondering if you had and comparisons of BioT5+ to MoleculeSTM or ProteinDT. They are mentioned in the paper but no comparison is given on the text-guided molecule design or text-guided protein design. My intuition is that the contrastive learning CLIP like approach is a great approach to the problem, but it is not clear how it compares to this method and training on multiple modalities sometimes improves performance as well. Do you have a good sense of how the two approaches compare to one another?

About the metrics of FCD and Text2Mol

Thanks for your great work!

I want to know the details about calculate FCD and Text2Mol metrics. It seems that the related codes are not provided in the repo.

Actually, I have already gotten the repositories of FCD and Text2Mol. But I don't know the details about how to use them to reproduce the results in BioT5 paper.

For example, when calculating FCD, do only valid molecules participate in calculations?

I would be grateful if codes or some details could be provided!

Plan for Pre-training data release

Hi, I appreciate your efforts for releasing this repository.

Do you have plan to release the pre-training data used for this work?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.