devalab / molgpt Goto Github PK

License: MIT License

Python 94.94% Shell 5.06%

molgpt's Introduction

MolGPT

In this work, we train small custom GPT on Moses and Guacamol dataset with next token prediction task. The model is then used for unconditional and conditional molecular generation. We compare our model with previous approaches on the Moses and Guacamol datasets. Saliency maps are obtained for interpretability using Ecco library.

The processed Guacamol and MOSES datasets in csv format can be downloaded from this link:

https://drive.google.com/drive/folders/1LrtGru7Srj_62WMR4Zcfs7xJ3GZr9N4E?usp=sharing

Original Guacamol dataset can be found here:

https://github.com/BenevolentAI/guacamol

Original Moses dataset can be found here:

https://github.com/molecularsets/moses

All trained weights can be found here:

https://www.kaggle.com/virajbagal/ligflow-final-weights

To train the model, make sure you have the datasets' csv file in the same directory as the code files.

Training

./train_moses.sh

./train_guacamol.sh

Generation

./generate_guacamol_prop.sh

./generate_moses_prop_scaf.sh

If you find this work useful, please cite:

Bagal, Viraj; Aggarwal, Rishal; Vinod, P. K.; Priyakumar, U. Deva (2021): MolGPT: Molecular Generation using a Transformer-Decoder Model. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14561901.v1

molgpt's People

Contributors

Stargazers

Watchers

Forkers

rishalaggarwal shunsunsun jixing475 baitutanglj bbyun28 jeah-z adiv5 dumplingdumpling ifyoungnet rnaimehaom orgw simonlevine sunkyukim oyj-2000 changzhijiang aerinko082 shenwanxiang bbgao jonghyunlee1993 kazeto111 mathcom qianrong0709 c10h sh8891 ritesh001 js0108 sevencheung2021 helloworld-aerin hritiknarayan amine179 masterwhook jsycheung yoochanmyung akhilmedvolt thegodone cklam12345 kimyujin96 mahmoud-ekhani metaenigma97 haomingcs yupliu xperthut ghaiyan 123panlinhu

molgpt's Issues

Chiral

Chiral carbons, [C@H] and [C@@h] , are not considered in vocabulary.

pretrained model

dear author,

it seems that model is trained based on guacamol/moses dataset.
is it possible for pretraining on bigger training dataset, bring the pre-trained model to fine-tune? will that improve performance

thanks

Hi, the shared dataset is empty

Hi, I try to download the processed data. However, the dataset is empty because I can't access it.

Could you open up data download access?

Thanks.

Variable num could be improved

In CausalSelfAttention and GPT classes, the variable num decided by the form of different condtions could be improved using the code below:
num = int(bool(config.num_props)) + ((1 - int(config.lstm)) * config.scaffold_maxlen + int(config.lstm)) * int(config.scaffold)

This could handle all the situations I believe.

Can We do Conditional Generation for Multiple properties at the same time?

Is it possible to train the model for all the 5 properties at once? if yes , How?

moses_stoi.json

How can i get the file moses_stoi.json? Thanks

Lack of environment

Please provide hardware environment:

GPU?
CPU?
Disk space?

As well as, Python environment

which library to setup?
which library to use?
Conda? Virtualenv?

loading model error: RuntimeError: Error(s) in loading state_dict for GPT

In this part, I have encountered some problems
`model = GPT(mconf)
model.load_state_dict(torch.load(args.model_weight, map_location='cpu'), False)
#model.to('cpu')
print('Model loaded')’

When I try to run this section, this problem occurs

RuntimeError: Error(s) in loading state_dict for GPT: size mismatch for pos_emb: copying a param with shape torch.Size([1, 40, 256]) from checkpoint, the shape in current model is torch.Size([1, 54, 256]). size mismatch for tok_emb.weight: copying a param with shape torch.Size([94, 256]) from checkpoint, the shape in current model is torch.Size([26, 256]). size mismatch for blocks.0.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.1.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.2.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.3.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.4.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.5.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.6.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.7.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for head.weight: copying a param with shape torch.Size([94, 256]) from checkpoint, the shape in current model is torch.Size([26, 256]).

How should this problem be solved? I would greatly appreciate it if someone could help me

Low validity on moses3 dataset

Dear authors,

I tried to train the model with default parameters on moses3.csv dataset, then generate with the trained model. However, the validity that I achieved is 0.868, which I think is much smaller than 0.994 as you mentioned on the paper.

python train.py --run_name unconditional_moses --data_name moses --num_props 0  
python generate.py --model_weight moses_nocond_12layer.pt --csv_name sampled.smiles --data_name moses

Do you have any suggestion for solving this problem? Or can you provide any hints on how to improve this validity. Thanks!

Is it possible to use a pretrained model from another type of tasks?

Dear author,

I saw multiple types of tasks in your study, e.g. without properties or scaffolds, with only one properties, and with scaffolds.

It is possible to use a model pretrained from one type of task on another using your code? Or that's theoretical not plausible?

What should property (p) be in self.config.generate block when training (with one property)?

Reference

Leaving it to None gets me errors in this section.
If I ignore the assert statement, the model would fail in this block.

about pre-train data.

Thanks for your excellent job.
I have a question.
The downloaded processed datasets are only in the order of 1~2 million respectively.
So, how large a dataset in unconditional pre-training?

customize input dataset and property

Thanks so much for providing this amazing library! If it is possible, could you kindly consider implement some additional features that allow user to define customizable input dataset and property for conditional molecule generation?

ModuleNotFoundError: No module named 'moses.utils'

pip3 mediated installed moses
but folloing error commimg

from moses.utils import get_mol
ModuleNotFoundError: No module named 'moses.utils'

plz help to solve

The num variable could be improved

[Question] About environment.

Hello!!

Can you tell me the experimental environment like the version of rdkit in the code?
Thanks!!

keyerror：smiles

When I downloaded the training model from the Moses dataset to generate new molecules through the official website, the character % appeared in the smiles, which finally caused the transformed smiles

train epochs

i just want to know how many epochs did you train for other models like vae aae char-rnn? i see that you said you train10 epochs for gpt, did all this models are same ?

Data download problem

I couldn't open https://drive.google.com/drive/folders/1LrtGru7Srj_62WMR4Zcfs7xJ3GZr9N4E?usp=sharing.
Is there any other way to download the data?
Thank you very much!

Conditional Generation

How to Use the trained model for conditional generation???

How can I get moses_stoi.json file?

When I generate molecules by moses DB, I can't find moses_stoi.json file.

RuntimeError: Error(s) in loading state_dict for GPT

Hi, when i run the code python generate/generate.py --model_weight gua_tpsa_logp_sas.pt --props tpsa logp sas --data_name guacamol2 --csv_name gua_tpsa_logp_sas_temp1 --gen_size 10000 --batch_size 512 --vocab_size 94 --block_size 100 in the generate_guacamol_prop.sh, i meet an RuntimeError: Error(s) in loading state_dict for GPT
size mismatch for blocks.0.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.1.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.2.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.3.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.4.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.5.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.6.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.7.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]).