Giter Site home page Giter Site logo

molgpt's Issues

keyerror:smiles

When I downloaded the training model from the Moses dataset to generate new molecules through the official website, the character % appeared in the smiles, which finally caused the transformed smiles

Chiral

Chiral carbons, [C@H] and [C@@h] , are not considered in vocabulary.

Hi, the shared dataset is empty

Hi, I try to download the processed data. However, the dataset is empty because I can't access it.

Could you open up data download access?

Thanks.

Is it possible to use a pretrained model from another type of tasks?

Dear author,

I saw multiple types of tasks in your study, e.g. without properties or scaffolds, with only one properties, and with scaffolds.

It is possible to use a model pretrained from one type of task on another using your code? Or that's theoretical not plausible?

Lack of environment

Please provide hardware environment:

  • GPU?
  • CPU?
  • Disk space?

As well as, Python environment

  • which library to setup?
  • which library to use?
  • Conda? Virtualenv?

Low validity on moses3 dataset

Dear authors,

I tried to train the model with default parameters on moses3.csv dataset, then generate with the trained model. However, the validity that I achieved is 0.868, which I think is much smaller than 0.994 as you mentioned on the paper.

python train.py --run_name unconditional_moses --data_name moses --num_props 0  
python generate.py --model_weight moses_nocond_12layer.pt --csv_name sampled.smiles --data_name moses     

Do you have any suggestion for solving this problem? Or can you provide any hints on how to improve this validity. Thanks!

customize input dataset and property

Thanks so much for providing this amazing library! If it is possible, could you kindly consider implement some additional features that allow user to define customizable input dataset and property for conditional molecule generation?

loading model error: RuntimeError: Error(s) in loading state_dict for GPT

In this part, I have encountered some problems
`model = GPT(mconf)
model.load_state_dict(torch.load(args.model_weight, map_location='cpu'), False)
#model.to('cpu')
print('Model loaded')’

When I try to run this section, this problem occurs

RuntimeError: Error(s) in loading state_dict for GPT: size mismatch for pos_emb: copying a param with shape torch.Size([1, 40, 256]) from checkpoint, the shape in current model is torch.Size([1, 54, 256]). size mismatch for tok_emb.weight: copying a param with shape torch.Size([94, 256]) from checkpoint, the shape in current model is torch.Size([26, 256]). size mismatch for blocks.0.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.1.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.2.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.3.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.4.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.5.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.6.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for blocks.7.attn.mask: copying a param with shape torch.Size([1, 1, 74, 74]) from checkpoint, the shape in current model is torch.Size([1, 1, 54, 54]). size mismatch for head.weight: copying a param with shape torch.Size([94, 256]) from checkpoint, the shape in current model is torch.Size([26, 256]).

How should this problem be solved? I would greatly appreciate it if someone could help me

Hardcoded Tokenizer Vocabulary Limits Model's Flexibility

Hi,

I've been working with molGPT and have encountered some issues that I believe stem from the hardcoded tokenizer vocabulary. I have some concerns and suggestions:

  1. Hardcoded Vocabulary: The whole_string variable in train.py appears to be a fixed vocabulary used to create the stoi (string-to-integer) mappings. This approach limits the model's ability to adapt to different datasets.

  2. Incompatibility with New Datasets: When training the model on my custom dataset, I encountered numerous errors due to tokens not present in the predefined vocabulary. I had to manually add these tokens, which is not scalable for larger or diverse datasets.

  3. Generation of Invalid SMILES: After training, the model generated SMILES strings that were consistently invalid. Here's a sample of the errors encountered:

[13:35:00] SMILES Parse Error: extra close parentheses while parsing: CF#F[Ag]S[Si-]F2P[Y+3][Ag]4[Ag][CH2](BrFI)S[OH+]FFFFI)P2FBr[Se][Se](BrS)21[Se]SBr[Se](BrS)[cH-]S)F8[cH-])FBrP2[BH-]4(BrF(BrF)(Br[Se]2FBrS)SBBBBB
[13:35:00] SMILES Parse Error: Failed parsing SMILES 'CF#F[Ag]S[Si-]F2P[Y+3][Ag]4[Ag][CH2](BrFI)S[OH+]FFFFI)P2FBr[Se][Se](BrS)21[Se]SBr[Se](BrS)[cH-]S)F8[cH-])FBrP2[BH-]4(BrF(BrF)(Br[Se]2FBrS)SBBBBB' for input: 'CF#F[Ag]S[Si-]F2P[Y+3][Ag]4[Ag][CH2](BrFI)S[OH+]FFFFI)P2FBr[Se][Se](BrS)21[Se]SBr[Se](BrS)[cH-]S)F8[cH-])FBrP2[BH-]4(BrF(BrF)(Br[Se]2FBrS)SBBBBB'
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎| 996/1000 [06:31<00:01,  2.68it/s][13:35:01] SMILES Parse Error: extra open parentheses for input: 'CF[Se+]F[SbH]2SPOF34SPBrF3(F(SS[SbH]2S[BH-](Br[cH-]P2P[Y+3]S)F(Br[N-]3SBr[Se]#2PBrP[Si-]P2PBr[Se](BrS)[cH-][Se]BrS)[cH-][NH+](Br[Se][Se]#[SeH+]BBBBBBBBBBB'
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 997/1000 [06:32<00:01,  2.68it/s][13:35:01] SMILES Parse Error: extra open parentheses for input: 'CF[Ag]F[Na]SP4[Ag](F[H-][c+][Ag]4[Ag]3[NH2+][CH2+]CBrF3[SH+]6S7(F(FBrF(F)FBrFPSC([Se](F)(F)PBrS)FSF3FF(F)(BrS)S)F4FBrFO)[Se][Se][Se]SBB'
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 998/1000 [06:32<00:00,  2.68it/s][13:35:02] SMILES Parse Error: unclosed ring for input: 'CFBrP2FBrFSF2[H-][N+](P[Ag]4SP[PH+]F(FSF2SPPC(F)(F3S[Ag]3[Ag]%112PBr[cH-])P2PBrFBrS)F(F)F4F4SF(F)(BrSF3)P5)[Se]P[11CH3][CH2][Se]3'
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [06:32<00:00,  2.68it/s][13:35:02] SMILES Parse Error: extra close parentheses while parsing: CP#[cH-]SPP2PI)2#F(FSCFFP35SF[Cl-][Si-]P2[Se][cH-]SBrF(F4BrF(F)FBrFI)F3F4BrBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
[13:35:02] SMILES Parse Error: Failed parsing SMILES 'CP#[cH-]SPP2PI)2#F(FSCFFP35SF[Cl-][Si-]P2[Se][cH-]SBrF(F4BrF(F)FBrFI)F3F4BrBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB' for input: 'CP#[cH-]SPP2PI)2#F(FSCFFP35SF[Cl-][Si-]P2[Se][cH-]SBrF(F4BrF(F)FBrFI)F3F4BrBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [06:33<00:00,  2.54it/s]

Out of 1000 generated SMILES, none were valid.

Suggestions for Improvement:

  1. Dynamic Vocabulary Generation: Consider implementing a method to dynamically generate the vocabulary based on the input dataset. This would allow the model to adapt to different chemical spaces.

  2. Implement a More Robust Tokenizer: A character-level tokenizer or a more sophisticated SMILES-specific tokenizer might be more flexible and less prone to out-of-vocabulary issues.

  3. Validity Checking: Incorporate a validity check for generated SMILES, possibly using RDKit, to ensure the model is producing chemically valid structures.

  4. Fine-tuning Option: Provide an option to fine-tune the model on custom datasets, which could help it adapt to specific chemical domains.

I believe addressing these points would significantly improve MolGPT's usability and performance across different chemical datasets. Let me know if you need any clarification or additional information.

about pre-train data.

Thanks for your excellent job.
I have a question.
The downloaded processed datasets are only in the order of 1~2 million respectively.
So, how large a dataset in unconditional pre-training?

pretrained model

dear author,

it seems that model is trained based on guacamol/moses dataset.
is it possible for pretraining on bigger training dataset, bring the pre-trained model to fine-tune? will that improve performance

thanks

RuntimeError: Error(s) in loading state_dict for GPT

Hi, when i run the code python generate/generate.py --model_weight gua_tpsa_logp_sas.pt --props tpsa logp sas --data_name guacamol2 --csv_name gua_tpsa_logp_sas_temp1 --gen_size 10000 --batch_size 512 --vocab_size 94 --block_size 100 in the generate_guacamol_prop.sh, i meet an RuntimeError: Error(s) in loading state_dict for GPT
size mismatch for blocks.0.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.1.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.2.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.3.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.4.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.5.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.6.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]). size mismatch for blocks.7.attn.mask: copying a param with shape torch.Size([1, 1, 101, 101]) from checkpoint, the shape in current model is torch.Size([1, 1, 201, 201]).

train epochs

i just want to know how many epochs did you train for other models like vae aae char-rnn? i see that you said you train10 epochs for gpt, did all this models are same ?

Variable num could be improved

In CausalSelfAttention and GPT classes, the variable num decided by the form of different condtions could be improved using the code below:
num = int(bool(config.num_props)) + ((1 - int(config.lstm)) * config.scaffold_maxlen + int(config.lstm)) * int(config.scaffold)

This could handle all the situations I believe.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.