minkaixu / geoldm Goto Github PK

View Code? Open in Web Editor NEW

190.0 190.0 37.0 23.23 MB

Geometric Latent Diffusion Models for 3D Molecule Generation

License: MIT License

Python 100.00%

deep-generative-model diffusion-models drug-discovery geometric-deep-learning icml-2023 molecule molecule-generation

geoldm's People

Contributors

Stargazers

Watchers

geoldm's Issues

Pre-trained checkpoints for QM9 conditional generation?

Hello.

Thank you for making this great work open-source. Would it be possible to also share the pre-trained QM9 conditional generation checkpoints used in the GeoLDM manuscript?

Best,
Alex

Question about datasets_config.py script

Hello Minkai! My team and I had a question about datasets_config.py script: what do the parameters 'distances' and 'radius_dic' mean? Thank you so much, we would really appreciate your response!

Question about autoencoder training stage

Hello, regarding the autoencoder phase, I have a question. Is the latent invariant feature dimension k mentioned in the paper referring to the dimension of the encoder's output μh, while the dimension of μx remains 3?

how to use for nonQM9 and nonDrug?

HI MinkaiXu,

The paper looks very exciting. I have a small input dataset with SMILES and an associated experimental property. Is there a simplified documentation on how I can try your algorithm to read such input and try training-testing along with my choice of property prediction performance, prediction evaluation and structure generation?

To non-expert, I am not sure how to go about trying your algorithm on my own input dataset, thanks so much. For my input, I cannot use your QM9 and drug training datasets.

Any guidance/help/code is appreciated. Thanks,
JL

AutoEncoder

Why not use the VAE framework?

Joint Training

Hi Minkai,

Thanks for the amazing work! I am wondering if in the code the AE and the Diffusion are trained jointly by default (with --train_diffusion and --trainable_ae), instead of training separately?

Are the evaluation results of conditional generative models accurate?

My trained GeoLDM conditional generative models by homo\lumo\mu are 384meV\644meV\1.36D separately. why?

question while loading pretrained models

Thanks for your code @MinkaiXu a lot : ), it's cool.
I met a problem in loading pretrained models, the error message is as follows:
Traceback (most recent call last): File "eval_analyze.py", line 198, in <module> main() File "eval_analyze.py", line 127, in main with open(join(eval_args.model_path, 'args.pickle'), 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: 'outputs/qm9_p/args.pickle'
I'm confused how to solve it. Hope you'll help me with it!

When will the code be made public

Hi, Minkai. Great work, when will the code for the work be made public?

Other output format

Hey Minkai,

thanks for sharing your promising work. Is there a way to convert the output to some other format like pdbtq/pdb, Chem.Mol, or a smiles/SELFies Object? (I can only find the .txt files which only contain the positions of atoms)

Thanks,
Lennart Jaretzki

z_h data formatting issue in EnLatentDiffusion model

Hi Minkai!

Thank you for sharing the code! Just one quick question regarding the format of h data throughout the training and sampling process.

At first, h is defined as {'categorical': one_hot, 'integer': charges} and the data is concatenated with categorical at the front of integer. However, at line 1310 of the EnLatentDiffusion model, z_h is formatted as z_h = {'categorical': torch.zeros(0).to(z_h), 'integer': z_h}, meaning that the charges part is placed before the categorical part.

Then, take sampling for instance: here z0[:, :, -1:] is used as charges, meaning z0 has a format different from that of z_h in the diffusion model.

Should z_h = {'categorical': torch.zeros(0).to(z_h), 'integer': z_h} be changed to z_h = {'categorical': z_h, 'integer': torch.zeros(0).to(z_h)} instead?

Thanks!
Tianyi

Autoencoder is identity function on atom coordinates? Equivalence to EDM

Hi Minkai,

Thank you for sharing this work! When I analyze the sampling results of GeoLDM, I found the latent variable z_x is almost equal to the decoded atom positions. Below are molecules I reconstructed with decoded atom pos and atom type (left) and z_x and decoded atom type (right) respectively. They are almost same.

`z_x` + recon atom type	recon atom pos + recon atom type

A further analysis on the reconstruction results of the auto encoder in GeoLDM indicates that both encoder and decoder are almost identity functions on atom coordinates. If so, can I consider GeoLDM is actually equivalent to 3D space diffusion (i.e. EDM) since #latent variables is equal to #atoms and both encoder and decoder are identity functions on atom coordinates, except that there is an auto-encoder part on atom types?

If this is correct, I’m also wondering how did you train the autoencoder in your published version. I can understand the training will lead to identity functions with the reconstruction loss only, but you mentioned in the repo that the encoder is remained untrained. If so, why is the encoder not a random mapping but a identity function instead?

Thanks!

z_xh <- mu + epsilon * sigma_0. How is the sigma_0 (0.0032) selected?

Hello, Minkai:
I found in the VAE encoder that you implemented the reparameteraztion process z_xh <- mu + epsilon * sigma_0 with sigma_0 = 0.0032. How to select a proper sigma_0 ? Is there any special skills ?

Drug data split

Hi，
I try to use the main_geom_drugs,py to run , but it seems to have some error,

And I also try to solve it, but it maybe is the build_geom_dataset.py and line 101 data_list = [data_list[i] for i in perm], this problem is because the data_list contains subarrays of varying shapes,and in line 107 np.spilt need same shape,
So how can I solve this problem?

best regard,
Zhongyu

How the molecules sampled by gschnet are evaluated？

Hello @MinkaiXu . Where is the code used to infer keys for molecules in gschnet? I always get high results when evaluating validity and uniqueness. Are you convenient to give this part of the code?

Training time for QM9 dataset

How long does it take to train the model on qm9 for 3000 epochs with batch size of 64? On my machine it seems like even one epoch would take 5 hours with a batch size of 64. Are the hyper parameters I am using correct?

minkaixu / geoldm Goto Github PK

geoldm's People

Contributors

Stargazers

Watchers

Forkers

geoldm's Issues

Recommend Projects

Recommend Topics

Recommend Org