Giter Site home page Giter Site logo

guacamoleval's Introduction

Python Version Poetry Version Pytorch Version Ruff License

GuacaMolEval

This small repository evaluates the effects of two factors on the Fréchet ChemNet Distance metric value. The two factors are:

  • Sample size of the reference molecules (GuacaMol uses a sample size of 10,000 molecules for both generated and reference molecules).
  • Padding length of the molecules (the fcd package uses a hard-coded padding length of 350).

For this to work we use local forks of the fcd and guacamol packages. The main reasons for this and resulting changes are:

  • The fcd package used too much memory (> 50GB) in our experiments when calculating the FCD for the whole GuacaMol training / reference set. A minor change reduced the memory footprint. This might be a local issue, but it was necessary for the evaluation. A pull request was opened to discuss this issue.
  • The padding length of the molecules is hard-coded in the fcd package. We changed this directly in the code for each experiment to allow for a custom padding length.
  • The guacamol package did not allow for a custom sample size of the reference molecules. We allowed for a parametrization of the sample size. This was necessary to evaluate the effect of the sample size on the FCD metric.
  • We updated the dependencies of the guacamol package to work with the latest versions of pytorch (no tensorflow, no keras) among others.

We calculated the FCD for a sample set before and after the changes to ensure that the changes did not affect the FCD metric value.

The main files

File Description
data/generated/generated_smiles.csv The generated sample molecules in SMILES format
data/reference/guacamol_v1_train.csv The reference molecules from GuacaMol in SMILES format
data/fcd.csv The FCD values of the experiments
figures/fcd_values.jpg The FCD values of the experiments as a plot
src/guacamoleval/eval.py The main evaluation script
src/guacamoleval/experiment(s).sh Sample calls of eval.py
src/guacamoleval/create_figures.ipynb The plotting script

Resulting plot

FCD values

Discussion

  • The FCD value is affected by both the sample size of the reference molecules and the padding length of the molecules.
  • Guacamol uses a sample size of 10,000. We see that choosing a higher sample size, or even the whole reference set, can lead to a lower FCD value.
  • Reducing the padding length of the molecules can lead to a lower FCD value; but since the value of 350 is hard-coded in the fcdpackage, we could consider changing this value as "cheating" in the evaluation.
  • If we ever need to increase the paddding length to allow for longer SMILES strings, we might need to re-evaluate existing FCD metrics. However, the effect of increasing the padding length seems to be minor.

Known issues

  • We only changed the guacamol code for the distribution learning benchmark; the other benchmarks might not work without changes.
  • The KL Divergence metric is not evaluated in this repository (commented out).

Meta

guacamoleval's People

Contributors

hogru avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.