GuacaMolEval

This small repository evaluates the effects of two factors on the Fréchet ChemNet Distance metric value. The two factors are:

Sample size of the reference molecules (GuacaMol uses a sample size of 10,000 molecules for both generated and reference molecules).
Padding length of the molecules (the fcd package uses a hard-coded padding length of 350).

For this to work we use local forks of the fcd and guacamol packages. The main reasons for this and resulting changes are:

The fcd package used too much memory (> 50GB) in our experiments when calculating the FCD for the whole GuacaMol training / reference set. A minor change reduced the memory footprint. This might be a local issue, but it was necessary for the evaluation. A pull request was opened to discuss this issue.
The padding length of the molecules is hard-coded in the fcd package. We changed this directly in the code for each experiment to allow for a custom padding length.
The guacamol package did not allow for a custom sample size of the reference molecules. We allowed for a parametrization of the sample size. This was necessary to evaluate the effect of the sample size on the FCD metric.
We updated the dependencies of the guacamol package to work with the latest versions of pytorch (no tensorflow, no keras) among others.

We calculated the FCD for a sample set before and after the changes to ensure that the changes did not affect the FCD metric value.

The main files

File	Description
`data/generated/generated_smiles.csv`	The generated sample molecules in SMILES format
`data/reference/guacamol_v1_train.csv`	The reference molecules from GuacaMol in SMILES format
`data/fcd.csv`	The FCD values of the experiments
`figures/fcd_values.jpg`	The FCD values of the experiments as a plot
`src/guacamoleval/eval.py`	The main evaluation script
`src/guacamoleval/experiment(s).sh`	Sample calls of eval.py
`src/guacamoleval/create_figures.ipynb`	The plotting script

The FCD value is affected by both the sample size of the reference molecules and the padding length of the molecules.
Guacamol uses a sample size of 10,000. We see that choosing a higher sample size, or even the whole reference set, can lead to a lower FCD value.
Reducing the padding length of the molecules can lead to a lower FCD value; but since the value of 350 is hard-coded in the fcdpackage, we could consider changing this value as "cheating" in the evaluation.
If we ever need to increase the paddding length to allow for longer SMILES strings, we might need to re-evaluate existing FCD metrics. However, the effect of increasing the padding length seems to be minor.

We only changed the guacamol code for the distribution learning benchmark; the other benchmarks might not work without changes.
The KL Divergence metric is not evaluated in this repository (commented out).