Leveraging molecular structure and bioactivity with chemical language models for drug design

Description
Requirements
How to run an experiment
How to cite this work
License
Address

Description

Supporting code for the paper «Leveraging molecular structure and bioactivity with chemical language models for drug design»

Preprint version

Abstract of the paper: Generative chemical language models (CLMs) can be used for de novo molecular structure generation. These CLMs learn from the structural information of known molecules to generate new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer learning. Several of the computer-generated molecular designs were commercially available, which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.

Requirements

First, you need to clone the repository:

git clone [email protected]:michael1788/hybridCLMs.git

Then, you can run the following command, which will create a conda virtual environment and install all the needed packages (if you don't have conda, you can follow the instructions to install it here).

cd hybridCLMs/
conda env create -f environment.yml

Once the installation is done, you can activate the virtual conda environment for this project:

conda activate hybrid

Please note that you will need to activate this virtual conda environment every time you want to use this project.

How to run an experiment

You can run an example experiment based on the data used in the paper by following the procedure described in A and B.

Note: we provide the pretrained weights for both the CLM and the E-CLM. You can therefore skip steps A1, A2, and B1 if you do not want to repeat the whole experiment, or if you do not have access to at least one GPU.

A. Generate the focused chemical library

We start by generating the focused virtual chemical library in part A.

A1. Process the data to train the chemical language model (CLM):

cd experiments/
sh run_processing.sh configfiles/clm/A01_clm.ini

Note: you can skip this step, as we provide the processed pretraining data.

A2. Pretrain the CLM:

sh run_training.sh configfiles/clm/A01_clm.ini

Note: we encourage you to skip this step, and to use the available pretrained model, especially if you do not have a GPU.

A3. Process the data to fine-tune the CLM:

sh run_processing.sh configfiles/ft_clm_generation/A01_clm_ft.ini

A4. Fine-tune the pretrained CLM:

sh run_training.sh configfiles/ft_clm_generation/A01_clm_ft.ini

A5. Generate SMILES strings with the fine-tuned CLM:

sh run_generation.sh configfiles/ft_clm_generation/A01_clm_ft.ini

Note: this step will be slow if you sample 5,000 SMILES strings by epoch as specified in A01_clm_ft.ini without a GPU. We advised you to first try with 500 SMILES strings (to do so, you can update the value in A01_clm_ft.ini).

A6. Process the generated SMILES strings to get the new molecules to constitute the focused virtual chemical library:

sh run_novo.sh configfiles/ft_clm_generation/A01_clm_ft.ini

B. Refine the focused chemical library

In part B, we refine the focused virtual chemical library by leveraging the bioactivity data of the fine-tuning set.

B1. Pretrain the E-CLM:

sh run_training.sh configfiles/eclm/A01_eclm.ini

Note: we encourage you to skip this step, and to use the available pretrained model, especially if you do not have a GPU.

Next, we fine-tune the pretrained E-CLM to do the ordinal classification. We start with a cross-validation to find a suitable set of hyperparameters (e.g. the number of fine-tuning epochs).

B2. Run a cross-validation experiment:

sh run_experiment.sh configfiles/ft_eclm/A01_cv.ini

B3. Run the analysis of the cross-validation experiment:

sh run_analysis.sh configfiles/ft_eclm/A01_cv.ini

You will find in hybridCLMs/experiments/outputs/ft_eclm/A01_cv/analysis/ a plot with your results, as well as a .csv file version of it.

Note: you can explore hyperparameters by running commands B2 and B3 on other configuration files, where you can change the hyperparameters you want to explore (e.g. the learning rate or the number of fine-tuning epochs).

B4. Once you are satisfied with your cross-validation experiment(s), you can run the E-CLM once on the test set to assess the final performance of your hyperparameters. For example, for the hyperparameters we used in the cross-validation in B2 and B3:

sh run_experiment_test_set.sh configfiles/ft_eclm/A02_test.ini

B5. And run again the analysis:

sh run_analysis_test_set.sh configfiles/ft_eclm/A02_test.ini

B6. You can now train on all the data the best E-CLM, as defined by the results on the test set. This will create an ensemble of models (with the number of models specified in the configuration file), which will be used for deployment, i.e. to make predictions on the focused virtual chemical library:

sh run_experiment_alldata.sh configfiles/ft_eclm/A03_alldata.ini

B7. Finally, we can use the deep ensemble of E-CLMs to refine the focused virtual chemical library generated in A.:

sh run_ensemble.sh configfiles/ensemble/A01_ensemble.ini

The results of the ensemble prediction can be found in a .csv file: hybridCLMs/experiments/outputs/ensemble/A01_ensemble/.

How to cite this work

@article{Moret2021,
  title={Leveraging molecular structure and bioactivity with chemical language models for drug design},
  author={Moret, Michael and Grisoni, Francesca and Brunner, Cyrill and Schneider, Gisbert},
  journal={Preprint at https://chemrxiv.org/engage/chemrxiv/article-details/615580ced1fc334326f9356e},
  year={2021},
}

License

MIT License

Address

MODLAB
ETH Zurich
Inst. of Pharm. Sciences
HCI H 413
Vladimir-Prelog-Weg 4
CH-8093 Zurich

how to use GPU

Hey @michael1788,

Thanks for such a nice mode for generating molecules! I tried to run the experiment but found it was very slow, the training step gives me some error (see below) which seems suggest I don't have GPU, but I do have one, so question is how to set it up so that I can make use of my GPU to acerbate the generating process, otherwise I have to spend a few days to run one experiment. Can you suggest something? Thanks!

I clone and install hybridCLMs according to your instruction, one thing I found is the tensorflow does not have gpu support, is that the case? Or maybe tensorflow by default supports GPU?

tensorboard               2.2.2                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tensorflow                2.2.0                    pypi_0    pypi
tensorflow-estimator      2.2.0                    pypi_0    pypi

Output message when I run the training step

$ sh run_training.sh configfiles/ft_clm_generation/A01_clm_ft.ini
Using TensorFlow backend.

START TRAINING

Batch_size used: 4
Data path : ../data/fine_tuning_generation/1_90_x0/
2023-10-24 10:30:19.381754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-10-24 10:30:19.447307: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-10-24 10:30:19.447328: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (login1): /proc/driver/nvidia/version does not exist
2023-10-24 10:30:19.447478: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2023-10-24 10:30:19.465903: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300000000 Hz
2023-10-24 10:30:19.468411: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1552b0000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-10-24 10:30:19.468428: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Weights loaded: ../pretrained_models/CLM.h5
Model: "sequential_1"
.....

michael1788 / hybridclms Goto Github PK

hybridclms's Introduction

Leveraging molecular structure and bioactivity with chemical language models for drug design

Table of Contents

Description

Requirements

How to run an experiment

How to cite this work

License

Address

hybridclms's People

Contributors

Stargazers

Watchers

Forkers

hybridclms's Issues

how to use GPU

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent