Giter Site home page Giter Site logo

hybridclms's Introduction

Leveraging molecular structure and bioactivity with chemical language models for drug design

Table of Contents

  1. Description
  2. Requirements
  3. How to run an experiment
  4. How to cite this work
  5. License
  6. Address

Description

Supporting code for the paper «Leveraging molecular structure and bioactivity with chemical language models for drug design»

Preprint version

Abstract of the paper: Generative chemical language models (CLMs) can be used for de novo molecular structure generation. These CLMs learn from the structural information of known molecules to generate new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer learning. Several of the computer-generated molecular designs were commercially available, which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.

Requirements

First, you need to clone the repository:

git clone [email protected]:michael1788/hybridCLMs.git

Then, you can run the following command, which will create a conda virtual environment and install all the needed packages (if you don't have conda, you can follow the instructions to install it here).

cd hybridCLMs/
conda env create -f environment.yml

Once the installation is done, you can activate the virtual conda environment for this project:

conda activate hybrid

Please note that you will need to activate this virtual conda environment every time you want to use this project.

How to run an experiment

You can run an example experiment based on the data used in the paper by following the procedure described in A and B.

Note: we provide the pretrained weights for both the CLM and the E-CLM. You can therefore skip steps A1, A2, and B1 if you do not want to repeat the whole experiment, or if you do not have access to at least one GPU.

A. Generate the focused chemical library

We start by generating the focused virtual chemical library in part A.

A1. Process the data to train the chemical language model (CLM):

cd experiments/
sh run_processing.sh configfiles/clm/A01_clm.ini

Note: you can skip this step, as we provide the processed pretraining data.

A2. Pretrain the CLM:

sh run_training.sh configfiles/clm/A01_clm.ini

Note: we encourage you to skip this step, and to use the available pretrained model, especially if you do not have a GPU.

A3. Process the data to fine-tune the CLM:

sh run_processing.sh configfiles/ft_clm_generation/A01_clm_ft.ini

A4. Fine-tune the pretrained CLM:

sh run_training.sh configfiles/ft_clm_generation/A01_clm_ft.ini

A5. Generate SMILES strings with the fine-tuned CLM:

sh run_generation.sh configfiles/ft_clm_generation/A01_clm_ft.ini

Note: this step will be slow if you sample 5,000 SMILES strings by epoch as specified in A01_clm_ft.ini without a GPU. We advised you to first try with 500 SMILES strings (to do so, you can update the value in A01_clm_ft.ini).

A6. Process the generated SMILES strings to get the new molecules to constitute the focused virtual chemical library:

sh run_novo.sh configfiles/ft_clm_generation/A01_clm_ft.ini

B. Refine the focused chemical library

In part B, we refine the focused virtual chemical library by leveraging the bioactivity data of the fine-tuning set.

B1. Pretrain the E-CLM:

sh run_training.sh configfiles/eclm/A01_eclm.ini

Note: we encourage you to skip this step, and to use the available pretrained model, especially if you do not have a GPU.

Next, we fine-tune the pretrained E-CLM to do the ordinal classification. We start with a cross-validation to find a suitable set of hyperparameters (e.g. the number of fine-tuning epochs).

B2. Run a cross-validation experiment:

sh run_experiment.sh configfiles/ft_eclm/A01_cv.ini

B3. Run the analysis of the cross-validation experiment:

sh run_analysis.sh configfiles/ft_eclm/A01_cv.ini

You will find in hybridCLMs/experiments/outputs/ft_eclm/A01_cv/analysis/ a plot with your results, as well as a .csv file version of it.

Note: you can explore hyperparameters by running commands B2 and B3 on other configuration files, where you can change the hyperparameters you want to explore (e.g. the learning rate or the number of fine-tuning epochs).

B4. Once you are satisfied with your cross-validation experiment(s), you can run the E-CLM once on the test set to assess the final performance of your hyperparameters. For example, for the hyperparameters we used in the cross-validation in B2 and B3:

sh run_experiment_test_set.sh configfiles/ft_eclm/A02_test.ini

B5. And run again the analysis:

sh run_analysis_test_set.sh configfiles/ft_eclm/A02_test.ini

B6. You can now train on all the data the best E-CLM, as defined by the results on the test set. This will create an ensemble of models (with the number of models specified in the configuration file), which will be used for deployment, i.e. to make predictions on the focused virtual chemical library:

sh run_experiment_alldata.sh configfiles/ft_eclm/A03_alldata.ini

B7. Finally, we can use the deep ensemble of E-CLMs to refine the focused virtual chemical library generated in A.:

sh run_ensemble.sh configfiles/ensemble/A01_ensemble.ini

The results of the ensemble prediction can be found in a .csv file: hybridCLMs/experiments/outputs/ensemble/A01_ensemble/.

How to cite this work

@article{Moret2021,
  title={Leveraging molecular structure and bioactivity with chemical language models for drug design},
  author={Moret, Michael and Grisoni, Francesca and Brunner, Cyrill and Schneider, Gisbert},
  journal={Preprint at https://chemrxiv.org/engage/chemrxiv/article-details/615580ced1fc334326f9356e},
  year={2021},
}

License

MIT License

Address

MODLAB
ETH Zurich
Inst. of Pharm. Sciences
HCI H 413
Vladimir-​Prelog-Weg 4
CH-​8093 Zurich

hybridclms's People

Contributors

michael1788 avatar

Stargazers

Mohammed AbuSadeh avatar Prasad avatar Hao Liu avatar  avatar lyingjay avatar Talha Karabıyık avatar  avatar Behrooz Azarkhalili avatar José Jiménez-Luna avatar

Watchers

James Cloos avatar  avatar

hybridclms's Issues

how to use GPU

Hey @michael1788,

Thanks for such a nice mode for generating molecules! I tried to run the experiment but found it was very slow, the training step gives me some error (see below) which seems suggest I don't have GPU, but I do have one, so question is how to set it up so that I can make use of my GPU to acerbate the generating process, otherwise I have to spend a few days to run one experiment. Can you suggest something? Thanks!

I clone and install hybridCLMs according to your instruction, one thing I found is the tensorflow does not have gpu support, is that the case? Or maybe tensorflow by default supports GPU?

tensorboard               2.2.2                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tensorflow                2.2.0                    pypi_0    pypi
tensorflow-estimator      2.2.0                    pypi_0    pypi

Output message when I run the training step

$ sh run_training.sh configfiles/ft_clm_generation/A01_clm_ft.ini
Using TensorFlow backend.

START TRAINING

Batch_size used: 4
Data path : ../data/fine_tuning_generation/1_90_x0/
2023-10-24 10:30:19.381754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-10-24 10:30:19.447307: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-10-24 10:30:19.447328: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (login1): /proc/driver/nvidia/version does not exist
2023-10-24 10:30:19.447478: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2023-10-24 10:30:19.465903: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300000000 Hz
2023-10-24 10:30:19.468411: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1552b0000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-10-24 10:30:19.468428: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Weights loaded: ../pretrained_models/CLM.h5
Model: "sequential_1"
.....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.