Combining generative artificial intelligence and on-chip synthesis for de novo drug design

In this repository, you will find data and code associated with the study: "Combining Generative Artificial Intelligence and On-Chip Synthesis for De Novo Drug Design", in which a Long Short-Term Memory network was combined with a microfluidics platform to design novel bioactive compounds, see Grisoni, Huisman et al. 2020. In this repository, you will find useful data and code to reproduce the results of our study.

Getting started
Data
Virtual reaction filter code
Generative deep learning code
How to cite

Getting started

To access the content of this repository on your local machine, you can clone it as follows:

[email protected]:abuttonCH/ai-on-a-chip.git

Data

This repository contains different types of datasets that are linked to our publication. Such datasets are located in data. In particular, the folder contains the following files:

LSTM_pretraining_data.zip: Contains the SMILES of the compounds used for pretraining the LSTM model. Such SMILES were obtained from a library of commercial compounds, which were retained by our virtual reaction filter (see below)
decomposition_reactions.txt: Reaction SMARTS used to convert the molecules into their corresponding reactants.
LSTM_FLOW-MOL_DB_DATA.npy: Molecular database of commercially available molecules. Each entry contains the number of molecule, the molecular SMILES, and the molecular weight is stored. The numpy array object is too large to upload to git. See the virtual reaction section to understand how to use this file.
mol_db_data.csv: Molecular database stored as a csv file. This file needs to be converted to the corresponding numpy array object in order to work with decompose.py and retrieve_bb.py (see below).

Virtual reaction filter

Here you will find instructions to apply the virtual reaction filter, as explained in the paper. The retro-synthesis is performed in two steps. In the first step, a series of reactions are applied to each product molecule in order to decompose it in to its corresponding reactants (decompose.py). The reaction used and the reactant molecules are stored as a text file. Once all of the products have been decomposed, the reactant molecules are then compared against a database of known, commercially available molecules. If all of the reactant molecules for a given reaction can be found within the database, the product molecule along with the reaction and the retrieved reactants molecules are stored in the output file.

Installation

This code requires rdkit version 2018.09.1 to be installed. The best way to do this is to create a conda environment.

conda create -n my_retrieve_env -c rdkit rdkit=2018.09.1

Once you have created the conda environment, you need to activate it.

conda activate my_retrieve_env

Usage

The code for the decomposition and building block retrieval is located in the code folder

The folder contains the following files:

create_npy_db.py (generates the mol db array object)
decompose.py (converts molecules into reactants)
retrieve_bb.py (returns reactants found in mol db)
reaction_library.py (defines the reaction object)
reaction_class_auto.py (performs the reaction) Below, you will find a step-by-step explanation on how to use the code.

Create the Molecular DB

Before running the method, one first needs to create the mol db numpy array object:

python create_npy_db.py --input ../data/mol_db_data.csv --output ../data/LSTM_FLOW-MOL_DB_DATA.npy

--input: Molecules saved in a csv file (Number of Molecules,SMILES,Molecular Weight). csv file

--output: Molecules converted to a numpy array. numpy array

Conversion to Reactants

decompose.py applies each of the reactions specified in decomposition_reactions.txt to the input molecules. Each set of resulting reactants generated by a given molecule and reaction are written to the output file.

python decompose.py --mol molfile.txt --reaction decomposition_reactions.txt --out decomposition_output.txt --limit molecule_limit

--mol: The molecules to be decomposed. Each line of the input file should contain a single SMILES string. text file

--reaction: The decomposition reactions written in the SMARTS format (decomposition reaction | SMARTS | number of conserved rings). text file

--out: The file path to output the results to. text file

--limit: The number of molecules to process. If not specified, then decompose.py will process the entire file. int

Building Block Retrieval

retrieve_bb.py searches the molecular database for matches between the decomposed reactant moleuces and the molecules in the database. If all of the reactants for a given input molecule are found, then the result is written to the output file.

python retrieve_bb.py --decomp decomposition_output.txt --mol_db molecular_database.npy --out retrieve_output.txt

--decomp: The decomposed products generated by decompose.py. text file

--mol_db: Datbase of commercially available molecules stored as numpy array object. numpy array

--out: The file path specifying where to write the outputs to. string

Example

Examples of running decompose.py and retrieve_bb.py. All files have been provided except for the molecular_database file (LSTM_FLOW-MOL_DB_DATA.npy) as it was too large. LSTM_FLOW-MOL_DB_DATA.npy has been provided in the supplementary information of "Combining generative artificial intelligence and on-chip synthesis for de novo drug design".

python code/create_npy_db.py --input data/mol_db_data.csv --output data/LSTM_FLOW-MOL_DB_DATA.npy

python code/decompose.py --mol data/data_val.txt --reaction data/decomposition_reactions.txt --out output/test_decomp.txt --limit 100

python code/retrieve_bb.py --decomp output/test_decomp.txt --mol_db data/LSTM_FLOW-MOL_DB_DATA.npy --out output/test_retrieve.txt

Generative deep learning code

The code used for molecule generation can be found in the dedicated repository: ETHmodlab/virtual_libraries. To repeat our fine-tuning experiment and generate molecules, you can follow the instructions there and:

Replace the parameters file by the one provided here
Modify the path in the new parameters file to point toward the right data (provided here) and to the right pretrained CLM (provided here)

How to cite

If you use any data or scripts associated to this repo, please cite:

@article{grisoni2020,
  title         = {Combining generative artificial intelligence and on-chip synthesis for de novo drug design},
  author        = {Grisoni, Francesca and Huisman, Berend and Button, Alex and Moret, Michael and Atz, Kenneth and Merk, Daniel and Schneider, Gisbert},
  journal       = {Science Advances},
  volume        = {7},
  pages         = {eabg3338}, 
  year          = {2021},
  doi           = {10.1126/sciadv.abg3338},
 publisher      = {American Association for the Advancement of Science}

abuttonch / ai-on-a-chip Goto Github PK

ai-on-a-chip's Introduction

Combining generative artificial intelligence and on-chip synthesis for de novo drug design

Getting started

Data

Virtual reaction filter

Installation

Usage

Create the Molecular DB

Conversion to Reactants

Building Block Retrieval

Example

Generative deep learning code

How to cite

ai-on-a-chip's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org