Giter Site home page Giter Site logo

cddleiden / pcmol Goto Github PK

View Code? Open in Web Editor NEW
36.0 4.0 2.0 1.23 MB

Multi-target de novo molecular generator conditioned on AlphaFold's latent protein embeddings.

License: MIT License

Python 98.94% Shell 1.06%
alphafold drug-discovery proteochemometrics de-novo-drug-design cheminformatics protein-language-models

pcmol's Introduction

PCMol

DOI License: MIT

A multi-target model for de novo molecule generation. By using the internal protein representations of the AlphaFold[1] model, a single SMILES-based transformer can generate relevant molecules for thousands of protein targets (embeddings are available for 4,331 proteins).

The model was trained using bioactivity data from the Papyrus[2] dataset (661,613 unique protein-ligand pairs in total, 6,249,253 after augmentation).


alt text


Paper & Authors

The preprint is available on ChemRxiv:

https://chemrxiv.org/engage/chemrxiv/article-details/65d47632e9ebbb4db9c63988


alt text


Installation

1. Setup script (recommended)

The setup script will install the required dependencies and download the pretrained model.

git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
chmod +x setup.sh
bash setup.sh

2. Conda (alternative)

The conda route requires the user to download the pretrained model manually (link below).

# Setting up a fresh conda environment
git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
conda env create -f environment.yml && conda activate pcmol
python -m pip install -e .

Pretrained model

*When not using the setup script, the pretrained model can be downloaded from here (mirror). It should then be placed in the .../pcmol/data/models folder.


Generating molecules for a particular target

1. Using a script (conda route)

# Run the model on a single target using Accession ID (generates 10 SMILES strings)
conda activate pcmol
python pcmol/generate.py --target P29275

# If GPU is not available
python pcmol/generate.py --target P29275 --device cpu

If available, the appropriate AlphaFold2 embeddings to be used as input to the model will be downloaded automatically. The generated molecules are saved in the data/results folder.

2. Calling the generator directly

To generate molecules for a particular target, the Runner class can be used directly. The generate_smiles method returns a list of SMILES strings for a target protein specified by its Accession ID.

from pcmol import Runner

model = Runner(model="XL")
SMILES = model.targetted_generation(target="P29275", num_mols=100)

List of supported protein targets

The model currently depends on the availability of AlphaFold2 embeddings for the target protein. The list of supported targets can be found in the data/targets.txt file.


References

[1]: Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.

[2]: Béquignon, O. J., Bongers, B. J., Jespers, W., IJzerman, A. P., van der Water, B., & van Westen, G. J. (2023). Papyrus: a large-scale curated dataset aimed at bioactivity predictions. Journal of cheminformatics, 15(1), 3.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.