Giter Site home page Giter Site logo

retrotrae's Introduction

License: CC BY-NC 4.0 DOI Nature Communications DOI

Retrosynthetic reaction pathway prediction through NMT of atomic environments

Ucak, U. V., Ashyrmamatov, I., Ko, J. & Lee, J. Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments. Nat Commun 13, 1186 (2022). https://doi.org/10.1038/s41467-022-28857-w

Designing efficient synthetic routes for a target molecule remains a major challenge in organic synthesis. Atom environments are ideal, stand-alone, chemically meaningful building blocks providing a high-resolution molecular representation. Our approach mimics chemical reasoning, and predicts reactant candidates by learning the changes of atom environments associated with the chemical reaction. Through careful inspection of reactant candidates, we demonstrate atom environments as promising descriptors for studying reaction route prediction and discovery. Here, we present a new single-step retrosynthesis prediction method, viz. RetroTRAE, being free from all SMILES-based translation issues, yields a top-1 accuracy of 58.3% on the USPTO test dataset, and top-1 accuracy reaches to 61.6% with the inclusion of highly similar analogs, outperforming other state-of-the-art neural machine translation-based methods. Our methodology introduces a novel scheme for fragmental and topological descriptors to be used as natural inputs for retrosynthetic prediction tasks.


Datasets

Training

We used a curated subset of Lowe’s grants dataset (USPTO-Full). Jin et al. further refined the USPTO-Full set by removing duplicates and erroneous reactions. This curated dataset contains 480K reactions. Preprocessing steps to remove reagents from reactants are described by Liu et al. and Schwaller et al..

We used Zheng's version (removal of agents and canonicalization of Jin's dataset) of USPTO and carefully curated the product-reactant pairs. We generated two distinct curated datasets consist of unimolecular and bimolecular reactions, with sizes 100K and 314K respectively. Since retrosynthesis implies an abstract backward direction, we named our datasets unimolecular and bimolecular reactions. There was no reaction class information available in this dataset and we have not used any atom-to-atom mapping algorithm.

Post-processing

*UPDATED: Now you can generate the PubChem Database for retrieval with the following command after downloading/extracting CID-SMILES. PubChem compound database (CID-SMILES.gz), 111 million compounds, is used to recover molecules from a list of AEs.

    python pubchem_gen.py --cid-smiles-path /path/to/CID-SMILES

For more details python pubchem_gen.py --help


Code usage

Requirements

The source code is tested on Linux operating systems. After cloning the repository, we recommend creating a new conda environment. Users should install required packages described in environments.yml prior to direct use.

conda create --name RetroTRAE_env python=3.8 -y
conda activate RetroTRAE_env
conda install pytorch cudatoolkit=11.3 -c pytorch -y
conda install -c conda-forge rdkit -y
pip install sentencepiece

or

conda env create --name RetroTRAE_env --file=environment.yml

Prediction & Demo:

First, checkpoints files should be downloaded and extracted.

Run below commands to conduct an inference with the trained model.

*UPDATED: This new code will be default in future.

 python predict.py --smiles='COc1cc2c(c(Cl)c1OC)CCN(C)CC2c1ccccc1' 
--smiles SMILES       An input sequence (default: None)                                                                                                          
--decode {greedy,beam}                                                                                                                                           
                     Decoding method for RetroTRAE (default: greedy)                                                                                            
--beam_size BEAM_SIZE                                                                                                                                            
                     Beam size (a number of candidates for RetroTRAE) (default: 3)                                                                              
--conversion {ml,db}  How to convert AEs to SMILES? 'ml': Machine Learning model 'db': Retrieve from PubChem database (default: ml)                              
--database_dir DATABASE_DIR                                                                                                                                      
                     Database for retrieval of the predicted molecules (default: ./data/PubChem_AEs)                                                            
--topk TOPK           A number of candidates for the AEs to SMIES conversion (default: 1)                                                                        
--uni_checkpoint_name UNI_CHECKPOINT_NAME                                                                                                                        
                     Checkpoint file name (default: uni_checkpoint.pth)                                                                                         
--bi_checkpoint_name BI_CHECKPOINT_NAME                                                                                                                          
                     Checkpoint file name (default: bi_checkpoint.pth)                                                                                          
--log_file LOG_FILE   A file name for saving outputs (default: None)

*Note: The old code is still there.

python src/predict.py  --smiles
  • --smiles: The molecule we wish to synthetize.
  • --decode: Decoding algorithm (either 'greedy' or 'beam'), (by default: greedy)
  • --uni_checkpoint_name: Checkpoint file name for unimolecular rxn model. (default: uni_checkpoint.pth)
  • --bi_checkpoint_name: Checkpoint file name for bimolecular rxn model. (default: bi_checkpoint.pth)
  • --database_dir: Path containing DB files.

Example prediction and sample output;

Results are saving to InChIKey coded filename.

>> python src/predict.py --smiles='COc1cc2c(c(Cl)c1OC)CCN(C)CC2c1ccccc1' --database_dir DB_Path

unimolecular model is building...
Loading checkpoint...

bimolecular model is building...
Loading checkpoint...

greedy decoding searching method is selected.
Preprocessing input SMILES: COc1cc2c(c(Cl)c1OC)CCN(C)CC2c1ccccc1
Corresponding AEs: [c;R;D3](-[CH;R;D3])(:[c;R;D3]):[cH;R;D2] ... [c;R;D3](-[Cl;!R;D1])(:[c;R;D3]):[c;R;D3] [CH;R;D3]


Predictions are made in AEs form.
Saving the results here: results_VDCYGTBDVYWJFQ-UHFFFAOYSA-N.json

Done!

Training:

Configurations:

  1. Users can set various hyperparameters in src/parameters.py file.

  2. The following command src/tokenizer_with_split.py applies tokenization scheme and also splits the data.

    python src/tokenizer_with_split.py --model_type='bi'
    • --model_type: By default, it runs for bimolecular reaction dataset.

The structure of whole data directory should be prefixed by model_type.

  • data
    • sp
      • src_sp.model, src_sp.vocab, tar_sp.model, tar_sp.vocab
    • src
      • train.txt, valid.txt, test.txt
    • trg
      • train.txt, valid.txt, test.txt
    • raw_data.src
    • raw_data.trg

Below command can be simply used to train the model for retrosynthetic prediction.

python src/train.py --model_type='bi'
  • --model_type: 'uni' or 'bi'. (default: bi)
  • --custom_validation: Evaluates the model accuracy based on the custom metrics. (default: True)
  • --resume: Resume training for a given checkpoint. (default: False)
  • --start_epoch: Epoch numbers for resumed training (default: 0)
  • --checkpoint_name: Checkpoint file name. (default: None)

Results

Model performance comparison without additional reaction classes based on either filtered MIT-full or Jin's USPTO.

Model top-1 top-3 top-5 top-10
Non-Transformer
Coley et al., similarity-based, 2017 32.8
Segler et al.,--rep. by Lin, Neuralsym, 2020 47.8 67.6 74.1 80.2
Dai et al., Graph Logic Network, 2019 39.3
Liu et al.,--rep. by Lin, LSTM-based, 2020 46.9 61.6 66.3 70.8
Genheden et al., AiZynthfinder, ANN + MCTS, 2020 43-72
Transformer-based
Zheng et al., SCROP, 2020 41.5
Wang et al., RetroPrime, 2021 44.1
Tetko et al., Augmented Transformer, 2020 46.2
Lin et al., AutoSynRoute, Transformer + MCTS, 2020 54.1 71.8 76.9 81.8
RetroTRAE 58.3 66.1 69.4 73.1
RetroTRAE (with SM and DM) 61.6

Cite

@article{10.1038/s41467-022-28857-w, 
year = {2022}, 
title = {{Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments}}, 
author = {Ucak, Umit V. and Ashyrmamatov, Islambek and Ko, Junsu and Lee, Juyong}, 
journal = {Nature Communications}, 
doi = {10.1038/s41467-022-28857-w}, 
pmid = {35246540}, 
pmcid = {PMC8897428}, 
pages = {1186}, 
number = {1}, 
volume = {13}
}

License

CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0

retrotrae's People

Contributors

ashyrmamatov01 avatar azpisruh avatar

Stargazers

 avatar Sul  avatar  avatar tiger avatar Daniil Artamonov avatar  avatar fanxiaoyu avatar Victorien Delannee avatar  avatar  avatar Barbara Makarova avatar  avatar Paulo Maia avatar Wang Zehui avatar  avatar GuillaumeG avatar Pancham Lal Gupta avatar Yip Yew Mun avatar David Meijer avatar Madeline avatar  avatar Shang Chien avatar Feng Yang avatar Boris Piakillia avatar  avatar  avatar JinyuSun avatar Hongchao Ji avatar UniversalNature avatar  avatar nanyeglm avatar Ahmet Sarıgün avatar Maryam Astero avatar Melchor S. avatar Suhaib Shekfeh avatar Willow avatar Junsu Ko avatar

Watchers

 avatar Juyong Lee avatar  avatar

retrotrae's Issues

Dataset comparison - USPTO full vs USPTO MIT

Hi,

this is a very interesting approach. Well done!

I have one concern regarding the comparison w.r.t other existing methods. For evaluation, you made use of the USPTO-MIT dataset. This is commonly used for reaction prediction (forward). I saw that in the works of the self-correcting transformer and AutoSynRoute, they used the same dataset. However, for other retrosynthesis algorithms, GLN, AT, Retrosim and Retroprime were trained and evaluated on a dataset double the size (USPTO-full curated by Dai et. al.).
Would you not agree that a comparison here is a bit unfair as the evaluation on a dataset 2x the size is surely more difficult?

Thank you for clarifying this.

NameError: name 'fp_names' is not defined

in RetroTRAE\RetroTRAE\train.py, I got an error: NameError: name 'fp_names' is not defined. Could you please check it? Many thanks.
args.src_vocab_size = fp_vocab_sizes[args.fp] args.trg_vocab_size = trg_vocab_sizes[args.model_type] args.src_seq_len = fp_seq_lens[args.fp] args.trg_seq_len = trg_seq_len
fp_vocab_sizes, trg_vocab_sizes, fp_seq_lens and trg_seq_len are not defined either.

안녕하십니까, 질문이 있습니다.

안녕하십니까, 질문이 있습니다.
좋은 논문 공유해 주셔서 감사합니다.

predict.py를 통해 나오는 json 파일을 논문 처럼 시각화 하는 부분은 따로 없는지 궁금합니다.

Top-k calcu

Hello,
Thanks for your research , I have a question , top-k is an important metric ,but how to calculate top-k when using AE-SMILES pairs?I would like to know how you calculate the top-k, thank you very much.

file request

Amazing work, thanks for your contribution! I tried to interpret the work by replicating the codes, however, when I run the retrotrae\src\train.py I got an FileNotFoundError: [Errno 2] No such file or directory: 'data/src/bi_train.txt', I followed the clue find that trian.txt was defined inparameters.py, But I can't find the files in this repository. Maybe I missed something, could you please help fix it? Mant thanks!

Question about accuracy

I apologize for any confusion caused. In Table 3 of your paper, your model shows the highest accuracy for retrosynthesis prediction. However, in the paper by Philippe Schwaller (https://doi.org/10.1021/acscentsci.9b00576), they achieved a top-1 accuracy above 90% on a common benchmark dataset for forward reaction prediction. I understand that the difference lies in the tasks being retrosynthesis prediction and forward reaction prediction, respectively. So I wonder why there is a huge gap between retrosynthesis and forward reaction prediction? If we consider single step reaction, in my opinion, it seems intuitive to swap the source and target molecules for a single-step reaction in order to convert a retrosynthesis problem into a forward reaction prediction problem. Looking forword to your reply, Many thanks!

atomic environment to SMILES

Hello, author. Programs can be used to obtain atomic environments according to SMILES. So I wonder if there is a program that can get SMILES based on the atomic environment.

DB_generation

Hello, author. I am very interested in your paper and am trying to run the code for this paper. But now I have some problems.

  1. When I downloaded the CID-SMILES file from https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/. And when running the DB_generation.py file, I always get SMILES Parse Error: syntax error while parsing: 16598 and RDKit ERROR: [21:19:43] SMILES Parse Error: syntax error while parsing: 16598. But I don't know how to fix it. If the author could give me a little help, I would be very grateful.

  2. There is a comparison of top-1 accuracy with other methods in this paper, however, I did not find the relevant code in the github codebase. Is it convenient for the author to provide this part of the code?

My email is [email protected]. Looking forward to your reply, thank you very much.
Best wishes!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.