Giter Site home page Giter Site logo

gerard-ompad / reagents Goto Github PK

View Code? Open in Web Editor NEW

This project forked from academich/reagents

0.0 2.0 0.0 33.81 MB

Reagent prediction with Molecular Transformer. Improvement of data for reaction product prediction in a self-supervised fashion.

Home Page: https://doi.org/10.26434/chemrxiv-2022-sn2kr

License: MIT License

Shell 0.78% Python 18.47% Jupyter Notebook 80.76%

reagents's Introduction

Molecular Transformer for Reagents Prediction

This is the code for the preprint Reagent Prediction with a Molecular Transformer Improves Reaction Data Quality.
The repository is effectively a fork of the Molecular Transformer.

Idea:

  • Train a transformer to predict reagents for organic reactions in the way of SMILES-to-SMILES translation.
  • Infer missing reagents for some reactions in the training set.
  • Train a transformer for reaction product prediction on the dataset with improved reagents.

Code

Old files:

onmt is a directory with the OpenNMT code.
preprocess.py, train.py, translate.py, score_predictions.py are files used in the Molecular Transformer code.

New files:

src folder contains preprocessing for reagent prediction.
prepare_data.py is the main script that preprocesses USPTO for reagents prediction with MT.
environment.yml is the conda environment specification.

Data

The entire USPTO data used to assemble the training set for reagent prediction can be downloaded here.

The training set for reagents prediction was obtained from it using the prepare_data.py script. It does not overlap with the USPTO MIT test set. The tokenized data can be downloaded here.

The tokenized data for product prediction is stored here. For the description of these data, please refer to the README of the original Molecular Transformer.

The data for product prediction with altered reagents can be downloaded here.

Workflow

  1. Create a conda environment from the specification file:

    conda env create -f environment.yml
    conda activate reagents_pred
  2. Download the datasets and put them in the data/tokenized directory.

  3. Train a reagents prediction model: First, preprocess the data for an OpenNMT model:

        python3 preprocess.py -train_src data/tokenized/${DATASET_NAME}/src-train.txt \
                              -train_tgt data/tokenized/${DATASET_NAME}/tgt-train.txt \
                              -valid_src data/tokenized/${DATASET_NAME}/src-val.txt \
                              -valid_tgt data/tokenized/${DATASET_NAME}/tgt-val.txt \
                              -save_data data/tokenized/${DATASET_NAME}/${DATASET_NAME} \
                              -src_seq_length 1000 -tgt_seq_length 1000 \
                              -src_vocab_size 1000 -tgt_vocab_size 1000 -share_vocab

    Then, run a model:

        python3 train.py -data data/tokenized/${DATASET_NAME}/${DATASET_NAME} \
                         -save_model experiments/checkpoints/${DATASET_NAME}/${DATASET_NAME}_model \
                         -seed 42 -gpu_ranks 0 -save_checkpoint_steps 10000 -keep_checkpoint 20 \
                         -train_steps 500000 -param_init 0  -param_init_glorot -max_generator_batches 32 \
                         -batch_size 4096 -batch_type tokens -normalization tokens -max_grad_norm 0  -accum_count 4 \
                         -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam -warmup_steps 8000  \
                         -learning_rate 2 -label_smoothing 0.0 -report_every 10 \
                         -layers 4 -rnn_size 256 -word_vec_size 256 -encoder_type transformer -decoder_type transformer \
                         -dropout 0.1 -position_encoding -share_embeddings \
                         -global_attention general -global_attention_function softmax -self_attn_type scaled-dot \
                         -heads 8 -transformer_ff 2048 -tensorboard
  4. Train a basline product prediction model:
    Train a Molecular Transformer on, say, MIT_separated data. For this, run the preprocess.py script and the train.py script
    as shown above but with DATASET_NAME=MIT_separated.

  5. Use a trained reagent model to improve reagents in a dataset for product prediction.
    The script reagent_substitution.py uses a reagents prediction model to change reagents in data which is the input
    to a product prediction model. To change the reagents in, say, MIT_separated, run the script as follows:

        python3 reagent_substitution.py --data_dir data/tokenized/MIT_separated \ 
                                        --reagent_model <MODEL_NAME> \ 
                                        --reagent_model_vocab <MODEL_SRC_VOCAB> \
                                        --beam_size 5 --gpu 0

    MODEL_NAME may be stored, in experiments/checkpoints/. MODEL_SRC_VOCAB (a .json file )may be stored in data/vocabs/.
    Or download the final data here.

  6. Train product prediction models on datasets cleaned by a reagents prediction model like in step 5.

The trained reagent and product models in the forms of .pt files are stored here.

Inference

To make predictions for reactions supplied in a .txt file as SMILES, use the following script:

```bash
    python3 translate.py -model <PATH TO THE MODEL WEIGHTS> \
                -src <PATH TO THE TEST REACTIONS WITHOUT REAGENTS> \
                -output <PATH TO THE .TXT FILE WHERE THE PREDICTIONS WILL BE STORED> \
                -batch_size 64 -replace_unk -max_length 200 -fast -beam_size 5 -n_best 5 -gpu <GPU ID>
```

The supplied reactions should be tokenized with tokens separated by spaces, like in files produced by prepare_data.py.

Citation

The preprint:

@article{andronov_voinarovska_andronova_wand_clevert_schmidhuber_2022, 
 place={Cambridge}, 
 title={Reagent Prediction with a Molecular Transformer Improves Reaction Data Quality}, 
 DOI={10.26434/chemrxiv-2022-sn2kr}, 
 journal={ChemRxiv},
 publisher={Cambridge Open Engage}, 
 author={Andronov, Mikhail and Voinarovska, Varvara and Andronova, Natalia and Wand, Michael and Clevert, Djork-Arné and Schmidhuber, Jürgen}, 
 year={2022}
 } 

The underlying framework:

@inproceedings{opennmt,
  author    = {Guillaume Klein and
               Yoon Kim and
               Yuntian Deng and
               Jean Senellart and
               Alexander M. Rush},
  title     = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
  booktitle = {Proc. ACL},
  year      = {2017},
  url       = {https://doi.org/10.18653/v1/P17-4012},
  doi       = {10.18653/v1/P17-4012}
}

reagents's People

Contributors

academich avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.