Giter Site home page Giter Site logo

mtnngc_admet's Introduction

MTNNGC_ADMET

This project contains the code to train / validate / predict a multi-task graph convolutional model using DeepChem. The architecture is the one used in the paper "Modeling Physico-Chemical ADMET Endpoints With Multitask Graph Convolutional Networks".

Setting up an environment

The conda environment containing all necessary dependencies can be created using the environment.yml file as so:

conda env create -f environment.yml

The environment should then be activated: source activate mtnngc

And finally the compchemdl library should be installed:

pip install -e .

This is to be run once, then the environment should simply be activated as explained above before trying to run any of the functionalities of the library.

Training and validating models

MTNNGC models can be trained using the python script in compchemdl/models/mtnn_gc.py. It allows to run cross-validation, train-test, or simple train of the model. The options are the following:

  • --tasks (-t): this should be the path to a directory where your training sets are stored (one training set per task in the folder). You can used the directory we prepared with public data in data/training_sets for example.

  • --test_tasks (-tt): Optional. This is the path to the directory where the test sets are stored (one test set per task in the folder, should have the same names as the corresponding training set files in the folder indicated at the --tasks option.

  • --name (-n): how the model should be named (ex: ADMET_1)

  • --output (-o): Optional. How the temporary directory storing intermediate results should be called

  • --cv (-x): Whether to run a cross-validation on the training sets or not. Default: True NB: setting this to true requires that the input files contain a column with cross-validation fold assignments

  • --refit (-r): Whether to run a final training on the whole training sets. Default: False

  • --batch (-b): size of the minibatches. Default: 128

  • --epochs (-e): number of training epochs. Default: 40

  • --learningrate (-l): learning rate. Default: 0.001

  • --gpu (-g): Optional GPU to use. If nothing is passed, the CPU will be used instead.

  • --smiles_field (-s): header of the column containing the clean SMILES of the input molecules. Default: 'canonical_smiles'

  • --y_field (-y): header of the column containing the target value for every task. Default: 'label'

  • --id_field (-i): header of the column containing the identifiers of the input molecules. Default: 'mol_index'

  • --split_field (-f): header of the column containing the fold assignments for cross-validation. Only needed if --cv is set to True. Default: 'fold'

We provide in this repository a folder with three example datasets that can be used to test. The datasets do not correspond to the ones discussed in the paper but rather open data coming from ChEMBL. To try and run a cross-validation task, the following python command can be used (if your GPU 0 is available):

python -u compchemdl/models/mtnn_gc.py -t /your_path_to_the_repo/data/training_data -n test_model -o mytemp -x True -r False -g 0

To train the same model on the whole data and save it for future use, the following python command can be used:

python -u compchemdl/models/mtnn_gc.py -t /your_path_to_the_repo/data/training_data -n test_model -o mytemp -x False -r True -g 0

Dataset preparation

Training and validation

The datasets for training and testing should be comma-separated files with different fields: a field for the identifier of the molecule (id_field, set by default to 'mol_index'), a field for the structures in SMILES format (smiles_field, set by default to 'canonical_smiles'), a field for the target value of the particular task (y_field, set by default to 'label'). In case cross-validation is required, then an additional field to assign every example to a CV fold is needed (split_field, set by default to 'fold'). Example files can be found under the data/training_data folder. One file per task is required, and will be aggregated based on the SMILES column, hence it is necessary to preprocess the compounds the same way for all tasks.

When train-test is required, a second directory containing test files should be given. The format is exactly the same as for the training and cross-validation. It is important to give the same names to the files in both directories so that they are matched properly (ex: Task1_train.csv, Task2_train.csv in training_data/ and Task1_test.csv, Task2_test.csv in the test directory).

Inference

For inference, we require an .smi format as input: tab separated file with one row per compound, starting with the SMILES in the first column and a molecule identifier in the second column. No headers. An example file can be found under data/test.smi.

Inference

Once a model has been trained and validated, it can also be used for predicting new compounds. For this, we provide the script in compchemdl/inference/inference.py. The options are the following:

  • --input (-i): path to the input file in .smi format (see Dataset preparation for more details)
  • --checkpoint (-c): path to the directory where the final trained model is stored
  • --output (-o): Optional path to the csv file where the predictions can be stored. If no path is given, the predictions will be printed out instead.
  • --jobdir (-j): temporary directory where intermediate files are to be stored. Will be deleted at the end of the job
  • --gpu (-g): Optional GPU to use. If nothing is passed, the CPU will be used instead.

Assuming the model named "model_1" has been previously trained and saved on date XX-XX-XX, using it for prediction on our test file would be done the following way:

python -u compchemdl/inference/inference.py -i /*your_path_to_the_repo*/data/test.smi -c /*your_path_to_the_repo*/data/models/model_1/final_model_XX-XX-XX -j /tmp

mtnngc_admet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mtnngc_admet's Issues

about pretrained model checkpoint and similar dataset

Thank you very much for your reply to the last issue. I wonder if you can update your final trained model checkpoint in the repository. I am trying to recover the repository. By the way, could you provide some other datasets which is similar to the dataset mentioned in the paper, because the dataset in the repository is too small to prove the validity of the model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.