wengong-jin / chemprop Goto Github PK

This project forked from chemprop/chemprop

Chemical Property Prediction with Graph Convolutional Networks

License: MIT License

Python 47.15% CSS 16.49% HTML 35.54% JavaScript 0.82%

chemprop's Introduction

Deprecation Warning

This repository is deprecated and has been replaced by https://github.com/swansonk14/chemprop, which contains a stable, clean version of the code. This repo is no longer actively maintained and should be used for reference only.

Property Prediction

This repository contains graph convolutional networks (or message passing network) for molecule property prediction.

Installation
Data
Training
Predicting
TensorBoard
Deepchem test
- Results

Installation

Requirements:

cuda >= 8.0 + cuDNN
Python 3/conda: Please follow the installation guide on https://conda.io/miniconda.html
- Create a conda environment with conda create -n <name> python=3.6
- Activate the environment with conda activate <name>
pytorch: Please follow the installation guide on https://pytorch.org/
- Typically it's conda install pytorch torchvision -c pytorch
tensorflow: Needed for Tensorboard training visualization
- CPU-only: pip install tensorflow
- GPU: pip install tensorflow-gpu
RDKit: conda install -c rdkit rdkit
Other packages: pip install -r requirements.txt
Note that if you get warning messages about kyotocabinet, it's safe to ignore them.
- This is a requirement from descriptastorus, but not necessary for the parts of descriptastorus that we use.
- See https://github.com/bp-kelley/descriptastorus for installation details if you want to install anyway.

Data

The data file must be be a CSV file with a header row. For example:

smiles,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
CCOc1ccc2nc(S(N)(=O)=O)sc2c1,0,0,1,,,0,0,1,0,0,0,0
CCN1C(=O)NC(c2ccccc2)C1=O,0,0,0,0,0,0,0,,0,,0,0
...

Data sets from deepchem are available in the data directory.

Training

To train a model, run:

python train.py --data_path <path> --dataset_type <type> --save_dir <dir>

where <path> is the path to a CSV file containing a dataset, <type> is either "classification" or "regression" depending on the type of the dataset, and <dir> is the directory where model checkpoints will be saved.

For example:

python train.py --data_path data/tox21.csv --dataset_type classification --save_dir tox21_checkpoints

Notes:

Classification is assumed to be binary.
Empty values in the CSV are ignored.
--save_dir may be left out if you don't want to save model checkpoints.
The default metric for classification is AUC and the default metric for regression is RMSE. The qm8 and qm9 datasets use MAE instead of RMSE, so you need to specify --metric mae.

Train/Validation/Test Split

Random split

By default, the data in --data_path will be split randomly into train, validation, and test sets using the seed specified by --seed (default = 0). By default, the train set contains 80% of the data while the validation and test sets contain 10% of the data each. These sizes can be controlled with --split_sizes (for example, the default would be --split_sizes 0.8 0.1 0.1).

Separate test set

To use a different data set for testing, specify --separate_test_path. In this case, the data in --data_path will be split into only train and validation sets (80% and 20% of the data), and the test set will contain all the dta in --separate_test_path.

Cross validation

k-fold cross-validation can be run by specifying the --num_folds argument (which is 1 by default). For example:

python train.py --data_path data/tox21.csv --dataset_type classification --num_folds 5

Ensembling

To train an ensemble, specify the number of models in the ensemble with the --ensemble_size argument (which is 1 by default). For example:

python train.py --data_path data/tox21.csv --dataset_type classification --ensemble_size 5

Model hyperparameters and augmentations

The base message passing architecture can be modified in a range of ways that can be controlled through command line arguments. The full range of options can be seen in parsing.py. Suggested modifications are:

--hidden_size <int> Control the hidden size of the neural network layers.
--depth <int> Control the number of message passing steps.
--virtual_edges Adds "virtual" edges connected non-bonded atoms to improve information flow. This works very well on some datasets (ex. QM9) but very poorly on others (ex. delaney).

Predicting

To load a trained model and make predictions, run predict.py and specify:

--test_path <path> Path to the data to predict on.
A checkpoint by using either:
- --checkpoint_dir <dir> Directory where the model checkpoint(s) are saved (i.e. --save_dir during training). This will walk the directory, load all .pt files it finds, and treat the models as an ensemble.
- --checkpoint_path <path> Path to a model checkpoint file (.pt file).
--preds_path Path where a CSV file containing the predictions will be saved.
(Optional) --write_smiles Add this flag if you would like to save the SMILES string for each molecule alongside the property prediction.

For example:

python predict.py --test_path data/tox21.csv --checkpoint_dir tox21_checkpoints --preds_path tox21_preds.csv

python predict.py --test_path data/tox21.csv --checkpoint_path tox21_checkpoints/fold_0/model_0/model.pt --preds_path tox21_preds.csv

TensorBoard

During training, TensorBoard logs are automatically saved to the same directory as the model checkpoints. To view TensorBoard logs, run tensorboard --logdir=<dir> where <dir> is the path to the checkpoint directory. Then navigate to http://localhost:6006.

Deepchem test

We tested our model on 14 deepchem benchmark datasets (http://moleculenet.ai/), ranging from physical chemistry to biophysics properties. To train our model on those datasets, run:

bash run.sh 1

where 1 is the random seed for randomly splitting the dataset into training, validation and testing (not applied to datasets with scaffold splitting).

Results

We compared our model against the graph convolution in deepchem. Our results are averaged over 3 runs with different random seeds, namely different splits across datasets. Unless otherwise indicated, all models were trained using hidden size 1800, depth 6, and master node. We did a few hyperparameter experiments on qm9, but did no searching on the other datasets, so there may still be further room for improvement.

Results on classification datasets (AUC score, the higher the better)

Dataset	Size	Ours	GraphConv (deepchem)
Bace	1,513	0.884 ± 0.034	0.783 ± 0.014
BBBP	2,039	0.922 ± 0.012	0.690 ± 0.009
Tox21	7,831	0.851 ± 0.015	0.829 ± 0.006
Toxcast	8,576	0.748 ± 0.014	0.716 ± 0.014
Sider	1,427	0.643 ± 0.027	0.638 ± 0.012
clintox	1,478	0.882 ± 0.022	0.807 ± 0.047
MUV	93,087	0.067 ± 0.03*	0.046 ± 0.031
HIV	41,127	0.821 ± 0.034†	0.763 ± 0.016
PCBA	437,928	0.218 ± 0.001*	0.136 ± 0.003

Results on regression datasets (score, the lower the better)

Dataset	Size	Ours	GraphConv/MPNN (deepchem)
delaney	1,128	0.687 ± 0.037	0.58 ± 0.03
Freesolv	642	0.915 ± 0.154	1.15 ± 0.12
Lipo	4,200	0.565 ± 0.052	0.655 ± 0.036
qm8	21,786	0.008 ± 0.000	0.0143 ± 0.0011
qm9	133,884	2.47 ± 0.036	3.2 ± 1.5

†HIV was trained with hidden size 1800 and depth 6 but without the master node. *MUV and PCBA are using a much older version of the model.

chemprop's People

Contributors

Stargazers

Watchers

Forkers

abchotujnn1 huangliang0828 dharmogata

chemprop's Issues

The benchmark should use the same random seed to split train, valid and test data

As we know, the random seed in splitting the dataset has a large impact on the final performance of test set, So all of these dataset should be evaluate as same as the test set in deepchem/molnet, namely, you should split your dataset as deepchem does! not just by yourself

Input pipeline

Hi everybody,

first of all, thanks for this great repo.

For me a minor issue is that the training process seems to be rather slow.
Are there any plans on parallelizing the input pipeline?

Thanks in advance!
Florian

Issue with TensorboardX version 1.7

Hello,

There seems to be an issue with the new TensorboardX release. (1.7) after installing (conda version).
"TypeError: init() got an unexpected keyword argument 'log_dir' "
Reverting back to 1.6 seems to fix the issue.
pytorch/ignite#534

adding partial charges

modify featurization.py

7a8,10
> from rdkit.Chem.rdPartialCharges import *
> from rdkit.Chem import ChemicalFeatures
> import os
34,35c37,39
< # len(choices) + 1 to include room for uncommon values; + 2 at end for IsAromatic and mass
< ATOM_FDIM = sum(len(choices) + 1 for choices in ATOM_FEATURES.values()) + 2
---
> # len(choices) + 1 to include room for uncommon values; + 3 at end for IsAromatic and mass and partial charge
> ATOM_FDIM = sum(len(choices) + 1 for choices in ATOM_FEATURES.values()) + 3
94c98,99
<            [atom.GetMass() * 0.01]  # scaled to about the same range as other features
---
>            [atom.GetMass() * 0.01] + \
>            [float(atom.GetProp("_GasteigerCharge"))]
192a198,201
> 
>         #assign partial charges
>         ComputeGasteigerCharges(mol)
>         #iterate over atoms

predict.py AssertionError

python predict.py --test_path data/tox21.csv --checkpoint_dir tox21_checkpoints --preds_path tox21_preds.csv

Traceback (most recent call last):
File "predict.py", line 6, in
make_predictions(args)
File "/home/chupvl/git/chemprop/chemprop/train/make_predictions.py", line 30, in make_predictions
assert smiles is not None # Note: Currently only works with smiles provided, not with data file.
AssertionError

3D distance feature

Hi, I found that in your code, some 3D distance was attached to the bond feature, but It's not mentioned in your paper and in the refined code, this feature was deleted. Why this was deleted? Isn't it a good supplementary information for some molecules' property prediction?

features_only failing?

python train.py --gpu 0 --features_only --virtual_edges --ensemble_size 5 --num_folds 5 --data_path data/tox21.csv --dataset_type classification --save_dir checkpoints/tox21

causes
AttributeError: 'Namespace' object has no attribute 'features_dim'

BCELoss is unstable

The “BCELoss” is highly unstable and crashes chemprop with assert statements on cuda level, could you change this to “BCEWithLogitsLoss”

Project is listed as "HTML" project?

A minor thing, but the project is listed as "HTML" project.
Maybe you want to change this to a more sensible category, e.g. Python? :P

code lacks a proper project structure

The current code layout is pretty confusing with everything living in one directory.

Would it be useful refactor it so that this repository becomes a more standard python project?
Something like:

Chemprop/
|   demo/
|   |-- data/
|   |   |-- ___.csv
|   |--demo_runner.py
|
|-- chemprop/
|   |-- test/
|   |   |-- __init__.py
|   |   |-- test_todo.py
|   |   
|   |-- __init__.py
|   |-- mpn.py
|   |-- runers.py
|   |-- utils.py
|   |-- io.py
|
|-- setup.py
|-- README
|-- LICENCE

I'm happy to help out.

About this model training file

As mentioned in the literature, you used RDKit to calculate 12 molecular features as part of the training set. But I can't understand the specific format of this training file. Could you give me an example file?

Replace hard-wired features with feature factory

Hi, for enabling us to feed in easily other type of vectorial data without enforcing hard-coding in a deeper class/function, I replaced the existing logic by using fully the already existing feature factory.
This is also used in argparse and can therefore easily handle more extensions by just adding new modules to the feature factory.

Please check and let me know what you think
https://github.com/joergkurtwegner/chemprop/commit/0e14e539592e17f60dce8c73d13bbac58badb027

generalize FunctionalGroupFeaturizer

Dear all, the SMARTS definition is hard to read and does not allow comments or any details.
Might I suggest to switch to the RDKit feature factory to make this more generic?

Find the class modification attached and please change the following command line parameters in parsing.py

parser.add_argument('--additional_atom_features', type=str, nargs='*', choices=['smarts','family_and_type'], default=[],
                    help='Use additional features in atom featurization')
parser.add_argument('--atom_features_family_and_type', type=str, default='{RDDataDir}/BaseFeatures.fdef',
                    help='Path to txt file of smarts for functional groups, if functional_group features are on.')
parser.add_argument('--atom_features_smarts', type=str, default='chemprop/features/smarts.txt',
                    help='Path to txt file of smarts for functional groups, if functional_group features are on.')

functional_groups.py.txt

Error while running the code

Hi, sorry if the question is a little noobish, but I get the following error while running the code :

(chemprop) C:\Users\1901566.admin\Downloads\chemprop-master>python web/run.py
Traceback (most recent call last):
  File "web/run.py", line 10, in <module>
    from app import app, db
  File "C:\Users\1901566.admin\Downloads\chemprop-master\web\app\__init__.py", line 6, in <module>
    app.config.from_object('config')
  File "C:\Users\1901566.admin\.conda\envs\chemprop\lib\site-packages\flask\config.py", line 174, in from_object
    obj = import_string(obj)
  File "C:\Users\1901566.admin\.conda\envs\chemprop\lib\site-packages\werkzeug\utils.py", line 568, in import_string
    __import__(import_name)
  File "C:\Users\1901566.admin\Downloads\chemprop-master\web\config.py", line 9, in <module>
    import torch
  File "C:\Users\1901566.admin\.conda\envs\chemprop\lib\site-packages\torch\__init__.py", line 81, in <module>
    ctypes.CDLL(dll)
  File "C:\Users\1901566.admin\.conda\envs\chemprop\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

Could anyone please please help me solve this? I don't know the reason of this error.

prediction by default picking up all models in the directory

Use case
DIR1/model1/
DIR1/model2/
DIR1/model3/
etc...

predict.py automatically picking up all other directories (model1/2/3) while I defined to use only one for predictions, I suppose it should use only and only one directory to look for *.pt?
python /home/user/git/chemprop/predict.py --test_path test.csv --preds_path test_preds.csv --checkpoint_path ./DIR1/model3/

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.