Giter Site home page Giter Site logo

microsoft / fs-mol Goto Github PK

View Code? Open in Web Editor NEW
147.0 12.0 23.0 452.61 MB

FS-Mol is A Few-Shot Learning Dataset of Molecules, containing molecular compounds with measurements of activity against a variety of protein targets. The dataset is presented with a model evaluation benchmark which aims to drive few-shot learning research in the domain of molecules and graph-structured data.

License: Other

Python 54.45% Jupyter Notebook 45.55%

fs-mol's Introduction

FS-Mol: A Few-Shot Learning Dataset of Molecules

This repository contains data and code for FS-Mol: A Few-Shot Learning Dataset of Molecules.

Installation

  1. Clone or download this repository

  2. Install dependencies

    cd FS-Mol
    
    conda env create -f environment.yml
    conda activate fsmol
    

The code for the Molecule Attention Transformer baseline is added as a submodule of this repository. Hence, in order to be able to run MAT, one has to clone our repository via git clone --recurse-submodules. Alternatively, one can first clone our repository normally, and then set up submodules via git submodule update --init. If the MAT submodule is not set up, all the other parts of our repository should continue to work.

Data

The dataset is available as a download, FS-Mol Data, split into train, valid and test folders. Additionally, we specify which tasks are to be used with the file datasets/fsmol-0.1.json, a default list of tasks for each data fold. We note that the complete dataset contains many more tasks. Should use of all possible training tasks available be desired, the training script argument --task_list_file datasets/entire_train_set.json should be used. The task lists will be used to version FS-Mol in future iterations as more data becomes available via ChEMBL.

Tasks are stored as individual compressed JSONLines files, with each line corresponding to the information to a single datapoint for the task. Each datapoint is stored as a JSON dictionary, following a fixed structure:

{
    "SMILES": "SMILES_STRING",
    "Property": "ACTIVITY BOOL LABEL",
    "Assay_ID": "CHEMBL ID",
    "RegressionProperty": "ACTIVITY VALUE",
    "LogRegressionProperty": "LOG ACTIVITY VALUE",
    "Relation": "ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE",
    "AssayType": "TYPE OF ASSAY",
    "fingerprints": [...],
    "descriptors": [...],
    "graph": {
        "adjacency_lists": [
           [... SINGLE BONDS AS PAIRS ...],
           [... DOUBLE BONDS AS PAIRS ...],
           [... TRIPLE BONDS AS PAIRS ...]
        ],
        "node_types": [...ATOM TYPES...],
        "node_features": [...NODE FEATURES...],
    }
}

FSMolDataset

The fs_mol.data.FSMolDataset class provides programmatic access in Python to the train/valid/test tasks of the few-shot dataset. An instance is created from the data directory by FSMolDataset.from_directory(/path/to/dataset). More details and examples of how to use FSMolDataset are available in fs_mol/notebooks/dataset.ipynb.

Evaluating a new Model

We have provided an implementation of the FS-Mol evaluation methodology in fs_mol.utils.eval_utils.eval_model(). This is a framework-agnostic python method, and we demonstrate how to use it for evaluating a new model in detail in notebooks/evaluation.ipynb.

Note that our baseline test scripts (fs_mol/baseline_test.py, fs_mol/maml_test.py, fs_mol/mat_test, fs_mol/multitask_test.py and fs_mol/protonet_test.py) use this method as well and can serve as examples on how to integrate per-task fine-tuning in TensorFlow (maml_test.py), fine-tuning in PyTorch (mat_test.py) and single-task training for scikit-learn models (baseline_test.py). These scripts also support the --task_list_file parameter to choose different sets of test tasks, as required.

Baseline Model Implementations

We provide implementations for three key few-shot learning methods: Multitask learning, Model-Agnostic Meta-Learning, and Prototypical Networks, as well as evaluation on the Single-Task baselines and the Molecule Attention Transformer (MAT) paper, code.

All results and associated plots are found in the baselines/ directory.

These baseline methods can be run on the FS-Mol dataset as follows:

kNNs and Random Forests -- Single Task Baselines

Our kNN and RF baselines are obtained by permitting grid-search over a industry-standard parameter set, detailed in the script baseline_test.py.

The baseline single-task evaluation can be run as follows, with a choice of kNN or randomForest model:

python fs_mol/baseline_test.py /path/to/data --model {kNN, randomForest}

Molecule Attention Transformer

The Molecule Attention Transformer (MAT) paper, code.

The Molecule Attention Transformer can be evaluated as:

python fs_mol/mat_test.py /path/to/pretrained-mat /path/to/data

GNN-MAML pre-training and evaluation

The GNN-MAML model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $8$-layer GNN with node-embedding dimension $128$. The GNN uses "Edge-MLP" message passing. The model was trained with a support set size of $16$ according to the MAML procedure Finn 2017. The hyperparameters used in the model checkpoint are default settings of maml_train.py.

The current defaults were used to train the final versions of GNN-MAML available here.

python fs_mol/maml_train.py /path/to/data 

Evaluation is run as:

python fs_mol/maml_test.py /path/to/data --trained_model /path/to/gnn-maml-checkpoint

GNN-MT pre-training and evaluation

The GNN-MT model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $10$-layer GNN with node-embedding dimension $128$. The model uses principal neighbourhood aggregation (PNA) message passing. The hyperparameters used in the model checkpoint are default settings of multitask_train.py. This method has similarities to the approach taken for the task-only training contained within Hu 2019

python fs_mol/multitask_train.py /path/to/data 

Evaluation is run as:

python fs_mol/multitask_test.py /path/to/gnn-mt-checkpoint /path/to/data

Prototypical Networks (PN) pre-training and evaluation

The prototypical networks method Snell 2017 extracts representations of support set datapoints and uses these to classify positive and negative examples. We here used the Mahalonobis distance as a metric for query point distance to class prototypes.

python fs_mol/protonet_train.py /path/to/data 

Evaluation is run as:

python fs_mol/protonet_test.py /path/to/pn-checkpoint /path/to/data

Available Model Checkpoints

We provide pre-trained models for GNN-MAML, GNN-MT and PN, these are downloadable from the links to figshare.

Model Name Description Checkpoint File
GNN-MAML Support set size 16. 8-layer GNN. Edge MLP message passing. MAML-Support16_best_validation.pkl
GNN-MT 10-layer GNN. PNA message passing multitask_best_model.pt
PN 10-layer GGN, PNA message passing. ECFP+GNN, Mahalonobis distance metric PN-Support64_best_validation.pt

Specifying, Training and Evaluating New Model Implementations

Flexible definition of few-shot models and single task models is defined as demonstrated in the range of train and test scripts in fs_mol.

We give a detailed example of how to use the abstract class AbstractTorchFSMolModel in notebooks/integrating_torch_models.ipynb to integrate a new general PyTorch model, and note that the evaluation procedure described below is demonstrated on sklearn models in fs_mol/baseline_test.py and on a Tensorflow-based GNN model in fs_mol/maml_test.py.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

fs-mol's People

Contributors

kmaziarz avatar megstanley avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mmjb avatar mrwnmsr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fs-mol's Issues

Evaluation methodology for regression task

Hi,

Thanks for making this dataset available! I'm wondering if there are any scripts for evaluating models on the task of predicting continuous values, i.e. RegressionProperty, or maybe a reference that uses FS-Mol for this task? In the utils/metrics directory I only see binary evaluation tools.

Thanks!

Dataset download link seems fired.

Hello.

Thanks for introducing an interesting dataset for few-shot molecule learning. However, the download link for the dataset seems fired. Could you check it?

Thank you.

Dataset filtering details

In your FS-Mol paper, it is said that only assays with 32 to 5000 compounds are kept, and the remaining training dataset then contains 4938 assays. However, if I try to filter out those from your provided dataset loading code, I'm left with ~24k assays.

from fs_mol.data import FSMolDataset, DataFold
dataset = FSMolDataset.from_directory(FS_MOL_DATASET_PATH)
train_task_iterable = dataset.get_task_reading_iterable(DataFold.TRAIN)
assay_sizes = []
for t in train_task_iterable:
    assay_sizes.append(len(t.samples))
print(len(assay_sizes[np.where(np.logical_and(assay_sizes >= 32, assay_sizes <= 5000))[0]]))
# prints 23832

Is there something obvious that I'm missing?

Loss computation for Prototypical Networks

Dear authors,

Thank you for making the code public. I was looking through the loss computation of prototypical networks in this line and wondering what is the rational behind multiplying the batch loss with the number of query samples. As it seems to me the loss was already computed using all query samples in the batch in this line, so this scaling may not be necessary? Probably I misunderstood it, so please let me know what you think. Thank you!

new environment recipe?

I've really enjoyed using MoLeR and thought I should give this a try. I was having a few issues installing this with cuda 12.2, on the off chance have you an updated yaml file ??

Clarification on how to use FS-MOL in a regression context

Thanks a lot for providing FS-mol. Very valuable to the community!

I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:

ChEMBL contains the results of many experiments, termed “assays”, each having a unique experiment ID. We retained only those measurements referring to small molecule activity (IC50 or EC50).

However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset:

if df.iloc[0]["standard_units"] == "%":
which is totally fine I guess when only using it to extract activity / non-activity :)
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:

Not sure to which extent it make sense to apply a log-transformation to percentage values ranging from [0-100]. However, this is done in the FS-MOL dataset, and also the community slowly starts to do that (I guess because only IC50 / EC50 values are assumed??) --> https://github.com/Wenlin-Chen/ADKF-IFT/blob/c96919d553313b267240dc1409ae65160c629aab/fs_mol/data/dkt.py#L111 (the corresponding paper: https://arxiv.org/pdf/2205.02708.pdf)

but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well

The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!

Possible Error in Dataset Notebook

Hello,

In your example notebook for datasets (https://github.com/microsoft/FS-Mol/blob/main/notebooks/dataset.ipynb) the dataset path seems to be wrong. You have put FS_MOL_DATASET_PATH = os.path.join(os.environ['HOME'], "Datasets", "FS-Mol"), but for it to work I had to change this to FS_MOL_DATASET_PATH = os.path.join(os.environ['HOME'], "Datasets", "FS-Mol", "datasets") (i.e. use the "/datasets" dir instead of the base dir in the repo). Perhaps you could update this if it is a mistake, or clarify it if it is not a mistake?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.