thinng / graphdta Goto Github PK
View Code? Open in Web Editor NEWGraphDTA: Predicting drug-target binding affinity with graph neural networks
Home Page: https://doi.org/10.1093/bioinformatics/btaa921
GraphDTA: Predicting drug-target binding affinity with graph neural networks
Home Page: https://doi.org/10.1093/bioinformatics/btaa921
Thank you for your excellent work. I have carefully read your code and have some questions about the one-dimensional convolution. You embed the protein sequence into 128 dimensions. So, the dimension of a batch of protein embedding matrix is [512, 1000, 128]. You do not exchange the last two dimensions. Therefore, the one-dimensional convolution will be executed on the last dimension, However, one-dimensional convolution is usually performed in the sequence dimension, that is to exchange the last two dimensions (permute). In short, for nn.Conv1d( ), your input is [batch_size, sequence length, embedded_dim], while the input required by pytorch should be [batch_size, embedded_dim, sequence length]. I think there is a problem here.
I wonder if there's any easy way to get the embeddings of a set/list of SMILEs string using the pre-trained models? For example, let's say we have a list of SMILEs like [smile_1, smiles_2,...,smile_n] how can I get the corresponding embedding vectors of SMILES in this list using the pre-trained models?
Hi. In the paper I see mention of pretrained models being available in this repo, however I don't see any .model files? Do you perhaps have those located elsewhere?
Hello thinng,
I met some issue when I running GraphDTA. Davis dataset was successfully run with GPU. However, when I change KIBA dataset, the GPU utilization is zero and model would run long time. Can you give me some advice about this problem? Thangs!
I now have some sequences of proteins and drugs but no affinity data between them. How can I predict the affinity between them through your method?How should I handle my data and how to call your method? Can you predict? I don't know how to make this, please give me some methods, thank you!
I saw your epochs=1000 and you didn't set early stopping. So, how do you prevent overfitting for such large epochs? Thanks.
In addition, I don't think you can use your test dataset to guide your training (training.py). Normally, when you train your model, you can't touch the test dataset. Otherwise, it is overfitting and can not have a generalized model. Maybe your training_validation.py is the right way to possibly have a generalized model.
In your paper, your results are better than DeepDTA and also WideDTA, I felt a little bit unsure if using a totally new test dataset.
Hello, thank you for sharing your code and results. I would like to ask you a question about how to use your optimal model to predict my own dataset. I have converted my dataset from csv file to pt file, but when I try to use it for prediction, I encounter an error that says the dataset cannot be input to the model. Could you please give me some guidance on how to solve this problem? Thank you very much!
Dear Team,
I'm trying to understand how graphs of variable sizes such as small molecules can be passed as a batch to a deep learning model with this code. While looking at the output from the DataLoader in training_validation.py, I get the following output with the default parameters set in the code.
`cuda_name: cuda:0
Learning rate: 0.0005
Epochs: 1000
running on GCNNet_davis
Pre-processed data found: data/processed/davis_train.pt, loading ...
Pre-processed data found: data/processed/davis_test.pt, loading ...
Batch(batch=[16399], c_size=[512], edge_index=[2, 36172], target=[512, 1000], x=[16399, 78], y=[512])
Batch(batch=[16332], c_size=[512], edge_index=[2, 36122], target=[512, 1000], x=[16332, 78], y=[512])
Batch(batch=[16269], c_size=[512], edge_index=[2, 36000], target=[512, 1000], x=[16269, 78], y=[512])
Batch(batch=[16193], c_size=[512], edge_index=[2, 35794], target=[512, 1000], x=[16193, 78], y=[512])
Batch(batch=[16418], c_size=[512], edge_index=[2, 36284], target=[512, 1000], x=[16418, 78], y=[512])`
I understand that 512 molecular graphs with their corresponding target proteins and affinity values are present in 1 batch of data. But I'm confused about how to extract the data corresponding to the 1st or 2nd graph in each batch from the DataLoader. I'm a beginner in Pytorch Geometric - so please explain in detail even if it appears as a very naive question. Also, another question is - does c_size set an upper limit to the maximum number of nodes in the batch? What will happen if we omit to provide the c_size attribute here?
Anticipating your reply and thanks in advance!
hello professor
I want to know What are the two outputs of the prediction function in the predict_with_pretrained_model.py file.
Do they represent values for predicted and actual affinity?
thank you very much
Thank you for your repo.
If my data set is like
SMILES Protein sequence affinity
CC1=C2C.....CC=C4)N MKKFFDSRR.....LLLVDQLIDL 7.365
CC1=C2C.....CC=C4) NSADAQSFLN.....MYTPHTVLQ 4.999
Do you have a function to change this to Y?
Dear Mr.Thin,
I have got issues when running pretrained-models.
With model_GCNet_davis.model and model_GCNet_kiba.model, it returns mse and ci scores successfully but might be wrong values (GCNet_davis: ci = 0.651, mse = 1.274, GCNet_kiba: ci = 0,639, mse = 0.592)
With model_GINConvNet_davis.model and model_GCNet_kiba.model, it also returns mse and ci scores successfully but also might be wrong values (GINConvNet_davis: ci = 0.662, mse = 1.189, GINConvNet_kiba: ci = 0.648, mse = 0.608)
With other pretrained_models, it throws an error as the image below:
Steps to reproduce:
python create_data.py
python predict_with_pretrained_model.py
(geometric) D:\Anaconda3\envs\geometric\GraphDTA>python create_data.py
Traceback (most recent call last):
File "create_data.py", line 9, in
from utils import *
File "D:\Anaconda3\envs\geometric\GraphDTA\utils.py", line 5, in
from torch_geometric.data import InMemoryDataset, DataLoader
File "D:\Anaconda3\envs\geometric\lib\site-packages\torch_geometric_init_.py", line 2, in
import torch_geometric.nn
File "D:\Anaconda3\envs\geometric\lib\site-packages\torch_geometric\nn_init_.py", line 2, in
from .data_parallel import DataParallel
File "D:\Anaconda3\envs\geometric\lib\site-packages\torch_geometric\nn\data_parallel.py", line 5, in
from torch_geometric.data import Batch
File "D:\Anaconda3\envs\geometric\lib\site-packages\torch_geometric\data_init_.py", line 1, in
from .data import Data
File "D:\Anaconda3\envs\geometric\lib\site-packages\torch_geometric\data\data.py", line 7, in
from torch_sparse import coalesce, SparseTensor
File "D:\Anaconda3\envs\geometric\lib\site-packages\torch_sparse_init_.py", line 12, in
torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
File "D:\Anaconda3\envs\geometric\lib\site-packages\torch_ops.py", line 105, in load_library
ctypes.CDLL(path)
File "D:\Anaconda3\envs\geometric\lib\ctypes_init_.py", line 381, in init
self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'D:\Anaconda3\envs\geometric\Lib\site-packages\torch_sparse_convert.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
The way of processing protein sequence in Github is different from the procedure in the paper. Is this code used in the paper?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.