The deeppurpose from kexinhuang12345

error in GetSequenceOrderCouplingNumber

When using the Quasi-seq encoding on the BindingDB dataset, I ran into the following error:

Drug Target Interaction Prediction Mode...
in total: 1073803 drug-target pairs
encoding drug...
unique drugs: 549205
encoding protein...
unique target sequence: 5078

KeyError Traceback (most recent call last)
in
1 train, val, test = utils.data_process(X_drugs, X_targets, y,
2 drug_encoding, target_encoding,
----> 3 split_method='cold_drug',frac=[0.7,0.1,0.2])

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in data_process(X_drug, X_target, y, drug_encoding, target_encoding, split_method, frac, random_seed, sample_frac, mode, X_drug_, X_target_)
419 if DTI_flag:
420 df_data = encode_drug(df_data, drug_encoding)
--> 421 df_data = encode_protein(df_data, target_encoding)
422 elif DDI_flag:
423 df_data = encode_drug(df_data, drug_encoding, 'SMILES 1', 'drug_encoding_1')

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in encode_protein(df_data, target_encoding, column_name, save_column_name)
317 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]]
318 elif target_encoding == 'Quasi-seq':
--> 319 AA = pd.Series(df_data[column_name].unique()).apply(GetQuasiSequenceOrder)
320 AA_dict = dict(zip(df_data[column_name].unique(), AA))
321 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]]

~\anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\lib.pyx in pandas._libs.lib.map_infer()

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder(ProteinSequence, maxlag, weight)
1908 """
1909 result = dict()
-> 1910 result.update(GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, _Distance1))
1911 result.update(GetQuasiSequenceOrder2SW(ProteinSequence, maxlag, weight, _Distance1))
1912 result.update(

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, distancematrix)
1794 for i in range(maxlag):
1795 rightpart = rightpart + GetSequenceOrderCouplingNumber(
-> 1796 ProteinSequence, i + 1, distancematrix
1797 )
1798 AAC = GetAAComposition(ProteinSequence)

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetSequenceOrderCouplingNumber(ProteinSequence, d, distancematrix)
1601 temp1 = ProteinSequence[i]
1602 temp2 = ProteinSequence[i + d]
-> 1603 tau = tau + math.pow(distancematrix[temp1 + temp2], 2)
1604 return round(tau, 3)
1605

KeyError: 'mg'

Please provide installation alternative to conda

Hello.

Please provide installation alternative to conda. Anything from pip to source compilation is fine but conda simply does not play well with the environment variables of linux.

error loading BindingDB data in load_data_tutorial

I was getting a different error before, not sure how to reproduce it unfortunately, here's the error I'm getting now:

data_path = dataset.download_BindingDB('./data/')

Beginning to download dataset...
100% [......................................................................] 327218168 / 327218168Beginning to extract zip file...
Done!

X_drugs, X_targets, y = dataset.process_BindingDB(path = data_path, df = None, y = 'Kd', binary = False, convert_to_log = True, threshold = 30

File "", line 1
X_drugs, X_targets, y = dataset.process_BindingDB(path = data_path, df = None, y = 'Kd', binary = False, convert_to_log = True, threshold = 30
^
SyntaxError: unexpected EOF while parsing

The training epochs of KIBA

Hi, Kexin. I'm writing to ask about the reproduction of DeepPurpose. Here I want to get the result for MPNN+AAC in the KIBA dataset. However, it seems that 150 epochs aren't enough for KIBA while they work for DAVIS. I can only get 0.73 for C-index, much smaller than that in your paper. So I wonder how many epochs need to be set while training KIBA.

AttributeError in virtual_screening

Code

X_drug = []
X_drug_names = []
file = open("./data/drugs.csv")
for aline in file:
  values = aline.split(",")
  X_drug.append(values[-1])
  print("Loaidng Drug",values[0])
  X_drug_names.append(values[0])
file.close()

target, target_name = dataset.load_SARS_CoV2_Protease_3CL()

net = models.model_pretrained(model = 'Transformer_CNN_BindingDB')
net.config

models.virtual_screening(X_drug, target, net, X_drug_names, target_name)

Error

virtual screening...
in total: 133 drug-target pairs
encoding drug...
unique drugs: 124
drug encoding finished...
encoding protein...
unique target sequence: 1
protein encoding finished...
Done.
predicting...
---------------
Virtual Screening Result
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-76-a764e04fe067> in <module>()
----> 1 models.virtual_screening(X_drug, target, net, X_drug_names, target_name)

/content/DeepPurpose/DeepPurpose/models.py in virtual_screening(X_repurpose, target, model, drug_names, target_names, result_folder, convert_y, output_num_max, verbose)
    459                         f_d = max([len(o) for o in drug_names]) + 1
    460                         f_p = max([len(o) for o in target_names]) + 1
--> 461                         for i in range(target.shape[0]):
    462                                 if model.binary:
    463                                         if y_pred[i] > 0.5:

AttributeError: 'str' object has no attribute 'shape'

I haven't gone through the entire codebase yet but should it be length of the string rather than its shape?

CNN_Transformer_DAVIS pre-trained model link not present in utils.py

Command:

net = models.model_pretrained(model = 'CNN_Transformer_DAVIS')
net.config

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-55-0074dc9e707f> in <module>()
----> 1 net = models.model_pretrained(model = 'CNN_Transformer_DAVIS')
      2 net.config

1 frames
/content/DeepPurpose/DeepPurpose/utils.py in download_pretrained_model(model_name, save_dir)
    786 
    787         pretrained_dir = os.path.join(save_dir, 'pretrained_model')
--> 788         pretrained_dir_ = wget.download(url, pretrained_dir)
    789 
    790         print('Downloading finished... Beginning to extract zip file...')

UnboundLocalError: local variable 'url' referenced before assignment

This is because there is no elif model condition to assign the link to the variable for CNN_Transformer_DAVIS even though it is listed on the README's pertained model section

try except of max_atoms/bond error

Greetings sir,
I was doing VS using virtual_screening function when it gave me this error. the same drugs were used but with a different protein without giving me this error
`
Traceback (most recent call last):
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 264, in smiles2mpnnfeature
assert atoms_completion_num >= 0 and bonds_completion_num >= 0
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "play_VS.py", line 20, in
play(dest_repur_db, dest_vs_db, dest_save +'/')
File "play_VS.py", line 9, in play
save_dir= dest_save)
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/oneliner.py", line 261, in virtual_screening
y_pred = models.virtual_screening(X_repurpose, target, model, drug_names, target_name, convert_y = convert_y, result_folder = result_folder_path, verbose = False)
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/DTI.py", line 163, in virtual_screening
model.drug_encoding, model.target_encoding, 'virtual screening')
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 578, in data_process_repurpose_virtual_screening
split_method='repurposing_VS')
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 499, in data_process
df_data = encode_drug(df_data, drug_encoding)
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 364, in encode_drug
unique = pd.Series(df_data[column_name].unique()).apply(smiles2mpnnfeature)
File "/share/apps/conda_envs/DeepPurpose/lib/python3.7/site-packages/pandas/core/series.py", line 3848, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
File "/lfs01/workdirs/cairo029u1/deeppurpose/DeepPurpose/DeepPurpose/utils.py", line 266, in smiles2mpnnfeature
raise Exception("increase MAX_ATOM and MAX_BOND in utils")
Exception: increase MAX_ATOM and MAX_BOND in utils
`

How can I do transfer learning based on your pre-trained DTI models

Thank you so much for your great work. If I have a small amount of data (for example I only have 100 drug-protein pairs) for a specific problem, I can't train my data from scratch. So, I just want to know how I can use your pre-trained model for transfer learning. Many thanks

Training Configuration of pre-trained MPNN_CNN

Hi Kexin Huang,

I am using the provided pre-trained MPNN_CNN model. When I looked into its model configuration file, it looks wired to me.

{'input_dim_drug': 1024,
'input_dim_protein': 8420,
'hidden_dim_drug': 128,
'hidden_dim_protein': 256,
'cls_hidden_dims': [1024, 1024, 512],
'batch_size': 16,
'train_epoch': 1,
'LR': 0.001,
'drug_encoding': 'MPNN',
'target_encoding': 'CNN',
'result_folder': './result/',
'binary': False,
'mpnn_hidden_size': 128,
'mpnn_depth': 3,
'cnn_target_filters': [32, 64, 96],
'cnn_target_kernels': [4, 8, 12],
'num_workers': 0,
'decay': 0}

Did you only train this model for only 1 epoch with batch size 16?

Best regards,
Po-Yu Kao

errors when I ran "MPNN_AAC_Kiba.ipynb"

I got this error when I ran "MPNN_AAC_Kiba.ipynb"

RuntimeError: CUDA error: device-side assert triggered

It happened again when I ran "case-study-II-Virtual-Screening-for-BindingDB-IC50.ipynb"

models importing issue

First of all I would like to appreciate your work , i am facing a little bit error in models importing from DeepPurpose other modules working fine of DeepPurpose.

from DeepPurpose import models

ImportError Traceback (most recent call last)
in
----> 1 from DeepPurpose import models

ImportError: cannot import name 'models'

error when load pretrained model

Error is AttributeError: 'DBTA' object has no attribute 'lower'
And my code is

config = utils.generate_config(
    drug_encoding='CNN',
    target_encoding='CNN',
    result_folder='DeepPurpose_model/Human/DeepDTA/d/0',
    **model_settings['DeepDTA']['config']
)

model = models.model_initialize(**config)
model = models.model_pretrained('DeepPurpose_model/Human/DeepDTA/d/0', model)
print(model)

Guideline for adding self-defined encoders

Do you have a guideline for adding a self-defined encoder?

config params error

Hi, it seems that the config in demo cannot work now:(

I used the following cfg

'MPNNAACDTA': {
      'drug_encoding': 'MPNN',
      'target_encoding': 'AAC',
      'cls_hidden_dims': [1024, 1024, 512],
      'train_epoch': 100,
      'LR': 0.001,
      'batch_size': 128,
      'hidden_dim_drug': 128,
      'hidden_dim_protein': 128,
      'input_dim_protein': 128,
      'mlp_hidden_dims_target': [128],
      'mpnn_hidden_size': 128,
      'mpnn_depth': 3,
      'cnn_target_filters': [32, 64, 96],
      'cnn_target_kernels': [4, 8, 12]
  }

and it got the error RuntimeError: mat1 dim 1 must match mat2 dim 0

My DeepPurpose version is 0.0.5.

if I run oneliner program, how can I get predicted probablity between 0 and 1, not a bining score(greater than 1)

why there are different results when i use the same inputs in repurpose and virtual_screening functions?

I used repurpose and virtual_screening functions from oneliner.py. the drugs and the protein were the same in the two cases and I used the pretrained models, however, the results were different.

why did this happen?
the inputs (drugs smiles and protein sequence) and the models are the same, so should not the results be the same too?
in the case of virtual_screening I used one sequence but wrote it many times

the input file for repurpose was as follows:
smile_files:
drug_name1 drug_smiles1
drug_name2 drug_smiles2
drug_name3 drug_smiles3
...

protein:
I used this function load_SARS_CoV2_Helicase()

the input file for virtual_screening was as follows:

input_file:
drug_smile1 protein_sequence
drug_smile2 protein_sequence
drug_smile3 protein_sequence
...

Did you compare your DTI affinity results with other structure-based models

Your models are all ligand-based. There are several structure-based models. Did you compare your results with them? Can your models do better than them? Thanks

Filename Issue in Tutorial_2_Drug_Property_Pred_Assay_Data.ipynb

Hi Kexin,

There is an issue on loading the HIV data. After I ran the following commands:

X_drugs, y, drugs_index = dataset.load_HIV(path = './data')
print('Drug 1: ' + X_drugs[0])
print('Score 1: ' + str(y[0])

It gave me an error about FileNotFoundError.

The code /DeepPurpose/dataset.py tried to fin the file hiv.csv under data folder. However, after unzipping the hiv.zip file, it results in the HIV.csv file. Therefore, there is a name mismatch issue here (hiv.csv vs HIV.csv).

Afterward, I changed the filename from HIV.csv to hiv.csv, and it stopped getting me the filename error.

Best,
Ken

Error when I chose 'ErG' as my drug_encoding

After I changed drug_encoding from 'cnn' to ErG' in "DeepDTA_Reproduce_KIBA.ipynb",
when I ran this cell
model.train(train, val, test)
I got:

AttributeError Traceback (most recent call last)
in
----> 1 model = models.model_initialize(**config)
2 model.train(train, val, test)

~/projects/DeepPurpose/DeepPurpose/DTI.py in model_initialize(**config)
57
58 def model_initialize(**config):
---> 59 model = DBTA(**config)
60 return model
61

~/projects/DeepPurpose/DeepPurpose/DTI.py in init(self, **config)
267 self.model_drug = MPNN(config['hidden_dim_drug'], config['mpnn_depth'])
268 else:
--> 269 raise AttributeError('Please use one of the available encoding method.')
270
271 if target_encoding == 'AAC' or target_encoding == 'PseudoAAC' or target_encoding == 'Conjoint_triad' or target_encoding == 'Quasi-seq' or target_encoding == 'ESPF':

AttributeError: Please use one of the available encoding method.

what does "Explicit valence for atom # 1 N, 4, is greater than permitted Molecules not found and change to zero vectors.." mean?

And this message: rdkit not found for this smiles

Did you fine tune every model?

Thank you so much for your great repo. From your demos, you always set epochs=100 for training. If we want to use some of the models, do we need to fine tune the hyperparameters and retrain them?

found conflicts while installing the environment

while installing the environment, it gave me conflicts.
I have anaconda version 4.8.2

How can I solve them?

how to replace the smiles or protein vocabulary

I got ValueError: Found unknown categories ['@'] in column 0 during transform when I called model.predict(test_df)

error in mpnn_feature_collate_func

Please note that I may have been using this function completely wrong (I called it outside of where it's supposed to be called) but I figured I should submit the bug report.

TypeError Traceback (most recent call last)
in
31 t_start = time()
32 for epo in range(train_epoch):
---> 33 for i, (v_d, v_p, label) in enumerate(training_generator):
34 if self.target_encoding == 'Transformer':
35 v_p = v_p

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in next(self)
343
344 def next(self):
--> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
--> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data_utils\fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
---> 47 return self.collate_fn(data)

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\DTI.py in mpnn_collate_func(x)
219 mpnn_feature = [i[0] for i in x]
220 #print("len(mpnn_feature)", len(mpnn_feature), "len(mpnn_feature[0])", len(mpnn_feature[0]))
--> 221 mpnn_feature = mpnn_feature_collate_func(mpnn_feature)
222 from torch.utils.data.dataloader import default_collate
223 x_remain = [[i[1], i[2]] for i in x]

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\DTI.py in mpnn_feature_collate_func(x)
212 def mpnn_feature_collate_func(x):
213 ## first version
--> 214 return [torch.cat([x[j][i] for j in range(len(x))], 0) for i in range(len(x[0]))]
215
216 def mpnn_collate_func(x):

TypeError: object of type 'numpy.float64' has no len()

utils.py exists in two places

There's a version of utils.py in the root directory and a newer version of utils.py in the DeepPurpose/DeepPurpose directory. Should the one in the root directory be deleted?

Will you have a plan to implement DeepSmiles and SELFIES?

https://github.com/baoilleach/deepsmiles
https://github.com/aspuru-guzik-group/selfies

Thank you so much for your great work!

How to use DeepPurpose for Virtual screening?

Greetings sir,

I want to use DeepPurpose for Virtual screening using drugs downloaded from databases with a certain protein.

Can you give me information on how to do this? such as the preparation of drugs and the protein?

Thanks

What is the limit of a good binding score?

Hi!
I have used the deeppurpose library to screen a database and now I want to select the best binding drugs based on the binding score.
Is there a limit for binding score below which I can select the drugs?

Thanks

How can I know if the model is overfitting?

From your demo and tutorials, you always set epoch=100, the learning rate is a constant, and you didn't show the comparison between the training losses and the validation losses. I saw somewhere in your codes for early stopping, but I don't know how to set it. Did you have a learning rate scheduling function? Thank you!

Do you include MolTrans in this repo

You have another DTI repo called MolTrans. I think you didn't include this model in this toolkit, am I right? If not, what is the difference between DTI model of MolTrans and other models in this repo. Which one is better? Thanks a lot.

Model configuration error in Tutorial_2_Drug_Property_Pred_Assay_Data

Hello again,
I am having trouble initializing a model using the code in "Tutorial 2: Training a Drug Property Prediction Model from Scratch for Assay Data". Here are the errors I'm getting:

config = utils.generate_config(drug_encoding = drug_encoding, 
                         cls_hidden_dims = [1024,1024,512], 
                         train_epoch = 5, 
                         LR = 0.001, 
                         batch_size = 128,
                         hidden_dim_drug = 128,
                         mpnn_hidden_size = 128,
                         mpnn_depth = 3
                        )

model = models.model_initialize(**config)
model

AttributeError Traceback (most recent call last)
in
----> 1 model = models.model_initialize(**config)
2 model

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\DTI.py in model_initialize(**config)
57
58 def model_initialize(**config):
---> 59 model = DBTA(**config)
60 return model
61

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\DTI.py in init(self, **config)
259 self.model_protein = transformer('protein', **config)
260 else:
--> 261 raise AttributeError('Please use one of the available encoding method.')
262
263 self.model = Classifier(self.model_drug, self.model_protein, **config)

AttributeError: Please use one of the available encoding method.

model.train(train, val, test)

Let's use CPU/s!
--- Data Preparation ---
--- Go for Training ---

KeyError Traceback (most recent call last)
C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2890 try:
-> 2891 return self._engine.get_loc(casted_key)
2892 except KeyError as err:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'target_encoding'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
in
----> 1 model.train(train, val, test)

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\DTI.py in train(self, train, val, test, verbose)
392 t_start = time()
393 for epo in range(train_epoch):
--> 394 for i, (v_d, v_p, label) in enumerate(training_generator):
395 if self.target_encoding == 'Transformer':
396 v_p = v_p

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in next(self)
343
344 def next(self):
--> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data\dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
--> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data_utils\fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\torch\utils\data_utils\fetch.py in (.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

~\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\utils.py in getitem(self, index)
519 if self.config['drug_encoding'] == 'CNN' or self.config['drug_encoding'] == 'CNN_RNN':
520 v_d = drug_2_embed(v_d)
--> 521 v_p = self.df.iloc[index]['target_encoding']
522 if self.config['target_encoding'] == 'CNN' or self.config['target_encoding'] == 'CNN_RNN':
523 v_p = protein_2_embed(v_p)

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in getitem(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if (

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
989
990 # Similar to Index.get_value, but we do not fall back to positional
--> 991 loc = self.index.get_loc(label)
992 return self.index._get_values_for_loc(self, loc, label)
993

C:\ProgramData\Anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2891 return self._engine.get_loc(casted_key)
2892 except KeyError as err:
-> 2893 raise KeyError(key) from err
2894
2895 if tolerance is not None:

KeyError: 'target_encoding'

Pre-train Transformer of Drug

Dear Kexin,

According to MT-DTI paper, they pre-trained the transformer on 97,092,853 molecules with canonical SMILES from PubChem. I just curious if I call drug_encoding='Transformer', does your code use the pre-trained weights?

Thank you for your answering.

Best,
Po-Yu Kao

error in calcPubChemFingerPart1

When running data_process on the BindingDB dataset, I'm getting the following error:

AttributeError Traceback (most recent call last)
in
1 train, val, test = utils.data_process(X_drugs, X_targets, y,
2 drug_encoding, target_encoding,
----> 3 split_method='cold_drug',frac=[0.7,0.1,0.2])

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in data_process(X_drug, X_target, y, drug_encoding, target_encoding, split_method, frac, random_seed, sample_frac, mode, X_drug_, X_target_)
418
419 if DTI_flag:
--> 420 df_data = encode_drug(df_data, drug_encoding)
421 df_data = encode_protein(df_data, target_encoding)
422 elif DDI_flag:

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\utils.py in encode_drug(df_data, drug_encoding, column_name, save_column_name)
265 df_data[save_column_name] = [unique_dict[i] for i in df_data[column_name]]
266 elif drug_encoding == 'Pubchem':
--> 267 unique = pd.Series(df_data[column_name].unique()).apply(calcPubChemFingerAll)
268 unique_dict = dict(zip(df_data[column_name].unique(), unique))
269 df_data[save_column_name] = [unique_dict[i] for i in df_data[column_name]]

~\anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\lib.pyx in pandas._libs.lib.map_infer()

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in calcPubChemFingerAll(s)
3377
3378 def calcPubChemFingerAll(s):
-> 3379 mol = Chem.MolFromSmiles(s)
3380 AllBits=[0]*881
3381 res1=list(calcPubChemFingerPart1(mol).ToBitString())

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in calcPubChemFingerPart1(mol, **kwargs)
2690 if count == 0:
2691 res[i + 1] = mol.HasSubstructMatch(patt)
-> 2692 else:
2693 print('ne')
2694 matches = mol.GetSubstructMatches(patt)

AttributeError: 'NoneType' object has no attribute 'GetSubstructMatches'

Convert from nM to p

Hi,

I am confused on your convert_y_unit function. I think this function mainly convert Kd to pKd, and convert pKd back to Kd. However, why do not you just use y = -np.log10(y*1e-9) here?

def convert_y_unit(y, from_, to_):
	# basis as nM

	if from_ == 'nM':
		y = y
	elif from_ == 'p':
		y = 10**(-y) / 1e-9

	if to_ == 'p':
		y = -np.log10(y*1e-9 + 1e-10)
	elif to_ == 'nM':
		y = y

	return y

print(convert_y_unit(convert_y_unit(100, 'p', 'nM'), 'nM', 'p'))
print(convert_y_unit(convert_y_unit(100, 'nM', 'p'), 'p', 'nM'))
print(convert_y_unit(100, 'p', 'p'))
print(convert_y_unit(100, 'nM', 'nM'))

It gave me:

10.0
100.09999999999994
10.0
100

I think the answers should be 100 for those four different combinations of convert_y_unit functions.

What is the purpose to add 1e-10 in the log function?

Best,
Ken

How to accumulate Target protein's amino acid sequence (t) and drug's SMILES strings (d)

I am novice in DTI research. I want to know how to get : an array of drug's SMILES strings (d), an array of target protein's amino acid sequence (t) . In order to learn "Tutorial_1_DTI_Prediction"

Suppose I have found the following using DrugBank data:
Drug ID Target ID Score

DB08604 P0AEK4 0.931528
DB07181 P0AEK4 0.931504
DB08642 P16184 0.931335
DB03233 P0A884 0.931334
DB07411 P0AEK4 0.931313
DB07209 P27338 0.931300
DB03072 P0AEK4 0.931230
DB02727 Q9Y296 0.931186
DB06840 Q9Y296 0.931151
DB07972 P0AEK4 0.931095
DB08700 P0AEK4 0.931029
DB07647 P0AEK4 0.931003
DB01861 P96945 0.930968
...........................................
............................................

Questions:
1.How to get target protein's amino acid sequence (t) for large no of Target ID
2.How to get drug's SMILES strings for the large no of Drug IDs

How did you train MPNN_CNN_BindingDB_IC50?

Hi,

I am trying to train a MPNN/CNN model using around 1.2M IC50 interactions from BindingDB dataset (2021m0). However, the first problem I encountered was the memory issue of MPNN drug encoder. If I want to train the model with all interactions, I need to set MAX_ATOM = 700 that gives me the memory issue even if my server has 252GB memories.

Do you know how did you solve this kind of issue to train the MPNN_CNN_BindingDB_IC50 successfully? Did you train the model with previous version (non-parallel) of MPNN drug encoder? Or, did you ignore those interactions with long SMILES sequence?

Best,
Ken Kao

Add Directed MPNN?

Can Directed MPNN, model from this paper (https://pubs.acs.org/doi/10.1021/acs.jcim.9b00237) be added? Code is here: https://github.com/chemprop/chemprop/blob/master/chemprop/models/mpn.py .

Its updates for the message passing are on the edges, so it's slightly different from MPNN.

Why cant i run the library on Google Colab?

Greetings sir,

I tried to run DeepPuropose library on Google Colab, but I could not do it.

Could you please provide some instructions to install it on colab?

the link to google Colab: https://colab.research.google.com/drive/1eF60BwGX6PnB91vpx5dRxFa72e6-MYuZ?usp=sharing

I tried to install anaconda following these steps: https://towardsdatascience.com/conda-google-colab-75f7c867a522

Thanks

Question regarding to DAVIS dataset

Hi Kexin,

For the DAVIS dataset, it has 68 drugs, 379 protein, and 30,056 interactions. It looks wired to me. If there are only one interaction between one drug and one protein, the maximum number of interaction would be 68x379 = 25,772. How can we have more than 25,772 interactions?

Best,
Po-Yu

Encounter RuntimeError When Running Tutorial_1_DTI_Prediction

Dear Kexin Huang,

This is an amazing work. Thank you for making the DTI prediction more easier for both scientists and engineers.

I tried to run the Tutorial_1_DTI_Prediction but it gives me error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-4686be42c026> in <module>
----> 1 model.train(train, val, test)

~/DeepPurpose/DeepPurpose/DTI.py in train(self, train, val, test, verbose)
    438                     #score = self.model(v_d, v_p.float().to(self.device))
    439 
--> 440                 score = self.model(v_d, v_p)
    441                 label = Variable(torch.from_numpy(
    442                     np.array(label)).float()).to(self.device)

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    150             return self.module(*inputs[0], **kwargs[0])
    151         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 152         outputs = self.parallel_apply(replicas, inputs, kwargs)
    153         return self.gather(outputs, self.output_device)
    154 

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    160 
    161     def parallel_apply(self, replicas, inputs, kwargs):
--> 162         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    163 
    164     def gather(self, outputs, output_device):

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     83         output = results[i]
     84         if isinstance(output, ExceptionWrapper):
---> 85             output.reraise()
     86         outputs.append(output)
     87     return outputs

~/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/_utils.py in reraise(self)
    392             # (https://bugs.python.org/issue2651), so we work around it.
    393             msg = KeyErrorMessage(msg)
--> 394         raise self.exc_type(msg)

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/ken/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ken/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ken/DeepPurpose/DeepPurpose/DTI.py", line 48, in forward
    v_D = self.model_drug(v_D)
  File "/home/ken/anaconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ken/DeepPurpose/DeepPurpose/encoders.py", line 267, in forward
    n_a = atoms_bonds[i,0].item()
RuntimeError: CUDA error: device-side assert triggered

I think it might be a parallel error of CUDA. Could you please guide me to solve this problem?

The CUDA version is 10.2.89, and the driver version is 450.66.

I ran the data parallelism tutorial from PyTorch and it works for me.

Best,
Ken

pretrained model not found

i tried to download them manually using wget on colab using the link in the code

I tried to open the link and gave me this

Missing models.py

Cool package! The models.py file seems to be missing from master, so from DeepPurpose import models doesn't work.

import error in Tutorial_2_Drug_Property_Pred_Assay_Data

Hi,
When trying to run the first cell of "Tutorial 2: Training a Drug Property Prediction Model from Scratch for Assay Data", I am running into the following error:

---------------------------------------------------------------------------ImportError Traceback (most recent call last)
<ipython-input-17-5d2978e4b9f3> in <module>
----> 3 from DeepPurpose import utils, dataset, property_pred
ImportError: cannot import name 'property_pred' from 'DeepPurpose' (C:\Users\Julia\Dropbox\Work\insight\omic\DeepPurpose\DeepPurpose\__init__.py)

I am using Windows but have tried using WSL and Amazon Linux and the error persists.

The latest version of BindingDB

Hi,

I think the BindingDB version you used in this repo is BindingDB_All_2020m2. Would you mind to use version BindingDB_All_2020m10? I can make a PR if you think this is a good idea?

Best,
Ken

Where did you save "pretrained models on BindingDB IC50"

When I tried your DEMO "oneliner-3CLpro-finetuning-AID1706.ipynb", I got

FileNotFoundError: [Errno 2] No such file or directory: './save_folder/pretrained_models/DeepPurpose_BindingDB/model_MPNN_CNN/config.pkl'

I couldn't find ./save_folder/. In your readme, you said " [11/20] Added 5 more pretrained models on BindingDB IC50 Units (around 1Million data points)"

Thank you

how to generate drug or protein embeddings

Hi, I am working on a related project and trying to use DeepPurpose to generate drug and protein embeddings for other downstream tasks.
I would like to ask is there a function/method in DeepPurpose to generate representation vectors given a list of drugs or proteins, instead of directly predicting an affinity score.

BUG

DeepPurpose/DeepPurpose/models.py

Line 757 in 6016a83

auc, auprc, f1, logits = self.test_(testing_generator, model_max, test = True)

function test_ returns 5 items not 4 when binary is True

this line should be:

				auc, auprc, f1, log_loss, logits = self.test_(testing_generator, model_max, test = True)

How to use DeepPurpose with a protein complex consisting of two chains?

Greetings,

Is there a way to use DeepPurpose with a protein complex consisting of two chains for drug repurposing? For Example, Should I add both chains in one file so it see them as one protein?

Thanks

questions about usage

How could I use DTBA model trained by myself to predict a new pair, and what is the inputs' format?
Thanks! :)

error in GetSequenceOrderCouplingNumber

I'm calling GetQuasiSequenceOrder for every protein in the BindingDB list and running into this error. I get that these things may be happening because I'm calling the functions directly rather than using data_process, but I'd still like to be able to call the encoding functions on their own.

KeyError Traceback (most recent call last)
in
8 for func in prot_func_list:
9 save_column_name = func.name
---> 10 AA = pd.Series(df_data[column_name].unique()).apply(func)
11 AA_dict = dict(zip(df_data[column_name].unique(), AA))
12 df_data[save_column_name] = [AA_dict[i] for i in df_data[column_name]]

~\anaconda3\envs\multiPurpose\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):

pandas_libs\lib.pyx in pandas._libs.lib.map_infer()

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder(ProteinSequence, maxlag, weight)
1908 """
1909 result = dict()
-> 1910 result.update(GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, _Distance1))
1911 result.update(GetQuasiSequenceOrder2SW(ProteinSequence, maxlag, weight, _Distance1))
1912 result.update(

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetQuasiSequenceOrder1SW(ProteinSequence, maxlag, weight, distancematrix)
1794 for i in range(maxlag):
1795 rightpart = rightpart + GetSequenceOrderCouplingNumber(
-> 1796 ProteinSequence, i + 1, distancematrix
1797 )
1798 AAC = GetAAComposition(ProteinSequence)

~\Dropbox\Work\insight\omic\DeepPurpose-omic\DeepPurpose\pybiomed_helper.py in GetSequenceOrderCouplingNumber(ProteinSequence, d, distancematrix)
1601 temp1 = ProteinSequence[i]
1602 temp2 = ProteinSequence[i + d]
-> 1603 tau = tau + math.pow(distancematrix[temp1 + temp2], 2)
1604 return round(tau, 3)
1605

KeyError: 'IX'

kexinhuang12345 / deeppurpose Goto Github PK

deeppurpose's People

Contributors

Stargazers

Watchers

Forkers

deeppurpose's Issues

Drug Target Interaction Prediction Mode... in total: 1073803 drug-target pairs encoding drug... unique drugs: 549205 encoding protein... unique target sequence: 5078

Recommend Projects

Recommend Topics

Recommend Org

Drug Target Interaction Prediction Mode...
in total: 1073803 drug-target pairs
encoding drug...
unique drugs: 549205
encoding protein...
unique target sequence: 5078