lsj2408 / transformer-m Goto Github PK
View Code? Open in Web Editor NEW[ICLR 2023] One Transformer Can Understand Both 2D & 3D Molecular Data (official implementation)
Home Page: https://arxiv.org/abs/2210.01765
License: MIT License
[ICLR 2023] One Transformer Can Understand Both 2D & 3D Molecular Data (official implementation)
Home Page: https://arxiv.org/abs/2210.01765
License: MIT License
Very enlightening work. Congratulations on your great achievements in the OGB Challenge! In addition, I noticed that you have made fine-tuning on the PDBbind dataset. How should you encode the protein information? Because proteins usually contain more heavy atoms, do you directly use Transformer-M to encode proteins?
Hi!
Thank you for introducing such an interesting model to us and sharing the code!
I'm trying to run the model only on 2D structures, would you mind providing a script for using only 2D structures to train the model (Like for PCQM4M-LSC-V2)?
I tried to change the dataset_name and set add_3D to false in the sample train script for 3D data in the readme file, but that doesn't work. I looked into the code and found that in the tasks/graph_prediction.py file , Class GraphPredictionTask, and load_dataset function, when calling BatchedDataDatset, when it set the dataset_version to "2D" for PCQM4M-LSC-V2, it gives the error in criterions/graph_predictions.py line 45: ori_pos = sample['net_input']['batched_data']['pos'], KeyError: 'pos'.
Thank you so much!
Hi,
I met a problem when I tried to load 'L12-old.pt' in finetuneqm9.sh , the program told me that the checkpoint's structure mismatched, how can I solve this problem?
Hi,
Thank you for the code and the surrounding instructions!
I was trying to reproduce the results but having some difficulty making the environment work.
I installed cuda and other package versions as mentioned but torch_scatter was erroring out with "'NoneType' object has no attribute 'origin'". Looking up online, I uninstalled your recommended version and installed another one with pip install --no-index torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html
(even though I have PyTorch 1.7.1). But now, torch sparse decided to error out with
div(float a, Tensor b) -> (Tensor):
Expected a value of type 'Tensor' for argument 'b' but instead found type 'int'.
div(int a, Tensor b) -> (Tensor):
Expected a value of type 'Tensor' for argument 'b' but instead found type 'int'.
The original call is:
File "/nethome/yjakhotiya3/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/storage.py", line 316
idx = self.sparse_size(1) * self.row() + self.col()
row = torch.div(idx, num_cols, rounding_mode='floor')
~~~~~~~~~ <--- HERE
col = idx % num_cols
assert row.dtype == torch.long and col.dtype == torch.long
I also tried on other machines but was getting unknown cuda errors by torch distributed (which could be due to an unrelated driver version mismatch issue).
Did you encounter any of these issues or do you have any advice on how to navigate them?
Great piece of work.
Could you please mention, how to load your checkpoint?
Hi,
Would it be possible to provide the commands for training a model on QM9 from scratch? This is mentioned in appendix B5 when investigating the effectiveness of pre-training.
Kind regards,
Rob
Thanks for your amazing work!
I wonder why download link of checkpoint is not to a '*.pt' file?
Issue Description:
Hello.
I have discovered a performance degradation in the read_csv function of pandas version 1.3.4 when handling CSV files with a large number of columns. This problem significantly increases the loading time from just a few seconds in the previous version 1.2.5 to several minutes, almost 60x diff. I found some discussions on GitHub related to this issue, including #44106 and #44192.
I found that Transformer-M/data/wrapper.py
, examples/MMPT/scripts/video_feature_extractor/videoreader.py
and examples/MMPT/mmpt/processors/dsprocessor.py
used the influenced api.
Steps to Reproduce:
I have created a small reproducible example to better illustrate this issue.
# v1.3.4
import os
import pandas
import numpy
import timeit
def generate_sample():
if os.path.exists("test_small.csv.gz") == False:
nb_col = 100000
nb_row = 5
feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
for i in range(nb_col):
feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
df = pandas.DataFrame(feature_list)
df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")
def load_csv_file():
col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
types_dict = {col: numpy.float32 for col in col_names}
types_dict.update({'sample': str})
feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
print("loaded dataframe shape:", feature_df.shape)
generate_sample()
timeit.timeit(load_csv_file, number=1)
# results
loaded dataframe shape: (5, 100000)
120.37690759263933
# v1.3.5
import os
import pandas
import numpy
import timeit
def generate_sample():
if os.path.exists("test_small.csv.gz") == False:
nb_col = 100000
nb_row = 5
feature_list = {'sample': ['s_' + str(i+1) for i in range(nb_row)]}
for i in range(nb_col):
feature_list.update({'feature_' + str(i+1): list(numpy.random.uniform(low=0, high=10, size=nb_row))})
df = pandas.DataFrame(feature_list)
df.to_csv("test_small.csv.gz", index=False, float_format="%.6f")
def load_csv_file():
col_names = pandas.read_csv("test_small.csv.gz", low_memory=False, nrows=1).columns
types_dict = {col: numpy.float32 for col in col_names}
types_dict.update({'sample': str})
feature_df = pandas.read_csv("test_small.csv.gz", index_col="sample", na_filter=False, dtype=types_dict, low_memory=False)
print("loaded dataframe shape:", feature_df.shape)
generate_sample()
timeit.timeit(load_csv_file, number=1)
# results
loaded dataframe shape: (5, 100000)
2.8567268839105964
Suggestion
I would recommend considering an upgrade to a different version of pandas >= 1.3.5 or exploring other solutions to optimize the performance of loading CSV files.
Any other workarounds or solutions would be greatly appreciated.
Thank you!
I tried to evaluate the model by using L12 ckpt.
an error occured:
AttributeError: 'Namespace' object has no attribute 'load_qm9'
I added the parameter ‘--load-qm9’ in evaluate.sh and run 'bash evaluate.sh'. A new error is raised:
RuntimeError: Error(s) in loading state_dict for TransformerMModel:
Unexpected key(s) in state_dict: "encoder.molecule_encoder.atom_proc.q_proj.weight", "encoder.molecule_encoder.atom_proc.q_proj.bias", "encoder.molecule_encoder.atom_proc.k_proj.weight", "encoder.molecule_encoder.atom_proc.k_proj.bias", "encoder.molecule_encoder.atom_proc.v_proj.weight", "encoder.molecule_encoder.atom_proc.v_proj.bias", "encoder.molecule_encoder.atom_proc.force_proj1.weight", "encoder.molecule_encoder.atom_proc.force_proj1.bias", "encoder.molecule_encoder.atom_proc.force_proj2.weight", "encoder.molecule_encoder.atom_proc.force_proj2.bias", "encoder.molecule_encoder.atom_proc.force_proj3.weight", "encoder.molecule_encoder.atom_proc.force_proj3.bias".
论文作者您好!首先感谢您提供了开源的代码,对复现论文帮助很大,模型的效果也很赞!另外想请问下,不知道finetune相关的代码大概什么时候会上传呀?
请问,在下游任务比如qm9上的微调脚本方便提供吗?
Thank you for your release of Transformer-M. About the settings of steps, warmup_ steps and total_ steps, do I need to divide the settings of steps by the number of GPUs used? Because I observed during training that the number of steps in each epoch of multi -gpu parallel will decrease, but the program still counts the steps according to the single card.
Hi,
It's a very interesting model. Would you mind providing the code for preprocessing the data of PDBbind and fine-tuning?
Thank you for your excellent work!
May I know why you use cosine similarity to calculate loss instead of the squared error as in the paper PRE-TRAINING VIA DENOISING FOR MOLECULAR PROPERTY PREDICTION
?
Hi:
Thanks for sharing the code of your cool work.
The code is OK when I run the train.py in terminal of ubuntu 22.04. However, when I debug the code by VScode, an error occurred: OSError: /home/xx/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/_convert_cuda.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv.
/home/xx/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/_convert_cuda.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
File "/home/xx/Transformer-M-main/Transformer-M/data/dataset.py", line 3, in
from ogb.lsc import PCQM4Mv2Evaluator
File "/home/xx/Transformer-M-main/Transformer-M/tasks/graph_prediction.py", line 37, in
from ..data.dataset import (
File "/home/xx/Transformer-M-main/fairseq/tasks/init.py", line 119, in import_tasks
importlib.import_module(namespace + "." + task_name)
File "/home/xx/Transformer-M-main/fairseq/utils.py", line 511, in import_user_module
import_tasks(tasks_path, f"{module_name}.tasks")
File "/home/xx/Transformer-M-main/fairseq/options.py", line 237, in get_parser
utils.import_user_module(usr_args)
File "/home/xx/Transformer-M-main/fairseq/options.py", line 38, in get_training_parser
parser = get_parser("Trainer", default_task)
File "/home/xx/Transformer-M-main/fairseq_cli/train.py", line 493, in cli_main
parser = options.get_training_parser()
File "/home/xx/Transformer-M-main/train.py", line 14, in
cli_main()
OSError: /home/xx/miniconda3/envs/Transformer-M/lib/python3.7/site-packages/torch_sparse/_convert_cuda.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
Hello, I found the model checkpoint and datasets are unable download.
For example, I try to open the href: https://1drv.ms/u/s!AgZyC7AzHtDBdrY59-_mP38jsCg?e=URoyUK
But it can't download the model and show can't take a security connection with server.
Thank you for your code! It's well written~
I have a few questions on finetuning task on PDB-Bind. I sincerely look forward to your kindest reply!
1. Inputs. Which features of protein are used as input? and whether pocket data (sub sequence) or full sequence are used?
2. Model architecture. Whether protein data and ligand data are sent into seperate encoders or they are sent into the same encoder? If different encoders are used, what is the type of protein encoder? and how the extracted features of protein and ligand gathered together for later prediction?
Thank you again for your clarifications for them!
btw, may I ask when the finetuning code for PDB bind will be released? Thanks
Great Work!
Input:atom_type + xyz
How can we get the embedding of atom and xyz?
Thanks for sharing the code of your enlightening work. I have a set of compounds that only have SMILES available. Would you recommend RDKit, Openbabel, or other tools for generating 3D structures as input into your model? Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.