thunlp-mt / mean Goto Github PK

This repo contains the codes for our paper Conditional Antibody Design as 3D Equivariant Graph Translation.

Home Page: https://arxiv.org/abs/2208.06073

License: MIT License

Python 66.83% C++ 31.97% Shell 1.21%

mean's Issues

Problem with the usage of '_metainfo' file, 'train_processed' file and 'part_0.pkl' file.

It seems that these file mentioned in title are not used in training.
Could you please teach me that where are the usages?

Randomness in MCATTModel.init_mask()?

Hello,

First - thanks for making this nice repository available for use.
I was looking over your network architecture, and noticed that the initialisation procedure for the coordinates and sequence in the masked positions in your MCATTModel.init_mask() function appears to be deterministic, see here.
Does this mean that you produce the same final output coordinates and sequence logits every time you generate from a reference antibody-antigen complex, or is randomness entering somewhere else that I can't see?

Thanks!

Inference example

Hi. Thank you soo much for sharing the code for reproducing the results in your paper. Usually, people want to run the inference on a given pdb file. So it would be much more convenient to provide a minimal code where given input pdb (including Heavy, light, antigen chain), the script outputs the predicted structure). Current code is a little bit of over-complicated to me...

Evaluation error - TMscore not found

I'm running the MINE tutorial and I'm having an issue with the final command

GPU=0 bash scripts/k_fold_eval.sh summaries 111 mean 0

I assume I've failed to generate data at the previous stage but could you suggest where I'm going wrong?

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.42it/s]
  0%|                                                                                                                                                                                                                          | 0/299 [00:00<?, ?it/s]/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
  0%|                                                                                                                                                                                                                          | 0/299 [00:01<?, ?it/s]
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 198, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 198, in <listcomp>
    return [fn(*args) for args in chunk]
  File "generate.py", line 74, in eval_one
    summary['TMscore'] = tm_score(cplx.get_heavy_chain(), new_cplx.get_heavy_chain())
  File "/home/matthewdavies/MEAN/evaluation/tm_score.py", line 31, in tm_score
    score = float(res.group(1))
AttributeError: 'NoneType' object has no attribute 'group'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "generate.py", line 257, in <module>
    main(args)
  File "generate.py", line 214, in main
    average_test(args, model, test_set, test_loader, device)
  File "generate.py", line 174, in average_test
    summaries = process_map(partial(eval_one, out_dir=out_dir, cdr=cdr_type), inputs, max_workers=args.num_workers, chunksize=10)
  File "/opt/conda/lib/python3.7/site-packages/tqdm/contrib/concurrent.py", line 130, in process_map
    return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 483, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
AttributeError: 'NoneType' object has no attribute 'group'

problem about MEAN/data/download.py

Hi!
When i run this command, "bash scripts/prepare_data_kfold.sh summaries/sabdab_summary_all.tsv all_structures/imgt", here is the error:
####################################################################
Locate project at /root/autodl-tmp/MEAN
Summary file at summaries/sabdab_summary_all.tsv. PDB folder at all_structures/imgt. Data working directory at summaries
download sabdab from summary file summaries/sabdab_summary_all.tsv
using local PDB files: all_structures/imgt
PDB file already renumbered with scheme imgt
downloading raw files
16%|███████████████████████████▊ | 1266/7689 [00:05<00:34, 185.90it/s]scripts/prepare_data_kfold.sh: line 31: 839 Killed python -m data.download --summary ${SUMMARY} --pdb_dir ${PDB_DIR} --fout ${ALL} --type sabdab --numbering imgt --pre_numbered --n_cpu 4
Processing cdrh1
Traceback (most recent call last):
File "data/split.py", line 232, in
main(parse())
File "data/split.py", line 72, in main
items = load_file(args.data)
File "data/split.py", line 37, in load_file
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/sabdab_all.json'
2023-09-14 15:13:08::WARN::Faild to load file summaries/cdrh1/fold_0/test_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/test_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/test.json'
2023-09-14 15:13:10::WARN::Faild to load file summaries/cdrh1/fold_0/valid_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/valid_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/valid.json'
2023-09-14 15:13:12::WARN::Faild to load file summaries/cdrh1/fold_0/train_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/train_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/train.json'
^CTraceback (most recent call last):
File "data/dataset.py", line 11, in
import numpy as np
File "/root/miniconda3/envs/mean/lib/python3.8/site-packages/numpy/init.py", line 141, in
from . import core
File "/root/miniconda3/envs/mean/lib/python3.8/site-packages/numpy/core/init.py", line 105, in
from . import _internal
File "/root/miniconda3/envs/mean/lib/python3.8/site-packages/numpy/core/_internal.py", line 7, in
import ast
File "/root/miniconda3/envs/mean/lib/python3.8/ast.py", line 27, in
from _ast import *
KeyboardInterrupt
######################################################################################
Then i check the code, and find that the download method in download.py is maybe wrong, in where the program show that it opens "out_path" with creating it before open, details below:
def download(items, out_path, ncpu=8, pdb_dir=None, numbering='imgt', pre_numbered=False):
if pdb_dir is None:
map_func = download_one_item
else:
map_func = partial(download_one_item_local, pdb_dir)
print('downloading raw files')
valid_entries = thread_map(map_func, items, max_workers=ncpu)
valid_entries = [item for item in valid_entries if item is not None]
print(f'number of downloaded entries: {len(valid_entries)}')
pdb_out_dir = os.path.join(os.path.split(out_path)[0], 'pdb')
if os.path.exists(pdb_out_dir):
print(f'WARNING: pdb file out directory {pdb_out_dir} exists!')
else:
os.makedirs(pdb_out_dir)
print(f'writing PDB files to {pdb_out_dir}')
for item in tqdm(valid_entries):
pdb_fout = os.path.join(pdb_out_dir, item['pdb'] + '.pdb')
with open(pdb_fout, 'w') as pfout:
pfout.write(item['pdb_data'])
item.pop('pdb_data')
item['pdb_data_path'] = os.path.abspath(pdb_fout)
item['numbering'] = numbering
if 'pre_numbered' not in item:
item['pre_numbered'] = pre_numbered
print('post processing')
valid_entries = process_map(post_process, valid_entries, max_workers=ncpu, chunksize=1)
valid_entries = [item for item in valid_entries if item is not None]
print(f'number of valid entries: {len(valid_entries)}')
# 这里似乎错了，out_path没有创建，但是下面代码直接open了
fout = open(out_path, 'w')
for item in valid_entries:
item_str = json.dumps(item)
fout.write(f'{item_str}\n')
fout.close()
return valid_entries

AAR指标如何计算得到的？

我想请问一下，Amino Acid Recovery (AAR)，这个指标是如何计算得到的啊，我没找到相应的代码。非常希望能得到您的回答！

Running Data Preparation Code

Once all the .pdb files downloaded and then the post process is run it says "{file path} could not be parsed: Muscle could not be found in the path". Can you please help resolve this issue.

A bit question about Figure 5 (B)

Hello, I feel a bit confused about y axis labeled density in Figure 5(B), What' s the meaning of density?

Thanks for ur good work.

How do you get other coordinates of amino ?

You only predict four coordinates in a residue, so how do you get other coordinates in a residue?

Question Regarding Initialization

Hi,

I have a question regarding coordinate initialization for the antibody CDR region. In the manuscript, it is written as,

we initialized its input feature with a mask vector and the coordinates according to the even distribution between the residue right before CDRs and the one right after CDRs

Is it similar to the linspace function, or is its kind sampled from a uniform distribution? I would be grateful if you can attach a link to the code in repo, I wasn't able to find it.

Regards,
Yogesh

questions about data split

Hi. Your paper reports that The total numbers of clusters for CDR-H1, CDR-H2, and CDR-H3 are 765, 1093, and 1659, respectively. Then we split all clusters into training, validation, and test sets with a ratio of 8:1:1. Are you splitting the train/validation set on each cdr type? Or do you split all the clusters, i.e., 765 + 1093 + 1659?

您好

problem accessing coordinates of antigen for RabD

Hi,

I am trying to download the RabD dataset via the summary file and it created an JSON file. However, the entries of the JSON file are:
{'pdb': '5nuz', 'heavy_chain': 'A', 'light_chain': 'B', 'antigen_chains': ['C'], 'pdb_data_path': '5nuz.pdb', 'numbering': 'imgt', 'pre_numbered': False, 'heavy_chain_seq': 'EVQLQQSGTVLARPGASVKMSCKASGYTFTSYWMHWIKQRPGQGLEWIGAIYPGDSDTKYNQKFKGKAKLTAVTSTSTAYMELSSLTNEDSAVYYCTRRNTLTGDYFDYWGQGTTLTVSS', 'light_chain_seq': 'DIVLTQSPASLAVSLGQRATISCRASESVDDYGISFMNWFQQKPGQPPKLLIYTASSQGSGVPARFSGSGSGTDFSLNIHPMEEDDTAMYFCQQSKEVPYTFGGGTKLEIK', 'antigen_seqs': ['LPLLCTLNKSHLYIKGGNASFQISFDDIAVLLPQYDVIIQHPADMSWCSKSDDQIWLSQWFMNAVGHDWHLDPPFLCRNRTKTEGFIFQVNTSKTGVNENYAKKFKTGMHHLYREYPDSCLNGKLCLMKAQPTSWPLQCPLD'], 'cdrh1_pos': (25, 32), 'cdrh1_seq': 'GYTFTSYW', 'cdrh2_pos': (50, 57), 'cdrh2_seq': 'IYPGDSDT', 'cdrh3_pos': (96, 108), 'cdrh3_seq': 'TRRNTLTGDYFDY', 'cdrl1_pos': (26, 35), 'cdrl1_seq': 'ESVDDYGISF', 'cdrl2_pos': (53, 55), 'cdrl2_seq': 'TAS', 'cdrl3_pos': (92, 100), 'cdrl3_seq': 'QQSKEVPYT'},
However, there are not any antigen spatial coordinates here. I think you have used the antigen coordinates for the experiment in your paper. Can you let me know if is this correct and how one can get antigen coordinates in this case?

Thanks!

evaluate 时出现group不存在问题。

大佬，当我训练了K-fold SA数据后进行验证时出现Nonetype问题，定位到TMscore的计算，恳请看一下谢谢。

Can you release the details of data after processing?

Thanks your nice work and code. I get 3816 valid data after the post-processing. But I have no idea whether I process it correctly. Can you release more information about the data after the post-processing like following:

questions regarding coord2radial

Hi. I find the way you calculate radial is different from other similar works, e.g., EGNN.
Your strategy. the radial is the dot product of the coord differences.

def coord2radial(edge_index, coord):
    row, col = edge_index
    coord_diff = coord[row] - coord[col]  # [n_edge, n_channel, d]
    radial = torch.bmm(coord_diff, coord_diff.transpose(-1, -2))  # [n_edge, n_channel, n_channel]
    # normalize radial
    radial = F.normalize(radial, dim=0)  # [n_edge, n_channel, n_channel]
    return radial, coord_diff

EGNN's strategy. the radial is the squared distance between two nodes.

    def coord2radial(self, edge_index, coord):
        row, col = edge_index
        coord_diff = coord[row] - coord[col]
        radial = torch.sum(coord_diff**2, 1).unsqueeze(1)

        if self.normalize:
            norm = torch.sqrt(radial).detach() + self.epsilon
            coord_diff = coord_diff / norm

        return radial, coord_diff

I think your radial can represent the orientation of two multi-channel residues, and egnn's radial represents the distance. Is this reasonable? What do you think it represents? What's your motivation for defining it this way instead of following egnn?
The way you normalize the radial is quite interesting, you normalize it along the n_edge dimension (similar to "batch dimension").
Why? Have you tried removing normalization?

Best,
Zhangzhi

提示 Muscle could not be found in the path

大佬请问当我是使用pdb数据库中的1fl5.pdb进行inference 推理时首先进行imgt的数据处理，但是提示1fl5.pdb could not be parsed: Muscle could not be found in the path？就没办法解析这个pdb，但是这个确实是个抗体，请问有好的解决方案么？谢谢

will you open source ckpt?

Hi, will you open-source the model checkpoints so that we can benchmark the model in our test set?

errors while trying to process rabd split data

Hello. I'm trying to generate the data splits for training the version MEAN used in the rabd test set evaluation.

I have installed the conda envrionment using the setup file. I have run up to line 31 in prepare_data_kfold.sh to generate the summaries/sabdab_all.json file. The pdbs have been moved to summaries/pdb and are IMGT formatted (downloaded from sabdab)

I'm running the following line:
bash scripts/prepare_data_rabd.sh summaries/rabd_summary.jsonl summaries/pdb/ summaries/sabdab_all.json

And getting the following nested error message. Are there any dependencies that I'm missing that could be causing an earlier error?
It looks like there is an intermediate summaries/rabd_all.json file. Is this supposed to be generated by the script or separately?

Traceback (most recent call last):
File "data/download.py", line 13, in
from .pdb_utils import Protein, AAComplex
ImportError: attempted relative import with no known parent package
Processing cdrh3
Valid entries after filtering with 111: 3127
Traceback (most recent call last):
File "data/split.py", line 232, in
main(parse())
File "data/split.py", line 78, in main
rabd = load_file(args.rabd)
File "data/split.py", line 37, in load_file
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/rabd_all.json'
2022-12-22 15:21:00::WARN::Faild to load file summaries/cdrh3/test_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh3/test_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh3/test.json'
2022-12-22 15:21:02::WARN::Faild to load file summaries/cdrh3/valid_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh3/valid_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh3/valid.json'
2022-12-22 15:21:04::WARN::Faild to load file summaries/cdrh3/train_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh3/train_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh3/train.json'

Questions about RMSD metric reported in paper

您好，在follow Wengong Jin的setting的baseline中，我发现论文中RMSD的结果和RefineGNN论文汇报的结果并不一致，请问是由于在同样的数据集分割和训练设置下复现后结果出现了误差吗？

thunlp-mt / mean Goto Github PK

mean's Issues

Recommend Projects

Recommend Topics

Recommend Org