thunlp-mt / mean Goto Github PK

This repo contains the codes for our paper Conditional Antibody Design as 3D Equivariant Graph Translation.

Home Page: https://arxiv.org/abs/2208.06073

License: MIT License

Python 66.83% C++ 31.97% Shell 1.21%

mean's Introduction

MEAN: Conditional Antibody Design as 3D Equivariant Graph Translation

This repo contains the codes for our paper Conditional Antibody Design as 3D Equivariant Graph Translation. MEAN is the abbreviation for the Multi-channel Equivariant Attention Network proposed in our paper.

Setup

Dependencies

we have prepared the script for environment setup in scripts/setup.sh, please install the dependencies in it with bash scripts/setup.sh before running our code.

Get Data

We have provided the summary data used in our paper from SAbDab, RAbD, SKEMPI_V2 in the summaries folder, please download all structure data from the download page of SAbDab. Since the SAbDab is updating on a weekly basis, you may also download the newest summary file from its official website. The following instructions will suppose the structure data renumbered by imgt is located at the folder all_structures/imgt.

Experiments

We have provided the shell scripts for different procedures of the experiments, which are located either in the folder scripts or in the root folder of our repo. For scripts in scripts, you can run them without arguments to see their hints of usage, which will also be illustrated in README later. One thing that might need explanation is the mode parameter which takes the value of either 100 or 111. 100 means only heavy chain is used in the context, and 111 means heavy chain, light chain as well as the epitope are considered in the context. The later one is the complete setting of antigen-binding CDR design, whereas the first one is only for comparison with RefineGNN. <model type> in the arguments only specifies the prefix of the directory to save/load the checkpoints, we use mean in the following sections. Please try using absolute directories when passing arguments to the scripts if you encounter problems concerning paths.

K-fold evaluation on SAbDab

We have provided the scripts for data preparation, k-fold training and evaluation

data preparation: bash scripts/prepare_data_kfold.sh <summary file> <pdb folder>
training: GPU=<gpu id> bash scripts/k_fold_train.sh <summary folder> <mode> <model type> <port for multi-GPU training>
evaluation: GPU=<gpu id> bash scripts/k_fold_eval.sh <mode> <model type> <version id>

here is an example for evaluating our MEAN:

bash scripts/prepare_data_kfold.sh summaries/sabdab_summary.tsv all_structures/imgt
GPU=0 bash scripts/k_fold_train.sh summaries 111 mean 9901
GPU=0 bash scripts/k_fold_eval.sh summaries 111 mean 0

By running bash scripts/prepare_data_kfold.sh summaries/sabdab_summary.tsv all_structures/imgt, the script will copy the pdbs in the summary to summaries/pdb, transform the summary to json format, and generate 10-fold data splits for each cdr, which requires ~5G space. If you want to do data preparation in another directory, just copy the summary file there and replace the summaries/sabdab_summary.tsv with the new path. Also, for each parallel run of trainining, the checkpoints will be saved in version 0, 1, ... So you need to specify the version id as the last argument of k_fold_eval.sh.

Antigen-binding CDR-H3 Redesign

before running this task, please at least run the commands of downloading json summary of SAbDab in scripts/prepare_data_kfold.sh (line 23-31). We will suppose the json file is located at summaries/sabdab_all.json.

data preparation: bash scripts/prepare_data_rabd.sh <rabd summary file> <pdb folder> <sabdab summary file in json format>
training: GPU=<gpu id> MODE=<mode> DATA_DIR=<data directory with train, valid and test json summary> bash train.sh <model type> <cdr type>
evaluation: GPU=<gpu id> MODE=<mode> DATA_DIR=<data directory with train, valid and test json summary> bash rabd_test.sh <version id> [checkpoint path]

Example:

bash scripts/prepare_data_rabd.sh summaries/rabd_summary.jsonl all_structures/imgt summaries/sabdab_all.json
GPU=0 MODE=111 DATA_DIR=summaries/cdrh3 bash train.sh mean 3
GPU=0 MODE=111 DATA_DIR=summaries/cdrh3 bash rabd_test.sh 0

We have also provided the trained checkpoint used in our paper at checkpoints/ckpt/rabd_cdrh3_mean.ckpt. You can use it for test by running GPU=0 MODE=111 DATA_DIR=summaries/cdrh3 bash rabd_test.sh 0 checkpoints/ckpt/rabd_cdrh3_mean.ckpt. The results will be saved to the folder named results under the same directory as the checkpoint.

Affinity Optimization

data preparation: bash scripts/prepare_data_skempi.sh <skempi summary file> <pdb folder> <sabdab summary file in json format>
training: train.sh, ita_train.sh
pretraining: GPU=<gpu id> MODE=<mode> DATA_DIR=<data directory with train, valid and test json summary> bash train.sh <model type> <cdr type>
ITA training: GPU=<gpu id> CKPT_DIR=<pretrained checkpoint folder> bash ita_train.sh
evaluation: GPU=<gpu id> DATA_DIR=<dataset folder > bash ita_generate.sh <checkpoint>

Example:

bash scripts/prepare_data_skempi.sh summaries/skempi_v2_summary.jsonl all_structures/imgt summaries/sabdab_all.json
GPU=0 MODE=111 DATA_DIR=summaries bash train.sh mean 3
GPU=0 CKPT_DIR=summaries/ckpt/mean_CDR3_111/version_0 bash ita_train.sh
GPU=0 DATA_DIR=summaries bash ita_generate.sh summaries/ckpt/mean_CDR3_111/version_0/ita/iter_i.ckpt  # specify the checkpoint from iteration i for testing

We have also provided the checkpoint after ITA finetuning at checkpoints/ckpt/opt_cdrh3_mean.ckpt. You can directly use it for inference by running GPU=0 DATA_DIR=summaries bash ita_generate.sh checkpoints/ckpt/opt_cdrh3_mean.ckpt. This script will generate 100 optimized candidates for each antibody in summaries/skempi_all.json and report the top1 in terms of predicted ddg. The pdbs of optimized candidates will be located at the same directory of the checkpoint.

Inference API

We also provide the script for design / optimization of single CDR at scritps/design.py. The script requires an input pdb containing the heavy chain, the light chain and the antigen. The pdb should be renumbered using the IMGT system in advance, which can be achieved by the script at data/ImmunoPDB.py fron ANARCI. Here is an example of CDR-H3 design for the 1ic7 pdb:

python ./data/ImmunoPDB.py -i data/1ic7.pdb -o 1ic7.pdb -s imgt  # renumber the pdb
python ./scripts/design.py --pdb 1ic7.pdb --heavy_chain H --light_chain L

The generated pdb as well as a summary of the CDR-H3 sequence will be saved to ./results. The default checkpoint used in the script is checkpoints/ckpt/rabd_cdrh3_mean.ckpt. You can pass your own checkpoint by the argument --ckpt path/to/your/checkpoint (e.g. use the opt_cdrh3_mean.ckpt for CDR optimization)

Further, the script is able to accommodate multiple pdbs as inputs, for example:

python ./scripts/design.py
    --pdb pdb1 pdb2 ...
    --heavy_chain Hchain1 Hchain2 ...
    --light_chain Lchain1 Lchain2 ...

Contact

Thank you for your interest in our work!

Please feel free to ask about any questions about the algorithms, codes, as well as problems encountered in running them so that we can make it clearer and better. You can either create an issue in the github repo or contact us at [email protected].

Others

Some codes are borrowed from existing repos:

evaluation/ddg: https://github.com/HeliXonProtein/binding-ddg-predictor
evaluation/TMscore.cpp: https://zhanggroup.org/TM-score/
data/anarci: https://github.com/oxpig/ANARCI
models/MCAttGNN/mc_egnn: https://github.com/vgsatorras/egnn

mean's People

Contributors

Stargazers

Watchers

Forkers

tanxiaoqin888 maolinwang dongcf kehan777 serbulent-av jjjjjulia35 zyh0608 danialgharaee lfelipesv astraightrain

mean's Issues

Randomness in MCATTModel.init_mask()?

Hello,

First - thanks for making this nice repository available for use.
I was looking over your network architecture, and noticed that the initialisation procedure for the coordinates and sequence in the masked positions in your MCATTModel.init_mask() function appears to be deterministic, see here.
Does this mean that you produce the same final output coordinates and sequence logits every time you generate from a reference antibody-antigen complex, or is randomness entering somewhere else that I can't see?

Thanks!

errors while trying to process rabd split data

Hello. I'm trying to generate the data splits for training the version MEAN used in the rabd test set evaluation.

I have installed the conda envrionment using the setup file. I have run up to line 31 in prepare_data_kfold.sh to generate the summaries/sabdab_all.json file. The pdbs have been moved to summaries/pdb and are IMGT formatted (downloaded from sabdab)

I'm running the following line:
bash scripts/prepare_data_rabd.sh summaries/rabd_summary.jsonl summaries/pdb/ summaries/sabdab_all.json

And getting the following nested error message. Are there any dependencies that I'm missing that could be causing an earlier error?
It looks like there is an intermediate summaries/rabd_all.json file. Is this supposed to be generated by the script or separately?

Traceback (most recent call last):
File "data/download.py", line 13, in
from .pdb_utils import Protein, AAComplex
ImportError: attempted relative import with no known parent package
Processing cdrh3
Valid entries after filtering with 111: 3127
Traceback (most recent call last):
File "data/split.py", line 232, in
main(parse())
File "data/split.py", line 78, in main
rabd = load_file(args.rabd)
File "data/split.py", line 37, in load_file
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/rabd_all.json'
2022-12-22 15:21:00::WARN::Faild to load file summaries/cdrh3/test_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh3/test_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh3/test.json'
2022-12-22 15:21:02::WARN::Faild to load file summaries/cdrh3/valid_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh3/valid_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh3/valid.json'
2022-12-22 15:21:04::WARN::Faild to load file summaries/cdrh3/train_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh3/train_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh3/train.json'

questions regarding coord2radial

Hi. I find the way you calculate radial is different from other similar works, e.g., EGNN.
Your strategy. the radial is the dot product of the coord differences.

def coord2radial(edge_index, coord):
    row, col = edge_index
    coord_diff = coord[row] - coord[col]  # [n_edge, n_channel, d]
    radial = torch.bmm(coord_diff, coord_diff.transpose(-1, -2))  # [n_edge, n_channel, n_channel]
    # normalize radial
    radial = F.normalize(radial, dim=0)  # [n_edge, n_channel, n_channel]
    return radial, coord_diff

EGNN's strategy. the radial is the squared distance between two nodes.

    def coord2radial(self, edge_index, coord):
        row, col = edge_index
        coord_diff = coord[row] - coord[col]
        radial = torch.sum(coord_diff**2, 1).unsqueeze(1)

        if self.normalize:
            norm = torch.sqrt(radial).detach() + self.epsilon
            coord_diff = coord_diff / norm

        return radial, coord_diff

I think your radial can represent the orientation of two multi-channel residues, and egnn's radial represents the distance. Is this reasonable? What do you think it represents? What's your motivation for defining it this way instead of following egnn?
The way you normalize the radial is quite interesting, you normalize it along the n_edge dimension (similar to "batch dimension").
Why? Have you tried removing normalization?

Best,
Zhangzhi

Can you release the details of data after processing?

Thanks your nice work and code. I get 3816 valid data after the post-processing. But I have no idea whether I process it correctly. Can you release more information about the data after the post-processing like following:

AAR指标如何计算得到的？

我想请问一下，Amino Acid Recovery (AAR)，这个指标是如何计算得到的啊，我没找到相应的代码。非常希望能得到您的回答！

questions about data split

Hi. Your paper reports that The total numbers of clusters for CDR-H1, CDR-H2, and CDR-H3 are 765, 1093, and 1659, respectively. Then we split all clusters into training, validation, and test sets with a ratio of 8:1:1. Are you splitting the train/validation set on each cdr type? Or do you split all the clusters, i.e., 765 + 1093 + 1659?

Evaluation error - TMscore not found

I'm running the MINE tutorial and I'm having an issue with the final command

GPU=0 bash scripts/k_fold_eval.sh summaries 111 mean 0

I assume I've failed to generate data at the previous stage but could you suggest where I'm going wrong?

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.42it/s]
  0%|                                                                                                                                                                                                                          | 0/299 [00:00<?, ?it/s]/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
  0%|                                                                                                                                                                                                                          | 0/299 [00:01<?, ?it/s]
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
/bin/sh: 1: /home/matthewdavies/MEAN/evaluation/TMscore: not found
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 198, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 198, in <listcomp>
    return [fn(*args) for args in chunk]
  File "generate.py", line 74, in eval_one
    summary['TMscore'] = tm_score(cplx.get_heavy_chain(), new_cplx.get_heavy_chain())
  File "/home/matthewdavies/MEAN/evaluation/tm_score.py", line 31, in tm_score
    score = float(res.group(1))
AttributeError: 'NoneType' object has no attribute 'group'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "generate.py", line 257, in <module>
    main(args)
  File "generate.py", line 214, in main
    average_test(args, model, test_set, test_loader, device)
  File "generate.py", line 174, in average_test
    summaries = process_map(partial(eval_one, out_dir=out_dir, cdr=cdr_type), inputs, max_workers=args.num_workers, chunksize=10)
  File "/opt/conda/lib/python3.7/site-packages/tqdm/contrib/concurrent.py", line 130, in process_map
    return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/conda/lib/python3.7/concurrent/futures/process.py", line 483, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
AttributeError: 'NoneType' object has no attribute 'group'

problem accessing coordinates of antigen for RabD

Hi,

I am trying to download the RabD dataset via the summary file and it created an JSON file. However, the entries of the JSON file are:
{'pdb': '5nuz', 'heavy_chain': 'A', 'light_chain': 'B', 'antigen_chains': ['C'], 'pdb_data_path': '5nuz.pdb', 'numbering': 'imgt', 'pre_numbered': False, 'heavy_chain_seq': 'EVQLQQSGTVLARPGASVKMSCKASGYTFTSYWMHWIKQRPGQGLEWIGAIYPGDSDTKYNQKFKGKAKLTAVTSTSTAYMELSSLTNEDSAVYYCTRRNTLTGDYFDYWGQGTTLTVSS', 'light_chain_seq': 'DIVLTQSPASLAVSLGQRATISCRASESVDDYGISFMNWFQQKPGQPPKLLIYTASSQGSGVPARFSGSGSGTDFSLNIHPMEEDDTAMYFCQQSKEVPYTFGGGTKLEIK', 'antigen_seqs': ['LPLLCTLNKSHLYIKGGNASFQISFDDIAVLLPQYDVIIQHPADMSWCSKSDDQIWLSQWFMNAVGHDWHLDPPFLCRNRTKTEGFIFQVNTSKTGVNENYAKKFKTGMHHLYREYPDSCLNGKLCLMKAQPTSWPLQCPLD'], 'cdrh1_pos': (25, 32), 'cdrh1_seq': 'GYTFTSYW', 'cdrh2_pos': (50, 57), 'cdrh2_seq': 'IYPGDSDT', 'cdrh3_pos': (96, 108), 'cdrh3_seq': 'TRRNTLTGDYFDY', 'cdrl1_pos': (26, 35), 'cdrl1_seq': 'ESVDDYGISF', 'cdrl2_pos': (53, 55), 'cdrl2_seq': 'TAS', 'cdrl3_pos': (92, 100), 'cdrl3_seq': 'QQSKEVPYT'},
However, there are not any antigen spatial coordinates here. I think you have used the antigen coordinates for the experiment in your paper. Can you let me know if is this correct and how one can get antigen coordinates in this case?

Thanks!

Question Regarding Initialization

Hi,

I have a question regarding coordinate initialization for the antibody CDR region. In the manuscript, it is written as,

we initialized its input feature with a mask vector and the coordinates according to the even distribution between the residue right before CDRs and the one right after CDRs

Is it similar to the linspace function, or is its kind sampled from a uniform distribution? I would be grateful if you can attach a link to the code in repo, I wasn't able to find it.

Regards,
Yogesh

A bit question about Figure 5 (B)

Hello, I feel a bit confused about y axis labeled density in Figure 5(B), What' s the meaning of density?

Thanks for ur good work.

problem about MEAN/data/download.py

Hi!
When i run this command, "bash scripts/prepare_data_kfold.sh summaries/sabdab_summary_all.tsv all_structures/imgt", here is the error:
####################################################################
Locate project at /root/autodl-tmp/MEAN
Summary file at summaries/sabdab_summary_all.tsv. PDB folder at all_structures/imgt. Data working directory at summaries
download sabdab from summary file summaries/sabdab_summary_all.tsv
using local PDB files: all_structures/imgt
PDB file already renumbered with scheme imgt
downloading raw files
16%|███████████████████████████▊ | 1266/7689 [00:05<00:34, 185.90it/s]scripts/prepare_data_kfold.sh: line 31: 839 Killed python -m data.download --summary ${SUMMARY} --pdb_dir ${PDB_DIR} --fout ${ALL} --type sabdab --numbering imgt --pre_numbered --n_cpu 4
Processing cdrh1
Traceback (most recent call last):
File "data/split.py", line 232, in
main(parse())
File "data/split.py", line 72, in main
items = load_file(args.data)
File "data/split.py", line 37, in load_file
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/sabdab_all.json'
2023-09-14 15:13:08::WARN::Faild to load file summaries/cdrh1/fold_0/test_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/test_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/test.json'
2023-09-14 15:13:10::WARN::Faild to load file summaries/cdrh1/fold_0/valid_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/valid_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/valid.json'
2023-09-14 15:13:12::WARN::Faild to load file summaries/cdrh1/fold_0/train_processed/_metainfo, error: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/train_processed/_metainfo'
Traceback (most recent call last):
File "data/dataset.py", line 303, in
dataset = EquiAACDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "data/dataset.py", line 63, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "data/dataset.py", line 135, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'summaries/cdrh1/fold_0/train.json'
^CTraceback (most recent call last):
File "data/dataset.py", line 11, in
import numpy as np
File "/root/miniconda3/envs/mean/lib/python3.8/site-packages/numpy/init.py", line 141, in
from . import core
File "/root/miniconda3/envs/mean/lib/python3.8/site-packages/numpy/core/init.py", line 105, in
from . import _internal
File "/root/miniconda3/envs/mean/lib/python3.8/site-packages/numpy/core/_internal.py", line 7, in
import ast
File "/root/miniconda3/envs/mean/lib/python3.8/ast.py", line 27, in
from _ast import *
KeyboardInterrupt
######################################################################################
Then i check the code, and find that the download method in download.py is maybe wrong, in where the program show that it opens "out_path" with creating it before open, details below:
def download(items, out_path, ncpu=8, pdb_dir=None, numbering='imgt', pre_numbered=False):
if pdb_dir is None:
map_func = download_one_item
else:
map_func = partial(download_one_item_local, pdb_dir)
print('downloading raw files')
valid_entries = thread_map(map_func, items, max_workers=ncpu)
valid_entries = [item for item in valid_entries if item is not None]
print(f'number of downloaded entries: {len(valid_entries)}')
pdb_out_dir = os.path.join(os.path.split(out_path)[0], 'pdb')
if os.path.exists(pdb_out_dir):
print(f'WARNING: pdb file out directory {pdb_out_dir} exists!')
else:
os.makedirs(pdb_out_dir)
print(f'writing PDB files to {pdb_out_dir}')
for item in tqdm(valid_entries):
pdb_fout = os.path.join(pdb_out_dir, item['pdb'] + '.pdb')
with open(pdb_fout, 'w') as pfout:
pfout.write(item['pdb_data'])
item.pop('pdb_data')
item['pdb_data_path'] = os.path.abspath(pdb_fout)
item['numbering'] = numbering
if 'pre_numbered' not in item:
item['pre_numbered'] = pre_numbered
print('post processing')
valid_entries = process_map(post_process, valid_entries, max_workers=ncpu, chunksize=1)
valid_entries = [item for item in valid_entries if item is not None]
print(f'number of valid entries: {len(valid_entries)}')
# 这里似乎错了，out_path没有创建，但是下面代码直接open了
fout = open(out_path, 'w')
for item in valid_entries:
item_str = json.dumps(item)
fout.write(f'{item_str}\n')
fout.close()
return valid_entries

Problem with the usage of '_metainfo' file, 'train_processed' file and 'part_0.pkl' file.

It seems that these file mentioned in title are not used in training.
Could you please teach me that where are the usages?

How do you get other coordinates of amino ?

You only predict four coordinates in a residue, so how do you get other coordinates in a residue?

Inference example

Hi. Thank you soo much for sharing the code for reproducing the results in your paper. Usually, people want to run the inference on a given pdb file. So it would be much more convenient to provide a minimal code where given input pdb (including Heavy, light, antigen chain), the script outputs the predicted structure). Current code is a little bit of over-complicated to me...