This repo contains code for Predicting the changes in binding affinity of multiple point mutations using protein three-dimensional structure by Guanglei Yu, Qichang Zhao, Xuehua Bi and Jianxin Wang.
We proposed a ProteinMPNN-inspired
-
Clipped patches: when given
$\mathcal{WT}$ and$\mathcal{MT}$ , we clipped$\mathcal{WT}$ and$\mathcal{MT}$ into residue patches containing 256 residues respectively, which are the 256 nearest neighbors of mutant residues based on$C_{\beta}$ distances of inter-residues, including the mutant residues itself. -
Two-step additive Gaussian noising strategy: To improve the performance and generalization of DDAffinity, we implemented a two-step additive Gaussian noising strategy for the atomic coordinates of residues. Firstly, the additive Gaussian noise (
$std=0.2\mathring{\mathrm A}$ ) was combined with all input atomic coordinates, which yields the perturbed backbone dihedrals$(\phi,\psi,\omega)$ and sidechain dihedrals$(\chi^{(1)},\chi^{(2)},\chi^{(3)},\chi^{(4)})$ . Secondly, inspired by the ideas of ProteinMPNN that can improve predictive performance and make prediction algorithm more robust, we also incorporate Gaussian noise ($std=0.2\mathring{\mathrm A}$ ) to the atomic coordinates of protein backbone atom set$\boldsymbol{A}={N,C_\alpha,C,O,C_\beta}$ . Importantly, this perturbation was implemented without updating the backbone dihedrals and sidechain dihedrals. Additionally, we only implemented above mentioned two-step additive Gaussian noising strategy during training. -
How to construct the
$k$ -nearest neighbor graph. We use three different neighbor residues: (1) Spatial distance$k_1$ . A residue will be connected to its$k_1$ -nearest neighbors according to their spatial Euclidean distances, which ensures that the spatial densities of different proteins are comparable. (2) Sequential distance$k_2$ . The linear interactions of residues are defined as the sequential distance between the residue$r_i$ and its sequence neighbors if their sequential distances are no more than$(k_2-1)/2$ . (3) Long-range distance$k_3$ . For efficiently capturing those dependencies that are long-range in sequence but local in 3D Euclidean space, neighbors of residue$r_i$ are ranked in ascending order according to their Euclidean distances, and discarded if their sequence distances are not greater than$(k_2-1)/2$ . After that, we select the$k_3$ -nearest neighbors from the ordered neighbor list. In summary,$k=k_1+k_2+k_3$ .
Overview of our DDAffinity architecture is shown below.
conda env create -f env.yml -n DDAffinity
conda activate DDAffinity
The default PyTorch version is 1.12.1 and cudatoolkit version is 11.3. They can be changed in env.yml
.
We generated all protein mutant complex PDB data and wild-type complex PDB data from PDBs file data/SKEMPI2/PDBs, rde/datasets/PDB_generate.py, data/SKEMPI2/SKEMPI2.csv, and FoldX tool. Then we use rde/datasets/skempi_parallel.py to transform the PDB files of wild-type and mutant complexes into processed dataset SKEMPI2_cache.
python PDB_generate.py
python skempi_parallel.py --reset
Dataset | Download Script | Processed Dataset |
---|---|---|
SKEMPI v2 | data/get_skempi_v2.sh |
data/SKEMPI2/SKEMPI2_cache |
SKEMPI2.csv | — | SKEMPI2_cache |
M1707.csv | — | M1707_cache |
S1131.csv | — | S1131_cache |
M1340.csv | — | M1340_cache |
M595.csv | — | M595_cache |
S494.csv | — | S494_cache |
S285.csv | — | S285_cache |
Ssys.csv | — | Ssys_cache |
The overall SKEMPI2 trained weights is located in: DDAffinity
The M1340 trained weights is located in: M1340
python test_DDAffinity.py ./configs/train/mpnn_ddg.yml --device cuda:0
python case_study.py ./configs/inference/blind_testing.yml --device cuda:0
python case_study.py ./configs/inference/case_study_1.yml --device cuda:0
python case_study.py ./configs/inference/case_study_2.yml --device cuda:0
python train_DDAffinity.py ./configs/train/mpnn_ddg.yml --num_cvfolds 10 --device cuda:0
We acknowledge that parts of our code is adapted from Rotamer Density Estimator (RDE). Thanks to the authors for sharing their codes.