Comprehensive assessment of BERT-based methods for predicting antimicrobial peptides

│  README.md
│
├─dependencies # Environment required
│      AMP-BERT_env.yml
│      Bert-Protein_env.yml
│      cAMPs_pred_env.yml
│      LM_pred_env.yml
│
├─dataset # Evaluate experimental datasets
│      independent dataset_AMPs.fasta
│      independent dataset_nonAMPs.fasta
│      ├─ADAPTABLE database
│           adaptable_amps.fa
│           adaptable_nonamps.fa
│      ├─APD database
│           apd_amps.fa
│           apd_nonamps.fa
│      ├─CAMP database
│           camp_amps.fa
│           camp_ nonamps.fa
│      ├─dbAMP database
│           dbamp_amps.fa
│           dbamp_ nonamps.fa
│      ├─DRAMP database
│           dramp_amps.fa
│           dramp_ nonamps.fa
│      ├─YADAMP database
│           yadamp_amps.fa
│           yadamp_nonamps.fa
│
└─Utils # Utility scripts
        test_in_AMP-BERT.ipynb # Testing on AMP-BERT
        test_in_Bert-Protein.ipynb # Testing on Bert-Protein
        test_in_cAMPs_pred.py # Testing on cAMPs_pred
        test_in_LM_pred.ipynb # Testing on LM_pred
        ROC and PR curve.ipynb # Plotting ROC and PR curve
        metrics.py # Calcuating evaluation indicators  
        ensemble.py # Integrating BERT model

Methods for assessments

The environments used in this study are available on \dependencies.

	Pretraining	Parameter	Classification	Repository
Bert-Protein	UniProt	12 layers 12 heads	FFN	https://github.com/BioSequenceAnalysis/Bert-Protein
AMP-BERT	BFD	30 layers 16 heads	FCN	https://github.com/GIST-CSBL/AMP-BERT
LM_pred	BFD100 UniRef100	30 layers 16 heads	CNN	https://github.com/williamdee1/LMPred_AMP_Prediction
cAMPs_pred	BookCorpus Wikipedia	12 layers 12 heads	FFN	https://github.com/mayuefine/c_AMPs-prediction

Dataset

ADAPTABLE
- AMP sequence data were downloaded from http://gec.u-picardie.fr/adaptable
AMPfun
- AMP sequence data were downloaded from http://fdblab.csie.ncu.edu.tw/AMPfun/index.html
APD
- AMP sequence data were downloaded from http://aps.unmc.edu/AP/
CAMP
- AMP sequence data were downloaded from http://www.bicnirrh.res.in/antimicrobial
dbAMP
- AMP sequence data were downloaded from http://csb.cse.yzu.edu.tw/dbAMP/
DRAMP
- AMP sequence data were downloaded from http://dramp.cpu-bioinfor.org/
YADAMP
- AMP sequence data were downloaded from http://www.yadamp.unisa.it

Peptide sequences were downloaded from UniProt http://www.uniprot.org

Experimental instructions

The architecture of evaluation experiments

In the study, we employed the strategy of independent test, multiple database validation and 5-fold cross-validation to evaluate the predictive performance of these methods. In addition, we propose a novel AMP prediction method based on the ensemble learning strategy.

Performance evaluation on the independent dataset and validation datasets

We collected an independent test dataset based on multiple different AMP databases, and then compared and analyzed the prediction performance of different prediction tools on the independent test dataset. In order to compare the robustness and generalization ability of the model, we further tested the prediction performance of different tools on multiple validation datasets.

Collection of the independent test dataset

Positive samples for independent test set were collected from different comprehensive AMP databases, including APD, CAMP, dbAMP, DRAMP, YADAMP, ADAPTABLE and AMPfun. Negative samples were collected from UniProt.

Testing on Bert-Protein

# ljy_predict_AMP.py
# The test data set is tested on the trained model

if __name__ == '__main__':
    main(data_name=r"test.csv",
         out_file="result.txt",
         model_path="model_train/1kmer_model/model.ckpt",
         step=1,
         config_file="./bert_config_1.json",
         vocab_file="./vocab/vocab_1kmer.txt")

Testing on AMP-BERT

# test_ with_amps.ipynb
# load appropriate tokenizer and fine-tuned model
tokenizer = AutoTokenizer.from_pretrained('Rostlab/prot_bert_bfd', do_lower_case=False)
model = BertForSequenceClassification.from_pretrained("Train/")

# Input test dataset
with open('test.csv') as l:
# Output prediction result
  with open('pred.txt','w') as r:
# Output predicted probability results
     with open('prob.txt','w') as f:

Testing on LM_pred

# Testing_Models.ipynb
# Input: test dataset
# Output: prediction result with label
#         prediction result with probability

BERT_model_INDEP = keras.models.load_model('Train model/BERT-BFD_best_model.epoch03-loss0.19.hdf5')
X_test = load_INDEP_X_data('BERT_BFD')
BERT_mod_pred = BERT_model_INDEP.predict(X_test, batch_size=8)
file = open('prob.txt', 'a')

for i in range(len(BERT_mod_pred)):
    mid = str(BERT_mod_pred[i]).replace('[', '').replace(']', '')
    mid = mid.replace("'", '').replace(',', '') + '\n'
    file.write(mid)
file.close()

BERT_mod_pred_labels = convert_preds(BERT_mod_pred)
file = open('pred.txt', 'a')
for i in range(len(BERT_mod_pred_labels)):
    mid = str(BERT_mod_pred_labels[i]).replace('[', '').replace(']', '')
    mid = mid.replace("'", '').replace(',', '') + '\n'
    file.write(mid)
file.close()

BERT_metrics = display_conf_matrix(y_test_INDEP, BERT_mod_pred_labels, BERT_mod_pred, 'BERT Model', 'BERT-BFD_Model_CM.png')

Testing on cAMPs_pred

environ["CUDA_VISIBLE_DEVICES"] = "0"
from bert_sklearn import load_model
import pandas as pd
x_test=[]
with open(r'test.csv')  as l:
    lines=l.readlines()
    for line in lines:
        line=line.rsplit()
        x_test.append(line)
model = load_model("bert.bin")
pre=model.predict(x_test)
prod=model.predict_proba(x_test)[:,1]
pre=pd.DataFrame(pre)
prod=pd.DataFrame(prod)
prod.to_csv('prod.txt')
pre.to_csv('pred.txt')

Performance evaluation on the retraining dataset

We retrained representative BERT-based models for AMP prediction on the comprehensive dataset and assessed their performance using the five-fold cross-validation test.

Retraining and testing on Bert-Protein

Date example:

    Train  1  M I S D S G ...

run_fine_tune.sh

# Set the Gpus that can be used
export CUDA_VISIBLE_DEVICES=0
python ljy_run_classifier.py \
--do_eval True \
--do_save_model True \
--data_name  AMPScan\
--batch_size 16 \
--num_train_epochs 1 \
--warmup_proportion 0.1 \
--learning_rate 2e-5 \
--using_tpu False \
--seq_length 128 \
--data_root ./dataset/1kmer_tfrecord/AMPScan/ \
--vocab_file ./vocab/vocab_1kmer.txt \
--init_checkpoint ./model/1kmer_model/model.ckpt \
--bert_config ./bert_config_1.json \
--save_path ./model_train/1kmer_model/model.ckpt

Retraining and testing on AMP-BERT

Data example

AMP--1,FQPYDHPAEVSY,12,TRUE

fine-tune_with_amps.ipynb

# Fine tuning
# Training set
df = pd.read_csv('train.csv', index_col = 0)
df = df.sample(frac=1, random_state = 0)
print(df.head(7))
train_dataset = amp_data(df)

# Validation set
df_val = pd.read_csv('val.csv', index_col = 0)
df_val= df_val.sample(frac=1, random_state = 0)
val_dataset = amp_data(df_val)

# Save model
trainer.train()
trainer.save_model('Train/')

Retraining and testing on LM_pred

Data example

25,TFFRLFNRGGGWGSFFKKAAHVGKL,AMP--955

Model Training

BERT_filepath = 'Keras_Models/BERT_best_model.epoch{epoch:02d}-loss{val_loss:.2f}.hdf5'
BERT_Plots_Path = 'Training_Plots/INDEP/BERT_Best_Model_Plot.png'
train_model(X_train, y_train_res, X_val, y_val_res, BERT_filepath, BERT_Plots_Path, 30, 8, False, 320, 11, 'RandomNormal', 8, 0.0001, 'SGD')

Retraining and testing on cAMPs_pred

Installation bert_sklearn

Download and copy bert_sklearn to your python3 site-packages folder

   cd bert-sklearn
   pip install .

Data example

  >AMP-467
  KNLRRIIRKIAHIIKKYG

train_in_cAMPs_pred.py

from bert_sklearn import BertClassifier
model = BertClassifier()

model.train_batch_size=16
model.eval_batch_size=16
model.learning_rate=2e-5
model.epochs=10

model.fit(x_train, y_train)
print(model.score(x_test,y_test))
model.save('bert.bin')

Ensemble model

We propose a novel AMP prediction method based on the ensemble learning strategy.

import pandas as pd
import sklearn.svm as svm

prob=pd.read_csv('prob.txt',header=None)
label=pd.read_csv(y_test.csv',header=None)

model = svm.SVC(C=1,kernel='rbf',gamma='auto')
model.fit(x_train,y_train)
pred = model.predict(x_test)

Citation

Wanling Gao, Jun Zhao, Zehan Wang and Zhenyu Yue*, Comprehensive assessment of BERT-based methods for predicting antimicrobial peptides, 2023, Submitted.

Contact

Please feel free to contact us if you need any help: [email protected]

xspring14 / amppred-bert-assessment Goto Github PK

amppred-bert-assessment's Introduction