Giter Site home page Giter Site logo

atm-tcr's Introduction

ATM-TCR

ATM-TCR demonstrates how a multi-head self-attention based model can be utilized to learn structural information from protein sequences to make binding affinity predictions.

Publication

ATM-TCR: TCR-Epitope Binding Affinity Prediction Using a Multi-Head Self-Attention Model
Michael Cai1,2, Seojin Bang2, Pengfei Zhang1,2, Heewook Lee1,2
1 School of Computing and Augmented Intelligence, Arizona State University, 2 Biodesign Institute, Arizona State University
Published in: Frontiers in Immunology, 2022.

Model Structure

The model takes a pair epitope and TCR sequences as input and returns the binding affinity between the two. The sequences are processing through an embedding layer before reaching the mutli-head self-attention layer. The outputs of these layers are then concatenated and fed through a linear decoder layer to receive the final binding affinity score.

drawing

Requirements

Written using Python 3.8.10

The pip package dependencies are detailed in requirements.txt

To install directly from the requirements list

pip install -r requirements.txt

It is recommended you utilize a virtual environment.

Input File Formatting Format

The input file should be a CSV with the following format:

Epitope,TCR,Binding Affinity

Where epitope and TCR are the linear protein sequences and binding affinity is either 0 or 1.

# Example
GLCTLVAML,CASSEGQVSPGELF,1
GLCTLVAML,CSATGTSGRVETQYF,0

If your data is unlabeled and you are only interested in the predictions, simply put either all 0's or all 1's as the label. The performance statistics can be ignored in this case and the predicted binding affinity scores can be collected from the output file.

Training

To train the model on our dataset using the default settings and on the first GPU

CUDA_VISIBLE_DEVICES=0 python main.py --infile data/combined_dataset.csv

To change the device to be utilized for training change the CUDA_VISIBLE_DEVICES to the device number as indicated by nvidia-smi.

The default model name utilized by the program is original.ckpt. To change the outputted/read model name utilize the following optional argument:

--model_name my_custom_model_name

After training has finished the model will appear under the models folder under model_name.ckpt and two csv files will appear in the result folder. These files will be called perf_model_name.csv and pred_model_name.csv respectively.

perf_model_name.csv contains the a description of performance metrics throughout training. Each line of the csv is the performance of the training model on the validation set in that particular epoch. The last line of the file contains the final performance statistics.

# Example
Loss        Accuracy Precision1 Precision0 Recall1 Recall0 F1Macro F1Micro AUC
37814.6235	0.6101	 0.6241	    0.5988	   0.5542  0.666   0.6089  0.6101  0.6749

pred_model_name.csv contains the predictions of the model on the validation set of data. Each line is a pair from the validation set along with the label and prediction made by the model. The calculated score from the model is also included.

# Example
Epitope     TCR	        Actual Prediction Binding Affinity
GLCTLVAML	CASCWNYEQYF	1	   1	      0.9996516704559326

Testing

To make a prediction using a pre-trained model

python main.py --infile data/combined_dataset.csv --indepfile data/covid19_data.txt --model_name my_custom_model_name --mode test

The predictions will be saved into the result folder under the name pred_model_name_indep_test_data.csv. These will be displayed similarly to the validation set predictions made during training.

Optional Arguments

For more information on optional hyperparameter and training arguments

python main.py --help

Data

See the README inside of the data folder for additional information.

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

atm-tcr's People

Contributors

cai-michael avatar dependabot[bot] avatar

Stargazers

Andy Lau avatar Wang Yansheng avatar Kâmuran İmran avatar fee avatar  avatar Pengfei Zhang avatar Andre Fonseca avatar Mike Chang avatar Jiuxin Feng avatar

atm-tcr's Issues

McPas-TCR dataset

Hello,

Could you maybe, if you still have it stored somewhere, send me the McPas-TCR unparsed dataset? The page to download it from is down, and I can't seem to find another source to download it from.

P.S. Actually if you can share all 3 unparsed datasets it would be amazing.

Thank you!

Unclear CLI argument: `--indepfile`

In the example command given, the required argument --indepfile is provided.

However, it is not clear how this should be used as separate to --infile when using a pre-trained model for creating predictions on an input file.

Clearer documentation would help clarify this.

python main.py --infile data/combined_dataset.csv --indepfile data/covid19_data.txt --model_name my_custom_model_name --mode test

Deprecated code

torchtext legacy seems to have been deprecated, because I'm getting an error:

Traceback (most recent call last):
 File "/home/trevizani/prog/ATM-TCR/main.py", line 11, in
   from data_loader import define_dataloader, load_embedding, load_data_split
 File "/home/trevizani/prog/ATM-TCR/data_loader.py", line 5, in
   from torchtext.legacy.data import Pipeline, Dataset, Field, Iterator, Example, RawField, get_tokenizer
ModuleNotFoundError: No module named 'torchtext.legacy'

`pip install -r requirements` is broken

Upon setting up the required environment, I run into the following error:

ERROR: Could not find a version that satisfies the requirement scikit-learn==0.24.1 (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.17, 0.17.1, 0.18, 0.18.1, 0.18.2, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21.1, 0.21.2, 0.21.3, 0.22, 0.22.1, 0.22.2.post1, 0.23.0, 0.23.1, 0.23.2, 0.24.0, 0.24.1, 0.24.2, 1.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.2.0rc1, 1.2.0, 1.2.1, 1.2.2, 1.3.0rc1, 1.3.0, 1.3.1, 1.3.2, 1.4.0rc1, 1.4.0, 1.4.1.post1)
ERROR: No matching distribution found for scikit-learn==0.24.1

Could the requirements.txt please be updated?

about data from IEDB

I am wondering how to confirm that all TCR-PMHC pairs' labels are all 1 in your search settings.(IEDB dataset)

License?

Hey @cai-michael, very exciting work. What license have you released the code under? Would love to try it out at work. Cheers, J.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.