oatml / eve Goto Github PK

Official repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning". Joint collaboration between the Marks lab and the OATML group.

Home Page: http://evemodel.org/

License: MIT License

Python 100.00%

protein evolutionary-data eve-models generative-model pytorch

eve's Introduction

Evolutionary model of Variant Effects (EVE)

Please note that we have migrated the official repo to the following address: https://github.com/OATML-Markslab/EVE.

Overview

EVE is a set of protein-specific models providing for any single amino acid mutation of interest a score reflecting the propensity of the resulting protein to be pathogenic. For each protein family, a Bayesian VAE learns a distribution over amino acid sequences from evolutionary data. It enables the computation of an evolutionary index for each mutant, which approximates the log-likelihood ratio of the mutant vs the wild type. A global-local mixture of Gaussian Mixture Models separates variants into benign and pathogenic clusters based on that index. The EVE scores reflect probabilistic assignments to the pathogenic cluster.

Usage

The end to end process to compute EVE scores consists of three consecutive steps:

Train the Bayesian VAE on a re-weighted multiple sequence alignment (MSA) for the protein of interest => train_VAE.py
Compute the evolutionary indices for all single amino acid mutations => compute_evol_indices.py
Train a GMM to cluster variants on the basis of the evol indices then output scores and uncertainties on the class assignments => train_GMM_and_compute_EVE_scores.py We also provide all EVE scores for all single amino acid mutations for thousands of proteins at the following address: http://evemodel.org/.

Example scripts

The "examples" folder contains sample bash scripts to obtain EVE scores for a protein of interest (using PTEN as an example). MSAs and ClinVar labels are provided for 4 proteins (P53, PTEN, RASH and SCN5A) in the data folder.

Data requirements

The only data required to train EVE models and obtain EVE scores from scratch are the multiple sequence alignments (MSAs) for the corresponding proteins.

MSA creation

We built multiple sequence alignments for each protein family by performing five search iterations of the profile HMM homology search tool Jackhmmer against the UniRef100 database of non-redundant protein sequences (downloaded on April 20th 2020). Please refer to the supplementary notes of the EVE paper (section 3.1.1) for a detailed description of the MSA creation process. Our github repo provides the MSAs for 4 proteins: P53, PTEN, RASH & SCN5A (see data/MSA). MSAs for all proteins may be accessed on our website (https://evemodel.org/).

MSA pre-processing

The EVE codebase provides basic functionalities to pre-process MSAs for modelling (see the MSA_processing class in utils/data_utils.py). By default, sequences with 50% or more gaps in the alignment and/or positions with less than 70% residue occupancy will be removed. These parameters may be adjusted as needed by the end user.

ClinVar labels

The script "train_GMM_and_compute_EVE_scores.py" provides functionalities to compare EVE scores with reference labels (e.g., ClinVar). Our github repo provides labels for 4 proteins: P53, PTEN, RASH & SCN5A (see data/labels). ClinVar labels for all proteins may be accessed on our website (https://evemodel.org/).

Software requirements

The entire codebase is written in python. Package requirements are as follows:

python=3.7
pytorch=1.7
cudatoolkit=11.0
scikit-learn=0.24.1
numpy=1.20.1
pandas=1.2.4
scipy=1.6.2
tqdm
matplotlib
seaborn

The corresponding environment may be created via conda and the provided protein_env.yml file as follows:

  conda env create -f protein_env.yml
  conda activate protein_env

License

This project is available under the MIT license.

Reference

If you use this code, please cite the following paper:

@article{Frazer2021DiseaseVP,
  title={Disease variant prediction with deep generative models of evolutionary data.},
  author={Jonathan Frazer and Pascal Notin and Mafalda Dias and Aidan Gomez and Joseph K Min and Kelly P. Brock and Yarin Gal and Debora S. Marks},
  journal={Nature},
  year={2021}
}

eve's People

Contributors

Stargazers

Watchers

Forkers

jonnyfrazer admariner debbiemarkslab loodvn pascalnotin kevinmenden oatml-markslab zzd-guardian rubenweitzman gsarfo-boateng hjuinj sarahgurev rmichae1 vxh357

eve's Issues

FYI on small fixes to run standard decoder (bayesian_decoder=false)

line 201 in VAE_decoder.py: self.convolution_depth = params['convolution_output_depth']
line 145 in VAE_model.py: KLD_decoder_params_normalized = torch.tensor(0.0)

Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein

Hello, author! EVE is a very good job, Thank you so much for your contributions to the community.

I recently encountered some problems when I was using EVE to score genetic variation. I am looking forward to your reply very much!

I would like to ask the following three questions:

1）What are "_ASM" and "_BPU"? Is there a help document that describes information for each column? When the two results are different, which one should be chosen? For example, csv files for PTEN

2）What transcripts do the 3,000 + proteins on EVE's website refer to? Because I found that the different transcription, variation of the corresponding amino acid is different, I refer to is MAEN project (refer to the link: http://tark.ensembl.org/web/mane_project/)

The Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration that aims to converge on human gene annotation and to produce a genome wide transcript set that includes pairs of RefSeq (NM) and Ensembl/GENCODE (ENST) transcripts that are 100% identical.

3）Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein (https://evemodel.org/proteins/HCN4_HUMAN)

I am looking forward to your reply very much!

Kind regards,
Licko

Why do some files contain two evolutionary index in some proteins?

Hello, author!
Recently, I want to use the EVE score in my work. When I downloaded the protein data that you provided on your website, there are some questions:
(1) why are there two columns called "evolutionary_index_ASM"? Is there any difference here?

(2) and why are there two columns called "EVE_scores_ASM" and the values of them are different?

These two questions only appear in the "CL065_HUMAN.csv" and "G6PC_HUMAN.csv".

I'd appreciate it if you can answer my questions.
Kind regards,
Liu

Huge Autoencoder Scenario

Hello,
I was trying to use EVE on the BLAT_ECOLX dataset from your paper, using a huge architecture and a large latent dimension and a small learning rate, while setting the KL coefficient to zero. In an ideal scenario, the corresponding auto encoder should memorize every datapoint and result in a 0 BCE. However, the BCE doesn't change much in comparison to a small auto encoder and doesn't get close to zero but a high number (around 800).
As you have reported your results on this dataset as well, I was wondering if there is any explanation for this phenomena.

Question about example data

Hi all,

I have recently read the EVE paper and got two questions.

The first one is PTEN alignment file related, I am wondering whether this file is before preprocessing, or it is the result after removing inadequate fragments and columns? Since I notice that all of the sequences in the alignment have the same length, which may not be a usual condition in an initial a2m result from jackhmmer (please just point out if I were wrong)?

Second, I am a bit confused about the description of 0.3 bit/residue reference part in the paper. Did you mean using 0.3 multiplied by the length of the target sequence as start values for jackhmmer parameters -T, --domT, --incT, --incdomT? If so, it seems there are sequences (e.g. UniRef100_A0A4U5VQ93) not satisfying the condition of Lcov >= 0.7L (for PTEN should be 0.7*403=282, the number of valid residue in UniRef100_A0A4U5VQ93 is only 204). The total number of alignment sequences is also less than 10L, I guess this is because when using 0.7L threshold, the 10L requirement will be automatically ignored?

Could you help me with the above questions? Thanks a lot.

Best,
Nan

reproducing evolutionary indices with example scripts

Hi,

I am trying to reproduce the results from your publication and evemodel.org. For the PTEN example this works well, but for the other proteins for which there is data in the repo (e.g. P53_HUMAN) the distribution of evolutionary indices is always shifted (or somewhat squashed) to lower values, compared to the distributions shown in the , like in the example below. Do you use the same default parameters and scripts for all example proteins ? And if not which should be tuned/changed from the PTEN example?

Lots of memory usage when running evol_indices with many sequences

Hi EVE team,

I'm running compute_evol_indices.py on a dataset with many variants in a single csv file(>400k, specifically UniProt ID SPG1_STRSG_Olson_2014).

When I try to compute evolutionary indices of these variants it requires over 100GB of memory, and my job stalls out. I think pytorch maybe keeping in memory previously computed batches, because one batch only requires roughly 1GB of memory.

It's easy to fix this issue simply by breaking up the dataset, but rather inconvenient, so it would be great if this issue could be fixed.

Let me know if this issue makes sense, and if it is reproducible.

Take care,
Bryce

About the trained models and ROC curve

Dear authors,

We are recently working on your paper and the released code, and we have the following two issues that we hope you could help us to manage them:

training the models for all the protein sequences from scratch is quite time-consuming, we are thus wondering if there are trained models of all the protein sequences along with the released code?
you evaluated AUC and ROC in your paper, however, it seems there is no corresponding code in the released one, could you please provide the code?

Many thanks.