oatml / eve Goto Github PK

Official repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning". Joint collaboration between the Marks lab and the OATML group.

Home Page: http://evemodel.org/

License: MIT License

Python 100.00%

protein evolutionary-data eve-models generative-model pytorch

eve's Issues

Lots of memory usage when running evol_indices with many sequences

Hi EVE team,

I'm running compute_evol_indices.py on a dataset with many variants in a single csv file(>400k, specifically UniProt ID SPG1_STRSG_Olson_2014).

When I try to compute evolutionary indices of these variants it requires over 100GB of memory, and my job stalls out. I think pytorch maybe keeping in memory previously computed batches, because one batch only requires roughly 1GB of memory.

It's easy to fix this issue simply by breaking up the dataset, but rather inconvenient, so it would be great if this issue could be fixed.

Let me know if this issue makes sense, and if it is reproducible.

Take care,
Bryce

Why do some files contain two evolutionary index in some proteins?

Hello, author!
Recently, I want to use the EVE score in my work. When I downloaded the protein data that you provided on your website, there are some questions:
(1) why are there two columns called "evolutionary_index_ASM"? Is there any difference here?

(2) and why are there two columns called "EVE_scores_ASM" and the values of them are different?

These two questions only appear in the "CL065_HUMAN.csv" and "G6PC_HUMAN.csv".

I'd appreciate it if you can answer my questions.
Kind regards,
Liu

Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein

Hello, author! EVE is a very good job, Thank you so much for your contributions to the community.

I recently encountered some problems when I was using EVE to score genetic variation. I am looking forward to your reply very much!

I would like to ask the following three questions:

1）What are "_ASM" and "_BPU"? Is there a help document that describes information for each column? When the two results are different, which one should be chosen? For example, csv files for PTEN

2）What transcripts do the 3,000 + proteins on EVE's website refer to? Because I found that the different transcription, variation of the corresponding amino acid is different, I refer to is MAEN project (refer to the link: http://tark.ensembl.org/web/mane_project/)

The Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration that aims to converge on human gene annotation and to produce a genome wide transcript set that includes pairs of RefSeq (NM) and Ensembl/GENCODE (ENST) transcripts that are 100% identical.

3）Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein (https://evemodel.org/proteins/HCN4_HUMAN)

I am looking forward to your reply very much!

Kind regards,
Licko

Question about example data

Hi all,

I have recently read the EVE paper and got two questions.

The first one is PTEN alignment file related, I am wondering whether this file is before preprocessing, or it is the result after removing inadequate fragments and columns? Since I notice that all of the sequences in the alignment have the same length, which may not be a usual condition in an initial a2m result from jackhmmer (please just point out if I were wrong)?

Second, I am a bit confused about the description of 0.3 bit/residue reference part in the paper. Did you mean using 0.3 multiplied by the length of the target sequence as start values for jackhmmer parameters -T, --domT, --incT, --incdomT? If so, it seems there are sequences (e.g. UniRef100_A0A4U5VQ93) not satisfying the condition of Lcov >= 0.7L (for PTEN should be 0.7*403=282, the number of valid residue in UniRef100_A0A4U5VQ93 is only 204). The total number of alignment sequences is also less than 10L, I guess this is because when using 0.7L threshold, the 10L requirement will be automatically ignored?

Could you help me with the above questions? Thanks a lot.

Best,
Nan

FYI on small fixes to run standard decoder (bayesian_decoder=false)

line 201 in VAE_decoder.py: self.convolution_depth = params['convolution_output_depth']
line 145 in VAE_model.py: KLD_decoder_params_normalized = torch.tensor(0.0)

reproducing evolutionary indices with example scripts

Hi,

I am trying to reproduce the results from your publication and evemodel.org. For the PTEN example this works well, but for the other proteins for which there is data in the repo (e.g. P53_HUMAN) the distribution of evolutionary indices is always shifted (or somewhat squashed) to lower values, compared to the distributions shown in the , like in the example below. Do you use the same default parameters and scripts for all example proteins ? And if not which should be tuned/changed from the PTEN example?

About the trained models and ROC curve

Dear authors,

We are recently working on your paper and the released code, and we have the following two issues that we hope you could help us to manage them:

training the models for all the protein sequences from scratch is quite time-consuming, we are thus wondering if there are trained models of all the protein sequences along with the released code?
you evaluated AUC and ROC in your paper, however, it seems there is no corresponding code in the released one, could you please provide the code?

Many thanks.

Huge Autoencoder Scenario

Hello,
I was trying to use EVE on the BLAT_ECOLX dataset from your paper, using a huge architecture and a large latent dimension and a small learning rate, while setting the KL coefficient to zero. In an ideal scenario, the corresponding auto encoder should memorize every datapoint and result in a 0 BCE. However, the BCE doesn't change much in comparison to a small auto encoder and doesn't get close to zero but a high number (around 800).
As you have reported your results on this dataset as well, I was wondering if there is any explanation for this phenomena.

oatml / eve Goto Github PK

eve's Issues

Lots of memory usage when running evol_indices with many sequences

Why do some files contain two evolutionary index in some proteins?

Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein

Question about example data

FYI on small fixes to run standard decoder (bayesian_decoder=false)

reproducing evolutionary indices with example scripts

About the trained models and ROC curve

Huge Autoencoder Scenario

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent