Giter Site home page Giter Site logo

oatml / eve Goto Github PK

View Code? Open in Web Editor NEW
54.0 8.0 50.0 8.82 MB

Official repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning". Joint collaboration between the Marks lab and the OATML group.

Home Page: http://evemodel.org/

License: MIT License

Python 100.00%
protein evolutionary-data eve-models generative-model pytorch

eve's Issues

Lots of memory usage when running evol_indices with many sequences

Hi EVE team,

I'm running compute_evol_indices.py on a dataset with many variants in a single csv file(>400k, specifically UniProt ID SPG1_STRSG_Olson_2014).

When I try to compute evolutionary indices of these variants it requires over 100GB of memory, and my job stalls out. I think pytorch maybe keeping in memory previously computed batches, because one batch only requires roughly 1GB of memory.

It's easy to fix this issue simply by breaking up the dataset, but rather inconvenient, so it would be great if this issue could be fixed.

Let me know if this issue makes sense, and if it is reproducible.

Take care,
Bryce

Why do some files contain two evolutionary index in some proteins?

Hello, author!
Recently, I want to use the EVE score in my work. When I downloaded the protein data that you provided on your website, there are some questions:
(1) why are there two columns called "evolutionary_index_ASM"? Is there any difference here?
asm
(2) and why are there two columns called "EVE_scores_ASM" and the values of them are different?
eve_score
eve_scores

These two questions only appear in the "CL065_HUMAN.csv" and "G6PC_HUMAN.csv".

I'd appreciate it if you can answer my questions.
Kind regards,
Liu

Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein

Hello, author! EVE is a very good job, Thank you so much for your contributions to the community.

I recently encountered some problems when I was using EVE to score genetic variation. I am looking forward to your reply very much!

I would like to ask the following three questions:

1)What are "_ASM" and "_BPU"? Is there a help document that describes information for each column? When the two results are different, which one should be chosen? For example, csv files for PTEN

image

image

2)What transcripts do the 3,000 + proteins on EVE's website refer to? Because I found that the different transcription, variation of the corresponding amino acid is different, I refer to is MAEN project (refer to the link: http://tark.ensembl.org/web/mane_project/)

The Matched Annotation from NCBI and EMBL-EBI (MANE) is a collaboration that aims to converge on human gene annotation and to produce a genome wide transcript set that includes pairs of RefSeq (NM) and Ensembl/GENCODE (ENST) transcripts that are 100% identical.

3)Why is EVE's score of a protein missing a large fragment? Like the HCN4 protein (https://evemodel.org/proteins/HCN4_HUMAN)
image

I am looking forward to your reply very much!

Kind regards,
Licko

Question about example data

Hi all,

I have recently read the EVE paper and got two questions.

The first one is PTEN alignment file related, I am wondering whether this file is before preprocessing, or it is the result after removing inadequate fragments and columns? Since I notice that all of the sequences in the alignment have the same length, which may not be a usual condition in an initial a2m result from jackhmmer (please just point out if I were wrong)?

Second, I am a bit confused about the description of 0.3 bit/residue reference part in the paper. Did you mean using 0.3 multiplied by the length of the target sequence as start values for jackhmmer parameters -T, --domT, --incT, --incdomT? If so, it seems there are sequences (e.g. UniRef100_A0A4U5VQ93) not satisfying the condition of Lcov >= 0.7L (for PTEN should be 0.7*403=282, the number of valid residue in UniRef100_A0A4U5VQ93 is only 204). The total number of alignment sequences is also less than 10L, I guess this is because when using 0.7L threshold, the 10L requirement will be automatically ignored?

Could you help me with the above questions? Thanks a lot.

Best,
Nan

reproducing evolutionary indices with example scripts

Hi,

I am trying to reproduce the results from your publication and evemodel.org. For the PTEN example this works well, but for the other proteins for which there is data in the repo (e.g. P53_HUMAN) the distribution of evolutionary indices is always shifted (or somewhat squashed) to lower values, compared to the distributions shown in the , like in the example below. Do you use the same default parameters and scripts for all example proteins ? And if not which should be tuned/changed from the PTEN example?

histogram_random_samples_next_P53_HUMAN

About the trained models and ROC curve

Dear authors,

We are recently working on your paper and the released code, and we have the following two issues that we hope you could help us to manage them:

  1. training the models for all the protein sequences from scratch is quite time-consuming, we are thus wondering if there are trained models of all the protein sequences along with the released code?
  2. you evaluated AUC and ROC in your paper, however, it seems there is no corresponding code in the released one, could you please provide the code?

Many thanks.

Huge Autoencoder Scenario

Hello,
I was trying to use EVE on the BLAT_ECOLX dataset from your paper, using a huge architecture and a large latent dimension and a small learning rate, while setting the KL coefficient to zero. In an ideal scenario, the corresponding auto encoder should memorize every datapoint and result in a 0 BCE. However, the BCE doesn't change much in comparison to a small auto encoder and doesn't get close to zero but a high number (around 800).
As you have reported your results on this dataset as well, I was wondering if there is any explanation for this phenomena.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.