j-snackkb / flip Goto Github PK

A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design

License: Academic Free License v3.0

Jupyter Notebook 99.43% Python 0.57%

protein protein-sequences protein-design representation-learning protein-fitness-prediction protein-function-prediction

flip's People

Contributors

Stargazers

Watchers

Forkers

barthelemymp tshauck kevingreenman wanghuineu joaquimgomez ming-qin-tech rnaimehaom christofhenkel sebief deltadedirac zsewa tb1over

flip's Issues

Confusing results between the paper and supplementary information.

From what I understand the values inside the supplementary information are the average over 10 runs with random seeds. If this is the case, why are there no errors added? Furthermore, the results reported in the main paper do not match those in the supplementary pdf. I am assuming the values from the main paper are from a single run?

Take out hard-coded paths

Meltome MMseqs2 split parameters

Hi,
I'm interested in how the meltome train-test splits were made. Is the mmseqs easy-cluster command with --min-seq-id 0.2, or are there more parameters involved?

Rename tasks to be consistent with paper

finding the data

Hello,

The procedure to run the benchmark is not really clear to me. I suppose I should run the notebooks of the collect_tasks. To do so I need a data folder not available here. How can I found this data ?

Best

Make installable

403 When Trying to Access Data

Hi all, I'm trying to access the raw data here: http://data.bioembeddings.com/public/FLIP/
but run into a 403. Has the raw data moved?

Wrong deletion masking for AAV task?

@sacdallago hi, thank you very much for your great data curation. I am planning to use the AAV dataset for my research.

I found that some deletion masks may not have been properly applied to the wild type sequences: as the image below shows, there are 29 sequences with different mutation_mask but with the same full_aa_sequnece as the wild type. Is this intended result?

Below is the code for replication:

import pandas as pd
from Bio import SeqIO
wt_seq = str(next(SeqIO.parse("P03135.fasta", "fasta")).seq)
variant_effects = pd.read_csv("full_data.csv")
wild_types = variant_effects.loc[variant_effects["full_aa_sequence"] == wt_seq]
wild_types

Stability dataset clustering data loss

Hi FLIP authors,
I have been working with the data split routine you applied to the meltome atlas data and found some irregularities. You create the train and test splits based on clusters from mmseq2 but the notebook routine seems off (in collect_flip/2_meltome_atlas.ipynb).
For creating the mixed dataset based on the cluster you remove the cluster center datapoints from the set once you encountered it in the full protein list which I think makes the output datasets incorrect:

Cell 30, last 20 LOC

            if key in train: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'train'
                })
                train.remove(key)  <--- HERE
            elif key in test: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                
                mixed_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                test.remove(key) <--HERE

While removing the sequences is fine for the test set (only the cluster center points are used anyways), for the training set it holds out all sequences of this cluster that are processed in the loop after the cluster center.
Upon fixing this I get a training set of 67361 datapoints + 3134 test datapoints (in comparison to 24817 training datapoints reported on the paper).

Do I understand something wrong here? 67361 is also 80% of the full cluster dataset (84030 entries) so this would make more sense based on the setting. The mixed set should in the end be 80% of all data in train + only cluster centers for test, which are obviously a lot less than 20% of all data.

I haven't checked if the same error happened on the other datasets but would recommend to do so.

No module named 'sequence_models'

Hi, very interesting work. It fails to load Attention1d in baselines/models.py. Is it missing in this repo? Thanks.

j-snackkb / flip Goto Github PK

flip's People

Contributors

Stargazers

Watchers

Forkers

flip's Issues

Confusing results between the paper and supplementary information.

Take out hard-coded paths

Meltome MMseqs2 split parameters

Rename tasks to be consistent with paper

finding the data

Make installable

403 When Trying to Access Data

Wrong deletion masking for AAV task?

Stability dataset clustering data loss

No module named 'sequence_models'

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent