Giter Site home page Giter Site logo

j-snackkb / flip Goto Github PK

View Code? Open in Web Editor NEW
87.0 7.0 12.0 609.1 MB

A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design

License: Academic Free License v3.0

Jupyter Notebook 99.43% Python 0.57%
protein protein-sequences protein-design representation-learning protein-fitness-prediction protein-function-prediction

flip's People

Contributors

jmou2 avatar joaquimgomez avatar kadinaj avatar kevingreenman avatar sacdallago avatar sebief avatar yangkky avatar zsewa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

flip's Issues

Confusing results between the paper and supplementary information.

From what I understand the values inside the supplementary information are the average over 10 runs with random seeds. If this is the case, why are there no errors added? Furthermore, the results reported in the main paper do not match those in the supplementary pdf. I am assuming the values from the main paper are from a single run?

Meltome MMseqs2 split parameters

Hi,
I'm interested in how the meltome train-test splits were made. Is the mmseqs easy-cluster command with --min-seq-id 0.2, or are there more parameters involved?

finding the data

Hello,

The procedure to run the benchmark is not really clear to me. I suppose I should run the notebooks of the collect_tasks. To do so I need a data folder not available here. How can I found this data ?

Best

B

Wrong deletion masking for AAV task?

@sacdallago hi, thank you very much for your great data curation. I am planning to use the AAV dataset for my research.

I found that some deletion masks may not have been properly applied to the wild type sequences: as the image below shows, there are 29 sequences with different mutation_mask but with the same full_aa_sequnece as the wild type. Is this intended result?

スクリーンショット 2022-08-23 17 23 25

Below is the code for replication:

import pandas as pd
from Bio import SeqIO
wt_seq = str(next(SeqIO.parse("P03135.fasta", "fasta")).seq)
variant_effects = pd.read_csv("full_data.csv")
wild_types = variant_effects.loc[variant_effects["full_aa_sequence"] == wt_seq]
wild_types

Stability dataset clustering data loss

Hi FLIP authors,
I have been working with the data split routine you applied to the meltome atlas data and found some irregularities. You create the train and test splits based on clusters from mmseq2 but the notebook routine seems off (in collect_flip/2_meltome_atlas.ipynb).
For creating the mixed dataset based on the cluster you remove the cluster center datapoints from the set once you encountered it in the full protein list which I think makes the output datasets incorrect:

Cell 30, last 20 LOC

            if key in train: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'train'
                })
                train.remove(key)  <--- HERE
            elif key in test: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                
                mixed_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                test.remove(key) <--HERE

While removing the sequences is fine for the test set (only the cluster center points are used anyways), for the training set it holds out all sequences of this cluster that are processed in the loop after the cluster center.
Upon fixing this I get a training set of 67361 datapoints + 3134 test datapoints (in comparison to 24817 training datapoints reported on the paper).

Do I understand something wrong here? 67361 is also 80% of the full cluster dataset (84030 entries) so this would make more sense based on the setting. The mixed set should in the end be 80% of all data in train + only cluster centers for test, which are obviously a lot less than 20% of all data.

I haven't checked if the same error happened on the other datasets but would recommend to do so.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.