j-snackkb / flip Goto Github PK
View Code? Open in Web Editor NEWA collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
License: Academic Free License v3.0
A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
License: Academic Free License v3.0
From what I understand the values inside the supplementary information are the average over 10 runs with random seeds. If this is the case, why are there no errors added? Furthermore, the results reported in the main paper do not match those in the supplementary pdf. I am assuming the values from the main paper are from a single run?
Hi,
I'm interested in how the meltome train-test splits were made. Is the mmseqs easy-cluster
command with --min-seq-id 0.2
, or are there more parameters involved?
Hello,
The procedure to run the benchmark is not really clear to me. I suppose I should run the notebooks of the collect_tasks. To do so I need a data folder not available here. How can I found this data ?
Best
B
Hi all, I'm trying to access the raw data here: http://data.bioembeddings.com/public/FLIP/
but run into a 403. Has the raw data moved?
@sacdallago hi, thank you very much for your great data curation. I am planning to use the AAV dataset for my research.
I found that some deletion masks may not have been properly applied to the wild type sequences: as the image below shows, there are 29 sequences with different mutation_mask
but with the same full_aa_sequnece
as the wild type. Is this intended result?
Below is the code for replication:
import pandas as pd
from Bio import SeqIO
wt_seq = str(next(SeqIO.parse("P03135.fasta", "fasta")).seq)
variant_effects = pd.read_csv("full_data.csv")
wild_types = variant_effects.loc[variant_effects["full_aa_sequence"] == wt_seq]
wild_types
Hi FLIP authors,
I have been working with the data split routine you applied to the meltome atlas data and found some irregularities. You create the train and test splits based on clusters from mmseq2 but the notebook routine seems off (in collect_flip/2_meltome_atlas.ipynb).
For creating the mixed dataset based on the cluster you remove the cluster center datapoints from the set once you encountered it in the full protein list which I think makes the output datasets incorrect:
Cell 30, last 20 LOC
if key in train: <-- current datapoint is a cluster center
clustered_set.append({
'sequence': protein.get('sequence'),
'target': protein.get('meltingPoint'),
'set': 'train'
})
train.remove(key) <--- HERE
elif key in test: <-- current datapoint is a cluster center
clustered_set.append({
'sequence': protein.get('sequence'),
'target': protein.get('meltingPoint'),
'set': 'test'
})
mixed_set.append({
'sequence': protein.get('sequence'),
'target': protein.get('meltingPoint'),
'set': 'test'
})
test.remove(key) <--HERE
While removing the sequences is fine for the test set (only the cluster center points are used anyways), for the training set it holds out all sequences of this cluster that are processed in the loop after the cluster center.
Upon fixing this I get a training set of 67361 datapoints + 3134 test datapoints (in comparison to 24817 training datapoints reported on the paper).
Do I understand something wrong here? 67361 is also 80% of the full cluster dataset (84030 entries) so this would make more sense based on the setting. The mixed set should in the end be 80% of all data in train + only cluster centers for test, which are obviously a lot less than 20% of all data.
I haven't checked if the same error happened on the other datasets but would recommend to do so.
Hi, very interesting work. It fails to load Attention1d
in baselines/models.py
. Is it missing in this repo? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.