Giter Site home page Giter Site logo

proteinnet's Introduction

ProteinNet

ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

Note that this is a preliminary and incomplete release. The raw data used for construction of the data sets, as well as the MSAs, are not yet available.

Motivation

Protein structure prediction is one of the central problems of biochemistry. While the problem is well-studied within the biological and chemical sciences, it is less well represented within the machine learning community. We suspect this is due to two reasons: 1) a high barrier to entry for non-domain experts, and 2) lack of standardization in terms of training / validation / test splits that make fair and consistent comparisons across methods possible. If these two issues are addressed, protein structure prediction can become a major source of innovation in ML research, alongside the canonical tasks of computer vision, NLP, and speech recognition. Much like ImageNet helped spur the development of new computer vision techniques, ProteinNet aims to facilitate ML research on protein structure by providing a standardized data set, and standardized training / validation / test splits, that any group can use with minimal effort to get started.

Approach

Once every two years the CASP assessment is held. During this competition structure predictors from across the globe are presented with protein sequences whose structures have been recently solved but which have not yet been made publicly available. The predictors make blind predictions of these structures, which are then assessed for their accuracy. The CASP structures thus provide a standardized benchmark for how well prediction methods perform at a given moment in time. The basic idea behind ProteinNet is to piggyback on CASP, by using CASP structures as test sets. ProteinNet augments these test sets with training / validation sets that reset the historical record to the conditions preceding each CASP experiment. In particular, ProteinNet restricts the set of sequences (used for building PSSMs and MSAs) and structures to those available prior to the commencement of each CASP. This is critical as standard databases such as BLAST do not maintain historical versions. We use time-reset versions of the UniParc dataset as well as metagenomic sequences from the JGI to build sequence databases for deriving MSAs. ProteinNet further provides carefully split validation sets that range in difficulty from easy (>90% seq. id.), useful for assessing a model's ability to predict minor changes in protein structure such as mutations, to extremely difficult (<10 seq. id.), useful for assessing a model's abiliy to predict entirely new protein folds, as in the CASP Free Modeling (FM) category. In a sense, our validation sets provide a series of transferability challenges to test how well a model can withstand distributional shifts in the data set. We have found that our most difficult validation subsets exceed the difficulty of CASP FM targets.

Download

CASP7 CASP8 CASP9 CASP10 CASP11 CASP12*
Text-based Text-based Text-based Text-based Text-based Text-based
TF Records TF Records TF Records TF Records TF Records TF Records

* CASP12 test set is incomplete due to embargoed structures. Once the embargo is lifted we will release all structures.

Documentation

Citation

Please cite the forthcoming preprint on ProteinNet when it becomes available.

Acknowledgements

Construction of this data set consumed millions of compute hours and was possible thanks to the generous support of the HMS Laboratory of Systems Pharmacology, the Harvard Program in Therapeutic Science, and the Research Computing group at Harvard Medical School. We also thank Martin Steinegger and Milot Mirdita for their extensive help with the MMseqs2 and HHblits software packages, Sergey Ovchinnikov for providing metagenomic sequences, Andriy Kryshtafovych for his assistance with CASP data, and Sean Eddy for his help with the HMMer software package. This data set is hosted by the HMS Research Information Technology Solutions group at Harvard University.

proteinnet's People

Contributors

alquraishi avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.