aqlaboratory / proteinnet Goto Github PK

Standardized data set for machine learning of protein structure

License: MIT License

Python 100.00%

machine-learning deep-learning protein-structure dataset protein-sequence proteins

proteinnet's Introduction

ProteinNet

ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

Note that this is a preliminary release. The raw data used for construction of the data sets, as well as the MSAs, are not yet generally available. However, the raw MSA data (4TB) for ProteinNet 12 is available upon request. Transfer requires downloading of a Globus client. See the raw data section for more information.

Motivation

Protein structure prediction is one of the central problems of biochemistry. While the problem is well-studied within the biological and chemical sciences, it is less well represented within the machine learning community. We suspect this is due to two reasons: 1) a high barrier to entry for non-domain experts, and 2) lack of standardization in terms of training / validation / test splits that make fair and consistent comparisons across methods possible. If these two issues are addressed, protein structure prediction can become a major source of innovation in ML research, alongside the canonical tasks of computer vision, NLP, and speech recognition. Much like ImageNet helped spur the development of new computer vision techniques, ProteinNet aims to facilitate ML research on protein structure by providing a standardized data set, and standardized training / validation / test splits, that any group can use with minimal effort to get started.

Approach

Once every two years the CASP assessment is held. During this competition structure predictors from across the globe are presented with protein sequences whose structures have been recently solved but which have not yet been made publicly available. The predictors make blind predictions of these structures, which are then assessed for their accuracy. The CASP structures thus provide a standardized benchmark for how well prediction methods perform at a given moment in time. The basic idea behind ProteinNet is to piggyback on CASP, by using CASP structures as test sets. ProteinNet augments these test sets with training / validation sets that reset the historical record to the conditions preceding each CASP experiment. In particular, ProteinNet restricts the set of sequences (used for building PSSMs and MSAs) and structures to those available prior to the commencement of each CASP. This is critical as standard databases such as BLAST do not maintain historical versions. We use time-reset versions of the UniParc dataset as well as metagenomic sequences from the JGI to build sequence databases for deriving MSAs. ProteinNet further provides carefully split validation sets that range in difficulty from easy (>90% seq. id.), useful for assessing a model's ability to predict minor changes in protein structure such as mutations, to extremely difficult (<10 seq. id.), useful for assessing a model's abiliy to predict entirely new protein folds, as in the CASP Free Modeling (FM) category. In a sense, our validation sets provide a series of transferability challenges to test how well a model can withstand distributional shifts in the data set. We have found that our most difficult validation subsets exceed the difficulty of CASP FM targets.

Download

ProteinNet records are provided in two forms: human- and machine-readable text files that can be used programmatically by any tool, and TensorFlow-specific TFRecord files. More information on the file format can be found in the documentation here.

CASP7	CASP8	CASP9	CASP10	CASP11	CASP12*
Text-based	Text-based	Text-based	Text-based	Text-based	Text-based
TF Records	TF Records	TF Records	TF Records	TF Records	TF Records

Secondary Structure Data
ASTRAL entries
PDB entries

* CASP12 test set is incomplete due to embargoed structures. Once the embargo is lifted we will release all structures.

Documentation

PyTorch Parser

ProteinNet includes an official TensorFlow-based parser. Jeppe Hallgren has kindly created a PyTorch-based parser that is available here.

Extensions

SideChainNet extends ProteinNet by adding angle and atomic coordinate information for side chain atoms.

Citation

Please cite the ProteinNet paper in BMC Bioinformatics.

Acknowledgements

Construction of this data set consumed millions of compute hours and was possible thanks to the generous support of the HMS Laboratory of Systems Pharmacology, the Harvard Program in Therapeutic Science, and the Research Computing group at Harvard Medical School. We also thank Martin Steinegger and Milot Mirdita for their extensive help with the MMseqs2 and HHblits software packages, Sergey Ovchinnikov for providing metagenomic sequences, Andriy Kryshtafovych for his assistance with CASP data, and Sean Eddy for his help with the HMMer software package. This data set is hosted by the HMS Research Information Technology Solutions group at Harvard University.

Funding

This work was supported by NIGMS grant P50GM107618 and NCI grant U54-CA225088.

proteinnet's People

Contributors

Stargazers

Watchers

Forkers

direkshan-digital tony32769 okuchaiev tonydeep whbpt pythseq hmchen1990 chansigit p-koo hedgefair debunge siathalysedi sensecollective hbcbh1999 zhanghonglishanzai julianyu123456 vvasavda usccolumbia jackmarsh nkkchem m3h0w sanderslin decoderkurt sunhwan gjnehruceg33 tyjk roddymcn honggexiao zphill15 aspirincode jinhou motte meanimus hailgambo lbmallory jerryji1993 liorz skorablyov adit-chandra maverick0004 jonathanking giantfurosemide caorenzhi ullychan tjustorm hwang-happy nerdsniper ivanag1t oaksean animesh immortalmathematicquantum sonnnguyen jmsung debo1908 zuxfoucault hieuqtran pechyonkin emeraldmath stjordanis rt416 syssynbio sailfish009 robertlizatovic boston123456 yvonneche murasame jing-wei kraj593 philipxyc thavlik rchddeg qwp-ryan foxtrotmike mallory-tollefson nmateyko cthoyt-forks-and-packages evdot0404 djaliiil shahyan3 eghbalz alejogiley jmathony leiyis99 sikwoxy douglas2code gokit1 altairch95 vallurumk zheng-prog ghaliarehawi sooheon nsridhar1 yeongkwoncho yuanqing-wang jacobjinkelly tlzhangg superxiang roysh dpsanders eframp

proteinnet's Issues

Some [ID] entries in the text version do not adhere to the documented format

e.g.
[ID]
3MKH_d3mkhd1

should this be
3MKH_3_D or 3MKH_1_D?
Is that chain_id correct?

looking at the pdb entry for 3MKH, 3MKH_4_D seems most likely without looking further into it.

In text/casp11/training30 there are 3661 of these either malformed or undocumented IDs.

Duplicate training samples with different coordinates, PSSM and entropy

During a sanity check of this data I noticed that quite a lot of the training examples have identical sequences, but with different PSSM and entropy. The coordinates for these duplicates are also not identical, even under translation/rotation, though the one example I actually plotted after matching the coordinates under translation and rotation, had coordinates that we close to identical, but deviated in a few places.

See the attached example (it was too long to paste in here)
identical_sequences.zip

Other training examples were repeated 6 times in the data.

Is there any good reason for this or is this an error?

Unmasked zeroed tertiary data in text-based CASP7

When implementing an RGN for a university project, we stumbled upon a few apparant irregularities in the text-based CASP7 dataset provided here.
That is, quite a few atoms in the tertiary data were positioned at (0,0,0) even though the mask was +, i.e. the atom was considered to be 'valid'.

Example taken from CASP7/validation.

[ID]
70#1MLI_1_A
[PRIMARY]
...
[EVOLUTIONARY]
...
[TERTIARY]
0	1562.5	0	0	1571.2	0	0	1458.2	0	0	1371.3	0	0	1078.5	0	0	953.8 ...
0	1363.	0	0	1492.5	0	0	1226.9	0	0	1303.3	0	0	1229.4	0	0	1255.1 ...
0	4743.1	0	0	4394.3	0	0	4152.2	0	0	3792.3	0	0	3597.2	0	0	3246.3 ...
[MASK]
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In this example two thirds of the atoms are positioned at (0, 0, 0).
Is this a bug, or am I simply misinterpreting the given data somehow?

Thanks in advance!

ProteinNet (predictive) and DiversityNet-Proteins (generative) benchmarks

Hello Mohammed,

This ProteinNet benchmark for predictive models is great, but are you interested in doing a similar benchmark for generative models of proteins?

I introduced this benchmark, called 'DiversityNet-Proteins' here (it's a variant of the DiversityNet-Chemistry benchmark):

https://medium.com/the-ai-lab/diversitynet-a-collaborative-benchmark-for-generative-ai-models-in-chemistry-f1b9cc669cba#c837

Tell me if you are interested!
Mostapha

How to process from the raw data on CASP?

Hi, thanks for your interesting work. I am currently using your dataset and trying to substitute your test set based on CASP 12 with the newest CASP 14. So I would really appreciate it if you can provide some details on how to process the raw data on CASP to obtain the test set. Thanks!

Obtaining MSA data

Hi,
Is there anyway to get the MSA data from the text-based files?

Thanks in advance

Interpreting the parser

TLDR; How can I best use the parser provided to go from tf DataRecords to inputting this into a Keras model?

Preamble: Thank you very much for compiling this awesome dataset that I am excited to prototype some protein design ideas on!

Disclaimer: I am quite new to Tensorflow. Below I outline every bump I ran into along the way of trying to work with this fantastic dataset. If this is going to be like ImageNet for proteins it needs to have more user friendly documentation on how to get up and running with it. I hope that the following will outline roadbumps for future users like myself. If the following is helpful I would love to help make this dataset more user friendly by providing example code etc.

I have spent over 10 hours trying to figure out how to work with the Tensorflow records provided. I assumed this was a very easy data format to work with before learning that it was notorious for having poor documentation and being difficult to understand.

A "parser" is provided but I spent quite some time assuming that it was, as per the tf documentation, a parser that could be applied in some way through dataset=tf.data.TFRecordDataset(files) and then dataset.map(read_protein). Instead it is a standalone function that you must first create a filequeue for.

It was only after spending considerable time going through the RGN docs to understand what was going on and how to structure my import after finding the _dataflow and read_protein functions in this separate repo. Then, because the imports are tensors, I was unable to look inside them at the data actually being imported. Eager execution also does not work because we are using tf.data.

Going through the RGN files I ended up reusing a lot of the _dataflow function in model.py. However, even this is hard to understand. I still do not know what the distinction between batch size and "num_steps" is after using tf.contrib.training.bucket_by_sequence_length and what the exact generator is that can then be given to fit_generator in order to train a keras model using this data.

After all of this time I have decided to go back to square one by instead downloading the .txt files and working with those instead. That way I can at least get a basic model up and running before I try to work with more advanced DataRecords and build a more efficient pipeline with them.

Secondary structure

the secondary structure is missing from casp11 (text format)

Next versions of CASP in ProteinNet

Hello,
ProteinNet is a great database, thank you very much for providing it!
Will there be any new versions of CASP and updates of ProteinNet?
Thank you

P.S. ProteinNet CASP12 text based was NFTized on blockchain where each structure is a separate NFT and tools to construct dataset are available web2 and web3 way:
https://datasetnft.org

IDs not matching PDB characteristics (chain number and ID, for instance)

Hi, first of all, thank you for the huge contribution of putting ProteinNet together great contribution to the comunity indeed.

I am now trying to fetch from PDB the Beta Carbon coordinates for each protein on ProteinNet for my algorithm. The problem is that I cannot intuitively do so via Biopython since some ID's wont inform the chain used on ProteinNet.

'1IQ8_d1iq8b4',
'1S2M_d1s2ma2',
'1IZ6_d1iz6b2',

Those 3 proteins as an example, have "strange" descriptors in place of chain id and number. What do they mean after all? How could I fetch the same sequence and structure as given in ProteinNet?
I am well aware of the problems regarding mmCIF files, and the mismatch of sequences. Is there any way to solve it? I really need to implement this change.

THanks

How are mask records generated?

Hi Mohammed,

Thanks for this great resource and congratulations on its recent publication. I'm curious - can you share your method for generating mask records?

Best,
Jonathan

Model and code of 'End-to-end differentiable learning of protein structure'

Hello Mohammed,

I am looking for the model and code of your paper 'End-to-end differentiable learning of protein structure'. Has it been released?

PSSM length

PSSM has length 20 and not 21 as mentioned in the readme file

Unequal sequence length in MSA

Hi,

I was wondering why some MSAs, e.g. 102L_1_A consist of sequences of unequal length.
If I run the following code:

from Bio import SeqIO

for index, record in enumerate(SeqIO.parse("ProteinNet/raw/102L_1_A.a2m", "fasta")):
    if index > 4: break
    print("index %i, ID = %s, length %i"
          % (index, record.id, len(record.seq)))

I get the following output:

index 0, ID = XXXX_UPI00066315A6/35-154, length 165
index 1, ID = XXXX_UPI00066315A6/175-292, length 165
index 2, ID = TB_PC08_66DRAFT_100509362|3300000228|metagenome/9-106, length 165
index 3, ID = TB_PC08_66DRAFT_100509362|3300000228|metagenome/113-206, length 165
index 4, ID = JGI25909J50240_10726711|3300003393|metagenome/24-165, length 166

Index 4 has length 166, the others 165.
Can I pad the shorter sequences with '-' in the end?

Best
Christoph

Hosting other data formats (hdf5)

Thanks again for all the work gone into curating this dataset. I am wondering if you would be open to making the data here more accessible by offering it in the hdf5 format, specifically for Pytorch users. Sadly, Tensorflow is starting to lose popularity, and more and more researchers are migrating to either Pytorch or Jax. It would be nice if there were another compressed data format other than TFRecords. I would be happy to send over the hdf5 converted files if you would be open to hosting them.

PSSM generation

Hello! I've been trying to wrap my mind around PSSM generation from an arbitrary protein taken out of pdb.

After reading the ProteinNet paper and checking out this repo, I'm still confused on the process you use. Right now I assume the pipeline:

FASTA sequence fed to jackhmmer -> esl-weight.

Down from this point it's a bit unclear to me. Are there any opensource utilities I can use to generate a binary-style PSSM (with the additional context columns) utilized by ProteinNet, or are there any arguments to esl-weight that I'm missing?

Thanks! Apologizing in advance if my question has already been discussed somewhere, couldn't find any directions on this.

How to read this data ??

entry 1KTR_2_M in text/casp11/training_30

This entry does not seem to correspond to 1ktr in the pdb.
1ktr only contains chains L and P.

Validation set

The validation set is not split into different difficulty as mentioned in the paper (10%,20%...). There is only one file for each casp dataset.

No secondary structure data in CASP12 TFRecord files. I didn't check others...

Nomenclature Details

I have some questions on the format of chain IDs in the [ID] field.

Question 1: 'chain_number'
I understand from the documentation that ids of format e.g 1VBK_1_A have the following parts:

<pdb_id>_<chain_number>_<chain_id>

However, I do not understand the meaning of the 'chain_number' field. Is this a simple index of the chain in the pdb file, or a reference to the model in multi-model PDB files?

Question 2. Last digit, alternative format
Some IDs do not have this format and instead look like this:

2EUL_d2euld1
1H4V_d1h4vb1
1C4K_d1c4ka2
1V5F_d1v5fa1

I understand that:

The first characters are the pdb_id
The 'd' means domain
The next few characters represent the domain

However I do not understand the meaning of the final digit ('1' or '2') in the above examples. What does this signify?

Many thanks!

General discrepancies between ProteinNet and mmCIF/PDB files (biopython)

I have some code that fetches mmCIF files for each entry in CASP11 using BioPython. A substantial proportion of examples fail the various checks, though some pass. Many of them are missing the corresponding model, and most that have the given model disagree on primary sequence length. Perhaps there is an obvious explanation for this, and I simply overlooked it. Subtracting one from model_id allows most of the models to be resolved, but many of the primary sequences have significant length mismatch. Most of the files only have the one model, so it's unclear what is exactly is going on here.

#!/usr/bin/python

# imports
import sys
import re

# Constants
NUM_DIMENSIONS = 3

# Functions for conversion from Mathematica protein files to TFRecords
_aa_dict = {
    'A': '0',
    'C': '1',
    'D': '2',
    'E': '3',
    'F': '4',
    'G': '5',
    'H': '6',
    'I': '7',
    'K': '8',
    'L': '9',
    'M': '10',
    'N': '11',
    'P': '12',
    'Q': '13',
    'R': '14',
    'S': '15',
    'T': '16',
    'V': '17',
    'W': '18',
    'Y': '19'
}
_dssp_dict = {
    'L': '0',
    'H': '1',
    'B': '2',
    'E': '3',
    'G': '4',
    'I': '5',
    'T': '6',
    'S': '7'
}
_mask_dict = {'-': '0', '+': '1'}


class switch(object):
    """Switch statement for Python, based on recipe from Python Cookbook."""
    def __init__(self, value):
        self.value = value
        self.fall = False

    def __iter__(self):
        """Return the match method once, then stop"""
        yield self.match

    def match(self, *args):
        """Indicate whether or not to enter a case suite"""
        if self.fall or not args:
            return True
        elif self.value in args:  # changed for v1.5
            self.fall = True
            return True
        else:
            return False


def letter_to_num(string, dict_):
    """ Convert string of letters to list of ints """
    patt = re.compile('[' + ''.join(dict_.keys()) + ']')
    num_string = patt.sub(lambda m: dict_[m.group(0)] + ' ', string)
    num = [int(i) for i in num_string.split()]
    return num


def read_record(file_, num_evo_entries):
    """ Read a Mathematica protein record from file and convert into dict. """

    dict_ = {}

    while True:
        next_line = file_.readline()
        for case in switch(next_line):
            if case('[ID]' + '\n'):
                id_ = file_.readline()[:-1]
                dict_.update({'id': id_})
            elif case('[PRIMARY]' + '\n'):
                primary = letter_to_num(file_.readline()[:-1], _aa_dict)
                dict_.update({'primary': primary})
            elif case('[EVOLUTIONARY]' + '\n'):
                evolutionary = []
                for residue in range(num_evo_entries):
                    evolutionary.append(
                        [float(step) for step in file_.readline().split()])
                dict_.update({'evolutionary': evolutionary})
            elif case('[SECONDARY]' + '\n'):
                secondary = letter_to_num(file_.readline()[:-1], _dssp_dict)
                dict_.update({'secondary': secondary})
            elif case('[TERTIARY]' + '\n'):
                tertiary = []
                for axis in range(NUM_DIMENSIONS):
                    tertiary.append(
                        [float(coord) for coord in file_.readline().split()])
                dict_.update({'tertiary': tertiary})
            elif case('[MASK]' + '\n'):
                mask = letter_to_num(file_.readline()[:-1], _mask_dict)
                dict_.update({'mask': mask})
            elif case('\n'):
                return dict_
            elif case(''):
                return None


if __name__ == '__main__':
    from Bio.PDB.MMCIFParser import MMCIFParser
    from Bio.PDB.PDBParser import PDBParser
    from Bio.PDB import PDBIO, PDBList
    from Bio.PDB.Structure import Structure, Entity
    from Bio.PDB.Model import Model
    from Bio.PDB.Residue import Residue
    from Bio.PDB.Chain import Chain
    from Bio.PDB.Atom import Atom
    import numpy as np

    input_path = 'D:/casp11/training_30'
    print(f'Reading data from {input_path}')
    num_evo_entries = int(sys.argv[2]) if len(
        sys.argv) == 3 else 20  # default number of evo entries

    input_file = open(input_path, 'r')
    pdbl = PDBList()

    while True:
        record = read_record(input_file, num_evo_entries)
        if record is not None:
            id = record["id"]
            primary = record['primary']
            primary_len = len(primary)
            parts = id.split('_')
            if len(parts) != 3:
                # https://github.com/aqlaboratory/proteinnet/issues/1#issuecomment-375270286
                continue
            pdb_id = parts[0]
            path = pdbl.retrieve_pdb_file(pdb_id, pdir='pdb/', file_format='mmCif')
            parser = MMCIFParser()
            structure = parser.get_structure(pdb_id, path)
            model_id = int(parts[1])
            assert model_id >= 0
            chain_id = parts[2]
            if not model_id in structure.child_dict:
                print(f'{pdb_id} lacks model {model_id}, models: {structure.child_dict}')
                continue
            model = structure.child_dict[model_id]
            assert isinstance(model, Model)
            chain = model.child_dict[chain_id]
            if not chain_id in model.child_dict:
                print(f'{pdb_id} model {model_id} lacks chain {chain_id}, chains: {model.child_dict}')
                continue
            chain_len = len(chain.child_list)
            assert isinstance(chain, Chain)
            if chain_len != primary_len:
                print(f'{pdb_id} chain ({chain_len}) and primary ({primary_len}) lengths mismatch')
            else:
                print(f'{pdb_id} is all good!')
        else:
            input_file.close()
            break

Sampled output:

Reading data from D:/casp11/training_30
Structure exists: 'pdb/2yo0.cif'
2YO0 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4jrn.cif'
4JRN lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4hxf.cif'
4HXF lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2l0e.cif'
2L0E is all good!
Structure exists: 'pdb/2kxi.cif'
2KXI is all good!
Structure exists: 'pdb/2o57.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 8744.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 9031.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 9318.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 9580.
  warnings.warn(
2O57 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4lhr.cif'
4LHR lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3zrg.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1020.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1024.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1027.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1097.
  warnings.warn(
3ZRG lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2k0m.cif'
2K0M is all good!
Structure exists: 'pdb/4hhx.cif'
4HHX lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4ld3.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1362.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1382.
  warnings.warn(
4LD3 lacks model 2, models: {0: <Model id=0>}
Structure exists: 'pdb/3wo6.cif'
3WO6 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2kkp.cif'
2KKP is all good!
Structure exists: 'pdb/4jja.cif'
4JJA lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4h5s.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1565.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1733.
  warnings.warn(
4H5S lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4aez.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 16578.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 16692.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 16725.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 16758.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 16819.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 16820.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain G is discontinuous at line 16868.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain H is discontinuous at line 16889.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 16895.
  warnings.warn(
4AEZ lacks model 3, models: {0: <Model id=0>}
Structure exists: 'pdb/4gdo.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3247.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3295.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3336.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 3362.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 3401.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 3403.
  warnings.warn(
4GDO lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3up6.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 4854.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 4927.
  warnings.warn(
3UP6 lacks model 1, models: {0: <Model id=0>}
Downloading PDB structure '3UKW'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3473.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3717.
  warnings.warn(
3UKW lacks model 2, models: {0: <Model id=0>}
Downloading PDB structure '3SEO'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3510.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3511.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3519.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3539.
  warnings.warn(
3SEO lacks model 1, models: {0: <Model id=0>}

Related issue #13

Thanks for all the good work.

How to convert .cif to ProteinNet alike file?

Hello,
Is there any converter of .cif file to ProteinNet alike file?
Thank you

Next releases

Hi, I a master student in Biotechnology and I am using the data collections you created for my thesis project. I wanted to thank you for your work, it is extremely useful to me. When do you plan to release casp13 and casp14 datasets?

How should I parse the [TERTIARY] section of a ProteinNet record file?

Hi Mohammed,

this looks like an amazing resource, thank you. I want to write my own parser to play around with the data. However, I haven't been able to determine what exactly is represented on the three lines of each [TERTIARY] section.

Given that there are 3 lines, I have assumed that each line represents one co-ordinate axis (i.e. x, y, or z) for all residues. I then assumed that the order of the numbers on each line is given in order of the N, C_alpha, and C' atoms for each amino acid residue. Is my interpretation correct? Could you please help or point me to the exact file specification? I couldn't find it in your paper either.

Thank you!
-Ali

Which metrics to use for model evaluation ?

Hello, thank you for maintaining this library, I find it takes away one of the most daunting tasks in regards to tertiary structure predictions for those that aren't that familiar with the field.

However, there's one thing that I can't figure out and reading through CASP related material (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394854/) doesn't really help, how do you actually validate the models ?

As in, once you have a model that seems to predicted scaled values for the 3x3 matrices well under a given loss function that's practical for optimizing the model, how do you determine if the predicted structures would be correct for domain specific purposes ?

Assuming said evaluation is extremely hard is there some simpler metric that gets you 9x% of the way there so that you can at least tell if your model is heading in the right direction ?

Generate lists of PDB ids / chains for the training, validation and test sets

Hi,

I think that you have gone to great lengths to make sure that there is no information leakage between the training and the validation datasets. I would like to change my own datasets to match your training / validation split. Could you suggest a way for me to do this?

In particular, I was wondering what would be the easiest way for me to generate a list of PDB and chain ids that correspond to the training, validation, and test datasets for each of the CASP experiments?

I can see that each entry in the tensorflow files has an id, which can be something like 1WIO_1_A , 1YX2_1_A, 1KA1_1_A, 1D2O_1_A, etc. or something like 1DLC_d1dlca2, 2BU7_d2bu7a2, 1OMW_d1omwg-, 1NUB_d1nubb3, etc. Could you explain how these IDs were generated, what each part corresponds to, and what the difference is between the first type of ids (which looks like {pdb_id}_{model_id}_{chain_id}) and the second type of ids?

Thank you!

Units of measurement for 3D structure

Hi.
I am trying to use these datasets for contact prediction. A contact is defined as a pair of residues within 8A of each other. However, when I compute the pairwise distances of residues for proteins in this dataset, they are typically in the thousands and none are 8A away from each other. What are the units used?
Thanks

3D protein structure prediction from sequence

Helo
I am Rashid and doing master thesis on protein sequence to structure prediction.
I tried according to the github instruction @alquraishi and also read the previous problem here.

I am also trying to make prediction of Predict sequences in ProteinNet TFRecords format using a trained model;
I used the script as:
python Machine_Learning/rgn-master/model/protling.py Machine_Learning/rgn-master/configurations/CASP7.config -d Machine_Learning/rgn-master/RGN7 -p -e weighted_testing
script ran well and i did not get any error but i did not understand where is my output 3d structure and how i can visualize through chimera.
Would you please help me to continue this work properly.
Thanks