Giter Site home page Giter Site logo

rcsb / mmtf Goto Github PK

View Code? Open in Web Editor NEW
43.0 13.0 17.0 33.75 MB

The specification of the MMTF format for biological structures

Home Page: http://mmtf.rcsb.org/

Shell 100.00%
compression file-format biological-structures protein-data-bank protein-structure bioinformatics

mmtf's Introduction

Logo

The MacroMolecular Transmission Format (MMTF) is a binary encoding of biological structures. For a general introduction to the format visit the website.

The following encoding/decoding libaries implementing the specification are available:

⚠️ Please note that while the specification and tools for the MMTF format are still available, up-to-date MMTF files for the PDB archive will not be produced anymore from July 2024. Users are strongly encouraged to migrate to the BinaryCIF format. Details on how to access BinaryCIF(BCIF) data files for the entire PDB archive are available here.

Version 1.0

Version 0.2

Version 0.1

mmtf's People

Contributors

abradle avatar andreasprlic avatar arose avatar danpf avatar josemduarte avatar ppillot avatar pwrose avatar speleo3 avatar valasatava avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mmtf's Issues

Chemical component identifier lost for unobserved non-standard residues

Since mmtf stores the SEQRES groups as 1-letter code strings, the chemical component id for any residue that is non-standard and happens to be unobserved will be lost. E.g. for 2X3T chain E (a glycopeptide) contains several unobserved non-standard aminoacids that are represented like "KXXXXXXEX". For groups that are observed, the chemical component identifier is recoverable from the ATOM information, but not for those that are unobserved.

why was mmtf dropped?

I'd appreciate it if anyone would spare 5 mins. I'm training models on the PDB and I'd love a binary format alternative for these kinds of tasks

Bonds missing in NMR models after first

In 1cdr.mmtf we see for inter-group bonds from the main bondAtomList:

inter-group (1.1179/1.1166) bond ASN77.N - GLU76.C 1
inter-group (1.1194/1.250) bond NAG78.C1 - ASN18.ND2 1
inter-group (1.1220/1.1204) bond NAG79.C1 - NAG78.O4 1
inter-group (1.1248/1.1206) bond FUC80.C1 - NAG78.O6 1

where 1.1248 indicates this is for the first model. This is correct. Going to the second model, we see only:

inter-group (2.2448/2.2435) bond ASN77.N - GLU76.C 1
inter-group (3.2559/3.2540) bond GLN2.N - LEU1.C 1

The bonds are missing in the bondAtomList array:

[21, 2, 38, 23, 48, 40, ..., 2433, 2416, 2448, 2435, ??? ????, 2559, 2540, 2576, 2561, ...]

where we should see 2473, 2489, I believe.

Uploading toMMTF1.png…

Bob Hanson

Generating a super cell from the mmtf.

Hello,

I am used to working with PDB structures, but I wanted to know if it was possible to generate a supercell (tessellated unit cells) from the mmtf file once it has been fetched?

List vs Array

Should the word list be replaced by array, since the decoding generates arrays, not lists?

Outdated atom names for some groups

I found a case where the atom names in the atomNameList of the groupList is not consistent with the atom naming in the respective mmCIF file.

PDB ID: 1igy
URL: https://mmtf.rcsb.org/v1.0/full/1igy
group: NDG
atom name: O (mmtf) or O5 (cif)

Excerpt from 1igy.cif:

HETATM 12351 C C1   . NDG E 3 .   ? 4.403   26.019  46.445  1.00 100.00 ? 2   NDG E C1   1 
HETATM 12352 C C2   . NDG E 3 .   ? 5.903   25.741  46.546  1.00 100.00 ? 2   NDG E C2   1 
HETATM 12353 C C3   . NDG E 3 .   ? 6.391   24.795  45.433  1.00 100.00 ? 2   NDG E C3   1 
HETATM 12354 C C4   . NDG E 3 .   ? 5.491   23.550  45.387  1.00 100.00 ? 2   NDG E C4   1 
HETATM 12355 C C5   . NDG E 3 .   ? 4.033   23.972  45.150  1.00 100.00 ? 2   NDG E C5   1 
HETATM 12356 C C6   . NDG E 3 .   ? 3.079   22.786  45.157  1.00 100.00 ? 2   NDG E C6   1 
HETATM 12357 C C7   . NDG E 3 .   ? 7.832   27.086  47.123  1.00 100.00 ? 2   NDG E C7   1 
HETATM 12358 C C8   . NDG E 3 .   ? 8.500   28.453  47.046  1.00 100.00 ? 2   NDG E C8   1 
HETATM 12359 O O5   . NDG E 3 .   ? 3.561   24.873  46.189  1.00 100.00 ? 2   NDG E O5   1 
HETATM 12360 O O3   . NDG E 3 .   ? 7.728   24.409  45.716  1.00 100.00 ? 2   NDG E O3   1 
HETATM 12361 O O4   . NDG E 3 .   ? 5.894   22.643  44.323  1.00 100.00 ? 2   NDG E O4   1 
HETATM 12362 O O6   . NDG E 3 .   ? 3.425   21.830  44.166  1.00 100.00 ? 2   NDG E O6   1 
HETATM 12363 O O7   . NDG E 3 .   ? 8.380   26.132  47.689  1.00 100.00 ? 2   NDG E O7   1 
HETATM 12364 N N2   . NDG E 3 .   ? 6.644   26.992  46.530  1.00 100.00 ? 2   NDG E N2   1 
HETATM 12365 H H1   . NDG E 3 .   ? 4.022   26.121  47.470  1.00 15.00  ? 2   NDG E H1   1 
HETATM 12366 H H2   . NDG E 3 .   ? 6.104   25.207  47.485  1.00 15.00  ? 2   NDG E H2   1 
HETATM 12367 H H3   . NDG E 3 .   ? 6.346   25.319  44.468  1.00 15.00  ? 2   NDG E H3   1 
HETATM 12368 H H4   . NDG E 3 .   ? 5.562   23.033  46.354  1.00 15.00  ? 2   NDG E H4   1 
HETATM 12369 H H5   . NDG E 3 .   ? 3.962   24.497  44.187  1.00 15.00  ? 2   NDG E H5   1 
HETATM 12370 H H61  . NDG E 3 .   ? 2.062   23.157  44.966  1.00 15.00  ? 2   NDG E H61  1 
HETATM 12371 H H62  . NDG E 3 .   ? 3.119   22.313  46.148  1.00 15.00  ? 2   NDG E H62  1 
HETATM 12372 H H81  . NDG E 3 .   ? 7.730   29.238  47.041  1.00 15.00  ? 2   NDG E H81  1 
HETATM 12373 H H82  . NDG E 3 .   ? 9.087   28.531  46.120  1.00 15.00  ? 2   NDG E H82  1 
HETATM 12374 H H83  . NDG E 3 .   ? 9.148   28.602  47.921  1.00 15.00  ? 2   NDG E H83  1 
HETATM 12375 H HO3  . NDG E 3 .   ? 7.755   23.952  46.560  1.00 15.00  ? 2   NDG E HO3  1 
HETATM 12376 H HO6  . NDG E 3 .   ? 4.377   21.829  44.043  1.00 15.00  ? 2   NDG E HO6  1 
HETATM 12377 H HN2  . NDG E 3 .   ? 6.265   27.774  46.078  1.00 15.00  ? 2   NDG E HN2  1

Excerpt from NDGentry in 1igy.mmtf groupList:

'groupName': 'NDG'
'atomNameList': ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'O', 'O3', 'O4', 'O6', 'O7', 'N2', 'H1', 'H2', 'H3', 'H4', 'H5', 'H6C1', 'H6C2', 'H8C1', 'H8C2', 'H8C3', 'HB', 'H6', 'HA']
'elementList': ['C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'O', 'O', 'O', 'O', 'O', 'N', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H']
'bondOrderList': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
'bondAtomList': [1, 0, 2, 1, 3, 2, 4, 3, 5, 4, 7, 6, 8, 0, 8, 4, 9, 2, 10, 3, 11, 5, 12, 6, 13, 1, 13, 6, 14, 0, 15, 1, 16, 2, 17, 3, 18, 4, 19, 5, 20, 5, 21, 7, 22, 7, 23, 7, 24, 9, 25, 11, 26, 13]
'formalChargeList': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
'singleLetterCode': '?'
'chemCompType': 'D-SACCHARIDE'

retinal missing double-bond connection to lysine in GPCR 1F88

While it is excellent to be able to show double bonds now within groups, there are a few cases, such as Schiff bases between an aldehyde and lysine, where a double bond is needed linking two groups. If all the other double bonds are present, but this one is missing, users will rightly consider it a bug. See 1F88.mmtf

1f88_ret

5MOO has bonds between alt confs

The MMTF file for 5MOO has bonds between alternate conformations of residue MET104 (for hydrogens which are only in one conformation). I don't see those bonds in the mmCIF file, so it looks like this is not a "primary-data-issue".

5moo-m104-extra-bonds

Magic number?

The spec.md file does not say anything about a magic number. This is a common practice in many domains, because you can't always trust that users will preserve file names and their extensions. Unix command line utilities like file make use of magic numbers to identify file types. Does MMTF have a magic number? If not, might one be considered in a future iteration of the format?

Update text that describes traversal pseudo code

The following text seems to be obsolete:

The following traversal pseudo code assumes that all fields have been decoded and specifically that the split-list delta encoded fields are decoded into fields named like in the following example: xCoordBig/xCoordSmall decode into xCoordList.

inconsistent reduced representation of 5ijo

The reduced model for 5ijo has numBonds=0, but len(bondOrderList)=2 and len(bondAtomList)=4.

>>> import simplemmtf
>>> d = simplemmtf.from_url('http://mmtf.rcsb.org/v1.0/reduced/5ijo')
>>> d.get('numBonds')
0
>>> d.get('bondOrderList')
array('b', [1, 1])
>>> d.get('bondAtomList')
array('i', [2959, 1342, 7253, 1812])

Expected: len(bondOrderList)=0 and len(bondAtomList)=0
or numBonds=2 (though there should really be no bonds in a CA-only model)

Make a new release

Please make a new release (1.0 from 2016 year is too old), with new version and new tarball.

assembly subunit ordering

The ordering of assembly subunits seems to be different in MMTF and mmCIF files. This was reported on the pymol-users mailing list (https://sourceforge.net/p/pymol/mailman/message/36005513/).

I'm not aware that assembly ordering would be part of any spec, so technically there is nothing wrong here. But for consistency, it would be nice to provide identical assembly descriptions. In PyMOL, assemblies from mmCIF files are identical to pdb1 files, so I assume we're reading the mmCIF files correct.

Example: Assembly ordering for 3bw1:
mmCIF: 1 2 3 4
MMTF: 4 2 1 3

_struct_conn group missing

Just notice that although we have all the bonding between amino acids and all the bonding within groups, we are missing the very important ones in _struct_conn, particularly in 1BLU

I believe this may be because the "metalc" type is being skipped. This is an important type to include.

metalc1 metalc ? B SF4 . FE1 ? ? ? 1_555 A CYS 53 SG ? ? A SF4 101 A CYS 53 1_555 ? ? ? ? ? ? ? 2.189 ?

same for 1hho:

metalc1 metalc ? A HIS 87 NE2 ? ? ? 1_555 D HEM . FE ? ? A HIS 87 A HEM 143 1_555 ? ? ? ? ? ? ? 1.937 ?

It is critical that we have these.

Remark/comments field

When working on modeling/prediction/design problems I know a lot of people add comments/remarks of various things to their PDB files.
In the case of structures from the PDB, I think it would be best if this field is empty always.

Possible use cases:

  • protein design scores/parameters
  • application runtime flags/commands
  • model quality numbers
  • rmsd to native for bench marking

It would be very useful to add a field dedicated to this.
probably:
extras or comments and it would just be a string field.

The alternative is to just to use title or structureId for this kind of stuff since in most modeling they don't exist. I'm not against that either, but the spec documentation should just note which one applications should use so it's standardized.
~Dan

multiple resolution/rFree/rWork values

Hi, I'm working on a project in the same domain - reading and writing macromolecular structures, and I was just checking how you handle models with multiple refinement statistics.
For example 5moo - joint x-ray and neutron refinement.
I suppose MMTF just stores one of the values?

'RCSB restricted area' for some PDB IDs

For some PDB IDs, the access to the MMTF files is restricted: The server shows a 'RCSB restricted area' message and asks for username and password.

Examples:

https://mmtf.rcsb.org/v1.0/full/1o1z
https://mmtf.rcsb.org/v1.0/full/4gxy

For https://mmtf.rcsb.org/v1.0/full/1aki and most other PDB IDs, the download works as expected.

Update:

This behavior seems inconsistent: At some attempts it works as expected, but some attempts later it shows the message again.

Anisotropic b-factors, do we want them in mmtf?

They are missing in v0.2. They would be important for refinement software like phenix. A possibility is that they are treated as user data, but in my opinion they are important enough to deserve their own field.

The field would add a significant amount of information to the file (a 3x3 matrix of floats for each atom), we'd need to decide what's a good compression strategy for it.

[feature request] Please support the secondary structure information and other essential information supported by the pdb format

Continuing from rcsb/mmtf-cpp#28

The recently added extra fields might be a solution for any additional data you would like to store.

I was downloading data from the PDB database in the MMTF format, but it lacks the secondary structure information.

This also makes the "PDB archive size comparison" graph on https://mmtf.rcsb.org/ invalid since the PDB format has more information in it.

Add ncsOperatorList field

Add field ncsOperatorList with transformations to construct the full crystallographic asymmetric unit.

Example from 1a37:

[
    [
        1.0, 0.0, 0.0, 0.0,
        0.0, 1.0, 0.0, 0.0,
        0.0, 0.0, 1.0, 0.0,
        0.0, 0.0, 0.0, 1.0
    ],
    [
        -0.997443,  0.000760, -0.071468, 59.52120
        -0.000162, -0.999965, -0.008376, 80.32820
        -0.071472, -0.008343,  0.997408, 2.38680
         0.0,       0.0,       0.0,      1.0
    ]
]

Bond order and aromatics/resonance

Was there ever a discussion about how we(applications) should be setting the bond order for aromatics or resonance bonds. Did the idea of a 5th bond type (aromatic/resonance) ever come up?

I ask because Rosetta stores bond information as single, double, triple, or Aromatic/Resonance, which makes sense (at least to me). I would assume all the aromatic bonds in phenylalanine would be considered equal, but currently I have decide bonds as 1 vs 2.

Clarify if bondAtomList and bondOrderList are optional for groups too

It would be great to clarify in the specs if it is acceptable for an MMTF file to have groupType objects (see https://github.com/rcsb/mmtf/blob/master/spec.md#grouplist) with missing bondAtomList and bondOrderList fields.

The reason why this needs clarification is that the bondAtomList and bondOrderList fields exist also on a global level (see https://github.com/rcsb/mmtf/blob/master/spec.md#bondatomlist) where they are optional (you can either have neither of them, only bondAtomList or both).

The discussion on this was started in rcsb/mmtf-cpp#9 and rcsb/mmtf-cpp#10.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.