ihmwg / python-ihm Goto Github PK

Python package for handling IHM mmCIF and BinaryCIF files

License: MIT License

Python 95.74% C 3.06% Shell 0.11% SWIG 1.08% Makefile 0.01%

python-ihm's Issues

Allow adding extra user-defined categories

I and @iecheverria want to be able to add extra categories to mmCIF or BinaryCIF files with our own data. This will allow us to

more quickly prototype new methods (before support for them is incorporated by @brindakv into the IHM dictionary), and
incorporate program-specific (IMP-specific in our case) data into files that we use internally as a "working format"

To allow this, the ihm.dumper and ihm.reader modules should expose the API of the various Dumper and Handler classes and their support classes, so that users can write new classes for their own categories, and the read and write functions should allow for such classes to be passed in.

Of course, at deposition time it is straightforward to remove program-specific categories from the file.

Add faster C API for reading loop constructs

The C-accelerated mmCIF reader calls a user-provided callback function to process all data in the mmCIF file. For simple categories, this is called once at the end of the file once all keywords for that category have been encountered. For loop constructs, it is called once per line in the loop. This can be quite inefficient since that could be a lot of function calls per loop (which can't be optimized as the function is only known at runtime).

A faster solution would be to have a single callback per loop, and provide an API function to get the next line.

IDmapper for assemblies does not work for some PDB-dev entries

Using ihm.reader to read orphan assemblies (_ihm_struct_assembly.*) from mmcif files returns empty list for some files even when the table is present (eg: *04.cif, *05.cif); although this works for some other mmcif files (eg: *02.cif).

TypeError attempting to read PDB 6ep0

Christian Hanke reports by email that python-ihm fails to read 6ep0.cif with the following traceback:

  File "D:\Python27_Anaconda\envs\env_for_IHM_Python27\lib\site-packages\ihm\reader.py", line 1617, in read
    h.finalize()

  File "D:\Python27_Anaconda\envs\env_for_IHM_Python27\lib\site-packages\ihm\reader.py", line 1379, in finalize
    offset = self._get_auth_seq_id_offset(asym)

  File "D:\Python27_Anaconda\envs\env_for_IHM_Python27\lib\site-packages\ihm\reader.py", line 1391, in _get_auth_seq_id_offset
    for seq_id in range(rng[0], rng[1]+1):

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

Allow manual specification of asym_id

Currently all IDs are hidden away and assigned automatically. This includes asym_ids (chain IDs). However, it may be helpful to allow these to be manually specified, for example if people want to keep the chain IDs from a starting model. We just need to check for duplicate manual IDs, and that any automatically-assigned IDs do not clash with manually-assigned ones.

Write a design principles doc

For example, it should clarify

why some objects can be duplicated (e.g. Datasets)
why objects are mutable rather than immutable
plans for MMTF IO
use of 'orphan' lists

Add warnings if reader encounters unknown categories or items

python-ihm was originally designed to read and write only the subset of PDBx/IHM mmCIF that viewers and modelers are interested in, and so the reader silently ignores categories or items it doesn't know about. However, if we want to use the library for any purpose where we need to extract all the data from the file (e.g. for remediation) it would be helpful for the reader to warn about missed data. Add an (optional) flag to ihm.reader.read to turn on these warnings.

Support auth_seq_id

Currently we assume that the author-provided seq_id is the same as the internal 1-based consecutive numbering and provide no mechanism to change it. This is true for all currently-deposited IMP models but is not true for Nup133 or some HADDOCK models. Provide a mechanism to allow the user to number residues. The simplest way to do this would be for AsymUnit to take an auth_seq_id parameter which could be either an int (add it to each seq_id to get auth_seq_id) or a mapping (list or dict; auth_seq_id = mapping[seq_id]).

Use special object for values marked as ?

Currently the mmCIF unknown value is represented as a literal ? Python string. This makes it impossible to distinguish from the actual string "?". On the other hand, the omitted value is represented specially (as the Python value None, not as the literal . string) which will never compare equal to any other valid mmCIF value. We should use something similar for mmCIF-unknown.

Note that this would also make handling BinaryCIF more straightforward, since BinaryCIF does not use the values . and ? but instead uses a bitmask.

Avoid hitting the network in MRCParser

When we use ihm.metadata.MRCParser to parse an MRC file, if it determines that the file is deposited in EMDB, it uses the EMDB web API to extract additional information. This suffers from two problems:

If the website is down or the network is unavailable (e.g. a modeling simulation being run on an internal compute cluster) this will raise an exception, killing the job.
If the parser's output ends up not being used then this network access was unnecessary.

Proposal:

Populate the additional information lazily, only when we actually come to write out the mmCIF file.
Catch network errors, report default information in this case, and give a warning instead.

Read starting model coordinates

We don't currently read the ihm_starting_model_coord table, since ChimeraX doesn't use these coordinates (it fetches the referenced file from PDB instead, if available). But we probably should (perhaps optionally) read it so that we can edit (or remediate) existing files without losing information.

Handle non-ASCII mmCIF files with C parser

Currently ihm.reader.read fails if given an mmCIF file that contains non-ASCII characters (e.g. accented characters in author names) with the error Python read method returned too many bytes. While mmCIF files are supposed to be ASCII, we should do something reasonable when given a non-ASCII file.

Change _ihm_chemical_descriptor to _ihm_chemical_component_descriptor in dumper.py

According to the IHM-dictionary, the "_ihm_chemical_descriptor" category should be called "_ihm_chemical_component_descriptor" in dumper.py

Should verify seq_ids in spheres/atoms against representation & assembly

Currently the lists of Atom and Sphere objects for a model are output as-is. No checking is done to ensure that these objects only reference seq_ids that are in the model's Representation (which in turn should match those in the Assembly). This should be checked to avoid generating invalid mmCIF output.

Add read support

Currently the library can write the Python hierarchy out as an mmCIF file, but cannot do the reverse, constructing a hierarchy from an existing mmCIF file. The latter would be useful to assist in displaying IHM mmCIF files in molecular viewers such as ChimeraX, to allow data to be easily extracted from the files, or to allow integrative models to be used as inputs for new modeling runs.

Segment for StartingModel is incorrect

@brindakv reports that if a starting model's templates do not cover the entire sequence (e.g. C- or N-termini are modeled without a template) then python-ihm outputs the wrong seq_id range for _ihm_starting_model_details.entity_poly_segment_id.

StartingModel._get_seq_id_range_all_templates() appears to be working as designed but the design is to report only the range covered by templates. I think this is a holdover from old versions of the dictionary where template information was reported in this table. For modern usage this function seems unnecessary; we should just use StartingModel.asym_unit as the seq_id range.

Change PolyResidueFeature type to 'residue' when a unique residue is provided

When instantiating a PolyResidueFeature object the type is automatically set to 'residue range' even when a unique residue is provided.
This is a minor syntax issue but the IHM dictionary allows for 'residue' type that would make sense here.

Since the PolyResidueFeature takes a list of AsymUnit or AsymUnitRange as input but exposes a unique type for each instance, it might be necessary to assign the type only at the dumper level. Or slightly re-design the class to have a type for each list element.

A temporary but dirty workaround I have is to check the residue IDs upon initialization, this will only work when the list has only one element though...

class PolyResidueFeature(Feature):
    """Selection of one or more residues from the system.

       :param sequence ranges: A list of :class:`AsymUnitRange` and/or
              :class:`AsymUnit` objects.
    """
    type = 'residue range'

    def __init__(self, ranges):
        self.ranges = ranges
        # Convert type to residue when the only range provided is constituted by
        # a unique residue
        if len(self.ranges) == 1 and \
                self.ranges[0].seq_id_range[0] == self.ranges[0].seq_id_range[1]:
            self.type = 'residue'

    # todo: handle case where ranges span multiple entities?
    entity = property(lambda self: self.ranges[0].entity
                                   if self.ranges else None)

Report an error if no fits are provided for a restraint

If the user did not provide any model-to-restraint fits for a given restraint, this is a probably a user error and the dumper should report about it (since this means the restraint wasn't used anywhere). With some restraints that use a single table (such as ihm_sas_restraint) having no fits means that the restraint won't be recorded in the mmCIF file at all. (Other restraints are split into a restraint info table like ihm_2dem_class_average_restraint and a fit table like ihm_2dem_class_average_fitting, so at least the info table will still be written even if no fits are available.)

Provide an example for using nonstandard/modified residues

We should provide an example that passes nonstandard chemical components to the Entity constructor, and another that uses a custom Alphabet subclass.

bool(ihm.unknown) should return False

For example, ChimeraX uses constructs such as

printable_name = m.name if m.name else "unknown name"

This works fine when m.name is None but doesn't handle m.name being ihm.unknown.

Add support for Entities that are not amino acid sequences

Currently only amino acid sequences are supported for Entity. We should allow for other alphabets (e.g. RNA, DNA) or even for specifying a custom list of components.

Make Assembly class more flexible

Add set-like operations to Assembly (e.g. allowing assemblies to be combined or subtracted).
some_asym_range in assembly or assembly_a in assembly_b should work.

Provide support for generic distance restraints

Modeling software such as HADDOCK extensively use generic distance restraints. In the IHM dictionary, these have been defined in the ihm_derived_distance_restraint category. This category also relies on the ihm_feature_list, ihm_poly_atom_feature, ihm_poly_residue_feature and ihm_non_poly_atom_feature categories. The python-ihm package should provide support for these data categories as well.

Verify Atom/Sphere against Representation type

We currently don't check to make sure that any Atom or Sphere objects output are consistent with the type of the Representation (only that they match the seq_id ranges). For instance, Atom objects should only be valid when matched against an ihm.representation.AtomicSegment.

Add MMTF support

For smaller output files, it may be useful to encode at least a subset of the IHM dictionary such that it can be written out/read in in MMTF format.

Add support for predicted contact restraints

Add a suitable Restraint class to represent predicted contacts, and add support to read from or write to mmCIF/BinaryCIF.

Add support for SWISS-MODEL to PDB parser

When given a PDB file generated by the SWISS-MODEL server, ihm.metadata.PDBParser currently fails to extract suitable metadata from it. The file does, however, include the alignment and templates used so in principle this metadata could be extracted.

Add a new mmCIF-compatible binary format, e.g. for trajectories

Add support for read/write of a file format that (ideally)

supports arbitrary mmCIF key:value pairs (so that we can convert it loss-free to or from mmCIF, including with custom data as per #30)
is fast to parse
is compact
is seekable (so that we can easily add extra frames in a trajectory, or quickly access frames from the middle of an existing trajectory)

Such a format would allow IMP folks to ditch the old RMF format as the "working format", and easily convert the resulting models to mmCIF. This may also be a practical solution for those that need to deposit huge files, bigger than is really practical for current mmCIF.

Conventional mmCIF fails points 2-4. We can't easily add an extra model to an mmCIF file without rewriting the entire file (since various IDs would have to be updated, and the data is gathered per-table rather than per-model). We also can't read a single trajectory frame without scanning the whole file.

MMTF (#11) has its own data model, so fails point 1. Both MMTF and BinaryCIF support a number of encoding mechanisms to generate compact files (point 3) but these render the file non-seekable (e.g. runlength encoding of the ATOM/HETATM field in the atom_site table necessitates reading and unpacking the entire thing to determine whether a given atom is ATOM or HETATM).

Fast parsing probably necessitates a binary format.

Proposal: use HDF5 as a binary container format. Each mmCIF category would map to an HDF5 table, which should be trivially seekable and extendable. This won't be as compact as BinaryCIF or MMTF (although HDF5 does support compression). To address this we can

replace a lot of string data with more compact forms - for example replace any floating-point number with a 32-bit float, or enumerated data (e.g. ATOM or HETATM) with a suitably small integer (just a 0 or 1 bool type in this case).
split out changing and non-changing data per ID - for example several models may have the same composition, so we only need to store coordinates for each model, and a single table with the composition.

Error when gap(s) present in the input structure

I have spotted an issue when I try to output an mmCIF file from a PDB file following the steps desribed in examples/simple-docking.py.

If the PDB presents a gap in the residue IDs sequence, the dumper module triggers the following error:

Traceback (most recent call last):
  File "16srna_ksga.py", line 170, in <module>
    ihm.dumper.write(fh, [system])
  File "/Users/mtrellet/Dev/python-ihm/ihm/dumper.py", line 1436, in write
    d.dump(system, writer)
  File "/Users/mtrellet/Dev/python-ihm/ihm/dumper.py", line 771, in dump
    seen_types = self.dump_atoms(system, writer)
  File "/Users/mtrellet/Dev/python-ihm/ihm/dumper.py", line 825, in dump_atoms
    comp = atom.asym_unit.entity.sequence[atom.seq_id-1]
IndexError: tuple index out of range

As reported in the error, the faulty line is comp = atom.asym_unit.entity.sequence[atom.seq_id-1] that looks for an object stored in the list atom.asym_unit.entity.sequence by taking the atom.seq_id value. However, in the case where the model has gaps, this seq_id can have non-sequential values leading to an out of range error. I fixed this issue by adding a checking step in the loop iterating over the model (in dumper.py line 821):

for group, model in system._all_models():
    c = 0
    prev = None
    for atom in model.get_atoms():
        # First iteration
        if c == 0 and not prev:
            prev = (atom.seq_id, atom.asym_unit.id)
        # Switching to another asymetric unit (assuming all AsymUnit have different chain ids)
        elif atom.asym_unit.id != prev[1]:
            prev = (atom.seq_id, atom.asym_unit.id)
            c = 0
        # Switching to another residue
        elif atom.seq_id != prev[0]:
            c += 1
            prev = (atom.seq_id, atom.asym_unit.id)
        comp = atom.asym_unit.entity.sequence[c]

The output is really similar to what I would expect and what is currently deposited in the PDB-dev DB for this particular structure.
However, there could be a better/smarter way to handle this...

Extend _entity.src_method to accept values allowed in the PDBx/mmCIF dictionary

According to the PDBx/mmCIF dictionary, _entity.src_method can have the following values: nat (natural source), man (genetically manipulated source) and syn (synthetic source). The python-ihm library allows for only "man".

Allow passing Residue objects to ResidueFeature

ResidueFeature expects as input a list of residue ranges and will give a cryptic error if given one or more Residue objects instead. Accept Residue and treat them like 1-residue ranges.

Should verify asym_ids in spheres/atoms against representation & assembly

Currently the lists of Atom and Sphere objects for a model are output as-is. No checking is done to ensure that these objects only reference asym_ids that are in the model's Representation (which in turn should match those in the Assembly). This should be checked to avoid generating invalid mmCIF output.

Add group_id to derived_distance_restraints

The IHM dictionary allows the usage of group IDs to create sets of distance restraints. This is particularly useful when a set of restraint has a particular group conditionality that does not apply to other restraints. In HADDOCK we deal a lot with ambiguous restraints and group of restraints that might not be all related.

Currently, as far as I understand, the python-ihm library does not allow to define them nor to write them down.

Allows for non-polymer atoms in Entity declaration

HADDOCK allows having ligands (non-polymer) atoms in the docking system. Those ligands constitute sometimes a complete subunit to be docked (protein-ligand docking). When trying to automate the creation of an IHM mmCIF file from a PDB file, we encountered the following error:

Traceback (most recent call last):
  File "/Users/mtrellet/Dev/haddock-webserver-flask/app/main/views.py", line 2105, in download_cif
    create_ihm_mmcif(pdb_path, params, interface_ranges)
  File "/Users/mtrellet/Dev/haddock-webserver-flask/utils/PDBHandler.py", line 405, in create_ihm_mmcif
    entities.append(ihm.Entity(seqs[ch_id]['seq'], alphabet=alphabet, description=f'Partner {i+1}'))
  File "/Users/mtrellet/Dev/miniconda3/envs/haddock3/lib/python3.6/site-packages/ihm/__init__.py", line 754, in __init__
    self.sequence = tuple(get_chem_comp(s) for s in sequence)
  File "/Users/mtrellet/Dev/miniconda3/envs/haddock3/lib/python3.6/site-packages/ihm/__init__.py", line 754, in <genexpr>
    self.sequence = tuple(get_chem_comp(s) for s in sequence)
  File "/Users/mtrellet/Dev/miniconda3/envs/haddock3/lib/python3.6/site-packages/ihm/__init__.py", line 753, in get_chem_comp
    return alphabet._comps[s]
KeyError: None

The code that returns the error (I voluntarily skipped some parts, tell me if you need more information):

# We store sequences and match between author resids and mmCIF consecutive resids (1->n)
    for r in s.get_residues():
        if r.get_parent().id in partners and r.resname != "HOH":
            segid = r.get_parent().id
            if segid not in seqs_id:
                seqs_id[segid] = 1
            else:
                seqs_id[segid] += 1
            if segid not in seqs:
                seqs[segid] = {'seq': [], 'id_from_pdb': {}, 'id_from_seq': {}}

            seqs[segid]['id_from_pdb'][seqs_id[segid]] = r.id[1]
            seqs[segid]['id_from_seq'][r.id[1]] = seqs_id[segid]
            # We only store one-letter code for standard aa
            try:
                seqs[segid]['seq'].append(three_to_one.get(r.resname.strip()))
            except KeyError as e:
                seqs[segid]['seq'].append(r.resname.strip())
            except Exception as e:
                logging.exception(f"An issue occured when trying to add residue {r}")
                raise Exception
            # seqs[segid].append(r.resname.strip())

    # Create different entities with specific dictionary depending on the unit type (Protein, RNA, DNA)
    entities = []
    for i, ch_id in enumerate(partners):
        if partners[ch_id]['moleculetype'] == 'DNA':
            alphabet = ihm.DNAAlphabet
        elif partners[ch_id]['moleculetype'] == 'RNA':
            alphabet = ihm.RNAAlphabet
        else:
            alphabet = ihm.LPeptideAlphabet
        entities.append(ihm.Entity(seqs[ch_id]['seq'], alphabet=alphabet, description=f'Partner {i+1}'))
    system.entities.extend(tuple(entities))

One workaround could be to create a non-polymer alphabet but I have the feeling that it would just create errors further down the pipeline.
So a solution would be, I guess, to support non-polymeric entities. However I'm well aware that this requires some significant efforts..!

Support features defined on entities, not asyms

The IHM dictionary now allows for asym_id to be omitted for all of the feature types (ihm_poly_atom_feature, ihm_poly_residue_feature, ihm_non_poly_feature) but python-ihm expects it to be there when reading mmCIF files. For example, @brindakv notes that this breaks reading PDBDEV_00000032. Allow for all features to be defined using Atom or Residue objects derived from Entity, not just AsymUnit.

Read dictionary version into System

When reading a file, we should parse the audit_conform table to determine the version of the dictionary the file conforms to and provide an API to get this information. This would allow ChimeraX, for example, to refuse to read files that conform to a pre-1.0 version of the dictionary, since they won't display correctly anyway.

Add support for pseudo atoms

The IHM dictionary now allows for pseudo atoms to be used as features, as per ihmwg/IHMCIF#73. Add support for these to the library.

Populate Citations from just pubmed ID

Add a Citation.from_pubmed_id class method that just takes a PubMed ID and extracts all of the other information needed for Citation by querying the PubMed JSON web API.

Add support for pdbx_nonpoly_scheme table

If we have any non-polymeric entities in the system (e.g. PDBDEV_00000015) we should read and write the pdbx_nonpoly_scheme table so we get author/PDB-provided seqids correct.

Don't allow Entities in Assemblies

We have previously allowed Entity or EntityRange objects to be added to an Assembly, for example to show in the exosome structure that some subunits (RPL3 in that case) appear in the experimental data (crosslinks) with unspecified stoichiometry but are not in the models. This should no longer be allowed as ihm_struct_assembly_details.asym_id is mandatory. Only AsymUnit or AsymUnitRange objects can be added to an Assembly. In the exosome case, we would still create an Entity for RPL3 and refer to that entity in the ihm_cross_link_list table, but not add it to the complete assembly.

Add support for crosslinks

Probably modeled on the existing EM and SAS restraints.

Unable to get model object for 39th and 40th entries

Error while processing:
self.model = ihm.model.Model
self.system, = ihm.reader.read(open(self.mmcif_file),
model_class=self.model)

Error statement:
File "/Users/saijananiganesan/anaconda3/envs/imp/lib/python3.7/site-packages/ihm/reader.py", line 2979, in read
more_data = r.read_file()
File "/Users/saijananiganesan/anaconda3/envs/imp/lib/python3.7/site-packages/ihm/format.py", line 553, in read_file
return self._read_file_c()
File "/Users/saijananiganesan/anaconda3/envs/imp/lib/python3.7/site-packages/ihm/format.py", line 603, in _read_file_c
eof, more_data = _format.ihm_read_file(self._c_format)
File "/Users/saijananiganesan/anaconda3/envs/imp/lib/python3.7/site-packages/ihm/reader.py", line 1415, in call
aln = self.sysr.external_files.get_by_id_or_none(alignment_file_id)
File "/Users/saijananiganesan/anaconda3/envs/imp/lib/python3.7/site-packages/ihm/reader.py", line 113, in get_by_id_or_none
return self.get_by_id(objid, newcls) if objid is not None else None
File "/Users/saijananiganesan/anaconda3/envs/imp/lib/python3.7/site-packages/ihm/reader.py", line 97, in get_by_id
if objid in self._obj_by_id:
TypeError: unhashable type: '__UnknownValue'

I am guessing a new value in the cif file has not been included in the reader?

Get rid of AssemblyComponent class

It's a little inconvenient to populate Assembly objects with AssemblyComponent objects. Assemblies should simply accept AsymUnit, Entity, and AsymUnitRange objects directly instead.

Representation does not have to be a strict subset of assembly

As per ihmwg/IHMCIF#77, a model's representation does not have to be a strict subset of its assembly. Thus, the checks enforced by #18 and #19, whereby each sphere/atom has to be in the representation, and the representation has to be fully in the assembly, are overly strict and should be relaxed. Instead, the only requirement is that a sphere/atom must be in both the representation and the assembly.

Prohibit or warn about non-ascii characters in strings

mmCIF files are supposed to be plain 7-bit ASCII, so no accented characters (e.g. the Å symbol). We should check all strings on output and warn/error about any non-ASCII data.

Add DCD support

For larger ensembles it is not efficient to store them all in the text-based mmCIF format, but we can link out to an external file. For IMP models we have been storing the coordinates in DCD format, which ChimeraX also supports. We've used a collection of scripts that read PDB or RMF files and utilize the ancient MDTools package inside Chimera to dump out the DCD format. To make this easier to use for other systems (and non-IMP users) it would be helpful to have basic DCD write support in the python-ihm package. This would use a similar interface to Model, accepting generators of ihm.model.Atom and ihm.model.Sphere objects, and writing out coordinates in DCD format.

Add support for audit_conform category

The audit_conform category describes the dictionary version which the current data file conforms to. It will be useful to add support in the python-ihm library to populate this category.

Remove hash from some/most classes

Many IHM classes are hashable, as we stick them in dict to detect copies. This is potentially problematic as the classes are also mutable, so users of the library may be tempted to place objects in their own set or dict, modify them, and end up with bizarre behavior. We should probably fix this by removing the __hash__ method from these classes and handling copies in some other fashion (e.g. a mutable wrapper class similar to frozenset vs. set).

Broken URL in example script

url='https://zenodo.org/record/820724/files/archive.zip' in https://github.com/ihmwg/python-ihm/blob/master/examples/locations.py seems to be broken.

Add basic symmetry operations

At a minimum, it would be useful to emit basic non-crystallographic symmetry information, for example for the NPC 1- and 3-spoke models.

ihmwg / python-ihm Goto Github PK

python-ihm's Issues

Recommend Projects

Recommend Topics

Recommend Org