I have spotted an issue when I try to output an mmCIF file from a PDB file following t

In both cases, you need to use seq_id (not <code clas

Error when gap(s) present in the input structure about python-ihm HOT 5 CLOSED

ihmwg commented on August 15, 2024

Error when gap(s) present in the input structure

from python-ihm.

Comments (5)

benmwebb commented on August 15, 2024

Maybe I don't understand you completely - could you come up with a self-contained example (perhaps as a gist) that reproduces the problem? You can't have gaps in seq_id by construction - in mmCIF seq_id always starts at 1 and is contiguous, and indexes into the entity sequence. If there are gaps in the PDB numbering you either need to add the missing residues to the entity sequence (and then just not provide coordinates for them) or map the non-contiguous PDB numbering to contiguous mmCIF numbering, using auth_seq_id_map - see https://python-ihm.readthedocs.io/en/latest/usage.html#residue-numbering

from python-ihm.

mtrellet commented on August 15, 2024

Thanks for your reply Ben!

You were right, the issue comes from the mapping indeed, I've just put the starting shift in the auth_seq_id_map and not the full mapping. See: https://gist.github.com/mtrellet/5926bb29e8f3952acf3eace7676af44f/919c4fd98f98f45011a20522d494501343c7c21a (1st revision)

I've just tested with a list containing the numbering gap instead:
See: https://gist.github.com/mtrellet/5926bb29e8f3952acf3eace7676af44f/b5b0c8ef4ada3265eed7cbf6936b76b21f3f823e (2nd revision)

The list used as auth_seq_id_map (gap in bold):

[5, 6, 7, 8, 9, 10, 11 ... 1383, 1384, 1385, 1506, 1507 ... 1532, 1533, 1534]

However, two issues at first glance:

I get a shift in the _pdbx_poly_seq_scheme.pdb_seq_num field of my mmCIF file that should start at 5 (see list above).

_pdbx_poly_seq_scheme.asym_id
_pdbx_poly_seq_scheme.entity_id
_pdbx_poly_seq_scheme.seq_id
_pdbx_poly_seq_scheme.mon_id
_pdbx_poly_seq_scheme.pdb_seq_num
_pdbx_poly_seq_scheme.auth_seq_num
_pdbx_poly_seq_scheme.pdb_mon_id
_pdbx_poly_seq_scheme.auth_mon_id
_pdbx_poly_seq_scheme.pdb_strand_id
A 1 1 U 6 6 U U A
A 1 2 G 7 7 G G A
A 1 3 A 8 8 A A A
A 1 4 A 9 9 A A A

If I traceback the writer function (_PolySeqSchemeDumper.dump) I can see that the auth_seq_num (where the error occurs in the mmCIF file) is set with the following line:
auth_seq_num = asym._get_auth_seq_id(num+1)
Removing the +1 outputs the proper _pdbx_poly_seq_scheme.pdb_seq_num. However, this also outputs 0 as first pdb_seq_num when no auth_seq_id_map is provided... So there must be a better solution..!

I still get the index out of range error at the same step than before.

I've printed the seq_id_range of my AsymUnits after initializing them with the auth_seq_id_map and it seems fine. My guess is that the error I get when dumping the system is that the get_atoms function of my MyModel class is overwriting the seq_id and then introduces the gap in seq_id. And it later on leads to the error I reported in my first post. So I guess I need to modify the get_atoms() function to not reassign the seq_id field. However, since it is supposed to be a contiguous sequence starting at 1, why not setting it up as an iterable counter? This should not be possible to overwrite it, or am I getting that wrong again?

Thanks again for your time!

from python-ihm.

benmwebb commented on August 15, 2024

In both cases, you need to use seq_id (not auth_seq_id or "PDB" numbering, e.g. r.id[1]) when referencing any IHM object - e.g. when you create your AsymUnitRange or Atom objects. One way to achieve this would be to keep a reverse mapping.

e.g. in your second case replace

seqA_ids = []

for r in s.get_residues():
    if r.get_parent().id == "A" and r.resname != "HOH":
        seqA.append(r.resname.strip())
        seqA_ids.append(r.id[1])

with

seq_id_from_pdb_A = {}
pdb_from_seq_id_A = {}

seq_id = 0
for r in s.get_residues():
    if r.get_parent().id == "A" and r.resname != "HOH":
        seq_id += 1
        seqA.append(r.resname.strip())
        seq_id_from_from_pdb_A[r.id(1)] = seq_id
        pdb_from_seq_id_A[seq_id] = r.id(1)

Then use seq_id_from_pdb_A[r.id(1)] when making your Atom or AsymUnitRange objects, and pass pdb_from_seq_id_A as auth_seq_id_map to the AsymUnit constructor.

The list used as auth_seq_id_map (gap in bold):

[5, 6, 7, 8, 9, 10, 11 ... 1383, 1384, 1385, 1506, 1507 ... 1532, 1533, 1534]

auth_seq_id_map is used as a map, i.e. auth_seq_id = auth_seq_id_map[seq_id]. Since Python numbering starts at 0 while mmCIF numbering starts at 1, if you want to use a list or tuple here you'll want to add a null entry at the start of the list (since there will never be a seq_id = 0 used to index that value). That's why your numbering ends up off by 1. I'll add some clarification to the API docs here.

My guess is that the error I get when dumping the system is that the get_atoms function of my MyModel class is overwriting the seq_id and then introduces the gap in seq_id

Right, it's because you're using the PDB numbering, not seq_id, here:

atomsA.append((c.id, r.id[1], a.name[0], a.name, a.coord[0], a.coord[1], a.coord[2]))

This should be

atomsA.append((c.id, seq_id_from_pdb_A[[r.id[1]], a.name[0], a.name, a.coord[0], a.coord[1], a.coord[2]))

However, since it is supposed to be a contiguous sequence starting at 1, why not setting it up as an iterable counter?

That wouldn't work because seq_id is per-residue, and get_atoms() could return one, multiple, or no atoms for each residue (your sequence has to be contiguous, but you don't have to have coordinates for every residue).

from python-ihm.

mtrellet commented on August 15, 2024

auth_seq_id_map is used as a map, i.e. auth_seq_id = auth_seq_id_map[seq_id]. Since Python numbering starts at 0 while mmCIF numbering starts at 1, if you want to use a list or tuple here you'll want to add a null entry at the start of the list (since there will never be a seq_id = 0 used to index that value). That's why your numbering ends up off by 1. I'll add some clarification to the API docs here.

Yes indeed, that what I was tempted to do (start with a null in my list) I just was not sure it was a "good practice" and that a neater way could exist. And actually a dictionary is way better (also for genericity purposes).

That wouldn't work because seq_id is per-residue, and get_atoms() could return one, multiple, or no atoms for each residue (your sequence has to be contiguous, but you don't have to have coordinates for every residue).

Very true, forgot this "detail". Thanks for the explanation!

Just to summarize, everything worked fine with the new auth_seq_id_map, thanks a lot for your complete answer! I updated the gist in case someone is interested...

And another (almost) related question: Do we plan to output the _atom_site.auth_seq_id records or this would be too much redundant with the _pdbx_poly_seq_scheme.auth_seq_num? I know we've done it in the model we deposited on PDB-dev for instance.

from python-ihm.

benmwebb commented on August 15, 2024

Do we plan to output the _atom_site.auth_seq_id records

I don't have the library output this because (as you say) it's redundant, and I prefer not to make the files any larger than they have to be. But it would be easy enough to add if needed.

from python-ihm.

Error when gap(s) present in the input structure about python-ihm HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent