incatools / kgcl Goto Github PK

View Code? Open in Web Editor NEW

11.0 7.0 3.0 9.34 MB

Datamodel for KGCL (Knowledge Graph Change Language)

Home Page: https://w3id.org/kgcl/

License: MIT License

Makefile 1.30% Python 98.16% Shell 0.09% Jinja 0.44%

linkml ontology-changes ontology-diffs rdf semantic-web kgcl

kgcl's Introduction

KGCL: Knowledge Graph Change Language

KGCL is a standard datamodel for representing changes in ontologies and knowledge graphs.

This repository houses:

The KGCL schema/standard
The Python implementation of the standard (LinkML model, LARK grammar)

Documentation

kgcl's People

Contributors

Stargazers

Watchers

Forkers

biodivportal cthoyt caufieldjh

kgcl's Issues

Reconfigure OWL export

This looks a bit odd:
https://bioportal.bioontology.org/ontologies/KGCL?p=classes

Use the latest linkml to generate this. Also use the same config as chemrof

Add ability to add comments to obsoletions

also - auto-add github issue

Fix rendering

https://github.com/INCATools/kgcl/blob/main/tests/test_grammar/test_render.py

This test doesn't have any asserts - if we add this we will see there are some issues

How to identify the target subset in subset membership change operations?

For the AddNodeToSubset and RemoveNodeToSubset changes, the subset the node should be added to/removed from is supposed to be represented by the in_subset slot of a AddToSubset or RemoveFromSubset mixin. That slot expects a OntologySubset, defined in ontology_model.yaml.

But OntologySubset has seemingly no slots at all (either directly or inherited from OntologyElement), so how one is supposed to know which subset the change is about?

Shouldn’t OntologySubset have a id slot to identify the subset?

Remove x_representation fields

changes be like

{
  "id": "CHANGE:001",
  "type": "NodeRename",
  "old_value": "'nuclear envelope'",
  "new_value": "'foo bar'",
  "about_node": "GO:0005635",
  "about_node_representation": "curie",
  "@type": "NodeRename"
}

there is no need for about_node_representation if we have a prefix map for each change set.

The grammar should return clean objects

Currently the grammar yields linkml instances like:

NodeRename(id='CHANGE:001', type='NodeRename', ...,` old_value=Token('SINGLE_QUOTE_LITERAL', "'nuclear envelope'"), new_value=Token('SINGLE_QUOTE_LITERAL', "'foo bar'")`

No need for this

Also the rendered json is:

{
  "id": "CHANGE:001",
  "type": "NodeRename",
  "old_value": "'nuclear envelope'",
  "new_value": "'foo bar'",
  "about_node": "GO:0005635",
  "about_node_representation": "curie",
  "@type": "NodeRename"
}

The single quotes are not necessary. It's already a string field.

Currently downstream code such as OAK has a workaround for this:

https://github.com/INCATools/ontology-access-kit/blob/8da76e19698058b43c4ae4ba2b5bcda35a1c851a/src/oaklib/utilities/kgcl_utilities.py#L99-L123

In theory this should still work with clean code but we should still wait until 1.0.0

update bioregistry dependency to ^0.6.0

this issue: is causing issues deploying reasoner-validator (depends on BMT, depends on oaklib, depends on bioregistry and oaklib depends on kgcl ) to ITRB environments. I think the fix here is to update oaklib pyproject.toml to version ^0.6.0 of bioregistry, and to do that, we need to update kgcl to use bioregistry ^0.6.0

have actions post error as comment?

Our KGCL bot for GO can make pull requests, but if it is triggered but an error occurs, it silently does nothing. Here is an example:

issue: geneontology/go-ontology#28585
triggered action: https://github.com/geneontology/go-ontology/actions/runs/10078725620/job/27864317755

It failed because the requestor forget to replace GOID with a valid term ID on one line.

The error is reported in the action log, but I had to dig for this. Would it be possible to catch these errors and comment back on the request issue?

Map to COnto-diff change language

https://www.sciencedirect.com/science/article/pii/S1532046412000627#b0160

https://github.com/dbs-leipzig/conto_diff/

The formal spec in the paper is very clear. There also seems to be a KGCL-CNL type language underpinning it, running the tool gives operations such as

move X_1 is_a X_2 X_3

Add example for adding xrefs

I would like to see two additional examples/functionalities in https://incatools.github.io/kgcl/examples/:

Adding an xref. Is it also possible to add extra metadata like the contributor ORCID with an xref?
Updating an xref. E.g., I know there's already an xref from A to B with a given relationship like oboInOwl:hasDbXref and want to upgrade it to a more specific relation like skos:exactMatch

It's also not clear from the docs what happens if I try upgrading an xref that doesn't exist. E.g., if I try and upgrade A oboInOwl:hasDbXref B to A skos:exactMatch B but A oboInOwl:hasDbXref B doesn't exist, will it still add A skos:exactMatch B?

Motivation: I want to automate adding Biomappings to various ontologies

Add a variant of the obsoletion workflow to add OBSOLETE to definition

When GO obsoletes a term the string OBSOLETE is added to the definition:

https://wiki.geneontology.org/Ontology_meeting_2024-04-08#Obsoletion

Some ontologies adopt this, others don't. Some do it half hearted. E.g. half of CL obsoletions have this and half don't. This kind of random patchwork confuses users.

In contrast OBO ontologies will always include obsolete in the label

This issue is to gather feedback from different ontologies about what their preferred policy is for this. Ideally we can achieve consensus and have a single KGCL workflow for all ontologies. I think this is best for users, as well as for maintenance. If not, then we can make this configurable.

My own preference is to keep things simple. "obsolete" in the label is sufficient. The definition is not necessarily obsolete. But the GO editors may have reasons to keep this.

Collect more information for Synonym proposals

It would be useful to collect more information for synonym proposals. For Mondo curation, it would be useful to:

include the ~~qualifier~~ synonym type ABBREVIATION for any kind (exact, related, narrow, broad) of synonym
include the submitter's ORCID

Using the current BioPortal UI implementation (Feb 2024), this is not possible.

Non-obvious side-effects of KGCL changes should be explicitly specified

(This is a follow-up to this issue in KGCL-Java, itself a follow-up to another issue in ENVO.)

There is apparently an expectation that some KGCL change operations should have some kind of “side effects” that go beyond the strict application of the change itself.

In the case at hand, it is seemingly expected that obsoleting a class should result in the automatic removal of any axiom that refers to that class. If so, that expected behaviour should be explicitly specified somewhere.

I can imagine at least 3 different ways of dealing with axioms referring to a to-be-obsoleted class:

Silently and automatically removing them. That’s apparently with KGCL-Python does.
Leave them in place for an editor to decide what to do with them. That’s what KGCL-Java does.
Refusing to perform the operation and warn the user (“This class is referred to by several other classes, obsoleting it would have cascading effects. I’m sorry Dave, but I’m afraid I can’t do that.”).

I have no strong opinion on which behaviour is best, and I have no objection to amending KGCL-Java to implement the first one if it is indeed the behaviour intended by KGCL’s authors. But my point is that all those possible behaviours are arguably equally reasonable, and that implementers cannot be expected to guess which was the “intended one”. Right now, the spec in effect leaves this kind of decision at the discretion of implementations, so unsurprisingly, different implementations make different decisions. If a consistent behaviour is desired, the spec must says so.

For node obsoletions, do we agree that the expected behaviours are as follows:

For node obsoletion without replacement or alternative (i.e. NodeObsoletion proper): remove any axiom referring to the to-be-obsoleted entity.
For node obsoletion with direct replacement: rewrite any axiom referring to the to-be-obsoleted entity to make it refer to the replacement entity instead (in this case, the spec does say that it can be done, but does not say it has to be done; my interpration was that the replacement was again left for the editors to do, and that the only expectation from KGCL was to set the replaced_by annotation).
For node obsoletion with non-direct replacement(s): presumably remove any referring axiom (as in the case of obsoletion without replacement at all)? Though I would think it is more useful to leave them in place so that editors are aware of the fact they need to manually rewrite them with one of the suggested alternative terms.

Align KGCL and taxonomy with prior work

We do not do a great job of showing how the KGCL data model aligns with prior work

There is a really excellent summary here:

Romana Pernisch, Daniele Dell’Aglio, Abraham Bernstein,
Beware of the hierarchy — An analysis of ontology evolution and the materialisation impact for biomedical ontologies,
Journal of Web Semantics,
Volume 70,
2021,
100658,
ISSN 1570-8268,
https://doi.org/10.1016/j.websem.2021.100658.

This summarizes earlier work by Noy et al e.g https://www.researchgate.net/profile/Michel-Klein/publication/2930642_A_Component-Based_Framework_for_Ontology_Evolution/links/0912f50ba15cd1a7e5000000/A-Component-Based-Framework-for-Ontology-Evolution.pdf as well as COntoDiff

One of the major contributions of Pernisch et al is looking at impact of entailed axioms - currently our focus is on structural diffs but we should include this

There is also a lot of great stuff in this paper with metrics for ontology contributions and usages, cc @matentzn

remove bioregistry dependency

ensure grammar is aligned with string_serialization in yaml

The yaml has string serializations

  node rename:
    is_a: node change
    description: >-
      A node change where the name (aka rdfs:label) of the node changes
    slots:
      - old value
      - new value
      - has textual diff      
      - new language
      - old language

    slot_usage:
      old value:
        multivalued: false
      new value:
        multivalued: false
      change description:
        string_serialization: "rename {about node} from {old value} to {new value}"
    examples:
      - value: "rename UBERON:0002398 from 'manus' to 'hand'"
        description: "replacing the rdfs:label of 'manus' on an uberon class with the rdfs:label 'hand'"

but these are not always aligned with the grammar

What is the intended mechanism to deal with quote characters in values?

The textual representation of KGCL needs a way to cope with the possibility that a textual value associated with a change (e.g., the old or new value of a label, definition, synonym, etc.) may contain quote characters.

There are several possible options:

C-style escape sequences, e.g., create related synonym 'poor man\'s synonym' for EX:0001;
alternating between single- and double-quotes as needed (use single-quotes if the value contains a double-quote and vice-versa) – this does not allow a value to contain both single- and double-quotes simultaneously;
Python-style triple-quoting, e.g. create related synonym '''poor man's synonym''' for EX:0001.

Currently, the Python implementation of KGCL seems to opt for triple-quoting when rendering, but the Lark grammar does not allow that for most literal values: synonyms, labels, and definitions are all ultimately expected to be SINGLE_QUOTE_LITERAL, i.e. can only be enclosed in '...'.

parsing rename commands take a long time

runoak --profile --stacktrace -vv -i robottemplate:templates apply "rename OBI:0002516 from 'brain specimen' to 'brain sample'" -o new_templates

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 58.273 58.273 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/click/core.py:1423(invoke)
1 0.000 0.000 58.273 58.273 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/click/core.py:732(invoke)
1 0.000 0.000 58.273 58.273 /Users/cjm/repos/ontology-access-kit/src/oaklib/cli.py:5875(apply)
1 0.000 0.000 58.227 58.227 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/kgcl_schema/grammar/parser.py:80(parse_statement)
1 0.000 0.000 58.223 58.223 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/kgcl_schema/grammar/parser.py:571(parse_rename)
1 0.000 0.000 58.222 58.222 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/kgcl_schema/grammar/parser.py:622(get_entity_representation)
1 0.000 0.000 58.222 58.222 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/kgcl_schema/grammar/parser.py:645(contract_uri)
1 0.000 0.000 58.222 58.222 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/kgcl_schema/grammar/parser.py:44(get_curie_converter)
1 0.001 0.001 58.222 58.222 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/io/parser.py:28(load_converter)
1 0.001 0.001 54.505 54.505 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/io/parser.py:35(load_multi_context)
3 1.388 0.463 54.381 18.127 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/datamodel/context.py:147(combine)
13989 1.153 0.000 52.993 0.004 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/datamodel/context.py:160(add_prefix)
13989 1.888 0.000 30.358 0.002 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/datamodel/context.py:245(namespaces)
13989 16.339 0.001 28.470 0.002 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/datamodel/context.py:253()
13989 0.302 0.000 21.465 0.002 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/datamodel/context.py:233(prefixes)
13989 14.956 0.001 21.163 0.002 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/datamodel/context.py:241()
192341811 18.342 0.000 18.342 0.000 {method 'lower' of 'str' objects}
1 0.000 0.000 3.716 3.716 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/prefixmaps/datamodel/context.py:335(as_converter)
1 0.000 0.000 3.684 3.684 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/curies/api.py:639(from_extended_prefix_map)
1 0.000 0.000 3.684 3.684 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/curies/api.py:474(init)
1 0.000 0.000 2.692 2.692 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/curies/api.py:373(_get_duplicate_uri_prefixes)
1 2.373 2.373 2.692 2.692 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/curies/api.py:374()
1 0.000 0.000 0.646 0.646 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/curies/api.py:382(_get_duplicate_prefixes)
1 0.386 0.386 0.646 0.646 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/curies/api.py:383()
1 0.000 0.000 0.344 0.344 /Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/pytrie.py:115(init)

modify grammar to remove need for <>s on URIs

kgcl.lark has:

_entity: ID
         | LABEL 
         | CURIE

...

ID :  "<" INNER_ID ">" 

INNER_ID: /((?!>).)*/

This is a bit confusing, as CURIEs are IDs

I think what is meant by ID here is actually IRI, so I would change ID to IRI

We also want to make the <>s optional (and deprecated, so I would change to)

_entity: IRI
         | LABEL 
         | CURIE

...

IRI: quoted_iri direct_iri
quoted_iri: "<" direct_iri ">
quoted_iri: /((?!>).)*/

You will also need to update:

kgcl/src/kgcl_schema/grammar/parser.py

Lines 501 to 518 in fdcadde

    
           def get_entity_representation(entity): 
        
               """Get entity representation.""" 
        
               first_character = entity[0] 
        
               last_character = entity[-1:] 
        
               if first_character == "<" and last_character == ">": 
        
                   return entity, "uri"  # not removing brackets (TODO why?) 
        
               if first_character == "'" and last_character == "'" and entity[1] != "'": 
        
                   return entity[1:-1], "label" 
        
               if first_character == '"' and last_character == '"': 
        
                   return entity[1:-1], "literal" 
        
               if entity[0:3] == '"""' and entity[-3:] == '"""': 
        
                   return entity[3:-3], "literal" 
        
               if entity[0:3] == "'''" and entity[-3:] == "'''": 
        
                   return entity[3:-3], "literal" 
        
               # TODO: use predefined set of prefixes to identify CURIEs 
        
               return entity, "curie" 
        
               # return entity, "error"

How to deal with proposed obsoletions (or proposed changes) overall

Currently KGCL has a simple data model where each type represents a change or set or changes. A change object can be thought of as a proposition, and can have metadata added to this.

In some cases, ontologies may want to represent the proposition directly in the ontology without fully enacting the change. This is most prominent in the case of obsoletion, where we may want metadata about a proposed ontology to live in the ontology for a period such as a month or two months, where it is queryable. However, we can imagine this for any change type.

An additional challenge here is that the mechanism for representing propositions in an ontology is not as standardized as for example obsoletion (which is itself not as standardized as it could be). E.g in mondo things may go into a mondo-specific "obsoletion_subset".

Some options here:

One is to discourage the notion of storing propositions in the ontology. If you want to query for propositions (such as proposed obsoletions), then query the GitHub issue and PR repository. There is a clear separation of concerns here: the ontology represents the current state of things, and we use infrastructure intended for propositions to store propositions. However, this is not not an ideal solution, e.g. do we expect all ontology browsers to implement some complex ingest mechanism?

Another is to create a collection of shadow classes, e.g. ProposedObsoletion, ProposedNodeMove, ... This is fairly awkward though.

Another option is to add a flag to all classes such as "partial: bool". The actual changes applied to an apply agent would vary depending on the setting of this flag. We can even imagine having maturity levels etc.

The simplest option might be that if something is a proposition we simply insert the change object as KGCL triples into the ontology. The ontology simply stores its own change. This may encounter resistance as people might like continuing to use familiar mechanisms such as oio:subset, IAO IDs etc.

What will probably sit best with existing ontologies is if there is a way to customize how an apply command works on an ontology specific basis, perhaps making "partial" a parameter on the application function

Handling of language tags in KGCL

Currently handling of language tags is under-specified in KGCL, both in terms of

matching (e.g. change label from X[@en] to Y)
applying (e.g. change label from X to Y[@en])

Recall also that most OBO ontologies use a mixture of uncommitted literals, xsd:string, and @en to denote english language labels.

As a general principle, the KGCL DSL is intended to be user-friendly. The user shouldn't have to know detailed implementation knowledge about each ontology. In fact it is very hard for them to know these details. As a case in point, for the following two terms in ENVO it's impossible to know from OLS that the first uses an explicit @en and the second does not:

At the most recent OMO meeting there was heated discussion about whether we should expect cardinality=1 of rdfs:label given that some ontologies may want to be international. It's not up to KGCL to adjudicate here. However, we can make things easy for users:

matching should be liberal; if a language tag is not specified this should not be interpreted as "must match untyped literal", it should instead be interpreted as "match this at the string level"
application should be configurable at the ontology level
- if the user does not specify a language tag, and the ontology is configured to always use language tags then the configured default language should be applied
- if the user does specify a language tag then this should be used (it is up to the ontology to configure GH actions to reject any or all language tags if their policy is always untyped literals)

2 This does place more of a burden on implementors as there needs to be some configuration mechanism, but having this default to untyped literals will work for pretty much all OBO ontologies for now

Support arbitrary axiom annotations

For many use cases we need to be able to add provenance on axioms. I have zero ideas about the syntax, here is some pseudo code:

add axiom-annotation sssom:object_label "sarcoma" to { MONDO:123 oboInOwl:hasDbXref "DOID:123"}

Prepare 1.0.0 release

We should make a 1.0.0 release.

There may need to be breaking code changes, so now is the time to pin <1.0.0 in downstream libraries

register KGCL with LOV and BioPortal

As soon as https://w3id.org/kgcl/kgcl.owl.ttl resolves (perma-id/w3id.org#3020) then register KGCL owl artefact with BP, LOV, ...

use more intuitive syntax for class creation

create : "create node" _WS id _WS label ["@" language]
create_class: "create" _WS id

"node" is not user friendly language, and it's not clear why "node" is coupled with a label

I suggest:

create : "create" _WS [node_type _WS] _WS label ["@" language]
node_type : /class|relation/instance/annotation property/

Improve examples

Use the examples from the paper

@hrshdhgd is this the SoT?
https://github.com/INCATools/kgcl/blob/main/src/data/examples/examples.yaml

Manipulating synonym types

Unless I am missing something, KGCL allows to manipulate synonym scopes (exact/narrow/broad/related), but it is ignorant of the concept of synonym types (represented in OWL, at least in some ontologies from the OBO world, by oboInOwl#hasSynonymType).

Being able to add types to a synonym would be useful. For example, several OBO ontologies have a policy that abbreviations that may be used to refer to a term should be represented as synonyms with scope related and a type that clearly marks the synonym as an abbreviation.

This could be done with a general syntax to allow arbitrary axiom annotations (as requested in #12), but I think this would be a bit too “low-level” for KGCL. Ideally users should be able to add synonym types without even having to know that such types are represented with a oboInOwl#hasSynonymType under the hood.

Possible syntax:

Specifying the type when adding the synonym in the first place:

create (exact|narrow|broad|exact)? synonym {new_value} (with type {new_synonym_type})? for {about_node}

Adding or changing the type of an existing synonym:

change synonym type (from {old_synonym_type})? to {new_synonym_type} for {old_value} on {about_node}

The `change_date` slot should be date-typed

The change node has a change_date slot, supposedly intended to represent the date when a change was created.

That slot has no explicit range:

slots:
  […]
  change_date:
    slot_uri: dcterms:date
  […]

so it defaults to be string-typed, which doesn’t seem to make much sense. It should have a range of xsd:date (or possibly xsd:dateTime to allow finer granularity than a day) instead.

Related: I’m unsure what to make of the following comment in the slot_usage section:

change_date:
  comments:
  - This should be the composition of 'was generated by' and 'ended at time'

What does that mean? Is that why the slot is not date-typed? Is the slot supposed to contain a string representation of a more complex structure?

What’s the difference between `NodeAnnotationChange` and `NodeMetadataAssertionChange`?

Excerpt of the node change hierarchy:

NodeChange
- NodeAnnotationChange
  - NodeAnnotationReplacement
- NodeMetadataAssertionChange
  - MetadataAssertionPredicateChange
  - MetadataAssertionReplacement
  - NewMetadataAssertion
  - RemoveMetadataAssertion

where NodeAnnotationChange is defined as “A node change where the change alters node properties/annotations. TODO”, while NodeMetadataAssertionChange is defined as “A node change where the metadata assertion (OWL annotations) for that node are altered”.

It’s unclear to me why those two types exist and how they actually differ. Intuitively, I feel that the second type should not exist in KGCL. Changes to annotations of a node should all be represented by NodeAnnotationChange objects – the fact that annotations are represented by OWL annotation assertion axioms if the underlying graph we’re modifying happens to be a OWL ontology should not matter.

Furthermore, in the spec for NodeMetadataAssertionChange, it is said that NodeAnnotationChange is an “alias” for NodeMetadataAssertionChange. This makes sense if we consider that a “metadata assertion” is merely how OWL represents the concept of a node annotation, but then if the two change types are the same thing, why do they have different slots (NodeAnnotationChange has a annotation_property slot that NodeMetadataAssertionChange does not have) and different children?

Consider a replace keyword command

This may be out of scope but wanted to add for discussion.

Frequently we want to replace instances of a token, word, or phrase in lexical ontology elements such as labels, definitions.

E.g

https://wiki.geneontology.org/Ontology_meeting_2024-04-08#Bulk_update_specific_keywords

This may be out of scope as it drifts into NLP territory. How are the boundaries of words or tokens defined? Do we have different tokenizers for chemicals vs biological language? What about bespoke rules that exclude certain tokens from replacement in some contexts?

This may be better handled by a separate tool that generates KGCL rename and redefine commands given an ontology plus some replacement rules.

In fact the logic is very similar to that uses for the synonymizer

Formally describe bundled and triggered changes (workflows)

KGCL has the concept of

Simple changes are atomic at the level of KGCL. E.g. kgcl:NodeMove could be broken down into a removed edge and add edge. However, NodeMove is still considered atomic/simple (it just so happens that there are potential rewrite rules for some change types). And of course at the RDF level this may involve many triples but that doesn't concern us here

Currently some KGCL implementations will trigger multiple changes from a single change. For example, in oak running apply on obsoletion will trigger removal of logical axioms and renaming as per (not particularly formally represented) OBO best practice, and similar to existing 'workflows' in Protege. It seems reasonable that we should try and formally represent this common workflow in KGCL.

In fact for obsoletions, the kgcl:Configuration class allows for the specification of an obsoletion policy, implemented as ObsoletionPolicyEnum - but there is no formal connection between the PVs and the logic.

Other ontologies may have more bespoke rules. See for example https://wiki.geneontology.org/Ontology_meeting_2024-04-08#Triggering_multiple_actions???

I don't think it makes sense to represent specific ontology rules in KGCL, but we may want some kind of mechanism for representing rules in general.

Currently the semantics of things like diff are not well defined here. I think when doing a diff we want a way to optionally "roll up" triggered changes. E.g. the renaming of "foo" to "obsolete foo" is not interesting in the way other renames are, same with "edge deletions". Currently oak hardwires a rule that these are ignored in the diff but this is not very satisfactory.

I propose we include a slot "triggers" or "triggered by" that could be used to better represent at the instance level triggered changes. This allows for a separation of concerns. diff calculation could infer these given two ontologies and an obsoletion policy (for example). Diff reporting could simply report these according to user preference.

We may also want to consider a rule language. e.g. to say "if change of type X happens, and the instance x has value v, then trigger change of type Y...". There are many interesting directions to go here but interesting is not necessarily good, we want this to be easy to implement in both oak and java-kgcl. Something like SPARQL construct would be easy to implement but would we run into expressivity issues?

AddNodeToSubset

runoak --stacktrace -i simpleobo:tests/input/go-nucleus.obo apply "add GO:0005635 to subset goslim_agr" -o z
Traceback (most recent call last):
  File "/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/bin/runoak", line 6, in <module>
    sys.exit(main())
  File "/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/cjm/repos/ontology-access-kit/src/oaklib/cli.py", line 5699, in apply
    change = kgcl_parser.parse_statement(command)
  File "/Users/cjm/Library/Caches/pypoetry/virtualenvs/oaklib-OeQZizwE-py3.9/lib/python3.9/site-packages/kgcl_schema/grammar/parser.py", line 137, in parse_statement
    raise NotImplementedError("No implementation for KGCL command: " + command)
NotImplementedError: No implementation for KGCL command: add_to_subset

Why do `EdgeChange` operations have both an `about_edge` and {`subject`, `predicate`, `object`} slots?

The EdgeChange class has an about_edge slot of range Edge, which itself has three slots subject, predicate, and object to represent a typical RDF triple / OWL axiom / graph edge, etc.

But several of the subclasses of EdgeChange have also their own subject, predicate, object fields, in addition to the about_edge that they inherit from EdgeChange. For example EdgeCreation, PlaceUnder, EdgeDeletion…

Why is that so, and what is the intended way of using those classes? For example, is the subject of the edge to be created in EdgeCreation intended to be stored in EdgeCreation.subject or in EdgeCreation.about_edge.subject?

Other subclasses of EdgeChange don’t have their own subject, predicate, object slots and have instead only the about_edge inherited from EdgeChange (for example NodeMove), suggesting that about_edge is the correct way to represent edges and that the “flattened” subject, predicate, and object slots in the other classes might be a mistake.

How strict should a KGCL applicator be?

Had there been any discussion as to what behaviour one should expect when trying to apply a KGCL change that does not fully match the existing ontology?

For example, considering the following changeset:

rename EX:1234 from 'alice' to 'bob'
create exact synonym 'robert' for EX:1234
remove definition for EX:5678

What is the expected behaviour if the rdfs:label of EX:1234 is not “alice”?

I can imagine 4 different options:

Relaxed: Apply changes where possible even where there is a mismatch. In this example, EX:1234 should get the new label “alice” regardless of what its existing label is.
Change-level strict: Reject any change where there is a mismatch, apply all other changes normally. In this example, reject the first change, apply the others two.
Node-level strict: If there is a mismatch in one change, reject all changes that are related to the same node. In this example, reject the first two changes (since both are about EX:1234), apply the last one.
Changeset-level strict: Only apply the changes in a changeset if they all can be applied cleanly, reject the entire changeset for any mismatch in any change. In this example, all three changes would be rejected.

Creating a new class with an auto-allocated ID

Currently, users wanting to create a new class using KGCL are expected to know in advance the ID of the class to be created, so that they can issue a create class ID:1234 "label" command.

This is hardly compatible with the intended use of KGCL in bug tracker tickets.

There would be several ways to address the problem.

A. Non-technical solution. Leave KGCL as it is, but expect that ontologies should have a ID range specifically intended for KGCL change and document that range to users.

Not ideal as it puts all the burden of allocating the ID to the users (who must first figure out what is the range allocated to KGCL-mediated changes, and then find out what is the lowest non-used ID in that range).

This is, in effect, the current situation.

B. Deal with auto IDs at the level of the Ontobot. Leave KGCL as it is. Agree on a special keyword (for example ID:auto) to use in the KGCL DSL syntax, and have Ontobot automatically replace that keyword by a suitable auto-generated ID before actually passing the KGCL data to the KGCL library. It’s up to the Ontobot to figure out how to allocated ID (probably by parsing the -idranges.owl file, if such a file exists).

C. Similar as B, but at the level of KGCL itself. That is, the KGCL DSL explicitly defines the ID:auto keyword, and KGCL libraries are expected to know that they should automatically allocate an ID when this keyword is used.

I currently think this would be the best solution.

Both B and C would allow an user to something like this:

create class ID:auto "new label"
add definition "new definition" to ID:auto
create edge ID:auto rdfs:subClassOf EX:1234

D. Add variables to the KGCL DSL. Make it possible to do something like this:

let my_new_class = create class "new label"
add definition "my definition" to my_new_class
create edge my_new_class rdfs:subClassOf EX:1234

Technically speaking the most elegant solution, but I don’t think we want to add such constructs to the KGCL DSL syntax – which is expected to be a simple syntax for mostly non-technical users.

	def get_entity_representation(entity):
	"""Get entity representation."""
	first_character = entity[0]
	last_character = entity[-1:]
	if first_character == "<" and last_character == ">":
	return entity, "uri" # not removing brackets (TODO why?)
	if first_character == "'" and last_character == "'" and entity[1] != "'":
	return entity[1:-1], "label"
	if first_character == '"' and last_character == '"':
	return entity[1:-1], "literal"
	if entity[0:3] == '"""' and entity[-3:] == '"""':
	return entity[3:-3], "literal"
	if entity[0:3] == "'''" and entity[-3:] == "'''":
	return entity[3:-3], "literal"

	# TODO: use predefined set of prefixes to identify CURIEs
	return entity, "curie"
	# return entity, "error"