xigt / xigt Goto Github PK

View Code? Open in Web Editor NEW

31.0 12.0 8.0 325 KB

eXtensible Interlinear Glossed Text

Home Page: http://depts.washington.edu/uwcl/xigt

License: MIT License

Python 99.94% Shell 0.06%

xigt's Introduction

Xigt

A framework for eXtensible Interlinear Glossed Text (IGT).

Introduction
Documentation
Installation and Requirements
Features
Acknowledgments

Introduction

The philosophy of Xigt is that IGT data should be simple for the common cases while easily scaling up to accommodate different kinds of annotations. New annotations do not need to alter the original data, but instead can be applied on top of them. Furthermore, Xigt data is meant to be easily processed by computers so that it's easy to inspect, analyze, and modify IGT data.

The Xigt framework includes a data model and XML format as well as a Python API for working with Xigt data.

Here is a small example of an IGT encoded in Xigt's XML format:

<igt id="i1" lg="spa">
  <tier type="words" id="w">
    <item id="w1">cocinas</item>
  </tier>
  <tier type="morphemes" id="m" segmentation="w">
    <item id="m1" segmentation="w1[0:5]"/>  <!-- selects "cocin" -->
    <item id="m2" segmentation="w1[5:7]"/>  <!-- selects "as" -->
  </tier>
  <tier type="glosses" id="g" alignment="m">
    <item id="g1" alignment="m1">cook</item>
    <item id="g2" alignment="m2">2</item>
    <item id="g3" alignment="m2">SG</item>
    <item id="g4" alignment="m2">PRS</item>
    <item id="g5" alignment="m2">IND</item>
  </tier>
</igt>

Installation and Requirements

The Xigt API is coded in Python (targeting Python 3.3+, but it is tested to work with Python 2.7).

Xigt can be installed via pip (see PyPI):

pip install xigt

(You may need to use pip3 to install for Python3.).

Alternatively, you can get the latest Xigt from the GitHub repository:

git clone https://github.com/xigt/xigt.git

After the cloning has finished, set up your PYTHONPATH environment variable to point to this directory.

The following extra features have their own requirements:

The Toolbox importer: get the toolbox module
The ODIN importer: get odin-utils
The [incr tsdb()] profile exporter: get pyDelphin

For validating Xigt's XML format, I recommend Jing.

Depending on the importer, you may need to configure a config.json for your particular use case, and point Xigt's import function to it using the built-in commands. Templates for the respective configurations are found within the files in xigt/importers.

Note: Xigt is primarily developed and tested on Linux. If you are having trouble installing on Windows, Mac, or some other operating system, please contact me or file an issue report.

Features

Xigt has several features that help enable complex alignments, and these features can be ignored for simpler IGT.

Alignment Expressions

Alignment expressions are an expanded referencing system that allow some data to align to more than one target, and furthermore allows them to select substrings from the target(s).

Given:

<item id="a1">one</item>
<item id="a2">two</item>

The following alignment expressions will align to the following selections:

a1                  -> "one"
a1,a2               -> "one two"
a1+a2               -> "onetwo"
a1[0:1]             -> "o"
a1[0:1,2:3]         -> "o e"
a1[1:3]+a2[1:2+0:1] -> "newt"

Alignment expressions are specified on reference attributes at the item level.

Floating Alignments

When more than one item align to the same selection, they are said to be in a "floating alignment". That is, they are ordered (as in the XML), but have no definite subpartitioning among them. For instance, given the following phrase item:

<tier type="phrases" id="p">
  <item id="p1">A dog barks.</item>
</tier>

...and the following word items all aligned to the same phrase item above:

<tier type="words" id="w" alignment="p">
  <item id="w1" alignment="p1">A</item>
  <item id="w2" alignment="p1">dog</item>
  <item id="w3" alignment="p1">barks</item>
</tier>

Xigt will maintain the order ["A", "dog", "barks"] (i.e. not ["dog", "A", "barks"] and so on), but does not specify which substrings each item aligns to. In other words, it is understood that w1, w2, and w3 are contained by s1 and in that order, but there is no explicit character alignments. This is useful when one does not want to delimit items exactly (e.g. when dealing with noisy data), or when one cannot delimit the sub-items (e.g. glosses for portmanteau morphemes).

Referred Values

In Xigt, the only difference between primary data (e.g. phrases or words) and annotations (e.g. glosses or translations) is that annotations are aligned to some other items. The data/annotation-label is called the "value", and this value can either be explicitly given, or refer to some other source. In the latter case, an alignment expression (given by the "segmentation" or "content" reference attribute) is used to select the value.

The benefit of using alignment expressions to select item values is that the data becomes more linked. For instance, it becomes possible to say that not only does a morpheme align to some word, but that its value is a particular substring of that word. A second use for referred values is stand-off annotation, where the data comes from some external source and one wants to encode the relationship between the IGT structure and the original data.

For example, in the above example, rather than aligning w1--w3 to p1 and then explicitly giving the value, one can "segment" the words from the phrase:

<tier type="phrases" id="p">
  <item id="p1">A dog barks.</item>
</tier>
<tier type="words" id="w" segmentation="p">
  <item id="w1" segmentation="p1[0:1]" />  <!-- selects "A" -->
  <item id="w2" segmentation="p1[2:5]" />  <!-- selects "dog" -->
  <item id="w3" segmentation="p1[6:11]" /> <!-- selects "barks -->
</tier>

Here, items w1, w2, and w3 do not provide their own value, but instead select it via the alignment expression on their "segmentation" attribute.

Also note that an item can specify both a "segmentation" or "content" attribute and explicitly provide a value, in which case the provided value overrides the selected one, but the link remains. This is useful for cleaning up OCR results or showing the underlying form before phonological processes have occurred.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 1160274. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This project is also partially supported by the Affectedness project, under the Singapore Ministry of Education Tier 2 grant (grant number MOE2013-T2-1-016).

Xigt was initially developed under the AGGREGATION project (http://depts.washington.edu/uwcl/aggregation/)

xigt's People

Contributors

Stargazers

Watchers

Forkers

hehaotian rgeorgi pombredanne lingdoc corbettmoore olzama lizcconrad ltxom

xigt's Issues

Reading/Writing to gzipped XML

The XML files can get large, so let's allow reading and writing to gzipped XML as an option. See here: http://stackoverflow.com/questions/13202516/writing-elementtree-compressed-in-gzip-format

XigtPaths with more than one context are faulty

On v1.1.0, composing a XigtPath that has more than one context condition are returning incorrect results. E.g. the following returns a list of tiers, not items:

>>> xigtpath.findall(xc, '//tier[@type="pos" and @alignment="gw"]/item[value()="ADJ"]')

Nested contexts are also a problem:

>>> xigtpath.findall(xc, '/igt[tier[@type="glosses" and item/value()="eat"]]')

Igt method for quick creation of tiers and items with alignments

When we have clean data, it would be convenient to have a built-in method of creating tiers and items for them, instead of always manually constructing them. For instance:

i = Igt(id='i1', ...)
i.make_tiers([
    ('phrases', 'p', ['el perro duerme']),
    ('words', 'w', ['el', 'perro', 'duerme']),
    ('glosses', 'g', ['the.MASC', 'dog.MASC', 'sleep.3.SG.IND.PRES']),
    ('translation', 't', ['the dog sleeps.'])
], options=blah)

I need to consider a useful but robust way of doing this, e.g. when we have multiple levels of annotation. Maybe something like the LaTeX IGT, where multi-tokens are grouped (e.g. in a tuple in Python).

Allow corpus-/IGT-level attributes and metadata from Toolbox headers

Let's say a toolbox file has this:

\field1 blah blah
\field2 yadda yadda

\id 1
\field3 foo bar baz

\ref igt1
[...]

\id 2
\field3 shmreh

\ref igtN
[...]

The data for \field1, \field2, and \field3 should go somewhere. The first two could go as corpus attributes or metadata, but since Xigt doesn't have sections, the third is more troublesome. It could be repeated on all IGTs in that \id section, or we could put the data at the corpus level with an attribute specifying which section it belongs to. Or something else?

ODIN importer uses incorrect metadata and namespaces

xigt.importers.odin uses the older metadata style where children of Meta objects were just strings, and namespaces were just manually-specified attributes.

testing corpora for equality

Corpora and all sub-components need to implement __eq__() methods for equality testing. This can probably be done on the mixin classes.

Add methods for object deletion, replacement, etc.

There are currently only methods for adding new objects to collections (append(), extend(), insert()) or for deleting all (clear()). What is lacking is methods for deleting individual objects, or for setting them to some new value. Python has a remove() method for lists that removes the first matching object, and a pop() method that removes the last object, or the object at some index if given. Javascript has pop() and shift(), which remove objects from the end and start of an Array, respectively. Removing an object in the middle of the list is more complicated: ary.splice(startPos, 1). Removing an object by value is even more complicated: ary.splice(ary.indexOf(obj), 1)). However the splice() method is nice in that you can remove any number of contiguous objects. For removing objects, the more Pythonic remove() seems like a good fit, although its semantics could be more like the get() method:

remove(x: int) = remove object at index x
remove(x: str) = remove object with id x

Neither of these is like list.remove() because it doesn't take the object to remove itself as the argument, but doing so might not be a bad idea, either.

Methods for setting (as opposed to adding) objects could be useful. Lists can set objects by index (some_list[3] = new_val), and dicts can set values by key (some_dict[key] = new_val), so maybe just defining a __setitem__() function would be good.

Allow language word ('w') lookup by gloss item

Currently, if I want to use word-level alignment in Gloss and Language lines, it seems necessary to iterate through morpheme tier in a case like kor-ex.xml. I.e., I first need to find all the morphemes that correspond to a word, then all the gloss items that correspond to these morphemes. It would be convenient to be able to look up a word-level gloss item by word item id, and vice versa.

Toolbox importer cannot specify the encoding

The toolbox module can specify the encoding through the call to open(), but there is currently no way to pass through an encoding when using the Xigt importer.

Thanks to @lingdoc for the bug report.

Metadata is not serializing properly

@rgeorgi noticed that Metadata was not serializing correctly for the new 1.0 API. He writes that using this when constructing a corpus...

wt.metadata = Metadata(type='xigt-meta')
wt.metadata.add(Meta(text='test’))

results in this for the metadata...

<metadata>test</metadata>

The metadata was not enumerating the meta elements under metadata. This is a small problem, but points to a bigger problem that may require a small API change for Meta objects (because of what they should be able to contain).

xigtxml namespaces get pushed to elements

If we have a namespace declared, as in the following metadata:

<metadata xmlns:olac="http://www.language-archives.org/OLAC/1.1/"
          xmlns:dc="http://purl.org/dc/elements/1.1/"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <meta id="md1">
    <dc:subject xsi:type="olac:language" olac:code="ko"/>
    <dc:language xsi:type="olac:language" olac:code="en"/>
  </meta>
</metadata>

The namespaces get pushed from their declarations to the elements on reading. This changes the names of the attributes, and is more noticeable when serializing with xigtxml:

<metadata>
  <meta id="md1">
      <dc:subject xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ns2="http://www.language-archives.org/OLAC/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ns2:code="ko" xsi:type="olac:language" />
      <dc:language xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ns2="http://www.language-archives.org/OLAC/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ns2:code="en" xsi:type="olac:language" />
  </meta>
</metadata>

This is actually a documented feature of ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

It would be better if these could be handled appropriately without expanding the prefixes, but it means the namespaces will need to be stored in the Xigt-internal structures somewhere, making it somewhat more XML-specific.

codec print method should include xml declaration

When printing a xigt corpus, there is no xml declaration printed. Should this be included (optionally) to allow easier output to a well-formed xml file?

xigtxml.loads() is broken

The namespace support was overlooked for the loads() function, so it crashes looking for an nsmap attribute.

Decouple get-by-ID and get-by-index

The methods for retrieving objects can either take an ID or an index:

>>> xc = XigtCorpus(igts=[Igt(id='i1'), Igt(id='i2')])
>>> xc[0] == xc['i1']
True

The index can even be a string, since integer-only IDs are not allowed:

>>> xc[0] == xc['0']
True

This can be convenient, but is also possibly confusing and the required logic hurts performance. Review existing Xigt apps or implementations and consider removing this feature, instead having the bracketed notation for list-access only (by index), and the get() method for dict-access only (by ID).

Finalize schema for alignment/content-selection/segmentation use cases

The current configuration for -ref attributes is inadequate, or confusing. There are two modes of reference:

Annotation alignment
Content selection/inheritance

And thus three methods of reference:

Simultaneously align and select content (the default for ref; useful for simple segmentation (e.g. phrases to words, words to morphemes))
Only align annotations (when content-ref is set to something other than ref)
Only select content (when annotation-ref is set to something other than ref)

The attributes content-ref and annotation-ref may be scoped at the xigt-corpus, igt, and tier levels, affecting all items under their definitions. This means they can be freely changed, and users might need to hunt to find out what setting currently applies for the current item.

This situation is less than ideal, so here is one alternative:

ref is no longer used for alignment expressions (which has the positive side effect of reducing confusion with the standard IDREF attributes of XML)
the content-ref and annotation-ref attributes are removed
annref is only used for aligning annotations
cntref is only used for selecting content
segref is used to align to AND select content (for segmentation)
segref cannot co-occur with annref or cntref, but annref and cntref may co-occur with each other
content may still be overridden for cntref and segref, such that the content-selection alignment shows where the content came from, but the overridden value provides the (cleaned, recovered, etc.) data

Under this alternative, part of the Korean ODIN example would look like this:

<tier type="odin-txt" id="o">
  ...
  <item id="o2" line="959" tag="L">   1 Nay-ka ai-eykey pap-ul mek-i-ess-ta</item>
  <item id="o3" line="960" tag="G">     I-Nom child-Dat rice-Acc eat-Caus-Pst-Dec</item>
  ...
</tier>
<tier type="phrases" id="p" cntref="o">
  <item id="p0" cntref="o2[5:40]"/>
</tier>
<tier type="words" id="w" segref="p">
  <item id="w0" segref="p0[0:6]"/>
  ...
</tier>
<tier type="morphemes" id="m" segref="w">
  <item id="w0.m0" segref="w0[0:3]"/>
  <item id="w0.m1" segref="w0[4:6]"/>
  ...
</tier>
<tier type="glosses" id="g" annref="m" cntref="o">
      <item id="w0.g0" annref="w0.m0" cntref="o3[5:6]"/>
      <item id="w0.g1" annref="w0.m1" cntref="o3[7:10]"/>
     ...
</tier>

Well-formedness checks

The RelaxNG schema can check the structural well-formedness of an XML encoded XigtCorpus, but it does not check that the content is valid. We need some checks to ensure:

The Item-reference constraint (items only refer to other items in the tier their own tier refers to)
That every item referred to has a valid, unique ID
That an alignment expression selects a valid subset of content (including inherited content)
That every referred-to ID is instantiated on a tier/item/etc.
others?

XigtCorpus creation collapses items with the same id upon corpus creation

When creating a XigtCorpus from a list of Igts, when the Igts have the same id they are collapsed into a single item:

Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import xigt.core as xigt
s = xigt.Tier(type="sentence", id="s", items=[xigt.Item(id="s1", content="This is a sentence.")])
i1 = xigt.Igt(id="i1", tiers=[s])
s = xigt.Tier(type="sentence", id="s", items=[xigt.Item(id="s1", content="This is another sentence.")])
i2 = xigt.Igt(id="i2", tiers=[s])>>> item_list = [i1, i2]
xc = xigt.XigtCorpus(igts=item_list)
xc.igts
[<xigt.core.Igt object at 0x7fc757851050>, <xigt.core.Igt object at 0x7fc757851110>]
item_list
[<xigt.core.Igt object at 0x7fc757851050>, <xigt.core.Igt object at 0x7fc757851110>]
i1 = xigt.Igt(id="i1", tiers=[s])>>> i2 = xigt.Igt(id="i1", tiers=[s])
item_list = [i1, i2]
xc = xigt.XigtCorpus(igts=item_list)
xc.igts
[<xigt.core.Igt object at 0x7fc7578511d0>]

Percolate common attributes or metadata

When a XigtCorpus is created in a stream, some attributes or metadata may get repeated on multiple IGTs when they could be pushed up to the XigtCorpus level. For example, a Min Nan corpus may have language metadata specified as "Taiwanse", "Chaozhou", "Amoy", etc., but the iso-639-3 metadata would always be "nan". The iso-639-3 metadata could be pushed up to the corpus, rather than be redundantly specified on each item. I see two options here:

Push up attributes or metadata only when they are common to all sub-elements
Push up the most common attribute/metadata, and override where it is different

In general this would require the whole corpus to be stored in memory. A later addition could allow this to be done as a two-pass process (first tally attributes/metadata on items, then move them around on the second pass).

Metadata uses "text" for iterables?

Whereas "text" is used elsewhere for the text nodes, the Metadata class appears to use "text" for its children; this seems inconsistent.

Cycle detection

A file like this results in a recursive cycle when resolving the item's value:

<xigt-corpus>
    <igt>
        <tier type="words" id="a" segmentation="a">
            <item id="a1" segmentation="a1"/>
        </tier>
    </igt>
</xigt-corpus>

Currently only query.descendants() has cycle detection. Make sure the other alignment expression resolution functions do, too.

get_aligment_expression_ids

In the move to 1.0, there's a misspelling in this function:

get_aligment_expression_ids
get_aligNment_expression_ids

(in ref.py)

Deepcopy Implementation for XigtCorpus?

There are occasionally times when it would be convenient to initialize a base XigtCorpus model, and make several copies to manipulate in different ways, e.g. for unit testing. However, copy.deepcopy() seems to still leave some pointers with regards to the ID mapping? It would be nice (and likely more efficient) to have some kind of xc.copy() method.

Missing Imports in metadata.py

Looks like in moving metadata to its own module, a few imports got missed.

Add:
import warnings
from xigt.errors import XigtError

And looks like _has_parent is also missing from mixins?

True standoff annotation

Xigt was made with pseudo-standoff and standalone annotation in mind, but true standoff should be possible. Provide support for resolving standoff content links and other relevant concerns.

Custom delimiters per tier

Users can extend Xigt by redefining the delimiters used in Alignment Expressions, but currently there is no way to redefine a delimiter for specific tiers. This would be useful (if not necessary) when an IGT has different kinds of data each requiring their own delimiter definitions (e.g. audio and text).

Check that XC is a Corpus in xigtxml.dump()

I confused myself recently by trying to dump an Igt instance with xigtxml.dump() and receiving an exception about the Igt instance not having metadata attribute.

Eventually, I realized this was because I was trying to dump an instance rather than a corpus, but this might be something that would benefit from an explicit type-check.

Autogenerate `id` on objects when not given

When adding new elements to some container, it may be best to just auto-generate IDs, rather than just requiring IDs for some kinds of objects. If an ID is given, it will be checked for uniqueness. Otherwise, it will be auto-generated following to-be-coded conventions (mostly described here: https://github.com/xigt/xigt/wiki/Conventions#id-naming)

Old issue text:

Igt objects can be "anonymous" by not specifying an id attribute, but this is probably not a good idea, since some applications may need Igt IDs to work.

I still think it is not the case that XigtCorpora need IDs specified, and perhaps nor do Tiers or Items as long as they aren't involved in reference relationships, although really it's best if all things specify IDs.

Deprecate and rename unintentionally public functions/methods

Some functions are currently public, but there is no reason for them to be, and keeping them public creates a maintenance burden. They should be renamed to have a leading underscore (which indicates they are private, or not supported), but the old name should remain accessible until the next major version (v2.0.0).

This issue is used to list objects that need to be renamed. Currently that list includes:

Also consider deprecating the redefinable nature of the functions in xigt.codecs.xigtxml; it's not used

Additionally there are already-deprecated objects that can be removed at v2.0.0:

Feature Request: resolve_objects

Instead of the resolve() function in ref.py, it would be handy to have a resolve_objects() or some such function that would return a list of tokens with start/stop indices that an expression references rather than the string representation thereof. (I am working on a solution to Issue #19, and this would help)

Toolbox importer fails for some undeclared tiers

It throws a cryptic message about NoneType not being iterable.

Use SAX-like XML parsing for streamability

Even though the data model of Xigt is meant for streamability, the XML format is currently not parsed that way. Make it so.

ElementTree's iterparse() method is probably the way to do it. A bonus is that we might not have to use lxml for pretty printing, and can then remove the dependency.

Unittests for core API

Provide unittests for the following:

Toolbox importer fails when fields appear with no content

The toolbox module returns None for fields with no content (search for "If no space": https://github.com/goodmami/toolbox/blob/master/tests.md), but line 235 of xigt/importers/toolbox.py does not expect this, leading to a "NoneType is not iterable" error.

Consider publishing your code in JOSS!

Hi XIGT people! Consider whether you want to release this library in the Journal of Open Source Software! Getting code published in a cite-able form might help you elsewhere as you go along.

Support for extensions

The Xigt schema allows extensions. E.g.:

include "xigt.rnc" {
Tier |= grammar { include "xigt-tier.rnc" {
    Tier.type = "syntax"
    Item = grammar { include "xigt-item.rnc" }
         | grammar { include "xigt-item.rnc" {
        Item.ref = notAllowed
        Item.content = attribute synref { AlgnExpr }, text
    } }
} }
}

It should be possible for the code to enable such extensions. Minimally, this would:

allow the serialization/deserialization of XML with extensions (by default)

But it should also do the following, and possibly more:

allow custom serialization/deserialization hooks for other formats
work with the access methods in code. For instance, alignment expression resolution should work with extensions with new alignment fields. This will require some intelligent pairing of custom refs on Items and on Tiers.

Reference attribute reassignment when a segmenting tier is removed

Some tiers are "intermediary" in the sense that their referrers can instead align to their referents in the absence of the intermediary tiers. This is possibly only the case with segmentation tiers. For example, if glosses annotate morphemes, but then the morphemes tier is removed, the glosses can be reassigned to the words that the morphemes had aligned to. There are probably some complications with this, but the API could provide a remove_tier_with_reassignment() kind of function to help here.

Update documentation

Several parts of the wiki, from the Schemata (e.g. extensions) to the API (e.g. metadata) are out of sync with the 1.0 release of the code. Bring these in line with the rest.

importers/toolbox.py and exporters/itsdb.py don't work with Python2

There is a syntax error and an import error.

coreferents() function to find tiers/items aligning to the same thing

In some cases, multiple kinds of annotations can align to the same thing, but not transitively. That is, instead of A->B->C, we may have A->C and B->C. In these cases, it would be useful to be able to find what items on A and B refer to the same things in C.

Maybe something like:

>>> query.coreferents(tier_a, tier_b, 'alignment')
[(<Item (id: a1) at ...>, <Item (id: b1) at ...>, [<Item (id: c1) at ...>]), ...]
>>> query.coreferrers(tier_c, 'alignment')
[(<Item (id: c1) at ...>, [<Item (id: a1) at ...>, ...], [<Item (id: b1) at ...>, ...]), ...]

Remove support through Python 3.5, add support for 3.6 through 3.9

Python 3.3 is no longer supported upstream, so it should no longer be tested. The 3.6 and 3.7 versions are now available, so they should be tested.

Deprecate xigt_process

The xigt process command is broken and only half implemented. It's partitioning subcommand ("divide") should be reworked into a separate xigt partition command, and xigt process should be deprecated (as "process" is too general). The tier-splitting and merging sub-commands can be dropped.

Default and overridable attribues

With a schema extension, attributes may be placed on XigtCorpus, Igt, Tier, or Item objects, or may be placed on more than one level at the same time. When an attribute is placed on a higher-level object (e.g. an Igt), it is to be interpreted as also occurring (by default) on its lower, contained objects. The lower-level objects can override this default by giving an explicit value for this attribute.

xigt module missing version string

A __version__ string should be added to xigt/__init__.py. It should become 1.1.0 for the next release. 1.0.0 was not in the 1.0 release, so it probably shouldn't be used since it's no longer true after something has been committed, so let's use 1.0.1 for the first one.

After some research, it seems a good method is to put it in a separate module (e.g. version.py) then read (NOT import) it from setup.py, but import it from __init__.py.

See here and here and here

Extend XigtPath with sibling axes

XigtPath should have support for the following-sibling and preceding-sibling axes (which don't have short-forms, so we'd have to support the /axis:: format). We should then also include position functions like position() and last() and maybe simple arithmetic.

//item/following-sibling::item
//item/following-sibling::item[last()]
//item/preceding-sibling::*[0]
//item/preceding-sibling::*[position()=last()-1]

Examples are out-of-date

Revise the Korean examples (and maybe the Abkhaz ones) to be up-to-date with the current conventions.

XPath-like query language for Xigt data

XPath itself works on the XigtXML, but it doesn't resolve references, nor does it work for potential non-XML serializations of Xigt data. We need functionality like this:

So far this is the same as XPath (except for the value() function). But resolving expressions will require a new axis specifier or two (or more). These are only tentative:

//items[../@type="glosses"]/> "find objects referenced by items in a glosses tier (by default using either the alignment or segmentation attribute)"
//items[../@type="glosses"]/>@alignment "find objects referenced by items in a glosses tier (via the alignment attribute)"
//items[../@type="words"]/< "find objects referring to items in a words tier (by default using either the alignment or segmentation attribute)"
//items[../@type="words"]/<@segmentation "find objects referring to items in a words tier (via the segmentation attribute)"

Toolbox markers without content don't need to be followed by a space

There is a mandatory space following toolbox markers, but if there's no content for the marker it can just be the end of the line. E.g., in the following, \b should not be considered a wrapped part of the \a line.

\a blah
\b

Allow specification of units for alignment expressions

Alignment expressions currently implicitly select character spans, but we may want to select from audio data as well. Think of a way to allow the specification of units. As we can have multiple alignment expressions on the same item it might make more sense to put the units on the target item. For example:

<tier type="audio" id="a">
  <item type="wavfile" id="a1" bitrate="22.05KHz" unit="millisecond" filename="abc.wav"/>
</tier>
<tier type="phrase" id="p" ref="a">
  <item id="p1" ref="a1[1173:6808]">This is a uh... transcription.</item>
</tier>

Selecting audio data means we may want to allow floating point numbers in the alignment expressions (e.g. to select 5.635 (a1[1.173:6.808]) seconds).

Also note that the attributes like bitrate and unit might be inherited from the tier, igt, or xigt-corpus, once attribute inheritance is implemented. Then it wouldn't need to be specified for every item, if there were multiple wav files to use.

Log When adding invalid ID

It seems that the "algnexpr_re" requires ids to begin with a letter, but this is not checked when elements are added, resulting in cases where the xml may look well-formatted, but the alignment resolution expressions do not function.

Repro:

<tier id="_e18c11e463694b21b8ae182a7da20f8c" type="translation-words" segmentation="r">

<item id="_858f369b26e7408687e27132b5f1845d" segmentation="_0b86bfeeb5264dca9fdfe78eb9a74458[24:31]"/>

</tier>

and the targeted item is:

<tier id="r" type="odin" state="raw">
<item id="_0b86bfeeb5264dca9fdfe78eb9a74458" tag="T+DB"> 'I read her article.' 'I read one of her articles.'</item>
</tier>

Minor Issue with Meta XML Rendering

When serializing metadata in XML, using the text= attribute on the Meta item, it seems that the meta items get rendered like:

    <metadata type="xigt-meta">
      <meta type="data-source">intent
      </meta>
      <meta type="data-method">mgiza
      </meta>
    </metadata>

With newlines before the close tag. It would be nice instead to have something like:

    <metadata type="xigt-meta">
      <meta type="data-source">intent</meta>
      <meta type="data-method">mgiza</meta>
    </metadata>