cltl / naf-4-development Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 140 KB

License: Apache License 2.0

Python 100.00%

naf-4-development's People

Stargazers

Watchers

Forkers

mavteam

naf-4-development's Issues

Character offsets for representing discourse units (paragraphs, headers, etc.)

We are also interested in representing paragraphs and headers for Clariah+, but I see that the current idea is to reference discourse units to spanned tokens. I would argue for using character offsets instead, as these discourse units can be present before a tokenizer comes into play.

Our input documents are in TEI format, where this kind of discourse units are already annotated, and we want to preserve their identifiers. Conversion from TEI would then generate a NAF file with a raw-text layer and a discourse-units layer. In our current pipeline, the tokenizer is only called afterwards, for each paragraph independently.

document metadata

What is the recommended way to store document metadata in NAF?
For instance the file name and/or URI of the document being processed.

Also interesting could be author, publication date, creation date, etc.

configuration of linguistic processors

What is the recommended way to add more metadata on a liguistic processor?

The following information would be useful to store in the element:

we have retrained stanza to do dependency parsing, so we would like to specify the specific model used.
command line options to exactly reproduce the process

element IDs contain information

The NAF document requires identifiers to use a prefix depending on the type of element: 'w' for words, 't' for terms, etc.. Additionally, some of the Newsreader pipeline tools expect the rest of the id to be a number.

This is contrary to the 'NAF should be simple' design of NAF.

annotation schema used for pos, semRole, morphofeat, rfunc

It is unclear how to indicate the schema used for various annotations, or what is the default schema.

For part-of-speech, the document lists the valid options, but these dont correspond to the ones used in the Newsreader pipeline.

For semRole, the NAF document only suggests using 'A0', 'A1', when the role corresponds to a PropBank predicate. but that is not specific enough. It is also unclear what to do if it is not a PropBank predicate.

For a term's morphofeat, there is no further mention of allowed values / content. The newsreader pipeline assumes it follows 'POS(A,B)' format where POS is a part-of-speech tag as produced by Alpino, and (A,B) are similarly from Alpino output.

For dependencies, there is the rfunc attribute. There is a (non exhaustive) list of values, but no way to indicate if these are from Universal Depencendies, or Alpino

cltl / naf-4-development Goto Github PK

naf-4-development's People

Stargazers

Watchers

Forkers

naf-4-development's Issues

Character offsets for representing discourse units (paragraphs, headers, etc.)

document metadata

configuration of linguistic processors

element IDs contain information

annotation schema used for pos, semRole, morphofeat, rfunc

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent