Giter Site home page Giter Site logo

naf-4-development's People

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

mavteam

naf-4-development's Issues

Character offsets for representing discourse units (paragraphs, headers, etc.)

We are also interested in representing paragraphs and headers for Clariah+, but I see that the current idea is to reference discourse units to spanned tokens. I would argue for using character offsets instead, as these discourse units can be present before a tokenizer comes into play.

Our input documents are in TEI format, where this kind of discourse units are already annotated, and we want to preserve their identifiers. Conversion from TEI would then generate a NAF file with a raw-text layer and a discourse-units layer. In our current pipeline, the tokenizer is only called afterwards, for each paragraph independently.

document metadata

What is the recommended way to store document metadata in NAF?
For instance the file name and/or URI of the document being processed.

Also interesting could be author, publication date, creation date, etc.

configuration of linguistic processors

What is the recommended way to add more metadata on a liguistic processor?

The following information would be useful to store in the element:

  • we have retrained stanza to do dependency parsing, so we would like to specify the specific model used.
  • command line options to exactly reproduce the process

element IDs contain information

The NAF document requires identifiers to use a prefix depending on the type of element: 'w' for words, 't' for terms, etc.. Additionally, some of the Newsreader pipeline tools expect the rest of the id to be a number.

This is contrary to the 'NAF should be simple' design of NAF.

annotation schema used for pos, semRole, morphofeat, rfunc

It is unclear how to indicate the schema used for various annotations, or what is the default schema.

For part-of-speech, the document lists the valid options, but these dont correspond to the ones used in the Newsreader pipeline.

For semRole, the NAF document only suggests using 'A0', 'A1', when the role corresponds to a PropBank predicate. but that is not specific enough. It is also unclear what to do if it is not a PropBank predicate.

For a term's morphofeat, there is no further mention of allowed values / content. The newsreader pipeline assumes it follows 'POS(A,B)' format where POS is a part-of-speech tag as produced by Alpino, and (A,B) are similarly from Alpino output.

For dependencies, there is the rfunc attribute. There is a (non exhaustive) list of values, but no way to indicate if these are from Universal Depencendies, or Alpino

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.