cltl / naf-4-development Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
We are also interested in representing paragraphs and headers for Clariah+, but I see that the current idea is to reference discourse units to spanned tokens. I would argue for using character offsets instead, as these discourse units can be present before a tokenizer comes into play.
Our input documents are in TEI format, where this kind of discourse units are already annotated, and we want to preserve their identifiers. Conversion from TEI would then generate a NAF file with a raw-text layer and a discourse-units layer. In our current pipeline, the tokenizer is only called afterwards, for each paragraph independently.
What is the recommended way to store document metadata in NAF?
For instance the file name and/or URI of the document being processed.
Also interesting could be author, publication date, creation date, etc.
What is the recommended way to add more metadata on a liguistic processor?
The following information would be useful to store in the element:
The NAF document requires identifiers to use a prefix depending on the type of element: 'w' for words, 't' for terms, etc.. Additionally, some of the Newsreader pipeline tools expect the rest of the id to be a number.
This is contrary to the 'NAF should be simple' design of NAF.
It is unclear how to indicate the schema used for various annotations, or what is the default schema.
For part-of-speech, the document lists the valid options, but these dont correspond to the ones used in the Newsreader pipeline.
For semRole, the NAF document only suggests using 'A0', 'A1', when the role corresponds to a PropBank predicate. but that is not specific enough. It is also unclear what to do if it is not a PropBank predicate.
For a term's morphofeat, there is no further mention of allowed values / content. The newsreader pipeline assumes it follows 'POS(A,B)' format where POS is a part-of-speech tag as produced by Alpino, and (A,B) are similarly from Alpino output.
For dependencies, there is the rfunc attribute. There is a (non exhaustive) list of values, but no way to indicate if these are from Universal Depencendies, or Alpino
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.