divvun / libdivvun Goto Github PK

View Code? Open in Web Editor NEW

9.0 3.0 1.0 2.32 MB

lib for running gramcheck and other pipelines + cli; modules for CG→spelling, CG→feedback, tagging blanks

Home Page: https://giellalt.github.io/proof/gramcheck/GrammarCheckerDocumentation.html

License: GNU General Public License v3.0

Shell 7.75% C++ 77.76% Makefile 4.14% M4 3.69% Emacs Lisp 0.03% Python 1.87% Logos 0.13% Roff 2.41% SWIG 1.81% C 0.40%

constraint-grammar cg grammar-checker divvun saami

libdivvun's Introduction

Description
Tools
Install from packages
Simple build from git on Mac
Prerequisites
Building
Command-line usage
JSON format
Pipespec XML
1. Mapping from XML preferences to UI
Writing grammar checkers
Troubleshooting
References / more documentation

Description

Libdivvun is a library for handling Finite-State Morphology and Constraint Grammar based NLP tools in GiellaLT. The tools are used for tokenisation, normalisation, grammar-checking and correction, and other NLP tasks.

Tools

This repository contains the library libdivvun, and following executables:

divvun-checker: This program opens a grammar checker pipeline XML specification and lets you run grammar checking on strings. It can also open zip files containing the XML and all required language data. The C++ library libdivvun (headers installed to $PREFIX/include/divvun/*.hpp) allows for the same features.

divvun-suggest: This program does FST lookup on forms specified as Constraint Grammar format readings, and looks up error-tags in an XML file with human-readable messages. It is meant to be used as a late stage of a grammar checker pipeline.

The main output format of divvun-suggest is JSON, although it can also simply annotate readings in CG stream format.

divvun-cgspell: This program spells unknown word forms from Constraint Grammar format readings, adding them as new readings.

divvun-blanktag: This program takes an FST as argument, reads CG input and uses the FST to add readings to cohorts that match on the wordform surrounded by the preceding and following blanks. Use cases include adding error tags that are dependent on spaces before/after, or tagging the first word after a linebreak or certain formatting.

divvun-phon: This program takes FSAs, reads CG input and uses the FSAs to add phonetic readings to each analysis lines

divvun-normalise: This program takes FSAs, reads CG input and uses the FSAs to normalise the text towards TTS friendlier, read-out forms

There are also some helper programs for validating XML (divvun-validate-suggest, divvun-validate-pipespec, divvun-gen-xmlschemas) and for generating shell scripts from pipeline specifications (divvun-gen-sh).

Install from packages

Tino Didriksen has kindly packaged this as both .deb and .rpm.

For .deb (Debian, Ubuntu and derivatives), add the repo and install the package with:

wget https://apertium.projectjj.com/apt/install-nightly.sh
sudo bash install-nightly.sh
sudo apt install divvun-gramcheck

For .rpm (openSUSE, Fedora, CentOS and derivatives), add the repo and install the package with:

wget https://apertium.projectjj.com/rpm/install-nightly.sh
sudo bash install-nightly.sh
sudo dnf install divvun-gramcheck

(See also the deb build logs and rpm build status.)

Simple build from git on Mac

There is a script that will download prerequisites and compile and install for Mac.

You don't even need to checkout this repository; just run:

curl https://raw.githubusercontent.com/divvun/divvun-gramcheck/master/scripts/mac-build | bash

and enter your sudo password when it asks you to.

It does not (yet) enable divvun-checker, since that has yet more dependencies. It assumes you've got xmllint installed.

Prerequisites

Note: Mac users probably just want to follow the steps in Simple build from git on Mac.

This section lists the prerequisites for building the various tools in this package.

For just `divvun-suggest` and `divvun-blanktag`

gcc >=5.0.0 with libstdc++-5-dev (or similarly recent version of clang, with full C++11 support)
libxml2-utils (just for xmllint)
libhfst >=3.12.2
libpugixml >=1.7.2 (optional)

Tested with gcc-5.2.0, gcc-5.3.1 and clang-703.0.29. On Mac OS X, the newest XCode includes a modern C++ compiler.

If you can't easily install libpugixml, you can run scripts/get-pugixml-and-build which will download libpugixml into this directory, build that (with cmake) and configure this program to use that library. Alternatively, you can run ./configure with --disable-xml if you don't care about human-readable error messages.

If you also want `divvun-checker`

gcc >=5.0.0 with libstdc++-5-dev (or similarly recent version of clang, with full C++11 support)
libxml2-utils (just for xmllint)
libhfst >=3.12.2
libpugixml >=1.7.2
libcg3-dev >=1.1.2.12327
libarchive >=3.2.2-2

Tested with gcc-5.2.0, gcc-5.3.1 and clang-703.0.29. On Mac OS X, the newest XCode includes a modern C++ compiler.

Now when building, pass --enable-checker to configure.

If you also want `divvun-cgspell`

hfst-ospell-dev >=0.4.5 (compiled with either libxml or tinyxml)

You can pass --enable-cgspell to ./configure if you would like to get an error if any of the divvun-cgspell dependencies are missing.

If you also want the Python library

The Python 3 library is used by the LibreOffice plugin. It will build if it finds both of:

SWIG >=3.0 (install python-swig if you're using MacPorts)
Python >=3.0

You can pass --enable-python-bindings to ./configure if you would like to get an error if any of the divvun-python-bindings dependencies are missing.

Building

./autogen.sh
./configure --enable-checker  # or just "./configure" if you don't need divvun-checker
make
make install # with sudo if you didn't specify a --prefix to ./configure

On OS X, you may have to do this:

sudo port install pugixml
export CC=clang CXX=clang++ "CXXFLAGS=-std=gnu++11 -stdlib=libc++"
./autogen.sh
./configure  LDFLAGS=-L/opt/local/lib --enable-checker
make
make install # with sudo if you didn't specify a --prefix to ./configure

Command-line usage

`divvun-suggest`

divvun-suggest takes two arguments: a generator FST (in HFST optimised lookup format), and an error message XML file (see the one for North Saami for an example), with input/output as stdin and stdout:

src/divvun-suggest --json generator-gt-norm.hfstol errors.xml < input > output

More typically, it'll be in a pipeline after various runs of vislcg3:

echo words go here | hfst-tokenise --giella-cg tokeniser.pmhfst | … | vislcg3 … \
  | divvun-suggest --json generator-gt-norm.hfstol errors.xml

`divvun-blanktag`

divvun-blanktag takes one argument: an FST (in HFST optimised lookup format), with input/output as stdin and stdout:

src/divvun-blanktag analyser.hfstol < input > output

More typically, it'll be in a pipeline after cg-mwesplit:

echo words go here | hfst-tokenise … | … | cg-mwesplit \
  | src/divvun-blanktag analyser.hfstol < input > output

See the file test/blanktag/blanktagger.xfst for an example blank tagging FST (the other files in test/blanktag show test input and expected output, as well as how to compile the FST).

`divvun-cgspell`

divvun-cgspell takes options similar to hfst-ospell. You can give it a single zhfst speller archive with the -a option, or specify unzipped error model and lexicon with -m and -l options.

There are some options for limiting suggestions too, see --help. You'll probably want to use --limit at least.

src/divvun-cgspell --limit 5 se.zhfst < input > output

More typically, it'll be in a pipeline before/after various runs of vislcg3:

echo words go here | hfst-tokenise --giella-cg tokeniser.pmhfst | … | vislcg3 … \
  | src/divvun-cgspell --limit 5 se.zhfst | vislcg3 …

You can also use it with unzipped, plain analyser and error model, e.g.

src/divvun-cgspell --limit 5 -l analyser.hfstol -m errmodel.hfst < input > output

`divvun-checker`

divvun-checker is an example command-line interface to libdivvun. You can use it to test a pipespec.xml or a zip archive containing both the pipespec and langauge data, e.g.

$ divvun-checker -a sme.zhfst
Please specify a pipeline variant with the -n/--variant option. Available variants in archive:
smegram
smepunct

$ echo ballat ođđa dieđuiguin | src/divvun-checker -a sme.zhfst -n smegram
{"errs":[["dieđuiguin",12,22,"msyn-valency-loc-com","Wrong valency or something",["diehtukorrekt"]]],"text":"ballat ođđa dieđuiguin"}

$ divvun-checker -s pipespec.xml
Please specify a pipeline variant with the -n/--variant option. Available variants in pipespec:
smegram
smepunct

$ echo ballat ođđa dieđuiguin | src/divvun-checker -s pipespec.xml -n smegram
{"errs":[["dieđuiguin",12,22,"msyn-valency-loc-com","Wrong valency or something",["diehtukorrekt"]]],"text":"ballat ođđa dieđuiguin"}

When using the -s/--spec pipespec.xml option, relative paths in the pipespec are relative to the current directory.

See the test/ folder for an example of zipped archives.

See the examples folder for how to link into libdivvun and use it as a library, getting out either the JSON-formatted list of errors, or a simple data structure that contains the same information as the JSON. The next section describes the JSON format.

JSON format

The JSON output of divvun-suggest is meant to be sent to a client such as https://github.com/divvun/divvun-webdemo. The current format is:

{errs:[[str:string, beg:number, end:number, typ:string, exp:string, [rep:string]]], text:string}

The string text is the input, for sanity-checking.

The array-of-arrays errs has one array per error. Within each error-array, beg/end are offsets in text, typ is the (internal) error type, exp is the human-readable explanation, and each rep is a possible suggestion for replacement of the text between beg/end in text.

The index beg is inclusive, end exclusive, and both indices are based on a UTF-16 encoding (which is what JavaScript uses, so e.g. the emoji "🇳🇴" will increase the index of the following errors by 4).

Example output:

{
  "errs": [
    [
      "badjel",
      37,
      43,
      "lex-bokte-not-badjel",
      "\"bokte\" iige \"badjel\"",
      [
        "bokte"
      ]
    ]
  ],
  "text": "🇳🇴sáddejuvvot báhpirat interneahta badjel.\n"
}

Pipespec XML

The divvun-checker program and libdivvun (divvun/checker.hpp) API has an XML format for specifying what programs go into the checker pipelines, and metadata about the pipelines.

A pipespec.xml defines a set of grammar checker (or really any text processing) pipelines.

There is a main language for each pipespec, but individual pipelines may override with variants.

Each pipeline may define certain a set of mutually exclusive (radio button) preferences, and if there's a <suggest> element referring to an errors.xml file in the pipeline, error tags from that may be used to populate UI's for hiding certain errors.

Mapping from XML preferences to UI

The mapping from preferences in the XML to a user interface should be possible to do automatically, so the UI writer doesn't have to know anything about what preferences the pipespec defines, but can just ask the API for a list of preferences.

Preferences in the UI are either checkboxes [X] or radio buttons (*).

We might for example get the following preferences UI:

(*) Nordsamisk, Sverige
( ) Nordsamisk, Noreg
…
[X] Punctuation
    (*) punktum som tusenskilje
    ( ) mellomrom som tusenskilje
[-] Grammar errors
    [X] ekteordsfeil
    [ ] syntaksfeil

Here, the available languages are scraped from the pipespec.xml using //pipeline/@language.

A language is selected, so we create a Main Category of error types from

pipespec.xml //[@language=Sverige|@language=""]/prefs/@type
pipespec.xml //pipeline[@language=Sverige|@language=""]/@type
errors.xml   //default/@type
errors.xml   //error/@type

in this case giving the set { Punctuation, Grammar errors }.

One Main Category type is Punctuation; the radio buttons under this main category are those defined in

pipespec.xml //prefs[@type="Punctuation"]

The other Main Category type is Grammar errors; maybe we didn't have anything in

pipespec.xml //prefs[@type="Grammar errors"]

but there are checkboxes for errors that we can hide in

errors.xml //defaults/default/title

It should be possible for the UI to hide which underlying <pipeline>'s are chosen, and only show the preferences (picking a pipeline based on preferences). But there is an edge case: Say the pipe named smegram_SE with language sme_SE and main type "Grammar errors" has a

pref[@type="Punctuation"]

and there's another pipe named smepunct with main type "Punctuation". Now, assuming we select the language sme_SE, we'll never use smepunct, since smegram defines error types that smepunct doesn't, but not the other way around. Hopefully this is not a problem in practice.

Writing grammar checkers

Grammar checkers written for use in libdivvun consist of a pipeline, at a high level typically looking like:

tokenisation/morphology | multiword handling | disambiguation | error rules | generation

There are often other modules in here too, e.g. for adding spelling suggestions, annotating valency, disambiguation and splitting multiwords, or annotating surrounding whitespace.

Below we go through some of the different parts of the checker, using the Giellatekno/Divvun North Sámi package (from https://victorio.uit.no/langtech/trunk/langs/sme/) as an example.

XML pipeline specification

Each grammar checker needs a pipeline specification with all the different modules and their data files in order. This is written in a file pipespec.xml, which should follow the . Each such file may have several <pipeline> elements (in case there are alternative pipeline variants in your grammar checker package), with a name and some metadata.

Here is the pipespec.xml for North Sámi:

<pipespec language="se"
          developer="Divvun"
          copyright="…"
          version="0.42"
          contact="Divvun [email protected]">

  <pipeline name="smegram"
            language="se"
            type="Grammar error">
    <tokenize><arg n="tokeniser-gramcheck-gt-desc.pmhfst"/></tokenize>
    <cg><arg n="valency.bin"/></cg>
    <cg><arg n="mwe-dis.bin"/></cg>
    <mwesplit/>
    <blanktag>
      <arg n="analyser-gt-whitespace.hfst"/>
    </blanktag>
    <cgspell>
      <arg n="errmodel.default.hfst"/>
      <arg n="acceptor.default.hfst"/>
    </cgspell>
    <cg><arg n="disambiguator.bin"/></cg>
    <cg><arg n="grammarchecker.bin"/></cg>
    <suggest>
      <arg n="generator-gt-norm.hfstol"/>
      <arg n="errors.xml"/>
    </suggest>
  </pipeline>

  <!-- other variants ommitted -->

</pipespec>

This is what happens when text is sent through the smegram pipeline:

First, <tokenize> turns plain text into morphologically analysed tokens, using an FST compiled with hfst-pmatch2fst. These tokens may be ambiguous both wrt. to morphology and tokenisation.
Then, a <cg> module adds valency tags to readings, enriching the morphological analysis with context-sensitive information on argument structure.
Another <cg> module disambiguates cohorts that are ambiguous wrt. tokenisation, like multiwords and punctuation.
The <mwesplit> module splits now-disambiguated multiwords into separate tokens.
Then <blanktag> adds some tags to readings based on the surrounding whitespace (or other types of non-token blanks/formatting), using an FST which matches sequences of blank–wordform–blank.
The <cgspell> module adds readings with spelling suggestions to unknown words. The suggestions appear as wordform-tags.
Then a <cg> disambiguator, with rules modified a bit to let through more errors.
The main <cg> grammar checker module can now add error tags to readings, as well as new readings for generating suggestions, or special tags for deleting words or expanding underlines (and, as in the other <cg> modules, we can use the full range of CG features to add information that may be helpful in these tasks, such as dependency annotation and semantic role analysis)
Finally, <suggest> uses a generator FST to turn suggestion readings into forms, and an XML file of error descriptions to look up error messages from the tags added by the <cg> grammar checker module. These are used to output errors with suggestions, as well as readable error messages and the correct indices for underlines.

The program divvun-gen-sh in this package creates shell scripts from the specification that you can use to test your grammar checker. In the North Sámi checker, these should appear in tools/grammarcheckers/modes when you type make, but you can also create a single script for the above pipeline manually. If we do divvun-gen-sh -s pipespec.xml -n smegram > test.sh with the above XML, test.sh will contain something like

#!/bin/sh

hfst-tokenise -g '/home/me/gtsvn/langs/sme/tools/grammarcheckers/tokeniser-gramcheck-gt-desc.pmhfst' \
 | vislcg3 -g '/home/me/gtsvn/langs/sme/tools/grammarcheckers/valency.bin' \
 | vislcg3 -g '/home/me/gtsvn/langs/sme/tools/grammarcheckers/mwe-dis.bin' \
 | cg-mwesplit \
 | divvun-blanktag '/home/me/gtsvn/langs/sme/tools/grammarcheckers/analyser-gt-whitespace.hfst' \
 | divvun-cgspell '/home/me/gtsvn/langs/sme/tools/grammarcheckers/errmodel.default.hfst' '/home/me/gtsvn/langs/sme/tools/grammarcheckers/acceptor.default.hfst' \
 | vislcg3 -g '/home/me/gtsvn/langs/sme/tools/grammarcheckers/disambiguator.bin' \
 | vislcg3 -g '/home/me/gtsvn/langs/sme/tools/grammarcheckers/grammarchecker.bin' \
 | divvun-suggest '/home/me/gtsvn/langs/sme/tools/grammarcheckers/generator-gt-norm.hfstol' '/home/me/gtsvn/langs/sme/tools/grammarcheckers/errors.xml'

We can send words through this pipeline with echo "words here" | sh test.sh.

Using divvun-gen-sh manually like this is good for checking if you've written your XML correctly, but if you're working within the Giellatekno projects, you'll typically just type make and use the scripts that end up in modes.

$ ls modes

in tools/grammarcheckers to list all the scripts. These contain not just the full pipeline (for every <pipeline> in the XML), but also "debug" versions that are chopped off at various points (with numbers to show how far they go), as well as versions with CG rule tracing turned on. So if you'd like to check up until disambiguation, before the grammarchecker CG, you'd do something like

echo "words go here" | sh modes/trace-smegram6-disam.mode

Simple blanktag rules

The divvun-blanktag program will tag a cohort with a user-specified tag if it finds a match on the input wordform and its surrounding blanks.

The wordform includes the CG wordform delimiters

"<

and

>"

The surrounding blanks do not include the start-of-line colon. The rule file is an FST with blank-wordform-blank on the input side, and the tag on the output-side, typically written in the XFST regex format.

As an example (with spaces changed to underscores for readability), if the input.txt contains

:_
"<)>"
        ")" RPAREN @EOP
        ")" RPAREN @EMO
"<.>"
        "." PUNCT
:\n
:\n

then divvun-blanktag will try to match twice, first on the string

_"<)>"

then on the string

"<.>"\n\n

If the rule file ws.regex (here in XFST regex format) contains

[ {_} {"<)>"} ?* ]:[%<spaceBeforeParenEnd%>]

then we will get

hfst-regexp2fst --disjunct ws.regex | hfst-fst2fst -O -o ws.hfst
divvun-blanktag ws.hfst < input.txt

:_
"<)>"
	")" RPAREN @EOP <spaceBeforeParenEnd>
	")" RPAREN @EMO <spaceBeforeParenEnd>
"<.>"
	"." PUNCT
:\n
:\n

The matching goes from the start of the preceding blank, across the wordform and to the end of the following blank. In this input, there was no blank following the right-parens, so the rule could just as well have been

[ {_} {"<)>"} ]:[%<spaceBeforeParenEnd%>]

– this would require that there is no following blank. However, if you want it to also match the input

:_
"<)>"
        ")" RPAREN @EOP
        ")" RPAREN @EMO
:\n

then you need the final match-all ?*.

Troubleshooting

If you get

terminate called after throwing an instance of 'FunctionNotImplementedException'                                                [68/660]
Aborted (core dumped)

check how you compiled the HFST file – it should be in unweighted HFST optimized lookup format.

Simple grammarchecker.cg3 rules

In our North Sámi checker, the

<cg><arg n="grammarchecker.bin"></cg>

file is created with from the source file $GTHOME/langs/sme/tools/grammarcheckers/grammarchecker.cg3, which adds error tags and suggestion-readings.

A simple rule looks like:

ADD:msyn-hallan (&real-hallan) TARGET (Imprt Pl1 Dial/-KJ) IF (0 HALLA-PASS-V) (NEGATE *1 ("!")) ;

This simply adds an error tag real-hallan to words that are tagged Imprt Pl1 Dial/-KJ and match the context conditions after the IF. This will put an underline under the word in the user interface. If errors.xml in the same folder has a nice description for that tag, the user will see that description in the user interface.

We can add a suggestion as well with a COPY rule:

COPY:msyn-hallan (Inf SUGGEST) EXCEPT (Imprt Pl1 Dial/-KJ) TARGET (Imprt Pl1 Dial/-KJ &real-hallan) ;

This creates a new reading where the tags Imprt Pl1 Dial/-KJ have been changed into Inf SUGGEST (and other tags are unchanged). The SUGGEST tag is necessary to get divvun-suggest (the <suggest> module) to try to generate a form from that reading. It is smart enough to skip things like weights, tracing and syntax tags when trying to suggest, but all morphological tags need to be correct and in the right order for generation to work.

More complex grammarchecker.cg3 rules (spanning over several words)

The error is considered to have a central part and one or more less central parts (parts here being CG cohorts).

The central part needs the error tag, e.g. &real-hallan in the above simple example.

If there are several different ways of correcting an error, you may also need to add "co-error tags" to the non-central parts to disambiguate the replacements – see the below section on Alternative suggestions for complex errors altering different parts of the error for details on this.

The non-central parts need to be referred to by a relation named LEFT or RIGHT or DELETE from the central part, if all all parts are to be underlined as one error. The words can be adjacent. If there are words in between that are not part of the error, they are still underlined.

In the first line of the following example only "soaitá" and "boađán" are part of the error and are underlined. However, if "mun" ("I") is inserted in between then it is also underlined.

Mun soaitá boađán. `Maybe I come.'
Soaitá mun boađán. `Maybe I come.'

You can refer to the word form of the "central" cohort of the error using $1 in errors.source.xml, e.g.

<description xml:lang="en">The word "$1" seems to be in the wrong case.</description>

You can refer to the word form of the first correction / suggestion using €1 in errors.source.xml, e.g.

<description xml:lang="en">Please don't write "$1", it sounds much nicer if you use "€1" instead.</description>

- $1 - reference to error
- €1 - reference to suggestion

To refer to other words, you add relations named $2 and so on:

ADDRELATION ($2) Ess TO (*-1 ("dego" &syn-not-dego) BARRIER Ess);

which you can refer to just like with $1:

<title xml:lang="en">there should not be "$2" if "$1" is essive</title>

Deleting words

If you want to delete a word from a CG rule, it's typically enough to add an error tag to the word you want to keep, and add a relation DELETE1 to the word you want to delete. This will make an underline that covers both those words, where the suggestion is the same string without the target of the DELETE1 relation.

ADD (&one-word-too-many) KeepThisWord;
ADDRELATION (DELETE1 $2) KeepThisWord TO (-1 DeleteThisWord);

The cohort matching KeepThisWord is now the central one of the error, so if e.g. errors.xml uses templates like

Don't use "$2" before using "$1"

the word form of KeepThisWord will be substituted for $1.

A real example from North Sámi is the error "dego lávvomuorran" where we want to delete the word "dego" from the suggestion and keep "lávvomuorran" (the central word, $1 in errors.xml):

ADD (&syn-not-dego)      TARGET Ess IF (-1 ("dego")) ;
ADDRELATION (DELETE1 $2) TARGET (&syn-not-dego) TO (-1 ("dego"));

You may delete more words from the same cohort using DELETE2 etc.

In South Sámi sometimes phrasal verbs are used (due to a literal translation from Scandinavian languages) where the verb alone already expresses the concept. This is the case for "tjuedtjelh bæjjese" (verb + adverb) meaning "stand.up up". With the following rule we first annotate the error and then delete the adverb "bæjjese".

ADD (&syn-delete-adv-phrasal-verb) TARGET (V) IF (0 ("tjuedtjielidh") OR ("fulkedh")) (*0 ("bæjjese") BARRIER (*) - Pcle) ;
ADD (&syn-delete-adv-phrasal-verb) TARGET (Adv) IF (0 ("bæjjese")) (*0 ("tjuedtjielidh") OR ("fulkedh") BARRIER (*) - Pcle) ;

ADDRELATION (DELETE1) (V &syn-delete-adv-phrasal-verb) TO (*0 (Adv &syn-delete-adv-phrasal-verb) BARRIER (*) - Pcle) ;

Alternative suggestions for complex errors altering different parts of the error

Sometimes you have several possible suggestions on the same word, which might partially overlap. For example, the simple deletion example from above might also have an alternative interpretation where instead of deleting the word "dego" to the left, we should change the case of the word "lávvomuorran" from essive to nominative case:

ADD (&syn-dego-nom) TARGET Ess IF (-1 ("dego"));
COPY (Sg Nom SUGGEST) EXCEPT (Ess) TARGET (&syn-dego-nom) ;

Here we want to keep the suggestions for &syn-dego-nom separate from the suggestions for &syn-not-dego – in particular, we don't want to include a suggestion where we both delete and change cases at the same time. But if we use the above rules, CG gives us this output:

"<dego>"
        "dego" CS @CNP ID:11
:
"<lávvomuorran>"
         "lávvomuorra" N Ess @COMP-CS< &syn-not-dego ID:12 R:DELETE1:11
         "lávvomuorra" N Sg Nom @COMP-CS< &syn-dego-nom ID:12 R:DELETE1:11 SUGGEST

Notice how the DELETE relation is on both readings, and also how how the relation target id (11) refers to a cohort, not a reading of a cohort. There is no way from this output to know that "dego" should not also be deleted from the SUGGEST reading.

So when there are such multiple alternative interpretations for errors spanning multiple words, the less central parts ("dego" above) need a "co-error tag" (using co& as a prefix instead of &) to say which error tag goes with which non-central reading.

ADD (co&syn-not-dego) ("dego") IF (1 (&syn-not-dego));

Without the co, this would be treated as a separate error, while without &syn-not-dego, we would suggest deleting this word in the suggestions for &syn-dego-nom too.

By using co&error-tags, we can have multiple alternative interpretations of an error, while avoiding generating bad combinations. In the following case:

Soaitá boađán.

there are two possible corrections:

Soaittán boahtit.
Kánske boađán.

The alternative corrections have different central parts of the error. In the first case both parts are changed. The first part ("soaitá" 3.Sg.) is changed to "soaittán" (1.Sg.) based on the person and number of the second word ("boađán" 1.Sg.). Subsequently "boađán" (1.Sg.) is changed to "boahtit" (infinitive). Alternatively, only the first part is changed and the second part remains unchanged. In this case we can change the "soaitá" (3.Sg.) to the adverb "kánske".

As usual, this requires SUGGEST readings for the parts that are two be changed, and one unique error tag for each interpretation, ie. &msyn-kánske for the "Kánske boađán" correction and &msyn-fin_fin-fin_inf for the "Soaittán boahtit" correction.

We also need relations LEFT/RIGHT from the central cohort carrying the error tag to ensure both words are underlined. Again, if we say that the correction "boahtit" has a relation to the correction "Soaittán", CG only knows that there's a relation between the words, not between the individual readings. In order to match readings-to-readings, we use the (co-)error tags to match up. If we chose the first word (input form "Soaitá") to be the central cohort, and had the error tag &msyn-kánske on the suggestion for "Kánske", then we would add a relation RIGHT to the second word (input form "boađán") and add the co-error tag co&msyn-kánske to the correct reading of that word (in this case the reading that does not suggest a change). So the CG output after grammar checker should contain:

"<Soaitá>"
	"soaitit" V IV Ind Prs Sg3 &syn-soahtit-vfin+inf   ID:2 R:RIGHT:3
	"soaitit" V IV Ind Prs Sg1 &syn-soahtit-vfin+inf   ID:2 R:RIGHT:3 SUGGEST
	"kánske"  Adv              &syn-kánske             ID:2 R:RIGHT:3 SUGGEST
: 
"<boađán>"
	"boahtit" V IV Ind Prs Sg1                         ID:3
	"boahtit" V IV Ind Prs Sg1 co&syn-kánske           ID:3 SUGGEST
	"boahtit" V IV Inf         co&syn-soahtit-vfin+inf ID:3 SUGGEST

By adding co&msyn-kánske etc., we avoid generating silly suggestion combinations like *"Kánske boahtit" or *"Soaittán boađán".

Another example:

Dåaktere veanhta dïhte aktem aajla-hirremem åtneme, dan åvteste tjarke svæjmadi jïh
{ij mujhti} satne lij vaedtsieminie skuvleste gåatan.

In this sentence in South Sámi there are two alternative suggestions:

one regarding the second cohort only – mujhti > mujhtieh
the other one regarding both cohorts – ij mujhti > idtji mujhtieh

Here too, we need to ensure that there are co&errortags to match relations to readings.

Avoiding mismatched words in multiple suggestions on ambiguous readings

In the sentence "Sámegiellaåhpadus vatteduvvá skåvlåjn 7 fylkajn ja aj gålmån priváhta skåvlåjn." from Lule Saami, the "7 fylkajn" should either be "7:n fylkan" (both words Sg Ine, error tag &msyn-numphrase-sgine) or "7:jn fylkajn" (both words Sg Com, error tag &msyn-numphrase-sgcom). The erroneous input looks like

"<7>"
        "7" Num Arab Sg Nom
        "7" Num Arab Sg Ine Attr
        "7" Num Arab Sg Ill Attr
        "7" Num Arab Sg Gen
        "7" Num Arab Sg Ela Attr
        "7" A Arab Ord Attr CLBfinal
:
"<fylkajn>"
        "fylkka" v1 N Sem/Org Sg Com
        "fylkka" v1 N Sem/Org Pl Ine
:

on the way into the grammar checker. For each error tag, we add, copy and substitute, e.g.

ADD (&msyn-numphrase-sgcom) TARGET (Num Sg Gen) OR (Num Pl Nom) OR (Num Sg Nom) OR (Num Sg Com) OR ("moadda" Indef Acc) OR (Num Arab) IF …
# and other add rules
COPY (Sg Com SUGGEST) EXCEPT (Sg Com) OR (Pl Ine) TARGET (&msyn-numphrase-sgcom) ;
SUBSTITUTE (&msyn-numphrase-sgcom) (co&msyn-numphrase-sgcom) TARGET (SUGGEST);

The error is ambiguous, with two possible suggestions. We want to keep these apart, but when CG runs the rule section for the second time, the ADD rule for the sgine suggestion may land on the reading that was copied in from the sgcom suggestion (now substituted to co&msyn-numphrase-sgcom), mixing the error tags:

        "7" Num Arab co&msyn-numphrase-sgcom Sg Ine SUGGEST co&msyn-numphrase-sgine

This will lead to mismatched suggestions like "*7:n fylkajn".

To prevent this, we can take advantage of the fact that Constraint Grammar will not ADD anything to a reading that has had a MAP rule applied. We can do this right after the SUBSTITUTE rule:

 ADD (&msyn-numphrase-sgcom) TARGET (Num Sg Gen) OR (Num Pl Nom) OR (Num Sg Nom) OR (Num Sg Com) OR ("moadda" Indef Acc) OR (Num Arab) IF …
 # and other add rules
 COPY (Sg Com SUGGEST) EXCEPT (Sg Com) OR (Pl Ine) TARGET (&msyn-numphrase-sgcom) ;
 SUBSTITUTE (&msyn-numphrase-sgcom) (co&msyn-numphrase-sgcom) TARGET (SUGGEST);
+MAP:LOCK_READING (SUGGEST) (SUGGEST);

Adding words

To add a word as a suggestion, use ADDCOHORT, adding both reading tags (lemma, part-of-speech etc.), a wordform tag (including a space) and &ADDED to mark it as something that didn't appear in the input; and then a LEFT or RIGHT relation from the central cohort of the error to the added word:

ADD (&msyn-valency-go-not-fs) IF (…);
ADDCOHORT ("<go >" "go" CS &ADDED &msyn-valency-go-not-fs) BEFORE &msyn-valency-go-not-fs;
ADDRELATION (LEFT) (&msyn-valency-go-not-fs) TO (-1 (&ADDED)) ;

Because of &ADDED, divvun-suggest will treat this as a non-central word of the error (just like with co& tags).

Note that we include the space in the wordform, and we put it at the end of the wordform. This is because vislcg3 always adds new cohorts after the blank of the preceding cohort. In some cases, e.g. with punctuation, we want the new cohort to come before the blank of the preceding cohort; then we use the tag &ADDED-BEFORE-BLANK, and divvun-suggest will ensure it ends up in the right place, e.g.:

ADD:punct-rihkku (&punct-rihkku) TARGET (Inf) IF (-1 Inf LINK -1 COMMA LINK -1 Inf …);
ADDCOHORT:punct-rihkku ("<,>" "," CLB &ADDED-BEFORE-BLANK &punct-rihkku) BEFORE (V &punct-rihkku) IF …;
ADDRELATION (LEFT) (&punct-rihkku) TO (-1 (&ADDED-BEFORE-BLANK)) ;

will give a suggestion that covers the space before the infinitive.

Adding literal word forms, altering existing wordforms

Say you want to tag missing spaces after punctuation. You've added a rule like

[ ?* {"<,>"} ]:[%<NoSpaceAfterPunctMark>]

to your whitespace-analyser.regex (used by divvun-blanktag) and the input to the grammarchecker CG is now

"<3>"
        "3" Num Arab Sg Loc Attr @HNOUN
        "3" Num Arab Sg Nom @HNOUN
        "3" Num Arab Sg Ill Attr @HNOUN
"<,>"
        "," CLB <NoSpaceAfterPunctMark>
"<ja>"
        "ja" CC @CNP

Then you can first of all turn that blanktag tag into an error tag with

ADD (&no-space-after-punct-mark) (<NoSpaceAfterPunctMark>);

Now, we could just suggest a wordform on the comma and call it a day:

COPY ("<, >" SUGGESTWF) TARGET ("," &no-space-after-punct-mark) ;

but that will

only work on commas, and
be a tiny underline, hard to click for users

Instead, let's extend the underline to the following word:

ADD (co&no-space-after-punct-mark)
    TARGET (*)
    IF (-1 (<NoSpaceAfterPunctMark>))
    ;
ADDRELATION (RIGHT) (&no-space-after-punct-mark)
    TO (1 (co&no-space-after-punct-mark))
    ;

Every error needs a "central" cohort, even if it involves several words; this is important in order to get error messages to show correctly. It doesn't matter which one you pick, as long as you pick one. Here we've picked the comma to be central, while the following word is a "link" word. In the above rules,

The co& tag says that the following word is just a part of the error, not the central cohort.
The RIGHT relation says that this is one big error, not two separate ones.

Then we can add a suggestion that puts a space between the forms:

COPY:no-space-after-punct ("<$1 $2>"v SUGGESTWF)
    TARGET ("<(.*)>"r &no-space-after-punct-mark)
    IF (1 ("<(.*)>"r))
       (NOT 0 (co&no-space-after-punct-mark))
    ;

This uses vislcg3's variable strings / varstrings to create the wordform suggestion from two regular expression strings matching the wordforms of the two cohorts. Note that the $1 and $2 refer to the first and second regex groups as they appear in the rule, not as they appear in the sentence. If the rule referred to the preceding word with (-1 ("<(.*)>"r)), you'd probably want the suggestion to be <$2 $1>.

We don't put a suggestion-tag on the co& cohort (here the word <ja>), which would lead to some strange suggestions since it is already part of the suggestion-tag on the comma <,> cohort. See How underlines and replacements are built for more on the relationship between SUGGESTWF and replacements.

Now the output is

"<3>"
        "3" Num Arab Sg Loc Attr @HNOUN
        "3" Num Arab Sg Nom @HNOUN
        "3" Num Arab Sg Ill Attr @HNOUN
"<,>"
        "," CLB <NoSpaceAfterPunctMark> &no-space-after-punct-mark ID:3 R:RIGHT:4
        "," CLB <NoSpaceAfterPunctMark> "<, ja>" &no-space-after-punct-mark SUGGESTWF ID:3 R:RIGHT:4
"<ja>"
        "ja" CC @CNP co&no-space-after-punct-mark ID:4

or, in JSON format:

{
  "errs": [
    [
      ",ja",
      4,
      7,
      "no-space-after-punct-mark",
      "no-space-after-punct-mark",
      [
        ", ja"
      ]
    ]
  ],
  "text": "ja 3,ja"
}

This looks pretty good, except the error tag is listed twice. The second entry is actually supposed to contain a human-readable error message, but errors.xml contains no entry for this tag. Let's add it:

<error id="no-space-after-punct-mark">
  <header>
    <title xml:lang="en">Missing space</title>
  </header>
  <body>
    <description xml:lang="en">There is no space after the punctuation mark "$1"</description>
  </body>
</error>

(In Giellatekno's setup, this goes in errors.source.xml, which is compiled to errors.xml.)

Now we get:

{
  "errs": [
    [
      ",ja",
      4,
      7,
      "no-space-after-punct-mark",
      "Missing space",
      [
        ", ja"
      ]
    ]
  ],
  "text": "ja 3,ja"
}

which should end up as a nice error message, suggestion and underline in the UI.

Including spelling errors

To use the divvun-cgspell module, you need a spelling acceptor (dictionary) FST and error model FST. These are the same format as the files used by hfst-ospell. The speller isn't yet used to handle real-word errors, just adding suggestions to unknowns.

The divvun-cgspell module should go before disambiguation in the pipeline, so the disambiguator can pick the best suggestion in context.

The module adds the tag <spelled> to any suggestions. The speller module itself doesn't take any context into account, that's for later steps to handle. As an example, you might have this unknown word as input to the speller module:

"<coffe>"
        "coffe" ?

To which the output from the speller might be

"<coffes>"
        "coffes" ?
        "coffee" N Sg <W:37.3018> <WA:17.3018> <spelled> "<coffee>"
        "coffee" N Pl <W:37.3018> <WA:17.3018> <spelled> "<coffees>"
        "coffer" N Pl <W:39.1010> <WA:17.3018> <spelled> "<coffers>"
        "Coffey" N Prop <W:40.0000> <WA:18.1800> <spelled> "<Coffey>"

The form to be suggested is included as a "wordform-tag" at the very end of each reading from the speller.

Now the later CG stages can use the context of this cohort to pick more relevant suggestions (e.g. if the word to the left was "a", we might want to REMOVE the plurals or even SELECT the singulars). We could also ADD/MAP some relevant tags or relations.

Note that the readings added by the speller don't include any error tags (tags with & in front). To turn these readings into error underlines and actually show the suggestions, add a rule like

ADD (&typo SUGGESTWF) (<spelled>) ;

to the grammar checker CG. The reason we add SUGGESTWF and not SUGGEST is that we're using the wordform-tag directly as the suggestion, and not sending each analysis through the generator (as SUGGEST would do). See also the next section on how replacements are built. So if, after disambiguation and grammarchecker CG's, we had

"<coffes>"
        "coffee" N Pl <W:37.3018> <WA:17.3018> <spelled> "<coffees>" &typo SUGGESTWF
        "coffer" N Pl <W:39.1010> <WA:17.3018> <spelled> "<coffers>" &typo SUGGESTWF

then the final divvun-suggest step would simply use the contents of the tags

"<coffers>"
"<coffees>"

to create the suggestion-list, without bothering with generating from

"coffee" N Pl
"coffer" N Pl

This makes the system more robust in case the speller lexicon differs from the regular suggestion generator, and saves some duplicate work.

How underlines and replacements are built

The LEFT/RIGHT relations (also DELETE) are used to expand the underline of the error, to include several cohorts in one replacement suggestion. We expand the underline until it matches the relation targets that are furthest away, so if you have several such relation targets to the left of the central cohort, the underline expands to the leftmost one.

A matching co&errtag isn't strictly needed on the non-central word, but is recommended in case we can have several error types and need to keep replacements separate (to avoid silly combinations of suggestions).

When we have a DELETE relation from a reading with an &errtag and there are be multiple source-cohort error tags, the deletion target needs to have a co&errtag, so that we only delete in the replacement for &errtag (not from the replacements for &other-errtag). See the section on Alternative suggestions for complex errors altering different parts of the error for more info on this.

By default, a cohort's word form is used to construct the replacement. So if we have the sentence "we was" where "was" is central and tagged &typo, and there's a LEFT relation to "we", then the default replacement if there were no SUGGEST tags would simply be the input "we was" (which would be filtered out since it's equal, giving no suggestions).

If we now add a SUGGEST reading on "we" that generates "he" then we get a "he was" suggestion. SUGGEST readings with matching (co-)error tags are prioritised over input word form.

If we also have a SUGGEST for was→are for the possible replacment "we are" (tagged &agr) – now we don't want both of these to apply at the same time giving *"we is". In this case, we need to ensure we have disambiguating co&errtype tags on the SUGGEST readings. The following CG parse:

"<we>"
    "we" Prn &agr                 ID:1 R:RIGHT:2
    "he" Prn SUGGEST co&agr-typo ID:1 R:RIGHT:2
: 
"<was>"
    "be" V 3Sg &agr-typo       ID:2 R:LEFT:1
    "be" V 3Pl co&agr SUGGEST ID:2 R:LEFT:1

will give us all and only the suggestions we want ("he was" and "we were", but not *"he were").

There is one exception to the above principles; for backwards-compatibility, SUGGESTWF is still used to mean that the whole underline should be replaced by what's in SUGGESTWF. This means that if you combine SUGGESTWF with RIGHT/LEFT, you will not automatically get the word form for the relation target(s) in your replacement, you have to construct the whole replacement yourself. This also means you cannot combine SUGGESTWF with SUGGEST on other words. (If we ever change how this works, we will have to first update many existing CG3 rules.)

Summary of special tags and relations

CG lets you define all kinds of new tags and relation-names and within CG you are free to make your own conventions as to what they mean. However, in the Divvun grammar checker system, certain CG tags and relation-names have special meanings to the rest of the system. Below is a summary of the special tags/relations and their uses. In addition, note that all divvun error tags need to start with the & character, but apart from that are free to name errors as long as they don't conflict with the below special tags.

Relations

LEFT and RIGHT are used to extend the underline to added cohorts; see Adding words and Adding literal word forms, altering existing wordforms. LEFT if the added word is to the left of the error tag, RIGHT if the added word is to the right of the error tag.
DELETE1, DELETE2 etc. (but not just DELETE without a number) are used to say that a word in the context of this error should be deleted in the suggestion. See Deleting words.
$2 (and $3 etc.) are used to make wordforms in the context available to human-readable error messages in errors.xml. Note that $1 is always the wordform of the central cohort of the error (so don't add $1 as a relation). See Simple grammarchecker.cg3 rules.

Troubleshooting

If you get

terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error

then your C++ compiler is too old. See Prerequisites.

If you get

configure: error: 'g++  -std=c++11 -Wall -I/usr/include/hfst/ @GLIB_CFLAGS@  -I/usr/include/ ' does not accept ISO C++11

then you may be at the receiving end of hfst/hfst#366. A workaround is to edit /usr/lib64/pkgconfig/hfst.pc and simply delete the string @GLIB_CFLAGS@.

References / more documentation

The architecture of systems using libdivvun is described in

Wiechetek, L., Moshagen, S., & Unhammer, K. B. (2019, February). Seeing more than whitespace—Tokenisation and disambiguation in a North Sámi grammar checker. In Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers) (pp. 46-55).

libdivvun's People

Contributors

Stargazers

Watchers

libdivvun's Issues

only uses <title> from errors.xml <defaults>, should get <description> too

(Originally at http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=2545#c4 )

We currently only parse <title> from the <defaults> entry in errors.xml; should do <description> too.

(We'll have to temporarily store regexes as strings in a map while parsing them, then compile them after going through allt he defaults.)

malloc(): invalid size (unsorted) in gcc-10

export CC=gcc-10
./configure 
make clean
make 
make check

divvun-checker: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
./run: line 23: 2013050 Aborted                 (core dumped) ../../src/divvun-checker "${args[@]}" < input."$n".txt > output."$n".json
FAIL run.xml (exit status: 134)

intermittent travis failures on input "seammasballat ođđa dieđuiguin. Ja vel."

shows failures in the same test, but at different stages. And on many systems that test works just fine: https://travis-ci.org/divvun/libdivvun/builds/627875780 Some times a "Restart job" makes it go away :-S

What's weird is that the difference is in the input to the test (text parameter):

< {"errs":[["dieđuiguin",19,29,"msyn-valency-loc-com","boasttut sátni",["diehtukorrekt"],"msyn thingy"],["vel",35,38,"double-space-before","double-space-before",[],"double-space-before"]],"text":"seammasballat ođđa dieđuiguin. Ja  vel."}

---

> {"errs":[["dieđuiguin",20,30,"msyn-valency-loc-com","boasttut sátni",["diehtukorrekt"],"msyn thingy"],["vel",36,39,"double-space-before","double-space-before",[],"double-space-before"]],"text":"seammas ballat ođđa dieđuiguin. Ja  vel."}

Error messages do not work for Finnish

To repeat, do:

echo "Hän  ei tuleee vielä..." | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst | divvun-blanktag analyser-gt-whitespace.hfst | divvun-cgspell fi.zhfst | vislcg3 -g grammarchecker.bin | divvun-suggest -g generator-gt-norm.hfstol -m errors.xml -j

Result:

WARNING: No message for "double-space-before"
WARNING: No message for "typo"

despite both having messages defined in errors.xml.

Correction suggestions are lowercase when should have been uppercase

See screenshot:

This is a speller error, and as such it should follow the casing of the input word. But one can imagine that we want to detect and correct even casing errors, and I presently do not know how to make that work with the fix for this bug.

Test fails on all RPM platforms

https://build.opensuse.org/package/show/home:TinoDidriksen:nightly/libdivvun all yield:

[  155s] make[3]: Entering directory '/home/abuild/rpmbuild/BUILD/libdivvun-0.3.2.g415.5d26e65e/test/suggest'
[  156s] FAIL: run
[  156s] FAIL: run-flushing
[  156s] PASS: run-genall
[  156s] PASS: validate

Maybe I'm missing a test dependency?

Use travis releases for the mac bundle

re: #14, https://docs.travis-ci.com/user/deployment/releases/ would probably be better, then libreoffice-divvun's mac-build-oxt-bundle could just grab the newest github release

CG keyword PROTECT not filtered by divvun-suggest

$ echo 'Dat ferte oainnahallot ovdal go sáhttit báhčit dan, lohká son.' | modes/trace-smegramrelease.mode
...
"<oainnahallot>"
	"oainnáhallat" V IV Imprt Pl1 Dial/-GG <W:2.30176> <WA:15.3018> <spelled> "<oainnáhallut>" PROTECT:3268 SELECT:3538 &SUGGESTWF &typo ADD:4066:spelled
typo
	"oainnihit" Ex/V TV Der/alla V Imprt Pl1 Dial/-GG <W:2.30176> <WA:17.3018> <spelled> "<oainnáhallut>" PROTECT:3268 SELECT:3538 &SUGGESTWF &typo ADD:4066:spelled
typo
	"oainnahallat" V IV Imprt Pl1 Dial/-GG <W:6.30176> <WA:15.3018> <spelled> "<oainnahallut>" PROTECT:3268 SELECT:3538 &SUGGESTWF &typo ADD:4066:spelled
typo
	"oainnihit" Ex/V TV Der/alla V Imprt Pl1 Dial/-KJ <W:13.3018> <WA:17.3018> <spelled> "<oainnáhallot>" PROTECT:3268 SELECT:3538 &real-hallan &typo &SUGGESTWF ADD:3769:msyn-hallan ADD:4066:spelled
real-hallan
typo
	"oainnihit" Ex/V TV Der/alla V <W:13.3018> <WA:17.3018> <spelled> "<oainnáhallot>" PROTECT:3268 SELECT:3538 Inf &SUGGEST ADD:3769:msyn-hallan COPY:3770:msyn-hallan
oainnihit+Ex/V+TV+Der/alla+V+PROTECT:3268+Inf	?
	"oainnahallat" V IV Imprt Du1 <W:16.3018> <WA:15.3018> <spelled> "<oainnahallu>" PROTECT:3268 SELECT:3538 &SUGGESTWF &typo ADD:4066:spelled
typo
;	"oainnahallot" ? SELECT:3538

But:

$ echo 'Dat ferte oainnahallot ovdal go sáhttit báhčit dan, lohká son.' | modes/smegramrelease.mode
"<oainnahallot>"
	"oainnáhallat" V IV Imprt Pl1 Dial/-GG <W:2.30176> <WA:15.3018> <spelled> "<oainnáhallut>" &SUGGESTWF &typo
typo
	"oainnihit" Ex/V TV Der/alla V Imprt Pl1 Dial/-GG <W:2.30176> <WA:17.3018> <spelled> "<oainnáhallut>" &SUGGESTWF &typo
typo
	"oainnahallat" V IV Imprt Pl1 Dial/-GG <W:6.30176> <WA:15.3018> <spelled> "<oainnahallut>" &SUGGESTWF &typo
typo
	"oainnihit" Ex/V TV Der/alla V Imprt Pl1 Dial/-KJ <W:13.3018> <WA:17.3018> <spelled> "<oainnáhallot>" &real-hallan &typo &SUGGESTWF
real-hallan
typo
	"oainnihit" Ex/V TV Der/alla V <W:13.3018> <WA:17.3018> <spelled> "<oainnáhallot>" Inf &SUGGEST
oainnihit+Ex/V+TV+Der/alla+V+Inf	oainnáhallat
	"oainnahallat" V IV Imprt Du1 <W:16.3018> <WA:15.3018> <spelled> "<oainnahallu>" &SUGGESTWF &typo
typo

Probably UNPROTECT also.

locale("") gives 'collate_byname<char>::collate_byname failed to construct for ' on LO on mac

The https://github.com/divvun/libreoffice-divvun extension built with libdivvun 0.3.3 crashes/hangs in LibreOffice on mac.

After compiling a debug build of LO we see the following output:

…
warn:linguistic:67283:363302:linguistic/source/gciterator.cxx:679: GrammarCheckingIterator::DequeueAndCheck ignoring N3com3sun4star3uno9ExceptionE msg: C++ code threw St13runtime_error: collate_byname<char>::collate_byname failed to construct for 
…
warn:legacy.osl:67283:361460:sw/source/uibase/dialog/SwSpellDialogChildWindow.cxx:430: ApplyChangedSentence in initial call or after resume

Running that through lldb (make debugrun in LO) and fiddling with breakpoints (first b linguistic/source/gciterator.cxx:607, c, then break set -E cxx and c; bt; c; bt …), we see that the error thrown is

 thread #18, name = 'GrammarCheckingIterator', stop reason = breakpoint 2.2
    frame #0: 0x00007fff7767d80b libc++abi.dylib`__cxa_rethrow
libc++abi.dylib`__cxa_rethrow:
->  0x7fff7767d80b <+0>: pushq  %rbp
    0x7fff7767d80c <+1>: movq   %rsp, %rbp
    0x7fff7767d80f <+4>: pushq  %r15
    0x7fff7767d811 <+6>: pushq  %r14
Target 0: (soffice) stopped.
(lldb) bt
* thread #18, name = 'GrammarCheckingIterator', stop reason = breakpoint 2.2
  * frame #0: 0x00007fff7767d80b libc++abi.dylib`__cxa_rethrow
    frame #1: 0x00007fff77653ffe libc++.1.dylib`std::__1::locale::__imp::__imp(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long) + 2162
    frame #2: 0x00007fff776553c8 libc++.1.dylib`std::__1::locale::locale(char const*) + 186
    frame #3: 0x00000001763d8389 libdivvun.0.dylib`divvun::getCasing(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 73

ie. the reason seems to be that in getCasing we call std::locale("") which then fails with collate_byname<char>::collate_byname failed to construct for .

But why does it work when running LO-bundled python-interpreter on the LO-extension from the command line? (And shouldn't the "" be a valid fallback locale?)

Improved case handling

Input text in ALLUPPER is now always flagged, because the analyser can't handle it and because the speller also does not recognise it. At a minimum we need to use the same case handling for the speller as is done by other spellers, so that if the analyser won't recognise the word, the speller will (and at the same time provide an analysis for the word).

The ideal solution would be to have hfst-pmatch handle this using one of the case functions, but only if the size of the resulting fst does not blow up.

flushing test doesn't work on mac

FAIL: ./run-flushing
====================

Timeout: aborting command ``awk'' with signal 9
./run-flushing: line 32: 78779 Killed: 9               timeout 1 awk 'BEGIN{RS="\0"}{printf "%s", $0;exit}' 0<&4 > output-flushing"${base}"json
divvun-suggest flushing read failed with 137, output is:
./run-flushing: line 52: 78776 Terminated: 15          ../../src/divvun-suggest --json generator.hfstol errors.xml < "${to}" > "${from}" 2> "${tmpd}/err"
FAIL run-flushing (exit status: 1)

(where timeout is gnu gtimeout)

cgspell needs to recognize ? even if tags come after it

Presumably any word with a ? as a tag should be considered unknown, and run through spelling.

https://giellatekno.uit.no/bugzilla/show_bug.cgi?id=2714

refer to form of suggestion in errors.xml

There is currently no way to refer to a (dynamically generated) suggestion in errors.xml messages. We have $1 for the current form, $2 (and $3 etc.) for context words marked using ADDRELATION ($2), but we can't yet refer to the &SUGGEST-ed form.

There may be several suggested forms, but for now we can have a feature that just picks an arbitrary one.

No speller applied to initial upper-case words

It looks like the grammar checker skips the speller if the input word starts with a capital:

‘mas Gouvdageainnus eai beasa’ vs ‘mas gouvdageainnus eai beasa’

This seems a bit too restricted. Could that be changed?

Valgrind some memory issue

valgrind-issue.txt
http://codepad.org/gug9gJzb
https://apertium.projectjj.com/apt/logs/libdivvun/hirsute-amd64.log

allow specifying -lN, --weight-classes=N in pipespec <tokenize>

Add pipeline attr 'status'

Add a new attribute to the pipeline element of the pipespec xml dtd, named 'status'. The content of the attribute is a closed list of [default|released|beta|closed]. The idea is to indicate whether a pipeline should be exposed in user applications like LibreOffice, MS Office etc.

use sentry.io

https://sentry.io/

Supports Python, but C++ is WIP: https://forum.sentry.io/t/raven-coredump-for-c-c-and-other-native-code/2604/7

divvun-normaliser

Draft specification here.

Tasks:

Add support for analyser
Add support for generator
Add support for normaliser
Add support for tag filtering
Proper output formatting
store the original lemma in a tag string in the same reading, replacing it with the normalized lemma
#58

Python.h check is wrong for Python 3.8

https://github.com/divvun/libdivvun/blob/master/configure.ac#L152 checks for ${python_include_path}/Python.h which results in a conftest.cpp that does #include <python3.8/Python.h> - this is not the correct way to check whether Python.h is usable, and never has been. It has worked so far, but Python 3.8 breaks that.

The test must instead add the result of python3-config --includes to the build flags and only #include <Python.h>

Autocorrect mode for divvun-checker

For automatised tests where we only care about whether we got the expected output, it would be nice to have an autocorrect mode. That is, the tool should take the input text, find possible errors, apply the best/first suggestion in each case, and print out the corrected text. The output should be identical to the input text except for the corrections made.

One could consider a sub-option for whether to do single-pass or multipass corrections, ie whether to try to find more errors after the first pass until no more errors are found. Default should be single-pass.

API-fn to query available languages per pipeline "type"

Users might want to ask not just "what are all the checker pipelines installed", but "what languages have the punctuation pipelines" or "what languages have plain analysis pipelines".

Currently, the dtd gives a type on the pipeline element: https://github.com/divvun/libdivvun/blob/master/src/pipespec.dtd#L42 but it's free-form and we don't actually use it yet.

But with a free-form attribute, pipespec writers might put in type="Punctutaion" or type="Grammar Checker" which ≠ type="Grammar checker" ≠ type="Grammar checking" etc. Not very queryable. OTOH, it might be nice to have some freedom sometimes – we don't want to constrain pipespec writers from inventing completely new things (type="phplint" using only CG? type="Cheek-movement-to-text"?).

Maybe we could let type be a closed set, and then the type Other requires an attribute for the freeform field like other-type="Blockchain parser"

cgspell <spellskip> unpredictable and untoggleable

Pipe: echo ... | kal-tokenise | divvun-cgspell -n 5 kl.zhfst

Input illu oqaaseq Vester wjgxyzæøå xyzæøå. yields output with <spellskip>:

"<illu>"
        "ih" Interj LU
        "illu" N Abs Sg
"<oqaaseq>"
        "oqaaseq" N Abs Sg
        "oqar" Gram/IV USIQ Der/vn N Abs Sg
        "oqar" Gram/TV Gram/Refl USIQ Der/vn N Abs Sg
"<Vester>"
        "Vester" ?
        "Vester" ? <spellskip>
"<wjgxyzæøå>"
        "wjgxyzæøå" ?
        "wjgxyzæøå" ? <spellskip>
"<xyzæøå>"
        "xyzæøå" ?
        "xyzæøå" ? <spellskip>
"<.>"
        "." CLB

But input Vester wjgxyzæøå xyzæøå. by itself does not give <spellskip>:

"<Vester>"
        "Vester" ?
        "mester" <W:10> <WA:0> <spelled> "<mester>"
        "bister" <W:20> <WA:0> <spelled> "<bister>"
        "center" <W:20> <WA:0> <spelled> "<center>"
        "meter" <W:20> <WA:0> <spelled> "<meter>"
        "festeq" <W:20> <WA:0> <spelled> "<festeq>"
"<wjgxyzæøå>"
        "wjgxyzæøå" ?
"<xyzæøå>"
        "xyzæøå" ?
        "xxxyzæøå" <W:20> <WA:0> <spelled> "<xxxyzæøå>"
        "xxxzæøå" <W:20> <WA:0> <spelled> "<xxxzæøå>"
        "xxxæøå" <W:20> <WA:0> <spelled> "<xxxæøå>"

Neither does it <spellskip> if the input is illu Vester wjgxyzæøå xyzæøå. or Vester wjgxyzæøå xyzæøå illu.

I would like a way to disable this <spellskip> logic, because it is not cgspell's job to potentially detect foreign quotes or whatever it is trying to do.

Correct casing of suggestions not working for Sámi letters

Cf the following two screen shots:

That is, it works for initial letter I, but not for initial letter Č.

Use Python3 stable API in swig when it exists

If and when swig/swig#1009 is released, we should use that to avoid having to build for every single python version we might encounter.

help2man -n

Package is getting Lintian tag https://lintian.debian.org/tags/manpage-has-useless-whatis-entry.html - can be trivially fixed by using help2man -n instead of plain help2man.

Remembering to use -n in the future is the tricky part.

divvun-checker --json not documented by --help

divvun-checker --help gives a.o. the following:

  -z, --null-flush     (Ignored, we always flush on <STREAMCMD:FLUSH>,
                       outputting \0 if --json).

but --json is not mentioned or documented anywhere in the --help output. Could there be other options that are not documented?

Hidden dependency: libarchive

./configure --enable-checker --enable-cgspell --enable-python-bindings ran fine, but the build said:

In file included from checker.cpp:19:
In file included from ./pipeline.hpp:29:
./pipespec.hpp:50:10: fatal error: 'archive.h' file not found

configure needs to check for a usable libarchive.

Plain C interface / Rust bindings

I see neither Rust nor plain C on http://swig.org/ 's list of "Exits" (there was a 2012 GsoC project on a C target, but I don't see any mention that it was ever merged and it was listed as experimental 14 days ago).

mbrubeck says bindgen might be able to create bindings from rust to C++: https://rust-lang.github.io/rust-bindgen/cpp.html – there are some caveats there, so I guess it's a matter of Try It And See for someone who's good with Rust.

So the alternatives seem to be:

manually write a C interface with extern C etc. It'd work, but be annoying to maintain two sets of interfaces (maybe less typesafe than SWIG too)
use SWIG, if/when the C target works and bind to C from Rust
use Rust bindgen, if that works (no C interface though)

(based on conversation here https://giella.zulipchat.com/#narrow/stream/129301-gramcheck/topic/libdivvun.20C.20interface )

divvun-checker does not give same speller output as the corresponding bash pipe

The following command:

$ echo 'Tabealla 1 čájeha ahte geatkemáddodat lea mealgat badjel máddodatmeari guovlluin 5, 6, 7 ja 8.  romssa fylkkas ja Finnmárkkus leat measta golmma geardde eanet geatkkit 2012s go máddodatmearis leat.' | tools/grammarcheckers/modes/smegram8-gc.mode

gives the following output for the misspelled word + context:

"<ja>"
	"ja" CC <W:0.0000000000> <sme> @CNP #16->16
: 
"<8>"
	"8" Num Arab Sg Nom <W:0> <sme> #17->17
"<.>"
	"." CLB <W:0> #18->18
:  

"<romssa>"
	"Romsa" N Prop Sem/Plc Sg Acc <W:19.7939> <WA:9.79395> <spelled> "<Romssa>" <doubleSpaceBefore> <sme> &double-space-before &typo &SUGGESTWF #1->1
	"Tromsa" N Prop Sem/Plc Sg Acc <W:21.8594> <WA:11.8594> <spelled> "<Tromssa>" <doubleSpaceBefore> <sme> &double-space-before &typo &SUGGESTWF #1->1
: 
"<fylkkas>"
	"fylka" N Sem/Org Sg Loc <W:0.0000000000> <sme> #2->2

Running the same text through divvun-checker with the same pipeline gives a different result:

$ echo 'Tabealla 1 čájeha ahte geatkemáddodat lea mealgat badjel máddodatmeari guovlluin 5, 6, 7 ja 8.  romssa fylkkas ja Finnmárkkus leat measta golmma geardde eanet geatkkit 2012s go máddodatmearis leat.' | divvun-checker -a tools/grammarcheckers/se.zcheck -n smegram
{"errs":[["romssa",96,102,"typo","Čállinmeattáhus",["ruossa","russa","ruvssa","Romssa","rossá","Tromssa","momssa"]],["2012s",168,173,"typo","Čállinmeattáhus",["2012:s"]],[".",196,197,"no-space-after-punct-mark","[SE] Missing space",[]]],"text":"Tabealla 1 čájeha ahte geatkemáddodat lea mealgat badjel máddodatmeari guovlluin 5, 6, 7 ja 8.  romssa fylkkas ja Finnmárkkus leat measta golmma geardde eanet geatkkit 2012s go máddodatmearis leat."}

There are two differences:

the speller suggestions (the bash pipe contains all and only what we want, whereas the one from divvun-checker contains a lot of noise)
the &double-space-before error has disappeared in the divvun-checker output

I have used the latest code of hfst, sme - hm, but not libdivvun. Could that be it?

macOS pristine run-python-bindings fails

Probably important part Library not loaded: /usr/local/lib/libdivvun.0.dylib - test seems to expect globally installed library, but it hasn't been installed yet.

FAIL: ./run-python-bindings
===========================

running install
running build
running build_py
running build_ext
running install_lib
creating /tmp/libdivvun/test/checker/python-build
copying build/lib.macosx-10.13-x86_64-3.6/_libdivvun.cpython-3.6m.so -> /tmp/libdivvun/test/checker/python-build
copying build/lib.macosx-10.13-x86_64-3.6/_libdivvun.cpython-36m-darwin.so -> /tmp/libdivvun/test/checker/python-build
copying build/lib.macosx-10.13-x86_64-3.6/libdivvun.py -> /tmp/libdivvun/test/checker/python-build
byte-compiling /tmp/libdivvun/test/checker/python-build/libdivvun.py to libdivvun.cpython-36.pyc
running install_egg_info
Writing /tmp/libdivvun/test/checker/python-build/python_divvun_gramcheck-0.2.0-py3.6.egg-info
Traceback (most recent call last):
  File "/tmp/libdivvun/test/checker/python-build/libdivvun.py", line 14, in swig_import_helper
    return importlib.import_module(mname)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 658, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: dlopen(/tmp/libdivvun/test/checker/python-build/_libdivvun.cpython-36m-darwin.so, 2): Library not loaded: /usr/local/lib/libdivvun.0.dylib
  Referenced from: /tmp/libdivvun/test/checker/python-build/_libdivvun.cpython-36m-darwin.so
  Reason: image not found

$ otool -L /tmp/libdivvun/test/checker/python-build/_libdivvun.cpython-36m-darwin.so
/tmp/libdivvun/test/checker/python-build/_libdivvun.cpython-36m-darwin.so:
        /usr/local/lib/libdivvun.0.dylib (compatibility version 1.0.0, current version 1.0.0)

Might need fiddling with install_name_tool or DYLD_LIBRARY_PATH.

divvun-suggest -j mangles speller suggestions

Given the following command:

$ echo "Muhto go beaivi jávkkai vári duohkai ja giđđaija ilbmi luoitáldii gilli badjel, vázzái son viidáseappot davasguvlui vuovddis." | hfst-tokenise --giella-cg tokeniser-gramcheck-gt-desc.pmhfst | vislcg3 -g mwe-dis.bin | cg-mwesplit | divvun-cgspell -a se.zhfst | vislcg3 -g disambiguator.bin | vislcg3 -g grammarchecker.bin | divvun-suggest -g generator-gt-norm.hfstol -m errors.xml -j

the output is the following (with added linewraps for readability):

{"errs":
  [
   ["ja",37,39,"default","default",[", ja"]],
   ["luoitáldii",55,65,"typo","typo",["luoitádit"]],
   ["gilli",66,71,"msyn-gen-before-postp","Iskka genitiivva alege nominatiivva",["gili"]],
   ["davasguvlui",104,115,"typo","typo",["","davveguvlui","davviguvlui","divaguvlui","divatguvlui","lagasguvlui"]]
  ],
  "text":
  "Muhto go beaivi jávkkai vári duohkai ja giđđaija ilbmi luoitáldii gilli badjel, vázzái son viidáseappot davasguvlui vuovddis.\n"
}

Compare the suggestion list for the spelling error davasguvlui with the output from the following command:

$ echo "Muhto go beaivi jávkkai vári duohkai ja giđđaija ilbmi luoitáldii gilli badjel, vázzái son viidáseappot davasguvlui vuovddis." | hfst-tokenise --giella-cg tokeniser-gramcheck-gt-desc.pmhfst | vislcg3 -g mwe-dis.bin | cg-mwesplit | divvun-cgspell -a se.zhfst | vislcg3 -g disambiguator.bin | vislcg3 -g grammarchecker.bin | divvun-suggest -g generator-gt-norm.hfstol -m errors.xml

Output:

...
"<viidáseappot>"
	"viiddis" A Comp Attr Err/Orth <W:0.0000000000> @>N #17->17
	"viidáseappot" v1 Adv Comp <W:0.0000000000> @<ADVL #17->17
: 
"<davasguvlui>"
	"davasguvlui" ? #18->18
	"davás guvlui" Adv <W:9.30176> <WA:15.3018> <spelled> "<davás guvlui>" @<ADVL &SUGGESTWF &typo #18->18
typo
	"divatguovlu" N Sg Ill <W:35.3018> <WA:15.3018> <spelled> "<divatguvlui>" @<ADVL &SUGGESTWF &typo #18->18
typo
	"davviguovlu" N Sg Ill <W:35.3018> <WA:15.3018> <spelled> "<davviguvlui>" @<ADVL &SUGGESTWF &typo #18->18
typo
	"davveguovlu" N Sg Ill <W:35.3018> <WA:15.3018> <spelled> "<davveguvlui>" @<ADVL &SUGGESTWF &typo #18->18
typo
	"divaguovlu" N Sg Ill <W:35.3018> <WA:15.3018> <spelled> "<divaguvlui>" @<ADVL &SUGGESTWF &typo #18->18
typo
	"lagasguovlu" N Sg Ill <W:35.3018> <WA:15.3018> <spelled> "<lagasguvlui>" @<ADVL &SUGGESTWF &typo #18->18
typo
: 
"<vuovddis>"
	"vuovdi" N Sem/Plc Sg Loc <W:0.0000000000> @<ADVL #19->19
...

It seems that the json formatting first deletes the string content of suggestions containing spaces, and then sorts the suggestion list alphabetically. Both are of course unwanted :-)

This behavior is also also seen in the divvun-checker program.

coverity-scan doesn't build

https://travis-ci.org/divvun/divvun-gramcheck/jobs/337610892 – it builds, but says

2018-02-05T15:59:17.944520Z|cov-build|11928|info|> [WARNING] Emitted 0 C/C++ compilation units (0%) successfully
2018-02-05T15:59:17.944520Z|cov-build|11928|info|> 
2018-02-05T15:59:17.944520Z|cov-build|11928|info|> 0 C/C++ compilation units (0%) are ready for analysis

Weights from the CG speller are integers, should be float

With the following command:

$ echo "Nu go mii dieehttit de lea guovddáš eiseváldit, SND čađa, vuoruhan doarjaga fatnasiidda mat leat stuorát go 15 mehtar." | hfst-tokenise --giella-cg tokeniser-gramcheck-gt-desc.pmhfst | vislcg3 -g valency.bin | vislcg3 -g mwe-dis.bin | cg-mwesplit | divvun-cgspell -a se.zhfst

one gets output like this:

"<Nu>"
	"nu" Adv <W:0.0000000000>
	"nu" Pcle <W:0.0000000000>
: 
"<go>"
	"go" CS <W:0.0000000000>
	"go" Pcle Qst <W:0.0000000000>
: 
"<mii>"
	"mii" Pron Indef Sg Nom <W:0.0000000000>
	"mii" Pron Interr Sg Nom <W:0.0000000000>
	"mii" Pron Rel Sg Nom <W:0.0000000000>
	"mun" Pron Pers Pl1 Nom <W:0.0000000000>
: 
"<dieehttit>"
	"dieehttit" ?
	"diehtit" V Ind Prs Pl1 <W:15848> <WA:8848> <spelled> "<diehtit>"
	"diehtit" V Inf <W:15848> <WA:8848> <spelled> "<diehtit>"
	"diehtit" V Imprt Pl2 <W:23203> <WA:13203> <spelled> "<diehttit>"
	"diehtti" N NomAg Pl Nom <W:23203> <WA:13203> <spelled> "<diehttit>"
	"diehtit" V Der/NomAg N Pl Nom <W:23203> <WA:15203> <spelled> "<diehttit>"
	"diehtit" V Imprt Du2 <W:35301> <WA:15301> <spelled> "<diehtti>"
	"diehtti" N NomAg Sg Acc <W:35301> <WA:15301> <spelled> "<diehtti>"
	"diehtti" N NomAg Sg Nom <W:35301> <WA:15301> <spelled> "<diehtti>"
	"diehtti" N NomAg Sg Gen <W:35301> <WA:15301> <spelled> "<diehtti>"

Compare this with the weights from the following command:

$ echo dieehttit | hfst-ospell -S ../spellcheckers/fstbased/desktop/hfst/se.zhfst
"dieehttit" is NOT in the lexicon:
Corrections for "dieehttit":
diehtit    15.848633
diehttit    23.203125
diehtti    35.301758
diehttis    35.301758
diehttut    35.301758
hiehttit    35.301758

That is, it seems that the weights from the CG speller has been multiplied with 1000 and then trunkated to make them integers. This used to be the case also for the --giella-cg mode in hfst-tokenise, but now that CG supports floats in weights, there's no reason to it. And to get consistent processing it is important to keep the weights from the different processing steps at the same scale.

transfer.sh shutting down 30th of November

So scripts/bundle_mac_libs needs to use something else

Parse and use <description> from errors.xml, prefer over <title>

(original report: https://giella.zulipchat.com/#narrow/stream/129301-gramcheck/topic/Feilmelding.20vs.20tittel )

We want to put the <description> from errors.xml where we now use <title> in right click messages, but also keep <title> in the json for the list of errors that may be disabled.

Overlapping errors cause bad suggestions

$ echo "Ii oktage dieđe gean lea ovddasvástadus ." | divvun-checker -a tools/grammarcheckers/se.zcheck | jq .
{
  "errs": [
    [
      "ovddasvástadus",
      25,
      39,
      "typo",
      "Ii leat sátnelisttus",
      [
        "ovddasfástádus",
        "ovddasvástádus"
      ],
      "Čállinmeattáhusat"
    ],
    [
      ".",
      40,
      41,
      "space-before-punct-mark",
      "Lea gaska \".\" ovddas",
      [
        "ovddasvástadus."
      ],
      "Sátnegaskameattáhusat"
    ]
  ],
  "text": "Ii oktage dieđe gean lea ovddasvástadus ."
}

The punctuation error contains the preceding work (uncorrected) as part of the cofrrection suggestion, while the spelling error corrects the same word independently. The end result - when running automically / unsupervised at least - is that the misspelled word gets duplicated. This makes automatized testing much harder.

macOS MacPorts run-python-bindings fail

Happens on macOS with Python 3.10 and other build helpers from MacPorts.

FAIL: ./run-python-bindings
===========================

running install
/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/opt/local/bin/python3 -E -c pass
TEST FAILED: /private/tmp/build/nightly/libdivvun/libdivvun-0.3.10+g518~fd5c4c6a/test/checker/python-build/ does NOT support .pth files
bad install directory or PYTHONPATH

You are attempting to install a package to a directory that is not
on PYTHONPATH and which Python does not read ".pth" files from.  The
installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    /private/tmp/build/nightly/libdivvun/libdivvun-0.3.10+g518~fd5c4c6a/test/checker/python-build/

and your PYTHONPATH environment variable currently contains:

    '/usr/local/lib/python3.10/site-packages'

Here are some of your options for correcting the problem:

* You can choose a different installation directory, i.e., one that is
  on PYTHONPATH or supports .pth files

* You can add the installation directory to the PYTHONPATH environment
  variable.  (It must then also be on PYTHONPATH whenever you run
  Python and want to use the package(s) you are installing.)

* You can set up the installation directory to support ".pth" files by
  using one of the approaches described here:

  https://setuptools.pypa.io/en/latest/deprecated/easy_install.html#custom-installation-locations


Please make the appropriate changes for your system and try again.
running bdist_egg
running egg_info
creating python_divvun_gramcheck.egg-info
writing manifest file 'python_divvun_gramcheck.egg-info/SOURCES.txt'
writing manifest file 'python_divvun_gramcheck.egg-info/SOURCES.txt'
running install_lib
running build_py
running build_ext
creating build/bdist.macosx-10.15-x86_64
creating build/bdist.macosx-10.15-x86_64/egg
byte-compiling build/bdist.macosx-10.15-x86_64/egg/libdivvun.py to libdivvun.cpython-310.pyc
byte-compiling build/bdist.macosx-10.15-x86_64/egg/_libdivvun.py to _libdivvun.cpython-310.pyc
creating build/bdist.macosx-10.15-x86_64/egg/EGG-INFO
copying python_divvun_gramcheck.egg-info/PKG-INFO -> build/bdist.macosx-10.15-x86_64/egg/EGG-INFO
copying python_divvun_gramcheck.egg-info/SOURCES.txt -> build/bdist.macosx-10.15-x86_64/egg/EGG-INFO
copying python_divvun_gramcheck.egg-info/dependency_links.txt -> build/bdist.macosx-10.15-x86_64/egg/EGG-INFO
copying python_divvun_gramcheck.egg-info/top_level.txt -> build/bdist.macosx-10.15-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
__pycache__._libdivvun.cpython-310: module references __file__
creating dist
removing 'build/bdist.macosx-10.15-x86_64/egg' (and everything under it)
creating /private/tmp/build/nightly/libdivvun/libdivvun-0.3.10+g518~fd5c4c6a/test/checker/python-build/python_divvun_gramcheck-0.3.10-py3.10-macosx-10.15-x86_64.egg
Extracting python_divvun_gramcheck-0.3.10-py3.10-macosx-10.15-x86_64.egg to /private/tmp/build/nightly/libdivvun/libdivvun-0.3.10+g518~fd5c4c6a/test/checker/python-build
byte-compiling /private/tmp/build/nightly/libdivvun/libdivvun-0.3.10+g518~fd5c4c6a/test/checker/python-build/python_divvun_gramcheck-0.3.10-py3.10-macosx-10.15-x86_64.egg/_libdivvun.py to _libdivvun.cpython-310.pyc
byte-compiling /private/tmp/build/nightly/libdivvun/libdivvun-0.3.10+g518~fd5c4c6a/test/checker/python-build/python_divvun_gramcheck-0.3.10-py3.10-macosx-10.15-x86_64.egg/libdivvun.py to libdivvun.cpython-310.pyc

Installed /private/tmp/build/nightly/libdivvun/libdivvun-0.3.10+g518~fd5c4c6a/test/checker/python-build/python_divvun_gramcheck-0.3.10-py3.10-macosx-10.15-x86_64.egg
Processing dependencies for python-divvun-gramcheck==0.3.10
Finished processing dependencies for python-divvun-gramcheck==0.3.10
error: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/otool-classic: can't open file: _libdivvun.cpython-310-darwin.so (No such file or directory)
FAIL run-python-bindings (exit status: 1)

divvun-blanktag needs symbols for end-of-stream and start-of-stream

The following command:

echo 'Tabealla 1 čájeha ahte geatkemáddodat lea mealgat badjel máddodatmeari guovlluin 5, 6, 7 ja 8.    Romsa fylkkas ja Finnmárkkus leat measta golmma geardde eanet geatkkit 2012s go máddodatmearis leat.' | divvun-checker -a tools/grammarcheckers/se.zcheck -n smegram

gives the following JSON output:

{"errs":[["Romsa",98,103,"double-space-before","[SE] Double space",[]],["2012s",169,174,"typo","Čállinmeattáhus",["2012:s"]],[".",197,198,"no-space-after-punct-mark","[SE] Missing space",[]]],"text":"Tabealla 1 čájeha ahte geatkemáddodat lea mealgat badjel máddodatmeari guovlluin 5, 6, 7 ja 8.    Romsa fylkkas ja Finnmárkkus leat measta golmma geardde eanet geatkkit 2012s go máddodatmearis leat."}

Note the error message for the final full stop. The regex that triggers this error is:

[ ?*        {"<,>"}          ]:[ "<NoSpaceAfterPunctMark>"]

in the file sme/tools/grammarcheckers/analyser-gt-whitespace.regex. The regex works fine in all other cases. How can we avoid that it matches end-of-paragraph full stops?.

Memory map file loading

Grammar checking is using quite a lot of RAM on our Divvun API server:

We've mitigated this for the spellchecking in DivvunSpell by using mmap instead of loading data into RAM, with minimal performance penalty in our use cases. Is this something that can be implemented for these grammar checking pipelines?

Status for pmatch-based analysis/tokenisation

Issues:

Ambiguous input
- Seems to work fine
Ambiguous multiword expessions with ambiguous tokenisation
- Seems to work – represented within lexc now; hfst-tokenise also
  supports forms on the analyses now
Ambiguous multiword expessions need reorganising after CG
- The module cg-mwesplit takes wordforms from readings and turns them into
  new cohorts
Unknown words
- The set-difference method only works for words without
  flag diacritics (even though we should be working only on the form-side?)
  and leads to binary blow-up: With only lower unknowns, we get 45M;
  lower+upper gives 67M, while no unknowns gives 27M
- Fixed instead by treating empty analyses as unknown-tokens in
  hfst-tokenise, and outputting unmatched strings with a prefix
Treat input that's within superblanks as unmatched
- probably requires a change in hfst-tokenise itself
Try >1 space for ambiguous MWE's? – represented within lexc now
Try set-difference-unknowns method with regular hfst commands?

Moved here from top of gramcheck tokeniser header.

@unhammer, @lynnda-hill - til info

divvun-checker --spec <spec> --variant <variant> crashes

On Linux this is the result of the command

❯ divvun-checker --spec tools/grammarcheckers/pipespec.xml --variant smegramrelease
terminate called after throwing an instance of 'TransducerHeaderException'
Aborted

divvun-checker also crashes on macOS when using these options.

Speller suggestions removed by divvun-checker

Compare this:

$ echo De lea jearaldat goabbá orrru eanemus jierpmálaš | ./grammarcheckers/modes/trace-smegram8-gc.mode | divvun-suggest -j -g grammarcheckers/generator-gramcheck-gt-norm.hfstol -m grammarcheckers/errors.xml | jq .
{
  "errs": [
    [
      "orrru",
      24,
      29,
      "typo",
      "Ii leat sátnelisttus",
      [
        "orro",
        "orru"
      ],
      "Čállinmeattáhusat"
    ]
  ],
  "text": "De lea jearaldat goabbá orrru eanemus jierpmálaš\n"
}

with this:

$ echo De lea jearaldat goabbá orrru eanemus jierpmálaš | divvun-checker -a grammarcheckers/se.zcheck | jq .
{
  "errs": [
    [
      "orrru",
      24,
      29,
      "typo",
      "Ii leat sátnelisttus",
      [],
      "Čállinmeattáhusat"
    ]
  ],
  "text": "De lea jearaldat goabbá orrru eanemus jierpmálaš"
}

The speller suggestions have disappeared from the last one.

$ divvun-checker -V
divvun-checker - Divvun gramcheck version 0.3.2

Error message without error tag

The following works just fine:

$ echo 'gávdno (bearaš, ustibat jna.)' | modes/smegramrelease.mode
"<gávdno>"
	"gávdnot" V <TH-Nom-Any> <LO-luhtte-Any> <LO-Loc-Plc> IV Imprt Sg2 <W:0.0> @+FMAINV
: 
"<(>"
	"(" PUNCT LEFT <W:0.0>
"<bearaš>"
	"bearaš" N Sem/Group_Hum Sg Nom <W:0.0> @<SUBJ
"<,>"
	"," CLB <W:0.0>
: 
"<ustibat>"
	"ustit" N Sem/Hum Pl Nom <W:0.0> @<SUBJ
: 
"<jna.>"
	"jna" Adv ABBR Gram/IAbbr <W:0.0> <NoSpaceAfterPunctMark> @<ADVL
"<)>"
	")" PUNCT RIGHT <W:0.0> <LastCohortOfParagraph>
:\n

But when processed by divvun-checker, the same sentence (and analysis) results in the following error flagging:

$ echo 'gávdno (bearaš, ustibat jna.)' | divvun-checker -a se.zcheck | jq .
{
  "errs": [
    [
      ")",
      28,
      29,
      "no-space-after-parent-end",
      "[SE] Parenthesis missing space",
      [],
      "[SE] Parenthesis missing space"
    ]
  ],
  "text": "gávdno (bearaš, ustibat jna.)"
}

Why?

blanktag tag order differs on different systems

<TinoDidriksen> tag order differs in test output - this is against HFST built with external OpenFST and Foma, and -std=c++17. Does something maybe rely on the order of unordered_set

make[5]: Entering directory '/build/libdivvun-0.3.10+g493~9db3c2a6-1~impish1/test/blanktag'
FAIL: run
===========================================================
   Divvun gramcheck 0.3.10: test/blanktag/test-suite.log
===========================================================

# TOTAL: 1
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: ./run
===========

3,4c3,4
< 	")" RPAREN @EOP <firstWord> <spaceBeforeParenEnd>
< 	")" RPAREN @EMO <firstWord> <spaceBeforeParenEnd>
---
> 	")" RPAREN @EOP <spaceBeforeParenEnd> <firstWord>
> 	")" RPAREN @EMO <spaceBeforeParenEnd> <firstWord>
10,11c10,11
< 	"be" V Pret "<was>" <firstWord> <firstWordOfParagraph>
< 		"it" Prn 3Sg Neut "<T>" <firstWord> <firstWordOfParagraph>
---
> 	"be" V Pret "<was>" <firstWordOfParagraph> <firstWord>
> 		"it" Prn 3Sg Neut "<T>" <firstWordOfParagraph> <firstWord>
FAIL run (exit status: 1)

============================================================================
Testsuite summary for Divvun gramcheck 0.3.10
============================================================================
# TOTAL: 1
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See test/blanktag/test-suite.log
Please report to [email protected]
============================================================================
make[5]: Leaving directory '/build/libdivvun-0.3.10+g493~9db3c2a6-1~impish1/test/blanktag'

withCasing() must return or throw

I: Program returns random data in a function
E: libdivvun no-return-in-nonvoid-function suggest.hpp:158

Function withCasing() on https://github.com/divvun/libdivvun/blob/master/src/suggest.hpp#L144 must do something at the end. If it's never supposed to reach that, throw a dummy exception - even just throw 1 will do.

Bad underlining, suggestions for double-space err after misspelling

[This bug could also be caused by the CG rules, but I am posting it here to begin with]

Given the command (there should be two spaces after the first word of the input sentence):

echo "Riggeroahkká lea gitta suomuoras, ja vuohkui mii heaŋga vuolimuš sáhttá bidjat gáfegievnni dahje málesgievnni." | tools/grammarcheckers/modes/trace-smegramrelease.mode

one gets the following output:

"<Riggeroahkká>"
	"reahkká" N Sem/Ani Sg Nom <W:37.3018> <WA:17.3018> <spelled> "<diggereahkká>" PROTECT:3268 SELECT:3578:r868 &double-space-before ID:1 ADD:4001:double-space-before-link ADD:4033:spelled
		"diggi" N Sem/Org Cmp/SgNom Cmp ID:1
double-space-before
	"reahkká" N Sem/Ani Sg Nom <W:37.3018> <WA:17.3018> <spelled> "<diggereahkká>" PROTECT:3268 SELECT:3578:r868 &LINK &typo &SUGGESTWF ID:1 ADD:4001:double-space-before-link ADDRELATION($2):4002:double-space-before-rel ADDRELATION(LEFT):4003:double-space-before-rel ADD:4033:spelled
		"diggi" N Sem/Org Cmp/SgNom Cmp ID:1
typo
	"roahkki" N Sem/Dummytag Sg Nom <W:37.3018> <WA:17.3018> <spelled> "<diggeroahkki>" PROTECT:3268 SELECT:3578:r868 &LINK &double-space-before &typo &SUGGESTWF ID:1 ADD:4001:double-space-before-link ADD:4033:spelled
		"diggi" N Sem/Org Cmp/SgNom Cmp ID:1
double-space-before
typo
...
;	"Riggeroahkká" ? SELECT:3578:r868
:  
"<lea>"
	"leat" V <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> IV Ind Prs Sg3 <W:0.0> <doubleSpaceBefore> SUBSTITUTE:2764 SUBSTITUTE:2772 SUBSTITUTE:2979 SUBSTITUTE:3083 SUBSTITUTE:3121 SUBSTITUTE:3494 SUBSTITUTE:3593 SUBSTITUTE:3598 SUBSTITUTE:3605 SUBSTITUTE:3616 SUBSTITUTE:3704 SUBSTITUTE:3834 SUBSTITUTE:3845 SUBSTITUTE:3849 SUBSTITUTE:3999 SUBSTITUTE:4158 SUBSTITUTE:4169 SUBSTITUTE:4281 SUBSTITUTE:4286 SUBSTITUTE:4299 SUBSTITUTE:4306 SUBSTITUTE:4312 SUBSTITUTE:4539 SUBSTITUTE:4637 SUBSTITUTE:4698 SUBSTITUTE:4721 SUBSTITUTE:4727 SUBSTITUTE:4759 SUBSTITUTE:4875 @+FMAINV MAP:15822 &double-space-before ID:2 R:$2:1 R:LEFT:1 ADD:4000:double-space-before
double-space-before
	"leat" V <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> IV Ind Prs Sg3 <W:0.0> <doubleSpaceBefore> SUBSTITUTE:2764 SUBSTITUTE:2772 SUBSTITUTE:2979 SUBSTITUTE:3083 SUBSTITUTE:3121 SUBSTITUTE:3494 SUBSTITUTE:3593 SUBSTITUTE:3598 SUBSTITUTE:3605 SUBSTITUTE:3616 SUBSTITUTE:3704 SUBSTITUTE:3834 SUBSTITUTE:3845 SUBSTITUTE:3849 SUBSTITUTE:3999 SUBSTITUTE:4158 SUBSTITUTE:4169 SUBSTITUTE:4281 SUBSTITUTE:4286 SUBSTITUTE:4299 SUBSTITUTE:4306 SUBSTITUTE:4312 SUBSTITUTE:4539 SUBSTITUTE:4637 SUBSTITUTE:4698 SUBSTITUTE:4721 SUBSTITUTE:4727 SUBSTITUTE:4759 SUBSTITUTE:4875 @+FMAINV MAP:15822 "<diggereahkká lea>" &double-space-before &SUGGESTWF ID:2 R:$2:1 R:LEFT:1 ADD:4000:double-space-before COPY:4004:double-space-before
double-space-before
: 
"<gitta>"
	"gitta" Adv <W:0.0> @<ADVL MAP:22784

In LibreOffice it looks like the following:

When applying the suggested correction, the result is the following:

The relevant CG rule is this (in the file tools/grammarcheckers/grammarchecker-release.cg3):

COPY:double-space-before ("<$2 $1>"v &SUGGESTWF) TARGET ("<(.*)>"r &double-space-before) IF (T:prevWordCrossSent LINK 0 ("<(.*)>"r)) (NOT 0 (&LINK)) ;

The main problem is that there is no way for the regex to differentiate between the word form of the input, and the word forms given as suggestions by the speller; it seems that it randomly selects the second speller suggestion to be included in the grammar error suggestion. There should probably be a way of saying that one wants the first matching word form (which would be the input word form).

Also, for whatever reason, the blue underlining does not cover the whole error, which leads to a circular correction pattern, as the error persists (the two spaces are not replaced, only the following word), each time with the second speller suggestion as the "added" word.

withCasing uses getCasing(cohort wf), should use getCasing(whole_underlined_wf)

If the suggestion/underline covers more than one cohort, we need to use the casing from that whole string, not just the cohort of the reading with the error tag.

If input is

"<A.>"
	"A" N Sg <NoSpaceAfterPunctMark> &no-space-after-punct-mark ID:1 R:RIGHT:2
	"A" N Sg <NoSpaceAfterPunctMark> "<A. Eira>" &no-space-after-punct-mark &SUGGESTWF ID:1 R:RIGHT:2
"<Eira>"
	"Eira" N Prop &LINK &no-space-after-punct-mark ID:2

we want &SUGGESTWF to produce A. Eira (as if it had the <fixedcase> tag).

But

	if(r.suggestwf) {
		r.sforms.emplace_back(withCasing(r.fixedcase, inputCasing, r.wf));
	}

gets called with inputCasing that's set by inputCasing = getCasing("A.") which is considered uppercase, should be called with getCasing("A.Eira") which is mIxed (and leads to no case change).

<codecvt> is deprecated – optionally allow using ICU?

Header and the related functions are deprecated as of C+17 - https://en.cppreference.com/w/cpp/header/codecvt

It is used in at least two places in the code ( https://github.com/divvun/libdivvun/search?q=codecvt ).

run.xml test heisenbug

I commented out a test from test/checker/Makefile.am that prevents building automatic packages but doesn't reproduce systematically on gentoo or fresh hirsute docker with:

FROM ubuntu:hirsute

RUN apt-get update &\
export LC_ALL=C.UTF-8 &\
apt-get install git build-essential autoconf automake pkg-config libtool libpugixml-dev libarchive-dev swig libxml2-utils python3 python3-dev zip language-pack-nb curl gawk &\
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | bash &\
apt-get install libhfst-dev hfst-ospell-dev cg3-dev &\
export LC_ALL=nb_NO.UTF-8 &\
git clone https://github.com/divvun/libdivvun /tmp/libdivvun &\
cd /tmp/libdivvun &\
./autogen.sh &\
./configure &\
make &\
make check

some fails may show up on hirsute docker if you run interactive session and for _ in $(seq 20) ; do make clean ; make check ; done -_-

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

divvun / libdivvun Goto Github PK

libdivvun's Introduction

Table of Contents

Description

Tools

Install from packages

Simple build from git on Mac

Prerequisites

For just divvun-suggest and divvun-blanktag

If you also want divvun-checker

If you also want divvun-cgspell

If you also want the Python library

Building

Command-line usage

divvun-suggest

divvun-blanktag

divvun-cgspell

divvun-checker

JSON format

Pipespec XML

Mapping from XML preferences to UI

Writing grammar checkers

XML pipeline specification

Simple blanktag rules

Troubleshooting

Simple grammarchecker.cg3 rules

More complex grammarchecker.cg3 rules (spanning over several words)

Deleting words

Alternative suggestions for complex errors altering different parts of the error

Avoiding mismatched words in multiple suggestions on ambiguous readings

Adding words

Adding literal word forms, altering existing wordforms

Including spelling errors

How underlines and replacements are built

Summary of special tags and relations

Tags

Relations

Troubleshooting

References / more documentation

libdivvun's People

Contributors

Stargazers

Watchers

libdivvun's Issues

Recommend Projects

Recommend Topics

Recommend Org

For just `divvun-suggest` and `divvun-blanktag`

If you also want `divvun-checker`

If you also want `divvun-cgspell`

`divvun-suggest`

`divvun-blanktag`

`divvun-cgspell`

`divvun-checker`