apertium / apertium-regtest Goto Github PK

View Code? Open in Web Editor NEW

1.0 10.0 0.0 332 KB

Regression testing system for Apertium language data and translators

Home Page: https://wiki.apertium.org/wiki/Apertium-regtest

License: GNU General Public License v3.0

Python 53.59% HTML 7.27% CSS 1.95% JavaScript 35.30% Makefile 0.65% M4 0.23% Shell 1.03%

testing golden-master characterization-tests apertium mt approval-testing

apertium-regtest's Introduction

apertium-regtest

Regression testing system for Apertium.

Full documentation: https://wiki.apertium.org/wiki/Apertium-regtest

Installation

Can be run as-is by invoking the apertium-regtest.py file in this directory or by running

$ autoreconf -fvi
$ ./configure
# make install

Which will install it as the command apertium-regtest

Usage

Static testing

apertium-regtest test runs all tests and reports the results, exiting with error code 0 if all pass and 1 otherwise.

Interactively updating tests

Test data can be updated either from a browser or from a terminal. For browser mode, run apertium-regtest web and for terminal apertium-regtest cli.

apertium-regtest's People

Contributors

Stargazers

Watchers

apertium-regtest's Issues

show changed, final results first?

Here's a typical session of how it looks when I open the web UI after make test complained:

regtest-changes.webm

I try to hide/show unchanged, nothing happens. I try to only show generated, that doesn't seem to help. Then I remember that one of the modes uses postgen, that changes something but I guess I'm still including unchanged things. Another hide/show unchanged. Finally I notice there's a page 2 – and there are the entries I'm after (ie. things that changed in generator).

I feel like something could be done to improve the flow here, but I'm not sure what. I think showing things that actually changed in the last step of the pipeline first in the list would help though (most of the time I don't care about changes in the middle of the pipeline.)

diffs are not immediately understandable to new users

The diff here is between expected (accepted) and output:

but to newcomers it might be more intuitive with a more verbose two-line format (ideally with colours, but even without would be helpful):

Alternate success threshold

We often want nightly builds to succeed even if they aren't perfect. It would be nice with a way to set a different criteria for success. E.g., envvar AP_REGTEST_MIN=80 saying 80% is good enough to pass, with default value being 100.

Should also be settable as a cmdline parameter, but envvars are easier to use in many builds.

diff acts weird when non-ascii characters

In a bunch of different words with æ, we get all the characters before æ diffed, and not æ or any of the characters after æ diffed. E.g., (with parts in red bolded and parts in green in italics):

xyXyæhargle
xyzXyzæbargle
xyzaXyzaæfoo

(Note that this issue was encountered while fiddling with modes and we hadn't added back the -g switch, which is why the capitalisation is what's different. That issue is resolved and we are no longer getting these diffs)

Show/Hide Unchanged Results should go in Filters, not under Inputs

Maybe like

(or should that be Unchanged Results: [Hide] [Show]? ¯\_(ツ)_/¯ )

Extra resources pulled from the page

It seems these resources are not really needed (at least, they do not exist in the repo).

<link rel="stylesheet" type="text/css" href="local.css">
<script src="local.js"></script>

loses accepted-choices after hitting `run`

 apertium-regtest cli

Running regression tests for apertium-sme-smj
Type `help` for a list of available commands.

Loading corpora...
Corpus sme-smj has 0 lines to be examined.
Corpus sme-smj-pending has 0 lines to be examined.
Corpus sme-smj-regression has 1 lines to be examined.
sme-smj-regression 63 of 181
INPUT:
  Dávda sáhttá dagahit šattalmasa ja oktiišaddama.
EXPECTED OUTPUT:
  Dávdda máhttá sjattalvisáv ja aktijsjaddamav dahkat.
ACTUAL OUTPUT:
  Dávdda máhttá sjattalvisáv ja aktij sjaddamav dahkat.
IDEAL OUTPUTS:
  Dávdda máhttá sjattalvissaj ja aktijsjaddamij vájkkudit.
> a
> run
Running sme-smj
Corpus sme-smj has 0 lines to be examined.
Running sme-smj-pending
Corpus sme-smj-pending has 0 lines to be examined.
Running sme-smj-regression
Corpus sme-smj-regression has 1 lines to be examined.
sme-smj-regression 63 of 181
INPUT:
  Dávda sáhttá dagahit šattalmasa ja oktiišaddama.
EXPECTED OUTPUT:
  Dávdda máhttá sjattalvisáv ja aktijsjaddamav dahkat.
ACTUAL OUTPUT:
  Dávdda máhttá sjattalvisáv ja aktij sjaddamav dahkat.
IDEAL OUTPUTS:
  Dávdda máhttá sjattalvissaj ja aktijsjaddamij vájkkudit.

very annoying if you've carefully accepted/golded/skipped 30 things

Subsume Greenlandic's regtest

Replace https://github.com/TinoDidriksen/regtest with this repo.

-z must be prepended

https://github.com/apertium/apertium-regtest/blob/master/apertium-regtest.py#L203 must instead add -z immediately after the command, or getopt() gets confused.

E.g., cg-proc -w sma-sme.mor.rlx.bin -z is not the same as cg-proc -z -w sma-sme.mor.rlx.bin

can't add as gold

 apertium-regtest -p 3333 web
Starting server
Open http://localhost:3333 in your browser
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 57376)
Traceback (most recent call last):
  File "/usr/lib/python3.8/socketserver.py", line 683, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python3.8/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/bin/apertium-regtest", line 649, in __init__
    super().__init__(request, client_address, server, directory=directory)
  File "/usr/lib/python3.8/http/server.py", line 647, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/lib/python3.8/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/lib/python3.8/http/server.py", line 427, in handle
    self.handle_one_request()
  File "/usr/lib/python3.8/http/server.py", line 415, in handle_one_request
    method()
  File "/usr/bin/apertium-regtest", line 662, in do_POST
    self.do_callback(urllib.parse.parse_qs(data.decode('utf-8')))
  File "/usr/bin/apertium-regtest", line 743, in do_callback
    Corpus.all_corpora[corp].set_gold(hsh, golds, stp)
  File "/usr/bin/apertium-regtest", line 556, in set_gold
    blob = self.step(step)
  File "/usr/bin/apertium-regtest", line 472, in step
    return self.data['cmds'][self.commands.get(s, -1)]
KeyError: 'cmds'
----------------------------------------

"Add manual gold" results should show up immediately

looks like adding didn't work, but if I rerun the corpus I see it in the list, so it's just the UI that doesn't update right away

optionally show forms that correspond to the numbers in test mode

Currently apertium-regtest test has output like this:

$ apertium-regtest -c .-morph test
Corpus 1 of 3: -s plurals-morph
  13/13 (100.0%) tests pass (3/13 (23.08%) match gold)

Corpus 2 of 3: -es plurals-morph
  20/20 (100.0%) tests pass (3/20 (15.0%) match gold)

Corpus 3 of 3: irregular plurals-morph
  22/22 (100.0%) tests pass (2/22 (9.09%) match gold)

All tests pass.

It would be nice if there were a way to have output like morph-test/aq-morphtest to show forms that do and/or don't match gold (or, alternatively, tests that are and/or are not expected).

wrong order of debug-modes, irrelevant steps shown

generator should be last, autoseq between pretransfer and biltrans:

and nob-dan doesn't even have interchunk/postchunk/chunker! They shouldn't show when nob-dan is selected.
(The other directions still do though.)

Default to last step

I always end up clicking the last three steps (because they're named slightly differently in the different modes) whenever I open up regtest. I always want to first see the difference in output (typically it's expected improvements – or I can tell from the final output what went wrong), only when there's something I don't understand why changed do I want to delve into the pipeline.

Can we default to showing the last possible step, or make it a preference cookie or something?

multiline diff

I got confused by the inline red/green colors of the diffs, is green expected/current/gold here?:

It'd be nice to have (perhaps optional) multiline diffs like:

EXPECTED: ^Det<SN><@SUBJ→><nt><pl><nom>{^den<det><nt><pl><acc>$}$ …
CURRENT: ^Det<SN><@SUBJ→><nt><pl><nom>{^den<det><dem><nt><pl><acc>$}$  …
(GOLDs: …)

And notably, including the label to the left so new users (or old users with bad memories) immediately see what's what.
(golds only applicable for generation of course)

Also, even better would be if the changed words were colored like in magit and wikipedia diffs:

Option to remove golds

In case of accidental addition of the wrong gold or a change in what the user decides they want to be gold, it would be cool if there were an option to select golds and remove them as the user saw fit.

sort analyses

^a/b<n>/c<n>$ and a/c<n>/b<n>$ should be treated as the same string.

options to show everything missing gold and show everything where current isn't gold

This issue was automatically made by begiak, Apertium's beloved IRC bot, by the order of jonorthwash on #apertium. A human still needs to update the description.

Make accepted sentences viewable

to view sentences that have been accepted we need to reload the whole page - shouldn't need to do this

single test focus / IDE-mode

I use a script to recompile and re-run on save, looks like

Peek.12-12-2022.14-28.webm

but for Most People I think it'd be really nice to have something like that in regtest – regtest already kind of does this, you can change files and click re-run corpus and see the new output. But when working on transfer (in particular) I often like to zoom in on one sentence and see the full analysis tree as well as the generated output and final translation all in one go, every time I click save.

Regtest could have a 🔍 button on each input sentence, and on click you are shown a page for just that sentence, looking a bit like apertium-viewer, but preferably with a fairly big <div> for the -tree output :)

There could be a button to re-run tests for that sentence, but even more magical would be a "recompile-and-rerun-on-change" button (like dev/r does). If this is easiest to do with entr rather than an external python lib for inotify, IMO it'd be fine if that one feature depended on having entr installed.

make clearer a 'make test' error

I get:

$ make test
apertium-regtest test
Corpus 1 of 1: analysis
  7/9 (77.78%) tests pass (0/7 (0.0%) match gold)

There were changes! Rerun in interactive mode to update tests.
Changed corpora: analysis
make: *** [Makefile:928: test] Error 1

It would be easier if the message explained what has to be done to "rerun in interactive mode". Moreover, a search in the wiki of "rerun in interactive mode" or "run in interactive mode" does not give meaningful results.

Very long distance testing

For tests where the result involves very long distances. E.g., identifying characters in a novel, done with at least +/- 6 windows and relations that span whole chapters via stepping-stones.

I think this falls out of scope, but maybe I or someone else can come up with a workable method.

Interactive streaming results

It should be possible to stream results to the interface and allow interactions with it while the test is running, instead of having to wait for the full run to be done. May require storing results in SQLite temporarily.

bigger manual gold field

could at least go as far right as the above buttons

Also, could be prefilled with the current output if there's like one minor thing to fix

Support preferences

[17:12:31] <Unhammer> yeah, been wondering what the best way to do that would be. Simplest is probably to just have a whole test set limited to a value of AP_SETVAR, e.g. 
[17:12:33] <Unhammer>     "dan-nno-moderate": {
[17:12:35] <Unhammer>         "input": "dan-nno-input.txt",
[17:12:37] <Unhammer>         "mode": "dan-nno",
[17:12:39] <Unhammer>         "setvar": "infa_infe,me_vi,ggj_gg",
[17:12:41] <Unhammer>     },

Parallelize runs

A pipe often has a single bottleneck (usually a complex CG), so even though the pipe is multi-process, the benefit is reduced. Splitting the input in ~4 and running that many pipes can thus take full advantage of available CPUs.

This should both be per-corpus and across corpora, so runs should internally be changed to a single tasks list.

(quick'n'dirty per-corpus TinoDidriksen/regtest@c18ed0c)

possible alternative gold/acceptance filter

I've had to explain the "show/hide unmatched" a few times, what if we changed it to two rows of toggles, based on gold-status and acceptance-status:

?

error on printing error: ident is not defined

This happens in sme-nob:


Traceback (most recent call last):

  File "/usr/local/bin/apertium-regtest", line 1175, in <module>

    if not static_test(args.ignore_add, threshold=args.threshold,

  File "/usr/local/bin/apertium-regtest", line 1058, in static_test

    corp.load()

  File "/usr/local/bin/apertium-regtest", line 499, in load

    golddata = load_gold(goldfile)

  File "/usr/local/bin/apertium-regtest", line 119, in load_gold

    print('ERROR: Empty entry %s in %s' % (ident, fname))

NameError: name 'ident' is not defined

make: *** [test] Error 1

Can we have generated and editable files in different folders?

Currently there are 165 files in nno-nob/test for 6 testsets. If I want to add inputs/golds, that's quite a lot of file names for my human eyes to parse. Would it be possible to have something like

/test/humaneditable.input.txt
/test/humaneditable.gold.txt
/test/generated/pendingtaggeroutputexpected.txt

(In my quickly-hacked-together regtest approximation in nno-nob/tests I do this, with 20 different test sets)

total count of gold matches in test mode

Currently apertium-regtest test has output like this:

$ apertium-regtest -c .-morph test
Corpus 1 of 3: -s plurals-morph
  13/13 (100.0%) tests pass (3/13 (23.08%) match gold)

Corpus 2 of 3: -es plurals-morph
  20/20 (100.0%) tests pass (3/20 (15.0%) match gold)

Corpus 3 of 3: irregular plurals-morph
  22/22 (100.0%) tests pass (2/22 (9.09%) match gold)

All tests pass.

It would be nice if the bottom line (or somewhere around there) said something like "8/55 tests match gold" or similar.

Tooltip for number of inputs matching gold

It would be cool if on hover of a corpus filter, you could see the number of passing tests/total tests for that filter, including the all corpora button. Right now, interpreting the percentages below is slightly complicated.

Interface idea: button partially filled by percent passing tests

Buttons in web need some kind of explanation on hover or similar

Nothing happens when I click Diff/Inserted/Deleted – when are they supposed to do anything? There is a diff between output and gold.

Also, especially when looking at Generator output, it's a bit confusing that the second line there actually is the generator output (gold and input are labelled, but it doesn't say Output before the output).

And why are there three "Replace as gold", "Add as gold" and "Add new gold" – why isn't "Replace as gold" and "Add new gold" enough? (If there isn't a gold already, perhaps s/Replace/Add.)

I can see that generator is accepted because I can't press the button, but what then does Accept result do?

Could perhaps have a little tutorial thing at the top of the screen.

Add pkg-config warning to langs/pairs

Modules that use regtest should warn if it's not present, a'la:

PKG_CHECK_MODULES(REGTEST, apertium-regtest >= 0.0.1, [], [AC_MSG_WARN([Running tests requires apertium-regtest])])

Editability for non-apertium developers

With the old, wiki-based tests, it was possible for people who didn't have apertium/git/developer-knowhow installed – but linguistic knowledge – to check and edit tests in a basic web UI. I didn't even know about this workflow before, but apparantly it happened a lot with the sme pairs. What are the possibilities for making the current regtests editable by non-developer linguists?

Editing through Github – possible, but gold/input are in different files and it seems very easy to mess up the hashes
Run apertium-regtest web on a server – possible, but then you have to deal with logins and access
Some kind of export/import/sync to e.g. wiki – sounds complicated and fragile
Other ideas?

fails if input file has unescaped slash /

e.g. 300 km/t has to be written 300 km\/t or it will crash on the first step