katef / libfsm Goto Github PK

DFA regular expression library & friends

License: BSD 2-Clause "Simplified" License

Makefile 4.89% Perl 0.72% C 89.74% Awk 1.80% CSS 0.47% JavaScript 0.25% XSLT 0.98% Shell 0.40% Python 0.16% Less 0.38% Scilab 0.22%

fsm regex lexer-generator regexp regexes regex-validator dfa nfa finite-state-machine finite-state-automata

libfsm's People

Contributors

Stargazers

Watchers

Forkers

fastly dhobsd csjperon marchelzo methylatorx athomason sfstewman gale320 sonnyit xpeper alyptik jameysharp ubuntu-repo robguima pldubouilh meteorix iximeow mmuthukrishna vagabond dgryski netrounds-aarish hvdijk data-man yesudeep nilium steveruizok ctjlewis sdbruder federicomenaquintero tomchappell chuxiao dkegel-fastly miguelemos codeluo wuganc wikinaut shoukunyan ekmixon rnshah9 myjeffxie careles vvulpes0 huddy1985 jiaoyk jbro885 qqwssotoo theo543 setrofim mitslyj luckysunfd wwall

libfsm's Issues

vmc generates incorrect code for regexp with empty alt

; ./build/bin/re -pl c -r pcre '|'             

int
fsm_main(int (*fsm_getc)(void *opaque), void *opaque)
{
	int c;


	enum {
		S0
	} state;

	state = S0;

	while (c = fsm_getc(opaque), c != EOF) {
		switch (state) {
		case S0: /* start */
			return TOK_UNKNOWN;

		default:
			; /* unreached */
		}
	}

	/* end states */
	switch (state) {
	case S0: return 0x1; /* "|" */
	default: return -1; /* unexpected EOT */
	}
}

; ./build/bin/re -pl vmc -r pcre '|'

int
fsm_main(int (*fsm_getc)(void *opaque), void *opaque)
{
	int c;

	if (c = fsm_getc(opaque), c == EOF) return 0x1; /* "|" */
	return -1;
}

Concatenation for character classes unions with the wrong endpoint

Regexps like ...[abc]...[xyz]... currently skip the second class with an epsilon. I would imagine this is from unionxy() being given the same endpoint for both, possibly introduced around the time of the AST work.

Switch from void *opaque to integer end IDs

Retire the (struct fsm_state).opaque field; all use-cases for this are covered by the numeric end IDs introduced in #292. These are stored per FSM, rather than per state. With this, the carryopaque callbacks can go.

lx: the exit character for a zone figuring in a variable definition inside the zone gets removed from the variable incorrectly

The lexer generated from the following lx input:

$ cat lexer.lx 
digit = /[0-9]/;
alpha = /[A-Z]/i;
tchar = "!" | "#" | "$" | "%" | "&" | "'" | "*" | "+" | "-" | "." | "^" | "_" | "`" | "|" | "~" | digit | alpha;

tchar+ -> $token;

'"' -> $qsstart .. '"' -> $qsend {
        vchar = /[\x21-\x7E]/;
        qdtext = "\t" | " " | "\x21" | /[\x23-\x5B]/ | /[\x5D-\x7E]/;
        quoted_pair = '\' ( "\t" | " " | vchar );

        qdtext -> $qschar;
        quoted_pair -> $qschar;
}

'=' -> $eq;

does not recognize the double quote (") as a valid character to follow the backslash (\) in the double-quote-bracketed zone when the definition of the variable vchar appears inside the double-quote-bracketed zone. The lexer yields the UNKNOWN token in this case instead of the QSCHAR token.

This problem does not happen when vchar is defined outside of the double-quote-bracketed zone.

Here is the difference between the generated C in the two cases: when vchar is defined outside of the zone vs when it is defined inside the zone is in whether " is listed in the transition:

 		case S3: /* e.g. "\\" */
 			switch ((unsigned char) c) {
 			case '\t':
 			case ' ':
 			case '!':
-			case '"':
 			case '#':
[...]
 			case '}':
 			case '~': state = S1; break;
 			default:  lx->lgetc = NULL; return TOK_UNKNOWN;
 			}
 			break;

A full test program to replicate the problem:

$ cat Makefile 
.PHONY: all

all: main

.PHONY: clean
clean:
        rm -f lexer.c lexer.h main

lexer.c lexer.h: lexer.lx
        lx -l c < lexer.lx > lexer.c.NEW
        lx -l h < lexer.lx > lexer.h.NEW
        mv lexer.c.NEW lexer.c
        mv lexer.h.NEW lexer.h

main: main.c lexer.c lexer.h
        gcc -o main -ggdb3 -Wall main.c lexer.c -DLX_HEADER='"lexer.h"'

$ cat main.c 
#include "lexer.h"

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

struct state {
        const char *base;
        const char *ptr;
};

static int
lgetc(struct lx *lx)
{
        struct state *state = lx->getc_opaque;

        unsigned char c = *state->ptr;
        if (c == 0) {
                return -1;
        }

        state->ptr += 1;

        return c;
}

static void
get_show_check_token(struct lx *lx, enum lx_token expected_tok)
{
        const struct state *state = lx->getc_opaque;

        enum lx_token tok = lx_next(lx);

        printf("token %s \"%.*s\"\n",
            lx_name(tok), 
            (int)(lx->end.byte - lx->start.byte),
            state->base + lx->start.byte);

        if (tok != expected_tok) {
                fprintf(stderr, "ERROR: Expected %s got %s\n",
                    lx_name(expected_tok),
                    lx_name(tok));
                exit(1);
        }
}

int
main(void)
{
        char *input = "a=\"a\\\"b\"";

        struct state state = {
                .base = input,
                .ptr = input,
        };
        struct lx lx;
        lx_init(&lx);
        lx.lgetc = lgetc;
        lx.getc_opaque = &state;

        get_show_check_token(&lx, TOK_TOKEN);
        get_show_check_token(&lx, TOK_EQ);
        get_show_check_token(&lx, TOK_QSSTART);
        get_show_check_token(&lx, TOK_QSCHAR);
        get_show_check_token(&lx, TOK_QSCHAR);
        get_show_check_token(&lx, TOK_QSCHAR);
        get_show_check_token(&lx, TOK_QSEND);
        get_show_check_token(&lx, TOK_EOF);

        return 0;
}

$ ./main 
token TOKEN "a"
token EQ "="
token QSSTART """
token QSCHAR "a"
token UNKNOWN "\""
ERROR: Expected QSCHAR got UNKNOWN

Convert libfsm's internal representation to an adjacency list

Currently libfsm stores its FSMs as a graph of structs pointing to each other. This ticket proposes an adjacency list instead, where the whole FSM would be stores as an array of state-to-state pairs, labelled with edges. Possibly we'd also want a skip list down the side, to quickly find the start of each state's list of edges.

This representation has a few advantages; it's simpler to deal with in a few ways, but also allows for data to be attached to edge nodes, which currently only exist as an index into an array of states. An adjacency list also allows some operations to be parallelised (perhaps by threads, SIMD, etc).

One concern is that some fsm_*() operations are currently O(1), and I'd want to take especial care not to damage performance for those on larger FSM.

Trailing comma in IR json

./build/bin/re -pl irjson -y tests/ir/in10.re
{
	"start": 0,
	"states": [
		{
			"end": true,
			"strategy": "none",
		}
	]
}

Probably a special case for IR_NONE

Required build directory is not made by fmake

[even later edit: it all works in pmake, which makes this more annoying than relevant]

Summary made after I understood the problem some more:

The machinery introduced in mkdir.mk to avoid making directories that already existed (introduced in 630694b827d489425ede498d4a21f0306c39eee8) appears to break an initial build if using FreeBSD make (fmake).

I'm on Ubuntu 16.04 using fmake (FreeBSD make) from aptpackagefreebsd-duildutils`:

$ CC=clang fmake -r
"share/mk/mkdir.mk", line 36: warning: duplicate script for target "build/lib" ignored
"share/mk/mkdir.mk", line 36: warning: duplicate script for target "build/lib" ignored
clang -o build/src/adt/alloc.o -I include -ansi -pedantic -O3 -DNDEBUG  -c src/adt/alloc.c
error: unable to open output file 'build/src/adt/alloc.o': 'No such file or
      directory'
1 error generated.
*** [build/src/adt/alloc.o] Error code 1

Stop in /home/ac1xdrj/prj/libfsm.

I'm going to fiddle around making directories until it works.

Incidentally, the same toolchain built kgt just fine

lx IR failing on iscomplete predicate

Something definitely screwy here.

; echo '/[^b]/ -> $t;' | ~/gh/libfsm/build/bin/lx -l c

...

static enum lx_token
z0(struct lx *lx)
{
        int c;

        enum {
                S0, S1, NONE
        } state;

        assert(lx != NULL);

        if (lx->clear != NULL) {
                lx->clear(lx->buf_opaque);
        }

        state = NONE;

        lx->start = lx->end;

        while (c = lx_getc(lx), c != EOF) {
                if (state == NONE) {
                        state = S0;
                }

lx: src/libfsm/print/ir.c:422: make_state: Assertion `fsm_iscomplete(fsm, state)' failed.

Minimisation by Brzozowski's algorithm produces non-minimal output

The expected handwritten output tests/minimise/out12.fsm has fewer states than the result produced by Brzozowski's algorithm as implemented by:

cat in12.fsm | fsm -rp | fsm -dp | fsm -rp | fsm -dp

tests/minimise/in19.fsm shows the handwritten expected minimal output used as input, which remains unchanged.

re generates wrong JSON

In this case, there is not enough comma.

./build/bin/re -pl json "^[abc]$"
{
        "states": [
                {
                        "end": false,
                        "edges": [
                                { "char":  "a", "to": 1},
                                { "char":  "b", "to": 1}
                                { "char":  "c", "to": 1}
                        ]
                },
                {
                        "end": true,
                        "edges": [
                        ]
                }
        ],
        "start": 0
}

But here everything is fine.

./build/bin/re -pl json "^[ace]$"
{
        "states": [
                {
                        "end": false,
                        "edges": [
                                { "char":  "a", "to": 1},
                                { "char":  "c", "to": 1},
                                { "char":  "e", "to": 1}
                        ]
                },
                {
                        "end": true,
                        "edges": [
                        ]
                }
        ],
        "start": 0
}

Update Unicode data

Version 13.0 of the Unicode Standard was released on March 10, 2020.
Would be nice update it.

Remove $unsupported tokens in the regex lexers

Currently we produce $unsupported for various things (e.g. lookahead, lookbehind in pcre) and then error about it.

My suggestion instead is that we do lex these correctly, and then error about them in the parser instead. This moves the concept of unsupportedness along a layer.

Eventually I'd like to also construct AST nodes for these, and only error about the unsupportedness when we come to do the AST->NFA conversion. This way we'd also have support for these features for e.g. AST -> regexp rendering (where FSM are not involved), but perhaps also opportunities to deal with them by AST rewriting.

Use re2/perl regexp dump for equivalence tests

perl dumping its implementation of a regex:

perl -E 'use re "debug";qr/a*(b|c)+d?|(fg*|x)[0-9a-z]?/;'

Likewise re2 can dump its AST.

We could consider using these for equivalence tests for libre's subset of the pcre dialect.

MacOS install fails on undefined symbol _fsm_capture_dump

On MacOS Mojave with bmake version 20200902 from Homebrew.
bmake -r install ends in

ld -r -o build/lib/libfsm.o build/src/libfsm/cost/legible.o build/src/libfsm/pred/isany.o build/src/libfsm/pred/isdfa.o build/src/libfsm/pred/isend.o build/src/libfsm/pred/iscomplete.o build/src/libfsm/pred/epsilonsonly.o build/src/libfsm/pred/hasincoming.o build/src/libfsm/pred/hasoutgoing.o build/src/libfsm/pred/hasepsilons.o build/src/libfsm/pred/hasnondeterminism.o build/src/libfsm/print/api.o build/src/libfsm/print/awk.o build/src/libfsm/print/c.o build/src/libfsm/print/dot.o build/src/libfsm/print/fsm.o build/src/libfsm/print/ir.o build/src/libfsm/print/irdot.o build/src/libfsm/print/irjson.o build/src/libfsm/print/json.o build/src/libfsm/print/rust.o build/src/libfsm/print/sh.o build/src/libfsm/print/go.o build/src/libfsm/print/vmc.o build/src/libfsm/print/vmdot.o build/src/libfsm/print/vmasm.o build/src/libfsm/walk/count.o build/src/libfsm/walk/has.o build/src/libfsm/walk/all.o build/src/libfsm/walk/reachable.o build/src/libfsm/walk/iter.o build/src/libfsm/vm/ir.o build/src/libfsm/vm/vm.o build/src/libfsm/vm/v1.o build/src/libfsm/vm/v2.o build/src/libfsm/capture.o build/src/libfsm/collate.o build/src/libfsm/complete.o build/src/libfsm/consolidate.o build/src/libfsm/clone.o build/src/libfsm/closure.o build/src/libfsm/edge.o build/src/libfsm/empty.o build/src/libfsm/end.o build/src/libfsm/endids.o build/src/libfsm/equal.o build/src/libfsm/exec.o build/src/libfsm/fsm.o build/src/libfsm/mode.o build/src/libfsm/start.o build/src/libfsm/state.o build/src/libfsm/trim.o build/src/libfsm/example.o build/src/libfsm/getc.o build/src/libfsm/vm.o build/src/libfsm/mergestates.o build/src/libfsm/subgraph.o build/src/libfsm/shortest.o build/src/libfsm/complement.o build/src/libfsm/determinise.o build/src/libfsm/glushkovise.o build/src/libfsm/minimise.o build/src/libfsm/reverse.o build/src/libfsm/merge.o build/src/libfsm/concat.o build/src/libfsm/union.o build/src/libfsm/intersect.o build/src/libfsm/subtract.o build/src/libfsm/walk2.o build/src/libfsm/lexer.o build/src/libfsm/parser.o  -exported_symbols_list build/src/libfsm/libfsm.syms
Undefined symbols for architecture x86_64:
  "_fsm_capture_dump", referenced from:
     -exported_symbol[s_list] command line option
ld: symbol(s) not found for architecture x86_64
*** Error code 1

Stop.

When I remove the last line from build/src/libfsm/libfsm.syms, the build succeeds.

Read lx man pages without installing

I see man/lx.{1,5}/lx.{1,5}.xml but I don't see anything in the build/ folder for the generated man pages, nor can I quickly find the build rule to generate them. I'd like to read them -- how should I go about this?

Feature proposal: Functions

Parameterised blocks

Blocks enclose a set of mappings:

{
   /abc/ -> $t1;
   /xyz/ -> $t2;
}

Defining a parameterised block of code:

f(x, y) = {
    /abc/ -> $t1;
    x /xyz/ | y -> $t2;
}

Then you could invoke f(),
which would behave just as if you typed it out by hand:

f(/xxx/, /yyy/);

or use it as a zone body:

'b' .. 'e' f(/xxx/, /yyy/);

which is exactly equivalent to writing:

'b' .. 'e' {
    f(/xxx/, /yyy/);
}

f() would be usable in expressions, giving its unioned DFA:

'abc' - f('xxx', 'yyy') -> $t;

f() cannot be used as the body for one-way zones,
because that would be syntactically ambiguous with the use in expressions.
However, you could write:

'x' {
    f('xxx', 'yyy');
}

to the same effect.

Parameterised expressions

g(x, y) = /abc/ x | ~y;

invoked in an expression just the same:

'abc' - g('x', 'y') -> $t;

but (as with writing /abc/; with no token mapping),
this would cause input to be consumed and no token emitted:

g(x, y);

Removing parameterisation

Parameters may be omitted, and the syntax degrades to:

f = {
    /abc/ -> $t1;
    /xyz/ -> $t2;
}

g = /abc/ | 'x'; # note this is the existing variable binding syntax

which may be used in the same ways:

f;
'b' .. 'e' f;
'abc' - f -> $t;
'abc' - g -> $t;
g;

Hence f and g are distinguished by type.

Neither f nor g may be used on the right hand side of a mapping.

Token names as parameters

So far the examples have shown FSM as parameters.

Currently using a token's value gives its FSM:

'if' | 'else' | 'for' -> $kw;
/[a-z]+/ - $kw -> $ident;

Parameter lists also permit token names as values;
this is indistinguishable from passing an FSM as a parameter,
where the FSM happens to come from an existing token as above.

f(x, y, t1) = {
    /abc/ - x - y - t1 -> $t;
}
g(t) = /abc/ - t;

and passing a token by name for mappings:

f(x, y, t2) = {
    /abc/ - x - y -> t2;
}

where f('xxx', 'yyy', $t); is known to mean the name of $t,
rather than its FSM value, due to using t2 on the rhs of a mapping in f().

Hence parameters are distinguished by type.

Introduce AST for libre

Currently libre constructs an NFA directly during parse. This ticket is to introduce an AST between parsing and NFA construction:

Parser actions construct AST nodes
The AST expresses whichever syntactic constructs fit most naturally; I would expect nodes for | alts and other operators. For some things, the parser can still normalise to a single internal form - "count x..y" repetition for example, where the regexp a* syntax is just sugar for x..y in general. I'm not currently sure if character classes would be present in the AST directly.

This provides a few possibilities:

We can transform the AST in various ways, for optimisations which are cheaper than operating on an FSM datastructure
Output of the AST (including perhaps to regexp syntax)
Some parts are awkward during parsing, due to the LL(1) recursive descent. Constructing an AST should make those more natural.

libre leaks memory if certain errors are encountered during parsing

When the sid parser encounters an error, it does not appear to free any partially-constructed pieces of the AST tree.

I found this working with cvtpcre with ASAN enabled, but it's easily triggered with re if ASAN is enabled:

% ./build/bin/re -r pcre '^\ca\cA\c[;\c:'
/^\ca\cA\c[;\c:/:2: Syntax error: unsupported operator

=================================================================
==21525==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 80 byte(s) in 1 object(s) allocated from:
    #0 0x7f44a8892dc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
    #1 0x559d6fbd13b4 in ast_make_expr_alt src/libre/ast.c:419
    #2 0x559d6fbb475f in p_expr src/libre/parser.act:671
    #3 0x559d6fbb89e7 in p_re__pcre src/libre/dialect/pcre/parser.c:3326
    #4 0x559d6fbbb27f in parse_re_pcre src/libre/parser.act:909
    #5 0x559d6fbcdeba in re_parse src/libre/re.c:111
    #6 0x559d6fbce156 in re_comp src/libre/re.c:154
    #7 0x559d6fb20de5 in main src/re/main.c:688
    #8 0x7f44a85ba0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

Direct leak of 80 byte(s) in 1 object(s) allocated from:
    #0 0x7f44a8892dc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
    #1 0x559d6fbd0fd6 in ast_make_expr_concat src/libre/ast.c:371
    #2 0x559d6fbba396 in p_expr_C_Calt src/libre/parser.act:664
    #3 0x559d6fbb7716 in p_expr_C_Clist_Hof_Halts src/libre/dialect/pcre/parser.c:3116
    #4 0x559d6fbb478d in p_expr src/libre/dialect/pcre/parser.c:2558
    #5 0x559d6fbb89e7 in p_re__pcre src/libre/dialect/pcre/parser.c:3326
    #6 0x559d6fbbb27f in parse_re_pcre src/libre/parser.act:909
    #7 0x559d6fbcdeba in re_parse src/libre/re.c:111
    #8 0x559d6fbce156 in re_comp src/libre/re.c:154
    #9 0x559d6fb20de5 in main src/re/main.c:688
    #10 0x7f44a85ba0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7f44a8892bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
    #1 0x559d6fb66dd0 in f_malloc src/adt/alloc.c:40
    #2 0x559d6fb4edd5 in fsm_new src/libfsm/fsm.c:51
    #3 0x559d6fb20bb1 in main src/re/main.c:653
    #4 0x7f44a85ba0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

Indirect leak of 4096 byte(s) in 1 object(s) allocated from:
    #0 0x7f44a8892bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
    #1 0x559d6fb66dd0 in f_malloc src/adt/alloc.c:40
    #2 0x559d6fb4eedc in fsm_new src/libfsm/fsm.c:60
    #3 0x559d6fb20bb1 in main src/re/main.c:653
    #4 0x7f44a85ba0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

Indirect leak of 80 byte(s) in 1 object(s) allocated from:
    #0 0x7f44a8892dc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
    #1 0x559d6fbd1ea4 in ast_make_expr_anchor src/libre/ast.c:545
    #2 0x559d6fbb9d14 in p_expr_C_Cpiece_C_Catom src/libre/parser.act:727
    #3 0x559d6fbb3eed in p_expr_C_Cpiece src/libre/dialect/pcre/parser.c:2475
    #4 0x559d6fba9658 in p_expr_C_Clist_Hof_Hpieces src/libre/dialect/pcre/parser.c:953
    #5 0x559d6fbba3c4 in p_expr_C_Calt src/libre/dialect/pcre/parser.c:3685
    #6 0x559d6fbb7716 in p_expr_C_Clist_Hof_Halts src/libre/dialect/pcre/parser.c:3116
    #7 0x559d6fbb478d in p_expr src/libre/dialect/pcre/parser.c:2558
    #8 0x559d6fbb89e7 in p_re__pcre src/libre/dialect/pcre/parser.c:3326
    #9 0x559d6fbbb27f in parse_re_pcre src/libre/parser.act:909
    #10 0x559d6fbcdeba in re_parse src/libre/re.c:111
    #11 0x559d6fbce156 in re_comp src/libre/re.c:154
    #12 0x559d6fb20de5 in main src/re/main.c:688
    #13 0x7f44a85ba0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

Indirect leak of 64 byte(s) in 1 object(s) allocated from:
    #0 0x7f44a8892bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
    #1 0x559d6fbd10d7 in ast_make_expr_concat src/libre/ast.c:381
    #2 0x559d6fbba396 in p_expr_C_Calt src/libre/parser.act:664
    #3 0x559d6fbb7716 in p_expr_C_Clist_Hof_Halts src/libre/dialect/pcre/parser.c:3116
    #4 0x559d6fbb478d in p_expr src/libre/dialect/pcre/parser.c:2558
    #5 0x559d6fbb89e7 in p_re__pcre src/libre/dialect/pcre/parser.c:3326
    #6 0x559d6fbbb27f in parse_re_pcre src/libre/parser.act:909
    #7 0x559d6fbcdeba in re_parse src/libre/re.c:111
    #8 0x559d6fbce156 in re_comp src/libre/re.c:154
    #9 0x559d6fb20de5 in main src/re/main.c:688
    #10 0x7f44a85ba0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

Indirect leak of 64 byte(s) in 1 object(s) allocated from:
    #0 0x7f44a8892bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
    #1 0x559d6fbd14b5 in ast_make_expr_alt src/libre/ast.c:429
    #2 0x559d6fbb475f in p_expr src/libre/parser.act:671
    #3 0x559d6fbb89e7 in p_re__pcre src/libre/dialect/pcre/parser.c:3326
    #4 0x559d6fbbb27f in parse_re_pcre src/libre/parser.act:909
    #5 0x559d6fbcdeba in re_parse src/libre/re.c:111
    #6 0x559d6fbce156 in re_comp src/libre/re.c:154
    #7 0x559d6fb20de5 in main src/re/main.c:688
    #8 0x7f44a85ba0b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

SUMMARY: AddressSanitizer: 4512 byte(s) leaked in 7 allocation(s).

Except for the 4096-byte struct fsm allocation, all of the allocations are done within the sid parser, and all are partial allocations of the

Missing handling for non-escaped { } literals in pcre dialect

https://twitter.com/JakeDChampion/status/1282973512593018880

This case shows { and } near the beginning, and these are literal characters and not escaped.

/\s*(?:{(.*)})?\s*(?:(\$?\S+))?\s*(?:\[([^\]]*)])?\s*-?\s*([\S\s]*)\s*$/

I supposed the first would be distinguished from the x{m,n} repetition syntax because it doesn't follow an atom. And then I guess the second is seen as non-special because by that point we're not in the middle of a {...} lexical region.

libre currently gives a syntax error here, but pcregrep accepts this.

Simplify C output in states where most symbols are accepted

Consider the output of re -pl c '^([^\\\n]|\\t)': In one of the states, it lists cases for all 254 symbols that are not \n. I would rather see output structured more like:

case '\\': /* transition to expecting a "t" */
default: /* transition to accept-anything state */
case '\n': /* reject */

I think of this as treating the symbols which don't transition as an additional group, and looking at the sizes of all the symbol groups. The largest group should get the default: label, even if that isn't the "reject" group.

Perhaps this should be another IR strategy? I guess it's a combination of the "complete" and "error" strategies, and perhaps could subsume both? Or maybe the bug here is just that this case should have used the "error" strategy instead of "complete"? 🤷‍♂️

lx lookahead isn't always necessary

lx -l c doesn't generate exactly the code I'd prefer for accepting states which have no further out-transitions.

For example, my sample language specification in #111 generates four states. In state S2, we've already seen .. so if we see a third . then we've matched an $ellipsis token. Currently, in that case lx generates a transition to a state S3 and continues with a new iteration of the loop, which calls lx_getc and then unconditionally passes the result to lx_ungetc before returning TOK_ELLIPSIS.

But instead of transitioning to a new state and reading then unreading a new character, when S2 matches a third ., it could immediately return TOK_ELLIPSIS.

This applies to any accepting state that has no out-transitions. From such a state, reading any additional character will always trigger an error transition. At that point the state machine must roll back to the most recent accepting state and return that, and we know statically which state that was.

This is a minor optimization, but I mostly care because the generated code would be slightly easier to understand if these unnecessary extra states were removed.

Warning about duplicate script in makefile

bmake: "share/mk/share/mk/mkdir.mk" line 36: warning: duplicate script for target "build/pc" ignored
bmake: "share/mk/share/mk/pc.mk" line 19: warning: using previous script for "build/pc" defined here

build fails due to checkout timestamps

I tried to
bmake -r install, PREFIX=/tmp/x bmake -r install and CC=clang PREFIX=$HOME bmake -r install

they all die with

sid -l ansi-c -s no-numeric-terminals -s no-terminals   src/libre/dialect/glob/parser.sid src/libre/parser.act src/libre/dialect/glob/parser.c src/libre/dialect/glob/parser.h  || { rm -f src/libre/dialect/glob/parser.c src/libre/dialect/glob/parser.h; false; }
/bin/sh: sid: command not found

But I could not find a hint were to get sid on OSX. I hope, it is totally obvious how to fix this.

`fsm_trim()` (neé `fsm_collect_unreachable_states()`) accidentally consumes too much

Some discussion here: #140 (comment)

I added a test case under #151:

1 -> 2 'a';
2 -> 3 'b';
3 -> 4 'c';

start: 2;
end: 4;

This test illustrates an unreachable state transitioning to the start state, which is then removed despite being reachable (by virtue of being the entry point).

portability: `clock_gettime` needs `-lrt` on Linux

src/fsm/main.c:378: undefined reference to clock_gettime'`

The makefile for fsm (the CLI util, not the library) should check uname and add -lrt to LFLAGS.fsm on Linux.

Deduplication of alts does not preserve capture groups

$ libfsm/build/bin/re -p -rpcre -lpcre '(foo)|(foo)'
(foo)
$ libfsm/build/bin/re -p -rpcre -lpcre '(foo)|(bar)'
(bar)|(foo)

In general, deduplicating alts is a good simple optimisation to speed up the generation of an FSM, but for regexes containing capture groups, the optimisation is not valid.

Feature proposal: find lexicographic bounds for an FSM

Some data storage systems store records in name order, making it efficient to retrieve records with names in a given range. When filtering such records by regular expression, it might therefore be useful to have a way to generate a range in which all the results are guaranteed lie, to use as a pre-filter.

For example, any match of the regular expression (moomin|troll)+ must lie in the range ("moomin", "trolltrolm"). Note that the last character of the upper bound has been 'rounded up' to ensure that every match, no matter how long, lies before it.

This should be fairly straightforward to implement for a DFA, and only slightly more complex for a NFA (assuming it's been trimmed of any states that do not lead to a match). The interface will need to specify the maximum length of the bounds. The ordering can be lexicographic order on the byte encodings, which in the case of UTF8 also conveniently matches the ordering on Unicode code points. Still, there are a couple of potential UTF8 niggles:

the maximum length might lie inside a multi-byte character
rounding a truncated upper bound to the next byte might result in an invalid byte sequence

The easiest approach is probably to remain encoding-agnostic and simply ignore this, returning byte sequence bounds and letting the client library deal with converting those to valid strings.

Incomplete minimization via Brzozowski's

For the following input:

1 -> 2 'a';
2 -> 2 'a';

start: 1;
end: 1, 2;

fsm -pm is now failing to detect that they should be combined into a single state. This is a fairly recent change -- bisecting narrowed the commit where this changed to 2306131 (merged as part of #193), which reworks determinization.

The deeper issue here is that while the integration tests via fsm -t equal will confirm that the FSMs match the same language, in this case it also needed to check that they both have the same (minimal) number of states.

Mishandling for PCRE_DOTALL

We better provide /s. the default is for it to be not set, and so the current implementation is incorrect. This bug found by running the suite in #306.

From php's documentation (which I find easier to read than the pcre docs...):

s (PCRE_DOTALL)
If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

Generated code does not distinguish matching an empty string, and no match.

The generated code for DFA does not distinguish beteween "" and a match failure, which would lead to the default case or nothing. Possibly specific to one kind of -k API.

Caching for unary and binary libfsm operations

This ticket proposes a cache kept behind the scenes in libfsm, storing FSM as constructed by each fsm_*() unary or binary operation.

I'm picturing this as a limited-size associative array indexed by ID; when an operation is called, it'd combine the IDs for its operands to construct a new ID, and return fsm_clone() of that new ID, if one exists. The idea is that cloning is cheap, especially when we move to an adjacency list.

IDs would be stored per FSM. For constructing IDs, I imagine they'd be constructed of some bits representing the operation, and one or two other IDs as operands, and then that all gets hashed to produce a new ID.

I'm picturing this specifically for operations on one or two FSM only - and not for things like fsm_addedge_*() which modify an existing FSM. Maybe this is exactly only for <fsm/bool.h>.

I'm told in lisp, this is called "hash-consing".

This idea doesn't necessarily mean that all struct fsm objects would be immutable, although that's possible COW style, too. I think in particular things like fsm_addedge_*() would be unwieldy if they duplicate an FSM for every edge added.

Ideally the PCRE dialect should detect unsupported features and give an appropriate error

Things that would be nice to detect:

backrefs
look-behind assertions
look-ahead assertions
Any other PCRE features that aren't regular and require backtracking.

Performance experiment: partitioning heuristic during DFA minimization

Inside minimise.c's try_partition, we split an equivalence class in place, but it's arbitrarily committing to keeping whichever states lead to the same EC as the first state in the EC list. While the non-determinism doesn't affect correctness (since it will run to a fixpoint either way), using the counts to decide which to modify in place and which to separate out may improve performance.

The change would appear in update_after_parittion -- if both the src and dst ECs have more than one state and the src EC has less (or more) than the dst EC, swap them, so the larger EC always appears earlier (or later). The special case for one state is so that ECs with only 1 state end up in the "done" region at the end.

I suspect always breaking out the smaller EC should perform better, but it needs benchmarking.

negative character classes produce an ABNF output that kgt does not parse

Not sure if this should be here or on katef/kgt.

When converting a regex that includes a negative character class to ABNF form, it generates the output fine, but when passing this to kgt, it fails with 1:11: Syntax error: expected production rule separator

Minimal Example:

$ re -bpl abnf '[^0-9]' | tee /dev/stderr | kgt -l abnf -e svg | isvg
e = OCTET - %x30-39

1:11: Syntax error: expected production rule separator
...  # errors continue from other parts of the pipeline

lx: terrible segfaults on malformed input

; echo '$' | ./build/bin/lx 
zsh: done                echo '$' | 
zsh: segmentation fault  ./build/bin/lx
; echo '$ ' | ./build/bin/lx
zsh: done                echo '$ ' | 
zsh: segmentation fault  ./build/bin/lx
; echo '$ x' | ./build/bin/lx
lx: src/lx/parser.act:928: struct ast *lx_parse(FILE *, const struct fsm_options *): Assertion `ast != NULL' failed.
zsh: done       echo '$ x' | 
zsh: abort      ./build/bin/lx

Feature request: Interface to find longest matches in an FSM

Currently we have fsm_shortest() to find the shortest paths (actually least cost by weighted edges). I'd like a more convenient interface to:

Return a set of finite strings, not just the shortest
Indicate where results may be unbounded or not
Return a set of results, not just one

About unbounded strings, I'm not sure of the exact requirement here, but often a just "can this regexp match any unbounded input?" will suffice.

Be performant! A* or Dikstra's algorithm visits all states. Currently we re-run the same thing for each end state in turn. That's wasteful. I think we can also match against an NFA at no extra cost, and avoid the cost of constructing a DFA here. Currently support for epsilon edges was removed from fsm_shortest(). So what I'm proposing here is a rewrite, perhaps to A*.

`make -r clean` failures

The following directories under build/tests/ need to be cleaned manually for make -r clean (or bmake/pmake/etc.) to work:

dump-lx
isdfa
intersect-ccuc
intersect
subtract-cuc
subtract
determinise
glob
like
literal
minimise-rdrd
minimise
pcre-pcregrep
pcre
reverse
union
set

They may need to have their artifacts registered as targets to clean up. (Just deleting build/tests/ recursively works, of course, but doesn't seem consistent with the rest of the build config.

`lx -l dump` does not define `lgetc` & `getc_opaque` if `-g` is missing

If lx -l dump is used without any particular -g flag, it generates a main that declares lgetc and getc_opaque, but doesn't define them, leading to a program that doesn't ever get input.

Off by one in gcc ranged case generation

With opt.case_ranges, re -pl c '(x|y+)+z' generates:

                case S1: /* e.g. "x" */
                        switch ((unsigned char) c) {
                        case 'x' ... 'z': break;
                        default:  return TOK_UNKNOWN;
                        }
                        break;

and without:

                case S1: /* e.g. "x" */
                        switch ((unsigned char) c) {
                        case 'x':
                        case 'y': break;
                        case 'z': state = S2; break;
                        default:  return TOK_UNKNOWN;
                        }
                        break;

Very large image with re

First off, this is an amazing program! Thank you for making it.

When I try to graph the following pattern[1]:

re -cb -pl dot 'N[^P][ST][^P]' | dot -Tpng -ore.png

I get an extremely long image. Is this just the nature of this re or am I doing something wrong?

[1] N-glycosylation protein motif, cf. http://rosalind.info/problems/mprt/

`make install` fails

When I run bmake PREFIX=$out install, it fails, with the last few lines saying:

install -m 644 build/pc /nix/store/jj9j4kylaprcpy46gpdjb2mj119vbdpr-libfsm/share/pkgconfig/pc
install: omitting directory 'build/pc'
*** Error code 1

Digraph not rendering

Rather than a graph rendering I'm seeing the following (I'm using kitty):

; re -cb -pl dot '[Ll]ibf+(sm)*' '[Ll]ibre' | dot                                                                             ⬡ 12.6.0 
digraph G {
	graph [bb="0,0,497,111.51",
		rankdir=LR,
		root=start
	];
	node [label="",
		shape=circle,
		width=0.3
	];
	edge [weight=2];
	start	[height=0.5,
		pos="11,57.506",
		shape=none,
		width=0.30556];
	S2	[height=0.30556,
		pos="70,57.506",
		width=0.30556];
	start -> S2	[pos="e,58.898,57.506 22.19,57.506 29.567,57.506 39.78,57.506 48.785,57.506"];
	S3	[height=0.30556,
		pos="137,57.506",
		width=0.30556];
	S2 -> S3	[label=<L>,
		lp="103.5,65.006",
		pos="e,125.7,57.506 81.156,57.506 90.356,57.506 104.19,57.506 115.66,57.506"];
	S2 -> S3	[label=<l>,
		lp="103.5,46.006",
		pos="e,128.68,49.811 78.655,50.442 83.938,46.068 91.363,40.852 99,38.506 106.5,36.202 114.29,39.274 120.79,43.575"];
	S0	[height=0.41667,
		pos="482,19.506",
		shape=doublecircle,
		width=0.41667];
	S1	[height=0.30556,
		pos="404,19.506",
		width=0.30556];
	S0 -> S1	[label=<s>,
		lp="443,9.0058",
		pos="e,413.18,13.341 469.71,10.455 460.98,4.5659 448.5,-1.5058 437,1.5058 431.8,2.8677 426.56,5.2849 421.86,7.9146"];
	S1 -> S0	[label=<m>,
		lp="443,27.006",
		pos="e,466.78,19.506 415.25,19.506 425.84,19.506 442.75,19.506 456.77,19.506"];
	S4	[height=0.30556,
		pos="199,57.506",
		width=0.30556];
	S3 -> S4	[label=<i>,
		lp="168,65.006",
		pos="e,187.97,57.506 148.18,57.506 156.24,57.506 167.77,57.506 177.7,57.506"];
	S6	[height=0.30556,
		pos="264,57.506",
		width=0.30556];
	S4 -> S6	[label=<b>,
		lp="231.5,65.006",
		pos="e,252.69,57.506 210.12,57.506 218.8,57.506 231.56,57.506 242.38,57.506"];
	S5	[height=0.30556,
		pos="331,96.506",
		width=0.30556];
	S6 -> S5	[label=<r>,
		lp="295.5,84.006",
		pos="e,320.96,91.079 273.77,62.766 283.6,68.664 299.8,78.384 312.29,85.88"];
	S7	[height=0.41667,
		pos="331,19.506",
		shape=doublecircle,
		width=0.41667];
	S6 -> S7	[label=<f>,
		lp="295.5,48.006",
		pos="e,317.36,26.893 273.77,52.381 282.63,47.202 296.66,38.998 308.49,32.082"];
	S8	[height=0.41667,
		pos="404,96.506",
		shape=doublecircle,
		width=0.41667];
	S5 -> S8	[label=<e>,
		lp="367.5,104.01",
		pos="e,388.93,96.506 342.17,96.506 351.66,96.506 366.19,96.506 378.67,96.506"];
	S7 -> S1	[label=<s>,
		lp="367.5,27.006",
		pos="e,392.68,19.506 346.03,19.506 356.44,19.506 370.8,19.506 382.5,19.506"];
	S7 -> S7	[label=<f>,
		lp="331,60.006",
		pos="e,336.75,33.409 325.25,33.409 323.5,43.031 325.42,52.506 331,52.506 334.49,52.506 336.54,48.805 337.17,43.68"];
}

I assume this an issue with my dot setup rather than anything to do with libfsm but would appreciate any advice!

wishlist: use a template language for code generation

I find files like src/lx/print/c.c difficult to read and maintain due to complicated sequences of fprintf calls which obscure the structure of the code that's being generated. I speculate that using a simple templating language would make the code generation process easier to understand. Perhaps the templates would even be a useful public interface to lx, so users could write their own templates without having to recompile lx.

Although I haven't used it, there is a C implementation of the Mustache templating language that looks to me like it might work perfectly here. For Mustache syntax see https://mustache.github.io/ and for the C implementation check out https://gitlab.com/jobol/mustach.

PCRE dialect doesn't handle option changes correctly

Running retest to develop dfavm, I found a few inconsistencies between libfsm and the PCRE spec:

[   2/  12] (pcre-tests.tst:606) regexp /^([ab](?i)[cd]|[ef])/ should match "Europe " but doesn't
[   3/  12] (pcre-tests.tst:608) regexp /^([ab](?i)[cd]|[ef])/ should match "France" but doesn't
[   6/  12] (pcre-tests.tst:1043) regexp /(?:(?i)a)b/ should not match "aB" but does
[   7/  12] (pcre-tests.tst:1139) regexp /((?i)AB(?-i)C|D)E/ should not match "abCe  " but does
[   8/  12] (pcre-tests.tst:1140) regexp /((?i)AB(?-i)C|D)E/ should not match "dE" but does
[   9/  12] (pcre-tests.tst:1141) regexp /((?i)AB(?-i)C|D)E/ should not match "De    " but does

(There are a few other unrelated errors.)

These errors fall into two groups:

Errors 1, 2, 3, 9: An option changed within a group are not applied to the remaining patterns and ALTs within that group
Errors 6, 7, 8, An option changed within a group is still applied outside of that group

The relevant spec is here: http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC13

When one of these option changes occurs at top level (that is, not inside group parentheses), the change applies to the remainder of the pattern that follows. An option change within a group (see below for a description of groups) affects only that part of the group that follows it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used). By this means, options can be made to have different settings in different parts of the pattern. Any changes made in one alternative do carry on into subsequent branches within the same group.
For example,
(a(?i)b|c)
matches "ab", "aB", "c", and "C", even though when matching "C" the first branch is abandoned before the option setting.

Notes:

I've attached the pcre-tests.tst file, zipped so github will take it: pcre-tests.zip. This was converted from the PCRE test suite and then pruned to remove tests that use back references and a few other things that require back tracking regexp. I've written a program to convert the pcre test suite directly that I'll submit as part of a larger PR on dfavm: https://github.com/sfstewman/libfsm/blob/1af925d992c1e572ba69259845b5d8c80d06cb16/src/retest/cvtpcre.c

Fails to build on Fedora with bmake

Hi! I'm trying to build libfsm on Fedora, I've installed the bmake package and running bmake -r install produces the following errors:

bmake: "/usr/share/mk/subdir.mk" line 93: Inconsistent operator for all
bmake: "/usr/share/mk/subdir.mk" line 93: Inconsistent operator for clean
bmake: "/usr/share/mk/init.mk" line 85: Inconsistent operator for all
bmake: "/usr/share/mk/subdir.mk" line 102: Inconsistent operator for all
bmake: "/home/caleb/git/libfsm/Makefile" line 88: Could not find pc.mk
bmake: "/home/caleb/git/libfsm/Makefile" line 89: Could not find sid.mk
bmake: "/home/caleb/git/libfsm/Makefile" line 90: Could not find pkgconf.mk
bmake: "/home/caleb/git/libfsm/Makefile" line 91: Could not find lx.mk
bmake: "/usr/share/mk/obj.mk" line 46: Malformed conditional (${MK_AUTO_OBJ} == "yes")
bmake: "/home/caleb/git/libfsm/Makefile" line 94: Could not find ar.mk
bmake: "/home/caleb/git/libfsm/Makefile" line 95: Could not find so.mk
bmake: "/home/caleb/git/libfsm/Makefile" line 96: Could not find part.mk
bmake: "/usr/share/mk/prog.mk" line 140: Inconsistent operator for all
bmake: "/usr/share/mk/subdir.mk" line 102: Inconsistent operator for all
bmake: "/usr/share/mk/final.mk" line 11: Malformed conditional (${MK_STAGING} == "yes")
bmake: "/usr/share/mk/final.mk" line 18: Inconsistent operator for install
bmake: "/home/caleb/git/libfsm/Makefile" line 98: Could not find mkdir.mk
bmake: "/home/caleb/git/libfsm/Makefile" line 105: Could not find install.mk
bmake: "/home/caleb/git/libfsm/Makefile" line 106: Could not find clean.mk
bmake: Fatal errors encountered -- cannot continue
bmake: stopped in /home/caleb/git/libfsm

Character class elided

This accidentally elides the content of [:digit:].

The AST does contain both the 0-3 range and the digit named class, attached to the same class-concat node.

lx doesn't always return the longest match

Given this input to lx,

"." -> $dot;
"..." -> $ellipsis;

running the generated -l dump program against the string .. (no trailing newline or other characters) outputs:

lx_next: Invalid argument
0-2:1,1-3:

Running it against ..\n outputs:

0-3:1-2,1: lexically uncategorised: '..'

In both cases I expected to see two <DOT '.'> tokens, followed by either <EOF> or an error that the newline character is lexically uncategorised, respectively.

I'm pretty sure language specifications like this require more than one character of lookahead/backtracking. An error transition needs to walk back to and return the most recent accepting state, which may be more than one transition ago.

I think a straightforward graph analysis can tell you the upper bound on the number of characters you might need to buffer. For each accepting state in the DFA, find all states that are reachable from it without hitting another accepting state. If the resulting subgraph does not contain cycles, then its maximum path length is one less than the necessary amount of lookahead. If it does contain cycles then it can require unbounded lookahead.

If I have this right, then lx_ungetc isn't sufficient in general, but at least you can detect at compile time when it isn't enough. You could even report example strings which take an error transition more than one step after the last accepting state.

Caching for predicate bits per state

The various <fsm/pred.h> predicates could be cached per state, as two bitfields: a set of their values, and a set of whether those values are valid (known), or invalidated (currently unknown).

Then whenever a predicate function is called, if a bit is valid, there's no need to re-do the work of finding out.

Various operations would invalidate these bits; for example something like adding an epsilon edge would invalidate the "isdfa" predicate bit for the state to which it's added, and any of the states which can indirectly reach it. That would be hard to find in general, so conservatively the epsilon-adding function could invalidate all "isdfa" predicates.

In a better case, for the "iscomplete" predicate, when adding an edge to a state, the fsm_addedge_literal() function could observe current completeness of the state being added to, and leave that predicate set if the edge is not already present. This can be done without touching other states.

The .isend field could possibly also be moved to one of the predicate bits; arguably it's a predicate, just predicated on externally-defined information.

It might also make sense to provide the same set of bits for an entire struct fsm.

`fsm -t minimise` changes the meaning of some FSMs

re(1) is affected too. Manually performing the same steps seems not to be.

The smallest testcase I know is (pcre or native) /[aBx]*a/ (or its equivalent NFA).

ifdef MACOS_HAS_NO_CLOCK_GETTME

this looks like a typo to me GETTME, and disabling it leads to a darwin build.
I don't know any darwin with clock_gettime, so
#if defined(__APPLE__) && defined(__MACH__) seems to be enough.