sicroatgit / regex-engine Goto Github PK

View Code? Open in Web Editor NEW

4.0 1.0 3.0 119 KB

A RegEx engine that builds NFA/DFA and always returns the longest match.

License: MIT License

PureBasic 100.00%

regex-to-nfa regex-matcher longest-match regex-to-dfa purebasic-module purebasic

regex-engine's Introduction

RegEx-Engine

https://github.com/SicroAtGit/RegEx-Engine

About

This RegEx engine compiles a regular expression string into an NFA, and can optionally convert the NFA into a DFA, which executes much faster; the generated NFA/DFA can then be executed against a string.

The RegEx engine will always return the longest possible match among several possible matches. During this process no backtracking is required, because all alternations are checked simultaneously.

RegExes can be assigned unique RegEx ID numbers, which allow to determine which RegEx matched when executing multiple RegExs simultaneously. This feature is useful for creating lexers, which is the main focus of the project. At the same time, the RegEx engine is kept flexible, so that it might be employed in a variety of other contexts, beside lexers creation.

Examples

Simple Match

*regEx = RegEx::Init()
If *regEx
  RegEx::AddNfa(*regEx, "test|example")
  If RegEx::Match(*regEx, @"example")
    Debug "Match!"
  Else
    Debug "No match!"
  EndIf
  RegEx::Free(*regEx)
Else
  Debug "Error!"
EndIf

Multiple RegExes Simultaneously

Enumeration
  #RegExId_Word
  #RegExId_Number
EndEnumeration

*regEx = RegEx::Init()
If *regEx
  RegEx::AddNfa(*regEx, "\w+", #RegExId_Word)
  RegEx::AddNfa(*regEx, "\d+", #RegExId_Number)
  If RegEx::Match(*regEx, @"example", @regExId)
    Select regExId
      Case #RegExId_Word:   Debug "Match is a word!"
      Case #RegExId_Number: Debug "Match is a number!"
    EndSelect
  Else
    Debug "No match!"
  EndIf
  RegEx::Free(*regEx)
Else
  Debug "Error!"
EndIf

More code examples can be found in the Source/Examples/ directory.

Supported Syntax

Syntax	Meaning
`xy`	`x` followed by `y` (Concatenation)
`x\|y`	`x` or `y` (Alternation)
`x*`	Zero or more consecutive `x`
`x+`	One or more consecutive `x`
`x?`	Zero or one `x`
`(` `)`	Groups a regular expression. Groups inherit the active modes of their parent context. Mode changes within a group have no effect on the surrounding contexts.
`\*`	Escapes the metacharacter `*` and treats it as a literal character. Works also with the other metacharacters: `\|` `+` `?` `(` `)` `\`
`\r`	Matches the carriage return character (`\x0D`)
`\n`	Matches the line feed character (`\x0A`)
`\t`	Matches the horizontal tab character (`\x09`)
`\f`	Matches the form feed character (`\x0C`)
`[x]`	Where `x` can be a combination of: single literal character, escape sequence, or range (`a-c`)
`.`	Matches any character except `\r` and `\n`
`\d`	Matches Unicode characters class Nd
`\D`	Matches any character except those in Unicode characters class Nd
`\s`	Matches Unicode characters class White_Space
`\S`	Matches any character except those in Unicode characters class White_Space
`\w`	Matches Unicode characters classes Alphabetic, M, Nd, Pc and Join_Control
`\W`	Matches any character except those in Unicode characters classes Alphabetic, M, Nd, Pc and Join_Control
`\x`	Matches the character represented by the two digit hex code `x` (`\x01`–`\xFF`)
`\u`	Matches the character represented by the four digit hex code `u` (`\u0001`–`\uFFFF`)
`(?m)`	Toggles the RegEx mode states. `m` can be one or more flags. To deactivate a RegEx mode prefix a flag with a minus sign

Unicode Support

Like the native string functions in PureBasic: UCS-2 character encoding, UTF-16 surrogate pairs are interpreted as two single UCS-2 characters.

Case-Insensitive Mode

Flag: i

The implementation uses Unicode's Simple Case Folding variant, but in reverse: instead of mapping all character variations to a single character (folding), a single character is mapped to all character variations (unfolding). This is necessary because the DFA must know all valid characters.

ASCII Mode

Flag: a

When active, the predefined character classes will only match the corresponding ASCII characters. For example, (?a)\w will match only [a-zA-Z0-9_]. The character encoding remains UCS-2 in this mode, i.e. (?a)\W matches all UCS-2 characters except [a-zA-Z0-9_].

This RegEx mode is also useful in combination with #RegExMode_NoCase when you want to lex for keywords within source code, case-insensitively, but no case-folding should be applied:

(?i)set corresponds to [Ss\u017F][Ee][Tt]
(?ia)set corresponds to [Ss][Ee][Tt]

Public Constants

EnumerationBinary RegExModes
  #RegExMode_NoCase ; Activates case-insensitive mode
  #RegExMode_Ascii  ; Activates ASCII mode
EndEnumeration

Enumeration NfaStateTypes
  #StateType_EpsilonMove ; Used for NFA epsilon moves
  #StateType_SymbolMove  ; Used for NFA symbol moves
  #StateType_SplitMove   ; Used for NFA unions
  #StateType_Final       ; Used for NFA final state
EndEnumeration

#State_DfaDeadState = 0 ; Index number of the DFA dead state

Public Structures

Structure ByteRangeStruc
  min.a ; Minimum byte value (0-255)
  max.a ; Maximum byte value (0-255)
EndStructure

Structure NfaStateStruc
  stateType.u               ; Type of the NFA state (regExId = stateType - #StateType_NfaFinal)
  byteRange.ByteRangeStruc  ; A byte range is used as a transition symbol
  *nextState1.NfaStateStruc ; Pointer to the first next NFA state
  *nextState2.NfaStateStruc ; Pointer to the second next NFA state
EndStructure

Structure DfaStateStruc
  nextState.u[256] ; Index is the symbol (0-255) and the value is the next DFA state
  isFinalState.u   ; Positive number if the DFA state is a final state, otherwise null
EndStructure

Structure DfaStatesArrayStruc
  states.DfaStateStruc[0] ; Array pointer to the DFA states
EndStructure

Structure NfaPoolStruc
  List nfaStates.NfaStateStruc() ; Holds all NFA states of the NFA pool
  *initialNfaState.NfaStateStruc ; Pointer to the NFA initial state
EndStructure

Structure RegExEngineStruc
  List nfaPools.NfaPoolStruc()       ; Holds all NFA pools
  *dfaStatesPool.DfaStatesArrayStruc ; Holds all DFA states
  isUseDfaFromMemory.b               ; `#True` if `UseDfaFromMemory()` was used, otherwise `#False`
EndStructure

Public Macros

GetString(_memoryAddress_, _lengthInBytes_)

Simplifies extracting the matched string via its memory address and length info obtained from a Match() call.

Public Functions

Init()

Creates a new RegEx engine and returns the pointer to the RegExEngineStruc structure. If an error occurred null is returned.
AddNfa(*regExEngine.RegExEngineStruc, regExString$, regExId = 0, regExModes = 0)

Compiles the RegEx string into an NFA which is added to the NFAs pool in the RegEx engine. On success #True is returned, otherwise #False. A unique number can be passed to regExId to determine later which RegEx has matched. The optional regExModes parameter allows defining which RegEx modes should be activated at the beginning; its currently supported values are:
- #RegExMode_NoCase — Activates case-insensitive mode
- #RegExMode_Ascii — Activates ASCII mode
To set multiple parameters, combine them with the | operator (bitwise OR).
CreateDfa(*regExEngine.RegExEngineStruc, clearNfa = #True)

Creates a single DFA from the existing NFAs in the RegEx engine. Match() will henceforth always use the DFA, which is much faster. Because the NFAs are no longer used after this, they are cleared by default; to preserve them set parameter clearNfa to #False. On success #True is returned, otherwise #False. If a DFA already exists, the DFA will be freed before creating a new DFA.
Free(*regExEngine.RegExEngineStruc)

Frees the RegEx engine.
UseDfaFromMemory(*dfaMemory)

Creates a new RegEx engine and assigns an existing DFA stored in external memory to the RegEx engine. After calling this procedure, the RegEx engine is immediately ready for use, without requiring to call Init(), AddNfa() or CreateDfa(). On success the pointer to RegExEngineStruc is returned, otherwise null.
Match(*regExEngine.RegExEngineStruc, *string.Unicode, *regExId.Integer = 0)

Runs the RegEx engine against the target string, passed via a pointer. The match search will start from the beginning of the string. If a match is found, the byte length of the match is returned, otherwise null. If the address of an integer variable was passed as the optional *regExId parameter, the RegEx ID number of the matching RegEx is written into it. If multiple RegExes match the same string, each having been assigned a different RegEx ID number, the RegEx ID number of the last matching RegEx will be picked, i.e. the matching RegEx that was last added with the AddNfa() function.
GetLastErrorMessages()

Returns the error messages of the last AddNfa() call, as a human-readable string.
ExportDfa(*regExEngine.RegExEngineStruc, filePath$)

Exports the created DFA as a binary file. On success #True is returned, otherwise #False.

Reduced DFA Matcher Module

The reduced module DfaMatcher provides only a DFA matcher which uses the precompiled DFAs created with the main module.

If only the precompiled DFAs are needed in the software, for matching, and no new NFAs/DFAs are to be created at runtime, then the reduced module can be used. This way the software is not unnecessarily bloated with the large Unicode tables and the rest of the code found in the main module.

Public Constants

#State_DfaDeadState = 0 ; Index number of the DFA dead state

Public Structures

Structure DfaStateStruc
  nextState.u[256] ; Index is the symbol (0-255) and the value is the next DFA state
  isFinalState.u   ; Positive number if the DFA state is a final state, otherwise null
EndStructure

Structure DfaStatesArrayStruc
  states.DfaStateStruc[0] ; Array pointer to the DFA states
EndStructure

Public Macros

GetString(_memoryAddress_, _lengthInBytes_)

Simplifies extracting the matched string via its memory address and length info obtained from a Match() call.

Public Functions

Match(*dfaMemory, *string.Unicode, *regExId.Integer = 0)

Runs the DFA against the target string, passed via a pointer. The match search will start from the beginning of the string. If a match is found, the byte length of the match is returned, otherwise null. If the address of an integer variable was passed as the optional *regExId parameter, the RegEx ID number of the matching RegEx is written into it. If multiple RegExes match the same string, each having been assigned a different RegEx ID number, the RegEx ID number of the last matching RegEx will be picked, i.e. the matching RegEx that was last added with the AddNfa() function.

Would you like to contribute to the project?

Then please check out CONTRIBUTING for details.

License

The project is licensed under the MIT license.

regex-engine's People

Contributors

Stargazers

Watchers

Forkers

tajmone maxbyz allhere

regex-engine's Issues

Add full support for UTF-16 / Unicode

This feature implements full UTF-16 / Unicode support by correctly interpreting UTF-16 surrogate pairs as single Unicode code points and extending the predefined character classes to the full Unicode code points range.

New syntax

Add escape sequence \Uhhhhhhhh (character with hex code hhhhhhhh)

New RegEx mode

(?u) — activates UTF-16 mode. The predefined character classes \w, \d etc. are then extended by the full Unicode code points range (0x01 up to 0x10FFFF).
(?-u) (default mode) — deactivates UTF-16 mode. The predefined character classes \w, \d etc. then correspond as before only to the Unicode code points possible with UCS-2 (0x01 up to 0xFFFF).

New Parameter

AddNfa(..., "\w", #RegExMode_Unicode) is the same as AddNfa(..., "(?u)\w")

PureBasic's string functions use UCS-2 encoding in Unicode mode according to the official documentation. But PureBasic uses the API functions of the operating systems for displaying the strings and these all (Windows, Linux and macOS) interpret the PureBasic string as UTF-16, so programs written in PureBasic can display all Unicode characters.

Currently, the RegEx engine also works with this UCS-2 encoding, so UTF-16 surrogate pairs are interpreted as two separate UCS-2 characters.

In order to write Unicode code points outside the range supported by UCS-2 (Unicode's Basic Multilingual Plane only) in the regex, the UTF-16 surrogate pairs currently have to be written separately. Besides the disadvantage that this is inconvenient to write, the case-insensitivity mode then also does not work correctly, because to work correctly it would have to be able to interpret a UTF-16 surrogate pair as a single Unicode code point.

Add escape sequence `\uhhhh` (character with hex code `hhhh`)

It allows writing ISO_8859-1 or Unicode characters with hexadecimal numbers. This makes it easy to write characters for which there are no keyboard keys.

\u0001 to \u00FF (same as \x01 to \xFF)
\u0100 to \uFFFF (hexadecimal numbers correspond here to the Unicode code points)

Renaming 'Example_codes' folder

You could rename the Example_codes/ folder to just examples/ and keep it short — after all, are there any examples rather than code here? And if there where, it would still be better to subfolder them (examples/code/, examples/whatever/, etc. rather than having Example_codes/, Example_whatever/, etc.).

Also, the noun code (as in source code) has no plural form — it's just "code", never "codes" (unlike, e.g. "moral codes").

Add predefined character class `\D` (no digit)

Any character except the Unicode's character class:

Nd and those exceeding \uFFFF

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ANd%3A%5D&abb=on&esc=on&g=&i=

Add feature to match multiple regexes simultaneously

A RegEx ID number can then be specified for each RegEx, and in the event of a match, the RegEx ID number can be used to determine which RegEx has matched.

If there are multiple RegExes that match the same string and have been assigned different RegEx ID numbers, the RegEx ID number of the last matched RegEx is taken, i.e. the last matched RegEx added with the AddNfa() function.

Useful for building a lexer where different RegEx ID numbers can then be used for the different token types.

Add feature to minimize the DFA

Add a new function that creates a new DFA from the existing DFA that has been minimized to the states that are really needed.

For example, the RegEx ab|cb creates this DFA:

The minimized DFA would look like this:

Add predefined character class `\s` (whitespace)

Unicode's character class:

White_Space but not those exceeding \uFFFF

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AWhite_Space%3A%5D&abb=on&esc=on&g=&i=

Add Case-Insensitive Mode

Discussed in #27

New Syntax:

(?i) means that everything after that is case-insensitive.
(?-i) means that everything after that is case-sensitive.

New Parameter

AddNfa(..., regExModes = 0)
- AddNfa(..., "a", #RegExMode_NoCase) is the same as AddNfa(..., "(?i)a")

Groups inherit the active modes of the context outside. Mode changes within a group has no effect on the context outside this group.

The implementation will use Unicode's Simple Case Folding variant (single character code point to single character code point), but in reverse: Instead of mapping all character variations to a single character (folding), a single character is mapped to all character variations (unfolding). This is necessary because the DFA must know all valid characters.

The source file from which the translation table will be created:
https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

Create reduced module that contains only a DFA matcher

If someone only wants to use the precompiled DFAs in their software for matching and not wants to create new NFA/DFA at runtime, a reduced module would not unnecessarily bloat their software with the large Unicode tables and the other code.

Add feature to export NFA/DFA table as state diagram

For this feature the tool Graphviz will be used and should replace these codes:

For example, the regex engine creates the following dot code for the regex (a|b):

digraph nfa_state_diagram {

	rankdir = LR;
	size = "8,5";
	node [shape = circle];
	"" [shape = none, fixedsize = true, height = 0, width = 0];

	"" -> 1;

	1 -> 2 [label = "&epsilon;", style = dashed];

	2 -> 3 [label = "61"];

	3 -> 4 [label = "00"];

	4 -> 5 [label = "&epsilon;", style = dashed];

	5 [shape = doublecircle];

	1 -> 6 [label = "&epsilon;", style = dashed];

	6 -> 7 [label = "62"];

	7 -> 8 [label = "00"];
	
	8 -> 5 [label = "&epsilon;", style = dashed];
}

Graphviz then generates this NFA diagram visualization from the dot code:

Shorthand character classes

In addition to character classes, there will also be shorthand character classes. However, I'm not quite sure yet which ones there should be and which characters they should cover.

According to this website, the different RegEx engines cover different characters in the shorthand character classes:
https://www.regular-expressions.info/shorthand.html

The current listing:

\d for [0-9]
\D for [^\d]
\t for the tab character
\r for carriage return (CR)
\n for linefeed (LF)
\f for form feed
\s for [ \t\r\n\f]
\S for [^\s]
\w for [A-Za-z0-9_]
\W for [^\w]
\h for [ \t]
\v for [\r\n\f]

Enable Discussions

You could enable Discussions to avoid cluttering Issues with anything that is not a pending task on the repository (e.g. Issue #1 could be converted to a Discussion, so once you've decided which shorthands to implement you can create an Issue for each pending shorthand, and keep it brief and to the point).

Add predefined character class `\d` (digit)

Unicode's character class:

Nd but not those exceeding \uFFFF

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3ANd%3A%5D&abb=on&esc=on&g=&i=

Add escape sequence `\Q`...`\E`

Any character following the escape sequence \Q is interpreted as a normal character.
The escape sequence \E can then be used to return to normal behavior.

Add escape sequence `\xhh` (character with hex code `hh`)

It allows writing ISO_8859-1 characters with hexadecimal numbers. This makes it easy to write characters for which there are no keyboard keys.

\x01 to \xFF

Add predefined character class `\S` (no whitespace)

Any character except the Unicode's character class:

White_Space and those exceeding \uFFFF

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AWhite_Space%3A%5D&abb=on&esc=on&g=&i=

Add predefined character class `\W` (no word character)

Any character except the Unicode's character classes:

Alphabetic, M, Nd, Pc, Join_Control and those exceeding \uFFFF

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AAlphabetic%3A%5D%5B%3AM%3A%5D%5B%3ANd%3A%5D%5B%3APc%3A%5D%5B%3AJoin_Control%3A%5D&abb=on&esc=on&g=&i=

Add ASCII mode

This feature implements ASCII mode. When activated, the predefined character classes will only match the corresponding ASCII characters. For example, (?a)\w will then match only [a-zA-Z0-9_]. The character encoding remains UCS-2 in this mode, i.e. (?a)\W matches all UCS-2 characters, but not [a-zA-Z0-9_].

New Syntax

(?a) — activates ASCII mode.
(?-a) — deactivates ASCII mode (default).

New Parameter

AddNfa(..., "\w", #RegExMode_Ascii) is the same as AddNfa(..., "(?a)\w")

This mode is also useful in combination with #RegExMode_NoCase when you want to lex keywords in a code, case-insensitive, but no case-folding should be applied. Example:

(?i)set corresponds to [Ss\u017F][Ee][Tt]
(?ia)set corresponds to [Ss][Ee][Tt]

Add predefined character class `\w` (word character)

Unicode's character classes:

Alphabetic, M, Nd, Pc and Join_Control, but not those exceeding \uFFFF

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AAlphabetic%3A%5D%5B%3AM%3A%5D%5B%3ANd%3A%5D%5B%3APc%3A%5D%5B%3AJoin_Control%3A%5D&abb=on&esc=on&g=&i=

Add escape sequence `\n` (line feed)

Matches the line feed character.

Add predefined character class `.` (dot, almost any character)

Matches any character up to \uFFFF except \r and \n.

Add support of limited repetitions

Syntax	Meaning
`x{n,m}`	at least `n` of `x` and at most `m` of `x`
`x{n,}`	at least `n` of `x`
`x{,m}`	at most `m` of `x`
`x{n}`	exactly `n` of `x`

If n is greater than m an error is triggered.

Add escape sequence `\r` (carriage return)

Matches the carriage return character.

Add support for custom character classes `[`...`]`

[x], while x can be several mixes of the following:

Symbol:
- Single character
- Escape sequence
Range:
- a-c for the characters a, b and c

To negate the user-defined character class, the ^ character must follow immediately after the opening square bracket. If the ^ occurs in other positions, it is a normal character, without the special meaning.

The metacharacter . (dot) is a dot character inside the square brackets, i.e. without the special meaning.

Nesting of character classes is not allowed.

For ranges, the range end symbol must have a larger character code than the range start symbol, i.e. [a-z] instead of [z-a].

This makes the changelog readable offline and the project is not so dependent on GitHub services.

sicroatgit / regex-engine Goto Github PK

regex-engine's Introduction

RegEx-Engine

About

Examples

Simple Match

Multiple RegExes Simultaneously

Supported Syntax

Unicode Support

Case-Insensitive Mode

ASCII Mode

Public Constants

Public Structures

Public Macros

Public Functions

Reduced DFA Matcher Module

Public Constants

Public Structures

Public Macros

Public Functions

Would you like to contribute to the project?

License

regex-engine's People

Contributors

Stargazers

Watchers

Forkers

regex-engine's Issues

Recommend Projects

Recommend Topics

Recommend Org