Giter Site home page Giter Site logo

skx / evalfilter Goto Github PK

View Code? Open in Web Editor NEW
117.0 6.0 12.0 1.93 MB

A bytecode-based virtual machine to implement scripting/filtering support in your golang project.

License: GNU General Public License v2.0

Go 98.34% Makefile 0.04% HTML 0.88% JavaScript 0.24% Emacs Lisp 0.50%
golang eval scripting extensions virtual-machine bytecode scripting-language fuzz-testing evaluation-engine eval-filter

evalfilter's Introduction

GoDoc Go Report Card license

eval-filter

The evalfilter package provides an embeddable evaluation-engine, which allows simple logic which might otherwise be hardwired into your golang application to be delegated to (user-written) script(s).

There is no shortage of embeddable languages which are available to the golang world, this library is intended to be something that is:

  • Simple to embed.
  • Simple to use, as there are only three methods you need to call:
  • Simple to understand.
  • As fast as it can be, without being too magical.

The scripting language is C-like, and is generally intended to allow you to filter objects, which means you might call the same script upon multiple objects, and the script will return either true or false as appropriate to denote whether some action might be taken by your application against that particular object.

It certainly is possible for you to handle arbitrary return-values from the script(s) you execute, and indeed the script itself could call back into your application to carry out tasks, via the addition of new primitives implemented and exported by your host application, which would make the return value almost irrelevant.

If you go down that route then this repository contains a general-purpose scripting-language, which can be used to execute user-supplied scripts.

My Google GMail message labeller uses the evalfilter in such a standalone manner, executing a script for each new/unread email by default. The script can then add labels to messages based upon their sender/recipients/subjects. etc. The notion of filtering there doesn't make sense, it just wants to execute flexible operations on messages.

However the ideal use-case, for which this was designed, is that your application receives objects of some kind, perhaps as a result of incoming webhook submissions, network events, or similar, and you wish to decide how to handle those objects in a flexible fashion.

Implementation

In terms of implementation the script to be executed is split into tokens by the lexer, then those tokens are parsed into an abstract-syntax-tree. Once the AST exists it is walked by the compiler and a series of bytecode instructions are generated.

Once the bytecode has been generated it can be executed multiple times, there is no state which needs to be maintained, which makes actually executing the script (i.e. running the bytecode) a fast process.

At execution-time the bytecode which was generated is interpreted by a naive stack-based virtual machine, with some runtime support to provide the built-in functions, as well as supporting the addition of host-specific functions.

The bytecode itself is documented briefly in BYTECODE.md, but it is not something you should need to understand to use the library, only if you're interested in debugging a misbehaving script.

Scripting Facilities

Types

The scripting-language this package presents supports the basic types you'd expect:

  • Arrays.
  • Floating-point numbers.
  • Hashes.
  • Integers.
  • Regular expressions.
  • Strings.
  • Time / Date values.
    • i.e. We can use reflection to handle time.Time values in any structure/map we're operating upon.

The types are supported both in the language itself, and in the reflection-layer which is used to allow the script access to fields in the Golang object/map you supply to it.

Built-In Functions

These are the built-in functions which are always available, though your users can write their own functions within the language (see functions).

You can also easily add new primitives to the engine, by defining a function in your golang application and exporting it to the scripting-environment. For example the print function to generate output from your script is just a simple function implemented in Golang and exported to the environment. (This is true of all the built-in functions, which are registered by default.)

  • between(value, min, max);
    • Return true if the specified value is between the specified range (inclusive, so between(1, 1, 10); will return true.)
  • float(value)
    • Tries to convert the value to a floating-point number, returns Null on failure.
    • e.g. float("3.13").
  • getenv(value)
    • Return the value of the named environmental variable, or "" if not found.
  • int(value)
    • Tries to convert the value to an integer, returns Null on failure.
    • e.g. int("3").
  • join(array,deliminator)
    • Return a string consisting of the array elements joined by the given string.
  • keys
    • Returns the available keys in the specified hash, in sorted order.
  • len(field | value)
    • Returns the length of the given value, or the contents of the given field.
    • For arrays it returns the number of elements, as you'd expect.
  • lower(field | value)
    • Return the lower-case version of the given input.
  • max(a, b)
    • Return the larger number of the two parameters.
  • min(a, b)
    • Return the smaller number of the two parameters.
  • panic() / panic("Your message here");
    • These will deliberately stop execution, and return a message to the caller.
  • print(field|value [, fieldN|valueN] )
    • Print the given values.
  • printf("Format string ..", arg1, arg2 .. argN);
    • Print the given values, with the specified golang format string
      • For example printf("%s %d %t\n", "Steve", 9 / 3 , ! false );
  • replace(input, /regexp/, value)
    • Perform a replacement with value of the matches of the given regexp in the input-value.
  • reverse(["Surname", "Forename"]);
    • Sorts the given array in reverse.
    • Add true as the second argument to ignore case.
  • sort(["Surname", "Forename"]);
    • Sorts the given array.
    • Add true as the second argument to ignore case.
  • split("string", "value");
    • Splits a string into an array, by the given substring.
  • sprintf("Format string ..", arg1, arg2 .. argN);
    • Format the given values, using the specified golang format string.
  • string( )
    • Converts a value to a string. e.g. "string(3/3.4)".
  • trim(field | string)
    • Returns the given string, or the contents of the given field, with leading/trailing whitespace removed.
  • type(field | value)
    • Returns the type of the given field, as a string.
      • For example string, integer, float, array, boolean, or null.
  • upper(field | value)
    • Return the upper-case version of the given input.
  • hour(field|value), minute(field|value), seconds(field|value)
    • Allow converting a time to HH:MM:SS.
  • day(field|value), month(field|value), year(field|value)
    • Allow converting a time to DD/MM/YYYY.
  • weekday(field|value)
    • Allow converting a time to "Saturday", "Sunday", etc.
  • now() & time() both return the current time.

Conditionals

As you'd expect the facilities are pretty normal/expected:

  • Perform comparisons of strings and numbers:
    • equality:
      • "if ( Message == "test" ) { return true; }"
    • inequality:
      • "if ( Count != 3 ) { return true; }"
    • size (<, <=, >, >=):
      • "if ( Count >= 10 ) { return false; }"
      • "if ( Hour >= 8 && Hour <= 17 ) { return false; }"
    • String matching against a regular expression:
      • "if ( Content ~= /needle/ )"
      • "if ( Content ~= /needle/i )"
        • With case insensitivity
    • Does not match a regular expression:
      • "if ( Content !~ /some text we don't want/ )"
    • Test if an array contains a value:
      • "return ( Name in [ "Alice", "Bob", "Chris" ] );"
  • Ternary expressions are also supported - but nesting them is a syntax error!
    • "a = Title ? Title : Subject;"
    • "return( result == 3 ? "Three" : "Four!" );"

Loops

Our script implements a golang-style loop, using either for or while as the keyword:

count = 0;
while ( count < 10 ) {
     print( "Count: ", count, "\n" );
     count++;
}

You could use either statement to iterate over an array contents, but that would be a little inefficient:

items = [ "Some", "Content", "Here" ];
i = 0;
for ( i < len(items) ) {
   print( items[i], "\n" );
   i++
}

A more efficient and readable approach is to iterate over arrays, and the characters inside a string, via foreach. You can receive both the index and the item at each step of the iteration like so:

foreach index, item in [ "My", "name", "is", "Steve" ] {
    printf( "%d: %s\n", index, item );
}

If you don't supply an index you'll receive just the item being iterated over instead, as you would expect (i.e. we don't default to returning the index, but the value in this case):

len = 0;
foreach char in "狐犬" {
    len++;
}
return( len == 2 );

The same kind of iteration works over hashes too (the single-argument version of the foreach loop iterates over values, rather than keys. Hash keys are available via keys so that seems like a more useful thing to return):

foreach key,value in { "Name": "Steve", "Location": "Finland" } {
  printf("Key %s has value %s\n", key, value );
}

The final helper is the ability to create arrays of integers via the .. primitive:

sum = 0;
foreach item in 1..10 {
    sum += item;
}
print( "Sum is ", sum, "\n" );

Here you note that len++ and sum += item; work as you'd expect. There is support for +=, -=, *=, and /=. The ++ and -- postfix operators are both available (for integers and floating-point numbers).

Functions

You can declare functions, for example:

function sum( input ) {
   local result;
   result = 0;
   foreach item in input {
     result = result + item;
   }
   return result;
}

printf("Sum is %d\n", sum(1..10));
return false;

See _examples/scripts/scope.in for another brief example, and discussion of scopes.

Case / Switch

We support the use of switch and case to simplify the handling of some control-flow. An example would look like this:

switch( Subject ) {
  case /^Re:/i {
     printf("Reply\n");
  }
  case /^fwd:/i {
     printf("Forwarded message\n");
  }
  case "DEAR" + "  " + WINNER" {
     printf("SPAM\n");
  }
  case "YOU HAVE WON" {
     printf("SPAM\n");
  }
  default {
     printf("New message!\n");
  }
}

Note that the case expression supports the following, as demonstrated in our switch example:

  • Expression matches.
  • Literal matches.
  • Regular expression matches.

To avoid fall-through-related bugs we've explicitly designed the case-statements to take blocks as arguments, rather than statements.

NOTE: Only the first matching case-statement will execute. In the following example only one message will be output:

count = 1;

switch( count ) {
  case 1 {
     printf("Match!\n");
  }
  case 1 {
     printf("This is not a match - the previous case statement won!\n");
  }
}

Use Cases

The motivation for this project came from a problem encountered while working:

  • I wanted to implement a simple "on-call notifier".
    • When messages were posted to Slack channels I wanted to sometimes trigger a phone-call to the on-call engineer.
    • Of course not all Slack-messages were worth waking up an engineer for.

The expectation was that non-developers might want to change the matching of messages to update the messages which were deemed worthy of waking up the on-call engineer. They shouldn't need to worry about rebuilding the on-call application, nor should they need to understand Go. So the logic was moved into a script and this evaluation engine was born.

Each time a Slack message was received it would be placed into a simple structure:

type Message struct {
    Author  string
    Channel string
    Message string
    Sent    time.Time
}

Then a simple script could then be executed against that object to decide whether to initiate a phone-call:

//
// You can see that comments are prefixed with "//".
//
// In my application a phone-call would be triggered if this
// script hit `return true;`.  If the return value was `false`
// then nothing would happen.
//

//
// If this is within office hours we'll assume somebody is around to
// handle the issue, so there is no need to raise a call.
//
if ( hour(Sent) >= 9 || hour(Sent) <= 17 ) {

    // 09AM - 5PM == Working day.  No need to notify anybody.

    // Unless it is a weekend, of course!
    if ( weekday(Sent) != "Saturday" && weekday(Sent) != "Sunday" ) {
       return false;
    }
}

//
// A service crashed with a panic?
//
// If so raise the engineer.
//
if ( Message ~=  /panic/i ) { return true; }


//
// At this point we decide the message is not important, so
// we ignore it.
//
// In a real-life implementation we'd probably work the other
// way round.  Default to triggering the call unless we knew
// it was a low-priority/irrelevant message.
//
return false;

You'll notice that we test fields such as Sent and Message here which come from the object we were given. That works due to the magic of reflection. Similarly we called a number of built-in functions related to time/date. These functions understand the golang time.Time type, from which the Sent value was read via reflection.

(All time.Time values are converted to seconds-past the Unix Epoch, but you can retrieve all the appropriate fields via hour(), minute(), day(), year(), weekday(), etc, as you would expect. Using them literally will return the Epoch value.)

Security

The user-supplied script is parsed and turned into a set of bytecode-instructions which are then executed. The bytecode instruction set is pretty minimal, and specifically has zero access to:

  • Your filesystem.
    • i.e. Reading files is not possible, neither is writing them.
  • The network.
    • i.e. Making outgoing network requests is not possible.

Of course you can export functions from your host-application to the scripting environment, to allow such things. If you do add primitives that have the possibility to cause security problems then the onus is definitely on you to make sure such accesses are either heavily audited or restricted appropriately.

Denial of Service

When it comes to security problems the most obvious issue we might suffer from is denial-of-service attacks; it is certainly possible for this library to be given faulty programs, for example invalid syntax, or references to undefined functions. Failures such as those would be detected at parse/run time, as appropriate.

In short running user-supplied scripts should be safe, but there is one obvious exception, the following program is valid:

print( "Hello, I'm wasting your time\n") ;

while( 1 ) {
  // Do nothing ..
}

print( "I'm never reached!\n" );

This program will never terminate! If you're handling untrusted user-scripts, you'll want to ensure that you explicitly setup a timeout period.

The following will do what you expect:

// Create the evaluator on the (malicious) script
eval := evalfilter.New(`while( 1 ) { } `)

// Setup a timeout period of five seconds
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
eval.SetContext(ctx)

// Now prepare as usual
err = eval.Prepare()
if ( err != nil ) { // handle error }

// Now execute as usual
ret, err = eval.Execute( object )
if ( err != nil ) { // handle error }

The program will be terminated with an error after five seconds, which means that your host application will continue to run rather than being blocked forever!

Misc.

You can find syntax-highlighters for evalfilter code beneath misc/.

  • A javascript syntax-highlighter for embedding scripts in web-pages.
  • An Emacs mode for working with evalfilter scripts.

Sample Usage

To give you a quick feel for how things look you could consult the following simple examples:

  • example_test.go.
    • This filters a list of people by their age.
  • example_function_test.go.
    • This exports a function from the golang-host application to the scripting environment.
      • This is a demonstration of how you'd provide extra functionality when embedding the engine in your own application.
    • The new function is then used to filter a list of people.
  • example_user_defined_function_test.go
    • Writing a function within the scripting-environment, and then calling it.
  • _examples/embedded/variable/
    • Shows how to pass a variable back and forth between your host application and the scripting environment

Additional Examples

Additional examples of using the library to embed scripting support into simple host applications are available beneath the _examples/embedded directory.

There are also sample scripts contained beneath _examples/scripts which you can examine.

The standalone driver located beneath cmd/evalfilter allows you to examine bytecode, tokens, and run the example scripts, as documented later in this README file.

Finally if you want to compile this library to webassembly, and use it in a web-page that is also possible! See wasm/ for details.

Standalone Use

If you wish to experiment with script-syntax, after looking at the example scripts you can install the standalone driver:

go install github.com/skx/evalfilter/v2/cmd/evalfilter@latest

This driver, contained within the repository at cmd/evalfilter has a number of sub-commands to allow you to experiment with the scripting environment:

  • Output a disassembly of the bytecode instructions the compiler generated when preparing your script.
  • Run a script.
    • Optionally with a JSON object as input.
  • View the lexer and parser outputs.

Help is available by running evalfilter help, and the sub-commands are documented thoroughly, along with sample output.

TAB-completion is supported if you're running bash, execute the following to enable it:

$ source <(evalfilter bash-completion)

Benchmarking

The scripting language should be fast enough for most purposes; it will certainly cope well with running simple scripts for every incoming HTTP-request, for example. If you wish to test the speed there are some local benchmarks available.

You can run the benchmarks as follows:

go test -test.bench=evalfilter_ -benchtime=10s -run=^t
goos: linux
goarch: amd64
pkg: github.com/skx/evalfilter/v2
Benchmark_evalfilter_complex_map-4   	 4426123	      2721 ns/op
Benchmark_evalfilter_complex_obj-4   	 7657472	      1561 ns/op
Benchmark_evalfilter_simple-4        	15309301	       818 ns/op
Benchmark_evalfilter_trivial-4       	100000000	       105 ns/op
PASS
ok  	github.com/skx/evalfilter/v2	52.258s

The examples there are not particularly representative, but they will give you an idea of the general speed. In the real world the speed of the evaluation engine is unlikely to be a significant bottleneck.

One interesting thing that shows up clearly is that working with a struct is significantly faster than working with a map. I can only assume that the reflection overhead is shorter there, but I don't know why.

Fuzz Testing

Fuzz-testing is basically magic - you run your program with random input, which stress-tests it and frequently exposes corner-cases you've not considered.

This project has been fuzz-tested repeatedly, and FUZZING.md contains notes on how you can carry out testing of your own with the integrated fuzz-testing available with the 1.18+ version of the golang release.

API Stability

The API will remain as-is for given major release number, so far we've had we've had two major releases:

  • 1.x.x
    • The initial implementation which parsed script into an AST then walked it.
  • 2.x.x
    • The updated design which parses the given script into an AST, then generates bytecode to execute when the script is actually run.

The second release was implemented to perform a significant speedup for the case where the same script might be reused multiple times.

See Also

This repository was put together after experimenting with a scripting language, writing a BASIC interpreter, a FORTH interpreter, and a TCL-like scripting language.

I've also played around with a couple of compilers which might be interesting to refer to:

Github Setup

This repository is configured to run tests upon every commit, and when pull-requests are created/updated. The testing is carried out via .github/run-tests.sh which is used by the github-action-tester action.

Steve

evalfilter's People

Contributors

dsikes avatar skx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

evalfilter's Issues

Remove generated `OpNop` instructions

We optimise our bytecode generation now, as a result of #65. This means that many operations are replaced with do-nothing instructions.

For example consider this script:

   if ( 3 - 2 == 1 ) { print("OK\n"); }
   return true;

This now gets compiled like so:

frodo ~/Repos/github.com/skx/evalfilter/cmd/evalfilter $ go build . ; ./evalfilter bytecode ./f.txt 
Bytecode:
  000000	         OpNop
  000001	         OpNop
  000002	         OpNop
  000003	         OpNop
  000004	         OpNop
  000005	         OpNop
  000006	         OpNop
  000007	         OpNop
  000008	         OpNop
  000009	         OpNop
  000010	        OpTrue
  000011	 OpJumpIfFalse	23
  000014	    OpConstant	0	// load constant: &{OK
}
  000017	    OpConstant	1	// load constant: &{print}
  000020	        OpCall	1	// call function with 1 arg(s)
  000023	        OpTrue
  000024	      OpReturn


Constants:
  000000 Type:STRING Value:OK
 Dump:&{OK
}
  000001 Type:STRING Value:print Dump:&{print}

Of course that stream of OpNop instructions is pointless. We just generated them to avoid messing up our offsets.

We should scan the program, remove them, and update any jump-offsets as appropriate.

Examine similar projects

Here's a small list of similar/related projects I could take inspiration from:

Expression evaluation ..

The implementation of expressions, as handled by the if statement, is not as clean or reliable as it could be. Specifically it started with the expectation it would only handle statements of the form:

  • if ( Field operator Field|Value ) { ..

Later it was updated to allow single-argument tests:

  • If ( Field ) { ...
  • if ( FunctionCall( arg1, ... argN ) ) { ..

Now I've grafted on support for &&, along with ||, but it is not a general purpose expression parser, and will not allow you to write full expressions such as these:

  • if ( Day == "Monday" && ( Hour <= 9 || Hour >= 17 ) ) { ..
  • if ( 1 + Count >= 10 ) { ...
  • if ( ! Field ) { ...

It should. Actually parsing expressions, via the golang lexer/ast would be a perfect way out of the problem:

Package split ..

Move tokens, etc, into their own packages.

Move the operations/values into their own files.

Benchmarking ..

There is an expression engine comparison repository which allows benchmarking different projects.

I added this to benchmark evalfilter:


package main

import (
	"testing"

	"github.com/skx/evalfilter"
)

func Benchmark_evalfilter(b *testing.B) {
	params := createParams()

	eval := evalfilter.New(example)

	var ret bool
	var err error
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		ret, err = eval.Run(params)
	}
	b.StopTimer()

	if err != nil {
		b.Fatal(err)
	}
	if !ret {
		b.Fail()
	}
}

The results were terrible! This project wasn't the slowest, but damn near!

$ go test -bench=. -benchtime=5s
Benchmark_bexpr-4              	 8790668	       686 ns/op
Benchmark_celgo-4              	18292764	       333 ns/op
Benchmark_celgo_startswith-4   	13063237	       467 ns/op
Benchmark_evalfilter-4         	 1872409	      3209 ns/op
Benchmark_expr-4               	32405442	       197 ns/op
Benchmark_expr_startswith-4    	16776326	       348 ns/op
Benchmark_goja-4               	14103356	       359 ns/op
Benchmark_govaluate-4          	16351401	       375 ns/op
Benchmark_gval-4               	  653475	      9432 ns/op
Benchmark_otto-4               	 6175132	       985 ns/op
Benchmark_starlark-4           	 1000000	      5960 ns/op

I don't expect significant improvements without a major overhaul, but noting here for minor tweaks & documentation-purposes.

Implement `in` for arrays.

We should simplify the process of array-iteration from this:

i = 0;
while( i < len(staff) ) {
    if ( Name == staff[i] ) {
        print("Message from staff-member " , Name, " ignoring\n" );
        return true;
    }
    i = i + 1
}

To this:

staff = [ "Steve", "Kemp", "Bob", "Maggie", "Homer Simpson" ];

if ( Name in staff ) {
    print( "Name IN staff\n" );
}

Allow nested `if`.

Right now we parse:

 if ( Count > 0 ) { return true; }

However we only allow parsing a single return statement, the block is otherwise not allowed to contain content.

If we rework the parser to allow it to be reused we can allow:

if ( 1 >= 1 ) {
   print "I'm nested\n";
   print "So AM I!\n";

   if ( 1 >= 0 ) {
      print "\tNested IF\n";
      return true;
   }
}

Add `sort` primitive.

If we have a range-operator, as per #103, it seems obviously we'll want to sort arrays.

sort( array, [bool]) should do the job. Where the flag controls case sensitivity, or not. (Optional.)

Implementation: Compile to bytecode

This was a quick hack which explains why the Run method walks the tokens directly from the lexer. However doing this is bad for two reasons:

  • We don't spot syntax-errors until we run the script.
  • We've got a performance hit if the same script is run multiple times.

Solution? Convert from the lexer-tokens to bytecode. This is trivial since we only have three main operations:

  • Call a function, with no arguments.
  • Run a comparison; return a value if it matches (true/false).
  • Return a value (true/false).

Update cmd/evalfilter some more

Currently we allow running a script, with a JSON object, and dumping the lexer-output.

Allow dumping the parse-tree too, that is more useful.

Support arrays

I've been toying with reflection and have written some horrid code which should allow array values to be read from structures/objects/maps.

However it is horrid.

Anyway before I can improve it I need to add array support, and array-indexing, to the interpreter..

BUG: foreach fails on the second iteration.

This works:

   items = [ "Steve", "Kemp"];
   foreach item in items {
           print( item, "\n" );
   }

This fails:

   items = [ "Steve", "Kemp"];
   foreach item in items {
           print( item, "\n" );
   }
   foreach item in items {
           print( item, "\n" );
   }

The second loop never resets the iteration state...

Expression support ..

Currently we support if tests of the form:

  • if ( LEFT $OP $RIGHT ) ..
  • if ( LEFT() ) ..

It would be nice if we replaced the parser with something that could handle expressions, which would allow:

  • if ( Hour >= 5 AND Hour <= 19 ) { ..

To do this we'd need to update the IfOperation structure to include Expression instances, rather than left/right Arguments.

The change would be mostly trivial, but it would be best to create a parser-package first for clarity and to improve testing.

Regexp support should be improved

The following program is valid:

if ( Name ~= /steve/i ) { print( "Steve\n" ); }
if ( Name ~= /^steve/m ) { print( "Steve\n" ); }
return false;

However when dumped it produces the wrong value, because the printing of the RegexpLiteral object doesn't care about flags properly:

$ ./evalfilter parse t.mew 
if(Name ~= /steve/i) 
{
print("Steve\n");

}
if(Name ~= /(?m)^steve/) 
{
print("Steve\n");

}
return false;

Document bytecode

We have only a few instructions, and a stack, so we should be able to document them.

Optimize our bytecode ..

OK this is going to be a long issue, but to recap the project:

We have a benchmark which runs the following code:

 if ( 1 + 2 * 3 == 7 ) { return true; } return false;

Compiled to bytecode, and slightly edited, this program looks like this:

Bytecode:
  000000	    OpConstant	0
  000003	    OpConstant	1
  000006	    OpConstant	2
  000009	         OpMul
  000010	         OpAdd
..

Constants:
  000000 Type:INTEGER Value:1 
  000001 Type:INTEGER Value:2 
  000002 Type:INTEGER Value:3 

As per our bytecode documentation the meaning of this should be clear, I hope.

  • We take the value of the constant with the ID 0, and push that onto the stack.
  • We take the value of the constant with the ID 1, and push that onto the stack.
  • We take the value of the constant with the ID 2, and push that onto the stack.
  • At this point the stack contains: [1 2 3]
  • We find the OpMul instruction
    • So we pop two values off the stack "3", and "2", then multiply them, and store the result back.
    • That makes the stack look like this: [1 6]
  • We find the OpAdd instruction.
    • So we pop two values off the stack "6", and "1", add them, and store the result back.
    • That makes the stack look like this: [7]

This is fine, it is slow, but it is provably correct and simple to understand.

However we should be able to make it faster. What we need to do is introduce a new instruction specifically to cheat at benchmarks!

We already allow the opcodes our virtual machine executes to have (integer) arguments. Since dealing with numbers is common we should use that to handle storing integers that are in the range 0-65535. We can imagine our code would then look like this:

  000000	        OpPush	1  // Push the integer value 1 onto the stack.  No constant used :)
  000003	        OpPush	2  // Push the integer value 2 onto the stack.  No constant used :)
  000006	        OpPush	3  // Push the integer value 3 onto the stack.  No constant used :)
  000009	         OpMul
  000010	         OpAdd
  000011	        OpPush	7
  000014	       OpEqual
  000015	 OpJumpIfFalse	20
  000018	        OpTrue
  000019	      OpReturn
  000020	       OpFalse
  000021	      OpReturn

Here we see we're directly pushing the values 1, 2, and 3 to the stack, without the use of the constant-pool. This doesn't really gain us any speed in itself. But this is where we get clever. We can now optimize our bytecode, before we run it. We can scan our bytecode from start to finish and look for the case where we have:

  • OpStore XX
  • OpStore YY
  • OpAdd | OpMul | OpSub | OpDiv ..

Since we're dealing with constants we can replace:

OpStore 2
OpStore 3
OpMul

With:

OpStore 6
NOP   // previous opstore
NOP   // previous opstore-arg
NOP   // previous opstore-arg
NOP   // previous opmul

Here we've added for "no operations" - one for the multiply instruction, and three for the loading of the second constant. We've obviously updated the first in-place with the expected result.

Of course in the real world this won't help us much. But we should be able to repeat this process to collapse our multiple and addition-benchmark into something faster.

Feature: Allow calling functions with arguments.

Ideally the arguments should be unbound, but a good first approach would be to allow calling with a single struct-member.

For example:

 if ( len( Message ) > 50 ) { return false; }   // We don't care about short messages.

Not impossible, but it will require a proper parser, which ties into #2.

Reflection is slow ..

I did some profiling and it seems that reflection is one of our slowest points. Initially I thought I'd just cache the value of field-lookups but that meant that results would be wrong if the same instance was shared between Run calls.

(Calculating a hash of the object to cache on a per-object basis was slower than doing nothing at all!)

I suspect the thing to do is to perform the reflection once at the top of the Run method, then rely upon the field being available. Logically this doesn't change anything, but it ensures that each time Run is hit with a new object we have up to date field-values.

This will provide a big boost if any field is referred to more than once:

  if ( Name == "Steve" || Name == "Bob" )

Here Name would have been looked up with reflection twice, but now it'll only be done once.

We should support time.Time.

The virtual machine attempts to use reflection to process all the fields of the map/object it executing against. However right now we just support "basic" types (string, int, etc).

Given a definition and object like this it should be possible to work with the time:

	type Input struct {
		Message string
		At      time.Time
	}

	in := &Input{Message: "This is a test",
		At: time.Now()}

There are two ways we could go here:

  • Define a time object. Similar to the int/float/array values we have.
  • Just parse the date/time to the seconds-past-the-unix-epoch value.

Regardless we'll then need some builtin-functions to work with the value. For example we should expect something like this to work:

if ( hour(At) < 07 || hour(At) >= 19) {
   print( "This is out of hours ..\n" );
}

I guess we'll need to define:

  • hour()
  • minute()
  • seconds()
  • day()
  • month()
  • year()
  • weekday() -> "Saturday", etc.

The only alternative right now is for the user embedding the script to write a built-in "IsWeekend()", or "WorkingHours()" function. That is something I've done for my on-call script, but it is a pain.

Another bytecode-optimizer failure

The following script does not work as expected:

print( "Before\n" );
if ( 1 == 0 ) {
   print( "Weird\n" );
}
if ( 0 == 0 ) {
   print( "Better\n");
}
print( "After" );
return false;

Running produces the output:

$ ./evalfilter run ./test.case |head -n 20
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before
Before

The output of the bytecode with the optimizer disabled is fine:

$ ./evalfilter bytecode -no-optimizer ./test.case
Bytecode:
  000000	    OpConstant	0	// load constant: "Before\n"
  000003	    OpConstant	1	// load constant: "print"
  000006	        OpCall	1	// call function with 1 arg(s)
  000009	        OpPush	1
  000012	        OpPush	0
  000015	       OpEqual
  000016	 OpJumpIfFalse	28
  000019	    OpConstant	2	// load constant: "Weird\n"
  000022	    OpConstant	1	// load constant: "print"
  000025	        OpCall	1	// call function with 1 arg(s)
  000028	        OpPush	0
  000031	        OpPush	0
  000034	       OpEqual
  000035	 OpJumpIfFalse	47
  000038	    OpConstant	3	// load constant: "Better\n"
  000041	    OpConstant	1	// load constant: "print"
  000044	        OpCall	1	// call function with 1 arg(s)
  000047	    OpConstant	4	// load constant: "After"
  000050	    OpConstant	1	// load constant: "print"
  000053	        OpCall	1	// call function with 1 arg(s)
  000056	       OpFalse
  000057	      OpReturn


Constants:
  000000 Type:STRING Value:"Before\n"
  000001 Type:STRING Value:"print"
  000002 Type:STRING Value:"Weird\n"
  000003 Type:STRING Value:"Better\n"
  000004 Type:STRING Value:"After"

But with it enabled we clearly see OpJump 0x0000 - which causes execution to return to the start of the program:

$ ./evalfilter bytecode ./test.case
Bytecode:
  000000	    OpConstant	0	// load constant: "Before\n"
  000003	    OpConstant	1	// load constant: "print"
  000006	        OpCall	1	// call function with 1 arg(s)
  000009	        OpJump	0
  000012	    OpConstant	2	// load constant: "Weird\n"
  000015	    OpConstant	1	// load constant: "print"
  000018	        OpCall	1	// call function with 1 arg(s)
  000021	    OpConstant	3	// load constant: "Better\n"
  000024	    OpConstant	1	// load constant: "print"
  000027	        OpCall	1	// call function with 1 arg(s)
  000030	    OpConstant	4	// load constant: "After"
  000033	    OpConstant	1	// load constant: "print"
  000036	        OpCall	1	// call function with 1 arg(s)
  000039	       OpFalse
  000040	      OpReturn


Constants:
  000000 Type:STRING Value:"Before\n"
  000001 Type:STRING Value:"print"
  000002 Type:STRING Value:"Weird\n"
  000003 Type:STRING Value:"Better\n"
  000004 Type:STRING Value:"After"

Two things here:

  • Fix the problem, obviously.
  • But also allow running a program with the optimizer disabled, to compare behaviour.

Optimize jumps

This sequence is a jump which will never occur:

OpTrue
OpJumpIfFalse 0x1234

This is an unconditional jump:

OpFalse
OpJumpIfFalse 0x1234

We can rewrite inline.

We should support a `range` operation.

Given a structure:

 type Message struct {
       Labels []string
 }

We can run a script which iterates over the items:

   print("\tThe message has ", len(Labels), " labels.\n");
   i = 0;
   while( i < len(Labels) ) {
     print("\tLabel ", i+1 , " is " , Labels[i], "\n");
     i = i + 1;
   }

But that's gross.

We should support a range operator:

 foreach index, entry  Labels  {    
        print("Label ", i, " is ", entry );
 }

Syntax is somewhat open, but the important thing is that we get access to the index, the entry, and users don't notice and complain about the lack of i++ support ;)

Single argument form of `if`...

If the if-statement is a single-argument form, i.e uses a function-result, we match on the literal string true.

Implement a truthy-result:

  • Boolean true is true
  • non-empty string is true
  • non-zero integer is true
  • Everything else is false.

Stack grows larger than it should ..

I'm suspicious of our stack-handling. It should be essentially empty at the end of most programs, but observation shows it is not.

TODO:

  • Add a debug flag to the run-command, and investigate.

Seems harmless,but it shouldn't happen.

Logical AND + OR parse less well than they might.

The following expression is not handled:

  (Origin == "MOW" || Country == "RU") && (Value >= 100 || Adults == 1)

Instead it needs to be rewritten as:

  ( ( Origin == "MOW" ) || ( Country == "RU" ) ) && ( (Value >= 100 ) || ( Adults == 1) )

Almost seems like we're not returning bool values for the comparisons...

Simplify optimisation code.

We have two parts of the optimizer that fold OpNop instructions.

We should also remove some of the if code around optimisation and dumping.

Optimizer makes a mistake

To resolve #65 we added an optimizer, which we later improved. However it generates code which is not 100% correct in all cases.

Specifically when we see an integer in the range 0x0000-0xffff we can store it "inline". Then we can collapse operations using multiple integers into just holding the result. The problem is that we don't take into account the result of the calculation.

This is an example:

  if ( 3 - 7 == -4 ) { return true; } return false;

Here we start off nicely:

Bytecode:
  000000	        OpPush	3
  000003	        OpPush	7
  000006	         OpSub
  000007	        OpPush	4
  000010	       OpMinus
  000011	       OpEqual
  000012	 OpJumpIfFalse	17
  000015	        OpTrue
  000016	      OpReturn
  000017	       OpFalse
  000018	      OpReturn

Then we try to merge the first two operations and end up:

  000000	        OpPush	65532
  000003	        OpPush	4
  000006	       OpMinus
  000007	       OpEqual
  000008	 OpJumpIfFalse	13

Here the problem is that "3 - 7" gives a negative result, which means we end up generating OpPush 65532. There is a solution here:

  • Allow numbers in the range -32768 - 32768
    • Which would allow negative numbers to be stored

However we must bounds-test the result of the optimization too, not just the operands.

Implement constant folding.

The following script is a good example:

if ( Name == "Steve" ) { return true; }
if ( Name == "Bob" ) { return true; }

return false;

Using cmd/evalfilter we can compile this to bytecode, and dump it. I see this:

./evalfilter bytecode ./test.script
Bytecode:
000000 - OpLookup  0
000003 - OpConstant 1
000006 - OpEqual 
000007 - OpJumpIfFalse 15
000010 - OpTrue 
000011 - OpReturn 
000012 - OpJump   15
000015 - OpLookup  2
000018 - OpConstant 3
000021 - OpEqual 
000022 - OpJumpIfFalse 30
000025 - OpTrue 
000026 - OpReturn 
000027 - OpJump 30
000030 - OpFalse 
000031 - OpReturn


Constants:
0 - &{Name}
1 - &{Steve}
2 - &{Name}
3 - &{Bob}

You're not expected to necessarily understand this, but in brief:

  • OpLookup 0
    • Get the contents of the constant 0 ("Name"), and lookup the field value with that name. Push the result onto the stack
  • OpConstant 1
    • Get the contents of constant 1 ("Steve"), and push it onto the stack
  • OpEqual
    • Pop two values from the stack and compare them.
    • Push true/false onto the stack as appropriate.
  • OpJumpIfFalse 15
    • Pop a value from the stack.
    • Jump to offset 15 if that is false.

Anyway the important thing here is that there are two constants called Name at offset 0 and 2. These are used via OpLookup 0 and OpLookup 2. They have the same value so we should just use the first one rather than emitting the second.

Avoid needless jumps

We generate extra bytecode instructions we don't need when handling the else branch of an if.

Looking at the code in #35 we see:

000000 - OpLookup [OpCode:4] 0
000003 - OpConstant [OpCode:0] 1
000006 - OpEqual [OpCode:23] 
000007 - OpJumpIfFalse [OpCode:2] 15
000010 - OpTrue [OpCode:7] 
000011 - OpReturn [OpCode:15] 
000012 - OpJump [OpCode:1] 15    ****
000015 - OpLookup [OpCode:4] 0
000018 - OpConstant [OpCode:0] 2
000021 - OpEqual [OpCode:23] 
000022 - OpJumpIfFalse [OpCode:2] 30
000025 - OpTrue [OpCode:7] 
000026 - OpReturn [OpCode:15] 
000027 - OpJump [OpCode:1] 30   ****
000030 - OpFalse [OpCode:8] 
000031 - OpReturn [OpCode:15] 

The two lines marked **** contain jumps which are pointless. They jump to the next instruction!

Remove those.

Fix the CamelCase issues.

We'll have to rename STRING_OBJ to STRING, NOT_EQ to NOTEQ, etc.

All a pain. Might be worth doing a mass comment/package update at the time we make such sweeping changes.

Advantage is that we can be 100% clean for our lint/vet/static-checks.

Improve regular-expression parsing.

This fails, because \ doesn't escape as it does for strings:

 if ( Msg ~= /steve\//m ) { ..

This fails, because we don't cap flags:

  if ( Name ~= /steve/ic ) { ..

Generate more efficient code for `if`.

This issue replaces #36 which I'd closed as being invalid, even though it wasn't. Oops.

Let me restate the problem, with examples, to make it obvious why it is valid. There are two kinds of if statements we support:

  • If statement with a body.
  • If statement with a body, and an alternative body.

These are:

  • if ( true ) { print( "true\n"); }
  • if ( true ) { print( "true\n"); } else { print( "false\n"); }

The second case we're handling just fine. Here is current bytecode output for that second example:

skx@frodo ~/Repos/github.com/skx/evalfilter/cmd/evalfilter $ ./evalfilter bytecode 2.txt 
Bytecode:
  000000	        OpTrue
  000001	 OpJumpIfFalse	16
  000004	    OpConstant	0	// load constant: &{true}
  000007	    OpConstant	1	// load constant: &{print}
  000010	        OpCall	1	// call function with 1 arg(s)
  000013	        OpJump	25
  000016	    OpConstant	2	// load constant: &{false}
  000019	    OpConstant	1	// load constant: &{print}
  000022	        OpCall	1	// call function with 1 arg(s)
  000025	       OpFalse
  000026	      OpReturn


Constants:
  000000 Type:STRING Value:true  Dump:&{true}
  000001 Type:STRING Value:print Dump:&{print}
  000002 Type:STRING Value:false Dump:&{false}

Here we see the code has been divided into blocks as we expect. Remember this is out input:

if ( true ) {
   print( "true\n" );
} else {
   print( "false\n" );
}
return false;

Our code looks good:

  • There is a push true
  • There is a jump if false
  • That jumps to offset 16, which is where our else body lives.
  • Otherwise we run the print, and then jump to offset 25, which is the instruction after our if-statement.

All good:

  • We either fall-through the test and run the first block.
    • Then jump to the end of the if.
  • Or we jump to the else block and run the code
    • We then naturally fall off the end of the if-statement.

However the case when there is no alternative is weird. Here's our test program with no else clause:

if ( true ) { print( "true\n"); }
return false;

And the bytecode:

  000000	        OpTrue
  000001	 OpJumpIfFalse	16
  000004	    OpConstant	0	// load constant: &{true}
  000007	    OpConstant	1	// load constant: &{print}
  000010	        OpCall	1	// call function with 1 arg(s)
  000013	        OpJump	16
  000016	       OpFalse
  000017	      OpReturn

The instruction at 13 is designed to jump over the else-clause, to the statement after the if. However there is no else-clause! So we pointlessly jump to the next instruction.

That OpJump instruction should not be emitted if there is no else block.

Better bytecode debugging support

#35 used the output of the cmd/evalfilter utility as part of the process to see we were outputing too much, and confirm we'd resolved it.

However our bytecode is less readable than it could be. We show things like OpConstant 3 to mean "Load the contents of the third constant, and push onto the stack". If we wanted to make it more clear we should actually show the constant being loaded.

That would make reading the bytecode output simpler.

Implement infix range operator.

#103 is relating to the annoyance of having to write your own loop-handling for iterating over an array.

The expectation is that we can have:

 foreach item in [ "This", "Is", "An", "Array ] {
   print( item, "\n" );
 }

It occurs to me we can have a for-loop:

  foreach item in 1..10

Where X..Y returns an array of the integers in the range. Should be trivial - so long as we don't parse the number badly as "1." then . and 10.

Test cases are mandatory!

Profile code and further optimise.

No doubt there is still more options to speedup, though the reflection change in #43 is going to be the best improvement for trivial scripts:

 if ( 1 + 2 * 3 == 7 ) { return true; } else { return false ; }

(That's a terrible benchmark, because it uses no fields. But then again it is a great one because it does do some "stuff" and exercises the full chain.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.