Giter Site home page Giter Site logo

soundandform / m3 Goto Github PK

View Code? Open in Web Editor NEW
18.0 1.0 1.0 102 KB

A high performance WebAssembly interpreter in C. Moved here:

Home Page: https://github.com/wasm3/wasm3

License: MIT License

C 97.90% Shell 0.14% C++ 1.96%
webassembly interpreter

m3's Introduction

This repository is deprecated. Development has moved here:

https://github.com/wasm3/wasm3

M3/Wasm

This is a work-in-progress WebAssembly interpreter written in C using a novel, high performance interpreter topology. The interpreter strategy (named M3) was developed prior to this particular Wasm project and is described some below.

Purpose

I don't know. I just woke up one day and started hacking this out after realizing my interpreter was well suited for the Wasm bytecode structure. Some ideas:

  • It could be useful for embedded systems.
  • It might be a good start-up, pre-JIT interpreter in a more complex Wasm compiler system.
  • It could serve as a Wasm validation library.
  • The interpreter topology might be inspiring to others.

Current Status

Its foundation is solid but the edges are still quite rough. Many of the WebAssembly opcodes are lacking an implementation. The compilation logic is a tad unfinished. Most execution trap cases are unhandled.

Benchmarks

It's at least fleshed out enough to run some simple benchmarks.

Tested on a 4 GHz Intel Core i7 iMac (Retina 5K, 27-inch, Late 2014). M3 was compiled with Clang -Os. C benchmarks were compiled with gcc with -O3

Mandelbrot

C code lifted from: https://github.com/ColinEberhardt/wasm-mandelbrot

Interpreter Execution Time Relative to GCC
Life 547 s 133 x https://github.com/perlin-network/life
Lua 122 s 30 x This isn't Lua running some weird Wasm transcoding; a manual Lua conversion of the C benchmark as an additional reference point.
M3 17.9 s 4.4 x (3.7 x on my 2016 MacBook Pro)
GCC 4.1 s

CRC32

Interpreter Execution Time Relative to GCC
Life 153 s 254 x
M3 5.1 s 8.5 x
GCC 600 ms

In general, the M3 strategy seems capable of executing code around 4-15X slower than compiled code on a modern x86 processor. (Older CPUs don't fare as well. I suspect branch predictor differences.) I have yet to test on anything ARM.

Building

There's only an Xcode project file currently.

M3: Massey Meta Machine

Over the years, I've mucked around with creating my own personal programming language. It's called Gestalt. The yet unreleased repository will be here: https://github.com/soundandform/gestalt

Early on I decided I needed an efficient interpreter to achieve the instant-feedback, live-coding environment I desire. Deep traditional compilation is too slow and totally unnecessary during development. And, most importantly, compilation latency destroys creative flow.

I briefly considered retooling something extant. The Lua virtual machine, one of the faster interpreters, is too Lua centric. And everything else is too slow.

I've also always felt that the "spin in a loop around a giant switch statement" thing most interpreters do was clumsy and boring. My intuition said there was something more elegant to be found.

The structure that emerged I named a "meta machine" since it mirrors the execution of compiled code much more closely than the switch-based virtual machine.

I set out with a goal of approximating Lua's performance. But this interpreter appears to beat Lua by 3X and more on basic benchmarks -- whatever they're worth!

How it works

This rough information might not be immediately intelligible without referencing the source code.

Reduce bytecode decoding overhead

  • Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
    • M3 trades some space for time. Opcodes map to up to 3 different operations depending on the number of source operands and commutative-ness.
  • Commonly occurring sequences of operations can can also be optimized into a "fused" operation. This sometimes results in improved performance.
    • the modern CPU pipeline is a mysterious beast
  • In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach.

Tightly Chained Operations

  • M3 operations are C functions with a single, fixed function signature.

    void * Operation_Whatever (pc_t pc, u64 * sp, u8 * mem, reg_t r0, f64 fp0);
  • The arguments of the operation are the M3's virtual machine registers

    • program counter, stack pointer, etc.
  • The return argument is a trap/exception and program flow control signal

  • The M3 program code is traversed by each operation calling the next. The operations themselves drive execution forward. There is no outer control structure.

    • Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.
  • Finally, note that x86/ARM calling conventions pass initial arguments through registers, and the indirect jumps between operations are branch predicted.

The End Result

Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers. It's not virtual! Instead, it's a meta machine with very low execution impedance.

M3 Register x86 Register
program counter (pc) rdi
stack pointer (sp) rsi
linear memory (mem) rdx
integer register (r0) rcx
floating-point register (fp0) xmm0

For example, here's a bitwise-or operation in the M3 compiled on x86.

m3`op_u64_Or_sr:
    0x1000062c0 <+0>:  movslq (%rdi), %rax             ; load operand stack offset  
    0x1000062c3 <+3>:  orq    (%rsi,%rax,8), %rcx      ; or r0 with stack operand
    0x1000062c7 <+7>:  movq   0x8(%rdi), %rax          ; fetch next operation
    0x1000062cb <+11>: addq   $0x10, %rdi              ; increment program counter
    0x1000062cf <+15>: jmpq   *%rax                    ; jump to next operation

Registers and Operational Complexity

  • The conventional Windows calling convention isn't compatible with M3, as-is, since it only passes 4 arguments through registers. Applying the vectorcall calling convention (https://docs.microsoft.com/en-us/cpp/cpp/vectorcall) should resolve this problem. (I haven't tried compiling this on Windows yet.)

  • It's possible to use more CPU registers. For example, adding an additional floating-point register to the meta-machine did marginally increase performance in prior tests. However, the operation space increases exponentially. With one register, there are up to 3 operations per opcode (e.g. a non-commutative math operation). Adding another register increases the operation count to 10. However, as you can see above, operations tend to be tiny.

Stack Usage

  • Looking at the above assembly code, you'll notice that once an M3 operation is optimized, it doesn't need the regular stack (no mucking with the ebp/esp registers). This is the case for 90% of the opcodes. Branching and call operations do require stack variables. Therefore, execution can't march forward indefinitely; the stack would eventually overflow.

    Therefore, loops unwind the stack. When a loop is continued, the Continue operation returns, unwinding the stack. Its return value is a pointer to the loop opcode it wants to unwind to. The Loop operations checks for its pointer and responds appropriately, either calling back into the loop code or returning the loop pointer back down the call stack.

  • Traps/Exceptions work similarly. A trap pointer is returned from the trap operation which has the effect of unwinding the entire stack.

  • Returning from a Wasm function also unwinds the stack, back to the point of the Call operation.

  • But, because M3 execution leans heavily on the native stack, this does create one runtime usage issue.

    A conventional interpreter can save its state, break out of its processing loop and return program control to the client code. This is not the case in M3 since the C stack might be wound up in a loop for long periods of time.

    With Gestalt, I resolved this problem with fibers (built with Boost Context). M3 execution occurs in a fiber so that control can effortlessly switch to the "main" fiber. No explicit saving of state is necessary since that's the whole purpose of a fiber.

    More simplistically, the interpreter runtime can also periodically call back to the client (in the either the Loop or LoopContinue operation). This is necessary, regardless, to detect hangs and break out of infinite loops.

Thoughts & Questions about WebAssembly

There are some really fundamental things about WebAssembly that I don't understand yet. There are a bunch of mysteries about this thing that don't seem sufficiently explained in plain English anywhere. Can someone orient me?

Some examples:

  • Linear memory: Modules seem hardcoded to request 16MB of memory. What? Why? As the host, where can I safely allocate things in linear memory? Who's in charge of this 16MB?
  • Traps: The spec says "Signed and unsigned operators trap whenever the result cannot be represented in the result type." Really? That's cool, but how can the behavior of source languages be modeled (efficiently)? C integer operations wrap and sometimes that's what you want.

The M3 strategy for other languages (is rad)

The Gestalt M3 interpreter works slightly differently than this Wasm version. With Gestalt, blocks of all kind (if/else/try), not just loops, unwind the native stack. (This somewhat degrades raw x86 performance.)

But, this adds a really beautiful property to the interpreter. The lexical scoping of a block in the language source code maps directly into the interpreter. All opcodes/operations end up having an optional prologue/epilogue structure. This made things like reference-counting objects in Gestalt effortless. Without this property, the compiler itself would have to track scope and insert dererence opcodes intentionally. Instead, the "CreateObject" operation is also the "DestroyObject" operation on the exit/return pathway.

Here's some pseudocode to make this more concrete:

return_t Operation_NewObject (registers...)
{
   Object o = runtime->CreateObject (registers...);
  
   * stack [index] = o;
  
   return_t r = CallNextOperation (registers...);  // executes to the end of the scope/block/curly-brace & returns
   
   if (o->ReferenceCount () == 0)
       runtime->DestroyObject (registers..., o);   // calls o's destructor and frees memory
   
   return r;
}

Likewise, a "defer" function (like in Go) becomes absolutely effortless to implement. Exceptions (try/catch) as well.

m3's People

Contributors

soundandform avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

menduz

m3's Issues

Bug runtime out of memory

I tried to run a minimal example to open a glfw window using this library.

I started the program using ./m3 glfw.wasm

The file glfw.wasm converted to glfw.wat looks like this:

(module
  (type (;0;) (func (result i32)))
  (type (;1;) (func (param i32 i32 i32 i32 i32) (result i32)))
  (type (;2;) (func))
  (type (;3;) (func (param i32) (result i32)))
  (type (;4;) (func (param i32)))
  (import "env" "glfwInit" (func $glfwInit (type 0)))
  (import "env" "glfwCreateWindow" (func $glfwCreateWindow (type 1)))
  (import "env" "glfwTerminate" (func $glfwTerminate (type 2)))
  (import "env" "glfwWindowShouldClose" (func $glfwWindowShouldClose (type 3)))
  (import "env" "glfwSwapBuffers" (func $glfwSwapBuffers (type 4)))
  (import "env" "glfwPollEvents" (func $glfwPollEvents (type 2)))
  (func $__wasm_call_ctors (type 2))
  (func $startGlfw (type 2)
    (local i32 i32)
    global.get 0
    i32.const 16
    i32.sub
    local.tee 0
    global.set 0
    block  ;; label = @1
      call $glfwInit
      i32.eqz
      br_if 0 (;@1;)
      block  ;; label = @2
        i32.const 640
        i32.const 480
        i32.const 1024
        i32.const 0
        i32.const 0
        call $glfwCreateWindow
        local.tee 1
        br_if 0 (;@2;)
        call $glfwTerminate
        local.get 0
        i32.const 16
        i32.add
        global.set 0
        return
      end
      block  ;; label = @2
        loop  ;; label = @3
          local.get 1
          call $glfwWindowShouldClose
          br_if 1 (;@2;)
          local.get 1
          call $glfwSwapBuffers
          call $glfwPollEvents
          br 0 (;@3;)
        end
      end
      call $glfwTerminate
      local.get 0
      i32.const 16
      i32.add
      global.set 0
      return
    end
    local.get 0
    i32.const 16
    i32.add
    global.set 0)
  (table (;0;) 1 1 funcref)
  (memory (;0;) 2)
  (global (;0;) (mut i32) (i32.const 66576))
  (global (;1;) i32 (i32.const 66576))
  (global (;2;) i32 (i32.const 1036))
  (global (;3;) i32 (i32.const 1024))
  (export "memory" (memory 0))
  (export "__heap_base" (global 1))
  (export "__data_end" (global 2))
  (export "startGlfw" (func $startGlfw))
  (export "__unnamed_1" (global 3))
  (data (;0;) (i32.const 1024) "Hello World\00"))

The error message I get looks like this:

-- m3 configuration --------------------------------------------
 sizeof M3MemPage     : 65536 bytes              
 sizeof M3Compilation : 8592 bytes              
----------------------------------------------------------------

    parse  |  load module: 1052 bytes
    parse  |  found magic + version
    parse  |  ** Type [5]
    parse  |  () -> i32
    parse  |  (i32, i32, i32, i32, i32) -> i32
    parse  |  () -> nil
    parse  |  (i32) -> i32
    parse  |  (i32) -> nil
    parse  |  ** Import [6]
    parse  |    - kind: 0; 'env.glfwInit' 
    parse  |    - kind: 0; 'env.glfwCreateWindow' 
    parse  |    - kind: 0; 'env.glfwTerminate' 
    parse  |    - kind: 0; 'env.glfwWindowShouldClose' 
    parse  |    - kind: 0; 'env.glfwSwapBuffers' 
    parse  |    - kind: 0; 'env.glfwPollEvents' 
    parse  |  ** Function [2]
    parse  |  <skipped (id: 4)>
    parse  |  <skipped (id: 5)>
    parse  |  ** Global [4]
    parse  |    - add global: [0] i32
  compile  |     0 | 0x01  .. nop
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 66576)
  compile  |     2 | 0x0b   end
    parse  |    - add global: [1] i32
  compile  |     0 | 0x00  .. unreachable
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 66576)
  compile  |     2 | 0x0b   end
    parse  |    - add global: [2] i32
  compile  |     0 | 0x00  .. unreachable
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 1036)
  compile  |     2 | 0x0b   end
    parse  |    - add global: [3] i32
  compile  |     0 | 0x00  .. unreachable
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 1024)
  compile  |     2 | 0x0b   end
    parse  |  ** Export [5]
    parse  |    - index:    0; kind: 2; export: 'memory'; 
    parse  |    - index:    1; kind: 3; export: '__heap_base'; 
    parse  |    - index:    2; kind: 3; export: '__data_end'; 
    parse  |    - index:    7; kind: 0; export: 'startGlfw'; 
    parse  |    - index:    3; kind: 3; export: '__unnamed_1'; 
    parse  |  ** Code [2]
    parse  |    - func size: 2; locals: 0
    parse  |    - func size: 141; locals: 1
    parse  |      - 2 locals; type: 'i32'
    parse  |  ** Data [1]
  compile  |     0 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 1024)
  compile  |     1 | 0x0b   end
    parse  |      segment [0]  memory: 0;  expr-size: 4;  size: 12
    parse  |  ** Custom: '.debug_info'
    parse  |  ** Custom: '.debug_macinfo'
    parse  |  ** Custom: '.debug_pubtypes'
    parse  |  ** Custom: '.debug_abbrev'
    parse  |  ** Custom: '.debug_line'
    parse  |  ** Custom: '.debug_str'
    parse  |  ** Custom: '.debug_pubnames'
    parse  |  ** Custom: 'name'
    parse  |  naming function [6]: __wasm_call_ctors

-- m3 configuration --------------------------------------------
 sizeof M3MemPage     : 65536 bytes              
 sizeof M3Compilation : 8592 bytes              
----------------------------------------------------------------

  runtime  |  initializing global: 0
  compile  |     0 | 0x01  .. nop
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 66576)
  compile  |     2 | 0x0b   end
  runtime  |  initializing global: 1
  compile  |     0 | 0x00  .. unreachable
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 66576)
  compile  |     2 | 0x0b   end
     exec  |  *** trapping ***
  runtime  |  initializing global: 2
  compile  |     0 | 0x00  .. unreachable
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 1036)
  compile  |     2 | 0x0b   end
     exec  |  *** trapping ***
  runtime  |  initializing global: 3
  compile  |     0 | 0x00  .. unreachable
  compile  |     1 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 1024)
  compile  |     2 | 0x0b   end
     exec  |  *** trapping ***
  compile  |     0 | 0x41  .. i32.const
  compile  |       | .......... (const i32 = 1024)
  compile  |     1 | 0x0b   end
  runtime  |  loading data segment: 0  offset: 1024
result: runtime ran out of memory ()

The bug happens when calling m3_LoadModule.

Any idea what's going on here? Is it a bug or may there be some problem with my wasm?

Trap

On README I read:

Traps: The spec says "Signed and unsigned operators trap whenever the result cannot be represented in the result type." Really? That's cool, but how can the behavior of source languages be modeled (efficiently)? C integer operations wrap and sometimes that's what you want.

Where does the spec say this? I can't find it. https://webassembly.github.io/spec/core/exec/numerics.html is pretty clear that addition is done modulo 2^N, for example.

webassembly.studio

I tried running compiled wasm files from https://webassembly.studio/
empty c project, helloworld in c and empty rust project

they all end with "Aborted"
(on debian arm64)

the following result is from the empty c project:

-- m3 configuration --------------------------------------------
sizeof M3MemPage : 65536 bytes
sizeof M3Compilation : 8592 bytes

parse  |  load module: 526 bytes
parse  |  found magic + version
parse  |  ** Type [2]
parse  |  () -> nil
parse  |  () -> i32
parse  |  ** Function [2]
parse  |  <skipped (id: 4)>
parse  |  <skipped (id: 5)>
parse  |  ** Global [3]
parse  |    - add global: [0] i32

compile | 0 | 0x01 .. nop
compile | 1 | 0x41 .. i32.const
compile | | .......... (const i32 = 66560)
compile | 2 | 0x0b end
parse | - add global: [1] i32
compile | 0 | 0x00 .. unreachable
compile | 1 | 0x41 .. i32.const
compile | | .......... (const i32 = 66560)
compile | 2 | 0x0b end
parse | - add global: [2] i32
compile | 0 | 0x00 .. unreachable
compile | 1 | 0x41 .. i32.const
compile | | .......... (const i32 = 1024)
compile | 2 | 0x0b end
parse | ** Export [4]
parse | - index: 0; kind: 2; export: 'memory';
parse | - index: 1; kind: 3; export: '__heap_base';
parse | - index: 2; kind: 3; export: '__data_end';
parse | - index: 1; kind: 0; export: 'main';
parse | ** Code [2]
parse | - func size: 2; locals: 0
parse | - func size: 4; locals: 0
parse | ** Custom: '.debug_info'
parse | ** Custom: '.debug_macinfo'
parse | ** Custom: '.debug_abbrev'
parse | ** Custom: '.debug_line'
parse | ** Custom: '.debug_str'
parse | ** Custom: 'name'
parse | naming function [0]: __wasm_call_ctors

-- m3 configuration --------------------------------------------
sizeof M3MemPage : 65536 bytes
sizeof M3Compilation : 8592 bytes

runtime | initializing global: 0
compile | 0 | 0x01 .. nop
compile | 1 | 0x41 .. i32.const
compile | | .......... (const i32 = 66560)
compile | 2 | 0x0b end
runtime | initializing global: 1
compile | 0 | 0x00 .. unreachable
compile | 1 | 0x41 .. i32.const
compile | | .......... (const i32 = 66560)
compile | 2 | 0x0b end
exec | *** trapping ***
runtime | initializing global: 2
compile | 0 | 0x00 .. unreachable
compile | 1 | 0x41 .. i32.const
compile | | .......... (const i32 = 1024)
compile | 2 | 0x0b end
exec | *** trapping ***

-- m3 runtime -------------------------------------------------
stack-size: 32768

module [0] name: '.unnamed'; funcs: 2

Aborted

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.