Giter Site home page Giter Site logo

franko / luajit-lang-toolkit Goto Github PK

View Code? Open in Web Editor NEW
636.0 44.0 88.0 464 KB

A Lua bytecode compiler written in Lua itself for didactic purposes or for new language implementations

License: Other

Makefile 1.36% Lua 85.76% Python 4.12% C 7.89% Meson 0.88%

luajit-lang-toolkit's Introduction

LuaJIT Language Toolkit

The LuaJIT Language Toolkit is an implementation of the Lua programming language written in Lua itself. It works by generating LuaJIT bytecode, including debug information, and uses LuaJIT's virtual machine to run the generated bytecode.

On its own, the language toolkit does not do anything useful, since LuaJIT itself does the same things natively. The purpose of the language toolkit is to provide a starting point to implement a programming language that targets the LuaJIT virtual machine.

With the LuaJIT Language Toolkit, it is easy to create a new language or modify the Lua language because the parser is cleanly separated from the bytecode generator and the virtual machine.

The toolkit implements a complete pipeline to parse a Lua program, generate an AST, and generate the corresponding bytecode.

Lexer

Its role is to recognize lexical elements from the program text. It takes the text of the program as input and produces a stream of "tokens" as its output.

Using the language toolkit you can run the lexer only, to examinate the stream of tokens:

luajit run-lexer.lua tests/test-1.lua

The command above will lex the following code fragment:

local x = {}
for k = 1, 10 do
    x[k] = k*k + 1
end

...to generate the list of tokens:

TK_local
TK_name	x
=
{
}
TK_for
TK_name	k
=
TK_number	1
,
TK_number	10
TK_do
TK_name	x
[
TK_name	k
]
=
TK_name	k
*
TK_name	k
+
TK_number	1
TK_end

Each line represents a token where the first element is the kind of token and the second element is its value, if any.

The Lexer's code is an almost literal translation of the LuaJIT's lexer.

Parser

The parser takes the token stream from the lexer and builds statements and expressions according to the language's grammar. The parser is based on a list of parsing rules that are invoked each time the input matches a given rule. When the input matches a rule, a corresponding function in the AST (abstract syntax tree) module is called to build an AST node. The generated nodes in turns are passed as arguments to the other parsing rules until the whole program is parsed and a complete AST is built for the program text.

The AST is very useful as an abstraction of the structure of the program, and is easier to manipulate.

What distinguishes the language toolkit from LuaJIT is that the parser phase generates an AST, and the bytecode generation is done in a separate phase only when the AST is complete.

LuaJIT itself operates differently. During the parsing phase it does not generate any AST but instead the bytecode is directly generated and loaded into the memory to be executed by the VM. This means that LuaJIT's C implementation performs the three operations:

  • parse the program text
  • generate the bytecode
  • load the bytecode into memory

in one single pass. This approach is remarkable and very efficient, but makes it difficult to modify or extend the programming language.

Parsing Rule example

To illustrate how parsing works in the language toolkit, let us make an example. The grammar rule for the "return" statement is:

explist ::= {exp ','} exp

return_stmt ::= return [explist]

In this case the toolkit parser's rule will parse the optional expression list by calling the function expr_list. Then, once the expressions are parsed the AST's rule ast:return_stmt(exps, line) will be invoked by passing the expressions list obtained before.

local function parse_return(ast, ls, line)
    ls:next() -- Skip 'return'.
    ls.fs.has_return = true
    local exps
    if EndOfBlock[ls.token] or ls.token == ';' then -- Base return.
        exps = { }
    else -- Return with one or more values.
        exps = expr_list(ast, ls)
    end
    return ast:return_stmt(exps, line)
end

As you can see, the AST functions are invoked using the ast object.

In addition, the parser provides information about:

  • the function prototype
  • the syntactic scope

The first is used to keep track of some information about the current function being parsed.

The syntactic scope rules tell the user's rule when a new syntactic block begins or end. Currently this is not really used by the AST builder but it can be useful for other implementations.

The Abstract Syntax Tree (AST)

The abstract syntax tree represent the whole Lua program, with all the information the parser has gathered about it.

One possible approach to implement a new programming language is to generate an AST that more closely corresponds to the target programming language, and then transform the tree into a Lua AST in a separate phase.

Another possible approach is to directly generate the appropriate Lua AST nodes from the parser itself.

Currently the language toolkit does not perform any additional transformations, and just passes the AST to the bytecode generator module.

Bytecode Generator

Once the AST is generated, it can be fed to the bytecode generator module, which will generate the corresponding LuaJIT bytecode.

The bytecode generator is based on the original work of Richard Hundt for the Nyanga programming language. It was largely modified by myself to produce optimized code similar to what LuaJIT would generate, itself. A lot of work was also done to ensure the correctness of the bytecode and of the debug information.

Alternative Lua Code generator

Instead of passing the AST to the bytecode generator, an alternative module can be used to generate Lua code. The module is called "luacode-generator" and can be used exactly like the bytecode generator.

The Lua code generator has the advantage of being more simple and more safe as the code is parsed directly by LuaJIT, ensuring from the beginning complete compatibility of the bytecode.

Currently the Lua Code Generator backend does not preserve the line numbers of the original source code. This is meant to be fixed in the future.

Use this backend instead of the bytecode generator if you prefer to have a more safe backend to convert the Lua AST to code. The module can also be used for pretty-printing a Lua AST, since the code itself is probably the most human readable representation of the AST.

C API

The language toolkit provides a very simple set of C APIs to implement a custom language. The functions provided by the C API are:

/* The functions above are the equivalent of the luaL_* corresponding
   functions. */
extern int language_init(lua_State *L);
extern int language_report(lua_State *L, int status);
extern int language_loadbuffer(lua_State *L, const char *buff, size_t sz, const char *name);
extern int language_loadfile(lua_State *L, const char *filename);


/* This function push on the stack a Lua table with the functions:
   loadstring, loadfile, dofile and loader.
   The first three function can replace the Lua functions while the
   last one, loader, can be used as a customized "loader" function for
   the "require" function. */
extern int luaopen_langloaders(lua_State *L);

/* OPTIONAL:
   Load into package.preload lang.* modules using embedded bytecode. */
extern void language_bc_preload(lua_State *L)

The functions above can be used to create a custom LuaJIT executable that use the language toolkit implementation.

When the function language_* is used, an independent lua_State is created behind the scenes and used to compile the bytecode. Once the bytecode is generated it is loaded into the user's lua_State ready to be executed. The approach of using a separate Lua's state ensure that the process of compiling does not interfere with the user's application.

The function language_bc_preload is useful to create a standalone executable that does not depend on the presence of the Lua files at runtime. The lang.* are compiled into bytecode and stored as static C data into the executable. By calling the function language_bc_preload all the modules are preloaded using the embedded bytecode. This feature can be disabled by changing the BC_PRELOAD variable in src/Makefile.

How to build

The LuaJIT Language toolkit can be compiled and optionally installed using Meson. Ensure that Meson is installed, the easyest way is to use PIP, the Python installer. Ensure also that LuaJIT is correctly installed since it is required for the language toolkit.

Once Meson and LuaJIT are installed configure the build with the command:

meson setup build

so that the 'build' directory will be used to build. You may also pass the preload option:

meson setup -Dpreload=true build

then to build use 'ninja', the default Meson's backend.

# build
ninja -C build

# install
ninja -C build install

The Meson-based build will take care of installing all the required Lua files, the library itself, the luajit-x executable and a pkg-config file.

Please note that when using the 'preload' option the Lua files will not be installed since they are embedded in the library itself.

Running the Application

The application can be run with the following command:

luajit run.lua [lua-options] <filename>

The "run.lua" script will just invoke the complete pipeline of the lexer, parser and bytecode generator and it will pass the bytecode to luajit with "loadstring".

The language toolkit also provides a customized executable named luajit-x that uses the language toolkit's pipeline instead of the native one. Otherwise, the program luajit-x works exactly the same as luajit itself, and accepts the same options.

In the standard build luajit-x will contain the lang.* modules as embedded bytecode data so that it does not rely on the Lua files at runtime.

This means that you can experiment with the language by modifying the Lua implementation of the language and test the changes immediately. If the option BC_PRELOAD in src/Makefile is activated you just need to recompile luajit-x.

If you works with the Lua files of the language toolkit you may choose to disable the BC_PRELOAD variable to avoid recompiling the executable for each change in the Lua code.

Generated Bytecode

You can inspect the bytecode generated by the language toolkit by using the "-b" options. They can be invoked either with standard luajit by using "run.lua" or directly using the customized program luajit-x.

For example you can inspect the bytecode using the following command:

luajit run.lua -bl tests/test-1.lua

or alternatively:

./src/luajit-x -bl tests/test-1.lua

where we suppose that you are running luajit-x from the language toolkit's root directory.

Either way, when you use one of the two commands above to generate the bytecode you will the see following on the screen:

-- BYTECODE -- "test-1.lua":0-7
00001    TNEW     0   0
0002    KSHORT   1   1
0003    KSHORT   2  10
0004    KSHORT   3   1
0005    FORI     1 => 0010
0006 => MULVV    5   4   4
0007    ADDVN    5   5   0  ; 1
0008    TSETV    5   0   4
0009    FORL     1 => 0006
0010 => KSHORT   1   1
0011    KSHORT   2  10
0012    KSHORT   3   1
0013    FORI     1 => 0018
0014 => GGET     5   0      ; "print"
0015    TGETV    6   0   4
0016    CALL     5   1   2
0017    FORL     1 => 0014
0018 => RET0     0   1

You can compare it with the bytecode generated natively by LuaJIT using the command:

luajit -bl tests/test-1.lua

In the example above the generated bytecode will be identical to that generated by LuaJIT. This is not an accident, since the Language Toolkit's bytecode generator is designed to produce the same bytecode that LuaJIT itself would generate. In some cases, the generated code will differ. But, this is not considered a big problem as long as the generated code is still semantically correct.

Bytecode Annotated Dump

In addition to the standard LuaJIT bytecode functions, the language toolkit also supports a special debug mode where the bytecode is printed byte-by-byte in hex format with some annotations on the right side of the screen. The annotations will explain the meaning of each chunk of bytes and decode them as appropriate.

For example:

luajit run.lua -bx tests/test-1.lua

will display something like:

1b 4c 4a 01             | Header LuaJIT 2.0 BC
00                      | Flags: None
11 40 74 65 73 74 73 2f | Chunkname: @tests/test-1.lua
74 65 73 74 2d 31 2e 6c |
75 61                   |
                        | .. prototype ..
8a 01                   | prototype length 138
02                      | prototype flags PROTO_VARARG
00                      | parameters number 0
07                      | framesize 7
00 01 01 12             | size uv: 0 kgc: 1 kn: 1 bc: 19
31                      | debug size 49
00 07                   | firstline: 0 numline: 7
                        | .. bytecode ..
32 00 00 00             | 0001    TNEW     0   0
27 01 01 00             | 0002    KSHORT   1   1
27 02 0a 00             | 0003    KSHORT   2  10
27 03 01 00             | 0004    KSHORT   3   1
49 01 04 80             | 0005    FORI     1 => 0010
20 05 04 04             | 0006 => MULVV    5   4   4
14 05 00 05             | 0007    ADDVN    5   5   0  ; 1
39 05 04 00             | 0008    TSETV    5   0   4
4b 01 fc 7f             | 0009    FORL     1 => 0006
27 01 01 00             | 0010 => KSHORT   1   1
27 02 0a 00             | 0011    KSHORT   2  10
27 03 01 00             | 0012    KSHORT   3   1
49 01 04 80             | 0013    FORI     1 => 0018
34 05 00 00             | 0014 => GGET     5   0      ; "print"
36 06 04 00             | 0015    TGETV    6   0   4
3e 05 02 01             | 0016    CALL     5   1   2
4b 01 fc 7f             | 0017    FORL     1 => 0014
47 00 01 00             | 0018 => RET0     0   1
                        | .. uv ..
                        | .. kgc ..
0a 70 72 69 6e 74       | kgc: "print"
                        | .. knum ..
02                      | knum int: 1
                        | .. debug ..
01                      | pc001: line 1
02                      | pc002: line 2
02                      | pc003: line 2
02                      | pc004: line 2
02                      | pc005: line 2
...

This kind of output is especially useful for debugging the language toolkit itself because it does account for every byte of the bytecode and include all the sections of the bytecode. For example, you will be able to inspect the kgc or knum sections where the prototype's constants are stored. The output will also include the debug section in decoded form so that it can be easily inspected.

There is a small trick to compare with the bytecode generated by LuaJIT because this latter it doesn't support the -bx option. You should generate first the bytecode using luajit:

luajit -bg tests/test-1.lua test-1.bc

and then you can use the language toolkit with the -bx option to dump the content on the luajit generated bytecode:

luajit run.lua -bx test-1.bc

so that you can compare the two outputs.

Current Status

Currently LuaJIT Language Toolkit should be considered as beta software.

The implementation is now complete in term of features and well tested, even for the most complex cases, and a complete test suite is used to verify the correctness of the generated bytecode.

The language toolkit is currently capable of executing itself. This means that the language toolkit is able to correctly compile and load all of its module and execute them correctly.

Yet some bugs are probably present and you should be cautious when you use LuaJIT language toolkit.

luajit-lang-toolkit's People

Contributors

areski avatar constfold avatar franko avatar gnois avatar ionoclastbrigham avatar q66 avatar secondwtq avatar stepelu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

luajit-lang-toolkit's Issues

Bad behavior of tables

I've been trying to integrate the custom language i'm working on into my game engine. I almost succeeded, but I came up to a weird bug that happens when you have enough fields in tables. I isolated a testcase out of one of the failing files (the other file was too complicated to isolate a reliable testcase out of, and I'm not yet sure if it's the same bug, but I think it might as well be).

Here's the testcase: http://codepad.org/zvs9Qcsj

I'd fix it and provide patches, but I haven't yet figured out what might be causing the problem... perhaps you'll have a better idea.

In the error message, the line "3" is wrong, because of the other issue I submitted - the actual line where it fails is 480, and when I remove table fields line 481 onwards, it stops failing (this is however somehow related to previous tables because if i insert a few fields into a table before, it starts failing again).

Here's the error: http://codepad.org/cWj2pwVm

Another clue might be this error when dumping bytecode: http://codepad.org/lY05aVj5

Bytecode dump with standard LuaJIT obviously works...

Replace native LuaJIT bytecode compiler?

I'm really happy to discover this cool project!

Has anybody looked into "dropping in" this bytecode compiler to replace the C implementation in LuaJIT? This is something we have been discussing over at raptorjit/raptorjit#248 but we weren't aware of this project until now.

AST validation: syntax description for Tables is wrong

According to ast-validate.lua, Tables are expected to be described in terms of the array_entries, hash_keys and hash_values properties.

However, lua-ast.lua generates Tables as with a single keyvals property which is essentially a list of pairs.

As such the validation of AST trees containing Tables is currently failing.

Escape character not preserved in generated Lua

I reasoned it's better to file the issues here rather than in my fork, because the issue links got confused in the commit history, pointing to issue of the same number in this repo.

print("\n\r")

generates this Lua code output.

print("\
\13")

The escapes should be preserved.

Weird bug

Here's a snippet:

http://codepad.org/IWzsktIG

When you run it, the error should be

luajit: [string "core.lua"]:31: loop in gettable

When you remove 4 or so lines from the end (so that last one is for M.Foo96) it errors differently:

luajit: [string "core.lua"]:33: table index is nil

Or maybe it won't error, seems to be pretty random - in that case, try removing a different number of lines. The first error should be reproducible every time.

I'm not quite sure what's happening here, so I haven't patched it yet...

Incorrect handling of call expression in tests

A simple test:

local k = "foo"
k = tonumber(k) and "foo" or k
print("K", k)

This will print nil for "k". Not sure what the best solution would be here. I see that an extra "mov" instruction is being generated, incorrectly overwriting the value.

Broken numeric for loop with a call as start/end/step

This testcase should be sufficient:

local foo = function()
    return 4
end

for i = 1, foo() do
    print("A", i)
end

for i = 1, 4 do
    print("B", i)
end

By running it, you can see that neither of the loops will run. When you comment out the first loop, the second loop will run. By looking at bytecode diff between LuaJIT and lj-lang-toolkit, I can see that the only difference is that LuaJIT emits a CALLT instruction while lj-lang-toolkit emits a CALL instruction for foo().

I think when this one is fixed, my engine will almost be working with the new bytecode generator. It already starts a map (after your latest and-or fix) but never runs any events on them, because of loops being broken like that.

I just found out about the issue, so no patch attached atm...

edit: sorry, the other way, CALLT for lj-lang-toolkit and CALL for luajit.

Few questions

Hi,

first of all, thx for sharing, looks a super great project that could fit with one I'm doing quite similar (mruby).

I have few questions:

  1. Is this stable? I mean, are there know issues, like bytecode of float etc?
  2. Why an handwritten parser and not the classic lpeg?

Thanks a lot!

Parenthesis dropped in generated Lua

print(("hi"):rep(3))

gives

print("hi":rep(3))

Output:
lua: run.lua:70: [string "print("\..."]:3: ')' expected near ':'

Due to the parenthesis were dropped.

Wrong runtime debug information

I believe the place "if node.line then self.ctx:line(node.line) end" is done is wrong. It should be done before the rule is called so that the line information is up to date by the time things are emitted. Right now it displays bad line info.

Also, such statements should be placed before every rule call, not only in "emit". That's because if you do stuff like

print(



    "foo" + 5)

The error about arithmetic on string value should be on the line the string literal is placed, not on the line "print" is placed. (i.e. without any fix at all, the error will be on line 1; with partial fix, it'll be on line 3; but it should be on line 7)

The solution here is:

  1. move the statement in "emit" before the rule call.
  2. add the same line before every rule call elsewhere, i.e. into test_emit, expr_toreg, expr_tomultireg and lhs_expr_emit.

The mask for getting a local uv's slot in uv_decode is wrong

return band(uv, 0x3fff), true, imm

Cause the slot in LuaJIT is only one byte1, thus the range of slot is 0x00 to 0xFF and olny the last byte will be used2. So the mask should be 0xFF instead of 0x3FFF.

Footnotes

  1. https://github.com/LuaJIT/LuaJIT/blob/e1e3034cf613f6913285fea41bbdac1cbeb2a9a8/src/lj_lex.h#L47

  2. https://github.com/LuaJIT/LuaJIT/blob/e1e3034cf613f6913285fea41bbdac1cbeb2a9a8/src/lj_func.c#L173

broken TestRule:UnaryExpression

It's broken particularly with the "not" operator.

First, it does "self:op_not" which has to be changed to "self.ctx:op_not" just to make it run at all.

After that, "print((not true) and 10)" gets nil instead of false (but this depends, if I put "print(5)" before that, it gets 5 - so it's leaking a register) and it breaks in a few other places (I implemented an "if" expression/ternary operator support in the bytecode generator and it yields completely nonsense results with the not operator - I thought it was a bug in my code but it wasn't :)). Removing the specialization for "not" from TestRule:UnaryExpression fixes the problem, but is not an ideal solution as it'll generate "worse" bytecode.

Add opcode for bytecode version 2

To be frank, I'm not familiar with Lua...I'll try to be useful as much😅

When trying to reverse a LuaJIT Dump with version 2 in header, I noticed that luajit-x -bx gave me a different result with luajit. Inspecting the hex dump I found that the opcode is interpreted from version 1, which is 2 slots shift from opcode array. I referenced lj_bc.h and #11 and added the missing opcode. The result was positive. My patched version is at branch v2.1.
However, compiled with this patch may not compatible with LuaJIT 2.0 codes. Do you have any solution?

commit d3d9f70bf4c88fbbea39e4219ec150e6e6d40833
Author: ttimasdf <[email protected]>
Date:   Tue Nov 14 18:35:02 2017 +0800

    Add opcode for LuaJIT 2.1 (bytecode version 2)
    
    DO NOT COMPATIBLE WITH 2.0 dumps(bytecode version 1)
    Trying to figure out a way.

diff --git a/lang/bcread.lua b/lang/bcread.lua
index 821fa47..8487421 100644
--- a/lang/bcread.lua
+++ b/lang/bcread.lua
@@ -22,7 +22,7 @@ local BCDUMP = {
 
     -- If you perform *any* kind of private modifications to the bytecode itself
     -- or to the dump format, you *must* set BCDUMP_VERSION to 0x80 or higher.
-    VERSION = 1,
+    VERSION = 2,
 
     -- Compatibility flags.
     F_BE    = 0x01,
@@ -60,6 +60,8 @@ local BCDEF_TAB = {
     {'ISFC', 'dst', 'none', 'var', 'none'},
     {'IST', 'none', 'none', 'var', 'none'},
     {'ISF', 'none', 'none', 'var', 'none'},
+    {'ISTYPE', 'var', 'none', 'lit', 'none'},
+    {'ISNUM', 'var', 'none', 'lit', 'none'},
 
     -- Unary ops.
     {'MOV', 'dst', 'none', 'var', 'none'},
@@ -114,10 +116,12 @@ local BCDEF_TAB = {
     {'TGETV', 'dst', 'var', 'var', 'index'},
     {'TGETS', 'dst', 'var', 'str', 'index'},
     {'TGETB', 'dst', 'var', 'lit', 'index'},
+    {'TGETR', 'dst', 'var', 'var', 'index'},
     {'TSETV', 'var', 'var', 'var', 'newindex'},
     {'TSETS', 'var', 'var', 'str', 'newindex'},
     {'TSETB', 'var', 'var', 'lit', 'newindex'},
     {'TSETM', 'base', 'none', 'num', 'newindex'},
+    {'TSETR', 'var', 'var', 'var', 'newindex'},
 
     -- Calls and vararg handling. T = tail call.
     {'CALLM', 'base', 'lit', 'lit', 'call'},
diff --git a/lang/bcsave.lua b/lang/bcsave.lua
index a70795d..deb9d80 100644
--- a/lang/bcsave.lua
+++ b/lang/bcsave.lua
@@ -584,7 +584,7 @@ local function bc_magic_header(input)
     local f, err = io.open(input, "rb")
     check(f, "cannot open ", err)
     local header = f:read(4)
-    local match = (header == string.char(0x1b, 0x4c, 0x4a, 0x01))
+    local match = (header == string.char(0x1b, 0x4c, 0x4a, 0x02))
     f:close()
     return match
 end

I'm using LuaJIT 2.1.0-beta3, sample byte code dump before patch:

$ luajit-x -bl main
-- BYTECODE -- main:0-0
0001    TGETV    1   0   0
0002    KSHORT   2   1
...
0020    KSHORT   2   1
0021    ITERN    1   1   2
0022    FORL     0 => -32744

-- BYTECODE -- main:0-0
0001    TDUP     0   0
0002    TGETS    0   0   1  ; "__G__TRACKBACK__"
0003    KPRI     0 500
...
0072    TGETS    0   0  19  ; "LAUNCHERPKG"
0073    TGETV    0   0  26
0074    KSHORT   1  30
0075    ITERN    0   2   2
0076    TSETV    0   0  31
0077    ITERN    0   2   1
0078    UNM      1   0
0079    TSETV    0   0  32
0080    ITERN    0   1   2
0081    FORL     0 => -32685

after:

$ luajit-x -bxg main
1b 4c 4a 02             | Header LuaJIT 2.0 BC
02                      | Flags: BCDUMP_F_STRIP
                        | .. prototype ..
b6 01                   | prototype length 182
00                      | prototype flags None
01                      | parameters number 1
05                      | framesize 5
00 08 00 16             | size uv: 0 kgc: 8 kn: 0 bc: 23
                        | .. bytecode ..
36 01 00 00             | 0001    GGET     1   0      ; "print"
27 02 01 00             | 0002    KSTR     2   1      ; "---------------
                        | -------------------------"
42 01 02 01             | 0003    CALL     1   1   2
...
36 01 00 00             | 0019    GGET     1   0      ; "print"
27 02 01 00             | 0020    KSTR     2   1      ; "---------------
                        | -------------------------"
42 01 02 01             | 0021    CALL     1   1   2
4b 00 01 00             | 0022    RET0     0   1
                        | .. uv ..
                        | .. kgc ..
05                      | kgc: ""
0e 74 72 61 63 65 62 61 | kgc: "traceback"
63 6b                   | 
0a 64 65 62 75 67       | kgc: "debug"
06 0a                   | kgc: "\
...
15 5f 5f 47 5f 5f 54 52 | kgc: "__G__TRACKBACK__"
41 43 4b 42 41 43 4b 5f | 
5f                      | 
00                      | kgc: <function: main:0>
                        | .. knum ..
00                      | eof

traceback inside a function literal used as a function argument reports wrong line numbers

d:\>cat pcall-wrong-lineno.lua
pcall(function()
  print(debug.traceback())
end)

d:\>lua pcall-wrong-lineno.lua
stack traceback:
        pcall-wrong-lineno.lua:2: in function <pcall-wrong-lineno.lua:1>
        [C]: in function 'pcall'
        pcall-wrong-lineno.lua:1: in main chunk
        [C]: at 0x00402180

d:\>lua luajit-lang-toolkit\run.lua pcall-wrong-lineno.lua
stack traceback:
        pcall-wrong-lineno.lua:2: in function <pcall-wrong-lineno.lua:1>
        [C]: in function 'pcall'
        pcall-wrong-lineno.lua:3: in function 'fn'
        luajit-lang-toolkit\run.lua:73: in main chunk
        [C]: at 0x00402180

d:\>lua -v
LuaJIT 2.0.2 -- Copyright (C) 2005-2013 Mike Pall. http://luajit.org/

Doesn't have to be pcall specifically, and it seems it uses the last line of the function declaration, unlike Lua and LuaJIT, which use the line on which the outer function call occurs:

function call(_, fn)
  fn()
end

call(

"ignored",

function()
  print(debug.traceback())
end,

"also ignored"

)

Lua/LuaJIT report line 5 (where call is), luajit-lang-toolkit reports line 11 (where the function declaration ends).

Broken lexing of 64-bit hex numbers

In your build_64int, there's a few things you need to do.

  1. check for leading 0x and skip it.
  2. Instead of checking "str[i] <= ASCII_9", you need to check if str[i] is a hex digit and possibly a period and if it's a period, return nil (so that floating point numbers with 64 bit suffixes are disallowed).
  3. if leading 0x is present, multiply by 16 instead of 10.

I'm not providing a patch as my codebase is heavily changed (working on a language) and uses a different lexer/parser, but this is what I came up to while messing around with lj-lang-toolkit, and it should be trivial for you to do.

Please add luajit2.1 bytecode

luajit2.1 enable GC64
lj_arch.h
#define LJ_FR2 1

The bytecode changes are as follows.
Bold is the changing bytecode

test code

local ffi = require("ffi") local a = "12345" local b = "test" print(a,b)

luajit-lang-toolkit

'''lua
luajit run.lua -bl 1.lua
-- BYTECODE -- 1.lua:0-9
0001 GGET 0 0 ; "require"
0002 KSTR 1 1 ; "ffi"
0003 CALL 0 2 2
0004 KSTR 1 2 ; "12345"
0005 KSTR 2 3 ; "test"
0006 GGET 3 4 ; "print"
0007 MOV 4 1
0008 MOV 5 2

0009 CALL 3 1 3
0010 RET0 0 1
'''

luajit
'''lua
luajit -bl 1.lua
-- BYTECODE -- 1.lua:0-9
0001 GGET 0 0 ; "require"
0002 KSTR 2 1 ; "ffi"
0003 CALL 0 2 2
0004 KSTR 1 2 ; "12345"
0005 KSTR 2 3 ; "test"
0006 GGET 3 4 ; "print"
0007 MOV 5 1
0008 MOV 6 2

0009 CALL 3 1 3
0010 RET0 0 1




patch: minor bytecode.lua refactoring (for clarity) and LuaJIT 2.1 support

Hey,

here is a patch that removes the decoupling of BC and BC_MODE (so it's clear which fields of BC_MODE correspond to which fields of BC) plus adds support for LuaJIT 2.1 (it added certain instructions which are not normally emitted in Lua but were added to be used in builtin bytecode, and that breaks lj-lang-toolkit)

diff:

commit bc01044bf39ad14fb9785781591e1c2dbb6936cf
Author: q66 <[email protected]>
Date:   Mon Sep 8 23:12:24 2014 +0100

    bytecode: refactoring (remove decoupling of BC and BC_MODE) and add LuaJIT 2.1 support

diff --git a/bytecode.lua b/bytecode.lua
index 5fc99e9..1090c9e 100644
--- a/bytecode.lua
+++ b/bytecode.lua
@@ -27,6 +27,13 @@ local bit  = require 'bit'
 local ffi  = require 'ffi'
 local util = require 'util'

+local has_jit, jit = pcall(function() return require 'jit' end)
+if not has_jit then
+   jit = { version_num = 20000 } -- fallback
+end
+
+local jit_v21 = jit.version_num >= 20100
+
 local typeof = getmetatable

 local function enum(t)
@@ -40,30 +47,124 @@ local Buf, Ins, Proto, Dump, KNum, KObj
 local MAX_REG = 200
 local MAX_UVS = 60

-local BC = enum {
-   [0] = 'ISLT', 'ISGE', 'ISLE', 'ISGT', 'ISEQV', 'ISNEV', 'ISEQS','ISNES',
-   'ISEQN', 'ISNEN', 'ISEQP', 'ISNEP', 'ISTC', 'ISFC', 'IST', 'ISF', 'MOV',
-   'NOT', 'UNM', 'LEN', 'ADDVN', 'SUBVN', 'MULVN', 'DIVVN', 'MODVN', 'ADDNV',
-   'SUBNV', 'MULNV', 'DIVNV', 'MODNV', 'ADDVV', 'SUBVV', 'MULVV', 'DIVVV',
-   'MODVV', 'POW', 'CAT', 'KSTR', 'KCDATA', 'KSHORT', 'KNUM', 'KPRI', 'KNIL',
-   'UGET', 'USETV', 'USETS', 'USETN', 'USETP', 'UCLO', 'FNEW', 'TNEW', 'TDUP',
-   'GGET', 'GSET', 'TGETV', 'TGETS', 'TGETB', 'TSETV', 'TSETS', 'TSETB',
-   'TSETM', 'CALLM', 'CALL', 'CALLMT', 'CALLT', 'ITERC', 'ITERN', 'VARG',
-   'ISNEXT', 'RETM', 'RET', 'RET0', 'RET1', 'FORI', 'JFORI', 'FORL', 'IFORL',
-   'JFORL', 'ITERL', 'IITERL', 'JITERL', 'LOOP', 'ILOOP', 'JLOOP', 'JMP',
-   'FUNCF', 'IFUNCF', 'JFUNCF', 'FUNCV', 'IFUNCV', 'JFUNCV', 'FUNCC', 'FUNCCW',
-}
+local function enum_mode(t)
+   local re, rm = {}, {}
+   local nskip = 0
+   for i = 1, #t do
+      local v = t[i]
+      if v[3] == false then
+         nskip = nskip + 1
+      else
+         local idx = i - nskip - 1
+         rm[idx ] = v[2]
+         re[v[1]] = idx
+      end
+   end
+   return re, rm
+end

 local BC_ABC = 0
 local BC_AD  = 1
 local BC_AJ  = 2

-local BC_MODE = {
-   [0] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
-   1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
-   1, 0, 0, 0, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1,
-   1, 1, 1, 1, 1, 1, 1,
+local BC, BC_MODE = enum_mode {
+   { 'ISLT'  , BC_AD  },
+   { 'ISGE'  , BC_AD  },
+   { 'ISLE'  , BC_AD  },
+   { 'ISGT'  , BC_AD  },
+   { 'ISEQV' , BC_AD  },
+   { 'ISNEV' , BC_AD  },
+   { 'ISEQS' , BC_AD  },
+   { 'ISNES' , BC_AD  },
+   { 'ISEQN' , BC_AD  },
+   { 'ISNEN' , BC_AD  },
+   { 'ISEQP' , BC_AD  },
+   { 'ISNEP' , BC_AD  },
+   { 'ISTC'  , BC_AD  },
+   { 'ISFC'  , BC_AD  },
+   { 'IST'   , BC_AD  },
+   { 'ISF'   , BC_AD  },
+   { 'ISTYPE', BC_AD, jit_v21 },
+   { 'ISNUM' , BC_AD, jit_v21 },
+   { 'MOV'   , BC_AD  },
+   { 'NOT'   , BC_AD  },
+   { 'UNM'   , BC_AD  },
+   { 'LEN'   , BC_AD  },
+   { 'ADDVN' , BC_ABC },
+   { 'SUBVN' , BC_ABC },
+   { 'MULVN' , BC_ABC },
+   { 'DIVVN' , BC_ABC },
+   { 'MODVN' , BC_ABC },
+   { 'ADDNV' , BC_ABC },
+   { 'SUBNV' , BC_ABC },
+   { 'MULNV' , BC_ABC },
+   { 'DIVNV' , BC_ABC },
+   { 'MODNV' , BC_ABC },
+   { 'ADDVV' , BC_ABC },
+   { 'SUBVV' , BC_ABC },
+   { 'MULVV' , BC_ABC },
+   { 'DIVVV' , BC_ABC },
+   { 'MODVV' , BC_ABC },
+   { 'POW'   , BC_ABC },
+   { 'CAT'   , BC_ABC },
+   { 'KSTR'  , BC_AD  },
+   { 'KCDATA', BC_AD  },
+   { 'KSHORT', BC_AD  },
+   { 'KNUM'  , BC_AD  },
+   { 'KPRI'  , BC_AD  },
+   { 'KNIL'  , BC_AD  },
+   { 'UGET'  , BC_AD  },
+   { 'USETV' , BC_AD  },
+   { 'USETS' , BC_AD  },
+   { 'USETN' , BC_AD  },
+   { 'USETP' , BC_AD  },
+   { 'UCLO'  , BC_AJ  },
+   { 'FNEW'  , BC_AD  },
+   { 'TNEW'  , BC_AD  },
+   { 'TDUP'  , BC_AD  },
+   { 'GGET'  , BC_AD  },
+   { 'GSET'  , BC_AD  },
+   { 'TGETV' , BC_ABC },
+   { 'TGETS' , BC_ABC },
+   { 'TGETB' , BC_ABC },
+   { 'TGETR' , BC_ABC, jit_v21 },
+   { 'TSETV' , BC_ABC },
+   { 'TSETS' , BC_ABC },
+   { 'TSETB' , BC_ABC },
+   { 'TSETM' , BC_AD  },
+   { 'TSETR' , BC_ABC, jit_v21 },
+   { 'CALLM' , BC_ABC },
+   { 'CALL'  , BC_ABC },
+   { 'CALLMT', BC_AD  },
+   { 'CALLT' , BC_AD  },
+   { 'ITERC' , BC_ABC },
+   { 'ITERN' , BC_ABC },
+   { 'VARG'  , BC_ABC },
+   { 'ISNEXT', BC_AJ  },
+   { 'RETM'  , BC_AD  },
+   { 'RET'   , BC_AD  },
+   { 'RET0'  , BC_AD  },
+   { 'RET1'  , BC_AD  },
+   { 'FORI'  , BC_AJ  },
+   { 'JFORI' , BC_AJ  },
+   { 'FORL'  , BC_AJ  },
+   { 'IFORL' , BC_AJ  },
+   { 'JFORL' , BC_AD  },
+   { 'ITERL' , BC_AJ  },
+   { 'IITERL', BC_AJ  },
+   { 'JITERL', BC_AD  },
+   { 'LOOP'  , BC_AJ  },
+   { 'ILOOP' , BC_AJ  },
+   { 'JLOOP' , BC_AD  },
+   { 'JMP'   , BC_AJ  },
+   { 'FUNCF' , BC_AD  },
+   { 'IFUNCF', BC_AD  },
+   { 'JFUNCF', BC_AD  },
+   { 'FUNCV' , BC_AD  },
+   { 'IFUNCV', BC_AD  },
+   { 'JFUNCV', BC_AD  },
+   { 'FUNCC' , BC_AD  },
+   { 'FUNCCW', BC_AD  },
 }

 local VKNIL   = 0
@@ -1110,7 +1211,7 @@ Dump = {
    HEAD_1 = 0x1b;
    HEAD_2 = 0x4c;
    HEAD_3 = 0x4a;
-   VERS   = 0x01;
+   VERS   = jit_v21 and 0x02 or 0x01;
    BE     = 0x01;
    STRIP  = 0x02;
    FFI    = 0x04;

ffi.cdef signature for 'free()' should change

In 'bytecode.lua', there is a ffi.cdef for the 'free()' function that looks like the following:
int free(void*);

The definition that is in my stdlib.h, and I think the standard, is:
void free ( void * ptr );

Additionally, on Windows, it's not found using ffi.C.free (symbol not found). I think you might have to pull from a specific library, or from the app itself.

cannot create state: not enough memory

MacOSX 10.14.6

Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

help pls

Incorrect escape sequence parsing

Hello! It seems that there are some off-by-one errors in lexer which result in characters being dropped or added near some escape sequences.

\z:

$ cat testcase.lua
"foo\z
bar"
$ luajit run-lexer.lua testcase.lua
TK_string   foozar
TK_eof

(expected TK_string foobar)

\\n:

$ cat testcase.lua
"foo\
bar"
$ luajit run-lexer.lua testcase.lua
TK_string   foo

ar
TK_eof

(expected TK_string foo
bar)

\d, \dd (less than 3 digits decimal escape sequences):

$ cat testcase.lua
"\97"
$ luajit run-lexer.lua testcase.lua
luajit: LLT-ERRORtestcase.lua:1: unfinished string near '"a'
stack traceback:
    [C]: in function 'error'
    ./lexer.lua:33: in function 'error_lex'
    ./lexer.lua:43: in function 'lex_error'
    ./lexer.lua:306: in function 'read_string'
    ./lexer.lua:426: in function 'llex'
    ./lexer.lua:459: in function 'next'
    run-lexer.lua:8: in main chunk
    [C]: at 0x0804be80

(expected TK_string a)

division constant evaluation bug

Hey,

your constant eval module checks for 0 in division, which is unnecessary - it will work either way - and keeping the check there results in compile errors with both operands being constant in code like local x = 1 / 0. I suggest you remove it.

The mask about immutable in uv_decode is wrong

problem with compiling function with empty body

hi, I think there is great possibility for luajit-lang-toolkit. thank you for your great work :D
I try to use luajit-lang-toolkit to implement some programming languages. but when I generate lua AST which expresses the lua function with empty body, I got error.

with further investigation I reproduce same problem with following simple lua source.

is it known problem or I was missing something about limitation of luajit-lang-toolkit?

regards.

$ git log -1
commit 3800c1ad46039b149c50eb2d0c552c563abcd179
Author: Francesco Abbate <[email protected]>
Date:   Sun Mar 30 17:19:26 2014 +0200

    Fix problems with test scripts when checking luajit
$ luajit -v
LuaJIT 2.0.3 -- Copyright (C) 2005-2014 Mike Pall. http://luajit.org/
$ cat tests/func-simple-body.lua 
local func = function () print(1) end
$ luajit tests/func-simple-body.lua 
$ luajit run.lua tests/func-simple-body.lua 
$ cat tests/func-empty.lua 
local func = function () end
$ luajit tests/func-empty.lua 
$ luajit run.lua tests/func-empty.lua 
luajit: ./compile.lua:11: ./generator.lua:900: attempt to index local 'node' (a nil value)
stack traceback:
    [C]: in function 'error'
    ./compile.lua:11: in function 'file'
    run.lua:55: in main chunk
    [C]: at 0x01000010d0

lang_loadstring error

I was curious to see how things would perform and behave if I replaced loadstring in my project and stumbled upon this error:

print(require("lang.compile").string('something("foo bar"--[[]], foo, bar)'))
false, luajit-lang-toolkit: stdin:1: ')' expected near ''

goto jumping into the scope of an unused local

Code like this

goto skip
local x = 1
::skip::
print(x)

causes both LuaJIT and luajit-lang-toolkit to throw an error: <goto skip> jumps into the scope of local 'x'. However, LuaJIT stops complaining if we remove the use of x like so:

goto skip
local x = 1
::skip::

However, luajit-lang-toolkit is not so lenient and still throws the same error.

I've seen people implement continue by placing a label at the very end of the loop, so maybe it's worthwhile to imitate this quirk of LuaJIT's?

Document for AST format?

Hi,

I wonder how this project's AST format differs from metalua's (described here). I'm considering port a project's backend from metalua to it. Thanks!

Wrong semantics on calls within parens

This example:

function foo() return 5, 10, 15 end
print((foo()))

should only print "5". Currently, lj-lang-toolkit just skips the parens (in prefix expr parsing), which is not what should be happening. A solution could be perhaps have an AST node kind, "ParenthesizedExpression"... that'd just discard everything but the first value, just like Lua would do... ideas?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.