Giter Site home page Giter Site logo

Comments (12)

agentzh avatar agentzh commented on July 4, 2024 9

@zx827882285 BTW, this is a use-after-free kind of bug so it may lead to any weird things from infinite loop to core dumps. And that's why we hit the internal LuaJIT assertion failure before entering the infinite loop. The infinite loop happens because the ctype ID becomes zero, which is not possible for normal ctype objects at all in the first place.

Thanks a lot for that minimal and standalone example, otherwise debugging this would be much much more difficult (if possible at all).

This bug also affects GC64 builds and any other builds as far as I can see. The issue was not reproducible using your test case just because the GC progresses quite differently in the GC64 mode (for one thing, Lua tables in GC64 builds use table-level free list for recycling nodes while the X64 mode uses bucket-level free lists, thus it is less demanding on GC in GC64 mode). But it's still possible to hit at least in theory.

Small changes in your nginx.conf or Lua code may have big impact on reproducing the issue. This is because that we need perfect timing for just the right GC activities to trigger this bug. And incremental GCs are notoriously known for their inherent nondeterminism.

In a nutshell, what happens here in your test case is like this:

  1. resty/core/shdict.lua:142: this Lua line creates a nil entry for the key "free" in the Lua table in the ffi.C (on the C level as cl->cache) userdata: if not pcall(function () return C.free end) then
  2. the tri-color mark-and-sweek GC progresses and its mark phase marks ffi.C's cache table as black (or "live").
  3. resty/core/shdict.lua:333: this Lua source line invokes C.free() which looks up the free key in C's cache table via the metamethod. It loads the free function cdata into the cache table under C. The cache table is already black while the new entry reuses the one created in step 1 but the value object is a white object. This breaks the tri-color GC invariant: "no black objects can hold references to white ones". So shortly after, once a GC cycle is completed, the "free" value in the cache table is collected prematurely (because it is "white" and thus considered "dead") even it is still to be used.
  4. The "free" object looked up from the table on step 3 gets used and leads to memory corruptions.

The key to reproduce this is to have a completed GC cycle in the middle of step 3 and step 4 above. The time window is very small and we have to be lucky enough to hit this. That's why you need wrk and special layout of the test case.

To fix this problem, we just need to preserve the tri-color GC's invariant in step 3 by adding a write barrier there and turning the cache table under ffi.C to gray. That's essentially what my patch above does via the lj_gc_anybarrier() function call.

Just for the record, below is my debugging session using Mozilla rr and gdb python tools generated by our ylang compiler:

https://gist.github.com/agentzh/534aabb3a5bc75ff62b8fd25e3d371e0

For more background info on LuaJIT's GC algorithm, we can look at Mike Pall's explanation here:

http://wiki.luajit.org/New-Garbage-Collector#gc-algorithms_tri-color-incremental-mark-sweep

from luajit2.

spacewander avatar spacewander commented on July 4, 2024

@zx827882285
Could you use lbt to get the Lua stack?

from luajit2.

zengjinji avatar zengjinji commented on July 4, 2024

yes
(gdb) lbt
builtin#176
@/usr/local/nginx/lua/resty/core/shdict.lua:333
@/usr/local/nginx/lua/test.lua:15
=content_by_lua(nginx.conf:37):5

from luajit2.

agentzh avatar agentzh commented on July 4, 2024

@zx827882285 Will you show us your /usr/local/nginx/lua/test.lua file?

from luajit2.

agentzh avatar agentzh commented on July 4, 2024

Okay, never mind, I saw it is in the zx827882285/luajit-bug-report github repo.

from luajit2.

agentzh avatar agentzh commented on July 4, 2024

@zx827882285 I just tried your luajit-bug-report repo and the steps on my side with OpenResty 1.15.8.1 RC0 on Fedora 28 x86_64 and I failed to reproduce any infinite looping no matter how hard I try:

https://gist.github.com/agentzh/85bd2c83d1c7a8badf5851268e695743

The CPU usage of the nginx or openresty worker processes dropped to 0 immediately after the wrk runs finished.

I tried both the GC64 and non-GC64 modes of LuaJIT.

I only applied the following patch to your nginx.conf so that I don't need to run the nginx server using root nor need I pollute my system's /usr/local/ directory:

diff --git a/nginx.conf b/nginx.conf
index 60214ef..cda2f53 100644
--- a/nginx.conf
+++ b/nginx.conf
@@ -8,8 +8,8 @@ events {

 http {

-    lua_package_path  "/usr/local/nginx/clua/?.lua;/usr/local/nginx/lua/?.lua;/usr/local/nginx/api/?.lua;;";
-    lua_package_cpath  "/usr/local/nginx/lua/?.so;;";
+    lua_package_path  "$prefix/clua/?.lua;$prefix/lua/?.lua;$prefix/api/?.lua;;";
+    lua_package_cpath  "$prefix/lua/?.so;;";

     lua_shared_dict a_shm 1M;
     lua_shared_dict b_shm 1M;
@@ -21,7 +21,7 @@ http {

     server {
         server_name localhost;
-        listen 80;
+        listen 8083;

         location /set {
             content_by_lua_block {

You should do the same when preparing a minimal example for others to run. BTW, you know about the -p PREFIX option of the nginx command line so that you can specify your own prefix. See this tutorial for a full example: https://openresty.org/en/getting-started.html

BTW, will you try the following OpenResty 1.15.8.1 RC0 version? To see if you can still reproduce it. It's much easier for us to reproduce things when using OpenResty. And you can also save a lot of setup commands to explain as in this issue.

https://openresty.org/download/openresty-1.15.8.1rc0.tar.gz

Thanks!

from luajit2.

agentzh avatar agentzh commented on July 4, 2024

@zx827882285 BTW, you should try fetching both the C and Lua backtraces at different times for the same nginx worker process that spins at 100% CPU usage forever.

So does your nginx worker processes spin at 100% CPU usage forever? Even after all wrk processes and any other traffic generators quit?

The Lua and C backtraces do not look like inside any C or Lua loops at all, which is weird.

from luajit2.

zengjinji avatar zengjinji commented on July 4, 2024

@agentzh
yes, i start two nginx worker, after wrk processes quit, only one nginx worker process spins at 100% CPU usage forever.
i try the following OpenResty 1.15.8.1 RC0 version, but can't reproduce it

In the morning, I install fedroa28 on VMware Workstation and reproduce it;
fedroa28 iso: http://mirrors.163.com/fedora/releases/28/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-28-1.1.iso

For convenience to reproduce it, i commit install.sh
could you try this in you fedroa system, see can reproduce it.
This needn't to run the nginx server using root and needn't pollute your system's /usr/local/ directory

$ cd /tmp
$ git clone https://github.com/zx827882285/luajit-bug-report.git
$ cd luajit-bug-report
$ sh install.sh
$ export LD_LIBRARY_PATH=/tmp/luajit-bug-report/luajit/lib
$ ./nginx/sbin/nginx -p . -c nginx.conf

$ cd /path/wrk;
$ curl  http://localhost:8083/set;./wrk -t4 -c120 -d10  http://localhost:8083/
$ top

thanks!

from luajit2.

agentzh avatar agentzh commented on July 4, 2024

OK, I can reproduce it using your install.sh script. BTW, it's considered bad practice to provide binary executable files like cjson.so in your minimal example.

You cannot reproduce it with OpenResty 1.15.8.1 RC0 just because it enables the GC64 mode in LuaJIT on x86_64 by default. If you enable that in your install.sh, the problem will also disappear.

If you enable internal assertions in your LuaJIT build, you will see that it hits an internal assertion inside LuaJIT first, before entering the infinite loop:

nginx: lj_obj.h:926: copyTV: Assertion `!((((o1)->it) - ((~4u)+1)) > ((~13u) - ((~4u)+1))) || ((~((o1)->it) == (((GCobj *)(uintptr_t)((o1)->gcr).gcptr32))->gch.gct) && !(((((GCobj *)(uintptr_t)((o1)->gcr).gcptr32)))->gch.marked & ((((global_State *)(void *)(uintptr_t)(L->glref).ptr32))->gc.currentwhite ^ (0x01 | 0x02)) & (0x01 | 0x02)))' failed.

Using openresty-valgrind (with no-pool patch for nginx and the system allocator for luajit) and valgrind fails to catch any memory problems.

Disabling the JIT compiler in LuaJIT does not make the problem go away, so it should not be related to the JIT compiler in at all.

Enforcing full GC cycle upon every Lua code line also makes the issue disappear.

Replacing C.free() calls with ffi.C.free() calls in the resty.core.shdict also make the issue go away.

These are my initial experiments and findings. Need to dig deeper to find out why the GC state of the C.free TValue is incorrect.

from luajit2.

zengjinji avatar zengjinji commented on July 4, 2024

sorry for provide binary executable files in my minimal example.
it confused me that it can't reproduce when change nginx.conf, like it.

diff --git a/nginx.conf b/nginx.conf
index cda2f53..f99e936 100644
--- a/nginx.conf
+++ b/nginx.conf
@@ -30,7 +30,7 @@ http {
         }
         location / {
             content_by_lua_block {
-                require("a");
+                --require("a");
                 require("b");
                 require("resty.lock");
                 require("test"):shm_get();

Although I think there is no infulence.

thie problem may related to another problem, segfault in luajit.
i can't reproduce it, so i post it to openresty geogle group.
https://groups.google.com/forum/#!topic/openresty/SKnI8kd_rdY

thank you for your experiments and findings, it was helpful!
may be i need try enables the GC64 mode in LuaJIT, to see can it reproduce in production environment.

from luajit2.

agentzh avatar agentzh commented on July 4, 2024

@zx827882285 I think I've already fixed it. It's a bug in LuaJIT's FFI library when handling GC. Please try the following LuaJIT patch on your side:

diff --git a/src/lj_clib.c b/src/lj_clib.c
index f016b06b96..a867205247 100644
--- a/src/lj_clib.c
+++ b/src/lj_clib.c
@@ -384,6 +384,7 @@ TValue *lj_clib_index(lua_State *L, CLibrary *cl, GCstr *name)
       cd = lj_cdata_new(cts, id, CTSIZE_PTR);
       *(void **)cdataptr(cd) = p;
       setcdataV(L, tv, cd);
+      lj_gc_anybarriert(L, cl->cache);
     }
   }
   return tv;

from luajit2.

zengjinji avatar zengjinji commented on July 4, 2024

@agentzh
thanks!
i will take this patch and try it in production environment.
it may need a few day to watch.

from luajit2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.