Giter Site home page Giter Site logo

Comments (25)

scouten avatar scouten commented on September 23, 2024 2

Still seeing a crash, unfortunately. I think this may be a different issue that was unmasked by the previous fix.

I believe there's a problem with the call to enif_thread_join at line 220 of esqlite3_nif.c. I'm sort of baffled by this finding; I would have expected that call to be battle-tested and robust, but I'll show you how I got to that conclusion.

I built a stripped-down version of my test harness to demonstrate. For purposes of this comment, the changes I made at scouten-archive/crash_esqlite_case@d800062?w=1 (sorry for whitespace noise; my editor strips trailing space by default and I didn't catch it) will be important. You may find those useful in debugging. Basically I've added console output to all of the calls relevant to the conn and queue data structures so I can observe how the two threads interact.

(If you don't mind installing Elixir 1.3.x or later, the try-to-crash.sh script at the root of that repo is pretty effective. It re-runs the test script 100 times, but almost always crashes within the first 10 iterations on my machine.)

What follows is a transcript of a test run on my machine that crashed. There were many other threads involved but I've stripped this down to the calls relevant to the thread pair that crashed:

esqlite_start conn = 000000001e4077b0 queue = 000000001e8800b0
esqlite_connection_run conn = 000000001e4077b0 queue = 000000001e8800b0 command thread START
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
esqlite_start conn = 000000001e4077b0 queue = 000000001e8800b0 command thread 000000001e8801e0 created
queue_push q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
queue_push q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
esqlite_prepare calling keep_resource on conn 000000001e4077b0
queue_push q = 000000001e8800b0 calling enif_mutex_lock
do_prepare conn = 000000001e4077b0 sql = PRAGMA foreign_keys
do_prepare conn = 000000001e4077b0 DONE
destruct_esqlite_statement release_resource conn = 000000001e4077b0
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0
queue_push q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 waiting for cmd thread 000000001e8801e0 to finish
   # --- point A
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 draining remaining commands
queue_has_item q = 000000001e8800b0 calling enif_mutex_lock
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
queue_has_item q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 queue is empty
queue_destroy q = 000000001e8800b0 calling enif_mutex_lock
destruct_esqlite_connection conn = 000000001e4077b0 queue = 000000001e8800b0 queue destroyed
destruct_esqlite_connection conn = 000000001e4077b0 DONE
queue_pop q = 000000001e8800b0 calling enif_mutex_lock
   # --- point B
../include/internal/ethr_mutex.h:656: Fatal error in ethr_mutex_lock(): Invalid argument (22)
./try-to-crash.sh: line 8: 16231 Abort trap: 6           mix test --no-deps-check

At point A, I would have expected to see esqlite_connection_run conn = 000000001e4077b0 queue = 000000001e8800b0 command thread EXIT but it didn't happen. Note that I do see that message, as expected, in test runs that do not crash. I also see it in the crashing test runs, but not for the thread pair that crashes.

At point B, it appears that the command processing thread apparently continued to run and called queue_pop with a now-dangling pointer and crashed.

From this report I believe that enif_thread_join incorrectly returns before the command-processing thread has actually terminated. In the documentation for erl_drv_thread_join (for which enif_thread_join is an alias), I do note the following warning ...

A thread can only be joined once. The behavior of joining more than once is undefined, an emulator crash is likely.

... but I don't have any evidence that there is an extra join call happening.

FWIW, my system information:

Erlang/OTP 19 [erts-8.2] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Elixir 1.3.4

Thanks for your attention so far and hope all of this information helps!

from esqlite.

scouten avatar scouten commented on September 23, 2024 1

Credit to http://erlang.org/pipermail/erlang-questions/2011-February/056282.html for suggestions on how this scenario might happen.

from esqlite.

scouten avatar scouten commented on September 23, 2024 1

/cc @fishcakez @tchoutri: If you know of anyone who could help propose a fix, this would be a big help to the sqlite_ecto2 effort (on which I've made quite a bit of progress lately).

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024 1

I installed elixer and can now see crashes. I've upgraded sqlite, and inspecting core dumps. It almost looks like it is an issue inside sqlite itself.

Thread 29:: esqlite_connect
0   libsystem_kernel.dylib        	0x00007fffd203ec86 __psynch_cvwait + 10
1   beam.smp                      	0x00000000154698f6 ethr_cond_wait + 6
2   esqlite3_nif.so               	0x00000001b0cce2c9 esqlite_connection_run + 89 (esqlite3_nif.c:694)

Thread 30 Crashed:: esqlite_connect
0   libsystem_pthread.dylib       	0x00007fffd21254fc pthread_mutex_lock + 0
1   esqlite3_nif.so               	0x00000001b0ce9875 pthreadMutexEnter + 21 (sqlite3.c:23358)
2   esqlite3_nif.so               	0x00000001b0cd0d14 sqlite3_mutex_enter + 36 (sqlite3.c:22809)
3   esqlite3_nif.so               	0x00000001b0cdbfcd columnName + 109 (sqlite3.c:76920)
4   esqlite3_nif.so               	0x00000001b0cdbf54 sqlite3_column_name + 36 (sqlite3.c:76938)
5   esqlite3_nif.so               	0x00000001b0ccef3b do_column_names + 187 (esqlite3_nif.c:581)
6   esqlite3_nif.so               	0x00000001b0cce635 evaluate_command + 341 (esqlite3_nif.c:659)
7   esqlite3_nif.so               	0x00000001b0cce33b esqlite_connection_run + 203 (esqlite3_nif.c:701)

Thread 31:: esqlite_connect
0   esqlite3_nif.so               	0x00000001b0ce4930 setupLookaside + 352 (sqlite3.c:140082)
1   esqlite3_nif.so               	0x00000001b0ce6990 openDatabase + 1552 (sqlite3.c:142436)
2   esqlite3_nif.so               	0x00000001b0ce6376 sqlite3_open + 38 (sqlite3.c:142489)
3   esqlite3_nif.so               	0x00000001b0cce832 do_open + 162 (esqlite3_nif.c:272)
4   esqlite3_nif.so               	0x00000001b0cce537 evaluate_command + 87 (esqlite3_nif.c:645)
5   esqlite3_nif.so               	0x00000001b0cce33b esqlite_connection_run + 203 (esqlite3_nif.c:701)

from esqlite.

scouten avatar scouten commented on September 23, 2024

I think that this could be made more reproducible by introducing an artificial delay in the do_prepare function and killing the OTP thread during that delay. Out of time to experiment with this for today, but I'll try that approach when I can look into this again.

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

Good one. I'll take a look at it.

Did you use the latest version? About a month ago I fixed another crash which happened in a similar way but without prepared statements.

from esqlite.

scouten avatar scouten commented on September 23, 2024

Well, it turns out I wasn't using the latest version. I was using the 0.2.1 release from hex.pm, which corresponds to June 2015. I updated to your latest. On the third test run, it failed with a new error:

/include/internal/ethr_mutex.h:656: Fatal error in ethr_mutex_lock(): Invalid argument (22)
Abort trap: 6

from esqlite.

scouten avatar scouten commented on September 23, 2024

elixir-sqlite/sqlite_ecto2#41 bumps the version reference to your current master branch.

Ignore the errors about Ecto.Integration.TransactionTest.UniqueError being raised. That's a known and unrelated issue.

from esqlite.

scouten avatar scouten commented on September 23, 2024

FWIW I've run it about 20 more times and have not seen this error recur. My suspicion is that the December 2016 fix made the race condition more difficult to reach, but not impossible.

If I read that code correctly, at 3d546ff#diff-78ab23ac1ccd798d73af2a75e4b49fbeR227, you are waiting until the queue is empty before destroying the queue. However, there is still the possibility that the command processing thread would pop the prepare command (https://github.com/mmzeeman/esqlite/blob/master/c_src/esqlite3_nif.c#L676) and then the Erlang scheduler thread would clear the remaining commands (including cmd_stop) while the do_prepare command is still running in the command processing thread.

from esqlite.

scouten avatar scouten commented on September 23, 2024

I've run this ~100 times since (great entertainment during a meeting!) and have only seen it crash once (the ethr_mutex_lock assertion failure I mentioned earlier). I'll merge this change since it seems to improve things, but am still concerned that it's not quite 100% stable.

Thanks for your attention!

from esqlite.

scouten avatar scouten commented on September 23, 2024

100x success on my machine. And then two builds failed on Travis with segfault: https://travis-ci.org/scouten/sqlite_ecto2/builds/203920437

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

Thanks for reporting back. I hope to find some time in the evening to check out the mutex problem.

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

I think a reference to the connection should be saved before the prepare event is sent. This should prevent the destruction of the connection while there are prepare events on the queue. It could be that I need to add some extra locking to prevent this from happening.

BTW, where is the code which triggers this race-condition?

from esqlite.

scouten avatar scouten commented on September 23, 2024

@mmzeeman The code that triggers this is in the branch for elixir-sqlite/sqlite_ecto2#41. As mentioned earlier, I've tried but not yet succeeded in developing a narrower test harness.

A reference to the connection is being saved. What I think is happening is that the OTP process that owns that reference is itself getting killed at an inopportune moment, thus allowing the destruction to happen in an undesirable sequence.

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

The extra reference to the connection is now made in the command processor thread. It could be that the connection is garbage collected and the stop event is put on the queue... If that happens first the connection is taken down and then later the prepared statement is constructed, and destructed again. This is not how it should work.

If the reference to the connection is made before it is put on the queue, the connection can't be garbage collected while there is a prepare event on the queue. I think this should fix the problem. The prepared statements will then always be destructed before the connection is destructed.

from esqlite.

scouten avatar scouten commented on September 23, 2024

That sounds right. I'll take a look when I can, probably this evening (US Pacific time).

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

Thanks for the detailed information. This is a really strange result indeed. I don't have any experience with elixer though. I will try to replicate the problem in plain erlang.

Could you printf to stderr instead of stdout? Sometimes stdout can be buffered, that sometimes causes printf statements to come out in the wrong order.

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

Aha... there is a reference counting problem to the connection. Finally I have a minimal crash example.

maas$ erl -pa ebin -pa deps/*/ebin
Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3  (abort with ^G)
1> {ok, Db1} = esqlite3:open(":memory:").
{ok,{connection,#Ref<0.0.4.31>,<<>>}}
2> esqlite3:q("create table test(one, two);", Db1).
[]
3> {ok, Stmt} = esqlite3:prepare("select * from test", Db1).
{ok,<<>>}
4> f(Db1).
ok
5> erlang:garbage_collect().
true
6> esqlite3:column_names(Stmt).
Segmentation fault: 11 (core dumped)

I expected that internally a reference to the connection was kep by enif_keep_resource in the prepare call. What happens in practice is that the connection is destructed while there are still prepared statements floating around.

from esqlite.

scouten avatar scouten commented on September 23, 2024

That example sounds a lot like what I've been seeing. I'll take a look at this in the morning. Thank you again!

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

I'm working on a fix... I want to store the connection variable together with the prepared statement in esqlite3.erl. Unfortunately I didn't have the time to finish it yesterday.

from esqlite.

scouten avatar scouten commented on September 23, 2024

OK, no problem. Thanks for the update.

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

I've pushed a fix (hopefully).

from esqlite.

scouten avatar scouten commented on September 23, 2024

Fix looks great. Thank you very much!

Fixed by 3f1ef40.

from esqlite.

mmzeeman avatar mmzeeman commented on September 23, 2024

Thanks for your patience... and detailed responses. That really helped.

from esqlite.

scouten avatar scouten commented on September 23, 2024

I think I spoke too soon. Still seeing segfaults, albeit more rarely. Symptoms are new and different. I'm working on narrowing down a case and I'll file a new issue when I have something reproducible.

from esqlite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.