Comments (9)
Fixed in 966c903.
Crystal's runtime now reaps child processes before delegating to any custom SIGCHLD handler. Trying to reap again led to blocking the process. Even using WNOHANG would be bad, since Crystal's runtime expects to reap child processes.
Thanks for helping the debug guys!
I should have known, I rewrote how SIGCHLD is handled in Crystal's runtime, which led to break Prax. Sigh.
from prax.cr.
Thanks for pointing this out @jacksonrayhamilton. It's good to know I'm not the only one experiencing these lockups.
I spent quite some time trying to figure out the cause of this, together with my employer, so I guess I will share our all our findings here. It could be useful.
we get similar behaviour even without the app-killer
First, as you mentioned, the Prax.applications.dup.each
seems a straw man. We saw it too but it doesn't really make a difference. You can easily reproduce the issue even without the app-killer. We actually compiled Prax without it for testing purposes.
I have this small script that will make Prax get into deadlock 100% of the time:
#!/usr/bin/env ruby
RESPONSIVE_HOST = 'index.test'
FAILING_HOST = 'fail.test' # the app on this domain will crash while starting up (missing gems)
def get_webpage(host)
puts "GET webpage from #{host}..."
`curl -m 10 http://#{host}`
puts 'done'
end
# ---------
get_webpage RESPONSIVE_HOST
# app gets started by Prax and responds with status 200
get_webpage RESPONSIVE_HOST
# app keeps responding to requests
get_webpage FAILING_HOST
# the failing app will crash and that will make Prax unresponsive, ...
# as we will see with the next statement
sleep 5
get_webpage RESPONSIVE_HOST
# even the responsive app will not be able to respond anymore
Note that you need to have at least one (non-failing) app running for the problem to occurr. This will not result in a lock:
# [...]
# ---------
get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...
get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...
# repeat this as often as you like
get_webpage RESPONSIVE_HOST
# everything should still be fine
This, to me, seems to be the same situation that you described where the app-killer kills all applications at once.
It seems that not only killing an application process will cause Prax to become unresponsive, it also happens when an application process process simply stops. (Maybe only with an unucessful exit status? 🤔 )
Deadlock?
We overloaded the whole Prax source with puts
statements and we found out it that the last line it will execute is line 112 in the application spawner that was spawning the crashing application. It simply says: sleep 0.1
. In other words: some other fiber will now take over and prevent all other fibers from ever running again; until we press ctrl+c.
Now which fiber could that be? Could it be a fiber that's managing the connection to the RESPONSIVE_HOST? It sounds plausible because without it we cannot seem to achieve deadlock. On the other hand, how can it be that it isn't locking without a crashing app? It seems logical that it must give other fibers a chance to run otherwise you could never open a second connection or run the app-killer fiber.
Could it be that Process.new
spawns some fiber that causes deadlock whenever the process hangs? But then again, why would it only do this when another process is running successfully?
I think it should never happen that a fiber will freeze up forever after a call to sleep
. My guess is that we have to search for the cause in the Crystal standard library. Maybe the Fiber
code changed? Maybe Process
changed? We took a look at the changelog of course but we couldn't find anything suspicious. This was our first dive into the Crystal source though, someone more experienced would probably have a less difficult time understanding the changes.
(A silly idea for debugging: can I maybe somehow label the fibers to find the culprit? Do I have to recompile the whole compiler for that?)
Prax did not have this problem a year ago
I have this PC at home on which I am using an older version of Prax that I cannot reproduce this issue on. I compiled it on 2017-06-29 using my Arch Linux PKGBUILD. I would have to do some research to find out exactly which versions of Prax and Crystal were used but they were the most recent ones at the time.
The problem seems to be independent from the operating system
I suspected that Arch Linux changed some system libraries between Summer 2017 and September 2018 that could have caused the error. I compiled Prax on a macOS (a BSD with completely different libs, right?) only to find that indeed I can easily reproduce the problem there too. It seems we can rule out the OS.
from prax.cr.
Wow, this is some thorough investigation! Thanks a lot, especially for the reproducible scenarios. I'll try to understand the issue when I get some free time.
Indeed, applications.each
+ applications.delete
will skip an app to check, but only when it stopped the previous app, yet, it should be iterated on the next run, so it shouldn't create problems. We could use applications.reverse_each
, instead of duplicating the array (please open a PR).
@tijn you can name crystal fibers, using spawn(name: "app-killer")
for example, if I recall correctly.
Note that Crystal is single-threaded: the event-loop runs on a single thread —the app has threads, but they're all created by the garbage collector— the event-loop can be locked, blocking all fibers from running, if one fiber uses a blocking C syscall or ends up in a busy loop (such as loop { i += 1 }
).
from prax.cr.
Maybe this is related: https://github.com/ysbaddaden/prax.cr/blob/master/src/prax.cr#L50-L51
We reworked how SIGCHLD is handled in Crystal some months ago.
from prax.cr.
Maybe this is related: https://github.com/ysbaddaden/prax.cr/blob/master/src/prax.cr#L50-L51
We reworked how SIGCHLD is handled in Crystal some months ago.
Well, that was spot-on!
Removing the call to waitpid
resolved the issue. But it's there for a reason of course; I guess we could end up with zombies if we go without it so we have to find a better way to call it.
I rewrote the signal handler to use WNOHANG
so it becomes non-blocking... However, I am totally unfamiliar with waitpid
so I wonder if this could be correct:
Signal::CHLD.trap do
loop do
code = LibC.waitpid(-1, out exit_code, LibC::WNOHANG)
STDERR.puts "SIGCHLD #{code} #{exit_code} #{Errno.value}"
if code == -1 && Errno.value == Errno::EINTR
# FIXME: is this right?
sleep 0.1 # sleep and continue the loop until there is a proper return value
else
break
end
end
end
I first had a version that didn't check errno
but sometimes that would end up in a never ending loop. This one seems to work better in that regard. I did not see it fail yet... 🤞 Still, expert advice is very welcome!
from prax.cr.
I didn't notice the missing WNOHANG in the waitpid call, making it a blocking call.
The loop is correct, but I should check the manpage, I'm fairly sure we can just discard the return value, or just loop
from prax.cr.
Pushed comment button inadvertently...
I think we can loop until it returns -1 and errno is EAGAIN, which means it would have blocked (no child process to reap). But I haven't verified the manpage, yet.
from prax.cr.
Fixed the "reaper skips an app" in 6973ce4.
from prax.cr.
Thanks, the issue was fixed for me.
from prax.cr.
Related Issues (20)
- Name resolution with external service not working HOT 2
- Tests are failing and test runner hangs. HOT 1
- `make package` fails HOT 1
- Can installation be made easier on Debian 9? HOT 4
- Bad request when sending host header
- ERROR -- prax: kill: No such process HOT 1
- Compilation error in 0.7.0/master with Crystal 0.24.1 HOT 2
- Please create next release. HOT 3
- start prax manually
- HTTP 1.0 request might return empty response HOT 4
- $RBENV_VERSION HOT 5
- Does not work HOT 15
- Cookies still getting combined into one header
- prax-binary: error while loading shared libraries: libevent-2.0.so.5 HOT 7
- https fails with no errors HOT 4
- OpenSSL Errors HOT 1
- Compile error HOT 3
- Not loading an app. HOT 3
- Error: can't find file 'thread' on require "thread" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prax.cr.