Giter Site home page Giter Site logo

Comments (9)

ysbaddaden avatar ysbaddaden commented on May 21, 2024 1

Fixed in 966c903.

Crystal's runtime now reaps child processes before delegating to any custom SIGCHLD handler. Trying to reap again led to blocking the process. Even using WNOHANG would be bad, since Crystal's runtime expects to reap child processes.

Thanks for helping the debug guys!

I should have known, I rewrote how SIGCHLD is handled in Crystal's runtime, which led to break Prax. Sigh.

from prax.cr.

tijn avatar tijn commented on May 21, 2024

Thanks for pointing this out @jacksonrayhamilton. It's good to know I'm not the only one experiencing these lockups.

I spent quite some time trying to figure out the cause of this, together with my employer, so I guess I will share our all our findings here. It could be useful.

we get similar behaviour even without the app-killer

First, as you mentioned, the Prax.applications.dup.each seems a straw man. We saw it too but it doesn't really make a difference. You can easily reproduce the issue even without the app-killer. We actually compiled Prax without it for testing purposes.

I have this small script that will make Prax get into deadlock 100% of the time:

#!/usr/bin/env ruby

RESPONSIVE_HOST = 'index.test'
FAILING_HOST = 'fail.test' # the app on this domain will crash while starting up (missing gems)

def get_webpage(host)
  puts "GET webpage from #{host}..."
  `curl -m 10 http://#{host}`
  puts 'done'
end

# ---------

get_webpage RESPONSIVE_HOST
# app gets started by Prax and responds with status 200

get_webpage RESPONSIVE_HOST
# app keeps responding to requests

get_webpage FAILING_HOST
# the failing app will crash and that will make Prax unresponsive, ...
# as we will see with the next statement

sleep 5
get_webpage RESPONSIVE_HOST
# even the responsive app will not be able to respond anymore

Note that you need to have at least one (non-failing) app running for the problem to occurr. This will not result in a lock:

# [...]
# ---------

get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...

get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...
# repeat this as often as you like

get_webpage RESPONSIVE_HOST
# everything should still be fine

This, to me, seems to be the same situation that you described where the app-killer kills all applications at once.

It seems that not only killing an application process will cause Prax to become unresponsive, it also happens when an application process process simply stops. (Maybe only with an unucessful exit status? 🤔 )

Deadlock?

We overloaded the whole Prax source with puts statements and we found out it that the last line it will execute is line 112 in the application spawner that was spawning the crashing application. It simply says: sleep 0.1. In other words: some other fiber will now take over and prevent all other fibers from ever running again; until we press ctrl+c.

Now which fiber could that be? Could it be a fiber that's managing the connection to the RESPONSIVE_HOST? It sounds plausible because without it we cannot seem to achieve deadlock. On the other hand, how can it be that it isn't locking without a crashing app? It seems logical that it must give other fibers a chance to run otherwise you could never open a second connection or run the app-killer fiber.

Could it be that Process.new spawns some fiber that causes deadlock whenever the process hangs? But then again, why would it only do this when another process is running successfully?

I think it should never happen that a fiber will freeze up forever after a call to sleep. My guess is that we have to search for the cause in the Crystal standard library. Maybe the Fiber code changed? Maybe Process changed? We took a look at the changelog of course but we couldn't find anything suspicious. This was our first dive into the Crystal source though, someone more experienced would probably have a less difficult time understanding the changes.

(A silly idea for debugging: can I maybe somehow label the fibers to find the culprit? Do I have to recompile the whole compiler for that?)

Prax did not have this problem a year ago

I have this PC at home on which I am using an older version of Prax that I cannot reproduce this issue on. I compiled it on 2017-06-29 using my Arch Linux PKGBUILD. I would have to do some research to find out exactly which versions of Prax and Crystal were used but they were the most recent ones at the time.

The problem seems to be independent from the operating system

I suspected that Arch Linux changed some system libraries between Summer 2017 and September 2018 that could have caused the error. I compiled Prax on a macOS (a BSD with completely different libs, right?) only to find that indeed I can easily reproduce the problem there too. It seems we can rule out the OS.

from prax.cr.

ysbaddaden avatar ysbaddaden commented on May 21, 2024

Wow, this is some thorough investigation! Thanks a lot, especially for the reproducible scenarios. I'll try to understand the issue when I get some free time.

Indeed, applications.each + applications.delete will skip an app to check, but only when it stopped the previous app, yet, it should be iterated on the next run, so it shouldn't create problems. We could use applications.reverse_each, instead of duplicating the array (please open a PR).

@tijn you can name crystal fibers, using spawn(name: "app-killer") for example, if I recall correctly.

Note that Crystal is single-threaded: the event-loop runs on a single thread —the app has threads, but they're all created by the garbage collector— the event-loop can be locked, blocking all fibers from running, if one fiber uses a blocking C syscall or ends up in a busy loop (such as loop { i += 1 }).

from prax.cr.

ysbaddaden avatar ysbaddaden commented on May 21, 2024

Maybe this is related: https://github.com/ysbaddaden/prax.cr/blob/master/src/prax.cr#L50-L51

We reworked how SIGCHLD is handled in Crystal some months ago.

from prax.cr.

tijn avatar tijn commented on May 21, 2024

Maybe this is related: https://github.com/ysbaddaden/prax.cr/blob/master/src/prax.cr#L50-L51

We reworked how SIGCHLD is handled in Crystal some months ago.

Well, that was spot-on!

Removing the call to waitpid resolved the issue. But it's there for a reason of course; I guess we could end up with zombies if we go without it so we have to find a better way to call it.

I rewrote the signal handler to use WNOHANG so it becomes non-blocking... However, I am totally unfamiliar with waitpid so I wonder if this could be correct:

Signal::CHLD.trap do
  loop do
    code = LibC.waitpid(-1, out exit_code, LibC::WNOHANG)
    STDERR.puts "SIGCHLD #{code} #{exit_code} #{Errno.value}"
    if code == -1 && Errno.value == Errno::EINTR
      # FIXME: is this right?
      sleep 0.1 # sleep and continue the loop until there is a proper return value
    else
      break
    end
  end
end

I first had a version that didn't check errno but sometimes that would end up in a never ending loop. This one seems to work better in that regard. I did not see it fail yet... 🤞 Still, expert advice is very welcome!

from prax.cr.

ysbaddaden avatar ysbaddaden commented on May 21, 2024

I didn't notice the missing WNOHANG in the waitpid call, making it a blocking call.

The loop is correct, but I should check the manpage, I'm fairly sure we can just discard the return value, or just loop

from prax.cr.

ysbaddaden avatar ysbaddaden commented on May 21, 2024

Pushed comment button inadvertently...

I think we can loop until it returns -1 and errno is EAGAIN, which means it would have blocked (no child process to reap). But I haven't verified the manpage, yet.

from prax.cr.

ysbaddaden avatar ysbaddaden commented on May 21, 2024

Fixed the "reaper skips an app" in 6973ce4.

from prax.cr.

jacksonrayhamilton avatar jacksonrayhamilton commented on May 21, 2024

Thanks, the issue was fixed for me.

from prax.cr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.