Out of control thread spawning when unable to start PIDs about pooler HOT 8 CLOSED

epgsql commented on August 16, 2024

Out of control thread spawning when unable to start PIDs

from pooler.

Comments (8)

seth commented on August 16, 2024

Thanks for the report. I believe I've reproduced this problem in the
past and have a patch that should limit the churn. I'll update later
with a cleaned up patch and more details.

seth

On May 14, 2012, at 14:55, Jeremy Raymond
[email protected]
wrote:

In pooler:take_member_from_pool/3 when there are no free PIDs we try to add pids:
take_member_from_pool(....) ->
   case Free of
       [] when NumInUse =:= Max ->
          % snip
       [] when NumInUse < Max ->
           case add_pids(PoolName, 1, State) of % JR: try to add pids
               {ok, State1} ->
                   take_member(PoolName, From, State1); % JR: try to take again, take_member/3 calls take_member_from_pool/3
               {max_count_reached, _} ->
% snip
In add_pids/3 if we're unable to add any PIDs, say because the backend is down, we end up back in the same spot where there are no free Pids. We then try to add some PIDs again endlessly recursing until we are able to add some PIDs. Each time this is done new worker threads are being spawned to create the connection via start_n_pids/4's call to supervisor:start_child/2. This is causing me issues, as each time the connection to the backend fails it's generating errors in the logs, filling up disk at a fast rate.

Need some way to detect his condition, and error out rather than spinning. This also blocks the pooler:take_member() call until the caller receives a gen_server error timeout.

Reply to this email directly or view it on GitHub:
#6

from pooler.

seth commented on August 16, 2024

So I've updated (rebased and added a test) the limit_failed_adds branch. From the commit log:

commit 25dc19d7f0aea96bee4bbe81a14b4ee5222df55d
Author: Seth Falcon [email protected]
Date: Thu Feb 9 16:23:36 2012 -0800

Crash the pooler gen_server if too many failed adds occur

In a situation where pooler is unable to add new pool members, pooler
can end up looping on attempting to add a member, member crashes,
attempting to add again.

This patch adds a failed_adds counter to the pool record. When this
counter reaches the init_count for the pool, pooler itself crashes. The
failed_adds is reset anytime a call to add_pids is made in which there
were no failures (all requested members added).

Will this resolve your issue?

from pooler.

jeraymond commented on August 16, 2024

Thanks helps solve the run away process creation issue. What about instead of crashing returning an error? If an error was returned I could do something more reasonable than crashing when calling take_member/0

case pooler:take_member() of
    error_no_members ->
        % handle error
    error_no_pids -> % JR: this is new
        % handle error rather than crashing
    Pid ->
        % do some work
end

Maybe even just returning error_no_members would be appropriate. A backend being temporarily unavailable (network outage, or backend upgrade) seems like a normal type of situation rather than something exceptional and worth crashing over.

from pooler.

seth commented on August 16, 2024

That's not a bad idea. I initially opted for a crash as it applies at least a little bit of back-pressure on attempting to add new pids. For example, if the client code you are pooling with pooler is noisy, logging-wise, when it crashes on a bad start then pooler is still in a tight loop trying to start clients and this could cause your vm issues in terms of filling up logs etc.

So I wonder if what is really needed is a more explicit mechanism to back-off attempting to start clients when failures are frequent.

from pooler.

jeraymond commented on August 16, 2024

What is preferable to me would be that pooler would indicate to the client that the backend is down so the client can do some thing smart about it like display a MSG to the user or wait and retry later. If pooler just crashes then the linked client crashes as in my case. If pooler tries itself to wait or back off then it may be slow to return controll to the caller causing potential timeouts on the client end.

from pooler.

seth commented on August 16, 2024

What you've described makes sense and I will amend the patch to return
an error and not crash pooler.

I think I may not have explained myself very well, however. In
suggesting a back off mechanism I wasn't imagining that this would
make clients wait -- return an error and let them decide what to do
next. Rather, I'm wondering if some back off would be sensible to
avoid the log storm issue since failing client starts are likely to
trigger a log storm -- so the mechanism that adds clients should
perhaps contain some start throttling.

Also, somewhat unrelated, I'd like to modify pooler to use monitors so
that a pooler crash would not have to crash clients.

On May 16, 2012, at 6:06 PM, Jeremy Raymond
[email protected]
wrote:

What is preferable to me would be that pooler would indicate to the client that the backend is down so the client can do some thing smart about it like display a MSG to the user or wait and retry later. If pooler just crashes then the linked client crashes as in my case. If pooler tries itself to wait or back off then it may be slow to return controll to the caller causing potential timeouts on the client end.

Reply to this email directly or view it on GitHub:
#6 (comment)

from pooler.

seth commented on August 16, 2024

I've pushed a different fix to master that I think addresses this issue. The causes take_member to return with error_no_members after making one attempt to add a new member in the case that there are no free members and the pool count is less than max_count. The number of retries is configurable.

Please let me know if this resolves the issue you are seeing w/ temporary backend outage causing problems.

from pooler.

seth commented on August 16, 2024

I think the fix on master for this addresses the out of control issue. With the new code, new member creation is only attempted once per take_member call (though this is configurable if more retries is desired).

from pooler.

Out of control thread spawning when unable to start PIDs about pooler HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent