Giter Site home page Giter Site logo

sidekiq_alive's Introduction

SidekiqAlive

Gem Version Total Downloads Workflow status


SidekiqAlive offers a solution to add liveness probe for a Sidekiq instance deployed in Kubernetes. This library can be used to check sidekiq health outside kubernetes.

How?

A http server is started and on each requests validates that a liveness key is stored in Redis. If it is there means is working.

A Sidekiq worker is the responsible to storing this key. If Sidekiq stops processing workers this key gets expired by Redis an consequently the http server will return a 500 error.

This worker is responsible to requeue itself for the next liveness probe.

Each instance in kubernetes will be checked based on ENV variable HOSTNAME (kubernetes sets this for each replica/pod).

On initialization SidekiqAlive will asign to Sidekiq::Worker a queue with the current host and add this queue to the current instance queues to process.

example:

hostname: foo
  Worker queue: sidekiq_alive-foo
  instance queues:
   - sidekiq_alive-foo
   *- your queues

hostname: bar
  Worker queue: sidekiq_alive-bar
  instance queues:
   - sidekiq_alive-bar
   *- your queues

Installation

Add this line to your application's Gemfile:

gem 'sidekiq_alive'

And then execute:

$ bundle

Or install it yourself as:

$ gem install sidekiq_alive

Usage

SidekiqAlive will start when running sidekiq command.

Run Sidekiq

bundle exec sidekiq
curl localhost:7433
#=> Alive!

how to disable? You can disabled by setting ENV variable DISABLE_SIDEKIQ_ALIVE example:

DISABLE_SIDEKIQ_ALIVE=true bundle exec sidekiq

Kubernetes setup

Set livenessProbe in your Kubernetes deployment

example with recommended setup:

Sidekiq < 6

spec:
  containers:
    - name: my_app
      image: my_app:latest
      env:
        - name: RAILS_ENV
          value: production
      command:
        - bundle
        - exec
        - sidekiq
      ports:
        - containerPort: 7433
      livenessProbe:
        httpGet:
          path: /
          port: 7433
        initialDelaySeconds: 80 # app specific. Time your sidekiq takes to start processing.
        timeoutSeconds: 5 # can be much less
      readinessProbe:
        httpGet:
          path: /
          port: 7433
        initialDelaySeconds: 80 # app specific
        timeoutSeconds: 5 # can be much less
      lifecycle:
        preStop:
          exec:
            # SIGTERM triggers a quick exit; gracefully terminate instead
            command: ['bundle', 'exec', 'sidekiqctl', 'quiet']
  terminationGracePeriodSeconds: 60 # put your longest Job time here plus security time.

Sidekiq >= 6

Create file:

kube/sidekiq_quiet

#!/bin/bash

# Find Pid
SIDEKIQ_PID=$(ps aux | grep sidekiq | grep busy | awk '{ print $2 }')
# Send TSTP signal
kill -SIGTSTP $SIDEKIQ_PID

Make it executable:

$ chmod +x kube/sidekiq_quiet

Execute it in your deployment preStop:

spec:
  containers:
    - name: my_app
      image: my_app:latest
      env:
        - name: RAILS_ENV
          value: production
      command:
        - bundle
        - exec
        - sidekiq
      ports:
        - containerPort: 7433
      livenessProbe:
        httpGet:
          path: /
          port: 7433
        initialDelaySeconds: 80 # app specific. Time your sidekiq takes to start processing.
        timeoutSeconds: 5 # can be much less
      readinessProbe:
        httpGet:
          path: /
          port: 7433
        initialDelaySeconds: 80 # app specific
        timeoutSeconds: 5 # can be much less
      lifecycle:
        preStop:
          exec:
            # SIGTERM triggers a quick exit; gracefully terminate instead
            command: ['kube/sidekiq_quiet']
  terminationGracePeriodSeconds: 60 # put your longest Job time here plus security time.

Outside kubernetes

It's just up to you how you want to use it.

An example in local would be:

bundle exec sidekiq
# let it initialize ...
curl localhost:7433
#=> Alive!

Options

SidekiqAlive.setup do |config|
  # ==> Server host
  # Host to bind the server.
  # Can also be set with the environment variable SIDEKIQ_ALIVE_HOST.
  # default: 0.0.0.0
  #
  #   config.host = 0.0.0.0

  # ==> Server port
  # Port to bind the server.
  # Can also be set with the environment variable SIDEKIQ_ALIVE_PORT.
  # default: 7433
  #
  #   config.port = 7433

  # ==> Server path
  # HTTP path to respond to.
  # Can also be set with the environment variable SIDEKIQ_ALIVE_PATH.
  # default: '/'
  #
  #   config.path = '/'

  # ==> Custom Liveness Probe
  # Extra check to decide if restart the pod or not for example connection to DB.
  # `false`, `nil` or `raise` will not write the liveness probe
  # default: proc { true }
  #
  #     config.custom_liveness_probe = proc { db_running? }

  # ==> Liveness key
  # Key to be stored in Redis as probe of liveness
  # default: "SIDEKIQ::LIVENESS_PROBE_TIMESTAMP"
  #
  #   config.liveness_key = "SIDEKIQ::LIVENESS_PROBE_TIMESTAMP"

  # ==> Time to live
  # Time for the key to be kept by Redis.
  # Here is where you can set de periodicity that the Sidekiq has to probe it is working
  # Time unit: seconds
  # default: 10 * 60 # 10 minutes
  #
  #   config.time_to_live = 10 * 60

  # ==> Callback
  # After the key is stored in redis you can perform anything.
  # For example a webhook or email to notify the team
  # default: proc {}
  #
  #    require 'net/http'
  #    config.callback = proc { Net::HTTP.get("https://status.com/ping") }

  # ==> Shutdown callback
  # When sidekiq process is shutting down, you can perform some arbitrary action.
  # default: proc {}
  #
  #    config.shutdown_callback = proc { puts "Sidekiq is shutting down" }

  # ==> Queue Prefix
  # SidekiqAlive will run in a independent queue for each instance/replica
  # This queue name will be generated with: "#{queue_prefix}-#{hostname}.
  # You can customize the prefix here.
  # default: :sidekiq-alive
  #
  #    config.queue_prefix = :other

  # ==> Concurrency
  # The maximum number of Redis connections requested for the SidekiqAlive pool.
  # Can also be set with the environment variable SIDEKIQ_ALIVE_CONCURRENCY.
  # NOTE: only effects Sidekiq 7 or greater.
  # default: 2
  #
  #    config.concurrency = 3

  # ==> Rack server
  # Web server used to serve an HTTP response. By default simple GServer based http server is used.
  # To use specific server, rack gem version > 2 is required. For rack version >= 3, rackup gem is required.
  # Can also be set with the environment variable SIDEKIQ_ALIVE_SERVER.
  # default: nil
  #
  #    config.server = 'puma'

  # ==> Quiet mode timeout in seconds
  # When sidekiq is shutting down, the Sidekiq process stops pulling jobs from the queue. This includes alive key update job. In case of
  # long running jobs, alive key can expire before the job is finished. To avoid this, web server is set in to quiet mode
  # and is returning 200 OK for healthcheck requests. To avoid infinite quiet mode in case sidekiq process is stuck in shutdown,
  # timeout can be set. After timeout is reached, web server resumes normal operations and will return unhealthy status in case
  # alive key is expired or purged from redis.
  # default: 180
  #
  #    config.quiet_timeout = 300

end

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install.

Here is an example rails app

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/arturictus/sidekiq_alive. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

sidekiq_alive's People

Contributors

andrcuns avatar arrowcircle avatar arturictus avatar astrocket avatar bcantin avatar bugthing avatar caioalonso avatar dependabot[bot] avatar fwolfst avatar jkogara avatar koconnor-ampion avatar limcross avatar lucas-aragno avatar maltesa avatar petergoldstein avatar ramontayag avatar ryanmrodriguez avatar sunny avatar vinnyfd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sidekiq_alive's Issues

Feature: optionally allow developers to control enqueuing

We use sidekiq-cron and noticed that at one point, jobs stop being enqueued. We already use this gem to see if jobs stop running, and thought that we could use the same mechanism to detect times when jobs stop getting enqueued.

Are you open to a feature where we can control the enqueueing of SidekiqAlive::Worker? By default, it would requeue itself, but if you pass requeue: false, it will not requeue itself.

Usage would be like:

SidekiqAlive.start(requeue: false)

requeue defaults to true, and the value is just passed into sa::Server.perform_async(hostname, requeue: requeue) in this line.

Then we enqueue, for every Sidekiq instance and config.time_to_live / 2 every seconds, by calling:

SidekiqAlive::Worker.perform_async(requeue: false)

Does this sound like it'll work? Still need to think through how we will enqueue for each available hostname without hard coding the number of pods we have for Sidekiq.

uninitialized constant Tilt::CompileSite

Hi,
could you pls fix this problem for sidekiq 5.2.x?
The problem is because sidekiq 5.x is using sinatra 1.0 which does not support Tilt.
Temporal fix is using sidekiq 4.x.
Cheers!

Change queue name?

I'm curious about why each host gets its own queue. As I understand it, Sidekiq recommends as few queues as possible. If I want to put the alives in my critical, what problem does that cause?

I suppose I can monkey-patch the name, but it seems like this is an 'obvious' feature, so maybe I'm missing some bad implications of doing so.

Sinatra not using the port from a Sidekiq.setup block

#26 addresses this issue.

I found that when I was using the sidekiq_alive_example rails application and I changed the port in the config/sidekiq_alive.rb file, it was being set for the SidekiqAlive singleton, but the Sinatra application was not loading it properly.

The given PR adds tests and an updated lib/sidekiq_alive/server.rb to load the port.

Thank you

child process terminates before sidekiq jobs are finished

Sidekiq can have jobs that are very long-running. In the standard sidekiq manager stop(deadline) method, the sidekiq shutdown handlers run (including sidekiq_alive shutdown handler to kill the child process), and then it waits for up to (deadline) seconds for all the workers to finish their jobs, which could be an arbitrarily large wait. In the meantime, anything that's calling sidekiq_alive's health check endpoint will fail as that process is now dead, indicating to the caller that the sidekiq process itself is now dead (but it's not - there's still work being done - jobs in flight) - which may then hard-terminate the container sidekiq is in.

I think a proper solution is to move the quiet and shutdown handling code into an at_exit handler instead, so the child process is terminated only immediately as sidekiq is on its way to exiting.

One problem with this approach though is that during this time (after quiet and shutdown handlers have run, but before sidekiq has exited, is that no more worker jobs are running to update the redis key. Therefore we can't return the HTTP 404 response during this time, lest that be an indication sidekiq has exited.

Thoughts?

DISABLE_SIDEKIQ_ALIVE="false" disables sidekiq alive

thanks for this gem, really useful for us :)

The dotenv gem is pretty popular. The behaviour of this gem is to provide an ENV var if it is not present in the executing environment already.

The means its hard to remove environment variables but easy to over write them, I see the code only checks the truthlyness of the env variable DISABLE_SIDEKIQ_ALIVE .. would you consider something like the following??

SidekiqAlive.start if ENV.fetch('DISABLE_SIDEKIQ_ALIVE', '').casecmp("true") >= 0

queue_prefix config not used

I wanted to use the queue_prefix for the queues, but the name is still "sidekiq_alive-:hostname"

init:

SidekiqAlive.setup do |config|
  config.queue_prefix = :x_alive
  config.server = 'puma'
end

I want to move this queues in all displays to the end, cause they normally sorted by name.

image

Broken with Sidekiq 7

Since version 7, sidekiq use redis-client in replacement of redis (https://github.com/mperham/sidekiq/blob/main/docs/7.0-Upgrade.md#redis-client). The Redis client API is no longer the same and so it breaks sidekiq_alive.

[2022-12-08T18:26:05.011835 #316492 #]  WARN [] - concerto.sidekiq_fifo: {"context":"Exception during Sidekiq lifecycle event.","event":"startup","_config":"#<Sidekiq::Config:0x00007fc3841d16f0>"}
[2022-12-08T18:26:05.011947 #316492 #]  WARN [] - concerto.sidekiq_fifo: TypeError: no implicit conversion of nil into Array
[2022-12-08T18:26:05.012054 #316492 #]  WARN [] - concerto.sidekiq_fifo: /home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:61:in `+'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:61:in `block in deep_scan'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:59:in `loop'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:59:in `deep_scan'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:55:in `registered_instances'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:152:in `successful_startup_text'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:23:in `block (3 levels) in start'
<internal:kernel>:90:in `tap'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:15:in `block (2 levels) in start'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/component.rb:58:in `block in fire_event'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/component.rb:57:in `each'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/component.rb:57:in `fire_event'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/cli.rb:105:in `run'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/bin/sidekiq:31:in `<top (required)>'
bin/sidekiq:27:in `load'
bin/sidekiq:27:in `<main>'
no implicit conversion of nil into Array
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:61:in `+'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:61:in `block in deep_scan'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:59:in `loop'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:59:in `deep_scan'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:55:in `registered_instances'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:152:in `successful_startup_text'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:23:in `block (3 levels) in start'
<internal:kernel>:90:in `tap'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq_alive-2.1.6/lib/sidekiq_alive.rb:15:in `block (2 levels) in start'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/component.rb:58:in `block in fire_event'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/component.rb:57:in `each'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/component.rb:57:in `fire_event'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/lib/sidekiq/cli.rb:105:in `run'
/home/nicolas/.asdf/installs/ruby/3.1.3/lib/ruby/gems/3.1.0/gems/sidekiq-7.0.2/bin/sidekiq:31:in `<top (required)>'
bin/sidekiq:27:in `load'
bin/sidekiq:27:in `<main>'

Pods keep failing health check

Hi,

I'm using :
sidekiq (5.2.7)
sidekiq_alive (2.0.1)

Deployment:

spec:
      containers:
      - name: sidekiq
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        command: ["bin/bundle", "exec", "sidekiq"]
        args: ["-q", "critical", "-q", "notifications", "-q", "default", "-q", "low"]
        ports:
          - containerPort: 7433
          - containerPort: 9394
         livenessProbe:
           httpGet:
             path: /
             port: 7433
           initialDelaySeconds: 80 # app specific. Time your sidekiq takes to start processing.
           timeoutSeconds: 5 # can be much less
         readinessProbe:
           httpGet:
             path: /
             port: 7433
          initialDelaySeconds: 80 # app specific
          timeoutSeconds: 5 # can be much less

I have another deplyimg mirrors config above, except i'm specifying a different queue there.

it keeps failing the liveness or readiness probe every few minutes and reboots:
image

some are saying Detected parent died, dying others show nothing in logs that would indicate anything going wrong.

Does not work on multiple Sidekiq pods

Let's say i have several sidekiq pods, each is working on a specific queue:

backend-staging-sidekiq-alice-stable-96186k9mq6   1/1       Running   2          36m
backend-staging-sidekiq-bob-stable-39797474-8j15c         1/1       Running   2          36m
backend-staging-sidekiq-chris-3103080732-hrsq9              1/1       Running   3          36m

Alice pod -> queue_a
Bob pod -> queue_b
Chris pod -> queue_c and default

because SidekiqAlive::Worker are all enqueued into default, it will only be executed by Chris.

But here in

def perform(hostname = SidekiqAlive.hostname)
return unless hostname_registered?(hostname)
if current_hostname == hostname
write_living_probe
# schedule next living probe
self.class.perform_in(config.time_to_live / 2, current_hostname)
else
# requeue for hostname to validate it's own liveness probe
self.class.perform_async(hostname)
end
end

as for the condition:
if current_hostname == hostname
if a SidekiqAlive::Worker was created by Alice, this condition will always be false,
because
current_hostname is backend-staging-sidekiq-christ-3103080732-hrsq9 but hostname is backend-staging-sidekiq-alice-stable-96186k9mq6, thus it will instantly enqueue another job with the alice hostname. This cycle goes indefinitely.

Additional sidekiq runner child process was spawned on fork

Description

We run 2 processes with different queue for the same app in a node. The capistrano config is as followed.

# config/deploy.rb
set :sidekiq_config, -> { File.join(shared_path, 'config', 'sidekiq.yml') }
set :sidekiq_default_processes, 2
set :sidekiq_processes, fetch(:sidekiq_default_processes)

set :sidekiq_options_per_process, [
  "--queue cable",
  "--queue bots --queue otps --queue default"
]
## config/sidekiq.yml

---
:verbose: true
:logfile: ./log/sidekiq.log
:concurrency: 8
:pidfile: tmp/pids/sidekiq.pid
:queues:
  - ['cable', 4]
  - ['bots', 3]
  - ['otps', 3]
  - default

When I check the process count for sidekiq, I found 2 processes (say sidekiq-0 and sidekiq-1) shortly after the deployment. However, after a while (less than 1 minute), I see another child process being spawned as well. I understand this library fork another process to run the health check server (we configure it as puma), but I expect only the puma process to be direct descendent of sidekiq-x. But what actually happen is there's an additional sidekiq runner child process (pid 7959) being spawned before the puma process was spawn (see the attached image). I also ended up with 3 pid files, sidekiq-0.pid, sidekiq-1.pid and sidekiq.pid (I don't expect this pid file to appear).

WhatsApp Image 2021-11-03 at 15 48 13

01:28 sidekiq:start
      01 RBENV_ROOT=/opt/rbenv RBENV_VERSION=2.5.9 /opt/rbenv/bin/rbenv exec bundle exec sidekiq --index 0 --pidfile /var/www/ada_staging/shared/tmp/pids/sidekiq-0.pid --environment staging --logfile /var/www/ada_staging/shared/log/sid…
    ✔ 01 [email protected] 0.903s
      02 RBENV_ROOT=/opt/rbenv RBENV_VERSION=2.5.9 /opt/rbenv/bin/rbenv exec bundle exec sidekiq --index 1 --pidfile /var/www/ada_staging/shared/tmp/pids/sidekiq-1.pid --environment staging --logfile /var/www/ada_staging/shared/log/sid…
    ✔ 02 [email protected] 0.877s

In our another environment, we only run 1 sidekiq process per app per node with sidekiq-alive enabled, the process list seems fine.

my_init,1 -u /sbin/my_init
  |-runsvdir,854 -P /etc/service
  |   |-runsv,858 sidekiq
  |   |   `-ruby,861
  |   |       `-ruby,1266
  |   |-runsv,859 sshd
  |   `-runsv,860 nginx
  `-syslog-ng,13 --pidfile /var/run/syslog-ng.pid -F --no-caps
app        861  0.9  4.3 1313672 358632 ?      Sl   02:32   3:42 sidekiq 4.2.10 webapp [0 of 7 busy] leader
app       1266  0.0  3.2 979984 269032 ?       Sl   02:33   0:02 puma 3.7.1 (tcp://0.0.0.0:7433) [webapp]

The mentioned behaviour does not happen if I disable sidekiq-alive altogether. Does that mean our sidekiq process setup is not suitable to use sidekiq-alive?

Environment

  1. ubuntu-16.04
  2. ruby 2.5.9
  3. rails 5.1.7
  4. sidekiq (4.2.10)
  5. sidekiq-ent (1.6.1)
  6. sidekiq-pro (3.7.1)
  7. sidekiq_alive (2.1.4)
  8. capistrano-sidekiq (1.0.3)

Clarify usage: does this work with a StatefulSet?

Apologies if this a stupid question, but I'm trying to figure out if this project is specifically for Sidekiq running as a kubernetes Deployment, or whether it would also work if you're running Sidekiq as a StatefulSet.

Partly, I'm trying to figure out whether it makes more sense to run as one or the other.

Queue Priority / Ordering

We commonly have jobs (mailers) that get pushed into the queue, thousands at a time. These some times take a bit of time to process. This can cause the alive scheduled job to not get processed. It gets queued but sidekiq_alive-XXXX is below mailers. In turn the probe fails due do the job not running.

Is/can a higher priority be set to the alive queue to ensure the worker does not get restarted because its actually getting work done :)

Running with Sidekiqswarm

Hi! This is a great gem, thank you for your work on it.

We are trying out Sidekiqswarm and it appears that when run with sidekiq_alive, there's one exception per worker process after the first one as each process attempts to bind to the port. Is there a way to only have one web server boot or to have different ports get assigned to each worker process?

Redis connection timeout issue in sidekiq_alive (occasionally) during shutdown of sidekiq

Hi All,

We are running sidekiq_alive 2.3 with sidekiq 7.0.8 (and Rails 6.1.7.4). This is running outside Kubernetes in Google Cloud Platform's AppEngine.

We have our deployment set up so when a new version is deployed, the current (soon to be old) version running sidekiq is put into quiet mode and left to finish any jobs it is currently running. This action causes sidekiq_alive to call unregister_current_instance.

Occasionally, the unregister_current_instance will fail with a SidekiqAlive: Timeout::Error. This is not every time, but it does get picked up by our Honeybadger integration and creates questions about the system health (and leaves the old host queues in place since our shutdown_callback does not get a chance to run).

This really isn't the end of the world as you can probably guess, but I think that there is a relatively easy solution.

The problem appears to be related to the concurrency setting for sidekiq and the number of connections (defined by the concurrency in sidekiq_alive) available in the capsule created for sidekiq_alive.

If the sidekiq_alive concurrency (fixed to 2 at the moment) is less than the concurrency for sidekiq (very often the case), the worker threads of sidekiq will allocate all of the sidekiq_alive connections to the various sidekiq threads.

Normally this doesn't cause any problems, the connections are used and returned to the pool quite quickly and the 1 second timeout to retrieve one from the pool is fine. However, it is possible for the unregister_current_instance to end up starved of connections and timeout every once in a while.

If you agree that this is a reasonable hypothesis, I believe that there is a relatively simple solution. That would be to allow the concurrency for sidekiq_alive to be configurable (yet still default to 2) so a system could set the sidekiq_alive concurrency to be one more than the number of the system's sidekiq concurrency. If this is done, there will always be a Redis connection available for shutdown.

Environments that feel their Redis connections are a constraint can leave the default and run as they currently do.

The pool used by sidekiq is mperham's connection_pool gem and it will only create new connections when they are first called for, so the 'extra' connection for shutdown is not created until needed.

I can provide a simple PR for this if desired.

Kevin

diff --git a/lib/sidekiq_alive.rb b/lib/sidekiq_alive.rb
index 5f104f6..7879c02 100644
--- a/lib/sidekiq_alive.rb
+++ b/lib/sidekiq_alive.rb
@@ -20,7 +20,7 @@ module SidekiqAlive
 
           if Helpers.sidekiq_7
             sq_config.capsule(CAPSULE_NAME) do |cap|
-              cap.concurrency = 2
+              cap.concurrency = config.concurrency
               cap.queues = [current_queue]
             end
           else
diff --git a/lib/sidekiq_alive/config.rb b/lib/sidekiq_alive/config.rb
index bb48896..a05e618 100644
--- a/lib/sidekiq_alive/config.rb
+++ b/lib/sidekiq_alive/config.rb
@@ -15,7 +15,8 @@ module SidekiqAlive
                   :server,
                   :custom_liveness_probe,
                   :logger,
-                  :shutdown_callback
+                  :shutdown_callback,
+                  :concurrency
 
     def initialize
       set_defaults
@@ -33,6 +34,7 @@ module SidekiqAlive
       @server = ENV.fetch("SIDEKIQ_ALIVE_SERVER", "webrick")
       @custom_liveness_probe = proc { true }
       @shutdown_callback = proc {}
+      @concurrency = Integer(ENV.fetch("SIDEKIQ_ALIVE_CONCURRENCY", 2), exception: false) || 2
     end
 
     def registration_ttl

`RedisClient::CannotConnectError: stream closed in another thread` after updates

Thanks for the work on sidekiq-alive.
After a recent update (to 2.2 and sidekiq 7.1), I see following errors on a multi-instance installation regularly (not reproducible, 1-x times per day under no load).
There is no further information on the Job itself (its in the proper instance-specific queue, and of class SidekiqAlive::Worker.

image

How can we further debug this?

why use 'redis keys command' ?

image

I am trying to use sidekiq_alive to use the kubernetes probe, but I have a problem.

when starting sidekiq_alive, it executes the above code,

The redis keys command is dangerous, so it can't be executed by blocking it. Is there a way to replace it?

please help me. thx

Update redis exists

Redis#exists(key) will return an Integer in redis-rb 4.3. exists? returns a boolean, you should use it instead. To opt-in to the new behavior now you can set Redis.exists_returns_integer = true. To disable this message and keep the current (boolean) behaviour of 'exists' you can set Redis.exists_returns_integer = false, but this option will be removed in 5.0. (/home/artur/.asdf/installs/ruby/2.6.5/lib/ruby/gems/2.6.0/gems/sidekiq-6.0.7/lib/sidekiq/launcher.rb:160:in `block (2 levels) in ❤')

Sinatra Thread blocks sidekiq shutdown signal

After trying to debug why on(:shutdown) never was triggered, I found that removing the Sinatra thread, allows on(:shutdown) to trigger properly. (However, this brakes the intention of the gem)

The sigterm signals hit the sinatra server, and never the sidekiq process, which results in the on(:shutdown) never getting hit.

One option is to run the webserver and sidekiq separate when using sidekiq alive. This can be done by removing the severthread and creating a config.ru to start the server in a separate process using foreman. This is more complicated, but better if other rack servers are required in production.

Another option is to force the alive process to use webrick as a webserver. (Easy to implement)

for forcing the process to use webrick for the alive signal, change the top of server.rb to:

  class Server < Sinatra::Base
    set :bind, '0.0.0.0'
    set :server, :webrick
.....

Runaway checks hit redis at 2600 requests per second within 30 minutes of starting

I've just integrated this but found a single instance escalated within 30 minutes to 2600 requests per second. This is a no load/new architecture and this is the only job running.

I did notice an environment variable/config error on my part, but not sure it would cause this - I had set ttl to config.time_to_live = 8000.

redis_labs

I did happen to lose a kubernetes cluster node in the pool, so the sidekiq pod may have been restarted on another host. Could this be the same as @fruwe is discussing in #7?

Could this TTL 8000 cause it to spin out of control (wouldn't think so)?

sidekiq_alive causes Redis CPU to max out on big deployments

Hi!
We recently performed an upgrade from v2.1.4 to v2.1.5 that caused a system outage: on boot, Sidekiq maxed out Redis CPU.
The machine where Redis is hosted usually doesn't see more than 10% CPU usage.

The investigation for the causes is still in progress, the code that is potentially causing the issue was contributed by our company to sidekiq_alive, so we caused our own problem, but it's important to consider that other big deployments (million jobs) might face the same issue, the version increase should probably be at least MINOR, not PATCH, given the kind of outage it caused.

I will be investigating in how this happened, so I will be able to provide more details.

Increasing SidekiqAlive workers

screen shot 2018-08-30 at 5 14 20 pm
screen shot 2018-08-30 at 5 14 28 pm

Hi, I've started to use SidekiqAlive for k8s liveness probing since last week and noticed the workers are piling up. Is this expected? Shouldn't the older ones be removed automatically?

Thanks in advance.

Please reopen #10

...and close this one. From my comment there:

This should be reopened (verified on 1.0.1):

Throughput of the Redis database indicated below is 1553 RPS (requests per second).

This is a brand new cluster/business with essentially no traffic other than alpha users. The spike occurs on a new deployment (old one deleted, new one added)

redis_labs

What information could I provide that can help?

Errno::EADDRINUSE: Address already in use - bind(2) for 0.0.0.0:7433

Hi,

We are facing an error when trying to spawn sidekiq, it doesn't happen always but we have many occurrences already, we have sidekiq_alive configured in our Kubernetes deployment as described in the Readme file.
The error is:
Errno::EADDRINUSE: Address already in use - bind(2) for 0.0.0.0:7433

Has someone faced it?

Disable access logs

It is pretty noisy to see

[2019-04-24 18:42:07] INFO  WEBrick 1.4.2
[2019-04-24 18:42:07] INFO  ruby 2.5.1 (2018-03-29) [x86_64-linux]
== Sinatra (v1.4.8) has taken the stage on 7433 for staging with backup from WEBrick
[2019-04-24 18:42:07] INFO  WEBrick::HTTPServer#start: pid=1 port=7433
10.183.2.1 - - [24/Apr/2019:18:43:12 UTC] "GET / HTTP/1.1" 200 6
- -> /
10.183.2.1 - - [24/Apr/2019:18:43:14 UTC] "GET / HTTP/1.1" 200 6
- -> /
10.183.2.1 - - [24/Apr/2019:18:43:22 UTC] "GET / HTTP/1.1" 200 6
- -> /
10.183.2.1 - - [24/Apr/2019:18:43:24 UTC] "GET / HTTP/1.1" 200 6
- -> /
10.183.2.1 - - [24/Apr/2019:18:43:32 UTC] "GET / HTTP/1.1" 200 6
- -> /

every time the health check probe comes, is there a way to disable the sinatra access logs (without forking this repo)?

Lack of clean up of old queues ( orphaned queues)

Seems like the way the queues are maintained, it's not set up to garbage collect after a deployment. Leaving me with:

Screenshot 2023-10-23 at 4 46 49 PM

Seems like everyone would have this issue. What did I mess up? Can I set them to just use the same persistent queue?

Question: What difference makes "sidekiq quite" as preStop hook?

The README recommends sending TSTP signal to Sidekiq as Kubernetes preStop hook. Looking at k8s documentation, this hook executed before TERM signal is sent. So what difference does actually make sending TSTP + TERM comparing to only TERM? It feels like TERM should be good enough on its own, unless I'm missing something here.

PS. Thanks for the great gem! It is good to have it.

Question: why grep busy in graceful shutdown

Hi! Thanks for maintaining this project.

When integrating, one of my coworkers came with this question:

#!/bin/bash

# Find Pid
SIDEKIQ_PID=$(ps aux | grep sidekiq | grep busy | awk '{ print $2 }')

Why grep busy? Could this possibly not match sometimes? If so, would need to be handled below

Any chance you can clarify for us the need for busy?

thanks!

Worker goes unhealthy after 10 minutes on sidekiq 7

We are trying to upgrade our infrastructure to use sidekiq 7 and facing the following issue:
sidekiq_alive works fine when registering the worker in redis, then it becomes healthy.
After exactly 10 minutes it becomes unhealthy with this message: "Can't find the alive key" and pod gets restarted.
We have verified and during this 10 minutes period we can see it really alive, so it's not like our health check starts working after 10 minutes.
Any idea where to look to solve this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.