The only error we have is this one: <div class="highlight highlight-source-shell n

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Looks like worker dies and no more jobs being picked up about honeydew HOT 10 CLOSED

koudelka commented on July 17, 2024

Looks like worker dies and no more jobs being picked up

from honeydew.

Comments (10)

koudelka commented on July 17, 2024

Hm, that's unfortunately not a lot to go on.

Do you have a minimal reproduction you can share, so I can help you debug the issue?

from honeydew.

nateless commented on July 17, 2024

Unfortunately no, also it happens randomly - can work for days, but then we get this error and that's all. Even putting a new job into the queue doesn't trigger it. The job itself is very heavy and intense and can take from 30s to minutes to process. However it worked for several months flawlessly and just recently started to freeze.

Do you have any idea why the worker supervisor doesn't start a new worker?

from honeydew.

koudelka commented on July 17, 2024

I'm not able to reproduce it, I've got a worker that spawns a long task and then times out, I'm seeing the worker start back up again every time.

There's even a test in the suite that's intended to simulate pandemonium, it's functioning correctly, so there must be something quite strange going on.

Your queue is named :new, but the Job struct you gave has it set to :new_block, was that intentional?

Next time you see it happen, can you send me your Honeydew.status/1?

from honeydew.

nateless commented on July 17, 2024

Yeah, its :new_block, changed it when replaced App name. It failed again, here is the status:

%{
  queue: %{
    count: 63,
    in_progress: 1,
    mnesia: %{
      "honeydew_:new_block": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580667096093232350, -576460752303423483, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block", :ordered_set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580667096093232350, -576460752303423483, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :ordered_set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 3421,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :ordered_set},
        size: 62,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :ordered_set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ],
      "honeydew_:new_block_in_progress": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580667096097079709, -576460752303423387, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block_in_progress", :set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580667096097079709, -576460752303423387, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 353,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :set},
        size: 1,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ]
    },
    suspended: false
  },
  workers: %{
    #PID<0.26129.22> => {%Honeydew.Job{
       by: :nonode@nohost,
       completed_at: nil,
       delay_secs: 0,
       enqueued_at: 1580683611589,
       failure_private: 1,
       from: nil,
       job_monitor: #PID<0.26130.22>,
       private: -576460752302077374,
       queue: :new_block,
       result: nil,
       started_at: 1580683661872,
       task: {:new_block,
        ["00000000000000000010147f3f358a7b0c3a187d1f0270a741b49e3124892aa1"]}
     }, :running}
  }
}

Error (sorry for the mess, its for elastic):

{"@timestamp":"2020-02-02T23:47:41.871+01:00","file":"lib/honeydew/logger.ex","function":"job_failed/2","level":"warn","line":64,"message":"Job failed due unexpected exit. %Honeydew.Job{by: :nonode@nohost, completed_at: nil, delay_secs: 0, enqueued_at: 1580683611589, failure_private: nil, from: nil, job_monitor: #PID<0.20763.22>, private: -576460752302077438, queue: :new_block, result: {:error, %Honeydew.Crash{reason: {:timeout, {Task, :await, [%Task{owner: #PID<0.20764.22>, pid: #PID<0.22225.22>, ref: #Reference<0.439751935.234356738.105265>}, 50000]}}, stacktrace: [], type: :exit}}, started_at: nil, task: {:new_block, [\"00000000000000000010147f3f358a7b0c3a187d1f0270a741b49e3124892aa1\"]}}\",\n** (exit) exited in: Task.await(%Task{owner: #PID<0.20764.22>, pid: #PID<0.22225.22>, ref: #Reference<0.439751935.234356738.105265>}, 50000)\n    ** (EXIT) time out\n","module":"Elixir.Honeydew.Logger"}

Looks like worker is alive, but it couldn't handle the crash and became a zombie.

from honeydew.

koudelka commented on July 17, 2024

It looks to me that the worker did indeed crash and restart, as intended, and a new worker is now retrying the job. The task's owner pid and the currently running worker's pid are different, so it did restart.

Just judging from this output, it does look like your job is legitimately running.

From here, I'd suggest adding some more logging to your job, especially around where you do your Task.await/1. Is there any chance you're looping and awaiting multiple times, or waiting for :infinity?

Can you share your actual job code with me?

from honeydew.

nateless commented on July 17, 2024

Happened again, you are right it restarts on the next block in queue after the one that failed and stops on it.

%{
  queue: %{
    count: 30,
    in_progress: 1,
    mnesia: %{
      "honeydew_:new_block": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580726431401602349, -576460752303423390, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block", :ordered_set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580726431401602349, -576460752303423390, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :ordered_set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 1672,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :ordered_set},
        size: 29,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :ordered_set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ],
      "honeydew_:new_block_in_progress": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580726431405308957, -576460752303423294, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block_in_progress", :set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580726431405308957, -576460752303423294, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 353,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :set},
        size: 1,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ]
    },
    suspended: false
  },
  workers: %{
    #PID<0.23441.22> => {%Honeydew.Job{
       by: :nonode@nohost,
       completed_at: nil,
       delay_secs: 0,
       enqueued_at: 1580753025163,
       failure_private: nil,
       from: nil,
       job_monitor: #PID<0.23442.22>,
       private: -576460752297858591,
       queue: :new_block,
       result: nil,
       started_at: 1580753053471,
       task: {:new_block,
        ["00000000000000000008d820bfbb7b50c02034bedd6756bf7079cabb08e5658c"]}
     }, :running}
  }
}

As you can see it has 30 elements in queue, being stuck on that one. Here is the :sys.get_state(pid) of #PID<0.23441.22>

%Honeydew.Worker.State{
  has_init_fcn: false,
  init_args: [],
  init_retry_secs: 5,
  job: %Honeydew.Job{
    by: :nonode@nohost,
    completed_at: nil,
    delay_secs: 0,
    enqueued_at: 1580753025163,
    failure_private: nil,
    from: nil,
    job_monitor: #PID<0.23442.22>,
    private: -576460752297858591,
    queue: :new_block,
    result: nil,
    started_at: nil,
    task: {:new_block,
     ["00000000000000000008d820bfbb7b50c02034bedd6756bf7079cabb08e5658c"]}
  },
  job_runner: #PID<0.23443.22>,
  module: App.BlockWorker,
  private: :no_state,
  queue: :new_block,
  queue_pid: #PID<0.280.0>,
  ready: true,
  start_opts: [
    #PID<0.307.0>,
    :new_block,
    %{
      init_retry_secs: 5,
      ma: {App.BlockWorker, []},
      nodes: [],
      num: 1,
      shutdown: 10000
    },
    #PID<0.280.0>
  ]
}

I've tried to get state of the job_runner: #PID<0.23443.22> but failed

** (exit) exited in: :sys.get_state(#PID<0.23443.22>)
    ** (EXIT) time out
    (stdlib) sys.erl:335: :sys.send_system_msg/2
    (stdlib) sys.erl:136: :sys.get_state/1

Process.info gives info that its actually a Redis process:

[
  current_function: {Redix.Connection, :pipeline, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 4,
  links: [#PID<0.23441.22>],
  dictionary: [
    "$ancestors": [#PID<0.23441.22>, #PID<0.307.0>, #PID<0.306.0>,
     :"Elixir.Honeydew.WorkerGroupSupervisor.new_block", #PID<0.303.0>,
     Honeydew.Workers, Honeydew.Supervisor, #PID<0.238.0>],
    job_monitor: #PID<0.23442.22>,
    "$initial_call": {Honeydew.JobRunner, :init, 1}
  ],
  trap_exit: false,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.237.0>,
  total_heap_size: 2585,
  heap_size: 987,
  stack_size: 42,
  reductions: 2445,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 5
  ],
  suspending: []
]

Unfortunately I can't provide the code. The task is really huge, we even call rust code with NIF to process some data in the middle.

The only place where we use Tasks is to fetch transactions through json-rpc:

  def fetch_transactions(raw_transactions) do
    raw_transactions
    |> Enum.map(fn trx_id ->
      Task.async(fn ->
        {:ok, data} = Api.gettransaction(trx_id)
        {trx_id, data}
      end)
    end)
    |> Enum.map(&Task.await(&1, 50_000))
    |> List.flatten()
  end

Which actually calls Redis:

  def gettransaction(txid) do
    case Redix.command!(:redix, ["GET", txid]) do
      nil ->
        case gettransaction_raw(txid) do
          {:ok, transaction} ->
            {:ok, trx_to_map(txid, transaction)}

          error ->
            error
        end

      transaction ->
        {:ok, trx_to_map(txid, Jason.decode!(transaction))}
    end
  end

from honeydew.

koudelka commented on July 17, 2024

Yeah, unfortunately, even though the job runner process is a GenServer, it's pretty unlikely that you'll get it to respond to :sys.get_state/1 since it's performing your job and the job doesn't yield control back to the GenServer loop.

So process #PID<0.23443.22>, as you mentioned, is the job runner that's stuck, it's not actually a process that belongs to redis' supervision tree, but it is running redix's code (as you mentioned, too). It sounds like redis' command!/2 function calls pipeline/3, which is probably blocking and eventually causing the Task.await/1 timeouts.

I think Redis is going to be the best place for you to investigate. Just off the cuff, I'd check out the following:

Check the redis logs for anything obvious (slow commands, warnings, etc)
Does the local :redix connection have a command pipeline cap, and is it blocking because it's waiting in line? (related: if you can store a handle to a unique redis connection in honeydew's job state instead of relying on a connection pool, there'll never be a mismatch between available connections and number of workers)
Does the remote redis server have a cap on the number of things it'll process simultaneously? It could be that your local client is blocking awaiting the scheduling of its request.
Do you have enough ram on your redis server, and is redis allowed to use that ram? It could be that redis is paging the data set out to disk because it's low on ram, and has to page it back in to satisfy your command (and hence blocks).
Next time this happens, maybe try opening a remote iex shell (as you've been doing) and running Redix.command!(:redix, ["PING"])? I don't know how much that'll tell you, but if that doesn't work at all, perhaps the redis connection has broken.

from honeydew.

koudelka commented on July 17, 2024

hey @nateless, i'd love to hear how your debugging is going if you're able. :)

from honeydew.

nateless commented on July 17, 2024

Hey @koudelka, after we removed Redis it works flawlessly, unfortunately I didn't have time to investigate why it freezed with Redis.

from honeydew.

koudelka commented on July 17, 2024

all good dude, thanks for the update, i'm glad it's working.

drop me a line if you have any other issues. :)

from honeydew.

Looks like worker dies and no more jobs being picked up about honeydew HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent