Giter Site home page Giter Site logo

Comments (10)

koudelka avatar koudelka commented on July 17, 2024

Hm, that's unfortunately not a lot to go on.

Do you have a minimal reproduction you can share, so I can help you debug the issue?

from honeydew.

nateless avatar nateless commented on July 17, 2024

Unfortunately no, also it happens randomly - can work for days, but then we get this error and that's all. Even putting a new job into the queue doesn't trigger it. The job itself is very heavy and intense and can take from 30s to minutes to process. However it worked for several months flawlessly and just recently started to freeze.

Do you have any idea why the worker supervisor doesn't start a new worker?

from honeydew.

koudelka avatar koudelka commented on July 17, 2024

I'm not able to reproduce it, I've got a worker that spawns a long task and then times out, I'm seeing the worker start back up again every time.

There's even a test in the suite that's intended to simulate pandemonium, it's functioning correctly, so there must be something quite strange going on.

Your queue is named :new, but the Job struct you gave has it set to :new_block, was that intentional?

Next time you see it happen, can you send me your Honeydew.status/1?

from honeydew.

nateless avatar nateless commented on July 17, 2024

Yeah, its :new_block, changed it when replaced App name. It failed again, here is the status:

%{
  queue: %{
    count: 63,
    in_progress: 1,
    mnesia: %{
      "honeydew_:new_block": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580667096093232350, -576460752303423483, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block", :ordered_set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580667096093232350, -576460752303423483, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :ordered_set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 3421,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :ordered_set},
        size: 62,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :ordered_set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ],
      "honeydew_:new_block_in_progress": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580667096097079709, -576460752303423387, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block_in_progress", :set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580667096097079709, -576460752303423387, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 353,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :set},
        size: 1,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ]
    },
    suspended: false
  },
  workers: %{
    #PID<0.26129.22> => {%Honeydew.Job{
       by: :nonode@nohost,
       completed_at: nil,
       delay_secs: 0,
       enqueued_at: 1580683611589,
       failure_private: 1,
       from: nil,
       job_monitor: #PID<0.26130.22>,
       private: -576460752302077374,
       queue: :new_block,
       result: nil,
       started_at: 1580683661872,
       task: {:new_block,
        ["00000000000000000010147f3f358a7b0c3a187d1f0270a741b49e3124892aa1"]}
     }, :running}
  }
}

Error (sorry for the mess, its for elastic):

{"@timestamp":"2020-02-02T23:47:41.871+01:00","file":"lib/honeydew/logger.ex","function":"job_failed/2","level":"warn","line":64,"message":"Job failed due unexpected exit. %Honeydew.Job{by: :nonode@nohost, completed_at: nil, delay_secs: 0, enqueued_at: 1580683611589, failure_private: nil, from: nil, job_monitor: #PID<0.20763.22>, private: -576460752302077438, queue: :new_block, result: {:error, %Honeydew.Crash{reason: {:timeout, {Task, :await, [%Task{owner: #PID<0.20764.22>, pid: #PID<0.22225.22>, ref: #Reference<0.439751935.234356738.105265>}, 50000]}}, stacktrace: [], type: :exit}}, started_at: nil, task: {:new_block, [\"00000000000000000010147f3f358a7b0c3a187d1f0270a741b49e3124892aa1\"]}}\",\n** (exit) exited in: Task.await(%Task{owner: #PID<0.20764.22>, pid: #PID<0.22225.22>, ref: #Reference<0.439751935.234356738.105265>}, 50000)\n    ** (EXIT) time out\n","module":"Elixir.Honeydew.Logger"}

Looks like worker is alive, but it couldn't handle the crash and became a zombie.

from honeydew.

koudelka avatar koudelka commented on July 17, 2024

It looks to me that the worker did indeed crash and restart, as intended, and a new worker is now retrying the job. The task's owner pid and the currently running worker's pid are different, so it did restart.

Just judging from this output, it does look like your job is legitimately running.

From here, I'd suggest adding some more logging to your job, especially around where you do your Task.await/1. Is there any chance you're looping and awaiting multiple times, or waiting for :infinity?

Can you share your actual job code with me?

from honeydew.

nateless avatar nateless commented on July 17, 2024

Happened again, you are right it restarts on the next block in queue after the one that failed and stops on it.

%{
  queue: %{
    count: 30,
    in_progress: 1,
    mnesia: %{
      "honeydew_:new_block": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580726431401602349, -576460752303423390, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block", :ordered_set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580726431401602349, -576460752303423390, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :ordered_set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 1672,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :ordered_set},
        size: 29,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :ordered_set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ],
      "honeydew_:new_block_in_progress": [
        access_mode: :read_write,
        active_replicas: [:nonode@nohost],
        all_nodes: [:nonode@nohost],
        arity: 3,
        attributes: [:key, :job],
        checkpoints: [],
        commit_work: [],
        cookie: {{1580726431405308957, -576460752303423294, 1}, :nonode@nohost},
        cstruct: {:cstruct, :"honeydew_:new_block_in_progress", :set,
         [:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
         :wrapped_job, [:key, :job], [], [], [],
         {{1580726431405308957, -576460752303423294, 1}, :nonode@nohost},
         {{2, 0}, []}},
        disc_copies: [],
        disc_only_copies: [],
        external_copies: [],
        frag_properties: [],
        index: [],
        index_info: {:index, :set, []},
        load_by_force: false,
        load_node: :nonode@nohost,
        load_order: 0,
        load_reason: {:dumper, :create_table},
        local_content: false,
        majority: false,
        master_nodes: [],
        memory: 353,
        ram_copies: [:nonode@nohost],
        record_name: :wrapped_job,
        record_validation: {:wrapped_job, 3, :set},
        size: 1,
        snmp: [],
        storage_properties: [],
        storage_type: :ram_copies,
        subscribers: [],
        type: :set,
        user_properties: [],
        version: {{2, 0}, []},
        where_to_commit: [nonode@nohost: :ram_copies],
        where_to_read: :nonode@nohost,
        where_to_wlock: {[:nonode@nohost], false},
        where_to_write: [:nonode@nohost],
        wild_pattern: {:wrapped_job, :_, :_}
      ]
    },
    suspended: false
  },
  workers: %{
    #PID<0.23441.22> => {%Honeydew.Job{
       by: :nonode@nohost,
       completed_at: nil,
       delay_secs: 0,
       enqueued_at: 1580753025163,
       failure_private: nil,
       from: nil,
       job_monitor: #PID<0.23442.22>,
       private: -576460752297858591,
       queue: :new_block,
       result: nil,
       started_at: 1580753053471,
       task: {:new_block,
        ["00000000000000000008d820bfbb7b50c02034bedd6756bf7079cabb08e5658c"]}
     }, :running}
  }
}

As you can see it has 30 elements in queue, being stuck on that one. Here is the :sys.get_state(pid) of #PID<0.23441.22>

%Honeydew.Worker.State{
  has_init_fcn: false,
  init_args: [],
  init_retry_secs: 5,
  job: %Honeydew.Job{
    by: :nonode@nohost,
    completed_at: nil,
    delay_secs: 0,
    enqueued_at: 1580753025163,
    failure_private: nil,
    from: nil,
    job_monitor: #PID<0.23442.22>,
    private: -576460752297858591,
    queue: :new_block,
    result: nil,
    started_at: nil,
    task: {:new_block,
     ["00000000000000000008d820bfbb7b50c02034bedd6756bf7079cabb08e5658c"]}
  },
  job_runner: #PID<0.23443.22>,
  module: App.BlockWorker,
  private: :no_state,
  queue: :new_block,
  queue_pid: #PID<0.280.0>,
  ready: true,
  start_opts: [
    #PID<0.307.0>,
    :new_block,
    %{
      init_retry_secs: 5,
      ma: {App.BlockWorker, []},
      nodes: [],
      num: 1,
      shutdown: 10000
    },
    #PID<0.280.0>
  ]
}

I've tried to get state of the job_runner: #PID<0.23443.22> but failed

** (exit) exited in: :sys.get_state(#PID<0.23443.22>)
    ** (EXIT) time out
    (stdlib) sys.erl:335: :sys.send_system_msg/2
    (stdlib) sys.erl:136: :sys.get_state/1

Process.info gives info that its actually a Redis process:

[
  current_function: {Redix.Connection, :pipeline, 3},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 4,
  links: [#PID<0.23441.22>],
  dictionary: [
    "$ancestors": [#PID<0.23441.22>, #PID<0.307.0>, #PID<0.306.0>,
     :"Elixir.Honeydew.WorkerGroupSupervisor.new_block", #PID<0.303.0>,
     Honeydew.Workers, Honeydew.Supervisor, #PID<0.238.0>],
    job_monitor: #PID<0.23442.22>,
    "$initial_call": {Honeydew.JobRunner, :init, 1}
  ],
  trap_exit: false,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.237.0>,
  total_heap_size: 2585,
  heap_size: 987,
  stack_size: 42,
  reductions: 2445,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 5
  ],
  suspending: []
]

Unfortunately I can't provide the code. The task is really huge, we even call rust code with NIF to process some data in the middle.

The only place where we use Tasks is to fetch transactions through json-rpc:

  def fetch_transactions(raw_transactions) do
    raw_transactions
    |> Enum.map(fn trx_id ->
      Task.async(fn ->
        {:ok, data} = Api.gettransaction(trx_id)
        {trx_id, data}
      end)
    end)
    |> Enum.map(&Task.await(&1, 50_000))
    |> List.flatten()
  end

Which actually calls Redis:

  def gettransaction(txid) do
    case Redix.command!(:redix, ["GET", txid]) do
      nil ->
        case gettransaction_raw(txid) do
          {:ok, transaction} ->
            {:ok, trx_to_map(txid, transaction)}

          error ->
            error
        end

      transaction ->
        {:ok, trx_to_map(txid, Jason.decode!(transaction))}
    end
  end

from honeydew.

koudelka avatar koudelka commented on July 17, 2024

Yeah, unfortunately, even though the job runner process is a GenServer, it's pretty unlikely that you'll get it to respond to :sys.get_state/1 since it's performing your job and the job doesn't yield control back to the GenServer loop.

So process #PID<0.23443.22>, as you mentioned, is the job runner that's stuck, it's not actually a process that belongs to redis' supervision tree, but it is running redix's code (as you mentioned, too). It sounds like redis' command!/2 function calls pipeline/3, which is probably blocking and eventually causing the Task.await/1 timeouts.

I think Redis is going to be the best place for you to investigate. Just off the cuff, I'd check out the following:

  • Check the redis logs for anything obvious (slow commands, warnings, etc)
  • Does the local :redix connection have a command pipeline cap, and is it blocking because it's waiting in line? (related: if you can store a handle to a unique redis connection in honeydew's job state instead of relying on a connection pool, there'll never be a mismatch between available connections and number of workers)
  • Does the remote redis server have a cap on the number of things it'll process simultaneously? It could be that your local client is blocking awaiting the scheduling of its request.
  • Do you have enough ram on your redis server, and is redis allowed to use that ram? It could be that redis is paging the data set out to disk because it's low on ram, and has to page it back in to satisfy your command (and hence blocks).
  • Next time this happens, maybe try opening a remote iex shell (as you've been doing) and running Redix.command!(:redix, ["PING"])? I don't know how much that'll tell you, but if that doesn't work at all, perhaps the redis connection has broken.

from honeydew.

koudelka avatar koudelka commented on July 17, 2024

hey @nateless, i'd love to hear how your debugging is going if you're able. :)

from honeydew.

nateless avatar nateless commented on July 17, 2024

Hey @koudelka, after we removed Redis it works flawlessly, unfortunately I didn't have time to investigate why it freezed with Redis.

from honeydew.

koudelka avatar koudelka commented on July 17, 2024

all good dude, thanks for the update, i'm glad it's working.

drop me a line if you have any other issues. :)

from honeydew.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.