Comments (10)
Hm, that's unfortunately not a lot to go on.
Do you have a minimal reproduction you can share, so I can help you debug the issue?
from honeydew.
Unfortunately no, also it happens randomly - can work for days, but then we get this error and that's all. Even putting a new job into the queue doesn't trigger it. The job itself is very heavy and intense and can take from 30s to minutes to process. However it worked for several months flawlessly and just recently started to freeze.
Do you have any idea why the worker supervisor doesn't start a new worker?
from honeydew.
I'm not able to reproduce it, I've got a worker that spawns a long task and then times out, I'm seeing the worker start back up again every time.
There's even a test in the suite that's intended to simulate pandemonium, it's functioning correctly, so there must be something quite strange going on.
Your queue is named :new
, but the Job struct you gave has it set to :new_block
, was that intentional?
Next time you see it happen, can you send me your Honeydew.status/1
?
from honeydew.
Yeah, its :new_block, changed it when replaced App name. It failed again, here is the status:
%{
queue: %{
count: 63,
in_progress: 1,
mnesia: %{
"honeydew_:new_block": [
access_mode: :read_write,
active_replicas: [:nonode@nohost],
all_nodes: [:nonode@nohost],
arity: 3,
attributes: [:key, :job],
checkpoints: [],
commit_work: [],
cookie: {{1580667096093232350, -576460752303423483, 1}, :nonode@nohost},
cstruct: {:cstruct, :"honeydew_:new_block", :ordered_set,
[:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
:wrapped_job, [:key, :job], [], [], [],
{{1580667096093232350, -576460752303423483, 1}, :nonode@nohost},
{{2, 0}, []}},
disc_copies: [],
disc_only_copies: [],
external_copies: [],
frag_properties: [],
index: [],
index_info: {:index, :ordered_set, []},
load_by_force: false,
load_node: :nonode@nohost,
load_order: 0,
load_reason: {:dumper, :create_table},
local_content: false,
majority: false,
master_nodes: [],
memory: 3421,
ram_copies: [:nonode@nohost],
record_name: :wrapped_job,
record_validation: {:wrapped_job, 3, :ordered_set},
size: 62,
snmp: [],
storage_properties: [],
storage_type: :ram_copies,
subscribers: [],
type: :ordered_set,
user_properties: [],
version: {{2, 0}, []},
where_to_commit: [nonode@nohost: :ram_copies],
where_to_read: :nonode@nohost,
where_to_wlock: {[:nonode@nohost], false},
where_to_write: [:nonode@nohost],
wild_pattern: {:wrapped_job, :_, :_}
],
"honeydew_:new_block_in_progress": [
access_mode: :read_write,
active_replicas: [:nonode@nohost],
all_nodes: [:nonode@nohost],
arity: 3,
attributes: [:key, :job],
checkpoints: [],
commit_work: [],
cookie: {{1580667096097079709, -576460752303423387, 1}, :nonode@nohost},
cstruct: {:cstruct, :"honeydew_:new_block_in_progress", :set,
[:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
:wrapped_job, [:key, :job], [], [], [],
{{1580667096097079709, -576460752303423387, 1}, :nonode@nohost},
{{2, 0}, []}},
disc_copies: [],
disc_only_copies: [],
external_copies: [],
frag_properties: [],
index: [],
index_info: {:index, :set, []},
load_by_force: false,
load_node: :nonode@nohost,
load_order: 0,
load_reason: {:dumper, :create_table},
local_content: false,
majority: false,
master_nodes: [],
memory: 353,
ram_copies: [:nonode@nohost],
record_name: :wrapped_job,
record_validation: {:wrapped_job, 3, :set},
size: 1,
snmp: [],
storage_properties: [],
storage_type: :ram_copies,
subscribers: [],
type: :set,
user_properties: [],
version: {{2, 0}, []},
where_to_commit: [nonode@nohost: :ram_copies],
where_to_read: :nonode@nohost,
where_to_wlock: {[:nonode@nohost], false},
where_to_write: [:nonode@nohost],
wild_pattern: {:wrapped_job, :_, :_}
]
},
suspended: false
},
workers: %{
#PID<0.26129.22> => {%Honeydew.Job{
by: :nonode@nohost,
completed_at: nil,
delay_secs: 0,
enqueued_at: 1580683611589,
failure_private: 1,
from: nil,
job_monitor: #PID<0.26130.22>,
private: -576460752302077374,
queue: :new_block,
result: nil,
started_at: 1580683661872,
task: {:new_block,
["00000000000000000010147f3f358a7b0c3a187d1f0270a741b49e3124892aa1"]}
}, :running}
}
}
Error (sorry for the mess, its for elastic):
{"@timestamp":"2020-02-02T23:47:41.871+01:00","file":"lib/honeydew/logger.ex","function":"job_failed/2","level":"warn","line":64,"message":"Job failed due unexpected exit. %Honeydew.Job{by: :nonode@nohost, completed_at: nil, delay_secs: 0, enqueued_at: 1580683611589, failure_private: nil, from: nil, job_monitor: #PID<0.20763.22>, private: -576460752302077438, queue: :new_block, result: {:error, %Honeydew.Crash{reason: {:timeout, {Task, :await, [%Task{owner: #PID<0.20764.22>, pid: #PID<0.22225.22>, ref: #Reference<0.439751935.234356738.105265>}, 50000]}}, stacktrace: [], type: :exit}}, started_at: nil, task: {:new_block, [\"00000000000000000010147f3f358a7b0c3a187d1f0270a741b49e3124892aa1\"]}}\",\n** (exit) exited in: Task.await(%Task{owner: #PID<0.20764.22>, pid: #PID<0.22225.22>, ref: #Reference<0.439751935.234356738.105265>}, 50000)\n ** (EXIT) time out\n","module":"Elixir.Honeydew.Logger"}
Looks like worker is alive, but it couldn't handle the crash and became a zombie.
from honeydew.
It looks to me that the worker did indeed crash and restart, as intended, and a new worker is now retrying the job. The task's owner pid and the currently running worker's pid are different, so it did restart.
Just judging from this output, it does look like your job is legitimately running.
From here, I'd suggest adding some more logging to your job, especially around where you do your Task.await/1
. Is there any chance you're looping and awaiting multiple times, or waiting for :infinity
?
Can you share your actual job code with me?
from honeydew.
Happened again, you are right it restarts on the next block in queue after the one that failed and stops on it.
%{
queue: %{
count: 30,
in_progress: 1,
mnesia: %{
"honeydew_:new_block": [
access_mode: :read_write,
active_replicas: [:nonode@nohost],
all_nodes: [:nonode@nohost],
arity: 3,
attributes: [:key, :job],
checkpoints: [],
commit_work: [],
cookie: {{1580726431401602349, -576460752303423390, 1}, :nonode@nohost},
cstruct: {:cstruct, :"honeydew_:new_block", :ordered_set,
[:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
:wrapped_job, [:key, :job], [], [], [],
{{1580726431401602349, -576460752303423390, 1}, :nonode@nohost},
{{2, 0}, []}},
disc_copies: [],
disc_only_copies: [],
external_copies: [],
frag_properties: [],
index: [],
index_info: {:index, :ordered_set, []},
load_by_force: false,
load_node: :nonode@nohost,
load_order: 0,
load_reason: {:dumper, :create_table},
local_content: false,
majority: false,
master_nodes: [],
memory: 1672,
ram_copies: [:nonode@nohost],
record_name: :wrapped_job,
record_validation: {:wrapped_job, 3, :ordered_set},
size: 29,
snmp: [],
storage_properties: [],
storage_type: :ram_copies,
subscribers: [],
type: :ordered_set,
user_properties: [],
version: {{2, 0}, []},
where_to_commit: [nonode@nohost: :ram_copies],
where_to_read: :nonode@nohost,
where_to_wlock: {[:nonode@nohost], false},
where_to_write: [:nonode@nohost],
wild_pattern: {:wrapped_job, :_, :_}
],
"honeydew_:new_block_in_progress": [
access_mode: :read_write,
active_replicas: [:nonode@nohost],
all_nodes: [:nonode@nohost],
arity: 3,
attributes: [:key, :job],
checkpoints: [],
commit_work: [],
cookie: {{1580726431405308957, -576460752303423294, 1}, :nonode@nohost},
cstruct: {:cstruct, :"honeydew_:new_block_in_progress", :set,
[:nonode@nohost], [], [], [], 0, :read_write, false, [], [], false,
:wrapped_job, [:key, :job], [], [], [],
{{1580726431405308957, -576460752303423294, 1}, :nonode@nohost},
{{2, 0}, []}},
disc_copies: [],
disc_only_copies: [],
external_copies: [],
frag_properties: [],
index: [],
index_info: {:index, :set, []},
load_by_force: false,
load_node: :nonode@nohost,
load_order: 0,
load_reason: {:dumper, :create_table},
local_content: false,
majority: false,
master_nodes: [],
memory: 353,
ram_copies: [:nonode@nohost],
record_name: :wrapped_job,
record_validation: {:wrapped_job, 3, :set},
size: 1,
snmp: [],
storage_properties: [],
storage_type: :ram_copies,
subscribers: [],
type: :set,
user_properties: [],
version: {{2, 0}, []},
where_to_commit: [nonode@nohost: :ram_copies],
where_to_read: :nonode@nohost,
where_to_wlock: {[:nonode@nohost], false},
where_to_write: [:nonode@nohost],
wild_pattern: {:wrapped_job, :_, :_}
]
},
suspended: false
},
workers: %{
#PID<0.23441.22> => {%Honeydew.Job{
by: :nonode@nohost,
completed_at: nil,
delay_secs: 0,
enqueued_at: 1580753025163,
failure_private: nil,
from: nil,
job_monitor: #PID<0.23442.22>,
private: -576460752297858591,
queue: :new_block,
result: nil,
started_at: 1580753053471,
task: {:new_block,
["00000000000000000008d820bfbb7b50c02034bedd6756bf7079cabb08e5658c"]}
}, :running}
}
}
As you can see it has 30 elements in queue, being stuck on that one. Here is the :sys.get_state(pid)
of #PID<0.23441.22>
%Honeydew.Worker.State{
has_init_fcn: false,
init_args: [],
init_retry_secs: 5,
job: %Honeydew.Job{
by: :nonode@nohost,
completed_at: nil,
delay_secs: 0,
enqueued_at: 1580753025163,
failure_private: nil,
from: nil,
job_monitor: #PID<0.23442.22>,
private: -576460752297858591,
queue: :new_block,
result: nil,
started_at: nil,
task: {:new_block,
["00000000000000000008d820bfbb7b50c02034bedd6756bf7079cabb08e5658c"]}
},
job_runner: #PID<0.23443.22>,
module: App.BlockWorker,
private: :no_state,
queue: :new_block,
queue_pid: #PID<0.280.0>,
ready: true,
start_opts: [
#PID<0.307.0>,
:new_block,
%{
init_retry_secs: 5,
ma: {App.BlockWorker, []},
nodes: [],
num: 1,
shutdown: 10000
},
#PID<0.280.0>
]
}
I've tried to get state of the job_runner: #PID<0.23443.22>
but failed
** (exit) exited in: :sys.get_state(#PID<0.23443.22>)
** (EXIT) time out
(stdlib) sys.erl:335: :sys.send_system_msg/2
(stdlib) sys.erl:136: :sys.get_state/1
Process.info
gives info that its actually a Redis process:
[
current_function: {Redix.Connection, :pipeline, 3},
initial_call: {:proc_lib, :init_p, 5},
status: :waiting,
message_queue_len: 4,
links: [#PID<0.23441.22>],
dictionary: [
"$ancestors": [#PID<0.23441.22>, #PID<0.307.0>, #PID<0.306.0>,
:"Elixir.Honeydew.WorkerGroupSupervisor.new_block", #PID<0.303.0>,
Honeydew.Workers, Honeydew.Supervisor, #PID<0.238.0>],
job_monitor: #PID<0.23442.22>,
"$initial_call": {Honeydew.JobRunner, :init, 1}
],
trap_exit: false,
error_handler: :error_handler,
priority: :normal,
group_leader: #PID<0.237.0>,
total_heap_size: 2585,
heap_size: 987,
stack_size: 42,
reductions: 2445,
garbage_collection: [
max_heap_size: %{error_logger: true, kill: true, size: 0},
min_bin_vheap_size: 46422,
min_heap_size: 233,
fullsweep_after: 65535,
minor_gcs: 5
],
suspending: []
]
Unfortunately I can't provide the code. The task is really huge, we even call rust code with NIF to process some data in the middle.
The only place where we use Tasks is to fetch transactions through json-rpc:
def fetch_transactions(raw_transactions) do
raw_transactions
|> Enum.map(fn trx_id ->
Task.async(fn ->
{:ok, data} = Api.gettransaction(trx_id)
{trx_id, data}
end)
end)
|> Enum.map(&Task.await(&1, 50_000))
|> List.flatten()
end
Which actually calls Redis:
def gettransaction(txid) do
case Redix.command!(:redix, ["GET", txid]) do
nil ->
case gettransaction_raw(txid) do
{:ok, transaction} ->
{:ok, trx_to_map(txid, transaction)}
error ->
error
end
transaction ->
{:ok, trx_to_map(txid, Jason.decode!(transaction))}
end
end
from honeydew.
Yeah, unfortunately, even though the job runner process is a GenServer, it's pretty unlikely that you'll get it to respond to :sys.get_state/1
since it's performing your job and the job doesn't yield control back to the GenServer loop.
So process #PID<0.23443.22>
, as you mentioned, is the job runner that's stuck, it's not actually a process that belongs to redis' supervision tree, but it is running redix's code (as you mentioned, too). It sounds like redis' command!/2
function calls pipeline/3
, which is probably blocking and eventually causing the Task.await/1
timeouts.
I think Redis is going to be the best place for you to investigate. Just off the cuff, I'd check out the following:
- Check the redis logs for anything obvious (slow commands, warnings, etc)
- Does the local
:redix
connection have a command pipeline cap, and is it blocking because it's waiting in line? (related: if you can store a handle to a unique redis connection in honeydew's job state instead of relying on a connection pool, there'll never be a mismatch between available connections and number of workers) - Does the remote redis server have a cap on the number of things it'll process simultaneously? It could be that your local client is blocking awaiting the scheduling of its request.
- Do you have enough ram on your redis server, and is redis allowed to use that ram? It could be that redis is paging the data set out to disk because it's low on ram, and has to page it back in to satisfy your command (and hence blocks).
- Next time this happens, maybe try opening a remote iex shell (as you've been doing) and running
Redix.command!(:redix, ["PING"])
? I don't know how much that'll tell you, but if that doesn't work at all, perhaps the redis connection has broken.
from honeydew.
hey @nateless, i'd love to hear how your debugging is going if you're able. :)
from honeydew.
Hey @koudelka, after we removed Redis it works flawlessly, unfortunately I didn't have time to investigate why it freezed with Redis.
from honeydew.
all good dude, thanks for the update, i'm glad it's working.
drop me a line if you have any other issues. :)
from honeydew.
Related Issues (20)
- Clear an entire queue HOT 1
- Disabling clock freeze warning in test environment HOT 1
- Control retry time within job HOT 2
- Support Ecto.Adapters.MyXQL for MySQL HOT 6
- Mnesia as default queue not mentioned in start_queue/2 documentation HOT 1
- Custom queue makes Honeydew crashs by timeout HOT 4
- warning: function equal?/2 required by behaviour Ecto.Type
- How to rerun jobs if the system crashes? HOT 1
- pg2 scheduled for removal HOT 10
- How to cancel Jobs in Ecto Queue? HOT 1
- Doc links to code broken HOT 2
- Creates a behaviour for `Honeydew.Logger` module
- Unexpected behavior: bad_return application start failure using Mnesia outside of Honeydew HOT 5
- no match of right hand side value: [] HOT 2
- API for accessing in_progress jobs
- Redesigning API for concurrency
- Timeout error on application start HOT 1
- ErlangQueue doesn't respect delay_secs option HOT 1
- Shutdown hangs HOT 11
- Error when listing mnesia in mix extra_applications.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from honeydew.