We've just upgraded from 0.3.3 to 0.4.4, to resolve the race condition problem when ru

edit: see followup at <a class="issue-link js-issue-link" data-error-text="Fai

+1 delayed_job (3.0.5) activesupport ( > 3.0) delayed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Mysql2::Error: Deadlock when attempting to lock a job,about collectiveidea/delayed_job_active_record

Comments (128)

cainlevy commented on July 22, 2024 13

edit: see followup at #63 (comment). the new queries may not be a net gain, even without deadlocks.

We were bit by this as well during an upgrade.

I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.

The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and 💥. MySQL resolves by killing the UPDATE, which crashes the worker.

The fix I've worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query's scan to skip over rows held by workers, which takes it out of contention with the DELETE query.

Hope this helps.

🔒🔒

from delayed_job_active_record.

jamsi commented on July 22, 2024

I get this also with: 4.0.0.beta2

from delayed_job_active_record.

piotrb commented on July 22, 2024

+1 ..

Seems to happen to us when creating a delayed within another job running, not all the time of course, just very intermittently.

from delayed_job_active_record.

philister commented on July 22, 2024

+1
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

from delayed_job_active_record.

settinghead commented on July 22, 2024

Same here.
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)

from delayed_job_active_record.

cheneveld commented on July 22, 2024

delayed_job (3.0.5)
activesupport (> 3.0)
delayed_job_active_record (0.4.4)
activerecord (>= 2.1.0, < 4)
delayed_job (> 3.0)

mysql(5.5.27)

from delayed_job_active_record.

philister commented on July 22, 2024

Can we help anything to solve this? Anyone a patch or hints? We are restarting the workers ever 10 minutes because of this.
Thanks and regards, Ph.

from delayed_job_active_record.

gorism commented on July 22, 2024

@philister We ended up forking and changing the logic to use a row level lock. I suspect the original implementation went to lengths to avoid using a lock, but in the end I'll accept a performance penalty for more reliability. You can find our fork here: https://github.com/doxo/delayed_job_active_record

Note that we do still get occasional deadlocks which result in restarts, but rare in comparison to what we were getting before. It would be even better if the gem could detect the deadlock and retry, but getting the root cause out of the exception raised by the MySQL adapter is not straightforward, and possibly version dependent. So I opted to live with the occasional deadlocks for now.

from delayed_job_active_record.

philister commented on July 22, 2024

@gorism Thanks a lot!

from delayed_job_active_record.

zxiest commented on July 22, 2024

@gorism great fix! Thanks! -- The original dj's workers were dying after deadlock. I'm happy to see my workers alive after some time.

from delayed_job_active_record.

cheneveld commented on July 22, 2024

+20

Thanks a bunch cainlevy. This did the trick. Great work.

from delayed_job_active_record.

philister commented on July 22, 2024

Finally this did it for us, too. Thanks @cainlevy (Other solutions didn' t work in our case)

from delayed_job_active_record.

ngan commented on July 22, 2024

Upgraded from 0.3.3 to 0.4.4 and we're experiencing deadlock issues as well. We're running a total of 10 workers and usually they manage to keep the job queue down to 1000 or so. On mysql 5.1. @cainlevy's solution didn't work for us :-(.

------------------------
LATEST DETECTED DEADLOCK
------------------------
130811  3:19:18
*** (1) TRANSACTION:
TRANSACTION 2607F0BE7C, ACTIVE 0 sec, process no 7937, OS thread id 1364158784 fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1248, 21 row lock(s)
MySQL thread id 2166505, query id 64381984645 init
UPDATE `delayed_jobs` SET `locked_at` = '2013-08-11 03:19:18', `locked_by` = 'delayed_job host:app3 pid:26732' WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0BE7C lock_mode X locks rec but not gap waiting
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) TRANSACTION:
TRANSACTION 2607F0C031, ACTIVE 0 sec, process no 7937, OS thread id 1363892544 updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 2166501, query id 64381985155 updating
DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 568169197
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 33 n bits 648 index `delayed_jobs_priority` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap waiting
Record lock, heap no 406 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000000; asc     ;;
 1: len 8; hex 8000124f11d7316a; asc    O  1j;;
 2: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 3: len 4; hex a1dd92ed; asc     ;;

*** WE ROLL BACK TRANSACTION (2)

mysql> show indexes from delayed_jobs;
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| delayed_jobs |          0 | PRIMARY               |            1 | id          | A         |        2873 |     NULL | NULL   |      | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            1 | locked_at   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            2 | locked_by   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            1 | priority    | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            2 | run_at      | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            3 | locked_by   | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)

from delayed_job_active_record.

cainlevy commented on July 22, 2024

What do you get from an explain on a query with the same WHERE and ORDER clauses?

EXPLAIN SELECT * WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1

from delayed_job_active_record.

ngan commented on July 22, 2024

mysql> EXPLAIN SELECT * FROM delayed_jobs WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1;
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys                              | key                   | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
|  1 | SIMPLE      | delayed_jobs | range | delayed_jobs_locking,delayed_jobs_priority | delayed_jobs_priority | 5       | NULL |  589 | Using where |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
1 row in set (0.00 sec)

from delayed_job_active_record.

cainlevy commented on July 22, 2024

I don't think we have the delayed_jobs_locking index on our table, but I guess it's not being used? Only other thing I can think of is maybe MySQL's 5.1 query planner isn't taking advantage of the updated delayed_jobs_priority index.

Hopefully you can reproduce the deadlock safely with a handful of workers running a self-replicating job. Let us know if you find anything more!

from delayed_job_active_record.

ngan commented on July 22, 2024

I added delayed_jobs_locking index for other metrics pulling. It would be nice if DJ-AR allowed us to specify the strategy for claiming jobs, because I would use the old strategy and still get new upgraded features.

from delayed_job_active_record.

romanlehnert commented on July 22, 2024

adding locked_at to the index (as described by @cainlevy) has the side-effekt for me that from 10 workers only a few (2-5) are working at the same time. Maybe this affects the performance on the jobs table?

from delayed_job_active_record.

t-anjan commented on July 22, 2024

Thanks @cainlevy. Adding locked_at to the index seems to have solved the problem for me as well. I did it yesterday, amidst the chaos of these deadlock errors being thrown all over the place in a queue of 100k jobs. As soon as this change was made, those deadlock issues were gone.

from delayed_job_active_record.

ValMilkevich commented on July 22, 2024

+1 Thanks @cainlevy, works great.

from delayed_job_active_record.

ngan commented on July 22, 2024

Is everyone that this is working for on mysql 5.5+?

from delayed_job_active_record.

michaelglass commented on July 22, 2024

@ngan @cainlevy's fix doesn't help me on 5.5

from delayed_job_active_record.

t-anjan commented on July 22, 2024

@ngan - I am on MySQL 5.5.31 and @cainlevy's fix DOES help me.

from delayed_job_active_record.

cainlevy commented on July 22, 2024

MySQL 5.5.x

If my fix isn't working for someone, I'd recommend following the steps I described earlier to set up a repro someplace safe to experiment. It's the only way we're going to learn more.

from delayed_job_active_record.

michaelglass commented on July 22, 2024

I'll publish my test setup in a moment. In the mean time: could not repro on 5.6.13

from delayed_job_active_record.

michaelglass commented on July 22, 2024

my experience with this patch: it does stop the deadlock but performance on my DJ doesn't improve after the deadlock is removed.

However, removing mysql optimizations makes performance jump from about 1job/second/worker to about 10 jobs/second/worker. seeing if upgrading to 5.6 will help one way or the other.

from delayed_job_active_record.

zxiest commented on July 22, 2024

I had originally thought some of the fixes above fixed my deadlock issues but apparently I still got them. I think using multiple processes will cause race conditions with the db so I built a gem that is thread based: a manager thread pulls jobs from the db and distributes them to a threadpool of workers. I would totally welcome contributors to http://github.com/zxiest/delayed_job_active_record_threaded

from delayed_job_active_record.

cainlevy commented on July 22, 2024

@zxiest That sounds like a good way to make DB polling more efficient, but if you're still having trouble with deadlocks, it'll probably come back when you run multiple machines.

from delayed_job_active_record.

cainlevy commented on July 22, 2024

@michaelglass I presume you're measuring performance against a previous version of the gem? Are you measuring real-world throughput, or using an empty job?

from delayed_job_active_record.

zxiest commented on July 22, 2024

@cainlevy yup, this might be an issue, however, if needed in the future, I can achieve mutual exclusivity by processing different queues on different servers at first and if the queues are too large, I could use job_id % servers_number == server_id . I would still use protective locks for safety but avoid locking as much as possible =)

from delayed_job_active_record.

michaelglass commented on July 22, 2024

@cainlevy thanks so much for being such a boss!

Here's my story:

Saw a lot of deadlocks and slow queries in my log. Pushed your fix and and queries no longer were reported to slow query log / no new deadlocks in show innodb status; I thought that everything was better!
Tested DJ throughput against my live web app first by counting rows processed. Ensured that this cap script was reporting equivalent numbers:
With and without the fix, Performance was about 1 job/s.
Forked and removed mysql specific logic. Performance jumped to about 80 jobs/s (with 25 workers). Throwing more workers at the queue didn't improve performance.

In isolation, I created this test harness and ran it in a couple of parallel irbs against default homebrew mysql (5.6.13) and current version in the percona apt repo (5.5.33-31.1). Couldn't repro slowness there.

Haven't run against 5.5.31-30.3-log which is what I'm running in production. Testing that soon.

from delayed_job_active_record.

cainlevy commented on July 22, 2024

@michaelglass So we eventually ran into performance problems as well. Seems like once something backed up the table enough, performance tanked. Limiting workers to specific priorities helped a little bit, but we couldn't fully catch up without switching back to the select/update query pattern.

from delayed_job_active_record.

michaelglass commented on July 22, 2024

@cainlevy yup. Was my experience. I think we should revert those "performance" fixes or revert them for 5.5+?

from delayed_job_active_record.

cainlevy commented on July 22, 2024

@michaelglass I think so, yeah. Did you ever find out what happens under 5.6?

I forgot until afterwards, but we had some performance data that nicely showed the effects of the new reserve. We deployed it to this app on Sept 23, then ran into problems and reverted on Oct 1.

Green is mean of the 90th percentile, red is the upper bound of the 90th percentile, and purple is the total upper bound.

from delayed_job_active_record.

michaelglass commented on July 22, 2024

looks like I don't see the same performance issues with Mysql 5.6

from delayed_job_active_record.

michaelglass commented on July 22, 2024

lies. same problem with percona mysql 5.6. awful.

from delayed_job_active_record.

paveltyk commented on July 22, 2024

+1 Thanks @cainlevy, works great.

from delayed_job_active_record.

joshuaxls commented on July 22, 2024

@cainlevy You're the man. Thanks for saving me a bunch of time. Index fixed it all up.

from delayed_job_active_record.

cainlevy commented on July 22, 2024

keep in mind that even if the deadlocks stop, overall table performance might be down!

from delayed_job_active_record.

michaelglass commented on July 22, 2024

@cainlevy might be worth editing the comment that everybody is reading and adding that caveat.

from delayed_job_active_record.

cainlevy commented on July 22, 2024

@michaelglass good idea

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

we're being bit by this too. What's the latest idea on this? is #75 the best fix?

from delayed_job_active_record.

cainlevy commented on July 22, 2024

@aaronjensen basically. we ended up monkey-patching the Delayed::Backend::ActiveRecord::Job.reserve method so we could still get other library updates.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@cainlevy the reason that there are performance issues with your suggestion is that they cause the update to not use an index at all. MySQL will apparently not use an update for an index unless it matches the sort criteria exactly. I don't know why, but it seems to be the case. So the index replacement more or less is the same as deleting the original index outright.

from delayed_job_active_record.

michaelglass commented on July 22, 2024

@aaronjensen my experience is that performance is actually slower than with no index. Is my memory not serving me?

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@michaelglass well, on 5.6.15 (which is capable of explain update, unlike 5.5.X it seems) it showed that no index was used for that update statement w/ the new index, but the old index was used if it was there. Using indexes on updates for order by is something fairly new to mysql believe it or not, it was added in 5.5.6 i think, so it doesn't surprise me that there are bugs/inconsistencies, and it wouldn't surprise me if you happened to have a version that it did work on :)

from delayed_job_active_record.

michaelglass commented on July 22, 2024

@aaronjensen to respond to your comment from way back: We're using #75 in prod with great success.

from delayed_job_active_record.

joshuapinter commented on July 22, 2024

👍
@cainlevy's Index Fix works on MySQL 5.5.31.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@cainlevy's Index Fix works on MySQL 5.5.31.

I'd be careful w/ that fix, it's not really a fix, it seems to just cause mysql to stop using an index at all (the new index doesn't seem to be used) which means your performance will get worse over time.

from delayed_job_active_record.

joshuapinter commented on July 22, 2024

@aaronjensen You're not kidding, I woke up this morning to find only one worker running a task at any given time... So, I'm a little confused, have we found a fix for this, yet? What's the best advice?

from delayed_job_active_record.

cainlevy commented on July 22, 2024

I ended up monkey-patching my Job.reserve method back to the original select+update strategy.

from delayed_job_active_record.

joshuapinter commented on July 22, 2024

@cainlevy Any chance you have a fork I can pull from?

from delayed_job_active_record.

cainlevy commented on July 22, 2024

I opted to patch the method so I wouldn't need to maintain a fork. If you're concerned about the API changing in a version update, I'd recommend a version check + exception.

https://gist.github.com/cainlevy/c6cfa67d44fe7427dea6

Really, this whole topic probably warrants a pull request with some kind of swappable strategy class.

from delayed_job_active_record.

joshuapinter commented on July 22, 2024

That's awesome, @cainlevy. Thanks for sharing.

I agree with you. For now, I'll put this patch into production and see how she performs.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@joshuapinter I just opened a pull request that fixes it in a different way. It's untested, but it should do the trick. Maybe you could give it a shot? #90

from delayed_job_active_record.

joshuapinter commented on July 22, 2024

@aaronjensen Looks straightforward enough. Seemed to work find in development. Giving it a shot right now in staging. I'll let you know how it goes.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@joshuapinter it looks like #90 did some unnecessary stuff. #91 just retries deletes since delayed_job handles reservation retries.

from delayed_job_active_record.

joshuapinter commented on July 22, 2024

@aaronjensen Have been using #91 in production for the last few hours. Seems to be working great. Thanks!

from delayed_job_active_record.

joshuapinter commented on July 22, 2024

@aaronjensen Spoke too soon, getting the following error in delayed/backend/active_record.rb:113:

uninitialized constant Delayed::Backend::ActiveRecord::StatementInvalid

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

ah, thanks @joshuapinter we hadn't had a deadlock yet. I just updated, that should fix it.

from delayed_job_active_record.

cainlevy commented on July 22, 2024

hmm, rescuing and retrying deadlocks seems like a bandaid. also, the delete query is only one side of the deadlock, and i'd expect both sides to fail sooner or later.

i'm not sure that the new reserve strategy is worth shoring up. maybe it depends on server version, but i posted a couple graphs above that show it actually performed worse in our system.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

hmm, rescuing and retrying deadlocks seems like a bandaid.

I disagree, my understanding is that this is a common way of handling deadlocks, especially when you cannot control lock ordering easily as is the case in this.

also, the delete query is only one side of the deadlock, and i'd expect both sides to fail sooner or later.

The other side's failure is already handled here: https://github.com/collectiveidea/delayed_job/blob/master/lib/delayed/worker.rb#L281

i'm not sure that the new reserve strategy is worth shoring up. maybe it depends on server version, but i posted a couple graphs above that show it actually performed worse in our system.

Weren't those graphs when you were running with modified indexes? I could not get the modified indexes to be used at all, mysql refused to use them regardless of version, so it was the same as running without any indexes. I don't have data that says that this method is better (yet... our next big run will be on monday night so I'll have more data then with #91).

from delayed_job_active_record.

michaelglass commented on July 22, 2024

just to reiterate: swapping to resque, if that's an option works pretty well

from delayed_job_active_record.

zxiest commented on July 22, 2024

I see a lot of people are still having issues. I have been using delayed_job_active_record_threaded (which I developed) on production, handling queues of 1M+ size, for the past year without any issues. It doesn't support multiple servers yet but otherwise, if you need to stick to a db solution, give it a shot =)

from delayed_job_active_record.

cainlevy commented on July 22, 2024

@aaronjensen True, it's possible that our version of MySQL wasn't using the new index on the UPDATE...LIMIT query. Unfortunately it's only 5.5.x so I had to use EXPLAIN SELECT to try and get anything back, which would've missed any differences from the actual UPDATE version. And as I understand, your evidence was from 5.6.x, which would've missed any version differences.

I'd be reluctant to depend on versions though, and if the new strategy combined with deadlock retries is more performant than the old select and update, then cool!

from delayed_job_active_record.

t-anjan commented on July 22, 2024

This seems to be a pretty serious issue. I too am seeing the problem of not all workers running together. Currently, I have added
gem 'delayed_job_active_record', :github => 'aaronjensen/delayed_job_active_record', :branch => 'retry-deadlock-on-delete'
to my Gemfile, and I reset the index changes suggested earlier in this issue. But it seems like the "all workers not running" problem still exists.

What are people doing at the moment to overcome this issue?

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@t-anjan i'm not sure you're looking at the right issue, this issue is about a deadlock, I'm not sure what "all workers not running" is, so I doubt that the deadlock retry branch would help you.

from delayed_job_active_record.

sg552 commented on July 22, 2024

I am getting this problem all the time. all my workers will die in an hour when processing 1k jobs in queue.

gem 'delayed_job_active_record', '4.0.0'
gem 'mysql2', '0.3.14'

any ideas?

p.s. currently my solution is using God to control the delayed_job workers to keep them alive. it's definitely ugly. so I am waiting for someone who can help. thanks!

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@sg552 we have been using #91 and it works well.

from delayed_job_active_record.

sg552 commented on July 22, 2024

@aaronjensen the patch (#91) doesn't work for me. Is there something I did incorrect?

BTW, this is what I did :

create a file in lib/delayed/backend/active_record.rb, and it looks like:

require 'active_record/version'
module Delayed
  module Backend
    module ActiveRecord
      # A job object that is persisted to the database.
      # Contains the work object as a YAML field.
      class Job < ::ActiveRecord::Base
        include Delayed::Backend::Base
          # ......
        def destroy
          retries = 0 
          begin
            super
          rescue ::ActiveRecord::StatementInvalid => e
            if e.message =~ /Deadlock found when trying to get lock/ && retries < 100 
              retries +=1 
              retry
            else
              raise
            end 
          end 
        end 
      end 
    end 
  end 
end

In my config/application.rb:

module MCacheCleaner
  class Application < Rails::Application
    config.autoload_paths += %W(#{config.root}/lib)
    config.time_zone = 'Asia/Harbin'
  end 
end

Rails version: 4.0.0

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@sg552 I'm not monkey patching, I'm actually pointing my Gemfile to my fork. If you're going to monkey patch, you should name the file differently and explicitly require it in your application.rb. Replacing the original file name is a bad idea because it probably won't load the original.

from delayed_job_active_record.

sg552 commented on July 22, 2024

hi @aaronjensen , thanks for reply, could you please tell me which url I should point to in my Gemfile?
could you give me an example?

thanks!

from delayed_job_active_record.

michaelglass commented on July 22, 2024

see: http://bundler.io/v1.6/git.html
and #90

gem 'delayed_job_active_record', git: 'https://github.com/aaronjensen/delayed_job_active_record.git', branch: 'retry-deadlock-on-delete'

from delayed_job_active_record.

sg552 commented on July 22, 2024

@michaelglass thanks! can I rename the name name to be :

gem 'delayed_job_activerecord_retry-deadlock-on-delete', git: 'https://github.com/aaronjensen/delayed_job_active_record.git', branch: 'retry-deadlock-on-delete'

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

No, though I'm not sure why you would want to? You should remove the original line from your Gemfile, this takes its place.

from delayed_job_active_record.

sg552 commented on July 22, 2024

@michaelglass @aaronjensen got it. thanks! I will let you know if it works on my server.

from delayed_job_active_record.

michaelglass commented on July 22, 2024

whoops. also I had a typol

from delayed_job_active_record.

t-anjan commented on July 22, 2024

@aaronjensen - When I said "all workers are not running together", I was referring to the same issue which others faced. Specifically, @joshuapinter commented in this thread, saying the following:

@aaronjensen You're not kidding, I woke up this morning to find only one worker running a task at any given time... So, I'm a little confused, have we found a fix for this, yet? What's the best advice?

Even after using the retry-deadlock-on-delete branch, I still have the same issue.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@t-anjan you're probably facing a different issue than what my branch fixes. My branch fixes an issue, a very specific one, which is that deadlocks can happen between the update and the delete of a job on mysql. Deadlocks are detected by the db and one of transactions is killed so it would not cause only one worker to work at once. That could be some sort of locking issue or something else entirely, I have no idea, but it's most likely not what my branch (and this issue) are about.

from delayed_job_active_record.

sg552 commented on July 22, 2024

@aaronjensen : thanks a lot! your branch works perfectly on my server! good job! the works has been running for 2 days without any problems.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

That's great, glad to hear it. Aaron

On Sat, Aug 2, 2014 at 7:07 PM, SiweiShen [email protected]
wrote:

@aaronjensen : thanks a lot! your branch works perfectly on my server! good job! the works has been running for 2 days without any problems.

Reply to this email directly or view it on GitHub:
#63 (comment)

from delayed_job_active_record.

csmuc commented on July 22, 2024

I created a pull request months ago, which would enable a little cleaner monkey-patch of the reserve method:
#89

I think this is useful to revert to conservative SQL statement (even better: a configuration option). I haven't heard anything yet.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

the non-conservative sql statement is fundamentally flawed w/o deadlock retry, so either #91 should be merged or the non-conservative method should just be removed. I don't think it should be encouraged to monkey patch it out ultimately.

from delayed_job_active_record.

csmuc commented on July 22, 2024

I have even more issues with the non-conservative SQL statement. MySQL (master-master replication) writes this warning to the log file on every execution of the SQL:

[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted.

The mysqld.log is flooded with those messages (and probably MySQL doesn't spit out the warnings without a reason, too). I'd really get rid of the optimized version of the MySQL (preferably with a config option)

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@csmuc yeah we get that too :/

from delayed_job_active_record.

ndbroadbent commented on July 22, 2024

We fixed this issue by reverting to the less efficient job fetching query, where it fetches a batch of 5 jobs, and then tries to lock each one successively until it obtains a lock.

I ran some tests on my local machine with 4 delayed job workers, and 10,000 no-op jobs:

Before: 41 Deadlocks in log/delayed_job.log
After: 0 Deadlocks

Here's the commit that reverts the change: ndbroadbent@d311b55

You can test it out by adding this line to your Gemfile:

gem 'delayed_job_active_record', github: 'ndbroadbent/delayed_job_active_record', branch: :fix_locking_issues

from delayed_job_active_record.

t-anjan commented on July 22, 2024

I don't know if mine is a special case.

Using

gem 'delayed_job_active_record', :github => 'aaronjensen/delayed_job_active_record', :branch => 'retry-deadlock-on-delete'

was still giving me a bucket-load of deadlocks and very slow queue processing. As I had mentioned before, in addition to deadlocks, it appeared that not all my workers were processing the queue simultaneously.

Yesterday, I tried

gem 'delayed_job_active_record', github: 'ndbroadbent/delayed_job_active_record', branch: :fix_locking_issues

and I immediately noticed an improvement. On a queue of about 30,000 real jobs, when I ran

select * from delayed_jobs where locked_at is not null;

I always got back a result pertaining to the exact number of workers I was running. And the whole queue was over in under an hour with zero deadlocks.

Earlier it was taking more than 4 to 5 hours with tens of deadlocks. And the above query was always returning just one (or at most two) records.

@aaronjensen - It looks like the issue I described earlier in the thread is indeed related to this fix.

@ndbroadbent - Thanks! I was so close to switching from DJ completely.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

@t-anjan that's great, I'm glad you found something that works for you.

from delayed_job_active_record.

marcoshack commented on July 22, 2024

@ndbroadbent, do you know if your fix can help with the following error also?

Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: 
UPDATE `delayed_jobs` SET `delayed_jobs`.`locked_at` = '2014-09-16 22:01:39', 
`delayed_jobs`.`locked_by` = 'delayed_job.1 host:job02 pid:17742' 
WHERE ((run_at <= '2014-09-16 22:01:39' AND (locked_at IS NULL OR
locked_at < '2014-09-16 18:01:39') OR locked_by = 'delayed_job.1 host:job02 pid:17742') 
AND failed_at IS NULL) AND `delayed_jobs`.`queue` IN ('retrieve_deliveries') 
ORDER BY priority ASC, run_at ASC LIMIT 1

from delayed_job_active_record.

ndbroadbent commented on July 22, 2024

Hi Marcos, yes that's the error that we were experiencing, which is now fixed by this change

On Wednesday, 17 September 2014 at 6:36 am, Marcos Hack wrote:

@ndbroadbent (https://github.com/ndbroadbent), do you know if your fix can help with the following error also?
Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: UPDATE delayed_jobs SET delayed_jobs.locked_at = '2014-09-16 22:01:39', delayed_jobs.locked_by = 'delayed_job.1 host:job02 pid:17742' WHERE ((run_at <= '2014-09-16 22:01:39' AND (locked_at IS NULL OR locked_at < '2014-09-16 18:01:39') OR locked_by = 'delayed_job.1 host:job02 pid:17742') AND failed_at IS NULL) AND delayed_jobs.queue IN ('retrieve_deliveries') ORDER BY priority ASC, run_at ASC LIMIT 1

—
Reply to this email directly or view it on GitHub (#63 (comment)).

from delayed_job_active_record.

t-anjan commented on July 22, 2024

@marcoshack - That's the error we were experiencing too. And it was fixed by @ndbroadbent's change.

from delayed_job_active_record.

marcoshack commented on July 22, 2024

That's awesome @ndbroadbent and @t-anjan, thank you for the feedback.

from delayed_job_active_record.

yigitbacakoglu commented on July 22, 2024

First of all thanks @ndbroadbent !
We were having almost same problem with @t-anjan.
We're running an avg of 25 workers and usually they manage to keep the job queue down to 1000, but when it passes 1000 threshold, it grows exponentially and takes almost 5 hours to process ~500k jobs (We see an avg of 35k jobs in queue during this 5 hours).

Also we've noticed that even we have 50+ workers, only ~10 workers got job. So we made the changes which @ndbroadbent mentioned above and also increased read_ahead to 25. After these changes 100k jobs processed within 30minutes.

How does it work? -->

In such a system, where 20+ workers run, each worker tries to reserve a job, and when almost all workers try at the same time, one of them reserves job and others get deadlock and waits 5 second to reserve new one. This situation raises significant delay on busy systems.

Changing this logic from:
-> "give me 1 job and I will try to reserve it, otherwise I will wait 5 seconds' to
-> "give me 25 jobs and I will try to reserve one of them asap"
fixed our issue.

from delayed_job_active_record.

jdelStrother commented on July 22, 2024

With @ndbroadbent's reverted commit, it does fix deadlocks, but the select times are pretty bad. With a jobs table containing 2500 'live' jobs and 170k failed jobs, the SELECTs would typically take around 2200ms to complete.

Delayed::Backend::ActiveRecord::Job Load (2443.3ms)  SELECT  `delayed_jobs`.* FROM `delayed_jobs`  WHERE ((run_at <= '2014-12-17 10:24:09' AND (locked_at IS NULL OR locked_at < '2014-12-17 06:24:09') OR locked_by = 'delayed_job.3 host:dj2.audioboo.cloud.ec pid:13318') AND failed_at IS NULL)  ORDER BY priority ASC, run_at ASC LIMIT 5

If I wipe out all the old failed jobs, the SELECTs drop to a few milliseconds. Possibly I should just be more regularly clearing out failed jobs. Alternatively, I wonder if it might help the query if when a job is marked as failed, we also bump its priority to an arbitrary high value.

from delayed_job_active_record.

yigitbacakoglu commented on July 22, 2024

@jdelStrother which mysql version do you use? Also can you try to restart mysql and observe response time again

from delayed_job_active_record.

jdelStrother commented on July 22, 2024

We're still on 5.5.39. Restarting mysql on our live site isn't really an option for us, but I'll see what I can do about reproducing it locally

from delayed_job_active_record.

wolfpakz commented on July 22, 2024

I'm experiencing this problem and the deadlocks are crippling. The gem is effectively BROKEN as released. The commit from ndbroadbent/delayed_job_active_record@d311b55 does resolve the problem, but I've had to fork and apply his patch in order to get on with business.

It's a real bummer having to live on a fork of this gem. I'd greatly prefer to be on the mainline release, but it's not possible unless this is resolved.

What do you need to see for this to move forward?

from delayed_job_active_record.

csmuc commented on July 22, 2024

The goal of my commit 0eecb8f was to at least enable a rather "clean" monkey-patch. We use this in production:

module Delayed
  module Backend
    module ActiveRecord
      class Job
        def self.reserve_with_scope(ready_scope, worker, now)
          reserve_with_scope_using_default_sql(ready_scope, worker, now)
        end
      end
    end
  end
end

Of course it would be much better to have an official config option or something.

from delayed_job_active_record.

wolfpakz commented on July 22, 2024

@csmuc Thanks! That's a nice way back to the mainline release.

from delayed_job_active_record.

aaronjensen commented on July 22, 2024

I'll repeat myself here: #63 (comment)

Requiring a monkey patch for a fundamentally flawed method (it is only flawed if you do not have deadlock retry) of reserving jobs makes no sense whatsoever. It should either be reverted to the method that works or we should add deadlock retry to the current method.

I honestly do not understand what is happening here.

from delayed_job_active_record.

Mysql2::Error: Deadlock when attempting to lock a job about delayed_job_active_record HOT 128 OPEN

Comments (128)

@aaronjensen : thanks a lot! your branch works perfectly on my server! good job! the works has been running for 2 days without any problems.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent