Comments (128)
edit: see followup at #63 (comment). the new queries may not be a net gain, even without deadlocks.
We were bit by this as well during an upgrade.
I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.
The output of show engine innodb status
says the contention is between the UPDATE
query now used to reserve a job, and the DELETE
query used to clean up a finished job. Surprising! Apparently the DELETE
query first acquires a lock on the primary index (id)
and then on the secondary index (priority, run_at)
. But the UPDATE
query is using the (priority, run_at)
index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE
and DELETE
queries each grab one of two locks for a given row, try to acquire the other, and π₯. MySQL resolves by killing the UPDATE
, which crashes the worker.
The fix I've worked out locally is to replace the index on (priority, run_at)
with an index on (priority, run_at, locked_by)
. This completely stabilizes my concurrency test! My theory is that it allows the UPDATE
query's scan to skip over rows held by workers, which takes it out of contention with the DELETE
query.
Hope this helps.
ππ
from delayed_job_active_record.
I get this also with: 4.0.0.beta2
from delayed_job_active_record.
+1 ..
Seems to happen to us when creating a delayed within another job running, not all the time of course, just very intermittently.
from delayed_job_active_record.
+1
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)
from delayed_job_active_record.
Same here.
delayed_job (3.0.5)
delayed_job_active_record (0.4.4)
from delayed_job_active_record.
+1
delayed_job (3.0.5)
activesupport (> 3.0)> 3.0)
delayed_job_active_record (0.4.4)
activerecord (>= 2.1.0, < 4)
delayed_job (
mysql(5.5.27)
from delayed_job_active_record.
Can we help anything to solve this? Anyone a patch or hints? We are restarting the workers ever 10 minutes because of this.
Thanks and regards, Ph.
from delayed_job_active_record.
@philister We ended up forking and changing the logic to use a row level lock. I suspect the original implementation went to lengths to avoid using a lock, but in the end I'll accept a performance penalty for more reliability. You can find our fork here: https://github.com/doxo/delayed_job_active_record
Note that we do still get occasional deadlocks which result in restarts, but rare in comparison to what we were getting before. It would be even better if the gem could detect the deadlock and retry, but getting the root cause out of the exception raised by the MySQL adapter is not straightforward, and possibly version dependent. So I opted to live with the occasional deadlocks for now.
from delayed_job_active_record.
@gorism Thanks a lot!
from delayed_job_active_record.
@gorism great fix! Thanks! -- The original dj's workers were dying after deadlock. I'm happy to see my workers alive after some time.
from delayed_job_active_record.
+20
Thanks a bunch cainlevy. This did the trick. Great work.
from delayed_job_active_record.
Finally this did it for us, too. Thanks @cainlevy (Other solutions didn' t work in our case)
from delayed_job_active_record.
Upgraded from 0.3.3 to 0.4.4 and we're experiencing deadlock issues as well. We're running a total of 10 workers and usually they manage to keep the job queue down to 1000 or so. On mysql 5.1. @cainlevy's solution didn't work for us :-(.
------------------------
LATEST DETECTED DEADLOCK
------------------------
130811 3:19:18
*** (1) TRANSACTION:
TRANSACTION 2607F0BE7C, ACTIVE 0 sec, process no 7937, OS thread id 1364158784 fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1248, 21 row lock(s)
MySQL thread id 2166505, query id 64381984645 init
UPDATE `delayed_jobs` SET `locked_at` = '2013-08-11 03:19:18', `locked_by` = 'delayed_job host:app3 pid:26732' WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0BE7C lock_mode X locks rec but not gap waiting
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
0: len 4; hex a1dd92ed; asc ;;
1: len 6; hex 002607f0c031; asc & 1;;
2: len 7; hex 000000549014ba; asc T ;;
3: len 4; hex 80000000; asc ;;
4: len 4; hex 80000000; asc ;;
5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
6: SQL NULL;
7: len 8; hex 8000124f11d7316a; asc O 1j;;
8: len 8; hex 8000124f11d7316e; asc O 1n;;
9: SQL NULL;
10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
11: SQL NULL;
12: len 8; hex 8000124f11d7316a; asc O 1j;;
13: len 8; hex 8000124f11d7316a; asc O 1j;;
*** (2) TRANSACTION:
TRANSACTION 2607F0C031, ACTIVE 0 sec, process no 7937, OS thread id 1363892544 updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 2166501, query id 64381985155 updating
DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 568169197
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
0: len 4; hex a1dd92ed; asc ;;
1: len 6; hex 002607f0c031; asc & 1;;
2: len 7; hex 000000549014ba; asc T ;;
3: len 4; hex 80000000; asc ;;
4: len 4; hex 80000000; asc ;;
5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
6: SQL NULL;
7: len 8; hex 8000124f11d7316a; asc O 1j;;
8: len 8; hex 8000124f11d7316e; asc O 1n;;
9: SQL NULL;
10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
11: SQL NULL;
12: len 8; hex 8000124f11d7316a; asc O 1j;;
13: len 8; hex 8000124f11d7316a; asc O 1j;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 33 n bits 648 index `delayed_jobs_priority` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap waiting
Record lock, heap no 406 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
0: len 4; hex 80000000; asc ;;
1: len 8; hex 8000124f11d7316a; asc O 1j;;
2: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
3: len 4; hex a1dd92ed; asc ;;
*** WE ROLL BACK TRANSACTION (2)
mysql> show indexes from delayed_jobs;
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| delayed_jobs | 0 | PRIMARY | 1 | id | A | 2873 | NULL | NULL | | BTREE | |
| delayed_jobs | 1 | delayed_jobs_locking | 1 | locked_at | A | 1 | NULL | NULL | YES | BTREE | |
| delayed_jobs | 1 | delayed_jobs_locking | 2 | locked_by | A | 1 | NULL | NULL | YES | BTREE | |
| delayed_jobs | 1 | delayed_jobs_priority | 1 | priority | A | 1 | NULL | NULL | YES | BTREE | |
| delayed_jobs | 1 | delayed_jobs_priority | 2 | run_at | A | 1436 | NULL | NULL | YES | BTREE | |
| delayed_jobs | 1 | delayed_jobs_priority | 3 | locked_by | A | 1436 | NULL | NULL | YES | BTREE | |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)
from delayed_job_active_record.
What do you get from an explain on a query with the same WHERE
and ORDER
clauses?
EXPLAIN SELECT * WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
from delayed_job_active_record.
mysql> EXPLAIN SELECT * FROM delayed_jobs WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1;
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
| 1 | SIMPLE | delayed_jobs | range | delayed_jobs_locking,delayed_jobs_priority | delayed_jobs_priority | 5 | NULL | 589 | Using where |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
1 row in set (0.00 sec)
from delayed_job_active_record.
I don't think we have the delayed_jobs_locking
index on our table, but I guess it's not being used? Only other thing I can think of is maybe MySQL's 5.1 query planner isn't taking advantage of the updated delayed_jobs_priority
index.
Hopefully you can reproduce the deadlock safely with a handful of workers running a self-replicating job. Let us know if you find anything more!
from delayed_job_active_record.
I added delayed_jobs_locking
index for other metrics pulling. It would be nice if DJ-AR allowed us to specify the strategy for claiming jobs, because I would use the old strategy and still get new upgraded features.
from delayed_job_active_record.
adding locked_at to the index (as described by @cainlevy) has the side-effekt for me that from 10 workers only a few (2-5) are working at the same time. Maybe this affects the performance on the jobs table?
from delayed_job_active_record.
Thanks @cainlevy. Adding locked_at to the index seems to have solved the problem for me as well. I did it yesterday, amidst the chaos of these deadlock errors being thrown all over the place in a queue of 100k jobs. As soon as this change was made, those deadlock issues were gone.
from delayed_job_active_record.
+1 Thanks @cainlevy, works great.
from delayed_job_active_record.
Is everyone that this is working for on mysql 5.5+?
from delayed_job_active_record.
@ngan @cainlevy's fix doesn't help me on 5.5
from delayed_job_active_record.
@ngan - I am on MySQL 5.5.31 and @cainlevy's fix DOES help me.
from delayed_job_active_record.
MySQL 5.5.x
If my fix isn't working for someone, I'd recommend following the steps I described earlier to set up a repro someplace safe to experiment. It's the only way we're going to learn more.
from delayed_job_active_record.
I'll publish my test setup in a moment. In the mean time: could not repro on 5.6.13
from delayed_job_active_record.
my experience with this patch: it does stop the deadlock but performance on my DJ doesn't improve after the deadlock is removed.
However, removing mysql optimizations makes performance jump from about 1job/second/worker to about 10 jobs/second/worker. seeing if upgrading to 5.6 will help one way or the other.
from delayed_job_active_record.
I had originally thought some of the fixes above fixed my deadlock issues but apparently I still got them. I think using multiple processes will cause race conditions with the db so I built a gem that is thread based: a manager thread pulls jobs from the db and distributes them to a threadpool of workers. I would totally welcome contributors to http://github.com/zxiest/delayed_job_active_record_threaded
from delayed_job_active_record.
@zxiest That sounds like a good way to make DB polling more efficient, but if you're still having trouble with deadlocks, it'll probably come back when you run multiple machines.
from delayed_job_active_record.
@michaelglass I presume you're measuring performance against a previous version of the gem? Are you measuring real-world throughput, or using an empty job?
from delayed_job_active_record.
@cainlevy yup, this might be an issue, however, if needed in the future, I can achieve mutual exclusivity by processing different queues on different servers at first and if the queues are too large, I could use job_id % servers_number == server_id . I would still use protective locks for safety but avoid locking as much as possible =)
from delayed_job_active_record.
@cainlevy thanks so much for being such a boss!
Here's my story:
- Saw a lot of deadlocks and slow queries in my log. Pushed your fix and and queries no longer were reported to slow query log / no new deadlocks in
show innodb status;
I thought that everything was better! - Tested DJ throughput against my live web app first by counting rows processed. Ensured that this cap script was reporting equivalent numbers:
- With and without the fix, Performance was about 1 job/s.
- Forked and removed mysql specific logic. Performance jumped to about 80 jobs/s (with 25 workers). Throwing more workers at the queue didn't improve performance.
In isolation, I created this test harness and ran it in a couple of parallel irbs against default homebrew mysql (5.6.13
) and current version in the percona apt repo (5.5.33-31.1
). Couldn't repro slowness there.
Haven't run against 5.5.31-30.3-log
which is what I'm running in production. Testing that soon.
from delayed_job_active_record.
@michaelglass So we eventually ran into performance problems as well. Seems like once something backed up the table enough, performance tanked. Limiting workers to specific priorities helped a little bit, but we couldn't fully catch up without switching back to the select/update query pattern.
from delayed_job_active_record.
@cainlevy yup. Was my experience. I think we should revert those "performance" fixes or revert them for 5.5+?
from delayed_job_active_record.
@michaelglass I think so, yeah. Did you ever find out what happens under 5.6?
I forgot until afterwards, but we had some performance data that nicely showed the effects of the new reserve
. We deployed it to this app on Sept 23, then ran into problems and reverted on Oct 1.
Green is mean of the 90th percentile, red is the upper bound of the 90th percentile, and purple is the total upper bound.
from delayed_job_active_record.
looks like I don't see the same performance issues with Mysql 5.6
from delayed_job_active_record.
lies. same problem with percona mysql 5.6. awful.
from delayed_job_active_record.
+1 Thanks @cainlevy, works great.
from delayed_job_active_record.
@cainlevy You're the man. Thanks for saving me a bunch of time. Index fixed it all up.
from delayed_job_active_record.
keep in mind that even if the deadlocks stop, overall table performance might be down!
from delayed_job_active_record.
@cainlevy might be worth editing the comment that everybody is reading and adding that caveat.
from delayed_job_active_record.
@michaelglass good idea
from delayed_job_active_record.
we're being bit by this too. What's the latest idea on this? is #75 the best fix?
from delayed_job_active_record.
@aaronjensen basically. we ended up monkey-patching the Delayed::Backend::ActiveRecord::Job.reserve
method so we could still get other library updates.
from delayed_job_active_record.
@cainlevy the reason that there are performance issues with your suggestion is that they cause the update to not use an index at all. MySQL will apparently not use an update for an index unless it matches the sort criteria exactly. I don't know why, but it seems to be the case. So the index replacement more or less is the same as deleting the original index outright.
from delayed_job_active_record.
@aaronjensen my experience is that performance is actually slower than with no index. Is my memory not serving me?
from delayed_job_active_record.
@michaelglass well, on 5.6.15 (which is capable of explain update
, unlike 5.5.X it seems) it showed that no index was used for that update statement w/ the new index, but the old index was used if it was there. Using indexes on updates for order by
is something fairly new to mysql believe it or not, it was added in 5.5.6 i think, so it doesn't surprise me that there are bugs/inconsistencies, and it wouldn't surprise me if you happened to have a version that it did work on :)
from delayed_job_active_record.
@aaronjensen to respond to your comment from way back: We're using #75 in prod with great success.
from delayed_job_active_record.
π
@cainlevy's Index Fix works on MySQL 5.5.31.
from delayed_job_active_record.
@cainlevy's Index Fix works on MySQL 5.5.31.
I'd be careful w/ that fix, it's not really a fix, it seems to just cause mysql to stop using an index at all (the new index doesn't seem to be used) which means your performance will get worse over time.
from delayed_job_active_record.
@aaronjensen You're not kidding, I woke up this morning to find only one worker running a task at any given time... So, I'm a little confused, have we found a fix for this, yet? What's the best advice?
from delayed_job_active_record.
I ended up monkey-patching my Job.reserve
method back to the original select+update strategy.
from delayed_job_active_record.
@cainlevy Any chance you have a fork I can pull from?
from delayed_job_active_record.
I opted to patch the method so I wouldn't need to maintain a fork. If you're concerned about the API changing in a version update, I'd recommend a version check + exception.
https://gist.github.com/cainlevy/c6cfa67d44fe7427dea6
Really, this whole topic probably warrants a pull request with some kind of swappable strategy class.
from delayed_job_active_record.
That's awesome, @cainlevy. Thanks for sharing.
I agree with you. For now, I'll put this patch into production and see how she performs.
from delayed_job_active_record.
@joshuapinter I just opened a pull request that fixes it in a different way. It's untested, but it should do the trick. Maybe you could give it a shot? #90
from delayed_job_active_record.
@aaronjensen Looks straightforward enough. Seemed to work find in development. Giving it a shot right now in staging. I'll let you know how it goes.
from delayed_job_active_record.
@joshuapinter it looks like #90 did some unnecessary stuff. #91 just retries deletes since delayed_job handles reservation retries.
from delayed_job_active_record.
@aaronjensen Have been using #91 in production for the last few hours. Seems to be working great. Thanks!
from delayed_job_active_record.
@aaronjensen Spoke too soon, getting the following error in delayed/backend/active_record.rb:113
:
uninitialized constant Delayed::Backend::ActiveRecord::StatementInvalid
from delayed_job_active_record.
ah, thanks @joshuapinter we hadn't had a deadlock yet. I just updated, that should fix it.
from delayed_job_active_record.
hmm, rescuing and retrying deadlocks seems like a bandaid. also, the delete query is only one side of the deadlock, and i'd expect both sides to fail sooner or later.
i'm not sure that the new reserve strategy is worth shoring up. maybe it depends on server version, but i posted a couple graphs above that show it actually performed worse in our system.
from delayed_job_active_record.
hmm, rescuing and retrying deadlocks seems like a bandaid.
I disagree, my understanding is that this is a common way of handling deadlocks, especially when you cannot control lock ordering easily as is the case in this.
also, the delete query is only one side of the deadlock, and i'd expect both sides to fail sooner or later.
The other side's failure is already handled here: https://github.com/collectiveidea/delayed_job/blob/master/lib/delayed/worker.rb#L281
i'm not sure that the new reserve strategy is worth shoring up. maybe it depends on server version, but i posted a couple graphs above that show it actually performed worse in our system.
Weren't those graphs when you were running with modified indexes? I could not get the modified indexes to be used at all, mysql refused to use them regardless of version, so it was the same as running without any indexes. I don't have data that says that this method is better (yet... our next big run will be on monday night so I'll have more data then with #91).
from delayed_job_active_record.
just to reiterate: swapping to resque, if that's an option works pretty well
from delayed_job_active_record.
I see a lot of people are still having issues. I have been using delayed_job_active_record_threaded (which I developed) on production, handling queues of 1M+ size, for the past year without any issues. It doesn't support multiple servers yet but otherwise, if you need to stick to a db solution, give it a shot =)
from delayed_job_active_record.
@aaronjensen True, it's possible that our version of MySQL wasn't using the new index on the UPDATE...LIMIT
query. Unfortunately it's only 5.5.x so I had to use EXPLAIN SELECT
to try and get anything back, which would've missed any differences from the actual UPDATE
version. And as I understand, your evidence was from 5.6.x, which would've missed any version differences.
I'd be reluctant to depend on versions though, and if the new strategy combined with deadlock retries is more performant than the old select and update, then cool!
from delayed_job_active_record.
This seems to be a pretty serious issue. I too am seeing the problem of not all workers running together. Currently, I have added
gem 'delayed_job_active_record', :github => 'aaronjensen/delayed_job_active_record', :branch => 'retry-deadlock-on-delete'
to my Gemfile, and I reset the index changes suggested earlier in this issue. But it seems like the "all workers not running" problem still exists.
What are people doing at the moment to overcome this issue?
from delayed_job_active_record.
@t-anjan i'm not sure you're looking at the right issue, this issue is about a deadlock, I'm not sure what "all workers not running" is, so I doubt that the deadlock retry branch would help you.
from delayed_job_active_record.
I am getting this problem all the time. all my workers will die in an hour when processing 1k jobs in queue.
gem 'delayed_job_active_record', '4.0.0'
gem 'mysql2', '0.3.14'
any ideas?
p.s. currently my solution is using God to control the delayed_job workers to keep them alive. it's definitely ugly. so I am waiting for someone who can help. thanks!
from delayed_job_active_record.
@sg552 we have been using #91 and it works well.
from delayed_job_active_record.
@aaronjensen the patch (#91) doesn't work for me. Is there something I did incorrect?
BTW, this is what I did :
create a file in lib/delayed/backend/active_record.rb, and it looks like:
require 'active_record/version'
module Delayed
module Backend
module ActiveRecord
# A job object that is persisted to the database.
# Contains the work object as a YAML field.
class Job < ::ActiveRecord::Base
include Delayed::Backend::Base
# ......
def destroy
retries = 0
begin
super
rescue ::ActiveRecord::StatementInvalid => e
if e.message =~ /Deadlock found when trying to get lock/ && retries < 100
retries +=1
retry
else
raise
end
end
end
end
end
end
end
In my config/application.rb:
module MCacheCleaner
class Application < Rails::Application
config.autoload_paths += %W(#{config.root}/lib)
config.time_zone = 'Asia/Harbin'
end
end
Rails version: 4.0.0
from delayed_job_active_record.
@sg552 I'm not monkey patching, I'm actually pointing my Gemfile to my fork. If you're going to monkey patch, you should name the file differently and explicitly require it in your application.rb. Replacing the original file name is a bad idea because it probably won't load the original.
from delayed_job_active_record.
hi @aaronjensen , thanks for reply, could you please tell me which url I should point to in my Gemfile?
could you give me an example?
thanks!
from delayed_job_active_record.
see: http://bundler.io/v1.6/git.html
and #90
gem 'delayed_job_active_record', git: 'https://github.com/aaronjensen/delayed_job_active_record.git', branch: 'retry-deadlock-on-delete'
from delayed_job_active_record.
@michaelglass thanks! can I rename the name name to be :
gem 'delayed_job_activerecord_retry-deadlock-on-delete', git: 'https://github.com/aaronjensen/delayed_job_active_record.git', branch: 'retry-deadlock-on-delete'
?
from delayed_job_active_record.
No, though I'm not sure why you would want to? You should remove the original line from your Gemfile, this takes its place.
from delayed_job_active_record.
@michaelglass @aaronjensen got it. thanks! I will let you know if it works on my server.
from delayed_job_active_record.
whoops. also I had a typol
from delayed_job_active_record.
@aaronjensen - When I said "all workers are not running together", I was referring to the same issue which others faced. Specifically, @joshuapinter commented in this thread, saying the following:
@aaronjensen You're not kidding, I woke up this morning to find only one worker running a task at any given time... So, I'm a little confused, have we found a fix for this, yet? What's the best advice?
Even after using the retry-deadlock-on-delete
branch, I still have the same issue.
from delayed_job_active_record.
@t-anjan you're probably facing a different issue than what my branch fixes. My branch fixes an issue, a very specific one, which is that deadlocks can happen between the update and the delete of a job on mysql. Deadlocks are detected by the db and one of transactions is killed so it would not cause only one worker to work at once. That could be some sort of locking issue or something else entirely, I have no idea, but it's most likely not what my branch (and this issue) are about.
from delayed_job_active_record.
@aaronjensen : thanks a lot! your branch works perfectly on my server! good job! the works has been running for 2 days without any problems.
from delayed_job_active_record.
That's great, glad to hear it.Β Aaron
On Sat, Aug 2, 2014 at 7:07 PM, SiweiShen [email protected]
wrote:
@aaronjensen : thanks a lot! your branch works perfectly on my server! good job! the works has been running for 2 days without any problems.
Reply to this email directly or view it on GitHub:
#63 (comment)
from delayed_job_active_record.
I created a pull request months ago, which would enable a little cleaner monkey-patch of the reserve
method:
#89
I think this is useful to revert to conservative SQL statement (even better: a configuration option). I haven't heard anything yet.
from delayed_job_active_record.
the non-conservative sql statement is fundamentally flawed w/o deadlock retry, so either #91 should be merged or the non-conservative method should just be removed. I don't think it should be encouraged to monkey patch it out ultimately.
from delayed_job_active_record.
I have even more issues with the non-conservative SQL statement. MySQL (master-master replication) writes this warning to the log file on every execution of the SQL:
[Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. The statement is unsafe because it uses a LIMIT clause. This is unsafe because the set of rows included cannot be predicted.
The mysqld.log
is flooded with those messages (and probably MySQL doesn't spit out the warnings without a reason, too). I'd really get rid of the optimized version of the MySQL (preferably with a config option)
from delayed_job_active_record.
@csmuc yeah we get that too :/
from delayed_job_active_record.
We fixed this issue by reverting to the less efficient job fetching query, where it fetches a batch of 5 jobs, and then tries to lock each one successively until it obtains a lock.
I ran some tests on my local machine with 4 delayed job workers, and 10,000 no-op jobs:
Before: 41 Deadlocks in log/delayed_job.log
After: 0 Deadlocks
Here's the commit that reverts the change: ndbroadbent@d311b55
You can test it out by adding this line to your Gemfile
:
gem 'delayed_job_active_record', github: 'ndbroadbent/delayed_job_active_record', branch: :fix_locking_issues
from delayed_job_active_record.
I don't know if mine is a special case.
Using
gem 'delayed_job_active_record', :github => 'aaronjensen/delayed_job_active_record', :branch => 'retry-deadlock-on-delete'
was still giving me a bucket-load of deadlocks and very slow queue processing. As I had mentioned before, in addition to deadlocks, it appeared that not all my workers were processing the queue simultaneously.
Yesterday, I tried
gem 'delayed_job_active_record', github: 'ndbroadbent/delayed_job_active_record', branch: :fix_locking_issues
and I immediately noticed an improvement. On a queue of about 30,000 real jobs, when I ran
select * from delayed_jobs where locked_at is not null;
I always got back a result pertaining to the exact number of workers I was running. And the whole queue was over in under an hour with zero deadlocks.
Earlier it was taking more than 4 to 5 hours with tens of deadlocks. And the above query was always returning just one (or at most two) records.
@aaronjensen - It looks like the issue I described earlier in the thread is indeed related to this fix.
@ndbroadbent - Thanks! I was so close to switching from DJ completely.
from delayed_job_active_record.
@t-anjan that's great, I'm glad you found something that works for you.
from delayed_job_active_record.
@ndbroadbent, do you know if your fix can help with the following error also?
Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction:
UPDATE `delayed_jobs` SET `delayed_jobs`.`locked_at` = '2014-09-16 22:01:39',
`delayed_jobs`.`locked_by` = 'delayed_job.1 host:job02 pid:17742'
WHERE ((run_at <= '2014-09-16 22:01:39' AND (locked_at IS NULL OR
locked_at < '2014-09-16 18:01:39') OR locked_by = 'delayed_job.1 host:job02 pid:17742')
AND failed_at IS NULL) AND `delayed_jobs`.`queue` IN ('retrieve_deliveries')
ORDER BY priority ASC, run_at ASC LIMIT 1
from delayed_job_active_record.
Hi Marcos, yes that's the error that we were experiencing, which is now fixed by this change
On Wednesday, 17 September 2014 at 6:36 am, Marcos Hack wrote:
@ndbroadbent (https://github.com/ndbroadbent), do you know if your fix can help with the following error also?
Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: UPDATEdelayed_jobs
SETdelayed_jobs
.locked_at
= '2014-09-16 22:01:39',delayed_jobs
.locked_by
= 'delayed_job.1 host:job02 pid:17742' WHERE ((run_at <= '2014-09-16 22:01:39' AND (locked_at IS NULL OR locked_at < '2014-09-16 18:01:39') OR locked_by = 'delayed_job.1 host:job02 pid:17742') AND failed_at IS NULL) ANDdelayed_jobs
.queue
IN ('retrieve_deliveries') ORDER BY priority ASC, run_at ASC LIMIT 1β
Reply to this email directly or view it on GitHub (#63 (comment)).
from delayed_job_active_record.
@marcoshack - That's the error we were experiencing too. And it was fixed by @ndbroadbent's change.
from delayed_job_active_record.
That's awesome @ndbroadbent and @t-anjan, thank you for the feedback.
from delayed_job_active_record.
First of all thanks @ndbroadbent !
We were having almost same problem with @t-anjan.
We're running an avg of 25 workers and usually they manage to keep the job queue down to 1000, but when it passes 1000 threshold, it grows exponentially and takes almost 5 hours to process ~500k jobs (We see an avg of 35k jobs in queue during this 5 hours).
Also we've noticed that even we have 50+ workers, only ~10 workers got job. So we made the changes which @ndbroadbent mentioned above and also increased read_ahead to 25. After these changes 100k jobs processed within 30minutes.
How does it work? -->
In such a system, where 20+ workers run, each worker tries to reserve a job, and when almost all workers try at the same time, one of them reserves job and others get deadlock and waits 5 second to reserve new one. This situation raises significant delay on busy systems.
Changing this logic from:
-> "give me 1 job and I will try to reserve it, otherwise I will wait 5 seconds' to
-> "give me 25 jobs and I will try to reserve one of them asap"
fixed our issue.
from delayed_job_active_record.
With @ndbroadbent's reverted commit, it does fix deadlocks, but the select times are pretty bad. With a jobs table containing 2500 'live' jobs and 170k failed jobs, the SELECT
s would typically take around 2200ms to complete.
Delayed::Backend::ActiveRecord::Job Load (2443.3ms) SELECT `delayed_jobs`.* FROM `delayed_jobs` WHERE ((run_at <= '2014-12-17 10:24:09' AND (locked_at IS NULL OR locked_at < '2014-12-17 06:24:09') OR locked_by = 'delayed_job.3 host:dj2.audioboo.cloud.ec pid:13318') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 5
If I wipe out all the old failed jobs, the SELECT
s drop to a few milliseconds. Possibly I should just be more regularly clearing out failed jobs. Alternatively, I wonder if it might help the query if when a job is marked as failed, we also bump its priority to an arbitrary high value.
from delayed_job_active_record.
@jdelStrother which mysql version do you use? Also can you try to restart mysql and observe response time again
from delayed_job_active_record.
We're still on 5.5.39. Restarting mysql on our live site isn't really an option for us, but I'll see what I can do about reproducing it locally
from delayed_job_active_record.
I'm experiencing this problem and the deadlocks are crippling. The gem is effectively BROKEN as released. The commit from ndbroadbent/delayed_job_active_record@d311b55 does resolve the problem, but I've had to fork and apply his patch in order to get on with business.
It's a real bummer having to live on a fork of this gem. I'd greatly prefer to be on the mainline release, but it's not possible unless this is resolved.
What do you need to see for this to move forward?
from delayed_job_active_record.
The goal of my commit 0eecb8f was to at least enable a rather "clean" monkey-patch. We use this in production:
module Delayed
module Backend
module ActiveRecord
class Job
def self.reserve_with_scope(ready_scope, worker, now)
reserve_with_scope_using_default_sql(ready_scope, worker, now)
end
end
end
end
end
Of course it would be much better to have an official config option or something.
from delayed_job_active_record.
@csmuc Thanks! That's a nice way back to the mainline release.
from delayed_job_active_record.
I'll repeat myself here: #63 (comment)
Requiring a monkey patch for a fundamentally flawed method (it is only flawed if you do not have deadlock retry) of reserving jobs makes no sense whatsoever. It should either be reverted to the method that works or we should add deadlock retry to the current method.
I honestly do not understand what is happening here.
from delayed_job_active_record.
Related Issues (20)
- Rails 6.1.0.alpha support HOT 3
- `Delayed::Job` no longer defined in initializers HOT 16
- Delayed::DeserializationError: Job failed to load: undefined class/module Spree::AccountSalesRecordMailer.
- undefined method `table_name=` for Delayed::Backend::ActiveRecord::Job:Class HOT 1
- 4.1.5 breaking changes HOT 5
- A way to get the job ID inside a worker? HOT 2
- Error in 4.1.5 when upgrading Rails from 6.0 to 6.1 HOT 2
- Job running on 2 workers at once after timeout HOT 1
- Delayed job stops, the jobs are stuck in postgres HOT 6
- Rails 7 Support HOT 9
- Use case for locked_by in the ready_to_run query HOT 1
- Use `update` instead of `update_all` in the reserve_with_scope_using_optimized_mysql method.
- Do that destroy_failed_jobs get honored by delayed_job?
- Problem with DelayedJob Psych gem when upgrading to Ruby 2.7.5 HOT 2
- Rails 7 weird error: can't dump anonymous module HOT 3
- Can't create Delayed::Job record after `errors.add` in model HOT 1
- How does max_attempts interface with ActiveJob retry_on method?
- Deprecation warning in rails 7.1 HOT 1
- Need a new release to support rails 7.1
- Don't optimize mysql query with trilogy gem
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from delayed_job_active_record.