Comments (25)
Agh, @andbar, I was completely stumped because I thought the code you were running included this change a30c2cb that I got about 3 weeks ago (GitHub shows last week because I had rebased) and that we're running in production because we're running the branch with support for recurring tasks. I noticed that one while I was working on recurring jobs because I hit a super similar deadlock, and fixed it there. Then, when you reported this one and I looked at your SHOW ENGINE INNODB STATUS
I was like "how can this be possible?!" 😆 🤦♀️ In my head, I had already shipped that fix ages ago. I think the deadlock is due to that missing job_id
in the ordered
scope.
I'm going to try to ship a new version with support for recurring jobs and that fix so you can test it. I'm on call this week and had a busy Monday, but hopefully will get to it tomorrow!
from solid_queue.
Thanks, @rosa. I haven't had a chance to look at it much myself before this, due to some other needed work in our project that's using this gem, but I'm hoping to be able to dig into it further. I don't have much experience with deadlocks, though, so I'm trying to brush up on that first 😬.
From a first glance, it appears that maybe the issue is locking due to both transactions (the insert and the delete) needing to lock this index: index_solid_queue_dispatch_all? Or more accurately, the insert appears to lock the PRIMARY index first and then tries to acquire a lock on the index_solid_queue_dispatch_all index, while the delete is going in the opposite direction and locks the index_solid_queue_dispatch_all index first and then tries to acquire a lock on the PRIMARY index. Does that sound right?
Maybe that's why the delete transaction (transaction 2 in the logs) shows "5336 row lock(s)" even though it's only deleting something like 19 jobs - b/c of the index_solid_queue_dispatch_all index?
from solid_queue.
Ah! Haha, so easy to do. We'll be glad to test that fix with our situation. I'll watch for a new version. Thank you!
from solid_queue.
Thank you so much! I just shipped version 0.2.2 with this fix, as I didn't have time to wrap up the recurring jobs PR, so I decided to just extract that fix. Could you try this one and see if you still encounter the deadlock? 🙏 Thank you!
from solid_queue.
Hey @andbar, sorry for the delay! I haven't forgotten about this one, but I've been working on other stuff in the last few weeks. I'm back looking at this, and I wonder if #199 might help in your case. Would you mind trying that one out if you're still using Solid Queue and experiencing this problem?
from solid_queue.
Hey @paulhtrott, ohhhh, thanks for letting me know! It's a bummer it didn't solve it completely, but I'll take the reduction. I'll merge that and will start using it in HEY, as we're going to increase the number of scheduled jobs per day in ~4M tomorrow, so hopefully we'll hit this ourselves and that will give me more info to think of other solutions.
from solid_queue.
Hi @rosa we have had zero luck being able to diagnose our deadlock issue. This is how our architecture is structured:
- Docker Container deployed to 4 instances
- Single MariaDB instance (trilogy gem)
Our errors do not show much details outside of the following, plus a stack trace (attached):
- ActiveRecord::Deadlocked: Trilogy::ProtocolError: 1213: Deadlock found when trying to get lock; try restarting transaction
- solid_queue:start : ActiveRecord::Deadlocked: Trilogy::ProtocolError: 1213: Deadlock found when trying to get lock; try restarting transaction
We have resorted to setting wait_untils for most of our jobs, it seems the delay helps on most occasions, but it is inconvenient in some cases.
Are there any other details that might be helpful for you?
stack_trace.txt
stack_trace_2.txt
db_logs.txt
from solid_queue.
Ohhh, @paulhtrott, that's super helpful! This looks like a different kind of deadlock than the one I tried to fix! Let me try to put together another fix for this one.
from solid_queue.
That deadlock is the same one as #229, going to tweak that one a bit and ship.
from solid_queue.
@paulhtrott, I just released version 0.3.3 with a possible fix for this one deadlock: #240.
from solid_queue.
@paulhtrott, I just released version 0.3.3 with a possible fix for this one deadlock: #240.
Thank you @rosa , we will give that a try today, I'll report back after a couple of days 🎉
from solid_queue.
Oh, @paulhtrott, I realised something... The query done now to delete records from solid_queue_ready_executions
should be using the primary key instead of job_id
, so I wonder if the locks that both transactions are waiting are still in the primary key index or another index... If you have the output from the LATEST DETECTED DEADLOCK
section when you run SHOW ENGINE INNODB STATUS
, would you mind sharing it?
Thank you so much again 🙏
from solid_queue.
Hey @rosa! Sure, here is the output
from solid_queue.
@andbar, could you share the LATEST DETECTED DEADLOCK
section if you run SHOW ENGINE INNODB STATUS
in the DB server where you have Solid Queue tables?
from solid_queue.
Here you go. Thanks for looking into it. Let me know if you need any more info.
from solid_queue.
@andbar, I've looked into this and I see why the deadlock is happening but it's not clear to me why the transaction (2) is locking 361 records in the index_solid_queue_dispatch_all
index of solid_queue_scheduled_executions
while only trying to delete 18 rows. Could you let me know:
- What version of Solid Queue you're running
- What version of MySQL you're running
- Your dispatcher configuration
Thank you! 🙏
from solid_queue.
Yep, here they are:
solid_queue (0.2.1)
8.0.mysql_aurora.3.03.1
dispatchers:
- polling_interval: 1
batch_size: 500
from solid_queue.
Hi, @rosa, we deployed that new version and thought it might have fixed it but unfortunately we got some more deadlocks today. Here's the latest deadlock log from the db, hopefully it helps pinpoint what might be causing it.
from solid_queue.
Ouch 😞 Sorry about that, and thanks for the new debug info. I'll continue looking into it.
from solid_queue.
Hi @rosa, I just realized I hadn't responded to you yet, I apologize for that. We ended up moving away from Solid Queue to using something else that was just a better fit for our particular needs due to the nature of the jobs we're running, so unfortunately I won't be able to test that fix. I'm sorry!
from solid_queue.
Hi @andbar! Oh, no worries at all! Thanks so much for letting me know, really appreciate you taking the time to test and report this and your patience through the troubleshooting 🤗
from solid_queue.
Hi @rosa 👋🏽
We tried your fix #199 at Pressable, it fixed our deadlock issue
Rails 7.1.2
ruby 3.2.2
MariaDB 10.6.17 (trilogy)
Thank you! 🚀
from solid_queue.
Hi again @rosa. I spoke too soon. #199 did not completely get rid of our deadlock issue, but it did reduce it.
from solid_queue.
Hi @rosa I'm back sooner than I wanted to be 😆 . We are still having the issue after 0.3.3. Same two stack traces basically.
from solid_queue.
Ohh, bummer 😞 I'll continue working on it. Thanks a lot for letting me know!
from solid_queue.
Related Issues (20)
- Question: managing worker memory HOT 7
- Support for dynamic limits on concurrency
- Exception SolidQueue::ReadyExecution.claim HOT 2
- Fugit version is 1.11.0 now HOT 1
- Rails 8 now requires concurrent-ruby 1.3.1 for activesupport HOT 1
- Concurrency Blocked Jobs do not respect queue priority HOT 5
- Default Rails logger does not output to STDOUT in development env HOT 2
- Puma plugin fails with `NameError`
- Health check
- Puma plugin causes warning about unresolved or ambiguous Gem specs
- Store actual exception instead of full `retry_on` condition HOT 2
- Cron tasks can give weird errors if enqueue fails HOT 5
- Solid Queue isn't processing jobs in queues HOT 2
- Deleting scheduled jobs HOT 1
- Database pool size in multi workers setup HOT 1
- Stopping in progress job HOT 2
- Not Finding Recurring Jobs HOT 3
- Best way to delete an old queue HOT 2
- Feature Request: Automatic Worker Process Recycling HOT 6
- urls for corresponding classes? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from solid_queue.