Comments (17)
Hi @jacobcabantomski-ct,
You've raised a good point. I can't think of a good reason not to automatically attempt to resubscribe. Sorry if neglecting to do this caused you any issue. I think the best way to fix is if I update the code to emit and event from Rascal's session object, and to handle this event in the way I do with the amqplib channel and connection error events. i.e. something like...
// consume is called with a null message when the RabbitMQ cancels the subscription
if (!message) return session.emit('resubscribe');
and
function attachErrorHandlers(channel, session, config) {
var connection = channel.connection;
var removeErrorHandlers = _.once(function() {
channel.removeListener('error', errorHandler);
connection.removeListener('error', errorHandler);
connection.removeListener('close', errorHandler);
session.removeListener('resubscribe', errorHandler);
});
var errorHandler = _.once(handleChannelError.bind(null, channel, session, config, removeErrorHandlers));
channel.once('error', errorHandler);
connection.once('error', errorHandler);
connection.once('close', errorHandler);
session.once('resubscribe', errorHandler);
return removeErrorHandlers;
}
I'll try to rework things a bit though, as the error handler assumes that a channel or connection error occurred and that there will be an error object.
from rascal.
Thanks @cressie176 . I'm not very familiar with Rascal's internal code but let me know if I can help in any way.
from rascal.
I took a look at this yesterday evening, but think it's slightly more complicated than I initially appreciated. Rascal's existing error handler assumes the previous channel was closed since the only way to invoke it is after a connection error or channel error. If I use this without modification I'll leak channels.
So my options are to close the channel, then get a new one and resubscribe, or to re-use the existing channel. I'd also like to test what happens with inflight messages when a queue is deleted. I've no idea what happens if you ack or nack under these circumstances.
from rascal.
Hi @jacobcabantomski-ct, still thinking about this, and curious to how your actually encountering the problem. I understand that the consumer cancellation occurs in two scenarios...
- If the queue being consumed is deleted
- If the queue is mirrored, and the node hosting the master queue fails, causing a mirrored queue to become the master
In the first scenario, the most obvious solution is to attempt resubscription with an exponential backoff. If the queue was deleted in error, hopefully it will be recreated, if not, the exponential backoff shouldn't tax the system.
In the second scenario, I am surprised that you wouldn't receive a connection error an automatically resubscribe. I suppose it's possible for the node hosting the master queue to be accessible to the consumer, but not to other nodes in the cluster and therefore be deemed unavailable, triggering the failover. In this case I'm not sure whether simply re-consuming or even re-connecting will have the desired effect. Unless you have a clearer understanding of what happens in this scenario, I'm probably going to have to post something to the RabbitMQ mailing list and see what they say.
from rascal.
Hi @cressie176
We encountered the two scenarios you described above. We have a production kubernetes RabbitMQ cluster (https://github.com/helm/charts/tree/master/stable/rabbitmq) in which several pods (services/instances) publish messages and another set of pods consume those messages.
When we upgraded our kubernetes cluster, it shut down and re-created each RabbitMQ node (pod) which caused the queues to be deleted. When our pods came back up, some had been running during that transition. The rascal
config on one re-created the deleted queues and our publishers were able to publish messages. However, our consumers had been running when those RabbitMQ nodes were re-created and had their consumers canceled but did not attempt to re-subscribe as the cancel signal was ignored. We had to manually restart those consumers to get them running again.
We were able to resolve the majority of this issue by setting up a queue mirroring policy (hence why I asked in #53). In that case, when switchover to the mirrored queue takes places rascal
successfully re-connects and re-subscribes :)
However, there is still a case where if multiple nodes go down and the queue and all its mirrors are lost, but then re-created by something else rascal
will do nothing and the consumer will sit idle and need to be manually restarted.
In summary:
- Still fails to re-subscribe, and I think your solution makes sense.
rascal
notices mirrored queue failover and re-subscribes, which worked perfectly for us.
I will also note that I think scenario 1 is not as likely as 2, and this issue with queue mirroring enabled is more a nice to have as a final redundancy in case of more widespread node failure or an entire RabbitMQ cluster failing and coming back up with consumers still running.
from rascal.
Thanks for the detailed explanation @jacobcabantomski-ct.
I setup a local clustered environment found that provided Rascal wasn't connected to the failing node, RabbitMQ handled the queue failover seamlessly. If I added x-cancel-on-ha-failover: true
to the subscription arguments (which are ultimately passed to channel.consume), then as per spec, amqplib did emit a null message.
I don't think it will hurt to reconsume in this case, but going to do some testing to be sure. Sorry this is all taking so long.
from rascal.
@jacobcabantomski-ct FYI I haven't forgotten about this, but found it's slightly more complicated than I thought, and haven't had to time to pick it up.
from rascal.
@cressie176 No problem, appreciate what time you have spent on it. Take as long as you need.
from rascal.
Still thinking this one through. I'm a bit concerned that resubscribing will fail if the queue was actually deleted. It would be possible to re-initialise the vhost to re-create the queue, but would completely disconnect from the broker and potentially be disruptive for operations on other channels. It could also be annoying if someone deleted the queue deliberately.
An alternative approach, would be to emit a "cancel" notification. At least this way you could catch this and log / manually resubscribe / bounce the broker. Thoughts?
from rascal.
Adding some more clarity to my proposal...
On receiving a basic cancel from the broker, rascal will
- Cancel the subscription (entails closing the channel after allowing some time for ack/nack to be sent)
- Emit a
cancel
event with an errorReceived consumer cancel while subscribed to queue: $queue. No more messages will be received.
- If there is no
cancel
handler, emit anerror
event with the same message.
from rascal.
@cressie176 Looks good to me, would be easy to act upon.
from rascal.
I have a working implementation. I changed the behaviour to...
When the broker sends a consumer cancellation (which amqplib delivers as a null message)
- Close the channel, but keep the subscription as is, so existing listeners will continue to work
- Emit a
cancelled
event on the subscription with appropriate error - If no
cancelled
event listener is registered emit anerror
event with appropriate error - If retries are configured, attempt to re-consume, but do not automatically re-create the queue
I believe the above will work in the failover scenario you described and indefinitely retry while emitting error events if a regular queue was deleted. If the deleted queue is manually recreated Rascal will automatically resubscribe.
from rascal.
Published as [email protected]
from rascal.
@cressie176 Thank you, appreciate your hard work on the improvement! I'm about to head out on a sabbatical, but when I'm back ~late January. I will be bumping our Rascal version, resolving breaking changes, and testing.
For what it's worth, we've been running on [email protected] in production for almost half a year now with queue mirroring through multiple k8s upgrades (i.e. RabbitMQ restarts) and have had no issues. Excited to bring this in and close that last known failure scenario for our use case.
from rascal.
Thanks for your patience @jacobcabantomski-ct and sorry it took so long. Glad Rascal has been working out for you. If you're OK I'll close the issue optimistically, but if you have problems post a comment and I'll re-open.
from rascal.
Assuming yes, but if no just let me know.
from rascal.
@cressie176 Upgraded to v8.0.0
and your changes work perfectly for the node failure case. Thanks again for the updates!
from rascal.
Related Issues (20)
- Update the queue configuration HOT 2
- How to update subscriptions after use Broker.create(config) HOT 1
- MaxListenersExceededWarning HOT 4
- Config with only subscribers HOT 11
- Customize consumer tag HOT 2
- Failed to assert vhost: Timeout of 1000ms exceeded HOT 3
- How can I do multi-ack? HOT 6
- Potential messages multiply? HOT 21
- Rascal fatal error - Timed out waiting for broker to confirm publication HOT 3
- withDefaultConfig does not work properly with url connection strings HOT 3
- FEATURE: Consumer prefetch update? HOT 7
- Messages multiply in queues (part II) HOT 13
- No channels left to allocate HOT 3
- No channels left to allocate HOT 5
- Rascal connects to RabbitMQ stop receiving messages under high load HOT 13
- BUG: no way to use passwords in connection url that would make the url invalid HOT 3
- Rascal doesn't reconnect when connection with the broker is dropped HOT 6
- FEATURE: Add support for updatable authentication secrets
- FEATURE: Upgrade dependency superagent to v9.0.0+ to include vulnerability fix HOT 2
- FEATURE: Add NodeJS Streams support to subscriptions HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rascal.