Giter Site home page Giter Site logo

Comments (17)

cressie176 avatar cressie176 commented on June 14, 2024

Hi @jacobcabantomski-ct,
You've raised a good point. I can't think of a good reason not to automatically attempt to resubscribe. Sorry if neglecting to do this caused you any issue. I think the best way to fix is if I update the code to emit and event from Rascal's session object, and to handle this event in the way I do with the amqplib channel and connection error events. i.e. something like...

// consume is called with a null message when the RabbitMQ cancels the subscription
if (!message) return session.emit('resubscribe'); 

and

  function attachErrorHandlers(channel, session, config) {
    var connection = channel.connection;
    var removeErrorHandlers = _.once(function() {
      channel.removeListener('error', errorHandler);
      connection.removeListener('error', errorHandler);
      connection.removeListener('close', errorHandler);
      session.removeListener('resubscribe', errorHandler);
    });
    var errorHandler = _.once(handleChannelError.bind(null, channel, session, config, removeErrorHandlers));
    channel.once('error', errorHandler);
    connection.once('error', errorHandler);
    connection.once('close', errorHandler);
    session.once('resubscribe', errorHandler);
    return removeErrorHandlers;
  }

I'll try to rework things a bit though, as the error handler assumes that a channel or connection error occurred and that there will be an error object.

from rascal.

 avatar commented on June 14, 2024

Thanks @cressie176 . I'm not very familiar with Rascal's internal code but let me know if I can help in any way.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

I took a look at this yesterday evening, but think it's slightly more complicated than I initially appreciated. Rascal's existing error handler assumes the previous channel was closed since the only way to invoke it is after a connection error or channel error. If I use this without modification I'll leak channels.

So my options are to close the channel, then get a new one and resubscribe, or to re-use the existing channel. I'd also like to test what happens with inflight messages when a queue is deleted. I've no idea what happens if you ack or nack under these circumstances.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

Hi @jacobcabantomski-ct, still thinking about this, and curious to how your actually encountering the problem. I understand that the consumer cancellation occurs in two scenarios...

  1. If the queue being consumed is deleted
  2. If the queue is mirrored, and the node hosting the master queue fails, causing a mirrored queue to become the master

In the first scenario, the most obvious solution is to attempt resubscription with an exponential backoff. If the queue was deleted in error, hopefully it will be recreated, if not, the exponential backoff shouldn't tax the system.

In the second scenario, I am surprised that you wouldn't receive a connection error an automatically resubscribe. I suppose it's possible for the node hosting the master queue to be accessible to the consumer, but not to other nodes in the cluster and therefore be deemed unavailable, triggering the failover. In this case I'm not sure whether simply re-consuming or even re-connecting will have the desired effect. Unless you have a clearer understanding of what happens in this scenario, I'm probably going to have to post something to the RabbitMQ mailing list and see what they say.

from rascal.

 avatar commented on June 14, 2024

Hi @cressie176

We encountered the two scenarios you described above. We have a production kubernetes RabbitMQ cluster (https://github.com/helm/charts/tree/master/stable/rabbitmq) in which several pods (services/instances) publish messages and another set of pods consume those messages.

When we upgraded our kubernetes cluster, it shut down and re-created each RabbitMQ node (pod) which caused the queues to be deleted. When our pods came back up, some had been running during that transition. The rascal config on one re-created the deleted queues and our publishers were able to publish messages. However, our consumers had been running when those RabbitMQ nodes were re-created and had their consumers canceled but did not attempt to re-subscribe as the cancel signal was ignored. We had to manually restart those consumers to get them running again.

We were able to resolve the majority of this issue by setting up a queue mirroring policy (hence why I asked in #53). In that case, when switchover to the mirrored queue takes places rascal successfully re-connects and re-subscribes :)

However, there is still a case where if multiple nodes go down and the queue and all its mirrors are lost, but then re-created by something else rascal will do nothing and the consumer will sit idle and need to be manually restarted.

In summary:

  1. Still fails to re-subscribe, and I think your solution makes sense.
  2. rascal notices mirrored queue failover and re-subscribes, which worked perfectly for us.

I will also note that I think scenario 1 is not as likely as 2, and this issue with queue mirroring enabled is more a nice to have as a final redundancy in case of more widespread node failure or an entire RabbitMQ cluster failing and coming back up with consumers still running.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

Thanks for the detailed explanation @jacobcabantomski-ct.

I setup a local clustered environment found that provided Rascal wasn't connected to the failing node, RabbitMQ handled the queue failover seamlessly. If I added x-cancel-on-ha-failover: true to the subscription arguments (which are ultimately passed to channel.consume), then as per spec, amqplib did emit a null message.

I don't think it will hurt to reconsume in this case, but going to do some testing to be sure. Sorry this is all taking so long.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

@jacobcabantomski-ct FYI I haven't forgotten about this, but found it's slightly more complicated than I thought, and haven't had to time to pick it up.

from rascal.

 avatar commented on June 14, 2024

@cressie176 No problem, appreciate what time you have spent on it. Take as long as you need.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

Still thinking this one through. I'm a bit concerned that resubscribing will fail if the queue was actually deleted. It would be possible to re-initialise the vhost to re-create the queue, but would completely disconnect from the broker and potentially be disruptive for operations on other channels. It could also be annoying if someone deleted the queue deliberately.

An alternative approach, would be to emit a "cancel" notification. At least this way you could catch this and log / manually resubscribe / bounce the broker. Thoughts?

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

Adding some more clarity to my proposal...

On receiving a basic cancel from the broker, rascal will

  • Cancel the subscription (entails closing the channel after allowing some time for ack/nack to be sent)
  • Emit a cancel event with an error Received consumer cancel while subscribed to queue: $queue. No more messages will be received.
  • If there is no cancel handler, emit an error event with the same message.

from rascal.

 avatar commented on June 14, 2024

@cressie176 Looks good to me, would be easy to act upon.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

I have a working implementation. I changed the behaviour to...

When the broker sends a consumer cancellation (which amqplib delivers as a null message)

  1. Close the channel, but keep the subscription as is, so existing listeners will continue to work
  2. Emit a cancelled event on the subscription with appropriate error
  3. If no cancelled event listener is registered emit an error event with appropriate error
  4. If retries are configured, attempt to re-consume, but do not automatically re-create the queue

I believe the above will work in the failover scenario you described and indefinitely retry while emitting error events if a regular queue was deleted. If the deleted queue is manually recreated Rascal will automatically resubscribe.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

Published as [email protected]

from rascal.

 avatar commented on June 14, 2024

@cressie176 Thank you, appreciate your hard work on the improvement! I'm about to head out on a sabbatical, but when I'm back ~late January. I will be bumping our Rascal version, resolving breaking changes, and testing.

For what it's worth, we've been running on [email protected] in production for almost half a year now with queue mirroring through multiple k8s upgrades (i.e. RabbitMQ restarts) and have had no issues. Excited to bring this in and close that last known failure scenario for our use case.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

Thanks for your patience @jacobcabantomski-ct and sorry it took so long. Glad Rascal has been working out for you. If you're OK I'll close the issue optimistically, but if you have problems post a comment and I'll re-open.

from rascal.

cressie176 avatar cressie176 commented on June 14, 2024

Assuming yes, but if no just let me know.

from rascal.

 avatar commented on June 14, 2024

@cressie176 Upgraded to v8.0.0 and your changes work perfectly for the node failure case. Thanks again for the updates!

from rascal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.