Giter Site home page Giter Site logo

[BUG] : sudden cessation of message consumption using ServiceBusProcessorClient in multiple Java applications about azure-sdk-for-java HOT 13 CLOSED

Poseithon avatar Poseithon commented on July 24, 2024
[BUG] : sudden cessation of message consumption using ServiceBusProcessorClient in multiple Java applications

from azure-sdk-for-java.

Comments (13)

github-actions avatar github-actions commented on July 24, 2024 1

@anuchandy @conniey @lmolkova

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on July 24, 2024 1

Hello @Poseithon, is it possible to capture the TCP level traffic when the application transitions to zombie mode? A tool I’m aware of is Wireshark where we can filter traffic by IP address (of gateway in this case). What we’re looking for is, if any TCP level disconnect signal is arriving to docker’s network stack from the gateway peer. The result of processor level connection-active check depends on whether the underlying network stack reported any connection drop.

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on July 24, 2024 1

Hi @Poseithon, closing a client instance will not close the connection if there are other client instances actively using that connection.

If we are expecting a fair amount of traffic going forward / after the piloting stage, it would be better to use dedicated builder per client rather than shared builder, this ensures each client instances get dedicated underlying async engine and any sudden peek in operations or time-consuming IO activities in one client will not negatively stall other clients. It also narrows down the investigation scope if any application issues arise. Reference .

In a micro service setup, I’ve seen 1-container:1-processor as a common pattern than running multiple Processor in one container. Also, another learning from customer cases is, low core allocation in micro service env can prevent SDK from performing certain internal time sensitive activities on time, leading to loss or dead lettering of messages. The "Open JDK Team" at Microsoft recommend (based on their user study of thousands of containerized java app in azure) two or more cores and strongly discourage selecting anything less than 1 core.

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on July 24, 2024 1

I see, thanks for the clarification 👍. I’ll go ahead and close the other ticket.

from azure-sdk-for-java.

Poseithon avatar Poseithon commented on July 24, 2024 1

Hello @anuchandy

thanks a lot. I m sure it will help the community. Thank you.

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on July 24, 2024 1

I’m closing this, given - the non-recovery problem resolved with TLSv1.2+, the trouble shooting guideline updated with details on TLSv1.0, and an PR to ProtonJ was opened (for TLSv1.0) which is external to azure-sdk repro.

from azure-sdk-for-java.

Poseithon avatar Poseithon commented on July 24, 2024

Hello,

We have understood where the zombie mode comes from. It occurs when we update properties of the gateway application. The new configuration is first loaded on the passive node and then on the next node where our connections are active. However, it seems that this is not detected by the mechanism in the ServiceBusProcessorClient responsible for checking if the connection is active. It seems that this is done via a scheduler boundedElastic. The connection_ID remains unchanged.

We still need your help to understand why the disconnection is not detected.

from azure-sdk-for-java.

Poseithon avatar Poseithon commented on July 24, 2024

Hello @anuchandy ,

Yes, we'll try to capture TCP packets during a gateway application property update from a non-productive environment. We are still in the process of validating this hypothesis, and reproducing zombie mode in a non-productive env., but we'll be sure to make a capture.

Having said that, I was left with the following question : If we create two ServiceBusProcessorClient instances with the same ServiceBusClientBuilder, according to the documentation, they will share the same connection. Does closing one of them, by calling close(), and then restart it, does it close and create the connection for both ?

Thank you again for taking the time to address our concerns.

from azure-sdk-for-java.

Poseithon avatar Poseithon commented on July 24, 2024

Thank you very much.

The reason I ask is that we all decided to implement the following work around: if no message is received for five minutes, we close the processor and start it again. It worked for everyone, except for one team whose builder was shared by two processors.

So you've just confirmed what we thought and given us a clear explanation. Thank you very much.

We're also going to follow your advice to have one dedicated builder per processor. And for the docker configuration, we'll make sure there are at least 2 CPUs.

The next step is to understand why we have these zombie modes. So I'm going to do what you told me, and try to capture TCP packets when an app gateway is updated. So we're going to try to set up an environment that's just like production, with a dedicated service bus and a dedicated gateway application.

from azure-sdk-for-java.

Poseithon avatar Poseithon commented on July 24, 2024

Hello @anuchandy

We have finally been able to reproduce the zombie mode in a non-production environment and have identified the root cause. It is related to the SSL policy. The version in production was lower than in other environments.

When the App Gateway is set to TLSv1.0 and a property is modified, we do receive a connection termination, but it seems this is not propagated to the application level.

Here are the logs. Please note that the update was made at 11:42;09 AM. and that nothing is received after 11:42:27. We kept the application running until 12:30 PM and nothing happened.

WireShark :
image

SDK :
image

Apologies for not providing raw logs directly; they appear disordered when I copy and paste them here, so I'm attaching an image instead.

We will correct the App Gateway to upgrade the TLS version.

However, we can see that even though the connection termination was acknowledged by the application, nothing is logged. Shouldn't we have an onConnectionShutDown event?

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on July 24, 2024

Hi @Poseithon, great job on the TLSv1.0 root cause analysis❤️.

In my App Gateway setup, I can see that with TLSv1.0 (AppGwSslPolicy20150501), the lower-level ProtonJ library (the open-source Apache AMQP library Azure Service Bus library uses) never signals termination if the FIN + ACK (+ RST) traffic from the peer arrives in the networking layer. Due to this, the Azure Service Bus library's ProtonJ hooks to detect the connection termination never notified. The traffic I captured looks like what you shared. Like you, I modified one of the properties (Backend request timeout for port 443) to trigger this FIN + ACK (+ RST) traffic.

TLSv1_0

In fact, Last month, I noticed this (but never correlated it to TLSv1.0) and debugged. I opened a changeset to Apache ProtonJ AMQP library. Now that we correlated this to TLSv1.0, I’m not sure if the problem is in Java TLS layer dealing with TLSv1.0 and associated cipher suites. Interesting thing is, ProtonJ detects that the outbound is closed but never detects the inbound closure, leaving the transport half-closed. Only after a full-closure, ProtonJ will invoke hooks (that Service Bus SDK registered).

I’ve tried using App Gateway with TLSv1.2 (AppGwSslPolicy20220101) and triggered FIN + ACK (+ RST) traffic, this time the ProtonJ library went through full-closure and Service Bus library recovered successfully. The traffic looks like below. It took around ~40 seconds for networking layer to signal this to ProtonJ and successfully complete the close.

TLSv1_2

I agree with your decision to upgrade App Gateway to TLSv1.2+. The Azure Services are phasing out TLSv1.0 and support will end on October 31st 2024 (announcement). Moving to 1.2 will help to identify any potential issues, not only for the Service Bus SDK but also for all Azure Services the application relies on to prevent any disruption in 6 months.

As far as the TLSv1.0 + Apache ProtonJ is concerned, I’ll update the Apache Jira ticket with these details. Please note that the ProtonJ library is not maintained by Azure SDK team but by the Apache community. I’m unsure when the community experts get to that Jira ticket or if lower the priority considering the general shift towards higher TLS version. But I believe we've now identified the solution (TLS upgrade) to unblock the work.

from azure-sdk-for-java.

Poseithon avatar Poseithon commented on July 24, 2024

Hello @anuchandy ,

Thank you very much for this exchange. I forgot to mention it in my ticket, but this problem was also reported in this ticket: #40020. It was my collegue working to solve the same issue

You had already mentioned a potential bug in the proton-j library. However, we realized that this ticket was too vague. Indeed, the ticket mentioned a timeout issue, but in reality, it was just a non-significant side effect. So we decided to open this one.

Thank you again for your support and help.

from azure-sdk-for-java.

anuchandy avatar anuchandy commented on July 24, 2024

We’ve published an official Trouble-Shooting-Guideline section about this - https://learn.microsoft.com/en-us/azure/developer/java/sdk/troubleshooting-messaging-service-bus-overview#clients-halt-when-using-application-gateway-custom-endpoint so that it's easier to find for anyone who faces a similar problem.

from azure-sdk-for-java.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.