Giter Site home page Giter Site logo

Comments (10)

judgej avatar judgej commented on July 4, 2024

This is the type of message the queue worker gets (sometimes) when the queue connection drops:

Fail:
Code: 500
Value: Operation could not be completed within the specified time.
details (if any): <?xml version="1.0" encoding="utf-8"?><Error><Code>OperationTimedOut</Code> <Message>Operation could not be completed within the specified time.
RequestId:789e0000-0000-0036-59dd-d70517000000
Time:2018-04-19T12:51:27.6988826Z</Message></Error>.

I think that is handed over to Laravel as a queue message, which Laravel then attempts to run as a job (which it shouldn't, but Laravel's lack of validation of queue messages is another bugbear of mine). This locks up the queue worker, which can only be killed from the Linux command line. Not sure what it's doing - blocking while waiting for its non-existent child worker to finish, I suspect.

Anyway, I need to do some more tests of long-running connections, but if what I think is happening is correct, then inspecting the messages that the Azure queue library gives to this connector, and looking for some text that indicates the connection has dropped, would allow it to reconnect and not pass the message back to Laravel as a job.

"Fail:\nCode: %s\nValue: %s\ndetails (if any): %s." == AZURE_ERROR_MSG == ServiceException

I guess "details" is the raw XML data returned by the API, and "Value" is the human-readable response phrase.

from azure-queue-laravel.

squigg avatar squigg commented on July 4, 2024

from azure-queue-laravel.

judgej avatar judgej commented on July 4, 2024

Let me know what I can do to help. There are a lot of pins set up in the chain here, and some things just have to be left to run for a few hours before the problem is observed. So if there is specific points that it would be helpful for me to capture data, I can do that.

from azure-queue-laravel.

judgej avatar judgej commented on July 4, 2024

I also think this is something that Laravel should take on board to a certain extent, perhaps allowing the connector to return a "connection lost" exception so that Laravel can ask for a reconnection before trying again. Or something like that.

from azure-queue-laravel.

judgej avatar judgej commented on July 4, 2024

Noticed the Azure PHP storage libraries have all been taken up a point release in the last day. Not sure if the new version tackles any of the timeout/auto-reconnect issues.

from azure-queue-laravel.

squigg avatar squigg commented on July 4, 2024

I've had a look into this, and with what you are seeing it appears there may be two different scenarios happening:

  1. cURL is simply failing to connect to the Azure service:
  • cURL raises an exception for error 7 "Couldn't connect"
  • GuzzleHttp raises a ConnectException for this, setting the message to the format cURL error %s: %s (%s)
  • Within the Azure PHP Storage library the promise is rejected, the HTTP response is null as it never connected, so the library re-throws the exception which makes its way up to your top error handler.
  1. The Azure service is returning an HTTP 500 error response for OperationTimedOut
  • cURL completes the request and return the HTTP 500 error
  • GuzzleHttp completes the promise with the 500 error and HTTP response body
  • Azure PHP Storage library is not expecting a 500 response from the server, and raises a ServiceException (see throwIfError in ServiceRestProxy)
  • Azure PHP Storage library formats the Exception message to the format Fail:\nCode: %s\nValue: %s\ndetails (if any): %s.
  • Laravel receives this Exception in the Worker getNextJob method, which is then caught and reported. As the error message does not match any of the strings in causedByLostConnection, the worker does not quit and keeps running.

The Laravel workers operate on a polling frequency rather than maintaining open connections, however cURL could potentially be holding connections open under the hood, and the default polling of every 3 seconds could be sufficient to keep a connection open. Laravel already tries to re-attempt what it understands to be lost connections for queues, but this doesn't cover this specific scenario.

Options I see are:

  1. I can handle the specific error in the pop method in the AzureQueue and re-throw an Exception that Laravel understands, which will cause it to kill the Worker and terminate the connection
  2. I can submit a PR to the Laravel Framework to get the messages added into causedByLostConnection method, which will have the same effect as Option 1

I'm not sure either will fix your first error, as that seems like a standard connection error. This could be Azure refusing the connection due to too many concurrent requests, or just an occasional network issue.

from azure-queue-laravel.

squigg avatar squigg commented on July 4, 2024

Also, I am not 100% sure whether the latter HTTP 500 error is actually a formal timeout / connection lost, so may get some pushback from the Laravel maintainers if it is just a transient error where the Azure Service itself just timed out and failed.

from azure-queue-laravel.

judgej avatar judgej commented on July 4, 2024

Thanks, that's some great analysis, very much appreciated.

I think you may be right about connections being held open. If Laravel is polling the queue every few seconds, and it is opening a new connection each time, then if connections are not closed properly we would certainly have a problem building up. Without restarting the queue worker, it would take anywhere between four and six hours before it effectively freezes. I could kill the workers (I have three of them) with a akill signal, but artisan queue:restart would not work, so I guess they were blocking on something - waiting for a spare connection slot to be freed up perhaps? I'm only guessing, with limited knowledge on the Azure side.

The long delays I was seeing in pushing to the queues could very well be a symptom of the same problem. If the queue workers have taken all available connection slots, then opening a connection to push a message could be blocking (somewhere in the route to the queue). Then we end up with a kind of deadlock.

If this is what is happening, then restarting the queue workers every ten minutes, which I am doing now, is probably the best workaround for now.

So, the options. I personally have an issue with Laravel's single list of "connection lost" messages for the database connections, that are hard-coded and need to cater for a wide range of databases. IMO those lists should be in the individual database connectors, and should be extendable to provide east fixes for specific cases, and the ability to add messages in other languages (I cannot fathom how the non-English locales are coping with lost connections). So the queue workers also support lost connection handling? I wasn't aware of that. If so, then keeping it in this queue connector makes more sense to me (option 1). This connector knows about Azure, and should be telling the Laravel framework what to do :-)

I'm not an Azure export, but I can get someone else to monitor our connection pool. If they are growing every few seconds, then that would give us some clues for what needs to be addressed in the longer term.

Looking at the Azure code last week, it seemed that just about any failure to talk to the remote queue resulted in a 500 exception. I suppose it is reasonable to always rethrow this as "connection lost", as a restart of the queue worker is likely to clear out any crap that has been building up by restarting the process.

from azure-queue-laravel.

judgej avatar judgej commented on July 4, 2024

Just looking at the lost connection detection in illuminate/queue, and realise it just uses the database lost connection detection:

https://github.com/illuminate/queue/blob/9c063c804f6ab0596c68c41830496029fbca31f7/Worker.php#L9

That's a bit WTF, TBH. It may be relevant for the database queue connector, and only then for error messages that it knows about (it lacks many) but is completely inappropriate for non-database connectors. This is why there will be push-back on adding to that list - it's a list of database errors.

from azure-queue-laravel.

michliu avatar michliu commented on July 4, 2024

I had the same problem and wanted to know how to solve it.

samm 500 time out error ~~

Fail: Code: 500 Value: Operation could not be completed within the specified time. details (if any): OperationTimedOutOperation could not be completed within the specified time. RequestId:ffa1d05b-c003-00c1-205f-105061000000 Time:2020-04-12T00:14:23.0928233Z.

錯誤檔案:
/home/site/wwwroot/vendor/microsoft/azure-storage-common/src/Common/Internal/ServiceRestProxy.php

from azure-queue-laravel.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.