Giter Site home page Giter Site logo

Comments (9)

pquentin avatar pquentin commented on June 15, 2024

@kquick Would you accept a PR that tries to disable Nagle's algorithm but suppresses any OSError?

from thespian.

kquick avatar kquick commented on June 15, 2024

Hi, @pquentin. Thanks for the bug report, and sorry for not responding earlier. I'm happy to receive a PR on this, although I have been looking into it (although intermittently due to being busy with a couple of other things).

This is a bit of an unfortunate issue, since it's not clear to me when it would be safe to perform the TCP_NODELAY setting. I've got a branch in progress (https://github.com/thespianpy/Thespian/tree/issue70) that dismisses the OSError, but I'm concerned about the performance effects of not having this set. I have not found an OS trigger that would be reasonable to use, so I'd been looking into whether I could add some retries for this to the internal socket management state machine.

I also don't have access to a Mac for a couple of weeks, so I'm only able to verify that these changes don't disrupt normal behavior under Linux. Since this is a bug that causes a Thespian failure, and you are encountering this in production, I'm willing to do a Thespian release with just the error dismissal, and follow up later with any TCP_NODELAY re-attempts for performance, but I'd like to ask if you would be willing to test both of these for me on a Mac?

from thespian.

pquentin avatar pquentin commented on June 15, 2024

Quick answer from my phone: no worries! We're not using this in production, only on developer machines, and we have an easy workaround: comment out the offending line.

We'd be happy to test any fix you come up within the next few weeks/months. Thanks.

from thespian.

inqueue avatar inqueue commented on June 15, 2024

Hi @kquick, I work with @pquentin and ran into the issue again today after forgetting to use our documented workaround in a new environment. It seems to reproduce reliably with Rally on macOS and would be happy to run any tests you like.

The differences I see between successful and unsuccessful executions (between macOS and Linux) are the protocol number and the missing raddr from lsock at

lsock, rmtTxAddr = self.new_socket(self.socket.accept)
during a failure. The value of lsock at various stages of execution is in elastic/rally#1575 (comment).

Success (Linux)

76.9s _acceptNewIncoming: "lsock: "='lsock: ',
      lsock=<socket.socket fd=31, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.2.15', 40081), raddr=('10.0.2.15', 52944)>
76.9s _acceptNewIncoming: "rmtTxAddr: "='rmtTxAddr: ',
      rmtTxAddr=('10.0.2.15', 52944)

Failure (macOS)

21.2s _acceptNewIncoming: "lsock: "='lsock: ',
      lsock=<socket.socket fd=60, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 59173)>
21.2s _acceptNewIncoming: "rmtTxAddr: "='rmtTxAddr: ', 
      rmtTxAddr=('192.168.5.116', 59224)

from thespian.

kquick avatar kquick commented on June 15, 2024

Hi @inqueue , thanks for the additional information. You refer to a "documented workaround"... is that the use of Thespian on the https://github.com/thespianpy/Thespian/tree/issue70 branch or something else you are doing?

Your additional information leads me to suspect connection attempts by some other source other than a Thespian client. I've pushed an additional change to the branch above to address this situation. If you could try this and confirm that it (a) doesn't hang as in the failure cases, and (b) still seems to reliably process all connections and data in your configuration, then I would be comfortable merging this and making a new Thespian release.

from thespian.

inqueue avatar inqueue commented on June 15, 2024

Hi Kevin,

You refer to a "documented workaround"... is that the use of Thespian on the https://github.com/thespianpy/Thespian/tree/issue70 branch or something else you are doing?

Commenting the following line has been our workaround:

lsock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

On macOS, the failure case can be reliably switched off/on by adding and removing the changes in https://github.com/thespianpy/Thespian/tree/issue70. I also tested it on Linux for completeness and it looks good there as well. +1 to merge it. Thank you!

from thespian.

kquick avatar kquick commented on June 15, 2024

Thespian 3.10.7 is released now with these changes. Please let me know if you have any issues with that release.

from thespian.

inqueue avatar inqueue commented on June 15, 2024

Hi @kquick, we did encounter one error on macOS when using 3.10.7:

DEBUG:esrally.actor:Checking capabilities [{'Thespian ActorSystem Name': 'multiprocQueueBase', 'Thespian ActorSystem Version': 1, 'Python Version': (3, 8, 13, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1673978163702'}] against requirements [{'coordinator': True}] failed.
2023-01-17 12:56:03.718155 p54867 I    ++++ Admin started @ ActorAddr-Q.ThespianQ / gen (3, 10)
2023-01-17 12:56:03.725607 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.ThespianQ.a, <class 'thespian.system.messages.multiproc.LoggerConnected'> msg: <thespian.system.messages.multiproc.LoggerConnected object at 0x1037035e0>)
2023-01-17 12:56:03.726543 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999773-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected obje
ct at 0x103703730>-quit_0:04:59.999752)
2023-01-17 12:56:03.728853 p54772 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999663-<class 'thespian.system.messages.admin.QueryExists'>-<thespian.system.messages.admin.QueryExists object at 0x1036
f6970>-quit_0:04:59.999650)
2023-01-17 12:56:03.729617 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.QueryExists'> msg: <thespian.system.messages.admin.QueryExists object at 0x103703820>)
2023-01-17 12:56:03.730150 p54867 dbg  Attempting intent TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999803-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999792)
2023-01-17 12:56:03.730473 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999474-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999466)
2023-01-17 12:56:03.733756 p54772 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999929-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#1_of_None-quit_0:04:59.999915)
2023-01-17 12:56:03.734321 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~1, <class 'thespian.system.messages.admin.PendingActor'> msg: PendingActor#1_of_None)
2023-01-17 12:56:03.734686 p54867 I    Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-Q.~1
2023-01-17 12:56:03.735352 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999929-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 10, /Users/jbryan/dev/pr-review/rally/esrally/actor.py,
122, "Checking capabilities [{'Thespian ActorSystem Name': 'multipr...-quit_0:04:59.999918)
- 2023-01-17 12:56:03.735849 p54867 Warn no system has compatible capabilities for Actor esrally.racecontrol.BenchmarkActor
2023-01-17 12:56:03.736326 p54867 dbg  Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999822-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999811)
2023-01-17 12:56:03.736646 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999507-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999498)
2023-01-17 12:56:06.743766 p54772 I    ActorSystem shutdown requested.
2023-01-17 12:56:06.745121 p54772 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999463-<class 'thespian.system.messages.admin.SystemShutdown'>-<thespian.system.messages.admin.SystemShutdown object at 0x1036f67c0>-quit_0:04:59.999383)
2023-01-17 12:56:06.746577 p54867 dbg  Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.SystemShutdown'> msg: <thespian.system.messages.admin.SystemShutdown object at 0x1037152e0>)
2023-01-17 12:56:06.747781 p54867 dbg  ---- shutdown initiated by ActorAddr-Q.~
2023-01-17 12:56:06.749025 p54867 dbg  Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999569-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.999532)
2023-01-17 12:56:06.749700 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.998863-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.998844)
2023-01-17 12:56:06.750640 p54867 dbg  actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999896-<class 'thespian.system.logdirector.LoggerExitRequest'>-<thespian.system.logdirector.LoggerExitRequest object a
t 0x103715340>-quit_0:04:59.999879)

I also hit this error during testing when using a patched 3.10.6 with the changes from https://github.com/thespianpy/Thespian/tree/issue70. Interestingly, this error did not surface with 3.10.1 (the version we are currently on) with the two issue70 commits, which is the same version the changes were validated against. I should have stuck with 3.10.6 and reported the error.

from thespian.

kquick avatar kquick commented on June 15, 2024

In the above example, you are using the multiprocQueueBase, whereas issue70 fixed the multiprocTCPBase, so these are unrelated. The error above indicates that it tried to find an actor system with certain capabilities (probably { 'coordinator': True }) that didn't match the running actor system (whose capabilities are unfortunately truncated in the above logging (see "Checking capabilities ...) so I can't confirm whether this is an error or not, but it's likely the multiprocQueueBase actor system you started doesn't have these capabilities.

Perhaps you intended to start a multprocTCPBase rather than a multiprocQueueBase since the latter does not support multiple host systems or actor system conventions (see https://thespianpy.com/doc/using.html#outline-container-hH-2a5fa63d-e6eb-43b9-bea8-47223b27544e for more details), but if this was intentional, please open a new issue and we can work to resolve your problem.

from thespian.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.