Comments (9)
@kquick Would you accept a PR that tries to disable Nagle's algorithm but suppresses any OSError?
from thespian.
Hi, @pquentin. Thanks for the bug report, and sorry for not responding earlier. I'm happy to receive a PR on this, although I have been looking into it (although intermittently due to being busy with a couple of other things).
This is a bit of an unfortunate issue, since it's not clear to me when it would be safe to perform the TCP_NODELAY setting. I've got a branch in progress (https://github.com/thespianpy/Thespian/tree/issue70) that dismisses the OSError, but I'm concerned about the performance effects of not having this set. I have not found an OS trigger that would be reasonable to use, so I'd been looking into whether I could add some retries for this to the internal socket management state machine.
I also don't have access to a Mac for a couple of weeks, so I'm only able to verify that these changes don't disrupt normal behavior under Linux. Since this is a bug that causes a Thespian failure, and you are encountering this in production, I'm willing to do a Thespian release with just the error dismissal, and follow up later with any TCP_NODELAY re-attempts for performance, but I'd like to ask if you would be willing to test both of these for me on a Mac?
from thespian.
Quick answer from my phone: no worries! We're not using this in production, only on developer machines, and we have an easy workaround: comment out the offending line.
We'd be happy to test any fix you come up within the next few weeks/months. Thanks.
from thespian.
Hi @kquick, I work with @pquentin and ran into the issue again today after forgetting to use our documented workaround in a new environment. It seems to reproduce reliably with Rally on macOS and would be happy to run any tests you like.
The differences I see between successful and unsuccessful executions (between macOS and Linux) are the protocol number and the missing raddr
from lsock
at
lsock
at various stages of execution is in elastic/rally#1575 (comment).
Success (Linux)
76.9s _acceptNewIncoming: "lsock: "='lsock: ',
lsock=<socket.socket fd=31, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.0.2.15', 40081), raddr=('10.0.2.15', 52944)>
76.9s _acceptNewIncoming: "rmtTxAddr: "='rmtTxAddr: ',
rmtTxAddr=('10.0.2.15', 52944)
Failure (macOS)
21.2s _acceptNewIncoming: "lsock: "='lsock: ',
lsock=<socket.socket fd=60, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.5.116', 59173)>
21.2s _acceptNewIncoming: "rmtTxAddr: "='rmtTxAddr: ',
rmtTxAddr=('192.168.5.116', 59224)
from thespian.
Hi @inqueue , thanks for the additional information. You refer to a "documented workaround"... is that the use of Thespian on the https://github.com/thespianpy/Thespian/tree/issue70 branch or something else you are doing?
Your additional information leads me to suspect connection attempts by some other source other than a Thespian client. I've pushed an additional change to the branch above to address this situation. If you could try this and confirm that it (a) doesn't hang as in the failure cases, and (b) still seems to reliably process all connections and data in your configuration, then I would be comfortable merging this and making a new Thespian release.
from thespian.
Hi Kevin,
You refer to a "documented workaround"... is that the use of Thespian on the https://github.com/thespianpy/Thespian/tree/issue70 branch or something else you are doing?
Commenting the following line has been our workaround:
On macOS, the failure case can be reliably switched off/on by adding and removing the changes in https://github.com/thespianpy/Thespian/tree/issue70. I also tested it on Linux for completeness and it looks good there as well. +1 to merge it. Thank you!
from thespian.
Thespian 3.10.7 is released now with these changes. Please let me know if you have any issues with that release.
from thespian.
Hi @kquick, we did encounter one error on macOS when using 3.10.7:
DEBUG:esrally.actor:Checking capabilities [{'Thespian ActorSystem Name': 'multiprocQueueBase', 'Thespian ActorSystem Version': 1, 'Python Version': (3, 8, 13, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1673978163702'}] against requirements [{'coordinator': True}] failed.
2023-01-17 12:56:03.718155 p54867 I ++++ Admin started @ ActorAddr-Q.ThespianQ / gen (3, 10)
2023-01-17 12:56:03.725607 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.ThespianQ.a, <class 'thespian.system.messages.multiproc.LoggerConnected'> msg: <thespian.system.messages.multiproc.LoggerConnected object at 0x1037035e0>)
2023-01-17 12:56:03.726543 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999773-<class 'thespian.system.messages.multiproc.EndpointConnected'>-<thespian.system.messages.multiproc.EndpointConnected obje
ct at 0x103703730>-quit_0:04:59.999752)
2023-01-17 12:56:03.728853 p54772 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999663-<class 'thespian.system.messages.admin.QueryExists'>-<thespian.system.messages.admin.QueryExists object at 0x1036
f6970>-quit_0:04:59.999650)
2023-01-17 12:56:03.729617 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.QueryExists'> msg: <thespian.system.messages.admin.QueryExists object at 0x103703820>)
2023-01-17 12:56:03.730150 p54867 dbg Attempting intent TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999803-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999792)
2023-01-17 12:56:03.730473 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~-pending-ExpiresIn_0:04:59.999474-<class 'thespian.system.messages.admin.QueryAck'>-<thespian.system.messages.admin.QueryAck object at 0x103703be0>-quit_0:
04:59.999466)
2023-01-17 12:56:03.733756 p54772 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999929-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#1_of_None-quit_0:04:59.999915)
2023-01-17 12:56:03.734321 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~1, <class 'thespian.system.messages.admin.PendingActor'> msg: PendingActor#1_of_None)
2023-01-17 12:56:03.734686 p54867 I Pending Actor request received for esrally.racecontrol.BenchmarkActor reqs {'coordinator': True} from ActorAddr-Q.~1
2023-01-17 12:56:03.735352 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999929-<class 'logging.LogRecord'>-<LogRecord: esrally.actor, 10, /Users/jbryan/dev/pr-review/rally/esrally/actor.py,
122, "Checking capabilities [{'Thespian ActorSystem Name': 'multipr...-quit_0:04:59.999918)
- 2023-01-17 12:56:03.735849 p54867 Warn no system has compatible capabilities for Actor esrally.racecontrol.BenchmarkActor
2023-01-17 12:56:03.736326 p54867 dbg Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999822-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999811)
2023-01-17 12:56:03.736646 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999507-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3586 actual None-quit_0:04:59.999498)
2023-01-17 12:56:06.743766 p54772 I ActorSystem shutdown requested.
2023-01-17 12:56:06.745121 p54772 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ-pending-ExpiresIn_0:04:59.999463-<class 'thespian.system.messages.admin.SystemShutdown'>-<thespian.system.messages.admin.SystemShutdown object at 0x1036f67c0>-quit_0:04:59.999383)
2023-01-17 12:56:06.746577 p54867 dbg Admin of ReceiveEnvelope(from: ActorAddr-Q.~, <class 'thespian.system.messages.admin.SystemShutdown'> msg: <thespian.system.messages.admin.SystemShutdown object at 0x1037152e0>)
2023-01-17 12:56:06.747781 p54867 dbg ---- shutdown initiated by ActorAddr-Q.~
2023-01-17 12:56:06.749025 p54867 dbg Attempting intent TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.999569-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.999532)
2023-01-17 12:56:06.749700 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.~1-pending-ExpiresIn_0:04:59.998863-<class 'thespian.system.messages.admin.PendingActorResponse'>-PendingActorResponse(for ActorAddr-Q.~1 inst# 1) errCode 3585 actual None-quit_0:04:59.998844)
2023-01-17 12:56:06.750640 p54867 dbg actualTransmit of TransportIntent(ActorAddr-Q.ThespianQ.a-pending-ExpiresIn_0:04:59.999896-<class 'thespian.system.logdirector.LoggerExitRequest'>-<thespian.system.logdirector.LoggerExitRequest object a
t 0x103715340>-quit_0:04:59.999879)
I also hit this error during testing when using a patched 3.10.6 with the changes from https://github.com/thespianpy/Thespian/tree/issue70. Interestingly, this error did not surface with 3.10.1 (the version we are currently on) with the two issue70 commits, which is the same version the changes were validated against. I should have stuck with 3.10.6 and reported the error.
from thespian.
In the above example, you are using the multiprocQueueBase
, whereas issue70 fixed the multiprocTCPBase
, so these are unrelated. The error above indicates that it tried to find an actor system with certain capabilities (probably { 'coordinator': True }
) that didn't match the running actor system (whose capabilities are unfortunately truncated in the above logging (see "Checking capabilities ...
) so I can't confirm whether this is an error or not, but it's likely the multiprocQueueBase
actor system you started doesn't have these capabilities.
Perhaps you intended to start a multprocTCPBase
rather than a multiprocQueueBase
since the latter does not support multiple host systems or actor system conventions (see https://thespianpy.com/doc/using.html#outline-container-hH-2a5fa63d-e6eb-43b9-bea8-47223b27544e for more details), but if this was intentional, please open a new issue and we can work to resolve your problem.
from thespian.
Related Issues (20)
- Low performance sending low latency messages between actors
- Is this project still active? HOT 2
- MultiprocTCPBase cannot determine socket address when computer is offline
- Dynamically creating actors HOT 1
- Getting exceptions While Using "multiprocTCPBase" and logdefs in virtualenvironment HOT 7
- Why do I not see a reply message logging for these "Person" actors? HOT 2
- how to run thespian with an event loop HOT 1
- Error using multiprocQueueBase and multiprocTCPBase HOT 3
- Example http_server2.py doesn't work like described HOT 1
- questions: how can I avoid main thread exits when run actors with multiprocQueueBase HOT 9
- Problem with reaching an actor (from the outside) after a while HOT 5
- question: Is there any way to use/integrate external message broker ? (for example rabbitMQ) HOT 2
- self.createActor() with globalName from inisde an ActorTypeDispatcher increases size of self HOT 1
- Assorted Windows Hiccups HOT 1
- TCPTransport: multiple "dictionary changed size during iteration" issues
- multiprocTCPBase: Example keeps breaking on `InvalidActorSpecification` on my machine HOT 4
- Dead letter handling with multiprocQueueBase implementation HOT 3
- MultiProcessQueue crashes on KeyboardInterrupt HOT 5
- `logdirector`'s "Dirty Trick" just bit me. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thespian.