Comments (17)
It is a thread of the runtime. I think it should not be too difficult to add a updateHeartbeat
method to the runtime. One could then call it manually.
from iceoryx.
@niclar this is related to #325. For the history feature there is a mutex involved. RouDi locks the mutex and can insert the samples into the queue without the interference of the publisher. This mutex is only contested when subscribers are added or removed but nevertheless, the publisher holds the lock for a short amount of time when publishing. If the application is terminated during this time and RouDi tries to clean up the resources of the process, it finds the corrupt mutex and terminates. I once tried to fix the problem but unfortunately, the data guarded by the mutex might also be corrupted and more refactoring would have been necessary to fully fix the problem. Other things came up and the issue never really happened in our setup.
This should actually only happen if either an application died or if RouDi assumed that an application died. But then you should see something like Application foo not responding (last response 1509 milliseconds ago) --> removing it
. If the monitoring is turned off, then RouDi does not detect the dead application immediately but only when the application tries to re-register. But then there is also a log message. If you don't see any log messages for removing or re-registering an application, then I'm a bit at a loss and it could be anything, including cosmic rays.
from iceoryx.
Looking at the logs again it looks like roudi incorrectly assumed 4 processes died prior. Don't know why it would assume this though. -Is there a starvation issue in the detection ?
from iceoryx.
@niclar this is related to #1361. I already improved the situation on master
from iceoryx.
-Looks like that fix is/was (2023-11-01) already in the code running (2023-11-23)
from iceoryx.
Oh, I just saw v2.0.x and thought it came from the release branch.
Is the system time changed abruptly so that jumps can occur?
Other scenarios that lead to this error might be when the threads which update the heartbeat value are not scheduled.
A workaround might be to massively increase the PROCESS_KEEP_ALIVE_TIMEOUT
. Unfortunately it can currently not be changed via a cmake paremeter but that would not be difficult to add.
from iceoryx.
These thread(s) are part of the publisher(s) & not roudi right ? -I reckon they might have been starved in this instance. Some deadline monotonic scheduling or similar would be nice to be able to jack in here..
from iceoryx.
This morning we encountered this fatal error again during (post) the morning restart routine of some iceoryx client processes.
from iceoryx.
Do you have more information, e.g. was the load during that time quite high?
from iceoryx.
cpu core utilization is low overall with roudi running on a dedicated core interrupt isolated. we received this anew the other day seemingly "spontanously" a few hours after a mid day restart of ~20 roudi client processes.
from iceoryx.
Do you have some logs of spikes in the CPU utilization when the issue occurred?
from iceoryx.
...we've added core utilization logging now for next time it happens...
from iceoryx.
@elBoberido I looked at the issue and was wondering if we could remove the mutex completely if we sacrifice the history? When on connecting a subscriber to a publisher no history has to be delivered, there should be also no need for a mutex.
@niclar Do you require the history when a subscriber connects?
Another solution to get rid of the mutex would be to transfer the samples without a lock (if this is possible) but then it is possible that a user receives the same sample twice. But only when connecting to a new publisher (and this can maybe filtered out with sequence numbers?!)
from iceoryx.
@elfenpiff, good news, we do not. All pub/sub are instantiated with historyCapacity=0
and historyRequest=0
respectively.
from iceoryx.
It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.
Furthermore, the POPO__CHUNK_LOCKING_ERROR
is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.
from iceoryx.
Furthermore, the
POPO__CHUNK_LOCKING_ERROR
is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.
This we can mitigate with turning monitoring off. But a better alternative would be to have a liveliness QoS that enforces some contracts like that a subscriber as to collect a sample latest after X seconds and the publisher must publisher every Y seconds a sample. But this would require some time-consuming refactoring.
It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.
Yes, you are right. But this could be solved by bringing loffli into the play and a lock-free optional where the atomic signals that the thing is set or not. This is also some major refactoring but I think it is solvable.
from iceoryx.
Well, the monitoring was explicitly turned on as far as I know 😅
from iceoryx.
Related Issues (20)
- Add an 'iox1' prefix to all resources created by 'iceoryx_posh' and 'RouDi' HOT 1
- Test Fixtures for RouDi HOT 2
- Gateway: Support Client/Server in GatewayGeneric HOT 4
- Race condition in 'PoshRuntime' during shutdown
- RouDi-GTest Multithread Integration Test HOT 1
- Wrong memory order in MpmcLoFFLi fence synchronization
- Iceoryx support fast-dds HOT 1
- 'NamedPipe' should be more robust
- Listener addEvent deadlock HOT 1
- ChunkHeader should expose the size of the entire user payload section, including padding HOT 6
- Explore cmake object libs for modules iceoryx hoofs HOT 1
- Problems with multiple "persistent" publishers on the same topic at subscriber startup HOT 3
- ssize_t: redefinition; different basic types HOT 3
- Generated files cause recompilation even without any changes HOT 2
- IPC channel still there, doing an unlink of instanceName HOT 5
- Declared but undefined copy assignment operator for iox::expected HOT 1
- Add aliases that conform with other STL container types HOT 2
- Linear search when releasing a sample scales very poorly HOT 38
- Can't directly assign `const` underlying value to `iox::optional` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iceoryx.