Giter Site home page Giter Site logo

Comments (17)

elBoberido avatar elBoberido commented on June 22, 2024 1

It is a thread of the runtime. I think it should not be too difficult to add a updateHeartbeat method to the runtime. One could then call it manually.

from iceoryx.

elBoberido avatar elBoberido commented on June 22, 2024

@niclar this is related to #325. For the history feature there is a mutex involved. RouDi locks the mutex and can insert the samples into the queue without the interference of the publisher. This mutex is only contested when subscribers are added or removed but nevertheless, the publisher holds the lock for a short amount of time when publishing. If the application is terminated during this time and RouDi tries to clean up the resources of the process, it finds the corrupt mutex and terminates. I once tried to fix the problem but unfortunately, the data guarded by the mutex might also be corrupted and more refactoring would have been necessary to fully fix the problem. Other things came up and the issue never really happened in our setup.

This should actually only happen if either an application died or if RouDi assumed that an application died. But then you should see something like Application foo not responding (last response 1509 milliseconds ago) --> removing it. If the monitoring is turned off, then RouDi does not detect the dead application immediately but only when the application tries to re-register. But then there is also a log message. If you don't see any log messages for removing or re-registering an application, then I'm a bit at a loss and it could be anything, including cosmic rays.

from iceoryx.

niclar avatar niclar commented on June 22, 2024

Looking at the logs again it looks like roudi incorrectly assumed 4 processes died prior. Don't know why it would assume this though. -Is there a starvation issue in the detection ?

from iceoryx.

elBoberido avatar elBoberido commented on June 22, 2024

@niclar this is related to #1361. I already improved the situation on master

from iceoryx.

niclar avatar niclar commented on June 22, 2024

-Looks like that fix is/was (2023-11-01) already in the code running (2023-11-23)

from iceoryx.

elBoberido avatar elBoberido commented on June 22, 2024

Oh, I just saw v2.0.x and thought it came from the release branch.

Is the system time changed abruptly so that jumps can occur?

Other scenarios that lead to this error might be when the threads which update the heartbeat value are not scheduled.

A workaround might be to massively increase the PROCESS_KEEP_ALIVE_TIMEOUT. Unfortunately it can currently not be changed via a cmake paremeter but that would not be difficult to add.

from iceoryx.

niclar avatar niclar commented on June 22, 2024

These thread(s) are part of the publisher(s) & not roudi right ? -I reckon they might have been starved in this instance. Some deadline monotonic scheduling or similar would be nice to be able to jack in here..

from iceoryx.

niclar avatar niclar commented on June 22, 2024

This morning we encountered this fatal error again during (post) the morning restart routine of some iceoryx client processes.

from iceoryx.

elBoberido avatar elBoberido commented on June 22, 2024

Do you have more information, e.g. was the load during that time quite high?

from iceoryx.

niclar avatar niclar commented on June 22, 2024

cpu core utilization is low overall with roudi running on a dedicated core interrupt isolated. we received this anew the other day seemingly "spontanously" a few hours after a mid day restart of ~20 roudi client processes.

from iceoryx.

elBoberido avatar elBoberido commented on June 22, 2024

Do you have some logs of spikes in the CPU utilization when the issue occurred?

from iceoryx.

niclar avatar niclar commented on June 22, 2024

...we've added core utilization logging now for next time it happens...

from iceoryx.

elfenpiff avatar elfenpiff commented on June 22, 2024

@elBoberido I looked at the issue and was wondering if we could remove the mutex completely if we sacrifice the history? When on connecting a subscriber to a publisher no history has to be delivered, there should be also no need for a mutex.

@niclar Do you require the history when a subscriber connects?

Another solution to get rid of the mutex would be to transfer the samples without a lock (if this is possible) but then it is possible that a user receives the same sample twice. But only when connecting to a new publisher (and this can maybe filtered out with sequence numbers?!)

from iceoryx.

niclar avatar niclar commented on June 22, 2024

@elfenpiff, good news, we do not. All pub/sub are instantiated with historyCapacity=0 and historyRequest=0 respectively.

from iceoryx.

elBoberido avatar elBoberido commented on June 22, 2024

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

from iceoryx.

elfenpiff avatar elfenpiff commented on June 22, 2024

@elBoberido

Furthermore, the POPO__CHUNK_LOCKING_ERROR is only a symptom. The issue is that RouDi kills running applications because of the heartbeat thread not being scheduled. At least that's my assumption.

This we can mitigate with turning monitoring off. But a better alternative would be to have a liveliness QoS that enforces some contracts like that a subscriber as to collect a sample latest after X seconds and the publisher must publisher every Y seconds a sample. But this would require some time-consuming refactoring.

It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher.

Yes, you are right. But this could be solved by bringing loffli into the play and a lock-free optional where the atomic signals that the thing is set or not. This is also some major refactoring but I think it is solvable.

from iceoryx.

elBoberido avatar elBoberido commented on June 22, 2024

Well, the monitoring was explicitly turned on as far as I know 😅

from iceoryx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.