Giter Site home page Giter Site logo

Comments (7)

elBoberido avatar elBoberido commented on May 25, 2024 1

@avikde oh, can you use SIGTERM instead of SIGINT? The latter might cause issues.

from iceoryx.

elBoberido avatar elBoberido commented on May 25, 2024

@avikde if an application is killed there is indeed the chance that it affects RouDi. Do you see this behavior also when the applications are terminated gracefully?

The Falling back to built-in config log message is weird. This should not happen during runtime. Similar with the Re-registering application log message. Can you run RouDi with -l verbose and give us the log messages?

Can you check the content of /dev/shm once the problem occurs?

Also, we are about to release iceoryx1 v3.0. Can you check if the problem still persists on the current master?

from iceoryx.

avikde avatar avikde commented on May 25, 2024

@elBoberido thanks for your response. Just FYI I am trying to collect info with -l verbose now, the issue is still the unpredictability... We have it on a loop constantly sending SIGINT to stop (which is normally how the software is used) and restarting and we may have to keep running it for a few days to see if it shows up again. We will have the log if it does.

from iceoryx.

avikde avatar avikde commented on May 25, 2024

Hi, @elBoberido thanks for your responses, I really appreciate it. I tried SIGTERM but unfortunately of course it doesn't let our programs clean up resources nicely, and I don't we can use it in our software release.

We set up a "restarting test" that does something like the following:

  • start roudi
  • start programs
  • SIGINT programs
  • SIGINT roudi
  • start roudi
  • start programs
  • pkill -9 programs
  • SIGINT roudi
  • loop

with this setup, we observed no failures after SIGINT program, but 0.3% of the time after pkill program, we got this error:

�[0;90m2024-02-07 08:58:56.646 �[0;1;92m[ Info ]�[m: No config file provided and also not found at /etc/iceoryx/roudi_config.toml. Falling back to built-in config.
Log level set to: �[0;1;36m[Verbose]�[m
�[0;90m2024-02-07 08:58:56.648 �[0;1;36m[Verbose]�[m: Command line parameters are:
Log level: Verbose
Monitoring mode: MonitoringMode::OFF
Compatibility check level: CompatibilityCheckLevel::PATCH
Unique RouDi ID: 123
Process kill delay: 45 s
Config file used is: < none >
Reserving 66798600 bytes in the shared memory [iceoryx_mgmt]
[ Reserving shared memory successful ] �[0;90m2024-02-07 08:58:56.711 �[0;1;96m[ Debug ]�[m: Registered memory segment 0x7fb3b1e000 with size 66798600 to id 1
Reserving 149264720 bytes in the shared memory [ghost]
[ Reserving shared memory successful ] �[0;90m2024-02-07 08:58:56.843 �[0;1;96m[ Debug ]�[m: Roudi registered payload data segment 0x7faacc4000 with size 149264720 to id 2
RouDi is ready for clients
...
�[0;90m2024-02-07 08:59:01.789 �[0;1;96m[ Debug ]�[m: Joining Mon+Discover thread...
�[0;90m2024-02-07 08:59:01.855 �[0;1;96m[ Debug ]�[m: ...Mon+Discover thread joined.
/home/ghost/builds/JMdqGtQ-/0/ghostrobotics/controls_deps/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:143 { bool iox::roudi::ProcessManager::requestShutdownOfProcess(iox::roudi::Process&, iox::roudi::ProcessManager::ShutdownPolicy) -> kill } ::: [ 3 ] No such process
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: Process ID 26792 named grupst_if could not be killed with SIGTERM, because the command failed with the following error: No such process See manpage for kill(2) or type man 2 kill in console for more information
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: ICEORYX error! POSH__ROUDI_PROCESS_SHUTDOWN_FAILED
/home/ghost/builds/JMdqGtQ-/0/ghostrobotics/controls_deps/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:143 { bool iox::roudi::ProcessManager::requestShutdownOfProcess(iox::roudi::Process&, iox::roudi::ProcessManager::ShutdownPolicy) -> kill } ::: [ 3 ] No such process
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: Process ID 26791 named grcontrols_proc could not be killed with SIGTERM, because the command failed with the following error: No such process See manpage for kill(2) or type man 2 kill in console for more information
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: ICEORYX error! POSH__ROUDI_PROCESS_SHUTDOWN_FAILED

Now this error is different from the original one we were seeing (in the first comment), so I am not sure if they are related. However,

  • it appears that roudi is trying to shut down our programs while the various programs are starting up... is that expected? Why does it need to do that? I included the roudi initialization output above, and process monitoring is set to off.
  • do you think the test procedure above is somehow artifically creating this error? It was hastily put together...

from iceoryx.

elBoberido avatar elBoberido commented on May 25, 2024

@avikde well, I'm not sure whether RouDi is shutting the programs down while they are starting up. From the log I can only see that RouDi got the signal to shut down and then tried to shut down the registered programs. Since the programs also got the signal to shut down there was a race and before RouDi was able to send the signal, the programs were already gone.

I guess if you add a short sleep after pkill -9 these warnings should not be printed anymore.

Also, please never use pkill -9 or kill -9 and always shut the programs down in a graceful way, e.g. with SIGTERM. Killing the programs might leave some internal structures in a corrupt state which might influence RouDi when it cleans up the remainders of the programs.

from iceoryx.

avikde avatar avikde commented on May 25, 2024

Just to summarize what I found (and close this issue):

  • using pkill to kill programs using iceoryx can cause this failure. (This is never the intended use case, and hopefully doesn't happen in normal operation, but it can be at the whim of users sometimes, especially during development)
  • using pkill to kill iox-roudi too quickly after the programs are pkilled seems to prevent iox-roudi from starting properly the next time
  • using pkill to kill iox-roudi at least 2 seconds after the program to alleviate the failure from the previous bullet

Ultimately, I don't understand the root cause of the failure, but it does seem pretty much tied to the pkilling, so I'll close the issue for now.

from iceoryx.

elBoberido avatar elBoberido commented on May 25, 2024

@avikde there is a timeout RouDi waits for the heartbeat. Only after that timeout RouDi assumes an application is dead (if RouDi runs with -m on). In general RouDi should be able to recover from a pkill -9 but maybe there are some edge cases we overlooked. Saying that, please use pkill -15 iox-roudi to shutdown RouDi.

from iceoryx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.