Comments (7)
@avikde oh, can you use SIGTERM instead of SIGINT? The latter might cause issues.
from iceoryx.
@avikde if an application is killed there is indeed the chance that it affects RouDi. Do you see this behavior also when the applications are terminated gracefully?
The Falling back to built-in config
log message is weird. This should not happen during runtime. Similar with the Re-registering application
log message. Can you run RouDi with -l verbose
and give us the log messages?
Can you check the content of /dev/shm
once the problem occurs?
Also, we are about to release iceoryx1 v3.0. Can you check if the problem still persists on the current master?
from iceoryx.
@elBoberido thanks for your response. Just FYI I am trying to collect info with -l verbose
now, the issue is still the unpredictability... We have it on a loop constantly sending SIGINT to stop (which is normally how the software is used) and restarting and we may have to keep running it for a few days to see if it shows up again. We will have the log if it does.
from iceoryx.
Hi, @elBoberido thanks for your responses, I really appreciate it. I tried SIGTERM but unfortunately of course it doesn't let our programs clean up resources nicely, and I don't we can use it in our software release.
We set up a "restarting test" that does something like the following:
- start roudi
- start programs
- SIGINT programs
- SIGINT roudi
- start roudi
- start programs
- pkill -9 programs
- SIGINT roudi
- loop
with this setup, we observed no failures after SIGINT program, but 0.3% of the time after pkill program, we got this error:
�[0;90m2024-02-07 08:58:56.646 �[0;1;92m[ Info ]�[m: No config file provided and also not found at /etc/iceoryx/roudi_config.toml. Falling back to built-in config.
Log level set to: �[0;1;36m[Verbose]�[m
�[0;90m2024-02-07 08:58:56.648 �[0;1;36m[Verbose]�[m: Command line parameters are:
Log level: Verbose
Monitoring mode: MonitoringMode::OFF
Compatibility check level: CompatibilityCheckLevel::PATCH
Unique RouDi ID: 123
Process kill delay: 45 s
Config file used is: < none >
Reserving 66798600 bytes in the shared memory [iceoryx_mgmt]
[ Reserving shared memory successful ] �[0;90m2024-02-07 08:58:56.711 �[0;1;96m[ Debug ]�[m: Registered memory segment 0x7fb3b1e000 with size 66798600 to id 1
Reserving 149264720 bytes in the shared memory [ghost]
[ Reserving shared memory successful ] �[0;90m2024-02-07 08:58:56.843 �[0;1;96m[ Debug ]�[m: Roudi registered payload data segment 0x7faacc4000 with size 149264720 to id 2
RouDi is ready for clients
...
�[0;90m2024-02-07 08:59:01.789 �[0;1;96m[ Debug ]�[m: Joining Mon+Discover thread...
�[0;90m2024-02-07 08:59:01.855 �[0;1;96m[ Debug ]�[m: ...Mon+Discover thread joined.
/home/ghost/builds/JMdqGtQ-/0/ghostrobotics/controls_deps/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:143 { bool iox::roudi::ProcessManager::requestShutdownOfProcess(iox::roudi::Process&, iox::roudi::ProcessManager::ShutdownPolicy) -> kill } ::: [ 3 ] No such process
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: Process ID 26792 named grupst_if could not be killed with SIGTERM, because the command failed with the following error: No such process See manpage for kill(2) or type man 2 kill in console for more information
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: ICEORYX error! POSH__ROUDI_PROCESS_SHUTDOWN_FAILED
/home/ghost/builds/JMdqGtQ-/0/ghostrobotics/controls_deps/iceoryx/iceoryx_posh/source/roudi/process_manager.cpp:143 { bool iox::roudi::ProcessManager::requestShutdownOfProcess(iox::roudi::Process&, iox::roudi::ProcessManager::ShutdownPolicy) -> kill } ::: [ 3 ] No such process
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: Process ID 26791 named grcontrols_proc could not be killed with SIGTERM, because the command failed with the following error: No such process See manpage for kill(2) or type man 2 kill in console for more information
�[0;90m2024-02-07 08:59:01.855 �[0;1;93m[Warning]�[m: ICEORYX error! POSH__ROUDI_PROCESS_SHUTDOWN_FAILED
Now this error is different from the original one we were seeing (in the first comment), so I am not sure if they are related. However,
- it appears that roudi is trying to shut down our programs while the various programs are starting up... is that expected? Why does it need to do that? I included the roudi initialization output above, and process monitoring is set to off.
- do you think the test procedure above is somehow artifically creating this error? It was hastily put together...
from iceoryx.
@avikde well, I'm not sure whether RouDi is shutting the programs down while they are starting up. From the log I can only see that RouDi got the signal to shut down and then tried to shut down the registered programs. Since the programs also got the signal to shut down there was a race and before RouDi was able to send the signal, the programs were already gone.
I guess if you add a short sleep after pkill -9
these warnings should not be printed anymore.
Also, please never use pkill -9
or kill -9
and always shut the programs down in a graceful way, e.g. with SIGTERM
. Killing the programs might leave some internal structures in a corrupt state which might influence RouDi when it cleans up the remainders of the programs.
from iceoryx.
Just to summarize what I found (and close this issue):
- using pkill to kill programs using iceoryx can cause this failure. (This is never the intended use case, and hopefully doesn't happen in normal operation, but it can be at the whim of users sometimes, especially during development)
- using pkill to kill iox-roudi too quickly after the programs are pkilled seems to prevent iox-roudi from starting properly the next time
- using pkill to kill iox-roudi at least 2 seconds after the program to alleviate the failure from the previous bullet
Ultimately, I don't understand the root cause of the failure, but it does seem pretty much tied to the pkilling, so I'll close the issue for now.
from iceoryx.
@avikde there is a timeout RouDi waits for the heartbeat. Only after that timeout RouDi assumes an application is dead (if RouDi runs with -m on). In general RouDi should be able to recover from a pkill -9
but maybe there are some edge cases we overlooked. Saying that, please use pkill -15 iox-roudi
to shutdown RouDi.
from iceoryx.
Related Issues (20)
- Add an 'iox1' prefix to all resources created by 'iceoryx_posh' and 'RouDi' HOT 1
- Test Fixtures for RouDi HOT 2
- Gateway: Support Client/Server in GatewayGeneric HOT 4
- Race condition in 'PoshRuntime' during shutdown
- mutex owner died -> POPO__CHUNK_LOCKING_ERROR HOT 17
- RouDi-GTest Multithread Integration Test HOT 1
- Wrong memory order in MpmcLoFFLi fence synchronization
- Iceoryx support fast-dds HOT 1
- 'NamedPipe' should be more robust
- Listener addEvent deadlock HOT 1
- ChunkHeader should expose the size of the entire user payload section, including padding HOT 6
- Explore cmake object libs for modules iceoryx hoofs HOT 1
- Problems with multiple "persistent" publishers on the same topic at subscriber startup HOT 3
- ssize_t: redefinition; different basic types HOT 3
- Generated files cause recompilation even without any changes HOT 2
- IPC channel still there, doing an unlink of instanceName HOT 5
- Declared but undefined copy assignment operator for iox::expected HOT 1
- Add aliases that conform with other STL container types HOT 2
- Linear search when releasing a sample scales very poorly HOT 38
- Can't directly assign `const` underlying value to `iox::optional` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iceoryx.