Comments (9)
I've worked around this problem with the following patch for now, so rasdaemon only listens to online CPU events:
diff --git a/ras-events.c b/ras-events.c
index 39cab20..319f049 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -328,7 +328,7 @@ static void parse_ras_data(struct pthread_data *pdata, struct kbuffer *kbuf,
static int get_num_cpus(struct ras_events *ras)
{
- return sysconf(_SC_NPROCESSORS_CONF);
+ return sysconf(_SC_NPROCESSORS_ONLN);
#if 0
char fname[MAX_PATH + 1];
int num_cpus = 0;
Not sure if it's an acceptable fix to be merged into the repo, but apparently the proper fix is to use libtracefs?
For convenience, the above patch can be applied directly to the binary in a hex editor by:
- Searching for
bf 53 00 00 00
- Replacing it with
bf 54 00 00 00
This is at file offset 0xdb3f for my version of rasdaemon (debian 0.6.7-1+b1)
Or just execute this perl one-liner to do the above:
sudo perl -i -pe 's/\xbf\x53\x00\x00\x00/\xbf\x54\x00\x00\x00/' /sbin/rasdaemon
This effectively makes the following binary patch:
--- rasdaemon.S.before 2022-12-10 19:25:27.114060904 +1100
+++ rasdaemon.S.after 2022-12-10 19:25:33.382269283 +1100
@@ -2670,11 +2670,11 @@
db2f: 41 5a pop %r10
db31: 41 5b pop %r11
db33: 85 c0 test %eax,%eax
db35: 0f 85 e5 05 00 00 jne e120 <__cxa_finalize@plt+0x2a60>
db3b: 83 45 b8 01 addl $0x1,-0x48(%rbp)
- db3f: bf 53 00 00 00 mov $0x53,%edi
+ db3f: bf 54 00 00 00 mov $0x54,%edi
db44: e8 a7 d6 ff ff call b1f0 <sysconf@plt>
db49: 48 89 df mov %rbx,%rdi
db4c: 89 c6 mov %eax,%esi
db4e: 89 45 a8 mov %eax,-0x58(%rbp)
db51: 49 89 c5 mov %rax,%r13
from rasdaemon.
After digging through old rasdaemon logs, I found that the SIGBUS correlates with the following log changes
Before:
rasdaemon: Listening to events for cpus 0 to 47
After:
rasdaemon: Listening to events for cpus 0 to 127
rasdaemon: Error on CPU 48
rasdaemon: Error on CPU 49
...
rasdaemon: Error on CPU 126
rasdaemon: Error on CPU 127
rasdaemon: Old kernel detected. Stop listening and fall back to pthread way.
My CPU is a AMD Ryzen Threadripper 3960X, which has 24 cores or 48 threads.
The dates on the logs line up with when I upgraded from Linux v5.19 to v6.0.
So sometime between then the kernel started reporting more CPUs than actually exist, and rasdaemon is unable to handle it properly
from rasdaemon.
I downgraded my kernel to v5.17 (which I was using months before the problem started) and still experienced the same error.
So I think it's not directly caused by the kernel, but perhaps might be firmware related?
from rasdaemon.
My CPU is a "AMD Ryzen 7 2700 Eight-Core Processor".
I confirm this issue and the fix.
Thanks :)
from rasdaemon.
Not sure if this is the right fix, as CPUs can be dynamically disabled/enabled in runtime, probably decreasing _SC_NPROCESSORS_ONLN. See, if I write this small test.c program:
#include <stdio.h>
#include <unistd.h>
int main(void)
{
printf ("Number of cpus: %ld\n", sysconf(_SC_NPROCESSORS_CONF));
printf ("Number of online cpus: %ld\n", sysconf(_SC_NPROCESSORS_ONLN));
return 0;
}
building it with gcc -o test test.c
and then doing:
# grep . /sys/devices/system/cpu/online /sys/devices/system/cpu/offline
/sys/devices/system/cpu/online:0-7
# echo 0 > /sys/devices/system/cpu/cpu4/online
# echo 0 > /sys/devices/system/cpu/cpu4/online
# grep . /sys/devices/system/cpu/online /sys/devices/system/cpu/offline
/sys/devices/system/cpu/online:0-3,5-7
/sys/devices/system/cpu/offline:4
$ ./test
Number of cpus: 8
Number of online cpus: 7
It will report 7 online cpus of 8 total ones. Rasdaemon should monitor all 8, as cpu4 can be placed online anytime. With your change, it will not monitor the last CPU. So, not only the disabled CPU won't be monitored, but also one that it is online.
The real issue here is: why AMD is announcing more CPUs than it actually has? BIOS issue?
from rasdaemon.
I'm hitting the same issue on rasdaemon 0.8.0, it seems to be a use-after-free bug. An output from running the daemon under Valgrind is attached here: rasdaemon-0.8.0-crash-valgrind.txt
First invalid access is this:
==25802== Invalid read of size 8
==25802== at 0x11C906: ras_mc_event_closedb (ras-record.c:918)
==25802== by 0x117DB7: handle_ras_events_cpu (ras-events.c:640)
==25802== by 0x4A8D389: start_thread (pthread_create.c:442)
==25802== by 0x4B0D5BF: clone (clone.S:100)
==25802== Address 0x17653f00 is 0 bytes inside a block of size 72 free'd
==25802== at 0x484440F: free (vg_replace_malloc.c:884)
==25802== by 0x11C9FC: ras_mc_event_closedb (ras-record.c:1020)
==25802== by 0x117DB7: handle_ras_events_cpu (ras-events.c:640)
==25802== by 0x4A8D389: start_thread (pthread_create.c:442)
==25802== by 0x4B0D5BF: clone (clone.S:100)
==25802== Block was alloc'd at
==25802== at 0x4846C0F: calloc (vg_replace_malloc.c:1340)
==25802== by 0x11C50B: ras_mc_event_opendb (ras-record.c:768)
==25802== by 0x117D37: handle_ras_events_cpu (ras-events.c:628)
==25802== by 0x4A8D389: start_thread (pthread_create.c:442)
==25802== by 0x4B0D5BF: clone (clone.S:100)
from rasdaemon.
Commit f1ea763 has applied my suggested change so this should be fixed now
from rasdaemon.
I use Debian Stable, and I also have SIGBUS signals very frequently.
$ sudo coredumpctl list --no-pager | grep rasdaemon | tail -n 25
[sudo] password for pioruns:
Sun 2023-11-12 12:38:48 GMT 4016656 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Mon 2023-11-13 08:51:57 GMT 4160630 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Tue 2023-11-14 03:53:52 GMT 402897 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Wed 2023-11-15 10:23:00 GMT 1403843 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Thu 2023-11-16 09:41:19 GMT 1403987 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Thu 2023-11-16 16:43:21 GMT 975 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Fri 2023-11-17 23:53:52 GMT 1536919 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Fri 2023-11-17 23:53:53 GMT 2517616 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Sun 2023-11-19 10:06:24 GMT 2517665 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Sun 2023-11-19 10:06:25 GMT 3668922 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Sun 2023-11-19 10:06:25 GMT 3668988 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Tue 2023-11-21 03:33:34 GMT 936 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Wed 2023-11-22 10:20:02 GMT 220536 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Wed 2023-11-22 10:20:03 GMT 3081318 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Wed 2023-11-22 10:20:04 GMT 3083389 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Thu 2023-11-23 10:50:47 GMT 3085716 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Fri 2023-11-24 07:13:39 GMT 1140728 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Sat 2023-11-25 10:04:15 GMT 389719 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Sun 2023-11-26 07:17:43 GMT 506670 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Mon 2023-11-27 10:24:38 GMT 1766908 0 0 SIGBUS missing /usr/sbin/rasdaemon -
Sat 2023-12-09 10:38:51 GMT 2058458 0 0 SIGBUS present /usr/sbin/rasdaemon 272.5K
Sun 2023-12-10 12:05:05 GMT 2058605 0 0 SIGBUS present /usr/sbin/rasdaemon 272.4K
Mon 2023-12-11 07:55:29 GMT 3163873 0 0 SIGBUS present /usr/sbin/rasdaemon 273.0K
Mon 2023-12-11 08:04:42 GMT 3195740 0 0 SIGBUS present /usr/sbin/rasdaemon 270.4K
Mon 2023-12-11 10:06:22 GMT 950 0 0 SIGBUS present /usr/sbin/rasdaemon 268.5K
Processor is AMD Ryzen 7 5800X (16). I understand this is now fixed? Need to wait until it lands in my distribution?
from rasdaemon.
Hi @github12101 . You question is related to Debian. I believe you may want to report what happened to you here https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054152 rather than this upstream issue tracker.
from rasdaemon.
Related Issues (20)
- How to a contrib adding new label HOT 1
- DBD::SQLite::db prepare failed: no such table: mc_event at /usr/sbin/ras-mc-ctl line 1183. HOT 2
- print non_standard_event at one line HOT 1
- `sudo ras-mc-ctl --error-count` not listing `Corrected error` event?
- rasdaemon does not support switch to disable events by config HOT 8
- ras-mc-ctl: drivers not loaded.
- Add `--flush-errors` option to `ras-mc-ctl`. HOT 1
- rasdaemon causes manual modification of current_tracer failed HOT 1
- rasdaemon: ras-mc-ctl --error-count random sorts output
- rasdaemon: add mc_event trigger #134
- MCE errors not showing up in ras-mc-ctl HOT 1
- Typo in rasdaemon man documentation
- Rasdaemon 0.6.6 (all the versions in debian repo) not logging the trace events from the kernel tracepoints HOT 2
- new release
- [PATCH] rasdaemon: don't emit error syslog when exiting normally
- Clearing errors / excluding old events from reports
- rasdaemon not logging HOT 9
- AER not reported immediatly HOT 1
- Release tarball not created for 0.8.0 because of Action failure (missing libtraceevent in CI) HOT 1
- rasdaemon does not log MCE HOT 37
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rasdaemon.