Giter Site home page Giter Site logo

mchehab / rasdaemon Goto Github PK

View Code? Open in Web Editor NEW
156.0 156.0 71.0 660 KB

Rasdaemon is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists.

License: GNU General Public License v2.0

Makefile 0.71% C 85.82% M4 2.39% Shell 0.67% Perl 10.41%

rasdaemon's People

Contributors

aegl avatar aristeu avatar avanaik avatar congwang avatar dannf avatar dgcampea avatar dmnosachev avatar g-edwards avatar hrio avatar hunterjaguar avatar jasontian518 avatar lostwayzxc avatar mchehab avatar muralimk-amd avatar nchatrad avatar pizza-speziale avatar shijujose4 avatar sikarash avatar stevenj avatar stintel avatar system-thoughts avatar thesamesam avatar thomastaioracle avatar tictooc avatar tk-wfischer avatar walterav1984 avatar weidongkl avatar whitslack avatar yang-shi avatar yghannam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rasdaemon's Issues

Extlog didn't show the error log

I inject a memory error and I could see the error in CPER format in windows event viewer.
In event viewer, it shows errors in CPER format at WHEA-logger ->detail-> RawData.
But I couldn't see the error in Extlog via rasdaemon.
How to shows any errrors in Extlog via rasdaemon?
Does rasdaemon under ubuntu support Extlog in CPER format ?

ras-mc-ctl tries to query nonexistent table

When rasdaemon is compiled without HAVE_DEVLINK, the devlink_event table is not created, but ras-mc-ctl still tries to query this table.

$ ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1181.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1182.
$ sqlite3 /var/lib/rasdaemon/ras-mc_event.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE mc_event (id INTEGER PRIMARY KEY, timestamp TEXT, err_count INTEGER, err_type TEXT, err_msg TEXT, label TEXT, mc INTEGER, top_layer INTEGER, middle_layer INTEGER, lower_layer INTEGER, address INTEGER, grain INTEGER, syndrome INTEGER, driver_detail TEXT);
CREATE TABLE aer_event (id INTEGER PRIMARY KEY, timestamp TEXT, dev_name TEXT, err_type TEXT, err_msg TEXT);
CREATE TABLE extlog_event (id INTEGER PRIMARY KEY, timestamp TEXT, etype INTEGER, error_count INTEGER, severity INTEGER, address INTEGER, fru_id BLOB, fru_text TEXT, cper_data BLOB);
CREATE TABLE mce_record (id INTEGER PRIMARY KEY, timestamp TEXT, mcgcap INTEGER, mcgstatus INTEGER, status INTEGER, addr INTEGER, misc INTEGER, ip INTEGER, tsc INTEGER, walltime INTEGER, cpu INTEGER, cpuid INTEGER, apicid INTEGER, socketid INTEGER, cs INTEGER, bank INTEGER, cpuvendor INTEGER, bank_name TEXT, error_msg TEXT, mcgstatus_msg TEXT, mcistatus_msg TEXT, mcastatus_msg TEXT, user_action TEXT, mc_location TEXT);
CREATE TABLE arm_event (id INTEGER PRIMARY KEY, timestamp TEXT, error_count INTEGER, affinity INTEGER, mpidr INTEGER, running_state INTEGER, psci_state INTEGER);
COMMIT;
$ rasdaemon -V
rasdaemon 0.6.5

rasdaemon causes kernel crash (NULL dereference)

Since booting into 6.0.5 and 6.0.6 linux kernels, I get NULL pointer dereference errors in the kernel. I am pretty sure its always caused by rasdaemon process. I am reporting here, since somebody might know about special kernel interfaces rasdaemon is using, and maybe who to contact from the linux kernel maintainers.

Here is the kernel log: https://bugs.archlinux.org/task/76354?getfile=21974
Archlinux bug report: https://bugs.archlinux.org/task/76354
Rasdaemon log: rasdaemon.log

Unintelligible errors: Unknown block error, I/O error, critical target error

Disk errors
1 2022-10-31 22:36:26 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
2 2022-11-01 01:08:50 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
3 2022-11-01 02:39:32 +0000 error: dev=0:0, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='',
4 2022-11-01 02:39:32 +0000 error: dev=0:0, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd='',
5 2022-11-01 02:39:32 +0000 error: dev=0:0, sector=-1, nr_sector=0, error='critical target error', rwbs='N', cmd='',
6 2022-11-01 22:44:22 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
7 2022-11-02 01:41:27 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
8 2022-11-02 02:02:44 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
9 2022-11-02 22:55:22 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
10 2022-11-03 01:15:27 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
11 2022-11-03 22:43:47 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',
12 2022-11-04 01:19:09 +0000 error: dev=0:2048, sector=371038120, nr_sector=144, error='unknown block error', rwbs='R', cmd='',

Do these errors mean anything? I don't know what to do about those errors.

There is a new unknown block error whenever I boot my computer.

question: machine readable output from ras-mc-ctl?

in order to periodically check for errors on a running system, it would be useful to run ras-mc-ctl and collect its output and optionally take action based on the results.

however, parsing text output is not ideal, and it's not clear that the format of the text output is intended to be stable enough to be relied upon.

one option could be to add a --json output format for ras-mc-ctl and document the output format.

another approach is to create a separate tool altogether to read the sqlite3 db, but a similar compatibility question arises - is the db schema expected to be compatible across rasdaemon releases?

what is the recommended approach for this use case? thanks!

Missing distribution tarball for v0.6.8

There is no distribution tarball generated for the rasdaemon 0.6.8 release, either in the directory on infradead or attached to the github release.

Could you please generate one (e.g. using make distcheck) for linux distro packaging?

If needed, I can set up my packaging tools to run autotools to generate the configure script, Makefile, etc. from the snapshot github generates from the tag, but this tends to be a bit more fragile, and slower to build, than having a proper distribution tarball.

The timestamps in the database do not seem to correlate to the system clock

Any time an error occurs and a database entry is created we need timestamps that reflect the time the error occurred. Currently the timestamps we are seeing are very odd, are they associated with uptime?

For example one timestamp looked like this:

ras-mc-ctl --errors

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2449-05-18 20:57:15 +0000 error: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=0, mcgcap=0x00000c09, status=0x940000000000009f, addr=0xffa84200, tsc=0x2ea83b5fdb8, walltime=0x5e7e5045, cpuid=0x000506f1, bank=0x00000001

2 2468-02-20 22:57:15 +0000 error: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error Error_enabled, n_errors=0, mcgcap=0x00000c09, status=0x940000000000009f, addr=0xffa84200, tsc=0x308d6c79418, walltime=0x5e7e5080, cpuid=0x000506f1, bank=0x00000001

Is there a way to correlate the timestamp with the clock?

rasdaemon dimm label format / sysfs content missing

Labels don't seem to be correctly applied when using rasdaemon.
If I edit /etc/edac/labels.db and add:

Vendor: ASRock
    Model: X99M Killer
        DDR4_A1: 0.0.0;
        DDR4_B1: 0.0.1;
        DDR4_C1: 0.0.2;
        DDR4_D1: 0.0.3;

edac-ctl seems to detect the correct sysfs dimm directories:

# edac-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0/csrow0/ch0_dimm_label           DDR4_A1              CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
mc0/csrow0/ch1_dimm_label           DDR4_B1              CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
mc0/csrow0/ch2_dimm_label           DDR4_C1              CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
mc0/csrow0/ch3_dimm_label           DDR4_D1              CPU_SrcID#0_Ha#0_Chan#3_DIMM#0

But doing the same for rasdaemon by adding the same content I added in /etc/edac/labels.db to /etc/ras/dimm_labels.d/asrock I get:

# ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 channel 0 slot 0 
              DDR4_A1              CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
                                    DDR4_B1              0:0:1 missing       
                                    DDR4_C1              0:0:2 missing       
                                    DDR4_D1              0:0:3 missing       

System Info

Motherboard: Fatal1ty X99M Killer
CPU: Intel(R) Xeon(R) CPU E5-2620 v4
rasdaemon version: 0.6.4
Kernel: 5.13.19-200.fc34.x86_64
Distribution: Fedora 34

sysfs contents by searching for dimm:

# find /sys/ -iname '*dimm*'
/sys/devices/system/edac/mc/mc0/dimm3
/sys/devices/system/edac/mc/mc0/dimm3/dimm_ue_count
/sys/devices/system/edac/mc/mc0/dimm3/dimm_mem_type
/sys/devices/system/edac/mc/mc0/dimm3/dimm_dev_type
/sys/devices/system/edac/mc/mc0/dimm3/dimm_ce_count
/sys/devices/system/edac/mc/mc0/dimm3/dimm_label
/sys/devices/system/edac/mc/mc0/dimm3/dimm_location
/sys/devices/system/edac/mc/mc0/dimm3/dimm_edac_mode
/sys/devices/system/edac/mc/mc0/csrow0/ch2_dimm_label
/sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label
/sys/devices/system/edac/mc/mc0/csrow0/ch3_dimm_label
/sys/devices/system/edac/mc/mc0/csrow0/ch1_dimm_label
/sys/devices/system/edac/mc/mc0/dimm6
/sys/devices/system/edac/mc/mc0/dimm6/dimm_ue_count
/sys/devices/system/edac/mc/mc0/dimm6/dimm_mem_type
/sys/devices/system/edac/mc/mc0/dimm6/dimm_dev_type
/sys/devices/system/edac/mc/mc0/dimm6/dimm_ce_count
/sys/devices/system/edac/mc/mc0/dimm6/dimm_label
/sys/devices/system/edac/mc/mc0/dimm6/dimm_location
/sys/devices/system/edac/mc/mc0/dimm6/dimm_edac_mode
/sys/devices/system/edac/mc/mc0/dimm0
/sys/devices/system/edac/mc/mc0/dimm0/dimm_ue_count
/sys/devices/system/edac/mc/mc0/dimm0/dimm_mem_type
/sys/devices/system/edac/mc/mc0/dimm0/dimm_dev_type
/sys/devices/system/edac/mc/mc0/dimm0/dimm_ce_count
/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
/sys/devices/system/edac/mc/mc0/dimm0/dimm_location
/sys/devices/system/edac/mc/mc0/dimm0/dimm_edac_mode
/sys/devices/system/edac/mc/mc0/dimm9
/sys/devices/system/edac/mc/mc0/dimm9/dimm_ue_count
/sys/devices/system/edac/mc/mc0/dimm9/dimm_mem_type
/sys/devices/system/edac/mc/mc0/dimm9/dimm_dev_type
/sys/devices/system/edac/mc/mc0/dimm9/dimm_ce_count
/sys/devices/system/edac/mc/mc0/dimm9/dimm_label
/sys/devices/system/edac/mc/mc0/dimm9/dimm_location
/sys/devices/system/edac/mc/mc0/dimm9/dimm_edac_mode

ras-mc-ctl --layout output

# ras-mc-ctl --layout
Use of uninitialized value $max_pos[3] in modulus (%) at /usr/sbin/ras-mc-ctl line 868.
Use of uninitialized value $d in numeric ge (>=) at /usr/sbin/ras-mc-ctl line 869.
Use of uninitialized value $d in sprintf at /usr/sbin/ras-mc-ctl line 872.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
    +-----------------------------------------------------------------------------------------------------------------------------------------------+
    |                                                                      mc0                                                                      |
    |             channel0              |             channel1              |             channel2              |             channel3              |
    |   slot0   |   slot1   |   slot2   |   slot0   |   slot1   |   slot2   |   slot0   |   slot1   |   slot2   |   slot0   |   slot1   |   slot2   |
----+-----------------------------------------------------------------------------------------------------------------------------------------------+

0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----+-----------------------------------------------------------------------------------------------------------------------------------------------+

ras-mc-ctl --error-count output:

# ras-mc-ctl --error-count
Label                         	CE	UE
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#3_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#1_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#2_DIMM#0	0	0

dmidecode -t memory output:

# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x000E, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 256 GB
	Error Information Handle: Not Provided
	Number Of Devices: 4

Handle 0x0010, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x000E
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 72 bits
	Size: 8 GB
	Form Factor: RIMM
	Set: None
	Locator: DIMM_A1
	Bank Locator: NODE 1
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2133 MT/s
	Manufacturer: Micron
	Serial Number: 1323637C
	Asset Tag: DIMM_A1_AssetTag
	Part Number: 18ASF1G72PZ-2G1B1  
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x0012, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x000E
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 72 bits
	Size: 8 GB
	Form Factor: RIMM
	Set: None
	Locator: DIMM_B1
	Bank Locator: NODE 1
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2133 MT/s
	Manufacturer: Micron
	Serial Number: 13236327
	Asset Tag: DIMM_B1_AssetTag
	Part Number: 18ASF1G72PZ-2G1B1  
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x0014, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x000E
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 72 bits
	Size: 8 GB
	Form Factor: RIMM
	Set: None
	Locator: DIMM_C1
	Bank Locator: NODE 1
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2133 MT/s
	Manufacturer: Micron
	Serial Number: 13236324
	Asset Tag: DIMM_C1_AssetTag
	Part Number: 18ASF1G72PZ-2G1B1  
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x0016, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x000E
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 72 bits
	Size: 8 GB
	Form Factor: RIMM
	Set: None
	Locator: DIMM_D1
	Bank Locator: NODE 1
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2133 MT/s
	Manufacturer: Micron
	Serial Number: 13236332
	Asset Tag: DIMM_D1_AssetTag
	Part Number: 18ASF1G72PZ-2G1B1  
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

ras-mc-ctl: drivers not loaded.

Hi, there is a Lenovo M57p 9088 desktop here (are these errors displayed because of no EDAC capability)? These errors are being displayed after passing the following commands:

Thinkcentre-M57p:/etc/sysconfig # ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.
Thinkcentre-M57p:/etc/sysconfig # 
Thinkcentre-M57p:/etc/sysconfig # rasdaemon --status
Segmentation fault (core dumped)
paul-Thinkcentre-M57p:/etc/sysconfig # 
Thinkcentre-M57p:/etc/sysconfig # rasdaemon --enable
rasdaemon: ras:mc_event event enabled
rasdaemon: ras:aer_event event enabled
rasdaemon: mce:mce_record event enabled
rasdaemon: ras:extlog_mem_event event enabled
rasdaemon: ras:non_standard_event event enabled
rasdaemon: ras:arm_event event enabled
rasdaemon: devlink:devlink_health_report event enabled
rasdaemon: block:block_rq_error event enabled
rasdaemon: ras:memory_failure_event event enabled
rasdaemon: Can't write to set_event
rasdaemon: Can't write to set_event
rasdaemon: Can't write to set_event
rasdaemon: Can't write to set_event
rasdaemon: Can't write to set_event
rasdaemon: Can't write to set_event
rasdaemon: Can't write to set_event
rasdaemon: Can't write to set_event
Thinkcentre-M57p:/etc/sysconfig # 
Thinkcentre-M57p:/etc/sysconfig # systemctl status rasdaemon
โ— rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; enabled; **preset: disabled**)
     Active: active (running) since Sat 2024-01-13 13:41:43 CST; 5h 36min ago
    Process: 1195 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
   Main PID: 1194 (rasdaemon)
      Tasks: 1 (limit: 4915)
        CPU: 142ms
     CGroup: /system.slice/rasdaemon.service
             โ””โ”€1194 /usr/sbin/rasdaemon -f -r

Jan 13 18:04:41 Thinkcentre-M57p rasdaemon[1194]:            <...>-955   [002] .....     0.000534 block_rq_error 2024-01-13 15:10:26 -0600
Jan 13 18:04:41 Thinkcentre-M57p rasdaemon[1194]: rasdaemon: diskerror_event store: 0x55cedf2742e0
Jan 13 18:04:41 Thinkcentre-M57p rasdaemon[1194]: rasdaemon: register inserted at db
Jan 13 18:04:41 Thinkcentre-M57p rasdaemon[1194]:            <...>-955   [000] .....     0.000534 block_rq_error 2024-01-13 15:10:26 -0600
Jan 13 18:04:42 Thinkcentre-M57p rasdaemon[1194]: rasdaemon: diskerror_event store: 0x55cedf2742e0
Jan 13 18:04:42 Thinkcentre-M57p rasdaemon[1194]: rasdaemon: register inserted at db
Jan 13 18:04:42 Thinkcentre-M57p rasdaemon[1194]:            <...>-955   [000] .....     0.000534 block_rq_error 2024-01-13 15:10:27 -0600
Jan 13 18:04:42 Thinkcentre-M57p rasdaemon[1194]: rasdaemon: diskerror_event store: 0x55cedf2742e0
Jan 13 18:04:42 Thinkcentre-M57p rasdaemon[1194]: rasdaemon: register inserted at db
Jan 13 18:04:42 Thinkcentre-M57p rasdaemon[1194]:            <...>-955   [000] .....     0.000534 block_rq_error 2024-01-13 15:10:27 -0600
Thinkcentre-M57p:/etc/sysconfig # 

What are your thoughts on this?

-Thanks

rasdaemon not logging

Distro: Fedora 37 KDE
Kernel: 6.1.8
rasdaemon version: 0.6.8
CPU: Ryzen 9 5900x

Due to the erroneous reporting of disk errors by rasdaemon bloating my log, I deleted the files ras-mc_event.db and ras-mc_event.db-journal in /var/lib/rasdaemon. After restarting the rasdaemon service clean ones were created.
Since then I had noticed it stopped logging those false disk errors. Then eventually I got another MCE error, and noticed that one wasn't logged either. (Not sure if doing that was directly related, but the timing lined up.) I reinstalled rasdaemon and waited for another one to happen to be sure.
This latest one wasn't logged either, and those supposed disk errors still aren't as well even though the service still seems to be reporting them.
Screenshot.
Journal log of a systemctl restart rasdaemon. Those core dumps happen on a fresh boot as well.

I have uninstalled mcelog, and I don't have the ras-mc-ctl service enabled since it fails and exits due to my system not having ECC memory.

rasdaemon.service fails to start in Fedora 35

ร— rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; enabled; vendor preset: disabled)
     Active: failed (Result: resources)
        CPU: 0

systemd[1]: rasdaemon.service: Failed to load environment files: No such file or directory
systemd[1]: rasdaemon.service: Failed to run 'start' task: No such file or directory
systemd[1]: rasdaemon.service: Failed with result 'resources'.
systemd[1]: Failed to start RAS daemon to log the RAS events.

Discussion in the Bugzilla bug report https://bugzilla.redhat.com/show_bug.cgi?id=2004409 in suggests problem is an undefined variable SYSCONFDEFDIR in the unit file template https://github.com/mchehab/rasdaemon/blob/master/misc/rasdaemon.service.in

configure.ac forces sysconfdir to be /etc, breaking Nix/Guix installs

Overview

Line 144 in configure.ac is a little strange:

test "$sysconfdir" = '${prefix}/etc' && sysconfdir=/etc

It strips the prefix from sysconfdir. I'm not sure what the intent is here, but this breaks installations anywhere other than the root directory. In particular, Guix and Nix will issue a build failure, since the build daemon doesn't have permission to create things under /etc.

Proposal

Perhaps the && was intended to be || or something? Better yet, maybe we want to set sysconfdir via an autoconf macro instead?

Rasdaemon does not work when kernel lockdown is enabled

Modern distributions enable kernel lockdown by default when UEFI and Secure Boot are enabled.
This breaks rasdaemon because it has no direct access to MSR or debugfs:

kernel: Lockdown: rasdaemon: Direct MSR access is restricted; see man kernel_lockdown.7
kernel: Lockdown: rasdaemon: debugfs is restricted; see man kernel_lockdown.7

I do not know how rasdaemon works but it sounds like perhaps the architecture must change to keep rasdaemon working with kernel lockdown.

The obvious workarounds would be either disable Secure Boot or kernel lockdown - both of which decrease the overall system security and may not be allowed due to company or compliance policies.

As more servers move to modern distributions and Secure Boot this problem will just get more common until it renders rasdaemon obsolete unless it can evolve.

`sudo ras-mc-ctl --error-count` not listing `Corrected error` event?

I decided to open this issue after casually mentioning it here.

After tightening my RAM timings to simulate a mce: [Hardware Error]: Machine check events logged event/ECC error correction report, I get this result with my computer still running:

sudo ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.
No disk errors.
MCE records summary:
	1 Corrected error, no action required. errors
sudo ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

MCE events:
1 .. error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=21), mcg mcgstatus=0, mci Error_overflow CECC, memory_channel=0,csrow=2, mcgcap=0x0000011d, status=0xdc2040000400011b, addr=0x7d41b7940, misc=0xd01a000601000000, walltime=0x6506e4a4, cpuid=0x00a60f12, bank=0x00000015

But

sudo ras-mc-ctl --error-count
Label               	CE	UE
mc#0csrow#3channel#0	0	0
mc#0csrow#2channel#1	0	0
mc#0csrow#2channel#0	0	0
mc#0csrow#3channel#1	0	0

isn't displaying anything?

hw: ASUS TUF GAMING B650-PLUS, 7800X3D and 2xKSM48E40BD8KM-32HM.
sw: arch, ras 0.8.0.tar.bz2 though aur.archlinux.org/packages/rasdaemon.

hardcoded /etc/sysconfig is not distro agnostic

9ae6b70 added the line EnvironmentFile=/etc/sysconfig/rasdaemon to rasdaemon's systemd unit file. However, /etc/sysconfig is only used by some distributions like Red Hat. Debian and Ubuntu would normally keep these files under /etc (though this file may make more sense under /etc/default). Seems like this should be a configure-time option.

When SELinux is enabled, ras-mc-ctl.service fails to be started

Reproduction procedure:

dnf install -y rasdaemon
setenforce 1
systemctl start ras-mc-ctl.service
systemctl status ras-mc-ctl.service

This problem occurs in my machine, but I don't get valuable output error information. Why does the service fail to be started?

Question: 'ug! no event found for type 838'

Hi.

I have just a quick question. I am working with AMD Epyc CPUs:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 1
Model name:            AMD EPYC 7551 32-Core Processor
Stepping:              2
CPU MHz:               2000.000
CPU max MHz:           2000.0000
CPU min MHz:           1200.0000
BogoMIPS:              3999.38
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             64K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7,64-71
NUMA node1 CPU(s):     8-15,72-79
NUMA node2 CPU(s):     16-23,80-87
NUMA node3 CPU(s):     24-31,88-95
NUMA node4 CPU(s):     32-39,96-103
NUMA node5 CPU(s):     40-47,104-111
NUMA node6 CPU(s):     48-55,112-119
NUMA node7 CPU(s):     56-63,120-127
Flags:                 fpu vme de pse tsc [...]

Using the packaged version of rasdaemon from RHEL/CentOS (0.4.1-35) I have got the following:

Jul 23 15:07:48 host1 rasdaemon: rasdaemon: Can't parse MCE for this AMD CPU yet

So I have switched to the latest version from upstream (0.6.6-1) and now it seems to work but in the logs I have
got quite a lot of the following messages:

Jul 23 14:04:27 host1 rasdaemon: ug! no event found for type 838

Is there a way to track down what it is exactly referring to? I guess these msgs are harmless but I would like
to understand where they come from.

Thanks in advance!

Unclear which kernel modules should be loaded

Iย started using rasdaemon recently. I use a self-compiled, self-configured kernel. I have a hard time finding out which kernel modules Iย should include in my configuration. Iย get the following warning:

Can't write to set_event

and errors:

Can't get traces from ras:extlog_mem_event
โ€ฆ
Can't get traces from devlink:devlink_health_report

For the first error, I have found the ACPI_EXTLOG config, but it is unclear what is needed to fix the other two issues.

Linker errors when compiling with gcc 10+

Hi,

GCC 10 and above defaults to -fno-common, which prevents the merging of globals declared in multiple compilation units. This is causing linker problems with the declaration of:

struct ras_events *ras;

on line 30 of ras-record.h. It needs to be declared extern and defined in a single complation unit, or -fcommon needs to be added to the CFLAGS.

I'm using gcc 10.1.0 on arch linux. I configured and built ras with:

./configure \
>     --prefix=/usr           \
>     --sbindir=/usr/bin      \
>     --sysconfdir=/etc       \
>     --localstatedir=/var    \
>     --enable-aer            \
>     --enable-arm            \
>     --enable-extlog         \
>     --enable-hisi-ns-decode \
>     --enable-mce            \
>     --enable-non-standard   \
>     --enable-devlink        \
>     --enable-diskerror      \
>     --enable-abrt-report    \
>     --enable-sqlite3        \
>     ;
make

and the output of make is:

make  all-recursive
make[1]: Entering directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5'
Making all in libtrace
make[2]: Entering directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/libtrace'
  CC       event-parse.o
  CC       parse-filter.o
  CC       kbuffer-parse.o
  CC       parse-utils.o
  CC       trace-seq.o
  AR       libtrace.a
ar: `u' modifier ignored since `D' is the default (see `U')
make[2]: Leaving directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/libtrace'
Making all in util
make[2]: Entering directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/util'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/util'
Making all in man
make[2]: Entering directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/man'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/man'
make[2]: Entering directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5'
  CC       rasdaemon.o
  CC       ras-events.o
  CC       ras-mc-handler.o
  CC       bitfield.o
  CC       ras-record.o
  CC       ras-aer-handler.o
  CC       ras-non-standard-handler.o
  CC       ras-arm-handler.o
  CC       ras-mce-handler.o
  CC       mce-intel.o
  CC       mce-amd.o
  CC       mce-intel-p4-p6.o
  CC       mce-intel-nehalem.o
  CC       mce-intel-dunnington.o
  CC       mce-intel-tulsa.o
  CC       mce-intel-sb.o
  CC       mce-intel-ivb.o
  CC       mce-intel-haswell.o
  CC       mce-intel-knl.o
  CC       mce-intel-broadwell-de.o
  CC       mce-intel-broadwell-epex.o
  CC       mce-intel-skylake-xeon.o
  CC       mce-amd-k8.o
  CC       mce-amd-smca.o
  CC       ras-extlog-handler.o
  CC       ras-devlink-handler.o
  CC       ras-diskerror-handler.o
  CC       ras-report.o
  CC       non-standard-hisi_hip07.o
  CC       non-standard-hisi_hip08.o
  CCLD     rasdaemon
/usr/bin/ld: ras-events.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-mc-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: bitfield.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-record.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-aer-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-non-standard-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-arm-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-mce-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-amd.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-p4-p6.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-nehalem.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-dunnington.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-tulsa.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-sb.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-ivb.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-haswell.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-knl.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-broadwell-de.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-broadwell-epex.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-intel-skylake-xeon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-amd-k8.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: mce-amd-smca.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-extlog-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-devlink-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-diskerror-handler.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: ras-report.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: non-standard-hisi_hip07.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
/usr/bin/ld: non-standard-hisi_hip08.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: multiple definition of `ras'; rasdaemon.o:/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5/ras-record.h:30: first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:621: rasdaemon] Error 1
make[2]: Leaving directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5'
make[1]: *** [Makefile:724: all-recursive] Error 1
make[1]: Leaving directory '/home/ashley/Development/ArchLinux/rasdaemon/rasdaemon-0.6.5'
make: *** [Makefile:515: all] Error 2

Logs indicate "only decoding architectural errors"

I'm running Rasdaemon on an Intel i3 9100 (Coffee Lake) CPU, and I've noticed this message in the logs:

rasdaemon: Family 6 Model 9e CPU: only decoding architectural errors

Looking at the MCE handler, it seems like this is a pretty standard catch-all for anything without explicit support. That said, I'm not exactly clear on what "architectural errors" cover. Explicit support seems to be limited to fairly old or niche (e.g. Knights Landing) architectures, and their implementations seem to cover very specific errors.

Is it safe to assume that "architectural errors" will cover corrected and uncorrected ECC errors? I'm primarily concerned with being able to see errors and associate them to a DIMM (I've added labels to /etc/ras/dimm_labels.d/ to help with the latter). Any insight you can share is much appreciated! ๐Ÿ˜„

print non_standard_event at one line

Now non_standard_event log use multi line, like this:

rasdaemon: register inserted at db
<...>-64065 [077] dNH. 0.054782 non_standard_event 2023-09-07 18:37:13 +0800
Recoverable
section type: a6980811-16ea-4e4d-b936-fb00a23ff29c fru text: fru id: 00000000-0000-0000-0000-000000000000
length: 124
error:
00000000: 00000050 00800005 00000000 00000000
00000010: 00010000 00000000 00000000 00000000
00000020: 00000000 00000000 00000d8c 02030480
00000030: 000001ff 00000000 0000003b 00000000
00000040: 00000000 00000000 08000002 08000002
00000050: 00000000 00000000 000001ff 00000000
00000060: 003b0000 f0402710 00000000 00000000
00000070: 00000000 00000000 00000000

But all other events just use one line, it is more reasonable log non_standard_event in one line exclude errors dump, so you can easily to get decoded non_standard_event log in one line if you implement a decoder. Like this:

rasdaemon: register inserted at db
<...>-69298 [082] dNH. 0.054850 non_standard_event 2023-09-07 18:48:40 +0800 Recoverable section type: a6980811-16ea-4e4d-b936-fb00a23ff29c fru text: fru id: 00000000-0000-0000-0000-000000000000 length: 124 ......

How to a contrib adding new label

Hi Everyone,

It seems hard to find label definition on the net, so I would like to contrib adding mine based on my understanding and the MB documentation.

Does someone know how to do ?

Thank :)

DBD::SQLite::db prepare failed: no such table: mc_event at /usr/sbin/ras-mc-ctl line 1183.

root@scratch:/var/lib/rasdaemon# rasdaemon --record
root@scratch:/var/lib/rasdaemon# ras-mc-ctl --summary
DBD::SQLite::db prepare failed: no such table: mc_event at /usr/sbin/ras-mc-ctl line 1183.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.
root@scratch:/var/lib/rasdaemon# ls
ras-mc_event.db
root@scratch:/var/lib/rasdaemon# rasdaemon -V
rasdaemon 0.8.0

SIGBUS when starting with --record

Since linux v6.0.8 (which contains the fix for issue #73), rasdaemon will crash with SIGBUS when launched with the --record option.

Debugging under gdb (with an extra --foreground option) shows it seems to be from sqlite code:

Thread 72 "rasdaemon" received signal SIGBUS, Bus error.
[Switching to Thread 0x7ffebffff6c0 (LWP 7972)]
0x00007ffff7f2d684 in sqlite3_finalize () from /lib/x86_64-linux-gnu/libsqlite3.so.0
(gdb) bt
#0  0x00007ffff7f2d684 in sqlite3_finalize () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#1  0x00005555555644d0 in ?? ()
#2  0x0000555555561738 in ?? ()
#3  0x00007ffff7ce2fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007ffff7d6366c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

At present I haven't investigated any further, so I don't know if it's actually a sqlite problem or a rasdaemon problem yet

sudo rasdaemon -f -r fails on Ubuntu, because /usr/local/var/lib/rasdaemon does not exist

Rasdaemon tries to make a directory in a folder that does not exist on Ubuntu 20.04

overriding event (1400) ras:mc_event with new print handler
rasdaemon: ras:mc_event event enabled
rasdaemon: Enabled event ras:mc_event
overriding event (1397) ras:aer_event with new print handler
rasdaemon: ras:aer_event event enabled
rasdaemon: Enabled event ras:aer_event
overriding event (111) mce:mce_record with new print handler
rasdaemon: mce:mce_record event enabled
rasdaemon: Enabled event mce:mce_record
rasdaemon: Listening to events for cpus 0 to 23
Calling ras_mc_event_opendb()
rasdaemon: Failed to create state directory /usr/local/var/lib/rasdaemon

This is how /usr/local looks like:

$ ll /usr/local/
total 40
drwxr-xr-x 10 root root 4096 febr   4  2021 ./
drwxr-xr-x 14 root root 4096 febr   4  2021 ../
drwxr-xr-x  2 root root 4096 febr   4  2021 bin/
drwxr-xr-x  2 root root 4096 febr   4  2021 etc/
drwxr-xr-x  2 root root 4096 febr   4  2021 games/
drwxr-xr-x  2 root root 4096 febr  12 16:52 include/
drwxr-xr-x  4 root root 4096 jรบl   30  2021 lib/
lrwxrwxrwx  1 root root    9 febr   4  2021 man -> share/man/
drwxr-xr-x  2 root root 4096 febr  12 16:52 sbin/
drwxr-xr-x  6 root root 4096 jan   17 04:11 share/
drwxr-xr-x  2 root root 4096 febr   4  2021 src/

Rasdaemon should detect this and either mkdir the required path during make install, or use something that is available.

ug! no event found for type 843

Building the rpm from c225517 on centos 7 results in the logs being spammed with ug! no event found for type 843. The ug! message is repeated 16468 times in the journal but there are also journal rate limit messages, so the total is probably much higher.

-- Logs begin at Sun 2022-05-15 21:28:23 UTC, end at Mon 2022-05-16 18:20:01 UTC. --
May 16 18:17:20 foo06.example.org systemd[1]: Starting RAS daemon to log the RAS events...
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Page offline choice on Corrected Errors is soft
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Threshold of memory Corrected Errors is 50 / 24h
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: ras:mc_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: ras:mc_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event ras:mc_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: ras:aer_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: mce:mce_record event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: ras:extlog_mem_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: block:block_rq_complete event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12672]: rasdaemon: Can't write to set_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: ras:aer_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event ras:aer_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get ras:non_standard_event traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from ras:non_standard_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get ras:arm_event traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from ras:arm_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: mce:mce_record event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event mce:mce_record
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: ras:extlog_mem_event event enabled
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Enabled event ras:extlog_mem_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get net:net_dev_xmit_timeout traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get devlink:devlink_health_report traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from devlink:devlink_health_report
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't write to filter file
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get ras:memory_failure_event traces. Perhaps this feature is not supported on your system.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Can't get traces from ras:memory_failure_event
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Listening to events for cpus 0 to 63
May 16 18:17:22 foo06.example.org systemd[1]: Started RAS daemon to log the RAS events.
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording mc_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording aer_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording extlog_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording mce_record events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording non_standard_event events
May 16 18:17:22 foo06.example.org rasdaemn[12671]: rasdaemon: Recording arm_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording devlink_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording disk_errors events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: rasdaemon: Recording memory_failure_event events
May 16 18:17:22 foo06.example.org rasdaemon[12671]: trace-cmd: No such file or directory
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (968) ras:mc_event with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (967) ras:aer_event with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (82) mce:mce_record with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: overriding event (969) ras:extlog_mem_event with new print handler
May 16 18:17:22 foo06.example.org rasdaemon[12671]: Calling ras_mc_event_opendb()
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843
May 16 18:17:22 foo06.example.org rasdaemon[12671]: ug! no event found for type 843

How/why is MAJ:MIN calculated in its present state via ras-mc-ctl --summary

The MAJ:MIN numbers are very different from MAJ:MIN in lsblk and /sys/dev/ and I don't see anywhere in the documentation explaining how they are calculated. The calculation from lsblk to ras-mc-ctl in this answer (https://unix.stackexchange.com/questions/602411/interpret-disk-errors-output-from-ras-mc-ctl-summary) sort of works, but there are some MAJ:MIN that lsblk lists that aren't in ras-mc-ctl output and vice versa when trying to do modulus operation to convert ras-mc-ctl to lsblk. This makes it hard to determine which of the drives/partitions/etc... belong to which. Using numbers from block or char in /sys/dev would be expected, but I am not experienced in this area and could be wrong.

I experienced this with rasdaemon-0.6.7-2.fc35.x86_64.

--layout reports incorrect memory layout for EPYC 7xx2 CPU

[jhoblitt@pillan06 rasdaemon]$ sudo ras-mc-ctl --layout
          +-----------------------------------------------------------------------------------------------+
          |                                              mc0                                              |
          |  csrow0   |  csrow1   |  csrow2   |  csrow3   |  csrow4   |  csrow5   |  csrow6   |  csrow7   |
----------+-----------------------------------------------------------------------------------------------+
channel7: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
channel6: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----------+-----------------------------------------------------------------------------------------------+
channel5: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
channel4: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----------+-----------------------------------------------------------------------------------------------+
channel3: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
channel2: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----------+-------------------------------------------------------------------------------------------------+
channel1: |     0 MB  |     0 MB  |  32767 MB  |  32767 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
channel0: |     0 MB  |     0 MB  |  32767 MB  |  32767 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----------+-------------------------------------------------------------------------------------------------+
[jhoblitt@pillan06 rasdaemon]$ free -g
              total        used        free      shared  buff/cache   available
Mem:            251          93          42           0         115         156
Swap:             0           0           0
[jhoblitt@pillan06 rasdaemon]$ sudo dmidecode --type 4
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.

Handle 0x0029, DMI type 4, 48 bytes
Processor Information
	Socket Designation: CPU
	Type: Central Processor
	Family: Zen
	Manufacturer: Advanced Micro Devices, Inc.
	ID: 10 0F 83 00 FF FB 8B 17
	Signature: Family 23, Model 49, Stepping 0
	Flags:
		FPU (Floating-point unit on-chip)
		VME (Virtual mode extension)
		DE (Debugging extension)
		PSE (Page size extension)
		TSC (Time stamp counter)
		MSR (Model specific registers)
		PAE (Physical address extension)
		MCE (Machine check exception)
		CX8 (CMPXCHG8 instruction supported)
		APIC (On-chip APIC hardware supported)
		SEP (Fast system call)
		MTRR (Memory type range registers)
		PGE (Page global enable)
		MCA (Machine check architecture)
		CMOV (Conditional move instruction supported)
		PAT (Page attribute table)
		PSE-36 (36-bit page size extension)
		CLFSH (CLFLUSH instruction supported)
		MMX (MMX technology supported)
		FXSR (FXSAVE and FXSTOR instructions supported)
		SSE (Streaming SIMD extensions)
		SSE2 (Streaming SIMD extensions 2)
		HTT (Multi-threading)
	Version: AMD EPYC 7502P 32-Core Processor               
	Voltage: 1.1 V
	External Clock: 100 MHz
	Max Speed: 3350 MHz
	Current Speed: 2500 MHz
	Status: Populated, Enabled
	Upgrade: Socket SP3
	L1 Cache Handle: 0x0026
	L2 Cache Handle: 0x0027
	L3 Cache Handle: 0x0028
	Serial Number: Unknown
	Asset Tag: Unknown
	Part Number: Unknown
	Core Count: 32
	Core Enabled: 32
	Thread Count: 64
	Characteristics:
		64-bit capable
		Multi-Core
		Hardware Thread
		Execute Protection
		Enhanced Virtualization
		Power/Performance Control

[jhoblitt@pillan06 rasdaemon]$ sudo dmidecode --type 17
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.

Handle 0x002B, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x002A
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMMA1
	Bank Locator: P0_Node0_Channel0_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x002D, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x002C
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMMA2
	Bank Locator: P0_Node0_Channel0_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948EFE3B4
	Asset Tag: DIMMA2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x0030, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x002F
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMMB1
	Bank Locator: P0_Node0_Channel1_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x0032, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0031
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMMB2
	Bank Locator: P0_Node0_Channel1_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948EFE54C
	Asset Tag: DIMMB2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x0035, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0034
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMMC1
	Bank Locator: P0_Node0_Channel2_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x0037, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0036
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMMC2
	Bank Locator: P0_Node0_Channel2_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948EFE495
	Asset Tag: DIMMC2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x003A, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0039
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMMD1
	Bank Locator: P0_Node0_Channel3_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x003C, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x003B
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMMD2
	Bank Locator: P0_Node0_Channel3_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948EFE716
	Asset Tag: DIMMD2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x003F, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x003E
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMME1
	Bank Locator: P0_Node0_Channel4_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x0041, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0040
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMME2
	Bank Locator: P0_Node0_Channel4_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948EFE698
	Asset Tag: DIMME2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x0044, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0043
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMMF1
	Bank Locator: P0_Node0_Channel5_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x0046, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0045
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMMF2
	Bank Locator: P0_Node0_Channel5_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948EFE3B8
	Asset Tag: DIMMF2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x0049, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x0048
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMMG1
	Bank Locator: P0_Node0_Channel6_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x004B, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x004A
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMMG2
	Bank Locator: P0_Node0_Channel6_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948F02273
	Asset Tag: DIMMG2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x004E, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x004D
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: Unknown
	Set: None
	Locator: DIMMH1
	Bank Locator: P0_Node0_Channel7_Dimm0
	Type: Unknown
	Type Detail: Unknown

Handle 0x0050, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x0023
	Error Information Handle: 0x004F
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMMH2
	Bank Locator: P0_Node0_Channel7_Dimm1
	Type: DDR4
	Type Detail: Synchronous Registered (Buffered)
	Speed: 3200 MT/s
	Manufacturer: Samsung
	Serial Number: T0FN00014948F02271
	Asset Tag: DIMMH2_AssetTag (date:21/49)
	Part Number: M393A4K40EB3-CWE    
	Rank: 2
	Configured Memory Speed: 3200 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: M393A4K40EB3-CWE    
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

DBD::SQLite::db prepare failed

How do i fix this issue?

Package version rasdaemon 0.6.5-1ubuntu1.1

# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1181.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1182.

Clearing errors / excluding old events from reports

Once there is an error event in the database it will apparently be reported every single time. It would be very useful to clear events so that they either are removed from the database or stay there but are excluded from reports like ras-mc-ctl --errors and ras-mc-ctl --summary.

A possible solution would be to include a column, "is_new", in the event database, that can be set to false to indicate that the event in question is old and should not be reported anymore (i.e., the queries used by the reports would add a ... WHERE is_new clause). Another possibility would be to have a global variable, "last_seen_id" which indicates that only events with id > last_seen_id should be reported.

rasdaemon: ras-mc-ctl --error-count random sorts output

Noticed this random sorting behavior of dimm numbers and channel/riser locations for a while but also verified git master version f9cb13b of 2024-02-05 which shows the same behavior. Not sure if its bug or a feature, but this happens on all tested machines (DQ57TM and MacPro 1,1 2,1 3,1) running repository versions 0.68 of debian 12 /ubuntu 23.10 and f9cb13b.

Its independent of the fact labels are used and or registered.

$ sudo ras-mc-ctl --error-count #1
Label   	CE	UE
DIMM2_RA	0	0
DIMM2_RB	0	0
DIMM1_RB	0	0
DIMM1_RA	0	0
DIMM4_RA	0	0
DIMM4_RB	0	0
DIMM3_RB	0	0
DIMM3_RA	0	0
$ sudo ras-mc-ctl --error-count #2
Label   	CE	UE
DIMM1_RB	0	0
DIMM4_RA	0	0
DIMM1_RA	0	0
DIMM4_RB	0	0
DIMM3_RA	0	0
DIMM2_RB	0	0
DIMM3_RB	0	0
DIMM2_RA	0	0
$ sudo ras-mc-ctl --error-count #3
Label   	CE	UE
DIMM3_RA	0	0
DIMM1_RB	0	0
DIMM2_RB	0	0
DIMM4_RA	0	0
DIMM1_RA	0	0
DIMM3_RB	0	0
DIMM4_RB	0	0
DIMM2_RA	0	0
$ sudo ras-mc-ctl --error-count #4
Label   	CE	UE
DIMM2_RA	0	0
DIMM4_RB	0	0
DIMM3_RA	0	0
DIMM1_RB	0	0
DIMM3_RB	0	0
DIMM1_RA	0	0
DIMM2_RB	0	0
DIMM4_RA	0	0

$ sudo ras-mc-ctl --error-count | sort
DIMM1_RA	0	0
DIMM1_RB	0	0
DIMM2_RA	0	0
DIMM2_RB	0	0
DIMM3_RA	0	0
DIMM3_RB	0	0
DIMM4_RA	0	0
DIMM4_RB	0	0
Label   	CE	UE

$ sudo ras-mc-ctl --error-count #1
Label                      	CE	UE
mc#0branch#0channel#1slot#0	0	0
mc#0branch#1channel#0slot#0	0	0
mc#0branch#0channel#0slot#1	0	0
mc#0branch#1channel#1slot#1	0	0
mc#0branch#1channel#1slot#0	0	0
mc#0branch#0channel#0slot#0	0	0
mc#0branch#1channel#0slot#1	0	0
mc#0branch#0channel#1slot#1	0	0

$ sudo ras-mc-ctl --error-count #2
Label                      	CE	UE
mc#0branch#1channel#0slot#1	0	0
mc#0branch#0channel#0slot#0	0	0
mc#0branch#0channel#0slot#1	0	0
mc#0branch#1channel#0slot#0	0	0
mc#0branch#1channel#1slot#1	0	0
mc#0branch#0channel#1slot#0	0	0
mc#0branch#0channel#1slot#1	0	0
mc#0branch#1channel#1slot#0	0	0

$ sudo ras-mc-ctl --error-count #3
Label                      	CE	UE
mc#0branch#0channel#1slot#1	0	0
mc#0branch#1channel#0slot#0	0	0
mc#0branch#0channel#0slot#1	0	0
mc#0branch#1channel#1slot#0	0	0
mc#0branch#1channel#1slot#1	0	0
mc#0branch#0channel#0slot#0	0	0
mc#0branch#1channel#0slot#1	0	0
mc#0branch#0channel#1slot#0	0	0

$ sudo ras-mc-ctl --error-count #4
Label                      	CE	UE
mc#0branch#0channel#0slot#0	0	0
mc#0branch#1channel#0slot#0	0	0
mc#0branch#1channel#1slot#0	0	0
mc#0branch#0channel#1slot#0	0	0
mc#0branch#1channel#0slot#1	0	0
mc#0branch#0channel#0slot#1	0	0
mc#0branch#0channel#1slot#1	0	0
mc#0branch#1channel#1slot#1	0	0

Compared to --guess-labels and --print-labels which use their own unique but fixed pattern.

$ sudo ras-mc-ctl --guess-labels
memory stick 'DIMM 1' is located at 'DIMM Riser A'
memory stick 'DIMM 2' is located at 'DIMM Riser A'
memory stick 'DIMM 1' is located at 'DIMM Riser B'
memory stick 'DIMM 2' is located at 'DIMM Riser B'
memory stick 'DIMM 3' is located at 'DIMM Riser A'
memory stick 'DIMM 4' is located at 'DIMM Riser A'
memory stick 'DIMM 3' is located at 'DIMM Riser B'
memory stick 'DIMM 4' is located at 'DIMM Riser B'

$ sudo ras-mc-ctl --print-labels #edited labels but not registered yet
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 branch 0 channel 0 slot 0       DIMM1_RA             mc#0branch#0channel#0slot#0
mc0 branch 0 channel 0 slot 1       DIMM3_RA             mc#0branch#0channel#0slot#1
mc0 branch 0 channel 1 slot 0       DIMM2_RA             mc#0branch#0channel#1slot#0
mc0 branch 0 channel 1 slot 1       DIMM4_RA             mc#0branch#0channel#1slot#1
mc0 branch 1 channel 0 slot 0       DIMM1_RB             mc#0branch#1channel#0slot#0
mc0 branch 1 channel 0 slot 1       DIMM3_RB             mc#0branch#1channel#0slot#1
mc0 branch 1 channel 1 slot 0       DIMM2_RB             mc#0branch#1channel#1slot#0
mc0 branch 1 channel 1 slot 1       DIMM4_RB             mc#0branch#1channel#1slot#1

$ sudo ras-mc-ctl --register-labels
$ sudo ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 branch 0 channel 0 slot 0       DIMM1_RA             DIMM1_RA            
mc0 branch 0 channel 0 slot 1       DIMM3_RA             DIMM3_RA            
mc0 branch 0 channel 1 slot 0       DIMM2_RA             DIMM2_RA            
mc0 branch 0 channel 1 slot 1       DIMM4_RA             DIMM4_RA            
mc0 branch 1 channel 0 slot 0       DIMM1_RB             DIMM1_RB            
mc0 branch 1 channel 0 slot 1       DIMM3_RB             DIMM3_RB            
mc0 branch 1 channel 1 slot 0       DIMM2_RB             DIMM2_RB            
mc0 branch 1 channel 1 slot 1       DIMM4_RB             DIMM4_RB

AER not reported immediatly

Hello,

On the last versions of linux, I have issues for retrieving AER. (not tested with other errors ?).
I'm using rasdaemon 0.7.0, but I had the issue also with rasdaemon 0.6.8.

After bisecting the kernel and testing, the commit torvalds/linux@42fb0a1 seems to be the breaking change.

As the poll/read function as been fixed to function as designed, now the ring buffer needs to be filled to a certain amount before poll is notified that it has to return. So only a big amount of errors are required before the events are polled.

https://lore.kernel.org/all/[email protected]/T/#md2090ad803d1e4b2fe53bb51c9c78791445ed2ed

We tried to change the buffer_percent and the buffer_size_kb to the smallests values accepted (1% and 1kb, but afaik buffer_percent is not documented for the moment) on the tracefs, but still not be able to retrieve single events.

This behavior has been reproduced in the kernel v5.15.82 and v6.2-rc1.

Reproduction of the issue using aer-inject:
I ran 73 times the ./aer-inject -s xx:xx.x examples/nonfatal. Only when the 73 AER has been sent, the 72 others AER have been read by rasdaemon.
On my case, all of the AER were on the CPU 31.

# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent 
1
# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_size_kb 
1
# cat /sys/kernel/debug/tracing/instances/rasdaemon/buffer_total_size_kb
48
-- injecting 72 entries, still not poll'd by rasdaemon
# cat /sys/kernel/debug/tracing/instances/rasdaemon/per_cpu/cpu31/stats 
entries: 72
overrun: 0
commit overrun: 0
bytes: 4032
oldest event ts: 8828234
now ts: 8832310
dropped events: 0
read events: 0
-- here i sent the 73th AER injection
# cat sys/kernel/debug/tracing/instances/rasdaemon/per_cpu/cpu31/stats 
entries: 1
overrun: 0
commit overrun: 0
bytes: 56
oldest event ts: 8832465
now ts: 8832556
dropped events: 0
read events: 72

Added the strace here: strace.txt

Is there something to do on the rasdaemon side or do we need to report this on the kernel trace mailing list ? If so, how should we proceed ?

Release tarball not created for 0.8.0 because of Action failure (missing libtraceevent in CI)

The release tarball for 0.8.0 didn't get created because the Create Release Action failed.

See https://github.com/mchehab/rasdaemon/actions/runs/4210487183/jobs/7308214621:

checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for libtraceevent... no
configure: error: Package requirements (libtraceevent) were not met:

No package 'libtraceevent' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

(Sorry if you're already aware/working on this, just wanted to file while I remembered.)

ras-mc-ctl --layout makes use of unititialized values and shows 0 MB

occurs in my nix package which uses v0.6.6-21-gb4764d4 (current master HEAD) running on nixos-unstable (plus that PR)

> ras-mc-ctl --layout                                                                                   
Use of uninitialized value $max_pos[3] in modulus (%) at /run/current-system/sw/bin/ras-mc-ctl line 882.
Use of uninitialized value $d in numeric ge (>=) at /run/current-system/sw/bin/ras-mc-ctl line 883.
Use of uninitialized value $d in sprintf at /run/current-system/sw/bin/ras-mc-ctl line 886.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
Use of uninitialized value $pos[3] in join or string at /run/current-system/sw/bin/ras-mc-ctl line 805.
    +-----------------------------------------------------------------------------------------------+
    |                                              mc0                                              |
    |        csrow0         |        csrow1         |        csrow2         |        csrow3         |
    | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  |
----+-----------------------------------------------------------------------------------------------+

0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----+-----------------------------------------------------------------------------------------------+

i'm not familiar with perl, but it doesn't seem impossible the perl interpreter in nixpkgs is set to be a bit more strict/verbose than the one used to develop this script

i also suspect the result of the script is incorrect, as i do in fact have memory in my system (ryzen 7 2700 with two 16gb sticks of unregistered ECC RAM on separate channels (asrock b450m pro4))

i do seem to recall getting some non-zero result from this script on NixOS on this hardware, but can't seem to replicate it, maybe this is due to now running on a newer kernel?
(i think i also set up an attempt at /etc/ras/{mainboard, dimm_labels.d} since then)

RAS errors

I started using rasdaemon a few months ago, and I have been seeing disk errors and a system log entry of rasdaemon: Can't get traces from ras:aer_event at every boot, as reported here: https://unix.stackexchange.com/questions/553527/how-to-diagnose-rasdaemon-disk-errors

$ ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No devlink errors.
Disk errors summary:
	0:0 has 88 errors
	0:2048 has 896 errors
	0:2064 has 14 errors
	0:2080 has 1 errors
	0:2816 has 8 errors
No MCE errors.

Are these errors indications of real issues, is it a mis-configuration of rasdaemon, or is it a bug with rasdaemon or something else?

After a recent system update (including upgrading to kernel 5.5.7) the errors now indicate some crashes at boot. The system is still stable and I would not have known of any issue without the reports in the system log:

ras errors in journalctl
rasdaemon[1337]: ras:mc_event event enabled
rasdaemon[1337]: rasdaemon: ras:mc_event event enabled
rasdaemon[1337]: rasdaemon: ras:aer_event event enabled
rasdaemon[1336]: rasdaemon: ras:mc_event event enabled
rasdaemon[1336]: rasdaemon: Enabled event ras:mc_event
rasdaemon[1336]: ras:mc_event event enabled
rasdaemon[1337]: rasdaemon: mce:mce_record event enabled
rasdaemon[1337]: rasdaemon: Can't write to set_event
rasdaemon[1337]: rasdaemon: devlink:devlink_health_report event enabled
rasdaemon[1337]: rasdaemon: block:block_rq_complete event enabled
rasdaemon[1336]: rasdaemon: ras:aer_event event enabled
rasdaemon[1336]: rasdaemon: Enabled event ras:aer_event
rasdaemon[1336]: Enabled event ras:mc_event
rasdaemon[1337]: ras:aer_event event enabled
rasdaemon[1337]: mce:mce_record event enabled
rasdaemon[1337]: Can't write to set_event
rasdaemon[1336]: ras:aer_event event enabled
rasdaemon[1336]: Enabled event ras:aer_event
rasdaemon[1337]: devlink:devlink_health_report event enabled
rasdaemon[1337]: block:block_rq_complete event enabled
rasdaemon[1336]: mce:mce_record event enabled
rasdaemon[1336]: Enabled event mce:mce_record
rasdaemon[1336]: rasdaemon: mce:mce_record event enabled
rasdaemon[1336]: rasdaemon: Enabled event mce:mce_record
rasdaemon[1336]: rasdaemon: Can't get ras:extlog_mem_event traces. Perhaps this feature is not supported on your system.
rasdaemon[1336]: rasdaemon: Can't get traces from ras:aer_event
systemd[1]: Started RAS daemon to log the RAS events.
rasdaemon[1336]: Can't get traces from ras:aer_event
rasdaemon[1336]: rasdaemon: net:net_dev_xmit_timeout event enabled
rasdaemon[1336]: rasdaemon: Enabled event net:net_dev_xmit_timeout
rasdaemon[1336]: rasdaemon: devlink:devlink_health_report event enabled
rasdaemon[1336]: rasdaemon: Enabled event devlink:devlink_health_report
rasdaemon[1336]: rasdaemon: block:block_rq_complete event enabled
rasdaemon[1336]: net:net_dev_xmit_timeout event enabled
audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rasdaemon comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? rasdaemon[1336]: rasdaemon: Enabled event block:block_rq_complete
rasdaemon[1336]: rasdaemon: Listening to events for cpus 0 to 15
rasdaemon[1336]: Enabled event net:net_dev_xmit_timeout
rasdaemon[1336]: devlink:devlink_health_report event enabled
rasdaemon[1336]: Enabled event devlink:devlink_health_report
rasdaemon[1336]: block:block_rq_complete event enabled
rasdaemon[1336]: Enabled event block:block_rq_complete
rasdaemon[1336]: rasdaemon: Recording mc_event events
rasdaemon[1336]: rasdaemon: Recording aer_event events
rasdaemon[1336]: rasdaemon: Recording extlog_event events
rasdaemon[1336]: rasdaemon: Recording mce_record events
rasdaemon[1336]: rasdaemon: Recording devlink_event events
rasdaemon[1336]: rasdaemon: Recording disk_errors events
rasdaemon[1336]: overriding event (1357) ras:mc_event with new print handler
rasdaemon[1336]: overriding event (1354) ras:aer_event with new print handler
rasdaemon[1336]: overriding event (114) mce:mce_record with new print handler
rasdaemon[1336]: overriding event (1438) net:net_dev_xmit_timeout with new print handler
rasdaemon[1336]: overriding event (1446) devlink:devlink_health_report with new print handler
rasdaemon[1336]: overriding event (1154) block:block_rq_complete with new print handler
rasdaemon[1336]: Calling ras_mc_event_opendb()
rasdaemon[1336]: cpu 08:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 09:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 10:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [001]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
abrtd[1453]: Too many clients, refusing connections to '/var/run/abrt/abrt.socket'
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [001]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
abrt-server[1472]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1473]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1470]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1474]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1477]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1475]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1479]: Lock file '.lock' is locked by process 1478
rasdaemon[1336]: devlink:devlink_health_report event enabled
rasdaemon[1336]: Enabled event devlink:devlink_health_report
rasdaemon[1336]: block:block_rq_complete event enabled
rasdaemon[1336]: Enabled event block:block_rq_complete
rasdaemon[1336]: rasdaemon: Recording mc_event events
rasdaemon[1336]: rasdaemon: Recording aer_event events
rasdaemon[1336]: rasdaemon: Recording extlog_event events
rasdaemon[1336]: rasdaemon: Recording mce_record events
rasdaemon[1336]: rasdaemon: Recording devlink_event events
rasdaemon[1336]: rasdaemon: Recording disk_errors events
rasdaemon[1336]: overriding event (1357) ras:mc_event with new print handler
rasdaemon[1336]: overriding event (1354) ras:aer_event with new print handler
rasdaemon[1336]: overriding event (114) mce:mce_record with new print handler
rasdaemon[1336]: overriding event (1438) net:net_dev_xmit_timeout with new print handler
rasdaemon[1336]: overriding event (1446) devlink:devlink_health_report with new print handler
rasdaemon[1336]: overriding event (1154) block:block_rq_complete with new print handler
rasdaemon[1336]: Calling ras_mc_event_opendb()
rasdaemon[1336]: cpu 08:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 09:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 10:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [001]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
abrtd[1453]: Too many clients, refusing connections to '/var/run/abrt/abrt.socket'
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:            <...>-104   [000]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
rasdaemon[1336]:           <idle>-0     [001]     0.000004: block_rq_complete:    2020-03-05 23:21:37 -0500
rasdaemon[1336]: cpu 14:rasdaemon: diskerror_eventstore: 0x561bcc45a418
rasdaemon[1336]: rasdaemon: register inserted at db
abrt-server[1472]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1473]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1470]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1474]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1477]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1475]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1479]: Lock file '.lock' is locked by process 1478
abrt-server[1471]: Lock file '.lock' is locked by process 1478
abrt-server[1476]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrtd[1453]: Too many clients, refusing connections to '/var/run/abrt/abrt.socket'
abrt-server[1471]: Lock file '.lock' is locked by process 1478
abrt-server[1479]: Lock file '.lock' is locked by process 1478
abrt-server[1471]: Lock file '.lock' is locked by process 1478
abrt-server[1479]: Lock file '.lock' is locked by process 1478
abrt-server[1480]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1481]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrtd[1453]: Too many clients, refusing connections to '/var/run/abrt/abrt.socket'
abrt-server[1471]: Lock file '.lock' is locked by process 1478
abrt-server[1479]: Lock file '.lock' is locked by process 1478
abrtd[1453]: Too many clients, refusing connections to '/var/run/abrt/abrt.socket'
abrt-server[1482]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1483]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1479]: Lock file '.lock' is locked by process 1471
abrt-server[1471]: Can't create directory '.libreport': File exists
abrt-server[1471]: Can't create meta-data directory
abrt-server[1471]: Error creating problem directory '/var/spool/abrt/ras-2020-03-05-23:21:38-1336.new'
abrt-server[1484]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1486]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1485]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1487]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'
abrt-server[1479]: Can't create directory '.libreport': File exists
abrt-server[1479]: Can't create meta-data directory
abrt-server[1479]: Error creating problem directory '/var/spool/abrt/ras-2020-03-05-23:21:38-1336.new'
abrt-server[1488]: Not saving repeating crash in '/boot/vmlinuz-5.5.7-200.fc31.x86_64'

ug! no event found for type 1058

When building rasdaemon with --disable-diskerror, logs are flooded with ug! no event found for type 1058. I tried to disable diskerror because otherwise my logs are flooded with <...>-218 [041] 0.000111: block_rq_complete: 2020-05-18 23:59:59 +0300. Unfortunately it seems this just seems to change the line that is logged.

Running rasdaemon 0.6.5 on Gentoo, kernel 5.6.13.

rasdaemon.log

rasdaemon leaking memory on nodes with high volumes of errors

Over the space of several weeks, we observe the memory usage of rasdaemon growing almost linearly, seemingly infinitely.
Initially we thought this was related to this fd leak (4bf0b71), which we were also seeing. We backported 0.6.6 from Debian Bullseye but are still seeing issues.

jgroocock@47m54:~$ sudo rasdaemon -V
rasdaemon 0.6.6
jgroocock@47m54:~$ sudo ras-mc-ctl --summary
Memory controller events summary:
	Corrected on DIMM Label(s): 'CPU1_E0' location: 3:1:0:-1 errors: 1987314

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.

There doesn't appear to be anything in the way of errors in the logs. Just event logs and nothing else

Jan 07 17:12:50 47m54 rasdaemon[3546146]: rasdaemon: mc_event store: 0x559881b894f8
Jan 07 17:12:50 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:50 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mc_event:             2021-01-07 17:12:47 +0000 6 Corrected errors: memory read error on CPU1_E0 (mc: 3 location: 1:0 address: 0x20fa015f00 grain: 5  OVERFLOW err_code:0x0101:0x0091 socket:1 imc:1 rank:0 bg:0 ba:0 row:0xf40 col:0x370)
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: mce_record store: 0x559881ba6968
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:51 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mce_record:           2021-01-07 17:12:48 +0000 bank=8, status= dc00008001010091, MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error MscodDataRdErr, mci=Error_overflow Corrected_error Error_enabled, mca=M2M: , n_errors=2, cpu_type= Skylake server, cpu= 24, socketid= 1, misc= 200400c028201086, addr= 21730d5f00, mcgstatus=0, mcgcap= f000c14, apicid= 40
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: mc_event store: 0x559881b894f8
Jan 07 17:12:51 47m54 rasdaemon[3546146]: rasdaemon: register inserted at db
Jan 07 17:12:51 47m54 rasdaemon[3546146]:            <...>-2159524 [024]     0.562862: mc_event:             2021-01-07 17:12:48 +0000 2 Corrected errors: memory read error on CPU1_E0 (mc: 3 location: 1:0 address: 0x21730d5f00 grain: 5  OVERFLOW err_code:0x0101:0x0091 socket:1 imc:1 rank:0 bg:0 ba:0 row:0xe76 col:0x370)

Over thousands of nodes, we see ~65 at present using >1GiB of memory after 3 weeks. Some after this time are using over 16GiB and climbing. Here is a graph over the last 24 days
image
Generally though, rasdaemon doesn't seem to use much memory at all. The 50th percentile is a mere ~1MiB, 80th being ~1.4MiB and even the 99th is only a few hundred MiB.

One thing that seems apparent is the common factor amongst the nodes with high memory usage have a seriously high volume of errors, such as the output above suggests. We suspect this is due to faulty hardware.

Is there any information I can collect/provide that would help to narrow down what is holding onto the memory?
Thanks

rasdaemon: Can't open trace_clock

What's happening

sometimes, systemd failed to bootstrap rasdasmon, the log looks like this:

rasdaemon: Can't open trace_clock
rasdaemon: Can't select a timestamp for tracing

Root Cause

systemd will start two rasdaemon process with different arguments, both of them share a common code path

  1. firstly, create /sys/kernel/debug/tracing/instances/rasdaemon
    rc = mkdir(ras->tracing, S_IRWXU);
  2. then, open /sys/kernel/debug/tracing/instances/rasdaemon/trace_clock
    fd = open_trace(ras, "trace_clock", O_RDONLY);

If the first process is creating the /sys/kernel/debug/tracing/instances/rasdaemon, then the second process's mkdir will fail at the kernel code path vfs_mkdir, then the second one try to open trace_clock, but the first one doesn't create it yet.

mkdir and populating the contents in the directory is the not atomic.

Possible solution

  1. use file lock to guard the mkdir operation
  2. change kernel code

negative record size -8

occasionally rasdaemon will start to emit the following: ug! negative record size -8
it does so continuously until restarted

the journalctl -fu rasdaemon.service output will be quiet for 30 seconds, then emit something like this:

Nov 02 19:38:43 valix systemd-journald[686]: [๐Ÿก•] Suppressed 5590251 messages from rasdaemon.service
Nov 02 19:38:43 valix rasdaemon[1009]:   ug! negative record size -8
Nov 02 19:38:43 valix rasdaemon[1009]:   ug! negative record size -8
Nov 02 19:38:43 valix rasdaemon[1009]:   ug! negative record size -8

strace seems to indicate it is continuously writing that message

20:08:34.341449 write(2, "ug! negative record size -8", 27) = 27 <0.000008>
20:08:34.341476 write(2, "\n", 1)       = 1 <0.000008>
20:08:34.341505 write(1, "\n", 1)       = 1 <0.000009>
20:08:34.341534 write(2, "  ", 2)       = 2 <0.000009>
20:08:34.341563 write(2, "ug! negative record size -8", 27) = 27 <0.000009>
20:08:34.341593 write(2, "\n", 1)       = 1 <0.000008>
20:08:34.341625 write(1, "\n", 1)       = 1 <0.000023>
20:08:34.341674 write(2, "  ", 2)       = 2 <0.000009>
20:08:34.341704 write(2, "ug! negative record size -8", 27) = 27 <0.000009>
20:08:34.341733 write(2, "\n", 1)       = 1 <0.000008>
20:08:34.341763 write(1, "\n", 1)       = 1 <0.000008>
20:08:34.341792 write(2, "  ", 2)       = 2 <0.000009>

this is on rasdaemon 0.6.8, default configuration besides setting mainboard and DIMM labels, on nixos-unstable

i still get the seemingly spurious errors reported in #10 but the reported number did not increase in a 1 minute interval

ordering systemd to restart rasdaemon results in the following after about 1:30 minutes

Nov 02 20:11:45 valix systemd[1]: rasdaemon.service: State 'stop-sigterm' timed out. Killing.
Nov 02 20:11:45 valix systemd[1]: rasdaemon.service: Killing process 1009 (rasdaemon) with signal SIGKILL.
Nov 02 20:11:45 valix systemd[1]: rasdaemon.service: Main process exited, code=killed, status=9/KILL
Nov 02 20:11:45 valix systemd[1]: rasdaemon.service: Failed with result 'timeout'.
Nov 02 20:11:45 valix systemd[1]: Stopped the RAS logging daemon.
Nov 02 20:11:45 valix systemd[1]: rasdaemon.service: Consumed 23h 53min 47.734s CPU time, read 2.0M from disk, written 27.6G to disk, no IP traffic.
Nov 02 20:11:45 valix systemd[1]: Started the RAS logging daemon.
Nov 02 20:11:45 valix rasdaemon[3357308]: Can't get traces from ras:memory_failure_event

my workstation's power usage rose about 30W when rasdaemon gots into this faulty mode
image
though dropped only ~15W after restarting rasdaemon

the horizontal lines in this graph are sections where the computer is in sleep mode (no data), this shows the issue did not start at such a boundary, and persists across them

Rasdaemon wrong mapping label

Hi all,

I have an issue with the label mapping of dimm:

First here my dimm without label:

(rubis)-[root@rubis247 ~] $ ras-mc-ctl --error-count
Label                         	CE	UE
CPU_SrcID#0_Ha#0_Chan#0_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#3_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#3_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#1_DIMM#0	0	0
CPU_SrcID#0_Ha#0_Chan#2_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#0_DIMM#0	5539	0
CPU_SrcID#1_Ha#0_Chan#1_DIMM#0	0	0
CPU_SrcID#1_Ha#0_Chan#2_DIMM#0	0	0

According to the report without label, I saw the cpu1 channel 0 slot 0 has 5539 Correctable error.

Then I label my dim according to the Intel documentation for the mainboard S2600KPR:

https://www.intel.com/content/dam/support/us/en/documents/server-products/server-boards/S2600KP_HNS2600KP.pdf
Page 54

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: Intel Corporation model S2600KPR
(rubis)-[root@rubis247 ~]$ cat /etc/ras/dimm_labels.d/intel
vendor: Intel Corporation
  model: S2600KPR
#  <label>: <mc>.channel>.<slot>
    #CPU1
    DIMM_A1: 0.0.0
    DIMM_B1: 0.1.0
    DIMM_C1: 0.2.0
    DIMM_D1: 0.3.0

    #CPU2
    DIMM_E1: 1.0.0
    DIMM_F1: 1.1.0
    DIMM_G1: 1.2.0
    DIMM_H1: 1.3.0

Then I register my label and I print them:

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
mc0 channel 0 slot 0                DIMM_A1              DIMM_A1             
mc0 channel 1 slot 0                DIMM_B1              DIMM_B1             
mc0 channel 2 slot 0                DIMM_C1              DIMM_C1             
mc0 channel 3 slot 0                DIMM_D1              DIMM_D1             
mc1 channel 0 slot 0                DIMM_E1              DIMM_E1             
mc1 channel 1 slot 0                DIMM_F1              DIMM_F1             
mc1 channel 2 slot 0                DIMM_G1              DIMM_G1             
mc1 channel 3 slot 0                DIMM_H1              DIMM_H1

The mc1 channel 0 slot 0 correpond to the dimm E1, which seems to be the good mapping according to the documentation. So I should have the 5539 error tagged on the dimm_E1 but i Have:

(rubis)-[root@rubis247 ~]$ ras-mc-ctl --print-label
Label  	CE	UE
DIMM_E1	0	0
DIMM_D1	0	0
DIMM_H1	0	0
DIMM_F1	0	0
DIMM_G1	0	0
DIMM_A1	5539	0
DIMM_B1	0	0
DIMM_C1	0	0

I also check the ipmi sel and it's confirming the correctable errors are on DIMM_E1 and not DIMM_A1

Maybe am I doing something wrong (or maybe a bug), someone can confirm my mind ? :)

rasdaemon does not log MCE

Hi,

I'm using rasdaemon v0.6.8 (From Debian, https://packages.debian.org/de/bookworm/rasdaemon) on Kernel 5.15 (Proxmox 7.4, 5.15.102-1-pve) and ASRock X570D4U-2L2T + AMD Ryzen 5950X.

I do get some MCEs in the kernel log:

root@pve:~# dmesg | grep -i mce
[    0.644337] mce: [Hardware Error]: Machine check events logged
[    0.644338] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 17: dc2040000000011b
[    0.644342] mce: [Hardware Error]: TSC 0 ADDR a8eb3fc80 MISC d01202dd01000000 SYND 88e00040a800200 IPID 9600050f00 
[    0.644345] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1682293811 SOCKET 0 APIC 0 microcode a201009
[    4.768515] MCE: In-kernel MCE decoding enabled.
[  310.396113] mce: [Hardware Error]: Machine check events logged
[  316.656894] mce: [Hardware Error]: Machine check events logged
[  627.947258] mce: [Hardware Error]: Machine check events logged
[  939.240972] mce: [Hardware Error]: Machine check events logged
[ 1250.534814] mce: [Hardware Error]: Machine check events logged
[ 1561.828702] mce: [Hardware Error]: Machine check events logged
[ 1873.122720] mce: [Hardware Error]: Machine check events logged

but ras-mc-ctl doesn't report anything:

root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

Everything seems to be running fine:

root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service 
โ— rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-24 01:50:17 CEST; 32min ago
    Process: 1013 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
   Main PID: 1012 (rasdaemon)
      Tasks: 1 (limit: 154399)
     Memory: 15.3M
        CPU: 24ms
     CGroup: /system.slice/rasdaemon.service
             โ””โ”€1012 /usr/sbin/rasdaemon -f -r

Apr 24 01:50:17 pve rasdaemon[1012]: Enabled event mce:mce_record
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: ras:extlog_mem_event event enabled
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Enabled event ras:extlog_mem_event
Apr 24 01:50:17 pve rasdaemon[1012]: ras:extlog_mem_event event enabled
Apr 24 01:50:17 pve rasdaemon[1012]: Enabled event ras:extlog_mem_event
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Listening to events for cpus 0 to 31
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording mc_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording aer_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording extlog_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording mce_record events

root@pve:~# systemctl status ras
rasdaemon.service   ras-mc-ctl.service  
root@pve:~# systemctl status ras-mc-ctl.service 
โ— ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
     Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2023-04-24 01:50:17 CEST; 33min ago
    Process: 1011 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
   Main PID: 1011 (code=exited, status=0/SUCCESS)
        CPU: 21ms

Apr 24 01:50:17 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Apr 24 01:50:17 pve ras-mc-ctl[1011]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
Apr 24 01:50:17 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware

Any ideas on how this could be debugged?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.