When reshaping or recovering disks, the mdstat plugin returns OK for both states <

More info: Version is 3.0.3 <div class="snippet-clipboard-content notranslate pos

current code does: <div class="highlight highlight-source-perl notranslate positio

mdstat plugin returns incorrect state for recovery/reshape about nagios-plugin-check_raid HOT 10 CLOSED

glensc commented on August 25, 2024

mdstat plugin returns incorrect state for recovery/reshape

from nagios-plugin-check_raid.

Comments (10)

toddles commented on August 25, 2024

More info: Version is 3.0.3

# ./check_raid.pl -p mdstat -V
check_raid Version 3.0.3

Running Debian v7.2

# cat /etc/debian_version
7.2

from nagios-plugin-check_raid.

glensc commented on August 25, 2024

you probably want --resync=WARNING option

from nagios-plugin-check_raid.

toddles commented on August 25, 2024

I'll do that, but I'm really in a worse state than that :). I'm actually in a degraded mode, and if there is some way to reflect that status, that would be great. If I perform a mdadm -detail, I get a more accurate state:

# mdadm --detail /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Mon Sep 28 23:50:50 2009
Raid Level : raid6
Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
Raid Devices : 6
Total Devices : 6
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Tue Nov 19 10:44:38 2013
      State : clean, degraded, recovering
Active Devices : 4
Working Devices : 6
Failed Devices : 0
Spare Devices : 2

     Layout : left-symmetric
 Chunk Size : 4K

 Rebuild Status : 23% complete

       UUID : c0a45de6:f2b14d16:46611cb4:29fb6250 (local to host <hostname>)
     Events : 0.457232

Number   Major   Minor   RaidDevice State
   0       8       17        0      active sync   /dev/sdb1
   1       8       32        1      active sync   /dev/sdc
   2       8       49        2      active sync   /dev/sdd1
   3       8       97        3      active sync   /dev/sdg1
   6       8       65        4      spare rebuilding   /dev/sde1
   7       8       81        5      spare rebuilding   /dev/sdf1

I'm running RAID 6 so I can lose two drives. I'm not sure what the status of the array will be if I lose 1 drive, and therefore how the plugin will behave. I'll do some more tests this week after the recovery is complete, as I need to take one of my drives and partition it appropriately.

BTW, thanks for providing this plugin, it seems VERY comprehensive.

from nagios-plugin-check_raid.

glensc commented on August 25, 2024

does mdadm use some other data than /proc/mdstat?

theoretically it would be possible to know if it's degraded, i.e you know it's raid6 and two disks out (UUUU__), then it's degraded. but i don't want to take responsibility to calculate that wrong :(

currently there's no mdadm driver, perhaps i should create it

ps: thanks, i recently updated mdstat plugin to parse /proc/mdstat exactly as kernel writes it. well, not deep to each raid variation (each raidX can write own output), but at least until the external device part.

from nagios-plugin-check_raid.

glensc commented on August 25, 2024

current code does:

if (my($action, $perc, $eta, $speed) = 
  m{(resync|recovery|check|reshape)\s+=\s+([\d.]+%) \(\d+/\d+\) finish=([\d.]+min) speed=(\d+K/sec)}) {

can i safely say that resync and check set state with --resync= option, but recovery and reshape always trigger critical state?

from nagios-plugin-check_raid.

toddles commented on August 25, 2024

That's a good question, as this is the first major re-work I have done on this array in about 2 years. I want to see how the array behaves under certain conditions before I'd make a recommendation. Here's my story:

I had 5x1.5 TB RAID 5 array. A little over 1 week ago, I lost a drive. The Nagios plugin I was using didn't report the failure, I was lucky to discover the state. I ordered two new drives. I replaced one and started the rebuild, which went successfully. Once the rebuild was complete, I decided to convert (reshape) to a RAID 6, using the other drive as the extra parity drive. I also replaced the Nagios plugin I had with yours. During the reshape, about 24 hours apart, I lost two drives (not sure what happened, power issue or controller issue), and that caused the array to fail. I spent a few hours getting the array back up, and the reshape continued, but Nagios never reported anything other than OK when the array was up.

Suffice it to say, this weekend was a crash course in Linux Software RAID management. mdadm is the management utility for Linux Software RAID, so I'm not sure how it gathers its data, but it's entirely possible for the RAID information on each of the drives to be out of sync in a failure state. /proc/mdstat does seem to be an accurate summary, however.

When I first started the reshape, I was redundant, as I had 4+1 and another parity drive being added. I wouldn't consider the reshape to be critical in this state. Maybe warning, as the array is being modified administratively. When I lost my first drive and was in a degraded/non-redundant state, I'd consider that to be critical, regardless of the activity.

The other reason why I'm hesitant to state that recovery would always be critical is because I am running RAID 6 (4+2). If I lose a drive, I'll be 4+1. I'll still have parity/redundancy, so it wouldn't be critical (to me at least). I'd consider the loss of 1 drive in RAID 6 to be a warning condition.

I'm wondering what state the array will report in that slightly degraded condition. I can send you output of both /proc/mdstat and mdadm --detail when I perform that test later this week.

Sorry for the rambling comment, thanks for your attention and support!

from nagios-plugin-check_raid.

toddles commented on August 25, 2024

The rebuild completed successfully, and I am doubly redundant. I removed one of the disks to see how the RAID software would respond:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdb1[0] sdf1[5] sde1[4] sdg1[3] sdd1[2]
      5860543744 blocks level 6, 4k chunk, algorithm 2 [6/5] [U_UUUU]

/proc/mdstat didn't seem to show the degraded state. Then I checked with mdadm:

# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Mon Sep 28 23:50:50 2009
     Raid Level : raid6
     Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
   Raid Devices : 6
  Total Devices : 5
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Nov 20 20:52:56 2013
          State : clean, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 4K

           UUID : c0a45de6:f2b14d16:46611cb4:29fb6250 (local to host <hostname>)
         Events : 0.461176

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       0        0        1      removed
       2       8       49        2      active sync   /dev/sdd1
       3       8       97        3      active sync   /dev/sdg1
       4       8       65        4      active sync   /dev/sde1
       5       8       81        5      active sync   /dev/sdf1

This seemed to be more accurate. I then partitioned the drive I just removed, added it back to the array, and checked the RAID status:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdc1[6] sdb1[0] sdf1[5] sde1[4] sdg1[3] sdd1[2]
      5860543744 blocks level 6, 4k chunk, algorithm 2 [6/5] [U_UUUU]
      [>....................]  recovery =  0.2% (3563496/1465135936) finish=1638.2min speed=14868K/sec

unused devices: <none>

I then checked with mdadm, which again seemed to return accurate results:

# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Mon Sep 28 23:50:50 2009
     Raid Level : raid6
     Array Size : 5860543744 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Nov 20 20:57:03 2013
          State : clean, degraded, recovering
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 4K

 Rebuild Status : 0% complete

           UUID : c0a45de6:f2b14d16:46611cb4:29fb6250 (local to host <hostname>)
         Events : 0.461194

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       6       8       33        1      spare rebuilding   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       97        3      active sync   /dev/sdg1
       4       8       65        4      active sync   /dev/sde1
       5       8       81        5      active sync   /dev/sdf1

From what I can tell, with a RAID 6 configuration, /proc/mdstat won't show a degraded state until a spare is rebuilding, but mdadm --detail will provide an accurate status of the array.

Thanks again for looking into this!

from nagios-plugin-check_raid.

glensc commented on August 25, 2024

i'm not sure what to do here, so i just added test data for mdstat from last posts

from nagios-plugin-check_raid.

glensc commented on August 25, 2024

should add here probably new "driver", processing with mdadm binary

from nagios-plugin-check_raid.

glensc commented on August 25, 2024

closing for now

from nagios-plugin-check_raid.

mdstat plugin returns incorrect state for recovery/reshape about nagios-plugin-check_raid HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent