spearfoot / disk-burnin-and-testing Goto Github PK

View Code? Open in Web Editor NEW

789.0 22.0 100.0 65 KB

Shell script for burn-in and testing of new or re-purposed drives

License: Other

Shell 100.00%

disk posix-compliant freenas freenas-forum smartmontools freebsd-scripts linux-scripts

disk-burnin-and-testing's Introduction

Shell script for burn-in and testing of drives

Purpose

disk-burnin.sh is a POSIX-compliant shell script I wrote to simplify the process of burning-in disks. It is intended for use only on disks which do not contain data, such as new disks or disks which are being tested or re-purposed. I was inspired by the "How To: Hard Drive Burn-In Testing" thread on the FreeNAS forum and I want to give full props to the good folks who contributed to that thread.

Warnings

Be warned that:

This script runs badblocks in destructive mode, which erases any data on the disk. Therefore, please be careful! Do not run this script on disks containing valuable data!
Run times for large disks can be several days. Use tmux or screen to test multiple disks in parallel.
Must be run as 'root'.

Tests

Performs these steps:

Run SMART short test
Run badblocks
Run SMART extended test

The script calls sleep after starting each SMART test, using a duration based on the polling interval reported by the disk, after which it polls for test completion.

Full SMART information is pulled after each SMART test. All output except for the sleep command is echoed to both the screen and log file.

You should periodically monitor the burn-in progress and check for errors, particularly any errors reported by badblocks, or these SMART errors:

ID	Attribute Name
5	Reallocated_Sector_Ct
196	Reallocated_Event_Count
197	Current_Pending_Sector
198	Offline_Uncorrectable

These indicate possible problems with the drive. You therefore may wish to abort the remaining tests and proceed with an RMA exchange for new drives or discard old ones. Also please note that this list is not exhaustive.

The script extracts the drive model and serial number and creates a log filename of the form burnin-[model]_[serial number].log.

`badblocks` Options

badblocks is invoked with the following options:

-b 8192 : Use a block size of 8192 (override this setting with the -b option below)
-e 1 : Abort the badblocks test immediately if an error is found (override this setting with the -x option below)
-c 64 : Number of concurrent blocks to check. (override this setting with the -c option below, but beware of memory use with high values)
-v : Verbose mode
-o : Write list of bad blocks found (if any) to a file named burnin-[model]_[serial number].bb
-s : Show progress
-w : Write-mode test, writes four patterns (0xaa, 0x55, 0xff, 0x00) on every disk block

Usage

./disk-burnin.sh [-h] [-e] [-b <block_size>] [-c <num_blocks>] [-f] [-o <directory>] [-x] <disk>

Options

-h: show help text
-e: show extended help text
-b: block size (default: 8192)
-c: number of concurrent blocks to check (default: 64). Higher values will use more memory.
-f: run a full, destructive test. Disables the default 'dry-run mode'. ALL DATA ON THE DISK WILL BE LOST!
-o <directory>: write log files to <directory> (default: working directory $(pwd))
-x: perform a full pass of badblocks, using the -e 0 option.
<disk>: disk to burn-in (/dev/ may be omitted)

Examples

./disk-burnin.sh sda: run in dry-run mode on disk /dev/sda
./disk-burnin.sh -f /dev/sdb: run full, destructive test on disk /dev/sdb
./disk-burnin.sh -f -o ~/burn-in-logs sdc: run full, destructive test on disk /dev/sdc and write the log files to ~/burn-in-logs directory

Dry-Run Mode

The script runs in dry-run mode by default, so you can check the sleep durations and insure that the sequence of commands suits your needs. In dry-run mode the script does not actually perform any SMART tests or invoke the sleep or badblocks programs.

In order to perform tests on drives, you will need to provide the -f option.

`smartctl` Device Type

Some users with atypical hardware environments may need to modify the script and specify the smartctl command device type explictly with the -d option. User bcmryan reports success using -d sat with a Western Digital MyBook 8TB external drive enclosure.

FreeBSD / FreeNAS Notes

Before using the script on FreeBSD systems (including FreeNAS) you must first execute this sysctl command to alter the kernel's geometry debug flags. This allows badblocks to write to the entire disk:

sysctl kern.geom.debugflags=0x10

Also note that badblocks may issue the following warning under FreeBSD / FreeNAS, which can safely be ignored as it has no effect on testing:

set_o_direct: Inappropiate ioctl for device

Operating System Compatibility

Tested under:

FreeNAS 9.10.2-U1 (FreeBSD 10.3-STABLE)
FreeNAS 11.1-U7 (FreeBSD 11.1-STABLE)
FreeNAS 11.2-U8 (FreeBSD 11.2-STABLE)
Ubuntu Server 16.04.2 LTS
CentOS 7.0
Tiny Core Linux 11.1
Fedora 33 Workstation

Drive Models Tested

The script should run successfully on any SAS or SATA disk with SMART capabilities, which includes just about all modern drives. It has been tested on these particular devices:

Intel
- DC S3700 SSD
- Model 320 Series SSD
HGST
- Deskstar NAS (HDN724040ALE640)
- Ultrastar 7K4000 (HUS724020ALE640)
- Ultrastar He10
- Ultrastar He12
Western Digital
- Black (WD6001FZWX)
- Gold
- Re (WD4000FYYZ)
- Green
- Red
- WD140EDFZ
Seagate
- IronWolf NAS HDD 12TB (ST12000VN0008)
- IronWolf NAS HDD 8TB (ST8000NE001-2M7101)

Prerequisites

smartmontools, available at www.smartmontools.org

Uses: grep, awk, sed, sleep, badblocks, smartctl

Tested with the static analysis tool at www.shellcheck.net to insure that the code is POSIX-compliant and free of issues.

Author

Original author: Keith Nash, March 2017. Modified on 19 February 2021.

disk-burnin-and-testing's People

Contributors

Stargazers

Watchers

Forkers

xbliss joegnis dhilip89 levifig dak180 rjt rajatnair meshops vbsinterestingstuff dacbarbos flatlinebb borg1622 downloadrammore jtessier72 rehanone tuksik zenon1823 bsodmike deploynull saskifx lucas-gautier emilianbold bee27 tommyku yut148 nwillems han-yoon rocco83 schnerring pedronavf dilepa noslin005 gamanakis dave-burke tankmek labdiynez codingspiderfox rca jelmerkk digitalknk a1cy0n opencareerpodcast kyounger camara-tech xonstone jimmygle garettmd xorilog edwinclement08-forks jollerprutt drdougphd keitalbame davidalger p3lim markismus gitgb spidersavitch89 hogenf gilibenzio ciscam thrat birt frozenmosaic megamuteki nonviotale cliffbo markthomas93 nova-ace adampryke zhaojie1130 strickdj joelishness sonicpet07 evie404 taiheng mbaezner backups-archives networkshokunin s-wachspress mcclown acloserview joshfng labs-labs-labs geotsot hennk jfklingler micxer tyler351 suppaduppax pleasestopasking schlep skowalczyk kjayga erooke ninpucho gofullthrottle wullsnpaxbwzgydyyhwtkkspeqoayxxyhoisqhf

disk-burnin-and-testing's Issues

Need for testning disk performance?

I was reading https://www.reddit.com/r/freenas/comments/adgef1/slow_sequential_write_speed_new_8_disk_raidz2/ where they did not get the performance excepted due to one drive. This was not indicated in the SMART-test. Is there a need to do a performance test as part of your burnin and testing of new drives?

If there is a need I think such a test is within the scope of this script to early identify fault or degraded drives.
Which tool that could be used I do not know.

Running the script returns "Please specify device type with the -d option." for 8 TB WD WD80EZAZ-11TDBA0 (in WD My Book)

Hi, thank you so much for writing this little script! There were a few issues I had with running it on a new 8 TB WD WD80EZAZ-11TDBA0 in a My Book external hard drive enclosure.

First and foremost, running the script without any modification returned "Please specify device type with the -d option." After a bit of Googling, I found a post from 2014 on https://bugs.freedesktop.org/show_bug.cgi?id=79379 that led to me the solution: adding -d sat after every instance of smartctl in the code. It was quick and dirty, but it worked. I don't think that this can be directly implemented into the script because it may cause breakage for others, but I did want to post it somewhere where others can find it if they run into the same issue as I did. This looks to be a problem with smartctl not automatically recognizing the connector.

I also have some other suggestions for the readme file. Since root privileges were required when I ran the script, it might be useful to let people know that they can run the script in a single line on the terminal as sudo bash ./disk-burnin.sh sdX. Secondly, since the user does need to set the Dry_Run variable to 0, it might also be helpful to bold the line "The script is distributed with 'dry runs' enabled, so you will need to edit the Dry_Run variable, setting it to 0, in order to actually perform tests on drives." or potentially even have that echoed whenever a user tries to run the script. (I'm not an IT guy or programmer by trade, so I know that modifying scripts is something that might trip newcomers up.)

Thanks for your help with writing this script and making it available to others!

Use of badblocks "-c" flag

I saw the note about the long testing times, and looked up expected times for badblocks on the disks I'm using (4TB). I found this useful answer on superuser, which mentioned that adjusting the value used for the "-c" flag made a big difference to the speed:

badblocks -svn /dev/sdb
To get to 1%: 1 Hour
To get to 10%: 8 hours 40 minutes

badblocks -svn -b 512 -c 32768 /dev/sda
To get to 1%: 35 Minutes
To get to 10%: 4 hours 10 minutes

badblocks -svn -b 512 -c 65536 /dev/sda
To get to 1%: 16 Minutes
To get to 10%: 2 hours 35 minutes

I naturally wondered if there's a downside to setting a higher "-c" value. Another helpful answer mentioned this:

The -c option corresponds to how many blocks should be checked at once. Batch reading/writing, basically. This option does not affect the integrity of your results, but it does affect the speed at which badblocks runs. badblocks will (optionally) write, then read, buffer, check, repeat for every N blocks as specified by -c. If -c is set too low, this will make your badblocks runs take much longer than ordinary, as queueing and processing a separate IO request incurs overhead, and the disk might also impose additional overhead per-request. If -c is set too high, badblocks might run out of memory. If this happens, badblocks will fail fairly quickly after it starts. Additional considerations here include parallel badblocks runs: if you're running badblocks against multiple partitions on the same disk (bad idea), or against multiple disks over the same IO channel, you'll probably want to tune -c to something sensibly high given the memory available to badblocks so that the parallel runs don't fight for IO bandwidth and can parallelize in a sane way.

I'm currently testing 6x 4TB disks and my memory use is under 300M, so that doesn't seem to be much of an issue. Is there another reason this option isn't used by the script?

Script has weak SAS parsing logic [HELP WANTED]

It appears the issue is relating to weak logic surrounding SAS models. However, more test cases should be provided to confirm if this is indeed a protocol difference, or a difference in how manufacturers report SMART data.

It appears the script is incorrectly parsing smartctl results, as the script reports the following:

but sudo smartctl --all /dev/sda clearly shows the expected data

Expected behavior:
Correctly parse the results of smartctl so the script can function accordingly.

Why run badblocks 4 times with 4 different patterns?

I'm curious: why the badblocks test is performed 4 times with 4 different patterns? Is there an option of running a single pattern and if so, what would be the best pattern to use even it's not as rigorous?

confirmed working environment

Thanks for the tool! If you cared to include in the readme:

fedora 33 workstation
WDC_WD140EDFZ-11A0VA0 (RED?)

burnin-WDC_WD140EDFZ-11A0VA0_Y5HVU25C.log

Incomplete dependencies availability check

No check for availability of smartmontools. When smartmontools aren't installed running script gives:

scripts/disk-burnin-and-testing-master/disk-burnin.sh: 263: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 264: smartctl: not found
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
[2020-10-05 09:49:59 UTC] + Started burn-in
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
[2020-10-05 09:49:59 UTC] Host:ubuntu-server
[2020-10-05 09:49:59 UTC] OS Flavor: Linux
[2020-10-05 09:49:59 UTC] Drive: /dev/sdc
[2020-10-05 09:49:59 UTC] Disk Type: non-mechanical
[2020-10-05 09:49:59 UTC] Drive Model:
[2020-10-05 09:49:59 UTC] Serial Number:
[2020-10-05 09:49:59 UTC] Short test duration: minutes
[2020-10-05 09:49:59 UTC] 0 seconds
[2020-10-05 09:49:59 UTC] Extended test duration: minutes
[2020-10-05 09:49:59 UTC] 0 seconds
[2020-10-05 09:49:59 UTC] Log file:/home/rakoczy/diskc/burnin-.log
[2020-10-05 09:49:59 UTC] Bad blocks file:/home/rakoczy/diskc/burnin-.bb
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
[2020-10-05 09:49:59 UTC] + Running SMART short test
[2020-10-05 09:49:59 UTC] +-----------------------------------------------------------------------------
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 1: eval: smartctl: not found
[2020-10-05 09:49:59 UTC] SMART short test started, awaiting completion for 0 seconds ...
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 483: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 490: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 483: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 490: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 483: eval: smartctl: not found
scripts/disk-burnin-and-testing-master/disk-burnin.sh: 490: eval: smartctl: not found
^C

New version mistakenly identifies WDC Green disk as 'non mechanical'

And as a result, skips execution of badblocks program.

WDC Green model: WDC_WD20EARX-00PASB0

Combine with badblocks with cryptsetup?

Pro:

The drive might "compress" the pattern that is written to it. So combining with cryptsetup ensures that the drive actually stored every single bit correctly and did not just compress it.
As preparation to encrypt the drive, it is recommended to overwrite the secure random before which will be done while testing the drive. Win-win.

Contra:

A bit higher CPU usage. But with AES-NI, this should be acceptable.

I have been doing this for years without issues btw, ref: https://github.com/ypid/scripts/blob/master/badblocks_and_secure_erase

Consider adding a f3 testing step

I found a good manual for burn-in testing on reddit:

https://www.reddit.com/r/DataHoarder/comments/alh22g/burning_in_hard_drives/efemr7k?utm_source=share&utm_medium=web2x&context=3

In steps 3-5, he also makes an additional check with ZFS and f3write.

Would it make sense to add these steps in your script, too?

Forgot to set kern.geom.debugflags on FreeBSD?

Hello, thanks for this script and the write up on your blog.

I'm running your script on a new disk under FreeBSD after having 1 of 3 new disks fail on me. (I'm a total noob, and the last thing I expected (trying to rescue my raid from near death) was a problem with the new disk!)

Anyway, I'm now running a burn in on the RMA'd replacement, but I forgot to execute this first:
sysctl kern.geom.debugflags=0x10

Should I now:

Kill the bb process and do that now?
Do it now with the bb process running anyway?
Just not do it and don't worry so much?

I read somewhere that after you've set this kernel flag you should un-set it again later (e.g. reboot) to avoid 'problems'... (Note that my pool is currently online (DEGRADED but backed up), as I'm using the freenas box itself to burn in the new disk).

Sorry for the noobs and thanks for any advice,
Dan.

Inappropriate ioctl

Hi. I'm new to FreeNAS and setting up my first NAS. Been slowly working my way through figuring everything out.

I'm running your script as we speak on 1 drive of mine. It just returned this:

+-----------------------------------------------------------------------------
+ Run badblocks test on drive /dev/da0: Sat Jun 20 15:46:58 EDT 2020
+-----------------------------------------------------------------------------
Checking for bad blocks in read-write mode
From block 0 to 1465130645
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
`  6.86% done, 31:09 elapsed. (0/0/0 errors)

I'm not sure what inappropriate ioctl is, and this thing still has 9+ hours to go before it's done. Should I be concerned?

If it helps, the disk I'm running it on is a Seagate Enterprise NAS drive. 6 TB. SATA 3.0 (6 Gbps). Model number is ST6000NM0115-1YZ.

I don't think or expect you to help me with my particular device. Just wondering if you might share some insight?

Thanks!

Add check if device is in use.

I ran your script on an 8TB disk but it finished after just a couple of days. I discovered that badblocks had exited immediately with no apparent reason. It just said that it had finished and then the script continued with the next step.

When I ran badblocks myself I got the message

/dev/sdb is apparently in use by the system; it's not safe to run badblocks!

Apparently I had missed to unmount the drive. It would be nice if the script did a check if the device is mounted directly when you run it and warns you.

Thanks for a awsome project!

Cannot Handle Very Large Drives

When trying to run this script on some 18TB drives backblocks threw the following error:

badblocks: Value too large for defined data type invalid end block (4394582016): must be 32-bit value

It seems like this is most likely due to the block size not being big enough for drives of this size. Can we get a dynamic block size based on drive size or another command line parameter to set this manually if we choose?

How to run the script?

First of all - thank you for publishing this little gem.

I'm new to this NAS game thing, and bought used disks(16x 2tb), and wanted to ensure that I know what I've got on my hands.

So I've made myself a bootable usb stick with ubuntu 18.04, ensured that all tools are available and fetched this script. It ran very fast at first, and I wondered, "large disks may take a long time", hmm what constitutes large disks?

Then I read the entire readme, carfeully, and lo and behold, hidden there in the middle, "disable dry run" shame on me for not RTFM. But bubling this to the top, would be very helpful for newcomers.

Lastly, I derived a "clever" method of running the tool for many disks(since I have some drives, and didn't want to sit and wait for it to finish).

ls /dev/sd[a-z] | cut -d'/' -f3 | sudo parallel -I{} ./wrapper.sh {}

# Wrapper contains this:
#!/bin/bash -xe
./disk-burnin.sh ${1} > logs/${1}.log

What I'm in doubt about then is, is this a good method? Does the parallel running degrade performance or in any way prevent a valid test? I Know this also tries to test my cd drive on /dev/sdr but hey, worst-case it fails :-)
From this, I also feel that it would be nice if the script accepted a full device path, rather than a device name-ish - eg to me it would be more logical to look in /dev/disk/by-path/ to figure out which disks to test.

I would be more than happy to submit a PR with these changes, I just didn't want to do too much without understanding what I'm actually doing.

EDIT, More questions:
It seems the polling logic is not working with version smartmontools release 6.6 dated 2016-05-07 at 11:17:46 UTC, due to a changed output format(this might be ubuntu 18.04 related). Also, in the mentioned version there is an option to do the task in the foreground, is there a particular reason to not doing this?(maybe because it didn't exist)

So in summary, the questions are:

What is a large disk?
Is there a good reason to place the info about turning of dry-run in the middle of the readme?
Is there a problem to running the script(or any of its tools) in parallel?
The smartmontools version in ubuntu 18.04 can do the tests in foregrounded, is there a reason not to do this?

I hope this is at least somewhat helpful feedback. :-)
/Nwillems

Question: Utilisation of -c?

Hi, I came across this script and it's incredibly helpful, so thanks! One question I have is why you opted to change the blocksize flag to -b 8192 rather than, say, double the blocks written using -c?

I found running badblocks with the option -b 4096 was writing at around 25M/s which would have resulted in my 8T drive completing after 16 days. By modifying the call to badblocks to use -b 4096 -c 128 (double the default), I did see an almost double increase in write speeds. I didn't fancy going higher just to avoid any potential issues with badblocks misreporting anything, but figured there must be a sweet spot somewhere for larger drives?

Thanks.