How to identify failing drives
When monitoring for drive failures in Nexenta and other ZFS-based storage appliances, not all failures will manifest as taking the drive offline completely. If a user or the technician suspects that a failing drive is slowing down the I/O, they may want to look elsewhere for hardware errors that may point to impending failures.
For a basic scan, get a list of all of the slot numbers and their corresponding GUIDs:
This should yield an output like this:
nmc@NS2000LEFT:/$ show lun slotmap
jbod:4:
LUN JBOD Slot# DeviceId
c0t5000C50025FE16A7d0 jbod:4 1 id1,sd@n5000c50025fe16a7
c0t5000C50025FFB0FFd0 jbod:4 2 id1,sd@n5000c50025ffb0ff
c0t5000C50040CF1B2Fd0 jbod:4 3 id1,sd@n5000c50040cf1b2f
c0t5000C50034E9800Bd0 jbod:4 4 id1,sd@n5000c50034e9800b
c0t5000C50034FBE687d0 jbod:4 6 id1,sd@n5000c50034fbe687
c0t5000C5003412E31Fd0 jbod:4 7 id1,sd@n5000c5003412e31f
c0t5000C50034130597d0 jbod:4 8 id1,sd@n5000c50034130597
The GUIDs are highlighted in blue. Next, you’ll want to scan for hardware and software errors on the disks:
This will yield a list of GUIDs where you can see which disks are generating errors:
root@NS2000LEFT:/opt/HAC/RSF-1/log# iostat -en
---- errors ---
s/w h/w trn tot device
0 0 0 0 c3d0
0 0 0 0 c0t5000C50031DB89A3d0
0 0 0 0 c0t5000C50031D8FA13d0
0 0 0 0 c0t5000C50031BB2793d0
0 0 0 0 c0t5000C50031BB2A23d0
0 0 0 0 c0t5000C50031D916A3d0
0 0 0 0 c0t5000C50042B6699Fd0
0 110 6752 6862 c0t5000C50025FE16A7d0
As we can see, the disk c0t5000C50025FE16A7d0 generated over 6000 errors, 110 of which were hardware errors. You can then reference the GUID back to the slotmap output to show find that the disk is installed in slot 8 of the enclosure.
nmc@NS2000LEFT:/$ show lun slotmap
jbod:4:
LUN JBOD Slot# DeviceId
c0t5000C50025FE16A7d0 jbod:4 1 id1,sd@n5000c50025fe16a7
c0t5000C50025FFB0FFd0 jbod:4 2 id1,sd@n5000c50025ffb0ff
c0t5000C50040CF1B2Fd0 jbod:4 3 id1,sd@n5000c50040cf1b2f
c0t5000C50034E9800Bd0 jbod:4 4 id1,sd@n5000c50034e9800b
c0t5000C50034FBE687d0 jbod:4 6 id1,sd@n5000c50034fbe687
c0t5000C5003412E31Fd0 jbod:4 7 id1,sd@n5000c5003412e31f
c0t5000C50034130597d0 jbod:4 8 id1,sd@n5000c50034130597
You can then install a replacement drive in the slot to see if the errors subside. You will also want to check iostat again after running the replacement drive in the enclosure for a few days, to ensure that the error count doesn’t continue to increase.
In cases where significant errors are generated across all disks in an enclosure, it’s possible that a bad cable or bad backplane are generating the errors. Swap suspect hardware as necessary.