Disk mapping for failed drives

Disk mapping for failed drives

How to identify failing drives

When monitoring for drive failures in Nexenta and other ZFS-based storage appliances, not all failures will manifest as taking the drive offline completely. If a user or the technician suspects that a failing drive is slowing down the I/O, they may want to look elsewhere for hardware errors that may point to impending failures. 

For a basic scan, get a list of all of the slot numbers and their corresponding GUIDs:

  • show lun slotmap

This should yield an output like this:

nmc@NS2000LEFT:/$ show lun slotmap

jbod:4:

LUN                 JBOD   Slot#    DeviceId               

c0t5000C50025FE16A7d0   jbod:4   1   id1,sd@n5000c50025fe16a7

c0t5000C50025FFB0FFd0   jbod:4   2   id1,sd@n5000c50025ffb0ff

c0t5000C50040CF1B2Fd0   jbod:4   3   id1,sd@n5000c50040cf1b2f

c0t5000C50034E9800Bd0   jbod:4   4   id1,sd@n5000c50034e9800b

c0t5000C50034FBE687d0   jbod:4   6   id1,sd@n5000c50034fbe687

c0t5000C5003412E31Fd0   jbod:4   7   id1,sd@n5000c5003412e31f

c0t5000C50034130597d0   jbod:4   8   id1,sd@n5000c50034130597


The GUIDs are highlighted in blue. Next, you’ll want to scan for hardware and software errors on the disks:

  • iostat -en

This will yield a list of GUIDs where you can see which disks are generating errors:

root@NS2000LEFT:/opt/HAC/RSF-1/log# iostat -en

 ---- errors ---

     s/w h/w trn tot device

0   0   0   0   c3d0

0   0   0   0   c0t5000C50031DB89A3d0

0   0   0   0   c0t5000C50031D8FA13d0

0   0   0   0   c0t5000C50031BB2793d0

0   0   0   0   c0t5000C50031BB2A23d0

0   0   0   0   c0t5000C50031D916A3d0

0   0   0   0   c0t5000C50042B6699Fd0

0 110 6752 6862 c0t5000C50025FE16A7d0

 

As we can see, the disk c0t5000C50025FE16A7d0 generated over 6000 errors, 110 of which were hardware errors. You can then reference the GUID back to the slotmap output to show find that the disk is installed in slot 8 of the enclosure.

nmc@NS2000LEFT:/$ show lun slotmap

jbod:4:

LUN                 JBOD   Slot#    DeviceId               

c0t5000C50025FE16A7d0   jbod:4   1   id1,sd@n5000c50025fe16a7

c0t5000C50025FFB0FFd0   jbod:4   2   id1,sd@n5000c50025ffb0ff

c0t5000C50040CF1B2Fd0   jbod:4   3   id1,sd@n5000c50040cf1b2f

c0t5000C50034E9800Bd0   jbod:4   4   id1,sd@n5000c50034e9800b

c0t5000C50034FBE687d0   jbod:4   6   id1,sd@n5000c50034fbe687

c0t5000C5003412E31Fd0   jbod:4   7   id1,sd@n5000c5003412e31f

c0t5000C50034130597d0   jbod:4   8   id1,sd@n5000c50034130597

 

You can then install a replacement drive in the slot to see if the errors subside. You will also want to check iostat again after running the replacement drive in the enclosure for a few days, to ensure that the error count doesn’t continue to increase.

In cases where significant errors are generated across all disks in an enclosure, it’s possible that a bad cable or bad backplane are generating the errors. Swap suspect hardware as necessary.

 

    • Related Articles

    • "Disk Layout Failed" during Boot.

      If your Compute Node is failing to boot with the error "Disk Layout Failed" then please gather the following information to submit with your support request.      Send /var/log/node-installer from the head node and the disk layout used for these ...
    • Locating failed drives

      Q: I need to remove a failed drive from my array, but I’m not sure which slot it resides in. How can I physically identify the failed disk? A: The faulted drive should be automatically marked with the red fault light. If this did not occur ...
    • Failed HDD Replacement Instructions

      From Nexenta Management Console (NMC) check the status of your pools $ zpool status Find the faulted HDD and record the worldwide number (WWN) of the HDD, and the name of the pool it resides in. You will use the WWN to identify this HDD later An ...
    • Failed Boot Mirror HDD Replacement Instructions

      ​​From NMV: Settings > Disks Click blink on the far right for the faulted LUN Ignore the warning, it will not cause issues In this example, we will use c3t1d0 This will do a DD read blink on the drive. From NMC: # setup volume syspool offline-lun # ...
    • Standard Upgrade Procedure (Nexenta)

      Check the Release Notes for any special requirements for the upgraded version Verify access to the package repository $setup appliance upgrade -t this is a dry run, that does not make any changes Clean up the device links, on both nodes: nmc$ lunsync ...