Identify Bad DIMM from EDAC

Identify Bad DIMM from EDAC

Here is an example to show you how to identify defective DIMM on an AMD_x64 archtecture machine, syslog reported kernel error from EDAC (Error Detection and Correction kernel module).

Here is a piece of typical error message from EDAC

 

kernel: [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
kernel: EDAC amd64 MC1: CE ERROR_ADDRESS= 0xf075b2410
kernel: EDAC MC1: CE page 0xf075b2, offset 0x410, grain 0, syndrome 0xa082, row 6, channel 0, label "": amd64_edac
kernel: [Hardware Error]: Error Status: Corrected error, no action required.
kernel: [Hardware Error]: CPU:6 (10:8:0) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c414000a0080813
kernel: [Hardware Error]: MC4_ADDR: 0x0000000f075b2410
kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

You may get confused by the message above, here is a quick way to show you what are they:

The structure of the message is:

        the memory controller                   (MC1)
Error type (CE) memory page (0xf075b2)
offset in the page (0x410)
The byte granularity (grain 0)
The error syndrome (0xb741)
memory row (row 6)
memory channel (channel 0)
DIMM label Not given
Module name amd64_edac

Additional Information

EDAC is composed of a "core" module (edac_core.ko) and several Memory Controller (MC) driver modules. On a given system, the CORE is loaded and one MC driver will be loaded. Both the CORE and the MC driver (or edac_device driver) have individual versions that reflect current release level of their respective modules. Thus, to "report" on what version a system is running, one must report both the CORE's and the MC driver's versions.

The example server I used in this article has these two edac module loaded:

# lsmod | grep -i edac
amd64_edac_mod         21913  0
edac_core              46645  4 amd64_edac_mod
edac_mce_amd           15615  1 amd64_edac_mod

Memory Controller (mc) Model, the memory controller's model abstracted in EDAC. Each 'mc' device controls a set of DIMM memory modules. These modules are laid out in a Chip-Select Row (csrowX) and Channel table (chX).

There can be multiple csrows and multiple channels. Memory controllers allow for several csrows, with 8 csrows being a typical value.

Channel, each channel represents a DIMM module. Dual channels allows for 128 bit data transfers to the CPU from memory. Some system supports more channels.

Csrow, Chip-Select Row, shows how memory module assembled, single or dual rank or more, the actual number of csrows depends on the electrical "loading" of a given motherboard, memory controller and DIMM characteristics.

For single rank DIMM module, a pair of DIMMs merge into one csrow, typically, you will see only csrow0, while csrow1 will be empty.

See more detail about EDAC in EDAC error detection and report

Use edac-util tool to identify

See  more examples about edac-util

Check MC info and status

# edac-util -vs
edac-util: EDAC drivers are loaded. 2 MCs detected:
  mc0:F10h
  mc1:F10h

Check EDAC error report by edac-util tool

# edac-util -v 
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: ch0: 0 Corrected Errors
mc0: csrow0: ch1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: ch0: 0 Corrected Errors
mc0: csrow1: ch1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: ch0: 0 Corrected Errors
mc0: csrow2: ch1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: ch0: 0 Corrected Errors
mc0: csrow3: ch1: 0 Corrected Errors
mc0: csrow4: 0 Uncorrected Errors
mc0: csrow4: ch0: 0 Corrected Errors
mc0: csrow4: ch1: 0 Corrected Errors
mc0: csrow5: 0 Uncorrected Errors
mc0: csrow5: ch0: 0 Corrected Errors
mc0: csrow5: ch1: 0 Corrected Errors
mc0: csrow6: 0 Uncorrected Errors
mc0: csrow6: ch0: 0 Corrected Errors
mc0: csrow6: ch1: 0 Corrected Errors
mc0: csrow7: 0 Uncorrected Errors
mc0: csrow7: ch0: 0 Corrected Errors
mc0: csrow7: ch1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: ch0: 0 Corrected Errors
mc1: csrow0: ch1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: ch0: 0 Corrected Errors
mc1: csrow1: ch1: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: ch0: 0 Corrected Errors
mc1: csrow2: ch1: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: ch0: 0 Corrected Errors
mc1: csrow3: ch1: 0 Corrected Errors
mc1: csrow4: 0 Uncorrected Errors
mc1: csrow4: ch0: 0 Corrected Errors
mc1: csrow4: ch1: 0 Corrected Errors
mc1: csrow5: 0 Uncorrected Errors
mc1: csrow5: ch0: 0 Corrected Errors
mc1: csrow5: ch1: 0 Corrected Errors
mc1: csrow6: 0 Uncorrected Errors
mc1: csrow6: ch0: 7 Corrected Errors
mc1: csrow6: ch1: 0 Corrected Errors
mc1: csrow7: 0 Uncorrected Errors
mc1: csrow7: ch0: 0 Corrected Errors
mc1: csrow7: ch1: 0 Corrected Errors

This is consistent with the error what we saw in the syslog, same information can also be found in /sys/devices/system/edac/mc/mc[0|1]

# ls /sys/devices/system/edac/mc/mc1
ce_count     csrow0  csrow2  csrow4  csrow6  device   reset_counters    seconds_since_reset  ue_count
ce_noinfo_count  csrow1  csrow3  csrow5  csrow7  mc_name  sdram_scrub_rate  size_mb         ue_noinfo_count

Info Analysis

So, according to above info we get, we can tell the server has two memory controllers, 8 pairs of dual ranked memory modules.

As we know the memory error located at

mc1: csrow6: ch0: 7 Corrected Errors

What it tells us is the physical DIMM:

In the second memory controller(mc1).
Fourth pair of DIMM (csrow6 and csrow7)
ch0 tells it's the first DIMM in the DIMM pair.

Use the info above, you can easily find it according the hardware info of the server(usually you can find the motherboard articheture map)

    • Related Articles

    • DIMM Replacement Guidelines

      Replace a DIMM when one of the following events takes place: The DIMM fails memory testing under BIOS due to Uncorrectable Memory Errors (UCEs).  UCEs occur and investigation shows that the errors originated from memory.  More than 24 Correctable ...
    • How To Diagnose Memory Errors on AMD x64 using EDAC

      Find the first DIMM slot using dmidecode output ******************************************************************************* 1. Which EDAC modules are in use? This HowTo is for the amd64_edac module.   # lsmod | grep -i amd amd64_edac_mod ...
    • Memory testing with Memtest86 MultiCore Mode.

      Create a bootable USB as directed below. Boot the system and run the default tests. Most memory issues will surface fairly quickly but let tests complete if CPU's are under suspicion. Download the attached ISO. Create Bootable USB from the ISO ...
    • Recovering Foreign/Unconfigured(bad) drives

      Q: My RAID controller reports a foreign configuration was detected during boot, and all of my drive bays have red lights on them. I try to import the configuration, but it fails. How do I import the configuration? A: Failure to import arrays usually ...
    • RAM models and serials

      Q: I have a failed DIMM, but I do not know the model and serial. How do I obtain this information? A: This information is written into DMI into the motherboard. To access it, simply use dmidecode, available on most UNIX and Linux operating systems by ...