How To Diagnose Memory Errors on AMD x64 using EDAC

How To Diagnose Memory Errors on AMD x64 using EDAC

Find the first DIMM slot using dmidecode output



*******************************************************************************

1. Which EDAC modules are in use? This HowTo is for the amd64_edac module.

 

# lsmod | grep -i amd

amd64_edac_mod         55921  0

edac_mc                61217  1 amd64_edac_mod

 

*******************************************************************************

2. Get the memory error information from the kernel log.

 

# dmesg | grep -E -i edac\|northbridge

Northbridge Error (node 3): DRAM ECC error detected on the NB.

EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710

EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3,

channel 1, label "": amd64_edac

 

The salient parts are: MC3, row 3, and channel 1.

 

*******************************************************************************

3. Get the memory controller (MCx) device information.

 

If you have cleared the kernel log then you will have to reboot. With a new

log, you will have the EDAC driver messages which help identify the DIMMS.

(blank lines have been added to the output for clarity)

# dmesg | grep -E -i edac\|northbridge                                          

EDAC MC: Ver: 2.0.1 Oct 20 2011

EDAC amd64_edac: Ver: 3.4.0

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 0).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC0: Giving out device to amd64_edac F10h: DEV 0000:00:18.2

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 1).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC1: Giving out device to amd64_edac F10h: DEV 0000:00:19.2

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 2).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC2: Giving out device to amd64_edac F10h: DEV 0000:00:1a.2

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 3).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 4).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC4: Giving out device to amd64_edac F10h: DEV 0000:00:1c.2

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 5).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC5: Giving out device to amd64_edac F10h: DEV 0000:00:1d.2

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 6).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC6: Giving out device to amd64_edac F10h: DEV 0000:00:1e.2

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 7).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC7: Giving out device to amd64_edac F10h: DEV 0000:00:1f.2

 

*******************************************************************************

4. Analysis of the information given.

 

This board, a Supermicro H8QG6, has 4 processors each having 8 DIMM slots. As

seen above, the EDAC driver has enumerated them such that there are 8 memory

controller instances (MC0-MC7). There are two MC's for each processor. Each MC

serves 4 DIMM slots. This board is physically labeled like this: P1-DIMM1A,

P1-DIMM1B, P1-DIMM2A, P1-DIMM2B ... P1-DIMM4B, and on up to P4-DIMM4B in the

same manner.

 

In order to make sure that you are interpreting the EDAC information

correctly, you have to know the current actual DIMM setup. I have four 4GB

DIMMS in the 'A' slots of each processor. That is a total of 16GB per

processor and 64GB on the board. There are 16 DIMMS installed total.

 

The actual memory controller/EDAC device control files can be examined by

looking into the directory: /sys/devices/system/edac/mc. There you will find

the log files for both correctable and uncorrectable errors, and a directory

for each memory controller instance.

 

# ls -F1 /sys/devices/system/edac/mc

log_ce

log_ue

mc0/

mc1/

mc2/

mc3/

mc4/

mc5/

mc6/

mc7/

panic_on_ue

poll_msec

 

Here is the error again:

 

Northbridge Error (node 3): DRAM ECC error detected on the NB.

EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710

EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3, channel 1, label "": amd64_edac

 

You can see the last line states EDAC MC3 so we can look into the mc3 directory:

 

# cd /sys/devices/system/edac/mc

# ls -F1 mc3

ce_count

ce_noinfo_count

csrow2/

csrow3/

device@

mc_name

reset_counters

seconds_since_reset

size_mb

ue_count

ue_noinfo_count

 

All of these files except for the device link are text files so they can be

easily examined. Look at the file size_mb for the entire controller instance:

 

# cd mc3

# cat size_mb

8192

 

This is half of the 16GB that are present for processor number 2.

Again, I am using 4GB DDR3 DIMMS. Remember that each memory controller instance is

managing half of the slots adjacent to each processor. This board has 8 slots

per processor and currently has 4 DIMMS installed into the A slots

for each processor. There is a total of 64GB or RAM on the board, 16GB per proc, 8GB per

MC, and 4GB per DIMM. Processor 2 is served by MC2 and MC3.

 

Each of the DIMMS is 'dual ranked' which means that there are 2GB per 'chip

select row' (csrow). A 'rank' corresponds to a populated csrow. Thus, these

4GB DIMMS show up in two csrows.

 

The csrow2/ and csrow3/ directories contain the following files:

 

# ls -1 csrow2                                                               

ce_count

ch0_ce_count

ch0_dimm_label

dev_type

edac_mode

mem_type

size_mb

ue_count

 

The size_mb file contains the amount of RAM that this chip select row is

 

# cat csrow2/size_mb

4096

 

# cat csrow3/size_mb

4096

 

Why 4096 and not 2048 (one half of the DIMM) in both rows? Because the csrows are

interleaved across two channels! This means that memory of one 4GB DIMM in slot 1A and

one 4GB DIMM in slot 2A show up in two rows and two channels. For MC3, the

csrow2 and csrow3 files contain the total size of the memory managed by this

memory controller instance. (The other 8GB is managed by MC2)

 

This can be confusing. Here is the correspondence between memory controllers

and processors:

 

MC0, MC1 -> processor 1

MC2, MC3 -> processor 2

MC4, MC5 -> processor 3

MC6, MC7 -> processor 4

 

The memory controller, MC2 is managing slots 1-4 for processor 2. MC3 is

managing slots 5-8 for processor 2.  The first 4 slots are P2-DIMM1A,

P2-DIMM1B, P2-DIMM2A, P2-DIMM2B, and the second 4 slots are P2-DIMM3A,

P2-DIMM3B, P2-DIMM4A, P2-DIMM4B.

 

Take a look at the EDAC messages for MC3 again:

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 3).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 2

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2

 

This memory controller uses 8 chip select rows (MC 0-7) and with the current

DIMM installation is showing 2 channels (DCT0 and DCT1). That is a confusing

print out because the two characters, MC, are used in multiple places and seem

to mean different things.

 

If we remove the DIMM in P2-DIMM4A the EDAC driver would look like this:

 

EDAC amd64: ECC is enabled by BIOS.

EDAC amd64: F10h detected (node 3).

EDAC MC: DCT0 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:  2048MB 3:  2048MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC MC: DCT1 chip selects:

EDAC amd64: MC: 0:     0MB 1:     0MB

EDAC amd64: MC: 2:     0MB 3:     0MB

EDAC amd64: MC: 4:     0MB 5:     0MB

EDAC amd64: MC: 6:     0MB 7:     0MB

EDAC amd64: using x8 syndromes.

EDAC amd64: MCT channel count: 1

EDAC amd64: CS2: Registered DDR3 RAM

EDAC amd64: CS3: Registered DDR3 RAM

EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2

 

Note that the MCT channel count is now 1. There are still two csrows involved

for the single DIMM in slot P2-DIMM3A (it is dual ranked), but the total size

for each csrow is now only 2048. There is nothing in DCT1 which is channel 1.

 

The total for the entire memory controller mc3 with one DIMM is 4096 as expected:

# cd /sys/devices/system/edac/mc/mc3

# cat size_mb

4096

 

The size_mb file for mc3/csrow2 and mc3/csrow3 now contain:

 

# cat csrow2/size_mb

2048

 

# cat csrow3/size_mb

2048

 

That is 2048MB or one half the DIMM allocated to both csrows (ranks).

 

It should be obvious now that the EDAC log messages and error messages do not

by default show the actual physical DIMM slot on the motherboard. That has to

be deduced from the triplet of mc/row/channel as explained in the conclusion.

 

*******************************************************************************

5. Conclusion

 

Take a look at the EDAC error one more time:

 

# dmesg | grep -E -i edac\|northbridge

Northbridge Error (node 3): DRAM ECC error detected on the NB.

EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710

EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3,

channel 1, label "": amd64_edac

 

As we said before, the error is on MC3, row 3, channel 1. We now know that MC3

is managing the second 4 slots of processor 2's eight slots, and that row 3 is

the 2nd rank of a dual ranked DIMM. There have also been EDAC errors for row

2, channel 1 which makes perfect sense. Row 2 is the first rank on the same

 

But what physical DIMM slot contains the defective DIMM?

 

The reported channel number, in this case 1, corresponds to DCT1 (the 2nd

channel) which is DIMM4A or DIMM4B. We now know that it must be DIMM4A because

rows 2&3 correspond to the A slots and rows 0&1 correspond to the B slots. But

we also know that we don't have any DIMMS in the B slots! That helps.

 

So the defective DIMM is P2-DIMM4A.

 

Here is a diagram for processor 2 showing the correspondence between rows,

channels, and DIMMS. Recall that the MCx tells us which processor as explained

 

MC2

Channel 0 (DCT0)

 row0 row1 P2-DIMM1B

 row2 row3 P2-DIMM1A

 row4 row5 unused

 row6 row7 unused

Channel 1 (DCT1)

 row0 row1 P2-DIMM2B

 row2 row3 P2-DIMM2A

 row4 row5 unused

 row6 row7 unused

MC3

Channel 0 (DCT0)

 row0 row1 P2-DIMM3B

 row2 row3 P2-DIMM3A

 row4 row5 unused

 row6 row7 unused

Channel 1 (DCT1)

 row0 row1 P2-DIMM4B

 row2 row3 P2-DIMM4A

 row4 row5 unused

 row6 row7 unused

 

*******************************************************************************

Appendix

 

On this SuperMicro H8QG6 with AMD processors and the amd64 EDAC driver code,

there is a strange occurrence. If you populate the B DIMM slots their memory

will show up in csrows 0 and 1. I did experiments to demonstrate this and it

seems to be linked to the fact that the DMI enumeration recognizes the B slots

before the A slots. I thought that the A slots would come first but that may

be misdirected.

 

Here is the output of dmidecode for the memory devices. As you can see, the

info for P1_DIMM1B shows up before P1_DIMM1A:

 

# dmidecode -t 17

 

SMBIOS 2.6 present.

 

Handle 0x001E, DMI type 17, 28 bytes

Memory Device

       Array Handle: 0x001C

       Error Information Handle: Not Provided

       Total Width: Unknown

       Data Width: Unknown

       Size: No Module Installed

       Form Factor: <OUT OF SPEC>

       Set: None

       Locator: P1_DIMM1B

       Bank Locator: BANK0

       Type: <OUT OF SPEC>

       Type Detail: None

       Speed: Unknown

       Manufacturer:               

       Serial Number:         

       Asset Tag:             

       Part Number:                   

       Rank: Unknown

 

Handle 0x0020, DMI type 17, 28 bytes

Memory Device

       Array Handle: 0x001C

       Error Information Handle: Not Provided

       Total Width: 72 bits

       Data Width: 64 bits

       Size: 4096 MB

       Form Factor: DIMM

       Set: None

       Locator: P1_DIMM1A

       Bank Locator: BANK1

       Type: DDR3

       Type Detail: Synchronous

       Speed: 1333 MHz

       Manufacturer: Samsung       

       Serial Number: 34363238

       Asset Tag:             

       Part Number: M393B5170FH0-CH9  

       Rank: 2



There is a package named edac-util that has a helpful script for examining the

contents of the /sys/devices/system/edac/mc directories. It is available

via yum as an rpm on CentOS.

 

dmidecode is also very helpful with the -t 16 or -t 17 switches.

 

    • Related Articles

    • Identify Bad DIMM from EDAC

      Here is an example to show you how to identify defective DIMM on an AMD_x64 archtecture machine, syslog reported kernel error from EDAC (Error Detection and Correction kernel module). Here is a piece of typical error message from EDAC   kernel: ...
    • DIMM Replacement Guidelines

      Replace a DIMM when one of the following events takes place: The DIMM fails memory testing under BIOS due to Uncorrectable Memory Errors (UCEs).  UCEs occur and investigation shows that the errors originated from memory.  More than 24 Correctable ...
    • Memory testing with Memtest86 MultiCore Mode.

      Create a bootable USB as directed below. Boot the system and run the default tests. Most memory issues will surface fairly quickly but let tests complete if CPU's are under suspicion. Download the attached ISO. Create Bootable USB from the ISO ...
    • Replace Failed Drive using MegaCLi

      Set the drive offline, if it is not already offline due to an error MegaCli -PDOffline -PhysDrv [E:S] -aN Mark the drive as missing MegaCli -PDMarkMissing -PhysDrv [E:S] -aN Prepare drive for removal MegaCli -PDPrpRmv -PhysDrv [E:S] -aN ...
    • Failed HDD Replacement Instructions

      From Nexenta Management Console (NMC) check the status of your pools $ zpool status Find the faulted HDD and record the worldwide number (WWN) of the HDD, and the name of the pool it resides in. You will use the WWN to identify this HDD later An ...