Find the first DIMM slot using dmidecode output
*******************************************************************************
1. Which EDAC modules are in use? This HowTo is for the amd64_edac module.
# lsmod | grep -i amd
amd64_edac_mod 55921 0
edac_mc 61217 1 amd64_edac_mod
*******************************************************************************
2. Get the memory error information from the kernel log.
# dmesg | grep -E -i edac\|northbridge
Northbridge Error (node 3): DRAM ECC error detected on the NB.
EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710
EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3,
channel 1, label "": amd64_edac
The salient parts are: MC3, row 3, and channel 1.
*******************************************************************************
3. Get the memory controller (MCx) device information.
If you have cleared the kernel log then you will have to reboot. With a new
log, you will have the EDAC driver messages which help identify the DIMMS.
(blank lines have been added to the output for clarity)
# dmesg | grep -E -i edac\|northbridge
EDAC MC: Ver: 2.0.1 Oct 20 2011
EDAC amd64_edac: Ver: 3.4.0
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 0).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC0: Giving out device to amd64_edac F10h: DEV 0000:00:18.2
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 1).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC1: Giving out device to amd64_edac F10h: DEV 0000:00:19.2
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 2).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC2: Giving out device to amd64_edac F10h: DEV 0000:00:1a.2
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 3).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 4).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC4: Giving out device to amd64_edac F10h: DEV 0000:00:1c.2
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 5).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC5: Giving out device to amd64_edac F10h: DEV 0000:00:1d.2
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 6).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC6: Giving out device to amd64_edac F10h: DEV 0000:00:1e.2
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 7).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC7: Giving out device to amd64_edac F10h: DEV 0000:00:1f.2
*******************************************************************************
4. Analysis of the information given.
This board, a Supermicro H8QG6, has 4 processors each having 8 DIMM slots. As
seen above, the EDAC driver has enumerated them such that there are 8 memory
controller instances (MC0-MC7). There are two MC's for each processor. Each MC
serves 4 DIMM slots. This board is physically labeled like this: P1-DIMM1A,
P1-DIMM1B, P1-DIMM2A, P1-DIMM2B ... P1-DIMM4B, and on up to P4-DIMM4B in the
same manner.
In order to make sure that you are interpreting the EDAC information
correctly, you have to know the current actual DIMM setup. I have four 4GB
DIMMS in the 'A' slots of each processor. That is a total of 16GB per
processor and 64GB on the board. There are 16 DIMMS installed total.
The actual memory controller/EDAC device control files can be examined by
looking into the directory: /sys/devices/system/edac/mc. There you will find
the log files for both correctable and uncorrectable errors, and a directory
for each memory controller instance.
# ls -F1 /sys/devices/system/edac/mc
log_ce
log_ue
mc0/
mc1/
mc2/
mc3/
mc4/
mc5/
mc6/
mc7/
panic_on_ue
poll_msec
Here is the error again:
Northbridge Error (node 3): DRAM ECC error detected on the NB.
EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710
EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3, channel 1, label "": amd64_edac
You can see the last line states EDAC MC3 so we can look into the mc3 directory:
# cd /sys/devices/system/edac/mc
# ls -F1 mc3
ce_count
ce_noinfo_count
csrow2/
csrow3/
device@
mc_name
reset_counters
seconds_since_reset
size_mb
ue_count
ue_noinfo_count
All of these files except for the device link are text files so they can be
easily examined. Look at the file size_mb for the entire controller instance:
# cd mc3
# cat size_mb
8192
This is half of the 16GB that are present for processor number 2.
Again, I am using 4GB DDR3 DIMMS. Remember that each memory controller instance is
managing half of the slots adjacent to each processor. This board has 8 slots
per processor and currently has 4 DIMMS installed into the A slots
for each processor. There is a total of 64GB or RAM on the board, 16GB per proc, 8GB per
MC, and 4GB per DIMM. Processor 2 is served by MC2 and MC3.
Each of the DIMMS is 'dual ranked' which means that there are 2GB per 'chip
select row' (csrow). A 'rank' corresponds to a populated csrow. Thus, these
4GB DIMMS show up in two csrows.
The csrow2/ and csrow3/ directories contain the following files:
# ls -1 csrow2
ce_count
ch0_ce_count
ch0_dimm_label
dev_type
edac_mode
mem_type
size_mb
ue_count
The size_mb file contains the amount of RAM that this chip select row is
# cat csrow2/size_mb
4096
# cat csrow3/size_mb
4096
Why 4096 and not 2048 (one half of the DIMM) in both rows? Because the csrows are
interleaved across two channels! This means that memory of one 4GB DIMM in slot 1A and
one 4GB DIMM in slot 2A show up in two rows and two channels. For MC3, the
csrow2 and csrow3 files contain the total size of the memory managed by this
memory controller instance. (The other 8GB is managed by MC2)
This can be confusing. Here is the correspondence between memory controllers
and processors:
MC0, MC1 -> processor 1
MC2, MC3 -> processor 2
MC4, MC5 -> processor 3
MC6, MC7 -> processor 4
The memory controller, MC2 is managing slots 1-4 for processor 2. MC3 is
managing slots 5-8 for processor 2. The first 4 slots are P2-DIMM1A,
P2-DIMM1B, P2-DIMM2A, P2-DIMM2B, and the second 4 slots are P2-DIMM3A,
P2-DIMM3B, P2-DIMM4A, P2-DIMM4B.
Take a look at the EDAC messages for MC3 again:
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 3).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 2
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2
This memory controller uses 8 chip select rows (MC 0-7) and with the current
DIMM installation is showing 2 channels (DCT0 and DCT1). That is a confusing
print out because the two characters, MC, are used in multiple places and seem
to mean different things.
If we remove the DIMM in P2-DIMM4A the EDAC driver would look like this:
EDAC amd64: ECC is enabled by BIOS.
EDAC amd64: F10h detected (node 3).
EDAC MC: DCT0 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 2048MB 3: 2048MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC MC: DCT1 chip selects:
EDAC amd64: MC: 0: 0MB 1: 0MB
EDAC amd64: MC: 2: 0MB 3: 0MB
EDAC amd64: MC: 4: 0MB 5: 0MB
EDAC amd64: MC: 6: 0MB 7: 0MB
EDAC amd64: using x8 syndromes.
EDAC amd64: MCT channel count: 1
EDAC amd64: CS2: Registered DDR3 RAM
EDAC amd64: CS3: Registered DDR3 RAM
EDAC MC3: Giving out device to amd64_edac F10h: DEV 0000:00:1b.2
Note that the MCT channel count is now 1. There are still two csrows involved
for the single DIMM in slot P2-DIMM3A (it is dual ranked), but the total size
for each csrow is now only 2048. There is nothing in DCT1 which is channel 1.
The total for the entire memory controller mc3 with one DIMM is 4096 as expected:
# cd /sys/devices/system/edac/mc/mc3
# cat size_mb
4096
The size_mb file for mc3/csrow2 and mc3/csrow3 now contain:
# cat csrow2/size_mb
2048
# cat csrow3/size_mb
2048
That is 2048MB or one half the DIMM allocated to both csrows (ranks).
It should be obvious now that the EDAC log messages and error messages do not
by default show the actual physical DIMM slot on the motherboard. That has to
be deduced from the triplet of mc/row/channel as explained in the conclusion.
*******************************************************************************
5. Conclusion
Take a look at the EDAC error one more time:
# dmesg | grep -E -i edac\|northbridge
Northbridge Error (node 3): DRAM ECC error detected on the NB.
EDAC amd64 MC3: CE ERROR_ADDRESS= 0x6281d4710
EDAC MC3: CE page 0x6281d4, offset 0x710, grain 0, syndrome 0x2845, row 3,
channel 1, label "": amd64_edac
As we said before, the error is on MC3, row 3, channel 1. We now know that MC3
is managing the second 4 slots of processor 2's eight slots, and that row 3 is
the 2nd rank of a dual ranked DIMM. There have also been EDAC errors for row
2, channel 1 which makes perfect sense. Row 2 is the first rank on the same
But what physical DIMM slot contains the defective DIMM?
The reported channel number, in this case 1, corresponds to DCT1 (the 2nd
channel) which is DIMM4A or DIMM4B. We now know that it must be DIMM4A because
rows 2&3 correspond to the A slots and rows 0&1 correspond to the B slots. But
we also know that we don't have any DIMMS in the B slots! That helps.
So the defective DIMM is P2-DIMM4A.
Here is a diagram for processor 2 showing the correspondence between rows,
channels, and DIMMS. Recall that the MCx tells us which processor as explained
MC2
Channel 0 (DCT0)
row0 row1 P2-DIMM1B
row2 row3 P2-DIMM1A
row4 row5 unused
row6 row7 unused
Channel 1 (DCT1)
row0 row1 P2-DIMM2B
row2 row3 P2-DIMM2A
row4 row5 unused
row6 row7 unused
MC3
Channel 0 (DCT0)
row0 row1 P2-DIMM3B
row2 row3 P2-DIMM3A
row4 row5 unused
row6 row7 unused
Channel 1 (DCT1)
row0 row1 P2-DIMM4B
row2 row3 P2-DIMM4A
row4 row5 unused
row6 row7 unused
*******************************************************************************
Appendix
On this SuperMicro H8QG6 with AMD processors and the amd64 EDAC driver code,
there is a strange occurrence. If you populate the B DIMM slots their memory
will show up in csrows 0 and 1. I did experiments to demonstrate this and it
seems to be linked to the fact that the DMI enumeration recognizes the B slots
before the A slots. I thought that the A slots would come first but that may
be misdirected.
Here is the output of dmidecode for the memory devices. As you can see, the
info for P1_DIMM1B shows up before P1_DIMM1A:
# dmidecode -t 17
SMBIOS 2.6 present.
Handle 0x001E, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x001C
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: <OUT OF SPEC>
Set: None
Locator: P1_DIMM1B
Bank Locator: BANK0
Type: <OUT OF SPEC>
Type Detail: None
Speed: Unknown
Manufacturer:
Serial Number:
Asset Tag:
Part Number:
Rank: Unknown
Handle 0x0020, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x001C
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: None
Locator: P1_DIMM1A
Bank Locator: BANK1
Type: DDR3
Type Detail: Synchronous
Speed: 1333 MHz
Manufacturer: Samsung
Serial Number: 34363238
Asset Tag:
Part Number: M393B5170FH0-CH9
Rank: 2
There is a package named edac-util that has a helpful script for examining the
contents of the /sys/devices/system/edac/mc directories. It is available
via yum as an rpm on CentOS.
dmidecode is also very helpful with the -t 16 or -t 17 switches.