Heads up: FMA additions/changes for Opteron/Athlon 64/Turion 64


Date: Sat, 07 Oct 2006 02:01:28 +0100
From: Gavin Maltby <Gavin.Maltby at Sun dot COM>
To: onnv-gate at onnv dot sfbay dot sun dot com, on-all at Sun dot COM, fma-interest at Sun dot COM
Subject: Heads up: FMA additions/changes for Opteron/Athlon 64/Turion 64

Follwuups set to fma-interest at sun dot com;  I'll mail more gory detail
to that alias and fm-discuss at opensolaris dot org soon.

Yesterday I integrated

     PSARC 2006/564 FMA for Athlon 64 and Opteron Rev F/G Processors

(and a number of related bugfixes) into onnv.  In addition to delivering
FMA support for revision F which previously had no FMA support, this
putback makes some changes to the FMA support for earlier Opteron revisions.

Revision F is already shipping in the "M2" products such as the
Ultra 20 M2, Sun Fire X2100 M2, and will be used in upcoming updates to
other products.  These are the new socket F(1207) and socket AM2 processors
from AMD with DDR-2 memory.  The FMA support applies whether your system is
a Sun product or sourced elsewhere.

More information on the revision F support and associated changes is available via
the Sun-internal FMA portfolio document and the links it provides:

     http://fma.eng/documents/engineering/portfolios/2006/019.opteron-rev-f-g/

If your revision B to E Opteron/Athlon 64/Turion 64 system (i.e., anything prior
to rev F) has some history of memory errors then this putback
constitutes a minor heads-up.  The waffle below provides some detail, but
since (hopefully) relatively few system have a history of memory errors
you can choose to ignore (based on the grounds that the new rules will
still catch anything that is bad) or simply drop me or fma-interest at sun dot com
or fm-discuss at opensolaris dot org a line and we'll work through things with you.

Heads-up detail:

If the errors have already been diagnosed to a dimm fault ('fmadm faulty'
shows a dimm as degraded) then there is no issue.  If there have been a small
number of memory errors and perhaps associated page retirements but not sufficient
to diagnose a dimm as faulty then the interim diagnosis state associated with those
errors will not carry forward so we'll essentially be starting from scratch
for all dimms.  If a dimm is really bad then we'll diagnose it soon
enough anyway, but it will take just a bit longer to get there.

You can tell you have diagnosis state for a dimm not yet resulting in
a fault diagnosis as follows.  You can run this before or after
install of the new bits.

# fmstat -m eft -s | grep memory
serd.memory.dimm_sb@motherboard0/chip2/memory-controller0/dimm0  >2    3d   2        1231294498902ns pend
serd.memory.page_sb@motherboard0/chip2/memory-controller0/dimm0  >2    3d   2        1231294496735ns pend

This shows SERD engines associated with a DIMM.  The new bits use
SERD engines with different paths (foo@.../dimm/rank) and the above
SERD engines will never be "fed" again after upgrade to the new bits.

If in the past the SERD engines have fired a few times to produce page
retirements, then

# fmstat -m eft | grep page_fault

will show some output such as

    stat.rules2 5 stat.page_fault@motherboard0/chip0/memory-controller0/dimm0

Again, the new rules hold these stats against individual ranks so the above
will be forgotten.  The pages that were retired for these faults will
continue to be retired on reboot, however.

If you see output from the above command but 'fmadm faulty' does not show that
we've already faulted the dimm then a) be aware that we've forgotten some
diagnosis state and perhaps make a fault decision yourself if the page_fault
stat above shows large numbers of faults and b) after checking that
'fmstat -m eft -s' does not show any cpu statistics perform an
'fmadm reset eft' to clear the disused stats.

Yes, for the S10 backport we'll see whether this can be automated or at least
made more elegant.

Gavin

last modified by danmcd on 2009/11/24 14:23
Collectives
Project


© Sun Microsystems Inc. 2009
XWiki Enterprise 1.8.2.19075 - Documentation
Terms Of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines | Site map | Help
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.