Flag Day: FMA for Athlon 64 and Opteron Processors


Date: Sat, 11 Feb 2006 16:01:35 -0800 (PST)
From: Cynthia McGuire <cindi at ozz dot sfbay dot sun dot com>
To: eversholt-interest at sun dot com, fma-interest at sun dot com, on-all at eng dot sun dot com,
Subject: Flag Day: FMA for Athlon 64 and Opteron Processors

Today's putback for:

PSARC 2006/020 FMA for Athlon 64 and Opteron Processors
PSARC 2006/028 eversholt language enhancements
6359264 Provide FMA support for AMD64 processors

represents a flag day in that new user and kernel components are
introduced, so you should not mix and match userland and kernel across
this flag day.  As usual, BFU can be used to get you a consistent system.
A large number of files have also changed; you should do a clobber build
once you bringover the changes from the FMA putback.  Specific flag day
details are described below.

This project brings the same level of FMA support to our Athlon 64 and Opteron
family of platforms that we have for SPARC-based platforms.  Detailed
information describing the following features is found at
http://ctg.central.sun.com/wiki/index.php/FMA_x64_cpu/mem.  FMA for x64
provides:

  Error handling and ereport generation for Machine Check Architecture (MCA)
  errors as well as background polling for correctable errors.

  Diagnosis for faulty CPU and DIMMs related to those errors

  Automatic page retire and CPU offline responses to faulty CPU and DIMMs

  Diagnosis and response activities are fully integrated with the fault
  manager daemon, fmd(1M) and syslog messaging agent to produce a standard
  FMA diagnosis message, such as:

  SUNW-MSG-ID: AMD-8000-5M, TYPE: Fault, VER: 1, SEVERITY: Major
  EVENT-TIME: Tue Feb  7 12:03:02 PST 2006
  PLATFORM: Sun Fire X4200 Server, CSN: 0000000000, HOSTNAME: vcr
  SOURCE: eft, REV: 1.16
  EVENT-ID: cc22e400-1e60-ee9f-81f3-af3d035f4dd8
  DESC: The number of errors associated with this CPU has exceeded acceptable
  levels.  Refer to http://sun.com/msg/AMD-8000-5M for more information.
  AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
  IMPACT: Performance of this system may be affected.
  REC-ACTION: Schedule a repair procedure to replace the affected CPU.  Use
  fmdump -v -u  to identify the module.

  The EVENT-ID can be used to learn more about the diagnosis and impact
  on system resources:

	# fmdump -v -u cc22e400-1e60-ee9f-81f3-af3d035f4dd8
	TIME                 UUID                                 SUNW-MSG-ID
	Feb 07 12:03:02.5062 cc22e400-1e60-ee9f-81f3-af3d035f4dd8 AMD-8000-5M
	  100%  fault.cpu.amd.l2cachedata

       		 Problem in: hc:///motherboard=0/chip=0/cpu=0
       		    Affects: cpu:///cpuid=0
       		        FRU: hc:///motherboard=0/chip=0

	# psrinfo
	0       faulted   since 02/07/2006 12:03:02
	1       on-line   since 02/07/2006 11:58:29
	2       on-line   since 02/07/2006 11:58:31
	3       on-line   since 02/07/2006 11:58:33

   This means that CPU 3 has a bad L2 data cache and to repair the problem,
   the CPU module should be replaced.  Specific details regarding repair
   policies of the CPU are found at http://sun.com/msg/AMD-8000-5M.

Flag Day information:

  BFU Changes

     We have made minor modifications to BFU itself.  If you have an outdated
     BFU and use it to get to the new archives, you may experience a problem
     with FMA features if you use your old BFU on a test machine in the lab
     on which the FMA group was at one point testing older FMA bits.  Please
     use the new BFU or update yours from the new source in usr/src/tools.

  BFU Conflict Resolution

     You will see conflicts in the following files:

        etc/driver_aliases
        etc/name_to_major

     You will need to resolve all of these conflicts in order to enable FMA.
     As usual, the BFU acr utility will do the right thing.

  Impact to Platform Developers

     This project removes all of the project private .topo files in-lieu
     of standard topologies for SPARC and x64 systems.  This change will
     help platform teams deliver consistent topologies for use in their
     eft diagnosis rules without having to deliver additional platform
     specific .topo files.   Topologies may be viewed with the internal
     fmtopo command:

	# /usr/lib/fm/fmd/fmtopo -v
	Topology Snapshot 22704dd4-d473-e3ac-a03b-af5e98bcabe9
	hc:///motherboard=0
       		 ASRU: -
       		 FRU: hc:///motherboard=0
       		 Label: MB
	hc:///motherboard=0/chip=0
       		 ASRU: -
       		 FRU: hc:///motherboard=0/chip=0
       		 Label: -
	hc:///motherboard=0/chip=0/cpu=0
       		 ASRU: cpu:///cpuid=0
       		 FRU: hc:///motherboard=0/chip=0
       		 Label: -

     This output is suitable for inclusion in section 3 of a platform
     FMA portfolio.

More Information:

  More information about the FMA Program in general is available at
  http://fma.eng.  Specific information on the x64 FMA project is available
  at http://ctg.central.sun.com/wiki/index.php/FMA_x64_cpu/mem.  You will want
  to take a look if you are an Opteron software or hardware developer who
  wants to learn more about the error
  handling and diagnosis capabilities offered via Solaris on our Galaxy,
  Andromeda, Marrakesh and Thumper families.

  As always, if you are interested in talking with people interested in fault
  management or want to participate in discussions about features and RFES,
  please sign up for fma-interest at sun dot com using netadmin or the fault management
  discussion (fm: discuss) forum at http://www.opensolaris.org/os/discussions.

  Feel free to send any question or comments directly to the FMA core team at
  fma-core at sun dot com.  The same bug categories for FMA related bugs and RFEs
  cover this project:

   kernel/fm - Solaris kernel FMA infrastructure
  library/fm - Solaris FMA libraries
  utility/fm - Solaris utilities fmdump(1M), fmstat(1M), fmadm(1M), fmd(1M)
      fma/io - i/o error handling, telemetry, diagnosis engines, agents
     fma/cpu - cpu error handling, telemetry, diagnosis engines, agents
     fma/mem - memory error handling, telemetry, diagnosis engines, agents
   fma/other - incoming triage category, requests for new fma features

  If you're not sure which category to use, file a bug in fma/other and we
  will be happy to recategorize it for you.

Cindi, humble servant of the FMA x64 I-Team

last modified by alanbur on 2009/11/20 23:47
Collectives
Project


© Sun Microsystems Inc. 2009
XWiki Enterprise 1.8.2.19075 - Documentation
Terms Of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines | Site map | Help
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.