FMA Event Protocol
Copyright 2006, Sun Microsystems, Inc
Policy Synopsis
The FMA Event Protocol is a formal specification of the data transmitted with error report and fault events and a general syntax for naming resources associated with errors and faults. The protocol defines a set of common data name-value pair members that are used to describe all possible error reports, fault events, and lists of suspected faults.
Contents
Overview
| Category | Software |
|---|
| Owner | SAC |
| Author | Cynthia McG. |
| Changes | PSARC |
| Authority | PSARC |
| Policy Version | 1.6 |
| Status | Approved PSARC 2003/06/04 |
| Last Reviewed | 2004/10/08 |
Background
- Added specifications for defect and upset event classes and payload
- Updated FMRI specifications for dev and hc
- Removed FMRI specifications for svc and diag-engine schemes
- Added FMRI specifications for pkg and mod schemes
- Updated examples and content to reflect Solaris FMA implementation
BestPractice
- Applies to Sun Internal Solaris developers. The Solaris FMA Event Protocol and FMRI specification is considered Sun Private.
- Authority PSARC
- Approval PSARC/2004/694
- Policy Error and fault event information is required to diagnose faults, effect recovery and log fault messages. This paper specifies an event protocol and a general syntax for resource naming that is needed to communicate error and fault event data between Solaris subsystems, a Solaris domain and its service processor, or disjoint Solaris domains. The protocol may also be used in non-Solaris Sun products, but is intended to complement and not replace existing heterogeneous standard management protocols. We focus on Solaris in this document because it is the first target environment in which the protocol will be used. We expect that future work will extend its use to other Sun products.
- Details
The lack of a consistent error and fault reporting mechanism for Solaris platforms inhibits our ability to automate fault diagnosis, repair, fault logging and recovery activities. The current methodology for reporting error and fault conditions is built upon unstable syslog text messages, subsystem-specific kstats and polled extraction of other subsystem-specific data. None of the aforementioned methodologies provide a consistent namespace, syntax or semantics for the reporting of error and fault data or a level of detail necessary for the automation of core fault management activities.
A fault management architecture requires a protocol that can be used to transport events with a consistent namespace, syntax, and semantics. The FMA Event Protocol and Resource Identification proposes a structure for the content and type of the data that is transmitted as the result of an error observance or fault diagnosis. Our event protocol is designed to communicate error and fault event data between Solaris subsystems, a Solaris domain and its service processor, or disjoint Solaris domains by providing a common programming interface for software-based diagnosis, fault message logging and administrative and service tools. As described in our extended one-pager [1], error events are not propagated beyond the fault manager and its diagnosis engines, as they exist only to provide input to the diagnosis. Subsequent fault events are transmitted to agents, but they may be transmitted outside of the system (and potentially translated to a different protocol) or recorded persistently in stable logs. The fault manager and its agents will live within the same fault region as the Solaris system or a surrounding fault region or both, as this colocation is required in order to implement consistent and immediate fault response and repair.
We have designed a new protocol for this purpose in order to implement a set of new ideas about fault management in a complex software stack such as Solaris, and to permit rapid innovation in Solaris error handling code, error event producers, diagnosis software and response agents. In addition, we have designed our protocol to leverage the existing event transport, SysEvent, that meets our transport requirements, and to leverage existing data marshalling and file encoding capabilities available in the Solaris kernel and in userland. Our protocol is not intended to replace or circumvent existing heterogeneous management protocols, such as those provided by SNMP or WBEM/CIM. Instead, we view these protocols as complementary technologies that would be suitable for an FMA agent to use to export data to a network-based heterogeneous fault management or administration product. A similar strategy has been employed by leaders in this area: DEC, HP, and IBM.
An evaluation of existing public protocols for use within our Solaris FMA implementation showed that these protocols do not provide the level of detail to perform automated fault diagnosis of hardware and software components, repair or recovery, and are focused on network-based management tools. These protocols would also needlessly overcomplicate the development and rapid evolution of Solaris FMA producers and consumers, while providing no end benefit to customers as error events are an internal implementation detail of Solaris diagnosis software, and fault events can be translated to any number of other protocols by an FMA agent as required by future product directions.
Standard notification schemes, such as CIM alerts or SNMP traps, provide some cursory level of event notification information, but lack the ability to classify, organize and specify the detail required to perform automated fault diagnosis for a specific hardware or software component or recovery actions. For example, a CIM alert notification can take an enumerated value of Communications Alert, Quality of Service Alert, Device Alert, Environmental Alert, Model Change, Security Alert or Other along with a perceived system impact and probable cause. This alert schema does not provide any of the vital data captured at the time of error detection: it is suited only to what we denote as a fault event. Similarly, using SNMP traps to communicate error event data between the kernel and a userland fault manager would require developing an extensible private MIB for all of our error events and a new implementation of a kernel-to-user SNMP trap mechanism all so that we could translate the SNMP data back to another form suitable for diagnosis engines. And all of this complexity would have no benefit to the end user of the system, as our entire purpose is to abstract error events away from them in our implementation.
For these reasons, we believe our design, implementation, and innovation needs will best be served by the new proposed protocol for use within Sun systems and service processors, and that the FMA protocol can be usefully complemented by existing heterogeneous protocols should Sun wish to develop network fault management products or provide agents for existing ISV products.
CaseHistory
References
Specification