Fault Management

Fault Management

Fault Management Community

Fault Management in Solaris 11 - What's New

See this page for a description of new fault managment features in Solaris 11!

About Predictive Self-Healing

Self-healing functionality for users and administrators of a modern operating system provides fine-grained fault isolation and restart where possible of any component — hardware or software — that experiences a problem. To do so, the system must include intelligent, automated, proactive diagnoses of errors that are observed on the system. The diagnosis system is used to trigger targeted automated responses or guided human intervention that mitigates a specific problem. Finally, these new system capabilities are connected to a new model for system administrators oriented around simpler, higher-level abstractions.

About Fault Management

The Fault Management effort (often abbreviated as FMA, for "Fault Management Architecture") provides an architecture for building resilient error handlers, structured error telemetry, automated diagnosis software, response agents, and a consistent model of system failures for a management stack. The architecture is not Solaris-specific - it is intended to span multiple fault domains and to facilitate the sharing of fault information between disjoint authorities such a system service processor and the Solaris instance(s) running on the platform.  Many parts of Solaris are FMA-aware, including CPU and Memory error handling, PCI and PCI-E subsystems, main HBA drivers, many NIC drivers, disks, ZFS, and more.

The legacy UNIX failure model was simply to leave error handling up to each subsystem author, and simply provide the ability to emit an error message for a human to the system log in a non-standard format. When a subsystem is converted to participate in Fault Management, error handling is made resilient so that the system can continue to operate despite some underlying failure, and telemetry events are produced that drive automated diagnosis and response. The Fault Management tools and architecture enable development of self-healing content for software and hardware failures, for both microscopic and macroscopic system resources, all with a unified, simple view for administrators and system management software.

Information Resources

Documentation

  • FMA Events and Messages
    • Diagnosis results obtained from the Fault Management software in Solaris contain links to the Knowledge Article Web.
    • The FMA Event Registry is the central repository for all fault management events passed between error handlers, the fault manager and its agents.
  • Fault Management for System Administrators. This Fault Management piece is pointed to from Working with the Fault Manager in the Troubleshooting section of the OpenSolaris System Administration Guide.
  • Man Pages
    • fmd(1M)
    • fmadm(1M)
      In OpenSolaris 2008.11, the fmadm repair command was replaced by the following new commands: fmadm repaired (synonymous with fmadm repair), fmadm replaced, and fmadm acquit.
    • fmdump(1M)
      In OpenSolaris 2008.11, the preferred method to display fault information and determine the FRUs involved is the fmadm faulty command, not the fmdump command.
    • fmstat(1M)
  • Writing Device Drivers for FMA. The Writing Device Drivers guide contains a section "Sun Fault Management Architecture I/O Fault Services" in "Chapter 13, Hardening Solaris Drivers." This section describes the steps and techniques used to write an FMA-aware driver.
  • Fault Management Daemon Programmer's Reference Manual. The FMD PRM [PDF] is a description of the internal architecture of the Sun Fault Management Daemon, fmd(1M), and the programming interfaces exported by the daemon.
    • FMDPRM 1.4 April 2008. Added -b option to the fmtopo command. Changed descriptions of TOPO_WALK_CHILD and TOPO_WALK_SIBLING.
    • FMDPRM 1.3 March 2008. Added repaircode to the table of Fault Management Configuration Properties.
    • FMDPRM 1.2 August 2007. Initial post of this document.
  • Eversholt Fault Tree Description Language. The Eversholt Fault Tree Description Language [PDF] explains how to use the eversholt language to describe fault trees in the Sun Fault Management Architecture.
    • Eversholt Fault Tree Description Language, 1.8, November 2008. Initial post of this document.
Tags:
Created by on 2009/10/26 11:40
Last modified by Gavin Maltby on 2011/11/09 21:00

Collectives


XWiki Enterprise 2.7.1.34853 - Documentation