Fault Management » Solaris11FCS
en

Solaris11FCS

Oracle Solaris 11: New Fault Management Features

Chassis and Receptacle Information - Disk Name Mapping

A long-standing challenge in Solaris administration is to identify a physical disk given the logical /dev Solaris name for it - e.g., which physical disk does /dev/dsk/c2t3d0 or /dev/dsk/c3t5000C50009404407d0 correspond to? For cases such as c2t3d0 the physical location has a bearing on the chosen logical name (t3 is target 3 which was usually a fixed location on the copper SCSI chain serviced by the controller c2, wherever that is), but for disk names formed using a property of the disk (such as worldwide number 5000C50009404407) the logical name only identifies a controller that has a path to that disk and the logical name may be unchanged if the disk were in a different bay. If a system is configured with even just a modest amount of storage (say a few disk shelves) there is room for confusion.

In Solaris 11, chassis "receptacles" (think of disk bays) and their "occupants" (typically disks) can be queried using the new diskinfo(1m) and croinfo(1m) commands. Moreover, descriptive aliases can be established for storage chassis: instead of a product name and serial number one can list a datacenter name and location therein, for example. These aliases are persistent and replace the default names in commands such as format(1m).

Here's the default output:

# croinfo
D:devchassis-path                                  t:occupant-type  c:occupant-compdev   
-------------------------------------------------  ---------------  ---------------------
/dev/chassis/SYS/HD0/disk                          disk             c10t0d0              
/dev/chassis/SYS/HD1/disk                          disk             c10t1d0              
/dev/chassis/SYS/HD2                               -                -                    
/dev/chassis/SYS/HD3                               -                -                    
/dev/chassis/SUN-Storage-J4410.A3451/DISK_00/disk  disk             c3t5000C50009404407d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_01/disk  disk             c3t5000C5000940262Bd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_02/disk  disk             c3t5000C500094036D3d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_03/disk  disk             c3t5000C5000940F6BBd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_04/disk  disk             c3t5000C5000940F6ABd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_05/disk  disk             c3t5000C5000940F57Fd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_06/disk  disk             c3t5000C5000940F553d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_07/disk  disk             c3t5000C5000940F62Fd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_08/disk  disk             c3t5000C5000940EE5Fd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_09/disk  disk             c3t5000C5000940F597d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_10/disk  disk             c3t5000C5000940ED47d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_11/disk  disk             c3t5000C5000940EDBBd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_12/disk  disk             c3t5000C5000940F757d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_13/disk  disk             c3t5000C5000940F7C7d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_14/disk  disk             c3t5000C5000940FB0Fd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_15/disk  disk             c3t5000C5000940F3EFd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_16/disk  disk             c3t5000C5000940F78Bd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_17/disk  disk             c3t5000C5000940F5CFd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_18/disk  disk             c3t5000C5000940F707d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_19/disk  disk             c3t5000C5000940F7C3d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_20/disk  disk             c3t5000C5000940F55Bd0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_21/disk  disk             c3t5000C5000940F547d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_22/disk  disk             c3t5000C5000940F7D3d0
/dev/chassis/SUN-Storage-J4410.A3451/DISK_23/disk  disk             c3t5000C5000940F567d0

The system has a couple of internal disks (under the well-known SYS alias) and a single Sun Storage J4410 JBOD with serial number A3541. Already a big step forward! Consulting the croinfo(1m) manpage there are a host of query options such as to ask where c3t5000C50009404407d0 is, list all disks in a chassis of a given serial number, list disks with a given firmware version, etc.

Next we'll appoint an alias for the J4410 attached to this system: "DC1-ROW11-RACK3-SHELF2" - which identifies this particular J4410 JBOD with chassis serial number A3541 within our datacenter.  Note that you can't specify aliases for individual receptacles within a chassis - those are discovered from the storage enclosure using SCSI enclosure services protocol which provides the labels as silkscreened on the receptacles themselves.

# fmadm add-alias SUN-Storage-J4410.A3451 DC1-ROW11-RACK3-SHELF2

# diskinfo
D:devchassis-path                                 c:occupant-compdev   
------------------------------------------------  ---------------------
...
/dev/chassis/DC1-ROW11-RACK3-SHELF2/DISK_00/disk  c3t5000C50009404407d0
/dev/chassis/DC1-ROW11-RACK3-SHELF2/DISK_01/disk  c3t5000C5000940262Bd0
...

# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c3t5000C50009404407d0 <SEAGATE-ST330057SSUN300G-0205-279.40GB>
          /scsi_vhci/disk@g5000c50009404407
          /dev/chassis/DC1-ROW11-RACK3-SHELF2/DISK_00/disk

...

Email Notification of Problem Diagnoses and Updates

An overdue feature in Solaris was the notification via email of new problem diagnoses, and of subsequent updates to those diagnoses. In Oracle Solaris 11 this functionality is provided by the svc:/system/fm/snmp-notify:default service instance (delivered in the service/fault-management/snmp-notify package). Use svcs -n to list the current notification preferences, and svccfg setnotify to express new preferences. The manpages snmp-notify(1m) and svccfg(1m) include additional information and examples. Notification preferences are expressed using the SMF command lines and not fmadm, mostly because notification of SMF service instance state transitions (see below) uses the same mechanism.

After initial install the defaults apply:

# svcs -n
Notification parameters for FMA Events
    Event: problem-diagnosed
        Notification Type: smtp
            Active: true
            reply-to: root@localhost
            to: root@localhost

        Notification Type: snmp
            Active: true

        Notification Type: syslog
            Active: true

    Event: problem-repaired
        Notification Type: snmp
            Active: true

    Event: problem-resolved
        Notification Type: snmp
            Active: true

Thus a new problem diagnosis sends an email to root@localhost, raises an SNMP trap (more configuration required to integrate this completely) and renders a summary to the console and messages file; when a problem is repaired (e.g.. through fmadm repair or through detection of a FRU replacement) and when all isolated resources are confirmed back online after the repair we raise an SNMP trap.

To configure, for example, notification by email for new problems diagnosed to another alias and to also render problem resolutions to syslog and messages:

# svccfg setnotify problem-diagnosed mailto:ops@somewhere.org
# svccfg setnotify problem-resolved syslog:active

Below is an example email notification show some of the header information as well as the body.

Date: Sun, 6 Nov 2011 21:02:57 -0800 (PST)
From: No Access User <noaccess@hyper.mydomain.com>
Message-Id: <201111070502.pA752vpD471737@hyper.mydomain.com>
X-FMEV-HOSTNAME: hyper
X-FMEV-CLASS: list.suspect
X-FMEV-UUID: e7b60570-c12b-c5cf-fe20-eddddf8e9834
X-FMEV-CODE: SMF-8000-YX
X-FMEV-SEVERITY: major
Reply-To: root@hyper.mydomain.com
Subject: Fault Management Event: hyper:SMF-8000-YX
To: ops@somewhere.org

SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
EVENT-TIME: Sun Nov  6 21:02:57 PST 2011
PLATFORM: Sun-Fire-X4600-M2, CSN: 0706BE118B, HOSTNAME: hyper
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: e7b60570-c12b-c5cf-fe20-eddddf8e9834
DESC: A service failed - the instance is restarting too quickly.
AUTO-RESPONSE: The service has been placed into the maintenance state.
IMPACT: svc:/system/auditd:default is unavailable.
REC-ACTION: Run 'svcs -xv svc:/system/auditd:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted. Please refer to the associated reference document at http://sun.com/msg/SMF-8000-YX for the latest service procedures and policies regarding this diagnosis.

Additional information is available in the manpages mentioned above, and in this blog entry.

Fan and Power-Supply Coverage

The Oracle Solaris fault manager now has visibility of fan and power-supply problems (diagnosed on the service processor). If a fan or power-supply should develop a fault, that fault will be visible in fmadm faulty output on the host.  This functionality is built on top of the Sensor Abstraction Layer, described below.

FRU Monitor

The fru-monitor plugin replaces the disk-monitor fault manager plugin, and monitors the addition and removal of FRUs (field-replaceable units) to/from the system. It is also responsible for lighting service LEDs on FRUs diagnosed as faulty (although for now it is restricted to operating on disks for that functionality). Tracking the insertion and removal of hot-pluggable FRUs, particularly of disks, is essential in keeping the fault diagnosis state up-to-date.

Identity-Based Faults

When a hardware problem is diagnosed an internal fault manager representation of the faulty resource is used to track the problem. That representation encodes both current location information (such as a particular disk bay) and serial number information. If the component is moved within the system it used to be the case for some faults that the associated fault information would not be translated for the new location, and we would no longer be able to perform an isolation action (such as to offline the component). New functionality in Oracle Solaris 11 now assures that we correctly track the resource using the invariant "identity" information (such as serial number).

Internal SAS Disk Enumeration and Diagnosis

In the past, Solaris disk diagnosis has applied primarily to disks housed in external storage chassis from which we can retrieve the necessary information using SCSI Enclosure Services. Internal disk bays usually lack this information, and so while such disks could raise error reports we could not diagnose those reports on a number of platforms (the exceptions being those few that had manually captured the details of the internal disk layout in Solaris code).

A new feature in Solaris 11 allows Solaris fault management software to enumerate internal SAS disk bays for Oracle products, and with that follows diagnosis of internal disk error reports.

Kernel Panic Modelling

If the operating system kernel should panic then on subsequent reboot when we save a crashdump for post-mortem analysis we now also raise a problem diagnosis to highlight the issue. The event raised for this problem, which can be transmitted via Oracle ASR if registered (as above), includes the panic string and panic stack for initial classification and recognition of the underlying problem.  Here's how a typical panic would appear in the messages file (in the example the fmdump -m option is used to illustrate this output):

# fmdump -m -u 708e3733-a6b2-6739-a013-a494d189a1ba
SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Tue Jun 14 09:04:37 PDT 2011
PLATFORM: Sun-Fire-X4600-M2, CSN: 0706BE118B, HOSTNAME: hyper
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: 708e3733-a6b2-6739-a013-a494d189a1ba
DESC: The system has rebooted after a kernel panic.
AUTO-RESPONSE: The failed system image was dumped to the dump device.  If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/hyper.
IMPACT: There may be some performance impact while the panic is copied to the savecore directory.  Disk space usage by panics can be substantial.
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u 708e3733-a6b2-6739-a013-a494d189a1ba' to view more panic detail. Please refer to the associated reference document at http://sun.com/msg/SUNOS-8000-KL for the latest service procedures and policies regarding this diagnosis.

Running the suggested command show the event in full and illustrates what would be transmitted through ASR if configured:

# fmdump -Vp -u 708e3733-a6b2-6739-a013-a494d189a1ba
TIME                           UUID                                 SUNW-MSG-ID
Jun 14 2011 09:04:37.816512000 708e3733-a6b2-6739-a013-a494d189a1ba SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Jun 14 09:04:25.3100 ireport.os.sunos.panic.dump_available 0x0000000000000000
  Jun 14 09:01:25.8865 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 708e3733-a6b2-6739-a013-a494d189a1ba
        code = SUNOS-8000-KL
        diag-time = 1308067477 784010
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        __case_state = 0x1
        topo-uuid = da8119a1-3ff1-40e3-ec74-f703faa8fca3
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = sw:///:path=/var/crash/hyper/.708e3733-a6b2-6739-a013-a494d189a1ba
                resource = sw:///:path=/var/crash/hyper/.708e3733-a6b2-6739-a013-a494d189a1ba
                savecore-succcess = 1
                dump-dir = /var/crash/hyper
                dump-files = vmdump.3
                os-instance-uuid = 708e3733-a6b2-6739-a013-a494d189a1ba
                panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff007aba40d0 addr=0 occurred in module "unix" due to a NULL pointer dereference
                panicstack = unix:die+d8 () | unix:trap+150c () | unix:_cmntrap+e6 () | unix:bcopy+55a () | e1000g:e1000g_m_tx+6b () | mac:mac_tx+2df () | dld:str_mdata_fastpath_put+99 () | ip:ip_xmit+946 () | ip:ire_send_wire_v4+33a () | ip:conn_ip_output+2eb () | ip:tcp_send_data+80 () | ip:tcp_timer+a61 () | ip:tcp_timer_handler+39 () | ip:squeue_drain+20b () | ip:squeue_worker+147 () | unix:thread_start+8 () |
                crashtime = 1308066640
                panic-time = 14 June 2011 08:50:40 AM PDT PDT
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x4df78695 0x30aafc00

Sensor Abstraction Layer

The Sensor Abstraction Layer extends the topology library (libtopo) such that sensors and indicators can also be represented in our topology in a fashion that allows for the association of sensors and indicators to corresponding hardware resource to be programmatically determined.

The sensor abstraction layer also provides a layer of abstraction between the topology library and the lower-level interfaces that are used to control a given sensor or indicator, making it easy to write high-level software for reading sensors and controlling indicators.

More information is available in blog entries here and here.

SMF Service Maintenance State Modelling and Notification

There are a number of conditions under which an SMF service instance may transition to the maintenance state, such as through the start method failing after detecting a configuration issue or repeatedly failing such as through coredump on every attempt. Service instances in maintenance state are modelled with problem diagnoses in Solaris 11, and they show up in fmadm faulty output and can directed through the same notification mechanisms (email, snmp, syslog) as for all problems (configure "problem-diagnosed" preferences with svccfg as above). Maintenance state continues to appear in svcs -x output as before. You can clear maintenance state through the conventional svcadm clear (or svcadm disable) or through fmadm repair which is equivalent to svcadm clear.

The example below will also serve to show how you can generate test diagnoses for validating your configuration - such as to confirm that email notification works as expected, or to send a test SNMP trap.  Producing such test diagnoses in the past has not been possible without proprietary (and very dangerous!) hardware error injectors - now we can force a diagnosis simply by abusing an SMF service.  The approach is simple - repeatedly kill all processes running under a particular service instance until SMF stops restarting it because things are restarting too frequently;  below we use the svc:/network/ntp:default - if that service is sensitive for your installation then choose one that isn't, or write a simple service manifest for a service we can target and use that.

root@hyper:~# svcs auditd
STATE          STIME    FMRI
online         Oct_26   svc:/system/auditd:default

root@hyper:~# pkill auditd
root@hyper:~# pkill auditd
root@hyper:~# pkill auditd
root@hyper:~# pkill auditd
root@hyper:~# pkill auditd
root@hyper:~# pkill auditd

root@hyper:~# svcs auditd
STATE          STIME    FMRI
maintenance    21:02:57 svc:/system/auditd:default

Wait a second between pkill repeats.  You may have to kill one or two more times - it depends on how quickly the service restarts. Here's what fmadm shows:

root@hyper:~# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Nov 06 21:02:57 e7b60570-c12b-c5cf-fe20-eddddf8e9834  SMF-8000-YX    major     

Host        : hyper
Platform    : Sun-Fire-X4600-M2 Chassis_id  : 0706BE118B
Product_sn  :

Fault class : defect.sunos.smf.svc.maintenance
Affects     : svc:///system/auditd:default
                  faulted and taken out of service
Problem in  : svc:///system/auditd:default
                  faulted and taken out of service

Description : A service failed - the instance is restarting too quickly.

Response    : The service has been placed into the maintenance state.

Impact      : svc:/system/auditd:default is unavailable.

Action      : Run 'svcs -xv svc:/system/auditd:default' to determine the
              generic reason why the service failed, the location of any
              logfiles, and a list of other services impacted. Please refer to
              the associated reference document at
              http://sun.com/msg/SMF-8000-YX for the latest service procedures
              and policies regarding this diagnosis.

When we've fixed the issue we can attempt to bring the service instance online
again using either svcadm clear auditd or fmadm repair e7b60570-c12b-c5cf-fe20-eddddf8e9834.

Note that service instance maintenance problem diagnoses do not result in service requests through Oracle ASR - they are usually to be resolved locally with no support required.

SMF Service Instance State Change Notification

While the SMF maintenance state problem modelling described above applies only to service instances entering maintenance state, you can also configure notification preferences for any service instance transition (i.e. not just to maintenance state).

We only model maintenance state as a problem and so "problem-diagnosed" preferences expressed in svccfg as above can only notify of that transition. An administrator can express notification preferences for any SMF transition (including to maintenance, so there's some overlap here) also using svccfg setnotify. Global preferences can be expressed as well as per-service-instance preferences, and svcs -n lists these preferences along with those for problem lifecycle events. The svccfg(1m) manpage has example of this.

# svccfg setnotify -g to-offline,to-maintenance mailto:admin@somehost.com
# svccfg -s svc:/network/smtp:sendmail setnotify to-offline mailto:admin@somehost.com
# svccfg -s svc:/network/smtp:sendmail setnotify to-offline snmp:active

The first command sets an email preference for all services transitioning either to maintenance or offline state. The second command expresses an email preference for an individual service instance, while the third enable SNMP trap notification for a single servince instance.

Additional information is available in the manpages mentioned above, and in this blog entry.

SNMP Monitoring

SNMP trap notification of problem diagnoses and updates is nothing new, however the implementation thereof has now moved from a fault manager plugin to a separate service svc:/system/fm/snmp-notify:default delivered via the system/fault-management/snmp-notify package. Also new is the ability to configure preferences for individual problem lifecycle events, such as to trap for problem-diagnosed but not for problem-resolved - see svccfg examples above.

The SNMP MIBs are delivered in /etc/net-snmp/snmp/mibs/. The SUN-FM-MIB.mib file describes the SNMP trap available for problem lifecycle events, in addition to the full browsable MIB available if you configure the MIB plugin into your SNMP agent. The SUN-IREPORT-MIB.mib describes the trap available for SMF state transitions.

Subscribing to Fault Manager Problem Lifecycle Events

For applications wishing to subscribe to fault manager events directly, a new Committed C API is delivered with libfmevent(3fm). The libfmevent(3fm) and fmev_shdl_init(3fm) manpages document the API in detail.

System Reporting

Oracle Solaris 11 includes a form of "phone home" offering which relays newly-diagnosed problems (as you'd see with fmadm faulty - new diagnoses in the Fault Management subsystem) to the Oracle ASR service so that problem trends may be established and a support call may be opened if appropriate. This service requires registration of the system using a valid My Oracle Support (MOS) account and the asradm(1m) command, and you also need to login to your MOS account and enable the system for monitoring. In addition to newly-diagnosed problems a periodic audit heartbeat is transmitted.

Whether registered or not, you are able to see what information would be relayed to MOS using the -n option to asradm. The asr-notify(1m) manpage describes the message format and content used by the ASR service. A new SMF service instance svc:/system/fm/asr-notify:default implements the ASR functionality.

Topology Snapshots

In some cases a reboot/panic or other major event can intercede between raising of an error report and delivery of that error report for diagnosis. An example is a cpu which suffers some uncorrectable error causing a kernel panic - telemetry is gathered at the time of the error but only diagnosed on the subsequent reboot. If the system topology changes between the event itself and the subsequent diagnosis (for example the system firmware chooses to unconfigure the errant cpu) then we need to know the topology from the time of the event in order to diagnose - the topology at diagnosis time is inadequate. This used to lead to a corner case in which we'd be unable to diagnose some problems (and we'd complain about that!). With the new feature of topology snapshots we maintain a snapshot of past topologies for replay of telemetry delayed as described.

A related feature, and also more "under the hood", is that of Configuration Numeric Association (CNA).  Previously every dynamic reconfiguration event resulted in a full rediscovery of system topolofy, whereas a relatively small minority of cases really require a rediscovery.  It would be possible to suffer bursts of DR events triggering numerous rediscoveries, and consuming corresponding CPU time.

USB Fault Management

Oracle Solaris 11 includes fault management features for USB. USB drivers have been hardened to be more robust in the face of USB errors and to raise structured error reports, and diagnosis rules are included to process those reports and highlight USB bus and client device faults.

Zones

The fault manager service svc:/system/fmd:default is now active in non-global Solaris 11 zones. Most fault manager plugins are not present in such a zone as they are concerned with hardware only visible from the global zone. The main role today of fmd in a non-global zone is to play its part in the diagnosis and messaging on SMF instance maintenance, and for notification of SMF instance state transitions (for transitions other than to maintenance, as described above). Note that you may need to pkg install the system/fault-management/snmp-notify and system/fault-management/snmp-notify packages in non-global zones in which you wish to utilize those notification methods.

Tags:
Created by Gavin Maltby on 2011/11/09 20:57
Last modified by Gavin Maltby on 2011/11/09 20:57

Collectives


XWiki Enterprise 2.7.1.34853 - Documentation