| Solaris |
|
|
A long-standing challenge in Solaris administration is to identify a physical disk given the logical /dev Solaris name for it - e.g., which physical disk does /dev/dsk/c2t3d0 or /dev/dsk/c3t5000C50009404407d0 correspond to? For cases such as c2t3d0 the physical location has a bearing on the chosen logical name (t3 is target 3 which was usually a fixed location on the copper SCSI chain serviced by the controller c2, wherever that is), but for disk names formed using a property of the disk (such as worldwide number 5000C50009404407) the logical name only identifies a controller that has a path to that disk and the logical name may be unchanged if the disk were in a different bay. If a system is configured with even just a modest amount of storage (say a few disk shelves) there is room for confusion.
In Solaris 11, chassis "receptacles" (think of disk bays) and their "occupants" (typically disks) can be queried using the new diskinfo(1m) and croinfo(1m) commands. Moreover, descriptive aliases can be established for storage chassis: instead of a product name and serial number one can list a datacenter name and location therein, for example. These aliases are persistent and replace the default names in commands such as format(1m).
Here's the default output:
The system has a couple of internal disks (under the well-known SYS alias) and a single Sun Storage J4410 JBOD with serial number A3541. Already a big step forward! Consulting the croinfo(1m) manpage there are a host of query options such as to ask where c3t5000C50009404407d0 is, list all disks in a chassis of a given serial number, list disks with a given firmware version, etc.
Next we'll appoint an alias for the J4410 attached to this system: "DC1-ROW11-RACK3-SHELF2" - which identifies this particular J4410 JBOD with chassis serial number A3541 within our datacenter. Note that you can't specify aliases for individual receptacles within a chassis - those are discovered from the storage enclosure using SCSI enclosure services protocol which provides the labels as silkscreened on the receptacles themselves.
An overdue feature in Solaris was the notification via email of new problem diagnoses, and of subsequent updates to those diagnoses. In Oracle Solaris 11 this functionality is provided by the svc:/system/fm/snmp-notify:default service instance (delivered in the service/fault-management/snmp-notify package). Use svcs -n to list the current notification preferences, and svccfg setnotify to express new preferences. The manpages snmp-notify(1m) and svccfg(1m) include additional information and examples. Notification preferences are expressed using the SMF command lines and not fmadm, mostly because notification of SMF service instance state transitions (see below) uses the same mechanism.
After initial install the defaults apply:
Thus a new problem diagnosis sends an email to root@localhost, raises an SNMP trap (more configuration required to integrate this completely) and renders a summary to the console and messages file; when a problem is repaired (e.g.. through fmadm repair or through detection of a FRU replacement) and when all isolated resources are confirmed back online after the repair we raise an SNMP trap.
To configure, for example, notification by email for new problems diagnosed to another alias and to also render problem resolutions to syslog and messages:
Below is an example email notification show some of the header information as well as the body.
Additional information is available in the manpages mentioned above, and in this blog entry.
The Oracle Solaris fault manager now has visibility of fan and power-supply problems (diagnosed on the service processor). If a fan or power-supply should develop a fault, that fault will be visible in fmadm faulty output on the host. This functionality is built on top of the Sensor Abstraction Layer, described below.
The fru-monitor plugin replaces the disk-monitor fault manager plugin, and monitors the addition and removal of FRUs (field-replaceable units) to/from the system. It is also responsible for lighting service LEDs on FRUs diagnosed as faulty (although for now it is restricted to operating on disks for that functionality). Tracking the insertion and removal of hot-pluggable FRUs, particularly of disks, is essential in keeping the fault diagnosis state up-to-date.
When a hardware problem is diagnosed an internal fault manager representation of the faulty resource is used to track the problem. That representation encodes both current location information (such as a particular disk bay) and serial number information. If the component is moved within the system it used to be the case for some faults that the associated fault information would not be translated for the new location, and we would no longer be able to perform an isolation action (such as to offline the component). New functionality in Oracle Solaris 11 now assures that we correctly track the resource using the invariant "identity" information (such as serial number).
In the past, Solaris disk diagnosis has applied primarily to disks housed in external storage chassis from which we can retrieve the necessary information using SCSI Enclosure Services. Internal disk bays usually lack this information, and so while such disks could raise error reports we could not diagnose those reports on a number of platforms (the exceptions being those few that had manually captured the details of the internal disk layout in Solaris code).
A new feature in Solaris 11 allows Solaris fault management software to enumerate internal SAS disk bays for Oracle products, and with that follows diagnosis of internal disk error reports.
If the operating system kernel should panic then on subsequent reboot when we save a crashdump for post-mortem analysis we now also raise a problem diagnosis to highlight the issue. The event raised for this problem, which can be transmitted via Oracle ASR if registered (as above), includes the panic string and panic stack for initial classification and recognition of the underlying problem. Here's how a typical panic would appear in the messages file (in the example the fmdump -m option is used to illustrate this output):
Running the suggested command show the event in full and illustrates what would be transmitted through ASR if configured:
The Sensor Abstraction Layer extends the topology library (libtopo) such that sensors and indicators can also be represented in our topology in a fashion that allows for the association of sensors and indicators to corresponding hardware resource to be programmatically determined.
The sensor abstraction layer also provides a layer of abstraction between the topology library and the lower-level interfaces that are used to control a given sensor or indicator, making it easy to write high-level software for reading sensors and controlling indicators.
More information is available in blog entries here and here.
There are a number of conditions under which an SMF service instance may transition to the maintenance state, such as through the start method failing after detecting a configuration issue or repeatedly failing such as through coredump on every attempt. Service instances in maintenance state are modelled with problem diagnoses in Solaris 11, and they show up in fmadm faulty output and can directed through the same notification mechanisms (email, snmp, syslog) as for all problems (configure "problem-diagnosed" preferences with svccfg as above). Maintenance state continues to appear in svcs -x output as before. You can clear maintenance state through the conventional svcadm clear (or svcadm disable) or through fmadm repair which is equivalent to svcadm clear.
The example below will also serve to show how you can generate test diagnoses for validating your configuration - such as to confirm that email notification works as expected, or to send a test SNMP trap. Producing such test diagnoses in the past has not been possible without proprietary (and very dangerous!) hardware error injectors - now we can force a diagnosis simply by abusing an SMF service. The approach is simple - repeatedly kill all processes running under a particular service instance until SMF stops restarting it because things are restarting too frequently; below we use the svc:/network/ntp:default - if that service is sensitive for your installation then choose one that isn't, or write a simple service manifest for a service we can target and use that.
Wait a second between pkill repeats. You may have to kill one or two more times - it depends on how quickly the service restarts. Here's what fmadm shows:
When we've fixed the issue we can attempt to bring the service instance online
again using either svcadm clear auditd or fmadm repair e7b60570-c12b-c5cf-fe20-eddddf8e9834.
Note that service instance maintenance problem diagnoses do not result in service requests through Oracle ASR - they are usually to be resolved locally with no support required.
While the SMF maintenance state problem modelling described above applies only to service instances entering maintenance state, you can also configure notification preferences for any service instance transition (i.e. not just to maintenance state).
We only model maintenance state as a problem and so "problem-diagnosed" preferences expressed in svccfg as above can only notify of that transition. An administrator can express notification preferences for any SMF transition (including to maintenance, so there's some overlap here) also using svccfg setnotify. Global preferences can be expressed as well as per-service-instance preferences, and svcs -n lists these preferences along with those for problem lifecycle events. The svccfg(1m) manpage has example of this.
The first command sets an email preference for all services transitioning either to maintenance or offline state. The second command expresses an email preference for an individual service instance, while the third enable SNMP trap notification for a single servince instance.
Additional information is available in the manpages mentioned above, and in this blog entry.
SNMP trap notification of problem diagnoses and updates is nothing new, however the implementation thereof has now moved from a fault manager plugin to a separate service svc:/system/fm/snmp-notify:default delivered via the system/fault-management/snmp-notify package. Also new is the ability to configure preferences for individual problem lifecycle events, such as to trap for problem-diagnosed but not for problem-resolved - see svccfg examples above.
The SNMP MIBs are delivered in /etc/net-snmp/snmp/mibs/. The SUN-FM-MIB.mib file describes the SNMP trap available for problem lifecycle events, in addition to the full browsable MIB available if you configure the MIB plugin into your SNMP agent. The SUN-IREPORT-MIB.mib describes the trap available for SMF state transitions.
For applications wishing to subscribe to fault manager events directly, a new Committed C API is delivered with libfmevent(3fm). The libfmevent(3fm) and fmev_shdl_init(3fm) manpages document the API in detail.
Oracle Solaris 11 includes a form of "phone home" offering which relays newly-diagnosed problems (as you'd see with fmadm faulty - new diagnoses in the Fault Management subsystem) to the Oracle ASR service so that problem trends may be established and a support call may be opened if appropriate. This service requires registration of the system using a valid My Oracle Support (MOS) account and the asradm(1m) command, and you also need to login to your MOS account and enable the system for monitoring. In addition to newly-diagnosed problems a periodic audit heartbeat is transmitted.
Whether registered or not, you are able to see what information would be relayed to MOS using the -n option to asradm. The asr-notify(1m) manpage describes the message format and content used by the ASR service. A new SMF service instance svc:/system/fm/asr-notify:default implements the ASR functionality.
In some cases a reboot/panic or other major event can intercede between raising of an error report and delivery of that error report for diagnosis. An example is a cpu which suffers some uncorrectable error causing a kernel panic - telemetry is gathered at the time of the error but only diagnosed on the subsequent reboot. If the system topology changes between the event itself and the subsequent diagnosis (for example the system firmware chooses to unconfigure the errant cpu) then we need to know the topology from the time of the event in order to diagnose - the topology at diagnosis time is inadequate. This used to lead to a corner case in which we'd be unable to diagnose some problems (and we'd complain about that!). With the new feature of topology snapshots we maintain a snapshot of past topologies for replay of telemetry delayed as described.
A related feature, and also more "under the hood", is that of Configuration Numeric Association (CNA). Previously every dynamic reconfiguration event resulted in a full rediscovery of system topolofy, whereas a relatively small minority of cases really require a rediscovery. It would be possible to suffer bursts of DR events triggering numerous rediscoveries, and consuming corresponding CPU time.
Oracle Solaris 11 includes fault management features for USB. USB drivers have been hardened to be more robust in the face of USB errors and to raise structured error reports, and diagnosis rules are included to process those reports and highlight USB bus and client device faults.
The fault manager service svc:/system/fmd:default is now active in non-global Solaris 11 zones. Most fault manager plugins are not present in such a zone as they are concerned with hardware only visible from the global zone. The main role today of fmd in a non-global zone is to play its part in the diagnosis and messaging on SMF instance maintenance, and for notification of SMF instance state transitions (for transitions other than to maintenance, as described above). Note that you may need to pkg install the system/fault-management/snmp-notify and system/fault-management/snmp-notify packages in non-global zones in which you wish to utilize those notification methods.
Terms of Use
|
Privacy
|
Trademarks
|
Copyright Policy
|
Site Guidelines
|
Site Map
|
Help
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
© 2012, Oracle Corporation and/or its affiliates.