Solaris10-Branded Zone Developer Guide
This guide familiarizes readers with the solaris10 zone brand, which allows OpenSolaris administrators to create Solaris 10 Containers. The guide explains how the brand affects Solaris development and how developers can enhance the brand so that it can cope with changes to Solaris 10 and OpenSolaris. This document is aimed at all Solaris kernel developers whose work might affect the solaris10 brand's functionality. (The introduction and the section entitled "What kinds of changes to OpenSolaris and Solaris 10 might break the solaris10 brand?" delineate the kinds of projects and fixes that are impacted.)
Note that this is a living document: The guide changes as the Solaris 10 Containers development team receives feedback from readers. Please send questions, comments, suggestions, and corrections to s10c-core@sun.com. Please ensure that your emails' subject lines start with "S10C Dev Guide" if the emails specifically address this guide.
Table of Contents
- Introduction
- What kinds of changes to OpenSolaris and Solaris 10 can break the solaris10 brand?
- What techniques can be used to make solaris10-branded zones work with my changes?
- Where is the source code?
- How does system call emulation work in the solaris10 brand?
- How can I add to or update the emulation?
- How can I set up the brand to use a native command?
- How is versioning handled for different Solaris 10 updates or patches running in the zone when they need different emulation?
- What is the procedure for backporting an incompatible change to a Solaris 10 update release?
- How can I test my OpenSolaris changes with the brand?
- How can I test my Solaris 10 changes with the brand?
Introduction
This section introduces solaris10-branded zones by providing brief overviews of zones, branded zones, what solaris10-branded zones do, and the reasons why Solaris kernel developers should take solaris10-branded zones into consideration when fixing bugs or adding new features to Solaris 10 or OpenSolaris.
What are zones?
Solaris Zones (also known as Solaris Containers or simply zones) are lightweight virtual machines that isolate user-level workloads on Solaris systems. They differ markedly from virtual machines created with other virtualization technologies such as Xen, VirtualBox, and Logical Domains (LDoms) in that zones do not rely on hypervisors to provide abstracted hardware resources and isolate themselves from the native host system and each other. Zones are built into the Solaris kernel and many of the kernel's subsystems are zone-aware (i.e., they associate their abstractions with zones and base decisions in part on such associations). Consequently, processes executing within zones experience little overhead (a high estimate is 5% of total execution time) and thus come close to achieving bare-metal performance. Furthermore, the lack of overhead makes zones highly scalable: Even low-end consumer desktops are capable of running dozens of zones at a time. The noticeable lack of execution overhead experienced by zoned processes, the relative ease of zone creation and management, and the maturity of zones technology make zones one of the most popular virtualization technologies (if not the most popular virtualization technology) supported on Solaris 10 and OpenSolaris.
However, zones have a few well-known, fundamental limitations. Zones cannot host processes running programs compiled for non-native architectures because zones lack hypervisors. Furthermore, not all Solaris kernel subsystems are zone-aware, which limits the kinds of resources that can be fully isolated and supported within zones. Finally, ordinary zones cannot host user environments from non-native operating systems (OSes). The last limitation is largely eliminated through the use of branded zones, which will be discussed shortly.
You can learn more about zones by visiting the OpenSolaris.org Zones Community Page, the Solaris Containers BigAdmin Page, or the zones chapter of the System Administration Guide on docs.sun.com (Solaris 10 only).
What are branded zones?
Branded zones are zones that are capable of emulating user environments from OSes other than Solaris 10 and OpenSolaris. Branded zones achieve this by emulating the non-native OSes' system calls (syscalls). Syscalls constitute the sole interface between user environments and kernels; therefore, if a branded zone emulates syscalls such that they have the same side effects as the syscalls of a particular non-native OS (e.g., Linux 2.4), then processes running within the zone will act as though they are running on the targeted non-native OS. A branded zone's brand is the collection of support libraries, support hooks, and auxiliary data files that make emulating the zone's targeted OS possible. Brands are named after the OSes whose syscalls they emulate. For example, there are solaris8 and solaris9 brands on Solaris 10 that allow Solaris 10 zones to host Solaris 8 and Solaris 9 user environments, respectively. Zones that host native Solaris user environments (i.e., zones that lack syscall emulation) are native-branded zones (often simply called native zones). (OpenSolaris' native zones currently use the ipkg brand.)
Maintaining brands can be incredibly difficult. Changes in Solaris syscalls' semantics (i.e., their parameters and side effects) can break emulation provided by brands if the brands' support libraries are not updated to account for the changes. Similarly, changes in non-native OSes' syscalls' semantics can break the emulation provided by brands. Maintaining a brand is especially difficult when the user-kernel interfaces exported by both the native Solaris kernel and the hosted non-native OS are continually in flux (as would be the case if we were to maintain a brand for the latest development releases of Linux 2.6 on OpenSolaris).
You can learn more about branded zones and the framework that makes them possible by visiting the OpenSolaris.org BrandZ Community Page, which describes the brand framework and the lx [Linux 2.4] brand on OpenSolaris, or the Solaris Containers BigAdmin Page.
What are solaris10-branded zones?
solaris10-branded zones host Solaris 10 (S10) user environments inside zones on OpenSolaris. They are meant to help maintainers of Solaris 10 systems consolidate their production environments onto systems running OpenSolaris. Workloads running within solaris10-branded zones can take advantage of the performance improvements made to the OpenSolaris kernel and utilize some of the innovative technologies available only on OpenSolaris (e.g., Crossbow VNICs). Only Solaris 10u8 and beyond are supported and tested in such zones.
Ultimately, the purpose of solaris10-branded zones is to provide the proper emulation for Solaris 10 processes running inside the zones so that they work correctly with the OpenSolaris kernel. This is summarized in the following principle, which should serve as a guide for enhancing and maintaining solaris10-branded zones: Any script or program that works in native Solaris 10 zones should also work in solaris10-branded zones.
You can learn more about the solaris10 brand and track the project's progress by visiting the OpenSolaris.org solaris10 Brand Project Page.
Why should I care about solaris10-branded zones?
Because a solaris10-branded zone is running Solaris 10 user-level binaries on top of the OpenSolaris kernel, mismatches in the user-kernel interfaces provided by both systems are possible. This does not happen within normal OpenSolaris zones (i.e., native OpenSolaris zones) because such zones run OpenSolaris user-level binaries, which are built to run in sync with the OpenSolaris kernel. If you make changes to either OpenSolaris or Solaris 10 that impact their user-kernel boundaries, then you will have to take solaris10-branded zones into consideration. If your changes break the emulation provided by solaris10-branded zones, then you will have to enhance the brand's emulation layer so that it can cope with such changes. OpenSolaris community contribution sponsors should ensure that ON contributions will not break the solaris10 brand.
Does the Solaris ABI take care of this for me?
No. The ABI deals with published, well-documented interfaces such as those provided by libc and documented in the section 2 man pages. The solaris10 brand must deal with undocumented, unstable interfaces such as how libc traps into the kernel. Normally changes to such interfaces are hidden within libraries such as libc in a compatible way so that applications don't notice them. However, solaris10-branded zones run the Solaris 10 version of libc (as well as other Solaris 10 libraries), which is not built to work with the OpenSolaris kernel. Therefore, the brand emulation layer is needed to translate between the Solaris 10 user level code and the OpenSolaris kernel code.
Here is a simple example: issetugid(2) is a libc function whose semantics haven't changed between Solaris 10 and OpenSolaris. However, Solaris 10's libc invokes the SYS_issetugid syscall (syscall number 75), which doesn't take arguments, while OpenSolaris' libc invokes the SYS_privsys syscall (number 82), which takes six arguments. Furthermore, the OpenSolaris kernel replaced SYS_issetugid with SYS_sidsys, which has radically different semantics. Consequently, if a Solaris 10 process linked with the Solaris 10 libc running on top of the OpenSolaris kernel were to invoke issetugid(2), then it would issue the wrong syscall. This might result in the Solaris 10 process dumping core or producing incorrect results. The solution is to emulate the Solaris 10 SYS_issetugid syscall so that SYS_privsys is issued instead (with proper arguments, of course).
Does this restrict my ability to innovate or make incompatible changes?
No. The purpose of solaris10-branded zones is to translate between the old Solaris 10 code running within them and the new OpenSolaris kernel. There are existing proofs of concept with the lx brand running Linux on Solaris 10 and OpenSolaris as well as the solaris8 and solaris9 brands running those releases on Solaris 10. The capability to run Linux on OpenSolaris did not stop developers from creating ZFS, DTrace, or Crossbow (to name just a few prominent examples of Solaris innovation). You can continue to make new, innovative changes to OpenSolaris, but you will have to take the solaris10 brand into account during your development and you may need to modify the brand so that solaris10-branded zones continue to work with your innovation. This guide tells you how to do this.
What kinds of changes to OpenSolaris and Solaris 10 can break the solaris10 brand?
Changes that cross the user-kernel boundary affect solaris10-branded zones. In other words, beware all changes that cause cap-I Install(1) flag days. The following list details these kinds of changes:
- Changes in syscalls
Does the change touch usr/src/uts/common/os/sysent.c? If a new syscall is added, then there probably will not be an issue. If an existing syscall is removed or changed then the change will likely impact the brand. Likewise for syscall parameters: if a syscall's parameters' structures or semantics change, then the brand will almost certainly be impacted. - Changes in ioctls
Changes in ioctl commands or parameters for devices that are accessible within zones will likely impact the brand. Disks and NICs are the devices most commonly added to zone configurations. Exclusive-stack zones, which have dedicated NIC devices and are responsible for their administration, are expected to be ubiquitous on OpenSolaris because VNICs and the Crossbow project provide a wealth of interesting networking options for branded zones. Explorer data suggest that it is uncommon for other devices to be configured inside zones.
The following list contains the devices available to solaris10-branded zones by default:- arp, conslog, console, cpu/self/cpuid, crypto, cryptoadm, dtrace/dtrace, dtrace/helper, dtrace/provider/*, fd/*, ipnet/{netif}, ipnet/lo0, kstat, lo0, log, logindmux, msglog, net/{netif}, null, openprom, poll, pool, ptmx, pts/*, random, sad/user, stderr, stdin, stdout, syscon, sysevent, sysmsg, systty, tcp, tcp6, ticlts, ticots, ticotsord, tty, udp, udp6, urandom, zconsole, zero, zfs
- Changes in libraries
Bugfixes and projects that modify libraries that interact closely with the kernel could impact the brand. The following are examples of such libraries:- libaio.so (a real library in s10)
- libbc.so
- libbsm.so
- libc.so
- libc_db.so
- libcontract.so
- libdlpi.so
- libdoor.so
- libdtrace.so
- libkstat.so
- libpkcs11.so (CRYPTO_GET_FUNCTION_LIST ioctl)
- libproc.so
- libproject.so
- librt.so (a real library in s10)
- librtld_db.so
- libzfs.so
- Changes in signals
There have been no signal changes between Solaris 10 and OpenSolaris. However, making such changes could impact the brand. For example, Solaris 10 changed the range for real time signals, which forced solaris8 and solaris9 branded zones to emulate the Solaris 8 and 9 real time signal ranges. - Changes in auditing or accounting
If auditing or accounting change in an incompatible way, then the brand could be impacted. Although this has not occurred between Solaris 10 and OpenSolaris, this was an issue for the solaris8 brand on Solaris 10 and could become a problem for the solaris10 brand if such a change were made to Solaris 10 or OpenSolaris. - Changes in /proc
Changes in the /proc file system or libproc can impact the ptools and other commands linked with libproc. At this point there have been no /proc changes that need emulation but the lx brand does provide /proc emulation and this could be done for the solaris10 brand in the future if necessary. - Changes in native commands or daemons used within solaris10-branded zones
If an OpenSolaris command or daemon (both will hereafter be simply referred to as 'commands' for brevity's sake) that replaces a command within a solaris10-branded zone is changed, then backwards compatibility with the replaced Solaris 10 command could break. Changes to such commands must be made such that the requirements for the commands to be used within solaris10-branded zones continue to be satisfied. (See "Use native commands or daemons" under "What techniques can be used to make solaris10-branded zones work with my changes?" below for the list of requirements.)
The following native commands are used within solaris10-branded zones:- automount
- automountd
- ifconfig
- Changes in kstats
kstats are generally private and unstable; however, they are consumed by various user-level utilities. If the kstats on which Solaris 10 utilities depend are removed or changed, then those utilities could break. If this happens, one option is to use the the native version of the utility in place of the Solaris 10 version. This issue is currently hypothetical: No kstat issues have been observed.
The following utilities use various kstats:- dladm: name:mac
- fssnap: mod:fssnap, name: highwater
- fsstat: variety
- fuser: mod:unix, name:var
- in.routed: class:net
- iostat: mod:cpu_info; mod:cpu, name:vm; mod:cpu, name:sys; mod:unix, name:system_misc, statistic:clk_intr
- mibiisa: name:mac
- mpstat: mod:cpu_info; mod:cpu, name:vm; mod:cpu, name:sys; mod:unix, name:system_misc, statistic:clk_intr
- netstat: class:net; mod:sockfs, name:sock_unix_list
- nfsstat: mod:unix, name:rpc_*; mod:nfs, name:nfs_client; mod:nfs, name:nfs_server; mod:nfs, name:rfs*; mod:nfs_acl, name:acl*
- poolstat: mod:cpu, name:sys
- psrinfo: mod:cpu_info
- rpc.rstatd: mod:unix, name:system_misc; mod:cpu, name:sys; mod:cpu, name:vm; class:disk; class:net
- sar: mod:unix, name:sysinfo; mod:unix, name:vminfo; mod:unix, name:var; mod:unix, name:system_misc; mod:unix, name:file_cache; mod:ufs, name:inode_cache; mod:vmem, name:kmem_oversize
- sendmail: mod:unix, name:system_misc, statistic:avenrun_1min
- svc.startd: mod:unix, name:system_misc, statistic:boot_time
- umount(nfs): mod:nfs, name:mntinfo
- vmstat: mod:cpu_info; mod:cpu, name:vm; mod:cpu, name:sys; mod:unix, name:system_misc, statistic:clk_intr
- Changes in supported platforms
If support for a new platform is added to OpenSolaris but is withheld from Solaris 10, then the solaris10 brand could be impacted. - Changes in privileges
If existing privileges are broken up or removed, then the brand could be impacted. Adding new privileges should work because nothing in Solaris 10 should use those privileges and properly written Solaris 10 applications should cope with new, unfamiliar privileges.
What techniques can be used to make solaris10-branded zones work with my changes?
- New features
In general, new features that will only exist in OpenSolaris do not need to work in solaris10-branded zones because Solaris 10 programs and libraries will not expect them. Note that branded zones can leverage new features that are not manageable within their emulated user environments. For example, solaris8 branded zones can reside on ZFS and DTrace can be used from the native user environment (a.k.a. the global zone) on the zones' processes. Similarly, solaris10-branded zones can use Crossbow network configurations managed by the global zone. - Maintain compatibility
Change the interfaces in a backward compatible manner. This is the simplest thing to do and is often the approach a developer automatically uses without even thinking about it. - Add emulation
The brand module can be enhanced to interpose on syscalls and act as a filter between Solaris 10 and OpenSolaris behaviors. This includes interposing on the ioctl syscall and translating commands and parameters from one format to the other. See "How can I add or update the emulation?" for details about emulating ioctls and syscalls. - Use native commands or daemons
If the change is in the private interaction between a command or daemon (both will hereafter be simply referred to as 'commands' for brevity's sake) and the kernel where no published API is involved, then the zone can be configured to use the native OpenSolaris command instead of the Solaris 10 command. (See "How can I set up the brand to use a native command?" below for details about how to do this.) For example, the native ifconfig command can be used to configure network interfaces from within exclusive stack solaris10-branded zones. Syscalls issued by native commands are not emulated.
An OpenSolaris replacement command must meet several requirements:- Any related configuration files must be compatible between Solaris 10 and OpenSolaris.
- The replacement command's command line interface (CLI) must be backwards-compatible with that of the replaced command.
- The interprocess communication (IPC) protocols utilized by the replacement command must be backwards-compatible with those of the replaced command. (If they are not, then the commands with which the replaced command communicates can also be replaced by OpenSolaris commands if and only if they also meet these requirements.)
- Scripts, libraries, and executables that expect a particular output format from the replaced Solaris 10 command must be able to parse output provided by the replacement OpenSolaris command. This requirement can be relaxed only if:
- it is known that the output of the replaced command is never or rarely consumed by scripts, libraries, or other commands and it is acceptable for known consumers to fail to parse such output; or
- it is acceptable for the outputs of some of the replacement program's behaviors to differ from those of the equivalent behaviors in the replaced command and it is acceptable for known consumers to fail to parse such outputs.
- If the replacement command modifies environment variables, then it must do so in a backwards-compatible manner.
- The replacement command's network communication protocols must be backwards-compatible with those of the replaced command.
- The replacement command must expect the same file permissions as the replaced command when accessing files.
- Keep existing kernel support
If major changes are made to the kernel, it is possible to keep the existing behavior with the solaris10 brand as the only consumer. The brand can be enhanced to interact with the legacy behavior. - Backport to Solaris 10
The changes can be backported to Solaris 10 and a patch can be produced for use in the zone. The brand can be enhanced to check for certain patches: See the section entitled "How is versioning handled for different Solaris 10 updates or patches running in the zone when they need different emulation?" for details. - Disallow certain configurations
The brand can validate zones and disallow unsupported configurations. For example, if a specific device is problematic, the brand can prevent zones from being configured or from booting with that device. This would only be practical for corner cases. - Platform-specific libraries
If support is added to OpenSolaris for a new hardware platform but the platform will not be be supported by Solaris 10, then the associated OpenSolaris psr libraries can be installed into solaris10-branded zones from the global zone. Doing so ensures that solaris10-branded zones will be able to function correctly on new platforms.
Where is the source code?
The solaris10 brand's source tree is layed out thus:
- usr/src/lib/brand/solaris10
This directory contains source files for the userland brand emulation library, various /usr/sbin/zoneadm and /usr/lib/zones/zoneadmd hooks, and auxiliary brand data files. - usr/src/lib/brand/solaris10/s10_brand
This directory contains the source files that construct the userland brand emulation library. All syscall and ioctl emulation is defined in here (see common/s10_brand.c). - usr/src/lib/brand/solaris10/zone
This directory contains various shell scripts and auxiliary data files that serve as zone state transition hooks for /usr/sbin/zoneadm and /usr/lib/zones/zoneadmd. s10_boot.ksh is the solaris10 brand's boot hook script: You should modify this file if you plan to use a native OpenSolaris command or daemon inside of solaris10-branded zones. version specifies the maximum emulation version supported by the solaris10 brand. (See the section on emulation versioning for details.) - usr/src/lib/brand/solaris10/s10_support
This directory contains support C functions utilized by /usr/sbin/zoneadm and /usr/lib/zones/zoneadmd. - usr/src/uts/common/brand/solaris10
This directory contains source files that constitute the solaris10 brand's kernel module, which resides in /usr/kernel/brand and /usr/kernel/brand/amd64 on x86 systems and /platform/sun4u/kernel/brand/sparcv9 and /platform/sun4v/kernel/brand/sparcv9 on sparc systems.
How does system call emulation work in the solaris10 brand?
Syscall emulation is what makes the solaris10 brand work. As mentioned in the introduction, the syscall interface (which includes ioctls) is the sole interface by which user processes interact with the kernel. By emulating syscalls, the solaris10 brand can make Solaris 10 processes running within solaris10-branded zones act as though they are communicating with a Solaris 10 kernel.
When a process execs within a solaris10-branded zone, the dynamic linker loads the solaris10 brand emulation library (which resides in the global zone) prior to all other dynamic libraries to which the process' executable is linked. The emulation library initializes its data structures and registers itself with the solaris10 kernel module via a special brand syscall (SYS_brand with a special subcode). Thus the kernel module will know how to transfer control to the emulation library in the event a syscall is issued from the associated process. Once the emulation library finishes initializing, the dynamic linker continues to set up the associated process and transfers control to its start routine.
When a non-native process in a solaris10-branded zone issues a syscall, the following steps occur in order:
- The thread issuing the syscall traps to the kernel's syscall-handling routines.
- The syscall-handling routines determine that the thread's process resides in a solaris10-branded zone and invoke a special syscall-handling callback registered by the solaris10 brand's kernel module.
- The callback looks up the syscall's entry in s10_emulation_table, an array of flags defined by the brand's kernel module and indexed by syscall number. If the syscall's entry is not flagged (i.e., the entry is zero), then the callback returns to the kernel's syscall-handling routines: The syscall is not emulated by the brand. Otherwise, the callback transfers control to the brand's emulation library (which resides in userland) at a registered entry point.
- The brand emulation library's entry point looks up the syscall's entry in s10_sysent_table, an array mapping syscalls to library functions that emulate the syscalls. If the syscall's entry indicates that a library function handles the syscall, then the handler is invoked with all of the syscall's parameters; otherwise, the emulation library sends SIGSYS (bad syscall) to the associated process.
- After the handler returns to the emulation library's entry point, the thread returns to its process' code at the instruction following the syscall. The library's entry point makes the emulated syscall's handler's return value and error code visible to the thread's process.
How can I add to or update the emulation?
The solaris10 brand's emulation library is constructed from usr/src/lib/brand/solaris10/s10_brand/common/s10_brand.c in the source tree.
Emulating a System Call
To emulate a Solaris 10 syscall, you must create a static function in the brand emulation library with the following signature:
static int <function-name>(sysret_t *<rv>, <syscall-args>)
where <function-name> is the name of the function and <syscall-args> are the syscall's parameter declarations. The function should return the errno value that the emulated Solaris 10 syscall would produce. You should store the emulated syscall's return value in <rv>.
You must make two additional changes, one to the brand emulation library and the other to the solaris10 brand's kernel module (usr/src/uts/common/brand/solaris10/s10_brand.c in the source tree):
- Find the definition of the array s10_sysent_table at the bottom of the brand emulation library's source file and locate the row/entry corresponding to the numeric identifier of the emulated syscall. (Comments have been added to each row/entry indicating its corresponding numeric syscall identifier.) Change the entry from "NOSYS" to
EMULATE(<function-name>, <num-syscall-args> | <syscall-RV-flags>)
where <function-name> is the name of the emulation function in the brand library that will handle the syscall, <num-syscall-args> is the number of parameters for the emulated Solaris 10 syscall, and <syscall-RV-flags> is one of the following constants:- RV_DEFAULT: The syscall returns "default" values. Use this when the Solaris 10 syscall is defined in the sysent table in uts/common/os/sysent.c with SYSENT_C, SYSENT_CI, or SYSENT_CL.
- RV_32RVAL2: The syscall returns two 32-bit values. Use this when the Solaris 10 syscall is defined with SYSENT_2CI.
- RV_64RVAL: The syscall returns a single 64-bit value. Use this when the Solaris 10 syscall is defined with SYSENT_AP.
- Find the definition of _init() in the solaris10 brand kernel module and add
s10_emulation_table[<SYS-identifier>] = 1;
near the beginning of the function, where <SYS-identifier> is the numeric code of the emulated syscall as defined in usr/src/uts/common/sys/syscall.h. There are already several such lines and they should be ordered by syscall number.
An Example
In this example, we will emulate the Solaris 10 SYS_sigqueue syscall. We will name the emulation function s10_sigqueue() and define it thus:
/*
* New last arg "block" flag should be zero. The block flag is used by
* the Opensolaris AIO implementation, which is now part of libc.
*/
static int
s10_sigqueue(sysret_t *rval, pid_t pid, int signo, void *value, int si_code)
{
return (__systemcall(rval, SYS_sigqueue + 1024, pid, signo, value,
si_code, 0));
}(For information about __systemcall(), see "Issuing System Calls within Emulation Functions".)
We need to modify the entry in the brand emulation library's s10_sysent_table array corresponding to SYS_sigqueue (entry 190) so that s10_sigqueue() will be invoked when processes issue SYS_sigqueue. SYS_sigqueue takes four arguments and returns "default" values in Solaris 10, so the correct entry is:
EMULATE(s10_sigqueue, 4 | RV_DEFAULT)
Finally, we need to modify the solaris10 brand kernel module so that it knows to emulate SIG_sigqueue. We would add the following line to the beginning of the _init() function in the kernel module:
s10_emulation_table[SYS_sigqueue] = 1;
Emulating an Ioctl
All ioctls are issued via the SYS_ioctl syscall, which the brand emulation library emulates in s10_ioctl(). All ioctls are currently emulated by first checking the request argument to ioctl(2) (the cmd argument to s10_ioctl()) and taking appropriate action based on its value (usually calling separate emulation functions to handle subsystem- or device-specific ioctls, such as zfs_ioctl() to handle ZFS ioctls). If the argument does not match any emulated ioctls, then the syscall is passed to the OpenSolaris kernel.
Here is an example of s10_ioctl():
int
s10_ioctl(sysret_t *rval, int fdes, int cmd, intptr_t arg)
{
/*
* Is the ioctl command a specific ioctl that needs to
* be emulated?
*/
switch (cmd) {
case CRYPTO_GET_FUNCTION_LIST:
return (crypto_ioctl(rval, fdes, cmd, arg));
case CT_TGET:
return (ctfs_ioctl(rval, fdes, cmd, arg));
case CT_TSET:
return (ctfs_ioctl(rval, fdes, cmd, arg));
}
/*
* Does the ioctl belong to the ZFS ioctl family?
*/
if ((cmd & 0xff00) == ZFS_IOC)
return (zfs_ioctl(rval, fdes, cmd, arg));
/*
* The ioctl doesn't need to be emulated. Pass it to the
* OpenSolaris kernel.
*/
return (__systemcall(rval, SYS_ioctl + 1024, fdes, cmd, arg));
}Note that the same ioctl command number might be used by two different devices: Checking the command number does not determine which of the two is being controlled. Although most Solaris devices have unique ioctl commands, there is no guarantee that a Solaris device-specific ioctl command is not used by a third-party device. If this is a concern, then ioctl emulation code can gather more information about the targeted device by performing an fstat operation (SYS_fstat) and checking the results.
For example, suppose that we will emulate the contract file system's (CTFS's) CT_TGET and CT_TSET ioctl commands in the emulation function ctfs_ioctl(). Suppose further that we want to ensure that we only emulate these ioctls when they target CTFS files. We can issue a SYS_fstat syscall and check that the targeted file's filesystem's type is MNTTYPE_CTFS:
static int
ctfs_ioctl(sysret_t *rval, int fdes, int cmd, intptr_t arg)
{
int err;
struct stat statbuf;
/* ... */
if ((err = __systemcall(rval, SYS_fstat + 1024, fdes, &statbuf)) != 0)
return (err);
if (strcmp(statbuf.st_fstype, MNTTYPE_CTFS) != 0) {
/*
* The target doesn't reside on CTFS. Assume that the ioctl
* doesn't need to be emulated and pass it to the kernel.
*/
return (__systemcall(rval, SYS_ioctl + 1024, fdes, cmd, arg));
}
/* ... */
}(For information about __systemcall(), see "Issuing System Calls within Emulation Functions".)
Notice that the above function assumes that CT_TGET and CT_TSET ioctls should not be emulated if they are not intended for CTFS files. This might not be the case if another device uses either ioctl command and needs to be emulated.
General Emulation Techniques
The following sections detail common techniques used in syscall and ioctl emulation functions.
Issuing System Calls within Emulation Functions
Many emulation functions need to issue syscalls to the native OpenSolaris kernel. To do so, invoke the __systemcall() function as follows:
__systemcall(<rv>, <SYS-identifier> + 1024, <syscall-args>);
<SYS-identifier> is the numeric code of the syscall being issued as defined in usr/src/uts/common/sys/syscall.h (e.g., SYS_fstat) and <syscall-args> are the arguments to the syscall. The return value of the syscall is stored in <rv>, which has type sysret_t *. The return value of __systemcall() indicates whether or not an error occurred. If it is zero, then no errors occurred.
Notice that you should add 1024 to the syscall's numeric identifier. The brand framework treats all syscalls whose identifiers are less than 1024 as emulated syscalls and will bounce them back to the brand emulation library, whereas syscalls whose identifiers are offset by 1024 are treated as native syscalls and are not emulated. If you do not add 1024 to the identifier, then the brand emulation library will handle the syscall, which is probably not what you want. Not adding 1024 might also cause infinite recursion if the emulation function inadvertently invokes itself while issuing the syscall, resulting in stack overflows and, ultimately, core dumps.
An Example
Refer to the example emulation function s10_sigqueue() provided in the last section, which emulates the SYS_sigqueue syscall:
/*
* New last arg "block" flag should be zero. The block flag is used by
* the Opensolaris AIO implementation, which is now part of libc.
*/
static int
s10_sigqueue(sysret_t *rval, pid_t pid, int signo, void *value, int si_code)
{
return (__systemcall(rval, SYS_sigqueue + 1024, pid, signo, value,
si_code, 0));
}In Solaris 10, SYS_sigqueue takes five arguments, but the OpenSolaris version takes six. The new sixth argument is a flag used by OpenSolaris' AIO subsystem and should be zero when SYS_sigqueue is issued from within a solaris10-branded zone. Reissuing the syscall within the emulation function with a zeroed sixth argument solves the problem. The above code issues a native SYS_sigqueue syscall (notice that the function offsets the syscall identifier by 1024), passing all of the arguments provided by the calling process untouched and adding a zero as the sixth argument. The return value of the syscall is stored in rval, which is handed to the calling process when the emulation function completes. The error code (if any) produced by the syscall is returned to the calling process.
Inserting truss Points
No truss points are triggered when a syscall is emulated by the brand library. If a native syscall is issued from an emulation function (see "Issuing System Calls within Emulation Functions" above), then the native syscall's truss point is triggered. However, if the emulation function does not issue a native syscall, then you should insert a truss point via the S10_TRUSS_POINT_* macros. These macros issue a SYS_brand syscall in order to simulate a truss point.
The macros have the following signature:
S10_TRUSS_POINT_N(<rv>, <SYS-identifier>, <errno-value>, <arguments>)
N is an integer between one and five (inclusive) that specifies the number of arguments to report in the truss point. <rv> is a non-NULL sysret_t * that stores the return value of the syscall that performs the truss operation. <SYS-identifier> is the numeric identifier of the syscall being issued as defined in usr/src/uts/common/sys/syscall.h (e.g., SYS_fstat). <errno-value> is an integer such that if it is nonzero, then the SYS_brandsys syscall that simulates the truss point stores it in the calling thread's errno. <arguments> is a comma-separated list of N values to report in the truss point.
The macros return zero for success and an errno error code for failure.
An Example
In this example, SYS_systeminfo is emulated entirely in the brand emulation library by the function s10_sysinfo(). If a process running in a solaris10-branded zone issues SYS_systeminfo and an instance of truss is observing the process' syscalls, then the latter won't see the SYS_systeminfo syscall because s10_sysinfo() never issues a native syscall. Creating a simulated truss point solves this problem:
int
s10_sysinfo(sysret_t *rv, int command, char *buf, long count)
{
char *value;
int len;
/* ... */
/*
* On success, sysinfo(2) returns the size of buffer required to hold
* the complete value plus its terminating NULL byte.
*/
(void) S10_TRUSS_POINT_3(rv, SYS_systeminfo, 0, command, buf, count);
rv->sys_rval1 = len;
rv->sys_rval2 = 0;
return (0);
}Notice that the function passes zero as <errno-value> in the truss macro in order to tell observing truss processes that the syscall completed successfully.
Copying Memory
If you need to copy data to or from a buffer or structure provided by the calling process, then you should use s10_uucopy() and s10_uucopystr(). Both prevent the emulation library from performing illegal memory accesses if the calling process provides junk pointers. The signatures of the two functions are identical:
static int s10_uucopy[str](const void *from, void *to, size_t size)
s10_uucopy() copies size bytes from from to to. s10_uucopystr() functions like strncpy(3C) in that it copies at most size characters from from to to, but unlike strncpy(3C) it never adds a terminating NULL byte. Both functions indicate success by returning zero and indicate failure by returning an errno value (e.g., EFAULT).
An Example
This example expands the example of inserting truss points, which showcased part of the emulation function for SYS_systeminfo.
int
s10_sysinfo(sysret_t *rv, int command, char *buf, long count)
{
char *value;
int len;
/*
* We must interpose on the sysinfo(2) commands SI_RELEASE and
* SI_VERSION; all others get passed to the native sysinfo(2)
* command.
*/
switch (command) {
case SI_RELEASE:
value = "5.10";
break;
case SI_VERSION:
value = "Generic_Virtual";
break;
default:
/* ... */
}
/*
* Copy the string to the buffer provided by the calling
* process. Use s10_uucopystr() in order to properly
* handle memory access/protection faults. (We can't
* trust pointers provided by the calling process!)
*/
len = strlen(value) + 1;
if (count > 0) {
if (s10_uucopystr(value, buf, count) != 0)
return (EFAULT);
/* Ensure NULL termination of buf as s10_uucopystr() doesn't. */
if (len > count && s10_uucopy("\0", buf + (count - 1), 1) != 0)
return (EFAULT);
}
/*
* On success, sysinfo(2) returns the size of buffer required to hold
* the complete value plus its terminating NULL byte.
*/
(void) S10_TRUSS_POINT_3(rv, SYS_systeminfo, 0, command, buf, count);
rv->sys_rval1 = len;
rv->sys_rval2 = 0;
return (0);
}When the calling process requests either the release (SI_RELEASE) or the version (SI_VERSION) of the kernel, the emulation function produces fake values so that the calling process will see values reflecting a Solaris 10 kernel. Once the appropriate string is selected, s10_uucopystr() copies it to the buffer provided by the calling process. However, s10_uucopystr() does not copy the string's terminating NULL byte, so the function invokes s10_uucopy() to append a NULL byte to the end of the buffer.
Modifying Parameter Structures
Some syscalls' and ioctls' parameters' structures and semantics differ between Solaris 10 and OpenSolaris. For example, the contract file system's ct_param_t structure changed in OpenSolaris so that it looks like
/* OpenSolaris version */
typedef struct ct_param {
uint32_t ctpm_id;
uint32_t ctpm_size;
void *ctpm_value;
} ct_param_t;instead of
/* Solaris 10 version */
typedef struct ct_param {
uint32_t ctpm_id;
uint32_t ctpm_pad;
uint64_t ctpm_value;
} ct_param_t;Notice that the third field, ctpm_value, changed from uint64_t to void *. Unfortunately, this means that the size of the structure is 12 bytes on 32-bit systems and 16 bytes on 64-bit systems. Solaris 10 processes issuing CT_TGET or CT_TSET contract file system ioctls will pass the Solaris 10 version of the structure as the ioctl argument, which is invalid on 32-bit OpenSolaris systems.
Other examples of incompatible structure changes include new fields, deleted fields, new flag field values, and field value range changes.
Structure changes can be overcome by defining the Solaris 10 version of the structure in the brand emulation library, declaring a stack instance (local variable) of the OpenSolaris version of the structure in the associated emulation function, copying the contents of the structure parameter to the local structure variable in the appropriate places (see "Copying Memory" above), adjusting the fields of the structure on the stack as necessary, issuing the system call or ioctl and passing the OpenSolaris structure instead of the Solaris 10 structure, and copying modified fields from the OpenSolaris structure to the Solaris 10 structure if any of the former's fields were modified by the syscall or ioctl. Sometimes a stack instance of the Solaris 10 structure must also be declared and utilized: see the example below for a case in which this is necessary.
An Example
The following example shows how the aforementioned ct_param_t structure change can be circumvented by the brand emulation library. ctfs_ioctl() emulates the contract file system CT_TGET and CT_TSET ioctls in the emulation library. Here is its definition:
#include <sys/ctfs.h>
/* Solaris 10 version of ct_param_t */
typedef struct s10_ct_param {
uint32_t ctpm_id;
uint32_t ctpm_pad;
uint64_t ctpm_value;
} s10_ct_param_t;
/*
* We have to emulate process contract ioctls for init(1M) because the
* ioctl parameter structure changed between Solaris 10 and OpenSolaris.
* This is a relatively simple process of filling OpenSolaris structure
* fields, shuffling values, and initiating a native syscall.
*/
static int
ctfs_ioctl(sysret_t *rval, int fdes, int cmd, intptr_t arg)
{
int err;
s10_ct_param_t s10param; /* Solaris 10 version */
ct_param_t param; /* OpenSolaris version */
/*
* Copy the structure provided by the caller to the stack
* and fill the fields of the OpenSolaris version of the structure
* with appropriate values. Then issue the syscall with
* the OpenSolaris version of the structure.
*/
if (s10_uucopy((const void *)arg, &s10param, sizeof (s10param)) != 0)
return (EFAULT);
param.ctpm_id = s10param.ctpm_id;
param.ctpm_size = sizeof (uint64_t);
param.ctpm_value = &s10param.ctpm_value;
if ((err = __systemcall(rval, SYS_ioctl + 1024, fdes, cmd, ¶m))
!= 0)
return (err);
/*
* If the ioctl is CT_TGET, then the syscall stored a value in
* the 'ctpm_value' field of the Solaris 10 structure (notice that
* the OpenSolaris version's 'ctpm_value' field is a pointer to the
* Solaris 10 version's 'ctpm_value' field). Copy the entire
* Solaris 10 structure back to the caller so that it sees the
* new value.
*/
if (cmd == CT_TGET)
return (s10_uucopy(&s10param, (void *)arg, sizeof (s10param)));
return (0);
}Producing Architecture-Specific Code
The solaris10 brand emulation library is compiled twice for both x86 and sparc in order to produce 32- and 64-bit shared libraries. Consequently, it is possible to restrict pieces of code in the emulation library to particular architectures via the standard architecture preprocessor symbols: __x86 for 32- or 64-bit x86; __i386 for 32-bit x86; __amd64 for 64-bit x86 (i.e., x86-64); __sparc for any sparc architecture; __sparcv7, __sparcv8, and __sparcv9 for sparc V7, V8, and V9, respectively; and _LP64 for any 64-bit architecture. For example, suppose that a particular ioctl only needs to be emulated when issued by a 64-bit x86 process. We can implement such emulation by wrapping the emulation code (and its invocations) with preprocessor conditional statements:
#ifdef __amd64
static int
s10_some_ioctl_emulation_function(sysret_t *rval, int fdes, int cmd,
intptr_t arg)
{
/* ... */
}
#endif /* __amd64 */
int
s10_ioctl(sysret_t *rval, int fdes, int cmd, intptr_t arg)
{
/* ... */
#ifdef __amd64
if (cmd == SOME_IOCTL)
return (s10_some_ioctl_emulation_function(rval, fdes, cmd,
arg));
#endif /* __amd64 */
/* ... */
}However, care must be taken when restricting syscall emulation to specific architectures. If the syscall emulation function will only exist for some architectures, then the s10_sysent_table array in the emulation library should be configured such that it specifies NOSYS for the architectures for which the syscall will not be emulated. The brand's kernel module's s10_emulation_table should also be properly configured so that the syscall is only emulated for those architectures for which emulation is necessary. (For more information about s10_sysent_table and s10_emulation_table, see "Emulating a System Call" above.) For example, if SYS_fstat were to be emulated by sparc alone, then we should define the emulation function thus:
#ifdef __sparc
static int
s10_fstat(sysret_t *rval, int fd, struct stat *sb)
{
/* ... */
}
#endif /* __sparc */We would have to modify s10_sysent_table as follows (note that SYS_fstat is syscall number 28):
s10_sysent_table_t s10_sysent_table[] = {
/* ... */
NOSYS, /* 27 */
#ifdef __sparc
EMULATE(s10_fstat, 2 | RV_DEFAULT), /* 28 */
#else /* !__sparc */
NOSYS, /* 28 */
#endif /* !__sparc */
NOSYS, /* 29 */
/* ... */
};Finally, we would have to modify _init() in the kernel module thus:
int
_init(void)
{
/* ... */
#ifdef __sparc
s10_emulation_table[SYS_fstat] = 1; /* 28 */
#endif /* __sparc */
/* ... */
}If a syscall will only be emulated on one or more 64-bit architectures, then an emulation function usually has to be created for 32-bit architectures that only passes the syscall's arguments to the native kernel. For example, if SYS_fstat should only be emulated by 64-bit x86 processes, then we should define the emulation function thus:
#ifdef __x86
static int
s10_fstat(sysret_t *rval, int fd, struct stat *sb)
{
#ifdef __amd64
/* ... */
#else /* !__amd64 */
/*
* The brand library won't do anything special for SYS_fstats issued by
* 32-bit x86 processes, so we'll hand the syscall back to the kernel.
*/
return (__systemcall(rval, SYS_fstat + 1024, fd, sb));
#endif /* !__amd64 */
}
#endif /* __x86 */We would have to modify s10_sysent_table as follows:
s10_sysent_table_t s10_sysent_table[] = {
/* ... */
NOSYS, /* 27 */
#ifdef __x86
EMULATE(s10_fstat, 2 | RV_DEFAULT), /* 28 */
#else /* !__x86 */
NOSYS, /* 28 */
#endif /* !__x86 */
NOSYS, /* 29 */
/* ... */
};This specifies that SYS_fstat will be emulated by s10_fstat() on both 32- and 64-bit x86 but not on sparc. Finally, we would have to modify _init() in the kernel module thus:
int
_init(void)
{
/* ... */
#ifdef __amd64
/*
* NOTE: Even though SYS_fstat will only be emulated when running a
* 64-bit x86 kernel, SYS_fstat will be emulated for both 32- and 64-
* bit processes.
*/
s10_emulation_table[SYS_fstat] = 1; /* 28 */
#endif /* __amd64 */
/* ... */
}The above code ensures that only the 64-bit x86 kernel module will pass SYS_fstat syscalls to the brand emulation library. However, both 32- and 64-bit processes are capable of running on 64-bit kernels and the kernel brand framework does not discriminate between syscalls issued by 32- and 64-bit processes. Consequently, s10_sysent_table must be configured so that SYS_fstat is handled by the brand emulation library for both 32- and 64-bit processes even though the 32-bit emulation function simply hands control back to the kernel. If s10_sysent_table were instead configured thus:
s10_sysent_table_t s10_sysent_table[] = {
/* ... */
NOSYS, /* 27 */
#ifdef __amd64 /* fstat() is only emulated for 64-bit x86 processes, right? */
EMULATE(s10_fstat, 2 | RV_DEFAULT), /* 28 */
#else /* !__amd64 */
NOSYS, /* 28 */
#endif /* !__amd64 */
NOSYS, /* 29 */
/* ... */
};then if a 32-bit x86 process running on a 64-bit kernel were to issue SYS_fstat, then the brand's kernel module would pass the syscall to the brand emulation library and the library would signal the calling process with SIGSYS (bad syscall) because the 32-bit library does not emulate SYS_fstat.
General Emulation Considerations
Keep the following considerations in mind when you modify the emulation library:
- Do not set errno! The brand emulation library has its own copy of errno, so processes running in solaris10-branded zones will not see any modifications to errno that are made from the emulation library. Emulation functions should return errno values: The brand framework and Solaris 10 libc move these values into errno as seen by the processes running in solaris10-branded zones. For example, if a syscall emulation function succeeds, then it should return zero (i.e., return (0);) and ensure that the sysret_t value passed to the emulation function is properly set. On the other hand, if the emulation function needs to indicate EFAULT to the invoking process, then it should return EFAULT (i.e., return (EFAULT);).
- Take care regarding which library calls you invoke. The brand emulation library is linked to its own copy of the Solaris 10 libc and other libraries, so it can invoke standard functions such as strtol(3C). However, you must be careful when you invoke a library function because it might issue syscalls, which might result in recursion as described in "Issuing System Calls within Emulation Functions" above.
- Always leave comments explaining why you are providing emulation. Make everyone's life easier. People reviewing the brand emulation library will most likely not be intimately familiar with the subsystem(s) whose behaviors you are emulating.
How can I set up the brand to use a native command?
Modify the solaris10 brand boot script (usr/src/lib/brand/solaris10/zone/s10_boot.ksh in the source tree) as follows:
- Navigate to the section labeled "STEP ONE".
- If the replaced command's path is /d_1/d_2/.../d_N/B, where d_i is a directory (i is between 1 and N) and B is the name of the command, then add the following lines to the script under "STEP ONE":
safe_dir /d_1 safe_dir /d_1/d_2 ... safe_dir /d_1/d_2/.../d_N
For example, if you want to replace /foo/bar/baz/some_command, then add the following lines to the script under "STEP ONE":safe_dir /foo safe_dir /foo/bar safe_dir /foo/bar/baz
You do not need to duplicate these lines for two commands residing within the same directory. For example, if you want to replace /foo/bar/baz/command_1 and /foo/bar/baz/command_2, then you only need to add the above lines once. - Navigate to the section labeled "STEP TWO".
- If the path of the replacement command is identical to that of the replaced command, then add the following line to the script under "STEP TWO": replace_with_native <abs-path> <mode> <user>:<group>, where <abs-path> is the absolute path to the replacement command, <mode> is the hex UNIX permissions that the replacement command will have, <user> is the name of the command's owner, and <group> is the owner's group.
The following example replaces /sbin/ifconfig with the native OpenSolaris ifconfig, both of which reside in /sbin; gives the replacement owner, group, and world read and execute permissions; and makes the zone's root user (group bin) the replacement's owner:replace_with_native /sbin/ifconfig 0555 root:bin
- If the path of the replacement command differs from that of the replaced command, then add the following line to the script under "STEP TWO": safe_replace $ZONEROOT/<replaced-abs-path> <replacement-abs-path> <mode> <user>:<group> remove, where <replaced-abs-path> is the absolute path of the replaced command (as seen from within the zone), <replacement-abs-path> is the absolute path of the replacement command, and <mode>, <owner>, and <group> have the same semantics as in the previous step.
The following example replaces the Solaris 10 /usr/bin/true with the OpenSolaris /usr/bin/false, gives the replacement owner, group, and world read and execute permissions, and makes the zone's root user (group bin) the replacement's owner:safe_replace $ZONEROOT/usr/bin/true /usr/bin/false 0555 root:bin remove
The brand boot script executes whenever a solaris10-branded zone boots. Once zone bootup is complete, every replaced command X will have been backed up to X.pre_p2v. If you need to restore the replaced command's binary within a zone, then execute mv X.pre_p2v X within the zone, where X is the full path of the replaced command.
Here is an example that replaces /usr/bin/zcat with OpenSolaris' /usr/bin/zcat (with owner root, group bin, and UNIX file permissions 0555) and /usr/demo/dtrace/kstat.d with OpenSolaris' /usr/bin/true (with owner root, group bin, and UNIX file permissions 0644):
#
# STEP ONE
#
# Validate that the zone filesystem looks like we expect it to.
#
safe_dir /usr
safe_dir /usr/bin
safe_dir /usr/demo
safe_dir /usr/demo/dtrace
# ...
#
# STEP TWO
#
# Replace Solaris 10 binaries with OpenSolaris binaries.
#
replace_with_native /usr/bin/zcat 0555 root:bin
safe_replace $ZONEROOT/usr/demo/dtrace/kstat.d /usr/bin/true 0644 root:bin \
remove
# ...A Note Concerning Split Binaries
Some commands in /usr/bin, such as /usr/bin/pmap, are split into 32- and 64-bit versions that reside in architecture-dependent subdirectories of /usr/bin. For example, the 32-bit version of pmap resides in /usr/bin/i86 while the 64-bit version resides in /usr/bin/amd64 on x86 systems. Sparc systems only have a 64-bit version of pmap: /usr/bin/sparcv9/pmap. To replace a split command such as pmap, you must replace both the 32-bit binary (if it exists) and the 64-bit binary.
Suppose that you want to replace the split command /usr/bin/X with the OpenSolaris version of /usr/bin/X. Then skip steps four and five above and add the following lines below "STEP TWO" instead:
if [ -n "$ARCH32" ]; then
replace_with_native /usr/bin/$ARCH32/X <mode> <owner>:<group>
fi
if [ -n "$ARCH64" ]; then
replace_with_native /usr/bin/$ARCH64/X <mode> <owner>:<group>
fi<mode>, <owner>, and <group> have the same semantics as in steps four and five above.
Here is an example that replaces the split command /usr/bin/pmap:
if [ -n "$ARCH32" ]; then
replace_with_native /usr/bin/$ARCH32/pmap 0555 root:bin
fi
if [ -n "$ARCH64" ]; then
replace_with_native /usr/bin/$ARCH64/pmap 0555 root:bin
fiIf the replacement command's path differs from that of the replaced command, then add the following lines below "STEP TWO" instead:
if [ -n "$ARCH32" ]; then
safe_replace $ZONEROOT/usr/bin/$ARCH32/X \
<32-bit-replacement-abs-path> <mode> <owner>:<group> remove
fi
if [ -n "$ARCH64" ]; then
safe_replace $ZONEROOT/usr/bin/$ARCH64/X \
<64-bit-replacement-abs-path> <mode> <owner>:<group> remove
fiX is the name of the replaced command. <32-bit-replacement-abs-path> and <64-bit-replacement-abs-path> are the absolute paths of the replacement commands for the 32- and 64-bit binaries (respectfully).
How is versioning handled for different Solaris 10 updates or patches running in the zone when they need different emulation?
Any OpenSolaris changes that require updates to the brand's emulation must ensure that Solaris 10u8 and later will function correctly inside of solaris10-branded zones. This is simply a matter of straightforward compatibility and no special versioning support is needed.
Versioning becomes a factor when making changes to Solaris 10 code. A single OpenSolaris system can host solaris10-branded zones running different releases of Solaris 10 (u8, u9, etc.). For example, an OpenSolaris system might have two zones foo and bar that host Solaris 10u8 and Solaris 10u9 environments, respectively. This is problematic because different Solaris 10 releases might require different emulation due to backports and other Solaris 10 changes. The solaris10 brand overcomes this problem through a basic versioning system.
Note that enhancing the brand's emulation library to dynamically detect the presence or absence of features in hosted Solaris 10 environments is preferable to versioning. You can see an example of this in s10_lwp_private(), which determines whether a zone's libc works when the %fs x86 segment register is zero. (Solaris 10u8 sets %fs to a nonzero value in 64-bit x86 processes whereas OpenSolaris clears it.) If it does not, then s10_lwp_private() changes the current LWP's %fs register to a nonzero value and adjusts the brand emulation library so that future LWPs created via the SYS_lwp_create syscall will have nonzero %fs registers; otherwise, the emulation library does not change %fs. Read the subsection entitled "Dynamic Feature Detection: Sysent Table Patching" to learn about techniques that improve the performance of emulation functions that utilize dynamic feature detection.
In order to use versioning, you must change OpenSolaris thus:
- Create a new enumeration constant in s10_emulated_features that represents the Solaris 10 backport or fix. Choose a name that starts with S10_FEATURE_ and describes the fix (e.g., S10_FEATURE_ALTERED_MNTFS_IOCTL or S10_FEATURE_CR_6813502). Place the new enumeration value immediately before S10_NUM_EMUL_FEATURES. For example:
enum s10_emulated_features { /* ... */ S10_FEATURE_ALTERED_MNTFS_IOCTL, S10_NUM_EMUL_FEATURES /* This must be the last entry! */ };
If your new constant conflicts with someone's putback (i.e., someone else adds a new constant to the enumeration), then merge the changes such that your new constant appears immediately before S10_NUM_EMUL_FEATURES. For example, if your fix adds a new constant named S10_FEATURE_ALTERED_MNTFS_IOCTL (as in the example above) but the addition conflicts with another developer's putback because the developer added a constant named S10_FEATURE_SOME_BACKPORT, then you should merge the changes as follows:enum s10_emulated_features { /* ... */ S10_FEATURE_SOME_BACKPORT, S10_FEATURE_ALTERED_MNTFS_IOCTL, S10_NUM_EMUL_FEATURES /* This must be the last entry! */ };
Remember the value of your new constant. You will need it in later steps. - Alter the fix's emulation in the brand's emulation library so that it performs conditional emulation. For example, if some emulation will not be necessary when your backport integrates into Solaris 10, then you can alter the emulation so that it only occurs when the backport is not present in the hosted Solaris 10 environment. On the other hand, if your Solaris 10 fix will require new emulation in the solaris10 brand's emulation library, then you should modify the library so that the new emulation is only performed when the backport is present in the hosted Solaris 10 environment. See "How can I add to or update the emulation?" for details about adding and updating syscall and ioctl emulation.
The emulation library provides the S10_FEATURE_IS_PRESENT() macro so that emulation functions can easily detect the presence of backports. S10_FEATURE_IS_PRESENT(N) is true iff the backport represented by the s10_emulated_features constant N is present in the associated zone's Solaris 10 environment. Armed with this macro and your new s10_emulated_features constant, you can change the emulation library so that some syscalls or ioctls are conditionally emulated based on whether your backport is present in a zone's Solaris 10 environment.
For example, suppose that your backport will obsolete some emulation. You cannot delete the emulation because older Solaris 10 environments (e.g., u8 environments) will need the emulation in order to continue to function properly; therefore, the emulation library should perform the emulation iff your backport is not present in the zone's Solaris 10 environment. Suppose that the backport's s10_emulated_features constant is S10_FEATURE_ALTERED_MNTFS_IOCTL. Then you should alter the affected emulation so that it looks something like the following:if (S10_FEATURE_IS_PRESENT(S10_FEATURE_ALTERED_MNTFS_IOCTL)) { /* * Emulation isn't necessary. Do whatever the * native kernel would do. */ return (__systemcall(/* ... */)); } else { /* Perform the emulation... */ }
On the other hand, if the backport represented by S10_FEATURE_ALTERED_MNTFS_IOCTL requires new emulation, then you should swap the branches in the above example. In other words, the new emulation should be performed iff S10_FEATURE_IS_PRESENT(S10_FEATURE_ALTERED_MNTFS_IOCTL) is true.
Afterwards, you must do the following as part of your Solaris 10 backport or fix:
- Create a file named usr/src/lib/brand/solaris10/M in Solaris 10's source tree, where M is the integral value of your new s10_emulated_features constant. For example, if your new s10_emulated_features constant evaluates to four, then your fix should create a file named usr/src/lib/brand/solaris10/4. The file can be empty, but it might be useful to insert a short description of the fix (what it changes, its CR number, etc.) so that developers will be able to more easily determine the file's associated fix.
- Update Solaris 10's usr/src/lib/brand/solaris10/Makefile to install the new file created in step (1) into the /usr/lib/brand/solaris10 directory in the proto area.
- Update the Solaris 10 SVr4 package that delivers your fix such that it also delivers the new file created in step (1) into the /usr/lib/brand/solaris10 directory. This ensures that the new file is installed whenever your fix is installed.
The brand will disallow booting solaris10-branded zones hosting Solaris 10 images with fixes and backports that deliver unrecognized numbered files to /usr/lib/brand/solaris10. In other words, solaris10-branded zones hosting newer versions of Solaris 10 will not boot if the solaris10 brand does not provide sufficiently up-to-date emulation (i.e., OpenSolaris is not sufficiently up-to-date). For example, the brand will not permit a solaris10-branded zone containing a Solaris 10 image that has /usr/lib/brand/solaris10/5 to boot when the brand only recognizes s10_emulated_features constants that are less than five (i.e., S10_NUM_EMULATED_FEATURES is five).
An Example
Suppose that you need to change a device's ioctl in OpenSolaris and backport the change to Solaris 10u9 at the same time. Suppose further that the device is used in native Solaris 10 zones. Then you will have to change OpenSolaris as follows:
- Create a new constant in s10_emulated_features (say, S10_FEATURE_DEV_IOCTL_BACKPORT) using the procedure described above.
- Update the solaris10 brand's emulation library so that it will emulate the ioctl's old Solaris 10 format when S10_FEATURE_IS_PRESENT(S10_FEATURE_DEV_IOCTL_BACKPORT) is true. This ensures that Solaris 10 images that lack your backported changes will continue to run properly in solaris10-branded zones.
Additionally, you should do the following in the Solaris 10 backport:
- Create the file usr/lib/brand/solaris10/M in the Solaris 10 source tree, where M is the numeric value of S10_FEATURE_DEV_IOCTL_BACKPORT.
- Modify Solaris 10's usr/src/lib/brand/solaris10/Makefile so that it installs the new file into the /usr/lib/brand/solaris10 directory.
- Ensure that the appropriate Solaris 10 SVr4 packages install the new file.
These changes ensure that solaris10-branded zones will emulate the ioctl's old format while hosting Solaris 10 environments that do not contain the backport.
Dynamic Feature Detection: Sysent Table Patching
Syscall emulation functions can dynamically adjust syscalls' s10_sysent_table entries so that future invocations of the syscalls invoke different emulation functions. This technique, called sysent table patching, eliminates the need to dynamically detect features for every syscall at the cost of coding and maintaining multiple emulation functions.
s10_lwp_private() uses sysent table patching. As described above, s10_lwp_private() determines whether the Solaris 10 libc can function when the %fs x86 segment register is zero. If libc cannot, then s10_lwp_private() modifies the SYS_lwp_create syscall's entry in s10_sysent_table so that s10_lwp_create_correct_fs handles SYS_lwp_create syscalls rather than s10_lwp_create, the default handler. s10_lwp_create_correct_fs ensures that new LWPs start in s10_lwp_create_entry_point, which sets the new LWP's %fs register to the legacy nonzero Solaris 10 selector value before making the LWP jump to its true entry point. s10_lwp_create simply hands the SYS_lwp_create syscall to the kernel. s10_lwp_private() is invoked once after a process execs and only while the process is single-threaded; therefore, its sysent table patch, if applied, affects all of the process' SYS_lwp_create syscalls.
In general, do the following when you want a syscall to use sysent table patching:
- Create two emulation functions: one that does not emulate the syscall (i.e., one that simply passes the syscall to the kernel) and one that does. The next step will refer to the former as <first-emulation-function> and the latter as <second-emulation-function>.
- Create an additional emulation function as follows:
static int <function-name>(sysret_t *<rv>, <syscall-args>) { /* Determine whether the syscall should be emulated. */ if (/* The syscall needs to be emulated */) { s10_sysent_table[<SYS-identifier>].st_callc = (sysent_cb_t)<second-emulation-function>; return (<second-emulation-function>(<rv>, <syscall-args>)); } else { s10_sysent_table[<SYS-identifier>].st_callc = (sysent_cb_t)<first-emulation-function>; return (<first-emulation-function>(<rv>, <syscall-args>)); } }
where <function-name> is the name of the emulation function, <syscall-args> is the comma-separated list of the syscall's parameters, <SYS-identifier> is the syscall's numeric identifier as defined in usr/src/uts/common/sys/syscall.h, and <first-emulation-function> and <second-emulation-function> are the names of the emulation functions defined in the previous step. - Configure the syscall's s10_sysent_table entry to use the emulation function defined in the last step. (See the section entitled "Emulating a System Call" above for instructions on modifying s10_sysent_table.)
For example, suppose that SYS_fstat needs to be emulated when a zone's Solaris 10 environment contains the fix represented by a hypothetical s10_emulated_features constant named S10_FEATURE_FSTAT_CHANGE. We would follow the above procedure thus:
- Define a function named s10_noemu_fstat:
/* * This function simply hands SYS_fstat syscalls to the kernel. It doesn't * emulate the syscall. */ static int s10_noemu_fstat(sysret_t *rv, int fd, struct stat *sb) { return (__systemcall(rv, SYS_fstat + 1024, fd, sb)); }
Define another function named s10_emulate_fstat:/* * This function emulates SYS_fstat syscalls when S10_FEATURE_FSTAT_CHANGE is * in the zone. */ static int s10_emulate_fstat(sysret_t *rv, int fd, struct stat *sb) { /* Emulation... */ } - Define an emulation function named s10_fstat:
/* * This emulation function is the initial handler for SYS_fstat syscalls. * Its sole purpose is to patch the sysent table when the process issues its * first SYS_fstat syscall. If the Solaris 10 fix represented by * S10_FEATURE_FSTAT_CHANGE is in the zone, then the sysent table will * be patched so that s10_emulate_fstat() will handle all future SYS_fstat * syscalls; otherwise, the sysent table will be patched so that it uses * s10_noemu_fstat() instead. */ static int s10_fstat(sysret_t *rv, int fd, struct stat *sb) { /* * SYS_fstat must be emulated if S10_FEATURE_FSTAT_CHANGE is present * in the zone. */ if (S10_FEATURE_IS_PRESENT(S10_FEATURE_FSTAT_CHANGE)) { s10_sysent_table[SYS_fstat].st_callc = (sysent_cb_t)s10_emulate_fstat; return (s10_emulate_fstat(rv, fd, sb)); } else { s10_sysent_table[SYS_fstat].st_callc = (sysent_cb_t)s10_noemu_fstat; return (s10_noemu_fstat(rv, fd, sb)); } } - Modify SYS_fstat's entry in s10_sysent_table so that s10_fstat() handles SYS_fstat syscalls:
EMULATE(s10_fstat, 2 | RV_DEFAULT)
What is the procedure for backporting an incompatible change to a Solaris 10 update release?
Please see the section entitled "How is versioning handled for different Solaris 10 updates or patches running in the zone when they need different emulation?" for the full backporting and conditional emulation procedure.
How can I test my OpenSolaris changes with the brand?
There are two ways to test your OpenSolaris fixes and projects with solaris10-branded zones:
Run do-it-yourself (DIY) tests
The ability to test using DIY will be available within one or two builds after integration. This section will be updated once DIY has this new capability in place.
Manually set up a zone on a test system and test your changes
Use zonecfg(1M) to configure a new solaris10-branded zone. Be sure to specify SUNWsolaris10 when issuing the create subcommand. For example:
# zonecfg -z testzone testzone: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:testzone> create -t SUNWsolaris10
Each zone configuration must specify a zonepath: Every other property is optional. Note that the zone path of a solaris10-branded zone must reside on a ZFS filesystem. The following is an example of a minimal configuration for a solaris10-branded zone:
# zonecfg -z testzone testzone: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:testzone> create -t SUNWsolaris10 zonecfg:testzone> set zonepath=/export/zones/testzone zonecfg:testzone> info zonename: testzone zonepath: /export/zones/testzone brand: solaris10 autoboot: false bootargs: pool: limitpriv: scheduling-class: ip-type: shared hostid: zonecfg:testzone> exit
Once you have configured a solaris10-branded zone, install it via zoneadm(1M) using the install subcommand. You will have to use one of the -a or the -d options. -a <archive-path> specifies a flash_archive(4); a cpio(1) archive, which can be compressed with either gzip(1) or bzip2(1); a pax(1) "xustar" archive; or a level zero ufsdump(1M) of the installed Solaris 10 system (either a physical system or a native Solaris 10 zone) whose files will be installed into the new zone. -d <s10-system-root-path> specifies the full path of the root directory of an installed Solaris 10 system (again, either a physical system or a native Solaris 10 zone) whose files will be installed into the new zone. You must also specify either -p to preserve the virtualized system's configuration or -u to run sys-unconfig(1M) within the zone after it is installed. The following example installs the zone configured above from a flash archive of a physical Solaris 10 update 8 system and indicates that sys-unconfig(1M) should be run within the zone:
# zoneadm -z testzone install -a /net/kodiak.sfbay/gates/s10brand/public/images/s10u8b8ax.flar -u
A ZFS file system has been created for this zone.
Log File: /var/tmp/testzone.install_log.esaZ0g
Installing: This may take several minutes...
Postprocessing: This may take a while...
Postprocess: Updating the image to run within a zone
Result: Installation completed successfully.
Log File: /export/zones/testzone/root/var/log/testzone.install111754.logA variety of pre-built Solaris 10u8 images are available within Sun for testing. They are located in /net/kodiak.sfbay/gates/s10brand/public/images.
Boot the zone via the zoneadm(1M) boot subcommand:
# zoneadm -z testzone boot
Once the zone boots, you can log into the zone via the zlogin(1) command. If you specified -u when you installed the zone, then you should grab the zone's console and step through the sys-unconfig(1M) screens via zlogin -C. For example:
# zlogin -C testzone [Connected to zone 'testzone' console] What type of terminal are you using? 1) ANSI Standard CRT 2) DEC VT52 3) DEC VT100 4) Heathkit 19 5) Lear Siegler ADM31 6) PC Console 7) Sun Command Tool 8) Sun Workstation 9) Televideo 910 10) Televideo 925 11) Wyse Model 50 12) X Terminal Emulator (xterms) 13) CDE Terminal Emulator (dtterm) 14) Other Type the number of your choice and press Return:
The zone's filesystem is accessible from the global zone via <zonepath>/root, where <zonepath> is the zone's path as specified in its configuration.
Once you boot a solaris10-branded zone, you can do two things: (Please note that these options are not mutually exclusive; in fact, doing both is highly recommended.)
Manually test your changes
If your changes only affect a few binaries (e.g., your changes alter private interfaces between special libraries and the kernel), then it is usually sufficient to test the affected binaries in your zone. For example, if your changes alter a ZFS ioctl, then you can execute relevant ZFS commands in your zone in order to ensure that the ioctl functions properly. If your changes affect a syscall, then you can create and execute tests that invoke library functions that issue the syscall. If you are using native command replacement, then you can log into your zone and test the functionality of the replacement binary's CLI and the functionality of commands that fork the binary if it is a command or test the functionality of commands that communicate with the replacement if it is a daemon. You should also consider running multithreaded tests that stress the interfaces affected by your changes if the interfaces are thread-safe in Solaris 10.
Remember, syscalls, ioctls, and other facets of the user-kernel interface as seen from within solaris10-branded zones must function as they do in Solaris 10. The same is true of commands and daemons within solaris10-branded zones.
Run all of the relevant PIT tests or just the MSTC portion
Consider running one or more PIT tests, such as MSTC, in your zone in addition to manually testing your changes if your changes are extensive or affect frequently-used syscalls or ioctls (e.g., SYS_lwp_create or a tty ioctl).
NOTE: The following procedure uses Sun-internal links to the test pages. Links to procedures for external testing will be provided soon.
Determine which PIT tests run within solaris10-branded zones. Many but not all of the existing test suites work in solaris10-branded zones. Any suite that runs in native Solaris 10 zones should run identically in solaris10-branded zones. You can do one of the following to determine whether a test will work:
- Go to the Solaris 10 PIT chessboard, which is accessible from the main PIT page, and navigate to the "Virtualization/Zones/Native" icon. Clicking it should give you a list of tests that have been run inside of native Solaris 10 zones.
- From the main PIT page, select "Search PIT results database". Click on the "All Test Suites" pull-down menu, select the test suite that you want to run, and click "Go". You should get a list of all the test runs associated with the selected test suite. If you see any results associated with a native Solaris 10 zone, then the test can be run in solaris10-branded zones. Any host that ends in "-z*" (e.g., foo-z1) is a zone.
Afterwards, run the test suite. All PIT test suites can be run via the STEP tool. Click here for instructions on how to run STEP. You should run STEP from the global zone. STEP will query you for the hostname of your test system, at which point you should enter the name of the solaris10-branded zone that you created for the test. Specify "zlogin_console" when STEP queries for the connection type.
How can I test my Solaris 10 changes with the brand?
You will need to follow the procedures for manually testing the zone. However, you will not be able to use one of the pre-built Solaris 10u8 images. Instead, as part of your testing of the Solaris 10 change, you should create a flash archive of a physical system onto which you installed your change.
# flar create -n testflar /export/home/test.flar
Instead of test.flar, you can name the flar as you choose. If your physical Solaris 10 test system has a ZFS root, then you must create the flar with an explicit cpio or pax archive using the -L option:
# flar create -n testflar -L cpio /export/home/test.flar
You can now use this flar to install the zone and complete the testing. When installing the zone using your new flar, you must also use the -F option which
bypasses the built-in version checking. This is needed because only Solaris 10u8 and later are supported and a BFU image does not look like a supported
version of Solaris 10.
# zoneadm -z testzone install -a /home/foo/test.flar -F -u
If your code changes are simple, an alternative to creating your own flar is to use one of the pre-existing archives to install the zone and then
replace the updated user-level binaries with the new binaries that you've built. If you use this technique, you must be sure that the base archive you're
using is fully compatible with the changed files you're replacing.