OpenSolaris
Collectives
Discussions
Documentation
Download
Source Browser
Free CD
Log-in
|
en
Community Group brandz
:
Design Doc
Top Menu
Show
:
Comments
Attachments
History
Information
Print
:
Print
Print preview
Export as PDF
Export as RTF
Export as HTML
Export as XAR
Wiki code for
Design Doc
Hide Line numbers
1: === 1. Introduction 2: 3: This project is delivering two sets of functionality: 4: 5: * The BrandZ infrastructure, which enables the creation of zones that provide alternate operating environment personalities, or //brands// on a Solaris(tm) 10 system. 6: * //lx//, a brand that supports the execution of Linux applications 7: 8: This document describes the design and implementations of both BrandZ and the //lx// brand. 9: 10: === 2. BrandZ Infrastructure 11: 12: ==== 2.1. Zones Integration 13: 14: It is a goal of this project that all brand management should be performed as simple extensions to the current zones infrastructure: 15: 16: * The zones configuration utility will be extended to configure zones of particular brands. 17: * The zone administration tools will be extended to display the brand of any created zone, as well as to list the brand types available. 18: 19: BrandZ adds functionality to the zones infrastructure by allowing zones to be assigned a brand type. This type is used to determine which scripts are executed when a zone is installed and booted. In addition, a zone’s brand is used to properly identify the correct application type at application launch time. All zones have an associated brand for configuration purposes; the default is the ’native’ brand. 20: 21: ===== 2.1.1 Configuration 22: 23: The {{code}}zonecfg(1M){{/code}} utility (as well as {{code}}libzonecfg{{/code}}) is modified to be brand-aware. There is a new option to the ’create’ command that allows a zone to be created with a specific brand: 24: 25: {{{ 26: 27: zonecfg:myzone> create -B lx 28: 29: }}} 30: 31: Once a zone has been assigned a brand, that brand cannot be changed or removed. 32: 33: The rest of the user interface for the zone configuration process remains the same. To support this, each brand delivers a configuration file at {{code}}/usr/lib/brand/<name>/config.xml{{/code}}. The file describes how to install, configure, and boot the zone. It also identifies the name of the kernel module (if any) that provides kernel-level functionality for the brand. A sample of the file follows: 34: 35: {{{ 36: 37: <!DOCTYPE brand PUBLIC "-//Sun Microsystems Inc//DTD Brands//EN" 38: "file:///usr/share/lib/xml/dtd/brand.dtd.1"> 39: 40: <brand name="lx"> 41: <install>/usr/lib/brand/lx/lx_install %z %R %*</install> 42: <boot>/usr/lib/brand/lx/lx_boot %R %z</boot> 43: <halt>/usr/lib/brand/lx/lx_halt %z</halt> 44: <postclone>/usr/bin/true</postclone> 45: <modname>lx_brand</modname> 46: <initname>/sbin/init</initname> 47: <platform name="platform.xml" /> 48: </brand> 49: 50: }}} 51: 52: ===== 2.1.2 Installation 53: 54: The current zone install mechanism is hardwired to execute {{code}}lucreatezone{{/code}} to install a zone. To support arbitrary installation methods, the {{code}}config.xml{{/code}} file contains a line of the form: 55: 56: {{{ 57: 58: <install>/usr/lib/lu/lucreatezone -z %z</install> 59: 60: }}} 61: 62: For the //lx// brand, this line will refer to {{code}}/usr/lib/brand/lx/lx_install{{/code}} instead of {{code}}lucreatezone{{/code}}. All of the command tags support the following substitutions ’%z’ (for zonename) and ’%R’ (for zone root path). In addition, the zoneadm(1M) command will pass additional arguments to the program as specified on the command line. This will allow runtime installation arguments, such as the location of source media: 63: 64: {{{ 65: 66: # zoneadm -z linux install -d /net/installsrv/redhat30 67: 68: }}} 69: 70: ===== 2.1.3 Virtual Platform 71: 72: The virtual platform consists of internal mounted filesystems, as well as the device tree and any network devices. The virtual platform is controlled by an XML file that is separate from the main brand config file, {{code}}config.xml{{/code}}. The properties it controls are outlined below. 73: 74: ====== 2.1.3.1 Mounted Filesystems 75: 76: Before handing off control to the virtual platform, {{code}}zoneadmd{{/code}} does some work of its own to set up well-known mounts and do some basic verification. It performs Solaris-specific mounts ({{code}}/proc{{/code}}, {{code}}/system/contract{{/code}}, etc.), which need to be defined for each brand. We also need to mount native versions of some files in order to support the interpretation layer (for native access to {{code}}/proc{{/code}}, for example). 77: 78: The set of filesystems mounted by {{code}}zoneadmd{{/code}} are specified in the virtual platform configuration file: 79: 80: {{{ 81: 82: <filesystem spec="/proc" dir="/native/proc" fstype="proc" /> 83: <filesystem spec="/usr" dir="/native/usr" fstype="lofs" /> 84: 85: }}} 86: 87: ====== 2.1.3.2 Device Configuration 88: 89: To create device nodes within a zone, {{code}}zoneadmd{{/code}} calls {{code}}devfsadm{{/code}}. The devfsadm process walks the current list of devices in {{code}}/dev{{/code}}, and calls into {{code}}libzonecfg{{/code}} to determine if the device should be exported. 90: 91: libzonecfg will consult a brand-specific platform configuration file that describes how to apply the following semantics: 92: 93: * **Policy restrictions**: Control which devices appear in the non-global zone. This is the only aspect currently implemented. This is currently hardcoded in {{code}}libzonecfg{{/code}}. 94: * **Device inheritance**: Describes a device which matches exactly a device in the global zone. Currently this is assumed behavior, but it needs to be configurable. 95: * **Device renaming**: Allows devices to appear in diffent locations within the zone. This is needed to support devices that are emulated via layered drivers. 96: 97: The platform configuration file will have elements to perform all the above tasks. A sample scheme could be: 98: 99: {{{ 100: 101: <!~-- Devices to create under /dev ~--> 102: 103: <device match="pts/*" /> 104: <device match="ptmx" /> 105: <device match="random" /> 106: <device match="urandom" /> 107: 108: <device match="zero" /> 109: <device match="null" /> 110: 111: <device match="tty" /> 112: <device match="tcp" /> 113: <device match="tcp6" /> 114: <device match="udp" /> 115: 116: <device match="udp6" /> 117: 118: <!~-- Symlinks to create under /dev ~--> 119: 120: <symlink source="stderr" target="./fd/2" /> 121: <symlink source="stdin" target="./fd/0" /> 122: <symlink source="stdout" target="./fd/1" /> 123: 124: <symlink source="systty" target="zconsole" /> 125: <symlink source="log" target="/var/run/syslog" /> 126: 127: <!~-- Renamed devices to create under /dev ~--> 128: 129: <device match="brand/lx/ptmx_linux" name="ptmx" /> 130: 131: <!~-- 132: /dev/console can’t be a symlink because this breaks 133: login security checks via /dev/securetty 134: ~--> 135: 136: <device match="zconsole" name="console" /> 137: 138: }}} 139: 140: ===== 2.1.4 Booting a Branded Zone 141: 142: In addition to the install-time command attribute, the configuration file can also provide an optional ’boot’ script. This script will be run after the zone has been brought to the ’Ready’ state - immediately before the zone is booted. This allows a brand to perform any needed setup tasks after the virtual platform has been established, but before the first branded process has been launched, giving the brand complete flexibility in how the zone is brought online. 143: 144: ===== 2.1.5 Spawning init 145: 146: The {{code}}init(1M){{/code}} binary is spawned by the kernel, allowing the kernel to control its restart. The name and path of a zone’s init(1M) binary is a per-zone attribute, which is stored in the brand configuration file and passed to the kernel by zoneadm immediately prior to booting the zone. 147: 148: Linux does not allow kill(2) to send any signals to init(1M) for which a signal handler has not been registered (including SIGSTOP and SIGKILL, as by definition handlers **cannot** be registered for those signals.) We solve this by having lx_kill() make the same check and only allow a signal to be sent to the Linux init process if a handler for the signal has already been registered. The Linux man pages are silent on whether the Linux //kernel// can send init an unhandled signal, so we do not perform any particular checks to prevent the Solaris kernel from doing so. 149: 150: ===== 2.1.6 Shutdown 151: 152: The standard {{code}}sysvinit{{/code}} Linux init package operates in much the same way as our {{code}}init{{/code}} did prior to the introduction of {{code}}smf(5){{/code}}. The ’init N’ form of the command writes to the init FIFO, which happens to be {{code}}/dev/initctl{{/code}}. The ’lx’ brand’s boot-time script creates this link from the global zone before the zone boots up. 153: 154: The {{code}}init{{/code}} command then does a {{code}}kill(1, SIGHUP){{/code}} to notify the running init process that there has been a change of runlevel. We interpose on the {{code}}kill(2){{/code}} syscall and translate PID 1 into whatever the zone-local PID is. 155: 156: To actually power off or reboot the system, {{code}}init{{/code}} calls the {{code}}reboot(){{/code}} function with ’magic numbers’ defined in {{code}}sys/reboot.h{{/code}}. This function translates directly to the {{code}}reboot{{/code}} system call. This call, and the associated magic numbers, will be translated into an appropriate {{code}}uadmin(){{/code}} call. 157: 158: The Linux {{code}}simpleinit{{/code}} program works in a nearly identical fashion. The main differences are in how it processes {{code}}/etc/rc{{/code}} scripts. 159: 160: ==== 2.2. Kernel Integration 161: 162: ===== 2.2.1 Brand Framework 163: 164: Brands are loaded into the system in the form of brand modules. A brand module is defined through a new linkage type: 165: 166: {{{ 167: 168: struct modlbrand { 169: struct mod_ops *brand_modops; 170: char *brand_linkinfo; 171: struct brand *brand_branddef; 172: }; 173: 174: }}} 175: 176: Each brand must declare {{code}}struct brand{{/code}} as part of the registration process. 177: 178: {{{ 179: 180: typedef struct brand { 181: int b_version; 182: char *b_name; 183: struct brand_ops *b_ops; 184: struct brand_mach_ops *b_machops; 185: } brand_t; 186: 187: }}} 188: 189: This structure declares the name of the brand, the version of the brand interface against which it was compiled, and ops vectors containing both common and platform-dependent functionality. The {{code}}b_version{{/code}} field is currently used to determine whether a brand is eligible to be loaded into the system. If the version number does not match the one compiled into the kernel, then we simply refuse to load the brand module. In theory, this version number could also be used to interpret the contents of the two ops vectors, allowing us to continue supporting older brand modules. 190: 191: It is important to note that even though we have defined a linkage type for brands and have implemented a versioning mechanism, we are not defining a formal ABI for brands. The relationship between brands and the kernel is so intimate that we cannot hope to properly support the development of brands outside the ON consolidation. This does not mean that we will do anything to prevent the development of brands outside of ON, but we must minimize the possibility of an out-of-date brand accidentally damaging something within the kernel. 192: 193: ===== 2.2.2 System Call Interposition 194: 195: The system call invocation mechanism is an implementation detail of both the operating system and the hardware architecture. On SPARC machines, a system call is initiated via a trap (Solaris chooses to use trap numbers 8 and 9.) On x86 machines, there are a variety of different methods available to enter the kernel: {{code}}sysenter{{/code}} and {{code}}syscall{{/code}} instructions, {{code}}lcalls{{/code}}, and software-triggered interrupts. Solaris has used all of these mechanisms at various points, and maintaining binary compatibility requires that we continue to support them all. 196: 197: Supporting a different version of Solaris requires interposing on each of these mechanisms. Before we begin executing the system call support code for the native version of Solaris, we must ensure that the process does not belong to a brand that has a different implementation of those calls. To do that, we must check the proc structure of the initiating process to determine whether it belongs to a foreign brand, and if so, whether that brand has chosen to interpose on that entry point. This check is carried out by a {{code}}BRAND_CALLBACK(){{/code}} macro placed at the top of the handling routine for each of the existing entry points. If the brand wishes to interpose on the entry point, control is passed to that brand’s kernel module. If not, control returns to the handler and the standard Solaris system call code is executed. 198: 199: Linux has a single means of entering the kernel: executing an ’int 80’ instruction. Since this mechanism is not used by any version of Solaris, there is no existing mechanism on which we can interpose. Therefore the creation of a Linux brand requires that we provide a new entry point into the kernel specifically for this brand. As with the existing entry points, the {{code}}BRAND_CALLBACK(){{/code}} macro is executed by this handler. In this case, there is no standard system call code to execute if the handler returns. Instead the program is killed with a General Protection Fault. 200: 201: While introducing a new entry point is a trivial task within the Solaris engineering organization, it again demonstrates that the ability for third parties to distribute radically new brands will be limited. A third party may certainly distribute a modified version OpenSolaris that includes the changes they need, but any brands that depend on those changes would not work properly with our Solaris until/unless those changes were adopted by Sun. 202: 203: For performance reasons, the {{code}}BRAND_CALLBACK(){{/code}} macro is only invoked for branded processes. Each system call mechanism actually has two different entry points defined: one for branded processes and one for non-branded processes. A context handler is used to modify the appropriate CPU registers or structures to switch to the appropriate set of entry points during a context switch operation. The callback macro is only a dozen or so instructions long, but this switching mechanism ensures that non-branded processes are not subject to even that minimal overhead. 204: 205: This interposition mechanism is executed before any other code in the system call path. Specifically, it is executed before the standard system call prologue. In a sense, this means that we are executing in kernel mode, but don’t consider ourselves to have entered the kernel. This may sound esoteric, but it has concrete implications for the brand-specific code. The brand code cannot take any locks, cannot do any I/O, or do anything else that might cause the thread to block. This means that the brand code can only do the simplest computations or transformations before returning to userspace or to the standard system call flow. 206: 207: As described below, the ’lx’ brand returns immediately to userspace, where the bulk of the emulation takes place. To implement a brand in the kernel, the interposition routine could transform the incoming system call into a brand() system call and return to the normal system call path. After executing the standard system call prologue, control would be vectored to the brand’s xxx_brandsys() routine, where the emulation could be carried out. 208: 209: ===== 2.2.3 Other Interposition Points 210: 211: Other interposition points will be placed in the code paths for process creation/exit, thread creation/teardown, signal delivery, and so on. 212: 213: The interposition points are identified by means of a pair of ops vectors (one generic and one platform-specific), similar to those used by VFS for filesystems and the VM subsystem for segments. The generic ops vector used by BrandZ is shown below: 214: 215: {{{ 216: 217: struct brand_ops { 218: int (*b_brandsys)(int, int64_t *, uintptr_t, uintptr_t, 219: uintptr_t, uintptr_t, uintptr_t, uintptr_t); 220: void (*b_setbrand)(struct proc *); 221: void (*b_copy_procdata)(struct proc *, struct proc *); 222: void (*b_proc_exit)(struct proc *, klwp_t *); 223: void (*b_exec)(); 224: void (*b_lwp_setrval)(klwp_t *, int, int); 225: void (*b_shmexit)(struct proc *); 226: int (*b_initlwp)(klwp_t *); 227: void (*b_forklwp)(klwp_t *, klwp_t *); 228: void (*b_freelwp)(klwp_t *); 229: void (*b_lwpexit)(klwp_t *); 230: int (*b_elfexec)(struct vnode *vp, struct execa *uap, 231: struct uarg *args, struct intpdata *idata, int level, 232: long *execsz, int setid, caddr_t exec_file, 233: struct cred *cred); 234: }; 235: 236: }}} 237: 238: A brief description of each entry follows: 239: 240: * {{code}}b_brandsys{{/code}}: Routine that implements a per-brand ’brandsys’ system call that can be used for any brand-specific functionality. 241: * {{code}}b_setbrand{{/code}}: Used to associate a new brand type with a process. This is called when the zone’s init process is hand-crafted, or when a process uses zone_enter() to enter a branded zone. 242: * {{code}}b_copy_procdata{{/code}}: Copies per-process brand data at fork time. 243: * {{code}}b_proc_exit{{/code}}: Called at process exit time to free any brand-specific process state. 244: * {{code}}b_exec{{/code}}: Called at process exec time to initialize any brand-specific process state. 245: * {{code}}b_lwp_setrval{{/code}}: set the syscall() return value for the newly created lwp before the creating fork() system call returns. 246: * {{code}}b_shmexit{{/code}}: Called when a shared memory segment is detached from an exiting branded process. 247: * {{code}}b_initlwp{{/code}}: Called by {{code}}lwp_create(){{/code}} to initialize any brand-specific per-lwp data. 248: * {{code}}b_forklwp{{/code}}: Called by {{code}}forklwp(){{/code}} to copy any brand-specific per-lwp data from the parent to child lwps. 249: * {{code}}b_freelwp{{/code}}: Called by {{code}}lwp_create(){{/code}} on error to do any brand-specific cleanup. 250: * {{code}}b_lwpexit{{/code}}: Called by {{code}}lwp_exit(){{/code}} to do any brand-specific cleanup. 251: * {{code}}b_elfexec{{/code}}: Called to load a branded executable. 252: 253: The x86-specific ops vector is: 254: 255: {{{ 256: 257: struct brand_mach_ops { 258: void (*b_sysenter)(void); 259: void (*b_int80)(void); 260: void (*b_int91)(void); 261: void (*b_syscall)(void); 262: void (*b_syscall32)(void); 263: greg_t (*b_fixsegreg)(greg_t, model_t); 264: }; 265: 266: }}} 267: 268: The first 5 entries of this vector allow a brand to override the standard system call paths with their own interpretations. The final entry protects Solaris from brands that make different use of the segment registers in userspace, and vice-versa. 269: 270: The SPARC-specific ops vector is: 271: 272: {{{ 273: 274: struct brand_mach_ops { 275: void (*b_syscall)(void); 276: void (*b_fasttrap)(void); 277: }; 278: 279: }}} 280: 281: Of these routines, only the {{code}}int80{{/code}} entry of the x86 vector is needed for the initial //lx// brand. The other entries are included for completeness and are only used by a trivial ’Solaris 10’ brand used for basic testing on SPARC platforms. 282: 283: These ops vectors are sufficient for the initial Linux brand. Adding support for a new Linux distribution based on the 2.4 kernel can probably also be done without modifying this interface. However, it is likely that adding any new brand that is substantially different from the initial Linux brand will require additional interposition points. For example, adding support for a 2.6-based Linux distribution could require modifications. Supporting a whole new operating system such as FreeBSD or Apple’s OS X would almost certainly require modifying this interface. 284: 285: === 3. The //lx// Brand 286: 287: ==== 3.1 Brands and Distributions 288: 289: The //lx// Brand is intended to emulate the kernel interfaces expected by the Red Hat Enterprise Linux 3 userspace. The freely available CentOS 3 distribution is built from the same SRPMs as RHEL, so it is expected to work as well. 290: 291: The interface between the kernel and userspace is largely encapsulated by the version of //glibc// that ships with the distribution. As such, the interface we emulate will likely support other distributions that make use of glibc 2.3.2. Debian 3.1 also uses this version of glibc, so adding support for that distro to //lx// should be straightforward. 292: 293: The further removed one gets from that version of //glibc//, the less likely it is that the current //lx// brand will be able to support that distro. For example, RHEL 4 is based on glibc 2.4.7, which represents a significant change to glibc. While adding support for RHEL 4 would likely require additional development work, we expect that the work could be done within the scope of the //lx// brand. In the unlikely event that the //lx// brand could not be extended to support RHEL 4, we would introduce a new //lx4// brand. 294: 295: Finally, it should be noted that supporting a new distribution will always require a new install script. For RPM-based distributions, it might be sufficient to update the existing scripts with new package lists. Adding support for distributions such as Debian, which do not use the RPM packaging format, will require entirely new install scripts to be created. It is relatively simple to have a variety of different install scripts within a single brand, so simply changing the packaging format does not require the creation of a new brand. 296: 297: ==== 3.2 Installation of a Linux zone 298: 299: Most Linux distributions’ native installation procedures start by probing and configuring the hardware on the system and partitioning the hard drives. The user then selects which packages to install and sets configuration variables such as the hostname and IP address. Finally the installer lays out the filesystems, installs the packages, modifies some configuration files, and reboots. 300: 301: When installing a Linux zone, we can obviously skip the device probing and disk partitioning. For the remainder of the installation procedure, there are several different approaches we could take. 302: 303: * One approach is to execute a distribution’s installation tool directly. Most of them are based on shell scripts, Python, or some other interpreted language, so we could theoretically run the tools before having a Linux environment running. This approach relies on the existing install tools being fairly robust and would be hard to sustain between releases if the tools change. 304: * Another option is to develop our own package installation tool, which extracts the desired software from a standard set of distribution media and copies it into the zone. 305: * A third option would be to adopt a flasharchive-like approach, in which we would simply unpack a prebuilt tarball or cpio image into the target filesystem. This image could be built by us, or by a customer that already had an installed Linux system. If we were to build this image ourselves, this approach would give us complete control over the final image and and allow us to manually handle any particularly ugly early-stage installation issues. This would also make it trivial for a customer to "just try it out," an approach that has been successful for VMware, Usermode Linux, and QEMU. 306: 307: It is our intention to support the last two options. 308: 309: We will provide an installation script that will extract a known set of RPMs from a known set of Red Hat or CentOS installation media. We will also allow a zone to be installed from a user-specified tarball. We will document how a user can construct an installable tarball that can be used for flasharchive-like installations. 310: 311: ==== 3.3. Execution of Linux Binaries 312: 313: When executing Linux binaries, we follow the same architectural approach taken by the SunOS(tm) Binary Compatibility Project (SBCP), which provides support for pre-Solaris 2.0 binaries. 314: 315: ===== 3.3.1. Loading Linux Binaries 316: 317: When a non-native ELF binary is execed inside a branded zone, the brand’s exec handler is given control. The {{code}}lx{{/code}} brand-specific exec handler execs the brand’s Solaris support library (akin to the ’sbcp’ command) and maps the non-native ELF executable and its interpreter into the address space. 318: 319: The handler also places all the extra information needed to exec the Linux binary on the stack in the form of aux vector entries. Specifically, the handler passes the following values to the support library via the aux vector: 320: 321: {{{ 322: 323: AT_SUN_BRAND_BASE: Base address of non-native interpeter 324: AT_SUN_BRAND_LDDATA: Address of non-native interpreter’s debug data 325: AT_SUN_BRAND_LDENTRY: Non-native interpreter’s entry point 326: AT_SUN_BRAND_PHDR: Non-native executable ELF information needed by 327: AT_SUN_BRAND_PHENT: non-native interpreter 328: AT_SUN_BRAND_PHNUM: 329: AT_SUN_BRAND_ENTRY: Entry point of non-native executable 330: 331: }}} 332: 333: ===== 3.3.2. Running Linux Binaries 334: 335: Rather than executing the Linux application directly, the exec handler starts the program execution in the brand support library. The library then runs whatever initialization code it needs before starting the Linux application. 336: 337: For its initialzation, the {{code}}lx{{/code}} brand library uses the {{code}}brandsys(2){{/code}} system call to pass the following data structure to the kernel: 338: 339: {{{ 340: 341: typedef struct lx_brand_registration { 342: uint_t lxbr_version; /* version number */ 343: void *lxbr_handler; /* base address of handler */ 344: void *lxbr_traceflag; /* address of trace flag */ 345: } lx_brand_registration_t; 346: 347: }}} 348: 349: This structure contains a version number, ensuring that the brand library and the brand kernel module are compatibile with one another. It also contains two addresses in the application’s address space: the starting address of the system call emulation code and the address of a flag indicating whether the process is being traced. The use of these two addresses are discussed in sections 3.4 and 3.9 respectively. 350: 351: Once the support library has finished initialization, it fixes up the aux vector for the Linux interpreter to run, and jumps to the interpreter’s entry point. The brand’s exec handler replaces the standard aux vector entries with the corresponding values above, clears the above vectors (by setting their type to AT_IGNORE), resets the stack to its pre-Solaris-linker state, and then jumps to the non-native interpreter which then runs the executable as it would on its own native system. 352: 353: The advantages of this design are: 354: 355: * No modifications to the Solaris runtime linker are necessary. 356: * Virtually all knowledge of branding is isolated to the kernel brand module and userland brand support library. 357: * It keeps us aligned with the de-facto standard for non-native emulation established with SunOS 4 BCP. 358: 359: ==== 3.4. System Call Emulation 360: 361: The most common method for applications to interact with the operating system is through system calls. Therefore, the bulk of the work required to support the execution of foreign binaries is to emulate the system call interface to which those binaries have been written. 362: 363: Linux applications do not use the {{code}}syscall/sysenter{{/code}} instructions. Instead, they use interrupt #80 to initiate a system call. Because Solaris has no current handler for that interrupt, one had to be added as part of this project. As noted above, the handler in the core Solaris code does nothing but pass control to the int80_handler routine in the brand module. It is then up to that brand to interpret and execute the system call. 364: 365: As with executable launching, the approach we have chosen to take is the one originally implemented by the SBCP. In this model, the trap handler in the kernel is nothing more than a trampoline, redirecting the execution of the system call to a library in userspace. The library performs any necessary data mangling and then calls the proper native system call. Ideally, this method would require almost no code in the kernel and would have no impact whatsoever on the behavior of a system with no SunOS binaries running. 366: 367: In practice, the user-space approach turns out to be less clean and self-contained for Linux than for the original SunOS project. In that case, the binary model being emulated was simpler than Solaris, less fully featured, and still closely related. In the Linux case, there are system calls that must be supported that do not have obvious equivalents in Solaris (e.g., futex()) and there are differences in fundamental OS abstractions (Linux ’threads’ are almost full blown processes, each with its own PID). 368: 369: The steps involved in emulating a fairly straightforward Linux system call are as follows: 370: 371: 1. The Linux application marshalls parameters into registers and issues an {{code}}int80{{/code}} 372: 1. The Solaris int80 handler checks to see if the process is branded. An unbranded process will continue along the standard code path. For {{code}}int80{{/code}}, there is no standard behavior so the process would die with a General Protection fault. Thus, a Solaris application cannot successfully execute any Linux system calls. 373: 1. Solaris passes control to the brand-specific routine indicated in the brandops structure. 374: 1. The //lx// brand module immediately trampolines into the user-space emulation library. 375: 1. The emulation library does any necessary argument manipulation and calls the appropriate Solaris system call(s). 376: 1. Solaris carries out the system call and returns to the brand library. 377: 1. The brand library performs any necessary manipulation of the return values and error code. 378: 1. The brand library returns directly to the Linux application; it does not return through the kernel. 379: 380: The diagram below illustrates these steps: 381: 382: [[image:syscallprocess.gif||alt="syscallprocess.gif"]] 383: 384: Each Linux system call can be divided into one of three types: pass-through, simple emulation, and complex emulation. 385: 386: ===== 3.4.1 Pass-Through 387: 388: A pass-through call is one that requires no data transformation and for which the Solaris 10 semantics match those of the Linux system call. These can be implemented in userland by immediately calling the equivalent system call. 389: 390: For example: 391: 392: {{{ 393: 394: int 395: lx_read(int fd, void *buf, size_t bytes) 396: { 397: int rval; 398: 399: rval = read(fd, buf, bytes); 400: 401: return (rval < 0 ? -errno : rval); 402: } 403: 404: }}} 405: 406: Other examples of pass-through calls are close(), write(), mkdir(), and munmap(). 407: 408: Although the arguments to the system call are identical, the method for returning an error to the caller differs between Solaris and Linux. In Solaris, the system call returns -1 and the error number is stored in the thread-specific variable {{code}}errno{{/code}}. In Linux, the error number is returned as part of the {{code}}rval{{/code}}. 409: 410: There are also differences in the error numbers between Solaris and Linux. The {{code}}lx_read(){{/code}} routine is called by {{code}}lx_emulate(){{/code}}, which handles the translation between Linux and Solaris error codes for all system calls. 411: 412: ===== 3.4.2 Simple Emulation 413: 414: One step up in complexity is a simple emulated system call. This is a call where either the original arguments and/or return value require some degree of simple transformation from the Solaris equivalent. Simple transformations include changing data types or the moving of values into a structure. These calls can be built entirely from standard Solaris system calls and userland transformations. 415: 416: For example: 417: 418: {{{ 419: 420: int 421: lx_uname(uintptr_t p1) 422: { 423: struct lx_utsname *un = (struct lx_utsname *)p1; 424: char buf[LX_SYS_UTS_LN + 1]; 425: 426: strlcpy(un->sysname, LX_SYSNAME, LX_SYS_UTS_LN); 427: strlcpy(un->release, lx_release, LX_SYS_UTS_LN); 428: strlcpy(un->version, LX_VERSION, LX_SYS_UTS_LN); 429: strlcpy(un->machine, LX_MACHINE, LX_SYS_UTS_LN); 430: gethostname(un->nodename, sizeof (un->nodename)); 431: if ((sysinfo(SI_SRPC_DOMAIN, buf, LX_SYS_UTS_LN) < 0)) 432: un->domainname[0] = ’\0’; 433: else 434: strlcpy(un->domainname, buf, LX_SYS_UTS_LN); 435: return (0); 436: } 437: 438: }}} 439: 440: Other examples of simple emulated calls are stat(), mlock(), and getdents(). 441: 442: ===== 3.4.3 Complex Emulation 443: 444: The calls requiring the most in-kernel support are the complex emulated system calls. These are calls that: 445: 446: * Require significant transformation to input arguments or return values 447: * Are partially or wholly unique within the //lx// brand implementation 448: * Possibly require a new system call that has no underlying Solaris counterpart 449: 450: Some examples of complex emulated calls are {{code}}clone(){{/code}}, {{code}}sigaction(){{/code}}, and {{code}}futex(){{/code}}. The implementation of the {{code}}clone(){{/code}} system call is described below. 451: 452: ==== 3.5 Other Issues 453: 454: ===== 3.5.1 Linux Threading 455: 456: Linux implements threads via the {{code}}clone(){{/code}} system call. Among the arguments to the call is a set of flags, four of which determine the level of sharing within the address space: CLONE_VM, CLONE_FS, CLONE_FILES, and CLONE_SIGHAND. When all four flags are clear, the clone is equivalent to a fork; when they are all set, it is the equivalent to creating another lwp in the address space. Any other combination of flags reflects a thread/process construct that does not match any existing Solaris model. Since these other combinations are rarely, if ever, encountered on a system, this project will not be adding the abstractions necessary to support them. 457: 458: The following table lists all of the flags for the {{code}}clone(2){{/code}} system call, and whether the ’lx’ brand supports them. If an applications uses an unsupported flag, or combination of flags, a detailed error message is emitted and ENOTSUP is returned. 459: 460: |=Flag|=Supported? 461: | CLONE_VM | Yes 462: | CLONE_FS | Yes 463: | CLONE_FILES | Yes 464: | CLONE_SIGHAND | Yes 465: | CLONE_PID | Yes 466: | CLONE_PTRACE | Yes 467: | CLONE_PARENT | Partial. Not supported for fork()-style clone() operations. 468: | CLONE_THREAD | Yes 469: | CLONE_SYSVSEM | Yes 470: | CLONE_SETTLS | Yes 471: | CLONE_PARENT_SETTID | Yes 472: | CLONE_CHILD_CLEARTID | Yes 473: | CLONE_DETACH | Yes 474: | CLONE_CHILD_SETTID | Yes 475: 476: When an application uses {{code}}clone(2){{/code}} to fork a new process, the {{code}}lx_clone(){{/code}} routine simply calls {{code}}fork1(2){{/code}}. When an application uses {{code}}clone(2){{/code}} to create a new thread, we call the {{code}}thr_create(3C){{/code}} routine in the Solaris {{code}}libc{{/code}}. 477: 478: The Linux application provides the address of a function at which the new thread should begin executing as an argument to the system call. However, the Linux kernel does not actually start execution at that address. Instead, the kernel essentially does a {{code}}fork(2){{/code}} of a new thread which, like a forked process, starts with exactly the same state as the parent thread. As a result, the new thread starts executing in the middle of the {{code}}clone(2){{/code}} system call, and it is the {{code}}glibc{{/code}} wrapper that causes it to jump to the user-specified address. 479: 480: This Linux implementation detail means that when we call {{code}}thr_create(3C){{/code}} to create our new thread, we cannot provide the user’s start address to that routine. Instead, all new Linux threads begin by executing a routine that we provide, called {{code}}clone_start(){{/code}}. This routine does some final initialization, notifies the brand’s kernel module that we have created a new Linux thread, and then returns to {{code}}glibc{{/code}}. 481: 482: A by-product of threads implementation in Linux is that every thread has a unique PID. To mimic this behavior in the //lx// brand, every thread created by a Linux binary reserves a PID from the PID list. This reservation is performed as part of the {{code}}clone_start(){{/code}} routine. 483: 484: This reserved PID is never seen by Solaris processes, but it is used by Linux processes. When a Linux thread calls {{code}}getpid(2){{/code}}, it is returned the standard Solaris PID of process. When it calls {{code}}gettid(2){{/code}}, it is returned the PID that was reserved at thread creation time. Similarly, {{code}}kill(2){{/code}} sends a signal to the entire process represented by the supplied PID, while {{code}}tkill(2){{/code}} sends a signal to the specific thread represented by the supplied PID. 485: 486: The Linux thread model supported by modern RedHat systems is provided by the Native Posix Threads Library (NPTL). NPTL uses three consecutive descriptor entries in the Global Descriptor Table (GDT) to manage thread local storage. One of the arguments to the {{code}}clone(){{/code}} is an optional descriptor entry for TLS. More commonly used is the {{code}}set_thread_area(){{/code}} system call, which takes a descriptor as an argument and returns the entry number in the GDT in which it has been stored. The NPTL then uses this to initialize the {{code}}%gs{{/code}} register. The descriptors are per thread, so they have to be stored in per thread storage and the GDT entries must be re-initialized on context switch. This is done via a {{code}}restore{{/code}} ctx operation. 487: 488: Since both NPTL and the Solaris {{code}}libc{{/code}} rely on {{code}}%gs{{/code}} to access per-thread data, we have added code to virtualize its usage. The first thing our user-space emulation library does is: 489: 490: {{{ 491: 492: /* 493: * Save the Linux libc’s %gs and switch to the Solaris libc’s %gs 494: * segment so we have access to the Solaris errno, etc. 495: */ 496: pushl %gs 497: pushl $LWPGS_SEL 498: popl %gs 499: 500: }}} 501: 502: This sequence ensures that we always enter our Solaris code using the well-known value used for our {{code}}%gs{{/code}}. We also stash the current value of {{code}}%gs{{/code}} on the stack, so we can restore it prior to returning to Linux code. 503: 504: ===== 3.5.2 EFAULT/SIGSEGV 505: 506: If the user-space emulation library were to access an argument from a system call which had an invalid address, a SIGSEGV signal would be generated. For proper Linux emulation, the desired result in this situation is to generate an error return from the system call with an EFAULT errno. 507: 508: To deliver the expected behavior, we will introduce a new system call ({{code}}uucopy(){{/code}}), which copies data from one user address to another. Any attempt to use an illegal address will cause the call to return an error. Otherwise, the data will be copied as if we had performed a standard {{code}}bcopy(){{/code}} operation. 509: 510: For example: 511: 512: {{{ 513: 514: int 515: lx_system_call(int *arg) 516: { 517: int local_arg; 518: int rval; 519: 520: /* 521: * catch EFAULT 522: */ 523: if ((rval = uucopy(arg, &arg, sizeof (int))) < 0) 524: return (rval); /* errno is set to EFAULT */ 525: 526: /* 527: * transform the arg, now in local_arg, to Solaris format 528: */ 529: return (solaris_system_call(&local_arg)); 530: } 531: 532: }}} 533: 534: This functionality seems to be generically useful, so the {{code}}uucopy(){{/code}} call will be implemented in {{code}}libc{{/code}}, where it will be available to any application. 535: 536: If the overhead imposed by this system call dramatically limits performance, we may include an environment variable that causes the brand library to perform a standard userspace copy rather than the kernel-based copy. Setting this variable would lead to higher performance, but some system calls would segfault rather than returning EFAULT. 537: 538: ==== 3.6 Signal Handling 539: 540: Delivering signals to a Linux process is complicated by differences in signal numbering, stack structure and contents, and the action taken when a signal handler exits. In addition, many signal-related structures, such as sigset_ts, vary between Solaris and Linux. 541: 542: The simplest transformation that must be done when sending signals is to translate between Linux and Solaris signal numbers. 543: 544: Major signal number differences between Linux and Solaris|=Number|=Linux|=Solaris 545: | 10 | SIGUSR1 | SIGBUS 546: | 12 | SIGUSR2 | SIGSYS 547: | 16 | SIGSTKFLT | SIGUSR1 548: | 17 | SIGCHLD | SIGUSR2 549: | 18 | SIGCONT | SIGCHLD 550: | 19 | SIGSTOP | SIGPWR 551: | 20 | SIGTSTP | SIGWINCH 552: | 21 | SIGTTIN | SIGURG 553: | 22 | SIGTTOU | SIGPOLL 554: | 23 | SIGURG | SIGSTOP 555: | 24 | SIGXCPU | SIGTSTP 556: | 25 | SIGXFSZ | SIGCONT 557: | 26 | SIGVTALARM | SIGTTIN 558: | 27 | SIGPROF | SIGTTOU 559: | 28 | SIGWINCH | SIGVTALARM 560: | 29 | SIGPOLL | SIGPROF 561: | 30 | SIGPWR | SIGXCPU 562: | 31 | SIGSYS | SIGXFSZ 563: 564: When a Linux process sends a signal using the {{code}}kill(2){{/code}} system call, we translate the signal into the Solaris equivalent before handing control off to the standard signalling mechanism. When a signal is delivered to a Linux process, we translate the signal number from Solaris back to Linux. Translating signals both at generation and at delivery time ensures both that Solaris signals are sent properly to Linux applications and that signals’ default behavior works as expected. 565: 566: One issue is that Linux supports 32 real time signals, with {{code}}SIGRTMIN{{/code}} typically starting at or near 32 ({{code}}SIGRTMIN{{/code}}) and proceeding to 63 ({{code}}SIGRTMAX{{/code}}) ({{code}}SIGRTMIN{{/code}}) is "at or near" 32 because glibc usually "steals" one ore more of these signals for its own internal use, adjusting {{code}}SIGRTMIN{{/code}} and {{code}}SIGRTMAX{{/code}} as needed.) Conversely, Solaris actively uses signals 32-40 for other purposes and and only supports 7 realtime signals, in the range 41 ({{code}}SIGRTMIN{{/code}}) to 48 ({{code}}SIGRTMAX{{/code}}). 567: 568: At present, attempting to translate a Linux signal greater than 39 (corresponding to the maximum real time signal number Solaris can support) will generate an error. We have not yet found an application that attempts to send such a signal. 569: 570: Branded processes are set up to ignore any Solaris signal for which there is no direct Linux analog, preventing the delivery of untranslatable signals from the global zone. 571: 572: ===== 3.6.1 Signal Delivery 573: 574: To support user-level signal handlers, BrandZ uses a double layer of indirection to process and deliver signals to branded threads. 575: 576: In a normal Solaris process, signal delivery is interposed on for any thread registering a signal handler by libc. Libc needs to do various bits of magic to provide thread-safe critical regions, so it registers its own handler, named {{code}}sigacthandler{{/code}}, using the {{code}}sigaction(2){{/code}} system call. When a signal is received, {{code}}sigacthandler(){{/code}} is called, and after some processing, libc calls the user’s signal handler via a routine named {{code}}call_user_handler(){{/code}}. 577: 578: Adding a Linux branded thread to the mix complicates things somewhat. First, when a thread receives a signal, it could be running with a Linux value in the x86 %gs segment register as opposed to the value Solaris threads expect; if control were passed directly to Solaris code, such as libc’s {{code}}sigacthandler(){{/code}}, that code would experience a segmentation fault the first time it tried to dereference a memory location using %gs. 579: 580: Second, the signal number translation referenced above must take place. (As an example, {{code}}SIGCONT{{/code}} is equivalent in function in Linux and Solaris, but Linux’ {{code}}SIGCONT{{/code}} is signal 18 while Solaris’ is signal //25//.) Further, as was the case with Solaris libc, before the Linux signal handler is called, the value of the %gs segment register must be restored to the value Linux code expects. 581: 582: This need to translate signal numbers and manipulate the %gs register means that while with standard Solaris libc, following a signal from generation to delivery looks something like: 583: 584: {{{ 585: 586: kernel -> 587: sigacthandler() -> 588: call_user_handler() -> 589: user signal handler 590: 591: }}} 592: 593: for BrandZ Linux threads, this instead would look like this: 594: 595: {{{ 596: 597: kernel -> 598: lx_sigacthandler() -> 599: sigacthandler() -> 600: call_user_handler() -> 601: 602: lx_call_user_handler() -> 603: Linux user signal handler 604: 605: }}} 606: 607: The new addtions are: 608: 609: * **lx_sigacthandler()** 610: This routine is responsible for setting the %gs segment register to the value Solaris expects, and for jumping to Solaris’ libc signal interposition handler, sigacthandler(). 611: * **lx_call_user_handler()** 612: This routine is responsible for translating Solaris signal numbers to their Linux equivalents, building a Linux signal stack based on the information Solaris has provided, and passing the stack to the registered Linux signal handler. It is, in effect, the Linux thread equivalent to libc’s {{code}}call_user_handler{{/code}}. 613: 614: Installing {{code}}lx_sigacthandler(){{/code}} is a bit tricky, as normally libc’s {{code}}sigacthandler(){{/code}} routine is hidden from user programs. To facilitate this, a new private function was added to libc, {{code}}setsigaction(){{/code}}: 615: 616: void setsigacthandler(void (*new_handler)(int, siginfo_t *, void *), void (**old_handler)(int, siginfo_t *, void *)) 617: 618: The routine works by modifying the per-thread data structure that libc already maintains that keeps track of the address of its own interposition handler with the address passed in; the old handler’s address is set in the pointer pointed to by the second argument, if it is non-NULL, mimicking the behavior of {{code}}sigaction(){{/code}} itself. Once {{code}}setsigacthandler(){{/code}} has been executed, all future branded threads this thread may create will automatically have the proper interposition handler installed as the result of a normal {{code}}sigaction(){{/code}} call. 619: 620: Note that none of this interposition is necessary unless a Linux thread registers a user signal handler, because the default action for all signals is the same between Solaris and Linux save for one signal, {{code}}SIGPWR{{/code}}. For this reason, BrandZ always installs its own internal signal handler for {{code}}SIGPWR{{/code}} that translates the action to the Linux default, to terminate the process. (Solaris’ default action is to ignore {{code}}SIGPWR{{/code}}.) 621: 622: It is also important to note that when signals are not translated, BrandZ relies upon code interposing upon the {{code}}wait(2){{/code}} system call to translate signals to their proper values for any Linux threads retrieving the status of others. So, while the Solaris signal number for a particular signal is set in the data structures for a process (and would be returned as the result of, for example, {{code}}WTERMSIG(){{/code}}), the BrandZ interposition upon {{code}}wait(2){{/code}} is responsible for translating the value {{code}}WTERMSIG(){{/code}}, and would return from a Solaris signal number to the appropriate Linux value. 623: 624: ===== 3.6.2 Returning From Signals 625: 626: The process of returning to an interrupted thread of execution from a user signal handler is entirely different between Solaris and Linux. While Solaris generally expects to set the context to the interrupted one on a normal return from a signal handler, in the normal case Linux instead sets the return address from the signal handler to point to code that calls one of two specific Linux system calls, {{code}}sigreturn(2){{/code}} or {{code}}rt_sigreturn(2){{/code}}. Thus, when a Linux signal handler completes execution, instead of returning through what would, in Solaris’ libc be a call to {{code}}setcontext(2){{/code}}, the {{code}}sigreturn(2){{/code}} or {{code}}rt_sigreturn(2){{/code}} Linux system calls are responsible for accomplishing much the same thing. 627: 628: This trampoline code (for a call to {{code}}sigreturn(2){{/code}}) looks like this: 629: 630: {{{ 631: pop %eax 632: mov LX_SYS_sigreturn, %eax 633: int $0x80 634: }}} 635: 636: such that when the Linux user signal handler is eventually called, the stack looks like this: 637: 638: | Pointer to //sigreturn// trampoline code 639: | Linux signal number 640: | Pointer to Linux siginfo_t 641: | Pointer to Linux ucontext_t 642: | Linux ucontext_t 643: | Linux fpstate 644: | Linux siginfo_t 645: 646: BrandZ takes the approach of intercepting the Linux {{code}}sigreturn(2){{/code}} (or {{code}}rt_sigreturn(2){{/code}}) system call in order to turn it into a return through the libc call stack that Solaris expects. This is done by the {{code}}lx_sigreturn(){{/code}} or {{code}}lx_rt_sigreturn(){{/code}} routines, which remove the Linux signal frame from the stack and pass the resulting stack pointer to another routine, {{code}}lx_sigreturn_tolibc(){{/code}}, which makes libc believe the user signal handler it had called returned. 647: 648: When control then returns to libc’s {{code}}call_user_handler(){{/code}} routine, a {{code}}setcontext(2){{/code}} will be done that (in most cases) returns the thread executing the code back to the location originally interrupted by receipt of the signal. 649: 650: One final complication in this process is the restoration of the %gs segment register when returning from a user signal handler. Prior to BrandZ, Solaris’ libc forced the value of %gs to a known value when calling {{code}}setcontext(){{/code}} to return to an interrupted thread from a user signal handler (as libc uses %gs internally as a pointer to curthread, it is a way of ensuring a good "known value" for curthread.) 651: 652: Since BrandZ requires that setcontext() restore a Linux value for %gs when returning from a Linux signal handler, we made this forced restoration optional on a per-process basis. This was accomplished by means of a new private routine to libc: 653: 654: {{{ 655: 656: void set_setcontext_enforcement(int on) 657: 658: }}} 659: 660: By default, the "curthread pointer" value enforcement is enabled. When this routine is called with an argument of ’0’, the mechanism is disabled for this process. 661: 662: Shutting off this mechanism will not have any correctness or security implications. Writing to the %gs segment register is not a privileged operation and as such %gs can be set to any value at any time by user code. The only drawback to disabling the mechanism is that if a bad value is set for %gs, the broken application will likely segmentation fault deep within libc. 663: 664: === 3.7. Device and {{code}}ioctl(){{/code}} support 665: 666: ==== 3.7.1 Determining Which Devices to Support 667: 668: Our investigation showed that he following devices are the minimum set required to support Linux branded zones: 669: 670: {{{ 671: 672: /dev/null 673: /dev/zero 674: /dev/ptmx 675: /dev/pts/* 676: /dev/tty 677: /dev/console 678: /dev/random 679: /dev/urandom 680: /dev/fd/* 681: 682: }}} 683: 684: The following devices were considered, but aren’t actually necessary for Linux branded zones. 685: 686: {{code}}/dev/ttyd?{{/code}} - These are serial port devices. 687: 688: {{code}}/dev/pty[pqr]?{{/code}} and {{code}}/dev/tty[pqr]?{{/code}} - These are old style terminal devices provided for compatibility purposes. They currently do not exist in native non-global zones. The Unix98 specification replaced these devices with {{code}}/dev/ptmx{{/code}} and {{code}}/dev/pts/*{{/code}}, which have been used as the standard terminal devices for Solaris and Linux for a long time. While these devices do still exist on Red Hat 2.4 systems, they have officially been unsupported since Linux 2.1.115. An inspection of a Linux 2.4 system didn’t reveal any applications that were actually using these devices. 689: 690: ===== 3.7.1.1 Networking Devices 691: 692: Native Solaris non-global zones have a network interface that is visible (reported via ifconfig), but there are no actual network device nodes accessible via {{code}}/dev{{/code}}. Certain higher level network protocol devices are accessible in native zones: 693: 694: {{{ 695: 696: /dev/arp, /dev/icmp, /dev/tcp, /dev/tcp6, /dev/udp, /dev/udp6 697: /dev/ticlts, /dev/ticots, /dev/ticotsord 698: 699: }}} 700: 701: Notably missing from the list above is: 702: 703: {{{ 704: 705: /dev/ip 706: 707: }}} 708: 709: Looking at a native Linux 2.4 system, we see that the following network devices exist: 710: 711: {{{ 712: 713: /dev/inet/egp, /dev/inet/ggp, /dev/inet/icmp, /dev/inet/idp 714: /dev/inet/ip, /dev/inet/ipip, /dev/inet/pup, /dev/inet/rawip 715: /dev/inet/tcp, /dev/inet/udp 716: 717: }}} 718: 719: {{code}}Documentation/devices.txt{{/code}} describes all of these as ’iBCS-2 compatibility devices’, so we will not be supporting them. 720: 721: Additionally, Linux does not create {{code}}/dev{{/code}} entries for networking devices. Network interface names are mapped to kernel modules via aliases defined in {{code}}/etc/modules.conf{{/code}}. Interface plumbing (via ifconfig) is all done via ioctls to sockets. Reporting status of interfaces (via ifconfig) is done either by socket ioctls or by accessing files in /proc/net/. Finally, network accesses (telnet and ping) are all done via socket calls. 722: 723: This indicates that initial Linux zones networking support has no actual device requirements, but it does require ioctl translations. (This issue is addressed later in this document.) Given the lack of device-specific configuration, the current native zones network interface can be leveraged in Linux branded zones. 724: 725: ==== 3.7.2 Major/Minor Device Number Mapping 726: 727: Linux has explicitly hardcoded knowledge about how major and minor device numbers map to drivers and paths. (See {{code}}Documentation/devices.txt{{/code}} in the Linux source.) But on Solaris, major device number to driver mapping is dynamically defined via {{code}}/etc/name_to_major{{/code}}. Minor device number name space is managed by individual drivers. 728: 729: Also, there is not a 1:1 major/minor device number mapping between Linux and Solaris devices nodes and emulation of the functionality provided by a given Linux driver might involve using multiple Solaris drivers. For example, in Linux, both {{code}}/dev/null{{/code}} and {{code}}/dev/random{{/code}} are implemented using the same driver, so both of these devices have the same major number. In Solaris, these devices are implemented with separate drivers, each with its own major number. 730: 731: Major/minor device numbers are exposed to Linux applications in multiple places. Some examples are: 732: 733: * the {{code}}stat(2){{/code}} / {{code}}statvfs(2){{/code}} family of system calls 734: * the filesystem name space via {{code}}/dev/pts/*{{/code}} 735: * {{code}}/proc/<pid>/stat{{/code}} 736: * certain ioctls (specifically {{code}}LX_TIOCGPTN{{/code}} ) 737: 738: One important question to answer is, is this device number translation actually necessary? Are major and minor device numbers actually consumed by Linux applications, and if so, how? As it turns out, the answer to this question is unfortunately yes. {{code}}glibc’s{{/code}} {{code}}ptsname(){{/code}} does "sanity checks" of pts paths to makes sure they have expected dev_t values. These "sanity checks" make assumptions about expected major //and// minor device values. 739: 740: Therefore, we will be required to provide a device number mapping mechanism. For the required devices listed earlier, this gives us: 741: 742: {{{ 743: 744: Device Solaris Driver Linux Major/Minor 745: ~------ ~-------------- ~----------------- 746: /dev/null mm 1 / 3 747: /dev/zero mm 1 / 5 748: /dev/random random 1 / 8 749: /dev/urandom random 1 / 9 750: 751: /dev/tty sy 5 / 0 752: /dev/console zcons 5 / 1 753: /dev/ptmx ptm 5 / 2 [1] 754: /dev/pts/* pts 136 / * [2] 755: 756: Notes: 757: [1] ptm is a clone device, so this translation is tricky. 758: Basically, the /dev/ptmx node in a native zone actually 759: points to the clone device. But when an open is done 760: on this device, the vnode that is returned actually corresponds 761: to a ptm node (and not a clone node).This means that 762: on a Solaris system, a stat of /dev/ptmx will return different 763: dev_t values than an fstat(2) of an fd that was created by 764: opening /dev/ptmx. On Linux, both of these operations need to 765: return the same result. So once again, we are mapping 766: multiple major/minor Solaris device numbers to a single 767: Linux device number. 768: 769: [2] For pts devices, there should probably be no translation done 770: for device minor node numbers. 771: 772: }}} 773: 774: ==== 3.7.3 Ioctl Translation Support 775: 776: ===== 3.7.3.1 Ioctl Support - General Issues 777: 778: A quick investigation shows that most of the required ioctl support isn’t actually for devices at all. Most the necessary ioctls are for non-device filesystem nodes. Here’s a list of the most obviously needed ioctls, broken up into categories of files that support these ioctls: 779: 780: {{{ 781: 782: 1) All file ioctls (regular files, streams devices, sockets, fifos): 783: FIONREAD, FIONBIO 784: 785: 2) Streams file ioctls (Streams device, sockets, fifos): 786: TCSBRK, TCXONC, TCFLSH, TIOCEXCL, TIOCNXCL, TIOCSPGRP, 787: TIOCSTI, TIOCSWINSZ, TIOCMBIS, TIOCMBIC, TIOCMSET, 788: TIOCSETD, FIOASYNC, FIOSETOWN, TCSETS, TCSETSW, 789: TCSETSF, TCSETA, TCSETAW, TCSETAF, TIOCGPGRP, TCSBRKP 790: 791: 3) Socket ioctls: 792: FIOGETOWN, SIOCSPGRP, SIOCGPGRP, SIOCATMARK, 793: SIOCGIFFLAGS, SIOCGIFADDR, SIOCGIFDSTADDR, 794: SIOCGIFBRDADDR, SIOCGIFNETMASK, SIOCGIFMETRIC, 795: SIOCGIFMTU, SIOCGIFCONF, SIOCGIFNUM 796: 797: 4) Streams device - ptm 798: TIOCGWINSZ, UNLKPT, LX_TIOCGPTN 799: 800: 5) Streams /w ttcompat module - pts 801: TIOCGETD 802: 803: 6) Streams /w ldterm, ptem or ttcompat module - pts 804: TCGETS 805: 806: 7) Streams /w ldterm or ptem module - pts 807: TCGETA 808: 809: 8) Streams device - pts 810: LX_TIOCSCTTY, LX_TIOCNOTTY 811: 812: }}} 813: 814: Most of these ioctls are streams ioctls, and since FIFOs and sockets are implemented via streams in Solaris, any FIFO or socket supports most of these ioctls. Of the 45 ioctls listed above, only 8 are actually device-specific ioctls. 815: 816: This indicates that doing ioctl translations via layered drivers is not the best approach, since this would only address a minor subset of the total ioctls that need to be supported. Because supporting non-device ioctls will require the creation of a non-layered driver ioctl translation mechanism, it seems more appropriate to handle device ioctls via this same mechanism as well. 817: 818: With this in mind, it’s more interesting if the categories above are renamed based in terms of their vnode v_type and v_rdev values. If we do this, we get: 819: 820: {{{ 821: 822: 1) VREG, VFIFO, VSOCK, VCHR[ptm, pts, sy, zcons] 823: 2) VFIFO, VSOCK, VCHR[ptm, pts, sy, zcons] 824: 3) VSOCK 825: 4) VCHR[ptm] 826: 5, 6, 7, 8) VCHR[pts] 827: 828: }}} 829: 830: Supporting ioctls on these vnodes will require a switch table. In addition to the ioctl number, the translation mechanism must look at the type of the file descriptor an ioctl is targeting to determine what translation needs to be done. Hence, the translation layer will need to looking at the v_type and the major portion of v_dev associated with the target file descriptor. These fields are easily accessible from the kernel and are also available via st_mode and st_rdev from fstat(2). So this translation could occur either in the kernel or in userland. 831: 832: The only tricky part about this determination is that we don’t want to hard code the the major Solaris driver number into any translation code since these number are allocated dynamically via {{code}}/etc/name_to_major{{/code}} in the global zone. Therefor device ioctl translators should be bound to specific Solaris drivers by their driver name, and when an application attempts to perform an ioctl to a driver the translation code will need to be able to resolve the driver name to driver major number mapping. This translation code should have not have any impact on how devices are managed in the global zone. 833: 834: ===== 3.7.3.2 Ioctl support - nits 835: 836: There are other more minor issues surrounding ioctl support that are worth mentioning. 837: 838: Ioctl cmd symbols that represent the same ioctl command on Solaris and Linux can have different underlying symbol values. For example, TCSBRK on Solaris is 0x5405, while on Linux it’s 0x5409. Any translation layers will have to be aware of the Linux ioctl values and translate them into Solaris ioctl values. 839: 840: One final note: BrandZ will take an "opt-in" approach to ioctls. Only those ioctls that we have explicitly added support for will be executed. All others will return EINVAL. The alternative approach, simply passing the unrecognized ioctls through to Solaris proper, is risky for the following reasons: 841: 842: * Nondeterministic behavior: Since ioctl cmd values can be different from Solaris, and devices on Solaris and Linux can support different behaviors, simply passing on unknown ioctls could result in nondeterministic device and/or application behavior. 843: 844: * Inadvertent breakage and maintainability issues: If an application depends on certain ioctls that are not explicitly listed as supported in brand-specific code, then there is an increased chance that a developer might attempt to change the implementation of one of those ioctls without knowing that the ioctl is also being consumed by a brand, and thereby inadvertently break the brand. If all ioctls supported by a brand are explicitly listed in the brand support code, then when developers search for consumers of a given ioctl, they will see that the ioctl is exported for application consumption in a brand environment. 845: 846: ==== 3.7.4 Device Minor Nodes and Path Management 847: 848: Multiple zones (both native and non-native) will be sharing devices with the global zone and with each other. Therefor, we must ensure that this device sharing doesn’t generate any security problems where one zone could affect another because of device sharing. Here’s a look at the device nodes that will be present in zones and what types of risks/issues these devices present. 849: 850: {{{ 851: 852: Device Notes 853: ~------ ~----- 854: /dev/null read-write, doesn’t have any consumer state 855: /dev/zero read-write, doesn’t have any consumer state 856: 857: /dev/random Writable only by root, protected from root 858: zone writes via the ’sys_devices’ privilege. 859: /dev/urandom Writable only by root, protected from root 860: zone writes via the ’sys_devices’ privilege. 861: 862: /dev/tty Pass through device, 863: doesn’t have any consumer state 864: 865: /dev/ptmx Cloning device. Each open results in a unique 866: minor node, so a minor node can only exist 867: in one zone at any given time. 868: 869: /dev/pts/* All minor nodes are visible in all zones. 870: Currently, this driver has been made zone 871: aware to prevent multiple zones from 872: accessing the same minor nodes concurrently. 873: 874: /dev/console Minor nodes of this device should not be 875: shared across different zones. Each 876: instance of this device in a zone should 877: have a unique minor number. This is the 878: current behavior for native zones. 879: 880: }}} 881: 882: The other important aspect of device paths is how brand/zone-specific device paths are seen from the global zone. Currently, the only brand/zone-specific device that exists is the zcons console device. Here are examples of zcons device paths that could exist in the global zone (with two native zones booted): 883: 884: {{{ 885: 886: /dev/zcons/<zone1_name>/masterconsole -> 887: 888: /devices/pseudo/zconsnex@1/zcons@0:masterconsole 889: 890: /dev/zcons/<zone1_name>/zoneconsole -> 891: 892: /devices/pseudo/zconsnex@1/zcons@0:zoneconsole 893: 894: /dev/zcons/<zone2_name>/masterconsole -> 895: /devices/pseudo/zconsnex@1/zcons@1:masterconsole 896: 897: /dev/zcons/<zone2_name>/zoneconsole -> 898: /devices/pseudo/zconsnex@1/zcons@1:zoneconsole 899: 900: }}} 901: 902: These device paths are very zcons centric and don’t really extend well if we attempt to introduce any new brand-specific devices. If we decide to add any new brand-specific devices, then these paths should probably be changed. For instance, if we add a Linux brand-specific driver that layers on top of {{code}}/dev/ptmx{{/code}}, then the following device paths might make more sense: 903: 904: {{{ 905: 906: /dev/zones/<brand>/<zone_name>/masterconsole -> 907: /devices/pseudo/zonesnex@1/zcons@0:masterconsole 908: 909: /dev/zones/<brand>/<zone_name>/zoneconsole -> 910: 911: /devices/pseudo/zonesnex@1/zcons@0:zoneconsole 912: 913: /dev/zones/<brand>/ptmx -> 914: 915: /devices/pseudo/ptm_linux@0:ptmx 916: 917: }}} 918: 919: ==== 3.7.5 Device-Specific Issues 920: 921: ===== 3.7.5.1 /dev/console 922: 923: Each branded zone will need its own console device, just like native zones today. Whenever possible, a brand should leverage the zcons driver and use it as the {{code}}/dev/console{{/code}} device in a non-native zones. We have done this in the //lx// brand. 924: 925: ===== 3.7.5.2 /dev/ptmx and /dev/pts 926: 927: These devices need special management. When Linux applications access these devices, we need to do two things. 928: 929: 1. After an open of {{code}}/dev/ptmx{{/code}}, we need to ensure that the associated {{code}}/dev/pts/*{{/code}} device link exists (since it can be created asynchronously to ptmx opens), and that it has its ownership changed to match that of the process that opened this instance of {{code}}/dev/ptmx{{/code}}. 930: 1. We need to ensure that after any pts device is opened by a Linux application, the following modules get pushed onto its stream: {{code}}ptem, ldterm, ttcompat, and ldlinux{{/code}} 931: 932: In Solaris, issue 1 above is done by libc`ptsname() with the help of an su binary. Issue 2 above is done by client applications that are allocating terminals (for example, this is done in xterm`spawn().) 933: 934: For BrandZ, there two possible approaches for solving these problems. 935: 936: The first is to continue with the initially prototyped approach and do both of these things post open in an interception layer. For the ptm device (issue 1 above), this interception layer would have to be in the kernel since it involves changing the ownership of a device and {{code}}/dev{{/code}} is mounted as read only in zones. The post open pts device processing (issue 2 above) could be done in the kernel or in userland. This approach doesn’t seem that great since its implementation is spread across both userland and the kernel. It also involves post-processing all opens to determine if additional work is necessary. 937: 938: We will adopt an alternate approach, and replace the ptm driver with a layered driver in Linux branded zones. The current ptmx driver will be replaced with a self-cloning layered driver in the Linux zones. Upon open, this layered driver will: 939: 940: * Open an instance of the real {{code}}/dev/ptmx{{/code}}. 941: * Wait for the corresponding {{code}}/dev/pts/*{{/code}} node to be created. 942: * Set the permissions on the corresponding {{code}}/dev/pts/*{{/code}} node. 943: * Set up the auto push mechanism (via kstr_autopush() to automatically push the required strmods onto the corresponding {{code}}/dev/pts/*{{/code}} node when it is actually opened by a Linux application. 944: 945: Upon final close of a Linux ptm node, this driver will remove the auto push configuration it created and close the underlying {{code}}/dev/ptmx{{/code}} node that it opened. 946: 947: ===== 3.7.5.3 /dev/fd/* 948: 949: The entries in {{code}}/dev/fd{{/code}} are not actually devices. The entries in {{code}}/dev/fd/{{/code}} allow a process access to its own file descriptors via another namespace. Thus, opens of entries in this directory map to re-opens of the corresponding file descriptor in the current process. 950: 951: In Solaris {{code}}/dev/fd{{/code}} is implemented via a filesystem. readdir(3C)s of {{code}}/dev/fd{{/code}} might not return an accurate reflection of the current file descriptor state of a process, but opens of specific entries in the directory will succeed if that file descriptor is valid for the process performing the open. 952: 953: In Linux, {{code}}/dev/fd{{/code}} is implemented as a symbolic link to {{code}}/proc/self/fd{{/code}}. This {{code}}/proc{{/code}} filesystem directory is similar to the Solaris {{code}}/proc/<pid>/fd{{/code}} directory in that it contains an accurate representation of a processes current file descriptor state. But aside from just providing access to the processes current file descriptors, on Linux the files in this directory are actually symbolic links to the underlying files referenced by the processes file descriptors. This is similar to the functionality in Solaris provided by {{code}}/proc/<pid>/paths{{/code}}. 954: 955: The most common uses for {{code}}/dev/fd{{/code}} entries are for suid shell script and as parameters to commands that don’t natively support I/O to {{code}}stdin/stdout{{/code}}. Given these use cases it seems that a simple mount of the existing Solaris {{code}}/dev/fd{{/code}} filesystem in the Linux zone should be sufficient for compatibility purposes. Of course, it is possible that other Linux applications exist that utilize some of the additional functionality of the Linux {{code}}/dev/fd{{/code}} implementation that isn’t available in Solaris. (For example, {{code}}/dev/fd{{/code}} on Linux provides an accurate reflection of the current fd state of the process and allows applications to determine file descriptor to file path mappings.) If such applications are discovered, then we might need to revisit this strategy. 956: 957: ===== 3.7.5.4 audio devices 958: 959: Linux has two different audio subsystems OSS and ALSA. To determine which audio subsystem to support we identified some common/popular applications that utilize audio and checked which subsystem they use. We found: 960: 961: {{{ 962: 963: OSS only: 964: skype, real/helix media player, flash, quake, sox 965: 966: OSS or ALSA: 967: xmms (selectable via plugins) 968: 969: }}} 970: 971: Our survey identified no popular applications that require ALSA, so we will only be supporting OSS audio. 972: 973: Audio device access on Linux and Solaris is done via reads, writes, and ioctls to different devices. 974: 975: {{{ 976: 977: OSS devices: 978: /dev/dsp, /dev/mixer 979: /dev/dsp[0-9]+, /dev/mixer[0-9]+ 980: 981: Solaris devices: 982: /dev/audio, /dev/audioctl 983: /dev/sound/[0-9]+, /dev/sound/[0-9]+ctl 984: 985: }}} 986: 987: Unfortunately, we can’t simply map the Solaris {{code}}/dev/audio{{/code}} and {{code}}/dev/audioctl{{/code}} devices to {{code}}/dev/dsp{{/code}} and {{code}}/dev/mixer{{/code}} devices in a Linux and expect the ioctl translator to handle everything else for us. Some of the reason for this are: 988: 989: * The admin/user may not always want a Linux branded zone to have access to system audio devices. 990: * There may be multiple audio devices on a system each of which may support only input, only output, or both input and output. In Solaris a user can specify which audio device an application should access by providing a {{code}}/dev/sound/*{{/code}} path to the desired device. But in the Linux zone the admin might want the Linux audio device to map to separate Solaris audio devices for input and/or output. 991: * Linux ioctl translation is done using dev_t major values. On Solaris opening {{code}}/dev/audio{{/code}} will result in accessubg different device drivers based of what the underlying audio hardware is, and these different drivers may have different dev_t values. Hence, if audio devices were directly imported the dev_t translator would need to have knowledge of every potential audio device driver on the system, and as new audio drivers are added to the system this translator would need to be made aware of them as well. 992: * In Linux audio devices are character devices and support mmap operations. On Solaris audio devices are streams based and do not support mmap operations. 993: 994: To deal with these problems the following components are providedL 995: 996: * A way for the user to enable audio support in a zone via zonecfg. The user enables audio via zonecfg boolean attribute called "audio". (The absence of this attribute implies a value of false.) Adding this resource to a zone via zonecfg looks like this: 997: 998: {{{ 999: ~-- 1000: zonecfg:centos> add attr 1001: zonecfg:centos:attr> set name="audio" 1002: zonecfg:centos:attr> set type=boolean 1003: zonecfg:centos:attr> set value=true 1004: zonecfg:centos:attr> end 1005: zonecfg:centos> commit 1006: zonecfg:centos> exit 1007: ~-- 1008: }}} 1009: 1010: By default when a Linux audio applications attempts to open {{code}}/dev/dsp{{/code}} this access is mapped to {{code}}/dev/audio{{/code}} under Solaris. (Linux application access to {{code}}/dev/mixer{{/code}} are mapped to {{code}}/dev/audioctl{{/code}}.) 1011: To allow an admin to control which Solaris devices a Linux zone can send input/output to we provide two additional attributes that have string values: 1012: 1013: {{{ 1014: audio_inputdev = none | [0-9]+ 1015: audio_outputdev = none | [0-9]+ 1016: }}} 1017: 1018: If audio_inputdev is set to none, then audio input is disabled. If audio_inputdev is set to an integer, then when a Linux application attempts to open {{code}}/dev/dsp{{/code}} for input this access is mapped to {{code}}/dev/sound/<audio_inputdev attribute value>{{/code}}. The same behavior applies to audio_outputdev for Linux audio output accesses. 1019: If audio_inputdev or audio_outputdev exist but the audio attribute is missing (or set to false) audio will not be enabled for the zone. 1020: Currently there is no mechanism outside of zonecfg itself for verifying zone attributes. So if a user specifies invalid types or values for these attributes there is no way to alert them during the zone configuration stage. (Later when attempting to boot the zone we will have an opportunity to verify these attributes and can fail to boot the zone if they are invalid.) 1021: * Create a new layered driver to act as a Linux audio device. 1022: This device will always be mapped into the zone. Linux can change the ownership of this device nodes as it sees fit. 1023: Since this layered driver will be accessing Solaris devices from within the kernel there will be no problems with device ownership. The Solaris audio devices will continue to be owned by whoever is logged in on the console but a Linux zone will also be able to access the device whenever necessary. 1024: Luckily, the Solaris audio architecture includes an integrated audio output stream mixer so that multiple programs can open the audio device and output data at the same time. Unfortunately it’s not possible to virtualize all audio features in this same manner. For instance there can only be one active audio recording stream on a given audio device at a time. Also it would be difficult to virtualize global mixer controls. Hence, by allowing a zone shared access to audio devices owned by the console user, it would be possible for zone users and the console user to compete for any non-virtualized audio resources. These limitations seem acceptable if it’s assumed that the user who owns the console is the same user who is running Linux audio applications. We assume that this will probably be the common case for most audio enabled Linux zones. If these limitations are not acceptable then an admin always has the option of updating {{code}}/etc/logindevperm{{/code}} to deny console users access to audio devices. (thereby giving a zone exclusive access to the device.) 1025: This device is implemented as a character driver (instead of as a streams driver.) This allows it to more easily support Linux memory mapped audio device access. Theoretically, mmap device access could be simulated in the ioctl translation layer but this would add quite a bit of extra complexity to the translation system. It would require that the ioctl translation layer start to maintain state across multiple ioctl operations, something it currently does not do. It would also require user land helper threads and support for handling and redirecting signals that might get delivered to those threads. 1026: * Provide hooks into the zone state transition brand callback mechanism to propagate zone audio device configuration to the Linux layered audio device. 1027: This is done via a program that opens the layered Linux audio device when the zone is booted, and uses ioctls to configure the device. When a zone is halted this same program is used to notify the driver that any previous configuration for a given zone is no longer valid. 1028: 1029: This model provides each Linux zone with one audio device. There are no inherent limits in this approach that would prevent future support for multiple virtual audio devices in a linux zone. The {{code}}zonecfg{{/code}} attribute configuration mechanism could be extended to allow the the administrator to specify what the mappings for additional audio devices should be. The audio layered driver could also be enhanced to export multiple audio and mixer minor nodes which would all be imported into any linux zone on the system. Lastly, the zone boot configuration callbacks could be enhanced to parse the additional {{code}}zonecfg{{/code}}audio configuration attributes and pass the additional configuration information down to the audio layered driver. 1030: 1031: === 3.8 NFS 1032: 1033: The BrandZ //lx// brand will have NFSv3/2 client support. NFSv4 is not supported since the version of RedHat/CentOS that the //lx// brand currently supports does not support NFSv4 itself. Having an NFSv3/2 server in a branded zone is also not supported since this is an existing limitation on native zones systems and BrandZ is not removing this limitation. 1034: 1035: ==== 3.8.1 NFSv3/2 Client Support 1036: 1037: NFS client support consists of two major components: 1038: 1039: * A kernel filesystem module that can "mount" nfs shares from a server and service accesses to that filesystem via VOP_ interfaces 1040: * {{code}}lockd{{/code}} and {{code}}statd{{/code}} rpc services to handle nfs locking requests. 1041: 1042: Due to licensing issues we can’t port the Linux NFS kernel filesystem module to Solaris, 1043: 1044: On Solaris {{code}}lockd{{/code}} is a userland daemon with a significant kernel component: {{code}}klmmod{{/code}}. Most of the {{code}}lockd{{/code}} functionality is actually implemented in the kernel in {{code}}klmmod{{/code}}. The kernel {{code}}lockd{{/code}} component also uses private undocumented interface (NSM_ADDR_PROGRAM) to communicate with {{code}}statd{{/code}}. 1045: 1046: On Linux {{code}}lockd{{/code}} is actually entirely contained within the kernel. When the kernel starts up the {{code}}lockd{{/code}} services, it creates a fake process that is visible via the {{code}}ps{{/code}} command but lacks most normal /proc style entries. 1047: 1048: Given the how closely integrated the separate components of the NFS client are on Solaris, and given that that most of the NFS client on Linux is in the kernel and there for not usable by the //lx// brand, the approach taken to support the NFS client in the //lx// brand was to simply run the Solaris NFS client within the //lx// zone. 1049: 1050: Adding support for using the all Solaris NFS client components in a zone involved modifications in BrandZ, the //lx// brand, and base Solaris. Some of these areas and the modifications that were required are described below. 1051: 1052: ===== 3.8.1.1 {{code}}mount(2){{/code}} support 1053: 1054: The first component of support NFS client operations in an //lx// branded zone is translating Linux NFS mount system call requests into Solaris mount system call requests. This requires translating the arguments and options strings normally passed to the Linux into formats that the Solaris kernel is expecting. 1055: 1056: ===== 3.8.1.2 Starting and stopping of the {{code}}lockd{{/code}} and {{code}}statd{{/code}} daemons. 1057: 1058: The version of RedHat/CentOS Linux that the //lx// brand support has a {{code}}rc.d{{/code}} startup script that is used to start, stop, and check the status of {{code}}statd{{/code}} and {{code}}lockd{{/code}}. To support the execution of the Solaris versions of {{code}}lockd{{/code}} and {{code}}statd{{/code}} in a zone, after install we replace the Linux {{code}}lockd{{/code}} and {{code}}statd{{/code}} binaries with symlinks to scripts in {{code}}/native/usr/lib/lx/brand{{/code}} that start the Solaris versions of the two daemons. The startup script has also been modified to successfully allow for the startup, stopping, and querying of status for {{code}}lockd{{/code}} and {{code}}statd{{/code}}. This approach preserves the existing techniques for administering NFS locking under RedHat/CentOS in the //lx// brand. 1059: 1060: {{code}}lockd{{/code}} and {{code}}statd{{/code}} are also run under different user and group ids under Linux than Solaris. The startup wrapper scripts pass command line options to {{code}}lockd{{/code}} and {{code}}statd{{/code}} to indicate what user and group should be used during normal operations. 1061: 1062: ===== 3.8.1.3 Running native Solaris processes in a zone 1063: 1064: Normally all processes in a non-native zone are branded non-native process. NFS client support is the only exception to this rule since it introduces the execution of two native Solaris processes into a non-native zone. To support the basic execution of a native Solaris process in the following steps were taken. 1065: 1066: ===== 3.8.1.3.1 Brand aware exec path 1067: 1068: The internal kernel exec path path (starting at {{code}}exec_common(){{/code}}) was updated to have a new brand operation flag. The possible values for this flag are: clear (no special brand operation), brand-exec (indicates that the current process should become branded if the exec operation succeeds), and native-exec (indicates that the current process should become a native process if the exec operation succeeds.) 1069: 1070: Note, these extended exec flags are not accessible through the normal exec() system call path. The normal exec() system call path always defaults to this flag being clear. To change the branding of a process via exec, a special brand operation (invoked via the {{code}}brandsys(2){{/code}} system call) is used. 1071: 1072: ===== 3.8.1.3.2 Chroot 1073: 1074: One Problem with running native Solaris binaries in a branded zone is that both the native binary and native libraries that they use expect to be able to access native Solaris paths and files that may not exist inside a branded. Rather than implementing a path mapping mechanism to re-direct filesystem accesses for native binaries to paths into /native, during the startup of these daemons we do a chroot("/native"). We’ve also ensured that there is enough of the native Solaris environment created in /native to allow {{code}}lockd{{/code}} and {{code}}statd{{/code}} run properly. 1075: 1076: ===== 3.8.1.4 Allowing {{code}}lockd{{/code}} and {{code}}statd{{/code}} to communicate with Linux services/interfaces within the zone. 1077: 1078: {{code}}lockd{{/code}} and {{code}}statd{{/code}} are fairly self contained but they do require access to certain services for which the native Solaris versions won’t be available in a zone. An audit of {{code}}lockd{{/code}} and {{code}}statd{{/code}} reveal that these daemons depend on access to the following services: 1079: 1080: * naming services (via libnsl.so) 1081: * syslog (via libc.so) 1082: 1083: Normally, these daemons simply access these services via local libraries. These libraries in turn use local files, other libraries, and various network based resources to resolve requests. In a branded zone most of these resources will not be available. For example, we can’t expect the Solaris libnsl.so library to know how to parse Linux NIS configuration files. 1084: 1085: To handle these requests we need to be able to leverage existing Linux services and interfaces. This requires translating certain Solaris {{code}}lockd{{/code}} and {{code}}statd{{/code}} services requests into Linux service requests, and then translating any results back into a format that Solaris libraries and utilities are expecting. In the //lx// brand we’ve decided to call this process of translating service requests //thunking//. (akin to a 32-bit OS calling into 16-bit BIOS code.) To service these requests we have created a thunking layer which translates Solaris calls into Linux calls. 1086: 1087: This thunking layer works as follows: 1088: 1089: 1. When {{code}}lockd{{/code}} or {{code}}statd{{/code}} make a request that requires thunking, this request ends up getting directed into a library in the process called lx_thunk.so (the mechanism used to direct requests into this library varies based of the type of request being serviced and is discussed further below). 1090: 1. The lx_thunk.so library packs up this request and sends it via a door to child Linux process called lx_thunk. 1091: 1. If the lx_thunk process does not exist then the lx_thunk.so library will {{code}}fork(2)/exec(2){{/code}} it. 1092: 1. The lx_thunk process is a one line /bin/sh script that attempts to execute itself and is executed in a Linux shell. When the brand emulation library (lx_brand.so) detects that it is executing as the lx_thunk process and it is attempting to re-exec itself, the library takes over the process and sets itself up as a doors server. 1093: 1. When the lx_thunk process receives a door request from lx_thunk.so library in a native process, it unpacks the request and uses a Linux thread to call invoke Linux interfaces to service the request. 1094: 1. Once it is done servicing the request it packs up any results and returns them via the same door call that it received the request on. 1095: 1096: This thunking layer means that now the //lx// brand is dependent upon Linux interfaces so we need to worry about Linux interfaces changing and breaking the lx_thunk server process. To help avoid this possibility, most the Linux interfaces that we’ve chosen to use are extremely well known and listed in the glibc ABI. All of the interfaces are used by many applications outside of glibc. Here are the Linux interfaces currently used by the lx_thunk process: 1097: 1098: * {{code}}gethostbyname_r{{/code}} 1099: * {{code}}gethostbyaddr_r{{/code}} 1100: * {{code}}getservbyname_r{{/code}} 1101: * {{code}}getservbyport_r{{/code}} 1102: * {{code}}openlog{{/code}} 1103: * {{code}}syslog{{/code}} 1104: * {{code}}closelog{{/code}} 1105: * {{code}}__progname{{/code}} 1106: 1107: Also worth mentioning is the means by which service requests that require thunking are directed to lx_thunk.so. To intercept name service requests the //lx// brand is introducing a new libnsl.so plugin name-to-address translation library. Libnsl already supports name-to-address translation plugin libraries that can be specified via netconfig(4). For //lx// branded zones there will be a custom netconfig file installed into /native/etc/netconfig that will instruct libnsl.so to redirect name service lookup requests to a new library called lx_nametoaddr.so. This library will in turn resolve name service requests using private interfaces exported from the thunking library, lx_thunk.so. 1108: 1109: ===== 3.8.1.5 rpcbind vs portmap 1110: 1111: {{code}}lockd{{/code}} and {{code}}statd{{/code}} are both rpc based services. There for when they start up they must register with an rpc portmapper. This registration is done via a standardized rpc protocol. The difficulty with this registration is that there are multiple versions of the protocol. 1112: 1113: Initially the protocol was called the portmapper protocol and only supported IP based transports. This is the version of the protocol that RedHat/CentOS support. Solaris long ago upgraded the version of the protocol it supports to be rpcbind. rpcbind supports registrations on non-ip based transports. The problems faced by BrandZ are that all the libnsl.so interfaces designed for talking to a local portmapper assume that the local portmapper supports the rpcbind protocol. In the case of an //lx// branded zone we’re actually running the Linux portmapper which doesn’t support the rpcbind protocol. To work around this a new command line flag has been added to {{code}}lockd{{/code}} and {{code}}statd{{/code}} to indicate that they should attempt to to use the portmapper protocol instead of the rpcbind protocol for registering rpc services. Also, new private interfaces have been added into libnsl.so to allow it to support communication via the portmapper protocol instead of the rpcbind protocol. 1114: 1115: ===== 3.8.1.6 Privilege retention for {{code}}lockd{{/code}} and {{code}}statd{{/code}} 1116: 1117: On Solaris, {{code}}lockd{{/code}} and {{code}}statd{{/code}} and privilege aware daemons and upon startup they drop all privilege they won’t need for normal execution. When running in a Linux zone, these daemons need additional privileges so that they can {{code}}chroot(2){{/code}}, {{code}}fork(2){{/code}}, {{code}}exec(2){{/code}}, and run the lx_thunk processes. So the lx_thunk.so library which is preloaded into these processes also prevents them from dropping the following privileges: 1118: 1119: * {{code}}proc_exec{{/code}} 1120: * {{code}}proc_fork{{/code}} 1121: * {{code}}proc_chroot{{/code}} 1122: * {{code}}file_dac_read{{/code}} 1123: * {{code}}file_dac_write{{/code}} 1124: * {{code}}file_dac_search{{/code}} 1125: 1126: ==== 3.8.2 Automounters 1127: 1128: Linux supports two automounters: "amd" and "automount". 1129: 1130: amd is implemented as a userland NFS server. It mounts NFS filesystems on directories where it will provide automount services, and specified itself as the server for these NFS filesystems. To support amd only required adding translation support for all the Linux system call mount options it expects to work. 1131: 1132: automount, the more common (and often default) automounter, is substantially more complex than amd. automount relies on a filesystem module called autofs. Upon startup, automount mounts the autofs filesystem onto all automount controlled directories. As an option to the mount command it passes a file descriptor that indicates a pipe will be used to send requests to the automount process. automount listens for requests on this pipe. When it gets a request, it looks up shares via whatever name services are configured, executes {{code}}mount(2){{/code}} system calls as necessary, and notifies the autofs filesystem that a request has been serviced. The exact semantics of the interfaces between automount and autofs are versioned and appear to differ based of the Linux kernel version. To support automount the //lx// brand will introduce a new filesystem module called lx_autofs. When the automount process attempts to mount the autofs filesystem we will instead mount the lx_autofs filesystem which will emulate the behavior one specific version of the autofs filesystem. 1133: 1134: === 3.9 Debugging and Observability 1135: 1136: ==== 3.9.1 Ptrace 1137: 1138: In order to support Linux {{code}}strace{{/code}}, it will be necessary to duplicate almost the full functionality of the {{code}}ptrace{{/code}} system call. Rather than trying to implement this giant wad in the kernel, we will implement it in userland on top of a native {{code}}/proc{{/code}}, which will be mounted at {{code}}/native/proc{{/code}}. It is worth noting that the Solaris {{code}}/proc{{/code}} already has a ptrace compatibility flag, which provides much of the semantics we want (e.g., interaction through wait(2)). 1139: 1140: Two difficult parts about ptrace are tracing Linux system calls and attaching to existing processes. 1141: 1142: To implement system call tracing, we want to stop the program in userland before the interposition library has done any work. To do this, we introduce a per-brand scratch space in the ulwp_t structure, similar to that used by the DTrace PID provider. When we want to trace a system call enter or exit, we set this flag from within the kernel via a brand-specific system call. In the trampoline code, we check the status of this flag and issue another brand-specific system call that will stop us with an appropriately formed stack. Note that it’s generally not possible to "hide" the interpositioning library when it comes to signals. Besides trying to figure where we are in the brand library (if we’re in it at all), we will probably do more harm than good when trying to hide this behavior. 1143: 1144: When a debugger attaches to another process using PTRACE_ATTACH under Linux, it actually becomes the parent for the target process in all ways - except that getppid() still returns the original value. Since this is an implementation detail and an undesirable pollution of the Solaris process model, we will instead add a per-brand child list that the wait(2) system call will also check. When we attach to a process, we add it to the debugger’s brand list. 1145: 1146: Finally, there are also significant issues around multithreaded apps. Trying to reconcile the differences between the Linux and Solaris threading models as well as the differences between {{code}}ptrace(2){{/code}} and {{code}}/proc{{/code}} appears to be a nearly intractable problem. For at least the initial BrandZ release, we do not expect to be able to support {{code}}ptrace(2){{/code}} for multithreaded applications. 1147: 1148: ==== 3.9.2 librtld_db: MDB and Solaris ptools Support 1149: 1150: Since the Linux binaries and applications are standard ELF objects, Solaris tools are able to process them in essentially the same way that Solaris binaries are processed. The main objective in doing so is to retrieve symbols and thereby aid debugging and observability. 1151: 1152: mdb and the ptools (pstack, pmap, etc) use interfaces provided by {{code}}librtld_db{{/code}} to debug live processes and core files. {{code}}librtld_db{{/code}} discovers ELF objects which have been mapped into the target’s (a target can be either a live process or a core file) address space and reports these back to the {{code}}librtld_db{{/code}} client. {{code}}librtld_db{{/code}} library understands enough of the internals of the Solaris runtime linker to iterate over the linker’s private link maps and process the objects it finds. {{code}}librtld_db{{/code}} allows our tools to debug the Solaris portions of a branded process (such as the brand library, libc, etc.), but they cannot understand any Linux objects that are mapped into our address space, because the Solaris linker only has Solaris objects on its link maps. 1153: 1154: In order to give Solaris tools visibility into Linux binaries, a brand helper-library framework is implemented in {{code}}librtld_db{{/code}}. When {{code}}librtld_db{{/code}} is asked to examine a branded target process or core file, it uses the AT_SUN_BRANDNAME aux vector to get the brand name of the process. Once it has the brand name, it {{code}}dlopen(3C){{/code}}s a shared object from: {{code}} /usr/lib/brand///brandname/////brandname//_librtld_db.so.1 {{/code}} 1155: 1156: Brand helpers must have a vector of operations called from critical {{code}}librtld_db{{/code}} hooks (e.g. helper initialization, loaded object iteration, etc). 1157: 1158: Once loaded, the helper library is responsible for finding any brand-specific information it needs, such as the brand equivalent of {{code}}AT_SUN_LDDATA{{/code}} (used by {{code}}librtld_db{{/code}} to find the Solaris link maps), and preparing to return details about the objects loaded into the address space by the brand’s linker. 1159: 1160: When a client of {{code}}librtld_db{{/code}} operating on a branded process, asks to know what objects are loaded in the target, {{code}}librtld_db{{/code}} walks the Solaris link maps and iterates over each object it finds there, handing information about each to the caller. It then calls down into the helper library, which does the same for the brand-specific objects used by the target. In this manner, the client of {{code}}librtld_db{{/code}} does not need any modification to operate on branded processes, nor does it need any special knowledge of brands; all data is passed via established {{code}}librtld_db APIs. {{/code}} 1161: 1162: Because {{code}}mdb{{/code}} and the other ptools are Solaris binaries, they may not run inside a non-Solaris branded zone. They must therefore run from the global zone and may operate on individual processes within the branded zone. 1163: 1164: ==== 3.9.3 Core Files 1165: 1166: Since dumping core is handled by the kernel, it will produce Solaris core files that cannot be understood by the native Linux tools. While it would be possible to provide a brand-specific tool to convert between Solaris core files and Linux core files, it will not be provided as part of the initial release. Unless there is significant customer demand for this tool, it will not likely be included in subsequent releases either. Our preferred method for examining Linux core files is through {{code}}mdb{{/code}}. 1167: 1168: ==== 3.9.4 DTrace 1169: 1170: With the addition of brand support in {{code}}librtld_db.so{{/code}}, the DTrace PID provider is able to instrument Linux processes correctly with no additional work. 1171: 1172: We have also added an {{code}}lx-syscall{{/code}} provider, which allows DTrace to trace the Linux system calls issued by the application. 1173: 1174: === 3.10. /proc 1175: 1176: The //lx// brand will deliver a lx_proc kernel module that provides the necessary semantics of a Linux /proc filesystem. 1177: 1178: Linux tends to use /proc as a dumping ground for all things system-related, although this is reduced by the introduction of sysfs in the 2.6 kernel. Thus, we will not be able to emulate a large number of elements from within a zone. Examples of unsupported functionality include physical device characteristics, the USB device tree, acccess to kernel memory, etc. Because various commands expect these files to be present, but do not actually act on their contents, a number of these files will exist but otherwise be empty. 1179: 1180: We are able to emulate the per-process directories completely. The following table shows the support status of other /proc system files. 1181: 1182: |=File|=Supported|=Description 1183: | cmdline | empty | Kernel command line options 1184: | cpuinfo | empty | Physical chip characteristics 1185: | crypto | no | Kernel crypto module info 1186: | devices | empty | Major number mappings 1187: | dma | empty | ??? 1188: | driver/* | no | Per-driver configuration settings 1189: | execdomains | no | ??? 1190: | fb | no | Framebuffer device 1191: | filesystems | empty | Available kernel filesystems 1192: | fs/* | no | ??? 1193: | ide/* | no | Kernel IDE driver info 1194: | interrupts | empty | Kernel interrupt table 1195: | iomem | no | I/O memory regions 1196: | ioports | empty | I/O port bindings 1197: | irq/* | no | ??? 1198: | kcore | empty | Kernel core image 1199: | kmsg | empty | Kernel message queue 1200: | ksyms | no | Kernel symbols 1201: | loadavg | yes | System load average 1202: | locks | no | ??? 1203: | mdstat | no | ??? 1204: | meminfo | yes | Virtual memory information 1205: | misc | no | ??? 1206: | modules | no | Random module information 1207: | mounts | yes | System mount table 1208: | mpt/* | no | ??? 1209: | mtrr | no | ??? 1210: | net/* | no* | Various network configuration 1211: | partitions | empty | Parition table 1212: | pci | no | PCI device info 1213: | scsi/* | no | SCSI device info 1214: | slabinfo | no | Kernel Slab allocator stats 1215: | stat | yes | General system wide statistics 1216: | swaps | no | Swap device information 1217: | sys/* | no | ??? 1218: | sysrq-trigger | no | ??? 1219: | sysvipc/* | no | System V IPC statistcis 1220: | tty/* | no | ??? 1221: | uptime | yes | System uptime 1222: | version | empty | Kernel version 1223: 1224: === 4. Deliverables 1225: 1226: ==== 4.1 Source delivered into ON 1227: 1228: Below is a summary of the new sources being added to ON. 1229: 1230: ===== New code for brand support: 1231: 1232: | usr/src/lib/libbrand | support for reading the new XML files defining brands and virtual platforms 1233: | usr/src/lib/brand/native/ | virtual platform and template files for non-branded zones 1234: | usr/src/uts/common/os/brand.c | manages the brandx framework, tracks loaded brands, and so on. 1235: | usr/src/uts/common/syscall/brandsys.c | the brandsys() system call 1236: 1237: ===== New directories created as holding areas for per-brand code: 1238: 1239: | usr/src/lib/brand | for the userspace components of brands 1240: | usr/src/uts/common/brand | for platform-independent brand code 1241: | usr/src/uts/intel/brand | for Intel-specific brand code 1242: | usr/src/uts/sparc/brand | for SPARC-specific brand code 1243: 1244: ===== For the Solaris 10 ’test’ brand: 1245: 1246: | usr/src/pkgdefs/SUNWs10r | The package containing all the kernel-space pieces of the //s10// brand 1247: | usr/src/pkgdefs/SUNWs10u | The package containing all the user-space pieces of the //s10// brand 1248: | usr/src/lib/brand/s10/s10_brand | contains the emulation code and the zones integration support. 1249: | usr/src/uts/common/brand/s10 | source for kernel-level //s10// support 1250: | usr/src/uts/intel/s10_brand | where the //s10// Intel brand module is built 1251: | usr/src/uts/sparc/s10_brand | where the //s10// SPARC brand module is built 1252: 1253: ===== For the //lx// brand: 1254: 1255: | usr/src/pkgdefs/SUNWlxr | The package containing the kernel-space components of the //lx// brand 1256: | usr/src/pkgdefs/SUNWlxu | The package containing the user-space components of the //lx// brand 1257: | usr/src/lib/brand/lx/lx_brand | contains the emulation code, as well as the zones integration support. (install scripts and so on) 1258: | usr/src/lib/brand/lx/librtld_db | the rtld_db plugin for the //lx// brand 1259: | usr/src/uts/common/brand/lx | source for kernel-level //lx// support 1260: | usr/src/uts/common/brand/lx/procfs | source for the //lx// /proc 1261: | usr/src/uts/common/brand/lx/dtrace | source for the //lx// syscall provider 1262: | usr/src/uts/intel/lx_brand | where the //lx// brand module is built 1263: | usr/src/uts/intel/lx_syscall | where the //lx// syscall provider is built 1264: | usr/src/uts/intel/lx_proc | where the //lx// /proc filesystem is built 1265: 1266: ==== 4.2 Components installed with the //lx// brand 1267: 1268: ===== Userspace components: 1269: 1270: | usr/lib/brand/lx/lx_brand.so.1 |//lx// emulation library 1271: | usr/lib/brand/lx/[amd64]/lx_librtld_db.so.1 |//lx// rtld_db plugin 1272: | usr/lib/brand/lx/[amd64]/lx_thunk.so |//lx// thunking support 1273: | usr/lib/brand/lx/[amd64]/lx_nametoaddr.so.1 | Wrapper for Solaris name services 1274: | usr/lib/brand/lx/config.xml | Definition of the //lx// brand 1275: | usr/lib/brand/lx/platform.xml | Definition of the //lx// virtual platform 1276: | usr/lib/brand/lx/SUNWblank.xml | Values for a blank zone configuration 1277: | usr/lib/brand/lx/SUNWdefault.xml | Values for a default zone configuration 1278: | usr/lib/brand/lx/install | The scripts needed to install supported distros 1279: | usr/lib/brand/lx/lx_audio_config | Audio device configuration 1280: | usr/lib/brand/lx/lx_lockd | Wrapper for Solaris lockd 1281: | usr/lib/brand/lx/lx_statd | Wrapper for Solaris statd 1282: | usr/lib/brand/lx/lx_native | Wrapper for native Solaris processes 1283: | usr/lib/brand/lx/lx_boot | Boot-time script 1284: | usr/lib/brand/lx/lx_halt | Halt-time script 1285: | usr/lib/brand/lx/lx_install | Main //lx// install script 1286: | usr/lib/devfsadm/linkmod/SUNW_lx_link_i386.so | Brand-aware devfsadm link module 1287: 1288: ===== Kernel components: 1289: 1290: | kernel/brand/[amd64]/lx_brand |//lx// brand kernel module 1291: | kernel/drv/[amd64]/lx_audio |//lx// Linux-compatible audio driver 1292: | kernel/drv/[amd64]/lx_systrace |//lx// syscall provider 1293: | kernel/drv/[amd64]/lx_ptm |//lx// Linux-compatible ptm driver 1294: | kernel/fs/[amd64]/lx_afs |//lx// automounter support 1295: | kernel/fs/[amd64]/lx_proc |//lx// /proc filesystem 1296: | kernel/strmod/[amd64]/ldlinux |//lx// Linux-compatible ldterm module
Search
Collectives
Community Group
Academic and Research
Accessibility
Advocacy
Appliances
Approachability
Architecture Process and Tools
BrandZ
Chinese Users
Community Advisory Board
Databases
Desktop
Device Drivers
Distribution
Documentation
DTrace
Emerging Platforms
Fault Management
Games on OpenSolaris
HA Clusters
HPC Developer
Installation and Packaging
Internationalization and Localization
Laptop
Logical Domains
Modular Debugger (MDB)
Networking
NFS
Observability
OpenSolaris Governing Board (OGB)
OpenSolaris Printing
OS/Net (ON)
Performance
Power Management
PowerPC
Security
Service Management Facility (smf(5))
Software Porters
Solaris Volume Manager
Storage
Systems Administration Community Group
Testing
Tools Home
Unix File Systems (UFS)
Website Community
X Window System
Xen
ZFS
Zones
Project
ADSL Modem Enhancement
ARC Process Definition
ARM Platform Port
Automatic Data Migration
BIND Update
Bluetooth Stack & Drivers
Brocade FC HBA - Initiator
Brocade FC HBA - Target
Brussels - unified network link configuration
Caiman, Solaris Install Revisited
Celeste
Český portál
Chime Visualization Tool for DTrace
CIFS client for Solaris
CIFS Server
Clearview: Network Interface Coherence
Cluster Agent: Informix Dynamic Server
Cluster Agent: OpenSolaris Container
Cluster Agent: OpenSolaris xVM
Cluster Agent: Oracle E-Business Suite
Cluster agent: PostgreSQL
Cluster Agent: Samba
Cluster Agent: Tomcat
CMT
Coarse Data Flow Parallelism
Colorado: Open HA Cluster on OpenSolaris
Command Assistant
Common Array Manager
Companion - /opt/sfw: Free and Open Source software
COMSTAR: Common Multiprotocol SCSI Target
Content
Contest
CPU Observability
Credentials Process Groups
Crossbow: Network Virtualization and Resource Control
Crypto KMS Agent Toolkit
Cryptographic Framework
Data Migration Manager
Data Tethers
Deutsches Portal
Device Detection Tool
Device Driver Utility
Device Manager
Device Mapper
Direct Rendering Infrastructure & 3D drivers
DTrace Guide
Duckwater: Simplified name services management
Easy Tools
Emancipation
Emulex Fibre Channel Device Driver
Emulex Advanced Ethernet Device Driver
Enable/Enhance Solaris support for Intel Platform
Enhance the support of USB webcams
Enhanced SMF Profiles
Enhancements for AMD-based Platforms
Erlang DTrace Integration
Ethernet bridge module for Solaris
Evaluate Conary
Events Registry
Ext3 file system support
F/OSS Package Base
Facilitation
Fibre Channel over Ethernet
Fine Grained Access Policy (FGAP)
Fingerprint Authentication
Flexible Mandatory Access Control
Forensic Tools
Fully Open X Project
Fuse on Solaris
gcore
Generic Machine Check Architecture Improvements
Google SOC
HA-JBoss
HA-MySQL
Hadoop Live CD
Hitachi
HoneyComb Fixed Content Storage
HPC Stack
Image Packaging System
Improved Performance MIB
Indiana
Innovation Awards
Input Method
Intel Graphics
Interrupt Resource Management
IP Datapath Refactoring
IP over Infiniband
IPsec Tunnel Reform
iSCSI Extensions for Remote DMA (iSER)
iSNS Server
JeOS - Just enough Operating System
JKstat - a java binding for libkstat
Journaled File System (JFS)
K Desktop Environment
Kerberos
Kernel Sockets
Kernel SSL Enhancements
Key Management Framework
Korn Shell 93 integration/migration project
Labeled IPsec
LatencyTOP
Layer 2 Filtering
LDoms Manager
Lending
libMicro - portable microbenchmarks
Link Layer Discovery
Live Media: Technologies for distributions running from CD and other media
Locale Data
lofi compression and cryptography support
lx64 brand
Media Management System
Mega_sas
Mexico
MilaX minimal Live Distribution
MIPS Platform Port
Mozilla DTrace
MRSL.NONsharedDevice
Multi-lingual Glossary
Multi-pathing software (MPxIO)
Multiple disk sector size support
Multiple DOI
Muskoka: An open repository for OpenSolaris technical content
Navigator
Nemo: A Framework for High-Performance Networking
Network Auto-Magic
Network Data Management Protocol
Network MIBs
Network Storage
Network Time Protocol (NTP)
Nevada Globalization
New Design of 4over6 Mechanism Based on OpenSolaris
NFS RDMA transport update and performance analysis
NFS Server in non-Global Zones
NFS version 4.1 pNFS
NFSv4 namespace extensions
Nightingale: Port Songbird to OpenSolaris
NPort ID Virtualization (NPIV)
NUMA
Object Storage Device (OSD) support for Solaris
OHACGE Script Based Plug-in
ON/Nevada (ONNV) Project
Open Development Infrastructure
Open HA Cluster Utilities
Open Sound System
OpenGrok
OpenPegasus CIM Server
OpenRTI
OpenSolaris Busybox
OpenSolaris Desktop
OpenSolaris Hispano
OpenSolaris Security Audit
OpenSolaris support for the QEMU processor emulator: host and guest
PEF: Packet Event Framework
Performance Wrappers
Pkgfactory
Polski Portal
Portail Francophone
Portal Brasil
Portals
Power Management Usability Interfaces
Presto: Automatic Printing Configuration
Printable Many Page Solaris Manuals
Promise SuperTrak RAID HBA Driver
QLogic Converged Network Adapter GLDv3 NIC Driver
Quagga Routing Protocol Suite Integration
RAID Configuration Utility
RBridge (IETF TRILL) support
RDMA Offload Framework
Reno: Login Process Enhancements for Interop
Resource Management
s10brand
SAM/QFS
SCM Migration Project
SCSI RDMA Protocol
SDcard Drivers
Sensor Abstraction Layer
Session Initiation Protocol
SFW
Shell: bourne shell, korn shell, C shell, etc.
Sierra: Intel WiFi Chipsets Support
Simple Panels
SM-HBA Based SAS HBA Management
SMF Documentation
Solaris iSCSI Target
Solaris PowerPC Port
SourceJuicer
Sparks: name service switch/nscd enhancements
Squashfs
Star integration/migration project
Starfish
Starter Kit
Storage Power Management
Sun Security Toolkit
Sun StorageTek Availability Suite
Support for OpenFabrics User Verbs / API on OpenSolaris OS
Support gcc4/GCCfss in Solaris
Suspend/Resume
SVR4 Packaging
Systemz
Tamarack: Removable Media Enhancements in Solaris
Tesla: OpenSolaris Enhanced Power Management
Test Development
Tickless Kernel Architecture
TIPC
Trademarks
Trusted networking interface policy database for Trusted Extensions
Trusted Platform Module support
Use Case
Validated Execution Project
Virtual Console
Virtual Network Machines
Visual Panels
Visualization for HPC
Volo
VRRP: Virtual Router Redundancy Protocol Implementation
VSCAN service
Web Stack
Website
Winchester: Schema mapping and ID mapping for AD Interoperability
Wireless USB Support
Wireless Wide Area Network
X Consolidation
x86 Generic FMA Topology Enumerator
Xen Gate
Xfce: A lightweight desktop environment
ZFS Boot and Install
ZFS on disk encryption support
Zone Manager
Zone Statistics
Русский портал
البوابة العربية
भारतीय पोर्टल
中国门户
日本ポータル
한국 포탈
User Group
Adelaide
Argentina
Arizona
Atlanta
Baltimore-Washington
Bangalore
Bangkok
Bangladesh
Beijing
Bélem
Berlin
Bhimavaram
Bloomington
Campus Ambassadors
Capital Region
Cardiff
Charlotte
Chengdu
Chennai
Chihuahua
Chile
Cleveland
Colombia
Columbus
Connecticut
Cracow
Czech
Dallas/Ft. Worth
Danish
Delaware
Edinburgh
Egypt
Finland
Florida
Front Range
FuZhou
Great Lakes
Greece
Hangzhou
Hawaii
HeFei
Houston
Hyderabad
Indonesia
Irish
Israel
Italian
Jinan
Kabul
Kansas City
Latvia
London
Madurai
Manchester
Mato Grosso
Melbourne
Minas Gerais
Minnesota
Montreal
Moscow
Mumbai
Munich
NEA
Netherlands
New England
New York City
New Zealand
NIT Hamirpur
Noroeste
Oklahoma City
Osnabrück
Peru
Philadelphia
Piaski
Pittsburgh
Porto Alegre
Puget Sound
Pune
Queensland
Research Triangle Park
Romania
Russia
San Antonio
San Diego
San Francisco
São Paulo
Scottish
Serbia
Shanghai
Shenzhen
Silicon Valley
Singapore
Slovak
South African
Southern Connecticut
St. Louis
Sweden
Switzerland
Sydney
Szczecin
Taiwan
Tecum
Thames Valley
Tokyo
Toronto
Trondheim
Tulsa
Turkey
Ukraine
University of Melbourne
Vale do Paraíba
Vancouver
Venezuela
Welsh - Cymru
Wisconsin
Xi'an
Subsites
Code Reviews
Code Repositories
Package Search
Bugster
Bugzilla
Test Machines
Planet
Mailing Lists
Elections & Polls
ARC Case Logs
Source Juicer
Package Factory
User Authentication
Community Group brandz Pages
Linux Applications
BrandZ/SCLA FAQ
BrandZ Bugfixes
How to Contribute Code to BrandZ
Design Doc
BrandZ Downloads
files
BrandZ Installation
BrandZ Impact on ON Development
BrandZ: Project List
BrandZ: Linux 2.6 Support