CPU Power Management
OpenSolaris provides support for Dynamic Frequency and Voltage Scaling (DVFS) across a range of Intel and AMD based processors. DVFS provides a mechanism enabling a processor to operate across a range of clock frequencies and voltages, allow one to trade off performance vs. power consumption. Intel's "Enhanced Speedstep Technology", and AMD's "PowerNOW!" are examples of DVFS. On x86 architectures, DVFS features are exposed to the OpenSolaris kernel through ACPI Performance States (P-States), where a P-State is an abstraction for a power/performance state. Changing P-states will typically cause the processor to change it's operating voltage and frequency, resulting in a corresponding change in performance, and power consumed.
The ACPI standard also defines Core States (C-States), where a C-State is an abstraction for a state that a processor may enter while idle. Entering the C-State will typically cause the processor to suspend instruction execution. Because the processor doesn't have to do any work, it may off-line various micro-architectural components facilitating an even lower dropping of voltage than would be possible in active operation. Entering deeper C-States results in lower power consumption, at the cost of an increased latency to recover to active operation.
This page describes the Tesla team's project work to collectively provide OpenSolaris with a next-generation CPU Power management architecture, fully event-driven, predictive, adaptive, and integrated with the kernel's thread scheduler/dispatcher subsystems.
Projects
OpenSolaris Power Aware Dispatcher (PAD)

CPU power management as it it implemented today is relatively isolated from the rest of the system. As such, it is forced to periodically poll to measure the utilization of the system's CPU resources. When CPU utilization drops to a sufficiently low level, the power state (P-State) of the CPU is dropped. Likewise, when a CPU's utilization increases, the P-State is raised to higher performing state. The present day architecture suffers from a few shortcomings:
- Polling is a poor thing for the power management subsystem to have to do. First, the question arises...how often to poll? There is a trade off that arises around polling more often to improve responsiveness to changes in utilization, vs. polling less often to minimize overhead. The current architecture polls relatively infrequently...which means that there are non-trivial durations of time where a thread may run on a clocked-down CPU before the PM subsystem notices that utilization has increased (and the CPU should be clocked up)...or where the CPU has become idle, but remains clocked up until the PM subsystem notices it should be clocked down. Polling is also inefficient, because it means that even on an otherwise quiescent system, the power management implementation still needs to wake up (bringing at least some resources into a higher power consuming state) to check to see if the system is still idle.
- Power Management decoupled from Resource Management - The thread dispatcher is the kernel subsystem responsible for deciding where (on which CPUs) threads should be scheduled to run. At present, it has no notion of CPU power/performance states. At the same time, the CPU power management subsystem is polling looking for idle CPU resources to power manage. Having these two subsystems decoupled leads to situations where the two subsystems can undermine each other's efforts, leading to poorer performance as threads are inadvertantly run on clocked down CPUs, or where utilization across the system remains light, but is distributed across the system to the point where nothing is quiescent enough to be power managed.
This project extends the kernel's existing topology aware scheduling facility to bring "power domain" awareness to the dispatcher. With this awareness in place, the dispatcher can implement coalescence dispatching policy to consolidate utilization onto a smaller subset of CPU domains, freeing up other domains to be power managed. In addition to being domain aware, the dispatcher will also tend to prefer to utilize domains already running at higher power/performance states...this will increase the duration and extent to which domains can remain quiescent, improving the kernel's ability to take advantage of features like deep C-states. Because the dispatcher will track power domain utilization along the way, it can drive active domain state changes in an event driven fashion, eliminating the need for the CPUPM subsystem to poll.
Status
PAD and Deep C-states support integrated into Nevada build 110 on Wednesday Feb 25th, 2009. For more information, please see the Flag Day and Heads Up announcement.
Bugs/RFE CR numbers
Integration:
- 6567156 bring CPU power awareness to the dispatcher
Documents
- Status update presentation, 7/8/08
- PSARC 2008/777 cpupm keyword mode extensions
- Overview and Code Walkthrough, 1/15/09
Power Aware Dispatcher Source Repository
- pad-gate: This repository is closed, as the project has integrated. Please see the current ON source base
OpenSolaris Deep C-State Support
Modern x86 processors support several different idle states for power conservation. ACPI defines these as C-states. Solaris as of onnv_102 supports only ACPI C1 via the HLT (halt) and MONITOR/MWAIT instructions. Deeper ACPI C-states C2 and C3 can conserve more power, but they can take longer to enter and resume. The ACPI specification allows CPU internal clock state to halt during C2, and caches may loose state in ACPI C3. Operating system support is required because of possible CPU state loss in C2 and C3 and because of the additional idle wakeup latency.
There is currently OpenSolaris Deep C-state work ongoing in several areas:
- cpudrv The existing cpudrv is being modified to support Deep C-states. The cpudrv is being modified to:
- detect processors which support deep C-states
- query ACPI properties
- implement idle loops to enter deep C-states via ACPI methods.
- kernel General kernel work to support Deep C-states
- support different CPU wakeup mechanisms for CPUs in different idle states.
- Read timers such as the local APIC and expire times such as for the top cyclic on a CPU's cyclic heap
- Scheduler improvements to choose CPUs based on idle states.
- support Real Time (RT) thread scheduling time requirements on CPUs with variable wakeup latencies
- HPET Solaris uses the local APIC timer to generate interrupts for the Cyclic Backend (CBE). The lAPIC timer in a CPU may stop counting and will not generate interrupts while the processor is in ACPI states C2 and C3. Ongoing work is being done to use the High Precision Event Timer (HPET) as a proxy for stalled lAPIC timers. The HPET is located on the chipset isolated from CPU C-State power side effects. CPU must schedule their next CBE interrupt on the HPET when they enter a deep C-state.
Status
Deep C-states support integrated (along with PAD) into Nevada build 110 on Wednesday Feb 25th, 2009. For more information, please see the Flag Day and Heads Up announcement.
Bugs/RFE CR numbers
- 6700904 deeper C-State support required on follow-ons to Intel Penryn processor generation microarchitecture
- C-State Development bugs
Deep C-State HPET Source Repository
C-State work has merged with Power Aware Dispatcher work. Please see pad-gate above.
HPET and C-State work was developed in separate gates to maintain quality of other gates. These gates are no longer active.
Mercurial Repositories
Please read these instructions on how to use Mercurial repositories. For help with using Mercurial, or the ON tools, you can also:
- Ask on the tools-discuss@opensolaris.org mailing list (subscribe here).
- You can also check out the Mercurial how-to page.
- To make a (debug) kernel (using pad-gate as an example workspace, and opensolaris.sh as the environment file)
$ cd pad-gate
$ /opt/onbld/bin/bldenv -d /opt/onbld/bin/opensolaris.sh
$ cd usr/src/tools
$ dmake install
$ cd $CODEMGR_WS/usr/src/uts
$ dmake install
- To create a kernel tarball to install (x86)...
$ /opt/onbld/bin/Install -G my_pad-gate_kernel -k i86pc
- To build BFU archives, you need to get (and extract) the "closed bins" tarball(s) into your workspace. See above for current pointers (you must use versions appropriate for the build of onnv against which your repo is synced).
$ cd pad-gate
$ tar xf on-closed-bins.i386.tar
$ /opt/onbld/bin/nightly /opt/onbld/bin/opensolaris.sh
- See the OpenSolaris Developer's Reference for details on how to use kernel tarballs generated by Install(1).