| Solaris |
|
|
Andrei Dorofeev
Alexander Kolbasov
Jonathan Chew
This document proposes a new resource control facility for the SolarisTM Operating Environment, called CPU Caps. This facility provides hard fine-grained limit for the amount of CPU resources that can be used by all user processes within a project or a zone. The CPU Caps facility complements existing resource control facilities like the Fair Share Scheduler, resource pools and others.
The document describes the motivation for limiting CPU resources, the functional requirements, provide a short implementation overview and discuss observability issues. We also show suggested changes to the existing SolarisTMdocumentation.
The on-line version of this paper is available in [1].
Customers running SolarisTMsystem in data centers often require facilities to partition a systems into smaller subsystems. This allows various departments or projects to share the single system gracefully. One of the most important system resources is the CPU time. Without good CPU partitioning, CPU-hungry software may starve others from the valuable cycles.
Solaris provides different mechanisms which can be used to control CPU usage of applications:
CPU binding and processor sets provide hard limits for applications since they can not run outside a bound CPU or a set, but the minimum granularity for both is a single CPU. The FSS(7) provides soft limits which are enforced only when there is enough competition for CPU resources and is specified in units expressing relationships of processes to each other and not in units related to the number of CPUs. Various customers requested a way to provide a limit on CPU usage that can not be exceeded (hard) and that can be specified in fractions of a CPU (fine-grained). For practical purposes it is enough to provide such limit to zones or projects - it does not seem to be very useful to provide such limit for individual processes or threads.
Customers provided the following major reasons for having such hard fine-grained CPU usage limits:
In some form such mechanism for limiting CPU resources is provided by other operating systems (see section 2.1 CPU Caps on other systems). It became one of many checkbox items which customers look at when comparing resource management solutions from Sun and other vendors.
To provide such facility, Solaris needs a new mechanism which provides a hard CPU usage cap which is enforced even if some CPUs are idle. This is a stronger performance guarantee than the one, provided by FSS shares, where the end result greatly depends on other workloads that run on the system at the same time. Such mechanism should have low enough overhead and not penalize users who do not wish to use it.
This paper describes such mechanism, called CPU caps, which allows system administrators to define a hard upper limit (or a cap) on how much CPU time can be consumed by applications. The cap can be specified in fractions of a single CPU. It can be set for any project or zone using two new resource controls project.cpu-cap and zone.cpu-cap. The cap value is a percentage of a single CPU that all threads belonging to the project or zone can consume. For example, if zone CPU cap is set to 50, then that zone is only allowed to use one half of a single processor (on a 2-CPU system that zone is allowed to use 25% of all CPUs). Note that, unlike processor binding and processor sets, applications in capped projects or zones can run on any valid CPU and their combined CPU usage is capped.
In the rest of this document we define the requirements, show some usage examples, discuss observability and other issues and, finally, propose changes to existing documentation.
The discussion above forms the basis for the requirements. Here we discuss in more detail the functionality, accuracy, observability and other requirements.
Linux has patches against 2.6.17 kernel provided by Aurema that implement per-task CPU caps. Capping for task aggregation is not currently supported. Child tasks inherit the cap value from the parent. It seems that Linux implementation includes some of the FSS concepts by providing the notion of ``soft cap'' which can be exceeded if there is some idle CPU time. More information is available at
HP-UX provides CPU caps via the WLM workload manager. CPU caps can not be used with shares at the same time. When caps are turned on, the number of shares becomes a cap. More information about the HP support for CPU caps is available at http://docs.hp.com/en/B8733-90017/ch02s02.html#cjadichh.
AIX 5L workload manager redbook at http://www.redbooks.ibm.com/redbooks/pdfs/sg245977.pdf provides info on what AIX WLM can do. They provide hard and soft CPU limits which can be used in combination with shares.. Read pp47-49 for details.
CPU caps can be set and enforced for projects and zones. The cap value is specified in units of one per cent of a CPU. For example, a CPU cap of 50 limits CPU resources to 50% of one CPU regardless of how many CPUs are available and their characteristics.
A zone CPU caps is represented by the zone.cpu-cap resource control. A project CPU cap is represented by the project.cpu-cap resource control. Caps are only enforced when privileged limits are set1. These resource controls can be set statically for projects in project(4) file and for zones using zonecfg(1M) command. They can be also modified or removed on a running system using prctl(1) utility. The cap value should be greater than zero2.
A project CPU cap is represented by the project.cpu-cap resource control. It is associated with the privileged level and can be modified only by privileged (superuser) callers (see resource_controls(5) and prctl(1) for the description of privileged level).
The project.cpu-cap resource limit can be set statically for projects in project(4) file. For example, the following line in project(4) file sets persistent CPU cap of 6 CPUs for user akolb:
user.akolb:1234::::project.cpu-cap=(privileged,600,none)
Project CPU cap can be also dynamically modified or removed on a running system using prctl(1) utility. For example, the following command modifies the CPU cap to limit user akolb to 3 CPUs:
$ prctl -r -t privileged -n project.cpu-cap -v 300 -i project user.akolb
To remove a project cap the following command can be used:
$ prctl -x -n project.cpu-cap $$
To dynamically change CPU caps for projects or zones, use -r (``replace'')3 option for the prctl(1) command. For example, the following command will change the cap set above to 80%:
$ prctl -r -t privileged -n project.cpu-cap -v 80 -i project group.staff
Adding the following line to /etc/project file:
A zone CPU caps is represented by the zone.cpu-cap resource control. Similar to the project cap it is associated with the privileged level and can be modified only by privileged (superuser) callers.
The zone.cpu-cap resource can be set for a zone using zonecfg(1M) command. The following example configures CPU cap for a zone to 3 CPUs:
zonecfg:myzone> add rctl zonecfg:myzone:rctl> set name=zone.cpu-cap zonecfg:myzone:rctl> add value (priv=privileged,limit=300,action=none) zonecfg:myzone:rctl> end
The zone cap can be dynamically changed using resource_controls(5) command. For example, the following command sets zone CPU cap for zone ``zone1'' to 80% of a CPU:
$ prctl -t privileged -n zone.cpu-cap -v 80 -i zone global
This cap can be changed to 50% later using the following command:
$ prctl -r -t privileged -n zone.cpu-cap -v 50 -i zone global
The zonecfg(1M) is extended with a new resource called capped-cpu, as described in PSARC/2006/496 [3]. The resource value, called ncpus, maps to the zone.cpu-cap rctl. This case formalizes the proposal from PSARC/2006/496 [3] and commits to this new interface.
The capped-cpu resource has a single ncpus property which is a positive decimal with two digits to the right of the decimal. This property is implemented as a special case of the zonecfg cpu-cap alias. The special case handling of this property normalizes the value so that it corresponds to units of cpus and is similar to the ncpus property under the dedicated-cpu resource group. Unlike dedicated-cpu it will not accept a range and it will accept a decimal number. For example, when using ncpus in the dedicated-cpu resource group, a value of 1 means one dedicated cpu. When using ncpus in the capped-cpu resource group, a value of 1 means 100% of a cpu as the zone.cpu-cap setting. A value of 1.25 means 125%, since 100% corresponds to one full cpu on the system when using cpu caps. The intention here is to align the ncpus units as closely as possible in these two cases (dedicated-cpu vs. capped-cpu), given the limitations and capabilities of the two underlying mechanisms (pset vs. rctl). See PSARC/2006/496 [3] for a description of the dedicated-cpu resource and ncpu property.
The following example sets zone CPU cap using the capped-cpu resource:
zonecfg:myzone> add capped-cpu zonecfg:myzone>capped-cpu> set ncpus=3 zonecfg:myzone>capped-cpu>capped-cpu> end
Here we provide a short overview of CPU caps implementation. See the implementation guide [2] for more details.
A CPU cap can be set for any project or any zone. Zone CPU caps limits the CPU usage for all projects running inside the zone. If the zone CPU cap is set below the project CPU cap, the latter will have no effect.
For all threads running in capped projects or zones, the system keeps track of their CPU usage over short periods of time. When CPU usage of projects or zones reaches specified caps, threads in them do not get scheduled and instead are placed on the special wait queues in the kernel. These threads will become runnable again only when CPU usage drops below the cap level.
Each zone and each project has its own wait queue. The time spent by threads on wait queues is reported as ``wait-cpu'' (latency) time by procfs4. Wait times can be seen in the LAT column when prstat(1M) is invoked with the -m option. CPU time spent by threads on wait queues is also accumulated at the LMS_WAIT_CPU micro-state accounting state5. This time, however, is not accounted for when calculating CPU load averages, unlike the time spent by threads in runnable (waiting on run queues) state. There is no separate accounting for time spent on wait queues.
We decided to combine the wait times that threads spent waiting on run queues and wait queues into a single bucket because currently there is a fixed set of micro-state accounting types. Any extension of this set will cause offsets of fields in data structures, embedding micro-state data, to change. This, in turn, implies that all proc(4) consumers should be recompiled. We feel that CPU wait time is general enough to include time spent waiting on both run queues and wait queues and it is reasonable to combine them together until micro-state accounting framework is extended.
All CPU usage accounting data is collected using micro-state accounting facility and the provided accuracy depends on the accuracy of the micro-state data. The per-thread CPU usage is aggregated to project usage and the project usage is accumulated to zone usage. The implementation uses a decay formula that decays one per cent of the value on every clock tick.
Once a CPU usage of a project or zone is exceeded, all user threads running there are marked with a special flag. The preemption code places them on wait queues once they cross the user-kernel boundary.
CPU caps are enforced only for threads running in TS, IA, FX, and FSS scheduling classes. CPU Cap on threads running in RT (Real-Time) scheduling class has no effect.
There are several ways to observe the impact of CPU caps at a zone or project level and at a process or thread level. Each project or zone CPU cap exports information via kstats (see 5.1).
The DTrace sched provider is extended with two new probes which can be used for gathering detailed data for times spent on wait queues (see 5.3). These probes provide thread and process level observability for CPU caps.
Zone and project CPU cap kstats contain the following information:
For example, when a project CPU cap is set to 50% for project 1234, the following command will show cap information:
$ kstat -m caps
module: caps instance: 0
name: cpucaps_project_1234 class: project_caps
above_sec 787
below_sec 260551
nwait 0
usage 1
maxusage 51
value 50
The following example is for zone kstats:
module: caps instance: 14
name: cpucaps_zone_14 class: zone_caps
above_sec 0
below_sec 3
maxusage 255
nwait 0
usage 19
value 300
For both zone and project kstats, the kstat instance is the same as zone ID. The PSARC 2006/598 Swap resource control case[4] provides a precedents for such kstats and establishes the naming conventions.
The maximum usage statistics provides a good way for users to estimate how to set up their CPU caps. They can set the cap to a very high value and observe the maximum usage while their application is running for a while. This gives an estimate of maximum CPU requirements for the workload.
The kstat instance ID is always equal to the zone ID. The global zone will see kstats for all zones, while non global zones will only see kstats with matching zoneid.
When a cap is set on a zone, all projects within this zone are automatically capped, but their kstats will show the cap value of zero (unless some of these projects have specific caps of their own). Projects with the cap value of zero participate in zone CPU usage accounting, but are not actually used to enforce project caps. For example, on a system with zone cap set on zone 1 and project cap set for project 10 in global zone:
kstat caps module: caps instance: 0 name: cpucaps_project_10 class: project_caps above_sec 439 below_sec 655 nwait 1 usage 70 maxusage 71 value 70 module: caps instance: 1 name: cpucaps_project_0 class: project_caps above_sec 0 below_sec 2 nwait 0 usage 7 maxusage 10 value 0 module: caps instance: 1 name: cpucaps_project_1 class: project_caps above_sec 0 below_sec 48 nwait 0 usage 42 maxusage 51 value 0 module: caps instance: 1 name: cpucaps_zone_1 class: zone_caps above_sec 32 below_sec 235 nwait 3 usage 50 maxusage 51 value 50
As we can see, there are two projects belonging to zone 1 and three threads are waiting on the zone cap because the zone has reached its limit6.
The kstat(1M) command running in a zone only shows CPU caps relevant for that zone and its projects. We can use various modifications of kstat(1M) command to examine only project or zone caps.
For example,
# # Show project caps only # $ kstat -c project_caps # # Show project caps for zone 1 # $ kstat -c project_caps -i 1
The similar observability kstats are also introduced by the Swap resource control [4] project.
The ps(1) command shows threads on the wait queue by displaying ``W'' for their state. For example:
$ ps -o pid,s,comm -p 101262 PID S COMMAND 101262 W /usr/perl5/bin/perl
The prstat(1M) command shows ``wait'' state for threads, sitting on wait queues. For example:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 100686 akolb 4272K 1516K wait 0 0 0:51:20 25% perl/1
CPU caps provide two new DTrace sched provider probes for observing scheduling impact for threads and processes:
For example, the following D script shows the number of seconds processes spend on CPU and on wait queues. It an reasonable estimate of which process and to what extent are affected by CPU caps.
#!/usr/sbin/dtrace -s
#pragma D option quiet
/* Mark the time process is placed on wait queue */
sched:::cpucaps-sleep
{
sleep[args[1]->pr_pid] = timestamp;
}
/* Thread leaves wait queue */
sched:::cpucaps-wakeup
/sleep[args[1]->pr_pid]/
{
/* this->delta is time spent on wait queue */
this->delta = timestamp - sleep[args[1]->pr_pid];
@sleeps[args[1]->pr_fname] = sum(this->delta);
@total[args[1]->pr_fname] = sum(this->delta);
}
sched:::on-cpu
/sleep[curpsinfo->pr_pid]/
{
/* Mark the time process is placed on CPU */
oncpu[curpsinfo->pr_pid] = timestamp;
}
sched:::off-cpu
/oncpu[curpsinfo->pr_pid]/
{
/* this->delta is time spent on CPU */
this->delta = timestamp - oncpu[curpsinfo->pr_pid];
@cpu[curpsinfo->pr_fname] = sum(this->delta);
@total[curpsinfo->pr_fname] = sum(this->delta);
}
END
{
/* Normalize data to print results in seconds */
normalize (@cpu, 1000000000);
normalize (@sleeps, 1000000000);
normalize (@total, 1000000000);
printf ("ON-CPU times:\n");
printa ("%-18s %@u\n", @cpu);
printf ("\nWait times:\n");
printa ("%-18s %@u\n", @sleeps);
printf ("\nTotal times:\n");
printa ("%-18s %@u\n", @total);
}
This script was running for a while on a system which has 40% project cap set and two CPU bound processes running in a project:
ON-CPU times: cpudrain-amd 75 project_001 76 Wait times: project_001 297 cpudrain-amd 297 Total times: project_001 373 cpudrain-amd 373
As we can see, each of the two processes spends 20% of its time on CPU and 80% on wait queue, so together they use 40% of a single CPU which is exactly what the cap allows them.
There is some inherent error in the usage accounting performed by the system. Even micro-state accounting is not 100% correct7. The CPU usage is aggregated for all threads running in a capped project or zone. When the system is executing many threads within a project or a zone, the aggregated usage error may increase significantly and noticeably reduce accuracy of the CPU caps.
Due to the non-extensible nature of micro-state accounting, the current design uses LMS_WAIT_CPU micro-state is used to keep track of both on-waitq and on-runq CPU times. There is no separate micro-state for wait time only. Addition of any extra states requires recompilation of many existing tools that read and process /proc data. Still, the observability hooks described in 5 provide enough tools to get around this issue.
Threads which get pinned by interrupt threads don't change their micro-states. Clock tick processing won't happen for pinned threads, but it might look like they've used more CPU time than they actually did just by looking at their micro-state counters. This is a generic micro-state accounting problem though.
The following description should be added to resource_controls(5) man page:
The Solaris Dynamic Tracing Guide should be updated to include information about new process states and two new sched provider probes.
The proc Provider section should be updated to reflect the addition of the new SWAIT processor state (table 25-5 in the Solaris Dynamic Tracing Guide ).
The Probes section (table 26-1 in the Solaris Dynamic Tracing Guide ) should be updated with information about two new probes:
The Arguments section should be updated with information in table 1, describing arguments for CPU Caps probe arguments (table 26-2 in the Solaris Dynamic Tracing Guide ).
Table 1: sched Probe Arguments||Probe|args[0]|args[1]
| cpucaps-sleep | lwpsinfo_t * | psinfo_t * |
| cpucaps-wakeup | lwpsinfo_t * | psinfo_t * |
The Examples section should be updated with examples for CPU Caps probe arguments.
You can use cpucaps-sleep and cpucaps-wakeup probes to understand the impact CPU Caps have on specific processes and threads. The following example shows how much various processes spend on wait queues:
sched:::cpucaps-sleep
{
sleep[args[1]->pr_pid] =
timestamp;
}
sched:::cpucaps-wakeup
/sleep[args[1]->pr_pid]/
{
@sleeps[args[1]->pr_fname] =
quantize(timestamp - sleep[args[1]->pr_pid]);
sleep[args[1]->pr_pid] = 0;
}
Running the above script results in output similar to the following example:
# ./capswait.d
dtrace: script './capswait.d' matched 2 probes
^C
exmh
value ~------------- Distribution ~------------- count
8388608 | 0
16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
33554432 | 0
scan
value ~------------- Distribution ~------------- count
16777216 | 0
33554432 |@@@@@@@@@@@@@@@@@@@@ 1
67108864 | 0
134217728 |@@@@@@@@@@@@@@@@@@@@ 1
268435456 | 0
firefox-bin
value ~------------- Distribution ~------------- count
4194304 | 0
8388608 |@@ 1
16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19
33554432 |@@@@ 2
67108864 | 0
:
Alexander Kolbasov 2007-01-09
Terms of Use
|
Privacy
|
Trademarks
|
Copyright Policy
|
Site Guidelines
|
Site Map
|
Help
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
© 2012, Oracle Corporation and/or its affiliates.