OpenSolaris
Collectives
Discussions
Documentation
Download
Source Browser
Free CD
Log-in
|
en
Project rm
:
Resource Controls
>
CPU caps
>
CPU Caps design
Top Menu
Show
:
Comments
Attachments
History
Information
Print
:
Print
Print preview
Export as PDF
Export as RTF
Export as HTML
Export as XAR
Wiki code for
CPU Caps design
Hide Line numbers
1: = CPU Caps design 2: 3: **Andrei Dorofeev** 4: 5: **Alexander Kolbasov** 6: 7: **Jonathan Chew ** 8: 9: [[PDF Version>>Project rm.design.pdf]] 10: 11: === Abstract: 12: 13: This document proposes a new resource control facility for the SolarisTM Operating Environment, called //CPU Caps//. This facility provides hard fine-grained limit for the amount of CPU resources that can be used by all user processes within a project or a zone. The //CPU Caps// facility complements existing resource control facilities like the Fair Share Scheduler, resource pools and others. 14: 15: The document describes the motivation for limiting CPU resources, the functional requirements, provide a short implementation overview and discuss observability issues. We also show suggested changes to the existing SolarisTMdocumentation. 16: 17: The on-line version of this paper is available in [[[1>>#opensolaris:cpucaps]]]. 18: 19: = 1 Introduction 20: 21: Customers running SolarisTMsystem in data centers often require facilities to partition a systems into smaller subsystems. This allows various departments or projects to share the single system gracefully. One of the most important system resources is the CPU time. Without good CPU partitioning, CPU-hungry software may starve others from the valuable cycles. 22: 23: Solaris provides different mechanisms which can be used to control CPU usage of applications: 24: 25: * Processor binding (see [[pbind(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqbg?a=view]] ) provides a way to limit process or a set of processes to a single CPU. It allows other unbound threads to run on the same CPU and compete for CPU cycles with bound threads. 26: * Processor sets (see [[psrset(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqd0?a=view]] ) provide a way to limit process execution to a set of CPUs. It also prohibits threads not belonging to the set from running within the set. 27: * Dynamic resource pools (see [[pooladm(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqbv?a=view]] ) integrate processor sets with Zones. 28: * Fair Share Scheduler (see [[FSS(7)>>http://docs.sun.com/app/docs/doc/816-5177/6mbbc4g5u?a=view]] ) provides a mechanism to share available CPUs within given proportions (shares). The //Fair Share Scheduler// only controls CPU usage relative to CPU usage of other applications. CPU shares do not limit CPU usage when there are idle CPU resources. 29: 30: CPU binding and processor sets provide //hard// limits for applications since they can not run outside a bound CPU or a set, but the minimum granularity for both is a single CPU. The [[FSS(7)>>http://docs.sun.com/app/docs/doc/816-5177/6mbbc4g5u?a=view]] provides //soft// limits which are enforced only when there is enough competition for CPU resources and is specified in units expressing relationships of processes to each other and not in units related to the number of CPUs. Various customers requested a way to provide a limit on CPU usage that can not be exceeded (//hard//) and that can be specified in fractions of a CPU (//fine-grained//). For practical purposes it is enough to provide such limit to zones or projects - it does not seem to be very useful to provide such limit for individual processes or threads. 31: 32: Customers provided the following major reasons for having such hard fine-grained CPU usage limits: 33: 34: * Selling CPU resources. 35: Customers selling their CPU resources want to provide usage limits based on the amount of CPU power purchased. 36: * Managing expectations. 37: Customers want the clients buying CPU resources to have the same experience in terms of their application performance independent of the machine load. With the FSS, clients, running their applications on otherwise idle machine, will see better performance and may expect to get the same level of performance on the loaded machine. Hard limits provide the same performance levels even when there are idle CPUs. 38: * Over-subscription 39: Some customers want to use hard limits as a mechanism to over-subscribe users. They may sell more CPU resource than what is available and they will be able to provide the level of service unless they run out of CPU cycles in which case there will be some decline in performance. For example, on a four-CPU machine they may sell 125% of a CPU to four customers. If all four customers make full use of their CPU resources, they will only get 100% of a CPU, but when some CPU resources are available they will get close to 125%. 40: 41: In some form such mechanism for limiting CPU resources is provided by other operating systems (see section [[2.1>>#sec:competition]] //CPU Caps on other systems//). It became one of many checkbox items which customers look at when comparing resource management solutions from Sun and other vendors. 42: 43: To provide such facility, Solaris needs a new mechanism which provides a hard CPU usage cap which is enforced even if some CPUs are idle. This is a stronger performance guarantee than the one, provided by FSS shares, where the end result greatly depends on other workloads that run on the system at the same time. Such mechanism should have low enough overhead and not penalize users who do not wish to use it. 44: 45: This paper describes such mechanism, called //CPU caps//, which allows system administrators to define a hard upper limit (or a cap) on how much CPU time can be consumed by applications. The cap can be specified in fractions of a single CPU. It can be set for any //project// or //zone// using two new resource controls project.cpu-cap and zone.cpu-cap. The cap value is a percentage of a single CPU that all threads belonging to the project or zone can consume. For example, if zone CPU cap is set to 50, then that zone is only allowed to use one half of a single processor (on a 2-CPU system that zone is allowed to use 25% of all CPUs). Note that, unlike processor binding and processor sets, applications in capped projects or zones can run on any valid CPU and their combined CPU usage is capped. 46: 47: In the rest of this document we define the requirements, show some usage examples, discuss observability and other issues and, finally, propose changes to existing documentation. 48: 49: = 2 Requirements 50: 51: The discussion above forms the basis for the requirements. Here we discuss in more detail the functionality, accuracy, observability and other requirements. 52: 53: * Functionality 54: For any project or zone with a cap set to C the combined usage of all LWPs running in that project or zone should not exceed C% of a single CPU. It should be possible to set project caps for projects belonging to capped zones. 55: * Orthogonality 56: CPU caps should be orthogonal to all other system facilities. In particular, it should be compatible with all supported scheduling classes. For practical reasons the //Real-Time// (RT) scheduling class ignores all CPU caps. 57: It should be also compatible with resource shares. For example, a zone may have CPU cap set to 50% and have all of its CPU cycles distributed between its project according to their CPU shares. Or, a project may have cap and shares set. When cap is reached, project should get scheduled less frequently, and it should still be scheduled according to the ratio between the amount of its shares and shares of all other active projects running on the same processor set in its zone. 58: CPU caps should also be compatible with resource pools which limit the CPUs that can be used by a zone. CPU caps will additionally limit the CPU consumption within a pool. One application of this may be setting up several capped projects within a pool. 59: * Accuracy 60: CPU caps must be accurate as much as practically possible. We aim at providing accuracy within 1% of the overall CPU capacity. 61: * Stability 62: The adjustments performed by the CPU caps enforcement mechanisms should be careful to preserve system load stability. Given a stable workload the capped system should provide stable load levels and small changes in the workload load should lead to small changes in the system load. The implementation should be careful to avoid oscillating system load. 63: * Observability 64: CPU caps should provide enough observability for users to estimate an impact of CPU limits on specific applications. It should also provide users ways to determine estimated CPU usage requirements. 65: * Performance impact 66: There should be no visible performance impact when CPU caps are not used (there are no projects or zones with defined caps). When caps are enabled, but not reached, the impact on performance (mostly caused by extra accounting work that needs to be done) should be as little as possible. When caps are enabled and get reached by projects/zones, their threads will get scheduled less often anyway so it’s hard to make any performance guarantees about that case. 67: 68: == 2.1 CPU Caps on other systems 69: 70: === 2.1.1 CPU Caps on Linux 71: 72: Linux has patches against 2.6.17 kernel provided by Aurema that implement per-task CPU caps. Capping for task aggregation is not currently supported. Child tasks inherit the cap value from the parent. It seems that Linux implementation includes some of the FSS concepts by providing the notion of ``soft cap’’ which can be exceeded if there is some idle CPU time. More information is available at 73: 74: * [[http://ebs.aurema.com/>>http://ebs.aurema.com/]] 75: * [[http://lkml.org/lkml/2006/5/26/7>>http://lkml.org/lkml/2006/5/26/7]] 76: * [[http://lwn.net/Articles/188862/>>http://lwn.net/Articles/188862/]] 77: 78: === 2.1.2 CPU Caps on HP/UX 79: 80: HP-UX provides CPU caps via the WLM workload manager. CPU caps can not be used with shares at the same time. When caps are turned on, the number of shares becomes a cap. More information about the HP support for CPU caps is available at [[http://docs.hp.com/en/B8733-90017/ch02s02.html#cjadichh>>http://docs.hp.com/en/B8733-90017/ch02s02.html#cjadichh]]. 81: 82: === 2.1.3 CPU Caps on AIX 83: 84: AIX 5L workload manager redbook at [[http://www.redbooks.ibm.com/redbooks/pdfs/sg245977.pdf>>http://www.redbooks.ibm.com/redbooks/pdfs/sg245977.pdf]] provides info on what AIX WLM can do. They provide hard and soft CPU limits which can be used in combination with shares.. Read pp47-49 for details. 85: 86: = 3 Administrative Interface 87: 88: CPU caps can be set and enforced for projects and zones. The cap value is specified in units of one per cent of a CPU. For example, a CPU cap of 50 limits CPU resources to 50% of one CPU regardless of how many CPUs are available and their characteristics. 89: 90: A zone CPU caps is represented by the zone.cpu-cap resource control. A project CPU cap is represented by the project.cpu-cap resource control. Caps are only enforced when //privileged// limits are set[[1>>#foot105]]. These resource controls can be set statically for projects in [[project(4)>>http://docs.sun.com/app/docs/doc/816-5174/6mbb98uiu?a=view]] file and for zones using [[zonecfg(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqm2?a=view]] command. They can be also modified or removed on a running system using [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] utility. The cap value should be greater than zero[[2>>#foot336]]. 91: 92: == 3.1 Project CPU caps 93: 94: A project CPU cap is represented by the project.cpu-cap resource control. It is associated with the //privileged level// and can be modified only by privileged (superuser) callers (see [[resource_controls(5)>>http://docs.sun.com/app/docs/doc/816-5175/6mbba7f37?a=view]] and [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] for the description of //privileged level//). 95: 96: The project.cpu-cap resource limit can be set statically for projects in [[project(4)>>http://docs.sun.com/app/docs/doc/816-5174/6mbb98uiu?a=view]] file. For example, the following line in [[project(4)>>http://docs.sun.com/app/docs/doc/816-5174/6mbb98uiu?a=view]] file sets persistent CPU cap of 6 CPUs for user akolb: 97: 98: {{{ 99: 100: user.akolb:1234::::project.cpu-cap=(privileged,600,none) 101: 102: }}} 103: 104: Project CPU cap can be also dynamically modified or removed on a running system using [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] utility. For example, the following command modifies the CPU cap to limit user akolb to 3 CPUs: 105: 106: {{{ 107: 108: $ prctl -r -t privileged -n project.cpu-cap -v 300 -i project user.akolb 109: 110: }}} 111: 112: To remove a project cap the following command can be used: 113: 114: {{{ 115: 116: $ prctl -x -n project.cpu-cap $$ 117: 118: }}} 119: 120: To dynamically change CPU caps for projects or zones, use -r (``replace’’)[[3>>#foot338]] option for the [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] command. For example, the following command will change the cap set above to 80%: 121: 122: {{{ 123: 124: $ prctl -r -t privileged -n project.cpu-cap -v 80 -i project group.staff 125: 126: }}} 127: 128: Adding the following line to /etc/project file: 129: 130: === 3.2 Zone CPU cap 131: 132: A zone CPU caps is represented by the zone.cpu-cap resource control. Similar to the project cap it is associated with the //privileged level// and can be modified only by privileged (superuser) callers. 133: 134: The zone.cpu-cap resource can be set for a zone using [[zonecfg(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqm2?a=view]] command. The following example configures CPU cap for a zone to 3 CPUs: 135: 136: {{{ 137: 138: zonecfg:myzone> add rctl 139: zonecfg:myzone:rctl> set name=zone.cpu-cap 140: zonecfg:myzone:rctl> add value (priv=privileged,limit=300,action=none) 141: zonecfg:myzone:rctl> end 142: 143: }}} 144: 145: The zone cap can be dynamically changed using [[resource_controls(5)>>http://docs.sun.com/app/docs/doc/816-5175/6mbba7f37?a=view]] command. For example, the following command sets zone CPU cap for zone ``zone1’’ to 80% of a CPU: 146: 147: {{{ 148: 149: $ prctl -t privileged -n zone.cpu-cap -v 80 -i zone global 150: 151: }}} 152: 153: This cap can be changed to 50% later using the following command: 154: 155: {{{ 156: 157: $ prctl -r -t privileged -n zone.cpu-cap -v 50 -i zone global 158: 159: }}} 160: 161: === 3.3 [[zonecfg(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqm2?a=view]] extensions for CPU caps 162: 163: The [[zonecfg(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqm2?a=view]] is extended with a new resource called capped-cpu, as described in PSARC/2006/496 [[[3>>#PSARC2006496]]]. The resource value, called ncpus, maps to the zone.cpu-cap rctl. This case formalizes the proposal from PSARC/2006/496 [[[3>>#PSARC2006496]]] and commits to this new interface. 164: 165: The capped-cpu resource has a single ncpus property which is a positive decimal with two digits to the right of the decimal. This property is implemented as a special case of the zonecfg cpu-cap alias. The special case handling of this property normalizes the value so that it corresponds to units of cpus and is similar to the ncpus property under the dedicated-cpu resource group. Unlike dedicated-cpu it will not accept a range and it will accept a decimal number. For example, when using ncpus in the dedicated-cpu resource group, a value of 1 means one dedicated cpu. When using ncpus in the capped-cpu resource group, a value of 1 means 100% of a cpu as the zone.cpu-cap setting. A value of 1.25 means 125%, since 100% corresponds to one full cpu on the system when using cpu caps. The intention here is to align the ncpus units as closely as possible in these two cases (dedicated-cpu vs. capped-cpu), given the limitations and capabilities of the two underlying mechanisms (pset vs. rctl). See PSARC/2006/496 [[[3>>#PSARC2006496]]] for a description of the dedicated-cpu resource and ncpu property. 166: 167: The following example sets zone CPU cap using the capped-cpu resource: 168: 169: {{{ 170: 171: zonecfg:myzone> add capped-cpu 172: zonecfg:myzone>capped-cpu> set ncpus=3 173: zonecfg:myzone>capped-cpu>capped-cpu> end 174: 175: }}} 176: 177: = 4 Implementation Overview 178: 179: Here we provide a short overview of CPU caps implementation. See the implementation guide [[[2>>#opensolaris:implementation]]] for more details. 180: 181: A CPU cap can be set for any project or any zone. Zone CPU caps limits the CPU usage for all projects running inside the zone. If the zone CPU cap is set below the project CPU cap, the latter will have no effect. 182: 183: For all threads running in capped projects or zones, the system keeps track of their CPU usage over short periods of time. When CPU usage of projects or zones reaches specified caps, threads in them do not get scheduled and instead are placed on the special //wait queues// in the kernel. These threads will become runnable again only when CPU usage drops below the cap level. 184: 185: Each zone and each project has its own //wait queue//. The time spent by threads on //wait queues// is reported as ``//wait-cpu//’’ (latency) time by //procfs//[[4>>#foot343]]. Wait times can be seen in the LAT column when [[prstat(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqcm?a=view]] is invoked with the -m option. CPU time spent by threads on wait queues is also accumulated at the LMS_WAIT_CPU micro-state accounting state[[5>>#foot173]]. This time, however, is not accounted for when calculating CPU load averages, unlike the time spent by threads in runnable (waiting on run queues) state. There is no separate accounting for time spent on //wait queues//. 186: 187: We decided to combine the wait times that threads spent waiting on //run queues// and //wait queues// into a single bucket because currently there is a fixed set of micro-state accounting types. Any extension of this set will cause offsets of fields in data structures, embedding micro-state data, to change. This, in turn, implies that all [[proc(4)>>http://docs.sun.com/app/docs/doc/816-5174/6mbb98uiq?a=view]] consumers should be recompiled. We feel that CPU wait time is general enough to include time spent waiting on both //run queues // and //wait queues// and it is reasonable to combine them together until micro-state accounting framework is extended. 188: 189: All CPU usage accounting data is collected using micro-state accounting facility and the provided accuracy depends on the accuracy of the micro-state data. The per-thread CPU usage is aggregated to project usage and the project usage is accumulated to zone usage. The implementation uses a decay formula that decays one per cent of the value on every clock tick. 190: 191: Once a CPU usage of a project or zone is exceeded, all user threads running there are marked with a special flag. The preemption code places them on //wait queues// once they cross the user-kernel boundary. 192: 193: CPU caps are enforced only for threads running in //TS//, //IA//, //FX//, and //FSS// scheduling classes. CPU Cap on threads running in //RT// (Real-Time) scheduling class has no effect. 194: 195: = 5 CPU Caps Observability 196: 197: There are several ways to observe the impact of CPU caps at a zone or project level and at a process or thread level. Each project or zone CPU cap exports information via kstats (see [[5.1>>#observability:kstat]]). 198: 199: The DTrace sched provider is extended with two new probes which can be used for gathering detailed data for times spent on wait queues (see [[5.3>>#dtrace]]). These probes provide thread and process level observability for CPU caps. 200: 201: == 5.1 Zones and Project observability 202: 203: Zone and project CPU cap kstats contain the following information: 204: 205: ; **value** 206: : - the cap value in percentages of a single CPU 207: ; **usage** 208: : - current aggregated CPU usage for all threads belonging to a capped project or zone in percentages of a single CPU 209: ; **maxusage** 210: : - maximum observed CPU usage 211: ; **above_sec** 212: : - total time in seconds spent above the cap 213: ; **below_sec** 214: : - total time in seconds spent below the cap 215: ; **nwait** 216: : - number of threads on cap wait queue 217: 218: For example, when a project CPU cap is set to 50% for project 1234, the following command will show cap information: 219: 220: {{{ 221: 222: $ kstat -m caps 223: module: caps instance: 0 224: name: cpucaps_project_1234 class: project_caps 225: above_sec 787 226: below_sec 260551 227: nwait 0 228: usage 1 229: maxusage 51 230: value 50 231: 232: }}} 233: 234: The following example is for zone kstats: 235: 236: {{{ 237: 238: module: caps instance: 14 239: name: cpucaps_zone_14 class: zone_caps 240: above_sec 0 241: below_sec 3 242: maxusage 255 243: nwait 0 244: usage 19 245: value 300 246: 247: }}} 248: 249: For both zone and project kstats, the kstat instance is the same as zone ID. The PSARC 2006/598 //Swap resource control// case[[[4>>#PSARC2006598]]] provides a precedents for such kstats and establishes the naming conventions. 250: 251: The maximum usage statistics provides a good way for users to estimate how to set up their CPU caps. They can set the cap to a very high value and observe the maximum usage while their application is running for a while. This gives an estimate of maximum CPU requirements for the workload. 252: 253: The kstat instance ID is always equal to the zone ID. The global zone will see kstats for all zones, while non global zones will only see kstats with matching zoneid. 254: 255: When a cap is set on a zone, all projects within this zone are automatically capped, but their kstats will show the cap value of zero (unless some of these projects have specific caps of their own). Projects with the cap value of zero participate in zone CPU usage accounting, but are not actually used to enforce project caps. For example, on a system with zone cap set on zone 1 and project cap set for project 10 in global zone: 256: 257: {{{ 258: 259: kstat caps 260: module: caps instance: 0 261: name: cpucaps_project_10 class: project_caps 262: above_sec 439 263: below_sec 655 264: nwait 1 265: usage 70 266: maxusage 71 267: value 70 268: 269: module: caps instance: 1 270: name: cpucaps_project_0 class: project_caps 271: above_sec 0 272: below_sec 2 273: nwait 0 274: usage 7 275: maxusage 10 276: value 0 277: 278: module: caps instance: 1 279: name: cpucaps_project_1 class: project_caps 280: above_sec 0 281: below_sec 48 282: nwait 0 283: usage 42 284: maxusage 51 285: value 0 286: 287: module: caps instance: 1 288: name: cpucaps_zone_1 class: zone_caps 289: above_sec 32 290: below_sec 235 291: nwait 3 292: usage 50 293: maxusage 51 294: value 50 295: 296: }}} 297: 298: As we can see, there are two projects belonging to zone 1 and three threads are waiting on the zone cap because the zone has reached its limit[[6>>#foot202]]. 299: 300: The kstat(1M) command running in a zone only shows CPU caps relevant for that zone and its projects. We can use various modifications of kstat(1M) command to examine only project or zone caps. 301: 302: For example, 303: 304: {{{ 305: 306: # 307: # Show project caps only 308: # 309: $ kstat -c project_caps 310: # 311: # Show project caps for zone 1 312: # 313: $ kstat -c project_caps -i 1 314: 315: }}} 316: 317: The similar observability kstats are also introduced by the //Swap resource control// [[[4>>#PSARC2006598]]] project. 318: 319: == 5.2 Thread level observability 320: 321: The [[ps(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9pk?a=view]] command shows threads on the wait queue by displaying ``W’’ for their state. For example: 322: 323: {{{ 324: 325: $ ps -o pid,s,comm -p 101262 326: PID S COMMAND 327: 101262 W /usr/perl5/bin/perl 328: 329: }}} 330: 331: The [[prstat(1M)>>http://docs.sun.com/app/docs/doc/816-5166/6mbb1kqcm?a=view]] command shows //``wait’’// state for threads, sitting on wait queues. For example: 332: 333: {{{ 334: 335: PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 336: 100686 akolb 4272K 1516K wait 0 0 0:51:20 25% perl/1 337: 338: }}} 339: 340: == 5.3 DTrace changes for CPU caps 341: 342: CPU caps provide two new DTrace sched provider probes for observing scheduling impact for threads and processes: 343: 344: ; **cpucaps-sleep** 345: : Probe that fires immediately before the current thread is placed on a wait queue. The lwpsinfo_t of the waiting thread is pointed to by args[0]. The psinfo_t of the process containing the waiting thread is pointed to by args[1]. 346: ; **cpucaps-wakeup** 347: : Probe that fires immediately after a thread is removed from a wait queue. The lwpsinfo_t of the waiting thread is pointed to by args[0]. The psinfo_t of the process containing the waiting thread is pointed to by args[1]. 348: 349: For example, the following D script shows the number of seconds processes spend on CPU and on wait queues. It an reasonable estimate of which process and to what extent are affected by CPU caps. 350: 351: {{{ 352: 353: #!/usr/sbin/dtrace -s 354: 355: #pragma D option quiet 356: 357: /* Mark the time process is placed on wait queue */ 358: sched:::cpucaps-sleep 359: { 360: sleep[args[1]->pr_pid] = timestamp; 361: } 362: 363: /* Thread leaves wait queue */ 364: sched:::cpucaps-wakeup 365: /sleep[args[1]->pr_pid]/ 366: { 367: /* this->delta is time spent on wait queue */ 368: this->delta = timestamp - sleep[args[1]->pr_pid]; 369: @sleeps[args[1]->pr_fname] = sum(this->delta); 370: @total[args[1]->pr_fname] = sum(this->delta); 371: } 372: 373: sched:::on-cpu 374: /sleep[curpsinfo->pr_pid]/ 375: { 376: /* Mark the time process is placed on CPU */ 377: oncpu[curpsinfo->pr_pid] = timestamp; 378: } 379: 380: sched:::off-cpu 381: /oncpu[curpsinfo->pr_pid]/ 382: { 383: /* this->delta is time spent on CPU */ 384: this->delta = timestamp - oncpu[curpsinfo->pr_pid]; 385: @cpu[curpsinfo->pr_fname] = sum(this->delta); 386: @total[curpsinfo->pr_fname] = sum(this->delta); 387: } 388: 389: END 390: { 391: /* Normalize data to print results in seconds */ 392: normalize (@cpu, 1000000000); 393: normalize (@sleeps, 1000000000); 394: normalize (@total, 1000000000); 395: 396: printf ("ON-CPU times:\n"); 397: printa ("%-18s %@u\n", @cpu); 398: printf ("\nWait times:\n"); 399: printa ("%-18s %@u\n", @sleeps); 400: printf ("\nTotal times:\n"); 401: printa ("%-18s %@u\n", @total); 402: } 403: 404: }}} 405: 406: This script was running for a while on a system which has 40% project cap set and two CPU bound processes running in a project: 407: 408: {{{ 409: 410: ON-CPU times: 411: cpudrain-amd 75 412: project_001 76 413: 414: Wait times: 415: project_001 297 416: cpudrain-amd 297 417: 418: Total times: 419: project_001 373 420: cpudrain-amd 373 421: 422: }}} 423: 424: As we can see, each of the two processes spends 20% of its time on CPU and 80% on wait queue, so together they use 40% of a single CPU which is exactly what the cap allows them. 425: 426: = 6 Issues 427: 428: == 6.1 Accounting accuracy 429: 430: There is some inherent error in the usage accounting performed by the system. Even micro-state accounting is not 100% correct[[7>>#foot348]]. The CPU usage is aggregated for all threads running in a capped project or zone. When the system is executing many threads within a project or a zone, the aggregated usage error may increase significantly and noticeably reduce accuracy of the CPU caps. 431: 432: == 6.2 Dedicated micro-state for wait queues 433: 434: Due to the non-extensible nature of micro-state accounting, the current design uses LMS_WAIT_CPU micro-state is used to keep track of both on-waitq and on-runq CPU times. There is no separate micro-state for wait time only. Addition of any extra states requires recompilation of many existing tools that read and process /proc data. Still, the observability hooks described in [[5>>#sec:observability]] provide enough tools to get around this issue. 435: 436: == 6.3 Interrupts and pinned threads 437: 438: Threads which get pinned by interrupt threads don’t change their micro-states. Clock tick processing won’t happen for pinned threads, but it might look like they’ve used more CPU time than they actually did just by looking at their micro-state counters. This is a generic micro-state accounting problem though. 439: 440: == 6.4 Related Bugs 441: 442: ; **[[6327235>>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6327235]] ** 443: : PSARC/2004/402 CPU caps 444: ; **[[6468003>>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6468003]] ** 445: : prctl should support the notion of default and infinity 446: ; **[[6468451>>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6468451]] ** 447: : Errors from setting resource controls should propagate to the caller 448: ; **[[6194864>>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6194864]] ** 449: : simultaneous setproject()’s on the same project can fail to set rctl 450: 451: = A Manual Page Changes 452: 453: == A.1 [[resource_controls(5)>>http://docs.sun.com/app/docs/doc/816-5175/6mbba7f37?a=view]] 454: 455: The following description should be added to [[resource_controls(5)>>http://docs.sun.com/app/docs/doc/816-5175/6mbba7f37?a=view]] man page: 456: 457: ; **project.cpu-cap** 458: : 459: Maximum amount of CPU resources a project can use. The unit used is the percentage of a single CPU (an integer). The cap does not apply to threads running in real-time scheduling class. 460: ; **zone.cpu-cap** 461: : Sets a limit on amount of CPU time that can be used by a zone. The cap does not apply to threads running in real- time scheduling class. Projects within the zone can have their own CPU caps. The minimum cap value takes precedence. Expressed as an integer, denoting the percentage of a single CPU that can be used by all user threads in a zone. 462: 463: == A.2 Solaris Dynamic Tracing Guide 464: 465: The Solaris Dynamic Tracing Guide should be updated to include information about new process states and two new sched provider probes. 466: 467: === A.2.1 proc Provider 468: 469: The proc //Provider// section should be updated to reflect the addition of the new SWAIT processor state ([[table 25-5>>http://docs.sun.com/app/docs/doc/817-6223/6mlkidll3?a=view#tbl-sched-state]] in the Solaris Dynamic Tracing Guide ). 470: 471: ; **SWAIT(W)** 472: : The thread is waiting on wait queue. The sched:::cpucaps-sleep probe will fire immediately before a thread state is transitioned to SWAIT. 473: 474: === A.2.2 Probes 475: 476: The //Probes// section ([[table 26-1>>http://docs.sun.com/app/docs/doc/817-6223/6mlkidll8?a=view#tbl-sched]] in the Solaris Dynamic Tracing Guide ) should be updated with information about two new probes: 477: 478: ; **cpucaps-sleep** 479: : Probe that fires immediately before the current thread is placed on a wait queue. The lwpsinfo_t of the waiting thread is pointed to by args[0]. The psinfo_t of the process containing the waiting thread is pointed to by args[1]. 480: ; **cpucaps-wakeup** 481: : Probe that fires immediately after a thread is removed from a wait queue. The lwpsinfo_t of the waiting thread is pointed to by args[0]. The psinfo_t of the process containing the waiting thread is pointed to by args[1]. 482: 483: === A.2.3 Arguments 484: 485: The //Arguments// section should be updated with information in table [[1>>#probe:arguments]], describing arguments for CPU Caps probe arguments ([[table 26-2>>http://docs.sun.com/app/docs/doc/817-6223/6mlkidll9?a=view#tbl-sched-args]] in the Solaris Dynamic Tracing Guide ). 486: 487: **Table 1:** sched Probe Arguments||**Probe**|args[0]|args[1] 488: |cpucaps-sleep|lwpsinfo_t *|psinfo_t * 489: |cpucaps-wakeup|lwpsinfo_t *|psinfo_t * 490: 491: === A.2.4 cpucaps-sleep and cpucaps-wakeup Examples 492: 493: The //Examples// section should be updated with examples for CPU Caps probe arguments. 494: 495: You can use cpucaps-sleep and cpucaps-wakeup probes to understand the impact CPU Caps have on specific processes and threads. The following example shows how much various processes spend on wait queues: 496: 497: {{{ 498: 499: sched:::cpucaps-sleep 500: { 501: sleep[args[1]->pr_pid] = 502: timestamp; 503: } 504: 505: sched:::cpucaps-wakeup 506: /sleep[args[1]->pr_pid]/ 507: { 508: @sleeps[args[1]->pr_fname] = 509: quantize(timestamp - sleep[args[1]->pr_pid]); 510: sleep[args[1]->pr_pid] = 0; 511: } 512: 513: }}} 514: 515: Running the above script results in output similar to the following example: 516: 517: {{{ 518: 519: # ./capswait.d 520: dtrace: script ’./capswait.d’ matched 2 probes 521: ^C 522: 523: exmh 524: value ~------------- Distribution ~------------- count 525: 8388608 | 0 526: 16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 527: 33554432 | 0 528: 529: scan 530: value ~------------- Distribution ~------------- count 531: 16777216 | 0 532: 33554432 |@@@@@@@@@@@@@@@@@@@@ 1 533: 67108864 | 0 534: 134217728 |@@@@@@@@@@@@@@@@@@@@ 1 535: 268435456 | 0 536: 537: firefox-bin 538: value ~------------- Distribution ~------------- count 539: 4194304 | 0 540: 8388608 |@@ 1 541: 16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 19 542: 33554432 |@@@@ 2 543: 67108864 | 0 544: 545: }}} 546: 547: == References 548: 549: : 550: ; 1 551: : CPU caps web page on OpenSolaris.org 552: [[http://www.opensolaris.org/os/project/rm/rctls/cpu-caps/>>Project rm.cpu-caps]] 553: ; 2 554: : Implementation description 555: [[http://www.opensolaris.org/os/project/rm/rctls/cpu-caps/caps_implementation/>>Project rm.caps_implementation]]. 556: ; 3 557: : PSARC/2006/496 Improved Zones/RM Integration 558: [[http://sac.sfbay.sun.com/PSARC/2006/496>>http://sac.sfbay.sun.com/PSARC/2006/496]] 559: [[http://www.opensolaris.org/os/community/arc/caselog/2006/496>>Community Group arc.496]] 560: ; 4 561: : PSARC 2006/598 Swap resource control; locked memory RM improvements. 562: [[http://sac.sfbay.sun.com/PSARC/2006/598>>http://sac.sfbay.sun.com/PSARC/2006/598]] 563: [[http://www.opensolaris.org/os/community/arc/caselog/2006/598>>Community Group arc.598]] 564: 565: ---- 566: 567: ==== Footnotes 568: 569: ; ... set[[1>>#tex2html13]] 570: : See [[resource_controls(5)>>http://docs.sun.com/app/docs/doc/816-5175/6mbba7f37?a=view]] and [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] for the description of privileged limits. 571: ; ... zero[[2>>#tex2html16]] 572: : Because of bug [[6468451>>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6468451]] [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] will not complain when the cap value is set to zero, but the kernel will ignore such attempt. As a result, [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] will show the resource value as being zero. 573: ; ... (``replace’’)[[3>>#tex2html23]] 574: : An attempt to change CPU cap by setting privileged resource to the new value will create two values of the resource while only one of them is active. To avoid confusion it is best to always replace the old resource value with -r option given to [[prctl(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9p6?a=view]] . 575: ; ...procfs[[4>>#tex2html35]] 576: : See pr_wtime field of struct prusage in [[proc(4)>>http://docs.sun.com/app/docs/doc/816-5174/6mbb98uiq?a=view]] . 577: ; ... state[[5>>#tex2html37]] 578: : Currently there is a fixed set of micro-state accounting types. Any extension of this set will cause offsets of fields in data structures, embedding micro-state data, to change. 579: ; ... limit[[6>>#tex2html40]] 580: : In the kstat output the combined project usage for a zone may not be the same as zone usage due to rounding error. Internally usage is kept as nanoseconds per tick and is rounded to the integer percentage value. 581: ; ... correct[[7>>#tex2html43]] 582: : See bug [[6498304>>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6498304]] as a good example 583: 584: ---- 585: Alexander Kolbasov 2007-01-09
Search
Collectives
Community Group
Academic and Research
Accessibility
Advocacy
Appliances
Approachability
Architecture Process and Tools
BrandZ
Chinese Users
Community Advisory Board
Databases
Desktop
Device Drivers
Distribution
Documentation
DTrace
Emerging Platforms
Fault Management
Games on OpenSolaris
HA Clusters
HPC Developer
Installation and Packaging
Internationalization and Localization
Laptop
Logical Domains
Modular Debugger (MDB)
Networking
NFS
Observability
OpenSolaris Governing Board (OGB)
OpenSolaris Printing
OS/Net (ON)
Performance
Power Management
PowerPC
Security
Service Management Facility (smf(5))
Software Porters
Solaris Volume Manager
Storage
Systems Administration Community Group
Testing
Tools Home
Unix File Systems (UFS)
Website Community
X Window System
Xen
ZFS
Zones
Project
ADSL Modem Enhancement
ARC Process Definition
ARM Platform Port
Automatic Data Migration
BIND Update
Bluetooth Stack & Drivers
Brocade FC HBA - Initiator
Brocade FC HBA - Target
Brussels - unified network link configuration
Caiman, Solaris Install Revisited
Celeste
Český portál
Chime Visualization Tool for DTrace
CIFS client for Solaris
CIFS Server
Clearview: Network Interface Coherence
Cluster Agent: Informix Dynamic Server
Cluster Agent: OpenSolaris Container
Cluster Agent: OpenSolaris xVM
Cluster Agent: Oracle E-Business Suite
Cluster agent: PostgreSQL
Cluster Agent: Samba
Cluster Agent: Tomcat
CMT
Coarse Data Flow Parallelism
Colorado: Open HA Cluster on OpenSolaris
Command Assistant
Common Array Manager
Companion - /opt/sfw: Free and Open Source software
COMSTAR: Common Multiprotocol SCSI Target
Content
Contest
CPU Observability
Credentials Process Groups
Crossbow: Network Virtualization and Resource Control
Crypto KMS Agent Toolkit
Cryptographic Framework
Data Migration Manager
Data Tethers
Deutsches Portal
Device Detection Tool
Device Driver Utility
Device Manager
Device Mapper
Direct Rendering Infrastructure & 3D drivers
DTrace Guide
Duckwater: Simplified name services management
Easy Tools
Emancipation
Emulex Fibre Channel Device Driver
Emulex Advanced Ethernet Device Driver
Enable/Enhance Solaris support for Intel Platform
Enhance the support of USB webcams
Enhanced SMF Profiles
Enhancements for AMD-based Platforms
Erlang DTrace Integration
Ethernet bridge module for Solaris
Evaluate Conary
Events Registry
Ext3 file system support
F/OSS Package Base
Facilitation
Fibre Channel over Ethernet
Fine Grained Access Policy (FGAP)
Fingerprint Authentication
Flexible Mandatory Access Control
Forensic Tools
Fully Open X Project
Fuse on Solaris
gcore
Generic Machine Check Architecture Improvements
Google SOC
HA-JBoss
HA-MySQL
Hadoop Live CD
Hitachi
HoneyComb Fixed Content Storage
HPC Stack
Image Packaging System
Improved Performance MIB
Indiana
Innovation Awards
Input Method
Intel Graphics
Internet Key Exchange, version 2
Interrupt Resource Management
IP Datapath Refactoring
IP over Infiniband
IPsec Tunnel Reform
iSCSI Extensions for Remote DMA (iSER)
iSNS Server
JeOS - Just enough Operating System
JKstat - a java binding for libkstat
Journaled File System (JFS)
K Desktop Environment
Kerberos
Kernel Sockets
Kernel SSL Enhancements
Key Management Framework
Korn Shell 93 integration/migration project
Labeled IPsec
LatencyTOP
Layer 2 Filtering
LDoms Manager
Lending
libMicro - portable microbenchmarks
Link Layer Discovery
Live Media: Technologies for distributions running from CD and other media
Locale Data
lofi compression and cryptography support
lx64 brand
Media Management System
Mega_sas
Mexico
MilaX minimal Live Distribution
MIPS Platform Port
Mozilla DTrace
MRSL.NONsharedDevice
Multi-lingual Glossary
Multi-pathing software (MPxIO)
Multiple disk sector size support
Multiple DOI
Muskoka: An open repository for OpenSolaris technical content
Navigator
Nemo: A Framework for High-Performance Networking
Network Auto-Magic
Network Data Management Protocol
Network MIBs
Network Storage
Network Time Protocol (NTP)
Nevada Globalization
New Design of 4over6 Mechanism Based on OpenSolaris
NFS RDMA transport update and performance analysis
NFS Server in non-Global Zones
NFS version 4.1 pNFS
NFSv4 namespace extensions
Nightingale: Port Songbird to OpenSolaris
NPort ID Virtualization (NPIV)
NUMA
Object Storage Device (OSD) support for Solaris
OHACGE Script Based Plug-in
ON/Nevada (ONNV) Project
Open Development Infrastructure
Open HA Cluster Utilities
Open Sound System
OpenGrok
OpenPegasus CIM Server
OpenRTI
OpenSolaris Busybox
OpenSolaris Desktop
OpenSolaris Hispano
OpenSolaris Security Audit
OpenSolaris support for the QEMU processor emulator: host and guest
PEF: Packet Event Framework
Performance Wrappers
Pkgfactory
Polski Portal
Portail Francophone
Portal Brasil
Portals
Power Management Usability Interfaces
Presto: Automatic Printing Configuration
Printable Many Page Solaris Manuals
Promise SuperTrak RAID HBA Driver
QLogic Converged Network Adapter GLDv3 NIC Driver
Quagga Routing Protocol Suite Integration
RAID Configuration Utility
RBridge (IETF TRILL) support
RDMA Offload Framework
Reno: Login Process Enhancements for Interop
Resource Management
s10brand
SAM/QFS
SCM Migration Project
SCSI RDMA Protocol
SDcard Drivers
Sensor Abstraction Layer
Session Initiation Protocol
SFW
Shell: bourne shell, korn shell, C shell, etc.
Sierra: Intel WiFi Chipsets Support
Simple Panels
SM-HBA Based SAS HBA Management
SMF Documentation
Solaris iSCSI Target
Solaris PowerPC Port
SourceJuicer
Sparks: name service switch/nscd enhancements
Squashfs
Star integration/migration project
Starfish
Starter Kit
Storage Power Management
Sun Security Toolkit
Sun StorageTek Availability Suite
Support for OpenFabrics User Verbs / API on OpenSolaris OS
Support gcc4/GCCfss in Solaris
Suspend/Resume
SVR4 Packaging
Systemz
Tamarack: Removable Media Enhancements in Solaris
Tesla: OpenSolaris Enhanced Power Management
Test Development
Tickless Kernel Architecture
TIPC
Trademarks
Trusted networking interface policy database for Trusted Extensions
Trusted Platform Module support
Use Case
Validated Execution Project
Virtual Console
Virtual Network Machines
Visual Panels
Visualization for HPC
Volo
VRRP: Virtual Router Redundancy Protocol Implementation
VSCAN service
Web Stack
Website
Winchester: Schema mapping and ID mapping for AD Interoperability
Wireless USB Support
Wireless Wide Area Network
X Consolidation
x86 Generic FMA Topology Enumerator
Xen Gate
Xfce: A lightweight desktop environment
ZFS Boot and Install
ZFS on disk encryption support
Zone Manager
Zone Statistics
Русский портал
البوابة العربية
भारतीय पोर्टल
中国门户
日本ポータル
한국 포탈
User Group
Adelaide
Argentina
Arizona
Atlanta
Baltimore-Washington
Bangalore
Bangkok
Bangladesh
Beijing
Bélem
Berlin
Bhimavaram
Bloomington
Campus Ambassadors
Capital Region
Cardiff
Charlotte
Chengdu
Chennai
Chihuahua
Chile
Cleveland
Colombia
Columbus
Connecticut
Cracow
Czech
Dallas/Ft. Worth
Danish
Delaware
Edinburgh
Egypt
Finland
Florida
Front Range
FuZhou
Great Lakes
Greece
Hangzhou
Hawaii
HeFei
Houston
Hyderabad
Indonesia
Irish
Israel
Italian
Jinan
Kabul
Kansas City
Latvia
London
Madurai
Manchester
Mato Grosso
Melbourne
Minas Gerais
Minnesota
Montreal
Moscow
Mumbai
Munich
NEA
Netherlands
New England
New York City
New Zealand
NIT Hamirpur
Noroeste
Oklahoma City
Osnabrück
Peru
Philadelphia
Piaski
Pittsburgh
Porto Alegre
Puget Sound
Pune
Queensland
Research Triangle Park
Romania
Russia
San Antonio
San Diego
San Francisco
São Paulo
Scottish
Serbia
Shanghai
Shenzhen
Silicon Valley
Singapore
Slovak
South African
Southern Connecticut
St. Louis
Sweden
Switzerland
Sydney
Szczecin
Taiwan
Tecum
Thames Valley
Tokyo
Toronto
Trondheim
Tulsa
Turkey
Ukraine
University of Melbourne
Vale do Paraíba
Vancouver
Venezuela
Welsh - Cymru
Wisconsin
Xi'an
Subsites
Code Reviews
Code Repositories
Package Search
Bugster
Bugzilla
Test Machines
Planet
Mailing Lists
Elections & Polls
ARC Case Logs
Source Juicer
Package Factory
User Authentication
Project rm Pages
Documentation and Examples
Resource Pools
Memory Sets
Usability Enhancements
Resource Controls
CPU caps
System V IPC
RM for Zones