| Solaris |
|
|
Andrei Dorofeev
Alexander Kolbasov
Jonathan Chew
This document describes implementation of the CPU Caps projects. It provides the general implementation overview, discusses the major data structures and interaction with scheduling classes. The CPU Caps design is discussed in [1]. It also outlines the general implementation strategy. The on-line version of this paper is available at [2].
Solaris provides different mechanisms which can be used to control CPU usage of applications - mainly Processor Binding, Processor Sets, Dynamic Resource Pools and Fair Share Scheduler.
There are disadvantages associated with using processor and processor set bindings to control CPU usage. They are too coarse (have CPU granularity), require users to choose specific processors to be used for binding, and can prevent certain Dynamic reconfiguration (DR) operations.
The Fair Share Scheduler provides a way to limit CPU usage of zones or projects in situations when workloads compete with each other for the same CPU resources. However, in situations when not all CPU resources are being actively used, CPU shares do not limit CPU usage in any way.
Various customers requested a hard and fine-grained limit on the CPU usage by a zone or a project. The primary motivation for such request is providing consistent user experience independent of the actual machine load.
CPU caps provide a hard CPU usage cap which is enforced even if some CPUs are idle. This is a stronger performance guarantee than the one, provided by FSS shares, where the end result greatly depends on other workloads that run on the system at the same time.
The rest of this document describes the implementation of CPU caps.
A CPU cap can be set on any project or any zone. Zone CPU cap limits the CPU usage for all projects running inside the zone. For any project or zone with a cap set to C the combined usage of all LWPs running in that project or zone should not exceed C% of a single CPU. It is possible to set project caps for projects belonging to capped zones. CPU caps are enforced only for threads running in TS, IA, FX, and FSS scheduling classes.
When CPU usage of projects or zones reaches specified caps, LWPs in them do not get scheduled and, instead, are placed on special wait queues, associated with the project or zone reaching the cap. These LWPs will start running again only when CPU usage drops below the cap level. Each cap has its own wait queue. The time, spent by threads on wait queues is reported as ``wait-cpu'' (latency) time by procfs1. Wait times can be seen in the LAT column when prstat(1M) is invoked with -m (report microstate process accounting information) option. CPU time spent by threads on wait queues is also accumulated at the LMS_WAIT_CPU micro-state accounting state2. This time, however, is not accounted for, when calculating CPU load averages, unlike the time spent by threads in runnable (sitting on run queues) state. There is no separate accounting for time spent on the wait queue.
When a CPU cap is set for a project or zone, the kernel continously keeps track of the CPU time used by all user threads within capped zones and projects over a short time interval and calculates their current CPU usage as a percentage. When the accumulated usage reaches the cap, LWPs running in the user-land (when they are not holding any critical kernel locks) are placed on special wait queues until their project's or zone's CPU usage drops below the cap.
The system maintains a list of all capped projects and all capped zones. On every clock tick every active thread belonging to a capped project adds total accumulated CPU usage for this thread since it was last checked to its project. The threads on-CPU time is also added to its project when a thread is leaving CPU or is exiting. On every clock tick the current project usage value is decayed by one per cent of its value. Usage from all projects belonging to a capped zone is aggregated to get the zone usage. All accounting uses thread micro-state accounting data.
When the current CPU usage is above the cap, a project or zone is considered over-capped. Every user thread caught running in an over-capped project or zone is marked by setting TS_PROJWAITQ flag in thread's t_schedflag field and is requested to surrender its CPU. This causes scheduling class specific CL_TRAP() callback to be invoked. The callback function places threads marked as TS_PROJWAIT on a wait queue and calls switch().
An important design decision is to only put threads on wait queues after being trapped in the user-land (they could be holding some user locks, but no kernel locks) and while returning from the trap back to the user-land when no kernel locks are held either. Putting threads on wait queues in random places while running in the kernel might lead to all kinds of locking problems.
Previous section described the overall CPU caps implementation. Here we describe the major data structures used in the implementation. The two major data structures are the cpucap structure (3.1) and the wait queue structure (3.2). We also introduce the caps_sc data structure (3.3) used to hold per-thread caps-specific data for its scheduling class.
Most of the per-project or per-zone state related to CPU caps is kept in the following cpucap structure:
typedef struct cpucap {
kproject_t *cap_project;
zone_t *cap_zone;
list_node_t cap_link;
int64_ t cap_value;
int64_t cap_usage;
disp_lock_t cap_usagelock;
uint_t cap_flags;
waitq_t cap_waitq;
uint64_t cap_below;
uint64_t cap_above;
kstat_t *cap_kstat;
} cpucap_t;
The fields of this structure have the following meaning:
The cpucap structure is associated with the project or zone the first time a cap is set for that project or zone. The cap_flags field tells whether this is a project or zone cap. For project caps the cap_project field points to the associated project and for zone caps the cap_zone field points to the associated zone.
All enabled project caps are linked together in a single list called capped_projects and all enabled zone caps are linked together in a single list called capped_zones. Both lists are protected with the caps_lock mutex. The lists are traversed by the clock thread and are modified when the cap is modified or removed, so there is usually no lock contention for the mutex.
When a zone cap is enabled, all projects belonging to that zone are also automatically enabled, but they have a special cap_value of zero. If a project already has its own cap, its cap value is unchanged when its zone gets a cap as well. Projects with a cap value of zero participate in CPU usage accounting for the zone, but are never used to enforce a cap.
Whenever an LWP looses its CPU, its on-CPU time, obtained from micro-state accounting data, is added to its project usage. Also, for any thread found on CPU during clock() thread scan, its new CPU usage is added to its project usage.
During each clock tick, the clock() calls the cpucaps_clock_callout() function which scans all capped project and decays each cap usage by one per cent of its value. For caps belonging to capped zones, it also aggregates the cap usage into their zone cap. The cpucaps_clock_callout() function also sets or clears the CAP_REACHED flag if the cap usage is above or below the cap, respectively. If the cap is not reached and there are threads waiting on its wait queue, a single thread is removed from the wait queue and is made runnable. Note that only one threads is relased from the wait queue per clock tick. This slows down the potential load increase and prevents system thrashing when threads are constantly put on and off wait queues.
Each cap keeps some statistics that it exports through kstats. Currently it exports the time spent below and above the cap. The time is calculated in ticks, but is exported in seconds. Kstats are only present for enabled caps and allow users and administrators to check what caps are enabled, what their value and CPU usage is and how much time LWPs they spend above and below the cap. They also show the maximum CPU usage reached.
Most of the fields in the cpucap structure are protected by the caps_lock mutex. The cap_usage field is protected by cap_usagelock dispatcher lock.
CPU Caps introcude the notion of the wait queue. This queue hosts threads sorted in priority order while they can't run because their project or zone CPU usage exceeded its cap limits. Wait queues are always associated with a cpucap structure (see 3.1). The wait queue has the following definition:
typedef struct waitq {
disp_lock_t wq_lock;
kthread_t *wq_first;
int wq_count;
boolean_t wq_blocked;
} waitq_t;
Here is the description of the various fields of the wait queue structure:
The wait queue is very similar to the sleep queue and uses the same mechanism to provide list of threads sorted by priority order.
A wait queue is a singly linked NULL-terminated list with doubly linked circular sublists. The singly linked list is in descending priority order and FIFO for threads of the same priority. It links through the t_link field of the thread structure. The doubly linked sublists link threads of the same priority. They use the t_priforw and t_priback fields of the thread structure.
There are three interesting operations on a waitq list: inserting a thread into the proper position according to priority; removing a thread given a pointer to it; and walking the list, possibly removing threads along the way. This design allows all three operations to be performed efficiently and easily.
To insert a thread, traverse the list looking for the sublist of the same priority as the thread (or one of a lower priority, meaning there are no other threads in the list of the same priority). This can be done without touching all threads in the list by following the links between the first threads in each sublist. Given a thread t that is the head of a sublist (the first thread of that priority found when following the t_link pointers), t->t_priback->t_link points to the head of the next sublist. It's important to do this since a waitq may contain lots of threads.
Removing a thread from the list is also efficient. First, the t_waitq field contains a pointer to the waitq on which a thread is waiting (or NULL if it's not on a waitq). This is used to determine if the given thread is on the given waitq without searching the list. Assuming it is, if it's not the head of a sublist, just remove it from the sublist and use the t_priback pointer to find the thread that points to it with t_link. If it is the head of a sublist, search for it by walking the sublist heads, similar to searching for a given priority level when inserting a thread.
To walk the list, simply follow the t_link pointers. Removing threads along the way can be done easily if the code maintains a pointer to the t_link field that pointed to the thread being removed.
Each project and zone cap has its own wait queue. The project wait queue is preferred over the zone wait queue so when both the project and zone caps are reached the thread is placed on the project wait queue since project usage is usually more accurate. Threads on the wait queue are considered to be in the TS_WAIT state. The wait queue abstraction provides the following interface:
Threads on wait queues are marked as non-swappable to avoid having situations where a thread is on the wait queue but it can't be made runnable quickly. The same thing happens for threads on the run queues, and wait queues are viewed as another place where threads are not supposed to spend a lot time on. Threads can be swapped out when they become runnable again.
There is a small amount of accounting data that should be kept by each scheduling class for each thread which is only used by CPU caps code. This data is kept in the caps_sc structure which is transparent for all scheduling classes:
typedef struct caps_sc {
clock_t csc_timestamp;
hrtime_t csc_cputime;
} caps_sc_t;
The structure has the following fields:
The caps_sc structure is used to keep track of the CPU usage by using micro-state accounting data.
Whenever LWPs is switched off its CPU (by going through TS_PREEMPT() or TS_SLEEP or TS_EXIT) scheduling classes call the cpucaps_charge_adjust() function which does the following:
void
cpucaps_charge_adjust(kthread_t *t,
caps_sc_t *csc)
{
clock_t timestamp = lbolt;
uint64_t usage = mstate_thread_onproc_time(t);
clock_t delta = timestamp -
csc->csc_timestamp;
/*
* Check what time is it now and when
* was the last time we charged this
* thread. If the delta is within two
* seconds, trust the data, otherwise
* discard it as stale.
*/
if (delta >= 0 &&
delta <= two_seconds_tck) {
int64_t usage_delta = usage -
csc->csc_cputime;
if (usage_delta > 0) {
kproject_t *kpj = ttoproj(t);
cap_project_charge(kpj->kpj_cpucap,
usage_delta);
}
}
csc->csc_timestamp = timestamp;
csc->csc_cputime = usage;
}
This function is also called once per tick for all running threads.
The CPU Caps facility provides the following interfaces to the rest of the system:
In addition, the following two macros are provided to quickly check whether the any caps are set or not:
CPU caps are supported by TS/IA, FSS, and FX scheduling classes only. Real-time (RT) scheduling class threads cannot be capped (or more accurately, caps defined for RT threads will have no effect). Each scheduling class, supporting CPU caps should provide the following:
The CPU caps code provides common functions for scheduling classes to perform these tasks.
Once per tick the CL_TICK() callback should call the cpucaps_charge() function and pass it a pointer to a thread and a pointer to the caps_sc structure. The function returns True if the thread should be surrender its CPU because of cap violation and also sets TS_PROJWAITQ and TS_ZONEWAITQ bits in the t_schedflag field for the thread if it project and/or zone caps are violated.
The cpucaps_enforce() function, called by CL_PREEMPT() class methods, is responsible for actually placing threads on wait queues. It returns True if the thread passed as an argument was placed on wait queue and False otherwise.
Threads running in FSS class whose time quantums have not yet expired but which should be put on wait queues because their project/zone caps is reached need to have their priority recalculated because their priority, in part, depends on the overall CPU usage of the project they belong to. The FSS code is trying to simulate the same behavior as if thread's time quantum have just expired.
All CPU caps accounting is done by adding LMS_USER, LMS_SYSTEM and LMS_TRAP micro-state accounting buckets. Any inaccuracy in the micro-state accounting data will influence the accuracy of the CPU caps mechanism3.
It is extremely important to avoid over-accounting. Since project usage is an aggregate of threads usage and number of running threads may be huge, a small per-thread over-accounting adds up in the project and zone usage. As a result, the usage may become very high and this will cause all running threads to be placed on the wait queue. When the usage decays below the cap value, threads will be released from the wait queue and the cycle may repeat, creating load thrashing. Doing accounting updates in small increments whenever thread lives CPU and, in addition to that, once per tick, avoids this problem.
The accounting accuracy depends on the micro-state data and reflects whatever inaccuracies are present there. Since usage for a project is aggregated over all threads, any accounting error is a sum of per-thread error and may become significant when the number of threads is high.
CPU usage is decayed by the caps_update() routine which is called once per every clock tick. It walks lists of project caps and decays their usages. If CPU usage drops below cap levels, threads on wait queues are made runnable again, one thread per clock tick. When caps are removed, all threads on wait queue are made runnable immediately.
The CPU caps implementation provides two facilities for the observability. One is the per cap kstat which shows aggregated project and zone information. Another is the extension of the DTrace sched provider for wait queues:
Internally, the probe contains a single pointer to a thread as an argument and this pointer can be used to find the cap structure, controlling the thread.
A new TS_WAIT thread state is introduced for threads placed on the wait queue. This state can only be entered from the TS_ONPROC state. Threads in the TS_WAIT state can only transition to the TS_RUN and (less frequently) TS_ONPROC states. The state can be checked while holding thread lock. From the userland, /proc's lwpsinfo structure will report pr_sname set to ``W'' for threads sitting on the wait queues.
The clock() function should walk the list of CPUs starting from different places. If clock() always walk the list of CPUs starting from the same CPU with CPU caps present in a steady case when the usage is waving around the cap level, threads running on CPUs that are closer to the head of the list are more likely to be placed on wait queues. This problem is avoided by making clock() walk the list of CPUs in more random order.
:
Alexander Kolbasov 2006-12-15
Terms of Use
|
Privacy
|
Trademarks
|
Copyright Policy
|
Site Guidelines
|
Site Map
|
Help
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
© 2012, Oracle Corporation and/or its affiliates.