Issues
Various Issues with CPU Caps
Dedicated micro-state for wait queues.
One of the main design issues is around whether a new micro-state should be added or not. Right now, LMS_WAIT_CPU micro-state is used to keep track of both on-waitq and on-runq CPU times. Adding a new micro-state in an update release may be impossible due to compatibility reasons. It might be interesting to provide DTrace tools that would allow to get CPU times spent on wait queues alone by using special probes.
It is possible to allow zero cap on a zone, effectively stopping all processes in a zone6. It is interestint to explore whether such feature provides real value for customers and whether it is dangerous.
Clock rate
Increasing clock() rate from 100 times/second to 1000 times/second might improve accuracy as well, but it requires careful analysis as it might impact performance on large SMP systems where clock has to do many more things; it might also have negative impact on power consumption on laptops. As one data point, Linux have gone from 100 to 1000 clock rate, and then fell back to 500 times/second.
CPU usage decay
CPU decay function currently is ``lose 1/100th of the current CPU usage every clock tick''. Ideally, regardless of what the current CPU usage is, if CPU usage stops growing, it should drop down to 0 in fixed amount of time (ideally one second). This would prevent projects with really high and really low CPU usages to have different rates at which threads on their wait queues become runnable again. If usage decays down to zero from any level in just one second, then in order to achieve 1% CPU cap goal we'll need to schedule at most one thread for one clock tick every second. If that accrued usage drops down to 0 in exactly one second, we'll be scheduli#ng that thread again only after 99 additional ticks. New function would probably be more complicated (and therefore expensive) so there should be clear benefits for adding such extra complexity. For code references, see cap_project_usage_walker() function which does the decay part, and cap_project_charge() which does the accrue part.
Interrupts and pinned threads
Threads which get pinned by interrupt threads don't change their micro-states. Clock tick processing won't happen for pinned threads, but it might look like they've used more CPU time than they actually did just by looking at their micro-state counters. This is a generic micro-state accounting problem though.
Strategy for scheduling threads from wait queues
Currently only one thread is removed from the wait queue per clock tick. On multi-processor boxes we may investigate starting multiple threads instead. Running an experiment on a large SMP box while setting a cap very close to the maximum possible usage (e.g., NCPU*100 - 5) would tell us if there's a problem. Having just a few vs. lots of spinning threads in capped project would also be interesting.
Right now, we can be making runnable 100 threads per second, or one thread each clock tick for each project. Imagine a system with 100 CPUs and a project with cap set to 9900. When we reach this cap utilization, our usage is doing to be decayed down to approx. 9800 on every clock tick. That means that if there's one thread sitting on a wait queue then it should be made runnable again. If our usage drops down further and faster then it could be reasonable to make more than one thread runnable again. I think that would only happen if there was a sudden change in workload's CPU usage patterns so that our usage would drop down really fast while there were many threads sitting on wait queues. I haven't explored this area yet.
Scalability
Traversing a global list of projects and zones may present a scalability
problem when the number of zones or projects in the system is high. We may
need more careful algorithms to provide scalable solution in this case.