LatencyTOP High Level Design

1 Overview

1.1 Purpose and Scope

Latency causes applications to run slowly even on a high end, fast system. It is easy to notice when a problem caused by latency happens, but it is usually difficult to identify what causes the latency.

LatencyTOP is a tool recently developed for the Linux OS (www.latencytop.org) to identify outstanding latency problems in a running system. Statistics are collected from the OS scheduling subsystem, and collected on both a system-wide and a per-process basis. Application developers can look at the statistics and try to avoid latency problems. A similar tool will also benefit developers writing applications under Solaris.

LatencyTOP monitors the system for latencies: processes that do nothing but wait for some condition to happen. LatencyTOP detects latencies that result in a process sleep or block on some condition. It also detects some known busy loop conditions in the Solaris kernel such as lock spinning. LatencyTOP does not detect busy loop conditions inside user applications. Neither does it detect delay that is not caused by waiting, for example, a process not running because a higher priority process takes a lot of CPU time.

LatencyTOP does not track latency that continues from a point of time preceding the execution of LatencyTOP.

DTrace is the advanced tracing mechanism in Solaris for troubleshooting the kernel as well as applications. With DTrace, it is very easy to gather statistics from Solaris without having to patch the kernel as LatencyTOP for Linux does.

1.2 Design Overview

The goal of the LatencyTOP utility is to analyze the cause of latency in a running system using DTrace probes. In order to do so, every time a LWP is going into and out of SLEEP state, the timing is recorded and the kernel call stack is captured. The utility periodically collects data from DTrace, searches call stacks for symbols that are known to cause latency, and updates the statistics.

The symbols that are used to scan call stacks are hand picked from OpenSolaris kernel source code, for example, zfs`zfs_read. If such a symbol is found in the call stack, this latency is described as "caused by ZFS read".

There are some other kinds of latencies that do not cause LWP to enter SLEEP state. For example, lock spinning causes LWP to wait in busy loop. These latencies are collected by using specific DTrace probes. For example, lock spinning is traced by using the lockstat module.

LatencyTOP uses a console based UI to display the statistics. The windowing is based on libcurses.

1.3 Key Challenges

The actual pattern of what is causing the latency varies in different systems. It is difficult to properly define and categorize all significant causes for latency once and for all. The design of the tool itself can be decided before implementation, but a configuration that only picks causes that "really matter" and organizes them in the most clear way would require test after development work is done, as well as feedback from real use cases.

1.4 Dependencies

LatencyTOP uses the DTrace framework to collect data. The implementation depends on the undocumented interface of libdtrace(3LIB).

2 Software Design

2.1 DTrace Usage in LatencyTOP

LatencyTOP uses DTrace probes sched::resume:off-cpu and sched::resume:on-cpu to capture when a LWP goes into and out of SLEEP. There might be a gap between the time a LWP goes from SLEEP to RUN and the time the LWP is actually running on a CPU after swtch(), so sched:::enqueue is also used to trace when the LWP actually wakes up. When on-cpu is fired, LatencyTOP calculates latency using timestamp values from off-cpu, enqueue, and on-cpu, and captures the kernel call stack.

LatencyTOP uses DTrace aggregations to store statistics. Every set of different pid/tid/stack() will have a separate entry. Values of count(), sum(), and max() are recorded.

When a LWP goes to the RUN state, it might not be placed on the CPU and start immediately because of scheduling and/or its priority. The time between when a thread is enqueued and when it is actually on the CPU and running is counted as "waiting for CPU".

LatencyTOP uses probes lockstat:::adaptive-spin and lockstat:::spin-spin to detect lock spinning. Stack is not used to track this kind of latency. Instead, two special aggregations with known names are used to track them. In case they get higher than expected, lockstat(1M) can be used to further drill the problem down. Suggestion of using this tool will be visible on screen when LatencyTOP is running.

D script used at runtime is embedded in the LatencyTOP binary. Since LatencyTOP makes heavy use of aggregations, the aggsize is changed to 8M in the script. For a very busy system, LatencyTOP may need a larger value to avoid dropping statistics.

LatencyTOP updates aggregations every time latency is discovered by on-cpu probe. On a busy system, there can be hundreds of thousands sleep/wake events every second, and the update will be costly. In order to reduce the impact to the system, especially CPU utilization, an alternative method is used. In this method, all latencies below a specified threshold (for example, 50ms) do not cause aggregation update. Instead, a per-thread counter is increased. When the counter reaches a given value, for example, 100, the aggregations are updated with the values count = 100, sum = 100 * the_100th_latency, max = the_100th_latency" to entry of the 100th call stack. This effectively reduces the update to 1/100 of the original method, while the data remains still statistically correct.

2.2 Identify Latency Causes

LatencyTOP uses a configuration file to identify causes of latencies found in the system. An example of configuration looks like:

# ZFS
D 60    sdt:zfs:zil_commit_writer:zil-cw1  sdt:zfs:zil_commit_writer:zil-cw4 ZFS ZIL writer I/O

Each line defines one tagging rule, except for lines starting with special character "#". The line has the following format:

[:blank:]* "D" [:blank:]+ <priority> [:blank:]+ <entry probe> [:blank:]+ <exit probe> [:blank:]+ <Cause>

These rules will be translated to DTrace probes, and tag latencies when found. Rules will not affect the stack trace and the data itself collected from DTrace. From the moment an entry probe is fired, latencies found from on-cpu probe will be tagged with the Cause string, until corresponding exit probe is fired.

Rules that contain the same Cause string are considered in the same group, and their statistics will be counted together.

Rules can be cascaded. If another entry is fired when previous Cause is still active, i.e. previous exit probe is not fired, priority is used to control which one is used. The rule with the higher priority value overrides rules with lower priority.

Lines starting with "#" are comments and will be ignored during parsing.

LatencyTOP will match latencies to syscalls if no other rules apply. Syscall rules have low priority. The syscall probe is defined in the D script and cannot be modified in the configuration file.

Configuration is embedded in LatencyTOP binary.

2.3 Data Analysis and UI

LatencyTOP periodically creates snapshot of DTrace data and walks through it. After the data is collected, the UI is updated and key presses are processed. So there is no concurrency problem between updating the data and displaying it.

LatencyTOP has similar look and feel as LatencyTOP for Linux version 0.4. The screen layout will look as follows:

                  LatencyTOP for OpenSolaris, version x.y
      Cause                    Count      Average      Maximum      Percentage
System wide latencies
<System-wide Cause 1>          <count>    <Avg> msec   <Max> msec   <Percent> %
<System-wide Cause 2>          <count>    <Avg> msec   <Max> msec   <Percent> %
<System-wide Cause 3>          <count>    <Avg> msec   <Max> msec   <Percent> %

Process <exec name> (<pid>)   Total:   <total latency> msec in <X> threads   CP
<Per-Process Cause 1>          <count>    <Avg> msec   <Max> msec   <Percent> %
<Per-Process Cause 2>          <count>    <Avg> msec   <Max> msec   <Percent> %

<-  <Process1>...                                                <ProcessN>  ->
                  Hint line

The top half of the screen displays system-wide statistics. Percent is calculated within a given period, for example, 5 seconds. Percent = sum(causeA)/sum(all)*100%. Avg means the average latency. Avg = sum(causeA)/count(causeA). Max represents the largest period of process being blocked. For example, if for the same cause, cv_wait() is called several times within a given period, and in the worst case the process is blocked X ms, Max = X. Count represents how many times a cause is found. Up to 10 causes will be displayed.

The bottom half of the screen displays per-process statistics. Latencies are ordered in the same way as system-wide statistics. Up to 8 causes will be displayed. The user can choose a different process to analyze by pressing "<" or ">" and cycling through the process list at the very bottom of the screen.

Latencies are sorted in descending order by percentage (which is the same as by total). The user can choose to sort in descending order by different criteria by pressing hotkeys.

LatencyTOP collects data periodically and updates the UI. Between data collection, the data on the UI remains the same. The data refresh period is equal to the period in which statistics are calculated.

An arrow will appear on the left or right side of the process tab, if there are more processes in that direction. Also, the hotkeys and hints will appear at the very bottom of the screen for ease of use.

It is also possible to trace latencies at the thread level. By pressing "t", the user can toggle the per-process and per-thread views. In thread mode, the bottom half of the screen represents values from a thread instead of the whole process in this mode. The flag at column 80 of the process name line indicates the current mode (process or thread).

The user can also switch to different views by pressing the number keys. Currently defined views are shown in the following table. The flag at column 79 of the process name line indicates the current view.

KeyColumn 79 of Process NameView
1CShows latencies grouped by causes in the kernel stack,
2SShows only special causes such as lock contention or scheduler queue.
3LShows causes grouped by their synchronization objects.

2.4 Logging Stack Traces

If none of the symbols defined in the configuration file is found in a stack trace, the latency is considered "not categorized" in the active configuration file. A command option is available in LatencyTOP to dump all such causes, one entry for each stack trace. After LatencyTOP runs for a while, develop/test engineers can look into the OpenSolaris kernel source code for the logged stack and update the configuration file.

There are 4 log levels in LatencyTOP:

  • No log at all
  • Log only stack traces that do not match to any rule
  • Log only stack traces that match to some rule
  • Log all stack traces

By default, the log will be written to the file when LatencyTOP ends. The user can also choose to write and reset the log periodically to the file to have "snapshots" of what happens.

3. Implementation and Testing Consideration

While the system is very busy, for example generating hundreds of thousands sleep/wake events every second, the CPU utilization will go high with LatencyTOP running. This is because the data collection in the D script will take a little CPU time to run every time a wakeup happens. In order to reduce the impact to the system, a less accurate sampling method is used as an alternative. See section 2.1 for details.

The initial release includes internal experience. To help define the initial symbols/stack traces of interest, LatencyTOP needs to be tested with several real workloads before release.

4. Future Improvements

The initial release of LatencyTOP aims to provide similar features and look-and-feel as LatencyTOP for Linux version 0.4. For the initial release, only a limited set of latency causes is defined in the configuration file. More causes will be added after internal and external tests. It is also possible to create multiple sets that group latency causes in different ways, for example categorized based on syscall types versus resource types to suit the need in a different environment, and later choose one to load when LatencyTOP launches by passing it in -c parameter.

Because DTrace can also capture user call stack, it would be useful to add a feature if a latency meets some condition (for example, goes above a threshold), and record both kernel and user stack. This would immediately pinpoint the cause in the user application.

last modified by alta on 2009/10/27 17:21
Collectives
Project


© Sun Microsystems Inc. 2009
XWiki Enterprise 1.8.2.19075 - Documentation
Terms Of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines | Site map | Help
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.