-- Main.AlexanderKolbasov - 06 Jan 2006
NUMA Observability Page
Wish List
While the Solaris kernel provides support for NUMA platforms, there are currently no tools to observe what it is actually doing and how it aligns with specific application requirements. There is a need for specific tools that can explore the system from NUMA standpoint and provide enough information to understand application behavior, verify that an application behaves as expected by the developer, accurately diagnose and, possibly, fix any problems. Potential users of such tools are
- System administrators who need to explore system behavior as a whole and quickly spot issues with system performance
- Application developers and performance engineers who need to explore performance issues with specific applications
- OS engineers who need to diagnose and repair any potential system misbehavior
All these users need tools do the following:
Observability:* Observe behavior of the system as a whole and specific applications, and be able to spot any abnormalities
Diagnosability:* Diagnose what went wrong and why
Control:* Adjust the behavior of the system or specific processes
The observability, diagnosability and control would be immediately useful for user-level processes and threads, but it is also very useful to get the equivalent information for kernel threads and memory.
In the next section we will explore the "ideal" set of tools that would help all three classes of the users to observe, diagnose and control their system and applications.
Observability
- System configuration:
- * Are we dealing with UMA or NUMA system?
- * What is the lgroup hierarchy?
- * What does each lgroup contain?
- * What are the characteristics of each lgroup (e.g. latency)?
- * lgrpinfo(1) answers all these system configuration questions.
- Overall system behavior:
- * How are threads distributed across lgroups?
- * ps(1) , prstat(1) extensions - see below.
- * How is load average distributed across lgroups?
- * lgrpinfo(1), kstats
How successful threads are at running at home? Are there any excessive migrations from CPU to CPU and from home lgroup to remote lgroup?* - * We can keep per-thread statistics, export them via /proc and use
prstat microstate mode to show it. - * We can also use DTrace-based profiling using existing sched provider
probes. Need to write specific scripts for such monitoring.
What is the overall rate of lgroup-specific events like migrations and non-local allocations? This would allow the user to get an overall "feel" for the "healthy" versus "unhealthy" system in the same way mpstat(1), vmstat(1) and iostat(1) do.* - * Need more per-CPU kstats/per-lgroup kstats
- * Can be done using DTrace scripts
- * Need some monitoring tool based on either/or DTrace or kstats
- * lgrpinfo(1), kstats
- * Is there enough memory in each lgroups to satisfy requests for local allocations?
- * lgrpinfo(1) and kstats
How successful threads are at accessing local memory?* - * Nothing....Need dprofile, VM sampling mechanism, or CPU hardware performance counter(s).
- * lgrpinfo(1) and kstats
- Process/threads to lgroup relationships:
What processes/threads run in what lgroups?*- * Can be done using DTrace sched provider probes
- * What are home lgroups of various threads
- * ps -H, prstat -H, plgrp(1)
- * What processes or threads run in specific lgroup(s)?
- * ps -h, prstat -h
- * What lgroups provide memory for a process?
- * pmap -L
What processes use memory from an lgroup?* - * Nothing....Need system monitoring tool? Could be expensive to collect incrementally or all at once.
- * May use existing page_get DTrace probe to collect data at run time.
- * Need to add page_get probe to page_get_anylist()
How much memory does a process use per lgroup?* - * For a single process can aggregate over pmap -L output
- * for many processes at once: Nothing....Maybe have RSS per lgrp?
What is process memory advice and memory allocation policies?* - * Nothing....Would have to remember advice given, make new API to get this and memory allocation policy, and change pmap(1) to display.
- * pmap -L
Concusions
- lgrpinfo(1) provides adequate system configuration information
- ps(1) and prstat(1) extensions provide good thread-level observability
- Need system monitoring tool. May be based on DTrace and kstats
- Existing DTrace probes are almost enough for thread-level observability
- Additional DTrace probes may be needed for memory observability
- Profiling or VM sampling mechanism can really show access patterns
Diagnosability
This list above provides a pretty good observability picture for system administrators, application developers and OS developers. Once some problems are observed, we need tools to get to the root cause of the problem. Such tools should provide answers to the following questions:
What and why processes or threads are spending too much time away from home?*
- * What part - can be done with prstat(1) microstate extensions.
- * Can be done with DTrace sched provider probes
- * Why part - potentially additional microstate extensions to show what is causing stealing. Can be also done with DTrace sched provider probes. May need additional probes to pinpoint migration details.
- Process/threads profile:
- * How much a thread runs in each lgroup?
- * How much memory does a thread allocate in each lgroup?
- * Need profiling tool for these
- * May use DTrace-based profiling for the first one
- System profile:
- * How successful is each lgroup in running its threads at home?
- * Nothing....May be implemented as per-lgroup kstat.
- * May be implemented using DTrace, may need monitoring script
- * How successful the system as a whole in running threads at home?
- * Nothing....Aggregation of per-lgroup kstats or profiling tool (or dtrace script)?
- * How successful are local memory allocation requests for each lgroup?
- * Per-lgroup kstats. Probably should be cleaned up a bit.
- * DTrace scripts around page_get probe.
- * What were typical reasons for failing local memory allocations?
- * _Nothing....More specific per-lgroup kstats in page_get_xxx() functions?_
- What system activity causes excessive migrations (e.g. preemption, interrupts, job stealing from idle CPUs, run-queue balancing, etc.)?
- * DTrace scripts
- * Need more: extended microstate accounting + prstat(1) extensions to observe it. What tool should expose this? mpstat(1M)? System monitor?
- What processes consume most of the given lgroup memory?
- * Nothing: Something like per-lgroup RSS? Also need some system monitoring tool to display this.
- * pmap -L is prohibitevely expensive for this.
- * May be estimation by simple per-thread counters?
- What is the memory access pattern for a specific thread? What processes or
threads exercise most non-local memory accesses and what is/are the lgroup(s) they access the most? most from local or interleaved memory?- * Nothing....Need dprofile, VM sampling, and/or CPU hardware performance counters and to observe each thread in system.
- Why memory cannot be allocated in the requested lgroup?
- * Nothing....May be per-lgroup kstats + extra dtrace probes?
What are recommendations for system administrators or users to fix any
observed problems?* - * Nothing....Have document or some sort of smart system monitor?
- * Nothing....May be per-lgroup kstats + extra dtrace probes?
Conclusions
- No existing tools
- DTrace may cover a lot, but need custom scripts
- Even better to have small numa DTrace tolkit
- Or even a special system monitor based on DTrace/kstats
- Need some in-kernel work for more accurate kstats and additional probes
Control
Once the root cause is discovered, we need to be able to "fix" some of the problems or provide specific recommendations to remedy the situation. Some fixes may require administrative intervention. For example, if there is not enough system resources, the system administrator may add additional CPUs or memory, or stop some applications which are consuming too many resources. Other fixes may require the following:
- Providing hints, describing application behaviour, to the OS.
- * pmadvise(1)
- * madv.so(1)
- * madvise(3C)
Moving processes from one lgroup to another* - * plgrp
- * lgrp_affinity_set(3LGRP) Need way to set home lgroup w/o setting lgroup affinity?
- * _Need LD_PRELOAD or policies?_
Moving process memory from one lgroup to another* - * pmadvise(1)
- * madv.so.1(1)
- * madvise(3C)
- * No way to move memory to specific lgroup....Should there be?
Changing application policies* - * Nothing....Need policies to affect thread placement, way to inherit policies, APIs, and tools for these.
Once we understand specific properties of applications, we may want to apply permanent ``fixes'' to them without modifying the application. This may require methods to do the following:
- Distribute or consolidate application threads among several lgroups
- * TAGs
- Place application threads in specific lgroups
- * LD_PRELOAD tool for thread placement
- * Inheriting home lgroup on fork()
- Affect how memory is allocated
- * madv.so.1(1)
Specify policies for an application* - * Nothing....See above
- * madv.so.1(1)
Conclusions
- Existing tools (plgrp, pmadvise) provide some control over thread and memory
placement - Tricks with preloaded libraries may allow running applications with "predefined"
behavior. - Need additional APIs for affecting thread home lgroup and dealing with
process/thread policies. - TAGs may provide very useful functionality.
Overall Conclusions
- The set of proposed tools provides pretty good observability coverage
- More observability + diagnosability can be obtained with DTrace toolkit
- May need some overall system monitor to integrate DTrace scripts and kstats data
- Proposed tools provide some level of control, more work needed (mainly TAGs, policies)
- Profiling tool may provide lots of otherwise unabtainable information
- Memory observability/diagnosability/control is worse than thread placement
Suggested extensions to commands
ps(1)
-h* Lists only processes homed to the specified lgroups.
-H* Prints the home lgroup of the process under an additional
column header, HOME.
- In addition, a new output format specifier home is added,
so shell script can easily get a home lgroup for specific process
by issuing the following command: $ ps -o home= -p $$ - See suggested man pages diffs.
psrstat(1M)
The prstat(1M) command is extended with two additional flags:
-h* Lists only processes homed to the specified lgroups.
-H* Prints the home lgroup of the process under an additional column header, HOME.
Suggested extensions to liblgrp(3LIB) library
- lgrp~_home~_set()
- * should refuse to set home incompatible with processor set
- * should refuse to set home incompatible with strong affinities
- * will set home irregardless of weak affinities
Suggested additional kstats
GUI Ideas
The observability and control tools described above lend themselve pretty well to the
graphical management paradigm. We can imagine a GUI that allows the following:
- Walking the lgroup hierarchy and showing content of each lgroup
- Showing processes/threads in each lgroup
- Moving processes/threads between lgroups (e.g. by dragging them from one lgroup
to another) - Looking at the process address map (e.g. by clicking on a process)
- Applying advice to regions by selecting them
- Grouping threads into TAGs by selecting threads and applying properties to
selections - Creating processor sets by dragging CPUs snd processes into them.
- Bindig threads to CPUs by "pinning" them
- Viewing per-lgroup loads and other stats visually