OpenSolaris
Collectives
Discussions
Documentation
Download
Source Browser
Free CD
Log-in
|
en
Community Group performance
:
OpenSolaris NUMA project
>
Observability
>
Tools
>
NUMA Observability Tools PSARC case
Top Menu
Show
:
Comments
Attachments
History
Information
Print
:
Print
Print preview
Export as PDF
Export as RTF
Export as HTML
Export as XAR
Wiki code for
NUMA Observability Tools PSARC case
Hide Line numbers
1: == SUMMARY 2: 3: This document explains the architecture of the observability and control tools 4: needed for Memory Placement Optimization (MPO). New tools, additions to 5: existing tools, and corrections to previously proposed but not yet integrated 6: tools are proposed: 7: 8: - Two new tools are proposed: 9: - [[lgrpinfo(1)>>../lgrpinfo/lgrpinfo_man]] for displaying the lgroup hierarchy 10: - [[plgrp(1)>>../plgrp/plgrp_man]] for observing and affecting lgroup affinities for specified threads 11: 12: - A new Lgrp perl module is introduced as a perl interface to [[liblgrp(3LIB)>>http://docs.sun.com/app/docs/doc/816-5173/6mbb8adu6?a=view]]. This is used by [[lgrpinfo(1)>>../lgrpinfo/lgrpinfo_man]] since [[lgrpinfo(1)>>../lgrpinfo/lgrpinfo_man]] is a perl script. 13: 14: - New flags are proposed as additions to the existing [[ps(1)>>../ps/ps.1.txt]] and prstat(1M) commands for displaying the home lgroup of all or active processes or listing processes or threads in a given lgroup. 15: 16: - Some minor corrections are needed to the output format of [[pmadvise(1)>>../pmadvise/pmadvise_man]] and the \-L option to [[pmap(1)>>../pmap/pmap.1.txt]] and to the syntax of the -A option to [[pmap(1)>>../pmap/pmap.1.txt]] (which were originally introduced in PSARC 2004/484 and 2004/485 but have not been integrated into Solaris yet). 17: 18: These new tools and changes are discussed below along with the our previously proposed tools as part of the MPO observability and control tools architecture 19: and in any separate supporting documents needed to explain each tool in more detail. 20: 21: == BACKGROUND 22: 23: As part of the Memory Placement Optimization feature in Solaris, we have added a "locality group" (lgroup) abstraction to tell what resources are near each other on a NUMA machine and a framework to optimize for performance through locality. 24: 25: Locality groups represent the set of CPU-like and memory-like hardware resources at most some latency apart from each other. A Uniform Memory Access (UMA) machine will only be represented with one lgroup (the root lgroup). A Non Uniform Memory Access (NUMA) machine is represented by a hierarchy of lgroups to show the corresponding levels of locality. The lgroup hierarchy is organized to facilitate finding the nearest resources. Each parent lgroup in the hierarchy contains the resources of its children plus the next nearest resources. 26: 27: Upon creation, each thread in the system is assigned to a "home" lgroup where the operating system will try to run the thread and allocate its memory and other resources to improve its performance via locality. If the desired resources aren’t available in the thread’s home lgroup, the operating system will traverse the lgroup hierarchy from the thread’s home lgroup to find the nearest available resources. 28: 29: == MOTIVATION 30: 31: MPO tries to provide good performance by default. This is expected to be the case for the majority of applications, but some small minority of applications may need more. Tools can be provided to help make it easier to figure things out and tune performance over what is provided by default. 32: 33: Specifically, tools are needed to facilitate observability, diagnosability, and control of the Solaris lgroup framework and its optimizations for locality on NUMA machines. So far, the lgroup framework and APIs have been provided to allow some observability and control, but little to no tools have been provided. 34: 35: Basic tools are needed to at least display the lgroup hierarchy, its contents, and characteristics and to observe and affect thread and memory placement among lgroups since placement is essential to locality. 36: 37: == USERS 38: 39: The intended consumers of the tools are system administrators, developers, performance engineers, systems programmers, and support engineers. These consumers may be interested in knowing more about the system, application(s), or both. 40: 41: We believe that most of the questions that the above consumers of the tools have usually boil down to one of the following questions: 42: 43: - What is the system configuration? 44: - Are the system or application resources balanced or placed well among lgroups? 45: - Is MPO successful? 46: - Why did that happen? 47: 48: == TOOLS 49: 50: Basic observability and control tools are essential to addressing these fundamental questions of system configuration, balance or placement, success, and diagnosability. These tools mostly help answer questions about system configuration and balance or placement, but they also provide the basic information and mechanism needed to determine whether MPO is successful and diagnose problems related to MPO. 51: 52: To answer the question of whether MPO is successful, it seems like profiling and statistics would be most helpful. However, it is important to know the thread’s affinities for lgroups (such as its home lgroup) and where its memory is allocated to determine whether MPO //should// be successful in providing good locality and subsequently good performance. In addition to that, a tool that profiles where a given thread runs and which memory it accesses most (relative to lgroups) would be useful for determining whether MPO is //really// successful. 53: 54: For diagnosability and to understand why something is happening, one has to understand what happened first. We have found that using our observability tools at least help to see what is happening and our tools to affect thread and memory placement provide a way to gain a deeper understanding of what an application is doing or needs through experimentation especially when the source isn’t available. 55: 56: To really be able to find out why something happened (like a thread not running in its home lgroup or allocating local memory), we believe that dtrace(1M) and potentially some more instrumentation in the kernel will be needed. 57: 58: In this PSARC case, we would like to propose the basic observability and control tools needed for MPO. As explained above, these tools are essential to observability, control, performance analysis, and diagnosability. While they don’t completely address the areas of performance analysis and diagnosability, they give what’s needed to start and should be very useful now. Moreover, we believe that the additional tools needed for performance analysis and diagnosability probably won’t overlap the proposed tools very much if at all because they require different mechanisms. 59: 60: Here is a small table that shows what question/area is addressed by what tool(s) for observability and control: 61: 62: |=|=OBSERVE |=CONTROL 63: | CONFIGURATION | [[lgrpinfo(1)>>../lgrpinfo/lgrpinfo_man]] | | 64: | THREAD PLACEMENT | [[plgrp(1)>>../plgrp/plgrp_man]], ps, prstat | [[plgrp(1)>>../plgrp/plgrp_man]] | 65: | MEMORY PLACEMENT | [[pmap(1)>>../pmap/pmap.1.txt]] | [[pmadvise(1)>>../pmadvise/pmadvise_man]], [[madv.so.1(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9ld?a=view]] | 66: 67: === LGROUP HIERARCHY 68: 69: The [[lgrpinfo(1)>>../lgrpinfo/lgrpinfo_man]] utility can be used to display the lgroup hierarchy, its contents, and characteristics and to easily determine the following: 70: 71: - Whether the system is an UMA or NUMA machine 72: 73: - Which CPUs are near each other, have memory near them, and how much 74: 75: - What the relative latencies are between the CPUs and different memory 76: 77: - How the operating system has organized these CPU and memory resources into a hierarchy to facilitate finding the nearest resources quickly 78: 79: - How each lgroup relates to the other lgroups 80: 81: - Lgroup thread and memory loads (eg. load average and amount of memory in use and free) 82: 83: 84: It can be useful for the following: 85: 86: - Observing and verifying the lgroup hierarchy 87: 88: - Understanding the context in which the operating system is trying to optimize applications for locality 89: 90: - Observing whether system (CPU and memory) resources are well balanced or placed across lgroups 91: 92: Overall, the tool has been very helpful in understanding the system better and 93: recognizing and diagnosing some problems at the system level. 94: 95: Please see [[lgrpinfo~(1~) writeup>>../lgrpinfo/lgrpinfo_psarc.txt]] for more discussion on [[lgrpinfo(1)>>../lgrpinfo/lgrpinfo_man]], [[lgroup perl module PSARC writeup>>../perl_lgrp/lgrp_mod_psarc.txt]] for a discussion of the supporting liblgrp perl module, and the [[Lgrp(1) man page>>../perl_lgrp/Lgrp_man]] for its specification. 96: 97: === PLACEMENT 98: 99: Thread and memory placement among lgroups are essential to optimizing for 100: locality. Thus, the ability to observe and affect how threads and memory are 101: placed among lgroups is important for understanding and affecting the 102: performance of the system and applications on NUMA machines. 103: 104: ==== THREAD 105: 106: The following tools are for observing and affecting the placement of threads 107: among lgroups: 108: 109: - [[ps(1)>>../ps/ps.1.txt]] for observing the home lgroup of every user process or thread in the system 110: 111: - [[prstat(1M)>>../prstat/prstat.1M.txt]] for observing the home lgroup of the active processes or threads in the system 112: 113: - [[plgrp(1)>>../plgrp/plgrp_man]] for observing and affecting thread placement among lgroups 114: 115: To provide a system view of how all user processes and threads are placed among 116: lgroups, a new -H option is proposed for [[prstat(1M)>>../prstat/prstat.1M.txt]] to display the home lgroup of active user processes and threads and for [[ps(1)>>../ps/ps.1.txt]] to show the home lgroup of all user processes and threads. Furthermore, a new -h option is proposed for [[ps(1)>>../ps/ps.1.txt]] and [[prstat(1M)>>../prstat/prstat.1M.txt]] to see all user processes or threads which have a specified lgroup as their home. A new "lgrp" format specifier is proposed for [[ps(1)>>../ps/ps.1.txt]] to allow for custom output formatting. 117: 118: The new [[plgrp(1)>>../plgrp/plgrp_man]] tool is for observing and controlling the placement of threads among lgroups. It can get and set the home lgroup and lgroup affinities of a given set of threads by using /proc to get information that /proc has or use the /proc agent LWP to make calls from within the target process on the tool’s behalf. 119: 120: To facilitate observing the home lgroup of a thread in a live process or core file, a new pr~_lgrp field has been added to lwpsinfo~_t in /proc. This structure is documented in [[proc(4)>>attach:proc.4.txt]] to contain the home lgroup of the corresponding thread. Similiarly, this change was made to the dtrace proc provider to have its lwpsinfo~_t include a new pr~_lgrp field. 121: 122: Please see [[lwpsinfo PSARC writeup>>attach:lwpsinfopsarc.txt]] for more details on the changes to lwpsinfo~_t and [[proc(4)>>attach:proc.diffs.txt]], [[ps(1)>>../ps/ps.diffs.txt]], and [[prstat(1M)>>../prstat/prstat.diffs.txt]] man page diffs to see interface changes. 123: 124: ==== MEMORY 125: 126: The tools for observing and affecting the placement of memory among lgroups are the following: 127: 128: - [[pmap(1)>>../pmap/pmap.1.txt]] for observing memory placement among lgroups (PSARC 2004/485) 129: 130: - [[pmadvise(1)>>../pmadvise/pmadvise_man]] for applying advice to virtual memory ranges, offering fine grain control of memory placement among lgroups through madvise(MADV\//ACCESS\//\*) (PSARC 2004/484) 131: 132: - [[madv.so.1(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9ld?a=view]] for applying advice to all kinds of memory (eg. heap, stack, private, shared, mapped, anonymous memory) offering coarse grain control of memory placement among lgroups through madvise(MADV\//ACCESS\//\*) (PSARC 2002/030) 133: 134: When the -L option is given, [[pmap(1)>>../pmap/pmap.1.txt]] will display the lgroup that directly contains the physical memory backing some given virtual memory. In addition, a new -A option was proposed in PSARC 2004/485 to make it possible to specify a virtual address range of interest, since using the -L option can result in one line per page when contiguous physical pages don’t back a given portion of the virtual address space. 135: 136: The [[pmadvise(1)>>../pmadvise/pmadvise_man]] tool is for affecting how memory is placed among lgroups. It uses a /proc agent LWP to make calls to madvise(3C) with the MADV\//ACCESS\//\* flags in the target process. The advise(MADV\//ACCESS\//\*) calls give a hint to the operating system of how the application will access a specified virtual address range. On NUMA machines, the operating system will use this hint to determine how to allocate memory for the specified range. 137: 138: Besides [[pmadvise(1)>>../pmadvise/pmadvise_man]], [[madv.so.1(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9ld?a=view]] can also be used to affect how memory is placed among lgroups, but it uses a different mechanism to do so. Instead of using /proc like [[pmadvise(1)>>../pmadvise/pmadvise_man]], [[madv.so.1(1)>>http://docs.sun.com/app/docs/doc/816-5165/6mbb0m9ld?a=view]] is an LD~_PRELOAD library that interposes on system calls for allocating virtual memory (eg. brk(2), mmap(2), shmat(2), etc.) and calls madvise(3C) on the newly allocated memory after making the system call. 139: 140: Please see [[pmap(1) PSARC writeup>>../pmap/pmap_psarc]] for an explanation of the changes needed to the previously proposed but not yet integrated -L and -A options to [[pmap(1)>>../pmap/pmap.1.txt]], the [[pmap(1)>>../pmap/pmap.1.txt]], and [[pmadvise(1)>>../pmadvise/pmadvise_man]] man pages for the specifications. 141: 142: == ISSUES 143: 144: Overall, the biggest issue for the tools is virtualization (eg. Xen, sun4v hypervisor aka LDOMs, etc.). Virtualization can make it impossible or hard to determine which hardware resources are near each other in a NUMA machine. It can change virtual hardware resources out from under a guest OS after the guest OS //thinks// that it knows how the hardware resources relate to each other. 145: 146: Currently, there is no lgroup platform support for with both Xen and sun4v hypervisor (LDOMs), so only one lgroup containing all the CPU and memory resources is created. Consequently, the lgroup tools and [[liblgrp(3LIB)>>http://docs.sun.com/app/docs/doc/816-5173/6mbb8adu6?a=view]] APIs will only export a single lgroup to applications and users which basically makes it appear as though the machine has Uniform Memory Access (UMA) instead of being NUMA. This keeps virtualization from confusing anything or anyone trying to understand or optimize for NUMA using lgroups. 147: 148: In the future, we anticipate that the guest OS will need to become virtualization aware and/or the virtualization will need to become NUMA aware. Some cooperation between the guest OS and hypervisor will probably need to occur to be able to provide very good performance on NUMA machines. When this happens, we may need to revisit how virtualization affects lgroups, its APIs, and tools, but it should be possible to export a reasonable lgroup abstraction or fallback to exporting a single lgroup as is done now. 149: 150: == CONCLUSION 151: 152: The above text explained the architecture of the observability and control tools needed for MPO and refers to additional documenation for the individual tools as needed. All of the proposed tools and changes have a stability level of Unstable and a release binding of Patch. This seems like the smart/conservative thing to do given some of the issues and that virtualization needs to be developed more to fully understand its ramifications on lgroups and NUMA. 153:
Search
Collectives
Community Group
Academic and Research
Accessibility
Advocacy
Appliances
Approachability
Architecture Process and Tools
BrandZ
Chinese Users
Community Advisory Board
Databases
Desktop
Device Drivers
Distribution
Documentation
DTrace
Emerging Platforms
Fault Management
Games on OpenSolaris
HA Clusters
HPC Developer
Installation and Packaging
Internationalization and Localization
Laptop
Logical Domains
Modular Debugger (MDB)
Networking
NFS
Observability
OpenSolaris Governing Board (OGB)
OpenSolaris Printing
OS/Net (ON)
Performance
Power Management
PowerPC
Security
Service Management Facility (smf(5))
Software Porters
Solaris Volume Manager
Storage
Systems Administration Community Group
Testing
Tools Home
Unix File Systems (UFS)
Website Community
X Window System
Xen
ZFS
Zones
Project
ADSL Modem Enhancement
ARC Process Definition
ARM Platform Port
Automatic Data Migration
BIND Update
Bluetooth Stack & Drivers
Brocade FC HBA - Initiator
Brocade FC HBA - Target
Brussels - unified network link configuration
Caiman, Solaris Install Revisited
Celeste
Český portál
Chime Visualization Tool for DTrace
CIFS client for Solaris
CIFS Server
Clearview: Network Interface Coherence
Cluster Agent: Informix Dynamic Server
Cluster Agent: OpenSolaris Container
Cluster Agent: OpenSolaris xVM
Cluster Agent: Oracle E-Business Suite
Cluster agent: PostgreSQL
Cluster Agent: Samba
Cluster Agent: Tomcat
CMT
Coarse Data Flow Parallelism
Colorado: Open HA Cluster on OpenSolaris
Command Assistant
Common Array Manager
Companion - /opt/sfw: Free and Open Source software
COMSTAR: Common Multiprotocol SCSI Target
Content
Contest
CPU Observability
Credentials Process Groups
Crossbow: Network Virtualization and Resource Control
Crypto KMS Agent Toolkit
Cryptographic Framework
Data Migration Manager
Data Tethers
Deutsches Portal
Device Detection Tool
Device Driver Utility
Device Manager
Device Mapper
Direct Rendering Infrastructure & 3D drivers
DTrace Guide
Duckwater: Simplified name services management
Easy Tools
Emancipation
Emulex Fibre Channel Device Driver
Emulex Advanced Ethernet Device Driver
Enable/Enhance Solaris support for Intel Platform
Enhance the support of USB webcams
Enhanced SMF Profiles
Enhancements for AMD-based Platforms
Erlang DTrace Integration
Ethernet bridge module for Solaris
Evaluate Conary
Events Registry
Ext3 file system support
F/OSS Package Base
Facilitation
Fibre Channel over Ethernet
Fine Grained Access Policy (FGAP)
Fingerprint Authentication
Flexible Mandatory Access Control
Forensic Tools
Fully Open X Project
Fuse on Solaris
gcore
Generic Machine Check Architecture Improvements
Google SOC
HA-JBoss
HA-MySQL
Hadoop Live CD
Hitachi
HoneyComb Fixed Content Storage
HPC Stack
Image Packaging System
Improved Performance MIB
Indiana
Innovation Awards
Input Method
Intel Graphics
Internet Key Exchange, version 2
Interrupt Resource Management
IP Datapath Refactoring
IP over Infiniband
IPsec Tunnel Reform
iSCSI Extensions for Remote DMA (iSER)
iSNS Server
JeOS - Just enough Operating System
JKstat - a java binding for libkstat
Journaled File System (JFS)
K Desktop Environment
Kerberos
Kernel Sockets
Kernel SSL Enhancements
Key Management Framework
Korn Shell 93 integration/migration project
Labeled IPsec
LatencyTOP
Layer 2 Filtering
LDoms Manager
Lending
libMicro - portable microbenchmarks
Link Layer Discovery
Live Media: Technologies for distributions running from CD and other media
Locale Data
lofi compression and cryptography support
lx64 brand
Media Management System
Mega_sas
Mexico
MilaX minimal Live Distribution
MIPS Platform Port
Mozilla DTrace
MRSL.NONsharedDevice
Multi-lingual Glossary
Multi-pathing software (MPxIO)
Multiple disk sector size support
Multiple DOI
Muskoka: An open repository for OpenSolaris technical content
Navigator
Nemo: A Framework for High-Performance Networking
Network Auto-Magic
Network Data Management Protocol
Network MIBs
Network Storage
Network Time Protocol (NTP)
Nevada Globalization
New Design of 4over6 Mechanism Based on OpenSolaris
NFS RDMA transport update and performance analysis
NFS Server in non-Global Zones
NFS version 4.1 pNFS
NFSv4 namespace extensions
Nightingale: Port Songbird to OpenSolaris
NPort ID Virtualization (NPIV)
NUMA
Object Storage Device (OSD) support for Solaris
OHACGE Script Based Plug-in
ON/Nevada (ONNV) Project
Open Development Infrastructure
Open HA Cluster Utilities
Open Sound System
OpenGrok
OpenPegasus CIM Server
OpenRTI
OpenSolaris Busybox
OpenSolaris Desktop
OpenSolaris Hispano
OpenSolaris Security Audit
OpenSolaris support for the QEMU processor emulator: host and guest
PEF: Packet Event Framework
Performance Wrappers
Pkgfactory
Polski Portal
Portail Francophone
Portal Brasil
Portals
Power Management Usability Interfaces
Presto: Automatic Printing Configuration
Printable Many Page Solaris Manuals
Promise SuperTrak RAID HBA Driver
QLogic Converged Network Adapter GLDv3 NIC Driver
Quagga Routing Protocol Suite Integration
RAID Configuration Utility
RBridge (IETF TRILL) support
RDMA Offload Framework
Reno: Login Process Enhancements for Interop
Resource Management
s10brand
SAM/QFS
SCM Migration Project
SCSI RDMA Protocol
SDcard Drivers
Sensor Abstraction Layer
Session Initiation Protocol
SFW
Shell: bourne shell, korn shell, C shell, etc.
Sierra: Intel WiFi Chipsets Support
Simple Panels
SM-HBA Based SAS HBA Management
SMF Documentation
Solaris iSCSI Target
Solaris PowerPC Port
SourceJuicer
Sparks: name service switch/nscd enhancements
Squashfs
Star integration/migration project
Starfish
Starter Kit
Storage Power Management
Sun Security Toolkit
Sun StorageTek Availability Suite
Support for OpenFabrics User Verbs / API on OpenSolaris OS
Support gcc4/GCCfss in Solaris
Suspend/Resume
SVR4 Packaging
Systemz
Tamarack: Removable Media Enhancements in Solaris
Tesla: OpenSolaris Enhanced Power Management
Test Development
Tickless Kernel Architecture
TIPC
Trademarks
Trusted networking interface policy database for Trusted Extensions
Trusted Platform Module support
Use Case
Validated Execution Project
Virtual Console
Virtual Network Machines
Visual Panels
Visualization for HPC
Volo
VRRP: Virtual Router Redundancy Protocol Implementation
VSCAN service
Web Stack
Website
Winchester: Schema mapping and ID mapping for AD Interoperability
Wireless USB Support
Wireless Wide Area Network
X Consolidation
x86 Generic FMA Topology Enumerator
Xen Gate
Xfce: A lightweight desktop environment
ZFS Boot and Install
ZFS on disk encryption support
Zone Manager
Zone Statistics
Русский портал
البوابة العربية
भारतीय पोर्टल
中国门户
日本ポータル
한국 포탈
User Group
Adelaide
Argentina
Arizona
Atlanta
Baltimore-Washington
Bangalore
Bangkok
Bangladesh
Beijing
Bélem
Berlin
Bhimavaram
Bloomington
Campus Ambassadors
Capital Region
Cardiff
Charlotte
Chengdu
Chennai
Chihuahua
Chile
Cleveland
Colombia
Columbus
Connecticut
Cracow
Czech
Dallas/Ft. Worth
Danish
Delaware
Edinburgh
Egypt
Finland
Florida
Front Range
FuZhou
Great Lakes
Greece
Hangzhou
Hawaii
HeFei
Houston
Hyderabad
Indonesia
Irish
Israel
Italian
Jinan
Kabul
Kansas City
Latvia
London
Madurai
Manchester
Mato Grosso
Melbourne
Minas Gerais
Minnesota
Montreal
Moscow
Mumbai
Munich
NEA
Netherlands
New England
New York City
New Zealand
NIT Hamirpur
Noroeste
Oklahoma City
Osnabrück
Peru
Philadelphia
Piaski
Pittsburgh
Porto Alegre
Puget Sound
Pune
Queensland
Research Triangle Park
Romania
Russia
San Antonio
San Diego
San Francisco
São Paulo
Scottish
Serbia
Shanghai
Shenzhen
Silicon Valley
Singapore
Slovak
South African
Southern Connecticut
St. Louis
Sweden
Switzerland
Sydney
Szczecin
Taiwan
Tecum
Thames Valley
Tokyo
Toronto
Trondheim
Tulsa
Turkey
Ukraine
University of Melbourne
Vale do Paraíba
Vancouver
Venezuela
Welsh - Cymru
Wisconsin
Xi'an
Subsites
Code Reviews
Code Repositories
Package Search
Bugster
Bugzilla
Test Machines
Planet
Mailing Lists
Elections & Polls
ARC Case Logs
Source Juicer
Package Factory
User Authentication
Community Group performance Pages
FileBench
Sample Comparison
Gotchas
Quick Start Guide to FileBench
Files
libMicro
Sample Comparison
OpenSolaris NUMA project
8 CPU chip Opteron
Improved Latency Discovery
Load Balancing
Observability
Roles and Grants
Technical Documents