OpenSolaris
Collectives
Discussions
Documentation
Download
Source Browser
Free CD
Log-in
|
en
Community Group performance
:
OpenSolaris NUMA project
>
Observability
>
Wish list
Top Menu
Show
:
Comments
Attachments
History
Information
Print
:
Print
Print preview
Export as PDF
Export as RTF
Export as HTML
Export as XAR
Wiki code for
Wish list
Hide Line numbers
1: ~-- Main.AlexanderKolbasov - 06 Jan 2006 2: 3: 4: NUMA Observability Page 5: 6: == Wish List 7: 8: While the Solaris kernel provides support for NUMA platforms, there are currently no tools to observe what it is actually doing and how it aligns with specific application requirements. There is a need for specific tools that can explore the system from NUMA standpoint and provide enough information to understand application behavior, verify that an application behaves as expected by the developer, accurately diagnose and, possibly, fix any problems. Potential users of such tools are 9: 10: * System administrators who need to explore system behavior as a whole and quickly spot issues with system performance 11: * Application developers and performance engineers who need to explore performance issues with specific applications 12: * OS engineers who need to diagnose and repair any potential system misbehavior 13: 14: All these users need tools do the following: 15: 16: // //Observability:* Observe behavior of the system as a whole and specific applications, and be able to spot any abnormalities 17: // //Diagnosability:* Diagnose what went wrong and why 18: // //Control:* Adjust the behavior of the system or specific processes 19: 20: The observability, diagnosability and control would be immediately useful for user-level processes and threads, but it is also very useful to get the equivalent information for kernel threads and memory. 21: 22: In the next section we will explore the "ideal" set of tools that would help all three classes of the users to observe, diagnose and control their system and applications. 23: 24: 25: === Observability 26: 27: * System configuration: 28: * * Are we dealing with UMA or NUMA system? 29: * * What is the lgroup hierarchy? 30: * * What does each lgroup contain? 31: * * What are the characteristics of each lgroup (e.g. latency)? 32: ** * [[lgrpinfo(1)>>../tools/lgrpinfo]] //answers all these system configuration questions.// 33: 34: * Overall system behavior: 35: * * How are threads distributed across lgroups? 36: ** * ps(1) , prstat(1) extensions - see below. 37: * * How is load average distributed across lgroups? 38: ** * lgrpinfo(1), kstats 39: // //How successful threads are at running at home? Are there any excessive migrations from CPU to CPU and from home lgroup to remote lgroup?* 40: ** * We can keep per-thread statistics, export them via /proc and use 41: prstat microstate mode to show it. 42: ** * We can also use DTrace-based profiling using existing {{code}}sched{{/code}} provider 43: probes. Need to write specific scripts for such monitoring. 44: // //What is the overall rate of lgroup-specific events like migrations and non-local allocations? This would allow the user to get an overall "feel" for the "healthy" versus "unhealthy" system in the same way mpstat(1), vmstat(1) and iostat(1) do.* 45: ** * //Need more per-CPU kstats/per-lgroup kstats// 46: ** * //Can be done using DTrace scripts// 47: ** * //Need some monitoring tool based on either/or DTrace or kstats// 48: * * Is there enough memory in each lgroups to satisfy requests for local allocations? 49: ** * lgrpinfo(1) and kstats 50: // //How successful threads are at accessing local memory?* 51: ** * //Nothing....Need dprofile, VM sampling mechanism, or CPU hardware performance counter(s).// 52: 53: * Process/threads to lgroup relationships: 54: // //What processes/threads run in what lgroups?* 55: ** * //Can be done using DTrace {{code}}sched{{/code}} provider probes// 56: * * What are home lgroups of various threads 57: ** * {{code}}ps -H{{/code}}, {{code}}prstat -H{{/code}}, {{code}}plgrp(1){{/code}} 58: * * What processes or threads run in specific lgroup(s)? 59: ** * {{code}}ps -h{{/code}}, {{code}}prstat -h{{/code}} 60: * * What lgroups provide memory for a process? 61: ** * {{code}}pmap -L{{/code}} 62: // //What processes use memory from an lgroup?* 63: ** * //Nothing....Need system monitoring tool? Could be expensive to collect incrementally or all at once.// 64: ** * May use existing {{code}}page_get{{/code}} DTrace probe to collect data at run time. 65: ** * Need to add {{code}}page_get{{/code}} probe to {{code}}page_get_anylist(){{/code}} 66: // //How much memory does a process use per lgroup?* 67: ** * For a single process can aggregate over {{code}}pmap -L{{/code}} output 68: ** * for many processes at once: //Nothing....Maybe have RSS per lgrp?// 69: // //What is process memory advice and memory allocation policies?* 70: ** * //Nothing....Would have to remember advice given, make new API to get this and memory allocation policy, and change pmap(1) to display.// 71: 72: ==== Concusions 73: 74: * {{code}}lgrpinfo(1){{/code}} provides adequate system configuration information 75: * {{code}}ps(1){{/code}} and {{code}}prstat(1){{/code}} extensions provide good thread-level observability 76: * Need system monitoring tool. May be based on DTrace and kstats 77: * Existing DTrace probes are almost enough for thread-level observability 78: * Additional DTrace probes may be needed for memory observability 79: * Profiling or VM sampling mechanism can really show access patterns 80: 81: === Diagnosability 82: 83: This list above provides a pretty good observability picture for system administrators, application developers and OS developers. Once some problems are observed, we need tools to get to the root cause of the problem. Such tools should provide answers to the following questions: 84: 85: // //What and why processes or threads are spending too much time away from home?* 86: * * What part - can be done with prstat(1) microstate extensions. 87: * * Can be done with DTrace {{code}}sched{{/code}} provider probes 88: * * Why part - potentially additional microstate extensions to show what is causing stealing. Can be also done with DTrace {{code}}sched{{/code}} provider probes. May need additional probes to pinpoint migration details. 89: 90: * Process/threads profile: 91: * * How much a thread runs in each lgroup? 92: * * How much memory does a thread allocate in each lgroup? 93: ** * Need profiling tool for these 94: ** * May use DTrace-based profiling for the first one 95: 96: * System profile: 97: * * How successful is each lgroup in running its threads at home? 98: ** * //Nothing....May be implemented as per-lgroup kstat.// 99: ** * May be implemented using DTrace, may need monitoring script 100: * * How successful the system as a whole in running threads at home? 101: ** * //Nothing....Aggregation of per-lgroup kstats or profiling tool (or dtrace script)?// 102: * * How successful are local memory allocation requests for each lgroup? 103: ** * Per-lgroup kstats. //Probably should be cleaned up a bit.// 104: ** * DTrace scripts around {{code}}page_get{{/code}} probe. 105: * * What were typical reasons for failing local memory allocations? 106: ** * _Nothing....More specific per-lgroup kstats in {{code}}page_get_xxx(){{/code}} functions?_ 107: 108: * What system activity causes excessive migrations (e.g. preemption, interrupts, job stealing from idle CPUs, run-queue balancing, etc.)? 109: ** * DTrace scripts 110: ** * //Need more: extended microstate accounting + {{code}}prstat(1){{/code}} extensions to observe it. What tool should expose this? mpstat(1M)? System monitor?// 111: * What processes consume most of the given lgroup memory? 112: ** * //Nothing: Something like per-lgroup RSS? Also need some system monitoring tool to display this.// 113: ** * {{code}}pmap -L{{/code}} is prohibitevely expensive for this. 114: ** * May be estimation by simple per-thread counters? 115: 116: * What is the memory access pattern for a specific thread? What processes or 117: threads exercise most non-local memory accesses and what is/are the lgroup(s) they access the most? most from local or interleaved memory? 118: ** * //Nothing....Need dprofile, VM sampling, and/or CPU hardware performance counters and to observe each thread in system.// 119: 120: * Why memory cannot be allocated in the requested lgroup? 121: ** * //Nothing....May be per-lgroup kstats + extra dtrace probes?// 122: 123: // //What are recommendations for system administrators or users to fix any 124: observed problems?* 125: ** * //Nothing....Have document or some sort of smart system monitor?// 126: 127: ==== Conclusions 128: 129: * No existing tools 130: * DTrace may cover a lot, but need custom scripts 131: * Even better to have small numa DTrace tolkit 132: * Or even a special system monitor based on DTrace/kstats 133: * Need some in-kernel work for more accurate kstats and additional probes 134: 135: === Control 136: 137: Once the root cause is discovered, we need to be able to "fix" some of the problems or provide specific recommendations to remedy the situation. Some fixes may require administrative intervention. For example, if there is not enough system resources, the system administrator may add additional CPUs or memory, or stop some applications which are consuming too many resources. Other fixes may require the following: 138: 139: * Providing hints, describing application behaviour, to the OS. 140: ** * {{code}}pmadvise(1){{/code}} 141: ** * {{code}}madv.so(1){{/code}} 142: ** * {{code}}madvise(3C){{/code}} 143: // //Moving processes from one lgroup to another* 144: ** * {{code}}plgrp{{/code}} 145: ** * {{code}}lgrp_affinity_set(3LGRP){{/code}} //Need way to set home lgroup w/o setting lgroup affinity?// 146: ** * _Need LD_PRELOAD or policies?_ 147: // //Moving process memory from one lgroup to another* 148: ** * {{code}}pmadvise(1){{/code}} 149: ** * {{code}}madv.so.1(1){{/code}} 150: ** * {{code}}madvise(3C){{/code}} 151: ** * //No way to move memory to specific lgroup....Should there be?// 152: // //Changing application policies* 153: ** * //Nothing....Need policies to affect thread placement, way to inherit policies, APIs, and tools for these.// 154: 155: Once we understand specific properties of applications, we may want to apply permanent ``fixes’’ to them without modifying the application. This may require methods to do the following: 156: 157: * Distribute or consolidate application threads among several lgroups 158: ** * TAGs 159: * Place application threads in specific lgroups 160: ** * {{code}}LD_PRELOAD{{/code}} tool for thread placement 161: ** * Inheriting home lgroup on {{code}}fork(){{/code}} 162: * Affect how memory is allocated 163: ** * {{code}}madv.so.1(1){{/code}} 164: // //Specify policies for an application* 165: ** * //Nothing....See above// 166: 167: ==== Conclusions 168: 169: * Existing tools ({{code}}plgrp{{/code}}, {{code}}pmadvise{{/code}}) provide some control over thread and memory 170: placement 171: * Tricks with preloaded libraries may allow running applications with "predefined" 172: behavior. 173: * Need additional APIs for affecting thread home lgroup and dealing with 174: process/thread policies. 175: * TAGs may provide very useful functionality. 176: 177: === Overall Conclusions 178: 179: * The set of proposed tools provides pretty good observability coverage 180: * More observability + diagnosability can be obtained with DTrace toolkit 181: * May need some overall system monitor to integrate DTrace scripts and kstats data 182: * Proposed tools provide some level of control, more work needed (mainly TAGs, policies) 183: * Profiling tool may provide lots of otherwise unabtainable information 184: * Memory observability/diagnosability/control is worse than thread placement 185: 186: == Suggested extensions to commands 187: 188: === ps(1) 189: 190: // //-h* Lists only processes homed to the specified lgroups. 191: // //-H* Prints the home lgroup of the process under an additional 192: column header, {{code}}HOME{{/code}}. 193: * In addition, a new output format specifier {{code}}home{{/code}} is added, 194: so shell script can easily get a home lgroup for specific process 195: by issuing the following command: {{code}}$ ps -o home= -p $${{/code}} 196: * See [[suggested man pages diffs>>../tools/ps]]. 197: 198: === psrstat(1M) 199: 200: The {{code}}prstat(1M){{/code}} command is extended with two additional flags: 201: 202: // //-h* Lists only processes homed to the specified lgroups. 203: // //-H* Prints the home lgroup of the process under an additional column header, {{code}}HOME{{/code}}. 204: * See [[suggested man pages diffs>>../tools/prstat]]. 205: 206: == Suggested extensions to {{code}}liblgrp(3LIB){{/code}} library 207: 208: * {{code}}lgrp~_home~_set(){{/code}} 209: * * should refuse to set home incompatible with processor set 210: * * should refuse to set home incompatible with strong affinities 211: * * will set home irregardless of weak affinities 212: 213: == Suggested additional kstats 214: 215: == GUI Ideas 216: 217: The observability and control tools described above lend themselve pretty well to the 218: graphical management paradigm. We can imagine a GUI that allows the following: 219: 220: * Walking the lgroup hierarchy and showing content of each lgroup 221: * Showing processes/threads in each lgroup 222: * Moving processes/threads between lgroups (e.g. by dragging them from one lgroup 223: to another) 224: * Looking at the process address map (e.g. by clicking on a process) 225: * Applying advice to regions by selecting them 226: * Grouping threads into TAGs by selecting threads and applying properties to 227: selections 228: * Creating processor sets by dragging CPUs snd processes into them. 229: * Bindig threads to CPUs by "pinning" them 230: * Viewing per-lgroup loads and other stats visually 231: 232: == Links
Search
Collectives
Community Group
Academic and Research
Accessibility
Advocacy
Appliances
Approachability
Architecture Process and Tools
BrandZ
Chinese Users
Community Advisory Board
Databases
Desktop
Device Drivers
Distribution
Documentation
DTrace
Emerging Platforms
Fault Management
Games on OpenSolaris
HA Clusters
HPC Developer
Installation and Packaging
Internationalization and Localization
Laptop
Logical Domains
Modular Debugger (MDB)
Networking
NFS
Observability
OpenSolaris Governing Board (OGB)
OpenSolaris Printing
OS/Net (ON)
Performance
Power Management
PowerPC
Security
Service Management Facility (smf(5))
Software Porters
Solaris Volume Manager
Storage
Systems Administration Community Group
Testing
Tools Home
Unix File Systems (UFS)
Website Community
X Window System
Xen
ZFS
Zones
Project
ADSL Modem Enhancement
ARC Process Definition
ARM Platform Port
Automatic Data Migration
BIND Update
Bluetooth Stack & Drivers
Brocade FC HBA - Initiator
Brocade FC HBA - Target
Brussels - unified network link configuration
Caiman, Solaris Install Revisited
Celeste
Český portál
Chime Visualization Tool for DTrace
CIFS client for Solaris
CIFS Server
Clearview: Network Interface Coherence
Cluster Agent: Informix Dynamic Server
Cluster Agent: OpenSolaris Container
Cluster Agent: OpenSolaris xVM
Cluster Agent: Oracle E-Business Suite
Cluster agent: PostgreSQL
Cluster Agent: Samba
Cluster Agent: Tomcat
CMT
Coarse Data Flow Parallelism
Colorado: Open HA Cluster on OpenSolaris
Command Assistant
Common Array Manager
Companion - /opt/sfw: Free and Open Source software
COMSTAR: Common Multiprotocol SCSI Target
Content
Contest
CPU Observability
Credentials Process Groups
Crossbow: Network Virtualization and Resource Control
Crypto KMS Agent Toolkit
Cryptographic Framework
Data Migration Manager
Data Tethers
Deutsches Portal
Device Detection Tool
Device Driver Utility
Device Manager
Device Mapper
Direct Rendering Infrastructure & 3D drivers
DTrace Guide
Duckwater: Simplified name services management
Easy Tools
Emancipation
Emulex Fibre Channel Device Driver
Emulex Advanced Ethernet Device Driver
Enable/Enhance Solaris support for Intel Platform
Enhance the support of USB webcams
Enhanced SMF Profiles
Enhancements for AMD-based Platforms
Erlang DTrace Integration
Ethernet bridge module for Solaris
Evaluate Conary
Events Registry
Ext3 file system support
F/OSS Package Base
Facilitation
Fibre Channel over Ethernet
Fine Grained Access Policy (FGAP)
Fingerprint Authentication
Flexible Mandatory Access Control
Forensic Tools
Fully Open X Project
Fuse on Solaris
gcore
Generic Machine Check Architecture Improvements
Google SOC
HA-JBoss
HA-MySQL
Hadoop Live CD
Hitachi
HoneyComb Fixed Content Storage
HPC Stack
Image Packaging System
Improved Performance MIB
Indiana
Innovation Awards
Input Method
Intel Graphics
Internet Key Exchange, version 2
Interrupt Resource Management
IP Datapath Refactoring
IP over Infiniband
IPsec Tunnel Reform
iSCSI Extensions for Remote DMA (iSER)
iSNS Server
JeOS - Just enough Operating System
JKstat - a java binding for libkstat
Journaled File System (JFS)
K Desktop Environment
Kerberos
Kernel Sockets
Kernel SSL Enhancements
Key Management Framework
Korn Shell 93 integration/migration project
Labeled IPsec
LatencyTOP
Layer 2 Filtering
LDoms Manager
Lending
libMicro - portable microbenchmarks
Link Layer Discovery
Live Media: Technologies for distributions running from CD and other media
Locale Data
lofi compression and cryptography support
lx64 brand
Media Management System
Mega_sas
Mexico
MilaX minimal Live Distribution
MIPS Platform Port
Mozilla DTrace
MRSL.NONsharedDevice
Multi-lingual Glossary
Multi-pathing software (MPxIO)
Multiple disk sector size support
Multiple DOI
Muskoka: An open repository for OpenSolaris technical content
Navigator
Nemo: A Framework for High-Performance Networking
Network Auto-Magic
Network Data Management Protocol
Network MIBs
Network Storage
Network Time Protocol (NTP)
Nevada Globalization
New Design of 4over6 Mechanism Based on OpenSolaris
NFS RDMA transport update and performance analysis
NFS Server in non-Global Zones
NFS version 4.1 pNFS
NFSv4 namespace extensions
Nightingale: Port Songbird to OpenSolaris
NPort ID Virtualization (NPIV)
NUMA
Object Storage Device (OSD) support for Solaris
OHACGE Script Based Plug-in
ON/Nevada (ONNV) Project
Open Development Infrastructure
Open HA Cluster Utilities
Open Sound System
OpenGrok
OpenPegasus CIM Server
OpenRTI
OpenSolaris Busybox
OpenSolaris Desktop
OpenSolaris Hispano
OpenSolaris Security Audit
OpenSolaris support for the QEMU processor emulator: host and guest
PEF: Packet Event Framework
Performance Wrappers
Pkgfactory
Polski Portal
Portail Francophone
Portal Brasil
Portals
Power Management Usability Interfaces
Presto: Automatic Printing Configuration
Printable Many Page Solaris Manuals
Promise SuperTrak RAID HBA Driver
QLogic Converged Network Adapter GLDv3 NIC Driver
Quagga Routing Protocol Suite Integration
RAID Configuration Utility
RBridge (IETF TRILL) support
RDMA Offload Framework
Reno: Login Process Enhancements for Interop
Resource Management
s10brand
SAM/QFS
SCM Migration Project
SCSI RDMA Protocol
SDcard Drivers
Sensor Abstraction Layer
Session Initiation Protocol
SFW
Shell: bourne shell, korn shell, C shell, etc.
Sierra: Intel WiFi Chipsets Support
Simple Panels
SM-HBA Based SAS HBA Management
SMF Documentation
Solaris iSCSI Target
Solaris PowerPC Port
SourceJuicer
Sparks: name service switch/nscd enhancements
Squashfs
Star integration/migration project
Starfish
Starter Kit
Storage Power Management
Sun Security Toolkit
Sun StorageTek Availability Suite
Support for OpenFabrics User Verbs / API on OpenSolaris OS
Support gcc4/GCCfss in Solaris
Suspend/Resume
SVR4 Packaging
Systemz
Tamarack: Removable Media Enhancements in Solaris
Tesla: OpenSolaris Enhanced Power Management
Test Development
Tickless Kernel Architecture
TIPC
Trademarks
Trusted networking interface policy database for Trusted Extensions
Trusted Platform Module support
Use Case
Validated Execution Project
Virtual Console
Virtual Network Machines
Visual Panels
Visualization for HPC
Volo
VRRP: Virtual Router Redundancy Protocol Implementation
VSCAN service
Web Stack
Website
Winchester: Schema mapping and ID mapping for AD Interoperability
Wireless USB Support
Wireless Wide Area Network
X Consolidation
x86 Generic FMA Topology Enumerator
Xen Gate
Xfce: A lightweight desktop environment
ZFS Boot and Install
ZFS on disk encryption support
Zone Manager
Zone Statistics
Русский портал
البوابة العربية
भारतीय पोर्टल
中国门户
日本ポータル
한국 포탈
User Group
Adelaide
Argentina
Arizona
Atlanta
Baltimore-Washington
Bangalore
Bangkok
Bangladesh
Beijing
Bélem
Berlin
Bhimavaram
Bloomington
Campus Ambassadors
Capital Region
Cardiff
Charlotte
Chengdu
Chennai
Chihuahua
Chile
Cleveland
Colombia
Columbus
Connecticut
Cracow
Czech
Dallas/Ft. Worth
Danish
Delaware
Edinburgh
Egypt
Finland
Florida
Front Range
FuZhou
Great Lakes
Greece
Hangzhou
Hawaii
HeFei
Houston
Hyderabad
Indonesia
Irish
Israel
Italian
Jinan
Kabul
Kansas City
Latvia
London
Madurai
Manchester
Mato Grosso
Melbourne
Minas Gerais
Minnesota
Montreal
Moscow
Mumbai
Munich
NEA
Netherlands
New England
New York City
New Zealand
NIT Hamirpur
Noroeste
Oklahoma City
Osnabrück
Peru
Philadelphia
Piaski
Pittsburgh
Porto Alegre
Puget Sound
Pune
Queensland
Research Triangle Park
Romania
Russia
San Antonio
San Diego
San Francisco
São Paulo
Scottish
Serbia
Shanghai
Shenzhen
Silicon Valley
Singapore
Slovak
South African
Southern Connecticut
St. Louis
Sweden
Switzerland
Sydney
Szczecin
Taiwan
Tecum
Thames Valley
Tokyo
Toronto
Trondheim
Tulsa
Turkey
Ukraine
University of Melbourne
Vale do Paraíba
Vancouver
Venezuela
Welsh - Cymru
Wisconsin
Xi'an
Subsites
Code Reviews
Code Repositories
Package Search
Bugster
Bugzilla
Test Machines
Planet
Mailing Lists
Elections & Polls
ARC Case Logs
Source Juicer
Package Factory
User Authentication
Community Group performance Pages
FileBench
Sample Comparison
Gotchas
Quick Start Guide to FileBench
Files
libMicro
Sample Comparison
OpenSolaris NUMA project
8 CPU chip Opteron
Improved Latency Discovery
Load Balancing
Observability
Roles and Grants
Technical Documents