Virtual Memory - HAT (Hardware Address Translation) Layer
Status 06/05/2007
With the 2nd release of source the VM and the HAT layers are functional. The kernel is in control of the MMU even though we haven't yet executed a bop_quisce. We still are relying on the prom interface for the console, print and network connection. However in the kernel memory management there is comfort level with it's functionality. If you have reviewed the Openfirmware task you will have noticed the issues related to the lack of the ODW firmware running virtual memory mode. Overall this masked a number of items such that when we got VOF up and running we actually regressed in this area. However that is behind us.
More details to follow along with the 2nd source release.
Initial Review of the HAT Layer - 1/1/06
Way back when, before even writing a line of code Guy did an assessment of the original 2.6 code to see what was usable in a 2.11 port project. Below are his notes from that review.
Virtual Memory: HAT - Hardware Address Translation Layer - Guy Shaw
The Virtual Memory sustem can be considered the core of a Solaris system, and the implementation of Solaris virtual memory affects just about every other subsystem in the operating system. Rather than managing every byte of memory, Solaris uses page-size pieces of memory to minimize the amount of work the virtual memory system has to do to maintain virtual-to-physical memory mappings. Figure 4.1 shows how the management and translation of the virtual view of memory (the address space) to physical memory is performed by hardware known as the virtual memory management unit (MMU).
\--
Guy has evaluated HAT interface changes based on a difference listing of 2.6 vs 2.10. usr/src/uts/common/vm/hat.h in /ws/on998-gate vs /ws/onnv-gate.
*The following are new functions
hat_dump
hat_thread_exit
hat_unload_callback
hat_register_callback
hat_add_callback
hat_delete_callback
hat_getkpfnum_badcall
hat_reserve
hat_page_demote
/// Kernel Physical Mapping (segkpm) hat interface routines. ///
hat_kpm_mapin
hat_kpm_mapout
hat_kpm_page2va
hat_kpm_vaddr2page
hat_kpm_fault
hat_kpm_mseghash_clear
hat_kpm_mseghash_update
hat_kpm_addmem_mseg_update
hat_kpm_addmem_mseg_insert
hat_kpm_addmem_memsegs_update
hat_kpm_mseg_reuse
hat_kpm_delmem_mseg_update
hat_kpm_split_mseg_update
hat_kpm_walk
va_to_pfn
va_to_pa
The following functions have been removed
hat_pageflip
The following functions have a change in function signature
hat_share
hat_unshare
hat_dump() is small and is entirely processor-independent code.
hat_thread_exit() is small but the underlying function that implements it,
hat_switch(), does processor-specific and mmu-specific things to switch
from one thread to another. Not a big problem.
The hat callback family of functions is currently implemented on Sparc only.
We can just supply pacifiers to comply with the new interface.
hat_getkpfnum() is deprecated. There are a few places left that still call
hat_getkpfnum(). Those have all been changed to hat_getkpfnum_badcall()
so that hat_getkpfnum() can be eliminated from the HAT interface. That way,
nobody is tempted to write new code that uses hat_getkpfnum().
hat_getkpfnum_badcall() is just the implementation of what used to be
hat_getkpfnum(). This is an easy change.
hat_reserve() does nothing.
hat_page_demote() is a significant amount of work. Much of it is
processor-independent, because it has to do with the way Solaris allocates
and deallocates Hardware Mapping Entries (HMEs). However, this can be
deferred, because it is only used for mappings of large page sizes. We
don't have to exploit large page sizes in userland in the first cut.
Kernel Physical Mapping (segkpm) hat interface routines, hat_kpm_*(), are
process-specific, but are trivial. Many would be noops on PowerPC.
va_to_pfn() is used only at boot time, while the boot loader is in charge of
the MMU. It is illegal to use it after that. Whoever writes the boot stuff
can do what he wants. We may need to coordinate on this item.
va_to_pa() is declared in the common hat interface but is really only
implemented on Sparc and only sparc-specific code (drivers, etc.) call it.
We not only don't have to implement it, we don't even have to define it.
The removal of hat_pageflip() is not a problem. The Power<nop>PC
implementation just returned a status indicating that this feature was not
supported.
The change to hat_share() and hat_unshare() involve adding an argument, a
page size code, to indicated the desired page size for shared mappings.
This can be made simple by restricting the variety of page sizes we will
deal with. For starters, we don't even have to support Intimate Shared
Memory (ISM) at all.
A few flags have been added for some functions:
HAT_RELOAD_SHARE
HAT_NO_KALLOC
HAT_LOAD_AUTOLPG
HAT_INIT
all these are either trivial or can be deferred. That's about it for
interface changes. Please see below for comprehensive details on HAT port
Study of the Feasibility of Reusing Solaris PPC 2.6 HAT Code
Guy has composed a more structured overview which can be seen below
Study of the Feasibility of Reusing Solaris/PPC 2.6 HAT Code
Background
Sun has already done a port of Solaris to PowerPPC. In 1995, Solaris
the release 2.5.1 supported PowerPC. Additional work was done for Solaris
rev 2.6. After 2.6, PPC support was removed, for commercial reasons rather
than any technical failure.
A big question in considering how to do the new Solaris/PPC is:
How much code from the Solaris/PPC 2.6 release can and should be reused, if
any?
This document is concerned about answering that question only for
the HAT layer and other processor-specific VM code, and portions of
the boot that deal with VM.
There are many pieces of processor-specific code involved in any
port of Solaris to a new processor, but the HAT layer is a large and
critical part. Whether a new HAT layer is written from scratch or
existing code is reused and upgraded, it is necessary to have some
idea of the costs of the HAT layer in order to have any hope of
reasoning about the total costs of the porting project.
HAT Roles
The HAT layer has many roles with respect to other parts of the
system, including hardware and other software. In order to make
decisions about the suitability of the existing code to be reused,
all of these roles must be examined, in light of the changes that
have taken place over the last decade.
The roles the HAT layer plays are:
- HAT/MMU - Manager of hardware MMU resources
- HAT/provider - Provider of kernel services
- HAT/consumer - Consumer of kernel services
- HAT/boot - Partner during boot
The term HAT/runtime is used to denote HAT/MMU, HAT/provider,
and HAT/consumer, together; that is, everything except HAT/boot.
Keep in mind that the boundary between HAT roles is just logical, for
the sake of decomposing the analysis of changes in requirements. It is
not that there are separate files, packages, modules, or functions
(whatever) that keep the code for these roles separate. There is
separate code for HAT/boot vs HAT/runtime, but it is not possible to
separate out HAT/runtime roles in any coarse-grain fashion. A single
function can be HAT/provider in one line and call some function
(HAT/consumer) in the next, then immediately do some low-level TLB
management (HAT/MMU).
The Decision Process
It could be that the existing code is simply too far out of date with
respect to any combination of these four roles. In that case the
issue of re-usability would be a no-brainer, just scrap the old code.
In the case of interaction with boot, a decision can be made separately
to scrap most of the boot-related code, but keep all code related to
the other HAT roles.
If there is enough value in the old code, then things are not quite
so simple, because decisions can be influenced by other factors, such
as schedule and budget and willingness to drop or defer development
of some functionality.
For example, if rapid bring-up is an absolute requirement then things
can be done for the sake of quick results, but which mean that cleanup
or major revisions will have to be done later. Examples of possible
deferred HAT functionality are:
- support for Intimate Shared Memory (ISM)
- support for 64 bit machines
- support for multiprocessor machines
The following four sections will present an evaluation of the
suitability of the Solaris 2.6 code. Each role will be evaluated
in terms of a quick go/no-go test, then in terms of time and
optional features.
HAT/MMU - Manager of hardware MMU resources
64-bit
Solaris/PPC 2.6 has no support for 64-bit models. Not for 64-bit
kernel and not for 64-bit applications. That is bad news. But it
does not necessarily mean that Solaris/PPC 2.6 HAT is unsuitable
for reuse. If we wrote a new HAT from scratch, it is still more work
to support both 32-bit and 64-bit kernel and applications. Since the
new Solaris/PPC port see use for embedded systems, we believe that we
would not contemplate supporting only a 64-bit kernel, as is done
on Sparc. Even if we did that, in order to eliminate one of the four
combinations, it would not save as much as 1/4 of the effort.
What this means is that it boils down to 2 questions:
1) Is the 64-bit MMU hardware so fundamentally different that the
32-bit code cannot (or should not) be reused?
2) Was the Solaris/PPC 2.6 HAT code designed in a way that makes it
unnecessarily difficult to support both 32-bit and 64-bit hardware?
You might think that the MMU hardware would be fundamentally different
between 32-bit and 64-bit, and necessarily so. The major rewrite
of Solaris/x86 HAT code was triggered by the port to AMD64. But,
that was only the proximate cause.
Intel's x86 hardware was designed much earlier than PowerPC.
Intel did not start out with a road-map for 64-bit kernel or userland.
Solaris/x86 was not designed with a 64-bit future in mind. But, the
PowerPC was designed from the beginning to be a 64-bit architecture
with a 32-bit subset. That applies to the MMU design as well as ISA
(Instruction Set Architecture).
The PowerPC has hashed page tables, unlike the x86, which has
forward-mapped page tables. Also, there is one global page table, no
per-process or per-group or separate kernel vs userland page tables.
That is not a decision made by a kernel developer; it is pretty much
dictated by the PowerPC MMU design, and we do not want to fight the
hardware. 64-bit addresses have segment IDs that are 32 bits longer,
but the role of the lower order bits in hashing and indexing into
the page table is the same for 32-bit and 64-bit. The designers of
Solaris/PPC knew this at the time and kept it in mind. Although it
has not been put to the test, the code appears to be sufficiently
64-bit clean.
Conclusion: going to 64-bit HAT is nowhere near as traumatic as it
was for x86 and AMD64.
Other aspects of 64-bit Solaris/PPC are outside the scope of this
document. They would include design of a 64-bit ABI, link editor,
etc. and getting consensus from all stake-holders. Historically,
arriving at a consensus has been known to consume a great deal of time.
But, reusing Solaris/PPC 2.6 HAT code does not add to this problem.
Location and size of page tables
There is a possibility of running into problems with a larger page
table on a 64-bit system with more physical memory. This is because
the pagetable is contiguous physical memory and a larger page table
might conflict with something else that needs to be in lower memory
or upper memory. But, I don't think this is too likely. In any case,
this problem is not made worse by reusing Solaris/PPC 2.6 code.
MMU related traps
Older models of PowerPC generated traps for every TLB miss. Newer
models can reload the TLB from the page tables without any traps;
the only page fault is a major page fault. This is a welcome change.
We could leave the trap handler in the code for the sake of older
models, or we could purge it if we know we will never encounter
hardware that generates TLB-miss traps. Better to have and not need
than to need and not have. However, finding machines to test this
case could be difficult.
Endianness
PowerPC can operate in either big-endian or little-endian mode.
Solaris/PPC runs in little-endian mode. This decision was made
primarily because of customer requirements at the time (1993-1995),
not for any purely technical reason. Without those commercial
requirements, there would be a slight advantage to running big-endian.
For one thing, PowerPC page tables are big-endian, independent of the
overall endian mode of the machine. The designers of Solaris/PPC knew
at the time that this might be controversial and subject to change,
and they coded accordingly. The HAT layer used accessor functions
for all read and modify operations on the page tables. Other parts of
Solaris have some endianness dependencies, but I believe they are not
a huge problem. In fact, we have a running big-endian version
of Solaris/PPC that was done as a feasibiltiy project. So, we might
be able to use that as a starting point.
Cache Implementation
--
The PowerPC architecture leaves the details of cache implementation
pretty wide open so that each model can be free to implement caching
in its own way. A model of PowerPC is allowed to implement no cache
at all. It is possible that Solaris/PPC 2.6 code, which supported
a small number of early PowerPC models, would have to be modified to
handle a wider range of cache behaviors in order to support newer
models. Besides cache geometry (number of levels, size of each level,
line sizes), which ought to be parameterized, other implementation
details are possible, which may require more significant code changes
to support. For example, on some models, a cache might be virtually
indexed, as is the case for some Sparc models. In that case, code
to handle page coloring would need to be added for performance.
At least in the case of the MPC-7450, this will not be necessary.
Multiprocessor
--
Solaris/PPC has been written with multiprocessor machines in mind.
There was a working version running in the lab, but it was not
integrated into Solaris 2.7, because Solaris/PPC was canceled by
then. Multiprocessor machines are much more common, today, and so
expectations are higher. A PPC port of Solaris would still require
much effort to verify multiprocessor mostly due to testing.
---
HAT/provider - Provider of kernel services
-
The HAT layer is inherently processor and platform dependent.
For that reason, Solaris HAT interfaces are pretty well-defined,
much more so that some other parts of the kernel which have not
had to be ported several times over Solaris's life. Therefore, of
all parts of the kernel, the HAT layer is among the least likely to
suffer from illegitimate interfaces, such as unintended dependencies,
spooky action at a distance, lack of decomposability, etc.
All legitimate HAT interface is defined in usr/src/uts/common/vm/hat.h.
This basic source code structure has not changed. It was a good idea
then, and it is a good idea now.
How has the HAT interface since Solaris 2.6? This can be answered by
examining a difference listing of usr/src/uts/common/vm/hat.h between
2.6 and the current version of Solaris.
The two source code gates to be compared are:
/ws/on297-gate
/ws/onnv-gate
The following are new functions:
hat_dump
hat_thread_exit
hat_unload_callback
hat_register_callback
hat_add_callback
hat_delete_callback
hat_getkpfnum_badcall
hat_reserve
hat_page_demote
/ Kernel Physical Mapping (segkpm) hat interface routines. /
hat_kpm_mapin
hat_kpm_mapout
hat_kpm_page2va
hat_kpm_vaddr2page
hat_kpm_fault
hat_kpm_mseghash_clear
hat_kpm_mseghash_update
hat_kpm_addmem_mseg_update
hat_kpm_addmem_mseg_insert
hat_kpm_addmem_memsegs_update
hat_kpm_mseg_reuse
hat_kpm_delmem_mseg_update
hat_kpm_split_mseg_update
hat_kpm_walk
va_to_pfn
va_to_pa
The following functions have been removed:
hat_pageflip
The following functions have a change in function signature:
hat_share
hat_unshare
hat_dump() is small and is entirely processor-independent code.
hat_thread_exit() is small but the underlying function that implements
it, hat_switch(), does processor-specific and MMU-specific things to
switch from one thread to another. Not a big problem.
The hat callback family of functions is currently implemented on Sparc
only. We can just supply pacifiers to comply with the new interface.
hat_getkpfnum() is deprecated. There are a few places left
that still call hat_getkpfnum(). Those have all been changed to
hat_getkpfnum_badcall() so that hat_getkpfnum() can be eliminated
from the HAT interface. That way, nobody is tempted to write new
code that uses hat_getkpfnum(). hat_getkpfnum_badcall() is just
the implementation of what used to be hat_getkpfnum(). This is an
easy change.
hat_reserve() does nothing.
hat_page_demote() is a significant amount of work. Much of it is
processor-independent, because it has to do with the way Solaris
allocates and deallocates Hardware Mapping Entries (HMEs). However,
this can be deferred, because it is only used for mappings of large
page sizes. We don't have to exploit large page sizes in userland
in the first cut.
Kernel Physical Mapping (segkpm) hat interface routines, hat_kpm_*(),
are process-specific, but are trivial. Many would be no-ops on PowerPC.
va_to_pfn() is used only at boot time, while the boot loader is in
charge of the MMU. It is illegal to use it after that. Whoever writes
the boot stuff can do what he wants. We may need to coordinate on
this item.
va_to_pa() is trivial, and is the same for all processors. It is
just va_to_pfn() with the page offset of the given virtual address
blended back in to give the corresponding physical address.
The removal of hat_pageflip() is not a problem. The PowerPC
implementation just returned a status indicating that this feature
was not supported.
The change to hat_share() and hat_unshare() involve adding an
argument, a page size code, to indicated the desired page size for
shared mappings. This can be made simple by restricting the variety
of page sizes we will deal with. For starters, we don't even have
to support Intimate Shared Memory (ISM) at all.
Flags
--
A few flags have been added for some functions:
HAT_RELOAD_SHARE
HAT_NO_KALLOC
HAT_LOAD_AUTOLPG
HAT_INIT
all these are either trivial or can be deferred.
Intimate Shared Memory (ISM)
--
ISM is strictly a performance feature. It does not involve any change
to the HAT interface. ISM is a term used to refer to the cases when
multiple processes can share not only mappings to the same physical
memory, but also MMU resources used for those mappings. For example,
in the case of x86, with forward-mapped page tables, entire pages of
Page Table Entries (PTEs) can be shared, provided that the virtual
addresses and size just happen to be suitable for sharing pages
of PTEs. Let's use the term "PTE-page-span" to describe the size
mapped by an entire page of PTEs. It is not required that all the
mappings to the same physical memory have the same virtual address.
But, the virtual addresses must all be aligned on a PTE-page-span
boundary, and their sizes must be a multiple of the PTE-page-span.
Any mappings that are less strict about VA alignment and size cannot
share page tables without violating Unix memory mapping semantics
and/or security principles. If VA alignment and size are even more
strict, then 2nd-level and even higher level pages of directory entries
could be shared. Something very similar has been done on Itanium and
MIPS hardware, except that those machines have linear page tables,
rather than forward-mapped.
Solaris/PPC 2.6 does not implement ISM. But, the absence of ISM
support is clean. That is, the fact that ISM was not supported
in Solaris/PPC 2.6 does not affect any decision to reuse the
existing code. The same work would have to be done whether adding
functionality to the old code or writing all new code. Whether any
functionality we add is easy or difficult, it is pure and simple
addition of functionality. Nothing about the Solaris/PPC 2.6 HAT
design involved work that would have to be undone or commitments to
a way of doing things that we might regret.
PowerPC MMU does not have any such thing as pages of PTEs. The only
possible way to support any sharing of MMU resources on PowerPC is to
use Block Address Translation (BAT) registers. BAT registers are the
only mechanism for mapping regions of memory with a page size larger
than 4K. There are only a handful of BAT registers. ISM implemented
this way would have more strict alignment requirements, because a
single BAT entry with a large page size would require:
1) that all mapping be naturally aligned with respect to page size;
2) that the requested size must be exactly 1 page size;
3) that the underlying physical memory be contiguous and naturally aligned physical addresses.
An unlimited number of processes could share the same memory, but at
any time, only a very small number of these mappings can be supported.
On an embedded system, there might be an application for which this
support is just perfect. Even a single very large mapping shared
by 2 processes could be a big win for the right kind of application.
It could save a great deal of pressure on the page table. Let's see
... large mappings can save 256 PTEs per megabyte. A 1 GByte mapping
for shared data could save 1/4 megaPTEs. In order to do this, there
would have do be some mechanism for preventing physical memory from
getting fragmented beyond redemption before we even get to the first
userland process. There is no interface to do this. There would
probably have to be something in /etc/system to tell the kernel to
reserve physical memory early on.
---
HAT/consumer - Consumer of kernel services
-
The HAT layer, proper, is pretty low down in the dependency tree of all
kernel services. This is especially true of the pure TLB management
functions. We would be in trouble if the data types and functions
provided by the kernel changed significantly in the last decade.
But it looks like we are in pretty good shape.
Data types
--
The HAT does interact with some other kernel data structures.
- HAT uses page_t's which describe pages of physical memory.
The machine-dependent page_t, machpage_t, is a pure extension of
the page_t data type; the page_t structure is not modified in any
other way. No part of Solaris uses the machpage_t extensions.
Functions
--
The HAT layer needs locking primitives and some atomic operations.
Function calls are used and the data types used with these functions
are either opaque objects or primitive data types. So, the HAT does
depend on functions such as: mutex_(), cv_(), atomic_*(), cas().
The good news is that these functions are pretty low level and their
interfaces are stable.
xXX Better separation of pure TLB management functions.
XXX Move to separate library
XXX It may be a good idea to change use of cv_*() functions
---
HAT/boot - Partner during boot
-
Solaris boot has changed considerably since 2.5.1 and 2.6. Almost all
boot-related HAT code will have to be thrown out, no matter what.
It is almost a complete write-off. Certainly, all the code related
to boot-time device support is useless. Some snippets related to VOF,
such as getting properties, can be used as a design suggestion.
XXX How much has VOF changed? Not much, we hope.
The good news is that modern boot makes many things easier. The basic
problem of handing off allocated memory and mappings from boot to the
kernel HAT is not much different, so some small pieces can be reused.
Another bit of good news is that some things that are done for good
hygiene can be deferred. For example, we can just waste some memory
owned by boot, bypassing the tricky hand-off code for those pages
of memory. This is a good trade, for the sake of rapid bring-up.
Whether we reuse Solaris/PPC 2.6 code or not, I strongly recommend
that we invest a great deal in enforcing the contract between boot and
the HAT layer, much more than has been done for Sparc and x86, even
more than was done for Solaris/IA64, which invested heavily in this.
This kind of investment is one that is tempting to short-stroke in
the interests of quick startup, but it pays big-time, unless all the
developers are perfect in every way, or extremely lucky. In fact,
I recommend that we deliberately change the contract, a few times
during development, just to keep us safe from inadvertent dependency
creep. For example, page table size and location can be changed,
within reason; allocation of BAT registers can be changed, for no
particular reason.
There is processor-dependent code to handle userland process address
space allocations. It is not really part of the HAT, proper, but
the developer who writes and maintains the HAT usually maintains
this bit part, as well. In addition to changing HAT/boot contract,
I recommend changing some aspects of VM layout, such as text start
address. It is not that we cannot decide on a value and stick with it.
It is a bit of a jolt to the system, just to keep things on track.
Better to do it early, rather than later.
XXX More on boot/HAT contract, later.
Data types
--
XXX memseg structures changed?
Functions
--
Flow of control
--
XXX flow of control from _starup() ... hat_kern_setup()
--
The following is a quick overview of HAT features
and an assessment of:
- how easy it is to implement;
- whether it can be deferred (from a purely technical perspective, not whether it is considered to be a critical requirement);
- how much demand there is, particularly for embedded systems;
- how urgent is the requirement;
- how much of an embarrassment would it be if this feature were not implemented, considering things like how much expectations have changed in the last decade, what has Linux already done, etc.
- How much more testing has to be done, let alone implementation cost. Some things could be coded right away, but have serious testing implications. For example, do we want to retain support for older models? That is a coding noop, but a huge increase in testing complexity, hardware procurement, and
so on.
critcal testing
Feature easy? defer? demand urgency factor burden
-- ---------- --
Multiprocessor ??? yes high soon high high
64-bit maybe yes ??? low low high
endian change yes yes ??? high low low
ISM no yes low low low medium
--
XXX UPOD schedule vs quick&dirty schedule
XXX UPOD := Under-Promise Over-Deliver
--
OPINION on HAT DATA Structures
The hat data structure should be an opaque data type, preferably void. That is, nothing outside the HAT should refer to "struct hat", but to hat_t. So, hat_t is void , as far everyone is concerned, except the HAT implementation.
If we wrote the kernel in C++, we could make hat_t a class with private members.
We should be able to change all usage of "struct hat" to hat_t.
If that is not acceptable, than we ought to at least redefine the struct hat so that it has one member which is a simple data type and has an unlikely name, like none_of_your_business. Failing that, we ought to be able to redefine struct hat so that all the members are the same order, data type, offset, and size, but the member names have been changed, for example by prefixing each member name with .
vm/hat.h could have some preprocessor code like so:
if defined(HAT_IMPLEMENTATION)
struct hat { some_type member1; ... };
else
struct hat { some_type member1; ... };
endif
END of OPINION
---
HAT/boot Food Taster and HAT Debugging Tools
-
Most of this document covers changes to Solaris/PPC 2.6 code that are
imposed by external factors: changing hardware, evolution of Solaris,
changes to boot. But, there are a few changes recommended here,
simply because they are an important improvement in HAT construction
technology. By a very wide margin, the top two are:
- Extensive HAT/boot food taster
- HAT Debugging toolkit
These additions do not come free of cost, so they need to be mentioned
here. However, they have a very good chance of leading to a net
reduction in system bringup time, and they contribute to more
reliable time budgets, because big surprises are reduced.
---
HAT/boot Food Taster
-
Things don't go well if the HAT consumes anything toxic. Things can go
especially badly early on and in mysterious ways if the HAT inherits
MMU state from boot which is not compatible with the state for which
it is designed. I recommend spending some time up front in writing
a significant amount of code which tests the MMU state and other
conditions, as they are when HAT first takes control. If there is
anything that is not in order, the HAT/boot food taster should not
be sparing in its effort to explain clearly what is expected and
what it got, and then die a quick and merciful death. Perhaps in
the process it can deliberately trigger the debugger (either hardware
debugger or kmdb), if present, in a special way. The alternative to
fail-fast semantics is system delusion, so fail fast and fail noisily.
This sort of thing was done on Solaris/IA64, and has been proven to save
a great deal of time when integrating work done by several developers
each working on different pieces, and possibly misunderstanding the
HAT/boot contract.
Even if it were the case that large pieces of HAT/boot code could
be reused, the amount of work needed to add a HAT/boot food taster
is pretty much the same for a HAT code update or for a brand new
HAT layer.
---
HAT Debugging Toolkit
-
It is common to sprinkle some ASSERTs in the code.
Also, most HAT developers have a stash of HAT debugging
aids, such as a handy-dandy pagetable hashing pocket calculator
or pagetable walker/navigator. But, I believe it is a good idea
to include a set of kernel functions and userland tools
as a first class citizen in the software release.
Solaris/IA64 had an extensive set of HAT debug helper functions,
as well as userland tools to do HAT-specific monitoring of
correctness and performance. Our new Solaris/PPC port can do even more
because there are more hardware and software tools available,
such as hardware debugger (when available) and DTrace.
The amount of work needed to add a HAT debugging toolkit
is pretty much the same for a HAT code update or for a
brand new HAT layer.
Some of the components of the HAT debugging toolkit are:
- Fault injection
- Pagetable and HME verifier
- Pagetable and HME statistics
- Pathological workloads (small scale)
Fault Injection
--
One example of the use of fault injection for the HAT layer is
to arrange for races to be lost often. The function to atomically
replace a PTE can be implemented with a version that deliberately
causes it to fail with a specified probability, but being careful to
limit the number of consecutive failures from the same caller so that
we don't block forward progress of the system.
Pagetable and HME verifier
--
A pagetable and HME verifier is to the HAT data structures what fsck
is to UFS filesystem metadata. On a running system things change too
quickly to check consistency of the entire system, but at the time of a
kernel panic, it can be done without disturbing the existing mappings.
Also, most of the consistency checking is decomposable. That is,
much can be determined about the internal consistency of a subset of
page tables, and it can be tested quickly and nondestructively.
Pagetable and HME statistics
--
Before DTrace, it would have been very difficult to generate statistics
about things like hash collisions without writing extra helper
functions and rolling your own methods for enabling and disabling
probing; and it was even more difficult to get statistics out of the
kernel so that some userland monitoring / visualization program can
slice and dice and present the data. DTrace makes this sort of thing
a great deal easier. However, I believe the HAT may still have some
hooks in it with DTrace and userland reporting programs in mind.
Pathological workloads
--
In order to exercise the logic for handling rare cases, such as
clustering of hash collisions leading to full PTE groups, small
workloads can be constructed that generate very unfortunate reference
patterns.
--
There are several ideas for new functionality and performance
enhancements, but they are not as important and certainly not as
urgent as debugging aids. So, they will be collected in a document
yet to come.