System V IPC resource controls
Background
Traditionally, the behavior of the System V IPC facilities (shared memory, message queues, and semaphores) was influenced through a large set of /etc/system tunables. While some of the tunables allowed one to set meaningful administrative limits (e.g. maximum shared memory segment size), many simply exposed implementation details (e.g. the number of undo entries in an undo structure).
There were many serious problems with the traditional implementation:
- Relying on /etc/system as an administrative mechanism meant reconfiguration required a reboot.
- Many parameters were used to size data structures allocated at boot (or module load) time. There was a penalty for sizing the parameters larger than was needed.
- There were a large variety of parameters to change, many of which were implementation-specific and didn't align well with public interface boundaries. Yet they were necessary to configure the system for different workloads.
- The tunables, named by combining a three character facility abbreviation with a three character parameter abbreviation, were a veritable alphabet soup. It was very easy for an administrator to misconfigure the system (see 4381822).
- The algorithms used by the traditional implementation assumed statically-sized data structures. Even if there was an interface which let one do so, changing many of the tunables at run-time wouldn't have been possible.
- There was no way to allocate additional resources to one user without allowing all users those resources. Since the amount of resources was always fixed, one user could have trivially prevented another from performing its desired allocations.
- There was no good way to observe the values of the parameters.
Additionally, a perpetual complaint was that the default values for these tunables were too small.
The Solution
In Solaris 10 (build 28), these problems were solved by reworking much of the System V IPC implementation to not require as much administrative hand-holding (removing unnecessary tunables), and by using task-based resource controls to limit users' access to the System V IPC facilities (replacing the remaining tunables). At the same time, the default values for those limits which remained were raised to more reasonable values. Lastly, for compatibility, the legacy tunables are interpreted and used to initialize the default privileged limit for the new resource controls. The new resource controls are:
| Resource control | Similar tunable | Old default | New default | Max value |
| project.max-shm-ids | shminfo_shmmni | 100 | 128 | 1<<24 |
| project.max-msg-ids | msginfo_msgmni | 50 | 128 | 1<<24 |
| project.max-sem-ids | seminfo_semmni | 10 | 128 | 1<<24 |
| project.max-shm-memory | shminfo_shmmax | 0x800000 | 1/4 physical | UINT64_MAX |
| process.max-sem-nsems | seminfo_semmsl | 25 | 512 | SHRT_MAX |
| process.max-sem-ops | seminfo_semopm | 10 | 512 | INT_MAX |
| process.max-msg-qbytes | msginfo_msgmnb | 4096 | 65536 | ULONG_MAX |
| process.max-msg-messages | msginfo_msgtql | 40 | 8192 | UINT_MAX |
The following tunables no longer have any effect (tunables listed in italics were removed by previous efforts):
| semsys:seminfo_semmns | semsys:seminfo_semvmx | semsys:seminfo_semmnu |
| semsys:seminfo_semaem | semsys:seminfo_semume | semsys:seminfo_semusz |
| semsys:seminfo_semmap | shmsys:shminfo_shmseg | shmsys:shminfo_shmmin |
| msgsys:msginfo_msgmap | msgsys:msginfo_msgseg | msgsys:msginfo_msgssz |
| msgsys:msginfo_msgmax |
The end result of all this is that all the problems listed above have been addressed. The specific improvements are:
- It is now possible to limit use of the System V IPC facilities on a per-process or per-project basis (depending on the resource being limited), without rebooting the system.
- None of these limits affect allocation directly; they can be made as large as possible without any immediate effect on the system. (Note that doing so would allow a user to allocate resources without bound, which would have an effect on the system.)
- Implementation internals are no longer exposed to the administrator, simplifying configuration greatly.
- The resource controls are fewer, and are more verbosely and intuitively named, than the tunables.
- Limit settings can be observed using the common resource control interfaces, such as prctl(1) and getrctl(2).
- Shared memory is limited based on the total amount allocated per project, not a per segment limit. This means that an administrator can give a user the ability to allocate a lot of segments and large segments, without having to give the user the ability to create a lot of large segments.
- Because resource controls are the administrative mechanism, this configuration can be persistent using project(4), as well as be made via the network.
The following major implementation changes were made (for all the details, see os/ipc.c, os/msg.c, os/shm.c, syscall/sem.c):
- Message headers are allocated dynamically. Previously all message headers were allocated at module load time, linked together into a global freelist, and allocated from there. (The locking on this list also caused a scalability problem.)
- Semaphore arrays are allocated dynamically. Previously semaphore arrays were allocated from a seminfo_semmns sized vmem arena, which meant that allocations could fail due to fragmentation.
- Semaphore undo structures are allocated dynamically, and are per-process and per-semaphore array. They are unlimited in number and are always as large as the semaphore array they correspond to. Previously there were a limited number of per-process undo structures, allocated at module load time. Furthermore, the undo structures each had the same, fixed size. It was possible for a process to not be able to allocate an undo structure, or for the process's undo structure to be full.
- Semaphore undo structures maintain their undo values as signed integers, so no semaphore value is too large to be undone.
- All facilities used to allocate objects from a fixed size namespace, allocated at module load time. All facility namespaces are now resizable, and will grow as demand increases.
Resource Controls
project.max-shm-ids
Maximum number of shared memory ids allowed a project.
When shmget() is used to allocate a shared memory segment, one id is allocated. If the id allocation doesn't succeed, shmget() fails and errno is set to ENOSPC (previously returned when "the system-imposed limit on the maximum number of allowed shared memory identifiers system-wide would be exceeded"). Upon successful shmctl(, IPC_RMID) the id is deallocated.
project.max-sem-ids
Maximum number of semaphore ids allowed a project.
When semget() is used to allocate a semaphore set, one id is allocated. If the id allocation doesn't succeed, semget() fails and errno is set to ENOSPC (previously returned when "the system-imposed limit on the maximum number of allowed semaphores or semaphore identifiers system-wide would be exceeded"). Upon successful semctl(, IPC_RMID) the id is deallocated.
project.max-msg-ids
Maximum number of message queue ids allowed a project.
When msgget() is used to allocate a message queue, one id is allocated. If the id allocation doesn't succeed, msgget() fails and errno is set to ENOSPC (previously returned when "the system-imposed limit on the maximum number of allowed message queue identifiers system wide would be exceeded"). Upon successful msgctl(, IPC_RMID) the id is deallocated.
project.max-shm-memory
Total amount of shared memory allowed a project.
When shmget() is used to allocate a shared memory segment, the segment's size is allocated against this limit. If the space allocation doesn't succeed, shmget() fails and errno is set to EINVAL (currently returned when "The size argument is less than the system-imposed minimum or greater than the system-imposed maximum."). The size will be deallocated once the last process has detached the segment and the segment has been successfully shmctl(, IPC_RMID)ed.
process.max-sem-nsems
Maximum number of semaphores allowed per semaphore set.
When semget() is used to allocate a semaphore set, the size of the set is compared with this limit. If the number of semaphores exceeds the limit, semget() fails and errno is set to EINVAL (previously returned when "The nsems argument is ... greater than the system-imposed limit").
process.max-sem-ops
Maximum number of semaphore operations allowed per semop call.
When semget() successfully allocates a semaphore set, the minimum enforced value of this limit is used to initialize the "system-imposed maximum" number of operations a semop() call for this set can perform.
process.max-msg-qbytes
Maximum number of bytes of messages on a message queue.
When msgget() successfully allocates a message queue, the minimum enforced value of this limit is used to initialize msg_qbytes (which was previously "set to the system limit").
process.max-msg-messages
Maximum number of messages on a message queue.
When msgget() successfully allocates a message queue, the minimum enforced value of this limit is used to initialize a per-queue limit on the number of messages.
Reference Materials
Official documentation:
Solaris Tunable Parameters Reference Manual
Solaris 10 What's New (see 3/05)
ARC cases:
PSARC 2002/694 System V IPC resource controls
PSARC 2003/047 process.max-msg-messages resource control
Bugs:
4269168 Can't run ICEM Surf with IPC kernel parameters default values.
4325644 Shared Memory limits should be exported
4381822 System V IPC settings should be dynamic
4678797 Solaris default value of shminfo_shmmax is too small
4715904 System V semaphores still suffer from false sharing
4715918 semaphore operations do not scale when SEM_UNDO is specified
4743342 race between msgget and message queue consumers
4743359 race between semget and semaphore consumers
4780063 ipcs may incorrectly report that a message queue has waiting readers
4788894 ipcs may fail to report waiting message queue writers
4808173 System V message queues don't scale, period