ccNUMA Overview

The NUMA application programming interfaces (APIs) allow applications to make scheduling and resource
allocation decisions based on advance knowledge of the application’s resource needs and behavior. Proper
manipulation of system resources and process scheduling through NUMA APIs has the following potential
advantages:
An application can notify the operating system of relationships between processes and threads that should be
scheduled on the same RAD and, if migration to another RAD becomes advantageous, must be moved together.
A very large and complex application whose resource demands and number of threads exceed the capacity of
one RAD can stripe its CPU cycles, I/O load, and the memory that contains program data across RADs.
Linux
Linux does not have any inherent ccNUMA support.
Linux has limited support for SMP architectures. It is optimized for two to four CPUs, with support for up to eight
CPUs. The implementation does not preclude Non-Uniform Memory Access, and cache coherency is assumed, so
ccNUMA is available—it is just not very efficient. Furthermore, there is limited support for memory locks and
semaphores, and little attention is paid by the kernel developers to efficiency of the multiprocessor versions of
Linux. (For more information, see http://sources.redhat.com/ecos/docs-latest/ref/hal-smp-support.html
.)
IBM NUMA-Q
IBM NUMA-Q implements a ccNUMA architecture using Intel Pentium
®
processors. The system is organized into
four-CPU quads that include memory and disk controllers. The quads are linked together using a relatively low-
bandwidth interconnect called IQ-Link. All memory is shared among all processors, and I/O devices are also
shared. A primary goal of this architecture is to ensure continuous access to attached I/O devices, and it
accomplishes this through redundant links to external devices.
The operating system, DYNIX/ptx, maximizes system performance by locating the memory and I/O connections
close to the calling process, preferably on the same quad. This operating system is based on System VR4 and is
extended to handle large numbers of processors and users.
Windows NT
®
and Linux also are available, but they are limited to being deployed in a quad and therefore do
not need to support any ccNUMA features.
IBM pSeries
AIX 5L has some features to accommodate the different latencies of the Regatta architecture. The basic physical
arrangement of the POWER4 series processors is four dual-core POWER4 processors, four 32 MB L3 caches
shared by all eight CPUs, and a multi-chip module that ties the processors and memory together, along with I/O
and a connection to the rest of the system. A p690 system consists of four of these modules. AIX 5L provides some
specialized scheduling algorithms to keep applications near their data in order to reduce memory traffic and
latency.
AIX 5L also implements a Large Page feature. By default, pages are 4 KB. The administrator can designate a
certain amount of the total memory to be made of Large Pages, defined as 16 MB. Upon a reboot, these pages
are available to applications, if the user and the application have been authorized by the administrator to take
advantage of this feature.
Sun
Sun high-end servers, such as the Enterprise15000, are ccNUMA systems.
Sun provides a tool to optimize application performance, called Memory Placement Optimization (MPO). It
attempts to place processes as close as possible to the memory they are using in order to reduce latency and
memory contention.
8