ccNUMA Overview

The NUMA application programming interfaces (APIs) allow applications to make scheduling and resource

allocation decisions based on advance knowledge of the application’s resource needs and behavior. Proper

manipulation of system resources and process scheduling through NUMA APIs has the following potential

advantages:

• An application can notify the operating system of relationships between processes and threads that should be

scheduled on the same RAD and, if migration to another RAD becomes advantageous, must be moved together.

• A very large and complex application whose resource demands and number of threads exceed the capacity of

one RAD can stripe its CPU cycles, I/O load, and the memory that contains program data across RADs.

Linux

Linux does not have any inherent ccNUMA support.

Linux has limited support for SMP architectures. It is optimized for two to four CPUs, with support for up to eight

CPUs. The implementation does not preclude Non-Uniform Memory Access, and cache coherency is assumed, so

ccNUMA is available—it is just not very efficient. Furthermore, there is limited support for memory locks and

semaphores, and little attention is paid by the kernel developers to efficiency of the multiprocessor versions of

Linux. (For more information, see http://sources.redhat.com/ecos/docs-latest/ref/hal-smp-support.html

IBM NUMA-Q

IBM NUMA-Q implements a ccNUMA architecture using Intel Pentium

processors. The system is organized into

four-CPU quads that include memory and disk controllers. The quads are linked together using a relatively low-

bandwidth interconnect called IQ-Link. All memory is shared among all processors, and I/O devices are also

shared. A primary goal of this architecture is to ensure continuous access to attached I/O devices, and it

accomplishes this through redundant links to external devices.

The operating system, DYNIX/ptx, maximizes system performance by locating the memory and I/O connections

close to the calling process, preferably on the same quad. This operating system is based on System VR4 and is

extended to handle large numbers of processors and users.

Windows NT

and Linux also are available, but they are limited to being deployed in a quad and therefore do

not need to support any ccNUMA features.

IBM pSeries

AIX 5L has some features to accommodate the different latencies of the Regatta architecture. The basic physical

arrangement of the POWER4 series processors is four dual-core POWER4 processors, four 32 MB L3 caches

shared by all eight CPUs, and a multi-chip module that ties the processors and memory together, along with I/O

and a connection to the rest of the system. A p690 system consists of four of these modules. AIX 5L provides some

specialized scheduling algorithms to keep applications near their data in order to reduce memory traffic and

latency.

AIX 5L also implements a Large Page feature. By default, pages are 4 KB. The administrator can designate a

certain amount of the total memory to be made of Large Pages, defined as 16 MB. Upon a reboot, these pages

are available to applications, if the user and the application have been authorized by the administrator to take

advantage of this feature.

Sun

Sun high-end servers, such as the Enterprise15000, are ccNUMA systems.

Sun provides a tool to optimize application performance, called Memory Placement Optimization (MPO). It

attempts to place processes as close as possible to the memory they are using in order to reduce latency and

memory contention.