ccNUMA Overview
Past implementations
With HP-UX 11i v1, all memory is interleaved over all the cells. The interleaving is done on a cache-line basis.
This makes the memory latency appear uniform. It should be noted that latency is greater for large systems than
for small systems since there are more memory references across more crossbars, on average.
For example, for a 16 CPU, 4-cell system, a memory reference is satisfied ¼ of the time locally and ¾ of the time
remotely. The remote accesses all are satisfied with one hop across the crossbar (see Figure 1). So the average
latency is (T
local
+ 3T
remote
)/4.
For a 32 CPU, 8-cell system, 1/8 of the references will be local, and 7/8 will be remote. But the remote accesses
will have different latencies. Now, 3/8 of the references will be across 1 crossbar, and 4/8 will be across 2
crossbars. This yields an average latency of (T
local
+ 3T
remote
+ 4T
veryremote
)/8.
A key feature of the original HP high-end SMP implementation is that memory bandwidth from the local cell is
essentially the same as the bandwidth available through the interconnect. Combined with the fact that data is
interleaved over all the cells, this makes the system perform essentially as a Uniform Memory Access (UMA)
system. This makes performance of applications uniform and repeatable. From the application’s perspective,
there is no need to distinguish between local and remote memory for the purposes of memory accesses.
Figure 2. Interleave mapping of cache lines over cells
Locality domains
A useful concept is the locality domain (LDOM). A locality domain consists of a related collection of processors,
memory, and peripheral resources that compose a fundamental building block of the system. All processors and
peripheral devices in a given locality domain have equal latency to the memory contained within that locality
domain.
A cell is a locality domain. The interleave memory region is a locality, but not a locality domain since it contains
no processors or peripherals.
5