ccNUMA Overview

Cache coherency

All modern multiprocessor systems are cache coherent. This is necessary since processor caches have copies of

data that is in memory. When one processor modifies data in memory location, it does so in its cache, on a cache

line basis. The process of modifying the data must also include the step of notifying any other processors that may

have that cache-line that the data in their cache is now invalid. The next time the cache line is referenced, this

prompts the memory system to force the owning processor (the one that modified the data) to write the data back

to memory and then—in a separate step or simultaneously—cause it to be reloaded into the consuming

processor’s cache. This is not a simple problem, and solving this problem efficiently is crucial to the successful

implementation of the overall memory system. HP high-end SMP servers implement this mechanism with a

directory-based coherency mechanism. This scheme has proven to be very efficient and scalable.

Directory-based coherency systems differ from other, simpler means, such as “snoopy” schemes, in that the

memory system state is catalogued in the memory system itself. Snoopy methods “listen” to the memory bus and

watch for memory addresses that the processor may have encached. The directory is in the memory system, and it

tells any requesting processor whether the cache line is clean and unmodified, or modified and dirty, and which

cache owns the cache line or is encached in some other processor’s cache but not modified (yet). This method is

much more scalable and simpler to implement on a large-scale system than traditional snoopy methods.

Interleaved memory

Interleaved memory is memory for shared objects or data structures. A portion of memory is taken from cells of the

system—typically all of the cells—and is mixed together in a round robin fashion of cache-line-size chunks. It has

the characteristic that memory accesses take a uniform amount of time. In other words, it has uniform latency no

matter which processor accesses it.

Local memory

Local memory is memory for private objects or data structures. It is still accessible to any processor, but processors

that are on the same cell will enjoy the lowest latency for memory accesses. Accesses from other cells will take

longer, or have greater latency, than accesses from within the same cell.

It should be pointed out that local memory is interleaved only over the memory banks of the local cell.

HP high-end SMP servers and the cellular architecture

HP high-end SMP servers are organized in cells of processors, memory, and I/O connections. These cells are

connected by crossbars. Four cells connect to a crossbar, and the crossbar connects to other crossbars to form

the interconnect. Memory latency is smallest when referencing memory locally. In other words, it is smallest from

memory on the cell that the processor is located on. The next nearest memory locality is the memory in the other

three cells connected to the same crossbar. Next is the memory located in the cells connected to the other three

crossbars. Memory transactions should never cross over more than two crossbars. But memory traffic between

system cabinets does have a greater latency than traffic within the cabinet. This is illustrated in Figure 1.

Figure 1. HP high-end SMP servers—logical diagram