ccNUMA Overview
Cache coherency
All modern multiprocessor systems are cache coherent. This is necessary since processor caches have copies of
data that is in memory. When one processor modifies data in memory location, it does so in its cache, on a cache
line basis. The process of modifying the data must also include the step of notifying any other processors that may
have that cache-line that the data in their cache is now invalid. The next time the cache line is referenced, this
prompts the memory system to force the owning processor (the one that modified the data) to write the data back
to memory and then—in a separate step or simultaneously—cause it to be reloaded into the consuming
processor’s cache. This is not a simple problem, and solving this problem efficiently is crucial to the successful
implementation of the overall memory system. HP high-end SMP servers implement this mechanism with a
directory-based coherency mechanism. This scheme has proven to be very efficient and scalable.
Directory-based coherency systems differ from other, simpler means, such as “snoopy” schemes, in that the
memory system state is catalogued in the memory system itself. Snoopy methods “listen” to the memory bus and
watch for memory addresses that the processor may have encached. The directory is in the memory system, and it
tells any requesting processor whether the cache line is clean and unmodified, or modified and dirty, and which
cache owns the cache line or is encached in some other processor’s cache but not modified (yet). This method is
much more scalable and simpler to implement on a large-scale system than traditional snoopy methods.
Interleaved memory
Interleaved memory is memory for shared objects or data structures. A portion of memory is taken from cells of the
system—typically all of the cells—and is mixed together in a round robin fashion of cache-line-size chunks. It has
the characteristic that memory accesses take a uniform amount of time. In other words, it has uniform latency no
matter which processor accesses it.
Local memory
Local memory is memory for private objects or data structures. It is still accessible to any processor, but processors
that are on the same cell will enjoy the lowest latency for memory accesses. Accesses from other cells will take
longer, or have greater latency, than accesses from within the same cell.
It should be pointed out that local memory is interleaved only over the memory banks of the local cell.
HP high-end SMP servers and the cellular architecture
HP high-end SMP servers are organized in cells of processors, memory, and I/O connections. These cells are
connected by crossbars. Four cells connect to a crossbar, and the crossbar connects to other crossbars to form
the interconnect. Memory latency is smallest when referencing memory locally. In other words, it is smallest from
memory on the cell that the processor is located on. The next nearest memory locality is the memory in the other
three cells connected to the same crossbar. Next is the memory located in the cells connected to the other three
crossbars. Memory transactions should never cross over more than two crossbars. But memory traffic between
system cabinets does have a greater latency than traffic within the cabinet. This is illustrated in Figure 1.
Figure 1. HP high-end SMP servers—logical diagram
4