ccNUMA Overview
The HP 9000 Superdome running HP-UX 11i v1.1 acts as UMA with an effective latency closer to 2-hop
performance. For example, an 8-cell configuration has an effective latency of 323 ns. The HP Integrity server
delivers 243 ns latency and 7.2 GB/s memory bandwidth if the dataset can be contained in local memory. This is
a 33% improvement in memory latency, if the applications can take advantage of the performance features of the
memory system. By default, HP-UX 11i v2 allocates memory for applications from the local memory, or LDOM. For
most applications, specifically applications that do not use multiple processors, this ensures optimal performance
of the applications and the memory system.
For parallel applications or multiprocess applications, the situation is somewhat different, but correct behavior is
still assured. Existing HP-UX 11i v1.6 applications will work correctly. To run optimally, some applications may
need to be slightly modified, but most will not. For many applications, the only changes are in the launcher
scripts.
This paper documents the tools and resources available for developers to take maximum advantage of the features
of HP-UX 11i v2.
Competitive situation
Most of the commercially available multiprocessors are implementing some variation of a ccNUMA architecture.
However, they have all implemented different interfaces to exploit the ccNUMA features of the systems.
Unfortunately, there is no standard means available yet to program these systems.
SGI IRIX
SGI IRIX was the first commercially successful ccNUMA operating system. It was introduced in 1997 with the
Origin2000 system, marking the first instance of a widely used system of this class.
The Origin is organized in nodes, with two processors connected to a hub, which is also connected to the node’s
memory. The node then shares with one other node, an I/O hub, which is connected to peripherals. The node
itself is connected to the interconnect network.
SGI has provided two main tools for developers and users: dlook and dplace. The dlook diagnostic tool helps
developers and administrators understand what the application is physically doing on the system in terms of
memory usage, contention, and bandwidth.
The dplace tool is encountered most often, since it is used in scripts and queuing systems to actually launch the
application in an efficient manner, considering current activity on the system and what resources the user wants to
make use of.
SGI Altix 64-bit Linux systems
The SGI Altix is organized the same way as the Origin system. The main difference is the incorporation of the
Intel
®
Itanium
®
2 processor and the use of Linux as the operating system.
SGI has extended Linux to work efficiently on a ccNUMA multiprocessor. They have also ported and provided
dlook and dplace to aid developers and users in making the most efficient use of the system.
Tru64
On Tru64 UNIX
®
systems, the building blocks that make up a NUMA system are mapped to structures called
Resource Affinity Domains (RADs). A RAD identifies the set of CPUs, memory arrays, and I/O buses that, when
used together, allow the system to work most efficiently.
Starting with Tru64 UNIX version 5.1, the operating system makes a best effort to:
• Schedule all threads of a multithreaded application on CPUs in the same RAD
• Allocate memory for each process or application thread in the same RAD as the CPU where the process or
thread is running
The default NUMA-aware algorithms for scheduling and allocating resources to a process or thread work well
when the resources in one RAD can accommodate the number of threads and the memory demands in any one
application.
7