ccNUMA Overview

ManualsBrandsHP ManualsSoftwareHP General Developer Tools Software

The HP 9000 Superdome running HP-UX 11i v1.1 acts as UMA with an effective latency closer to 2-hop

performance. For example, an 8-cell configuration has an effective latency of 323 ns. The HP Integrity server

delivers 243 ns latency and 7.2 GB/s memory bandwidth if the dataset can be contained in local memory. This is

a 33% improvement in memory latency, if the applications can take advantage of the performance features of the

memory system. By default, HP-UX 11i v2 allocates memory for applications from the local memory, or LDOM. For

most applications, specifically applications that do not use multiple processors, this ensures optimal performance

of the applications and the memory system.

For parallel applications or multiprocess applications, the situation is somewhat different, but correct behavior is

still assured. Existing HP-UX 11i v1.6 applications will work correctly. To run optimally, some applications may

need to be slightly modified, but most will not. For many applications, the only changes are in the launcher

scripts.

This paper documents the tools and resources available for developers to take maximum advantage of the features

of HP-UX 11i v2.

Competitive situation

Most of the commercially available multiprocessors are implementing some variation of a ccNUMA architecture.

However, they have all implemented different interfaces to exploit the ccNUMA features of the systems.

Unfortunately, there is no standard means available yet to program these systems.

SGI IRIX

SGI IRIX was the first commercially successful ccNUMA operating system. It was introduced in 1997 with the

Origin2000 system, marking the first instance of a widely used system of this class.

The Origin is organized in nodes, with two processors connected to a hub, which is also connected to the node’s

memory. The node then shares with one other node, an I/O hub, which is connected to peripherals. The node

itself is connected to the interconnect network.

SGI has provided two main tools for developers and users: dlook and dplace. The dlook diagnostic tool helps

developers and administrators understand what the application is physically doing on the system in terms of

memory usage, contention, and bandwidth.

The dplace tool is encountered most often, since it is used in scripts and queuing systems to actually launch the

application in an efficient manner, considering current activity on the system and what resources the user wants to

make use of.

SGI Altix 64-bit Linux systems

The SGI Altix is organized the same way as the Origin system. The main difference is the incorporation of the

Intel

Itanium

2 processor and the use of Linux as the operating system.

SGI has extended Linux to work efficiently on a ccNUMA multiprocessor. They have also ported and provided

dlook and dplace to aid developers and users in making the most efficient use of the system.

Tru64

On Tru64 UNIX

systems, the building blocks that make up a NUMA system are mapped to structures called

Resource Affinity Domains (RADs). A RAD identifies the set of CPUs, memory arrays, and I/O buses that, when

used together, allow the system to work most efficiently.

Starting with Tru64 UNIX version 5.1, the operating system makes a best effort to:

• Schedule all threads of a multithreaded application on CPUs in the same RAD

• Allocate memory for each process or application thread in the same RAD as the CPU where the process or

thread is running

The default NUMA-aware algorithms for scheduling and allocating resources to a process or thread work well

when the resources in one RAD can accommodate the number of threads and the memory demands in any one

application.