Notes

Process and Interrupt Affinity on Intel® Xeon® Processor E5 Servers

with Intel® DDIO Technology

The designs of the new Intel

Xeon

E5 processor servers have introduced some

complexities that, if not accommodated in the application design, may not produce as

large a performance gain as possible. Creating core affinity is critical in a Direct Data

IO/NUMA environment. Whenever possible, processes that are identified as

performance-critical should have affinity with the local socket connection, rather than

remote socket connections. To that end, the most foolproof method for identifying the

local and remote socket might be to refer to the block diagram of the server platform.

This might not always be as easy as it is with the Intel Software Development Vehicle

(SDV) code name: Rose City, where all the PCIE slots were branched off of a single

socket/CPU1.

The example in the picture below shows the I/O riser module from a Dell R720* series

server, an Intel Xeon E5 processor platform where the riser clearly identifies slot affinity

to a given CPU or socket.

Figure 1. I/O Riser Identifies Slot Affinity to CPU or Socket

Another way to determine socket correlation would be to use APICID reference under

/proc/cpuinfo, or possibly the CPU-Z detection engine for Windows*. (Assuming the

APIC tables are implemented as defined in ACPI 5.0 specification and the operating

system is able to read and parse the information for application consumption.) Once

the local socket is identified and a determination is made about how to distribute the

processes. (Keep in mind that simultaneous multi-threading (SMT) is especially

effective when each thread is set to perform different types of operations and perform

those operations with under-used CPU cycles.) Then, use the Thread Affinity Interface

to assign the processes to the cores as needed.

One of the characteristics of the NUMA architecture is that each processor has its own

localized memory module that it can access directly and thereby gain a distinct

performance advantage. In addition, it can also access any memory module belonging

to another processor using the QPI link. However, that memory access across the QPI

link carries a performance/latency penalty in relation to a memory access on a local

memory bus. Therefore, optimal performance would require adherence to the

strategies for NUMA optimization outlined in the Optimizing Software Applications for

NUMA whitepaper.

For specifics about how to pin the processes to the chosen cores, see the 82575,82576,

82598, 82599 Ethernet Controllers Interrupts Application Note.

§ §