Notes

4
Process and Interrupt Affinity on Intel® Xeon® Processor E5 Servers
with Intel® DDIO Technology
The designs of the new Intel
®
Xeon
®
E5 processor servers have introduced some
complexities that, if not accommodated in the application design, may not produce as
large a performance gain as possible. Creating core affinity is critical in a Direct Data
IO/NUMA environment. Whenever possible, processes that are identified as
performance-critical should have affinity with the local socket connection, rather than
remote socket connections. To that end, the most foolproof method for identifying the
local and remote socket might be to refer to the block diagram of the server platform.
This might not always be as easy as it is with the Intel Software Development Vehicle
(SDV) code name: Rose City, where all the PCIE slots were branched off of a single
socket/CPU1.
The example in the picture below shows the I/O riser module from a Dell R720* series
server, an Intel Xeon E5 processor platform where the riser clearly identifies slot affinity
to a given CPU or socket.
Figure 1. I/O Riser Identifies Slot Affinity to CPU or Socket
Another way to determine socket correlation would be to use APICID reference under
/proc/cpuinfo, or possibly the CPU-Z detection engine for Windows*. (Assuming the
APIC tables are implemented as defined in ACPI 5.0 specification and the operating
system is able to read and parse the information for application consumption.) Once
the local socket is identified and a determination is made about how to distribute the
processes. (Keep in mind that simultaneous multi-threading (SMT) is especially
effective when each thread is set to perform different types of operations and perform
those operations with under-used CPU cycles.) Then, use the Thread Affinity Interface
to assign the processes to the cores as needed.
One of the characteristics of the NUMA architecture is that each processor has its own
localized memory module that it can access directly and thereby gain a distinct
performance advantage. In addition, it can also access any memory module belonging
to another processor using the QPI link. However, that memory access across the QPI
link carries a performance/latency penalty in relation to a memory access on a local
memory bus. Therefore, optimal performance would require adherence to the
strategies for NUMA optimization outlined in the Optimizing Software Applications for
NUMA whitepaper.
For specifics about how to pin the processes to the chosen cores, see the 82575,82576,
82598, 82599 Ethernet Controllers Interrupts Application Note.
§ §