Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Application Note Publication # 40555 Issue Date: June 2006 Revision: 3.
© 2006 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Contents Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 1.1 Related Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Rev. 3.00 June 2006 A.2.1 What Resources Are Used When a Single Read-Only or Write-Only Thread Accesses Remote Data? . . . . . . . . . . . . . . . . . . . . . . . . . .40 A.2.2 What Resources Are Used When Two Write-only Threads Fire at Each Other (Crossfire) on an Idle System? . . . . . . . . . . . . . . . . . . . . . . . . .40 A.2.3 What Role Do Buffers Play in the Throughput Observed? . . . . . . . . . . . .
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems List of Figures Figure 1. Quartet Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Figure 2. Internal Resources Associated with a Quartet Node . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Figure 3. Write-Only Thread Running on Node 0, Accessing Data from 0, 1 and 2 Hops Away on an Idle System . . . . . . .
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 6 List of Figures 40555 Rev. 3.
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 Revision History Date Revision June 2006 3.00 Description Initial release.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 8 Revision History 40555 Rev. 3.
40555 Rev. 3.00 June 2006 Chapter 1 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Introduction The AMD Athlon™ 64 and AMD Opteron™ family of single-core and dual-core multiprocessor systems are based on the cache coherent Non-Uniform Memory Access (ccNUMA) architecture.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 bandwidth test, it exercises both of these modes of operation. The test serves as a latency sensitive test case when the test threads perform read-only operations and as a bandwidth sensitive test when the test threads carry out write-only operations.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems [12] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/ msdn_heapmm.asp [13] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/ low_fragmentation_heap.asp [14] http://msdn2.microsoft.com/en-us/library/tt15eb9t.aspx [15] https://www.pathscale.com/docs/UserGuide.pdf [16] http://docs.sun.com/source/819-3688/parallel.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 12 Introduction 40555 Rev. 3.
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 Chapter 2 Experimental Setup This chapter presents a description of the experimental environment within which the following performance study was carried out. This section describes the hardware configuration and the software test framework used. 2.1 System Used All experiments and analysis discussed in this application note were performed on a Quartet system having four 2.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 N0 Rev. 3.00 June 2006 N1 Link Link Link Link N2 Figure 1. N3 Quartet Topology The term hop is commonly used to describe access distances on NUMA systems. When a thread accesses memory on the same node as that on which it is running, it is a 0-hop access or local access. If a thread is running on one node but accessing memory that is resident on a different node, the access is a remote access.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems C0 C1 4 GV/s per direction @ 2 GHz Data Rate 4 GV/s per direction @ 2 GHz Data Rate 4 GV/s per direction @ 2 GHz Data Rate HT = HyperTransport™ Technology Figure 2. Internal Resources Associated with a Quartet Node From the perspective of the MCT, a memory request may come from either the local core or from another core over a coherent HyperTransport link.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 resources approach saturation. The test has two modes: read-only and write-only. When the test threads are read-only, the throughput does not stress the capacity of the system resources and, thus, the test is more sensitive to latency. However, when the threads are write-only, there is a heavy throughput load on the system. This is described in detail in later sections of this document.
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 characterization of the resource behavior in the system. These recommendations, coupled with these interesting cases, provide an understanding of the low-level behavior of the system, which is crucial to the analysis of larger real-world workloads. 2.3 Reading and Interpreting Test Graphs Figure 3 below shows one of the graphs that will be discussed in detail later.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 2.3.2 40555 Rev. 3.00 June 2006 Labels Used Each of the bars on the graph is labeled with the hop information for the thread. 2.3.3 Y-Axis Display For the one-thread test cases on the idle system, the graphs show the time taken by a single thread, normalized to the time taken by the fastest single-thread case—in this case the time it takes a readonly thread to do local accesses on an idle system.
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 Chapter 3 Analysis and Recommendations This section lays out recommendations to developers. Several of these recommendations are accompanied by empirical results collected from test cases with analysis, as applicable.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 3.1.2 40555 Rev. 3.00 June 2006 Multiple Threads-Shared Data When scheduling multiple threads that share data on an idle system, it is preferable to schedule the threads on both cores of an idle node first, then on both cores of the the next idle node, and so on. In other words, schedule using core major order first followed by node major order.
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 distance. If they are indirectly connected to each other in a 4P configuration, it is considered as a 2 hop access distance. The following example—extracted from mining the results of the synthetic test case—substantiates the recommendation to keep data local.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 T im e f o r w r ite 1 .8 149% 1 .6 1 .4 1 .2 1 27 % 12 9% 1 Ho p 1 Hop 113 % 1 0 .8 0 .6 0 Ho p 2 Ho p 0 .4 0 .2 0 0 .0 .w .0 Figure 5. 0.0.w .1 0 .0 .w .2 0 .0 .w .
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is enough physical memory available on that node. If enough physical memory is not available on the node, then various advanced techniques are used to determine where to place the data, depending on the OS. Data once placed on a node due to first touch normally resides on that node for its lifetime.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 afterwords no longer needs the data structure and if only one of the worker threads needs the data structure. In other words, the data structure is not truly shared between the worker threads. It is best in this case to use a data initialization scheme that avoids incorrect data placement due to first touch.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Spec JBB 2005 was run using the NUMA tools provided by Linux® to measure the performance improvement with node interleaving. The results were obtained on the same internal 4P Quartet system used for the synthetic tests. 3.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems • 40555 Rev. 3.00 June 2006 Threads firing at each other (crossfire) The first thread runs on node 0 and writes to memory on node 1 (1 hop). The second thread runs on node 1 and writes to memory on node 0 (1 hop). In each case, the two threads are run on core 0 of whichever code they are running on. The system is left idle except for the two threads.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Here the same two foreground threads as before were run though the cases as before—local, crossfire and no crossfire.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 VERY HIGH: Total Time for both threads (write-write) 2.2 186% 2 1.8 1.6 158% 1.4 1.2 1 0.8 0.6 195% 0 Hop 0 Hop 1 Hop 1 Hop NO Xfire 1 Hop 1 Hop Xfire 0.4 0.2 0 0.0.w.0 1.0.w.1 (0 Hops) (0 Hops) 0.0.w.1 1.0.w.3 (1 Hops) (1 Hops) 0.0.w.1 1.0.w.0 (1 Hops) (1 Hops) Figure 8.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems VERY HIGH: Total Time for both threads (write-write) 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 216% 202% 156% 0 Hop 0 Hop 1 Hop 1 Hop NO Xfire 1 Hop 1 Hop Xfire 0.0.w.0 1.0.w.1 (0 Hops) (0 Hops) 0.0.w.1 1.0.w.3 (1 Hops) (1 Hops) 0.0.w.1 1.0.w.0 (1 Hops) (1 Hops) Figure 9.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 This analogy clearly communicates the performance effects of queuing time versus latency.
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 However, as shown in Figure 11 on page 31, when both threads are write-only, the 0 hop-1 hop and 0 hop-2 hop cases are faster than the 0 hop-0 hop case. Total Time for both threads (write-write) 1.8 1.6 147% 1.4 126% 125% 0 Hop 1 Hop 0 Hop 1 Hop 136% 1.2 1 0.8 0.6 0 Hop 0 Hop 2 Hop 0.4 0.2 0 0.0.w.0 0.0.w.0 0.0.w.0 0.0.w.0 0.1.w.0 0.1.w.1 0.1.w.2 0.1.w.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 In addition, three background threads are running on nodes 1, 2 and 3. Each of these background threads access data locally. The rate of memory demand by each these threads is varied simultaneously from low to medium to high to very high as shown in Table 1 on page 16.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Medium: Total Time for both threads (write-write) 1.8 1.6 146% 129% 129% 0 Hop 1 Hop 0 Hop 1 Hop 1.4 139% 1.2 1 0.8 0.6 0 Hop 0 Hop 0 Hop 2 Hop 0.4 0.2 0 0.0.w.0 0.0.w.0 0.0.w.0 0.0.w.0 0.1.w.0 0.1.w.1 0.1.w.2 0.1.w.3 (0 Hops) (0 Hops) (0 Hops) (0 Hops) (0 Hops) (1 Hops) (1 Hops) (2 Hops) Figure 13.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 Very High: Total Time for both threads (write-write) 1.8 1.6 158% 158% 0 Hop 1 Hop 0 Hop 1 Hop 147% 169% 1.4 1.2 1 0.8 0.6 0 Hop 0 Hop 0 Hop 2 Hop 0.4 0.2 0 0.0.w.0 0.0.w.0 0.0.w.0 0.0.w.0 0.1.w.0 0.1.w.1 0.1.w.2 0.1.w.3 (0 Hops) (0 Hops) (0 Hops) (0 Hops) (0 Hops) (1 Hops) (1 Hops) (2 Hops) Figure 15.
40555 Rev. 3.00 3.6 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Parallelism Exposed by Compilers on AMD ccNUMA Multiprocessor Systems Several compilers for AMD multiprocessor systems provide additional hooks to allow automatic parallelization of otherwise serial programs. Several compilers also support the OpenMP API for parallel programming.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 36 Analysis and Recommendations 40555 Rev. 3.
40555 Rev. 3.00 June 2006 Chapter 4 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Conclusions The single most important recommendation for most applications is to keep data local on node where it is being accessed. As long as a thread initializes the data it needs, in other words writes to it for the first time, a ccNUMA aware OS will typically keep the data local on the node where the thread runs.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 Data placement tools can also come in handy when a thread needs more data than the amount of physical memory available on a node. Certain OSs also allow data migration with these tools or API. Using this feature, data can be migrated from the node where it was first touched to the node where it is subsequently accessed.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Appendix A The following sections provide additional explanatory information on topics discussed in the previous sections of this document. A.1 Description of the Buffer Queues Figure 16 shows the internal resources in each Quartet node.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 Likewise packets to be transmitted from the MCT to the XBar are queued in the “MCT-to-XBar” buffers. The buffers in the SRI, XBar and MCT can be viewed as staggered queues on the various units. A.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 4.4 GB/s necessary. The two coherent HyperTransport links are loaded at 3.5 GB/s each. Thus the utilization of each of the two coherent HyperTransport links that connect node 0 and node 1 equals 87% (3.5÷4). A.2.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems A.3 40555 Rev. 3.
40555 Rev. 3.00 A.5 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Why Is 0 Hop-1 Hop Case Slower Than 0 Hop-0 Hop Case on a System under High Background Load (High Subscription) for WriteOnly Threads? When a 0 hop-0 hop scenario is subjected to a very high background load, the system sees the following traffic pattern, where each node gets memory requests from the threads as described: • Node 0: 2 foreground threads.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems A.7 40555 Rev. 3.00 June 2006 Tools and APIs for Thread/Process and Memory Placement (Affinity) for AMD64 ccNUMA Multiprocessor Systems This following sections discuss tools and APIs available for assigning thread/process and memory affinity under various operating systems. A.7.
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Controlling Memory Affinity Both numactl and libnuma library functions can be used to set memory affinity[5]. Memory affinity set by tools like numactl applies to all the data accessed by the entire program (including child processes). Memory affinity set by libnuma or other library functions can be made to apply only to specific data as determined by the program.
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 The function to set memory affinity for a thread is VirtualAlloc( )[9]. This function gives the developer the choice to bind memory immediately on allocation or to defer binding until first touch.
40555 Rev. 3.00 A.8.4 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Node Interleaving Configuration in the BIOS AMD Opteron™ and Athlon™ 64 ccNUMA multiprocessor systems can be configured in the BIOS to interleave all memory across all nodes on a page basis (4KB for regular pages and 2M for large pages).
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 48 40555 Rev. 3.