Computer Hardware User's Manual

ManualsBrandsAMD ManualsLaptopATHLON 64

Performance Guidelines for

AMD Athlon™ 64 and

AMD Opteron™ ccNUMA

Multiprocessor Systems

Application Note

40555Publication # Revision: 3.00

June 2006Issue Date:

Summary of content (48 pages)

PAGE 1
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Application Note Publication # 40555 Issue Date: June 2006 Revision: 3.
PAGE 2
© 2006 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice.
PAGE 3
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Contents Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 1.1 Related Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 4
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Rev. 3.00 June 2006 A.2.1 What Resources Are Used When a Single Read-Only or Write-Only Thread Accesses Remote Data? . . . . . . . . . . . . . . . . . . . . . . . . . .40 A.2.2 What Resources Are Used When Two Write-only Threads Fire at Each Other (Crossfire) on an Idle System? . . . . . . . . . . . . . . . . . . . . . . . . .40 A.2.3 What Role Do Buffers Play in the Throughput Observed? . . . . . . . . . . . .
PAGE 5
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems List of Figures Figure 1. Quartet Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Figure 2. Internal Resources Associated with a Quartet Node . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Figure 3. Write-Only Thread Running on Node 0, Accessing Data from 0, 1 and 2 Hops Away on an Idle System . . . . . . .
PAGE 6
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 6 List of Figures 40555 Rev. 3.
PAGE 7
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 Revision History Date Revision June 2006 3.00 Description Initial release.
PAGE 8
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 8 Revision History 40555 Rev. 3.
PAGE 9
40555 Rev. 3.00 June 2006 Chapter 1 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Introduction The AMD Athlon™ 64 and AMD Opteron™ family of single-core and dual-core multiprocessor systems are based on the cache coherent Non-Uniform Memory Access (ccNUMA) architecture.
PAGE 10
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 bandwidth test, it exercises both of these modes of operation. The test serves as a latency sensitive test case when the test threads perform read-only operations and as a bandwidth sensitive test when the test threads carry out write-only operations.
PAGE 11
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems [12] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/ msdn_heapmm.asp [13] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/ low_fragmentation_heap.asp [14] http://msdn2.microsoft.com/en-us/library/tt15eb9t.aspx [15] https://www.pathscale.com/docs/UserGuide.pdf [16] http://docs.sun.com/source/819-3688/parallel.
PAGE 12
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 12 Introduction 40555 Rev. 3.
PAGE 13
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 Chapter 2 Experimental Setup This chapter presents a description of the experimental environment within which the following performance study was carried out. This section describes the hardware configuration and the software test framework used. 2.1 System Used All experiments and analysis discussed in this application note were performed on a Quartet system having four 2.
PAGE 14
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 N0 Rev. 3.00 June 2006 N1 Link Link Link Link N2 Figure 1. N3 Quartet Topology The term hop is commonly used to describe access distances on NUMA systems. When a thread accesses memory on the same node as that on which it is running, it is a 0-hop access or local access. If a thread is running on one node but accessing memory that is resident on a different node, the access is a remote access.
PAGE 15
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems C0 C1 4 GV/s per direction @ 2 GHz Data Rate 4 GV/s per direction @ 2 GHz Data Rate 4 GV/s per direction @ 2 GHz Data Rate HT = HyperTransport™ Technology Figure 2. Internal Resources Associated with a Quartet Node From the perspective of the MCT, a memory request may come from either the local core or from another core over a coherent HyperTransport link.
PAGE 16
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 resources approach saturation. The test has two modes: read-only and write-only. When the test threads are read-only, the throughput does not stress the capacity of the system resources and, thus, the test is more sensitive to latency. However, when the threads are write-only, there is a heavy throughput load on the system. This is described in detail in later sections of this document.
PAGE 17
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 characterization of the resource behavior in the system. These recommendations, coupled with these interesting cases, provide an understanding of the low-level behavior of the system, which is crucial to the analysis of larger real-world workloads. 2.3 Reading and Interpreting Test Graphs Figure 3 below shows one of the graphs that will be discussed in detail later.
PAGE 18
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 2.3.2 40555 Rev. 3.00 June 2006 Labels Used Each of the bars on the graph is labeled with the hop information for the thread. 2.3.3 Y-Axis Display For the one-thread test cases on the idle system, the graphs show the time taken by a single thread, normalized to the time taken by the fastest single-thread case—in this case the time it takes a readonly thread to do local accesses on an idle system.
PAGE 19
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 Chapter 3 Analysis and Recommendations This section lays out recommendations to developers. Several of these recommendations are accompanied by empirical results collected from test cases with analysis, as applicable.
PAGE 20
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 3.1.2 40555 Rev. 3.00 June 2006 Multiple Threads-Shared Data When scheduling multiple threads that share data on an idle system, it is preferable to schedule the threads on both cores of an idle node first, then on both cores of the the next idle node, and so on. In other words, schedule using core major order first followed by node major order.
PAGE 21
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 distance. If they are indirectly connected to each other in a 4P configuration, it is considered as a 2 hop access distance. The following example—extracted from mining the results of the synthetic test case—substantiates the recommendation to keep data local.
PAGE 22
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 T im e f o r w r ite 1 .8 149% 1 .6 1 .4 1 .2 1 27 % 12 9% 1 Ho p 1 Hop 113 % 1 0 .8 0 .6 0 Ho p 2 Ho p 0 .4 0 .2 0 0 .0 .w .0 Figure 5. 0.0.w .1 0 .0 .w .2 0 .0 .w .
PAGE 23
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems A ccNUMA-aware OS keeps data local on the node where first-touch occurs as long as there is enough physical memory available on that node. If enough physical memory is not available on the node, then various advanced techniques are used to determine where to place the data, depending on the OS. Data once placed on a node due to first touch normally resides on that node for its lifetime.
PAGE 24
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 afterwords no longer needs the data structure and if only one of the worker threads needs the data structure. In other words, the data structure is not truly shared between the worker threads. It is best in this case to use a data initialization scheme that avoids incorrect data placement due to first touch.
PAGE 25
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Spec JBB 2005 was run using the NUMA tools provided by Linux® to measure the performance improvement with node interleaving. The results were obtained on the same internal 4P Quartet system used for the synthetic tests. 3.
PAGE 26
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems • 40555 Rev. 3.00 June 2006 Threads firing at each other (crossfire) The first thread runs on node 0 and writes to memory on node 1 (1 hop). The second thread runs on node 1 and writes to memory on node 0 (1 hop). In each case, the two threads are run on core 0 of whichever code they are running on. The system is left idle except for the two threads.
PAGE 27
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Here the same two foreground threads as before were run though the cases as before—local, crossfire and no crossfire.
PAGE 28
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 VERY HIGH: Total Time for both threads (write-write) 2.2 186% 2 1.8 1.6 158% 1.4 1.2 1 0.8 0.6 195% 0 Hop 0 Hop 1 Hop 1 Hop NO Xfire 1 Hop 1 Hop Xfire 0.4 0.2 0 0.0.w.0 1.0.w.1 (0 Hops) (0 Hops) 0.0.w.1 1.0.w.3 (1 Hops) (1 Hops) 0.0.w.1 1.0.w.0 (1 Hops) (1 Hops) Figure 8.
PAGE 29
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems VERY HIGH: Total Time for both threads (write-write) 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 216% 202% 156% 0 Hop 0 Hop 1 Hop 1 Hop NO Xfire 1 Hop 1 Hop Xfire 0.0.w.0 1.0.w.1 (0 Hops) (0 Hops) 0.0.w.1 1.0.w.3 (1 Hops) (1 Hops) 0.0.w.1 1.0.w.0 (1 Hops) (1 Hops) Figure 9.
PAGE 30
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 This analogy clearly communicates the performance effects of queuing time versus latency.
PAGE 31
40555 Rev. 3.00 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems June 2006 However, as shown in Figure 11 on page 31, when both threads are write-only, the 0 hop-1 hop and 0 hop-2 hop cases are faster than the 0 hop-0 hop case. Total Time for both threads (write-write) 1.8 1.6 147% 1.4 126% 125% 0 Hop 1 Hop 0 Hop 1 Hop 136% 1.2 1 0.8 0.6 0 Hop 0 Hop 2 Hop 0.4 0.2 0 0.0.w.0 0.0.w.0 0.0.w.0 0.0.w.0 0.1.w.0 0.1.w.1 0.1.w.2 0.1.w.
PAGE 32
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 In addition, three background threads are running on nodes 1, 2 and 3. Each of these background threads access data locally. The rate of memory demand by each these threads is varied simultaneously from low to medium to high to very high as shown in Table 1 on page 16.
PAGE 33
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Medium: Total Time for both threads (write-write) 1.8 1.6 146% 129% 129% 0 Hop 1 Hop 0 Hop 1 Hop 1.4 139% 1.2 1 0.8 0.6 0 Hop 0 Hop 0 Hop 2 Hop 0.4 0.2 0 0.0.w.0 0.0.w.0 0.0.w.0 0.0.w.0 0.1.w.0 0.1.w.1 0.1.w.2 0.1.w.3 (0 Hops) (0 Hops) (0 Hops) (0 Hops) (0 Hops) (1 Hops) (1 Hops) (2 Hops) Figure 13.
PAGE 34
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 Very High: Total Time for both threads (write-write) 1.8 1.6 158% 158% 0 Hop 1 Hop 0 Hop 1 Hop 147% 169% 1.4 1.2 1 0.8 0.6 0 Hop 0 Hop 0 Hop 2 Hop 0.4 0.2 0 0.0.w.0 0.0.w.0 0.0.w.0 0.0.w.0 0.1.w.0 0.1.w.1 0.1.w.2 0.1.w.3 (0 Hops) (0 Hops) (0 Hops) (0 Hops) (0 Hops) (1 Hops) (1 Hops) (2 Hops) Figure 15.
PAGE 35
40555 Rev. 3.00 3.6 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Parallelism Exposed by Compilers on AMD ccNUMA Multiprocessor Systems Several compilers for AMD multiprocessor systems provide additional hooks to allow automatic parallelization of otherwise serial programs. Several compilers also support the OpenMP API for parallel programming.
PAGE 36
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 36 Analysis and Recommendations 40555 Rev. 3.
PAGE 37
40555 Rev. 3.00 June 2006 Chapter 4 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Conclusions The single most important recommendation for most applications is to keep data local on node where it is being accessed. As long as a thread initializes the data it needs, in other words writes to it for the first time, a ccNUMA aware OS will typically keep the data local on the node where the thread runs.
PAGE 38
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 Data placement tools can also come in handy when a thread needs more data than the amount of physical memory available on a node. Certain OSs also allow data migration with these tools or API. Using this feature, data can be migrated from the node where it was first touched to the node where it is subsequently accessed.
PAGE 39
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Appendix A The following sections provide additional explanatory information on topics discussed in the previous sections of this document. A.1 Description of the Buffer Queues Figure 16 shows the internal resources in each Quartet node.
PAGE 40
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 Likewise packets to be transmitted from the MCT to the XBar are queued in the “MCT-to-XBar” buffers. The buffers in the SRI, XBar and MCT can be viewed as staggered queues on the various units. A.
PAGE 41
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 4.4 GB/s necessary. The two coherent HyperTransport links are loaded at 3.5 GB/s each. Thus the utilization of each of the two coherent HyperTransport links that connect node 0 and node 1 equals 87% (3.5÷4). A.2.
PAGE 42
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems A.3 40555 Rev. 3.
PAGE 43
40555 Rev. 3.00 A.5 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Why Is 0 Hop-1 Hop Case Slower Than 0 Hop-0 Hop Case on a System under High Background Load (High Subscription) for WriteOnly Threads? When a 0 hop-0 hop scenario is subjected to a very high background load, the system sees the following traffic pattern, where each node gets memory requests from the threads as described: • Node 0: 2 foreground threads.
PAGE 44
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems A.7 40555 Rev. 3.00 June 2006 Tools and APIs for Thread/Process and Memory Placement (Affinity) for AMD64 ccNUMA Multiprocessor Systems This following sections discuss tools and APIs available for assigning thread/process and memory affinity under various operating systems. A.7.
PAGE 45
40555 Rev. 3.00 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Controlling Memory Affinity Both numactl and libnuma library functions can be used to set memory affinity[5]. Memory affinity set by tools like numactl applies to all the data accessed by the entire program (including child processes). Memory affinity set by libnuma or other library functions can be made to apply only to specific data as determined by the program.
PAGE 46
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 40555 Rev. 3.00 June 2006 The function to set memory affinity for a thread is VirtualAlloc( )[9]. This function gives the developer the choice to bind memory immediately on allocation or to defer binding until first touch.
PAGE 47
40555 Rev. 3.00 A.8.4 June 2006 Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems Node Interleaving Configuration in the BIOS AMD Opteron™ and Athlon™ 64 ccNUMA multiprocessor systems can be configured in the BIOS to interleave all memory across all nodes on a page basis (4KB for regular pages and 2M for large pages).
PAGE 48
Performance Guidelines for AMD Athlon™ 64 and AMD Opteron™ ccNUMA Multiprocessor Systems 48 40555 Rev. 3.