Specifications
April 2012 v1 AMD Opteron™ 6200 Linux Tuning Guide
12
At least 20% better STREAM performance can be achieved using GCC 4.6.0 or later, but these later versions
are unlikely to be included in Linux distributions today. However, 70% better STREAM can be achieved by using
the AMD Open64 compiler.
2.6 High Performance STREAM Using AMD Open64 Compiler
Build STREAM using the Open64 compiler when attempting to measure the best achievable memory
performance.
• Download and install the AMD Open64 compiler.
- Locate the Open64 compiler on http://developer.amd.com/tools/open64/pages/default.aspx and then
download and follow the installation instructions.
• AMD Open64 compiler flags.
- Use compiler AMD Open64 version 4.5.1 or later with the following flags:
-march=bdver1 -mp -Ofast -LNO:simd=2 -WOPT:sib=on
-LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4
-static
• Run STREAM with the following expected performance.
- Run on all cores.
The following is an example run on 2P AMD Opteron™ 6276 Series processors (32 cores) with 64GB
(8x8GB) 1600Mhz memory:
> export OMP_NUM_THREADS=32
> ./stream
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 67973.1136 0.0208 0.0206 0.0211
Scale: 70406.8166 0.0200 0.0199 0.0201
Add: 65922.3917 0.0319 0.0318 0.0321
Triad: 65656.1828 0.0321 0.0319 0.0324
Note: The AMD Open64 version of STREAM yields 70% better bandwidth than the GCC version when
running on all 32 cores.
- Run on all cores of each NUMA node.
The following is an example on the same system but running only on the first NUMA node, cores 0-7.
> export O64_OMP_AFFINITY=”TRUE”
> export O64_OMP_AFFINITY_MAP=”0,1,2,3,4,5,6,7”
> export OMP_NUM_THREADS=8
> ./stream
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 17311.7443 0.0812 0.0808 0.0828
Scale: 18046.0464 0.0779 0.0775 0.0784