Specifications

April 2012 v1 AMD Opteron™ 6200 Linux Tuning Guide
11
RUN STREAM
Run STREAM on all 32 cores of a 2P system with AMD Opteron™ 6276 Series processors and 64GB
(8x8GB) of 1600Mhz memory as follows:
> export OMP_NUM_THREADS=32
> ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 87380000, Offset = 1840
Total memory required = 2000.0 MB.
Each test is run 30 times, but only
the *best* time for each is used.
-------------------------------------------------------------
(lines deleted)
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 36916.7828 0.0386 0.0379 0.0390
Scale: 37034.2906 0.0384 0.0378 0.0387
Add: 41602.0300 0.0514 0.0504 0.0518
Triad: 41769.3595 0.0512 0.0502 0.0517
-------------------------------------------------------------
On all cores of only 1 NUMA node (i.e., 1 CPU die) of a 2P Opteron™ 6276 (32 cores) with 64GB
(8x8GB) 1600Mhz memory, we observe about a quarter of the performance than with running
on all cores:
> export OMP_NUM_THREADS=8
> export GOMP_CPU_AFFINITY=”0 1 2 3 4 5 6 7”
> ./stream
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 10520.9029 0.1371 0.1329 0.1417
Scale: 10583.1833 0.1358 0.1321 0.1382
Add: 12010.5944 0.1806 0.1746 0.1843
Triad: 12062.1800 0.1802 0.1739 0.1839
-------------------------------------------------------------
Note: With only one of the four NUMA nodes running STREAM, the memory bandwidth is about 25%
of that when running on all 32 cores above.
Note: Also, run STREAM on each die to see that STREAM is the same on each. If not, STREAM on all
cores is lower than it should be and may result from having a memory channel empty (i.e., plugging in a
DIMM in the wrong slot).