Specifications

ManualsBrandsAMD ManualsComputer equipmentAMD Opteron

April 2012 v1 AMD Opteron™ 6200 Linux Tuning Guide

3.5 Compiling for AMD’s New Core Architecture Instructions

The shared ﬂoating point unit for the AMD Opteron™ 6200 Series processors features new FMA4 and XOP

instructions that can improve ﬂoating point throughput for workloads. For more details on the new instructions

see the AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions

at http://support.amd.com/us/Embedded_TechDocs/43479.pdf. The following graphic shows a bit about the

new instructions:

FMA4 Overview (AMD Unique) XOP Overview (AMD Unique)

Performs fused multiply–add (FMA) operations.

The FMA operation has the form d = a + b x c. FMA4

allows a, b, c, and d to be four different registers,

providing programming ﬂexibility.

• A fast FMA can speed up computations which

involve the accumulation of products

• FMA capabilities are also available in IBM

Power, SPARC, and Itanium CPUs.

• Intel is anticipated to introduce FMA3, a more

limited implementation of FMA (where d is the

same register as either a, b, or c) to Xeon in 2013

timeframe*

Provides three- and four-operand non-destructive

destination encoding, an expansive new opcode

space, and extension of SIMD ﬂoating point

operations to 256 bits.

• Horizontal integer add/subtract

• Integer multiply/accumulate

• Shift/rotate with per-element counts

• Integer compare

• Byte permute

• Bit-wise conditional move

• Fraction extract

• Half-precision convert

The classic example of a ﬂoating point intensive code is the DGEMM (Double-precision GEneral Matrix Multiply)

routine that is heavily used by the HPL benchmark. FMA4 instructions have a latency of ﬁve cycles. Individual

SSE and AVX add and multiply instructions also have a latency of ﬁve cycles. As a result, binaries compiled to

run on previous generations of hardware with the SSE/SSE2 instruction sets will not run as fast on the AMD

Opteron™ 6200 Series processors.

Because of this, it is essential to recompile any ﬂoating point-intensive application with appropriate FMA4

compiler ﬂags and to link with optimized libraries to ensure that the application can take advantage of the new

ﬂoating point unit’s full capability.

Users should start using ﬂags recommended in the latest Compiler Options Quick Reference Guide for the

AMD Opteron™ 4200/6200 Series processors based on the new core architecture. The guide can be found

at: http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf. The overall recommendation for

performance on the 6200 processor is to compile generating both FMA4 and AVX128 instructions (the exact

ﬂag depends on the compiler).

If an application binary currently includes the instructions that are common to AMD’s new core architecture

and to the Intel CPUs (e.g., AVX, SSE3, SSE4.1, SSE4.2, AES-NI), then this code will run well on the AMD

Opteron™ 6200 CPU, as long as the binary checks only the ISA feature bits in the CPUID. Unfortunately, much

code generated by the Intel compiler and Intel libraries inserts checks for the CPU Vendor to be “GENUINEINTEL”

and will thus either fail or execute an inefﬁcient code sequence on AMD processors. Recompile such software.