Specifications
April 2012 v1 AMD Opteron™ 6200 Linux Tuning Guide
18
3.5 Compiling for AMD’s New Core Architecture Instructions
The shared floating point unit for the AMD Opteron™ 6200 Series processors features new FMA4 and XOP
instructions that can improve floating point throughput for workloads. For more details on the new instructions
see the AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions
at http://support.amd.com/us/Embedded_TechDocs/43479.pdf. The following graphic shows a bit about the
new instructions:
FMA4 Overview (AMD Unique) XOP Overview (AMD Unique)
Performs fused multiply–add (FMA) operations.
The FMA operation has the form d = a + b x c. FMA4
allows a, b, c, and d to be four different registers,
providing programming flexibility.
• A fast FMA can speed up computations which
involve the accumulation of products
• FMA capabilities are also available in IBM
Power, SPARC, and Itanium CPUs.
• Intel is anticipated to introduce FMA3, a more
limited implementation of FMA (where d is the
same register as either a, b, or c) to Xeon in 2013
timeframe*
Provides three- and four-operand non-destructive
destination encoding, an expansive new opcode
space, and extension of SIMD floating point
operations to 256 bits.
• Horizontal integer add/subtract
• Integer multiply/accumulate
• Shift/rotate with per-element counts
• Integer compare
• Byte permute
• Bit-wise conditional move
• Fraction extract
• Half-precision convert
The classic example of a floating point intensive code is the DGEMM (Double-precision GEneral Matrix Multiply)
routine that is heavily used by the HPL benchmark. FMA4 instructions have a latency of five cycles. Individual
SSE and AVX add and multiply instructions also have a latency of five cycles. As a result, binaries compiled to
run on previous generations of hardware with the SSE/SSE2 instruction sets will not run as fast on the AMD
Opteron™ 6200 Series processors.
Because of this, it is essential to recompile any floating point-intensive application with appropriate FMA4
compiler flags and to link with optimized libraries to ensure that the application can take advantage of the new
floating point unit’s full capability.
Users should start using flags recommended in the latest Compiler Options Quick Reference Guide for the
AMD Opteron™ 4200/6200 Series processors based on the new core architecture. The guide can be found
at: http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf. The overall recommendation for
performance on the 6200 processor is to compile generating both FMA4 and AVX128 instructions (the exact
flag depends on the compiler).
If an application binary currently includes the instructions that are common to AMD’s new core architecture
and to the Intel CPUs (e.g., AVX, SSE3, SSE4.1, SSE4.2, AES-NI), then this code will run well on the AMD
Opteron™ 6200 CPU, as long as the binary checks only the ISA feature bits in the CPUID. Unfortunately, much
code generated by the Intel compiler and Intel libraries inserts checks for the CPU Vendor to be “GENUINEINTEL”
and will thus either fail or execute an inefficient code sequence on AMD processors. Recompile such software.