Technical information
Before You Begin: Important Concepts
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 9
NEON Performance Limit
When using NEON to optimize software algorithms, it is important to determine how much
performance improvement you can expect.
An ARM document, the Cortex-A9 NEON Media Processing Engine Technical Reference
Manual [Ref 4], contains the detailed information for each VFP and NEON instruction. Tab l e 2
and Tabl e 3 are subsets of the information provided in this document for quick reference.
The Cycles column lists the number of issued cycles the specific instruction needs and is the
absolute minimum number of cycles if no operand interlocks are present. In other words, this
represents the NEON performance upper limit. For example, in Ta b l e 3 you can see that the
NEON floating-point instruction VMUL/VMLA can finish operations on Q registers in just two
cycles. This means that NEON can do four single-precision float multiplications in two cycles.
Thus, if NEON runs at 1 GHz, it can achieve the maximum of 2 GFLOPS single-precision
floating-point operation. In Ta b l e 2 , you can also see that VFP requires two cycles to finish one
double-precision multiply accumulation.
Even though the NEON cycle timing information can be found in the ARM documentation, it is
difficult to determine how many cycles are required in real-world applications, even for a trivial
piece of code. The actual time required depends not only on the instruction sequence but also
on the cache and memory system.
Note:
Use profiling tools (described above) to achieve the most accurate data for your application.
NEON Benefits
The merits of NEON for embedded systems include:
• Simple DSP algorithms can show a larger performance boost (4x-8x)
• Generally, about 60-150% performance boost on complex video codecs
• Better efficiency for memory access with wide registers
• Saves power because the processor can finish a task more quickly and enter sleep mode
sooner
Table 2: Part of VFP Instruction Timing
Name Format Cycles Source Result Writeback
VADD
VSUB
.F Sd,Sn,Sm
.D Dd,Dn,Dm
1 -,1,1 4 4
VMUL .F Sd,Sn,Sm 1 -,1,1 5 5
VNMUL .D Dd,Dn,Dm 2 -,1,1 6 6
VMLA .F Sd,Sn,Sm 1 -,1,1 8 8
VMLS
VNMLS
VNMLA
.D Dd,Dn,Dm 2 -,1,1 9 9
Table 3: Part of Advanced SIMD (NEON) Floating-Point Instructions
Name Format Cycles Source Result Writeback
VADD
VSUB
VABD
VMUL
Dd,Dn,Dm 1 -,2,2 5 6
Qd,Qn,Qm 2
-,2,2
-,3,3
5
6
6
7
VMLA
VMLS
Dd,Dn,Dm 1 3,2,2 9 10
Qd,Qn,Qm 2
3,2,2
4,3,3
9
10
10
11