Technical information

ManualsBrandsQ-NOTE ManualsGadgetsQN-7000HX

Before You Begin: Important Concepts

XAPP1206 v1.1 June 12, 2014 www.xilinx.com 9

NEON Performance Limit

When using NEON to optimize software algorithms, it is important to determine how much

performance improvement you can expect.

An ARM document, the Cortex-A9 NEON Media Processing Engine Technical Reference

Manual [Ref 4], contains the detailed information for each VFP and NEON instruction. Tab l e 2

and Tabl e 3 are subsets of the information provided in this document for quick reference.

The Cycles column lists the number of issued cycles the specific instruction needs and is the

absolute minimum number of cycles if no operand interlocks are present. In other words, this

represents the NEON performance upper limit. For example, in Ta b l e 3 you can see that the

NEON floating-point instruction VMUL/VMLA can finish operations on Q registers in just two

cycles. This means that NEON can do four single-precision float multiplications in two cycles.

Thus, if NEON runs at 1 GHz, it can achieve the maximum of 2 GFLOPS single-precision

floating-point operation. In Ta b l e 2 , you can also see that VFP requires two cycles to finish one

double-precision multiply accumulation.

Even though the NEON cycle timing information can be found in the ARM documentation, it is

difficult to determine how many cycles are required in real-world applications, even for a trivial

piece of code. The actual time required depends not only on the instruction sequence but also

on the cache and memory system.

Note:

Use profiling tools (described above) to achieve the most accurate data for your application.

NEON Benefits

The merits of NEON for embedded systems include:

• Simple DSP algorithms can show a larger performance boost (4x-8x)

• Generally, about 60-150% performance boost on complex video codecs

• Better efficiency for memory access with wide registers

• Saves power because the processor can finish a task more quickly and enter sleep mode

sooner

Table 2: Part of VFP Instruction Timing

Name Format Cycles Source Result Writeback

VADD

VSUB

.F Sd,Sn,Sm

.D Dd,Dn,Dm

1 -,1,1 4 4

VMUL .F Sd,Sn,Sm 1 -,1,1 5 5

VNMUL .D Dd,Dn,Dm 2 -,1,1 6 6

VMLA .F Sd,Sn,Sm 1 -,1,1 8 8

VMLS

VNMLS

VNMLA

.D Dd,Dn,Dm 2 -,1,1 9 9

Table 3: Part of Advanced SIMD (NEON) Floating-Point Instructions

Name Format Cycles Source Result Writeback

VADD

VSUB

VABD

VMUL

Dd,Dn,Dm 1 -,2,2 5 6

Qd,Qn,Qm 2

-,2,2

-,3,3

VMLA

VMLS

Dd,Dn,Dm 1 3,2,2 9 10

Qd,Qn,Qm 2

3,2,2

4,3,3