Technical information

Boost NEON Performance by Improving Memory Access Efficiency
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 23
The algorithm used is still the dot product calculation on two float vectors (length=1024), written
in assembler codes. There are two versions: preload optimized, and non-preload optimized
with all PLD instructions commented out.
.align 4
.global neon_dot_product_vec16_pld
.arm
neon_dot_product_vec16_pld:
pld [r0, #0]
pld [r1, #0]
pld [r0, #32]
pld [r1, #32]
vmov.i32 q10, #0
vmov.i32 q11, #0
vmov.i32 q12, #0
vmov.i32 q13, #0
.L_mainloop_vec_16_pld:
@ load current set of values
vldm r0!, {d0, d1, d2, d3, d4, d5, d6, d7}
vldm r1!, {d10, d11, d12, d13, d14, d15, d16, d17}
pld [r0]
pld [r1]
pld [r0, #32]
pld [r1, #32]
@ calculate values for current set
vmla.f32 q10, q0, q5
vmla.f32 q11, q1, q6
vmla.f32 q12, q2, q7
vmla.f32 q13, q3, q8
@ loop control
subs r2, r2, #16
bgt .L_mainloop_vec_16_pld @ loop if r2 > 0, if we have
more elements to process
.L_return_vec_16_pld:
@ calculate the final result
vadd.f32 q15, q10, q11
vadd.f32 q15, q15, q12
vadd.f32 q15, q15, q13
vpadd.f32 d30, d30, d31
vpadd.f32 d30, d30, d30
vmov.32 r0, d30[0]
@ return
bx lr
On the console, you can see the execution time and observe the following:
Non-preload optimized assembler functions take approximately 11.8 s to execute. This is
a little slower than compiler optimization because the above example is for demonstration
purposes and does not use other low-level optimization techniques.
Preload optimized assembler functions take around 9.5 s to execute. This is better than
the compiler optimization.
Now you can check the software performance on hot cache. Lab2 focuses on testing
performance in a conservative way, that is, it makes the assumption of using cold cache. Cold
cache means there is no data in cache when the algorithm starts. The coding associated with