Technical information

ManualsBrandsQ-NOTE ManualsGadgetsQN-7000HX

Boost NEON Performance by Improving Memory Access Efficiency

XAPP1206 v1.1 June 12, 2014 www.xilinx.com 23

The algorithm used is still the dot product calculation on two float vectors (length=1024), written

in assembler codes. There are two versions: preload optimized, and non-preload optimized

with all PLD instructions commented out.

.align 4

.global neon_dot_product_vec16_pld

.arm

neon_dot_product_vec16_pld:

pld [r0, #0]

pld [r1, #0]

pld [r0, #32]

pld [r1, #32]

vmov.i32 q10, #0

vmov.i32 q11, #0

vmov.i32 q12, #0

vmov.i32 q13, #0

.L_mainloop_vec_16_pld:

@ load current set of values

vldm r0!, {d0, d1, d2, d3, d4, d5, d6, d7}

vldm r1!, {d10, d11, d12, d13, d14, d15, d16, d17}

pld [r0]

pld [r1]

pld [r0, #32]

pld [r1, #32]

@ calculate values for current set

vmla.f32 q10, q0, q5

vmla.f32 q11, q1, q6

vmla.f32 q12, q2, q7

vmla.f32 q13, q3, q8

@ loop control

subs r2, r2, #16

bgt .L_mainloop_vec_16_pld @ loop if r2 > 0, if we have

more elements to process

.L_return_vec_16_pld:

@ calculate the final result

vadd.f32 q15, q10, q11

vadd.f32 q15, q15, q12

vadd.f32 q15, q15, q13

vpadd.f32 d30, d30, d31

vpadd.f32 d30, d30, d30

vmov.32 r0, d30[0]

@ return

bx lr

On the console, you can see the execution time and observe the following:

• Non-preload optimized assembler functions take approximately 11.8 s to execute. This is

a little slower than compiler optimization because the above example is for demonstration

purposes and does not use other low-level optimization techniques.

• Preload optimized assembler functions take around 9.5 s to execute. This is better than

the compiler optimization.

Now you can check the software performance on hot cache. Lab2 focuses on testing

performance in a conservative way, that is, it makes the assumption of using cold cache. Cold

cache means there is no data in cache when the algorithm starts. The coding associated with