Technical information

ManualsBrandsQ-NOTE ManualsGadgetsQN-7000HX

Software Performance Optimization Methods

XAPP1206 v1.1 June 12, 2014 www.xilinx.com 18

Accessing Two D-registers of a Q-register

This can be done using vget_low and vget_high, as shown below:

vec64a = vget_low_u32(vec128); // split 128 bit vector

vec64b = vget_high_u32(vec128); // into 2x 64 bit vectors

Casting NEON Variables Between Different Types

NEON intrinsics are strongly typed, and you cannot perform type casts as freely as you can in

C language. If there must be casts between vectors of different types, use vreinterpret,

which does not actually generate any code but does enable you to cast the NEON types:

uint8x8_t byteval;

uint32x2_t wordval;

byteval = vreinterpret_u8_u32(wordval);

Note that the destination type u8 is listed first after vreinterpret.

To give you a broader perspective on how NEON intrinsics can be used, the following is an

example of calculating a dot product from two vectors, with moderate complexity:

float dot_product_intrinsic(float * __restrict vec1,

float * __restrict vec2, int n)

{

float32x4_t vec1_q, vec2_q;

float32x4_t sum_q = {0.0, 0.0, 0.0, 0.0};

float32x2_t tmp[2];

float result;

for( int i=0; i<( n & ~3); i+=4 )

{

vec1_q=vld1q_f32(&vec1[i]);

vec2_q=vld1q_f32(&vec2[i]);

sum_q = vmlaq_f32(sum_q, vec1_q, vec2_q );

}

tmp[0] = vget_high_f32(sum_q);

tmp[1] = vget_low_f32 (sum_q);

tmp[0] = vpadd_f32(tmp[0], tmp[1]);

tmp[0] = vpadd_f32(tmp[0], tmp[0]);

result = vget_lane_f32(tmp[0], 0);

return result;

}

Note: As stated above, to use NEON types and intrinsics, a header file, arm_neon.h, must be included.

Compiling NEON Intrinsics with GCC

Unlike the complex options for compiling C code with automatic vectorization, compiling NEON

intrinsics is fairly simple, and only a few compiler options are needed:

• -On. (default). Set the optimization levels.

• -mcpu=cortex-a9. Set the processor type for Zynq-7000 AP SoC as 'cortex-a9'

• -mfpu=neon. Tell the compiler to generate NEON instructions for Zynq-7000 AP SoC.

Optimizing NEON Assembler Code

Sometimes NEON assembler code is the only way to achieve optimal performance. When

looking at NEON intrinsics, it is apparent that in some cases the compiler might not be able to

generate the fastest binary possible. In these cases, carefully hand-written assembler code can

yield the best results from NEON, especially for performance-critical applications.