Technical information
Software Performance Optimization Methods
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 18
Accessing Two D-registers of a Q-register
This can be done using vget_low and vget_high, as shown below:
vec64a = vget_low_u32(vec128); // split 128 bit vector
vec64b = vget_high_u32(vec128); // into 2x 64 bit vectors
Casting NEON Variables Between Different Types
NEON intrinsics are strongly typed, and you cannot perform type casts as freely as you can in
C language. If there must be casts between vectors of different types, use vreinterpret,
which does not actually generate any code but does enable you to cast the NEON types:
uint8x8_t byteval;
uint32x2_t wordval;
byteval = vreinterpret_u8_u32(wordval);
Note that the destination type u8 is listed first after vreinterpret.
To give you a broader perspective on how NEON intrinsics can be used, the following is an
example of calculating a dot product from two vectors, with moderate complexity:
float dot_product_intrinsic(float * __restrict vec1,
float * __restrict vec2, int n)
{
float32x4_t vec1_q, vec2_q;
float32x4_t sum_q = {0.0, 0.0, 0.0, 0.0};
float32x2_t tmp[2];
float result;
for( int i=0; i<( n & ~3); i+=4 )
{
vec1_q=vld1q_f32(&vec1[i]);
vec2_q=vld1q_f32(&vec2[i]);
sum_q = vmlaq_f32(sum_q, vec1_q, vec2_q );
}
tmp[0] = vget_high_f32(sum_q);
tmp[1] = vget_low_f32 (sum_q);
tmp[0] = vpadd_f32(tmp[0], tmp[1]);
tmp[0] = vpadd_f32(tmp[0], tmp[0]);
result = vget_lane_f32(tmp[0], 0);
return result;
}
Note: As stated above, to use NEON types and intrinsics, a header file, arm_neon.h, must be included.
Compiling NEON Intrinsics with GCC
Unlike the complex options for compiling C code with automatic vectorization, compiling NEON
intrinsics is fairly simple, and only a few compiler options are needed:
• -On. (default). Set the optimization levels.
• -mcpu=cortex-a9. Set the processor type for Zynq-7000 AP SoC as 'cortex-a9'
• -mfpu=neon. Tell the compiler to generate NEON instructions for Zynq-7000 AP SoC.
Optimizing NEON Assembler Code
Sometimes NEON assembler code is the only way to achieve optimal performance. When
looking at NEON intrinsics, it is apparent that in some cases the compiler might not be able to
generate the fastest binary possible. In these cases, carefully hand-written assembler code can
yield the best results from NEON, especially for performance-critical applications.