Technical information

Software Performance Optimization Methods
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 16
produce a result different from non-NEON optimized code for floating-point numbers. Typically,
however, the difference is not significant. When you need to validate the code by comparing the
computational result, be aware that the term “equal” for a data type of float or double does not
mean exactly the same thing, but the difference is acceptable.
Lab 1
1. Create a new project in SDK.
2. Import lab1 source files.
3. Run the application on hardware, and check the output on the console.
4. Open the generated ELF file to check how the instructions are being generated.
5. Observe the following:
- The manually unrolled loop is vectorized and takes more time.
- The execution time of the automatic vectorized codes is about 10.8 s, even when the
optimization level is set to -O3, and NEON automatic vectorization is enabled.
- When optimization levels are set to -O3, PLD instructions are inserted. (This is
discussed later.)
Using NEON Intrinsics
NEON C/C++ intrinsics are available for armcc, GCC/g++, and llvm. Because they use the
same syntax, source code that uses intrinsics can be compiled by any of these compilers and
provide excellent code portability.
Essentially, NEON intrinsics are a C function wrapper of NEON assembler instructions. They
provide a way to write NEON code that is easier to maintain than NEON assembler code, while
preserving fine-granularity control of the generated NEON instructions. In addition, there are
new data type definitions that correspond to NEON registers (both D-registers and Q-registers)
containing different sized elements, allowing C variables to be created that map directly onto
NEON registers. These variables can be passed to NEON intrinsic functions directly. The
compiler then generates NEON instructions instead of incurring an actual subroutine call.
NEON intrinsics provide low-level access to NEON instructions but with the compiler doing
some of the hard work normally associated with writing assembly language, such as:
Register allocation.
Code scheduling or re-ordering instructions to achieve the highest performance. The C
compilers can be told which processor is being targeted, and they can reorder code to
ensure the CPU pipeline is running in an optimized way.
The main disadvantage with intrinsics is that you cannot force the compiler to output exactly the
code you want. So in some cases, there is still a possibility of further improvement by using
NEON assembler code.
For details about NEON intrinsics, see the following:
RealView Compilation Tools Compiler Reference Guide [Ref 9]
GCC documentation [Ref 10]
NEON Types in C
The ARM C Language Extensions [Ref 9] contains a full list of NEON types. The format is:
<basic type>x<number of elements>_t
To use NEON types and intrinsics, a header file, arm_neon.h, must be included.