Technical information

Boost NEON Performance by Improving Memory Access Efficiency
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 24
this is at line 67 of the source file benchmarking.c. You can see that before each run, the L1
and L2 cache is flushed by function call Xil_DCacheFlush().
Comment out this line to see the hot cache performance. The execution drops to around 2.67
s, demonstrating that cache can improve performance significantly. In this example, because
the latency for PLD instructions is much longer than the computation time, not all PLD
instructions take effect.
Two additional methods for improving the cache hit rate and system performance are as
follows:
Create a preload routine for the algorithm to load some of the leading data into cache and
run it some time before the actual algorithm computation routine.
Increase the preload advancement steps in the actual algorithm computation to make the
preload continuous.
If properly tuned, you can achieve performance very close to that of hot cache.
However, to efficiently use data preloading, you must consider what the lead time should be. If
preload is done too early, the preloaded data might be ejected because of other code. If it is too
late, the data might not be available in cache when needed and thus lower system
performance. The key factor is the main memory access latency. Fortunately, you do not need
to write code to test it. The open source project lmbench already has test code identify this
parameter in an embedded system. For a typical Zynq device configuration (CPU runs at 667
MHz, DDR3 runs at 533 MHz), the latency is about 60 to 80 CPU cycles. This provides
adequate information about where to insert the preload routine before the actual computation
routine.
You can also try to optimize memcpy(), written by C with data preloading. The performance
boost is around 25%. This is not as significant as the above because there is no computation
to compensate the data preload latency.
Using Tiles to Prevent Cache Thrashing
For Zynq-7000 devices, each of the two Cortex-A9 processors has separate 32 KB level-1
instruction and data caches, and both caches are 4-way set-associative. The L2 cache is
designed as an 8-way set-associative 512 KB cache for dual Cortex-A9 cores. These
parameters are critical to predicting when cache thrashing will occur.
Before trying to identify a solution, you need to know why the issue occurs. Start with the
simplest cache implementation first, a direct mapped cache.
In a direct-mapped cache (shown in Figure 6 ), each location in main memory maps to a single
location in the cache. The following figure shows a simplified small cache (64 bytes) example,
with four words per line and four lines. In this example, address bits [3:2] act as the offset to
select a word within a cache line, and address bits [5:4] act as the index to select one of the four
available cache lines. Address bits [31:6] are used as a tag value for each cache line.