Technical information

Boost NEON Performance by Improving Memory Access Efficiency

XAPP1206 v1.1 June 12, 2014 www.xilinx.com 24

this is at line 67 of the source file benchmarking.c. You can see that before each run, the L1

and L2 cache is flushed by function call Xil_DCacheFlush().

Comment out this line to see the hot cache performance. The execution drops to around 2.67

s, demonstrating that cache can improve performance significantly. In this example, because

the latency for PLD instructions is much longer than the computation time, not all PLD

instructions take effect.

Two additional methods for improving the cache hit rate and system performance are as

follows:

• Create a preload routine for the algorithm to load some of the leading data into cache and

run it some time before the actual algorithm computation routine.

• Increase the preload advancement steps in the actual algorithm computation to make the

preload continuous.

If properly tuned, you can achieve performance very close to that of hot cache.

However, to efficiently use data preloading, you must consider what the lead time should be. If

preload is done too early, the preloaded data might be ejected because of other code. If it is too

late, the data might not be available in cache when needed and thus lower system

performance. The key factor is the main memory access latency. Fortunately, you do not need

to write code to test it. The open source project lmbench already has test code identify this

parameter in an embedded system. For a typical Zynq device configuration (CPU runs at 667

MHz, DDR3 runs at 533 MHz), the latency is about 60 to 80 CPU cycles. This provides

adequate information about where to insert the preload routine before the actual computation

routine.

You can also try to optimize memcpy(), written by C with data preloading. The performance

boost is around 25%. This is not as significant as the above because there is no computation

to compensate the data preload latency.

Using Tiles to Prevent Cache Thrashing

For Zynq-7000 devices, each of the two Cortex-A9 processors has separate 32 KB level-1

instruction and data caches, and both caches are 4-way set-associative. The L2 cache is

designed as an 8-way set-associative 512 KB cache for dual Cortex-A9 cores. These

parameters are critical to predicting when cache thrashing will occur.

Before trying to identify a solution, you need to know why the issue occurs. Start with the

simplest cache implementation first, a direct mapped cache.

In a direct-mapped cache (shown in Figure 6 ), each location in main memory maps to a single

location in the cache. The following figure shows a simplified small cache (64 bytes) example,

with four words per line and four lines. In this example, address bits [3:2] act as the offset to

select a word within a cache line, and address bits [5:4] act as the index to select one of the four

available cache lines. Address bits [31:6] are used as a tag value for each cache line.