Technical information

Boost NEON Performance by Improving Memory Access Efficiency

XAPP1206 v1.1 June 12, 2014 www.xilinx.com 26

Fully associative cache can solve this issue. With this solution, any specific location in main

memory can be stored in any cache line. However, building such a cache is impractical unless

it is very small. The control hardware can be complex and the power consumption high.

In practice, N-way set associative cache is more widely used to balance complexity and

performance. In this solution, cache is divided into N pages and each location in main memory

can be stored in N possible locations in the cache.

In fact, fully associative cache can be regarded as N-way set associative, where N is the

quotient of cache size divided by cache line size. Studies on the choice of N show performance

improvements are minimal for level 1 caches above 4-way associativity, and 8-way or 16-way

associativity are appropriate for larger level 2 caches. Figure 7 is a simple diagram showing the

structure of a 4-way set associative cache.

But even for the fully associative cache, the cache thrashing issue, though alleviated, still exists

in extreme conditions. For example, with the above code example running on a 4-way set

associative cache, when the number of source vectors exceeds four (even three, depending on

the cache allocation policy), cache thrashing occurs.

A typical scenario for this issue is accessing elements in the same column sequentially within

a two-dimensional array. The document Cortex-A Series Programmer's Guide [Ref 7] provides

an example of matrix multiplication. Below is a straightforward code example for matrix

multiplication:

for(i=0;i<N;i++)

for(j=0;j<N;j++)

for(k=0;k<N;k++)

result[i][j] += a[i][k]*b[k][j];

In this case, the contents of matrix a are accessed sequentially, but accessing matrix b

advances by row. Therefore, cache misses are likely while accessing each element in matrix b

because matrix sizes are so large that the cache cannot contain them.

To solve this issue, divide the large matrix into smaller partitions and confine the computations

within these smaller partitions. The partitions are also known as tiles. Here you assume the

data type for matrix is int. For Cortex-A9 processors, the L1 cache line is 32 bytes, so one L1

cache line can hold eight elements. In this case you can rewrite the code using 8*8 tiles to

improve cache hit rates. Below is an example of optimized code:

for (io = 0; io < N; io += 8)

for (jo = 0; jo < N; jo += 8)

for (ko = 0; ko < N; ko += 8)

for (ii = 0, rresult = &result[io][jo], ra = &a[io][ko];

X-Ref Target - Figure 7

Figure 7: Structure of a 4-way Set Associative Cache

:D\ 7DJ

6HW

,QGH[

2IIVHW

'DWD5$0 7DJ5$0

/LQH

;