Technical information
Boost NEON Performance by Improving Memory Access Efficiency
XAPP1206 v1.1 June 12, 2014 www.xilinx.com 26
Fully associative cache can solve this issue. With this solution, any specific location in main
memory can be stored in any cache line. However, building such a cache is impractical unless
it is very small. The control hardware can be complex and the power consumption high.
In practice, N-way set associative cache is more widely used to balance complexity and
performance. In this solution, cache is divided into N pages and each location in main memory
can be stored in N possible locations in the cache.
In fact, fully associative cache can be regarded as N-way set associative, where N is the
quotient of cache size divided by cache line size. Studies on the choice of N show performance
improvements are minimal for level 1 caches above 4-way associativity, and 8-way or 16-way
associativity are appropriate for larger level 2 caches. Figure 7 is a simple diagram showing the
structure of a 4-way set associative cache.
But even for the fully associative cache, the cache thrashing issue, though alleviated, still exists
in extreme conditions. For example, with the above code example running on a 4-way set
associative cache, when the number of source vectors exceeds four (even three, depending on
the cache allocation policy), cache thrashing occurs.
A typical scenario for this issue is accessing elements in the same column sequentially within
a two-dimensional array. The document Cortex-A Series Programmer's Guide [Ref 7] provides
an example of matrix multiplication. Below is a straightforward code example for matrix
multiplication:
for(i=0;i<N;i++)
for(j=0;j<N;j++)
for(k=0;k<N;k++)
result[i][j] += a[i][k]*b[k][j];
In this case, the contents of matrix a are accessed sequentially, but accessing matrix b
advances by row. Therefore, cache misses are likely while accessing each element in matrix b
because matrix sizes are so large that the cache cannot contain them.
To solve this issue, divide the large matrix into smaller partitions and confine the computations
within these smaller partitions. The partitions are also known as tiles. Here you assume the
data type for matrix is int. For Cortex-A9 processors, the L1 cache line is 32 bytes, so one L1
cache line can hold eight elements. In this case you can rewrite the code using 8*8 tiles to
improve cache hit rates. Below is an example of optimized code:
for (io = 0; io < N; io += 8)
for (jo = 0; jo < N; jo += 8)
for (ko = 0; ko < N; ko += 8)
for (ii = 0, rresult = &result[io][jo], ra = &a[io][ko];
X-Ref Target - Figure 7
Figure 7: Structure of a 4-way Set Associative Cache
:D\ 7DJ
6HW
,QGH[
2IIVHW
'DWD5$0 7DJ5$0
/LQH
;