Technical data

Cache Effects
93
Cache Effects
It is good policy to write loops that take the effect of the cache into account,
with or without parallelism. The technique for the best cache performance is
also quite simple: make the loop step through the array in the same way that
the array is laid out in memory. For Fortran, this means stepping through the
array without any gaps and with the leftmost subscript varying the fastest.
Note that this optimization does not depend on multiprocessing, nor is it
required in order for multiprocessing to work correctly. However,
multiprocessing can affect how the cache is used, so it is worthwhile to
understand.
Example 1: Matrix Multiply
DO I = 1, N
DO K = 1, N
DO J = 1, N
A(I,J) = A(I,J) + B(I,K) * C(K,J)
END DO
END DO
END DO
This is the same as Example 1 in Work Quantum on page 90. To get the best
cache performance, the I loop should be innermost. At the same time, to get
the best multiprocessing performance, the outermost loop should be
parallelized. For this example, you can interchange the I and J loops, and get
the best of both optimizations:
C$DOACROSS LOCAL(I, J, K)
DO J = 1, N
DO K = 1, N
DO I = 1, N
A(I,J) = A(I,J) + B(I,K) * C(K,J)
END DO
END DO
END DO