Technical data

Cache Effects

It is good policy to write loops that take the effect of the cache into account,

with or without parallelism. The technique for the best cache performance is

also quite simple: make the loop step through the array in the same way that

the array is laid out in memory. For Fortran, this means stepping through the

array without any gaps and with the leftmost subscript varying the fastest.

Note that this optimization does not depend on multiprocessing, nor is it

required in order for multiprocessing to work correctly. However,

multiprocessing can affect how the cache is used, so it is worthwhile to

understand.

Example 1: Matrix Multiply

DO I = 1, N

DO K = 1, N

DO J = 1, N

A(I,J) = A(I,J) + B(I,K) * C(K,J)

END DO

This is the same as Example 1 in “Work Quantum” on page 90. To get the best

cache performance, the I loop should be innermost. At the same time, to get

the best multiprocessing performance, the outermost loop should be

parallelized. For this example, you can interchange the I and J loops, and get

the best of both optimizations:

C$DOACROSS LOCAL(I, J, K)

DO J = 1, N

DO K = 1, N

DO I = 1, N

A(I,J) = A(I,J) + B(I,K) * C(K,J)

END DO