Technical data
96
Chapter 5: Fortran Enhancements for Multiprocessors
This can be parallelized on the I loop. Because the inner loop goes from 1 to
I, the first block of iterations of the outer loop will end long before the last
block of iterations of the outer loop.
In this example, this is easy to see and predictable, so you can change the
program:
NUM_THREADS = MP_NUMTHREADS()
C$DOACROSS LOCAL(I, J, K)
DO K = 1, NUM_THREADS
DO I = K, N, NUM_THREADS
DO J = 1, I
A(J, I) = A(J, I) + B(J)*C(I)
END DO
END DO
END DO
In this rewritten version, instead of breaking up the I loop into contiguous
blocks, break it into interleaved blocks. Thus, each execution thread receives
some small values of I and some large values of I, giving a better balance of
work between the threads. Interleaving usually, but not always, helps cure a
load balancing problem.
This desirable transformation is provided to do this automatically by using
the MP_SCHEDTYPE clause.
C$DOACROSS LOCAL (I,J), MP_SCHEDTYPE=INTERLEAVE
DO 20 I = 1, N
DO 10 J = 1, I
A (J,I) = A(J,I) + B(J)*C(J)
10 CONTINUE
20 CONTINUE
This has the same meaning as the rewritten form above.
Note that this can cause poor cache performance because you are no longer
stepping through the array at stride 1. This can be somewhat improved by
adding a CHUNK clause. CHUNK= 4 or 8 is often a good choice of value.
Each small chunk will have stride 1 to improve cache performance, while the
chunks are interleaved to improve load balancing.










