Technical data

Chapter 5: Fortran Enhancements for Multiprocessors

This can be parallelized on the I loop. Because the inner loop goes from 1 to

I, the ﬁrst block of iterations of the outer loop will end long before the last

block of iterations of the outer loop.

In this example, this is easy to see and predictable, so you can change the

program:

NUM_THREADS = MP_NUMTHREADS()

C$DOACROSS LOCAL(I, J, K)

DO K = 1, NUM_THREADS

DO I = K, N, NUM_THREADS

DO J = 1, I

A(J, I) = A(J, I) + B(J)*C(I)

END DO

In this rewritten version, instead of breaking up the I loop into contiguous

blocks, break it into interleaved blocks. Thus, each execution thread receives

some small values of I and some large values of I, giving a better balance of

work between the threads. Interleaving usually, but not always, helps cure a

load balancing problem.

This desirable transformation is provided to do this automatically by using

the MP_SCHEDTYPE clause.

C$DOACROSS LOCAL (I,J), MP_SCHEDTYPE=INTERLEAVE

DO 20 I = 1, N

DO 10 J = 1, I

A (J,I) = A(J,I) + B(J)*C(J)

10 CONTINUE

20 CONTINUE

This has the same meaning as the rewritten form above.

Note that this can cause poor cache performance because you are no longer

stepping through the array at stride 1. This can be somewhat improved by

adding a CHUNK clause. CHUNK= 4 or 8 is often a good choice of value.

Each small chunk will have stride 1 to improve cache performance, while the

chunks are interleaved to improve load balancing.