Technical data

Chapter 5: Fortran Enhancements for Multiprocessors

Example 2: Trade-Offs

Sometimes you must choose between the possible optimizations and their

costs. Look at the following code segment:

DO J = 1, N

DO I = 1, M

A(I) = A(I) + B(J)*C(I,J)

END DO

This loop can be parallelized on I but not on J. You could interchange the

loops to put I on the outside, thus getting a bigger work quantum.

C$DOACROSS LOCAL(I,J)

DO I = 1, M

DO J = 1, N

A(I) = A(I) + B(J)*C(I,J)

END DO

However, putting J on the inside means that you will step through the C

array in the wrong direction; the leftmost subscript should be the one that

varies the fastest. It is possible to parallelize the I loop where it stands:

DO J = 1, N

C$DOACROSS LOCAL(I)

DO I = 1, M

A(I) = A(I) + B(J)*C(I,J)

END DO

but M needs to be large for the work quantum to show any improvement. In

this particular example, A(I) is used to do a sum reduction, and it is possible

to use the reduction techniques shown in Example 4 of “Breaking Data

Dependencies” on page 85 to rewrite this in a parallel form. (Recall that there

is no support for an entire array as a member of the REDUCTION clause on

a DOACROSS.) However, that involves converting array A from a

one-dimensional array to a two-dimensional array to hold the partial sums;

this is analogous to the way we converted the scalar summation variable

into an array of partial sums.