Technical data

Cache Effects

If A is large, however, that may take more memory than you can spare.

NUM = MP_NUMTHREADS()

IPIECE = (N + (NUM-1)) / NUM

C$DOACROSS LOCAL(K,J,I)

DO K = 1, NUM

DO J = K*IPIECE - IPIECE + 1, MIN(N, K*IPIECE)

DO I = 1, M

PARTIAL_A(I,K) = PARTIAL_A(I,K) + B(J)*C(I,J)

END DO

C$DOACROSS LOCAL (I,K)

DO I = 1, M

DO K = 1, NUM

A(I) = A(I) + PARTIAL_A(I,K)

END DO

You must trade off the various possible optimizations to ﬁnd the

combination that is right for the particular job.

Load Balancing

When the Fortran compiler divides a loop into pieces, by default it uses the

simple method of separating the iterations into contiguous blocks of equal

size for each process. It can happen that some iterations take signiﬁcantly

longer to complete than other iterations. At the end of a parallel region, the

program waits for all processes to complete their tasks. If the work is not

divided evenly, time is wasted waiting for the slowest process to ﬁnish.

Example:

DO I = 1, N

DO J = 1, I

A(J, I) = A(J, I) + B(J)*C(I)

END DO