Technical data
Cache Effects
95
If A is large, however, that may take more memory than you can spare.
NUM = MP_NUMTHREADS()
IPIECE = (N + (NUM-1)) / NUM
C$DOACROSS LOCAL(K,J,I)
DO K = 1, NUM
DO J = K*IPIECE - IPIECE + 1, MIN(N, K*IPIECE)
DO I = 1, M
PARTIAL_A(I,K) = PARTIAL_A(I,K) + B(J)*C(I,J)
END DO
END DO
END DO
C$DOACROSS LOCAL (I,K)
DO I = 1, M
DO K = 1, NUM
A(I) = A(I) + PARTIAL_A(I,K)
END DO
END DO
You must trade off the various possible optimizations to find the
combination that is right for the particular job.
Load Balancing
When the Fortran compiler divides a loop into pieces, by default it uses the
simple method of separating the iterations into contiguous blocks of equal
size for each process. It can happen that some iterations take significantly
longer to complete than other iterations. At the end of a parallel region, the
program waits for all processes to complete their tasks. If the work is not
divided evenly, time is wasted waiting for the slowest process to finish.
Example:
DO I = 1, N
DO J = 1, I
A(J, I) = A(J, I) + B(J)*C(I)
END DO
END DO










