Technical data

Parallel Programming Exercise
127
Multiprocessing has helped very little compared with the single-process run
of the modied code: the program is running slower than the original. What
happened? The cycle counts tell the story. The routine calc_ is what remains
of the original routine after the C$DOACROSS loop _calc_88_aaaa is
extracted (refer to Loop Transformation on page 104 for details about loop
naming conventions). calc_ still takes nearly 70 percent of the time of the
original. When you pulled the code for FORCE into a separate loop, you had
to remove too much from the loop. The serial part is still too large.
Additionally, there seems to be a load-balancing problem. The master is
spending a large fraction of its time waiting for the slave to complete. But
even if the load were perfectly balanced, there would still be the 30 percent
additional work of the multiprocessed version. Trying to x the load
balancing right now will not solve the general problem.
Regroup and Attack Again
Now is the time to try a different approach. If the rst attempt does not give
precisely the desired result, regroup and attack from a new direction.
Repeat Step 3: Analyze
At this point, round-off errors might not be so terrible. Perhaps you can try
to adapt the sum reduction technique to the original code.
Although the calculations on FORCE are not quite the same as a sum
reduction, you can use the same technique: give the reduction variable one
extra dimension so that each thread gets its own separate memory location.