User`s guide

Cray XMT Programming Environment Users Guide
to fields of the aggregate inside the loop will be replaced with the temporaries. This
can be useful if scalar replacement is unsafe or undesirable for portions of a routine,
but needed to achieve good performance in specific loops. The loop variant can also
be used to achieve parallelization of the loop in the previous example:
| #pragma mta no inline
| void doit(int *c) {
| int i;
|
| #pragma mta assert noalias *this
| #pragma mta assert loop can replace *this
| for (i = 1; i < n; ++i) {
5 L | b[i] = b[i-1] + c[i-1];
** scalar replacing *this
|}
|};
|};
The exact syntax of these pragmas is described in Appendix C.3 of Cray XMT
Programming Environment User's Guide.
9.2 Optimizing Calls to memcpy and memset
The compiler option -enable_memcmd_opt enables a compiler optimization that
replaces calls to memcpy/memset with versions of the functions that were built for
the current parallel mode, which the compiler can inline. This allows the compiler
to potentially merge the parallel region in the memory routine with any surrounding
parallel region, which can reduce the cost of having to tear down and restart parallel
regions in order to call memcpy or memset. However, when this optimization is
enabled and these functions are called from within a parallel loop, this creates nested
parallel regions. The result is a potentially significant performance degradation.
A new compiler flag, -disable_memcmd_opt was added to disable this
optimization in case there were performance problems, such as the case mentioned
above. However, because the functions may be getting called indirectly, it may
not always be easy to determine that a call to memcpy or memset is causing a
performance problem. For example, this can happen is if a program calls a function in
the C++ STL that calls memcpy. For this reason, the default behavior of the compiler
is to have this optimization disabled and allow users to enable it with the option
-enable_memcmd_opt. Use this option
only when you know there is no risk of
memcpy or memset being called from within a parallel loop.
For additional control over the parallelism used by memcpy or memset, you can call
directly versions of of these commands that use a single stream, single processor
parallelism and multiprocessor parallelism. The memcpy functions are called
memcpy_ss, memcpy_sp and memcpy_mp, respectively. The corresponding
memset functions are called memset_ss, memset_sp and memset_mp,
respectively. These functions are declared in string.h and are documented in the
memcpy(3) and memset(3) man pages.
98 S247920