User`s guide

Cray XMT™ Programming Environment User’s Guide

to fields of the aggregate inside the loop will be replaced with the temporaries. This

can be useful if scalar replacement is unsafe or undesirable for portions of a routine,

but needed to achieve good performance in specific loops. The loop variant can also

be used to achieve parallelization of the loop in the previous example:

| #pragma mta no inline

| void doit(int *c) {

| int i;

| #pragma mta assert noalias *this

| #pragma mta assert loop can replace *this

| for (i = 1; i < n; ++i) {

5 L | b[i] = b[i-1] + c[i-1];

** scalar replacing *this

|};

The exact syntax of these pragmas is described in Appendix C.3 of Cray XMT

Programming Environment User's Guide.

9.2 Optimizing Calls to memcpy and memset

The compiler option -enable_memcmd_opt enables a compiler optimization that

replaces calls to memcpy/memset with versions of the functions that were built for

the current parallel mode, which the compiler can inline. This allows the compiler

to potentially merge the parallel region in the memory routine with any surrounding

parallel region, which can reduce the cost of having to tear down and restart parallel

regions in order to call memcpy or memset. However, when this optimization is

enabled and these functions are called from within a parallel loop, this creates nested

parallel regions. The result is a potentially significant performance degradation.

A new compiler flag, -disable_memcmd_opt was added to disable this

optimization in case there were performance problems, such as the case mentioned

above. However, because the functions may be getting called indirectly, it may

not always be easy to determine that a call to memcpy or memset is causing a

performance problem. For example, this can happen is if a program calls a function in

the C++ STL that calls memcpy. For this reason, the default behavior of the compiler

is to have this optimization disabled and allow users to enable it with the option

-enable_memcmd_opt. Use this option

only when you know there is no risk of

memcpy or memset being called from within a parallel loop.

For additional control over the parallelism used by memcpy or memset, you can call

directly versions of of these commands that use a single stream, single processor

parallelism and multiprocessor parallelism. The memcpy functions are called

memcpy_ss, memcpy_sp and memcpy_mp, respectively. The corresponding

memset functions are called memset_ss, memset_sp and memset_mp,

respectively. These functions are declared in string.h and are documented in the

memcpy(3) and memset(3) man pages.

98 S–2479–20