System information
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE
19
Vector Reduction with Offload
Each core on the Intel® Xeon Phi™ Coprocessor has a VPU. The auto vectorization option is enabled by default
on the offload compiler. Alternately, as seen in the example below, the programmer can use the Intel® Cilk™
Plus Extended Array Notation to maximize vectorization and take advantage of the Intel® MIC Architecture
core’s 32 512-bit registers. The offloaded code is executed by a single thread on a single core. The thread
uses the built-in reduction function __sec_reduce_add() to use the core’s 32 512-bit vector registers to
reduce the elements in the array sixteen at a time.
float reduction(float *data, int size)
{
float ret = 0;
#pragma offload target(mic) in(data:length(size))
ret = __sec_reduce_add(data[0:size]); //Intel® Cilk™ Plus
//Extended Array Notation
return ret;
}
Code Example 3: Vector Reduction with Offload in C/C++
Asynchronous Offload and Data Transfer
Asynchronous offload and data transfer between the host and the Intel® Xeon Phi™ Coprocessor is available.
For details see the “About Asynchronous Computation” and “About Asynchronous Data Transfer” sections in
the Intel® C++ Compiler User and Reference Guide (under “Key Features/Programming for the Intel® MIC
Architecture”).
For an example showing the use of asynchronous offload and transfer, refer to /opt/intel/composerxe
/Samples/en_US/C++/mic_samples/intro_sampleC/sampleC13.c
Note that when using the Explicit Memory Copy Model in C/C++, arrays are supported provided the array
element type is scalar or bitwise copyable struct or class. So arrays of pointers are not supported. For C/C++
complex data structure, use the Implicit Memory Copy Model. Please consult the section “Restrictions on
Offload Code Using a Pragma” in the document “Intel C++ Compiler User and Reference Guide” for more
information.
Using the Offload Compiler – Implicit Memory Copy Model
Intel Composer XE 2013 includes two additional keyword extensions for C and C++ (but not Fortran) that
provide a “shared memory” offload programming model appropriate for dealing with complex, pointer-based
data structures such as linked lists, binary trees, and the like (_Cilk_shared and _Cilk_offload). This
model places variables to be shared between the host and coprocessor (marked with the _Cilk_shared
keyword) at the same virtual addresses on both machines, and synchronizes their values at the beginning and
end of offload function calls marked with the _Cilk_offload keyword. Data to be synchronized can also
be dynamically allocated using special allocation and free calls that ensure the allocated memory exists at the
same virtual addresses on both machines.
APIs for Dynamic shared memory allocation:
void *_Offload_shared_malloc(size_t size);
_Offload_shared_free(void *p);