System information

Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE

Vector Reduction with Offload

Each core on the Intel® Xeon Phi™ Coprocessor has a VPU. The auto vectorization option is enabled by default

on the offload compiler. Alternately, as seen in the example below, the programmer can use the Intel® Cilk™

Plus Extended Array Notation to maximize vectorization and take advantage of the Intel® MIC Architecture

core’s 32 512-bit registers. The offloaded code is executed by a single thread on a single core. The thread

uses the built-in reduction function __sec_reduce_add() to use the core’s 32 512-bit vector registers to

reduce the elements in the array sixteen at a time.

float reduction(float *data, int size)

{

float ret = 0;

#pragma offload target(mic) in(data:length(size))

ret = __sec_reduce_add(data[0:size]); //Intel® Cilk™ Plus

//Extended Array Notation

return ret;

}

Code Example 3: Vector Reduction with Offload in C/C++

Asynchronous Offload and Data Transfer

Asynchronous offload and data transfer between the host and the Intel® Xeon Phi™ Coprocessor is available.

For details see the “About Asynchronous Computation” and “About Asynchronous Data Transfer” sections in

the Intel® C++ Compiler User and Reference Guide (under “Key Features/Programming for the Intel® MIC

Architecture”).

For an example showing the use of asynchronous offload and transfer, refer to /opt/intel/composerxe

/Samples/en_US/C++/mic_samples/intro_sampleC/sampleC13.c

Note that when using the Explicit Memory Copy Model in C/C++, arrays are supported provided the array

element type is scalar or bitwise copyable struct or class. So arrays of pointers are not supported. For C/C++

complex data structure, use the Implicit Memory Copy Model. Please consult the section “Restrictions on

Offload Code Using a Pragma” in the document “Intel C++ Compiler User and Reference Guide” for more

information.

Using the Offload Compiler – Implicit Memory Copy Model

Intel Composer XE 2013 includes two additional keyword extensions for C and C++ (but not Fortran) that

provide a “shared memory” offload programming model appropriate for dealing with complex, pointer-based

data structures such as linked lists, binary trees, and the like (_Cilk_shared and _Cilk_offload). This

model places variables to be shared between the host and coprocessor (marked with the _Cilk_shared

keyword) at the same virtual addresses on both machines, and synchronizes their values at the beginning and

end of offload function calls marked with the _Cilk_offload keyword. Data to be synchronized can also

be dynamically allocated using special allocation and free calls that ensure the allocated memory exists at the

same virtual addresses on both machines.

APIs for Dynamic shared memory allocation:

void *_Offload_shared_malloc(size_t size);

_Offload_shared_free(void *p);