White Paper Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Version 1.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Contents Introduction ........................................................................................................................................................................................................ 4 Goals .............................................................................................................................................................................................................................
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Asynchronous Offload and Data Transfer ..................................................................................................................................... 19 Using the Offload Compiler – Implicit Memory Copy Model ......................................................................................................... 19 Native Compilation ............................................................................................
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Introduction This document will help you get started writing code and running applications on a system (host) that includes the Intel® Xeon Phi™ Coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture). It describes the available tools and includes simple examples to show how to get C/C++ and Fortran-based programs up and running.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE NAcc – Native Acceleration – a mode or form of Intel® MKL in which the data being processed and the MKL function processing the data reside on the Intel® Xeon Phi™ Coprocessor. Offload Compilers – The Intel® C/C++ Compiler and Intel® Fortran Compiler compilers, which can generate binaries for both the host system and the Intel® Xeon Phi™ Coprocessor.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Device Driver: At the bottom of the software stack in kernel space is the Intel® Xeon Phi™ Coprocessor device driver. The device driver is responsible for managing device initialization and communication between the host and target devices. Libraries: The libraries live on top of the device driver in user and system space.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Intel® Many Integrated Core Architecture Overview The Intel® Xeon Phi™ Coprocessor has up to 61 in-order Intel® MIC Architecture processor cores running at 1GHz (up to 1.3GHz). The Intel® MIC Architecture is based on the x86 ISA, extended with 64-bit addressing and new 512-bit wide SIMD vector instructions and registers. Each core supports 4 hardware threads.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Administrative Tasks If you purchased the Intel® Xeon Phi™ Coprocessor from an equipment manufacturer, please go to the Intel® Developer Zone page http://software.intel.com/mic-developer and click on the “Tools & Downloads” tab, then select the “Intel® Many Integrated Core Architecture (Intel® MIC Architecture) Platform Software Stack” link on this page.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Make sure that the Driver Version, MPSS Version and Flash Version are verified according to the following table: MPSS stack installed Driver Version MPSS Version Flash Version mpss_gold_update_3-2.1.6720-13 6720-13 2.1.6720-13 2.1.02.0386 KNC_gold_update_2-2.1.5889-16 5889-16 2.1.5889-16 2.1.05.0385 KNC_gold_update_1-2.1.4982-15 4982-15 2.1.4982-15 2.1.05.0375 KNC_gold-2.1.4346-xx 4346-xx 2.1.4346-xx 2.1.01.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE 3. Verify that the card is working by running a sample program (located /opt/intel/composerxe/Samples/en_US/C++/mic_sample for C/C++ code or in in /opt/intel/composerxe/Samples/en_US/Fortran/mic_sample for Fortran code) with “setenv H_TRACE 2” or “export H_TRACE=2” to display the dialog between the Host and Intel® 4. Xeon Phi™ Coprocessor (messages from the processor will be prefixed with “MIC:”).
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE 3. 4. 5. Update the flash on your card(s) as detailed in section 7.2 of readme.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE sudo micctrl -w sudo /opt/intel/mic/bin/micinfo If the Intel® MPSS service is not running properly, then you need to restart the driver and all connected coprocessors: sudo service mpss stop sudo service mpss unload sudo service mpss start sudo micctrl -w sudo /opt/intel/mic/bin/micinfo Monitoring the Intel® Xeon Phi™ Coprocessor If you want to monitor the load on your coprocessor, its temperature, etc.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE coprocessor were installed, it would be called “mic1” and located at 172.31.2.1, and it would see the host as 172.31.2.254. For detailed information on setting up the card for non-root users, adjusting the network configuration, mounting an NFS file system exported by the host for use on the Intel® Xeon Phi™ coprocessor, etc., please see the document Intel® MPSS Boot Configuration Guide.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Development Environment: Available Compilers and Libraries Compilers o Intel C++ Composer XE 2013 for building applications that run on Intel® 64 architecture and Intel® MIC Architecture o Intel® Fortran Composer XE 2013 for building applications that run on Intel® 64 architecture and Intel® MIC Architecture Libraries packaged with the compilers include: o Intel® Math Kernel Library (Intel® MKL) optimized for the Intel® MIC Architectur
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE o Intel TBB: /opt/intel/composerxe/tbb/bin/tbbvars.csh or tbbvars.sh with intel64 as the argument. o Intel® MKL: /opt/intel/composerxe/mkl/bin/mklvars.csh or mklvars.sh with intel64 as the argument. Documentation and Sample Code • The most useful documentation can be found in /opt/intel/composerxe/Documentation/en_US/ including: o compiler_c/main_cls/index.htm and compiler_f/main_cls/index.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE • Some sample offload code using the explicit memory copy model can be found in: o C++: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/intro_sampleC/ o Fortran: /opt/intel/composerxe/Samples/en_US/Fortran/mic_samples/ o Intel® MKL: /opt/intel/composerxe/mkl/examples/mic* o For examples of Intel® MKL automated offload: /opt/intel/composerxe/mkl/examples/mic_ao/blasc and …/mic_ao/blasf o The rest of the samples demonstrate use of MKL via c
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE For csh – setenv H_TRACE 2 For sh – export H_TRACE=2 To print the compiler’s internal offload timers, a value of 1 reports just the time the offload took measured by the host, and the amount of computation time done by the coprocessor. A value of 2 adds information on how much data was transferred in either direction.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Host Version: The following sample code shows the C code to implement this version of the reduction. float reduction(float *data, int size) { float ret = 0.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Vector Reduction with Offload Each core on the Intel® Xeon Phi™ Coprocessor has a VPU. The auto vectorization option is enabled by default on the offload compiler. Alternately, as seen in the example below, the programmer can use the Intel® Cilk™ Plus Extended Array Notation to maximize vectorization and take advantage of the Intel® MIC Architecture core’s 32 512-bit registers. The offloaded code is executed by a single thread on a single core.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE APIs for Dynamic Aligned Shared memory allocation void *_Offload_shared_aligned_malloc(size_t size, size_t alignment); _Offload_shared_aligned_free(void *p); It should be noted that this is not actually “shared memory”: there is no hardware that maps some portion of the memory on the Intel® Xeon Phi™ Coprocessor to the host system.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Code Example 4: Using the “_Cilk_shared” and “_Cilk_offload” Keywords with Dynamic Allocation in C/C++ Note: For more examples on using the implicit memory copy model, see: C: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleC and …/LEO_tutorial C++: /opt/intel/composerxe/Samples/en_US/C++/mic_samples/shrd_sampleCPP For more information, users are encouraged to read the Intel C++ Compiler User and Reference Guides and/or the Intel For
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE ulimit –s unlimited 7. Go to /tmp and run a.out: cd /tmp ./a.out Parallel Programming Options on the Intel® Xeon Phi™ Coprocessor Most of the parallel programming options available on the host systems are available for the Intel® Xeon Phi™ Coprocessor. These include the following: 1. 2. 3. 4.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE } } return ret; } Code Example 5: C/C++: Using OpenMP in Offloaded Reduction Code real function FTNReductionOMP(data, size) implicit none integer :: size real, dimension(size) :: data real :: ret = 0.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE } return ret; } Code Example 7: Array Reduction Using Open MP and Intel® Cilk™ Plus in C/C++ Parallel Programming on the Intel® Xeon Phi™ Coprocessor: Intel® Cilk™ Plus Intel Cilk Plus header files are not available on the target environment by default.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Code Example 10: Wrapping the Intel TBB Header Files in C/C++ Functions called from within the offloaded construct and global data required on the Intel® Xeon Phi™ Coprocessor should be appended by the special function attribute __attribute__((target(mic))). As an example, parallel_reduce recursively splits an array into subranges for each thread to work on.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Code Example 12: Prefixing an Intel TBB Function for Intel® MIC Architecture code generation in C/C++ 3. Use #pragma offload target(mic) to offload the parallel code using Intel TBB to the coprocessor float MICReductionTBB(float *data, int size) { float ret(0.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Step 2: Send the data over to the Intel® Xeon Phi™ Coprocessor using #pragma offload. In this example, the free_if(0) qualifier is used to make the data persistent on the Intel® Xeon Phi™ Coprocessor.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE out(C:length(matrix_elements) alloc_if(0) free_if(0)) // output data { omp_set_num_threads(64); // set num threads in openmp sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); } Code Example 17: Controlling Threads on the Intel® Xeon Phi™ Coprocessor Using omp_set_num_threads() Intel® MKL Automatic Offload Model A few of the host Intel® MKL functions are Automatic Offload-aware--you call them as you normally would on the host
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE About the Authors Sudha Udanapalli Thiagarajan received a Bachelor’s degree in Computer Science and Engineering from Anna University Chennai, India in 2008 and a Masters degree in Computer Engineering from Clemson University in May 2010. She joined Intel in 2010 and been working as an enabling Application Engineer, focusing on optimizing applications for ISV’s and developing collateral for Intel® Many Integrated Core Architecture.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
Intel® Xeon Phi™ Coprocessor DEVELOPER’S QUICK START GUIDE Performance Notice For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Optimization Notice Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations.