HP XC System Software User Guide Version 4.0

ManualsBrandsHP ManualsSoftwareHP XC System 4.x Software

HP XC System Software

XC User Guide

Version 4.0

HP Part Number: A-XCUSR-40a

Published: February 2009

Summary of content (135 pages)

PAGE 1
HP XC System Software XC User Guide Version 4.
PAGE 2
© Copyright 2003, 2005, 2006, 2007, 2008, 2009 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
PAGE 3
Table of Contents About This Document.......................................................................................................11 Intended Audience................................................................................................................................11 New and Changed Information in This Edition...................................................................................11 Typographic Conventions.....................................................................
PAGE 4
2.3.1 Determining the LSF Cluster Name and the LSF Execution Host..........................................32 2.4 Getting System Help and Information............................................................................................32 3 Configuring Your Environment with Modulefiles.......................................................33 3.1 Overview of Modules......................................................................................................................33 3.
PAGE 5
.2 Submitting a Serial Job Using LSF...................................................................................................49 5.2.1 Submitting a Serial Job with the LSF bsub Command............................................................49 5.2.2 Submitting a Serial Job Through SLURM Only......................................................................50 5.3 Submitting a Parallel Job.........................................................................................................
PAGE 6
9.7 Job Accounting................................................................................................................................84 9.8 Fault Tolerance................................................................................................................................84 9.9 Security............................................................................................................................................84 10 Using LSF.............................................
PAGE 7
A.4 Launching a Parallel Interactive Shell Through LSF....................................................................117 A.5 Submitting a Simple Job Script with LSF.....................................................................................119 A.6 Submitting an Interactive Job with LSF........................................................................................120 A.7 Submitting an HP-MPI Job with LSF...................................................................................
PAGE 8
List of Figures 4-1 4-2 10-1 11-1 8 Library Directory Structure...........................................................................................................47 Recommended Library Directory Structure..................................................................................47 How LSF and SLURM Launch and Manage a Job........................................................................93 MPICH Wrapper Script.............................................................................
PAGE 9
List of Tables 1-1 1-2 3-1 4-1 5-1 10-1 10-2 10-3 Determining the Node Platform...................................................................................................20 HP XC System Interconnects.........................................................................................................22 Supplied Modulefiles....................................................................................................................34 Compiler Commands........................................
PAGE 10
List of Examples 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 5-13 5-14 5-15 5-16 5-17 5-18 8-1 8-2 9-1 9-2 9-3 9-4 9-5 9-6 9-7 9-8 10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 10-10 10 Submitting a Job from the Standard Input....................................................................................50 Submitting a Serial Job Using LSF ................................................................................................50 Submitting an Interactive Serial Job Using LSF only.......
PAGE 11
About This Document This document provides information about using the features and functions of the HP XC System Software. It describes how the HP XC user and programming environments differ from standard Linux® system environments.
PAGE 12
Ctrl+x A key sequence. A sequence such as Ctrl+x indicates that you must hold down the key labeled Ctrl while you press another key or mouse button. ENVIRONMENT VARIABLE The name of an environment variable, for example, PATH. [ERROR NAME] The name of an error, usually returned in the errno variable. Key The name of a keyboard key. Return and Enter both refer to the same key. Term The defined use of an important word or phrase. User input Commands and other text that you type.
PAGE 13
HP XC System Software User's Guide Provides an overview of managing the HP XC user environment with modules, managing jobs with LSF, and describes how to build, run, debug, and troubleshoot serial and parallel applications on an HP XC system.
PAGE 14
software components are generic, and the HP XC adjective is not added to any reference to a third-party or open source command or product name. For example, the SLURM srun command is simply referred to as the srun command. The location of each Web site or link to a particular topic listed in this section is subject to change without notice by the site provider. • http://www.platform.com Home page for Platform Computing, Inc., the developer of the Load Sharing Facility (LSF).
PAGE 15
• http://www.balabit.com/products/syslog_ng/ Home page for syslog-ng, a logging tool that replaces the traditional syslog functionality. The syslog-ng tool is a flexible and scalable audit trail processing tool. It provides a centralized, securely stored log of all devices on the network. • http://systemimager.org Home page for SystemImager®, which is the underlying technology that distributes the golden image to all nodes and distributes configuration changes throughout the system.
PAGE 16
MPI Web Sites • http://www.mpi-forum.org Contains the official MPI standards documents, errata, and archives of the MPI Forum. The MPI Forum is an open group with representatives from many organizations that define and maintain the MPI standard. • http://www-unix.mcs.anl.gov/mpi/ A comprehensive site containing general information, such as the specification and FAQs, and pointers to other resources, including tutorials, implementations, and other MPI-related sites. Compiler Web Sites • http://www.
PAGE 17
Manpages for third-party software components might be provided as a part of the deliverables for that component. Using discover(8) as an example, you can use either one of the following commands to display a manpage: $ man discover $ man 8 discover If you are not sure about a command you need to use, enter the man command with the -k option to obtain a list of commands that are related to a keyword. For example: $ man -k keyword HP Encourages Your Comments HP encourages comments concerning this document.
PAGE 18
PAGE 19
1 Overview of the User Environment The HP XC system is a collection of computer nodes, networks, storage, and software, built into a cluster, that work together. It is designed to maximize workload and I/O performance, and to provide the efficient management of large, complex, and dynamic workloads.
PAGE 20
$ head /proc/cpuinfo Table 1-1 presents the representative output for each of the platforms. This output may differ according to changes in models and so on.
PAGE 21
distributes login requests from users. A node with the login role is referred to as a login node in this manual. compute role The compute role is assigned to nodes where jobs are to be distributed and run. Although all nodes in the HP XC system are capable of carrying out computations, the nodes with the compute role are the primary nodes used to run jobs. Nodes with the compute role become a part of the resource pool used by LSF and SLURM, which manage and distribute the job workload.
PAGE 22
the HP XC. So, for example, if the HP XC system interconnect is based on a Quadrics® QsNet II® switch, then the SFS will serve files over ports on that switch. The file operations are able to proceed at the full bandwidth of the HP XC system interconnect because these operations are implemented directly over the low-level communications libraries.
PAGE 23
Additional information on supported system interconnects is provided in the HP XC Hardware Preparation Guide. 1.1.8 Network Address Translation (NAT) The HP XC system uses Network Address Translation (NAT) to enable nodes in the HP XC system that do not have direct external network connections to open outbound network connections to external network resources. 1.
PAGE 24
Modulefiles can be loaded into the your environment automatically when you log in to the system, or any time you need to alter the environment. The HP XC system does not preload modulefiles. See Chapter 3 “Configuring Your Environment with Modulefiles” for more information. 1.3.3 Commands The HP XC user environment includes standard Linux commands, LSF commands, SLURM commands, HP-MPI commands, and modules commands. This section provides a brief overview of these command sets.
PAGE 25
1.4.2 Serial Applications You can build and run serial applications under the HP XC development environment. A serial application is a command or application that does not use any form of parallelism. Full details and examples of how to build, run, debug, and troubleshoot serial applications are provided in “Building Serial Applications”. 1.5 Run-Time Environment This section describes LSF, SLURM, and HP-MPI, and how these components work together to provide the HP XC run-time environment.
PAGE 26
1.5.4 How LSF and SLURM Interact In the HP XC environment, LSF cooperates with SLURM to combine the powerful scheduling functionality of LSF with the scalable parallel job launching capabilities of SLURM. LSF acts primarily as a workload scheduler on top of the SLURM system, providing policy and topology-based scheduling for end users. SLURM provides an execution and monitoring layer for LSF.
PAGE 27
1.6 Components, Tools, Compilers, Libraries, and Debuggers This section provides a brief overview of the some of the common tools, compilers, libraries, and debuggers available for use on HP XC. An HP XC system is integrated with several open source software components. HP XC incorporates a Linux operating system and its standard commands and tools, and maintains the Linux ABI.
PAGE 28
PAGE 29
2 Using the System This chapter describes the tasks and commands that the general user must know to use the system. It addresses the following topics: • • • • “Logging In to the System” (page 29) “Overview of Launching and Managing Jobs” (page 29) “Performing Other Common User Tasks” (page 31) “Getting System Help and Information” (page 32) 2.1 Logging In to the System Logging in to an HP XC system is similar to logging in to any standard Linux system.
PAGE 30
overview about some basic ways of running and managing jobs. Full information and details about the HP XC job launch environment are provided in “Using SLURM”) and the LSF section of “Using LSF”) of this document. 2.2.1 Introduction As described in “Run-Time Environment” (page 25), SLURM and LSF cooperate to run and manage jobs on the HP XC system, combining LSF's powerful and flexible scheduling functionality with SLURM's scalable parallel job-launching capabilities.
PAGE 31
For more information about using this command and a sample of its output, see “Examining System Core Status” (page 95) • The LSF lshosts command displays machine-specific information for the LSF execution host node. $ lshosts For more information about using this command and a sample of its output, see “Getting Information About the LSF Execution Host Node” (page 95) . • The LSF lsload command displays load information for the LSF execution host node.
PAGE 32
2.3.1 Determining the LSF Cluster Name and the LSF Execution Host The lsid command returns the LSF cluster name, the LSF version, and the name of the LSF execution host: $ lsid Platform LSF HPC version, Update n, build date stamp Copyright 1992-2008 Platform Computing Corporation My cluster name is hptclsf My master name is lsfhost.localdomain In this example, hptclsf is the LSF cluster name, and lsfhost.
PAGE 33
3 Configuring Your Environment with Modulefiles The HP XC system supports the use of Modules software to make it easier to configure and modify the your environment. Modules software enables dynamic modification of your environment by the use of modulefiles.
PAGE 34
access the mpi** scripts and libraries. You can specify the compiler it uses through a variety of mechanisms long after the modulefile is loaded. The previous scenarios were chosen in particular because the HP-MPI mpicc command uses heuristics to try to find a suitable compiler when MPI_CC or other default-overriding mechanisms are not in effect. It is possible that mpicc will choose a compiler inconsistent with the most recently loaded compiler module.
PAGE 35
Table 3-1 Supplied Modulefiles (continued) Modulefile Sets the HP XC User Environment to Use: icc/8.1/default Intel C/C++ Version 8.1 compilers. icc/9.0/default Intel C/C++ Version 9.0 compilers. icc/9.1/default Intel C/C++ Version 9.1 compilers. idb/7.3/default Intel IDB debugger. idb/9.0/default Intel IDB debugger. idb/9.1/default Intel IDB debugger. ifort/8.0/default Intel Fortran Version 8.0 compilers. ifort/8.1/default Intel Fortran Version 8.1 compilers. ifort/9.
PAGE 36
Each module supplies its own online help. See “Viewing Modulefile-Specific Help” for information on how to view it. 3.3 Modulefiles Automatically Loaded on the System The HP XC system does not load any modulefiles into your environment by default. However, there may be modulefiles designated by your system administrator that are automatically loaded. “Viewing Loaded Modulefiles” describes how you can determine what modulefiles are currently loaded on your system.
PAGE 37
For example, if you wanted to automatically load the TotalView modulefile when you log in, edit your shell startup script to include the following instructions. This example uses bash as the login shell. Edit the ~/.bashrc file as follows: # if the 'module' command is defined, $MODULESHOME # will be set if [ -n "$MODULESHOME" ]; then module load totalview fi From now on, whenever you log in, the TotalView modulefile is automatically loaded in your environment. 3.
PAGE 38
In this example, a user attempted to load the ifort/8.0 modulefile. After the user issued the command to load the modulefile, an error message occurred, indicating a conflict between this modulefile and the ifort/8.1 modulefile, which is already loaded. When a modulefile conflict occurs, unload the conflicting modulefile before loading the new modulefile. In the previous example, you should unload the ifort/8.0 modulefile before loading the ifort/8.1 modulefile.
PAGE 39
4 Developing Applications This chapter discusses topics associated with developing applications in the HP XC environment. Before reading this chapter, you should you read and understand Chapter 1 “Overview of the User Environment” and Chapter 2 “Using the System”.
PAGE 40
HP UPC is a parallel extension of the C programming language, which runs on both common types of multiprocessor systems: those with a common global address space (such as SMP) and those with distributed memory. UPC provides a simple shared memory model for parallel programming, allowing data to be shared or distributed among a number of communicating processors.
PAGE 41
4.3 Examining Nodes and Partitions Before Running Jobs Before launching an application, you can determine the availability and status of the system's nodes and partitions. Node and partition information is useful to have before launching a job so that you can launch the job to properly match the resources that are available on the system.
PAGE 42
4.6.1 Serial Application Build Environment You can build and run serial applications in the HP XC programming environment. A serial application is a command or application that does not use any form of parallelism. An example of a serial application is a standard Linux command, such as the ls or hostname command. A serial application is basically a single-core application that has no communication library calls such as MPI. 4.6.
PAGE 43
4.7.1.1 Modulefiles The basics of your working environment are set up automatically by your system administrator during the installation of HP XC. However, your application development environment can be modified by means of modulefiles, as described in “Overview of Modules”. There are modulefiles available that you can load yourself to further tailor your environment to your specific application development requirements. For example, the TotalView module is available for debugging applications.
PAGE 44
To compile programs that use SHMEM, it is necessary to include the shmem.h file and to use the SHMEM and Elan libraries. For example: $ gcc -o shping shping.c -lshmem -lelan 4.7.1.6 MPI Library The MPI library supports MPI 1.2 as described in the 1997 release of MPI: A Message Passing Interface Standard. Users should note that the MPI specification describes the application programming interface, but does not specify the contents of the MPI header files, mpi.h and mpif.
PAGE 45
4.7.1.12 MKL Library MKL is a math library that references pthreads, and in enabled environments, can use multiple threads. MKL can be linked in a single-threaded manner with your application by specifying the following in the link command: • On the CP3000 and CP4000 platforms (as appropriate): -L/opt/intel/mkl70/lib/32 -lmkl_ia32 -lguide -pthread -L/opt/intel/mkl70/lib/em64t -lmkl_em64t -lguide -pthread • On the CP6000 platforms: -L/opt/intel/mkl70/lib/64 -lmkl_ipf -lguide -pthread 4.7.1.
PAGE 46
To compile and link a C application using the mpicc command: $ mpicc -o mycode hello.c To compile and link a Fortran application using the mpif90 command: $ mpif90 -o mycode hello.f In the above examples, the HP-MPI commands invoke compiler utilities which call the C and Fortran compilers with appropriate libraries and search paths specified to build the parallel application called hello. The -o specifies that the resulting program is called mycode. 4.
PAGE 47
names. However, HP recommends an alternative method. The dynamic linker, during its attempt to load libraries, will suffix candidate directories with the machine type. The HP XC system on the CP4000 platform uses i686 for 32-bit binaries and x86_64 for 64-bit binaries. HP recommends structuring directories to reflect this behavior.
PAGE 48
NOTE: 48 There is no shortcut as there is for the dynamic loader.
PAGE 49
5 Submitting Jobs This chapter describes how to submit jobs on the HP XC system; it addresses the following topics: • • • • • • • “Overview of Job Submission” (page 49) “Submitting a Serial Job Using LSF” (page 49) “Submitting a Parallel Job” (page 51) “Submitting a Parallel Job That Uses the HP-MPI Message Passing Interface” (page 52) “Submitting a Batch Job or Job Script” (page 56) “Submitting a Job from a Host Other Than an HP XC Host” (page 61) “Running Preexecution Programs” (page 61) 5.
PAGE 50
The srun command is only necessary to launch the job on the allocated node if the HP XC JOB STARTER script is not configured to run a job on the compute nodes in the lsf partition. The jobname parameter can be name of an executable or a batch script. If jobname is executable, job is launched on LSF execution host node. If jobname is batch script (containing srun commands), job is launched on LSF node allocation (compute nodes).
PAGE 51
#include #include int main() { char name[100]; gethostname(name, sizeof(name)); printf("%s says Hello!\n", name); return 0; } The following is the command line used to compile this program: $ cc hw_hostname.c -o hw_hostname NOTE: The following invocations of the sample hw_hostname program are run on a SLURM non-root default partition, which is not the default SLURM partition for the HP XC system software.
PAGE 52
The bsub command submits the job to LSF. The -n num-procs parameter, which is required for parallel jobs, specifies the number of cores requested for the job. The num-procs parameter may be expressed as minprocs[,maxprocs] where minprocs specifies the minimum number of cores and the optional value maxprocs specifies the maximum number of cores. The SLURM srun command is required to run jobs on an LSF node allocation. The srun command is the user job launched by the LSF bsub command.
PAGE 53
variable that was set by LSF; this environment variable is equivalent to the number provided by the -n option of the bsub command. Any additional SLURM srun options are job specific, not allocation-specific. The mpi-jobname is the executable file to be run. The mpi-jobname must be compiled with the appropriate HP-MPI compilation utility. For more information, see the section titled Compiling applications in the HP-MPI User's Guide.
PAGE 54
options that specify the minimum number of nodes required for the job, specific nodes for the job, and so on. Note: The SLURM external scheduler is a plug-in developed by Platform Computing for LSF; it is not actually part of SLURM. This plug-in communicates with SLURM to gather resource information and request allocations of nodes, but it is integrated with the LSF scheduler.
PAGE 55
Example 5-9 Using the External Scheduler to Submit a Job to Run on Specific Nodes $ bsub -n4 -ext "SLURM[nodelist=n6,n8]" -I srun hostname Job <70> is submitted to default queue . <> <> n6 n6 n8 n8 In the previous example, the job output shows that the job was launched from the LSF execution host lsfhost.localdomain, and it ran on four cores on the specified nodes, n6 and n8.
PAGE 56
Example 5-13 Using the External Scheduler to Constrain Launching to Nodes with a Given Feature $ bsub -n 10 -ext "SLURM[constraint=dualcore]" -I srun hostname You can use the bqueues command to determine the SLURM scheduler options that apply to jobs submitted to a specific LSF queue, for example: $ bqueues -l dualcore | grep SLURM MANDATORY_EXTSCHED: SLURM[constraint=dualcore] 5.
PAGE 57
Example 5-15 Submitting a Batch Script with the LSF-SLURM External Scheduler Option $ bsub -n4 -ext "SLURM[nodes=4]" -I ./myscript.sh Job <79> is submitted to default queue . <> <> n1 n2 n3 n4 Hello world! I'm 0 of 4 on n1 Hello world! I'm 1 of 4 on n2 Hello world! I'm 2 of 4 on n3 Hello world! I'm 3 of 4 on n4 Example 5-16 and Example 5-17 show how the jobs inside the script can be manipulated within the allocation.
PAGE 58
Example 5-18 Environment Variables Available in a Batch Job Script $ cat ./envscript.sh #!/bin/sh name=`hostname` echo "hostname = $name" echo "LSB_HOSTS = '$LSB_HOSTS'" echo "LSB_MCPU_HOSTS = '$LSB_MCPU_HOSTS'" echo "SLURM_JOBID = $SLURM_JOBID" echo "SLURM_NPROCS = $SLURM_NPROCS" $ bsub -n4 -I ./envscript.sh Job <82> is submitted to default queue . <> <
PAGE 59
The ping_pong_ring application is submitted twice in a Makefile named mymake; the first time as run1 and the second as run2: $ cat mymake PPR_ARGS=10000 NODES=2 TASKS=4 all: run1 run2 run1: mpirun -srun -N ${NODES} -n ${TASKS} ./ping_pong_ring ${PPR_ARGS} run2: mpirun -srun -N ${NODES} -n ${TASKS} ./ping_pong_ring ${PPR_ARGS} The following command line makes the program and executes it: $ bsub -o %J.out -n2 -ext "SLURM[nodes=2]" make -j2 -f .
PAGE 60
1 This line attempts to submit a program that does not exist. The following command line makes the program and executes it: $ bsub -o %J.out -n2 -ext "SLURM[nodes=2]" make -j3 \ -f ./mymake PPR_ARGS=100000 Job <117> is submitted to default queue . The output file contains error messages related to the attempt to launch the nonexistent program. $ cat 117.out . . . mpirun -srun -N 2 -n 4 ./ping_pong_ring 100000 mpirun -srun -N 2 -n 4 ./ping_pong_ring 100000 mpirun -srun -N 2 -n 4 .
PAGE 61
5.6 Submitting a Job from a Host Other Than an HP XC Host To submit a job from a host other than an HP XC host to the HP XC system, use the LSF -R option, and the HP XC host type SLINUX64 (defined in lsf.shared) in the job submission resource requirement string.
PAGE 62
PAGE 63
6 Debugging Applications This chapter describes how to debug serial and parallel applications in the HP XC development environment. In general, effective debugging of applications requires the applications to be compiled with debug symbols, typically the -g switch. Some compilers allow -g with optimization. This chapter addresses the following topics: • • “Debugging Serial Applications” (page 63) “Debugging Parallel Applications” (page 63) 6.
PAGE 64
6.2.1 Debugging with TotalView TotalView is a full-featured, debugger based on GUI and specifically designed to fill the requirements of parallel applications running on many cores. You can purchase the TotalView debugger, from Etnus, Inc., for use on the HP XC cluster. TotalView is not included with the HP XC software and technical support is not provided by HP. Contact Etnus, Inc. for any issues with TotalView. This section provides only minimum instructions to get you started using TotalView.
PAGE 65
6.2.1.3 Using TotalView with SLURM Use the following commands to allocate the nodes you need before you debug an application with SLURM, as shown here: $ srun -Nx -A $ mpirun -tv -srun application These commands allocate x nodes and run TotalView to debug the program named application. Be sure to exit from the SLURM allocation created with the srun command when you are done. 6.2.1.4 Using TotalView with LSF HP recommends the use of xterm when debugging an application with LSF.
PAGE 66
6.2.1.6 Debugging an Application This section describes how to use TotalView to debug an application. 1. Compile the application to be debugged. For example: $ mpicc -g -o Psimple simple.c -lm Use the -g option to enable debugging information. 2. Run the application in TotalView: $ mpirun -tv -srun -n2 ./Psimple 3. The TotalView main control window, called the TotalView root window, opens. It displays the following message in the window header: Etnus TotalView Version# 4.
PAGE 67
6.2.1.7 Debugging Running Applications As an alternative to the method described in “Debugging an Application”, it is also possible to "attach" an instance of TotalView to an application which is already running. 1. Compile a long-running application as in “Debugging an Application”: $ mpicc -g -o Psimple simple.c -lm 2. Run the application: $ mpirun -srun -n2 Psimple 3. Start TotalView: $ totalview 4. Select Unattached in the TotalView Root Window to display a list of running processes.
PAGE 68
PAGE 69
7 Monitoring Node Activity This chapter describes the optional utilities that provide performance information about the set of nodes associated with your jobs. It addresses the following topics: • • “The Xtools Utilities” (page 69) “Running Performance Health Tests” (page 70) 7.1 The Xtools Utilities Two Xtools utilities help you monitor your HP XC system: xcxclus Enables you to monitor a number of nodes simultaneously.
PAGE 70
7.2 Running Performance Health Tests You can run the ovp command to generate reports on the performance health of the nodes. Use the following format to run a specific performance health test: ovp [options] [-verify=perf_health/test] Where: options Specify additional command line options for the test. The ovp --help perf_health command lists the command line options for each test. The following options apply to all the tests: NOTE: • • Use the --opts= option to pass this option.
PAGE 71
NOTE: Except for the network_stress and network_bidirectional tests, these tests only apply to systems that install LSF incorporated with SLURM. The network_stress and network_bidirectional tests also function under Standard LSF. You can list the available tests with the ovp -l command: $ ovp -l Test list for perf_health: cpu_usage memory_usage cpu memory network_stress network_bidirectional network_unidirectional By default, the ovp command reports if the nodes passed or failed the given test.
PAGE 72
Verify perf_health: Testing cpu_usage ... The headnode is excluded from the cpu usage test. Number of nodes allocated for this test is 14 Job <102> is submitted to default queue . <> <>> All nodes have cpu usage less than 10%. +++ PASSED +++ This verification has completed successfully. A total of 1 test was run. Details of this verification have been recorded in: HOME_DIRECTORY/ovp_n16_mmddyy.
PAGE 73
The tests execution directory has been saved in: HOME_DIRECTORY/ovp_n16_mmddyy.tests Details of this verification have been recorded in: HOME_DIRECTORY/ovp_n16_mmddyyr1.log 7.
PAGE 74
PAGE 75
8 Tuning Applications This chapter discusses how to tune applications in the HP XC environment. 8.1 Using the Intel Trace Collector and Intel Trace Analyzer This section describes how to use the Intel Trace Collector (ITC) and Intel Trace Analyzer (ITA) with HP-MPI on an HP XC system. The Intel Trace Collector/Analyzer were formerly known as VampirTrace and Vampir, respectively.
PAGE 76
Example 8-1 The vtjacobic Example Program For the purposes of this example, the examples directory under /opt/IntelTrace/ITC is copied to the user's home directory and renamed to examples_directory. The GNU Makefile looks as follows: CC F77 CLINKER FLINKER IFLAGS CFLAGS FFLAGS LIBS CLDFLAGS = = = = = = = = = mpicc.mpich mpif77.mpich mpicc.mpich mpif77.
PAGE 77
8.2 The Intel Trace Collector and Analyzer with HP-MPI on HP XC NOTE: The Intel Trace Collector (OTA) was formerly known as VampirTrace. The Intel Trace Analyzer was formerly known as Vampir. 8.2.1 Installation Kit The following are installation-related notes: There are two installation kits for the Intel Trace Collector: • • ITC-IA64-LIN-MPICH-PRODUCT.4.0.2.1.tar.gz ITA-IA64-LIN-AS21-PRODUCT.4.0.2.1.tar.gz The Intel Trace Collector is installed in the /opt/IntelTrace/ITC directory.
PAGE 78
Running a Program Ensure that the -static-libcxa flag is used when you use mpirun.mpich to launch a C or Fortran program. The following is a C example called vtjacobic: # mpirun.mpich -np 2 ~/xc_PDE_work/ITC_examples_xc6000/vtjacobic warning: this is a development version of HP-MPI for internal R&D use only /nis.home/user_name/xc_PDE_work/ITC_examples_xc6000/vtjacobic: 100 iterations in 0.228252 secs (28.712103 MFlops), m=130 n=130 np=2 [0] Intel Trace Collector INFO: Writing tracefile vtjacobic.
PAGE 79
[0] Intel Trace Collector INFO: Writing tracefile vtjacobif.stf in /nis.home/user_name/xc_PDE_work/ITC_examples_xc6000 mpirun exits with status: 0 Running a Program Across Nodes (Using LSF) The following is an example that uses the LSF bsub command to run the program named vtjacobic across four nodes: # bsub -n4 -I mpirun.mpich -np 2 ./vtjacobic The license file and the OTC directory need to be distributed across the nodes. 8.
PAGE 80
PAGE 81
9 Using SLURM HP XC uses the Simple Linux Utility for Resource Management (SLURM) for system resource management and job scheduling.
PAGE 82
The srun command handles both serial and parallel jobs. The srun command has a significant number of options to control the execution of your application closely. However, you can use it for a simple launch of a serial program, as Example 9-1 shows. Example 9-1 Simple Launch of a Serial Program $ srun hostname n1 9.3.1 The srun Roles and Modes The srun command submits jobs to run under SLURM management. The srun command can perform many roles in launching and managing your job.
PAGE 83
Example 9-2 Displaying Queued Jobs by Their JobIDs $ squeue --jobs 12345,12346 JOBID PARTITION NAME USER ST TIME_USED NODES NODELIST(REASON) 12345 debug job1 jody R 0:21 4 n[9-12] 12346 debug job2 jody PD 0:00 8 The squeue command can report on jobs in the job queue according to their state; possible states are: pending, running, completing, completed, failed, timeout, and node_fail. Example 9-3 uses the squeue command to report on failed jobs.
PAGE 84
Example 9-8 Reporting Reasons for Downed, Drained, and Draining Nodes $ sinfo -R REASON Memory errors Not Responding NODELIST n[0,5] n8 9.7 Job Accounting HP XC System Software provides an extension to SLURM for job accounting. The sacct command displays job accounting data in a variety of forms for your analysis. Job accounting data is stored in a log file; the sacct command filters that log file to report on your jobs, jobsteps, status, and errors.
PAGE 85
10 Using LSF The Load Sharing Facility (LSF) from Platform Computing is a batch system resource manager used on the HP XC system. On an HP XC system, a job is submitted to LSF, which places the job in a queue and allows it to run when the necessary resources become available. In addition to launching jobs, LSF provides extensive job management and information capabilities.
PAGE 86
The LSF environment is set up automatically for the user on login; LSF commands and their manpages are readily accessible: • • • • • The bhosts command is useful for viewing LSF batch host information. The lshosts command provides static resource information. The lsload command provides dynamic resource information. The bsub command is used to submit jobs to LSF. The bjobs command provides information on batch jobs. 10.
PAGE 87
“Translating SLURM and LSF JOBIDs” describes the relationship between the SLURM_JOBID and the LSF JOBID. SLURM_NPROCS This environment variable passes along the total number of tasks requested with the bsub -n command to all subsequent srun commands. User scripts can override this value with the srun -n command, but the new value must be less than or equal to the original number of requested tasks. LSF regards the entire HP XC system as a “SLURM machine.
PAGE 88
Example 10-2 Examples of Launching LSF Jobs Without the srun Command The following bsub command line invokes the bash shell to run the hostname command with the pdsh command: [lsfadmin@n16 ~]$ bsub -n4 -I -ext "SLURM[nodes=4]" /bin/bash -c 'pdsh -w "$LSB_HOSTS" hostname' Job <118> is submitted to default queue . <> <
PAGE 89
• LSF integrated with SLURM only runs daemons on one node within the HP XC system. This node hosts an HP XC LSF Alias, which is an IP address and corresponding host name specifically established for LSF integrated with SLURM on HP XC to use. The HP XC system is known by this HP XC LSF Alias within LSF. Various LSF commands, such as lsid , lshosts, and bhosts, display HP XC LSF Alias in their output. The default value of the HP XC LSF Alias, lsfhost.
PAGE 90
sometime in the future, depending on resource availability and batch system scheduling policies. Batch job submissions typically provide instructions on I/O management, such as files from which to read input and filenames to collect output. By default, LSF jobs are batch jobs. The output is e-mailed to the user, which requires that e-mail be set up properly. SLURM batch jobs are submitted with the srun -b command. By default, the output is written to $CWD/slurm-SLURMjobID.
PAGE 91
allocates the appropriate whole node for exclusive use by the serial job in the same manner as it does for parallel jobs, hence the name “pseudo-parallel”. Parallel job A job that requests more than one slot, regardless of any other constraints. Parallel jobs are allocated up to the maximum number of nodes specified by the following specifications: • SLURM[nodes=min-max] (if specified) • SLURM[nodelist=node_list] (if specified) • bsub -n Parallel jobs and serial jobs cannot run on the same node.
PAGE 92
request more than one core for a job. This option, coupled with the external SLURM scheduler, discussed in “LSF-SLURM External Scheduler”, gives you much flexibility in selecting resources and shaping how the job is executed on those resources. LSF reserves the requested number of nodes and executes one instance of the job on the first reserved node, when you request multiple nodes. Use the srun command or the mpirun command with the -srun option in your jobs to launch parallel applications.
PAGE 93
Figure 10-1 How LSF and SLURM Launch and Manage a Job User 1 N16 N 16 N16 Login node $ bsub-n4 -ext ”SLURM[nodes-4]” -o output.out./myscript 2 lsfhost.localdomain LSF Execution Host job_starter.sh $ srun -nl myscript 6 3 4 hostname N2 Compute Node n2 SLURM_JOBID=53 SLURM_NPROCS=4 7 N1 $ hostname n1 5 Compute Node myscript $ hostname $ srun hostname $ mpirun -srun ./hellompi 6 srun hostname N3 Compute Node n3 6 7 hostname n1 6 7 hostname N4 Compute Node n4 7 1. 2.
PAGE 94
4. LSF prepares the user environment for the job on the LSF execution host node and dispatches the job with the job_starter.sh script. This user environment includes standard LSF environment variables and two SLURM-specific environment variables: SLURM_JOBID and SLURM_NPROCS. SLURM_JOBID is the SLURM job ID of the job. Note that this is not the same as the LSF jobID. “Translating SLURM and LSF JOBIDs” describes the relationship between the SLURM_JOBID and the LSF JOBID.
PAGE 95
10.10.1 Examining System Core Status The bhosts command displays LSF resource usage information. This command is useful to examine the status of the system cores. The bhosts command provides a summary of the jobs on the system and information about the current state of LSF. For example, it can be used to determine if LSF is ready to start accepting batch jobs.
PAGE 96
10.10.3 Getting Host Load Information The LSF lsload command displays load information for LSF execution hosts. $ lsload HOST_NAME lsfhost.loc status ok r15s - r1m - r15m - ut - pg - ls 4 it - tmp - swp - mem - In the previous example output, the LSF execution host (lsfhost.localdomain) is listed under the HOST_NAME column. The status is listed as ok, indicating that it can accept remote jobs. The ls column shows the number of current login users on this host.
PAGE 97
on this topic. See the LSF manpages for full information about the commands described in this section. The following LSF commands are described in this section: bjobs “Examining the Status of a Job” bhist “Viewing the Historical Information for a Job” 10.11.1 Getting Job Allocation Information Before a job runs, LSF integrated with SLURM allocates SLURM compute nodes based on job resource requirements.
PAGE 98
date and time stamp: Started on 4 Hosts/Processors <4*lsfhost.
PAGE 99
Example 10-6 Using the bjobs Command (Long Output) $ bjobs -l 24 Job <24>, User ,Project ,Status , Queue , Interactive pseudo-terminal shell mode, Extsched , Command date and time stamp: Submitted from host , CWD <$HOME>, 4 Processors Requested, Requested Resources ; date and time stamp: Started on 4 Hosts/Processors <4*lsfhost.
PAGE 100
Table 10-2 Output Provided by the bhist Command (continued) Field Description UNKWN The total unknown time of the job. TOTAL The total time that the job has spent in all states. For detailed information about a finished job, add the -l option to the bhist command, shown in Example 10-8. The -l option specifies that the long format is requested.
PAGE 101
$ sacct -j Jobstep ---------123 123.0 123 Jobname -----------------hptclsf@99 hptclsf@99 Partition Ncpus Status ---------- ------- ---------lsf 8 RUNNING lsf 0 RUNNING Error ----0 0 In these examples, the job name is hptclsf@99; the LSF job ID is 99. Note that the scontrol show job command keeps jobs briefly after they finish, then it purges itself; this is similar with the bjobs command.
PAGE 102
$ export SLURM_JOBID=150 $ srun hostname n1 n2 n3 n4 Note: Be sure to unset the SLURM_JOBID when you are finished with the allocation, to prevent a previous SLURM JOBID from interfering with future jobs: $ unset SLURM_JOBID The following examples illustrate launching interactive MPI jobs. They use the hellompi job script introduced in Section 5.3.2 (page 52).
PAGE 103
Note: If you exported any variables, such as SLURM_JOBID and SLURM_NPROCS, be sure to unset them as follows before submitting any further jobs from the second terminal: $ unset SLURM_JOBID $ unset SLURM_NPROCS You do not need to launch the /bin/bash shell to be able to interact with any compute node resources; any running job will suffice. This is excellent for checking on long-running jobs.
PAGE 104
Table 10-3 LSF Equivalents of SLURM srun Options (continued) srun Option Description Requests a specific list of nodes. The --nodelist=node1,...nodeN job will at least contain these nodes. The list may be specified as a comma-separated list of nodes, or a range of nodes. By default, job does not require -ext. -w LSF Equivalent -ext "SLURM[nodelist=node1,...nodeN]" Requests that a specific list of hosts be -ext --exclude=node1,...nodeN excluded in the resource allocated to "SLURM[exclude=node1,...
PAGE 105
Table 10-3 LSF Equivalents of SLURM srun Options (continued) srun Option Description LSF Equivalent -s Share nodes with other running jobs. --share SHARED=FORCE shares all nodes in partition. You cannot use this option. LSF uses this option to create allocation. SHARED=YES shares nodes if and only if –share is specified. SHARED=NO means do not share the node. -O Overcommit resources. Use when launching parallel tasks. Submit in “batch mode”. Meaningless under LSF integrated with SLURM.
PAGE 106
Table 10-3 LSF Equivalents of SLURM srun Options (continued) srun Option Description LSF Equivalent -Q Suppress informational message. Use as an argument to srun when launching parallel tasks. --core=type Adjust corefile format for parallel job. Use as an argument to srun when launching parallel tasks. -a Attach srun to a running job. Meaningless under LSF integrated with SLURM. Join with running job. Meaningless under LSF integrated with SLURM. Steal connection to running job.
PAGE 107
11 Advanced Topics This chapter covers topics intended for the advanced user. This chapter addresses the following topics: • • • • • • “Enabling Remote Execution with OpenSSH” (page 107) “Running an X Terminal Session from a Remote Node” (page 107) “Using the GNU Parallel Make Capability” (page 109) “Local Disks on Compute Nodes” (page 112) “I/O Performance Considerations” (page 113) “Communication Between Nodes” (page 113) 11.
PAGE 108
$ echo $DISPLAY :0 Next, get the name of the local machine serving your display monitor: $ hostname mymachine Then, use the host name of your local machine to retrieve its IP address: $ host mymachine mymachine has address 192.0.2.134 Step 2. Logging in to HP XC System Next, you need to log in to a login node on the HP XC system. For example: $ ssh user@xc-node-name Once logged in to the HP XC system, you can start an X terminal session using SLURM or LSF.
PAGE 109
$ sinfo PARTITION AVAIL TIMELIMIT NODES lsf up infinite 2 STATE NODELIST idle n[46,48] According to the information returned about this HP XC system, LSF has two nodes available for use, n46 and n48. Determine the address of your monitor's display server, as shown at the beginning of “Running an X Terminal Session from a Remote Node”. You can start an X terminal session using this address information in a bsub command with the appropriate options. For example: $ bsub -n4 -Ip srun -n1 xterm -display 192.0.
PAGE 110
proceed while another is waiting for I/O. On an HP XC system, there is the potential to use compute nodes to do compilations, and there are a variety of ways to make this happen. One way is to prefix the actual compilation line in the rule with an srun command. So, instead of executing cc foo.c -o foo.o it would execute srun cc foo.c -o foo.o. With concurrency, multiple command nodes would have multiple srun commands instead of multiple cc commands.
PAGE 111
for i in ${HYPRE_DIRS}; \ do \ if [ -d $$i ]; \ then \ echo "Making $$i ..."; \ (cd $$i; make); \ echo ""; \ fi; \ done clean: @ \ for i in ${HYPRE_DIRS}; \ do \ if [ -d $$i ]; \ then \ echo "Cleaning $$i ..."; \ (cd $$i; make clean); \ fi; \ done veryclean: @ \ for i in ${HYPRE_DIRS}; \ do \ if [ -d $$i ]; \ then \ echo "Very-cleaning $$i ..."; \ (cd $$i; make veryclean); \ fi; \ done 11.3.
PAGE 112
Modified Makefile: all: $(MAKE) $(MAKE_J) struct_matrix_vector/libHYPRE_mv.a struct_linear_solvers/libHYPRE_ls.a utilities/libHYPRE_utilities.a $(PREFIX) $(MAKE) -C test struct_matrix_vector/libHYPRE_mv.a: $(PREFIX) $(MAKE) -C struct_matrix_vector struct_linear_solvers/libHYPRE_ls.a: $(PREFIX) $(MAKE) -C struct_linear_solvers utilities/libHYPRE_utilities.a: $(PREFIX) $(MAKE) -C utilities The modified Makefile is invoked as follows: $ make PREFIX='srun -n1 -N1' MAKE_J='-j4' 11.3.
PAGE 113
11.5 I/O Performance Considerations Before building and running your parallel application, I/O performance issues on the HP XC cluster must be considered. The I/O control system provides two basic types of standard file system views to the application: • • Shared Private 11.5.1 Shared File View Although a file opened by multiple processes of an application is shared, each core maintains a private file pointer and file position.
PAGE 114
respectively. These subsections are not full solutions for integrating MPICH with the HP XC System Software. Figure 11-1 MPICH Wrapper Script #!/bin/csh srun csh -c 'echo `hostname`:2' | sort | uniq > machinelist set hostname = `head -1 machinelist | awk -F: '{print $1}'` ssh $hostname /opt/mpich/bin/mpirun options... -machinefile machinelist a.out The wrapper script is based on the following assumptions: • • • • • Each node in the HP XC system contains two CPUs.
PAGE 115
A Examples This appendix provides examples that illustrate how to build and run applications on the HP XC system. The examples in this section show you how to take advantage of some of the many methods available, and demonstrate a variety of other user commands to monitor, control, or kill jobs. The examples in this section assume that you have read the information in previous chapters describing how to use the HP XC commands to build and run parallel applications.
PAGE 116
Examine the partition information: $ sinfo PARTITION AVAIL TIMELIMIT NODES lsf up infinite 6 STATE NODELIST idle n[5-10] Examine the local host information: $ hostname n2 Examine the job information: $ bjobs No unfinished job found Run the LSF bsub -Is command to launch the interactive shell: $ bsub -Is -n1 /bin/bash Job <120> is submitted to default queue . <> <
PAGE 117
date and time stamp: Submitted from host , CWD <$HOME>, 2 Processors Requested; date and time stamp: Started on 2 Hosts/Processors <2*lsfhost.localdomain>; date and time stamp: slurm_id=24;ncpus=4;slurm_alloc=n[13-14]; date and time stamp: Done successfully. The CPU time used is 0.0 seconds.
PAGE 118
steps through a series of commands that illustrate what occurs when you launch an interactive shell. Examine the LSF execution host information: $ bhosts HOST_NAME STATUS lsfhost.
PAGE 119
Summary of time in seconds spent in various states by date and time PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 11 0 124 0 0 0 135 Exit from the shell: $ exit exit Examine the finished job's information: $ bhist -l 124 Job <124>, User , Project , Interactive pseudo-terminal shell mode, Extsched , Command date and time stamp: Submitted from host , to Queue , CWD <$HOME>, 4 Processors Requested, Requested Resources ; date and time stamp: Dispat
PAGE 120
srun hostname srun uname -a Run the job: $ bsub -I -n4 myjobscript.sh Job <1006> is submitted to default queue . <> <> n14 n14 n16 n16 Linux n14 2.4.21-15.3hp.XCsmp #2 SMP date and time stamp ia64 ia64 ia64 GNU/Linux Linux n14 2.4.21-15.3hp.XCsmp #2 SMP date and time stamp ia64 ia64 ia64 GNU/Linux Linux n16 2.4.21-15.3hp.XCsmp #2 SMP date and time stamp ia64 ia64 ia64 GNU/Linux Linux n16 2.4.21-15.3hp.
PAGE 121
Show the SLURM job ID: $ env | grep SLURM SLURM_JOBID=74 SLURM_NPROCS=8 Run some commands from the pseudo-terminal: $ srun hostname n13 n13 n14 n14 n15 n15 n16 n16 $ srun -n3 hostname n13 n14 n15 Exit the pseudo-terminal: $ exit exit View the interactive jobs: $ bjobs -l 1008 Job <1008>, User smith, Project , Status , Queue , Interactive pseudo-terminal mode, Command date and time stamp: Submitted from host n16, CWD <$HOME/tar_drop1/test>, 8 Processors Requested; date and
PAGE 122
View the node state: $ sinfo PARTITION AVAIL TIMELIMIT NODES lsf up infinite 4 STATE idle NODELIST n[13-16] A.7 Submitting an HP-MPI Job with LSF This example shows how to run an MPI job with the bsub command. Show the environment: $ lsid Platform LSF HPC version, Update n, build date stamp Copyright 1992-2008 Platform Computing Corporation My cluster name is penguin My master name is lsfhost.
PAGE 123
EXTERNAL MESSAGES: MSG_ID FROM POST_TIME 0 1 lsfadmin date and time MESSAGE SLURM[nodes=2] ATTACHMENT N View the finished job: $ bhist -l 1009 Job <1009>, User , Project , Interactive mode, Extsched , Command date and time stamp: Submitted from host , to Queue ,CWD <$HOME>, 6 Processors Requested; date and time stamp: Dispatched to 6 Hosts/Processors <6*lsfhost.
PAGE 124
PAGE 125
Glossary A administration branch The half (branch) of the administration network that contains all of the general-purpose administration ports to the nodes of the HP XC system. administration network The private network within the HP XC system that is used for administrative operations. availability set An association of two individual nodes so that one node acts as the first server and the other node acts as the second server of a service. See also improved availability, availability tool.
PAGE 126
operating system and its loader. Together, these provide a standard environment for booting an operating system and running preboot applications. enclosure The hardware and software infrastructure that houses HP BladeSystem servers. extensible firmware interface See EFI. external network node A node that is connected to a network external to the HP XC system. F fairshare An LSF job-scheduling policy that specifies how resources should be shared by competing users.
PAGE 127
image server A node specifically designated to hold images that will be distributed to one or more client systems. In a standard HP XC installation, the head node acts as the image server and golden client. improved availability A service availability infrastructure that is built into the HP XC system software to enable an availability tool to fail over a subset of eligible services to nodes that have been designated as a second server of the service See also availability set, availability tool.
PAGE 128
LVS Linux Virtual Server. Provides a centralized login capability for system users. LVS handles incoming login requests and directs them to a node with a login role. M Management Processor See MP. master host See LSF master host. MCS An optional integrated system that uses chilled water technology to triple the standard cooling capacity of a single rack. This system helps take the heat out of high-density deployments of servers and blades, enabling greater densities in data centers.
PAGE 129
onboard administrator See OA. P parallel application An application that uses a distributed programming model and can run on multiple processors. An HP XC MPI application is a parallel application. That is, all interprocessor communication within an HP XC parallel application is performed through calls to the MPI message passing library. PXE Preboot Execution Environment.
PAGE 130
an HP XC system, the use of SMP technology increases the number of CPUs (amount of computational power) available per unit of space. ssh Secure Shell. A shell program for logging in to and executing commands on a remote computer. It can provide secure encrypted communications between two untrusted hosts over an insecure network. standard LSF A workload manager for any kind of batch job.
PAGE 131
Index A ACML library, 45 application development, 39 building parallel applications, 45 building serial applications, 42 communication between nodes, 113 compiling and linking parallel applications, 45 compiling and linking serial applications, 42 debugging parallel applications, 63 debugging serial applications, 63 debugging with TotalView, 64 determining available resources for, 94 developing libraries, 46 developing parallel applications, 42 developing serial applications, 41 examining core availability,
PAGE 132
CP3000, 20 MKL library, 45 system interconnect, 22 CP3000BL, 20 CP4000, 20 ACML library, 45 compilers, 40, 44 designing libraries for, 46 MKL library, 45 software packages, 27 system interconnect, 22 CP6000 MKL library, 45 system interconnect, 22 examining core availability, 41 external scheduler, 49 F fault tolerance SLURM, 84 feedback e-mail address for documentation, 17 file system local disk, 112 Fortran, 27 building parallel applications, 44 G D DDT, 63 debugger TotalView, 64 debugging DDT, 63 gdb,
PAGE 133
job accounting, 84 job allocation information obtaining, 97 job manager, 86 job scheduler, 86 JOBID translation, 100 L launching jobs srun, 81 libraries, 27 building parallel applications, 45 library development, 46 Linux manpages, 32 local disk configuring, 112 login node, 39 login procedure, 29 LSF, 85, 86 bhist command, 100 bhosts command, 95 bjobs command, 98, 100 bqueues command, 96 bsub command, 91, 100 determining available system resources, 94 differences from standard LSF, 88 displaying host load
PAGE 134
developing, 39 environment for developing, 24 examples of, 115 partition reporting state of, 83 PATH environment variable setting with a module, 34 Pathscale building parallel applications, 44 Pathscale compilers, 40 Pathscale Fortran (see Fortran) performance considerations, 113 performance health tests, 70–73 pgdbg, 63 PGI building parallel applications, 44 PGI compilers, 40 PGI Fortran (see Fortran) private file view, 113 /proc/cpuinfo, 19 program development (see application development) programming env
PAGE 135
UPC, 39 user environment, 33 V Vampir, 77 VampirTrace/Vampir, 75 W Web site HP XC System Software documentation, 12 X xterm running from remote node, 107 135