HP XC System Software User's Guide Version 3.
© Copyright 2003, 2005, 2006 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document.......................................................................................................13 1 Intended Audience................................................................................................................................13 2 New and Changed Information in This Edition........................................................................................13 3 Typographic Conventions..........................................................
3 Configuring Your Environment with Modulefiles.......................................................31 3.1 Overview of Modules..........................................................................................................................31 3.2 Supplied Modulefiles..........................................................................................................................32 3.3 Modulefiles Automatically Loaded on the System............................................................
.4 Submitting a Batch Job or Job Script.....................................................................................................53 5.5 Submitting a Job from a Host Other Than an HP XC Host......................................................................55 5.6 Running Preexecution Programs..........................................................................................................56 6 Debugging Applications.......................................................................
10.5 Using LSF-HPC Integrated with SLURM in the HP XC Environment....................................................87 10.5.1 Useful Commands.....................................................................................................................87 10.5.2 Job Startup and Job Control........................................................................................................87 10.5.3 Preemption............................................................................................
List of Figures 4-1 4-2 7-1 7-2 7-3 7-4 7-5 10-1 11-1 Library Directory Structure.................................................................................................................44 Recommended Library Directory Structure..........................................................................................44 The xcxclus Utility Display.................................................................................................................64 The xcxclus Utility Display Icon.......
List of Tables 1-1 1-2 3-1 4-1 5-1 10-1 10-2 Determining the Node Platform..........................................................................................................20 HP XC System Interconnects...............................................................................................................22 Supplied Modulefiles.........................................................................................................................32 Compiler Commands...........................
List of Examples 5-1 Submitting a Job from the Standard Input..................................................................................................48 5-2 Submitting a Serial Job Using LSF-HPC ....................................................................................................48 5-3 Submitting an Interactive Serial Job Using LSF-HPC only...........................................................................
About This Document This document provides information about using the features and functions of the HP XC System Software. It describes how the HP XC user and programming environments differ from standard Linux® system environments.
Ctrl+x ENVIRONMENT VARIABLE [ERROR NAME] Key Term User input Variable [] {} ... | WARNING CAUTION IMPORTANT NOTE A key sequence. A sequence such as Ctrl+x indicates that you must hold down the key labeled Ctrl while you press another key or mouse button. The name of an environment variable, for example, PATH. The name of an error, usually returned in the errno variable. The name of a keyboard key. Return and Enter both refer to the same key. The defined use of an important word or phrase.
See the following sources for information about related HP products. HP XC Program Development Environment The Program Development Environment home page provide pointers to tools that have been tested in the HP XC program development environment (for example, TotalView® and other debuggers, compilers, and so on). http://h20311.www2.hp.com/HPC/cache/276321-0-0-0-121.
— — — — — Administering Platform LSF Administration Primer Platform LSF Reference Quick Reference Card Running Jobs with Platform LSF LSF procedures and information supplied in the HP XC documentation, particularly the documentation relating to the LSF-HPC integration with SLURM, supersedes the information supplied in the LSF manuals from Platform Computing Corporation. The Platform Computing Corporation LSF manpages are installed by default.
• http://sourceforge.net/projects/modules/ Web site for Modules, which provide for easy dynamic modification of a user's environment through modulefiles, which typically instruct the module command to alter or set shell environment variables. • http://dev.mysql.com/ Home page for MySQL AB, developer of the MySQL database. This Web site contains a link to the MySQL documentation, particularly the MySQL Reference Manual.
Software RAID Web Sites • http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html and http://www.ibiblio.org/pub/Linux/docs/HOWTO/other-formats/pdf/Software-RAID-HOWTO.pdf A document (in two formats: HTML and PDF) that describes how to use software RAID under a Linux operating system. • http://www.linuxdevcenter.com/pub/a/linux/2002/12/05/RAID.html Provides information about how to use the mdadm RAID management utility.
1 Overview of the User Environment The HP XC system is a collection of computer nodes, networks, storage, and software, built into a cluster, that work together. It is designed to maximize workload and I/O performance, and to provide the efficient management of large, complex, and dynamic workloads.
Table 1-1 Determining the Node Platform Platform Partial Output of /proc/cpuinfo CP3000 processor vendor_id cpu family model model name : : : : : 0 GenuineIntel 15 3 Intel(R) Xeon(TM) CP4000 processor vendor_id cpu family model model name : : : : : 0 AuthenticAMD 15 5 AMD Opteron(tm) CP6000 processor vendor arch family model CP300BL (Blade-only XC systems) processor vendor_id cpu family model model name : : : : : 0 GenuineIntel IA-64 Itanium 2 1 : : : : : 0 GenuineIntel 15 6 Intel(R) Xeon(T
nodes must be launched from nodes with the login role. Nodes with the compute role are referred to as compute nodes in this manual. 1.1.5 Storage and I/O The HP XC system supports both shared (global) and private (local) disks and file systems. Shared file systems can be mounted on all the other nodes by means of Lustre or NFS. This gives users a single view of all the shared data on disks attached to the HP XC system.
and keeps software from conflicting with user installed software.
Use the following command to display the amount of free and used memory in megabytes: free -m Use the following command to display the disk partitions and their sizes: cat /proc/partitions Use the following command to display the swap usage summary swapon -s by device: Use the following commands to display the cache information; this is not available on all systems. cat /proc/pal/cpu0/cache_info cat /proc/pal/cpu1/cache_info 1.
SLURM commands HP-MPI commands Modules commands HP XC uses the Simple Linux Utility for Resource Management (SLURM) for system resource management and job scheduling. Standard SLURM commands are available through the command line. SLURM functionality is described in Chapter 9 “Using SLURM”. Descriptions of SLURM commands are available in the SLURM manpages. Invoke the man command with the SLURM command name to access them. You can run standard HP-MPI commands from the command line.
1.5.2 Load Sharing Facility (LSF-HPC) The Load Sharing Facility for High Performance Computing (LSF-HPC) from Platform Computing Corporation is a batch system resource manager that has been integrated with SLURM for use on the HP XC system. LSF-HPC for SLURM is included with the HP XC System Software, and is an integral part of the HP XC environment. LSF-HPC interacts with SLURM to obtain and allocate available resources, and to launch and control all the jobs submitted to LSF-HPC.
HP-MPI Determines HOW the job runs. It is part of the application, so it performs communication. HP-MPI can also pinpoint the processor on which each rank runs. 1.5.5 HP-MPI HP-MPI is a high-performance implementation of the Message Passing Interface (MPI) standard and is included with the HP XC system. HP-MPI uses SLURM to launch jobs on an HP XC system — however, it manages the global MPI exchange so that all processes can communicate with each other. See the HP-MPI documentation for more information.
2 Using the System This chapter describes the tasks and commands that the general user must know to use the system. It addresses the following topics: • • • • “Logging In to the System” (page 27) “Overview of Launching and Managing Jobs” (page 27) “Performing Other Common User Tasks” (page 29) “Getting System Help and Information” (page 30) 2.1 Logging In to the System Logging in to an HP XC system is similar to logging in to any standard Linux system.
2.2.1 Introduction As described in “Run-Time Environment” (page 24), SLURM and LSF-HPC cooperate to run and manage jobs on the HP XC system, combining LSF-HPC's powerful and flexible scheduling functionality with SLURM's scalable parallel job-launching capabilities. SLURM is the low-level resource manager and job launcher, and performs core allocation for jobs. LSF-HPC gathers information about the cluster from SLURM.
For more information about using this command and a sample of its output, see “Getting Information About the LSF Execution Host Node” (page 91) . • The LSF lsload command displays load information for the LSF execution host node. $ lsload For more information about using this command and a sample of its output, see “Getting Host Load Information” (page 91). 2.2.4 Getting Information About System Partitions You can view information about system partitions with the SLURM sinfo command.
My cluster name is hptclsf My master name is lsfhost.localdomain In this example, hptclsf is the LSF cluster name, and lsfhost.localdomain is the name of the virtual IP address used by the node where LSF-HPC is installed and running (LSF execution host). 2.4 Getting System Help and Information In addition to the hardcopy documentation described in the preface of this document (“About This Document”), the HP XC system also provides system help and information in the form on online manpages.
3 Configuring Your Environment with Modulefiles The HP XC system supports the use of Modules software to make it easier to configure and modify the your environment. Modules software enables dynamic modification of your environment by the use of modulefiles.
(perhaps with incompatible shared objects) installed, it is probably wise to set MPI_CC (and others) explicitly to the commands made available by the compiler's modulefile. The contents of the modulefiles in the modulefiles_hptc RPM use the vendor-intended location of the installed software. In many cases, this is under the /opt directory, but in a few cases (for example, the PGI compilers and the TotalView debugger) this is under the /usr directory.
Table 3-1 Supplied Modulefiles (continued) Modulefile Sets the HP XC User Environment to Use: imkl/8.0 (default) Intel Math Kernel Library. intel/7.1 Intel Version 7.1 compilers. intel/8.0 Intel Version 8.0 compilers. intel/8.1 Intel Version 8.1 compilers. intel/9.0 Intel Version 9.0 compilers. intel/9.1 (default) Intel Version 9.1 compilers. mlib/intel/7.1 HP Math Library for Intel 7.1 compilers. mlib/intel/8.0 HP Math Library for Intel 8.0 compilers. mlib/intel/8.
3.5 Viewing Loaded Modulefiles A loaded modulefile is a modulefile that has been explicitly loaded in your environment by the module load command. To view the modulefiles that are currently loaded in your environment, issue the module list command: $ module list 3.6 Loading a Modulefile You can load a modulefile in to your environment to enable easier access to software that you want to use by executing the module load command.
3.8 Viewing Modulefile-Specific Help You can view help information for any of the modulefiles on the HP XC system. For example, to access modulefile-specific help information for TotalView, issue the module help command as follows: $ module help totalview ----------- Module Specific Help for 'totalview/default' ----------------This loads the TotalView environment.
To install a random product or package should look at the manpages for modulefiles, examine the existing modulefiles, and create a new modulefile for the product being installed using existing modulefiles as a template.
4 Developing Applications This chapter discusses topics associated with developing applications in the HP XC environment. Before reading this chapter, you should you read and understand Chapter 1 “Overview of the User Environment” and Chapter 2 “Using the System”.
Table 4-1 Compiler Commands Type Compilers Standard Linux Notes C C++ Fortran gcc gcc++ g77 All HP XC platforms. The HP XC System Software supplies these compilers by default. Intel icc icc ifort Version 9.0 compilers For use on the Intel 64–bit platform. Intel icc icc ifort Version 8.0 compilers For use on the Intel 64–bit platform. Intel ecc ecc efc Version 7.1 compilers For use on the Intel 64–bit platform. These compilers can be used but Intel may not support them much longer.
The Ctrl/Z key sequence is ignored. 4.5 Setting Debugging Options In general, the debugging information for your application that is needed by most debuggers can be produced by supplying the -g switch to the compiler. For more specific information about debugging options, see the documentation and manpages associated with your compiler. 4.6 Developing Serial Applications This section describes how to build and run serial applications in the HP XC environment.
For further information about developing parallel applications in the HP XC environment, see the following: • • • “Launching Jobs with the srun Command” (page 79) “Debugging Parallel Applications” (page 57) Chapter “Advanced Topics” (page 103) 4.7.1 Parallel Application Build Environment This section discusses the parallel application build environment on an HP XC system.
Intel PGI -pthread -lpgthread For example: $ mpicc object1.o ... -pthread -o myapp.exe 4.7.1.5 Quadrics SHMEM The Quadrics implementation of SHMEM runs on HP XC systems with Quadrics switches. SHMEM is a collection of high-performance routines (that support a distributed-memory model) for data passing between parallel executables. To compile programs that use SHMEM, it is necessary to include the shmem.h file and to use the SHMEM and Elan libraries. For example: $ gcc -o shping shping.
Information about using the GNU parallel Make is provided in “Using the GNU Parallel Make Capability”. For further information about using GNU parallel Make, see the make manpage. For additional sources of GNU information, see the references provided in the front of this manual, located in “About This Document”. 4.7.1.12 MKL Library MKL is a math library that references pthreads, and in enabled environments, can use multiple threads.
If you have not already loaded the mpi compiler utilities module , load it now as follows: $ module load mpi To compile and link a C application using the mpicc command: $ mpicc -o mycode hello.c To compile and link a Fortran application using the mpif90 command: $ mpif90 -o mycode hello.f In the above examples, the HP-MPI commands invoke compiler utilities which call the C and Fortran compilers with appropriate libraries and search paths specified to build the parallel application called hello.
For released libraries, dynamic and archive, the usual custom is to have a ../lib directory that contains the libraries. This, by itself, will work if the 32-bit and 64-bit libraries have different names. However, HP recommends an alternative method. The dynamic linker, during its attempt to load libraries, will suffix candidate directories with the machine type. The HP XC system on the CP4000 platform uses i686 for 32-bit binaries and x86_64 for 64-bit binaries.
NOTE: There is no shortcut as there is for the dynamic loader. 4.
5 Submitting Jobs This chapter describes how to submit jobs on the HP XC system; it addresses the following topics: • • • • • • • “Overview of Job Submission” (page 47) “Submitting a Serial Job Using LSF-HPC” (page 47) “Submitting a Parallel Job” (page 49) “Submitting a Parallel Job That Uses the HP-MPI Message Passing Interface” (page 50) “Submitting a Batch Job or Job Script” (page 53) “Submitting a Job from a Host Other Than an HP XC Host” (page 55) “Running Preexecution Programs” (page 56) 5.
launched on LSF-HPC node allocation (compute nodes). LSF-HPC node allocation is created by -n num-procs parameter, which specifies the number of cores the job requests. The SLURM srun job launch command is only needed if the LSF-HPC JOB_STARTER script is not configured for the intended queue, but it can be used regardless of whether or not the script is configured. You can use the bqueues command to confirm whether or not the JOB_STARTER script exists; see bqueues(1) for information on the bqueues command.
return 0; } The following is the command line used to compile this program: $ cc hw_hostname.c -o hw_hostname When run on the login node, it shows the name of the login node, n16 in this case: $ ./hw_hostname n16 says Hello! When you use the srun command to submit this program, it runs on one of the compute nodes. In this instance, it runs on node n13: $ srun ./hw_hostname n13 says Hello! Submitting the same program again with the srun command may run this program on another node, as shown here: $ srun .
The SLURM srun command is required to run jobs on an LSF-HPC node allocation. The srun command is the user job launched by the LSF bsub command. SLURM launches the jobname in parallel on the reserved cores in the lsf partition. The jobname parameter is the name of an executable file or command to be run in parallel. Example 5-5 illustrates a non-MPI parallel job submission. The job output shows that the job “srun hostname” was launched from the LSF execution host lsfhost.
Example 5-7 Submitting an MPI Job $ bsub -n4 -I mpirun -srun ./hello_world Job <24> is submitted to default queue . <> <> Hello world! Hello world! I'm 1 of 4 on host1 Hello world! I'm 3 of 4 on host2 Hello world! I'm 0 of 4 on host1 Hello world! I'm 2 of 4 on host2 You can use the LSF-SLURM External Scheduler option to add capabilities at the job level and queue level by including several SLURM options in the command line.
bsub -n num-procs -ext "SLURM[slurm-arguments]" [bsub-options][ -srun [srun-options]] [jobname] [job-options] The slurm-arguments parameter can be one or more of the following srun options, separated by semicolons, as described in Table 5-1. Table 5-1 Arguments for the SLURM External Scheduler SLURM Arguments Function nodes=min[-max] Specifies the minimum and maximum number of nodes allocated to job. The job allocation will contain at least the minimum number of nodes.
Example 5-11 Using the External Scheduler to Submit a Job That Excludes One or More Nodes $ bsub -n4 -ext "SLURM[nodes=4; exclude=n3]" -I srun hostname Job <72> is submitted to default queue . <> <> n1 n2 n4 n5 This example runs the job exactly the same as in Example 5-10 “Using the External Scheduler to Submit a Job to Run One Task per Node”, but additionally requests that node n3 is not to be used to run the job.
Example 5-14 Submitting a Job Script $ cat myscript.sh #!/bin/sh srun hostname mpirun -srun hellompi $ bsub -I -n4 myscript.sh Job <29> is submitted to default queue . <> <> n2 n2 n4 n4 Hello world! I'm 0 of 4 on n2 Hello world! I'm 1 of 4 on n2 Hello world! I'm 2 of 4 on n4 Hello world! I'm 3 of 4 on n4 Example 5-15 runs the same script but uses the LSF-SLURM External Scheduler option to specify different resources (here, 4 compute nodes).
Example 5-17 Submitting a Batch job Script That Uses the srun --overcommit Option $ bsub -n4 -I ./myscript.sh Job <81> is submitted to default queue . <> <
5.6 Running Preexecution Programs A preexecution program is a program that performs needed setup tasks that an application needs. It may create directories, input files, and so on. Though LSF-HPC daemons only run on a node with resource manager role, batch jobs can run on any compute node that satisfies the scheduling and allocation requirements.
6 Debugging Applications This chapter describes how to debug serial and parallel applications in the HP XC development environment. In general, effective debugging of applications requires the applications to be compiled with debug symbols, typically the -g switch. Some compilers allow -g with optimization. This chapter addresses the following topics: • • “Debugging Serial Applications” (page 57) “Debugging Parallel Applications” (page 57) 6.
This section provides only minimum instructions to get you started using TotalView. Instructions for installing TotalView are included in the HP XC System Software Installation Guide. Read the TotalView documentation for full information about using TotalView; the TotalView documentation set is available directly from Etnus, Inc. at the following URL: http://www.etnus.com 6.2.1.
6.2.1.4 Using TotalView with LSF-HPC HP recommends the use of xterm when debugging an application with LSF-HPC. You also need to allocate the nodes you will need.
Use the -g option to enable debugging information. 2. Run the application in TotalView: $ mpirun -tv -srun -n2 ./Psimple 3. The TotalView main control window, called the TotalView root window, opens. It displays the following message in the window header: Etnus TotalView Version# 4. The TotalView process window opens. This window contains multiple panes that provide various debugging functions and debugging information.
$ mpicc -g -o Psimple simple.c -lm 2. Run the application: $ mpirun -srun -n2 Psimple 3. Start TotalView: $ totalview 4. Select Unattached in the TotalView Root Window to display a list of running processes. Double-click on the srun process to attach to it. 5. The TotalView Process Window appears, displaying information on the srun process. Select Attached in the TotalView Root Window. 6. 7. Double-click one of the remote srun processes to display it in the TotalView Process Window.
7 Monitoring Node Activity This chapter describes the optional utilities that provide performance information about the set of nodes associated with your jobs.
Figure 7-1 The xcxclus Utility Display The icons show most node utilization statistics as a percentage of the total resource utilization. For example, Figure 7-1 indicates that the CPU cores are almost fully utilized, at 94 per cent, and 95 per cent of available CPU time. These values are rounded to the nearest integer. Selecting (that is, clicking on) an icon automatically invokes another utility, xcxperf, described in “Using the xcxperf Utility to Display Node Performance” (page 66).
1 2 The node designator is on the upper left of the icon. The left portion of the icon represents the Ethernet connection or connections. In this illustration, two Ethernet connections are used. The data for eth0 is above the data for eth1. As many as 4 Ethernet connections can be displayed. 3 The center portion of the icon displays core usage data for each CPU core in the node. As many as 4 CPU cores can be displayed. 4 5 6 The right portion of the icon displays memory statistics.
Figure 7-3 The clusplot Utility Display The clusplot utility uses the GNUplot open source plotting program. 7.4 Using the xcxperf Utility to Display Node Performance The xcxperf utility provides a graphic display of node performance for a variety of metrics. You can invoke the xcxperf utility either by entering it on the command line or by selecting a node icon in the xcxclus display. The xcxperf utility displays a dynamic graph showing the performance metrics for the node.
$ xcxperf -o test Figure 7-4 The xcxperf Utility Display Specifying the data file prefix when you invoke the xcxperf utility from the command line plays back the display according to the recorded data. The following command line plays back the test.xcxperf data file: $ xcxperf test The graphical display differs from the depiction in Figure 7-4because there is an additional pull-down menu named Control next to the File menu. Choosing the Play...
Figure 7-5 The perfplot Utility Display 7.6 Running Performance Health Tests You can run the ovp command to generate reports on the performance health of the nodes. Use the following format to run a specific performance health test: ovp [options] [-verify=perf_health/test] Where: options Specify additional command line options for the test. The ovp --help perf_health command lists the command line options for each test.
NOTE: The --nodelist=nodelist option is particularly useful for determining problematic nodes. If you use this option and the --nnodes=n option, the --nnodes=n option is ignored. test Indicates the test to perform. The following tests are available: Tests CPU core performance using the Linpack benchmark. cpu Tests CPU core usage. All CPU cores should be idle during cpu_usage the test. This test reports a node if it is using more than 10% (by default) of its CPU cores.
$ ovp --verify=perf_health/cpu_usage XC CLUSTER VERIFICATION PROCEDURE date time Verify perf_health: Testing cpu_usage ... +++ PASSED +++ This verification has completed successfully. A total of 1 test was run. Details of this verification have been recorded in: HOME_DIRECTORY/ovp_n16_mmddyy.log The following example runs the same test but with the --verbose option to show additional output.
Verify perf_health: Testing memory ... Specified nodelist is n[11-15] Number of nodes allocated for this test is 5 Job <103> is submitted to default queue . <> <>> Detailed streams results for each node can be found in path_name if the --keep flag was specified. Streams memory results summary (all values in mBytes/sec): min: 1272.062500 max: 2090.885900 median: 2059.492300 mean: 1865.301018 range: 818.823400 variance: 128687.
8 Tuning Applications This chapter discusses how to tune applications in the HP XC environment. 8.1 Using the Intel Trace Collector and Intel Trace Analyzer This section describes how to use the Intel Trace Collector (ITC) and Intel Trace Analyzer (ITA) with HP-MPI on an HP XC system. The Intel Trace Collector/Analyzer were formerly known as VampirTrace and Vampir, respectively.
Example 8-1 The vtjacobic Example Program For the purposes of this example, the examples directory under /opt/IntelTrace/ITC is copied to the user's home directory and renamed to examples_directory. The GNU Makefile looks as follows: CC F77 CLINKER FLINKER IFLAGS CFLAGS FFLAGS LIBS CLDFLAGS = = = = = = = = = mpicc.mpich mpif77.mpich mpicc.mpich mpif77.
8.2 The Intel Trace Collector and Analyzer with HP-MPI on HP XC NOTE: The Intel Trace Collector (OTA) was formerly known as VampirTrace. The Intel Trace Analyzer was formerly known as Vampir. 8.2.1 Installation Kit The following are installation-related notes: There are two installation kits for the Intel Trace Collector: • • ITC-IA64-LIN-MPICH-PRODUCT.4.0.2.1.tar.gz ITA-IA64-LIN-AS21-PRODUCT.4.0.2.1.tar.gz The Intel Trace Collector is installed in the /opt/IntelTrace/ITC directory.
Running a Program Ensure that the -static-libcxa flag is used when you use mpirun.mpich to launch a C or Fortran program. The following is a C example called vtjacobic: # mpirun.mpich -np 2 ~/xc_PDE_work/ITC_examples_xc6000/vtjacobic warning: this is a development version of HP-MPI for internal R&D use only /nis.home/user_name/xc_PDE_work/ITC_examples_xc6000/vtjacobic: 100 iterations in 0.228252 secs (28.712103 MFlops), m=130 n=130 np=2 [0] Intel Trace Collector INFO: Writing tracefile vtjacobic.
86 Difference is 2.809467246160129E-005 88 Difference is 2.381154327036583E-005 90 Difference is 2.018142964565221E-005 92 Difference is 1.710475838933507E-005 94 Difference is 1.449714388058985E-005 96 Difference is 1.228707004052045E-005 98 Difference is 1.041392661369357E-005 [0] Intel Trace Collector INFO: Writing tracefile vtjacobif.stf in /nis.
9 Using SLURM HP XC uses the Simple Linux Utility for Resource Management (SLURM) for system resource management and job scheduling.
Example 9-1 Simple Launch of a Serial Program $ srun hostname n1 9.3.1 The srun Roles and Modes The srun command submits jobs to run under SLURM management. The srun command can perform many roles in launching and managing your job. The srun command operates in several distinct usage modes to accommodate the roles it performs. 9.3.1.
Example 9-3 Reporting on Failed Jobs in the Queue $ squeue --state=FAILED JOBID PARTITION NAME 59 amt1 hostname USER root ST F TIME 0:00 NODES NODELIST(REASON) 0 9.5 Terminating Jobs with the scancel Command The scancel command cancels a pending or running job or job step. It can also be used to send a specified signal to all processes on all nodes associated with a job. Only job owners or administrators can cancel jobs. Example 9-4 terminates job #415 and all its jobsteps.
# chmod a+r /hptc_cluster/slurm/job/jobacct.log You can find detailed information on the sacct command and job accounting data in the sacct(1) manpage. 9.8 Fault Tolerance SLURM can handle a variety of failure modes without terminating workloads, including crashes of the node running the SLURM controller. User jobs may be configured to continue execution despite the failure of one or more nodes on which they are executing.
10 Using LSF-HPC The Load Sharing Facility (LSF-HPC) from Platform Computing Corporation is a batch system resource manager used on the HP XC system. On an HP XC system, a job is submitted to LSF-HPC, which places the job in a queue and allows it to run when the necessary resources become available. In addition to launching jobs, LSF-HPC provides extensive job management and information capabilities.
• • The bsub command is used to submit jobs to LSF. The bjobs command provides information on batch jobs. 10.2 Overview of LSF-HPC Integrated with SLURM LSF-HPC was integrated with SLURM for the HP XC system to merge the scalable and efficient resource management of SLURM with the extensive scheduling capabilities of LSF-HPC. In this integration: • • SLURM manages the compute resources. LSF-HPC performs the job management.
10.3 Differences Between LSF-HPC and LSF-HPC Integrated with SLURM LSF-HPC integrated with SLURM for the HP XC environment supports all the standard features and functions that LSF-HPC supports, except for those items described in this section, in “Using LSF-HPC Integrated with SLURM in the HP XC Environment”, and in the HP XC release notes for LSF-HPC. • By LSF-HPC standards, the HP XC system is a single host.
$ lshosts HOST_NAME type model lsfhost.loc SLINUX6 Opteron8 $ ssh n15 lshosts HOST_NAME type model lsfhost.loc SLINUX6 Opteron8 n15 UNKNOWN UNKNOWN_ • • • • • cpuf ncpus maxmem maxswp server RESOURCES 60.0 8 2007M Yes (slurm) cpuf ncpus maxmem maxswp server RESOURCES 60.0 8 2007M Yes (slurm) 1.0 No () The job-level run-time limits enforced by LSF-HPC integrated with SLURM are not supported.
Pseudo-parallel job A job that requests only one slot but specifies any of these constraints: • mem • tmp • nodes=1 • mincpus > 1 Pseudo-parallel jobs are allocated one node for their exclusive use. NOTE: Do NOT rely on this feature to provide node-level allocation for small jobs in job scripts. Use the SLURM[nodes] specification instead, along with mem, tmp, mincpus allocation options. LSF-HPC considers this job type as a parallel job because the job requests explicit node resources.
10.6 Submitting Jobs The bsub command submits jobs to LSF-HPC; it is used to request a set of resources on which to launch a job. This section focuses on enhancements to this command from the LSF-HPC integration with SLURM on the HP XC system; this section does not discuss standard bsub functionality or flexibility. See the Platform LSF documentation and the bsub(1) manpage for more information on this important command.
Figure 10-1 How LSF-HPC and SLURM Launch and Manage a Job User 1 N16 N 16 N16 Login node $ bsub-n4 -ext ”SLURM[nodes-4]” -o output.out./myscript 2 lsfhost.localdomain LSF Execution Host job_starter.sh $ srun -nl myscript 6 3 4 hostname N2 Compute Node n2 SLURM_JOBID=53 SLURM_NPROCS=4 7 N1 $ hostname n1 5 Compute Node myscript $ hostname $ srun hostname $ mpirun -srun .
4 LSF-HPC prepares the user environment for the job on the LSF execution host node and dispatches the job with the job_starter.sh script. This user environment includes standard LSF environment variables and two SLURM-specific environment variables: SLURM_JOBID and SLURM_NPROCS. SLURM_JOBID is the SLURM job ID of the job. Note that this is not the same as the LSF-HPC jobID. “Translating SLURM and LSF-HPC JOBIDs” describes the relationship between the SLURM_JOBID and the LSF-HPC JOBID.
LSF-HPC daemons run on only one node in the HP XC system, so the bhosts command will list one host, which represents all the resources of the HP XC system. The total number of cores for that host should be equal to the total number of cores assigned to the SLURM lsf partition. By default, this command returns the host name, host status, and job state statistics. The following example shows the output from the bhosts command: $ bhosts HOST_NAME STATUS JL/U MAX lsfhost.
In the previous example output, the LSF execution host (lsfhost.localdomain) is listed under the HOST_NAME column. The status is listed as ok, indicating that it can accept remote jobs. The ls column shows the number of current login users on this host. See the OUTPUT section of the lsload manpage for further information about the output of this example. In addition, see the Platform Computing Corporation LSF documentation and the lsload(1) manpage for more information about the features of this command.
After LSF-HPC integrated with SLURM allocates nodes for a job, it attaches allocation information to the job. The bjobs -l command provides job allocation information on running jobs. The bhist -l command provides job allocation information for a finished job. For details about using these commands, see the LSF manpages .
Example 10-2 Job Allocation Information for a Finished Job $ bhist -l 24 Job <24>, User , Project , Interactive pseudo-terminal shell mode, Extsched , Command date and time stamp: Submitted from host , to Queue , CWD <$HOME>, 4 Processors Requested, Requested Resources ; date and time stamp: Dispatched to 4 Hosts/Processors <4*lsfhost.
Example 10-4 Using the bjobs Command (Long Output) $ bjobs -l 24 Job <24>, User ,Project ,Status , Queue , Interactive pseudo-terminal shell mode, Extsched , Command date and time stamp: Submitted from host , CWD <$HOME>, 4 Processors Requested, Requested Resources ; date and time stamp: Started on 4 Hosts/Processors <4*lsfhost.
For detailed information about a finished job, add the -l option to the bhist command, shown in Example 10-6. The -l option specifies that the long format is requested.
123 123.0 hptclsf@99 hptclsf@99 lsf lsf 8 RUNNING 0 RUNNING 0 0 In these examples, the job name is hptclsf@99; the LSF job ID is 99. Note that the scontrol show job command keeps jobs briefly after they finish, then it purges itself; this is similar with the bjobs command. The sacct command continues to provide job information after the job has finished; this is similar to bhist command: $ sacct -j Jobstep ---------123 123.
You can simplify this by first setting the SLURM_JOBID environment variable to the SLURM JOBID in the environment, as follows: $ export SLURM_JOBID=150 $ srun hostname n1 n2 n3 n4 Note: Be sure to unset the SLURM_JOBID when you are finished with the allocation, to prevent a previous SLURM JOBID from interfering with future jobs: $ unset SLURM_JOBID The following examples illustrate launching interactive MPI jobs. They use the hellompi job script introduced in Section 5.3.2 (page 50).
$ export SLURM_JOBID=150 $ export SLURM_NPROCS=4 $ mpirun -tv srun additional parameters as needed After you finish with this interactive allocation, exit the /bin/bash process in the first terminal; this ends the LSF job.
Table 10-2 LSF-HPC Equivalents of SLURM srun Options (continued) srun Option Description -C Specifies a list of constraints. The list may -ext "SLURM[constraint=list]" include multiple features separated by the & character (meaning ANDed) or the | (meaning ORed). By default, job does not require -ext. --constraint=list LSF-HPC Equivalent Requests a specific list of nodes. The job -ext "SLURM[nodelist=node1,...nodeN]" will at least contain these nodes. The list --nodelist=node1,...
Table 10-2 LSF-HPC Equivalents of SLURM srun Options (continued) srun Option Description LSF-HPC Equivalent -s Share nodes with other running jobs. --share SHARED=FORCE shares all nodes in partition. You cannot use this option. LSF-HPC uses this option to create allocation. SHARED=YES shares nodes if and only if –share is specified. SHARED=NO means do not share the node. -O Overcommit resources. Use when launching parallel tasks. Submit in “batch mode”.
Table 10-2 LSF-HPC Equivalents of SLURM srun Options (continued) srun Option Description LSF-HPC Equivalent -j Join with running job. Meaningless under LSF-HPC integrated with SLURM. Steal connection to running job. Meaningless under LSF-HPC integrated with SLURM.
11 Advanced Topics This chapter covers topics intended for the advanced user. This chapter addresses the following topics: • • • • • • “Enabling Remote Execution with OpenSSH” (page 103) “Running an X Terminal Session from a Remote Node” (page 103) “Using the GNU Parallel Make Capability” (page 105) “Local Disks on Compute Nodes” (page 109) “I/O Performance Considerations” (page 109) “Communication Between Nodes” (page 109) 11.
$ hostname mymachine Then, use the host name of your local machine to retrieve its IP address: $ host mymachine mymachine has address 14.26.206.134 Step 2. Logging in to HP XC System Next, you need to log in to a login node on the HP XC system. For example: $ ssh user@xc-node-name Once logged in to the HP XC system, you can start an X terminal session using SLURM or LSF-HPC. Both methods are described in the following sections. Step 3.
First, examine the available nodes on the HP XC system. For example: $ sinfo PARTITION AVAIL TIMELIMIT NODES lsf up infinite 2 STATE NODELIST idle n[46,48] According to the information returned about this HP XC system, LSF-HPC has two nodes available for use, n46 and n48. Determine the address of your monitor's display server, as shown at the beginning of “Running an X Terminal Session from a Remote Node”.
the rule). Typically the rules for an object file target is a single compilation line, so it is common to talk about concurrent compilations, though GNU make is more general. On non-cluster platforms or command nodes, matching concurrency to the number of cores often works well. It also often works well to specify a few more jobs than cores so that one job can proceed while another is waiting for I/O.
test all: @ \ for i in ${HYPRE_DIRS}; \ do \ if [ -d $$i ]; \ then \ echo "Making $$i ..."; \ (cd $$i; make); \ echo ""; \ fi; \ done clean: @ \ for i in ${HYPRE_DIRS}; \ do \ if [ -d $$i ]; \ then \ echo "Cleaning $$i ..."; \ (cd $$i; make clean); \ fi; \ done veryclean: @ \ for i in ${HYPRE_DIRS}; \ do \ if [ -d $$i ]; \ then \ echo "Very-cleaning $$i ..."; \ (cd $$i; make veryclean); \ fi; \ done 11.3.
$ make PREFIX=’srun –n1 –N1 MAKE_J='-j4' 11.3.2 Example Procedure 2 Go through the directories in parallel and have the make procedure within each directory be serial. For the purpose of this exercise we are only parallelizing the “make all” component. The “clean” and “veryclean” components can be parallelized in a similar fashion. Modified Makefile: all: $(MAKE) $(MAKE_J) struct_matrix_vector/libHYPRE_mv.a struct_linear_solvers/libHYPRE_ls.a utilities/libHYPRE_utilities.
The modified Makefile is invoked as follows: $ make PREFIX='srun -n1 -N1' MAKE_J='-j4' 11.4 Local Disks on Compute Nodes The use of a local disk for private, temporary storage may be configured on the compute nodes of your HP XC system. Contact your system administrator to find out about the local disks configured on your system. A local disk is a temporary storage space and does not hold data across execution of applications.
Verify with your system administrator that MPICH has been installed on your system. The HP XC System Software Administration Guide provides procedures for setting up MPICH. MPICH jobs must not run on nodes allocated to other tasks. HP strongly recommends that all MPICH jobs request node allocation through either SLURM or LSF and that MPICH jobs restrict themselves to using only those resources in the allocation. Launch MPICH jobs using a wrapper script, such as the one shown in Figure 11-1.
IMPORTANT: Be sure that the number of nodes and processors in the bsub command corresponds to the number specified by the appropriate options in the wrapper script. NOTE: This method assumes that the communication among nodes is performed using ssh and that passwords are not required. 11.
A Examples This appendix provides examples that illustrate how to build and run applications on the HP XC system. The examples in this section show you how to take advantage of some of the many methods available, and demonstrate a variety of other user commands to monitor, control, or kill jobs. The examples in this section assume that you have read the information in previous chapters describing how to use the HP XC commands to build and run parallel applications.
Examine the partition information: $ sinfo PARTITION AVAIL TIMELIMIT NODES lsf up infinite 6 STATE NODELIST idle n[5-10] Examine the local host information: $ hostname n2 Examine the job information: $ bjobs No unfinished job found Run the LSF bsub -Is command to launch the interactive shell: $ bsub -Is -n1 /bin/bash Job <120> is submitted to default queue . <> <
View the job: $ bjobs -l 8 Job <8>, User , Project , Status , Queue , Interactive mode, Extsched , Command date and time stamp: Submitted from host , CWD <$HOME>, 2 Processors Requested; date and time stamp: Started on 2 Hosts/Processors <2*lsfhost.localdomain>; date and time stamp: slurm_id=24;ncpus=4;slurm_alloc=n[13-14]; date and time stamp: Done successfully. The CPU time used is 0.0 seconds.
A.4 Launching a Parallel Interactive Shell Through LSF-HPC This section provides an example that shows how to launch a parallel interactive shell through LSF-HPC. The bsub -Is command is used to launch an interactive shell through LSF-HPC. This example steps through a series of commands that illustrate what occurs when you launch an interactive shell. Examine the LSF execution host information: $ bhosts HOST_NAME STATUS lsfhost.
date and time stamp: Submitted from host , to Queue , CWD <$HOME>,4 Processors Requested, Requested Resources ; date and time stamp: Dispatched to 4 Hosts/Processors <4*lsfhost.
$ lshosts HOST_NAME lsfhost.loc type SLINUX6 $ bhosts HOST_NAME lsfhost.loc STATUS ok model cpuf ncpus maxmem maxswp server RESOURCES DEFAULT 1.0 8 1M Yes (slurm) JL/U - MAX 8 NJOBS 0 RUN 0 SSUSP 0 USUSP 0 RSV 0 Display the script: $ cat myjobscript.sh #!/bin/sh srun hostname srun uname -a Run the job: $ bsub -I -n4 myjobscript.sh Job <1006> is submitted to default queue . <> <> n14 n14 n16 n16 Linux n14 2.4.21-15.3hp.
$ sinfo PARTITION AVAIL TIMELIMIT NODES lsf up infinite 4 STATE NODELIST idle n[13-16] Submit the job: $ bsub -n8 -Ip /bin/sh Job <1008> is submitted to default queue . <> <
loadSched loadStop - - - - - - - - - - - View the finished jobs: $ bhist -l 1008 Job <1008>, User smith, Project , Interactive pseudo-terminal mode, Command date and time stamp: Submitted from host n16, to Queue , CWD <$HOME/tar_drop1/test>, 8 Processors Requested; date and time stamp: Dispatched to 8 Hosts/Processors <8*lsfhost.
Greetings from process 2! Greetings from process 3! Greetings from process 4! Greetings from process 5! mpirun exits with status: from from from from 0 ( ( ( ( n14 n14 n15 n15 pid pid pid pid 14011) 14012) 18227) 18228) View the running job: $ bjobs -l 1009 Job <1009>, User , Project , Status , Queue , Interactive mode, Extsched , Command date and time stamp: Submitted from host
If myjob runs on an HP XC host, the SLURM[nodes=4-4] allocation option is applied. If it runs on an Alpha/AXP host, the SLURM option is ignored. • Run myjob on any host type, and apply allocation options appropriately: $ bsub-n 8 -R "type==any" \ -ext "SLURM[nodes=4-4];RMS[ptile=2]" myjob If myjob runs on an HP XC host, the SLURM[nodes=4-4] option is applied. If myjob runs on an HP AlphaServer SC host, the RMS ptile option is applied.
Glossary A administration branch The half (branch) of the administration network that contains all of the general-purpose administration ports to the nodes of the HP XC system. administration network The private network within the HP XC system that is used for administrative operations. availability set An association of two individual nodes so that one node acts as the first server and the other node acts as the second server of a service. See also improved availability, availability tool.
external network node A node that is connected to a network external to the HP XC system. F fairshare An LSF job-scheduling policy that specifies how resources should be shared by competing users. A fairshare policy defines the order in which LSF attempts to place jobs that are in a queue or a host partition. FCFS First-come, first-served.
Integrated Lights Out See iLO. interconnect A hardware component that provides high-speed connectivity between the nodes in the HP XC system. It is used for message passing and remote memory access capabilities for parallel applications. interconnect module A module in an HP BladeSystem server.
MCS An optional integrated system that uses chilled water technology to triple the standard cooling capacity of a single rack. This system helps take the heat out of high-density deployments of servers and blades, enabling greater densities in data centers. Modular Cooling System See MCS. module A package that provides for the dynamic modification of a user's environment by means of modulefiles. See also modulefile.
PXE Preboot Execution Environment. A standard client/server interface that enables networked computers that are not yet installed with an operating system to be configured and booted remotely. PXE booting is configured at the BIOS level. R resource management role Nodes with this role manage the allocation of resources to user applications. role A set of services that are assigned to a node. Root Administration Switch A component of the administration network.
Index A ACML library, 42 application development, 37 building parallel applications, 42 building serial applications, 39 communication between nodes, 109 compiling and linking parallel applications, 42 compiling and linking serial applications, 39 debugging parallel applications, 57 debugging serial applications, 57 debugging with TotalView, 57 determining available resources for, 90 developing libraries, 43 developing parallel applications, 39 developing serial applications, 39 examining core availability,
compute node, 37 configuring local disk, 109 core availability, 38 CP3000, 20 MKL library, 42 system interconnect, 22 CP3000BL, 20 CP4000, 20 ACML library, 42 compilers, 37–38, 41 designing libraries for, 43 MKL library, 42 software packages, 26 system interconnect, 22 CP6000 MKL library, 42 system interconnect, 22 D DDT, 57 debugger TotalView, 57 debugging DDT, 57 gdb, 57 idb, 57 pgdbg, 57 TotalView, 57 debugging options setting, 39 debugging parallel applications, 57 debugging serial applications, 57 det
submission, 47 submission from non-HP XC host, 55 job accounting, 81 job allocation information obtaining, 92 job manager, 84 job scheduler, 84 JOBID translation, 96 L launching jobs srun, 79 libraries, 26 building parallel applications, 42 library development, 43 Linux manpages manpages, 30 local disk configuring, 109 login node, 37 login procedure, 27 LSF documentation, 15 HP-MPI, 77 lsf partition, 91 obtaining information, 92 LSF-HPC, 83–84 bhist command, 96 bhosts command, 90 bjobs command, 94, 96 bque
P parallel application build environment, 40 building, 42 compiling and linking, 42 debugging, 57 debugging with TotalView, 57 developing, 37 environment for developing, 24 examples of, 113 partition reporting state of, 81 PATH environment variable setting with a module, 32 Pathscale building parallel applications, 41 Pathscale compilers, 37 Pathscale Fortran (see Fortran) performance considerations, 109 performance health tests, 68–71 perfplot utility, 67 pgdbg, 57 PGI building parallel applications, 41 PG
T TotalView, 57 debugging an application, 59 exiting, 61 setting preferences, 59 setting up, 58 tuning applications, 73 U user environment, 31 utilization metrics, 63 V Vampir, 75 VampirTrace/Vampir, 73 W Web site HP XC System Software documentation, 14 X xcxclus utility, 63, 66 data file, 65–66 plotting data from, 65 xcxperf utility, 66 data file, 66 playback, 67 plotting data from, 67 xterm running from remote node, 103 133