Using Platform LSF™ HPC Version 7 Update 3 Release date: May 2008 Last modified: May 8, 2008 Support: support@platform.com Comments to: doc@platform.
Copyright © 1994-2008, Platform Computing Inc. We’d like to hear from You can help us make this document better by telling us what you think of the content, you organization, and usefulness of the information. If you find an error, or just want to make a suggestion for improving this document, please address your comments to doc@platform.com. Your comments should pertain only to Platform documentation. For product support, contact support@platform.com.
Contents 1 Installing and Upgrading Platform LSF Installing Platform LSF Upgrading Platform LSF . 2 About Platform LSF . . What Is Platform LSF? LSF Components . 3 . Running Parallel Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Submitting IBM POE Jobs over InfiniBand 6 Using Platform LSF HPC for Linux/QsNet . . . . . . About Platform LSF HPC for Linux/QsNet Configuring Platform LSF HPC for Linux/QsNet Operating Platform LSF HPC for Linux/QsNet Submitting and Monitoring Jobs . . 7 . . Using Platform LSF with SGI Cpusets About SGI cpusets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 Using Platform LSF with Intel® MPI . . . . . . About Platform LSF and the Intel® MPI Library Configuring LSF to Work with Intel MPI . . . . . 14 . . . Using Platform LSF with Open MPI . . . . . . . . . Submitting Open MPI Jobs . . . . . . . . . . Using LSF with ANSYS Using LSF with NCBI BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using Platform LSF HPC
C H A P T E R 1 Installing and Upgrading Platform LSF Contents ◆ ◆ “Installing Platform LSF” on page 8 “Upgrading Platform LSF” on page 18 Using Platform LSF HPC 7
Installing Platform LSF Installing Platform LSF involves the following steps: 1 2 3 4 “Get a Platform LSF license”. “Download Platform LSF Packages”. “Run lsfinstall”. “Run hostsetup” to configure host-based resources and set up automatic LSF startup on server hosts. Running hostsetup is optional on AIX and Linux. You must run hostsetup on SGI hosts (IRIX, TRIX, and Altix), HP-UX hosts, and Linux QsNet hosts. ENABLE_HPC_INST Make sure ENABLE_HPC_INST=Y is specified in install.
Begin Host HOST_NAME MXJ r1m #hostA () 3.5/4.5 default ! () HPPA11 ! () End Host lsb.modules ◆ pg 15/ () () ls tmp 12/15 0 () () () () DISPATCH_WINDOW () () () # Keywords # Example #pset host Adds the external scheduler plugin module names to the PluginModule section of lsb.modules: Begin PluginModule SCH_PLUGIN schmod_default schmod_fcfs schmod_fairshare schmod_limit schmod_reserve schmod_preemption schmod_advrsv ...
Begin Queue QUEUE_NAME = hpc_ibm PRIORITY = 30 NICE = 20 # ... RES_REQ = select[ poe > 0 ] EXCLUSIVE = Y REQUEUE_EXIT_VALUES = 133 134 135 DESCRIPTION = Platform HPC 7 for IBM. This queue is to run POE jobs ONLY. End Queue Begin Queue QUEUE_NAME = hpc_ibm_tv PRIORITY = 30 NICE = 20 # ... RES_REQ = select[ poe > 0 ] REQUEUE_EXIT_VALUES = 133 134 135 TERMINATE_WHEN = LOAD PREEMPT WINDOW RERUNNABLE = NO INTERACTIVE = NO DESCRIPTION = Platform HPC 7 for IBM TotalView debug queue.
specified by the JOB_CONTROLS parameter. A sample termination job control script is described in “Sample job termination script for queue job control” on page 50. ◆ Configures rms queue for RMS jobs running in LSF for LinuxQsNet. Begin Queue QUEUE_NAME = rms PJOB_LIMIT = 1 PRIORITY = 30 NICE = 20 STACKLIMIT = 5256 DEFAULT_EXTSCHED = RMS[RMS_SNODE] # LSF will using this scheduling policy if # -extsched is not defined.
◆ On SGI IRIX and SGI Altix hosts, sets the full path to the SGI vendor MPI library libxmpi.so: ❖ On SGI IRIX: LSF_VPLUGIN="/usr/lib32/libxmpi.so" ❖ On SGI Altix: LSF_VPLUGIN="/usr/lib/libxmpi.so" You can specify multiple paths for LSF_VPLUGIN, separated by colons (:). For example, the following configures both /usr/lib32/libxmpi.so for SGI IRIX, and /usr/lib/libxmpi.so for SGI IRIX: LSF_VPLUGIN="/usr/lib32/libxmpi.so:/usr/lib/libxmpi.
schroedinger hmmer adapter_windows ntbl_windows poe css0 csss dedicated_tasks ip_tasks us_tasks End Resource Boolean Boolean Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric () () 30 30 30 30 30 () () () () () N N N N N Y Y Y (schroedinger availability) (hmmer availability) (free adapter windows on css0 on IBM SP) (free ntbl windows on IBM HPS) (poe availability) (free adapter windows on css0 on IBM SP) (free adapter windows on csss on IBM SP) (running dedicated tasks) (running IP tasks)
◆ Single-user—Your user account must be primary LSF administrator. You will be able to start LSF daemons, but only your user account can submit jobs to the cluster. Your user account must be able to read the system kernel information, such as /dev/kmem. To run IBM POE jobs, you must manually change the ownership and setuid bit for swtbl_api and ntbl_api to root, and the file permission mode to -rwsr-xr-x (4755) so that the user ID bit for the owner is setuid.
Optional configuration After installation, you can define the following in lsf.conf: ◆ LSF_LOGDIR=director y In large clusters, you should set LSF_LOGDIR to a local file system (for example, /var/log/lsf). ◆ LSB_RLA_WORKDIR=directory parameter, where director y is the location of the status files for RLA. Allows RLA to recover its original state when it restarts. When RLA first starts, it creates the directory defined by LSB_RLA_WORKDIR if it does not exist, then creates subdirectories for each host.
❖ -extsched option of bsub ❖ DEFAULT_EXTSCHED and MANDATORY_EXTSCHED in lsb.queues Default: 1024 ◆ LSB_RMS_MAXNUMRAILS=integer Maximum number of rails in a Linux/QsNet system. Specifies a maximum value for the rails argument to the topology scheduler options specified in: ❖ -extsched option of bsub ❖ DEFAULT_EXTSCHED and MANDATORY_EXTSCHED in lsb.queues Default: 32 ◆ LSB_RMS_MAXPTILE=integer Maximum number of CPUs per node in a Linux/QsNet system.
For more information ◆ ◆ ◆ See lsf7.0_lsfinstall/lsf_quick_admin.html to learn more about your new LSF cluster. See the Platform LSF Command Reference for information about using lsfinstall. See the Platform LSF Configuration Guide for information about the install.config and the slave.config files. Where to go next Learn about using Platform LSF, as described in Chapter 2, “About Platform LSF”.
Upgrading Platform LSF Contents ◆ ◆ ◆ ◆ ◆ CAUTION “Before upgrading” “What lsfinstall does for upgrade” “Run lsfinstall to upgrade” “Run hostsetup” “After upgrading” If your cluster was installed or upgraded with lsfsetup, DO NOT use these steps. Before upgrading Platform LSF, upgrade your cluster to at least Platform LSF Version 6.0. Before upgrading 1 2 3 Back up your existing LSF_CONFDIR, LSB_CONFDIR, and LSB_SHAREDIR according to the procedures at your site.
LSB_SUB_COMMANDNAME (lsf.conf) If LSB_SUB_COMMANDNAME=N is already defined in lsf.conf, lsfinstall does not change this parameter; you must manually set it to LSB_SUB_COMMANDNAME=Y to enable the LSF_SUB_COMMANDLINE environment variable required by esub. SGI cpuset host upgrade For SGI cpuset hosts, lsfinstall updates the following files: ◆ lsb.modules —adds the schmod_cpuset external scheduler plugin module name to the PluginModule section, comments out the schmod_topology module ◆ lsf.conf line.
Run lsfinstall to upgrade Make sure the following install.config variables are set for upgrade: ◆ ◆ ENABLE_HPC_INST=Y enables Platform LSF installation LSF_TARDIR specifies the location of distribution packages for upgrade. For example: LSF_TARDIR=/tmp Migrate from LSF To migrate an existing Platform LSF Version 7 cluster to Platform LSF. comment out to LSF LSF_TARDIR and make sure that no distribution tar files are in the directory where you run lsfinstall.
To run hostsetup 1 2 Log on to each LSF server host as root. Start with the LSF master host. Run hostsetup on each LSF server host. For example: # cd /usr/share/hpc/7.0/install # ./hostsetup --top="/usr/share/hpc" --boot="y" After upgrading 1 2 Log on to the LSF master host as root. Set your environment: ❖ For csh or tcsh: % source /LSF_TOP/conf/cshrc.lsf ❖ For sh, ksh, or bash: # . /LSF_TOP/conf/profile.lsf 3 4 Follow the steps in lsf7Update3_lsfinstall/lsf_quick_admin.html to update your license.
Using Platform LSF HPC
C H A P T E R 2 About Platform LSF Contents ◆ ◆ “What Is Platform LSF?” on page 24 “LSF Components” on page 27 Using Platform LSF HPC 23
What Is Platform LSF? Platform LSF™ HPC (“LSF”) is the distributed workload management solution for maximizing the performance of High Performance Computing (HPC) clusters. Platform LSF is fully integrated with Platform LSF, the industry standard workload management software product, to provide load sharing in a distributed system and batch scheduling for compute-intensive jobs.
Parallel application support Platform LSF supports jobs using the following parallel job launchers: POE The IBM Parallel Operating Environment (POE) interfaces with the Resource Manager to allow users to run parallel jobs requiring dedicated access to the high performance switch. OpenMP PVM MPI The LSF integration for IBM High-Performance Switch (HPS) systems provides support for submitting POE jobs from AIX hosts to run on IBM HPS hosts.
PAM The Parallel Application Manager (PAM) is the point of control for LSF. PAM is fully integrated with LSF. PAM interfaces the user application with LSF.
LSF Components LSF takes full advantage of the resources of LSF for resource selection and batch job process invocation and control. User requests mbatchd LIM Batch job submission to LSF using the bsub command. Master Batch Daemon (MBD) is the policy center for LSF. It maintains information about batch jobs, hosts, users, and queues. All of this information is used in scheduling batch jobs to hosts. Load Information Manager is a daemon process running on each execution host.
bsub -a lammpi bsub_options mpirun.lsf myjob The method name lammpi, uses the esub for LAM/MPI jobs (LSF_SERVERDIR/esub.lammpi), which sets the environment variable LSF_PJL_TYPE=lammpi. The job launcher, mpirun.lsf reads the environment variable LSF_PJL_TYPE=lammpi, and generates the appropriate command line to invoke LAM/MPI as the PJL to start the job.
C H A P T E R 3 Running Parallel Jobs Contents ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ “blaunch Distributed Application Framework” on page 30 “OpenMP Jobs” on page 38 “PVM Jobs” on page 39 “SGI Vendor MPI Support” on page 40 “HP Vendor MPI Support” on page 43 “LSF Generic Parallel Job Launcher Framework” on page 45 “How the Generic PJL Framework Works” on page 46 ❖ “Integration Method 1” on page 52 ❖ “Integration Method 2” on page 54 “Tuning PAM Scalability and Fault Tolerance” on page 56 “Running Jobs with Ta
blaunch Distributed Application Framework Most MPI implementations and many distributed applications use rsh and ssh as their task launching mechanism. The blaunch command provides a drop-in replacement for rsh and ssh as a transparent method for launching parallel and distributed applications within LSF.
See the Platform LSF Command Reference for more information about the blaunch command. LSF APIs for the blaunch distributed application framework LSF provides the following APIs for programming your own applications to use the blaunch distributed application framework: ◆ lsb_launch()—a synchronous API call to allow source level integration with vendor MPI implementations. This API will launch the specified command (argv) on the remote nodes in parallel.
and lsgrun unless the user is either an LSF administrator or root. LSF_ROOT_REX must be defined for remote execution by root. Other remote execution commands, such as ch and lsmake are not affected. Temporary directory for tasks launched by blaunch By default, LSF creates a temporary directory for a job only on the first execution host. If LSF_TMPDIR is set in lsf.conf, the path of the job temporary directory on the first execution host is set to LSF_TMPDIR/job_ID .tmpdir.
Configuring application profiles for the blaunch framework Handle remote You can configure an application profile in lsb.applications to control the task exit behavior of a parallel or distributed application when a remote task exits. Specify a value for RTASK_GONE_ACTION in the application profile to define what the LSF does when a remote task exits. The default behaviour is: When ... LSF ...
LSF can run an appropriate script that is responsible for setup and cleanup of the job launching launching environment. You can specify the name of the appropriate script in an environment application profile in lsb.applications. Set up job Use DJOB_ENV_SCRIPT to define the path to a script that sets the environment for the parallel or distributed job launcher. The script runs as the user, and is part of the job. DJOB_ENV_SCRIPT only applies to the blaunch distributed application framework.
Update job heartbeat and resource usage Use DJOB_RU_INTERVAL in an application profile in lsb.applications to configure an interval in seconds used to update the resource usage for the tasks of a parallel or distributed job. DJOB_RU_INTERVAL only applies to the blaunch distributed application framework. When DJOB_RU_INTERVAL is specified, the interval is scaled according to the number of tasks in the job: max(DJOB_RU_INTERVAL, 10) + host_factor where host_factor = 0.
◆ Submit a job to an application profile bsub -n 4 -app djob blaunch myjob Example execution scripts Launching MPICH-P4 tasks To launch an MPICH-P4 tasks through LSF using the blaunch framework, substitute the path to rsh or ssh with the path to blaunch. For example: Sample mpirun script changes: ... # Set default variables AUTOMOUNTFIX="sed -e s@/tmp_mnt/@/@g" DEFAULT_DEVICE=ch_p4 RSHCOMMAND="$LSF_BINDIR/blaunch" SYNCLOC=/bin/sync CC="cc" ...
case $VERSION in 10.0) #version list entry export ANSYS_DIR=/usr/share/app/ansys_inc/v100/Ansys export ANSYSLMD_LICENSE_FILE=1051@licserver.company.
OpenMP Jobs Platform LSF provides the ability to start parallel jobs that use OpenMP to communicate between process on shared-memory machines and MPI to communicate across networked and non-shared memory machines. This implementation allows you to specify the number of machines and to reserve an equal number of processors per machine. When the job is dispatched, PAM only starts one process per machine.
PVM Jobs Parallel Virtual Machine (PVM) is a parallel programming system distributed by Oak Ridge National Laboratory. PVM programs are controlled by the PVM hosts file, which contains host names and other information. PVM esub An esub for PVM jobs, esub.pvm, is installed with Platform LSF. The PVM esub calls the pvmjob script. Use bsub -a pvm to submit PVM jobs. pvmjob script The pvmjob shell script is invoked by esub.pvm to run PVM programs as parallel LSF jobs.
SGI Vendor MPI Support Compiling and linking your MPI program You must use the SGI C compiler (cc by default). You cannot use mpicc to build your programs. For example, use one of the following compilation commands to build the program mpi_sgi: ◆ On IRIX/TRIX: cc -g -64 -o mpi_sgi mpi_sgi.c -lmpi f90 -g -64 -o mpi_sgi mpi_sgi.c -lmpi cc -g -n32 -mips3 -o mpi_sgi mpi_sgi.c -lmpi ◆ On Altix: efc -g -o mpi_sgi mpi_sgi.f -lmpi ecc -g -o mpi_sgi mpi_sgi.c -lmpi gcc -g -o mpi_sgi mpi_sgi.
On SGI IRIX: LSF_VPLUGIN="/usr/lib32/libxmpi.so" ❖ On SGI Altix: LSF_VPLUGIN="/usr/lib/libxmpi.so" You can specify multiple paths for LSF_VPLUGIN, separated by colons (:). For example, the following configures both /usr/lib32/libxmpi.so for SGI IRIX, and /usr/lib/libxmpi.so for SGI IRIX: ❖ LSF_VPLUGIN="/usr/lib32/libxmpi.so:/usr/lib/libxmpi.so" ◆ libxmpi.so file permission LSF_PAM_USE_ASH=Y enables LSF to use the SGI Array Session Handler (ASH) to propagate signals to the parallel jobs.
Examples Running a job To run a job and have LSF select the host, the command: mpirun -np 4 a.out is entered as: bsub -n 4 pam -mpi -auto_place a.out Running a job on a single host To run a single-host job and have LSF select the host, the command: mpirun -np 4 a.out is entered as: bsub -n 4 -R "span[hosts=1]" pam -mpi -auto_place a.out Running a job on multiple hosts To run a multihost job (5 processors per host) and have LSF select the hosts, the following command: mpirun hosta -np 5 a.
HP Vendor MPI Support When you use mpirun in stand-alone mode, you specify host names to be used by the MPI job. Automatic HP MPI library configuration During installation, lsfinstall sets LSF_VPLUGIN in lsf.conf to the full path to the MPI library libmpirm.sl. For example: LSF_VPLUGIN="/opt/mpi/lib/pa1.1/libmpirm.sl" On Linux On Linux hosts running HP MPI, you must manually set the full path to the HP vendor MPI library libmpirm.so.
The a.out and b.out processes may run on a different host, depending on the resources available and LSF scheduling algorithms. More details on mpirun For a complete list of mpirun options and environment variable controls, refer to the mpirun man page and the HP MPI User's Guide.
LSF Generic Parallel Job Launcher Framework Any parallel execution environment (for example a vendor MPI, or an MPI package like MPICH-GM, MPICH-P4, or LAM/MPI) can be made compatible with LSF using the generic parallel job launcher (PJL) framework. All LSF Version 7 distributions support running parallel jobs with the generic PJL integration. Vendor MPIs for SGI MPI and HP MPI are already integrated with Platform LSF.
How the Generic PJL Framework Works Terminology First execution host Execution hosts The host name at the top of the execution host list as determined by LSF. Starts PAM. The most suitable hosts to execute the batch job as determined by LSF task A process that runs on a host; the individual process of a parallel application parallel job PJL A parallel job consists of multiple tasks that could be executed on different hosts.
Architecture Running a parallel job using a non-integrated PJL First Execution Host Second Execution Host PJL Task ... Task ... Task ... Task ... Without the generic PJL framework, the PJL starts tasks directly on each host, and manages the job. Even if the MPI job was submitted through LSF, LSF never receives information about the individual tasks. LSF is not able to track job resource usage or provide job control.
2 3 The PJL wrapper starts the PJL (for example, mpirun). Instead of starting tasks directly, PJL starts TS on each host selected to run the parallel job. 4 TS starts the task. Each TS reports its task PID and host name back to PAM. Now PAM can perform job control and resource usage collection through RES. TaskStarter also collects the exit status of the task and reports it to PAM. When PJL exits, PAM exits with the same termination status as the PJL.
Using the pam -n option (SGI MPI only) The -n option on the pam command line specifies the number of tasks that PAM should start. You can use both bsub -n and pam -n in the same job submission. The number specified in the pam -n option should be less than or equal to the number specified by bsub -n. If the number of task specified with pam -n is greater than the number specified by bsub -n, the pam -n is ignored. For example, you can specify: bsub -n 5 pam -n 2 -mpi a.
JOB_CONTROLS = TERMINATE[kill -CONT -$LSB_JOBRES_PID; kill -TERM -$LSB_JOBRES_PID] ◆ If pam and the job RES are in different process groups (for example, pam is started by a wrapper, which could set its own PGID). Use both LSB_JOBRES_PID and LSB_PAMPID to make sure your parallel jobs are cleaned up. JOB_CONTROLS = TERMINATE[kill -CONT -$LSB_JOBRES_PID -$LSB_PAMPID; kill -TERM -$LSB_JOBRES_PID -$LSB_PAMPID] LSB_PAM_PID may not be available when job first starts.
foundPamPid="N" for apid in $PIDS do if [ "$apid" = "$LSB_PAM_PID" ]; then # pam is running foundPamPid="Y" break fi done if [ "$foundPamPid" == "N" ]; then break # pam has exited fi sleep 2 done fi # User other terminate signals if SIGTERM is # caught and ignored by your application. kill -TERM -$LSB_JOBRES_PID >>$JOB_CONTROL_LOG 2>&1 exit 0 To configure the script in the hpc_linux queue 1 2 Create a job control script named job_terminate_control.sh.
Integration Method 1 When to use this integration method In this method, PAM rewrites the PJL command line to insert TS in the correct position, and set callback information for TS to communicate with PAM.
For more detailed examples See “Example Integration: LAM/MPI” on page 62 Using Platform LSF HPC 53
Integration Method 2 When to use this integration method In this method, you rewrite or wrap the PJL to include TS and callback information for TS to communicate with PAM. This method of integration is the most flexible, but may be more difficult to implement.
Your job script is: #!/bin/sh if [ -n "$ENV1" ]; then pjl -opt1 job1 else pjl -opt2 -opt3 job2 fi After After the integration, your job submission command line includes the pam command: bsub -n 2 pam -g new_jobscript Your new job script inserts TS and LSF_TS_OPTIONS before the jobs: #!/bin/sh if [ -n "$ENV1" ]; then pjl -opt1 usr/share/lsf/TaskStarter $LSF_TS_OPTIONS job1 else pjl -opt2 -opt3 usr/share/lsf/TaskStarter $LSF_TS_OPTIONS job2 fi For more detailed examples See “Example Integration: LAM/MPI”
Tuning PAM Scalability and Fault Tolerance To improve performance and scalability for large parallel jobs, tune the following parameters. Parameters for PAM (lsf.conf) For better performance, you can adjust the following parameters in lsf.conf. The user's environment can override these. LSF_HPC_PJL_LOADENV_TIMEOUT Timeout value in seconds for PJL to load or unload the environment. For example, the time needed for IBM POE to load or unload adapter windows.
Running Jobs with Task Geometry Specifying task geometry allows you to group tasks of a parallel job step to run together on the same node. Task geometry allows for flexibility in how tasks are grouped for execution on system nodes. You cannot specify the particular nodes that these groups run on; the scheduler decides which nodes run the specified groupings. Task geometry is supported for all Platform LSF MPI integrations including IBM POE, LAM/MPI, MPICH-GM, MPICH-P4, and Intel® MPI.
bsub -n 3 -R "span[ptile=1]" -I -a mpich_gm mpirun.lsf my_job Job <564> is submitted to queue . <> <> ... Planning your task geometry specification You should plan their task geometry in advance and specify the job resource requirements for LSF to select hosts appropriately.
bsub -n 12 -R "span[ptile=3]" -a poe mpirun.lsf myjob If task 6 is an OpenMP job that spawns 4 threads, the job submission is: bsub -n 20 -R "span[ptile=5]" -a poe mpirun.lsf myjob Do not use -a openmp or set LSF_PAM_HOSTLIST_USE for OpenMP jobs. A POE job has three tasks: task0, task1, and task2, and Task task2 spawns 3 threads. The tasks task0 and task1 run on one node and task2 runs on the other node. The job submission is: bsub -a poe -n 6 -R "span[ptile=3]" mpirun.
Enforcing Resource Usage Limits for Parallel Tasks A typical Platform LSF parallel job launches its tasks across multiple hosts. By default you can enforce limits on the total resources used by all the tasks in the job. Because PAM only reports the sum of parallel task resource usage, LSF does not enforce resource usage limits on individual tasks in a parallel job.
◆ When a parallel job is terminated because of task limit enforcement, LSF logs the job termination reason in lsb.acct file: ❖ TERM_SWAP for swap limit ❖ TERM_MEMLIMIT for memory limit and bacct displays the termination reason.
Example Integration: LAM/MPI The script lammpirun_wrapper is the PJL wrapper. Use either “Integration Method 1” on page 52 or “Integration Method 2” on page 54 to call this script: pam [other_pam_options ] -g num_args lammpirun_wrapper job [job_options ] pam [other_pam_options ] -g lammpirun_wrapper job [job_options ] Example script #!/bin/sh # # ----------------------------------------------------# Source the LSF environment. Optional. # ----------------------------------------------------.
_my_name=`whoami | sed -e "s/[ ]//g"` else _my_name=`id | sed -e 's/[^(]*(\([^)]*\)).
LAMHALT_CMD="lamhalt" # # ----------------------------------------------------# Define an exit value to rerun the script if it fails # - create and set the variable EXIT_VALUE to represent the requeue exit value # - we assume you have enabled job requeue in LSF # - we assume 66 is one of the job requeue values you specified in LSF # ---------------------------------------------------# # EXIT_VALUE should not be set to 0 EXIT_VALUE="66" # # ----------------------------------------------------# Write the firs
# # # start a new host file from scratch rm -f $LAMHOST_FILE echo "# LAMMPI host file created by LSF on `date`" >> $LAMHOST_FILE # check if we were able to start writing the conf file if [ -f $LAMHOST_FILE ]; then : else echo "$0: can't create $LAMHOST_FILE" exit 1 fi HOST="" NUM_PROC="" FLAG="" TOTAL_CPUS=0 for TOKEN in $LSB_MCPU_HOSTS do if [ -z "$FLAG" ]; then HOST="$TOKEN" FLAG="0" else NUM_PROC="$TOKEN" TOTAL_CPUS=`expr $TOTAL_CPUS + $NUM_PROC` FLAG="1" fi if [ "$FLAG" = "1" ]; then _x=0 while [ $_x -l
# ----------------------------------------------------# Process the command line: # - extract [mpiopts] from the command line # - extract jobname [jobopts] from the command line # ----------------------------------------------------ARG0=`$LAMMPIRUN_CMD -h 2>&1 | \ egrep '^[[:space:]]+-[[:alpha:][:digit:]-]+[[:space:]][[:space:]]' | \ awk '{printf "%s ", $1}'` # get -ton,t and -w / nw options TMPARG=`$LAMMPIRUN_CMD -h 2>&1 | \ egrep '^[[:space:]]+-[[:alpha:]_-]+[[:space:]]*(,|/)[[:space:]][[:alpha:]]*' | sed
if [ $option = "$1" ]; then MPIRunOpt="1" case "$1" in -v) shift ;; *) LAMMPI_OPTS="$LAMMPI_OPTS $1" shift ;; esac break fi done fi if [ $MPIRunOpt = "1" ]; then : else JOB_CMDLN="$*" break fi done # ----------------------------------------------------------------------------# Set up the CMD_LINE variable representing the integrated section of the # command line: # - LSF_TS, script variable representing the TaskStarter binary. # TaskStarter must start each and every job task process.
# - capture the result of tping and test for success before proceeding # - exits with the "requeue" exit value if pre-execution setup failed # ---------------------------------------------------# LAM_MPI_SOCKET_SUFFIX="${LSB_JOBID}_${LSB_JOBINDEX}" export LAM_MPI_SOCKET_SUFFIX echo $LAMBOOT_CMD $LAMHOST_FILE >>$LOGFILE $LAMBOOT_CMD $LAMHOST_FILE >>$LOGFILE 2>&1 echo $TPING_CMD h -c 1 >>$LOGFILE $TPING_CMD N -c 1 >>$LOGFILE 2>&1 EXIT_VALUE="$?" if [ "$EXIT_VALUE" = "0" ]; then # # ---------------------------
exit $EXIT_VALUE # # ----------------------------------------------------# End the script.
Tips for Writing PJL Wrapper Scripts A wrapper script is often used to call the PJL. We assume the PJL is not integrated with LSF, so if PAM was to start the PJL directly, the PJL would not automatically use the hosts that LSF selected, or allow LSF to collect resource information. The wrapper script can set up the environment before starting the actual job. Script log file Command aliases Signal handling The script should create and use its own log file, for troubleshooting purposes.
Vendor-specific Depending on the vendor, the PJL may require some special post-execution work, such post-exec as stopping daemons. You should log each post-exec task in the log file, and also check the result and handle errors if any task failed. Script post-exec The script should exit gracefully. This might include closing files it used, removing files it created, shutting down daemons it started, and recording each action in the log file for troubleshooting purposes.
Other Integration Options Once the PJL integration is successful, you might be interested in the following LSF features. For more information about these features, see the LSF documentation. Using a job starter A job starter is a wrapper script that can set up the environment before starting the actual job.
C H A P T E R 4 Using Platform LSF with HP-UX Processor Sets LSF makes use of HP-UX processor sets (psets) to create an efficient execution environment that allows a mix of users and jobs to coexist in the HP Superdome cellbased architecture.
About HP-UX Psets HP-UX processor sets (psets) are available as an optional software product for HP-UX 11i Superdome multiprocessor systems. A pset is a set of active processors group for the exclusive access of the application assigned to the set. A pset manages processor resources among applications and users. The operating system restricts applications to run only on the processors in their assigned psets.
After the job finishes, LSF destroys the pset. If no host meets the CPU requirements, the job remains pending until processors become available to allocate the pset. CPU 0 in the default pset 0 is always considered last for a job, and cannot be taken out of pset 0, since all system processes are running on it. LSF cannot create a pset with CPU 0; it only uses the default pset if it cannot create a pset without CPU 0. LSF topology adapter for psets (RLA) RLA runs on each HP-UX11i host.
Configuring LSF with HP-UX Psets Automatic configuration at installation lsb.modules During installation, lsfinstall adds the schmod_pset external scheduler plugin module name to the PluginModule section of lsb.modules: Begin PluginModule SCH_PLUGIN schmod_default schmod_fcfs schmod_fairshare schmod_limit schmod_preemption ...
Begin Resource RESOURCENAME ... pset ... End Resource TYPE INTERVAL INCREASING DESCRIPTION Boolean () () (PSET) You should add the pset resource name under the RESOURCES column of the Host section of lsf.cluster.cluster_name . Hosts without the pset resource specified are not considered for scheduling pset jobs. lsb.hosts For each pset host, lsfinstall enables "!" in the MXJ column of the HOSTS section of lsb.hosts for the HPPA11 host type.
MANDATORY_EXTSCHED options override any conflicting job-level options set by -extsched options on the bsub command. For example, if the queue specifies: MANDATORY_EXTSCHED=PSET[CELLS=2] and a job is submitted with no topology requirements requesting 6 CPUs (bsub n 6), a pset is allocated using 2 cells with 3 CPUs in each cell.
Using LSF with HP-UX Psets Specifying pset topology options To specify processor topology scheduling policy options for pset jobs, use: ◆ The -extsched option of bsub. You can abbreviate the -extsched option to -ext. ◆ DEFAULT_EXTSCHED or MANDATORY_EXTSCHED, or both, in the queue definition (lsb.queues). If LSB_PSET_BIND_DEFAULT is set in lsf.conf, and no pset options are specified for the job, Platform LSF binds the job to the default pset 0.
The LSF job uses only cells specified in the specified cell list to allocate the pset. For example, if CELL_LIST=1,2, and the job requests 8 processors (bsub -n 8) on a 4-CPU/cell HP Superdome system with no other jobs running, the pset uses cells 1 and 2, and the allocation is 4 CPUs on each cell. If LSF cannot satisfy the CELL_LIST request, the job remains pending.
You can use CELL_LIST with the PSET[PTILE=cpus_per_cell] option. The PTILE PTILE option allows the job pset to spread across several cells. The number of required cells equals the number of requested processors divided by the PTILE value. The resulting number of cells must be less than or equal to the number of cells in the cell list; otherwise, the job remains pending.
Thu Jan 22 12:04:39: Running with execution home , Execution CWD , Execution Pid <18440>; Thu Jan 22 12:05:39: Done successfully. The CPU time used is 0.
◆ Submit a pset job specifying 1 CPU per cell: bsub -n 6 -ext "PSET[PTILE=1]" myjob A pset containing 6 processors is created for the job. The allocation uses 6 cells with 1 processor per cell. ◆ Submit a pset job specifying 4 cells: bsub -n 6 -ext "PSET[CELLS=4]" myjob A pset containing 6 processors is created for the job. The allocation uses 4 cells: 2 cells with 2 processors and 2 cells with 1 processor.
Using Platform LSF HPC
C H A P T E R 5 Using Platform LSF with IBM POE Contents ◆ ◆ ◆ ◆ “Running IBM POE Jobs” on page 86 “Migrating IBM Load Leveler Job Scripts to Use LSF Options” on page 93 “Controlling Allocation and User Authentication for IBM POE Jobs” on page 100 “Submitting IBM POE Jobs over InfiniBand” on page 103 Using Platform LSF HPC 85
Running IBM POE Jobs The IBM Parallel Operating Environment (POE) interfaces with the Resource Manager to allow users to run parallel jobs requiring dedicated access to the high performance switch. The LSF integration for IBM High-Performance Switch (HPS) systems provides support for submitting POE jobs from AIX hosts to run on IBM HPS hosts. An IBM HPS system consists of multiple nodes running AIX.
Resource usage defined in the ReservationUsage section overrides the cluster-wide RESOURCE_RESERVE_PER_SLOT parameter defined in lsb.params if it also exists. Begin ReservationUsage RESOURCE METHOD adapter_windows PER_SLOT ntbl_windows PER_SLOT csss PER_SLOT css0 PER_SLOT End ReservationUsage 2. Optional. Enable exclusive mode (lsb.queues) To support the MP_ADAPTER_USE and -adapter_use POE job options, you must enable the LSF exclusive mode for each queue. To enable exclusive mode, edit lsb.
Begin Queue NAME=hpc_ibm ... lock=0 ... End Queue The scheduling threshold on the lock index prevents dispatching to nodes which are being used in exclusive mode by other jobs. 4. Optional. Define system partitions (spname) If you schedule jobs based on system partition names, you must configure the static resource spname. System partitions are collections of HPS nodes that together contain all available HPS nodes without any overlap.
RESOURCENAME ... adapter_windows ntbl_windows poe css0 csss dedicated_tasks ip_tasks us_tasks ... End Resource TYPE Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric INTERVAL INCREASING DESCRIPTION 30 30 30 30 30 () () () N N N N N Y Y Y (free adapter windows on css0 on IBM SP) (free ntbl windows on IBM HPS) (poe availability) (free adapter windows on css0 on IBM SP) (free adapter windows on csss on IBM SP) (running dedicated tasks) (running IP tasks) (running US tasks) You must edit lsf.
PAM updates resource usage for each task for every SBD_SLEEP_TIME + num_tasks * 1 seconds (by default, SBD_SLEEP_TIME=15). For large parallel jobs, this interval is too long. As the number of parallel tasks increases, LSF_PAM_RUSAGE_UPD_FACTOR causes more frequent updates. Default: LSF_PAM_RUSAGE_UPD_FACTOR=0.01For large clusters 7. Reconfigure to apply the changes 1 Run badmin ckconfig to check the configuration changes. If any errors are reported, fix the problem and check the configuration again.
POE PJL wrapper (poejob) The POE PJL (Parallel Job Launcher) wrapper, poejob, parses the POE job options, and filters out those that have been set by LSF. Submitting POE jobs Use bsub to submit POE jobs, including parameters required for the application and POE. PAM launches POE and collects resource usage for all running tasks in the parallel job. Syntax bsub -a poe [bsub_options ] mpirun.lsf program_name [program_options ] [poe_options ] where: -a poe Invokes esub.poe.
lslpp -l | grep ssp.basic ssp.basic 3.2.0.9 COMMITTED ssp.basic 3.2.0.9 COMMITTED SP System Support Package SP System Support Package To verify the switch type, run: SDRGetObjects Adapter css_type Switch type Value SP_Switch_Adapter SP_Switch_MX_Adapter SP_Switch_MX2_Adapter SP_Switch2_Adapter 2 3 3 5 SP_Switch2_Adapter indicates that you are using SP Switch2. Use these values to configure the device_type variable in the script LSF_BINDIR/poejob. The default for device_type is 3.
Migrating IBM Load Leveler Job Scripts to Use LSF Options You can integrate LSF with your POE jobs by modifying your job scripts to convert POE Load Leveler options to LSF options.
-nodes combinations -nodes -tasks_per_nodes -nodes combination Cannot convert to LSF. You must use span[host=1] bsub -n a*b -R "span[ptile=b]" bsub -n a*b -R "span[ptile=b]" Only use if the poe options are: ◆ Only use if the poe options are: poe -nodes a -tasks_per_nodes b -nodes b poe -nodes a -tasks_per_nodes b -procs a*b -nodes -procs ◆ Load Leveler directives Load Leveler job commands are handled as follows: ◆ ◆ ◆ Ignored by LSF Converted to bsub options (or queue options in lsb.
Load Leveler Ignored Command notify_user output parallel_path preferences queue requirements resources bsub option Special Handling Set in lsf.conf bsub -o Y bsub -R "select[...] bsub -q bsub -R and -m bsub -R rss_limit shell stack_limit startdate step_name task_geometry Set rusage for each task according to the Load Leveler equivalent bsub -M Y bsub -S bsub -b Y Use the LSB_PJL_TASK_GEOMETRY environment variable to specify task geometry for your jobs.
#@ notification = never #@ queue # --------------------------------------------# Copy required workfiles to $WORKDIR, which is set # to /scr/$user under the large GPFS work filesystem, # named /scr. cp ~/TESTS/mpihello $WORKDIR/mpihello # Change directory to $WORKDIR cd $WORKDIR # Execute program mypoejob poe mypoejob poe $WORKDIR/mpihello # Copy output data from $WORKDIR to appropriate archive FS, # since we are currently running within a volatile # "scratch" filesystem.
# Copy output data from $WORKDIR to appropriate archive FS, # since we are currently running within a volatile # "scratch" filesystem. 4 # Clean unneeded files from $WORKDIR after job ends.
The following examples show how to convert the POE options in a Load Leveler command file to LSF options in your job scripts for several kinds of jobs. -adapter_use dedicated and -cpu_use unique ◆ This example uses following POE job script: #!/bin/csh #@ shell = /bin/csh #@ environment = ENVIRONMENT=BATCH; COPY_ALL;\ # MP_EUILIB=us; MP_STDOUTMODE=ordered; MP_INFOLEVEL=0; #@ network.MPI = switch,dedicated,US #@ job_type = parallel #@ job_name = batch-test #@ output = $(job_name).log #@ error = $(job_name).
#BSUB -q hpc_ibm setenv ENVIRONMENT BATCH setenv MP_EUILIB us # Copy required workfiles to $WORKDIR, which is set # to /scr/$user under the large GPFS work filesystem, # named /scr. cp ~/TESTS/mpihello $WORKDIR/mpihello # Change directory to $WORKDIR cd $WORKDIR # Execute program(s) mpirun.lsf mypoejob -euilib us mpirun.lsf $WORKDIR/mpihello -euilib us # Copy output data from $WORKDIR to appropriate archive FS, # since we are currently running within a volatile # "scratch" filesystem.
Controlling Allocation and User Authentication for IBM POE Jobs About POE authentication Establishing authentication for POE jobs means ensuring that users are permitted to run parallel jobs on the nodes they intend to use. POE supports two types of user authentication: ◆ AIX authentication (the default) Uses /etc/hosts.equiv or $HOME/.
Configuring POE allocation and authentication support Configure services pmv4lsf 1 stream Register pmv4lsf (pmv3lsf) service with inetd: a Add the following line to /etc/inetd.conf: tcp b nowait root /etc/pmdv4lsf pmdv4lsf Make a symbolic link from pmd_w to /etc/pmdv4lsf. For example: # ln -s $LSF_BINDIR/pmd_w /etc/pmdv4lsf Both $LSF_BINDIR and /etc must be owned by root for the symbolic link to work. c Add pmv4lsf to /etc/services.
Example job scripts For IP jobs For the following job script: # mypoe_jobscript #!/bin/sh #BSUB -o out.%J #BSUB -n 2 #BSUB -m "hostA" #BSUB -a poe export MP_EUILIB=ip mpirun.lsf ./hmpis Submit the job script as a redirected job, specifying the appropriate resource requirement string: bsub -R "select[poe>0]" < mypoe_jobscript For US jobs: For the following job script: # mypoe_jobscript #!/bin/sh #BSUB -o out.%J #BSUB -n 2 #BSUB -m "hostA" #BSUB -a poe export MP_EUILIB=us mpirun.lsf .
Submitting IBM POE Jobs over InfiniBand Platform LSF installation adds a shared nrt_windows resource to run and monitor POE jobs over the InfiniBand interconnect. lsb.shared Begin Resource RESOURCENAME TYPE ... poe Numeric dedicated_tasks Numeric tasks) ip_tasks Numeric us_tasks Numeric nrt_windows Numeric IBM poe over IB) ... End Resource INTERVAL INCREASING DESCRIPTION 30 () N Y (poe availability) (running dedicated () () 30 Y Y N (running IP tasks) (running US tasks) (free nrt windows on lsf.
Using Platform LSF HPC
C H A P T E R 6 Using Platform LSF HPC for Linux/QsNet RMS Version 2.8.1 and 2.8.
About Platform LSF HPC for Linux/QsNet Contents ◆ ◆ ◆ “What Platform LSF HPC for Linux/QsNet does” “Assumptions and limitations” “Compatibility with earlier releases” What Platform LSF HPC for Linux/QsNet does The Platform LSF HPC for Linux/QsNet combines the strengths of Platform LSF, Quadrics Resource Management System (RMS), and Quadrics QsNet data network to provide a comprehensive Distributed Resource Management (DRM) solution on Linux.
◆ ◆ ◆ ◆ ◆ ◆ Kernel-level checkpointing is not available on Linux/QsNet systems. When LSF selects RMS jobs to preempt, jobs to be preempted are selected from the list of preemptable candidates based on the topology-aware allocation algorithm. Allocation always starts from the smallest numbered node on the LSF node and works from this node up. Some specialized preemption preferences, such as MINI_JOB and LEAST_RUN_TIME in the PREEMPT_FOR parameter in lsb.
and DEFAULT_EXTSCHED=RMS_SNODE in the queue Before Version 7, the net effect is: RMS_SNODE;ptile=1 In Version 7, ptile=1 is not an RMS allocation option, so it is ignored and the net result is: RMS_SNODE ◆ ◆ The following install.config options are obsolete. You do not need to specify them when running lsfinstall: ❖ LSF_ENABLE_EXTSCHEDULER="Y" ❖ CONFIGURE_LSB_RLA_PORT="Y" ❖ LOGDIR="path" For job control action configuration, the batchid in the RMS rcontrol command must include the LSF cluster name.
Configuring Platform LSF HPC for Linux/QsNet Contents ◆ ◆ ◆ ◆ “Automatic configuration at installation” “Setting dedicated LSF partitions (recommended)” “Customizing job control actions (optional)” “Configuration notes” Automatic configuration at installation lsb.hosts For the default host, lsfinstall enables "!" in the MXJ column of the HOSTS section of lsb.hosts. For example: Begin Host HOST_NAME MXJ r1m #hostA () 3.5/4.5 default ! () End Host lsb.
To make the rms queue the default queue, set DEFAULT_QUEUE=rms in lsb.params. Use the bqueues -l command to view the queue configuration details. Before using LSF, see the Platform LSF Configuration Guide to understand queue configuration parameters in lsb.queues. lsf.conf During installation, lsfinstall sets the following parameters in lsf.conf: ◆ LSF_ENABLE_EXTSCHEDULER=Y LSF uses an external scheduler for RMS allocation.
Customizing job control actions (optional) By default, LSF carries out job control actions by sending the appropriate signal to suspend, terminate, or resume a job. If your jobs need special job control actions, use the RMS rcontrol command in the rms queue configuration for RMS jobs to change the default job controls. JOB_CONTROLS Use the JOB_CONTROLS parameter in lsb.queues to configure suspend, parameter in lsb.
Per-processor job slot limit (PJOB_LIMIT in lsb.queues) By default, the per-processor job slot limit is 1 (PJOB_LIMIT=1 in the rms queue in lsb.queues). Do not change this default. Maximum number of sbatchd connections (lsb.params) If LSF operates on a large system (for example, a system with more than 32 hosts), you may need to configure the parameter MAX_SBD_CONNS in lsb.params. MAX_SBD_CONNS controls the maximum number of files mbatchd can have open and connected to sbatchd.
If topology options (nodes, ptile, or base) or rail flags (rails or railmask) are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -extsched option of bsub. For example, if DEFAULT_EXTSCHED=RMS[nodes=2], and you do not want to specify any node option at all, use -extsched "RMS[RMS_SNODE;nodes=]". See “bsub command” on page 117 for more information.
Operating Platform LSF HPC for Linux/QsNet Contents ◆ ◆ ◆ ◆ ◆ “RMS hosts and RMS jobs” “Platform LSF RMS topology support plugin” “LSF scheduling policies and RMS topology support” “LSF host preference and RMS allocation options” “RMS rail allocation options” RMS hosts and RMS jobs An RMS host has the rms Boolean resource in the RESOURCES column of the host section in lsf.cluster.cluster_name .
RMS option -I -N -n -R -i/ -o/ -e -m Description LSF equivalent Allocate immediately or fail Number of nodes to use Number of CPUs to use immediate same as -I rails railmask Input/output/error redirection Block or cyclic job distribution LSF overrides; uses immediately to allocate -extsched "RMS[nodes=nodes]" -n LSF overrides Passed with -extsched Passed with -extsched bsub -i/ -o/ -e Must be passed directly (not via LSF) by the user on prun command line.
LSF checks the validity of rail options at job submission against the LSB_RMS_MAXNUMRAILS parameter if it is set in lsf.conf, which specifies a maximum value for the rails option. The default is 32. If incorrect rail option values pass this check, the job pends forever.
Submitting and Monitoring Jobs Contents ◆ ◆ ◆ ◆ “bsub command” “Running jobs on any host type” “Viewing nodes allocated to your job” “Example job submissions” bsub command To submit a job, use the bsub command. Syntax bsub -ext[sched] "RMS[[allocation_type ][;topology ][;flags ]]" job_name Specify topology scheduling policy options for RMS jobs either in the -extsched option, or with DEFAULT_EXTSCHED or MANDATORY_EXTSCHED in the rms queue definition in lsb.queues.
This is the default allocation policy for RMS jobs. LSF sorts nodes according to RMS topology (numbering of nodes and domains), which takes precedence over LSF sorting order. LSF host preferences (for example, bsub -m hostA) are not taken into account. The allocation is more compact than in RMS_SLOAD and starts from the leftmost node allowed by the LSF host list and continues rightward until the allocation specification is satisfied.
❖ nodes=nodes | ptile=cpus_per_node Specifies the number of nodes the allocation requires or the number of CPUs per node. The ptile topology option is different from the LSF ptile keyword used in the span section of the resource requirement string (bsub -R "span[ptile=n ]"). If the ptile topology option is specified in -extsched, the value of bsub -n must be an exact multiple of the ptile value.
If topology options (nodes, ptile, or base) or flags (rails or railmask) are set in DEFAULT_EXTSCHED, and you do not want to specify values for these options, use the keyword with no value in the -extsched option of bsub. For example, if DEFAULT_EXTSCHED=RMS[nodes=2], and you do not want to specify any node option at all, use -extsched "RMS[RMS_SNODE;nodes=]".
loadStop - - - EXTERNAL MESSAGES: MSG_ID FROM POST_TIME 0 1 user1 Aug 6 17:09 - - - - MESSAGE RMS[nodes=3;base=hostA] - - - - ATTACHMENT N Finished jobs (bhist -l) Use bhist -l to see RMS allocation information for finished jobs.
SUMMARY: ( time unit: second ) Total number of done jobs: 0 Total number of exited jobs: Total CPU time consumed: 0.4 Average CPU time consumed: Maximum CPU time of a job: 0.4 Minimum CPU time of a job: Total wait time in queues: 8.0 Average wait time in queue: 8.0 Maximum wait time in queue: 8.0 Minimum wait time in queue: Average turnaround time: 43 (seconds/job) Maximum turnaround time: 43 Minimum turnaround time: Average hog factor of a job: 0.
◆ ◆ ◆ About job starters, see Administering Platform LSF About bacct, bhist, bjobs, and bsub, see the Platform LSF Command Reference About lsb.queues, and lsf.
Using Platform LSF HPC
C H A P T E R 7 Using Platform LSF with SGI Cpusets LSF makes use of SGI cpusets to enforce processor limits for LSF jobs. When a job is submitted, LSF creates a cpuset and attaches it to the job before the job starts running, After the job finishes, LSF deallocates the cpuset. If no host meets the CPU requirements, the job remains pending until processors become available to allocate the cpuset.
About SGI cpusets An SGI cpuset is a named set of CPUs. The processes attached to a cpuset can only run on the CPUs belonging to that cpuset. Dynamic cpusets Static cpusets Jobs are attached to a cpuset dynamically created by LSF. The cpuset is deleted when the job finishes or exits. If not specified, the default cpuset type is dynamic. Jobs are attached to a static cpuset specified by users at job submission. This cpuset is not deleted when the job finishes or exits.
◆ ◆ The new cpuset integration cannot coexist with the old integration within the same cluster Under the MultiCluster lease model, both clusters must use the same version of the cpuset integration Since backfill and slot reservation are based on an entire host, they may not work reservation correctly if your cluster contains hosts that use both static and dynamic cpusets or multiple static cpusets.
Array services For PAM jobs on Altix, the SGI Array Services daemon arrayd must be running and authentication AUTHENTICATION must be set to NONE in the SGI array services authentication (Altix only) file /usr/lib/array/arrayd.auth (comment out the AUTHENTICATION NOREMOTE method and uncomment the AUTHENTICATION NONE method). To run a mulithost MPI applications, you must also enable rsh without password prompt between hosts: The remote host must defined in the arrayd configuration. ◆ Configure .
Configuring LSF with SGI Cpusets Automatic configuration at installation and upgrade lsb.modules During installation and upgrade, lsfinstall adds the schmod_cpuset external scheduler plugin module name to the PluginModule section of lsb.modules: Begin PluginModule SCH_PLUGIN schmod_default schmod_cpuset End PluginModule RB_PLUGIN () () SCH_DISABLE_PHASES () () The schmod_cpuset plugin name must be configured after the standard LSF plugin names in the PluginModule list.
lsf.shared During installation and upgrade, lsfinstall defines the cpuset Boolean resource in lsf.shared: Begin Resource RESOURCENAME TYPE ... cpuset Boolean ... End Resource INTERVAL INCREASING DESCRIPTION () () (cpuset host) You should add the cpuset resource name under the RESOURCES column of the Host section of lsf.cluster.cluster_name . Hosts without the cpuset resource specified are not considered for scheduling cpuset jobs. lsf.cluster.
You should not use a CXFS file system for LSB_RLA_WORKDIR. ◆ LSF_PIM_SLEEPTIME_UPDATE=Y On Altix hosts, use this parameter to improve job throughput and reduce a job’s start time if there are many jobs running simultaneously on a host. This parameter reduces communication traffic between sbatchd and PIM on the same host.
The names dcpus and scpus can be any name you like. 2 Begin ResourceMap RESOURCENAME dcpus scpus End ResourceMap Edit lsf.cluster.cluster_name to map the resources to hosts. For example: LOCATION (4@[hosta]) # total cpus - cpus in static cpusets (8@[hostc]) # static cpusets For dynamic cpuset resources, the value of the resource should be the number of free CPUs on the host; that is, the number of CPUs outside of any static cpusets on the host.
Configuring default and mandatory cpuset options Use the DEFAULT_EXTSCHED and MANDATORY_EXTSCHED queue paramters in lsb.queues to configure default and mandatory cpuset options. Use keywords SGI_CPUSET[] or CPUSET[] to identify the external scheduler parameters. The keyword SGI_CPUSET[] is deprecated. The keyword CPUSET[] is preferred. DEFAULT_EXTSCHED=[SGI_]CPUSET[cpuset_options] Specifies default cpuset external scheduling options for the queue.
-extsched options on the bsub command are merged with MANDATORY_EXTSCHED options, and MANDATORY_EXTSCHED options override any conflicting job-level options set by -extsched.
The cpuset option in the job submission overrides the DEFAULT_EXTSCHED, so the job will run in a cpuset allocated with a maximum of 1 CPU per node, honoring the joblevel MAX_CPU_PER_NODE option. If the queue specifies: MANDATORY_EXTSCHED=CPUSET[MAX_CPU_PER_NODE=2] and the job is submitted with: bsub -n 4 -ext "CPUSET[MAX_CPU_PER_NODE=1]" myjob The job will run in a cpuset allocated with a maximum of 2 CPUs per node, honoring the MAX_CPU_PER_NODE option in the queue.
Using LSF with SGI Cpusets Specifying cpuset properties for jobs To specify cpuset properties for LSF jobs, use: The -extsched option of bsub. ◆ DEFAULT_EXTSCHED or MANDATORY_EXTSCHED, or both, in the queue definition (lsb.queues). If a job is submitted with the -extsched option, LSF submits jobs with hold, then resumes the job before dispatching it to give time for LSF to attach the -extsched options. The job starts on the first execution host.
◆ CPU_LIST=cpu_ID_list; cpu_ID_list is a list of CPU IDs separated by commas. The CPU ID is a positive integer or a range of integers. If incorrect CPU IDs are specified, the job remains pending until the specified CPUs are available. You must specify at least as many CPU IDs as the number of CPUs the job requires (bsub -n). If you specify more CPU IDs than the job requests, LSF selects the best CPUs from the list.
span[ptile] CPU_LIST is interpreted as a list of possible CPU seelctions, not a strict requirement. resource For example, if you subit a job with the the -R "span[ptile]" option: requirement bsub -R "span[ptile=1]" -ext "CPUSET[CPU_LIST=1,3]" -n2 ...
◆ SGI Altix Linux ProPack 4 and ProPack 5 do not support CPUSET_OPTIONS=CPUSET_MEMORY_MANDATORY or CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE attributes. If the cpuset job runs on an Altix host, the cpusets created on the Altix system will have their memory usage restricted to the memory nodes containing the CPUs assigned to the cpuset. The CPUSET_MEMORY_MANDATORY and CPUSET_CPU_EXCLUSIVE attributes are ignored.
CPU radius and processor topology If LSB_CPUSET_BESTCPUS is set in lsf.conf, LSF can choose the best set of CPUs that can create a cpuset. The best cpuset is the one with the smallest CPU radius that meets the CPU requirements of the job. CPU radius is determined by the processor topology of the system and is expressed in terms of the number of router hops between CPUs.
If you define the external scheduler option CPUSET[CPUSET_TYPE=none], no cpusets are allocated and the job is dispatched and run outside of any cpuset. Spanning multiple hosts is not supported on IRIX or TRIX. Platform HPC creates cpusets on a single host (or on the first host in the allocation.) LSB_HOST_CPUSETS environment variable After dynamic cpusets are allocated and before the job starts running LSF sets the LSB_HOST_CPUSETS environment variable.
Example Assume a host with 2 nodes, 2 CPUs per node (total of 4 CPUs) Node CPUs 0 1 0 2 1 3 When a job running within a cpuset that contains cpu 1 is suspended: 1 2 The job processes are detached from the cpuset and suspended The cpuset is destroyed When the job is resumed: 1 2 A cpuset with the same name is recreated The processes are resumed and attached to the cpuset The RESUME_OPTION parameter determines which CPUs are used to recreate the cpuset: If RESUME_OPTION=ORIG_CPUS, only CPUs from t
/reg62@221;NCPUS=2; Thu Dec 15 14:20:03: Done successfully. The CPU time used is 0.0 seconds.
CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP 0.03 3 7 done 0.0042 0K 0K -----------------------------------------------------------------------------SUMMARY: ( time unit: second ) Total number of done jobs: 1 Total number of exited jobs: Total CPU time consumed: 0.0 Average CPU time consumed: Maximum CPU time of a job: 0.0 Minimum CPU time of a job: Total wait time in queues: 3.0 Average wait time in queue: 3.0 Maximum wait time in queue: 3.
bsub -n 8 -extsched "CPUSET[CPU_LIST=1, 5, 7-12; CPUSET_OPTIONS=CPUSET_CPU_EXCLUSIVE|CPUSET_MEMORY_LOCAL]" myjob The job myjob will succeed if CPUs 1, 5, 7, 8, 9, 10, 11, and 12 are available. ◆ Specify a static cpuset: bsub -n 8 -extsched "CPUSET[CPUSET_TYPE=static; CPUSET_NAME=MYSET]" myjob Specifying a cpuset name implies that the cpuset type is static: bsub -n 8 -extsched "CPUSET[CPUSET_NAME=MYSET]" myjob Jobs are attached to a static cpuset specified by users at job submission.
Using SGI Comprehensive System Accounting facility (CSA) The SGI Comprehensive System Accounting facility (CSA) provides data for collecting per-process resource usage, monitoring disk usage, and chargeback to specific login accounts. If is enabled on your system, LSF writes records for LSF jobs to CSA. SGI CSA writes an accounting record for each process in the pacct file, which is usually located in the /var/adm/acct/day directory.
# csaedit -P /var/csa/day/pacct -A For each LSF job, you should see two lines similar to the following: --------------------------------------------------------------------------------------37 Raw-Workld-Mgmt user1 0x19ac91ee000064f2 0x0000000000000000 0 REQID=1771 ARRAYID=0 PROV=LSF START=Jun 4 15:52:01 ENTER=Jun 4 15:51:49 TYPE=INIT SUBTYPE=START MACH=hostA REQ=myjob QUE=normal … 39 Raw-Workld-Mgmt user1 0x19ac91ee000064f2 0x0000000000000000 0 REQID=1771 ARRAYID=0 PROV=LSF START=Jun 4 16:09:14 TYPE=TERM
Using SGI User Limits Database (ULDB—IRIX only) The SGI user limits database (ULDB) allows user-specific limits for jobs. If no ULDB is defined, job limits are the same for all jobs. If you use ULDB, you can configures LSF so that jobs submitted to a host with the SGI job limits package installed are subject to the job limits configured in the ULDB. Set LSF_ULDB_DOMAIN=domain_name in lsf.conf to specify the name of the LSF domain in the ULDB domain directive.
◆ SWAPLIMIT—Corresponds to JLIMIT_VMEM; use process limit RLIMIT_VMEM Increasing the default MEMLIMIT for ULDB In some pre-defined LSF queues, such as normal, the default MEMLIMIT is set to 5000 (5 MB). However, if ULDB is enabled (LSF_ULDB_DOMAIN is defined) the MEMLIMIT should be set greater than 8000 in lsb.queues. Example ULDB domain configuration The following steps enable the ULDB domain LSF for user user1: 1 Define the LSF_ULDB_DOMAIN parameter in lsf.conf: ... LSF_ULDB_DOMAIN=LSF ...
SGI Job Container and Process Aggregate Support An SGI job contains all processes created in a login session, including array sessions and session leaders. Job limits set in ULDB are applied to SGI jobs either at creation time or through the lifetime of the job. Job limits can also be reset on a job during its lifetime.
EXTERNAL MESSAGES: MSG_ID FROM POST_TIME MESSAGE 0 1 2 JID=0x2bc0000000001f7a; ASH=0x2bc0f root Jan 20 12:41 ATTACHMENT N bhist -l 640 Job <640>, User , Project , Command Sat Oct 19 14:52:14: Submitted from host , to Queue , CWD <$HOME>, Requested Resources ; Sat Oct 19 14:52:22: Dispatched to ; Sat Oct 19 14:52:22: CPUSET_TYPE=none;NHOSTS=1;ALLOCINFO=hostA; Sat Oct 19 14:52:23: Starting (Pid 5020232); Sat Oct 19 14:52:23: Run
Using Platform LSF HPC
C H A P T E R 8 Using Platform LSF with LAM/MPI Contents ◆ ◆ ◆ “About Platform LSF and LAM/MPI” on page 154 “Configuring LSF to work with LAM/MPI” on page 156 “Submitting LAM/MPI Jobs” on page 157 Using Platform LSF HPC 153
About Platform LSF and LAM/MPI LAM (Local Area Multicomputer) is an MPI programming environment and development system for heterogeneous computers on a network. With LAM, a dedicated cluster or an existing network computing infrastructure can act as one parallel computer solving one problem. System requirements ❏ LAM/MPI version 6.5.
Begin Resource RESOURCE_NAME TYPE INTERVAL ... lammpi Boolean () ... End Resources INCREASING () DESCRIPTION (LAM MPI) The lammpi Boolean resource is used for mapping hosts with LAM/MPI available. lammpi resource name under the RESOURCES column of the lsf.cluster.cluster_name . You should add the Host section of ◆ Parameter to lsf.
Configuring LSF to work with LAM/MPI System setup 1 For troubleshooting LAM/MPI jobs, edit the LSF_BINDIR/lammpirun_wrapper script, and specify a log directory that all users can write to. For example: LOGDIR="/mylogs" Do not use LSF_LOGDIR for this log directory. 2 3 4 Add the LAM/MPI home directory to your path. The LAM/MPI home directory is the directory that you specified as the prefix during LAM/MPI installation.
Submitting LAM/MPI Jobs bsub command Use bsub to submit LAM/MPI jobs: bsub -a lammpi -n number_cpus [-q queue_name ] mpirun.lsf [-pam "pam_options "] [mpi_options ] job [job_options ] ◆ -a lammpi tells esub the job is a LAM/MPI job and invokes esub.lammpi. ◆ -n number_cpus specifies the number of processors required to run the job ◆ -q queue_name specifies a LAM/MPI queue that is configured to use the custom termination action. If no queue is specified, the hpc_linux queue is used. ◆ mpirun.
Log files For troubleshooting LAM/MPI jobs, define LOGDIR in the LSF_BINDIR/lammpirun_wrapper script. Log files (lammpirun_wrapper.job[job_ID].log) are written to the LOGDIR directory. If LOGDIR is not defined, log messages are written to /dev/null. For example, the log file for the job with job ID 123 is: lammpirun_wrapper.job123.
C H A P T E R 9 Using Platform LSF with MPICHGM Contents ◆ ◆ ◆ ◆ “About Platform LSF and MPICH-GM” on page 160 “Configuring LSF to Work with MPICH-GM” on page 162 “Submitting MPICH-GM Jobs” on page 164 “Using AFS with MPICH-GM” on page 165 Using Platform LSF HPC 159
About Platform LSF and MPICH-GM MPICH is a freely available, portable implementation of the MPI Standard for messagepassing libraries, developed jointly with Mississippi State University. MPICH is designed to provide high performance, portability, and a convenient programming environment. MPICH-GM is used with high performance Myrinet networks. Myrinet is a high-speed network which allows OS-bypass communications in large clusters.
These files... Are installed to... TaskStarter pam esub.mpich_gm gmmpirun_wrapper mpirun.lsf pjllib.sh LSF_BINDIR LSF_BINDIR LSF_SERVERDIR LSF_BINDIR LSF_BINDIR LSF_BINDIR Resources and parameters configured by lsfinstall ◆ External resources in lsf.shared: Begin Resource RESOURCE_NAME TYPE INTERVAL ... mpich_gm Boolean () ... End Resources INCREASING () DESCRIPTION (MPICH GM MPI) The mpich_gm Boolean resource is used for mapping hosts with MPICH-GM available.
Configuring LSF to Work with MPICH-GM Configure GM port resources (optional) If there are more processors on a node than there are available GM ports, you should configure the external static resource name gm_ports to limit the number of jobs that can launch on that node. lsf.shared Add the external static resource gm_ports in lsf.shared to keep track of the number of free Myrinet ports available on a host: Begin Resource RESOURCENAME TYPE ... gm_ports Numeric ...
lsf.conf (optional) LSF_STRIP_DOMAIN If the gm_board_info command returns host names that include domain names you cannot define LSF_STRIP_DOMAIN in lsf.conf. If the gm_board_info command returns host names without domain names, but LSF commands return host names that include domain names, you must define LSF_STRIP_DOMAIN in lsf.conf.
Submitting MPICH-GM Jobs bsub command Use bsub to submit MPICH-GM jobs. bsub -a mpich_gm -n number_cpus mpirun.lsf [-pam "pam_options "] [mpi_options ] job [job_options ] ◆ -a mpich_gm tells esub the job is an MPICH-GM job and invokes esub.mpich_gm. ◆ -n number_cpus specifies the number of processors required to run the job ◆ mpirun.lsf reads the environment variable LSF_PJL_TYPE=mpich_gm set by esub.
Using AFS with MPICH-GM Complete the following steps only if you are planning to use AFS with MPICH-GM. The MPICH-GM package contains an esub.afs file which combines the esub for MPICH-GM and the esub for AFS so that MPICH-GM and AFS can work together. Steps 1 2 3 Install and configure LSF for AFS. Edit mpirun.ch_gm. The location of this script is defined with the MPIRUN_CMD parameter in the script LSF_BINDIR/gmmpirun_wrapper.
Using Platform LSF HPC
C H A P T E R 10 Using Platform LSF with MPICHP4 Contents ◆ ◆ ◆ “About Platform LSF and MPICH-P4” on page 168 “Configuring LSF to Work with MPICH-P4” on page 170 “Submitting MPICH-P4 Jobs” on page 171 Using Platform LSF HPC 167
About Platform LSF and MPICH-P4 MPICH is a freely available, portable implementation of the MPI Standard for messagepassing libraries, developed jointly with Mississippi State University. MPICH is designed to provide high performance, portability, and a convenient programming environment. MPICH-P4 is an MPICH implementation for the ch_p4 device, which supports SMP nodes, MPMD programs, and heterogeneous collections of systems. Requirements ❏ MPICH version 1.2.
These files... Are installed to... TaskStarter pam esub.mpichp4 mpichp4_wrapper mpirun.lsf pjllib.sh LSF_BINDIR LSF_BINDIR LSF_SERVERDIR LSF_BINDIR LSF_BINDIR LSF_BINDIR Resources and parameters configured by lsfinstall ◆ External resources in lsf.shared: Begin Resource RESOURCE_NAME TYPE ... mpichp4 Boolean ... End Resources INTERVAL () INCREASING () DESCRIPTION (MPICH P4 MPI) The mpichp4 Boolean resource is used for mapping hosts with MPICH-P4 available.
Configuring LSF to Work with MPICH-P4 mpichp4_wrapper script Modify the mpichp4_wrapper script in LSF_BINDIR to set MPICH_HOME. The default is: MPICH_HOME="/opt/mpich-1.2.5.
Submitting MPICH-P4 Jobs bsub command Use bsub to submit MPICH-P4 jobs. bsub -a mpichp4 -n number_cpus mpirun.lsf [-pam "pam_options "] [mpi_options ] job [job_options ] ◆ -a mpichp4 tells esub the job is an MPICH-P4 job and invokes esub.mpichp4. ◆ -n number_cpus specifies the number of processors required to run the job ◆ mpirun.lsf reads the environment variable LSF_PJL_TYPE=mpichp4 set by esub.
For information on generic PJL wrapper script components, see Chapter 3, “Running Parallel Jobs”. See Administering Platform LSF for information about submitting jobs with job scripts.
C H A P T E R 11 Using Platform LSF with MPICH2 Contents ◆ ◆ ◆ ◆ “About Platform LSF and MPICH2” on page 174 “Configuring LSF to Work with MPICH2” on page 176 “Building Parallel Jobs” on page 178 “Submitting MPICH2 Jobs” on page 179 Using Platform LSF HPC 173
About Platform LSF and MPICH2 MPICH is a freely available, portable implementation of the MPI Standard for messagepassing libraries, developed jointly with Mississippi State University. MPICH is designed to provide a high performance, portable, and convenient programming environment. MPICH2 implements both MPI-1 and MPI-2. The mpiexec command of MPICH2 spawns all tasks, while LSF retains full control over the tasks spawned.
These files... Are installed to... TaskStarter pam esub.mpich2 mpich2_wrapper mpirun.lsf pjllib.sh LSF_BINDIR LSF_BINDIR LSF_SERVERDIR LSF_BINDIR LSF_BINDIR LSF_BINDIR Resources and parameters configured by lsfinstall ◆ External resources in lsf.shared: Begin Resource RESOURCE_NAME TYPE ... mpich2 Boolean ... End Resources INTERVAL () INCREASING () DESCRIPTION (MPICH2 MPI) The mpich2 Boolean resource is used for mapping hosts with MPICH2 available.
Configuring LSF to Work with MPICH2 1 Make sure MPICH2 commands are in the PATH environment variable. MPICH2 commands include mpiexec, mpd, mpdboot, mpdtrace, and mpdexit. For example: [174]- which mpiexec /pcc/app/mpich2/kernel2.4-glibc2.3-x86/bin/mpiexec 2 Add an mpich2 boolean resource to the $LSF_ENVDIR/lsf.shared file.
Make sure $HOME/.mpd.conf has a permission mode of 600 after you finish the modification. iii Set LSF_START_MPD_RING=N in your job script or in the environment for all users. If you want to start an MPD ring on all hosts, follow the steps described in the MPICH2 documentation to start an MPD ring across all LSF hosts for each user. The user MPD ring must be running all the time, and you must set LSF_START_MPD_RING=N in your job script or in the environment for all users.
Building Parallel Jobs 1 Use mpicc -o to compile your source code. For example: [178]- which mpicc /pcc/app/mpich2/kernel2.4-glibc2.3x86/bin/mpicc 5:19pm Mon, Sep-19-2005 qat21:~/milkyway/bugfix/test [179]- mpicc -o hw.mpich2 hw.c 3.2 2 Make sure the compiled binary can run under the root MPD ring outside Platform LSF. For example: [180]- mpiexec -np 2 hw.
Submitting MPICH2 Jobs bsub command Use the bsub command to submit MPICH2 jobs. 1 Submit a job from the console command line: bsub -n <### > -a mpich2 mpirun.lsf job Note that -np options of mpiexec will be ignored. For example: bsub -I -n 8 -R "span[ptile=4]" -a mpich2 -W 2 mpirun.lsf -np 3 ./hw.mpich2 1 Submit a job using a script: bsub < myjobscript.sh where myjobscript.sh looks like: #!/bin/sh #BSUB -n 8 #BSUB -a mpich2 mpirun.lsf ./hw.
Using Platform LSF HPC
C H A T P E R 12 Using Platform LSF with MVAPICH Contents ◆ ◆ ◆ “About Platform LSF and MVAPICH” on page 182 “Configuring LSF to Work with MVAPICH” on page 184 “Submitting MVAPICH Jobs” on page 185 Using Platform LSF HPC 181
About Platform LSF and MVAPICH MVAPICH is an open-source product developed in the Department of Computer and Information Science, The Ohio State University. MVAPICH is MPI-1 over VAPI for InfiniBand. It is an MPI-1 implementation on Verbs Level Interface (VAPI), developed by Mellanox Technologies. The implementation is based on MPICH and MVICH. The LSF MVAPICH MPI integration is based on the LSF generic PJL framework.
Files installed by lsfinstall During installation, lsfinstall copies these files to the following directories: These files... Are installed to... TaskStarter pam esub.mvapich —sets the mode: rsh ssh LSF_BINDIR LSF_BINDIR LSF_SERVERDIR mvapich_wrapper mpirun.lsf pjllib.sh LSF_BINDIR LSF_BINDIR LSF_BINDIR or mpd Resources and parameters configured by lsfinstall ◆ External resources in lsf.shared: Begin Resource RESOURCE_NAME TYPE ... mvapich Boolean ...
Configuring LSF to Work with MVAPICH esub.mvapich script Modify the esub.mvapich in LSF_SERVERDIR to set MVAPICH_START_CMD.to one of ssh, rsh, or mpd. The default value is ssh. mvapich_wrapper script Modify the mvapich_wrapper script in LSF_BINDIR to set MVAPICH_HOME. The defaults are: ◆ ◆ ◆ Topspin MPI: MVAPICH_HOME="/usr/local/topspin IBRIX Roll MPI: MVAPICH_HOME="/opt/mpich/infiniband/gnu" Generic MVAPICH: defined by your site.
Submitting MVAPICH Jobs bsub command Use bsub -a mvapich to submit jobs: If the starting command is mpd, you must submit your MVAPICH jobs as exclusive jobs (bsub -x). bsub -a mvapich -n number_cpus mpirun.lsf [-pam "pam_options "] [mpi_options ] job [job_options ] ◆ -a mvapich tells esub the job is an MVAPICH job and invokes esub.mvapich. ◆ -n number_cpus specifies the number of processors required to run the job ◆ mpirun.lsf reads the environment variable LSF_PJL_TYPE=mvapich set by esub.
Using Platform LSF HPC
C H A P T E R 13 Using Platform LSF with Intel® MPI Contents ◆ ◆ ◆ ◆ “About Platform LSF and the Intel® MPI Library” on page 188 “Configuring LSF to Work with Intel MPI” on page 190 “Working with the Multi-purpose Daemon (MPD)” on page 191 “Submitting Intel MPI Jobs” on page 192 Using Platform LSF HPC 187
About Platform LSF and the Intel® MPI Library The Intel® MPI Library (“Intel MPI”) is a high-performance message-passing library for developing applications that can run on multiple cluster interconnects chosen by the user at runtime. It supports TCP, shared memory, and high-speed interconnects like InfiniBand and Myrinet. Intel MPI supports all MPI-1 features and many MPI-2 features, including file I/O, generalized requests, and preliminary thread support. it is based on the MPICH2 specification.
❖ ◆ ◆ www-unix.mcs.anl.gov/mpi/mpich2/ for more information about MPICH2. See the Intel Software Network > Software Products > Cluster Tools > Intel MPI Library at www.intel.com for more information about the Intel MPI Library. See Getting Started with the Intel® MPI Librar y (Getting_Started.pdf in the Intel MPI installation documentation directory for more information about using the Intel MPI library and commands.
Configuring LSF to Work with Intel MPI intelmpi_wrapper script Modify the intelmpi_wrapper script in LSF_BINDIR to set MPI_TOPDIR The default value is: MPI_TOPDIR="/opt/intel/mpi/2.0" lsf.conf (optional) To improve performance and scalability for large parallel jobs, tune the following parameters as described in “Tuning PAM Scalability and Fault Tolerance” on page 56: LSF_HPC_PJL_LOADENV_TIMEOUT ◆ LSF_PAM_RUSAGE_UPD_FACTOR The user's environment can override these.
Working with the Multi-purpose Daemon (MPD) The Intel® MPI Library (“Intel MPI”) uses a Multi-Purpose Daemon (MPD) job startup mechanism. MPD daemons must be up and running on the hosts where an MPI job is supposed to start before mpiexec is started. How Platform LSF manges MPD rings LSF manages MPD rings for users automatically using mpdboot and mpdtrace commands. Each MPI job running under LSF uses a uniquely labeled MPD ring.
Submitting Intel MPI Jobs bsub command Use bsub -a intelmpi to submit jobs. If the starting command is mpd, you must submit your Intel MPI jobs as exclusive jobs (bsub -x). bsub -a intelmpi -n number_cpus mpirun.lsf [-pam "pam_options "] [mpi_options ] job [job_options ] ◆ -a intelmpi tells esub the job is an Intel MPI job and invokes esub.intelmpi. ◆ -n number_cpus specifies the number of processors required to run the job ◆ mpirun.lsf reads the environment variable LSF_PJL_TYPE=intelmpi set by esub.
% hostname hosta % mpiexec -l -n 2 -host hosta.domain.com ./hmpi mpdrun: unable to start all procs; may have invalid machine names remaining specified hosts: hosta.domain.com % mpiexec -l -n 2 -host hosta ./hmpi 0: myrank 0, n_processes 2 1: myrank 1, n_processes 2 0: From process 1: Slave process 1! -genvlist option The -genvlist options does not work if the configuration file for -configfile has more than one entry.
Using Platform LSF HPC
C H A T P E R 14 Using Platform LSF with Open MPI Contents ◆ ◆ ◆ “About Platform LSF and the Open MPI Library” on page 196 “Configuring LSF to Work with Open MPI” on page 198 “Submitting Open MPI Jobs” on page 199 Using Platform LSF HPC 195
About Platform LSF and the Open MPI Library The Open MPI Library is a high-performance message-passing library for developing applications that can run on multiple cluster interconnects chosen by the user at runtime. Open MPI supports all MPI-1 and MPI-2 features. The LSF Open MPI integration is based on the LSF generic PJL framework. It supports the LSF task geometry feature. Requirements ❏ Open MPI version 1.1 or later You should upgrade all your hosts to the same version of Open MPI.
These files... Are installed to... mpirun.lsf pjllib.sh LSF_BINDIR LSF_BINDIR Resources and parameters configured by lsfinstall ◆ External resources in lsf.shared: Begin Resource RESOURCE_NAME TYPE ... openmpi Boolean ... End Resources INTERVAL () INCREASING () DESCRIPTION (Open MPI) The openmpi Boolean resource is used for mapping hosts with Open MPI available. openmpi resource name under the RESOURCES column of the lsf.cluster.cluster_name .
Configuring LSF to Work with Open MPI ◆ ◆ The mpirun command must be included in the $PATH environment variable on all LSF hosts. Make sure LSF uses system host official names (/etc/hosts): this will prevent problems when you run the application. ❖ Configure the $LSF_CONFDIRDIR/hosts file and the $LSF_ENVDIR/lsf.cluster. file. For example: 172.25.238.91 scali scali.lsf.platform.com 172.25.238.96 scali1 scali1.lsf.plaform.
Submitting Open MPI Jobs bsub command Use bsub -a openmpi to submit jobs. For example: bsub -a openmpi -n number_cpus mpirun.lsf a.out ◆ -a openmpi tells esub the job is an Open MPI job and invokes esub.openmpi. ◆ -n number_cpus specifies the number of processors required to run the job ◆ mpirun.lsf reads the environment variable LSF_PJL_TYPE=intelmpi set by esub.
Using Platform LSF HPC
C H A P T E R 15 Using Platform LSF Parallel Application Integrations Contents ◆ ◆ ◆ ◆ ◆ ◆ ◆ “Using LSF with ANSYS” on page 202 “Using LSF with NCBI BLAST” on page 205 “Using LSF with FLUENT” on page 206 “Using LSF with Gaussian” on page 210 “Using LSF with Lion Bioscience SRS” on page 211 “Using LSF with LSTC LS-Dyna” on page 212 “Using LSF with MSC Nastran” on page 218 Using Platform LSF HPC 201
Using LSF with ANSYS LSF use supports various ANSYS solvers through a common integration console builtin to the ANSYS GUI. The only change the average ANSYS user sees is the addition of a Run using LSF? button on the standard ANSYS console. Using ANSYS with LSF simplifies distribution of jobs, and improves throughput by removing the need for engineers to worry about when or where their jobs run.
Initial Jobname Input filename Output filename Memory requested Run using LSF? Run in The name given to the job for easier recognition at runtime. Specifies the file of ANSYS commands you are submitting for batch execution. You can either type in the desired file name or click on the ... button, to display a file selection dialog box. Specifies the file to which ANSYS directs text output by the program.
Available Hosts Queue Host Types Allows users to specify a specific host to run the job on. Allows users to specify which queue they desire instead of the default. Allows users to specify a specific architecture for their job. Submitting jobs through the ANSYS command-line Submitting a command line job requires extra parameters to run correctly through LSF.
Using LSF with NCBI BLAST LSF accepts jobs running NCBI BLAST (Basic Local Alignment Search Tool). Requirements ◆ ◆ Platform LSF BLAST, available from the National Center for Biotechnology Information (NCBI) Configuring LSF for BLAST jobs During installation, lsfinstall adds the Boolean resource blast to the Resource section of lsf.shared. Host If only some of your hosts can accept BLAST jobs, configure the Host section of configuration lsf.cluster.cluster_name to identify those hosts.
Using LSF with FLUENT LSF is integrated with products from Fluent Inc., allowing FLUENT jobs to take advantage of the checkpointing and migration features provided by LSF. This increases the efficiency of the software and means data is processed faster. FLUENT 5 offers versions based on system vendors’ parallel environments (usually MPI using the VMPI version of FLUENT 5.) Fluent also provides a parallel version of FLUENT 5 based on its own socket-based message passing library (the NET version).
LSF installs echkpnt.fluent and erestart.fluent, which are special versions erestart of echkpnt and erestart to allow checkpointing with FLUENT. Use bsub -a fluent to make sure your job uses these files. echkpnt and Checkpoint directories When you submit a checkpointing job, you specify a checkpoint directory. Before the job starts running, LSF sets the environment variable LSB_CHKPNT_DIR. The value of LSB_CHKPNT_DIR is a subdirectory of the checkpoint directory specified in the command line.
-k checkpoint_dir Regular option to bsub that specifies the name of the checkpoint directory. checkpoint_period Regular option to bsub that specifies the time interval in minutes that LSF will automatically checkpoint jobs. FLUENT command Regular command used with FLUENT software. -lsf Special option to the FLUENT command. Specifies that FLUENT is running under LSF, and causes FLUENT to check for trigger files in the checkpoint directory if the environment variable LSB_CHKPNT_DIR is set.
Examples ◆ Sequential FLUENT batch job with checkpoint and restart % bsub -a fluent -k "/home/username 60" fluent 3d -g -i journal_file -lsf Submits a job that uses the checkpoint/restart method echkpnt.fluent and erestart.fluent, /home/username as the checkpoint directory, and a 60 minute duration between automatic checkpoints. FLUENT checks if there is a checkpoint trigger file /home/username/exit or /home/username/check.
Using LSF with Gaussian Platform HPC accepts jobs running the Gaussian electronic structure modeling program. Requirements ◆ ◆ Platform LSF Gaussian 98, available from Gaussian, Inc. Configuring LSF for Gaussian jobs During installation, lsfinstall adds the Boolean resource gaussian to the Resource section of lsf.shared. Host If only some of your hosts can accept Gaussian jobs, configure the Host section of configuration lsf.cluster.cluster_name to identify those hosts.
Using LSF with Lion Bioscience SRS SRS is Lion Bioscience’s Data Integration Platform, in which data is extracted by all other Lion Bioscience applications or third-party products. LSF works with the batch queue feature of SRS to provide load sharing and allow users to manage their running and completed jobs. Requirements ◆ ◆ Platform LSF SRS 6.
Using LSF with LSTC LS-Dyna LSF is integrated with products from Livermore Software Technology Corporation (LSTC). LS-Dyna jobs can use the checkpoint and restart features of LSF and take advantage of both SMP and distributed MPP parallel computation. To submit LS-Dyna jobs through LSF, you only need to make sure that your jobs are checkpointable. See Administering Platform LSF for more information about checkpointing in LSF.
With pam and Task Starter, you can track resources of MPP jobs, but cannot checkpoint. If you do not use pam and Task Starter, checkpointing of MPP jobs is supported, but tracking is not. LSF installs echkpnt.ls_dyna and erestart.ls_dyna, which are special erestart versions of echkpnt and erestart to allow checkpointing with LS-Dyna. Use bsub -a ls_dyna to make sure your job uses these files.
◆ ◆ When LS-Dyna jobs are restarted from a checkpoint, the job will use the checkpoint environment instead of the job submission environment. You can restore your job submission environment if you submit your job with a job script that includes your environment settings. LS-Dyna jobs must run in the directory that LSF sets in the LSB_CHKPNT_DIR environment variable. This lets you submit multiple LS-Dyna jobs from the same directory but is also required if you are submitting one job.
If you do not set your environment variables in the job script, then you must add some lines to the script to restore environment variables. For example: if [ -f $LSB_CHKPNT_DIR/.envdump ]; then .$LSB_CHKPNT_DIR/.envdump fi Change directory Ensure that your jobs run in the checkpoint directory set by LSF, by adding the following line after your bsub commands: cd $LSB_CHKPNT_DIR LS-Dyna Write the LS-Dyna command you want to run.
Example job submission script: #!/bin/sh #BSUB -J LS_DYNA #BSUB -k "/usr/share/checkpoint_dir method=ls_dyna" cd $LSB_CHKPNT_DIR #after the first checkpoint if [ -f $LSB_CHKPNT_DIR/.envdump ]; then .$LSB_CHKPNT_DIR/.envdump fi /usr/share/ls_dyna_path/ls960 endtime=2 i=/usr/share/ls_dyna_path/airbag.deploy.k ncpu=1 exit $? ◆ Job script running MPP LS-Dyna job embedded in the script. Without PAM and TaskStarter, the job can be checkpointed, but not resource usage or job control are available.
Specifies checkpoint and exit. The job will be killed immediately after being checkpointed. When the job is restarted, it continues from the last checkpoint. ◆ job_ID Job ID of the LS-Dyna job. Specifies which job to checkpoint. Each time the job is migrated, the job is restarted and assigned a new job ID. See Platform LSF Command Reference for more information about bchkpnt.
Using LSF with MSC Nastran MSC Nastran Version 70.7.2 (“Nastran”) runs in a Distributed Parallel mode, and automatically detects a job launched by LSF, and transparently accepts the execution host information from LSF. The Nastran application checks if the LSB_HOSTS or LSB_MCPU_HOSTS environment variable is set in the execution environment. If either is set, Nastran uses the value of the environment variable to produce a list of execution nodes for the solver command line.
Note that both the bsub -n 4 and Nastran dmp=4 options are used. The value for -n and dmp must be the same. ◆ Parallel job through LSF requesting 4 processors, no more than 1 processor per host: % bsub -n 4 -a nastran -R "nastran span[ptile=1]" nastran example dmp=4 Nastran on Linux using LAM/MPI You must write a script that will pick up the LSB_HOSTS variable and provide the chosen hosts to the Nastran program.
lamboot -v ${LSB_HOST_FILE} >> ${LOG} 2>&1 NDMP=`sed -n -e '$=' ${LSB_HOST_FILE}` HOST="n0" (( i=1 )) while (( i < $NDMP )) ; do HOST="$HOST:n$i" (( i += 1 )) done echo DAT=${DAT##*/} pwd nast707t2 ${DAT##*/} dmp=${NDMP} scr=yes bat=no hosts=$HOST >> ${LOG} 2>&1 wipe -v ${LSB_HOST_FILE} >> ${LOG} 2>&1 # # Bring back files: DATL=${DAT##*/} rcp ${DATL%.dat}.log ${LSB_SUB_HOST}:${DAT%/*} rcp ${DATL%.dat}.f04 ${LSB_SUB_HOST}:${DAT%/*} rcp ${DATL%.dat}.
C H A P T E R 16 Using Platform LSF with the Etnus TotalView® Debugger Contents ◆ ◆ ◆ “How LSF Works with TotalView” on page 222 “Running Jobs for TotalView Debugging” on page 224 “Controlling and Monitoring Jobs Being Debugged in TotalView” on page 227 Using Platform LSF HPC 221
How LSF Works with TotalView Platform LSF is integrated with Etnus TotalView® multiprocess debugger. You should already be familiar with using TotalView software and debugging parallel applications. Debugging LSF jobs with TotalView Etnus TotalView is a source-level and machine-level debugger for analyzing, debugging, and tuning multiprocessor or multithreaded programs.
Setting TotalView preferences Before running and debugging jobs with TotalView, you should set the following options in your $HOME/.preferences.
Running Jobs for TotalView Debugging Submit jobs two ways: ◆ ◆ Start a job and TotalView together through LSF Start TotalView and attach the LSF job You must set the path to the TotalView binary in the $PATH environment variable on the submission host, and the $DISPLAY environment variable to console_name :0.0. Compiling your program for debugging Before using submitting your job in LSF for debugging in TotalView, compile your source code with the -g compiler option.
Depending on your TotalView preferences, you may see the Stop Before Going Parallel dialog. Click Yes. Use the Parallel page on the File > Preferences dialog to change the setting of When a job goes parallel or calls exec() radio buttons. The process starts running and stops at the first breakpoint you set. For MPICH-GM jobs, TotalView stops at two breakpoints: one in pam, and one in MPI_init(). Click Go to continue debugging. 3 Debug your job as you would normally in TotalView.
When you are finished debugging your program, choose File > Exit to exit TotalView, and click Yes in the Exit dialog. As TotalView exits it kills the pam process. In a few moments, LSF detects that PAM has exited and your job exits as Done successfully. Viewing source code while debugging Use View > Lookup Function to view the source code of your application while debugging. Enter main in the Name field and click OK.
Controlling and Monitoring Jobs Being Debugged in TotalView Controlling jobs While your job is running and you are using TotalView to debug it, you cannot use LSF job control commands: ◆ bchkpnt and bmig are not supported ◆ Default TotalView signal processing prevents bstop and bresume from suspending and resuming jobs, and bkill from terminating jobs brequeue causes TotalView to display all jobs in error status. Click Go and the jobs will rerun. Job rerun within TotalView is not supported.
% bjobs -l 341 Job <341>, User , Project , Status , Queue , Com mand Wed Oct 16 09:59:42: Submitted from host , CWD , Execution Home , E xecution CWD ; Wed Oct 16 10:01:19: Done successfully. The CPU time used is 97.0 seconds.
C H A P T E R 17 pam Command Reference Contents ◆ ◆ ◆ ◆ ◆ “SYNOPSIS” on page 230 “DESCRIPTION” on page 230 “OPTIONS” on page 231 “EXIT STATUS” on page 233 “SEE ALSO” on page 233 Using Platform LSF HPC 229
pam Parallel Application Manager – job starter for MPI applications SYNOPSIS HP-UX vendor MPI syntax bsub pam -mpi mpirun [mpirun_options ] mpi_app [argument ...] SGI vendor MPI bsub pam [-n num_tasks ] -mpi -auto_place mpi_app [argument ...] syntax Generic PJL framework syntax bsub pam [-t] [-v] [-n num_tasks ] -g [num_args] pjl_wrapper [pjl_options] mpi_app [argument ...] pam [-h] [-V] DESCRIPTION The Parallel Application Manager (PAM) is the point of control for Platform LSF.
❖ TS starts the tasks on each execution host, reports the process ID to PAM, and waits for the task to finish. OPTIONS OPTIONS FOR VENDOR MPI JOBS -auto_place The -auto_place option on the pam command line tells the SGI IRIX mpirun library to launch the MPI application according to the resources allocated by LSF. -mpi In the SGI environment, the -mpi option on the bsub and pam command line is equivalent to the mpirun command.
OPTIONS FOR LSF HPC GENERIC PJL JOBS -t This option tells pam not to print out the MPI job tasks summary report to the standard output. By default, the summary report prints out the task ID, the host on which it was executed, the command that was executed, the exit status, and the termination time. -v Verbose mode. Displays the name of the execution host or hosts. -g [num_args ] pjl_wrapper [pjl_options ] The -g option is required to use the LSF generic PJL framework.
EXIT STATUS pam exits with the exit status of mpirun or the PJL wrapper.
Using Platform LSF HPC
Index A account mapping, limitations FLUENT jobs 75 LS-Dyna jobs Altix process aggregates (PAGG) 150 ansys Boolean resource ANSYS jobs FLUENT jobs LS-Dyna jobs 202 command-line submission submitting 202 206 212 trigger files FLUENT jobs 204 LS-Dyna jobs chunk job limitations array session handle (ASH), viewing 150 HP-UX psets B 207 213 207 213 75 107 127 CPU containment 74, 126 CPU radius 140 Linux QsNet RMS bacct command, viewing cpuset information backfill, limitations 127 badmin ckc
D D3KIL checkpoint trigger file DEFAULT_EXTSCHED 212 77 HP-UX psets Linux QsNet RMS 133 SGI cpusets euilib 90 222 222 91 90 207 euilib POE option 112 exit trigger file F directories for checkpointing file descriptor limit, MPI on Altix 207 LS-Dyna jobs 213 FLUENT jobs external scheduler options maximum radius 140 overview 126 preemption configuration 136 131 E for TotalView debugging LSB_CHKPNT_DIR 222 207 213 LSB_CPUSET_DEDICATED 141 LSB_HOST_CPUSETS 141 LSB_JOBEXIT_INFO 60 LS
Intel MPI LSB_CHKPNT_DIR environment variable 192 LAM/MPI 157 for FLUENT 117 Lion Bioscience SRS 211 LS-Dyna 213 MPICH2 179 MPICH-GM 164 MPICH-P4 171 MVAPICH 185 Open MPI 199 SGI cpusets 136 through ANSYS 202 Linux QsNet RMS for LS-Dyna LSB_CPUSET_DEDICATED environment variable LSB_HOST_CPUSETS environment variable 141 LSB_JOB_MEMLIMIT, lsf.conf file lsf.cluster.cluster_name 154 task geometry lammpi Boolean resource 155 lammpirun_wrapper script 156 LEAST_RUN_TIME, lsb.params file libmpirm.
40 SGI MPI optional cpuset configuration 76 110 pset configuration RMS configuration lsf.shared file maximum radius, dynamic cpusets 130 description 137 MEMLIMIT, lsb.queues file, increasing for ULDB MINI_JOB, lsb.params file 130 cpuset configuration HP 43 OpenMP SGI 40 LSF_HPC_EXTENSIONS=LSB_POE_AUTHENTICATION, 100 LSF_HPC_EXTENSIONS=TASK_MEMLIMIT, lsf.conf file 60 LSF_HPC_EXTENSIONS=TASK_SWAPLIMIT, lsf.conf file 60 LSF_HPC_PJL_LOADENV_TIMEOUT, lsf.
lsf.cluster.cluster_name 197 Open MPI jobs, task geometry 199 OpenMP MPI 38 OpenMPI jobs, task geometry hpc_linux_tv rms -r restart_file LS-Dyna option 43 running parallel jobs SGI vendor MPI 41 resource requirement 138 41 resources lion 100 100 configuration for cpusets nastran 218 75, 107, 127 openmpi pset 77 131 75, 127 rms 132 process group files, blaunch framework 74 processor topology 79, 140 -procs POE option 91 profile.
SRS jobs, submitting and monitoring st_status command 90 static cpusets limitations 211 127 TotalView debugging environment variables submitting jobs 224 trigger files overview 126 preemption configuration 131 suspended jobs, preemption limitation T task geometry blaunch framework examples 58 for checkpointing FLUENT jobs for checkpointing LS-Dyna jobs 127 TRIX 131 207 213 U 35 192 LAM/MPI jobs 58 MPICH2 jobs 179 MPICH-P4 jobs 58, 171 MVAPICH jobs 185 Open MPI jobs 199 OpenMPI jobs 58 planni