Running Jobs with Platform LSF® Version 6.2 September 2005 Comments to: doc@platform.
Copyright © 1994-2005 Platform Computing Corporation All rights reserved. We’d like to hear from you You can help us make this document better by telling us what you think of the content, organization, and usefulness of the information. If you find an error, or just want to make a suggestion for improving this document, please address your comments to doc@platform.com. Your comments should pertain only to Platform documentation. For product support, contact support@platform.com.
Contents Welcome . . . . . . . . About This Guide 1 . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Learn About Platform Products . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Get Technical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 About Platform LSF . Job Life Cycle Working with Jobs . . .
Contents Viewing Jobs in Job Groups . . . . . . . . . . . . . . . Viewing Information about Resource Allocation Limits Index . . . . . . 4 Running Jobs with Platform LSF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 . . . . . . . . . . . . . 63 . . . . . . . . . .
Welcome Contents ◆ ◆ ◆ “About This Guide” on page 6 “Learn About Platform Products” on page 8 “Get Technical Support” on page 9 About Platform Computing Platform Computing is the largest independent grid software developer, delivering intelligent, practical enterprise grid software and services that allow organizations to plan, build, run and manage grids by optimizing IT resources.
About This Guide About This Guide This guide introduces the basic concepts of the Platform LSF® software (“LSF”) and describes how to use your LSF installation to run and monitor jobs. September 30 2005 Latest version www.platform.com/Support/Documentation.htm Who should use this guide This guide is intended for LSF users and administrators who want to understand the fundamentals of Platform LSF operation and use. This guide assumes that you have access to one or more Platform LSF products at your site.
Welcome Typographical conventions Typeface Meaning Example Courier The names of on-screen computer output, commands, files, and directories What you type, exactly as shown ◆ Book titles, new words or terms, or words to be emphasized ◆ Command-line place holders—replace with a real name or value ◆ Names of GUI elements that you manipulate The lsid command Bold Courier Italics Bold Sans Serif Type cd /bin The queue specified by queue_name Click OK Command notation Notation Meaning Example Quotes "
Learn About Platform Products Learn About Platform Products World Wide Web and FTP The latest information about all supported releases of Platform LSF is available on the Platform Web site at www.platform.com. Look in the Online Support area for current README files, Release Notes, Upgrade Notices, Frequently Asked Questions (FAQs), Troubleshooting, and other helpful information. The Platform FTP site (ftp.platform.
Welcome Get Technical Support Contact Platform Contact Platform Computing or your LSF vendor for technical support. Use one of the following to contact Platform technical support: Email support@platform.com World Wide Web www.platform.com Mail Platform Support Platform Computing Corporation 3760 14th Avenue Markham, Ontario Canada L3R 3T7 When contacting Platform, please include the full name of your company. See the Platform Web site at www.platform.com/contactus for other contact information.
Get Technical Support 10 Running Jobs with Platform LSF
C H A P T E R 1 About Platform LSF Contents ◆ ◆ “Cluster Concepts” on page 12 “Job Life Cycle” on page 22 Running Jobs with Platform LSF 11
Cluster Concepts Cluster Concepts Compute Host Commands Compute Host Submission Host Master Host Compute Host Clusters, jobs, and queues Cluster A group of computers (hosts) running LSF that work together as a single unit, combining computing power and sharing workload and resources. A cluster provides a single-system image for disparate computing resources. Hosts can be grouped into clusters in a number of ways.
Chapter 1 About Platform LSF Job slot A job slot is a bucket into which a single unit of work is assigned in the LSF system. Hosts are configured to have a number of job slots available and queues dispatch jobs to fill job slots. Commands ◆ bhosts —View job slot limits for hosts and host groups ◆ bqueues —View job slot limits for queues ◆ busers —View job slot limits for users and user groups Configuration ◆ Define job slot limits in lsb.resources.
Cluster Concepts Hosts Host An individual computer in the cluster. Each host may have more than 1 processor. Multiprocessor hosts are used to run parallel jobs. A multiprocessor host with a single process queue is considered a single machine, while a box full of processors that each have their own process queue is treated as a group of separate machines.
Chapter 1 About Platform LSF Master host Where the master LIM and mbatchd run. An LSF server host that acts as the overall coordinator for that cluster. Each cluster has one master host to do all job scheduling and dispatch. If the master host goes down, another LSF server in the cluster becomes the master host. All LSF daemons run on the master host. The LIM on the master host is the master LIM.
Cluster Concepts ◆ badmin hrestart —Restarts sbatchd Configuration ◆ Port number defined in lsf.conf res Remote Execution Server (RES) running on each server host. Accepts remote execution requests to provide transparent and secure remote execution of jobs and tasks. Commands ◆ lsadmin resstartup —Starts res ◆ lsadmin resshutdown —Shuts down res ◆ lsadmin resrestart —Restarts res Configuration ◆ Port number defined in lsf.conf lim Load Information Manager (LIM) running on each server host.
Chapter 1 About Platform LSF Master LIM The LIM running on the master host. Receives load information from the LIMs running on hosts in the cluster. Forwards load information to mbatchd, which forwards this information to mbschd to support scheduling decisions. If the master LIM becomes unavailable, a LIM on another host automatically takes over.
Cluster Concepts The bsub command stops display of output from the shell until the job completes, and no mail is sent to you by default. Use Ctrl-C at any time to terminate the job. Commands ◆ bsub -I —Submit an interactive job Interactive task A command that is not submitted to a batch queue and scheduled by LSF, but is dispatched immediately. LSF locates the resources needed by the task and chooses the best host among the candidate hosts that has the required resources and is lightly loaded.
Chapter 1 About Platform LSF All computers that run the same operating system on the same computer architecture are of the same type—in other words, binary-compatible with each other. Each host type usually requires a different set of LSF binary files. Commands ◆ lsinfo -t —View all host types defined in lsf.shared Configuration ◆ ◆ Defined in lsf.shared Mapped to hosts in lsf.cluster.cluster_name Host model The combination of host type and CPU speed (CPU factor) of the computer.
Cluster Concepts Resources Resource usage The LSF system uses built-in and configured resources to track resource availability and usage. Jobs are scheduled according to the resources available on individual hosts. Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling.
Chapter 1 About Platform LSF To schedule a job on a host, the load levels on that host must satisfy both the thresholds configured for that host and the thresholds for the queue from which the job is being dispatched. The value of a load index may either increase or decrease with load, depending on the meaning of the specific load index. Therefore, when comparing the host load conditions with the threshold values, you need to use either greater than (>) or less than (<), depending on the load index.
Job Life Cycle Job Life Cycle 6 job report (output, errors, info) email job report Submit job (bsub) 5 Job RUN 3 4 dispatch job Queue 1 Submission Host Job PEND 2 Compute Host Master Host 1 Submit a job You submit a job from an LSF client or server with the bsub command. If you do not specify a queue when submitting the job, the job is submitted to the default queue. Jobs are held in a queue waiting to be scheduled and have the PEND state.
Chapter 1 About Platform LSF 4 Run job sbatchd handles job execution.
Job Life Cycle 24 Running Jobs with Platform LSF
C H A P T E 2 R Working with Jobs Contents ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ “Submitting Jobs (bsub)” on page 26 “Modifying a Submitted Job (bmod)” on page 30 ❖ “Modifying Pending Jobs (bmod)” on page 31 ❖ “Modifying Running Jobs” on page 33 “Controlling Jobs” on page 34 ❖ “Killing Jobs (bkill)” on page 35 ❖ “Suspending and Resuming Jobs (bstop and bresume)” on page 36 ❖ “Changing Job Order Within Queues (bbot and btop)” on page 38 ❖ “Controlling Jobs in Job Groups” on page 39 “Submitting a Job to Specific Hosts”
Submitting Jobs (bsub) Submitting Jobs (bsub) In this section ◆ ◆ ◆ ◆ ◆ ◆ ◆ “bsub command” on page 26 “Submitting a job to a specific queue (bsub -q)” on page 26 “Submitting a job associated to a project (bsub -P)” on page 27 “Submitting a job associated to a user group (bsub -G)” on page 28 “Submitting a job with a job name (bsub -J)” on page 28 “Submitting a job to a service class (bsub -sla)” on page 28 “Submitting a job under a job group (bsub -g)” on page 29 bsub command You submit a job with the bs
Chapter 2 Working with Jobs Viewing available To see available queues, use the bqueues command. queues Use bqueues -u user_name to specify a user or user group so that bqueues displays only the queues that accept jobs from these users. The bqueues -m host_name option allows users to specify a host name or host group name so that bqueues displays only the queues that use these hosts to run jobs. You can submit jobs to a queue as long as its STATUS is Open.
Submitting Jobs (bsub) Submitting a job associated to a user group (bsub -G) You can use the bsub -G user_group option to submit a job and associate it with a specified user group. This option is only useful with fairshare scheduling. For more details on fairshare scheduling, see Administering Platform LSF. You can specify any user group to which you belong as long as it does not contain any subgroups. You must be a direct member of the specified user group.
Chapter 2 Working with Jobs Submitting a job under a job group (bsub -g) Use bsub -g to submit a job into a job group. The job group does not have to exist before submitting the job. For example: % bsub -g /risk_group/portfolio1/current myjob Job <105> is submitted to default queue. Submits myjob to the job group /risk_group/portfolio1/current. If group /risk_group/portfolio1/current exists, job 105 is attached to the job group.
Modifying a Submitted Job (bmod) Modifying a Submitted Job (bmod) In this section ◆ ◆ ◆ “Modifying Pending Jobs (bmod)” on page 31 “Modifying Running Jobs” on page 33 “Controlling Jobs” on page 34 30 Running Jobs with Platform LSF
Chapter 2 Working with Jobs Modifying Pending Jobs (bmod) If your submitted jobs are pending (bjobs shows the job in PEND state), use the bmod command to modify job submission parameters. You can also modify entire job arrays or individual elements of a job array. See the bmod command in the Platform LSF Reference for more details. Replacing the job command-line To replace the job command line, use the bmod -Z "new_command" option.
Modifying Pending Jobs (bmod) Modifying a job submitted to a job group Use the -g option of bmod and specify a job group path to move a job or a job array from one job group to another. For example: % bmod -g /risk_group/portfolio2/monthly 105 moves job 105 to job group /risk_group/portfolio2/monthly. Like bsub -g, if the job group does not exist, LSF creates it. bmod -g cannot be combined with other bmod options. It can operate on finished, running, and pending jobs.
Chapter 2 Working with Jobs Modifying Running Jobs Modifying resource reservation A job is usually submitted with a resource reservation for the maximum amount required. Use bmod -R to modify the resource reservation for a running job. This command is usually used to decrease the reservation, allowing other jobs access to the resource.
Controlling Jobs Controlling Jobs LSF controls jobs dispatched to a host to enforce scheduling policies, or in response to user requests. The LSF system performs the following actions on a job: Suspend by sending a SIGSTOP signal ◆ Resume by sending a SIGCONT signal ◆ Terminate by sending a SIGKILL signal On Windows, equivalent functions have been implemented to perform the same tasks.
Chapter 2 Working with Jobs Killing Jobs (bkill) The bkill command cancels pending batch jobs and sends signals to running jobs. By default, on UNIX, bkill sends the SIGKILL signal to running jobs. Before SIGKILL is sent, SIGINT and SIGTERM are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from mbatchd to sbatchd, which waits for the job to exit before reporting the status.
Suspending and Resuming Jobs (bstop and bresume) Suspending and Resuming Jobs (bstop and bresume) The bstop and bresume commands allow you to suspend or resume a job. A job can also be suspended by its owner or the LSF administrator with the bstop command. These jobs are considered user-suspended and are displayed by bjobs as USUSP. When the user restarts the job with the bresume command, the job is not started immediately to prevent overloading.
Chapter 2 Working with Jobs Resuming a job bresume To resume a job, use the bresume command. command Resuming a user-suspended job does not put your job into RUN state immediately. If your job was running before the suspension, bresume first puts your job into SSUSP state and then waits for sbatchd to schedule it according to the load conditions.
Changing Job Order Within Queues (bbot and btop) Changing Job Order Within Queues (bbot and btop) By default, LSF dispatches jobs in a queue in the order of arrival (that is, first-come-first-served), subject to availability of suitable server hosts. Use the btop and bbot commands to change the position of pending jobs, or of pending job array elements, to affect the order in which jobs are considered for dispatch.
Chapter 2 Working with Jobs Controlling Jobs in Job Groups Stopping (bstop) Use the -g option of bstop and specify a job group path to suspend jobs in a job group % bstop -g /risk_group 106 Job <106> is being stopped Use job ID 0 (zero) to suspend all jobs in a job group: % bstop -g /risk_group/consolidate 0 Job <107> is being stopped Job <108> is being stopped Job <109> is being stopped Resuming (bresume) Use the -g option of bresume and specify a job group path to resume suspended jobs in a job group:
Controlling Jobs in Job Groups % bkill -g /risk_group/consolidate 0 Job <116> is being terminated Deleting (bgdel) Use bgdel command to remove a job group. The job group cannot contain any jobs. For example: % bgdel /risk_group Job group /risk_group is deleted. deletes the job group /risk_group and all its subgroups. For more information See Administering Platform LSF for more information about using job groups.
Chapter 2 Working with Jobs Submitting a Job to Specific Hosts To indicate that a job must run on one of the specified hosts, use the bsub -m "hostA hostB ..." option. By specifying a single host, you can force your job to wait until that host is available and then run on that host. For example: % bsub -q idle -m "hostA hostD hostB" myjob This command submits myjob to the idle queue and tells LSF to choose one host from hostA, hostD and hostB to run the job.
Submitting a Job and Indicating Host Preference Submitting a Job and Indicating Host Preference When several hosts can satisfy the resource requirements of a job, the hosts are ordered by load. However, in certain situations it may be desirable to override this behavior to give preference to specific hosts, even if they are more heavily loaded.
Chapter 2 Working with Jobs Submitting a job with different levels of host preference You can indicate different levels of preference by specifying a number after the plus sign (+). The larger the number, the higher the preference for that host or host group. You can also specify the + with the keyword others. For example: % bsub -m "groupA+2 groupB+1 groupC" myjob In this example, LSF gives first preference to hosts in groupA, second preference to hosts in groupB and last preference to those in groupC.
Using LSF with Non-Shared File Space Using LSF with Non-Shared File Space LSF is usually used in networks with shared file space. When shared file space is not available, use the bsub -f command to have LSF copy needed files to the execution host before running the job, and copy result files back to the submission host after the job completes. LSF attempts to run the job in the directory where the bsub command was invoked.
Chapter 2 Working with Jobs If the submission and execution hosts have different directory structures, you must ensure that the directory where remote_file and local_file will be placed exists. LSF tries to change the directory to the same path name as the directory where the bsub command was run. If this directory does not exist, the job is run in your home directory on the execution host.
Reserving Resources for Jobs Reserving Resources for Jobs About resource reservation When a job is dispatched, the system assumes that the resources that the job consumes will be reflected in the load information. However, many jobs do not consume the resources they require when they first start. Instead, they will typically use the resources over a period of time. For example, a job requiring 100 MB of swap is dispatched to a host having 150 MB of available swap.
Chapter 2 Working with Jobs Submitting a Job with Start or Termination Times By default, LSF dispatches jobs as soon as possible, and then allows them to finish, although resource limits might terminate the job before it finishes. You can specify a time of day at which to start or terminate a job. Submitting a job with a start time If you do not want to start your job immediately when you submit it, use bsub -b to specify a start time. LSF will not dispatch the job before this time.
Submitting a Job with Start or Termination Times 48 Running Jobs with Platform LSF
C H A P T E 3 R Viewing Information About Jobs Use the bjobs and bhist commands to view information about jobs: ◆ bjobs reports the status of jobs and the various options allow you to display specific information. ◆ bhist reports the history of one or more jobs in the system. You can also find jobs on specific queues or hosts, find jobs submitted by specific projects, and check the status of specific jobs using their job IDs or names.
Viewing Job Information (bjobs) Viewing Job Information (bjobs) The bjobs command has options to display the status of jobs in the LSF system. For more details on these or other bjobs options, see the bjobs command in the Platform LSF Reference. Unfinished current jobs The bjobs command reports the status of LSF jobs. When no options are specified, bjobs displays information about jobs in the PEND, RUN, USUSP, PSUSP, and SSUSP states for the current user.
Chapter 3 Viewing Information About Jobs Viewing Job Pend and Suspend Reasons (bjobs -p) When you submit a job, it may be held in the queue before it starts running and it may be suspended while running. You can find out why jobs are pending or in suspension with the bjobs -p option. You can combine bjob options to tailor the output. For more details on these or other bjobs options, see the bjobs command in the Platform LSF Reference.
Viewing Job Pend and Suspend Reasons (bjobs -p) Viewing suspend reasons only The -s option of bjobs displays reasons for suspended jobs only.
Chapter 3 Viewing Information About Jobs Viewing Detailed Job Information (bjobs -l) The -l option of bjobs displays detailed information about job status and parameters, such as the job’s current working directory, parameters specified when the job was submitted, and the time when the job started running. For more details on bjobs options, see the bjobs command in the Platform LSF Reference.
Viewing Job Resource Usage (bjobs -l) Viewing Job Resource Usage (bjobs -l) LSF monitors the resources jobs consume while they are running. The -l option of the bjobs command displays the current resource usage of the job. For more details on bjobs options, see the bjobs command in the Platform LSF Reference.
Chapter 3 Viewing Information About Jobs Viewing Job History (bhist) Sometimes you want to know what has happened to your job since it was submitted. The bhist command displays a summary of the pending, suspended and running time of jobs for the user who invoked the command. Use bhist -u all to display a summary for all users in the cluster. For more details on bhist options, see the bhist command in the Platform LSF Reference.
Viewing Job History (bhist) Viewing history of jobs not listed in active event log LSF periodically backs up and prunes the job history log. By default, bhist only displays job history from the current event log file. You can use bhist -n num_logfiles to display the history for jobs that completed some time ago and are no longer listed in the active event log.
Chapter 3 Viewing Information About Jobs Viewing Job Output (bpeek) The output from a job is normally not available until the job is finished. However, LSF provides the bpeek command for you to look at the output the job has produced so far. By default, bpeek shows the output from the most recently submitted job. You can also select the job by queue or execution host, or specify the job ID or job name on the command line.
Viewing Information about SLAs and Service Classes Viewing Information about SLAs and Service Classes Monitoring the progress of an SLA (bsla) Use bsla to display the properties of service classes configured in lsb.serviceclasses and dynamic state information for each service class. Examples ◆ One velocity goal of service class Tofino is active and on time. The other configured velocity goal is inactive.
Chapter 3 Viewing Information About Jobs % bsla Kyuquot SERVICE CLASS NAME: Kyuquot -- Daytime/Nighttime SLA PRIORITY: 23 USER_GROUP: user1 user2 GOAL: VELOCITY ACTIVE WINDOW: (9:00-17:30) STATUS: Active:On time VELOCITY: 8 CURRENT VELOCITY: 0 GOAL: DEADLINE ACTIVE WINDOW: (17:30-9:00) STATUS: Inactive NJOBS 0 ◆ PEND 0 RUN 0 SSUSP 0 USUSP 0 FINISH 0 The throughput goal of service class Inuvik is always active.
Viewing Information about SLAs and Service Classes Tracking historical behavior of an SLA (bacct) Use bacct to display historical performance of a service class. For example, service classes Inuvik and Tuktoyaktuk configure throughput goals. % bsla SERVICE CLASS NAME: -- throughput 6 PRIORITY: 20 Inuvik GOAL: THROUGHPUT ACTIVE WINDOW: Always Open STATUS: Active:On time SLA THROUGHPUT: 10.
Chapter 3 Viewing Information About Jobs Maximum wait time in queue:18912.0 Minimum wait time in queue: 7.0 Average turnaround time: 12268 (seconds/job) Maximum turnaround time: 22079 Minimum turnaround time: 1713 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00 Total throughput: 8.94 (jobs/hour) during 20.
Viewing Jobs in Job Groups Viewing Jobs in Job Groups Viewing job group information (bjgroup) Use the bjgroup command to see information about jobs in specific job groups. % bjgroup GROUP_NAME /fund1_grp /fund2_grp /bond_grp /risk_grp /admi_grp NJOBS 5 11 2 2 4 PEND 4 2 2 1 4 RUN 0 5 0 1 0 SSUSP 1 0 0 0 0 USUSP 0 0 0 0 0 FINISH 0 4 0 0 0 Viewing jobs by job group (bjobs) Use the -g option of bjobs and specify a job group path to view jobs attached to the specified group.
Chapter 3 Viewing Information About Jobs Viewing Information about Resource Allocation Limits Your job may be pending because some configured resource allocation limit has been reached. Use the blimits command to show the dynamic counters of resource allocation limits configured in Limit sections in lsb.resources. blimits displays the current resource usage to show what limits may be blocking your job.
Viewing Information about Resource Allocation Limits Examples For the following limit definitions: Begin Limit NAME = limit1 USERS = user1 PER_QUEUE = all PER_HOST = hostA hostC TMP = 30% SWP = 50% MEM = 10% End Limit Begin Limit NAME = limit_ext1 PER_HOST = all RESOURCE = ([user1_num, 30] [hc_num, 20]) End Limit blimits displays the following: % blimits INTERNAL RESOURCE LIMITS: NAME limit1 limit1 limit1 USERS user1 user1 user1 QUEUES q2 q3 q4 HOSTS hostA hostA hostC PROJECTS - SLOTS - MEM 10/25 -
Index A G automount option, /net 44 goal-oriented scheduling.
Index (blimits) 63 lsrcp command for remote file access 44 N names, assigning to jobs 28 non-shared file space 44 O order of job execution 38 P pending reasons 51 project names, viewing resource allocation limits (blimits) 63 projects, associating jobs with 27 PSUSP job state 36 Q queues and host preference 42 changing job order within 38 specifying at job submission 26 viewing, resource allocation limits (blimits) 63 R rcp command for remote file access 44 rerunnable jobs, modifying running jobs 33 re