Platform LSF® Administrator’s Primer Version 6.2 December 2005 Comments to: doc@platform.
Copyright © 1994-2005 Platform Computing Corporation All rights reserved. We’d like to hear from you You can help us make this document better by telling us what you think of the content, organization, and usefulness of the information. If you find an error, or just want to make a suggestion for improving this document, please address your comments to doc@platform.com. Your comments should pertain only to Platform documentation. For product support, contact support@platform.com.
Contents Welcome . . . . . . . . About This Guide 1 . . . . . . . . . 4 . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Learn About Platform LSF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Get Technical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Working with LSF . . . . . . . . . . . . . . . .
Contents 4 Platform LSF Administrator’s Primer
Welcome Contents ◆ ◆ ◆ “About This Guide” on page 6 “Learn About Platform LSF” on page 8 “Get Technical Support” on page 9 About Platform Computing Platform Computing is the largest independent grid software developer, delivering intelligent, practical enterprise grid software and services that allow organizations to plan, build, run and manage grids by optimizing IT resources.
About This Guide About This Guide Last update December 20 2005 Latest version www.platform.com/Support/Documentation.htm Purpose of this guide This guide is your starting point for learning how to manage and use your new cluster running the Platform LSF® software (“LSF”). It provides an overview of LSF concepts, basic commands to test your new cluster, how to run applications through LSF, LSF licensing, and some troubleshooting tips.
Welcome Notation Meaning Example Ellipsis … The argument before the ellipsis can be repeated. Do not enter the ellipsis. The argument must be replaced with a real value you provide. You must enter one of the items separated by the bar. You cannot enter more than one item, Do not enter the bar. Must be entered exactly as shown The argument within the brackets is optional. Do not enter the brackets.
Learn About Platform LSF Learn About Platform LSF World Wide Web and FTP The latest information about all supported releases of Platform LSF is available on the Platform Web site at www.platform.com. Look in the Online Support area for current README files, Release Notes, Upgrade Notices, Frequently Asked Questions (FAQs), Troubleshooting, and other helpful information. The Platform FTP site (ftp.platform.
Welcome Get Technical Support Contact Platform Contact Platform Computing or your LSF vendor for technical support. Use one of the following to contact Platform technical support: Email support@platform.com World Wide Web www.platform.com Mail Platform Support Platform Computing Corporation 3760 14th Avenue Markham, Ontario Canada L3R 3T7 When contacting Platform, please include the full name of your company. See the Platform Web site at www.platform.com/contactus for other contact information.
Get Technical Support 10 Platform LSF Administrator’s Primer
C H A P T E 1 R About Your Cluster Contents ◆ ◆ “Cluster Characteristics” on page 12 “LSF Directories and Configuration Files” on page 13 Platform LSF Administrator’s Primer 11
Cluster Characteristics Cluster Characteristics Contents ◆ ◆ “Cluster name and administrators” on page 12 “LSF hosts” on page 12 Cluster name and administrators The cluster name you specified at installation is part of the name of LSF_CONFDIR/lsf.cluster.cluster_name . For example: /usr/share/lsf/lsf_62/conf/lsf.cluster.lsf_62 Cluster administrators are listed in the ClusterAdmins section of LSF_CONFDIR/lsf.cluster.cluster_name .
Chapter 1 About Your Cluster LSF Directories and Configuration Files Contents ◆ ◆ ◆ ◆ ◆ ◆ “Four important LSF configuration files” on page 13 “Default directory structure” on page 13 “LSF directories” on page 14 “LSF cluster configuration files” on page 15 “LSF Batch configuration files” on page 15 “Daemon log files” on page 16 Four important LSF configuration files LSF configuration administered through several configuration files, which you use to modify the behavior of your cluster.
LSF Directories and Configuration Files LSF_TOP 1 conf work 2 lsbatch 5 3 log version 4 cluster_name logdir cluster_name lsb.event.lock configdir info lsb.hosts lsb.params lsb.queues … 6 man lsf lsbatch.h lsf.h lsf_indir lsf_cmdir license.dat lsf.cluster.cluster_name lsf.conf lsf.shared lsf.task profile.lsf cshrc.lsf misc 12 bin 9 badmin bjobs lsadmin … etc 10 lim res sbatchd … make.def make.misc … install lsfinstall hostsetup ... files lib 11 locale uid ckpt_crt0.o libampi.
Chapter 1 About Your Cluster Directory Description Example LSF_MANDIR LSF_MISC LSF man pages Examples and other miscellaneous files Server daemon binaries, scripts and other utilities, shared by all hosts of the same type Top-level installation directory /usr/share/lsf/lsf_62/6.2/man/ /usr/share/lsf/lsf_62/6.2/misc/ LSF_SERVERDIR LSF_TOP /usr/share/lsf/lsf_62/6.2/ sparc-sol2/etc/ /usr/share/lsf/lsf_62/ Other configuration directories are specified in LSF_CONFDIR/lsf.conf.
LSF Directories and Configuration Files File Example Server hosts and their attributes, such as scheduling load thresholds, dispatch windows, and job slot limits. If no hosts are defined in this file, then all LSF server hosts listed in LSF_CONFDIR/lsf.cluster.cluster_name are assumed to be LSF Batch server hosts. LSF scheduler and resource broker plugin modules. If no scheduler or resource broker modules are configured, LSF uses the default scheduler plugin module named schmod_default.
C H A P T E 2 R Working with LSF Contents ◆ ◆ ◆ “Starting, Stopping, and Reconfiguring LSF” on page 18 “Checking LSF Status” on page 21 “Running LSF Jobs” on page 26 Platform LSF Administrator’s Primer 17
Starting, Stopping, and Reconfiguring LSF Starting, Stopping, and Reconfiguring LSF Contents ◆ ◆ ◆ ◆ ◆ “Two LSF administration commands (lsfadmin and badmin)” on page 18 “Setting up the LSF environment (cshrc.lsf and profile.lsf)” on page 18 “Starting your cluster” on page 18 “Stopping your cluster” on page 19 “Reconfiguring your cluster” on page 19 Two LSF administration commands (lsfadmin and badmin) Only LSF administrators or root can run these commands.
Chapter 2 Working with LSF Your user account must be able to read the system kernel information, such as /dev/kmem. lsadmin and Use lsadmin and badmin to start the LSF daemons. badmin 1 Log on as root to each LSF server host. If you installed a single-user cluster as a non-root user, log on as primary LSF administrator. Start with the LSF master host, and repeat these steps on all LSF hosts.
Starting, Stopping, and Reconfiguring LSF This command also reads the LSF_LOGDIR/lsb.events file, so it can take some time to complete if a lot of jobs are running. See Administering Platform LSF for information about which command to run after modifying LSF configuration files.
Chapter 2 Working with LSF Checking LSF Status Contents ◆ ◆ ◆ ◆ ◆ “Example command output” on page 21 “Checking cluster configuration (lsadmin)” on page 21 “Finding out cluster status (lsid and lsload)” on page 22 “Checking LSF Batch configuration (badmin)” on page 23 “Finding out LSF Batch system status (bhosts and bqueues)” on page 23 Example command output The LSF commands shown in this section show examples of typical output. The output you see will differ according to your configuration.
Checking LSF Status Finding out cluster status (lsid and lsload) lsid Tells you if your LSF environment is set up properly. lsid displays the current LSF version number, cluster name, and host name of the current LSF master host for your cluster. The LSF master name displayed by lsid may vary, but it is usually the first host configured in the Hosts section of LSF_CONFDIR/lsf.cluster.cluster_name . % lsid Platform LSF 6.
Chapter 2 Working with LSF Other useful ◆ commands ◆ The lshosts command displays configuration information for LSF hosts and their static resource information. The lsinfo command displays cluster configuration information about resources, host types, and host models. Checking LSF Batch configuration (badmin) badmin ckconfig -v The badmin command controls and monitors the operation of the LSF Batch system. Use the badmin ckconfig command to check the LSF Batch configuration files.
Checking LSF Status after starting or reconfiguring LSF, wait a few seconds and try bhosts again to give mbatchd time to initialize. If the problem persists, see “Top 10 LSF problems” on page 55 for help. bqueues LSF Batch queues organize jobs with different priorities and different scheduling policies. The bqueues command displays the status of available queues and their configuration parameters. For a queue to accept and dispatch jobs, the status should be Open:Active.
Chapter 2 Working with LSF ◆ ◆ Other useful ◆ commands ◆ Hosts and users able to use the queue Scheduling threshold values: ❖ loadSched is the threshold for LSF to stop dispatching jobs automatically ❖ loadStop is the threshold for LSF to suspend a job automatically The bparams command displays information about the LSF Batch configuration parameters. The bhist command displays historical information about jobs.
Running LSF Jobs Running LSF Jobs Contents ◆ ◆ ◆ ◆ ◆ ◆ “Commands for running LSF jobs (lsrun and bsub)” on page 26 “Submitting batch jobs (bsub)” on page 26 “Displaying job status (bjobs)” on page 27 “Controlling job execution (bstop, bresume, bkill)” on page 27 “Running interactive tasks (lsrun and lsgrun)” on page 28 “Integrating your applications with LSF” on page 28 Commands for running LSF jobs (lsrun and bsub) You use two basic commands to run jobs through LSF: bsub submits jobs to LSF Batch.
Chapter 2 Working with LSF To submit a batch interactive job by using a pseudo-terminal, use the bsub -Ip option. To submit a batch interactive job and create a pseudo-terminal with shell mode support, use the bsub -Is option. Displaying job status (bjobs) The status of each LSF job is updated periodically, and you can use the job ID to monitor and manipulate the job status. bjobs The bjobs command displays the job ID and other job status.
Running LSF Jobs Running interactive tasks (lsrun and lsgrun) lsrun The lsrun command runs a task on either the current local host or remotely on the best available host, provided it can find the necessary resources and the appropriate host type. For example, the following command runs the UNIX ls command.
Chapter 2 Working with LSF #! /bin/sh bsub -R "rusage[abc_license=1:duration=1]" abc_real When users run abc, they are actually running a script to submit a job abc_real to LSF using 1 shared resource named abc_license. For more information about specifying shared resources using the resource requirement (rusage) string on the -R option of bsub, see Chapter 5, “Using Shared Resources to Manage Software Licenses”.
Running LSF Jobs 30 Platform LSF Administrator’s Primer
C H A P T E 3 R Managing Your Cluster Contents ◆ ◆ “Managing Users, Hosts, and Queues” on page 32 “Configuring LSF Startup” on page 38 Platform LSF Administrator’s Primer 31
Managing Users, Hosts, and Queues Managing Users, Hosts, and Queues Contents ◆ ◆ ◆ ◆ ◆ “Making your cluster available to users (cshrc.lsf and profile.lsf)” on page 32 “Adding a host to your cluster” on page 32 “Removing a host from your cluster” on page 35 “Adding a queue” on page 36 “Removing a queue” on page 36 Making your cluster available to users (cshrc.lsf and profile.lsf) To set up the LSF environment for your users, use the following two shell files: ◆ LSF_CONFDIR/cshrc.
Chapter 3 Managing Your Cluster 1 Install LSF Use lsfinstall to add new host types to your cluster. You can skip these steps if you binaries for a new already have the executables host type 1 Log on as root to any host that can access the LSF install script directory. 2 Change to the LSF install script directory. For example: # cd /usr/share/lsf/lsf_62/6.2/install 3 Edit install.config to specify desired options for new host types. You do not need to specify LSF_LICENSE.
Managing Users, Hosts, and Queues Do you really want to restart Restart LIM on ...... Restart LIM on ...... Restart LIM on ...... LIMs on all hosts? [y/n] y done done done The lsadmin reconfig command checks for configuration errors. If no errors are found, you are asked to confirm that you want to restart lim on all hosts and lim is reconfigured. If fatal errors are found, reconfiguration is aborted. 5 Reconfigure mbatchd: % badmin reconfig Checking configuration files ...
Chapter 3 Managing Your Cluster Removing a host from your cluster CAUTION Never remove the master host from LSF. If you want to remove your current default master from LSF, change lsf.cluster.cluster_name to assign a different default master host. Then remove the host that was once the master host. 1 2 3 4 5 Log on to the host as root or the primary LSF administrator. Run badmin hclose to close the host. This prevents jobs from being dispatched to the host and allows running jobs to finish.
Managing Users, Hosts, and Queues Removing hosts Dynamic host configuration allows you to remove hosts from the cluster without dynamically manually changing the LSF configuration. See Administering Platform LSF for details about removing hosts dynamically. If you get errors See Chapter 6, “Troubleshooting LSF Problems” or the Platform LSF Reference for help with some common configuration errors. Adding a queue Adding a queue does not affect pending or running jobs.
Chapter 3 Managing Your Cluster % badmin qclose night Queue is closed 3 Move all pending and running jobs into another queue.
Configuring LSF Startup Configuring LSF Startup Contents ◆ ◆ “Allowing LSF administrators to start LSF daemons (lsf.sudoers)” on page 38 “Setting up automatic LSF startup” on page 38 Allowing LSF administrators to start LSF daemons (lsf.sudoers) To allow LSF administrators to start and stop LSF daemons, you should configure the /etc/lsf.sudoers file. If lsf.sudoers does not exist, only root can start and stop LSF daemons. 1 Log on as root to each LSF server host.
C H A P T E 4 R Working with LSF Licenses Contents ◆ ◆ “About LSF Licenses” on page 40 “Setting up a Permanent LSF License” on page 42 Platform LSF Administrator’s Primer 39
About LSF Licenses About LSF Licenses Contents ◆ ◆ “Types of LSF licenses” on page 40 “Where the license file is located” on page 41 Types of LSF licenses LSF uses two types of licenses: ◆ File-based DEMO licenses, which do not need a license server. These are typically used while evaluating LSF and usually expire after 30 days. Each FEATURE line in the license contains an expiry date and ends with DEMO.
Chapter 4 Working with LSF Licenses Example permanent license file Server name LSF vendor daemon Host ID (lmhostid) License vendor daemon path License port number SERVER hosta 880a0748a 1700 DAEMON lsf_ld /usr/share/lsf/lsf_62/6.2/sparc-sol2/etc/lsf_ld FEATURE lsf_base lsf_ld 6.200 1-jun-0000 10 DCF7C3D92A5471A12345 "Platform" FEATURE lsf_manager lsf_ld 6.200 1-jun-0000 10 4CF7D37944B023A12345 "Platform" FEATURE lsf_sched_fairshare lsf_ld 6.
Setting up a Permanent LSF License Setting up a Permanent LSF License Contents ◆ “Getting information needed for a permanent license” on page 42 ◆ “Getting a license key” on page 42 ◆ “Example permanent license file” on page 41 ◆ “Preparing and installing a permanent license” on page 43 ◆ “Starting the license server daemon (lmgrd)” on page 42 ◆ “Checking the license status (lmstat)” on page 44 ◆ “Updating an existing permanent license” on page 45 ◆ “If you have problems” on page 45 See Administering Plat
Chapter 4 Working with LSF Licenses See the lmgrd(8) man page for information about FLEXlm commands. Multiple license If you use multiple FLEXlm license servers, start lmgrd on all license servers. servers If lmgrd does not If you get the message start port already in use the license port number defined in LSF_LICENSE_FILE or in LSF_CONFDIR/license.dat is in use by another application. The default port is 1700.
Setting up a Permanent LSF License FEATURE lsf_base lsf_ld 6.200 1-jun-0000 10 DCF7C3D92A5471A12345 "Platform" FEATURE lsf_manager lsf_ld 6.200 1-jun-0000 10 4CF7D37944B023A12345 "Platform" the PRODUCTS line in LSF_CONFDIR/lsf.cluster.cluster_name must contain: PRODUCTS=LSF_Base LSF_Manager If your do not have licenses for some features in the PRODUCTS line, contact Platform at license@platform.com or your LSF vendor.
Chapter 4 Working with LSF Licenses For example, depending on the LSF features installed, the output of the command should look something like the following: % lmstat -a -c /usr/share/lsf/lsf_62/conf/license.dat lmstat - Copyright (C) 1989-2000 Globetrotter Software, Inc. Flexible License Manager status on Fri 3/15/2005 08:39 License server status: 1700@hosta License file(s) on hosta: /usr/share/lsf/lsf_62/conf/license.dat: hosta: license server UP (MASTER) v7.
Setting up a Permanent LSF License ◆ See the FLEXlm End Users Guide, available for download from GLOBEtrotter Software, Inc. at www.globetrotter.com, for more information about FLEXlm. Where to go next Learn how to set up an LSF External LIM (ELIM) to monitor dynamic shared resources, described in Chapter 5, “Using Shared Resources to Manage Software Licenses”.
C H A P T E 5 R Using Shared Resources to Manage Software Licenses Contents ◆ “Managing Software Licenses and Other Shared Resources” on page 48 Platform LSF Administrator’s Primer 47
Managing Software Licenses and Other Shared Resources Managing Software Licenses and Other Shared Resources This chapter uses managing software licenses as an example of how to set up an LSF External LIM (ELIM) to monitor dynamic shared resources.
Chapter 5 Using Shared Resources to Manage Software Licenses RESOURCENAME license1 license2 End Resource TYPE INTERVAL INCREASING Numeric 30 N Numeric 30 N ◆ RELEASE Y Y DESCRIPTION # Keywords (license1 resource) (license2 resource) The TYPE of shared resource can be: ❖ ❖ ❖ Numeric Boolean String In this case, the resource is numeric. ◆ ◆ ◆ The INTERVAL specifies how often the value should be refreshed; in this case, the ELIM updates the shared resource values every 30 seconds.
Managing Software Licenses and Other Shared Resources The value of the second load index (2) and so on. ◆ Writing the ELIM The ELIM must be an executable program, named elim, located in the program LSF_SERVERDIR directory. When lim is started or restarted, it invokes elim on the same host and takes the standard output of the external load indices sent by elim. In general, you can define any quantifiable resource as an external load index, write an ELIM to report its value, and use it as an LSF resource.
Chapter 5 Using Shared Resources to Manage Software Licenses Begin Queue QUEUE_NAME = license1 RES_REQ=rusage[license1=1:duration=1] ... End Queue Then submit a batch job using one license1 resource using a command like: % bsub -q license1 myjob When licenses are available, LSF runs your jobs right away; when all licenses are in use, LSF puts your job in a queue and dispatches them as licenses become available. This way, all of your licenses are used to the best advantage.
Managing Software Licenses and Other Shared Resources 52 Platform LSF Administrator’s Primer
C H A P T E R 6 Troubleshooting LSF Problems This chapter covers solutions to the top 10 LSF problems in the order that you would most likely encounter them as you begin to use LSF. If you cannot find a solution to your problem here, contact your Platform system engineer or support@platform.com.
Common LSF Problems Common LSF Problems Contents ◆ ◆ ◆ “Finding LSF error logs” on page 54 “For most LSF problems” on page 54 “Top 10 LSF problems” on page 55 Finding LSF error logs When something goes wrong, LSF server daemons log error messages in the LSF log directory (LSF_LOGDIR). Make sure that the primary LSF administrator owns LSF_LOGDIR, and that root can write to this directory. If an LSF server is unable to write to LSF_LOGDIR, then the error logs are created in /tmp.
Chapter 6 Troubleshooting LSF Problems Top 10 LSF problems 1 Cannot open You might see this message running lsid. This means that lsf.conf file LSF_CONFDIR/lsf.conf is not accessible to LSF. By default, LSF checks the directory defined by LSF_ENVDIR for lsf.conf. If lsf.conf is not in LSF_ENVDIR, LSF looks for it in/etc. ◆ Make sure that there is either a symbolic link from /etc/lsf.conf to lsf.conf or ◆ Use csrhc.lsf or profile.lsf to set up your LSF environment. Make sure that cshrc.lsf or profile.
Common LSF Problems the PRODUCTS line in LSF_CONFDIR/lsf.cluster.cluster_name must contain: PRODUCTS=LSF_Base LSF_Manager Modify the PRODUCTS line to fix the error. See Chapter 4, “Working with LSF Licenses” for information about working with a permanent LSF license. ◆ lsf.conf is not in the location specified by LSF_ENVDIR. Check that LSF_LICENSE_FILE parameter in lsf.conf is correct.
Chapter 6 Troubleshooting LSF Problems If lim has just been started, this is normal; lim needs time to read configuration files and contact lim daemons on other hosts. If lim does not respond within one or two minutes, check the lim error log (LSF_LOGDIR/lim.log.host_name ) for the host you are working on. When the local lim is running but there is no master lim in the cluster, LSF applications display the following message: Cannot locate master LIM now, try later. lim problems can have several causes.
Common LSF Problems LSB_SBD_PORT=6882 The port numbers can be any numbers ranging from 1024 to 65535 that are not already used by other services. To make sure that the port numbers you supply are not already used by applications registered in your service database, check /etc/services. ❖ To change the port numbers: a b c Shut down your cluster. Edit LSF_CONFDIR/lsf.conf. Restart LSF.
Chapter 6 Troubleshooting LSF Problems Run lsadmin ckconfig -v and correct the problems shown in the command output. See problem “2 Host does not have a software license” on page 55 and Chapter 4, “Working with LSF Licenses” for more information. ◆ Ownership of the LSF files and directories. The LSF primary administrator should own all LSF directories and most files. In particular, LSB_SHAREDIR (e.g., /usr/share/lsf/lsf_62/work) must be owned and writable by the LSF primary administrator.
Common LSF Problems The messages in LSF_LOGDIR/res.log.host_name on the execution host. res is responsible for authenticating users in LSF ❖ The setting of LSF authentication (LSF_AUTH in LSF_CONFDIR/lsf.conf): ✧ LSF default authentication is eauth (LSF_AUTH is not or is defined as eauth in lsf.conf) ✧ If LSF_AUTH is defined as identd in lsf.
Chapter 6 Troubleshooting LSF Problems If none of these applies to your situation, contact support@platform.com. 8 Application runs fine under UNIX or with lsrun, but fails or hangs when submitted through bsub On some UNIX systems, certain applications only run with specific limit values. Different limit values or no limits can cause problems for these applications.
Common LSF Problems Permissions or ownership of your submission directory is incorrect for the home directory on the execution host ◆ You have a non-shared file system A command may fail with the following error message due to a non-uniform file name space. ◆ chdir(...
Index A E administration commands 18 administrators 12 automatic startup, configuring 38 ELIM (external LIM) defining dynamic shared resources 48 example 50 error logs, location 54 examples DEMO license file 40 external LIM (ELIM) 50 lmstat command 45 permanent license file 41 external LIM (ELIM) defining dynamic shared resources 48 example 50 B badmin command, description 18 batch configuration, checking 23 batch configuration files 15 batch daemon not responding, troubleshooting 58 batch jobs status 2
Index M Q master host 12 output, from jobs 26 queues adding 36 lost_and_found 36 removing 36 P S permanent license description 40 example 41 installing 43 updating 45 profile.