LSF Version 7.3 - Administering Platform LSF
Tuning LSF for Large Clusters
616 Administering Platform LSF
Some operating systems, such as Linux and AIX, let you increase the number of file
descriptors that can be allocated on the master host. You do not need to limit the
number of file descriptors to 1024 if you want fast job dispatching. To take
advantage of the greater number of file descriptors, you must set
LSB_MAX_JOB_DISPATCH_PER_SESSION to a value greater than 300.
Set
LSB_MAX_JOB_DISPATCH_PER_SESSION to one-half the value of
MAX_SBD_CONNS. This setting configures mbatchd to dispatch jobs at a high rate
while maintaining the processing speed of other
mbatchd tasks.
MAX_SBD_CONNS in lsb.params
The maximum number of open file connections between mbatch and sbatchd.
Specify a value equal to the number of hosts in your cluster plus a buffer. For
example, if your cluster includes 4000 hosts, set:
MAX_SBD_CONNS=4100
Highly recommended for large clusters to decrease the load on the master LIM.
Forces the client
sbatchd to contact the local LIM for host status and load
information. The client sbatchd only contacts the master LIM or a LIM on one of
the LSF_SERVER_HOSTS if sbatchd cannot find the information locally.
Enable fast job
dispatch
1 Log in to the LSF master host as the root user.
2 Increase the system-wide file descriptor limit of your operating system if you
have not already done so.
3 In
lsb.params, set MAX_SBD_CONNS equal to the number of hosts in the cluster
plus a buffer.
4 In
lsf.conf, set the parameter LSB_MAX_JOB_DISPATCH_PER_SESSION to a
value greater than 300 and less than or equal to one-half the value of
MAX_SBD_CONNS.
For example, for a cluster with 4000 hosts:
LSB_MAX_JOB_DISPATCH_PER_SESSION = 2050
MAX_SBD_CONNS=4100
5 In lsf.conf, define the parameter LSF_SERVER_HOSTS to decrease the load on
the master LIM.
6 In the shell you used to increase the file descriptor limit, shut down the LSF
batch daemons on the master host:
badmin hshutdown
7 Run badmin mbdrestart to restart the LSF batch daemons on the master host.
8 Run
badmin hrestart all to restart every sbatchd in the cluster:
NOTE: When you shut down the batch daemons on the master host, all LSF services are
temporarily unavailable, but existing jobs are not affected. When mbatchd is later started by
sbatchd, its previous status is restored and job scheduling continues.