Platform LSF Administration Guide Version 6.2
Fault Tolerance
Administering Platform LSF
74
Fault Tolerance
LSF is designed to continue operating even if some of the hosts in the cluster are
unavailable. One host in the cluster acts as the master, but if the master host becomes
unavailable another host takes over. LSF is available as long as there is one available host
in the cluster.
LSF can tolerate the failure of any host or group of hosts in the cluster. When a host
crashes, all jobs running on that host are lost. No other pending or running jobs are
affected. Important jobs can be submitted to LSF with an option to automatically restart
if the job is lost because of a host failure.
Dynamic master host
The LSF master host is chosen dynamically. If the current master host becomes
unavailable, another host takes over automatically. The master host selection is based on
the order in which hosts are listed in the
lsf.cluster.cluster_name file. If the first
host in the file is available, that host acts as the master. If the first host is unavailable, the
second host takes over, and so on. LSF might be unavailable for a few minutes while
hosts are waiting to be contacted by the new master.
Running jobs are managed by
sbatchd on each server host. When the new mbatchd
starts, it polls the
sbatchd on each host and finds the current status of its jobs. If
sbatchd fails but the host is still running, jobs running on the host are not lost. When
sbatchd is restarted it regains control of all jobs running on the host.
Network failure
If the cluster is partitioned by a network failure, a master LIM takes over on each side
of the partition. Interactive load-sharing remains available, as long as each host still has
access to the LSF executables.
Event log file (lsb.events)
Fault tolerance in LSF depends on the event log file, lsb.events, which is kept on the
primary file server. Every event in the system is logged in this file, including all job
submissions and job and host status changes. If the master host becomes unavailable, a
new master is chosen by
lim. sbatchd on the new master starts a new mbatchd. The
new
mbatchd reads the lsb.events file to recover the state of the system.
For sites not wanting to rely solely on a central file server for recovery information, LSF
can be configured to maintain a duplicate event log by keeping a replica of
lsb.events. The replica is stored on the file server, and used if the primary copy is
unavailable. When using LSF’s duplicate event log function, the primary event log is
stored on the first master host, and re-synchronized with the replicated copy when the
host recovers.
Partitioned network
If the network is partitioned, only one of the partitions can access lsb.events, so
batch services are only available on one side of the partition. A lock file is used to make
sure that only one
mbatchd is running in the cluster.