LSF Version 7.3 - Administering Platform LSF
Fault Tolerance
38 Administering Platform LSF
Fault Tolerance
LSF is designed to continue operating even if some of the hosts in the cluster are
unavailable. One host in the cluster acts as the master, but if the master host
becomes unavailable another host takes over. LSF is available as long as there is one
available host in the cluster.
LSF can tolerate the failure of any host or group of hosts in the cluster. When a host
crashes, all jobs running on that host are lost. No other pending or running jobs are
affected. Important jobs can be submitted to LSF with an option to automatically
restart if the job is lost because of a host failure.
Dynamic master host
The LSF master host is chosen dynamically. If the current master host becomes
unavailable, another host takes over automatically. The failover master host is
selected from the list defined in LSF_MASTER_LIST in
lsf.conf (specified in
install.config at installation). The first available host in the list acts as the
master. LSF might be unavailable for a few minutes while hosts are waiting to be
contacted by the new master.
Running jobs are managed by
sbatchd on each server host. When the new mbatchd
starts, it polls the
sbatchd on each host and finds the current status of its jobs. If
sbatchd fails but the host is still running, jobs running on the host are not lost.
When
sbatchd is restarted it regains control of all jobs running on the host.
Network failure
If the cluster is partitioned by a network failure, a master LIM takes over on each
side of the partition. Interactive load-sharing remains available, as long as each host
still has access to the LSF executables.
Event log file (lsb.events)
Fault tolerance in LSF depends on the event log file, lsb.events, which is kept on
the primary file server. Every event in the system is logged in this file, including all
job submissions and job and host status changes. If the master host becomes
unavailable, a new master is chosen by
lim. sbatchd on the new master starts a new
mbatchd. The new mbatchd reads the lsb.events file to recover the state of the
system.
For sites not wanting to rely solely on a central file server for recovery information,
LSF can be configured to maintain a duplicate event log by keeping a replica of
lsb.events. The replica is stored on the file server, and used if the primary copy is
unavailable. When using LSF’s duplicate event log function, the primary event log
is stored on the first master host, and re-synchronized with the replicated copy
when the host recovers.
Partitioned network
If the network is partitioned, only one of the partitions can access lsb.events, so
batch services are only available on one side of the partition. A lock file is used to
make sure that only one
mbatchd is running in the cluster.