LSF Version 7.3 - Administering Platform LSF

ManualsBrandsHP ManualsSoftwareHP XC System 4.x Software

Fault Tolerance

38 Administering Platform LSF

Fault Tolerance

LSF is designed to continue operating even if some of the hosts in the cluster are

unavailable. One host in the cluster acts as the master, but if the master host

becomes unavailable another host takes over. LSF is available as long as there is one

available host in the cluster.

LSF can tolerate the failure of any host or group of hosts in the cluster. When a host

crashes, all jobs running on that host are lost. No other pending or running jobs are

affected. Important jobs can be submitted to LSF with an option to automatically

restart if the job is lost because of a host failure.

Dynamic master host

The LSF master host is chosen dynamically. If the current master host becomes

unavailable, another host takes over automatically. The failover master host is

selected from the list defined in LSF_MASTER_LIST in

lsf.conf (specified in

install.config at installation). The first available host in the list acts as the

master. LSF might be unavailable for a few minutes while hosts are waiting to be

contacted by the new master.

Running jobs are managed by

sbatchd on each server host. When the new mbatchd

starts, it polls the

sbatchd on each host and finds the current status of its jobs. If

sbatchd fails but the host is still running, jobs running on the host are not lost.

When

sbatchd is restarted it regains control of all jobs running on the host.

Network failure

If the cluster is partitioned by a network failure, a master LIM takes over on each

side of the partition. Interactive load-sharing remains available, as long as each host

still has access to the LSF executables.

Event log file (lsb.events)

Fault tolerance in LSF depends on the event log file, lsb.events, which is kept on

the primary file server. Every event in the system is logged in this file, including all

job submissions and job and host status changes. If the master host becomes

unavailable, a new master is chosen by

lim. sbatchd on the new master starts a new

mbatchd. The new mbatchd reads the lsb.events file to recover the state of the

system.

For sites not wanting to rely solely on a central file server for recovery information,

LSF can be configured to maintain a duplicate event log by keeping a replica of

lsb.events. The replica is stored on the file server, and used if the primary copy is

unavailable. When using LSF’s duplicate event log function, the primary event log

is stored on the first master host, and re-synchronized with the replicated copy

when the host recovers.

Partitioned network

If the network is partitioned, only one of the partitions can access lsb.events, so

batch services are only available on one side of the partition. A lock file is used to

make sure that only one

mbatchd is running in the cluster.