Platform LSF Administration Guide Version 6.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

Fault Tolerance

Administering Platform LSF

Fault Tolerance

LSF is designed to continue operating even if some of the hosts in the cluster are

unavailable. One host in the cluster acts as the master, but if the master host becomes

unavailable another host takes over. LSF is available as long as there is one available host

in the cluster.

LSF can tolerate the failure of any host or group of hosts in the cluster. When a host

crashes, all jobs running on that host are lost. No other pending or running jobs are

affected. Important jobs can be submitted to LSF with an option to automatically restart

if the job is lost because of a host failure.

Dynamic master host

The LSF master host is chosen dynamically. If the current master host becomes

unavailable, another host takes over automatically. The master host selection is based on

the order in which hosts are listed in the

lsf.cluster.cluster_name file. If the first

host in the file is available, that host acts as the master. If the first host is unavailable, the

second host takes over, and so on. LSF might be unavailable for a few minutes while

hosts are waiting to be contacted by the new master.

Running jobs are managed by

sbatchd on each server host. When the new mbatchd

starts, it polls the

sbatchd on each host and finds the current status of its jobs. If

sbatchd fails but the host is still running, jobs running on the host are not lost. When

sbatchd is restarted it regains control of all jobs running on the host.

Network failure

If the cluster is partitioned by a network failure, a master LIM takes over on each side

of the partition. Interactive load-sharing remains available, as long as each host still has

access to the LSF executables.

Event log file (lsb.events)

Fault tolerance in LSF depends on the event log file, lsb.events, which is kept on the

primary file server. Every event in the system is logged in this file, including all job

submissions and job and host status changes. If the master host becomes unavailable, a

new master is chosen by

lim. sbatchd on the new master starts a new mbatchd. The

new

mbatchd reads the lsb.events file to recover the state of the system.

For sites not wanting to rely solely on a central file server for recovery information, LSF

can be configured to maintain a duplicate event log by keeping a replica of

lsb.events. The replica is stored on the file server, and used if the primary copy is

unavailable. When using LSF’s duplicate event log function, the primary event log is

stored on the first master host, and re-synchronized with the replicated copy when the

host recovers.

Partitioned network

If the network is partitioned, only one of the partitions can access lsb.events, so

batch services are only available on one side of the partition. A lock file is used to make

sure that only one

mbatchd is running in the cluster.