LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 39

How the System Works

Host failure

If an LSF server host fails, jobs running on that host are lost. No other jobs are

affected. Jobs can be submitted as rerunnable, so that they automatically run again

from the beginning or as checkpointable, so that they start again from a checkpoint

on another host if they are lost because of a host failure.

If all of the hosts in a cluster go down, all running jobs are lost. When a host comes

back up and takes over as master, it reads the

lsb.events file to get the state of all

batch jobs. Jobs that were running when the systems went down are assumed to

have exited, and email is sent to the submitting user. Pending jobs remain in their

queues, and are scheduled as hosts become available.

Job exception handling

You can configure hosts and queues so that LSF detects exceptional conditions

while jobs are running, and take appropriate action automatically. You can

customize what exceptions are detected, and the corresponding actions. By default,

LSF does not detect any exceptions.

See Handling Host-level Job Exceptions on page 98 and Handling Job Exceptions in

Queues on page 110 for more information about job-level exception management.