LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 39
How the System Works
Host failure
If an LSF server host fails, jobs running on that host are lost. No other jobs are
affected. Jobs can be submitted as rerunnable, so that they automatically run again
from the beginning or as checkpointable, so that they start again from a checkpoint
on another host if they are lost because of a host failure.
If all of the hosts in a cluster go down, all running jobs are lost. When a host comes
back up and takes over as master, it reads the
lsb.events file to get the state of all
batch jobs. Jobs that were running when the systems went down are assumed to
have exited, and email is sent to the submitting user. Pending jobs remain in their
queues, and are scheduled as hosts become available.
Job exception handling
You can configure hosts and queues so that LSF detects exceptional conditions
while jobs are running, and take appropriate action automatically. You can
customize what exceptions are detected, and the corresponding actions. By default,
LSF does not detect any exceptions.
See Handling Host-level Job Exceptions on page 98 and Handling Job Exceptions in
Queues on page 110 for more information about job-level exception management.