Platform LSF Administration Guide Version 6.2
Chapter 2
How the System Works
Administering Platform LSF
75
Host failure
If an LSF server host fails, jobs running on that host are lost. No other jobs are affected.
Jobs can be submitted so that they are automatically rerun from the beginning or
restarted from a checkpoint on another host if they are lost because of a host failure.
If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back
up and takes over as master, it reads the
lsb.events file to get the state of all batch
jobs. Jobs that were running when the systems went down are assumed to have exited,
and email is sent to the submitting user. Pending jobs remain in their queues, and are
scheduled as hosts become available.
Job exception handling
You can configure hosts and queues so that LSF detects exceptional conditions while
jobs are running, and take appropriate action automatically. You can customize what
exceptions are detected, and the corresponding actions. By default, LSF does not detect
any exceptions.
See “Handling Host-level Job Exceptions” on page 124 and “Handling Job Exceptions”
on page 137 for more information about job-level exception management.