Platform LSF Administration Guide Version 6.2

Chapter 2

How the System Works

Administering Platform LSF

Host failure

If an LSF server host fails, jobs running on that host are lost. No other jobs are affected.

Jobs can be submitted so that they are automatically rerun from the beginning or

restarted from a checkpoint on another host if they are lost because of a host failure.

If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back

up and takes over as master, it reads the

lsb.events file to get the state of all batch

jobs. Jobs that were running when the systems went down are assumed to have exited,

and email is sent to the submitting user. Pending jobs remain in their queues, and are

scheduled as hosts become available.

Job exception handling

You can configure hosts and queues so that LSF detects exceptional conditions while

jobs are running, and take appropriate action automatically. You can customize what

exceptions are detected, and the corresponding actions. By default, LSF does not detect

any exceptions.

See “Handling Host-level Job Exceptions” on page 124 and “Handling Job Exceptions”

on page 137 for more information about job-level exception management.