LSF Version 7.3 - Administering Platform LSF

ManualsBrandsHP ManualsSoftwareHP XC System 4.x Software

691

692

693

694

695

696

697

698

699

700

Duplicate Logging of Event Logs

696 Administering Platform LSF

Duplicate Logging of Event Logs

To recover from server failures, host reboots, or mbatchd restarts, LSF uses

information stored in

lsb.events. To improve the reliability of LSF, you can

configure LSF to maintain copies of these logs, to use as a backup.

If the host that contains the primary copy of the logs fails, LSF will continue to

operate using the duplicate logs. When the host recovers, LSF uses the duplicate

logs to update the primary copies.

How duplicate logging works

By default, the event log is located in LSB_SHAREDIR. Typically, LSB_SHAREDIR

resides on a reliable file server that also contains other critical applications

necessary for running jobs, so if that host becomes unavailable, the subsequent

failure of LSF is a secondary issue.

LSB_SHAREDIR must be accessible from all

potential LSF master hosts.

When you configure duplicate logging, the duplicates are kept on the file server, and

the primary event logs are stored on the first master host. In other words,

LSB_LOCALDIR is used to store the primary copy of the batch state information, and

the contents of

LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR, which

resides on a central file server. This has the following effects:

◆ Creates backup copies of lsb.events

◆ Reduces the load on the central file server

◆ Increases the load on the LSF master host

Failure of file server If the file server containing LSB_SHAREDIR goes down, LSF continues to process

jobs. Client commands such as

bhist, which directly read LSB_SHAREDIR will not

work.

When the file server recovers, the current log files are replicated to

LSB_SHAREDIR.

Failure of first

master host

If the first master host fails, the primary copies of the files (in LSB_LOCALDIR)

become unavailable. Then, a new master host is selected. The new master host uses

the duplicate files (in

LSB_SHAREDIR) to restore its state and to log future events.

There is no duplication by the second or any subsequent LSF master hosts.

When the first master host becomes available after a failure, it will update the

primary copies of the files (in

LSB_LOCALDIR) from the duplicates (in) and continue

operations as before.

If the first master host does not recover, LSF will continue to use the files in

LSB_SHAREDIR, but there is no more duplication of the log files.

Simultaneous

failure of both

hosts

If the master host containing LSB_LOCALDIR and the file server containing

LSB_SHAREDIR both fail simultaneously, LSF will be unavailable.

Network partioning We assume that Network partitioning does not cause a cluster to split into two

independent clusters, each simultaneously running

mbatchd.