Platform LSF Administration Guide Version 6.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

591

592

593

594

595

596

597

598

599

600

Chapter 43

Error and Event Logging

Administering Platform LSF

595

Duplicate Logging of Event Logs

To recover from server failures, host reboots, or mbatchd restarts, LSF uses

information stored in

lsb.events. To improve the reliability of LSF, you can

configure LSF to maintain copies of these logs, to use as a backup.

If the host that contains the primary copy of the logs fails, LSF will continue to operate

using the duplicate logs. When the host recovers, LSF uses the duplicate logs to update

the primary copies.

How duplicate logging works

By default, the event log is located in LSB_SHAREDIR. Typically, LSB_SHAREDIR

resides on a reliable file server that also contains other critical applications necessary for

running jobs, so if that host becomes unavailable, the subsequent failure of LSF is a

secondary issue.

LSB_SHAREDIR must be accessible from all potential LSF master

hosts.

When you configure duplicate logging, the duplicates are kept on the file server, and the

primary event logs are stored on the first master host. In other words,

LSB_LOCALDIR

is used to store the primary copy of the batch state information, and the contents of

LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR, which resides on a central

file server. This has the following effects:

◆

Creates backup copies of lsb.events

◆

Reduces the load on the central file server

◆

Increases the load on the LSF master host

Failure of file

server

If the file server containing LSB_SHAREDIR goes down, LSF continues to process jobs.

Client commands such as

bhist, which directly read LSB_SHAREDIR will not work.

When the file server recovers, the current log files are replicated to

LSB_SHAREDIR.

Failure of first

master host

If the first master host fails, the primary copies of the files (in LSB_LOCALDIR) become

unavailable. Then, a new master host is selected. The new master host uses the duplicate

files (in

LSB_SHAREDIR) to restore its state and to log future events. There is no

duplication by the second or any subsequent LSF master hosts.

When the first master host becomes available after a failure, it will update the primary

copies of the files (in

LSB_LOCALDIR) from the duplicates (in) and continue operations

as before.

If the first master host does not recover, LSF will continue to use the files in

LSB_SHAREDIR, but there is no more duplication of the log files.

Simultaneous

failure of both

hosts

If the master host containing LSB_LOCALDIR and the file server containing

LSB_SHAREDIR both fail simultaneously, LSF will be unavailable.

Network

partioning

We assume that Network partitioning does not cause a cluster to split into two

independent clusters, each simultaneously running

mbatchd.

This may happen given certain network topologies and failure modes. For example,

connectivity is lost between the first master, M1, and both the file server and the

secondary master, M2. Both M1 and M2 will run

mbatchd service with M1 logging