Platform LSF Administration Guide Version 6.2

Chapter 43
Error and Event Logging
Administering Platform LSF
595
Duplicate Logging of Event Logs
To recover from server failures, host reboots, or mbatchd restarts, LSF uses
information stored in
lsb.events. To improve the reliability of LSF, you can
configure LSF to maintain copies of these logs, to use as a backup.
If the host that contains the primary copy of the logs fails, LSF will continue to operate
using the duplicate logs. When the host recovers, LSF uses the duplicate logs to update
the primary copies.
How duplicate logging works
By default, the event log is located in LSB_SHAREDIR. Typically, LSB_SHAREDIR
resides on a reliable file server that also contains other critical applications necessary for
running jobs, so if that host becomes unavailable, the subsequent failure of LSF is a
secondary issue.
LSB_SHAREDIR must be accessible from all potential LSF master
hosts.
When you configure duplicate logging, the duplicates are kept on the file server, and the
primary event logs are stored on the first master host. In other words,
LSB_LOCALDIR
is used to store the primary copy of the batch state information, and the contents of
LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR, which resides on a central
file server. This has the following effects:
Creates backup copies of lsb.events
Reduces the load on the central file server
Increases the load on the LSF master host
Failure of file
server
If the file server containing LSB_SHAREDIR goes down, LSF continues to process jobs.
Client commands such as
bhist, which directly read LSB_SHAREDIR will not work.
When the file server recovers, the current log files are replicated to
LSB_SHAREDIR.
Failure of first
master host
If the first master host fails, the primary copies of the files (in LSB_LOCALDIR) become
unavailable. Then, a new master host is selected. The new master host uses the duplicate
files (in
LSB_SHAREDIR) to restore its state and to log future events. There is no
duplication by the second or any subsequent LSF master hosts.
When the first master host becomes available after a failure, it will update the primary
copies of the files (in
LSB_LOCALDIR) from the duplicates (in) and continue operations
as before.
If the first master host does not recover, LSF will continue to use the files in
LSB_SHAREDIR, but there is no more duplication of the log files.
Simultaneous
failure of both
hosts
If the master host containing LSB_LOCALDIR and the file server containing
LSB_SHAREDIR both fail simultaneously, LSF will be unavailable.
Network
partioning
We assume that Network partitioning does not cause a cluster to split into two
independent clusters, each simultaneously running
mbatchd.
This may happen given certain network topologies and failure modes. For example,
connectivity is lost between the first master, M1, and both the file server and the
secondary master, M2. Both M1 and M2 will run
mbatchd service with M1 logging