LSF Version 7.3 - Administering Platform LSF
Duplicate Logging of Event Logs
696 Administering Platform LSF
Duplicate Logging of Event Logs
To recover from server failures, host reboots, or mbatchd restarts, LSF uses
information stored in
lsb.events. To improve the reliability of LSF, you can
configure LSF to maintain copies of these logs, to use as a backup.
If the host that contains the primary copy of the logs fails, LSF will continue to
operate using the duplicate logs. When the host recovers, LSF uses the duplicate
logs to update the primary copies.
How duplicate logging works
By default, the event log is located in LSB_SHAREDIR. Typically, LSB_SHAREDIR
resides on a reliable file server that also contains other critical applications
necessary for running jobs, so if that host becomes unavailable, the subsequent
failure of LSF is a secondary issue.
LSB_SHAREDIR must be accessible from all
potential LSF master hosts.
When you configure duplicate logging, the duplicates are kept on the file server, and
the primary event logs are stored on the first master host. In other words,
LSB_LOCALDIR is used to store the primary copy of the batch state information, and
the contents of
LSB_LOCALDIR are copied to a replica in LSB_SHAREDIR, which
resides on a central file server. This has the following effects:
◆ Creates backup copies of lsb.events
◆ Reduces the load on the central file server
◆ Increases the load on the LSF master host
Failure of file server If the file server containing LSB_SHAREDIR goes down, LSF continues to process
jobs. Client commands such as
bhist, which directly read LSB_SHAREDIR will not
work.
When the file server recovers, the current log files are replicated to
LSB_SHAREDIR.
Failure of first
master host
If the first master host fails, the primary copies of the files (in LSB_LOCALDIR)
become unavailable. Then, a new master host is selected. The new master host uses
the duplicate files (in
LSB_SHAREDIR) to restore its state and to log future events.
There is no duplication by the second or any subsequent LSF master hosts.
When the first master host becomes available after a failure, it will update the
primary copies of the files (in
LSB_LOCALDIR) from the duplicates (in) and continue
operations as before.
If the first master host does not recover, LSF will continue to use the files in
LSB_SHAREDIR, but there is no more duplication of the log files.
Simultaneous
failure of both
hosts
If the master host containing LSB_LOCALDIR and the file server containing
LSB_SHAREDIR both fail simultaneously, LSF will be unavailable.
Network partioning We assume that Network partitioning does not cause a cluster to split into two
independent clusters, each simultaneously running
mbatchd.