LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 697
Error and Event Logging
This may happen given certain network topologies and failure modes. For example,
connectivity is lost between the first master, M1, and both the file server and the
secondary master, M2. Both M1 and M2 will run
mbatchd service with M1 logging
events to
LSB_LOCALDIR and M2 logging to LSB_SHAREDIR. When connectivity is
restored, the changes made by M2 to
LSB_SHAREDIR will be lost when M1 updates
LSB_SHAREDIR from its copy in LSB_LOCALDIR.
The archived event files are only available on
LSB_LOCALDIR, so in the case of
network partitioning, commands such as
bhist cannot access these files. As a
precaution, you should periodically copy the archived files from
LSB_LOCALDIR to
LSB_SHAREDIR.
Setting an event
update interval
If NFS traffic is too high and you want to reduce network traffic, use
EVENT_UPDATE_INTERVAL in
lsb.params to specify how often to back up the
data and synchronize the LSB_SHAREDIR and LSB_LOCALDIR directories.
The directories are always synchronized when data is logged to the files, or when
mbatchd is started on the first LSF master host.
Automatic archiving and duplicate logging
Event logs Archived event logs, lsb.events.n, are not replicated to LSB_SHAREDIR. If LSF
starts a new event log while the file server containing
LSB_SHAREDIR is down, you
might notice a gap in the historical data in
LSB_SHAREDIR.
Configure duplicate logging
To enable duplicate logging, set LSB_LOCALDIR in lsf.conf to a directory on the
first master host (the first host configured in
lsf.cluster.cluster_name) that will
be used to store the primary copies of
lsb.events. This directory should only exist
on the first master host.
1 Edit lsf.conf and set LSB_LOCALDIR to a local directory that exists only on
the first master host.
2 Use the commands
lsadmin reconfig and badmin mbdrestart to make the
changes take effect.
LSF Job Termination Reason Logging
When a job finishes, LSF reports the last job termination action it took against the
job and logs it into
lsb.acct.
If a running job exits because of node failure, LSF sets the correct exit information
in
lsb.acct, lsb.events, and the job output file.
View logged job exit information (bacct -l)
1 Use bacct -l to view job exit information logged to lsb.acct:
bacct -l 7265
Accounting information about jobs that are: