HP XC System Software Administration Guide Version 3.0

The bstop command suspends the execution of a running job.
The bresume command resumes the execution of a suspended job.
For more information, see bkill(1), bstop(1), and bresume(1).
Job Accounting
Standard LSF job accounting using the bacct command is available. The output of a job contains total CPU
time and memory usage:
$ cat 231.out
.
.
.
Resource usage summary:
CPU time : 8252.65 sec.
Max Memory : 4 MB
Max Swap : 113 MB
.
.
.
The LSF bacct command provides accurate job accounting data on the following:
Jobs submitted by all users
Jobs accounted on all projects
Jobs completed normally or exited
Jobs executed on all hosts
Jobs submitted to all queues
Jobs accounted on all service classes
Consider using the -l of the bacct command to display the accounting data in its long format:
$ bacct -l job_number
For more information, see bacct(1).
LSF-HPC Failover
This section discusses aspects of the LSF-HPC failover mechanism.
Overview of LSF-HPC Monitoring and Failover Support
Note
At least two nodes must have the resource management roles to enable LSF-HPC failover. One is selected
as the master (primary LSF execution host), and the others are considered backup nodes. At any time, LSF-HPC
daemons start and run only on the master node.
The Nagios LSF-HPC failover module monitors the virtual IP associated with the primary LSF execution host.
When LSF-HPC failover is enabled on the HP XC system, and if the primary LSF execution host fails, the
Nagios LSF-HPC failover module detects that the node is unresponsive and initiates failover:
The Nagios module attempts to contact the node hosting the IP to ensure that LSF-HPC for SLURM is
shut down and that virtual IP hosting is disabled.
A new primary LSF execution host from the backup nodes is selected. The LSF daemons start on the
backup node.
The Nagios module tries to re-establish the virtual IP on the new node.
LSF-HPC is restarted on that host.
Administering LSF-HPC 127