HP XC System Software Administration Guide Version 3.0
The bstop command suspends the execution of a running job.
The bresume command resumes the execution of a suspended job.
For more information, see bkill(1), bstop(1), and bresume(1).
Job Accounting
Standard LSF job accounting using the bacct command is available. The output of a job contains total CPU
time and memory usage:
$ cat 231.out
.
.
.
Resource usage summary:
CPU time : 8252.65 sec.
Max Memory : 4 MB
Max Swap : 113 MB
.
.
.
The LSF bacct command provides accurate job accounting data on the following:
• Jobs submitted by all users
• Jobs accounted on all projects
• Jobs completed normally or exited
• Jobs executed on all hosts
• Jobs submitted to all queues
• Jobs accounted on all service classes
Consider using the -l of the bacct command to display the accounting data in its long format:
$ bacct -l job_number
For more information, see bacct(1).
LSF-HPC Failover
This section discusses aspects of the LSF-HPC failover mechanism.
Overview of LSF-HPC Monitoring and Failover Support
Note
At least two nodes must have the resource management roles to enable LSF-HPC failover. One is selected
as the master (primary LSF execution host), and the others are considered backup nodes. At any time, LSF-HPC
daemons start and run only on the master node.
The Nagios LSF-HPC failover module monitors the virtual IP associated with the primary LSF execution host.
When LSF-HPC failover is enabled on the HP XC system, and if the primary LSF execution host fails, the
Nagios LSF-HPC failover module detects that the node is unresponsive and initiates failover:
• The Nagios module attempts to contact the node hosting the IP to ensure that LSF-HPC for SLURM is
shut down and that virtual IP hosting is disabled.
• A new primary LSF execution host from the backup nodes is selected. The LSF daemons start on the
backup node.
• The Nagios module tries to re-establish the virtual IP on the new node.
• LSF-HPC is restarted on that host.
Administering LSF-HPC 127