HP XC System Software Administration Guide Version 4.0

Table Of Contents
.
#
# o Define the job accounting mechanism
#
JobAcctType=jobacct/log
#
# o Define the location where job accounting logs are to
# be written. For
# - jobacct/none - this parameter is ignored
# - jobacct/log - the fully-qualified file name
# for the data file
#
JobAcctLogfile=/hptc_cluster/slurm/job/jobacct.log
JobAcctFrequency=10
.
.
.
g. Save the file.
5. Restart the slurmctld and slurmd daemons:
# cexec -a "service slurm restart"
15.5 Monitoring SLURM
The SLURM squeue, sinfo, and scontrol commands and the Nagios system monitoring
utility provide the means for monitoring and controlling SLURM on your HP XC system.
For status at a glance, the Nagios system monitor provides a global view of your system and
includes details about the state of SLURM. Chapter 8 (page 105)provides information about
Nagios on the HP XC system.
You can run the scontrol utility to confirm that your control daemons are active. In the following
example, node n5, which runs the primary slurmctld, and node n8, which runs the backup,
are both up.
# scontrol ping
Slurmctld(primary/backup) at n5/n8 are UP/UP
The sinfo command reports the status of both nodes and partitions. Consider this example:
# sinfo --all
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lsf up infinite 122 idle n[5-16,18-127]
lsf up infinite 1 down n17
swaptest up infinite 4 idle n[1-4]
In this example, node n17 is down.
The squeue utility reports the state of jobs currently running under the SLURM's control. For
more information about the squeue utility, see squeue(1).
The SLURM log files on each node in /var/slurm/log are helpful for diagnosing specific
problems. The log files slurmctld.log and slurmd.log log entries from their respective
daemons. Both these log files have the following format:
[ date and time stamp] Log Entry
15.6 Draining Nodes
Use the SLURM scontrol command to change a node's state. SLURM provides DRAIN and
DOWN states for taking nodes out of service. Draining a node means that the current job is allowed
to finish on that node while no other jobs are scheduled for that node.
There are a variety of reasons why a node must be drained. For example, you may want exclusive
use of a node to perform diagnostics on it or you may need to replace it.
To drain one or more nodes use the scontrol command as follows:
182 Managing SLURM