HP XC System Software Administration Guide Version 3.0

# for the data file
#
JobAcctLoc=/hptc_cluster/slurm/job/jobacct.logJobAcctParameters="Frequency=10"
.
.
.
g. Save the file.
5. Restart the slurmctld and slurmd daemons:
# cexec -a "service slurm restart"
Monitoring SLURM
The SLURM squeue, sinfo, and scontrol utilities and the Nagios system monitoring utility provide the
means for monitoring and controlling SLURM on your HP XC system.
For status at a glance, the Nagios system monitor provides a global view of your system and includes details
about the state of SLURM. “Nagios” (page 62) provides information about Nagios on the HP XC system.
You can run the scontrol utility to confirm that your control daemons are active. In the following example,
node n5, which runs the primary slurmctld, and node n8, which runs the backup, are both up.
# scontrol ping
Slurmctld(primary/backup) at n5/n8 are UP/UP
The sinfo command reports the status of both nodes and partitions. Consider this example:
# sinfo --all
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lsf up infinite 122 idle n[5-16,18-127]
lsf up infinite 1 down n17
swaptest up infinite 4 idle n[1-4]
In this example, node n17 is down.
The squeue utility reports the state of jobs currently running under the SLURM's control. For more information
about the squeue utility, see squeue(1).
The SLURM log files on each node in /var/slurm/log are helpful for diagnosing specific problems. The
log files slurmctld.log and slurmd.log log entries from their respective daemons. Both these log files
have the following format:
[ date and time stamp] Log Entry
Draining Nodes
Use the SLURM scontrol command to change a node's state. SLURM provides DRAIN and DOWN states
for taking nodes out of service. Draining a node means that the current job is allowed to finish on that node
while no other jobs are scheduled for that node.
There are a variety of reasons why a node must be drained. For example, you may want exclusive use of
a node to perform diagnostics on it or you may need to replace it.
To drain one or more nodes use the scontrol command as follows:
# scontrol update NodeName=nodelist State=drain Reason="describe reason here"
See “Interpreting the nodelist Parameter” (page 27) for a discussion on the use of the nodelist parameter.
The reason that you provide for the node draining is displayed by the sinfo command. Be brief but
descriptive.
In Example 12-2. node n17 is drained so that it can be removed from service for maintenance:
Example 12-2. Draining a Node Draining a Node
# scontrol update nodename=n17 state=drain reason="maintenance"
After the node has drained, use the scontrol command to remove a node from service. Example 12-3.
shows the command to remove the node that was drained in Example 12-2..
Monitoring SLURM 113