HP XC System Software Administration Guide Version 2.1
# for the data file
#
JobAcctLoc=/hptc_cluster/slurm/job/jobacct.
log
JobAcctParameters="Frequency=10"
.
.
.
g. Save the file.
5. Restart the slurmctld and slurmd daemons:
# cexec -a "service slurm restart"
11.5 Managing SLURM
The S LURM squeue, sinfo,andscontrol utilities and the Nagios system monitoring
utility provide the means for monitoring and controlling SLURM on your HP X C system.
11.5.1 Monitoring SLURM
For status at a glance, the Nagios system monitor provides a global view of your system and
includes details about the state of SLURM. Section 6 .2.2 provides information about Nagios on
the HP XC system.
You can run the scontrol utility to confirm that your control daemons are active. In the
following example, node n5, which runs the pri mar y slurmctld, and node n8,whichruns
the backup, are bo th up.
# scontrol ping
Slurmctld(primary/backup) at n5/n8 are UP/UP
The sinfo command reports the status of both nodes and partitio ns. Consider this example:
# sinfo --all
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lsf up infinite 122 idle n[5-16,18-127]
lsf up infinite 1 down n17
swaptest up infinite 4 idle n[1-4]
In this example, node n17 is down.
The squeue u til ity repo rts the state of j obs curren tly runn ing under the SL URM’s cont
rol. See
the squeue
(1) manpage for more information about th is command.
The SLURM log files on each node in /var/slurm/log are helpful for diagno sing specific
problems. The log files slurmctld.log and slurmd.log log entries from their respective
daemons. Both these log files have the fol low ing forma t:
[ date and time stamp] Log Entry
11.5.2 Changing a Node’s State
Use the scontrol commandtotakeanodeoutofserviceandtoreturnittoserviceby
changing its state.
SLURM p ro vid e s a ’d rain’ state for taking nodes out of service. Draining a node means that the
current job is allowed to finish on that node while no other jobs are scheduled for that node.
The following command removes node n17 from service:
# scontrol update nodename=n17 state=drain reason="maintenance"
Setting the parameter state=drain causes SLURM to allow any active jobs to complete.
After the node has drained, use the following command to take th e node out of service:
# scontrol update nodename=n17 state=down
SLURM Administration 11-11