HP XC System Software Administration Guide Version 2.1

# for the data file

JobAcctLoc=/hptc_cluster/slurm/job/jobacct.

log

JobAcctParameters="Frequency=10"

g. Save the file.

5. Restart the slurmctld and slurmd daemons:

# cexec -a "service slurm restart"

11.5 Managing SLURM

The S LURM squeue, sinfo,andscontrol utilities and the Nagios system monitoring

utility provide the means for monitoring and controlling SLURM on your HP X C system.

11.5.1 Monitoring SLURM

For status at a glance, the Nagios system monitor provides a global view of your system and

includes details about the state of SLURM. Section 6 .2.2 provides information about Nagios on

the HP XC system.

You can run the scontrol utility to confirm that your control daemons are active. In the

following example, node n5, which runs the pri mar y slurmctld, and node n8,whichruns

the backup, are bo th up.

# scontrol ping

Slurmctld(primary/backup) at n5/n8 are UP/UP

The sinfo command reports the status of both nodes and partitio ns. Consider this example:

# sinfo --all

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

lsf up infinite 122 idle n[5-16,18-127]

lsf up infinite 1 down n17

swaptest up infinite 4 idle n[1-4]

In this example, node n17 is down.

The squeue u til ity repo rts the state of j obs curren tly runn ing under the SL URM’s cont

rol. See

the squeue

(1) manpage for more information about th is command.

The SLURM log files on each node in /var/slurm/log are helpful for diagno sing specific

problems. The log files slurmctld.log and slurmd.log log entries from their respective

daemons. Both these log files have the fol low ing forma t:

[ date and time stamp] Log Entry

11.5.2 Changing a Node’s State

Use the scontrol commandtotakeanodeoutofserviceandtoreturnittoserviceby

changing its state.

SLURM p ro vid e s a ’d rain’ state for taking nodes out of service. Draining a node means that the

current job is allowed to finish on that node while no other jobs are scheduled for that node.

The following command removes node n17 from service:

# scontrol update nodename=n17 state=drain reason="maintenance"

Setting the parameter state=drain causes SLURM to allow any active jobs to complete.

After the node has drained, use the following command to take th e node out of service:

# scontrol update nodename=n17 state=down

SLURM Administration 11-11