HP XC System Software Administration Guide Version 3.0

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

111

112

113

114

115

116

117

118

119

120

# for the data file

JobAcctLoc=/hptc_cluster/slurm/job/jobacct.logJobAcctParameters="Frequency=10"

g. Save the file.

5. Restart the slurmctld and slurmd daemons:

# cexec -a "service slurm restart"

Monitoring SLURM

The SLURM squeue, sinfo, and scontrol utilities and the Nagios system monitoring utility provide the

means for monitoring and controlling SLURM on your HP XC system.

For status at a glance, the Nagios system monitor provides a global view of your system and includes details

about the state of SLURM. “Nagios” (page 62) provides information about Nagios on the HP XC system.

You can run the scontrol utility to confirm that your control daemons are active. In the following example,

node n5, which runs the primary slurmctld, and node n8, which runs the backup, are both up.

# scontrol ping

Slurmctld(primary/backup) at n5/n8 are UP/UP

The sinfo command reports the status of both nodes and partitions. Consider this example:

# sinfo --all

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

lsf up infinite 122 idle n[5-16,18-127]

lsf up infinite 1 down n17

swaptest up infinite 4 idle n[1-4]

In this example, node n17 is down.

The squeue utility reports the state of jobs currently running under the SLURM's control. For more information

about the squeue utility, see squeue(1).

The SLURM log files on each node in /var/slurm/log are helpful for diagnosing specific problems. The

log files slurmctld.log and slurmd.log log entries from their respective daemons. Both these log files

have the following format:

[ date and time stamp] Log Entry

Draining Nodes

Use the SLURM scontrol command to change a node's state. SLURM provides DRAIN and DOWN states

for taking nodes out of service. Draining a node means that the current job is allowed to finish on that node

while no other jobs are scheduled for that node.

There are a variety of reasons why a node must be drained. For example, you may want exclusive use of

a node to perform diagnostics on it or you may need to replace it.

To drain one or more nodes use the scontrol command as follows:

# scontrol update NodeName=nodelist State=drain Reason="describe reason here"

See “Interpreting the nodelist Parameter” (page 27) for a discussion on the use of the nodelist parameter.

The reason that you provide for the node draining is displayed by the sinfo command. Be brief but

descriptive.

In Example 12-2. node n17 is drained so that it can be removed from service for maintenance:

Example 12-2. Draining a Node Draining a Node

# scontrol update nodename=n17 state=drain reason="maintenance"

After the node has drained, use the scontrol command to remove a node from service. Example 12-3.

shows the command to remove the node that was drained in Example 12-2..

Monitoring SLURM 113