HP XC System Software Administration Guide Version 3.0

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

161

162

163

164

165

166

167

168

169

170

slurm.conf The SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf.

This file contains all the information necessary to understand how SLURM is

configured on HP XC systems, including the following:

• Logging (syslog is the default logging mechanism)

• Debug level (the debug levels range from 1 to 7; the default debug level is

• Nodes (all nodes are listed by default)

• Node partitions (only one by default)

• Authentication (MUNGE is used by default )

SLURM uses the MUNGE package to authenticate users between nodes in the system. Both MUNGE and

SLURM require files that contain encrypted keys. The names of the SLURM files are configured in the

/hptc_cluster/slurm/etc/slurm.conf file. The MUNGE key file is

/opt/hptc/munge/etc/keys/.munge_key. These files must be replicated on every node in the HP XC

system, which should occur by default through SystemImager; see Chapter 8.: Distributing Software

Throughout the System (page 79) for more information on software distribution.

SLURM and MUNGE expect the following characteristics of the system configuration. Errors can result unless

all these conditions are true:

• Each node must be synchronized to the correct time. Communication errors occur if the node clocks

differ.

• User authentication must be available on every node. If not, non-root users will be unable to run jobs.

• The /hptc_cluster directory must be properly shared. It is exported from the head node and mounted

on all the other nodes. If this directory is not properly shared, the slurm.conf configuration file will

not be found and errors will result.

• On systems using Quadrics system interconnects, the /opt/hptc/libelanhosts/etc/elanhosts

file must be properly configured with the spconfig command, as described in the

HP XC System

Software Installation Guide

. Otherwise, system interconnect errors will occur, and you must restart

SLURM. See “Configuring SLURM System Interconnect Support” (page 103) for more information.

SLURM Run-Time Troubleshooting

The following describes how to overcome problems reported by SLURM while the HP XC system is running:

Healthy node is down The most common reason for SLURM to list an apparently

healthy node down is that a specified resource has dropped

below the level defined for the node in the

/hptc_cluster/slurm/etc/slurm.conf file.

For example, if the temporary disk space specification is

TmpDisk=4096, but the available temporary disk space

falls below 4 GB on the system, SLURM marks it as down.

SLURM refuses to operate on some nodes If SLURM refuses to operate on some or all nodes, and the

log files in /var/slurm/log report problems with

credentials, execute the following command to confirm that

all nodes display the same time:

# cexec -a date

A matter of a few seconds is inconsequential, but SLURM is

unable to recognize the credentials of nodes within the HP

XC system that are more than 5 minutes out of

synchronization.

Checking SLURM daemons Use the following command to confirm that your control

daemons are up and running:

# scontrol ping

Checking node and partition status Use the following command to examine the status of your

nodes and partitions:

164 Troubleshooting