HP XC System Software Administration Guide Version 2.1

ManualsBrandsHP ManualsSoftwareHP XC 1 Processor LTU

131

132

133

134

135

136

137

138

139

140

• Node partitions (only one by default)

• Authentication (MUNGE is u sed by default )

SLURM uses the MUNGE package to authenticate users between nodes in the system. Both

MUNGE and SLURM require files that contain encrypted keys. The names of the SLURM files

are configured in the /hptc_cluster/slurm/etc/slurm.conf file; the M UNG E

key file is /opt/hptc/munge/etc/keys/.munge_key. These files must be replicated

on every node in the HP XC system, which should occur by default through SystemImager;

see Chapter 7 for more information on software distribution.

SLURM and MUNGE expect the following characteristics of the system configuration . Errors

may result unless all these conditions are true:

• Each no de must be synchronized to the c

orrect tim e. Comm unicationerrorsoccurifthe

node clocks differ.

• User auth enticatio n m ust be available on every node. If not, non-roo t users will be unable

to run jobs.

•The/hptc_cluster directory must be properly shared. It is exported from the head

node and m ou nted on all the other nodes. If this directory i s not properly shared, the

slurm.conf co nfigu ration file will not be fo und an d erro rs will result.

• On systems using Quad rics system

interconnects, the /opt/hptc/libelan-

hosts/etc/elanhosts file must

be properly configu red with the spconfig

command, as described in the HP X

C System Software In stallation Guide.Otherwise,

system interconnect errors wi

ll occur, and SLURM must be restarted. See Section 11.6 for

more information.

16.2.2 SLURM Run-Time Troubleshooting

The following describes how

to overcome problems reported by SLURM while the H P XC

system is running.

Healthy no de is down

The most common reason for SLURM to list an apparently healthy node down is

that a specified resource has dropped below the level defined for the node in the

/hptc_cluster/slurm/etc/slurm.conf file.

Forexample,ifthetemp

orary disk space specification is TmpDisk=4096,butthe

available temporary di

sk space falls below 4 GB on the system , SLURM marks it

as down.

SLURM r efus es to operate on some nodes

If SLURM refuses to o

perate on some or all nodes, and the log files in

/var/slurm/log co

mplain of problems with credentials, execute the f ollowing

command to confirm

that all nodes display the same time:

# cexec -a date

A matter of a few seconds is inconsequential, but SLURM is unable to recognize the

credentials of nodes within the HP XC system that are more than five minutes out

of synchronization.

Checking SLURM daemons

Use the followi

ng com m and to confirm that your control daemons are up and running:

# scontrol ping

16-6 Troubleshooting