HP XC System Software Administration Guide Version 3.1

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

241

242

243

244

245

246

247

248

249

250

Healthy node is down The most common reason for SLURM to list an apparently

healthy node down is that a specified resource has dropped

below the level defined for the node in the

/hptc_cluster/slurm/etc/slurm.conf file.

For example, if the temporary disk space specification is

TmpDisk=4096, but the available temporary disk space

falls below 4 GB on the system, SLURM marks it as down.

SLURM refuses to operate on some nodes If SLURM refuses to operate on some or all nodes, and the

log files in /var/slurm/log report problems with

credentials, execute the following command to confirm that

all nodes display the same time:

# cexec -a date

A matter of a few seconds is inconsequential, but SLURM

is unable to recognize the credentials of nodes within the

HP XC system that are more than 5 minutes out of

synchronization.

Checking SLURM daemons Use the following command to confirm that your control

daemons are up and running:

# scontrol ping

Checking node and partition status Use the following command to examine the status of your

nodes and partitions:

# sinfo --all

20.6 LSF-HPC Troubleshooting

Take the following steps if you are have trouble submitting jobs or controlling LSF-HPC with SLURM:

• Ensure that the number of nodes in the lsf partition is less than or equal to the number of nodes

reported in the XC.lic file. Sample entries follow:

INCREMENT XC-CPUS Compaq auth.number exp. date nodes ...

INCREMENT XC-PROCESSORS Compaq auth.number exp. date nodes ...

The value for nodes in the XC-CPUS or XC-PROCESSORS entry specifies the number of licensed

nodes for this system. If this value does not match the number of actual nodes, the LSF service may

fail to start LSF.

Use the lshosts command to determine the number of processors reported by LSF.

• Ensure that the date is synchronized throughout the HP XC system.

• Verify that the /hptc_cluster directory (file system) is properly mounted on all nodes. SLURM

relies on this file system.

• Ensure that SLURM is configured, up, and running properly.

• Examine the SLURM log files in /var/slurm/log/ directory on the SLURM master node for any

problems.

• If the sinfo command reports that the node is down and daemons are running, examine the

available processors vs. Procs setting in the slurm.conf file.

• Ensure that the lsf partition is configured correctly.

• Verify that the system licensing is operational. Use the lmstat -a command.

• Ensure that munge is running on all compute nodes.

• If you are experiencing LSF communication problems, examine for potential firewall issues.

20.6 LSF-HPC Troubleshooting 241