HP XC System Software Administration Guide Version 3.1
20 Troubleshooting.......................................................................................................229
20.1 General Troubleshooting..................................................................................................................229
20.1.1 Mismatched Secure Shell Keys..................................................................................................229
20.2 Nagios Troubleshooting...................................................................................................................229
20.2.1 Determining the Status of the Nagios Service.............................................................................229
20.2.2 Nagios Fails to Start.................................................................................................................230
20.2.3 Nagios Log Files......................................................................................................................230
20.2.4 Running Nagios Plug-Ins Manually..........................................................................................230
20.2.5 Using the nrg Command's Analyze Mode..................................................................................231
20.3 Messages Reported by Nagios..........................................................................................................232
20.4 System Interconnect Troubleshooting................................................................................................235
20.4.1 Myrinet System Interconnect Troubleshooting...........................................................................235
20.4.2 Quadrics System Interconnect Troubleshooting..........................................................................236
20.4.3 InfiniBand System Interconnect Troubleshooting.......................................................................238
20.5 SLURM Troubleshooting..................................................................................................................240
20.5.1 SLURM Configuration Issues....................................................................................................240
20.5.2 SLURM Run-Time Troubleshooting..........................................................................................240
20.6 LSF-HPC Troubleshooting................................................................................................................241
21 Servicing the HP XC System....................................................................................243
21.1 Adding a Node................................................................................................................................243
21.2 Replacing a Client Node...................................................................................................................244
21.3 Replacing a System Interconnect Board in an CP6000 System..............................................................246
21.4 Software RAID Disk Replacement.....................................................................................................246
21.4.1 Replacing a RAID Disk.............................................................................................................246
21.4.2 Writing a Boot Block to the RAID Disk......................................................................................248
A Installing LSF-HPC with SLURM into an Existing Standard LSF Cluster ...............251
A.1 Assumptions....................................................................................................................................251
A.2 Requirement....................................................................................................................................252
A.3 Sample Case.....................................................................................................................................252
A.4 HP XC Preparation...........................................................................................................................252
A.5 Installing LSF-HPC with SLURM.......................................................................................................256
A.6 Perform Post Installation Tasks..........................................................................................................259
A.7 Configuring the LSF Alias.................................................................................................................260
A.8 Starting LSF on the HP XC System.....................................................................................................261
A.9 Sample Running Jobs........................................................................................................................261
A.10 Troubleshooting..............................................................................................................................262
B Installing Standard LSF on a Subset of Nodes.......................................................265
B.1 Requirements...................................................................................................................................265
B.2 Assumptions....................................................................................................................................265
B.3 Sample Case.....................................................................................................................................266
B.4 Instructions......................................................................................................................................266
C Setting Up MPICH.....................................................................................................271
C.1 Downloading the MPICH Source Files...............................................................................................271
C.2 Building MPICH on the HP XC System..............................................................................................271
C.3 Running the MPICH Self-Tests..........................................................................................................272
C.4 Installing MPICH..............................................................................................................................272
D HP MCS Monitoring..................................................................................................273
D.1 Customizing the Configuration for Your Installation...........................................................................273
D.2 Regenerating the Nagios MCS Configuration.....................................................................................274
Table of Contents 9