HP XC System Software Administration Guide Version 4.0

ManualsBrandsHP ManualsSoftwareHP XC System 4.x Software

191

192

193

194

195

196

197

198

199

200

Table Of Contents

XC Administration Guide
Table of Contents
About This Document
- Intended Audience
- New and Changed Information in This Edition
- Typographic Conventions
- HP XC and Related HP Products Information
- Related Information
- Manpages
- HP Encourages Your Comments
1 HP XC Administration Environment
- 1.1 Understanding Nodes, Services, and Roles
- 1.2 File System
  - 1.2.1 Key Operating System Directories
  - 1.2.2 Log Files
- 1.3 HP XC Command Environment
- 1.4 Configuration and Management Database
- 1.5 HP XC Configuration File Guidelines
- 1.6 Installation and Software Distribution
  - 1.6.1 Determining the Installation Type
  - 1.6.2 Software Distribution
- 1.7 Improved Availability
- 1.8 Networking
- 1.9 Modulefiles
- 1.10 Security
- 1.11 Recommended Administrative Tasks
2 Improved Availability
- 2.1 Purpose of the Availability Tool
- 2.2 Services Eligible for Improved Availability
- 2.3 Availability Sets
  - 2.3.1 Determining Which Nodes Are in the Availability Set
  - 2.3.2 Reconfiguring an Availability Set
- 2.4 HP Serviceguard Tasks
- 2.5 Transferring Control of Services
3 Starting Up and Shutting Down the HP XC System
- 3.1 Understanding the Node States
- 3.2 Starting Up the HP XC System
- 3.3 Shutting Down the HP XC System
- 3.4 Shutting Down One or More Nodes
- 3.5 Determining a Node's Power Status
- 3.6 Locating a Given Node
- 3.7 Disabling and Enabling a Node
4 Managing and Customizing System Services
- 4.1 HP XC System Services
- 4.2 Displaying Services Information
- 4.3 Restarting a Service
- 4.4 Stopping a Service
- 4.5 Global System Services
- 4.6 Customizing Services and Roles
5 Managing Licenses
- 5.1 License Manager and License File
- 5.2 Determining If the License Manager Is Running
- 5.3 Starting and Stopping the License Manager
6 Managing the Configuration and Management Database
- 6.1 Accessing the Configuration and Management Database
- 6.2 Querying the Configuration and Management Database
- 6.3 Finding and Setting System Attribute Values
- 6.4 Backing Up the Configuration Database
- 6.5 Restoring the Configuration Database from a Backup File
- 6.6 Archiving Sensor Data from the Configuration Database
- 6.7 Restoring the Sensor Data from an Archive File
- 6.8 Purging Sensor Data from the Configuration and Management Database
- 6.9 Dumping the Configuration and Management Database
7 Monitoring the System
- 7.1 Monitoring Tools
- 7.2 Monitoring Strategy
- 7.3 Displaying System Environment Data
- 7.4 Monitoring Disks
- 7.5 Displaying System Statistics
- 7.6 Logging Node Events
- 7.7 The collectl Utility
- 7.8 Using HP Graph To Display Network Bandwidth and System Use
- 7.9 The resmon Utility
- 7.10 The kdump Mechanism and the crash Utility
8 Monitoring the System with Nagios
- 8.1 Nagios Overview
- 8.2 Using the Nagios Web Interface
- 8.3 Adjusting the Nagios Configuration
- 8.4 Configuring Nagios on HP XC Systems
- 8.5 Using the Nan Notification Aggregator and Delimiter To Control Nagios Messages
- 8.6 Nagios Report Generator Utility
- 8.7 Modifying Nagios To Effect Changes
9 Network Administration
- 9.1 Network Address Translation Administration
- 9.2 Network Time Protocol Service
- 9.3 Changing the External IP Address of a Head Node
- 9.4 Modifying Sendmail
- 9.5 Bonding Ethernet Network Interface Cards for Failover
10 Managing Patches and RPM Updates
- 10.1 Sources for Software Packages and Information
- 10.2 Digital Signature
- 10.3 Downloading and Installing Patches
- 10.4 Rebuild Kernel Dependent Modules
- 10.5 Rebuilding Serviceguard Modules
11 Distributing Software Throughout the System
- 11.1 Overview of the Image Replication and Distribution Environment
- 11.2 Installing and Distributing Software Patches
- 11.3 Adding Software or Modifying Files on the Golden Client
- 11.4 Determining Which Nodes Will Be Imaged
- 11.5 Updating the Golden Image
- 11.6 Propagating the Golden Image to All Nodes
  - 11.6.1 Using the Full Imaging Installation
  - 11.6.2 Using the cexec Command
- 11.7 Maintaining a Global Service Configuration
12 Opening an IP Port in the Firewall
- 12.1 Open Ports
- 12.2 Opening Ports in the Firewall
  - 12.2.1 Opening a Temporary Port in the Firewall
  - 12.2.2 Opening an IP Port in the Firewall Persistently
13 Connecting to a Remote Console
- 13.1 Console Management Facility
- 13.2 Accessing a Remote Console
14 Managing Local User Accounts and Passwords
- 14.1 HP XC User and Group Accounts
- 14.2 General Procedures for Administering Local User Accounts
- 14.3 Adding a Local User Account
- 14.4 Modifying a Local User Account
- 14.5 Deleting a Local User Account
- 14.6 Configuring the ssh Keys for a User
- 14.7 Synchronizing the NIS Database
- 14.8 Changing Administrative Passwords
15 Managing SLURM
- 15.1 Overview of SLURM
- 15.2 Configuring SLURM
- 15.3 Restricting User Access to Nodes
- 15.4 Job Accounting
- 15.5 Monitoring SLURM
- 15.6 Draining Nodes
- 15.7 Configuring the SLURM Epilog Script
- 15.8 Maintaining the SLURM Daemon Log
- 15.9 Enabling SLURM to Recognize a New Node
- 15.10 Removing SLURM
16 Managing LSF
- 16.1 Standard LSF
- 16.2 LSF with SLURM
  - 16.2.1 Integration of LSF with SLURM
- 16.3 Switching the Type of LSF Installed
- 16.4 LSF with SLURM Installation
- 16.5 LSF with SLURM Startup and Shutdown
  - 16.5.1 Starting Up LSF with SLURM
  - 16.5.2 Shutting Down LSF with SLURM
- 16.6 Controlling the LSF with SLURM Service
- 16.7 Launching Jobs with LSF with SLURM
- 16.8 Monitoring and Controlling LSF with SLURM Jobs
- 16.9 Maintaining Shell Prompts in LSF Interactive Shells
- 16.10 Job Accounting
- 16.11 LSF Daemon Log Maintenance
- 16.12 Load Indexes and Resource Information
- 16.13 LSF with SLURM Monitoring
- 16.14 LSF with SLURM Failover
- 16.15 Moving SLURM and LSF Daemons to Their Backup Nodes
- 16.16 Enhancing LSF with SLURM
  - 16.16.1 LSF with SLURM Enhancement Settings
  - 16.16.2 Thresholds in LSF with SLURM and SLURM Interplay
- 16.17 Configuring an External Virtual Host Name for LSF with SLURM on HP XC Systems
17 Managing Modulefiles
18 Mounting File Systems
- 18.1 Overview of the Network File System on the HP XC System
- 18.2 Understanding the Global fstab File
- 18.3 Mounting Internal File Systems Throughout the HP XC System
  - 18.3.1 Understanding the csys Utility in the Mounting Instructions
  - 18.3.2 Mounting Internal File Systems
- 18.4 Mounting Remote File Systems
  - 18.4.1 Understanding the Mounting Instructions
  - 18.4.2 Mounting a Remote File System
- 18.5 Improved Availability of the /hptc_cluster File System
19 Managing Software RAID Arrays
- 19.1 Overview of Software RAID
  - 19.1.1 Software RAID-0
  - 19.1.2 Software RAID-1
- 19.2 Installing Software RAID on the Head Node
- 19.3 Installing Software RAID on Client Nodes
- 19.4 Examining a Software RAID Array
- 19.5 Error Reporting
- 19.6 Removing Software RAID from Client Nodes
20 Using Diagnostic Tools
- 20.1 Using the sys_check Utility
- 20.2 Using the ovp Utility for System Verification
- 20.3 Using the dgemm Utility to Analyze Performance
- 20.4 Using the System Interconnect Diagnostic Tools
21 Troubleshooting
- 21.1 General Troubleshooting
- 21.2 Nagios Troubleshooting
- 21.3 Messages Reported by Nagios
- 21.4 System Interconnect Troubleshooting
- 21.5 Improved Availability Issues
- 21.6 SLURM Troubleshooting
  - 21.6.1 SLURM Configuration Issues
  - 21.6.2 SLURM Run-Time Troubleshooting
- 21.7 LSF Troubleshooting
22 Servicing the HP XC System
- 22.1 Adding a Node
- 22.2 Replacing a Client Node
- 22.3 Actualizing Planned Nodes
- 22.4 Replacing a Server Blade Enclosure OnBoard Administrator
- 22.5 Replacing a System Interconnect Board in an HP CP6000 System
- 22.6 Software RAID Disk Replacement
  - 22.6.1 Replacing a RAID Disk
  - 22.6.2 Writing a Boot Block to the RAID Disk
- 22.7 Incorporating External Network Interface Cards
A Installing LSF with SLURM into an Existing Standard LSF Cluster
- A.1 Assumptions
- A.2 Requirement
- A.3 Sample Case
- A.4 HP XC Preparation
- A.5 Installing LSF with SLURM
- A.6 Perform Post Installation Tasks
- A.7 Configuring the LSF Alias
- A.8 Starting LSF on the HP XC System
- A.9 Sample Running Jobs
- A.10 Troubleshooting
B Setting Up MPICH
- B.1 Downloading the MPICH Source Files
- B.2 Building MPICH on the HP XC System
- B.3 Running the MPICH Self-Tests
- B.4 Installing MPICH
C HP MCS Monitoring
- C.1 Customizing the Configuration for Your Installation
- C.2 Regenerating the Nagios MCS Configuration
- C.3 Useful Administrative Commands
- C.4 MCS Log Files
- C.5 Nagios Plug-Ins for MCS
D CPU Frequency-Based Power-Saving Feature
Glossary
Index

Table 16-1 LSF with SLURM Interpretation of SLURM Node States (continued)

DescriptionNode

A node in any of the following states:In Use

The node is allocated to a job.

ALLOCATED

The node is allocated to a job that is in

the process of completing. The node

state is removed when all the job

processes have ended and the SLURM

epilog program (if any) has ended.

COMPLETING

The node is currently running a job but

will not be allocated to additional jobs.

The node state changes to state

DRAINED when the last job on it

completes.

DRAINING

A node that is not available for use; its status is one of the following:Unavailable

The node is not available for use.

DOWNED

The node is not available for use per

system administrator request.

DRAINED

The SLURM controller has just started

and the node state is not yet

determined.

UNKNOWN

16.2.1.4 LSF with SLURM Failover

The failover of the LSF component of the integrated LSF with SLURM product is of critical

concern because only one node in the HP XC system runs the LSF with SLURM daemons. During

installation, you select the primary LSF execution host from the nodes on the HP XC system that

have the resource management role; although that node could also be a compute node, it is not

recommended. Other nodes that also have the resource management role are designated as

potential LSF execution host backups.

To address this concern, LSF with SLURM is configured on HP XC with a virtual host name

(vhost) and a virtual IP (vIP). The virtual IP and host name are used because they can be moved

from one node to another, and maintain a consistent LSF interface. By default, the virtual IP is

an internal IP on the HP XC administration network, and the virtual host name is

lsfhost.localdomain. The LSF execution host is configured to host the vIP, then the LSF

with SLURM daemons are started on that node.

The Nagios infrastructure contains a module that monitors the LSF with SLURM virtual IP. If it

detects a problem with the virtual IP (for example, the inability to ping it), the monitoring code

assumes the node is down and chooses a new LSF execution host from the backup candidate

nodes on which to set up the virtual IP and restart LSF with SLURM.

See “LSF with SLURM Failover” (page 203) for more information.

16.3 Switching the Type of LSF Installed

The HP XC system installation process offers a choice of two different types of LSF. The default

choice is LSF with SLURM. This choice requires that SLURM is installed and configured when

you run the cluster_config utility. Standard LSF is the second type of LSF that is available

to install, and it does not interact with SLURM.

If you made the wrong LSF selection while running the cluster_config utility, perform the

following procedure to remove the current type of LSF installed and install the other type of

LSF:

1. Log in as superuser (root) on the head node.

194 Managing LSF