HP XC System Software Administration Guide Version 4.0

ManualsBrandsHP ManualsSoftwareHP XC System 4.x Software

201

202

203

204

205

206

207

208

209

210

Table Of Contents

XC Administration Guide
Table of Contents
About This Document
- Intended Audience
- New and Changed Information in This Edition
- Typographic Conventions
- HP XC and Related HP Products Information
- Related Information
- Manpages
- HP Encourages Your Comments
1 HP XC Administration Environment
- 1.1 Understanding Nodes, Services, and Roles
- 1.2 File System
  - 1.2.1 Key Operating System Directories
  - 1.2.2 Log Files
- 1.3 HP XC Command Environment
- 1.4 Configuration and Management Database
- 1.5 HP XC Configuration File Guidelines
- 1.6 Installation and Software Distribution
  - 1.6.1 Determining the Installation Type
  - 1.6.2 Software Distribution
- 1.7 Improved Availability
- 1.8 Networking
- 1.9 Modulefiles
- 1.10 Security
- 1.11 Recommended Administrative Tasks
2 Improved Availability
- 2.1 Purpose of the Availability Tool
- 2.2 Services Eligible for Improved Availability
- 2.3 Availability Sets
  - 2.3.1 Determining Which Nodes Are in the Availability Set
  - 2.3.2 Reconfiguring an Availability Set
- 2.4 HP Serviceguard Tasks
- 2.5 Transferring Control of Services
3 Starting Up and Shutting Down the HP XC System
- 3.1 Understanding the Node States
- 3.2 Starting Up the HP XC System
- 3.3 Shutting Down the HP XC System
- 3.4 Shutting Down One or More Nodes
- 3.5 Determining a Node's Power Status
- 3.6 Locating a Given Node
- 3.7 Disabling and Enabling a Node
4 Managing and Customizing System Services
- 4.1 HP XC System Services
- 4.2 Displaying Services Information
- 4.3 Restarting a Service
- 4.4 Stopping a Service
- 4.5 Global System Services
- 4.6 Customizing Services and Roles
5 Managing Licenses
- 5.1 License Manager and License File
- 5.2 Determining If the License Manager Is Running
- 5.3 Starting and Stopping the License Manager
6 Managing the Configuration and Management Database
- 6.1 Accessing the Configuration and Management Database
- 6.2 Querying the Configuration and Management Database
- 6.3 Finding and Setting System Attribute Values
- 6.4 Backing Up the Configuration Database
- 6.5 Restoring the Configuration Database from a Backup File
- 6.6 Archiving Sensor Data from the Configuration Database
- 6.7 Restoring the Sensor Data from an Archive File
- 6.8 Purging Sensor Data from the Configuration and Management Database
- 6.9 Dumping the Configuration and Management Database
7 Monitoring the System
- 7.1 Monitoring Tools
- 7.2 Monitoring Strategy
- 7.3 Displaying System Environment Data
- 7.4 Monitoring Disks
- 7.5 Displaying System Statistics
- 7.6 Logging Node Events
- 7.7 The collectl Utility
- 7.8 Using HP Graph To Display Network Bandwidth and System Use
- 7.9 The resmon Utility
- 7.10 The kdump Mechanism and the crash Utility
8 Monitoring the System with Nagios
- 8.1 Nagios Overview
- 8.2 Using the Nagios Web Interface
- 8.3 Adjusting the Nagios Configuration
- 8.4 Configuring Nagios on HP XC Systems
- 8.5 Using the Nan Notification Aggregator and Delimiter To Control Nagios Messages
- 8.6 Nagios Report Generator Utility
- 8.7 Modifying Nagios To Effect Changes
9 Network Administration
- 9.1 Network Address Translation Administration
- 9.2 Network Time Protocol Service
- 9.3 Changing the External IP Address of a Head Node
- 9.4 Modifying Sendmail
- 9.5 Bonding Ethernet Network Interface Cards for Failover
10 Managing Patches and RPM Updates
- 10.1 Sources for Software Packages and Information
- 10.2 Digital Signature
- 10.3 Downloading and Installing Patches
- 10.4 Rebuild Kernel Dependent Modules
- 10.5 Rebuilding Serviceguard Modules
11 Distributing Software Throughout the System
- 11.1 Overview of the Image Replication and Distribution Environment
- 11.2 Installing and Distributing Software Patches
- 11.3 Adding Software or Modifying Files on the Golden Client
- 11.4 Determining Which Nodes Will Be Imaged
- 11.5 Updating the Golden Image
- 11.6 Propagating the Golden Image to All Nodes
  - 11.6.1 Using the Full Imaging Installation
  - 11.6.2 Using the cexec Command
- 11.7 Maintaining a Global Service Configuration
12 Opening an IP Port in the Firewall
- 12.1 Open Ports
- 12.2 Opening Ports in the Firewall
  - 12.2.1 Opening a Temporary Port in the Firewall
  - 12.2.2 Opening an IP Port in the Firewall Persistently
13 Connecting to a Remote Console
- 13.1 Console Management Facility
- 13.2 Accessing a Remote Console
14 Managing Local User Accounts and Passwords
- 14.1 HP XC User and Group Accounts
- 14.2 General Procedures for Administering Local User Accounts
- 14.3 Adding a Local User Account
- 14.4 Modifying a Local User Account
- 14.5 Deleting a Local User Account
- 14.6 Configuring the ssh Keys for a User
- 14.7 Synchronizing the NIS Database
- 14.8 Changing Administrative Passwords
15 Managing SLURM
- 15.1 Overview of SLURM
- 15.2 Configuring SLURM
- 15.3 Restricting User Access to Nodes
- 15.4 Job Accounting
- 15.5 Monitoring SLURM
- 15.6 Draining Nodes
- 15.7 Configuring the SLURM Epilog Script
- 15.8 Maintaining the SLURM Daemon Log
- 15.9 Enabling SLURM to Recognize a New Node
- 15.10 Removing SLURM
16 Managing LSF
- 16.1 Standard LSF
- 16.2 LSF with SLURM
  - 16.2.1 Integration of LSF with SLURM
- 16.3 Switching the Type of LSF Installed
- 16.4 LSF with SLURM Installation
- 16.5 LSF with SLURM Startup and Shutdown
  - 16.5.1 Starting Up LSF with SLURM
  - 16.5.2 Shutting Down LSF with SLURM
- 16.6 Controlling the LSF with SLURM Service
- 16.7 Launching Jobs with LSF with SLURM
- 16.8 Monitoring and Controlling LSF with SLURM Jobs
- 16.9 Maintaining Shell Prompts in LSF Interactive Shells
- 16.10 Job Accounting
- 16.11 LSF Daemon Log Maintenance
- 16.12 Load Indexes and Resource Information
- 16.13 LSF with SLURM Monitoring
- 16.14 LSF with SLURM Failover
- 16.15 Moving SLURM and LSF Daemons to Their Backup Nodes
- 16.16 Enhancing LSF with SLURM
  - 16.16.1 LSF with SLURM Enhancement Settings
  - 16.16.2 Thresholds in LSF with SLURM and SLURM Interplay
- 16.17 Configuring an External Virtual Host Name for LSF with SLURM on HP XC Systems
17 Managing Modulefiles
18 Mounting File Systems
- 18.1 Overview of the Network File System on the HP XC System
- 18.2 Understanding the Global fstab File
- 18.3 Mounting Internal File Systems Throughout the HP XC System
  - 18.3.1 Understanding the csys Utility in the Mounting Instructions
  - 18.3.2 Mounting Internal File Systems
- 18.4 Mounting Remote File Systems
  - 18.4.1 Understanding the Mounting Instructions
  - 18.4.2 Mounting a Remote File System
- 18.5 Improved Availability of the /hptc_cluster File System
19 Managing Software RAID Arrays
- 19.1 Overview of Software RAID
  - 19.1.1 Software RAID-0
  - 19.1.2 Software RAID-1
- 19.2 Installing Software RAID on the Head Node
- 19.3 Installing Software RAID on Client Nodes
- 19.4 Examining a Software RAID Array
- 19.5 Error Reporting
- 19.6 Removing Software RAID from Client Nodes
20 Using Diagnostic Tools
- 20.1 Using the sys_check Utility
- 20.2 Using the ovp Utility for System Verification
- 20.3 Using the dgemm Utility to Analyze Performance
- 20.4 Using the System Interconnect Diagnostic Tools
21 Troubleshooting
- 21.1 General Troubleshooting
- 21.2 Nagios Troubleshooting
- 21.3 Messages Reported by Nagios
- 21.4 System Interconnect Troubleshooting
- 21.5 Improved Availability Issues
- 21.6 SLURM Troubleshooting
  - 21.6.1 SLURM Configuration Issues
  - 21.6.2 SLURM Run-Time Troubleshooting
- 21.7 LSF Troubleshooting
22 Servicing the HP XC System
- 22.1 Adding a Node
- 22.2 Replacing a Client Node
- 22.3 Actualizing Planned Nodes
- 22.4 Replacing a Server Blade Enclosure OnBoard Administrator
- 22.5 Replacing a System Interconnect Board in an HP CP6000 System
- 22.6 Software RAID Disk Replacement
  - 22.6.1 Replacing a RAID Disk
  - 22.6.2 Writing a Boot Block to the RAID Disk
- 22.7 Incorporating External Network Interface Cards
A Installing LSF with SLURM into an Existing Standard LSF Cluster
- A.1 Assumptions
- A.2 Requirement
- A.3 Sample Case
- A.4 HP XC Preparation
- A.5 Installing LSF with SLURM
- A.6 Perform Post Installation Tasks
- A.7 Configuring the LSF Alias
- A.8 Starting LSF on the HP XC System
- A.9 Sample Running Jobs
- A.10 Troubleshooting
B Setting Up MPICH
- B.1 Downloading the MPICH Source Files
- B.2 Building MPICH on the HP XC System
- B.3 Running the MPICH Self-Tests
- B.4 Installing MPICH
C HP MCS Monitoring
- C.1 Customizing the Configuration for Your Installation
- C.2 Regenerating the Nagios MCS Configuration
- C.3 Useful Administrative Commands
- C.4 MCS Log Files
- C.5 Nagios Plug-Ins for MCS
D CPU Frequency-Based Power-Saving Feature
Glossary
Index

Table 16-3 Environment Variables for LSF with SLURM Enhancement (lsf.conf File) (continued)

DescriptionEnvironment Variable

This setting enables Platform LSF extensions.

This setting is undefined by default.

The following extension names are supported:

• SHORT_EVENTFILE

This compresses long host name lists when event records are

written to the lsb.events and lsb.acct files for large parallel

jobs. The short host string has the format:

number_of_hosts*real_host_name

When SHORT_EVENTFILE is enabled, older daemons and

commands (prior to LSF Version 7.3) cannot recognize the

lsb.acct and lsb.events file format. For example, the original

host list record is as follows:

6 "hostA" "hostA" "hostA" "hostA" "hostB" "hostC"

Redundant host names are removed and the host count is changed

so that the short host list record is as follows:

3 "4*hostA" "hostB" "hostC"

When LSF_HPC_EXTENSION="SHORT_EVENTFILE" is set, and

LSF reads the host list from the lsb.events or lsb.acct files,

the compressed host list is expanded into a normal host list. This

setting applies to the following events:

— JOB_START — when a normal job is dispatched.

— JOB_FORCE — when a job is forced with the brun command.

— JOB_CHUNK — when a job is inserted into a job chunk.

— JOB_FORWARD — when a job is forwarded to a MultiCluster

leased host.

— JOB_FINISH in lsb.acct.

• SHORT_PIDLIST

This shortens the output from the bjobs command to eliminate

many of the process IDs (PIDs) for a job. The bjobs command

displays only the first ID and a count of the process group IDs

(PGIDs) and process IDs for the job. Without the SHORT_PIDLIST

setting, the bjobs -l command displays all the PGIDs and PIDs

for the job. With SHORT_PIDLIST set, the bjobs -l command

displays a count of the PGIDs and PIDs.

• RESERVE_BY_STARTTIME

LSF with SLURM selects the reservation that gives the job the

earliest predicted start time. By default, if multiple host groups

are available for reservation, LSF with SLURM chooses the largest

possible reservation based on the number of slots. When backfill

is configured, this can lead to larger jobs not running as their start

times are pushed further into the future.

• BRUN_WITH_TOPOLOGY

If a topology request can be satisfied for a brun job, brun preserves

the topology request. LSF with SLURM allocates the resource

according to the request and tries to run the job with the requested

topology. If allocation fails because the topology request cannot

be satisfied, the job is queued again. By default, the job topology

request is ignored by the scheduler when it creates an allocation

if BRUN_WITH_TOPOLOGY is not specified.

LSF_HPC_EXTENSIONS="ext_name,..."

This entry in the lsf.conf file defines how any two

LSF_HPC_NCPU_* thresholds are combined.

The default value is or.

LSF_HPC_NCPU_COND=and|or

16.16 Enhancing LSF with SLURM 209