HP XC System Software Administration Guide Version 4.0
Table Of Contents
- XC Administration Guide
- Table of Contents
- About This Document
- 1 HP XC Administration Environment
- 1.1 Understanding Nodes, Services, and Roles
- 1.2 File System
- 1.3 HP XC Command Environment
- 1.4 Configuration and Management Database
- 1.5 HP XC Configuration File Guidelines
- 1.6 Installation and Software Distribution
- 1.7 Improved Availability
- 1.8 Networking
- 1.9 Modulefiles
- 1.10 Security
- 1.11 Recommended Administrative Tasks
- 2 Improved Availability
- 3 Starting Up and Shutting Down the HP XC System
- 4 Managing and Customizing System Services
- 4.1 HP XC System Services
- 4.2 Displaying Services Information
- 4.3 Restarting a Service
- 4.4 Stopping a Service
- 4.5 Global System Services
- 4.6 Customizing Services and Roles
- 4.6.1 Overview of the HP XC Services Configuration
- 4.6.2 Service Configuration Sequence of Operation
- 4.6.3 Assigning Roles with the cluster_config Utility
- 4.6.4 The *config.d Directories
- 4.6.5 Configuration Scripts
- 4.6.6 Understanding Global Configuration Scripts
- 4.6.7 Advance Planning
- 4.6.8 Editing the roles_services.ini File
- 4.6.9 Creating a service.ini File
- 4.6.10 Adding a New Service
- 4.6.11 Verifying a New Service
- 5 Managing Licenses
- 6 Managing the Configuration and Management Database
- 6.1 Accessing the Configuration and Management Database
- 6.2 Querying the Configuration and Management Database
- 6.3 Finding and Setting System Attribute Values
- 6.4 Backing Up the Configuration Database
- 6.5 Restoring the Configuration Database from a Backup File
- 6.6 Archiving Sensor Data from the Configuration Database
- 6.7 Restoring the Sensor Data from an Archive File
- 6.8 Purging Sensor Data from the Configuration and Management Database
- 6.9 Dumping the Configuration and Management Database
- 7 Monitoring the System
- 7.1 Monitoring Tools
- 7.2 Monitoring Strategy
- 7.3 Displaying System Environment Data
- 7.4 Monitoring Disks
- 7.5 Displaying System Statistics
- 7.6 Logging Node Events
- 7.7 The collectl Utility
- 7.8 Using HP Graph To Display Network Bandwidth and System Use
- 7.9 The resmon Utility
- 7.10 The kdump Mechanism and the crash Utility
- 8 Monitoring the System with Nagios
- 8.1 Nagios Overview
- 8.2 Using the Nagios Web Interface
- 8.3 Adjusting the Nagios Configuration
- 8.4 Configuring Nagios on HP XC Systems
- 8.5 Using the Nan Notification Aggregator and Delimiter To Control Nagios Messages
- 8.6 Nagios Report Generator Utility
- 8.7 Modifying Nagios To Effect Changes
- 9 Network Administration
- 10 Managing Patches and RPM Updates
- 11 Distributing Software Throughout the System
- 11.1 Overview of the Image Replication and Distribution Environment
- 11.2 Installing and Distributing Software Patches
- 11.3 Adding Software or Modifying Files on the Golden Client
- 11.4 Determining Which Nodes Will Be Imaged
- 11.5 Updating the Golden Image
- 11.6 Propagating the Golden Image to All Nodes
- 11.7 Maintaining a Global Service Configuration
- 12 Opening an IP Port in the Firewall
- 13 Connecting to a Remote Console
- 14 Managing Local User Accounts and Passwords
- 14.1 HP XC User and Group Accounts
- 14.2 General Procedures for Administering Local User Accounts
- 14.3 Adding a Local User Account
- 14.4 Modifying a Local User Account
- 14.5 Deleting a Local User Account
- 14.6 Configuring the ssh Keys for a User
- 14.7 Synchronizing the NIS Database
- 14.8 Changing Administrative Passwords
- 14.8.1 Changing the Superuser Password
- 14.8.2 Changing the CMDB Password
- 14.8.3 Changing the Interconnect Password
- 14.8.4 Changing the Console Port Password
- 14.8.5 Synchronizing the BMC/IPMI Password for CP6000 Systems
- 14.8.6 Changing the Nagios Administrator Password
- 14.8.7 Changing the LSF Administrator Password
- 15 Managing SLURM
- 16 Managing LSF
- 16.1 Standard LSF
- 16.2 LSF with SLURM
- 16.3 Switching the Type of LSF Installed
- 16.4 LSF with SLURM Installation
- 16.5 LSF with SLURM Startup and Shutdown
- 16.6 Controlling the LSF with SLURM Service
- 16.7 Launching Jobs with LSF with SLURM
- 16.8 Monitoring and Controlling LSF with SLURM Jobs
- 16.9 Maintaining Shell Prompts in LSF Interactive Shells
- 16.10 Job Accounting
- 16.11 LSF Daemon Log Maintenance
- 16.12 Load Indexes and Resource Information
- 16.13 LSF with SLURM Monitoring
- 16.14 LSF with SLURM Failover
- 16.15 Moving SLURM and LSF Daemons to Their Backup Nodes
- 16.16 Enhancing LSF with SLURM
- 16.17 Configuring an External Virtual Host Name for LSF with SLURM on HP XC Systems
- 17 Managing Modulefiles
- 18 Mounting File Systems
- 19 Managing Software RAID Arrays
- 20 Using Diagnostic Tools
- 20.1 Using the sys_check Utility
- 20.2 Using the ovp Utility for System Verification
- 20.3 Using the dgemm Utility to Analyze Performance
- 20.4 Using the System Interconnect Diagnostic Tools
- 21 Troubleshooting
- 21.1 General Troubleshooting
- 21.2 Nagios Troubleshooting
- 21.3 Messages Reported by Nagios
- 21.4 System Interconnect Troubleshooting
- 21.5 Improved Availability Issues
- 21.5.1 How To Start HP Serviceguard When Only the Head Node is Running
- 21.5.2 Restart Serviceguard Quorum Server if Quorum Server Node is Re-imaged
- 21.5.3 Known Limitation if Nagios is Configured for Improved Availability
- 21.5.4 Network Restart Command Negatively Affects Serviceguard
- 21.5.5 Problem Failing Over Database Package Under Serviceguard
- 21.6 SLURM Troubleshooting
- 21.7 LSF Troubleshooting
- 22 Servicing the HP XC System
- 22.1 Adding a Node
- 22.2 Replacing a Client Node
- 22.3 Actualizing Planned Nodes
- 22.4 Replacing a Server Blade Enclosure OnBoard Administrator
- 22.5 Replacing a System Interconnect Board in an HP CP6000 System
- 22.6 Software RAID Disk Replacement
- 22.7 Incorporating External Network Interface Cards
- 22.7.1 Gathering Information
- 22.7.2 Editing the platform_vars.ini File
- 22.7.3 Using the device_config Command
- 22.7.4 Updating the Database for the External Network Card
- 22.7.5 Updating the Firewall Custom Configuration
- 22.7.6 Reconfiguring the Nodes
- 22.7.7 Verifying Success
- 22.7.8 Updating the Golden Image
- A Installing LSF with SLURM into an Existing Standard LSF Cluster
- B Setting Up MPICH
- C HP MCS Monitoring
- D CPU Frequency-Based Power-Saving Feature
- Glossary
- Index
7. Modify the Head Node.
These steps modify the head node and propagate those changes to the rest of the HP XC
system. The recommended method is to use the updateimage command as documented
in Chapter 11: Distributing Software Throughout the System (page 141). Make the
modifications first, then propagate the following changes:
a. Lower the firewall on the HP XC external network.
LSF daemons communicate through pre-configured ports in the lsf.conf configuration
file, but the LSF commands open random ports for receiving information when they
communicate with the LSF daemons. Because an LSF cluster needs this "open" network
environment, trying to maintain a firewall becomes challenging. Security-aware
customers are welcome to try to get LSF running with firewalls, but those procedures
are beyond the scope of this documentation.
For this procedure, open the unprivileged ports (1024-65535) and one privileged port
(1023) on the external network by adding the following lines to /etc/sysconfig/
iptables.proto on the head node:
-A RH-Firewall-1-INPUT -i External -p tcp -m tcp --dport 1023:65535 -j ACCEPT
-A RH-Firewall-1-INPUT -i External -p udp -m udp --dport 1023:65535 -j ACCEPT
This file establishes the initial firewall rules for all nodes on HP XC system. These new
rules open all the unprivileged ports externally and 1 privileged port (1023). Opening
the privileged port allows LSF commands run as root on HP XC to communicate with
non-XC LSF daemons, since LSF commands executed by root use privileged ports. If
necessary, opening the privileged port can be avoided.
These new rules need to be set up on every node in the HP XC system that could be
selected to run the LSF with SLURM daemons. In one of the steps that follows, we will
provide instructions on how to generate a new /etc/sysconfig/iptables file on
each HP XC node from the recently modified iptables.proto file.
b. Node-to-node communication.
LSF uses rsh and ssh (if configured) to control all the LSF daemons in the HP XC
system. LSF expects the selected mechanism to allow access to all nodes without a
password.
HP discourages this use because it transmits unencrypted passwords through the
network, which can be received by any standard network-sniffing program. The rsh
command and its related packages are not installed on by default for this reason. Instead,
HP recommends the ssh command.
If you want to continue to use the rsh command within the Standard LSF cluster, install
its RPM packages on the head node now; these packages are available on the HP XC
DVD.
c. Set Up the Expected LSF Environment
A typical LSF installation provides two environment setup files that, when sourced by
the user, will enable access to LSF binaries, man pages, and libraries by adjusting the
user's environment. These files are named profile.lsf and cshrc.lsf by default.
When LSF is installed locally on XC, two custom files are created that automatically
source the LSF environment setup files so that user has access to LSF as soon as he or
she logs into the HP XC system. These two files are /etc/profile.d/lsf.sh and
/etc/profile.d/lsf.csh. The content of these files would be as shown below
before they were removed in Step 3. We will be replacing the old LSF_TOP, which was
/opt/hptc/lsf/top, with the new LSF_TOP location that is shared between the two
clusters, which is /shared/lsf in our example.
# cat lsf.sh
case $PATH in
*-slurm/etc:*) ;;
*:/opt/hptc/lsf/top*) ;;
*)
if [ -f /opt/hptc/lsf/top/conf/profile.lsf ]; then
. /opt/hptc/lsf/top/conf/profile.lsf
A.4 HP XC Preparation 295