HP XC System Software Installation Guide Version 3.2.1

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

HP XC System Software

XC Installation Guide

Version 3.2.1

HP Part Number: A-XCINS-321B

Published: November 2007

Summary of content (260 pages)

PAGE 1
HP XC System Software XC Installation Guide Version 3.2.
PAGE 2
© Copyright 2003–2007 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
PAGE 3
Table of Contents About This Document.......................................................................................................15 Intended Audience................................................................................................................................15 How to Use This Document..................................................................................................................15 Naming Conventions Used in This Document........................................
PAGE 4
2.4.2 Run the SVA Installation Script to Install SVA........................................................................48 2.4.3 Install Optional Linux RPMs...................................................................................................49 2.5 You Are Done..................................................................................................................................49 3 Configuring and Imaging the System................................................................
PAGE 5
3.13 Task 12: Run the startsys Utility to Start the System and Propagate the Golden Image.............102 3.14 Task 13: Perform Postconfiguration Tasks for the InfiniBand Interconnect................................105 3.15 Task 14: Create a Lock LUN Device File......................................................................................106 3.16 Task 15: Start Availability Tools...................................................................................................106 3.
PAGE 6
7 Installing and Using PBS Professional.....................................................................141 7.1 PBS Professional Overview...........................................................................................................141 7.2 Before You Begin............................................................................................................................141 7.3 Plan the Installation.......................................................................................
PAGE 7
10.5 Troubleshooting the OVP............................................................................................................162 10.5.1 OVP network_bidirectional Test Might Report False Error on HP Server Blades..............164 10.5.2 OVP Reports Benign Nagios Warnings...............................................................................165 10.5.3 OVP qsnet_database Test May Fail Due to Benign Errors Returned By the qsctrl Utility...165 10.6 Troubleshooting SLURM........................
PAGE 8
F.2.4 Special Considerations for Improved Availability................................................................200 F.3 Role Definitions.............................................................................................................................201 F.3.1 Availability Role....................................................................................................................201 F.3.2 Avail_node_management Role..................................................................
PAGE 9
List of Figures 10-1 Discovery Flowchart....................................................................................................................
PAGE 10
PAGE 11
List of Tables 1 2 1-1 1-2 1-3 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 3-10 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 8-1 8-2 10-1 10-2 A-1 D-1 D-2 D-3 D-4 D-5 F-1 G-1 G-2 G-3 G-4 H-1 I-1 Installation Types..........................................................................................................................15 Naming Conventions....................................................................................................................
PAGE 12
PAGE 13
List of Examples 1-1 3-1 3-2 3-3 3-4 3-5 3-6 5-1 G-1 G-2 Sample XC.lic File..........................................................................................................................25 Default Content Of The base_addr.ini File....................................................................................60 Default Content Of The base_addrV2.ini File...............................................................................
PAGE 14
PAGE 15
About This Document This document describes how to install and configure HP XC System Software Version 3.2.1 on HP Cluster Platforms 3000, 4000, and 6000. An HP XC system is integrated with several open source software components. Some open source software components are being used for underlying technology, and their deployment is transparent.
PAGE 16
To avoid duplication of information, from time to time you are instructed to refer to information in the other documents in the HP XC System Software Documentation Set. To reduce the size of screen displays and command output in this document, a three- or four-node system was used to generate most of the sample command output shown in this document. Naming Conventions Used in This Document This document uses the naming conventions and sample IP addresses listed in Table 2.
PAGE 17
Ctrl-x ENVIRONMENT VARIABLE [ERROR NAME] Key Term User input Variable [] {} ... | WARNING CAUTION IMPORTANT NOTE A key sequence. A sequence such asCtrl-c indicates that you must hold down the key labeled Ctrl while you press the key for the letter c. The name of an environment variable, for example, PATH. The name of an error, usually returned in the errno variable. The name of a keyboard key. Return and Enter both refer to the same key. The defined use of an important word or phrase.
PAGE 18
HP XC System Software User's Guide Provides an overview of managing the HP XC user environment with modules, managing jobs with LSF, and describes how to build, run, debug, and troubleshoot serial and parallel applications on an HP XC system.
PAGE 19
software components are generic, and the HP XC adjective is not added to any reference to a third-party or open source command or product name. For example, the SLURM srun command is simply referred to as the srun command. The location of each website or link to a particular topic listed in this section is subject to change without notice by the site provider. • http://www.platform.com Home page for Platform Computing Corporation, the developer of the Load Sharing Facility (LSF).
PAGE 20
• http://www.balabit.com/products/syslog_ng/ Home page for syslog-ng, a logging tool that replaces the traditional syslog functionality. The syslog-ng tool is a flexible and scalable audit trail processing tool. It provides a centralized, securely stored log of all devices on the network. • http://systemimager.org Home page for SystemImager®, which is the underlying technology that distributes the golden image to all nodes and distributes configuration changes throughout the system.
PAGE 21
MPI Websites • http://www.mpi-forum.org Contains the official MPI standards documents, errata, and archives of the MPI Forum. The MPI Forum is an open group with representatives from many organizations that define and maintain the MPI standard. • http://www-unix.mcs.anl.gov/mpi/ A comprehensive site containing general information, such as the specification and FAQs, and pointers to other resources, including tutorials, implementations, and other MPI-related sites. Compiler Websites • http://www.
PAGE 22
Manpages for third-party software components might be provided as a part of the deliverables for that component. Using discover(8) as an example, you can use either one of the following commands to display a manpage: $ man discover $ man 8 discover If you are not sure about a command you need to use, enter the man command with the -k option to obtain a list of commands that are related to a keyword. For example: $ man -k keyword HP Encourages Your Comments HP encourages comments concerning this document.
PAGE 23
1 Preparing for a New Installation This chapter describes preinstallation tasks to perform before you install HP XC System Software Version 3.2.1.
PAGE 24
1.3 Task 3: Prepare Existing HP XC Systems This task applies to anyone who is installing HP XC System Software Version 3.2.1 on an HP XC system that is already installed with an older version of the HP XC System Software. Omit this task if you are installing HP XC System Software Version 3.2.1 on new hardware for the first time. Before using the procedures described in this document to install and configure HP XC System Software Version 3.2.
PAGE 25
1.6 Task 6: Arrange for IP Address Assignments and Host Names Make arrangements with your site's network administrator to assign IP addresses for the following system components. All IP addresses must be defined in the site's Domain Name System (DNS) configuration: • The external IP address of the HP XC system, if it is to be connected to an external network. The name associated with this interface is known as the Linux Virtual Server (LVS) alias or cluster alias.
PAGE 26
NOTICE="Authorization = BM05WHITMORE19772031 - permanent - HP \ XC System Software - BASE License" INCREMENT XC-PROCESSORS Compaq 3.0 permanent 68 7BA7E0876F0F \ NOTICE="Date 30-Jan-2007 01:29:36 - License Number = \ LAGA4D1958DL - Qty 68 - 434066-B21 - HP XC System Software 1 \ Proc Flex License" INCREMENT lsf_xc Compaq 6.
PAGE 27
server of the service. Improved availability protects against service failure if a node that is serving vital services becomes unresponsive or goes down. You have the flexibility to decide which availability tool you want to use to manage and migrate specified services to a second server if the first server is not available. You can install one or more availability tools to manage the services that have been configured for improved availability.
PAGE 28
Table 1-1 Improved Availability Summary Task Appropriate Task Details Are Provided Here 1. Decide which availability tool or tools you want to use; this tool manages the services “Choosing an for which improved availability has been configured. Then, obtain or purchase, install, Availability Tool” and configure the availability tool or tools. (page 28) 2. Write translator and supporting scripts for the availability tool if you are not using HP “Writing Translator and Serviceguard.
PAGE 29
Availability Tools from Other Vendors If you prefer to use another availability tool, such as Heartbeat Version 1 or Version 2 (which is an open source tool), you must obtain the tool and configure it for use on your own. Third-party vendors are responsible for providing customer support for their tools. Installation and configuration instructions for any third-party availability tools you decide to use are outside the scope of this document. See the vendor documentation for instructions. 1.9.
PAGE 30
1.9.6 Assigning Node Roles For Improved Availability An important part of planning your strategy for improved availability is to determine the services for which availability is vital to the system operation. Services are delivered in node roles. A node role is an abstraction that combines one or more services into a group and provides a convenient way of installing services on a node. In this release, improved availability is supported for the services listed in Table 1-2.
PAGE 31
Table 1-2 Role and Service Placement for Improved Availability (continued) Service Name Service is Delivered in This Role Special Considerations for Role Assignment Within the availability set, the higher numbered node is the LVS director, and the lower numbered node is the backup for the LVS director. Thus, to achieve improved availability of the LVS director service, you must assign at least three nodes with the login role: • Assign the login to the first node in the availability set.
PAGE 32
by placing the resource_management role on two or more nodes. These nodes are not members of any availability set, and the SLURM and LSF-HPC with SLURM software is not managed by any availability tool. When you assign two or more nodes with the resource_management role, SLURM availability is automatically enabled. If you assign the resource_management to two or more nodes, you must manually enable availability for LSF-HPC with SLURM; see “Perform LSF Postconfiguration Tasks” (page 109) for instructions.
PAGE 33
Table 1-3 Availability Sets Worksheet Availability Set Configuration First Node Name Second Node Name Availability Tool to Manage This Availability Roles to Assign to Nodes in the Availability Set Set First node in the availability set: • _________________________ • _________________________ • _________________________ • _________________________ • _________________________ Second node in the availability set: • • • • • _________________________ _________________________ _________________________ ______
PAGE 34
PAGE 35
2 Installing Software on the Head Node This chapter contains an overview of the software installation process and describes software installation tasks. These tasks must be performed in the following order: • “Task 1: Gather Information Required for the Installation” (page 39) • “Task 2: Start the Installation Process” (page 42) • “Task 3: Install Additional RPMs from the HP XC DVD” (page 47) 2.
PAGE 36
Table 2-1 HP XC Software Stack Software Product Name Description HP MPI HP MPI provides optimized libraries for message passing designed specifically to make high-performance use of the system interconnect. HP MPI complies fully with the MPI-1.2 standard. HP MPI also complies with the MPI-2 standard, with restrictions. HP Scalable Visualization Array The HP Scalable Visualization Array (SVA) provides a visualization component for applications that require visualization in addition to computation.
PAGE 37
Table 2-1 HP XC Software Stack (continued) Software Product Name Description Standard LSF Standard LSF is the industry standard Platform Computing LSF product used for workload management across clusters of compute resources. It features comprehensive workload management policies in addition to simple first-come, first-serve scheduling (fairshare, preemption, backfill, advance reservation, service-level agreement, and so on).
PAGE 38
size to calculate the size of each disk partition can result in needlessly large partition sizes when the installation disk is larger than 36 GB. Thus, limits have been set on partition sizes to leave space on the disk for other user-defined file systems and partitions. Use the Linux Disk Druid disk partitioning utility to partition the remaining disk space according to your needs.
PAGE 39
the appropriate sized partition. The guidelines depend on whether /hptc_cluster is located on an HP StorageWorks Scalable File Server (SFS) or is created on the local system disk. • • “Determining The Size of /hptc_cluster When It Is Located On An SFS Server” “Determining The Size of /hptc_cluster When It Is Located On A Local Disk On The Head Node” 2.1.5.
PAGE 40
Table 2-5 Chip Architecture by Cluster Platform 4. 5. 6. Cluster Platform Model Chip Architecture Cluster Platform 3000 (CP3000) Intel Xeon with EM64T Cluster Platform 3000BL (HP server blades) Intel Xeon with EM64T Cluster Platform 4000 (CP4000) AMD Opteron Cluster Platform 4000BL (HP server blades) AMD Opteron Cluster Platform 6000 (CP6000) Intel Itanium 2 Ensure that you have in your possession the DVD distribution media that is appropriate for the cluster platform architecture.
PAGE 41
Table 2-6 Information Required for the Kickstart Installation Session (continued) Item Description and User Action Where to create a partition The /hptc_cluster file system is the global, or clusterwide, file system on an for the /hptc_cluster file HP XC system. This file system is shared and mounted by all nodes and contains system configuration and log file information that is required for all nodes in the system.
PAGE 42
Table 2-6 Information Required for the Kickstart Installation Session (continued) Item Description and User Action Time zone Select the time zone in which the system is located. The default is America/New York (Eastern Standard Time, which is Greenwich Mean Time minus 5 hours). Use the Tab key to move through the list of time zones, and use the spacebar to highlight the selection. Then, use the Tab key to move to OK, and press the space bar to select OK.
PAGE 43
IMPORTANT: After this document was published, it is possible that specific head node hardware models require additional parameters to be included on the command line. Before booting the head node, look in HP XC System Software Release Notes at http://www.docs.hp.com/en/linuxhpc.html to make sure no additional command-line options are required for your model of head node. Table 2-7 Kickstart Boot Command Line Cluster Platform or Hardware Model Chip Architecture Type CP3000 and CP4000 6.
PAGE 44
8. 9. Log in as the root user when the login screen appears, and enter the root password you previously defined during the software installation process. Open a terminal window when the desktop appears: a. Click on the Linux for High Performance Computing splash screen to close it. b. Click Applications→System Tools→Terminal to open a terminal window. 10. Proceed to “Task 3: Install Additional RPMs from the HP XC DVD” (page 47). 2.3.
PAGE 45
2. Use the menus on the Insight Display panel to manually set a static IP address and subnet mask for the Onboard Administrator. You can use any valid IP address because there is no connection to a public network. All static addresses must be in the same network. For example, assume the network is 172.100.100.0 and the netmask is 255.255.255.0. In this case, the static IP addresses might be: • • • 3. 4. 5. 6. 7. IP address of the installation PC: 172.100.100.
PAGE 46
d. e. Click the Power button and then click Momentary Press to turn on power to the server and start booting from the DVD. Proceed to step 6. Mozilla Firefox If you are using Firefox as your browser, do the following: a. b. c. d. e. f. g. 6. 7. Click the Remote Console link to open the virtual console window. In the iLO2 Web Administration window, click the Virtual Devices tab. In the left frame, click the Virtual Media link.
PAGE 47
IMPORTANT: After the software load is complete, ensure that the DVD is ejected from the drive before continuing. On systems with a retractable DVD device, you must remove the installation DVD before the system reboots. This is especially important if the head node is an HP workstation, which never ejects the DVD. If you do not remove the DVD, a second installation process is initiated from the DVD when the system reboots. If a second installation process is started, halt the process and remove the DVD.
PAGE 48
# cd # umount /dev/cdrom 2.4.2 Run the SVA Installation Script to Install SVA The HP Scalable Visualization Array (SVA) is a scalable visualization solution that brings the power of parallel computing to bear on many demanding visualization challenges. SVA is integrated with HP XC and shares a single interconnect with the compute nodes and a storage system.
PAGE 49
2.4.3 Install Optional Linux RPMs Follow this procedure to install additional, optional Linux RPMs from the HP XC distribution DVD: 1. If the DVD is not already mounted, insert the installation DVD into the DVD drive and mount it on the default location (the default location is the /media/cdrom directory): # mount /dev/cdrom 2. Change to the following directory: # cd /media/cdrom/LNXHPC/RPMS 3. Find the Linux RPM you want to install and issue the appropriate command to install it.
PAGE 50
PAGE 51
3 Configuring and Imaging the System This chapter contains an overview of the initial system configuration and imaging process and describes system configuration tasks, which must be performed in the following order: • “Task 1: Prepare for the System Configuration” (page 53) • “Task 2: Change the Default IP Address Base (Optional)” (page 60) • “Task 3: Run the cluster_prep Command to Prepare the System” (page 61) • “Task 4: Install Patches or RPM Updates” (page 64) • “Task 5: Run the discover Command to Dis
PAGE 52
Command or Utility Name Description cluster_config Populates the configuration and management database with node role assignments, starts all services on the head node, and creates the golden system image startsys Turns on power to each node and downloads the SystemImager automatic installation environment to install and configure each node from the golden image 3.1.2 Internal Node Naming It is important to understand how internal node names are assigned.
PAGE 53
As part of the initial software installation, the head node is configured as the golden client, which is the node that represents the configuration from which all other nodes are replicated. Next, a golden image is created from the golden client, which is a replication of the local file system directories and files, starting from root (/). The golden image is stored on the image server, which is also resident on the head node in this release.
PAGE 54
Table 3-1 Information Required by the cluster_prep Command Item Description and User Action Node name prefix During the system discovery process, each node is automatically assigned an internal name. This name is based on a prefix defined by you. The default node prefix is the letter n. All node names consist of the prefix and a number based on the node's topographical location in the system.
PAGE 55
Table 3-1 Information Required by the cluster_prep Command (continued) Item Description and User Action IPv6 address Provide the IPv6 address of the head node's Ethernet connection to the external network, if applicable. Specifying this address is optional and is intended for sites that use IPv6 addresses for the rest of the network.
PAGE 56
Table 3-2 Information Required by the discover Command Item Description and User Action Total number of nodes in this cluster Enter the total number of nodes in the system configuration that are to be discovered at this time. Make sure the number you enter includes the head node and all compute nodes. You are not prompted for this information if you are discovering a multi-region, large-scale system.
PAGE 57
Table 3-2 Information Required by the discover Command (continued) Item Description and User Action User name and password for Supply the common user name and password that you set for the console port the console port management devices (that is, the MP, iLO and LO-100i devices) when you management devices prepared the hardware.
PAGE 58
Table 3-3 Information Required by the cluster_config Utility (continued) Item Description and User Action Number of NFS daemons You are prompted to supply the number of NFS daemons to be run on the head node and on any other NFS server within the system to support the number of NFS clients in the system. A default is provided based on the number of nodes in the hardware configuration.
PAGE 59
Table 3-3 Information Required by the cluster_config Utility (continued) Item Description and User Action Nagios configuration You are prompted to enable web access to the Nagios monitoring application. HP recommends that you enable web access because it is the only mechanism with which you can view the data collected by Nagios. You will have to supply a password for the Nagios administration user. This password does not have to match any other password you previously provided.
PAGE 60
Table 3-3 Information Required by the cluster_config Utility (continued) Item Description and User Action SLURM configuration You are prompted for the following information about the SLURM configuration: • Whether you want to configure SLURM. You are not required to configure SLURM, however, SLURM is required by LSF-HPC with SLURM and SVA. • A SLURM user name. A default value is provided, but you can specify your own SLURM user name.
PAGE 61
Example 3-2 Default Content Of The base_addrV2.ini File Base = 172 nodeBase = 172.20 cpBase = 172.21 swBase = 172.20.65 icBase = 172.22 otherBase = 172.23.0 netMask = 255.224.0.0 dyStart=172.31.48.1 1 dyEnd=172.31.63.254 2 12 The dyStart and dyEnd parameters are present only in the base_addrV2.ini file and do not need to be in the base_addr.ini file. If you change these IP addresses, do not add these parameters to the base_addr.ini file.
PAGE 62
connection on the head node, starts the MySQL service, and initializes the configuration and management database. 1. 2. Begin this procedure as the root user on the head node. Change to the following directory: # cd /opt/hptc/config/sbin 3. Start the system preparation process by invoking the cluster_prep command.
PAGE 63
You have the option to override the system default MTU value. You can enter 9000 to enable jumbo frames, press the [ ] keys to delete the value shown and use the system default, or press the Enter key to accept the value shown. MTU value (optional) []: Enter IP address for DNS name server [ ]: your_IPaddress Provide one or more DNS domains to use for search paths or press the Enter key to accept the default response. Enter one domain name on a line, and after the last domain name, enter a period (.
PAGE 64
3.5 Task 4: Install Patches or RPM Updates For each supported version of the HP XC System Software, HP releases all Linux security updates and HP XC software patches on the HP IT Resource Center (ITRC) website. Software patches might also be available for other HP products that you are installing, such as SVA, RGS, or Serviceguard. To determine if software patches are available, go to the product-specific location on the ITRC.
PAGE 65
5. From the IT Resource Center home page, select patch/firmware database from the maintenance and support (hp products) list. 6. From the patch / firmware database page, select Linux under find individual patches. 7. From the search for patches page, in step 1 of the search utility, select vendor and version, select hpxc as the vendor and select the HP XC version that is appropriate for the cluster platform. If you are installing patches for SVA, select hpsva as the vendor. 8.
PAGE 66
NOTE: Follow this procedure before you run the discover command if you want to locate the console port of a non-blade head node on the administration network and not on the external network: 1. 2. Set the IP address for the head node console port to a static IP address that is not currently in use by the HP XC system. Typically, this address can be 172.31.47.240, which is the top end of addresses defined for the HP XC switches defined in the /opt/hptc/config/base_addrV2.ini file.
PAGE 67
Table 3-2 (page 56) and discover(8) contain information about additional keywords you can add to the command line to omit some of the questions that will be asked during the discovery process. Use of these keywords is optional. If you encounter problems during the discovery process, see “Troubleshooting the Discovery Process” (page 155) for troubleshooting guidelines. NOTE: The discover command does not properly discover HP ProLiant DL140 and DL145 servers until the password is set.
PAGE 68
Waiting for power daemon ... done switchName necs1-1 switchIP 172.20.65.2 type 2650 switchName nems1-1 switchIP 172.20.65.1 type 2848 Attempting to power on nodes with nodestring 8n[13-15] Powering on all known nodes ... done Discovering Nodes... running port_discover on 172.20.65.1 nodes Found = 1 nodes Expected = 4 running port_discover on 172.20.65.1 nodes Found = 1 nodes Expected = 4 running port_discover on 172.20.65.1 nodes Found = 1 nodes Expected = 4 running port_discover on 172.20.65.
PAGE 69
not indicate a failure unless a network component is plugged into that port on the switch. If necessary, see Chapter 10 (page 155) for information about troubleshooting problems you might encounter during the discovery process. Example 3-4 shows the unique command output for a large-scale system with two regions; all other command output is similar to the previous example. Example 3-4 discover Command Output For Large-Scale Systems The discover process has detected 2 regions.
PAGE 70
NOTE: The following procedure assumes that all enclosures have been physically set up and populated with nodes, all components have been cabled together as described in the HP XC Hardware Preparation Guide, you have prepared the head node and the non-blade server nodes according to the instructions in the HP XC Hardware Preparation Guide, and the server blade head node is installed with HP XC System Software. 1. 2. Begin this procedure as the root user on the head node.
PAGE 71
make the appropriate BIOS settings on all server blade nodes. These tasks are documented in the HP XC Hardware Preparation Guide (BIOS settings depend upon the hardware model type). Return here when you have finished with those tasks and proceed to “Discover All Nodes and Enclosures”. 3.6.2.3 Discover All Nodes and Enclosures Follow this procedure to discover all enclosures and all nodes (including server blades) in the hardware configuration.
PAGE 72
Attempting to start hpls power daemon ... done Waiting for power daemon ... done Checking if all console ports are reachable ... number of cps to check, 5 checking 172.31.16.5 checking 172.31.16.1 checking 172.31.16.4 checking 172.31.16.3 checking 172.31.16.2 .done Starting CMF for discover... Stopping cmfd: [FAILED] Starting cmfd: [ OK ] Waiting for CMF to establish console connections .......... done 1 uploading database Restarting dhcpd Opening /etc/hosts Opening /etc/hosts.new.XC Opening /etc/powerd.
PAGE 73
# script your_filename 4. Change to the following directory: # cd /opt/hptc/config/sbin 5. Start the discovery process: # ./discover --enclosurebased --verbose --single Command output is similar to the following: Discovery - XC Cluster version HP XC V3.2.
PAGE 74
7. Do one of the following: • Proceed to “Modify the Default Password for HP ProLiant DL140 and DL145 Hardware Models” if the hardware configuration contains HP ProLiant DL140 and DL145 hardware models. • Proceed to “Task 6: Set Up the System Environment” (page 76) if the hardware configuration does not contain HP ProLiant DL140 and DL145 hardware models. 3.6.
PAGE 75
password. Changing the default password is not required, but HP recommends changing the factory default value for security purposes. Omit this step for all other server models. 1. 2. Use the method of your choice to view the /etc/dhcpd.conf file and look for the characters cp- in host names to determine console port names. Use the telnet command and the internal name of the console port to log in to each node's console management device and change the default password.
PAGE 76
3.7 Task 6: Set Up the System Environment Table 3-5 lists the tasks that are required to set up the system environment. Perform these tasks now before the system is configured so that the information is propagated to the appropriate client nodes during the initial image synchronization. Some tasks are required and other tasks are conditionally optional, as listed in Table 3-5.
PAGE 77
3.7.2 Configure Interconnect Switch Monitoring Line Cards (Required) You must configure Quadrics switch controller cards, InfiniBand switch controller cards, and Myrinet monitoring line cards on the interconnect to diagnose and debug problems with the interconnect. These cards must be configured to pass the operation verification program (OVP), which is used to verify the proper operation of the system after it has been installed and configured.
PAGE 78
9. Repeat steps 5 and 6 and make the same changes to the sendmail.cf file. 10. Save the changes to the file and exit the text editor. 11. Restart sendmail: # service sendmail restart To forward mail to users, the sendmail service requires users to create .forward files in their home directories to specify where mail is to be sent. If you intend to make additional, more advanced modifications to sendmail, HP recommends that you do not modify the .cf files directly. Rather, modify the .
PAGE 79
Install the software now before the system is configured so that the software is transparently propagated to all nodes during the initial image synchronization. The remainder of this section provides information about installing additional software products on an HP XC system. The following topics are addressed: • “Install Additional HP Software Products” (page 79) • “Install Third-Party Software Products” (page 81) • “Install Compilers” (page 81) 3.7.7.
PAGE 80
1. 2. 3. pidentd-3.0.15sg-1.{architecture} qs-A.02.00.03-0.{architecture} serviceguard-A.11.16.04-0.{architecture} NOTE: HP recommends that you obtain the latest available release of Serviceguard for Linux Version 11.16 and all available patches so that you install the most up-to-date version of Serviceguard. New versions of Serviceguard Version 11.16 interoperate with HP XC Version 3.2.1 as expected. The quorum server RPM, qs-A.02.00.03-0.product.redhat.
PAGE 81
3.7.7.1.3 HP Remote Graphics Software HP Remote Graphics Software (RGS) is an optional software product that displays images created on a remote SVA display device. If you want to use RGS with SVA, you must purchase RGS separately from HP. See the RGS product documentation for installation instructions. If you install RGS, the HP XC cluster_config utility prompts you for specific RGS configuration information when you configure the system.
PAGE 82
During the compiler installation process, use the default locations suggested by the installation process, if possible. If you change the installation directory and modules are being used, you must edit the corresponding modulefile to point to their new location or create a corresponding symbolic link. Modulefiles are located in the /opt/module/modulefiles directory.
PAGE 83
The command depends on the workstation model: • Enter the following command to change the name of xw8400 workstations: # ./modify_node_type "hp workstation xw8200" \ "HP xw8400 Workstation" node_list • Enter the following command to change the name of xw9400 workstations: # ./modify_node_type "hp workstation xw9300" \ "HP xw9400 Workstation" node_list 3.7.10 Enable Software RAID-0 or RAID-1 on Client Nodes This task is optional.
PAGE 84
Special Considerations for Nagios and LSF During the system configuration phase, the cluster_config command attempts to create a nagios and an lsfadmin account for use by Nagios and LSF, respectively. To use existing nagios and lsfadmin user accounts from a site wide NIS system (or some other external user authentication system), you must manually create local XC accounts that mirror the site wide accounts (with matching user identification (UID) and group identification (GID) values).
PAGE 85
However, if the client nodes require a different partitioning scheme, you have the flexibility to apply customized, fixed partition sizes or to assign partition sizes based on a percentage of the disk size to all or a subset of client nodes. See Appendix E (page 189) for instructions. 3.7.14 Create the HP Modular Cooling System Configuration File This task is optional. Perform this task only if the hardware configuration includes a water-based HP Modular Cooling System (MCS) device.
PAGE 86
name=mcs2 ipaddr=172.23.0.2 location=Cab CBB2 nodes=n[37-72] status=offline [mcs3] name=mcs3 ipaddr=172.23.0.3 location=Cab CBB3 nodes=n[73-108] status=offline [mcs4] name=mcs4 ipaddr=172.23.0.4 location=Cab CBB4 nodes=n[109-144] status=offline [mcs5] name=mcs5 ipaddr=172.23.0.5 location=Cab CBB5 nodes=n[145-180] status=offline 3.7.15 Mount Network File Systems This task is optional. If you plan to mount NFS file systems, add the mount points to the /hptc_cluster/etc/fstab.
PAGE 87
if [ -f $NEWF ] then echo "replacing $i with $NEWF" rm $i mv $NEWF $i fi fi done 3.8 Task 7: Run the cluster_config Utility to Configure the System The next step in the process is to configure the system. In this task, use the cluster_config utility to review default role assignments and modify the role assignments on all nodes in the system based on the size of the system (and other considerations, such as improved availability).
PAGE 88
If you answer yes, the database backup file is stored in the /var/hptc/database directory. 5. Proceed to one of the following sections: • If you have installed and configured an availability tool (such as HP Serviceguard), proceed to “Task 8: Configure Availability Sets” to configure availability sets. • If you are not configuring availability sets, proceed to “Task 9: Modify and Assign Node Roles” (page 90). 3.9 Task 8: Configure Availability Sets This task is optional.
PAGE 89
2. Do one of the following: a. Enter the letter e to create an availability set. Proceed to step 3. b. Enter the letter p to omit availability sets and proceed to “Task 9: Modify and Assign Node Roles” (page 90). You cannot enable improved availability without first configuring at least one availability set. c. Enter the letter q to exit the cluster_config utility. Are you sure you want to quit? [y/n] y 3.
PAGE 90
Do you want to proceed to roles configuration? [y/n] Proceed to “Task 9: Modify and Assign Node Roles” to evaluate the default node role configuration and re-adjust the default role assignments where necessary. 3.10 Task 9: Modify and Assign Node Roles Follow this procedure to use the cluster_config utility to view the default system configuration and modify default role assignments and assign additional roles to nodes where necessary.
PAGE 91
Continue? [y/n] y Node: n12 location: Level 1 Switch 172.20.65.1, Port 18 Roles assigned: compute Node: n11 location: Level 1 Switch 172.20.65.1, Port 17 Roles assigned: compute NOTE: If the hardware configuration contains HP server blades and enclosures, the node's enclosure location is provided: Node: n11 location: Enclosure n-enc00110a881340 Roles assigned: compute 3. Do the following to determine whether you have to modify the default role assignments: a.
PAGE 92
to continue with the system configuration process and proceed to “Task 10: Respond to Configuration Questions”. NOTE: If you are an experienced HP XC administrator, see “Customize Service and Client Configurations” (page 211) for more information about customizing the services configuration. • Enter the letter p to continue with the system configuration process.
PAGE 93
2. Specify a quorum server node name or a full path to the lock LUN (for example, /dev/sdb1, where 1 is the partition number) if you have configured availability sets to use Serviceguard as the availability tool . See “Deciding on the Method to Achieve Quorum for HP Serviceguard Clusters” (page 80) or the Serviceguard documentation for more information.
PAGE 94
Executing Executing Executing Executing Executing Executing 7. C20smartd gconfigure C30syslogng_forward gconfigure C35dhcp gconfigure C42mcs gconfigure C50cmf gconfigure C50lvs gconfigure Define an LVS alias if you assigned a login role to one or more nodes. This alias is the name by which users will log in to the system. If you did not assign a login role to any node, you are not asked to supply an LVS alias.
PAGE 95
Executing C50supermond gconfigure Executing C51nagios_monitor gconfigure Executing C60nis gconfigure Executing C51nagios_monitor gconfigure Executing C52snmp_traps gconfigure Configuring the snmptrapd service for the cluster. SNMP traps are configured to be received over the following interfaces: loopback Admin Snmptrapd is also configured to listen to the Nagios IP alias.
PAGE 96
13. Supply the name or IP address of the external NIS master server and the NIS domain name if you assigned the nis_server role to one or more nodes to configure them as NIS slave servers. If you did not assign a nis_server role to any node, you are not asked to supply this information. Network Information Service (NIS) Configuration This step sets up one or more NIS servers within the XC system that are "slaves" to an external NIS "master".
PAGE 97
partition. If you want additional partitions, configure them manually in the /hptc_cluster/slurm/etc/slurm.conf file. The current Node Partition configuration is: PartitionName=lsf RootOnly=YES Shared=FORCE Nodes=n[11-16] Do you want to enable SLURM-controlled user-access to the compute nodes? (y/n) [n]: n 1 SLURM configuration complete.
PAGE 98
Table 3-9 LSF-HPC with SLURM and Standard LSF Features LSF-HPC with SLURM Standard LSF • Parallel support (through SLURM) for: — Accounting — Signal propagation — I/O — Job launching • Ideal for serial jobs because it is a load-based scheduler • Finds the free resource that is the least loaded and dispatches the job to that node. • Sufficient for sites that do not need the type of parallel job support provided by LSF-HPC with SLURM.
PAGE 99
Logging installation sequence in /opt/hptc/lsf/files/lsfslurm/ install-20060925114531/lsf6.2_lsfinstall/Install.log 1) linux2.6-glibc2.3-x86_64-slurm Press 1 or Enter to install this host type: Enter 1 The sample command output was obtained from an HP ProLiant server. The tar file name is linux2.6-glibc2.3-x86_64-slurm (the string x86_64 signifies an Opteron or Xeon chip architecture). The string ia64 is included in the file name for HP Integrity servers. 20.
PAGE 100
21. Do one of the following: • Enter the letter p to create the golden image. Proceed to step 22. • Enter the letter q to exit the cluster_config utility and change your responses. Exiting now does not create the golden image, and your previous responses are stored in the configuration management database. The next time you re-run the cluster_config utility, your previous responses are used as the default responses. 22.
PAGE 101
info: info: info: info: info: info: Executing C90munge cconfigure Executing C90slurm cconfigure Executing C95lsf cconfigure nconfig shut down nconfig started Executing on head node info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: Executing C02network nrestart Executing C03nicbond nrestart Executing C04ip6tables nrestart Ex
PAGE 102
IMPORTANT: Do not perform a global replace of the /etc/dhcpd.conf file with the entire contents of the .ORIG backup copy because the new version of the /etc/dhcpd.conf file might contain new or additional information that is not present in the backup copy. 4. 5. Save your changes to the /etc/dhcpd.conf file and exit the text editors. Restart the dhcpd service: # service dhcpd restart 3.
PAGE 103
# ls /opt/hptc/etc/license CAUTION: You cannot continue if the license key file is not present in this directory. See “Task 7: Have the License Key File Ready” (page 25) and “Put the License Key File in the Correct Location (Required)” (page 76) for more information about obtaining and positioning the license key file if you have not already done so. 3. 4. Ensure that the power is off on all nodes except the head node.
PAGE 104
For more information about startsys command-line options and option values, see startsys(8). 5. If you want to watch as the startsys command images and turns on power to the nodes, open a second terminal window and issue a tail command to view the following log files: • /hptc_cluster/adm/logs/imaging.log • /hptc_cluster/adm/logs/startsys.
PAGE 105
Fri Jul 06 09:01:22 2007 Powering on for boot: 1 node -> n15 Fri Jul 06 09:02:33 2007 Retrying power --on command: 1 node -> n15 Fri Jul 06 09:04:18 2007 Processing completed for: 1 node -> n15 *** Fri Jul 06 09:04:33 2007 Current statistics: Booted and available: 1 node -> n15 Waiting for hierarchy to boot: 14 nodes -> n[1-14] Progress: You must manually power on the following nodes: 2 n1 Press enter after applying power to these nodes. continuing ........
PAGE 106
# cat /root/.ssh/id_dsa.pub | ssh root@IR0N00 cat '>>' /.ssh/authorized_keys You might be prompted to enter the InfiniBand switch root password. The factory default password is br6000. If you set a unique password for the switch, enter that password instead. Proceed to “Task 14: Create a Lock LUN Device File”. 3.15 Task 14: Create a Lock LUN Device File Perform this task only if you configured improved availability of services and are using Serviceguard as the availability tool.
PAGE 107
startAvailTool: ========== Executing '/opt/hptc/availability/serviceguard/start_avail'... Configuring HP Serviceguard cluster [-C /usr/local/cmcluster/conf/avail1.config -P /usr/local/cmcluster/conf/nat.n7/nat.n7.config -P /usr/local/cmcluster/conf/nat.n8/nat.n8.config -P /usr/local/cmcluster/conf/lvs.n8/lvs.n8.config -P /usr/local/cmcluster/conf/nagios.n8/nagios.n8.config -P /usr/local/cmcluster/conf/dbserver.n8/dbserver.n8.config ]. Applying the cluster configuration. Begin cluster verification...
PAGE 108
/hptc_cluster/adm/log/consolidated.log file and the /hptc_cluster/adm/logs/snmptraps.log file. This enables Nagios to generate alerts for all MCS traps posted with a priority of WARNING or greater. 1.
PAGE 109
Restarting SLURM... SLURM Post-Configuration Done. In this example, the spconfig utility was run on a system with a QsNetII interconnect. For all other interconnect types, the first two lines of the command output are not displayed. NOTE: If a compute node did not boot up, the spconfig utility configures the node as follows: Configured unknown node n14 with 1 CPU and 1 MB of total memory... After the node has been booted up, re-run the spconfig utility to configure the correct settings. 3. 4.
PAGE 110
Remainder Applies to LSF-HPC with SLURM: The remainder of this procedure applies to LSF-HPC with SLURM. If standard LSF is configured, omit the remaining steps. 4. If you assigned two or more nodes with the resource_management role and want to enable LSF failover, enter the following command; otherwise, proceed to step 5. # controllsf enable failover 5. Determine the node on which the LSF daemons are running: # controllsf show current LSF is currently running on node n32, and assigned to node n32 6.
PAGE 111
3.21 You Are Done You have completed the mandatory system configuration tasks. Proceed to Chapter 4 (page 113) to verify the successful installation and configuration of the system. 3.
PAGE 112
PAGE 113
4 Verifying the System and Creating a Baseline Record of the Configuration Complete the tasks described in this chapter to verify the successful installation and configuration of the HP XC system components. With the exception of the tasks that are identified as optional, HP recommends that you perform all tasks in this chapter.
PAGE 114
# lsid Platform LSF 6.2, LSF_build_date Copyright 1992-2005 Platform Computing Corporation My cluster name is hptclsf My master name is n13 [root@n16 ~]# lshosts HOST_NAME type model n13 LINUX64 Itanium2 n16 LINUX64 Itanium2 n1 LINUX64 Itanium2 n2 LINUX64 Itanium2 n3 LINUX64 Itanium2 n4 LINUX64 Itanium2 n5 LINUX64 Itanium2 n6 LINUX64 Itanium2 n7 LINUX64 Itanium2 n8 LINUX64 Itanium2 n9 LINUX64 Itanium2 n10 LINUX64 Itanium2 n11 LINUX64 Itanium2 n12 LINUX64 Itanium2 2.
PAGE 115
# shownode config hostgroups hostgroups: headnode: n16 serviceguard:avail1: n14 n16 In this example, one availability set, avail1, has been configured. 2. On one node in each availability set, view the status of the Serviceguard cluster: # pdsh -w n14 /usr/local/cmcluster/bin/cmviewcl NOTE: The /usr/local/cmcluster/bin directory is the default location of the cmviewcl command. If you installed Serviceguard in a location other than the default, look in the /etc/cmcluster.
PAGE 116
The OVP also runs the following benchmark tests. These tests compare values relative to each node and report results with values more than three standard deviations from the mean: • • • • LINPACK is a collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems. This test is CPU intensive and stresses the nodes, with limited data exchange. PALLAS exercises the interconnect connection between compute nodes to evaluate MPI performance.
PAGE 117
6. When all OVP tests pass, proceed to “Task 4: Run the SVA OVP Utility” (if SVA is installed) or “Task 5: View System Health”. 4.4 Task 4: Run the SVA OVP Utility Run the SVA OVP utility only if you installed and configured SVA. The SVA OVP runs a series of Chromium demonstration applications on all defined display surfaces, which verifies the successful installation of SVA. Follow this procedure to start the SVA OVP: 1.
PAGE 118
Host Monitor IP Assignment - DHCP Load Average LSF Failover Monitor Nagios Monitor NodeInfo PING Interconnect Resource Monitor Resource Status Root key synchronization Sensor Collection Monitor Slurm Monitor Slurm Status Supermon Metrics Monitor Switch Switch Data Collection Syslog Alert Monitor Syslog Alerts System Event Log System Event Log Monitor System Free Space Totals: 1-Ok 1-Ok 10-Ok 1-Ok 1-Ok 10-Ok 10-Ok 1-Ok 10-Ok 1-Ok 1-Ok 1-Ok 10-Ok 1-Ok 2-Ok 1-Ok 1-Ok 10-Ok 9-Ok 1-Ok 10-Ok 115-Ok 0-Warn 0-War
PAGE 119
4.7 Task 7: Create a Baseline Report of the System Configuration The sys_check utility is a data collection tool that is used to diagnose system errors and problems. Use the sys_check utility now to create a baseline report of the system configuration (software and hardware). The sys_check utility collects configuration data only for the node on which it is run unless you set and export the SYS_CHECK_SYSWIDE variable, which collects configuration data for all nodes in the HP XC system.
PAGE 120
PAGE 121
5 Upgrading Your HP XC System To This Release This chapter describes how to use the upgrade process to install HP XC System Software Version 3.2.1 on an HP XC system that is running a previous version of the HP XC System Software.
PAGE 122
NOTE: If you are not sure what version of the HP XC System Software is installed on the system, enter the following command: # cat /etc/hptc-release 5.1.2 Is Upgrading Appropriate for Your System Configuration? An upgrade path from previous HP XC System Software releases to Version 3.2.1 is provided, but HP recommends a new installation of Version 3.2.1.
PAGE 123
Table 5-2 Upgrade Characteristics (continued) Characteristic Description Affect on configuration files The RPM upgrade process saves user customizations to existing HP XC configuration files by renaming them with a .rpmsave or .rpmsave file extension. If you want to retain your customizations to these files or to any customizations you made to standard Linux configuration files, you must perform a manual merge of the customizations into the newly delivered version of the file (which might have changed).
PAGE 124
2. 3. Notify users in advance that a software upgrade is planned. Because you must shut down all nodes, plan the upgrade for a time when system activity is at its lightest. Use your preferred backup solution to perform a full backup of the head node, which can be done while the head node is up and when jobs are running. If the /hptc_cluster file system is located on an HP SFS server, ensure you back up the SFS server as well.
PAGE 125
9. Read the HP Scalable Visualization Array, V2.1.1 Release Notes before beginning the upgrade procedure. Upgrade and reinstall procedures for SVA are part of Section 5.7 (page 128) Proceed to “Task 2: Prepare the System State”. 5.3 Task 2: Prepare the System State On the head node, follow this procedure to ensure that the system is in the appropriate state for the upgrade: 1.
PAGE 126
# mkdir /mnt/cdrom 3. Mount the DVD on /mnt/cdrom: # mount /dev/cdrom /mnt/cdrom 4. Change to the following directory: # cd /mnt/cdrom/HPC/RPMS 5. Find the name of the hptc-upgrade RPM: # ls hptc-upgrade* hptc-upgrade-1-n.noarch.rpm In the previous example, n in the RPM name represents the RPM version number. The version number varies from release to release. 6. Install the RPM: # rpm -Uvh hptc-upgrade-1-n.noarch.rpm 7. 8. Before running the preupgradesys command, remove the HP SFS RPM files.
PAGE 127
NOTE: Because the upgraderpms command output spans several pages, it is shown in Appendix L (page 233). Return here when command processing is complete. 3. Unmount the DVD: # cd # umount /dev/cdrom 4. 5. Eject the DVD from the drive. Perform the following tasks to verify that the RPM upgrade process was successful. a. Use the method of your choice to view the log file that contains the results of the Linux and HP XC RPM upgrade process. This example uses the more command. # more /var/log/yum_upgrade.
PAGE 128
Linux sup16n0 2.6.9-55.9hp.4sp.XCsmp #1 SMP Tue Nov 27 18:32:01 EST 2007 x86_64 x86_64 x86_64 GNU/Linux 3. If you are using HP SFS, verify that the ib0 interface is up using ifconfig ib0. NOTE: OFED. The Voltaire IB stack interface was named ipoib0. This name is not valid for If ib0 is not running, you must manually create the /etc/sysconfig/network-scripts/ifcfg-ib0 file and start the interface using ifup ib0. Sample ifcfg-ib0 file: DEVICE="ib0" ONBOOT="yes" BOOTPROTO="none" IPADDR="172.22.0.
PAGE 129
# rpm -ivh pidentd-{version} # rpm -ivh qs-A.{version}.rhel{version} # rpm -ivh serviceguard-{version}.rhel{version} 6. You might need to upgrade firmware according to master firmware list for this release: http://www.docs.hp.com/en/linuxhpc.html 7. If you want to configure eligible services for improved availability, you must install and configure an availability tool now.
PAGE 130
Table 5-5 Files Containing User Customizations (continued) File Name Important Notes /etc/my.cnf.rpmsave /opt/hptc/systemimager/etc/updgi_exclude_file.rpmsave /opt/hptc/systemimager/etc/chkconfig.map.rpmsave /opt/hptc/systemimager/etc/base_exclude_file.rpmsave /opt/hptc/systemimager/etc/*.conf.rpmsave /opt/hptc/config/*.rpmsave /opt/hptc/config/etc/*.rpmsave 4. Perform this step only if you know you changed standard Linux configuration files.
PAGE 131
upgradesys output logged to /var/log/upgradesys/upgradesys.log CAUTION: Do not proceed to the next step in the upgrade process if the output from the upgradesys script indicates failures. If you cannot determine how to resolve these errors, contact your local HP support center. 2. 3. Review the /opt/hptc/systemimager/etc/base_exclude_file to determine if you want to exclude files from the golden image beyond what is already excluded.
PAGE 132
6. Change directory to the configuration directory: # cd /opt/hptc/config/sbin 7. Specify one of the following cluster_config options: • To migrate the existing system configuration: # ./cluster_config --migrate • To apply new default role assignments to the existing system configuration: # ./cluster_config --init 8. 9.
PAGE 133
NOTE: To avoid duplicating command output here, cluster_config output is shown in Section 3.11 (page 92). Continue to the next step in this procedure when the cluster_config processing is complete. 13. Look at the backup copy of the slurm.conf file, which is located in the /hptc_cluster/slurm/etc/slurm.conf.bak file. If you previously customized this file, you must merge those customizations into the new version of the /hptc_cluster/slurm/etc/slurm.conf file. Otherwise, omit this step. 14.
PAGE 134
Table 5-8 Upgrade startsys Command-Line Options Based on Hardware Configuration Hardware Configuration startsys Command Line 300 nodes or fewer For small-scale hardware configurations, nodes are imaged and rebooted in one operation. The nodes complete their per-node configuration phase, thus completing the installation. This option applies only for nodes that have previously been set up to network boot.
PAGE 135
6. If LSF is not running on the head node, log in to the node that is running LSF. If LSF is running on the head node, omit this step. # ssh n32 7. Restart the LIM daemon: # lsadmin limrestart Checking configuration files ... No errors found. Restart LIM on ...... done Restarting the LIM daemon is required because the licensing of LSF-HPC with SLURM occurs when the LIM daemon is started.
PAGE 136
PAGE 137
6 Reinstalling HP XC System Software Version 3.2.1 This chapter describes how to reinstall HP XC System Software Version 3.2.1 on a system that is already running Version 3.2.1. Reinstalling an HP XC system with the same release might be necessary if you participated as a field test site of an advance development kit (ADK) or an early release candidate kit (RC).
PAGE 138
# scontrol update NodeName=n[1-5] State=IDLE 6.2 Reinstalling Systems with HP Integrity Hardware Models This section describes the following tasks: • “Reinstalling the Entire System” (page 138) • “Reinstalling One or More Nodes” (page 138) 6.2.1 Reinstalling the Entire System Follow this procedure to reinstall HP XC System Software Version 3.2.1 on systems comprised of HP Integrity hardware models. A reinstallation requires that all nodes are set to network boot.
PAGE 139
# stopsys n[1-5] # startsys --image_and_boot n[1-5] All nodes reboot automatically when the installation is finished. 5. 6. Run the transfer_to_avail command to shut down all HP XC services and IP aliases that will be managed by an availability tool. After shutting down these services and IP aliases, the transfer_to_avail command starts each availability tool. Then, the availability tool starts up the services and IP aliases it is managing.
PAGE 140
PAGE 141
7 Installing and Using PBS Professional This chapter addresses the following topics: • “PBS Professional Overview” (page 141) • “Before You Begin” (page 141) • “Plan the Installation” (page 141) • “Perform Installation Actions Specific to HP XC” (page 142) • “Configure PBS Professional under HP XC” (page 142) • “Replicate Execution Nodes” (page 143) • “Enter License Information” (page 144) • “Start the Service Daemons” (page 144) • “Set Up PBS Professional at the User Level” (page 144) • “Run HP MPI Tasks”
PAGE 142
7.4 Perform Installation Actions Specific to HP XC Follow this installation procedure: 1. Install the PBS server node (front-end node) first, using the installation script provided by the software vendor, and specify the following values: a. Accept the default value offered for the PBS_HOME directory, which is /var/spool/PBS. b. When prompted for the type of PBS installation, select: option 1 (Server, execution and commands). c. If available, enter the license key during the interactive installation.
PAGE 143
7.5.1 Configure the OpenSSH scp Utility By default, PBS Professional uses the rcp utility to copy files between nodes. The default HP XC configuration disables rcp in favor of the more secure scp command provided by OpenSSH. To use PBS Professional on HP XC, configure HP XC to default to scp as follows: 1. 2. Using a text editor of your choice, open the /etc/pbs.conf file on the server node.
PAGE 144
# # # # pdcp pdcp pdcp pdcp -rp -w "x[n-n]" /usr/pbs /usr -rp -w "x[n-n]" /var/spool/PBS /var/spool -p –w "x[n-n]" /etc/pbs.conf /etc -p –w "x[n-n]" /etc/init.d/pbs /etc/init.d Use the following as an example: • • • You have installed the PBS server on node n100. The first PBS execution node is node n49. You want to replicate the execution environment to nodes n1 through n48. In this case, the value of the node list expression is: "n[1-48]".
PAGE 145
7.10 Run HP MPI Tasks The PBS Professional distribution contains a wrapper script named pbs_mpihp , which is used to run HP MPI jobs. The wrapper script uses information about the current PBS Professional allocation to construct a command line, and optionally, an appfile suitable for HP MPI. The wrapper also sets the MPI_REMSH environment variable to the PBS Professional pbs_tmrsh remote shell utility.
PAGE 146
PAGE 147
8 Installing the Maui Scheduler This chapter describes how to install and configure the Maui Scheduler software tool to interoperate with SLURM on an HP XC system. It addresses the following topics: • “Maui Scheduler Overview” (page 147) • “Readiness Criteria” (page 147) • “Preparing for the Installation” (page 147) • “Installing the Maui Scheduler” (page 148) • “Verifying the Successful Installation of the Maui Scheduler” (page 150) 8.
PAGE 148
• • http://www.chpc.utah.edu/docs/manuals/software/maui.html http://www.clusterresources.com/products/maui/ Ensure That LSF-HPC with SLURM Is Not Activated HP does not support the use of the Maui Scheduler with LSF-HPC with SLURM. These schedulers have not been integrated and will not work together on an HP XC system. Before you install the Maui Scheduler on an HP XC system, you must be sure that the HP XC version of LSF-HPC with SLURM is not activated on the system.
PAGE 149
• • “Task 4: Edit the SLURM Configuration File” (page 150) “Task 5: Configure the Maui Scheduler” (page 150) 8.4.1 Task 1: Download the Maui Scheduler Kit Follow this procedure to download the Maui Scheduler kit: 1. 2. Log in as the root user on the head node. Download the Maui Scheduler kit to a convenient directory on the system. The Maui Scheduler kit is called maui-3.2.6p9, and it is available at: http://www.clusterresources.com/products/maui/ 8.4.
PAGE 150
NODECFG[n16] NODECFG[n15] NODECFG[n14] NODECFG[n13] 6. PARTITION=PARTA PARTITION=PARTA PARTITION=PARTA PARTITION=PARTA Save the changes to the file and exit the text editor. 8.4.4 Task 4: Edit the SLURM Configuration File Uncomment the following lines in the /hptc_cluster/slurm/etc/slurm.conf SLURM configuration file: SchedulerType=sched/wiki SchedulerAuth=42 SchedulerPort=7321 8.4.
PAGE 151
StartTime: Fri Jul 06 20:30:43 Total Tasks: 6 Req[0] TaskCount: 6 Partition: lsf Network: [NONE] Memory >= 1M Disk >= 1M Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] NodeCount: 1 Allocated Nodes: [n16:4][n15:2] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [lsf] Reservation '116' (00:00:00 -> 1:00:00 Duration: 1:00:00) PE: 6.00 StartPriority: 1 Table 8-2 lists several commands that provide diagnostic information about various aspects of resources, workload, and scheduling.
PAGE 152
PAGE 153
9 Adding Visualization Nodes to An Existing HP XC System This chapter describes how to install and configure visualization nodes into an existing HP XC system after the SVA nodes have been fully integrated into the hardware configuration. It addresses the following topics: • • “Prerequisites” (page 153) “Installation Scenarios” (page 153) 9.
PAGE 154
9.2.2 New Visualization Nodes Do Not Exceed the Maximum Number of Nodes Supplied to the cluster_prep Command Follow this procedure if you added visualization nodes to the existing system and the number of new nodes does not exceed the maximum node number you set during the initial cluster_prep process. Because the number of new nodes does not exceed the previous maximum number of nodes, you do not need to run cluster_prep command again, but you do have to discover the new nodes. 1. 2.
PAGE 155
10 Troubleshooting This chapter addresses the following topics: • “Troubleshooting the Discovery Process” (page 155) • “Troubleshooting the Cluster Configuration Process” (page 158) • “Troubleshooting LSF and Licensing” (page 162) • “Troubleshooting the Imaging Process” (page 160) • “Troubleshooting the OVP” (page 162) • “Troubleshooting SLURM” (page 166) • “Troubleshooting the Software Upgrade Procedure” (page 167) 10.
PAGE 156
• • • • • • “Discovery Process Hangs While Discovering Console Ports” (page 156) “ProCurve Switches Do Not Obtain Their IP Addresses” (page 156) “ProCurve Switches Can Take Time to Get IP Addresses” (page 156) “Not All Console Ports Are Discovered” (page 156) “Some Console Ports Have Not Obtained Their IP Addresses” (page 157) “Not All Nodes Are Discovered” (page 157) After performing the suggested corrective action, rerun the discover command. 10.1.
PAGE 157
NOTE: If the --oldmp option was used on the discover command line, it is assumed that all Management Processors (MPs) have their IP addresses set statically, and therefore are not subject to this step in the discovery process. If some console ports are not configured to use DHCP, they will not be discovered. Therefore, the first item to verify is whether the nondiscovered console ports are configured to use DHCP.
PAGE 158
. . . Switch Switch Switch Switch . . . 172.20.65.3 172.20.65.3 172.20.65.3 172.20.65.4 port port port port 4 5 6 1 ... ... ... ... Node Found Node Found NO Node Found Node Found In this case, a node is plugged into port 6 of the Branch Root switch at address 172.20.65.3. To resolve the discovery problem, examine this node to see what actions it is taking during power-on.
PAGE 159
gethostbyaddr failure To resolve this problem, edit the /etc/resolv.conf file and fix incorrect DNS entries. • Nodes that fail the configuration phase are put into single-user mode and marked as disabled in the database if an essential service failed. 10.2.1 lsadmin limrestart Command Fails “Task 18: Finalize the Configuration of Compute Resources” (page 108) describes LSF postconfiguration tasks.
PAGE 160
# service mysqld restart The command you were trying to initiate should now be able to connect to the database. 10.3 Troubleshooting the Imaging Process This section describes hints to troubleshoot the imaging process. System imaging and node configuration information is stored in the following log files: • • • /hptc_cluster/adm/logs/imaging.log /var/log/systemimager/rsyncd /hptc_cluster/adm/logs/startsys.
PAGE 161
Table 10-1 Diagnosing System Imaging Problems (continued) Symptom How To Diagnose The network boot times out. The system boots from local disk • Verify DHCP settings and status of daemon. and runs nconfigure. You can • Verify network status and connections. verify this by checking messages • Monitor the /var/log/dhcpd.log file for written to the imaging.log file. DHCPREQUEST messages from the client node MAC address. • Check boot order and BIOS settings.
PAGE 162
Enter the following command on the affected node to fix the network boot problem: setnode --resync node_name 10.3.3 How To Monitor An Imaging Session To monitor an imaging operation, use the tail -f command in another terminal window to view the imaging log files. It is possible to actually view an installation through the remote serial console, but to do so, you must edit the /tftpboot/pxelinux.cfg/default file before the installation begins and add the correct serial console device to the APPEND line.
PAGE 163
Verify perf_health: Testing memory_usage ... The headnode is excluded from the memory usage test. Number of nodes allocated for this test is 14 Job <2049> is submitted to default queue << Waiting for dispatch ...>> <> The following node has memory usage more than 25%: n3: memory usage is 34.38%, 12.
PAGE 164
Virtual hostname is lsfhost.localdomain Comparing ncpus from Lsf lshosts to Slurm cpu count. The Lsf and Slurm cpu count are NOT in sync. The lshosts 'ncpus' value of 1560 differs from the cpu total of 2040 calculated from the sinfo output. Suggest running 'lshosts -w' manually and compare the ncpus value with the output from sinfo --- FAILED --Testing hosts_status ... Running 'bhosts -w'. Checking output from bhosts. Running 'controllsf show' to determine virtual hostname. Checking output from controllsf.
PAGE 165
nodes ibblc64 and ibblc65 have an Exchange value of 2077.790000 10.5.2 OVP Reports Benign Nagios Warnings The OVP might return the following Nagios warning messages. These messages are benign and you can ignore them. Verify nagios: Testing configuration ... Running basic sanity check on the Nagios configuration file. Starting the command: /opt/hptc/bin/nagios -v /opt/hptc/nagios/etc/nagios_local.cfg Here is the output from the command: Warnings were reported. Nagios 2.3.
PAGE 166
# qsctrl qsctrl: QR0N00:00:0:0 <--> Elan:0:0 state 3 should be 4 qsctrl: QR0N00:00:0:1 <--> Elan:0:1 state 3 should be 4 qsctrl: QR0N00:00:0:2 <--> Elan:0:2 state 3 should be 4 qsctrl: QR0N00:00:0:3 <--> Elan:0:3 state 3 should be 4 qsctrl: QR0N00:00:1:0 <--> Elan:0:4 state 3 should be 4 qsctrl: QR0N00:00:1:1 <--> Elan:0:5 state 3 should be 4 qsctrl: QR0N00:00:1:2 <--> Elan:0:6 state 3 should be 4 qsctrl: QR0N00:00:1:3 <--> Elan:0:7 state 3 should be 4 qsctrl: QR0N00:00:2:0 <--> Elan:0:8 state 3 should be 4
PAGE 167
The sinfo example shown in this section illustrates the Low RealMemory reason. It is more obscure and can be a side effect of the system configuration process. This error is reported because the SLURM slurm.conf file is configured with a RealMemory value that is higher than the MemTotal value in the /proc/meminfo file that is being reported by the compute node. SLURM does not automatically restore a node that had failed at any point because of this reason.
PAGE 168
Table 10-2 Software Upgrade Log Files (continued) • File Name Contents /var/log/yum.log Results of the YUM upgrade /var/log/upgrade/kernel/RPMS Symbolic links to HP XC kernel-related RPMs on the HP XC System Software Version 3.2.1 DVD /var/log/upgrade/RPMS Symbolic links to HP XC RPMs on the HP XC System Software Version 3.2.1 DVD If you see errors in the /var/log/postinstall.log or /var/log/yum_upgrade.log files, fix the problem by manually installing the RPMs that failed to upgrade properly: 1.
PAGE 169
# touch /hptc_cluster/adm/logs/imaging.log 3. If hptc-ire-serverlog is not running, start the service: # service hptc-ire-serverlog start 10.7.3 External Ethernet Connection Fails To Come Up It is possible for an external Ethernet connection to occasionally fail to come up after invoking the cluster_config --init | --migrate command. You may see messages similar to the following in the /var/log/nconfig.log file: Bringing up interface eth2: SIOCSIFFLAGS: Cannot allocate memory Failed to bring up eth2.
PAGE 170
PAGE 171
A Installation and Configuration Checklist Table A-1 provides a list of tasks performed during a new installation. Use this checklist to ensure you complete all installation and configuration tasks in the correct order. Perform all tasks on the head node unless otherwise noted.
PAGE 172
Table A-1 Installation and Configuration Checklist Task and Description Reference Preparing for the Installation 1. Read related documents, especially the HP XC System Software Release Notes. “Task 1: Read Related Documentation” (page 23) If the hardware configuration contains HP blade servers and enclosures, download and print the HP XC Systems With HP Server Blades and Enclosures HowTo. 2. Plan for future releases. “Task 2: Plan for Future HP XC Releases” (page 23) 3.
PAGE 173
Table A-1 Installation and Configuration Checklist (continued) Task and Description Reference 18. Perform the following tasks to define and set up the system environment “Task 6: Set Up the System before the golden image is created: Environment” (page 76) • Put the XC.lic license key file in the /opt/hptc/etc/license directory (required). • Configure interconnect switch line monitoring cards (required). • Configure sendmail (required). • Customize the Nagios environment (required).
PAGE 174
Table A-1 Installation and Configuration Checklist (continued) Task and Description Reference Verifying the System 174 32. Verify proper operation of LSF if you installed LSF. “Task 1: Verify the LSF Configuration” (page 113) 33. Verify proper operation of availability tools if you installed and configured an availability tool. “Task 2: Verify Availability Tools” (page 114) 34. Run the operation verification program (OVP).
PAGE 175
B Host Name and Password Guidelines This appendix contains guidelines for making informed decisions about information you are asked to supply during the installation and configuration process. It addresses the following topics: • “Host Name Guidelines” (page 175) • “Password Guidelines” (page 175) B.1 Host Name Guidelines Follow these guidelines when deciding on a host name: • Host names can contain from 2 to 63 alphanumeric uppercase or lowercase characters (a-z, A-Z, 0-9).
PAGE 176
When choosing a password, do not use any of the following: • Single words found in any dictionary in any language. • Personal information about you or your family or significant others such as first and last names, addresses, birth dates, telephone numbers, names of pets, and so on. • Any combination of single words in the dictionary and personal information. • An obvious sequence of numbers or letters, such as 789 or xyz.
PAGE 177
C Enabling telnet on iLO and iLO2 Devices The procedure described in this appendix applies only to HP XC systems with nodes that use Integrated Lights Out (iLO or iLO2) as the console management device. New nodes that are managed with iLO or iLO2 console management connections that have never been installed with HP XC software might have iLO interfaces that have not been configured properly for HP XC operation.
PAGE 178
2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Do one of the following: • If you cannot find an entry corresponding to the new node, check the network connections. Make repairs and rerun the discover command. • If you do find an entry corresponding to the new node, note the IP address on the line that begins with the string fixed-address, and proceed to step 3. Open a web browser on the head node.
PAGE 179
option host-name "cp-n2"; fixed-address 172.21.0.2; # location "Level 2 Switch 172.20.65.4, Port 2"; } host cp-n3 { hardware ethernet 00:11:0a:30:b0:bc; option host-name "cp-n3"; fixed-address 172.21.0.3; # location "Level 2 Switch 172.20.65.4, Port 3"; } host cp-n4 { hardware ethernet 00:11:0a:2f:8d:fc; option host-name "cp-n4"; fixed-address 172.21.0.4; # location "Level 2 Switch 172.20.65.4, Port 4"; } 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
PAGE 180
PAGE 181
D Configuring Interconnect Switch Monitoring Cards You must configure the Quadrics switch controller cards, the InfiniBand switch controller cards, and the Myrinet monitoring line cards on the system interconnect to diagnose and debug problems with the system interconnect.
PAGE 182
Table D-1 Quadrics Switch Controller Card Naming Conventions and IP Addresses for Reduced Bandwidth (continued) Number of Nodes Node-Level Switch Name 1025 to 2048 QR0N00 to QR0N31 (P) 172.20.66.1 to 172.20.66.32 QR0N00_S to QR0N31_S (S) 172.20.66.33 to Secondary not applicable 172.20.66.64 1 2 Node-Level IP Address Top-Level Switch Name Top-Level Switch IP Address QR0T00 to QR0T31 172.20.66.65 to 172.20.66.96 Secondary not applicable (P) represents the primary switch controller.
PAGE 183
1. 2. 3. 4. 5. 6. 7. 8. 9. Show network settings Change network settings Run jtest Set module mode Firmware upgrade Quit Reboot Access Settings Self Test Enter 1-9 and press return: 3. Enter 2 to access the Change network settings menu option and set the switch to STATIC IP addresses: Quadrics Switch Control Select Protocol 1. BOOTP 2. STATIC 3.
PAGE 184
When the connection is established, use the Quadrics login password you set in step 5 to log in to the switch controller. D.2 Configure Myrinet Switch Monitoring Line Cards You can use the Myrinet switch monitoring line card to run diagnostic tools and to check for events on each port of the line card. Table D-3 provides the switch names and associated IP addresses you need during the configuration procedure.
PAGE 185
# location "Level 2 Switch 172.20.65.4, Port 1"; } . . . host n3 { hardware ethernet 00:11:0a:ea:ea:41; option host-name "n3"; fixed-address 172.20.0.3; option xc-macaddress "00:11:0a:ea:ea:41"; # location "Level 2 Switch 172.20.65.3, Port 3"; } host MR0N00 { hardware ethernet your_MAC_address; option host-name "MR0N00"; fixed-address 172.20.66.1; } } } 5. 6. Save your changes and exit the text editor. Copy the contents of the /etc/dhcpd.
PAGE 186
IMPORTANT: The IP address base differs if the hardware configuration contains HP server blades and enclosures, and you must use the IP addresses listed in Table D-5 instead of the addresses listed in Table D-4. Table D-4 InfiniBand Switch Controller Card Naming Conventions and IP Addresses Switch Order Switch Name IP Address First switch IR0N00 172.20.66.1 Second switch IR0N01 172.20.66.2 Third switch IR0N02 172.20.66.3 Last switch IR0N0n 172.20.66.
PAGE 187
ISR-9024# password update enable 8. Access the configuration mode: ISR-9024# config 9. Access the interface fast mode: ISR-9024(config)# interface fast 10. Set the IP address of the switch and the netmask using the data in Table D-4 or Table D-5 as a reference. Set the netmask to 255.255.0.0 as shown: ISR-9024(config-if-fast)# ip-address-fast set IP_address 255.255.0.0 11. Confirm the settings: ISR-9024(config-if-fast)# ip-address-fast show 12.
PAGE 188
23. Repeat this entire procedure for each switch controller card, modifying the host name and switch address using the data in Table D-4 as a reference. After completing this procedure, you can access the switch controller cards using either the switch name or IP address: # telnet IR0N00 When the connection is established, use the password you set in step 6 to log in to the switch controller. D.3.
PAGE 189
E Customizing Client Node Disks Use the information in this appendix to customize the disk partition layout on client node disk devices. It addresses the following topics: • “Overview of Client Node Disk Imaging” (page 189) • “Dynamically Configuring Client Node Disks” (page 189) • “Statically Configuring Client Node Disks ” (page 195) E.
PAGE 190
issues encountered when the golden client node disk configuration differs from the client disk configuration. You also have the flexibility to configure client node disks on a per-image and per-node basis and to create an optional scratch partition. Partition sizes can be fixed or can be based on a percentage of total disk size. You can set the appropriate variables in the /opt/hptc/systemimager/etc/make_partitions.sh file or in user-defined files with a .part extension.
PAGE 191
E.2.2 Example 1: Modifying Partitions Using Fixed Sizes and Defining an Additional Partition This example applies fixed sizes to modify the default partition sizes on all compute nodes and creates an additional /scratch partition on each compute node. The user-defined .part files allow partition modifications to be done on a per-image or per-node basis. 1. Use the text editor of your choice to create the following file to define the partition format for compute nodes: /var/lib/systemimager/scripts/compute.
PAGE 192
7. Do one of the following to install and image the client nodes: • If the client nodes were not previously installed with the HP XC System Software, see “Task 12: Run the startsys Utility to Start the System and Propagate the Golden Image” (page 102) to continue the initial installation procedure.
PAGE 193
8. Run the cluster_config utility, choosing the default answers, to create a new master autoinstallation script (/var/lib/systemimager/scripts/base_image.master.0) and generate an updated version of the golden image: # /opt/hptc/config/sbin/cluster_config 9. After the cluster_config utility completes its processing, the client nodes are ready to be installed.
PAGE 194
# shownode servers lvs n[135-136] 9. Create a symbolic link from the node names of the login nodes to the newly created master autoinstallation script. Note that the node name is appended with a .sh extension: for i in n135 n136 do ln -sf login.master.0 $i.sh done 10.
PAGE 195
NOTE: With software RAID, the Linux boot loader requires the /boot partition to be mirrored (RAID1). In addition, the swap partitions are not raided (with striping) because the operating system stripes them automatically. 6.
PAGE 196
E.3.1 Enable Static Disk Configuration Dynamic disk configuration method is the default behavior for the HP XC imaging environment. Before you can use the static disk configuration method to customize the client disk configuration you must enable it. You cannot use a combination of the two methods simultaneously. Follow this procedure to enable static disk configuration: 1. Use the text editor of your choice to open the following file: /etc/systemimager/systemimager.conf 2.
PAGE 197
# setnode --resync --all # stopsys # startsys --image_and_boot Wait until the stopsys command completes before invoking the startsys command. E.3.2.2 Example 2: Creating a New .conf File and Associated Master Autoinstallation Script If necessary, you can create your own master autoinstallation script with static disk configuration included from a customized .conf file by following the procedure shown here.
PAGE 198
PAGE 199
F Description of Node Roles, Services, and the Default Configuration This appendix addresses the following topics: • “Default Node Role Assignments” (page 199) • “Special Considerations for Modifying Node Role Assignments” (page 199) • “Role Definitions” (page 201) F.1 Default Node Role Assignments Table F-1 lists the default role assignments. The default assignments are based on the number of total nodes in the system.
PAGE 200
and LSF controller daemons run on that node, and no fail over of these components is possible. HP recommends that you configure at least two nodes with the resource_management role to distribute the work of these components and provide a failover configuration.
PAGE 201
F.3 Role Definitions A node role is defined by the services provided to the node. The role is an abstraction that combines one or more services into a group. Roles provide a convenient way of installing services on a node.
PAGE 202
HP recommends that you assign this role to the head node and another service node. Assigning this role to another service node enables the Subnet Manager, and therefore the InfiniBand network, to continue to function even if the head node is down. Either the cisco_hsm or the voltaire_hsm roles might be present in a system, but not both. F.3.4 Common Role The common role is automatically assigned to all nodes, and it cannot be removed. This role runs services that must be present on every node.
PAGE 203
F.3.7 Disk_io Role Nodes with the disk_io role provide access to storage and file systems mounted locally on the node. This role can be located on any node that provides local file system access to all nodes using NFS. Assign this role to any node that is exporting SAN storage. The configuration and management database for the NFS Server service supplied by this role is nfs_server. Nodes with this role normally reside in the utility cabinet of the cluster and have the most direct access to storage.
PAGE 204
If you add the management_hub role to nodes in the system, HP recommends that you also add the console_network role as well. If the console_network role is not on a management_hub node, the hub picks a node that is assigned with the console_network role. See “Console_network Role” (page 202) for more information. F.3.11 Management Server Role The management_server role contains services that manage the overall management infrastructure.
PAGE 205
The license manager service enables some software components in the system when a valid license is present. F.3.14 Resource Management Role Nodes with the resource_management role provide the services necessary to support SLURM and LSF. On systems with fewer than 63 total nodes, this role is assigned by default to the head node. On large-scale systems with more than 64 nodes, this role is assigned by default to the node with the internal node name that is one less than the head node.
PAGE 206
PAGE 207
G Using the cluster_config Command-Line Menu This appendix describes how to use the configuration command-line menu that appears by the cluster_config utility. It addresses the following topics: • “Overview of the cluster_config Command-Line Menu” (page 207) • “Displaying Node Configuration Information” (page 207) • “Modifying a Node” (page 208) • “Analyzing Current Role Assignments Against HP Recommendations” (page 210) • “Customize Service and Client Configurations” (page 211) G.
PAGE 208
G.3 Modifying a Node From the command-line menu of the cluster_config utility, enter the letter m to modify node role assignments and Ethernet connections: [L]ist Nodes, [M]odify Nodes, [A]nalyze, [H]elp, [P]roceed, [Q]uit: m You are prompted to supply the node name of the node you want to modify. All operations you perform from this point are performed on this node until you specify a different node name.
PAGE 209
5. After you have added the Ethernet connections, you have the option to do the following: • Enter the letter e to add an Ethernet connection on another node. • Enter the letter d to remove an Ethernet connection. • Enter the letter b to return to the previous menu. G.5 Modifying Node Role Assignments The cluster configuration menu enables you to assign roles to specific nodes. “Role Definitions” (page 201) provides definitions of all node roles and the services they provide.
PAGE 210
G.6 Analyzing Current Role Assignments Against HP Recommendations From the command-line menu of the cluster_config utility, enter the letter a to analyze current node role assignments with those recommended by HP. [L]ist Nodes, [M]odify Nodes, [A]nalyze, [H]elp, [P]roceed, [Q]uit: a HP recommends using this option any time you run the cluster_config utility, even when you are configuring the system for the first time.
PAGE 211
Note: n499 does not have external connection recommended by resource_management Role Rec: Role Recommended HN Req: Head Node Required HN Rec: Head Node Recommended Exc Rec: Exclusivity Recommended Ext Req: External Connection Required Ext Rec: External Connection Recommended Table G-3 provides an explanation of the analysis.
PAGE 212
The following prompt appears: [S]ervices Config, [P]roceed, [Q]uit: Do one of the following: • Enter the letter s to perform customized services configuration on the nodes in the system. This option is intended for experienced HP XC administrators who want to customize service servers and clients. Intervention like this is typically not required for HP XC systems. See “Services Configuration Commands” (page 212) for information about each services configuration command.
PAGE 213
Example: Disable a Client Similarly, to disable node n1 as a client of the supermond service, enter the node attribute na_disable_client.supermond to the affected node. In this instance, it is not the server but a client of the service that is affected. Creating and Adding Node Attributes Using the previous two examples, enter the following commands to create and add node attributes: svcs> create na_disable_server.cmf Attribute "na_disable_server.cmf" created svcs> create na_disable_client.
PAGE 214
Table G-4 Service Configuration Command Descriptions (continued) Command Description and Sample Use [d]estroy attribute_name Destroys an existing attribute with a specific name; it is the opposite of the create function. All occurrences of this attribute on a node or nodes are removed as well. You will receive an error if you try to destroy an attribute that does not exist. Sample use: svcs> destroy na_disable_server.
PAGE 215
Table G-4 Service Configuration Command Descriptions (continued) Command Description and Sample Use [h]elp Displays a help message. Sample use: svcs> help [b]ack Returns to the previous menu. Sample use: svcs> back G.
PAGE 216
PAGE 217
H Determining the Network Type The information in this appendix applies only to cluster platforms with a QsNetII interconnect. During the processing of the cluster_config utility, the swmlogger gconfig script prompts you to supply the network type of the system. The network type reflects the maximum number of ports the switch can support, and the network type is used to create the qsnet diagnostics database.
PAGE 218
PAGE 219
I LSF and SLURM Environment Variables This appendix lists the default values for LSF and SLURM environment variables that were set during the HP XC System Software installation process. For more information about setting LSF environment variables and parameters listed in Table I-1, see the Platform LSF Reference manual.
PAGE 220
Table I-1 Default Installation Values for LSF and SLURM (continued) Environment Variable Default Value Description XC_LIBLIC /opt/hptc/lib/libsyslic.so Location of the HP XC OEM license module. Where is this Value Stored? lsf.conf file LSF_NON_PRIVILEGED_PORTS Y When LSF commands are run by the root user, this variable configures them to communicate using non-privileged ports (> 1024). lsf.conf file LSB_RLA_UPDATE 120 Controls how often LSF synchronizes with SLURM.
PAGE 221
Table I-1 Default Installation Values for LSF and SLURM (continued) Where is this Value Stored? Environment Variable Default Value Description One LSF partition RootOnly YES Specifies that only the root user or the SLURM administrator can create allocations for normal user jobs. slurm.conf file One LSF partition Shared FORCE Specifies that more than one job can run on the same node. LSF uses this facility to support preemption and scheduling of multiple serial jobs on the same node.
PAGE 222
PAGE 223
J Customizing the SLURM Configuration This appendix describes customizations you can make to the /hptc_cluster/slurm/etc/slurm.conf SLURM configuration file. It addresses the following topics: • “Assigning Features” (page 223) • “Creating Additional SLURM Partitions” (page 223) • “Required Customizations for SVA” (page 223) J.1 Assigning Features Assigning features to nodes is common if the compute resources of the cluster are not consistent.
PAGE 224
SVA with LSF-HPC with SLURM If you installed LSF, you must create two LSF partitions: one partition for visualization jobs and one partition for LSF jobs. A node can be present in one partition only. The following procedure provides an example of a cluster that has five nodes: node 5 is the head node, nodes 1 and 2 are visualization nodes, and nodes 3 and 4 are compute nodes. Using that example, you must modify the slurm.conf file to create two partitions: 1.
PAGE 225
K OVP Command Output This appendix provides command output from the OVP utility, which verifies successful installation and configuration of software and hardware components. # ovp --verbose XC CLUSTER VERIFICATION PROCEDURE Fri Jul 06 08:03:03 2007 Verify connectivity: Testing etc_hosts_integrity ... There are 47 IP addresses to ping. A total of 47 addresses were pinged. Test completed successfully. All IP addresses were reachable. +++ PASSED +++ Verify client_nodes: Testing network_boot ...
PAGE 226
Running verify_server_status Starting the command: /opt/hptc/sbin/lmstat Here is the output from the command: lmstat - Copyright (c) 1989-2004 by Macrovision Corporation. All right s reserved. Flexible License Manager status on Fri 7/06/2007 08:03 License server status: 27000@ n16 License file(s) on n16: /opt/hptc/etc/license/XC.lic: n16: license server UP (MASTER) v9.2 Vendor daemon status (on n16): Compaq: UP v9.2 Checking output from command. +++ PASSED +++ Verify SLURM: Testing spconfig ...
PAGE 227
Here is the output from the command: n[3-16] 14 lsf idle Checking for non-idle node states. +++ PASSED +++ Verify LSF: Testing identification ... Starting the command: /opt/hptc/lsf/top/6.2/linux2.6-glibc2.3-ia64-slurm/bin/lsid Here is the output from the command: Platform LSF HPC 6.2 for SLURM, May 10 2006 Copyright 1992-2005 Platform Computing Corporation My cluster name is hptclsf My master name is lsfhost.localdomain Checking output from command. +++ PASSED +++ Testing hosts_static_resource_info ...
PAGE 228
Nagios 2.3.1 Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org) Last Modified: 05-15-2006 License: GPL Reading configuration data... Running pre-flight check on configuration data... Checking services... Checked 158 services. Checking hosts... Checked 16 hosts. Checking host groups... Checked 13 host groups. Checking service groups... Checked 0 service groups. Checking contacts... Checked 1 contacts. Checking contact groups... Checked 1 contact groups. Checking service escalations...
PAGE 229
Syslog Syslog System System System Alert Monitor Alerts Event Log Event Log Monitor Free Space Totals: 1-Ok 14-Ok 14-Ok 1-Ok 14-Ok 0-Warn 0-Warn 0-Warn 0-Warn 0-Warn 0-Crit 0-Crit 0-Crit 0-Crit 0-Crit 0-Pend 0-Pend 0-Pend 0-Pend 0-Pend 0-Unk 0-Unk 0-Unk 0-Unk 0-Unk 157-Ok 0-Warn 0-Crit 0-Pend 0-Unk +++ PASSED +++ Verify xring: Testing ... Send 200 messages with a size of 1024 bytes to 14 hosts.
PAGE 230
Waiting for dispatch ... Starting on lsfhost.localdomain Detailed streams results for each node can be found in /hptc_cluster/ovp/ovp_ n16_032807.tests/tests/100.perf_health/40.memory if the --keep flag was specified. Streams memory results summary (all values in mBytes/sec): min: 596.868100 max: 1054.011600 median: 1009.190000 mean: 937.954567 range: 457.143500 variance: 18632.672317 std_dev: 136.501547 All nodes were in range for this test. +++ PASSED +++ Testing network_stress ...
PAGE 231
All nodes were in range for this test. +++ PASSED +++ Testing network_unidirectional ... Number of nodes allocated for this test is 14 Job 114 is submitted to default queue interactive . Waiting for dispatch ... Starting on lsfhost.localdomain [0: n3:1] ping-pong 7718.08 usec/msg 518.26 MB/sec [1: n4:2] ping-pong 7613.24 usec/msg 525.40 MB/sec [2: n5:3] ping-pong 7609.81 usec/msg 525.64 MB/sec [3: n6:4] ping-pong 7529.98 usec/msg 531.21 MB/sec [4: n7:5] ping-pong 7453.28 usec/msg 536.
PAGE 232
PAGE 233
L upgraderpms Command Output This appendix provides command output from the upgraderpms command, which is run during a software upgrade. Command output is similar to the following: Use the upgraderpms utility only if you are performing a minor upgrade to install the new HP XC release on your system. Before running the upgraderpms utility, you must mount the new XC release DVD on the /mnt/cdrom directory and then use the cd command to go to that directory.
PAGE 234
---> Downloading header for glibc-devel to pack into transaction set. ---> Package glibc-devel.ia64 0:2.3.4-2.19 set to be updated ---> Downloading header for slurm-devel to pack into transaction set. ---> Package slurm-devel.ia64 0:1.0.15-1hp set to be updated ---> Downloading header for qsswm to pack into transaction set. ---> Package qsswm.ia64 0:2.1.1-1.2hptc set to be updated ---> Downloading header for openssl to pack into transaction set. ---> Package openssl.ia64 0:0.9.7a-43.
PAGE 235
---> Downloading header for slurm-switch-elan to pack into transaction set. ---> Package slurm-switch-elan.ia64 0:1.0.15-1hp set to be updated ---> Downloading header for pam-devel to pack into transaction set. ---> Package pam-devel.ia64 0:0.77-66.14 set to be updated ---> Downloading header for ipvsadm to pack into transaction set. ---> Package ipvsadm.ia64 0:1.24-6 set to be updated ---> Downloading header for shadow-utils to pack into transaction set. ---> Package shadow-utils.ia64 2:4.0.3-60.
PAGE 236
---> Package krb5-libs.ia64 0:1.3.4-27 set to be updated ---> Downloading header for slurm to pack into transaction set. ---> Package slurm.ia64 0:1.0.15-1hp set to be updated ---> Downloading header for hptc-nconfig to pack into transaction set. ---> Package hptc-nconfig.noarch 0:1.0-68 set to be updated ---> Downloading header for OpenIPMI-libs to pack into transaction set. ---> Package OpenIPMI-libs.ia64 0:1.4.14-1.4E.
PAGE 237
---> Downloading header for newt-devel to pack into transaction set. ---> Package newt-devel.ia64 0:0.51.6-7.rhel4 set to be updated ---> Downloading header for hptc-qsnet2-diag to pack into transaction set. ---> Package hptc-qsnet2-diag.noarch 0:1-19 set to be updated ---> Downloading header for ypserv to pack into transaction set. ---> Package ypserv.ia64 0:2.13-9.1hptc set to be updated ---> Downloading header for keyutils to pack into transaction set. ---> Package keyutils.ia64 0:1.
PAGE 238
--> Processing Dependency: libkeyutils.so.1(KEYUTILS_1.0)(64bit) for package: keyutils --> Restarting Dependency Resolution with new changes. --> Populating transaction set with selected packages. Please wait. ---> Downloading header for hptc-supermon-modules-source to pack into transaction set. ---> Package hptc-supermon-modules-source.ia64 0:2-0.18 set to be updated ---> Downloading header for keyutils-libs to pack into transaction set. ---> Package keyutils-libs.ia64 0:1.
PAGE 239
hptc-power noarch 3.0-32 hpcrpms hptc-qsnet2-diag noarch 1-19 hpcrpms hptc-slurm noarch 1.0-3hp hpcrpms hptc-supermon ia64 2-0.18 hpcrpms hptc-supermon-config noarch 1-30 hpcrpms hptc-supermon-modules ia64 2-6.k2.6.9_34.7hp.XCsmp hpcrpms 117 k hptc-syslogng noarch 1-24 hpcrpms hptc-sysman noarch 1-1.71 hpcrpms hptc-sysmandb noarch 1-44 hpcrpms hptc_ovp ia64 1.15-21 hpcrpms hptc_release noarch 1.0-15 hpcrpms hwdata noarch 0.146.18.EL-1 linuxrpms ia32el ia64 1.2-5 linuxrpms ibhost-biz ia64 3.5.5_21-2hptc.k2.
PAGE 240
selinux-policy-targeted noarch 1.17.30-2.126 linuxrpms 119 k shadow-utils warning: /etc/localtime created as /etc/localtime.rpmnew Stopping sshd:[ OK ] Starting sshd:[ OK ] warning: /etc/security/limits.conf created as /etc/security/limits.conf.rpmnew /opt/hptc/lib warning: /etc/systemimager/autoinstallscript.template saved as /etc/systemimager/autoinstallscript.template.rpmsave warning: /etc/localtime created as /etc/localtime.rpmnew warning: /etc/nsswitch.conf created as /etc/nsswitch.conf.
PAGE 241
0:2.6.9-34.7hp.XC Dependency Installed: Tk.ia64 0:804.027-1hp audit.ia64 0:1.0.12-1.EL4 hptc-supermon-modules-source.ia64 0:2-0.18 keyutils-libs.ia64 0:1.0-2 modules.ia64 0:3.1.6-4hptc Updated: IO-Socket-SSL.ia64 0:0.96-98 MAKEDEV.ia64 0:3.15.2-3 OpenIPMI.ia64 0:1.4.14-1.4E.12 OpenIPMI-libs.ia64 0:1.4.14-1.4E.12 autofs.ia64 1:4.1.3-169 binutils.ia64 0:2.15.92.0.2-18 bzip2.ia64 0:1.0.2-13.EL4.3 bzip2-devel.ia64 0:1.0.2-13.EL4.3 bzip2-libs.ia64 0:1.0.2-13.EL4.3 chkconfig.ia64 0:1.3.13.3-2 cpp.ia64 0:3.4.
PAGE 242
Setting up repositories Reading repository metadata in from local files Parsing package install arguments Resolving Dependencies --> Populating transaction set with selected packages. Please wait. ---> Downloading header for iptables-ipv6 to pack into transaction set. ---> Package iptables-ipv6.ia64 0:1.2.11-3.1.
PAGE 243
Text-DHCPparse ia64 collectl-utils noarch hptc-avail noarch hptc-ibmon noarch hptc-mcs noarch hptc-mdadm noarch hptc-smartd noarch hptc-snmptrapd noarch rrdtool ia64 xcgraph noarch Installing for dependencies: cgilib ia64 net-snmp-perl ia64 0.07-2hp 1.3.10-1 1.0-1.19 1-3 1-6 1-1 1-1 1-8 1.2.15-0.2hp 0.1-11 hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms 9.4 482 19 8.9 46 3.2 4.0 120 1.4 23 k k k k k k k k M k 0.5-0.2hp 5.1.2-11.EL4.
PAGE 244
PAGE 245
Glossary A administration branch The half (branch) of the administration network that contains all of the general-purpose administration ports to the nodes of the HP XC system. administration network The private network within the HP XC system that is used for administrative operations. availability set An association of two individual nodes so that one node acts as the first server and the other node acts as the second server of a service. See also improved availability, availability tool.
PAGE 246
operating system and its loader. Together, these provide a standard environment for booting an operating system and running preboot applications. enclosure The hardware and software infrastructure that houses HP BladeSystem servers. extensible firmware interface See EFI. external network node A node that is connected to a network external to the HP XC system. F fairshare An LSF job-scheduling policy that specifies how resources should be shared by competing users.
PAGE 247
image server A node specifically designated to hold images that will be distributed to one or more client systems. In a standard HP XC installation, the head node acts as the image server and golden client. improved availability A service availability infrastructure that is built into the HP XC system software to enable an availability tool to fail over a subset of eligible services to nodes that have been designated as a second server of the service See also availability set, availability tool.
PAGE 248
LVS Linux Virtual Server. Provides a centralized login capability for system users. LVS handles incoming login requests and directs them to a node with a login role. M Management Processor See MP. master host See LSF master host. MCS An optional integrated system that uses chilled water technology to triple the standard cooling capacity of a single rack. This system helps take the heat out of high-density deployments of servers and blades, enabling greater densities in data centers.
PAGE 249
onboard administrator See OA. P parallel application An application that uses a distributed programming model and can run on multiple processors. An HP XC MPI application is a parallel application. That is, all interprocessor communication within an HP XC parallel application is performed through calls to the MPI message passing library. PXE Preboot Execution Environment.
PAGE 250
an HP XC system, the use of SMP technology increases the number of CPUs (amount of computational power) available per unit of space. ssh Secure Shell. A shell program for logging in to and executing commands on a remote computer. It can provide secure encrypted communications between two untrusted hosts over an insecure network. standard LSF A workload manager for any kind of batch job.
PAGE 251
Index A adduser command, 83 administration network activating, 69 testing, 115 using as interconnect network, 66 administrator password ProCurve switch, 56 Apache self-signed certificate, 59 configuring, 95 avail_node_management role, 201 availability role, 201 availability set choosing nodes as members, 29 configuring with cluster_config, 88 defined, 26 availability tool, 26 Heartbeat, 29 Serviceguard, 28 starting, 106 verifying operation, 114 B back up CMDB, 118 CMDB before cluster_config, 87 SFS server,
PAGE 252
disk partition sizes, 37 file system layout, 37 node role assignments, 199 system configuration, 199 dense node names, 52 development environment tools, 81 disabled node inserted in database, 158 discover command, 65 command line options, 66 enclosures, 71 flowchart, 155 HP ProLiant DL140, 67 HP ProLiant DL145 G2, 67 HP ProLiant DL145 G3, 67 information required by, 55 no node found, 157 no switch found, 68 nodes, 71 - -oldmp option, 57 switches, 69 troubleshooting, 155 disk configuration file, 195 disk par
PAGE 253
providing feedback for, 22 HP Integrity BMC/IPMI password, 78 HP MPI defined, 35 HP ProLiant DL140 discovering, 67 HP ProLiant DL145 G2 discovering, 67 HP ProLiant DL145 G3 discovering, 67 HP Remote Graphics Software (see RGS) HP Scalable Visualization Array (see SVA) HP Serviceguard (see Serviceguard) HP StorageWorks Scalable File Share (see SFS) HP XC system software defined, 35 determining installed version, 122 installation process, 35 software stack, 35 hpasm, 202 hptc-ire-serverlog service, 168 /hptc_
PAGE 254
J jumbo frames, 55 K kernel dependent modules, 65 kernel modules rebuilding, 65 Kickstart file, 35, 37 Kickstart installation (see installation) ks.cfg file, 37 L license HP XC system software, 25 location of license key file, 76 management, 26 troubleshooting, 162 XC.
PAGE 255
N O Nagios configuration details, 59 customizing, 78 defined, 35 enabling web access, 94 improved availability, 31 service assignment, 204 user account, 84 verifying system health, 117 warnings from OVP, 165 naming conventions, 16 NAT configuration, 59 NAT service, 203 configuring, 94 netboot failure, 161 network connectivity testing, 113, 115 network mask, 54 network type, 58, 217 configuring, 93 NFS daemon, 58 configuring, 92 NFS server service, 203 NIS configuration, 59 NIS slave server configuring, 96
PAGE 256
QsNet network type, 58 node- and top-level switches, 93 Quadrics interconnect, 58 (see also QsNet) connecting to the switch monitoring line card, 183 logging in to line monitoring card, 183 network type, 217 switch controller card, 181 QuickSpecs, 81 quorum configuring a lock LUN, 80 configuring a quorum server, 80 quorum server configuring, 80 R real enclosure defined, 52 real server defined, 58 reinstall software, 137 release version, 122 remote graphics software RGS, 81 Virtual GL, 81 reporting document
PAGE 257
patch download site, 64 reinstalling HP XC system software, 137 software development tools, 81 software installation (see installation) software patches, 64 software RAID, 83 documentation, 21 enabling on client nodes, 83 mdadm utility, 21 mirroring, 83 striping, 83 software stack, 35 software upgrade (see upgrade) software version, 122 sparse node numbering, 52 spconfig utility, 108 ssh configuring on InfiniBand switch, 105 ssh key, 57 standard LSF configuring failover, 32 defined, 35 features, 97 installi
PAGE 258
V /var file system, 37 /var/lib/systemimager/images/base_image, 100 /var/log/nconfig.log file, 53 /var/log/postinstall.log file, 35 virtual console and media, 44 virtual enclosure defined, 52 Virtual GL, 81 voltaire_hsm role, 205 W website HP software patches, 64 HP XC System Software documentation, 17 ITRC, 64 workstation nodes, 56 changing database name, 82 X XC software version, 122 XC.
PAGE 259
PAGE 260