HP XC System Software Installation Guide Version 3.
© Copyright 2003, 2004, 2005, 2006 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document...................................................................................13 Intended Audience................................................................................................................................13 Document Organization.........................................................................................................................13 HP XC Information...............................................................................
Put the License Key File in the Correct Location (Required)......................................................................46 Configure the Interconnect Switch Monitoring Line Cards (Required).......................................................46 Configure sendmail (Required)...........................................................................................................47 Create the /hptc_cluster File System (Optional)......................................................................
Resource Management Role..............................................................................................................95 A Installation and Configuration Checklist.......................................................97 B Host Name and Password Guidelines........................................................101 Defining Host Names..........................................................................................................................101 Setting Strong Passwords..........
Readiness Criteria...............................................................................................................................135 Before You Begin................................................................................................................................135 Installation Procedure..........................................................................................................................136 Task 1: Download the Maui Scheduler Kit............................
List of Figures K-1 Discovery Flowchart...........................................................................................................................
List of Tables 1-1 1-2 3-1 3-2 3-3 3-4 3-5 3-6 3-7 4-1 4-2 4-3 4-4 4-5 4-6 4-7 7-1 7-2 7-3 7-4 7-5 7-6 7-7 7-8 8-1 A-1 D-1 D-2 D-3 D-4 F-1 F-2 F-3 F-4 G-1 H-1 J-1 K-1 K-2 Installation Types..................................................................................................................................21 Naming Conventions............................................................................................................................21 HP XC Software Stack........................
List of Examples 2-1 Sample XC.lic File.................................................................................................................................24 4-1 Sample sfstab.proto File .......................................................................................................................49 7-1 Successful Content in /var/log/postinstall.log File.....................................................................................81 7-2 Failure in /var/log/postinstall.log File...
About This Document This document describes the procedures and tools that are required to install and configure HP XC System Software Version 3.0 on HP Cluster Platforms 3000, 4000, and 6000. An HP XC system is integrated with several open source software components. Some open source software components are being used for underlying technology, and their deployment is transparent.
• Appendix B: Host Name and Password Guidelines (page 101) provides guidelines for various pieces of user-defined information that you are asked to supply during the installation and configuration process. • Appendix C: Enabling telnet on iLO Devices (page 103) describes how to enable telnet on nodes using Integrated Lights Out (iLO) as the console port management device.
The QuickSpecs are located at: http://www.hp.com/techservers/clusters/xc_clusters.html HP XC Program Development Environment The following URL provides pointers to tools that have been tested in the HP XC program development environment (for example, TotalView® and other debuggers, compilers, and so on): ftp://ftp.compaq.com/pub/products/xc/pde/index.html HP Message Passing Interface HP Message Passing Interface (MPI) is an implementation of the MPI standard for HP systems.
• LSF administrator tasks are documented in the HP XC System Software Administration Guide. • LSF user tasks such as launching and managing jobs are documented in the HP XC System Software User's Guide. • To view LSF manpages supplied by Platform Computing Corporation use the following command: $ man lsf_command_name • The LSF Administrator and Reference guides developed by Platform are also available at: http://www.hp.com/techservers/clusters/xc_clusters.html • http://www.llnl.
Manpages for third-party vendor software components may be provided as a part of the deliverables for that component. Using the discover(8) manpage as an example, you can use either of the following commands to display a manpage: $ man discover $ man 8 discover If you are not sure about a command you need to use, enter the man command with the k option to obtain a list of commands that are related to the keyword.
Related Compiler Web Sites • http://www.intel.com/software/products/compilers/index.htm Web site for Intel® compilers. • http://support.intel.com/support/performancetools/ Web site for general Intel software development information. • http://www.pgroup.com/ Home page for The Portland Group™, supplier of the PGI® compiler.
IMPORTANT NOTE This alert provides essential information to explain a concept or to complete a task A note contains additional information to emphasize or supplement important points of the main text. HP Encourages Your Comments HP encourages your comments concerning this document. We are committed to providing documentation that meets your needs. Send any errors found, suggestions for improvement, or compliments to: feedback@fc.hp.
1 Document Overview This chapter addresses the following topics: • How to Use This Document (page 21) • Naming Conventions Used in This Document (page 21) How to Use This Document Table 1-1 lists the types of installations you can perform: a new installation, a software upgrade, or a reinstallation of Version 3.0. This document is designed to lead you through the installation and configuration procedures in a specific sequence.
2 Preparing for a New Installation This chapter describes preinstallation tasks to perform before you install HP XC System Software Version 3.0.
Task 5: Arrange for IP Address Assignments and Host Names Make arrangements with your site's network administrator to assign IP addresses for the following system components. All IP addresses must be defined in your site's Domain Name System (DNS) configuration: • The external IP address of the system, if it is to be connected to an external network. The name associated with this interface is known as the LVS or cluster alias.
http://www.macrovision.com Task 7: Purchase and License Additional Software from Third-Party Vendors An HP XC system supports the use of several third-party software products. Use of these products is optional; the purchase and installation of these components is your decision and depends on your site's requirements. The XC software does not bundle, nor does HP resell, the TotalView debugger or the Intel and The Portland Group compilers.
3 Installing Software on the Head Node This chapter contains an overview of the installation process and describes installation tasks.
Table 3-1 HP XC Software Stack Software Product Name Description HP MPI HP MPI provides optimized libraries for message passing designed specifically to make high-performance use of the system interconnect. HP MPI complies fully with the MPI-1.2 standard. HP MPI also complies with the MPI-2 standard, with restrictions. HP XC System Software Version 3.
Table 3-2 lists the default values defined in the Kickstart file, regardless of the cluster platform. These default values reduce the number of answers you have to provide during the installation session. Table 3-2 Default Values in the ks.cfg File Item Default Value Keyboard type United States (U.S.) Mouse Generic three button mouse emulation Language used by the installation process U.S. English Language installed on the system U.S.
Table 3-5 Distribution Media Cluster Platform Model Software DVD Title CP3000 HP XC System Software for Intel Xeon Systems Version 3.0 CP4000 HP XC System Software for AMD Opteron Systems Version 3.0 CP6000 HP XC System Software for Intel Itanium 2 Processors Version 3.0 • Be prepared to supply the information listed in Table 3-6 during the Kickstart installation session.
5. This step applies only to Itanium-based systems (HP Integrity servers). Skip this step for all other chip architectures and proceed to step 6. After displaying some power-on messages, the node power-on process displays the Boot Menu. Use the arrow keys on the keyboard to select the DVD ROM as the boot device. How you do so depends on upon how the EFI environment is configured. • Select the preconfigured option to boot from the CD or DVD device if it is available. Proceed to step 6.
the system will boot from the original, completed installation (assuming that the second installation had not begun). 8. Do not press any keys while the system is coming up and while the new kernel boots. Ignore the Welcome to Kudzu utility. Do not press any keys until the login screen is displayed. 9. Log in as the root user when the login screen is displayed, and enter the root password you previously defined during the software installation process. 10.
• HP XC System Software QuickSpecs The HP XC System Software QuickSpecs contain a list of commercially available software packages that have been tested and qualified to interoperate with the HP XC System Software. If you want to learn more about these supported software packages, see the QuickSpecs at: http://www.hp.com/techservers/clusters/xc_clusters.
4 Configuring and Imaging the System This chapter contains an overview of the initial system configuration and imaging process (System Configuration and Imaging Overview) and describes system configuration tasks, which must be performed in the following order: • Task 1: Prepare for the System Configuration (page 36) • Task 2: Change the Default IP Address Base (Optional) (page 39) • Task 3: Run the cluster_prep Command to Prepare the System (page 40) • Task 4: Run the discover Command to Discover Sys
• Partitions the local disk. By default, the first disk is used because disks other than the first disk are not supported due to restrictions in the image replication environment for this release. • Creates file systems. • Downloads the golden image. • Installs the appropriate boot loader. When the automatic installation process completes, each node is rebooted and continues its configuration process, eventually ending with the login prompt.
Table 4-1 Information Required During System Configuration Item Description and User Action Information Requested by the cluster_prep Command Node name prefix During the system discovery process, each node is automatically assigned a name. This name consists of a user-defined prefix and a number based on the node's topographical location in the system. The default node prefix is the letter n; a 6-character maximum is allowed in the node prefix.
Item Description and User Action Number of nodes that are workstations Enter the number of nodes that are workstations, that is, nodes that do not have console ports. There is no default response. Enter 0 (zero) if your system does not contain workstations. You will not be prompted for this information if you are discovering a multi-region, large-scale system. You can include the ws= keyword on the discover command line to bypass this question during the discovery process.
Item Description and User Action NIS configuration If you modified the default role assignments and assigned a nis_server role to configure one or more nodes as a NIS slave server, you are prompted to enter the name of your NIS master server or its IP address as well as your NIS domain name.
Proceed to “Task 3: Run the cluster_prep Command to Prepare the System” (page 40) to begin the discovery process. Task 3: Run the cluster_prep Command to Prepare the System The first step in the configuration process is to prepare the system by running the cluster_prep command. This process sets up the node naming prefix, the number of nodes in your system, the database administrator's password, and the external Ethernet connection on the head node. 1.
In this example, the maximum number of nodes supported by the interconnect switch is 16. Enter the maximum number of nodes allowed by your interconnect switch. 3 The internal host name of the head node is based on the default node naming prefix and the maximum number of nodes in the system. Therefore, as shown in this example, the head node host name is set to n16. Restart the X server to accommodate the head node host name change: a. Press Ctrl+Alt+Backspace to restart the X server. b.
the interconnect found on the head node; specific command options are not necessary for those interconnect types. • HP recommends that you include the --verbose option because it provides useful feedback and enables you to follow the discovery process. • Table 4-1 (page 37) and discover(8) contain information about additional keywords you can add to the command line to bypass some of the questions that will be asked during the discovery process. Use of these keywords is strictly optional.
Discovering 172.20.65.2 Discovering 172.20.65.2 Discovering 172.20.65.2 Discovering 172.20.65.2 Restarting dhcpd discovered_cps is 6 Checking if all console number of cps to check, pinging 172.21.0.13 no pinging 172.21.0.15 port port port port 20 19 18 17 ... ... ... ... Console Console Console Console Port Port Port Port OK OK OK OK ports are reachable ... 5 response from 172.21.0.13 no response from 172.21.0.15 pinging 172.21.0.12 pinging 172.21.0.14 pinging 172.21.0.
running port_discover on 172.20.65.1 nodes Found = 6 nodes Expected = 6 All nodes initialized. Powering off all nodes but head node ... done Switch 172.20.65.1 port 22 ... Node Found Switch 172.20.65.1 port 21 ... Node Found Switch 172.20.65.1 port 20 ... Node Found Switch 172.20.65.1 port 19 ... Node Found Switch 172.20.65.1 port 18 ... Node Found Switch 172.20.65.1 port 17 ... Node Found Switch 172.20.65.1 port 16 ...
a. Use the telnet command to log in to each node's LO-100i console management device and change the password. To determine console port names, view the /etc/dhcpd.conf file and look for the characters cp- in the host name. Use the factory default user name admin and the default password admin to log in. # telnet cp-node_name login: admin password: admin b. c. Press Escape and Shift+9 to enter the command-line mode. Use the C[hange Password] option to change the console port management device password.
Install the updated RPM packages now before the system is configured so that the software updates are propagated to all nodes during the initial image synchronization. Procedure to Download Patches Follow this procedure to download XC patches from the ITRC Web site: 1. Create a temporary directory for the patch or patches on the head node. You can name this temporary directory anything you want; this procedure creates a directory called /root/patches: # mkdir /root/patches 2.
Appendix D (page 105) describes how to configure the line monitoring cards for each interconnect type. Return to this chapter when you are finished configuring the cards. Configure sendmail (Required) LSF requires a mail program to send job output to users submitting jobs and to send administrative messages to the LSF administrator. By default, LSF uses the mail program /usr/lib/sendmail.
• Install HP SFS Client RPMs To Configure the /hptc_cluster File System On An HP SFS Server (page 48) • Make and Mount the /hptc_cluster File System Locally (page 49) Install HP SFS Client RPMs To Configure the /hptc_cluster File System On An HP SFS Server HP StorageWorks Scalable File Share (HP SFS) is based on Lustre® technology developed by Cluster File Systems, Inc. HP SFS is a turnkey Lustre system that can be configured to operate with an HP XC system.
6. Create an /etc/sfstab.proto file on the head node to ensure the persistence of SFS mounts. The entries in this file use the following basic syntax: ldap://sfs_server_name/filesystem mountpoint sfs optionlist 0 0 The /etc/sfstab.proto file enables you to customize the /etc/sfstab file on each node. The sfs service reads the /etc/sfstab.proto file as it starts and edits the /etc/sfstab file. In the /etc/sfstab.
Create Local User Accounts (Optional) If you intend to create local user accounts on your HP XC system rather than manage user accounts through another user authentication method (such as NIS or LDAP), use the Linux adduser command to create local user accounts on the system now, before the system is configured. See the HP XC System Software Administration Guide if you need more information about creating local user accounts.
Task 6: Run the cluster_config Utility to Configure the System The next step in the process is to configure the system. The cluster_config utility assigns roles to all nodes in your system based on the size of the system (and other considerations). A node role is defined by the services provided by the node. A role is an abstraction that groups one or more services. You have the option to modify this default configuration.
node: n16 location: Level 1 Switch 172.20.65.1, Port 42 CURRENT HEAD NODE roles assigned: management_server management_hub external disk_io compute External Ethernet Name: penguin.southpole.com, ipaddr: 192.0.2.0, netmask: 255.255.252.0, gateway: 192.0.2.51 node: n15 location: Level 1 Switch 172.20.65.
• Enter S to perform customized services configuration. This option is intended for experienced XC administrators who want to customize role services. Intervention like this is typically not required for HP XC systems. HP recommends that you enter p to continue with the system configuration process and proceed to step 10. Note If you are an experienced XC administrator, refer to Using the Configuration Menu (page 117) for more information about customizing the services configuration.
Configuring the following nodes as ntp servers for the cluster: n16 You must now specify the clock source for the server nodes. If the nodes have external connections, you may specify up to 4 external NTP servers. Otherwise, you must use the node's system clock. Enter the IP address or host name of the first external NTP server or leave blank to use the system clock on the NTP server node: Enter Renaming previous /etc/ntp.conf to /etc/ntp.conf.bak 3.
In addition, to complete this configuration, you will need to provide 1) the name or IP address of the NIS master, and 2) the NIS domain name hosted by the NIS master Enter the Enter the Executing Executing 7.
After cluster_config processing is complete, you have the option to modify default SLURM compute node and partition information. This information is described in “Task 8: Modify SLURM Characteristics (Optional)” (page 59). 8. Decide whether or not you want to install LSF as your job management system: Do you want to install LSF locally now? (y|n) [y]: Do one of the following: • To install LSF-HPC with SLURM or standard LSF, enter y or press the Enter key. Proceed to step 9.
11. Provide responses to install and configure LSF. This requires you to supply information about the primary LSF administrator and administrator's password. The user name lsfadmin is the default user name for the primary LSF administrator. If you accept the default user name and a NIS account exists with the same name, LSF-HPC with SLURM will be configured with the existing NIS account. You will not be prompted to supply a password for the lsfadmin account. Otherwise, accept all default answers.
your cluster "hptclsf" is running correctly, see "/opt/hptc/lsf/top/6.1/hpc_quick_admin.html" to learn more about your new LSF cluster. ***Begin LSF-HPC Post-Processing*** Created '/hptc_cluster/lsf/tmp'... Editing /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf... Moving /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf to /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf.old.6490... Editing /opt/hptc/lsf/top/conf/lsf.conf... Moving /opt/hptc/lsf/top/conf/lsf.conf to /opt/hptc/lsf/top/conf/lsf.conf.old.6490...
info: info: info: info: info: info: info: info: info: info: info: info: info: Executing C51nrpe nconfigure Executing C90munge nconfigure Executing C90slurm nconfigure Executing C95lsf nconfigure Executing C30syslogng_forward cconfigure Executing C35dhcp cconfigure Executing C50supermond cconfigure Executing C90munge cconfigure Executing C90slurm cconfigure Executing C95lsf cconfigure nconfig shut down nconfig started Executing on head node info: info: info: info: info: info: info: info: info: info: info:
temporary disk space and SLURM features. SLURM features are useful for distinguishing different types of nodes. You can also modify this file to configure additional node partitions. HP recommends that you review the /hptc_cluster/slurm/etc/slurm.conf file, particularly to address the following configuration aspects: • Compute node characteristics Initially, all nodes with the compute role are listed as SLURM compute nodes, and these nodes are configured statically with a processor count of two.
to a multicast from the designated image server. Multicast imaging provides very little resource drain on the image server as compared to other file transfer technologies, and therefore, allows systems of all sizes to be installed relatively quickly. Multicast imaging uses the udpcast open source package, and the flamethrower functionality of SystemImager. A series of udp-sender daemons are run on the image server, and each client node runs a series of udp-receiver daemons during the imaging operation.
Power down required after image load -> n[14-15] Nodes requiring a power off -> n[14-15] Powering off -> n[14-15] Powering on for boot -> n14 Current statistics: Currently processing -> n[14-15] Waiting to boot -> n15 Nodes with a valid loaded image -> n[14-15] Current statistics: Currently processing -> n[14-15] Waiting to boot -> n15 Nodes with a valid loaded image -> n[14-15] Current statistics: Currently processing -> n[14-15] Waiting to boot -> n15 Nodes with a valid loaded image -> n[14-15] Processing
If your system is using a QsNetII interconnect, ensure that the number of node entries in the /opt/hptc/libelanhosts/etc/elanhosts file matches the expected number of operational nodes in the cluster. If the number does not match, verify the status of the nodes to ensure that they are all up and running and rerun the spconfig script. Output from the spconfig utility looks similar to the following for all other interconnect types: Configured unknown node n14 with 1 CPU and 4872 MB of total memory...
# lshosts HOST_NAME type model lsfhost.loc SLINUX6 Opteron8 c. cpuf ncpus maxmem maxswp server RESOURCES 60.0 6 1M Yes (slurm) Verify the dynamic resource information: # bhosts HOST_NAME STATUS lsfhost.localdomai ok JL/U - MAX 6 NJOBS 0 RUN SSUSP 0 0 USUSP 0 RSV 0 See the troubleshooting information in the HP XC System Software Administration Guide if you do not receive a status of ok from the bhosts command. • Standard LSF: a. Verify that standard standard LSF is running: # lsid Platform LSF 6.
See the troubleshooting information in the HP XC System Software Administration Guide if you do not receive a status of ok from the bhosts command. For more information about where to find LSF-HPC with SLURM or standard LSF documentation, see the Preface at the beginning of this document . You Are Done You have completed the mandatory system configuration tasks. Proceed to Chapter 5 (page 67) to verify the successful installation and configuration of your system.
5 Verifying the System Complete the tasks described in this chapter to verify the successful installation and configuration of the HP XC system components. HP recommends that you perform all tasks described in this chapter.
Now checking if /hptc_cluster is mounted on those nodes. +++ PASSED +++ Verify time_synchronism: Testing ... Comparing time on all nodes with time on head node. n11 n12 n13 n14 n15 time time time time time diff diff diff diff diff 0 0 0 0 0 ok. ok. ok. ok. ok. +++ PASSED +++ Verify license: Testing file_integrity ... Checking license file: /opt/hptc/etc/license/XC.lic +++ PASSED +++ Testing server_status ...
Here is the output from the command: Slurmctld(primary/backup) at n16/(NULL) are UP/DOWN Checking output from scontrol. +++ PASSED +++ Testing partition_state ... Starting the command: /opt/hptc/bin/sinfo --all Here is the output from the command: PARTITION AVAIL lsf up TIMELIMIT NODES infinite 6 STATE NODELIST idle n[11-16] Checking output from command. +++ PASSED +++ Testing node_state ...
Running 'lshosts -w'. Checking output from lshosts. Running 'controllsf show' to determine virtual hostname. Checking output from controllsf. Virtual hostname is lsfhost.localdomain Comparing ncpus from Lsf lshosts to Slurm cpu count. +++ PASSED +++ Testing hosts_status ... Running 'bhosts -w'. Checking output from bhosts. Running 'controllsf show' to determine virtual hostname. Checking output from controllsf. Virtual hostname is lsfhost.
Checked Checking Checked Checking Checked Checking Checked Checking Checked Checking Checked Checking Checked Checking Checked Checking Checked Checking Checking Checking Checking Checking 1 contact groups. service escalations... 0 service escalations. service dependencies... 156 service dependencies. host escalations... 0 host escalations. host dependencies... 0 host dependencies. commands... 53 commands. time periods... 4 time periods. extended host info definitions... 0 extended host info definitions.
For more information about verifying individual cluster components on demand, see ovp(8) and the HP XC System Software Administration Guide . Interconnect diagnostic tests are documented in the installation and operation guide for your model of HP cluster platform and in the HP XC System Software Administration Guide. When all OVP tests pass, proceed to “Task 2: Take a Snapshot of the Database”.
6 Configuring SAN Storage Devices This chapter addresses the following topics: • SAN Storage Overview (page 73) • Installing and Configuring the EVA3000 or EVA5000 SAN Storage Devices (page 74) • Making and Mounting the SAN Storage File Systems (page 74) SAN Storage Overview An MSA1000 storage array is an option in an HP Cluster Platform cluster. The MSA1000 is shipped with a cable that enables you to establish an administrative connection to the array’s storage controllers.
# updateimage --gc 'nodename' 7. Follow the procedure described in the HP XC System Software Administration Guide to reimage the node (by running the updateclient command). For more information about SAN devices, see the storage array systems home page at: http://h10018.www1.hp.com/wwsolutions/linux/products/storage/storagearray.
Vendor: COMPAQ Model: HSV110 (C)COMPAQ Type: Direct-Access Host: scsi2 Channel: 00 Id: 02 Lun: 00 Vendor: COMPAQ Model: HSV110 (C)COMPAQ Type: Unknown Host: scsi2 Channel: 00 Id: 02 Lun: 01 Vendor: COMPAQ Model: HSV110 (C)COMPAQ Type: Direct-Access Host: scsi2 Channel: 00 Id: 03 Lun: 00 Vendor: COMPAQ Model: HSV110 (C)COMPAQ Type: Unknown Host: scsi2 Channel: 00 Id: 03 Lun: 01 Vendor: COMPAQ Model: HSV110 (C)COMPAQ Type: Direct-Access 3.
32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624 Writing inode tables: done Creating journal (8192 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 20 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. 6.
7 Upgrading Your HP XC System This chapter addresses the following topics: • Supported Upgrade Paths (page 77) • Software Upgrade Overview (page 77) • Is Upgrading Right for Your System? (page 78) • Task 1: Prepare for the Upgrade (page 78) • Task 2: Install the Upgrade RPM and Prepare Your System (page 79) • Task 3: Upgrade Linux and HP XC Specific RPMs (page 79) • Task 4: View the Results of the RPM Upgrade (page 80) • Task 5: Install Patches and Additional Software (page 81) • Task 6: Ma
Table 7-2 Upgrade Characteristics Characteristic Description Effect on system availability There is a period of system down time because all nodes have to be shut down in order to upgrade to the next release. Therefore, to minimize user interruption, schedule the upgrade for a time when system use and user activity is low. For your site, this might be over the weekend or perhaps after work hours or late in the evening. Effect on HP XC licensing The upgrade does not affect XC licensing.
3. Read the HP XC System Software Release Notes for Version 3.0 before beginning the upgrade procedure. This document describes any last-minute changes to software, firmware, or hardware that may affect the upgrade process. The HP XC System Software Release Notes for each XC software release are available at: http://www.hp.com/techservers/clusters/xc_clusters.html 4. Set all nodes to network boot so that all client nodes can be reimaged after the head node is updated: # setnode --resync --all 5.
1. Make sure the DVD is still in the DVD drive, and reboot the head node: # reboot 2. At the boot prompt, enter the appropriate boot command to start the upgrade process and specify the name of the Kickstart upgrade file. You must act fast, because the system does not pause very long at the boot prompt. As listed in Table 7-4, the command line is different depending upon your cluster platform architecture and how your boot device was set.
3. Search for the word error in this log file to find upgrade errors. You can safely ignore the error rpmdb: No such file or directory . These errors are benign. However, you must resolve failed dependencies and RPM conflicts. See “Troubleshooting the Software Upgrade Procedure” (page 146) for information about resolving upgrade errors. If you cannot determine how to resolve these errors, contact the XC support team at: xc_support@hp.com 4.
You may have to reboot the head node if there is a patch to the kernel; the patch README file provides instructions if a reboot is required. 2. Reinstall or upgrade any previously installed HP, open source, or third-party vendor RPMs that you specifically installed (for example, HP StorageWorks File System (SFS), TotalView, Intel compilers, and so on).
Task 7: Configure the System and Propagate the Golden Image Follow this procedure to configure your upgraded system and propagate the new golden image to all client nodes: 1. Run the following utility to back up the existing database and migrate existing data to the new release format: # upgradesys Command output looks similar to the following: The upgradesys utility performs all the necessary steps to upgrade your cluster.
3. The cluster_config utility displays the following menu. Enter the letter p to proceed with the system configuration process; refer to Appendix F (page 117) for information about using this menu. Important If the head node was configured as a NIS slave server in the previous release, you must assign the nis_server role to the head node now, because this role is being introduced in this release. Use the [M]odify option of the cluster_config utility to assign this role to the head node.
Executing Executing Executing Executing Executing Executing Executing C10hptc_cluster_fs gconfigure C20gmmon gconfigure C30swmlogger gconfigure C30syslogng_forward gconfigure C35dhcp gconfigure C50cmf gconfigure C50nagios gconfigure Would you like to enable web based monitoring? ([y]/n) y Enter the password for the 'nagiosadmin' web user: New password: Re-type new password: Adding password for user nagiosadmin Executing C50nat gconfigure Executing C50supermond gconfigure Executing C51nagios_monitor gconfi
NOTE: The only Partition created by default is the lsf partition. If you want additional partitions, configure them manually in the /hptc_cluster/slurm/etc/slurm.conf file. The current Node Partition configuration is: PartitionName=lsf RootOnly=YES Shared=FORCE Nodes= n[11-16] Do you want to enable SLURM-controlled user-access to the compute nodes? (y/n) [n]: n SLURM configuration complete.
Replaced default lsb.queues with a preconfigured lsb.queues. C95lsf finished Configuring the image replication environment Initializing 172.20.0.
info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: 5. 6.
# . /opt/hptc/lsf/top/conf/profile.lsf Task 8: Verify the Upgrade Run the operation verification program (OVP) to verify the integrity of the upgrade process and test the successful operation of your upgraded system: # ovp --verbose For a list of tests run by the OVP and for an example of successful completion of the OVP, see Chapter 5 (page 67).
8 Node Roles, Services, and the Default Configuration This chapter addresses the following topics: • Default Node Role Assignments (page 91) • Special Considerations for Modifying Default Node Role Assignments (page 91) • Role Definitions (page 92) Default Node Role Assignments Table 8-1 lists the default role assignments. The default assignments are based on the number of total nodes in your system.
• No nodes are configured with the nis_server role. Assigning a nis_server role to a node establishes the node as a NIS slave server. Any node assigned with the nis_server role must also have an external network connection defined. • You must assign an external role to any node that has an external network connection configured.
• Parallel distributed shell (pdsh) • SLURM launch (munge) These services provide functionality that is required on all nodes and are fundamental to the proper functioning of the cluster. Compute Role Jobs are distributed to and run on nodes with the compute role. This role provides the services required for the node to be an allocated resource of the SLURM central control service (slurmcd).
• SuperMon aggregator (supermond) • Syslogng (syslogng_forward) These services are used to support scaling of the cluster. Nodes with this role provide local storage for aggregation of system logs and performance information. Management hub services typically report up to the node with the management_server role. You can assign this role to several nodes of the cluster, and HP recommends that you consider using one management hub for every 64 to 128 nodes.
The license manager service enables some software components in the system when a valid license is present. Resource Management Role Nodes with the resource_management role provide the services necessary to support SLURM and LSF. On systems with fewer than 63 total nodes, this role is assigned by default to the head node. On large-scale systems with more than 64 nodes, this role is assigned by default to the node with the internal node name that is one less than the head node.
Appendix A Installation and Configuration Checklist Table A-1 provides a list of tasks performed during a new installation. Use this checklist to ensure you complete all installation and configuration tasks in the correct order. Perform all tasks on the head node unless otherwise noted.
Table A-1 Installation and Configuration Checklist Task Description Reference Preparing for the Installation 1. Read related documents, especially the HP XC System Software Release Notes. Page 23 2. Back up user data and accounts on systems installed with an older version of the HP XC System Software. Page 23 3. Prepare the hardware according to the instructions in the HP XC Hardware Preparation Guide. Page 23 4.
Task Description Reference 24. Run the sys_check utility. Page 72 Is Task Complete? Optional Configuration Tasks 25. Configure SAN storage devices.
Appendix B Host Name and Password Guidelines This appendix contains guidelines for making informed decisions about information you are asked to supply during the installation and configuration process. This appendix addresses the following topics: • Defining Host Names (page 101) • Setting Strong Passwords (page 101) Defining Host Names Follow these guidelines when deciding on a host name: • Names can contain from 2 to 63 alphanumeric uppercase or lowercase characters (a-z, A-Z, 0-9).
• Single words found in any dictionary in any language. • Personal information about you or your family or significant others such as first and last names, addresses, birth dates, telephone numbers, names of pets, and so on. • Any combination of single words in the dictionary and personal information. • An obvious sequence of numbers or letters, such as 789 or xyz.
Appendix C Enabling telnet on iLO Devices The procedure described in this appendix applies only to HP XC systems with nodes that use Integrated Lights Out (iLO) as the console management device. New nodes that are managed with iLO console management connections that have never been installed with HP XC software may have iLO interfaces that have not been configured properly for HP XC operation.
3. Open the Mozilla Web browser on the head node, and in the Web address field at the top of the window, enter the IP address you noted in step 2 appended with /ie_index.htm, similar to the following: https://172.20.0.n/ie_index.htm 4. 5. Click OK twice to accept the security certificates. In the Account Login window, enter Administrator as the login name, and enter the password that is shown on the information tag that is attached to the iLO.
Appendix D Configuring Interconnect Switch Monitoring Cards You must configure the Quadrics switch controller cards, the InfiniBand switch controller cards, and the Myrinet monitoring line cards on the system interconnect in order to diagnose and debug problems with the system interconnect.
Table D-2 Quadrics Switch Controller Card Naming Conventions and IP Addresses For Full Bandwidth Number of Nodes Node-Level Switch Name Node-Level IP Address 1 to 64 QR0N00 QR0N00_B (P)1 172.20.66.1 Top-Level Switch Name Top-Level Switch IP Address Not applicable Not applicable 2 (S) 172.20.66.2 65 to 256 QR0N00 to QR0N03 (P) 172.20.66.1 to QR0N00_B to QR0N03_B 172.20.66.4 (S) 172.20.66.5 to 172.20.66.8 QR0T00 to QR0T01 257 to 512 QR0N00 to QR0N07 (P) 172.20.66.
option host-name "QR0N00"; fixed-address 172.20.66.1; filename "503-upgrade.tar"; } In the following example, the entry is added to the end of the shared-network XC { block: #Built by Blddhcpd in DiscoverTools.pm ddns-update-style none; deny unknown-clients; allow bootp; default-lease-time 480; max-lease-time 480; option xc-macaddress code 232 = string; shared-network XC { subnet 172.0.0.0 netmask 255.224.0.0 { next-server 172.20.0.8; filename = "/pxelinux.
Configuring the Myrinet Switch Monitoring Line Card You can use the Myrinet switch monitoring line card to run diagnostic tools and to check for events on each port of the line card. Table D-3 provides the switch names and associated IP addresses you need during the configuration procedure. The IP addresses for the switch monitoring line cards is based on the swBase address in the base_addr.ini file. The default address base for the switch monitoring line cards is 172.20.
} host MR0N00 { hardware ethernet your_MAC_address; option host-name "MR0N00"; fixed-address 172.20.66.1; } } } 5. Restart the DHCP service: # service dhcpd restart 6. Use the text editor of your choice to open the /etc/hosts file to include an entry for each monitoring line card, using the data in Table D-3 as a reference: 172.20.66.
ISR-9024# password update admin 7. Change the default enable password: ISR-9024# password update enable 8. Access the configuration mode: ISR-9024# config 9. Access the interface fast mode: ISR-9024(config)# interface fast 10. Set the IP address of the switch and the netmask using the data in Table D-4 as a reference. Set the netmask to 255.255.0.0 as shown: ISR-9024(config-if-fast)# ip-address-fast set IP_address 255.255.0.0 11. Confirm the settings: ISR-9024(config-if-fast)# ip-address-fast show 12.
Appendix E Customizing Client Node Disks Use the information in this appendix to customize the disk partition layout on client node disk devices. This appendix addresses the following topics: • Overview of Client Node Disk Imaging (page 111) • Dynamic Disk Configuration (page 111) • Static Disk Configuration (page 114) Overview of Client Node Disk Imaging The HP XC client node imaging process requires a single system disk on each client node on which the operating system is installed.
The component files are as follows: • /opt/hptc/systemimager/etc/make_partitions.sh Identifies the client disk type and size and creates the default partition table. When changing the default sizes of partitions or swap space, you will edit this file to effect the change. Read the comments in the file for more details. See the example in “Example 1: Changing Default Partition Sizes and Swap Space for All Client Nodes” (page 112) for information about how to make such a change.
4. Change the MEM_PERCENTAGE variable to 1.5 to create a swap partition that is 1.5 times the size of physical memory. This will only be effective if your physical memory size is greater than 6 GB and less than 16 GB because swap partition size is bounded by these limits. MEM_PERCENTAGE="1.5" 5. 6. Save your changes to the file. Run the cluster_config utility, choosing the default answers, to create a new master autoinstallation script (/var/lib/systemimager/scripts/base_image.master.
ln -sf login.master.0 $i.sh done 8. Do one of the following to install the client nodes: • If the client nodes were not previously installed with the HP XC System Software, see “Task 9: Run the startsys Utility To Start the System and Propagate the Golden Image” (page 60) to continue the initial installation procedure.
/etc/systemimager/systemimager.conf 2. Search for the following variable at the bottom of the file: DYNAMIC_DISK_PROCESSING 3. Set the value to FALSE: DYNAMIC_DISK_PROCESSING = FALSE 4. Save your change to the file. The static disk configuration method is now persistently enabled. Each time the cluster_config utility is run, one (or more) master autoinstallation scripts are created with static disk configuration information from the appropriate .conf files.
3. Use the text editor of your choice to modify the login_node.conf file with the necessary disk configuration changes. See autoinstallscript.conf(5) for formatting guidelines. 4. Save your changes to the file. 5. Create a new autoinstallation script: # mkautoinstallscript --quiet \ --no-listing \ --image base_image \ --script login_node \ --force YES \ --config-file /opt/hptc/systemimager/etc/login_node.conf \ --ip-assignment static \ --post-install reboot See mkautoinstallscript(8) for more information.
Appendix F Using the Configuration Menu This appendix describes how to use the text-based cluster configuration menu that is displayed by the cluster_config utility.
Please enter node name or [C]ancel: n15 Current Node: n15 [E]xternal Network Configuration, [R]oles, [H]elp, [B]ack: At this point you have the following options: • Enter the letter E and proceed to “Adding an Ethernet Connection” (page 118) to configure an external Ethernet connection on any node. • Enter the letter R and proceed to “Modifying Node Role Assignments” (page 118) to modify node role assignments. • Enter the letter B to go back to the previous menu.
At this point you have the following options: • Enter the letter O to accept the role assignments you just made. • Enter the letter R to start the assignment process again. The Reassign Roles option does not apply the role assignments, it simply allows you to go back to the list of roles and make adjustments to the role assignments. • Enter the letter C to cancel the operation and make no changes to the currently assigned roles.
HN Req: Head Node Required HN Rec: Head Node Recommended Ext Req: External Connection Required Ext Rec: External Connection Recommended Table F-3 Specific Node-By-Node Output of the Analyze Option Column Heading Description Recommend Displays the number of nodes recommended for a particular role based on the number of nodes in your system. Assigned Displays the actual number of nodes assigned with a particular role.
digits, periods, and underscores, up to a maximum of 64 characters. The services configuration option of the cluster_config utility provides an interface that enables you to manipulate node attributes in the database. Individual nodes can be disabled as servers or disabled as clients of specific services. You must create a node attribute before you can add it to a node.
Service Configuration Command Descriptions Table F-4 provides more information about the services configuration commands. All commands are invoked from the svcs> prompt.
Table F-4 Service Configuration Command Descriptions Command Description and Useage [l]ist Displays user-created node attribute and attribute assignments sorted by attribute name. Sample use: svcs> list Attributes: na_disable_client.supermond na_disable_server.cmf Assignments: na_disable_client.supermond: n1 na_disable_server.cmf: n3 [n]odes Displays user-created node attribute and attribute assignments sorted by node name. Sample use: svcs> list Attributes: na_disable_client.
Command Description and Useage [s]ervices Displays node to service mappings.
Appendix G Determining the Network Type The information in this appendix applies only to cluster platforms with a Quadrics QsNetII interconnect. During the processing of the cluster_config utility, the swmlogger gconfig script prompts you to supply the network type of your system. The network type reflects the maximum number of ports the switch can support. The network type is used to create the qsnet diagnostics database.
Appendix H LSF Installation Values This appendix lists the LSF values that were configured for your system during the HP XC System Software installation process.
Table H-1 Default Installation Values for LSF Environment Variable Value Description JOB_ACCEPT_INTERVAL 1 This value is set to 0 when LSF-HPC with SLURM is installed, and left as the default (1) when standard LSF is installed. Where Is This Value Stored? lsb.params file This environment variable controls how many jobs are dispatched to a host during a scheduling cycle.
Environment Variable Value Description Where Is This Value Stored? MinJobAge 1 hour Defines the amount of time a completed job may persist in the memory of the slurmctld daemon before it is purged. slurm.conf file ReturnToService 1 Specifies that a DOWN node will become available for use upon registration. slurm.conf file One LSF partition RootOnly YES Specifies that only the root user or the SLURM administrator can create allocations for normal user jobs. slurm.
Appendix I Reinstalling an HP XC System with Version 3.0 This appendix describes how to reinstall HP XC System Software Version 3.0 on a system that is already running Version 3.0. Reinstalling an HP XC system with the same release may be necessary if you participated as a field test site of an advance development kit (ADK) or an early release candidate kit (RC).
• Reinstalling the Entire System (page 132) • Reinstalling One or More Nodes (page 133) Reinstalling the Entire System A reinstallation requires that all nodes are set to network boot. The setnode --resync command automatically sets all nodes to network boot, however, the setnode --resync command requires that Netboot is defined as the common user-specified label for the network boot option that is configured in the EFI shell environment for each node.
n3: n3: n3: n3: BootCurrent: 0000 BootOrder: 0000,0001 Boot0000* Linux Boot0001* Netboot <--- 4. Prepare all client nodes to network boot rather than boot from local disk: # setnode --resync --all Note If you did not follow the instructions in step 3 to set Netboot as the common network boot label, you cannot use the setnode command to set all nodes to network boot.
The network boot option label for node n5 is set to eth1boot. This value must be named Netboot for the setnode command to operate properly. Do one of the following: 3 b. • If all boot labels are consistently named Netboot, there is no need to change anything. Proceed to step 4. • If the boot label names for all nodes are not consistently named Netboot, complete step 3c through step 3e to change the label for all nodes that do not have the label Netboot. c.
Appendix J Installing the Maui Scheduler This appendix describes how to install and configure the Maui Scheduler software tool to interoperate with SLURM on an HP XC system.
Ensure That LSF Is Not Activated HP does not support the use of the Maui Scheduler with LSF. These schedulers have not been integrated and will not work together on an HP XC system. Before you install the Maui Scheduler on an HP XC system, you must be sure that the HP XC version of LSF is not activated on your system. If LSF is activated, you must deactivate it before proceeding. The following procedure describes how to determine if LSF is activated and running on your system, and how to deactivate it.
Task 2: Compile the Maui Scheduler from Its Source Distribution To compile the Maui Scheduler from its source distribution, go to the directory where you downloaded the Maui Scheduler kit and enter the following commands: 1. ./configure --with-key=42 --with-wiki --prefix=/opt/hptc/maui --exec—prefix=/opt/hptc/maui --with-spooler=/hptc_cluster/maui 2. gmake 3.
this document. The best resource for configuring the Maui Scheduler is the online information available at the Maui Scheduler developer's Web site, which is located at: http://www.supercluster.org Verifying Success You can verify the success of the Maui Scheduler installation and configuration by executing various Maui Scheduler commands. Maui Scheduler commands such as showq, showstate, checknode, and checkjob can help you to determine if the Maui Scheduler is operating properly on your system.
Several commands provide diagnostic information about various aspects of resources, workload, and scheduling. For example: • diagnose n diagnoses nodes • diagnose j diagnoses jobs • diagnose t diagnoses partitions Additional information about Maui Scheduler commands is available in the Maui Scheduler Administrator’s Guide and at: http://www.clusterresources.com/products/maui/docs/a.gcommandoverview.
Appendix K Troubleshooting This appendix addresses the following topics: • Troubleshooting the Discovery Process (page 141) • Troubleshooting the Cluster Configuration Process (page 144) • Troubleshooting Licensing Issues (page 146) • Troubleshooting the Imaging Process (page 144) • Troubleshooting the Software Upgrade Procedure (page 146) Troubleshooting the Discovery Process Figure K-1 provides a high-level flowchart that illustrates the processing performed by the discover command.
• ProCurve Switches May Take Time to Get IP Addresses (page 142) • Not All Console Ports Are Discovered (page 142) • Some Console Ports Have Not Obtained Their IP Address (page 143) • Not All Nodes Are Discovered (page 143) After performing the suggested corrective action, rerun the discover command. Discovery Process Hangs While Discovering Console Ports This information applies only to HP XC systems with nodes that use Integrated Lights Out (iLO) as the console management device.
Some Console Ports Have Not Obtained Their IP Address Use the following procedure to determine why all console ports have been discovered, but some have not obtained their IP addresses after a reasonable time: 1. View the system log file: # tail -f /var/log/messages 2. Look for instances where a network component issues a DHCPREQUEST, the head node sends back a DHCPOFFER, but the console port does not send back a corresponding DHCPACK.
The most common reasons for this are: • The console port or node NIC is not plugged in. • The node is not set to network boot. After the discovery process is complete, a list of the nodes that were not found is displayed. Determine why the node or nodes was not found and rerun the discover command with the --replacenode= option to properly discover the node. The discover command attempts to account for all the nodes that it expects to find.
Table K-1 Diagnosing System Imaging Problems Symptom How To Diagnose A node boots to local disk and runs An nconfig starting entry appears in through the node configuration phase the imaging.log file. (nconfigure) instead of imaging. A node hangs while imaging. Possible Solution Verify BIOS settings to ensure that the node is set to network boot and that the correct network adapter is at the top of the boot order. You can determine when a node hangs • during imaging by monitoring the imaging.
It is possible to actually view an installation through the remote serial console, but to do so, you must edit the /tftpboot/pxelinux.cfg/default file before the installation begins and add the correct serial console device to the APPEND line. If this is done, the cmfd services should be disabled and a smaller group of nodes should be imaged at any one time. The network traffic caused by the serial console can adversely affect the imaging operation.
# mount /dev/cdrom /mnt/cdrom 2. Change to the following directory: # cd /var/log/upgrade/RPMS 3. Install the RPM or RPMs that failed to upgrade properly: # rpm -Uvh rpm_name • An upgraded system contains the xc_prev_version attribute in the database. Use the following command to determine if this attribute exists: # shownode config sysparams | grep version cmdb_version: 1.26 xc_prev_version: V2.1 xc_version: V3.
Glossary A administration branch The half (branch) of the administration network that contains all of the general-purpose administration ports to the nodes of the HP XC system. administration network The private network within the HP XC system that is used for administrative operations. B base image The collection of files and directories that represents the common files and configuration data that are applied to all nodes in an HP XC system. branch switch A component of the Administration Network.
FCFS First-come, first-served. An LSF job-scheduling policy that specifies that jobs are dispatched according to their order in a queue, which is determined by job priority, not by order of submission to the queue. first-come, first-served See FCFS. G global storage Storage within the HP XC system that is available to all of the nodes in the system. Also known as local storage. golden client The node from which a standard file system image is created.
L Linux Virtual Server See LVS. load file A file containing the names of multiple executables that are to be launched simultaneously by a single command. Load Sharing Facility See LSF-HPC with SLURM. local storage Storage that is available or accessible from one node in the HP XC system. LSF execution host The node on which LSF runs. A user's job is submitted to the LSF execution host. Jobs are launched from the LSF execution host and are executed on one or more compute nodes.
Network Information Services See NIS. NIS Network Information Services. A mechanism that enables centralization of common data that is pertinent across multiple machines in a network. The data is collected in a domain, within which it is accessible and relevant. The most common use of NIS is to maintain user account information across a set of networked hosts. NIS client Any system that queries NIS servers for NIS database information.
SMP Symmetric multiprocessing. A system with two or more CPUs that share equal (symmetric) access to all of the facilities of a computer system, such as the memory and I/O subsystems. In an HP XC system, the use of SMP technology increases the number of CPUs (amount of computational power) available per unit of space. ssh Secure Shell. A shell program for logging in to and executing commands on a remote computer.
Index A adduser command, 50 administration network running as interconnect, 41 B back up database, 72 /boot/efi file system, 29 /boot file system, 29 C case sensitivity, 117 checklist of installation tasks, 97 client nodes default disk partition layout, 50 cluster configuration modifying, 117 cluster configuration menu, 117 cluster platform CP3000, 29 CP4000, 29 CP6000, 29 cluster_config command, 51 cluster_config utility log files, 36 text-based menu, 117 troubleshooting, 144 cluster_prep command, 40 com
HP MPI defined, 27 HP ProLiant DL140 discovering, 42 HP ProLiant DL145 G2 discovering, 42 HP SFS, 48 HP StorageWorks Scalable File Share (see HP SFS) HP XC system software defined, 27 /hptc_cluster file system, 29 creating, 47 making and mounting, 49 I iLO settings on DL585, 103 image server, 35 imaging monitoring, 145 nodes, 36 troubleshooting, 144 imaging log file, 36 InfiniBand interconnect connecting to the switch monitoring line card, 110 logging in to line monitoring card, 110 node-list file, 63 swit
N Nagios defined, 27 user account, 50 nagios user account, 50 naming conventions, 21 network connectivity testing, 67 network mask, 37 network type, 125 NFS daemon, 53 no node found command output, 143 no switch found command output, 44 node imaging problems, 36 not discovered, 143 roles, 92 node list condensed, 123 explicit, 123 node management role, 94 node naming, 40 node role defined, 91 node role assignment default, 91 modifying, 118 node-list file for InfiniBand interconnect, 63 O ovp utility log fil
defined, 27 system configuration, 35 defined, 35 system imaging, 35, 41 monitoring, 145 troubleshooting, 144 system imaging log file, 36 system interconnect (see interconnect) system verification, 67 T third-party software, 25 U udp-sender daemon, 60 udpcast, 60 UIDs, 50 upgrade cluster_config options, 83 log files, 146 manually merging customized files, 82 overview of, 77 readiness criteria, 78 verifying, 89 viewing results, 80 upgrade path, 77 upgrade procedure, 78 upgrade script, 79 upgradesys script,