HP XC System Software Installation Guide Version 3.
© Copyright 2003, 2004, 2005, 2006, 2007 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document.......................................................................................................17 Intended Audience................................................................................................................................17 How to Use This Document..................................................................................................................17 Naming Conventions Used in This Document........................................
2.4.2 Run the SVA Installation Script to Install SVA........................................................................50 2.4.3 Install Optional Linux RPMs...................................................................................................51 2.5 You Are Done..................................................................................................................................51 3 Configuring and Imaging the System................................................................
3.16 Task 15: Start Availability Tools...................................................................................................105 3.17 Task 16: Configure SNMP Trap Destination for Enclosures........................................................106 3.18 Task 17: Configure SNMP Trap Destination for Modular Cooling System Devices...................106 3.19 Task 18: Finalize the Configuration of Compute Resources........................................................107 3.19.
7.3 Task 1: Prepare for the Installation................................................................................................138 7.4 Task 2: Install the Red Hat Software..............................................................................................138 7.5 Task 3: Install Additional RPMs....................................................................................................139 7.6 Task 4: Install the HP XC System Software..................................................
11 Adding Visualization Nodes to An Existing HP XC System.................................163 11.1 Prerequisites................................................................................................................................163 11.2 Installation Scenarios...................................................................................................................163 11.2.1 New Visualization Nodes Exceed the Maximum Number of Nodes Supplied to the cluster_prep Command....................
E Customizing Client Node Disks................................................................................199 E.1 Overview of Client Node Disk Imaging.......................................................................................199 E.2 Dynamically Configuring Client Node Disks...............................................................................199 E.2.1 Component Files Required for Dynamic Configuration of Client Node Disks...................200 E.2.
J Customizing the SLURM Configuration.....................................................................229 J.1 Assigning Features.........................................................................................................................229 J.2 Creating Additional SLURM Partitions.........................................................................................229 J.3 Required Customizations for SVA.......................................................................................
List of Figures 12-1 Discovery Flowchart....................................................................................................................
List of Tables 1 2 1-1 1-2 1-3 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 7-1 7-2 7-3 8-1 8-2 10-1 10-2 12-1 12-2 A-1 D-1 D-2 D-3 D-4 D-5 F-1 G-1 G-2 G-3 Installation Types..........................................................................................................................17 Naming Conventions....................................................................................................................
G-4 H-1 I-1 14 Service Configuration Command Descriptions..........................................................................221 Network Type Based on System Topology..................................................................................223 Default Installation Values for LSF and SLURM.........................................................................
List of Examples 1-1 3-1 3-2 3-3 3-4 G-1 G-2 Sample XC.lic File..........................................................................................................................27 discover Command Output On A Small Non-Blade Configuration.............................................68 discover Command Output For Large-Scale Systems..................................................................70 Sample mcs.ini File..................................................................................
About This Document This document describes how to install and configure HP XC System Software Version 3.2 on HP Cluster Platforms 3000, 4000, and 6000. An HP XC system is integrated with several open source software components. Some open source software components are being used for underlying technology, and their deployment is transparent.
Table 1 Installation Types (continued) Installation Type Description Documented In HP XC System Software installation on Red Hat Enterprise Linux Installs and configures Red Hat Enterprise Linux Enterprise Version RHEL4 U4 on the head node, and then installs the HP XC System Software Version 3.2 as a layered product. Chapter 7 (page 137) HP XC System Software upgrade on Red Hat Enterprise Linux Upgrades HP XC System Software Version 3.
Typographic Conventions This document uses the following typographical conventions: %, $, or # audit(5) Command Computer output Ctrl-x ENVIRONMENT VARIABLE [ERROR NAME] Key Term User input Variable [] {} ... | WARNING CAUTION IMPORTANT NOTE A percent sign represents the C shell system prompt. A dollar sign represents the system prompt for the Korn, POSIX, and Bourne shells. A number sign represents the superuser prompt. A manpage. The manpage name is audit, and it is located in Section 5.
The HP XC System Software Documentation Set includes the following core documents: HP XC System Software Release Notes Describes important, last-minute information about firmware, software, or hardware that might affect the system. This document is not shipped on the HP XC documentation CD. It is available only on line.
HP Cluster Platform The cluster platform documentation describes site requirements, shows you how to set up the servers and additional devices, and provides procedures to operate and manage the hardware. These documents are available at the following website: http://www.docs.hp.com/en/linuxhpc.html HP Integrity and HP ProLiant Servers Documentation for HP Integrity and HP ProLiant servers is available at the following website: http://www.docs.hp.com/en/hw.
• http://www.nagios.org/ Home page for Nagios®, a system and network monitoring application that is integrated into an HP XC system to provide monitoring capabilities. Nagios watches specified hosts and services and issues alerts when problems occur and when problems are resolved. • http://oss.oetiker.ch/rrdtool Home page of RRDtool, a round-robin database tool and graphing system. In the HP XC system, RRDtool is used with Nagios to provide a graphical view of system status. • http://supermon.
Linux Websites • http://www.redhat.com Home page for Red Hat®, distributors of Red Hat Enterprise Linux Advanced Server, a Linux distribution with which the HP XC operating environment is compatible. • http://www.linux.org/docs/index.html The website for the Linux Documentation Project (LDP) contains guides that describe aspects of working with Linux, from creating your own Linux system from scratch to bash script writing.
Software RAID Websites • http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html and http://www.ibiblio.org/pub/Linux/docs/HOWTO/other-formats/pdf/Software-RAID-HOWTO.pdf A document (in two formats: HTML and PDF) that describes how to use software RAID under a Linux operating system. • http://www.linuxdevcenter.com/pub/a/linux/2002/12/05/RAID.html Provides information about how to use the mdadm RAID management utility.
1 Preparing for a New Installation This chapter describes preinstallation tasks to perform before you install HP XC System Software Version 3.2.
1.3 Task 3: Prepare Existing HP XC Systems This task applies to anyone who is installing HP XC System Software Version 3.2 on an HP XC system that is already installed with an older version of the HP XC System Software. Omit this task if you are installing HP XC System Software Version 3.2 on new hardware for the first time. Before using the procedures described in this document to install and configure HP XC System Software Version 3.
1.6 Task 6: Arrange for IP Address Assignments and Host Names Make arrangements with your site's network administrator to assign IP addresses for the following system components. All IP addresses must be defined in the site's Domain Name System (DNS) configuration: • The external IP address of the HP XC system, if it is to be connected to an external network. The name associated with this interface is known as the Linux Virtual Server (LVS) alias or cluster alias.
NOTICE="Authorization = BM05WHITMORE19772031 - permanent - HP \ XC System Software - BASE License" INCREMENT XC-PROCESSORS Compaq 3.0 permanent 68 7BA7E0876F0F \ NOTICE="Date 30-Jan-2007 01:29:36 - License Number = \ LAGA4D1958DL - Qty 68 - 434066-B21 - HP XC System Software 1 \ Proc Flex License" INCREMENT lsf_xc Compaq 6.
server of the service. Improved availability protects against service failure if a node that is serving vital services becomes unresponsive or goes down. You have the flexibility to decide which availability tool you want to use to manage and migrate specified services to a second server if the first server is not available. You can install one or more availability tools to manage the services that have been configured for improved availability.
Table 1-1 Improved Availability Summary Task Appropriate Task Details Are Provided Here 1. Decide which availability tool or tools you want to use; this tool manages the services “Choosing an for which improved availability has been configured. Then, obtain or purchase, install, Availability Tool” and configure the availability tool or tools. (page 30) 2. Write translator and supporting scripts for the availability tool if you are not using HP “Writing Translator and Serviceguard.
Availability Tools from Other Vendors If you prefer to use another availability tool, such as Heartbeat Version 1 or Version 2 (which is an open source tool), you must obtain the tool and configure it for use on your own. Third-party vendors are responsible for providing customer support for their tools. Installation and configuration instructions for any third-party availability tools you decide to use are outside the scope of this document. See the vendor documentation for instructions. 1.9.
1.9.6 Assigning Node Roles For Improved Availability An important part of planning your strategy for improved availability is to determine the services for which availability is vital to the system operation. Services are delivered in node roles. A node role is an abstraction that combines one or more services into a group and provides a convenient way of installing services on a node. In this release, improved availability is supported for the services listed in Table 1-2.
Table 1-2 Role and Service Placement for Improved Availability (continued) Service Name Service is Delivered in This Role Special Considerations for Role Assignment Within the availability set, the higher numbered node is the LVS director, and the lower numbered node is the backup for the LVS director. Thus, to achieve improved availability of the LVS director service, you must assign at least three nodes with the login role: • Assign the login to the first node in the availability set.
by placing the resource_management role on two or more nodes. These nodes are not members of any availability set, and the SLURM and LSF-HPC with SLURM software is not managed by any availability tool. When you assign two or more nodes with the resource_management role, SLURM availability is automatically enabled. If you assign the resource_management to two or more nodes, you must manually enable availability for LSF-HPC with SLURM; see “Perform LSF Postconfiguration Tasks” (page 108) for instructions.
Table 1-3 Availability Sets Worksheet Availability Set Configuration First Node Name Second Node Name Availability Tool to Manage This Availability Roles to Assign to Nodes in the Availability Set Set First node in the availability set: • _________________________ • _________________________ • _________________________ • _________________________ • _________________________ Second node in the availability set: • • • • • _________________________ _________________________ _________________________ ______
2 Installing Software on the Head Node This chapter contains an overview of the software installation process and describes software installation tasks. These tasks must be performed in the following order: • “Task 1: Gather Information Required for the Installation” (page 41) • “Task 2: Start the Installation Process” (page 44) • “Task 3: Install Additional RPMs from the HP XC DVD” (page 49) 2.
Table 2-1 HP XC Software Stack Software Product Name Description HP MPI HP MPI provides optimized libraries for message passing designed specifically to make high-performance use of the system interconnect. HP MPI complies fully with the MPI-1.2 standard. HP MPI also complies with the MPI-2 standard, with restrictions. HP Scalable Visualization Array The HP Scalable Visualization Array (SVA) provides a visualization component for applications that require visualization in addition to computation.
Table 2-1 HP XC Software Stack (continued) Software Product Name Description Standard LSF Standard LSF is the industry standard Platform Computing LSF product used for workload management across clusters of compute resources. It features comprehensive workload management policies in addition to simple first-come, first-serve scheduling (fairshare, preemption, backfill, advance reservation, service-level agreement, and so on).
space on the disk for other user-defined file systems and partitions. Use the Linux Disk Druid disk partitioning utility to partition the remaining disk space according to your needs. During the Kickstart installation procedure, messages notify you if a calculated disk partition size exceeds the limit and the maximum partition size is applied instead.
the appropriate sized partition. The guidelines depend on whether /hptc_cluster is located on an HP StorageWorks Scalable File Server (SFS) or is created on the local system disk. • • “Determining The Size of /hptc_cluster When It Is Located On An SFS Server” “Determining The Size of /hptc_cluster When It Is Located On A Local Disk On The Head Node” 2.1.5.
Table 2-5 Chip Architecture by Cluster Platform 4. 5. 6. Cluster Platform Model Chip Architecture Cluster Platform 3000 (CP3000) Intel Xeon with EM64T Cluster Platform 3000BL (HP server blades) Intel Xeon with EM64T Cluster Platform 4000 (CP4000) AMD Opteron Cluster Platform 4000BL (HP server blades) AMD Opteron Cluster Platform 6000 (CP6000) Intel Itanium 2 Ensure that you have in your possession the DVD distribution media that is appropriate for the cluster platform architecture.
Table 2-6 Information Required for the Kickstart Installation Session (continued) Item Description and User Action Where to create a partition The /hptc_cluster file system is the global, or clusterwide, file system on an for the /hptc_cluster file HP XC system. This file system is shared and mounted by all nodes and contains system configuration and log file information that is required for all nodes in the system.
Table 2-6 Information Required for the Kickstart Installation Session (continued) Item Description and User Action Time zone Select the time zone in which the system is located. The default is America/New York (Eastern Standard Time, which is Greenwich Mean Time minus 5 hours). Use the Tab key to move through the list of time zones, and use the spacebar to highlight the selection. Then, use the Tab key to move to OK, and press the space bar to select OK.
IMPORTANT: After this document was published, it is possible that specific head node hardware models require additional parameters to be included on the command line. Before booting the head node, look in HP XC System Software Release Notes at http://www.docs.hp.com/en/linuxhpc.html to make sure no additional command-line options are required for your model of head node. Table 2-7 Kickstart Boot Command Line Cluster Platform or Hardware Model Chip Architecture Type CP3000 and CP4000 6.
8. 9. Log in as the root user when the login screen appears, and enter the root password you previously defined during the software installation process. Open a terminal window when the desktop appears: a. Click on the Linux for High Performance Computing splash screen to close it. b. Click Applications→System Tools→Terminal to open a terminal window. 10. Proceed to “Task 3: Install Additional RPMs from the HP XC DVD” (page 49). 2.3.
2. Use the menus on the Insight Display panel to manually set a static IP address and subnet mask for the Onboard Administrator. You can use any valid IP address because there is no connection to a public network. All static addresses must be in the same network. For example, assume the network is 172.100.100.0 and the netmask is 255.255.255.0. In this case, the static IP addresses might be: • • • 3. 4. 5. 6. 7. IP address of the installation PC: 172.100.100.
d. e. Click the Power button and then click Momentary Press to turn on power to the server and start booting from the DVD. Proceed to step 6. Mozilla Firefox If you are using Firefox as your browser, do the following: a. b. c. d. e. f. g. 6. 7. Click the Remote Console link to open the virtual console window. In the iLO2 Web Administration window, click the Virtual Devices tab. In the left frame, click the Virtual Media link.
IMPORTANT: After the software load is complete, ensure that the DVD is ejected from the drive before continuing. On systems with a retractable DVD device, you must remove the installation DVD before the system reboots. This is especially important if the head node is an HP workstation, which never ejects the DVD. If you do not remove the DVD, a second installation process is initiated from the DVD when the system reboots. If a second installation process is started, halt the process and remove the DVD.
# cd # umount /dev/cdrom 2.4.2 Run the SVA Installation Script to Install SVA The HP Scalable Visualization Array (SVA) is a scalable visualization solution that brings the power of parallel computing to bear on many demanding visualization challenges. SVA is integrated with HP XC and shares a single interconnect with the compute nodes and a storage system.
2.4.3 Install Optional Linux RPMs Follow this procedure to install additional, optional Linux RPMs from the HP XC distribution DVD: 1. If the DVD is not already mounted, insert the installation DVD into the DVD drive and mount it on the default location (the default location is the /media/cdrom directory): # mount /dev/cdrom 2. Change to the following directory: # cd /media/cdrom/LNXHPC/RPMS 3. Find the Linux RPM you want to install and issue the appropriate command to install it.
3 Configuring and Imaging the System This chapter contains an overview of the initial system configuration and imaging process and describes system configuration tasks, which must be performed in the following order: • “Task 1: Prepare for the System Configuration” (page 55) • “Task 2: Change the Default IP Address Base (Optional)” (page 61) • “Task 3: Run the cluster_prep Command to Prepare the System” (page 63) • “Task 4: Install Patches or RPM Updates” (page 65) • “Task 5: Run the discover Command to Dis
Command or Utility Name Description cluster_config Populates the configuration and management database with node role assignments, starts all services on the head node, and creates the golden system image startsys Turns on power to each node and downloads the SystemImager automatic installation environment to install and configure each node from the golden image 3.1.2 Internal Node Naming It is important to understand how internal node names are assigned.
As part of the initial software installation, the head node is configured as the golden client, which is the node that represents the configuration from which all other nodes are replicated. Next, a golden image is created from the golden client, which is a replication of the local file system directories and files, starting from root (/). The golden image is stored on the image server, which is also resident on the head node in this release.
Table 3-1 Information Required by the cluster_prep Command Item Description and User Action Node name prefix During the system discovery process, each node is automatically assigned an internal name. This name is based on a prefix defined by you. The default node prefix is the letter n. All node names consist of the prefix and a number based on the node's topographical location in the system.
Table 3-1 Information Required by the cluster_prep Command (continued) Item Description and User Action IPv6 address Provide the IPv6 address of the head node's Ethernet connection to the external network, if applicable. Specifying this address is optional and is intended for sites that use IPv6 addresses for the rest of the network.
Table 3-2 Information Required by the discover Command Item Description and User Action Total number of nodes in this cluster Enter the total number of nodes in the system configuration that are to be discovered at this time. Make sure the number you enter includes the head node and all compute nodes. You are not prompted for this information if you are discovering a multi-region, large-scale system. If the hardware configuration contains HP server blades, you are not prompted for this information.
Table 3-2 Information Required by the discover Command (continued) Item Description and User Action Number of nodes plugged If you are required to use the --oldmp option on the discover command line into the Root Administration for HP XC systems with an HP Integrity head node, you are prompted to supply Switch the number of nodes that are plugged into the Root Administration Switch. Number of nodes plugged into application cabinets 4.
Table 3-3 Information Required by the cluster_config Utility (continued) Item Description and User Action Number of QsNetII node-level and top-level switches For systems with a QsNetII interconnect, you are asked to supply the number of node-level and top-level switches in the configuration. LVS configuration If you modified the default role assignments and assigned a login role to one or more nodes, you are prompted to enter an LVS alias.
Table 3-3 Information Required by the cluster_config Utility (continued) Item Description and User Action SVA and remote graphics software configuration If you installed SVA or optional remote graphics software,1 you are prompted to supply the following information: • Whether the visualization nodes (the workstations) have a KVM attached • The host names for display nodes, that is, the nodes that have monitors connected to them • Remote graphics software configuration information: — The host names of the
and the file you modify depends on whether or not HP server blades and enclosures are included in the hardware configuration. Modify this file if the hardware configuration does not contain HP server blades and enclosures. Modify this file if the hardware configuration contains HP server blades and enclosures. You must make the same modifications to the base_addr.ini file. base_addr.ini file base_addrV2.ini file. Follow this procedure to change the default IP address base for the XC private networks: 1.
5. If you made changes to the base_addrV2.ini file, repeat steps 1 through 4 to edit the base_addr.ini file and make the same changes. Proceed to “Task 3: Run the cluster_prep Command to Prepare the System” . 3.4 Task 3: Run the cluster_prep Command to Prepare the System The first step in the configuration process is to prepare the system by running the cluster_prep command.
You can enter an IPv6 address, press the [ ] keys to delete the value shown, or press the Enter key to accept the value shown. IPv6 address (optional) []: Enter Gateway IP address []: your_IPaddress You have the option to override the system default MTU value. You can enter 9000 to enable jumbo frames, press the [ ] keys to delete the value shown and use the system default, or press the Enter key to accept the value shown.
c. d. 8. Return all cabling to its original configuration. Press the reset button on the Onboard Administrator. Click Applications→System Tools→Terminal to open a terminal window. 3.5 Task 4: Install Patches or RPM Updates For each supported version of the HP XC System Software, HP releases all Linux security updates and HP XC software patches on the HP IT Resource Center (ITRC) website.
4. From the registration confirmation window, select the option to go directly to the ITRC home page. 5. From the IT Resource Center home page, select patch/firmware database from the maintenance and support (hp products) list. 6. From the patch / firmware database page, select Linux under find individual patches. 7.
NOTE: Follow this procedure before you run the discover command if you want to locate the console port of a non-blade head node on the administration network and not on the external network: 1. 2. Set the IP address for the head node console port to a static IP address that is not currently in use by the HP XC system. Typically, this address can be 172.31.47.240, which is the top end of addresses defined for the HP XC switches defined in the /opt/hptc/config/base_addrV2.ini file.
• • • 5. Table 3-2 (page 58) and discover(8) contain information about additional keywords you can add to the command line to omit some of the questions that will be asked during the discovery process. Use of these keywords is optional. If you encounter problems during the discovery process, see “Troubleshooting the Discovery Process” (page 165) for troubleshooting guidelines. The discover command does not properly discover HP ProLiant DL140 and DL145 servers until the password is set.
switchName necs1-1 switchIP 172.20.65.2 type 2650 switchName nems1-1 switchIP 172.20.65.1 type 2848 Attempting to power on nodes with nodestring 8n[13-15] Powering on all known nodes ... done Discovering Nodes... running port_discover on 172.20.65.1 nodes Found = 1 nodes Expected = 4 running port_discover on 172.20.65.1 nodes Found = 1 nodes Expected = 4 running port_discover on 172.20.65.1 nodes Found = 1 nodes Expected = 4 running port_discover on 172.20.65.
If necessary, see Chapter 12 (page 165) for information about troubleshooting problems you might encounter during the discovery process. Example 3-2 shows the unique command output for a large-scale system with two regions; all other command output is similar to the previous example. Example 3-2 discover Command Output For Large-Scale Systems The discover process has detected 2 regions. Is this correct? [y/n] y switchName nems0-1-0 switchIP 172.20.65.
NOTE: The following procedure assumes that all enclosures have been physically set up and populated with nodes, all components have been cabled together as described in the HP XC Hardware Preparation Guide, you have prepared the head node and the non-blade server nodes according to the instructions in the HP XC Hardware Preparation Guide, and the server blade head node is installed with HP XC System Software. 1. 2. Begin this procedure as the root user on the head node.
3.6.2.3 Discover All Nodes and Enclosures Follow this procedure to discover all enclosures and all nodes (including server blades) in the hardware configuration. This discovery process assigns IP addresses to all hardware components: 1. 2. Begin this procedure as the root user on the head node. Start a script to capture command output into a file. This step is optional, but HP recommends doing so. # script your_filename 3. Change to the following directory: # cd /opt/hptc/config/sbin 4.
checking 172.31.16.1 checking 172.31.16.4 checking 172.31.16.3 checking 172.31.16.2 .done Starting CMF for discover... Stopping cmfd: [FAILED] Starting cmfd: [ OK ] Waiting for CMF to establish console connections .......... done 1 uploading database Restarting dhcpd Opening /etc/hosts Opening /etc/hosts.new.XC Opening /etc/powerd.conf Building /etc/powerd.conf ... done Attempting to start hpls power daemon ... done Waiting for power daemon ...
New password confirmed. Lights-Out> exit • For BMC Firmware Version 1.24 or higher: a. Press Esc and Shift-9 to enter into the command-line mode. b. Change to the following directory: /directory_name/-> cd map1/accounts c. List the pre-defined users (by default, 16 users have been pre-defined): /directory_name/-> show d. One user at a time, use the show command until you find the first instance of the user name admin .
Table 3-4 System Environment Setup Tasks (continued) Required Tasks Conditionally Optional Tasks “Create the HP Modular Cooling System Configuration File” (page 83) “Mount Network File Systems” (page 85) “Update initrd Files With Required Hardware” When you have finished setting up the system environment, proceed to “Task 7: Run the cluster_config Utility to Configure the System” (page 85) to begin the system configuration process. 3.7.
sendmail Configuration Requirements on an HP XC System Although Linux sendmail typically functions correctly as shipped, current HP XC host naming conventions cause sendmail to improperly identify itself to other mail servers. This improper identification can lead to the mail being rejected by the remote server. To remedy this issue, perform the following procedure on all nodes with an external connection that will send mail: 1. 2.
3.7.4 Customize the Nagios Environment (Required) Nagios is a highly customizeable system monitoring tool that you can tailor to specific installation and monitoring requirements. HP recommends that you consider certain aspects of the Nagios environment as part of the initial system setup to optimize the type of system events reported to you as well as the frequency of alerts.
NOTE: HP SFS, Serviceguard, and RGS have not been qualified on HP XC System Software that is installed on Red Hat Enterprise Linux (described in Chapter 7 (page 137)). If you are installing HP XC System Software on Red Hat Enterprise Linux, do not install these HP software products. 3.7.7.1.1 HP StorageWorks Scalable File Share The HP XC System Software enables Lustre1 client services for high-performance and high-availability file I/O.
Deciding on the Method to Achieve Quorum for HP Serviceguard Clusters In a Serviceguard configuration, each availability set becomes its own two-node Serviceguard cluster, and each Serviceguard cluster requires some form of quorum. The quorum acts as a tie breaker in the Serviceguard cluster running on each availability set. If connectivity is lost between the nodes of the Serviceguard cluster, the node that can access the quorum continues to run the cluster and the other node is considered down.
3.7.7.2 Install Third-Party Software Products An HP XC system supports the use of several third-party software products. Use of these products is optional; the purchase and installation of these components is your decision depending on the software requirements. Potentially important software that is not bundled with the HP XC software includes the Intel Fortran and C compilers, The Portland Group PGI compiler, and the TotalView Debugger.
• Intel compilers http://www.intel.com/software/products/compilers/index.htm • The Portland Group, supplier of the C/C++ and Fortran compilers http://www.pgroup.com/ 3.7.8 Create the /hptc_cluster File System This task is optional. Do one of the following only if you chose to install the /hptc_cluster file system somewhere other than on the installation disk on the head node.
You have the option to enable software RAID-0 (striping) or software RAID-1 (mirroring) on client nodes. RAID is an acronym for redundant array of inexpensive (or independent) disks. RAID is a way of combining multiple disks into a single entity to improve performance or reliability or both. Software RAID-0 (striping) enables client nodes that have more than one storage disk to split data evenly across the disks. Striping is typically used to increase performance.
Table 3-5 lists the user and group account IDs that are configured by default on an HP XC system if they are not already in use. If any of the default user and group identifiers conflict with other accounts or are not suitable for your environment, you can override them by creating the user accounts manually now (before running the cluster_config utility).
3. Copy the MCS template file into a file called mcs.ini: # cp mcs_template.ini mcs.ini 4. Use the text editor of your choice to populate the mcs.ini file with site-specific MCS device information. A sample of the mcs.ini file is shown in Example 3-3 (page 84). The following list describes the parameters in the mcs.ini file: mcs_units mcs_server_units name ipaddr location nodes status Specifies each MCS device name separated by a comma.
status=offline [mcs5] name=mcs5 ipaddr=172.23.0.5 location=Cab CBB5 nodes=n[145-180] status=offline 3.7.15 Mount Network File Systems This task is optional. If you plan to mount NFS file systems, add the mount points to the /hptc_cluster/etc/fstab.proto file now so that the mount points are propagated to the golden image. NOTE: See the HP XC System Software Administration Guide if you need more information about how to modify the /hptc_cluster/etc/fstab.proto file. 3.7.
Depending upon the role assignments, the cluster_config utility prompts you to configure services (such as NAT, NIS, LVS) and prompts you to configure software components such as LSF (either LSF-HPC with SLURM or standard LSF), SLURM, and Nagios.
The following criteria must be met to configure improved availability of services: • • You have purchased, licensed, installed, and configured an availability tool (such as HP Serviceguard) . You have written and positioned tool-specific translator and other related scripts in the /opt/hptc/availability/availability_tool directory. NOTE: Scripts for Serviceguard were automatically created and positioned for you when you installed the HP XC Serviceguard RPM.
b. Create an availability set that will be managed by the selected tool. Specify both node names to associate into the availability set. “Choosing Nodes as Members of Availability Sets” (page 31) described how to choose the nodes to associate in an availability set. avail> create n7 n8 Availability Set serviceguard: (n7 n8) created. 5.
[L]ist Nodes, [M]odify Nodes, [A]nalyze, [H]elp, [P]roceed, [Q]uit: l Output Type - a[B]breviated, [A]ll, [C]ancel: • The a[B]breviated output is similar to the following. Nodes with the same role assignments are condensed into one line item. Node: n16 location: Level 1 Switch 172.20.65.1, Port 42 CURRENT HEAD NODE Roles assigned: compute console_network disk_io external management_hub management_server resource_management External Ethernet name: penguin.southpole.com ipaddr: 192.0.2.0 netmask: 255.255.
c. d. If you plan to configure an external Ethernet connection on any node, you must assign the external role to that node. By default, the head node is assigned with the console_network and management_hub roles. You have the flexibility to assign the console_network and management_hub roles to multiple nodes and HP recommends that you consider using one management hub for every 64 to 128 nodes.
NOTE: Table 3-3 (page 59) describes each prompt and provides information to help you with your answers. 1.
You must now specify the clock source for the server nodes. If the nodes have external connections, you may specify up to 4 external NTP servers. Otherwise, you must use the node's system clock. Enter the IP address or host name of the first external NTP server or leave blank to use the system clock on the NTP server node: IP_address Enter the IP address or host name of the second external NTP server or leave blank if you have no more servers: Enter Renaming previous /etc/ntp.conf to /etc/ntp.conf.bak 4.
8. Enable web access to the Nagios monitoring application and create a password for the nagiosadmin user. This password does not have to match any other password on the system. In this example, the Nagios service has been configured with improved availability. Executing C50nagios gconfigure Availability can be configured for nagios in one of several ways. The choices are: 1: standard 2: serviceguard A choice of 'standard' (1) means no improved availability.
Enter the host name of the first display node. You can enter names like n[1-4] or n[1,3] too n[1-5] Enter the host name of the next display node leave blank if you have no more display nodes: You must now specify the Remote Graphics nodes for the cluster. Each Remote Graphics node must have an external ethernet address configured. Enter the host name of the first Remote Graphics node.
Executing C90munge gconfigure Executing C90slurm gconfigure 14. Decide whether you want to configure SLURM. SLURM is required if you installed SVA and if you plan to install LSF-HPC with SLURM. Do you want to configure SLURM? (y/n) [y]: Do one of the following: • • If you intend to install LSF-HPC with SLURM or if you intend to install the Maui Scheduler, or if you have already installed SVA, enter y and proceed to step 15. If you intend to install standard LSF do not install SLURM and enter n.
Do one of the following: • To install LSF, enter y or press the Enter key. — If you elected to install SLURM in step 14, proceed to step 17 to choose the type of LSF to install. — If you did not elect to install SLURM, you cannot choose the type of LSF to install, and standard LSF is installed automatically. Proceed to step 19. • If you intend to install another job management system, such as PBS Professional (see Chapter 9) or the Maui Scheduler (see Chapter 10) enter n. Proceed to step 22.
19. Provide responses to install and configure LSF. This requires you to supply information about the primary LSF administrator and administrator's password. The default user name for the primary LSF administrator is lsfadmin. If you accept the default user name and a NIS account exists with the same name, LSF is configured with the existing NIS account, and you are not be prompted to supply a password. Otherwise, accept all default answers.
see "/opt/hptc/lsf/top/6.2/lsf_quick_admin.html" to learn more about your new LSF cluster. ***Begin LSF-HPC Post-Processing*** Created '/hptc_cluster/lsf/tmp'... Editing /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf... Moving /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf to /opt/hptc/lsf/top/conf/lsf.cluster.hptclsf.old.7858... Editing /opt/hptc/lsf/top/conf/lsf.conf... Moving /opt/hptc/lsf/top/conf/lsf.conf to /opt/hptc/lsf/top/conf/lsf.conf.old.7858...
info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: info: Executing C04iptables nconfigure Executing C06nfs_server nconfigure Executing C08ntp nconfigure Executing C10hptc_cluster_fs nconfigure Executing C10hptc_cluster_fs_client nconfigure Executing C11avail nconfigure Executing C12dbserver nconfigure Executing C20gmmon nconfigure Executing C20smartd nconfigure Execut
NOTE: If necessary, see “Troubleshooting the Imaging Process” (page 170) for information about using the imaging log files to troubleshoot the imaging process. Proceed to “Task 12: Run the startsys Utility to Start the System and Propagate the Golden Image”. 3.12 Task 11: Edit the /etc/dhcpd.conf File This task is optional.
1. Determine whether you want to override files delivered in the golden image. This step is optional, and typically during an initial system installation and configuration, it is not necessary. However, be aware that you can modify the files delivered in the golden image, and the HP XC System Software Administration Guide describes how to do so.
Table 3-9 startsys Command-Line Options Based on Hardware Configuration Hardware Configuration startsys Command Line Fewer than 300 nodes For small-scale hardware configurations, nodes are imaged and rebooted in one operation. The nodes complete their per-node configuration phase, thus completing the installation. This option applies only for nodes that have previously been set up to network boot.
You must manually power on the following nodes: n1 Press enter after applying power to these nodes. continuing ........
12 nodes -> n[2-6,8-14] Fri Jul 06 09:06:48 2007 Processing completed for: 1 node -> n1 Fri Jul 06 09:07:03 2007 Processing completed for: 1 node -> n7 Fri Jul 06 09:07:18 2007 Processing completed for: 9 nodes -> n[4-5,8-14] *** Fri Jul 06 09:07:33 2007 Current statistics: Booted and available: 15 nodes -> n[1-15] Progress: Fri Jul 06 09:07:33 2007 Processing completed for: 3 nodes -> n[2-3,6] *** Fri Jul 06 09:07:33 2007 Current statistics: Booted and available: 15 nodes -> n[1-15] Progress: Fri Jul 06 09
If you have used a lock LUN to achieve quorum on one or more availability sets, create the device file on the second node in the appropriate availability sets. In the following example, /dev/sdb1 is the full path to the disk and partition of the lock LUN on the first disk. Run the following command on the second disk, specifying the same disk and partition number as the first disk: # sfdisk -R /dev/sdb1 Proceed to “Task 15: Start Availability Tools”. 3.
Adding node n8 to cluster avail1. Completed the cluster creation. Starting the cluster. cmruncl : Validating network configuration... cmruncl : Network validation complete cmruncl : Waiting for cluster to form.... cmruncl : Cluster successfully formed. cmruncl : Check the syslog files on all nodes in the cluster cmruncl : to verify that no warnings occurred during startup.
# grep nh /etc/hosts IP_address n16 nh hplsadm 2. n16.localhost.localdomain Configure the MCS devices to send their SNMP traps to the management server IP alias; use the IP alias obtained in the previous step: # mcs_webtool -H MCS_IP_address -p MCS_Admin_password -i -a IP_address -e IP_address Adding trap receiver 1: IP_address Enabling trap receiver 1: IP_address \ Trap Receiver Status for MCS at MCS_IP_address: Authentication Traps: Disable Trap Receiver 1: IP_address Enable Trap Receiver 2: 0.0.0.
NOTE: If a compute node did not boot up, the spconfig utility configures the node as follows: Configured unknown node n14 with 1 CPU and 1 MB of total memory... After the node has been booted up, re-run the spconfig utility to configure the correct settings. 3. 4. If the system is using a QsNetII interconnect, ensure that the number of node entries in the /opt/hptc/libelanhosts/etc/elanhosts file matches the expected number of operational nodes in the cluster.
# controllsf enable failover 5. Determine the node on which the LSF daemons are running: # controllsf show current LSF is currently running on node n32, and assigned to node n32 6. Log in to the node that is running LSF if it is not running on the head node. If LSF is running on the head node, omit this step. # ssh n32 7. Restart the LIM daemon: # lsadmin limrestart Checking configuration files ... No errors found. Restart LIM on ......
4 Verifying the System and Creating a Baseline Record of the Configuration Complete the tasks described in this chapter to verify the successful installation and configuration of the HP XC system components. With the exception of the tasks that are identified as optional, HP recommends that you perform all tasks in this chapter.
# lsid Platform LSF 6.2, LSF_build_date Copyright 1992-2005 Platform Computing Corporation My cluster name is hptclsf My master name is n13 [root@n16 ~]# lshosts HOST_NAME type model n13 LINUX64 Itanium2 n16 LINUX64 Itanium2 n1 LINUX64 Itanium2 n2 LINUX64 Itanium2 n3 LINUX64 Itanium2 n4 LINUX64 Itanium2 n5 LINUX64 Itanium2 n6 LINUX64 Itanium2 n7 LINUX64 Itanium2 n8 LINUX64 Itanium2 n9 LINUX64 Itanium2 n10 LINUX64 Itanium2 n11 LINUX64 Itanium2 n12 LINUX64 Itanium2 2.
# shownode config hostgroups hostgroups: headnode: n16 serviceguard:avail1: n14 n16 In this example, one availability set, avail1, has been configured. 2. On one node in each availability set, view the status of the Serviceguard cluster: # pdsh -w n14 /usr/local/cmcluster/bin/cmviewcl NOTE: The /usr/local/cmcluster/bin directory is the default location of the cmviewcl command. If you installed Serviceguard in a location other than the default, look in the /etc/cmcluster.
The OVP also runs the following benchmark tests. These tests compare values relative to each node and report results with values more than three standard deviations from the mean: • • • • LINPACK is a collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems. This test is CPU intensive and stresses the nodes, with limited data exchange. PALLAS exercises the interconnect connection between compute nodes to evaluate MPI performance.
6. When all OVP tests pass, proceed to “Task 4: Run the SVA OVP Utility” (if SVA is installed) or “Task 5: View System Health”. 4.4 Task 4: Run the SVA OVP Utility Run the SVA OVP utility only if you installed and configured SVA. The SVA OVP runs a series of Chromium demonstration applications on all defined display surfaces, which verifies the successful installation of SVA. Follow this procedure to start the SVA OVP: 1.
Host Monitor IP Assignment - DHCP Load Average LSF Failover Monitor Nagios Monitor NodeInfo PING Interconnect Resource Monitor Resource Status Root key synchronization Sensor Collection Monitor Slurm Monitor Slurm Status Supermon Metrics Monitor Switch Switch Data Collection Syslog Alert Monitor Syslog Alerts System Event Log System Event Log Monitor System Free Space Totals: 1-Ok 1-Ok 10-Ok 1-Ok 1-Ok 10-Ok 10-Ok 1-Ok 10-Ok 1-Ok 1-Ok 1-Ok 10-Ok 1-Ok 2-Ok 1-Ok 1-Ok 10-Ok 9-Ok 1-Ok 10-Ok 115-Ok 0-Warn 0-War
4.7 Task 7: Create a Baseline Report of the System Configuration The sys_check utility is a data collection tool that is used to diagnose system errors and problems. Use the sys_check utility now to create a baseline report of the system configuration (software and hardware). The sys_check utility collects configuration data only for the node on which it is run unless you set and export the SYS_CHECK_SYSWIDE variable, which collects configuration data for all nodes in the HP XC system.
5 Upgrading an HP XC System This chapter describes how to use the upgrade process to install HP XC System Software Version 3.2 on an HP XC system that is already running a previous version of the HP XC System Software.
NOTE: Enter the following command if you are not sure what version of the HP XC System Software is installed on the system: # cat /etc/hptc-release 5.1.2 Is Upgrading Appropriate for Your System Configuration? An upgrade path from previous HP XC System Software releases to Version 3.2 is provided, but HP recommends a new installation of Version 3.2.
Table 5-2 Upgrade Characteristics (continued) Characteristic Description Affect on nodes that are down If a node or nodes is in the DOWN state during the upgrade, you must reimage those nodes as soon as they are returned to operation. Affect on number of nodes You cannot add or remove nodes during the upgrade process. 5.1.4 Upgrade Commands Table 5-3 lists the commands and utilities that are run as part of a software upgrade process.
4. If you are using LSF, make sure that you save the LSF /hptc_cluster/lsf/conf/ configuration directory. Also save any other known LSF customizations, such as elim scripts. 5. If the system is using HP StorageWorks Scalable File Share, see the SFS documentation to determine if you are required to update the SFS server and client software for the new HP XC release. If an SFS software update is required, you perform that task in “Task 6: Install Patches and Reinstall Additional Software” (page 124) .
5.4 Task 3: Install the Upgrade RPM and Prepare the System Follow this procedure to install the upgrade Red Hat Package Manager (RPM), and run the preupgradesys script, which performs the necessary preprocessing to prepare the system: 1. 2. Insert the HP XC System Software Version 3.2 DVD into the DVD drive on the head node. Create the following directory: # mkdir /mnt/cdrom 3. Mount the DVD on /mnt/cdrom: # mount /dev/cdrom /mnt/cdrom 4. Change to the following directory: # cd /mnt/cdrom/HPC/RPMS 5.
Follow this procedure to upgrade the RPMs on the system: 1. Change directory to the mount point of the DVD: # cd /mnt/cdrom 2. Update the Linux and HP XC RPMs: # upgraderpms NOTE: Because the upgraderpms command output spans several pages, it is shown in Appendix L (page 239). Return here when command processing is complete. 3. Unmount the DVD: # cd # umount /dev/cdrom 4. 5. Eject the DVD from the drive. Reboot the head node, thus booting the new HP XC kernel: # reboot 6.
3. 4. If a new kernel is supplied in a patch, you must rebuild kernel-dependent modules. See the HP XC System Software Administration Guide for more information about rebuilding kernel-dependent modules. Reinstall or upgrade any HP, open source, or third-party vendor software products that you specifically installed (for example, debuggers, compilers, and so on) on the HP XC system.
Table 5-5 Files Containing User Customizations File Name Important Notes *.bak files /hptc_cluster/slurm/etc/slurm.conf.bak *.rpmsave files /etc/sysconfig/iptables.proto.rpmsave If the head node was previously configured as a NIS slave server, do not merge the nis ports from the iptables.proto.rpmsave file into the iptables.proto file because the nis_server service automatically opens the necessary ports when it is configured during cluster_config processing. /etc/my.cnf.
Removing XC MLIB RPMs upgradesys output logged to /var/log/upgradesys/upgradesys.log IMPORTANT: Do not proceed to the next step in the upgrade process if the output from the upgradesys script indicates failures. If you cannot determine how to resolve these errors, contact the HP XC Support team at the following e-mail address: xc_support@hp.com 2. 3. Review the /opt/hptc/systemimager/etc/base_exclude_file to determine if you want to exclude files from the golden image beyond what is already excluded.
• • • 8. HP recommends that you use the [L]ist Nodes option to see the roles assigned to each node and make adjustments if required. If you ran the cluster_config command with the --init option, use the [M]odify Nodes option to reassign any role assignments you customized in the previous release. For example, if the system configuration had login roles on one or more nodes, you must assign a login role on any node on which you want users to be able to log in.
12. Compare the final LSF configuration with the saved version to ensure that only the appropriate changes relevant to the upgrade have occurred. Also ensure (or restore, if required) the elim scripts. HP recommends that you use another terminal window to install the elim scripts at the point when the cluster_config utility displays the following prompt: All user specified configuration is complete. The Golden Image will be created next. [P]roceed, [Q]uit: Thus, changes to the /opt/hptc/lsf/top/6.
Table 5-8 Upgrade startsys Command-Line Options Based on Hardware Configuration Hardware Configuration startsys Command Line Fewer than 300 nodes For small-scale hardware configurations, nodes are imaged and rebooted in one operation. The nodes complete their per-node configuration phase, thus completing the installation. This option applies only for nodes that have previously been set up to network boot.
6. Log in to the node that is running LSF if it is not running on the head node. If LSF is running on the head node, omit this step. # ssh n32 7. Restart the LIM daemon: # lsadmin limrestart Checking configuration files ... No errors found. Restart LIM on ...... done Restarting the LIM daemon is required because the licensing of LSF-HPC with SLURM occurs when the LIM daemon is started.
6 Reinstalling HP XC System Software Version 3.2 This chapter describes how to reinstall HP XC System Software Version 3.2 on a system that is already running Version 3.2. Reinstalling an HP XC system with the same release might be necessary if you participated as a field test site of an advance development kit (ADK) or an early release candidate kit (RC).
# scontrol update NodeName=n[1-5] State=IDLE 6.2 Reinstalling Systems with HP Integrity Hardware Models This section describes the following tasks: • “Reinstalling the Entire System” (page 134) • “Reinstalling One or More Nodes” (page 134) 6.2.1 Reinstalling the Entire System Follow this procedure to reinstall HP XC System Software Version 3.2 on systems comprised of HP Integrity hardware models. A reinstallation requires that all nodes are set to network boot.
# stopsys n[1-5] # startsys --image_and_boot n[1-5] All nodes reboot automatically when the installation is finished. 5. 6. Run the transfer_to_avail command to shut down all HP XC services and IP aliases that will be managed by an availability tool. After shutting down these services and IP aliases, the transfer_to_avail command starts each availability tool. Then, the availability tool starts up the services and IP aliases it is managing.
7 Installing HP XC System Software on Red Hat Enterprise Linux This chapter describes how to install and configure Red Hat Enterprise Linux on the head node and then install the HP XC System Software as a layered product (also known as layered HP XC).
7.2 Caveats The following caveats apply to installing HP XC System Software on Red Hat Enterprise Linux: • • • • • HP XC System Software has been tested and certified only on the interconnect types listed in Table 7-1. The HP StorageWorks Scalable File Share (SFS) product is not supported. The OpenFabrics Enterprise Distribution (OFED) has not been qualified. The improved availability feature of HP XC System Software has not been certified.
Table 7-2 Red Hat Installation Settings Item Selection Firewall configuration In the Firewall Configuration window: • Click No firewall • Disable the Security Enhanced Linux (SELinux) feature Head node disk partitions and file systems Use the Disk Druid utility to partition the installation disk on the head node as follows: • swap space - 6 GB • /boot or /boot/efi - 1 GB • / (root) - 25 GB • /var - 30 GB • /hptc_cluster - 8 GB Software You have considerable flexibility in selecting the Red Hat softwa
# chkconfig openibd off • squid software The squid high-performance web proxy cache can improve response times and reduce bandwidth. HP server blade enclosures require the squid software for the Onboard Administrator (OA). Follow this procedure if you need to install the squid software: 1. 2. Mount CD 2 of the Red Hat installation kit. Install the squid software: # rpm -Uvh RedHat/RPMS/squid-*.rpm 3.
7.7 Task 5: Configure, Image, and Verify the System Follow this procedure to complete the system configuration and verification process: 1. 2. If your system environment requires additional software from the HP XC DVD, install the software now by following the instructions in “Task 3: Install Additional RPMs from the HP XC DVD” (page 49). Configure and image the system by performing all tasks documented in Chapter 3 (page 53).
Table 7-3 Determining the Appropriate Support Contact Type of Problem Contact You encounter a problem with the Red Hat software. If you purchased the Red Hat software from Red Hat directly, contact Red Hat Support. If you purchased the Red Hat software from HP directly, contact HP Support. You encounter a problem with the HP XC System Software and you have a support contract. Contact the HP XC Support team: xc_support@hp.
8 Upgrading HP XC System Software on Red Hat Enterprise Linux This chapter addresses the following topics: • “HP XC on Red Hat Enterprise Linux Software Upgrade Overview” (page 143) • “Task 1: Prepare for the HP XC on Red Hat Enterprise Linux Upgrade” (page 144) • “Task 2: Prepare the System State” (page 145) • “Task 3: Install the Upgrade RPM and Prepare the System” (page 145) • “Task 4: Upgrade HP XC RPMs for HP XC on Red Hat Enterprise Linux” (page 146) • “Task 5: View the Results of the RPM Upgrade” (pa
8.1.3 HP XC on Red Hat Enterprise Linux Upgrade Characteristics Before you begin an upgrade, become familiar with the upgrade characteristics listed in Table 5-2 (page 120). 8.1.4 HP XC on Red Hat Enterprise Linux Upgrade Commands Table 8-1 lists the commands and utilities that are run as part of an HP XC on Red Hat Enterprise Linux software upgrade process.
4. If the system is configured with an InfiniBand interconnect, upgrade the HCA cards on all nodes now. Proceed to “Task 2: Prepare the System State”. 8.3 Task 2: Prepare the System State On the head node, follow this procedure to ensure that the system is in the appropriate state for the upgrade: 1. Set all nodes to network boot so that all client nodes can be reimaged after the head node is updated: # setnode --resync --all 2. 3. Stop all running jobs.
IMPORTANT: Do not proceed to the next step in the upgrade process if the output from the preupgradesys-lxc script indicates failures. If you cannot determine how to resolve these errors, contact the HP XC Support team at the following e-mail address: xc_support@hp.com 9. Save the current system configuration to a file to create a snapshot of the system configuration before the upgrade: # shownode config --admin > /var/log/preupgradesys-lxc/shownode_config.txt 8.
1. Install all patches from the HP IT Resource Center (ITRC) website that might be available for HP XC on Red Hat Enterprise Linux. “Download and Install Patches” (page 65) describes how to download patches from this website. IMPORTANT: You might have to reboot the head node if a patch to the kernel is required. The patch README file provides instructions. You must stop all user applications before rebooting the head node.
3. 4. base_exclude file file is the file that is read when the golden image is re-created as part of the upgrade process. The HP XC System Software Administration Guide describes how to add exclusions to this file. Use the information in Table 5-6 (page 127) to decide on the cluster_config option to use to configure the upgraded system. Change directory to the configuration directory: # cd /opt/hptc/config/sbin 5.
8.10 Task 9: Image and Boot the HP XC on Red Hat Enterprise Linux System and Start Compute Resources Follow the procedure in “Task 9: Image and Boot the System and Start Compute Resources” (page 129) to image and boot the system and start compute resources. Proceed to “Task 10: Verify the HP XC on Red Hat Enterprise Linux Upgrade”. 8.
9 Installing and Using PBS Professional This chapter addresses the following topics: • “PBS Professional Overview” (page 151) • “Before You Begin” (page 151) • “Plan the Installation” (page 151) • “Perform Installation Actions Specific to HP XC” (page 152) • “Configure PBS Professional under HP XC” (page 152) • “Replicate Execution Nodes” (page 153) • “Enter License Information” (page 154) • “Start the Service Daemons” (page 154) • “Set Up PBS Professional at the User Level” (page 154) • “Run HP MPI Tasks”
9.4 Perform Installation Actions Specific to HP XC Follow this installation procedure: 1. Install the PBS server node (front-end node) first, using the installation script provided by the software vendor, and specify the following values: a. Accept the default value offered for the PBS_HOME directory, which is /var/spool/PBS. b. When prompted for the type of PBS installation, select: option 1 (Server, execution and commands). c. If available, enter the license key during the interactive installation.
9.5.1 Configure the OpenSSH scp Utility By default, PBS Professional uses the rcp utility to copy files between nodes. The default HP XC configuration disables rcp in favor of the more secure scp command provided by OpenSSH. To use PBS Professional on HP XC, configure HP XC to default to scp as follows: 1. 2. Using a text editor of your choice, open the /etc/pbs.conf file on the server node.
# # # # pdcp pdcp pdcp pdcp -rp -w "x[n-n]" /usr/pbs /usr -rp -w "x[n-n]" /var/spool/PBS /var/spool -p –w "x[n-n]" /etc/pbs.conf /etc -p –w "x[n-n]" /etc/init.d/pbs /etc/init.d Use the following as an example: • • • You have installed the PBS server on node n100. The first PBS execution node is node n49. You want to replicate the execution environment to nodes n1 through n48. In this case, the value of the node list expression is: "n[1-48]".
9.10 Run HP MPI Tasks The PBS Professional distribution contains a wrapper script named pbs_mpihp , which is used to run HP MPI jobs. The wrapper script uses information about the current PBS Professional allocation to construct a command line, and optionally, an appfile suitable for HP MPI. The wrapper also sets the MPI_REMSH environment variable to the PBS Professional pbs_tmrsh remote shell utility.
10 Installing the Maui Scheduler This chapter describes how to install and configure the Maui Scheduler software tool to interoperate with SLURM on an HP XC system. It addresses the following topics: • “Maui Scheduler Overview” (page 157) • “Readiness Criteria” (page 157) • “Preparing for the Installation” (page 157) • “Installing the Maui Scheduler” (page 158) • “Verifying the Successful Installation of the Maui Scheduler” (page 160) 10.
• • http://www.chpc.utah.edu/docs/manuals/software/maui.html http://www.clusterresources.com/products/maui/ Ensure That LSF-HPC with SLURM Is Not Activated HP does not support the use of the Maui Scheduler with LSF-HPC with SLURM. These schedulers have not been integrated and will not work together on an HP XC system. Before you install the Maui Scheduler on an HP XC system, you must be sure that the HP XC version of LSF-HPC with SLURM is not activated on the system.
• • “Task 4: Edit the SLURM Configuration File” (page 160) “Task 5: Configure the Maui Scheduler” (page 160) 10.4.1 Task 1: Download the Maui Scheduler Kit Follow this procedure to download the Maui Scheduler kit: 1. 2. Log in as the root user on the head node. Download the Maui Scheduler kit to a convenient directory on the system. The Maui Scheduler kit is called maui-3.2.6p9, and it is available at: http://www.clusterresources.com/products/maui/ 10.4.
NODECFG[n16] NODECFG[n15] NODECFG[n14] NODECFG[n13] 6. PARTITION=PARTA PARTITION=PARTA PARTITION=PARTA PARTITION=PARTA Save the changes to the file and exit the text editor. 10.4.4 Task 4: Edit the SLURM Configuration File Uncomment the following lines in the /hptc_cluster/slurm/etc/slurm.conf SLURM configuration file: SchedulerType=sched/wiki SchedulerAuth=42 SchedulerPort=7321 10.4.
StartTime: Fri Jul 06 20:30:43 Total Tasks: 6 Req[0] TaskCount: 6 Partition: lsf Network: [NONE] Memory >= 1M Disk >= 1M Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] NodeCount: 1 Allocated Nodes: [n16:4][n15:2] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [lsf] Reservation '116' (00:00:00 -> 1:00:00 Duration: 1:00:00) PE: 6.00 StartPriority: 1 Table 10-2 lists several commands that provide diagnostic information about various aspects of resources, workload, and scheduling.
11 Adding Visualization Nodes to An Existing HP XC System This chapter describes how to install and configure visualization nodes into an existing HP XC system after the SVA nodes have been fully integrated into the hardware configuration. It addresses the following topics: • • “Prerequisites” (page 163) “Installation Scenarios” (page 163) 11.
11.2.2 New Visualization Nodes Do Not Exceed the Maximum Number of Nodes Supplied to the cluster_prep Command Follow this procedure if you added visualization nodes to the existing system and the number of new nodes does not exceed the maximum node number you set during the initial cluster_prep process. Because the number of new nodes does not exceed the previous maximum number of nodes, you do not need to run cluster_prep command again, but you do have to discover the new nodes. 1. 2.
12 Troubleshooting This chapter addresses the following topics: • “Troubleshooting the Discovery Process” (page 165) • “Troubleshooting the Cluster Configuration Process” (page 168) • “Troubleshooting LSF and Licensing” (page 172) • “Troubleshooting the Imaging Process” (page 170) • “Troubleshooting the OVP” (page 172) • “Troubleshooting SLURM” (page 176) • “Troubleshooting the Software Upgrade Procedure” (page 177) • “Troubleshooting HP XC on Red Hat Enterprise Linux” (page 179) 12.
The remainder of this section provides troubleshooting hints to help you solve some common problems that may occur during the discovery process.
12.1.4 Not All Console Ports Are Discovered The discovery process queries the ProCurve switches to obtain the MAC addresses of all console ports. The MAC addresses are logged in the switch as the console port device issues a DHCP request to get an IP address. NOTE: If the --oldmp option was used on the discover command line, it is assumed that all Management Processors (MPs) have their IP addresses set statically, and therefore are not subject to this step in the discovery process.
To determine what node failed in the discover process, examine the output of the discover command when it is parsing the switch output to gather the node input. For example, assume that the following was displayed during the discovery process: . . . Switch Switch Switch Switch . . . 172.20.65.3 172.20.65.3 172.20.65.3 172.20.65.4 port port port port 4 5 6 1 ... ... ... ...
gethostbyaddr failure To resolve this problem, edit the /etc/resolv.conf file and fix incorrect DNS entries. • Nodes that fail the configuration phase are put into single-user mode and marked as disabled in the database if an essential service failed. 12.2.1 lsadmin limrestart Command Fails “Task 18: Finalize the Configuration of Compute Resources” (page 107) describes LSF postconfiguration tasks.
# service mysqld restart The command you were trying to initiate should now be able to connect to the database. 12.3 Troubleshooting the Imaging Process This section describes hints to troubleshoot the imaging process. System imaging and node configuration information is stored in the following log files: • • • /hptc_cluster/adm/logs/imaging.log /var/log/systemimager/rsyncd /hptc_cluster/adm/logs/startsys.
Table 12-1 Diagnosing System Imaging Problems (continued) Symptom How To Diagnose Possible Solution The network boot times out. The system boots from local disk • Verify DHCP settings and status of daemon. and runs nconfigure. You can • Verify network status and connections. verify this by checking messages • Monitor the /var/log/dhcpd.log file for written to the imaging.log file. DHCPREQUEST messages from the client node MAC address. • Check boot order and BIOS settings.
Enter the following command on the affected node to fix the network boot problem: setnode --resync node_name 12.3.3 How To Monitor An Imaging Session To monitor an imaging operation, use the tail -f command in another terminal window to view the imaging log files. It is possible to actually view an installation through the remote serial console, but to do so, you must edit the /tftpboot/pxelinux.cfg/default file before the installation begins and add the correct serial console device to the APPEND line.
Verify perf_health: Testing memory_usage ... The headnode is excluded from the memory usage test. Number of nodes allocated for this test is 14 Job <2049> is submitted to default queue << Waiting for dispatch ...>> <> The following node has memory usage more than 25%: n3: memory usage is 34.38%, 12.
Virtual hostname is lsfhost.localdomain Comparing ncpus from Lsf lshosts to Slurm cpu count. The Lsf and Slurm cpu count are NOT in sync. The lshosts 'ncpus' value of 1560 differs from the cpu total of 2040 calculated from the sinfo output. Suggest running 'lshosts -w' manually and compare the ncpus value with the output from sinfo --- FAILED --Testing hosts_status ... Running 'bhosts -w'. Checking output from bhosts. Running 'controllsf show' to determine virtual hostname. Checking output from controllsf.
nodes ibblc64 and ibblc65 have an Exchange value of 2077.790000 12.5.2 OVP Reports Benign Nagios Warnings The OVP might return the following Nagios warning messages. These messages are benign and you can ignore them. Verify nagios: Testing configuration ... Running basic sanity check on the Nagios configuration file. Starting the command: /opt/hptc/bin/nagios -v /opt/hptc/nagios/etc/nagios_local.cfg Here is the output from the command: Warnings were reported. Nagios 2.3.
# qsctrl qsctrl: QR0N00:00:0:0 <--> Elan:0:0 state 3 should be 4 qsctrl: QR0N00:00:0:1 <--> Elan:0:1 state 3 should be 4 qsctrl: QR0N00:00:0:2 <--> Elan:0:2 state 3 should be 4 qsctrl: QR0N00:00:0:3 <--> Elan:0:3 state 3 should be 4 qsctrl: QR0N00:00:1:0 <--> Elan:0:4 state 3 should be 4 qsctrl: QR0N00:00:1:1 <--> Elan:0:5 state 3 should be 4 qsctrl: QR0N00:00:1:2 <--> Elan:0:6 state 3 should be 4 qsctrl: QR0N00:00:1:3 <--> Elan:0:7 state 3 should be 4 qsctrl: QR0N00:00:2:0 <--> Elan:0:8 state 3 should be 4
The sinfo example shown in this section illustrates the Low RealMemory reason. It is more obscure and can be a side effect of the system configuration process. This error is reported because the SLURM slurm.conf file is configured with a RealMemory value that is higher than the MemTotal value in the /proc/meminfo file that is being reported by the compute node. SLURM does not automatically restore a node that had failed at any point because of this reason.
Table 12-2 Software Upgrade Log Files (continued) • File Name Contents /var/log/yum.log Results of the YUM upgrade /var/log/upgrade/kernel/RPMS Symbolic links to HP XC kernel-related RPMs on the HP XC System Software Version 3.2 DVD /var/log/upgrade/RPMS Symbolic links to HP XC RPMs on the HP XC System Software Version 3.2 DVD If you see errors in the /var/log/postinstall.log or /var/log/yum_upgrade.log files, fix the problem by manually installing the RPMs that failed to upgrade properly: 1.
# touch /hptc_cluster/adm/logs/imaging.log 3. If hptc-ire-serverlog is not running, start the service: # service hptc-ire-serverlog start 12.7.3 External Ethernet Connection Fails To Come Up It is possible for an external Ethernet connection to occasionally fail to come up after invoking the cluster_config --init | --migrate command. You may see messages similar to the following in the /var/log/nconfig.log file: Bringing up interface eth2: SIOCSIFFLAGS: Cannot allocate memory Failed to bring up eth2.
A Installation and Configuration Checklist Table A-1 provides a list of tasks performed during a new installation. Use this checklist to ensure you complete all installation and configuration tasks in the correct order. Perform all tasks on the head node unless otherwise noted.
Table A-1 Installation and Configuration Checklist Task and Description Reference Preparing for the Installation 1. Read related documents, especially the HP XC System Software Release Notes. “Task 1: Read Related Documentation” (page 25) If the hardware configuration contains HP blade servers and enclosures, download and print the HP XC Systems With HP Server Blades and Enclosures HowTo. 2. Plan for future releases. “Task 2: Plan for Future HP XC Releases” (page 25) 3.
Table A-1 Installation and Configuration Checklist (continued) Task and Description Reference 18. Perform the following tasks to define and set up the system environment “Task 6: Set Up the System before the golden image is created: Environment” (page 74) • Put the XC.lic license key file in the /opt/hptc/etc/license directory (required). • Configure interconnect switch line monitoring cards (required). • Configure sendmail (required). • Customize the Nagios environment (required).
Table A-1 Installation and Configuration Checklist (continued) Task and Description Reference Verifying the System 184 32. Verify proper operation of LSF if you installed LSF. “Task 1: Verify the LSF Configuration” (page 111) 33. Verify proper operation of availability tools if you installed and configured an availability tool. “Task 2: Verify Availability Tools” (page 112) 34. Run the operation verification program (OVP).
B Host Name and Password Guidelines This appendix contains guidelines for making informed decisions about information you are asked to supply during the installation and configuration process. It addresses the following topics: • “Host Name Guidelines” (page 185) • “Password Guidelines” (page 185) B.1 Host Name Guidelines Follow these guidelines when deciding on a host name: • Host names can contain from 2 to 63 alphanumeric uppercase or lowercase characters (a-z, A-Z, 0-9).
When choosing a password, do not use any of the following: • Single words found in any dictionary in any language. • Personal information about you or your family or significant others such as first and last names, addresses, birth dates, telephone numbers, names of pets, and so on. • Any combination of single words in the dictionary and personal information. • An obvious sequence of numbers or letters, such as 789 or xyz.
C Enabling telnet on iLO and iLO2 Devices The procedure described in this appendix applies only to HP XC systems with nodes that use Integrated Lights Out (iLO or iLO2) as the console management device. New nodes that are managed with iLO or iLO2 console management connections that have never been installed with HP XC software might have iLO interfaces that have not been configured properly for HP XC operation.
2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Do one of the following: • If you cannot find an entry corresponding to the new node, check the network connections. Make repairs and rerun the discover command. • If you do find an entry corresponding to the new node, note the IP address on the line that begins with the string fixed-address, and proceed to step 3. Open a web browser on the head node.
option host-name "cp-n2"; fixed-address 172.21.0.2; # location "Level 2 Switch 172.20.65.4, Port 2"; } host cp-n3 { hardware ethernet 00:11:0a:30:b0:bc; option host-name "cp-n3"; fixed-address 172.21.0.3; # location "Level 2 Switch 172.20.65.4, Port 3"; } host cp-n4 { hardware ethernet 00:11:0a:2f:8d:fc; option host-name "cp-n4"; fixed-address 172.21.0.4; # location "Level 2 Switch 172.20.65.4, Port 4"; } 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
D Configuring Interconnect Switch Monitoring Cards You must configure the Quadrics switch controller cards, the InfiniBand switch controller cards, and the Myrinet monitoring line cards on the system interconnect to diagnose and debug problems with the system interconnect.
Table D-1 Quadrics Switch Controller Card Naming Conventions and IP Addresses for Reduced Bandwidth (continued) Number of Nodes Node-Level Switch Name 1025 to 2048 QR0N00 to QR0N31 (P) 172.20.66.1 to 172.20.66.32 QR0N00_S to QR0N31_S (S) 172.20.66.33 to Secondary not applicable 172.20.66.64 1 2 Node-Level IP Address Top-Level Switch Name Top-Level Switch IP Address QR0T00 to QR0T31 172.20.66.65 to 172.20.66.96 Secondary not applicable (P) represents the primary switch controller.
1. 2. 3. 4. 5. 6. 7. 8. 9. Show network settings Change network settings Run jtest Set module mode Firmware upgrade Quit Reboot Access Settings Self Test Enter 1-9 and press return: 3. Enter 2 to access the Change network settings menu option and set the switch to STATIC IP addresses: Quadrics Switch Control Select Protocol 1. BOOTP 2. STATIC 3.
When the connection is established, use the Quadrics login password you set in step 5 to log in to the switch controller. D.2 Configure Myrinet Switch Monitoring Line Cards You can use the Myrinet switch monitoring line card to run diagnostic tools and to check for events on each port of the line card. Table D-3 provides the switch names and associated IP addresses you need during the configuration procedure.
# location "Level 2 Switch 172.20.65.4, Port 1"; } . . . host n3 { hardware ethernet 00:11:0a:ea:ea:41; option host-name "n3"; fixed-address 172.20.0.3; option xc-macaddress "00:11:0a:ea:ea:41"; # location "Level 2 Switch 172.20.65.3, Port 3"; } host MR0N00 { hardware ethernet your_MAC_address; option host-name "MR0N00"; fixed-address 172.20.66.1; } } } 5. 6. Save your changes and exit the text editor. Copy the contents of the /etc/dhcpd.
IMPORTANT: The IP address base differs if the hardware configuration contains HP server blades and enclosures, and you must use the IP addresses listed in Table D-5 instead of the addresses listed in Table D-4. Table D-4 InfiniBand Switch Controller Card Naming Conventions and IP Addresses Switch Order Switch Name IP Address First switch IR0N00 172.20.66.1 Second switch IR0N01 172.20.66.2 Third switch IR0N02 172.20.66.3 Last switch IR0N0n 172.20.66.
11. Confirm the settings: ISR-9024(config-if-fast)# ip-address-fast show 12. Exit the interface fast mode: ISR-9024(config-if-fast)# exit 13. Access the route mode: ISR-9024(config)# route 14. Set the gateway IP address, which is the internal IP address of the head node: ISR-9024(config-route)# default-gw fast set IP_address 15. Confirm the gateway IP setting: ISR-9024(config-route)# default-gw show 16. Exit the route and configuration modes: ISR-9024(config-route)# exit 17.
E Customizing Client Node Disks Use the information in this appendix to customize the disk partition layout on client node disk devices. It addresses the following topics: • “Overview of Client Node Disk Imaging” (page 199) • “Dynamically Configuring Client Node Disks” (page 199) • “Statically Configuring Client Node Disks ” (page 205) E.
issues encountered when the golden client node disk configuration differs from the client disk configuration. You also have the flexibility to configure client node disks on a per-image and per-node basis and to create an optional scratch partition. Partition sizes can be fixed or can be based on a percentage of total disk size. You can set the appropriate variables in the /opt/hptc/systemimager/etc/make_partitions.sh file or in user-defined files with a .part extension.
E.2.2 Example 1: Modifying Partitions Using Fixed Sizes and Defining an Additional Partition This example applies fixed sizes to modify the default partition sizes on all compute nodes and creates an additional /scratch partition on each compute node. The user-defined .part files allow partition modifications to be done on a per-image or per-node basis. 1. Use the text editor of your choice to create the following file to define the partition format for compute nodes: /var/lib/systemimager/scripts/compute.
7. Do one of the following to install and image the client nodes: • If the client nodes were not previously installed with the HP XC System Software, see “Task 12: Run the startsys Utility to Start the System and Propagate the Golden Image” (page 100) to continue the initial installation procedure.
8. Run the cluster_config utility, choosing the default answers, to create a new master autoinstallation script (/var/lib/systemimager/scripts/base_image.master.0) and generate an updated version of the golden image: # /opt/hptc/config/sbin/cluster_config 9. After the cluster_config utility completes its processing, the client nodes are ready to be installed.
# shownode servers lvs n[135-136] 9. Create a symbolic link from the node names of the login nodes to the newly created master autoinstallation script. Note that the node name is appended with a .sh extension: for i in n135 n136 do ln -sf login.master.0 $i.sh done 10.
NOTE: With software RAID, the Linux boot loader requires the /boot partition to be mirrored (RAID1). In addition, the swap partitions are not raided (with striping) because the operating system stripes them automatically. 6.
E.3.1 Enable Static Disk Configuration Dynamic disk configuration method is the default behavior for the HP XC imaging environment. Before you can use the static disk configuration method to customize the client disk configuration you must enable it. You cannot use a combination of the two methods simultaneously. Follow this procedure to enable static disk configuration: 1. Use the text editor of your choice to open the following file: /etc/systemimager/systemimager.conf 2.
# setnode --resync --all # stopsys # startsys --image_and_boot Wait until the stopsys command completes before invoking the startsys command. E.3.2.2 Example 2: Creating a New .conf File and Associated Master Autoinstallation Script If necessary, you can create your own master autoinstallation script with static disk configuration included from a customized .conf file by following the procedure shown here.
F Description of Node Roles, Services, and the Default Configuration This appendix addresses the following topics: • “Default Node Role Assignments” (page 209) • “Special Considerations for Modifying Default Node Role Assignments” (page 209) • “Role Definitions” (page 210) F.1 Default Node Role Assignments Table F-1 lists the default role assignments. The default assignments are based on the number of total nodes in the system.
and LSF controller daemons run on that node, and no fail over of these components is possible. HP recommends that you configure at least two nodes with the resource_management role to distribute the work of these components and provide a failover configuration.
You can define multiple roles on any node. The head node, in particular, can have all of these roles if you are setting up a small cluster. If you need more information about services and node roles, see the HP XC System Software Administration Guide. F.3.1 Availability Role The availability role is automatically assigned to all nodes that are members of availability sets. You cannot assign this role to any node.
Any node with this role may be called upon to execute jobs scheduled by the users. Nodes with this role are also often called compute nodes. It is your responsibility to remove this role on a node if it is not wanted or required. To enable monitoring of the nodes, run the Nagios remote plug-in execution agent on nodes with the compute role. F.3.5 Console_network Role The console_network role enables you to map HP XC services with physical console requirements to the appropriate nodes.
The login role supplies a node with the LVS director service, which handles the placement of user login sessions on login nodes when a user logs in to the cluster alias. To achieve improved availability of the LVS director service, you must assign the login role to three nodes. See the role assignment guidelines listed in Table 1-2 (page 32) for more information. The configuration and management database name for the service supplied by this role is lvs. F.3.
• Dynamic Host Control Protocol (DHCP) server (dhcp) • HPTC file system server (hptc_cluster_fs) • License manager (hptc-lm) • InfiniBand switch monitor (ibmon) • Image server (imageserver) • HP MPI interconnect setup (mpiic) • Myrinet switch monitor (gmmon) • NFS server (nfs_server) • Network time protocol (ntp) • Power daemon server (pwrmgmtserver) • Quadrics switch monitor (swmlogger) The configuration and management database, in which all management configuration information is stored, runs on a node wi
G Using the cluster_config Command-Line Menu This appendix describes how to use the configuration command-line menu that appears by the cluster_config utility. It addresses the following topics: • “Overview of the cluster_config Command-Line Menu” (page 215) • “Displaying Node Configuration Information” (page 215) • “Modifying a Node” (page 216) • “Analyzing Current Role Assignments Against HP Recommendations” (page 218) • “Customize Service and Client Configurations” (page 219) G.
G.3 Modifying a Node From the command-line menu of the cluster_config utility, enter the letter m to modify node role assignments and Ethernet connections: [L]ist Nodes, [M]odify Nodes, [A]nalyze, [H]elp, [P]roceed, [Q]uit: m You are prompted to supply the node name of the node you want to modify. All operations you perform from this point are performed on this node until you specify a different node name.
5. After you have added the Ethernet connections, you have the option to do the following: • Enter the letter e to add an Ethernet connection on another node. • Enter the letter d to remove an Ethernet connection. • Enter the letter b to return to the previous menu. G.5 Modifying Node Role Assignments The cluster configuration menu enables you to assign roles to specific nodes. “Role Definitions” (page 210) provides definitions of all node roles and the services they provide.
G.6 Analyzing Current Role Assignments Against HP Recommendations From the command-line menu of the cluster_config utility, enter the letter a to analyze current node role assignments with those recommended by HP. [L]ist Nodes, [M]odify Nodes, [A]nalyze, [H]elp, [P]roceed, [Q]uit: a HP recommends using this option any time you run the cluster_config utility, even when you are configuring the system for the first time.
Role Rec: Role Recommended HN Req: Head Node Required HN Rec: Head Node Recommended Exc Rec: Exclusivity Recommended Ext Req: External Connection Required Ext Rec: External Connection Recommended Table G-3 provides an explanation of the analysis. Table G-3 Second Portion of cluster_config Analysis Option Column Heading Description Recommend Displays the number of nodes recommended for a particular role based on the number of nodes in the system.
Do one of the following: • Enter the letter s to perform customized services configuration on the nodes in the system. This option is intended for experienced HP XC administrators who want to customize service servers and clients. Intervention like this is typically not required for HP XC systems. See “Services Configuration Commands” (page 220) for information about each services configuration command. • Enter the letter p to continue with the system configuration process.
Creating and Adding Node Attributes Using the previous two examples, enter the following commands to create and add node attributes: svcs> create na_disable_server.cmf Attribute "na_disable_server.cmf" created svcs> create na_disable_client.supermond Attribute "na_disable_client.supermond" created svcs> add na_disable_server.cmf n3 Attribute "na_disable_server.cmf" added to n3 svcs> add na_disable_client.supermond n1 Attribute "na_disable_client.
Table G-4 Service Configuration Command Descriptions (continued) Command Description and Sample Use [a]dd attribute_name node|node_list Adds a node attribute to a specific node or node list. The attribute must have been created previously. Node lists can take two forms: explicit (such as n1, n2, n3, n5) or condensed (such as n[1–3,5]) In the node list examples, the node prefix is the letter n. Replace n with the node-naming prefix you chose for your nodes. Sample use: svcs> add na_disable_client.
H Determining the Network Type The information in this appendix applies only to cluster platforms with a QsNetII interconnect. During the processing of the cluster_config utility, the swmlogger gconfig script prompts you to supply the network type of the system. The network type reflects the maximum number of ports the switch can support, and the network type is used to create the qsnet diagnostics database.
I LSF and SLURM Environment Variables This appendix lists the default values for LSF and SLURM environment variables that were set during the HP XC System Software installation process. For more information about setting LSF environment variables and parameters listed in Table I-1, see the Platform LSF Reference manual.
Table I-1 Default Installation Values for LSF and SLURM (continued) Environment Variable Default Value Description XC_LIBLIC /opt/hptc/lib/libsyslic.so Location of the HP XC OEM license module. Where is this Value Stored? lsf.conf file LSF_NON_PRIVILEGED_PORTS Y When LSF commands are run by the root user, this variable configures them to communicate using non-privileged ports (> 1024). lsf.conf file LSB_RLA_UPDATE 120 Controls how often LSF synchronizes with SLURM.
Table I-1 Default Installation Values for LSF and SLURM (continued) Where is this Value Stored? Environment Variable Default Value Description One LSF partition RootOnly YES Specifies that only the root user or the SLURM administrator can create allocations for normal user jobs. slurm.conf file One LSF partition Shared FORCE Specifies that more than one job can run on the same node. LSF uses this facility to support preemption and scheduling of multiple serial jobs on the same node.
J Customizing the SLURM Configuration This appendix describes customizations you can make to the /hptc_cluster/slurm/etc/slurm.conf SLURM configuration file. It addresses the following topics: • “Assigning Features” (page 229) • “Creating Additional SLURM Partitions” (page 229) • “Required Customizations for SVA” (page 229) J.1 Assigning Features Assigning features to nodes is common if the compute resources of the cluster are not consistent.
SVA with LSF-HPC with SLURM If you installed LSF, you must create two LSF partitions: one partition for visualization jobs and one partition for LSF jobs. A node can be present in one partition only. The following procedure provides an example of a cluster that has five nodes: node 5 is the head node, nodes 1 and 2 are visualization nodes, and nodes 3 and 4 are compute nodes. Using that example, you must modify the slurm.conf file to create two partitions: 1.
K OVP Command Output This appendix provides command output from the OVP utility, which verifies successful installation and configuration of software and hardware components. # ovp --verbose XC CLUSTER VERIFICATION PROCEDURE Fri Jul 06 08:03:03 2007 Verify connectivity: Testing etc_hosts_integrity ... There are 47 IP addresses to ping. A total of 47 addresses were pinged. Test completed successfully. All IP addresses were reachable. +++ PASSED +++ Verify client_nodes: Testing network_boot ...
Running verify_server_status Starting the command: /opt/hptc/sbin/lmstat Here is the output from the command: lmstat - Copyright (c) 1989-2004 by Macrovision Corporation. All right s reserved. Flexible License Manager status on Fri 7/06/2007 08:03 License server status: 27000@ n16 License file(s) on n16: /opt/hptc/etc/license/XC.lic: n16: license server UP (MASTER) v9.2 Vendor daemon status (on n16): Compaq: UP v9.2 Checking output from command. +++ PASSED +++ Verify SLURM: Testing spconfig ...
Here is the output from the command: n[3-16] 14 lsf idle Checking for non-idle node states. +++ PASSED +++ Verify LSF: Testing identification ... Starting the command: /opt/hptc/lsf/top/6.2/linux2.6-glibc2.3-ia64-slurm/bin/lsid Here is the output from the command: Platform LSF HPC 6.2 for SLURM, May 10 2006 Copyright 1992-2005 Platform Computing Corporation My cluster name is hptclsf My master name is lsfhost.localdomain Checking output from command. +++ PASSED +++ Testing hosts_static_resource_info ...
Nagios 2.3.1 Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org) Last Modified: 05-15-2006 License: GPL Reading configuration data... Running pre-flight check on configuration data... Checking services... Checked 158 services. Checking hosts... Checked 16 hosts. Checking host groups... Checked 13 host groups. Checking service groups... Checked 0 service groups. Checking contacts... Checked 1 contacts. Checking contact groups... Checked 1 contact groups. Checking service escalations...
Syslog Syslog System System System Alert Monitor Alerts Event Log Event Log Monitor Free Space Totals: 1-Ok 14-Ok 14-Ok 1-Ok 14-Ok 0-Warn 0-Warn 0-Warn 0-Warn 0-Warn 0-Crit 0-Crit 0-Crit 0-Crit 0-Crit 0-Pend 0-Pend 0-Pend 0-Pend 0-Pend 0-Unk 0-Unk 0-Unk 0-Unk 0-Unk 157-Ok 0-Warn 0-Crit 0-Pend 0-Unk +++ PASSED +++ Verify xring: Testing ... Send 200 messages with a size of 1024 bytes to 14 hosts.
Waiting for dispatch ... Starting on lsfhost.localdomain Detailed streams results for each node can be found in /hptc_cluster/ovp/ovp_ n16_032807.tests/tests/100.perf_health/40.memory if the --keep flag was specified. Streams memory results summary (all values in mBytes/sec): min: 596.868100 max: 1054.011600 median: 1009.190000 mean: 937.954567 range: 457.143500 variance: 18632.672317 std_dev: 136.501547 All nodes were in range for this test. +++ PASSED +++ Testing network_stress ...
All nodes were in range for this test. +++ PASSED +++ Testing network_unidirectional ... Number of nodes allocated for this test is 14 Job 114 is submitted to default queue interactive . Waiting for dispatch ... Starting on lsfhost.localdomain [0: n3:1] ping-pong 7718.08 usec/msg 518.26 MB/sec [1: n4:2] ping-pong 7613.24 usec/msg 525.40 MB/sec [2: n5:3] ping-pong 7609.81 usec/msg 525.64 MB/sec [3: n6:4] ping-pong 7529.98 usec/msg 531.21 MB/sec [4: n7:5] ping-pong 7453.28 usec/msg 536.
L upgraderpms Command Output This appendix provides command output from the upgraderpms command, which is run during a software upgrade. Command output is similar to the following: Use the upgraderpms utility only if you are performing a minor upgrade to install the new HP XC release on your system. Before running the upgraderpms utility, you must mount the new XC release DVD on the /mnt/cdrom directory and then use the cd command to go to that directory.
---> Downloading header for glibc-devel to pack into transaction set. ---> Package glibc-devel.ia64 0:2.3.4-2.19 set to be updated ---> Downloading header for slurm-devel to pack into transaction set. ---> Package slurm-devel.ia64 0:1.0.15-1hp set to be updated ---> Downloading header for qsswm to pack into transaction set. ---> Package qsswm.ia64 0:2.1.1-1.2hptc set to be updated ---> Downloading header for openssl to pack into transaction set. ---> Package openssl.ia64 0:0.9.7a-43.
---> Downloading header for slurm-switch-elan to pack into transaction set. ---> Package slurm-switch-elan.ia64 0:1.0.15-1hp set to be updated ---> Downloading header for pam-devel to pack into transaction set. ---> Package pam-devel.ia64 0:0.77-66.14 set to be updated ---> Downloading header for ipvsadm to pack into transaction set. ---> Package ipvsadm.ia64 0:1.24-6 set to be updated ---> Downloading header for shadow-utils to pack into transaction set. ---> Package shadow-utils.ia64 2:4.0.3-60.
---> Package krb5-libs.ia64 0:1.3.4-27 set to be updated ---> Downloading header for slurm to pack into transaction set. ---> Package slurm.ia64 0:1.0.15-1hp set to be updated ---> Downloading header for hptc-nconfig to pack into transaction set. ---> Package hptc-nconfig.noarch 0:1.0-68 set to be updated ---> Downloading header for OpenIPMI-libs to pack into transaction set. ---> Package OpenIPMI-libs.ia64 0:1.4.14-1.4E.
---> Downloading header for newt-devel to pack into transaction set. ---> Package newt-devel.ia64 0:0.51.6-7.rhel4 set to be updated ---> Downloading header for hptc-qsnet2-diag to pack into transaction set. ---> Package hptc-qsnet2-diag.noarch 0:1-19 set to be updated ---> Downloading header for ypserv to pack into transaction set. ---> Package ypserv.ia64 0:2.13-9.1hptc set to be updated ---> Downloading header for keyutils to pack into transaction set. ---> Package keyutils.ia64 0:1.
--> Processing Dependency: libkeyutils.so.1(KEYUTILS_1.0)(64bit) for package: keyutils --> Restarting Dependency Resolution with new changes. --> Populating transaction set with selected packages. Please wait. ---> Downloading header for hptc-supermon-modules-source to pack into transaction set. ---> Package hptc-supermon-modules-source.ia64 0:2-0.18 set to be updated ---> Downloading header for keyutils-libs to pack into transaction set. ---> Package keyutils-libs.ia64 0:1.
hptc-power noarch 3.0-32 hpcrpms hptc-qsnet2-diag noarch 1-19 hpcrpms hptc-slurm noarch 1.0-3hp hpcrpms hptc-supermon ia64 2-0.18 hpcrpms hptc-supermon-config noarch 1-30 hpcrpms hptc-supermon-modules ia64 2-6.k2.6.9_34.7hp.XCsmp hpcrpms 117 k hptc-syslogng noarch 1-24 hpcrpms hptc-sysman noarch 1-1.71 hpcrpms hptc-sysmandb noarch 1-44 hpcrpms hptc_ovp ia64 1.15-21 hpcrpms hptc_release noarch 1.0-15 hpcrpms hwdata noarch 0.146.18.EL-1 linuxrpms ia32el ia64 1.2-5 linuxrpms ibhost-biz ia64 3.5.5_21-2hptc.k2.
selinux-policy-targeted noarch 1.17.30-2.126 linuxrpms 119 k shadow-utils warning: /etc/localtime created as /etc/localtime.rpmnew Stopping sshd:[ OK ] Starting sshd:[ OK ] warning: /etc/security/limits.conf created as /etc/security/limits.conf.rpmnew /opt/hptc/lib warning: /etc/systemimager/autoinstallscript.template saved as /etc/systemimager/autoinstallscript.template.rpmsave warning: /etc/localtime created as /etc/localtime.rpmnew warning: /etc/nsswitch.conf created as /etc/nsswitch.conf.
0:2.6.9-34.7hp.XC Dependency Installed: Tk.ia64 0:804.027-1hp audit.ia64 0:1.0.12-1.EL4 hptc-supermon-modules-source.ia64 0:2-0.18 keyutils-libs.ia64 0:1.0-2 modules.ia64 0:3.1.6-4hptc Updated: IO-Socket-SSL.ia64 0:0.96-98 MAKEDEV.ia64 0:3.15.2-3 OpenIPMI.ia64 0:1.4.14-1.4E.12 OpenIPMI-libs.ia64 0:1.4.14-1.4E.12 autofs.ia64 1:4.1.3-169 binutils.ia64 0:2.15.92.0.2-18 bzip2.ia64 0:1.0.2-13.EL4.3 bzip2-devel.ia64 0:1.0.2-13.EL4.3 bzip2-libs.ia64 0:1.0.2-13.EL4.3 chkconfig.ia64 0:1.3.13.3-2 cpp.ia64 0:3.4.
Setting up repositories Reading repository metadata in from local files Parsing package install arguments Resolving Dependencies --> Populating transaction set with selected packages. Please wait. ---> Downloading header for iptables-ipv6 to pack into transaction set. ---> Package iptables-ipv6.ia64 0:1.2.11-3.1.
Text-DHCPparse ia64 collectl-utils noarch hptc-avail noarch hptc-ibmon noarch hptc-mcs noarch hptc-mdadm noarch hptc-smartd noarch hptc-snmptrapd noarch rrdtool ia64 xcgraph noarch Installing for dependencies: cgilib ia64 net-snmp-perl ia64 0.07-2hp 1.3.10-1 1.0-1.19 1-3 1-6 1-1 1-1 1-8 1.2.15-0.2hp 0.1-11 hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms hpcrpms 9.4 482 19 8.9 46 3.2 4.0 120 1.4 23 k k k k k k k k M k 0.5-0.2hp 5.1.2-11.EL4.
Glossary A administration branch The half (branch) of the administration network that contains all of the general-purpose administration ports to the nodes of the HP XC system. administration network The private network within the HP XC system that is used for administrative operations. availability set An association of two individual nodes so that one node acts as the first server and the other node acts as the second server of a service. See also improved availability, availability tool.
operating system and its loader. Together, these provide a standard environment for booting an operating system and running preboot applications. enclosure The hardware and software infrastructure that houses HP BladeSystem servers. extensible firmware interface See EFI. external network node A node that is connected to a network external to the HP XC system. F fairshare An LSF job-scheduling policy that specifies how resources should be shared by competing users.
image server A node specifically designated to hold images that will be distributed to one or more client systems. In a standard HP XC installation, the head node acts as the image server and golden client. improved availability A service availability infrastructure that is built into the HP XC system software to enable an availability tool to fail over a subset of eligible services to nodes that have been designated as a second server of the service See also availability set, availability tool.
LVS Linux Virtual Server. Provides a centralized login capability for system users. LVS handles incoming login requests and directs them to a node with a login role. M Management Processor See MP. master host See LSF master host. MCS An optional integrated system that uses chilled water technology to triple the standard cooling capacity of a single rack. This system helps take the heat out of high-density deployments of servers and blades, enabling greater densities in data centers.
onboard administrator See OA. P parallel application An application that uses a distributed programming model and can run on multiple processors. An HP XC MPI application is a parallel application. That is, all interprocessor communication within an HP XC parallel application is performed through calls to the MPI message passing library. PXE Preboot Execution Environment.
an HP XC system, the use of SMP technology increases the number of CPUs (amount of computational power) available per unit of space. ssh Secure Shell. A shell program for logging in to and executing commands on a remote computer. It can provide secure encrypted communications between two untrusted hosts over an insecure network. standard LSF A workload manager for any kind of batch job.
Index A adduser command, 82 administration network activating, 70 testing, 113 using as interconnect network, 67 administrator password ProCurve switch, 58 Apache self-signed certificate, 61 configuring, 94 avail_node_management role, 211 availability role, 211 availability set choosing nodes as members, 31 configuring with cluster_config, 86 defined, 28 availability tool, 28 Heartbeat, 31 Serviceguard, 30 starting, 105 verifying operation, 112 B back up CMDB, 116 CMDB before cluster_config, 86 SFS server,
dense node names, 54 development environment tools, 80 disabled node inserted in database, 168 discover command, 66 command line options, 67 enclosures, 72 flowchart, 165 HP ProLiant DL140, 68 HP ProLiant DL145 G2, 68 HP ProLiant DL145 G3, 68 information required by, 57 no node found, 167 no switch found, 69 nodes, 72 - -oldmp option, 59 switches, 70 troubleshooting, 165 disk configuration file, 205 disk partition default file system layout, 39 default sizes on installation disk, 39 maximum size of, 39 on c
defined, 37 HP ProLiant DL140 discovering, 68 HP ProLiant DL145 G2 discovering, 68 HP ProLiant DL145 G3 discovering, 68 HP Remote Graphics Software (see RGS) HP Scalable Visualization Array (see SVA) HP Serviceguard (see Serviceguard) HP StorageWorks Scalable File Share (see SFS) HP XC Support team, 141 HP XC system software defined, 37 determining installed version, 120 installation process, 37 software stack, 37 hpasm, 211 hptc-ire-serverlog service, 178 /hptc_cluster failure to mount, 171 /hptc_cluster f
K kernel dependent modules, 66 kernel modules rebuilding, 66 Kickstart file, 37, 39 Kickstart installation (see installation) ks.cfg file, 39 L license HP XC system software, 27 location of license key file, 75 management, 28 troubleshooting, 172 XC.
improved availability, 33 service assignment, 213 user account, 82 verifying system health, 115 warnings from OVP, 175 naming conventions, 18 NAT configuration, 60 NAT service, 212 configuring, 93 netboot failure, 171 network connectivity testing, 111, 113 network mask, 56 network type, 59, 223 configuring, 92 NFS daemon, 59 configuring, 91 NFS server service, 212 NIS configuration, 61 NIS slave server configuring, 94 no node found, 167 no switch found, 69 node discovering, 72 imaging problems, 55 in down s
logging in to line monitoring card, 193 network type, 223 switch controller card, 191 QuickSpecs, 80 quorum configuring a lock LUN, 79 configuring a quorum server, 79 quorum server configuring, 79 R real enclosure defined, 54 real server defined, 60 Red Hat Enterprise Linux installing HP XC system software on, 137 troubleshooting, 179 Red Hat network drivers troubleshooting, 179 reinstall software, 133 release version, 120 remote graphics software RGS, 79 Virtual GL, 80 reporting documentation errors feedb
software installation (see installation) software patches, 65 software RAID, 81 documentation, 24 enabling on client nodes, 81 mdadm utility, 24 mirroring, 81 striping, 81 software stack, 37 software upgrade (see upgrade) software version, 120 sparse node numbering, 54 spconfig utility, 107 ssh configuring on InfiniBand switch, 104 ssh key, 59 standard LSF configuring failover, 34 defined, 37 features, 96 installing, 95 verify configuration, 111 startsys command, 100 STREAMS, 114 striping, 81 supermon defin
V /var file system, 39 /var/lib/systemimager/images/base_image, 98 /var/log/nconfig.log file, 55 /var/log/postinstall.log file, 37 virtual console and media, 46 virtual enclosure defined, 54 Virtual GL, 80 W website HP software patches, 65 HP XC System Software documentation, 19 ITRC, 65 workstation nodes, 58 changing database name, 81 X XC software version, 120 XC.lic file, 27 xc_support@hp.