HP XC System Software XC Version 3.1 Release Notes Version 3.
© Copyright 2006, 2007, 2008 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document.......................................................................................................11 Intended Audience................................................................................................................................11 Typographic Conventions.....................................................................................................................11 HP XC and Related HP Products Information.....................................
5.1 Notes That Apply Before You Invoke the cluster_prep Utility.......................................................39 5.1.1 Use Alphabetic Characters in Node Naming Prefix...............................................................39 5.1.2 Required Task for Some NIC Adapter Models: Verify Correct NIC Device Driver Mapping..........................................................................................................................................39 5.
7 System Administration and Management..................................................................55 7.1 Multiple %EXPR% Expressions Are Not Accepted In the nagios_vars.ini File..............................55 7.2 The collectl Utility Does Not Correctly Handle the Change to Daylight Savings Time ................55 7.3 Perform a Dry Run Before Using the si_updateclient Utility to Update Nodes.............................55 7.
15 Documentation............................................................................................................79 15.1 Documentation CD Search Option................................................................................................79 15.2 HP XC System Software Installation Guide.......................................................................................79 15.2.1 Typographical Error in Path to iptables.proto.rpmsave File.................................................
List of Figures 3-1 HP Integrity rx2660 Server Rear View..........................................................................................
List of Tables 1-1 3-1 3-2 3-3 3-4 4-1 Upgrade Types..............................................................................................................................23 iLO Settings for HP ProLiant DL385 G2 Nodes............................................................................30 RBSU Settings for HP ProLiant DL385 G2 Nodes.........................................................................30 iLO Settings for HP ProLiant DL585 G2 Nodes.............................................
About This Document This document contains release notes for HP XC System Software Version 3.1. This document contains important information about firmware, software, or hardware that may affect the system. An HP XC system is integrated with several open source software components. Some open source software components are being used for underlying technology, and their deployment is transparent.
Variable [] {} ... | WARNING CAUTION IMPORTANT NOTE The name of a placeholder in a command, function, or other syntax display that you replace with an actual value. The contents are optional in syntax. If the contents are a list separated by |, you can choose one of the items. The contents are required in syntax. If the contents are a list separated by |, you must choose one of the items. The preceding element can be repeated an arbitrary number of times. Separates items in a list of choices.
See the following sources for information about related HP products. HP XC Program Development Environment The Program Development Environment home page provide pointers to tools that have been tested in the HP XC program development environment (for example, TotalView® and other debuggers, compilers, and so on). http://h20311.www2.hp.com/HPC/cache/276321-0-0-0-121.
Standard LSF is also available as an alternative resource management system (instead of LSF-HPC with SLURM) for HP XC. This is the version of LSF that is widely discussed on the Platform Web site.
• http://linuxvirtualserver.org Home page for the Linux Virtual Server (LVS), the load balancer running on the Linux operating system that distributes login requests on the HP XC system. • http://www.macrovision.com Home page for Macrovision®, developer of the FLEXlm™ license management utility, which is used for HP XC license management. • http://sourceforge.
Compiler Web Sites • http://www.intel.com/software/products/compilers/index.htm Web site for Intel® compilers. • http://support.intel.com/support/performancetools/ Web site for general Intel software development information. • http://www.pgroup.com/ Home page for The Portland Group™, supplier of the PGI® compiler. Debugger Web Site http://www.etnus.com Home page for Etnus, Inc., maker of the TotalView® parallel debugger. Software RAID Web Sites • http://www.tldp.org/HOWTO/Software-RAID-HOWTO.
HP Encourages Your Comments HP encourages comments concerning this document. We are committed to providing documentation that meets your needs. Send any errors found, suggestions for improvement, or compliments to: feedback@fc.hp.com Include the document title, manufacturing part number, and any comment, error found, or suggestion for improvement you have concerning this document.
1 New and Changed Features This chapter describes the new and changed features delivered in HP XC System Software Version 3.1. 1.1 Base Distribution and Kernel The following table lists information about the base distribution and kernel for this release as compared to the last HP XC release. HP XC Version 3.1 HP XC Version 3.0 Enterprise Linux 4 Update 3 Enterprise Linux 4 Update 2 HP XC kernel version 2.6.9-34.7hp.XC Based on Red Hat kernel version 2.6.9-34.0.2.EL Based on Red Hat kernel version 2.6.
infrastructure allows you the flexibility to select and use the availability tool you prefer to manage the services. In this release, HP Serviceguard is the recommended availability tool, and it must be purchased and licensed separately from HP. In general, availability tools monitor nodes, services and resources, and restart the services and resources on another node, as necessary.
1.6 Support for IPv6 Addresses Support for IPv6 addresses has been added in this release. When you invoke the cluster_prep command, you can specify an IPv6 address for the head node's Ethernet connection to the external network. Specifying this address is optional and is intended for sites that use IPv6 addresses for the rest of the network. Similarly, when you use the cluster_config utility to add an external Ethernet connection to any node, you have the option to specify an IPv6 address. 1.
In an enclosure-based system, the discover command uses a sparse node numbering scheme. This means that internal node names are assigned based on the enclosure in which the node is located and the slot the node is plugged into. For example, if a node is plugged into slot 10 of enclosure 1, the node is numbered n10. In a configuration with two enclosures in which there might be 16 nodes in each enclosure, the node in slot 10 in enclosure 2 is numbered n27.
Table 1-1 Upgrade Types Upgrade Type Description Major You perform a major upgrade on systems that are installed with an HP XC release that is based on an older version of Enterprise Linux (for example Enterprise Linux 3 [EL3]) and the new HP XC release is based on the next Enterprise Linux release version (for example, EL4). Minor You perform a minor upgrade when the new HP XC release is based on the same version of the Enterprise Linux (EL) operating system that is currently installed on the system.
• The RRDtool software tool is integrated into the HP XC system to create and displays graphs about the network bandwidth and other system utilization. You can access this display by selecting HP Graph in the Nagios menu. • NRG provides status summaries using the information collected by Nagios. A non-graphical user interface provides status from the command line. You can use NRG on management_server nodes for complete system status or on management_hub nodes to report individual hub status.
The xcxclus utility is a graphic utility that enables you to monitor a number of nodes simultaneously. The utility displays an array of icons.
2 Important Release Information This chapter contains information that is important to know for this release. 2.1 Firmware Versions The HP XC System Software is tested against specific minimum firmware versions. Follow the instructions in the accompanying HP Cluster Platform documents to ensure that all hardware components are installed with the latest firmware version. The master firmware tables for this release are available at the following Web site: http://www.docs.hp.com/en/highperfcomp.
3 Hardware Preparation Hardware preparation tasks are documented in the HP XC Hardware Preparation Guide. This chapter contains information that was not included in that document at the time of publication. 3.1 Preparing HP Server Blades and Enclosures If the hardware configuration contains HP server blades and enclosures, download and print the HP XC Systems With HP Server Blades and Enclosures HowTo from the following Web site: http://www.docs.hp.com/en/highperfcomp.
3.4 Preparing HP ProLiant DL385 G2 Nodes On HP ProLiant DL385 G2 servers, use the following tools to configure the appropriate settings for an HP XC system: • Integrated Lights Out (iLO) Setup Utility • ROM-Based Setup Utility (RBSU) Perform the following procedure from the iLO Setup Utility for each HP ProLiant DL385 G2 server in the hardware configuration: 1. Use the instructions in the accompanying hardware documentation to connect a monitor, mouse, and keyboard to the node. 2. Turn on power to the node.
Table 3-2 RBSU Settings for HP ProLiant DL385 G2 Nodes (continued) Menu Name Option Name Set To This Value Set the following boot order on all nodes except the head node: • IPL:1 CD-ROM • IPL:2 Floppy Drive (A:) • IPL:3 USB Drive Key (C:) • IPL:4 PCI Embedded HP NC373i Multifunction Gigabit Adapter • IPL:5 Hard Drive C: Standard Boot Order (IPL) Set the following boot order on the head node: • IPL:1 CD-ROM • IPL:2 Floppy Drive (A:) • IPL:3 USB Drive Key (C:) • IPL:4 Hard Drive C: BIOS Serial Console and
Table 3-3 iLO Settings for HP ProLiant DL585 G2 Nodes Menu Name SubMenu Name Option Name User Add Set To This Value Create a common iLO user name and password for every node in the hardware configuration. The password must have a minimum of 8 characters by default, but this value is configurable. The user Administrator is predefined by default, but you must create your own user name and password. For security purposes, HP recommends that you delete the Administrator user.
Table 3-4 RBSU Settings for HP ProLiant DL385 G2 Nodes (continued) Menu Name Advanced 2. 3. Option Name Set To This Value BIOS Interface Mode Command Line Linux x86_64 HPET Option Disabled Press the Esc key to exit the RBSU. Press the F10 key to confirm your choice and restart the boot sequence. Repeat this procedure for every HP ProLiant DL585 G2 node in the hardware configuration.
d. e. Press the Enter key to access the MP. If there is no response, press the MP reset pin on the back of the MP and try again. Log in to the MP using the default user name and password shown on the screen. The following MP Main Menu is displayed: MP MAIN MENU: CO: VFP: CM: SMCLP: CL: SL: HE: X: 4. 5. 6. Console Virtual Front Panel Command Menu Server Management Command Line Protocol Console Log Show Event Logs Main Help Menu Exit Connection Enter SL to show event logs.
14. Press the esc key or enter x as many times as necessary to return to the Boot Menu. 15. Turn off power to the node: a. Press Ctrl+b to exit the console mode. b. Enter CM to display the Command Menu. c. Enter PC and enter off to turn off power to the node. 3.7 Minimum Firmware Versions for InfiniBand HCA Refer to the Master Firmware List for the latest firmware version information, and refer to the Cluster Platform InfiniBand documentation for proper firmware upgrade procedures.
4 Head Node Installation This chapter contains notes that apply to the HP XC System Software Kickstart installation session. 4.1 Notes to Read Before the Kickstart Installation Session Read the notes in this section before starting the Kickstart installation session. 4.1.1 Additional Kickstart Boot Command Options For Some Hardware Models Some hardware models require additional options to be included on the boot command line.
5 System Configuration This chapter contains information about configuring the system. Notes that describe additional configuration tasks are mandatory and have been organized chronologically. Perform these tasks in the sequence presented in this chapter. The HP XC system configuration procedure is documented in the HP XC System Software Installation Guide. 5.1 Notes That Apply Before You Invoke the cluster_prep Utility Read the notes in this section before you invoke the cluster_prep utility. 5.1.
4. 5. 6. Use the text editor of your choice to edit the /etc/sysconfig/network-scripts/ifcfg-eth[0,1,2,3] files, and remove the HWADDR line from each file if it is present. If you made changes, save your changes and exit each file. Reload the modules: # modprobe tg3 # modprobe e1000 7. Follow the instructions in the HP XC System Software Installation Guide to complete the cluster configuration process (beginning with the cluster_prep command). 5.
During normal operation of an ProLiant DL145 G2 node, the console port may lose connectivity to the internal Administration Network for periods of tens of seconds to several minutes. This loss of connection causes a disruption in node management and monitoring. The connection loss is characterized by "no route to host" messages when using the ping command to contact the node.
#!/bin/bash FILES=`find /boot -name initrd\* -print` for i in $FILES do VERSION=`echo $i | sed -e 's/^.*initrd-\(.*\).img/\1/'` if [ ! -z "$VERSION" ] then NEWF=/tmp/initrd-$VERSION.img echo "/sbin/mkinitrd --with=aacraid -f $NEWF $VERSION" /sbin/mkinitrd --with=aacraid -f $NEWF $VERSION if [ -f $NEWF ] then echo "replacing $i with $NEWF" rm $i mv $NEWF $i fi fi done 5.5.4 Change the InfiniBand Switch Root Password The InfiniBand switches run Linux and have a root password.
To work around this issue, use the text editor of your choice to edit the /etc/hosts file and remove the trailing spaces on the IRxNxx lines 5.6 Benign Messages Seen During cluster_config Processing This note applies only if the hardware configuration contains HP server blades and enclosures. During cluster_config processing, if you see any messages stating "Warning - cannot map [node] to switch or enclosure" where node is the head node (typically node 0), you can safely ignore the message. 5.
Use the procedure in the HP XC System Software Administration Guide, which describes how to use file overrides to the golden image, to edit the /etc/modprobe.conf.vapi file to have different values for nodes with an architecture that is different from the head node. • For Opteron nodes in a hardware configuration with a Xeon head node, edit the /etc/modprobe.conf.
/var/lib/systemimager/images/base_image/boot/grub/grub.conf 2. Add the noapic option to the end of the kernel boot options as follows: #boot=/dev/cciss/c0d0 default=0 timeout=5 hiddenmenu title Linux for High Performance Computing 4 (Savant) (2.6.9-34.4hp.XCsmp) root (hd0,0) kernel /vmlinuz-2.6.9-34.4hp.XCsmp ro root=LABEL=/ console=ttyS2 noapic initrd /initrd-2.6.9-34.4hp.XCsmp.img 3. • Save your changes and exit the text editor.
Nagios 2.3.1 Copyright (c) 1999-2006 Ethan Galstad (http://www.nagios.org) Last Modified: 05-15-2006 License: GPL Reading configuration data... Warning: Duplicate definition found for service 'nagiosmonitor' (config file '/opt/hptc/nagios/etc/xc-monitor-n8.cfg', starting on line 195) Warning: Duplicate definition found for service 'hostmonitor' (config file '/opt/hptc/nagios/etc/xc-monitor-n8.
qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: qsctrl: QR0N00:00:0:3 <--> Elan:0:3 state 3 should be 4 QR0N00:00:1:0 <--> Elan:0:4 state 3 should be 4 QR0N00:00:1:1 <--> Elan:0:5 state 3 should be 4 QR0N00:00:1:2 <--> Elan:0:6 state 3 should be 4 QR0N00:00:1:3 <--> Elan:0:7 state 3 should be 4 QR0N00:00:2:0 <--> Elan:0:8 state 3 should be 4 QR0N00:00:2:1 <--> Elan:0:9 state 3 should be 4 QR0N00:00:2:2 <--> Elan:0:10 st
6 Software Upgrades This chapter contains notes about upgrading the HP XC System Software from a previous release to this release. Installation release notes described in Chapter 4 (page 37) and system configuration release notes described in Chapter 5 (page 39) also apply when you upgrade the HP XC System Software from a previous release to this release. Therefore, when performing an upgrade, make sure you also read and follow the instructions in those chapters.
error: Failed dependencies: libgm.so.0()(64bit) is needed by (installed) lustre-lite-1.4.2-2.6.9_22.11hp.3sp.XCsmp_3.1.0_SFS_2.1_1.2.x86_64 Use the following command to manually remove the gm RPM before proceeding to the next step in the upgrade process. # rpm -ev --nodeps gm InfiniBand An InfiniBand RPM dependency failure looks similar to the following: Removing infiniband RPM and directories error: error: Failed dependencies: Failed dependencies: ibhost-biz is needed by (installed) paracomp-ib-1.
3. If hptc-ire-serverlog is not running, start the service: # service hptc-ire-serverlog start 6.2 Notes That Apply to Major Upgrades The notes in this section apply to major upgrades. 6.2.1 Required Task: Add Missing MAC Addresses Before you begin to upgrade the HP XC Version 2.
1. Run the following device_config command to set the external device MAC address in the database for a specific node. Use the MAC addresses you obtained in “Required Task: Determine MAC Addresses For Client Nodes With External Devices”. # cd /opt/hptc/config/sbin # ./device_config --type E MAC_address --host host_name 2. 3. Run a similar device_config command for every node that has an external connection.
If you want to use the Voltaire MPI tools, you can work around this issue by adding /usr/voltaire/mpi/bin to the PATH in the /etc/profile file, similar to the following: # Voltaire InfiniBand Stack Services - Start PATH="${PATH}:/usr/voltaire/bin:/usr/voltaire/scripts:/usr/mellanox/bin:\ /usr/voltaire/mpi/bin" export MTHOME="/usr/mellanox" export NOGAMLA=1 export MTCONF=release # Voltaire InfiniBand Stack Services - End 6.2.
7 System Administration and Management This chapter contains notes about system administration and management commands and tasks. 7.1 Multiple %EXPR% Expressions Are Not Accepted In the nagios_vars.ini File The nagios_vars.ini file is intended for site specific customizations. Problems occur if you modify the entries in this file to contain more than one %EXPR% variable. For example, the following entry causes the Nagios plug-in(s) that use this information to fail to report the correct status.
# ifconfig ethy:extn down 3. Change to the following directory: # cd /etc/sysconfig/network-scripts 4. Remove the internal NAT server alias ifcfg file so that this alias is not restarted upon a reboot: # rm ifcfg-ethx:intnat{nodename} 5. Remove the external NAT server alias ifcfg file so that this alias is not restarted upon a reboot: # rm ifcfg-ethy:extnat{nodename} 6. Restart the availability tools: # transfer_to_avail 7.5 Re-edit the /etc/dhcpd.
# ps -eaf | grep mysql Three processes should be listed: grep, mysqld_safe, and mysqld. If you do not see mysqld_safe and mysqld, proceed to step 4. 3. Use the process ID (PID) of /usr/libexec/mysqld (the number just after the process owner name) to kill mysqld manually. If the mysqld process is not listed, but there is a mysqld_safe process, use that PID instead. # kill mysqld_PID This process should kill both mysqld and mysqld_safe. 4.
info: Executing C11avail nconfigure /bin/mknod: `/dev/pidentd': File exists /bin/mknod: `/dev/deadman': File exists /etc/rc.d/init.d/cmcluster.init: line 107: let: +: syntax error: operand expected (error token is "+") cmviewconf: Either binary file does not exist, or the user doesn't have access to view the cluster configuration. You might also see errors about uninitialized variables in AdapterMap.pm at or near lines 711 and 713. You can safely ignore these messages. 7.9.
3. cmmodpkg -e dbserver.{nodename} To start any services that are dependent on the database service, issue the following commands for each package. These commands enable Serviceguard to restart the database package and start the remaining packages. 1. 2. cmrunpkg -n {other node in avail set} {service}.{nodename} cmmodpkg -e {service}.{nodename} 7.
8 Load Sharing Facility and Job Management This chapter addresses the following topics: • Load Sharing Facility (page 61) • Job Management (page 63) 8.1 Load Sharing Facility This section contains notes about LSF-HPC with SLURM on HP XC and standard LSF. 8.1.1 Maintaining Shell Prompts in LSF-HPC Interactive Shells Launching an interactive shell under LSF-HPC integrated with SLURM removes shell prompts.
To verify the change, submit an interactive job similar to the following: [lsfadmin@n16 ~]$hostname n16 [lsfadmin@n16 ~]$ bsub -Is -n8 /bin/bash -i Job <261> is submitted to the default queue . <> <> [lsfadmin@n4 ~]$ hostname n4 [lsfadmin@n4 ~]$ srun hostname n4 n4 n4 n4 n5 n5 n5 n5 [lsfadmin@n4 ~]$ exit exit [lsfadmin@n16 ~]$ hostname n16 [lsfadmin@n16 ~]$ 8.1.
8.2 Job Management At the time of publication, no release notes are specific to the Simple Linux Utility for Resource Management (SLURM). SLURM provides commands for launching, monitoring, and controlling jobs. Refer to the HP XC System Software User's Guide for more information about using SLURM. 8.
9 Programming and User Environment This chapter contains information that applies to the programming and user environment. 9.1 Required HP-MPI Option on Systems With a Mix of InfiniBand PCI-X and PCI Express To run MPI jobs on a set of nodes that have a mix of PCI-X and PCI Express InfiniBand boards, you must include the following option with the HP-MPI mpirun command: -e MPI_IB_MTU=1024 9.
10 Cluster Platform 3000 At the time of publication, no release notes are specific to Cluster Platform 3000 systems.
11 Cluster Platform 4000 At the time of publication, no release notes are specific to Cluster Platform 4000 systems.
12 Cluster Platform 6000 This chapter contains information that applies only to Cluster Platform 6000 systems. 12.1 Network Boot Operation and Imaging Failures on HP Integrity rx2600 Systems An underlying issue in the kernel is causing MAC addresses on HP Integrity rx2600 systems to be set to all zeros (for example, 00.00.00.00.00), which results in network boot and imaging failures. To work around this issue, enter the following commands on the head node to network boot and image an rx2600 system: 1.
13 Integrated Lights Out Console Management Devices This chapter contains information that applies to the integrated lights out (iLO and iLO2) console management device. 13.1 iLO2 Devices Can Become Unresponsive There is a known problem with the iLO2 console management devices causes the iLO2 to become unresponsive to certain tools including the HP XC power daemon and the iLO2 Web interface. When this happens, you will see CONNECT_ERROR messages from the power daemon.
14 Interconnects This chapter contains information that applies to the supported interconnect types: • InfiniBand Interconnect (page 75) • Myrinet Interconnect (page 76) • QsNetII Interconnect (page 76) 14.1 InfiniBand Interconnect The following notes are specific to the InfiniBand® interconnect. 14.1.1 No InfiniBand Graphs with Firmware Older Than Version 3.4.2 The HP Graph system status graphs in Nagios require a feature in the InfiniBand software that was first implemented in firmware version 3.4.2.
14.2 Myrinet Interconnect The following release notes are specific to the Myrinet interconnect. 14.2.1 Myrinet Monitoring Line Card Can Become Unresponsive A Myrinet monitoring line card can become unresponsive some period of time after it has been set up with an IP address with DHCP. This is a problem known to Myricom. For more information, see the following: http://www.myri.com/fom-serve/cache/321.
# mysql -u root -p qsnet mysql> delete from switch_modules where name="QR0N03"; mysql> quit # service swm restart In addition to the previous problem, the IP address of a switch module may be incorrectly populated in the switch_modules table, and you might see the following message: # qsctrl qsctrl: failed to parse module name 172.20.66.2 . . .
15 Documentation This chapter describes known issues and omissions in the HP XC System Software Documentation Set and HP XC manpages. 15.1 Documentation CD Search Option If you are viewing the main page of the HP XC Documentation CD, you cannot perform a literature search from the Search: option box at the top of the page. To search docs.hp.com or to search all of HP's global Web service, click on the link for More options.
3. Create a new module dependency list, system.map: # depmod -a 15.2.4 Missing Key Sequence In Section 3.6.1 “Modify the Default Password for HP ProLiant DL140 and DL145 Hardware Models” on Page 58, add the following step before step a. in the procedure shown for BMC firmware version 1.24 or higher: • For BMC Firmware Version 1.24 or higher: a. Press esc and Shift+9 to enter the command-line mode. Continue with steps a through f as documented on Page 58. 15.2.
15.3.4 Moving SLURM and LSF to Their Backup Nodes This procedure is not documented in the HP XC System Software Administration Guide but it will be included in a future version. To move the SLURM and LSF daemons from their primary node to their backup node (perhaps due to a maintenance need on the primary node), follow this procedure: 1. 2. Log into the backup node as root. Shut down the backup slurmctld daemon: # pkill slurmctld 3. 4. 5.
Index software RAID, 16 Supermon, 14 syslog-ng, 14 SystemImager, 14 TotalView, 16 A acpi=off option, 80 attribute caching, 57 availability, 19 B E base operating system, 19 elilo.
IPv6 addresses, 21 Q J qsctrl utility, 46 qsnet diagnostics database, 76 QsNet interconnect, 76 job management, 63 jumbo frames, 21 R K kernel version, 19 Kickstart installation, 37 L Linux operating system, 19 Load Sharing Facility (see LSF) lock LUN device error in procedure, 79 LSF documentation, 14 moving to backup node, 81 notes, 61 supported version, 24 M major upgrade, 22 management processor (see MP) manpages, 16, 81 discover.
X xcxclus utility, 24 xcxperf utility, 24 85