HP XC System Software Release Notes Version 3.
© Copyright 2003–2007 Hewlett-Packard Development Company, L.P. Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license. The information contained herein is subject to change without notice.
Table of Contents About This Document.........................................................................................................9 1 Intended Audience...............................................................................................................................9 2 Document Organization......................................................................................................................9 3 HP XC Information..........................................................
5 Configuration Notes....................................................................................................31 5.1 Tasks to Perform Before Running the cluster_prep Utility.............................................................31 5.1.1 Required Task for Some NIC Adapter Models: Correct NIC Device Driver Mapping..........31 5.2 Node Name Prefix Cannot Contain a Hyphen...............................................................................31 5.
8 Programming and User Environment Notes..............................................................47 8.1 Notes About HP MPI and Modulefiles...........................................................................................47 8.2 Configuring the Intel Trace Collector and Analyzer with HP MPI on HP XC...............................47 8.2.1 Installation Notes....................................................................................................................47 8.2.
14.5.2 discover(8).............................................................................................................................66 14.5.3 startsys(8)...............................................................................................................................66 Index.................................................................................................................................
List of Tables 3-1 4-1 6-1 Corrected BIOS Settings for the HP ProLiant DL140 G2 Nodes...................................................25 Boot Command Line Based on Hardware Model.........................................................................29 Upgrade Boot Command Line Based on Hardware Model..........................................................
About This Document This document contains release notes for HP XC System Software Version 3.0. This document contains important information about firmware, software, or hardware that may affect your system. An HP XC system is integrated with several open source software components. Some open source software components are being used for underlying technology, and their deployment is transparent.
• • • • • • Chapter 9: Load Sharing Facility and Job Management Notes (page 51) contains notes that apply to the Load Sharing Facility (LSF®) and interactive job management commands. Chapter 10: Cluster Platform 3000 Notes (page 55) contains notes that apply only to Xeon® with EMT64-based systems. Chapter 11: Cluster Platform 4000 Notes (page 57) contains notes that apply only to AMD Opteron™-based systems.
HP XC Program Development Environment The following URL provides pointers to tools that have been tested in the HP XC program development environment (for example, TotalView® and other debuggers, compilers, and so on): ftp://ftp.compaq.com/pub/products/xc/pde/index.html HP Message Passing Interface HP Message Passing Interface (MPI) is an implementation of the MPI standard for HP systems. The home page is located at: http://www.hp.
— — — LSF administrator tasks are documented in the HP XC System Software Administration Guide. LSF user tasks such as launching and managing jobs are documented in the HP XC System Software User's Guide. To view LSF manpages supplied by Platform Computing Corporation use the following command: $ man lsf_command_name — The LSF Administrator and Reference guides developed by Platform are also available at: http://www.docs.hp.com/en/highperfcomp.html • http://www.llnl.
3.3 Manpages Manpages provide online reference and command information from the command line. Manpages are supplied with the HP XC system for standard HP XC components, Linux user commands, LSF commands, and other software components that are distributed with the HP XC system. Manpages for third-party vendor software components may be provided as a part of the deliverables for that component.
Related MPI Web Sites • http://www.mpi-forum.org Contains the official MPI standards documents, errata, and archives of the MPI Forum. The MPI Forum is an open group with representatives from many organizations that define and maintain the MPI standard. • http://www-unix.mcs.anl.gov/mpi/ A comprehensive site containing general information, such as the specification and FAQs, and pointers to a variety of other resources, including tutorials, implementations, and other MPI-related sites.
User input Variable [] {} ... | WARNING CAUTION IMPORTANT NOTE Commands and other text that you type. The name of a placeholder in a command, function, or other syntax display that you replace with an actual value. The contents are optional in syntax. If the contents are a list separated by |, you can choose one of the items. The contents are required in syntax. If the contents are a list separated by |, you must choose one of the items. The preceding element can be repeated an arbitrary number of times.
1 New and Changed Features This chapter describes the new and changed features delivered in HP XC System Software Version 3.0. 1.1 Base Distribution and Kernel The following table lists the updates made to the base distribution and kernel since the last release. HP XC Version 3.0 HP XC Version 2.1 Enterprise Linux 4 Update 2 Enterprise Linux 3 Update 4 Base Red Hat kernel 2.6.9-22.0.1.EL Base Red Hat kernel 2.4.21-27.0.2.EL 1.
The following changes have been made to the Kickstart installation process: • • In this release, the Kickstart installation will not prompt you to specify the external Ethernet device to use as the external network device on the head node. The cluster_prep command now prompts you for this information. Because the external Ethernet connection is not configured during the Kickstart installation process, the head node does not have network connectivity until after the cluster_prep command is run. 1.
Enter s to perform customized services configuration on the nodes in your system. • The ability to configure a node or nodes as a NIS slave server has been automated, thereby eliminating the manual process that was required in previous releases. If you modify the default role assignments and assign a nis_server role to configure one or more nodes as a NIS slave server, you are prompted to enter the name of your NIS master server or its IP address as well as your NIS domain name.
1.10 LSF and SLURM The following enhancements were made to the Load Sharing Facility (LSF) and Simple Linux Utility for Resource Management (SLURM) on HP XC: • The most recent versions of the LSF and SLURM products are provided in this release: — LSF-HPC with SLURM and Standard LSF Version 6.1 — SLURM Version 0.5.0-1 • • • • • • This release introduces the option to install and configure standard LSF. In previous releases, standard LSF was not offered as an option.
Refer to ipmitool(1) for details. 1.16 HP Mathematical Library (MLIB) HP XC System Software Version 2.1 (released in June 2005) included Linux HP MLIB libraries as a supported and integral part of the software. Starting with this release, the HP XC System Software will not include support for HP MLIB. There are no direct HP replacement products for the HP MLIB library.
2 Important Release Information This chapter contains information that is important to know for this release. 2.1 Firmware Revisions The HP XC System Software is qualified against specific firmware revisions. Follow the instructions in the accompanying HP Cluster Platform documents to ensure that all hardware components are running the latest firmware version. The master firmware list for this release is available on line at: http://www.docs.hp.com/en/highperfcomp.
3 Hardware Preparation Notes Hardware preparation tasks are documented in the HP XC Hardware Preparation Guide. This chapter contains information that was not included or was inaccurate in that document at the time of publication. 3.1 BIOS Settings for HP ProLiant DL140 G2 Nodes Some BIOS settings shown in Table 3–3 in the HP XC Hardware Preparation Guide for the HP ProLiant DL140 G2 nodes are incorrect.
d. e. 3. 4. 5. Press the Enter key to access the MP. If there is no response, press the MP reset pin on the back of the MP and try again. Log in to the MP using the default user name and password shown on the screen. The MP Main Menu is displayed. Enter SL to show event logs. Then, enter C to clear all log files and Y to confirm. Enter CM to display the Command Menu. Enter UC and use the menu options to remove the default MP user name and password and create your own unique user name and password.
1) 2) 3) 4) Enable the Acpi(HWP0002,700)/Pci(1|1)/Uart(9600 N81)/VenMsg(Vt100+) option. Enable the Acpi(HWP0002,700)/Pci(2|0) option. If prompted, save the setting to NVRAM. Enter x to return to the previous menu. c. Select the Select Error Console option to enable console messages to be displayed on the screen when you turn on the system. 1) Enable the Acpi(HWP0002,700)/Pci(1|1)/Uart(9600 N81)/VenMsg(Vt100+) option. 2) Enable the Acpi(HWP0002,700)/Pci(2|0) option.
9. Perform this step on all nodes except the head node. From the Boot Menu screen, which is displayed during the power on of the node, select the Boot Configuration Menu. Do the following from the Boot Configuration Menu: a. Select Add Boot Entry. b. Select Load File [Core LAN Gb A] as the network boot choice, which is a Gigabit Ethernet (GigE) port. c. Enter the string Netboot as the boot option description. This entry is required and must be set to the string Netboot (with a capital letter N). d.
4 Installation Notes This chapter contains notes that apply to the HP XC System Software Kickstart installation session. 4.1 Notes to Read Before the Kickstart Installation Session Read the notes in this section before starting the Kickstart installation session. 4.1.1 Kickstart Boot Command Line Options Based on Hardware Model To start the Kickstart installation, some hardware models require additional options to be included on the boot command line.
4.4 MySQL Database Daemon Does Not Start If the mysql database daemon does not start, check the appropriate log file in /var/lib/mysql/node_name.err for errors. One of the most common reasons for this type of failure is that the mysql database daemon could not create a temporary file in the /tmp directory on start up. Follow this procedure to resolve the problem: 1. Verify the /tmp directory permissions and change the permissions if needed: # ls -l -d /tmp # chmod 1777 /tmp 2.
5 Configuration Notes This chapter contains information about configuring the system. Notes that describe additional configuration tasks are mandatory and have been organized chronologically. Perform these tasks in the sequence presented in this chapter. The HP XC system configuration procedure is documented in Chapter 4 in the HP XC System Software Installation Guide. 5.
Including a hyphen in the node name prefix adversely affects the bsub command, and you may see messages similar to the following if a node name contains a hyphen: # bsub -n 2 -ext "SLURM[nodelist=xc-n[1-10]]" hostname Syntax error, incorrect SLURM option for NODELIST Extsched option syntax error. Job not submitted. 5.3 Database Administrators Password Cannot Contain an Equals Sign Do not include an equals sign character (=) as part of the database administrator password.
The head node console port may be either internally or externally connected to the HP XC system. Follow the procedure that is appropriate for your system. 5.5.2.
The cluster_config utility rewrites the /etc/dhcpd.conf file, and if you edit this file before running the utility, your changes to this file are lost. Therefore, enter this command to save your customizations to the /etc/dhcpd.conf file before running the cluster_config utility: # cp /etc/dhcpd.conf /etc/dhcpd.conf.ORIG 5.5.4 Required Task on HP xw9300 Client Nodes: Remove acpi=off HP xw9300 workstations do not need to be booted with acpi=off.
5.7.1 Required Task: Restart the LSF LIM Daemon This release note applies to systems where LSF-HPC with SLURM has been installed and configured; skip this task if your system is installed and configured with standard LSF. The LSF Load Information Manager (LIM) daemon must be restarted so that it can be properly licensed for all compute node processors. At this point, LSF is only licensed for a subset of the available compute node processors. Follow this procedure to start the LSF LIM daemon: 1.
Copyright (c) 1999-2005 Ethan Galstad (http://www.nagios.org) Last Modified: 11-30-2005 License: GPL Reading configuration data... Warning: Duplicate definition found for service 'slurmstatus' (config file '/opt/hpt c/nagios/etc/xc-monitor-n16.cfg', starting on line 415) Warning: Duplicate definition found for service 'resourcestatus' (config file '/opt/ hptc/nagios/etc/xc-monitor-xc6n16.
Note: You must only configure out links where the ELAN connections are identified with the Elan designator in the destination field. For example, qsctrl: QR0N00:02:3:3 <--> Elan:0:47 link state normal Links that have the QR0Nxx designator in both the origin and destination field must not be configured out. Doing so will cause the whole chip to go into reset.
6 Software Upgrade Notes This chapter contains notes about upgrading the HP XC System Software from a previous release to this release. 6.1 Installation and Configuration Release Notes Also Apply To An Upgrade Installation release notes described in Chapter 4 (page 29) and system configuration release notes described in Chapter 5 (page 31) also apply when you upgrade the HP XC System Software from a previous release to this release.
6.5 Required Task: Remove Nagios Symbolic Links You must remove old Nagios symbolic links after you run the upgradesys script: # /bin/rm -f /etc/rc.d/rc3.d/S345nagios # /bin/rm -f /etc/rc.d/rc4.d/S345nagios # /bin/rm -f /etc/rc.d/rc5.d/S345nagios 6.6 Required Task: Edit the /etc/logrotate.
Bringing up interface eth2: SIOCSIFFLAGS: Cannot allocate memory Failed to bring up eth2. If the output from the ifconfig command does not contain the external interface, run the ifup command. In the following example, external_device_name is the name of an external device such as eth2. # ifup external_device_name Because this issue is intermittent, HP recommends that you run the cluster_config utility directly from the console (and not remotely) in case you lose your external connection. 6.
7 System Administration and Management Notes This chapter contains notes about system administration and management commands and tasks. Perform the tasks only when necessary. 7.1 Multiple %EXPR% Expressions Are Not Accepted In the nagios_vars.ini File The nagios_vars.ini file is intended for site specific customizations. Problems occur if you modify the entries in this file to contain more than one %EXPR% variable.
# pdsh -a -x n122,n126,n129 cp /etc/init.d/default_gateway \ /etc/init.d/default_gatewaySAVE 3. Use the dbsysparams command to modify the value of NAT_GATEWAYS from multiple to single: # /opt/hptc/sbin/dbsysparams "NAT_GATEWAYS" NAT_GATEWAYS: multiple # /opt/hptc/sbin/dbsysparams -s "NAT_GATEWAYS" "single" # /opt/hptc/sbin/dbsysparams "NAT_GATEWAYS" NAT_GATEWAYS: single 4. Rerun nconfig and cconfig to rewrite the /etc/init.
pdsh@n3: n1: ssh exited with exit code 3 pdsh@n3: n2: ssh exited with exit code 3 0 headerinfo Mon Sep 26 15:49:59 EDT 2005 7.7 Benign pdsh Messages The pdsh command will display the following message if you enter an invalid node name (in this example, 1166). This message is benign and can be safely ignored: # pdsh -w 1166 hostname pdsh@n3: 1166: ssh exited with exit code 255 7.
7.11 Possible Problem with ext3 File Systems on SAN Storage Issues have been reported when an ext3 file system fills up to the point where ENOSPC is returned to write requests for a long period of time, and the file system is subsequently unmounted. A forced check is initiated (fsck -fy) before the next mount. It appears that the fsck checks might cause corruption of the file system inode information.
8 Programming and User Environment Notes This chapter contains information that applies to the programming and user environment. 8.1 Notes About HP MPI and Modulefiles If the HP Message Passing Interface (MPI) is being used, it is important to make sure the mpi** compiler scripts use the intended compiler, for example, by setting the MPI_CC or MPI_F90 environment variables (or both).
FLDFLAGS = -static-libcxa -L$(VT_ROOT)/lib -lnsl -lm -lelf -lpthread $(TLIB) -lvtunwind -ldwarf When the Intel compilers are used, add -static-libcxa to the link line; otherwise, the following errors are generated at run-time: [n1]/nis.home/sballe/xc_PDE_work/ITC_examples_xc6000 >mpirun.mpich -np 2 ~/xc_PDE_work/ITC_examples_xc6000/vtjacobic warning: this is a development version of HP MPI for internal R&D use only /nis.
66 Difference is 1.469404085386082E-004 68 Difference is 1.245266549586746E-004 70 Difference is 1.055343296682637E-004 72 Difference is 8.944029434752290E-005 74 Difference is 7.580169395426893E-005 76 Difference is 6.424353519703476E-005 78 Difference is 5.444822123484475E-005 80 Difference is 4.614672291984789E-005 82 Difference is 3.911112299221254E-005 84 Difference is 3.314831465581266E-005 86 Difference is 2.809467246160129E-005 88 Difference is 2.381154327036583E-005 90 Difference is 2.
9 Load Sharing Facility and Job Management Notes This chapter addresses the following topics: • Load Sharing Facility (page 51) • SLURM and Job Management (page 53) 9.1 Load Sharing Facility This section contains notes about LSF-HPC with SLURM on XC and standard LSF. 9.1.1 Maintaining Shell Prompts in LSF-HPC Interactive Shells Launching an interactive shell under LSF-HPC integrated with SLURM removes shell prompts.
To verify the change, submit an interactive job similar to the following: [lsfadmin@n16 ~]$hostname n16 [lsfadmin@n16 ~]$ bsub -Is -n8 /bin/bash -i Job <261> is submitted to the default queue . <> <> [lsfadmin@n4 ~]$ hostname n4 [lsfadmin@n4 ~]$ srun hostname n4 n4 n4 n4 n5 n5 n5 n5 [lsfadmin@n4 ~]$ exit exit [lsfadmin@n16 ~]$ hostname n16 [lsfadmin@n16 ~]$ 9.1.
9.1.4 Short LSF Queue RUN_WINDOW Can Suspend Other Jobs A job that does not complete within the RUN_WINDOW of its queue is suspended and may prevent other jobs on other queues from running, even if those other jobs were submitted to a higher priority queue. At the next instance of the queue's RUN_WINDOW, the job resumes execution and the other jobs can be scheduled. Consider this example: 1. 2. 3. 4. 5. 6. 7. 8. Job #75 is scheduled on a queue named night. The RUN_WINDOW opens for the night queue.
If SLURM has been installed and configured but is not required, use the following procedures to deactivate it: 1. As root on the head node, shut down SLURM: # scontrol shutdown 2. Unconfigure SLURM on the head node: # /opt/hptc/slurm/etc/gconfig.d/slurm_gconfig.pl gunconfigure # /opt/hptc/slurm/etc/nconfig.d/slurm_nconfig.pl nunconfigure 3. 4. 54 Update the golden image. Propagate the new golden image to all nodes.
10 Cluster Platform 3000 Notes At the time of publication, no release notes are specific to Cluster Platform 3000 systems.
11 Cluster Platform 4000 Notes This chapter contains information that applies only to Cluster Platform 4000 systems. 11.1 HP ProLiant DL145 G2 Console Port Might Lose Connectivity to the Administration Network During normal operation of an HP ProLiant DL145 G2 node, the console port may lose connectivity to the administration network for periods of tens of seconds to several minutes. This loss of connection causes a disruption in node management and monitoring.
12 Cluster Platform 6000 Notes At the time of publication, no release notes are specific to CP6000 systems.
13 Interconnect Notes This chapter contains information that applies to the supported interconnect types: • InfiniBand Interconnect (page 61) • Myrinet Interconnect (page 61) • QsNetII Interconnect (page 61) 13.1 InfiniBand Interconnect At the time of publication, no release notes are specific to the InfiniBand® interconnect. 13.2 Myrinet Interconnect The following release notes are specific to the Myrinet® interconnect. 13.2.
13.3.3 Possible Conflict with Use of SIGUSR2 The Quadrics QsNetII software internally uses SIGUSR2 to manage the interconnect. This can conflict with any user applications that use SIGUSR2, including for debugger use. To work around this conflict, set the environment variable LIBELAN4_TRAPSIG for the application to a different signal number other than the default value 12 that corresponds to SIGUSR2.
14 Documentation Notes The notes in this section apply to the HP XC System Software Documentation Set and HP XC manpages. 14.1 HP XC Hardware Preparation Guide Callout 2 in Figure 2–6 “ProCurve 2824 Root Administration Switch” incorrectly states that Port 23 is used as the interconnect to the Root Console Switch. Port 24 is used as the interconnect to the Root Console Switch, not Port 23. 14.
14.3.1 System Event Logs This new functionality was delivered to your HP XC system through the PK01 patch for Version 3.0. Each hardware platform provided by HP supplies an event logging mechanism to capture platform-specific events to track hardware states and changes.
14.3.4 Moving SLURM and LSF to Their Backup Nodes This procedure is not documented in the HP XC System Software Administration Guide but it will be included in a future version. To move the SLURM and LSF daemons from their primary node to their backup node (perhaps due to a maintenance need on the primary node), follow this procedure: 1. 2. Log into the backup node as root. Shut down the backup slurmctld daemon: # pkill slurmctld 3. 4. 5.
During an LSF failover scenario, any interactive LSF jobs are terminated because their I/O operates through the LSF daemons. However LSF batch jobs will run undisturbed if their nodes remain up. 14.4.2 Changing the Root Password Step 4 in the "Changing the Root Password" procedure in Chapter 11 contains a misspelling. The /etc/passwd file name is incorrectly shown as /etc/password. The corrected text is as follows: 4.
Index A attribute caching, 45 ipmitool, 32 ipmitool command, 20 J B job management, 53 base operating system, 17 BMC password, 32 K C kernel version, 17 Kickstart installation, 29 new features, 17 cannot connect to MySQL database, 45 clear_counters command, 61 cluster_config utility, 18 cluster_prep command, 18 collectl utility, 20 log files stop rolling, 43 console connection to DL145 G2, 57 console port connection failure, 46 controllsf manpage, 66 CP3000 system, 55 CP4000 system, 57 CP6000 syste
LSF errors, 35 Nagios errors, 35 qsnet_database test failure, 36 P password MP, 26, 27 patches, 32 pdsh command, 44 Q qsctrl utility, 36 qsnet diagnostics database, 62 Quadrics QsNet interconnect, 61 R RPM errors, 29 S sendmail, 45, 63 signal Quadrics QsNet, 62 single default gateway, 43 SLURM, 53 deactivating, 53 moving to backup node, 65 new features, 20 user processes terminated, 53 version, 20 slurm.epilog.