SGI® Altix® XE Clusters Quick Reference Guide 007-5474-003
COPYRIGHT © 2008-2009 SGI. All rights reserved; provided portions may be copyright in third parties, as indicated elsewhere herein. No permission is granted to copy, distribute, or create derivative works from the contents of this electronic documentation in any manner, in whole or in part, without the prior written permission of SGI.
Record of Revision Version Description -001 March 2008 First publication. Note that substantial content included in this document was originally published in SGI publication 007-4979-00x. -002 July 2008 Modifications to accommodate the Platform Manager software (formerly Scali Manage) release 5.7 for use with SGI Altix XE clusters plus other miscellaneous updates. -003 March 2009 Updates to cover new hardware nodes available and changes covered by the release of Platform Manager 5.7.2.
Contents 1. SGI Altix XE Cluster Quick-reference . . . . . . . . . . . . . . . . . 1 Overview . . . . . . . . . . . . . . . . . . . . . . . 1 Site Plan Verification . . . . . . . . . . . . . . . . . . . . . . . . 3 Unpacking and Installing a Cluster Rack . . . . . . . . . . . . . . . . . 3 Booting the XE Cluster . . . . . . . . . . . . . . . . . .
Contents Third-Party Clustering Documents . 2. 3. . . . . . . . . . . . . . . . . . 29 Voltaire Product Guides . . . . . . . . . . . . . . . . . . . . . 29 SMC Product Guides . . . . . . . . . . . . . . . . . . . . 29 . Platform Manage Product Guides . . . . . . . . . . . . . . . . . . 30 QLogic Product Guides . . . . . . . . . . . . . . . . . . . 30 Customer Service and Removing Parts . . . . . . . .
Contents Serial-over-lan Commands . . . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . 59 Connecting to Node Console via SOL . . . . . . . . . . . . . . . . . 60 Deactivating an SOL Connection . Configuring SOL . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . . . . . . . . . . . 60 Displaying all Objects in SDR . . . . . . . . . . . . . . . . . . .
Chapter 1 1. SGI Altix XE Cluster Quick-reference Overview Your SGI® Altix® XE cluster system ships with a variety of hardware and software documents in both hard copy and soft copy formats. Hard copy documents are in the packing box and soft copy documents are located on your system hard disk in both /opt/sgi/Factory-Install/Docs and /opt/sgi/Factory-Install/CFG Additional third-party documentation may be shipped on removable media (CD/DVD) included with your shipment.
1: SGI Altix XE Cluster Quick-reference instance, one application may run on 16 processors in the cluster while another application runs on a different set of 8 processors. Very large clusters may run dozens of separate, independent applications at the same time. Important: In a cluster using older and newer compute nodes (i.e. XE310, XE320 and XE340 nodes) parallel calculations will be executed at the rate of the slowest node.
Site Plan Verification not connect directly to the “outside world” because mixing external and internal cluster network traffic could impact application performance. Site Plan Verification Ensure that all site requirements are met before you install and boot your system. If you have questions about the site requirements or you would like to order full-size floor templates for your site, contact an SGI site planning representative by e-mail (site@sgi.com).
1: SGI Altix XE Cluster Quick-reference SGI Altix XE250 and XE270 Node Front Controls and Indicators The front control panel on the SGI Altix XE250 or XE270 head node or compute node (see Figure 1-1) has six LED indicators to the left of the power and reset buttons. The LEDs provide critical server related information. The two headnode models have virtually identical front panel controlls although their internal circuitry and processors are different.
Booting the XE Cluster Altix XE320 or XE340 Compute Node Controls and Indicators Control panel: Node board 1 Control panel: Node board 2 RESET RESET RESET 2 Overheat/Fan fail LED NIC 2 activity LED 1 Power RESET Power LED HDD activity LED NIC 1 activity LED 007-5474-003 Figure 1-2 SGI Altix XE320/XE340 Compute Node Controls and Indicators Table 1-1 SGI Altix XE320/XE340 Compute Node Controls and Indicator Descriptions Feature Description RESET Press the reset button to reboot only the n
1: SGI Altix XE Cluster Quick-reference Cluster Configuration Overview The following figures are intended to represent some of the general types of cluster configurations used with SGI Altix XE cluster systems. Note: These configuration drawings are for informational purposes only and are not meant to represent any specific cluster system. Figure 1-3 on page 7 diagrams a basic Gigabit Ethernet configuration using a single Ethernet switch for node-to-node communication.
Cluster Configuration Overview Base Gigabit Ethernet switch for Admin.
1: SGI Altix XE Cluster Quick-reference Base Gigabit Ethernet switch for Admin.
Cluster Configuration Overview Base Gigabit Ethernet switch for Admin.
1: SGI Altix XE Cluster Quick-reference InfiniBand switch (MPI) Base Gigabit Ethernet switch for Admin.
Cluster Configuration Overview NAS Gigabit Ethernet switch for NAS Base Gigabit Ethernet switch for Admin.
1: SGI Altix XE Cluster Quick-reference InfiniBand switch (MPI) Base Gigabit Ethernet switch for Admin.
Power Down the Cluster Power Down the Cluster Note: You can also use the baseboard management controller (BMC) interface to perform power management and other administrative functions. Refer to the SGI Altix XE340 User’s Guide, publication number 007-5536-00x, for more information about the BMC interface. See the SGI Altix XE320 User’s Guide, publication number 007-5466-00x for information on its BMC. Remote power management is done via Platform Manager’s GUI or CLI.
1: SGI Altix XE Cluster Quick-reference Powering Off Manually To power off your cluster system manually, follow these steps: ! Caution: If you power off the cluster before you halt the operating system, you can lose data. 1. Shut down the operating system by entering the following command: # init 0 2. Press the power button on the head node(s) that you want to power off. You may have to hold the button down for up to 5 seconds. You may power off the nodes in any order. 3.
Ethernet Network Interface Card (NIC) Guidelines Ethernet Network Interface Card (NIC) Guidelines While Ethernet ports are potentially variable in a cluster, the following rules generally apply to the cluster head node: • The server motherboard’s nic1 is always a public IP in the head node. • The server motherboard’s nic2 is always a private administrative network connection. • Nic3 is always a PCI expansion controller port. It is typically used to handle MPI traffic.
1: SGI Altix XE Cluster Quick-reference Table 1-2 Head Node Ethernet Address Listings Head node number Internal management IP address nic2 (GigEnet) MPI NAS/SAN option nic3 Infiniband IP address Baseboard Management Control or IPMI address nic1 1 10.0.10.1 172.16.10.1 192.168.10.1 10.0.30.1 2 10.0.10.2 172.16.10.2 192.168.10.2 10.0.30.2 3 10.0.10.3 172.16.10.3 192.168.10.3 10.0.30.3 4 10.0.10.4 172.16.10.4 192.168.10.4 10.0.30.
Changing the NIC1 (Customer Domain) IP Address – Click the “IP Address” box for device eth0 and change the IP address – Click the “Subnet” box for each network and select (arrow) the new subnet 9. Click in the “Default Gateway” tab. Click on the “Gateway IP Address” and change it to your network address 10. Click on the “NAT Settings” tab and configure any NAT settings (if applicable). See the Add and Remove buttons (lower right) in the window. 11.
1: SGI Altix XE Cluster Quick-reference Cluster Compute Node IP Addresses The cluster system can have multiple compute nodes that each use up to three IP address points (plus the Infiniband IP address). As with the head nodes, each fourth octet number in an address iterates by one number as a compute node is added to the list. Table 1-3 shows the factory assigned IP address settings for compute nodes one through four.
Switch Connect and IP Address Web or Telnet Access to Maintenance Port on the Gigabit Ethernet Switch Your switch(s) setup is configured in the factory before shipment and should be accessible via telnet or a web browser. The switch can be a single switch or a stacked master/slave combination. You can connect to a console directly from the head node through the administration network using telnet. To access the switch via telnet: telnet 10.0.20.
1: SGI Altix XE Cluster Quick-reference Serial Access to the SMC Switch Use of a serial interface to the switch should only be needed if the factory assigned IP address for the switch has been somehow deleted, altered or corrupted. Otherwise, use of the web or telnet access procedure is recommended. To use a serial interface with the switch, connect a laptop, or PC to the switch’s console port. Refer to Figure 1-9 for the location of the console port and use the steps that follow for access.
InfiniBand Switch Connect and IP Address InfiniBand Switch Connect and IP Address The subsection “Web or Telnet Access to the InfiniBand Switch” on page 21 lists the factory IP address settings for your InfiniBand switch or switch “stack” used with the cluster. For clusters with greater than 288 network ports, consult SGI Professional Services for specific IP address configuration information.
1: SGI Altix XE Cluster Quick-reference Serial Access to the Switch You should connect a Voltaire serial cable (either DV-9 to DB-9 or DB-9 to DB-9) that comes with the 24-port switch, from a PC/laptop directly to the switch for serial access. Use of a serial interface to the switch should only be needed if the factory assigned IP address for the switch has been somehow deleted, altered or corrupted. Otherwise, use of the web or telnet access procedure is recommended.
Using the 1U Console Option 3. Set up the network for your InfiniBand switch cluster configuration using the following information and the IP reference provided in “Web or Telnet Access to the InfiniBand Switch” on page 21. Enter the following commands to set up the network: ISR-xxxx# config ISR-xxxx(config)# interface fast ISR-xxxx(config-if-fast)# ip-address-fast set [10.0.21.1] 255.255.0.0 ISR-xxxx(config-if-fast)# broadcast-fast set 10.0.255.
1: SGI Altix XE Cluster Quick-reference Installing or Updating Software Platform Manage offers a mechanism to upload and install software across the cluster. This upload and installation process requires that the software installation be in RPM format. Tarball software distributions can be installed across a cluster. Please see the Platform scarcp (cluster remote copy) and the scash (cluster remote shell) commands in the Platform Manage User’s Guide.
Platform Manage Troubleshooting Tips Note: The DEL key and F2 key work only if the proper ACSII terminal settings are in place. Many Linux distributions default to varied ASCII settings. In the case of the SGI Altix XE340 or XE320 compute node, or the Altix XE250 or XE270 head node, the DEL key should always generate an “ACSII DEL”. If it does not, type Ctrl-Backspace to enter BIOS setup menu. Important: The BIOS comes preconfigured with the SGI recommended settings.
1: SGI Altix XE Cluster Quick-reference have trouble that is more hardware related, see “Customer Service and Removing Parts” on page 31. NFS Quick Reference Points The cluster head node exports an NFS, compute nodes import NFS on the head node. The cluster comes with a pre-configured NFS mount. The headnode exports the /data filesystem. The compute nodes mount head node /data1 on /cluster.
Related Publications Related Publications The following SGI system documents may be useful or necessary while configuring and operating your Altix XE cluster system: • Manufacturing Audit Checklist (P/N 007-4942-00x) This document contains the network configuration/validation switch IP addresses for your system.
1: SGI Altix XE Cluster Quick-reference • SGI Altix® Systems Dual-Port Gigabit Ethernet Board User's Guide, Publication Number 007-4326-00x This guide describes the two versions of the optional SGI dual-port Gigabit Ethernet board, shows you how to connect the boards to an Ethernet network, and explains how to operate the boards. You can use the dual-port Gigabit Ethernet board to replace or supplement the built-in Ethernet network adapters in your system.
Third-Party Clustering Documents Third-Party Clustering Documents The SGI Altix XE Cluster is provided in different configurations and not all the third-party documents listed here will be applicable to every system. Note that Linux is the only operating system supported with the SGI Altix XE cluster.
1: SGI Altix XE Cluster Quick-reference • SMC® TigerStack™ II Gigabit Ethernet Switch Management Guide Use this guide to manage the operations of your SMC8824M 24-port switch or SMC8848M 48-port switch. Platform Manage Product Guides • Platform Manage™ User’s Guide This document provides an overview of a Platform managed system in terms of instructions for building a Platform Manager administered cluster system.
Customer Service and Removing Parts Customer Service and Removing Parts If you are experiencing trouble with the cluster and determine that a replacement part will be needed, please contact your SGI service representative using the information in “Contacting the SGI Customer Service Center” on page 31. Return postage information is included with replacement parts.
1: SGI Altix XE Cluster Quick-reference Latin America +55 11.5185.2860 Europe +44 118.912.7500 Japan +81 3.5488.1811 Asia Pacific +1 650.933.3000 Cluster Administration Training from SGI SGI offers customer training classes covering all current systems, including clusters. If you have a maintenance agreement in place with SGI, contact SGI Customer Education at 1-800-361-2621 for information on the time, location and cost of the applicable training course you are interested in.
Chapter 2 2. Administrative Tips and Adding a Node This chapter provides general administrative information as well as basic instructions on starting and using the Platform Manage GUI to add a node in a Platform managed cluster. For information on using the Platform Manage command line interface to add a node, refer to the Platform Manage User’s Guide.
2: Administrative Tips and Adding a Node Administrative Tips Root password and administrative information includes: • Root password = sgisgi (head node and compute nodes) • Ipmitool user/password info: User = admin Password = admin Refer to Table 1-2 on page 16 and Table 1-3 on page 18 for listings of the IPMI IP addresses for nodes.
Administrative Tips The Platform Manage installer directory (/usr/local/Platform###) is the location of the code used to install Platform Cluster management Software. The Factory-Install directory is located on the head node server at /usr/local/Factory-Install.
2: Administrative Tips and Adding a Node Start the Platform Manager GUI Login to the Platform Manager interface as root, the factory password is sgisgi. Use your system name and log in as root. Refer to Figure 2-1 for an example. Note: SGI Altix XE clusters using Altix XE340 or XE270 servers as compute nodes or head nodes must use Platform Manager release 5.7.2 or later.
Head Node Information Screen Head Node Information Screen You can view and confirm the head node information from the main GUI screen. Click on the node icon (cl1n001 in the example below) for name and subnet information on your cluster head node.
2: Administrative Tips and Adding a Node Adding a Node Starting from the Main GUI Screen Add a node when you need to upgrade. To add a cluster node, open the Clusters tree by clicking the right mouse button. Move your cursor over the cluster tree (cluster cl1 in the example screen), and click the right mouse button. Then click the left mouse button on “New” in the popup window. Refer to Figure 2-3.
Adding a Cluster Compute Node Adding a Cluster Compute Node These steps should only be taken if the cluster needs to be upgraded or re-created. Select the option “Extend existing cluster” and provide the number of new servers (2 in the example). Then select the “Cluster Name” (cl1 in the example). Select the server template and click “Next” to move to the following screen.
2: Administrative Tips and Adding a Node Selecting the Server Type Click on “Edit” to bring up the “Node Hardware Configuration” network panel. Scroll down the menu and select the server type you are adding. Then enter the BMC user ID (admin) and the password (admin).
Network BMC Configuration Network BMC Configuration Click on the “Edit” button. Assign the new BMC IP address, stepping and BMC host name. Click OK when the appropriate information is entered. Click “Next” to move to the following screen.
2: Administrative Tips and Adding a Node Select Preferred Operating System Select the option to provision the new node’s operating system. Enter the sgisgi factory password or whatever new password may have been assigned. Click “Next” to move to the following screen.
Node Network Configuration Screen Node Network Configuration Screen Use this screen to assign Ethernet 0 (eth0) as your network interface port. Fill in the additional information as it applies to your local network. Click “OK” to continue.
2: Administrative Tips and Adding a Node Enter the default gateway information (refer to the example in Figure 2-9) and select “Next” to continue.
DNS and NTP Configuration Screen DNS and NTP Configuration Screen This screen extracts the name server numbers for use with the system configuration files. Enter the appropriate domain name enabling information or disable the function by un-checking the box. Click “Next” when complete.
2: Administrative Tips and Adding a Node NIS Configuration Screen This screen allows you to specify, enable or disable a Network Information Service (NIS) for the new node. Assign your domain name (see Figure 2-11 for an example) and click “Next” to go to the following screen.
Platform Manager Options Screen Platform Manager Options Screen This screen provides the options shown, including installation of MPI, your software version, monitor options and more. Click “Next” to move to the following screen.
2: Administrative Tips and Adding a Node Configuration Setup Complete Screen This screen allows you to install the operating system and Platform Manager immediately, or store the configuration for later use. Click “Finish” after you make your selection.
Checking the Log File Entries (Optional) Checking the Log File Entries (Optional) You can check the log file entries during configuration of the new node(s) to confirm that a log file has been created and to view the entries.
2: Administrative Tips and Adding a Node Setting a Node Failure Alarm on Platform Manage This section shows how to create an alarm using a “Node Down” alarm as an example: 1. Start the GUI. Refer to “Start the Platform Manager GUI” on page 36 if needed. 2. Using the mouse, select the “Edit Alarms” submenu from the “Monitoring” menu item. 3. Select a node (or list of nodes) for which you want to define the alarm. 4. Then select “Add Alarm” to add the alarm and a pop-up window appears, see Figure 2-15. 5.
Setting a Node Failure Alarm on Platform Manage 6. At this time you must enter the criteria that trigger the alarm. Click on “Add Criteria” (refer to Figure 2-16.) Figure 2-16 Add Criteria Screen Example 7. Another popup presents itself. For this example we picked a “Filter” criteria for the node status. See the example in Figure 2-17.
2: Administrative Tips and Adding a Node Figure 2-17 Define Chart Data Popup Example (Filter Selected) Next we need to choose the priority for this alarm. The example assigns a critical priority for the “Node Down” alarm. We want this alarm to be triggered at most once. To enable this alarm, click on “Apply Alarm”, refer to Figure 2-18 on page 53. This alarm does not define any action to be taken when the alarm fires. This can be easily done by selecting a predefined action.
Setting a Node Failure Alarm on Platform Manage example, Platform Manager can send an email to a system administrator or e-mail alias. You must pick the appropriate action and supply the e-mail address or alias.
2: Administrative Tips and Adding a Node To illustrate how an alarm makes it’s appearance we have intentionally brought down the node. A few seconds thereafter the GUI indicates a node failure by changing the node icon in the cluster tree, refer to Figure 2-19. A few seconds later the alarm gets triggered and shows up in the alarm log, see Figure 2-20 on page 55.
Setting a Node Failure Alarm on Platform Manage Figure 2-20 007-5474-003 Node Down Alarm Screen Example 55
Chapter 3 3. IPMI Commands Overview This chapter provides a set of example IPMI commands, and is not meant to be a comprehensive guide in the use of ipmitool. Its purpose is to briefly describe some of the commonly used IPMI commands to help you get started with your cluster administration. Command-line utility for issuing common IPMI requests allows remote operation usage: ipmitool [-v] [-I interface] [-o oemtype] [-H bmc-ip-address] [-k key] [-U user] [-P password] [-E] command...
3: IPMI Commands Overview User Administration BMC Supports multiple users, username/password is required for remote connections.
Serial-over-lan Commands ipmitool lan set 1 netmask x.x.x.x impitool lan set 1 arp respond on impitool lan set 1 arp generate on To check your lan settings: impitool lan print 1 Serial-over-lan Commands Serial-Over-Lan (SOL) comes preconfigured and enabled on each node of your cluster.
3: IPMI Commands Overview Connecting to Node Console via SOL ipmitool sol activate Deactivating an SOL Connection In certain cases using the Platform Manager GUI to access a console, you may need to deactivate the SOL connection from the command line to free up the SOL session. ipmitool sol deactivate Sensor commands Sensor commands may be used to display objects, individual sensors, or all sensors in a system.
Chassis Commands Chassis Commands Use the following chassis commands to administer the cluster. Note that you can also use the BMC interface to perform chassis power commands on cluster nodes. Chassis Identify Note: The following ipmitool chassis identify command works only on the SGI Altix XE head node.