OFED+ Host Software Release 1.5.
OFED+ Host Software Release 1.5.4 User Guide Information furnished in this manual is believed to be accurate and reliable. However, QLogic Corporation assumes no responsibility for its use, nor for any infringements of patents or other rights of third parties which may result from its use. QLogic Corporation reserves the right to change product specifications at any time without notice. Applications described in this document for any of these products are for illustrative purposes only.
Table of Contents Preface Intended Audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Documentation Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . License Agreements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Technical Support. . . . . . . . . . . . . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide Subnet Manager Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . QLogic Distributed Subnet Administration . . . . . . . . . . . . . . . . . . . . . . . . . . Applications that use Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . Virtual Fabrics and the Distributed SA. . . . . . . . . . . . . . . . . . . . . . . . . Configuring the Distributed SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide Configuring for ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configuring ssh and sshd Using shosts.equiv . . . . . . . . . . Configuring for ssh Using ssh-agent . . . . . . . . . . . . . . . . . . . Process Limitation with ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checking Cluster and Software Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ipath_control . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide Debugging MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MPI Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Debuggers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Using Other MPIs Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installed Layout . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide Running SHMEM Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using shmemrun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Running programs without using shmemrun . . . . . . . . . . . . . . . QLogic SHMEM Relationship with MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . Slurm Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Full Integration. . . .
OFED+ Host Software Release 1.5.4 User Guide A Benchmark Programs Benchmark 1: Measuring MPI Latency Between Two Nodes . . . . . . . . . . . A-1 Benchmark 2: Measuring MPI Bandwidth Between Two Nodes . . . . . . . . . A-4 Benchmark 3: Messaging Rate Microbenchmarks. . . . . . . . . . . . . . . . . . . . A-6 OSU Multiple Bandwidth / Message Rate test (osu_mbw_mr) A-6 An Enhanced Multiple Bandwidth / Message Rate test (mpi_multibw) . . . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide Configuring SRP for Native IB Storage . . . . . . . . . . . . . . . . . . . . . . . . Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Additional Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OFED SRP Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide Open MPI Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Invalid Configuration Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues . . . . . . . . . . . . . . . . Checking the logical connection between the IB Host and the VIO hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide iba_packet_capture. . . . . . . . . . . . . . . . ibhosts . . . . . . . . . . . . . . . . . . . . . ibstatus. . . . . . . . . . . . . . . . . . . . . ibtracert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ibv_devinfo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ident . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ipath_checkout.
OFED+ Host Software Release 1.5.4 User Guide List of Figures 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 xii QLogic OFED+ Software Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed SA Default Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed SA Multiple Virtual Fabrics Example . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed SA Multiple Virtual Fabrics Configured Example . . . . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.4 User Guide List of Tables 3-1 3-2 3-3 3-4 3-5 4-1 4-2 4-3 4-4 4-5 4-6 4-7 5-1 5-2 5-3 5-4 5-5 6-1 6-2 6-3 6-4 6-5 6-6 6-7 6-8 D-1 G-1 G-2 G-3 G-4 G-5 G-6 G-7 ibmtu Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . krcvqs Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checks Preformed by ipath_perf_tuning Tool . . . . . . . . . . . . . . . . . . . . . .
OFED+ Host Software Release 1.5.
Preface The QLogic OFED+ Host Software User Guide shows end users how to use the installed software to setup the fabric. End users include both the cluster administrator and the Message-Passing Interface (MPI) application programmers, who have different but overlapping interests in the details of the technology.
Preface License Agreements Text in blue font indicates a hyperlink (jump) to a figure, table, or section in this guide, and links to Web sites are shown in underlined blue. For example: Table 9-2 lists problems related to the user interface and remote agent. See “Installation Checklist” on page 3-6. For more information, visit www.qlogic.com. Text in bold font indicates user interface elements such as a menu items, buttons, check boxes, or column headings.
Preface Technical Support Technical Support Customers should contact their authorized maintenance provider for technical support of their QLogic products. QLogic-direct customers may contact QLogic Technical Support; others will be redirected to their authorized maintenance provider. Visit the QLogic support Web site listed in Contact Information for the latest firmware and software updates.
Preface Technical Support Knowledge Database The QLogic knowledge database is an extensive collection of QLogic product information that you can search for specific solutions. We are constantly adding to the collection of information in our database to provide answers to your most urgent questions. Access the database from the QLogic Support Center: http://support.qlogic.com.
1 Introduction How this Guide is Organized The QLogic OFED+ Host Software User Guide is organized into these sections: Section 1, provides an overview and describes interoperability. Section 2, describes how to setup your cluster to run high-performance MPI jobs. Section 3, describes the lower levels of the supplied QLogic OFED+ Host software. This section is of interest to a InfiniBand® cluster administrator.
1–Introduction Overview Appendix C, describes two methods the administrator can use to allow users to submit MPI jobs through batch queuing systems. Appendix D, provides information for troubleshooting installation, cluster administration, and MPI. Appendix E, provides information for troubleshooting the upper layer protocol utilities in the fabric.
1–Introduction Interoperability An embedded subnet manager can be used in one or more managed switches. QLogic offers the QLogic Embedded Fabric Manager (FM) for both DDR and QDR switch product lines supplied by your IB switch vendor. A host-based subnet manager can be used. QLogic provides the QLogic Fabric Manager (FM), as a part of the QLogic InfiniBand® Fabric Suite (IFS). Interoperability QLogic OFED+ participates in the standard IB subnet management protocols for configuration and monitoring.
1–Introduction Interoperability 1-4 IB0054606-02 A
2 Step-by-Step Cluster Setup and MPI Usage Checklists This section describes how to set up your cluster to run high-performance Message Passing Interface (MPI) jobs. Cluster Setup Perform the following tasks when setting up the cluster. These include BIOS, adapter, and system settings. 1.
2–Step-by-Step Cluster Setup and MPI Usage Checklists Using MPI 8. Set up the host environment to use ssh. Two methods are discussed in “Host Environment Setup for MPI” on page 3-40. 9. Verify the cluster setup. See “Checking Cluster and Software Status” on page 3-44. Using MPI 2-2 1. Verify that the QLogic hardware and software has been installed on all the nodes you will be using, and that ssh is set up on your cluster (see all the steps in the Cluster Setup checklist). 2. Setup Open MPI.
3 InfiniBand® Cluster Setup and Administration This section describes what the cluster administrator needs to know about the QLogic OFED+ software and system administration. Introduction The IB driver ib_qib, QLogic Performance Scaled Messaging (PSM), accelerated Message-Passing Interface (MPI) stack, the protocol and MPI support libraries, and other modules are components of the QLogic OFED+ software. This software provides the foundation that supports the MPI implementation.
3–InfiniBand® Cluster Setup and Administration Installed Layout Installed Layout This section describes the default installed layout for the QLogic OFED+ software and QLogic-supplied MPIs. QLogic-supplied Open MPI, MVAPICH, and MVAPICH2 RPMs with PSM support and compiled with GCC, PGI, and the Intel compilers are installed in directories using the following format: /usr/mpi//--qlc For example: /usr/mpi/gcc/openmpi-1.
3–InfiniBand® Cluster Setup and Administration IB and OpenFabrics Driver Overview IB and OpenFabrics Driver Overview The ib_qib module provides low-level QLogic hardware support, and is the base driver for both MPI/PSM programs and general OpenFabrics protocols such as IPoIB and sockets direct protocol (SDP). The driver also supplies the Subnet Management Agent (SMA) component.
3–InfiniBand® Cluster Setup and Administration IPoIB Network Interface Configuration This example assumes that no hosts files exist, the host being configured has the IP address 10.1.17.3, and DHCP is not used. NOTE Instructions are only for this static IP address case. Configuration methods for using DHCP will be supplied in a later release. 1. Type the following command (as a root user): ifconfig ib0 10.1.17.3 netmask 0xffffff00 2.
3–InfiniBand® Cluster Setup and Administration IPoIB Administration NOTE The configuration must be repeated each time the system is rebooted. IPoIB-CM (Connected Mode) is enabled by default. The setting in /etc/infiniband/openib.conf is SET_IPOIB_CM=yes. To use datagram mode, change the setting to SET_IPOIB_CM=no. Setting can also be changed when asked during initial installation (./INSTALL).
3–InfiniBand® Cluster Setup and Administration IB Bonding NAME field specified in the CREATE block. The following is an example of the ifcfg-NAME file: DEVICE=ib1 BOOTPROTO=static BROADCAST=192.168.18.255 IPADDR=192.168.18.120 NETMASK=255.255.255.0 ONBOOT=yes NM_CONTROLLED=no NOTE For IPoIB, the INSTALL script for the adapter now helps the user create the ifcfg files. 2. After modifying the /etc/sysconfig/ipoib.cfg file, restart the IPoIB driver with the following: /etc/init.
3–InfiniBand® Cluster Setup and Administration IB Bonding Red Hat EL5 and EL6 The following is an example for bond0 (master). The file is named /etc/sysconfig/network-scripts/ifcfg-bond0: DEVICE=bond0 IPADDR=192.168.1.1 NETMASK=255.255.255.0 NETWORK=192.168.1.0 BROADCAST=192.168.1.255 ONBOOT=yes BOOTPROTO=none USERCTL=no MTU=65520 BONDING_OPTS="primary=ib0 updelay=0 downdelay=0" The following is an example for ib0 (slave).
3–InfiniBand® Cluster Setup and Administration IB Bonding SuSE Linux Enterprise Server (SLES) 10 and 11 The following is an example for bond0 (master). The file is named /etc/sysconfig/network-scripts/ifcfg-bond0: DEVICE="bond0" TYPE="Bonding" IPADDR="192.168.1.1" NETMASK="255.255.255.0" NETWORK="192.168.1.0" BROADCAST="192.168.1.
3–InfiniBand® Cluster Setup and Administration IB Bonding Verify the following line is set to the value of yes in /etc/sysconfig/boot: RUN_PARALLEL="yes" Verify IB Bonding is Configured After the configuration scripts are updated, and the service network is restarted or a server reboot is accomplished, use the following CLI commands to verify that IB bonding is configured.
3–InfiniBand® Cluster Setup and Administration Subnet Manager Configuration Example of ifconfig output: st2169:/etc/sysconfig # ifconfig bond0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.1.1 Mask:255.255.255.0 Bcast:192.168.1.
3–InfiniBand® Cluster Setup and Administration Subnet Manager Configuration OpenSM is a component of the OpenFabrics project that provides a Subnet Manager (SM) for IB networks. This package can optionally be installed on any machine, but only needs to be enabled on the machine in the cluster that will act as a subnet manager. You cannot use OpenSM if any of your IB switches provide a subnet manager, or if you are running a host-based SM, for example the QLogic Fabric Manager.
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration QLogic Distributed Subnet Administration As InfiniBand® clusters are scaled into the Petaflop range and beyond, a more efficient method for handling queries to the Fabric Manager is required. One of the issues is that while the Fabric Manager can configure and operate that many nodes, under certain conditions it can become overloaded with queries from those same nodes.
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration Virtual Fabrics and the Distributed SA The IBTA standard states that applications can be identified by a Service ID (SID). The QLogic Fabric Manager uses SIDs to identify applications. One or more applications can be associated with a Virtual Fabric using the SID.
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration If you are using the QLogic Fabric Manager in its default configuration, and you are using the standard QLogic PSM SIDs, this arrangement will work fine and you will not need to modify the Distributed SA's configuration file - but notice that the Distributed SA has restricted the range of SIDs it cares about to those that were defined in its configuration file.
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration 9LUWXDO )DEULF ³$GPLQ´ 3NH\ [ III 9LUWXDO )DEULF ³5HVHUYHG´ 3NH\ [ 6,' 5DQJH [ [ I 9LUWXDO )DEULF ³6WRUDJH´ 3NH\ [ 6,' [ 9LUWXDO )DEULF ³360B03,´ 3NH\ [ 6,' 5DQJH [ [I 6,' 5DQJH [ [ I 9LUWXDO )DEULF ³5HVHUYHG´ 3NH\ [ 6,' 5DQJH [ [ I 9LUWXDO )DEULF ³360B03,´ 3NH\ [ 6,' 5DQJH [ [I 6,' 5DQJH [
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration 9LUWXDO )DEULF ³5HVHUYHG´ ,' 3NH\ [ 6,' 5DQJH [ [I 9LUWXDO )DEULF ³'HIDXOW´ 3NH\ [IIII 9LUWXDO )DEULF ³'HIDXOW´ 6,' 5DQJH [ [IIIIIIIIIIIIIIII 3NH\ [IIII 6,' 5DQJH [ [IIIIIIIIIIIIIIII 9LUWXDO )DEULF ³360B03,´ ,' 3NH\ [ 6,' 5DQJH [ [I 6,' 5DQJH [ [ I ,QILQLEDQG )DEULF 9LUWXDO )DEULF ³'HIDXOW´ " 3NH\ [IIII /RRNLQJ IRU 6,' 5DQJHV [ [I D
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration Second, the Distributed SA handles overlaps by taking advantage of the fact that Virtual Fabrics have unique numeric indexes. These indexes are assigned by the QLogic Fabric Manager in the order which the Virtual Fabrics appear in the configuration file. These indexes can be seen by using the command iba_saquery -o vfinfo command.
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration SID The SID is the primary configuration setting for the Distributed SA, and it can be specified multiple times. The SIDs identify applications which will use the distributed SA to determine their path records. The default configuration for the Distributed SA includes all the SIDs defined in the default Qlogic Fabric Manager configuration for use by MPI.
3–InfiniBand® Cluster Setup and Administration QLogic Distributed Subnet Administration Dbg This parameter controls how much logging the Distributed SA will do. It can be set to a number between one and seven, where one indicates no logging and seven includes informational and debugging messages. To change the Dbg setting for Distributed SA, find the line in qlogic_sa.conf that reads Dbg=5 and change it to a different value, between 1 and 7.
3–InfiniBand® Cluster Setup and Administration Changing the MTU Size Changing the MTU Size The Maximum Transfer Unit (MTU) size enabled by the IB HCA and set by the driver is 4KB. To see the current MTU size, and the maximum supported by the adapter, type the command: $ ibv_devinfo If the switches are set at 2K MTU size, then the HCA will automatically use this as the active MTU size, there is no need to change any file on the hosts.
3–InfiniBand® Cluster Setup and Administration Managing the ib_qib Driver NOTE To use 4K MTU, set the switch to have the same 4K default. If you are using QLogic switches, the following applies: For the Externally Managed 9024, use 4.2.2.0.3 firmware (9024DDR4KMTU_firmware.emfw) for the 9024 EM. This has the 4K MTU default, for use on fabrics where 4K MTU is required. If 4K MTU support is not required, then use the 4.2.2.0.2 DDR *.emfw file for DDR externally-managed switches.
3–InfiniBand® Cluster Setup and Administration Managing the ib_qib Driver See the ib_qib man page for more details. Configure the ib_qib Driver State Use the following commands to check or configure the state. These methods will not reboot the system. To check the configuration state, use this command.
3–InfiniBand® Cluster Setup and Administration Managing the ib_qib Driver You can check to see if opensmd is configured to autostart by using the following command (as a root user); if there is no output, opensmd is not configured to autostart: # /sbin/chkconfig --list opensmd | grep -w on Unload the Driver/Modules Manually You can also unload the driver/modules manually without using /etc/init.d/openibd.
3–InfiniBand® Cluster Setup and Administration More Information on Configuring and Loading Drivers /ipathfs/1/counter_names /ipathfs/1/counters The driver_stats file contains general driver statistics. There is one numbered subdirectory per IB device on the system.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips Performance Tuning Tuning compute or storage (client or server) nodes with IB HCAs for MPI and verbs performance can be accomplished in several ways: Run the ipath_perf_tuning script in automatic mode (See “Performance Tuning using ipath_perf_tuning Tool” on page 3-34) (easiest method) Run the ipath_perf_tuning script in interactive mode (See “Performance Tuning using ipath_perf_tuning Tool” on page 3-34 or see man
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips If cpuspeed or powersaved are being used as part of implementing Turbo modes to increase CPU speed, then they can be left on. With these daemons left on, IB micro-benchmark performance results may be more variable from run-to-run. For compute nodes, set the default runlevel to 3 to reduce overheads due to unneeded processes. Reboot the system for this change to take effect.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips Increasing the number of kernel receive queues allows more CPU cores to be involved in the processing of verbs traffic. This is important when using parallel file systems such as Lustre or IBM's GPFS (General Parallel File System). The module parameter that sets this number is krcvqs.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips a 1 is the default setting, so if the table recommends '1', krcvqs does not need to be set. In the rare case that the node has more than 64 cores, and it is desired to run MPI on more than 64 cores, then two HCAs are required and settings can be made, using the rules in Table 3-2, as though half the cores were assigned to each HCA.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips For setting all C-States to 0 where there is no BIOS support: 1. Add kernel boot option using the following command: processor.max_cstate=0 2. Reboot the system. If the node uses a single-port HCA, and is not a part of a parallel file system cluster, there is no need for performance tuning changes to a modprobe configuration file.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips High Risk Tuning for Intel Harpertown CPUs For tuning the Harpertown generation of Intel Xeon CPUs that entails a higher risk factor, but includes a bandwidth benefit, the following can be applied: For nodes with Intel Harpertown, Xeon 54xx CPUs, you can add pcie_caps=0x51 and pcie_coalesce=1 to the modprobe.conf file.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips Additional Driver Module Parameter Tunings Available Setting driver module parameters on Per-unit or Per-port basis The ib_qib driver allows the setting of different driver parameter values for the individual HCAs and ports. This allows the user to specify different values for each port on a HCA or different values for each HCA in the system.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips value is the parameter value for the particular unit or port. The fields in the square brackets are options; however, either a default or a per-unit/per-port value is required.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips This command lets the driver automatically decide on the allocation behavior and disables this feature on platforms with AMD and Intel Westmere-or-earlier CPUs, while enabling it on newer Intel CPUs. Tunable options: option ib_qib numa_aware=0 This command disables the NUMA awareness when allocating memories within the driver. The memory allocation requests will be satisfied on the node's CPU that executes the request.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips For example: # cat /etc/modprobe.d/ib_ipoib.conf alias ib0 ib_ipoib alias ib1 ib_ipoib options ib_ipoib recv_queue_size=512 Performance Tuning using ipath_perf_tuning Tool The ipath_perf_tuning tool is intended to adjust parameters to the IB QIB driver to optimize the IB and application performance.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips Table 3-3. Checks Preformed by ipath_perf_tuning Tool Check Type Description cstates Check whether (and which) C-States are enabled. C-States should be turned off for best performance. services Check whether certain system services (daemons) are enabled. These services should be turned off for best performance. The values picked for the various checks and tests may depend on the type of node being configured.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips AUTOMATIC vs. INTERACTIVE MODE The tool performs different functions when running in automatic mode compared to running in the interactive mode. The differences include the node type selection, test execution, and applying the results of the executed tests. Node Type Selection The tool is capable of configuring compute nodes or storage nodes (see Compute Nodes and Storage (Client or Server) Nodes).
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips Table 3-5. Test Execution Modes Test services Mode Test is performed in both modes but the user is notified of running services only if the tool is in interactive mode. In that case, the user is queried whether to turn the services off. Applying the Results Automatic mode versus interactive mode also has an effect when the tool is committing the changes to the system.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips rpm (see “rpm” on page G-32) strings (see “strings” on page G-32) NOTE Run these tools to gather information before reporting problems and requesting support. Adapter and Other Settings The following adapter and other settings can be adjusted for better performance. NOTE For the most current information on performance tuning refer to the QLogic OFED+ Host Software Release Notes.
3–InfiniBand® Cluster Setup and Administration Performance Settings and Management Tips Remove Unneeded Services The cluster administrator can enhance application performance by minimizing the set of system services running on the compute nodes. Since these are presumed to be specialized computing appliances, they do not need many of the service daemons normally running on a general Linux computer. Following are several groups constituting a minimal necessary set of services.
3–InfiniBand® Cluster Setup and Administration Host Environment Setup for MPI Other services may be required by your batch queuing system or user community. If your system is running the daemon irqbalance, QLogic recommends turning it off. Disabling irqbalance will enable more consistent performance with programs that use interrupts. Use this command: # /sbin/chkconfig irqbalance off See “Erratic Performance” on page D-10 for more information.
3–InfiniBand® Cluster Setup and Administration Host Environment Setup for MPI “Configuring for ssh Using ssh-agent” on page 3-43 shows how an individual user can accomplish the same thing using ssh-agent. The example in this section assumes the following: Both the cluster nodes and the front end system are running the openssh package as distributed in current Linux systems.
3–InfiniBand® Cluster Setup and Administration Host Environment Setup for MPI 3. On each of the IB node systems, create or edit the file /etc/ssh/ssh_known_hosts. You will need to copy the contents of the file /etc/ssh/ssh_host_dsa_key.pub from ip-fe to this file (as a single line), and then edit that line to insert ip-fe ssh-dss at the beginning of the line. This is very similar to the standard known_hosts file for ssh.
3–InfiniBand® Cluster Setup and Administration Host Environment Setup for MPI At this point, any end user should be able to login to the ip-fe front end system and use ssh to login to any IB node without being prompted for a password or pass phrase. Configuring for ssh Using ssh-agent The ssh-agent, a daemon that caches decrypted private keys, can be used to store the keys. Use ssh-add to add your private keys to ssh-agent’s cache.
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status 5. Finally, test by logging into the front end node, and from the front end node to a compute node, as follows: $ ssh frontend_node_name $ ssh compute_node_name For more information, see the man pages for ssh(1), ssh-keygen(1), ssh-add(1), and ssh-agent(1). Process Limitation with ssh Process limitation with ssh is primarily an issue when using the mpirun option -distributed=off.
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status iba_opp_query iba_opp_query is used to check the operation of the Distributed SA. You can run it from any node where the Distributed SA is installed and running, to verify that the replica on that node is working correctly. See “iba_opp_query” on page G-4 for detailed usage information.
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status mtu 0x4 rate 0x6 pkt_life 0x10 preference 0x0 resv2 0x0 resv3 0x0 ibstatus Another useful program is ibstatus that reports on the status of the local HCAs.
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status ibv_devinfo ibv_devinfo queries RDMA devices. Use the -v option to see more information. Sample usage: $ ibv_devinfo hca_id: qib0 fw_ver: 0.0.
3–InfiniBand® Cluster Setup and Administration Checking Cluster and Software Status 3-48 IB0054606-02 A
4 Running MPI on QLogic Adapters This section provides information on using the Message-Passing Interface (MPI) on QLogic IB HCAs. Examples are provided for setting up the user environment, and for compiling and running MPI programs. Introduction The MPI standard is a message-passing library or collection of routines used in distributed-memory parallel programming. It is used in data exchange and task synchronization between processes.
4–Running MPI on QLogic Adapters Open MPI Installation Follow the instructions in the QLogic Fabric Software Installation Guide for installing Open MPI. Newer versions of Open MPI released after this QLogic OFED+ release will not be supported (refer to the OFED+ Host Software Release Notes for version numbers). QLogic does not recommend installing any newer versions of Open MPI. If a newer version is required it can be found on the Open MPI web site (http://www.open-mpi.
4–Running MPI on QLogic Adapters Open MPI Table 4-2. Command Line Options for Scripts Command Meaning man mpicc (mpif90, mpicxx, etc.) Provides help -showme Lists each of the compiling and linking commands that would be called without actually invoking the underlying compiler -showme:compile Shows the compile-time flags that would be supplied to the compiler -showme:link Shows the linker flags that would be supplied to the compiler for the link phase.
4–Running MPI on QLogic Adapters Open MPI The first choice will use verbs by default, and any with the _qlc string will use PSM by default. If you chose openmpi_gcc_qlc-1.4.3, for example, then the following simple mpirun command would run using PSM: $ mpirun -np 4 -machinefile mpihosts mpi_app_name To run over IB Verbs instead of the default PSM transport in openmpi_gcc_qlc-1.4.
4–Running MPI on QLogic Adapters Open MPI Configuring MPI Programs for Open MPI When configuring an MPI program (generating header files and/or Makefiles) for Open MPI, you usually need to specify mpicc, mpicxx, and so on as the compiler, rather than gcc, g++, etc.
4–Running MPI on QLogic Adapters Open MPI The easiest way to use other compilers with any MPI that comes with QLogic OFED+ is to use mpi-selector to change the selected MPI/compiler combination, see “Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility” on page 5-5. These compilers can be invoked on the command line by passing options to the wrapper scripts. Command line options override environment variables, if set. Tables 4-3 and 4-4 show the options for each of the compilers. In each case, ....
4–Running MPI on QLogic Adapters Open MPI For Fortran 90 programs: $ mpif90 -f90=pgf90 -show pi3f90.f90 -o pi3f90 pgf90 -I/usr/include/mpich/pgi5/x86_64 -c -I/usr/include pi3f90.f90 -c pgf90 pi3f90.o -o pi3f90 -lmpichf90 -lmpich -lmpichabiglue_pgi5 Fortran 95 programs will be similar to the above. For C programs: $ mpicc -cc=pgcc -show cpi.c pgcc -c cpi.c pgcc cpi.o -lmpich -lpgftnrtl -lmpichabiglue_pgi5 Compiler and Linker Variables When you use environment variables (e.g.
4–Running MPI on QLogic Adapters Open MPI Table 4-5. Available Hardware and Software Contexts Adapter QLE7342/ QLE7340 Available Hardware Contexts (same as number of supported CPUs) Available Contexts when Software Context Sharing is Enabled 16 64 The default hardware context/CPU mappings can be changed on the QDR IB Adapters (QLE734x). See “IB Hardware Contexts on the QDR IB Adapters” on page 4-8 for more details. Context sharing is enabled by default.
4–Running MPI on QLogic Adapters Open MPI Performance can be improved in some cases by disabling IB hardware contexts when they are not required so that the resources can be partitioned more effectively. To disable this behavior, explicitly configure for the number you want to use with the cfgctxts module parameter in the modprobe configuration file (see “Affected Files” on page 3-37 for exact file name and location). The maximum that can be set is 18 on QDR IB Adapters.
4–Running MPI on QLogic Adapters Open MPI To explicitly disable context sharing, set this environment variable in one of the two following ways: PSM_SHAREDCONTEXTS=0 PSM_SHAREDCONTEXTS=NO The default value of PSM_SHAREDCONTEXTS is 1 (enabled).
4–Running MPI on QLogic Adapters Open MPI Context Sharing Error Messages The error message when the context limit is exceeded is: No free InfiniPath contexts available on /dev/ipath This message appears when the application starts. Error messages related to contexts may also be generated by ipath_checkout or mpirun. For example: PSM found 0 available contexts on InfiniPath device The most likely cause is that the cluster has processes using all the available PSM contexts.
4–Running MPI on QLogic Adapters Open MPI mpihosts File Details As noted in “Create the mpihosts File” on page 4-3, a hostfile (also called machines file, nodefile, or hostsfile) has been created in your current working directory. This file names the nodes that the node programs may run. The two supported formats for the hostfile are: hostname1 hostname2 ... or hostname1 slots=process_count hostname2 slots=process_count ...
4–Running MPI on QLogic Adapters Open MPI The command line option -hostfile can be used as shown in the following command line: $mpirun -np n -hostfile mpihosts [other options] program-name or -machinefile is a synonym for -hostfile. In this case, if the named file cannot be opened, the MPI job fails. An alternate mechanism to -hostfile for specifying hosts is the -H, -hosts, or --host followed by a host list.
4–Running MPI on QLogic Adapters Open MPI This option spawns n instances of program-name. These instances are called node programs. Generally, mpirun tries to distribute the specified number of processes evenly among the nodes listed in the hostfile. However, if the number of processes exceeds the number of nodes listed in the hostfile, then some nodes will be assigned more than one instance of the program.
4–Running MPI on QLogic Adapters Open MPI NOTE The node that invoked mpirun need not be the same as the node where the MPI_COMM_WORLD rank 0 process resides. Open MPI handles the redirection of mpirun's standard input to the rank 0 process. Open MPI directs UNIX standard output and error from remote nodes to the node that invoked mpirun and prints it on the standard output/error of mpirun. Local processes inherit the standard output/error of mpirun and transfer to it directly.
4–Running MPI on QLogic Adapters Open MPI Open MPI adds the base-name of the current node’s bindir (the directory where Open MPI’s executables are installed) to the prefix and uses that to set the PATH on the remote node. Similarly, Open MPI adds the base-name of the current node’s libdir (the directory where Open MPI’s libraries are installed) to the prefix and uses that to set the LD_LIBRARY_PATH on the remote node.
4–Running MPI on QLogic Adapters Open MPI Setting MCA Parameters The -mca switch allows the passing of parameters to various Modular Component Architecture (MCA) modules. MCA modules have direct impact on MPI programs because they allow tunable parameters to be set at run time (such as which BTL communication device driver to use, what parameters to pass to that BTL, and so on.). The -mca switch takes two arguments: key and value.
4–Running MPI on QLogic Adapters Open MPI Environment Variables Table 4-6 contains a summary of the environment variables that are relevant to any PSM including Open MPI. Table 4-7 is more relevant for the MPI programmer or script writer, because these variables are only active after the mpirun command has been issued and while the MPI processes are active. Open MPI provides the environmental variables shown in Table 4-7 that will be defined on every MPI process.
4–Running MPI on QLogic Adapters Open MPI Table 4-6. Environment Variables Relevant for any PSM (Continued) Name IPATH_NO_CPUAFFINITY Description When set to 1, the PSM library will skip trying to set processor affinity. This is also skipped if the processor affinity mask is set to a list smaller than the number of processors prior to MPI_Init() being called. Otherwise the initialization code sets cpu affinity in a way that optimizes cpu and memory locality and load.
4–Running MPI on QLogic Adapters Open MPI Table 4-6. Environment Variables Relevant for any PSM (Continued) Name LD_LIBRARY_PATH Description This variable specifies the path to the run-time library. Default: Unset Table 4-7.
4–Running MPI on QLogic Adapters Open MPI and Hybrid MPI/OpenMP Applications Open MPI and Hybrid MPI/OpenMP Applications Open MPI supports hybrid MPI/OpenMP applications, provided that MPI routines are called only by the master OpenMP thread. This application is called the funneled thread model. Instead of MPI_Init/MPI_INIT (for C/C++ and Fortran respectively), the program can call MPI_Init_thread/MPI_INIT_THREAD to determine the level of thread support, and the value MPI_THREAD_FUNNELED will be returned.
4–Running MPI on QLogic Adapters Debugging MPI Programs NOTE With Open MPI, and other PSM-enabled MPIs, you will typically want to turn off PSM's CPU affinity controls so that the OpenMP threads spawned by an MPI process are not constrained to stay on the CPU core of that process, causing over-subscription of that CPU. Accomplish this using the IPATH_NO_CPUAFFINITY=1 setting as follows: OMP_NUM_THREADS=8 (typically set in the ~/.bashrc file) mprun -np 2 -H host1,host2 -x IPATH_NO_CPUAFFINITY=1 .
4–Running MPI on QLogic Adapters Debugging MPI Programs NOTE The TotalView® debugger can be used with the Open MPI supplied in this release. Consult the TotalView documentation for more information: http://www.open-mpi.
4–Running MPI on QLogic Adapters Debugging MPI Programs 4-24 IB0054606-02 A
5 Using Other MPIs This section provides information on using other MPI implementations. Detailed information on using Open MPI is provided in Section 4, and will be covered in this Section in the context of choosing among multiple MPIs or in tables which compare the multiple MPIs available. Introduction Support for multiple high-performance MPI implementations has been added. Most implementations run over both PSM and OpenFabrics Verbs (see Table 5-1).
5–Using Other MPIs Installed Layout Table 5-1. Other Supported MPI Implementations (Continued) MPI Implementation Runs Over Compiled With Comments Table Notes MVAPICH and Open MPI have been have been compiled for PSM to support the following versions of the compilers: (GNU) gcc 4.1.0 (PGI) pgcc 9.0 (Intel) icc 11.1 These MPI implementations run on multiple interconnects, and have their own mechanisms for selecting the interconnect that runs on.
5–Using Other MPIs Open MPI Open MPI Open MPI is an open source MPI-2 implementation from the Open MPI Project. Pre-compiled versions of Open MPI version 1.4.3 that run over PSM and are built with the GCC, PGI, and Intel compilers are available with the QLogic download. Details on Open MPI operation are provided in Section 4. MVAPICH Pre-compiled versions of MVAPICH 1.2 built with the GNU, PGI, and Intel compilers, and that run over PSM, are available with the QLogic download.
5–Using Other MPIs MVAPICH2 Here is an example of a simple mpirun command running with four processes: $ mpirun -np 4 -hostfile mpihosts mpi_app_name Password-less ssh is used unless the -rsh option is added to the command line above. Further Information on MVAPICH For more information about MVAPICH, see: http://mvapich.cse.ohio-state.edu/ MVAPICH2 Pre-compiled versions of MVAPICH2 1.7 built with the GNU, PGI, and Intel compilers, and that run over PSM, are available with the QLogic download.
5–Using Other MPIs Managing MVAPICH, and MVAPICH2 with the mpi-selector Utility Running MVAPICH2 Applications By default, the MVAPICH2 options in mpi-selector with 'qlc' as part of their name run over PSM once it is installed. Here is an example of a simple mpirun command running with four processes: $ mpirun_rsh -np 4 -hostfile mpihosts ./mpi_app_name Further Information on MVAPICH2 For more information about MVAPICH2, see: http://mvapich.cse.ohio-state.edu/support/mvapich2-1.7-quick-start.
5–Using Other MPIs Platform MPI 8 The example shell scripts mpivars.sh and mpivars.csh, for registering with mpi-selector, are provided as part of the mpi-devel RPM in $prefix/share/mpich/mpi-selector-{intel, gnu, pgi} directories. For all non-GNU compilers that are installed outside standard Linux search paths, set up the paths so that compiler binaries and runtime libraries can be resolved. For example, set LD_LIBRARY_PATH, both in your local environment and in an rc file (such as .mpirunrc, .
5–Using Other MPIs Intel MPI to, MPI_ICMOD_PSM__PSM_PATH = "^" Compiling Platform MPI 8 Applications As with Open MPI, QLogic recommends that you use the included wrapper scripts that invoke the underlying compiler (see Table 5-4). Table 5-4. Platform MPI 8 Wrapper Scripts Wrapper Script Name Language mpicc C mpiCC C mpi77 Fortran 77 mpif90 Fortran 90 To compile your program in C using the default compiler, type: $ mpicc mpi_app_name.
5–Using Other MPIs Intel MPI Installation Follow the instructions for download and installation of Intel MPI from the Intel web site. Setup Intel MPI can be run over Tag Matching Interface (TMI) The setup for Intel MPI is described in the following steps: 1. Make sure that the TMI psm provider is installed on every node and all nodes have the same version installed. The TMI is supplied with the Intel MPI distribution.
5–Using Other MPIs Intel MPI Using DAPL 2.0. $ rpm -qa | grep dapl dapl-devel-static-2.0.19-1 compat-dapl-1.2.14-1 dapl-2.0.19-1 dapl-debuginfo-2.0.19-1 compat-dapl-devel-static-1.2.14-1 dapl-utils-2.0.19-1 compat-dapl-devel-1.2.14-1 dapl-devel-2.0.19-1 2. Verify that there is a /etc/dat.conf file. It should be installed by the dapl- RPM. The file dat.conf contains a list of interface adapters supported by uDAPL service providers. In particular, it must contain mapping entries for OpenIB-cma for dapl 1.
5–Using Other MPIs Intel MPI Substitute bin if using 32-bit. Compiling Intel MPI Applications As with Open MPI, QLogic recommended that you use the included wrapper scripts that invoke the underlying compiler. The default underlying compiler is GCC, including gfortran. Note that there are more compiler drivers (wrapper scripts) with Intel MPI than are listed here (see Table 5-5); check the Intel documentation for more information. Table 5-5.
5–Using Other MPIs Intel MPI uDAPL 1.2: -genv I_MPI_DEVICE rdma:OpenIB-cma uDAPL 2.0: -genv I_MPI_DEVICE rdma:ofa-v2-ib To help with debugging, you can add this option to the Intel mpirun command: TMI: -genv TMI_DEBUG 1 uDAPL: -genv I_MPI_DEBUG 2 Further Information on Intel MPI For more information on using Intel MPI, see: http://www.intel.
5–Using Other MPIs Improving Performance of Other MPIs Over IB Verbs Improving Performance of Other MPIs Over IB Verbs Performance of MPI applications when using an MPI implementation over IB Verbs can be improved by tuning the IB MTU size. NOTE No manual tuning is necessary for PSM-based MPIs, since the PSM layer determines the largest possible IB MTU for each source/destination path. The maximum supported MTU size of IB adapter cards is 4K. Support for 4K IB MTU requires switch support for 4K MTU.
6 SHMEM Description and Configuration Overview QLogic SHMEM is a user-level communications library for one-sided operations. It implements the SHMEM Application Programming Interface (API) and runs on the QLogic IB stack. The SHMEM API provides global distributed shared memory across a network of hosts. Details of the API implementation are included in an appendix. SHMEM is quite distinct from local shared memory (often abbreviated as "shm" or even “shmem”).
6–SHMEM Description and Configuration Installation The -qlc suffix denotes that this is the QLogic PSM version. MVAPICH version 1.2.0 compiled for PSM. This is provided by QLogic IFS and can be found in the following directories: /usr/mpi/gcc/mvapich-1.2.0-qlc /usr/mpi/intel/mvapich-1.2.0-qlc /usr/mpi/pgi/mvapich-1.2.0-qlc The -qlc suffix denotes that this is the QLogic PSM version. MVAPICH2 version 1.7 compiled for PSM.
6–SHMEM Description and Configuration SHMEM Programs By default QLogic SHMEM is installed with a prefix of /usr/shmem/qlogic into the following directory structure: /usr/shmem/qlogic /usr/shmem/qlogic/bin /usr/shmem/qlogic/bin/mvapich /usr/shmem/qlogic/bin/mvapich2 /usr/shmem/qlogic/bin/openmpi /usr/shmem/qlogic/lib64 /usr/shmem/qlogic/lib64/mvapich /usr/shmem/qlogic/lib64/mvapich2 /usr/shmem/qlogic/lib64/openmpi /usr/shmem/qlogic/include QLogic recommends that /usr/shmem/qlogic/bin is added onto your $PA
6–SHMEM Description and Configuration SHMEM Programs NOTE These instructions assume a standard SHMEM installation and that /usr/shmem/qlogic/bin has been added to the $PATH. The % character in the previous example is used to indicate the shell prompt and is followed by a command. The program can be compiled and linked using the shmemcc wrapper script: % shmemcc shmem_world.c -o shmem_world The program can be run using the shmemrun wrapper script: % shmemrun -m hosts -np 2 .
6–SHMEM Description and Configuration SHMEM Programs -Wl,--export-dynamic,--allow-shlib-undefined -L $SHMEM_DIR/lib64/default -lqlogic_shmem Where $SHMEM_DIR in both of the options denotes the top-level directory of the SHMEM installation, typically the directory is /usr/shmem/qlogic. The -L option uses the default version of the SHMEM libraries. The default is actually a symbolic link to libraries built for a specific MPI implementation.
6–SHMEM Description and Configuration SHMEM Programs By default mpirun is picked up from the path and is assumed to be called mpirun. Alternatively, the pathname of mpirun can be specified with the $SHMEM_MPIRUN environment variable. There is also support for integration with slurm (see Slurm Integration).
6–SHMEM Description and Configuration QLogic SHMEM Relationship with MPI QLogic SHMEM Relationship with MPI QLogic SHMEM requires the QLogic PSM layer to provide the network transport function and this runs exclusively on QLogic IB HCAs. It also requires a compatible MPI implementation (also running over PSM) to provide program start up and other miscellaneous services.
6–SHMEM Description and Configuration Slurm Integration Slurm Integration QLogic SHMEM relies on an MPI implementation to provide a run-time environment for jobs. This includes job start-up, stdin/stdout/stderr routing, and other low performance control mechanisms. QLogic SHMEM programs are typically started using shmemrun which is a wrapper script around mpirun.
6–SHMEM Description and Configuration Sizing Global Shared Memory The salloc allocates 16 nodes and runs one copy of shmemrun on the first allocated node which then creates the SHMEM processes. shmemrun invokes mpirun, and mpirun determines the correct set of hosts and required number of processes based on the slurm allocation that it is running inside of. Since shmemrun is used in this approach there is no need for the user to set up the environment.
6–SHMEM Description and Configuration Sizing Global Shared Memory NOTE There is a connection between the sizing of the global shared memory and local shared memory because of the mechanism used for accessing global shared memory in a PE that happens to be on the same host. The QLogic SHMEM library pre-allocates room in the virtual address space according to $SHMEM_SHMALLOC_MAX_SIZE (default of 4GB). It then populates this with enough pages to cover $SHMEM_SHMALLOC_INIT_SIZE (default 16MB).
6–SHMEM Description and Configuration Progress Model Alternatively, if $SHMEM_SHMALLOC_BASE_ADDR is specified as 0, then each SHMEM process will independently choose its own base virtual address for the global shared memory segment. In this case, the values for a symmetric allocation using shmalloc() are no longer guaranteed to be identical across the PEs. The QLogic SHMEM implementation takes care of this asymmetry by using offsets relative to the base of the symmetric heap in its protocols.
6–SHMEM Description and Configuration Progress Model Active Progress In the active progress mode SHMEM progress is achieved when the application calls into the SHMEM library. This approach is well matched to applications that call into SHMEM frequently, for example, to have a fine grained mix of SHMEM operations and computation. This mix is typical of many SHMEM applications.
6–SHMEM Description and Configuration Environment Variables SHMEM's long message protocol is disabled. This is because the long message protocol implementation does not support passive progress. The effect of disabling this is to reduce long message bandwidth to that which can be achieved with the short message protocol. There is no effect on the bandwidth for message sizes below the long message break-point, which is set to 16KB by default.
6–SHMEM Description and Configuration Environment Variables Table 6-1. SHMEM Run Time Library Environment Variables (Continued) Environment Variable $SHMEM_SHMALLOC_CHECK Default on Shared memory consistency checks set for 0 to disable and 1 to enable. These are good checks for correctness but degrade the performance of shmalloc() and shfree(). These routines are usually not important for benchmark performance, so for now the checks are turned on to catch bugs early.
6–SHMEM Description and Configuration Implementation Behavior Table 6-1. SHMEM Run Time Library Environment Variables (Continued) Environment Variable $SHMEM_PUT_REPLY_COMBINING_COUNT Default 8 Description Number of consecutive put replies on a flow to combine together into a single reply. The command shmemrun automatically propagates SHMEM* environment variables from its own environment to all the SHMEM processes.
6–SHMEM Description and Configuration Implementation Behavior For a put operation, these descriptions use the terms "local completion" and “remote completion”. Once a put is locally complete, the source buffer on the initiating PE is available for reuse. Until a put is locally complete the source buffer must not be modified since that buffer is in use for the put operation. A blocking put is locally complete immediately upon return from the put.
6–SHMEM Description and Configuration Application Programming Interface 8 byte put to a sync location Target side: Wait for the sync location to be written Now it is safe to make observations on all puts prior to fence shmem_int_wait(), shmem_long_wait(), shmem_longlong_wait(), shmem_short_wait(), shmem_wait(), shmem_int_wait_until(), shmem_long_wait_until(), shmem_longlong_wait_until(), shmem_short_wait_until(), shmem_wait_until() - These SHMEM operations are provided for waiting for a va
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration Application Programming Interface Table 6-3.
6–SHMEM Description and Configuration SHMEM Benchmark Programs Table 6-3.
6–SHMEM Description and Configuration SHMEM Benchmark Programs Table 6-4.
6–SHMEM Description and Configuration SHMEM Benchmark Programs Table 6-5.
6–SHMEM Description and Configuration SHMEM Benchmark Programs Table 6-6.
6–SHMEM Description and Configuration SHMEM Benchmark Programs Table 6-8.
6–SHMEM Description and Configuration SHMEM Benchmark Programs 6-32 IB0054606-02 A
7 Virtual Fabric support in PSM Introduction Performance Scaled Messaging (PSM) provides support for full Virtual Fabric (vFabric) integration, allowing users to specify IB Service Level (SL) and Partition Key (PKey), or to provide a configured Service ID (SID) to target a vFabric. Support for using IB path record queries to the QLogic Fabric Manager during connection setup is also available, enabling alternative switch topologies such as Mesh/Torus.
7–Virtual Fabric support in PSM Virtual Fabric Support Virtual Fabric Support Virtual Fabric (vFabric) in PSM is supported with the QLogic Fabric Manager. The latest version of the QLogic Fabric Manager contains a sample qlogic_fm.xml file with pre-configured vFabrics for PSM. Sixteen unique Service IDs have been allocated for PSM enabled MPI vFabrics to ease their testing however any Service ID can be used. Refer to the QLogic Fabric Manager User Guide on how to configure vFabrics.
7–Virtual Fabric support in PSM Using Service ID Using Service ID Full vFabric integration with PSM is available, allowing the user to specify a SID. For correct operation, PSM requires the following components to be available and configured correctly. QLogic host Fabric Manager Configuration – PSM MPI vFabrics need to be configured and enabled correctly in the qlogic_fm.xml file. 16 unique SIDs have been allocated in the sample file. OFED+ library needs to be installed on all nodes.
7–Virtual Fabric support in PSM Verifying SL2VL tables on QLogic 7300 Series Adapters Verifying SL2VL tables on QLogic 7300 Series Adapters iba_saquery can be used to get the SL2VL mapping for any given port however, QLogic 7300 series adapters exports the SL2VL mapping via sysfs files. These files are used by PSM to implement the SL2VL tables automatically. The SL2VL tables are per port and available under /sys/class/infiniband/hca name/ports/port #/sl2vl.
8 Dispersive Routing Infiniband® uses deterministic routing that is keyed from the Destination LID (DLID) of a port. The Fabric Manager programs the forwarding tables in a switch to determine the egress port a packet takes based on the DLID. Deterministic routing can create hotspots even in full bisection bandwidth (FBB) fabrics for certain communication patterns if the communicating node pairs map onto a common upstream link, based on the forwarding tables.
8–Dispersive Routing Internally, PSM utilizes dispersive routing differently for small and large messages. Large messages are any messages greater-than or equal-to 64K. For large messages, the message is split into message fragments of 128K by default (called a window). Each of these message windows is sprayed across a distinct path between ports. All packets belonging to a window utilize the same path however the windows themselves can take a different path through the fabric.
8–Dispersive Routing Static_Dest: The path selection is based on the CPU index of the destination process. Multiple paths can be used if data transfer is to different remote processes within a node. If multiple processes from Node A send a message to a single process on Node B only one path will be used across all processes. Static_Base: The only path that is used is the base path [SLID,DLID] between nodes regardless of the LMC of the fabric or the number of paths available.
8–Dispersive Routing 8-4 IB0054606-02 A
9 gPXE gPXE is an open source (GPL) network bootloader. It provides a direct replacement for proprietary PXE ROMs. See http://etherboot.org/wiki/index.php for documentation and general information. gPXE Setup At least two machines and a switch are needed (or connect the two machines back-to-back and run QLogic Fabric Manager on the server). A DHCP server A boot server or http server (can be the same as the DHCP server) A node to be booted Use a QLE7340 or QLE7342 adapter for the node.
9–gPXE Preparing the DHCP Server in Linux A Linux install image like kickstart, which then installs software to the local hard drive(s). Refer to http://www.faqs.org/docs/Linux-HOWTO/KickStart-HOWTO.html A second stage boot loader A live CD Linux image A gPXE script Required Steps 1. Download a copy of the gPXE image.
9–gPXE Preparing the DHCP Server in Linux Installing DHCP gPXE requires that the DHCP server runs on a machine that supports IP over IB. NOTE Prior to installing DHCP, make sure that QLogic OFED+ is already installed on your DHCP server. 1. Download and install the latest DHCP server from www.isc.org. Standard DHCP fields holding MAC address are not large enough to contain an IPoIB hardware address.
9–gPXE Preparing the DHCP Server in Linux Configuring DHCP 1. From the client host, find the GUID of the HCA by using p1info or look at the GUID label on the IB adapter. 2. Turn the GUID into a MAC address and specify the port of the IB adapter that is going to be used at the end, using b0 for port0 or b1 for port1. For example for a GUID that reads 0x00117500005a6eec, the MAC address would read: 00:11:75:00:00:5a:6e:ec:b0 3. Add the MAC address to the DHCP server.
9–gPXE Netbooting Over IB 4. Restart the DHCP server Netbooting Over IB The following procedures are an example of netbooting over IB, using an HTTP boot server. Prerequisites Required steps from above have been executed. The BIOS has been configured to enable booting from the IB adapter. The gPXE IB device should be listed as the first boot device. Apache server has been configured with PHP on your network, and is configured to serve pages out of /vault.
9–gPXE Netbooting Over IB 1. Install Apache. 2. Create an images.conf file and a kernels.conf file and place them in the /etc/httpd/conf.d directory. This sets up aliases for and tells apache where to find them: /images — http://10.252.252.1/images/ /kernels — http://10.252.252.1/kernels/ The following is an example of the images.
9–gPXE Netbooting Over IB To add an IB driver into the initrd file, The IB modules need to be copied to the diskless image. The host machine needs to be pre-installed with the QLogic OFED+ Host Software that is appropriate for the kernel version the diskless image will run. The QLogic OFED+ Host Software is available for download from http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/default.aspx NOTE The remainder of this section assumes that QLogic OFED+ has been installed on the Host machine.
9–gPXE Netbooting Over IB b. The infinipath rpm will install the file /usr/share/infinipath/gPXE/gpxe-qib-modify-initrd with contents similar to the following example. You can either run the script to generate a new initrd image, or use it as an example, and customize as appropriate for your site. # This assumes you will use the currently running version of linux, and # that you are starting from a fully configured machine of the same type # (hardware configuration), and BIOS settings.
9–gPXE Netbooting Over IB # extract previous contents gunzip -dc ../initrd-ib-${kern}.img | cpio --quiet -id # add infiniband modules mkdir -p lib/ib find /lib/modules/${kern}/updates -type f | \ egrep '(iw_cm|ib_(mad|addr|core|sa|cm|uverbs|ucm|umad|ipoib|qib ).ko|rdma_|ipoib_helper)' | \ xargs -I '{}' cp -a '{}' lib/ib # Some distros have ipoib_helper, others don't require it if [ -e lib/ib/ipoib_helper ]; then helper_cmd='/sbin/insmod /lib/ib/ipoib_helper.
9–gPXE Netbooting Over IB IFS=' ' v6cmd='/sbin/insmod /lib/'${xfrm}'.ko '"$v6cmd" crypto=$(modinfo -F depends $xfrm) if [ ${crypto} ]; then cp $(find /lib/modules/$(uname -r) -name ${crypto}.ko) lib IFS=' ' v6cmd='/sbin/insmod /lib/'${crypto}'.ko '"$v6cmd" fi fi fi # we need insmod to load the modules; if not present it, copy it mkdir -p sbin grep -q insmod ..
9–gPXE Netbooting Over IB /sbin/insmod /lib/ib/ib_sa.ko /sbin/insmod /lib/ib/ib_cm.ko /sbin/insmod /lib/ib/ib_uverbs.ko /sbin/insmod /lib/ib/ib_ucm.ko /sbin/insmod /lib/ib/ib_umad.ko /sbin/insmod /lib/ib/iw_cm.ko /sbin/insmod /lib/ib/rdma_cm.ko /sbin/insmod /lib/ib/rdma_ucm.ko $dcacmd /sbin/insmod /lib/ib/ib_qib.ko $helper_cmd /sbin/insmod /lib/ib/ib_ipoib.
9–gPXE Netbooting Over IB # and show the differences. echo -e '\nChanges in files in initrd image\n' diff Orig-listing New-listing # copy the new initrd to wherever you have configure the dhcp server to look # for it (here we assume it's /images) mkdir -p /images cp initrd-${kern}.img /images echo -e '\nCompleted initrd for IB' ls -l /images/initrd-${kern}.img c. Run the usr/share/infinipath/gPXE/ gpxe-qib-modify-initrd script to create the initrd.img file. At this stage, the initrd.
9–gPXE Netbooting Over IB The following is an example of a uniboot.php file: header ( 'Content-type: text/plain' ); function strleft ( $s1, $s2 ) { return substr ( $s1, 0, strpos ( $s1, $s2 ) ); } function baseURL() { $s = empty ( $_SERVER["HTTPS"] ) ? '' : ( $_SERVER["HTTPS"] == "on" ) ? "s" : ""; $protocol = strleft ( strtolower ( $_SERVER["SERVER_PROTOCOL"] ), "/" ).$s; $port = ( $_SERVER["SERVER_PORT"] == "80" ) ? "" : ( ":".$_SERVER["SERVER_PORT"] ); return $protocol."://".$_SERVER['SERVER_NAME'].
9–gPXE HTTP Boot Setup This is the kernel that will boot. This file can be copied from any machine that has RHEL5.3 installed. 2. Start httpd Steps on the gPXE Client 1. Ensure that the HCA is listed as the first bootable device in the BIOS. 2. Reboot the test node(s) and enter the BIOS boot setup. This is highly dependent on the BIOS for the system but you should see a menu for boot options and a submenu for boot devices. Select gPXE IB as the first boot device.
9–gPXE HTTP Boot Setup 5. Create an images.conf file and a kernels.conf file using the examples in Step 2 of Boot Server Setup and place them in the /etc/httpd/conf.d directory. 6. Edit /etc/dhcpd.conf file to boot the clients using HTTP filename "http://172.26.32.9/images/uniboot/uniboot.php"; 7. Restart the DHCP server 8. Start HTTP if it is not already running: /etc/init.
9–gPXE HTTP Boot Setup 9-16 IB0054606-02 A
A Benchmark Programs Several MPI performance measurement programs are installed by default with the MPIs you choose to install (such as Open MPI, MVAPICH2 or MVAPICH). This appendix describes a few of these benchmarks and how to run them. Several of these programs are based on code from the group of Dr. Dhabaleswar K. Panda at the Network-Based Computing Laboratory at the Ohio State University. For more information, see: http://mvapich.cse.ohio-state.
A–Benchmark Programs Benchmark 1: Measuring MPI Latency Between Two Nodes The program osu_latency, from Ohio State University, measures the latency for a range of messages sizes from 0bytes to 4 megabytes. It uses a ping-pong method, where the rank zero process initiates a series of sends and the rank one process echoes them back, using the blocking MPI send and receive calls for all operations.
A–Benchmark Programs Benchmark 1: Measuring MPI Latency Between Two Nodes -H (or --hosts) allows the specification of the host list on the command line instead of using a host file (with the -m or -machinefile option). Since only two hosts are listed, this implies that two host programs will be started (as if -np 2 were specified). The output of the program looks like: # OSU MPI Latency Test v3.1.1) # Size Latency (us) 0 1.67 1 1.68 2 1.69 4 1.68 8 1.68 16 1.93 32 1.92 64 1.92 128 1.
A–Benchmark Programs Benchmark 2: Measuring MPI Bandwidth Between Two Nodes Benchmark 2: Measuring MPI Bandwidth Between Two Nodes The osu_bw benchmark measures the maximum rate that you can pump data between two nodes.
A–Benchmark Programs Benchmark 2: Measuring MPI Bandwidth Between Two Nodes Typical output might look like: # OSU MPI Bandwidth Test v3.1.1 # Size Bandwidth (MB/s) 1 2.35 2 4.69 4 9.38 8 18.80 16 34.55 32 68.89 64 137.87 128 265.80 256 480.19 512 843.70 1024 1353.48 2048 1984.11 4096 2152.61 8192 2249.00 16384 2680.75 32768 2905.83 65536 3170.05 131072 3224.15 262144 3241.35 524288 3270.21 1048576 3286.05 2097152 3292.64 4194304 3283.
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks Benchmark 3: Messaging Rate Microbenchmarks OSU Multiple Bandwidth / Message Rate test (osu_mbw_mr) osu_mbw_mr is a multi-pair bandwidth and message rate test that evaluates the aggregate uni-directional bandwidth and message rate between multiple pairs of processes. Each of the sending processes sends a fixed number of messages (the window size) back-to-back to the paired receiving process before waiting for a reply from the receiver.
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks This was run on 12-core compute nodes, so we used Open MPI's -npernode 12 option to place 12 MPI processes on each node (for a total of 24) to maximize message rate. Note that the output below indicates that there are 12 pairs of communicating processes. # OSU MPI Multiple Bandwidth / Message Rate Test v3.1.1 # [ pairs: 12 ] [ window size: 64 ] # Size MB/s Messages/s 1 22.77 22768062.43 2 44.90 22449128.66 4 91.75 22938300.
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks A-8 N/2 is dynamically calculated at the end of the run. You can use the -b option to get a bidirectional message rate and bandwidth results. Scalability has been improved for larger core-count nodes.
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks The benchmark has been updated with code to dynamically determine what processes are on which host. Thefollowing is an example output when running mpi_multibw: $ mpirun -H host1,host2 -npernode 12 \ /usr/mpi/gcc/openmpi-1.4.3-qlc/tests/qlogic/mpi_multibw # PathScale Modified OSU MPI Bandwidth Test (OSU Version 2.2, PathScale $Revision: 1.1.2.
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks Note the improved message rate at small message sizes of ~25 million compared to the rate of 22.8 million measured with osu_mbw_mr. Also note that it only takes a message of size 121 bytes to generate half of the peak uni-directional bandwidth. The following is an example output when running with the bidirectional option (-b): $ mpirun -H host1,host2 -np 24 \ /usr/mpi/gcc/openmpi-1.4.
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks The higher peak bi-directional messaging rate of 34.6 million messages per second at the 1 byte size, compared to 25 million messages/sec. when run unidirectionally.
A–Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks A-12 IB0054606-02 A
B SRP Configuration SRP Configuration Overview SRP stands for SCSI RDMA Protocol. It allows the SCSI protocol to run over IB for Storage Area Network (SAN) usage. SRP interfaces directly to the Linux file system through the SRP Upper Layer Protocol (ULP). SRP storage can be treated as another device. In this release, two versions of SRP are available: QLogic SRP and OFED SRP. QLogic SRP is available as part of the QLogic OFED Host Software, QLogic IFS, Rocks Roll, and Platform PCM downloads.
B–SRP Configuration QLogic SRP Configuration A SRP Initiator Extension is a 64-bit numeric value that is appended to the port GUID of the SRP initiator port, which allows an SRP initiator port to have multiple SRP maps associated with it. Maps are for FVIC only. IB attached storage will use their own mechanism as maps are not necessary. A SRP Initiator is the combination of an SRP initiator port and an SRP initiator extension.
B–SRP Configuration QLogic SRP Configuration Stopping, Starting and Restarting the SRP Driver To stop the qlgc_srp driver, use the following command: /etc/init.d/qlgc_srp stop To start the qlgc_srp driver, use the following command: /etc/init.d/qlgc_srp start To restart the qlgc_srp driver, use the following command: /etc/init.d/qlgc_srp restart Specifying a Session In the SRP configuration file, a session command is a block of configuration commands, surrounded by begin and end statements.
B–SRP Configuration QLogic SRP Configuration 1. By the port GUID of the IOC, or 2. By the IOC profile string that is created by the VIO device (i.e., a string containing the chassis GUID, the slot number and the IOC number). FVIC creates the device in this manner, other devices have their own naming method. To specify the host IB port to use, the user can either specify the port GUID of the local IB port, or simply use the index numbers of the cards and the ports on the cards.
B–SRP Configuration QLogic SRP Configuration The system returns input similar to the following: st187:~/qlgc-srp-1_3_0_0_1 # ib_qlgc_srp_query QLogic Corporation. Virtual HBA (SRP) SCSI Query Application, version 1.3.0.0.1 1 IB Host Channel Adapter present in system.
B–SRP Configuration QLogic SRP Configuration 0x0000494353535250 service 3 : name SRP.T10:0000000000000004 id 0x0000494353535250 Target Path(s): HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a21dd000021 HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a21dd000021 SRP IOC Profile : Chassis 0x00066A0050000135, Slot 5, IOC 1 SRP IOC GUID : 0x00066a013800016c SRP IU SIZE : 320 SRP IU SG SIZE: 15 SRP IO CLASS : 0xff00 service 0 : name SRP.
B–SRP Configuration QLogic SRP Configuration Enter ib_qlgc_srp_build_cfg. The system provides output similar to the following: # qlgc_srp.cfg file generated by /usr/sbin/ib_qlgc_srp_build_cfg, version 1.3.0.0.17, on Mon Aug 25 13:42:16 EDT 2008 #Found QLogic OFED SRP registerAdaptersInOrder: ON # ============================================================= # IOC Name: BC2FC in Chassis 0x0000000000000000, Slot 6, Ioc 1 # IOC GUID: 0x00066a01e0000149 SRP IU SIZE : 320 # service 0 : name SRP.
B–SRP Configuration QLogic SRP Configuration noverify: 0 description: "SRP Virtual HBA 0" end The ib_qlgc_srp_build_cfg command creates a configuration file based on discovered target devices. By default, the information is sent to stdout. In order to create a configuration file, output should be redirected to a disk file. Enter ib_qlgc_srp_build_cfg -h for a list and description of the option flags.
B–SRP Configuration QLogic SRP Configuration NOTE When using this method, if the port GUIDs are changed, they must also be changed in the configuration file. Specifying a SRP Target Port The SRP target can be specified in two different ways. To connect to a particular SRP target no matter where it is in the fabric, use the first method (By IOCGUID).
B–SRP Configuration QLogic SRP Configuration Specifying a SRP Target Port of a Session by IOCGUID The following example specifies a target by IOC GUID: session begin card: 0 port: 1 targetIOCGuid: 0x00066A013800016c #IOC GUID of the InfiniFibre port end 0x00066a10dd000046 0x00066a20dd000046 Specifying a SRP Target Port of a Session by Profile String The following example specifies a target by Profile String: session begin card: 0 port: 1 # FVIC in Chassis 0x00066A005000010E, # Slot number 1, port
B–SRP Configuration QLogic SRP Configuration Restarting the SRP Module For changes to take effect, including changes to the SRP map on the VIO card, SRP will need to be restarted. To restart the qlgc_srp driver, use the following command: /etc/init.d/qlgc_srp restart Configuring an Adapter with Multiple Sessions Each adapter can have an unlimited number of sessions attached to it. Unless round robin is specified, SRP will only use one session at a time.
B–SRP Configuration QLogic SRP Configuration When the qlgc_srp module encounters an adapter command, that adapter is assigned all previously defined sessions (that have not been assigned to other adapters). This makes it easy to configure a system for multiple SRP adapters.
B–SRP Configuration QLogic SRP Configuration end adapter begin description: "Test Device 1" end Configuring Fibre Channel Failover Fibre Channel failover is essentially failing over from one session in an adapter to another session in the same adapter. Following is a list of the different type of failover scenarios: Failing over from one SRP initiator port to another. Failing over from a port on the VIO hardware card to another port on the VIO hardware card.
B–SRP Configuration QLogic SRP Configuration Failover Configuration File 1: Failing over from one SRP Initiator port to another In this failover configuration file, the first session (using adapter Port 1) is used to reach the SRP Target Port. If a problem is detected in this session (e.g., the IB cable on port 1 of the adapter is pulled) then the 2nd session (using adapter Port 2) will be used. # service 0: name SRP.
B–SRP Configuration QLogic SRP Configuration adapterIODepth: 1000 lunIODepth: 16 adapterMaxIO: 128 adapterMaxLUNs: 512 adapterNoConnectTimeout: 60 adapterDeviceRequestTimeout: 2 # set to 1 if you want round robin load balancing roundrobinmode: 0 # set to 1 if you do not want target connectivity verification noverify: 0 description: "SRP Virtual HBA 0" end Failover Configuration File 2: Failing over from a port on the VIO hardware card to another port on the VIO hardware card session begin card: 0 (InfiniS
B–SRP Configuration QLogic SRP Configuration On the VIO hardware side, the following needs to be ensured: The target device is discovered and configured for each of the ports that is involved in the failover. The SRP Initiator is discovered and configured once for each different initiatorExtension. Each map should use a different Configured Device, e.g Configured Device 1 has the Target being discovered over FC Port 1, and Configured Device 2 has the Target being discovered over FC Port 2.
B–SRP Configuration QLogic SRP Configuration On the VIO hardware side, the following need to be ensured on each FVIC involved in the failover: The target device is discovered and configured through the appropriate FC port The SRP Initiator is discovered and configured once for the proper initiatorExtension.
B–SRP Configuration QLogic SRP Configuration The target device is discovered and configured through the appropriate FC port The SRP Initiator is discovered and configured once for the proper initiatorExtension. The SRP map created for the initiator connects to the same target Configuring Fibre Channel Load Balancing The following examples display typical scenarios for how to configure Fibre Channel load balancing.
B–SRP Configuration QLogic SRP Configuration 2 Adapter Ports and 2 Ports on a Single VIO Module In this example, traffic is load balanced between adapter Port 2/VIO hardware Port 1 and adapter Port1/VIO hardware Port 1. If one of the sessions goes down (due to an IB cable failure or an FC cable failure), all traffic will begin using the other session.
B–SRP Configuration QLogic SRP Configuration Using the roundrobinmode Parameter In this example, the two sessions use different VIO hardware cards as well as different adapter ports. Traffic will be load-balanced between the two sessions. If there is a failure in one of the sessions (e.g., one of the VIO hardware cards is rebooted) traffic will begin using the other session.
B–SRP Configuration QLogic SRP Configuration Configuring SRP for Native IB Storage 1. Review ib_qlgc_srp_query. QLogic Corporation. Virtual HBA (SRP) SCSI Query Application, version 1.3.0.0.1 1 IB Host Channel Adapter present in system.
B–SRP Configuration QLogic SRP Configuration 2. Edit /etc/sysconfig/qlgc_srp.cfg to add this information. # service : name SRP.
B–SRP Configuration QLogic SRP Configuration roundrobinmode: 0 # set to 1 if you do not want target connectivity verification noverify: 0 description: "SRP Virtual HBA 0" end Note the correlation between the output of ib_qlgc_srp_query and qlgc_srp.cfg Target Path(s): HCA 0 Port 1 0x0002c9020026041d -> Target Port GID 0xfe8000000000000000066a11dd000021 HCA 0 Port 2 0x0002c9020026041e -> Target Port GID 0xfe8000000000000000066a11dd000021 qlgc_srp.cfg: session begin . . . .
B–SRP Configuration OFED SRP Configuration Additional Details All LUNs found are reported to the Linux SCSI mid-layer. Linux may need the max_scsi_luns (2.4 kernels) or max_luns (2.6 kernels) parameter configured in scsi_mod. Troubleshooting For troubleshooting information, refer to “Troubleshooting SRP Issues” on page E-9. OFED SRP Configuration To use OFED SRP, follow these steps: 1. Add the line SRP_LOAD=yes to the module list in /etc/infiniband/openib.conf to have it automatically loaded. 2.
B–SRP Configuration OFED SRP Configuration 3.
B–SRP Configuration OFED SRP Configuration Notes B-26 IB0054606-02 A
C Integration with a Batch Queuing System Most cluster systems use some kind of batch queuing system as an orderly way to provide users with access to the resources they need to meet their job’s performance requirements. One task of the cluster administrator is to allow users to submit MPI jobs through these batch queuing systems. For Open MPI, there are resources at openmpi.org that document how to use the MPI with three batch queuing systems.
C–Integration with a Batch Queuing System Clean-up PSM Shared Memory Files This command displays a list of processes using InfiniPath.
C–Integration with a Batch Queuing System Clean-up PSM Shared Memory Files #!/bin/sh files=`/bin/ls /dev/shm/psm_shm.* 2> /dev/null`; for file in $files; do /sbin/fuser $file > /dev/null 2>&1; if [ $? -ne 0 ]; then /bin/rm $file > /dev/null 2>&1; fi; done; When the system is idle, the administrators can remove all of the shared memory files, including stale files, by using the following command: # rm -rf /dev/shm/psm_shm.
C–Integration with a Batch Queuing System Clean-up PSM Shared Memory Files C-4 IB0054606-02 A
D Troubleshooting This appendix describes some of the tools you can use to diagnose and fix problems.
D–Troubleshooting BIOS Settings Table D-1. LED Link and Data Indicators (Continued) LED States Green ON Amber OFF Indication Signal detected and the physical link is up. Ready to talk to SM to bring the link fully up. If this state persists, the SM may be missing or the link may not be configured. Use ipath_control -i to verify the software state. If all IB adapters are in this state, then the SM is not running. Check the SM configuration, or install and run opensmd.
D–Troubleshooting Kernel and Initialization Issues Driver Load Fails Due to Unsupported Kernel If you try to load the InfiniPath driver on a kernel that InfiniPath software does not support, the load fails. Error messages similar to this display: modprobe: error inserting ’/lib/modules/2.6.3-1.1659-smp/updates/kernel/drivers/infinib and/hw/qib/ib_qib.ko’: -1 Invalid module format To correct this problem, install one of the appropriate supported Linux kernel versions, then reload the driver.
D–Troubleshooting Kernel and Initialization Issues A zero count in all CPU columns means that no InfiniPath interrupts have been delivered to the processor.
D–Troubleshooting Kernel and Initialization Issues InfiniPath ib_qib Initialization Failure There may be cases where ib_qib was not properly initialized. Symptoms of this may show up in error messages from an MPI job or another program.
D–Troubleshooting OpenFabrics and InfiniPath Issues MPI Job Failures Due to Initialization Problems If one or more nodes do not have the interconnect in a usable state, messages similar to the following appear when the MPI program is started: userinit: userinit ioctl failed: Network is down [1]: device init failed userinit: userinit ioctl failed: Fatal Error in keypriv.
D–Troubleshooting OpenFabrics and InfiniPath Issues Manual Shutdown or Restart May Hang if NFS in Use If you are using NFS over IPoIB and use the manual /etc/init.d/openibd stop (or restart) command, the shutdown process may silently hang on the fuser command contained within the script. This is because fuser cannot traverse down the tree from the mount point once the mount point has disappeared. To remedy this problem, the fuser process itself needs to be killed.
D–Troubleshooting System Administration Troubleshooting ibsrpdm Command Hangs when Two Host Channel Adapters are Installed but Only Unit 1 is Connected to the Switch If multiple IB adapters (unit 0 and unit 1) are installed and only unit 1 is connected to the switch, the ibsrpdm command (to set up an SRP target) can hang. If unit 0 is connected and unit 1 is disconnected, the problem does not occur. When only unit 1 is connected to the switch, use the -d option with ibsrpdm.
D–Troubleshooting Performance Issues Broken Intermediate Link Sometimes message traffic passes through the fabric while other traffic appears to be blocked. In this case, MPI jobs fail to run. In large cluster configurations, switches may be attached to other switches to supply the necessary inter-node connectivity. Problems with these inter-switch (or intermediate) links are sometimes more difficult to diagnose than failure of the final link between a switch and a node.
D–Troubleshooting Performance Issues Erratic Performance Sometimes erratic performance is seen on applications that use interrupts. An example is inconsistent SDP latency when running a program such as netperf. This may be seen on AMD-based systems using the QLE7240 or QLE7280 adapters. If this happens, check to see if the program irqbalance is running. This program is a Linux daemon that distributes interrupts across processors.
D–Troubleshooting Performance Issues This method is not the first choice because, on some systems, there may be two rows of ib_qib output, and you will not know which one of the two numbers to choose. However, if you cannot find $my_irq listed under /proc/irq (Method 1), this type of system most likely has only one line for ib_qib listed in /proc/interrupts, so you can use Method 2.
D–Troubleshooting Open MPI Troubleshooting Performance Warning if ib_qib Shares Interrupts with eth0 When ib_qib shares interrupts with eth0, performance may be affected the OFED ULPs, such as IPoIB. A warning message appears in syslog, and also on the console or tty session where /etc/init.d/openibd start is run (if messages are set up to be displayed).
E ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues To verify that an IB host can access an Ethernet system through the EVIC, issue a ping command to the Ethernet system from the IB host. Make certain that the route to the Ethernet system is using the VIO hardware by using the Linux route command on the IB host, then verify that the route to the subnet is using one of the virtual Ethernet interfaces (i.e., an EIOC).
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues Verify that the proper VirtualNIC driver is running Check that a VirtualNIC driver is running by issuing an lsmod command on the IB host. Make sure that the qlgc_vnic is displayed on the list of modules.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues Verifying that the host can communicate with the I/O Controllers (IOCs) of the VIO hardware To display the Ethernet VIO cards that the host can see and communicate with, issue the command ib_qlgc_vnic_query.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues ID: Chassis 0x00066A00010003F2, Slot 1, IOC 3 service entries: 2 service[ 0]: 1000066a00000003 / InfiniNIC.InfiniConSys.Control:03 service[ 1]: 1000066a00000103 / InfiniNIC.InfiniConSys.Data:03 When ib_qlgc_vnic_query is run with -e option, it reports the IOCGUID information. With the -s option it reports the IOCSTRING information for the Virtual I/O hardware IOCs present on the fabric.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues If the host can not see applicable IOCs, there are two things to check. First, verify that the adapter port specified in the eioc definition of the /etc/infiniband/qlgc_vnic.cfg file is active. This is done using the ibv_devinfo commands on the host, then checking the value of state. If the state is not Port_Active, the adapter port is not logically connected to the fabric.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues Another reason why the host might not be able to see the necessary IOCs is that the subnet manager has gone down. Issue an iba_saquery command to make certain that the response shows all of the nodes in the fabric. If an error is returned and the adapter is physically connected to the fabric, then the subnet manager has gone down, and this situation needs to be corrected.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues DEVICE=eioc1 BOOTPROTO=static IPADDR=172.26.48.132 BROADCAST=172.26.63.130 NETMASK=255.255.240.0 NETWORK=172.26.48.0 ONBOOT=yes TYPE=Ethernet Example of ifcfg-eiocx setup for SuSE and SLES systems: BOOTPROTO='static' IPADDR='172.26.48.130' BROADCAST='172.26.63.255' NETMASK='255.255.240.0' NETWORK='172.26.48.
E–ULP Troubleshooting Troubleshooting VirtualNIC and VIO Hardware Issues There are up to 6 IOC GUIDs on each VIO hardware module (6 for the IB/Ethernet Bridge Module, 2 for the EVIC), one for each Ethernet port.
E–ULP Troubleshooting Troubleshooting SRP Issues Troubleshooting SRP Issues ib_qlgc_srp_stats showing session in disconnected state Problem: If the session is part of a multi-session adapter, ib_qlgc_srp_stats will show it to be in the disconnected state.
E–ULP Troubleshooting Troubleshooting SRP Issues : 0x0000000000000000 Errors Completed Receives : 0x00000000000002c0 | Receive : 0x0000000000000000 Connect Attempts : 0x0000000000000000 : 0x0000000000000000 | Test Attempts Total SWUs : 0x00000000000003e8 | Available : 0x00000000000003e8 SWUs Busy SWUs : 0x00000000000003e8 : 0x0000000000000000 | SRP Req Limit SRP Max ITIU : 0x0000000000000140 : 0x0000000000000140 | SRP Max TIIU Host Busys Used : 0x000000000000000f Session : Disconnected : 0x00000
E–ULP Troubleshooting Troubleshooting SRP Issues Solution: Perhaps an interswitch cable has been disconnected, or the VIO hardware is offline, or the Chassis/Slot does not contain a VIO hardware card. Instead of looking at this file, use the ib_qlgc_srp_query command to verify that the desired adapter port is in the active state. NOTE It is normal to see the "Can not find a path" message when the system first boots up.
E–ULP Troubleshooting Troubleshooting SRP Issues Following is an example: SCSI Host # ROUNDROBIN : 17 | Mode : Trgt Adapter Depth : 1000 | Verify Target Rqst Adapter Depth : 1000 | Rqst LUN Depth Tot Adapter Depth : 1000 | Tot LUN Depth Act Adapter Depth : 998 | Act LUN Depth : 512 | Max IO : 256 | Max SG Depth : Yes : 16 : 16 : 16 Max LUN Scan : 131072 (128 KB) Max Sectors : 33 T/O Session Count : 60 Second(s) Register In Order : 2 Second(s) Description Session : Disconnected Source GI
E–ULP Troubleshooting Troubleshooting SRP Issues SWUs : 0x00000000000003e8 Busy SWUs : 0x00000000000003e8 : 0x0000000000000000 | SRP Req Limit SRP Max ITIU : 0x0000000000000140 : 0x0000000000000140 | SRP Max TIIU Host Busys Used : 0x000000000000000f Session : Disconnected : 0x0000000000000000 | SRP Max SG : Session 2 Source GID | State : 0xfe8000000000000000066a000100d052 Destination GID : 0xfe8000000000000000066a0260000165 SRP IOC Profile 1, IOC 2 SID : Chassis 0x00066A0001000481, Slot SRP T
E–ULP Troubleshooting Troubleshooting SRP Issues Solution 1: The host initiator has not been configured as an SRP initiator on the VIO hardware SRP Initiator Discovery screen. Via Chassis Viewer, bring up the SRP Initiator Discovery screen and either Click on 'Add New' to add a wildcarded entry with the initiator extension to match what is in the session entry in the qlgc_srp.
E–ULP Troubleshooting Troubleshooting SRP Issues Solution: This indicates a problem in the path between the VIO hardware and the target storage device. After an SRP host has connected to the VIO hardware successfully, the host sends a “Test Unit Ready” command to the storage device. After five seconds, if that command is not responded to, the SRP host brings down the session and retries in five seconds.
E–ULP Troubleshooting Troubleshooting SRP Issues Solution 2: Make certain that all sessions have a map to the same disk defined. The fact that the session is active means that the session can see a disk. However, if one of the sessions is using a map with the 'wrong' disk, then the round-robin method could lead to a disk or disks not being seen.
E–ULP Troubleshooting Troubleshooting SRP Issues In a failover configuration, if everything is configured correctly, one session will be Active and the rest will be Connected. The transition of a session from Connected to Active will not be attempted until that session needs to become Active, due to the failure of the previously Active session.
E–ULP Troubleshooting Troubleshooting SRP Issues The system displays information similar to the following: st106:~ # ibv_devinfo -i 1 hca_id: mthca0 fw_ver: 5.1.9301 node_guid: 0006:6a00:9800:6c9f sys_image_guid: 0006:6a00:9800:6c9f vendor_id: 0x066a vendor_part_id: 25218 hw_ver: 0xA0 board_id: SS_0000000005 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 71 port_lid: 60 port_lmc: 0x00 st106:~ # ibv_devinfo -i 2 hca_id: mthca0 fw_ver: 5.1.
E–ULP Troubleshooting Troubleshooting SRP Issues Need to determine the SRP driver version. Solution: To determine the SRP driver version number, enter the command modinfo -d qlgc-srp, which returns information similar to the following: st159:~ # modinfo -d qlgc-srp QLogic Corp. Virtual HBA (SRP) SCSI Driver, version 1.0.0.0.
E–ULP Troubleshooting Troubleshooting SRP Issues E-20 IB0054606-02 A
F Write Combining Introduction Write Combining improves write bandwidth to the QLogic driver by writing multiple words in a single bus transaction (typically 64 bytes). Write combining applies only to x86_64 systems. The x86 Page Attribute Table (PAT) mechanism allocates Write Combining (WC) mappings for the PIO buffers, and is the default mechanism for WC.
F–Write Combining MTRR Mapping and Write Combining Revert to using MTRR-only behavior by following one of the two suggestions in MTRR Mapping and Write Combining. The driver must be restarted after the changes have been made. . NOTE There will not be a WC entry in /proc/mtrr when using PAT. MTRR Mapping and Write Combining Two suggestions for properly enabling MTRR mapping for write combining are described in the following sections.
F–Write Combining Verify Write Combining is Working The test results will list any problems, if they exist, and provide suggestions on what to do. To fix the MTRR registers, use: # ipath_mtrr -w Restart the driver after fixing the registers. This script needs to be run after each system reboot. It can be set to run automatically upon restart by adding this line in /etc/sysconfig/infinipath: IPATH_MTRR_ACTIVE=1 See the ipath_mtrr(8) man page for more information on other options.
F–Write Combining Verify Write Combining is Working Notes F-4 IB0054606-02 A
G Commands and Files The most useful commands and files for debugging, and common tasks, are presented in the following sections. Many of these commands and files have been discussed elsewhere in the documentation. This information is summarized and repeated here for your convenience. Check Cluster Homogeneity with ipath_checkout Many problems can be attributed to the lack of homogeneity in the cluster environment. Use the following items as a checklist for verifying homogeneity.
G–Commands and Files Restarting InfiniPath Restarting InfiniPath When the driver status appears abnormal on any node, you can try restarting (as a root user). Type: # /etc/init.d/openibd restart These two commands perform the same function as restart: # /etc/init.d/openibd stop # /etc/init.d/openibd start Also check the /var/log/messages file for any abnormal activity. Summary and Descriptions of Commands Commands are summarized in Table G-1.
G–Commands and Files Summary and Descriptions of Commands Table G-1. Useful Programs (Continued) Program Name Function ibtracerta Determines the path that IB packets travel between two nodes ibv_devinfoa Lists information about IB devices in use. Use when OpenFabrics is enabled. identb Identifies RCS keyword strings in files. Can check for dates, release versions, and other identifying information.
G–Commands and Files Summary and Descriptions of Commands a These programs are contained in the OpenFabrics openib-diags RPM. b These programs are contained within the rcs RPM for your distribution. c These programs are contained in the Open mpi-frontend RPM. d These programs are contained within the binutils RPM for your distribution. dmesg dmesg prints out bootup messages. It is useful for checking for initialization problems.
G–Commands and Files Summary and Descriptions of Commands -S/--sgid GID — Source GID. (Can be in GID (“0x########:0x########”) or inet6 format (“##:##:##:##:##:##:##:##”)) -D/--dgid GID — Destination GID. (Can be in GID (“0x########:0x########”) or inet6 format (“##:##:##:##:##:##:##:##”)) -k/--pkey pkey — Partition Key -i/--sid sid — Service ID -h/--hca hca — The HCA to use. (Defaults to the first HCA.) The HCA can be identified by name (“mthca0”, “qib1”, et cetera) or by number (1, 2, 3, et cetera).
G–Commands and Files Summary and Descriptions of Commands Sample output: # iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107 Query Parameters: resv1 0x0000000000000107 dgid :: sgid :: dlid 0x75 slid 0x31 hop 0x0 flow 0x0 tclass 0x0 num_path 0x0 pkey 0x0 qos_class 0x0 sl 0x0 mtu 0x0 rate 0x0 pkt_life 0x0 preference 0x0 resv2 0x0 resv3 0x0 Using HCA qib0 Result: G-6 resv1 0x0000000000000107 dgid fe80::11:7500:79:e54a sgid fe80::11:7500:79:e416 dlid 0x75 slid 0x3
G–Commands and Files Summary and Descriptions of Commands resv2 0x0 resv3 0x0 Explanation of Sample Output: This is a simple query, specifying the source and destination LIDs and the desired SID. The first half of the output shows the full “query” that will be sent to the Distributed SA. Unused fields are set to zero or are blank. In the center, the line “Using HCA qib0” tells us that, because we did not specify which HCA to query against, the tool chose one for us.
G–Commands and Files Summary and Descriptions of Commands Examples: Query by LID and SID: iba_opp_query -s 0x31 -d 0x75 -i 0x107 iba_opp_query --slid 0x31 --dlid 0x75 --sid 0x107 Queries using octal or decimal numbers: iba_opp_query --slid 061 --dlid 0165 --sid 0407 (using octal numbers) iba_opp_query –slid 49 –dlid 113 –sid 263 (using decimal numbers) Note that these queries are the same as the first two, only the base of the numbers has changed.
G–Commands and Files Summary and Descriptions of Commands iba_hca_rev This command scans the system and reports hardware and firmware information about all the HCAs in the system. Running iba_hca_rev -v(as a root user) produces output similar to the following when run from a node on the IB fabric: # iba_hca_rev -v ###################### st2092 - HCA 0a:00.0 ID: FALCON QDR PN: MHQH29B-XTR EC: A2 SN: MT1029X00540 V0: PCIe Gen2 x8 V1: N/A YA: N/A FW: 2.9.1000 Image type: ConnectX FW Version: 2.9.
G–Commands and Files Summary and Descriptions of Commands [ADAPTER] PSID = MT_0D80120009 pcie_gen2_speed_supported = true adapter_dev_id = 0x673c silicon_rev = 0xb0 gpio_mode1 = 0x0 gpio_mode0 = 0x050e070f gpio_default_val = 0x0502010f [HCA] hca_header_device_id = 0x673c hca_header_subsystem_id = 0x0017 dpdp_en = true eth_xfi_en = true mdio_en_port1 = 0 [IB] phy_type_port1 = XFI phy_type_port2 = XFI read_cable_params_port1_en = true read_cable_params_port2_en = true ;;Polarity eth_tx_lane_polarity_port1=0x
G–Commands and Files Summary and Descriptions of Commands port1_sd2_ob_preemp_pre_qdr = 0x0 port2_sd2_ob_preemp_pre_qdr = 0x0 port1_sd3_ob_preemp_pre_qdr = 0x0 port2_sd3_ob_preemp_pre_qdr = 0x0 port1_sd0_ob_preemp_post_qdr = 0x6 port2_sd0_ob_preemp_post_qdr = 0x6 port1_sd1_ob_preemp_post_qdr = 0x6 port2_sd1_ob_preemp_post_qdr = 0x6 port1_sd2_ob_preemp_post_qdr = 0x6 port2_sd2_ob_preemp_post_qdr = 0x6 port1_sd3_ob_preemp_post_qdr = 0x6 port2_sd3_ob_preemp_post_qdr = 0x6 port1_sd0_ob_preemp_main_qdr = 0x0 po
G–Commands and Files Summary and Descriptions of Commands port2_sd3_muxmain_qdr = 0x1f mellanox_qdr_ib_support = true mellanox_ddr_ib_support = true spec1_2_ib_support = true spec1_2_ddr_ib_support = true spec1_2_qdr_ib_support = true auto_qdr_tx_options = 8 auto_qdr_rx_options = 7 auto_ddr_option_0.tx_preemp_pre = 0x2 auto_ddr_option_0.tx_preemp_msb = 0x1 auto_ddr_option_0.tx_preemp_post = 0x0 auto_ddr_option_0.tx_preemp_main = 0x1b auto_ddr_option_1.tx_preemp_pre = 0x8 auto_ddr_option_1.
G–Commands and Files Summary and Descriptions of Commands auto_ddr_option_4.tx_preemp = 0x0 auto_ddr_option_5.tx_preemp_pre = 0x5 auto_ddr_option_5.tx_preemp_msb = 0x1 auto_ddr_option_5.tx_preemp_post = 0x3 auto_ddr_option_5.tx_preemp_main = 0x13 auto_ddr_option_5.tx_preemp = 0x0 auto_ddr_option_6.tx_preemp_pre = 0x3 auto_ddr_option_6.tx_preemp_msb = 0x1 auto_ddr_option_6.tx_preemp_post = 0x4 auto_ddr_option_6.tx_preemp_main = 0x1f auto_ddr_option_6.tx_preemp = 0x0 auto_ddr_option_7.
G–Commands and Files Summary and Descriptions of Commands auto_ddr_option_11.tx_preemp_msb = 0x0 auto_ddr_option_11.tx_preemp_post = 0x3 auto_ddr_option_11.tx_preemp_main = 0x19 auto_ddr_option_11.tx_preemp = 0x0 auto_ddr_option_12.tx_preemp_pre = 0xf auto_ddr_option_12.tx_preemp_msb = 0x0 auto_ddr_option_12.tx_preemp_post = 0x3 auto_ddr_option_12.tx_preemp_main = 0x19 auto_ddr_option_12.tx_preemp = 0x0 auto_ddr_option_13.tx_preemp_pre = 0x0 auto_ddr_option_13.tx_preemp_msb = 0x0 auto_ddr_option_13.
G–Commands and Files Summary and Descriptions of Commands auto_ddr_option_6.rx_offs_lowpass_en = 0x0 auto_ddr_option_7.rx_offs_lowpass_en = 0x0 auto_ddr_option_0.rx_offs = 0x0 auto_ddr_option_1.rx_offs = 0x0 auto_ddr_option_2.rx_offs = 0x0 auto_ddr_option_3.rx_offs = 0x0 auto_ddr_option_4.rx_offs = 0x0 auto_ddr_option_5.rx_offs = 0x0 auto_ddr_option_6.rx_offs = 0x0 auto_ddr_option_7.rx_offs = 0x0 auto_ddr_option_0.rx_equal_offs = 0x0 auto_ddr_option_1.rx_equal_offs = 0x0 auto_ddr_option_2.
G–Commands and Files Summary and Descriptions of Commands auto_ddr_option_5.rx_main = 0xe auto_ddr_option_6.rx_main = 0xf auto_ddr_option_7.rx_main = 0xf auto_ddr_option_0.rx_extra_hs_gain = 0x0 auto_ddr_option_1.rx_extra_hs_gain = 0x3 auto_ddr_option_2.rx_extra_hs_gain = 0x2 auto_ddr_option_3.rx_extra_hs_gain = 0x4 auto_ddr_option_4.rx_extra_hs_gain = 0x1 auto_ddr_option_5.rx_extra_hs_gain = 0x2 auto_ddr_option_6.rx_extra_hs_gain = 0x7 auto_ddr_option_7.rx_extra_hs_gain = 0x0 auto_ddr_option_0.
G–Commands and Files Summary and Descriptions of Commands auto_ddr_option_11.rx_muxeq = 0x04 auto_ddr_option_11.rx_muxmain = 0x1f auto_ddr_option_11.rx_main = 0xf auto_ddr_option_11.rx_extra_hs_gain = 0x4 auto_ddr_option_11.rx_equalization = 0x7f auto_ddr_option_12.rx_muxeq = 0x6 auto_ddr_option_12.rx_muxmain = 0x1f auto_ddr_option_12.rx_main = 0xf auto_ddr_option_12.rx_extra_hs_gain = 0x4 auto_ddr_option_12.rx_equalization = 0x7f auto_ddr_option_13.rx_muxeq = 0x0 auto_ddr_option_13.
G–Commands and Files Summary and Descriptions of Commands lbist_shift_freq = 3 pll_stabilize = 0x13 flash_div = 0x3 lbist_array_bypass = 1 lbist_pat_cnt_lsb = 0x2 core_f = 44 core_r = 27 ddr_6_db_preemp_pre = 0x3 ddr_6_db_preemp_main = 0xe [FW] Firmware Verification: FS2 failsafe image. Start address: 0x0. Chunk size 0x80000: NOTE: The addresses below are contiguous logical addresses.
G–Commands and Files Summary and Descriptions of Commands FW image verification succeeded. Image is bootable. ###################### iba_manage_switch (Switch) Allows management of externally managed switches (including 12200, 12200-18, and HP BLc QLogic 4X QDR) without using the IFS software. It is designed to operate on one switch at a time, taking a mandatory target GUID parameter.
G–Commands and Files Summary and Descriptions of Commands linkwidth (link width supported) – use -i for integer value (1=1X, 2=4X, 3=1X/4X, 4=8X, 5=1X/8X, 6=4X/8X, 7=1X/4X/8X) vlcreditdist (VL credit distribution) – use -i for integer value (0, 1, 2, 3, or 4) linkspeed (link speed supported) – use -i for integer value (1=SDR, 2=DDR, 3=SDR/DDR, 4=QDR, 7=SDR/DDR/QDR) -i integer-value – integer value -s string-value – string value -c captureFile – filename of capture output file operation – operation to perfo
G–Commands and Files Summary and Descriptions of Commands Example iba_manage_switch -t 0x00066a00e3001234 -f QLogic_12000_V1_firmware.7.0.0.0.27.emfw fwUpdate iba_manage_switch -t 0x00066a00e3001234 reboot iba_manage_switch -t 0x00066a00e3001234 showFwVersion iba_manage_switch -t 0x00066a00e3001234 -s i12k1234 setIBNodeDesc iba_manage_switch -t 0x00066a00e3001234 -C mtucap -i 4 setConfigValue iba_manage_switch -H The results are recorded in iba_manage_switch.res file in the current directory.
G–Commands and Files Summary and Descriptions of Commands -a alarm – number of seconds for alarm trigger to dump capture and exit -s maxblocks – max 64 byte blocks of data to capture in units of Mi (1024*1024) -v – verbose output To stop capture and trigger dump, kill with SIGINT (Ctrl-C) or SIGUSR1 (with the kill command). The program will dump packets to file and exit A sample filter file is located at /opt/iba/samples/filterFile.txt.
G–Commands and Files Summary and Descriptions of Commands Following is a sample output for the DDR adapters: # ibstatus Infiniband device 'qib0' port 1 status: default gid: fe80:0000:0000:0000:0011:7500:0078:a5d2 base lid: 0x1 sm lid: 0x4 state: 4: ACTIVE phys state: 5: LinkUp rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand ibtracert The tool ibtracert determines the path that IB packets travel between two nodes. It is installed from the openib-diag RPM.
G–Commands and Files Summary and Descriptions of Commands ibv_devinfo This program displays information about IB devices, including various kinds of identification and status data. It is installed from the openib-diag RPM. Use this program when OpenFabrics is enabled. ibv_devinfo queries RDMA devices. Use the -v option to see more information. For example: # ibv_devinfo hca_id: qib0 transport: InfiniBand (0) fw_ver: 0.0.
G–Commands and Files Summary and Descriptions of Commands NOTE For QLogic RPMs on a RHEL distribution, the drivers folder is in the updates folder instead of the kernels folder as follows: /lib/modules/OS_version/updates/drivers/ infiniband/hw/qib/ib_qib.ko If the /lib/modules/OS_version/updates directory is not present, then the driver in use is the one that comes with the core kernel. In this case, either the kernel-ib RPM is not installed or it is not configured for the current running kernel.
G–Commands and Files Summary and Descriptions of Commands NOTE The hostnames in the nodefile are Ethernet hostnames, not IPv4 addresses. To create a nodefile, use the ibhosts program. It will generate a list of available nodes that are already connected to the switch. ipath_checkout performs the following seven tests on the cluster: 1. Executes the ping command to all nodes to verify that they all are reachable from the front end. 2.
G–Commands and Files Summary and Descriptions of Commands Table G-2. ipath_checkout Options (Continued) Command Meaning -k, --keep This option keeps intermediate files that were created while performing tests and compiling reports. Results are saved in a directory created by mktemp and named infinipath_XXXXXX or in the directory name given to --workdir. --workdir=DIR Use DIR to hold intermediate files created while running tests. DIR must not already exist.
G–Commands and Files Summary and Descriptions of Commands Here is sample usage and output: % ipath_control -i $Id: QLogic OFED Release x.x.x $ $Date: yyyy-mm-dd-hh:mm $ 0: Version: ChipABI 2.0, InfiniPath_QLE7342, InfiniPath1 6.
G–Commands and Files Summary and Descriptions of Commands MTRR is used by the InfiniPath driver to enable write combining to the QLogic on-chip transmit buffers. This option improves write bandwidth to the QLogic chip by writing multiple words in a single bus transaction (typically 64 bytes). This option applies only to x86_64 systems. It can often be set in the BIOS. However, some BIOS’ do not have the MTRR mapping option.
G–Commands and Files Summary and Descriptions of Commands Test the IB link and bandwidth between two InfiniPath IB adapters. Using an IB loopback connector, test the link and bandwidth within a single InfiniPath IB adapter. The ipath_pkt_test program runs in either ping-pong mode (send a packet, wait for a reply, repeat) or in stream mode (send packets as quickly as possible, receive responses as they come back).
G–Commands and Files Summary and Descriptions of Commands mpirun mpirun determines whether the program is being run against a QLogic or non-QLogic driver. It is installed from the mpi-frontend RPM. Sample commands and results are shown in the following paragraphs. QLogic-built: $ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0 asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1 active chips) asus-01:0.
G–Commands and Files Common Tasks and Commands This option poisons receive buffers at initialization and after each receive; pre-initialize with random data so that any parts that are not being correctly updated with received data can be observed later. See the mpi_stress(1) man page for more information. rpm To check the contents of an installed RPM, use these commands: $ rpm -qa infinipath\* mpi-\* $ rpm -q --info infinipath # (etc) The option-q queries. The option --qa queries all.
G–Commands and Files Common Tasks and Commands Table G-3. Common Tasks and Commands Summary Function Check the system state Command ipath_checkout [options] hostsfile ipathbug-helper -m hostsfile \ > ipath-info-allhosts mpirun -m hostsfile -ppn 1 \ -np numhosts -nonmpi ipath_control -i Also see the file: /sys/class/infiniband/ipath*/device/status_str where * is the unit number. This file provides information about the link state, possible cable/switch problems, and hardware errors.
G–Commands and Files Summary and Descriptions of Useful Files Table G-3.
G–Commands and Files Summary and Descriptions of Useful Files This information is useful for reporting problems to Technical Support. NOTE This file returns information of where the form factor adapter is installed. The PCIe half-height, short form factor is referred to as the QLE7140, QLE7240, QLE7280, QLE7340, or QLE7342. status_str Check the file status_str to verify that the InfiniPath software is loaded and functioning.
G–Commands and Files Summary of Configuration Files This same directory contains other files with information related to status. These files are summarized in Table G-6. Table G-6. Status—Other Files File Name Contents lid IB LID. The address on the IB fabric, similar conceptually to an IP address for TCP/IP. Local refers to it being unique only within a single IB fabric. mlid The Multicast Local ID (MLID), for IB multicast. Used for InfiniPath ether broadcasts, since IB has no concept of broadcast.
G–Commands and Files Summary of Configuration Files Table G-7. Configuration Files Configuration File Name /etc/modprobe.conf Specifies options for modules when added or removed by the modprobe command. Also used for creating aliases. The PAT write-combing option is set here. For Red Hat 5.X systems. /etc/modprobe.d/ib_qib.conf Specifies options for modules when added or removed by the modprobe command. Also used for creating aliases. The PAT write-combing option is set here. For Red Hat 6.X systems.
G–Commands and Files Summary of Configuration Files G-38 IB0054606-02 A
H Recommended Reading Reference material for further reading is provided in this appendix. References for MPI The MPI Standard specification documents are located at: http://www.mpi-forum.org/docs The MPICH implementation of MPI and its documentation are located at: http://www-unix.mcs.anl.gov/mpi/mpich/ The ROMIO distribution and its documentation are located at: http://www.mcs.anl.
H–Recommended Reading OpenFabrics OpenFabrics Information about the OpenFabrics Alliance (OFA) is located at: http://www.openfabrics.org Clusters Gropp, William, Ewing Lusk, and Thomas Sterling, Beowulf Cluster Computing with Linux, Second Edition, 2003, MIT Press, ISBN 0-262-69292-9 Networking The Internet Frequently Asked Questions (FAQ) archives contain an extensive Request for Command (RFC) section. Numerous documents on networking and configuration can be found at: http://www.faqs.org/rfcs/index.
Corporate Headquarters QLogic Corporation 26650 Aliso Viejo Parkway Aliso Viejo, CA 92656 949.389.6000 www.qlogic.com International Offices UK | Ireland | Germany | France | India | Japan | China | Hong Kong | Singapore | Taiwan © 2012 QLogic Corporation. Specifications are subject to change without notice. All rights reserved worldwide. QLogic, the QLogic logo, and the Powered by QLogic logo are registered trademarks of QLogic Corporation.