Q Simplify InfiniPath User Guide Version 2.
Q InfiniPath User Guide Version 2.0 Information furnished in this manual is believed to be accurate and reliable. However, QLogic Corporation assumes no responsibility for its use, nor for any infringements of patents or other rights of third parties which may result from its use. QLogic Corporation reserves the right to change product specifications at any time without notice. Applications described in this document for any of these products are for illustrative purposes only.
Q Added info about using MPI over uDAPL. Need to load modules rdma_cm and rdma_ucm. InfiniPath User Guide Version 2.0 3.7 Added section: Error messages generated by mpirun. This explains more about the types of errors found in the sub-sections. Also added error messages related to failed connections between nodes C.8.12 Added mpirun error message about stray processes to error message section C.8.12.2 Added driver and link error messages reported by MPI programs C.8.12.
InfiniPath User Guide Version 2.0 Q © 2006, 2007 QLogic Corporation. All rights reserved worldwide. © PathScale 2004, 2005, 2006. All rights reserved. First Published: August 2005 Printed in U.S.A.
Table of Contents Section 1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 Who Should Read this Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How this Guide is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interoperability .
InfiniPath User Guide Version 2.0 Q 2.10 2.10.1 2.10.2 2.10.3 2.10.4 2.10.5 2.10.6 2.10.7 2.11 Performance and Management Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . Remove Unneeded Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disable Powersaving Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balanced Processor Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SDP Module Parameters for Best Performance . . . . . . .
Q InfiniPath User Guide Version 2.0 InfiniPath User Guide 3.11 3.11.1 3.11.2 3.12 Debugging MPI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MPI Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Debuggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . InfiniPath MPI Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
InfiniPath User Guide Version 2.0 C.4.5 C.4.6 C.4.7 C.5 C.5.1 C.5.2 C.5.3 C.6 C.6.1 C.7 C.7.1 C.8 C.8.1 C.8.2 C.8.3 C.8.4 C.8.5 C.8.6 C.8.7 C.8.8 C.8.9 C.8.10 C.8.11 C.8.12 C.8.12.1 C.8.12.2 C.8.12.3 C.8.13 C.9 C.9.1 C.9.2 C.9.3 C.9.4 C.9.5 C.9.6 C.9.7 C.9.8 C.9.9 C.9.10 Page viii Q OpenFabrics Load Errors If ib_ipath Driver Load Fails . . . . . . . . . . InfiniPath ib_ipath Initialization Failure . . . . . . . . . . . . . . . . . . . . . . MPI Job Failures Due to Initialization Problems . . . . . . . .
Q C.9.11 C.9.12 C.9.13 C.9.14 C.9.15 C.9.16 C.9.17 C.9.18 InfiniPath User Guide Version 2.0 InfiniPath User Guide ipath_pkt_test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ipathstats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . lsmod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
InfiniPath User Guide Version 2.
Section 1 Introduction This chapter describes the objectives, intended audience, and organization of the InfiniPath User Guide. The InfiniPath User Guide is intended to give the end users of an InifiniPath cluster what they need to know to use it. In this case, end users are understood to include both the cluster administrator and the MPI application programmers, who have different but overlapping interests in the details of the technology.
Q 1 – Introduction Interoperability ■ Appendix E Glossary of technical terms ■ Index In addition, the InfiniPath Install Guide contains information on InfiniPath hardware and software installation. 1.3 Overview The material in this documentation pertains to an InfiniPath cluster. This is defined as a collection of nodes, each attached to an InfiniBand™-based fabric through the InfiniPath Interconnect. The nodes are Linux-based computers, each having up to eight processors.
Q 1 – Introduction What’s New in this Release NOTE: OpenFabrics was known as OpenIB until March 2006. All relevant references to OpenIB in this documentation have been updated to reflect this change. See the OpenFabrics website at http://www.openfabrics.org for more information on the OpenFabrics Alliance. 1.6 What’s New in this Release QLogic Corp. acquired PathScale in April 2006. In this 2.0 release, product names, internal program and output message names now refer to QLogic rather than PathScale.
Q 1 – Introduction Supported Distributions and Kernels Support for multiple versions of MPI has been added. You can use a different version of MPI and achieve the high-bandwidth and low-latency performance that is standard with InfiniPath MPI. Also included is expanded operating system support, and support for the latest OpenFabrics software stack. Multiple InfiniPath cards per node are supported. A single software installation works for all the cards.
Q 1 – Introduction Software Components 1.
Q 1 – Introduction Documentation and Technical Support NOTE: 32 bit OpenFabrics programs using the verb interfaces are not supported in this InfiniPath release, but will be supported in a future release. 1.9 Conventions Used in this Document This Guide uses these typographical conventions: Table 1-3.
Q 1 – Introduction Documentation and Technical Support ■ Readme file The Troubleshooting Appendix for installation, InfiniPath and OpenFabrics administration, and MPI issues is located in the InfiniPath User Guide. Visit the QLogic support Web site for documentation and the latest software updates. http://www.qlogic.
1 – Introduction Documentation and Technical Support Q Notes 1-8 IB6054601-00 D
Section 2 InfiniPath Cluster Administration This chapter describes what the cluster administrator needs to know about the InfiniPath software and system administration. 2.1 Introduction The InfiniPath driver ib_ipath, layered Ethernet driver ipath_ether, OpenSM, and other modules and the protocol and MPI support libraries are the components of the InfiniPath software providing the foundation that supports the MPI implementation. Figure 2-1, below, shows these relationships.
2 – InfiniPath Cluster Administration Memory Footprint Q MPI include files are in: /usr/include MPI programming examples and source for several MPI benchmarks are in: /usr/share/mpich/examples InfiniPath utility programs, as well as MPI utilities and benchmarks are installed in: /usr/bin The InfiniPath kernel modules are installed in the standard module locations in: /lib/modules (version dependent) They are compiled and installed when the infinipath-kernel RPM is installed.
Q 2 – InfiniPath Cluster Administration Memory Footprint on system configuration. OpenFabrics support is under development and has not been fully characterized. This table summarizes the guidelines. Table 2-1. Memory Footprint of the InfiniPath Adapter on Linux x86_64 Systems Adapter component Required/ optional Memory Footprint Comment InfiniPath Driver Required 9 MB Includes accelerated IP support. Includes tables space to support up to 1000 node systems.
Q 2 – InfiniPath Cluster Administration Configuration and Startup This breaks down to a memory footprint of 331MB per node, as follows: Table 2-2. Memory Footprint, 331 MB per Node Component Footprint (in MB) Breakdown Driver 9 Per node MPI 316 4*71 MB (MPI per process) + 32 MB (shared memory per node) OpenFabrics 6 6 MB + 200 KB per node 2.4 Configuration and Startup 2.4.1 BIOS Settings A properly configured BIOS is required.
Q 2 – InfiniPath Cluster Administration Configuration and Startup You can check and adjust these BIOS settings using the BIOS Setup Utility. For specific instructions on how to do this, follow the hardware documentation that came with your system. 2.4.2 InfiniPath Driver Startup The ib_ipath module provides low level InfiniPath hardware support. It does hardware initialization, handles infinipath-specific memory management, and provides services to other InfiniPath and OpenFabrics modules.
2 – InfiniPath Cluster Administration Configuration and Startup Q and unmounted when the infinipath script is invoked with the "stop" option (e.g. at system shutdown). The layout of the filesystem is as follows: atomic_stats 00/ 01/ ... The atomic_stats file contains general driver statistics. There is one numbered directory per InfiniPath device on the system.
Q 2 – InfiniPath Cluster Administration Configuration and Startup You must create a network device configuration file for the layered Ethernet device on the InfiniPath adapter. This configuration file will resemble the configuration files for the other Ethernet devices on the nodes. Typically on servers there are two Ethernet devices present, numbered as 0 (eth0) and 1 (eth1). This examples assumes we create a third device, eth2.
Q 2 – InfiniPath Cluster Administration Configuration and Startup If you are using DHCP (dynamic host configuration protocol), add the following lines to ifcfg-eth2: # QLogic Interconnect Ethernet DEVICE=eth2 ONBOOT=yes BOOTPROTO=dhcp If you are using static IP addresses, use the following lines instead, substituting your own IP address for the sample one given here.The normal matching netmask is shown. # QLogic Interconnect Ethernet DEVICE=eth2 BOOTPROTO=static ONBOOT=YES IPADDR=192.168.5.
Q 2 – InfiniPath Cluster Administration Configuration and Startup Step 3 is applicable only to SLES 10; it is required because SLES 10 uses a newer version of the udev subsystem. NOTE: The MAC address (media access control address) is a unique identifier attached to most forms of networking equipment. Step 2 below determines the MAC address to use, and will be referred to as $MAC in the subsequent steps. $MAC must be replaced in each case with the string printed in step 2.
2 – InfiniPath Cluster Administration Configuration and Startup Q Check each of the lines starting with SUBSYSTEM=, to find the highest numbered interface. (For standard motherboards, the highest numbered interface will typically be 1.) Add a new line at the end of the file, incrementing the interface number by one. In this example, it becomes eth2.
Q 2 – InfiniPath Cluster Administration Configuration and Startup 6. To verify that the configuration files are correct, you will normally now be able to run the commands: # ifup eth2 # ifconfig eth2 Note that it may be necessary to reboot the system before the configuration changes will work. 2.4.7 OpenFabrics Configuration and Startup In the prior InfiniPath 1.3 release the InfiniPath (ipath_core) and OpenFabrics (ib_ipath) modules were separate.
Q 2 – InfiniPath Cluster Administration Configuration and Startup To verify the configuration, type: # ifconfig ib0 The output from this command should be similar to this: ib0 Link encap:InfiniBand HWaddr 00:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:10.1.17.3 Bcast:10.1.17.255 Mask:255.255.255.
Q 2 – InfiniPath Cluster Administration Starting and Stopping the InfiniPath Software and you can stop it again like this: # /etc/init.d/opensmd stop If you wish to pass any arguments to the OpenSM program, modify the file: /etc/init.d/opensmd and add the arguments to the "OPTIONS" variable. Here is an example: # Use the UPDN algorithm instead of the Min Hop algorithm. OPTIONS="-u" 2.5 SRP SRP stands for SCSI RDMA Protocol.
2 – InfiniPath Cluster Administration Starting and Stopping the InfiniPath Software Q To disable the driver on the next system boot, use the command (as root): # chkconfig infinipath off NOTE: This does not stop and unload the driver, if it is already loaded. You can start, stop, or restart (as root) the InfiniPath support with: # /etc/init.d/infinipath [start | stop | restart] This method will not reboot the system. The following set of commands shows how this script can be used.
Q 2 – InfiniPath Cluster Administration Configuring ssh and sshd Using shosts.equiv If there is output, you should look at the output from this command to determine if it is configured: $ /sbin/ifconfig -a Finally, if you need to find which InfiniPath and OpenFabrics modules are running, try the following command: $ lsmod | egrep ’ipath_|ib_|rdma_|findex’ 2.8 Software Status InfiniBand status can be checked by running the program ipath_control.
2 – InfiniPath Cluster Administration Configuring ssh and sshd Using shosts.equiv Q This next example assumes the following: ■ Both the cluster nodes and the front end system are running the openssh package as distributed in current Linux systems. ■ All cluster users have accounts with the same account name on the front end and on each node, either by using NIS or some other means of distributing the password file. ■ The front end is called ip-fe.
Q 2 – InfiniPath Cluster Administration Performance and Management Tips 0zwxSL7GP1nEyFk9wAxCrXv3xPKxQaezQKs+KL95FouJvJ4qrSxxHdd1NYNR0D avEBVQgCaspgWvWQ8cL 0aUQmTbggLrtD9zETVU5PCgRlQL6I3Y5sCCHuO7/UvTH9nneCg== Change the file to mode 600 when finished editing. 4. On each node, the system file /etc/ssh/sshd_config must be edited, so that the following four lines uncommented (no # at the start of the line) and are set to yes.
2 – InfiniPath Cluster Administration Performance and Management Tips Q nodes. Since these are presumed to be specialized computing appliances, they do not need many of the service daemons normally running on a general Linux computer. Following are several groups constituting a minimal necessary set of services. These are all services controlled by chkconfig.
Q 2 – InfiniPath Cluster Administration Performance and Management Tips For SUSE 9.3 and 10.0 run this command as root: # /sbin/chkconfig --level 12345 powersaved off After running either of these commands, the system will need to be rebooted for these changes to take effect. 2.10.3 Balanced Processor Power Higher processor speed is good. However, adding more processors is good only if processor speed is balanced. Adding processors with different speeds can result in load imbalance. 2.10.
2 – InfiniPath Cluster Administration Performance and Management Tips Q 2.10.6 Hyper-Threading If using Intel processors that support Hyper-Threading, it is recommended that HyperThreading is turned off in the BIOS. This will provide more consistent performance. You can check and adjust this setting using the BIOS Setup Utility. For specific instructions on how to do this, follow the hardware documentation that came with your system. 2.10.
Q 2 – InfiniPath Cluster Administration Performance and Management Tips 00: LID=0x30 MLID=0x0 GUID=00:11:75:00:00:07:11:97 Serial: 1236070407 Note that ipath_control will report whether the installed adapter is the QHT7040, QHT7140, or the QLE7140. It will also report whether the driver is InfiniPath-specific or not with the output associated with $Id.
Q 2 – InfiniPath Cluster Administration Customer Acceptance Utility $Id: kernel.org InfiniPath Release 2.0 $ $Date: 2006-09-15-04:16 $ /lib/modules/2.6.16.21-0.8-smp/updates/ipath.ko: $Id: kernel.org InfiniPath Release2.0 $ $Date: 2006-09-15-04:20 $ NOTE: ident is in the optional rcs RPM, and is not always installed. strings The command strings can also be used. Here is a sample: $ strings /usr/lib/libinfinipath.so.4.0 | grep Date: will produce output like this: Date: 2006-09-15 04:07 Release2.
Q 2 – InfiniPath Cluster Administration Customer Acceptance Utility 3. Gather and analyze system configuration from nodes. 4. Gather and analyze RPMs installed on nodes. 5. Verify InfiniPath hardware and software status and configuration. 6. Verify ability to mpirun jobs on nodes. 7. Run bandwidth and latency test on every pair of nodes and analyze results. The possible options to ipath_checkout are: -h, --help Displays help messages giving defined usage.
2 – InfiniPath Cluster Administration Customer Acceptance Utility Q Notes 2-24 IB6054601-00 D
Section 3 Using InfiniPath MPI This chapter provides information on using InfiniPath MPI. Examples are provided for compiling and running MPI programs. 3.1 InfiniPath MPI QLogic’s implementation of the MPI standard is derived from the MPICH reference implementation Version 1.2.6. The InfiniPath MPI libraries have been highly tuned for the InfiniPath Interconnect, and will not run over other interconnects. InfiniPath MPI is an implementation of the original MPI 1.2 standard.
Q 3 – Using InfiniPath MPI Getting Started with MPI These examples assume that: ■ Your cluster administrator has properly installed InfiniPath MPI and the PathScale compilers. ■ Your cluster’s policy allows you to use the mpirun script directly, without having to submit the job to a batch queuing system. ■ You or your administrator has properly set up your ssh keys and associated files on your cluster. See section 3.5.1 and section 2.9 for details on ssh administration.
Q 3 – Using InfiniPath MPI Getting Started with MPI Here ./cpi designates the executable of the example program in the working directory. The -np parameter to mpirun defines the number of processes to be used in the parallel computation. Now try it with four processes: $ mpirun -np 4 -m mpihosts ./cpi Process 3 on hostname1 Process 0 on hostname2 Process 2 on hostname2 Process 1 on hostname1 pi is approximately 3.1416009869231249, Error is 0.0000083333333318 wall clock time = 0.
3 – Using InfiniPath MPI Configuring MPI Programs for InfiniPath MPI Q and run it with: $ mpirun -np 2 -m mpihosts ./pi3f90 The C++ program hello++.cc is a parallel processing version of the traditional “Hello, World” program. Notice that this version makes use of the external C bindings of the MPI functions if the C++ bindings are not present. Compile it: $ mpicxx -o hello hello++.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details You may need to instead pass arguments to configure directly, in a fashion similar to this: $ ./configure -cc=mpicc -fc=mpif77 -c++=mpicxx -c++linker=mpicxx Sometimes you may need to edit a Makefile to achieve this result, adding lines similar to: CC=mpicc F77=mpif77 F90=mpif90 F95=mpif95 CXX=mpicxx In some cases, the configuration process may specify the linker. It is recommended that the linker be specified as mpicc, mpif90, etc. in these cases.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details The process is shown in the following steps: 1. Create a key pair. Use the default file name, and be sure to enter a passphrase. $ ssh-keygen -t rsa 2. Enter a passphrase for your key pair when prompted. Note that the key agent does not survive X11 logout or system reboot: $ ssh-add 3. This tells ssh that your key pair should let you in: $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys edit ~/.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details 3.5.2 Compiling and Linking These scripts invoke the compiler and linker for programs in each of the respective languages, and take care of referring to the correct include files and libraries in each case. mpicc mpicxx mpif77 mpif90 mpif95 On x86_64, by default these call the PathScale compiler and linker. To use other compilers, see section 3.5.3. NOTE: The 2.x PathScale compilers aren’t currently supported on systems that use the GNU 4.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details line options. See the PathScale compiler documentation and the man pages for pathcc and pathf90 for complete information on its options. See the corresponding documentation for any other compiler/linker you may call for its options. 3.5.3 To Use Another Compiler In addition to the PathScale Compiler Suite, InfiniPath MPI supports a number of other compilers. These include PGI 5.2 and 6.0, Intel 9.0, the GNU gcc 3.3.x, 3.4.x, and 4.0.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details To use the Intel compiler for Fortran90/Fortran95 programs, use: $ mpif90 -f90=ifort ..... $ mpif95 -f95=ifort ..... Usage for other compilers will be similar to the examples above, substituting the options following -cc, -CC, -f77, -f90, or -f95. Consult the documentation for specific compilers for more details. Also, use mpif77, mpif90, or mpif95 for linking, otherwise you may have problems with .true. having the wrong value.
3 – Using InfiniPath MPI InfiniPath MPI Details Q The current workaround for this is to compile on a supported and compatible distribution, then run the executable on one of the systems that uses the GNU 4.x compilers and environment. ■ To run on FC4 or FC5, install FC3 or RHEL4/CentOS on your build machine. Compile your application on this machine. ■ To run on SLES 10, install SUSE 9.3 on your build machine. Compile your application on this machine.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details program-name will generally be the pathname to the executable MPI program. If the MPI program resides in the current directory and the current directory is not in your search path, then program-name must begin with ‘./’, such as: ./program-name Unless you want to run only one instance of the program, you need to use the -np option, as in: $ mpirun -np n [other options] program-name This spawns n instances of program-name.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details programs will be started on that host before using the next entry in the mpihosts file. If the full mpihosts file is processed, and there are still more processes requested, processing starts again at the start of the file. You have several alternative ways of specifying the mpihosts file. 1. First, as noted in section 3.3.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details LD_LIBRARY_PATH, and other environment variables for the node programs through the use of the -rcfile option of mpirun: $ mpirun -np n -m mpihosts -rcfile mpirunrc program In the absence of this option, mpirun checks to see if a file called $HOME/.mpirunrc exists in the user's home directory. In either case, the file is sourced by the shell on each node at time of startup of the node program. The .mpirunrc should not contain any interactive commands.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details 3.5.9 Multiprocessor Nodes Another command line option, -ppn, instructs mpirun to assign a fixed number p of node programs to each node, as it distributes the n instances among the nodes: $ mpirun -np n -m mpihosts -ppn p program-name This option overrides the :p specifications, if any, in the lines of the MPI hosts file.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details -verbose Print diagnostic messages from mpirun itself. Can be useful in troubleshooting Default: Off -version, -v Print MPI version. Default: Off -help, -h Print mpirun help message. Default: Off -rcfile node-shell-script Startup script for setting environment on nodes. Default: $HOME/.mpirunrc -in-xterm Run each process in an xterm window. Default: Off -display X-server X Display for xterm.
Q 3 – Using InfiniPath MPI InfiniPath MPI Details -nonmpi Run a non-MPI program. Required if the node program makes no MPI calls. Default: Off -quiescence-timeout, seconds Wait time in seconds for quiescence (absence of MPI communication) on the nodes. Useful for detecting deadlocks. 0 disables quiescence detection. Default: 900 -disable-mpi-progress-check This option disables MPI communication progress check, without disabling the ping reply check. Default: Off.
Q 3 – Using InfiniPath MPI MPD -statsfile file-prefix Specifies alternate file to receive the output from the -print-stats option. Default: stderr 3.6 Using Other MPI Implementations Support for multiple MPI implementations has been added. You can use a different version of MPI and achieve the high-bandwidth and low-latency performance that it is standard with InfiniPath MPI. The currently supported implementations are HP-MPI, OpenMPI and Scali.
Q 3 – Using InfiniPath MPI File I/O in MPI 3.8.1 MPD Description The Multi-Purpose Daemon (MPD) was developed by Argonne National Laboratory (ANL), as part of the MPICH-2 system. While the ANL MPD had certain advantages over the use of their mpirun (faster launching, better cleanup after crashes, better tolerance of node failures), the InfiniPath mpirun offers the same advantages. The disadvantage of MPD is reduced security, since it does not use ssh to launch node programs.
Q 3 – Using InfiniPath MPI InfiniPath MPI and Hybrid MPI/OpenMP Applications accessed via some network file system, typically NFS. Parallel programs usually need to have some data in files to be shared by all of the processes of an MPI job. Node programs may also use non-shared, node-specific files, such as for scratch storage for intermediate results or for a node’s share of a distributed database. There are different styles of handling file I/O of shared data in parallel programming.
Q 3 – Using InfiniPath MPI Debugging MPI Programs may be desirable to run multiple MPI processes and multiple OpenMP threads per node. The number of OpenMP threads is typically controlled by the OMP_NUM_THREADS environment variable in the .mpirunrc file. This may be used to adjust the split between MPI processes and OpenMP threads. Usually the number of MPI processes (per node) times the number of OpenMP threads will be set to match the number of CPUs per node.
Q 3 – Using InfiniPath MPI InfiniPath MPI Limitations Symbolic debugging is easier than machine language debugging. To enable symbolic debugging you must have compiled with the -g option to mpicc so that the compiler will have included symbol tables in the compiled object code. To run your MPI program with a debugger use the -debug or -debug-no-pause and -debugger options to mpirun. See the man pages to pathdb, gdb, and strace for details.
Q 3 – Using InfiniPath MPI InfiniPath MPI Limitations No ports available on /dev/ipath NOTE: If port sharing is enabled, this limit is raised to 16 and 8 respectively. To enable port sharing, set PSM_SHAREDPORTS=1 in your environment There are no C++ bindings to MPI -- use the extern C MPI function calls. In MPI-IO file I/O calls in the Fortran binding, offset or displacement arguments are limited to 32 bits.
Appendix A Benchmark Programs Several MPI performance measurement programs are installed from the mpi-benchmark RPM. This Appendix describes these useful benchmarks and how to run them. These programs are based on code from the group of Dr. Dhabaleswar K. Panda at the Network-Based Computing Laboratory at the Ohio State University. For more information, see: http://nowlab.cis.ohio-state.edu/ These programs allow you to measure the MPI latency and bandwidth between two or more nodes in your cluster.
Q A – Benchmark Programs Benchmark 2: Measuring MPI Bandwidth Between Two Nodes This benchmark always involves just two node programs. You can run it with the command: $ mpirun -np 2 -ppn 1 -m mpihosts osu_latency The -ppn 1 option is needed to be certain that the two communicating processes are on different nodes.
Q A – Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks MPI_Isend function, while the receiving node consumes them as quickly as it can using the non-blocking MPI_Irecv, and then returns a zero-length acknowledgement when all of the set has been received. You can run this program with: $ mpirun -np 2 -ppn 1 -m mpihosts osu_bw Typical output might look like: # OSU MPI Bandwidth Test (Version 2.0) # Size Bandwidth (MB/s) 1 2.250465 2 4.475789 4 8.979276 8 17.952547 16 27.615041 32 52.
Q A – Benchmark Programs Benchmark 3: Messaging Rate Microbenchmarks benchmark (as shown in the example above). It has been enhanced with the following additional functionality: ■ Messaging rate reported as well as bandwidth ■ N/2 dynamically calculated at end of run ■ Allows user to run multiple processes per node and see aggregate bandwidth and messaging rates The benchmark has been updated with code to dynamically determine which processes are on which host.
Q A – Benchmark Programs Benchmark 4: Measuring MPI Latency in Host Rings A.4 Benchmark 4: Measuring MPI Latency in Host Rings The program mpi_latency can be used to measure latency in a ring of hosts. Its syntax is a bit different from Benchmark 1 in that it takes command line arguments that let you specify the message size and the number of messages over which to average the results.
A – Benchmark Programs Benchmark 4: Measuring MPI Latency in Host Rings Q Notes A-6 IB6054601-00 D
Appendix B Integration with a Batch Queuing System Most cluster systems use some kind of batch queuing system as an orderly way to provide users with access to the resources they need to meet their job’s performance requirements. One of the tasks of the cluster administrator is to provide means for users to submit MPI jobs through such batch queuing systems. This can take the form of a script, which your users can invoke much as they would invoke mpirun to submit their MPI jobs.
B – Integration with a Batch Queuing System A Batch Queuing Script Q require that his node program be the only application running on each node CPU. In a typical batch environment, the MPI user would still specify the number of node programs, but would depend on the batch system to allocate specific nodes when the required number of CPUs becomes available.
Q B – Integration with a Batch Queuing System A Batch Queuing Script by mpirun.Each line consists of a node name, a colon, and the number of processes to start on that node. NOTE: This is one of two formats that the file may use. See section 3.5.6 for more information. B.1.3 Simple Process Management At this point, your script has enough information to be able to run an MPI program.
Q B – Integration with a Batch Queuing System Lock Enough Memory on Nodes When Using SLURM The following command will terminate all processes using the InfiniPath interconnect: # /sbin/fuser -k /dev/ipath For more information, see the man pages for fuser(1) and lsof(8). NOTE: Run these commands as root to insure that all processes are reported. B.2 Lock Enough Memory on Nodes When Using SLURM This is identical to information provided in appendix C.8.11. It is repeated here for your convenience.
Appendix C Troubleshooting This Appendix describes some of the existing provisions for diagnosing and fixing problems. The sections are organized in the following order: ■ C.1 “Troubleshooting InfiniPath adapter installation” ■ C.2 “BIOS settings” ■ C.3 “Software installation issues” ■ C.4 “Kernel and initialization issues” ■ C.5 “OpenFabrics issues” ■ C.6 “System administration troubleshooting” ■ C.7 “Performance issues” ■ C.8 “InfiniPath MPI troubleshooting” ■ C.
Q C – Troubleshooting BIOS Settings states of the LEDs. The green LED will normally illuminate first. The normal state is Green On, Amber On. Table C-1. LED Link and Data Indicators LED Color Status Power Green ON Signal detected. Ready to talk to an SM to bring link fully up. OFF Switch not powered up. Software not installed or started. Loss of signal. Check cabling. Link Amber ON Link configured. Properly connected and ready to receive data and link packets. OFF SM may be missing.
Q C – Troubleshooting BIOS Settings C.2.1 MTRR Mapping and Write Combining MTRR (Memory Type Range Registers) is used by the InfiniPath driver to enable write combining to the InfiniPath on-chip transmit buffers. This improves write bandwidth to the InfiniPath chip by writing multiple words in a single bus transaction (typically 64). This applies only to x86_64 systems.
Q C – Troubleshooting BIOS Settings C.2.3 Incorrect MTRR Mapping Causes Unexpected Low Bandwidth This same MTRR Mapping setting as described in the previous section can also cause unexpected low bandwidth if it is set incorrectly. The setting should look like this: MTRR Mapping [Discrete] The MTRR Mapping needs to be set to Discrete if there is 4GB or more memory in the system; it affects where the PCI, PCIe, and HyperTransport i/o addresses (BARs) are mapped.
Q C – Troubleshooting Software Installation Issues C.3 Software Installation Issues This section covers issues related to software installation. C.3.1 OpenFabrics Dependencies You need to install sysfsutils for your distribution before installing the OpenFabrics RPMs, as there are dependencies. If sysfsutils has not been installed, you might see error messages like this: error: Failed dependencies: libsysfs.so.1()(64bit) is needed by libipathverbs-2.0-1_100.77_fc3_psc.x86_64 libsysfs.so.
Q C – Troubleshooting Software Installation Issues In older distributions, such as RHEL4, the 32-bit glibc will be contained in the libgcc RPM. The RPM will be named similarly to: libgcc-3.4.3-9.EL4.i386.rpm In newer distributions, glibc is an RPM name. The 32-bit glibc will be named similarly to: glibc-2.3.4-2.i686.rpm or glibc-2.3.4-2.i386.rpm Check your distribution for the exact RPM name. C.3.4 Installing Newer Drivers from Other Distributions The driver source now resides in infinipath-kernel.
Q C – Troubleshooting Kernel and Initialization Issues 8. Reload all modules by using this command (as root): # /etc/init.d/infinipath start An alternate mechanism can be used, if provided as part of your alternate installation. 9. Run an OpenFabrics test program, such as ibstatus, to verify that your InfiniPath card(s) work correctly. C.3.
C – Troubleshooting Kernel and Initialization Issues Q C.4.1 Kernel Needs CONFIG_PCI_MSI=y If the InfiniPath driver is being compiled on a machine without CONFIG_PCI_MSI=y configured, you will get a compilation error similar to this: ib_ipath/ipath_driver.c:46:2: #error "InfiniPath driver can only be used with kernels with CONFIG_PCI_MSI=y" make[3]: *** [ib_ipath/ipath_driver.o] Error 1 Some kernels, such as some versions of FC4 (2.6.16), have CONFIG_PCI_MSI=n as the default.
Q C – Troubleshooting Kernel and Initialization Issues NOTE: This problem has been fixed in the 2.6.17 kernel.org kernel. C.4.3 Driver Load Fails Due to Unsupported Kernel If you try to load the InfiniPath driver on a kernel that InfiniPath software does not support, the load fails. Error messages similar to this appear: modprobe: error inserting ’/lib/modules/2.6.3-1.1659-smp/kernel/drivers/infiniband/hw/ipath/ ib_ipath.
C – Troubleshooting Kernel and Initialization Issues Q A zero count in all CPU columns means that no interrupts have been delivered to the processor.
Q C – Troubleshooting Kernel and Initialization Issues C.4.6 InfiniPath ib_ipath Initialization Failure There may be cases where ib_ipath was not properly initialized. Symptoms of this may show up in error messages from an MPI job or another program. Here is a sample command and error message: $ mpirun -np 2 -m ~/tmp/mbu13 osu_latency :The link is down MPIRUN: Node program unexpectedly quit. Exiting.
Q C – Troubleshooting System Administration Troubleshooting C.5 OpenFabrics Issues This section covers items related to OpenFabrics, including OpenSM. C.5.1 Stop OpenSM Before Stopping/Restarting InfiniPath OpenSM must be stopped before stopping or restarting InfiniPath. If not, error messages such as the following will occur: # /etc/init.d/infinipath stop Unloading infiniband modules: sdp cm umad uverbs ipoib sa ipath mad coreFATAL:Module ib_umad is in use.
Q C – Troubleshooting InfiniPath MPI Troubleshooting C.6.1 Broken Intermediate Link Sometimes message traffic passes through the fabric while other traffic appears to be blocked. In this case, MPI jobs fail to run. In large cluster configurations, switches may be attached to other switches in order to supply the necessary inter-node connectivity. Problems with these inter-switch (or intermediate) links are sometime more difficult to diagnose than failure of the final link between a switch and a node.
Q C – Troubleshooting InfiniPath MPI Troubleshooting $ mpirun -v MPIRUN:Infinipath Release2.0 : Built on Wed Nov 19 17:28:58 PDT 2006 by mee The following is the error that occurs when mpirun from the 2.0 release is being used with the 1.3 libraries: $ mpirun-ipath-ssh -np 2 -ppn 1 -m ~/tmp/idev osu_latency MPIRUN: mpirun from the 2.0 software distribution requires all node processes to be running 2.0 software. At least node uses non-2.0 MPI libraries C.8.2 Cross-compilation Issues The 2.
Q C – Troubleshooting InfiniPath MPI Troubleshooting On a SLES 10 system, you would need: ■ compat-libstdc++ (for FC3) ■ compat-libstdc++5 (for SLES 10) Depending upon the application, you may need to use the -W1,-Bstatic option to use the static versions of some libraries. C.8.3 Compiler/Linker Mismatch This is a typical error message if the compiler and linker are not matching in C and C++ programs: $ export MPICH_CC=gcc $ mpicc mpiworld.
Q C – Troubleshooting InfiniPath MPI Troubleshooting For these examples in Section C.8.5 below, we assume that these new locations are: /path/to/devel (for mpi-devel-*) /path/to/libs (for mpi-libs-*) C.8.5 Compiling on Development Nodes If the mpi-devel-* rpm is installed with the --prefix /path/to/devel option then mpicc, etc. will need to be passed -I/path/to/devel/include in order for the compiler to find the MPI include files, as in this example: $ mpicc myprogram.
Q C – Troubleshooting InfiniPath MPI Troubleshooting The above compiler command insures that the program will run using this path on any machine. For the second option, we change the file /etc/ld.so.conf on the compute nodes rather than using the -Wl,-rpath, option when compiling on the development node. We assume that the mpi-lib-* rpm is installed on the compute nodes with the same --prefix /path/to/libs option as on the development nodes.
C – Troubleshooting InfiniPath MPI Troubleshooting Q Examples are given below. In the following command, the HP-MPI version of mpirun is invoked by the full pathname. However, the program mpi_nxnlatbw was compiled with the QLogic version of mpicc. The mismatch will produce errors similar this: $ /opt/hpmpi/bin/mpirun -hostlist "bbb-01,bbb-02,bbb-03,bbb-04" -np 4 /usr/bin/mpi_nxnlatbw bbb-02: Not running from mpirun?.
Q C – Troubleshooting InfiniPath MPI Troubleshooting The following two commands will both work properly: QLogic mpirun and executable used together: $ mpirun -m ~/host-bbb -np 4 /usr/bin/mpi_nxnlatbw HP-MPI mpirun and executable used together: $ /opt/hpmpi/bin/mpirun -hostlist \ "bbb-01,bbb-02,bbb-03,bbb-04" -np 4 ./hpmpi-mpi_nxnlatbw Hints: Use the rpm command to find out which RPM is installed in the standard installed layout. For example: # rpm -qf /usr/bin/mpirun mpi-frontend-2.0-964.731_fc3_psc.
Q C – Troubleshooting InfiniPath MPI Troubleshooting ^ pathf95-389 pathf90: ERROR BORDERS, File = communicate.F, Line = 407, Column = 18 No specific match can be found for the generic subprogram call "MPI_RECV". If it is necessary to use a non-standard argument list, it is advisable to create your own MPI module file, and compile the application with it, rather than the standard MPI module file that is shipped in the mpi-devel-* RPM.
Q C – Troubleshooting InfiniPath MPI Troubleshooting integer count, datatype, root, comm, ierror ! Call the Fortran 77 style implicit interface to "mpi_bcast" external mpi_bcast call mpi_bcast(buffer, count, datatype, root, comm, ierror) end subroutine additional_mpi_bcast_for_character end module additional_bcast program myprogram use mpi use additional_bcast implicit none character*4 c integer master, ierr, i ! Explicit integer version obtained from module "mpi" call mpi_bcast(i, 1, MPI_INTEGER, master,
Q C – Troubleshooting InfiniPath MPI Troubleshooting If this file is not present or the node has not been rebooted after the infinipath RPM has been installed, a failure message similar to this will be generated: $ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000 node-00:1.ipath_update_tid_err: failed: Cannot allocate memory mpi_latency: /fs2/scratch/infinipath-build-2.0/mpi-2.0/mpich/psm/src mq_ips.c:691: mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program unexpectedly quit. Exiting.
Q C – Troubleshooting InfiniPath MPI Troubleshooting Found unknown timer type type unknown frame type type recv done: available_tids now n, but max is m (freed p) cancel recv available_tids now n, but max is m (freed %p) [n] Src lid error: sender: x, exp send: y Frame receive from unknown sender. exp.
Q C – Troubleshooting InfiniPath MPI Troubleshooting The following message indicates that a node program may not be processing incoming packets, perhaps due to a very high system load: eager array full after overflow, flushing (head h, tail t) The following indicates an invalid InfiniPath link protocol version: InfiniPath version ERROR: Expected version v, found w (memkey h) The following error messages should rarely occur and indicate internal software problems: ExpSend opcode h tid=j, rhf_error k: str
Q C – Troubleshooting InfiniPath MPI Troubleshooting These messages appear in the mpirun output. Most are followed by an abort, and possibly a backtrace. Each is preceded by the name of the function in which the exception occurred. Error sending packet: description Error receiving packet: description A fatal protocol error occurred while trying to send an InfiniPath packet. On Node n, process p seems to have forked. The new process id is q. Forking is illegal under InfiniPath. Exiting.
Q C – Troubleshooting InfiniPath MPI Troubleshooting There is no route to any host: $ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100 ssh: connect to host port 22: No route to host ssh: connect to host port 22: No route to host MPIRUN: All node programs ended prematurely without connecting to mpirun. Node jobs have started, but one host couldn’t connect back to mpirun: $ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100 9139.
Q C – Troubleshooting InfiniPath MPI Troubleshooting $ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000 MPIRUN: MPI progress Quiescence Detected after 9000 seconds. MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress. MPIRUN: Per-rank details are the following: MPIRUN: Rank 0 () caused MPI progress Quiescence. MPIRUN: Rank 1 () caused MPI progress Quiescence. MPIRUN: both MPI progress and Ping Quiescence Detected after 120 seconds.
Q C – Troubleshooting InfiniPath MPI Troubleshooting C.8.13 MPI Stats Using the -print-stats option to mpirun will result in a listing to stderr of various MPI statistics. Here is example output for the -print-stats option when used with an 8-rank run of the HPCC benchmark. MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPIRUN: MPI Statistics Summary Messages sent Eager count Eager aggregate bytes Rendezvous count Rendezvous agg.
Q C – Troubleshooting Useful Programs and Files for Debugging C.9 Useful Programs and Files for Debugging The most useful programs and files for debugging are listed in the sections below. Many of these programs and files have been discussed elsewhere in the documentation: this information is summarized and repeated here for your convenience. C.9.1 Check Cluster Homogeneity with ipath_checkout Many problems can be attributed to the lack of homogeneity in the cluster environment.
Q C – Troubleshooting Useful Programs and Files for Debugging C.9.3 Summary of Useful Programs and Files Useful programs and files are summarized in the table below. Descriptions for some of the programs and files follow. Check man pages for more information on the programs. Table C-2. Useful Programs and Files Program or file name C-30 Function Use to verify homogeneity? boardversion File. Check the version of the installed InfiniPath software.
Q C – Troubleshooting Useful Programs and Files for Debugging Table C-2. Useful Programs and Files (Continued) Program or file name Function Use to verify homogeneity? modprobe Adds or removes modules from the Linux kernel. Used to configure ipath_ether module on SUSE. No mpirun A front end program that starts an MPI job on an InfiniPath cluster. Can be used to check the origin of the drivers. Yes ps Displays information on current active processes.
Q C – Troubleshooting Useful Programs and Files for Debugging C.9.5 ibstatus This program displays basic information on the status of InfiniBand devices that are currently in use when the OpenFabrics modules are loaded. C.9.6 ibv_devinfo This program displays information about InfiniBand devices, including various kinds of identification and status data. Use this program when OpenFabrics is enabled. C.9.7 ident ident strings are available in ib_ipath.ko.
Q C – Troubleshooting Useful Programs and Files for Debugging C.9.8 ipath_checkout ipath_checkout is a bash script used to verify that the installation is correct and that all the nodes of the network are functioning and mutually connected by the InfiniPath fabric. It is to be run on a front end node, and requires specification of a hosts file: $ ipath_checkout [options] hostsfile where hostsfile designates a file listing the hostnames of the nodes of the cluster, one hostname per line.
Q C – Troubleshooting Useful Programs and Files for Debugging --workdir=DIR Use DIR to hold intermediate files created while running tests. DIR must not already exist. -k, --keep Keep intermediate files that were created while performing tests and compiling reports. Results will be saved in a directory created by mktemp and named infinipath_XXXXXX or in the directory name given to --workdir. --skip=LIST Skip the tests in LIST (e.g.
Q C – Troubleshooting Useful Programs and Files for Debugging 00: LID=0x30 MLID=0x0 GUID=00:11:75:00:00:07:11:97 Serial: 1236070407 C.9.10 ipathbug-helper The tool ipathbug-helper is useful for verifying homogeneity. Prior to seeking assistance from QLogic technical support, you should run this script on the head node of your cluster and the compute nodes which are suspected to have problems. Inspection of the output will often help you to see the problem.
C – Troubleshooting Useful Programs and Files for Debugging Q C.9.13 lsmod If you need to find which InfiniPath and OpenFabrics modules are running, try the following command: # lsmod | egrep ’ipath_|ib_|rdma_|findex’ C.9.14 mpirun mpirun can give information on whether the program is being run against a QLogic or non-QLogic driver. Sample commands and results are given below. QLogic-built: $ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0 asus-01:0.
Q C – Troubleshooting Useful Programs and Files for Debugging The following table shows the possible contents of the file, with brief explanations of the entries. Table C-3. status_str File File contents Description Initted The driver has loaded and successfully initialized the IBA6110. Present The IBA6110 has been detected (but not initialized unless Initted is also here). IB_link_up The IB link has been configured and is in the active state; packets can be sent and received.
C – Troubleshooting Useful Programs and Files for Debugging Q C.9.17 strings The command strings can also be used. Its format is as follows: $ strings /usr/lib/libinfinipath.so.4.0 | grep Date: will produce output like this: $Date: 2006-09-15 04:07 Release2.0 InfiniPath $ NOTE: strings is part of binutils (a development RPM), and may not be available on all machines. C.9.
Appendix D Recommended Reading Reference material for further reading is provided here. D.1 References for MPI The MPI Standard specification documents. http://www.mpi-forum.org/docs The MPICH implementation of MPI and its documentation. http://www-unix.mcs.anl.gov/mpi/mpich/ The ROMIO distribution and its documentation. http://www.mcs.anl.gov/romio D.2 Books for Learning MPI Programming Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI, Second Edition, 1999, MIT Press, ISBN 0-262-57134-X.
D – Recommended Reading Rocks Q D.6 Clusters Gropp, William, Ewing Lusk, and Thomas Sterling, Beowulf Cluster Computing with Linux, Second Edition, 2003, MIT Press, ISBN 0-262-69292-9. D.7 Rocks Extensive documentation on installing Rocks and custom Rolls. http://www.rocksclusters.
Appendix E Glossary A glossary is provided below for technical terms used in the documentation. IB6054601-00 D bandwidth The rate at which data can be transmitted. This represents the capacity of the network connection. Theoretical peak bandwidth is fixed, but the effective bandwidth, the ideal rate is modified by overhead in hardware and the computer operating system. Usually measured in bits/megabits or bytes/megabytes per second. Bandwidth is related to latency. BIOS For Basic Input/Output System.
Q E – Glossary E-2 GID For Global Identifier. Used for routing between different InfiniBand subnets. GUID For Globally Unique Identifier for the InfiniPath chip. Equivalent to Ethernet MAC address. head node Same as front end node. HCA For Host Channel Adapter. HCAs are I/O engines located within processing nodes, connecting them to the InfiniBand fabric. hosts file Same as mpihosts file. Not the same as the /etc/hosts file.
Q IB6054601-00 D E – Glossary LID For Local Identifier. Assigned by the Subnet Manager (SM) to each visible node within a single InfiniBand fabric. It is similar conceptually to an IP address for TCP/IP. Lustre Open source project to develop scalable cluster file systems. MAC Address For Media Access Control Address. It is a unique identifier attached to most forms of networking equipment. machines file Same as mpihostsfile. MADs For Management Datagrams.
Q E – Glossary E-4 MTRR For Memory Type Range Registers. MTRR For "Memory Type Range Registers". Used by the InfiniPath driver to enable write combining to the InfiniPath on-chip transmit buffers. This improves write bandwidth to the InfiniPath chip, by writing multiple words in a single bus transaction (typically 64). Applies only to x86_64 systems. MTU For Maximum Transfer Unit. The largest packet size that can be transmitted over a given network.
Q IB6054601-00 D E – Glossary SDP For Sockets Direct Protocol. An InfiniBand-specific upper layer protocol. It defines a standard wire protocol to support stream sockets networking over InfiniBand. SRP For SCSI RDMA Protocol. The implementation of this protocol is under development for utilizing block storage devices over an InfiniBand fabric. SM For Subnet Manager.
Q E – Glossary Notes E-6 IB6054601-00 D
Index A ACPI, enabling C-9 B Batch queuing for MPI jobs B-1–B-4 Benchmarking MPI bandwidth A-2–A-3 MPI latency measurement A-1–A-2 MPI latency measurement in host rings A-5 C Compiling MPI programs compiler and linker variables 3-9 scripts for invoking compiler and linker 3-7 specifying compilers and linkers 3-4–3-5 Troubleshooting C-13–C-20 using other compilers 3-8 Configuration OpenSM 2-12 Configuration, OpenSM 2-12 CPU affinity, setting 2-19 D Debugging MPI programs 3-20–3-21 Distribution override, s
Q InfiniPath User Guide Version 2.