Myrinet-2000 Installation and Troubleshooting Guide Myricom, Inc. Draft: 07 April 2007 The most recent version of this document can be downloaded from http://www.myri.com/scs/doc/troubleshooting_guide.pdf © 2007 Myricom, Inc.
Table of Contents I. Introduction .......................................................................................................................................... 3 II. What Hardware Is Required? ............................................................................................................... 3 III. Hardware Installation .......................................................................................................................... 3 IV.
I. Introduction This Myrinet-2000 Installation and Troubleshooting Guide describes the hardware and software installation procedures for a Myrinet-2000 cluster. Section II summarizes the required hardware, and Section III provides detailed installation instructions for each hardware component. Sections IV, V, VI, and VII address the software installation of MX, GM-2, or GM-1, and Section VIII describes the testing and validation of the Myrinet cluster.
"Guide to Switches and Switch Networks” http://www.myri.com/myrinet/m3switch/guide/ For Myrinet-2000 M3-CLOS-ENCL-* or M3-SPINE-ENCL-* switches, please read: http://www.myri.com/myrinet/14U_switches/ http://www.myri.com/scs/14U_switches/ and the following section of the Myrinet FAQ (http://www.myri.com/cgibin/fom?file=369. For Myrinet-2000 PCIX-based NICs, we recommend reading: “Guide to Myrinet PCI-X Network Interface Cards” http://www.myri.com/scs/doc/guide_to_pcix_nics.
• • • • • • If your Myrinet-2000 M3-E* switch is equipped with a monitoring line card (located in the top-slot of the switch), this monitoring line card contains 10base-T dual ethernet ports and DHCP is required for its installation. A Myrinet-2000 switch does not require any configuration. Switch line cards (M3-SW16-8E, M3-SW16-8F, M3-SPINE-8F, M3-BLANK) are hot-swappable. A line card, a fan tray, or an enclosure is a Field Replaceable Unit (FRU).
• • • • • The M3-CLOS-ENCL-B (or M3-SPINE-ENCL-B) enclosure contains two 840W power supplies that can be individually hot-swapped, and operate in an autoparallel mode in which any one power supply is sufficient to supply the maximum power a unit may require. Two fan assemblies are included in each M3-CLOS-ENCL-* and M3-SPINEENCL-* enclosure, and they can be individually hot-swapped. The line cards, the power supplies, and the fan assemblies, are Field Replaceable Units (FRU).
• http://helics.iwr.uni-heidelberg.de/gallery/index.html Installation of the Myrinet PCI-X/PCI Network Interface Cards (NICs) Following the installation instructions in the “Guide to Myrinet PCI-X Network Interface Cards” or the “Guide to Myrinet/PCI Host Interfaces” document, you will perform the following steps: 1. Install the Myrinet NIC(s) into your host(s). 2. Power on the host(s). 3. Detect the NIC(s) in your host(s) using the Linux command /sbin/lspci.
• • Have you tried using a different riser card and/or a different brand of riser card? Have you tried using a newer BIOS for this motherboard? Installation of the Myrinet switch and cables Once the Myrinet NIC(s) have been installed and correctly detected in your host(s), you can now proceed to install the switch(es) and connect the cables. Separate instructions are included below for M3-E* Switches and M3-CLOS-ENCL or M3-SPINE-ENCL Switches.
server will then serve this static IP address to the monitoring line card when it boots and asks for it. On Linux, this file is /etc/dhcpd.conf. The MAC address is a group of 6 hexadecimal numbers separated by colons, and should begin with 00:60:dd:??:??:??. Step 2. Before seating the monitoring line card into the top slot of the Myrinet-2000 switch, connect at least the first ethernet port to the LAN. For high availability, the second ethernet port can also be connected.
1. Plug in the power cord of the switch and the color TFT display (driven by the monitoring line card) will illuminate and exhibit a color-bar display. After the operating system finishes to boot (about 10 seconds), the color-bar display will change to a virtual image of the switch. Do not yet connect ethernet to the monitoring line card (in the left-most slot of the switch chassis). Configuration of the monitoring line card will be performed after all of the fiber cables have been connected. 2.
Step 6: As soon as the ethernet port is connected, the upper green LED on the RJ45 connector will illuminate. Step 7: When the monitoring line card has received its IP address, it is reachable. You can ping the card, open a web browser to it, or walk the SNMP MIB. Step 8: If you make a mistake and cannot ping the switch, then use the TFT display to turn static addressing to no and reboot. To assign an IP address via DHCP: Step 1.
Each time a monitoring line card is powered on, it will ask for its IP address (and netmask) via DHCP. You can specify a gateway with the DHCP "routers" option. To test that the monitoring line card is properly configured, you can ping its IP address or open a web browser to its IP address. We suggest that you familiarize yourself with the features of the HTTP interface to the monitoring line card, as many of these features can be very useful diagnostic tools.
MX-2G or GM-2 software is required for use with the Myrinet-2000 M3-CLOS-ENCL-* and M3-SPINE-ENCL-* switches. MX-2G and GM 2.1.x support multi-path, dispersive routing, a technique that improves the utilization of the network bisection in large networks. GM-2 software is required for ethernet-emulation interoperability with M3-SW16-8E switch line cards. MX-2G does not provide support for the M3-SW16-8E switch line cards. If you are using GM-2, GM-2.1.
$ ./configure --with-linux= where specifies the directory for the Linux kernel source. The kernel header files MUST match the running kernel exactly: not only should they both be from the same version, but they should also contain the same kernel configuration options. Note: • For Linux 2.6 kernels, the kernel headers/scripts often come in two parts in two different directories, you might need to use both --with-linux and --with-linuxbuild.
MX libraries are installed in /lib32 and the 64-bit MX libraries are installed in /lib64. The /lib directory is a symbolic link to either lib64 or lib32 depending on the native wordsize detected by configure. E.g., on most ppc64 distributions, gcc defaults to 32-bit, which means that lib links to lib32. However, on most x86_64 distributions, gcc defaults to 64-bit, so lib links to lib64.
The yellow "Lanai" LED is controlled by the Lanai processor, and will pulse like a heartbeat while the MCP/firmware is running. If an error occurs, the yellow "Lanai" LED will pulse an S.O.S signal. If the yellow LED is not pulsing, the MX-2G MCP is not loaded or is not running. Refer to the FAQ entry "How can I tell if the MX Mapper has correctly detected all of the hosts in my Myrinet network?" (http://www.myri.com/cgi-bin/fom?file=427).
or (for RedHat Linux): chkconfig –add mx Alternatively, you may start and stop the driver manually using su root /etc/init.d/mx start /etc/init.d/mx stop or su root /etc/init.d/mx restart The mx “stop” script performs the following operations: • • • Shuts down the mx_mapper daemon ifconfig’s down the myri* ethernet devices Unloads the MX modules (using rmmod) The mx "restart" script performs an mx stop followed by an mx start. Note: 1. Legacy PCI64-based and PCI32-based Myrinet NICs are not supported.
3. Enabling IP over Myrinet (Ethernet emulation) (OPTIONAL) If you wish to run IP over Myrinet (ethernet emulation), the Linux command to enable IP over MX is as follows: /sbin/ifconfig myri0 up where you must replace myri0 with the appropriate name (myri1, myri2, etc.) if you have more than one Myrinet NIC per host. VI. GM-2 Software Installation GM-2 installation is performed in three easy steps: 1. Configuring and compiling GM-2. 2. Installing the GM-2 driver. 3.
If you would like to have FMS diagnostic monitoring with GM-2, refer to the FMS Download page (http://www.myri.com/scs/fms/) for installation instuctions. If you are building GM-2 on SuSE SLES9 on PowerPC64 or AMD64 or EM64T, you may need to explicitly point configure at the kernel source and object trees. For example, .
• • • • • • If you do not see any green “link” LEDs illuminated, is the switch powered on? If you do not see green “link” LEDs illuminated on only a specific line card, is the line card properly seated in the enclosure? (Refer to the “Guide to Switches and Switch Networks” for the proper procedure to insert/remove a line card.
/var/run/gm_mapper/pid.{board_id}, and the map files are stored in /var/run/gm_mapper/map.{board_id}. Further details about the mapper in GM-2 can be found on the following webpage: http://www.myri.com/scs/mapper_gm2.html Refer to the FAQ entry "How can I tell if the GM-2 Mapper has correctly detected all of the hosts in my Myrinet network?" (http://www.myri.com/cgibin/fom?file=273). Important note: Stopping the gm_mapper while GM-2 is running is not supported.
• • • Shuts down the gm_mapper daemon ifconfig’s down the myri* ethernet devices Unloads the GM module (using rmmod) The gm "restart" script performs a gm stop followed by a gm start. Note: 1. GM is not in the critical performance path so it does not need to be built with specialized compilers and flags. GM should be built with Gnu gcc and only built with -O level of optimization. 2. GM should be installed in an NFS-mounted area. 3. gm_install_drivers and /etc/init.
VII. GM-1 Software Installation GM-1 installation is performed in four easy steps: 1. 2. 3. 4. Configuring and compiling GM-1. Installing the GM-1 driver. Running the GM-1 mapper. Enabling IP over Myrinet (ethernet emulation) (OPTIONAL) For detailed installation instructions for GM-1 with FMS diagnostic monitoring, refer to the FMS webpage (http://www.myri.com/scs/fms/#install-tarball). After you have completed these steps, proceed to Section VII. Testing/Validation (page 27). 1.
/sbin/gm_install_drivers /etc/init.d/gm start on each machine to install/copy the driver on that machine. When the hardware is connected through a cable to another operating component and the GM-1 firmware has been loaded, a green “link” LED and a yellow/amber “Lanai” LED will illuminate on NICs and a green “link” LED will illuminate on connected ports on the line cards.
If you wish the driver to auto-load at boot, you must create appropriate links in the /etc/rcN directories to the /etc/init.d/gm script. Alternatively, you may start and stop the driver manually using su root /etc/init.d/gm start /etc/init.d/gm stop or su root /etc/init.d/gm restart Note: 1. 2. 3. 4. 5. 6. GM is not in the critical performance path so it does not need to be built with specialized compilers and flags. GM should be built with Gnu gcc and only built with -O level of optimization.
Important points to note: • The GM-1 mapper is ONLY run on one node in the cluster. You should choose one node in the cluster to be the mapper node, and any subsequent invocations of the mapper should be done on this node only. • The GM-1 mapper must be run before any communication over Myrinet can occur. • If a host is rebooted, you must reload the GM driver and rerun the GM-1 mapper. • If any topological change occurs in the cluster, the GM-1 mapper must be rerun.
If the GM tree is not mounted by NFS, copy the 3 files created by this command (static.map, static.routes, and static.hosts) to each /sbin/ directory on each host. For auto-mapping at boot time, add the following command to the boot scripts of the host (scripts in /etc/init.d or /etc/rc.d/init.d). cd /sbin/ su root ./file_mapper ../etc/gm/file.args “High Availability” Mapping is the third way in which the GM mapper may be used.
VIII. Testing/Validation Once the MX, GM-2, or GM-1 firmware is running on all hosts in the cluster, and all host-to-switch and switch-to-switch cables have been connected, you are ready to verify the health of all of the Myrinet hardware components in the Myrinet installation by performing the following sequence of tests. The Fabric Management System (FMS) is the recommended diagnostic tool for Myrinet-2000 networks. Requirements for the installation of FMS are summarized on the FMS webpage (http://www.
$ fm_switch -a where is the DNS name or IP address for the monitoring line card in the specific switch enclosure. If you need to remove a switch from the database, run $ fm_switch –d If the monitoring line card has not yet been installed in the switch(es), refer to "How do I install the monitoring line card in my Myrinet-2000 M3-E* switch?" (http://www.myri.
the other end. On the host, there will be a green LED illuminated and a flashing yellow/amber LED illuminated on each NIC. If the LED of a connected port is not illuminated in green, refer to "Run fm_db2wirelist and look for any missing links". If FMS is not available, please consult the diagnostic procedures in Appendix B "Isolating the Cause of a Hardware Problem". If you're using an M3-CLOS-ENCL-* or M3-SPINE-ENCL-* switch, please consult the following webpage (http://www.myri.
If you must have two PCI devices sharing the same PCI bus, and both devices are able to run at 133MHz, but the PCI bus is not running at 133MHz, are you sure that the motherboard can sustain two PCI devices on the same PCI bus running at full speed? • Or, if you are using a riser card, there could be a problem with the riser card. Not all 64-bit riser cards will run at 133MHz. Refer to the FAQ entry “My PCI-X slot should run at 133MHz, but gm_debug reports 66MHz or 100MHz. What’s wrong?” (http://www.myri.
mpicc to compile mx/unit_test/src/mpi/mpi_stress.c. The executable mpi_stress can then be run like any other MPI program using mpirun.ch_mx or mpirun.ch_gm. If the GM firmware is installed on the cluster, the GM-specific stress program, gm_stress.c, can also be used to stress the network. Full details of how to run gm_stress can be found on the FAQ entry (http://www.myri.com/cgi-bin/fom?file=53). 8. Run fm_show_alerts for diagnostic information on any damaged/failing hardware component.
Appendix A: Determining if a Problem is Hardware or Software Related Diagnosing a problem as hardware- or software-related can be difficult. The first goal is to isolate where the problem resides: • • • • Host computer hardware (e.g., a bad PCI slot, defective or inadequate riser card, buggy BIOS, etc) Host computer software (e.g.
• Is there a monitoring line card installed in each Myrinet-2000 switch? If yes, do you see a high number of bad crcs reported in the switch counters? If you're using a Myrinet-2000 M3-E* switch, this information can be obtained with the following command: lynx –dump /all | grep badCrcs If you're using a Myrinet-2000 M3-CLOS-ENCL or M3-SPINE-ENCL switch, this information can be obtained with the following command: lynx –dump /cgi/web.
• Did the firmware (MX or GM) load properly on all nodes in the cluster? Were there any error messages in the system log (dmesg or /var/log/messages) output on any of the nodes when you loaded the firmware? Sections V, VI, and VII address software installation and troubleshooting issues. Run-time diagnostic error messages are also explained in the Myrinet FAQ (http://www.myri.com/scs/FAQ/).
If you are using M3-E* switches, two other useful hardware counters for diagnosing hardware failures are the switch counters called serdesFaultTrap and missedBeatTrap. It is important to note that these two traps can be harmless and merely signal a port on a switch line card that is unconnected.
Appendix B: Isolating the Cause of a Hardware Problem The following diagnostic procedures will need to be used if you are unable to install the Fabric Management System (FMS). Two of the most commonly reported hardware failures are damaged cables and damaged port connectors.
mx_stop_mapper mx_msg_loop -n mx_counters | grep Bad where is the name of the host on which the test is being run. Note that after running the test, the mx_mapper process must be restarted on the host, as follows: cd /sbin/ su root mx_start_mapper If you’re using GM-2, the gm_allsize "loopback test" is performed as follows: 1. Reset the host counters cd /bin/ su root ./gm_counters –C 2. On each node, run: .
If the badcrc_cnt (reported in gm_counters) increased significantly after the test on any of the hosts, then you have identified a possible hardware trouble spot in your cluster and you must now isolate if the badcrc_cnt is coming from the Myrinet NIC, the cable, or the port on the Myrinet switch. B.1. How do I determine if a cable has failed? In most cases, the Bad CRC8 or badcrc__invalid (or badcrc_cnt) is caused by a damaged cable.
B.3. How do I determine if a Myrinet NIC has failed? If exchanging the cable and the port on the switch line card do not eliminate the errors, then the Myrinet NIC may be the point of failure. Here are some suggestions for determining whether a Myrinet NIC has failed. First, try using the NIC in isolation by running the mx_pingpong "hardware loopback test" or gm_allsize "hardware loopback test". The hardware loopback test is performed as follows: 1.
4. If you're using GM-1, run the gm_allsize "hardware loopback test" as follows: gm_counters [--board=n] gm_simpleroute --loopback [--board=n] gm_allsize --geometric --exit-on-error [--board=n] gm_counters [--board=n] The --board flag is only necessary if the board number is other than 0.
Appendix C: Troubleshooting Performance If you suspect a performance anomaly, we suggest: 1. Run mx_dmabench or gm_debug -L on each node in the cluster to ensure that all nodes report consistent read/write performance and PCI speed. 2. If you are using the Fabric Management System (FMS), does fm_show_alerts detect significant badcrcs in the fabric? Alternatively, check for badcrcs in the mx_counters or gm_counters output, as well as the hardware counters on the switch.
significantly degraded performance. For a list of all options to gm_allsize, type gm_allsize –help or refer to the FAQ. For sample output of gm_allsize, refer to the FAQ entry “What are the run-time options to gm_allsize?” (http://www.myri.com/cgi-bin/fom?file=79).
and on host2 type: gm_allsize --both-ways --bandwidth \ --remote-host=host1 --size=15 –geometric where the length of the messages sent is 2**(size - 8) bytes. This test has GM streaming packets in both directions (both nodes are always sending) and it causes GM to report the sum of the send and receive bandwidths. The output from this command will consist of two columns of data: the first column lists the message size (in bytes) and the second column lists the bandwidth (in MB/s). 4.