Troubleshooting guide

If you must have two PCI devices sharing the same PCI bus, and both devices are
able to run at 133MHz, but the PCI bus is not running at 133MHz, are you sure that
the motherboard can sustain two PCI devices on the same PCI bus running at full
speed?
Or, if you are using a riser card, there could be a problem with the riser card. Not all
64-bit riser cards will run at 133MHz. Refer to the FAQ entry “My PCI-X slot should
run at 133MHz, but gm_debug reports 66MHz or 100MHz. What’s wrong?”
(http://www.myri.com/cgi-bin/fom?file=281). You should try using the Myrinet NIC
without the riser card and see if the NIC is correctly detected.
Or, you could need a BIOS update for your motherboard.
Or, there could be a PCI slot problem on the motherboard. You should try using a
different PCI slot.
Sample PCI Bus Performance for Myrinet/PCI-X NICs
(http://www.myri.com/scs/performance/PCIX_motherboards/) is available. Performance
measurements (http://www.myri.com/scs/performance/Myrinet-2000/) for MX and GM
are also available.
6. Test performance between each host and the switch
Run mx_pingpong with shared memory disabled on all nodes to check for consistent
unidirectional bandwidth performance.
export MX_DISABLE_SHMEM=1
export MX_RCACHE=1
mx_pingpong –e 0 –r 1 –S 0 –E 10000000 –M 1.7 &
mx_pingpong –e 1 –r 0 –S 0 –E 10000000 –M 1.7 –d ‘hostname’:0
On PCIXD and PCIXF NICs, the result should be very close to the 250 MB/s line rate
(~246 MB/s) and on PCIXE NICs, it should be very close to the 500 MB/s line rate.
7. Run mpi_stress or gm_stress to stress all of the connections in the Myrinet fabric
Two stress programs have been developed to “stress” all of the connections in the
Myrinet fabric. Note that these stress programs are NOT benchmarking programs for
performance. These stress programs are designed to flood the network with lots of sends
and receives among multiple hosts in order to isolate/emphasize any link that may have a
damaged cable or other damaged hardware component. These stress programs can be run
on a subset of nodes or the whole cluster.
One of the stress programs is an MPI program, mpi_stress.c, and is available in the MX
distribution. Configure, compile, and install MPICH-MX or MPICH-GM, and then use
© 2007 Myricom, Inc. DRAFT
31