Product specifications

Table Of Contents
D–Troubleshooting
QLogic MPI Troubleshooting
D-30 IB6054601-00 H
S
The following indicates that one program on one node died:
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100000 1000000
MPIRUN: <nodename> node program unexpectedly quit: Exiting.
The quiescence detected message is printed when an MPI job is not making
progress. The default timeout is 900 seconds. After this length of time, all the
node processes are terminated. This timeout can be extended or disabled with the
-quiescence-timeout option in mpirun.
$ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000
MPIRUN: MPI progress Quiescence Detected after 9000 seconds.
MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress.
MPIRUN: Per-rank details are the following:
MPIRUN: Rank 0 (<nodename> ) caused MPI progress Quiescence.
MPIRUN: Rank 1 (<nodename> ) caused MPI progress Quiescence.
MPIRUN: both MPI progress and Ping Quiescence Detected after 120
seconds.
Occasionally, a stray process will continue to exist out of its context. mpirun
checks for stray processes; they are killed after detection. The following code is
an example of the type of message that displays in this case:
$ mpirun -np 2 -ppn 1 -m ~/tmp/mfast mpi_latency 500000 2000
iqa-38: Received 1 out-of-context eager message(s) from stray
process PID=29745
running on host 192.168.9.218
iqa-35: PSM pid 10513 on host IP 192.168.9.221 has detected that I
am a stray process, exiting.
2000 5.222116
iqa-38:1.ips_ptl_report_strays: Process PID=29745 on host
IP=192.168.9.218 sent
1 stray message(s) and was told so 1 time(s) (first stray message
at 0.7s (13%),last at 0.7s (13%) into application run)
The following message should never occur. If it does, notify Technical Support:
Internal Error: NULL function/argument found:func_ptr(arg_ptr)
Driver and Link Error Messages Reported by MPI Programs
The following driver and link error messages are reported by MPI programs.
When the InfiniBand link fails during a job, a message is reported once per
occurrence. The message will be similar to:
ipath_check_unit_status: IB Link is down