HP-MPI V2.3 for Linux Release Note
rank has called MPI_Comm_dup() on the communicator. After all ranks have called
MPI_Comm_dup(), the parent communicator may again be used for point-to-point
communication. MPI_Comm_dup() can be called successfully even after a failure is
observed on the communicator. Because the results of a collective call can vary by rank,
ensure that an application is written to avoid deadlocks. For example, using multiple
communicators can be very difficult as the following code demonstrates:
...
err = MPI_Bcast(buffer, len, type, root, commA);
if (err) {
MPI_Error_class(err, &class);
if (class == MPI_ERR_EXITED) {
err = MPI_Comm_dup(commA, &new_commA);
if (err != MPI_SUCCESS) {
cleanup_and_exit();
}
MPI_Comm_free(commA);
commA = new_commA;
}
}
err = MPI_Sendrecv_replace(buffer2, len2, type2, src, tag1, dest, tag2, commB, &status);
if (err) {
....
...
In this case, some ranks exit successfully from the MPI_Bcast() and move onto the
MPI_Sendrecv_replace() operation on a different communicator. The ranks that
call MPI_Comm_dup() will only cause operations on commA to fail. Therefore, some
ranks may never be able to return from the MPI_Sendrecv_replace() call on commB
if their partners are also members of commA and are in the call to MPI_Comm_dup()
call on commA. This demonstrates just one example of the importance of using care
when dealing with multiple communicators. In this example, if the intersection of
commA and commB is MPI_COMM_SELF, it is simpler to write an application that does
not deadlock during failure.
The use of the -ha:recover option is available only on HP hardware. Usage on
non-HP hardware will result in an error message. On third-party systems, a failed
communicator can continue to be used for point-to-point communication, but no
recovery mechanism is available.
1.2.7.7.7 Network High Availability (-ha:net)
The net option to -ha turns on any network high availability. Network high availability
will attempt to insulate an application from errors in the network In this release,
-ha:net is only significant on IBV for OFED 1.2 or later, where it will cause Automatic
Path Migration to be used. This option currently has no effect on TCP connections.
The use of the -ha:net option is available only on HP hardware. Usage on non-HP
hardware will result in an error message.
1.2.7.7.8 Failure Detection (-ha:detect)
When using the -ha:detect option, a communication failure is detected and prevents
interference with the application's ability to continue communicating with other
processes which have not been impacted by the failure. In addition to specifying
1.2 What’s in This Version 19