HP-MPI Version 2.3.1 for Linux Release Note

Table Of Contents

3.8.4 Using MPI_Comm_disconnect

In high availability mode, MPI_Comm_disconnect is collective only across the local

group of the calling process. This enables a process group to independently break a

connection to the remote group in an intercommunicator without synchronizing with

those processes. Unreceived messages on the remote side are buffered and might be

received until the remote side calls MPI_Comm_disconnect.

Receive calls that cannot be satisfied by a buffered message fail on the remote processes

after the local processes have called MPI_Comm_disconnect. Send calls on either side

of the intercommunicator fail after either side has called MPI_Comm_disconnect.

3.8.5 Instrumentation and High Availability Mode

HP-MPI lightweight instrumentation is now supported when using -ha and singletons.

In the event that some ranks terminate during or before MPI_Finalize(), then the

lowest rank id in MPI_COMM_WORLD produces the instrumentation output file on behalf

of the application and instrumentation data for the exited ranks is not included. For

other enhancements to instrumentation in this release, see “Expanded Lightweight

Instrumentation” (page 23).

The use of -ha and -i is available only on HP hardware. Usage on third-party hardware

results in an error message.

3.8.6 Failure Recover (-ha:recover)

Fault-Tolerant MPI_Comm_dup() That Excludes Failed Ranks

When using -ha:recover, the functionality of MPI_Comm_dup() enables an

application to recover from errors.

IMPORTANT: The MPI_Comm_dup() function is not standard compliance because a

call to MPI_Comm_dup() always terminates all outstanding communications with

failures on the communicator regardless of the presence or absence of errors.

When one or more pairs of ranks within a communicator are unable to communicate

because a rank has exited or the communication layers have returned errors, a call to

MPI_Comm_dup attempts to return the largest communicator containing ranks that

were fully interconnected at some point during the MPI_Comm_dup call. Because new

errors can occur at any time, the returned communicator might not be completely error

free. However, the two ranks in the original communicator that were unable to

communicate before the call are not included in a communicator generated by

MPI_Comm_dup.

Communication failures can partition ranks into two groups, A and B, so that no rank

in group A can communicate to any rank in group B and vice versa. A call to

MPI_Comm_dup() can behave similarly to a call to MPI_Comm_split(), returning

different legal communicators to different callers. When a larger communicator exists

3.8 Expanded Functionality for -ha 19