HP-MPI User's Guide (11th Edition)
Understanding HP-MPI
Scalability
Chapter 3 165
To use daemon communication, specify the -commd option in the mpirun
command. Once you have set the -commd option, you can use the
MPI_COMMD environment variable to specify the number of
shared-memory fragments used for inbound and outbound messages.
Refer to “mpirun” on page 74 and “MPI_COMMD” on page 150 for more
information. Daemon communication can result in lower application
performance. Therefore, it should only be used to scale an application to
a large number of ranks when it is not possible to increase the operating
system file descriptor limits to the required values.
Resource usage of RDMA communication modes
When using InfiniBand or GM, a certain amount of memory is pinned,
which means it is locked to physical memory and cannot be paged out.
The amount of pre-pinned memory HP-MPI uses can be adjusted using
several tunables, such as MPI_RDMA_MSGSIZE, MPI_RDMA_NENVELOPE,
MPI_RDMA_NSRQRECV, and MPI_RDMA_NFRAGMENT.
By default when the number of ranks is less than or equal to 512, each
rank will pre-pin 256k per remote rank; thus making each rank pin up to
128Mb. If the number of ranks is above
512 but less than or equal to 1024, then each rank will only pre-pin 96k
per remote rank; thus making each rank pin up to 96Mb. If the number
of ranks is over 1024, then the 'shared receiving queue' option is used
which reduces the amount of pre-pinned memory used for each rank to a
fixed 64Mb regardless of how many ranks are used.
HP-MPI also has a safeguard variables MPI_PHYSICAL_MEMORY and
MPI_PIN_PERCENTAGE which set an upper bound on the total amount of
memory an HP-MPI job will pin. An error will be reported during startup
if this total is not large enough to accommodate the pre-pinned memory.