White Papers

Altair AcuSolve Performance
15 Dell EMC Ready Solution for HPC Digital Manufacturing Altair Performance
These benchmarks were carried out on a cluster of eight servers, each with 6252 processors. The results are
presented in relative performance compared with the single node results. On the surface these results appear
surprising for the “Riser” model where the performance increases by more than a factor of 2X going from one
to two nodes. However, this behavior can be explained by “cache effects”, where when the data set is
distributed among a greater number of nodes, there can be a point where the entire problem can fit into
cache, and the speed of the solver can increase dramatically. Such cache effects are highly problem specific.
In general, there is a tradeoff in distributed memory parallelism where the cache performance typically
improves as the problem is distributed to more nodes, but the communication overhead also increases,
counteracting the increased performance from the caching benefit. Overall the datasets show excellent
parallel speedup up to 4 nodes. The largest model “Nozzle” displays nearly linear parallel scaling up to 8
nodes.
AcuSolve is a hybrid parallel application, where it is possible to use both shared memory parallelism within a
node and distributed memory parallelism both within a node an across nodes. Finding the proper balance
between shared memory and distributed memory parallelism within a node can be daunting. Figure 6 shows
the parallel performance for these models where the number of shared memory parallel threads is adjusted
from 1 to 8 threads per domain, where the number of domains per server was the divisor of the total number
of cores per server with the number of threads per domain.
1.0
2.0
4.0
8.0
48 (1) 96(2) 192 (4) 384(8)
Performance Relative to 48 Cores
Number of Cores (Number of Nodes)
Figure 5: AcuSolve Parallel Scaling
Riser Windmill Nozzle