White Papers

Altair AcuSolve Performance

15 Dell EMC Ready Solution for HPC Digital Manufacturing— Altair Performance

These benchmarks were carried out on a cluster of eight servers, each with 6252 processors. The results are

presented in relative performance compared with the single node results. On the surface these results appear

surprising for the “Riser” model where the performance increases by more than a factor of 2X going from one

to two nodes. However, this behavior can be explained by “cache effects”, where when the data set is

distributed among a greater number of nodes, there can be a point where the entire problem can fit into

cache, and the speed of the solver can increase dramatically. Such cache effects are highly problem specific.

In general, there is a tradeoff in distributed memory parallelism where the cache performance typically

improves as the problem is distributed to more nodes, but the communication overhead also increases,

counteracting the increased performance from the caching benefit. Overall the datasets show excellent

parallel speedup up to 4 nodes. The largest model “Nozzle” displays nearly linear parallel scaling up to 8

nodes.

AcuSolve is a hybrid parallel application, where it is possible to use both shared memory parallelism within a

node and distributed memory parallelism both within a node an across nodes. Finding the proper balance

between shared memory and distributed memory parallelism within a node can be daunting. Figure 6 shows

the parallel performance for these models where the number of shared memory parallel threads is adjusted

from 1 to 8 threads per domain, where the number of domains per server was the divisor of the total number

of cores per server with the number of threads per domain.

1.0

2.0

4.0

8.0

48 (1) 96(2) 192 (4) 384(8)

Performance Relative to 48 Cores

Number of Cores (Number of Nodes)

Figure 5: AcuSolve Parallel Scaling

Riser Windmill Nozzle