White Papers

Altair OptiStruct Performance
21 Dell EMC Ready Solution for HPC Digital Manufacturing Altair Performance
benchmark model and server. Typically, OptiStruct scales better using more DMP partitions per node with
fewer SMP threads each as compared with using fewer DMP partitions and more SMP threads per partition.
However, typically memory and I/O requirements increase with more DMP partitions, such that it is possible to
run out of memory or create I/O bottlenecks with too many DMP partitions per node.
The overall observations show that for the small Engine model, the performance correlated well with the total
number of cores on the system, particularly for the DMP solver, which showed significantly better overall
performance than the SMP solver. The results from the Taurus model displayed fairly uniform behavior for the
SMP results across the various servers, similar to what was observed with the engine model. For system with
sufficient memory, the DMP version displayed similar increases in performance as a function of increasing
number of cores per server.
The performance of the DMP solver is based on how the simulation is laid out on the server in terms of the
number of DMP partitions and the number of SMP threads per partition. For practical considerations it is
assumed that there would be no more than one active thread per physical core so the product of the number
of DMP partitions on the node with the number of threads per partition was never greater than the total
number of physical cores. Figure 11 shows the relative performance of the various domain layouts for servers
with both benchmark models.
Here, the reference value of 1.0 was chosen to be the single domain for the 6136-based server. For the cases
tested, eight domains appeared to be a good value, offering significantly better performance than two or four
domains. However, the I/O storage requirements increase with the number of MPI domains such that there
can be a practical limit of the number of domains possible for larger models. As an example, there was
insufficient Disk space available on the 6142 based server with 1TB of scratch space to use four MPI domains
with the Taurus benchmark.
Apart from providing better single-server performance from the SMP solver, the DMP solver allows
simulations to be run using multiple servers. Our experience with MPI based CAE packages is that a high-
bandwidth, low-latency network is typically required to carry out MPI based simulations across multiple nodes.
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
2.50
1-domain 2-domains 4-domains 8-domains 16-domain
Performance Relative to E5
-2667v4
Figure 11: OptiStruct DMP performance: Domains per node
6136-Engine 6136-Taurus 6142-Engine