White Papers

15 Dell HPC System for Manufacturing—System Architecture and Application Performance
4 System Performance
This section presents the performance results obtained from the reference system described in Section 3.
Basic performance of the servers was measured first, prior to any application benchmarking. This was
done to ensure that individual server sub-systems were performing as expected and that the systems were
stable. The STREAM memory bandwidth test was used to check the memory performance and HPL was
used to check the computational subsystem, power delivery and to stress test the individual servers. After
basic system performance was verified, the Fluent, ANSYS Mechanical, STAR-CCM+, and LS-DYNA
benchmark cases were measured on the system.
4.1 STREAM
The STREAM benchmark results for the computational building blocks are listed in Table 4. The results in
the table are the minimum, maximum, and average memory bandwidth from the Triad test for three runs
of STREAM on all of the building blocks in the reference system. These results demonstrate sustained
memory bandwidth of 129 GBps for the Explicit building block, 117 GBps for the Implicit Building Block,
and 129 GBps for the Implicit GPGPU building block. This performance is as expected.
The memory bandwidth for the Implicit building block is less than the other two building blocks because
of the specific processor selected for this system. The Intel Xeon E5-2667 v4 has only a single memory
controller; whereas, the processors used in the other two building blocks have two memory controllers.
This means that the total available memory bandwidth per processor is less; however, memory bandwidth
per core is high because the E5-2667 v4 is an 8-core processor.
The memory bandwidth of the master node and VDI system were also verified prior to application
benchmarking.
STREAM Benchmark Results
Building Block Triad MBps
(min)
Triad MBps
(max)
Triad MBps
(avg)
Explicit 129,410 129,927 129,665
Implicit 117,528 117,918 117,712
Implicit GPGPU 129,063 129,350 129,222
4.2 High Performance Linpack (HPL)
High Performance Linpack (HPL) is a popular benchmark that is computationally intensive. It is used to
rank the TOP500 fastest supercomputers in the world and is an important burn-in test. Although not
usually representative of real-world application performance, as a burn-in test, it helps to quickly identify
unstable components and verify the power delivery to the system.
The precompiled HPL binary from Intel MKL is used in this test. The results for individual computational
building blocks are listed in Table 5. This table presents the minimum and maximum result from three runs
of HPL for all of the individual building blocks in the system. The variation observed for the eight Explicit