White Papers

15 Dell HPC System for Manufacturing—System Architecture and Application Performance

4 System Performance

This section presents the performance results obtained from the reference system described in Section 3.

Basic performance of the servers was measured first, prior to any application benchmarking. This was

done to ensure that individual server sub-systems were performing as expected and that the systems were

stable. The STREAM memory bandwidth test was used to check the memory performance and HPL was

used to check the computational subsystem, power delivery and to stress test the individual servers. After

basic system performance was verified, the Fluent, ANSYS Mechanical, STAR-CCM+, and LS-DYNA

benchmark cases were measured on the system.

4.1 STREAM

The STREAM benchmark results for the computational building blocks are listed in Table 4. The results in

the table are the minimum, maximum, and average memory bandwidth from the Triad test for three runs

of STREAM on all of the building blocks in the reference system. These results demonstrate sustained

memory bandwidth of 129 GBps for the Explicit building block, 117 GBps for the Implicit Building Block,

and 129 GBps for the Implicit GPGPU building block. This performance is as expected.

The memory bandwidth for the Implicit building block is less than the other two building blocks because

of the specific processor selected for this system. The Intel Xeon E5-2667 v4 has only a single memory

controller; whereas, the processors used in the other two building blocks have two memory controllers.

This means that the total available memory bandwidth per processor is less; however, memory bandwidth

per core is high because the E5-2667 v4 is an 8-core processor.

The memory bandwidth of the master node and VDI system were also verified prior to application

benchmarking.

STREAM Benchmark Results

Building Block Triad MBps

(min)

Triad MBps

(max)

Triad MBps

(avg)

Explicit 129,410 129,927 129,665

Implicit 117,528 117,918 117,712

Implicit GPGPU 129,063 129,350 129,222

4.2 High Performance Linpack (HPL)

High Performance Linpack (HPL) is a popular benchmark that is computationally intensive. It is used to

rank the TOP500 fastest supercomputers in the world and is an important burn-in test. Although not

usually representative of real-world application performance, as a burn-in test, it helps to quickly identify

unstable components and verify the power delivery to the system.

The precompiled HPL binary from Intel MKL is used in this test. The results for individual computational

building blocks are listed in Table 5. This table presents the minimum and maximum result from three runs

of HPL for all of the individual building blocks in the system. The variation observed for the eight Explicit