White Papers

Dell - Internal Use - Confidential

GCC compiler

4.8.5

Intel Compiler

Version 2017.0.098

Applications

HPL

Version hpl_cuda_8_ompi165_gcc_485_pascal_v1

LAMMPS

Version Lammps-30Sep16

NAMD

Version NAMD_2.12_Source

GROMACS

Version 2016.1

HOOMD-blue

Version 2.1.2

Amber

Version 16update7

ANSYS Mechanical

Version 17.0

RELION

Version 2.0.3

High Performance Linpack (HPL)

HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system

of linear equations using LU decomposition with partial row pivoting and designed to be run at very large

scale. The HPL running on the experimented cluster uses the double precision floating point operations.

Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x

faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes.

Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100

GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL

Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result that

could be achieved with base clock, and the number reported by HPL is rMax and is the real performance

that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in

between but the average is close to the base clock then to the max boost clock. That is why we used base

clock for rPeak calculation. Although we also included CPU rPeak in the efficiency calculation, when

running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM,

but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed

that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs

fed. What is the most important for P100 is rMax is really big.