White Papers
Dell - Internal Use - Confidential
GCC compiler
4.8.5
Intel Compiler
Version 2017.0.098
Applications
HPL
Version hpl_cuda_8_ompi165_gcc_485_pascal_v1
LAMMPS
Version Lammps-30Sep16
NAMD
Version NAMD_2.12_Source
GROMACS
Version 2016.1
HOOMD-blue
Version 2.1.2
Amber
Version 16update7
ANSYS Mechanical
Version 17.0
RELION
Version 2.0.3
High Performance Linpack (HPL)
HPL is a multicomputer parallel application to measure how fast computers solve a dense n by n system
of linear equations using LU decomposition with partial row pivoting and designed to be run at very large
scale. The HPL running on the experimented cluster uses the double precision floating point operations.
Figure 1 shows the HPL performance on the tested P100-PCIe cluster. It can be seen that 1 P100 is 3.6x
faster than 2 x E5-2690 v4 CPUs. HPL also scales very well with more GPUs within nodes or across nodes.
Recall that 4 P100 is within a server and therefore 8, 12 and 16 P100 are in 2, 3 and 4 servers. 16 P100
GPUs has the speedup of 14.9x compared to 1 P100. Note that the overall efficiency is calculated as: HPL
Efficiency = rMax / (CPUs rPeak + GPUs rPeak), where rPeak is the highest theoretical FLOPS result that
could be achieved with base clock, and the number reported by HPL is rMax and is the real performance
that can be achieved. HPL cannot be run at the max boost clock. It is typically run at some number in
between but the average is close to the base clock then to the max boost clock. That is why we used base
clock for rPeak calculation. Although we also included CPU rPeak in the efficiency calculation, when
running HPL on P100 we set DGEMM_SPLIT=1.0 which means CPU is not really contributing to the DGEMM,
but only handling other overhead so it is not actually contributing a lot of FLOPS. Although we observed
that CPUs stayed fully utilized they were just handling the overhead and data movement to keep the GPUs
fed. What is the most important for P100 is rMax is really big.