White Papers

Dell - Internal Use - Confidential

Figure 1: HPL performance on P100-PCIe

NAMD

NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-

performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic

Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco

Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities

website. The performance metric in the output log of this application is “days/ns” (the lower the better).

But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus

on. The average of all occurrences of this value in the output log was used. Figure 2 shows the

performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of

using 4 P100. This is probably because of the communications among different CPU threads. This

application launches a set of workers threads that handle the computation and communication threads

that handle the data communication. As more GPUs are used, more communication threads are used and

more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler

called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time.

According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not

parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two

different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows

that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4

GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.

1.1

3.9

7.9

15.5

29.4

41.8

57.8

100

CPU(2x 2690

v4)

1 P100 2 P100 4 P100 8 P100 12 P100 16 P100

Efficiency(%)

Performance (TFLOPS)

HPL Performance Scaling on P100-PCIe

TFLOPS Efficiency