White Papers

Dell - Internal Use - Confidential
Figure 1: HPL performance on P100-PCIe
NAMD
NAMD (for NAnoscale Molecular Dynamics) is a molecular dynamics application designed for high-
performance simulation of large biomolecular systems. The dataset we used is Satellite Tobacco Mosaic
Virus (STMV) which is a small, icosahedral plant virus that worsens the symptoms of infection by Tobacco
Mosaic Virus (TMV). This dataset has 1,066,628 atoms and it is the largest dataset on NAMD utilities
website. The performance metric in the output log of this application is “days/ns” (the lower the better).
But its inverted metric “ns/day” is used in our plot since that is what most molecular dynamics users focus
on. The average of all occurrences of this value in the output log was used. Figure 2 shows the
performance within 1 node. It can be seen that the performance of using 2 P100 is better than that of
using 4 P100. This is probably because of the communications among different CPU threads. This
application launches a set of workers threads that handle the computation and communication threads
that handle the data communication. As more GPUs are used, more communication threads are used and
more synchronization is needed. In addition, based on the profiling result from NVIDIA’s CUDA profiler
called nvprof, with 1 P100 the GPU computation takes less than 50% of the whole application time.
According to Amdahl’s law, the speedup with more GPUs will be limited by another 50% work that is not
parallelized by GPU. Based on this observation, we further ran this application on multiple nodes with two
different settings (2 GPUs/node and 4 GPUs/node) and the result is shown in Figure 3. The result shows
that no matter how many nodes are used, the performance of 2 GPUs/node is always better than 4
GPUs/node. Within a node, 2 P100 GPUs is 9.5x faster than dual CPUs.
1.1
3.9
7.9
15.5
29.4
41.8
57.8
93
81
82
86
84
81
85
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
CPU(2x 2690
v4)
1 P100 2 P100 4 P100 8 P100 12 P100 16 P100
Efficiency(%)
Performance (TFLOPS)
HPL Performance Scaling on P100-PCIe
TFLOPS Efficiency