White Papers
Dell - Internal Use - Confidential
Figure 8: HOOMD-blue Performance on CPU and P100-PCIe
Amber
Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics
simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force
fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-
PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using
1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that
similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G
only supports that between 2 pair GPUs. We verified this by again testing this application on a
configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared
to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x.
But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber
multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060
or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-
E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by
about the same amount. Amber website explicitly states that “It should be noted that while the legacy
MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the
code automatically if peer to peer communication is not available, you are very unlikely to see any
speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs
are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is
obvious to see that in Figure 9, the more nodes are used, the worse the performance is.
328.2
24.6
16.8
11.7
7.9
7.1
6.3
1.0
1.5
2.1
3.1
3.5
3.9
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
1.0
10.0
100.0
1000.0
CPU(2x
2690 v4)
1 P100 2 P100 4 P100 8 P100 12 P100 16 P100
Speedup over 1 P100
Hours for 10e6 steps (lower is better)
HOOMD-blue Performance with Microsphere Dataset