White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

Figure 8: HOOMD-blue Performance on CPU and P100-PCIe

Amber

Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics

simulations, particularly on biomolecules. The term Amber is also used to refer to the empirical force

fields that are implemented in this suite. Figure 9 shows the performance of Amber on CPU and P100-

PCIe. It can be seen that 1 P100 is 6.3x faster than dual CPU. Using 2 P100 GPUs is 1.2x faster than using

1 P100. However, the performance drops significantly when 4 or more GPUs are used. The reason is that

similar to LAMMPS and HOOMD-blue, this application heavily relies on P2P access but configuration G

only supports that between 2 pair GPUs. We verified this by again testing this application on a

configuration B node. As a result, the performance of using 4 P100 was improved to 791 ns/day compared

to 315 ns/day in configuration G, resulting in 151% performance improvement and the speedup of 2.5x.

But even in configuration B, the multi-GPU scaling is still not good enough. This is because when the Amber

multi-GPU support was originally designed the PCI-E bus speed was gen 2 x 16 and the GPUs were C1060

or C2050s. However, the current Pascal generation GPUs are > 16x faster than the C1060s while the PCI-

E bus speed has only increased by 2x (PCI Gen2 x 16 to PCI Gen3 x 16) and Infiniband interconnects by

about the same amount. Amber website explicitly states that “It should be noted that while the legacy

MPI and GPU-Direct methods of multi-GPU communication are still supported, and will be used by the

code automatically if peer to peer communication is not available, you are very unlikely to see any

speedup by using multiple GPUs for a single job if the GPUs are newer than C2050s. Multi-node runs

are almost impossible to get to scale.” This is consistent with our results on multi-node. Because it is

obvious to see that in Figure 9, the more nodes are used, the worse the performance is.

328.2

24.6

16.8

11.7

7.9

7.1

6.3

1.0

1.5

2.1

3.1

3.5

3.9

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1.0

10.0

100.0

1000.0

CPU(2x

2690 v4)

1 P100 2 P100 4 P100 8 P100 12 P100 16 P100

Speedup over 1 P100

Hours for 10e6 steps (lower is better)

HOOMD-blue Performance with Microsphere Dataset