White Papers

Dell - Internal Use - Confidential
15
Amber benchmark suite
This suite includes the Joint Amber-Charmm (JAC) benchmark considering dihydrofolate reductase (DHFR) in an explicit water bath
with cubic periodic boundary conditions. The major assumptions are that the DHFR molecule presents in water without surface effect
and its movement assumed to follow microcanonical (NVE) ensemble which assumes constant amount of substance (N), volume (V),
and energy (E). Hence, the sum of kinetic (KE) and potential energy (PE) is conserved, in other words, Temperature (T) and Pressure
(P) are unregulated. JAC benchmark repeats simulations with Isothermal–isobaric (NPT) ensemble that assumes N, P and T are
conserved. It corresponds most closely to laboratory conditions with a flask open to ambient temperature and pressure. Beside these
settings, Particle mesh Ewald (PME) is the choice of algorithm to calculate electrostatic forces in molecular dynamics simulations. Other
biomolecules simulated in this benchmark suite are Factor IX (one of the serine proteases of the coagulation system), cellulose and
Satellite Tobacco Mosaic Virus (STMV). Here, we report the results from DHFR and STMV data.
Figure 14
Figure 14 and Figure 15 illustrate AMBER’s results with DHFR and STMV dataset. On SXM2 system (Config K), AMBER scales weakly
with 2 and 4 GPUs. Even though the scaling is not strong, V100 has noticeable improvement than P100, giving ~78% increase in single
card runs, and 1x V100 is actually 23% faster than 4x P100. On the PCIe (Config G) side, one and two cards perform similar to SXM2;
however, four cards’ results dropped sharply. This is because PCIe (Config G) only supports Peer-to-Peer access between GPU0/1
and GPU2/3 and not among all four GPUs. Since AMBER has redesigned the way data transfers among GPUs to address the PCIe
bottleneck, it relies heavily on Peer-to-Peer access for performance with multiple GPU cards. Hence a fast, direct interconnect like
NVLink between all GPUs in SXM2 (Config K) is vital for AMBER multiple GPU performance. To compensate for a single job’s weak
scaling on multiple GPUs, there is another use case promoted by AMBER developers, which is running multiple jobs in the same node
concurrently but where each job uses only 1 or 2 GPUs. Figure 5 shows the results of 1-4 individual jobs on one C4130 with V100s and
the numbers indicate that those individual jobs have little impact on each other. This is because AMBER is designed to run pretty much
entirely on the GPUs and has very low dependency on the CPU. The aggregate throughput of multiple individual jobs scales linearly in
this case. Without any card to card communication, the 5% better performance on SXM2 is contributed by its higher clock speed.
Figure 14 AMBER JAC Benchmark
Figure 15 AMBER STMV benchmark