White Papers

Dell - Internal Use - Confidential

Amber benchmark suite

This suite includes the Joint Amber-Charmm (JAC) benchmark considering dihydrofolate reductase (DHFR) in an explicit water bath

with cubic periodic boundary conditions. The major assumptions are that the DHFR molecule presents in water without surface effect

and its movement assumed to follow microcanonical (NVE) ensemble which assumes constant amount of substance (N), volume (V),

and energy (E). Hence, the sum of kinetic (KE) and potential energy (PE) is conserved, in other words, Temperature (T) and Pressure

(P) are unregulated. JAC benchmark repeats simulations with Isothermal–isobaric (NPT) ensemble that assumes N, P and T are

conserved. It corresponds most closely to laboratory conditions with a flask open to ambient temperature and pressure. Beside these

settings, Particle mesh Ewald (PME) is the choice of algorithm to calculate electrostatic forces in molecular dynamics simulations. Other

biomolecules simulated in this benchmark suite are Factor IX (one of the serine proteases of the coagulation system), cellulose and

Satellite Tobacco Mosaic Virus (STMV). Here, we report the results from DHFR and STMV data.

Figure 14

Figure 14 and Figure 15 illustrate AMBER’s results with DHFR and STMV dataset. On SXM2 system (Config K), AMBER scales weakly

with 2 and 4 GPUs. Even though the scaling is not strong, V100 has noticeable improvement than P100, giving ~78% increase in single

card runs, and 1x V100 is actually 23% faster than 4x P100. On the PCIe (Config G) side, one and two cards perform similar to SXM2;

however, four cards’ results dropped sharply. This is because PCIe (Config G) only supports Peer-to-Peer access between GPU0/1

and GPU2/3 and not among all four GPUs. Since AMBER has redesigned the way data transfers among GPUs to address the PCIe

bottleneck, it relies heavily on Peer-to-Peer access for performance with multiple GPU cards. Hence a fast, direct interconnect like

NVLink between all GPUs in SXM2 (Config K) is vital for AMBER multiple GPU performance. To compensate for a single job’s weak

scaling on multiple GPUs, there is another use case promoted by AMBER developers, which is running multiple jobs in the same node

concurrently but where each job uses only 1 or 2 GPUs. Figure 5 shows the results of 1-4 individual jobs on one C4130 with V100s and

the numbers indicate that those individual jobs have little impact on each other. This is because AMBER is designed to run pretty much

entirely on the GPUs and has very low dependency on the CPU. The aggregate throughput of multiple individual jobs scales linearly in

this case. Without any card to card communication, the 5% better performance on SXM2 is contributed by its higher clock speed.

Figure 14 AMBER JAC Benchmark

Figure 15 AMBER STMV benchmark