Administrator Guide

Performance characterization
37 Dell EMC Ready Solution for HPC PixStor Storage | Document ID
Figure 20 N to N Sequential Performance
From the results we can observe that write performance rises with the number of threads used and then
reaches a plateau at around 64 threads for writes and 128 threads for reads. Then read performance also
rises fast with the number of threads, and then stays stable until the maximum number of threads that IOzone
allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients.
Write performance drops about 10% at 1024 threads. However, since the client cluster has less than that
number of cores, it is uncertain if the drop in performance is due to swapping and similar overhead not
observed in spinning media(since NVMe latency is very low compared to spinning media), or if the RAID 10
data synchronization is becoming a bottleneck. More clients are needed to clarify that point. An anomaly on
the reads was observed at 64 threads, where performance did not scale at the rate that was observed for
previous data points, and then on the next data point moves to a value very close to the sustained
performance. More testing is needed to find the reason for such anomaly but is out of the scope of this blog.
The maximum read performance for reads was below the theoretical performance of the NVMe devices (~102
GB/s), or the performance of EDR links, even assuming that one link was mostly used for NVMe over fabric
traffic (4x EDR BW ~96 GB/s).
However, this is not a surprise since the hardware configuration is not balanced with respect to the NVMe
devices and IB HCAs under each CPU socket. One CX6 adapter is under CPU1, while the CPU2 has all the
NVMe devices and the second CX6 adapters. Any storage traffic using the first HCA must use the UPIs to
access the NVMe devices. In addition, any core in CPU1 used must access devices or memory assigned to
CPU2, so data locality suffers, and UPI links are used. That can explain the reduction for the maximum
performance, compared to the max performance of the NVMe devices or the line speed for CX6 HCAs. The
alternative to fix that limitation is having a balanced hardware configuration which implies reducing density to
half by using an R740 with four x16 slots and use two x16 PCIe expanders to equally distribute NVMe devices
on two CPUs and having one CX6 HCA under each CPU.