Administrator Guide

Performance characterization

37 Dell EMC Ready Solution for HPC PixStor Storage | Document ID

Figure 20 N to N Sequential Performance

From the results we can observe that write performance rises with the number of threads used and then

reaches a plateau at around 64 threads for writes and 128 threads for reads. Then read performance also

rises fast with the number of threads, and then stays stable until the maximum number of threads that IOzone

allow is reached, and therefore large file sequential performance is stable even for 1024 concurrent clients.

Write performance drops about 10% at 1024 threads. However, since the client cluster has less than that

number of cores, it is uncertain if the drop in performance is due to swapping and similar overhead not

observed in spinning media(since NVMe latency is very low compared to spinning media), or if the RAID 10

data synchronization is becoming a bottleneck. More clients are needed to clarify that point. An anomaly on

the reads was observed at 64 threads, where performance did not scale at the rate that was observed for

previous data points, and then on the next data point moves to a value very close to the sustained

performance. More testing is needed to find the reason for such anomaly but is out of the scope of this blog.

The maximum read performance for reads was below the theoretical performance of the NVMe devices (~102

GB/s), or the performance of EDR links, even assuming that one link was mostly used for NVMe over fabric

traffic (4x EDR BW ~96 GB/s).

However, this is not a surprise since the hardware configuration is not balanced with respect to the NVMe

devices and IB HCAs under each CPU socket. One CX6 adapter is under CPU1, while the CPU2 has all the

NVMe devices and the second CX6 adapters. Any storage traffic using the first HCA must use the UPIs to

access the NVMe devices. In addition, any core in CPU1 used must access devices or memory assigned to

CPU2, so data locality suffers, and UPI links are used. That can explain the reduction for the maximum

performance, compared to the max performance of the NVMe devices or the line speed for CX6 HCAs. The

alternative to fix that limitation is having a balanced hardware configuration which implies reducing density to

half by using an R740 with four x16 slots and use two x16 PCIe expanders to equally distribute NVMe devices

on two CPUs and having one CX6 HCA under each CPU.