NASA Case Study

NASA’s Flexible Cloud Fabric: Moving Cluster Applications to the Cloud
wire speed. Clearly, this software-only
virtualization conguration is insufcient
to support the high-performance
computing demands of the NASA Center
for Climate Simulation. On the other
hand, the gures in the third column
show that virtualizing I/O with the help of
hardware acceleration drives throughput
up considerably, although the highest
throughput gures achieved in this test
case are less than 65 percent of wire speed.
The rightmost column of Table 2 shows
dramatic throughput improvement in the
virtualized environment when SR-IOV
is utilized. In fact, the gures in this
column approach those of the bare-
metal case, indicating that a properly
congured virtualized network can deliver
throughput that is roughly equivalent to
that of a non-virtualized one.
To expand on the Nuttcp results, the test
team performed trials on the other two
benchmarks with different message sizes.
Figure 1 shows throughput (left chart)
and latency (right chart) results for the
Ohio State MPI benchmark. Surprisingly,
the test conguration that uses SR-IOV
actually outperforms the bare-metal
one. The test team postulates that
this performance differential is due to
inefciencies in the Linux* kernel that
are overcome by direct assignment under
SR-IOV. In any event, this test result
does support the nding above that, in
some cases, virtualized performance with
SR-IOV can be comparable to equivalent
non-virtualized performance.
SINGLE-ROOT I/O VIRTUALIZATION
(SR-IOV) DEFINED
Supported by Intel® Ethernet Server
Adapters, SR-IOV is a standard
mechanism for devices to advertise
their ability to be simultaneously
shared among multiple virtual
machines (VMs). SR-IOV allows for
the partitioning of a PCI function into
many virtual functions (VFs) for the
purpose of sharing resources in virtual
or non-virtual environments. Each VF
can support a unique and separate
data path for I/O-related functions, so
for example, the bandwidth of a single
physical port can be partitioned into
smaller slices that may be allocated to
specic VMs or guests.
Finally, the test team considered the
results of throughput testing with the
Intel MKL implementation of LINPACK, as
shown in Figure 2. Here, while the SR-IOV
implementation increases performance
relative to the non-SR-IOV case, its
performance is somewhat lower than
0
110 100 1,000 10,000 100,000 1,000,000 10,000,000
100
200
300
400
500
600
700
800
900
1000
Throughput (MBytes/sec)
Message Size (Bytes)
Throughput
(Higher is better)
VM to VM (with SR-IOV)
Bare Metal to Bare Metal
VM to VM (with virtualized
I/O but without SR-IOV)
VM to VM (with SR-IOV)
Bare Metal to Bare Metal
VM to VM (with virtualized
I/O but without SR-IOV)
Latency
0
0 1,000 2,000 3,000 4,000 5,000
12000
10000
8000
6000
4000
2000
Throughput (MBytes/sec)
Message Size (MB)
(Lower is better)
Figure 1. Virtualized and non-virtualized performance results for the Ohio State University
MPI benchmark.
Ohio State University MPI Benchmarks
3