High-Performance Cluster for Biomedical Research Using 10 Gigabit Ethernet iWARP Fabric

The iWARP protocol was developed
to perform within an Ethernet
infrastructure, and thus does not require
any modications to existing Ethernet
networks or equipment. At the same time,
iWARP’s Ethernet compatibility enables
IT organizations to take advantage of
enhancements to Ethernet, such as Data
Center Bridging, low-latency switches, and
IP security.
Standard Ethernet switches and routers
carry iWARP trafc over existing TCP/IP
protocols. Because iWARP is layered over
TCP, network equipment doesn’t need
to process the iWARP layer, nor does it
require any special-purpose functionality.
This enables the use of industry-accepted
management consoles that use existing IP
management protocols. The Open Fabrics
Alliance (www.openfabrics.org) provides
an open source RDMA software stack
that is hardware-agnostic and application-
agnostic for iWARP. These characteristics
allow iWARP to be readily integrated
into existing environments while
meeting stringent cost and performance
requirements.
Performance and Scalability Results
Using this cluster in the lab with the
HPL benchmark running on 4,000 cores,
project engineers attained performance
of 35.81 TeraFLOPS at 84.14 percent
efciency, as shown in Figure 3. The HPL
problem size used was 1,200,000, and the
problem size necessary to achieve half
the performance (N/2 problem size) was
300,000. Importantly, the performance
data scales in a nearly linear fashion
as the number of cores applied to the
benchmark workload increases.
From an engineering perspective, the
linearity of scaling in the results helps
ensure the viability of the topology for
large-scale computational problems. This
cluster demonstrates the best efciency
in an Ethernet solution compared to the
systems on the June 2010 Top 500
1
list,
as well as a placement within the range of
the top 100 supercomputers for efciency
on that list overall. Moreover, because
the data does not show an obvious
drop-off in efciency at this cluster size,
it suggests that the solution is scalable
40000
30000
20000
10000
0
500 1000 1500 2000 2500 3000 3500 4000 45000
90.20
89.63
88.79
87.62
86.93
84.14
2764
5432
8162
10740
21310
95910
Performance
Efficiency
Performance
(GF/s)
Number of Cores
100.00
75.00
50.00
25.00
0.00
Efficiency
(%)
Figure 3. As measured using the HPC LINPACK benchmark, the cluster achieves performance of 35.81 TeraFLOPS at 84.14 percent
efficiency using iWARP and 10 gigabit Ethernet.
beyond the size shown here, although
that hypothesis would need to be tested
to verify its validity. From a budgetary
perspective, the results demonstrate that
each compute node added to the cluster,
up to at least 500 nodes, provides value
commensurate with the overall cost of
the cluster.
These performance and efciency results
must be considered in the context that
this cluster conguration oversubscribes
the connections to the Arista 7xxx
switches by a factor of 2.475 to 1. Making
additional connections from the racks
to the network fabric using free ports
to reduce the oversubscription could
potentially result in higher performance.
This is a possible area for future inquiry.
June 2010 Top 500 Entry:
Performance (R
max
): 35.81 teraops
Rank: #208
Efciency (R
max
÷R
peak
): 84.14%
Rank: #84
3
A High-Performance Cluster for Biomedical Research Using 10 Gigabit Ethernet iWARP Fabric
3