Technologies Paper

ManualsBrandsIntel ManualsOtherIntel Xeon Processor 7130M

• Apache* Hive Join, a workload

that is both CPU- and I/O-intensive.

This workload provides

performance benchmarks for

more structured datasets.

• Page Rank, a MapReduce workload

that uses a well-known search

engine algorithm that ranks pages.

The tests consisted of running the

workloads against two congurations:

• A six-node, fully optimized baseline

conguration that used dual-socket

servers based on the Intel Xeon

processor E5-2680 with Intel SSDs

and 10 gigabit Intel Ethernet Server

Adapters.

• A three-node enhanced conguration

that used four-socket servers based

on the Intel Xeon processor E7-4890

v2 with Intel SSDs and 10 gigabit Intel

Ethernet Server Adapters.

The results are normalized for the servers

congured with the Intel Xeon processor

E5 family.

The benchmarks demonstrate signicant

performance gains from the servers

equipped with the Intel Xeon processor E7

v2 family over servers equipped with the

previous generation Intel Xeon processor

E5 family. The I/O-intensive workloads—

Sort and Page Rank—showed a 2.6 and

2.7 times performance advantage, while

CPU-intensive workloads—Apache Hive

Join and K-means—showed the greatest

performance advantage at 3.2 and 3.5

times the performance of the servers

equipped with the previous generation

Intel Xeon processor E5 family.

TeraSort,

which is both I/O- and CPU-intensive,

performed nearly 2.9 times faster.

In recent tests, Intel engineers tested

the performance of 1 GbE and 10 GbE

networks when importing data into an

Apache Hadoop cluster and replicating it

across worker nodes. The testing results

demonstrated a ve-fold increase in

loading times using 10 GbE.

Putting It All Together:

Benchmarking Apache Hadoop

Clusters with Intel Technologies

In recent internal tests, Intel engineers

combined high-performance Intel CPU,

SSD, and networking technologies to

determine the performance benets

across a range of CPU- and I/O-intensive

Apache Hadoop workloads. These

workloads included:

• Sort, an I/O-intensive workload that

transforms data from one format to

another. Sort is representative of a

typical real-world MapReduce task.

• TeraSort, a popular industry-standard

benchmark for large-size data sorting.

• K-means, a CPU-intensive workload that

uses a well-known clustering algorithm

for data mining and machine learning.

Intel® Ethernet Server Adapters: Higher

Throughput for Distributed Clusters

The distributed architecture of Apache

Hadoop depends heavily on fast and

reliable network communication. Many

enterprises use gigabit Ethernet (GbE)

network fabrics to connect Apache

Hadoop nodes, but as the frequency of

workload requests and data velocity

increases, combined with faster CPUs

and storage, network speeds must

also increase.

A common tool for increasing network

throughput is Ethernet bonding, where

multiple physical Ethernet ports are

bonded together into a higher-bandwidth

logical Ethernet port. This method can

provide short-term performance gains,

but it increases complexity and costs.

The 10 gigabit Intel® Ethernet Server

Adapters provide higher performance

while decreasing port counts, cabling, and

power consumption. More importantly, 10

GbE also provides scalability benets for

Apache Hadoop clusters.

Data Loading and Replication Performance

Data Set Size

30 GB 60 GB

Figure 2: The 10 gigabit Intel® Ethernet Server Adapters demonstrate a ﬁve-fold

increase in data loading and replication performance.

Import Time (Minutes)

120 GB 240 GB 300 GB

Gigabit Ethernet

10 Gigabit Ethernet

Accelerate Big Data Analysis with Intel® Technologies