Technologies Paper

enterprises with small data analysis needs
to enterprises with increasing data analysis
requirements that need clusters that can
scale with faster, more powerful hardware.
Three components make up the core of
the Apache Hadoop version 1.x:
Apache Hadoop Distributed File
System (HDFS*), which provides a
high-performance le system that
can span and replicate data across the
nodes of an Apache Hadoop cluster.
Important features of HDFS include
fault tolerance and performance for
large datasets.
MapReduce, a processing framework
that provides parallel processing
across large, unstructured datasets.
MapReduce includes two functions:
map, which sorts and lters the data,
and reduce, which further processes
the output of map into a nal result.
Apache Hadoop Common, which ties
HDFS and MapReduce together.
An Apache Hadoop cluster consists of
master nodes and worker nodes. When
a client sends a request to a master
node, the node processes the request
with two components:
3
• NameNode, a component of HDFS
that keeps track of data within the
cluster nodes.
• JobTracker, which reduces an analysis
request into smaller tasks based on
where in the cluster the data resides,
and then assigns those tasks to
specic worker nodes.
After a master node processes a request,
it communicates with three services on
the worker nodes:
DataNode, a component of HDFS that
manages data on the worker nodes.
TaskTracker, a service that receives
and runs MapReduce tasks from a
master node’s JobTracker service.
MapReduce, which performs the
assigned tasks.
As MapReduce on each worker node
nishes its assigned tasks, the worker
nodes return the results to the master
node. Since the tasks can run in parallel
on multiple worker nodes, the master
node waits for all of the tasks to complete
on the worker nodes, compiles the results,
and then returns the combined result to
the client.
Performance Bottlenecks
Apache Hadoop benets from its
distributed architecture, as worker
nodes do not require high-availability
congurations due to the HDFS ability to
create multiple copies of data across the
worker nodes. Any worker node within
the cluster can fail without data loss or
interruption to the rest of the cluster.
But as the number of worker nodes in
an Apache Hadoop cluster increases, the
strain on the master node—specically
the NameNode and JobTracker services
increases. As the volume and velocity of
data increases, the master node services
can become overwhelmed, reducing
performance across the cluster.
Networking and storage I/O bottlenecks
can also affect cluster performance. A
master node must wait for all tasks on the
worker nodes to complete before it can
compile the results and return the results
to the client. Therefore, slow worker
nodes—whether they are hampered by
CPU or I/O speeds—can hamper analytics
and batch tasks. At the worker node,
reading data from disk into memory to
perform a task, and then sending the
results across the network to the master
node can introduce delays, especially
where high-velocity data is concerned.
Increase Apache Hadoop Cluster
Performance with Intel® Technologies
Intel provides a number of technologies
that can help dramatically improve Apache
Hadoop performance across CPU- and
I/O-intensive workloads. Combined, these
technologies can help enterprises scale
Apache Hadoop to address increasing
Table of Contents
Executive Summary .............. 1
Apache Hadoop* Overview. . . . . . . . 1
Performance Bottlenecks ......... 2
Increase Apache Hadoop
Cluster Performance with
Intel® Technologies ............. 2
Intel® Xeon® Processor E7 v2
Product Family: Performance
for CPU-Intensive Workloads .. 3
Intel® Solid-State Drives:
High-Performance Storage
for I/O-Intensive Workloads ... 3
Intel® Ethernet Server Adapters:
Higher Throughput for
Distributed Clusters .......... 4
Putting It All Together:
Benchmarking Apache Hadoop
Clusters with Intel Technologies ... 4
Scale Up with Intel Technologies .. 6
2
Accelerate Big Data Analysis with Intel® Technologies