Technologies Paper

ManualsBrandsIntel ManualsOtherIntel Xeon Processor L7545

enterprises with small data analysis needs

to enterprises with increasing data analysis

requirements that need clusters that can

scale with faster, more powerful hardware.

Three components make up the core of

the Apache Hadoop version 1.x:

• Apache Hadoop Distributed File

System (HDFS*), which provides a

high-performance le system that

can span and replicate data across the

nodes of an Apache Hadoop cluster.

Important features of HDFS include

fault tolerance and performance for

large datasets.

• MapReduce, a processing framework

that provides parallel processing

across large, unstructured datasets.

MapReduce includes two functions:

map, which sorts and lters the data,

and reduce, which further processes

the output of map into a nal result.

• Apache Hadoop Common, which ties

HDFS and MapReduce together.

An Apache Hadoop cluster consists of

master nodes and worker nodes. When

a client sends a request to a master

node, the node processes the request

with two components:

• • NameNode, a component of HDFS

that keeps track of data within the

cluster nodes.

• • JobTracker, which reduces an analysis

request into smaller tasks based on

where in the cluster the data resides,

and then assigns those tasks to

specic worker nodes.

After a master node processes a request,

it communicates with three services on

the worker nodes:

• DataNode, a component of HDFS that

manages data on the worker nodes.

• TaskTracker, a service that receives

and runs MapReduce tasks from a

master node’s JobTracker service.

• MapReduce, which performs the

assigned tasks.

As MapReduce on each worker node

nishes its assigned tasks, the worker

nodes return the results to the master

node. Since the tasks can run in parallel

on multiple worker nodes, the master

node waits for all of the tasks to complete

on the worker nodes, compiles the results,

and then returns the combined result to

the client.

Performance Bottlenecks

Apache Hadoop benets from its

distributed architecture, as worker

nodes do not require high-availability

congurations due to the HDFS ability to

create multiple copies of data across the

worker nodes. Any worker node within

the cluster can fail without data loss or

interruption to the rest of the cluster.

But as the number of worker nodes in

an Apache Hadoop cluster increases, the

strain on the master node—specically

the NameNode and JobTracker services—

increases. As the volume and velocity of

data increases, the master node services

can become overwhelmed, reducing

performance across the cluster.

Networking and storage I/O bottlenecks

can also affect cluster performance. A

master node must wait for all tasks on the

worker nodes to complete before it can

compile the results and return the results

to the client. Therefore, slow worker

nodes—whether they are hampered by

CPU or I/O speeds—can hamper analytics

and batch tasks. At the worker node,

reading data from disk into memory to

perform a task, and then sending the

results across the network to the master

node can introduce delays, especially

where high-velocity data is concerned.

Increase Apache Hadoop Cluster

Performance with Intel® Technologies

Intel provides a number of technologies

that can help dramatically improve Apache

Hadoop performance across CPU- and

I/O-intensive workloads. Combined, these

technologies can help enterprises scale

Apache Hadoop to address increasing

Table of Contents

Executive Summary .............. 1

Apache Hadoop* Overview. . . . . . . . 1

Performance Bottlenecks ......... 2

Increase Apache Hadoop

Cluster Performance with

Intel® Technologies ............. 2

Intel® Xeon® Processor E7 v2

Product Family: Performance

for CPU-Intensive Workloads .. 3

Intel® Solid-State Drives:

High-Performance Storage

for I/O-Intensive Workloads ... 3

Intel® Ethernet Server Adapters:

Higher Throughput for

Distributed Clusters .......... 4

Putting It All Together:

Benchmarking Apache Hadoop

Clusters with Intel Technologies ... 4

Scale Up with Intel Technologies .. 6

Accelerate Big Data Analysis with Intel® Technologies