User`s guide

Run mapreduce on a Hadoop Cluster

6-61

Run mapreduce on a Hadoop Cluster

In this section...

“Cluster Preparation” on page 6-61

“Output Format and Order” on page 6-61

“Calculate Mean Delay” on page 6-61

Cluster Preparation

Before you can run mapreduce on a Hadoop

cluster, make sure that the cluster and

client machine are properly configured. Consult your system administrator, or see

“Configure a Hadoop Cluster”.

Output Format and Order

When running mapreduce on a Hadoop cluster with binary output (the default), the

resulting KeyValueDatastore points to Hadoop Sequence files, instead of binary MAT-

files as generated by mapreduce in other environments. For more information, see the

'OutputType' argument description on the mapreduce reference page.

When running mapreduce on a Hadoop cluster, the order of the key-value pairs in the

output is different compared to running mapreduce in other environments. If your

application depends on the arrangement of data in the output, you must sort the data

according to your own requirements.

Calculate Mean Delay

This example shows how modify the MATLAB example for calculating mean airline

delays to run on a Hadoop cluster.

First, you must set environment variables and cluster properties as appropriate for your

specific Hadoop configuration. See your system administrator for the values for these and

other properties necessary for submitting jobs to your cluster.

setenv('HADOOP_HOME','/share/hadoop/a2.2.0');

cluster = parallel.cluster.Hadoop;

cluster.HadoopProperties('mapred.job.tracker') = 'hadoophost1:50031';

cluster.HadoopProperties('fs.default.name') = 'hdfs://hadoophost2:8020';