User`s guide

Run mapreduce on a Hadoop Cluster
6-61
Run mapreduce on a Hadoop Cluster
In this section...
“Cluster Preparation” on page 6-61
“Output Format and Order” on page 6-61
“Calculate Mean Delay” on page 6-61
Cluster Preparation
Before you can run mapreduce on a Hadoop
®
cluster, make sure that the cluster and
client machine are properly configured. Consult your system administrator, or see
“Configure a Hadoop Cluster”.
Output Format and Order
When running mapreduce on a Hadoop cluster with binary output (the default), the
resulting KeyValueDatastore points to Hadoop Sequence files, instead of binary MAT-
files as generated by mapreduce in other environments. For more information, see the
'OutputType' argument description on the mapreduce reference page.
When running mapreduce on a Hadoop cluster, the order of the key-value pairs in the
output is different compared to running mapreduce in other environments. If your
application depends on the arrangement of data in the output, you must sort the data
according to your own requirements.
Calculate Mean Delay
This example shows how modify the MATLAB example for calculating mean airline
delays to run on a Hadoop cluster.
First, you must set environment variables and cluster properties as appropriate for your
specific Hadoop configuration. See your system administrator for the values for these and
other properties necessary for submitting jobs to your cluster.
setenv('HADOOP_HOME','/share/hadoop/a2.2.0');
cluster = parallel.cluster.Hadoop;
cluster.HadoopProperties('mapred.job.tracker') = 'hadoophost1:50031';
cluster.HadoopProperties('fs.default.name') = 'hdfs://hadoophost2:8020';