User`s guide
6 Programming Overview
6-64
Partition a Datastore in Parallel
Partitioning a datastore in parallel, with a portion of the datastore on each worker in a
parallel pool, can provide benefits in many cases:
• Perform some action on only one part of the whole datastore, or on several defined
parts simultaneously.
• Search for specific values in the data store, with all workers acting simultaneously on
their own partitions.
• Perform a reduction calculation on the workers across all partitions.
This example shows how to use partition to parallelize the reading of data from a
datastore. It uses a small datastore of airline data provided in MATLAB, and finds the
mean of the non-NaN values from its 'ArrDelay' column.
A simple way to calculate the mean is to divide the sum of all the non-NaN values by the
number of non-NaN values. The following code does this for the datastore first in a non-
parallel way. To begin, you define a function to amass the count and sum. If you want
to run this example, copy and save this function in a folder on the MATLAB command
search path.
function [total,count] = sumAndCountArrivalDelay(ds)
total = 0;
count = 0;
while hasdata(ds)
data = read(ds);
total = total + sum(data.ArrDelay,1,'OmitNaN');
count = count + sum(~isnan(data.ArrDelay));
end
end
The following code creates a datastore, calls the function, and calculates the mean
without any parallel execution. The tic and toc functions are used to time the
execution, here and in the later parallel cases.
ds = datastore(repmat({'airlinesmall.csv'},20,1),'TreatAsMissing','NA');
ds.SelectedVariableNames = 'ArrDelay';
reset(ds);
tic