User`s guide

6 Programming Overview

6-64

Partition a Datastore in Parallel

Partitioning a datastore in parallel, with a portion of the datastore on each worker in a

parallel pool, can provide benefits in many cases:

• Perform some action on only one part of the whole datastore, or on several defined

parts simultaneously.

• Search for specific values in the data store, with all workers acting simultaneously on

their own partitions.

• Perform a reduction calculation on the workers across all partitions.

This example shows how to use partition to parallelize the reading of data from a

datastore. It uses a small datastore of airline data provided in MATLAB, and finds the

mean of the non-NaN values from its 'ArrDelay' column.

A simple way to calculate the mean is to divide the sum of all the non-NaN values by the

number of non-NaN values. The following code does this for the datastore first in a non-

parallel way. To begin, you define a function to amass the count and sum. If you want

to run this example, copy and save this function in a folder on the MATLAB command

search path.

function [total,count] = sumAndCountArrivalDelay(ds)

total = 0;

count = 0;

while hasdata(ds)

data = read(ds);

total = total + sum(data.ArrDelay,1,'OmitNaN');

count = count + sum(~isnan(data.ArrDelay));

end

The following code creates a datastore, calls the function, and calculates the mean

without any parallel execution. The tic and toc functions are used to time the

execution, here and in the later parallel cases.

ds = datastore(repmat({'airlinesmall.csv'},20,1),'TreatAsMissing','NA');

ds.SelectedVariableNames = 'ArrDelay';

reset(ds);

tic