User's Manual

ManualsBrandsIBM ManualsSwitchIBM IBM Switch 15

241

242

243

244

245

246

247

248

249

250

233

Performance Considerations for Streams and Nodes

Note: The following databas es support temporary table s for the purpose of caching: DB2,

Netezza, O r ac

le, SQL Server, a nd Teradata. Other databases will use a n ormal table for database

caching. The SQL code can be customized for speciﬁc databases - contact Support for assistan ce.

Performance

: Process Nodes

Sort. The Sort node must read the entire input data set before it can be sorted. The data is stored in

memory up to some limit, and the excess is spilled to disk. The sorting algorithm is a combination

algorithm: data is read into memory up to the limit and sorted using a fast h ybrid quick-sort

algorithm. If all the data ﬁts in me mory, then the sort is complete. Otherw ise, a merge-sort

algorithm is applied. The sorted data is written to ﬁle and the next chunk of data is read into

memory, sorted, and written to disk. This is repeated until all the data has been read; then the sorted

chunks are merged. Merging may require repeated passes over the data s tored o n disk. At peak

usage, the Sort node will have two complete copies of the data se t on disk: sorted and unsorted.

The overall running time of the algorithm is on the order of N*log(N), where N is the number

of records. Sor ting in memory is faster than me rging from disk, so the actual running tim e can

be reduced by allocating more memory to the sort. The algorithm allocates to i ts elf a fraction of

physical RAM controlled by the IBM® SPSS® Modeler Server co nﬁguration option Memory

usage multiplier. To increase the m emory u sed for sorting, p rovide more physical RAM or increase

this value. Note that when the proportion of memory used exceeds the working set of the process

so that part of the memory is paged to disk, performance degrades because the memory-access

pattern of the in-memor y sort algorithm is random and can cause excessive paging. The sort

algorithm is used by several nodes other than the Sort nod e, but the same performance rules apply.

Binning. The Binning node reads the entire input data set to compute the bin boundaries, before

it allocates records to bins . The data set is c ached while the boundarie s are computed; then it is

rescanned for allocation. When the binning method is ﬁxed-width or mean+standard deviation,

the data set is cached directly to disk. These methods have a linear running time and require

enough dis k space to store the entire data set. When the binning method is ranks or tiles, the da ta

set is sorted using the sort algorithm described earlier, and the sorted data set is used as the ca ch e.

Sorting gives these methods a running time of M*N*log(N), where M is th e number of binned

ﬁelds and N is the number of r ecords; i t req uires disk space equal to twice the data set size.

Generating a Derive node bas ed on generated bins will improve perf ormance in subsequent

passes. Derive operati ons are muc h faster than binning.

Merge by Key (Join). The Merge node, when the merge method is keys (equivalent to a database

join), so r ts each of its input data sets by th e k ey ﬁel ds. This part of the procedure has a running

time of M*N*log(N), where M is the number of inputs and N is the number of records in the largest

input; it re quires sufﬁcient disk space to store all of its inpu t data sets plus a second copy of the

largest data s et. The running time of the merge itself is proportional to the siz e of the output

data set, which depends on the f requency of matching ke ys. In the worst case, where the o utput

is the Cartesian product of the inputs, the running time may approach NM. This is rare—mo st

joins have many fewer matching keys. If one data set is relatively larger than the other(s), or if

the incoming data is already sorted by a key ﬁeld, then you can improve the performance of

this node using the Optimization tab.