User's Manual

233
Performance Considerations for Streams and Nodes
Note: The following databas es support temporary table s for the purpose of caching: DB2,
Netezza, O r ac
le, SQL Server, a nd Teradata. Other databases will use a n ormal table for database
caching. The SQL code can be customized for specic databases - contact Support for assistan ce.
Performance
: Process Nodes
Sort. The Sort node must read the entire input data set before it can be sorted. The data is stored in
memory up to some limit, and the excess is spilled to disk. The sorting algorithm is a combination
algorithm: data is read into memory up to the limit and sorted using a fast h ybrid quick-sort
algorithm. If all the data ts in me mory, then the sort is complete. Otherw ise, a merge-sort
algorithm is applied. The sorted data is written to le and the next chunk of data is read into
memory, sorted, and written to disk. This is repeated until all the data has been read; then the sorted
chunks are merged. Merging may require repeated passes over the data s tored o n disk. At peak
usage, the Sort node will have two complete copies of the data se t on disk: sorted and unsorted.
The overall running time of the algorithm is on the order of N*log(N), where N is the number
of records. Sor ting in memory is faster than me rging from disk, so the actual running tim e can
be reduced by allocating more memory to the sort. The algorithm allocates to i ts elf a fraction of
physical RAM controlled by the IBM® SPSS® Modeler Server co nguration option Memory
usage multiplier. To increase the m emory u sed for sorting, p rovide more physical RAM or increase
this value. Note that when the proportion of memory used exceeds the working set of the process
so that part of the memory is paged to disk, performance degrades because the memory-access
pattern of the in-memor y sort algorithm is random and can cause excessive paging. The sort
algorithm is used by several nodes other than the Sort nod e, but the same performance rules apply.
Binning. The Binning node reads the entire input data set to compute the bin boundaries, before
it allocates records to bins . The data set is c ached while the boundarie s are computed; then it is
rescanned for allocation. When the binning method is xed-width or mean+standard deviation,
the data set is cached directly to disk. These methods have a linear running time and require
enough dis k space to store the entire data set. When the binning method is ranks or tiles, the da ta
set is sorted using the sort algorithm described earlier, and the sorted data set is used as the ca ch e.
Sorting gives these methods a running time of M*N*log(N), where M is th e number of binned
elds and N is the number of r ecords; i t req uires disk space equal to twice the data set size.
Generating a Derive node bas ed on generated bins will improve perf ormance in subsequent
passes. Derive operati ons are muc h faster than binning.
Merge by Key (Join). The Merge node, when the merge method is keys (equivalent to a database
join), so r ts each of its input data sets by th e k ey el ds. This part of the procedure has a running
time of M*N*log(N), where M is the number of inputs and N is the number of records in the largest
input; it re quires sufcient disk space to store all of its inpu t data sets plus a second copy of the
largest data s et. The running time of the merge itself is proportional to the siz e of the output
data set, which depends on the f requency of matching ke ys. In the worst case, where the o utput
is the Cartesian product of the inputs, the running time may approach NM. This is rare—mo st
joins have many fewer matching keys. If one data set is relatively larger than the other(s), or if
the incoming data is already sorted by a key eld, then you can improve the performance of
this node using the Optimization tab.