user manual
231
Performance Considerations for Streams and Nodes
The following operations cannot be performed in most databases. They should be placed in the
stream after t
he operations in the preceding list:
Operations on an y nondatabase data, such as flat files
Merge by orde
r
Balance
Distinct ope
rations in discard mode or where only a subset of fields are selected as distinct
Any operation that requir es acc essing data fro m records other than the one being processed
State and cou
nt field deriv ations
History node operations
Operations i
nvolving “@” (time-series) function s
Type-ch ecking modes Warn and Abort
Model constru
ction, applica tion, and analysis
Note: Decision trees, r ulesets, linear regression, and f actor-generated models can generate
SQL and can the
refore be pushed back to the database.
Data output to anywhere other than the same database that is proces sing the dat a
Node Caches
To o ptimize stream running, you can set up a cache on any nonterminal node . When you set up a
cache on a node
, the cache is filled with the data that passes through t he node the next time you
run the data stream. From then on, the data is read from the cac he (which is stored on disk in a
temporary directory) rather than from the data source.
Caching is mo
st useful following a time-consuming operation such as a sort, merge, or
aggregation. For example, suppose that you have a source node s et to read sales data from a
database an d an Aggregate node that summarizes sales by locat ion. You can set up a cache on the
Aggregate n
ode rather than on the source node because you want the cache to store the aggregated
data rather than the entire data set.
Note: Caching at source nodes, which simply stores a copy of the original data as it is read into
IBM® SPSS® M
odeler, will n ot improve performance in m ost circumstances.
Nodes with caching enabled are displayed with a small document icon at the top right corner.
When the data is cached at the node, the document icon is green.