user manual
35
Understanding Data Mining
Classification nodes
The Auto Classifier node creates and compar es a number of different models for
binary outcomes (yes or no, churn or do not churn, and so on), allowing you to
choose the best approach for a given analysis. A nu mber of modeling algorithms are
supported, making it possible to sel ect the methods you want to use, the specific
options for each, and the criteria for comparing the results. The node generates a set
of models based on the specified opt i ons and ranks the best candidates according to
the criteria y ou specify.
The Auto N umeric node estim at es and compares models for continuous numeric
range outcomes using a number of di fferent methods. The node works in the same
manner as the Auto Classifier node, allowing you to choose the algorithms to use
and to experiment with multiple combinations of op t i ons in a single modeling pass.
Supported algorithms include neural netw orks, C&R Tree, CHAID, linear regres sion,
generalized linear regress i on, and support ve ct or machines (SVM). Models can be
compared based on correlation, relative error, or number of variables used.
The Cla ssification and Regression (C&R) Tree node generates a decision tree that
allows you to predict or classify future observations. The method uses recursive
partitioning to split the training records into segments by minimizing the impurity
at each step, where a node in the tree is consider ed “pure” if 100% of cases in the
node fall into a specific categ ory of the target field. Target and input fields can
be numeric ranges or categorical (nominal, ordi nal, or flags); all splits are binary
(only two subgroups).
The QUEST node provides a binary classification metho d for building decision trees,
designed to reduce the processing time required for large C&R Tree analyses while
also reduc i ng the tendency found in classification tree methods to favor inputs that
allow more splits. Input fields can be numeric ranges (continuous), but the target field
must be categorical. All splits are binary.
The CHAID node generates decisi on tre es using chi-square statistics to identify
optimal splits. U nlike the C&R Tree and QUEST nodes, CHAID can generate
nonbinary trees, meaning that some splits have more than two branches. Target and
input fields can be numeric range (continuous) or categorical. Exhaustive CHAID is
a modification of CHAID that does a more thorough job of examining all possible
splits but t akes l onger to compute.
The C5.0 node builds either a decision tree or a rule set. The model works by splitti ng
the sample based on the field that provides the maxi mum information gain at each
level. The target field must be categorical. Multiple splits into more than two
subgroups are allowed.
The Decision List node ident i fies subgroups, or segments, that show a higher or lower
likelihood of a given binary outcome relative to the overall populat i on. For example,
you might look for customers who are unlikely to churn or ar e most likely to respond
favorably to a camp ai gn. You can incorporate your business knowledge into the
model by adding your own custom se gments and previewing alternative models side
by side to compare the results. Decision List mod el s cons i st of a list of rules in which
each rule has a condition and an outcome. Rules are applied in order, and the first rule
that matches determines the outcome.
Linear regression models predict a continuous target based on li near relationships
between the target and one or more predi ct ors.