user manual

ManualsBrandsIBM ManualsSwitchSwitch 15

Understanding Data Mining

Classiﬁcation nodes

The Auto Classiﬁer node creates and compar es a number of different models for

binary outcomes (yes or no, churn or do not churn, and so on), allowing you to

choose the best approach for a given analysis. A nu mber of modeling algorithms are

supported, making it possible to sel ect the methods you want to use, the speciﬁc

options for each, and the criteria for comparing the results. The node generates a set

of models based on the speciﬁed opt i ons and ranks the best candidates according to

the criteria y ou specify.

The Auto N umeric node estim at es and compares models for continuous numeric

range outcomes using a number of di fferent methods. The node works in the same

manner as the Auto Classiﬁer node, allowing you to choose the algorithms to use

and to experiment with multiple combinations of op t i ons in a single modeling pass.

Supported algorithms include neural netw orks, C&R Tree, CHAID, linear regres sion,

generalized linear regress i on, and support ve ct or machines (SVM). Models can be

compared based on correlation, relative error, or number of variables used.

The Cla ssiﬁcation and Regression (C&R) Tree node generates a decision tree that

allows you to predict or classify future observations. The method uses recursive

partitioning to split the training records into segments by minimizing the impurity

at each step, where a node in the tree is consider ed “pure” if 100% of cases in the

node fall into a speciﬁc categ ory of the target ﬁeld. Target and input ﬁelds can

be numeric ranges or categorical (nominal, ordi nal, or ﬂags); all splits are binary

(only two subgroups).

The QUEST node provides a binary classiﬁcation metho d for building decision trees,

designed to reduce the processing time required for large C&R Tree analyses while

also reduc i ng the tendency found in classiﬁcation tree methods to favor inputs that

allow more splits. Input ﬁelds can be numeric ranges (continuous), but the target ﬁeld

must be categorical. All splits are binary.

The CHAID node generates decisi on tre es using chi-square statistics to identify

optimal splits. U nlike the C&R Tree and QUEST nodes, CHAID can generate

nonbinary trees, meaning that some splits have more than two branches. Target and

input ﬁelds can be numeric range (continuous) or categorical. Exhaustive CHAID is

a modiﬁcation of CHAID that does a more thorough job of examining all possible

splits but t akes l onger to compute.

The C5.0 node builds either a decision tree or a rule set. The model works by splitti ng

the sample based on the ﬁeld that provides the maxi mum information gain at each

level. The target ﬁeld must be categorical. Multiple splits into more than two

subgroups are allowed.

The Decision List node ident i ﬁes subgroups, or segments, that show a higher or lower

likelihood of a given binary outcome relative to the overall populat i on. For example,

you might look for customers who are unlikely to churn or ar e most likely to respond

favorably to a camp ai gn. You can incorporate your business knowledge into the

model by adding your own custom se gments and previewing alternative models side

by side to compare the results. Decision List mod el s cons i st of a list of rules in which

each rule has a condition and an outcome. Rules are applied in order, and the ﬁrst rule

that matches determines the outcome.

Linear regression models predict a continuous target based on li near relationships

between the target and one or more predi ct ors.