user manual
Chapter
6
66
6
Handling Missing Values
Overview of Missing Values
During the Data Preparation phase of data mining, you will of ten want to replace missing values
in th e data. Missing values are values in the data set that are unknown, uncolle cted, or incorrectly
entered. Usually, such values are invalid for th eir fields. For example, the field Sex should contain
the values M and F. If y ou discover the values Y or Z in the field, you can safely assume that such
values are invalid and should therefore be interpret ed as blanks. Likew ise, a negative value for the
field Age is meaningless and should also be interpreted as a blank . Frequently, such obviously
wrong values are purpose ly ente r ed, or fields left blank, during a ques tionnaire to indicate a
nonresp onse. At times, you may want to examine these blanks more closely to determine whether
a nonresponse, s uch as the re f usal to give one’s age, is a factor in pr edicting a specific o utcome.
Some modeling techniques handle missin g data better than others. For example, C5.0 and
Apriori cope w ell with values that are explicitly declared as “missing” in a Type node. Other
modeling techniques have trouble dealing with missing values and experience longer training
times, res ulting in less-accurate models.
There are several ty pe s of missing values recognized by IBM® SPSS® Modeler:
Null or system-missing values. These are nonstring values that have been left blank in the
database or source file a nd have not been specifically defined as “missing” in a source or
Type nod e. System-missing values are displayed as $null$. Note that empty strings are not
considered nu lls in SPSS Modeler, although they may be trea ted as nulls by certain databases.
Empty strings and white space. Empty string values and white space (strings with no visibl e
characters) are treated as distinct from null v alues. Empty strings are treated as equivalent to
white space for mo st purposes. For example, if you select the option to treat white space as
blanks in a source or Type node, this s etting applies to emp ty strings as well.
Blank or user-defined missing values. These are value s such as unknown, 99, or –1 that are
explicitly de fined in a source node or Type node as missing. Optionally, you can also choose
to tre at nulls a nd white space as blanks, which allows them to be flagged for spec ial treatment
and to be excluded from most calcula tions. Fo r example, you can use the @BLANK function to
treat these values, along with other types of miss ing values, as blanks.
© Copyright IBM Corporation 1994, 2012.
99