User's Manual

101
Handling Missing Values
In general terms, there are two approa ches you can follow:
You can exclude elds or re cords with missing values
You can impute, replace, or coerce missing values using a variety of m ethods
Both of these approaches can be largely automated using the Data Audit node. For example, you
can generate a Filter node that excludes elds with too m any missing values to be useful in
modeling, and generate a Supernode that imputes missing values for any or all of the elds that
remain. This is where the real power of the audit comes in, allowing you not only to assess the
current state of your data , but to take action based on the assessment.
Handling Records with Missing Values
If the majority of missing val ue s is concentrated in a small number of records, you can just
exclude those records. For example , a bank usu ally keeps detailed and complete record s on
its loan customers. If, h owever, the bank is less restricti ve in approving loans for its own staff
members, data gathered for staff loans is likely to have several blank elds. In such a case, there
are two options for hand ling these missing values:
You can use a Select node to remove the staff records.
If the data s et is large, you can discard all records with blanks.
Handling Fields with Missing Values
If the majority of missing values is concentrated in a small numb er of elds, you c an address them
at the eld level rather than at the record level. This approach also a llows you to experiment with
the relativ e importance of particular elds before deciding on an approach for handling missing
values. If a eld is unimportant in modeling, it probably is not worth keeping, regardless of how
many missing value s it h as.
For example, a market research company may colle ct data from a general questionnaire
containing 50 questions. Two of the questions address age and political persuasion, information
that many people are re luctant to give. In this c ase, Age and Political_persuasion have many
missing values.
Field Measurement Level
In dete r mining which method to use, you sho uld also conside r the measurement lev el of elds
with missing values.
Numeric fields. For numeric eld types, such as Continuous, you shou ld always eliminate any
non-numeric values before building a model, be cause many models will not function if bla nks are
included in numeric elds.
Categorical fields. For categorical elds
, such as Nominal and Flag, alte r ing missing values is not
necessary but w ill increase the accuracy of the mo del. For example, a model that uses the eld Sex
will still function with meaningless valu es, such as Y and Z, but removing all values other than M
and F will increase the accuracy of the model
.