User's Manual
101
Handling Missing Values
In general terms, there are two approa ches you can follow:
You can exclude fields or re cords with missing values
You can impute, replace, or coerce missing values using a variety of m ethods
Both of these approaches can be largely automated using the Data Audit node. For example, you
can generate a Filter node that excludes fields with too m any missing values to be useful in
modeling, and generate a Supernode that imputes missing values for any or all of the fields that
remain. This is where the real power of the audit comes in, allowing you not only to assess the
current state of your data , but to take action based on the assessment.
Handling Records with Missing Values
If the majority of missing val ue s is concentrated in a small number of records, you can just
exclude those records. For example , a bank usu ally keeps detailed and complete record s on
its loan customers. If, h owever, the bank is less restricti ve in approving loans for its own staff
members, data gathered for staff loans is likely to have several blank fields. In such a case, there
are two options for hand ling these missing values:
You can use a Select node to remove the staff records.
If the data s et is large, you can discard all records with blanks.
Handling Fields with Missing Values
If the majority of missing values is concentrated in a small numb er of fields, you c an address them
at the field level rather than at the record level. This approach also a llows you to experiment with
the relativ e importance of particular fields before deciding on an approach for handling missing
values. If a field is unimportant in modeling, it probably is not worth keeping, regardless of how
many missing value s it h as.
For example, a market research company may colle ct data from a general questionnaire
containing 50 questions. Two of the questions address age and political persuasion, information
that many people are re luctant to give. In this c ase, Age and Political_persuasion have many
missing values.
Field Measurement Level
In dete r mining which method to use, you sho uld also conside r the measurement lev el of fields
with missing values.
Numeric fields. For numeric field types, such as Continuous, you shou ld always eliminate any
non-numeric values before building a model, be cause many models will not function if bla nks are
included in numeric fields.
Categorical fields. For categorical fields
, such as Nominal and Flag, alte r ing missing values is not
necessary but w ill increase the accuracy of the mo del. For example, a model that uses the field Sex
will still function with meaningless valu es, such as Y and Z, but removing all values other than M
and F will increase the accuracy of the model
.