User's Manual
31
Understanding Data Mining
together. It may not even be onl ine. If it exists only on p aper, data entry will be r equired before
you can begin d
ata mining.
Check whether the data covers the relevant attributes
The object of data mining is to identify relevant attributes, so including this check may seem odd
at first. It is very useful, however, to look at what data is available and to try to identify the likely
relevant factors that are not recorded. In trying to predict ice cream sales, for example, you may
have a lot of information about retail outlets or sales history, but you may not have weather
and temperature information, which is likely to play a significant role. Missing attrib utes do
not necessarily mea n that data mining will n ot produce usefu l results, but they can limit the
accuracy of resulting predictions.
A quick way of assessing the situation is to perf orm a comprehensive audit of your data.
Before moving on, consider attaching a Data Audit node to your data source and running it to
generate a full report.
Beware of noisy data
Data ofte n contains errors or may contain subjective, and ther efore variable, judgments. These
phenomena are c ollectively referred to as noise. Sometimes noise in da ta is norm al. There may
well be underlying rules, but they ma y not hold for 100% of the cases.
Typically, the more noise there is in data, the more difficult it is to get accurate results .
However, SPSS Modeler’s machine-lear ning methods are able to h andle noisy data and have be en
used s uccessfully on data sets containing almost 50% noise.
Ensure that there is sufficient data
In data mining , it is not necessarily th e size of a data set that is important. The representativeness
of the data set is far more significant, together with its coverage of possible outcomes and
combinations of variables.
Typically, the more attribu tes that ar e considered, the more records that will be needed to
give representativ e coverage.
If the data is representative and there are general underlying rules, it ma y well be that a data
sample of a few th ousand (or even a few hundred) records will give eq ually good results as a
million—and you w ill g et the results m ore quickly.
Seek out the experts on the data
In many cases, yo u will be working on your ow n data and will therefore be highly familiar with
its content and meaning. However, if you are working on data for another department of your
organization or for a client, it is highly desirable that you have access to experts who know the
data. They can guide you in the id entification of relevant attributes and can he lp to interpret the
results of data minin g, distinguishing the true nuggets of information from “fool’s gold,” or
artifact s caused by anomalies in the data sets.