User Guide

Chapter
33
K-Means Clus
ter Analysis
This procedure attempts to identify relatively homogeneous groups of cases based
on selected
characteristics, using an algorithm that can handle large numbers of
cases. However, the algorithm requires you to specify the number of clusters. You
can specify initial cluster centers if you know this information. You can select one
of two meth
ods for classifying cases, either updating cluster centers iteratively or
classifying only. You can save cluster membership, distance information, and final
cluster centers. Optionally, you can specify a variable whose values are used to label
casewise
output. You can also request analysis of variance F statistics. While these
statistics are opportunistic (the procedure tries to form groups that do differ), the
relative size of the statistics provides information about each variable’s contribution
to the se
paration of the groups.
Example. What are some identifiable groups of television shows that attract similar
audiences within each group? With k-means cluster analysis, you could cluster
televis
ion shows (cases) into k homogeneous groups based on viewer characteristics.
This can be used to identify segments for marketing. Or you can cluster cities
(cases) into homogeneous groups so that comparable cities can be selected to test
variou
s marketing strategies.
Statistics. Complete solution: initial cluster centers, ANOVA table. Each case: cluster
information, distance from cluster center.
Data. V
ariables should be quantitative at the interval or ratio level. If your variables
are binary or counts, use the Hierarchical Cluster Analysis procedure.
Case and Initial Cluster Center Order. The default algorithm for choosing initial cluster
center
s is not invariant to case ordering. The
Use running means option on the
Iterate dialog box makes the resulting solution potentially dependent upon case
order regardless of how initial cluster centers are chosen. If you are using either of
these
methods, you may want to obtain several different solutions with cases sorted
in different random orders to verify the stability of a given solution. Specifying
473