This is an excerpt from Dr. Ham's premier book
"Oracle
Data Mining: Mining Gold from your Warehouse".
The
Wizard will automatically trim outliers and impute missing data by
substituting the mean for numerical attributes and the mode for
categorical attributes. Normalization of numerical values is
also performed using the Min/Max
technique. You can change these default settings by clicking
on Advanced Settings when you finish the New Activity Wizard.
In the Build Tab, the default number of clusters is set at 10.
K-means must have a number of clusters to start with, in contrast to
O-Cluster,
which finds the number of clusters best suited to the dataset.
We?ll keep the default build settings and build the model.
After the ?Mining Activity?
completes, click on Build results to view the clusters. In
this view, all clusters are shown. Cluster #1 has all 1044
cases. Cluster #2 is an intermediate cluster created from
Cluster #1, with 615 cases, and Cluster #4 was created from Cluster
#3. The check box Show Leaves Onlywill
display the final clustering.
Next, highlight a cluster and
click the ?Detail? button to view a histogramof the cluster centroid attributesand corresponding values. Keeping in mind
that clustering is an unsupervised data mining technique, meaning
that there was no target attribute to predict, we can learn more
about the similarities of customers who purchased insurance if by
serendipity the clustering algorithm split on the target attribute.
Finding majority cohort values
Even though there is usually no
?pure? sample of customers with the target value of interest, we may
find a cohort of the population that has more or less a majority of
that attribute value. To explore this possibility, select each
of the leaf clusters, choose Detail, and highlight the CARAVAN
attribute. As it turns out, three clusters #11, #14 and #16
are mostly insurance carriers for mobile homes. Let?s choose
Cluster #16 and compare it with Cluster #18, a cohort of customers
without CARAVAN insurance.
You
can see from the cluster details that Cluster #16 has 84 customers
who all have insurance while Cluster #18 is comprised of 95
customers without insurance. We didn?t plan these divisions,
the algorithm found these naturally occurring clusters in the
dataset. In fact, performing the cluster algorithm on the
entire dataset of 5822 cases does not yield any clusters where the
CARAVAN target = all 1?s.
We
influenced this pure sample by stratifying the data so that the
values of 1 and 0 were more evenly distributed in the starting
cluster, Cluster #1. You might use these subsets of the case
dataset to define very homogeneous populations or cohortsof
customers, hospital patients, sales executives or whatever business
you may be investigating.
Next, we proceed by clicking through each attribute to find those
whose values are most different between the cohorts, as shown in the
following examples. To quickly review the values, you can
place each Cluster Detail
windows side by side.
As you can see from these
examples, there are clear differences in various attributes between
those customers who purchased CARAVAN insurance and those who did
not, including any other insurance purchased, size of household,
number of children in the household, and amount of money spent on
third party insurance.