Using the K-Means Wizard

Data warehouse tips by Burleson Consulting

This is an excerpt from Dr. Ham's premier book "Oracle Data Mining: Mining Gold from your Warehouse".

The Wizard will automatically trim outliers and impute missing data by substituting the mean for numerical attributes and the mode for categorical attributes. Normalization of numerical values is also performed using the Min/Max technique. You can change these default settings by clicking on Advanced Settings when you finish the New Activity Wizard. In the Build Tab, the default number of clusters is set at 10. K-means must have a number of clusters to start with, in contrast to O-Cluster, which finds the number of clusters best suited to the dataset. We?ll keep the default build settings and build the model.

After the ?Mining Activity? completes, click on Build results to view the clusters. In this view, all clusters are shown. Cluster #1 has all 1044 cases. Cluster #2 is an intermediate cluster created from Cluster #1, with 615 cases, and Cluster #4 was created from Cluster #3. The check box Show Leaves Onlywill display the final clustering.

Next, highlight a cluster and click the ?Detail? button to view a histogramof the cluster centroid attributesand corresponding values. Keeping in mind that clustering is an unsupervised data mining technique, meaning that there was no target attribute to predict, we can learn more about the similarities of customers who purchased insurance if by serendipity the clustering algorithm split on the target attribute.

Finding majority cohort values

Even though there is usually no ?pure? sample of customers with the target value of interest, we may find a cohort of the population that has more or less a majority of that attribute value. To explore this possibility, select each of the leaf clusters, choose Detail, and highlight the CARAVAN attribute. As it turns out, three clusters #11, #14 and #16 are mostly insurance carriers for mobile homes. Let?s choose Cluster #16 and compare it with Cluster #18, a cohort of customers without CARAVAN insurance.

You can see from the cluster details that Cluster #16 has 84 customers who all have insurance while Cluster #18 is comprised of 95 customers without insurance. We didn?t plan these divisions, the algorithm found these naturally occurring clusters in the dataset. In fact, performing the cluster algorithm on the entire dataset of 5822 cases does not yield any clusters where the CARAVAN target = all 1?s.

We influenced this pure sample by stratifying the data so that the values of 1 and 0 were more evenly distributed in the starting cluster, Cluster #1. You might use these subsets of the case dataset to define very homogeneous populations or cohortsof customers, hospital patients, sales executives or whatever business you may be investigating.

Next, we proceed by clicking through each attribute to find those whose values are most different between the cohorts, as shown in the following examples. To quickly review the values, you can place each Cluster Detail windows side by side.

As you can see from these examples, there are clear differences in various attributes between those customers who purchased CARAVAN insurance and those who did not, including any other insurance purchased, size of household, number of children in the household, and amount of money spent on third party insurance.

For more tips and tricks for Oracle data warehouse analysis, see Dr. Ham's premier book "Oracle Data Mining: Mining Gold from your Warehouse"

You can buy it direct from the publisher for 30%-off:

http://www.rampant-books.com/book_2006_1_oracle_data_mining.htm

��