In this article we will explore clustering and customer segmentation using transaction data.
We will mainly focus on k-means clustering and determining the optimal number of groups, but we will also briefly look at the pam algorithm, dendrograms and the gap statistic.
Finally we will examine the created group composition using bar charts and word clouds.
The data set we will be using is taken from John W Foreman’s book Data Smart, and consists of a file containing descriptions of discounted wine offers, and another with customers discount purchases.
k-means is a clustering method which iteratively assigns each data point to a cluster, depending on which cluster centre it is currently closest to. After assigning each point to a cluster a new cluster centre is created based on the mean position of the new members of that cluster. Points are then re-assigned to the nearest new cluster centre. This process is repeated until there are either no more changes, or a set number of iterations is reached.
Distance is important with k-means clustering due to the assignment of cluster membership, so first we will arrange the data to create an n-dimensional space where all offers are of the same magnitude as each other.
This is effectively a frequency table, showing which offers a customer purchased.
How many different clusters?
One of the important factors is how many clusters to create. If we create too few then some groups will be missed or merged, but if there are too many then resources are wasted, groups are split and it becomes more complicated.
Using the broom package we can create multiple models before choosing the number of k-means clusters we’d like to use.
Graphing the clusters
Our data is in multidimensional space so we need to reduce that to 2 dimensional space in order to visually inpect the different groupings.
For this we use Principle Component Analysis (PCA). This transforms the data to create new variables containing the same information but in a transformed space.
These new variables are listed in descending order of the information they contain, we then pick the first 2 variables for our graphs - retaining as much information as possible, whilst working in 2 dimensions.
Here we view the porportion of the variance contained in the first 5 principle components, and it can be seen that the first 2 components, which we will use, contain 25% of the variance between them.
Using pca we now transform our data, and then having transformed it we can graph it
Visualising K-means groups
From a visual inspection of the graphs below, the clearest groups would appear to be with k=2, k=3, or k=4. Although it has to be remembered that we are only looking at 25% of the variance.
We can also determine the optimal number of groups by looking at the total within-cluster sum of squares, which represents the variance within the clusters.
Looking at the graph below the tot.wininss value decreases as k increases, and at k=3 we can see an ‘elbow’, indicating that whilst clusters beyond this point have decreasing variance, the rate of change of variance has dropped so k=3 might be a good number of clusters.
Gap Statistic
The graph below shows the gap statistic, which is another measure for optimal clustering groups, and we can see that there is a local maxima at k=3 which is one method of ascertaining the optimal cluster numbers.
The Tibs2001SEmax criteria from the cluster package is another method (Tibshirani et al, 2001) and this metric also gives 3 clusters.
Another method we could use to assign groups is the pam algorithm, which is a clustering method related to kmeans which is said to be more robust to noise and outliers. This method minimizes the sum of pairwise dissimilarities instead of a sum of squared Euclidean distances.
Using the Tibs2001SEmax criteria here suggest the optimal number of clusters is 5.
Another method of grouping is hierarchical clustering which forms groups from the bottom up. This create a hierarchy based on a measure of dissimilarity (distance) between points.
We can see that our dendrogram naturally splits into 3 groups, although 1 of those groups is very small and 1 very large.
Clustering Summary
This brief overview of a few clustering methods shows that choosing the correct number of clusters isn’t always straightforward.
But for our purposes using k-means with k=3 seems a good choice with clear groupings. This will also help to keep our results simple.
Examining the Groups
Now that we have our groups, we need to understand the differences between them, so that we can then target future offers appropriately.
Each offer is described by different variables e.g. varietal (grape type), country of origin, amount of discount, etc. And any of these, or any combination of these might be a great way of understanding the differences between our groups.
A thorough analysis would look at all of these, but we will just look at varietal.
First we need to join our groupings with our data and tidy it a little.
And also create some helper functions.
Visualising the Groups
From these graphs it can be seen that the groups have the following preferences:
Group 1
Pinot Noir, Champagne and Cabernet Sauvignon
Group 2
Espumante, Malbec, Pinot Grigion, Prosecco
Group 3
Mainly Champagne
Using Word Clouds
Using word clouds shows the same groupings.
Summary
We have looked at 3 different clustering methods, k-means, pam and hierarchical, and also at determining the optimal number of clusters. Before choosing k-means with k=3.
We then visualised the results showing some clear differences between groups. These differences were emphasised using word clouds.
The next step would be to target these groups with offers which are tailor made for their preferences, and then analyse the difference in sales.
Feel free to comment below with any thoughts, corrections etc.
PROJECTS r customer segmentation clustering k-means broom ggplot2 wordcloud