Cluster Analysis using K-means Algorithm
In this post, I illustrate the commonly used clustering algorithm called K-means using a self-written Octave functions/codes. Indeed, there are readily available R, python and octave packages/functions which I listed at the end of this post. Cluster analysis groups similar objects together so that objects in the same group are more similar. It can be used for market segmentation, social network analysis, astronomical data analysis, and so on (see for example Bjiuraj etal). K-means clustering is unsupervised learning algorithm that groups a given dataset into k coherent clusters so that each observation belongs to a group with the nearest mean. The number of clusters (Ks) should be known or determined a priori, and for each cluster we need to provide initial means(centroids).
The data used for this post is obtained from the course material of Machine Learning course offered by Prof Andrew Ng from Stanford University.
The steps listed below were followed to implement k-means algorithm:
Step-1: Randomly initialize k points in the dataset as cluster centroid
Step-2: Assign all observations into one of k groups based on which cluster centroid is closest to the observation. The closeness is measured by computing the square of the difference between each observation and the centroids (square of Euclidean distance).
Step-3: Update cluster centroids as the average of the observations inside the k cluster.
Step-4: Repeat steps 2 and 3 until the algorithm converges.
The code in this example is executable in Octave / Matlab Software.
Preparing and Exploring Data
First, import the data
% import the data
load('ex7data2.mat');
The data has 2 variables and 300 observations. The first five observations are shown below:
N | VAR1 | VAR2 |
---|---|---|
1 | 1.84208 | 4.60757 |
2 | 5.65858 | 4.79996 |
3 | 6.35258 | 3.29085 |
4 | 2.90402 | 4.61220 |
5 | 3.23198 | 4.93989 |
Before analyzing the data, it is good to explore and visualize to have a better sense of the data. For example, check the presence of missing values and visualize the distribution of the data. The summary statistics and scatter plot of the two continuous variables are shown below. There was no missing data in any of the variables. In addition the distribution of the two variables are similar.
Table 1: Summary Statistics
N | Mean | Std Dev. | Min. | 1st Qu. | Median | 3rd Qu. | Max. | |
VAR1 | 300 | 3.684 | 1.926 | 0.245 | 2.156 | 3.232 | 5.510 | 8.203 |
VAR2 | 300 | 2.987 | 1.685 | 0.206 | 1.211 | 2.950 | 4.808 | 5.784 |
Figure 1: Scatter Plot of Variables
Determine Number of Clusters
The scatter plot indicates that there are three possible clusters. Often, the number of clusters are not clear or the number of variables are more than two and not straightforward to visualize. In such cases, one approach is to determine the optimal number of clusters using elbow method. The elbow method helps to visualize the sum of square errors (SSE) against the number of clusters, and helps to choose the best number of clusters based on a reasonably minimum SSE achieved for a reasonable small number of clusters. To make this plot, we need to run the algorithm for a number of clusters, for example 1 to 10. To confirm what was seen in the above figure, I plotted SSE against K. The variation on SSE after K=3 is minimal and therefore the figure indicates the presence of 3 clusters.
Figure 2: Sum of Square Error against Number of Clusters
Steps to Implement K-means Cluster
Step-1: Randomly Initialize k Centroids
Three observations are randomly selected from the dataset as initial centroids. Note that the centroids should be unique; however this doesn't guarantee that the k-means can't fall on local optima. Running the algorithm for several different random initials may help to avoid from falling on local optima.
% K (number of centroids)=3
K=3;
% X is a two dimensional dataset from which initial centroids are randomly selected
rand_idx= randperm(size(X,1));
initial_centroids= X(rand_idx(1:K), :);
centroids = initial_centroids;
previous_centroids = centroids;
Step-2: Assign each observation into one of the centroids
Assign each observation in the dataset to the closest centroid. The function findClosestCentroids
, defined below, returns a vector containing the closest centroid for each observation.
function idx = findClosestCentroids(X, centroids)
[m,n]= size(X); % m and n are number of rows and columns of the dataset X ;
% Set K
K = size(centroids, 1);
dist= zeros(size(X, 1), K);
c=0;
for i= 1:m
for k=1:K
for j= 1:n
% the distance from each observation to a centroid
c= c+ (X(i,j)-centroids(k,j)).^2 ;
end
dist(i,k)= c; %distance of each observation from each centroid
c=0 ; % restart c to zero for next iteration
end
end
[min idx]= min(dist , [], 2);
idx; % a vector containing centroids closest to each observation
end
Step-3: Update cluster centroids
% compute new centroids for each cluster using a user-defined function “updateCentroids”.
function centroids = updateCentroids(X, idx, K)
clstr=0;
for k=1:K
clstr= find(idx==k); % to find row index where idx==k
centroids(k,:)= mean(X(clstr,:));
clstr=0;
end
end
Step-4: Repeat steps 2 and 3 until the algorithm converges.
In this example, the process stops when the maximum number of iteration reached.
% maximum iteration to be 10
max_iters = 10;
K=3;
% Plot the progress
figure;
hold on;
% Create palette for coloring
palette = hsv(K + 1);
%-------------------------------
%Run K-Means algorithm
%-------------------------------
for i=1:max_iters
% Output progress
fflush(stdout);
idx = findClosestCentroids(X, centroids);
% plot the progress
% Plot simillar clustergroups with the same color
colors = palette(idx, :);
scatter(X(:,1), X(:,2), 15, colors);
% Plot the centroids as black x's
plot(centroids(:,1), centroids(:,2), 'x', 'MarkerEdgeColor','k', 'MarkerSize', 10, 'LineWidth', 3);
% Plot the history of the centroids with lines
for j=1:size(centroids,1)
plot([centroids(j, 1), previous_centroids(j, 1)], [centroids(j, 2), previous_centroids(j, 2)]);
end
% Title
title(sprintf('Iteration number %d', i))
previous_centroids = centroids;
% compute new centroids for each cluster
centroids = updateCentroids(X, idx, K);
end;
hold off;
Figure 3: Scatter plot along with three clusters colored in red, green and blue.
The ☓s show the centroids at each iteration for a total number of 10 iterations.
Summary:
I hope this post gives an insight on how k-mean algorithm works. Alternatively, we can use the existing Octave, R or Python functions that can handle K-means cluster analysis. These functions are:
Octave: use the function kmeans
from statistics
package. For the details, you can refer to Octave documentation
R: use kmeans
function from stats
package. For the details, you can refer to R documentation.
Python: From scikit-learn
package, you can use sklearn.cluster.KMeans
class. For the details, you can refer to sklearn.cluster.KMeans
documentation.
Or from graphlab
package you can use kmeans.create()
to perform cluster analysis. For the details, you can refer to graphlab.kmeans.create
documentation.