In order to cluster the training data using a k-means algorithm, the following steps are iteratively performed:
- The first step is for the user to assign the number of clusters (k). Each initial cluster centroid is chosen at random.
- The data points are assigned to their nearest cluster centroids.
- The centroids of the respective clusters are calculated after the data point assignment. The centroid of a cluster is usually the mean distance of the data points in the cluster, which can be computed from the Euclidean distance or correlation or cosine similarity or any other similarity measure. This generates a centroid for the clusters.
- Step 2 and Step 3 are repeated iteratively until a convergence is achieved. That is, the iteration stops when there are no changes to the data points assignment.
The following table show the steps with a graphical demonstration (image source: David Sontag)
However, it can be argued that the two-cluster example is too obvious and often found inapplicable in real-world applications. To demonstrate the working of the method in a more complex environment, let us consider an exercise where the K is set to be three. Figure 5 shows such an approach with K = 3.
One of the open issues of K-means clustering algorithm for which the method is very often criticised is the selection of the value of K. In most cases, this value is provided a priori by the user. And, in some practical scenarios, this introduces a serious bias. To overcome this, some more sophisticated versions of the K-means algorithm have been proposed in the literature. For the sake of simplicity, we do not explore these in detail in this lesson, although interested readers may refer to the appropriate source. However, below are details of one of the few methods available for deciding the optimal number of the clusters.