Clustering is an unsupervised learning technique. We have looked previously at supervised learning problems, where a set of features is given and the goal was to predict the output label or class. In unsupervised learning, the features are given but the label or class is not known. Therefore, the goal is to discover any interesting information on the data such as patterns within the data or groupings.
Unsupervised learning is very challenging and since there is no label for the data, the exercise tends to be more subjective. It is usually performed as part of an exploratory data analysis. Even so, it is hard to assess the performance of the obtained results since there is no universally accepted mechanism for performing validation. This is because in classification it is possible to check how well the model is performing by comparing the predicted result with the actual result, but this is not usually possible in an unsupervised learning scenario.
In clustering, the task is to find subgroups or clusters within a data set. When data is clustered, the essence is to divide or partition the data into distinct homogeneous groups such that the observations within each group are quite similar to each other, while observations in different groups are dissimilar from each other. In order to achieve this, there must be a formal definition of what makes the observation similar and what makes an observation different. This is a domain-specific definition and therefore it depends largely on the knowledge of the data being observed. For example, what makes marketing data similar or different may be different from the criteria that make a biological data set similar or different.
Figure 1. An example of clustering. (source: Noel Bambrick)
Clustering differs from classification since clustering is an unsupervised learning problem while classification is a supervised learning problem. The table below summarises some of the key differences.
A large number of clustering algorithms are available in the literature, with some of these algorithms using the same clustering technique. Clustering techniques have to do with the underlying principles governing how the clusters are created rather than the specifics of a method. Some of the most well-known clustering techniques include hierarchical and partition clustering, density-based clustering, model-based clustering and grid-based clustering, with the most popular being the hierarchical and partitioning techniques.
In agglomerative hierarchical clustering, each element is seen as a cluster of its own at the beginning. Then the clusters are aggregated or merged until a desired cluster structure is obtained.
In divisive hierarchical clustering, all the elements are considered to belong to a single cluster at the beginning. The cluster is then divided into smaller clusters. The smaller clusters are further sub-divided until the desired cluster structure is obtained.
Hierarchical clustering is also known as connectivity-based clustering and it is based on the idea that similar objects tend to be close to each other, whereas dissimilar objects are far away from each other. The technique uses various criteria to decide at each step which clusters should be aggregated or divided. This is usually based on the measure of the clusters’ proximity.
To achieve optimally acceptable partitions, an exhaustive enumeration is performed of all possible partitions required. This is, however, not possible and, therefore, heuristics are used to optimise the iterations. In summary, the starting (initial) cluster partitions are iteratively improved until a locally optimal partition is reached by moving the data points.
Partition clustering is more applicable to a large data set since it is difficult to represent data in tree form as done in the hierarchical approach.
By comparing the two clustering techniques above, the following conclusions can be deduced:
1. There is no need to specify the number of clusters in advance.
2. The hierarchical structure is similar to human intuition for some application domains.
3. They do not scale well, especially on large data.
1. The user must specify the number of clusters in advance.
2. It scales well for a large data set.
3. It is non-hierarchical and each data instance is placed exactly in one of the pre-assigned number of clusters.
As you have already learned in clustering, given some data points, the idea is to cluster the data such that members of the same cluster are somewhat close to each other in similarity, whereas members of different clusters are far from each other. Therefore, the data points in each cluster should be close to each other and, in contrast, relatively separate from the other clusters’ data points, [MJ1] as shown in to Figure 2.
Calculating the distance between data points in a two-dimensional space is easy and visible. However, a problem arises when the dimension of the data increases, for example, to 10 or 100 dimensions.
Different distance measures for clustering exists. The two most popular among them are the Euclidean distance and the Manhattan distance (shown in Figure 3).
The Euclidean distance between two sets of points [x1, …., xn] and [y1, …., yn] is expressed as:
whereas the Manhattan distance between two sets of points [x1, …., xn] and [y1, …., yn] is expressed as: