Clustering is an unsupervised learning technique. We have looked previously at supervised learning problems, where a set of features is given and the goal was to predict the output label or class. In unsupervised learning, the features are given but the label or class is not known. Therefore, the goal is to discover any interesting information on the data such as patterns within the data or groupings.

Unsupervised learning is very challenging and since there is no label for the data, the exercise tends to be more subjective. It is usually performed as part of an exploratory data analysis. Even so, it is hard to assess the performance of the obtained results since there is no universally accepted mechanism for performing validation. This is because in classification it is possible to check how well the model is performing by comparing the predicted result with the actual result, but this is not usually possible in an unsupervised learning scenario.

In clustering, the task is to find subgroups or clusters within a data set. When data is clustered, the essence is to divide or partition the data into distinct homogeneous groups such that the observations within each group are quite similar to each other, while observations in different groups are dissimilar from each other. In order to achieve this, there must be a formal definition of what makes the observation similar and what makes an observation different. This is a domain-specific definition and therefore it depends largely on the knowledge of the data being observed. For example, what makes marketing data similar or different may be different from the criteria that make a biological data set similar or different.

 

Figure 1. An example of clustering. (source: Noel Bambrick[1])

 
Clustering is applied to a wide range of problem domains. An example is in cancer detection. If a data set contains the tissue samples for cancer patients with a certain number of features, clustering could be used to find the similarity between the different tissues and even to discover different unknown subtypes of the disease. This is, therefore, considered as an unsupervised problem since we are trying to discover distinct structures and subgroups within the data. Another area of application of clustering is in marketing. For example, a data consisting of customer’s purchase record may include their household income, occupation, distance from the nearest urban area, number of dependents, etc. The goal could be to segment the customers and identify subgroups of people who may be more likely to purchase a particular product.

 


[1] http://blog.aylien.com/new-feature-news-api-clustering/

Clustering differs from classification since clustering is an unsupervised learning problem while classification is a supervised learning problem. The table below summarises some of the key differences.

 

A large number of clustering algorithms are available in the literature, with some of these algorithms using the same clustering technique. Clustering techniques have to do with the underlying principles governing how the clusters are created rather than the specifics of a method. Some of the most well-known clustering techniques include hierarchical and partition clustering, density-based clustering, model-based clustering and grid-based clustering, with the most popular being the hierarchical and partitioning techniques.

  1. Hierarchical clustering: This method constructs the clusters by recursively dividing the instances. This can be done through a top-down approach with the method known as divisive or in a bottom-up approach with the method known as agglomerative.

In agglomerative hierarchical clustering, each element is seen as a cluster of its own at the beginning. Then the clusters are aggregated or merged until a desired cluster structure is obtained.

In divisive hierarchical clustering, all the elements are considered to belong to a single cluster at the beginning. The cluster is then divided into smaller clusters. The smaller clusters are further sub-divided until the desired cluster structure is obtained.

Hierarchical clustering is also known as connectivity-based clustering and it is based on the idea that similar objects tend to be close to each other, whereas dissimilar objects are far away from each other. The technique uses various criteria to decide at each step which clusters should be aggregated or divided. This is usually based on the measure of the clusters’ proximity.

  1. Partition clustering. In partition clustering, the data points are separated into a number of different partitions or clusters. The data points are then relocated by moving them from one cluster/partition to another, starting from an initial partitioning. This method typically requires the number of clusters to be provided a priori by the user.

To achieve optimally acceptable partitions, an exhaustive enumeration is performed of all possible partitions required. This is, however, not possible and, therefore, heuristics are used to optimise the iterations. In summary, the starting (initial) cluster partitions are iteratively improved until a locally optimal partition is reached by moving the data points.

Partition clustering is more applicable to a large data set since it is difficult to represent data in tree form as done in the hierarchical approach.

 

By comparing the two clustering techniques above, the following conclusions can be deduced:

Hierarchical clustering

1.     There is no need to specify the number of clusters in advance.

2.     The hierarchical structure is similar to human intuition for some application domains.

3.     They do not scale well, especially on large data.

Partition clustering

1.     The user must specify the number of clusters in advance.

2.     It scales well for a large data set.

3.     It is non-hierarchical and each data instance is placed exactly in one of the pre-assigned number of clusters.

  1. Density-based clustering. This technique assumes that the data points in each cluster are drawn from a certain probability distribution, whereas the overall distribution of the data is assumed to be a mixture of several distributions. The aim is to identify the clusters and their distribution parameters.
  2. Grid-based clustering. This technique involves first partitioning the object space into a finite number of cells that form a grid. The clustering operation is then performed on the grid structure. The density of each cell in the grid is calculated and the cells are sorted according to their densities. This allows for the identification of cluster centres and transversals of the neighbouring cells. This approach to clustering differs from the other approaches because they are concerned with the data points, whereas the grid-based approach is concerned with the value space that surrounds the data.
  3. Model-based clustering. This technique tries to fit the data into a given model by estimating the optimal parameters that best fit it. It tries to find not just the clusters of the data points, but also characteristics of the given clusters based on the given mathematical model.

As you have already learned in clustering, given some data points, the idea is to cluster the data such that members of the same cluster are somewhat close to each other in similarity, whereas members of different clusters are far from each other. Therefore, the data points in each cluster should be close to each other and, in contrast, relatively separate from the other clusters’ data points, [MJ1] as shown in to Figure 2.

               Figure 2. Cluster distances (Source: Medium[1])

Calculating the distance between data points in a two-dimensional space is easy and visible. However, a problem arises when the dimension of the data increases, for example, to 10 or 100 dimensions.

Different distance measures for clustering exists. The two most popular among them are the Euclidean distance and the Manhattan distance (shown in Figure 3).

 

The Euclidean distance between two sets of points [x1, …., xn] and [y1, …., yn] is expressed as:

whereas the Manhattan distance between two sets of points [x1, …., xn] and [y1, …., yn] is expressed as:

 


[1] https://cdn-images-1.medium.com/max/1600/1*kFAJCaRaOKk2GqqiJDX4Ag.jpeg