Logo

The Data Daily

Top Machine Learning Algorithms for Clustering

Top Machine Learning Algorithms for Clustering

Top Machine Learning Algorithms for Clustering
How to find the underlying structure of data
Photo by Kym Ellis on Unsplash
Clustering is a way to group a set of data points in a way that similar data points are grouped together. Therefore, clustering algorithms look for similarities or dissimilarities among data points. Clustering is an unsupervised learning method so there is no label associated with data points. Clustering algorithms try to find the underlying structure of the data.
There are different approaches and algorithms to perform clustering tasks which can be divided into three sub-categories:
Partition-based clustering
Hierarchical clustering
Density-based clustering
All of these approaches aim to group data into clusters. The methods they use slightly differ. I will explain each approach in detail with a common algorithm of each approach. Before starting on the topic, I would like point out the difference between clustering and classification.
Clustering vs Classification
Observations (or data points) in a classification task have labels. Each observation is classified according to some measurements. Classification algorithms try to model the relationship between measurements (features) on observations and their assigned class. Then the model predicts the class of new observations.
Observations (or data points) in clustering do not have labels. We expect the model to find structures in the dataset so that similar observations can be grouped into clusters. We basically ask the model to label observations.
Partition-based clustering
Photo by Shashank Sahay on Unsplash
Partition-based clustering techniques try to create partitions of data based on a distance measurement applied to data points. The most common algorithm of this approach is k-means clustering.
K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart.
Similarity of two points is determined by the distance between them.
There are many methods to measure the distance. Euclidean distance (minkowski distance with p=2) is one of most commonly used distance measurements.
K-means clustering tries to minimize distances within a cluster and maximize the distance between different clusters. K-means algorithm is not capable of determining the number of clusters. We need to define it when creating the KMeans object which may be a challenging task.
K-means is an iterative process. It is built on expectation-maximization algorithm. After number of clusters are determined, it works by executing the following steps:
Randomly select centroids (center of cluster) for each cluster.
Calculate the distance of all data points to the centroids.
Assign data points to the closest cluster.
Find the new centroids of each cluster by taking the mean of all data points in the cluster.
Repeat steps 2,3 and 4 until all points converge and cluster centers stop moving.
Note: Initial centroids are chosen randomly which may cause final clusters to be somewhat different. To overcome this issue, scikit learn provides n_init parameter. The k-means algorithm run “n_init” times with different initial centroids and final results will be determined according to n_init consecutive runs.
Pros:
Relatively fast
Scalable for large data sets
Able to choose the positions of initial centroids in a smart way that speeds up the convergence
Guarantees convergence
Cons:
Number of clusters must be pre-determined. K-means algorithm is not able to guess how many clusters exist in the data. Determining number of clusters may well be a challenging task.
Can only draw linear boundaries. If there is a non-linear structure separating groups in the data, k-means will not be a good choice.
Slows down as the number of samples increases because at each step, k-means algorithm accesses all data points and calculates distances. An alternative way is to use a subset of data points to update the location of centroids (i.e. sklearn.cluster.MiniBatchKMeans)
Sensitive to outliers
Photo by Samuel Charron on Unsplash
This approach works by building a hierarchy of clusters. Hierarchical clustering means creating a tree of clusters by iteratively grouping or separating data points. There are two types of hierarchical clustering:
Agglomerative clustering
Divisive clustering
One of the advantages of hierarchical clustering is that we do not have to specify the number of clusters (but we can).
Agglomerative clustering is kind of a bottom-up approach. Each data point is assumed to be a separate cluster at first. Then the similar clusters are iteratively combined. Let’s go over an example to explain the concept clearly.
We have a dataset consists of 9 samples. I choose numbers associated with these samples to demonstrate the concept of similarity. At each iteration (or level), the closest numbers (i.e. samples) are combined together. As you can see in the figure below, we start with 9 clusters. The closest ones are combined at the first level and then we have 7 clusters. The number of black lines that intersect with blue lines represents the number of clusters.
Dendogram
The figure above is called dendrogram which is a diagram representing tree-based approach. In hierarchical clustering, dendrograms are used to visualize the relationship among clusters.
As we go up, the number of clusters decreases as more samples are combined. After level 6, all samples are combined under one big cluster.
One of the advantages of hierarchical clustering is that we do not have to specify the number of clusters beforehand. However, it is not wise to combine all data points into one cluster. We should stop combining clusters at some point. Scikit-learn provides two options for this:
Stop after a number of clusters is reached (n_clusters)
Set a threshold value for linkage (distance_threshold). If the distance between two clusters are above the threshold, these clusters will not be merged.
Divisive clustering is not commonly used in real life so I will mention it briefly. Simple yet clear explanation is that divisive clustering is the opposite of agglomerative clustering. We start with one giant cluster including all data points. Then data points are separated into different clusters. It is an up to bottom approach.
Hierarchical clustering is useful and gives better results if the underlying data has some sort of hierarchy.
Some common use cases of hierarchical clustering:
Genetic or other biological data can be used to create a dendrogram to represent mutation or evolution levels. Phylogenetic trees are used to show evolutionary relationships based on similarities and differences.
Hierarchical clustering is also used for grouping text documents.
Another common use case of hierarchical clustering is social network analysis.
Hierarchical clustering is also used for outlier detection.
Pros
Do not have to specify the number of clusters beforehand. The number of clusters must be specified for k-means algorithm.
It is easy to implement and interpretable with the help of dendrograms.
Always generates the same clusters. K-means clustering may result in different clusters depending on the how the centroids (center of cluster) are initiated.
Cons
It is a slower algorithm compared to k-means. Hierarchical clustering takes long time to run especially for large data sets.
Density-based clustering
Photo by Hakan Aldrin on Unsplash
Partition-based and hierarchical clustering techniques are highly efficient with normal shaped clusters. However, when it comes to arbitrary shaped clusters or detecting outliers, density-based techniques are more efficient.
Consider the following figures:
The data points in these figures are grouped in arbitrary shapes or include outliers. Density-based clustering algorithms are very effienct at finding high-density regions and outliers. It is very important to detect outliers for some task, e.g. anomaly detection. One of the most commonly used algorithm of this approach is DBSCAN.
DBSCAN stands for density-based spatial clustering of applications with noise. It is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers).
The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.
There are two key parameters of DBSCAN:
eps: The distance that specifies the neighborhoods. Two points are considered to be neighbors if the distance between them are less than or equal to eps.
minPts: Minimum number of data points to define a cluster.
Based on these two parameters, points are classified as core point, border point, or outlier:
Core point: A point is a core point if there are at least minPts number of points (including the point itself) in its surrounding area with radius eps.
Border point: A point is a border point if it is reachable from a core point and there are less than minPts number of points within its surrounding area.
Outlier: A point is an outlier if it is not a core point and not reachable from any core points.
These points may be better explained with visualizations. The following figure is taken from Wikipedia:
Figure source
In this case, minPts is 4. Red points are core points because there are at least 4 points within their surrounding area with radius eps. This area is shown with the circles in the figure. The yellow points are border points because they are reachable from a core point and have less than 4 points within their neighborhood. Reachable means being in the surrounding area of a core point. The points B and C have two points (including the point itself) within their neigborhood (i.e. the surrounding area with a radius of eps). Finally N is an outlier because it is not a core point and cannot be reached from a core point.
We have learned the definitions of parameters and different type points. Now we can talk about how the algoritm works. It is actually quite simple:
minPts and eps are determined.
A starting point is selected at random at it’s neighborhood area is determined using radius eps. If there are at least minPts number of points in the neighborhood, the point is marked as core point and a cluster formation starts. If not, the point is marked as noise. Once a cluster formation starts (let’s say cluster A), all the points within the neighborhood of initial point become a part of cluster A. If these new points are also core points, the points that are in the neighborhood of them are also added to cluster A.
Note: A point that is marked as noise may be revisited and be part of a cluster.
Next step is to randomly choose another point among the points that have not been visited in the previous steps. Then same procedure applies.
This process is finished when all points are visited.
The distance between points is determined using a distance measurement method as in k-means algorithm. The most commonly used method is euclidean distance.
By applying these steps, DBSCAN algorithm is able to find high density regions and separate them from low density regions.
A cluster includes core points that are neighbors (i.e. reachable from one another) and all the border points of these core points. The required condition to form a cluster is to have at least one core point. Although very unlikely, we may have a cluster with only one core point and its border points.
Pros:
Does not require to specify number of clusters beforehand.
Performs well with arbitrary shapes clusters.
DBSCAN is robust to outliers and able to detect the outliers.
Cons:
In some cases, determining an appropriate distance of neighborhood (eps) is not easy and it requires domain knowledge.
If clusters are very different in terms of in-cluster densities, DBSCAN is not well suited to define clusters. The characteristics of clusters are defined by the combination of eps-minPts parameters. Since we pass in one eps-minPts combination to the algorithm, it cannot generalize well to clusters with much different densities.
Thanks for reading. Please let me know if you have any feedback.
References

Images Powered by Shutterstock