K-Means Clustering for Unsupervised Machine Learning

The Beginner’s Guide to Unsupervised Learning

Photo by Ramón Salinero on Unsplash

Artificial Intelligence (AI) and Machine Learning (ML) have revolutionized every aspect of our life and disrupted how we do business, unlike any other technology in the the history of mankind. Such disruption brings many challenges for professionals and businesses. In this article, I will provide an introduction to one of the most commonly used machine learning methods, K-Means.

First things first: What is Machine Learning (ML) anyway?! And is it a new paradigm?

Machine learning is a scientific method that utilizes statistical methods along with the computational power of machines to convert data to wisdom that humans or the machine itself can use for taking certain actions. “It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.” (SaS)

If you think ML is a new paradigm you should know that the name machine learning was coined in 1959 by Arthur Samuel . However, this came after a proposal by Alan Turing in 1950s in he replaced “Can machines think?” with “ Can machines do what we (as thinking entities) can do?” , or in other words, “can machines learn?”

So, ML has been around for half a century. However, with the recent advancements in computational power of machines, and also the shear amount of data that we are generating, collecting and storing, ML has surfaced as the next big thing in many industries.

What are the Main Fields in Machine Learning?

There are many fields in ML, but we can name the three main fields as:

Supervised Learning (SL): SL is when the ML model is built and trained using a set of inputs (predictors) and desired outputs (target). Many of regression (either simple or multi-) or classification models fall under this category.

Unsupervised Learning (UL): UL is used when the target is not know and the objective is to infer patterns or trends in the data that can inform a decision, or sometimes covert the problem to a SL problem (Also known as Transfer Learning , TL). This article is focused on UL clustering, and specifically, K-Means method.

Reinforcement Learning (RL) : This paradigm is a more complex than SL and UL, however this article provides a simple, yet technical definition of RL . Generally, RL is concerned with how a “agent” (e.g. a model) takes actions in an environment and in each step attempts to to maximize a reward (e.g. an optimization function). A good example for RL is route optimization using genetic algorithm and brute-force (more on this in later articles).

The graphic below by Abdul Wahid nicely show these main areas of ML.

From https://www.slideshare.net/awahid/big-data-and-machine-learning-for-businesses , Credit: Abdul Wahid

K-Means Clustering, Defined

k-means clustering is a method from signal processing, with the objective of putting the observations into k clusters in which each observation belongs to a cluster with the nearest mean. These clusters are also called Voronoi cells in mathematics.

Before getting into the details of Python codes, let’s look at the fundamentals of K-Means clustering.

How Does K-Means Cluster the Observations?

The main input to the clustering algorithm is the number of clusters (herein called k). k determines the clustering mechanism, and how the clusters form. It could be challenging to come up with the number of clusters before you know which should belong to a cluster, and especially because you are dealing with an unsupervised learning problem.

There are other unsupervised learning methods to determine the right number of clusters for a K-Means clustering method, including Hierarchical Clustering, but we are not getting into that topic in this article. Our assumption is that you know the number of clusters, or have a general sense of the right number of clusters. The best approach would be to do a couple of trial/errors to find the best number of clusters.

Once you know the number of clusters, there are three different ways to assign the cluster centers:

Manually,

Randomly, and

“k-means++” in SKLearn

The latter selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. You can find more on this here.

It should be noted that the initial cluster centers do have any effects on the final clustering results, for reasons that are explained next. Given the initial cluster centers, the algorithm repeats the following steps until it converges:

Illustration of K-Means Algorithm, Wikipedia Creative Commons, credit: Chire

Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance , this is intuitively the “nearest” mean.

Update step: Calculate the new means ( centroids ) of the observations in the new clusters.

Check for Convergence: The algorithm assumes convergence when the assignments no longer change.

One thing to keep in mind is that K-Means almost always converges, but is not guaranteed to find the most optimum solution, because it terminates the cycle at a local minimum and may not reach the global minimum state.

Alright! Enough about the algorithm. Let’s get to the exciting part which is the Python code.

Note on Scaling of Data for K-Means

Since K-Means works based on the distance of data points to a cluster center, scaling of data to the same scale is critical to the accuracy of the results.

K-Means in Python

We are going to use SciKit Learn library for this purpose. You can read the documentation for the K-Means clustering package here.

Let’s import the packages first.

import numpy as np

import matplotlib.pyplot as pltfrom sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

To illustrate how this algorithm works, we are going to use the make_blob package in sklearn.datasets. The code snipper below will generate 5 clusters. We will not be using the cluster designations (y) here for our clustering.

# Create 5 blobs of 2,000 random data

n_samples = 2000

Let’s visualize the clusters to see where they are.

# Plot the random blub dataplt.figure(figsize=(6, 6))plt.scatter(X[:, 0], X[:, 1], s=5)

plt.title(f"No Clusters Assigned")

Looking at the blobs, we can see that we have three different “zones”, consisting of 5 blobs:

There is a blob in the lower left,

There are two blobs in the upper left zone in the general vicinity of each other, and

There are two blobs, almost overlapping, in the middle right zone.

Let’s see how K-Means clustering can handle this. We are going to look at different cluster numbers, between 1 and 10. The code is provided below, and the resulting graphs are put together in an animation below.

# Plot the data and color code based on clusters# changing the number of clusters

for i in range(1,11):

y_pred = KMeans(n_clusters=i, random_state=random_state).fit_predict(X)# plotting the clusters

plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=5)

plt.title(f"Number of Clusters: {i}")plt.show();

The animated plot was made using Image.Io package. For more information on this refer to Johannes Huessy blog ( Click Here ).

Evaluating the K-Means Clustering Algorithm

So you have done the clustering, but how good is this clustering, and how can you measure the performance of the algorithm?

Inertia: We talked about one metric in the previous section, which is the within-cluster sum of squares of distances to the cluster center. This is called “inertia”. The algorithm aims to choose centroids that minimize the inertia, which can be recognized as a measure of how internally coherent clusters are.

You can use the following code to get the inertia score for the clusters:

km = KMeans(n_clusters=i, random_state=random_state)

km.fit(X)

km.inertia_

The code below calculates the inertia score for the 10 different cluster numbers we did before, and saves them in a list that we use to plot (more on this later). The plot of inertial score vs the number of clusters is called the “Elbow Curve”.

Silhouette Score: Silhouette score is based on a combination of cluster Cohesion (how close points in a cluster are relative to each other) and Separation (how far the clusters are relative to each other).

Silhouette score is between -1 (poor clustering) and +1 (excellent clustering).

Inertia

# Calculating the inertia and silhouette_score¶inertia = []

sil = []# changing the number of clusters

for k in range(2,11):

km = KMeans(n_clusters=k, random_state=random_state)

km.fit(X)

inertia.append((k, km.inertia_))

sil.append((k, silhouette_score(X, y_pred)))

Now that we have the inertia and silhouetter scores, let’s plot them and evaluate the performance of the clustering algorithm.

fig, ax = plt.subplots(1,2, figsize=(12,4))# Plotting Elbow Curve

x_iner = [x[0] for x in inertia]

y_iner = [x[1] for x in inertia]

ax[0].plot(x_iner, y_iner)

ax[0].set_xlabel('Number of Clusters')

ax[0].set_ylabel('Intertia')

ax[0].set_title('Elbow Curve')# Plotting Silhouetter Score

x_sil = [x[0] for x in sil]

y_sil = [x[1] for x in sil]

ax[1].plot(x_sil, y_sil)

ax[1].set_xlabel('Number of Clusters')

ax[1].set_ylabel('Silhouetter Score')

ax[1].set_title('Silhouetter Score Curve')

You can see that the inertia score always drops when you increase the number of clusters. However, the elbow curve can tell you above 4 clusters, the change in the inertia was not significant. Now, let’s look at the silhouette curve. You can see that the maximum score happens at 4 clusters (the higher the silhouette score, the better the clustering).

Coupling the elbow curve with the silhouette score curve provides invaluable insight into the performance of K-Means.

Other Use Cases for K-Means

K-Means method has many use cases, from image vectorization to text document clustering. You can find some examples here.

I hope you found this guide useful in understanding the K-Means clustering method using Python’s SkLearn package. Stay tuned for more on similar topics!

Nick Minaie, PhD ( LinkedIn Profile ) is a senior consultant and a visionary data scientist, and represents a unique combination of leadership skills, world-class data-science expertise, business acumen, and the ability to lead organizational change. His mission is to advance the practice of Artificial Intelligence (AI) and Machine Learning in the industry.

Sharing concepts, ideas, and codes.

Follow