Stop Using Elbow Method in K-means Clustering

The elbow method is a graphical representation of finding the optimal ‘K’ in a k-means clustering. This is typically done by picking out the k-value where the elbow is created. However, this is not the best way to find the optimal ‘K’.

Elbow Method Definition

The elbow method is a graphical method for finding the optimal K value in a k-means clustering algorithm. The elbow graph shows the within-cluster-sum-of-square (WCSS) values on the y-axis corresponding to the different values of K (on the x-axis). The optimal K value is the point at which the graph forms an elbow.

In this blog, we will look at the most practical way of finding the number of clusters (or K) for your k-means clustering algorithm and why the elbow method isn’t the answer.

Following are the topics that we will cover in this blog:

What is K-means clustering?
What is the elbow method?
What are the drawbacks of the elbow method?
Why the Silhouette Method is better than the elbow method.
How to do the elbow method in Python.
How to do the Silhouette method in Python.

Let’s get started.

What Is K-means Clustering?

K-means clustering is a distance-based unsupervised clustering algorithm where data points that are close to each other are grouped in a given number of clusters/groups.

What Is the Elbow Method?

As I mentioned, the elbow method involves finding the optimal k via a graphical representation. It works by finding the within-cluster sum of square (WCSS), i.e. the sum of the square distance between points in a cluster and the cluster centroid.

The elbow graph shows WCSS values on the y-axis corresponding to the different values of K on the x-axis. When we see an elbow shape in the graph, we pick the K-value where the elbow gets created. We can call this the elbow point. Beyond the elbow point, increasing the value of ‘K’ does not lead to a significant reduction in WCSS.

What Are the Drawbacks of the Elbow Method?

The elbow curve is expected to look like this:

Stop Using Elbow Method in K-means Clustering | Built In (1)

But here’s what it typically looks like:

Stop Using Elbow Method in K-means Clustering | Built In (2)

So, in the majority of the real-world data sets, there’s not a clear elbow inflection point to identify the right ‘K’ using the elbow method. This makes it easier to find the wrong K.

Why the Silhouette Method Is Better Than the Elbow Method

The Silhouette score is a very useful method to find the number of K when the elbow method doesn’t show the elbow point.

The value of the Silhouette score ranges from -1 to 1. Following is the interpretation of the Silhouette score.

1: Points are perfectly assigned in a cluster and clusters are easily distinguishable.
0: Clusters are overlapping.
-1: Points are wrongly assigned in a cluster.

Stop Using Elbow Method in K-means Clustering | Built In (3)

Silhouette Score = (b-a)/max(a,b)

Where:

a = average intra-cluster distance, i.e the average distance between each point within a cluster.
b = average inter-cluster distance i.e the average distance between all clusters.

How to Use the Elbow Method in Python

Let’s compare the elbow method and the silhouette score using the Iris data set. We’ll start with creating an elbow curve in Python.

The elbow curve can be created using the following code:

#install yellowbrick to vizualize the Elbow curve!pip install yellowbrick from sklearn import datasetsfrom sklearn.cluster import KMeansfrom yellowbrick.cluster import KElbowVisualizer# Load the IRIS datasetiris = datasets.load_iris()X = iris.datay = iris.target# Instantiate the clustering model and visualizerkm = KMeans(random_state=42)visualizer = KElbowVisualizer(km, k=(2,10)) visualizer.fit(X) # Fit the data to the visualizervisualizer.show() # Finalize and render the figure

Stop Using Elbow Method in K-means Clustering | Built In (4)

The above graph selects an elbow point at K=4, but K=3 also looks like a plausible elbow point. So, it’s not clear what should be the elbow point.

How to Use the Silhouette Method in Python

Let’s validate the value of K using the Silhouette plot using the below code.

from sklearn import datasetsfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltfrom yellowbrick.cluster import SilhouetteVisualizer# Load the IRIS datasetiris = datasets.load_iris()X = iris.datay = iris.target fig, ax = plt.subplots(3, 2, figsize=(15,8))for i in [2, 3, 4, 5]: ''' Create KMeans instances for different number of clusters ''' km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=42) q, mod = divmod(i, 2) ''' Create SilhouetteVisualizer instance with KMeans instance Fit the visualizer ''' visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod]) visualizer.fit(X)

Stop Using Elbow Method in K-means Clustering | Built In (5)

The silhouette score is maximum(0.68) for K=2, but that’s not sufficient to select the optimal K.

The following conditions should be checked to pick the right ‘K’ using the Silhouette plots:

For a particular K, all the clusters should have a Silhouette score greater than the average score of the data set represented by the red-dotted line. The x-axis represents the Silhouette score. The clusters with K=4 and 5 get eliminated because they don’t follow this condition.
There shouldn’t be wide fluctuations in the size of the clusters. The width of the clusters represents the number of data points. For K=2, the blue cluster has almost twice the width as compared to the green cluster. This blue cluster gets broken down into two sub-clusters for K=3, and thus forms clusters of uniform size.

So, the silhouette plot approach gives us K=3 as the optimal value.

We should select K=3 for the final clustering on the Iris data set.

import plotly.graph_objects as go #for 3D plot## K-means using k = 3kmeans = KMeans(n_clusters=3)kmeans.fit(X)y_kmeans = kmeans.predict(X)## 3D plot Scene = dict(xaxis = dict(title = 'sepal_length -->'),yaxis = dict(title = 'sepal_width--->'),zaxis = dict(title = 'petal_length-->'))labels = kmeans.labels_trace = go.Scatter3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], mode='markers',marker=dict(color = labels, size= 10, line=dict(color= 'black',width = 10)))layout = go.Layout(margin=dict(l=0,r=0),scene = Scene,height = 800,width = 800)data = [trace]fig = go.Figure(data = data, layout = layout)fig.show()

Stop Using Elbow Method in K-means Clustering | Built In (6)

I also validated the output clusters by indexing/checking the distribution of the input features within the clusters.

More on Data ScienceA Comprehensive Guide to Scikit-Learn (Sklearn)

Elbow Method vs. Silhouette Method

Elbow curve and Silhouette plots both are very useful techniques for finding the optimal K for k-means clustering. In real-world data sets, you will find quite a lot of cases where the elbow curve is not sufficient to find the right ‘K’. In such cases, you should use the silhouette plot to figure out the optimal number of clusters for your dataset.

I would recommend using both the techniques together to figure out the optimal K for k-means clustering.

Stop Using Elbow Method in K-means Clustering | Built In (2024)