K-means Clustering

8 min readApr 2, 2024

The K-means algorithm is a method to automatically cluster similar data points together.

Concretely, we are given a training set {x^1,…….,x^m)}, and we want to group the data into a few cohesive “clusters”.

K-means is an iterative procedure that starts by guessing the initial centroids, and then Refines this guess by

Repeatedly assigning examples to their closest centroids, and then
Recomputing the centroids based on the assignments.

In pseudocode, the K-means algorithm is as follows:

# Initialize centroids
# K is the number of clusters
centroids = kMeans_init_centroids(X, K)

for iter in range(iterations):
    # Cluster assignment step: 
    # Assign each data point to the closest centroid. 
    # idx[i] corresponds to the index of the centroid 
    # assigned to example i
    idx = find_closest_centroids(X, centroids)

    # Move centroid step: 
    # Compute means based on centroid assignments
    centroids = compute_centroids(X, idx, K)

The inner-loop of the algorithm repeatedly carries out two steps:

Assigning each training example x^(i) to its closest centroid, and
Recomputing the mean of each centroid using the points assigned to it.

The k-means algorithm will always converge to some final set of means for the centroids.

However, the converged solution may not always be ideal and depends on the initial setting of the centroids.

Therefore, in practice the K-means algorithm is usually run a few times with different random initializations.
One way to choose between these different solutions from different random initializations is to choose the one with the lowest cost function value (distortion).

We will implement the two phases of the K-means algorithm separately in the next sections.

We will start by completing find_closest_centroid and then proceed to complete compute_centroids.

Finding closest centroids

With g find_closest_centroid function, we will compute the closet centroids.

This function takes the data matrix X and the locations of all centroids inside centroids
It should output a one-dimensional array idx (which has the same number of elements as X) that holds the index of the closest centroid (a value in {0,...,K−1}, where K is the total number of centroids) to every training example. (Note: The index range 0 to K-1 varies slightly from what is shown in the lectures (i.e. 1 to K) because Python list indices start at 0 instead of 1)
Specifically, for every example x^(i) we set

where

c^(i) is the index of the centroid that is closest to x^(i) and
μ_j is the position (value) of the j th centroid. (stored in centroids in the starter code)
||x^(i) — μ_j|| is the L2-norm

def find_closest_centroids(X, centroids):
    """
    Computes the centroid memberships for every example
    
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): (K, n) centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    
    """
    # Set K
    K = centroids.shape[0]
    # You need to return the following variables correctly
    idx = np.zeros(X.shape[0], dtype=int)
    for i in range(X.shape[0]):
        dist = []
        for j in range(centroids.shape[0]):
            norm_ij = np.linalg.norm(X[i] - centroids[j])
            dist.append(norm_ij)
            
        idx[i] = np.argmin(dist) 
    
    return idx

Computing centroid means

for every centroid μ_j we set

def compute_centroids(X, idx, K):
    """
    Returns the new centroids by computing the means of the 
    data points assigned to each centroid.
    
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    m, n = X.shape
    
    centroids = np.zeros((K, n))

    for k in range(K):
        points = X[idx == k]
        centroids[k] = np.mean(points, axis = 0)
        
    return centroids

K-means on a sample dataset

import numpy as np
import matplotlib.pyplot as plt
from utils import *

# Load data stored in arrays X, y from data folder (ex7data2.mat)
import os
from os.path import dirname, join as pjoin
import scipy.io as sio
data = sio.loadmat(os.path.join('dataPath', 'ex7data2.mat'))
X = data['X']

print("First five elements of X are:\n", X[:5]) 
print('The shape of X is:', X.shape)

First five elements of X are:
[[1.84207953 4.6075716 ]
[5.65858312 4.79996405]
[6.35257892 3.2908545 ]
[2.90401653 4.61220411]
[3.23197916 4.93989405]]
The shape of X is: (300, 2)

# Select an initial set of centroids (3 Centroids)
initial_centroids = np.array([[3,3], [6,2], [8,5]])

# Find closest centroids using initial_centroids
idx = find_closest_centroids(X, initial_centroids)

# Print closest centroids for the first three elements
print("First three elements in idx are:", idx[:3])

# UNIT TEST
from public_tests import *

find_closest_centroids_test(find_closest_centroids)

The first three elements in idx are: [0 2 1]


def run_kMeans(X, initial_centroids, max_iters=10, plot_progress=False):
    """
    Runs the K-Means algorithm on data matrix X, where each row of X
    is a single example
    """
    
    # Initialize values
    m, n = X.shape
    K = initial_centroids.shape[0]
    centroids = initial_centroids
    previous_centroids = centroids    
    idx = np.zeros(m)
    plt.figure(figsize=(8, 6))

    # Run K-Means
    for i in range(max_iters):
        
        #Output progress
        print("K-Means iteration %d/%d" % (i, max_iters-1))
        
        # For each example in X, assign it to the closest centroid
        idx = find_closest_centroids(X, centroids)
        
        # Optionally plot progress
        if plot_progress:
            plot_progress_kMeans(X, centroids, previous_centroids, idx, K, i)
            previous_centroids = centroids
            
        # Given the memberships, compute new centroids
        centroids = compute_centroids(X, idx, K)
    plt.show() 
    return centroids, idx

# Set initial centroids
initial_centroids = np.array([[3,3],[6,2],[8,5]])

# Number of iterations
max_iters = 10

# Run K-Means
centroids, idx = run_kMeans(X, initial_centroids, max_iters, plot_progress=True)

K-Means iteration 0/9
K-Means iteration 1/9
K-Means iteration 2/9
K-Means iteration 3/9
K-Means iteration 4/9
K-Means iteration 5/9
K-Means iteration 6/9
K-Means iteration 7/9
K-Means iteration 8/9
K-Means iteration 9/9

Image compression with K-means

In a straightforward 24-bit color representation of an image, each pixel is represented as three 8-bit unsigned integers (ranging from 0 to 255) that specify the red, green and blue intensity values. This encoding is often refered to as the RGB encoding.

Our image contains thousands of colors, and in this part of the exercise, you will reduce the number of colors to 16 colors.
By making this reduction, it is possible to represent (compress) the photo in an efficient way.
Specifically, we only need to store the RGB values of the 16 selected colors, and for each pixel in the image you now need to only store the index of the color at that location (where only 4 bits are necessary to represent 16 possibilities).

In this part, we will use the K-means algorithm to select the 16 colors that will be used to represent the compressed image.

Concretely, we will treat every pixel in the original image as a data example and use the K-means algorithm to find the 16 colors that best group (cluster) the pixels in the 3- dimensional RGB space.
Once we have computed the cluster centroids on the image, you will then use the 16 colors to replace the pixels in the original image.

Load a sample Image

original_img = plt.imread('bird_small.png')
# Visualizing the image
plt.imshow(original_img)

Now let’s check the dimensions of the image

print("Shape of original_img is:", original_img.shape)

Shape of original_img is: (128, 128, 3)

K-Means on image pixels

K = 16
max_iters = 10
X_img = np.reshape(original_img, (original_img.shape[0] * original_img.shape[1], 3))
# Using the function we have implemented above. 
initial_centroids = kMeans_init_centroids(X_img, K)

# Run K-Means - this can take a couple of minutes depending on K and max_iters
centroids, idx = run_kMeans(X_img, initial_centroids, max_iters)

print("Shape of idx:", idx.shape)
print("Closest centroid for the first five elements:", idx[:5])


# Plot the colors of the image and mark the centroids
plot_kMeans_RGB(X_img, centroids, idx, K)

Shape of idx: (16384,)
Closest centroid for the first five elements: [14 14 14 14 14]

The plot shows all the colors found in the original image. As mentioned earlier, the color of each pixel is represented by RGB values so the plot should have 3 axes — R, G, and B. You’ll notice a lot of dots below representing thousands of colors in the original image. The red markers represent the centroids after running K-means. These will be the 16 colors that we will use to compress the image.

# Visualize the 16 colors selected
show_centroid_colors(centroids)

Compress the image

After finding the top k=16 colors to represent the image, we can now assign each pixel position to its closest centroid using the find_closest_centroids function.

This allows us to represent the original image using the centroid assignments of each pixel.

Notice that we have significantly reduced the number of bits that are required to describe the image.

The original image required 24 bits (i.e. 8 bits x 3 channels in RGB encoding) for each one of the 128×128 pixel locations, resulting in total size of 128×128×24=393,216 bits.
The new representation requires some overhead storage in form of a dictionary of 16 colors, each of which require 24 bits, but the image itself then only requires 4 bits per pixel location.
The final number of bits used is therefore 16×24+128×128×4=65,920 bits, which corresponds to compressing the original image by about a factor of 6.

# Find the closest centroid of each pixel
idx = find_closest_centroids(X_img, centroids)

# Replace each pixel with the color of the closest centroid
X_recovered = centroids[idx, :] 

# Reshape image into proper dimensions
X_recovered = np.reshape(X_recovered, original_img.shape)

Finally, we can view the effects of the compression by reconstructing the image based only on the centroid assignments.

Specifically, we replaced each pixel with the value of the centroid assigned to it.
Figure below hows a sample reconstruction. Even though the resulting image retains most of the characteristics of the original, we will also see some compression artifacts because of the fewer colors used.
the code below shows how the image is reconstructed using the 16 colors selected earlier.

# Display original image
fig, ax = plt.subplots(1,2, figsize=(16,16))
plt.axis('off')

ax[0].imshow(original_img)
ax[0].set_title('Original')
ax[0].set_axis_off()


# Display compressed image
ax[1].imshow(X_recovered)
ax[1].set_title('Compressed with %d colours'%K)
ax[1].set_axis_off()

complete code can be found in this github repository