K-Means Clustering

This is a simple representation learning algorithm, which groups similar data points together in clusters.

Visualization

Link:

Kaggle Notebook on Visualizing High Dimensional Clusters
Visualizing the clusters article

It seems like this algorithm is used quite a lot!

Poker hand clustering
Object Detection for Anchor Boxes size clustering
Image Segmentation to downsample an image to $K$ colours

The k-means clustering algorithm divides the training set into $k$ different clusters of examples that are near each other.

The K-means algorithm works in three steps:

Specify the number of $K$ clusters provided by algorithm
Initialize centroids by first shuffling the dataset and then randomly selecting $K$ data points for the centroids without replacement.
Iterate until centroids don’t change by $ϵ$ :
1. Compute sum of Squared Distance between data points and all centroids.
2. Allocate each data point to the cluster that is closest to the others (centroid).
3. Set new centroids to be average of the all data points that belong to a given cluster

One of the big issues is bad centroid values because it highly depends on initialization. In practice, you do various trials of K-Means Clustering, and select the trial with the lowest variance for the centroids.

Time complexity? https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Distance Metrics

Why EMD vs. simpler metrics like Euclidean Distance? See this Stackoverflow thread.

K-Means++

Instead of choosing initial cluster centres randomly, we choose them in a smart way.

Tricks

Use the Triangle Inequality for speedup: https://www.researchgate.net/publication/2480121_Using_the_Triangle_Inequality_to_Accelerate_K-Means

Determining the number of Clusters

At first, for my Poker AI project, I thought I could just take the number that they worked with. But there are some techniques that allows me to compute the optimal number of clusters:

Elbow method
Average silhouette method
Gap statistic method

🛠️ Steven Gong

Table of Contents

K-Means Clustering

Visualization

Distance Metrics

K-Means++

Tricks

Determining the number of Clusters

Extra Resources

In Spark Shown in CS451

Graph View

Backlinks