Unsupervised Learning

K-Means Clustering

K-Means is a simple clustering algorithm that groups similar data points into clusters by iteratively assigning points to the nearest centroid.

Common applications:

The algorithm works in three steps:

  1. Specify the number of clusters
  2. Initialize centroids by shuffling the dataset and randomly selecting data points for the centroids without replacement
  3. Iterate until centroids don’t change by :
    1. Compute sum of Squared Distance between data points and all centroids
    2. Allocate each data point to the cluster whose centroid is closest
    3. Set new centroids to the average of all data points in that cluster

Intuition

Voronoi cells, refined. Each centroid “owns” the region of space closer to it than to any other centroid. Assign each point to its owner (E-like step), then slide each centroid to the mean of the points it owns (M-like step), and the cells rearrange. Repeat. The within-cluster sum-of-squares can only go down at each step, so convergence is guaranteed, but the objective is non-convex, so the final answer depends on where you started (hence k-means++)

Bad centroids from bad initialization

Results depend heavily on initial centroid placement. In practice, run several trials of K-Means and select the one with the lowest variance. K-Means++ picks better starting centroids instead of random

Links:

Distance Metrics

Why EMD vs simpler metrics like Euclidean Distance? See this Stackoverflow thread.

Tricks

Use the Triangle Inequality for speedup: https://www.researchgate.net/publication/2480121_Using_the_Triangle_Inequality_to_Accelerate_K-Means

Determining the number of Clusters

Techniques for computing the optimal number of clusters:

  • Elbow method
  • Average silhouette method
  • Gap statistic method

Extra Resources

In Spark Shown in CS451