Unsupervised Learning Clustering Method

K-Means Clustering

This is a simple representation learning algorithm, which groups similar data points together in clusters.

Visualization

Link:

It seems like this algorithm is used quite a lot!

The k-means clustering algorithm divides the training set into different clusters of examples that are near each other.

The K-means algorithm works in three steps:

  1. Specify the number of clusters provided by algorithm
  2. Initialize centroids by first shuffling the dataset and then randomly selecting data points for the centroids without replacement.
  3. Iterate until centroids don’t change by :
    1. Compute sum of Squared Distance between data points and all centroids.
    2. Allocate each data point to the cluster that is closest to the others (centroid).
    3. Set new centroids to be average of the all data points that belong to a given cluster

One of the big issues is bad centroid values because it highly depends on initialization. In practice, you do various trials of K-Means Clustering, and select the trial with the lowest variance for the centroids.

Time complexity? https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Distance Metrics

Why EMD vs. simpler metrics like Euclidean Distance? See this Stackoverflow thread.

K-Means++

Instead of choosing initial cluster centres randomly, we choose them in a smart way.

Tricks

Use the Triangle Inequality for speedup: https://www.researchgate.net/publication/2480121_Using_the_Triangle_Inequality_to_Accelerate_K-Means

Determining the number of Clusters

At first, for my Poker AI project, I thought I could just take the number that they worked with. But there are some techniques that allows me to compute the optimal number of clusters:

  1. Elbow method
  2. Average silhouette method
  3. Gap statistic method

Extra Resources

In Spark Shown in CS451