Unsupervised Learning Clustering Method

# K-Means Clustering

This is a simple representation learning algorithm, which groups similar data points together in clusters.

### Visualization

Link:

- Kaggle Notebook on Visualizing High Dimensional Clusters
- Visualizing the clusters article

It seems like this algorithm is used quite a lot!

- Poker hand clustering
- Object Detection for Anchor Boxes size clustering
- Image Segmentation to downsample an image to $K$ colours

The k-means clustering algorithm divides the training set into $k$ different clusters of examples that are near each other.

The K-means algorithm works in three steps:

- Specify the number of $K$ clusters provided by algorithm
- Initialize centroids by first shuffling the dataset and then randomly selecting $K$ data points for the centroids without replacement.
- Iterate until centroids don’t change by $ϵ$:
- Compute sum of Squared Distance between data points and all centroids.
- Allocate each data point to the cluster that is closest to the others (centroid).
- Set new centroids to be average of the all data points that belong to a given cluster

One of the big issues is bad centroid values because it highly depends on initialization. In practice, you do various trials of K-Means Clustering, and select the trial with the lowest variance for the centroids.

Time complexity? https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

### Distance Metrics

Why EMD vs. simpler metrics like Euclidean Distance? See this Stackoverflow thread.

### K-Means++

Instead of choosing initial cluster centres randomly, we choose them in a smart way.

### Tricks

Use the Triangle Inequality for speedup: https://www.researchgate.net/publication/2480121_Using_the_Triangle_Inequality_to_Accelerate_K-Means

### Determining the number of Clusters

At first, for my Poker AI project, I thought I could just take the number that they worked with. But there are some techniques that allows me to compute the optimal number of clusters:

- Elbow method
- Average silhouette method
- Gap statistic method