Expectation Maximization (EM)
EM is a soft clustering algorithm that iteratively estimates parameters of latent-variable models by alternating expectation and maximization steps.
Why EM instead of direct MLE on a GMM?
The analytical MLE for a GMM has no closed form (the derivative of a log of a sum of Gaussians is ugly). EM introduces latent cluster assignments so each M-step has a closed form
Unlike K-Means Clustering (hard clustering), EM assigns soft probabilities of cluster membership.
Intuition
Chicken-and-egg. If you knew which Gaussian each point came from, fitting the means and variances is trivial (just MLE per cluster). If you knew the means and variances, assigning points is trivial (pick the most likely cluster). EM alternates: given current params, soft-assign points (E-step); given soft assignments, refit params (M-step). Each full pass provably increases the data log-likelihood, so you climb a hill in parameter space until it stops moving.
We can equivalently describe a GMM with latent variables:
where:
- is the (hidden) cluster assignment for point
- is the prior probability of cluster
The joint distribution of observed and hidden is
The marginal :
If the latent were observed, the complete-data log-likelihood would be
which is easy to optimize in closed form. Since are hidden, EM replaces them with their expected values (responsibilities ), leading to the iterative updates below.
This lecture is really good:
Resources
- https://arxiv.org/pdf/cs/0412015
- https://chrischoy.github.io/posts/Expectation-Maximization-and-Variational-Inference/
- Paper tutorial: https://ieeexplore-ieee-org.proxy.lib.uwaterloo.ca/stamp/stamp.jsp?tp=&arnumber=543975
You first assume there are a set of Gaussians that the data is sampled from.
Basic EM Algorithm for Gaussian Mixture
- Initialization: assume Gaussians with parameters . Initialize , and mixing weights randomly
- E-step (Expectation): compute the Posterior Probability (responsibility) that point belongs to component
The responsibility is the softmax-looking answer to “how much does cluster claim point ?“. Numerator is “how likely cluster would have produced ”; denominator normalizes across clusters.
- M-step (Maximization): update parameters using weighted averages. This is the standard Gaussian MLE, except each point contributes with weight instead of 0/1. A point sitting between two clusters pulls both means partially toward itself.
Mixing weights:
- Repeat: alternate E-step and M-step until convergence (parameters stabilize or log-likelihood stops improving)