Likelihood Function
The likelihood function (often simply called the likelihood) is the Joint Probability of the observed data viewed as a function of the parameters of the chosen statistical model.
“Probability of what you see given your model”
Key idea for MLE:
- is your data
- are your model parameters
The other way around is the Posterior.
“It can be shown1 that minimizing the KL Divergence is equivalent to minimizing the Negative Log Likelihood, which is what we usually do when training a classifier, for example.
Likelihood Function (Definition)
If , where are i.i.d. RVs with observations , then
- is the Probability Density Function
- are the parameters we are trying to estimate, depends on the distribution. Ex: for Gaussian Distribution,
Probability vs. Likelihood
- Probability is assigning the probability of a data value given distribution, i.e.
- Likelihood is the probability of a distribution given data values, i.e.
https://www.youtube.com/watch?v=pYxNSUDSFH4&ab_channel=StatQuestwithJoshStarmer
Negative Log Likelihood
First heard from Andrej Karpathy.
Log likelihood:
We use the log because the probabilities can be very small, so we work with Log Function.
We negative it so the value can be positive on domain 0 to 1.
One super neat trick from the Log Rules is that instead of multiplying everything, we can just add all the logs, i.e. log(a*b*c) = log(a) + log(b) + log(c)
We do this because a*b*c
might be an extremely small number, so we perform addition instead.
How likelihood is used
I’m still trying to wrap my head around this. But essentially, you use a series of likelihood updates.
Your prior is your belief distribution. Then, you have new observations. NO, you are getting confused.
The belief distribution refers to a probability distribution over possible outcomes or states, typically representing subjective probabilities based on a person’s knowledge or judgment.
Based on your belief distribution, you update your prior.
- I’m still confused
Our goal is to find the posterior.
Example:
- Based what you observe with the measurements, update the position (state) of the dog
- position is the prior and posterior
- measurement is used to update the prior. But where’s the likelihood in all of this?
This chapter is it: https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python/blob/master/02-Discrete-Bayes.ipynb
In Robotics
Imitation Learning is often formulated as behavior cloning, which uses supervised learning to learn a policy parameterized by to maximize the log-likelihood of actions in a dataset , .
- Here is the likelihood, since it’s the probability of assigning the action given
Maximizing π(a∣s) just means: “make the model put high probability on the demonstrated (expert) actions.”
- In discrete spaces → cross-entropy.
- In continuous spaces → likelihood under a distribution (e.g., Gaussian).
If the policy “just always outputs one action,” it’s probably overfitting or collapsing due to data bias or model simplicity.
KL divergence shows up when comparing your learned policy to another (e.g., expert or prior) distribution.