# Maximum Likelihood Estimation (MLE)

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data.

https://www.youtube.com/watch?v=XepXtl9YKwc&ab_channel=StatQuestwithJoshStarmer

General Template for Deriving MLE

We always use the log likelihood since it makes it much easier to derive. See Logarithm Rules, but basically you have that $ln(abc)=lna+lnb+lnc$

You then take derivative, and set that to 0, since you want to maximize it.

We model binomial with bernouilli, so binomial is a special case. We just do $X∼Bin(n,p)$

But for the other distributions, we state $X_{1},…,X_{n}∼Exp(θ)$ for example. So I am a little confused on this?

We define

- $L(θ)$ as the likelihood
- $l(θ)$ as the log likelihood

#### Binomial MLE

Suppose $Y∼Bin(n,θ)$, with $y$ observed successes. Then what is $θ(y)$?

Likelihood $L(y;θ)=P(Y=y)=(yn )θ_{y}(1−θ)_{n−y}$ Let’s derive $θ(y)$ by using the log likelihood. Maximizing log likelihood is the same as maximizing likelihood. $l(θ)=lnL(θ)$ $θ$ maximizes $L(θ)⟺θ$ maximizes $l(θ)$ $l(θ)=lnk+ylnθ+(n−y)ln(1−θ)$ $dθdl(θ) =0⟹θy −1−θn−y =0$ $⟹θ=ny $

- For the Binomial Distribution, the parameter is simply the sample proportion of success $p=#total trials#observed successes $, which intuitively should make sense.

#### Poisson MLE

Let $Y_{1},Y_{2},…,Y_{n}∼Poi(θ)$ with observations ${y_{1},…,y_{n}}$. What is the MLE of $θ$? $θ=n1 ∑y_{i}=y $

- Remember for Poisson Distribution the parameter is $λ$, and $E(y)=λ$, so the parameter is simply the mean.

I got practice deriving this, and it seems that

#### Exponential MLE

$λ=y 1 $

- Remember that if $X∼Exp(λ)$, then $E(X)=λ1 =μ$, so the parameter $λ=μ1 $

#### Normal MLE

Suppose $Y_{1},…,Y_{n}∼N(μ,σ_{2})$ with observations/data of ${y_{1},…y_{n}}$

What is the MLE of $μ$ and $σ_{2}$? $μ =y $ $σ_{2}=n1 ∑(y_{i}−y )_{2}$ $s_{2}=n−11 ∑(y_{i}−y )_{2}$ Am I supposed to use the $n−1$ or the $n$?? I supposed $n−1$ because that we are estimating the variance of a sample, or a population? We are estimating the variance of a population.

##### Derivation for Normal MLE

We use the definition of Likelihood:

&={\frac {1}{\sigma^n (2\pi)^\frac{n}{2} }}e^{-{\frac {1}{2\sigma^2}}\sum \left({y_i-\mu }\right)^{2}} \\ &={\frac {1}{\sigma^n} \cdot \frac{1}{(2\pi)^\frac{n}{2} }} \cdot e^{-{\frac {1}{2\sigma^2}}\sum \left({y_i-\mu }\right)^{2}} \\ l(u, \sigma^2) &=- n \log \sigma - \frac{n}{2} \log 2\pi - \frac{1}{2\sigma^2} \sum(y_i - \mu)^2 \end{align}$$ This likelihood is maximized when the derivative is 0 (similar ideas in [[notes/Least Squares|Least Squares]]). ### Properties of the [[notes/Maximum Likelihood Estimation|MLE]] For discrete -> $L(\theta)$ the probability of observing $\theta$ For continuous - > recall [[notes/Probability Mass Function|p.m.f.]] = gives probability directly -> [[notes/Probability Density Function|p.d.f.]] -> $f(y_i)$ is not a probability 1. Consistency - As $n \rightarrow \infty$, $\widehat{\theta} \rightarrow \theta$ (our estimate converges to the true value) 2. Efficiency - We want a minimum variance when finding $\widehat{\theta}_i$ 3. [[notes/Invariance|Invariance]] - If $\widehat{\theta}$ is the MLE of $\theta$, then $g(\widehat{\theta})$ is the MLE of $g(\theta)$ Other Notes - We assume that the class of the distribution has been properly identified - We assume that we have [[notes/independent and Identically Distributed|i.i.d]] datasets ### Related - [[notes/Likelihood Function|Likelihood Function]] - [[notes/Relative Likelihood Function|Relative Likelihood Function]]