Maximum Likelihood Estimation

What is Maximum Likelihood Estimation?

Maximum Likelihood Estimation is a way to estimate some parameter(s) of a probability distribution, given a set of data. The gist of MLE is, like the name suggests, that the best estimate of a given set of parameters is the one that maximizes a likelihood function.

We often use it to estimate parameters of statistical models.

How it Works — Linear Regression

For any given model, we can construct a likelihood function that basically tells us, given the observed data, how good is the model. I.e. how well do the parameters of this model fit. In the case of a (multiple) linear regression, these would be our $β$ coefficients.

The process works like this:

Estimate some starting $β$ parameters. These starting values don’t have to be perfect, because the point of the process is to optimize them.
Estimate $\overset{y}{^}$ for our starting betas, e.g.

\overset{y}{^} = X * β

For each observation, calculate the residual:

ε_{i} = \overset{y}{^}_{i} - y_{i}

Assume a distribution for the residuals. In a linear regression, we assume that the residuals will be normally distributed with $μ = 0$ and some $σ$ . The value of $σ$ will depend on the scale of your data, but if the data is standardized, then 1 is reasonable.
For each residual, calculate the probability density function under the assumed distribution. We usually just get the computer to do this for us, since any stats software will have these PDFs already. But, FWIW, the normal PDF is:

f (ε_{i}) = (1/ 2 π σ^{2}) * exp (- ε_{i}^{2} /2 σ^{2})

Calculate the loglikelihood by summing the probabilities of the likelihoods. We use the loglikelihood rather than the regular likelihood because calculating the likelihood requires taking the product of all of the residuals, whereas the loglikelihood only requires summing, since $l o g (a * b) = lo g (a) + lo g (b)$ . Doing this much multiplication, especially with small probabilities, can be computationally unstable.

lo g (L) = \sum lo g (f (ε_{i}))

Next, in practice, we often want to take the negative loglikelihood, since many optimizers seek to minimize a loss function, and taking the minimum negative loglikelihood is equivalent to taking the maximum likelihood.

This is how we calculate the loglikelihood given a model (a set of parameters) and a dataset. We can then use some optimizer/optimizing algorithm to find the minimum negative loglikelihood and return the parameters ( $β$ s, in this case) that yield this value.

Worked Example

Here’s a link to a worked example, in Julia.

Brain

Explorer

Maximum Likelihood Estimation

What is Maximum Likelihood Estimation?

How it Works — Linear Regression

Worked Example

Graph View

Table of Contents

Backlinks

Recent Notes

Dunning-Kruger Effect

V (Book)

Choice-Supportive Bias

User Guides

Loop Agents (AI)