Linear regression is a fundamental approach to both prediction and inference in statistical applications. It is one of the most basic models we can fit in a supervised learning context, and it is the basis for more advanced generalized linear models.

Linear regression works by estimating the outcome variable as a linear combination of weighted input variables, e.g.

or, in matrix notation

In the case where there is one predictor variable, this amounts to modeling the relationship between and as a straight line.

Benefits

Although linear regression is a very simple model (relatively speaking), it still has several potential benefits and reasons why you might choose it over more complex models.

First, and probably most importantly, linear models are very easy to interpret. The parameters we estimate in linear regression models can be interpreted as the “effect” of the given predictor on the outcome after controlling for the effects of all other predictors. This interpretability makes linear models particularly compelling for estimating causal effects (although causal claims obviously require much more than a statistical model).

Linear regression is also useful in cases where we have a small sample size, where there’s a low signal-to-noise ratio, or where we have sparse data.

Bias and Variance

Linear regression is a high-bias/low-variance model (see the Bias Variance Tradeoff). This means that the amount of structural error (caused by underfitting) is relatively high (high bias). On the other hand, the function is will change very little when exposed to new data, making the model very stable (low variance).

Estimation

Least squares is one of the most common methods for estimating the parameters of a linear regression. This approach estimates the coefficients to minimize the residual sum of squares (RSS), e.g.:

or, in matrix notation,

Since this is a quadratic function, we can minimize it by taking the first derivative and setting it equal to 0:

which we can rewrite as:

Assumptions

Linear regression models assume the following:

  1. Linearity: The relationship between and is linear in its parameters.
  2. Independence: All observations are independent.
  3. Homoskedasticity: Constant error variance. Another way to frame this is that errors are uncorrelated with the predictors.
  4. Normality: Errors are normally distributed.