3.2 Linear Regression Models and Least Squares

We have an input vector \(X^\top = (X_1, \dots, X_p)\), we want to predict a real-valued output \(Y\). The linear regression model has the form

\[f(X) = \beta_0 + \sum_{j=1}^p X_j\beta_j\]

Here \(X_j\) can be quantitative inputs, transformations of quantitative inputs, or numerical encoding of qualitative inputs.

Typically we have a set of training data \((x_1, y_1), \dots, (x_N, y_N)\) from which to estimate the parameters :math`beta`. The most popular estimation method is least squares, in which we pick the coefficients \(\beta = (\beta_0, \dots, \beta_p)^\top\) to minimize the residual sum of squares

\[\text{RSS}(\beta) = \sum_{i=1}^N (y_i - f(x_i))^2 = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2\]

From a statistical point of view, this criterion is reasonable if the training observations \((x_i, y_i)\) represent independent draws from their population. Even if the \(x_i\)’s were not drawn randomly, the criterion is still valid if the \(y_i\)’s are conditionally independent given the inputs \(x_i\).

Let \(\mathbf{X}_{N \times (p+1)}\) and \(\mathbf{y}_{N \times 1}\) be the matrix representation of the training data. We can write the residual sum-of-squares as

\[RSS(\beta) = (\mathbf{y} - \mathbf{X}\beta)^\top (\mathbf{y} - \mathbf{X}\beta)\]

Differentiating w.r.t. \(\beta\) we obtain

\[\begin{split}\frac{\partial \text{RSS}}{\partial \beta} & = -2\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\beta) \\ \frac{\partial^2 \text{RSS}}{\partial\beta\partial\beta^\top} & = 2\mathbf{X}^\top\mathbf{X}\end{split}\]

Assuming that \(\mathbf{X}\) has full rank, and hence \(\mathbf{X}^\top\mathbf{X}\) is positive definite, we set the first derivative to zero

\[\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\beta) = 0 \label{eq:eq3-1}\]

to obtain the unique solution

(1)\[\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

The fitted values at the training inputs are

\[\hat{\mathbf{y}} = \mathbf{X}\hat{\beta} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

The matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) is sometimes called the “hat” matrix because it puts the hat on \(\mathbf{y}\).

The figure below shows a different geometrical representation of the least squares estimate in \(\mathbb{R}^N\). We minimize \(\text{RSS}(\beta)\) by choosing \(\hat{\beta}\) so that the residual error \(\mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to the column space of \(\mathbf{X}\). The orthogonality is expressed in Equation (1), and the resulting estimate \(\hat{\mathbf{y}}\) is the orthogonal projection of \(\mathbf{y}\) onto this subspace. The hat matrix \(\mathbf{H}\) computes the orthogonal projection, and hence it is also known as a projection matrix.

3.2.1 Example: Prostate Cancer

3.2.2 The Gauss-Markov Theorem

We focus on estimation of any linear combination of the parameters \(\theta = a^\top\beta\). The least squares estimate of \(a^\top\beta\) is

\[\hat{\theta} = a^\top\hat{\beta} = a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}\]

Considering \(\mathbf{X}\) to be fixed, this is a linear function \(\mathbf{c}_0^\top\mathbf{y}\) of the response vector \(\mathbf{y}\). If we assume that the linear model is correct, \(a^\top\hat{\beta}\) is unbiased since

\[\begin{split}\text{E}(a^\top\hat{\beta}) & = \text{E}(a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}) \nonumber \\ & = a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \text{E}(\mathbf{y}) \nonumber \\ & = a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{X}\beta \nonumber \\ & = a^\top \beta\end{split}\]

The Gauss-Markov theorem states that if we have any other linear estimator \(\tilde{\theta} = \mathbf{c}^\top\mathbf{y}\) that is unbiased for \(a^\top\beta\), that is \(\text{E}(\mathbf{c}^\top\mathbf{y}) = a^\top\beta\), then

\[\text{Var}(a^\top\hat{\beta}) \leq \text{Var}(\mathbf{c}^\top\mathbf{y})\]

For simplicity we have stated the result in terms of estimation of a single parameter \(a^\top\beta\). The proof is left as Exercise 3.3.

Consider the mean squared error of an estimator \(\tilde{\theta}\) in estimating \(\theta\):

\[\begin{split}\text{MSE}(\tilde{\theta}) & = \text{E}(\tilde{\theta} - \theta)^2 \nonumber \\ & = \text{E}(\tilde{\theta} - \text{E}(\tilde{\theta}) + \text{E}(\tilde{\theta}) - \theta)^2 \nonumber \\ & = \text{E}\left[(\tilde{\theta} - \text{E}(\tilde{\theta}))^2 + 2 (\tilde{\theta} - \text{E}(\tilde{\theta}))(\text{E}(\tilde{\theta}) - \theta) + (\text{E}(\tilde{\theta}) - \theta)^2 \right] \nonumber \\ & = \text{E}(\tilde{\theta} - \text{E}(\tilde{\theta}))^2 + 2 (\text{E}(\tilde{\theta}) - \theta) \text{E}(\tilde{\theta} - \text{E}(\tilde{\theta})) + [\text{E}(\tilde{\theta}) - \theta]^2 \nonumber \\ & = \text{Var}(\tilde{\theta}) + [\text{E}(\tilde{\theta}) - \theta]^2\end{split}\]

The first term is the variance, while the second term is the squared bias.

The Gauss-Markov theorem implies that the least squares estimator has the smallest mean squared error of all linear estimators with no bias. However, there may well exist a biased estimators with smaller mean squared error. Such an estimator would trade a little bias for a larger reduction in variance. We discuss many examples, including variable subset selection and ridge regression, later in this chapter. From a more pragmatic point of view, most models are distortions of the truth, and hence are biased; picking the right model amounts to creating the right balance between bias and variance.

Consider the prediction of the new response at input \(x_0\),

\[Y_0 = f(x_0) + \varepsilon_0\]

Then the expected prediction error of an estimate \(\tilde{f}(x_0) = x_0^\top \tilde{\beta}\) is

\[\begin{split}\text{E}(Y_0 - \tilde{f}(x_0))^2 & = \text{E}(f(x_0) + \varepsilon_0 - x_0^\top \tilde{\beta})^2 \nonumber \\ & = \text{E}(\varepsilon_0^2) + 2\text{E}(\varepsilon_0(f(x_0) - x_0^\top\tilde{\beta})) + \text{E}(x_0^\top\tilde{\beta} - f(x_0))^2 \nonumber \\ & = \sigma^2 + \text{E}(x_0^\top\tilde{\beta} - f(x_0))^2 \nonumber \\ & = \sigma^2 + \text{MSE}(\tilde{f}(x_0))\end{split}\]

Therefore, expected prediction error and mean squared error differ only by the constant \(\sigma^2\), representing the variance of the new observation \(y_0\).

3.2.3 Multiple Regression from Simple Univariate Regression

3.2.4 Multiple Outputs

Suppose we have multiple outputs \(Y_1, \dots, Y_K\) that we wish to predict from our inputs \(X_0, \dots, X_p\). We assume a linear model for each output

\[\begin{split}Y_k & = \beta_{0k} + \sum_{j=1}^p X_j\beta_{jk} + \varepsilon_k \\ & = f_k(X) + \varepsilon_k\end{split}\]

With \(N\) training cases we can write the model in matrix notation

\[\mathbf{Y} = \mathbf{XB} + \mathbf{E}\]

Here \(\mathbf{Y}\) is the \(N \times K\) response matrix, \(\mathbf{X}\) is the \(N \times (p + 1)\) input matrix, \(\mathbf{B}\) is the \((p + 1) \times K\) matrix of parameters, and \(\mathbf{E}\) is the \(N \times K\) matrix of errors. A straightforward generalization of the univariate loss function is

\[\begin{split}\text{RSS}(\mathbf{B}) & = \sum_{k=1}^K \sum_{i=1}^N (y_{ik} - f_k(x_i))^2 \\ & = \text{tr}[(\mathbf{Y} - \mathbf{XB})^\top (\mathbf{Y} - \mathbf{XB})]\end{split}\]

The least squares estimates have the same form as before

(2)\[\hat{\mathbf{B}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X^\top Y}\]

Multiple outputs do not affect one another’s least squares estimates.

If the errors \(\varepsilon = (\varepsilon_1, \dots, \varepsilon_K)\) are correlated, then it might seem appropriate to modify RSS in favor of a multivariate version. Specifically, suppose \(\text{Cov}(\varepsilon) = \mathbf{\Sigma}\), then the multivariate weighted criterion

\[\text{RSS}(\mathbf{B}; \mathbf{\Sigma}) = \sum_{i=1}^N (y_i - f(x_i))^\top \mathbf{\Sigma}^{-1}(y_i - f(x_i))\]

arises naturally from multivariate Gaussian theory. However, it can be shown again the solution is given by Equation eq3.39 (Exercise 3.11). If \(\mathbf{\Sigma}_i\) vary among observations, then this is no longer the case and the solution for \(\mathbf{B}\) no longer decouples.

Warning

Add solution to Exercise 3.11.