3.2 Linear Regression Models and Least Squares¶
We have an input vector \(X^\top = (X_1, \dots, X_p)\), we want to predict a real-valued output \(Y\). The linear regression model has the form
Here \(X_j\) can be quantitative inputs, transformations of quantitative inputs, or numerical encoding of qualitative inputs.
Typically we have a set of training data \((x_1, y_1), \dots, (x_N, y_N)\) from which to estimate the parameters :math`beta`. The most popular estimation method is least squares, in which we pick the coefficients \(\beta = (\beta_0, \dots, \beta_p)^\top\) to minimize the residual sum of squares
From a statistical point of view, this criterion is reasonable if the training observations \((x_i, y_i)\) represent independent draws from their population. Even if the \(x_i\)’s were not drawn randomly, the criterion is still valid if the \(y_i\)’s are conditionally independent given the inputs \(x_i\).
Let \(\mathbf{X}_{N \times (p+1)}\) and \(\mathbf{y}_{N \times 1}\) be the matrix representation of the training data. We can write the residual sum-of-squares as
Differentiating w.r.t. \(\beta\) we obtain
Assuming that \(\mathbf{X}\) has full rank, and hence \(\mathbf{X}^\top\mathbf{X}\) is positive definite, we set the first derivative to zero
to obtain the unique solution
The fitted values at the training inputs are
The matrix \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\) is sometimes called the “hat” matrix because it puts the hat on \(\mathbf{y}\).
The figure below shows a different geometrical representation of the least squares estimate in \(\mathbb{R}^N\). We minimize \(\text{RSS}(\beta)\) by choosing \(\hat{\beta}\) so that the residual error \(\mathbf{y} - \hat{\mathbf{y}}\) is orthogonal to the column space of \(\mathbf{X}\). The orthogonality is expressed in Equation (1), and the resulting estimate \(\hat{\mathbf{y}}\) is the orthogonal projection of \(\mathbf{y}\) onto this subspace. The hat matrix \(\mathbf{H}\) computes the orthogonal projection, and hence it is also known as a projection matrix.
3.2.1 Example: Prostate Cancer¶
3.2.2 The Gauss-Markov Theorem¶
We focus on estimation of any linear combination of the parameters \(\theta = a^\top\beta\). The least squares estimate of \(a^\top\beta\) is
Considering \(\mathbf{X}\) to be fixed, this is a linear function \(\mathbf{c}_0^\top\mathbf{y}\) of the response vector \(\mathbf{y}\). If we assume that the linear model is correct, \(a^\top\hat{\beta}\) is unbiased since
The Gauss-Markov theorem states that if we have any other linear estimator \(\tilde{\theta} = \mathbf{c}^\top\mathbf{y}\) that is unbiased for \(a^\top\beta\), that is \(\text{E}(\mathbf{c}^\top\mathbf{y}) = a^\top\beta\), then
For simplicity we have stated the result in terms of estimation of a single parameter \(a^\top\beta\). The proof is left as Exercise 3.3.
Consider the mean squared error of an estimator \(\tilde{\theta}\) in estimating \(\theta\):
The first term is the variance, while the second term is the squared bias.
The Gauss-Markov theorem implies that the least squares estimator has the smallest mean squared error of all linear estimators with no bias. However, there may well exist a biased estimators with smaller mean squared error. Such an estimator would trade a little bias for a larger reduction in variance. We discuss many examples, including variable subset selection and ridge regression, later in this chapter. From a more pragmatic point of view, most models are distortions of the truth, and hence are biased; picking the right model amounts to creating the right balance between bias and variance.
Consider the prediction of the new response at input \(x_0\),
Then the expected prediction error of an estimate \(\tilde{f}(x_0) = x_0^\top \tilde{\beta}\) is
Therefore, expected prediction error and mean squared error differ only by the constant \(\sigma^2\), representing the variance of the new observation \(y_0\).
3.2.3 Multiple Regression from Simple Univariate Regression¶
3.2.4 Multiple Outputs¶
Suppose we have multiple outputs \(Y_1, \dots, Y_K\) that we wish to predict from our inputs \(X_0, \dots, X_p\). We assume a linear model for each output
With \(N\) training cases we can write the model in matrix notation
Here \(\mathbf{Y}\) is the \(N \times K\) response matrix, \(\mathbf{X}\) is the \(N \times (p + 1)\) input matrix, \(\mathbf{B}\) is the \((p + 1) \times K\) matrix of parameters, and \(\mathbf{E}\) is the \(N \times K\) matrix of errors. A straightforward generalization of the univariate loss function is
The least squares estimates have the same form as before
Multiple outputs do not affect one another’s least squares estimates.
If the errors \(\varepsilon = (\varepsilon_1, \dots, \varepsilon_K)\) are correlated, then it might seem appropriate to modify RSS in favor of a multivariate version. Specifically, suppose \(\text{Cov}(\varepsilon) = \mathbf{\Sigma}\), then the multivariate weighted criterion
arises naturally from multivariate Gaussian theory. However, it can be shown again the solution is given by Equation eq3.39 (Exercise 3.11). If \(\mathbf{\Sigma}_i\) vary among observations, then this is no longer the case and the solution for \(\mathbf{B}\) no longer decouples.
Warning
Add solution to Exercise 3.11.