3.2 Linear Regression Models and Least Squares
=====================================

We have an input vector :math:`X^\top = (X_1, \dots, X_p)`, we want to predict a real-valued output :math:`Y`. The linear regression model has the form

.. math::

  f(X) = \beta_0 + \sum_{j=1}^p X_j\beta_j

Here :math:`X_j` can be quantitative inputs, transformations of quantitative inputs, or numerical encoding of qualitative inputs.

Typically we have a set of training data :math:`(x_1, y_1), \dots, (x_N, y_N)` from which to estimate the parameters :math`\beta`. The most popular estimation method is *least squares*, in which we pick the coefficients :math:`\beta = (\beta_0, \dots, \beta_p)^\top` to minimize the residual sum of squares

.. math::

  \text{RSS}(\beta) = \sum_{i=1}^N (y_i - f(x_i))^2 = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2

From a statistical point of view, this criterion is reasonable if the training observations :math:`(x_i, y_i)` represent independent draws from their population. Even if the :math:`x_i`'s were not drawn randomly, the criterion is still valid if the :math:`y_i`'s are conditionally independent given the inputs :math:`x_i`.

Let :math:`\mathbf{X}_{N \times (p+1)}` and :math:`\mathbf{y}_{N \times 1}` be the matrix representation of the training data. We can write the residual sum-of-squares as

.. math::

  RSS(\beta) = (\mathbf{y} - \mathbf{X}\beta)^\top (\mathbf{y} - \mathbf{X}\beta)

Differentiating w.r.t. :math:`\beta` we obtain

.. math::

  \frac{\partial \text{RSS}}{\partial \beta} & = -2\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\beta) \\
  \frac{\partial^2 \text{RSS}}{\partial\beta\partial\beta^\top} & = 2\mathbf{X}^\top\mathbf{X}

Assuming that :math:`\mathbf{X}` has full rank, and hence :math:`\mathbf{X}^\top\mathbf{X}` is positive definite, we set the first derivative to zero

.. math::

  \mathbf{X}^\top (\mathbf{y} - \mathbf{X}\beta) = 0 \label{eq:eq3-1}

to obtain the unique solution

.. math::
  :label: eq3-6

  \hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

The fitted values at the training inputs are

.. math::

  \hat{\mathbf{y}} = \mathbf{X}\hat{\beta} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

The matrix :math:`\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top` is sometimes called the "hat" matrix because it puts the hat on :math:`\mathbf{y}`.

The figure below shows a different geometrical representation of the least squares estimate in :math:`\mathbb{R}^N`. We minimize :math:`\text{RSS}(\beta)` by choosing :math:`\hat{\beta}` so that the residual error :math:`\mathbf{y} - \hat{\mathbf{y}}` is orthogonal to the column space of :math:`\mathbf{X}`. The orthogonality is expressed in Equation :eq:`eq3-6`, and the resulting estimate :math:`\hat{\mathbf{y}}` is the *orthogonal projection* of :math:`\mathbf{y}` onto this subspace. The hat matrix :math:`\mathbf{H}` computes the orthogonal projection, and hence it is also known as a projection matrix.

3.2.1 Example: Prostate Cancer
-------------------------------------

3.2.2 The Gauss-Markov Theorem
-------------------------------------

We focus on estimation of any linear combination of the parameters :math:`\theta = a^\top\beta`. The least squares estimate of :math:`a^\top\beta` is

.. math::

  \hat{\theta} = a^\top\hat{\beta} = a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}

Considering :math:`\mathbf{X}` to be fixed, this is a linear function :math:`\mathbf{c}_0^\top\mathbf{y}` of the response vector :math:`\mathbf{y}`. If we assume that the linear model is correct, :math:`a^\top\hat{\beta}` is unbiased since

.. math::

  \text{E}(a^\top\hat{\beta}) & = \text{E}(a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}) \nonumber \\
	& = a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \text{E}(\mathbf{y}) \nonumber \\
  & = a^\top (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{X}\beta \nonumber \\
	& = a^\top \beta

*The Gauss-Markov theorem* states that if we have any other linear estimator :math:`\tilde{\theta} = \mathbf{c}^\top\mathbf{y}` that is unbiased for :math:`a^\top\beta`, that is :math:`\text{E}(\mathbf{c}^\top\mathbf{y}) = a^\top\beta`, then

.. math::

  \text{Var}(a^\top\hat{\beta}) \leq \text{Var}(\mathbf{c}^\top\mathbf{y})

For simplicity we have stated the result in terms of estimation of a single parameter :math:`a^\top\beta`. The proof is left as Exercise 3.3.

Consider the mean squared error of an estimator :math:`\tilde{\theta}` in estimating :math:`\theta`:

.. math::

  \text{MSE}(\tilde{\theta}) & = \text{E}(\tilde{\theta} - \theta)^2 \nonumber \\
	& = \text{E}(\tilde{\theta} - \text{E}(\tilde{\theta}) + \text{E}(\tilde{\theta}) - \theta)^2 \nonumber \\
	& = \text{E}\left[(\tilde{\theta} - \text{E}(\tilde{\theta}))^2 + 2 (\tilde{\theta} - \text{E}(\tilde{\theta}))(\text{E}(\tilde{\theta}) - \theta) + (\text{E}(\tilde{\theta}) - \theta)^2 \right] \nonumber \\
	& = \text{E}(\tilde{\theta} - \text{E}(\tilde{\theta}))^2 + 2 (\text{E}(\tilde{\theta}) - \theta) \text{E}(\tilde{\theta} - \text{E}(\tilde{\theta})) + [\text{E}(\tilde{\theta}) - \theta]^2 \nonumber \\
	& = \text{Var}(\tilde{\theta}) + [\text{E}(\tilde{\theta}) - \theta]^2

The first term is the variance, while the second term is the squared bias.

The Gauss-Markov theorem implies that the least squares estimator has the smallest mean squared error of all linear estimators with no bias. However, there may well exist a biased estimators with smaller mean squared error. Such an estimator would trade a little bias for a larger reduction in variance. We discuss many examples, including variable subset selection and ridge regression, later in this chapter. **From a more pragmatic point of view, most models are distortions of the truth, and hence are biased; picking the right model amounts to creating the right balance between bias and variance.**

Consider the prediction of the new response at input :math:`x_0`,

.. math::

  Y_0 = f(x_0) + \varepsilon_0

Then the expected prediction error of an estimate :math:`\tilde{f}(x_0) = x_0^\top \tilde{\beta}` is

.. math::

  \text{E}(Y_0 - \tilde{f}(x_0))^2 & = \text{E}(f(x_0) + \varepsilon_0 - x_0^\top \tilde{\beta})^2 \nonumber \\
	& = \text{E}(\varepsilon_0^2) + 2\text{E}(\varepsilon_0(f(x_0) - x_0^\top\tilde{\beta})) + \text{E}(x_0^\top\tilde{\beta} - f(x_0))^2 \nonumber \\
	& = \sigma^2 + \text{E}(x_0^\top\tilde{\beta} - f(x_0))^2 \nonumber \\
	& = \sigma^2 + \text{MSE}(\tilde{f}(x_0))

Therefore, expected prediction error and mean squared error differ only by the constant :math:`\sigma^2`, representing the variance of the new observation :math:`y_0`.

3.2.3 Multiple Regression from Simple Univariate Regression
-------------------------------------

3.2.4 Multiple Outputs
-------------------------------------

Suppose we have multiple outputs :math:`Y_1, \dots, Y_K` that we wish to predict from our inputs :math:`X_0, \dots, X_p`. We assume a linear model for each output

.. math::

  Y_k & = \beta_{0k} + \sum_{j=1}^p X_j\beta_{jk} + \varepsilon_k \\
	& = f_k(X) + \varepsilon_k

With :math:`N` training cases we can write the model in matrix notation

.. math::

  \mathbf{Y} = \mathbf{XB} + \mathbf{E}

Here :math:`\mathbf{Y}` is the :math:`N \times K` response matrix, :math:`\mathbf{X}` is the :math:`N \times (p + 1)` input matrix, :math:`\mathbf{B}` is the :math:`(p + 1) \times K` matrix of parameters, and :math:`\mathbf{E}` is the :math:`N \times K` matrix of errors. A straightforward generalization of the univariate loss function is

.. math::

  \text{RSS}(\mathbf{B}) & = \sum_{k=1}^K \sum_{i=1}^N (y_{ik} - f_k(x_i))^2 \\
	& = \text{tr}[(\mathbf{Y} - \mathbf{XB})^\top (\mathbf{Y} - \mathbf{XB})]

The least squares estimates have the same form as before

.. math::
  :label: eq3.39

  \hat{\mathbf{B}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X^\top Y}

Multiple outputs do not affect one another's least squares estimates.

If the errors :math:`\varepsilon = (\varepsilon_1, \dots, \varepsilon_K)` are correlated, then it might seem appropriate to modify RSS in favor of a multivariate version. Specifically, suppose :math:`\text{Cov}(\varepsilon) = \mathbf{\Sigma}`, then the multivariate weighted criterion

.. math::

  \text{RSS}(\mathbf{B}; \mathbf{\Sigma}) = \sum_{i=1}^N (y_i - f(x_i))^\top \mathbf{\Sigma}^{-1}(y_i - f(x_i))

arises naturally from multivariate Gaussian theory. However, it can be shown again the solution is given by Equation :ref:`eq3.39` (Exercise 3.11). If :math:`\mathbf{\Sigma}_i` vary among observations, then this is no longer the case and the solution for :math:`\mathbf{B}` no longer decouples.

.. warning::

  Add solution to Exercise 3.11.