Applied Linear Regression (4e)
CHAPTER 12
Binomial and Poisson Regression
In this chapter we show how nearly all the methods described in this book can be extended to problems in which the response variable is a count rather than a measured variable. We will consider binomial regression, which includes a binary categorical response, and also the closely related Poisson regression. We first review a bit about the binomial and Poisson distributions, and then describe the regression models with counted responses with either binomial or Poisson distributions, emphasizing the connections to the rest of this book.
Books dedicated to binomial regression include Collett (2003) and Hosmer et al. (2013). Counted data more generally is covered in Agresti (2007, 2013). Counted data models can also be studied in the framework of generalized linear models, in which the linear model, binomial model, and Poisson model are all special cases. McCullagh and Nelder (1989) provided the basis for this approach.
12.1 DISTRIBUTIONS FOR COUNTED DATA
12.1.1 Bernoulli Distribution
Suppose the random variable Y has two possible values, perhaps called “success” or “failure,” with probability of success equal to θ where 0 ≤ θ ≤ 1. We label the possible outcomes of Y as y = 1 if success occurs, and y = 0 if success does not occur. We will say that Y with these characteristics has a Bernoulli distribution1 with probability of success θ. Using Appendix A.2,
\[\operatorname{E}(Y) = \theta \qquad \operatorname{Var}(Y) = \theta(1-\theta) \tag{12.1}\]
1 Named for Jacob Bernoulli, 1654–1705.
Applied Linear Regression, Fourth Edition. Sanford Weisberg.
© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.
An important feature of the Bernoulli distribution is that the variance depends on the mean, Var(Y) = E(Y)(1 − E(Y)). The variance is largest for θ = 1/2, and smallest for θ close to 0 or 1. The Bernoulli is the only distribution for a random variable with sample space {0, 1}.
12.1.2 Binomial Distribution
The binomial distribution generalizes the Bernoulli. Suppose we have m random variables B1, B2, . . . , Bm, such that (1) each Bj has a Bernoulli distribution with the same probability θ of success; and (2) all the Bj are independent. Then if Y is the number of successes in the m trials, Y B = ∑ j , we say that Y has a binomial distribution with m trials and probability of success θ. We write this as Y ∼ Bin(m, θ). Each of the Bernoulli variables is Bj ∼ Bin(1, θ).
The probability mass function for the binomial is
\[\Pr(Y=j) = \binom{m}{j} \theta^j (1-\theta)^{(m-j)} \tag{12.2}\]
for j ∈ {0, 1, . . . , m}. The mean and variance of a binomial are
\[\operatorname{E}(Y) = m\theta \qquad\qquad \operatorname{Var}(Y) = m\theta(1-\theta) \tag{12.3}\]
Since m is known, both the mean and variance are determined by θ only. Both assumptions of constant θ and of independence are required for the binomial distribution to apply to the number of successes in m trials. For example, in a survey of family members about their view on a particular political issue, the number in favor of the issue will likely not be binomially distributed because the views of members of the same family are unlikely to be independent.
12.1.3 Poisson Distribution
Whereas the binomial distribution concerns the distribution of the number of successes in a fixed number m of trials, the Poisson distribution2 is the number of events of a specific type that occur in a fixed time or space. A Poisson variable Y can take the value of any nonnegative integer {0, 1, 2, . . .}. For example, if customers to an ice cream store arrive independently, at random but at a constant rate, then the number of customers arriving in any fixed period of time will follow a Poisson distribution.
We will say Y has a Poisson distribution with rate λ, Y ∼ Po(λ) if
\[\Pr(Y=\mathbf{y}) = \exp(-\mathcal{\lambda})\mathcal{\lambda}^{\mathbf{y}} / \mathbf{y}! \qquad \text{ } \qquad \mathbf{y} = \mathbf{0}, 1, \dots \tag{12.4}\]
2 Named for Siméon Denis Poisson, 1781–1840. Using Appendix A.2, it is not hard to show that
\[\operatorname{E}(Y) = \mathcal{X} \qquad\qquad\qquad \operatorname{Var}(Y) = \mathcal{X} \tag{12.5}\]
so the mean and variance of a Poisson variable are equal.
In the ice cream store arrival example, the count Y depends on both the assumption of independence of customers and constant arrival rate. The rate could vary with time of day or outdoor temperature, and so a Poisson assumption may be appropriate for short time intervals but not for longer intervals. Arrivals could be correlated if customers arrive in groups, for example, at the end of a high school sports event, again suggesting that the number of arrivals may not follow a Poisson distribution.
There are many interesting and useful relationships between the Poisson and the binomial that suggest that regression models for both types of responses should be studied together. In particular, suppose that Y1 ∼ Po(λ1) and Y2 ∼ Po(λ2), and suppose Y1 and Y2 are independent. Y1 could be the number of ice cream customers who arrive and buy an ice cream cone, and Y2 could be the number who arrive and buy a yogurt cone. Then the sum (Y1 + Y2) ∼ Po(λ1 + λ2) is the number of customers who arrive in the time period and buy a cone of either type. The conditional distribution of Y1 given the total number of customers arriving is binomial, Y1|(Y1 + Y2) ∼ Bin(Y1 + Y2, λ1/(λ1 + λ2)).
12.2 REGRESSION MODELS FOR COUNTS
The big idea is that the parameter for the counted distribution, θ for the binomial or λ for the Poisson, can depend on the values of predictors.
12.2.1 Binomial Regression
We consider the binomial case first. We assume that θ(x) depends on the values x of the regressors only through a linear combination β′x for some unknown β. We can write θ(x) as a function of β′x,
\[ \theta(\mathbf{x}) = \mathfrak{m}(\mathcal{B}\mathbf{\hat{x}}) \tag{12.6} \]
The quantity β′x is called the linear predictor. As in nonlinear models, the function m is called a kernel mean function. Because the left side of (12.6) is a probability, m(β′x) must map β′x, which can take any possible value in (−∞, ∞), into the range (0, 1). The most frequently used kernel mean function for binomial regression, and the only one we discuss in this book, is the logistic function,

Figure 12.1 The logistic kernel mean function.
\[\theta(\mathbf{x}) = \pi(\mathcal{J}\mathbf{'}\mathbf{x}) = \frac{\exp(\mathcal{J}\mathbf{'}\mathbf{x})}{1 + \exp(\mathcal{J}\mathbf{'}\mathbf{x})} = \frac{1}{1 + \exp(-\mathcal{J}\mathbf{'}\mathbf{x})} \tag{12.7}\]
The last two forms are equivalent representations for the same function. A graph of the logistic function is shown in Figure 12.1.
Most presentations of logistic regression work with the link function, which is the inverse of the kernel mean function. Solving (12.7) for β′x, we find
\[\log\left(\frac{\theta(\mathbf{x})}{1-\theta(\mathbf{x})}\right) = \mathcal{J}^\mathbf{x} \tag{12.8}\]
The left side of (12.8) is called a logit or log-odds and the right side is the linear predictor β′x. If we were to draw a graph of log{θ(x)/[1 − θ(x)]} versus β′x, we would get a straight line.
The ratio θ(x)/[1 − θ(x)] is the odds of success. For example, if the probability of success is 0.25, the odds of success are 0.25/(1 − 0.25) = 1/3, one success to each three failures. If the probability of success is 0.8, then the odds of success are 0.8/0.2 = 4, or four successes to one failure. Whereas probabilities are bounded between 0 and 1, odds can be any nonnegative number. The logit is the logarithm of the odds. According to Equation (12.8), the logit is equal to a linear combination of the regressors.
The data for logistic regression for the ith observation consist of the observed number of successes yi, the observed number of trials mi, and a vector xi of regressors computed from the predictors in the problem, as in Chapters 4 and 5. The three components of the logistic regression model are the following:
Distribution The distribution of (Yi|Xi = xi) ∼ Bin(mi, θ(xi)). Both the mean and the variance of Yi depend only on the known mi and on θ(xi).
Linear predictor The parameter θ(xi) depends on xi only through the linear predictor β′xi for some unknown parameter vector β.
Link function There is a link function, or equivalently its inverse the kernel mean function, that specifies the connection between θ(xi) and the linearly related regressors β′xi, such as the logit link at (12.8).
Logistic regression models are not fit with ols. Rather, maximum likelihood estimation is used, based on the binomial distribution; see Appendix A.11.2. Most statistics packages will make fitting logistic models easy, and the results will look just like the results for fitting ols.
Blowdown
On July 4, 1999, a storm with winds exceeding 90 miles per hour hit the Boundary Waters Canoe Area Wilderness in northeastern Minnesota, causing serious damage to the forest. Rich et al. (2007) studied the effects of this storm using a very extensive ground survey of the area, determining status, either alive or dead, of more than 3600 trees. Suppose θ(x) is the probability of death by blowdown for a tree with characteristics given by the regressors x.
We start with one predictor, and consider the dependence of the probability of blowdown on the diameter d of the tree, measured to the nearest 0.5 cm, for black spruce trees only. We will use log(d) as the single regressor beyond the intercept. The data file BlowBS contains d, the number of trees m of that diameter that were measured, and died, the number of trees that died in the blowdown. We view m as fixed, and model died given m as a binomial response, with regressors for the intercept and log(d). The data file has n = 35 rows, corresponding to the 35 unique values of d. It represents a total of ∑ m= 659 trees, with m ranging from 1 tree for some of the larger diameters to 91 for d= 6 cm.
As usual, we begin by graphing the data in Figure 12.2, with the blowdown fraction died/m on the vertical axis and the regressor log(d) on the horizontal axis. Since the samples sizes m are highly variable in this example, points in the graph are drawn with area proportional to m. Larger points with more trees are more important in fitting. Concentrating on the the larger points, the probability of blowdown increases with log(d). The points based on few trees are further from a trend, as should be expected. For example, if m= 3, the value of died/m could only equal 0, 1/2, or 1, and all these values are likely to be far from any fitted regression line.
The results fitting the logistic model are summarized in Table 12.1. The coefficient summary is similar to the output for simple regression, giving the
| Estimate | Std. Error | z-Value | Pr(> z ) | |
|---|---|---|---|---|
| ,nterFept−7.8925 logd |
3.2643 | 0.6325 0.2761 |
−12.48 11.82 |
0.0000 0.0000 |
Table 12.1 Logistic Regression Summary for the Black Spruce Blowdown Data
Residual deviance = 49.891 (33 df). Null deviance = 250.856 (34 df).

Figure 12.2 Plot of the the blowdown fraction versus d, with the horizontal axis in log scale. The plotted points have area proportional to the sample size for that point. The solid line is the fitted logistic regression.
name of the regressor, its coefficient estimate, and standard error. The ratio of the estimate to its standard error is called a z-value rather than a t-value. Since logistic regression does not have a variance parameter, the t-distribution is not appropriate, and the large sample normal approximation is used to get the significance levels in the last column. Also included in the output is the deviance, which is analogous to the residual sum of squares in linear regression and is described in Section 12.2.2.
The very small p-values suggest that both the intercept and the coefficient for log(d) are unlikely to be equal to 0. Since the logistic model is in the scale of the logarithm of the odds, we can use Section 4.1.6 on responses in log scale to interpret coefficients. From (4.5), if the diameter d is increased by 10%, then the odds of death by blowdown are multiplied by exp[log(1.1) × 3.264] ≈ 1.34.
The fitted curve, with equation 1/{1 + exp[−(−7.8925 + 3.2643 × log(d))]} is shown on Figure 12.2. The agreement of the line to the points with larger m is encouraging.
To account for the place-to-place variation in storm intensity that is likely to change the probability of death by blowdown, a measure s of local severity of the storm was computed as the fraction of the total basal area of trees of four major species that died near the measured tree. The data file Blowdown includes d, s, and a factor spp for species with 9 levels for all n = 3666 trees, with one row in the file for each tree that was measured. The variable y equals 1 if a particular tree died as a result of the blowdown, and 0 if it survived. The data on black spruce trees used previously are included and require 659 rows, one for each of the trees.
The Wilkinson–Rogers notation for specifying a model was designed for linear models, but it often used for logistic, Poisson, and other generalized linear models. The model we consider is y∼logd +s+logds , allowing for an interaction between s and log(d). The “∼” should be read as “is modeled using,” since the response depends on the regressors only through the dependence of the log-odds on the regressors.
The regression summary using the data in Blowdown with spp equal to ElaFkspruFe , and therefore ignoring trees of all other species, is given in Table 12.2. Since an interaction is included, the z-tests for main effects are relevant only if the test for the interaction suggests the interaction is not needed. The tiny p-value for the interaction suggests all the terms should be maintained, even though the z-value for the main effect of log(d) is small.
The effects plot is shown in Figure 12.3. The model returns fitted values in the logit scale, and these were transformed first to fitted odds by
| Estimate | Std. Error | z-Value | Pr(> z ) | |
|---|---|---|---|---|
| (Intercept) | −3.678 | 1.425 | −2.58 | 0.010 |
| log(d) | 0.578 | 0.633 | 0.91 | 0.361 |
| s | −11.205 | 3.638 | −3.08 | 0.002 |
| log(d):s | 7.085 | 1.645 | 4.31 | 0.000 |
| Table | 12.2 | Black | Spruce | Blowdown | Data | ||
|---|---|---|---|---|---|---|---|
| ——- | —— | ——- | – | ——– | – | ———- | —— |
Residual deviance = 541.7 (655 df). Null deviance = 856.2 (658 df).

Figure 12.3 Effects plot for black spruce trees in the blowdown data with both d and s as predictors.
exponentiating, and then to the fitted probabilities shown on the plot. The horizontal axis is the diameter d. Because of the interaction, the dependence of the probability of blowdown on d will be different for each value of s. Three curves are shown in Figure 12.3, corresponding roughly to the 25%, 50%, and 75% points of the distribution of s for the black spruce trees. The effect of d increases fastest for the highest quartile and slowest for the lowest quartile. This plot could be drawn with many variations, including using log(d) on the horizontal axis, using odds or log-odds on the vertical axis, and reversing the roles of d and s in the plot, but the presentation here seems to catch the essence of the solution.
12.2.2 Deviance
In multiple linear regression (Chapter 6), the residual sum of squares provides the basis for tests for comparing mean functions. In logistic and Poisson regression, the residual sum of squares is replaced by the deviance, which is often called G2 . The deviance is defined for logistic regression to be
\[G^2 = 2\sum\_{i=1}^n \left[ \mathbf{y}\_i \log \left(\frac{\mathbf{y}\_i}{\hat{\mathbf{y}}\_i}\right) + (m\_i - \mathbf{y}\_i) \log \left(\frac{m\_i - \mathbf{y}\_i}{m\_i - \hat{\mathbf{y}}\_i}\right) \right] \tag{12.9}\]
where ˆ ˆ y m i i = θ( ) xi are the fitted number of successes in mi trials. The df associated with the deviance is equal to the number of cases n used in the calculation minus the number of elements of β that were estimated; for the black spruce data fit in Table 12.2, df = 659 − 4 = 655.
Methodology for comparing models parallels the results in Section 6.1. Write b b ′x x = 1 1 ′ + b2 2 ′x , and consider testing
\[\begin{aligned} \text{NH:} & \quad \theta(\mathbf{x}) = \mathsf{m}(\mathsf{B}\_1' \mathbf{x}\_1) \\ \text{AH:} & \quad \theta(\mathbf{x}) = \mathsf{m}(\mathsf{B}\_1' \mathbf{x}\_1 + \mathsf{B}\_2' \mathbf{x}\_2) \end{aligned}\]
Obtain the deviance GNH 2 and degrees of freedom dfNH under the null hypothesis, and then obtain GAH 2 and dfAH under the alternative hypothesis. As with linear models, we will have evidence against the null hypothesis if G G NH AH 2 2 − is too large. To get a p-value, we compare the difference G G NH AH 2 2 − with the χ2 distribution with df = dfNH − dfAH, not with an F-distribution as was done for linear models.
The NH that the probability of blowdown is constant versus AH that the probability depends on a set of regressors is equivalent to the overall test in linear models, and it is based on the difference between the null deviance and the residual deviance. For the model summarized in Table 12.1, this is G2 = 250.86 − 49.89 = 200.97, which is compared with the χ2 distribution with 34 − 33 = 1 df. The significance level is very tiny, suggesting the hypothesis of probability of blowdown independent of d is firmly rejected. Similarly, the overall test for the model summarized in Table 12.2 is G2 = 856.2 − 541.7 = 314.5, which is compared with the χ2 distribution with 658 − 655 = 3 df. Once again the significance level is very small, providing evidence that the probability of blowdown is not constant.
For the next example, we consider testing NH: y∼ logd versus AH: y∼logd +s+logds . The NH model was fit in Table 12.1 using the group of trees with the same diameter as the unit of analysis, while the AH model was fit using each tree as the unit of analysis. We need to refit the NH model using the tree as the unit of analysis, with the data file Blowdown . Estimates and standard errors are the same fitting using the grouped binomial data or the individual Bernoulli data, but the deviance is different. Fitting to the individual trees, the residual deviance is 655.24 with 657 df, and the test is G2 = 113.50 with 2 df. Once again, the significance level is very small, and the NH firmly rejected.
Tests with logistic models, and with the Poisson models to be introduced later, are often summarized in an Analysis of Deviance table that is directly analogous to the Analysis of Variance table used to study linear models. As an example, we consider a third model for blowdown probability that uses all the data, adding a factor with nine levels for species, and allowing each species to have its own log(d) : s interaction. This corresponds to fitting with main effects, all two-factor interactions, and a three-factor interaction. The results may be summarized in the Analysis of Deviance table in Table 12.3. This is a Type II table, as in Section 6.2, and is interpreted and used in the same way. Starting at the bottom of the table, the three-factor interaction test is considered first. Testing stops because it has a very small significance level, as lowerorder effects are not tested when higher-order effects in the same predictors are nonzero.
Figure 12.4 is a summary effects plot for the blowdown data. The model fit suggests that effects of s and d are needed for most tree species. We see there are interesting differences between species. For red pine, the probability of blowdown appears to decrease with d, while for jack pine, the probability of blowdown may be independent of d. Cedar trees were relatively immune to blowdown except in areas of very high severity. Further analysis of these
| df | G2 | Pr(>χ2 ) |
|
|---|---|---|---|
| logd | 1 | 227.8 | 0.0000 |
| S | 1 | 594.0 | 0.0000 |
| Spp | 8 | 509.5 | 0.0000 |
| logds | 1 | 41.6 | 0.0000 |
| logdspp | 8 | 71.9 | 0.0000 |
| sspp | 8 | 36.5 | 0.0000 |
| logdsspp | 8 | 20.4 | 0.0088 |
Table 12.3 Analysis of Deviance for Blowdown

Figure 12.4 Effects plots in the blowdown data with both d and s as predictors.
data would require more work, and quite likely a separate analysis for each species separately could be enlightening.
12.3 POISSON REGRESSION
When the data are to be modeled as if they are Poisson counts, the rate parameter is assumed to depend on the regressors with linear predictor β′x through the link function
\[\log[\mathcal{\lambda}(\mathcal{B}^\prime \mathbf{x})] = \mathcal{B}^\prime \mathbf{x} \tag{12.10}\]
Poisson regression models are often called log-linear models.
The data for Poisson regression for the ith observation consist of the observed number of events yi, and the values of the regressors. The three components of the Poisson regression model are as follows:
Distribution The distribution of (Yi|Xi = xi) ∼ Po[λ(xi)]. Linear predictor The parameter λ(xi) depends on xi only through the linear predictor β′xi for some unknown parameter vector β. Link function The link function is the log-link (12.10).
Maximum likelihood estimation is the usual method used to fit Poisson regression models. The deviance for Poisson regression is given by
\[G^2 = 2\sum\_{i=1}^n \left[ \log \left( \mathbf{y}\_i / \hat{\mathbf{y}}\_i \right) - \left( \mathbf{y}\_i - \hat{\mathbf{y}}\_i \right) \right] \tag{12.11}\]
where yˆi is the fitted value exp( ) ′ ˆ b xi .
Mathematical Sciences PhDs
The data in Table 12.4 gives the number of PhD degrees awarded in the mathematical sciences in the United States in 2008–2009 (Phipps et al., 2009). The rows of the table correspond to the factor type, with six levels. The first four rows correspond to mathematics departments grouped into Type I public and private for the largest universities, and Types II and III for smaller universities. Type IV corresponds to programs in biostatistics or statistics in any university. Type Va is for applied mathematics in any university. Columns subdivide the counts further by sex and citizenship of the PhD recipient. We can view each of the cell counts as a Poisson random variable with possibly different rates. The data are in the file AMSsurvey in a format that is suitable for fitting with Poisson regression. The data file has one row for each cell in the table, so there are n = 24 rows. Columns are given for type, se[, and Fiti]en . An additional column called Fount gives the cell counts shown in the table.
| Level | Non-U.S. | U.S. | ||
|---|---|---|---|---|
| Female | Male | Female | Male | |
| I(Pu) | 29 | 130 | 35 | 132 |
| I(Pr) | 25 | 79 | 20 | 87 |
| II | 50 | 89 | 47 | 96 |
| III | 39 | 53 | 32 | 47 |
| IV | 105 | 122 | 54 | 71 |
| Va | 12 | 28 | 14 | 34 |
Table 12.4 American Mathematics Society PhD Survey, 2008–2009
| df | G2 | Pr(>χ2 ) |
|
|---|---|---|---|
| Type | 5 | 233.3 | 0.0000 |
| Se[ | 1 | 183.0 | 0.0000 |
| Citi]en | 1 | 5.9 | 0.0149 |
| typese[ | 5 | 69.1 | 0.0000 |
| typeFiti]en | 5 | 24.0 | 0.0002 |
| se[Fiti]en | 1 | 0.5 | 0.4635 |
| Typese[Fiti]en | 5 | 1.4 | 0.9222 |
Table 12.5 Analysis of Deviance for Mathematical Sciences PhDs
Table 12.6 Poisson Regression Summary for the Mathematical Sciences PhDs
| Estimate | Std. Error | z-Value | Pr(> z ) | |
|---|---|---|---|---|
| ,nterFept | 3.0992 | 0.1646 | 18.83 | 0.0000 |
| type,Pu | 0.3417 | 0.2143 | 1.59 | 0.1109 |
| type,, | 0.7681 | 0.2026 | 3.79 | 0.0002 |
| type,,, | 0.5436 | 0.2150 | 2.53 | 0.0114 |
| type,9 | 1.5310 | 0.1870 | 8.19 | 0.0000 |
| type9a | −0.6296 | 0.2814 | −2.24 | 0.0253 |
| se[0ale | 1.3053 | 0.1681 | 7.77 | 0.0000 |
| Fiti]enUS | 0.0284 | 0.1377 | 0.21 | 0.8364 |
| type,Puse[0ale0.1041 | 0.2184 | 0.48 | 0.6335 | |
| type,,se[0ale | −0.6597 | 0.2097 | −3.15 | 0.0017 |
| type,,,se[0ale−0.9628 | 0.2288 | −4.21 | 0.0000 | |
| type,9se[0ale | −1.1115 | 0.1993 | −5.58 | 0.0000 |
| type9ase[0ale | −0.4363 | 0.2878 | −1.52 | 0.1296 |
| type,PuFiti]enUS | 0.0207 | 0.1767 | 0.12 | 0.9070 |
| type,,Fiti]enUS−0.0001 | 0.1821 | −0.00 | 0.9997 | |
| type,,,Fiti]enUS | −0.1808 | 0.2061 | −0.88 | 0.3805 |
| type,9Fiti]enUS−0.6251 | 0.1771 | −3.53 | 0.0004 | |
| type9aFiti]enUS0.1539 | 0.2545 | 0.60 | 0.5455 |
Residual deviance = 1.957 (6 df). Null deviance = 521.444 (23 df).
Table 12.5 gives the Type II Analysis of Deviance table for the fit of log-linear Poisson model with all main effects, two-factor interactions, and the three-factor interaction. Starting as usual at the bottom, both the typese[Fiti]en and the se[Fiti]en interactions have large p-values and appear to be negligible. The remaining two-factor interactions have small p-values and are not negligible. Before further summarization, we fit the Poisson model including only the important two-factor interactions.
The regression summary is provided in Table 12.6. As is true with any regression model with interactions present, interpretation of coefficient estimates is challenging because the parameters depend on the choice of regressors used to represent the factors. The interaction parameters are the most easily interpretable. For example, the coefficient for U.S. citizens at type IV institutions is −0.6251, and this describes the difference between citizens and

Figure 12.5 Effects plots for Mathematical Sciences PhDs.
noncitizens. Since the Poisson model uses a log-link, the expected number of U.S. citizen PhDs is exp(−0.6251) ≈ 0.5 times the expected non-U.S. citizens. The coefficients for the main effects are not easily interpretable. The difference between 0ale and )emale is not reflected by the coefficient for Se[ because this difference depends on the value of type.
A better way to understand the fitted model is to get estimated cell counts for each of the 24 cells based on the model, and then view them in the effects plots shown in Figure 12.5, one for each of the two-factor interactions. In both plots, the horizontal axis is levels of type. The levels have been ordered according to the total number PhD awards granted, as this makes the graphs easier to read. The vertical axis is the fitted number of PhDs. The Poisson model used a log-link, so an alternative of plotting log-fitted values on the vertical axis could have been used. The lines joining the points in the plot are just visual aids, as fitted values are available only at the points shown. The error bars are 95% confidence intervals, without adjusting for multiple inferences, for the estimated number of PhDs awarded.
Figure 12.5a is for the typeFiti]en interaction. The number of PhDs for citizens and noncitizens are essentially the same for all types of institutions except for Type IV, statistics and biostatistics programs, which have many more noncitizen PhDs awarded, although this difference is exaggerated because the vertical axis is not in log scale. The picture for the typese[ interaction is a little more complicated. Males outnumber females at all levels of type, except perhaps for Type IV. The sex differences vary by type, and are largest in the Type I public and private universities.
12.3.1 Goodness of Fit Tests
If a Poisson mean function is correctly specified, the residual deviance G2 will be distributed as a χ2 (n − p′) random variable, where n is the number of cells and p′ is the number of regressors fit. If the mean function is not correctly specified, or if the Poisson assumption is wrong, then G2 will generally be too large, and so a lack of fit test can be obtained by comparing the value of G2 to the relevant χ2 distribution. For the model summarized in Table 12.6, the deviance is G2 = 1.96, and when compared with the χ2 (6) distribution, we get a signficance level of 0.92, suggesting no lack of fit of the model used.
The same idea can be used for binomial regression when the sample sizes mi are larger than 1. For Table 12.1, we have G2 = 49.89 with 33 df, corresponding to a p-value of 0.03, providing modest evidence of lack of fit. Since we found that adding s to that model improved the fit, finding that the initial model is inadequate is not surprising.
An alternative to using G2 for lack of fit testing is to use Pearson’s X2 for testing, given by the familiar formula
\[X^2 = \sum\_{i=1}^n \frac{\left(\mathbf{y}\_i - \hat{\mathbf{y}}\_i\right)^2}{\hat{\mathbf{y}}\_i} \tag{12.12}\]
Like G2 , X2 is compared with χ2 (n − p′) to get significance levels. In large samples, the two tests will give the same inference, but in smaller samples χ2 is generally more powerful.
In binomial regression with all or nearly all the mi = 1, neither G2 nor X2 provides a lack of fit test.
12.4 TRANSFERRING WHAT YOU KNOW ABOUT LINEAR MODELS
Most of the methodology developed in this book transfers to problems with binomial or Poisson responses. In this section, important connections are briefly summarized.
12.4.1 Scatterplots and Regression
Graphing data, Chapter 1, is just as important in binomial and Poisson regression as it is in linear regression. In problems with a binary response, plots of the response versus predictors or regressors are generally not very helpful because the response only has two values. Smoothers, however, can help look at these plots as well. Plots of predictors with color used to indicate the level of the response can also be helpful.
12.4.2 Simple and Multiple Regression
The general ideas in Chapters 2 and 3 apply to binomial and Poisson models, even if the details differ. With the counted data models, estimates ˆ b and Var |ˆ ( ) b X are computed using the appropriate maximum likelihood methods, not with the formulas in these chapters. Once these are found, they can be used in the formulas and methods given the text. For example, a point estimate and standard error for a linear combination of the elements of β is given by (3.26), but with σˆ 2 set equal to 1, and (X′X) −1 replaced by the covariance matrix of ˆ b from the binomial or Poisson fit. Confidence intervals and tests use the standard normal rather than a t-distribution.
12.4.3 Model Building
Chapters 4 and 5 apply with modest modification in binomial and Poisson regression. Since both binomial and Poisson models use logarithms in their link functions, the results of Section 4.1.7 can be useful.
12.4.4 Testing and Analysis of Deviance
The t-tests discussed in Chapters 2, 3, and 6 are replaced by z-tests for binomial and Poisson models. The F-tests in Chapter 6 are replaced by χ2 tests based on changes in deviance. The marginality principle, Section 6.2, is the guiding principle for testing with counted responses.
In linear models, the t-tests and F-tests for the same hypothesis have the same value, and so they are identical. With binomial and Poisson responses, the tests are identical only for very large samples, and in small samples they can give conflicting summaries. The G2 tests are generally preferred.
12.4.5 Variances
Failure of the assumptions needed for binomial or Poisson fitting may be reflected in overdispersion, meaning that the variation between observations given the predictors is larger than the value required by the model. One general approach to overdispersion is to fit models that allow for it, such as the binomial or Poisson mixed models similar to those in Section 7.4. Other models, for example, using negative binomial distributions rather than binomial (Hilbe, 2011), can account for overdispersion. Alternatively, variance corrections like those in Section 7.2.1 are also available, and some software packages including Stata offer them as “robust” standard errors.
12.4.6 Transformations
Transformation of the response is not relevant with binomial and Poisson models. Transformation of predictors is relevant, however, and all the methodology in Chapter 8 can be used.
12.4.7 Regression Diagnostics
Many diagnostic methods depend on residuals. In binomial and Poisson models, the variance depends on the mean, and any useful residuals must be scaled to account for variance. A generalization of the Pearson residuals defined in Section 9.1.3, is appropriate for most purposes. Fox and Weisberg (2011, chapter 6) provide examples of applying diagnostic methods for binomial and Poisson models.
12.4.8 Variable Selection
All the ideas discussed in Chapter 10 carry over to binomial and Poisson models.
12.5 GENERALIZED LINEAR MODELS
The multiple linear regression, logistic, and Poisson log-linear models are particular instances of generalized linear models. They share three basic characteristics:
- 1. The conditional distribution of the response Y|X is distributed according to an exponential family distribution. The important members of this class include the normal, binomial, Poisson, and gamma distributions.
- 2. The response Y depends on the regressors only through the linear combination of terms β′x.
- 3. The mean E(Y|X = x) = m(β′x) for some kernel mean function m. For the multiple linear regression model, m is the identity function, and for logistic regression it is the logistic function. The Poisson was specified using the log link, so its m is the inverse of the log, or the exponential function. Other choices of the kernel mean function are possible but are used less often in practice.
These three components are enough to specify completely a regression problem along with methods for computing estimates and making inferences. The methodology for these models generally builds on the methods in this book, usually with only minor modification. Generalized linear models were first suggested by Nelder and Wedderburn (1972) and are discussed at length by McCullagh and Nelder (1989). Some statistical packages use common software to fit all generalized linear models, including the multiple linear regression model.
12.6 PROBLEMS
- 12.1 (Data file: Blowdown )
- 12.1.1 Create a table that gives then number of trees that survived and the number that died of each of the nine species.
- 12.1.2 Select the rows from the data file with spp equal to ElaFk spruFe to get the data on the balsam fir trees only. Draw the graph of status y versus log(d), and add a smoother. Does the graph support fitting a logistic model?
- 12.1.3 Fit the same model as is used in Table 12.1, but fit the Bernoulli regression model by fitting to the individual trees. Show that the estimates and standard errors are identical to those in Table 12.1, but the deviance and df are different.
- 12.1.4 Add (log(d))2 to the mean function to allow for a possible decline in the probability of blowdown for the largest trees. Obtain the z-test that the coefficient for the quadratic term is 0, and also obtain the G2 test for the same hypothesis. Show that these two tests are not identical, that is G2 ≠ z2 , and state the conclusions from the tests. Draw the effects plot for d; does the quadratic model allow for declining probabilities?
- 12.2 Professor ratings (Data file: RateproI ) Problem 6.10 concerned learning about Tuality rating as a function of several other predictors. In this problem, take as the response variable pepper with values yes and no, where yes means that the consensus of the raters is that the instructor is physically attractive. The predictors are gender , disFipline , Tuality , easiness , and rater,nterst . Find a set of regressors that appears to model the probability that pepper = yes, and summarize your results. (Hint: In some computer programs you may need to convert the values no and yes to 0 and 1. R will do this automatically, ordering the levels alphabetically.)
- 12.3 Downer data (Data file: Downer ) For unknown reasons, dairy cows sometimes become recumbent—they lay down. Called downers, these cows may have a serious illness that may lead to their death. These data are from a study of blood samples of over 400 downer cows studied at the Ruakura New Zealand Animal Health Laboratory during 1983– 1984. A variety of blood tests were performed, and for many of the animals, the outcome (survived, died) was determined. The goal is to see if survival can be predicted from the blood measurements. The variables in the data file are described in Table 12.7. These data were collected from veterinary records, and not all variables were recorded for all cows.
- 12.3.1 Consider first predicting outFome from myopathy . Find the fraction of surviving cows of myopathy = 0 and for myopathy = 1.
- 12.3.2 Fit the logistic regression outFome ∼ myopathy . Write a sentence that explains the meaning of each of the coefficient estimates, and provide 95% confidence intervals. Obtain the estimated probability of survival when myopathy = 0 and when
| Variable | n | Description | ||
|---|---|---|---|---|
| ast | 429 | Serum asparate amino transferase (U/L at 30°C) | ||
| FalYing | 431 | Factor with levels before and after calving | ||
| Fk | 413 | Serum creatine phosphokinase (U/L at 30°C) | ||
| daysreF | 432 | Days recumbent when measurements were done | ||
| inÀamat | 136 | Is inflammation present? no or yes | ||
| myopathy | 222 | Is muscle disorder present? a factor with levels absent and present |
||
| pFY | 175 | Packed cell volume (hematocrit), percentage | ||
| urea | 266 | Serum urea (mmol/l) | ||
| outFome | 435 | survived or died | ||
Table 12.7 The Downer Data
Source: Clark et al. (1987).
myopathy = 1, and compare with the observed survival fractions in Problem 12.3.1.
- 12.3.3 Next, consider the regression problem with only Fk as a predictor. Since Fk is observed more often than is myopathy , this regression will be based on more cases than were used in the first two parts of this problem. Fit the logistic regression mean function with log(Fk) as the only regressor beyond the intercept. Summarize results.
- 12.3.4 Fit the logistic mean function y∼ myopathy + logFk + myopathylogFk . Obtain a Type II Analysis of Deviance table and summarize the results. Draw and interpret an effects plot, assuming the interaction is significant.
- 12.4 Starting with (12.7), derive (12.8).
- 12.5 Donner party (Data file: Donner ) In the winter of 1846–1847, about 90 wagon train emigrants in the Donner party were unable to cross the Sierra Nevada Mountains of California before winter, and almost half of them starved to death. The data in file Donner from Johnson (1996) include some information about each of the members of the party. The variables include age, the age of the person; se[, whether male or female; status , whether the person was a member of a family group, a hired worker for one of the family groups, or a single individual who did not appear to be a hired worker or a member of any of the larger family groups; and y, a factor with levels died and survived.
- 12.5.1 How many men and women were in the Donner Party? What was the survival rate for each sex? Obtain a test that the survival rates were the same against the alternative that they were different. What do you conclude?
- 12.5.2 Fit the logistic regression model y∼age, and provide an interpretation for the fitted coefficient for age.
- 12.5.3 Use your computer package to draw a scatterplot of the Pearson residuals from the model fit in Section 12.5 versus age. The residual plot will consist of two curves, with the curve of all positive values corresponding to survivors and the negative curve for deaths. As in Chapter 9, residual plots can be used to diagnose curvature, but this is quite hard without the aid of a smoother added to the plot.
- 12.5.4 Fit the logistic regression model y∼age+age+se[+ status and summarize results.
- 12.6 Challenger (Data file: Challeng ) These data from Dalal et al. (1989) records performance of O-rings for the 23 U.S. space shuttle missions prior to the Challenger disaster of January 20, 1986. For each of the previous missions, the temperature at takeoff and the pressure of a prelaunch test were recorded, along with the number of O-rings that failed out of 6.
Use these data to try to understand the probability of failure as a function of temperature, and of temperature and pressure. Use your fitted model to estimate the probability of failure of an O-ring when the temperature was 31°F, the launch temperature on January 20, 1986.
- 12.7 Titanic (Data file: Whitestar ) The Titanic was a British luxury passenger liner that sank when it struck an iceberg about 640 km south of Newfoundland on April 14–15, 1912, on its maiden voyage to New York City from Southampton, England. Of 2201 known passengers and crew, only 711 are reported to have survived. These data from Dawson (1995) classify the people on board the ship according to their se[ as male or female; age, either child or adult; and Flass, either first, second, third, or crew. Not all combinations of the three-factors occur in the data, since no children were members of the crew. For each age/sex/class combination, the number of people m and the number surviving SurY are also reported. The data are shown in Table 12.8.
- 12.7.1 Fit a logistic regression model with terms for factors se[, age, and Flass. On the basis of examination of the data in Table 12.8, explain why you expect that this mean function will be inadequate to explain these data.
- 12.7.2 Fit a logistic regression model that includes all the terms of the last part, plus all the two-factor interactions. Use appropriate testing procedures to decide if any of the two-factor interactions can be eliminated. Assuming that the mean function you have obtained matches the data well, summarize the results you have obtained by interpreting the parameters to describe different
| Female | Male | |||
|---|---|---|---|---|
| Class | Adult | Child | Adult | Child |
| Crew | 20/23 | 192/862 | ||
| First | 140/144 | 1/1 | 57/175 | 5/5 |
| Second | 80/93 | 13/13 | 14/168 | 11/11 |
| Third | 76/165 | 14/31 | 75/462 | 13/48 |
Table 12.8 Data from the Titanic Disaster of 1912. Each Cell Gives Surv/ , the Number of Survivors, and the Number Number of People in the Cell
survival rates for various factor combinations. (Hint: How does the survival of the crew differ from the passengers? First class from third class? Males from females? Children versus adults? Did children in first class survive more often than children in third class?)
12.8 More Blowdown (Data file: Blowdown )
- 12.8.1 For the blowdown example, fit the model y∼logd +s+ logds for spp= paperEirFh and summarize results.
- 12.8.2 Repeat for spp= aspen.
12.9 (Data file: A0SsurYey )
- 12.9.1 The example discussed in Section 12.3 concerns 2008–2009 PhDs in the mathematical sciences. Also included in the data file is an additional variable Fount11 that gives the number of mathematical sciences PhDs in 2011–2012. Analyze these data to parallel the analysis in the text and summarize your results.
- 12.9.2 View these data as a four-dimensional table, with the dimensions type, se[, Fiti]en , and year, where year is either 2008–2009 or 2011–2012. Fit models to this four-factor problem, and summarize results. (Hint: This will require that you reform the data file to have 48 rows. For example, in R, the user-unfriendly reshape function can do this.
> AMS1 <- reshape(AMSsurvey, varying =c(“count”, “count11”), v.names =”y”, times =c(“2008-09”, “2011-12”), timevar =”year”, direction =”long”) > AMS1$year <- factor(AMS1$year) The variable year is now a factor and y is the count of PhDs.)
Appendix
A.1 WEBSITE
The web address for this book is http]umnedualred .
The website includes information about using R with this book, a description of an R package called alr that includes all the data files described, and solutions to odd-numbered problems.
A.2 MEANS, VARIANCES, COVARIANCES, AND CORRELATIONS
Suppose we let u1, u2, . . . , un be n random variables.1 Also let a0, a1, . . . , an be n + 1 known constants.
A.2.1 The Population Mean and E Notation
The symbol E(ui) is read as the expected value of the random variable ui. The phrase “expected value” is the same as the phrase “mean value.” Informally, the expected value of ui is the average value of a very large sample drawn from the distribution of ui. If E(ui) = 0, then the average value we would get for ui if we sampled its distribution repeatedly is 0. Since ui is a random variable, any particular realization of ui is likely to be nonzero.
The expected value is a linear operator, which means
\[\begin{aligned} \mathbb{E}(a\_0 + a\_1 u\_1) &= a\_0 + a\_1 \mathbb{E}(u\_1) \\ \mathbb{E}\left(a\_0 + \sum a\_i u\_i\right) &= a\_0 + \sum a\_i \mathbb{E}(u\_i) \end{aligned} \tag{A.1}\]
1 Formally, we have random variables U1, . . . , Un and u1, . . . , un that are realizations of the Ui, but we ignore here the distinction between a random variable and its realization.
Applied Linear Regression, Fourth Edition. Sanford Weisberg.
© 2014 John Wiley & Sons, Inc. Published 2014 by John Wiley & Sons, Inc.
For example, suppose all the ui have the same expected value and we write E(ui) = µ, i = 1, . . . , n. The sample mean of the ui is u u = ∑ i n n = ∑ ui / (1/ ) , and the expected value of the sample mean is
\[\operatorname{E}(\overline{n}) = \operatorname{E}\left(\sum \frac{1}{n}\mu\_i\right) = \frac{1}{n}\sum \operatorname{E}(\mu\_i) = \frac{1}{n}(n\mu) = \mu\_i\]
We say that is an unbiased estimate of the population mean µ, since its expected value is µ.
A.2.2 Variance and Var Notation
The symbol Var(ui) is for the variance of ui. The variance is defined by the equation Var(ui) = E[ui − E(ui)]2 , the expected squared difference between an observed value for ui and its mean value. The larger Var(ui), the more variable observed values for ui are likely to be. The symbol σ2 is often used for a variance, or σu 2 might be used for the variance of the identically distributed ui if several variances are being discussed. The square root of a variance, often σ or σu, is the standard deviation, and is in the same units as the units of the random variable ui. For example, if the ui are heights in centimeters, then units of σu are also centimeters. The units of σu 2 are cm2 , which can be much harder to interpret.
The general rule for the variance of a sum of uncorrelated random variables is
\[\operatorname{Var}\left(a\_0 + \sum a\_i u\_i\right) = \sum a\_i^2 \operatorname{Var}(u\_i) \tag{A.2}\]
The a0 term vanishes because the variance of a constant is 0. Assuming that Var(ui) = σ2 , we can find the variance of the sample mean of independently, identically distributed ui:
\[\operatorname{Var}(\overline{u}) = \operatorname{Var}\left(\sum \frac{1}{n}\mu\_i\right) = \frac{1}{n^2}\sum \operatorname{Var}(\mu\_i) = \frac{1}{n^2}(n\sigma^2) = \frac{\sigma^2}{n}\]
The standard deviation of a sum is found by computing the variance of the sum and then taking a square root.
A.2.3 Covariance and Correlation
The symbol Cov(ui, uj) is read as the covariance between the random variables ui and uj and is also an expected value defined by the equation
\[\text{Cov}(\mu\_i, \mu\_j) = \mathbb{E}\left\{ \left[ \mu\_i - \mathbb{E}(\mu\_i) \right] \left[ \mu\_j - \mathbb{E}(\mu\_j) \right] \right\} = \text{Cov}(\mu\_j, \mu\_i)\]
The covariance describes the way two random variables vary jointly. If the two variables are independent, then Cov(ui, uj) = 0, but zero correlation does not imply independence. The variance is a special case of covariance, since Cov(ui, ui) = Var(ui).
When covariance is nonzero, common language is to say that two variables are correlated. Formally, the correlation coefficient is defined by
\[\rho(\boldsymbol{\mu}\_{i}, \boldsymbol{\mu}\_{j}) = \frac{\mathrm{Cov}(\boldsymbol{\mu}\_{i}, \boldsymbol{\mu}\_{j})}{\sqrt{\mathrm{Var}(\boldsymbol{\mu}\_{i})\mathrm{Var}(\boldsymbol{\mu}\_{j})}}\]
The correlation does not depend on units of measurement and has a value between −1 and 1, with ρ(ui, uj) = 0 only if Cov(ui, uj) = 0.
The rule for covariances is
\[\text{Cov}(a\_0 + a\_1\mu\_1, a\_3 + a\_2\mu\_2) = a\_1a\_2\text{Cov}(\mu\_1, \mu\_2)\]
It is left as an exercise to show that
\[ \rho(a\_0 + a\_1\mu\_1, a\_3 + a\_2\mu\_2) = \rho(\mu\_1, \mu\_2), \]
so the unit-free correlation coefficient does not change if the random variables are rescaled or centered.
The general form for the variance of a linear combination of random variables is
\[\operatorname{Var}\left(a\_0 + \sum a\_l u\_l\right) = \sum\_{i=1}^n a\_i^2 \operatorname{Var}(u\_i) + 2\sum\_{i=1}^{n-1} \sum\_{j=i+1}^n a\_i a\_j \operatorname{Cov}(u\_i, u\_j) \tag{A.3}\]
A.2.4 Conditional Moments
Throughout the book, we use notation like E(Y|X) or E(Y|X = x) to denote the mean of the random variable Y in the population for which the value of X is fixed. Similarly, Var(Y|X) or Var(Y|X = x) is the variance of the random variable Y in the population for which X is fixed.
There are simple relationships between the conditional mean and variance of Y given X and the unconditional mean and variances (Casella and Berger, 2001):
\[\mathbb{E}(Y) = \mathbb{E}\left[\mathbb{E}(Y|X)\right] \tag{A.4}\]
\[\operatorname{Var}(Y) = \mathbb{E}\left[\operatorname{Var}(Y|X)\right] + \operatorname{Var}\left[\operatorname{E}(Y|X)\right] \tag{A.5}\]
For example, suppose that when we condition on the predictor X we have a simple linear regression mean function with constant variance, E(Y|X = x) = β0 + β1x, Var(Y|X = x) = σ2 . In addition, suppose the unconditional moments of the predictor are E(X) = µx and Var( ) 2 X = τ x . Then for the unconditional random variable Y,
\[\begin{aligned} \operatorname{E}(Y) &= \operatorname{E}\left[\operatorname{E}(Y|X=x)\right] \\ &= \operatorname{E}\left[\beta\_0 + \beta\_1 x\right] \\ &= \beta\_0 + \beta\_1 \mu\_x \end{aligned}\]
\[\begin{aligned} \operatorname{Var}(Y) &= \operatorname{E}\left[\operatorname{Var}(Y|X=x)\right] + \operatorname{Var}\left[\operatorname{E}(Y|X=x)\right] \\ &= \operatorname{E}[\sigma^2] + \operatorname{Var}[\beta\_0 + \beta\_1 x] \\ &= \sigma^2 + \beta\_1^2 \tau\_x^2 \end{aligned}\]
The expected value of the unconditional variable Y is obtained by substituting the expected value of the unconditional variable X into the conditional expected value formula, and the unconditional variance of Y equals the conditional variance plus an additional quantity that depends on both β1 2 and on τ x 2 .
A.3 LEAST SQUARES FOR SIMPLE REGRESSION
The ols estimates of β0 and β1 in simple regression are the values that minimize the residual sum of squares function,
\[\text{RSSS}(\beta\_0, \beta\_1) = \sum\_{i=1}^{n} \left(\mathbf{y}\_i - \beta\_0 - \beta\_1 \mathbf{x}\_i\right)^2 \tag{A.6}\]
One method of finding the minimizer is to differentiate with respect to β0 and β1, set the derivatives equal to 0, and solve
\[\frac{\partial \mathbb{R} \mathbb{S} \mathbb{S}(\mathcal{B}\_0, \mathcal{B}\_1)}{\beta\_0} = -2 \sum\_{i=1}^n \left( \mathbf{y}\_i - \beta\_0 - \beta\_1 \mathbf{x}\_i \right) = 0\]
\[\frac{\partial \mathbb{R} \mathbb{S} \mathbb{S}(\mathcal{B}\_0, \mathcal{B}\_1)}{\beta\_1} = -2 \sum\_{i=1}^n \mathbf{x}\_i \left( \mathbf{y}\_i - \beta\_0 - \beta\_1 \mathbf{x}\_i \right) = 0\]
Upon rearranging terms, we get
\[ \begin{aligned} \beta\_0 n + \beta\_1 \sum x\_i &= \sum y\_i \\ \beta\_0 \sum x\_i + \beta\_1 \sum x\_i^2 &= \sum x\_i y\_i \end{aligned} \tag{A.7} \]
Equations (A.7) are called the normal equations for the simple linear regression model (2.1). The normal equations depend on the data only through the sufficient statistics ∑xi , ∑yi, ∑xi 2 , and ∑x yi i . Using the formulas
294 appendix
\[\begin{aligned} \text{SXX} &= \sum (\mathbf{x}\_{i} - \overline{\mathbf{x}})^{2} &= \sum \mathbf{x}\_{i}^{2} - n\overline{\mathbf{x}}^{2} \\ \text{SXY} &= \sum (\mathbf{x}\_{i} - \overline{\mathbf{x}})(\mathbf{y}\_{i} - \overline{\mathbf{y}}) = \sum \mathbf{x}\_{i}\mathbf{y}\_{i} - n\overline{\mathbf{x}}\overline{\mathbf{y}} \end{aligned} \tag{A.8}\]
equivalent and numerically more stable sufficient statistics are given by x, y, S;;, and S;<. Solving (A.7), we get
\[ \hat{\beta}\_0 = \overline{\mathbf{y}} - \hat{\beta}\_1 \overline{\mathbf{x}} \quad \hat{\beta}\_1 = \frac{\mathbf{S}XX}{\mathbf{S}XX} \tag{A.9} \]
A.4 MEANS AND VARIANCES OF LEAST SQUARES ESTIMATES
The least squares estimates are linear combinations of the observed values y1, . . . , yn of the response, so we can apply the results of Appendix A.2 to the estimates found in Appendix A.3 to get the means, variances, and covariances of the estimates. Assume the simple regression model (2.1) is correct. The estimator ˆ β1 given at (A.9) can be written as ˆ β1 = ∑c yi i , where for each i, c x i i = ( ) − x /SXX. Since we are conditioning on the values of X, the ci are fixed numbers. By (A.1),
\[\begin{aligned} \operatorname{E}(\hat{\beta}\_{l}|X) &= \operatorname{E}\left(\sum c\_{i}\mathbf{y}\_{i}|X=\mathbf{x}\_{i}\right) = \sum c\_{i}\operatorname{E}(\mathbf{y}\_{i}|X=\mathbf{x}\_{i}) \\ &= \sum c\_{i}(\beta\_{0}+\beta\_{1}\mathbf{x}\_{i}) \\ &= \beta\_{0}\sum c\_{i}+\beta\_{1}\sum \mathbf{c}\_{i}\mathbf{x}\_{i} \end{aligned}\]
By direct summation, ∑ci = 0 and ∑c xi i = 1, giving
\[\operatorname{E}(\hat{\beta}\_1 \vert X) = \beta\_1\]
which shows that ˆ β1 is unbiased for β1. A similar computation will show that ˆ β0 is an unbiased estimate of β0.
Since the yi are assumed independent, the variance of ˆ β1 is found by an application of (A.2),
\[\begin{aligned} \operatorname{Var}(\hat{\beta}\_l | X) &= \operatorname{Var}\left(\sum c\_i y\_i | X = x\_i\right) \\ &= \sum c\_i^2 \operatorname{Var}(y\_i | X = x\_i) \\ &= \sigma^2 \sum c\_i^2 \\ &= \sigma^2 / \operatorname{SXX} \end{aligned}\]
This computation also used ∑c x i = ∑ −i x = 2 2 2 ( ) / / SXX 1 SXX. Computing the variance of ˆ β0 requires an application of (A.3). We write
\[\begin{split} \text{Var}(\hat{\beta}\_{0}|X) &= \text{Var}(\overline{\mathbf{y}} - \hat{\beta}\_{l}\overline{x}|X) \\ &= \text{Var}(\overline{\mathbf{y}}|X) + \overline{x}^{2}\text{Var}(\hat{\beta}\_{l}|X) - 2\overline{x}\text{Cov}(\overline{\mathbf{y}}, \hat{\beta}\_{l}|X) \end{split} \tag{A.10}\]
To complete this computation, we need to compute the covariance,
\[\begin{aligned} \text{Cov}(\overline{\mathbf{y}}, \hat{\beta}\_l | X) &= \text{Cov}\left(\frac{1}{n} \sum \mathbf{y}\_i, \sum c\_i \mathbf{y}\_i | X\right) \\ &= \frac{1}{n} \sum c\_i \text{Cov}(\mathbf{y}\_i, \mathbf{y}\_j | X) \\ &= \frac{\sigma^2}{n} \sum c\_i \\ &= 0 \end{aligned}\]
because the yi are independent and ∑ci = 0. Substituting into (A.10) and simplifying,
\[\operatorname{Var}(\hat{\beta}\_0 | X) = \sigma^2 \left( \frac{1}{n} + \frac{\overline{x}^2}{\text{SXX}} \right)^2\]
Finally,
\[\begin{aligned} \text{Cov}(\hat{\beta}\_0, \hat{\beta}\_1 | X) &= \text{Cov}(\overline{\mathbf{y}} - \hat{\beta}\_1 \overline{\mathbf{x}}, \hat{\beta}\_1 | X) \\ &= \text{Cov}(\overline{\mathbf{y}}, \hat{\beta}\_1 | X) - \overline{\mathbf{x}} \text{Cov}(\hat{\beta}\_1, \hat{\beta}\_1 | X) \\ &= 0 - \sigma^2 \frac{\overline{\mathbf{x}}}{\text{SXX}} \\ &= -\sigma^2 \frac{\overline{\mathbf{x}}}{\text{SXX}} \end{aligned}\]
Further application of these results gives the variance of a fitted value, ˆ ˆ ˆ y x = + β β0 1 :
\[\begin{split} \text{Var}(\hat{\mathbf{y}}|X=\mathbf{x}) &= \text{Var}(\hat{\beta}\_0 + \hat{\beta}\_1 \mathbf{x} | X=\mathbf{x}) \\ &= \text{Var}(\hat{\beta}\_0 | X=\mathbf{x}) + \mathbf{x}^2 \text{Var}(\hat{\beta}\_1 | X=\mathbf{x}) + 2\mathbf{x} \text{Cov}(\hat{\beta}\_0, \hat{\beta}\_1 | X=\mathbf{x}) \\ &= \sigma^2 \left( \frac{1}{n} + \frac{\overline{\mathbf{x}}^2}{\text{SXX}} \right) + \sigma^2 \mathbf{x}^2 \frac{1}{\text{SXX}} - 2\sigma^2 \mathbf{x} \frac{\overline{\mathbf{x}}}{\text{SXX}} \\ &= \sigma^2 \left( \frac{1}{n} + \frac{(\mathbf{x} - \overline{\mathbf{x}})^2}{\text{SXX}} \right) \end{split} \tag{A.11}\]
A prediction * at the future value x* is just ˆ ˆ β β 0 1 + x*. The variance of a prediction consists of the variance of the fitted value at x* given by (A.11) plus σ2 , the variance of the error that will be attached to the future value,
\[\operatorname{Var}(\tilde{\mathbf{y}}\_{\ast}|X=x\_{\ast}) = \sigma^2 \left( \frac{1}{n} + \frac{(\mathbf{x}\_{\ast} - \overline{\mathbf{x}})^2}{\operatorname{SXX}} \right) + \sigma^2 \frac{1}{n}\]
as given by (2.16).
A.5 ESTIMATING E(Y|X) USING A SMOOTHER
For a 2D scatterplot of Y versus X, a scatterplot smoother provides an estimate of the mean function E(Y|X = x) as x varies, without making parametric assumptions about the mean function. Many smoothing methods are used, and the smoother we use most often in this book is the simplest case of the loess smoother, Cleveland (1979); see also the first step in Algorithm 6.1.1 in Härdle (1990, p. 192). This smoother estimates E(Y|X = xg) by g via a weighted least squares (wls) simple regression, giving more weight to points close to xg than to points distant from xg. Here is the method:
- 1. Select a value for a smoothing parameter f, a number between 0 and 1. Values of f close to 1 will give curves that are too smooth and will be close to a straight line, while small values of f give curves that are too rough and match all the wiggles in the data. The value of f must be chosen to balance the bias of oversmoothing with the variability of undersmoothing. Remarkably, for many problems f ≈ 2/3 is a good choice. There is a substantial literature on the appropriate ways to estimate a smoothing parameter for loess and for other smoothing methods, but for the purposes of using a smoother to help us look at a graph, optimal choice of a smoothing parameter is not critical.
- 2. Find the fn closest points to xg. For example, if n = 100, and f = 0.6, then find the fn = 60 closest points to xg. Every time the value of xg is changed, the points selected may change.
- 3. Among these fn nearest neighbors to xg, compute the wls estimates for the simple regression of <∼;, with weights determined so that points close to xg have the highest weight, and the weights decline toward 0 for points farther from xg. We use a triangular weight function that gives maximum weight to data at xg, and weights that decrease linearly to 0 at the edge of the neighborhood. If a different weight function is used, answers are somewhat different.
- 4. The value of g is the fitted value at xg from the wls regression using the nearest neighbors found at step 2 as the data, and the weights from step 3.

Figure A.1 Three choices of the smoothing parameter for a loess smooth. The data used in this plot are discussed in Section 8.1.2.
5. Repeat steps 1–4 for many values of xg that form a grid of points that cover the interval on the x-axis of interest. Join the points.
Figure A.1 shows a plot of Height versus Diameter for western cedar trees in the Upper Flat Creek data, along with four smoothers. The first smoother is the ols simple regression line, which does not match the data well because the mean function for the data in this figure is probably curved, not straight. The loess smooth with f = 0.1 is as expected very wiggly, matching the local variation rather than the mean. The line for f = 2/3 seems to match the data very well, while the loess fit for f = .95 is nearly the same as for f = 2/3, but it tends toward oversmoothing and attempts to match the ols line. We would conclude from this graph that a straight-line mean function is likely to be inadequate because it does not match the data very well. Loader (2004) presents a bootstrap based lack-of-fit test based on comparing parametric and nonparametric estimates of the mean function.
The loess smoother is an example of a nearest neighbor smoother. Local polynomial regression smoothers and kernel smoothers are similar to loess, except they give positive weight to all cases within a fixed distance of the point of interest rather than a fixed number of points. There is a large literature on nonparametric regression, for which scatterplot smoothing is a primary tool. Recent reference on this subject include Simonoff (1996), Bowman and Azzalini (1997), and Loader (1999).
A.6 A BRIEF INTRODUCTION TO MATRICES AND VECTORS
We provide only a brief introduction to matrices and vectors. More complete references include Seber (2008), Schott (2005), or any good linear algebra book.
Boldface type is used to indicate matrices and vectors. We will say that X is an r × c matrix if it is an array of numbers with r rows and c columns. A specific 4 × 3 matrix X is
\[\mathbf{X} = \begin{pmatrix} 1 & 2 & 1 \\ 1 & 1 & 5 \\ 1 & 3 & 4 \\ 1 & 8 & 6 \end{pmatrix} = \begin{pmatrix} \mathbf{x}\_{11} & \mathbf{x}\_{12} & \mathbf{x}\_{13} \\ \mathbf{x}\_{21} & \mathbf{x}\_{22} & \mathbf{x}\_{23} \\ \mathbf{x}\_{31} & \mathbf{x}\_{32} & \mathbf{x}\_{33} \\ \mathbf{x}\_{41} & \mathbf{x}\_{42} & \mathbf{x}\_{43} \end{pmatrix} = \begin{pmatrix} \mathbf{x}\_{ij} \end{pmatrix} \tag{A.12}\]
The element xij of X is the number in the ith row and the jth column. For example, in the preceding matrix, x32 = 3.
A vector is a matrix with just one column. A specific 4 × 1 matrix y, which is a vector of length 4, is given by
\[\mathbf{y} = \begin{pmatrix} 2 \\ 3 \\ -2 \\ 0 \end{pmatrix} = \begin{pmatrix} y\_1 \\ y\_2 \\ y\_3 \\ y\_4 \end{pmatrix}\]
The elements of a vector are generally singly subscripted; thus, y3 = −2. A row vector is a matrix with one row. We do not use row vectors in this book. If a vector is needed to represent a row, a transpose of a column vector will be used, Appendix A.6.4.
A square matrix has the same number of rows and columns, so r = c. A square matrix Z is symmetric if zij = zji for all i and j. A square matrix is diagonal if all elements off the main diagonal are 0, zij = 0, unless i = j. The matrices C and D below are symmetric and diagonal, respectively:
\[\mathbf{C} = \begin{pmatrix} 7 & 3 & 2 & 1 \\ 3 & 4 & 1 & -1 \\ 2 & 1 & 6 & 3 \\ 1 & -1 & 3 & 8 \end{pmatrix} \quad \mathbf{D} = \begin{pmatrix} 7 & 0 & 0 & 0 \\ 0 & 4 & 0 & 0 \\ 0 & 0 & 6 & 0 \\ 0 & 0 & 0 & 8 \end{pmatrix}\]
The diagonal matrix with all elements on the diagonal equal to 1 is called the identity matrix, for which the symbol I is used. The 4 × 4 identity matrix is
\[\mathbf{I} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}\]
A scalar is a 1 × 1 matrix, an ordinary number.
A.6.1 Addition and Subtraction
Two matrices can be added or subtracted only if they have the same number of rows and columns. The sum C = A + B of r × c matrices is also r × c. Addition is done elementwise:
\[\mathbf{C} = \mathbf{A} + \mathbf{B} = \begin{pmatrix} a\_{11} & a\_{12} \\ a\_{21} & a\_{22} \\ a\_{31} & a\_{32} \end{pmatrix} + \begin{pmatrix} b\_{11} & b\_{12} \\ b\_{21} & b\_{22} \\ b\_{31} & b\_{32} \end{pmatrix} = \begin{pmatrix} a\_{11} + b\_{11} & a\_{12} + b\_{12} \\ a\_{21} + b\_{21} & a\_{22} + b\_{22} \\ a\_{31} + b\_{31} & a\_{32} + b\_{32} \end{pmatrix}\]
Subtraction works the same way, with the “+” signs changed to “−” signs. The usual rules for addition of numbers apply to addition of matrices, namely commutativity, A + B = B + A, and associativity, (A + B) + C = A + (B + C).
A.6.2 Multiplication by a Scalar
If k is a number and A is an r × c matrix with elements (aij), then kA is an r × c matrix with elements (kaij). For example, the matrix σ2 I has all diagonal elements equal to σ2 and all off-diagonal elements equal to 0.
A.6.3 Matrix Multiplication
Multiplication of matrices follows rules that are more complicated than are the rules for addition and subtraction. For two matrices to be multiplied together in the order AB, the number of columns of A must equal the number of rows of B. For example, if A is r × c, and B is c × q, then C = AB is r × q. If the elements of A are (aij) and the elements of B are (bij), then the elements of C = (cij) are given by the formula
\[c\_{\vec{\eta}} = \sum\_{k=1}^{c} a\_{ik} b\_{kj}\]
This formula says that cij is formed by taking the ith row of A and the jth column of B, multiplying the first element of the specified row in A by the first element in the specified column in B, multiplying second elements, and so on, and then adding the products together.
If A is 1 × c and B is c × 1, then the product AB is 1 × 1, an ordinary number. For example, if A and B are
\[\mathbf{A} = \begin{pmatrix} 1 & 3 & 2 & -1 \end{pmatrix} \quad \mathbf{B} = \begin{pmatrix} 2 \\ 1 \\ -2 \\ 4 \end{pmatrix}\]
then the product AB is
\[\mathbf{AB} = (1 \times 2) + (3 \times 1) + (2 \times -2) + (-1 \times 4) = -3\]
AB is not the same as BA. For the preceding matrices, the product BA will be a 4 × 4 matrix:
\[\mathbf{BA} = \begin{pmatrix} 2 & 6 & 4 & -2 \\ 1 & 3 & 2 & -1 \\ -2 & -6 & -4 & 2 \\ 4 & 12 & 8 & -4 \end{pmatrix}\]
The following small example illustrates what happens when all the dimensions are bigger than 1. A 3 × 2 matrix A times a 2 × 2 matrix B is given as
\[ \begin{pmatrix} a\_{11} & a\_{12} \\ a\_{21} & a\_{22} \\ a\_{31} & a\_{32} \end{pmatrix} \begin{pmatrix} b\_{11} & b\_{12} \\ b\_{21} & b\_{22} \end{pmatrix} = \begin{pmatrix} a\_{11}b\_{11} + a\_{12}b\_{21} & a\_{11}b\_{12} + a\_{12}b\_{22} \\ a\_{21}b\_{11} + a\_{22}b\_{21} & a\_{21}b\_{12} + a\_{22}b\_{22} \\ a\_{31}b\_{11} + a\_{32}b\_{21} & a\_{31}b\_{12} + a\_{32}b\_{22} \end{pmatrix} \]
Using numbers, an example of multiplication of two matrices is
\[ \begin{pmatrix} 3 & 1 \\ -1 & 0 \\ 2 & 2 \end{pmatrix} \begin{pmatrix} 5 & 1 \\ 0 & 4 \end{pmatrix} = \begin{pmatrix} 15+0 & 3+4 \\ -5+0 & -1+0 \\ 10+0 & 2+8 \end{pmatrix} = \begin{pmatrix} 15 & 4 \\ -5 & -1 \\ 10 & 10 \end{pmatrix} \]
In this example, BA is not defined because the number of columns of B is not equal to the number of rows of A. However, the associative law holds: If A is r × c, B is c × q, and C is q × p, then A(BC) = (AB)C, and the result is an r × p matrix.
A.6.4 Transpose of a Matrix
The transpose of an r × c matrix X is a c × r matrix called X′ such that if the elements of X are (xij), then the elements of X′ are (xji). For the matrix X given at (A.12),
\[\mathbf{X}' = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 2 & 1 & 3 & 8 \\ 1 & 5 & 4 & 6 \end{pmatrix}\]
The transpose of a column vector is a row vector. The transpose of a product (AB)′ is the product of the transposes, in opposite order, so (AB)′ = B′A′.
Suppose that a is an r × 1 vector with elements a1, . . . , ar. Then the product a′a will be a 1 × 1 matrix or scalar, given by
\[\mathbf{a}'\mathbf{a} = a\_1^2 + a\_2^2 + \dots + a\_r^2 = \sum\_{i=1}^r a\_i^2 \tag{A.13}\]
Thus, a′a provides a compact notation for the sum of the squares of the elements of a vector a. The square root of this quantity (a′a) 1/2 is called the norm or length of the vector a. Similarly, if a and b are both r × 1 vectors, then we obtain
\[\mathbf{a'}\mathbf{b} = a\_1b\_1 + a\_2b\_2 + \dots + a\_nb\_n = \sum\_{i=1}^r a\_ib\_i = \sum\_{i=1}^r b\_ia\_i = \mathbf{b'}\mathbf{a}\]
The fact that a′b = b′a is often quite useful in manipulating the vectors used in regression calculations.
Another useful formula in regression calculations is obtained by applying the distributive law
\[((\mathbf{a} - \mathbf{b})'(\mathbf{a} - \mathbf{b}) = \mathbf{a}'\mathbf{a} + \mathbf{b}'\mathbf{b} - 2\mathbf{a}'\mathbf{b} \tag{A.14}\]
A.6.5 Inverse of a Matrix
For any real number c ≠ 0, there is another number called the inverse of c, say d, such that the product cd = 1. For example, if c = 3, then d = 1/c = 1/3, and the inverse of 3 is 1/3. Similarly, the inverse of 1/3 is 3. The number 0 does not have an inverse because there is no other number d such that 0 × d = 1.
Square matrices can also have an inverse. We will say that the inverse of a matrix C is another matrix D, such that CD = I, and we write D = C−1 . Not all square matrices have an inverse. The collection of matrices that have an inverse are called full rank, invertible, or nonsingular. A square matrix that is not invertible is of less than full rank, or singular. If a matrix has an inverse, it has a unique inverse.
The inverse is easy to compute only in special cases, and its computation in general can require a very tedious calculation that is best done on a computer. High-level matrix and statistical languages such as 0atlaE , 0aple, 0athe matiFa and R include functions for inverting matrices, or returning an appropriate message if the inverse does not exist.
The identity matrix I is its own inverse. If C is a diagonal matrix, say
\[\mathbf{C} = \begin{pmatrix} 3 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & 4 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}\]
then C−1 is the diagonal matrix
\[\mathbf{C} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 3 & & 0 & 0 \\ 0 & -1 & 0 & 0 \\ 0 & 0 & \frac{1}{4} & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}\]
as can be verified by direct multiplication. For any diagonal matrix with nonzero diagonal elements, the inverse is obtained by inverting the diagonal elements. If any of the diagonal elements are 0, then no inverse exists.
A.6.6 Orthogonality
Two vectors a and b of the same length are orthogonal if a′b = 0. An r × c matrix Q has orthonormal columns if its columns, viewed as a set of c ≤ r different r × 1 vectors, are orthogonal and in addition have length 1. This is equivalent to requiring that Q′Q = I, the r × r identity matrix. A square matrix A is orthogonal if A′A = AA′ = I, and so A−1 = A′. For example, the matrix
\[\mathbf{A} = \begin{pmatrix} 1 & 1 & 1 \\ \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \\ 1 & 0 & -\frac{2}{\sqrt{6}} \\ \frac{1}{\sqrt{3}} & -\frac{1}{\sqrt{2}} & \frac{1}{\sqrt{6}} \end{pmatrix}\]
can be shown to be orthogonal by showing that A′A = I, and therefore
\[\mathbf{A}^{-1} = \mathbf{A}' = \begin{pmatrix} 1 & 1 & 1 \\ \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} \\ 1 & 0 & -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{6}} & -\frac{2}{\sqrt{6}} & \frac{1}{\sqrt{6}} \end{pmatrix}\]
A.6.7 Linear Dependence and Rank of a Matrix
Suppose we have a n × p matrix X with columns given by the vectors x1, . . . , xp; we consider only the case p ≤ n. We will say that x1, . . . , xp are linearly dependent if we can find multipliers a1, . . . , ap, not all of which are 0, such that
\[\sum\_{i=1}^{p} a\_i \mathbf{x}\_i = \mathbf{0} \tag{A.15}\]
If no such multipliers exist, then we say that the vectors are linearly independent, and the matrix is full rank. In general, the rank of a matrix is the maximum number of xi that form a linearly independent set.
For example, the matrix X given at (A.12) can be shown to have linearly independent columns because no ai not all equal to zero can be found that satisfy (A.15). On the other hand, the matrix
\[\mathbf{X} = \begin{pmatrix} 1 & 2 & 5 \\ 1 & 1 & 4 \\ 1 & 3 & 6 \\ 1 & 8 & 11 \end{pmatrix} = (\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3) \tag{A.16}\]
has linearly dependent columns and is singular because x3 = 3x1 + x2. The matrix has rank 2, because the linearly independent subset of the columns with the most elements has two elements.
The matrix X′X is a p × p matrix. If X has rank p, so does X′X. Full-rank square matrices always have an inverse. Square matrices of less than full rank never have an inverse.
A.7 RANDOM VECTORS
An n × 1 vector Y is a random vector if each of its elements is a random variable. The mean of an n × 1 random vector Y is also an n × 1 vector whose elements are the means of the elements of Y. The variance of an n × 1 vector Y is an n × n square symmetric matrix, often called a covariance matrix, written Var(Y) with Var(yi) as its (i, i) element and Cov(yi, yj) = Cov(yj, yi) as both the (i, j) and (j, i) element.
The rules for means and variances of random vectors are matrix equivalents of the scalar versions in Appendix A.2. If a0 is a vector of constants, and A is a matrix of constants,
\[\mathbf{E}(\mathbf{a}\_0 + \mathbf{A}\,\mathbf{Y}) = \mathbf{a}\_0 + \mathbf{A}\mathbf{E}(\mathbf{Y})\tag{A.17}\]
\[\operatorname{Var}(\mathbf{a}\_0 + \mathbf{A}\mathbf{Y}) = \mathbf{A}\operatorname{Var}(\mathbf{Y})\mathbf{A}^\prime \tag{A.18}\]
A.8 LEAST SQUARES USING MATRICES
The multiple linear regression model can be written as
\[\mathbb{E}(Y|X=\mathbf{x}) = \mathcal{B}'\mathbf{x} \quad \operatorname{Var}(Y|X=\mathbf{x}) = \sigma^2\]
The matrix version is
\[\mathbb{E}(\mathbf{Y}|\mathbf{X}) = \mathbf{X}\boldsymbol{\beta} \quad \text{Var}(\mathbf{Y}|\mathbf{X}) = \sigma^2 \mathbf{I}\]
where Y is the n × 1 vector of response values and X is a n × p′ matrix. If the mean function includes an intercept, then the first column of X is a vector of ones, and p′ = p + 1. If the mean function does not include an intercept, then the column of one is not included in X and p′ = p. The ith row of the n × p′ matrix X is x′ i , β is a p′ × 1 vector of parameters for the mean function.
The ols estimator ˆ b of β is given by the arguments that minimize the residual sum of squares function,
\[\mathbb{RSS}(\mathfrak{B}) = (\mathbf{Y} - \mathbf{X}\mathfrak{B})'(\mathbf{Y} - \mathbf{X}\mathfrak{B})'\]
Using (A.14)
\[\mathbb{R}\mathbb{S}\mathbb{S}(\mathcal{J}) = \mathbf{Y}'\mathbf{Y} + \mathcal{J}'(\mathbf{X}'\mathbf{X})\boldsymbol{\beta} - 2\mathbf{Y}'\mathbf{X}\boldsymbol{\beta} \tag{A.19}\]
RSS(β) depends on only three functions of the data: Y′Y, X′X, and Y′X. Any two data sets that have the same values of these three quantities will have the same least squares estimates. Using (A.8), the information in these quantities is equivalent to the information contained in the sample means of the regressors plus the sample covariances of the regressors and the response.
To minimize (A.19), differentiate with respect to β and set the result equal to 0. This leads to the matrix version of the normal equations,
\[\mathbf{X}'\mathbf{X}\boldsymbol{\beta} = \mathbf{X}'\mathbf{Y} \tag{A.20}\]
The ols estimates are any solution to these equations. If the inverse of (X′X) exists, as it will if the columns of X are linearly independent, the ols estimates are unique and are given by
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} \tag{A.21} \]
If the inverse does not exist, then the matrix (X′X) is of less than full rank, and the ols estimate is not unique. In this case, most computer programs will use a linearly independent subset of the columns of X in fitting the model, so that the reduced model matrix does have full rank. This is discussed in Section 4.1.4.
A.8.1 Properties of Estimates
Using the rules for means and variances of random vectors, (A.17) and (A.18), we find
\[\begin{aligned} \mathrm{E}(\hat{\mathcal{B}}|\mathbf{X}) &= \mathrm{E}[(\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{Y} | \mathbf{X}] \\ &= [(\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime] \mathrm{E}(\mathbf{Y}|\mathbf{X}) \\ &= (\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{X} \mathcal{B} \\ &= \mathcal{B} \end{aligned} \tag{A.22}\]
so ˆ b is unbiased for β, as long as the mean function that was fit is the true mean function. The variance of ˆ b is
\[\begin{aligned} \text{Var}(\hat{\mathcal{B}}|\mathbf{X}) &= \text{Var}[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}|\mathbf{X}] \\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'[\text{Var}(\mathbf{Y}|\mathbf{X})] \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'[\sigma^2\mathbf{1}] \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \\ &= \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \\ &= \sigma^2(\mathbf{X}'\mathbf{X})^{-1} \end{aligned} \tag{A.23}\]
The variances and covariances are compactly determined as σ2 times a matrix whose elements are determined only by X and not by Y.
A.8.2 The Residual Sum of Squares
Let Y X ˆ ˆ = b be the n × 1 vector of fitted values corresponding to the n cases in the data, and ê = Y − is the vector of residuals. One representation of the residual sum of squares, which is the residual sum of squares function evaluated at ˆ b, is
\[\text{RSS} = (\mathbf{Y} - \hat{\mathbf{Y}})'(\mathbf{Y} - \hat{\mathbf{Y}}) = \hat{\mathbf{e}}'\hat{\mathbf{e}} = \sum\_{i=1}^{n} \hat{e}\_i^2\]
which suggests that the residual sum of squares can be computed by squaring the residuals and adding them up. In multiple linear regression, it can also be computed more efficiently on the basis of summary statistics. Using (A.19) and the summary statistics X′X, X′Y, and Y′Y, we write
\[\text{RSS} = \text{RSS}(\hat{\boldsymbol{\beta}}) = \mathbf{Y}'\mathbf{Y} + \hat{\boldsymbol{\beta}}'\mathbf{X}'\mathbf{X}\hat{\boldsymbol{\beta}} - 2\mathbf{Y}'\mathbf{X}\hat{\boldsymbol{\beta}}\]
We will first show that ˆ ˆ ˆ b b ′ ′ X X = Y X′ b. Substituting for one of the ˆ bs, we get
\[ \hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{Y} = \hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{Y} = \mathbf{Y}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}} \]
The last result follows because taking the transpose of a 1 × 1 matrix does not change its value. The residual sum of squares function can now be rewritten as
\[\begin{aligned} \mathbf{RSS} &= \mathbf{Y'}\mathbf{Y} - \hat{\boldsymbol{\beta}}\mathbf{'}\mathbf{x}\mathbf{x}\hat{\boldsymbol{\beta}} \\ &= \mathbf{Y'}\mathbf{Y} - \hat{\mathbf{Y'}}\hat{\mathbf{Y}} \end{aligned} \]
where Y X ˆ ˆ = b are the fitted values. The residual sum of squares is the difference in the squares of the lengths of the two vectors Y and . Another useful form for the residual sum of squares is
\[\text{RSS} = \text{SYY}(\mathbf{1} - \mathbf{R}^2)\]
where R2 is the square of the sample correlation between and Y.
A.8.3 Estimate of Variance
Under the assumption of constant variance, the estimate of σ2 is
\[ \hat{\sigma}^2 = \frac{\text{RSS}}{d} \tag{A.24} \]
with d df, where d is equal to the number of cases n minus the number of regressors with estimated coefficients in the model. If the matrix X is of full rank, then d = n − p′, where p′ = p for mean functions without an intercept, and p′ = p + 1 for mean functions with an intercept. The number of estimated coefficients will be less than p′ if X is not of full rank.
A.8.4 Weighted Least Squares
From Section 7.1, the wls model can be written in matrix notation as
\[\operatorname{E}(\mathbf{Y}|\mathbf{X}) = \mathbf{X}\mathcal{J} \quad \operatorname{Var}(\mathbf{Y}|\mathbf{X}) = \sigma^2 \mathbf{W}^{-1} \tag{A.25}\]
To distinguish ols and wls results, we will use a subscript W on several quantities. In practice, there is no need to distinguish between ols and wls, and this subscript is dropped elsewhere in the book.
• The wls estimator ˆ bW of β is given by the arguments that minimize the residual sum of squares function,
\[\begin{aligned} \mathbf{^R S^S \mathbf{v}}(\boldsymbol{\beta}) &= (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})' \mathbf{W} (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) \\ &= \mathbf{Y}' \mathbf{W} \mathbf{Y} + \boldsymbol{\beta}' (\mathbf{X'} \mathbf{WX}) \boldsymbol{\beta} - 2 \mathbf{Y'} \mathbf{WX} \boldsymbol{\beta} \end{aligned}\]
• The wls estimator solves the weighted normal equations
\[\mathbf{x}\prime\mathbf{W}\mathbf{x}\mathbf{g} = \mathbf{x}\prime\mathbf{W}\mathbf{Y}\]
• The wls estimate is
\[ \hat{\boldsymbol{\beta}}\_{W} = (\mathbf{X}'\mathbf{WX})^{-1}\mathbf{X}'\mathbf{WY} \tag{A.26} \]
• ˆ bW is unbiased:
\[\begin{aligned} \mathrm{E}(\hat{\mathcal{B}}\_{\mathbf{W}} | \mathbf{X}) &= \mathrm{E} \left[ (\mathbf{X}^\prime \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{W} \mathbf{Y} | \mathbf{X} \right] \\ &= (\mathbf{X}^\prime \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{W} \mathrm{E}(\mathbf{Y} | \mathbf{X}) \\ &= (\mathbf{X}^\prime \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{W} \mathbf{X} \beta \\ &= \beta \end{aligned} \tag{A.27}\]
• The variance of ˆ b is
\[\begin{split} \operatorname{Var}(\hat{\mathcal{B}}\_{W}|\mathbf{X}) &= \operatorname{Var}((\mathbf{X}^{\prime}\mathbf{WX})^{-1}\mathbf{X}^{\prime}\mathbf{WY}|\mathbf{X}) \\ &= (\mathbf{X}^{\prime}\mathbf{WX})^{-1}\mathbf{X}^{\prime}\mathbf{W}[\operatorname{Var}(\mathbf{Y}|\mathbf{X})] \mathbf{WX}(\mathbf{X}^{\prime}\mathbf{WX})^{-1} \\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{W}[\sigma^{2}\mathbf{W}^{-1}] \mathbf{WX}(\mathbf{X}^{\prime}\mathbf{X})^{-1} \\ &= \sigma^{2}(\mathbf{X}^{\prime}\mathbf{WX})^{-1} \end{split} \tag{A.28}\]
• The RSSW can be computed from
\[\mathbf{R} \mathbf{S} \mathbf{S}\_{W} = \mathbf{Y}' \mathbf{W} \mathbf{Y} - \hat{\boldsymbol{\beta}}' \mathbf{X}' \mathbf{W} \mathbf{X} \hat{\boldsymbol{\beta}}\]
• The estimated variance is
\[ \hat{\sigma}^2 = \frac{\text{RSS}\_W}{d} \tag{A.29} \]
with d df, where d is equal to the number of cases n minus the number of regressors with estimated coefficients in the model.
• Confidence intervals are the same for both ols and wls as long as (A.28) and (A.29) are used. Testing procedures in Chapter 6 are the same with ols and wls subject to the changes described here. In particular, standard computer programs produce output that will look the same with ols and wls and the output can be interpreted similarly.
A.9 THE QR FACTORIZATION
Most of the formulas given in this book are convenient for derivations but can be inaccurate when used on a computer because inverting a matrix such as (X′X) leaves open the possibility of introducing significant rounding errors into calculations. Most statistical packages will use better methods of computing, and understanding how they work is useful.
We start with the basic n × p′ matrix X of regressors. Suppose we could find an n × p′ matrix Q and a p′ × p′ matrix R such that (1) X = QR; (2) Q has orthonormal columns, meaning that Q′Q = Ip′; and (3) R is an upper triangular matrix, meaning that all the entries in R below the diagonal are equal to 0, but those on or above the diagonal can be nonzero.
Using the basic properties of matrices, we can write
\[\begin{aligned} \mathbf{X} &= \mathbf{Q} \mathbf{R} \\ \mathbf{X}' \mathbf{X} &= (\mathbf{Q} \mathbf{R})'(\mathbf{Q} \mathbf{R}) = \mathbf{R}' \mathbf{R} \end{aligned}\]
\[(\mathbf{X}'\mathbf{X})^{-1} = (\mathbf{R}'\mathbf{R})^{-1} = \mathbf{R}^{-1}(\mathbf{R}')^{-1} \tag{A.30}\]
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{R}^{-1}(\mathbf{Q}'\mathbf{Y})\tag{A.31} \]
\[\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' = \mathbf{Q}\mathbf{Q}' \tag{A.32}\]
Equation (A.30) follows because R is a square matrix, and the inverse of the product of square matrices is the product of the inverses in opposite order. From (A.31), to compute ˆ b , first compute Q′Y, which is a p′ × 1 vector, and multiply on the left by R to get
\[\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^{\prime}\mathbf{Y} \tag{A.33}\]
This last equation is very easy to solve because R is a triangular matrix and so we can use backsolving. For example, to solve the equations
| 7 ⎛ |
4 | 2 ⎞ |
3 ⎛ ⎞ |
||
|---|---|---|---|---|---|
| ⎜ 0 |
2 | ⎟ 1 |
ˆ b = |
⎜ ⎟ 2 |
|
| ⎜ ⎝ 0 |
0 | ⎟ 1 ⎠ |
⎜ ⎟ ⎠ ⎝ 1 |
first solve the last equation, so ˆ β3 = 1, substitute into the equation above it, so 2β2 + 1 = 2, so ˆ β / 2 = 1 2. Finally, the first equation is 7 2 1 2 3 ˆ β + + = , so ˆ β/ 3 = −1 7.
Equation (A.32) shows how the elements of the n × n hat matrix H can be computed without inverting a matrix, and without using all the storage needed to save H in full. If qi is the ith column of Q, then an element hij of the H matrix is simply computed as hij = i j q q′ .
Golub and Van Loan (1996) provide a complete treatment on computing and using the QR factorization. Very high-quality computer code for computing this and related quantities for statistics is provided in the publicly available Lapack package, described on the internet at httpwwwnetliEorg lapaFklug . This code is also used in many standard statistical packages.
A.10 SPECTRAL DECOMPOSITION
The spectral decomposition provides a very useful representation of a square symmetric matrix (Schott, 2005; Christensen, 2011; Golub and Van Loan, 1996). Suppose S is a p × p symmetric matrix. Then the spectral theorem says that there exists a matrix U that is p × p and orthogonal, so U′U = UU′ = I, and a diagonal matrix D with diagonal elements d1 ≥ d2 ≥ ··· ≥ dp ≥ 0 such that
\[\mathbf{S} = \mathbf{U}\mathbf{D}\mathbf{U}^{\prime}\tag{A.34}\]
The dj are called the eigenvalues of S, and the columns ( , u u 1′ ′ …, )p of U are called the corresponding eigenvectors. The eigenvectors are unique if all the eigenvalues are unequal. The number of nonzero eigenvalues of S is equal to the rank of S. If all the eigenvalues are positive, then
\[\mathbf{S}^{-1} = \mathbf{U}(\mathbf{D})^{-1}\mathbf{U}'\]
This is particularly useful in computations because inverting S requires only inverting a diagonal matrix D.
Equation (A.34) can be rewritten in scalar form as
\[\mathbf{S} = \sum\_{j=1}^{p} d\_{j} \mathbf{u}\_{j} \mathbf{u}\_{j}^{\prime}\]
For any vector a with a′a = 1,
\[\mathbf{a}'\mathbf{S}\mathbf{a} = \sum\_{j=1}^{p} d\_j(\mathbf{a}'\mathbf{u}\_j)(\mathbf{u}'\mathbf{a}) = \sum\_{j=1}^{p} d\_j(\mathbf{a}'\mathbf{u}\_j)^2\]
Now for each j, (a′uj) 2 is bounded between 0 and 1. If we set a = u1, then ( ) a u′ 1 1 = ( ) u u′ 1 = 1, and ( ) a u′ j = ( ) u u1′ j = 0 for all j > 1 because U is an orthogonal matrix. For this case the sum in the last equation is equal to d1, and this is the largest possible value of a′Sa.
A.11 MAXIMUM LIKELIHOOD ESTIMATES
A.11.1 Linear Models
Maximum likelihood estimation is probably the most frequently used method of deriving estimates in statistics. A general treatment is given by Casella and Berger (2001, section 7.2.2); here we derive the maximum likelihood estimates for the linear regression model assuming normality, without proof or much explanation. Our goal is to establish notation and define quantities that will be used in the discussion of Box–Cox transformations, and estimation for generalized linear models in Chapter 12.
The normal multiple linear regression model specifies for the ith observation that
\[(y\_i|\mathbf{x}\_i) - \mathbf{N}(\mathcal{B}'\mathbf{x}\_i, \sigma^2)\]
Given this model, the density for the ith observation yi is the normal density function,
\[f\_{\mathbb{N}}(\mathbf{y}\_i|\mathbf{x}\_i, \boldsymbol{\mathcal{B}}, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(\mathbf{y}\_i - \boldsymbol{\mathcal{B}}\mathbf{x}\_i)^2}{2\sigma^2}\right),\]
Assuming the observations are independent, the likelihood function is just the product of the densities for each of the n observations, viewed as a function of the parameters with the data fixed rather than a function of the data with the parameters fixed:
\[\begin{aligned} L(\mathcal{B}, \sigma^2 | Y) &= \prod\_{i=1}^n f\_{y\_i}(\mathbf{y}\_i | \mathbf{x}\_i, \mathbf{\mathcal{B}}, \sigma^2) \\ &= \prod\_{i=1}^n \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(\mathbf{y}\_i - \mathbf{\mathcal{B}}'\mathbf{x}\_i)^2}{2\sigma^2}\right) \\ &= \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n \exp\left(-\frac{1}{\sigma^2} \sum\_{i=1}^n (\mathbf{y}\_i - \mathbf{\mathcal{B}}'\mathbf{x}\_i)^2\right) \end{aligned}\]
The maximum likelihood estimates are simply the values of β and σ2 that maximize the likelihood function.
The values that maximize the likelihood will also maximize the logarithm of the likelihood
\[\log\left[L(\mathfrak{f}, \sigma^2 | Y)\right] = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2} \sum\_{i=1}^n (\mathbf{y}\_i - \mathbf{f}\mathbf{'}\mathbf{x}\_i)^2 \quad \text{(A.35)}\]
The log-likelihood function (A.35) is a sum of three terms. Since β is included only in the third term and this term has a negative sign in front of it, we recognize that maximizing the log-likelihood over β is the same as minimizing the third term, which, apart from constants, is the same as the residual sum of squares function (see Section 3.4.3). We have just shown that the maximum likelihood estimate of β for the normal linear regression problem is the same as the ols estimator. Fixing β at the ols estimator ˆ b, (A.35) becomes
\[\log\left[L(\hat{\pm}, \sigma^2 | Y)\right] = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\text{RSS} \tag{A.36}\]
and differentiating (A.36) with respect to σ2 and setting the result to 0 gives the maximum likelihood estimator for σ2 as RSS/n, the same estimate we have been using, apart from division by n rather than n − p′.
Maximum likelihood estimation has many important properties that make them useful. These estimates are approximately normally distributed in large samples, and the large sample variance achieves the lower bound for the variance of all unbiased estimates.
A.11.2 Logistic Regression
In logistic regression we have (y1, . . . , yn) independent with yi ∼ Bin(mi, θ(x)). The likelihood based on (y1, . . . , yn) is obtained by multiplying the likelihood for each observation,
\[\begin{aligned} L &= \prod\_{i=1}^{n} \binom{m\_i}{y\_i} (\theta(\mathbf{x}\_i))^{y\_i} (1 - \theta(\mathbf{x}\_i))^{m\_i - y\_i} \\ &\approx \prod\_{i=1}^{n} (\theta(\mathbf{x}\_i))^{y\_i} (1 - \theta(\mathbf{x}\_i))^{m\_i - y\_i} \end{aligned}\]
In the last expression, we have dropped the binomial coefficients m y i i ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ because they do not depend on parameters. After minor rearranging, the loglikelihood is
\[\log(L) \approx \sum\_{i=1}^{n} \left[ y\_i \log \left( \frac{\theta(\mathbf{x}\_i)}{1 - \theta(\mathbf{x}\_i)} \right) + m\_i \log(1 - \theta(\mathbf{x}\_i)) \right]\]
Next, we substitute for θ(xi) using Equation (12.8) to get
\[\log(L(\mathcal{B})) = \sum\_{i=1}^{n} \left[ (\mathcal{B}' \mathbf{x}\_i) \mathbf{y}\_i - m\_i \log(1 + \exp(\mathcal{B}' \mathbf{x}\_i)) \right] \tag{A.37}\]
The log-likelihood depends on the regression parameters β explicitly, and we can maximize (A.37) to get estimates. An iterative procedure is required. Most computer packages use the Fisher scoring algorithm for the computing, which amounts to a sequence to weighted least squares computations with the weights depending on the estimates (Fox and Weisberg, 2011, section 5.12). The more general Newton–Raphson algorithm can also be used. Details of the computational method are provided by McCullagh and Nelder (1989, section 2.5), Collett (2003, section 3.12), Hosmer et al. (2013), and Agresti (2013), among others.
The estimated covariance matrix of the estimates is given by
\[\operatorname{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\* \hat{\mathbf{W}} \mathbf{X})^{-1}\]
where is a diagonal matrix with entries mi i i ˆ ˆ θ θ (x x )(1− ( )), and X is a matrix with ith row x′.
A.12 THE BOX–COX METHOD FOR TRANSFORMATIONS
A.12.1 Univariate Case
Box and Cox (1964) derived the Box–Cox method for selecting a transformation using a likelihood-like method. They supposed that, for some value of λ, ψM(Y, λ) given by (8.5) in Section 8.1.3, is normally distributed. With n independent observations, therefore, the log-likelihood function for (β, σ2 , λ) is given by (A.35), but with yi replaced by ψM(Y, λ),2
\[\log(L(\hat{\mathbf{j}}, \sigma^2, \boldsymbol{\lambda} | Y)) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2} \sum\_{i=1}^n (\psi\_M(\mathbf{y}\_i, \boldsymbol{\lambda}) - \mathbf{j}^\prime \mathbf{x}\_i)^2 \tag{A.38}\]
For a fixed value of λ, (A.38) is the same as (A.35), and so the maximum likelihood estimates for β and σ2 are obtained from the regression of ψM(Y, λ) on X, and the value of the log-likelihood evaluated at these estimates is
\[\log(L(\mathcal{J}(\lambda), \sigma^2(\lambda), \lambda | Y)) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\mathbb{R} \lhd \mathbb{S}(\lambda)/n) - \frac{n}{2} \quad \text{(A.39)}\]
where RSS(λ) is the residual sum of squares in the regression of ψM(Y, λ) on X, as defined in Section 8.1.3. Only the second term in (A.39) involves data, and so the global maximum likelihood estimate of λ minimizes RSS(λ).
Standard likelihood theory can be applied to get a (1 − α) × 100% confidence interval for λ to be the set
\[\left\{\lambda!2\left[\log(L(\mathcal{B}(\hat{\lambda}), \sigma^2(\hat{\lambda}), \hat{\lambda}|Y)) - \log(L(\mathcal{B}(\lambda), \sigma^2(\hat{\lambda}), \lambda|Y))\right] < \mathcal{X}^2(1, 1-\alpha)\right\}\]
\[\text{Or, setting } \alpha = .05 \text{ so } \mathcal{X}^2(1, .95) = 3.84 \text{, and using (A.39)}\]
\[\left\{\boldsymbol{\lambda}(n/2)(\log(\mathbb{R}\mathbb{S}\mathbb{S}(\boldsymbol{\lambda})) - \log(\mathbb{R}\mathbb{S}\mathbb{S}(\hat{\lambda})) < 1.92\right\}\tag{A.40}\]
2 As λ is varied, the units of ψM(Y, λ) can change, and so the joint density of the transformed data should require a Jacobian term; see Casella and Berger (2001, section 4.3). The modified power transformations are defined so the Jacobian of the transformation is always equal to 1, and it can therefore be ignored.
Many statistical packages will have routines that will provide a graph of RSS(λ) versus λ, or of (n/2) log(RSS(λ)) versus λ as shown in Figure 8.7, for the highway accident data. Equation (A.40) shows that the confidence interval for λ includes all values of λ for which the log-likelihood is within 1.92 units of the maximum value of the log-likelihood, or between the two vertical lines in the figure.
A.12.2 Multivariate Case
Although the material in this section uses more mathematical statistics than most of this book, it is included because the details of computing the multivariate extension of Box–Cox transformations are not published elsewhere. The basic idea was proposed by Velilla (1993).
Suppose X is a set of p variables we wish to transform and define
\[\Psi\_M(X,\mathcal{A}) = (\Psi\_M(X\_1,\mathcal{A}\_1), \dots, \Psi\_M(X\_k,\mathcal{A}\_k))^\dagger\]
We have used the modified power transformations (8.5) for each element of X, but the same general idea can be applied using other transformations such as the Yeo–Johnson family introduced in Section 8.4. In analogy to the univariate case, we assume that for some λ, we will have
\[ \psi\_M(X, \mathcal{A}) \text{ - N}(\mu, \mathbf{V}), \]
where V is an unknown positive definite symmetric matrix that needs to be estimated. If xi is the observed value of X for the ith observation, then the likelihood function is given by
\[\begin{split} L(\boldsymbol{\mu}, \mathbf{V}, \boldsymbol{\lambda} | \boldsymbol{X}) &= \prod\_{i=1}^{n} \frac{1}{(2\pi |\mathbf{V}|)^{1/2}} \\ &\times \exp\left( -\frac{1}{2} (\boldsymbol{\psi}\_{M}(\mathbf{x}\_{i}, \boldsymbol{\lambda}) - \boldsymbol{\mu})^{\prime} \mathbf{V}^{-1} (\boldsymbol{\psi}\_{M}(\mathbf{x}\_{i}, \boldsymbol{\lambda}) - \boldsymbol{\mu}) \right) \quad (\text{A.41}) \end{split} \tag{A.41}\]
where |V| is the determinant.3 After rearranging terms, the log-likelihood is given by
\[\begin{aligned} \log(L(\boldsymbol{\mu}, \mathbf{V}, \boldsymbol{\lambda} | \boldsymbol{X})) &= -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(|\mathbf{V}|) \\ &- \frac{1}{2} \sum\_{i=1}^{n} \mathbf{V}^{-1} (\boldsymbol{\nu}\_{M}(\mathbf{x}\_{i}, \boldsymbol{\lambda}) - \boldsymbol{\mu}) (\boldsymbol{\nu}\_{M}(\mathbf{x}\_{i}, \boldsymbol{\lambda}) - \boldsymbol{\mu})' \quad (\text{A.42}) \end{aligned}\]
3 The determinant is defined in any linear algebra textbook. If we fix λ, then (A.42) is the standard log-likelihood for the multivariate normal distribution. The values of V and µ that maximize (A.42) are the sample mean and sample covariance matrix, the latter with divisor n rather than n − 1,
\[\begin{aligned} \boldsymbol{\mu}(\boldsymbol{\lambda}) &= \frac{1}{n} \sum\_{i=1}^{n} \boldsymbol{\psi}\_{M}(\mathbf{x}\_{i}, \boldsymbol{\lambda}) \\ \mathbf{V}(\boldsymbol{\lambda}) &= \frac{1}{n} \sum\_{i=1}^{n} (\boldsymbol{\psi}\_{M}(\mathbf{x}\_{i}, \boldsymbol{\lambda}) - \boldsymbol{\mu}(\boldsymbol{\lambda})) (\boldsymbol{\psi}\_{M}(\mathbf{x}\_{i}, \boldsymbol{\lambda}) - \boldsymbol{\mu}(\boldsymbol{\lambda}))' \end{aligned}\]
Substituting these estimates into (A.42) gives the profile log-likelihood for λ,
\[\log(L(\mu(\lambda), \mathbf{V}(\lambda), \mathcal{A}X)) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\|\mathbf{V}(\lambda)\|) - \frac{n}{2} \qquad (\text{A.43})\]
This equation will be maximized by minimizing the determinant of V(λ) over values of λ. This is a numerical problem for which there is no closed-form solution, but it can be solved using a general-purpose function minimizer.
Standard theory for maximum likelihood estimates can provide tests concerning λ and standard errors for the elements of λ. To test the hypothesis that λ = λ0 against a general alternative, compute
\[G^2 = 2\left[\log(L(\mu(\hat{\lambda}), \mathbf{V}(\hat{\lambda}), \hat{\lambda})) - \log(L(\mu(\lambda\_0), \mathbf{V}(\lambda\_0), \lambda\_0))\right]^2\]
and compare G2 with a chi-squared distribution with p df. The standard error of ˆ l is obtained from the inverse of the expected information matrix evaluated at ˆ l. The expected information for ˆ l is just the matrix of second derivatives of (A.43) with respect to λ evaluated at ˆ l. Many optimization routines, such as optim in R, will return the matrix of estimated second derivatives if requested; all that is required is inverting this matrix, and then the square roots of the diagonal elements are the estimated standard errors.
A.13 CASE DELETION IN LINEAR REGRESSION
Suppose X is the n × p′ matrix of regressors with linearly independent columns. We use the subscript “(i)” to mean “without case i,” so that X(i) is an (n − 1) × p′ matrix. We can compute ( ) ( ) ( ) 1 ′ − X Xi i from the remarkable formula
\[(\mathbf{X}'\_{(i)}\mathbf{X}\_{(i)})^{-1} = (\mathbf{X}'\mathbf{X})^{-1} + \frac{(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}\_i\mathbf{x}'\_i(\mathbf{X}'\mathbf{X})^{-1}}{1 - h\_{il}} \tag{A.44}\]
where hii = i i ′ ′ − x X( ) X x1 is the ith leverage value, a diagonal value from the hat matrix. This formula was used by Gauss (1821); a history of it and many variations is given by Henderson and Searle (1981). It can be applied to give all the results that one would want relating multiple linear regression with and without the ith case. For example,
\[ \hat{\boldsymbol{\beta}}\_{(i)} = \hat{\boldsymbol{\beta}} - \frac{(\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{x}\_i \hat{e}\_i}{1 - h\_{ii}} \tag{A.45} \]
Writing r e i i = − hii ˆ / 1 σˆ , the estimate of variance is
\[ \hat{\sigma}\_{(l)}^2 = \hat{\sigma}^2 \left( \frac{n - p' - 1}{n - p' - r\_l^2} \right)^{-1} \tag{A.46} \]
and the studentized residual ti is
\[t\_i = r\_i \left(\frac{n - p' - 1}{n - p' - r\_i^2}\right)^{1/2} \tag{A.47}\]
The diagnostic statistics examined in this book were first thought to be practical because of simple formulas used to obtain various statistics when cases are deleted that avoided recomputing estimates. Advances in computing in the last 30 years have made the computational burden of recomputing without a case much less onerous, and so diagnostic methods equivalent to those discussed here can be applied to problems other than linear regression where the updating formulas are not available.