Bootstrapping Regression Models in R - McMaster

Bootstrapping Regression Models in R An Appendix to An R Companion to Applied Regression, Second Edition

John Fox & Sanford Weisberg last revision: 10 October 2017 Abstract The bootstrap is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data at hand. This appendix to Fox and Weisberg (2011) briefly describes the rationale for the bootstrap and explains how to bootstrap regression models using the Boot function, which was added to the car package in 2012, and therefore is not described in Fox and Weisberg (2011). This function provides a simple way to access the power of the boot function (lower-case “b”) in the boot package. In 2017, the generic Boot function was extensively revised so that it now will work with many regression problems. Use of the function with the betareg and crch packages are given.

1

Basic Ideas

The Bootstrap is a general approach to statistical inference based on building a sampling distribution for a statistic by resampling from the data at hand. The term “bootstrapping,” due to Efron (1979), is an allusion to the expression “pulling oneself up by one’s bootstraps,” in this case, using the sample data as a population from which repeated samples are drawn. At first blush, the approach seems circular, but has been shown to be sound. At least two R packages for bootstrapping are associated with extensive treatments of the subject: Efron and Tibshirani’s (1993) bootstrap package, and Davison and Hinkley’s (1997) boot package. Of the two, boot, programmed by A. J. Canty, is somewhat more capable and is a part of the standard R distribution. The bootstrap is potentially very flexible and can be used in many different ways, and as a result using the boot package require some programming. In this appendix we will mostly discuss the car function Boot, which provides a simplified front-end to the boot package. Confusion alert: Boot with a capital “B” is a function in the car package, and is the primary function used in the appendix. It is really just a convenience function that calls the function boot with a lower-case “b” in a package that is also called boot, also with a lower-case “b”. We hope you will get a kick out of all the boots. There are several forms of the bootstrap, and, additionally, several other resampling methods that are related to it, such as jackknifing, cross-validation, randomization tests, and permutation tests. We will stress the nonparametric bootstrap. Suppose that we draw a sample S = {X1 , X2 , ..., Xn } from a population P = {x1 , x2 , ..., xN }; imagine further, at least for the time being, that N is very much larger than n, and that S is a 1

simple random sample1 . We will briefly consider other sampling schemes at the end of the appendix. It is helpful initially to think of the elements of the population and, hence, of the sample as scalar values, but they could just as easily be vectors. Suppose that we are interested in some statistic T = t(S) as an estimate of the corresponding population parameter θ = t(P). Again, θ could be a vector of parameters and T the corresponding vector of estimates, but for simplicity assume that θ is a scalar. A traditional approach to statistical inference is to make assumptions about the structure of the population, such as an assumption of normality, and, along with the stipulation of random sampling, to use these assumptions to derive the sampling distribution of T , on which classical inference is based. In certain instances, the exact distribution of T may be intractable, and so we instead derive its asymptotic distribution. This familiar approach has two potentially important deficiencies: 1. If the assumptions about the population are wrong, then the corresponding sampling distribution of the statistic may be seriously inaccurate. If asymptotic results are relied upon, these may not hold to the required level of accuracy in a relatively small sample. 2. The approach requires sufficient mathematical prowess to derive the sampling distribution of the statistic of interest. In some cases, such a derivation may be prohibitively difficult. In contrast, the nonparametric bootstrap allows us to estimate the sampling distribution of a statistic empirically without making assumptions about the form of the population, and without deriving the sampling distribution explicitly. The essential idea of the nonparametric bootstrap is as follows: We proceed to draw a sample of size n from among the elements of the sample S, sampling ∗ , X ∗ , ..., X ∗ }. It is necessary to with replacement. Call the resulting bootstrap sample S∗1 = {X11 12 1n sample with replacement, because we would otherwise simply reproduce the original sample S. In effect, we are treating the sample S as an estimate of the population P; that is, each element Xi of S is selected for the bootstrap sample with probability 1/n, mimicking the original selection of the sample S from the population P. We repeat this procedure a large number of times, R, selecting ∗ , X ∗ , ..., X ∗ }. many bootstrap samples; the bth such bootstrap sample is denoted S∗b = {Xb1 bn b2 The key bootstrap analogy is therefore as follows:

The population is to the sample as the sample is to the bootstrap samples.

Next, we compute the statistic T for each of the bootstrap samples; that is Tb∗ = t(S∗b ). Then the distribution of Tb∗ around the original estimate T is analogous to the sampling distribution of the estimator T around the population parameter θ. For example, the average of the bootstrapped statistics, PR ∗ ∗ ∗ ∗ b (T ) = b=1 Tb T =E R 1

Alternatively, P could be an infinite population, specified, for example, by a probability distribution function.

2

b ∗ = T ∗ − T is an estimate of the estimates the expectation of the bootstrapped statistics; then B bias of T , that is, T − θ. Similarly, the estimated bootstrap variance of T ∗ , PR ∗ 2 ∗ ∗ ∗ b=1 (Tb − T ) d Var (T ) = R−1 estimates the sampling variance of T . The square root of this quantity s PR ∗ 2 ∗ ∗ b=1 (Tb − T ) ∗ c SE (T ) = R−1 is the bootstrap estimated standard error of T . The random selection of bootstrap samples is not an essential aspect of the nonparametric bootstrap, and at least in principle we could enumerate all bootstrap samples of size n. Then we could calculate E ∗ (T ∗ ) and Var∗ (T ∗ ) exactly, rather than having to estimate them. The number of bootstrap samples, however, is astronomically large unless n is tiny.2 There are, therefore, two sources of error in bootstrap inference: (1) the error induced by using a particular sample S to represent the population; and (2) the sampling error produced by failing to enumerate all bootstrap samples. The latter source of error can be controlled by making the number of bootstrap replications R sufficiently large.

2

Bootstrap Confidence Intervals

There are several approaches to constructing bootstrap confidence intervals. The normal-theory interval assumes that the statistic T is normally distributed, which is often approximately the case for statistics in sufficiently large samples, and uses the bootstrap estimate of sampling variance, and perhaps of bias, to construct a 100(1 − α)% confidence interval of the form ∗

b ∗ ) ± z1−α/2 SE c (T ∗ ) θ = (T − B where z1−α/2 is the 1 − α/2 quantile of the standard-normal distribution (e.g., 1.96 for a 95% confidence interval, when α = .05). An alternative approach, called the bootstrap percentile interval, is to use the empirical quantiles of Tb∗ to form a confidence interval for θ: ∗ ∗ T(lower) < θ < T(upper) ∗ , T ∗ , . . . , T ∗ are the ordered bootstrap replicates of the statistic; lower = [(R + 1)α/2]; where T(1) (2) (R) upper = [(R + 1)(1 − α/2)]; and the square brackets indicate rounding to the nearest integer. For example, if α = .05, corresponding to a 95% confidence interval, and R = 999, then lower = 25 and upper = 975. The bias-corrected, accelerated (or BC a ) percentile intervals perform somewhat better than the percentile intervals just described. To find the BCa interval for θ:

Calculate



R

#  b=1 z = Φ−1   2

(Tb∗

≤ T)

R+1

   

If we distinguish the order of elements in the bootstrap samples and treat all of the elements of the original sample as distinct (even when some have the same values) then there are nn bootstrap samples, each occurring with probability 1/nn .

3

where Φ−1 (·) is the standard-normal quantile function, and # (Tb∗ ≤ T ) /(R + 1) is the (adjusted) proportion of bootstrap replicates at or below the original-sample estimate T of θ. If the bootstrap sampling distribution is symmetric, and if T is unbiased, then this proportion will be close to .5, and the correction factor z will be close to 0. Let T(−i) represent the value of T produced when the ith observation is deleted from the sample;3 there are n of these quantities. Let T represent the average of the T(−i) ; that is P T = ni=1 T(−i) /n. Then calculate

Pn

i=1

a= 6

hP n

i=1

T − T(−i) T(−i) − T

3 2 i 32

With the correction factors z and a in hand, compute z − z1−α/2 a1 = Φ z + 1 − a(z − z1−α/2 ) z + z1−α/2 a2 = Φ z + 1 − a(z + z1−α/2 )

where Φ(·) is the standard-normal cumulative distribution function. The values a1 and a2 are used to locate the endpoints of the corrected percentile confidence interval: ∗ ∗ T(lower*) < θ < T(upper*)

where lower* = [Ra1 ] and upper* = [Ra2 ]. When the correction factors a and z are both 0, a1 = Φ(−z1−α/2 ) = Φ(zα/2 ) = α/2, and a2 = Φ(z1−α/2 ) = 1 − α/2, which corresponds to the (uncorrected) percentile interval. To obtain sufficiently accurate 95% bootstrap percentile or BCa confidence intervals, the number of bootstrap samples, R, should be on the order of 1000 or more; for normal-theory bootstrap intervals we can get away with a smaller value of R, say, on the order of 100 or more, because all we need to do is estimate the standard error of the statistic.

3

Bootstrapping Regressions

Recall Duncan’s regression of prestige on income and education for 45 occupations from Chapters 1 and 6 in Fox and Weisberg (2011)4 . In the on-line appendix on robust regression, we refit this regression using an M -estimator with the Huber weight function, employing the rlm function in the MASS package, which is available when you load the car package: 3

The T(−i) are called the jackknife values of the statistic T . Although we will not pursue the subject here, the jackknife values can also be used as an alternative to the bootstrap to find a nonparametric confidence interval for θ. 4 R functions used but not described in this appendix are discussed in Fox and Weisberg (2011) All the R code in this appendix can be downloaded from http://tinyurl.com/carbook. Alternatively, if you are running R and attached to the Internet, load the car package and enter the command carWeb(script="appendix-bootstrap") to view the R command file for the appendix in your browser.

4

library(car) library(MASS) mod.duncan.hub