1 An introduction to the Bootstrap. The bootstrap is an important tool of modern
statistical analysis. It establishes a general framework for simulation-based ...
1
An introduction to the Bootstrap
The bootstrap is an important tool of modern statistical analysis. It establishes a general framework for simulation-based statistical inference. In simple situations the uncertainty of an estimate may be gauged by analytical calculations leading, for example, to the construction of confidence intervals based on an assumed probability model for the available data. The bootstrap replaces complicated and often inaccurate approximations to biases, variances and other measures of uncertainty by computer simulations. The idea of the bootstrap: • The random sample Y1 , . . . , Yn is generated by drawing observations independently and with replacement from the underlying population (with distribution function F ) For each interval [a, b] the probability of drawing an observation in [a, b] is given by P (Y ∈ [a, b]) = F (b) − F (a). • n large: The empirical distribution of the sample values is “close” to the distribution of Y in the underlying population. The relative frequency Fn (b) − Fn (a) of observations in [a, b] converges to P (Y ∈ [a, b]) = F (b) − F (a) as n → ∞. • The idea of the bootstrap consists in mimicking the data generating process. Random sampling from the true population is replaced by random sampling from the observed data. This is justified by the insight that the empirical distribution of the observed data is “similar” to the true distribution (Fn → F for n → ∞). Literature: Davison, A.C. and Hinkley, D.V. (2005): Bootstrap Methods and their Applications; Cambridge University Press Inference@LS-Kneip
1–1
Setup: • Original data: i.i.d. random sample Y1 , . . . , Yn ; the distribution of Yi depends on an unknown parameter (vector) θ • The data Y1 , . . . , Yn is used to estimate θ ⇒ estimator θˆ ≡ ˆ 1 , . . . , Yn ) θ(Y • We are interested in evaluating the distribution of θˆ (resp. θˆ − θ) in order to provide standard errors, to construct confidence intervals, or to perform tests of hypothesis. The bootstrap approach: 1) Bootstrap samples: Random samples Y1∗ , . . . , Yn∗ are generated by drawing drawing observations independently and with replacement from the available sample Y1 , . . . , Yn . ˆ ∗, . . . , Y ∗) 2) Bootstrap estimates: θˆ∗ ≡ θ(Y n 1 3) In practice: Steps 1) and 2) are repeated m times (e.g. m = ∗ 2000) ⇒ m values θˆ1∗ , θˆ2∗ , . . . , θˆm 4) The (empirical) distribution of θˆ∗ is used to approximate the ˆ distribution of θ.
Inference@LS-Kneip
1–2
1.1
Why does the bootstrap work?
The theoretical justification of the bootstrap is based on asymptotic arguments. Usually the bootstrap does not provide very good approximations for extremely small sample size. It must, however, be emphasized that in some cases bootstrap confidence intervals can be more accurate for moderate sample sizes than confidence intervals based on standard asymptotic approximations. Example 1: Estimating a proportion • Data: i.i.d. random sample Y1 , . . . , Yn ; Yi ∈ {0, 1} is dichotomous, P (Yi = 1) = p, P (Yi = 0) = 1 − p. The problem is to estimate p. • Let S denote the number of Yi which are equal to 1. The maximum likelihood estimate of p is pˆ = S/n. • Recall: nˆ p = S ∼ B(n, p) • As n → ∞ the central limit theorem implies that √ n(ˆ p − p) →L N (0, 1) p(1 − p) √ • n large: the distributions of n(ˆ p −p) and pˆ−p can be approximated by N (0, p(1 − p)) and N (0, p(1 − p)/n), respectively. √ For simplicity we will write distr( n(ˆ p −p)) ≈ N (0, p(1−p)) as well as distr(ˆ p − p) ≈ N (0, p(1 − p)/n). Bootstrap: • Random sample Y1∗ , . . . , Yn∗ generated by drawing observations independently and with replacement from Yn := {Y1 , . . . , Yn }. Let S ∗ denote the number of Yi∗ which are equal to 1. Inference@LS-Kneip
1–3
• Bootstrap estimate of p: pˆ∗ = S ∗ /n The distribution of pˆ∗ depends on the observed sample Yn := {Y1 , . . . , Yn }!. A different sample will lead to a different distribution. The bootstrap now tries to approximate the true distribution of pˆ − p by the conditional distribution of pˆ∗ − pˆ given the observed sample Yn . The bootstrap is called consistent if asymptotically (n → ∞) the conditional distribution of pˆ∗ − pˆ coincides with the true distribution of pˆ−p (note: a proper scaling is required!) • We obtain P ∗ (Yi∗ = 1) = P (Yi∗ = 1| Yn ) = pˆ, P ∗ (Yi∗ = 0) = P (Yi∗ = 0| Yn ) = 1 − pˆ and E ∗ (ˆ p∗ ) = E(ˆ p∗ | Yn ) = pˆ, V ar∗ (ˆ p∗ ) = E[(ˆ p∗ − pˆ)2 | Yn ] =
pˆ(1 − pˆ] n
• The conditional distribution of nˆ p∗ = S ∗ given Yn is equal to B(n, pˆ). In a slight abuse of notation we will write (nˆ p∗ |Yn ) ∼ B(n, pˆ) or distr(nˆ p∗ |Yn ) = B(n, pˆ)
Inference@LS-Kneip
1–4
• As n → ∞ the central√limit theorem implies that the (condin(pˆ∗ −p) ˆ tional) distribution ( p(1− ˆ p) ˆ |Yn ) converges (stochastically) to a N (0, 1)-distribution. Moreover, pˆ is a consistent estimator of p and therefore pˆ(1 − pˆ) →P p(1 − p) as n → ∞. This implies that asymptotically pˆ(1 − pˆ) may be replaced by p(1 − p), and √
n(ˆ p∗ − pˆ) The law of ( |Yn ) converges stochastically p(1 − p) to a N (0, 1)-distribution More precisely, as n → ∞ (√ ∗ ) n(ˆ p − pˆ) sup |P ≤ δ|Yn − Φ(δ)| →P 0, p(1 − p) δ where Φ denotes the distribution function of the standard normal distribution. • We can conclude that for large n √ ∗ √ distr( n(ˆ p − pˆ)|Yn ) ≈ distr( n(ˆ p − p)) ≈ N (0, p(1 − p)) as well as distr(ˆ p∗ − pˆ|Yn ) ≈ distr(ˆ p − p) ≈ N (0, p(1 − p)/n) ⇒ Bootstrap consistent
Inference@LS-Kneip
1–5
Example 2: Estimating a population mean • Let Y1 , . . . , Yn denote an i.i.d. random sample with mean µ and variance σ 2 . In the following F will denote the corresponding distribution function. ∑n • Y¯ = n1 i=1 Yi is an unbiased estimator of µ • Problem: Construct a confidence interval Traditional approach for constructing a 1−α confidence interval: 2 • Y¯ ∼ N (µ, σn )
∑n 1 ¯ 2 • Estimation of σ 2 : S 2 = n−1 i=1 (Yi − Y ) √ ¯ • This implies: n Y S−µ ∼ tn−1 , and hence S S P (−tn−1,1− α2 √ ≤ Y¯ − µ ≤ tn−1,1− α2 √ ) n n • 95% confidence interval: [Y¯ − tn−1,1− α2 √Sn , Y¯ + tn−1,1− α2 √Sn ] Remark: The construction relies on the assumption that Y¯ ∼ 2 N (µ, σn ). This is necessarily true if Y is normally distributed. If the underlying distribution is not normal, then this condition is approximately fulfilled is the sample size n is sufficiently large (central limit theorem). In this case the constructed confidence interval must also be seen as an approximation The bootstrap offers an alternative method for constructing such confidence intervals.
Inference@LS-Kneip
1–6
The bootstrap approach: • Random samples Y1∗ , . . . , Yn∗ are generated by drawing observations independently and with replacement from the available sample Yn := {Y1 , . . . , Yn }. ∑n • Y1∗ , . . . , Yn∗ ⇒ estimator Y¯ ∗ = n1 i=1 Yi∗ • Means and variances of the conditional distributions of Yi∗ and Y¯ ∗ given Yn : E ∗ (Yi∗ ) = E(Yi∗ |Yn ) = Y¯ , 1∑ ∗ ∗ ∗ 2 2 ¯ ˜ V ar (Yi ) = E[(Yi − Y ) | Yn ] = S := (Yi − Y¯ )2 n i=1 n
Moreover, E ∗ (Y¯ ∗ ) = Y¯ , V ar∗ (Y¯ ∗ ) = S˜2 /n • As n → ∞ the central limit theorem implies that the (condi√ ¯∗ ¯ tional) distribution of ( n(YS˜ −Y ) |Yn ) converges (stochastically) to a N (0, 1)-distribution.. Moreover, S˜2 is a consistent estimator of σ 2 and therefore S˜2 →P σ 2 as n → ∞. This implies that asymptotically S˜ may be replaced by σ, and √
n(Y¯ ∗ − Y¯ ) The law of ( |Yn ) converges stochastically σ to a N (0, 1)-distribution More precisely, as n → ∞ (√ ¯∗ ¯ ) n(Y − Y ) sup |P ≤ δ|Yn − Φ(δ)| →P 0, σ δ where Φ denotes the distribution function of the standard normal distribution. Inference@LS-Kneip
1–7
We can conclude that for large n √ √ distr( n(Y¯ ∗ − Y¯ )|Yn ) ≈ distr( n(Y¯ − µ)) ≈ N (0, σ 2 ) as well as distr(Y¯ ∗ − Y¯ |Yn ) ≈ distr(Y¯ − µ) ≈ N (0, σ 2 /n) ⇒ Bootstrap consistent Construction of a symmetric confidence interval of level 1 − α: • Determine α2 and 1 − α2 quantiles tˆα2 and tˆ1− α2 of the conditional distribution of Y ∗ given Yn := {Y1 , . . . , Yn } (the “bootstrap distribution”): α P ∗ (Y¯ ∗ ≤ tˆα2 ) ≈ , 2
α P ∗ (Y¯ ∗ > tˆα2 ) ≈ 1 − , 2
α P ∗ (Y¯ ∗ ≤ tˆ1− α2 ) ≈ 1 − , 2
α P ∗ (Y¯ ∗ > tˆ1− α2 ) ≈ , 2
Here, P ∗ denotes probabilities with respect to conditional distribution of Y¯ ∗ given Yn := {Y1 , . . . , Yn }. • In practice: – Draw m bootstrap samples (e.g. m = 2000) and calculate the corresponding estimates Y¯1∗ , Y¯2∗ , . . . , Y¯m∗ . – Order the resulting estimates ⇒ Y¯ ∗ ≤ Y¯ ∗ ≤ · · · ≤ Y¯ ∗ . (1)
(2)
(m)
∗ ¯∗ ˆ – Set tˆα := Y¯([m+1] α ) and t1−α := Y([m+1][1− α ]) . 2
Inference@LS-Kneip
2
1–8
A basic bootstrap confidence interval: • By construction of tˆα2 and tˆ1− α2 we have α P ∗ (Y¯ ∗ − Y¯ ≤ tˆα2 − Y¯ ) ≈ , 2
α P ∗ (Y¯ ∗ − Y¯ ≤ tˆ1− α2 − Y¯ ) ≈ 1 − . 2
• We have seen that the bootstrap is consistent, and therefore distr(Y¯ ∗ − Y¯ |Yn ) ≈ distr(Y¯ −µ) asymptotically. This implies that for large n α P (Y¯ − µ ≤ tˆα2 − Y¯ ) ≈ , 2
α P (Y¯ − µ ≤ tˆ1− α2 − Y¯ ) ≈ 1 − , 2
and therefore ) ( ¯ ¯ ¯ ¯ ˆ ˆ α α P Y − (t1− 2 − Y ) ≤ µ ≤ Y − (t 2 − Y ) ≈ 1 − α • ⇒ Approximate 1 − α (symmetric) confidence interval: [2Y¯ − tˆ1− α2 , 2Y¯ − tˆα2 ] The percentile interval: • In the older bootstrap literature the so-called percentile interval [tˆα2 , tˆ1− α2 ] is usually recommended as a 1 − α confidence interval. • The percentile interval can easily be justified if all underlying distributions are symmetric, distr(Y¯ ∗ − Y¯ |Yn ) ≈ distr(Y¯ − Y¯ ∗ |Yn ), distr(Y¯ − µ) ≈ distr(µ − Y¯ ). • In practice the percentile interval is usually less precise than the standard interval discussed above; there are however some bias-corrected modifications of the percentile interval which allow better approximations. Inference@LS-Kneip
1–9
General Setup: The nonparametric (naive) bootstrap • Data: Random sample Yn := {Y1 , . . . , Yn }; the distribution of Yi depends on an unknown parameter (vector) θ • The data Y1 , . . . , Yn is used to estimate θ ˆ 1 , . . . , Yn ) ⇒ estimator θˆ ≡ θ(Y • Bootstrap: Random samples Y1∗ , . . . , Yn∗ are generated by drawing observations independently and with replacement from the available sample Y1 , . . . , Yn ⇒ Bootstrap estimates ˆ ∗, . . . , Y ∗) θˆ∗ ≡ θ(Y n 1 ˆ n ) is used to approximate distr(θˆ − θ) • distr(θˆ∗ − θ|Y The bootstrap ”works” for a large number of statistical and econometrical problems. Indeed, it can be shown that under some mild regularity conditions the bootstrap is consistent, if 1) Generation of the bootstrap sample “reflects” appropriately the way in which the original sample has been generated (i.i.d. sampling!). 2) The distribution of the estimator θˆ is asymptotically normal. More precisely, √ – single parameter (θ ∈ IR): n(θˆ − θ) → N (0, v 2 ); v √ standard error of n(θˆ − θ) √ – multivariate parameter vector (θ ∈ IRd ): n(θˆ − θ) → √ Nd (0, V ); V - covariance matrix of n(θˆ − θ) √ ˆ n ) ≈ distr(√n(θˆ − Consistent Bootstrap: distr( n(θˆ∗ − θ)|Y ˆ n ) ≈ distr(θˆ − θ)] if n is sufficiently large. θ)) [and distr(θˆ∗ − θ|Y ⇒ Bootstrap confidence intervals, tests, etc. Inference@LS-Kneip
1–10
Note: • Standard approaches to construct confidence intervals and tests are usually based on asymptotic normal approximati√ ons. For example, if θ ∈ IR and n(θˆ − θ) → N (0, v 2 ) one usually tries to determine an approximation vˆ of v from the data. An approximate 1 − α confidence interval is then given by vˆ vˆ [θˆ − z1− α2 √ , θˆ + z1− α2 √ ] n n • In some cases it is very difficult to obtain approximations vˆ of v. Statistical inference is then usually based on the bootstrap • In contemporary statistical analysis the bootstrap is frequently used even for standard problems, where estimates vˆ of v are easily constructed. The reason is that in many situations bootstrap it can be shown that bootstrap confidence intervals or tests are more precise than those determined analytically based on asymptotic formulas. It must be emphasized that the bootstrap does not always work. The bootstrap may fail if one of the above conditions 1) or 2) is violated. Examples are • The naive bootstrap will not work if the i.i.d re-sample Y1∗ , . . . , Yn∗ from Y1 , . . . , Yn does not properly reflect the way how the Y1 , . . . , Yn is generated from the underlying population (e.g. dependent data; Y1 , . . . , Yn not i.i.d.). • The distribution of the estimator θˆ is not asymptotically normal (e.g. extreme value problems)
Inference@LS-Kneip
1–11
General approach: Basic bootstrap 1 − α confidence interval Random sample Yn := {Y1 , . . . , Yn }; unknown parameter (vector) θ ˆ n) ≈ We will assume that the bootstrap is consistent: distr(θˆ∗ −θ|Y distr(θˆ − θ) if n is sufficiently large. • Determine
and 1 − α2 quantiles tˆα2 and tˆ1− α2 of the conditional distribution of θˆ∗ given Yn := {Y1 , . . . , Yn } (the “bootstrap distribution”): α 2
α P ∗ (θˆ∗ ≤ tˆα2 ) ≈ , 2
α P ∗ (θˆ∗ > tˆα2 ) ≈ 1 − , 2
α P ∗ (θˆ∗ ≤ tˆ1− α2 ) ≈ 1 − , 2
α P ∗ (θˆ∗ > tˆ1− α2 ) ≈ , 2
Here, P ∗ denotes probabilities with respect to conditional distribution of θˆ∗ given Yn := {Y1 , . . . , Yn }. • Consistency of the bootstrap implies that for large n ˆ ≈ α, P (θˆ − θ ≤ tˆα2 − θ) 2
ˆ ≈ 1 − α, P (θˆ − θ ≤ tˆ1− α2 − θ) 2
and therefore ( ) ˆ ≤ θ ≤ θˆ − (tˆα − θ) ˆ ≈1−α P θˆ − (tˆ1− α2 − θ) 2 • ⇒ Approximate 1 − α (symmetric) confidence interval: [2θˆ − tˆ1− α2 , 2θˆ − tˆα2 ]
Inference@LS-Kneip
1–12
Example: Bootstrap confidence interval for a median Given: i.i.d. sample Yn := {Y1 , . . . , Yn }; Yi possesses a continuous distribution with (unknown) density f . We are now interested in estimating the median µmed of the underlying distribution. Recall that the median is defined by P (Yi ≤ µmed ) = P (Yi ≥ µmed ) = 0.5 µmed is estimated by the sample median µ ˆmed . Based on the ordered sample Y(1) ≤ Y(2) ≤ · · · ≤ Y(n) , µ ˆmed is given by Y( n+1 ) if n is an odd number 2 µ ˆmed = (Y( n ) + Y( n +1) )/2 if n is an even number 2
2
Construction of a confidence interval for µmed is not an easy task. Asymptotically we obtain √ 1 n(ˆ µmed − µmed ) →L N (0, ) 4f (µmed )2 The problem is that the density f is unknown. In principle it may be estimated by nonparametric kernel density estimation and a corresponding plug-in estimate fˆ(µmed ) may be used to approximate the asymptotic variance. However, the bootstrap offers a simple alternative. Construction of a bootstrap confidence interval: • Draw i.i.d. random samples Y1∗ , . . . , Yn∗ from Yn and determine the corresponding medians µ ˆ∗med • Determine α2 and 1 − α2 quantiles tˆα2 and tˆ1− α2 of the conditional distribution of µ ˆ∗med given Yn := {Y1 , . . . , Yn }. ⇒ Approximate 1 − α (symmetric) confidence interval: [2ˆ µmed − tˆ1− α2 , 2ˆ µmed − tˆα2 ] Inference@LS-Kneip
1–13
1.2
Pivot statistics and the ”bootstrap-t method”
In many situations it is possible to get more accurate bootstrap confidence intervals by using the bootstrap-t method (one also speaks of “studentized bootstrap confidence intervals”). The construction relies on so-called pivot statistics. Let Y1 , . . . , Yn be an i.i.d. random sample and assume that the distribution of Y depends on an unknown parameter (or parameter vector) θ. • A statistics Tn ≡ T (Y1 , . . . , Yn ) is called ”pivot statistics”, if the distribution of Tn does not depend on any unknown parameter. • A statistics Tn ≡ T (Y1 , . . . , Yn ) is called ”asymptotic pivot statistics”, if for suitable sequences an , bn of real numbers the transformed statistics an Tn + bn possesses a well-defined, non-degenerate asymptotic distribution, which does not depend on the parameters of the unknown distribution of Y . Example: Population mean: Y1 , . . . , Yn with mean µ, variance σ 2 > 0, and E|Y |3 = β < ∞. If Y is normally distributed we obtain √ ¯ n(Y − µ) ∼ tn−1 S ∑n 1 ¯ 2 with S 2 = n−1 i=1 (Yi − Y ) , where tn−1 denotes Student’s tdistribution with n − 1 degrees of freedom. We can conclude that Tn is a pivot statistics Even if Y is not normally distributed, the central limit theorem implies that √ ¯ n(Y − µ) →L N (0, 1) S Inference@LS-Kneip
1–14
In this case Tn is an asymptotic pivot statistics. Bootstrap: • i.i.d. re-sample Y1∗ , . . . , Yn∗ Y1∗ , . . . , Yn∗ from Yn ⇒ estimators ∑n Y¯ ∗ = n1 i=1 Yi∗ and ∑n 1 ∗ ¯∗ 2 S ∗2 = n−1 i=1 (Yi − Y ) • n large ⇒ approximately √ ¯∗ ¯ √ ¯ n(Y − Y ) n(Y − µ) distr( |Y ) ≈ distr( ) ≈ N (0, 1) n S∗ S or Y¯ ∗ − Y¯ Y¯ − µ ) distr( |Yn ) ≈ distr( S∗ S ¯∗ ¯ Therefore, the (conditional) distribution of Y S−∗ Y (given Yn ) can be used to approximate the distribution of
Y¯ −µ S .
Construction of a bootstrap-t confidence interval of level 1 − α: • Determine
α 2
and 1 − α2 quantiles τˆα2 and τˆ1− α2 of the condi-
tional distribution of
Y¯ ∗ −Y¯ S∗
given Yn :
α Y¯ ∗ − Y¯ α) ≈ P ( ≤ τ ˆ , 2 S∗ 2
α Y¯ ∗ − Y¯ α) ≈ 1 − P ( > τ ˆ , 2 S∗ 2
∗
∗
Y¯ ∗ − Y¯ α α) ≈ 1 − P ( ≤ τ ˆ , 1− 2 S∗ 2 ∗
Y¯ ∗ − Y¯ α α) ≈ P ( > τ ˆ , 1− 2 S∗ 2 ∗
• In practice: – Draw m bootstrap samples (e.g. m = 2000) and calY¯ ∗ −Y¯ culate the corresponding estimates Z1∗ : 1S ∗ , Z2 := Y¯2∗ −Y¯ S2∗
, . . . , Zm :=
∗ Y¯m −Y¯ ∗ Sm
1
.
∗ ∗ – Order the resulting estimates ⇒ Z(1) ≤ Z(2) ≤ ··· ≤ ∗ Z(m) . Inference@LS-Kneip
1–15
∗ ∗ – Set τˆα := Z([m+1] ˆ1−α := Z([m+1][1− α . α ) and τ ]) 2
2
• Consistency of the bootstrap implies that asymptotically also Y¯ − µ α P( ≤ τˆα2 ) ≈ , S 2
Y¯ − µ α P ( > τˆα2 ) ≈ 1 − , S 2 ∗
Y¯ − µ α P ( ≤ τˆ1− α2 ) ≈ 1 − , S 2 ∗
Y¯ − µ α P ( > τˆ1− α2 ) ≈ , S 2 ∗
• This yields the 1 − α confidence interval [Y¯ − τˆ1−α S, Y¯ − τˆα S]
Inference@LS-Kneip
1–16
General construction of a bootstrap-t interval (unknown real values parameter θ ∈ IR): Random sample Yn := {Y1 , . . . , Yn }; unknown parameter (vector) θ. Assume that the estimator θˆ of θ is asymptotically normal, √
n(θˆ − θ) →L N (0, v 2 ) ⇒
√ (θˆ − θ) n →L N (0, 1) v
and that a consistent estimator vˆ ≡ vˆ(Y1 , . . . , Yn ) of v is available. One might then replace v by vˆ to obtain √ θˆ − θ n →L N (0, 1) vˆ Obviously,
√
ˆ
ˆ θ−θ v ˆ
n (θ−θ) and v ˆ
are asymptotic pivot statistics.
• Based on an i.i.d. re-sample Y1∗ , . . . , Yn∗ from {Y1 , . . . , Yn }, calculate Bootstrap estimates θˆ∗ and vˆ∗ . • Determine
α 2
and 1 − α2 quantiles τˆα2 and τˆ1− α2 of the condi-
tional distribution of
θˆ∗ −θˆ v ˆ∗
given Yn .
• Bootstrap-t interval [θˆ − τˆ1−α vˆ, θˆ − τˆα vˆ]
Inference@LS-Kneip
1–17
1.3
The Parametric bootstrap
A further increase of accuracy can be obtained in applications, where the distribution of Y is known up to some parameter vectors θ, ω (e.g: Y is normal with mean µ and variance σ 2 ; Y follows an exponential distribution with parameter λ). The difference to the nonparametric bootstrap discussed above consists in the way how to generate a bootstrap re-sample Y1∗ , . . . , Yn∗ . Let θ = (θ1 , . . . , θp )′ , and for some known F let F (y, θ, ω) denote the distribution function of Y as a function of θ, ω. F is assumed to be known. For simplicity, we will concentrate on constructing a confidence interval for θ. The parametric bootstrap now proceeds as follows: • The unknown parameter vectors θ, ω are estimated by the ˆω maximum likelihood method. ⇒ Likelihood estimators θ, ˆ • An i.i.d re-sample Y1∗ , . . . , Yn∗ is generated by randomly ˆω drawing observations from a F (·, θ, ˆ ) distribution (using a random number generator) ⇒ θˆ∗ , ω ˆ ∗. ˆω • The conditional distribution of θˆ∗ given F (·, θ, ˆ ) is used to ˆ approximate the distribution of the estimator θ. In almost all cases of practical interest confidence intervals based on the parametric bootstrap are more accurate than standard intervals based on first order asymptotic approximations. The parametric bootstrap usually also provides more accurate approximations than its nonparametric counterpart discussed above. Of course, this requires that the underlying distributional assumption is satisfied (otherwise, the parametric bootstrap will lead to incorrect results). Inference@LS-Kneip
1–18
Basic parametric bootstrap confidence interval: [2θˆ − tˆ1− α2 , 2θˆ − tˆα2 ], where tˆα2 and tˆ1− α2 now denote the α2 and 1 − α2 quantiles of the ˆω conditional distribution of θˆ∗ given F (·, θ, ˆ ). Bootstrap-t intervals: √ • Assume that the standard error v(θ, ω) of n(θˆ − θ) can be determined in dependence of the parameter (vectors) θ, ω. • i.i.d re-sample Y1∗ , . . . , Yn∗ generated by randomly drawing ˆω observations from a F (·, θ, ˆ ) distribution ⇒ Parameter estimates θˆ∗ , ω ˆ ∗ as well as bootstrap approximations v(θˆ∗ , ω ˆ ∗ ) of the standard error. • Bootstrap-t interval ˆω ˆω [θˆ − τˆ1−α v(θ, ˆ ), θˆ − τˆα v(θ, ˆ )], α α 2 and 1 − 2 quantiles θˆ∗ −θˆ ˆω given F (·, θ, ˆ ). v(θˆ∗ ,ˆ ω∗ )
where τˆα2 and τˆ1− α2 now denote the of the conditional distribution of
Note: Sometimes the following modification leads to even more accurate intervals: • Determine the
and 1 −
α 2
conditional distribution of
α 2 quantiles θˆ∗ −θˆ ˆ ω ∗ ) given v(θ,ˆ
τ˜α2 and τ˜1− α2 of he ˆω F (·, θ, ˆ ).
• Asymptotically we obtain θˆ − θ P (˜ τ ≤ ≤ τ˜1− α2 ) ≈ 1 − α v(θ, ω ˆ) α 2
• 1−α confidence interval: Set of all θ with τ˜α2 ≤ Inference@LS-Kneip
ˆ θ−θ v(θ,ˆ ω)
1–19
≤ τ˜1− α2
Example: Exponential distribution Assume that Y follows an exponential distribution with parameter λ. Density and distribution function are then given by f (y, λ) =
1 −x/λ e , λ
F (y, λ) = 1 − e−x/λ
We have E(Yi ) = λ and V ar(Yi ) = λ2 . The maximum likelihood ˆ = 1 ∑n Yi , and V ar(λ) ˆ = λ2 . estimator of λ is given by λ i=1 n n The parametric bootstrap can then be used to construct confidence intervals. The following procedure is straightforward, but there also exist alternative approaches. • An i.i.d re-sample Y1∗ , . . . , Yn∗ is generated by randomly drawing observations from an exponential distribution with paˆ rameter λ. ˆ∗ • Y1∗ , . . . , Yn∗ ⇒ Estimator λ • Calculation of
α 2
and 1 −
α 2
quantiles τ˜α2 and τ˜1− α2 with
ˆ∗ − λ ˆ λ α P ( | ≤ τ˜α2 ) = ˆ 2 λ ˆ∗ ˆ α ∗ λ −λ P ( | ≥ τ˜1− α2 ) = 1 − ˆ 2 λ where P ∗ (·) denotes probabilities calculated with respect to ˆ the exponential distribution with parameter λ. ∗
• This yields
ˆ−λ λ P (˜ τ α2 ≤ ≤ τ˜1− α2 ) = α λ
⇒ Confidence interval: [ τ
ˆ λ
1−
ˆ
, τ˜λα ] α 2
2
It can be shown, that or any finite sample of size n the coverage probability of this interval is exactly equal to 1 − α. Inference@LS-Kneip
1–20
1.4
More on Bootstrap Confidence Intervals
Setup: i.i.d. random sample Yn := {Y1 , . . . , Yn }; unknown parameter (vector) θ ˆ n) ≈ We will assume that the bootstrap is consistent: distr(θˆ∗ −θ|Y distr(θˆ − θ) if n is sufficiently large. In the previous sections we have already defined basic bootstrap confidence intervals as well as bootstrap-t intervals. 1.4.1
Basic confidence interval
[2θˆ − tˆ1− α2 , 2θˆ − tˆα2 ], where tˆα2 and tˆ1− α2 are the α2 and 1 − tional distribution of θˆ∗ given Yn . 1.4.2
α 2
quantiles of the condi-
Bootstrap-t Intervals
[θˆ − τˆ1−α vˆ, θˆ − τˆα vˆ], where τˆα2 and τˆ1− α2 are the tional distribution of 1.4.3
θˆ∗ −θˆ v ˆ∗
α 2
and 1 −
α 2
quantiles of the condi-
given Yn .
Percentile Intervals
The “classical” percentile confidence interval is given by [tˆα2 , tˆ1− α2 ] Generally, this interval does not work extremely well in practice. Inference@LS-Kneip
1–21
The so-called BCa method allows to construct better confidence intervals. The term BCa stands for bias-corrected and accelerated. The BCa interval of intended coverage 1 − α is given by [tˆα1 , tˆα2 ], where tˆα1 and tˆα2 are the α1 and α2 quantiles of the conditional distribution of θˆ∗ given Yn , and ( ) ˆ ζ + z α2 ˆ α1 = Φ ζ + 1−a ˆ(ζˆ + z α2 ) ) ( ˆ ζ + z1− α2 ˆ α2 = Φ ζ + , 1−a ˆ(ζˆ + z1− α2 ) where Φ is the standard normal distribution function, and where zα is the α quantile of a standard normal distribution. Note that the BCa interval reduces to a standard percentile interval if ζˆ = a ˆ = 0. However, a different choice of ζˆ and a ˆ leads to more accurate intervals: The value of the bias-correction ζˆ can be obtained from the proportion of the bootstrap replications less than the original estimate θˆ ( ) −1 ∗ ∗ ˆ ζˆ = Φ P [θˆ < θ] Calculation of the acceleration a ˆ is slightly more complicated. It ˆ For any i = 1, . . . , n is based on Jacknife values of the estimator θ: calculate the estimate θˆ−i from the sample Y1 , . . . , Yi−1 , Yi+1 , . . . , Yn ∑n ˆ 1 ˜ with the ith observation deleted. Let θ = n i=1 θ−i and determine ∑n ˜ ˆ 3 (θ − θ−i ) a ˆ = ∑ni=1 (θ˜ − θˆ−i )2 ]3/2 6[ i=1
The BCa interval is motivated by theoretical results which show Inference@LS-Kneip
1–22
that it is second order accurate. Consider generally 1−α confidence intervals of the form [tlow , tup ] of θ. Upper and lower bounds of such intervals are determined from the data, tlow ≡ tlow (Y1 , . . . , Yn ), tup ≡ tup (Y1 , . . . , Yn ), and their accuracy depends on the particular procedure applied. • (Symmetric) confidence intervals are said to be first-order accurate if there exist some constant d1 , d2 < ∞ such that for sufficiently large n |P (θ < tlow ) −
α d1 |≤ √ , 2 n
|P (θ > tup ) −
α d2 |≤ √ . 2 n
• (Symmetric) confidence intervals are said to be second-order accurate if there exist some constant d3 , d4 < ∞ such that for sufficiently large n |P (θ < tlow ) −
d3 α |≤ , 2 n
|P (θ > tup ) −
α d4 |≤ . 2 n
If the distribution of θˆ is asymptotically normal, then under some additional regularity conditions it can usually be shown that • Standard confidence intervals based on asymptotic approximations are first-order accurate. The same holds for the basic bootstrap intervals [2θˆ − tˆ1− α2 , 2θˆ − tˆα2 ] as well as for the classical percentile method. • Bootstrap-t intervals as well as BCa intervals are secondorder accurate. The difference between first and second-order accuracy is not just a theoretical nicety. In many practically important situations second-order accurate intervals lead to much better approximations. Another approach for constructing confidence intervals is the Inference@LS-Kneip
1–23
ABC method: ABC, standing for for approximate bootstrap confidence intervals, allows to approximate the BCa interval endpoints analytically, without using any Monte Carlo replications at all (⇒ reduced computational costs). The procedure works by approximating the bootstrap sampling results by Taylor expanˆ 1 , . . . , Yn ) is a sions. It is then, however, required that θˆ ≡ θ(Y smooth function of Y1 , . . . , Yn . This is for example not true for the sample median.
Inference@LS-Kneip
1–24
1.5
Subsampling: Inference for a sample maximum
Data: i.i.d. random sample Yn := {Y1 , . . . , Yn }. We now consider the situation that the Yi only takes values in a compact interval [0, θ] such that P (Yi ∈ [0, θ]) = 1. Furthermore, Yi possesses a density f which is continuous on [0, θ] and satisfies f (y) > 0 for y ∈ (0, θ], and f (y) = 0 for y ̸∈ [0, θ]. The maximum θ of Yi is unknown and has to be estimated from the data. Similar type of extreme value problems frequently arise in econometrics. An example is the analysis of production efficiencies of different firms. The above situation may arise if we consider production outputs Yi of a sample of firms with identical inputs. A firm then is “efficient” if its output equals the maximal possible value θ. Note that in practice usually more complicated problems have to be considered, where production outputs dependent on individually different values of input variables ⇒ “Frontier Analysis”. Consistent estimator θˆ of θ: θˆ := max Yi i=1,...,n
Constructing a confidence interval for θ is not an easy task. The distribution of θˆ is not asymptotically normal. Indeed, it can ˆ follows asymptotically an exponential be shown that n(θ − θ) 1 distribution with parameter λ = f (θ) : ˆ →L Exp( 1 ) n(θ − θ) f (θ) Inference@LS-Kneip
1–25
The naive bootstrap fails: • i.i.d. re-sample Y1∗ , . . . , Yn∗ from {Y1 , . . . , Yn } ⇒ bootstrap estimator θˆ∗ := maxi=1,...,n Yi∗ • Unfortunately, the bootstrap is not consistent The reason is as follows: θˆ = Y(n) , and hence θˆ∗ = θˆ = Y(n) whenever Y(n) ∈ {Y1∗ , . . . , Yn∗ }. Some calculations then show that for large n P ∗ (θˆ − θˆ∗ = 0) = P (θˆ − θˆ∗ = 0|Yn ) ≈ 1 − e−1 , while P (θ − θˆ = 0) = 0! • One can conclude that even for large sample sizes distr(θˆ − ˆ ⇒ Basic bootθˆ∗ |Yn ) will be very different from distr(θ − θ) strap confidence intervals are incorrect. A possible remedy is to use subsampling. Similar to the ordinary bootstrap, subsampling relies on i.i.d. re-sampling from Y, and the only difference consists in the fact that subsampling is based on drawing a smaller number κ < n of observations.
Inference@LS-Kneip
1–26
Subsampling bootstrap: • Choose some κ < n • Determine an i.i.d. re-sample Y1∗ , . . . , Yk∗ by drawing randomly κ observations from {Y1 , . . . , Yn } ⇒ bootstrap estimator θˆκ∗ := maxi=1,...,k Yi∗ • For the above problem subsampling is consistent. If κ = nδ for some 0 < δ < 1, then The law of (κ(θˆ − θˆκ∗ )|Yn ) converges stochastically to a Exp(
1 )-distribution f (θ)
More precisely, as n → ∞, κ = nδ for some 0 < δ < 1, ( ) 1 ∗ sup |P κ(θˆ − θˆκ ) ≤ δ|Yn − F (δ; )| →P 0, f (θ) δ 1 where F (·; f (θ) ) denotes the distribution function of an ex-
ponential distribution with parameter λ =
1 f (θ) .
ˆ • Asymptotically: distr(κ(θˆ − θˆκ∗ )|Yn ) ≈ distr(n(θ − θ)). The subsampling bootstrap works under extremely general conditions, and it can often be applied in situations where the ordinary bootstrap fails. However, it usually does not make any sense to apply subsampling in regular cases, where standard nonparametric bootstrap is consistent. Then subsampling is less efficient, and confidence intervals based on subsampling are less accurate. In practice, a major problem is the choice of κ.
Inference@LS-Kneip
1–27
Confidence interval based on subsampling: • Calculation of
α 2
and 1 −
α 2
quantiles tˆα2 and tˆ1− α2 with
α P ∗ (θˆ − θˆκ∗ ≤ tˆα2 ) = 2 α P ∗ (θˆ − θˆκ∗ ≤ tˆ1− α2 ) = 1 − 2 where P ∗ (·) denotes probabilities calculated with respect to the conditional distribution of θˆκ∗ given Yn . • This yields P ∗ (κtˆα2 ≤ κ(θˆ − θˆκ∗ ) ≤ κtˆ1− α2 ) ≈ 1 − α, and consistency of the bootstrap implies ˆ ≤ κtˆ1− α ) ≈ 1 − α. P (κtˆα2 ≤ n(θ − θ) 2 • Confidence interval for θ: κ κ [θˆ + tˆα2 , θˆ + tˆ1− α2 ] n n
Inference@LS-Kneip
1–28
1.6 1.6.1
Appendix The empirical distribution function
Data: i.i.d. sample X1 , . . . , Xn ; ordered sample X(1) ≤ · · · ≤ X(n) . The distribution of Xi possesses a distribution function F defined by F (x) = P (Xi ≤ x)
Let Hn (x) denote the number of observations Xi satisfying Xi ≤ X. The empirical distribution function is then defined by
Fn (x) = Hn (x)/n = Proportion of observations Xi with Xi ≤ x Properties: • 0 ≤ Fn (x) ≤ 1 • Fn (x) = 0 if x < X(1) • F (x) = 1 if x ≥ X(n) • Fn is a monotonically increasing step function
Inference@LS-Kneip
1–29
Example: x1
x2
x3
x4
x5
x6
x7
x8
5,20
4,80
5,40
4,60
6,10
5,40
5,80
5,50
Empirical distribution function:
1.0
0.8
0.6
0.4
0.2
0.0 4.0
Inference@LS-Kneip
4.5
5.0
5.5
6.0
6.5
1–30
Theoretical properties of Fn Theorem: For every x ∈ IR we obtain Fn (x) ∼ B(n, F (x)), i.e. Fn (x) follows a binomial distribution with parameters n and F (x). The probability distribution of Fn (x) is thus givenn by ( m) n P Fn (x) = = F (x)m (1−F (x))n−m , m = 0, 1, . . . , n n m Consequences: • E(Fn (x)) = F (x), i.e.. Fn (x) is an unbiased estimator of F (x) • V ar(Fn (x)) = n1 F (x)(1 − F (x)) ⇒ the standard error of Fn (x) decreases as n increaases. Fn (x) is a consistent estimator of F (x)). Theorem of Glivenko-Cantelli: ( P
Inference@LS-Kneip
) lim sup |Fn (x) − F (x)| = 0 x∈IR
n→∞
=1
1–31
1.6.2
Consistency of estimators
Any reasonable estimator θˆ of a parameter θ must be consistent. Intuitively this means that the distribution of θˆ ≡ θˆn must become more and more concentrated around the true value θ as n → ∞. The mathematical formalization of consistency relies on general concepts quantifying convergence of random variables. Convergence in probability: Let X1 , X2 , . . . and X be random variables defined on a probability space (Ω, A, P). Xn converges in probability to X if lim P [|Xn − X| < ϵ] = 1
n→∞
for every ϵ > 0. One often uses the notation Xn →P X weak consistency: An estimator θˆ is called „weakly consistent“ if θˆn →P θ
Convergence in mean square: Let X1 , X2 , . . . and X be random variables defined on a probability space (Ω, A, P). Xn converges in mean square to X if ( ) 2 lim E |Xn − X| = 0 n→∞
Notation: Xn →M SE X mean square consistency: θˆ is „mean square consistent“ if θˆn →M SE θ.
Inference@LS-Kneip
1–32
Strong Convergence (Convergence with probability 1): Let X1 , X2 , . . . and X be random variables defined on a probability space (Ω, A, P). Xn converges with probability 1 (or ”almost surely”) to X if [ ] P lim Xn = X = 1 n→∞
Notation: Xn →a.s. X
Strong consistency (consistency with probability 1): An estimator θˆ is „strongly consistent“ if θˆn →a.s. θ
• Xn →M SE X implies Xn →P X • Xn →a.s. X implies Xn →P X Application: Law of large numbers ¯ = µ as well as V ar(X) ¯ = We obtain E(X)
σ2 n
σ2 2 ¯ ¯ ¯ ⇒ M SE(X) := E((X − µ) ) = V ar(X) = →n→∞ 0 n ¯ →P µ as n → ∞ ⇒ X Example: Consider a normally distributed random variable X ∼ N (µ, (0, 18)2 ) with unknown mean but known standard deviation σ = 0.18. ¯ of µ. Random sample X1 , . . . , Xn → Estimator X ¯ ∼ N (µ, σ2 ) = N (µ, 0.182 ). Recall: X n n ¯ = 0, 0036 n = 9 : standard error = 0, 06, M SE(X) ¯ = 0, 000225 n = 144 : standard error = 0, 015, M SE(X) Inference@LS-Kneip
1–33
n=9:
¯ ≤ µ + 0, 1176] = 0, 95 P [µ − 0, 1176 ≤ X
n = 144 :
¯ ≤ µ + 0, 0294] = 0, 95 P [µ − 0, 0294 ≤ X
n=144
n=9 1.5
1.5
1.0
1.0
0.5
0.5
0,025
0,025 0.0
Inference@LS-Kneip
µ
0,025
0,025 0.0
µ
1–34
1.6.3 Convergence in distribution Let Z1 , Z2 , . . . be a sequence of random functions with distribution functions F1 , F2 , . . . , and let Z be a random variable with distribution function F . Zn konverges in distribution to Z if lim Fn (t) → F (t)
n→∞
an every continuity point t von F
Notation: Zn →L Z
The central limit theorem Theorem (Ljapunov): Let X1 , X2 , . . . be a sequence of independent random variables with means E(Xi ) = µi and variances V ar(Xi ) = E((Xi − µi )2 ) = σi2 > 0. Furthermore assume that E(|Xi − µi |3 ) = βi < ∞. If
∑ 1/3 ( n i=1 βi ) ∑n ( i=1 σi2 )1/2
→ 0 as n → ∞ then ∑n i=1 (Xi − µi ) ∑ →L N (0, 1) n ( i=1 σi2 )1/2
Sometimes the notation Zn ∼ AN (0, 1) is used instead of Zn →L N (0, 1). Important information about the speed of convergence to a normal distribution is given by the Berry-Eséen theorem:
Inference@LS-Kneip
1–35
Theorem (Berry-Eséen): Let X1 , X2 , . . . be a sequence of i.i.d. random variables with mean E(Xi ) = µ and variance V ar(Xi ) = E((Xi − µi )√2 ) = σ 2 > 0. Then, if Gn denotes the distribution ¯ function of n(X−µ) , σ 33 E(|Xi − µ|3 ) sup |Gn (t) − Φ(t)| ≤ 4 σ 3 n1/2 t
1.6.4
Stochastic order symbols (rates of convergence)
In mathematical notation the symbols O(·) and o(·) are often used in order to quantify the speed (rate) of convergence of a sequence of numbers. Let α1 , α1 , α3 , . . . and β1 , β1 , β3 , . . . be a (deterministic) sequence of numbers. • The notation αn = O(1) indicates that the sequence α1 , α2 , . . . is bounded. More precisely, there exists an M < ∞ such that αn ≤ M for all n ∈ IN. • αn = o(1) means that Zn → 0. • Zn = O(rn ) means that |Zn |/|rn | = O(1). • Z = o(rn ) means that |Zn |/|rn | → 0. ∑n ∑n Examples: i=1 i = O(n2 ), i=1 i = o(n3 ) Stochastic order symbols OP (·) and oP (·) are used to quantify the speed (rate) of convergence of a sequence of random variables. Let Z1 , Z2 , Z3 , . . . be a sequence of random variables, and let r1 , r2 , . . . be either a deterministic sequence of number or a sequence of random variables. Inference@LS-Kneip
1–36
• We will write Zn = Op (1) if for every ϵ > 0 there exists an Mϵ < ∞ and an nϵ ∈ IN such that P (|Zn | > Mϵ ) ≤ ϵ
für alle n ≥ nϵ
In other words, Zn = Op (1) indicates that the r.v. Zn are stochastically bounded. • We will write Zn = oP (1) if and only if Zn →P 0. • Zn = OP (Vn ) means|Zn |/|Vn | = OP (1). • Zn = oP (Vn ) means that |Zn |/|Vn | →P 0. ¯ − µ = OP (n−1/2 ) Example: X
1.6.5
Important inequalities
Inequality of Chebychev: P [|X − µ| > kσ] ≤
1 k2
for all k > 0
⇒ P [µ − kσ ≤ X ≤ µ + kσ] ≥ 1 − k
P [µ − kσ ≤ X ≤ µ + kσ]
2
≥1−
3
≥1−
4
≥1−
1 k2
1 4 = 0, 75 1 9 ≈ 0, 89 1 16 = 0, 9375
Generalization: E(|X − µ|r ) P [|X − µ| > k] ≤ kr Inference@LS-Kneip
for all k > 0, r = 1, 2, . . .
1–37
Cauchy-Schwarz inequality: • Let x1 , . . . , xn and y1 , . . . , yn be arbitrary real numbers. Then n n n ∑ ∑ ∑ ( xi yi )2 ≤ ( x2i )( yi2 ) i=1
• Integrated version: (∫ b
f (x)g(x)dx
i=1
)2
∫ ≤(
a
b
i=1
∫ f (x)2 dx)(
a
b
g(x)2 dx) a
• Application to random variables: 2
(E(XY )) ≤ E(X 2 ) · E(Y 2 ) Hölder inequality: Sei p > 1 und p1 + 1q = 1 • Let xi , yi ≥ 0, i = 1, . . . , n be arbitrary numbers. Then n ∑ i=1
n n ∑ ∑ p 1/p xi yi ≤ ( xi ) ( yiq )1/q i=1
i=1
• Integrated version: (f (x) ≥ 0, g(x) ≥ 0) ∫ b ∫ b ∫ f (x)g(x)dx ≤ ( f (x)p dx)1/p ( a
a
b
g(x)q dx)1/q a
• Application to random variables: E(|X| · |Y |) ≤ (E(|X|p ))1/p · (E(|Y |q ))1/q
Inference@LS-Kneip
1–38
2
Bootstrap and Regression Models
Problem: Analyze the influence of some explanatory (“independent”) variables X1 , X2 , . . . , Xp on a response variable (or “dependent” variable) Y . • Observations (Y1 , X11 , . . . , X1p ), (Y2 , X21 , . . . , X2p ), . . . , (Yn , Xn1 , . . . , Xnp ) • Model
Yi = β0 + β1 Xi1 + β2 Xi2 + . . . + βp Xip + ϵi ϵ1 , . . . , ϵn i.i.d.,
E(ϵi ) = 0,
Var(ϵi ) = σ 2
[ ] 2 ϵi ∼ N (0, σ )
• The linear structure of the regression function as postulated by the model, β0 + β1 Xi1 + . . . + βp Xip = m(Xi1 , . . . , Xip ) = E(Y |X1 = Xi1 , . . . , Xp = Xip ), is necessarily fulfilled, if (Yi , Xi1 , Xi2 , . . . , Xip )T is a multivariate normal random vector.
Inference@LS-Kneip
2–1
Remark: Regression analysis is usually a conditional analysis. The goal is to estimate the regression function m which is the conditional expectation of Y given X1 , . . . , Xp . Standard inference studies the behavior of estimators conditional on the observed values. However, different types of bootstrap may be used depending on how the data is generated. 1) Random design: (Y1 , X11 , . . . , X1p ), (Y2 , X21 , . . . , X2p ), . . . , (Yn , Xn1 , . . . , Xnp ) is a sample of i.i.d. random vectors, i.e. observations are independent and identically distributed. Example: p + 1 measurements from n individuals randomly drawn from an underlying population. 2) (Xj1 , . . . , Xjp ), j = 1, . . . , p, random vectors which are, however, not independent or not identically distributed (e.g. time series data, the X-variables are observed in successive time periods). 3) Fixed design: Data are collected at are pre-specified, nonrandom values Xjk (corresponding for example to different experimental conditions).
Inference@LS-Kneip
2–2
The model can be rewritten in matrix notation:
Y =X·β+ϵ E(ϵ) = 0,
Cov(ϵ) = σ 2 · In ,
[ϵ ∼ Nn (0, σ 2 · In )]
Y 1 . with Y = .. , Yn
β 0 β1 β= .. , . βp
X 11 X21 X= .. . Xn1 ϵ 1 ϵ2 ϵ= .. .
X12
···
X22 .. .
···
Xn2
···
X1p
X2p .. . Xnp
···
ϵn
The parameter vector β = (β0 , . . . βp )T is usually estimated by least squares: • Least squares method: Determine βˆ0 , βˆ1 , . . . , βˆp by minimzing Q(β0 , . . . , βp ) = =
n ∑ i=1 n ∑
(Yi − Yˆi )2 (Yi − β0 − β1 Xi1 − . . . − βp Xip )2
i=1
• Least squares estimator: βˆ = [XT X]−1 XT Y Inference@LS-Kneip
2–3
Let Eϵ and Covϵ denote conditional expectation and covariances given the observed X-values. Properties of βˆ 1. βˆ is an unbiased estimator of β
Eϵ (βˆ0 ) β 0 . . ˆ = Eϵ (β) .. = .. = β Eϵ (βˆp ) βp 2. Covariance matrix: ˆ = Covϵ ([XT X]−1 XT Y ) Covϵ (β) = [XT X]−1 XT Cov(Y )X[XT X]−1 = σ 2 [XT X]−1 XT X[XT X]−1 = σ 2 [XT X]−1 3. Distribution under normality: If ϵi ∼ N (0, σi2 ) then ϵ ∼ Nn (0, σ 2 In ), and consequently ( ) βˆ ∼ Np+1 β, σ 2 [XT X]−1 ∑ 4. Asymptotic distribution: Assume that n1 i Xij Xik → cjk as ∑ well as n1 i Xij → c0k as n → ∞ Note that cjk = E(Xj Xk ) and c0j = E(Xj ) in the case of random design. Furthermore, Let C denote the (p + 1) × (p + 1) matrix with elements cjk, j, k = 0, . . . , p, c00 = 1, cj0 = c0j , and assume that C is of full rank. Then ( ) √ 2 −1 ˆ n(β − β) ∼ Np+1 0, σ C Inference@LS-Kneip
2–4
Estimation of σ 2 : p ∑ βˆj Xij “estimate” • The residuals ϵˆi = Yi − Yˆi = Yi − βˆ0 − j=1
the error term ϵi • Estimator σ ˆ 2 of σ 2 : ∑ 1 2 σ ˆ = (Yi − Yˆi )2 n − p − 1 i=1 n
• σ ˆ 2 is an unbiased estimator σ 2 • If the true error terms ϵi are normally distributed, then (n − 2 p − 1) σσˆ 2 ∼ χ2n−p−1 Let γij , i, j = 1, . . . , p + 1 denote the elements of the matrix Γ = [XT X]−1 . Then, for normal errors, βˆj − βj ∼ tn−p−1 √ σ ˆ γjj ⇒ Standard confidence intervals and tests for the parameter estimates. Note: Under the normality assumption,
βˆj −βj √ σ ˆ γjj
is a Pivot stati-
stics. In the general case (under some weak regularity conditions), this quantity is an asymptotic Pivot statistics. γjj /n converges to the j − th diagonal element of the matrix C, and therefore βˆj − βj →L N (0, 1) as n → ∞ √ σ ˆ γjj
Inference@LS-Kneip
2–5
2.1
Bootstrapping Pairs
The usual, nonparametric is applicable if the data is generated by a random design. Let Xi = (Xi1 , . . . , Xip ). The construction of bootstrap confidence intervals then proceeds as follows: Basic bootstrap confidence interval: • Original data: i.i.d. sample (Y1 , X1 ), . . . , (Yn , Xn ) • Random samples (Y1∗ , X1∗ ), . . . , (Yn∗ , Xn∗ ) are generated by drawing observations independently and with replacement from the available sample Yn := {(Y1 , X1 ), . . . , (Yn , Xn )}. • (Y1∗ , X1∗ ), . . . , (Yn∗ , Xn∗ ) ⇒ least squares estimators βˆj∗ , j = 1, . . . , p + 1. • Determine
and 1− α2 quantiles tˆα2 ,j and tˆ1− α2 ,j of the conditional distribution of βˆ∗ given Yn := {(Y1 , X1 ), . . . , (Yn , Xn )}. α 2
j
α P ∗ (βˆj∗ ≤ tˆα2 ,j ) ≈ , 2
α P ∗ (βˆj∗ > tˆα2 ,j ≈ 1 − , 2
α P ∗ (βˆj∗ ≤ tˆ1− α2 ,j ) ≈ 1 − , 2
α P ∗ (βˆj∗ > tˆ1− α2 ,j ) ≈ , 2
Here, P ∗ denotes probabilities with respect to conditional distribution of βˆj∗ given Yn . • ⇒ Approximate 1 − α (symmetric) confidence interval: [2βˆj − tˆ1− α2 ,j , 2βˆj − tˆα2 ,j ]
Inference@LS-Kneip
2–6
Remark: Under some weak regularity conditions the bootstrap is consistent, whenever Yi = β0 + β1 Xi1 + β2 Xi2 + . . . + βp Xip + ϵi for independent errors ϵi with E(ϵi ) = 0 and var(ϵi ) = σ 2 (Xi ) < ∞. In other words, the basic bootstrap confidence interval provides an asymptotically (first order) accurate confidence interval, even if the errors are heteroscedastic (unequal variances)! This is not true for the standard t-intervals. Modification: Bootstrap-t intervals: • Random samples (Y1∗ , X1∗ ), . . . , (Yn∗ , Xn∗ ) are generated by drawing observations independently and with replacement from the available sample Yn := {(Y1 , X1 ), . . . , (Yn , Xn )}. • Use (Y1∗ , X1∗ ), . . . , (Yn∗ , Xn∗ ) to determine least squares estimators βˆj∗ , j = 1, . . . , p + 1 as well as estimators (ˆ σ 2 )∗ of the error variance σ 2 . ∗ • With γjj denoting the j-th diagonal element of the matrix Γ∗ = [(X∗ )T X∗ ]−1 compute
βˆj∗ − βˆj √ ∗ σ ˆ ∗ γjj • Determine
α 2
and 1 −
α 2
conditional distribution of
quantiles τˆα2 ,j and τˆ1− α2 ,j of the βˆj∗ −βˆj
σ ˆ∗
√
∗ γjj
• This yields the 1 − α confidence interval √ √ [βˆj − τˆ1− α2 ,j σ ˆ γjj , βˆj − τˆα2 ,j σ ˆ γjj ] Different from the basic bootstrap interval, this bootstrap-t interval will be incorrect for heteroscedastic error. Inference@LS-Kneip
2–7
In order to understand bootstrap behavior for random design let us analyze the simplest case with p = 1. Then Yi = β0 +β1 Xi +ϵi . Consider the estimator ∑ ∑ 1 ¯ ¯ (X − X)Y i i i (Xi − X)ϵi i n ˆ β1 = ∑ = β1 + 1 ∑ 2 ¯ ¯ 2 (X − X) i i i (Xi − X) n of the slope β1 . Random design implies that (Yi , Xi ), and hence (ϵi , Xi ), i = 1, . . . , n are independent and identically distributed. Under some regularity conditions (existence of moments) we have 1∑ 2 ¯ 2 →p E(Xi − µx )2 = σX (Xi − X) , n i and the central limit theorem implies that 1 ∑ 2 ¯ i →L N (0, vϵ,X √ ), (Xi − X)ϵ n i where
( ) 2 vϵ,X = E (Xi − µx )2 ϵ2i .
If ϵi and Xi are independent and σ 2 = var(ϵi ) does not depend 2 2 on Xi , then vϵ,X = σX σ . We then generally obtain for large n ( 1 ∑ ) ¯ √ (X − X)ϵ i i √ i n ∑ distr( n(βˆ1 − β1 )) ≈ distr 1 ¯ 2 i (Xi − X) n ( 1 ∑ ) ¯ 2 √ (X − X)ϵ vϵ,X i i i n ≈ distr ≈ N (0, 4 ) 2 σX σx
Inference@LS-Kneip
2–8
Now consider the bootstrap estimator βˆ1∗ , ∑ ∑ 1 ∗ ∗ ∗ ¯ ∗ )ˆ ¯ ∗ )Y ∗ (X − X ϵ (X − X i i i i i ∗ i n ˆ1 + ∑ βˆ1 = ∑ = β 1 ∗−X ∗−X ¯ ∗ )2 ¯ ∗ )2 , (X (X i i i i n where ϵˆ∗i = Yi∗ − βˆ0 − βˆ1 Xi∗ . Recall that by definition, (Yi∗ , Xi∗ ), and hence (ˆ ϵ∗i , Xi∗ ), i = 1, . . . , n are independent and identically distributed observations (condi∑ ¯ ∗ )2 |Yn ) = 1 ∑ (Xi − tional on Yn ). We obtain E( n1 i (Xi∗ − X i n 2 2 ¯ =: σ X) ˆX , and 1∑ ∗ 2 ¯ ∗ )2 − σ | (Xi − X ˆX | →P 0 n i as n → ∞. Moreover, E ( var
(
√1 n
∑
∗ i (Xi
1 ∑ ∗ ¯ ∗ )ˆ √ (Xi − X ϵ∗i |Yn n i
)
¯ ∗ )ˆ −X ϵ∗i |Yn = 0 and
)
1∑ ¯ 2 ϵˆ2i = (Xi − X) n i
By the central limit theorem we obtain that for large n ∑ 1 (√ ) ¯ 2 ˆ2 i i (Xi − X) ϵ ∗ n ˆ ˆ distr n(β1 − β1 )|Yn ≈ N (0, ). 4 σ ˆX ∑ ¯ 2 ϵˆ2 →P v 2 , σ Since n1 i (Xi − X) i ϵ,X ˆx →P σx , we can conclude that asymptotically (√ ) √ ∗ distr( n(βˆ1 − β1 )) ≈ distr n(βˆ1 − βˆ1 )|Yn ⇒ Bootstrap consistent
Inference@LS-Kneip
2–9
2.2
Bootstrapping Residuals
Bootstrapping residuals is applicable independent of the particular design of the regression model. The only crucial assumption is that the error terms ϵi are i.i.d. with constant variance σ 2 . Residuals: ϵˆi = Yi − Yˆi = Yi − βˆo −
p ∑
βˆj Xij
j=1
Matrix notation: ϵˆ 1 . ϵˆ = .. = (I − X[XT X]−1 XT )Y = (I − X[XT X]−1 XT )ϵ | {z } H ϵˆn ⇒
Cov(ϵ) = σ 2 (I − H)
With hii > 0 denoting the i-th diagonal element of H we thus obtain var(ˆ ϵi ) = σ 2 (1 − hii ) < σ 2 Standardized residuals: r˜i =
ϵˆi 1 − hii
⇒ var(ri ) = σ 2
∑
We have i ϵˆi = 0. For the standardized residuals it is, however, ∑ 1 not guaranteed that r¯ = n i r˜i is equal to zero. The residual bootstrap thus relies on resampling centered standardized residuals ri := r˜i − r¯.
Inference@LS-Kneip
2–10
Note: Residual plots play an important role in validating regression models. a.) Nonlinear model:
−2
0
residuals
2
4
Mangelnde Modellanpassung
0
50
100
150
fitted y
b.) Heteroscedasticity
−50 −200
−150
−100
Residuals
0
50
100
Heteroskedadastizität
0
50
100
150
fitted y_i
Inference@LS-Kneip
2–11
Bootstrapping Residuals • Original data: i.i.d. sample (Y1 , X1 ), . . . , (Yn , Xn ) ⇒ Estimator βˆ • Calculate (centered) standardized residuals r˜i =
ϵˆi , ri = r˜i − r¯, 1 − hii
i = 1, . . . , n
• Generate random samples ϵˆ∗1 , . . . , ϵˆ∗n of residuals by drawing observations independently and with replacement from {r1 , . . . , rn }. • Calculate Yi∗ = βˆ0 +
p ∑
βˆj Xij + ϵˆ∗i ,
i = 1, . . . , n
j=1
• Bootstrap estimators βˆ∗ are determined by least squares estimation from the data (Y1∗ , X1 ), . . . , (Yn∗ , Xn ). Basic bootstrap confidence intervals: • Determine α2 and 1 − α2 quantiles tˆα2 ,j and tˆ1− α2 ,j of the conditional distribution of βˆj∗ . α P ∗ (βˆj∗ ≤ tˆα2 ,j ) ≈ , 2
α P ∗ (βˆj∗ > tˆα2 ,j ≈ 1 − , 2
α P ∗ (βˆj∗ ≤ tˆ1− α2 ,j ) ≈ 1 − , 2
α P ∗ (βˆj∗ > tˆ1− α2 ,j ) ≈ , 2
Here, P ∗ denotes probabilities with respect to conditional distribution of βˆj∗ given Yn . • ⇒ Approximate 1 − α (symmetric) confidence interval: [2βˆj − tˆ1− α2 ,j , 2βˆj − tˆα2 ,j ] Bootstrap-t intervals can be determined similarly. Inference@LS-Kneip
2–12
In order to understand the residual bootstrap let us again analyze the simplest case with p = 1, and recall that ∑ ∑ 1 ¯ ¯ (X − X)Y i i i (Xi − X)ϵi i n ˆ β1 = ∑ = β1 + 1 ∑ 2 ¯ ¯ 2 (X − X) i i i (Xi − X) n ∑ 1 2 ¯ 2 . If the errors ϵi are i.i.d. zero mean Let σ ˆX := n i (Xi − X) random variables with var(ϵi ) = σ 2 , then (under some regularity conditions) the central limit theorem implies that conditional on the observed values X1 , . . . , Xn ) ( 1 ∑ ¯ √ (X − X)ϵ i i √ σ2 i n ˆ ∑ distr( n(β1 − β1 )) = distr ≈ N (0, 2 ) 1 2 ¯ σ ˆX i (Xi − X) n holds for large n. By definition, ∑ ¯ ∗ ∗ i (Xi − X)Yi ˆ β1 = ∑ = βˆ1 + 2 ¯ i (Xi − X)
1 n 1 n
∑ ¯ ϵ∗ i i (Xi − X)ˆ ∑ . 2 ¯ (X − X) i i
We have E(ˆ ϵ∗i |Yn )
= 0,
var(ˆ ϵ∗i |Yn )
1∑ 2 ri =: σ ˆ2, = n i
and therefore ( ) ∑ 1 1∑ ∗ ¯ 2σ ¯ var √ (Xi − X) ˆ2 (Xi − X)ˆ ϵi Yn = n i n i The central limit theorem then leads to (√ ) 2 σ ˆ ∗ distr n(βˆ1 − βˆ1 )|Yn ≈ N (0, 2 ). σ ˆX ⇒ Bootstrap consistent, since σ ˆ 2 →P σ 2 as n → ∞. Inference@LS-Kneip
2–13
2.3
Wild Bootstrap
The residual bootstrap is not consistent if the errors ϵi are heteroscedastic, i.e. var(ϵi ) = σi2 . In this case the wild bootstrap offers an alternative. There are several versions of the wild bootstrap. In its simplest form this procedure works as follows: Conditional on Yn , a bootstrap sample ϵˆ∗1 , . . . , ϵˆ∗n of residuals is determined by generating n independent random variables from the following binary distributions: ( √ ) 1− 5 = π, P ϵˆ∗i = ϵˆi 2 ( √ ) 1− 5 P ϵˆ∗i = ϵˆi = 1 − π, 2 i = 1, . . . , n, where π =
√ 5+ 5 10 .
The constants are chosen in such a way that ϵ∗i ) = 0 • E(ˆ ϵ∗i |Yn ) = E ∗ (ˆ ϵ∗i ) = ϵˆ2i • var(ˆ ϵ∗i |Yn ) = var∗ (ˆ ϵ∗i )3 ) = ϵˆ3i • E((ˆ ϵ∗i )3 |Yn ) = E ∗ ((ˆ
Inference@LS-Kneip
2–14
Implementation of the wild bootstrap: • Original data: i.i.d. sample (Y1 , X1 ), . . . , (Yn , Xn ) ⇒ Estimator βˆ • Calculate (centered) standardized residuals r˜i =
ϵˆi , ri = r˜i − r¯, 1 − hii
i = 1, . . . , n
• Generate n independent random variables ϵˆ∗i from binary distributions, ( √ ) 1− 5 P ϵˆ∗i = ϵˆi = π, 2 ( √ ) 1− 5 = 1 − π, P ϵˆ∗i = ϵˆi 2 i = 1, . . . , n, where π =
√ 5+ 5 10 .
• Calculate Yi∗ = βˆ0 +
p ∑
βˆj Xij + ϵˆ∗i ,
i = 1, . . . , n
j=1
• Bootstrap estimators βˆ∗ are determined by least squares estimation from the data (Y1∗ , X1 ), . . . , (Yn∗ , Xn ). Basic bootstrap confidence intervals: • Determine α2 and 1 − α2 quantiles tˆα2 ,j and tˆ1− α2 ,j of the conditional distribution of βˆj∗ . • ⇒ Approximate 1 − α (symmetric) confidence interval: [2βˆj − tˆ1− α2 ,j , 2βˆj − tˆα2 ,j ] Bootstrap-t intervals can be determined similarly. Inference@LS-Kneip
2–15
In order to understand the basic intuition let us again analyze the simplest case with p = 1, and recall that ∑ ¯ i (Xi − X)Yi ˆ = β1 + β1 = ∑ 2 ¯ i (Xi − X)
1 n 1 n
∑ ¯ i (Xi − X)ϵi ∑ ¯ 2 i (Xi − X)
It is now assumed that the errors ϵi are independent with var(ϵi ) = ∑ 2 ¯ 2 and vˆ2 = 1 ∑ (Xi − X) ¯ 2 σ 2 . Unσi2 . Let σ ˆX := n1 i (Xi − X) i ϵ,X i n der some regularity conditions the central limit theorem implies that conditional on the observed values X1 , . . . , Xn ( 1 ∑ ) ¯ 2 √ (X − X)ϵ vˆϵ,X i i √ i n ∑ distr( n(βˆ1 − β1 )) = distr ≈ N (0, 4 ) 1 2 ¯ σ ˆX i (Xi − X) n holds for large n. As above, ∑ ¯ ∗ ∗ i (Xi − X)Yi ˆ β1 = ∑ = βˆ1 + 2 ¯ i (Xi − X)
1 n 1 n
∑ ¯ ϵ∗ i i (Xi − X)ˆ ∑ , 2 ¯ (X − X) i i
and by construction ( ) ∑ 1∑ 1 ∗ 2 ¯ ¯ 2 ϵˆ2i =: w var √ (Xi − X)ˆ ϵi Yn = (Xi − X) ˆϵ,X . n i n i For large n, the central limit theorem then leads to distr
(√
n(βˆ1∗
− βˆ1 )|Yn
)
2 w ˆϵ,X ≈ N (0, 4 ). σ ˆX
We have Eϵ (ˆ ϵ2i ) = σi2 + O( n1 ), and thus for large n 2 ) Eϵ (w ˆϵ,X
1∑ 2 ¯ 2 Eϵ (ˆ = (Xi − X) ϵ2i ) ≈ vˆϵ,X n i
Under some regularity conditions the law of large numbers then Inference@LS-Kneip
2–16
2 2 implies that |w ˆϵ,X − vˆϵ,X | → 0 as n → ∞. ⇒ Wild bootstrap consistent.
2.4
Generalizations
The above types of bootstrap (bootstrapping pairs, bootstrapping residuals, wild bootstrap) can also be useful in more complex regression setups. An appropriate method then has to be selected in dependence of existing knowledge about underlying design and structure of residuals. 1) Nonlinear regression: Yi = g(Xi , β) + ϵi , where g is a nonlinear function of β. Example: Depreciation of a car (CV Citroen ) X
-
Age of the car (in years)
Y
-
depreciation =
selling price original price (new car)
0.6 0.4 0.0
0.2
Y = relativer Wertverlust
0.8
1.0
Wertverlust eines Autos
0
2
4
6
8
10
X= Alter in Jahren
Inference@LS-Kneip
2–17
Yi = e−βXi + ϵi An estimator βˆ is determined by (nonlinear) least squares; ˆ residual: ϵˆi = Yi − e−βXi Model:
Bootstrap: Random design ⇒ bootstrapping pairs; bootstrapping residuals for homoscedastic errors; wild bootstrap for heteroscedastic errors. 2) Median Regression: Linear model: Yi = β0 +
∑ j
βj Xij + ϵi
In some applications the errors possess heavy tails (→ outliers!). In such situations estimation of β by least squares may not be appropriate, and statisticians tend to use more robust method. A sensible procedure then is to determine estimates βˆ by minimizing n ∑
|Yi − β0 −
i=1
∑
βj Xij |
j
over all possible β. Solutions can be determined by numerical optimization algorithms. Inference is the usually based on the bootstrap. Random design ⇒ bootstrapping pairs; bootstrapping residuals for homoscedastic errors; wild bootstrap for heteroscedastic errors. 3) Nonparametric regression: Model: Yi = m(Xi ) + ϵi for some unknown function m. The function m can be estimated by nonparametric smoothing procedures (kernel estimation; local linear estimation; spline estimation). Inference is often based on the bootstrap. Inference@LS-Kneip
2–18
2.5
Time series
The general idea of the residual bootstrap can be adapted to many different situations. For example, it can also be used in the context of time series models. Example: AR(1)-process: Xt = ρXt−1 + ϵt ,
t = 1, . . . , n
for i.i.d zero mean error terms with var(ϵt ) = σ 2 . If |ρ| < 1 this defines a stationary stochastic process. Standard estimator of ρ: ∑n ¯ ¯ i=2 (Xt − X)(Xt−1 − X) ∑n ρˆ = ¯ 2 i=1 (Xt − X) Asymptotic distribution: √
n(ˆ ρ − ρ) →L N (0, 1 − ρ2 )
Bootstrapping residuals • Calculate centered residuals
1 ∑ ϵ˜t = Xt − ρˆXt−1 , ϵˆt = ϵ˜t − ϵ˜t , n−1 t
t = 2, . . . , n
• For some k > 0 generate random samples ϵˆ∗−k , ϵˆ∗−k+1 , . . . , ϵˆ∗0 , ϵˆ∗1 , . . . , ϵˆ∗n of residuals by drawing n + k + 1 observations independently and with replacement from {ˆ ϵ1 , . . . , ϵˆn }. ∗ • Generate a bootstrap time series by X−k = ϵ∗−k and
Xt = ρˆXt−1 + ϵˆ∗1 ,
t = −k + 1, . . . , n
• Determine bootstrap estimators ρˆ∗ from X1∗ , . . . , Xn∗ . Inference@LS-Kneip
2–19
Under the standard assumptions of AR(1) models this bootstrap is consistent. Basic bootstrap confidence intervals: • Determine α2 and 1 − α2 quantiles tˆα2 and tˆ1− α2 of the conditional distribution of ρˆ∗ . • ⇒ Approximate 1 − α (symmetric) confidence interval: [2ˆ ρ − tˆ1− α2 , 2ˆ ρ − tˆα2 ,j ] Bootstrap-t intervals can be determined similarly.
Inference@LS-Kneip
2–20