This paper analyzes estimation by bootstrap variable-selection in a simple. Gaussian ...... Lehmann (P.J. Bickel, K. Doksum, and J.L. Hodges, eds.), 305-327.
Bootstrap Variable-Selection and Con dence Sets Rudolf Beran University of California, Berkeley This paper analyzes estimation by bootstrap variable-selection in a simple Gaussian model where the dimension of the unknown parameter may exceed that of the data. A naive use of the bootstrap in this problem produces risk estimators for candidate variable-selections that have a strong upward bias. Resampling from a less over tted model removes the bias and leads to bootstrap variable-selections that minimize risk asymptotically. A related bootstrap technique generates con dence sets that are centered at the best bootstrap variable-selection and have two further properties: the asymptotic coverage probability for the unknown parameter is as desired; and the con dence set is geometrically smaller than a classical competitor. The results suggest a possible approach to con dence sets in other inverse problems where a regularization technique is used. Key words and phrases. Coverage probability, geometric loss, Cp -estimator.
1. Introduction
Certain statistical estimation problems, such as curve estimation, signal recovery, or image reconstruction, share two distinctive features: the dimension of the parameter space exceeds that of the data; and each component of the unknown parameter may be important. In such problems, ordinary least squares or maximum likelihood estimation typically over ts the model. One general approach to estimation in such problems has three stages: First, devise a promising class of candidate estimators, such as penalized maximum likelihood estimators corresponding to a family of penalty functions or Bayes estimators generated by a family of prior distributions. This step is sometimes called using a regularization technique. Second, estimate the risk of each candidate estimator. Third, use the candidate estimator with smallest estimated risk. Largely unresolved to date is the question of constructing accurate con dence sets based on such adaptive, regularized estimators. Even obtaining reliable estimators of risk can be dicult. This paper treats both matters 1
in the following problem, which is relatively simple to analyze explicitly, yet suciently general to indicate potential directions for other problems that involve a regularization technique. Suppose that Xn is an observation on a discretized signal that is measured with error at n time points. The errors are independent, identically distributed, Gaussian random variables with means zero. Thus, Xn is a random vector whose distribution is N (n ; nIn). Both n and n are unknown. The problem is to estimate the signal n . The integrated squared error of an estimator ^n is (1.1) Ln(^n ; n ) = n? j^n ? n j ; 2
2
1
2
where j j is Euclidean norm. Under this loss, Stein (1956) showed that Xn , the maximum likelihood or least squares estimator of n , is inadmissible for n 3. Better estimators for n include the James-Stein (1961) estimator, locally smoothed estimators such as the kernel variety treated by Rice (1984), and variable-selection estimators, to be described in the next paragraph. Each of these improved estimators accepts some bias in return for a greater reduction in variance. A variable-selection approach to estimating n consists of three steps: rst, transform Xn orthogonally to Xn0 = OXn ; second, replace selected components of Xn0 with zero; and third, apply the inverse rotation O? to the outcome of step two. The vector generated by such a process will be called a variable-selection estimator of n. How shall we choose the orthogonal matrix O? Ideally, the components of the rotated mean vector On would be either very large or very small relative to measurement error. The nature of the experiment that generated Xn may suggest that O be a nite Fourier transform, or an analysis of variance transform, or an orthogonal polynomial transform, or a wavelet transform. Important though it is, we will not deal further, in this paper, with the choice of O. Having rotated Xn , how shall we choose which components of Xn0 to zero out? Thereafter, how shall we construct, around the variable-selection estimator, an accurate con dence set for n? A plausible answer is to compare candidate variable-selections through their bootstrap risks; and then bootstrap the empirically best candidate estimator to obtain a con dence set for n . Efron and Tibshirani (1993, Chapter 17) discussed simple bootstrap estimators of mean squared prediction error. However, Freedman et 1
2
al. (1988) and Breiman (1992) showed that simple bootstrap estimators of mean-squared prediction error can be untrustworthy for variable-selection. This paper treats variable-selection for estimation rather than prediction and allows the dimension of the unknown parameter to increase with sample size n. The second point is very important. A stronger model assumption used by Speed and Yu (1993) and others|that the dimension of the parameter space is xed for all n|restricts the possible bias induced by candidate variable-selections. In such restricted models, variable-selection by Cp does not choose well. On the other hand, Cp can be asymptotically correct when the dimension of the parameter space increases quickly with n and the selection class is not too large (cf. Section 2). Rice (1984, Section 3) and Speed and Yu (1993, Section 4) discuss other instances and aspects of this phenomenon. Section 2 of this paper proves for our estimation problem that naive bootstrapping|resampling from a N (Xn ; ^n In) model, where ^n estimates n|yields upwardly biased risk estimators for candidate variable-selections. However, resampling from a N (~n ; ^n ) distribution, where ~n is obtained by suitably shrinking some of the components of Xn0 toward zero, corrects the bias and generates a good bootstrap variable-selection estimator ^n;B for n. Using a related shrinkage bootstrap, Section 3 then constructs con dence sets centered at ^n;B that have correct asymptotic coverage probability for n and small geometrical error. Here as well, two plausible but naive bootstrap algorithms give wrong answers. 2
2
2
2
2. Bootstrap Selection Estimators
This section proposes bootstrap selection estimators for n and analyzes their asymptotic losses (which equal the asymptotic risks). The choice of bootstrap algorithm proves critical to the success of bootstrap selection. Naive bootstrapping does not work. The signal vector Xn = (Xn; ; : : : ; Xn;n )0 has a N (n ; nIn) distribution on Rn . For brevity, write n = (n; n) and let P;n denote the above normal distribution. Because the estimation problem is invariant under rotation of the coordinate system, we will simplify notation by assuming, without any loss of generality, that the orthogonal matrix O is the identity matrix. Then, the variable selection is done directly on the components of Xn . Consider 1
2
2
3
candidate estimators for n that have the form (2.1) ^n(A) = (an; (A)Xn; ; : : :; an;n(A)Xn;n )0; where A ranges over subsets of [0; 1] and an;i(A) = 1 if i=(n + 1) 2 A and vanishes otherwise. The goal is to choose A, on the basis of the data Xn , so as to minimize, at least asymptotically, the loss of the corresponding candidate estimator ^n (A). Success of this formulation of variable selection appears to require restrictions on the possible values of A. In the paper, we assume that A is the union of m ordered closed intervals: 1
1
m [ A = [t i? ; t i];
(2.2)
2
i=1
1
2
where 0 t : : : t m 1 and m is xed. The pseudo-distance between two such sets A and B is de ned to be (2.3) d(A; B ) = (A4B ); where is Lebesgue measure. After forming equivalence classes, the collection S (m) of all subsets having the form (2.2) is a compact metric space under d. Let A be a compact subset of S (m), possibly S (m) itself, that contains the unit interval [0; 1] as an element. Consider the candidate estimators ^n (A) that are generated as A ranges over A. Since [0; 1] is a element of A, the unbiased estimator Xn is among these candidate estimators. Let Ac be the complement of A in [0; 1]. The quadratic loss of ^n (A) is then Ln(^n (A); n) = n? j^n (X A) ? n j ? = n (2.4) (Xn;i ? n;i) + n (Ac); 1
2
1
2
1
2
i=(n+1)2A
where n is the non-negative measure de ned by X n (A) = n? (2.5) n;i : 1
2
i=(n+1)2A
Estimators of this loss or of the associated risk are naturally phrased in terms of the discrete uniform measure X n(A) = n? (2.6) 1 1
i=(n+1)2A
4
and the empirical measure ^n(A) = n? (2.7)
X
1
i=(n+1)2A
Xn;i : 2
Consider the following two bootstrap risk estimators: Naive bootstrap. Suppose ^ n is a consistent estimator of n, such as the variance estimators to be discussed in Section 3. Let Xn be a random vector such that the conditional distribution of Xn given Xn is N (Xn ; ^nIn). Let n (A) denote the recalculation from Xn of the candidate estimator ^n (A). Let E denote expectation with respect to the conditional distribution of Xn given Xn . The naive bootstrap risk estimator produced by the scheme is R^n;N (A; ^n) = ELn(n (A); Xn) 2
2
2
2
(2.8)
= E[n?
X
1
? X ) + n? (Xn;i n;i 2
1
i=(n+1)2A
X
i=(n+1)2Ac
Xn;i ] 2
= ^nn (A) + ^n (Ac): 2
Unfortunately, if n converges weakly to and n converges to as n increases, then R^n;N (A) converges in probability to (2.9) (A) = + (Ac ): The actual asymptotic loss or risk of ^n (A) is 2
2
2
(2.10) (A) = (A) + (Ac); where is Lebesgue measure. Theorem 2.1 below gives details. The upward asymptotic bias in R^n;N (A; ^n) renders it useless for selection among the candidate estimators. Shrink bootstrap. Let [ ] denote the positive-part function. The modi ed estimator (2.11) R^ n;B (A; ^n) = ^n n(A) + [^n (Ac) ? ^nn (Ac)] 2
2
+
2
2
2
+
corrects the asymptotic bias in R^ n;N and converges in probability to (A), the correct asymptotic loss of ^n (A); see Theorem 2.1. Moreover, the risk estimator R^n;B (A; ^n) can also be viewed as a bootstrap estimator: 2
5
Let (2.12) s^n (Ac) = [1 ? ^nn (Ac)=^n (Ac)] and de ne ~n (A) = (~n; (A); : : :; ~n;n (A))0 by 8 < Xn;i if i=(n + 1) 2 A ~n;i(A) = : = c (2.13) : s^n (A )Xn;i if i=(n + 1) 2 Ac Let Xn now be a random vector such that the conditional distribution of Xn given Xn is N (~n (A); ^nIn). As before, let n (A) denote the recalculation from Xn of the candidate estimator ^n (A). Now the bootstrap risk is X X ? X ) + n? ELn (n (A); ~n) = E[n? (Xn;i s^n(Ac)Xn;i ] n;i 2
+
1
1 2
2
1
2
1
i=(n+1)2A
2
i=(n+1)2Ac
= ^nn (A) + s^n (Ac)^n (Ac) = R^ n;B (A; ^n): The shrink bootstrap method just described has two notable features: It depends on the candidate set A; and it shrinks some, but not all, of the components of Xn towards the origin. In de ning ~n (A), we could shrink as well the components of Xn for which i=(n + 1) 2 A without changing the nal evaluation in (2.14). In this sense, R^ n;B (A; ^n) is the bootstrap risk generated by a family of shrink bootstrap algorithms. The shrinkage factor in (2.13) corrects the over tting of n that occurs in the naive bootstrap. The idea of bootstrap variable selection is to choose the candidate estimator whose estimated loss is smallest. Thus, let A^n;B be any set in A such that (2.15) R^ (A; ^n): R^n;B (A^n;B ; ^n) = min A2A n;B The minimum is achieved because, for each n, R^ n;B (; ^n) has a nite number of possible values. We will call (2.16) ^n;B = ^n (A^n;B ) a bootstrap selection estimator generated by the candidate estimators f^n (A) : A 2 Ag. Let kkA denote supremum norm taken over all sets A 2 A. To study the locally uniform convergences of R^n;N and R^n;B , we introduce two conditions:
(2.14)
2
2
2
2
2
2
6
C1. A is a compact subset of S (m), in the metric d, that contains [0; 1] as an element. The sequence fn = (n ; n) : n 1g is such that 2
(2.17)
nlim !1 kn ? kA
= 0; nlim !1 n = 2
2
for some bounded, d-continuous, non-negative measure on A and some nite positive . C2. For every sequence fn : n 1g that satis es condition C1, (2.18) plim ^ = : n!1 n 2
2
2
Here plim stands for the limit in P ;n-probability. n
Theorem 2.1 Suppose that conditions C1 and C2 hold. Then, for every positive , (2.19)
plim kR^ n;N (; ^n) ? kA = 0 n!1 where is de ned in (2.9). By contrast, plim kLn(^n (); n) ? kA = 0 n!1 (2.20) plim kR^ n;B (; ^n) ? kA = 0 2
2
n!1
where is de ned in (2.10). Consequently, (2.21) plim Ln (^n;B ; n ) = min (A): A2A
n!1
If the limiting loss has a unique minimizer A0 2 A, then (2.22) plim d(A^n;B ; A0) = 0: n!1
The theorem proof is in Section 4. By equations (2.20) and (2.21), the limiting loss of ^n;B coincides with the limiting loss of the unrealizable candidate estimator that minimizes loss over all selections A in A. In this sense, the bootstrap selection estimator ^n;B is asymptotically optimal. Because [0; 1] is an element of A, (2.23) min (A) ([0; 1]) = A2A 2
7
with equality only in special circumstances (e.g. = or (A ) = 1, (Ac ) = 0). Thus, ^n;B asymptotically dominates the unbiased estimator Xn . An alternative to the shrink bootstrap risk estimator R^ n;B replaces the positive-part function in (2.11) with the identity function. The result is the risk or loss estimator R^ n;C (A; ^n) = ^n[n (A) ? n (Ac)] + ^n (Ac) (2.24) = ^n (Ac) + ^n[2n (A) ? 1]: Unlike R^n;B , this risk estimator can assume negative values. Let A^n;C be any value of A 2 A that minimizes R^ n;C (A; ^n). We will call ^n;C = ^n(A^n;C ) a Cp-estimator generated by the candidate estimators f^n (A) : A 2 Ag. This terminology recognizes the analogy between (2.24) and risk estimators discussed by Mallows (1973) in a dierent context. Conclusions (2.20), (2.21) and (2.22) remain valid when R^ n;B , A^n;B ; ^n;B are replaced by R^n;C , A^n;C , ^n;C respectively. Other variable selection criteria, such as Akaike's (1974) AIC, Shibata's (1981) method, and several competitors discussed by Rice (1984, Section 3), Speed and Yu (1993, Section 4) might also be used to choose A. Under Conditions C1 and C2, these methods do not minimize asymptotic loss in the sense of (2.21). 2
0
0
2
2
2
2
3. Bootstrap Con dence Sets
A con dence ball for n , centered at an estimator ^n and having radius ^dn , is (3.1) Cn(^n ; d^n) = ft 2 Rk : j^n ? tj d^n g: This section studies con dence balls centered at the bootstrap selection estimator ^n;B . The rst goal is to devise a bootstrap radius d^n;B such that the coverage probability P;n [Cn(^n;B ; d^n;B ) 3 n] converges to as n increases. The second goal is to determine the geometric loss of Cn (^n; d^n ) for various choices of (^n; d^n ): GLn (Cn; n ) = n? = sup jt ? n j t2C ? = (3.2) = n j^n ? j + n? = d^n : 1 2
n
1 2
8
1 2
Geometric loss measures the error of Cn(^n ; d^n ) as a set-valued estimator of n . It has a projection-pursuit interpretation that stems from the identity jxj = supfu0x: juj = 1g. Both the de nition of ^n;B and the construction of con dence balls centered at ^n;B require a good estimator of n . One possibility, used in Rice (1984), is n X (3.3) ^n; = [2(n ? 1)]? (Xn;i ? Xn;i? ) : 2
2
1
1
1
i=2
2
The consistency or asymptotic normality of ^n; requires that the rst-order squared dierences f(n;i ? n;i? ) g be suciently small, in a sense that Condition D1 below makes precise. A second estimator of n works under the assumption that n lies in a subspace of dimension n0 < n. Suppose that n0 is the integer part of cn, where c is a fraction strictly between 0 and 1. By making an appropriate orthogonal transformation, assume without loss of generality that Xn = (Xn ; Yn?n ), where Xn has a N (n ; nIn ) distribution in n0 dimensions, Yn?n has a N (0; n In?n ) distribution in n ? n0 dimensions, and Xn , Yn?n are independent. In this canonical formulation, a bootstrap selection estimator of n can be formed from Xn and the variance estimator 2
1
2
1
2
0
0
0
0
2
2
0
0
0
0
0
0
0
^n; = (n ? n0)? jYn?n j :
(3.4)
2
1
2
0
2
The distribution of (n ? n0)^n; =n is chi-squared with n ? n0 degrees of freedom. The essential features of ^n; and ^n; are expressed in the following two assumptions: 2
2
2
2
2
1
2
D1. The variance estimator ^n; is de ned by (3.3). The sequence of mean vectors fn : n 1g satis es 2
(3.5)
nlim !1
1
X n?1=2 (n;i ? n;i?1 )2 = 0: n
i=2
D2. The variance estimator ^n; and Xn are independent random variables. The distribution of bn^n; =n is chi-squared, where fbn : n 1g is a sequence of constants such that limn!1 bn =n = b < 1. 2
2
2
2
2
9
Under D1 and A1, the asymptotic distribution of n = (^n; ? n) is N (0; 3 ), as in Gasser et al. (1986) or by the reasoning in Section 4. Under D2 and A1, the asymptotic distribution of n? = (^n; ? n) is N (0; 2b? ). To construct con dence balls, we begin by nding the asymptotic distribution of (3.6) Dn (n ; Xn ; ^n;j ) = n = [Ln(^n;B ; n) ? R^n;C (A^n;B ; ^n;j )]: The quantity Dn compares the loss of ^n;B with the simple estimator (2.24) of its risk. On the one hand, the asymptotic distribution of Dn turns out to be normal with mean zero (Theorem 3.2 below). On the other hand, referring Dn (n ; Xn ; ^n;j ) to the th quantile of its bootstrap distribution generates a con dence ball centered at ^n;B that has asymptotic coverage probability for n (Theorem 3.3 below). There is no apparent advantage to replacing R^n;C in (3.6) with the more complex bootstrap risk estimator R^n;B . In the remainder of the paper, the notation Dj stands for either condition D1 or D2, according to the value of j . Theorem 3.1 Suppose that Conditions C1 and Dj hold and that the limiting loss has a unique minimum at A 2 A. Then (3.7) L[Dn (n ; Xn; ^n;j )jP ;n ] ) N (0; j (; ; A )); where (3.8) (; ; A ) = 2 + [2(A ) ? 1] + 4 (Ac ) and (3.9) (; ; A ) = 2 + 2b? [2(A ) ? 1] + 4 (Ac ): This result is proved in Section 4. The same asymptotic distributions hold if the bootstrap selection estimator in the de nition of Dn (n ; Xn ; ^n;j ) is replaced by the Cp-estimator ^n;C . Moreover, comparing the proof of Theorem 3.1 with its counterpart for the Cp-estimator establishes the following asymptotic equivalence in loss: = jL (^ ; ) ? L (^ ; )j = 0: (3.10) plim n n n;B n n n;C n n!1 1 2
1 2
2
2
2
2
1
2
2
4
1
1 2
4
2
2
0
2
2
2 1
2 2
2
4
0
4
0
2
2
n
4
1
2
0
4
0
0
2
2
0
2
0
2
1 2
To successfully bootstrap the sampling distribution of Dn (n ; Xn ; ^n;j ) requires an algorithm that recognizes both the data-based selection A^n;B 2
10
and the structure of the variance estimator ^n;j . For every A 2 A, let (3.11) D~ n (n ; A; Xn; ^n;j ) = n = [Ln(^n (A); n) ? R^n;C (A; ^n;j )]: 2
2
1 2
2
We consider two cases: Bootstrapping Dn (n ; Xn ; ^ n; ). Let (3.12) s^n = [1 ? ^n; n (A^cn;B )=^n (A^cn;B )] and de ne ~n = (~n; ; : : : ; ~n;n ) by 8 < Xn;i if i=(n + 1) 2 A^n;B ~n;i = : = (3.13) : s^n Xn;i if i=(n + 1) 2 A^cn;B ) be a random vector such that the conditional Let En = (En; ; : : : ; En;n )0 distribution of En given Xn is N (0; ^n; In). De ne Xn = (Xn; ; : : :; Xn;n and n; by Xn = ~n + En n X ? E ) : n; = ^n; + [2(n ? 1)]? (En;i (3.14) n;i? 2
1
2
+
1
1
1 2
1
2
2 1
2 1
2
1
1
1
1
1
i=2
2
The partial bootstrap estimator of L[Dn (n ; Xn ; ^n; )jP ;n ] is then (3.15) H^ n;B; = L[Dn (~n ; A^n;B ; Xn; n; )jXn ]: Bootstrapping Dn (n ; Xn ; ^ n; ). Rede ne s^n and En above by replacing ^n; with ^n; . De ne ~n by (3.13) and Xn as in (3.14). Let n; be a random variable such that L[bnn; =^n; jXn ] is chi-squared with bn degrees of freedom and such that n; , En are conditionally independent, given Xn . The actual construction of n; will normally require a separate bootstrap scheme. The partial bootstrap estimator of L[Dn (n ; Xn ; ^n; )jP ;n] is then (3.16) H^ n;B; = L[Dn (~n; A^n;B ; Xn; ^n; )jXn ] Theorem 3.2 Suppose that Conditions C1 and Dj hold and that the limiting loss has a unique minimum at A 2 A. Then (3.17) H^ n;B;j ) N (0; j (; ; A )) in P ;n-probability, where j (; ; A ) is de ned by (3.8) and (3.9). 2
2
2
1
2
2 2 2 2
2
2 2
2
2 2
2
2
n
2
2 2
2
0
2
n
n
2 1
1
2
1
2
2
0
11
2
0
Both bootstrap algorithms in Theorem 3.2 shrink toward zero these components of Xn that are not selected by ^n;B . The construction (3.13) of ~n is critical for the weak convergence (3.17). If we took instead ~n = Xn , over tting n , then the asymptotic variance in (3.17) would become j ( + ; ; A ). If we used ~n = ^n;B , under tting n for bootstrap purposes, the asymptotic variance in (3.17) would become j (0; ; A ). These conclusions follow by the method used to prove Theorem 3.2. Thus, neither of these alternative bootstrap algorithms yield consistent estimators of the sampling distribution of Dn (n; Xn ; ^n;j ). ? () be the th quantile of For strictly between 0 and 1, let H^ n;B;j the bootstrap distribution H^ n;B;j de ned in (3.15) or (3.16). Under Condition Dj , de ne the bootstrap selection con dence set for n to be Cn;B;j = Cn(^n;B ; d^n;B;j ), where ? ()] = : (3.18) d^n;B;j = [nR^ n;C (A^n;B ; ^n;j ) + n = H^ n;B;j The following theorem justi es this con dence set centered at ^n;B . Theorem 3.3 Suppose that Conditions C1 and Dj hold and that the limiting risk has a unique minimizer A 2 A such that (A ) > 0. Then (3.19) nlim !1 P ;n (Cn;B;j 3 n ) = and = (3.20) plim GL n (Cn;B;j ; n ) = 2 (A ): n!1 If (A ) = 0 then (3.21) lim inf P ;n(Cn;B;j 3 n ) : n!1 Remarks. The exceptional case (A ) = 0 arises only when (A ) = (Ac ) = 0. This occurs when all but an asymptotically vanishing fraction of the components of n are zero. A more familiar con dence set for n in the normal model is Cn;F = Cn(Xn ; ^n?n = ()), where ?n = () is the square root of the th quantile of the chi-squared distribution with n degrees of freedom. Under Conditions C1 and C2, nlim !1 P ;n (Cn;F 3 n ) = (3.22) plim GLn (Cn;F ; n ) = 2; n!1 2
2
2
0
2
2
0
2
1
2
1 2
0
1 2 +
1
0
n
1 2
0
0
n
0
0
1 2
1 2
n
12
0
the second convergence relying on (3.2) and the normal approximation to the chi-squared distribution. It follows from (2.23), (3.20) and (3.22) that, at asymptotic coverage probability , the bootstrap-selection con dence balls Cn;B;j are both asymptotically smaller than the con dence ball Cn;F . As an alternative to bootstrapping, the asymptotic variances in Theorem 3.1 may be estimated consistently from the sample, using ^n;j for and [^n(A^cn;B ) ? ^n;j (A^cn;B )] for (Ac ). Equation (4.4) below justi es the second of these estimators. The estimated normal limit distributions then yield critical values and con dence sets for which an analog of Theorem 3.3 holds. 2
2
+
2
0
4. Proofs
The theorem proofs rely on ideas from Beran (1994) augmented by bootstrap considerations. Let En;i = Xn;i ? n;i and, for every set A 2 A, de ne
Wn; (A; n) = n? =
1 2
1
Wn; (A; n) =
(4.1)
2
n?1=2
X
i=(n+1)2A
X
i=(n+1)2A
(En;i ? n) 2
2
n;i En;i:
Let D(A) denote the set of all bounded functions having at most jump discontinuities on the compact set A. Metrize D(A) by supremum norm k kA. The -algebra is that generated by open balls. Under Condition C1, the two processes Wn;j (n) = fWn;j (A; n) : A 2 Ag are random elements of D(A). Let Bj = fBj (A) : A 2 Ag be two independent Gaussian processes on A with mean zero and covariance structure Cov[B (A); B (A0)] = (A \ A0) Cov[B (A); B (A0)] = (A \ A0)
(4.2)
1
1
2
2
where is Lebesgue measure and is the bounded non-negative measure de ned in Condition C1. Both processes are random elements of D(A) that have d-continuous sample paths. Lemma 1 Suppose that Condition C1 holds. Then the bivariate processes f(Wn; (n); Wn; (n)g converge weakly as random elements of D(A) D(A) to the process (2 = B ; B ). 1
2
1 2
2
1
2
13
Convergence of the nite-dimensional distributions is straightforward. For tightness, see LeCam (1983, Lemma 4) or Alexander and Pyke (1986, Section 4).
Proof of Theorem 2.1. The de nitions (2.5), (2.7) and (4.1) entail (4.3) ^n (A) = n (A) + ^nn (A) + n? = Wn; (A; n) + 2n? = Wn; (A; n): 2
1 2
1 2
1
2
Consequently, by Lemma 1, (4.4) plim k^n () ? [ () + ()]kA = 0: n!1 2
Then (2.19) and the second convergence in (2.20) follow from (4.4), Condition C2, and the de nitions (2.8), (2.11) of R^ n;N and R^ n;B . On the other hand, by (2.4) and (4.1), (4.5) Ln (^n(A); n ) = n (Ac) + nn (A) + n? = Wn; (A; n): 2
1 2
1
The rst convergence in (2.20) follows from Lemma 4.1. De nition (2.15) of ^n;B , (2.20), and the triangle inequality imply plim R^ n;B (A^n;B ; ^n) = min (A) (4.6) 2
n!1
A2A
and (4.7) plim [R^n;B (A^n;B ; ^n) ? Ln(^n;B ; n )] = 0: n!1 Conclusion (2.21) thus follows. Limit (4.6) and the second limit in (2.20) imply that (4.8) plim (A^n;B ) = (A ); 2
0
n!1
where A is the unique minimizer of over A. Suppose that (2.22) does not hold. By considering a subsequence, we may assume without loss of generality that convergence (4.8) occurs almost surely while (4.9) P ;n [d(A^n;B ; A ) > ] 0
0
n
for some positive and . Because is d-continuous on the compact A and A uniquely minimizes over A, the almost sure version of (4.8) implies 0
14
that d(A^n;B ; A ) ! 0 with probability one. This contradicts (4.8), thereby proving (2.22). 0
Proof of Theorem 3.1. As above, the de nitions (2.4) and (2.24) of
Ln and R^ n;C respectively entail that (4.10) Ln(^n;B ; n ) = (A^cn;B ) + nn (A^n;B ) + n? = Wn; (A^n;B ; n) 2
1 2
1
and R^ n;C (A^n;B ; ^n;j ) = n(A^cn;B ) + n n(A^cn;B ) + n? = Wn; (A^cn;B ; n) (4.11) +2n? = Wn; (A^cn;B ; n) + ^n;j [n(A^n;B ) ? n (A^cn;B )]: Consequently, Dn(n ; Xn ; ^n;j ) = n = [Ln(^n;B ; n ) ? R^ n;C (A^n;B ; ^n;j )] (4.12) = Wn; (A^n;B ; n) ? Wn; (A^cn;B ; n) ? 2Wn; (A^cn;B ) ?n? = (^n;j ? n )[2n(A^n;B ) ? 1]: Under Condition D1, 2
2
1 2
2
1 2
1
2
2
1 2
2
1
1
1 2
2
2
2
n X ^n; = 2? [(n ? 1)? (En;i + En;i? )] i n X +(n ? 1)? En;i En;i? + op (n? = ) 2
1
1
2
2
1
1
=2
1
1 2
1
i=2
+ 2?1fWn;1([2=(n + 1); 1]; n ) + Wn;1([0; (n ? 1)=(n + 1)]; n)g
(4.13) = n n X ? En;i En;i? + op (n? = ): +(n ? 1) 2
1
1 2
1
i=2
The argument for Lemma 1 and the martingale central limit theorem, applied to the quadratic term in the last line of (4.13), which is uncorrelated with Wn; ; Wn; , imply that (4.14) n = (^n; ? n) ) 2 = B ([0; 1]) + Z; where Z is a N (0; 1) random variable such that B , B and Z are independent. Moreover, (4.15) Dn (n ; Xn ; ^n; ) = Sn(En ; n; n ) + op(1); 1
2
1 2
2
1
1 2
2
2
2
1
1
2
2
1
15
2
where Sn(En ; n ; n) is de ned by substituting (4.13) into (4.12) and dropping the remainder term. The d-continuity of and together with the convergence (2.22) imply that plimn!1 (A^n;B ) = (A ) and that plimn!1 (A^n;B ) = (A ). The foregoing considerations yield 2
0
0
Sn (En; n ; n) ) 2 = B (A ) ? 2 = B (Ac ) ? 2B (Ac ) (4.16) ?[2(A ) ? 1][2 = B ([0; 1]) + Z ] : For j = 1, the weak convergence (3.7) follows from (4.2), (4.15) and (4.16). We will use (4.16) again to prove Theorem 3.2. Under Condition D2, ^n; is independent of Xn and 2
1 2
2
1
1 2
0
1 2
0
2
1
2
0
0
2
1
2
n = (^n; ? n) ) 2 = b? = Z; where Z again is a N (0; 1) random variable such that B , B and Z are independent. Because of (4.12), (2.22) and Lemma 1, Dn (n ; Xn ; ^n; ) ) 2 = B (A ) ? 2 = B (Ac ) ? 2B (Ac ) (4.18) ?[2(A ) ? 1]2 = b? = Z; which establishes (3.7) for j = 2. (4.17)
1 2
2
2
2
1 2
1 2
2
1
2
1 2
2
2
1
1 2
0
1 2
0
1
2
0
1 2
2
0
2
Proof of Theorem 3.2. Suppose that fn g, fn g, and fAn 2 Ag are 2
such that (4.19)
nlim !1 kn ? ~k
2
;
nlim !1 n
= ; 2
2
nlim !1 d(An ; A0) = 0
and ~ is d-continuous. By the reasoning for Theorem 3.1, (4.20) D~ n (n ; An; Xn ; ^n; ) = S~n(En ; An; n; n) + op (1); 2
2
1
where S~n(En ; An; n ; n) is obtained from the de nition of Sn (En; n ; n) by replacing A^n;B with An. As in Theorem 3.1, (4.21) L[S~n(En ; An; n; n)jP ;n ] ) N (0; (~ ; ; A )) 2
2
2
and (4.22)
n
2 1
2
0
L[D~ n (n ; An; Xn ; ^n; jP ;n ] ) N (0; (~ ; ; A )): 2
2
n
16
2 2
2
0
Next, consider the empirical measure X ~ ~n (A) = n? n;i 1
2
i=(n+1)2A
= ^n (A \ A^n;B ) + s^n^n (A \ A^cn;B ):
(4.23)
Under either Condition D1 or D2, it follows from (4.4) and (2.22) that plim k~n ? ~kA = 0; plim ^ = ; n!1 n!1 n;j
(4.24)
2
2
where ~ is the measure on A de ned by (4.25)
~(A) = (A \ A ) + (A \ A ) +s[ (A \ Ac ) + (A \ Ac )] s = (Ac )=[ (Ac ) + (Ac )]: 2
0
0
2
0
0
0
0
2
0
For the case j = 1, (3.14) and the reasoning for (4.20) entail that (4.26) Dn (~n ; A^n;B ; Xn; ^n; ) = S~n(En ; A^n;B ; ~n ; ^n; ): 2 1
2 1
Consequently, by (4.24), (2.22) and (4.21), (4.27) H^ n;B; ) N (0; (~ ; ; A )) 2 1
1
2
0
in P ;n-probability. This limit law agrees with (3.17) because ~(Ac ) from (4.25) equals (Ac ) in (3.8). For the case j = 2, (4.24), (2.22) and (4.22) yield (4.28) H^ n;B; ) N (0; (~ ; ; A )); n
0
0
2 2
2
2
0
which agrees with (3.17) because ~(Ac ) from (4.25) again equals (Ac ) in (3.9). 0
0
Proof of Theorem 3.3. This result follows from Theorems 3.1 and 3.2.
The argument parallels the proof of Theorem 3.2 in Beran (1994).
17
Acknowledgements
This paper was written while I was a guest of the Statistics group in the Institut fur Angewandte Mathematik at Universitat Heidelberg. The research was supported in part by National Science Foundation grant DMS 9224868. Comments by Lutz Dumbgen were especially helpful. REFERENCES Akaike, H. (1974). A new look at statistical model identi cation. IEEE Trans. Auto. Cont. 19, 716-723. Alexander, K.S. and Pyke, R. (1986). A uniform central limit theorem for set-indexed partial-sum processes with nite variance. Ann. Probab. 14, 582-597. Beran, R. (1994). Con dence sets centered at Cp-estimators. Unpublished preprint. Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: X - xed prediction error. J. Amer. Statist. Assoc. 87, 738-754. Efron, B. and Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York. Freedman, D.A., Navidi, W., and Peters, S.C. (1988). On the impact of variable selection in tting regression equations. On Model Uncertainty and its Statistical Implications (T.K. Dijkstra, ed.), 1-16. Springer-Verlag, Berlin. Gasser, T., Sroka, L., and Jennen-Steinmetz, C. (1986). Residual variance and residual pattern in nonlinear regression. Biometrika 73, 625-633. James, W. and Stein, C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley Symp. Math. Statist. Prob. Vol. 1, 361-380. Univ. of California Press, Berkeley. LeCam, L. (1983). A remark on empirical measures. Festschrift for Erich Lehmann (P.J. Bickel, K. Doksum, and J.L. Hodges, eds.), 305-327. Wadsworth, Belmont, California. Mallows, C. (1973). Some comments on Cp. Technometrics 15, 661-675. 18
Rice, J. (1984). Bandwidth choice for nonparametric regression. Ann. Statist. 12, 1215-1230. Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68, 45-54. Speed, T.P. and Yu, B. (1993). Model selection and prediction: normal regression. Ann. Inst. Statist. Math. 45, 35-54. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proc. Third Berkeley Symp. Math. Statist. Prob. Vol. 1, 197-206. Univ. of California Press, Berkeley. Rudolf Beran Department of Statistics University of California, Berkeley Berkeley, CA 94720{3860 U.S.A.
29 November 1994
19