A natural question then is: how adaptive can any procedure be? This motivates the .... ng; ) can not be the class of all possible regression functions. But it is still ...
How Accurate Can Any Regression Procedure Be? Yuhong Yang Department of Statistics Iowa State University October, 2000 Abstract
Various parametric and nonparametric regression procedures have been constructed according to dierent possible characteristics of the underlying regression function. To reduce the dependence on subjective assumptions, the theme of adaptive estimation is to construct a procedure that provides an accurate estimate of the regression function for various scenarios without knowing which one describes the data well. A closely related question is: Given a regression procedure, how many regression functions are estimated accurately? In this work, for a given sequence of prescribed estimation accuracy (in sample size), we give an upper bound (in terms of metric entropy) on the number of regression functions for which the accuracy is achieved. A consequence is that if one demands near optimal performance for a target class of regression functions, then the same accuracy can not be achieved for many additional regression functions. This has a negative implication on adaptive estimation. The main result is also applied to show that as far as polynomial rates of convergence are concerned, any regression procedure is essentially no better than a method based on sparse approximation. Key words and phrases: Adaptive estimation, nonparametric regression, rate of convergence.
1 Introduction Nonparametric function estimation techniques have been proposed in the past few decades and are now widely used in statistical applications. Many fundamental questions on function estimation have also been successfully addressed. Universally consistent estimators have been found (e.g., Stone (1977) and Devroye and Wagner (1980)). However, the convergence of such a function estimator to the true underlying function can be arbitrarily slow (see, e.g., Devroye (1982)). Under certain conditions on the function to be estimated, dierent types of estimators have been shown to converge at certain rates in statistical risk. Minimax risks are often considered for studying the problem of estimating a function assumed to be in a function class with certain characteristics (e.g., monotonicity or smoothness). Such characteristics are usually closely related to how fast the minimaxrisk converges to zero. For example, the smoother the function to be estimated is, the more accurately it can be estimated (see, e.g., Ibragimov and Hasminskii (1977), Bretagnolle and Huber (1979), Stone (1982)). More generally, relationship This
research was supported by the University Research Grant of Iowa State University.
1
between the minimax rate of convergence and the largeness of the function class as measured in terms of metric entropy is now well understood under some familiar loss functions (e.g., Birge (1986), Yatracos (1988) and Yang and Barron (1999)). Since early original work of Efromovich and Pinsker (1984), adaptive function estimation has received considerable attention. Take the problem of estimating a smooth univariate density function on the unit interval for an example. If one knew the density has r derivatives, then various procedures (e.g., kernel or polynomial) can be applied with a suitable (data-independent) choice of certain hyper-parameters (e.g., bandwidth for kernel or order of polynomial) according to r: Then the estimators are often shown to converge at the order n?2r=(2r+1) under the squared L2 loss, which is known to be the optimal rate of convergence for a density with r derivatives. From a practical point of view, however, one does not know the value of r: Thus one wants to have a procedure that automatically converges at the optimal rate without the knowledge of r: This is the spirit of adaptive function estimation. Various adaptive procedures have been proposed. See, e.g., Barron, Birge and Massart (1999) or Yang (2000b) for some recent references on adaptive function estimation. Clearly adaptation is a desirable property for function estimation. Ideally, one would wish to use an adaptive procedure by which a very large class of functions are accurately estimated based on a sample. A natural question then is: how adaptive can any procedure be? This motivates the study of the problem that for a given function estimation procedure, how many functions are estimated well by it? In this paper, we focus on the case of estimating a regression function and provide some negative results showing that for any given regression procedure, the set of regression functions that can be well estimated is fundamentally limited in size. The rest of the paper is organized as follows. In Section 2, we show that for a given class of probability density functions, a good news in estimation directly forecasts a bad news. In Section 3, we present the main result that no estimator can converge fast for many underlying functions. In Section 4, we give a negative consequence of the main result on adaptive function estimation. In Section 5, we show that in some sense, every regression procedure is essentially no better than a method based on a certain sparse approximation.
2 Lower bounding minimax risk through upper bounds Let X1 ; X2 , :::; Xn be i.i.d. observations with probability density function p(x); x 2 X with respect to a - nite measure . Here the space X is general and could be any dimensional. One needs to estimate the unknown density p based on the data. The Kullback-Leibler (K-L) divergence between two densities p and q is de ned as D(p k q) = R p log(p=q)d: Let d be a distance (metric) between densities. Examples are Hellinger distance dH (p; q) = R ? pp ? pq2 d 1=2 and the L2 distance k p ? q k= R (p ? q)2 d1=2 : The loss d2 (p; pb) will be considered for density estimation in this section. 2
A minimax risk measures diculty in estimation in a uniform sense. Let F be a class of density functions. Then the minimax risk for estimating a density in F at sample size n under the d2 loss is de ned as R(F ; d; n) = inf sup Ed2(p; pb); p b
p2F
where the in mum is taken over all density estimators. Metric entropy is a fundamental concept describing massiveness or largeness of a set (see, Kolmogorov and Tihomirov (1959) or Lorentz et al (1996, Chapter 15)). A set N is said to be an -packing set in F if N F and any two distinct members in N are more than apart under the metric d: The packing -entropy of the set F under the metric d, denoted by M(; F ), is then de ned to be the logarithm of the size of the largest -packing set in F . It is known that the convergence rate of the minimax risk is determined by the metric entropy of the class F . Recently Yang and Barron (1999) characterize both minimax upper and lower bounds directly in terms of the metric entropy using information-theoretic tools. The results in this section are essentially in Yang and Barron (1999) (not formally given there but contained in the proofs). Let pn denote the n-fold product density with the same marginal density p; i.e., pn(xn ) = ni=1 p(xi ):The following lemma follows from the proof of Theorem 1 in Yang and Barron (1999). Lemma 1: Assume there exists a density function qn on X n such that sup D(pn k qn) nn2 :
p2F
Let n be chosen to satisfy
M(n; F ) = d2(nn2 + log2)e:
Then the minimax risk of the density class F satis es
R(F ; d; n) 8n : 2
From Lemma 1, if a good upper bound on supp2F D(pn k qn) is available with a suitable choice of qn, then a lower bound on the minimax risk is obtained. The next lemma provides a good choice of qn based on a sequence of estimators of p under the K-L loss. Lemma 2: Suppose there exists a sequence of estimators pek based on X1 ; :::Xk for k 1 such that sup ED(p k pek ) b2k , k = 1; :::n ? 1:
p2F
Let b20 = supp2F D(p k p0 ) for any chosen density p0 . Then there exists a density qn such that
sup D(pn k qn)
p2F
nX ?1 i=0
b2i :
Proof: Given the estimator sequence pek , de ne gk (xk+1; xk ) = pek (xk+1) for k 1 and g0(x1; x0) = p0(x1 ). Let qn(x1 ; :::; xn) = n?1 gk (xk+1; xk ). Then qn is a probability density function on X n. Then k=0
3
by the information chain rule as used in Barron (1987), Z n n D(p k qn) = ni=1 p(xi) log n?1i=1 p(xi) i (dx1 ) (dxn) i=0 gi (xi+1; x ) Z n X p(xi) (dx ) (dx ) = ni=1 p(xi ) log g (x 1 n i?1 i ; xi?1) i=1 n Z X p(xi) (dx ) (dx ) ij =1 p(xi) log g (x = 1 i i?1 i; xi?1) i=1 n Z X E p(xi) log pe p(x(xi ) ) (dxi) = i?1 i i=1 =
n X
i=1 nX ?1 i=0
ED(p k pei?1) b2i :
This completes the proof of Lemma 2. From Lemma 1 and Lemma 2, we have the following result. Theorem 1: For a sequence of estimators pbk based on X1 ; :::Xk, 1 k n; if supp2F ED(p k pbk ) b2k , then for the minimax risk, we have
where 2n;d is chosen such that
2 R(F ; d; n) n;d 8 ; &
nX ?1
M(n;d ; F ) = 2
i=0
!'
b2 + log2 i
:
In the theorem, the K-L risks are involved. The K-L divergence is closely related to the L2 and Hellinger distances (see, e.g., Yang and Barron (1999)). It may seem mysterious that an upper bound also forecasts a lower bound. An explanation is as follows. The smaller the upper bound is, the closer the densities in the class F to a xed density qn(x1; :::; xn) on X n (as constructed by Lemma 2) in terms of the K-L divergence, which suggests the densities are harder to be distinguished as revealed by Fano's inequality (see, e.g., Yang and Barron (1999)).
3 How many regression functions can be served well by any given regression procedure? We now apply the results in Section 2 to show that the collection of functions that have small risks by a given estimation procedure is fundamentally limited in size. We will study the problem in a nonparametric regression setting for convenience, but similar results can be obtained for density estimation as well. 4
Consider the regression model
Yi = f(Xi ) + "i , i = 1; :::n;
where (Xi ; Yi )ni=1 are i.i.d. copies from the joint distribution of (X; Y ) with Y = f(X)+". The explanatory variable X (could be any dimensional) has a distribution PX and the random error " is assumed to have a normal distribution with mean zero and variance 2 > 0. One needs to estimate the regression function f based on the data Z n = (Xi ; Yi)ni=1 . Since our main interest in this work is on the negative side (limit of estimation), unless stated otherwise, 2 is assumed to be known and then taken to be 1 without loss of generality. When 2 is unknown, the problem of estimating f certainly can not be easier. For the same reason, we assume PX is known in this paper unless stated otherwise. Let be a regression estimation procedure producing estimator fbi (x) = fbi (x; Z i ) atqeach sample size R i 1: Let k k denote the L2 norm with respect to the distribution of X; i.e., kgk = g2 (x)PX (dx). Let R(f; n; ) = E kf ? fbn k2 denote the risk of the procedure at the sample size n under the squared L2 loss. For a class of regression functions F ; let R(F ; n) = inf f^ supf 2F E kf ? fbk2 denote its minimax risk. Fix a regression procedure : Let b2n be a non-increasing sequence with b2n ! 0 as n ! 1: Assume that the true regression function has the L2 norm bounded by a known constant A > 0: Consider the class of regression function
F (fb2n g; ) = ff :k f k A and R (f; n; ) b2n for all n 1g:
(1)
It is the collection of the regression functions for which the estimation procedure achieves the given accuracy b2n at each sample size n: Ideally, one wants this class to be as large as possible. As mentioned in the introduction, it is well-known that one can not demand any universal convergence rate, i.e., for any given sequence b2n # 0, for every estimation procedure , there exists at least one regression function f such that R (f; n; ) b2n for each n 1 (see, e.g., Devroye (1982)). Thus one knows that F (fb2ng; ) can not be the class of all possible regression functions. But it is still unclear, however, how much smaller F (fb2ng; ) is compared to the class of all regression functions. We will provide an upper order of the largeness of F (fb2ng; ) in terms of metric entropy. The order can also be achieved by familiar regression procedures for smoothness function classes. Thus the largeness bound (order) that we will give can not be improved in general. The problem of upper bounding the metric entropy of F (fb2ng; ) is closely related to the problem of lower bounding the minimax risk of a general class of regression functions. Intuitively if F (fb2ng; ) is too large, then the rate of convergence b2n can not be achieved uniformly. However, it seems that no lower bounds in the literature have direct implications on the size of F (fb2ng; ): In particular, it is not feasible to apply hyper-cube methods (often used in deriving minimax lower bounds) because there is little can be said about the structure and local properties of the class F (fb2n g; ) since no speci c 5
conditions are put on : Theorem 1 in the previous section is handy for upper bounding the metric entropy of F (fb2ng; ): Note that in the de nition of F (fb2n g; ), the risk bounds are required to hold for each sample size. One might wonder why not consider one sample size at a time. Actually, if one modi es the de nition of F (fb2ng; ) this way, i.e., de ne Fn (b2n; ) = ff :k f k A and R (f; n; ) b2n g; then no general nontrivial bound is possible as seen from the following simple example. Example 1: Consider the procedure that produces the trivial estimator fbi (x) 0 for all sample sizes. For any > 0; the minimax risk of the class F = ff :k f k2 g is obviously no bigger than by using ; but the metric entropy of F is always 1: This indicates that it is impossible to have a non-trivial upper bound on the metric entropy of the class of functions that serves well at a given sample size. P Let b20 = A2 + 2 log 2 and de ne Bk = ki=0 b2i for k 1 and B0 = b20: Theorem 2: Take b2n = Cn? for some constant C > 0 and 0 < 1: When < 1; for any regression procedure ; for 3C 1=2, 2(1? ) 0 1 ? 2 ; M ; F (fbng; ) C where C 0 is a constant depending only on ; C and A: When = 1; for any regression procedure ; for 3C 1=2, ? 00 2 M ; F (fbng; ) C log 1 for some constant C 00 depending only on A and C: For a general sequence fb2ng, we have ?
M 3bk ; F (fb2ng; ) dBk?1 e for all k 1: Remarks:
1. The normality assumption on the errors is not essential. A similar result holds if the regression function is bounded in a known range [?A; A] and the error distribution satis es a mild condition as used for Theorem 1 in Yang (2000c). 2. The dependence of C 0 or C 00 on ; C and A is given in the proof of the theorem below. 2 1 Proof: For a function g; let pg (x; y) = p12 e? 2 (y?g(x)) denote the joint density of (X; Y ) with respect to the product measure of PX and Lebesgue measure when the regression function is g. It can be easily veri ed that the K-L divergence between two such densities satis es D(pf k pg ) = 12 k f ? g k2 : Let fek ; k 1 be the estimators of f by the procedure at each sample size respectively. Let fe0 (x) 0 and then E k f ? fe0 k2 A2 by assumption. Let qn(z n ) = pfe0 (x1; y1 ) pfe1 (x2 ; y2) pfen?1 (xn ; yn): It is a probability density function in z n with respect to the n-fold product measure of the distribution of X and Lebesgue measure. Then as in the proof of Lemma 2, n n X X D(pnf k qn) = ED(pf k pfei?1 ) = 21 E k f ? fei?1 k2 : i=1 i=1 6
It follows that for f 2 F (fb2ng; ); we have D(pnf k qn) for convenience. Choose n such that &
1 2
P ?1 2 A2 + in=1 bi : Let F denote F (fb2ng; )
!
nX ?1 M(n; F ) = 2 21 A2 + b2i + log 2 i=1
Then by Lemma 1, we have
!'
= dBn?1e :
inf sup E k f ? fb k2 8n : 2
fb f 2F
By de nition of F (fb2ng; ), we have supf 2F E k f ? fen k2 b2n; and thus the minimax risk of F satis es inf sup E k f ? fb k2 b2n :
(2)
fb f 2F
From above, if M (3bn; F ) > dBn?1 e; then 3bn n and 2
inf sup E k f ? fb k2 (3b8n) > b2n ; fb f 2F
which would contradict with (2). Thus we must have M (3bn; F ) dBn?1 e: For the case b2n = Cn? for some 0 < < 1; Bn A2 + 2 log2 + Cn1?1? : For any 0 < 3C 1=2; there 2= 2= exists n 1 such that 3C 1=2(n + 1)? =2 < 3C 1=2n? =2 : Then 3C1=2 ? 1 < n 3C1=2 : It follows from these inequalities that when 0 < 3C 1=2; 1? M (; F ) M 3C 1=2(n + 1)? =2; F Bn + 1 A2 + 2 log2 + 1 + Cn 1? 2(1? ) 2(1? )= C 1= 1 3 2 A + 2 log 2 + 1 + 1 ? : Thus 2(1 ? ) 0 1 ; M (; F ) C where C 0 is a constant depending only on ; C and A: For the case b2n = Cn?1; Bn = A2 + 2 log2 + C
nX ?1 i=1
i?1 A2 + 2 log2 + C (1 + log(n ? 1)) :
Similarly as the case when 0 < < 1; we have for 0 < 3C 1=2;
1=2 2 3C 2 M (; F ) A + 2 log2 + 1 + C 1 + log
!
A2 + 2 log2 + 1 + C 1 + 2 log 3C 1=2 + 2C log 1 :
?
(3)
(4)
Thus M (; F ) C 00 log 1 for some C 00 depending only on C and A: This completes the proof of Theorem 2. 7
It is well-known that for smoothness function classes (e.g., Sobolev or Besov), the minimax rate of convergence under the squared L2 loss is usually determined by a certain smoothness parameter (e.g., the number of derivatives that the unknown regression function is assumed to have) with the rate n?2=(2+d), where d is the dimension of the function being estimated. The metric entropy order of such a class is typically (1=)d= as ! 0. Note that for = 2=(2+d); 2 (1 ? ) = = d= and accordingly the entropy upper bound given in Theorem 2 for F (fb2n g; ) is of order (1=)d= : Notice that this order matches the metric entropy of smoothness classes with convergence rates n?2=(2+d): As is well-known, the convergence rate 1=n corresponds to parametric classes and they usually have metric entropies of order log(1=) ; which is the order given in Theorem 2 for = 1: Thus in terms of order, the upper bounds in Theorem 2 can not be generally improved. From Theorem 2, no matter how carefully a regression procedure is constructed, it can converge fast for only a limited set of regression functions. Barron and Hengartner (1998) study super-eciency in density estimation for both parametric and nonparametric classes. For a nonparametric class of densities, they show that for any given estimation procedure, the set of densities for which the procedure converges faster than the minimax rate of the class is asymptotically negligible compared to the whole class in terms of metric entropy order. The metric entropy bound in Theorem 2 can also be used to readily derive such a result for the regression problem.
4 Implications on adaptive estimation The non-asymptotic nature of the upper bound in Theorem 2 is helpful to draw some conclusions on limitations of adaptive estimation. Traditionally, adaptive estimation addresses the objective of achieving the minimax risk (often rate of convergence) over smoothness function classes with the smoothness parameter unknown. As mentioned earlier, dierent values of the smoothness parameter are usually associated with dierent rates of convergence. The function classes have dierent sizes (in terms of metric entropy) and are usually nested. Yang (2000abc) shows that adaptive rate of convergence can be obtained for a general countable collection of function classes. The function classes are allowed to be completely dierent, which may be desirable when very distinct scenarios are explored in situations such as high-dimensional estimation. Let F1 ; F2::: be a countable collection of regression classes. Assume that the regression functions in the classes are uniformly bounded between ?A and A for some A > 0; and the variance parameter 2 P is upper bounded by a known constant 2. Let fj ; j 1g be positive numbers satisfying 1 j =1 j = 1: b Yang (2000c) shows that one can construct an adaptive estimator fn such that for each j 1; sup E k f ? fbn k2 C n1 log 1 + n1 + R(Fj ; bn=2c) ; (5) j f 2Fj where the constant C depends only on A and : See the appendix for an adaptation risk bound that 8
implies the above conclusion. For a typical parametric or nonparametric function class, the minimax risk sequence is rate-regular in the sense that R(Fj ; bn=2c) and R(Fj ; n) converge at the same order. If the regression function classes Fj are all rate-regular, then from (5), since for each xed j; n1 log 1j + n1 does not change the rate of R(Fj ; n); we conclude that the minimax rate of convergence is automatically achieved for every class Fj : This is called minimax-rate adaptation. This notion of adaptation addresses the asymptotic performance for each class as n ! 1: However, for each xed n; since for a countable collection of classes, j necessarily goes to zero and accordingly 1 1 1 1 n log j ! 1 as j ! 1: Thus the penalty term n log j in (5) is large except for a few classes with high prior weights. A question then is: Can one construct a better adaptive method such that for a constant C; sup E k f ? fb k2 CR(Fj ; bn=2c) f 2Fj
for all j 1 and n 1? If so, the adaptive estimator achieves the minimax risk up to a constant factor uniformly over both the classes and the sample sizes. Based on the known fact that no uniform rate of convergence is possible, it seems intuitively clear that the objective is simply too ambitious to achieve in general. We give a formal result below using the metric entropy bound in the previous section. Corollary 1: There exist a collection of uniformly bounded classes of regression functions fFj ; j 1g and a constant B > 0 such that for any regression procedure ; we have that for each > 1; there can be at most eB many classes for which supf 2Fj R(f; n; ) for all n 1: R(Fj ; n) Remark: With a similar argument, it can be shown that if one is willing to lose a logarithmic factor
logn in risk for each n 1, then in general one can achieve that for not more than n many classes for some constant > 0. Proof: Consider that X = (X1 ; :::) takes values in X = [0; 1]1 with independent and uniformly distributed components. Let G = fg(x) : g(x) = x; 1 2g be a parametric class of functions on [0,1]. Let Fj be the collection of functions that actually depend only on xj and the univariate function belongs to G : For the parametric family Fj ; it is not hard to show that Cn1 R(Fj ; n) Cn2 for some positive constants C1 and C2: Note that the functions in dierent classes are well separated: for f1 2 Fj1 and f2 2 Fj2 with j1 6= j2 ; one always has
k f1 ? f2 k 1=6: As a consequence, the -packing sets in dierent classes Fj are at least 1=6 away from each other, and accordingly, the packing entropy of [j 1Fj is in nity when is smaller than 1=6. Fix a constant > 1: 9
For any given regression procedure , consider the classes that each has risk sup R (f; n; ) Cn 2 for all n 1: f 2Fj Let ? be the collection of all such classes. Then we have 2 for all n 1: sup R (f; n; ) C n f 2[j2? Fj Thus [j 2?Fj F (f Cn 2 g; ): It follows from Theorem 2 and (4) that [j 2?Fj has packing entropy bounded above by 2C2 log(1=) asymptotically as ! 0. It can be easily veri ed that Fj has metric entropy uniformly lower bounded by C3 log(1=) for some constant C3 > 0: Since the classes Fj are well separated, when < 1=6, the maximum packing number of [j 2?Fj is the sum of the maximum packing numbers of the classes. It follows that the packing entropy of [j 2?Fj is the packing entropy of G plus the logarithm of the size of ?. As a consequence, the logarithm of the size of ? is upper bounded by B for a constant B > 0. Thus j?j eB . This completes the proof of Corollary 1. From Corollary 1, adaptation up to a uniform constant factor can not be achieved in general except for nitely many function classes. Note that the result does not exclude the possibility of adaptation within a constant factor for particular collections of function classes. See Barron, Birge and Massart (1999) for examples of such adaptation results for some smoothness classes.
5 Sparse approximation and estimation In recent years, sparse approximation has attracted an increasing attention in the eld of function estimation (see, e.g., Donoho (1993), Donoho and Johnstone (1998), Yang and Barron (1998, 1999), Johnstone (1999) and Barron, Birge and Massart (1999)). It has been shown that sparse approximation together with suitable statistical methods lead to better estimation compared to traditional linear approximation in situations such as orthogonal wavelet expansion and neural network modeling. In this section, we show that in some sense (to be made clear), each regression procedure is essentially no better than a method based on sparse approximation. Let k be an index. For each k, let k = f'k;1; :::; 'k;Lkg be a collection of Lk linearly independent functions. Given k; 1 m LK and I = Ik;m = fi1 ; :::; img as a subset of f1; 2; :::;Lkg with m terms in k , consider approximation of a function f by linear combinations m X l=1
l 'k;il ; (1 ; :::; m) 2 Rm :
When m is small compared to Lk ; the terms used in the linear combination is a sparse subset of k . The sparsity in fact is the key for improved accuracy in estimation compared to traditional linear approximation using all Lk terms when f has certain sparsity characteristics. We call such approximation that allows the use of sparse linear combinations sparse approximation. In general, the choices of approximation systems fk g are also allowed to depend on the sample size. 10
For estimating a regression function f based on the data (Xi ; Yi )ni=1 ; one can use sparse subset models corresponding to sparse approximation. For each choice of (k; m; I); one ts the model Yi =
m X l=1
l 'k;il (Xi ) + "i ; 1 i n
based on the observations. The use of sparse subsets can be advantageous in terms of estimation accuracy when a small number of terms can provide a good approximation of f, since using a sparse subset avoids large variability that arises when all the Lk coecients are estimated. Since one does not know which subset provides a good approximation, one may select a model according to a certain appropriate criterion. For some recent theoretical results on model selection for nonparametric regression, see, e.g., Yang (1999), Lugosi and Nobel (1999), and Barron, Birge and Massart (1999). Alternatively to selecting a single model, one could also average the sparse subset models. Proper averagings have been demonstrated to lead to reduced model uncertainty (e.g., Chat eld (1995), Hoeting, et al (1999)), increased stability of the estimator (Breiman (1996)) and improved estimation accuracy in risk (Yang (2000c)) from dierent angles. In this section, we assume that PX is dominated by a known probability measure with a density function pX (x); which is uniformly bounded above and below (away from zero). We assume 2 is upper bounded by a known constant 2 < 1: Theorem 3: For any given regression procedure ; there exists a procedure e based on sparse approximation (with a proper model averaging) such that for any regression function f with k f k1 < 1; if R(f; ; n) Cn? under 2 = 2 for all n for some constant C > 0 and 0 < < 1; then e n) Cn e ? holds under 2 2 for all n for some constant C e > 0; and if R(f; ; n) Cn?1 R(f; ; e n) Cn e ?1 log n holds under 2 2 for under 2 = 2 for all n for some constant C > 0; then R(f; ; some constant Ce > 0. Remarks: 1. The sparse approximation systems constructed in the theorem depend on the procedure . 2. The assumption on PX is needed so that the procedure e can be constructed (in theory) based on : 3. If there exists an estimator of 2 converging at rate 1=n under the square error, then the condition that 2 is upper bounded by a known constant 2 < 1 is not needed. The theorem says that as far as polynomial rates of convergence are concerned, under the squared L2 loss, theoretically speaking, estimation based on a certain sparse approximation together with model averaging can do as well as any given regression procedure (but losing a logarithmic factor for the parametric rate of convergence). Proof of Theorem 3: We rst assume that PX is known. Without loss of generality, assume 2 = 1: For a given regression procedure , for each C and 0 < 1; from Section 3, the set F (fb2ng; ) of regression functions as de ned in (1) with b2n = Cn? has metric entropy bounded above by order 11
? 1 2(1 ? )
when 0 < < 1 and by order log(1=) when = 1: Note that when PX is known and 2 = 1, the set F (fb2ng; ) can be identi ed (theoretically speaking). We now denote F (fb2n g; ) by F (A; fb2ng; ) in this section since dierent values will be considered for A: For 0 < < 1; with n = C 1=2n? =2; from the derivation in the proof of Theorem 2, we have
?
0
M n; F (A; fb2ng; ) C n(1? ) ; where C 0 can be taken as A +21?log 2+C +1: Let s = (A; C; ). It follows that, with PX known, we can nd (again theoretically speaking) a covering set N(A; C; ) = ffs;1; :::; fs;Js g; with fs;i bounded between ?A 0 and A and Js exp(C n(1? ) ); satisfying that for any f 2 F (A; fb2ng; ); there exists 1 i Js such that k f ? fs;i k2 n : Similarly, when = 1; we can nd a covering set N(A; C; ) = ffs;1; :::; fs;Js g; with fs;i bounded between ?A and A and Js nC 00 for some constant C 00 > 0; such that for any f 2 F (A; fb2ng; ); there exists 1 i Js with k f ? fs;i k2 C 1=2n?1=2: Let s = N(A; C; ). Let S = fs = (A; C; ) : A 2 (0; 1); C 2 (0; 1); 2 (0; 1]g: Let Q denote the set of positive dyadic rationals. Let SD consist of all the points (A; C; ) in S with A; C; 2 Q: Now consider T = ft = (s; j); 1 j Js ; s 2 SD g: Clearly T is countable. Let t ; t 2 T be the procedure that gives estimator fbt ;i fs;j for all 1 i n: Then for f 2 F (A; fCn? g; ) with s 2 SD ; there exists 1 j Js such that R(f; n; t ) Cn? : (6) 2
The procedures ft g will be combined appropriately to have a small risk. We assign prior weights ft; t 2 T g based on a description of the index t of the classes according to information theory (see, e.g., Rissanen (1983)). For every dyadic rational number q; it can be written P q) aj (q)2?j for some l 1; aj 's being either 0 or 1, and i is the integer part of q. To as q = i(q) + lj(=1 describe such a q, we just need to describe the integers i, l; and the aj 's. To describe integer i 0; we may use log (i) =: log2(i + 1) + 2 log2 (log2(i + 2)) bits. Then describe l using log (l) bits, and nally describe aj 's using l bits. By this way, we describe the hyper-parameter components A; C; and use log2 Js bits to describe j for t 2 T. The total description length for t then is log (i(A)) + log (l(A)) + l(A) + log (i(C)) + log (l(C)) + l(C) + log (i( )) + log (l( )) + l( ) + log2 Js : The prior weight of the procedure t in the countable collection is then assigned to be t with ? log2 t equal the above expression. The coding interpretation guarantees ft : t 2 T g is a sub-probability (see, P e.g., Cover and Thomas (1991, p. 52)), i.e., t2T t 1. One can either normalize t to be a probability or put the remaining probability on any chosen procedure t , which does not have any eect on rates of convergence. Now we combine these procedures based on Proposition 0 in the appendix using the prior weights described above to get fbn . Note that the combined estimator fbn is a convex combination of fs;j 's with 12
1 j Js ; s 2 SD : It is an estimator based on sparse approximation systems fs : s 2 SD g. The risk of the combined procedure ~ satis es that for any f with k f k1 A (A needs not to be known) and if 2 2 ; then 1 log 1 + 1 + R(f; bn=2c; ) : E kf ? fbn k2 CA; tinf (7) t 2T n n t
Fix s0 = (A0 ; C0; 0 ) 2 S with 0 < 0 < 1: For each m ? log2 0 ; there exists s(m) = (A(m) ; C (m) ; (m) ) 2 SD such that A0 A(m) A0 +2?m , C0 C (m) C0 +2?m ; and 0 ?2?m < (m) 0 : For t = (s(m) ; j) with 1 j Js(m) ; noting that i( ) = 0 for 0 < < 1, the prior weight satis es 0 (m) (8) log 1 log (A0 + 1) + log (C0 + 1) + 1 + 3 log m + 3m + C n(1? ) t 0 Cm + C n(1? (m) ) ;
where C is a constant depending on A0 and C0: Note that f 2 F (A0 ; fC0n? 0 g; ) implies that f 2 F (A(m) ; fC (m) n? (m) g; ): Then from (6), (7) and (8), we have that for f 2 F (A0 ; fC0n? 0 g; ), e C e m + n? (m) ; R(f; n; ) n where Ce is a constant depending on A0 , C0 and : Take m of order log n; observing that n? (m) is then e e e ? for some constant C e not of order n? 0 ; we have that for f 2 F (A0 ; fC0n? 0 g; ); R(f; n; ~) Cn depending on n: For s0 = (A0 ; C0; 0 ) 2 S with 0 = 1; there exists s1 = (A1 ; C1; 1) 2 SD such that A0 A1 and C0 C1. For t = (s1 ; j) with 1 j Js1 ; applying the metric entropy bound in Theorem 2 for the case
= 1, we have that the prior weight satis es log 1 C 000 logn; t where C 000 is a constant depending only on s1 : It follows, similarly as the case with 0 < < 1; that for f 2 F (A0; fC0n?1g; ); when 2 2 , we have R(f; n; ~) Ce log n=n for some constant Ce > 0 not depending on n: Now we assume that PX is unknown but known to have a probability density fX (x) with respect to a probability measure with fX bounded above and away from zero. Then for any function g; Z
C g(x)2 (dx)
Z
Z
g(x)2 PX (dx) C g(x)2 (dx)
for some constants 0 < C < C < 1: It follows that R(f; n; ) under design distribution PX is bounded above and below by multiples of R(f; n; ) under design distribution : One can then modify the derivation above slightly to show the conclusion still holds with the relaxed condition on PX : This completes the proof of Theorem 3.
6 Appendix: An adaptation risk bound Let fj ; j 1g be a countable (or nite) collection of regression procedures for the regression problem in Section 3. Let f^j;i ; j 1; i 1 denote the estimator produced by j at sample size i: Assume that 13
the error variance 2 is upper bounded by a known constant 2 < 1: Let fj : j 1g be a chosen P prior weight on the procedures with j > 0 and j j = 1: Yang (2000c) proposed an algorithm called three-stage ARM to combine the procedures. Let denote the combined procedure and let fbi be the corresponding estimator at sample size i; i 1: Proposition 0: Assume that f is uniformly bounded with kf k1 A (A needs not to be known). Then the risk of the combined procedure satis es 1 log 1 + 1 + E kf ? f^ 2 ; k E kf ? fbn k2 C inf j; b n= 2 c j n j n where the constant C depends only on A and 2 : Note that the risk bound above involves the individual risks of the original procedures at sample size bn=2c and the prior weights. Usually E kf ? f^j;bn=2ck2 is larger than E kf ? f^j;nk2 but they converge at the same rate. When there are only a few candidate procedures, one could assign uniform weight. When there are a large number of procedures or even countably many to be combined, assignment of weights become very important in order to have optimal convergence properties.
References
[1] Barron, A.R. (1987). Are Bayes rules consistent in information? Open Problems in Communication and Computation, pp. 85-91. T. M. Cover and B. Gopinath editors, Springer-Verlag. [2] Barron, A.R., Birge, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113 301-413. [3] Barron, A.R. and Hengartner, N. (1998). Information theory and supereciency. Ann. Statist. 26, 1800-1825. [4] Birge, L. (1986). On estimating a density using Hellinger distance and some other strange facts. Probability Theory and Related Fields 71, 271-291. [5] Bretagnolle, J. and Huber, C. (1979). Estimation des densites: risque minimax. Z. Wahrscheinlichkeitstheor. Verw. Geb., 47, 119-137. [6] Chat eld, C. (1995).Model uncertainty, data mining and statistical inference. J. Roy. Statist. Soc., Ser. A, 158, 419-444. [7] Devroye, L. (1982) Necessary and Sucient Conditions for the Pointwise Convergence of Nearest Neighbor Regression Function Estimates. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 467{481. [8] Devroye, L. and Wagner, T.J. (1980). Distribution-free consistency results in nonparametric discrimination and regression function estimation. Ann. Statist., 8, 231-239. [9] Donoho, D.L. (1993). Unconditional bases are optimal bases for data compression and for statistical estimation. Applied and Computational Harmonic Analysis, 1, 100-115. [10] Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet shrinkage. Ann. Statistics., 26, 879-921. [11] Efromovich, S. Yu. and Pinsker, M.S. (1984). A self-educating nonparametric ltration algorithm. Automation and Remote Control, 45, 58-65. [12] Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999) Bayesian model averaging: A tutotial. Statistical Science, 14. 14
[13] Ibragimov, I.A. and Hasminskii, R.Z. (1977). On the estimation of an in nite-dimensional parameter in Gaussian white noise. Soviet Math. Dokl., 18, 1307-1309. [14] Johnstone, I. (1999) Function Estimation in Gaussian Noise: Sequence Models. Book Manuscript. [15] Kolmogorov, A.N. and Tihomirov, V.M. (1959). -entropy and -capacity of sets in function spaces. Uspehi Mat. Nauk 14, 3-86. [16] Lorentz, G.G., Golitschek, M.v., and Makovoz, Y. (1996). Constructive Approximation: Advanced Problems, Springer, New York. [17] Lugosi, G. and Nobel, A. (1999) Adaptive model selection using empirical complexities. Accepted by Ann. Statistics, 27, 1830-1864. [18] Stone, C.J. (1977). Consistent nonparametric regression. Ann. Statist., 8, 1348-1360. [19] Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist., 10, 1040-1053. [20] Yang, Y. (1999). Model selection for nonparametric regression. Statistica Sinica, 9, 475-499. [21] Yang, Y. (2000a). Mixing strategies for density estimation. Ann. Statistics, 28, 75-87. [22] Yang, Y. (2000b). Combining Dierent Procedures for Adaptive Regression. Journal of Multivariate Analysis, 74, 135-161. [23] Yang, Y. (2000c). Adaptive regression by mixing. Accepted by Journal of American Statistical Association. [24] Yang, Y. and Barron, A.R. (1998). An asymptotic property of model selection criteria. IEEE Trans. on Information Theory, 44, 95-116. [25] Yang, Y. and Barron, A.R. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statistics, 27, 1564-1599. [26] Yatracos, Y.G. (1988). A lower bound on the error in nonparametric regression type problems. Ann. Statist., 16, 1180-1187.
15