Adaptive Estimation of a Distribution Function and its Density in Sup-Norm Loss by Wavelet and Spline Projections
arXiv:0805.1404v1 [math.ST] 9 May 2008
Evarist Gin´ e and Richard Nickl University of Connecticut May 2008 Abstract Given an i.i.d. sample from a distribution F on R with uniformly continuous density p0 , purely-data driven estimators are constructed that efficiently estimate F in sup-norm loss, and simultaneously estimate p0 at the best possible rate of convergence over H¨ older balls, also in sup-norm loss. The estimators are obtained from applying a model selection procedure close to Lepski’s method with random thresholds to projections of the empirical measure onto spaces spanned by wavelets or B-splines. Explicit constants in the asymptotic risk of the estimator are obtained, as well as oracle-type inequalities in sup-norm loss. The random thresholds are based on suprema of Rademacher processes indexed by wavelet or spline projection kernels. This requires Bernstein-analogues of the inequalities in Koltchinskii (2006) for the deviation of suprema of empirical processes from their Rademacher symmetrizations. MSC 2000 subject classification: Primary: 62G07; Secondary: 60F05. Key words and phrases: adaptive estimation, Rademacher processes, sup-norm loss, wavelet estimator, spline estimator, oracle inequality, Lepski’s method.
1
Introduction
If X1 , ..., Xn are i.i.d. with unknown distribution function F on R, then classical results of mathematical statistics establish optimality of the empirical distribution function Fn as an estimator of F . That is to say, if we assume no apriori knowledge whatsoever on F , and equip the set of all probability distribution functions with some natural loss function, such as sup-norm loss, then Fn is asymptotically sharp minimax for estimating F . (The same is true even if more is known about F , for instance if F is known to have a uniformly continuous density.) However, this does not preclude the existence of other estimators that are also asymptotically minimax for estimating F in sup-norm loss, but which improve upon Fn in other respects. In particular, if one believes that F is absolutely continuous then one may want to simultaneously obtain a reasonable estimate of the density of F . What we have in mind as a ‘reasonable estimate’ of the density of F is a purely data-driven adaptive estimator that achieves best rates of convergence in some relevant loss-function over some prescribed classes of densities. Our goal in the present article is to construct density estimators that satisfiy the two properties just described, more concretely, the functional central limit theorem (CLT) for the distribution function and adaptation in sup-norm loss to the unknown smoothness of the density,
1
assuming at most uniform continuity for the density of F , and involving reasonable constants in both the risk bounds and the estimation procedure. To achieve adaptation one can opt for several approaches, all of which are related. Among them we mention the penalization method of Barron, Birg´e and Massart (1999), wavelet threshholding (Donoho, Johnstone, Kerkyacharian and Picard (1996)), and Lepski’s (1991) method. Our choice for the goal at hand consists of using Lepski’s method, with random thresholds, applied to wavelet and spline projection estimators of a density. The linear estimators underlying our procedure are projections of the empirical measure onto spaces spanned by wavelets, and wavelet theory is central to some of the derivations of this article. The wavelets most commonly used in statistics are those that are compactly supported (e.g., Daubechies’ wavelets), and our results readily apply to these. However, for computational and other purposes, projections onto spline spaces are also interesting candidates for the estimators. Density estimators obtained from projecting the empirical measure onto Schoenberg spaces spanned by B-splines were studied by Huang and Studden (1993). As is well-known in wavelet theory, the Schoenberg spline spaces with equally spaced knots have an orthonormal basis consisting of the Battle-Lemari´e wavelets, so that the spline projection estimator is in fact exactly equal to the wavelet estimator based on Battle-Lemari´e wavelets. These wavelets do not have compact support but they enjoy exponential decay at infinity. Although we cannot handle in general exponentially decaying wavelets, we can still work with Battle-Lemari´e wavelets because the B-spline expansion of the projections allows us to show that the relevant classes of functions are of Vapnik-Cervonenkis type, so that empirical process techniques can be applied. In particular, the adaptive estimators we devise in Theorem 3 may be based either on spline projections or on compactly supported wavelets. And in the process of proving the main theorem, we also provide new asymptotic results for spline projection density estimators similar to those for wavelet estimators in Gin´e and Nickl (2007). We need to use Talagrand’s inequality with sharp constants (Bousquet (2003), Klein and Rio (2005)) in the proofs, but to do this, we have to estimate the expectation of suprema of certain empirical processes that appear in the centering of Talagrand’s inequality. The use of entropy-based moment inequalities for empirical processes typically results in too conservative constants (e.g., in Gin´e and Nickl (2008)). In order to remedy this problem, we adapt recent ideas due to Koltchinskii (2001, 2006) and Bartlett, Boucheron and Lugosi (2002) to density estimation: the entropy based moment bounds are replaced by the sup norm of the associated Rademacher averages, which are, with high probability, better estimates of the expected value of the supremum of the empirical process. We derive a Bernstein-type analogue of an exponential inequality in Koltchinskii (2006) that shows how the supremum of an empirical process deviates from the supremum of the associated Rademacher processes. This Bernstein-type version allows to use partial knowledge on the variance of the empirical processes involved, which is crucial for applications in our context of adaptive density estimation. Moreover, we show that one can use, instead of the supremum of the Rademacher process, its conditional expectation given the data (which is more stable). We should also remark on recent interest in obtaining inequalities similar to those in Koltchinskii (2006) for general bootstrap procedures, see Fromont (2007). Since many bootstrap empirical processes (such as the one’s obtained from Efron’s bootstrap) are minorized by Rademacher processes, our inequalities apply there as well, but may give suboptimal constants. Adaptive estimation in sup-norm loss is a relatively recent subject, and we should mention first the results in Tsybakov (1998) and Golubev, Lepski and Levit (2001) that were obtained in the Gaussian white noise model. Tsybakov (1998) devises procedures that are sharp adaptive (attaining the optimal constant in the asymptotic risk) in sup-norm loss, when the unknown function lies in a Sobolev ball of order β (so that the rate of convergence is no better than it would be for functions that are H¨ older-continuous of order β − 1/2). If – as Tsybakov (1998) 2
does – one uses Fourier-expansions in the white noise framework, the Gaussian processes involved turn out to be stationary, so direct methods (e.g., the Rice-formula) can be used to control suprema of the relevant random quantities. These methods extend to somewhat more general basis expansions, in which case the corresponding Gaussian processes are mildly nonstationary, see Lemma 1 in Golubev, Lepski and Levit (2001). However, if one is interested in adapting to a H¨ older-continuous density in sup-norm loss in the i.i.d. density model on R, this greatly simplifying structure is not available, in particular, trigonometric basis expansions are not optimal for approximating H¨ older- continuous functions in the supremum-norm. Working with the more adequate wavelet bases seems to require different methods; in this case, even after reduction to a Gaussian model by strong approximation, the resulting Gaussian processes cannot be dealt with as in the aforementioned articles, and this for several reasons, including lack of stationarity, non-differentiability of the covariance function (for many wavelets), and since we are interested in suprema over the whole real line. We show that empirical process techniques can be used in the i.i.d. density model to achieve rate-adaptive estimators, and we establish reasonable bounds for the asymptotic constant in sup-norm risk. The constants we obtain here are not sharp (as compared to the optimal ones obtained in Korostelev and Nussbaum (1999) for densities supported in [0, 1]): This does not come as a surprise since a) at least some loss has to expected for adaptive procedures over H¨ older classes, cf. Lepski (1992) and Tsybakov (1998), and b) Rademacher symmetrization increases the constants in the large deviation bounds we use. In the i.i.d. density model, a direct ‘competitor’ to the estimators constructed in this article is the hard thresholding wavelet density estimator introduced in Donoho et al. (1996): as proved in Gin´e and Nickl (2007), its distribution function satisfies the functional CLT and it is adaptive in the sup-norm over H¨ older balls; however, the proofs there require the additional assumption that dF integrates |x|δ for some δ > 0, and the constants appearing in the threshold and the risk become quite large for δ small. The results in the present article hold under no moment condition whatsoever. Gin´e and Nickl (2008) construct another estimator that asymptotically optimally estimates F and its density in sup-norm loss. The estimator there was constructed by applying Lepski’s method to classical kernel estimators, modified by imposing ‘by force’ √ that their distribution functions stay at a uniform distance o(1/ n) from Fn . In the present situation,√if F has a uniformly continuous density, we do not need to force the estimator to stay in a o(1/ n) ball around Fn , which reduces the complexity of the method, and we also avoid the large constants that resulted from the entropy bounds in Gin´e and Nickl (2008). There has been recent interest in considering nonasymptotic risk bounds for adaptive estimators. Rigollet (2006) obtained sharp oracle inequalities (with monotone oracles) in L2 (R)-loss, using a Stein-type density estimator. He builds on results of Cavalier and Tsybakov (2001) that were obtained in the Gaussian white noise model, and the methods employed there are closely tied to Hilbert-space structure. We prove an oracle inequality in sup-norm loss for estimators based on Haar wavelets, but with the following constraints: First, the constant we obtain cannot be made arbitrarily close to one, and second, we have to assume that the true density has at least one point where it attains a critical H¨ older singularity. The latter condition is related to the notion of self-similar functions in Jaffard (1997a,b), and also arises in related problems such as the construction of pointwise adaptive confidence intervals, cf. Picard and Tribouley (2000). The outline of the article is as follows: In Section 2 we define the basic linear estimators and give some of their asymptotic properties. In Section 3, building on Talagrand’s inequality, we derive a Bernstein-type inequality for the deviation of the (supremum of the) empirical process from the (supremum of the) associated Rademacher process. In Section 4, we construct the adaptive procedures and give the main results. Most of the proofs are deferred to Section 5.
3
2
Wavelets expansions and estimators
We start with some basic notation. If (S, S) is a measurable space, R and for Borel-measurable functions h : S → R and Borel measures µ on S, we set µh := S hdµ. We will denote by Lp (Q) := Lp (S, Q), 1 ≤ p ≤ ∞, the usual Lebesgue spaces on S w.r.t. a Borel measure Q, and if Q is Lebesgue measure on S = R we simply denote this space by Lp (R), and its norm by k · kp if p < ∞. We will use the symbol khk∞ to denote supx∈R |h(x)| for h : R → R. For s ∈ N, denote by Cs (R) the spaces of functions f : R → R that P are s-times differentiable with uniformly continuous Ds f , equipped with the norm kf ks,∞ = 0≤α≤s kDα f k∞ , with the convention that D0 =: id and that then C(R) := C0 (R) is the space of bounded uniformly continuous functions. For noninteger s > 0 and [s] the integer part of s, set [s] D f (x) − D[s] f (y) X kDα f k∞ + sup Cs (R) = f ∈ C[s] (R) : kf ks,∞ := < ∞ . s−[s] x6=y |x − y| 0≤α≤[s]
We also define the ’local’ H¨ older constant
|f (x + v) − f (x)| , |v|s |v|≤1 x∈R
H(s, f ) := sup sup
(1)
for 0 < s < 1 and we set H(1, f ) := kDf k∞ .
2.1
Multiresolution analysis and wavelet bases
We recall here a few well-known facts about wavelet expansions, see, e.g., H¨ ardle, Kerkyacharian, Picard and Tsybakov (HKPT, 1998). Let φ ∈ L2 (R) be a father wavelet, that is, φ is such that {φ(· − k) P : k ∈ Z} is an orthonormal system in L2 (R), and moreover the linear spaces V0 = {f (x) = k ck φ(x − k) : {ck }k∈Z ∈ ℓ2 }, V1 = {h(x) = f (2x) : f ∈ V0 },...,Vj = {h(x) = f (2j x) : f ∈ V0 },..., are nested (Vj−1 ⊆ Vj for j ∈ N) and their union is dense in L2 (R). In the case where φ is a bounded function that decays exponentially at infinity (i.e. |φ(x)| ≤ Ce−γ|x| for some C, γ > 0) – which we assume for the rest of this subsection – the kernel of the projection onto the space Vj has certain properties: First, the series X K(y, x) := K(φ, y, x) = φ(y − k)φ(x − k), (2) k∈Z
converges pointwise, and we set Kj (y, x) := 2j K(2j y, 2j x), j ∈ N ∪ {0}. Furthermore we have X |K(y, x)| ≤ Φ(|y − x|) and sup |φ(x − k)| < ∞, (3) x∈R
k
where Φ : R → R+ is bounded and has exponential decay. If f ∈ Lp (R), 1 ≤ p ≤ ∞, and j is fixed, then the projection of f onto Vj is Z Z X j j Kj (f )(y) := Kj (x, y)f (x)dx = 2 φ(2 y − k) φ(2j x − k)f (x)dx, y ∈ R, k∈Z
the series converging pointwise. For f ∈ L1 (R), which is the main case in this article, the convergence of the series in fact takes place in Lp (R), 1 ≤ p ≤ ∞. This still holds true if f (x)dx is replaced by dµ(x), where µ is any finite signed measure. If now φ is a father wavelet and ψ 4
the associated mother wavelet so that {φ(· − k), 2l/2 ψ(2l (·) − k) : k ∈ Z, l ∈ N} is an orthonormal basis of L2 (R), then any f ∈ Lp (R) has the formal expansion f (y) =
X k
αk (f )φ(y − k) +
∞ X X l=0
βlk (f )ψlk (y)
(4)
k
R R where ψlk (y) = P 2l/2 ψ(2l y − k), αk (f ) = f (x)φ(x − k)dx, βlk (f ) = f (x)ψlk (x)dx. Since (Kl+1 − Kl )f = k βlk (f )ψlk , the partial sums of the series (4) are in fact given by Kj (f )(y) =
X k
αk (f )φ(y − k) +
j−1 X X l=0
βlk (f )ψlk (y),
(5)
k
and, if φ, ψ are bounded and have exponential decay, then (5) holds pointwise, and it also holds in Lp (R), 1 ≤ p ≤ ∞, if f ∈ L1 (R) or if f is replaced by a finite signed measure. Now, using these facts one can furthermore show that the wavelet series (4) converges in Lp (R), p < ∞, for f ∈ Lp (R), and we also note that if p0 is a uniformly continuous density, then its wavelet series converges uniformly.
2.2
Density Estimation using wavelet and spline projection kernels
Let X1 , ..., X Pnnbe i.i.d. random variables with common law P and density p0 on R, and denote by Pn = n1 i=1 δXi the associated empirical measure. A natural first step is to estimate the projection Kj (p0 ) of p0 onto Vj by j−1
n
pn (y) := pn (y, j) =
XX X 1X Kj (y, Xi ) = α ˆ k φ(y − k) + βˆlk ψlk (y) n i=1 l=0
k
k
y ∈ R,
(6)
R R where K is as in (2), j ∈ N, and where α ˆk = φ(x − k)dPn (x), βˆlk = ψlk (x)dPn (x) are the empirical wavelet coefficients. We note that for φ, ψ compactly supported (e.g., Daubechies’ wavelets), there are only finitely many k’s for which these coefficients are nonzero. This estimator was first studied by Kerkyacharian and Picard (1992) for compactly supported wavelets. If the wavelets φ and ψ do not have compact support, it may be impossible to compute the estimator exactly, since the sums over k consist of infinitely many summands. However, in the special case of the Battle-Lemari´e family φr , r ≥ 1 (see, e.g., Section 6.1 in HKPT (1998)) – which is a class of non-compactly supported but exponentially decaying P wavelets – the estimator j/2 φr (2j (·) − k) : has a simple form in terms of splines: the associated spaces V = { j,r k ck 2 P 2 k ck < ∞} are in fact equal to the Schoenberg spaces generated by the Riesz-basis of B-splines of order r, so that the sum in (6) can be computed by pn (y, j) :=
n n 1X 2j X X X κj (y, Xi ) = bkl Nj,k,r (Xi )Nj,l,r (y) n i=1 n i=1 k
l
y ∈ R,
(7)
where the Nj,k,r (are suitably translated and dilated) B-splines of order r, the kernel κ is as in (34) below and the bkl ’s are the entries of the inverse of the matrix defined in (33) below. An exact derivation of this spline projection, their wavelet representation and detailed definitions are given in Section 5.1. It turns out that for every sample point Xi and for every y, each of the last two sums extends only over r terms. We should note that this ’spline projection’ estimator was first studied (outside of the wavelet setting) by Huang and Studden (1993), who 5
derived pointwise rates of convergence. See also Huang (1999), where some comparison between Daubechies’ and spline wavelets can be found. In the course of proving the main theorem of this article, we will derive some basic results for the linear spline projection estimator (7), that we now state. For classical kernel estimators, results similar to those that follow were obtained in Gin´e and Guillou (2002) and Gin´e and Nickl (2008), and for wavelet estimators based on compactly supported wavelets, this was done in Gin´e and Nickl (2007). Theorem 1 Suppose that P has a bounded density p0 . Assume jn → ∞, n/(jn 2jn ) → ∞, jn / log log n → ∞ and j2n − jn ≤ τ for some τ positive. Let pn (y) = pn (y, jn ) be the estimator from (7) for some r ≥ 1. Then r n lim sup sup |pn (y) − Epn (y)| = C a.s. jn j 2 n n y∈R and, for 1 ≤ p < ∞, sup n
r
n 2jn jn
1/p E sup |pn (y) − Epn (y)|p ≤ C′ y∈R
where C and C ′ depend only on kp0 k∞ and on r, p, τ . Furthermore, if p0 ∈ Ct (R), with t ≤ r, one has ! r 2jn jn + O(2−tjn ) both a.s. and in Lp (P ), sup |pn (y) − p0 (y)| = O n y∈R and, if in addition 2jn ≃ (n/ log n)1/(2t+1) , then sup |pn (y) − p0 (y)| = O y∈R
log n n
t/(2t+1) !
both a.s. and in Lp (P ).
For the following central limit theorem, we denote by ℓ∞ (R) convergence in law for samplebounded processes in the Banach space of bounded functions on R, and by GP the usual P Brownian bridge. See, e.g., Chapter 3 in Dudley (1999). Theorem 2 Assume that the density p0 of P is a bounded√function (t = 0) or that p0 ∈ Ct (R) −jn (t+1) for some t, 0 < t ≤ r. Let jn satisfy n/(2jn jn ) → ∞ → 0 as n → ∞. If F is R sand n2 S the distribution function of P and setting Fn (s) := −∞ p(y, jn )dy, then √ n(FnS − F )
ℓ∞ (R)
GP .
√ Proof. Given ε > 0, apply Proposition 8 below with λ = ε, so that kFnS − Fn k∞ = oP (1/ n) √ follows, and use the fact that n(Fn − F ) converges in law in ℓ∞ (R) to GP . We should emphasize that the optimal bandwidth choice 2−jn ≃ n−1/2t+1 (or, if sup-norm loss is considered, n replaced by n/ log n) is admissible for every t > 0 in the last theorem.
6
3
Estimating Suprema of Empirical Processes
Talagrand’s (1996) exponential inequality for empirical processes (see also Ledoux (2001)), which is a uniform Prohorov type inequality, is not specific about constants. Constants in its Bernstein type version have been specified by several authors (Massart (2000), Bousquet (2003) and Klein and Rio (2005)). Let Xi be the coordinates of the product probability space (S, S, P )N , where P is any probability measure on (S, S), and let F be a countable class of measurable functions on S that take values in [−1/2, 1/2], or, if F is P -centered, in [−1, 1]. Let σ ≤ 1/2 and V be any two numbers satisfying
n
X
(8) σ 2 ≥ kP f 2 kF , V ≥ nσ 2 + 2E (f (Xi ) − P f ) ,
i=1
F
P
where V is also an upper bound for E (f (Xi ) − P f )2 F (Klein and Rio (2005)). Then, taking Pn Pn into account that supf ∈F ∪(−F ) i=1 f (Xi ) = supF | i=1 f (Xi )|, Bousquet’s (2003) version of Talagrand’s inequality is as follows: For every t > 0,
n ) ( n
X
X t2
. (9) Pr (f (Xi ) − P f ) ≥ E (f (Xi ) − P f ) + t ≤ exp −
2V + 32 t i=1 i=1 F
F
In the other direction, the Klein and Rio (2005) result is: For every t > 0,
) ( n n
X
X t2
. Pr (f (Xi ) − P f ) ≤ E (f (Xi ) − P f ) − t ≤ exp −
2V + 2t i=1
i=1
F
(10)
F
These inequalities can be applied in conjunction with an estimate of the expected value obtained via empirical processes methods. Here we describe one such result for VC type classes, i.e., for F satisfying the uniform metric entropy condition v A sup N (F , L2 (Q), τ ) ≤ , 0 < τ ≤ 1, (A ≥ e, v ≥ 2). (11) τ Q with the supremum extending over all Borel probability measures on (S, S). [We denote here by N (G, L2 (Q), τ ) the usual covering numbers of a class G of functions by balls of radius less than or equal to τ in L2 (Q)-distance.] Then one has, for every n
n " r #
X 5A 5A
2 , (12) E (f (Xi ) − P f ) ≤ 2 15 2vnσ log + 1350v log
σ σ i=1 F
see Proposition 3 in Gin´e and Nickl (2008) with by using V as in (8) instead
P a change obtained
of an earlier bound due to Talagrand for E (f (Xi ) − P f )2 F . This type of inequalities has also some history (Talagrand (1994), Einmahl and Mason (2000), Gin´e and Guillou (2001), Gin´e and Koltchinskii (2006), among others). The constants at the right hand side of (12) may be far from best possible, but we prefer them over unspecified ’universal’ constants. As is the case of Bernstein’s inequality in R, Talagrand’s inequality is especially useful in the Gaussian tail range, and, combining (9) and (12), one can obtain such a ‘Gaussian tail’ bound for the supremum of the empirical process that depends only on σ (similar to a bound in Gin´e and Guillou (2001)).
7
Proposition 1 Let F be a countable class of measurable functions that satisfies (11), and is uniformly bounded (in absolute value) by 1/2. Assume further that for some λ > 0, nσ 2 ≥
λ2 v 5A log . 2 σ
(13)
Set c1 (λ) = 2[15 + 1350λ−1 ] and let c2 (λ) ≥ 1 + 120λ−1 + 10800λ−2. Then, if r 3 5A ≤ t ≤ c2 (λ)nσ 2 , c1 (λ) 2vnσ 2 log σ 2 we have
) ( n
X t2
Pr (f (Xi ) − P f ) ≥ 2t ≤ exp − .
3c2 (λ)nσ 2 i=1
(14)
(15)
F
Proof. Under (13), inequality (12) gives
r n
X 5A
E (f (Xi ) − P f ) ≤ c1 (λ) 2vnσ 2 log
σ i=1
F
and (8) implies that we can take V = c2 (λ)nσ 2 . Now the result follows from (9), taking into account that in the range of t’s
n
X
E (f (Xi ) − P f ) ≤ t ≤ 3V /2,
i=1
F
(9) becomes
) ( n
X t2
. Pr (f (Xi ) − P f ) ≥ 2t ≤ exp −
3V i=1
F
The constants here may be too large for some applications, but they are not so in situations where λ can be taken very large, in particular in asymptotic considerations. [Then c1 (λ) → 30 and c2 (λ) → 1 as λ → ∞.]
3.1
Estimating the size of empirical processes by Rademacher averages
The constants one could obtain from Proposition 1 are not satisfactory for certain applications in adaptive estimation. We now propose a remedy for this problem, inspired by a nice idea of Koltchinskii (2001) and Bartlett, Boucheron and Lugosi (2002) consisting in replacing the expectation of the supremum of an empirical process by the supremum of the associated Rademacher process. To be more precise, these authors obtain a purely data driven stochastic estimate of the supremum of an empirical process and apply it to problems in risk minimization and model selection. An inequality of this type (see Koltchinskii (2006), page 2602), is
n ) ( n
X
X 2t2
εi f (Xi ) + 3t ≤ exp − , (16) Pr (f (Xi ) − P f ) ≥ 2
3n i=1 i=1 F
F
where εi , i ∈ N, are i.i.d. Rademacher random variables, independent of the Xi ’s, all defined as coordinates on a large product probability space. Note that this bound does not take the 8
variance V in (9) into account, but in the applications to density estimation that we have in mind, V is much smaller than n (it is of the order n2−jn , jn → ∞). We need a similar inequality, with the quantity n in the bound replaced by V , valid in a large enough range of t’s. It will be convenient to use the following well-known symmetrization inequality (e.g., Dudley (1999), p.343):
√ n n n
X
X n 1
X
εi f (Xi ) , (17) εi f (Xi ) − E kP f kF ≤ E (f (Xi ) − P f ) ≤ 2E
2 2 i=1
i=1
F
i=1
F
F
The following exponential bound is the Bernstein-type analogue of (16). Denote by E ε expectation w.r.t. the Rademacher variables only.
Proposition 2 Let F be a countable class of measurable functions, uniformly bounded (in absolute value) by 1/2. Then, for every t > 0,
n ) ( n
X
X t2
εi f (Xi ) + 3t ≤ 2 exp − ′ Pr (f (Xi ) − Ef (X)) ≥ 2 , (18)
2V + 2t i=1
F
i=1
F
as well as
n ) ( n
X
X t2
ε εi f (Xi ) + 3t ≤ 2 exp − ′ Pr (f (Xi ) − Ef (X)) ≥ 2E ,
2V + 2t i=1
i=1
F
(19)
F
Pn where V ′ = nσ 2 + 4E k i=1 εi f (Xi )kF .
Proof. We have
( n ) n
X
X
εi f (Xi ) + 3t Pr (f (Xi ) − P f ) ≥ 2
i=1 i=1 F F
n ) ( n
X
X
εi f (Xi ) + t ≤ Pr (f (Xi ) − P f ) ≥ 2E
i=1 i=1
F )
F
n ( n
X
X
εi f (Xi ) − t . εi f (Xi ) ≤ E + Pr
i=1
i=1
F
F
For the first term combining (17) with (9) gives
( n ) n
X
X t2
. Pr (f (Xi ) − P f ) ≥ 2E εi f (Xi ) + t ≤ exp − ′
2V + (2/3)t i=1
F
i=1
F
For the second term, note that (10) applies to the randomized sums taking the class of functions
Pn
i=1 εi f (Xi )
as well by just
G = {g(τ, x) = τ f (x) : f ∈ F } , τ ∈ {−1, 1}, instead of F and the probability measure P¯ = 2−1 (δ−1 + δ1 ) × P instead of P . Hence
n ) ( n
X
X t2
εi f (Xi ) − t ≤ exp − ′ εi f (Xi ) ≤ E Pr , (20)
2V + 2t i=1
F
i=1
F
9
Pn since V ′ ≥ nσ 2 + 2E k i=1 εi f (Xi )kF . Combining the bounds completes the proof of (18). It remains to prove (19). Let G, P¯ be as above, let Yi = (εi , Xi ), and note that P¯ is the law of Yi . By convexity, Ee−tE
ε
k
Pn
i=1
εi f (Xi )kF
≤ Ee−tk
Pn
i=1
εi f (Xi )kF
= Ee−tk
Pn
i=1
g(Yi )kG
for all t. The Klein and Rio (2005) version Pn(10) of Talagrand’s inequality is in fact established by estimating the Laplace transform Ee−tk i=1 g(Yi )kG , and Theorem 1.2a in Klein and Rio (2005) implies n X P V 3t −tE ε k n εi (f (Xi )−P f )kF i=1 g(Yi )kG + ≤ −tEk Ee e − 3t + 1 , 9 i=1 P n for V ≥ nσ 2 + 2Ek i=1 g(Yi )kG , which, by their proof of the implication (a) ⇒ (c) in that theorem, gives
) ( n n
X
X t2
ε εi f (Xi ) − t ≤ exp − ′ εi f (Xi ) ≤ E . Pr E
2V + 2t i=1 i=1 F
F
The proof of (19) now follows as in the previous case.
For F of VC type, the moment bound (12) is usually proved as a consequence of a bound for the Rademacher process. In fact, the proof of Proposition 3 in Gin´e and Nickl (2008) shows
n r
X 5A 5A
εi f (Xi ) ≤ 15 2vnσ 2 log E + 1350v log , (21)
σ σ i=1
F
where σ is as in (8), which we use in the following corollary, together with the previous proposition. The constant c2 (λ) in the exponent below is still potentially large, but tends to one if λ → ∞. Corollary 1 Let F be a countable class of measurable functions that satisfies (11), and assume it to be uniformly bounded (in absolute value) by 1/2. Assume further (13) for some λ > 0. Then for 1 c2 (λ)nσ 2 0 1, choose integers jmin := jmin,n and jmax := jmax,n such that 0 < jmin < jmax , 1/(2r+1) n n jmin jmax ≃ 2 ≃ and 2 log n (log n)2
and set J := Jn = [jmin , jmax ] ∩ N.
The number of elements in this grid is of order log n. We will consider several preliminary estimators ¯jnε , ¯jn , ˜jnε and ˜jn of the resolution level, and we discuss the main differences among them below. Let pn (j) be as in (6) or (7). First, we set r 2l l ¯jnε = min j ∈ J : k pn (j) − pn (l)k∞ ≤ T (n, j, l) + 7kΦk2kpn (jmax )k1/2 ∀l > j, l ∈ J ∞ n (23) where the function Φ is as in (3), and we discuss an explicit way to construct Φ in Remark 2 below. If the minimum does not exist, we set ¯jnε equal to jmax . Further we define ¯jn as the same minimum but with T (n, j, l) replaced by its Rademacher expectation E ε T (n, j, l). An alternative estimator of the resolution level is r 2l l ε 1/2 ˜jn = min j ∈ J : k pn (j)−pn (l)k∞ ≤ (B(φ)+1)R(n, l)+7kΦk2 kpn (jmax )k∞ ∀l > j, l ∈ J n (24) where B(φ) is a bound, uniform in j, for the operator norm in L∞ (R) of the projection πj , see Remark 3 below. Again, if the minimum does not exist, we set ˜jnε equal to jmax . We also define ˜jn by replacing R(n, l) by E ε R(n, l) in (24). Before we state the main result, we briefly discuss these procedures: The data-driven resolution levels ˜jnε and ˜jn in (24) are based on tests that use Rademacher-analogues of the usual thresholds in Lepski’s method: Starting with jmin , the main contribution to kpn (j) − pn (l)k∞ is the bias kEpn (j)− p0 k∞ . The procedure should stop when the ’variance term’ kpn (l)− Epn (l)k∞ starts to dominate. Since this is an unknown quantity, and since we know no good nonrandom upper bound for it, we estimate it by the supremum of the associated Rademacher process, i.e., by R(n, l), or by its Rademacher expectation. The constant B(φ) is necessary to correct for the lack of monotonicity of the R(n, l)’s in the resolution level l. The estimators ¯jnε and ¯jn in (23) are somewhat more refined, but also slightly more complicated: They try to take advantage of the fact that in the ’small bias’ domain,
n
1 X
(Kj − Kl )(Xi , ·) kpn (j) − pn (l)k∞ =
n i=1
11
∞
should not exceed its Rademacher symmetrization
n
1 X
εi (Kj − Kl )(Xi , ·) T (n, j, l) = 2
n i=1
,
∞
or its conditional expectation E ε T (n, j, l), using the deviation inequality from Corollary 1. Yet another way of viewing these resolution levels is in terms of model selection: One starts with the smallest model Vjmin and compares it to a nested sequence of models {Vj }j∈J , proceeding to a larger model Vj if all relevant blocks of wavelet coefficients between j and jmax are insignificant as compared to the corresponding Rademacher penalty. We now state the main result, whose proof is deferred to the next section. As usual, we say that a wavelet basis is s-regular, s ∈ N ∪ {0}, if either the father wavelet φ has R s weak derivatives contained in Lp (R) for some p ≥ 1, or if the mother wavelet ψ satisfies xα ψ(x)dx = 0 for α = 0, ..., s. Note that any compactly supported element of Cs (R), s > 0, is of bounded (1/s)variation, so that the p-variation condition in the following theorem is satisfied, e.g., for all Daubechies-wavelets. The assumption of uniform continuity in the following theorem can be relaxed at the expense of sligthly modifying the definition of ˆjn , see Remark 1 below. The estimators below achieve the optimal rate of convergence for estimating p0 in sup-norm loss in the minimax sense (over H¨ older balls), cf., e.g., Korostelev and Nussbaum (1999) for optimality of these rates. Theorem 3 Let X1 , ..., Xn be i.i.d. on R with common law P that possesses a uniformly continuous density p0 . Let pn (j) := pn (y, j) be as in (6), where φ is either compactly supported, of bounded p-variation (p < ∞) and (r − 1)-regular, or φ = φr equals a Battle-Lemari´e wavelet. Let the sequence {ˆjn }n∈N be either {¯jnε }n∈N , {¯jn }n∈N , {˜jnε }n∈N or {˜jn }n∈N , and let Fn (ˆjn )(t) = Rt p (y, ˆjn )dy. Then −∞ n √ n Fn (ˆjn ) − F (25) ℓ∞ (R) GP ,
the convergence being uniform over the set of all probability measures P on R with densities p0 bounded by a fixed constant, in any distance that metrizes convergence in law. Furthermore, if C is any precompact subset of C(R), then sup E sup |pn (y, ˆjn ) − p0 (y)| = o(1).
p0 ∈C
(26)
y∈R
If, in addition, p0 ∈ Ct (R) for some 0 < t ≤ r then also sup p0 :kp0 kt,∞ ≤D
E sup |pn (y, ˆjn ) − p0 (y)| = O y∈R
log n n
t/(2t+1) !
.
(27)
Remark 1 Relaxing the uniform continuity assumption. The assumption of uniform continuity of the density of F can be relaxed by modifying the definition of ¯jn (or ˜jn ) along the lines of Gin´ √ e and Nickl (2008): The idea is to constrain all candidate estimators to lie in a ball of size o(1/ n) around the empirical distribution function Fn so that (25) holds automatically. Formally, this can be done by adding the requirement Z t 1 sup pn (y, j)dy − Fn (t) ≤ √ n log n t∈R −∞
in each test in (23) or (24). If this requirement does not even hold for jmax , it can be seen as evidence that F has no density, and one just uses Fn as the estimator, so as to obtain at least the 12
functional CLT. If F has a bounded density, one can use the exponential bound in Proposition 8 in the proof to control rejection probabilities of these test in the ’small bias’ domain ˆjn > j ∗ , and Theorem 3 can then still be proved for this procedure, without any assumptions on F . See Gin´e and Nickl (2008) for more details on this procedure and its proof. Remark 2 The constant kΦk2 . Once the wavelet φ have been chosen, ˆjn is purely data driven since the function Φ depends only on φ. For the Haar basis (φ = I[0,1) ) we can take Φ = φ because in this case K(x, y) ≤ I[0,1) (|x− y|) so that kΦk2 = 1. A general way to obtain majorizing kernels Φ is described in Section 8.6 of HKPT (1998) as follows: Let φ¯ be a non-increasing function in ¯ L1 (R+ ) such that |φ(u)| ≤ φ(|u|). Then, ¯ φ(δ|x ¯ |K(x, y)| ≤ C(φ) − y|/2) =: Φ(|x − y|) ¯ where δ < 1/4 is such that φ(δ/2) > 0, and ¯ ¯ = 2φ(0) C(φ) ¯ φ(δ/2)
Z ¯ ¯ φ(0) + 2 φ(|u|)du . R
For compactly supported φ, a crude choice of φ¯ is kφk∞ times the indicator of its support, but numerical methods might give much better estimates. For Battle-Lemari´e wavelets, the spline representation of the projection kernel is again useful for estimating kΦk2 . For example, if r = 2 (linear splines), one writes as in Huang and Studden (1993) (cf. also Lemma 1 below) X X K(x, y) = bkl N0,2 (x − k)N0,2 (y − l) = c λ|i−j| N0,2 (x − k)N0,2 (y − l), k,l
k,l
√ √ with c = 3 and λ = −2 + 3. Now since N0,2 = 1[0,1) ∗ 1[0,1) , the kernel K(x, y) is easily seen to be majorized in absolute value by Φ(|x − y|) where Φ(|u|) = 4cλ(|u|−2)∨0 , which gives the (not necessarily sharp) bound kΦk2 ≤ 15.5. For higher order spline wavelets similar computations apply. Again, numerical methods might be preferable here.
Remark 3 The constant B(φ). To construct ˜jn one requires knowledge of the constant B(φ) that bounds the operator norm kπj k′∞ of πj viewed as an operator L∞ (R). A simple way of obtaining a bound is as follows: for any f ∈ L∞ (R) we have, by (3), Z |πj (f )(x)| = Kj (x, y)f (y)dy ≤ kΦk1 kf k∞ ,
that is,
kπj k′∞ ≤ kΦk1 .
In combination with the previous remark, one readily obtains possible values for B(φ). For instance, for the Haar wavelet, B(φ) ≤ 1. For spline wavelets, other methods are available. For example, for Battle-Lemari´e wavelets arising from linear B-splines, kπj k′∞ is bounded by 3, and Shadrin (2001, p.135) conjectures the bound 2r − 1 for general order r. See DeVore and Lorentz (1996, Chapter 13.4), Shadrin (2001) and references therein for more information.
4.1
Risk and Oracle-type inequalities for Haar wavelets
Theorem 3 is asymptotic in nature, and a natural question is how large the constants in the convergence rate (27) are. One way to address this question is by comparing the risk of the adaptive estimator to the risk of optimal linear estimators that could be constructed if more 13
were known about p0 . While our methods allows such comparisons, we should note in advance that the randomization techniques as well as the relatively simple model selection procedure employed here are not likely to produce optimal constants in these comparisons. Anyhow, the constants obtained here are much better than what would be possible using moment inequalities for empirical processes directly (as in (12)), or any other method known to us. To reduce technicalities, we restrict ourselves here to the Haar wavelet φ = φ1 , but all the results below could also be proved (with modified constants) for the wavelets considered in Theorem 3. We first compare to an ’oracle’ that only knows that p0 ∈ Ct (R), with a bound on its H¨ older norm. In this case, one could choose j ∗ (t) so that the ’variance’ term Ekpn (j) − Epn (j)k∞ and the bias term kEpn (j) − p0 k∞ ≤ B(j, p0 ) balance, where B(j, p0 ) is defined in (45) below. Here, several possibilities arise: For instance, since r p n sup |p (y, j ) − Ep (y, j )| = lim 2 log 2kp0 k1/2 a.s. and in Lp (P ) (28) n n n n ∞ j n 2 n jn y∈R for the linear Haar-wavelet estimator (Theorem 2 in Gin´e and Nickl (2007)), a possible choice of √ 1/2 p j ∗ is the resolution level that balances B(j, p0 ) with 2 log 2kp0 k∞ 2j j/n, see (49) below.
Proposition 3 Let the conditions of Theorem 3 hold and let j ∗ := j ∗ (t) be as in (49). Then if p0 ∈ Ct (R) for some 0 < t ≤ 1, if φ = φ1 is the Haar wavelet, and if ˆjn is as in Theorem 3, we have for every n, 2t/(2t+1) ! log n 1 ∗ +O . Ekpn (ˆjn ) − p0 k∞ ≤ 30Ekpn(j ) − p0 k∞ + O √ n n
The (proof of the) previous proposition and (28) allow to obtain an explicit upper bound for the asymptotic constant in the risk of the adaptive Haar-wavelet estimator. Recall the definition of H(s, f ) from (1). Proposition 4 Let the conditions of Theorem 3 hold. Then, if p0 ∈ Ct (R) for some 0 < t ≤ 1, and if φ = φ1 is the Haar wavelet, we have lim sup n
where
n log n
t/(2t+1)
Ekpn (ˆjn ) − p0 k∞ ≤ A(p0 )
1/(2t+1) 1 A(p0 ) = 26.6 √ kp0 kt∞ H(t, p0 ) . 2 log 2(1 + t) 1/3
1/3
For example if t = 1, A(p0 ) ≤ 20kp0k∞ kDp0 k∞ . The best possible constant in the minimax risk is derived in Korostelev and Nussbaum (1999) for densities supported in [0, 1], and our bound misses the one in Korostelev and Nussbaum (1999) by a factor less than 20. Some loss of efficiency in the asymptotic constant of any adaptive estimator is to be expected in our estimation problem, cf. Lepski (1992) and also Tsybakov (1998). The choice j ∗ in Proposition 3 above is based on replacing the variance term Ekpn (j) − Epn (j)k∞ by its limit, which might be suboptimal in finite samples. So, for better finite-sample performance, an oracle that knows p0 ∈ Ct (R) would choose the resolution level j # so as to balance B(j, p0 ) and Ekpn (j) − Epn (j)k∞ . 14
A slight modification of the procedure (24) allows to obtain a comparison of the risk of the adaptive estimator to the one of pn (j # ). Let J be as in the previous section, and define ˆjn by r 2l l 1/2 ˆjn = min j ∈ J : k pn (j) − pn (l)k∞ ≤ 5R(n, l) + 10kpn(jmax )k∞ ∀l > j, l ∈ J . (29) n Proposition 5 Suppose p0 is in Ct (R) for some 0 < t ≤ 1. Let φ = φ1 be the Haar wavelet, and let ˆjn and j # be defined as in (29) and (56), respectively. Then, for every n, 2t/(2t+1) ! log n # −1/2 ˆ . Ekpn (jn ) − p0 k∞ ≤ 51Ekpn(j ) − p0 k∞ + O(n )+O n We should note that Ekpn (j # ) − p0 k∞ = O((n/ log n)−t/(2t+1) ) can be shown to hold (using Lemma 7), so that this estimator satisfies the conclusion of Theorem 3 (with r = 1) as well. The ’oracles’ from Propositions 3 and 5 only use knowledge of the fact that p0 ∈ Ct (R). If p0 were known completely, one could still improve on j # by using directly kEpn (j) − p0 k∞ instead of its upper bound B(j, p0 ). In fact, under complete knowledge of p0 , a ”Haar-oracle” would choose a resolution level j H that satisfies inf Ekpn (j) − p0 k∞ = Ekpn (j H ) − p0 k∞ .
j∈J
(30)
To mimic such an oracle in sup-norm loss is more difficult. Note first that the procedures ˆjn used here are all implicitly based on estimating the unknown bias-bound B(j, p0 ). The space Ct (R) contains functions that are not contained in Ct+δ (R) for any δ, but still do not attain a critical H¨ older-singularity of order t at any point x ∈ R. More precisely, let f ∈ Ct (R) for t ∈ (0, 1], set H(f, x, v, t) :=
|f (x + v) − f (x)| |v|t
and define the pointwise H¨ older exponent t(f, x) = sup{t : H(f, x, v, t) ≤ C for some C and v in a neighborhood of x}. Several things can happen: For example, if the exponent is not attained (so that f is not t(f, x)H¨ older at x), the limit as |v| → 0 of H(f, x, v, t) equals 0 for every t < t(f, x). Even if f is t(f, x)-H¨ older at x, it can happen that H(f, x, v, t(f, x)) → 0 (for instance it could be of order 1/ log(1/v)). Furthermore, lim|v|→0 H(f, x, v, t(f, x)) may fail to exist. We refer to Jaffard (1997a,b), where these phenomena are investigated in more detail, and where it is shown that this somewhat pathological behavior does not occur for a large class of ’self-similar’ functions. As soon as the true density attains a critical H¨ older-singularity at one point, the sup-norm risk of the oracle estimator is driven by the risk at such a ’critical’ point, and then we can prove an oracle inequality. We note that similar assumptions were necessary in the construction of adaptive pointwise confidence intervals, cf. Picard and Tribouley (2000). Let now either p0 ∈ C1 (R), or assume p0 ∈ Ct (R) for some 0 < t < 1 but p0 ∈ / Ct+δ (R) for t ′ ′ any δ > 0. For instance, if p0 ∈ C (R) and, for some x ∈ R, lim inf v→0 H(f, x , v, t)/ω(1/h) > 0 where xδ ω(x) → ∞ as x → ∞ for every δ > 0, this assumption is satisfied. Recall the definition of H(s, f ) from (1). For x ∈ R, let k(x) be the integer satisfying 2l x − 1 ≤ k(x) < 2l x, and define Z k(x)−2l x+1 2lt (t + 1) W (l, p0 ) = sup (p0 (x + 2−l u) − p0 (x))du ∧ 1. (31) H(t, p0 ) x∈R k(x)−2l x 15
We show in the proof of the following proposition that W (l, p0 ) > 0 for all l ∈ N if p0 is a uniformly continuous density. The (inverse of the) function W (l, p0 ) measures the loss of adaptation compared to the oracle if the true density p0 does not attain a critical H¨ older singularity at any point. If p0 is ’self-similar’ in the sense that sup |βlk (p0 )| ≥ 2−l(t+1/2) w(l)
(32)
k
for some positive function w(l), one can use, e.g., (63) to obtain a simple (not necessarily optimal) lower bound for W (l, p0 ) of the form cw(l) where c > 0. For instance, if limv→0 (p0 (x′ + v) − p0 (x′ ))/|v|t sign(v) = H ′ > 0 at some x′ ∈ R, then Condition (32) can be shown to hold with w(l) tending to a constant w ≥ H ′ (1 − 2−t )/(t + 1) as l → ∞. This implies that W (l, p0 ) is bounded from below uniformly in l and converges to W ≥ H ′ (1 − 2−t )/H(t, p0 ). In particular, if p0 ∈ C1 (R), then W (l, p0 ) → W ≥ 1/2. Note also that j H → ∞ as n → ∞. Proposition 6 Let pn (ˆjn ) be the estimator from Proposition 5, and let j H be as in (30). Suppose p0 ∈ C1 (R) or assume p0 ∈ Ct (R) for some 0 < t < 1 but p0 ∈ / Ct+δ (R) for any δ > 0. Then, for every n, 2t/(2t+1) ! 52 log n Ekpn (ˆjn ) − p0 k∞ ≤ . Ekpn (j H ) − p0 k∞ + O(n−1/2 ) + O W (j H , p0 ) n
5 5.1
Proofs of the Main Results Projections onto spline spaces and their wavelet representation
We briefly review in this section how the wavelet estimator (6) for Battle-Lemari´e wavelets can be represented as a spline projection estimator (7). We shall need the spline representation in some proofs, while the wavelet representation will be useful in others. −j Let T := Tj = {ti }∞ −∞ = 2 Z, j ∈ Z, be a bi-infinite sequence of equally spaced knots. A function S is a spline of order r, or of degree m = r − 1, if on each interval (ti , ti+1 ), it is a polynomial of degree less than or equal to m (and exactly of degree m on at least one interval), and, at each breakpoint ti , S is at least m-times differentiable. The Schoenberg space Sr (T ) := Sr (T, R) is defined as the set of all splines of order r, and it coincides with the space Sr (T, 1, R) in DeVore and Lorentz (1993, p.135). The space Sr (Tj ) has a Riesz-basis formed by B-splines {Nj,k,r }k∈Z that we now describe. [See Section 4.4 in Schumaker (1993) and p.138f. in DeVore and Lorentz (1993) for more details.] Define r X (−1)i ri (x − i)r−1 + . N0,r (x) = 1[0,1) ∗ ... ∗ 1[0,1) (x), r-times = (r − 1)! i=0 For r = 2, this is the linear B-spline (the usual ’hat’ function), for r = 3 it is the quadratic, and for r = 4 it is the cubic B-spline. Set Nk,r (x) := N0,r (x − k). Then the elements of the Riesz-basis are given by Nj,k,r (x) := Nk,r (2j x) = N0,r 2j x − k .
By the Curry-Schoenberg theorem, any S ∈ Sr (Tj ) can be uniquely represented as S(x) = P 2 2 is derived, k∈Z ck Nj,k,r (x). The orthogonal projection πj (f ) of f ∈ L (R) onto Sr (Tj ) ∩ L (R) P j/2 e.g., in DeVore and Lorentz (1993, p.401f.), where it is shown that πj (f ) = 2 k∈Z ck Nj,k,r 16
R with the coefficients ck := ck (f ) satisfying (Ac)k = 2j/2 Nj,k,r (x)f (x)dx where the matrix A is given by Z Z j akl = 2 Nj,k,r (x)Nj,l,r (x)dx = Nk,r (x)Nl,r (x)dx. (33)
The inverse A−1 of the matrix A exists (see CorollaryR4.2 P on p.404 in DeVore and Lorentz (1993)), and if we denote its entries by bkl so that ck = 2j/2 l bkl Nj,l,r (x)f (x)dx, we have Z Z XX πj (f )(y) = 2j bkl Nj,l,r (x)Nj,k,r (y)f (x)dx = κj (x, y)f (x)dx, k
l
where κj (x, y) = 2j κ(2j x, 2j y) and where XX κ(x, y) = bkl Nl,r (x)Nk,r (y), k
(34)
l
is the spline projection kernel. Note that κ is symmetric in its arguments. The idea behind the Battle-Lemari´e wavelets is to diagonalize the kernel κ of the projection operator πj or, what is the same, to construct an orthonormal basis for the space Sr (Tj ). This led in fact to one of the first examples of wavelets, see, e.g., p.21f. and Section 2.3 in Meyer (1992), Section 5.4 in Daubechies (1992), or Section 6.1 in HKPT (1998). There it is shown that there exists a r − 1-times differentiable father wavelet φr with exponential decay, the Battle-Lemari´e wavelet of order r, such that ( ) X X Sr (Tj ) ∩ L2 (R) = Vj,r = ck 2j/2 φr (2j (·) − k) : c2k < ∞ . k
k
This necessarily implies that the kernels κ and K = K(φr ) describe the same projections in L2 (R), and the following simple lemma shows that these kernels are in fact pointwise the same. Lemma 1 Let {Nk,r }k∈Z be the Riesz-basis of B-splines of order r ≥ 1, and let φr be the associated Battle-Lemari´e father wavelet. If K is as in (2) and κ is as in (34), then, for all x, y ∈ R, we have K(x, y) = κ(x, y). Proof. If r = 1, then N0,1 = φ1 since this is just the Haar-basis. So consider r > 1. Since {φr (· − k) : k ∈ Z} is an orthonormal basis of Sr (Z) ∩ L2 (R) (cf., e.g., Theorem 1 on p. 26 in Meyer (1992)), it follows that K and κ are the kernels of the same L2 -projection operator, and therefore, for all f, g ∈ L2 (R) Z Z (K(x, y) − κ(x, y))f (x)g(y)dxdy = 0. By density in L2 (R × R) of linear combinations of products of elements of L2 (R), this implies that κ and K are almost everywhere equal in R2 . We complete the proof by showing that both functions are continuous in R2 . For K, this follows from the decomposition X X |K(x, y)−K(x′ , y ′ )| ≤ |φr (x−k)−φr (x′ −k)||φr (y−k)|+ |φr (y−k)−φr (y ′ −k)||φr (x′ −k)|, k
k
the uniform continuity of φr (r > 1) and relation (3). For κ we use the relation (36) below, X X |H(y − i) − H(y ′ − i)||Ni,r (x′ )|, |Ni,r (x) − Ni,r (x′ )||H(y − i)| + |κ(x, y) − κ(x′ , y ′ )| ≤ i
i
17
which implies continuity of κ on R2 since N0,r and H are uniformly continuous (as N0,r is and P i |g(|i|)| < ∞), and since N0,r has compact support.
The fact that these kernels are pointwise the same allows to compute the estimator (6) for the Battle-Lemari´e wavelets in terms of B-splines by the formula (7).
5.2
An Exponential inequality for the uniform deviations of the linear estimator
To control the uniform deviations of the linear estimators from their means, one can use inequalities for the empirical process indexed by classes of functions F contained in K = 2−j Kj (·, y) : y ∈ R, j ∈ N ∪ {0} , (35)
together with suitable bounds on σ. If K is a convolution kernel, then K is contained in the set of dilations and translations of a fixed function K, and then K is of VC-type (i.e., it satisfies (11)) if K is of bounded variation, a result due to Nolan and Pollard (1987). In fact, bounded variation can be replaced by bounded p-variation for p < ∞ (see Lemma 1 in Gin´e and Nickl (2007)) which allows also for α-H¨older kernels, α > 0. If K = K(φ) is a wavelet projection kernel as in (2), and if φ has compact support (and is of finite p-variation), it is proved in Lemma 2 in Gin´e and Nickl (2007) that the class K also satisfies the bound (11). However, the proof there does not apply to Battle-Lemari´e wavelets. A different proof, using the Toeplitz- and band-limited structure of the spline projection kernel, still enables us to prove that these classes of functions are of Vapnik-Cervonenkis type. Lemma 2 Let K be as in (35), where φr is a Battle-Lemari´e wavelet for some r ≥ 1. Then there exist finite constants A ≥ 2 and v ≥ 2 such that v A sup N (K, L2 (Q), ε) ≤ ε Q for 0 < ε < 1 and where the supremum extends over all Borel probability measures on R. Proof. In the case r = 1, φ1 is just the Haar wavelet, in which case the results follows from Lemma 2 in Gin´e and Nickl (2007). Hence assume r ≥ 2. The matrix A is Toeplitz since, by change of variables in (33), akl = ak+1,l+1 for all k, l ∈ Z, and it is band-limited because N0,r has compact support. It follows that also A−1 is Toeplitz, and we denote its entries by bkl = g(|k − l|)) for some function g. Furthermore it is known (e.g., Theorem 4.3 on p.404 in DeVore and Lorentz (1993)) that the entries of the inverse of any positive definite band-limited matrix satisfy |bkl | ≤ cλ|k−l| for some 0 < λ < 1 and c finite. Now, following Huang and Studden (1992), we write X X X g(|l − k|)Nk,r (x) = g(|l − k|)Nk−l,r (x − l) = g(|k|)Nk,r (x − l), k
k
so that 2−j κj (·, y) =
k
X l∈Z
Nj,l,r (y)H(2j (·) − l) 18
(36)
where H(x) =
X
g(|k|)Nk,r (x)
k∈Z
is a function of bounded variation: To see the last claim, note that N0,r is of bounded variation, and hence kNk,r kT V = kN0,r kT V (where k · kT V denotes the usual total-variation norm) so that X kHkT V ≤ kN0,r kT V |g(|k|)| < ∞ k∈Z
because
P
k
|bl,l−k | ≤
P
k
cλ|k| < ∞. The last fact implies that H = H(2j (·) − l) : l ∈ Z, j ∈ N ∪ {0}
satisfies, for finite constants B > 1 and w ≥ 1 w BkHk∞ 2 , sup N (H, L (Q), ε) ≤ ε Q
for
0 < ε < kHk∞
as proved in Nolan and Pollard (1987). Since Nj,0,r is zero if y is not contained in [0, 2−j r], the sum in (36), for fixed y and j, extends only over the l’s such that 2j y −r ≤ l < 2j y, hence consists of at most r terms. This implies that K is contained in the set Hr of linear combinations of at most r functions from H, with coefficients bounded in absolute value by kNj,l,r k∞ = kN0,r k∞ < ∞. Given ε, let ε′ = ε/(2r max(kHk∞ , kN0,r k∞ )). Let α1 , ..., αn1 be an ε′ -dense subset of [−kN0,r k∞ , kN0,r k∞ ] which, for ε′ < kN0,r k∞ , has cardinality n1 ≤ 3kN0,r k∞ /ε′ . Furthermore, ′ in the let h1 , ..., hn2 be a subset of H of cardinality n2 = N (H, L2 (Q), ε′ ) which P is ε -dense in H j 2 ′ N (y)H(2 (·) − l) L (Q)-metric. It follows that, for ε < min(kHk , kN k ), every j,l,r ∞ 0,r ∞ l∈Z P is at L2 (Q)-distance at most ε from rl=1 αi(l) hi′ (l) for some 1 ≤ i(l) ≤ n1 and 1 ≤ i′ (l) ≤ n2 . The total number of such linear combinations is dominated by (n1 n2 )r ≤ (B ′ /ε)(w+1)r . This shows that the lemma holds for ε < 2r min{kHk∞, kN0,r k∞ } max{kHk∞ , kN0,r k∞ } = 2rkHk∞ kN0,r k∞ = U , which completes the proof by taking A = max(B ′ , U, e) (for ε ∈ [U, A] one ball covers the whole set). This lemma implies the following result. Proposition 7 Let K be as in (2) and assume either that φ has compact support and is of bounded p-variation (p < ∞), or that φ is a Battle-Lemari´e father wavelet for some r ≥ 1. Suppose P has a bounded density p0 . Given C, T > 0, there exist finite positive constants C1 = C1 (C, K, kp0 k∞ ) and C2 = C2 (C, T, K, kp0 k∞ ) such that, if r n 2j j ≥C and C1 ≤t≤T j 2 j n then
Pr sup |pn (y, j) − Epn (y, j)| ≥ t y∈R
nt2 ≤ exp −C2 j . 2
(37)
Proof. We first prove the Battle-Lemari´e wavelet case. If r > 1, the function K is continuous (see the proof of Lemma 1), and therefore the supremum in (37) is over a countable set. That this is also true for r = 1 follows from Remark 1 in Gin´e and Nickl (2007). We apply Proposition 1 and Lemma 2 to the supremum of the empirical process indexed by the classes of functions Kj := 2−j Kj (·, y)/(2kΦk∞ ) : y ∈ R , 19
where Φ is a function majorizing K (as in (3)), so that Kj is uniformly bounded by 1/2. We next bound the second moments E(2−2j Kj2 (X, y)). We have, using (3), Z Z −2j 2 2 Kj (x, y)p0 (x)dx ≤ Φ2 (|2j (x − y)|)p0 (x)dx Z ≤ 2−j Φ2 (|u|)p0 (y + 2−j u)du ≤ 2−j kp0 k∞ kΦk22 . (38) p Hence we may take σ = 2−j kΦk22 kp0 k∞ /(2kΦk∞), and the result is then a direct consequence of Proposition 1, which applies by Lemma 2. For compactly supported wavelets, the same proof applies, using Lemma 2 (and Remark 1) in Gin´e and Nickl (2007). Using Proposition 1, and by keeping track of the constants in the proof Lemma 2, one could also obtain explicit constants in inequality (37). For applications in limit theorems unspecified constants suffice. Proof. (Theorem 1) Using Lemma 2, the first two claims of the Theorem follow by the same proof as in Theorem 1 and Corollary 1, Gin´e and Nickl (2007). For the bias term, we have the following argument. It is well known that if φ is m-times differentiable with derivatives in Lp (R) for some 1 ≤ p ≤ ∞, then the projection kernel reproduces polynomials of degree less than or equal to m, that is, for every x and 0 ≤ α ≤ m, Z K(x, y)y α dy = xα , cf., e.g., Theorem 8.2 in HKPT (1998). Recall from Section 5.1 that φr is r−1 times differentiable. If p0 ∈ Ct (R) then we can write, by Taylor expansion and the mean value theorem, Z |Epn (x) − p0 (x)| = Kj (x, y)(p0 (y) − p0 (x))dy Z ≤ |Kj (x, y)||D[t] p0 (x + ξ(y − x)) − D[t] p0 (x)||y − x|[t] dy Z ≤ 2j Φ(|2j (y − x)|)|D[t] p0 (x + ξ(y − x)) − D[t] p0 (x)||y − x|[t] dy Z = 2−j[t] Φ(|u|)|u|[t] |D[t] p0 (x + ξ2−j u) − D[t] p0 (x)|du
≤ 2−jt kp0 kt,∞ C, (39) R for t noninteger, where C := C(Φ) = Φ(|u|)|u|t du. The proof of the same inequality for r ≥ t ∈ N is similar (in fact shorter), and omitted.
5.3
An exponential inequality for the distribution function of the linear estimator.
The quantity of interest in this subsection is the distribution function FnS of the linear projection estimator pn from (7), more precisely, we will study the stochastic process Z s √ √ n(FnS (s) − F (s)) = n (pn (y, j) − p0 (y))dy, s ∈ R. −∞
20
S To prove a functional CLT for this process, it turns out that it is easier to compare Fn to Fn rather than to F . With F = 1(−∞,s] : s ∈ R , the decomposition Z S (Fn − Fn )(s) = (Pn − P )(πj (f ) − f ) + (πj (p0 ) − p0 )f, f ∈ F, (40)
will be useful, since it splits the quantity of interest into a deterministic ’bias’ term and an empirical process. R We first give a bound on the deterministic term. To show that (πj (p0 ) − p0 )f = O(2−jt ) is quite straightforward by the usual bias techniques, but to obtain meaningful results for the most interesting choices of j, the sharper bound O(2−j(t+1) ) is crucial, and it can be obtained as follows. t Lemma 3 Assume that p 0 is a bounded function (t = 0), or that p0 ∈ C (R) for some 0 < t ≤ r. Let F = 1(−∞,s] : s ∈ R . Then Z (πj (p0 ) − p0 )f ≤ C2−j(t+1) (41) R
for some constant C depending only on r and kp0 kt,∞ .
Proof. If ψ := ψr is the mother wavelet associated to φr , we have, using that the wavelet series of p0 ∈ L1 (R) converges in L1 (R), πj (p0 ) − p0 = −
∞ X X l=j
βlk (p0 )ψlk ,
k
in the L1 (R)-sense. Therefore, since f = 1(−∞,s] ∈ L∞ (R), we have Z Z ∞ X X βlk (p0 )ψlk (x) f (x)dx − (πj (p0 ) − p0 )f = R
R
= =
l=j
∞ X X
l=j k ∞ X X l=j
k
βlk (p0 )
Z
f (x)ψlk (x)dx
R
βlk (p0 )βlk (f ).
(42)
k
The lemma now follows from an estimate for the decay of the wavelet coefficients of p0 and f , namely the bounds X |βlk (f )| ≤ c2−l/2 and sup |βlk (p0 )| ≤ c′ 2−l(t+1/2) , (43) sup f ∈F
k
k
The first bound is proved as in the proof of Lemma 3 in Gin´e and Nickl (2007), noting that the identity before equation (37) in that proof also holds for spline wavelets by their exponential decay property. The second bound follows from sup |βlk (p0 )| ≤ k
≤
c′′ 2−l/2 kKl+1 (p0 ) − Kl (p0 )k∞ c′′ 2−l/2 (kKl (p0 ) − p0 k∞ + kKl+1 (p0 ) − p0 k∞ ) ≤ c′ 2−l/2 2−lt 21
where we used (9.35) in HKPT (1998) for the first inequality and (39) in the last. To control the fluctuations of the stochastic term, one applies Talagrand’s inequality to the empirical process indexed by the ’shrinking’ classes of functions {πj (f ) − f : f ∈ F }. These classes consist of differences of elements in F and in Z t Kj′ := Kj (·, y)dy : t ∈ R , −∞
and we have to show that, for each j, this class satisfies the entropy condition (11). Again, for φ with compact support (and of finite p-variation), this result was proved in Lemma 2 in Gin´e and Nickl (2007), but we have to extend it now to the Battle-Lemari´e wavelets considered here. Lemma 4 Let Kj′ be as above where φr is a Battle-Lemari´e wavelet for r ≥ 1. Then there exist finite constants A ≥ e and v ≥ 2 and independent of j such that v A ′ 2 , 0 < ε < 1, sup N (Kj , L (Q), ε) ≤ ε Q where the supremum extends over all Borel probability measures on R. Proof. In analogy to the proof of Lemma 2, one can write Z t XZ t Kj (·, y)dy = 2j Nj,l,r (y)dyH(2j (·) − l), −∞
l∈Z
−∞
since the series (36) converges absolutely (in view of X X X X |H(2j x − l)| ≤ |g(|k|)| Nk,r (2j x − l) ≤ rkN0,r k∞ |g(|k|)| < ∞.) l
k
l
k
j Recall that Nj,l,r is supported in the interval [2−j l, 2−j (r + l)]. integral R Hence, if l > 2 t, the last j is zero. For l ≤ 2 t− r, the integral equals the constant c = R N0,r (y)dy, and for l ∈ [2j t− r, 2j t], the integral cj,l,r is bounded by c, so that this sum in fact equals X X cj,l,r H(2j (·) − l). H(2j (·) − l) + c l≤2j t−r
2j t−r c log n for some c > 0 independent of n, we have from Proposition 7, integrating tail probabilities, that Ekpn (j) − Epn (j)kp∞ ≤ Dp
2j j n
p/2
:= Dp σ p (j, n)
(44)
for every j ∈ J , 1 ≤ p < ∞ and some 0 < D < ∞ depending only on kp0 k∞ and Φ. For the bias, we recall from (39) that, for 0 < t ≤ r |Epn (y, j) − p0 (y)| ≤ 2−jt kp0 kt,∞ C(Φ) := B(j, p0 ).
(45)
If the density p0 is only uniformly continuous, then one still has from (3) and integrability of Φ that, uniformly in y ∈ R, Z −j |Epn (y, j) − p0 (y)| ≤ |Φ(|u|)||p0 (y − 2 u) − p0 (y)|du := B(j, p0 ) = o(1). (46)
˜ := M ˜ n = Ckpn (jmax )k∞ and set C = 49kΦk22. Define also M = Ckp0 k∞ for the II) Define M ˜ > 1.01M or M ˜ < 0.99M if p0 is uniformly same C. We need to control the probability that M continuous. For some 0 < L < ∞ and n large enough we have ˜ − M | > 0.01Ckp0k∞ Pr |M = Pr (|kpn (jmax )k∞ − kp0 k∞ | > 0.01kp0k∞ )
≤ Pr (kpn (jmax ) − p0 k∞ > 0.01kp0 k∞ ) ≤ Pr (kpn (jmax ) − Epn (jmax )k∞ > 0.01kp0k∞ − B(jmax , p0 )) ≤ Pr (kpn (jmax ) − Epn (jmax )k∞ > 0.009kp0k∞ ) (log n)2 ≤ exp − L 23
˜ ≤ L′ for by Proposition 7 and Step I). Furthermore, there exists a constant L′ such that E M every n in view of Ekpn (jmax )k∞ ≤ Ekpn (jmax ) − Epn (jmax )k∞ + kEpn (jmax )k∞ ≤ c + kΦk1 kp0 k∞ , where we have used (3) and (44). III) We need some observations on the Rademacher P processes used in the definition of ˆjn . n −1 ˜ First, for the symmetrized empirical measure Pn = 2n i=1 εi δXi , we have R(n, j) = kπj (P˜n )k∞ = kπj (πl (P˜n ))k∞ ≤ kπj k′∞ R(n, l) ≤ B(φ)R(n, l)
(47)
for every l > j: Here kπj k′∞ is the operator norm in L∞ (R) of the projection πj , which admits bounds B(φ) independent of j. (Clearly, πj acts on finite signed measures µ by duality, taking R values in L∞ (R) since |πj (µ)| = | Kj (·, y)dµ(y)| ≤ 2j kΦk∞ |µ|(R).) See Remark 3 for details on how to obtain B(φ). Integrating the last chain of inequalities establishes (47) also for E ε R(n, j). Furthermore, for j < l, T (n, j, l) ≤ R(n, j) + R(n, l) ≤ (1 + B(φ))R(n, l),
(48)
and the same inequality holds for the Rademacher expectations of T (n, j, l). We also record the following bound for the (full) expectation of R(n, l), l ∈ J : Using inequality (21) and the variance computation (38), we have that there exists a constant L depending only on kp0 k∞ and Φ such that, for every l ∈ J , q ER(n, l) ≤ L 2l l/n.
Proof of (25). Let F = 1(−∞,s] : s ∈ R , and let f ∈ F. We have Z Z Z √ √ √ ˆ n (pn (jn ) − p0 )f = n (pn (jmax ) − p0 )f + n (pn (ˆjn ) − pn (jmax ))f.
The first term satisfies the CLT from Theorem 2 for the linear estimator with jn = jmax . We now show that the second term converges to zero in probability. Observe first pn (ˆjn )(y) − pn (jmax )(y) =
Pn (Kˆjn (·, y) − Kjmax (·, y)) = −
jmax X−1 X l=ˆ jn
βˆlk ψlk (y),
k
with convergence in L1 (R). Next, we have by (9.35) in HKPT (1998), for all l ∈ [ˆjn , jmax − 1] and all k, by definition of ˆjn , that for some 0 < D′ < ∞ (1/D′ )2l/2 |βˆlk | ≤ ≤ ≤
sup |Pn (Kl+1 (·, y)) − Pn (Kl (·, y))| = kpn (l + 1) − pn (l)k∞ y∈R
kpn (l + 1) − pn (ˆjn )k∞ + kpn (l) − pn (ˆjn )k∞ q ˜ 2l l/n (1 + B(φ))(R(n, l + 1) + R(n, l)) + 3 M
in case ˆjn = ¯jn or ˆjn = ¯jnε using also the inequality T (n, ¯jn , l) ≤ (1 + B(φ))R(n, l) for l ≥ ¯jn (see
24
(48)). Consequently, uniformly in f ∈ F, Z jmax Z X−1 X E (pn (ˆjn ) − pn (jmax ))f = E βˆlk ψlk (y)f (y)dy l=ˆj k n
≤E
≤
jmax X−1 l=jmin ′′
D √ n
X q ′ −l/2 l ˜ (B(φ) + 1)(R(n, l + 1) + R(n, l)) + 3 M 2 l/n D2 |βlk (f )|
jmax X−1 l=jmin
k
√ 2−l/2 l = o
1 √ n
using the in II), III), ˆjn ≥ jmin → ∞ as n → ∞ (by definition of J ) and since Pmoment bounds −l/2 by (43) for some constant c. supf ∈F k |βlk (f )| ≤ c2
Proof of (26) and (27): The proof of the case t = 0 follows from a simple modification of the arguments below as in Theorem 2 in Gin´e and Nickl (2008), so we omit it. [In this case, one defines j ∗ as jmax if t = 0 so that only the case ˆjn ≤ j ∗ has to be considered.] For t > 0, define j ∗ := j(p0 ) by the balance equation n o p j ∗ = min j ∈ J : B(j, p0 ) ≤ 2 log 2kp0 k1/2 (49) ∞ kΦk2 σ(j, n) . ∗
1
Using the results from I), it is easily verified that 2j ≃ (n/ log n)) 2t+1 if p0 ∈ Ct (R) for some 0 < t ≤ r, and that t/(2t+1) ! log n ∗ σ(j , n) = O n
is the rate of convergence required in (27). We will consider the cases {ˆjn ≤ j ∗ } and {ˆjn > j ∗ } separately. First, if ˆjn is ¯jn , then we have E kpn (¯jn ) − p0 k∞ I{¯jn ≤j ∗ }∩{M˜ ≤1.01M} ≤ E (kpn (¯jn ) − pn (j ∗ )k∞ + Ekpn (j ∗ ) − p0 k∞ ) I{¯jn ≤j ∗ }∩{M≤1.01M} ˜ √ ≤ (B(φ) + 1)ER(n, j ∗ ) + 1.01Mσ(j ∗ , n) + kpn (j ∗ ) − p0 k∞ r ∗ 2j j ∗ ′ ≤B + B ′′ σ(j ∗ , n) n = O(σ(j ∗ , n)),
(50)
by the definition of ¯jn , (48), the definitions of M and j ∗ , (44) and the moment bound in III), and likewise if ˆjn = ¯jnε . If ˆjn is ˜jn or ˜jnε , then one has the same bound (without even using (48)).
25
Also, by the results in I), II),
E pn (ˆjn ) − p0 I{ˆjn ≤j ∗ }∩{M˜ >1.01M} ∞ X ≤ E [kpn (j) − Epn (j)k∞ + B(j, p0 )] I{ˆjn =j} I{M˜ >1.01M} j∈J :j≤j ∗
q ≤ c log n [Dσ(j ∗ , n) + B(jmin , p0 )] · E1{M>1.01M} ˜ s ! (log n)2 = o(σ(j ∗ , n)). = o (log n) exp − L
We now turn to {ˆjn > j ∗ }. First,
E pn (ˆjn ) − p0 I{ˆjn >j ∗ }∩{M˜ j ∗
≤
X
j∈J :j>j ∗
1/q ˜} D′ σ(j, n) · Pr {ˆjn = j} ∩ {0.99M ≤ M .
We show below that for n large enough, some constant c, some δ > 0 and some q > 1, ˜ }) ≤ c2−j(q/2+δ) , Pr({ˆjn = j} ∩ {0.99M ≤ M
(51)
which gives the bound X
j∈J :j>j ∗
D′′ σ(j, n) · 2−j/2−jδ/q = O
1 √ n
= o(σ(j ∗ , n)),
completing the proof, modulo verification of (51). To verify (51), we split the proof into two cases. Pick any j ∈ J so that j > j ∗ and denote by j − the previous element in the grid (i.e. j − = j − 1). Case I, ˆjn = ¯jn or ˆjn = ¯jnε : We give the proof for ¯jnε only, as the proof for ¯jn is the same given Corollary 1. One has X √
˜ }) ≤ Pr({¯jnε = j}∩{0.99M ≤ M Pr pn (j − ) − pn (l) ∞ > T (n, j − , l) + 0.99Mσ(l, n) . l∈J :l≥j
26
We first observe that
pn (j − ) − pn (l) ≤ pn (j − ) − pn (l) − Epn (j − ) + Epn (l) + B(j − , p0 ) + B(l, p0 ), ∞ ∞
where, setting
(52)
√ 1/2 2 log 2kp0 k∞ kΦk2 =: U (p0 , Φ),
B(j − , p0 ) + B(l, p0 ) ≤ 2B(j ∗ , p0 ) ≤ 2U (p0 , Φ)σ(j ∗ , n) ≤ 2U (p0 , Φ)σ(l, n) by definition of j ∗ and since l > j − ≥ j ∗ . Consequently, the l-th probability in the last sum is bounded by √
Pr pn (j − ) − pn (l) − Epn (j − ) + Epn (l) ∞ > T (n, j − , l) + ( 0.99M − 2U (p0 , Φ))σ(l, n) ,
(53)
and we now apply Corollary 1 to this bound. Define the class of functions F := Fj − ,l = 2−l (Kj − (·, y) − Kl (·, y))/(4kΦk∞ ) ,
which is uniformly bounded by 1/2, and satisfies (11) for some A and v independent of l and j − by Lemma 2 (and a simple computation on covering numbers). We compute σ, using (38) and l > j−: 2−2l E(Kj − − Kl )(X, y))2 ≤ 2−2l+1 EKj2− (X, y) + EKl2 (X, y) −
≤ 2−2l+1 kΦk22 kp0 k∞ (2j + 2l ) ≤ 3 · 2−l kΦk22 kp0 k∞
so that we can take σ 2 = 3 · 2−l
kΦk22 kp0 k∞ . 16kΦk2∞
Then the probability in (53) is equal to
n
n !
√ 2l 4kΦk∞ 2l 4kΦk∞
X
X Pr εi f (Xi ) + ( 0.99M − 2U (p0 , Φ))σ(l, n) f (Xi ) − P f > 2
n n i=1 i=1 F F
n
n ! √
X
X n( 0.99M − 2U (p0 , Φ))σ(l, n)
εi f (Xi ) + 3 f (Xi ) − P f > 2 = Pr .
3 · 2l · 4kΦk∞ i=1
F
i=1
F
Since nσ 2 / log(1/σ) ≃ n/(2l l) → ∞ uniformly in l ∈ J , there exists λn → ∞ independent of l such that (13) is satisfied, and the choice √ n( 0.99M − 2U (p0 , Φ))σ(l, n) t= 3 · 2l · 4kΦk∞
−2 is admissible in Corollary 1 for c2 (λn ) = 1 + 120λ−1 n + 10800λn . Hence, using Corollary 1, the last probability is bounded by ! √ n2 ( 0.99M − 2U (p0 , Φ))2 (2l l/n)16kΦk2∞ ≤ 2−l((q/2)+δ) (54) ≤ 2 exp − 9 · 6.3 · c2 (λn )22l n2−l kΦk22 kp0 k∞ 16kΦk2∞
for some δ > 0 and q > 1, by definition of M . Since have proved (51). 27
P
l∈J :l≥j
2−l(q/2)+δ) ≤ c2−j((q/2)+δ) , we
Case II, ˆjn = ˜jn or ˆjn = ˜jnε : We again only prove ˆjn = ˜jnε , in which case one has ˜ }) Pr({˜jnε = j} ∩ {0.99M ≤ M X √
≤ Pr pn (j − ) − pn (l) ∞ > (B(φ) + 1)R(n, l) + 0.99M σ(l, n) l∈J :l≥j
≤
X
l∈J :l≥j
√
Pr pn (j − ) − pn (l) ∞ > T (n, j − , l) + 0.99Mσ(l, n) ,
by inequality (48). The proof now reduces to the previous case.
5.5
Proofs for Section 4.1
The proofs will use three technical lemmas that we give at the end of the section. Proofs of Propositions 3 and 4: We have that
E pn (ˆjn ) − p0 ∞ √ √ ≤ 8Ekpn (j ∗ ) − Epn (j ∗ )k∞ + 1.01M σ(j ∗ , n) + Ekpn (j ∗ ) − p0 k∞ + O(1/ n) √ ≤ 8Ekpn (j ∗ ) − Epn (j ∗ )k∞ + 1.01M S(p0 )−1 Ekpn (j ∗ ) − Epn (j ∗ )k∞ + Ekpn (j ∗ ) − p0 k∞ √ √ ¯ 2t/(2t+1) + (2l(p0 ) ¯l(p0 ))/ n + O(1/ n) + O (n/ log n) √ 2t/(2t+1) = 30Ekpn (j ∗ ) − p0 k∞ + O(1/ n) + O (n/ log n) .
The first inequality follows from collecting the bounds from the proof of Theorem 3 (in particular (50)) and desymmetrization (17) (using also kKj (p0 )k∞ ≤ kp0 k∞ ). The second inequality follows from s r ∗ ∗ j ∗ 2 j 2¯l(p0 ) ¯l(p0 ) C2j log n (1[j ∗ ≤¯l(p0 )] + 1[j ∗ >¯l(p0 )] ) ≤ + S(p0 )−1 Ekpn (j ∗ ) − Epn (j ∗ )k∞ + n n n 1
∗
and Lemma 8 below, using also that 2j ≃ (n/ log n)) 2t+1 . The last identity follows from Remark 3 (noting that pn (l) − Epn (l) = πl (pn (l) − p0 )). This already proves Proposition 3, and Proposition 4 follows from the first inequality in the last display, the law of the logarithm for the Haar wavelet density estimator (28) and from p the definition of j ∗ , after some computations, involving an upper bound for (n/ log n)t/(2t+1) 2j ∗ j ∗ /n. Proofs of Propositions 5 and 6: We set shorthand E(l) := Ekpn (l) − Epn (l)k∞
and note that E(l) ≤ E(j) holds for l ≤ j in view of pn (l) − Epn (l) = πl (pn (j) − Epn (j)) and Remark 3. Also, in the case of the Haar wavelet we can take in fact B(l, p0 ) := 2−lt H(t, p0 )/(t + 1),
(55)
in the bound (45). Define j # := j # (p0 , n) by j # = argminl∈J max (E(l), B(l, p0 )) . 28
(56)
Since B(l, p0 ) decreases and E(l) is nondecreasing as l increases, we have that j # exists (and if the minimizer is not unique we take the smallest one) and B(l, p0 ) ≤ E(l) = Ekpn (l) − Epn (l)k∞ for all l > j # .
(57)
To see the the latter, suppose to the contrary that E(l) < B(l, p0 ) for some l > j # . Then, since B(l, p0 ) < B(j # , p0 ) by strict monotonicity, l is a point where max(B(l, p0 ), E(l)) = B(l, p0 ) < B(j # , p0 ) ≤ max(B(j # , p0 ), E(j # )), a contradiction. We also note that by Lemma 7 below one # has 2j (p0 ,n) ≃ (n/ log n)1/(2t+1) . ˜ = 102 kpn (jmax )k∞ and M = 102 kp0 k∞ . We note in advance that, as in the proof Define M of Theorem 3, s !
(log n)2
ˆ = O(n−β ) E pn (jn ) − p0 I{ˆjn ≤j # }∩{M˜ >1.01M} = O (log n) exp − L ∞ as well as
E pn (ˆjn ) − p0
∞
s
I{ˆjn >j # }∩{M˜ 0. Hence it remains to consider the cases I{ˆjn ≤j # }∩{M≤1.01M} and I{ˆjn >j # }∩{M≥0.99M} . ˜ ˜ First we have
E pn (ˆjn ) − p0 I{ˆjn ≤j # }∩{M≤1.01M} ˜ ∞ ≤ E kpn (ˆjn ) − pn (j # )k∞ + kpn (j # ) − p0 k∞ I{ˆjn ≤j # }∩{M˜ ≤1.01M} r 2j # j # # ≤ 5ER(n, j ) + 1.01M + Ekpn (j # ) − p0 k∞ n r 2j # j # 10kp0 k∞ √ ≤ 20Ekpn(j # ) − Epn (j # )k∞ + 1.01M + Ekpn (j # ) − p0 k∞ + n n by the definition of ˆjn , and desymmetrization (17) (using also kKj (p0 )k∞ ≤ kp0 k∞ ). Second, using (44) and (57), and with 1/p + 1/q = 1, q > 1 arbitrary
E pn (ˆjn ) − p0 I{ˆjn >j # }∩{0.99M≤M˜ } ∞ 1/q X 1/p EI{ˆjn =j}∩{0.99M≤M˜ } ≤ 2 (E kpn (j) − Epn (j)kp∞ ) j∈J :j>j #
≤
X
2D
j∈J :j>j #
r
1/q 2j j ˜} . · Pr {ˆjn = j} ∩ {0.99M ≤ M n
We show in Lemma 5 below that for n large enough, some q > 1 and some constant c′ ˜ }) ≤ c′ 2−j(q/2+δ) , Pr({ˆjn = j} ∩ {0.99M ≤ M for some δ > 0, which gives the bound r X 2j j −j(1/2+δ/q) 1 ′′ . ·2 =O √ D n n # j∈J :j>j
29
(58)
Combining these bounds we have established
E pn (ˆjn ) − p0
#
#
≤ 20Ekpn(j )−Epn (j )k∞ +
∞
r
1.01M
2j # j # +Ekpn (j # )−p0 k∞ +O n
1 √ . n (59)
Let next ¯l(p0 ) be as in Lemma 8, then s r # # j # 2 j 2¯l(p0 ) ¯l(p0 ) C2j log n (1[j # ≤¯l(p0 )] +1[j # >¯l(p0 )] ) ≤ +S(p0 )−1 Ekpn (j # )−Epn (j # )k∞ + , n n n so that (59) becomes
E pn (ˆjn ) − p0
∞
≤
! √ 1.01M Ekpn (j # ) − Epn (j # )k∞ + Ekpn (j # ) − p0 k∞ 20 + S(p0 ) 2t/(2t+1) ! 1 log n +O √ , (60) +O n n
where we have also used Lemma 7. This completes the proof of Proposition 5, using pn (l) − Epn (l) = πl (pn (l) − p0 ) and Remark 3, after computing the constant. We now prove Proposition 6. Let j H be the resolution level of the oracle. Then we have from Lemma 6, and by definition of j # , Ekpn (j H ) − p0 k∞
≥ W (j H , p0 ) max(E(j H ), B(j H , p0 ))
≥ W (j H , p0 ) max(E(j # ), B(j # , p0 ))
which, noting that Ekpn (j # ) − Epn (j # )k∞ = E(j # ) ≤ max(E(j # ), B(j # , p0 )) as well as, using again Lemma 6, Ekpn (j # ) − p0 k∞ ≤ 2 max(E(j # ), B(j # , p0 )), gives, by (60),
E pn (ˆjn ) − p0
∞
≤
22 +
√ 2t/(2t+1) ! 1 log n 1.01M /S(p0 ) , Ekpn (j H )−p0 k∞ +O √ +O W (j H , p0 ) n n
completing the proof by definition of M , given the following lemmas. Lemma 5 Let ˆjn and j # be defined as in (29) and (56), respectively. Then, for every j > j # and n large enough (independent of j), we have that (58) holds. Proof. Pick any j ∈ J so that j > j # and denote by j − the previous element in the grid (i.e. j − = j − 1). Then ! r X
2l l − ˜
ˆ , Pr({jn = j} ∩ {0.99M ≤ M }) ≤ Pr pn (j ) − pn (l) ∞ > 5R(n, l) + 0.99M n l∈J :l≥j
30
We first observe that
pn (j − ) − pn (l) ≤ pn (j − ) − pn (l) − Epn (j − ) + Epn (l) + B(j − , p0 ) + B(l, p0 ), ∞ ∞
(61)
where
B(j − , p0 ) = 2t B(j − + 1, p0 ) ≤ 2E(j − + 1) ≤ 2E(l),
and B(l, p0 ) ≤ E(l)
by (57), since j − + 1 = j > j # and since l ≥ j > j # . Consequently, the l-th probability in the last sum is bounded by ! r ll
2 − − Pr pn (j ) − pn (l) − Epn (j ) + Epn (l) ∞ > 5R(n, l) − 3E(l) + 0.99M n ! r ll
2 ≤ Pr pn (j − ) − pn (l) − Epn (j − ) + Epn (l) ∞ > 2R(n, l) + (1 − α) 0.99M n ! r 2l l + Pr 3R(n, l) < 3E(l) − α 0.99M = A + B. n The term A is dominated by ! r ll
2 − − − Pr pn (j ) − pn (l) − Epn (j ) + Epn (l) ∞ > T (n, j , l) + (1 − α) 0.99M n
in view of (48) and Remark 3, and we apply Corollary 1 to this probability. Arguing as in the bound (54) for (53), this probability is bounded by (1 − α)2 99 exp − ≤ 2−l(q/2+δ) 9 · 6.3c1 (λn ) for some δ > 0 and q > 1 if α = 0.5359. Next, by symmetrization (17), the B-term is less than or equal to ! r α 2l l Pr R(n, l) < ER(n, l) − 0.99M 3 n
! r n n
2l l 2l α 2l
X
X = Pr 0.99M , εi f (Xi ) − εi f (Xi ) < E
n n 6 n i=1
−l
i=1
F
F
where F := Fl = 2 (Kl (·, y)/2, y ∈ R . Applying (20), the variance computation (38) (and choosing c2 (λn ) → ∞ as in Corollary 1) we have that this probability is bounded by α2 · 99 exp − ≤ 2−l(q/2+δ) . 36 · 2.1c1 (λn ) P Finally, since l∈J :l≥j c2−l(q/2+δ) ≤ c′ 2−j(q/2+δ) , we have proved the lemma.
Lemma 6 Let W (l, p0 ) be as in (31) and let B(l, p0 ) be as in (55). Let the conditions of Proposition 6 be satisfied. Then W (l, p0 ) > 0 for every l ∈ N. Furthermore, we have for every l, n ∈ N that W (l, p0 ) max(E(l), B(l, p0 )) ≤ Ekpn (l) − p0 k∞ ≤ 2 max(E(l), B(l, p0 )). 31
Proof. We first make the general observation that Z l l −l kEpn (l) − p0 k∞ = sup K(2 x, 2 x + u)(p0 (x + 2 u) − p0 (x))du x∈R
≤
(62)
R
2−lt sup x∈R
Z
k(x)−2l x+1
k(x)−2l x
|p0 (x + 2−l u) − p0 (x)| t |u| du ≤ B(l, p0 ), |2−l u|t
where k(x) is as before (31). To prove the first claim of the lemma, assume W (l, p0 ) < 1 in which case H(t, p0 ) = kEpn (l) − p0 k∞ . W (l, p0 ) (t + 1)2lt For any bounded f , Z Z u+k l ψ(u)du ≤ kf k∞ , sup sup 2 f (x)ψ(2 x − k)dx = sup sup f l 2 l
l
k
l
k
which, applied to f = Epn (l) − p0 and since p0 is uniformly continuous (so that its wavelet series converges uniformly), gives kEpn (l) − p0 k∞ ≥ sup sup 2m/2 |βmk (p0 )|.
(63)
m≥l k
Suppose |βmk (p0 )| = 0 for all m ≥ l and all k. Then p0 ∈ Vl−1 is a piecewise constant function, which is impossible since p0 is a uniformly continuous density. We conclude that W (l, p0 ) > 0 for all l ∈ N. For the upper bound, note that Ekpn (l) − p0 k∞ ≤ Ekpn (l) − Epn (l)k∞ + kEpn (l) − p0 k∞ ≤ 2 max(E(l), B(l, p0 )), by the inequality below (62). For the lower bound, we have pn (l) − Epn (l) = πl (pn (l) − p0 ) so that (cf. Remark 3) E(l) = Ekpn (l) − Epn (l)k∞ ≤ Ekpn (l) − p0 k∞ . Second, we have by Jensen’s inequality Ekpn (l) − p0 k∞ = Ekpn (l) − Epn (l) + Epn (l) − p0 k∞ ≥ kEpn (l) − p0 k∞ , so that Ekpn (l) − p0 k∞ ≥ W (l, p0 )B(l, p0 ) by (62) and the definition of B(l, p0 ). Combining bounds, this completes the proof. Lemma 7 We have, for all n large enough that #
2j ≃
n log n
1/(2t+1)
so that in particular −t/(2t+1))
c(n/ log n)
≤
r
2j # j # ≤ C(n/ log n)−t/(2t+1) n
for some constants 0 < c < C < ∞. 32
Proof. In view of Lemma 8, (44) and definition of J , we have, for appropriate constants, r r 2j j 2j j d ≤ E(j) ≤ D n n as soon as jmin ≥ ¯l(p0 ), and the result follows from the definition of j # and of B(j, p0 ). Lemma 8 There exists ¯l(p0 ) finite such that for every l > ¯l(p0 ) and every n we have r 2l l c′′ 2l log n Ekpn (l) − Epn (l)k∞ ≥ S(p0 ) − n n where S(p0 ) =
s
kp0 k∞ 4π log 2
and where 0 < c′′ < ∞ does not depend on n or l. Proof. Define ¯l(p0 ) as the smallest integer l for which the following four conditions hold: a) If U (p0 ) is the largest interval in {x ∈ R : p0 (x) ≥ kp0 k∞ /2} and if |U (p0 )| denotes its length, then let l be such that |U (p0 )| > 2/2l . Note that such an l always exists by uniform continuity of p0 . b) kp0 k∞ ≤ (1 − τ )2l where τ = 51/50. c) l(log 2 − 0.68) ≥ − log(|U (p0 )|/2). s s √ 0.68l 0.51l − 2≥ . d) 2 log 2 2 log 2 We now prove the lemma. We start with a reduction to Gaussian processes. By Theorem 3 in K´omlos, Major and Tusn´ ady (1975) and by integrating tail probabilities, there exists a sequence of Brownian bridges Bn such that EkFn − F − n−1/2 Bn ◦ F k∞ = O(log n/n), where we have used that F is a continuous distribution function. Note next that for each y ∈ R, there exists k := k(y) such that n
pn (l, y) − Epn (l, y) = 2l
1X (1 l l (Xi ) − E1[k/2l ,(k+1)/2l ) (X)) n i=1 [k/2 ,(k+1)/2 )
= 2l [(Fn − F )((k + 1)/2l −) − (Fn − F )(k/2l −)]. Consequently, for Gn (k) := Gn,l (k) = Bn ◦ F ((k + 1)/2l ) − Bn ◦ F (k/2l ),
and Gn (y) = Gn (k(y)), we have
l
2l 2 log n l+1 −1/2
E pn (l) − Epn (l) − √ Gn ≤ 2 EkFn − F − n . Bn ◦ F k∞ = O n n ∞ 33
Therefore
2l c′′ 2l log n Ekpn (l) − Epn (l)k∞ ≥ √ E sup |Gn (k)| − n k∈Z n
for some c′′ finite independent of l and n. We now lower-bound the Gaussian expectation, and we write shorthand Pk = P (X ∈ (k/2l , (k + 1)/2l ]). We see that, for any k, k ′ ∈ Z, E(Gn (k) − Gn (k ′ ))2 = Pk + Pk′ − (Pk − Pk′ )2 ≥ τ (Pk + Pk′ ) 0 < τ < 1, which happens by b) in the definition of if Pk , Pk′ are both less than or equal to 1 − τ , √ ¯l(p0 ). Consider the Gaussian process G(k) ¯ = τ Pk · gk where the gk ’s are i.i.d. standard normal. ¯ ¯ ′ ))2 ≤ E(Gn (k) − Gn (k ′ ))2 , and by Gaussian Then, by the above inequality, E(G(k) − G(k comparison (Theorem 3.2.5 together with Example 3.2.7b in Fernique (1997)) we have q p ¯ ¯ ¯ 2 E sup |Gn (k)| ≥ E sup Gn (k) ≥ E sup G(k) ≥ E sup |G(k)| − 2/π sup E G(k) k∈A
k∈A
k∈A
k∈A
k∈A
for any A ⊆ Z. Let now U (p0 ) be the interval from a). By hypothesis, |U (p0 )| > 2/2l , and therefore card{k ∈ Z : k/2l ∈ U (p0 )} ≥ 2l−1 |U (p0 )| ≥ 1.
Then, taking A = 2l U (p0 ) ∩ Z in the Gaussian comparison inequality, we conclude r p 2τ kp0 k∞ E sup |Gn (k)| ≥ E max ( τ Pk |gk |) − . l π2l k∈2 U(p0 )∩Z k∈Z Furthermore, by Fernique (1997, p. 27, expression 1.7.1), p p E max ( τ Pk |gk |) ≥ min τ Pk E max |gk | l l k∈2 U(p0 ) k∈2 U(p0 ) k∈2l U(p0 ) s s τ kp0 k∞ log(2l−1 |U (p0 )|) τ kp0 k∞ 0.68l ≥ ≥ 2l+1 π log 2 2l+1 π log 2 again by condition c) in the definition of ¯l(p0 ). Hence, by the last condition on ¯l(p0 ), we have s r τ 0.51kp0 k∞ l E sup |Gn (k)| ≥ , 2π log 2 2l k∈Z and since τ = 0.5/0.51, Ekpn (l) − Epn (l)k∞ ≥
s
0.5kp0 k∞ 2π log 2
r
2l l c′′ 2l log n − , n n
which completes the proof. Acknowledgement We thank Patricia Reynaud-Bouret and Benedikt P¨ otscher for helpful comments. The idea of using Rademacher thresholds in Lepski’s method arose from a conversation with Patricia Reynaud-Bouret.
34
References [1] Barron, A., Birg´ e, L., Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301-413. [2] Bartlett, P., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Mach. Learn. 48 85-113. [3] Bousquet, O. (2003). Concentration inequalities for sub-additive functions using the entropy method. In: Stochastic inequalities and applications., Progr. Probab. 56, E. Gin´e, C. Houdr´e, D. Nualart, eds., Birkh¨ auser, Boston, 213-247. [4] Cavalier, L. and Tsybakov, A. B. (2001). Penalized blockwise Stein’s method, monotone oracles and sharp adaptive estimation. Math. Methods Statist. 10 247-282. [5] Daubechies, I. (1992). Ten lectures on wavelets. CBMS-NSF Reg. Conf. Ser. in Appl. Math. 61. Philadelphia, Society for Industrial and Applied Mathematics. [6] DeVore, R.A. and Lorentz, G.G. (1993). Constructive approximation. Springer, Berlin. [7] Donoho, D. L.; Johnstone, I. M.; Kerkyacharian, G. and Picard, D. (1996). Density estimation by wavelet thresholding. Ann. Statist. 24 508-539. [8] Dudley, R.M. (1999). Uniform central limit theorems. Cambridge University Press; Cambridge, England. [9] Einmahl, U. and Mason, D. M. (2000). An empirical process approach to the uniform consistency of kernel-type function estimators. J. Theoret. Probab. 13 1-37. [10] Fernique, X. (1997). Fonctions al´eatoires gaussiennes, vecteurs al´eatoires gaussiens. Universit´e de Montr´eal, Centre de Recherches Math´ematiques. [11] Fromont, M. (2007). Model selection by boostrap penalization for classification. Mach. Learn. 66 165-207. e, E. and Guillou, A. (2001). On consistency of kernel density estimators for randomly [12] Gin´ censored data: rates holding uniformly over adaptive intervals. Ann. Inst. H. Poincar´e Probab. Statist. 37 503-522. e, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate [13] Gin´ kernel density estimators. Ann. Inst. H. Poincar´e Probab. Statist. 38 907-921. e, E. and Koltchinskii, V. (2006). Concentration inequalities and asymptotic results [14] Gin´ for ratio type empirical processes. Ann. Probab. 34 1143-1216. [15] Gin´ e, E. and Nickl, R. (2007). Uniform limit theorems for wavelet density estimators. preprint. [16] Gin´ e, E. and Nickl, R. (2008). An exponential inequality for the distribution function of the kernel density estimator, with applications to adaptive estimation, Probab. Theory Related Fields, forthcoming. [17] Golubev, Y., Lepski, O. and Levit, B. (2001). On adaptive estimation for the sup-norm losses. Math. Methods Statist. 10 23–37.
35
¨rdle, W.; Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998). Wavelets, [18] Ha approximation, and statistical applications. Lecture Notes in Statistics 129. Springer, New York. [19] Huang, S.-Y. (1999). Density estimation by wavelet-based reproducing kernels. Statist. Sinica 9 137-151. [20] Huang, S.-Y. and Studden, W. J. (1993). Density estimation using spline projection kernels. Comm. Statist. Theory Methods 22 3263-3285. [21] Jaffard, S. (1997a). Multifractal formalism for functions. I. Results valid for all functions. SIAM J. Math. Anal. 28 944–970. [22] Jaffard, S. (1997b). Multifractal formalism for functions. II. Self-similar functions. SIAM J. Math. Anal. 28 971–998. [23] Kerkyacharian, G., Picard, D. (1992). Density estimation in Besov spaces. Statist. Probab. Lett. 13 15-24. [24] Klein, T. and Rio, E. (2005). Concentration around the mean for maxima of empirical processes. Ann. Probab. 33 1060-1077. ´ s, J.; Major, J. and Tusna ´dy, G. (1975). An approximation of partial sums [25] Komlo of independent rv’s, and the sample df. I. Z. Wahrscheinlichkeitstheorie verw. Gebiete 32 111-131. [26] Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902-1914. [27] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593-2656. [28] Korostelev, A. and Nussbaum, M. (1999). The asymptotic minimax constant for supnorm loss in nonparametric density estimation. Bernoulli 5 1099-1118. [29] Ledoux, M. (2001) The concentration of measure phenomenon. Mathematical Surveys and Monographs, 89. American Mathematical Society. [30] Lepski, O.V. (1991). Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. Theory Probab. Appl. 36 682-697. [31] Lepski, O.V. (1992). On problems of adaptive estimation in white Gaussian noise. In: Topics in nonparametric estimation (R.Z. Khasminksi, ed.) 87-106. Amer. Math. Soc., Providence. [32] Massart, P. (2000) About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28 863-884. [33] Meyer, Y. (1992). Wavelets and operators I. Cambridge University Press; Cambridge, England. [34] Nolan, D. and Pollard, D. (1987). U -processes: rates of convergence. Ann. Statist. 15 780-799. [35] Picard, D. and Tribouley, K. (2000). Adaptive confidence interval for pointwise curve estimation. Ann. Statist. 28 298–335. 36
[36] Rigollet, P. (2006) Adaptive density estimation using the blockwise Stein method. Bernoulli 12 351–370. [37] Shadrin, A. Y. (2001). The L∞ -norm of the L2 -spline projector is bounded independently of the knot sequence: a proof of de Boor’s conjecture. Acta Math. 187 59-137. [38] Schumaker, L.L. (1993). Spline functions: basic theory. Correlated reprint of the 1981 original. Krieger, Malabar. [39] Talagrand, M. Probab. 22 28-76.
(1994). Sharper bounds for Gaussian and empirical processes. Ann.
[40] Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505-563. [41] Tsybakov, A.B. (1998). Pointwise and sup-norm sharp adaptive estimation of the functions on the Sobolev classes. Ann. Statist. 26 2420-2469. Department of Mathematics University of Connecticut Storrs, CT 06269-3009, USA E-mail:
[email protected],
[email protected]
37