Stat Meth Appl (2008) 17:321–334 DOI 10.1007/s10260-007-0064-6 ORIGINAL ARTICLE
A note on Bayesian nonparametric regression function estimation Catia Scricciolo
Received: 10 July 2006 / Revised: 10 April 2007 / Accepted: 30 April 2007 / Published online: 6 June 2007 © Springer-Verlag 2007
Abstract In this note the problem of nonparametric regression function estimation in a random design regression model with Gaussian errors is considered from the Bayesian perspective. It is assumed that the regression function belongs to a class of functions with a known degree of smoothness. A prior distribution on the given class can be induced by a prior on the coefficients in a series expansion of the regression function through an orthonormal system. The rate of convergence of the resulting posterior distribution is employed to provide a measure of the accuracy of the Bayesian estimation procedure defined by the posterior expected regression function. We show that the Bayes’ estimator achieves the optimal minimax rate of convergence under mean integrated squared error over the involved class of regression functions, thus being comparable to other popular frequentist regression estimators. Keywords Nonparametric regression · Posterior distribution · Rate of convergence · Sieve prior 1 Introduction Regression methods provide statistical techniques intended to estimate, and to conduct inference about, models for conditional mean functions as well as conditional quantile functions that describe relationships of dependence among random variables. From classical least-squares methods for linear regression to modern nonparametric regression estimation, the literature on this topic is vast. For a concise review that clearly conveys the main ideas refer to Wasserman (2006, Chap. 5). The interest in regression
C. Scricciolo (B) Istituto di Metodi Quantitativi, Università Commerciale “L. Bocconi”, Viale Isonzo 25, 20135 Milano, Italy e-mail:
[email protected]
123
322
C. Scricciolo
methods for predictive purposes is apparent in various applications ranging from Medicine to Economics. In this article we deal with estimation of conditional mean functions. The general formulation of the problem is as follows. Let (X, Y ) be a random vector defined on a probability space (Ω, F , P) and taking values in (X × R, B(X ) ⊗ B(R), η ⊗ λ), where X is a subset of some Euclidean space Rd , d ≥ 1, endowed with the Borel σ -field B(X ), and ν := η ⊗ λ is the Lebesgue measure on the product σ -field B(X ) ⊗ B(R). Assume that E[Y 2 ] < ∞. It is known that the conditional expectation of Y given X = x, denoted by E[Y |X = x], is a Borel-measurable function defined for each x ∈ X such that E[Y |X = x] PX (dx) = Y (ω) P(dω), ∀ B ∈ B(X ), B
X −1 (B)
where PX denotes the marginal probability law of X . Clearly, E[Y |X = ·] is determined uniquely up to sets of PX -measure 0. The function r0 (·) := E[Y |X = ·] is called the regression function of Y on X . Let E[Y |X ] denote the conditional expectation of Y given X , i.e., the conditional expected value of Y given the σ -field generated by X , E[Y |X ] := E[Y |σ (X )]. Then, E[Y |X ] can be regarded as the composition of r0 and X , (1) r0 ◦ X = E[Y |X ] a.s. [P], namely, r0 ◦ X is a version of E[Y |X ], see, e.g., Laha and Rohatgi (1979, Chap. 6). We now proceed to write the model. To this aim, let ξ := Y − E[Y |X ] denote the error made in replacing Y with its expected value given X . Clearly, E[ξ |X ] = 0, E[ξ 2 |X ] < ∞ a.s. [P], consequently, E[ξ ] = 0, V[ξ ] = E[ξ 2 ] < ∞. Note also that ξ and X are uncorrelated, C(ξ, X ) = 0. With the previous position, Y = E[Y |X ] + ξ, which, taking into account (1), can be rewritten as Y = (r0 ◦ X ) + ξ a.s. [P],
(2)
wherein Y can be interpreted as the response corresponding to the d-dimensional covariate X = (X (1), . . . , X (d)), via the unknown link function r0 , up to a perturbation ξ . In this article we consider the case when X is a one-dimensional covariate
123
A note on Bayesian nonparametric regression function estimation
323
taking values in some compact set X ⊂ R. Suppose we are given a random sample of n i.i.d. observations Z 1 = (X 1 , Y1 ), . . . , Z n = (X n , Yn ) from the joint distribution of X and Y . According to the model in (2), the Yi ’s are noisy measurements of the (r0 ◦ X i )’s, the errors ξi ’s being unobservable centered random variables with finite variance. The problem is to estimate the unknown regression function r0 from Z 1 , . . . , Z n . A model like the present one, wherein, for each 1 ≤ i ≤ n, the realization xi of the covariate X i is measured along with the corresponding value yi of the response Yi is called a random design regression model. If, instead, the values x1 , . . . , xn are fixed a priori and only the corresponding Y -values y1 , . . . , yn are observed, then the model is called a fixed design regression model. We now make some assumptions about the covariate and the error term. We assume that the distribution of X is known. As spelled out in the Appendix A, we can focus on the case that PX is uniform over [0, 1]. We assume that ξ and X are stochastically independent, (A1)
ξ⊥ ⊥ X,
this meaning that for an observer who knows the realization x of X , the conditional distribution of ξ , given X = x, coincides with the unconditional one anyway. Thus, from the point of view of the observer, the partial information possessed does not help to learn more about the underlying random mechanism. Furthermore, we assume that the error ξ has normal distribution (A2)
ξ ∼ N (0, σ02 ),
with known variance 0 < σ02 < ∞. If, as assumed in the preceding, the error is a Gaussian random variable, then we speak of a random design Gaussian regression model. Due to (A1) and (A2), conditionally on X = x, the response Y is also a Gaussian random variable with mean r0 (x) and variance σ02 that does not depend on x, Y |(X = x) ∼ N (r0 (x), σ02 ), so the data are homoschedastic. Let P0 denote the probability law of Z = (X, Y ) under the true regression function r0 and let f 0 stand for the corresponding density with respect to ν on [0, 1] × R, √ f 0 (z) = f 0 (x, y) = f 0 (y|x) f X (x) = (σ0 2π )−1 exp{−[y − r0 (x)]2 /2σ02 }. As previously mentioned, we are interested in estimating the regression function r0 on the basis of a random sample Z 1 , . . . , Z n from P0 . We do not assume a parametric form for the regression function, rather we make some smoothness assumptions. It has been pointed out in the literature that, unlike cumulative distribution functions, regression functions as well as probability density functions cannot be consistently estimated without making smoothness assumptions. Throughout this article we assume that r0 belongs to a certain class of p-times differentiable functions of L 2 [0, 1]. In order to define the parameter space we need to introduce some more notation. Let (φ j ) j≥1 be the trigonometric orthonormal system of L 2 [0, 1] defined as φ1 (x) ≡ 1
123
324
C. Scricciolo
and, for each k ≥ 1, φ2k (x) = Note that
√ √ 2 cos(2π kx), φ2k+1 (x) = 2 sin(2π kx), x ∈ [0, 1]. sup φ j ∞ =
√ 2,
j≥1
where φ j ∞ := supx∈[0, 1] |φ j (x)| is the supremum norm of φ j . For any r ∈ L 2 [0, 1], let (r j ) j≥1 be the sequence of its Fourier coefficients with respect to (φ j ) j≥1 , 1 rj =
r (x)φ j (x) dx,
j ≥ 1.
0
For some fixed integer p ≥ 1 and real Q > 0, let ⎧ ⎫ ∞ ∞ ⎨ ⎬ R p (Q) := r ∈ L 2 [0, 1] : r (·) = r j φ j (·), j 2 p r 2j < Q . ⎩ ⎭ j=1
j=1
The class R p (Q) comprises square-integrable functions on [0, 1] with the first p − 1 derivatives satisfying boundary conditions, r (l) (0) = r (l) (1) for l = 1, . . . , p − 1, and the generalized derivative r ( p) , in the sense that r ( p−1) is absolutely continuous, bounded by a multiple of Q in squared L 2 -norm, r ( p) 22 < π 2 p Q. Indeed, the correspondence is exact if j 2 p is replaced by ( j − 1)2 p for odd values of j, see, e.g., Lemma A.3 in Tsybakov (2004, pp. 162–165). It is relevant for what follows to point out that the functions in R p (Q) are uniformly bounded, that is, there exists a constant L > 0 such that sup
r ∈R p (Q)
r ∞ < L .
For some known degree of smoothness p and bound Q, we take R p (Q) as the parameter space. Various methods, either frequentist or Bayesian, may be employed to estimate a regression function in R p (Q). There is a large literature on nonparametric regression estimation from the frequentist point of view. Next to the popular kernel method, which the well-known Nadaraya–Watson estimator is based on (cf. Nadaraya 1964; Watson 1964), other methods rely on series expansions, spline functions and wavelets approximations. A good deal is known about rates of convergence, exact constants in minimax bounds, limit distributions of estimators based on these methods. Much less is, instead, known about the asymptotic properties of Bayesian nonparametric regression estimators. Since a Bayesian approach accounts for putting a prior on a given class of regression functions, construction of prior measures has in primis attracted most of the attention and efforts. Next to the Gaussian and Dirichlet processes, for the latter see, e.g., Cifarelli et al. (1981), other processes based on freeknot splines and orthogonal bases expansions have been proposed as priors. For an account on Bayesian methods for function estimation, see Choudhuri et al. (2005) and
123
A note on Bayesian nonparametric regression function estimation
325
the references therein. Only in recent years large-sample properties like consistency and rates of convergence of posterior distributions for Bayesian nonparametric regression problems have been studied, see Shen and Wasserman (2001), Amewou-Atisso et al. (2003), Ghosal and Roy (2006), Kleijn and van der Vaart (2006), Choudhuri et al. (2007). In all these articles as well as in the present note, the frequentist or “what if” approach for studying Bayesian asymptotics is adopted. It consists in checking frequentist properties of Bayesian procedures. Such investigations give indications on the accuracy of posterior distributions in representing the uncertainty on the unknown parameter of the sampling distribution and help avoiding those priors leading to posteriors that might have an unexpected behaviour in nonparametric models, see, e.g., Diaconis and Freedman (1986a,b). In this note we consider the same Bayesian framework for the above regression problem as in Shen and Wasserman (2001, pp. 695–699), from now on SW for short. They showed that the posterior distribution of a “sieve prior” restricted to R p (Q) concentrates almost all its mass on L 2 -neighbourhoods of r0 with radius a multiple of n − p/(2 p+1) as the sample size tends to infinity. Starting from this result, in Sect. 2, we show that the posterior expected regression function, which we refer to as the Bayes’ estimator, r˜n (·) := E[r (·)|Z 1 , . . . , Z n ], achieves the optimal minimax rate of convergence for mean integrated squared error over Sobolev classes. Technical details are provided in the Appendix B. In the last section we discuss some limitations of this work and identify open issues for future work.
2 Optimal convergence rate of the Bayes’ estimator In this section, after describing the set-up considered by SW, we present a refinement of their result that better suits the purpose of proving that the Bayes’ estimator achieves the optimal convergence rate over Sobolev classes. We do not study the exact asymptotic behaviour of the risk for the Bayes’ estimator, we provide the rate of convergence but not the exact constant. We proceed by first proving that the mean integrated squared error for r˜n converges to zero with optimal rate at any “point” r0 ∈ R p (Q) and then showing that r˜n attains the optimal rate simultaneously for all functions r0 ∈ R p (Q ) with Q < Q. A Bayesian approach to the problem starts from the specification of a prior distribution on the given class R p (Q), which can be induced by a prior for the Fourier coefficients on the corresponding ellipsoid of 2 ,
E p (Q) :=
⎧ ⎨ ⎩
(r j ) j≥1 ∈ 2 :
∞ j=1
j 2 p r 2j < Q
⎫ ⎬ ⎭
.
123
326
C. Scricciolo
Following SW, we adopt the prior π=
µ 1E p (Q) µ(E p (Q))
,
(3)
which is the restriction to E p (Q) of the mixture prior µ=
∞
λk µk ,
k=1
where the mixing weights λk ’s satisfy the conditions λk ≥ 0 and ∞ k=1 λk = 1, and for each k ≥ 1, µk makes the coefficients r j ’s mutually independent random variables such that r j ∼ N (0, j −(2 p+1) ),
j = 1, . . . , k,
a.s.
r j = 0,
j > k.
For each k ≥ 1, µk is formally a prior on the infinite-dimensional space R∞ , but the sequence (r j ) j≥1 in fact “lives” on Rk and rk := (r1 , . . . , rk ) has normal distribution Nk (0, Σ), with diagonal covariance matrix Σ whose jth entry is j −(2 p+1) . The overall prior µ is thus constructed from a sequence of priors supported on finite-dimensional spaces. Such a prior has therefore been called “sieve prior” by SW. It was introduced by Zhao (2000), who showed that in the infinitely many normal means problem the corresponding Bayes’ estimator, unlike the one arising from any direct or un-sieved Gaussian prior, can achieve the optimal minimax rate of convergence under squared L 2 -risk over any Sobolev ellipsoid. SW employed prior (3) in the regression model herein specified and investigated the pointwise asymptotic behaviour of the corresponding posterior distribution. To review their result we need to introduce some more notation and further definitions. For any r ∈ R p (Q), let Pr denote the probability law of Z = (X, Y ) when the parameter is r and let fr be the corresponding density (with respect to ν), √ fr (z) = fr (x, y) = fr (y|x) f X (x) = (σ0 2π )−1 exp{−[y − r (x)]2 /2σ02 }. Let the class P := {Pr : r ∈ R p (Q)} be equipped with the Hellinger distance ⎡
dH (Pr , Ps ) := ⎣ =
d Pr − dν
fr −
d Ps dν
⎤1/2
2
dν ⎦
1/2 2 f s dν
⎡ ⎤1/2 1 2 √ [r (x) − s(x)] = 2 ⎣1 − exp − dx ⎦ , 8σ02 0
123
Pr , Ps ∈ P.
A note on Bayesian nonparametric regression function estimation
327
Let Π denote the prior induced on P by π via the map r → Pr , π being here used to indicate also the prior induced by (3) on R p (Q). The posterior distribution Π (·|Z 1 , . . . , Z n ) is said to converge (at least) at rate n relative to dH , where n ↓ 0, if for any sequence Mn → ∞ such that Mn n → 0, Π ({Pr ∈ P : dH (Pr , P0 ) ≥ Mn n }|Z 1 , . . . , Z n ) → 0
(4)
in probability or almost surely when sampling from P0 . Validity of (4) means that Hellinger neighbourhoods of P0 contract as the radius Mn n tends to zero, meanwhile still capturing almost all the posterior probability mass. Clearly, we search the fastest sequence n for which (4) holds true. Note that in order for convergence in (4) to take place it suffices that for a large enough constant M > 0, Π ({Pr ∈ P : dH (Pr , P0 ) > M n }|Z 1 , . . . , Z n ) → 0 in probability or almost surely. Ghosal et al. (2000) and SW stated theorems providing sufficient conditions for posterior distributions to asymptotically concentrate their mass on Hellinger neighbourhoods of P0 at a rate determined by the size of the model, as measured by metric entropy (or existence of certain tests), and the prior concentration rate around P0 . Requirements for assessing rates of convergence are a support condition and a smoothness condition. The support condition amounts for the prior probability of certain neighbourhoods of P0 to be not exponentially small. The smoothness condition splits up into an entropy condition and a tail condition, both involving a sieve, that is, a sequence (Pn )n≥1 of approximating subsets of P, whose “dimension” grows with sample size. The entropy condition imposes a restriction on the growth rate of the size of Pn as measured by some entropy number. The tail condition requires Pnc to have exponentially small prior probability so that also its posterior probability turns out to be negligible. Regarding the problem at study, SW showed that if r0 ∈ R p (Q), then for a large enough constant K > 0, possibly depending on r0 , π({r ∈ R p (Q) : r − r0 2 > K n }|Z 1 , . . . , Z n ) → 0
(5)
in P0n -probability, as n → ∞, with
n = n − p/(2 p+1) , namely, the posterior distribution on the space of regression functions asymptotically concentrates its mass on L 2 -neighbourhoods of r0 with radius a multiple of n . However, the authors did not directly study the large-sample behaviour for the posterior probability in (5), rather, they related it with the posterior probability on P of the complement of a Hellinger ball around P0 , by noting that Hellinger neighbourhoods of P0 “translate” into L 2 -neighbourhoods of r0 because the regression functions herein considered are uniformly bounded. This enabled them to appeal to theorems for deriving posterior rates of convergence. This key step deserves a careful examination. Using the chain of inequalities 1 − e−x ≤ x ≤ e x − 1, x ≥ 0,
123
328
C. Scricciolo
the Hellinger metric is seen to be related to the L 2 -metric as follows: e−L
2 /4σ 2 0
r − s 2 /2σ0 ≤ dH (Pr , Ps ) ≤ r − s 2 /2σ0 .
(6)
Setting for a constant M > 0, K := e L
2 /4σ 2 0
2σ0 M,
(7)
from the left-hand side of (6) we have that Π ({Pr ∈ P : dH (Pr , P0 ) > M n }|Z 1 , . . . , Z n ) ≥ π({r ∈ R p (Q) : r − r0 2 > K n }|Z 1 , . . . , Z n ).
(8)
Thus, if the posterior on P converges (in probability or almost surely) at least at rate
n , then also the posterior on R p (Q) converges at least at the same rate. Hence, the investigation of the asymptotic behaviour of the former yields a result also for the latter. The strategy of the proof consists in bounding the numerator and the denominator in the expression of the posterior probability on the left-hand side of (8), which, setting Hnc := {Pr ∈ P : dH (Pr , P0 ) > M n }, can be written as Hnc
Rn ( fr ) π(dr )
P
Rn ( f s ) π(ds)
Π ({Pr ∈ P : dH (Pr , P0 ) > M n }|Z 1 , . . . , Z n ) =
,
where for φσ0 (x) := σ0−1 φ(x/σ0 ), with φ(·) the standard normal density, Rn ( fr ) is the likelihood ratio Rn ( fr ) :=
n n fr (Z i ) φσ0 (Yi − r (X i )) = . f 0 (Z i ) φσ0 (Yi − r0 (X i )) i=1
i=1
The in-probability statement (5) can be enhanced to an almost sure assertion by simply employing a stronger prior mass condition than that used by SW to lower bound the denominator P Rn ( f s )π(ds). Instead of Lemma 1 of SW, p. 690, we can use Lemma 2, p. 691, which requires to check that for constants c, c1 , C > 0, Π ({Pr ∈ P : ρα (P0 Pr ) ≤ c n2 }) ≥ c1 e−Cn n , 2
where for any α ∈ (0, 1], the ρα -divergence is defined as 1 ρα (P0 Pr ) := α
123
f 01+α dν − 1 . frα
(9)
A note on Bayesian nonparametric regression function estimation
329
Using the identity 1 √ τ 2π
+∞ 1 exp − 2 A(y − a)2 + B(y − b)2 dy 2τ
−∞
=√
AB(a − b)2 , ∀ τ > 0, A + B = 0, exp − 2 2τ (A + B) A+B 1
it can be seen that in the present model, for any α ∈ (0, 1], the ρα -divergence is bounded above by the squared L 2 -metric, ρα (P0 Pr ) ≤ e4L
2 /σ 2 0
r − r0 22 /σ02 ,
so that if π({r ∈ R p (Q) : r − r0 22 ≤ cσ02 e−4L
2 /σ 2 0
n2 }) ≥ c1 e−Cn n , 2
(10)
then condition (9) on the prior concentration rate is satisfied. Except possibly for constants, (10) is exactly the same condition eventually found by SW, pp. 697–699, starting from a different prior mass requirement involving Kullback–Leibler type neighbourhoods of P0 . SW established validity of (10) bounding below the prior probability on the left-hand side by the prior probability of a k-dimensional Euclidean ball B(r0,k ; c0 n2 ) centered at the projection r0,k of (r0, j ) j≥1 on Rk , π({r ∈ R p (Q) : r − r0 22 ≤ cσ02 e−4L ≥ λk Nk (0, ) B(r0,k ; c0 n2 ) ,
2 /σ 2 0
n2 }) (11)
where c0 is a constant depending on r0 and k is suitably chosen such that k ≡ kn = O(n 1/(2 p+1) ). SW used specific exponentially decaying mixing weights λk ’s. Precisely, for some γ > 0, λk = eγ (1 − e−γ )e−γ k , k ≥ 1. Since they proved that for constants b, B > 0, 2 Nk (0, Σ) B(r0,k ; c0 n2 ) ≥ be−Bn n , an examination of (11) reveals that it is indeed enough to take mixing weights bounded below by an exponentially decreasing sequence, that is, for constants A, γ > 0, (A3)
λk ≥ Ae−γ k , k ≥ 1,
this meaning that also sieve priors with “heavy tailed” weights work. We now formalize the result arising from previous arguments in the following proposition, wherein P0∞ denotes the infinite product measure of P0 .
123
330
C. Scricciolo
Proposition 1 Let (λk )k≥1 satisfy the condition (A3). If r0 ∈ R p (Q), then for a sufficiently large constant K > 0, the posterior probability π({r ∈ R p (Q) : r − r0 2 > K n }|Z 1 , . . . , Z n ) → 0 (n → ∞) P0∞ -almost surely. Remark 1 For later use, we insist that Proposition 1 follows from a parallel result for the posterior Π (·|Z 1 , . . . , Z n ). The result says more than stated: not only does the posterior converge at least at rate n , but the convergence is exponentially fast, i.e., on a set of P0∞ -probability one, for constants d, D > 0, π({r ∈ R p (Q) : r − r0 2 > K n }|Z 1 , . . . , Z n ) ≤ Π ({Pr ∈ P : dH (Pr , P0 ) > M n }|Z 1 , . . . , Z n ) ≤ de−Dn n
2
(12)
for all but finitely many n, where d, D, K and M may depend on r0 . Remark 2 Proposition 1 is still valid if we use a sample-size-dependent prior πn =
µn 1E p (Q) µn (E p (Q))
,
(13)
kn (n) where, for each n ≥ 1, µn = k=1 λk µk is a sieve prior truncated at kn = n (n) (n) 1/(2 p+1) −γ k ), with λk ≥ A1 e , k = 1, . . . , kn , and kk=1 λk = 1. O(n As previously mentioned, constants appearing in (12) may crucially depend on the specific regression function r0 , thus, additional work is needed to prove that convergence holds uniformly over Sobolev classes and r˜n achieves the optimal minimax rate of convergence n , that is, for some constant C > 0, lim sup n−2 n→∞
sup
r0 ∈R p (Q)
E0 [ ˜rn − r0 22 ] ≤ C,
where E0 denotes expectation with respect to P0∞ . We recall that n is the optimal rate of convergence in the sense that there exist constants 0 < c ≤ C < ∞ such that c ≤ lim inf n−2 Rn∗ ≤ lim sup n−2 Rn∗ ≤ C, n→∞
n→∞
where Rn∗ is the minimax quadratic risk constructed on a sample of size n, Rn∗ := inf
sup
rˆn r0 ∈R p (Q)
E0 [ ˆrn − r0 22 ],
the infimum being taken over all possible estimators rˆn of functions r0 ∈ R p (Q), see, e.g., (Yang and Barron, 1999, p. 1591). In the statement of the following proposition, whose proof is postponed to the Appendix B, we write an bn to mean that both
123
A note on Bayesian nonparametric regression function estimation
331
an bn and bn an hold true, where an bn if an = O(bn ), namely, there exists a constant c > 0 such that an ≤ cbn for all large n. Proposition 2 Let r˜n be the Bayes’ estimator arising from prior (3) with mixing weights λk ’s satisfying the condition (A3). Then, for any 0 < Q < Q, sup
r0 ∈R p (Q )
E0 [ ˜rn − r0 22 ] n2 .
The result, which is also valid for the Bayes’ estimator arising from prior (13), provides the optimal rate of convergence for r˜n over any class R p (Q ). The restriction Q < Q might be just an artifact of the technique employed in the proof. 3 Final remarks We have considered the problem of regression function estimation in a random design Gaussian regression model with known error variance, when the response function is assumed to be p-smooth, from a Bayesian perspective. We have shown that, when the design points are sampled from a known distribution, the Bayes’ estimator arising from a sieve prior consistently estimates the regression function with optimal accuracy as measured by the rate of convergence. The result provides the optimal convergence rate, but not the optimal constant. Thus, comparison with other estimators is not allowed on the level of constants. It would be interesting to investigate how the results would change if a multi-dimensional random covariate with unknown law is considered. A sieve prior restricted to a Sobolev class leads to the optimal convergence rate, but knowledge of both Q and p is required for its definition. Moreover, the evaluation of the probability µ(E p (Q)) represents a potential limitation to its use. When p is known, to circumvent this problem, we may think of proving the result of Proposition 2 using an unrestricted sieve prior, but this presumably accounts for a more involved technique because the key technical assumption of uniform boundedness of regression functions is not preserved. When, furthermore, the regression function is assumed to be smooth, but no prior information on the smoothness degree is available, we may also put a prior on the degree of smoothness. If an overall prior on the union of models for different degrees of smoothness leads to the same optimal convergence rate for the posterior as when only a single model is specified, then we have Bayesian adaptive estimation. Some lines for future research that are worth being pursued. Acknowledgements The author would like to thank an anonymous Referee, an Associate Editor and the Editor for constructive and detailed comments that have led to an improved presentation.
Appendix A This appendix is mainly devoted to explain why, when the law of the covariate is known, we can take it to be uniform over [0, 1]. First, we make a remark about the posterior distribution. Suppose that X has known distribution PX on some compact set X ⊂ R. Let P1 and P0 be probability measures that are absolutely continuous
123
332
C. Scricciolo
with respect to the Lebesgue measure ν on X × R. Let f h (y|x) denote the conditional density of Y , given X = x, with respect to the Lebesgue measure λ on R when the probability measure is Ph , h = 0, 1. Note that f h (y|x) can be seen as the joint density of (X, Y ) with respect to the measure γ := PX ⊗ λ. Since the Hellinger distance dH (P1 , P0 ) is unchanged if one chooses a different dominating measure instead of ν, we have that dH (P1 , P0 ) =
f 1 (y|x) −
f 0 (y|x)
2
1/2 γ (d(x, y))
.
Making use of the fact that PX can be absorbed into the dominating measure γ for the model, Hellinger neighborhoods of P0 can be defined in terms of the conditional densities. Consequently, PX cancels out of the expression for the posterior of the regression function. The explanation of why PX can be taken to be uniform over [0, 1] closely follows that in Barron et al. (1999, p. 548). Let P∗ be a probability measure on X with cumulative distribution function (cdf) F∗ . The covariate X can be mapped on [0, 1] by the probability integral transform. Let U := F∗ ◦ X , then U is uniformly distributed over [0, 1]. We can use a prior induced by a prior π as in (3) for the unknown distribution of (U, Y ) and then map it back to a prior on the collection of distributions on X × R, say P∗ . Let h ∗ : P → P∗ be the mapping associating to each Pr on [0, 1] × R the probability measure h ∗ (Pr ) with cdf Fr (F∗ (X (·)), ·). Since dH (Pr , Ps ) = dH (h ∗ (Pr ), h ∗ (Ps )), ∀ Pr , Ps ∈ P, the mapping h ∗ is continuous in the Hellinger topology (hence measurable), thus, induces a prior on P∗ , say ∗ , having the property that the expected prior marginal distribution of X is P∗ ,
F∗ (x) =
⎧ +∞⎪ ⎨ −∞
FP (x, y) ∗ (d P)
⎪ ⎩
P∗
⎫ ⎪ ⎬ ⎪ ⎭
dy, x ∈ R.
Appendix B Proof of Proposition 2 Let 0 < Q < Q be fixed. As mentioned in Sect. 2, it is known that inf
sup
rˆn r0 ∈R p (Q )
E0 [ ˆrn − r0 22 ] n2 .
In particular, for the Bayes’ estimator r˜n ,
n2
123
sup
r0 ∈R p
(Q )
E0 [ ˜rn − r0 22 ].
(14)
A note on Bayesian nonparametric regression function estimation
333
The assertion is proved once the reversed inequality in (14) is established. We begin to note that for any fixed, non-null function r ∗ ∈ R p (Q), there is a class around it where the convergence of the posterior Π (·|Z 1 , . . . , Z n ) is uniform. Precisely, there exists ρ ∗ > 0 such that for constants d, D, M > 0 depending on r ∗ , sup
r0 ∈R p (r ∗ ; ρ ∗ )
E0 [Π ({Pr ∈ P : dH (Pr , P0 ) > M n }|Z 1 , . . . , Z n )]
≤ de−Dn n , 2
(15)
where ⎧ ⎫ ∞ ∞ ⎨ ⎬ R p (r ∗ ; ρ ∗ ) := r ∈ L 2 [0, 1] : r (·) = r j φ j (·), j 2 p (r j − r ∗j )2 ≤ ρ ∗ . ⎩ ⎭ j=1
j=1
2p ∗ 2 ∗ Let 0 < δ ∗ < Q be such that ∞ j=1 j (r j ) = Q − δ . By standard arguments, it √ √ can be seen that if 0 < ρ ∗ < ( Q − Q − δ ∗ )2 , then R p (r ∗ ; ρ ∗ ) ⊂ R p (Q). Note that for each r0 ∈ R p (r ∗ ; ρ ∗ ), condition (10) [hence condition (9)] involves constants possibly depending on the specific “point” r0 . Therefore, to prove (15), it suffices to check that for any r0 ∈ R p (r ∗ ; ρ ∗ ), condition (10) is satisfied with all constants depending on r ∗ . One easily shows that a constant a3∗ > 0, depending on r ∗ , can be found that plays the same role as a3 in SW, pp. 697–699. Thus, (15) follows. The next step consists in choosing both r ∗ and ρ ∗ so that R p (Q ) ⊂ R p (r ∗ ; ρ ∗ ) ⊂ R p (Q).
(16)
√ ∗ ∗ We √ know∗ 2from the above arguments that for any fixed r , if 0 < ρ < ( Q − Q − δ ) , then the inclusion on the right-hand true. For 0 < β < 1 √side of√(16) holds ∗ = β( Q − ∗ )2 . It can be seen that Q − δ to be suitably chosen, we can write ρ √ √ if ( Q + Q − δ ∗ )2 ≤ ρ ∗ , then also the inclusion on the left-hand side holds true. This implies that r ∗ can be chosen so that
Q − δ ∗ ≤ ( β Q − Q )/(1 + β)
provided that Q /Q < β < 1. The radius ρ ∗ then remains determined by the choice of β and δ ∗ . Combining the above facts, we obtain that for all large n,
123
334
C. Scricciolo
sup
r0 ∈R p (Q )
≤ ≤
E0 [ ˜rn − r0 22 ]
sup
E0 [ ˜rn − r0 22 ]
sup
E0
r0 ∈R p (r ∗ ; ρ ∗ )
r0 ∈R p (r ∗ ; ρ ∗ )
< K 2 n2 + 4L 2 ≤ K 2 n2 + 4L 2
r − r0 22 π(dr |Z 1 , . . . , Z n )
sup
E0 [π({r ∈ R p (Q) : r − r0 2 > K n }|Z 1 , . . . , Z n )]
sup
E0 [Π ({Pr ∈ P : dH (Pr , P0 ) > M n }|Z 1 , . . . , Z n )]
r0 ∈R p (r ∗ ; ρ ∗ ) r0 ∈R p (r ∗ ; ρ ∗ )
n2 , where the third line follows from Jensen’s inequality and the last one from (15). Note that in virtue of (7), K depends on r ∗ through M. The proof is completed by combining sup
r0 ∈R p (Q )
with (14).
E0 [ ˜rn − r0 22 ] n2
References Amewou-Atisso M, Ghosal S, Ghosh JK, Ramamoorthi RV (2003) Posterior consistency for semiparametric regression problems. Bernoulli 9:291–312 Barron A, Schervish MJ, Wasserman L (1999) The consistency of posterior distributions in nonparametric problems. Ann Stat 27:536–561 Choudhuri N, Ghosal S, Roy A (2005) Bayesian methods for function estimation. In: Dey D, Rao CR (eds) Handbook of statistics. Bayesian thinking, modeling and computation, vol 25 Choudhuri N, Ghosal S, Roy A (2007) Nonparametric binary regression using a Gaussian process prior. Stat Methodol 4:227–243 Cifarelli DM, Muliere P, Scarsini M (1981) Il modello lineare nell’approccio bayesiano non parametrico. Istituto Matematico “G. Castelnuovo”, Università degli Studi di Roma “La Sapienza”, vol 15 Diaconis P, Freedman D (1986a) On the consistency of Bayes estimates. With a discussion and a rejoinder by the authors. Ann Stat 14:1–67 Diaconis P, Freedman D (1986b) On inconsistent Bayes estimates of location. Ann Stat 14:68–87 Ghosal S, Roy A (2006) Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann Stat 34:2413–2429 Ghosal S, Ghosh JK, van der Vaart AW (2000) Convergence rates of posterior distributions. Ann Stat 28:500–531 Kleijn BJK, van der Vaart AW (2006) Misspecification in infinite-dimensional Bayesian statistics. Ann Stat 34:837–877 Laha RG, Rohatgi VK (1979) Probability theory. Wiley, New York-Chichester-Brisbane Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141–142 Shen X, Wasserman L (2001) Rates of convergence of posterior distributions. Ann Stat 29:687–714 Tsybakov AB (2004) Introduction à l’estimation non-paramétrique. Springer, Berlin Wasserman L (2006) All of nonparametric statistics. Springer, New York Watson GS (1964) Smooth regression analysis. Sankhy¯a A 26:359–372 Yang Y, Barron A (1999) Information-theoretic determination of minimax rates of convergence. Ann Stat 27:1564–1599 Zhao LH (2000) Bayesian aspects of some nonparametric problems. Ann Stat 28:532–552
123