Equivalence theory for density estimation, Poisson processes ... - arXiv

12 downloads 0 Views 311KB Size Report
By Lawrence D. Brown1, Andrew V. Carter, Mark G. Low2 and Cun-Hui ... L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG ...... Ann. Math. Statist.
arXiv:math/0503674v1 [math.ST] 29 Mar 2005

The Annals of Statistics 2004, Vol. 32, No. 5, 2074–2097 DOI: 10.1214/009053604000000012 c Institute of Mathematical Statistics, 2004

EQUIVALENCE THEORY FOR DENSITY ESTIMATION, POISSON PROCESSES AND GAUSSIAN WHITE NOISE WITH DRIFT By Lawrence D. Brown1 , Andrew V. Carter, Mark G. Low2 and Cun-Hui Zhang3 University of Pennsylvania, University of California, Santa Barbara, University of Pennsylvania and Rutgers University This paper establishes the global asymptotic equivalence between a Poisson process with variable intensity and white noise with drift under sharp smoothness conditions on the unknown function. This equivalence is also extended to density estimation models by Poissonization. The asymptotic equivalences are established by constructing explicit equivalence mappings. The impact of such asymptotic equivalence results is that an investigation in one of these nonparametric models automatically yields asymptotically analogous results in the other models.

1. Introduction. The purpose of this paper is to give an explicit construction of global asymptotic equivalence in the sense of Le Cam (1964) between a Poisson process with variable intensity and white noise with drift. The construction is extended to density estimation models. It yields asymptotic solutions to both density estimation and Poisson process problems based on asymptotic solutions to white noise with drift problems and vice versa. Density estimation model. A random vector Vn⋆ of length n is observed such that Vn⋆ ≡ (V1⋆ , . . . , Vn⋆ ) is a sequence of i.i.d. variables with a common density f ∈ F . Received April 2002; revised February 2004. Supported by NSF Grant DMS-99-71751. 2 Supported by NSF Grant DMS-03-06576. 3 Supported by NSF Grants DMS-01-02529 and DMS-02-03086. AMS 2000 subject classifications. Primary 62B15; secondary 62G07, 62G20. Key words and phrases. Asymptotic equivalence, decision theory, local limit theorem, quantile transform, white noise model. 1

This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2004, Vol. 32, No. 5, 2074–2097. This reprint differs from the original in pagination and typographic detail. 1

2

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

Poisson process. A random vector of random length {N, XN } is observed such that N ≡ Nn is a Poisson variable with EN = n and that given N = m, XN = Xm ≡ (X1 , . . . , Xm ) is a sequence of i.i.d. variables with a common density f ∈ F . The resulting observations are then distributed as a Poisson process with intensity function nf . White noise. A Gaussian process Z ⋆ ≡ Zn⋆ ≡ {Zn⋆ (t), 0 ≤ t ≤ 1} is observed such that Z t√ B ⋆ (t) ⋆ f (x) dx + √ , Zn (t) ≡ (1.1) 0 ≤ t ≤ 1, 2 n 0

with a standard Brownian motion B ⋆ (t) and an unknown probability density function f ∈ F in [0, 1].

Asymptotic equivalence. For any two experiments ξ1 and ξ2 with a common parameter space Θ, ∆(ξ1 , ξ2 ; Θ) denotes Le Cam’s distance [cf., e.g., Le Cam (1986) or Le Cam and Yang (1990)] defined as (k)

(j)

∆(ξ1 , ξ2 ; Θ) ≡ sup max sup inf sup |Eθ L(θ, δ(j) ) − Eθ L(θ, δ(k) )|, L j=1,2 δ(j) δ(k) θ∈Θ

where (a) the first supremum is taken over all decision problems with loss function kLk∞ ≤ 1, (b) given the decision problem and j = 1, 2, k ≡ 3 − j (k = 2 for j = 1 and k = 1 for j = 2) the “maximin” value of the maximum difference in risks over Θ is computed over all (randomized) statistical pro(ℓ) cedures δ(ℓ) for ξℓ and (c) the expectations Eθ are evaluated in experiments ξℓ with parameter θ, ℓ = j, k. The statistical interpretation of the Le Cam distance is as follows: If ∆(ξ1 , ξ2 ; Θ) < ε, then for any decision problem with kLk∞ ≤ 1 and any statistical procedure δ(j) with the experiment ξj , j = 1, 2, there exists a (randomized) procedure δ(k) with ξk , k = 3 − j, such that the risk of δ(k) evaluated in ξk nearly matches (within ε) that of δ(j) evaluated in ξj . Two sequences of experiments {ξ1,n , n ≥ 1} and {ξ2,n , n ≥ 1}, with a common parameter space F , are asymptotically equivalent if ∆(ξ1,n , ξ2,n ; F) → 0

as n → ∞.

The interpretation is that the risks of corresponding procedures converge. A key result of Le Cam (1964) is that this equivalence of experiments can be characterized using random transformations between the probability spaces. A random transformation, T (X, U ) which maps observations X into the space of observations Y (with possible dependence on an independent, uninformative random component U ) also maps distributions in ξ1 to (2) (1) approximations of the distributions in ξ2 via Pθ T ≈ Pθ . For the mapping between the Poisson and Gaussian processes we shall restrict ourselves

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

3

to transformations T with deterministic inverses, T −1 (T (X, U )) = X. The experiments are asymptotically equivalent if the total-variation distance be(2) (1) tween Pθ and the distribution of T under Pθ converges to 0 uniformly in θ. As explained in Brown and Low (1996) and Brown, Cai, Low and Zhang (2002), knowing an appropriate T allows explicit construction of estimation procedures in ξ1 by applying statistical procedures from ξ2 to T (X, U ). In general, asymptotic equivalence also implies a transformation from (1) (2) the Pθ to the Pθ and the corresponding total-variation distance bound. However, in the case of the equivalence between the Poisson process and white noise with drift, by requiring that the transformation be invertible, we have saved ourselves a step. The transformation in the other direction is T −1 , and (1)

(1)

(2)

(1)

(2)

(2)

kPθ T − Pθ k ≥ kPθ T T −1 − Pθ T −1 k = kPθ − Pθ T −1 k. (1)

(2)

Therefore, it is sufficient if supθ kPθ T − Pθ k → 0. The equivalence mappings Tn constructed in this paper from the sample space of the Poisson process to the sample space of the white noise are invertible randomized mappings such that (1.2)

sup Hf (Tn (N, XN ), Zn⋆ ) → 0

f ∈F

under certain conditions on the family F . Here Hf (Z1 , Z2 ) denotes the Hellinger distance of stochastic processes or random vectors Z1 and Z2 living in the same sample space, when the true unknown density is f . Since Tn are invertible randomized mappings, Tn (N, XN ) are sufficient statistics for the Poisson processes and their inverses Tn−1 are necessarily many-to-one deterministic mappings. Similar considerations apply for the mapping of the density estimation problem to the white noise with drift problem, although in that case there are two mappings, one from the density estimation to the white noise with drift model and another from the white noise with drift model back to the density estimation model. These mappings are given in Section 2. There have recently been several papers on the global asymptotic equivalence of nonparametric experiments. Brown and Low (1996) established global asymptotic equivalence of the white noise problem with unknown drift f to a nonparametric regression problem with deterministic design and unknown regression f when f belongs to a Lipschitz class with smoothness index α > 12 . It has also been demonstrated that such nonparametric problems are typically asymptotically nonequivalent when the unknown f belongs to larger classes, for example, with smoothness index α ≤ 12 . Brown and Low (1996) showed the asymptotic nonequivalence between the

4

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

white noise problem and nonparametric regression with deterministic design for α ≤ 12 , Efromovich and Samarov (1996) showed that the asymptotic equivalence may fail when α < 14 . Brown and Zhang (1998) showed the asymptotic nonequivalence for α ≤ 12 between any pair of the following four experiments: white noise, density problem, nonparametric regression with random design, and nonparametric regression with deterministic design. In Brown, Cai, Low and Zhang (2002) the asymptotic equivalence for nonparametric regression with random design was shown under Besov constraints which include Lipschitz classes with any smoothness index α > 12 . Gramma and Nussbaum (1998) solved the fixed-design nonparametric regression problem for nonnormal errors. Milstein and Nussbaum (1998) showed that some diffusion problems can be approximated by discrete versions that are nonparametric autoregression models, and Golubev and Nussbaum (1998) established a discrete Gaussian approximation to the problem of estimating the spectral density of a stationary process. Most closely related to this paper is the work in Nussbaum (1996) where global asymptotic equivalence of the white noise problem to the nonparametric density problem with unknown density g = f 2 /4 is shown. In this paper the global asymptotic equivalence was established under the following smoothness assumption: f belongs to the Lipschitz classes with smoothness index α > 21 . The parameter spaces. The class of functions F will be assumed throughout to be densities with respect to Lebesgue measure on [0, 1] that are uniformly bounded away from 0. The smoothness conditions on F can be described in terms of Haar basis functions of the densities. Let (1.3)

θk,ℓ ≡ θk,ℓ (f ) ≡

Z

f φk,ℓ ,

ℓ = 0, . . . , 2k − 1, k = 0, 1, . . . ,

be the Haar coefficients of f , where φk,ℓ ≡ 2k/2 (1Ik+1,2ℓ − 1Ik+1,2ℓ+1 )

(1.4)

are the Haar basis functions with Ik,ℓ ≡ [ℓ/2k , (ℓ + 1)/2k ). The convergence of the Hellinger distance in (1.2) is established via an inequality in Theorem 3 in terms of the tails of the Besov norms kf k1/2,2,2 and kf k1/2,4,4 of the Haar coefficients θk,ℓ ≡ θk,ℓ(f ) in (1.3). The Besov norms kf kα,p,q for the Haar coefficients, with smoothness index α and shape parameters p and q, are defined by (1.5)

" Z kf kα,p,q ≡

0

1

( q ∞ X f + 2k(α+1/2−1/p) k=0

k −1 2X

ℓ=0

p

|θk,ℓ (f )|

!1/p )q #1/q

.

5

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

Let f¯k be the piecewise average of f at resolution level k, that is, the piecewise constant function defined by (1.6)

f¯k ≡ f¯k (t) ≡

k −1 2X

ℓ=0

1{t ∈ Ik,ℓ }2k

Z

f.

Ik,ℓ

R P P Since kf¯k − f¯k+1 kpp = | ℓ θk,ℓ φk,ℓ |p = ℓ |θk,ℓ |p 2k(p/2−1) , (1.5) can be written as (

∞ X

kf kα,p,q ≡ |f¯0 | + q

(2 kf¯k − f¯k+1 kp ) kα

q

k=0

)1/q

,

and its tail at resolution level k0 ≥ 0 is kf − f¯k0 kα,p,q , k0 ≥ 0, with (1.7)

kf − f¯k0 kqα,p,q =

∞ X

k=k0

(

k(α+1/2−1/p)

2

k −1 2X

ℓ=1

p

|θk,ℓ |

!1/p )q

.

Let B(α, p, q) be the Besov space B(α, p, q) = {f : kf kα,p,q < ∞}. The following two theorems on the equivalence of white noise with drift, density estimation and Poisson estimation models are corollaries of our main result, Theorem 3, which bounds the squared Hellinger distance between particular invertible randomized mappings of the Poisson process and white noise with drift models. The randomized mappings are given in Section 2. Proofs of these theorems are given in the Appendix. Theorem 1. Let Zn⋆ , {N, XN } and Vn⋆ be the Gaussian process, Poisson process and density estimation experiments, respectively. Suppose that H is compact in both B(1/2, 2, 2) and B(1/2, 4, 4) and that H ⊆ {f : inf 0 12 .

6

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG (L)

(S)

For 0 < β ≤ 1 the Lipschitz norm kf kβ defined by (L)

kf kβ ≡ where cn (f ) ≡

|f (x) − f (y)| , |x − y|β 0≤x k0 , with the Poisson process (N, XN ), to strongly approximate the Gaussian variables. ⋆ ⋆ ,0 ≤ ℓ < It can be easily verified from (1.1) that {Z k0 ,ℓ , 0 ≤ ℓ < 2k0 , Wk,2ℓ k−1 2 , k > k0 } are uncorrelated normal random variables with ⋆

EZ k,ℓ = hk,ℓ ≡ 2k (2.3)

p



Var(Z k,ℓ ) = σk ≡

p

Z

h,

Ik,ℓ

h≡

p

f,

2k /(4n),

for ℓ = 0, . . . , 2k − 1, and for ℓ = 0, . . . , 2k−1 − 1, Z √ ⋆ EWk,2ℓ = 12 (hk,2ℓ − hk,2ℓ+1 ) = 2k−1 hφk−1,ℓ , (2.4)



⋆ Var(Wk,2ℓ ) = σk−1 .

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

7

e = {U ek,ℓ , k ≥ k0 , ℓ ≥ 0} be a sequence of i.i.d. uniform variables in Let U [−1/2, 1/2) independent of (N, XN ). For k = 0, 1, . . . and ℓ = 0, . . . , 2k − 1 define

(2.5)

Nk ≡ {Nk,ℓ , 0 ≤ ℓ < 2k },

We shall approximate (2.6)

⋆ Zk

Nk,ℓ ≡ #{Xi : Xi ∈ Ik,ℓ }.

in (2.1) in distribution by

Zk ≡ {Z k,ℓ , 0 ≤ ℓ < 2k },

p

ek,ℓ ) |Nk,ℓ + U ek,ℓ |, Z k,ℓ ≡ 2σk sgn (Nk,ℓ + U

at the initial resolution level k = k0 . Since Nk,ℓ are Poisson variables with (2.7)

λk,ℓ ≡ ENk,ℓ =

fk,ℓ n fk,ℓ = 2 , k 2 4σk

fk,ℓ ≡ 2k

Z

f,

Ik,ℓ

by Taylor expansion and central limit theory Z k,ℓ ≈ 2σk





λk,ℓ +

Nk,ℓ − λk,ℓ 1/2 2λk,ℓ



√ ≈ N ( fk,ℓ, σk2 ) p

as λk,ℓ → ∞, compared with (2.3). Note that fk,ℓ ≈ hk,ℓ under suitable smoothness conditions on f , in view of (2.3) and (2.7). The Poisson variables Nk,ℓ can be fully recovered from Z k,ℓ , while the randomization turns Nk,ℓ into continuous variables. ⋆ for k > k is more delicate, since the central limit Approximation of Wk,ℓ 0 theorem is not sufficiently accurate at high resolution levels. Let Fm be the cumulative distribution function of the independent sum of a binomial 1 1 1 e e variable X m,1/2 with parameter (m, 2 ) and a uniform variable U in [− 2 , 2 ),

(2.8)

e e Fm (x) ≡ P {X m,1/2 + U ≤ x},

with F0 being the uniform distribution in [− 12 , 12 ). Let Φ be the N (0, 1) cumulative distribution. We shall approximate Wk⋆ by using a quantile transformation of randomized versions of the Poisson random variables. More specifically, let (2.9) Wk ≡ {Wk,ℓ , 0 ≤ ℓ < 2k },

e Wk,2ℓ ≡ σk−1 Φ−1 (FNk−1,ℓ (Nk,2ℓ + U k,2ℓ ))

withWk,2ℓ ≡ −Wk,2ℓ+1 , ℓ = 0, . . . , 2k−1 −1, and theσk in (2.3). Given Nk−1,ℓ = m, (2.10) Nk,2ℓ ∼ Bin(m, pk,2ℓ ),

R

pk,2ℓ ≡ R

Ik,2ℓ

f

Ik−1,ℓ

f

=

fk,2ℓ , fk,2ℓ + fk,2ℓ+1

1 2 ) for p so that Wk,2ℓ is distributed exactly according to N (0, σk−1 k,2ℓ = 2 , ⋆ compared with (2.4). Thus, the distributions of Wk,2ℓ and Wk,2ℓ are close

8

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

at high resolution levels as long as f is sufficiently smooth, even for small Nk−1,ℓ = m. e are defined The equivalence mappings Tn , with randomization through U, by e →W Tn : {N, XN , U} [k0 ,∞) → Zn ≡ {Zn (t) : 0 ≤ t ≤ 1},

where for k0 ≤ k ≤ ∞, W[k0 ,k) ≡ {Zk0 , Wj , k0 < j < k}, and Zk and Wk are as in (2.6) and (2.9). The inverse of Tn is a deterministic many-to-one mapping defined by ⋆ Tn−1 : Zn⋆ → W[k → (N ⋆ , X⋆N ⋆ ), 0 ,∞) ⋆

⋆ where for k0 ≤ k ≤ ∞, W[k ≡ {Zk0 , Wj⋆ , k0 < j < k}. 0 ,k)

Remark 1. One need only carry out the above construction to k = k1 : 2k1 > εn since we shall assume that f ∈ B( 12 , 2, 2) and then the observa⋆ ⋆ tions W[k ≡ {Zk0 , Wj⋆ , k0 < j < k} and W[k0 ,k) ≡ {Zk0 , Wj , k0 < j < k} 0 ,k) are asymptotically sufficient for the Gaussian process and Poisson process experiments. See Brown and Low (1996) for a detailed argument in the context of nonparametric regression. Mappings for the density estimation model. The constructive asymptotic equivalence between density estimation experiments and Gaussian experiments is established by first randomizing the density estimation experiment to an approximation of the Poisson process and then applying the randomized mapping as given above. Set γk = supf ∈H kf − f¯k k21/2,2,2 and note that since H is compact in B(1/2, 2, 2), γk ↓ 0. Now let k0 be the smallest integer such that 4k0 /n ≥ γk0 and divide the unit interval into subintervals of equal length with length equal to 2−k0 . Let f˜n be the corresponding histogram estimate based on Vn⋆ . Now note that since functions f ∈ H are bounded below by ε0 > 0 it follows that √ √ Z 1 p p 2 Z 1 p p 2 ( f˜n + f )2 Z 1 (f˜n − f )2 ˜ ˜ ( fn − f ) ≤ (2.11) = . ( fn − f ) ε0 ε0 0 0 0 Now

(2.12)

E

Z

0

1

(f˜n − f )2 = E

Z

0

1

(f˜n − f¯k0 )2 +

Z

0

1

(f − f¯k0 )2

and simple calculations show that the histogram estimate f˜n satisfies E f˜n (x) = k f¯k0 (x) and Var f˜n (x) ≤ f¯k0 (x) 2n0 . Hence, (2.13)

n1/2 E

Z

0

1

(f˜n − f¯k0 )2 ≤ n1/2

2k0 1/2 ≤ 2γk0 → 0. n

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

Now n1/2 ≤ (2.14)

2k0 1/2 γk

9

and hence, from (1.7),

0

n

1/2

Z

0

1

1 1/2 (f − f¯k0 )2 ≤ 1/2 kf − f¯k0 k21/2,2,2 ≤ γk0 → 0. γk0

It thus follows from (2.11) to (2.14) that (2.15)

n1/2 sup E f ∈H

Z

1

0

p

( f˜n −

p

f )2 → 0.

Hence the density estimate is squared Hellinger consistent at a rate faster than square root of n. ˜ , a Poisson random variable with expectation n and Now generate N ⋆ ˜ > n generate N ˜ − n conditionally independent independent of Vn . If N ⋆ ⋆ ˜,X ˜ ˜) = observations Vn+1 , . . . , VN˜ with common density f˜n . Finally let (N N ˜,X ˜ ˜ ), ˜ , V ⋆ , V ⋆ , . . . , V ⋆ ) and write R1 for this randomization from V⋆ to (N (N n n 2 1 ˜ N N ˜,X ˜ ˜ ). Rn1 : Vn⋆ → (N N A map from the Poisson number of independent observations back to the fixed number of observations is obtained similarly. This time let fˆn be the histogram estimator based on (N, XN ). If N < n generate n − N additional conditionally independent observations with common density fˆn . It is also easy to check that (2.16)

n1/2 sup E f ∈H

Z

0

1

p

( fˆn −

p

f )2 → 0.

Now label these observations Vn = (V1 , . . . , Vn ) and write Rn2 for this randomization from (N, XN ) to Vn , Rn2 : (N, XN ) → Vn . Remark 2. It should also be possible to map the density estimation problem directly into an approximation of the white noise with drift model. Dividing the interval into 2k0 subintervals and conditioning on the number of observations falling in each subinterval, the conditional distribution within each subinterval is the same as for the Poisson process. Therefore, it is only necessary to have a version of Theorem 4 for a 2k0 -dimensional multinomial experiment. Carter (2002) provides a transformation from a 2k0 -dimensional multinomial to a multivariate normal as in Theorem 4 such that the total-variation distance between the distributions is O(k0 2k0 n−1/2 ). The transformation is similar to ours in that it adds uniform noise and then uses the square root as a variance-stabilizing transformation. However, the covariance structure

10

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

of the multinomial complicates the issue and necessitates using a multiresolution structure similar to the one applied here to the conditional experiments. The Carter (2002) result can be used in place of Theorem 4 to get a slightly weaker bound on the error in the approximation in Theorem 3 (because of the extra k0 factor) when the total number of observations is fixed. This is enough to establish Theorem 2 if the inequalities bounding α and β are changed to strictly greater than. It is also enough to establish Theorem 1 if H is a Besov space with α > 21 . Carter (2000) also showed that a somewhat more complicated transformation leads to a deficiency bound on the normal approximation to the multinomials without the added k0 factor. 3. Main theorem. The theorems in Section 1 on the equivalence of white noise with drift experiments and Poisson process experiments are consequences of the following theorem which uniformly bounds the Hellinger distance between the randomized mappings described in Section 2. ⋆

⋆ Theorem 3. Suppose inf 0 k0 , ⋆ H 2 (W[k , W[k0 ,k1 ) ) 0 ,k1 ) k

k

∞ ∞ 2X −1 2X −1 D2 n X C 4k0 D1 X 4 2 3k k + 2 θk,ℓ θk,ℓ + 3 k0 2 2 ≤ ε0 n 4 ε0 k=k ε0 k=k ℓ=0 ℓ=0 0



0

4k0

C D2 n D1 + 2 kf − f¯k0 k21/2,2,2 + 3 k0 kf − f¯k0 k41/2,4,4 , ε0 n ε0 ε0 4

where θk,ℓ are the Haar coefficients of f as in (1.3), f¯k is as in (1.6) and k · k1/2,p,p are the Besov norms in (1.5). Remark 3. Here the universal constant C is the same as the one in D 8 Theorem 4, while D1 = 3D 8 + 2 and D2 = 9 + 3 for the D in Theorem 5. The proof of Theorem 3 is based on the inequalities established in Sections 4 and 5 for the normal approximation of Poisson and Binomial variables. Some additional technical lemmas are given in the Appendix. e m,p be a Bin(m, p) variable, X e λ be a Poisson variable with mean λ, Let X e be a uniform variable in [− 1 , 1 ) independent of X e m,p and X e λ . Define and U 2 2 (3.1)

g˜m,p (x) ≡

d e m,p + U e )) ≤ x} P {Φ−1 (Fm (X dx

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

11

with the Fm in (2.8) and the N (0, 1) distribution function Φ, and define p d e +U e | ≤ x}. e +U e ) |X P {2 sgn(X λ λ dx Write ϕb for the density of N (b, 1) variables.

(3.2)

g˜λ (x) ≡

⋆ (w[k0 ,k)) and g[k0 ,k)(w[k0 ,k) ) be the Proof of Theorem 3. Let g[k 0 ,k) ⋆ joint densities of W[k0 ,k) and W[k0 ,k) , gk⋆ (wk ) be the joint density of Wk⋆ , and gk (wk |w[k0 ,k) ) be the conditional joint density of Wk given W[k0 ,k) . ⋆ Since Wk⋆ is independent of W[k , 0 ,k) √ ⋆ √ ⋆ √ ⋆ √ ⋆ g[k0 ,k)g[k0 ,k) − g[k0 ,k+1)g[k0 ,k+1) = g[k gk gk ), g (1 − [k ,k) ,k) 0 0

so that the Hellinger distance can be written as ⋆ Hf2 (W[k , W[k0 ,k1 ) ) 0 ,k1 )



=2 1− (3.3)



=2 1− +

Z √

Z √

X

⋆ g g[k 0 ,k1 ) [k0 ,k1 )

⋆ g[k g 0 ,k0 +1) [k0 ,k0 +1)

2

k0 0 such that, for all m ≥ 0,

H 2 (˜ gm,p , ϕb ) =

(5.2)

  2 √ b8 b √ + 2 , ( g˜m,p − ϕb )2 dz ≤ C1 m m

Z

√ where b = ( m/2) log(p/(1 − p)). Consequently, 2

(5.3) H (˜ gm,p , ϕβ ) ≤ D



1 p− 2

2



1 +m p− 2

4 

√ ( m(2p − 1) − β)2 . + 2

Proof. The case when m = 0 is trivial because X = 0 with probability 1 and therefore g˜0,p is exactly an N (0, 1). Thus, the following assumes that m ≥ 1. It follows from (3.1) that g˜m,p (z) = pj (1 − p)m−j 2m ϕ0 (z),

(5.4)

where j = j(z) is the integer between 0 and m such that Φ−1 [Fm (j − 21 )] ≤ z < Φ−1 [Fm (j + 12 )].

(5.5)

Let θ = log(p/q) so that log





gm,p (z) m m log(4pq) =θ j− , + ϕ0 (z) 2 2

and the second term can be approximated by (5.6)







2 + eθ + e−θ θ2 θ4 θ2 θ4 − ≤ log(4pq) = − log ≤− + . 4 24 4 4 32

18

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

Let h1 (θ) = (2 + e−θ + e−θ )/4. The second inequality in (5.6) follows from log(h1 (θ)) ≥ log(1+ θ 2 /4) ≥ θ 2 /4− θ 4 /32. The first inequality in (5.6) follows from h1 (θ) ≤ 1 + θ 2 /4 + θ 4 /24 for |θ| ≤ 4, and from log(h1 (θ)) ≤ |θ| ≤ θ 2 /4 for |θ| > 4. Now, let √ j(z) − m/2 m ′ ′ z = z (z) = √ (5.7) and b = θ . m/2 2 Then for some −1/24 ≤ h2 (θ) ≤ 1/32 the log ratio is log

g˜m,p (z) b2 = z ′ b − + h2 (θ)mθ 4 . ϕ0 (z) 2

The log ratio of normals with different means is log (ϕ0 /ϕb ) = −zb + b2 /2. Therefore the ratio with respect to the normal with mean b is (5.8)

log

g˜m,p = h2 (θ)mθ 4 − b(z − z ′ ), ϕb

|h2 (θ)| ≤

1 24 .

Since y log(x/y) ≤ x − y ≤ x log(x/y), for all positive x and y,     √ ϕb ϕb 1√ 1√ √ g˜m,p log ϕb log ≤ ϕb − g˜m,p ≤ , 2 g˜m,p 2 g˜m,p so that by (5.8), 1 H (˜ gp,m , ϕb ) ≤ 4 2

(5.9) ≤



Z 

mθ 4 24



ϕb log g˜m,p 2

+

b2 2

Z

2

(ϕb + g˜m,p ) dz

(z − z ′ )2 (ϕb + g˜m,p ) dz.

It follows from Carter and Pollard (2004) that the difference between z and z ′ = z ′ (z) is bounded by ′

|z − z | ≤

(5.10)



C2 (m−1/2 + m−1 |z|3 ), C2 (m−1/2 + m−1 |z ′ |3 ),

for all z, √ if |z| ≤ 2m,



Z

for some constant C2 . Thus, (5.11) Since Z

R

Z

(z − z ′ )2 g˜m,p dz ≤ 2C2 2

1 + m

Z

|z ′ |6 g˜m,p dz + m2

√ e m,p = j}, g˜m,p I{z ′ = (j − m/2)/ m } dz = P {X

′ 6

|z | g˜m,p dz = E

 e  Xm,p − m/2 6



m



=O 1+m

3



p−1 2



z6 g˜ dz . 2 m,p z 2 >2m m

6 

= O(1 + b6 )

uniformly in (m, p). It follows from (5.4) that Z

z 2 >2m

z 6 g˜m,p dz ≤ 2m

Z

z 2 >2m

z 6 ϕ0 dz = O(2m m6 e−m ) = O(m−1 ).

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

19

The above two inequalities and (5.11) imply

R

Z

(z − z ′ )2 g˜m,p dz ≤ 2C2 2 O(1/m + b6 /m2 ).

Similarly, (z − z ′ )2 ϕb dz ≤ 2C2 2 O(1/m + b6 /m2 ). Inserting these two inequalities into (5.9) yields (5.2) in view of (5.7). Now let us prove (5.3). The Hellinger distance is bounded by 2, so that b8 /m2 in (5.2) can be replaced by b4 /m and it suffices to consider |p − 1 1 2 | ≤ 4 for the proof of (5.3). By inspecting the infinite series expansion of p log( q ) = log(1 + x) − log(1 − x) for x = 2p − 1, we find that for |p − 21 | ≤ 41 , | log( pq )| ≤ 38 |2p − 1| and | log( pq ) − 4(p − 12 )| ≤ 98 |2p − 1|3 . These inequalities, respectively, imply b4 16 256 b2 + 2 ≤ (2p − 1)2 + m(2p − 1)4 m m 9 81

√ 4 16 m|2p − 1|6 ≤ 81 m|2p − 1|4 , in view of the definition and |b− m(2p − 1)|2 ≤ 81 of b, which then imply (5.3) via (5.2) and the fact that H 2 (ϕb − ϕβ ) = (b − β)2 /4.  APPENDIX A.1. The Tusn´ ady inequality. The coupling of symmetric binomials and normals maps the integers j onto intervals [βj , βj+1 ] such that the normal(m/2, m/4)  −j probability in the interval is equal to the binomial probability at m 2 . j Taking the standardized values zj =

2(βj − m/2) √ , m

uj =

2(j − 1/2 − m/2) √ , m

Carter and Pollard (2004) showed that for m/2 < j < m and certain universal finite constants C± s





u2j log(1 − u2j /m) uj + 1 uj uj + log m C− ≤ zj − uj 1 + 2 γ √ ≤ C+ − m m m 2cuj m √ where c = 2 log 2 and γ is an increasing function with γ(0) = 1/12 and γ(1) = log 2 − 1/2. This immediately implies that u2j C0 1 (|uj |3 + log m) ∀ ≤ m m 2 for a certain universal constant C0 < ∞. We shall prove (5.10) here based on (A.1). Because of the symmetry in both distributions, it is only necessary to consider z > 0. (A.1)

|zj − uj | ≤

20

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

It follows from (5.5) and (5.7) that uj ≤ z ′ = z ′ (z) < uj+1 . √ Let zj ≤ z < zj+1 . Since uj+1 − uj = 2/ m, for u2j+1 ≤ m/2 (A.1) implies zj ≤ z < zj+1

⇐⇒





|z|3 ∧ |z ′ |3 1 2 (A.2) |z − z ′ | ≤ |zj − uj | ∨ |zj+1 − uj+1 | + √ ≤ C0′ √ + . m m m √ Since uj and zj are both increasing in j, it follows that (z ∧ z ′ )/ m are p uniformly bounded away from zero for uj+1 ≥ m/2, so that

2 |z|3 ∧ |z ′ |3 |z − z ′ | ≤ |zj − uj | ∨ |zj+1 − uj+1 | + √ ≤ C0′′ m m √ √ for (m +√1)/ m = u√m+1 ≥ uj+1 ≥ m/2 and z ≤ 2m. Since z ∨ z ′ ≤ z ∨ um+1 ≤ 2z for z > 2m, (A.2) and (A.3) imply (A.3)



|z − z | ≤



C2 (m−1/2 + m−1 |z|3 ), C2 (m−1/2 + m−1 |z ′ |3 ),

for all z, √ if |z| ≤ 2m,

for a certain universal C2 < ∞, that is, (5.10). A.2. Technical lemmas. The following three lemmas simplify the rest of the proof of Theorem 3. (i) Let fk,ℓ and hk,ℓ be as in (2.7) and (2.3). Then

Lemma 1.

0≤

(A.4)



−3/2

fk,ℓ − hk,ℓ ≤ 2k−1 fk,ℓ

Z

Ik,ℓ

(f − fk,ℓ )2 .

(ii) Let θk,ℓ be the Haar coefficients of f as in (1.3). Then (A.5)

Z θk,ℓ hφk,ℓ − p 2 f

k,ℓ

Z ≤ 2k/2−1 f −3/2 k,ℓ

Ik,ℓ

(f − fk,ℓ )2 .

Proof. Let T = (f − fk,ℓ )/fk,ℓ ≥ −1. By algebra, √

1+T −1=

T T T2 √ √ = − . 2 1+ 1+T 2(1 + 1 + T )2

It follows from (2.3) and (2.7) that Z √ √ k 1+T hk,ℓ = 2 fk,ℓ Ik,ℓ

√ = 2k fk,ℓ

Z

Ik,ℓ



1+



(f − fk,ℓ)2 f − fk,ℓ √ − 2 , 2fk,ℓ 2fk,ℓ (1 + 1 + T )2

21

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

R

which implies (A.4) as 2k Ik,ℓ = 1 and by (2.7) have Z Z √ √ hφk,ℓ = fk,ℓ φk,ℓ 1 + T

Ik,ℓ (f

− fk,ℓ) = 0. For (ii) we





Z



R

f − fk,ℓ (f − fk,ℓ)2 √ , = fk,ℓ φk,ℓ 1 + − 2 2fk,ℓ 2fk,ℓ (1 + 1 + T )2 √ R which implies (A.5) as φk,ℓ = 0 and |φk,ℓ | ≤ 2k by (1.4). 

Let θk,ℓ be the Haar coefficients in (1.3) and fk,ℓ be as in

Lemma 2. (2.7). Then ∞ X

k

2

k=k0

k −1  Z 2X

2

Ik,ℓ

ℓ=0

(f − fk,ℓ)

2

k

2X −1 ∞ X 2−ck0 k(1+c) 4 2 θk,ℓ ≤ (1 − 1/2c )2 k=k ℓ=0 0

Proof. Define δi,j,k,ℓ ≡ Since

P

j δi,j,k,ℓ

Z

Ik,ℓ



if Ii,j ⊆ Ik,ℓ , otherwise.

1, 0,

= 2i−k for i ≥ k, using Cauchy–Schwarz twice yields

(f − f¯k )2

2

i

=

−1 ∞ 2X X

2 δi,j,k,ℓθi,j

i=k j=0

≤ ≤

"

∞ X

−ic/2

2

ic i−k

2 2

i=k

2−ic

i −1 2X

4 δi,j,k,ℓθi,j

j=0

i=k

∞ X

!2

∞ X

2ic 2i−k

2i −1

X

!1/2 #2

4 δi,j,k,ℓ θi,j

j=0

i=k

i

2X −1 ∞ 2−k(1+c) X i(1+c) 4 2 δi,j,k,ℓ θi,j . ≤ 1 − 1/2c i=k j=0

Since

P2k −1 ℓ=0

δi,j,k,ℓ = 1 for i ≥ k, the above inequality implies ∞ X

k=k0

k

2

k −1  Z 2X

ℓ=0

Ik,ℓ

(f − fk,ℓ)2

2

i −1 2k −1 2X ∞ −k(1+c) X X 4 i(1+c) k2 δi,j,k,ℓθi,j 2 ≤ 2 c 1 − 1/2 j=0 ℓ=0 i=k k=k0

∞ X

∀ c > 0.

22

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

=

∞ X

i X

i=k0

k=k0

!

i

2X −1 2−ck i(1+c) 4 2 θi,j 1 − 1/2c j=0 i

2X −1 ∞ X 2−ck0 i(1+c) 4 2 θi,j . ≤ (1 − 1/2c )2 i=k j=0 0

Lemma 3.



e λ be a Poisson random variable with mean λ. Then Let X p √ e λ − λ )4 ≤ 4. E( X

e − λ)4 = λ(3λ + 1), Proof. Since E(X λ p

eλ − E( X

eλ − λ)4 √ 4 E(X e λ = 0) + λ2 P (X λ) ≤ √ 4 ( λ + 1)



3λ + 1 + 1 ≤ 4. λ+6



A.3. Proof of Theorem 1. First note that H(Tn Rn1 Vn⋆ , Z⋆n ) ≤ H(Tn Rn1 Vn⋆ , Tn (N, XN )) + H(Tn (N, XN ), Z⋆n ) and H(Vn⋆ , Rn2 Tn−1 Z⋆n ) ≤ H(Vn⋆ , Rn2 (N, XN )) + H(Rn2 (N, XN ), Rn2 Tn−1 Z⋆n )). Note also that since for any randomization T and random X and Y , H(T X, T Y ) ≤ H(X, Y ), it follows that H(Tn Rn1 Vn⋆ , Tn (N, XN )) ≤ H(Rn1 Vn⋆ , (N, XN )) and H(Rn2 (N, XN ), Rn2 Tn−1 Z⋆n ) ≤ H((N, XN ), Tn−1 Z⋆n ) = H(Tn (N, XN ), Z⋆n ).

For the class H and the randomizations Rn1 and Rn2 it follows from (2.15), (2.16) and the proof of Proposition 3 on page 508 of Le Cam (1986) that sup H(Rn1 Vn⋆ , (N, XN )) → 0

f ∈H

and sup H(Vn⋆ , Rn2 (N, XN )) → 0.

f ∈H

Hence (1.9) and (1.8) will follow once (A.6)

sup H(Tn (N, XN ), Z⋆n ) → 0

f ∈H

EQUIVALENCE THEORY FOR DENSITY ESTIMATION

23

is established. By Theorem 3, for (A.6) to hold it is sufficient to show that sup f ∈H





n 4k0 + kf − f¯k0 k21/2,2,2 + k0 kf − f¯k0 k41/2,4,4 → 0. n 4

If the class of functions H is a compact set in the Besov spaces, then the partial sums converge uniformly to 0, sup kf − f¯k k1/2,p,p → 0

f ∈H

for p = 2 or 4 as k → ∞. This implies that there is a sequence γk → 0 such that γk−1 supf ∈H kf − f¯k k41/2,4,4 → 0. To be specific, let γk = sup kf − f¯k k21/2,4,4 . f ∈H

It is necessary to choose the sequence of integers k0 (n) that will be the critical dimension that divides the two techniques. Let k0 be the smallest k integer such that 4n0 ≥ γk0 . Therefore, k0 (n) → ∞, and as n → ∞, sup f ∈H



n 4k0 + kf − f¯k0 k21/2,2,2 + k0 kf − f¯k0 k41/2,4,4 n 4 





1 kf − f¯k0 k41/2,4,4 → 0. ≤ sup 4γk0 + kf − f¯k0 k21/2,2,2 + γ k0 f ∈H



A.4. Proof of Theorem 2. Theorem 2 follows from Theorem 1 and the fact that the Lipschitz and Sobolev spaces described are compact in the Besov spaces. The Lipschitz class is equivalent to Bβ,∞,∞ and therefore is compact in B1/2,p,p if β > 21 . The Sobolev class is equivalent to Bα,2,2 and kf − f¯k0 k2α,2,2 ≤ Cα

X n

|cn (f )|2 n2α ,

where Cα depends only on α. Thus if F is compact in Sobolev(α) for α ≥ 21 then it is compact in B1/2,2,2 . Further restrictions are required to show that the Sobolev(α) class is (L) compact in B1/2,4,4 . If kf kβ ≤ C(L) , then kf¯k − f¯k+1 k∞ ≤ C(L) 2−kβ , so that 2 kf − f¯k0 k41/2,4,4 ≤ C(L)

∞ X

k=k0

2k2(1−β)

Z

|f¯k − f¯k+1 |2 dx

2 = C(L) kf − f¯k0 k2(1−β),2,2 .

24

L. D. BROWN, A. V. CARTER, M. G. LOW AND C.-H. ZHANG

Therefore, for F bounded in Lipschitz(β), a compact Sobolev(α) set is also compact in B1/2,4,4 if α ≥ 1 − β. Finally, if F is compact in Sobolev(α), α ≥ 3/4, then it immediately follows from the Sobolev embedding theorem that the function is bounded in Lipschitz(1/4) [e.g., Folland (1984), pages 270 and 273], and it follows that F is compact in B1/2,4,4 .  Acknowledgments. We thank the referees and an Associate Editor for several suggestions which led to improvements in the final manuscript.

REFERENCES Brown, L. D., Cai, T., Low, M. G. and Zhang, C.-H. (2002). Asymptotic equivalence theory for nonparametric regression with random design. Ann. Statist. 30 688–707. MR1922538 Brown, L. D. and Low, M. G. (1996). Asymptotic equivalence of nonparametric regression and white noise. Ann. Statist. 24 2384–2398. MR1425958 Brown, L. D. and Zhang, C.-H. (1998). Asymptotic nonequivalence of nonparametric experiments when the smoothness index is 1/2. Ann. Statist. 26 279–287. MR1611772 Carter, A. V. (2000). Asymptotic equivalence of nonparametric experiments. Ph.D. dissertation, Yale Univ. Carter, A. V. (2002). Deficiency distance between multinomial and multivariate normal experiments. Ann. Statist. 30 708–730. MR1922539 Carter, A. V. and Pollard, D. (2004). Tusn´ ady’s inequality revisited. Ann. Statist. 32. To appear. Efromovich, S. and Samarov, A. (1996). Asymptotic equivalence of nonparametric regression and white noise has its limits. Statist. Probab. Lett. 28 143–145. MR1394666 Folland, G. B. (1984). Real Analysis. Wiley, New York. MR767633 Golubev, G. and Nussbaum, M. (1998). Asymptotic equivalence of spectral density and regression estimation. Technical report, Weierstrass Institute, Berlin. Gramma, I. and Nussbaum, M. (1998). Asymptotic equivalence for nonparametric generalized linear models. Probab. Theory Related Fields 111 167–214. MR1633574 Le Cam, L. (1964). Sufficiency and approximate sufficiency. Ann. Math. Statist. 35 1419– 1455. MR207093 Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York. MR856411 Le Cam, L. and Yang, G. (1990). Asymptotics in Statistics. Springer, New York. MR1066869 Milstein, G. and Nussbaum, M. (1998). Diffusion approximation for nonparametric autoregression. Probab. Theory Related Fields 112 535–543. MR1664703 Nussbaum, M. (1996). Asymptotic equivalence of density estimation and Gaussian white noise. Ann. Statist. 24 2399–2430. MR1425959

EQUIVALENCE THEORY FOR DENSITY ESTIMATION L. D. Brown M. G. Low Department of Statistics The Wharton School University of Pennsylvania Philadelphia, Pennsylvania 19104-6340 USA e-mail: [email protected] e-mail: [email protected]

25

A. V. Carter Department of Statistics and Applied Probability University of California, Santa Barbara Santa Barbara, California 93106-3110 USA e-mail: [email protected]

C.-H. Zhang Department of Statistics 504 Hill Center, Busch Campus Rutgers University Piscataway, New Jersey 08854-8019 USA

Suggest Documents