Estimating linear functionals in Poisson mixture models. - CiteSeerX

1 downloads 0 Views 189KB Size Report
Apr 6, 2009 - Cavalier (2008)). Due to the logarithmic rates, the problem of estimating the density f of F in Poisson mixture is too difficult from a statistical point ...
Estimating linear functionals in Poisson mixture models. Laurent Cavalier and Nicolas W. Hengartner April 6, 2009

Abstract This paper concerns the problem of estimating linear functionals of the mixing distribution from Poisson mixture observations. In particular, linear functionals for which parametric rate of convergence cannot be achieved are studied. It appears that Gaussian functionals are rather easy to estimate. Estimation of the distibution functions is then considered by approximating this functional by Gaussian functionals. Finally, the case of smooth distribution functions is considered in order to deal with rather general linear functionals. Keywords: Poisson mixtures, linear functionals, rates of convergence, density estimation, inverse problems. AMS 1991 Primary 62G05, 62G20, Secondary 62J99.

1

Introduction

The problem of Poisson mixture, distribution which has probability mass function Z ∞ k s −s △ PF [X1 = k] = πk (F ) = e F (ds), for k = 0, 1, 2, . . . , k! 0 arises in many interesting applications. This correspond to problems where the observations have a Poisson distribution P(s), but the parameter s is random with distribution

function F on [0, ∞). For example, these models appear in various applications, from the number of detected particles with sources with varying intensities (Hengartner et al 1

(1994)) to species richness estimation (Wang and Lindsay (2005)) or premarked cohort studies (Norris and Pollack (2001)). Nonparametric estimates for the mixing distribution F is of interest, and has been studied by Tucker (1963), Walter (1985), Pilla and Lindsay (2001), Holzmann, Munk and Stratmann (2004) and Wang (2007). Theoretical properties, and in particular optimal rates of convergence, have been obtained in Zhang (1995), Loh and Zhang (1996) and Hengartner (1997). These optimal rates are logarithmic, and then very slow. Thus, nonparametric estimation in Poisson mixture is a very difficult problem. One of the reason why this model is so difficult is related to the theory of inverse problems. One has to estimate an unknown function by use of indirect observations. Indeed, in the Poisson mixture, one observes the Xi with distribution P(s) and one wants to recon-

struct the distribution of s. The statistical study of ill-posed inverse problems is nowadays

rather well-known (van Rooij and Ruymgaart (1996), Cavalier et al (2002), Cavalier and Tsybakov (2002) and Cavalier (2008)) and more specifically the deconvolution problem (Fan (1991), Neumann (1997) and Pensky and Vidakovic (1999)). Moreover, in severely ill-posed problems or supersmooth deconvolution, the rates are known to be usually logarithmic and then very slow (Fan (1992), Efromovich (1997) and Cavalier (2008)). Due to the logarithmic rates, the problem of estimating the density f of F in Poisson mixture is too difficult from a statistical point of view. The aim is then to change our mind and find another, and more easy goal of estimation. From an inverse problem point of view, one is looking for well-posed questions in ill-posed problems. One main question in such a difficult framework is : what are we able to estimate in a good way ? This paper is then considering an easier problem: to estimate linear functionals of the R mixing distribution F, IL[F ] = ψ(s)F (ds), from an n-sample X1 , . . . , Xn of independent Poisson mixture random variables.

Naive estimates for these linear functionals are obtained by plugging in consistent estimators Fˆ of the mixing distribution function F . Since the collection of all bounded continuous functions is measure determining, there will exist linear functionals for which R R these plug-in estimators IL[Fbn (ds)] = ψ(s)Fbn (ds) converge to IL[F ] = ψ(s)F (ds) at the same rate as the estimators of the mixing distribution Fbn converges to F (see Bickel 2

and Ritov (2003) and Gin´e and Nickl (2008)). However, by directly focusing on estimating linear functionals, it is sometimes possible to exploit special structures in the problem. We will give conditions for linear functionals to be estimated at polynomial n−γ -rates, with γ ≤ 1/2. Van der Vaart (1991) and Bickel and Ritov (1995) have studied the case of estimating good linear functionals of the mixing distribution that are estimated at the parametric n−1/2 -rate. In this paper we consider a more general case for which the parametric rate is not achieved in general. The main interest is that the rates of convergence will cover a large scale, from logarithmic to parametric rates. This provides insight to what are well-posed questions in this ill-posed problem. The paper is organized as follows : in Section 2, we first study the link beween the linear functional to estimate ψ and the function ϕ of the data that should be used. Then, we obtain an explicit form in Proposition 1. One gives also conditions for linear functionals to be estimated at rate n−γ for all probability distribution F supported on a given finite interval. The idea is to identify the assumptions needed for the linear functional IL[F ] to be P consistently estimated by an average θbn = n ϕ(Xj )/n. Section 3 considers estimation j=1

of some specific linear functionals : Gaussian functionals and distribution functions. One then obtains rates of convergence for θbn in Theorem 1 and 2. These rates are obtained

for the very smooth Gaussian functionals in Section 3.1. In Section 3.2, estimation of the distribution function is considered and an approximation method is used. The point of view of Section 4 is different since one considers rather general functionals. On the other hand, the class of mixing distribution F belongs now to a kernel-family. By restricting the class of mixing distributions F , linear functionals defined by ψ, that do not satisfy a priori the conditions of Section 2.1, can nevertheless be estimated by an average. As an example, in Section 4.3, the classes of reproducing kernel Hilbert spaces with Gaussian kernel are considered. One obtains here rates of convergence in Theorem 3. The Appendix gives the proof of Theorem 2.

3

2

General framework

2.1

Relationship between ϕ and ψ

One observes n independent Poisson mixture observations X1 , . . . , Xn with the following probability mass function △

PF [X1 = k] = πk (F ) =

Z

0



sk −s e F (ds), k!

for k = 0, 1, 2, . . . .

The aim is then to estimate linear functional of the mixing distribution F , IL[F ] = R ψ(s)F (ds). However, in this setting, one only has indirect observations. Indeed, the

data do not have a distribution F .

Thus, the first step is to understand which function ϕ(X) should be used as an estiR mator of ψ(s)F (ds).

The second step is then to identify necessary conditions for a linear functional of the mixing distribution to be estimated by the corresponding empirical mean θbn = Pn 1 j=1 ϕ(Xj ). n By SLLN (Chung (1974), page 126), the estimator n

1X θbn = ϕ(Xj ) n j=1

(1)

converges almost surely to its expected value IL[F ] = IEF [ϕ(X1 )] if A) IEF [|ϕ(X)|] < ∞. For distributions F supported on [0, b], Requirement A) is satisfied when ∞ X |ϕ(k)| k=0

k!

bk < ∞.

Moreover, Condition B), B) IEF [ϕ(X)] = implies IEF [ϕ(X)] =

∞ X

Z

ϕ(k)πk (F ) =

ψ(s)F (ds).

∞ X k=0

k=0

4

ϕk

Z

b 0

sk −s e F (ds). k!

Since the series is absolutely convergent, it is possible to interchange the sum and integral to get

# Z b "X Z b ∞ ϕ(k) k −s IEF [ϕ(X)] = F (ds) = ψ(s)F (ds). s e k! 0 0 k=0

This identity holds for all distributions F supported on [0, b], (and in particular for point masses), from which follows the representation ∞ X ϕ(k)

ψ(s) =

k=0

k!

sk e−s

for

0 ≤ s ≤ b.

(2)

To determine the function ϕ(k) from ψ(s), note that (2) implies that es ψ(s) is analytic at the origin with radius of convergence ρ > b. Since es is entire, the same conclusion holds for ψ(s). Therefore k   X k (ℓ) dk s ψ (0). ϕ(k) = k [e ψ(s)]s=0 = ℓ ds ℓ=0

This result is important. Indeed, it shows how to construct explicity the estimator θbn = R Pn 1 ϕ(X ) of the linear functional IL[F ] = ψ(s)F (ds). Thus, it gives the following j j=1 n

proposition.

Proposition 1 If the function ψ(s) that defines the linear functional IL[F ] =

R

ψ(s)F (ds)

is analytic at the origin with radius of convergence ρ > 0, then for all distributions F supported on (0, b), b < ρ and b < ∞, Z ∞ ∞ X ϕ(k)πk (F ) ψ(s)F (ds) = 0

where

(3)

k=0

k   X k (ℓ) dk s ψ (0). ϕ(k) = k [e ψ(s)]s=0 = ℓ ds ℓ=0

Furthermore, the estimator θˆn defined in (1) is consistent for its expected value IL[F ]. Remark: If ψ(s) has derivatives of all order the conclusion of Proposition 1 holds for all distributions F that satisfy Z

0

∞ ∞X k=0

|ϕ(k)| k −s s e F (ds) < ∞. k! 5

The latter condition can be removed when ϕ(k) ≥ 0 for all k ∈ IN. The importance of

this remark is that it removes the restriction on the support of F .

Hence the representation (2) is a necessary condition for the existence of an unbiased R estimator of the linear functional ψ(s)F (ds) from Poisson mixture observations. Also note that (2) can be expressed as

ψ(s) = IE[ϕ(X)|Y = s],

(4)

where Y is distributed according to F , and X given Y = s has a Poisson distribution with mean s. A similar result was obtained by Van der Vaart (1991) in connection with characterizing the functionals of the mixing distribution that can be estimated at the parametric √ 1/ n rate. He considered square integrable functionals of the mixture observations, while in the present context, the functionals are only assumed to be summable.  √ b When ϕ(X) is square integrable, n θn − IL[F ] is asymptotically Normal with mean

zero and finite variance σ 2 (F ), where σ 2 (F ) denotes the variance of the distribution F . Things are different if IE [|ϕ(X)|p ] = ∞ for some 1 < p < 2.

Lemma 1 If IE [|ϕ(X)|p ] = ∞ for some 1 < p < 2 then we never have   1−1/p b n θn − IL[F ] → 0, a.s.   Proof: Suppose that n1−1/p θbn − IL[F ] → 0 almost surely, then

1/p      1 ϕ(Xn ) − IL[F ] 1−1/p b 1−1/p b = n θ − IL[F ] − 1 − (n − 1) θ − IL[F ] n n−1 n1/p n

converges to zero almost surely. By Borel-Cantelli, ∞ X n=0

IP [|ϕ(Xn ) − IL[F ]|p > n] < ∞,

which implies IE [|ϕ(X) − IL[F ]|p ] < ∞. This contradicts the assumption IE [|ϕ(X)|p ] = ∞.

Remark: This result means that n1/p−1 is a lower bound for the almost sure rate of convergence of θbn to IL[F ]. Thus, when the functional ϕ(X) is not satisfying, i.e. is not square integrable, then the rate of convergence cannot be too fast (square root of n). In

fact, we want to estimate rather difficult functionals. 6

Estimating the rth moment of the mixing distribution F.

2.2

In this section, we will give some example where the results of Section 2.1 may be used. Consider the problem of estimating the r th moment of the distribution F from a sample of n Poisson mixtures. The linear functional of interest is ψ(s) = sr , which is entire when r is integer. Proposition 1 gives min (k,r)   X dk s r! k k! r−ℓ ϕ(k) = k e ψ(s) = = s II[k ≥ r], ℓ (r − ℓ)! ds (k − r)! s=0 s=0 ℓ=0

where II(A) denotes the indicator function of the set A. The estimator n

1 X Xj ! θbn = II[Xj ≥ r] n j=1 (Xj − r)!

thus converges almost surely to

R∞ 0

(5)

sr F (ds) for all boundedly supported distributions F .

In fact, since ϕ(k) ≥ 0 for all k ∈ IN, it is consistent for all distributions on IR+ , with R∞ the understanding that θbn converges almost surely to ∞ when 0 sr F (ds) = ∞. Direct R∞ calculations show that , unless 0 s2r F (ds) < ∞, the estimator (5) does not have finite variance.

When r is not integer, ψ(s) is not analytic at the origin, and Proposition 1 does not apply.

3

Estimation of some specific linear functionals

In this section, the goal is to understand what will happen if we consider rather specific functionals : Gaussian functional in Section 3.1 and distribution function in Section 3.2.

3.1

Estimating Gaussian functionals

In this section, we will consider very smooth functionals, the Gaussian functionals. The point is to show that these functionals are then rather easy to estimate by use of the estimator θbn . These Gaussian functionals will then be used as a tool in the rest of paper. Moreover,

they will provide a better understanding of what are the good functionals to estimate.

7

Consider the problem of estimating IL[F ] =

R∞ 0

ψ(s)F (ds) from Poisson mixture ob-

servation in the special case where ψ is called a Gaussian functional and can be written as

Z



1 2 2 e−(t−s) /2σ dt 2πσ −∞ for some bounded Ψ(t). These functionals are entire and Proposition 1 applies. ψ(s) =

Ψ(t) · √

(6)

We now give conditions on the mixing distributions to guarantee that the estimator (1) converges in probability at rate n−γ , for some 0 < γ < 1/2. Theorem 1 Assume that the function Z ∞ 1 2 2 e−(t−s) /2σ dt, ψ(s) = Ψ(t) · √ 2πσ −∞ where Ψ(t) is bounded and σ > 0, defines the linear functional IL[F ] = Define

and the estimator

k   X k dℓ ϕ(k) = ψ(s) ℓ ℓ ds s=0 ℓ=0

R

ψ(s)F (ds).

n

1X θbn = ϕ(Xj ). n j=1

If the mixing distribution F satisfies Z ∞ n 2 o exp cs 2−p F (ds) < ∞,

(7)

0

for some p ∈ (1, 2) and all c > 0, then we have for the estimator θˆn defined in (5),   P n1−1/p θbn − IL[F ] −→ 0.

Proof : Following a theorem of Marcinkiewicz and Sygmund (Chow and Teicher (1988) page 125), the conclusion follows if for Z ∞ k   k   X X k dℓ k dℓ 1 −(t−s)2 /2σ2 e dt , = ψ(s) Ψ(t) · √ ϕ(k) = ℓ ℓ ℓ ds −∞ ℓ ds 2πσ s=0 s=0 ℓ=0 ℓ=0

(8)

the random variable ϕ(X) has finite pth moment. Interchange the derivative and the integral in (8) and use Rodrigues formula (Rainville (1967), page 189) to get ℓ  Z ∞ Z ∞  √  dℓ Ψ(t) −(t−s)2 /2σ2 −1 1 2 2 √ √ = e dt e−t /2σ dt, Ψ(t) H ℓ t/ 2σ · √ ℓ ds −∞ 2πσ 2σ 2πσ −∞ s=0 8

where Hℓ (t) denotes the Hermit polynomial of order ℓ. A classical bound on the growth of Hermit polynomials (Beyer (1985), p375) is √ √ 2 |Hℓ (t)| ≤ C1 ℓ! 2ℓ et /2 , with C1 ≤ 1.09. Setting C2 = C1 ·

R∞

2 /4σ 2

−∞

1 e−t |Ψ(t)| √2πσ

ℓ d ψ(s) dsℓ

s=0

Thus

dt, the derivative is bounded by

 ℓ √ ≤ C2 ℓ! 1 . σ

 ℓ  k k   X √ 1 1 k √ ℓ! |ϕ(k)| ≤ C2 ≤ C2 k! 1 + . ℓ σ σ ℓ=0

Interchanging the sum and the integral gives  k !p ∞ X √ 1 k! 1 + IE [|ϕ(X)|p ] ≤ C2p πk (F ) σ k=0 h ik  2−p 2 2   −1 p Z ∞X ∞  (2s(1 + σ ) ) 2−p  1 −s ≤ 2C2p e F (ds). k+1   k! 2 0   k=0

For 1 < p < 2, the ratio 0 < (2 − p)/2 < 1, so that by Jensen’s inequality IE [|ϕ(X)|p ] ≤ 2p/2 C2p

 ik  2−p h 2 2  p Z X ∞  (2s(1 + 1/σ) ) 2−p  2k k!

e−s F (ds)

    k=0   Z √  2 2−p −1 p 2−p p e−s F (ds). = ( 2C2 ) 2s 1 + σ exp 4

The latter integral is bounded by condition (7), and the conclusion follows.

Remark: Condition (7) implies that IE [|ϕ(X)|p ] is finite. Thus, we are not in the situation of Lemma 1. For compactly supported mixing distributions F , Condition (7) is satisfied for all 1 < p < 2. Corollary 1 Under the assumptions of Theorem 1, if F has bounded support, then for all 0 < ε < 1,



n1−ε



 P b θn − IL[F ] −→ 0. 9

  Remark: For p = 2, a careful look at the proof of Theorem 1 reveals that IE |ϕ(X)|2
t}F (ds) 0

10

from a sample of n Poisson mixture observations. The explicit inversion of Proposition 1 can not be applied since the radius of convergence of series expansion of ψ(s) = II{s > t} at the origin is t, and less than the support of F when F (t) < 1. However, the approximates Z ∞ m 2 2 ψm (s) = II{u > t} √ e−m (u−s) /2 du 2π −∞ are entire. Interchange integration and differentiation, and apply Rodriguez formula to get

 k   Z ∞ m m −m2 u2 /2 m dk √ √ √ = ψ (s) II[u ≥ t] u e du. H m k dsk 2 2 2π 0 s=0

By Proposition 1, the estimator θbn,m with ϕm (k) =

Z

0



ℓ   k   X m k m m 2 2 √ II[u ≥ t] Hℓ √ u √ e−m u /2 du, ℓ 2 2 2π ℓ=0

is consistent for ILm (F ) =

Z Z

m 2 2 II{u > t} √ e−m (u−s) /2 F (ds)du. 2π

Upper bounds on the absolute error of the proposed estimator are given in the following theorem. Theorem 2 Assume the distribution function F has a differentiable density in a neighborhood of t, and set for any 0 < ε < 1, m = m(n) = (log n)1/2−ε/2 and r(n) = (log n)1−ε . If for t < B < ∞, the distribution function satisfies F (t) < F (B) = 1, then the estimator θˆn defined in (11) verifies i h IP r(n) θbn,m − (1 − F (t)) ≥ c i.o. = 0. Thus, as n → ∞,

b IE θn,m − (1 − F (t)) = O(r(n)−1).

Proof: The proof may be found in the Appendix (Section 5). Remark: The rate is slow, but not surprising. By comparison with the logarithmic lower and upper bounds for the rates in both Zhang (1995) and Hengartner (1997), we note that this rate is almost optimal. Logarithmic rates of convergence also appear in the related 11

deconvolution problems (Efromovich (1997), Fan (1992)) and in the framework of severely ill-posed inverse problems (see for example Cavalier (2008) or Cavalier et al (2003)). This very slow rate means that estimating such a linear functional as II(s > t) is not so easy. It is much more difficult than estimating a Gaussian functional as described in Section 3.1.

4

Restricting the class of distributions.

In Section 3 we considered some specific functionals as Gaussian functionals or distribution functions. On the other hand, weak assumptions were made on the smoothness of the mixing distribution. An alternative point of view is to consider rather general functionals ψ, but then to restrict the class of mixing distributions F to a subset F0 ⊂ F .

The idea is then to borrow smoothness from the distribution and lend it to ψ, making

it easier to estimate linear functional of smooth distribution in F0 than for an arbitrary

distribution. Indeed, as we have seen in Section 3, smooth functionals are rather easy to estimate.

We show in this section how restricting the collection of distributions to F0 enables R linear functionals IL[F ] = ψ(s)F (ds) to be estimated by an average, even if ψ is not analytic or, has radius of convergence ρ for which F (ρ) < 1. The idea is to replace the linear functional IL by a linear functional IL∗ that agrees with IL on F0 , i.e., Z Z IL[F ] = ψ(s)F (ds) = ψ∗ (s)F (ds) = IL∗ [F ] for all F ∈ F0 .

(12)

If ψ∗ verifies the conditions of Proposition 1, an explicit inverse for IL∗ [F ] exists, and can be used to estimate the functional IL[F ], provided F ∈ F0 . Of course, the difficulty lies

in identifying functionals IL∗ [F ] for which (12) is true. We shall now proceed to give a recipe for finding IL∗ [F ] in the special case where the densities function f of F ∈ F0 are

represented

f (t) = for suitable kernels Kf (t, s),

Z

Kf (t, s)f (s)ds,

12

4.1

Kernel-families of distributions

Let the kernel K(t, s) satisfy both the integrability Z |K(t, z)K(z, s)| dz < ∞ for all s, t, and the stability K(t, s) =

Z

K(t, z)K(z, s)dz

(13)

conditions. The kernel family (or K-family) of distributions FK is the collection of distri-

butions F

FK = {F : with density f, f (t) =

Z

K(t, s)f (s)ds}.

(14)

This definition covers many interesting families of distributions. For example, the collection of distributions with squared integrable densities for which the valuation functional t −→ f (t) is bounded in L2 , is a K-family. Indeed, these densities belong to a reproduc-

ing kernel Hilbert space IH, the latter space being characterized by having a reproducing kernel K(t, s) that satisfies K(·, s) ∈ IH for all fixed s, and, for all h ∈ IH,

h(t) =

Z

K(t, s)h(s)ds.

The reproducing kernels satisfy both the integrability and the stability conditions. A complete account of reproducing kernel Hilbert spaces is found in Scholkopf and Smola (2001). The definition of K-family extends beyond reproducing kernel Hilbert spaces.

4.2

Constructing an equivalent linear functional on FK

Again let IL[F ] =

R

ψ(s)F (ds) be a linear functional and suppose that for all f ∈ FK Z Z |ψ(t)K(t, s)| f (s)dsdt < ∞. (15)

The latter justifies interchanging the order of integration to give Z  Z Z IL[F ] = ψ(s)f (s)ds = ψ(s) K(s, t)f (t)dt ds  Z Z = ψ(s)K(s, t)ds f (t)dt = IL∗ [F ]. 13

Define ψ∗ (s) =

Z

K(t, s)ψ(t)dt.

(16)

Hence the linear functionals defined through ψ and ψ∗ agree on FK . Remark: Estimating functionals defined through ψ and ψ∗ is equivalent when F is ”smooth”. One can then replace the estimation of functionals of ψ by functionals of the much more ”smooth” ψ∗ . Combining the assumption that Z Z |ψ(s)K(s, t)K(t, z)| dsdt < ∞

(17)

with the stability condition (13) on the kernel K, implies that ψ∗ (s) is stable in the sense that (ψ∗ )∗ (s) = ψ∗ (s). Hence ψ∗ can be viewed as a projection of ψ. The function ψ∗ (s) is an inner product between ψ(t) and K(t, s). If the kernels K(t, s) is entire in s for every fixed t and satisfies mild integrability conditions, the function of the Poisson mixture ϕ∗ (k) associated to ψ∗ (s) can be expressed as an inner product between ψ(t) and a function ϕ(t; k). Proposition 2 Let F ∈ FK be a Kernel family of distributions defined in (14). Assume

that the kernel K(t, s) is entire in s for every t, satisfies the integrability condition (15), and that for every k = 1, 2, . . . and s in a neighborhood of zero, Z k ∂ ψ(t) K(t, s) dt < ∞. k ∂s Define

ϕ∗ (k) = with

Z

ψ(t)ϕ(t; k)dt

k   ℓ X ∂ k ∂k s . = e K(t, s) K(t, s) ϕ(t; k) = ℓ ∂sℓ ∂sk s=0 s=0 ℓ=0

Then

IEF [ϕ∗ (X)] = IL[F ]

14

for all F ∈ FK .

(18)

Proof : The integrability condition (15) implies that the linear functionals defined via R ψ(t) and ψ∗ (s) = K(t, s)ψ(t)dt agree for all F ∈ FK . Applying Proposition 1 to the latter and interchanging derivatives and integrals yields ) Z (X Z k   ℓ ∂ dk s k = ϕ∗ (k) = k e ψ(t)dt K(t, s)ψ(t)dt K(t, s) ℓ ℓ ds ∂s s=0 s=0 ℓ=0 Z = ψ(t)ϕ(t; k)dt.

4.3

Reproducing kernel Hilbert spaces with Gaussian kernel

In this section, consider a rather well-known kernel family, a reproducing kernel Hilbert space with a Gaussian kernel. Let Kσ (t, s) = √ where σ > 0.

 1 exp −(t − s)2 /2σ 2 , 2πσ

(19)

Remark: The function Kσ (t, s) is a reproducing kernel. Moreover it is analytic in both t and s. √ 2 2 The function K(x) = 1/ 2πσe−x /2σ has a well-known Fourier transform  ˆ K(ω) = exp −σ 2 ω 2 /2 .

Define then, the kernel family :

Fσ = {F : with density f : f (t) =

Z

Kσ (t, s)f (s)ds},

(20)

where Kσ (t, s) is defined in (19). Since Kσ (t, s) is a convolution kernel, these functions may be easily characterized by their Fourier transform. We then obtain Fσ = {F : with density f :

Z

 |fˆ(ω)|2 exp σ 2 ω 2 < ∞},

where fˆ denotes the Fourier transform of f (see Steinwart et al (2006)).

These functions have then fastly (exponentially) decreasing Fourier transform. This family contains very smooth functions often called in statistics analytic functions (with an analytic extension) (see Ibragimov and Khas’minskii (1982)). 15

The RKHS with Gaussian kernel is now also rather well-studied in learning theory (see Steinwart et al (2006) and Loustau (2008)). Theorem 3 Define

and the estimator

k   X k ∂ℓ ϕ(t, k) = K (t, s) σ ℓ ℓ ∂s s=0 ℓ=0 θbn =

Z

ψ(t)

(

) n 1X ϕ(t; Xj ) dt. n j=1

Let F ∈ Fσ defined in (20) and a bounded functional. Then, under Assumption (7) of Theorem 1 we have

  P n1−1/p θbn − IL[F ] −→ 0.

Proof : It suffices to show that ϕ∗ (X) = We have

R∞ 0

∞ Z X IE|ϕ∗ (X)|p = k=0

0



ψ(t)ϕ(t, X)dt has finite moment of order p.

p ψ(t)ϕ(t, k)dt πk (F ).

Applying Proposition 2, and interchanging the sum and the integral gives k   p Z X ∞ X k ∂ ℓ πk (F )dt. IE|ϕ∗ (X)|p ≤ |ψ(t)|p K (t, s) σ ℓ ℓ ∂s s=0 ℓ=0 k=0

Then, the proof follows the same lines than the proof of Theorem 1, with the notation Ψ(t) replaced by ψ(t) and using that it is bounded. Remark : There is also no assumption (only bounded) on the linear functional ψ(s) in Theorem 3, when, in Theorem 1, the functional is very smooth. The goal in this section is to use the smoothness of the density in order to replace this non-smooth functional ψ by a smooth one ψ∗ . Remark : The assumptions in Theorem 3 are strong since f is very smooth and also fastly decreasing. One further possible work would then be to use a more specific form for ψ instead of just using the boundedness assumption in the proof of Theorem 1. If we compare Theorem 2 and 3 then we remark that we go from a logarithmic to a polynomial rate of convergence. In this setting, estimating the distribution function is 16

difficult. However, if we suppose that F is very smooth, i.e. in Fσ , then we obtain a reasonnable estimator.

An analog phenomena appears in the related problem of severely ill-posed problems or in supersmooth deconvolution. For these kinds of problems it is known (Fan (1992) and Efromovich (1997)) that rates of convergence are very slow, i.e. logarithmic. However, if the functions are supposed to be very smooth, i.e. analytic, then the rates become polynomial in n, (see Pensky and Vidacovic (1999) and Cavalier et al (2003)).

5

Appendix

When ϕm (X) has finite variance for all m, the usual bias-variance trade-off argument leads to choosing m = m(n) such that p



n |IE (ϕm (X)) − IL [F ]| . (21) √ The rate of convergence is then bounded by r(n) = n |IE (ϕm (X)) − IL [F ]| . If the Var(ϕm (X)) =

variance Var(ϕm (X)) = ∞ but IE [|ϕm (X)|p ] < ∞ for some 1 < p < 2, a similar argument

gives upper bounds for the in probability rate of convergence.

Lemma 2 Let {Xm,j } be a double array of random variables, where Xm,1 , Xm,2 , . . . are

i.i.d. Fm distributed random variables. Denote θm = IE [Xm,1 ]. Assume that for p ∈ (1, 2), q p IE [|Xm,1 − IE [Xm,1 ]|p ] ≡ V(m) < ∞,

is increasing with m while |IE[Xm,1 ] − θ| decreases to zero, for some θ. If m(n) solves

and

n1/p V(m(n)) = κ · n θm(n) − θ , r(n) =

then

1 , 2 θm(n) − θ

# n r(n) X Xm(n),j − θ ≥ 1 < 4κ2 . lim sup IP n n→∞ j=1 "

17

Proof : Define the truncations ( Xm,j if |Xm,j − θm | ≤ n1/p V(m) . Ym,j = θm otherwise When

1

r(m) =

, 2 |θm − θ| repeated use of the triangular inequality gives " # " # n n r(m) X r(m) X IP (Xm,j − θ) > 1 ≤ IP (Xm,j − θm ) + r(m) |θm − θ| > 1 n n j=1

j=1

# " n # " n X 1 1 1 1 X (Xm,j − Ym,j ) > (Ym,j − θm ) > + IP . ≤ IP 4r(m) 4r(m) n j=1 n j=1

Markov’s inequality bounds " n #  1 16r(m)2  1 X IP (Y − θ ) > ≤ IE (Ym,1 − θm )2 . m,j m 4r(m) n j=1 n

(22)

Using

    IE (Ym,1 − θm )2 = IE (Xm,1 − θm )2 II |Xm,1 − θm | ≤ n1/p V(m) 2−p 2−p ≤ n1/p V(m) IE [|Xm,1 − θm |p ] = n p V(m)2 ,

Equation (22) is bounded by # " n 1 1 16r(m)2 2−p 1 X (Ym,j − θm ) > ≤ n p V(m)2 ≤ 16V(m)2r(m)2 n2( p −1) . IP 4r(m) n j=1 n

Taking

leads to bound

Showing that

n1/p V(m(n)) = κ · n θm(n) − θ ,

" n # 1 X 1 IP (Ym,j − θm ) > ≤ 4κ2 . 4r(m) n j=1

" n #  1 1 X Xm(n),j − Ym(n),j > limsup IP =0 4r(m(n)) n j=1 n→0 18

completes the proof. For this, note that κ n1−1/p r(m(n)) = 2 V (m(n)) and write   n n X  X Xm(n),j − θm(n) 1/p . ≥n Xm(n),j − Ym(n),j ≤ Xm(n),j − θm (n) II V(m(n)) j=1

j=1

Apply Markov’s inequality to get # " n  1 r(m(n)) X Xm(n),j − θm(n) > ≤ IP 4 n j=1    Xm(n),j − θm(n) p Xm(n),j − θm(n) p II ≤ 2κIE V(m(n)) > n . V(m(n)) i h −θm(n) p X Since IE m(n),j ≡ 1, the expectation (23) converges to zero. V(m(n))

(23)

Proof of Theorem 2

Let Zm be a N(0, m12 ) distributed random variable. The bias of the estimator θˆn,m is bm (F ) = IE [F (x − Zm ) − F (x)] .

(24)

Expanding F (x − Zm ) in Taylor series around x, the bias is bounded by ′ 1 |bm (F )| ≤ f (x) 2 + o(m−2 ) (25) m Repeating the computation of the pth moment of |ϕm (X)| as in Theorem 1 results in the

bound

p

IE [|ϕm (X)| ] ≤ Cp By Lemma 2 and choosing

Z

B

exp

0



 2 2−p · (2s(1 + m)) 2−p e−s F (ds). 4

m(n) = (κp log n)

2−p 2

with

2 4p (2B) 2−p + 1, 2−p makes both the variance and the bias decrease to zero. Thus the rate is bounded by

κp =

r˜(n) = (κp log n)2−p . The result follows by choosing p = 1 + ε. Acknowledgements. The authors would like to thank the Associate Editor and the referee for comments which really improved this paper. 19

References [1] Beyer, W.H. (1985) CRC Standard Mathematical Tables. 27th Edition. CRC Press, Boca Raton, Florida. [2] Bickel, P.J. and Ritov, Y. (1995) Estimating linear functionals of a PET image. IEEE Journ. of Medical Imaging., 14, 81-87. [3] Bickel, P.J. and Ritov, Y. (2003) Nonparametric estimators which can be “plugged-in”. Annals of Statistics, 31, 1033-1053. [4] Cavalier, L. (2008) Nonparametric estimation in inverse problems. Inverse Problems. 24, 1-19. [5] Cavalier, L., Golubev, G.K., Lepski, O. and Tsybakov, A.B. (2003) Block thresholding and sharp adaptive estimation in severely ill-posed inverse problems. Theory Probab. and Appl., 48, 534-556 (SIAM version, 48, (2004), 426-446). [6] Cavalier, L., Golubev, G.K., Picard, D. and Tsybakov, A.B. (2002) Oracle inequalities in inverse problems Annals of Statist. 30 843-874. [7] Cavalier, L. and Tsybakov, A.B. (2002) Sharp adaptation for inverse problems with random noise. Probability Theory and Related Fields 123, 323-354. [8] Chow, Y.S. and Teicher, H. (1988) Probability Theory: independence, interchangeability, Martingales. 2nd Edition. Springer Verlag, New York. [9] Chung, K.L. (1974) A course in probability theory. 2nd Edition. Academic Press, New York. [10] Efromovich, S. (1997) Density estimation for the case of supersmooth measurement error. Journ. American Stat. Association , 92, 526-535. [11] Fan, J. (1991) On the optimal rates of convergence for nonparametric deconvolution problems. Annals of Statistics, 19, 1257-1272. [12] Fan, J. (1992) Deconvolution with supersmooth distributions. Canadian Journal of Statistics, 20, 155-169. 20

[13] Gin´ e, E. and Nickl, R. (2008) Uniform central limit theorems for kernel density estimators. Probability Theory and Related Fields 141, 333-387. [14] Hengartner, N.W. (1997) Adaptive demixing in Poisson mixture models. Annals of Statistics, 25, 917-928. [15] Hengartner, N.W., Talbot L., Shepard, I. and Bickel, P.J. (1994) Estimating the probability density of the scattering cross section from Rayleigh scattering experiments. Jour. of Optical Soc. Amer. A, 12, 1316-1323. [16] Holzmann, H., Munk, A. and Stratmann B. (2004) Identifiability of finite mixtures-with applications to circular distributions. Sankhya, 66, 440-449. [17] Ibragimov, I. A. and Khas’minskii, R. Z. (1982) Estimation of distribution density belonging to a class of entire functions. Theory of Probability and its Applications, 27, 551-562. [18] Loh, W-L. and Zhang, C-H., (1996) Global properties of kernel estimators for mixing densities in discrete exponential family models. Statistica Sinica, 6, 561-578. [19] Loustau, S. (2008) Performances statistiques de m´ethodes `a noyaux. PhD thesis, Universit´e Aix-Marseille 1. [20] Neumann, M. H. (1997) On the effect of estimating the error density in nonparametric deconvolution. J. Nonparametric Stat., 7, 307-330. [21] Norris, J.L. and Pollack, K.H. (2001) Nonparametric MLE incorporation of heterogeneity and model testing into premarked cohort studies. Environ. Ecol. Stat., 27, 21-32. [22] Pensky, M. and Vidakovic, B. (1999) Adaptive wavelet estimator for nonparametric density deconvolution. Annals of Statistics, 27, 2033-2053. [23] Pilla, R.S. and Lindsay, B.G. (2001) Alternative EM methods for nonparametric finite mixture models. Biometrika, 88, 535-550. [24] Rainville, D. (1967) Special Functions. 4th Edition. Macmillan Co., New York.

21

[25] van Rooij, A.C.M and Ruymgaart, F.H. (1996) Asymptotic minimax rates for abstract linear estimators. Journ. Statist. Plann. Inference, 53, 389-402. [26] Scholkopf, B. and Smola, A.J. (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press Cambridge, USA. [27] Steinwart, I., Hush, D. and Scovel C. (2006) An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans. Information Theory, 52, 4635-4643. [28] Tucker, H.G. (1963) An estimate of the compounding distribution of a compound Poisson distribution. Theor. Prob. Appli., 8, 195-200. [29] Van der Vaart, A. (1991) On differential functionals. Annals of Statistics , 19, 178-204. [30] Walter, G. (1985) Orthogonal polynomial estimators of the prior distribution of a compound Poisson distribution. Sankhya, Ser A., 47, 222-230. [31] Wang, J.-P. (2007) A linearization procedure and a VDM/ECM algorithm for penalized and constrained nonparametric maximum likelihood estimation for mixture models. Comput. Statistics Data Analysis, 51, 2946–2957. [32] Wang, J.-P. and Lindsay, B.G. (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. Journ. Ameri. Statist. Assoc., 100, 942-959. [33] Williamson, R.C., Smola, A.J. and Scholkopf, B. (2001) Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans. Information Theory, 47, 2516-2532. [34] Zhang, C-H. (1995) On estimating mixing densities in exponential family models for discrete variables. Annals of Statistics , 23, 929-945.

Address: Laurent Cavalier 22

Universit´ e Aix-Marseille 1 CMI, 39 rue Joliot-Curie 13453 Marseille cedex 13 France Nicolas Hengartner Department of Statistics Los Alamos National Laboratory USA.

23

Suggest Documents