Some Refinements of Large Deviation Tail Probabilities

5 downloads 0 Views 116KB Size Report
May 4, 2012 - Rao. Binomial distributions and their tail probabilities are discussed in more detail. Keywords: ... According to the Sanov theorem the probability that the deviation from the ... method by Blackwell and Hodges [3]. For µ∗ >µ>µ0, ...
Some Refinements of Large Deviation Tail Probabilities L´ aszl´ o Gy¨orfia , Peter Harremo¨esb,∗, G´abor Tusn´adyc

arXiv:1205.1005v1 [math.ST] 4 May 2012

a

Budapest University of Technology and Economics, Budapest, Hungary b Copenhagen Business College, Copenhagen, Denmark c R´ enyi Institute of Mathematics, Budapest, Hungary

Abstract We study tail probabilities via some Gaussian approximations. Our results make refinements to large deviation theory. The proof builds on classical results by Bahadur and Rao. Binomial distributions and their tail probabilities are discussed in more detail. Keywords: Binomial distribution, Gaussian distribution, large deviations, tail probability. 2000 MSC: primary 60F10, secondary 60E15

1. Introduction Let X1 , . . . , Xn be i.i.d. random variables such that the moment generating function E [exp (βX1 )] is finite in a neighborhood of the origin. For fixed µ > E [X1 ], the aim of this paper is to approximate the tail distribution: ( n ) 1X Pn,µ := P Xi ≥ µ . n i=1 If µ is close to the mean of X1 one would usually approximate Pn,µ by a tail probability of a Gaussian random variable. If µ is far from the mean of X1 the tail probability can be estimated using large deviation theory. According to the Sanov theorem the probability that the deviation from the mean is as large as µ is of the order exp (−nD) where D is a constant. Bahadur and Rao [2] improved the estimate of this large deviation probability, and the goal of this paper is to extend the Gaussian tail approximations into situations where one normally uses large deviation techniques. Let φ and Φ be the density function and the distribution function of the standard Gaussian, respectively. Let P0 denote a probability measure describing the distribution of a random variable X. Consider the 1-dimensional exponential family (Pβ ) based on P0 and given by dPβ exp (β · x) (x) = dP0 Z (β) ∗ Corresponding

author Email addresses: [email protected] (L´ aszl´ o Gy¨ orfi), [email protected] (Peter Harremo¨ es), [email protected] (G´ abor Tusn´ ady) Preprint submitted to Statistics and Probability Letters May 7, 2012

where the denominator is the moment generating function (partition function) given by Z   Z (β) = exp (β · x) dP0 x = E eβX . The mean value of Pβ is

Z ′ (β) Z (β)

(1)

and the range of this function will be denoted M and will be called the mean value range of the exponential family. For µ in interior of M the maximum likelihood estimate βˆ (µ) equals the β such that the mean value of Pβ equals µ, which in this case is the average of the i.i.d. samples. Put P µ = Pβ(µ) . An equivalent definition of βˆ (µ) can be as the solution of the equation ˆ   ˆ Z β(µ) ˆ eβ(µ)µ

=

i h ˆ E eβ(µ)X ˆ eβ(µ)µ

  E eβX Z(β) = min = min βµ . βµ β>0 β>0 e e

Let V (µ) denote the variance of P µ . Information divergence is given by D (P µ kP0 ) =

Z

ln



dP µ (x) dP0



dP µ x.

We see that D (P µ kP0 ) = − ln

h i ˆ E eβ(µ)X ˆ eβ(µ)µ

  = βˆ (µ) µ − ln Z βˆ (µ) .

(2)

2. Approximation of tail distributions for non-lattice valued variables Introduce the notation µ∗ := sup{µ > µ0 ; D (P µ kP0 ) < ∞} = sup M. Bahadur and Rao [2] proved a refined version of the large deviation bound, but some aspects of their result dates back to Cram´er [4] and part of it was proved by a different method by Blackwell and Hodges [3]. For µ∗ > µ > µ0 , the Sanov theorem implies that  Pn ln P n1 i=1 Xi ≥ µ − → D (P µ kP0 ) for n → ∞. n Bahadur and Rao [2] verified the following improvement of the Sanov theorem ( n )    1X exp (−nD (P µ kP0 )) 1 √ for n → ∞ P Xi ≥ µ = 1 + O 1/2 n i=1 n (2πnV (µ)) βˆ (µ) for non lattice random variables. We will write D (µ) as short for D ( P µ k P0 ) . 2

(3)

Theorem 1. For µ∗ > µ > µ0 , one has that ( n )       1X cµ 1/2 1 P Xi ≥ µ = Φ −n1/2 2D µ − 1+O √ for n → ∞, (4) n i=1 n n where

1/2

cµ =

ln V(2D(µ)) ˆ (µ)1/2 β(µ) βˆ (µ)

.

(5)

Proof. The cµ defined by (5) satisfies the equation 

2D(µ) V (µ)

1/2

ˆ βˆ (µ) ecµ β(µ)

= 1.

(6)

The tail probabilities of the standard Gaussian satisfy   1 φ (z) φ (z) 1 − 2 ≤ Φ(−z) ≤ z z z for z > 0, (cf. Feller [5, p. 179]), which implies that     c    exp −nD µ − nµ 1 cµ 1/2 1/2 1+O , 2D µ − = Φ −n cµ 1/2 1/2 n n (2πn) 2D µ − n

and so

cµ  n 1/2 (2πn) (2D(µ))1/2

exp −nD µ −

      cµ 1/2 1 1+O . = Φ −n1/2 2D µ − n n

(7)

Because of (1) and (2), the derivative can be calculated as d D (µ) = βˆ (µ) , dµ leading to the following Taylor expansion    cµ  1 cµ ˆ D µ− . +O = D (µ) − β (µ) · n n n2 Thus, c  exp −nD µ − nµ (2πn)1/2 (2D(µ))1/2

=

= =

  exp −n D (µ) − βˆ (µ) ·

cµ n

+O

(2πn)1/2 (2D(µ))1/2



ˆ exp −nD(µ) + β(µ)c µ +O

1 n

1 n2





(2πn)1/2 (2D(µ))1/2    ˆ exp (−nD(µ)) ecµ β(µ) 1 1 + O 1/2 n (2πn) (2D(µ))1/2 3

(8)

According to (3) we also have ( n )    1X exp (−nD (µ)) 1 P for n → ∞, Xi ≥ µ = 1+O √ 1/2 n i=1 n (2πnV (µ)) βˆ (µ) therefore applying (6), (7), (8) and (9) the proof of Theorem 1 is complete.

(9) 

Remark 1. If in the approximation cµ is replaced by any other constant c then the ratio of the two approximations tends to a number, which is not equal to 1: c     exp −nD µ − nµ c  c   = exp −nD µ − µ + nD µ − c n n exp −nD µ − n    1 = exp βˆ (µ) · (cµ − c) + O n   ≈ exp βˆ (µ) · (cµ − c) 6=

1.

Remark 2. If X1 has a density with respect to the Lebesgue measure then Bahadur and Rao [2] proved the stronger result that ) ( n    1 exp (−nD (P µ kP0 )) 1X 1+O Xi ≥ µ = . P 1/2 ˆ n i=1 n (2πnV (µ)) β (µ) Using this result we get the following theorem: If X1 has a density with respect to the Lebesgue measure then ( n )       1X cµ 1/2 1 1/2 P 2D µ − Xi ≥ µ = Φ −n 1+O for n → ∞, n i=1 n n for any µ∗ > µ > µ0 . 3. Results for lattice valued variables Now assume that X1 , X2 , . . . is a sequence of i.i.d. random variables with values in a lattice of the type {kd + δ | k ∈ Z} . For such a sequence Bahadur and Rao [2] proved that ) ( n    exp (−nD ( P µ k P0 )) 1 1X Xi ≥ µ = (10) 1+O P ˆ ) n i=1 n 1/2 1−exp(−dβ(µ) (2πnV (µ)) d  Pn for any n such that P n1 i=1 Xi = µ > 0. We note that the result (3) for non-lattice variables can be considered as a limiting version of (10) for small d > 0 because 1 − exp (−dβ) → β for d → 0. d 4

∗ Theorem 2. Assume that X1 has values  1 Pnin the lattice {kd + δ | k ∈ Z} and that µ > µ > µ0 . Then for any n such that P n i=1 Xi = µ > 0 one has ) ( n       cµ 1/2 1 1X Xi ≥ µ = Φ −n1/2 2D µ − 1+O for n → ∞, P n i=1 n n

where ln cµ =

(2D(µ))1/2 V

ˆ 1−exp(−dβ(µ) ) (µ)1/2 d

βˆ (µ)

.

Proof. If X1 is lattice valued then the proof of Theorem 1 can be modified by replacing ˆ 1−exp(−dβ(µ) ) at the appropriate places throughout the proof. There is no βˆ (µ) by d modification in the use of a Taylor expansion.  We now turn to the special case, where X1 , . . . , Xn are i.i.d. Bernoulli random variables with  1 with probability p, Xi = 0 with probability 1 − p. Pn In this case d = 1, and i=1 Xi is a binomial (n, p) random variable. For various refinements of (10), see Bahadur [1], Littlewood [8] and McKay [9]. Corollary 1. Put µn := ⌈nµ⌉/n. Then for 1 > µ > p one has that ) ( n       cµn 1/2 1 1X 1/2 2D µn − Xi ≥ µ = Φ −n 1+O for n → ∞, P n i=1 n n where D(µ) = D(µkp) = µ ln and cµ =

1 + 2

ln



µ 1−µ + (1 − µ) ln p 1−p

2D(µkp) p (1 − (µ−p)2 µ(1−p) 2 ln p(1−µ)

 p)

.

Proof. Because of the definition of µn , ( n ) ) ( n 1X 1X P Xi ≥ µ = P X i ≥ µn , n i=1 n i=1  Pn and the condition P n1 i=1 Xi = µn > 0 is satisfied, and so Theorem 2 implies that ( n )       1X cµn 1/2 1 1/2 P 2D µn − Xi ≥ µ = Φ −n 1+O for n → ∞. n i=1 n n 5

We have to evaluate cµ . The distribution Pβ has Pβ (Xi = 1) =

peβ 1 − p + peβ

which is also the mean of Pβ . The equation µ=

peβ 1 − p + peβ

is equivalent to eβ = implying that

µ (1 − p) p (1 − µ)

p (1 − µ) µ−p 1 − e−dβ = 1 − e−β = 1 − = . d µ (1 − p) µ (1 − p)

The variance function is

V (µ) = µ (1 − µ) . Thus, we have ln cµ =



ln =



=

1 + 2

2D(µkp) V (µ)

1/2

βˆ (µ) 1/2

2D(µkp) µ(1−µ)

ln

1 ˆ 1−e−β(µ)

µ(1−p) µ−p

µ(1−p) ln p(1−µ)



2D(µkp) p (1 − (µ−p)2 µ(1−p) 2 ln p(1−µ)





 p)

. 

Remark 3. For p = 1/2, 0.5 < cµ < 0.534 and Table 1 shows some numerical values for cµ ≈ 0.5 + (µ − 0.5)/12. µ cµ

0.6 0.508

0.65 0.512

0.7 0.516

0.75 0.520

0.8 0.524

0.85 0.528

0.9 0.532

Table 1: Numerical values

4. Discussion As discussed by Reiczigel, Rejt˝o and Tusn´ady [10] and by Harremo¨es and Tusn´ady [6] there are some strong indications that these asymptotic results can be strengthened 6

to sharp inequalities. Such sharp inequalities would imply the present asymptotic results as corollaries. We hope that the asymptotics presented here can help in proving the conjectured sharp inequalities. Related sharp inequalities have been discussed by Leon and Perron [7] and Talagrand [11]. Numerical experiments have also shown that our tail estimates are useful even for small values of n. References [1] Bahadur, R. R.: 1960, Some approximations to the binomial distribution function. Annals of Mathematical Statistics, 31:43–54. [2] Bahadur, R. R. and Rao, R. R.: 1960, On deviation of the sample mean. Annals of Mathematical Statistics, 31, 1015–1027. [3] Blackwell, D. and Hodges, J. L.: 1959, The probability in the extreme tail of a convolution, Annals of Mathematical Statistics, 30, 1113–1120. [4] Cram´ er, H.: 1938, Sur un nouveau th´ eor´ eme-limite de la th´ eorie des probabilit´ es, Actualit´ es Scientifiques et Industrielles (Number 736, Hermann Cie, Paris). [5] Feller, W.: 1957, An Introduction to Probability and its Applications. Vol. I, Wiley, New York. [6] Harremo¨ es, P. and Tusn´ ady, G.: 2012, Information divergence is more χ2 - distributed than the χ2 statistic, 2012 IEEE International Symposium on Information Theory (ISIT 2012), Cambridge, Massachusetts, USA. Accepted. URL: http://arxiv.org/abs/1202.1125 [7] Leon, C. A. and Perron, F.: 2003, Extremal properties of sums of Bernoulli random variables, Statistics and Probability Letters, 62(4), 345–354. [8] Littlewood, J. E.: 1969, On the probability in the tail of binomial distribution. Advanced Applied Probability, 1:43–72. [9] McKay, B. D.: 1989, On Littlewood’s estimate for the binomial distribution. Advanced Applied Probability, 21:475–478. [10] Reiczigel, J., Rejt˝ o, L. and Tusn´ ady, G.: 2011, A sharpning of Tusn´ ady’s inequality. ArXiv 1110.3627v2. [11] Talagrand, M.: 1995, The missing factor in Hoeffding’s inequality, Ann. Inst. Henri Poincare, 31, 689–702.

7