KEY WORDS: Probabilities of large deviations; martingale; bounds for tail ... of
bounded independent random variables (as well as for martingales) have.
Journal of Theoretical Probability, Vol. 16, No. 1, January 2003 (© 2003)
An Inequality for Tail Probabilities of Martingales with Differences Bounded from One Side V. Bentkus 1 Received May 15, 2001; revised December 6, 2001 Let Mn =X1 + · · · +Xn be a martingale with differences Xk =Mk − Mk − 1 bounded from above such that P{Xk [ ek }=1 with some non-random positive ek . Let the conditional variance t 2k =E(X 2k | X1 ,..., Xk − 1 ) satisfy t 2k [ s 2k with probability one, where s 2k are some non-random numbers. Write s 2k =max{e 2k , s 2k } and s 2=s 21 + · · · +s 2n . We prove the inequality P{Mn \ x} [ min{exp{ − x 2/(2s 2)}, c0 (1 − F(x/s))} with a constant c0 =1/(1 − F(`3)) [ 25. KEY WORDS: Probabilities of large deviations; martingale; bounds for tail probabilities; inequalities; bounded differences and random variables; measure concentration phenomena; product spaces; Lipschitz functions; Hoeffding’s inequalities; Azuma’s inequality.
1. INTRODUCTION AND RESULTS Our attention to this topic was attracted by a seminal paper of Hoeffding, (4) where the classical inequalities for probabilities of large deviations for sums of bounded independent random variables (as well as for martingales) have been obtained. These inequalities have not been improved almost 40 years long with exceptions of papers of Pinelis, (9, 10) Talagrand, (13) and Bentkus. (1, 2) In this paper we extend and improve Hoeffding’s Theorem 2. The result is new already for sums of independent random variables. Most probably (up to an absolute constant) our result is the best possible that can be achieved using normal like tails as an upper bound.
1
Vilnius Institute of Mathematics and Informatics, Akademijos 4, 232600 Vilnius, Lithuania. E-mail:
[email protected] 161 0894-9840/03/0100-0161/0 © 2003 Plenum Publishing Corporation
162
Bentkus
Let F0 =” … F1 … · · · … Fn … F be a family of s-algebras of a measurable space (W, F). Let Mn =X1 + · · · +Xn be a martingale (we define M0 =0) with differences Xj =Mj − Mj − 1 bounded from above by some non-random ej \ 0 such that P{Xj [ ej }=1,
for j=1,..., n.
(1.1)
Assume that the conditional variance t 2j =E(X 2j | Fj − 1 ) is bounded from above, that is, that P{t 2j [ s 2j }=1, for some non-random s 2j . Write s 2j =max{s 2j , e 2j },
s 2=s 21 + · · · +s 2n .
Let I(x)=1 − F(x)=> . x j(t) dt be the survival function of the standard normal distribution with the density j(t)=(2p) −1/2 exp{ − t 2/2}. Introduce D(x)=1, for x [ 0, and D(x)=min{exp{ − x 2/2}, cI(x)},
for x \ 0,
where c is an absolute constant. Our result is the following upper bound for the tail probabilities of Mn . Theorem 1.1. Let 2 [ c [ c0 with c0 =1/I(`3). Then, for x ¥ R, we have P{Mn \ x} [ D(x/s).
(1.2)
Theorem 1.1 complements the following inequalities (1.5). For the differences Xj consider the condition: Xj have non-random sizes 2ej such that P{aj − ej [ Xj [ aj +ej }=1,
for j=1,..., n,
(1.3)
where aj are some random variables measurable with respect to the s-algebra Fj − 1 . Since Mj =X1 + · · · +Xj , the condition (1.3) is equivalent to P{bj − ej [ Mj [ bj +ej }=1,
for j=1,..., n,
(1.4)
where bj are random variables measurable with respect to the s-algebra Fj − 1 . Assume that one of conditions (1.3) or (1.4) is fulfilled. Then (see Bentkus, (2) for x ¥ R, we have P{Mn \ x} [ D(x/e),
P{Mn > x} \ 1 − D(−x/e),
where e 2=e 21 + · · · +e 2n , and where D is defined with c=435.
(1.5)
Tail Probabilities
163
The upper bound c [ c0 [ 25 for c in (1.2) is not optimal and can be improved. Extending the methods of Bentkus, (1) one can show that sup sup P{Mn \ x} \ 2I(x/e), n
Mn
where sup is taken over all Bernoulli type martingales Mn such that e1 = · · · =en and P{|Xj | [ ej }=1, and all n (we call a martingale of Bernoulli type if the differences conditionally have two point distributions). Hence, the constant c in (1.2) has to satisfy c \ 2. In the very special case of independent identically distributed Rademacher random variables X1 ,..., Xn of size 2/`n such that P{X1 =−1/`n}= P{X1 =1/`n}=1/2, we have obviously s 2=1 and sup P{Mn \ x} \ I(x), n
which, for larger x, differs by the factor 2 [ c [ c0 compared to (1.2). Thus, Theorem 1.1 shows that the martingale type dependence does not influence much the bounds for tail probabilities compared to the independent and i.i.d. cases. Theorem 1.1 and (1.5) improve and extend an inequality of Hoeffding, (4) which is proved for martingales satisfying the condition (1.3) with nonrandom aj . In our notation his inequality (Theorem 2 in Hoeffding (4)) reads as P{Mn \ x} [ exp{ − x 2/(2e 2)},
x \ 0.
(1.6)
In the case of the condition (1.3) with random aj , the inequality (1.6) is contained in McDiarmid (6) as Theorem 6.7. Therefore, the new component in the bounds (1.2) and (1.5) is the inequality P{Mn \ x} [ cI(x/s). For larger x \ c1 s, the bounds (1.2) and (1.5) are better than (1.6) since I(x) [ c2 x −1 exp{ − x 2/2}, for x > 0 (here cj are absolute constants). Hence, the improvement is just the factor s/x. Theorem 1.1 and (1.5) extend a result of Pinelis, (10) who proved P{Mn \ x} [ 4.48I(x/e) for martingales assuming that aj =0, for all j. The constant in Pinelis inequality is better than our constants. However, the values of c in (1.2) and (1.5) can be considerably improved. Hence, compared to the bound of Pinelis, we get rid of the symmetrical boundedness assumption aj =0, which is very important in applications to the measure concentration, and instead of two-sided boundedness of the difference we require only the boundedness from above. The inequalities of type (1.6) (and hence, (1.2) and (1.5) as well) have extensive applications in combinatorics, operational research, computer
164
Bentkus
science, random graphs (see McDiarmid (6)), in the theory of Banach spaces (see Milman and Schechtman (7)). The result applies to the measure concentration for separately Lipschitz functions on product spaces (see Bentkus (2) for applications of (1.5), as well as for some non-linear statistics. In these models the bounds implied are the sharpest among the known ones. The space for improvements is restricted (cf. the discussion above). For statistical applications optimal bounds for finite (that is, fixed) n are of interest (see Bentkus and Van Zuijlen (3)). In this sense our result is not optimal and can be improved extending the methods of Bentkus. (1) The history of inequalities for tail probabilities is very rich (see, for example, books Petrov, (8) Shorack and Wellner (11)). The names of Chernoff, Bennett, Prokhorov, Hoeffding, and other come to mind. Our methods are different from those of Hoeffding, (4) Pinelis, (9) and Talagrand. (13) The proof of Theorem 1.1 is based on induction in n and multiply applications of the Chebyshev inequality. Chebyshev’s inequality we understand as > f [ > g if f [ g, and in this paper we always apply it with a quadratic function g. Our methods are well designed for applications where we have martingale type dependence. For independent random variables X1 ,..., Xn satisfying (1.3) with aj =0 and ej =1, for all j, a bound P{Mn \ x} [ Bn (x) with some function Bn (x) essentially smaller than cI(x/e), e=`n, was obtained earlier in Bentkus. (1) One can show that the bound Bn (x)! a! a is sharp on martingales, for integer x and all n. Heuristically, the basic ideas and methods are already contained in this paper. 2. THE PROOF By the definition of c0 , we have c0 I(`3)=1. Hence, c0 I(x) \ 1, for x [ `3, and the function c0 I is strictly decreasing. Below we prove the following bounds. Theorem 2.1. For x ¥ R, we have P{Mn \ x} [ c0 I(x/s). Theorem 2.2. For x \ 0, we have P{Mn \ x} [ exp{ − x 2/(2s 2)}. Proof of Theorem 1.1. It suffices to combine the bounds of Theorems 2.1 and 2.2. i
Tail Probabilities
165
Proof of Theorem 2.1. We apply induction in n. Without loss of generality we assume that ej > 0, for all j=1,..., n. Rescaling if necessary, we can assume as well that e 21 =1. Throughout we write s 2=r 2+s 2,
where r 2=s 21
and
s 2=s 22 + · · · +s 2n .
Due to the assumption e1 =1, we have P{X1 [ 1}=1 and s 2 \ r 2=s 21 \ 1. We have to prove that P{Mn \ x} [ c0 I(x/s),
for x \ s `3.
(2.1)
Indeed, for x [ s `3 the trivial bound P{Mn \ x} [ 1 yields (2.1) since for x [ s `3 it holds 1 [ c0 I(x/s) due to the definition of c0 and s \ 1. The Case n=1. In essence, in this case there is nothing to prove. Notice that now M1 =X1 and P{X1 \ x} [ I{x [ 1} since X1 is bounded from above by 1, and where I{A} is the indicator function of event A. The definition of c0 and s \ 1 show that I{x [ 1} [ c0 I(x/s), which concludes the proof for n=1. The Case n > 1. By the induction assumption we have P{Zn − 1 \ x} [ c0 I(x/b),
for all x ¥ R,
(2.2)
for any martingale sequence Z0 =0, Z1 ,..., Zn − 1 such that P{Zk − Zk − 1 [ ak }=1,
P{E(Z 2k | Fk − 1 ) [ c k }=1
with some non-random ak and c k , where b 2=max{a 21 , c 21 }+ · · · +max{a 2n − 1 , c 2n − 1 }. Notice that s > 0. Using (2.2) and conditioning on X1 , we have P{Mn \ x}=EP{X2 + · · · +Xn \ x − X1 | X1 } [ Ec0 I
1 x −sX 2 ,
since, for given X1 , the sequence Z0 =0, Z1 =X2 , ... , Zn − 1 =X2 + · · · +Xn
1
166
Bentkus
is a martingale sequence with differences such that P{Zk − Zk − 1 [ ek+1 }=1,
P{E(Z 2k | Fk − 1 ) [ sk+1 )}=1.
To simplify notation, write +(t)=c0 I
1 x −s t 2 ,
Ec0 I
1 x −sX 2=E+(X ). 1
1
Notice that + depends on x and other parameters, which is not reflected in the notation. We have to prove that E+(X1 ) [ c0 I(x/s). Let us note that (i) the function t W +œ(t) is positive and strictly increasing for t [ 1. Indeed, introducing the variable z=(x − t)/s such that z \ (x − 1)/s, the assertion (i) is equivalent to (ii) the function z W Iœ(z) is positive and strictly decreasing for z \ (x − 1)/s. We have Iœ(z)=zj(z). The function z W zj(z) is positive and strictly decreasing for z \ 1. Hence, to prove (i) it suffices to verify that (x − 1)/s \ 1, or, equivalently, that x 2 \ s 2+2s+1. Using r \ 1 and 2rs [ r 2+s 2, we have 1+2s+s 2 [ r 2+2rs+s 2 [ 2s 2 [ x 2 since by our assumption x 2 \ 3s 2. This proves (i). Due to (i), we can apply the following Lemma 2.3 (we give its proof later). Lemma 2.3. Let +: (−., 1] Q [0, .) be a function such that the second derivative t W +œ(t) is a positive strictly increasing function of t [ 1. Let z < 1. Then the quadratic polynomial P(t)=at 2+bt+c, where a=(z − 1) −2 ((z − 1) +Œ(z) − +(z)++(1)), b=(z − 1) −2 ((1 − z 2) +Œ(z)+2z+(z) − 2z+(1)), c=(z − 1) −2 ((z 2 − z) +Œ(z)+(1 − 2z) +(z)+z 2+(1)), satisfies P(t) \ +(t), for all t [ 1.
Tail Probabilities
167
Writing z=−r 2, the quadratic polynomial P(t)=at 2+bt+c from Lemma 2.3 satisfies +(t) [ P(t),
for all t [ 1.
(2.3)
Using (2.3) and EX1 =0, a small elementary calculation shows that
1 x+rs 2+c pI 1 x −s 1 2 2
E+(X1 ) [ EP(X1 )=c0 (1 − p) I
0
(2.4)
with 1 1 − p= 1+r 2
and
r2 p= . 1+r 2
The inequality (2.4) reduces the proof of (2.1) to checking that def
1 x+rs 2+pI 1 x −s 1 2 − I 1 sx 2 [ 0. 2
D= (1 − p) I
(2.5)
Introduce a variable, say y, such that 0 [ y [ r/s. Write h=`1 − y 2 and consider the function def
w(y)= (1 − p) I
1 x+rsy 2+pI 1 x −shsy/r 2 − I 1 sx 2 . sh
It is clear that w(0)=0 and w(r/s)=D. To simplify the notation, write y=x/s. The condition x 2 \ 3s 2 is equivalent to y 2 \ 3. Then w(y)=(1 − p) I
1 y+ry 2+pI 1 y −hy/r 2 − I(y). h
(2.6)
Let us prove (2.5). Using (2.6), we have wŒ(y)=−(1 − p) j
yy+r y − y/r 2 yy − 1/r 1 y+ry 2 − pj 1 . h h h h 3
3
(2.7)
A bit later we prove that wŒ(y) [ 0, for 0 [ y [ r/s. Therefore w(y) is a decreasing function of y \ 0, and to prove (2.5) (that is, to show that w(r/s)=D [ 0) it suffices to check that w(0)=0, which holds by the definition of w. The proof of (2.5) is completed.
168
Bentkus
Let us prove that wŒ(y) [ 0. Using j(t)=(2p) −1/2 exp{ − t 2/2} and (2.7), the inequality wŒ(y) [ 0 is equivalent to (yy+r) exp{E}+r 2(yy − 1/r) \ 0,
(2.8)
where E=−
1 2
1
2
y2 1 yy 1 +r + 2 2 − r 2 . 2 h r 2h r
If y=0 then (2.8) is just the equality 0=0. If yy − 1/r \ 0 then (2.8) is obviously fulfilled. Hence, it suffices to prove (2.8) for y such that 0 < y < 1/ry. Write C=yy+r,
D=r − yyr 2.
Then (2.6) is equivalent to the inequality inequality
C D
(2.9)
exp{E} \ 1 and therefore to the
def
v= ln C − ln D+E \ 0.
(2.10)
A bit later we prove that “y v \ 0, that is, that y W v(y) is an increasing function. This proves (2.10) since v(0)=0. Let us prove that “y v(y) \ 0. We have “y C=y,
“y D=−yr 2,
(2.11)
and “y E=−
1 2 1
2
y(1+y 2) 1 y 1 +r + 4 2 − r 2 . 4 h r h r
It is clear that “C “D “y v= y − y +“y E C D and sign(“y v)=sign(D“y C − C“y D+CD“y E),
(2.12)
Tail Probabilities
169
where the sign function is defined as sign(z)=−1, for z < 0, and sign(z)=1, for z > 0, and sign(z)=0, if z=0. Using (2.9) and (2.11), the relation (2.12) is equivalent to sign(“y v)=sign(y+yr 2+B“y E),
(2.13)
where B=(yy+r)(1 − yyr)=r+yy(1 − r 2) − y 2ry 2. We have 1+r 2 “y E= 2 4 A, rh
def
A= − yr+y(1 − r 2) − y 2ry.
Hence, the relation (2.13) is equivalent to sign(“y v)=sign(yr 2h 4(1+r 2)+B(1+r 2) A) =sign(yr 2h 4+BA). Write B=r+yg,
A=−yr+yt
with g=y(1 − r 2) − yry 2,
t=1 − r 2 − yry.
Then BA=−yr 2+y(rt − yrg+ytg) and, using h 2=1 − y 2, we obtain yr 2h 4+BA=yQ,
def
Q= − 2yr 2y+yr 2y 3+rt − yrg+ytg.
Hence, sign(“y v)=sign(Q). A small elementary calculation shows that Q=ra0 +yya1 +2y 2ry 2a2 +y 3yr 2(1+y 2),
(2.14)
where a0 =(r 2 − 1)(y 2 − 1),
a1 =r 2y 2+1+r 4 − 5r 2,
a2 =r 2 − 1.
170
Bentkus
We assume that r \ 1 and y 2 \ 3. Therefore we have a1 \ 1+r 4 − 2r 2=(r 2 − 1) 2 \ 0. Other terms in (2.14) are clearly non-negative. This means that Q \ 0 and i therefore “y v \ 0, proving the theorem. Proof of Theorem 2.2. The proof of this theorem is quite a standard one. Let h \ 0. Replacing the indicator function t W I{t \ x} by the function t W exp{h(t − x)}, we have P{Mn \ x} [ exp{ − hx} E exp{hMn }.
(2.15)
Below we prove that E exp{hMn } [ exp{h 2s 2/2}.
(2.16)
Choosing h=sx2 and combining (2.15) and (2.16), we get P{Mn \ x} [ 2 exp{ − 2sx 2}, which proves the theorem. It remains to prove (2.16). A bit later we prove that n
E exp{hMn } [ D Fk , k=1
Fk =(1 − pk ) exp{ − hek gk }+pk exp{hek } (2.17)
with gk =max{1, s 2k /e 2k } and pk =gk /(1+gk ). The inequality (2.17) yields (2.16). Indeed, we can apply to each factor Fk in (2.17) the estimate of the following Lemma 2.4 with g=gk and g=hek (we give the proof of Lemma 2.4 later). Lemma 2.4. For g \ 0 and g \ 1, we have
3 4
1 g g 2g exp{ − gg}+ exp{g} [ exp . 1+g 1+g 2 Let us prove (2.17). It suffices to show that E exp{hMk } [ Fk E exp{hMk − 1 },
for k=1,..., n.
(2.18)
Using induction, (2.18) yields (2.17). Conditioning on Mk − 1 , we have E exp{hMk } [ E exp{hMk − 1 } E(exp{hXk } | Mk − 1 ).
(2.19)
Tail Probabilities
171
Write Z=Xk /ek . Then P{Z [ 1}=1 and E(exp{hXk } | Mk − 1 )=E (+(Z) | Mk − 1 )
with +(t)=exp{hek t}. (2.20)
The function t W +(t) has a strictly increasing positive second derivative. Therefore we can apply Lemma 2.3 with z=−gk . Let P(t)=at 2+bt+c be the polynomial given in Lemma 2.3. Then +(t) [ P(t), and we have E(+(Z) | Mk − 1 ) [ E(P(Z) | Mk − 1 )=aE(Z 2 | Mk − 1 )+c =ae k−2 E(X 2k | Mk − 1 )+c [ ae k−2 s 2k +c.
(2.21)
A small calculation shows that ae k−2 s 2k +c=Fk . Hence, the relations (2.19)– (2.21) together yield (2.18), which concludes the proof of Theorem 2.2. i Proof of Lemma 2.3. It is easy to check that P(z)=+(z),
PŒ(z)=+Œ(z),
P(1)=+(1).
(2.22)
Furthermore, we have Pœ(z) − +œ(z)=(z − 1) −2 (2(z − 1) +Œ(z) − 2+(z)+2+(1) − (z − 1) 2 +œ(z)) =2E(1 − y)(+œ(z+(1 − z) y) − +œ(z)),
(2.23)
by an application of the Taylor expansion +(1)=+(z)+(1 − z) +Œ(z)+(1 − z) 2 E(1 − y) +œ(z+(1 − z) y), where y is a random variable uniformly distributed in the interval [0, 1]. The function z W +œ(z) is strictly increasing function of z [ 1. Since z+(1 − z) y > z, for y ] 0 and z < 1, the expression under the expectation sign in (2.23) is positive, and therefore Pœ(z) − +œ(z) > 0. This means that the function t W P(t) − +(t) is positive for t sufficiently close to z such that t ] z. Now we can prove that P(t) − +(t) \ 0, for t [ 1. Assume that the inequality does not hold. Then there exits a t0 < 1 such that P(t0 ) − +(t0 ) < 0. Due to (2.22) and positiveness of P(t) − +(t) for t close to z, the function t W P(t) − +(t) has at least 4 zeroes since by (2.22) it has 3 zeroes and at least one additional zero is guarantied by P(t0 ) − +(t0 ) < 0. Therefore the function Pœ(t) − +œ(t)=2a − +œ(t) has at least 2 zeroes, which contradicts to the assumption that t W +œ(t) is a strictly increasing function. i
172
Bentkus
Proof of Lemma 2.4. The inequality we have to prove is equivalent to def
v(g)= exp{ − g(g+g 2/2)}+g exp{g − gg 2/2} − 1 − g [ 0.
(2.24)
Below we prove that vŒ(g) [ 0, for g \ 0. Since v(0)=0, this proves (2.24) and the lemma. Let us prove that vŒ(g) [ 0. Using (2.24), the inequality vŒ(g) [ 0 is equivalent to − (1+g) exp{ − gg}+(1 − gg) exp{g} [ 0.
(2.25)
The inequality (2.25) clearly is fulfilled if 1 − gg [ 0. Hence, we have to verify (2.25) only for 0 [ g < 1/g. For such g, the inequality (2.25) is equivalent to 1+g exp{ − gg − g} \ 1. 1 − gg or to the inequality def
u(g)= ln(1+g) − ln(1 − gg) − gg − g \ 0.
(2.26)
To prove (2.26), it suffices to verify that uŒ(g) \ 0. Elementary calculations show that uŒ(g) \ 0 is equivalent to g+gg \ 1. The inequality g+gg \ 1 holds for all g \ 0 since we assume that g \ 1. This proves (2.26) and the lemma. i ACKNOWLEDGMENT Research supported by Max Planck Institute for Mathematics, Bonn. REFERENCES 1. Bentkus, V. (2001). An inequality for large deviation probabilities of sums of bounded i.i.d.r.v. Lithuanian Math. J. 41, 144–153. 2. Bentkus, V. (2001). On measure concentration for separately Lipschitz functions in product spaces, To appear in Israel J. Math. 3. Bentkus, V., and van Zuijlen, M. (2001). Upper confidence bounds for mean, submitted to Lithuanian Math. J. 4. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13–30. 5. Ledoux, M. (1999). Concentration of measure and logarithmic Sobolev inequalities, Séminaire de Probabilités, XXXIII, pp. 120–216, Lecture Notes in Math., Springer, Berlin, Vol. 1709.
Tail Probabilities
173
6. McDiarmid, C. (1989). On the method of bounded differences, Surveys in combinatorics, 1989 (Norwich, 1989), Cambridge University Press, Cambridge, pp. 148–188, London Math. Soc. Lecture Note Ser., Vol. 141. 7. Milman, V. D., and Schechtman, G. (1986). Asymptotic theory of finite-dimensional normed spaces, Lecture Notes in Mathematics, Springer, Vol. 1200. 8. Petrov, V. V. (1975). Sums of independent random variables, info Ergebnisse der Mathematik und ihrer Grenzgebiete, Band 82, Springer-Verlag, New York/Heidelberg, 346. 9. Pinelis, I. (1994). Extremal probabilistic problems and Hotelling’s T 2 test under a symmetry assumption. Ann. Stat. 22(4), 357–368. 10. Pinelis, I. (1998). Optimal tail comparison based on comparison of moments. In High Dimensional Probability (Oberwolfach, 1996), Progr. Probab., Vol. 43, Birkhäuser, Basel, pp. 297–314. 11. Shorack, G. R., and Wellner, J. A., (1986). Empirical processes with applications to statistics, info Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, Wiley, New York, 938. 12. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Inst. Hautes Études Sci. Publ. Math. 81, 73–205. 13. Talagrand, M. (1995). The missing factor in Hoeffding’s inequalities. Ann. Inst. H. Poincaré Probab. Statist. 31 (4), 689–702.