Some Aspects in Large Deviations - CiteSeerX

5 downloads 0 Views 260KB Size Report
A sequence of probability measures (Pn)n on (V;B(V )) obeys a large deviation ... is known as the V-statistic or von Mises-statistic (of degree m with kernel h).
Some Aspects in Large Deviations Peter Eichelsbacher Sonderforschungsbereich 343 Diskrete Strukturen der Mathematik Universitat Bielefeld Postfach 100131 33501 Bielefeld, Germany Abstract Let V be a topological space and B(V ) the Borel  { eld on V . A sequence of probability measures (Pn)n satis es a large deviation principle (LDP) with rate function I () if Pn(A) can be approximated by exp(?n inf x2A I (x)) for appropriate events A in B(V ). In the proof of the LDP for m{variate von Mises{statistics and U{statistics combinatorial factorization arguments can be dropped. In general, the governing rate function is not convex. Nevertheless, we are able to give a representation analogously to the convex conjugate concept. Moreover, we analyse a convex minorant of the rate function. Short running title: Large deviations

1

1 Introduction Let V be a topological space and B(V ) the Borel { eld on V . A sequence of probability measures (Pn)n satis es a large deviation principle (LDP) with rate function I () if Pn (A) can be approximated by exp(?n inf x2A I (x)) for appropriate events A in B(V ). The following de nition makes this more precise: A sequence of probability measures (Pn )n on (V; B(V )) obeys a large deviation principle (LDP) with speed an and rate function I if there exists a sequence (an)n of positive numbers tending to in nity and a function I : V ! [0; 1] such that the following conditions hold: 1. I (x) is a lower semicontinuous function. I 6 0. 2. fx : I (x)  C g  V is compact for all C < +1 ( compact level sets ). 3. For every open set O  V

I (x): lim inf a1 log Pn (O)  ? xinf n!1 2O n

(1)

4. For every closed set A  V

1 log P (A)  ? inf I (x): lim sup n x2A n!1 a n

(2)

A sequence (Xn )n of random variables with values in V is said to obey a LDP with rate function I satisfying 1. and 2. if the corresponding distributions satisfy 3. and 4.. In this case we will shortly write P (Xn  a)  e?an I a : ( )

Throughout this paper (X; B(X ); ) denotes a probability space and, except otherwise speci ed, ( ; F ; P ) is its N-fold product. De ne (Xn )n to be a sequence of i.i.d. random variables with values in X and distribution (law) , i.e. for every i P (Xi 2 A) = (A). Let h : X m ! Rd be a measurable and (without loss of generality) symmetric function. The function X h(Xi ; : : : ; Xim ); m  n (3) Un := Un (h; X ; : : :; Xn ) :=  1n  1

1

m

i1 0; L 2 R and took a continuous version hcL according to Lusin's theorem. Now we had to prove that the continuous mappings Z FL() := X m hcL(x ; : : : ; xm)dm(x ; : : :; xm): (11) 1

1

1

converge to the contraction function F in the sense that the two conditions in the contraction principle are satis ed. Checking the rst condition that Z lim sup jj (h ? hcL)dm jj = 0: L!1 2N

for any level set N of H (j), we adopt the arguments given in [10], which are a direct consequence of an early result of [8]; Lemma 5.1. The remain part is to show that for every " > 0 1 log P (jjV (h ? hc )jj > ") = ?1: lim lim sup n L L!1 n or equivalently

n!1

1 log P (jjU (h ? hc )jj > ") = ?1: lim lim sup n L L!1 n!1 n

(12)

This is a consequence of our Lemma 2.2. From the exponential boundedness of h() (5) it follows that for any non negative t 2 R:

P (jjUn (h ? hL)jj > ")  exp(?t")E (exp(tjjUn(h ? hL)jj)) = exp(?t")Mt(Un (jjh ? hLjj))  exp(?t")(M kt (h ? hL))k : h i where k = mn . Therefore substituting t by tk yields P (jjUn(h ? hL)jj > ")  exp(?tk")(Mt(h ? hL))k ; thus

1 log P (jjU (h ? h )jj > ")  ?tk" + k log M (h ? h ); n L t L n n n 6

(13)

and we conclude

1 log P (jjU (h ? h )jj > ")  ?t" + 1 log M (h ? hc ): lim sup n L t L m m n!1 n

Since t > 0 has been arbitrary and " > 0 we have 1 log P (U (jjh ? h jj) > ") = ?1: lim lim sup n L L!1 n!1 n We make use of the triangle inequality

P (jjUn (h ? hcL)jj > ")  P (Un (jjh ? hLjj) > 2" ) + P (Un (jjhL ? hcLjj) > 2" )

2

and treat the second term analogously ( see [10] for details).

Remarks 2.5

1. Although the  {topology on the space M1(X ) of all probability measures on X makes integration with respect to bounded measurable functions continuous, it is unknown whether the map % ?! %r for % 2 M1(X ), r 2 N, is continuous, if M1 (X ) is endowed with the  {topology. So we cannot use the fact that the LDP for the distribution of Ln is proved in the  {topology by Groeneboom et.al. ([13]). Therefore Lusin's theorem is needed.

2. Generalized statistics as an extension of U{statistics resp. V{statistics to the case of several samples are de ned in a straightforward way: consider l independent collections of independent observations with l di erent distribution functions. The extension of the LDP to generalized statistics works dropping the factorization theorem, too. The reason is that the representation of such a statistic as an average of dependent averages of i.i.d. random variables (as in Lemma 2.1) holds true. For simplicity look at a U{statistic of degree (m1; m2) with kernel h, given by 1  X h(X ; : : :; X ; Y ; : : :; Y ) Un ;n :=  n  i im j jm n 1

2

1

m1

1

c

2

m2

1

1

2

where the summation Pc extends over all 1  i1 <    < im  n1 and 1  j1 <    < jm  n2; (Xi )i; (Yi)i are i.i.d. samples with values in Z1 resp. Z2,distributed according to 1 resp. 2 and h : Z1m  Z2m ?! Rd is a measurable, symmetric function. An upper bound for the moment generating function can be proved as in Lemma 2.2: 1

2

1

2

Lemma 2.6 Consider a U{statistic with degree (m ; m ) and let k := minf

then

1

2

Mt(Un ;n )  (M kt (h))k : 1

2

7

hn i hn i m ; m g, 1

2

1

2

Proof: Let C (X ; : : :; Xn ; Y ; : : :; Yn ) := 1 [h(X ; : : :; X ; Y ; : : :; Y ) + h(X ; : : : ; X ; Y ; : : : ; Y ) m m m m m m k +    + h(Xkm ?m ; : : :; Ykm )]: 1

1

1

1

2

1

1

1

1 +1

2

1 +1

2

1

2 +1

2

2

2

Then

X Un ;n = n !1n ! C (Xi ; : : : ; Xin ; Yj ; : : :; Yjn ); per P denotes the sum over all n !n ! permutations. The term on the right is an average of per i.i.d. random variables, so the inequality follows as in Lemma 2.2. 2 1

2

1

1

2

1

1

1

2

2

3 A convex upper bound Under the assumption of exponential boundedness we apply the techniques of section 2 to derive a result on an upper bound for the probability of large deviations, where the bound is a convex conjugate of a measurable function, more precisely the so called Cramer functional de ned as follows: Given a function f : Rd ?! R, its convex conjugate { also called the Legendre transform{ is de ned by f (x) := supdfh; xi ? f ()g 2R

and the Cramer functional f of f is de ned by f (B ) := xinf  (x); 2B f

B 2 B(Rd);

where B(Rd) denotes the Borel {algebra on Rd.

Theorem 3.1 Let Un be a Rd{valued U{statistic. Assume that for all  > 0 Z exp(jjh(x ; : : :; xm)jj)dm(x ; : : : ; xm) < 1: m X 1

1

(14)

Then for every closed set A  Rd

1 log P (U 2 A)  ? (A); lim sup n C n!1 n

where C () denotes the Cramer functional of C () := Ch () := 1 log E (exp(mh; h(X1 ; : : :; Xm )i)):

m

8

(15)

Remark 3.2 The LDP for (P  Un? )n was an application of the i.i.d. LDP for the em1

pirical law. Therefore the deviating behaviour of nite{dimensional vectors is described in terms of a variational formula in in nite dimensions. Here we have a nite{dimensional variational formula. But later we will see that the rate cannot be expressed as a Legendre transform in general.

The proof of the theorem relies on techniques which are step by step known in the Cramer{ Cherno theory for i.i.d.{means. First we collect some elementary properties of C () and C ().

Lemma 3.3 For every kernel function h : X m ?! Rd C ()  0, moreover 1. C () and C () are convex and lower{semicontinuous. 2. C () = 0 with  := E (h) < 1. 3. C has compact level sets. 4. Let d = 1, C () is non decreasing on [; 1) and non increasing on (?1; ]. In addition C (x) = sup0 f x ? C ()g for x   and C (x) = sup0 f x ? C ()g for x  .

Proof: 1. Convexity of C follows from Holder's inequality, lower semicontinuity by

Fatou's lemma. As the pointwise supremum of linear functions, C is necessarily lower semi{continuous and convex. Since h; xi ? C () = 0 for  = 0(2 Rd) and every x 2 Rd, C  0. 2. By Jensen's inequality: 1 log E (exp(mh; h(X ; : : :; X )i))  1 mE (h; hi) = h; i (16) m m m for all  2 Rd . Thus h; i ? C ()  0 for all  2 Rd and therefore C ()  0. Since C  0 =) C () = 0. 3. Consider KL := fx 2 Rd : C (x)  Lg. Since C is lower{semicontinuous, KL is closed and for y 2 KL hz; yi  C (z) + C (y)  C (z) + L for any z 2 Rd , thus for R > 0 there exists a suitable A such that 1

sup hz; yi  sup C (z) + L  A < 1:

jjzjjR

jjzjjR

Hence KL is bounded. 9

4. If d = 1, the monotonicity behaviour is obviously true. Moreover for x   and   0 by (16) x ? C ()  x ?  = (x ? )  0 and it follows 0  C (x) = sup fx ? C ()g. The case x   is done analogously. 2 0

Proof: ( of Theorem 3.1 ) Step 1: Let d = 1 and k =

Lemma 2.2:

hni m . For   0 we apply

P (Un 2 [a; 1))  exp(?a)E (exp(Un ))  exp(?a)(M k (h))k : Therefore substituting  by n we have 1 log P (U 2 [a; 1))  ?a + 1 log M (h): lim sup n m m n!1 n Now suppose that a  , then as a consequence of Lemma 3.3 (4) it follows 1 log P (U 2 [a; 1))  ? (a) = ? ([a; 1)): lim sup n C C n!1 n Clearly, a similar calculation yields 1 log P (U 2 (?1; a])  ? (a) lim sup n C n!1 n for a  . Let A  R be closed and A \ [; 1) 6= ; and A \ (?1; ] 6= ;, let i := inf fx   : x 2 Ag and i? := supfx   : x 2 Ag. Then 1 1  (x): nlim !1 n log P (Un 2 A)  nlim !1 n log (P (Un 2 (?1; i? ]) + P (Un 2 [i ; 1)))  ? xinf 2A C Step 2: Now we will prove the statement for vector{valued kernel functions: if 0 < C (A) < 1, we can cover A by nitely many closed half{spaces: [k A  H (ai; C (A) ? ") +

+

i=1

for any 0 < " < C (A), where a ; : : : ; ak are nonzero points in Rd and H (a; b) := fx 2 Rd : ha; xi ? C (a)  bg; a 2 Rd; b 2 R. This is an immediate consequence of Lemma VII 4.1 [12]. Then Chebyshev's inequality implies k X P (Un 2 H (ai; C (A) ? ")) P (Un 2 A)  1

i=1

=



k X i=1 k X i=1

P (hai; Uni  C (ai) + C (A) ? ") exp(?n(C (ai) + C (A) ? "))E (exp(nhai; Un i)): 10

(17)

Again applying Corollary 2.3 we obtain 1 log E (exp(nha ; U i))  1 log E (exp(mha ; hi)) = C (a ): lim sup i n i i m n!1 n Using log(a + b)  log 2 + max(log a; log b) it follows 1 log P (U 2 A)  ? (A) + "; lim sup n C n!1 n since 0 < " < (A) is arbitrary this yields our result. If (A) = 1 Lemma VII 4.1 ([12]) yields a nite covering of A which makes the calculation (17) work. 2 U{statistics and V{statistics associated with h are closely related: if condition (5) holds, we obtain 1 nlim !1 n log P (jjUn ? Vn jj > ") = ?1 (see [10]). Now for each " > 0 1 log P (V 2 A)  1 log 2 + max  1 log P (U 2 A"); 1 log P (jjU ? V jj > ") n n n n n n n n where A" := fx 2 Rd : d(x; A) < "g; d(x; y) = jjx ? yjj and d(x; A) := inf fd(x; y); y 2 Ag. We obtain as a consequence of (15): 1 log P (V 2 A)  ? lim inf  (x) = ? (A): lim sup n C "& x2A" C n!1 n Using Lemma 3.3, the last equality follows by Lemma 3.4 (2.1.2 [6]) Let F : Rd ! [0; 1] be lower{semicontinuous with compact level sets. Then for each closed set A  Rd : inf F (x) = "lim inf F (x): x2A & x2A" 0

0

Thus we have proved: Corollary 3.5 Let Vn be a Rd{valued V{statistic. Assume condition (5). Then 1 log P (V 2 A)  ? (A) lim sup n C n!1 n for every closed set A  Rd. At last we prove a general result for Banach{valued kernel functions h : X m ?! B , B a separable real Banach space. Denote by B the dual space of B and h; i the dual relation. Then we de ne analogously the Cramer{functional as C (A) := xinf sup fh; xi ? C ()g; 2A 0

2B

0

where C () = m log E (exp(mh; h(X ; : : : ; Xm)i)). 1

1

11

Theorem 3.6 Let h be B {valued, measurable and symmetric. Assume the exponential boundedness condition (14) for h. If for every L > 0 there exists a compact set KL  B and an n^ 2 N such that P (Un 2 KLc )  exp(?n L) for all n  n^ ; where KLc := B nKL ( (P  Un?1 )n is exponentially tight), then lim sup 1 log P (U 2 A)  ? (A) n!1

for every closed subset A  B .

n

n

Proof: As a consequence of Corollary 2.3

C



k n E (exp(nh; Un i))  E (exp( k h; hi)) ; (18) hence we are in the situation of Theorem 2.1 of de Acosta ([5]). He has proved that the upper bounds for large deviations of dependent random vectors depend only on a limiting inequality for the "free{energy" term (18). 2

Remarks 3.7

1. E (exp(< ; h >)) < 1 is a well known conclusion from the exponential boundedness condition (14).

2. Theorem 3.1 is included in Theorem 3.6, but our direct proof does not use any abstract elements of large deviations theory as the exponential tightness concept. Nevertheless, if a sequence Pn of probability measures on a Polish space satis es a LDP, then (Pn )n is exponentially tight by Lemma 2.6 of Lynch and Sethuraman ([17]). Thus with Theorem 2.4, especially under the stronger condition (5), the conditions of Theorem 3.6 are ful lled. But in the Rd {case, we do not need to go this detour. Independently of this we can prove the exponantial tightness directly:

Lemma 3.8 Let Un be a Rd{valued U-statistic. Then (P  Un? )n is exponentially tight. 1

Proof: Let KL be the set fx 2 Rd : jjxjj  Lg, KL is compact and P (Un 2= KL) = P (0jjUnjj > L) X  P @  1n 



i1 0 there exists n such that for all n  n Z n  exp(n (x))dPn (x)  exp(L() + "); 0

0

1

X

thus (22) holds. (Corollary 5.5:) Combining Theorem 5.1 and Corollary 5.4, resp. Theorem 5.2 and Corollary 5.4. 2 Of course, the proof of Corollary 5.5 is in some sense a diversion, because the asymptotic value method is an approach to large deviations, but we start with a LDP. Dinwoodie [7] has proved the identi cation (23) in Corollary 5.5 directly without using Bryc's theorems. Thus we will do so for the identi cation (24) which can be applied in the U{statistic case: Let Y; Y ; X; D and G be as in Theorem 5.2.

Theorem 5.6 (Rate representation in LD{theory)

If (Pn )n satis es a LDP with rate function I and if L(F ) exists for each F 2 G then

I (x) = supfF (x) ? L(F ); F 2 Gg:

(25)

Corollary 5.7 Given a U{statistic Un with Rd {valued kernel function satisfying the ex-

ponential boundedness condition (5). Then the rate function in Theorem 2.4 has a representation as in (25).

Proof: (of Corollary 5.7) P  Un? satis es a LDP with rate I . Let F 2 G with representation mini(hi ; xi + ci), we can calculate for all i 2 f1; : : :; ng: 1

Z

where k =

exp(nF (x))dP

 Un?1 (x)

hni m



Z

exp(nhi ; xi + nci)dP  Un? Z k  exp(nci ) exp(mhi; hi)dP ; 1

(26)

and thus Z n ?  (exp(ci)) (E (exp(mhi ; hi))) m exp(nF (x))dP  Un 1

1

1

and as in Theorem 4.3 P  Un? admits the asymptotic value over G . 1

2

Proof: (of Theorem 5.6) Y  separates points of X by the Hahn{Banach theorem (see

e.g. [11], Chapter 1) and since D is dense and linear, it separates points of X , too. 18

Moreover G contains the constant and is closed under nite minima, thus the Stone{ Weierstrass theorem ( see e.g. Schaefer p.243 ) can be applied. As a consequence of L(F ) < 1 for each F 2 G we have

L(F ) = supfF (x) ? I (x)g = ? xinf fI (x) ? F (x)g: 2X x2X

To prove this variational formula, pick " > 0 then there exists an n such that for all nn Z n exp(nF (x))dPn (x)  exp(L(F ) + "); 0

0

1

X

thus by Lemma 4.2 L(F ) = supfF (x) ? I (x)g. An immediate consequence is that

I (x)  supfF (x) ? L(F )g: F 2G

To show that I (x)  supF 2G fF (x) ? L(F )g , we follow an earlier result of Dinwoodie [7], Theorem 2.1: Without loss of generality we can assume I (x) > 0, let  < I (x) and choose " > 0 such that I (y) >  for each y 2 Bx; " := fy : d(y; x)  2"g (this is possible by the semi{continuity of I ). For a xed N  0 let gN be a continuous bounded function de ned on X such that gN (y) = 0 in Bx;", gN (y) = ?N in (Bx;c ") and ?N  gN (y)  0 for " < d(y; x) < 2" ( see Theorem 1.2 [2]). Denote B the closure of B  X . Then by simple calculation L(gN )  ? minfI (Bx; "); N g : (27) ! Z Z 1 L(gN ) = nlim !1 n log d x;y < " exp(ngN )dPn + d x;y  " exp(ngN )dPn   (28)  lim sup n1 log Pn (Bx; ") + exp(?n N ) n!1  maxf?I (Bx; "); ?N g 2

2

2

(

) 2

(

) 2

2

2

Since (Pn )n is exponentially tight by Lemma 2.6 [17], we can choose a large compact set K^ . Then K := K^ [ fxg is compact, too. By the Stone{Weierstrass theorem (see e.g. [20]) there is a nite collection fgig in of functions in G such that 1

sup jgN (y) ? max g (y)j  ": i i y2K

Moreover, since gN (y)  0, passing to gi ^ 0 if necessary, we assume gi  0 for all i and we can choose g = 0. Without loss of generality let maxi gi (y) = 0 for all y 2 Bx;" \ K . By Lemma 3.2 [3], 0

L(max(g ; : : :; gn )) = max(L(g ); : : : ; L(gn )); 1

1

19

if L(gi) exists for some measurable functions gi(x); 1  i  n. De ne LA (F ) := R limn n log A exp(nF (x))dPn(x) for suitable, measurable F : X ! R. Then for each j 1

L(gN )  LK (gN )  LK (max g (x)) ? " = max LK (gi ) ? "  LK (gj ) ? " i i i thus using (27) we get minfI (Bx; "); N g  ?LK (gj ) + " = ?LK (gj ) + gj (x) + "; 2

if we choose j such that maxi gi (x) = gj (x)(= 0) . Now if I (Bx; ") = 1 we get 2

N  gj (x) ? LK (gj ) + ": Since maxi L(gi)  L(g ) = 0, by Lemma 4.1, [3]: 1 Z  Z 1 g )dPn ? n log exp(n max g )dP = 0 nlim !1 n log exp(n max i i i i n K and hence I (x) = 1 = supfg(x) ? L(g)g: 0

g2G

Otherwise choose N > I (Bx; ") and by an argument of [7] we obtain 2

 < I (Bx; ")  gj (x) ? LK (gj ) + "  supfg(x) ? L(g)g: 2

g2G

Now  was choosen arbitrarily such that 0 <  < I (x) < 1. It follows

I (x)  supfg(x) ? L(g)g g2G

2

which proves the result.

Acknowledgment: I would like to thank Matthias Lowe for helpful comments.

References [1] Baxter, J.R.; Jain, N.C.: A comparison principle for large deviations; Proceedings of the American Mathematical Society Vol.103, No.4, 1235{1240 (1988) [2] Billingsley, P.: Convergence of Probability Measures; New York, Wiley & Sons (1968) [3] Bryc, W.: Large deviations by the asymptotic value method, Di usion Processes and Related Problems in Analysis 1 (M. Pinsky ed.) Birkhauser Boston, 447{472 (1990) [4] Bryc, W.: On the large deviation principle for stationary weakly dependent elds, Ann. Probab. 20, 1004{1030 (1992)

20

[5] de Acosta, A.: Upper bounds for large deviations of dependent random vectors, Z. Wahrsch. verw. Geb. 69, 551{565 (1985) [6] Deuschel, J.-D.; Stroock, D.W.: Large Deviations; Academic Press 137, (1989) [7] Dinwoodie, I.H.: Identifying a large deviation rate function, Ann.Probab. 21 , 216{231 (1993) [8] Donsker,M.D.; Varadhan,S.R.S.: Asymptotic evaluation of certain Markov process expectations for large time (I,III,IV); Commun.Pure Appl.Math.28, 1-47 (1975); 29, 389461 (1976); 36, 182-212 (1983) [9] Eichelsbacher, P.: Varadhans Prinzip groer Abweichungen fur spezielle StatistikKlassen; Dissertation, Universitat Bielefeld, (1992) [10] Eichelsbacher, P.; Lowe, M.: Large deviation principle for m{variate von Mises{ statistics and U{statistics, preprint 93{001, Sonderforschungsbereich 343, Diskrete Strukturen in der Mathematik, Universitat Bielefeld (1993) [11] Ekeland, I.; Teman, R.: Convex analysis and variational problems, North Holland, Amsterdam (1979) [12] Ellis,R.S.: Entropy, large deviations and statistical mechanics; Grundlehren der Mathematischen Wissenschaften vol.271, Springer (1985) [13] Groeneboom, P.; Oosterho , J.; Ruymgaart, F.H.: Large deviation theorems for empirical probability measures; Ann. Probab. 7, 553{586 (1979) [14] Hoe ding, W.: A class of statistics with asymptotically normal distribution; Ann. of Math. Stat. 19, 293{325 (1948) [15] Hoe ding, W.: Probability inequalities for sums of bounded random variables; Journal of the American Statistical Association 58, 13{30 (1963) [16] Lowe, M.: Exponential Inequalities and Principles of Large Deviations for U {Statistics and von Mises{Statistics; Dissertation; Universitat Bielefeld, (1992) [17] Lynch, J.; Sethuraman, J.: Large deviations for processes with independent increments, Ann. Probab. 15, 610{627 (1987) [18] von Mises, R.: On the asymptotic distribution of di erentiable statistical functions; Ann. of Math. Stat. 18, 309{348 (1947) [19] Sanov, J.N.: On the probability of large deviations of random variables; Selected Translations in Mathematical Statistics and Probability I, 214{244 (1957) [20] Schaefer, H.: Topological Vector Spaces, Springer{Verlag, New{York (1971)

21

Suggest Documents