PATRIZIA BERTI, LUCA PRATELLI, AND PIETRO RIGO. Abstract. A new type of stochastic dependence for a sequence of random variables is introduced and ...
Limit theorems for predictive sequences of random variables Patrizia BERTI Universita' di Modena e Reggio-Emilia Luca PRATELLI Accademia Navale Pietro RIGO Universita' di Pavia
# 146 (12-02)
Dipartimento di economia politica e metodi quantitativi Università degli studi di Pavia Via San Felice, 5 I-27100 Pavia Dicembre 2002
LIMIT THEOREMS FOR PREDICTIVE SEQUENCES OF RANDOM VARIABLES PATRIZIA BERTI, LUCA PRATELLI, AND PIETRO RIGO
Abstract. A new type of stochastic dependence for a sequence of random variables is introduced and studied. Precisely, (Xn )n≥0 is said to be predictive, with respect to a filtration (Gn )n≥0 such that Gn ⊃ σ(X0 , . . . , Xn ), if X0 is distributed as X1 and, for each n ≥ 0, (Xk )k>n is identically distributed given the past Gn . In case Gn = σ(X0 , . . . , Xn ), a result of Kallenberg implies that (Xn )n≥0 is exchangeable if and only if is stationary and predictive. After giving some natural examples of non exchangeable predictive sequences, it is shown that (Xn )n≥0 is exchangeable if and only if (Xτ (n) )n≥0 is predictive for any finite permutation τ of N, and that the distribution of a predictive sequence agrees with an exchangeable law on a certain sub-σ-field. Moreover, P 1 (1/n) n−1 k=0 f (Xk ) converges a.s. and in L whenever (Xn )n≥0 is predictive and f is a real measurable function such that E[|f (X0 )|] < ∞. As to the CLT , three types of random centering are considered. One of such centerings, significant in Bayesian prediction and discrete time filtering, is E[f (Xn+1 )|Gn ]. For each centering, convergence in distribution of the corresponding empirical process is analyzed under uniform distance.
1. Introduction and motivations In this paper, a new type of stochastic dependence for a sequence (Xn )n≥0 of random variables is introduced and studied. Precisely, suppose the Xn are defined on the probability space (Ω, A, P ) with values in the measurable space (E, E), and G = (Gn )n≥0 is an increasing filtration such that σ(X0 , . . . , Xn ) ⊂ Gn ⊂ A. Then, (Xn )n is said to be G-predictive whenever X1 ∼ X0 and (1)
Xk ∼PH Xn+1 for all n, k ∈ N with n < k and all events H ∈ Gn with P (H) > 0;
here, N = {0, 1, . . .}, PH denotes the conditional probability P (·|H), “∼PH ” means “distributed as, under PH ”, and “∼” stands for “∼P ”. In case Gn = σ(X0 , . . . , Xn ), the filtration is not mentioned at all, and (Xn )n is just called predictive. In general, if (Xn )n is G-predictive then is predictive and identically distributed. An obvious, equivalent formulation of (1) is (2)
E[f (Xk )|Gn ] = E[f (Xn+1 )|Gn ] a.s. for all n, k ∈ N with n < k and all bounded measurable f : E → R.
Roughly speaking, (2) means that, at each time n, the future observations (Xk )k>n are identically distributed given the past Gn . Another equivalent formulation of (1) 2000 Mathematics Subject Classification. Primary 60B10,60G09: Secondary 60F15. Key words and phrases. Central limit theorem; Convergence: almost sure, in distribution, in σ(L1 , L∞ ), stable; Empirical process; Exchangeability; Predictive distribution; Strong law of large numbers; Uniform limit theorem. 1
2
PATRIZIA BERTI, LUCA PRATELLI, AND PIETRO RIGO
is (3)
E[f (Xn+1 )|Gn ] n is a G-martingale for every bounded measurable f : E → R.
By general results on martingales, (3) amounts to E[f (XT +1 )] = E[f (X1 )] for all Nvalued G-stopping times T and bounded measurable f . Hence, one more equivalent formulation of (1) is (4)
XT +1 ∼ X1 for each N-valued G-stopping time T.
Note also that, when Gn = σ(X0 , . . . , Xn ), conditions (1)-(4) all reduce to (5)
[X0 , . . . , Xn , Xn+2 ] ∼ [X0 , . . . , Xn , Xn+1 ] for all n ∈ N.
Exchangeable sequences are identically distributed and meet (5) so that are predictive. Indeed, exchangeability is the most significant case of predictivity. Predictive sequences, however, need not be exchangeable. In fact, by a remarkable result of Kallenberg (1988, Proposition 2.1), exchangeability amounts to stationarity and condition (5). In Kallenberg’s paper (cf. Proposition 2.2), it is also shown that conditions (3), (4) and (5) are equivalent in case Gn = σ(X0 , . . . , Xn ). However, apart from these results, condition (5) is not systematically investigated. In the present paper, instead, we focus on G-predictivity. As a first motivation, we give some examples where predictivity naturally arises while exchangeability may fail. Example 1.1 (Clinical trials). Suppose patients sequentially arrive to a surgery and Xn represents the information gained by visiting the n-th patient. Then, the conditional distribution of future observations is to be updated after visiting each patient, but, if patients are sufficiently homogeneous, the conditional distribution of a single future patient should not depend on the time he/she is visited. Namely, (Xn )n should be taken G-predictive for some filtration G. On the other hand, it may be that some factors (increase in knowledge, improvement in the available techniques) advise against assuming stationarity of (Xn )n . In some cases, even if extreme, it may be even that the Xn exhibit a certain aptitude to converge and this conflicts with (weak) stationarity. As a matter of fact, if (Xn )n is weak stationary, converges in probability, and (Xn2 )n is uniformly integrable, then Xn = X0 for all n a.s.. Example 1.2 (Stopping and sampling). Let Xn = ZT ∧n , where (Zn )n is exchangeable and T is a random variable with values in N ∪ {∞}. Then, (Xn )n is not exchangeable apart from trivial cases, but is predictive under natural conditions on T . In fact, if (Zn )n is predictive (and not necessarily exchangeable), then (Xn )n is predictive whenever T is independent of (Zn )n , or whenever T is a predictable stopping time for (Zn )n (i.e., T = T 0 + 1 for some stopping time T 0 of the filtration induced by (Zn )n ). Thus, typically, predictivity is preserved under stopping while exchangeability is not. We now prove that (Xn )n is predictive if (Zn )n is predictive and T is a predictable stopping time for (Zn )n . Let S be an N-valued stopping time for (Xn )n . Since T = T 0 + 1, with T 0 stopping time for (Zn )n , one has T ∧ (S + 1) = (T 0 ∧ S) + 1 where T 0 ∧ S is an N-valued stopping time for (Zn )n . Thus, predictivity of (Zn )n yields XS+1 = ZT ∧(S+1) = Z(T 0 ∧S)+1 ∼ Z1 ∼ Z0 = X0 .
PREDICTIVE SEQUENCES
3
Hence, X1 ∼ X0 (letting S = 0) and condition (4) holds with Gn = σ(X0 , . . . , Xn ), that is, (Xn )n is predictive. By the same argument, it can be shown that predictivity is preserved under (strictly increasing) sampling. That is, (ZTn )n is predictive whenever (Zn )n is predictive and (Tn )n is a sequence of N-valued predictable stopping times for (Zn )n such that T0 < T1 < . . .. a.s.. In Example 1.2, if T is a.s. finite (Xn )n is definitively constant with probability 1, so that (Xn )n converges a.s. but in a trivial way. Next example exhibits a situation where a predictive (non exchangeable) sequence converges a.s. in a non trivial way. Example 1.3 (Compensated sum of independent random variables). Given the real numbers 0 < b0 ≤ b1 ≤ b2 ≤ . . . < c, let us define γii = c and γij = bi ∧ bj for i 6= j. On noting that Γn = (γij )0≤i,j≤n is a symmetric positive definite matrix, (Xn )n can be taken such that [X0 , . . . , Xn ] ∼ N (0, Γn ) for each n. Then, X1 ∼ X0 and [X0 , . . . , Xn , Xn+2 ] ∼ N (0, Γn+1 ) for all n ≥ 0, that is, (Xn )n is predictive (by condition (5)). However, E[(Xn − Xm )2 ] = 2(c − bn ∧ bm ). Thus, (Xn )n is not weak stationary unless bn = b0 for all n, and E[(Xn − X)2 ] → 0 if Pc = limn rbn , for some random variable X. Further,2r Xn → X a.s. r whenever n (c − bn ) < ∞ for some r > 0 (since E[|Xn − X| ] = γr (c − bn ) for some constant γr ). To explain the title of the example, we note that it is a particular case of the following general scheme. Let (Zn )n , (Un )n be independent sequences of independent real random variables and let Xn =
n X
Zi + Un ,
Gn = σ(Z0 , U0 , . . . , Zn , Un ).
i=0
Pn Suppose also that Un compensates i=0 Zi , in the sense that Xn ∼ X0 for all n, and that the characteristic function φX0 of X0 is null on a set with void interior. Fix 0 ≤ n < k and a bounded Borel function f : R → R. Then, Z n X E [f (Xk )|Gn ] = f (x + Zi )µk (dx) a.s. i=0
Pn
where µk is the distribution of Xk − i=0 Zi . But µk = µn+1 , due to φX0 is null on a set with void interior, and thus (Xn )n is G-predictive. For instance, given any non degenerate and infinitely divisible law µ, the sequences (Zn )n and (Un )n can be taken such that the resulting (Xn )n is predictive, non exchangeable with X0 ∼ µ. Finally, to recover the first part of the example, just take Zn ∼ N (0, bn − bn−1 ) and Un ∼ N (0, c − bn ), where b−1 = 0. Example 1.4 (Modified Polya urns). An urn contains w0 > 0 white and r0 > 0 red balls. At each time n ≥ 0, a ball is drawn and then replaced together with dn more balls of the same colour. Let Xn be the indicator of the event {white ball at time n}. Then, E(X0 ) = w0 /(w0 + r0 ) and Pn w0 + i=0 di Xi Pn a.s. for all n ≥ 0. E [Xn+1 |X0 , d0 , . . . , Xn , dn ] = w0 + r0 + i=0 di
4
PATRIZIA BERTI, LUCA PRATELLI, AND PIETRO RIGO
In the usual Polya scheme, dn = d0 for all n, where d0 ≥ 1 is a fixed integer, and (Xn )n turns out to be exchangeable. Here, instead, we let (dn )n be any sequence of random variables, with values in {1, 2, . . .}, satisfying: (i) dn is independent of σ(Xi , dj : i ≤ n, j < n) for all n ≥ 0, or (ii) d0 is degenerate and σ(dn ) ⊂ σ(X0 , . . . , Xn−1 ) for all n ≥ 1. Then, (Xn )n is predictive, but (apart from particular cases) non exchangeable. For instance, if all the dn are degenerate, (Xn )n is not exchangeable unless dn = d0 for all n. To prove predictivity, it is enough to check that X1 ∼ X0 (which is true under both (i) and (ii)) and that E(Xn+1 |Gn ) n is a G-martingale for some filtration G such that Gn ⊃ σ(X0 , . . . , Xn ). Suppose (i) holds and let Gn = σ(X0 , d0 , . . . , Xn , dn , dn+1 ). Then, dn+1 is Gn -measurable and (i) implies E(Xn+1 |Gn ) = E(Xn+1 |X0 , d0 , . . . , Xn , dn ) a.s.. Hence, Pn w0 + i=0 di Xi + dn+1 E(Xn+1 |Gn ) = E(Xn+1 |Gn ) a.s.. E [E(Xn+2 |Gn+1 )|Gn ] = Pn+1 w0 + r0 + i=0 di A similar argument works under (ii), after setting Gn = σ(X0 , . . . , Xn ). A second reason for studying G-predictivity, in addition to its possible utility for modelling real phenomena, is that it is a key condition in uniform limit theorems for predictive distributions and empirical processes from dependent data. Precisely, suppose E is a Polish space, E = B(E), D is any class of bounded Borel functions on E, and an (f ) = E [f (Xn+1 )|Gn ] for all f ∈ D denotes the predictive distribution. In various problems, mainly in Bayesian predictive inference, discrete time filtering and sequential procedures, the main goal is just ∼ evaluating an , and good approximations a n for an are needed. See for instance Algoet (1992, 1995), Ould-Said (1997), Berti and Rigo (1997, 2002), Modha and Masry ∼ (1998). Usually, a n is asked a consistency condition of the type ∼ supf ∈D | a n (f ) − an (f )| → 0 a.s.. A further request is that, for suitable normal∼ izing constants cn , the limit distribution of cn ( a n − an ) can be evaluated. Here, ∼ cn ( a n − an ) is viewed as a process (indexed by D) with paths in l∞ (D), the space of bounded functions on D equipped with uniform distance; cf. van der Vaart and ∼ Wellner (1996). In this P framework, possible choices for a n and cn are the empirical √ n 1 distribution µn = n+1 n. So, it is of some interest to give i=0 δXi and cn = conditions for sup |µn (f ) − an (f )| → 0 a.s.
(6)
f ∈D
(7)
√
n(µn − an ) converges in distribution to some known limit.
Note that (7) implies (6) if a.s. convergence is weakened to convergence in probability. In any case, G-predictivity is fundamental for both (6) and (7). Proving (7) for a G-predictive sequence is one of the concerns of this paper, to be done in Section 4. As to (6), suppose D is a (suitable) class of indicators and the µn converge uniformly on D with probability 1. Then, (6) holds whenever (Xn )n is G-predictive; see Berti, Mattei and Rigo (2002).
PREDICTIVE SEQUENCES
5
To sum up, G-predictivity seems interesting enough for deserving a systematic study, both from the theoretical and the applied point of view. This task is accomplished here from the first point of view, with special attention to limit theorems. The paper is organized through three sections. Suppose (Xn )n is G-predictive and f : E → R is a measurable function. InPSection 2, a few basic facts on n−1 (Xn )n are listed, including the SLLN. Indeed, n1 i=0 f (Xi ) converges a.s. and in 1 L whenever E[|f (X0 )|] < ∞. A related result is that the distribution of (Xn )n coincides with an exchangeable law on a certain sub-σ-field of E N . Moreover, (Xn )n is exchangeable if and only if (Xτ (n) )n is predictive for any (finite) permutation τ of N. Section 3 includes versions of the CLT. Stable convergence (in particular, Pn−1 √ convergence in distribution) of n( n1 i=0 f (Xi ) − Ln ) is investigated for three different choices of the random centering Ln . In particular, conditions are given for convergence in distribution of ! n √ √ 1 X n (µn (f ) − an (f )) = n f (Xi ) − E [f (Xn+1 )|Gn ] . n + 1 i=0 Such conditions, incidentally, work in Examples 1.3 and 1.4. Section 4 is devoted to uniform limit theorems. For each centering considered in Section 3, convergence in distribution of the corresponding empirical process is investigated under uniform distance. General statements for G-predictive sequences are obtained which, among other things, yield interesting (and possibly new) results in the particular case of exchangeable sequences. 2. Preliminary results and the SLLN Let H be the class of measurable functions f : E → R such that E[|f (Xn )|] < ∞ for all n. We recall that, if f ∈ H and Y is an integrable random variable, f (Xn ) → Y in the weak topology σ(L1 , L∞ ) means E[f (Xn )Z] → E[Y Z] for each bounded random variable Z. Our starting point is a lemma on convergence in σ(L1 , L∞ ) of G-predictive sequences. Lemma 2.1. If (Xn )n is G-predictive and f ∈ H, then (f (Xn ))n converges in σ(L1 , L∞ ) to a random variable Vf which satisfies (8)
E[Vf |Gn ] = E[f (Xn+1 )|Gn ] a.s. for every n ≥ 0.
Moreover, for any f1 , . . . , fk bounded elements of H, k Y
fj (Xn+j−1 ) →
j=1
k Y
V fj
in σ(L1 , L∞ ).
j=1
Proof. Fix f ∈ H. By (3), (E[f (Xn+1 )|Gn ])n is a G-martingale, and is uniformly integrable due to the Xn are identically distributed. Hence E[f (Xn+1 )|Gn ] → Vf , a.s. and in L1 , for some random variable Vf . In particular, Vf closes the martingale (E[f (Xn+1 )|Gn ])n , and thus condition (8) holds. Next, let Z be a bounded random variable. To prove convergence in σ(L1 , L∞ ) of f (Xn ) to Vf , it can be assumed that Z is Gm -measurable, for some m. Then E[Vf Z] = lim E[E[f (Xn+1 )|Gn ]Z] = lim E[f (Xn+1 )Z], n
n
6
PATRIZIA BERTI, LUCA PRATELLI, AND PIETRO RIGO
that is, f (Xn ) → Vf in σ(L1 , L∞ ). Finally, let f1 , . . . , fk be bounded elements of H, with k > 1. Arguing by induction, suppose that k−1 Y
fj (Xn+j−1 ) →
k−1 Y
j=1
V fj
in σ(L1 , L∞ ).
j=1
Again, let Z be bounded and Gm -measurable. Then, for any n > m − k + 1, one obtains k k−1 Y Y fj (Xn+j−1 )ZE[fk (Xn+k−1 )|Gn+k−2 ] E fj (Xn+j−1 )Z = E j=1
j=1
=E
fj (Xn+j−1 )ZE[Vfk |Gn+k−2 ]
k−1 Y j=1
k−1 Y
=E
fj (Xn+j−1 )ZVfk .
j=1
Hence, since ZVfk is a.s. bounded, the inductive assumption yields k k−1 k−1 Y Y Y lim E fj (Xn+j−1 )Z = lim E fj (Xn+j−1 )ZVfk = E Vfj ZVfk . n
n
j=1
j=1
j=1
¿From now on, when (Xn )n is predictive or G-predictive and f ∈ H, Vf always denotes a version of the limit in σ(L1 , L∞ ) of (f (Xn ))n . We now give the SLLN for predictive sequences. Theorem 2.2 (SLLN). If (Xn )n is predictive and f ∈ H, then n−1 1X f (Xk ) → Vf n
a.s. and in L1 .
k=0
Proof. Since (f (Xn ))n is uniformly integrable, it is enough to prove a.s. convergence, and since f = f + − f − it can be assumed f ≥ 0. Let Gn = σ(X0 , . . . , Xn ), fn = f I{f ≤n} and Yn = fn (Xn ). Since Vfn ↑ Vf a.s., one has E[Yn+1 |Gn ] = E[Vfn+1 |Gn ] → Vf a.s.. Thus, it suffices to prove X Yn+1 − E[Yn+1 |Gn ] (a) n+1
converges a.s.,
n≥0
(b) P (f (Xn ) 6= Yn i.o.) = 0. Pn−1 In fact, under (a)-(b), Kronecker lemma implies n1 k=0 f (Xk+1 ) − E[Yk+1 |Gk ] → Pn−1 0 a.s. and thus n1 k=0 f (Xk+1 ) → Vf a.s.. It follows that n1 f (Xn ) → 0 a.s. and P n−1 1 k=0 f (Xk ) → Vf a.s.. We now prove (a) and (b). Condition (a) follows from n the martingale convergence theorem, after noting that Zn =
n−1 X k=0
Yk+1 − E[Yk+1 |Gk ] k+1
PREDICTIVE SEQUENCES
7
is a martingale and X sup E[Zn2 ] = (n + 1)−2 E (Yn+1 − E[Yn+1 |Gn ])2 n
n≥0
X
≤
2 (n + 1)−2 E[Yn+1 ]≤2
n≥0
≤2
=2 ≤4
X n≥0 ∞ X k=1 ∞ X
X
(n + 1)−2
Z
(n + 1)
n+1 X
xP (f (X0 ) > x)dx 0
n≥0 −2
n+1
kP (f (X0 ) > k − 1)
k=1
kP (f (X0 ) > k − 1)
∞ X
(n + 1)−2
n=k−1
P (f (X0 ) > k − 1) ≤ 4[1 + E[f (X0 )]].
k=1
Condition (b) follows from X X P (f (Xn ) 6= Yn ) = P (f (X0 ) > n) ≤ 1 + E[f (X0 )]. n≥0
n≥0
A further relevant property of predictive sequences, which follows from results of Aldous (1985), is asymptotic exchangeability. By a Markov kernel on (X , R) × (Y, S), where (X , R) and (Y, S) are any measurable spaces, we mean a function M on X × S such that M (x, ·) is a probability on S for x ∈ X and M (·, A) is Rmeasurable for A ∈ S. A sequence (Zn )n of random variables, defined on (Ω, A, P ) and taking values in a metric space Z, converges stably if, for every H ∈ A with P (H) > 0, (Zn )n converges in distribution under PH . If (Zn )n converges stably and Z is Polish, Reach limit law µH (where H ∈ A and P (H) > 0) can be represented as µH (·) = M (ω, ·)PH (dω) for some Markov kernel M on (Ω, σ(Z0 , Z1 , . . .)) × (Z, B(Z)). In this case, we say that (Zn )n converges stably to the Markov kernel M . See Hall and Heyde (1980), Aldous (1985), Letta and Pratelli (1996). Theorem 2.3. Suppose E is a Polish space, E = B(E) and (Xn )n is predictive. Then, there is a Markov kernel M on (Ω, σ(X0 , X1 , . . .)) × (E N , E N ) such that [Xn , Xn+1 , . . .] converges stably to M , and M (ω, ·) is i.i.d. for all ω ∈ Ω (that is, the coordinate projections on E N are independent and identically distributed under the law M (ω, ·)). Proof. Let αn be a Markov kernel on (Ω, σ(X0 , . . . , Xn ))×(E, E) such that αn (·, B) is a version of E[I{Xn+1 ∈B} |X0 , . . . , Xn ] for all B ∈ E. Since (Xn )n is predictive, E [αn+1 (·, B)|X0 , . . . , Xn ] = αn (·, B) a.s. for all n ≥ 0 and B ∈ E. By Lemma (7.14) of Aldous (1985, p. 53), P (αn → α weakly) = 1 for some Markov Kernel α on (Ω, σ(X0 , X1 , . . .)) × (E, E). Next, fix m ≥ 0, H ∈ σ(X0 , . . . , Xm ) with P (H) > 0, and take a Markov kernel βn on (Ω, σ(X0 , . . . , Xn )) × (E, E) such that βn (·, B) is a version of EPH [I{Xn+1 ∈B} |X0 , . . . , Xn ] for all B ∈ E. Since H ∈ σ(X0 , . . . , Xm ), for n ≥ m one has βn (ω, ·) = αn (ω, ·) for almost all ω ∈ H. Thus, PH (βn → α weakly) = 1. By Lemma (8.2) of Aldous (1985, p. 61),
8
PATRIZIA BERTI, LUCA PRATELLI, AND PIETRO RIGO
[Xn , Xn+1 , . . .] converges in distribution under PH to the exchangeable law µH characterized by Z µH (B0 × B1 × . . .) = (α(ω, B0 )α(ω, B1 ) · · · ) PH (dω). By standard arguments, it follows that [Xn , Xn+1 , . . .] converges stably to the Markov kernel M on (Ω, σ(X0 , X1 , . . .)) × (E N , E N ) characterized by M (ω, B0 × B1 × . . .) = α(ω, B0 )α(ω, B1 ) · · ·
for all ω.
By Theorem 2.3, [Xn , Xn+1 , . . .] converges in distribution under PH to an exchangeable law whenever (Xn )n is predictive and H is a non null event. Incidentally, this fact directly implies Kallenberg’s result that exchangeability is equivalent to stationarity and predictivity; cf. Section 1. In fact, if (Xn )n is stationary and predictive, the distribution of [Xn , Xn+1 , . . .] does not depend on n (by stationarity) and converges weakly to an exchangeable law (by predictivity and Theorem 2.3). One more consequence of Theorems 2.2 and 2.3 is that any predictive law on E N is exchangeable on a suitable sub-σ-field. Let πn be the n-th coordinate projection on E N and V the σ-field on E N generated by 1 lim sup [f (π0 ) + · · · + f (πn−1 )] for all bounded f ∈ H. n n Theorem 2.4. Suppose E is a Polish space, E = B(E) and (Xn )n is predictive. Then, R the probability distribution λ of (Xn )n coincides on V with the exchangeable law M (ω, ·)P (dω), where M is the Markov kernel involved in Theorem 2.3. Proof. By Theorem 2.3, [Xn , Xn+1 , . . .] converges in distribution to the exchangeR able law µ(·) = M (ω, ·)P (dω). Given bounded continuous functions g1 , . . . , gk on E, this fact and Lemma 2.1 imply Z Y k k k Y Y gj ◦ πj dµ = lim E[ gj (Xn+j )] = E[ Vgj ]. n
j=1
j=1
j=1
Let V0 be the σ-field on E generated by lim supn [g(π0 ) + . . . + g(πn−1 )]/n for all bounded continuous g on E, and let h be any product of generators of V0 , i.e., N
h=
k Y j=1
lim sup n
n−1 1X gj ◦ πi n i=0
where g1 , . . . , gk are bounded continuous.
hQ i R Pn−1 k By Theorem 2.2, lim supn n1 i=0 gj (Xi ) = Vgj a.s., and thus hdλ = E V . g j j=1 On the other hand, exchangeability of µ implies Z Z Z Y k n−1 k n−1 Y X 1X 1 hdµ = lim gj ◦ πi dµ = lim k gj ◦ πi dµ n n n n i=0 j=1 j=1 i=0 =
Z Y k j=1
gj ◦ πj dµ = E
k hY
i Z Vgj = hdλ.
j=1
Hence, λ = µ on V0 . To conclude the proof, it is sufficient showing that V ⊂ σ(V0 ∪ N ) where N = {A ∈ E N : λ(A) = µ(A) = 0}. Let γ = (λ + µ)/2. Given a
PREDICTIVE SEQUENCES
9
bounded measurable f on E and > 0, there is a bounded continuous g such that R |f (π0 ) − g(π0 )|dγ < . Since (πn )n is predictive under γ, Theorem 2.2 implies Z n−1 n−1 1X 1X f (πi ) − lim sup g(πi ) dγ lim sup n i=0 n i=0 n n Z Z n−1 1 X = lim (f (πi ) − g(πi )) dγ ≤ |f (π0 ) − g(π0 )|dγ < . n n i=0 This concludes the proof.
Remark 2.5. In Theorem 2.4, V can not be replaced by the invariant σ-field of (πn )n . In fact, if the distribution of (Xn )n agrees with an exchangeable law µ on the invariant σ-field of (πn )n , then P (∃ lim Xn ) = µ(∃ lim πn ) = µ(∃m with πn = πm for all n ≥ m) = P (∃m with Xn = Xm for all n ≥ m) where the second equality is due to exchangeability of µ. But, there are predictive sequences for which P (∃ lim Xn ) = 1 > 0 = P (∃m with Xn = Xm for all n ≥ m), for instance the one exhibited in Example 1.3. It follows that, unlike in the exchangeable case, the invariant σ-field of (πn )n and V have not the same completion under an arbitrary predictive law on E N . In view of the previous results, the empirical measures µn converge weakly with probability 1. The limit is the marginal of M , where M is the Markov kernel involved in Theorem 2.3, that is M0 (ω, A) = M (ω, {π0 ∈ A}) for all ω ∈ Ω and A ∈ E. Corollary 2.6. If E is a Polish space, E = B(E) and (Xn )n is predictive, then n
µn (ω, ·) =
1 X δX (ω) → M0 (ω, ·) n + 1 i=0 i
weakly, for almost all ω ∈ Ω.
R Proof. For all H ∈ A and A ∈ E with M0 (ω, ∂A)P (dω) = 0, one obtains Z M0 (ω, A)P (dω) = lim P (H ∩ {Xn ∈ A}) by Theorem 2.3 H
n
= E[IH VIA ]
by Lemma 2.1
= E[IH lim µn (·, A)] n
by Theorem 2.2.
Hence, M0 (·, A) = limn µn (·, A) a.s. for all such A, and since E is separable this concludes the proof. We close this section with a characterization of exchangeability in terms of predictivity. Theorem 2.7. Let I be the invariant σ-field of (Xn )n . The following statements are equivalent: (i) (Xn )n is exchangeable; (ii) For any H ∈ I with P (H) > 0, (Xn )n is predictive under PH ; (iii) For any (finite) permutation τ on N, (Xτ (n) )n is predictive.
10
PATRIZIA BERTI, LUCA PRATELLI, AND PIETRO RIGO
Proof. (i)⇒(ii). Obvious. (ii)⇒(iii). Fix f ∈ H and note that, by Theorem 2.2, Vf can be taken I-measurable. Hence, by (ii), E[f (X0 )|I] = Vf a.s.. Further, let n ∈ N, H ∈ σ(X0 , . . . , Xn ) and K ∈ I. For all k > n, condition (ii) implies E [f (Xk )IH IK ] = E [f (Xn+1 )IH IK ] . Since f (Xk ) → Vf in σ(L1 , L∞ ) as k → ∞, one also obtains E [Vf IH IK ] = E [f (Xn+1 )IH IK ]. Since Vf is I-measurable, this implies E[f (Xn+1 )|I] = Vf = E[f (Xn+1 )|I ∨ σ(X0 , . . . , Xn )] a.s.. Thus, (Xn )n is exchangeable, and clearly this implies condition (iii). (iii)⇒(i). We prove that, for any n0 , . . . , np distinct integers, any r ≥ 1 and any m > max(n0 , . . . , np ), p p r r Y Y Y Y (9) E fj (Xm+j ) gl (Xnl ) = E fj (Xm+j ) gl (Xl ) j=1
j=1
l=0
l=0
where f1 , . . . , fr , g0 , . . . , gp are bounded elements of H. We argue by induction on p ≥ 0. When p = 0, condition (9) follows by applying (5) to the predictive sequence (Xτ (n) )n , where τ is a finite permutation such that τ (j) = m + j for j = 1, . . . , r, τ (r + 1) = n0 , and τ (r + 2) = 0. Suppose now that (9) holds for some p. We have to prove p+1 p+1 k k Y Y Y Y fj (Xm+j ) gl (Xnl ) = E fj (Xm+j ) gl (Xl ) (10) E j=1
j=1
l=0
l=0
for any k ≥ 1 and any m > max(n0 , . . . , np , np+1 ). Let τ1 denote a finite permutation such that τ1 (j) = m + j for j = 1, . . . , k, τ1 (k + 1 + l) = nl for l = 0, . . . , p + 1, and τ1 (k + 3 + p) = m + k + 1. Since (Xτ1 (n) )n is predictive, one has p+1 p k k Y Y Y Y E fj (Xm+j ) gl (Xnl ) = E fj (Xm+j ) gl (Xnl )gp+1 (Xnp+1 ) j=1
l=0
j=1
=E
k Y
l=0
fj (Xm+j )
j=1
p Y
gl (Xnl )gp+1 (Xm+k+1 ) .
l=0
Let τ2 be a finite permutation such that τ2 (j) = m+j for j = 1, . . . , k, τ2 (k+1+l) = l for l = 0, . . . , p + 1, and τ2 (k + 3 + p) = m + k + 1. Since (Xτ2 (n) )n is predictive, one also has p+1 p k k Y Y Y Y E fj (Xm+j ) gl (Xl ) = E fj (Xm+j ) gl (Xl )gp+1 (Xp+1 ) j=1
l=0
j=1
=E
k Y
j=1
l=0
fj (Xm+j )
p Y
gl (Xl )gp+1 (Xm+k+1 ) .
l=0
Hence, (10) follows from (9) with r = k + 1 and fk+1 = gp+1 .
PREDICTIVE SEQUENCES
11
3. Some CLT’s for predictive sequences In this section, stable convergence (in particular, convergence in distribution) of ! n−1 √ 1X n f (Xk ) − Ln n k=0
is investigated for three different choices of the random centering Ln . In all cases, our main tool is the following version of the martingale CLT; cf. Hall and Heyde (1980, Theorem 3.2, p. 58). Let {Ynk : n ≥ 1, k = 1, . . . , kn } be an array of real square integrable random variables, where kn ↑ ∞, and for all n let Fn0 ⊂ Fn1 ⊂ · · · ⊂ Fnkn ⊂ A be σ-fields with Fn0 = {∅, Ω}. If (i) σ(Ynk ) ⊂ Fnk , E[Ynk |Fn,k−1 ] = 0 a.s., Fnk ⊂ Fn+1,k , 2 (ii) max1≤k≤kn |Ynk | → 0 in probability, supn E max1≤k≤kn Ynk < ∞, Pkn 2 (iii) k=1 Ynk → L in probability, for some real random variable L, Pkn then k=1 Ynk converges stably. Precisely, under PH (with H ∈ A and P (H) > 0), Pkn k=1 Ynk converges in distribution to the probability law Z µH (·) = N (0, L(ω))(·)PH (dω), where N (0, L(ω))(·) denotes the Gaussian law with mean 0 and variance L(ω) (with Pkn N (0, 0) = δ0 ). We abbreviate this latter fact by saying that: k=1 Ynk converges stably to the Gaussian kernel N (0, L). Let us start with the case Ln = Vf . Theorem 3.1 (CLT, case I). Suppose (Xn )n is predictive, f and f 2 are in H, and there exists an integer m ≥ 0 such that (f (Xn+m ) − Vf )n is predictive. Then, 1 Wn,f = √ (f (X0 ) + · · · + f (Xn−1 ) − nVf ) n converges stably (in particular, in distribution) to the Gaussian kernel N (0, Vf 2 − (Vf )2 ). Proof. For each n > m, define 1
Ynk = n− 2 (f (Xk+m−1 ) − Vf ) and Fnk = σ(Yn1 , . . . , Ynk ) for k = 1, . . . , n − m. Since (f (Xn+m ) − Vf )n is predictive, condition (i) holds. By Theorem 2.2, condition (iii) holds with L = Vf 2 −(Vf )2 . As to (ii), first note that it can be equivalently written as: (ii)’ √1n maxm≤k n}. Then, (ii)’ follows from n−1 √ 1 X n−m P ( max |f (Xk )| > n) ≤ 2 E[IAnk f 2 (Xk )] = 2 E[IAnm f 2 (Xm )] m≤k