arXiv:1712.07361v1 [math.PR] 20 Dec 2017
Uniform rates of the Glivenko-Cantelli convergence and their use in approximating Bayesian inferences Emanuele Dolera and Eugenio Regazzini Universit`a degli Studi di Pavia Abstract This paper deals with the problem of quantifying the approximation a probability measure by means of an empirical (in a wide sense) random probability measure ˆ pn , depending on the first n terms of a sequence {ξ˜i }i≥1 of random elements. In Section 2, one studies the range of (p) pn , assuming that the oscillation near zero of the Wasserstein distance d[S] between p0 and ˆ
ξ˜i ’s are i.i.d. with p0 as common law. Theorem 2.3 deals with the case in which p0 is fixed as a generic element of the space of all probability measures on (Rd , B(Rd )) and ˆpn coincides P with the empirical measure ˜en := n1 n i=1 δξ˜i . In Theorem 2.4 (Theorem 2.5, respectively) p0 is a d-dimensional Gaussian distribution (an element of a distinguished type of statistical exponential family, respectively) and ˆpn is another d-dimensional Gaussian distribution with estimated mean and covariance matrix (another element of the same family with an estimated parameter, respectively). These new results improve on allied recent works (see, e.g., [31]) since they also provide uniform bounds with respect to n, meaning that the finiteness of the (p) p-moment of the random variable supn≥1 bn d[S] (p0 , ˆpn ) is proved for some suitable diverg-
ing sequence bn of positive numbers. In Section 3, under the hypothesis that the ξ˜i ’s are exchangeable, one studies the range of the random oscillation near zero of the Wasserstein distance between the conditional distribution—also called posterior—of the directing measure of the sequence, given ξ˜1 , . . . , ξ˜n , and the point mass at ˆ pn . In a similar vein, a bound for the approximation of predictive distributions is given. Finally, Theorems from 3.3 to 3.5 reconsider Theorems from 2.3 to 2.5, respectively, according to a Bayesian perspective.
1
Introduction and description of the results
A recurrent problem in different branches of science concerns the approximation of a probability measure (p.m., for short)—either fixed or random—by means of another random p.m., say ˆpn , of an empirical nature, as function of a random sample ξ˜(n) := (ξ˜1 , . . . , ξ˜n ) from the original 1
measure. The accuracy of such approximation is usually measured in terms of a distinguished form of distance between p.m.’s, and the final objective is to study the stochastic convergence to zero of that distance between the original p.m. and ˆpn , as n goes to infinity. This problem arose in connection with the issue, of a statistical nature, of estimating an unknown probability distribution (p.d., for short) on (R, B(R)) by means of the empirical frequency distribution. Thus, the original formulation considered ξ˜(n) as the initial n-segment of a sequence {ξ˜i }i≥1 of i.i.d. random numbers with common p.d. p0 , and ˆpn was chosen equal to P ˜en := n1 ni=1 δξ˜i . In the same year and journal, Cantelli [11], Glivenko [32] and Kolmogorov [38]
published their respective solutions concerning the almost sure convergence to zero of the so-called uniform (or Kolmogorov) distance between p0 and ˜en , while de Finetti [28] gave a solution in terms
of the L´evy distance. Since this achievement was considered by some authors so important to earn the title of “fundamental theorem of mathematical statistics”, a flourishing line of research developed from it with a view to providing refinements and extensions in various directions, such as: i) replacing ˜en with smoothed versions of it; ii) providing central limit theorems for the abovementioned distances; iii) quantifying the almost sure convergence; iv) relaxing the i.i.d. assumption on the ξ˜i ’s. See, e.g., the books [22, 49, 55] for a comprehensive treatment of both the original problem and its developments. In particular, the very fruitful line of research—also pursued in the present paper—which aims at providing quantitative rates of the Glivenko-Cantelli convergence never ceased to be investigated from the end of the Sixties, due to its several connections with pure and applied mathematics. See, for example, [1, 4, 7, 8, 12, 21, 31, 33, 36, 40, 52, 54, 58] along with the references therein for possible applications to statistics, informatics, numerical analysis, particle physics and other areas. The present paper tackles the aforesaid quantitative problem for various forms of ˆpn by studying rates with respect to both Lp and almost sure convergence. In Section 2 p0 is fixed and the ξ˜i ’s are i.i.d. from p0 , while in Section 3 the ξ˜i ’s are assumed to be exchangeable. In both sections, as in many other papers, the discrepancy between two p.m.’s is measured in terms of (p)
the distance d[S] 1 , defined as follows: given a separable metric space (S, dS ), p ≥ 1, and µ1 , µ2 in R [S]p := µ p.m. on (S, B(S)) S [dS (x, x0 )]p µ(dx) < +∞ for some x0 ∈ S , one puts (p) d[S] (µ1 , µ2 )
:=
inf
γ∈F (µ1 ,µ2 )
Z
S2
1/p
[dS (x, y)]p γ(dxdy)
(1.1)
where F (µ1 , µ2 ) stands for the class of all p.m.’s on (S2 , B(S2 )) with i-th marginal equal to µi , (p)
i = 1, 2. Recall that the couple ([S]p , d[S] ) turns out to be separable metric space, which is complete if (S, dS ) is also complete, provided that (S, dS ) satisfies the additional Radon property, 1 Nowadays,
(p)
one of the most common appellations of d[S] is p-Wasserstein distance, but, for a correct attribution,
see the Bibliographical Notes at the end of Chapter 6 of [57].
2
i.e. every p.m. on (S, B(S)) has the compact inner approximation property. See Definition 5.1.4 and Proposition 7.1.5 in [3] for detailed statements. Henceforth, every time (S, dS ) is not explicitly characterized, it will be always assumed to be a metric subspace of a complete and separable metric ˆ dˆ ) with S ∈ B(S), ˆ which entails the Radon property. See, e.g., Theorem 7.1.4 in [23]. space (S, S From a practical point of view, it is worth remarking that the use of such a metric is fully justified whenever the base distance dS is meaningfully suggested, or even imposed, by the metric features (p)
of the concrete problem under study. Indeed, without this condition, resorting to d[S] could turn (p)
out into a merely conventional act. A reasoned appreciation of d[S] as a distance between p.m.’s which takes account of the geometry of the base space is shown in Chapter 6 of [57]. In Section 2 one fixes p ≥ 1, p0 in [S]p , and defines the sequence {ξ˜i }i≥1 to be the coordinate
random elements of S∞ , providing a formal definition of the sample evoked at the beginning
of the section. Henceforth, one confines oneself to considering forms of ˆpn which, like ˜en , are presentable as a σ(ξ˜(n) )/B([S]p )-measurable function, for every n. After providing (S∞ , B(S∞ )) with the law p∞ 0 which makes the coordinates i.i.d. random elements with common p.d. p0 , and denoting expectation with respect to p∞ 0 by the same symbol, one aims at determining an increasing sequence bn of positive numbers, with limn→+∞ bn = +∞, and positive constants Cp (p0 ) and Yp (p0 ) for which p∞ 0
p (p) ≤ Cp (p0 ) sup bn d[S] (p0 , ˆpn )
(1.2)
n≥1
and p∞ 0
(p) =1. lim sup bn d[S] (p0 , ˆpn ) ≤ Yp (p0 )
(1.3)
n→+∞
In the sequel, the expression “p0 satisfies (1.2)” (“p0 satisfies (1.3)”, respectively) will be used as a shorthand to mean that, given p0 , (1.2) ((1.3), respectively) is in force with a suitable constant Cp (p0 ) (Yp (p0 ), respectively). Then, the first main result (Theorem 2.3) states the validity of (1.2)-(1.3) when ˆ pn = ˜en , S = Rd , dS coincides with the standard Euclidean distance, R p ∈ [1, +∞) ∩ (d/2, +∞) and bn ∼ (n/ log log n)1/2p , provided that Rd |x|2p+δ p0 (dx) < +∞
obtains for some δ > 0. To introduce the second topic of the same section, one should notice that bn increases slower and slower as d gets large. As noted in [1, 31, 40, 50, 58], this drawback is h i (1) ˜ (p , e ) an intrinsic feature of the problem since limn→+∞ n1/d p∞ d > 0 when p0 coincides 0 [Rd ] 0 n
with the uniform measure on the cube [−1, 1]d. This phenomenon is ascribable to the fact that the use of ˜en in approximating p0 is thoroughly justified when there is no significant a priori
restriction on the subclass of [S]p in which p0 can be fixed, so that the slowdown of the divergence of bn for large dimensions seems to be the price to pay for this generality. In point of fact, many problems in applied mathematics and statistics provide enough information to restrict the aforesaid admissible class to a family M of distinguished p.m.’s µθ on (S, B(S)), determined up to
some finite-dimensional parameter θ in Θ ⊆ Rk so that θ 7→ µθ is injective. In this framework, the 3
approximation of p0 = µθ0 proves to be more natural within the elements of M, so that one first
approximates θ0 by a suitable random element θˆn ∈ Θ, depending on ξ˜(n) according to well-known
principles of statistical estimation, and then approximates p0 by ˆpn = µθˆn . The second main result
supports this last method of approximation, in contrast to the rougher one based on the choice of ˜en , by showing that, if ˆ pn = µθˆn , one can put p = 2 and bn ∼ (n/ log log n)1/2 in (1.2)-(1.3)
independently of the dimension of the ξ˜i ’s and of Θ, at least when M coincides with the class of all non-singular multidimensional Gaussian distributions (Theorem 2.4) and, more in general, with a distinguished type of statistical exponential family (Theorem 2.5). Before explaining Section 3, it is worth commenting on the value of the problems associated with (1.2)-(1.3) and their solutions contained in Theorems 2.3-2.5. First, the use of the distance (p)
d[S] connects the present problems with the so-called Vapnik-Chervonenkis theory. See [22, 55, 56]. However, the actual novelty of the present study consists in finding rates according to a concept of strong (i.e. uniform with respect to n) convergence, obtained by suitable applications of classical inequalities concerning the dominated ergodic theorem, due to Siegmund and Teicher. In fact, the p i h (p) ˆ (p , p ) ≤ d current literature has investigated, until now, only non uniform bounds like p∞ 0 n 0 [S] ′
′
C (p0 )α(n), valid for all n ∈ N and suitable positive constants C (p0 ), whilst (1.2) entails, of course, p i h (p) ˆ (p , p ) ≤ Cp (p0 )/bpN , for all N ∈ N. In spite of the strengthening given by sup d p∞ 0 n n≥N [S] 0
the latter inequality, the rate α(n) given in the literature turns out to be comparable with our
1/bpn , in the sense that α(n) goes to zero slightly faster than 1/bpn under the same assumption R 2p+δ p0 (dx) < +∞, at least for small δ > 0. Cf. Theorem 1 in [31] and Theorem 13 in Rd |x| [33]. Moreover, apart from the obvious formal improvement, uniform bounds, which are strictly
connected with the concept of uniform integrability, prove to be of crucial importance in those applications where one is interested in showing that the union of a collection of “bad events”— and not only the single “bad events”—has a small probability. Indeed, one can notice that (1.2) S (p) (p) ∞ ˆ ˆ entails p∞ (p , p ) > M ≤ Cp (p0 )/M p by (p , p ) > M/b } = p sup b d {d 0 n 0 n n n 0 0 n≥1 n≥1 [S] [S]
means of Markov’s inequality, so that, given any η > 0, one can choose Mη large enough to (p) pn ) ≤ Mη /bn } ≥ 1 − η. On the other hand, certain applications to ∩n≥1 {d[S] (p0 , ˆ obtain p∞ 0
mathematical statistics and mathematical physics would prove completely justified only in the presence of such uniform bounds, as suggested, for example, by Wiener in relation to the problem of connecting von Neumann’s ergodic theorem with Birkhoff’s. As far as our personal motivations to study the uniform convergence displayed in (1.2), we mention the paper [14], which indicates a line of research—also pursued in the next Section 3—focused on some approximations of posterior and predictive distributions in Bayesian statistics. Indeed, we realized that it is just a condition of the same type as (1.2) that allows the application of a suitable martingale argument due to Blackwell and Dubins, so that we seize the opportunity to highlight this connection. In Section 3, the coordinate random elements ξ˜1 , ξ˜2 , . . . are assumed to be exchangeable, i.e.
4
the joint distribution of every finite subset of n of these elements depends only on n and not on the particular subset, for every n ∈ N. This condition is obviously satisfied if the ξ˜i ’s are i.i.d.. Actually, exchangeability is the suitable assumption to be made to translate the usual statistical condition of analogy (not necessarily identity) of the observations into precise probabilistic terms. It is well-known that, given any exchangeable p.m. ρ on (S∞ , B(S∞ )), there is an extension of the Glivenko-Cantelli theorem, due to de Finetti [29, 30], which states the existence of a random p.m., say ˜ p, from S∞ to [S], such that ρ({˜en ⇒ ˜p, as n → ∞}) = 1, ⇒ denoting the limit with respect to weak convergence of p.m.’s, and [S] the class of all p.m.’s on (S, B(S)) endowed with the topology generated by ⇒. Moreover, the ξ˜i ’s turn out to be conditionally i.i.d. given ˜p,
p. These two facts are encapsulated in the well-known de Finetti with common p.d. equal to ˜ representation theorem, equivalently reformulated as follows (see [2]): for any exchangeable p.d. ρ on (S∞ , B(S∞ )), there exists one and only one p.m. π on ([S], B([S])), equal to the p.d. of R ˜p, such that ρ(A) = [S] p∞ (A)π(dp) holds for any A ∈ B(S∞ ). In Bayesian statistics, the
p.m. π is usually called prior distribution. In this framework, one encounters two new questions concerning posterior and predictive distributions, i.e. regular conditional p.d.’s π(ξ˜(n) ) := π(ξ˜(n) , ·)
and p(ξ˜(n) ) := p(ξ˜(n) , ·) of ˜ p and {ξ˜n+i }i≥1 , respectively, given ξ˜(n) . Apropos of these questions,
Section 3 completes and enriches the previous work [14] by extending some of the statements therein and by providing a Bayesian interpretation of the main results (1.2)-(1.3) formulated in Section 2. As to the approximation of π(ξ˜(n) ), the aim is to show that, if ρ[Cp (˜p)] < +∞ with the same Cp as in (1.2), then ρ
(p) =1 lim sup bn d[[S]p ] π(ξ˜(n) ), δpˆn ≤ Yp (˜p)
(1.4)
n→+∞
(p)
holds with the same Yp as in (1.3), where d[[S]p ] is defined according to (1.1) with the pro(p)
viso that (S, dS ) therein is replaced by ([S]p , d[S] ). See Theorem 3.1 in Section 3. Notice that 1/p R (p) (p) d[[S]p ] π(ξ˜(n) ), δpˆn = [S] [d[S] (p, ˆpn )]p π(ξ˜(n) , dp) . It is also worth remarking that the choice (p)
of d[[S]p ] as reference distance establishes an interesting link between (1.3) and (1.4), which shows how the latter can be viewed as an extension of the former: when the ξ˜i ’s are i.i.d. ˜(n) ) is equal to δp0 with ρ-probability 1, and with common p.d. p0 , ρ coincides with p∞ 0 , π(ξ (p) (p) d[[S]p ] π(ξ˜(n) ), δpˆn = d[S] p0 , ˆ pn . The approximation of p(ξ˜(n) ) is a more delicate problem, here solved, consistently with the
general lines and notation given in [14, 15], by considering conditional p.d.’s, say qm (ξ˜(n) ) := Pm 1 ˜(n) , for every m ∈ N. Therefore, the main result qm (ξ˜(n) , ·), of ˜en,m := m i=1 δξ˜i+n given ξ
concerning the approximation of the predictive distributions reads (1) (n) ˆ∞ −1 ˜ =1 ρ lim sup bn d[[S]1 ] qm (ξ ), pn ◦ ˜en,m ≤ Yp (˜p)
(1.5)
n→+∞
for every m ∈ N. See Theorem 3.2 in Section 3. After establishing these general facts, the next 5
new results, i.e. Theorems from 3.3 to 3.5, reconsider Theorems from 2.3 to 2.5, respectively, according to the present Bayesian perspective. Thus, explicit expressions for ˆpn , bn , Cp (˜p) and Yp (˜p) are provided in each of the Theorems from 3.3 to 3.5. The achievement of (1.4)-(1.5) represents a specific asymptotic analysis of both posterior and predictive distributions, which is relevant with a view to a better understanding of different kinds of empirical approximation to orthodox Bayesian methods. In fact, the impressive growth of Bayesian statistics in the last decades has produced new complex models—in particular of nonparametric type—which, although appreciated for their predictive features, generate very often serious hurdles to clear from the point of view of direct computations. These difficulties are very often circumvented by the use of unorthodox, but more manageable, Bayesian methods, such as: empirical Bayes (see [25, 39, 41, 45]); partial and profile likelihood (see [16, 48]); numerical techniques based on simulation of random quantities, like the bootstrap (see [24, 25]). In particular, ˜(n) ) as in (1.5) is a cornerstone in bootstrap techniques. Very recontrasting ˆ p∞ e−1 n ◦˜ n,m to qm (ξ cently, the question of merging of orthodox and empirical Bayes procedures has been studied in [17, 44, 43, 47], but with a significant difference: (1.4)-(1.5) are reformulated therein by replacing ρ with some hypothetical distribution p∞ ⋆ , in the spirit of the what if method proposed in [19]. In conclusion, the ultimate aim of producing bounds like (1.4)-(1.5) is to quantify the degree of accuracy in the approximation ensuing from one of the aforesaid empirical techniques. In fact, for any ǫ, η > 0, one can find n0 = n0 (ǫ, η) and L = L(η) such that
and
i h (p) ρ ∩n≥n0 {d[[S]p ] π(ξ˜(n) ), δpˆn ≤ (L + ǫ)/bn } ≥ 1 − η i h (p) ρ ∩n≥n0 {d[[S]p ] qm (ξ˜(n) ), pˆ∞ e−1 n ◦˜ n,m ≤ (L + ǫ)/bn } ≥ 1 − η
are in force. Thus, one can observe that the determination of L follows from the specific form of Yp (˜p), while that of n0 represents an interesting open problem to be tackled in future works.
2
Results for the i.i.d. case
This section contains four propositions. The first one, of a general character, is a re-statement of an inequality for normed sums of i.i.d. random numbers, originally due to Siegmund [51] and Teicher [53], sometimes referred to as dominated ergodic theorem. See also Section 10.3 in [13]. In the following version, this inequality is reinforced, with respect to the original statements, by the explicit characterization of the upper bound, which plays a crucial role in the proofs of the remaining theorems exhibited in the present section. These last results, in turn, provide affirmative answers to the achievement of (1.2)-(1.3) in three situations of interest.
6
Hence, one starts with the reformulation of the Siegmund-Teicher inequality, in view of its important role. Proposition 2.1. Let {Xi }i≥1 be a sequence of i.i.d. random numbers on (Ω, F , P) with E[X1 ] =
0 and E[|X1 |r ] < +∞ for some r > 2. Then, after putting Log2 (x) := 1l{x < ee }+(log log x)1l{x ≥
ee }, one has
" r # ⌈r⌉ # Xi E[|X1 |r ] r ≤ σ α0 (r) + α1 (r) E sup r/2 σr n≥1 (nLog2 (n)) Pn
"
i=1
(2.1)
with suitable constants α0 (r), α1 (r), σ 2 := E[X12 ] and ⌈r⌉ := inf{n ∈ N | n ≥ r}. See Subsection 4.1 for the proof. Remark 2.2. With a view to successive applications of this proposition, it is crucial to underline that the constants α0 (r) and α1 (r) appearing in the upper bound do not depend on the common law of the Xi ’s. Precise expressions for these constants can be drawn from the combination of specific passages of the proof, culminating with inequality (4.8). (p)
Combination of (2.1) with certain inequalities concerning d[S] , originally proved in [18] and reformulated more recently in [31], yields the following Theorem 2.3. If S = Rd and dS coincides with the Euclidean distance, assume that, for some R δ > 0 and p ∈ [1, +∞) ∩ (d/2, +∞), Rd |x|2p+δ p0 (dx) < +∞ obtains. Then, (1.2)-(1.3) hold true with ˆpn = ˜en , bn = (n/Log2 n)1/2p and Cp (p0 ) = Yp (p0 ) =
r3 −1 K(p, d)C(r) 2p ∞ ˜ 2p+δ 1+ p [|ξ1 | ] 1 − 2−λ 1 − 2−σ 0 ( √ ) 1/2 1/p 2K(p, d) 2p ∞ ˜ 2p+δ 1+ p [|ξ1 | ] 1 − 2−(p−d/2) 1 − 2−δ/2 0
(2.2) (2.3)
where K(p, d), r, C(r), λ and σ are suitable numerical constants. See Subsection 4.2 for the proof, as well as for the exact evaluation of K(p, d), r, C(r), λ and σ. Before introducing the other results, it is worth noticing once again the troublesome dependence, in the previous theorem, of bn and p on d, which appears to be unavoidable when p0 is viewed as any element of the whole class [Rd ]2p+δ . See [1, 31, 40, 50, 58]. Then, one is naturally led to search for situations in which the aforesaid drawback does not occur. As already remarked in Section 1, there are many significant problems in which it is natural to constrain both p0 and ˆpn within a distinguished class M strictly contained in [S]. By the way of illustration, maintaining
the choice S = Rd , one now considers the noteworthy case in which M = {µθ }θ∈Θ coincides with the class of the non-degenerate Gaussian p.d.’s, namely µθ (dx) =
1 2π
d/2
1 (det(V ))−1/2 exp − t (x − m)V −1 (x − m) dx 2 7
(x ∈ Rd )
(2.4)
+ where θ = (m, V ), Θ = Rd × GL+ sym (d) and GLsym (d) denotes the class of symmetric and positive-
definite d × d matrices with real entries. As for the approximating p.d. ˆpn , in agreement with the ˆ n , Vˆn ) and so-called plug-in method of statistical point estimation, put ˆpn = µθˆn with θˆn = (m n
ˆn = m
1X˜ ξi , n i=1
n
1X ˜ ˆ n ) t (ξ˜i − m ˆ n) , (ξi − m Vˆn = n i=1
(2.5)
with the proviso that µθˆ1 = δξ˜1 , to obtain the following Theorem 2.4. If S = Rd and dS coincides with the Euclidean distance, then the Gaussian p.d. p0 = µθ0 , defined in (2.4) with θ0 = (m0 , V0 ) ∈ Rd × GL+ sym (d), satisfies (1.2)-(1.3) with p = 2, ˆ n , Vˆn ) given by (2.5), and bn = (n/Log2 n)1/2 , ˆ pn = µθˆn , θˆn = (m C2 (p0 ) Y2 (p0 )
= C(d)tr[V0 ] √ p = (1 + 2) tr[V0 ],
(2.6) (2.7)
where C(d) is a numerical constants and tr denotes the trace of a matrix. See Subsection 4.2 for the proof, which contains the elements to get a precise expression of C(d). In any case, this constant is the only quantity that depends explicitly on the dimension d, and it would be interesting to show whether it is possible or not to get rid also of this dependence, by choosing a uniform constant. In such a circumstance, Theorem 2.4 would be directly extended to the case in which S is a separable Hilbert space endowed with the norm distance.
A remarkably interesting fact is that an analogous conclusion holds true for a rather general type of exponential family which, besides including the previous class of Gaussian distributions as a distinguished, but particular, case, enjoys important, global properties from the point of view of mathematical statistics. As to terminology and basic relevant results about this family, see, e.g., [5, 10]. To get to the point, consider a σ-finite measure µ on the metric space (S, dS ) and a Borel measurable map t : S → Rk such that: i) the interior Θ of the convex hull of the support of µ ◦ t−1 is non-void, ii) the set Z n o Λ := y ∈ Rk ey·t(x) µ(dx) < +∞ S
k
is a non void open subset of R , which proves to be convex. Then, setting M (y) := log
R
S
ey·t(x) µ(dx) : Rk → (−∞, +∞], Theorems 1.13, 2.2 and 2.7 in
[10] state that M is strictly convex on Λ, lower semi-continuous on Rk , of class C ∞ (Λ) and analytic, while Corollary 5.3 in [5] implies that M is steep (essentially smooth). Hence, from R Theorem 3.6 in [10], y 7→ V (y) := ∇M (y) = S t(x) exp{y · t(x) − M (y)}µ(dx) defines a smooth 8
homeomorphism of Λ and Θ. Apropos of the last integral and the family (as y varies in Λ) of R p.m.’s γy (A) := A exp{y · t(x) − M (y)}µ(dx), A ∈ B(S), it is worth recalling that Corollary 2.5 in [10] entails the equivalence of the following facts (identifiability): γy1 = γy2 ; y1 = y2 ; R R tdγy1 = tdγy2 . This paves the way for the mean value re-parametrization, i.e. µθ (dx) := exp{V −1 (θ) · t(x) − M (V −1 (θ))}µ(dx)
(θ ∈ Θ)
(2.8)
yielding a family of p.m.’s, indicated by M, that is the subject of the next theorem. More specifically, p0 will be a fixed element µθ0 of M, and ˆpn the element µθˆn corresponding to n
1X ˜ θˆn := t(ξi ) ∈ Θ n i=1
(2.9)
which represents, in this case, the maximum likelihood estimator of the parameter θ. Other concepts to be introduced are: first, the cumulant generating function, say Ψ, of t(ξ˜1 ), namely R Ψ(y) := log S ey·t(x) µθ0 (dx) ; second, the Legendre transformation, say Iµ , of M , that is Iµ (θ) := sup {θ · y − M (y)} = V −1 (θ) · θ − M (V −1 (θ)) ;
(2.10)
y∈Λ
third, the Legendre transformation, say Iθ0 , of Ψ, given by Iθ0 (θ) := sup {θ · y − Ψ(y)} = Iµ (θ) − θ · y0 − M (y0 )
(2.11)
y∈Λ
where y0 = V −1 (θ0 ). See Section 26 of [46] and, in particular, Theorem 26.4. Moreover, from Theorem 26.6 therein, one can consider the function v u 1 uZ u Φ(η) := t (1 − s)kHess[Iµ ](θ0 + s(η − θ0 ))kds 0
(η ∈ Θ)
(2.12)
where k · k stands for the Frobenius norm of a matrix. It may be useful to observe that possible unboundedness of Φ(η) is caused by either one or both of the following facts: |η| is large; η approaches ∂Θ. For the sake of definiteness, one assumes the existence of a pair of positive functions Φ1 and Φ2 on [0, +∞) and Θ, respectively, satisfying Φ(η) ≤ Φ1 (|η|) + Φ2 (η) for all η ∈ Θ; Φ1 is increasing and diverges at infinity ; ∀ {η } n n≥1 ⊂ Θ, Φ2 (ηn ) → +∞ entails d(ηn , ∂Θ) → 0, as n → +∞ .
(2.13)
Along with Φ2 , one introduces the non-negative integer (possibly equal to +∞) NΦ2 (ρ, σ) which represents the minimum number of convex closed sets that form a covering of R1 := {θ ∈ Θ | Φ2 (θ) ≥
σ 2 , |θ
− θ0 | ≤ ρ} with the constrain that such a covering must be included in
R2 := {θ ∈ Θ | Φ2 (θ) ≥
σ 4 , |θ
− θ0 | ≤ ρ}. 9
(p)
Finally, it should be noticed that the involvement of d[S] in (1.2)-(1.3) forces to assume that M is included in [S]p and, with a view to the obtainment of explicit upper bounds, induces to focus attention on the so-called Talagrand inequality. With reference to the elements of M with p = 2, this inequality reads (2)
d[S] (µθ0 , µθ )2 ≤ CT (θ0 )K(µθ | µθ0 )
(θ0 , θ ∈ Θ)
(2.14)
where K stands for the Kullback-Leibler information of µθ0 at µθ , i.e. Z dµθ (x) µθ (dx) . K(µθ | µθ0 ) := log dµθ0 S
See, e.g., [34] for further information, included both sufficient conditions for the validity of (2.14) and the interesting connection with the logarithmic Sobolev inequality as a means to estimate the constant CT (θ0 ). Theorem 2.5. Suppose that i)-ii) are fulfilled and that the elements of the exponential family (2.8) belong to [S]2 and meet (2.14). Then, p0 = µθ0 satisfies (1.3) with p = 2, bn = (n/Log2 n)1/2 , ˆpn = µθˆ , θˆn as in (2.9), and n Y2 (p0 ) =
p 2CT (θ0 )Φ(θ0 )σmax (θ0 )
(2.15)
2 where σmax (θ0 ) is the largest eigenvalue of Hess[M ](V −1 (θ0 )). Moreover, p0 satisfies (1.2) with
the same p, bn , ˆ pn and a suitable C2 (p0 ) specified in the proof if, in addition, (2.13) is fulfilled and there exist a positive constant τ (θ0 ) and functions ρ, σ : [τ (θ0 ), +∞) → [0, +∞) satisfying +∞ Z
τ (θ0 )
−2 −m(t)
m(t)
e
dt +
+∞ Z
NΦ2 (ρ(t), σ(t))h(t)−2 e−h(t) dt < +∞
(2.16)
τ (θ0 )
σ(t) t where m(t) := min{ρ(t), σ(t) , Φ−1 1 ( 2 )} and h(t) := inf{Iθ0 (θ) | Φ2 (θ) ≥
σ(t) 4 , |θ
− θ0 | ≤ ρ(t)}.
See Subsection 4.4 for the proof, which also includes the elements to specify the constant C2 (p0 ). Before concluding the section, it is worth remarking that the abundance of assumptions in the last theorem counterbalances the generality of the family (2.8). While the checking of i)-ii), (2.13) and (2.14) represents a quite standard task for which one can find some auxiliary literature, the finding of τ (θ0 ), ρ(t), σ(t) and N (t) could prove more labored. Due to this fact, C2 (p0 ) does not admit, in general, a so direct interpretation as the one provided by (2.6).
3
Exchangeable random variables
As announced in Section 1, the random coordinates ξ˜1 , ξ˜2 , . . . are now to be thought of as exchangeable random elements (observations, in statistics) distributed according to the p.m. ρ on 10
(S∞ , B(S∞ )). The reader is recommended to resort to the representation theorem recalled therein and, in particular, to the meaning of the random p.m. ˜p as limit of the sequence of the empirical laws ˜en ’s. It is also worth recalling the symbols π(ξ˜(n) ) and p(ξ˜(n) ) to denote the posterior and the predictive distribution, respectively. The present section opens with a theorem about the posterior distribution, with the aim of establishing the validity of (1.4). Its novelty is that it deals with the question of determining the (p) range of the sequence {d[[S]p ] (π(ξ˜(n) ), δpˆn )}n≥1 .
Theorem 3.1. Assume that, for some p ≥ 1, ˆpn ∈ [S]p for all n ∈ N and ˜p ∈ [S]p with ρprobability 1. Moreover, let {bn }n≥1 be an increasing and diverging sequence of positive numbers with respect to which ˜ p satisfies (1.2)-(1.3) with ρ-probability 1. Then, (p) (n) ˜ ρ lim sup bn d[[S]p ] π(ξ ), δpˆn ≤ Yp (˜p) = 1
(3.1)
n→∞
provided that Cp (˜ p) and Yp (˜ p) are random numbers and ρ[Cp (˜p)] < +∞. The proof is deferred to Subsection 4.5. Apropos of the assumption that Cp (˜p) and Yp (˜p) are random numbers, one can notice that
their checking, in concrete situations, boils down to the analysis of their explicit expressions. Cf., e.g., (2.2)-(2.3) and (2.6)-(2.7). In the same vein, one now deals with the approximation of the distributions qm (ξ˜(n) ) as in (1.5). R To connect qm (ξ˜(n) ) with π(ξ˜(n) ), suffice it to notice that p(ξ˜(n) , A) = [S] p∞ (A)π(ξ˜(n) , dp) and ∞ qm (ξ˜(n) , B) = p(ξ˜(n) , ˜e−1 n,m (B)) hold true for every A ∈ B(S ) and B ∈ B([S]). From a technical
point of view, knowledge of the latter distribution may help to make the properties of the former more explicit. This is what happens when, by following the usual frequentistic interpretation of analogy of the observations, one approximates qm (ξ˜(n) ) by means of the distinguished distribution Pm 1 ˜ ˆ of m i=1 δξ˜i+n that one would obtain by considering the ξi+n ’s as i.i.d. with common p.d. pn ,
for every n. Apropos of this, one states the second main result of this section as
Theorem 3.2. If ˆ pn ∈ [S]1 for all n ∈ N and ˜p ∈ [S]1 with ρ-probability 1, then i h (1) (1) −1 en,m ) ≤ d[[S]1 ] (π(ξ˜(n) ), δpˆn ) = 1 ρ d[[S]1 ] (qm (ξ˜(n) ), ˆp∞ n ◦˜ for every m ∈ N. In particular, under the same assumptions of Theorem 3.1, one has (1) (n) ˆ∞ −1 ˜ ρ lim sup bn d[[S]1 ] (qm (ξ ), pn ◦ ˜en,m ) ≤ Yp (˜p) = 1
(3.2)
(3.3)
n→∞
for every m ∈ N. See Subsection 4.6 for the proof. Theorems 3.1 and 3.2 are deliberately stated in a very general form, as far as the nature of the space S is concerned, but contain the strong assumption that (1.2)-(1.3) are in force. With a 11
view to their real use, one could do a good turn by providing possibly simple sufficient conditions for the validity of (1.2)-(1.3), even if one has to give up the generality of S. It is with this spirit that we state the following three theorems which, from a merely mathematical side, are immediate consequences of the combination of Theorems 3.1 and 3.2 with Theorems 2.3, 2.4 and 2.5, respectively. The first statement deals with the nonparametric model, and can be viewed as an extension of Theorem 4 in [14] from R to Rd . Theorem 3.3. If S = Rd and dS coincides with the Euclidean distance, assume that, for some δ > 0 and p ∈ [1, +∞) ∩ (d/2, +∞), Z Z r3 −1 h Z 3r −1 i 2p+δ ˜ |x|2p+δ p(dx) = ρ π(dp) < +∞ p(dx) |x| Rd
[Rd ]
Rd
is valid with any suitable r ∈ (2, 3) for which p/d > 2− 3r and (2p+δ)( 3r −1) > p. Then, (1.4)-(1.5) hold true with ˆ pn = ˜en , bn = (n/Log2 n)1/2p and Z 1/2 io1/p n √2K(p, d) h 2p 2p+δ ˜ ˜) = Yp (p 1 + p (dx) |x| 1 − 2−(p−d/2) 1 − 2−δ/2 Rd
where K(p, d) is the same numerical constant as in Theorem 2.3. The next theorem provides a new result concerning the d-dimensional Gaussian model. Theorem 3.4. If S = Rd and dS coincides with the Euclidean distance, suppose that ˜p = µθ˜ with ˜ V˜ ) is a random element taking values in Θ = Rd × GL+ ρ-probability 1, where θ˜ = (m, sym (d) and µθ is referred to (2.4). If +∞ Z h i ˜ xπtr[V˜ ] (dx) < +∞ ρ tr[V ] = 0
is valid, with πtr[V˜ ] standing for the p.d. of tr[V˜ ], then (1.4)-(1.5) hold true with ˆpn = µθˆn , √ q bn = (n/Log2 n)1/2 and Y2 (˜ p) = (1 + 2) tr[V˜ ]. Finally, the third result concerns the exponential family considered in Theorem 2.5. Theorem 3.5. Let ˜ p equal µθ˜ with ρ-probability 1, where µθ is referred to (2.8) and θ˜ is a random element with values in Θ. Suppose that i)-ii) are fulfilled along with (2.13). Moreover, assume that, with ρ-probability 1, µθ˜ ∈ [S]2 , (2.14) holds with θ˜ in place of θ0 and θˆn in place of θ, and (2.16) is in force. If the same constant C2 (p0 ) mentioned in Theorem 2.5 is such that C2 (µθ˜) is a random number satisfying ρ[C2 (µθ˜)] < +∞, then (1.4)-(1.5) hold true with ˆpn = µθˆn , q ˜ θ)σ ˜ max (θ). ˜ bn = (n/Log2 n)1/2 and Y2 (˜ p) = 2CT (θ)Φ( Remark 3.6. Theorems 3.1 and 3.2 face the challenging task of comparing laws of p.d.’s. Nevertheless, there are situations in which it suffices to have a partial knowledge of a p.d., such as 12
mean values or appropriate synthetic measures of interesting aspects of that p.d., presentable as functionals on classes of p.d.’s. Thus, Theorems 3.1 and 3.2 are ideally accompanied by results 1 Pm ˜ ˜(n) ] concerning the comparison between ρ[Φ(˜p) | ξ˜(n) ] and Φ(ˆpn ), and between ρ[Φ( m i=1 ξi+n ) | ξ Pm ˜ 1 and ˆp∞ n [Φ( m i=1 ξi+n )], respectively, where Φ stands for some specific functional. In particular,
if Φ is Lipschitz-continuous, one has
(p) ρ[Φ(˜ p) | ξ˜(n) ] − Φ(ˆpn ) ≤ Lip(Φ)d[[S]p ] π(ξ˜(n) ), δpˆn
and
m m 1 X˜ 1 X˜ (1) −1 en,m ) ξi+n ) | ξ˜(n) ] − ˆ p∞ [Φ( ξ )] ≤ Lip(Φ)d[[S]1 ] (qm (ξ˜(n) ), ˆp∞ ρ[Φ( i+n n ◦˜ n m i=1 m i=1
where Lip(Φ) stands for the Lipschitz constant of Φ. Due to its relevance in the literature, the empirical approximation of functionals will be studied in a forthcoming paper.
4
Proofs
Gathered here are the proofs of the main results.
4.1
Proof of Proposition 2.1
This subsection complements the arguments already developed to prove Lemma 1 and Theorem 4 in Section 10.3 of [13], in order to justify the bound (2.1). Then, notation is substantially the same as therein, and numbering relates to that very same book. To start, consider a sequence {Yn }n≥1 of independent, nonnegative random numbers, and P+∞ P+∞ set S := n=1 Yn and Mk := n=1 E[Ynk ], for k > 0. An application of the inequality (a +
b)m ≤ 2m−1 (am + bm ), valid for a, b > 0 and m ≥ 1, yields E[Sq ] ≤ 2q−2 (Mq + M1 E[Sq−1 ]) Pq−2 for q ∈ {2, 3, . . . }. Thus, by induction, one gets E[Sq ] ≤ λq,q−1 Mq1 + k=0 λq,k Mq−k Mk1 for q ∈ {2, 3, . . . }, where λ2,1 := 1, λq,0 := 2q−2 and λq+1,h := 2q−1 λq,h−1 if h = 1, . . . , q. If q > 2 and
q 6∈ N, one can write E[Sq ] ≤ ≤
2[q]−1 Mq + Mq−[q] E[S[q] ]
[q]−2 X [q] λ[q],k M[q]−k Mk1 . 2[q]−1 Mq + λ[q],[q]−1 Mq−[q] M1 + Mq−[q] k=0
Now, the reasoning continues parallelling the method used to prove Theorem 4. First, assume that the distribution of the Xn ’s is symmetric with E[X12 ] = 1. Second, define cn := ′
′′
(nLog2 (n))−1/2 , Xn := Xn 1l{|Xn | ≤ n1/r }, Xn := Xn 1l{|Xn | > n1/r } and decompose Sn := Pn Pn Pn ′′ ′′ ′ ′′ ′ ′′ ′ i=1 Xi . Apropos of Sn , exploit the i=1 Xi and Sn := i=1 Xi as Sn = Sn + Sn , with Sn := 13
monotonicity of the sequence {cn }n≥1 , as in the last inequality on page 389, to obtain ′′ n r i h X i h r i X ′′ ′′ |Sn |r |X | ≤ E ≤ E sup c c |X | . n n i n r/2 n≥1 n≥1 (nLog2 (n)) i=1 n≥1
h E sup
Therefore, if r ∈ {3, 4, . . . }, one gets ′′
i |Sn |r r/2 n≥1 (nLog2 (n)) !r r−2 +∞ X X ′′ λr,r−1 cn E[|Xn |] + λr,k h E sup
≤
!k
(4.1)
+∞ +∞ [r]−2 X X k o X ′′ ′′ ′′ r−[r] [r]−k [r]−k cr−[r] E[|X | ] · λ c E[|X | ] · c E[|X |] n [r],k n n n n n
(4.2)
n=1
+∞ X
n=1
k=0
′′ r−k cr−k ] n E[|Xn |
!
+∞ X
·
n=1
′′
cn E[|Xn |]
whilst, if r > 2 and r 6∈ N, the following bound ′′
i |Sn |r r/2 n≥1 (nLog2 (n)) +∞ +∞ +∞ [r] X X nX ′′ ′′ ′′ r−[r] c E[|X |] ≤ 2[r]−1 cr−[r] E[|X | ] · crn E[|Xn |r ] + λ[r],[r]−1 n n n n h E sup
+
n=1
n=1
n=1
n=1
+∞ X
n=1
k=0
n=1
′′
r−h
is in force. Now, combining the inequalities of H¨ older and Markov, one gets E[|Xn |h ] ≤ (1/n) r E[|X1 |r ] P+∞ h ′′ h r for every n ∈ N and h ∈ (0, r] which, in turn, yields n=1 cn E[|Xn | ] ≤ K(h)E[|X1 | ], with P+∞ r−h K(h) := n=1 (1/n) r chn < +∞. Thus, one concludes that ′′
i |Sn |r ≤ β1 (r)(E[|X1 |r ])⌈r⌉ r/2 n≥1 (nLog2 (n))
h E sup
(4.3)
is valid with a suitable numerical constant β1 (r), determined as follows. First, recall that E[X12 ] = 1 Pr−1 entails E[|X1 |r ] ≥ 1. Then, if r ∈ {3, 4, . . . }, invoke (4.1) to get β1 (r) = k=0 λr,k K(r−k)(K(1))k , whilst, if r > 2 and r 6∈ N, use (4.2) to obtain
[r]−2 o n X λ[r],k K([r] − k)(K(1))k . β1 (r) = 2[r]−1 K(r) + λ[r],[r]−1K(r − [r])(K(1))[r] + K(r − [r]) k=0
′
Apropos of the term involving Sn , one starts by observing that E
′
sup cn |Sn |
n≥1
r
≤
ur0
r
+ rλ
+∞ Z
u0 /λ
′ ur−1 P sup cn |Sn | > λu du
(4.4)
n≥1
holds for every λ, u0 > 0. Then, after putting nk := [ek ] for k ∈ N0 , formula (15) on page 390 can be invoked to write
+∞ h i X ′ P sup cn |Sn | > λu ≤ 4 P cnk Snk+1 > λu .
′
n≥1
(4.5)
k=0
′
Moreover, at the same page, it is proved that P[Sn > x] ≤ exp{−tx + nt2} is valid for every n ∈ N, 1/2 Log2 (nk+1 ) x > 0 and t ∈ [0, n−1/r ]. Then, setting n = nk+1 , γr := supk∈N (nk+1 )1/r < +∞, nk+1 14
Log2 (nk+1 ) nk+1
1/2
and x = γr ucn yields k ′ (αu − 1) u ≤ exp − P cnk Snk+1 > Log (n ) (4.6) 2 k+1 γr γr2 1/2 nk Log2 (nk ) > 0. Putting λ = γr−1 , the combination of (4.4), (4.5) and where α := inf k∈N0 nk+1 Log (nk+1 )
t = γr−1
2
(4.6) leads to
+∞ r +∞ Z X ′ (αu − 1) r−1 r −r u exp − E sup cn |Sn | ≤ u0 + 4rγr Log2 (nk+1 ) du . γr2 n≥1 k=0 u γ 0 r
To conclude, it is enough to choose u0 large enough so that αγr u0 > 4γr2 +1 is in force. In fact, this choice entails, for every u ≥ u0 γr , αu > 4γr2 + 1 and which, in turn, give r ′ E sup cn |Sn | ≤ ur0 + 4rγr−r
+∞ X
n≥1
k=0
(αu−1) Log2 (nk+1 ) γr2
≥ 2Log2 (nk+1 ) +
2α 4γr2 +1 u
! r 4γr2 + 1 exp{−2Log2 (nk+1 )} · Γ(r) =: β0 (r) . (4.7) 2α
The proof can be now carried out by means of the following steps. First, combine (4.3) with (4.7) to get E
sup cn |Sn | n≥1
r
≤
2
r−1
2
r−1
r r ′ ′′ r−1 E sup cn |Sn | + 2 E sup cn |Sn | n≥1
≤
n≥1
r
⌈r⌉
β0 (r) + β1 (r)(E[|X1 | ])
,
(4.8)
which is valid for symmetric Xn ’s with E[X12 ] = 1. Then, invoke the symmetrization argument at the end of page 390 to show that this last very same bound holds also for non-symmetric Xn ’s, provided that E[X1 ] = 0 and E[X12 ] = 1. Finally, remove the assumption E[X12 ] = 1, by replacing every Xi with Xi /σ.
4.2
Proof of Theorem 2.3
Fix d ∈ N, p ∈ [1, +∞) ∩ (d/2, +∞) and δ > 0. After setting β := 2p + δ, observe that it is possible
to choose r ∈ (2, 3) so that λ = λ(p, d, r) := p − d(2 − 3r ) and σ = σ(p, d, r) := β( 3r − 1) − p are both strictly positive constants. In particular, the bound on σ is equivalent to r
2
˜ β from Markov’s inequality, p0 (Bs ) ≤ 2−β(s−1) p∞ 0 [|ξ1 | ] holds with the same β as defined at the beginning of this proof. To conclude the proof of (1.2), gather the above inequalities together to obtain p∞ 0
(p) sup bpn d[Rd ] (p0 , ˜en )p n≥1
r3 −1 K(p, d)C(r) 2p ∞ ˜ β ≤ 1+ p [|ξ1 | ] 1 − 2−λ 1 − 2−σ 0
which provides the value of Cp (p0 ) displayed in (2.2). To prove (1.3), use again (4.9) to write (p)
lim sup bpn d[Rd ] (p0 , ˜en )p n→+∞
≤ K(p, d)
+∞ X s=0
2ps
+∞ X l=0
2−lp
X
F ∈Pl
n 1X ˜ 1l{ξi ∈ 2s F ∩ Bs } . lim sup bpn p0 (2s F ∩ Bs ) − n i=1 n→+∞
Since, from Hartman-Wintner’s law of the iterated logarithm, n p 1X ˜ 1l{ξi ∈ 2s F ∩ Bs } = 2p0 (2s F ∩ Bs )(1 − p0 (2s F ∩ Bs )) lim sup bpn p0 (2s F ∩ Bs ) − n n→+∞ i=1
one can argue exactly as above to obtain √ 1/2 2p 2K(p, d) p (p) p ∞ ˜ β lim sup bn d[Rd ] (p0 , ˜en ) ≤ 1+ p [|ξ1 | ] 1 − 2−(p−d/2) 1 − 2−δ/2 0 n→+∞ which provides the value of Yp (p0 ) displayed in (2.3).
4.3
Proof of Theorem 2.4 (2)
From a well-known expression of the distance d[Rd ] between two multivariate Gaussian distributions (see, e.g., [20, 42]), one gets i i2 h h (2) ˆ n |2 + tr V0 + Vˆn − 2(V0 Vˆn )1/2 d[Rd ] (µθ0 , µθˆn ) = |m0 − m
(4.10)
ˆ n and Vˆn are referred to (2.5). It is worth obwhere the symbol µθ is referred to (2.4), while m serving, preliminarily, that the second summand on the RHS of (4.10) coincides with the squared (2)
distance [d[Rd ] ]2 between two d-dimensional Gaussian distributions with zero means and a covariance matrix equal to V0 or Vˆn , respectively. Thus, by exploiting the basic properties of the distance (2)
d[Rd ] , one can write 2 q d q i X h (V0 )j,j − (Vˆn )j,j tr V0 + Vˆn − 2(V0 Vˆn )1/2 ≤ j=1
17
(4.11)
where, for any matrix M , the symbol (M )l,h will denote, from now on, the element of M on the l-th row and on the h-th column. After these preliminaries, one can prove the validity of (1.2)-(1.3) with bn = (n/Log2 (n))1/2 . As for (1.3), start by applying the Hartman-Wintner law of the iterated logarithm to obtain v "P #2 u d (j) (j) n uX p (ξ˜i − m0 ) i=1 t p ˆ n| = lim sup lim sup bn |m0 − m = 2tr [V0 ] n→+∞ n→+∞ nLog2 (n) j=1
(j) (j) where ξ˜i (m0 , respectively) denotes the j-th component of ξ˜i (m0 , respectively). To handle the
second summand on the RHS of (4.10), it is enough to use (4.11) to get v r h q 2 q d i u uX n lim sup bn tr V0 + Vˆn − 2(V0 Vˆn )1/2 ≤ t lim sup (V0 )j,j − (Vˆn )j,j n→+∞ n→+∞ Log2 (n) j=1
where, in view of a combination of (2.5) with the strong law of large numbers, v !2 u Pn ˜(j) q n (j) i u 1 X h ˜(j) (j) 2 t i=1 (ξi − m0 ) ˆ (Vn )j,j = (V0 )j,j + (ξi − m0 ) − (V0 )j,j − n i=1 n s q ǫ˜n,j (V0 )j,j · 1 + = (V0 )j,j q ǫ˜n,j ǫ˜n,j (V0 )j,j 1 + = +o 2(V0 )j,j (V0 )j,j i Pn ˜(j) (j) 2 Pn h (j) (j) i=1 (ξi −m0 ) with ǫ˜n,j := n1 i=1 (ξ˜i − m0 )2 − (V0 )j,j − . Whence, n lim sup n→+∞
r
r q q q n n 1 ˆ p ) − lim sup |˜ ǫ | = (V ( V ) (V0 )j,j = 0 j,j n,j n j,j Log2 (n) 2 (V0 )j,j n→+∞ Log2 (n)
again from the Hartman-Wintner law of the iterated logarithm, yielding r h i p lim sup bn tr V0 + Vˆn − 2(V0 Vˆn )1/2 ≤ tr [V0 ] . n→+∞
To prove (1.2), one starts by analyzing the first summand one the RHS of (4.10). Thus, the
combination of Lyapunov’s inequality with (2.1) yields, for any r > 2, µ∞ θ0
"
ˆ n| sup bn |m0 − m n≥1
2 #
Pn !r #)2/r ˜ i=1 (ξi − m0 ) p sup ≤ nLog2 n n≥1 2/r " !r # Pn d r X ˜(j) − m(j) ) ( ξ 0 i=1 i p ≤ d 2 −1 sup µ∞ θ0 n nLog n≥1 2 j=1 !⌈r⌉ 2/r (j) r d ∞ ˜(j) X µθ0 [|ξ1 − m0 | ] r−2 ≤ d r σjr α0 (r) + α1 (r) r σj (
µ∞ θ0
"
j=1
≤ d
r−2 r
"
α0 (r) + α1 (r)
18
2r/2 √ Γ π
r+1 2
⌈r⌉ #2/r
tr[V0 ] .
(4.12)
To study the second summand on the RHS of (4.10), start from an application of (4.11) to obtain " # X q d q 2 ∞ 1/2 2 ∞ sup bn (V0 )j,j − (Vˆn )j,j µθ 0 µθ0 sup bn tr[V0 + Vˆn − 2(V0 Vˆn ) ] ≤ . n≥1
n≥1
j=1
(j) (j) For fixed j ∈ {1, . . . , d}, define ζ˜i := (ξ˜i − m0 )/
p (V0 )j,j for all i ∈ N, and observe that, under
˜ µ∞ θ0 , {ζi }i≥1 is a sequence of i.i.d. random numbers with standard Normal distribution. Then, put i1/2 h P Pn n ˜2 − ν˜2 ν˜n := n1 i=1 ζ˜i and τˆn := n1 , and notice that n i=1 ζi " " # # q q 2 2 ∞ ∞ sup bn (V0 )j,j − (Vˆn )j,j µθ sup bn 1 − τˆn = (V0 )j,j µθ 0
where µ∞ θ0
h
0
n≥1
n≥1
2 i does not depend on j. Therefore, the proof is completed afsupn≥1 bn 1 − τˆn
ter showing that this last expectation is finite. To this aim, introduce the quantity δ˜n := h i1/2 Pn ˜2 1 ˜n − ν˜2 ζ − 1 and observe that τ ˆ = 1 + δ and that n n i i=1 n " " 2 # # 2 1 2 1 ∞ ∞ ˜ sup bn 1 − τˆn 1l |δn | ≤ , ν˜n ≤ ≤ µθ 0 sup bn 1 − τˆn µθ 0 4 4 n≥1 n≥1 " 2 # 1 sup bn 1 − τˆn 1l |δ˜n | > + µ∞ θ0 4 n≥1 " 2 # 1 2 ∞ sup bn 1 − τˆn 1l ν˜n > + µθ 0 . (4.13) 4 n≥1 k √ P+∞ To bound the first term on the above RHS, recall that 1 + x = 1 + k=1 1/2 k x is valid if |x| < 1 and notice that
1 − τˆn 2 1l |δ˜n | ≤ 1 , ν˜n2 ≤ 1 ≤ 2c2 (δ˜n2 + ν˜n4 ) 4 4 i h P+∞ 1/2 2 4 2 ˜2 ˜n turn out to be and µ∞ δ sup b with c := k=1 k (1/2)k−1 . Now, µ∞ n n≥1 n θ0 supn≥1 bn ν θ0
finite in view of a direct application of Proposition 2.1. As for the second term on the RHS of (4.13), start by writing " 2 # 1 sup bn 1 − τˆn 1l |δ˜n | > µ∞ θ0 4 n≥1
≤
X
2b2n µ∞ θ0
X
2b2n µ∞ θ0
n≥1
(1 +
τˆn2 )1l
1 |δ˜n | > 4
X 1 1 2 ∞ ˜ ˜ ˜ + 2bn µθ0 ζ1 1l |δn | > |δ n | > ≤ 4 4 n≥1 n≥1 1/2 √ X 2 1 ˜ b n µ∞ ≤ 2(1 + 3) . θ 0 |δ n | > 4
n≥1
h i1/2 1 ˜ The last series turns out to be finite, since µ∞ ≤ e−αn holds with a suitable α > 0 θ 0 |δ n | > 4
in view of an application of the usual Chernoff bound. As for the third term on the RHS of (4.13), argue as above to obtain " 2 # 1/2 √ X 2 1 1 2 2 ∞ ν ˜ > ≤ 2(1 + 3) sup bn 1 − τˆn 1l ν˜n > b n µ∞ µθ 0 n θ0 4 4 n≥1 n≥1
19
and the finiteness of the last series follows from a direct application of the Markov inequality, after noticing that ν˜n has a Normal distribution with zero mean and variance equal to
4.4
1 n.
Proof of Theorem 2.5
From the definition of the Kullback-Leibler information, one gets K(µθˆn | µθ0 )
= θˆn · [V −1 (θˆn ) − V −1 (θ0 )] − [M (V −1 (θˆn )) − M (V −1 (θ0 ))] =
Z1 0
(1 − s) t (θˆn − θ0 )Hess[Iµ ](θ0 + s(θˆn − θ0 )) (θˆn − θ0 )ds
where the second identity follows from the combination of (2.10) with Bernstein’s representation of the remainder term in Taylor formula. Then, from (2.14), v u Z1 u u (2) d[S] (µθ0 , µθˆn ) ≤ tCT (θ0 ) (1 − s) t (θˆn − θ0 )Hess[Iµ ](θ0 + s(θˆn − θ0 )) (θˆn − θ0 )ds 0
p ≤ CT (θ0 ) |θˆn − θ0 | Φ(θˆn )
and hence, in view of the k-dimensional version of the Hartman-Wintner LIL, r p n (2) d[S] (µθ0 , µθˆn ) ≤ 2CT (θ0 )Φ(θ0 )σmax (θ0 ) lim sup Log2 (n) n→+∞
(4.14)
2 where σmax (θ0 ) is the largest eigenvalue of Hess[M ](V −1 (θ0 )). See, e.g., Theorem 3.1 in [26]. This p proves (1.3) with bn = (n/Log2 (n))1/2 and Y (p0 ) = 2CT (θ0 )Φ(θ0 )σmax (θ0 ).
To prove (1.2), define Θi := {θ ∈ Rk | |θ − θ0 | ≤ δ(θ0 )} and Θe := Θ \ Θi , where δ(θ0 ) is
chosen so that Θi is a proper subset of Θ, and notice that sup bn |θˆn − θ0 | Φ(θˆn ) ≤ sup bn |θˆn − θ0 | Φ(θˆn )1l{θˆn ∈ Θi } + sup bn |θˆn − θ0 | Φ(θˆn )1l{θˆn ∈ Θe } . n≥1
n≥1
n≥1
(4.15) As to the first summand on the right-hand side, for any r > 2, one can write µ∞ θ0
1/r k X (j) r r ˆ(j) ∞ ˆ ˆ ˆ µθ0 [sup bn |θn − θ0 | ] sup bn |θn − θ0 | Φ(θn )1l{θn ∈ Θi } ≤ max Φ(η) η∈Θi
n≥1
j=1
n≥1
(j) (j) where θˆn and θ0 stand for the j-th component of θˆn and θ0 , respectively. At this stage, invoke
Proposition 2.1 to conclude that ˆ ˆ ˆ sup b | θ − θ | Φ( θ )1l{ θ ∈ Θ } µ∞ n n 0 n n i θ0
(4.16)
n≥1
≤ max Φ(η) η∈Θi
k X j=1
σj (θ0 ) α0 (r) + α1 (r) 20
(j) r (j) ˜ µ∞ θ0 [|t (ξ1 ) − θ0 | ] σj (θ0 )r
!⌈r⌉ 1/r
(4.17)
where σj (θ0 ) :=
q 2 M (V −1 (θ )) and t(j) (ξ˜ ) stands for the j-th component of t(ξ˜ ). This ∂j,j 0 1 1
upper bound represents a first contribution to the determination of C(p0 ), to be now completed by bounding the second summand on the right-hand side of (4.15). Apropos of this, one can start with the following general considerations, valid for any n0 ∈ N: ˆn − θ0 | Φ(θˆn )1l{θˆn ∈ Θe } µ∞ sup b | θ n θ0 n≥n0
=
+∞ Z 0
≤ +
µ∞ θ0
[ n
n≥n0
X
bn
X
bn
n≥n0
n≥n0
τZ(θ0 ) 0 +∞ Z
τ (θ0 )
o bn |θˆn − θ0 | Φ(θˆn )1l{θˆn ∈ Θe } > t dt
h i ˆ ˆ ˆ µ∞ θ0 |θn − θ0 | Φ(θn )1l{θn ∈ Θe } > t dt
(4.18)
h i ˆ ˆ ˆ µ∞ θ0 |θn − θ0 | Φ(θn )1l{θn ∈ Θe } > t dt .
(4.19)
The term in (4.18) can be bounded thanks to the fact that {θˆn ∈ Θe }, yielding X
n≥n0
bn
τZ(θ0 ) 0
n o |θˆn − θ0 | Φ(θˆn )1l{θˆn ∈ Θe } > t ⊆
h i h i X ˆ ˆ ˆ ˆ b n µ∞ µ∞ θ0 |θn − θ0 | > δ(θ0 ) . θ0 |θn − θ0 | Φ(θn )1l{θn ∈ Θe } > t dt ≤ τ (θ0 ) n≥n0
Here and in other points of the present proof, some results in [59] are invoked to estimate tail h i ˆ probabilities of the type of µ∞ θ0 |θn − θ0 | > a . More precisely, the analysis of (4.18) is based on
the Corollary on page 491 of that paper, whose applicability to the present context follows after
checking (2.1) therein. Since vectors t(ξ˜j ) − θ0 correspond to vectors ξj in [59], in the place of
E[|ξj |m ] therein one considers
m m/2−1 ˜ µ∞ θ0 [|t(ξj ) − θ0 | ] ≤ k
k X i=1
(i) m (i) ˜ µ∞ θ0 [|t (ξj ) − θ0 | ] .
Thanks to the existence of the moment generating function of t(ξ˜j ) − θ0 , the classical Cauchy estimate shows that C(r(θ0 )) (i) m (i) ˜ µ∞ θ0 [|t (ξj ) − θ0 | ] ≤ m! r(θ0 )m is valid for all m ∈ N with a suitable chosen r(θ0 ) > 0, provided that C(r(θ0 )) is the maximum
(i) (as i = 1, . . . , k) of the maximum modulus of the moment generating function of |t(i) (ξ˜j ) − θ0 | on
{z ∈ C | |z| = r(θ0 )}. Plainly, the last inequality entails (2.1) in [59] with specific H = H(θ0 ) and √ 2 2 2 2 ˜ nδ(θ0 )/B(θ0 ), the thesis of b2j = µ∞ θ0 [|t(ξj ) − θ0 | ] =: B(θ0 ) . Putting Bn := nB (θ0 ) and x :=
21
the aforesaid corollary yields X
n≥n0
h i ˆn − θ0 | > δ(θ0 ) | θ b n µ∞ θ0
≤ ≤
−nδ(θ0 )2 B(θ0 )2 + 1, 62δ(θ0)H(θ0 ) n≥n0 −2 −n0 δ(θ0 )2 δ(θ0 )2 exp c B(θ0 )2 + 1, 62δ(θ0)H(θ0 ) B(θ0 )2 + 1, 62δ(θ0 )H(θ0 ) (4.20) X
bn exp
as a consequence of the elementary inequality X
n≥n0
bn e−nz ≤
c −n0 z e z2
(4.21)
valid for z > 0 with some constant c > 0. Estimate (4.20) provides the required upper bound for (4.18). As for (4.19), start by writing ˆ ˆ µ∞ θ0 [|θn − θ0 |Φ(θn ) > t] ∞ ˆ ˆ ˆ ˆ ˆ ˆ ≤ µ∞ θ0 [|θn − θ0 |Φ(θn ) > t, Φ(θn ) ≤ σ(t)] + µθ0 [|θn − θ0 |Φ(θn ) > t, Φ(θn ) > σ(t)] ∞ ˆ ˆ ≤ µ∞ θ0 [|θn − θ0 | > t/σ(t)] + µθ0 [Φ(θn ) > σ(t)] ∞ ˆ ∞ ˆ ˆ ˆ ≤ µ∞ θ0 [|θn − θ0 | > t/σ(t)] + µθ0 [Φ(θn ) > σ(t), |θn − θ0 | ≤ ρ(t)] + µθ0 [|θn − θ0 | > ρ(t)] σ(t) σ(t) ˆ ∞ ˆ ˆ ˆ ≤ µ∞ ] + µ∞ , |θn − θ0 | ≤ ρ(t)] θ0 [|θn − θ0 | > t/σ(t)] + µθ0 [Φ1 (|θn − θ0 |) > θ0 [Φ2 (θn ) > 2 2 ˆ + µ∞ θ0 [|θn − θ0 | > ρ(t)] σ(t) t , Φ−1 and, after introducing the function m(t) := min{ρ(t), σ(t) 1 ( 2 )}, the sum of the first, second
and fourth term can be bounded by −nm(t)2 −nm(t) 3 exp ≤ 3 exp (t > τ (θ0 )) B(θ0 )2 + 1, 62m(t)H(θ0 ) 2H(θ0 ) √ invoking once again the Corollary in [59] with x := nm(t)/B(θ0 ). The validity of the above inequality rests on the fact that, by properly choosing ρ and σ, m turns out to be monotone and diverging at infinity. Now, from (4.21), X
3bn exp
n≥n0
−nm(t) 2H(θ0 )
≤ 3c
m(t) 2H(θ0 )
−2
exp
−n0 m(t) 2H(θ0 )
(t > τ (θ0 ))
and one finds that the right-hand side is integrable on [τ (θ0 ), +∞), thanks to (2.16), after a ˆ proper choice of n0 . At this stage, it remains only to bound µ∞ θ0 [Φ2 (θn ) > ρ(t)] =
ˆ µ∞ θ 0 [θ n
σ(t) ˆ 2 , |θ n
− θ0 | ≤
∈ R1 (t)], a task which will be carried out by an application of some large deviation
estimate. In fact, recall that −n inf θ∈R Iθ0 (θ) ˆ µ∞ θ0 [θn ∈ R] ≤ e
holds true for every closed convex subset R of Θ. See, e.g., Section 2 of [9]. Therefore, since R1 (t) is compact, consider a covering {Vi (t)}i=1,...,N (t) of R1 (t), made by closed and convex subsets of 22
Θ, which is entirely contained in R2 , to obtain N (t) N (t)
∞ ˆ ˆ µ∞ θ0 [θn ∈ R1 (t)] ≤ µθ0 [θn ∈ ∪i=1 Vi (t)] ≤
where h(t) := inf{Iθ0 (θ) | Φ2 (θ) ≥
σ(t) 4 , |θ
X i=1
−nh(t) ˆ µ∞ θ0 [θn ∈ Vi (t)] ≤ N (t)e
− θ0 | ≤ ρ(t)} and N (t) is an abbreviation for
NΦ2 (ρ(t), σ(t)). The conclusion is reached by first applying (4.21), giving X
n≥n0
bn e−nh(t) ≤ c
1 h(t)
2
e−n0 h(t)
2 1 and then invoking (2.16) to prove the integrability of N (t) h(t) e−n0 h(t) on [τ (θ0 ), +∞). After Pn0 −1 choosing n0 properly, sums of the type n=1 admits trivial bounds which, combined with the
rest of the proof, lead to the desired result.
4.5
Proof of Theorem 3.1
First, observe that the following two identities p i p h ip Z h (p) (p) (p) d[S] (p, ˆpn ) π(ξ˜(n) , dp) = ρ d[S] (˜p, ˆpn ) | ξ˜(n) d[[S]p ] π(ξ˜(n) ), δpˆn = [S]
hold with ρ-probability 1, the former being a direct consequence of the definition (1.1). To derive the latter, suffice it to apply the disintegration theorem (see, e.g., Theorem 6.4 in [37]) after recalling that ˆ pn is σ(ξ˜(n) )/B([S]p )-measurable. In the next step, parallelling the reasoning in [6] to prove Theorem 2 therein, one gets lim sup ρ n→∞
i p h (p) ≤ pn ) | ξ˜(n) p, ˆ bn d[S] (˜ ≤
lim
sup
k→+∞ n≥k,h≥k
ρ
h
i p (p) | ξ˜(n) bh d[S] (˜p, ˆph )
p (p) lim sup ρ sup bh d[S] (˜p, ˆph ) | ξ˜(n)
k→+∞ n≥k
h≥m
p (p) for every m ∈ N. Then, one has that Xm := suph≥m bh d[S] (˜p, ˆph ) belongs to L1 (S∞ , B(S∞ ), ρ),
since
ρ[Xm ] ≤ ρ[X1 ] = ρ [ρ[X1 | ˜p]] ≤ ρ[Cp (˜p)] < +∞ . Hence, the martingale converge theorem gives [ i h (n) (n) = ρ Xm σ lim sup ρ Xm | ξ˜ σ(ξ˜ )
k→+∞ n≥k
n≥1
with ρ-probability 1, and now it is enough to observe that Xm is a σ
S
˜(n) ) -measurable σ( ξ n≥1
random number, for any m ∈ N, since ρ[˜en ⇒ ˜p, as n → +∞] = 1. Whence, i h =1 ρ lim sup ρ Xm | ξ˜(n) = Xm k→+∞ n≥k
23
for every m ∈ N. In conclusion, ip h (p) ≤ lim sup Xm ≤ [Yp (˜p)]p lim sup d[[S]p ] π(ξ˜(n) ), δpˆn m→+∞
n→∞
with ρ-probability 1, which amounts to (3.1).
4.6
Proof of Theorem 3.2
Since d(1) ≤ d(p) , one gets immediately that (3.3) follows from the combination of (3.2) with (3.1). Thus, one proves (3.3) as a consequence of the following more general statement: for p0 ∈ [S]1 and µ ∈ [[S]1 ] there holds
(1) (1) ∞ e−1 d[[S]1 ] ρ[µ] ◦ ˜e−1 n,m , p0 ◦ ˜ n,m ≤ d[[S]1 ] (µ, δp0 )
(4.22)
for every m ∈ N, where ρ[µ] is the p.m. on (S∞ , B(S∞ )) given by ρ[µ](·) :=
R
[S]1
p∞ (·)µ(dp). In-
deed, taking account of the Kantorovich-Rubinstein dual representation of d(1) (see, e.g., Theorem
11.8.2 in [23]), (4.22) can be equivalently rewritten as Z m m Z Z 1 X 1 X h sup δxi p(dx1 ) . . . p(dxm ) µ(dp) − h δxi p0 (dx1 ) . . . p0 (dxm ) m i=1 m i=1 h∈Lip1 ([S]1 ;R) [S]1
Sm
Sm
≤
sup
h∈Lip1 ([S]1 ;R)
Z h(p)µ(dp) − h(p0 )
(4.23)
[S]1
(1)
where Lip1 ([S]1 ; R) := {h : [S]1 → R | |h(p1 ) − h(p2 )| ≤ d[S] (p1 , p2 ), ∀ p1 , p2 ∈ [S]1 }. The (m)
validity of (4.23) rests on the fact that, for any fixed h ∈ Lip1 ([S]1 ; R), the maps Fh (p) := R 1 Pm i=1 δxi p(dx1 ) . . . p(dxm ) belong to Lip1 ([S]1 ; R), for every m ∈ N. Such a claim Sm h m
admits a direct proof based on the so-called Birkhoff theorem on optimal matching, under the
additional condition that S does not contain any isolated point. See Theorem 6.0.1 in [3] for a precise statement. Indeed, the additional hypothesis about the absence of isolated points is needed only to prove that any element p ∈ [S]1 can be approximated with respect to the metPN (1) ric d[S] , as closely as desired, by discrete p.m.’s of the form N1 i=1 δxi , x1 , . . . , xN being dis-
tinct elements of S. This kind of approximation can be justified as follows: first, invoke Lemma Pk 11.8.4 in [23] to provide an approximation of p, in the desired metric, of the form i=1 αi δai P with a1 , . . . , ak ∈ S and α1 , . . . , αk ∈ [0, 1] satisfying ki=1 αi = 1. In view of a simple density
argument, one can choose α1 , . . . , αk ∈ [0, 1] ∩ Q. Then, write the approximating measure as PN PN 1 i=1 ni = N . Finally, for fixed i=1 ni δai , with N, n1 , . . . , nN positive integers satisfying N (j)
ǫ > 0, utilize the non-isolation hypothesis to find distinct points bi (j) maxi=1,...,N maxj=1,...,ni dX (ai , bi )
∈ S in such a way that
≤ ǫ, entailing ni N N X N ni X X 1 1 1 XX (1) (1) d[S] n i δ ai , δb(j) ≤ d (δai , δb(j) ) ≤ ǫ i N i=1 N i=1 j=1 i N i=1 j=1 [S] 24
(1)
since d[S] (δx , δy ) = dS (x, y). At this stage, if S does not contain any isolated point, the claim (m)
Fh
∈ Lip1 ([S]1 ; R) can be reduced to checking that N N X (m) 1 X (m) 1 (1) δxi − Fh δyi ≤ d[S] Fh N i=1 N i=1
N N 1 X 1 X δx i , δy N i=1 N i=1 i
!
holds for every N ∈ N and all collections {x1 , . . . , xN } and {y1 , . . . , yN } of distinct points. This check, which constitutes the heart of the proof, is encapsulated in the following group of relations
=
=
N N X (m) 1 X (m) 1 δxi − Fh δy i Fh N i=1 N i=1
1 Nm
1 Nm
≤
1 Nm
≤
1 Nm
=
1 Nm
=
[h
X
[h
i1 ,...,im ∈{1,...,N }
1 Nm
≤
X
i1 ,...,im ∈{1,...,N }
X
i1 ,...,im ∈{1,...,N }
X
i1 ,...,im ∈{1,...,N }
X
i1 ,...,im ∈{1,...,N }
m
m
m
1 X 1 X δxij − h δyρ(ij ) ] m j=1 m j=1
(4.24)
m m X 1 X [h 1 δxij − h δyρ(ij ) ] m j=1 m j=1 (1) d[S]
i1 ,...,im ∈{1,...,N }
X
m
1 X 1 X δxij − h δyij ] m j=1 m j=1
m
m
k=1
k=1
1 X 1 X δxik , δyρ(ik ) m m
(
) m 1 X (1) d[S] (δxik , δyρ(ik ) ) m
(
) m 1 X dS (xik , yρ(ik ) ) m
!
(4.25)
k=1
k=1
m N 1 X 1 X dS (xi , yρ(i) ) m N i=1 k=1
=
=
N 1 X dS (xi , yρ(i) ) N i=1 (1) d[S]
N N 1 X 1 X δx i , δy N i=1 N i=1 i
!
(4.26)
where: (4.24) holds for any permutation ρ : {1, . . . , N } → {1, . . . , N }; (4.25) follows from the (1)
convexity property of d[S] ; (4.26) corresponds to the Birkhoff theorem when ρ stands for the optimal coupling permutation. To complete the proof it remains to bypass the introduction of the non-isolation hypothesis. With this aim, define S := S × [−1, 1] and DS ((x, s), (y, t)) := dS (x, y) + |s − t|, so that (S, DS ) turns out to be a metric space without isolated points. Then, for any p ∈ [S], define p to be the extension to B(S) of p(A × B) := p(A)δ0 (B), for A ∈ B(S) and B ∈ B([−1, 1]). Whence, for any 25
p1 , p2 ∈ [S]1 , one has (1)
d[S] (p1 , p2 ) =
inf
π∈F (p1 ,p2 )
Z
DS (ξ, η)π(dξdη) =
inf
π∈F (p1 ,p2 )
Z
(1)
dS (x, y)π(dxdy) = d[S] (p1 , p2 )
S2
2
S
where the validity of the second equality follows from the fact that every π ∈ F (p1 , p2 ) has a support contained in S × {0}. Now, for every h ∈ Lip1 ([S]1 ; R), one can associate a function h, defined on {p ⊗ δ0 | p ∈ [S]1 } ⊂ [S]1 by h(p ⊗ δ0 ) := h(p). Successively, by virtue of Proposition 11.2.3 in [23], h can be extended to a function, denoted with the same symbol and defined on the whole [S]1 , belonging to Lip1 ([S]1 , R). Consistently with this, for any h ∈ Lip1 ([S]1 ; R) define (m)
Fh
: [S]1 → R by (m)
F h (τ ) :=
Z
m
S
m
h
1 X δξ τ (dξ1 ) . . . τ (dξm ) m i=1 i
(τ ∈ [S]1 ) .
(m) (m) (1) Since F h (τ1 ) − F h (τ2 ) ≤ d[S] (τ1 , τ2 ) is valid for every τ1 , τ2 ∈ [S]1 from the first part of the proof, then, for any pair p1 , p2 of elements of [S]1 , one can write
(m) (m) (m) (1) (1) (m) F (p1 ) − F (p2 ) ≤ d (p1 , p2 ) = d (p1 , p2 ) (p1 ) − F (p2 ) = F h
h
h
h
[S]
[S]
and (4.23) is proved.
Finally, to prove (3.2), it is enough to notice that p(ξ˜(n) , A) =
R
[S]
p∞ (A)π(ξ˜(n) , dp) and
−1 qm (ξ˜(n) , B) = p(ξ˜(n) , ˜en,m (B)) hold true, with ρ-probability 1, for every A ∈ B(S∞ ) and B ∈
B([S]).
Acknowledgements. The authors are obliged to Giuseppe Savar´e for illuminating conversations on Subsection 4.6. They are also grateful to P. Diaconis and to E.A. Carlen for a number of useful suggestions.
References ´ s, J. and Tusna ´dy, G. (1984). On optimal matchings. Combinatorica [1] Ajtai, M., Komlo 4, 259-264. [2] Aldous, D.J. (1985). Exchangeability and related topics. Ecole d’Et´e de Probabilit´es de Saint-Flour, XIII 1-198. Lecture Notes in Math. 1117 Springer, Berlin. [3] Ambrosio, L., Gigli, N. and Savar´ e, G. (2008). Gradient Flows in Metric Spaces and in the Space of Probability Measures. 2nd ed. Birkh¨auser, Basel. [4] Ambrosio, L., Stra, F. and Trevisan, D. (2016). A PDE approach to a 2-dimensional matching problem. arXiv :1611.04960. 26
[5] Barndorff-Nielsen, 0. (1978). Information and Exponential Families in Statistical Theory. Wiley, New York. [6] Blackwell, D. and Dubins, L.E. (1962). Merging of opinions with increasing information. Ann. Math. Statist. 33, 882-886. [7] Bobkov, S. and Ledoux, M. (2016). One-dimensional empirical measures, order statistics, and Kantorovich transport distances. To appear in: Memoirs of the AMS. [8] Boissard, P. (2011). Simple bounds for the convergence of empirical and occupation measures in 1-Wasserstein distance. Electron. J. Probab. 16, 2296-2333. [9] Borovkov, A.A. and Mogulskii, A.A. (2012). Chebyshev-type exponential inequalities. Theory Probab. Appl. 56, 21-43. [10] Brown, L.D. (1986). Fundamentals of Statistical Exponential Families with Application in Statistical Decision Theory. Institute of Mathematical Statistics Lecture Notes-Monograph Series, 9. Hayward, California. [11] Cantelli, F.P. (1933). Sulla determinazione empirica delle leggi di probabilit` a. Giorn. Ist. Ital. Attuari 4, 421-424. [12] Caracciolo, S., Lucibello, G., Parisi, G., and Sicuro, G. (2014). Scaling hypothesis for the Euclidean bipartite matching problem. Phys. Rev. E 90, 012118. [13] Chow, Y. S. and Teicher, H (1997). Probability Theory. Independence, Interchangeability, Martingales. 3rd ed. Springer, New York. [14] Cifarelli, D.M., Dolera, E. and Regazzini, E. (2016). Frequentistic approximations to Bayesian prevision of exchangeable random elements. Internat. J. Approx. Reason. 78, 138-152. [15] Cifarelli, D.M., Dolera, E. and Regazzini, E. (2017). Note on “Frequentistic approximation to Bayesian prevision of exchangeable random elements”. Internat. J. Approx. Reason. 86, 26-27. [16] Cox, D.R. (1975). Partial likelihood. Biometrika 62, 269-276. [17] Cui, W. and George, E.I. (2008). Empirical Bayes vs fully Bayes variable selection. J. Statist. Plann. Inference 138, 888-900. [18] Dereich, S., Scheutzow, M. and Schottstedt, R. (2013). Constructive quantization: approximation by empirical measures. Ann. Inst. Henri Poincar´e Probab. Stat. 49, 1183-1203.
27
[19] Diaconis, P. and Freedman, D. (1986). On consistency of Bayes estimates. Ann. Statist. 14, 1-26. [20] Dowson, D.C. and Landau, B.V. (1982). The Fr´echet distance between multivariate normal distributions. J. Multivariate Anal. 12, 450-455. [21] Dudley, R.M. (1969). The speed of mean Glivenko-Cantelli convergence. Ann. Math. Statist. 40, 40-50. [22] Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge University Press, Cambridge. [23] Dudley, R.M. (2002). Real Analysis and Probability. Cambridge University Press, Cambridge. [24] Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist. 7, 1-26. [25] Efron, B. (2010). Large-Scale Inference. Cambridge University Press, Cambridge. [26] Einmahl, U. (2016). Law of the iterated logarithm type results for random vectors with infinite second moments. Math. Appl. 44, 167-181. [27] de Finetti, B. (1930). Funzione caratteristica di un fenomeno aleatorio. Atti Reale Accademia Nazionale dei Lincei, Mem. 4, 86-133. [28] de Finetti, B. (1933). Sull’approssimazione empirica di una legge di probabilit` a. Giorn. Istit. Ital. Attuari 4, 415-420. [29] de Finetti, B. (1933). La legge dei grandi numeri nel caso dei numeri aleatori equivalenti. Atti Reale Accademia Nazionale dei Lincei, Rend. 18, 203-207. [30] de Finetti, B. (1937). La pr´evision: ses lois logiques, ses sources subjectives. Ann. Inst. H. Poincar´e 7, 1-68. [31] Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Related Fields 162, 707-738. [32] Glivenko, V.I. (1933). Sulla determinazione empirica delle leggi di probabilit` a. Giorn. Istit. Ital. Attuari 4, 92-99. [33] Gozlan, N. and L´ eonard, C. (2007). A large deviation approach to some transportation cost inequalities. Probab. Theory Related Fields 139, 235-283. [34] Gozlan, N. and L´ eonard, C. (2010). Transport inequalities. A survey. Markov Processes Related Fields 16, 635-736. 28
[35] Horn, R.A. and Johnson, C.R. (1985). Matrix Analysis. Cambridge University Press, Cambridge. [36] Horowitz, J. and Karandikar, R.L. (1994). Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math. 55, 261-273. [37] Kallenberg, O. (2002). Foundations of Modern Probability. 2nd ed. Springer-Verlag, New York. [38] Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giorn. Istit. Ital. Attuari 4, 83-91. [39] Maritz, J.S. and Lwin, T. (1989). Empirical Bayes Methods. 2nd ed. Chapman and Hall, London. [40] Massart, J.E. (1988). About the Prokhorov distance between the uniform distribution over the unit cube in Rd and its empirical measure. Probab. Theory Related Fields 79, 431-450. [41] Morris, C.N. (1983). Parametric empirical Bayes inference: theory and applications. J. Amer. Statist. Assoc. 78, 47-65. [42] Olkin, I. and Pukelsheim, F. (1982). The distance between two random vectors with given dispersion matrices. Linear Algebra Appl. 48, 257-263. [43] Petrone, S., Rizzelli, S., Rousseau, J. and Scricciolo, C. (2014). Empirical Bayes methods in classical and Bayesian inference. Metron 72, 201-215. [44] Petrone, S., Rousseau, J. and Scricciolo, C. (2014). Bayes and empirical Bayes: do they merge? Biometrika 101, 285-302. [45] Robbins, H. (1956). An empirical Bayes approach to statistics. Proc. 3rd Berkeley Symp. 1, 157-163. [46] Rockafellar, R.T. (1970). Convex Analysis. Princeton University Press, Princeton. [47] Rousseau, J. (2016). On the frequentist properties of Bayesian nonparametric methods. Annual Review of Statistics and its Applications 3, 211-231. [48] Severini, T.A. (2000). Likelihood Methods in Statistics. Oxford University Press, Oxford. [49] Shorack, G.R. and Wellner, J.A. (1986). Empirical Processes with Application to Statistics. Wiley, New York. [50] Shor, P.W. and Yukich, J.E. (1991). Minimax grid matching and empirical measures. Ann. Probab. 19, 1338-1348. 29
[51] Siegmeund, D. (1969). On moments of the maximum of normed partial sums. Ann. Math. Statist. 40, 527-531. [52] Talagrand, M. (1994). The transportation cost from the uniform measure to the empirical measure in dimension ≥ 3. Ann. Probab. 22, 919-959. [53] Teicher, H. (1971). Completion of a dominated ergodic theorem. Ann. Math. Stat. 42, 2156-2158. [54] Trillos, N.G. and Slepcev, D. (2014). On the rate of convergence of empirical measures in ∞-transportation distance. arXiv :1407.1157. [55] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer, New York. [56] Vapnik, V. and Chervonenkis, A. (2004). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264-280. [57] Villani, C. (2009). Optimal Transport. Old and New. Springer, Berlin. [58] Yukich, J.E. (1989). Optimal matching and empirical measures. Proc. Amer. Math. Soc. 107, 1051-1059. [59] Yurinskii, V.V. (1976). Exponential inequalities for sums of random vectors. J. Multivariate Analysis 6, 473-499.
emanuele dolera (corresponding author) dipartimento di matematica “felice casorati” ` degli studi di pavia universita via ferrata 1, 27100 pavia, italy e-mail:
[email protected]
eugenio regazzini dipartimento di matematica “felice casorati” ` degli studi di pavia universita via ferrata 1, 27100 pavia, italy e-mail:
[email protected]
30