Feb 8, 2017 - structure facilitates local computing of M-estimators and in doing so ..... random graph models with countable support, local dependence, and ...
arXiv:1702.01812v1 [math.ST] 6 Feb 2017
Consistent M -Estimation of Curved Exponential-Family Random Graph Models with Local Dependence and Growing Neighborhoods Michael Schweinberger
Jonathan Stewart
February 8, 2017 Abstract In general, statistical inference for exponential-family random graph models of dependent random graphs given a single observation of a random graph is problematic. We show that statistical inference for exponential-family random graph models holds promise as long as models are endowed with a suitable form of additional structure. We consider a simple and common form of additional structure called multilevel structure. To demonstrate that exponential-family random graph models with multilevel structure are amenable to statistical inference, we develop the first concentration and consistency results covering M -estimators of a wide range of full and non-full, curved exponential-family random graph models with local dependence and natural parameter vectors of increasing dimension. In addition, we show that multilevel structure facilitates local computing of M -estimators and in doing so reduces computing time. Taken together, these results suggest that exponential-family random graph models with multilevel structure constitute a promising direction of statistical network analysis.
1
Introduction
Models of network data have witnessed a surge of interest in statistics and related areas [e.g., 24]. Such data arise in the study of social networks [e.g., 31], terrorist networks [e.g., 26], the spread of infectious diseases [e.g., 23], and other areas. Since the work of Holland and Leinhardt in the 1970s [e.g., 14], it is well-known that network data exhibit a wide range of dependencies. The most powerful statistical framework for modeling dependencies are discrete exponential-family random graph models pioneered by Frank and Strauss [9], with extensions by Wasserman and Pattison [44], Snijders et al. [41], Hunter and Handcock [18], and others [e.g., 27, 39, 8, 31]. Discrete exponential-family random graph models are popular among network scientists for the same reason the Ising model is popular among physicists: both classes of models enable scientists to model a wide range of interesting dependencies [e.g., 31]. It is worth noting that there are exponential-family random graph models with independence and dyad-independence assumptions, including the classic Bernoulli random graph model 1
of Erd¨os and R´enyi [7] and the p1 -model of Holland and Leinhardt [15] and modern relatives such as the β-model [e.g., 5, 36, 29, 46], but such models cannot capture interesting dependencies and are therefore not widely used in practice, though such models are sometimes useful as null models [15]. Discrete exponential-family random graph models with dependence are challenging, both because some models possess undesirable properties (e.g., model degeneracy) and because estimators of many models have not been shown to possess desirable properties (e.g., consistency). The fact that some models possess undesirable properties, such as model degeneracy, has been known since the groundbreaking work of Strauss [42] in the 1980s: Strauss [42], Jonasson [21], and Handcock [12] showed that some models are near-degenerate in the sense of placing most probability mass on almost empty or almost complete graphs. More generally, Schweinberger [38] showed that a wide range of models is near-degenerate in terms of sufficient statistics in the sense of placing most probability mass on graphs that are close to the boundary of the convex hull of the sufficient statistics vector. Special cases were studied in depth by Jonasson [21] and Chatterjee and Diaconis [3]. The most striking implication is that near-degenerate models cannot represent observed network data, most of which are not extreme in terms of sufficient statistics. As a result, statistical inference for near-degenerate models is pointless: if no member of a family of distributions is able to represent important features of observed network data, then regardless of which member is selected, the goodness-of-fit of the model may be unacceptable. The implications of model degeneracy in terms of statistical inference were first studied by Handcock [12] and Rinaldo et al. [35]: model degeneracy implies that if data are generated from near-degenerate distributions, the sufficient statistics vector is close to the boundary of the convex hull of the sufficient statistics vector with high probability [38], which implies that the maximum likelihood estimators of natural parameters either do not exist at all or are hard to obtain [12, 35]. Shalizi and Rinaldo [40] and others [5, 45, 36, 33, 29, 46] showed that maximum likelihood estimators of exponential-family random graph models are consistent under strong assumptions, but those assumptions rule out almost all exponential-family random graph models with dependence. Most troubling, there are no consistency results concerning curved exponential-family random graph models with dependence and natural parameter vectors of increasing dimension, which belong to the most popular exponential-family random graph models with dependence [41, 18, 16, 17]. To address these probabilistic and statistical challenges, we consider discrete exponentialfamily random graph models with additional structure. A simple form of additional structure is multilevel structure, which is observed in multilevel networks and is increasingly popular in the social sciences: see, e.g., the recent monograph on multilevel networks by Lazega and Snijders [30]. Multilevel structure is generated by either sampling designs or stochastic processes governing random graphs. An example of a sampling design generating multilevel structure is a sampling design that generates a sample of schools from a population of schools. An example of a stochastic process generating multilevel structure is a stochastic process that generates a partition of a set of nodes into subsets, called neighborhoods. The nodes are sometimes called level-one units and the neighborhoods level-two units and hence the structure underlying the random graph can be considered to be a two-level structure. In applications, partitions may correspond to, e.g., classes in schools, departments in companies, and terrorist cells in terrorist networks. It is worth noting that models of multilevel networks are related to, but distinct from stochastic block models [34]: 2
first, in multilevel networks the partition of the set of nodes is observed whereas it is unobserved in stochastic block models; and, second, models of multilevel networks focus on modeling dependencies within and between neighborhoods, whereas stochastic block models assume that edges are independent conditional on the neighborhood memberships of nodes. We show that discrete exponential-family random graph models with multilevel structure address the probabilistic and statistical challenges discussed above and therefore constitute a promising direction of statistical network analysis: by inducing local dependence within neighborhoods, scientists are free to model interesting dependencies. At the same time, as long as the dependence is local and all neighborhoods are small, the overall dependence is weak, which reduces model degeneracy and facilitates concentration and consistency results. To demonstrate that exponentialfamily random graph models with multilevel structure hold promise, we develop the first concentration and consistency results covering M -estimators of a wide range of full and non-full, curved exponential-family random graph models with local dependence and neighborhood-dependent natural parameter vectors of increasing dimension. While some consistency results have been obtained under independence and dyad-independence assumptions [e.g., 5, 45, 36, 40, 33, 29, 46], there are hardly any results on exponential-family random graph models with dependence [45, 40, 33] and no results on the popular curved exponential-family random graph models with dependence. In contrast, we cover a large class of full and non-full, curved exponential-family random graph models with dependence. In addition, we show that multilevel structure facilitates local computing of M -estimators and in doing so reduces computing time. Taken together, these results suggest that exponential-family random graph models with multilevel structure constitute a promising direction of statistical network analysis. The paper is structured as follows. Sections 2 and 3 discuss models and assumptions, respectively. Sections 4 and 5 describe the main concentration and consistency results, respectively. Section 6 discusses local computing. Section 7 presents simulation results and Section 8 an application.
2
Discrete exponential-family random graph models with multilevel structure
We consider discrete exponential-family random graph models with multilevel structure. A simple form of multilevel structure is a partition of a set of nodes into non-empty subsets A1 , . . . , AK , called neighborhoods. We note that the partition of the set of nodes is observed in multilevel networks, as discussed in Section 1. We consider undirected random graphs with countable sample spaces, which covers binary and non-binary network data, including network count data. Extensions to directed random graphs are straightforward. Let X = (Xk )K k=1 and K Y = (Yk,l )k 0, C2 > 0, and K0 ≥ 1 such that, for all K > K0 , PK |Ak | P PIk for all µ ∈ M0 (α). [C.3?? ] K k=1 i=1 |µk,i | ≤ C1 k=1 2 [C.4?? ]
PK |sk,i (xk,1 ) − sk,i (xk,2 )| ≤ C kAk 2 ∞ k=1 d(xk,1 , xk,2 ) for all (xk,1 , xk,2 ) ∈ P Xk × Xk , where d(xk,1 , xk,2 ) = i∈Ak 0 and let X = {x ∈ X : |f (x)−E f (X)| ≥ t}. Since within-neighborhood edges do not depend on between-neighborhood edges, P(X ∈ X, Y ∈ Y) = P(X ∈ X). In the following, we denote by P a probability measure on (X, S) with densities of the form (1), where S is the power set of the countable set X. Keep in mind that X = (Xk )K k=1 denotes the sequence of within-neighborhood edge variables, where Xk = (Xi,j )i∈Ak < j∈Ak . In an abuse of notation, we denote the elements of the sequence edge variables X by X1 , . . . , Xm with sample Pof K |Ak | spaces X1 , . . . , Xm , respectively, where m = is the number of edge variables. Let k=1 2 Xi:j = (Xi , . . . , Xj ) be a subsequence of edge variables with sample space Xi:j , where i ≤ j. By applying Theorem 1.1 of Kontorovich and Ramanan [25] to kf kLip -Lipschitz functions f : X 7→ R defined on the countable set X, t2 , P(|f (X) − E f (X)| ≥ t) ≤ 2 exp − 2 m kΦk2∞ kf k2Lip where Φ is the m × m-upper triangular matrix with entries ϕi,j if i < j φi,j = 1 if i = j 0 if i > j and kΦk∞
m X = max 1 + ϕi,j . 1≤i≤m j=i+1
25
The coefficients ϕi,j are known as mixing coefficients and are defined by ϕi,j ≡
sup x1:i−1 ∈X1:i−1 (xi , x?i )∈Xi ×Xi
ϕi,j (x1:i−1 , xi , x?i ) =
sup x1:i−1 ∈X1:i−1 (xi , x?i )∈Xi ×Xi
kπxi − πx?i kTV ,
where kπxi − πx?i kTV is the total variation distance between the distributions πxi and πx?i given by πxi ≡ π(xj:m | x1:i−1 , xi ) = P(Xj:m = xj:m | X1:i−1 = x1:i−1 , Xi = xi ) and πx?i ≡ π(xj:m | x1:i−1 , x?i ) = P(Xj:m = xj:m | X1:i−1 = x1:i−1 , Xi = x?i ). Since the support of πxi and πx?i is countable, kπxi − πx?i kTV =
1 X 2 x ∈X j:m
|π(xj:m | x1:i−1 , xi ) − π(xj:m | x1:i−1 , x?i )|.
j:m
An upper bound on kΦk∞ can be obtained by bounding the mixing coefficients ϕi,j as follows. Consider any pair of edge variables Xi and Xj . If Xi and Xj involve nodes in more than one neighborhood, the mixing coefficient ϕi,j vanishes by the local dependence induced by exponential families of the form (1). If the pair of nodes corresponding to Xi and the pair of nodes corresponding to Xj belong to the same neighborhood, the mixing coefficient ϕi,j can be bounded as follows: 1 X |π(xj:m | x1:i−1 , xi ) − π(xj:m | x1:i−1 , x?i )| ϕi,j (x1:i−1 , xi , x?i ) = 2 x ∈X j:m
≤
1 X 2 x ∈X j:m
j:m
j:m
π(xj:m | x1:i−1 , xi ) +
1 X 2 x ∈X j:m
π(xj:m | x1:i−1 , x?i ) = 1,
j:m
because πxi and πx?i are conditional probability mass functions with countable support Xj:m . We note that the upper bound is not sharp, but it has the advantage that it covers a wide range of dependencies within neighborhoods. Thus, m X kAk∞ kΦk∞ = max 1 + ϕi,j ≤ , 1≤i≤m 2 j=i+1 because each edge variable Xi can depend on at most kAk2 ∞ −1 other edge variables corresponding to pairs of nodes belonging to the same pair of neighborhoods. Therefore, there exists C > 0 such that, for all K ≥ 1 and all t > 0, 2 t , P(|f (X) − E f (X)| ≥ t) ≤ 2 exp − P K |Ak | 4 2 C kAk∞ kf kLip k=1 2 where kAk∞ > 0 because all neighborhoods Ak are non-empty and kf kLip > 0 by assumption. 26
PK X → 7 R be defined by f (X) = k=1 PK |Ak | P k=1 ( 2 ) . The Lipschitz coefficient of f with respect to X , where X = {0, 1} i,j i∈Ak < j∈Ak the Hamming metric d : P X × X 7→ R+ 0 is given by kf kLip = 1. An application of Proposition 1 to K |Ak | deviations of size t = k=1 2 shows that there exist C > 0 and K0 ≥ 1 such that, for all > 0 and all K > K0 , ! 2 K X |Ak | CK P |f (X) − E f (X)| ≥ ≤ 2 exp − , 2 2 kAk ∞ k=1 Proof of Corollary 1.
Let f
:
where we used the assumption that all neighborhoods grow at the same rate, which implies that there exists C1 > 0 such that |Ak | ≥ kAk∞ / C1 (k = 1, . . . , K) and thus there exists C2 > 0 such that !2 K K X X |Ak | |A | k 2 2 2 2 ≥ C2 K kAk∞ . t = 2 2 k=1 k=1 Lemma 1. Consider a full or non-full, curved exponential family of the form (1) with dim(ηk ) −→ ∞ (k = 1, . . . , K) as K −→ ∞ satisfying condition [C.4]. Then there exist C > 0 and K0 ≥ 1 such that, for all > 0, all θ ∈ Θ0 , and all K > K0 , with at least probability 1 − 2 exp(−2 C K / kAk4∞ ), K X |Ak | |g(θ; b(X)) − g(θ; E b(X))| < . 2 k=1 P ROOF OF L EMMA 1. By assumption, E b(X) exists. Observe that, for all θ ∈ Θ0 , the deviation of g(θ; b(X)) from g(θ; E b(X)) can be written as |g(θ; b(X)) − g(θ; E b(X))| = |ha(θ), b(X) − E b(X)i|. Let f : X 7→ R be defined by f (X) = ha(θ), b(X)i, where f is considered as a function of X for fixed θ ∈ Θ0 . P We are interested in bounding the probability of deviations of the form K |Ak | |f (X) − E f (X)| ≥ k=1 2 , where > 0. By condition [C.4], there exist A > 0 and K0 ≥ 1 such that, for all K > K0 , the Lipschitz coefficient of f with respect to the Hamming metric d : X × X 7→ R+ bounded above by kf kLip ≤ A kAk∞ . Thus, by applying Proposition 1 0 is P K |Ak | to deviations of size t = k=1 2 , we obtain, for all > 0 and all K > K0 , P |g(θ; b(X)) − g(θ; E b(X))| ≥
! K X |Ak | 2
k=1
≤ 2 exp −
A2
2 P K
2
P
K |Ak | k=1 2
|Ak | k=1 2
27
kAk4∞
kAk2∞
.
Therefore, there exists C > 0 such that, for all K > K0 , P |g(θ; b(X)) − g(θ; E b(X))| ≥
! K X |Ak | k=1
2
2 CK ≤ 2 exp − , kAk4∞
where we used the assumption that dim(θ) < ∞ and that all neighborhoods grow at the same rate (see, e.g., the proof of Corollary 1). Proof of Proposition 2. By assumption, E b(X) exists. We are interested in bounding the probability of sup-deviations of the form supθ∈Θ0 |g(θ; b(X)) − g(θ; E b(X))|. A helpful observation is that we do not need to bound the probability of sup-deviations on the whole set X: it suffices to bound the probability of sup-deviations on high-probability subsets of X. To construct a convenient high-probability subset of X, define ( ) K X |Ak | , G(α) = x ∈ X : |g(θ0 ; b(x)) − g(θ0 ; E b(X))| < α 2 k=1 where α > 0 is identical to the constant α > 0 used in the construction of the set M0 (α). Choose any > 0 and denote the event of primary interest by ( ) K X |Ak | E() = x ∈ X : sup |g(θ; b(x)) − g(θ; E b(X))| ≥ . 2 θ∈Θ0 k=1 Since, for all α > 0 and all > 0, P(E()) ≤ P(comp G(α)) + P(E() ∩ G(α)),
(10)
the probability of event E() can be bounded by bounding the probabilities of events comp G(α) and E() ∩ G(α), where comp G(α) denotes the complement of G(α). The probability of event comp G(α) can be bounded by using Lemma 1, which shows that there exist C1 > 0 and K1 ≥ 1 such that, for all K > K1 , α 2 C1 K (11) . P (comp G(α)) ≤ 2 exp − kAk4∞ To bound the probability of event E() ∩ G(α), choose any ρ > 0 satisfying 0 < ρ < / (6 A3 ), where A3 > 0 is equal to the constant A3 > 0 in condition [C.3]. We can construct an open cover of Θ0 of the form Θ0 ⊂ ∪θ∈Θ0 B(θ, ρ) by using open balls B(θ, ρ) centered at θ ∈ Θ0 with radius ρ > 0. By the compactness of Θ0 , there exists a finite subcover of Θ0 of the form Θ0 ⊂ ∪1≤l≤L B(θl , ρ). To bound the probability of sup-deviations, observe that, for all θ ∈ Θ0 , |g(θ; b(X)) − g(θ; E b(X))| = |ha(θ), b(X) − E b(X)i|,
28
which, along with the triangle inequality, implies that the probability of event E() ∩ G(α) can be bounded as follows: P sup |g(θ; b(X)) − g(θ; E b(X))| ≥ t ∩ G(α) θ∈Θ0
! ≤ P
max
sup
1≤l≤L θ∈B(θ , ρ) l
|g(θ; b(X)) − g(θ; E b(X))| ≥ t ∩ G(α) !
≤ P
max
sup
1≤l≤L θ∈B(θ , ρ) l
|ha(θ) − a(θl ), b(X)i| ≥
t ∩ G(α) 3
(12)
! t + P max sup |ha(θl ), b(X) − E b(X)i| ≥ ∩ G(α) 1≤l≤L θ∈B(θ , ρ) 3 l ! t + P max sup |ha(θl ) − a(θ), E b(X)i| ≥ ∩ G(α) , 1≤l≤L θ∈B(θ , ρ) 3 l P |Ak | where t = K . We bound the three probabilities on the right-hand side of (12) one by k=1 2 one. First term on the right-hand side of (12). By applying condition [C.3], there exist A3 > 0 and K2 ≥ 1 such that, for all K > K2 and all x ∈ G(α), K X |Ak | max sup |ha(θ) − a(θl ), b(x)i| ≤ max sup A3 kθ − θl k2 . 1≤l≤L θ∈B(θ , ρ) 1≤l≤L θ∈B(θ , ρ) 2 l l k=1 By construction, ρ > 0 satisfies 0 < ρ < / (6 A3 ), which implies K K K X X |Ak | |Ak | X |Ak | . max sup A3 kθ − θl k2 ≤ A3 2 ρ < 1≤l≤L θ∈B(θ , ρ) 2 2 3 2 l k=1 k=1 k=1 Therefore, for all K > K2 , the intersection of the events (
) K X |Ak | x ∈ X : max sup |ha(θ) − a(θl ), b(x)i| ≥ 1≤l≤L θ∈B(θ , ρ) 3 k=1 2 l
and G(α) is the empty set and thus, for all K > K2 , the first probability on the right-hand side of (12) vanishes. Second term on the right-hand side of (12). Observe that ! K X |Ak | ∩ G(α) P max sup |ha(θl ), b(X) − E b(X)i| ≥ 1≤l≤L θ∈B(θ , ρ) 3 k=1 2 l ! K X |Ak | = P max |ha(θl ), b(X) − E b(X)i| ≥ ∩ G(α) . 1≤l≤L 3 k=1 2 29
Let f : X 7→ R be defined by f (X) = ha(θ), b(X)i, where f is considered as a function of X for fixed θ ∈ Θ0 . We are interested in bounding the probability of deviations of the form PK |Ak | |f (X) − E f (X)| ≥ ( / 3) k=1 2 . By condition [C.4], there exist A > 0 and K3 ≥ 1 such that, for all K > K3 , the Lipschitz coefficient of f with respect to the Hamming metric d : X × X 7→ R+ above by kf 0 is bounded P kLip ≤ A kAk∞ . Thus, by applying Proposition 1 to K |Ak | deviations of size t = ( / 3) k=1 2 along with a union bound over the L < ∞ balls that make up the finite subcover of Θ0 , we have, for all > 0 and all K > K3 , ! K X |Ak | P max |ha(θl ), b(X) − E b(X)i| ≥ ∩ G(α) 1≤l≤L 3 k=1 2 ! K X |Ak | ≤ P max |ha(θl ), b(X) − E b(X)i| ≥ 1≤l≤L 2 3 k=1 P 2 K |Ak | 2 k=1 2 P ≤ 2 exp − + log L , K |A | k kAk4∞ kAk2∞ 9 A2 k=1 2 which implies that there exist C2 > 0 and K4 ≥ 1 such that, for all K > K4 , the probability of interest is bounded above by 2 C2 K 2 exp − , kAk4∞ where we used the assumption that all neighborhoods grow at the same rate (see, e.g., the proof of Corollary 1). Third term on the right-hand side of (12). The third term can be bounded along the same lines as the first term. Using condition [C.3] and 0 < ρ < / (3 A3 ) shows that there exists K5 ≥ 1 such that, for all K > K5 , K X |Ak | max sup |ha(θl ) − a(θ), E b(X)i| < , 1≤l≤L θ∈B(θ , ρ) 3 k=1 2 l implying that, for all K > K5 , the third probability on the right-hand side of (12) vanishes. Conclusion. By combining (10)—(12), there exists C = min(C1 , C2 ) > 0 such that, for all > 0 and all K > max(K1 , K2 , K3 , K4 , K5 ), ! K X |Ak | P sup |g(θ; b(X)) − g(θ; E b(X))| ≥ 2 θ∈Θ0 k=1 2 α 2 C1 K C2 K min(α2 , 2 ) C K ≤ 2 exp − + 2 exp − ≤ 4 exp − . kAk4∞ kAk4∞ kAk4∞ Proof of Proposition 3. Write ? ) − µ(θ ? )) = f (X) − E f (X), \ ∇θ log pη(θ) (X)|θ=θ? = (∇θ η(θ)|θ=θ? )> (µ(θ
30
where f (X) = (∇θ η(θ)|θ=θ? )> s(X) and E f (X) = (∇θ η(θ)|θ=θ? )> Eη(θ? ) s(X). Observe that Eη(θ? ) s(X) exists as long as the natural parameter vector η(θ) is in the interior int(N) of N [e.g., 1, Theorem 2.2, pp. 34–35], which is ensured by the assumption that η : Θ 7→ Ξ and Ξ ⊆ int(N). We have, for all > 0, ! K X |Ak | P kf (X) − E f (X)k2 ≥ 2 k=1 ! P |Ak | K k=1 2 ≤ P kf (X) − E f (X)k∞ ≥ p . dim(θ) Let fi (X) be the i-th component of f (X). Then P kf (X) − E f (X)k∞ ≤ P
dim(θ)
[ i=1
PK |Ak | ! ≥ pk=1 2 dim(θ)
PK |Ak | ! fi (X) − Eη(θ? ) fi (X) ≥ p k=1 2 . dim(θ)
By assumption, there exist A > 0 and K1 ≥ 1 such that, for all K > K1 , the Lipschitz coefficient of fi with respect to the Hamming metric d : X×X 7→ R+ 0 is bounded above by kfi kLip ≤ A kAk∞ . Thus, by a union bound overP the dim(θ) 0 and all K > K1 , PK |Ak | ! dim(θ) [ fi (X) − Eη(θ? ) fi (X) ≥ p k=1 2 P dim(θ) i=1 P 2 K |Ak | 2 k=1 2 P ≤ 2 exp − + log dim(θ) . K |Ak | dim(θ) A2 kAk4∞ kAk2∞ k=1 2 Therefore, there exist C > 0 and K2 ≥ 1 such that, for all > 0 and all K > K2 , ! 2 K X |Ak | CK P kf (X) − E f (X)k2 ≥ ≤ 2 exp − , kAk4∞ 2 k=1 where we used the assumption that dim(θ) < ∞ and that all neighborhoods grow at the same rate (see, e.g., the proof of Corollary 1). 31
B
Proofs: consistency
We prove the main consistency results of Section 5, Theorem 1 along with Corollaries 2 and 3. Proof of Theorem 1. We proceed in two steps. In the first step, we show that θb falls into a convex and compact subset Θ0 of Θ containing θ0 with high probability. In the second step, we show that, provided θb is contained in Θ0 , θb is arbitrarily close to θ0 with high probability. These two steps are based on the observation that, for all > 0, the probability of event kθb − θ0 k2 ≥ can be bounded as follows: P(kθb − θ0 k2 ≥ ) ≤ P(θb 6∈ Θ0 ) + P(kθb − θ0 k2 ≥ ∩ θb ∈ Θ0 ).
(13)
First step: event θb 6∈ Θ0 . By identifiability condition (4) of Theorem 1, there exists a convex and compact subset Θ0 of Θ containing θ0 and C0 > 0 and K0 ≥ 1 such that, for all K > K0 and all x ∈ X such that b(x) ∈ M0 (α), sup g(θ; b(x)) − sup g(θ; b(x)) > C0 θ∈Θ
θ∈Θ\Θ0
K X |Ak | k=1
2
, b(x) ∈ M0 (α).
Thus, b(x) ∈ M0 (α) implies that the supremum supθ∈Θ g(θ; b(x)) is attained on the convex and compact subset Θ0 of Θ containing θ0 and the maximizer θb of g(θ; b(x)) is unique by condition [C.2]. As a result, the probability of event θb 6∈ Θ0 is bounded above by the probability of event b(X) 6∈ M0 (α). The probability of event b(X) 6∈ M0 (α) can be bounded by using Lemma 1, which shows that there exist C1 > 0 and K1 ≥ 1 such that, for all K > K1 , α 2 C1 K b (14) P(θ 6∈ Θ0 ) ≤ P(b(X) ∈ M0 (α)) ≤ 2 exp − . kAk4∞ Second step: event kθb − θ0 k2 ≥ ∩ θb ∈ Θ0 . By identifiability condition (5) of Theorem 1, for all > 0, there exists δ() > 0 such that, for all K > K1 and all θ ∈ Θ0 with kθ − θ0 k2 ≥ , K X |Ak | g(θ0 ; E b(X)) − g(θ; E b(X)) ≥ δ() . 2 k=1 Therefore, kθ − θ0 k2 ≥ implies g(θ0 ; E b(X)) − g(θ; E b(X)) ≥ δ()
PK
k=1
|Ak | 2
and thus
P(kθb − θ0 k2 ≥ ∩ θb ∈ Θ0 ) b E b(X)) ≥ δ() ≤ P g(θ0 ; E b(X)) − g(θ;
K X |Ak | k=1
2
! ∩ θb ∈ Θ0
b E b(X)) ≥ δ() PK To bound the probability of event g(θ0 ; E b(X)) − g(θ; k=1 observe that
|Ak | 2
.
b E b(X)) ≤ 2 sup |g(θ; b(X)) − g(θ; E b(X))|. g(θ0 ; E b(X)) − g(θ; θ∈Θ0
32
∩ θb ∈ Θ0 ,
b E b(X)) ≥ δ() PK |Ak | ∩ θb ∈ Θ0 As a result, the probability of event g(θ0 ; E b(X)) − g(θ; k=1 2 can be bounded above as follows: ! K X |A | k b E b(X)) ≥ δ() P g(θ0 ; E b(X)) − g(θ; ∩ θb ∈ Θ0 2 k=1 ! K X |Ak | ≤ P 2 sup |g(θ; b(X)) − g(θ; E b(X))| ≥ δ() ∩ θb ∈ Θ0 . 2 θ∈Θ0 k=1 By Proposition 2, there exist C2 > 0 and K2 ≥ 1 such that, for all > 0 and all K > K2 , ! K X |Ak | P 2 sup |g(θ; b(X)) − g(θ; E b(X))| ≥ δ() ∩ θb ∈ Θ0 2 θ∈Θ0 k=1 2 2 min(α , δ() ) C2 K ≤ 4 exp − , kAk4∞ implying min(α2 , δ()2 ) C2 K b b P(kθ − θ0 k2 ≥ ∩ θ ∈ Θ0 ) ≤ 4 exp − . kAk4∞
(15)
Combining (13)—(15) shows that there exists C = min(C1 , C2 ) > 0 such that, for all > 0 and all K > max(K0 , K1 , K2 ), α 2 C1 K min(α2 , δ()2 ) C2 K b P(kθ − θ0 k2 ≥ ) ≤ 2 exp − + 4 exp − kAk4∞ kAk4∞ min(α2 , δ()2 ) C K ≤ 6 exp − . kAk4∞ Proof of Corollary 2. The corollary follows from Theorem 1 provided conditions [C.1]—[C.4] are satisfied. To show that conditions [C.1]—[C.4] are satisfied, note that η(θ) = θ in canonical exponential families. Condition [C.1] is satisfied because η(θ) = θ. Condition [C.2] follows from η(θ) = θ and the upper semicontinuity and strict concavity of exponential-family loglikelihood functions on Θ ⊆ {θ ∈ Rdim(θ) : ψ(η(θ)) < ∞} [e.g., 1, Lemma 5.3, p. 146]. Conditions [C.3] and [C.4] need to hold on convex and compact subsets Θ? of Θ containing θ ? . Let Θ? be any convex and compact subset of Θ containing θ ? . Using p |hθ1 − θ2 , µi| ≤ kθ1 − θ2 k1 kµk∞ ≤ dim(θ) kθ1 − θ2 k2 kµk∞ P |Ak | shows that condition [C.3] is satisfied as long as kµk∞ ≤ C1 K for all µ ∈ M0 (α), k=1 2 ? because dim(θ) < ∞ and Θ is a compact subset of Θ. The same argument shows that p |hθ, s(x1 ) − s(x2 )i| ≤ dim(θ) kθk2 ks(x1 ) − s(x2 )k∞ , 33
so that condition [C.4] is satisfied ks(x1 ) − s(x2 )k∞ ≤ C2 d(x1 , x2 ) kAk∞ for all (x1 , x2 ) ∈ X × X.
as
long
as
Proof of Corollary 3. The corollary follows from Theorem 1 provided conditions [C.1]—[C.4] hold. To show that conditions [C.1]—[C.4] hold, it is convenient to first consider the special case θ1 = 1 and then the general case 0 < θ1 < ∞. Case θ1 = 1. To ease the presentation, we drop the subscript of θ2 and write θ rather than θ2 . The coordinates ηk,i (θ) of the neighborhood-dependent natural parameter vectors ηk (θ) can then be written as " i # 1 , i = 1, . . . , Ik , Ik ≥ 2, k = 1, . . . , K. (16) ηk,i (θ) = θ 1 − 1 − θ Observe that exponential families with natural parameters of the form (16) can be reduced to an exponential family with natural parameter vector η(θ) of dimension dim(η) = max1≤k≤K Ik , where max1≤k≤K Ik −→ ∞ as K −→ ∞. We denote the coordinates of the natural parameter vector η(θ) by ηi (θ), which are given by ηi (θ) = θ − θ β(θ)i ,
i = 1, . . . , max Ik , 1≤k≤K
where β(θ) = 1 − and
1 θ
1 Θ = θ∈R: < θ < ∞, ψ(η(θ)) < ∞ . 2
The coordinates ηk,i (θ) of the neighborhood-dependent natural parameter vectors ηk (θ) are related to the coordinates ηi (θ) of the natural parameter vector η(θ) of the exponential family as follows: ηk,i (θ) = ηi (θ),
i = 1, . . . , Ik ,
k = 1, . . . , K.
A helpful observation is that the coordinates ηi (θ) of η(θ) are continuously differentiable on (1/2, ∞) with derivatives ∇θ ηi (θ) = 1 − β(θ)i −
i β(θ)i−1 , θ
θ ∈ (1/2, ∞).
We check conditions [C.1]—[C.4] one by one. Case θ1 = 1: condition [C.1]. To show that the map η : Θ 7→ Ξ is one-to-one on Θ for all K ≥ 1, we show that at least one coordinate of η(θ + δ) must deviate from η(θ) for all θ ∈ (1/2, ∞), all δ > 0, and all K ≥ 1. To do so, note that η(θ) has at least two coordinates, denoted by η1 (θ) and η2 (θ), because Ik ≥ 2 (k = 1, . . . , K). The first coordinate η1 (θ) of η(θ) is constant on (1/2, ∞): θ ∈ (1/2, ∞).
η1 (θ) = 1, 34
The second coordinate η2 (θ) of η(θ) is continuously differentiable on (1/2, ∞) with derivative ∇θ η2 (θ) = 1 − β(θ)2 −
2 1 β(θ) = 2 > 0, θ θ
θ ∈ (1/2, ∞).
By the mean-value theorem, η2 (θ + δ) − η2 (θ) ≥
δ > 0, (θ + δ)2
θ ∈ (1/2, ∞),
δ > 0.
Thus, η2 (θ) is strictly increasing on (1/2, ∞) and at least one coordinate of η(θ + δ) must deviate from η(θ) for all θ ∈ (1/2, ∞), all δ > 0, and all K ≥ 1. As a result, the map η : Θ 7→ Ξ is one-to-one and continuous on Θ. Thus condition [C.1] is satisfied. Case θ1 = 1: condition [C.2]. Condition [C.2] follows from the continuity of η : Θ 7→ Ξ and the upper semicontinuity of exponential-family loglikelihood functions [e.g., 1, Lemma 5.3, p. 146] along with the fact that η : Θ 7→ Ξ is one-to-one and exponential-family loglikelihood functions are strictly concave on Ξ ⊆ int(N) [e.g., 1, Lemma 5.3, p. 146]. Case θ1 = 1: condition [C.3]. Condition [C.3] needs to hold on convex and compact subsets of Θ containing θ? . Let Θ? = [A, B] be any such subset of Θ, where 1/2 < A < B < ∞. Choose any θ ∈ Θ? and θ0 ∈ Θ? . By the triangle inequality, we obtain, for all θ ∈ Θ? and θ0 ∈ Θ? and all µ ∈ M0 (α), Ik K X X 0 0 [ηi (θ ) − ηi (θ)] µk,i |hη(θ ) − η(θ), µi| = k=1 i=1 (17) Ik K X X |ηi (θ0 ) − ηi (θ)| |µk,i | . ≤ k=1 i=1
We show that there exists C > 2 such that, for all θ ∈ Θ? and all i ∈ {1, 2, . . . }, |∇θ ηi (θ)| ≤ max(3, C),
i ∈ {1, 2, . . . },
which, by the mean-value theorem, implies that |ηi (θ0 ) − ηi (θ)| ≤ |θ0 − θ| max(3, C),
i ∈ {1, 2, . . . }.
To show that |∇θ ηi (θ)| is bounded for all θ ∈ Θ? and all i ∈ {1, 2, . . . }, note that i i−1 i |∇θ ηi (θ)| = 1 − β(θ) − β(θ) ≤ 2 + 2 i |β(θ)|i−1 , θ where we used the fact that θ > 1/2 and |β(θ)| < 1 for all θ ∈ Θ? . Observe that |∇θ ηi (θ)| ≤ 2 + 2 i |β(θ)|i−1 ≤ 3
(18)
is satisfied as long as
log 2 i |β(θ)| ≤ exp − i−1 35
,
i ∈ {2, 3, . . . }.
(19)
We note that (18) does not hold for small i and while it does hold for sufficiently large i, the upper bound 3 in (18) is not tight, but it is convenient. To show that (18) is satisfied by all θ ∈ Θ? for all sufficiently large i, we lower bound the exponential term on the right-hand side of (19) as follows. Choose γ > 0 so that 0 < γ ≤ min((2 A − 1)/2, 1/B), where A > 1/2 and B > 1/2 were selected above so that 1/2 < A ≤ θ ≤ B < ∞ for all θ ∈ Θ? ; note that the choice of γ > 0 implies that 1/(2 − γ) ≤ A < B ≤ 1/γ. Then there exists I0 ≥ 2 such that, for all i > I0 , log 2 i , i ∈ {I0 + 1, I0 + 2, . . . }, I0 ≥ 2. 1 − γ ≤ exp − i−1 In other words, we have |∇θ ηi (θ)| ≤ 3 as long as i > I0 and |β(θ)| ≤ 1 − γ, i.e., as long as θ ∈ Θ? satisfies 1/(2 − γ) ≤ θ ≤ 1/γ. Since γ > 0 was chosen above so that 1/(2 − γ) ≤ A < B ≤ 1/γ, all θ ∈ Θ? satisfy 1/(2 − γ) ≤ A ≤ θ ≤ B ≤ 1/γ and hence |∇θ ηi (θ)| ≤ 3 holds for all θ ∈ Θ? provided i > I0 . In addition, there exists C > 2 such that, for all θ ∈ Θ? and all i ∈ {1, 2, . . . , I0 }, |∇θ ηi (θ)| ≤ 2 + 2 i |β(θ)|i−1 ≤ 2 + 2 I0 = C,
i ∈ {1, 2, . . . , I0 },
where we used |β(θ)| < 1 for all θ ∈ Θ? . Therefore, for all θ ∈ Θ? , |∇θ ηi (θ)| ≤ 2 + 2 i |β(θ)|i−1 ≤ max(3, C),
i ∈ {1, 2, . . . },
implying |ηi (θ0 ) − ηi (θ)| ≤ |θ0 − θ| max(3, C),
i ∈ {1, 2, . . . }.
(20)
Using (17) and (20) along with condition [C.3?? ] shows that there exist C1 > 0 and K1 ≥ 1 such that, for all K > K1 , 0
|hη(θ ) − η(θ), µi| ≤
Ik K X X
|ηi (θ0 ) − ηi (θ)| |µk,i |
k=1 i=1
≤ |θ0 − θ| max(3, C)
Ik K X X
|µk,i | ≤ C1 kθ0 − θk2
k=1 i=1
K X |Ak | k=1
2
(21) .
Thus condition [C.3] is satisfied. Case θ1 = 1: condition [C.4]. Condition [C.4] needs to hold on convex and compact subsets of Θ containing θ? . Let Θ? = [A, B] be any such subset of Θ, where 1/2 < A < B < ∞. Then, using |β(θ)| < 1 for all θ ∈ Θ? , |ηi (θ)| ≤ |θ| + |θ| |β(θ)|i ≤ 2 B,
θ ∈ Θ? .
By condition [C.4?? ], there exist C2 > 0 and K2 ≥ 1 such that, for all K > K2 , K I k X X |hη(θ), s(x1 ) − s(x2 )i| = ηi (θ) [sk,i (xk,1 ) − sk,i (xk,2 )] k=1 i=1
≤ 2B
Ik K X X
|sk,i (xk,1 ) − sk,i (xk,2 )| ≤ C2 kAk∞ d(x1 , x2 ).
k=1 i=1
36
(22)
(23)
Thus condition [C.4] is satisfied. Case 0 < θ1 < ∞. Along the lines of the proof in the case θ1 = 1, it can be shown that conditions [C.1] and [C.2] are satisfied. Conditions [C.3] and [C.4] need to hold on convex and compact subsets Θ? of Θ containing θ ? . Let Θ? be any convex and compact subset of Θ containing θ ? . Since Θ? is a compact subset of Θ, there exist constants 0 < A1 < B1 < ∞ and 1/2 < A2 < B2 < ∞ such that 0 < A1 ≤ θ1 ≤ B1 < ∞ and 1/2 < A2 ≤ θ2 ≤ B2 < ∞ for all θ ∈ Θ? . Choose any θ ∈ Θ? and θ 0 ∈ Θ? and write ( " i #) 1 = w1 (θ1 ) w2,i (θ2 ), ηi (θ) = θ1 θ2 1 − 1 − θ2 where w1 (θ1 ) = θ1 and w2,i (θ2 ) = θ2 − θ2 (1 − 1/θ2 )i . Both w1 (θ1 ) and w2,i (θ2 ) are bounded: w1 (θ1 ) is bounded since |w1 (θ1 )| = |θ1 | ≤ B1 , whereas w2,i (θ2 ) is bounded since |w2,i (θ2 )| ≤ 2 B2 by using an argument along the lines of (22). Thus, there exists C > 0 such that |ηi (θ) − ηi (θ 0 )| ≤ |w1 (θ1 ) w2,i (θ2 ) − w1 (θ10 ) w2,i (θ2 )| + |w1 (θ10 ) w2,i (θ2 ) − w1 (θ10 ) w2,i (θ20 )| ≤ 2 B2 |w1 (θ1 ) − w1 (θ10 )| + B1 |w2,i (θ2 ) − w2,i (θ20 )| . Using |w1 (θ1 ) − w1 (θ10 )| = |θ1 − θ10 | and |w2,i (θ2 ) − w2,i (θ20 )| ≤ |θ2 − θ20 | max(3, C) (C > 0), which follows from an argument along the lines of (20), shows that there exists D > 0 such that |ηi (θ) − ηi (θ 0 )| ≤ 2 B2 |θ1 − θ10 | + B1 max(3, C) |θ2 − θ20 | ≤ D kθ − θ 0 k2 . In addition, the boundedness of w1 (θ1 ) and w2,i (θ2 ) implies that ηi (θ) is bounded: |ηi (θ)| = |w1 (θ1 ) w2,i (θ2 )| ≤ 2 B1 B2 . As a result, by an argument along the lines of (21) and (23), it is straightforward to show that conditions [C.3] and [C.4] are satisfied.
37