MICHAEL SCHWEINBERGER AND JONATHAN STEWART ...... [10] Crane, H., and Dempsey, W. (2015), âA framework for statistical network modeling,â ...
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS FOR RANDOM GRAPHS WITH COMPLEX TOPOLOGICAL STRUCTURES
arXiv:1702.01812v3 [math.ST] 3 Feb 2018
B Y M ICHAEL S CHWEINBERGER AND J ONATHAN S TEWART Rice University Statistical inference for exponential-family random graphs with complex topological structures is challenging. We stress the importance of additional structure and show that additional structure facilitates statistical inference. A simple and common form of additional structure, which is observed in multilevel networks, is a random graph with neighborhoods and local dependence within neighborhoods. We develop the first concentration results for maximum likelihood and M -estimators of a wide range of canonical and curved exponential-family random graphs with local dependence. All results are non-asymptotic and cover random graphs of fixed and finite size, provided the neighborhoods are small relative to the size of the random graph. We discuss extensions to larger random graphs with more neighborhoods along with concentration results for subgraph-to-graph estimators. As applications, we consider canonical and curved exponential-family random graphs, with local dependence induced by sensible forms of transitivity and parameter vectors whose dimensions depend on the number of nodes.
1. Introduction. Models of network data have witnessed a surge of interest in statistics and related areas [e.g., 25]. Such data arise in the study of social networks, terrorist networks, epidemics, and other areas. Since the work of Holland and Leinhardt in the 1970s [e.g., 16], it is known that network data exhibit a wide range of dependencies induced by transitivity and other complex topological features of networks [e.g., 33]. A large statistical framework for modeling dependencies induced by complex topological features of networks is given by discrete exponential-family random graphs [e.g., 13, 48, 14, 45, 19, 27, 33, 30]. Such models are popular among network scientists for the same reason Ising models are popular among physicists: both classes of models enable scientists to model a wide range of dependencies of scientific interest [e.g., 33]. Despite the appeal of the discrete exponential-family framework and its relationship to other discrete exponential-family models for dependent random variables [e.g., Ising models and Ising graphical models, 7, 36, 2], statistical inference for discrete exponential-family random graphs is challenging. One reason is that ∗
Supported by NSF award DMS-1513644. MSC 2010 subject classifications: curved exponential families, exponential families, exponential-family random graph models, M -estimators, multilevel networks, social networks.
1
2
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
some exponential-family random graphs are ill-behaved [e.g., the so-called triangle model, 46, 22, 14, 39, 1, 8, 5], but it is worth noting that well-behaved alternatives have been developed, among them curved exponential-family random graphs [45, 19]. A second reason is that in most applications of exponential-family random graphs statistical inference is based on a single observation of a dependent random graph. Establishing desirable properties of estimators, such as consistency of estimators, is non-trivial when no more than one observation of a dependent random graph is available. While some consistency results have been obtained under independence assumptions [11, 49, 37, 43, 34, 29, 50] and dependence assumptions [49, 43, 34]—as discussed in Section 7—the existing consistency results do not cover the models of primary interest to network scientists: that is, canonical and curved exponential-family random graphs, with dependence induced by sensible forms of transitivity and other complex topological features of networks [33]. We stress the importance of additional structure and show that additional structure facilitates statistical inference. We consider here a simple and common form of additional structure called multilevel structure. Network data with multilevel structure are popular in network science, as the recent monograph of Lazega and Snijders [31] and a growing number of applications demonstrate [e.g., 47, 52, 32, 44, 17, 18]. A simple form of multilevel structure is given by a partition of a set of nodes into subsets of nodes, called neighborhoods. In applications, neighborhoods may correspond to school classes within schools, departments within companies, and units of armed forces. It is worth noting that in multilevel networks the partition of the set of nodes is observed and models of multilevel networks attempt to capture dependencies within and between neighborhoods [e.g., 47, 52, 32, 44, 17, 18], whereas the well-known class of stochastic block models [35] assumes that the partition is unobserved and that edges are independent conditional on the partition. Multilevel structure offers outstanding opportunities in terms of statistical inference: • First, an exponential-family random graph with multilevel structure can be extended to a larger exponential-family random graph with more neighborhoods [40], whereas many exponential-family random graphs without multilevel structure cannot be extended to larger exponential-family random graphs [43, 10, 30]. • Second, as long as the dependence is local in the sense that it is restricted to neighborhoods and the neighborhoods are small, the overall dependence is weak, which facilitates finite-graph concentration and consistency results for exponential-family random graphs. We take advantage of these opportunities and develop the first statistical theory which shows that statistical inference for many exponential-family random graphs
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
3
with complex topological structures is possible, provided multilevel structure is available. To do so, we develop the first finite-graph concentration and consistency results for a wide range of canonical and curved exponential-family random graphs with local dependence. The finite-graph concentration results cover sparse and dense random graphs with countable support and local dependence and may be of independent interest. The finite-graph consistency results cover maximum likelihood and M -estimators under correct and incorrect model specifications. All results are non-asymptotic and cover random graphs of fixed and finite size, provided the neighborhoods are small relative to the size of the random graph. We discuss extensions to larger random graphs with more neighborhoods along with finite-graph consistency results for subgraph-to-graph estimators. As applications, we consider canonical and curved exponential-family random graphs, with local dependence induced by sensible forms of transitivity and parameter vectors whose dimensions depend on the number of nodes. These finite-graph concentration and consistency results have important implications: • The most important implication is that statistical inference for transitivity and other complex topological features of networks is possible, provided sensible specifications of exponential-family random graphs are used and multilevel structure is available. To date, it has been widely believed that statistical inference for transitivity is challenging [e.g., 43, 8], but multilevel structure facilitates statistical inference for sensible forms of transitivity. • In practice, researchers should take advantage of multilevel structure when it is available and collect more network data with multilevel structure when it is possible to do so. We note that we focus here on sensible specifications of exponential-family random graphs, because statistical inference for models that are known to be ill-posed [e.g., the so-called triangle model, 46, 22, 14, 39, 1, 8, 5] is problematic. The paper is structured as follows. Section 2 introduces models. Section 3 describes finite-graph concentration results for dependent random graphs while Sections 4 and 5 describe finite-graph consistency results for maximum likelihood and M -estimators, respectively. Section 6 discusses extensions to larger random graphs with more neighborhoods along with finite-graph consistency results for subgraphto-graph estimators. A comparison with existing asymptotic consistency results can be found in Section 7. Section 8 presents simulation results. 2. Discrete exponential-family random graphs with multilevel structure. We introduce discrete exponential-family random graphs with multilevel structure. A simple and common form of multilevel structure is a partition of a set of nodes into K ≥ 2 non-empty subsets of nodes A1 , . . . , AK , called neighborhoods.
4
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
We note that in multilevel networks the partition of the set of nodes is observed [e.g., 47, 52, 32, 44, 17, 18] and that some neighborhoods may be larger than others. We consider undirected random graphs with countable sample spaces, which covers binary and non-binary network data, including network count data. Extensions to directed random graphs are straightforward. Let X = (Xk )K k=1 and Y = K (Yk,l )k 0 and a non-decreasing function ρ : N 7→ N such that A1 ρ(K) ≤ |Ak | ≤ A2 ρ(K) (k = 1, . . . , K, K = 1, 2, . . . ), where N = {0, 1, 2, . . . }. In other words, the sizes of neighborhoods may not be identical, but are assumed to be similar. We note that when the neighborhoods are not of the same order of magnitude, the natural parameters of neighborhoods may have to depend on the order of magnitude of neighborhoods [e.g., 28, 29, 6], because there are good reasons to believe that small and large within-neighborhood subgraphs are not governed by the same natural parameters [43, 10, 30]. However, despite recent progress on sizedependent parameterizations of simple models [e.g., 28, 29, 6], more research on size-dependent parameterizations of more complex models is needed, which is an important topic in its own right but is beyond the scope of our paper. The assumption ηk,i (θ) = ηi (θ) (i = 1, . . . , dim(ηk ), k = 1, . . . , K) implies that the exponential families considered here can be reduced to exponential families with natural parameter vectors of the form η(θ) = (η1 (θ), . . . , ηd∞ (θ)) and sufficient statistic vectors of the form s(x) = (s1 (x), . . . , sd∞ (x)) , P where si (x) = K k=1 sk,i (xk ) (i = 1, . . . , d∞ ) and d∞ = max1≤k≤K dim(ηk ). The dimensions dim(ηk ) of the neighborhood-dependent natural parameter vectors ηk (θ) may be functions of the sizes |Ak | of neighborhoods Ak (k = 1, . . . , K), which implies that d∞ = max1≤k≤K dim(ηk ) may be a function of kAk∞ = max1≤k≤K |Ak |. An example is given by curved exponential families with geometrically weighted model terms, as described in Section 4.2. In addition, in both canonical and curved exponential families, we allow the dimension of parameter vector θ to depend on kAk∞ . Notation. To prepare the ground for the finite-graph concentration and consistency results in Sections 3, 4, 5, and 6, we introduce mean-value parameterizations of exponential families along with additional notation. Mean-value parameterizations facilitate finite-graph concentration and consistency results, because concentration inequalities [3] bound probabilities of deviations from means and the mean-value parameters of an exponential family are the means of the sufficient statistics, defined by µ(η(θ)) = Eη(θ) s(X) ∈ rint(M) [4, p. 2 and p. 75]. Here, Eη(θ) s(X) denotes the expectation of s(X) with respect to exponential-family distributions Pη(θ) having densities of the form (1), M denotes the convex hull of the set {s(x) : x ∈ X}, and rint(M) denotes the relative interior of M. We denote
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
7
the data-generating parameter vector by θ ? ∈ int(Θ) and write P ≡ Pη(θ? ) and E ≡ Eη(θ? ) , where int(Θ) denotes the interior of Θ. An open ball in Rdim(θ) with center θ ? ∈ Rdim(θ) and radius > 0 is denoted by B(θ ? , ). We denote by k.k1 , k.k2 , and k.k∞ the `1 -, `2 -, and `∞ -norm of vectors, respectively. Throughout, d : X × X 7→ R+ ∪ {0} denotes the Hamming metric defined by d(x1 , x2 ) =
K X
X
1(x1,i,j 6= x2,i,j ), (x1 , x2 ) ∈ X × X,
k=1 i∈Ak < j∈Ak
where 1(x1,i,j 6= x2,i,j ) is an indicator function, which is 1 if x1,i,j 6= x2,i,j and is 0 otherwise. 3. Finite-graph concentration results for dependent random graphs. To obtain finite-graph concentration and consistency results for sparse and dense exponential-family random graphs with countable support and local dependence, we need finite-graph concentration results for dependent random graphs. Finite-graph concentration results for dependent random graphs are non-trivial for at least three reasons. First, exponential families of the form (1) may induce strong dependence within neighborhoods and the sizes of neighborhoods need not be identical. Second, exponential families of the form (1) can induce a wide range of dependencies within neighborhoods. Therefore, we need general-purpose concentration inequalities that cover a wide range of dependencies. Third, we consider exponential families with countable support and some distributions with countable support do not have nice Subgaussian tails. The following general-purpose concentration inequality addresses the three challenges discussed above. It shows that the dependence induced by exponential families of the form (1) may be strong within neighborhoods but is sufficiently weak overall to obtain concentration results as long as the neighborhoods are not too large. Proposition 1. Consider an exponential family with countable support X and local dependence. Let f : X 7→ R be Lipschitz with respect to the Hamming metric d : X × X 7→ R+ ∪ {0} with Lipschitz coefficient kf kLip > 0 and assume that E f (X) exists. Then there exists C > 0 such that, for all t > 0, ! t2 P(|f (X) − E f (X)| ≥ t) ≤ 2 exp − PK |A | . C k=1 2k kAk4∞ kf k2Lip It is worth noting that there is a large body of literature in probability theory and related areas on concentration results for random graphs [e.g., 9], but almost all of them assume that edge variables are independent and almost surely bounded, i.e.,
8
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
Subgaussian. In contrast, we consider dependent edge variables with countable sample spaces having distributions with either Subgaussian or non-Subgaussian tails. Examples of dependent edge variables having distributions with non-Subgaussian tails (e.g., Poisson-referenced exponential-family random graphs with transitivity) can be found in Krivitsky [27]. The proof of Proposition 1 relies on concentration inequalities for dependent random variables [e.g., 26, 3] and bounds mixing coefficients, which quantify the strength of dependence induced by exponential families with local dependence, in terms of kAk∞ . Proposition 1 paves the way for finite-graph concentration and consistency results for sparse and dense exponential-family random graphs with countable support and local dependence. We present here an example of special interest for maximum likelihood estimation and note that additional concentration results for \? )) = s(X) and note that M -estimation can be found in Section 5. Let µ(η(θ \? )) = s(X) is an estimator of the data-generating mean-value parameter µ(η(θ vector µ(η(θ ? )) = Eη(θ? ) s(X) ∈ rint(M). The following concentration result \? )) is close to µ(η(θ ? )) ∈ rint(M) with high probability as long shows that µ(η(θ as the neighborhoods and the dimension d∞ of µ(η(θ ? )) are small relative to the number of neighborhoods K. Proposition 2. Consider a full or non-full, canonical or curved exponential family with countable support X and local dependence. Assume that there exists A > 0 such that, for all (x1 , x2 ) ∈ X × X, ks(x1 ) − s(x2 )k∞ ≤ A d(x1 , x2 ) kAk∞ .
(2)
Then there exists C > 0 such that, for all deviations of the form t = δ with δ > 0 and 0 ≤ α ≤ 1,
\? )) − µ(η(θ ? ))k∞ ≥ t P kµ(η(θ
\? )) − µ(η(θ ? ))k2 ≥ t P kµ(η(θ
≤ 2 exp −
δ2 C K 4 (2−α)
kAk∞ ≤ 2 exp −
PK
|Ak | α 2 !
+ log d∞
δ2 C K 4 (2−α)
d∞ kAk∞
k=1
! + log d∞
,
where d∞ may be a function of kAk∞ . The smoothness condition (2) of Proposition 2 is satisfied as long as changing a within-neighborhood edge cannot change the within-neighborhood sufficient statistics by more than a multiple of kAk∞ . It is verified in Corollaries 1 and 2 in Section 4. The finite-graph concentration results for estimators of mean-value parameters in Proposition 2 cover both sparse and dense random graphs. We note that the conventional definition of the terms sparse and dense random graphs is
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
9
based on the scaling of the expected number of edges of Bernoulli random graphs with independent edges, i.e., the expectation of the sufficient statistic of Bernoulli random graphs. We use these terms here to refer to the scaling of the expectations of all sufficient statistics of exponential-family random graphs. The power α of |Ak | α controls the level of sparsity: if α = 1, the within-neighborhood subgraphs 2 are said to be dense in the sense that the expectations of within-neighborhood suffi cient statistics are non-negligible fractions of the number of edge variables |A2k | in neighborhood Ak (k = 1, . . . , K), otherwise the within-neighborhood subgraphs are said to be sparse (0 ≤ α < 1). Between-neighborhood subgraphs can be sparse or dense. Remark 1. Sharpness. The concentration results discussed above are not sharp. We are not interested in sharp bounds in special cases, because the main appeal of the exponential-family framework is that it can capture a wide range of dependencies. Therefore, we are interested in concentration results that cover a wide range of exponential families with local dependence. 4. Finite-graph concentration results for maximum likelihood estimators. In exponential families, maximum likelihood estimation of natural parameters is related to maximum likelihood estimation of mean-value parameters and, therefore, the finite-graph concentration results for estimators of mean-value parameters in Proposition 2 pave the way for finite-graph concentration results for estimators of natural parameters and functions of natural parameters. All of the following results are non-asymptotic and cover random graphs of fixed and finite size, provided the neighborhoods are small relative to the size of the random graph. Throughout, a random graph of fixed and finite size is taken to be a random graph with a finite number of neighborhoods and a finite number of nodes. Extensions to larger random graphs with more neighborhoods are discussed in Section 6. To cover a wide range of full and non-full, canonical and curved exponential families under weak assumptions on the map η : Θ 7→ int(N), we consider estimating functions of the form (3)
\? ))) = kµ(η(θ \? )) − µ(η(θ))k2 , θ ∈ Θ, g(θ; µ(η(θ
which are approximations of g(θ; µ(η(θ ? ))) = kµ(η(θ ? )) − µ(η(θ))k2 , θ ∈ Θ. \? ))) is an approximation of g(θ; µ(η(θ ? ))) follows from The fact that g(θ; µ(η(θ the triangle inequality, which shows that \? ))) − g(θ; µ(η(θ ? ))) ≤ kµ(η(θ \? )) − µ(η(θ ? ))k2 , θ ∈ Θ, g(θ; µ(η(θ
10
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
\? )) − µ(η(θ ? ))k2 is along with the fact that, under suitable conditions, kµ(η(θ small with high probability by Proposition 2. Estimating function (3) has at least three advantages. First, estimating function (3) admits finite-graph concentration results under weak assumptions on the map η : Θ 7→ int(N). Indeed, the main results of Section 4 assume that the map η : Θ 7→ int(N) is one-to-one and satisfies a weak identifiability assumption, but the map η : Θ 7→ int(N) need not be differentiable. The weakness of these assumptions implies that the main results of Section 4 cover a wide range of full and non-full exponential families—including, but not limited to curved exponential families—and it is possible to verify these assumptions in some of the most challenging applications, as demonstrated by Corollaries 1 and 2. Second, estimating function (3) is natural, because maximum likelihood estimation of the data-generating natural parameter vector η ? of an exponential family with natural parameter vector η and sufficient statistic vector s(x) can be based on the ? ) − µ(η) provided \ gradient of the loglikelihood function ∇η log pη (x) = µ(η ? ) = s(x) [4, Lemma 5.3, p. 146]. Therefore, maxi\ η ∈ int(N), where µ(η mum likelihood estimation of η ? can be based on estimating functions of the form ? ) − µ(η)k . By the parameterization-invariance of maximum likelihood es\ kµ(η 2 timators, maximum likelihood estimation of functions of η ? , such as θ ? , can be \? )) − µ(η(θ))k2 provided the map η : Θ 7→ int(N) is one-tobased on kµ(η(θ one. We note that estimating function (3) is chosen for mathematical convenience, facilitating finite-graph concentration results for maximum likelihood estimators of a wide range of full and non-full, curved exponential families under weak assumptions on the map η : Θ 7→ int(N), and is not chosen for computational convenience. Last, but not least, the simple form of estimating function (3) helps determine when minimizers of (3) exist and are unique, and how the minimizers are related to each other when there is more than one minimizer. These advantages are most useful in non-full exponential families, in particular curved exponential families. In the following, we assume that the map η : Θ 7→ int(N) is one-to-one. A natural class of estimators is hence given by \? ))) = inf g(θ; \? ))) . ˙ µ(η(θ θb = θ ∈ Θ : g(θ; µ(η(θ ˙ θ∈Θ
If the set θb is non-empty, it may contain one element (e.g., in full exponential families) or more than one element (e.g., in non-full exponential families). If the set θb contains more than one element, then all elements of the set θb map to mean\? )) b that have the same `2 -distance from µ(η(θ value parameter vectors µ(η(θ)) b by construction of estimating function (3). In addition, if the set θ is non-empty,
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
11
\? ))) are approximations of the minimizer of then all minimizers θb of g(θ; µ(η(θ ? g(θ; µ(η(θ ))) under suitable conditions. The minimizer of g(θ; µ(η(θ ? ))) is unique and is given by the data-generating parameter vector θ ? provided µ(η(θ ? ))) ∈ rint(M): ? ? ? ˙ θ = θ ∈ Θ : g(θ; µ(η(θ ))) = inf g(θ; µ(η(θ ))) . ˙ θ∈Θ
The data-generating parameter vector θ ? is the unique minimizer of g(θ; µ(η(θ ? ))) = kµ(η(θ ? )) − µ(η(θ))k2 , because θ ? ∈ int(Θ) and the map η : Θ 7→ int(N) is one-to-one by assumption, while the map µ : int(N) 7→ rint(M) is one-to-one by classic exponential-family theory [4, Theorem 3.6, p. 74]. Therefore, kµ(η(θ ? )) − µ(η(θ))k2 = 0 holds if and only if θ = θ ? . In the remainder of the section, we show that the estimator θb is close to the datagenerating parameter vector θ ? with high probability as long as the neighborhoods are small relative to the size of the random graph. We discuss finite-graph concentration results for full and non-full, curved exponential families in Sections 4.1 and 4.2, respectively. More general finite-graph concentration results for M -estimators can be found in Section 5. Finite-graph concentration results for subgraph-to-graph estimators are discussed in Section 6. 4.1. Finite-graph concentration: full exponential families. The main concentration result for full exponential families is stated in Theorem 1. It covers some of the most interesting exponential-family random graphs used in practice, including exponential-family random graphs with sensible forms of transitivity, as Corollary 1 demonstrates. We note that the term full exponential-family random graph in Theorem 1 refers to the fact that the exponential family is full, i.e., the parameter space of the exponential family is as large as possible, and does not refer to a full graph with all possible edges. Theorem 1. Consider a full exponential-family random graph of fixed and finite size with countable support X and local dependence. Let η(θ) = θ ∈ Θ, where Θ = {θ ∈ Rdim(θ) : ψ(η(θ)) < ∞}. Assume that θ ? ∈ int(Θ) and that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist δ() > 0 and A > 0 such that, for all θ ∈ Θ \ B(θ ? , ), K X |Ak | α ? (4) kµ(η(θ )) − µ(η(θ))k2 ≥ δ() for some 0 ≤ α ≤ 1 2 k=1
and that, for all (x1 , x2 ) ∈ X × X, (5)
ks(x1 ) − s(x2 )k∞ ≤ A d(x1 , x2 ) kAk∞ .
12
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
Then, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ() > 0 and C > 0 such that θb exists, is unique, and ! 2 C K κ() P θb ∈ B(θ ? , ) ⊆ int(Θ) ≥ 1 − 2 exp − + log dim(θ) , 4 (2−α) dim(θ) kAk∞ where dim(θ) may be a function of kAk∞ and K ≥ 2. Theorem 1 is non-asymptotic in the sense that it covers all finite random graphs with K ≥ 2 neighborhoods. It excludes random graphs with K = 1 neighborhood, because random graphs with K = 1 neighborhood are equivalent to random graphs with unrestricted dependence, which can possess undesirable properties [46, 22, 14, 39, 1, 8]. An important implication of Theorem 1 is that θb is close to the datagenerating parameter vector θ ? with high probability when the neighborhoods and the dimension of θ ? are small relative to the number of neighborhoods K. Conditions (4) and (5) of Theorem 1 are verified in Corollary 1. Condition (4) is an identifiability assumption and covers both sparse (0 ≤ α < 1) and dense (α = 1) within-neighborhood subgraphs, as explained in Section 3. Theorem 1 shows that sparsity comes at a cost, because the probability of event θb 6∈ B(θ ? , ) decays slower when the within-neighborhood subgraphs are sparse rather than dense. The fact that sparsity weakens concentration results is well-known in the literature on concentration results for Bernoulli random graphs with independent edges [e.g., 21, 24]. Condition (5) is satisfied as long as changing a within-neighborhood edge cannot change the within-neighborhood sufficient statistics by more than a multiple of kAk∞ . Both conditions are satisfied by exponential-family random graphs capturing sensible forms of transitivity and other complex topological features of networks. Example: full, canonical exponential family. As an example, we consider PK |Ak | exponential-family random graphs with support X = {0, 1} k=1 ( 2 ) and local dependence induced by sensible forms of transitivity. Transitivity refers to the tendency of nodes i and j with shared partners (i.e., xi,h = xj,h = 1 for at least one node h 6= i, j) to be connected (i.e., xi,j = 1) [33]. A sensible form of transitivity is induced by exponential-family random graphs with within-neighborhood edge and transitive edge terms [20], with neighborhood-dependent natural parameters ηk,1 (θ) = θ1 and ηk,2 (θ) = θ2 and sufficient statistics sk,1 (xk ) and sk,2 (xk ) (k = 1, . . . , K) given by X sk,1 (xk ) = xi,j i∈Ak < j∈Ak
sk,2 (xk ) =
X i∈Ak < j∈Ak
xi,j
max
h∈Ak , h6=i,j
xi,h xj,h .
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
13
Here, as elsewhere in the paper, we assume that the neighborhoods are of the same order of magnitude, as defined in Section 2. Corollary 1. Consider an exponential-family random graph of fixed and finite size with within-neighborhood edge and transitive edge terms. Let θ ? ∈ int(Θ), where Θ = {θ ∈ R2 : ψ(η(θ)) < ∞} = R2 . Then conditions (4) and (5) of Theorem 1 are satisfied and hence, for all > 0, there exist κ() > 0 and C > 0 such that θb exists, is unique, and κ()2 C K ? b P θ ∈ B(θ , ) ⊆ int(Θ) ≥ 1 − 4 exp − , kAk4∞ provided |Ak | ≥ 3 (k = 1, . . . , K) and K ≥ 2. Corollary 1 is the first finite-graph concentration result which shows that consistent estimation of exponential-family random graphs with sensible forms of transitivity is possible. To date, it has been widely believed that statistical inference for transitivity is challenging [e.g., 43, 8], but the additional structure in the form of multilevel structure facilitates statistical inference for sensible forms of transitivity. Remark 2. Sensible forms of transitivity. We focus here on sensible forms of transitivity based on the number of within-neighborhood transitive edges rather than triangles. The reason is that triangle terms tend to induce near-degenerate distributions in large graphs and, therefore, large within-neighborhood subgraphs [46, 22, 14, 39, 1, 8]. We note that these issues are serious issues in large random graphs with unrestricted dependence, but are less serious issues in large random graphs with local dependence as long as the neighborhoods are small relative to the size of the random graph. Nonetheless, it is prudent to focus on sensible forms of transitivity. Transitive edge terms are sensible alternatives to triangle terms [20], because transitive edge terms count the first triangle in which a pair of nodes is involved but discard all additional triangles, reducing the incentive of pairs of nodes to be involved in an excessive number of triangles. 4.2. Finite-graph concentration: non-full, curved exponential families. In situations where the dimensions of the natural parameter vectors ηk (θ) of neighborhoods Ak depend on the number of nodes in Ak (k = 1, . . . , K), it is sometimes convenient to use non-full, curved exponential families rather than full exponential families. The reason is that curved exponential families offer more parsimonious parameterizations, because the neighborhood-dependent natural parameter vectors ηk (θ) may be non-affine functions of a lower-dimensional parameter vector θ (k = 1, . . . , K). Such curved exponential families have turned out to be useful in practice [e.g., 45, 19]. The main concentration result for non-full, curved exponential families is stated
14
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
in Theorem 2. It covers the popular class of curved exponential-family random graphs with geometrically weighted model terms, as Corollary 2 demonstrates. Theorem 2. Consider a non-full, curved exponential-family random graph of fixed and finite size with countable support X and local dependence. Let Θ ⊆ {θ ∈ Rdim(θ) : ψ(η(θ)) < ∞}, Θ open. Assume that θ ? ∈ int(Θ). Let η : Θ 7→ int(N) be one-to-one and assume that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()). In addition, assume that there exist δ() > 0 and A > 0 such that, for all η(θ) ∈ N \ B(η(θ ? ), γ()), (6)
kµ(η(θ ? ))
− µ(η(θ))k2 ≥ δ()
K X |Ak | α k=1
2
for some 0 ≤ α ≤ 1
and, for all (x1 , x2 ) ∈ X × X, (7)
ks(x1 ) − s(x2 )k∞ ≤ A d(x1 , x2 ) kAk∞ .
Then, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ() > 0 and C > 0 such that ! 2CK κ() P θb ∈ B(θ ? , ) ⊆ int(Θ) ≥ 1 − 2 exp − + log d∞ , 4 (2−α) d∞ kAk∞ where d∞ may be a function of kAk∞ and K ≥ 2. Theorem 2 does not make strong assumptions about the map η : Θ 7→ int(N): It assumes that the map η : Θ 7→ int(N) is one-to-one and that η(θ) is far from η(θ ? ) when θ is far from θ ? , but the map η : Θ 7→ int(N) need not be differentiable. As a result, Theorem 2 covers a wide range of non-full exponential families, including curved exponential families. It is worth noting that in non-full, curved exponential families there may be more than one minimizer of estimating function (3). A pleasant feature of estimating function (3) is that, with high probability, the minimizers of (3) do not give rise to global minima that are separated by large distances, under the assumptions made. The reason is that, if the set θb contains more than one element, then all elements of b whose `2 -distance from the set θb map to mean-value parameter vectors µ(η(θ)) \? )) is identical and whose `2 -distance from µ(η(θ ? )) is bounded above by µ(η(θ \? ))k2 + kµ(η(θ \? )) − µ(η(θ ? ))k2 b − µ(η(θ ? ))k2 ≤ kµ(η(θ)) b − µ(η(θ kµ(η(θ)) \? )) − µ(η(θ ? ))k2 , ≤ 2 kµ(η(θ
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
15
\? )) is close to as shown in the proof of Theorem 2. By Proposition 2, µ(η(θ ? µ(η(θ )) with high probability provided the neighborhoods and the dimension d∞ of µ(η(θ ? )) are small relative to the size of the random graph. Therefore, all b are close to µ(η(θ ? )) with high probability and hence, elements of the set µ(η(θ)) by the identifiability conditions of Theorem 2, all elements of the set θb are close to θ ? with high probability, under the stated assumptions. Example: non-full, curved exponential family. As an example, we consider one of the most popular curved exponential-family random graphs with support X = PK |Ak | {0, 1} k=1 ( 2 ) : curved exponential-family random graphs with within-neighborhood edge and geometrically weighted edgewise shared partner terms [45, 19]. These models are based on the number of within-neighborhood edges and pairs of nodes with i edgewise shared partners: X sk,1 (xk ) = xa,b a∈Ak < b∈Ak
sk,i+1 (xk ) =
X
xa,b fa,b,i (xk ),
i = 1, . . . , |Ak | − 2,
a∈Ak < b∈Ak
P where fa,b,i (xk ) = 1( c∈Ak , c6=a,b xa,c xb,c = i) is an indicator function, which is 1 if nodes a and b have i shared partners in neighborhood Ak and is 0 otherwise (k = 1, . . . , K). The natural parameters of the within-neighborhood edge and edgewise shared partner terms are of the form ηk,1 (θ)
= θ1
h i ηk,i+1 (θ) = exp(ϑ) 1 − (1 − exp(−ϑ))i ,
i = 1, . . . , |Ak | − 2,
where ϑ > 0 controls the rate of decay of the geometric sequence (1 − exp(−ϑ))i , i = 1, 2, . . . (k = 1, . . . , K). For convenience, we consider here the parameterization θ2 = exp(−ϑ) ∈ (0, 1), so that θ = (θ1 , θ2 ) ∈ R × (0, 1). Such model terms are called geometrically weighted terms, because the natural parameters ηk,i+1 (θ) are based on the geometric sequence (1 − exp(−ϑ))i , i = 1, 2, . . . The dimensions dim(ηk ) = |Ak | − 1 of the neighborhood-dependent natural parameter vectors ηk (θ) are functions of |Ak | (k = 1, . . . , K) and hence the dimension d∞ = max1≤k≤K dim(ηk ) = kAk∞ − 1 of natural parameter vector η(θ) is a function of kAk∞ . The resulting model captures transitive closure in neighborhoods and, in the special case ϑ = 0, reduces to the edge and transitive edge model described in Section 4.1. A full-fledged discussion of these models is beyond the scope of our paper. We therefore refer the interested reader to Snijders et al. [45] and Hunter and Handcock [19] and focus here on finite-graph concentration results.
16
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
The following finite-graph concentration result shows that the estimator θb is close to the data-generating parameter vector θ ? with high probability, provided kAk∞ is small relative to the number of neighborhoods K; note that θ = (θ1 , θ2 ) ∈ R × (0, 1) since θ2 = exp(−ϑ) and ϑ > 0. Corollary 2. Consider a curved exponential-family random graph of fixed and finite size with within-neighborhood edge and geometrically weighted edgewise shared partner terms. Let Θ = R × (0, 1) and assume that θ ? ∈ int(Θ). Then all conditions of Theorem 2 are satisfied and hence, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ() > 0 and C > 0 such that κ()2 C K ? b P θ ∈ B(θ , ) ⊆ int(Θ) ≥ 1 − 2 exp − + log kAk∞ , kAk6∞ provided |Ak | ≥ 4 (k = 1, . . . , K) and K ≥ 2. Corollary 2 is the first concentration result for curved exponential-family random graphs with geometrically weighted model terms and dependence induced by sensible forms of transitivity. Concentration results for other curved exponentialfamily random graphs with geometrically weighted model terms, such as geometrically weighted degree and dyadwise shared partner terms [45, 19], can be established along the same lines. While these results are encouraging, it goes without saying that numerous open questions remain: e.g., it is unknown whether θb is unique, because the mean-value parameter vectors µ(η(θ)) ∈ rint(M) are not available in closed form and it is not straightforward to characterize the set of mean-value parameter vectors induced by Θ. However, if the set θb contains more than one element, then all elements of the set θb are close to θ ? with high probability according to Theorem 2, as we pointed out in the discussion of Theorem 2. 4.3. Extensions to other complex topological features. Theorems 1 and 2 can be used to establish finite-graph concentration results for maximum likelihood estimators of other sensible specifications of exponential-family random graphs capturing complex topological features of networks. In many cases, the conditions of Theorems 1 and 2 can be verified along the lines of Corollaries 1 and 2. 5. Finite-graph concentration results for M -estimators. The finite-graph concentration results for maximum likelihood estimators in Section 4 are special cases of more general results for M -estimators. To demonstrate, we introduce a natural class of M -estimators in Section 5.1, which includes both likelihood- and moment-based estimators, and present finite-graph concentration results in Section 5.2 along with an application to misspecified models with omitted covariate terms.
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
17
5.1. M -estimators. A natural class of M -estimators, which includes both likelihood- and moment-based estimators, can be constructed as follows. Let b : X 7→ Rdim(b) be a vector of statistics. These statistics might be • the sufficient statistics of the data-generating exponential family; • the sufficient statistics of an exponential family other than the data-generating exponential family, which implies that the model is misspecified; • statistics motivated by computational considerations. An example is presented in Corollary 3, where we consider a misspecified exponential family with omitted covariate terms. A natural extension of the class of likelihood-based estimating functions in Section 4 is given by estimating functions of the form (8)
g(θ; b(x)) = kb(x) − β(θ)k2 , θ ∈ Θ,
which are approximations of g(θ; E b(X)) = kE b(X) − β(θ)k2 , θ ∈ Θ, provided E b(X) exists. Here, E b(X) is the expectation of b(X) under the datagenerating exponential-family distribution. The vector β : Θ 7→ Rdim(b) is a vector-valued function of θ ∈ Θ defined by β(θ) = Eθ b(X), where Eθ b(X) is the expectation of b(X) under a distribution parameterized by θ ∈ Θ, provided Eθ b(X) exists. The distribution may not belong to the data-generating exponential family or any other exponential family, i.e., the model may be misspecified. Estimating functions of the form (8) cover both likelihood- and moment-based estimating functions: • Likelihood-based estimating functions: The choice b(x) = s(x) gives g(θ; s(x)) = ks(x) − Eθ s(X)k2 , which is based on the gradient of the loglikelihood function with respect to the natural parameter vector of the data-generating exponential family and is therefore based on moments of the sufficient statistics of the data-generating exponential family. • Moment-based estimating functions: The choice b(x) 6= s(x) gives g(θ; b(x)) = kb(x) − Eθ b(X)k2 , which is based on moments of statistics other than the sufficient statistics of the data-generating exponential family. M -estimators based on estimating functions of the form (8) are defined by b ˙ θ = θ ∈ Θ : g(θ; b(x)) = inf g(θ; b(x)) , ˙ θ∈Θ
which are estimators of θ0 = θ ∈ Θ : g(θ; E b(X)) =
˙ inf g(θ; E b(X)) .
˙ θ∈Θ
18
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
If, given an observation x of a random graph X, the estimating function g(θ; b(x)) is close to g(θ; E b(X)), then we would expect the minimizers θb and θ0 of g(θ; b(x)) and g(θ; E b(X)) to be close under suitable conditions. To show that θb is close to θ0 with high probability, we make the following assumptions. In the following, B denotes the convex hull of the set {b(x) : x ∈ X}. [C.1 ] The expectation E b(X) ∈ rint(B) exists and there exists A > 0 such that, for all (x1 , x2 ) ∈ X × X, kb(x1 ) − b(x2 )k∞ ≤ A d(x1 , x2 ) kAk∞ . [C.2 ] The parameter space Θ is an open subset of Rdim(θ) , the expectation β(θ) = Eθ b(X) exists for all θ ∈ Θ, and there exists a unique element θ0 ∈ Θ such that β(θ0 ) = E b(X) ∈ rint(B). [C.3 ] For all > 0 small enough so that B(θ0 , ) ⊆ Θ, there exists δ() > 0 such that, for all θ ∈ Θ \ B(θ0 , ), kβ(θ0 ) − β(θ)k2 ≥ δ()
K X |Ak | α k=1
2
for some 0 ≤ α ≤ 1.
Conditions [C.1]—[C.3] are verified in Corollary 3. Conditions [C.1] and [C.3] resemble the conditions of Theorems 1 and 2 and are verified in Corollaries 1 and 2 in the special case b(x) = s(x). Condition [C.2] is a moment-matching condition: It assumes that the family of distributions parameterized by θ ∈ Θ, which may not be the data-generating exponential family, is able to match the first moment E b(X) of b(X) under the data-generating exponential-family distribution. The unique parameter vector θ0 ∈ Θ that matches the moment E b(X) is the unique minimizer of g(θ; E b(X)), i.e., θ0 = arg minθ∈Θ g(θ; E b(X)). 5.2. Finite-graph concentration results. We establish finite-graph concentration results for the class of M -estimators introduced in Section 5.1. To do so, we need to show that the probability mass of statistic vector b(X) concentrates around its expectation E b(X) under the data-generating exponentialfamily distribution. Proposition 3. Consider an exponential family with countable support X and local dependence. Let b : X 7→ Rdim(b) and assume that condition [C.1] is satisfied. there exists C > 0 such that, for all deviations of the form t = P Then |Ak | α δ K with δ > 0 and 0 ≤ α ≤ 1, k=1 2 ! δ2 C K P (kb(X) − E b(X)k2 ≥ t) ≤ 2 exp − + log dim(b) , 4 (2−α) dim(b) kAk∞
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
19
where dim(b) may be a function of kAk∞ . Proposition 3 paves the way for finite-graph concentration results for the class of M -estimators introduced in Section 5.1. The following concentration result shows that M -estimators θb = arg minθ∈Θ g(θ; b(X)) based on estimating functions of the form (8) are close to θ0 = arg minθ∈Θ g(θ; E b(X)) with high probability as long as the neighborhoods and dim(b) are small relative to the number of neighborhoods K. Theorem 3. Consider an exponential-family random graph of fixed and finite size with countable support X and local dependence. Assume that conditions [C.1]— [C.3] are satisfied. Then, for all > 0 small enough so that B(θ0 , ) ⊆ Θ, there exist κ() > 0 and C > 0 such that ! 2CK κ() P θb ∈ B(θ0 , ) ⊆ Θ ≥ 1 − 2 exp − + log dim(b) , 4 (2−α) dim(b) kAk∞ where dim(b) may be a function of kAk∞ and K ≥ 2. Conditions [C.1]—[C.3] of Theorem 3 are verified by Corollary 3. It is worth noting that θb may not be unique, but an argument along the lines of Section 4.2 shows that, with high probability, the minimizers of estimating function (8) do not give rise to global minima that are separated by large distances, under the stated assumptions. Example: misspecified exponential family with omitted covariate term. To demonstrate, we consider an extension of the exponential-family random graph with withinneighborhood edge and transitive edge terms, as described in Section 4.1. The extended model includes an additional same-attribute edge term with natural parameters ηk,3 (θ) = θ3 and sufficient statistics sk,3 (xk ), X sk,3 (xk ) = xi,j 1(ci = cj ), i∈Ak < j∈Ak
where 1(ci = cj ) is an indicator function, which is 1 if ci = cj and is 0 otherwise, and ci ∈ {C1 , . . . , CM } (Cm ∈ R, m = 1, . . . , M , M ≥ 2) is a categorical attribute of node i ∈ Ak (k = 1, . . . , K). Same-attribute edge terms are popular in applications and capture excesses in the expected number of edges among nodes with the same attribute [e.g., 19]: e.g., students may prefer to befriend other students of the same region of origin. Suppose that researchers are unaware that the attributes ci of nodes i are important predictors of edges and estimate the edge and transitive edge parameters θ1 and θ2 based on the misspecified exponential family with natural parameter vector θ = (θ1 , θ2 ) and sufficient statistic vector b(x) = (s1 (x), s2 (x)), where
20
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
PK P s1 (x) = K k=1 sk,2 (xk ) are defined in Section 4.1. k=1 sk,1 (xk ) and s2 (x) = In other words, suppose that statistical inference is based on the estimating function g(θ; b(x)) = kb(x) − Eθ b(X)k2 , θ ∈ Θ0 , where Θ0 = R2 is the parameter space of the misspecified exponential family, which is a two-dimensional subspace of the three-dimensional parameter space Θ? = R3 of the data-generating exponential family. The following concentration result shows that the M -estimator θb = arg minθ∈Θ0 g(θ; b(X)) is close to θ0 = arg minθ∈Θ0 g(θ; E b(X)) with high probability as long as the neighborhoods are small relative to the number of neighborhoods K. Corollary 3. Consider an exponential-family random graph of fixed and finite size with within-neighborhood edge, transitive edge, and same-attribute edge terms. Suppose that an observation of the random graph is generated by θ ? ∈ Θ? . Then all conditions of Theorem 3 are satisfied and hence, for all > 0 small enough so that B(θ0 , ) ⊆ Θ0 , there exist κ() > 0 and C > 0 such that θb exists, is unique, and κ()2 C K b P θ ∈ B(θ0 , ) ⊆ Θ0 ≥ 1 − 4 exp − , kAk4∞ provided |Ak | ≥ 3 (k = 1, . . . , K) and K ≥ 2. It is worth noting that θ0 ∈ R2 and θ ? ∈ R3 do not have the same dimension and cannot be identical, but the distribution parameterized by θ0 is as close as possible to the data-generating distribution parameterized by θ ? in terms of Kullback-Leibler divergence: by construction, θ0 minimizes kE b(X)−Eθ b(X)k2 = kE ∇θ log pθ (X)k2 and hence maximizes E log pθ (X), where pθ (x) ∝ exp(hθ, b(x)i) denotes the density of x ∈ X under the misspecified exponential-family distribution with natural parameter vector θ = (θ1 , θ2 ) and sufficient statistic vector b(x) = (s1 (x), s2 (x)). Therefore, θ0 satisfies (9)
θ0 = arg max E log pθ (X) − E log pθ? (X) = arg min KL(Pθ? ; Pθ ), θ ∈ Θ0
θ ∈ Θ0
where KL(Pθ? ; Pθ ) is the Kullback-Leibler divergence from Pθ? to Pθ and the expectations E log pθ (X) and E log pθ? (X) are with respect to Pθ? . Owing to the dependence of within-neighborhood edge variables and sufficient statistics, it is not straightforward to bound kθ ? − θ0 k2 , but—by the properties of maximum likelihood estimation—we are assured that the distribution parameterized by θ0 is as close as possible to the distribution parameterized by the data-generating parameter vector θ ? in terms of Kullback-Leibler divergence, as shown by (9).
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
21
6. Extendability to larger random graphs with more neighborhoods and subgraph-to-graph estimation. A natural question to ask is whether it makes sense to extend a model of a random graph of fixed and finite size to a larger random graph [43, 10, 30]. We discuss the issue of extendability in Section 6.1 and the related issue of subgraph-to-graph estimation in Section 6.2. 6.1. Extendability to larger random graphs with more neighborhoods. While many exponential-family random graphs cannot be extended to larger exponentialfamily random graphs [43, 10, 30], an exponential-family random graph with multilevel structure can be extended to a larger exponential-family random graph with more neighborhoods. To demonstrate, consider a population graph (XL , YL ) with a set of neighborhoods L = {A1 , . . . , AL }, where XL ∈ XL and YL ∈ YL denote the sequences of within- and between-neighborhood edge variables based on the set of neighborhoods L, respectively. As before, assume that XL is governed by an exponential family with countable support XL and local dependence, with neighborhooddependent natural parameters ηA,i (θ) = ηi (θ) and sufficient statistics sA,i (xA ) (i = 1, . . . , dim(ηA ), A ∈ L). Therefore, the exponential family can be reduced to an exponential family with natural parameter vector η(θ) = (η1 (θ), . . . , ηd∞ (θ))
(10)
and sufficient statistic vector s(xL ) = (s1 (xL ), . . . , sd∞ (xL )) , P
where si (xL ) = A∈L sA,i (xA ) (i = 1, . . . , d∞ ) and d∞ = maxA∈L dim(ηA ). Consider a subgraph (XK , YK ) induced by a subset of neighborhoods K ⊆ L. Then the subgraph (XK , YK ) with subset of neighborhoods K is extendable to the population graph (XL , YL ) with set of neighborhoods L ⊃ K as follows. Proposition 4. Consider a full or non-full, curved exponential-family random graph with set of neighborhoods L, countable support XL , and local dependence. Assume that, for all yL ∈ YL , Y P(YL = yL ) = P(YC,D = yC,D ), C∈L, D∈L, C6=D
where YC,D = (Yi,j )i∈C, j∈D . Then, for all θ ∈ Θ ⊆ {θ ∈ Rdim(θ) : ψL (η(θ)) < ∞}, all K ⊆ L, and all xK ∈ XK and yK ∈ YK , Pη(θ) (XK = xK , YK = yK , XL\K ∈ XL\K , YL\K ∈ YL\K ) = Pη(θ) (XK = xK , YK = yK ),
22
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where ψL (η(θ)) =
X
ψA (η(θ)) =
A∈L
X A∈L
log
X
exp (hη(θ), sA (xA )i) νA (xA ).
xA ∈XA
The marginal density of a subgraph xK ∈ XK of xL ∈ XL induced by K ⊆ L is an exponential-family density with support XK and local dependence: X pη(θ) (xL ) = pη(θ) (xK ) = exp (hη(θ), s(xK )i − ψK (η(θ))) νK (xK ), xL\K ∈XL\K
P where . , ηd∞ (θ)), s(xK ) = A∈K ψA (η(θ)), η(θ) = (η1 (θ), . .Q P ψK (η(θ)) = P ( A∈K sA,1 (xA ), . . . , A∈K sA,d∞ (xA )), and νK (xK ) = A∈K νA (xA ). In other words, the marginal probability mass function of a subgraph induced by a subset of neighborhoods K is consistent with the joint probability mass function of the population graph with set of neighborhoods L ⊃ K. Thus, the exponentialfamily random graph induced by subset of neighborhoods K can be extended to the exponential-family random graph with set of neighborhoods L ⊃ K. A more restrictive result was proved by Schweinberger and Handcock [40, Theorem 1]. 6.2. Finite-graph concentration results for subgraph-to-graph estimators. In practice, researchers may wish to extend an exponential-family random graph to a larger exponential-family random graph with more neighborhoods when it is infeasible to observe a population graph with a set of neighborhoods L and therefore a subgraph induced by a subset of neighborhoods K ⊆ L is sampled. We consider here population graphs of finite size, because in practice population graphs are of finite size and the concentration results in Sections 3, 4, and 5 enable concentration results for population graphs of finite size. Let L be the set of neighborhoods of the population graph and assume that xL ∈ XL was generated by an exponential family with countable support XL and local dependence. Suppose that it is infeasible to observe xL ∈ XL , but it is feasible to sample a subset of neighborhoods K ⊆ L and collect data on the subgraphs induced by K ⊆ L. We assume henceforth that the sampling design is ignorable in the sense of Rubin [38] and Handcock and Gile [15], i.e., the probability of observing subgraphs does not depend on the unobserved subgraphs. A simple example is a sampling design that samples neighborhoods at random and collects data on the subgraphs induced by the sampled neighborhoods. By Proposition 4 and the ignorability of the sampling design [38, 15], the observed-data likelihood function based on the observed subgraph xK ∈ XK of xL ∈ XL is proportional to X pη(θ) (xL ) = pη(θ) (xK ), xL ∈ XL , xL\K ∈ XL\K
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
23
where η(θ) is of the form (10). In other words, maximum likelihood estimation can be based on pη(θ) (xK ). Motivated by the same considerations we detailed in Section 4, we therefore consider estimating functions of the form gK (θ; µK\ (η(θ ? ))) = kµK\ (η(θ ? )) − µK (η(θ))k2 , where µK\ (η(θ ? )) = s(xK ), µK (η(θ)) = Eη(θ) s(XK ), and s(xK ) is defined in Proposition 4. The data-generating parameter vector θ ? of the population graph can hence be estimated by the estimator θbK based on the observed subgraph xK ∈ XK of xL ∈ XL : ? ? \ \ b ˙ θK = θ ∈ Θ : gK (θ; µK (η(θ ))) = inf gK (θ; µK (η(θ ))) . ˙ θ∈Θ
The following finite-graph concentration result shows that, with high probability, the estimator θbK based on the observed subgraph induced by K ⊆ L is close to the data-generating parameter vector θ ? of the population graph as long as the neighborhoods and the dimensions of neighborhood-dependent natural parameter vectors are small relative to the number of sampled neighborhoods |K|. Theorem 4. Consider a full or non-full, curved exponential-family random graph with set of neighborhoods L, countable support XL , and local dependence. Let Θ ⊆ {θ ∈ Rdim(θ) : ψL (η(θ)) < ∞}, Θ open. Assume that θ ? ∈ int(Θ) and that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()), where η(θ) is of the form (10). In addition, assume that there exist δ() > 0 and A > 0 such that, for all K ⊆ L and all η(θ) ∈ N \ B(η(θ ? ), γ()), X |A|α ? for some 0 ≤ α ≤ 1 (11) kµK (η(θ )) − µK (η(θ))k2 ≥ δ() 2 A∈K
and, for all K ⊆ L and all (x1 , x2 ) ∈ XK × XK , (12)
ks(x1 ) − s(x2 )k∞ ≤ A d(x1 , x2 ) kKk∞ ,
where kKk∞ = maxA∈K |A|. Then, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ() > 0 and C > 0 such that ! 2 C |K| κ() P θbK ∈ B(θ ? , ) ⊆ int(Θ) ≥ 1 − 2 exp − + log d∞ , 4 (2−α) d∞ kLk∞
24
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where d∞ = maxA∈L dim(ηA ) may be a function of kLk∞ and |K| ≥ 2. If the exponential family is full, then θbK is unique in the event θbK ∈ B(θ ? , ) ⊆ int(Θ). Theorem 4 shows that there are costs associated with observing a subset of neighborhoods K ⊆ L rather than the whole set of neighborhoods L of the population graph: the probability of event θbK 6∈ B(θ ? , ) decays with the number of observed neighborhoods |K| and is hence lowest when the whole set of neighborhoods L of the population graph is observed. To demonstrate, consider curved exponential-family random graphs with withinneighborhood edge and geometrically weighted edgewise shared partner terms, as described in Section 4.2. Corollary 4. Consider a curved exponential-family random graph with set of neighborhoods L, countable support XL , and local dependence induced by withinneighborhood edge and geometrically weighted edgewise shared partner terms. Let Θ = R × (0, 1) and assume that θ ? ∈ int(Θ). Then all conditions of Theorem 4 are satisfied and hence, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ() > 0 and C > 0 such that κ()2 C |K| ? b P θK ∈ B(θ , ) ⊆ int(Θ) ≥ 1 − 2 exp − + logkLk∞ , kLk6∞ provided |A| ≥ 4 (A ∈ L) and |K| ≥ 2. Remark 3. “Bad” subsets of neighborhoods K ⊆ L. Since the neighborhoods need not have the same size, it is natural to ask whether it is possible to sample a “bad” subset of neighborhoods K with too small or too large neighborhoods, which could make it challenging to estimate some of the parameters. However, the assumptions of Theorem 4 rule out “bad” subsets of neighborhoods K, for two reasons. First, while some neighborhoods may be larger than others, Theorem 4 assumes that the neighborhoods are of the same order of magnitude, as defined in Section 2. In other words, the neighborhoods have similar sizes. Second, the conditions of Theorem 4 assume that the model satisfies identifiability and smoothness conditions for all possible subsets of neighborhoods K ⊆ L. Corollary 4 shows that, in the special case of curved exponential-family random graphs with withinneighborhood edge and geometrically weighted edgewise shared partner terms, the identifiability conditions require |A| ≥ 4 for all neighborhoods A ∈ L of the population graph. Thus, no neighborhood can be too small and, at the same time, no neighborhood can be too large, because all neighborhoods are of the same order of magnitude. As a consequence, under the stated assumptions, it is impossible to sample a “bad” subset of neighborhoods K with too small or too large neighborhoods.
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
25
7. Comparison with existing asymptotic consistency results. We focus on asymptotic consistency results for exponential-family random graphs with dependence. We note that there exist asymptotic consistency results for exponentialfamily random graphs with independence assumptions—see, e.g., Diaconis et al. [11], Rinaldo et al. [37], Krivitsky and Kolaczyk [29], Yan et al. [51, 50]—but such independence assumptions may not be satisfied in applications, as discussed in Section 1. Concerning exponential-family random graphs with dependence, Shalizi and Rinaldo [43] showed that maximum likelihood estimators of exponentialfamily random graphs satisfying strong extendability assumptions are consistent, but those strong extendability assumptions rule out dependencies induced by transitivity and many other complex topological features of networks. Xiang and Neville [49] reported asymptotic consistency results under weak dependence assumptions, but did not give any example of an exponential-family random graph with dependence that satisfies those assumptions. Mukherjee [34] showed that consistent estimation of the so-called two-star model is possible, but those results have not been extended to other exponential-family random graphs. In addition, Shalizi and Rinaldo [43], Xiang and Neville [49], and Mukherjee [34] focus on asymptotic consistency results for estimators of natural parameter vectors whose dimensions do not depend on the number of nodes. In contrast, we advance the statistical theory of exponential-family random graphs by providing the first finite-graph concentration and consistency results that cover • a wide range of exponential-family random graphs with dependence induced by sensible forms of transitivity and other complex topological features of networks; • curved exponential-family random graphs with dependence and parameter vectors whose dimensions can depend on the number of nodes; • correct and incorrect specifications of exponential-family random graphs; • M -estimators, including likelihood- and moment-based estimators; • subgraph-to-graph estimators. These results demonstrate the importance of additional structure: It is the additional structure in the form of multilevel structure that facilitates these results. 8. Simulation results. To shed light on the finite-graph properties of maximum likelihood estimators, we generated data from the full and non-full, curved exponential-family random graphs described in Section 4. We used R package hergm [41] to generate 1,000 graphs from each model and estimated the datagenerating parameter vector by Monte Carlo maximum likelihood estimators [19]. Figure 1 shows 1,000 estimates of the exponential-family random graph with within-neighborhood edge and transitive edge terms, where each graph consists of K = 100, 200, and 300 neighborhoods of size 50 with natural parameter vec-
26
MICHAEL SCHWEINBERGER AND JONATHAN STEWART K = 100
K = 200
K = 300
0.525
●
0.500
0.475
0.450
0.550
Transitive edge parameter θ2
0.550
Transitive edge parameter θ2
Transitive edge parameter θ2
0.550
0.525
●
0.500
0.475
0.450 −2.04
−2.00
Edge parameter θ1
−1.96
0.525
●
0.500
0.475
0.450 −2.04
−2.00
Edge parameter θ1
−1.96
−2.04
−2.00
Edge parameter θ1
−1.96
1.0
●
0.9
0.8
−2.7
−2.6
−2.5
−2.4
Edge parameter θ1
−2.3
K = 100 −0.15
−0.20
−0.25
●
−0.30
−0.35
−0.40
−0.45 −0.4
−0.3
−0.2
Edge deviation θ3 (50)
−0.1
Transitive edge deviation θ6 (75)
Transitive edge parameter θ2
K = 100
Transitive edge deviation θ4 (50)
F IG 1. 1,000 estimates of the exponential-family random graph with within-neighborhood edge and transitive edge terms, where each graph consists of K = 100, 200, and 300 neighborhoods of size 50 with natural parameter vectors ηk (θ) = (θ1 , θ2 ). The horizontal and vertical lines indicate the coordinates of the data-generating parameter vector θ ? = (θ1? , θ2? ). K = 100 −0.4
−0.5
●
−0.6
−0.7
−1.0
−0.9
−0.8
−0.7
Edge deviation θ5 (75)
−0.6
F IG 2. 1,000 estimates of the exponential-family random graph with within-neighborhood edge and transitive edge terms, where each graph consists of 33 neighborhoods of size 25 with natural parameter vectors ηk (θ) = (θ1 , θ2 ), 34 neighborhoods of size 50 with natural parameter vectors ηk (θ) = (θ1 + θ3 , θ2 + θ4 ), and 33 neighborhoods of size 75 with natural parameter vectors ηk (θ) = (θ1 + θ5 , θ2 + θ6 ). The horizontal and vertical lines indicate the coordinates of the datagenerating parameter vector θ ? = (θ1? , θ2? , θ3? , θ4? , θ5? , θ6? ).
tors ηk (θ) = (θ1 , θ2 ) (k = 1, . . . , K). The figure suggests that the probability mass of estimators becomes more and more concentrated in a neighborhood of the data-generating parameters as the number of neighborhoods K increases from 100 to 300, demonstrating that the finite-graph concentration results in Section 4 are manifest when K is in the low hundreds and kAk∞ = 50. Figure 2 sheds light on the performance of a simple form of a size-dependent parameterization that allows small and large neighborhoods to have different parameters. We consider exponential-family random graphs with within-neighborhood edge and transitive edge terms, where each graph consists of 33 neighborhoods of size 25 with natural parameter vectors ηk (θ) = (θ1 , θ2 ) (k = 1, . . . , 33), 34 neighborhoods of size 50 with natural parameter vectors ηk (θ) = (θ1 + θ3 , θ2 + θ4 ) (k = 34, . . . , 67), and 33 neighborhoods of size 75 with natural parameter vectors ηk (θ) = (θ1 + θ5 , θ2 + θ6 ) (k = 68, . . . , 100). Figure 2 demonstrates that the
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
27
25
40 20
30 15
20
10
10
5
0 −3.06
−3.03
−3.00
−2.97
Edge parameter θ1
0
0.18
0.20
0.22
Geometric term parameter θ2
F IG 3. 1,000 estimates of the curved exponential-family random graph with within-neighborhood edge and geometrically weighted edgewise shared partner terms, where each graph consists of K = 100 neighborhoods of size 50 with natural parameter vectors ηk (θ) = (θ1 , ηk,1 (θ2 ), . . . , ηk,48 (θ2 )). The vertical lines indicate the coordinates of the data-generating parameter vector θ ? = (θ1? , θ2? ).
estimates of the baseline edge and transitive edge parameters θ1 and θ2 tend to be closer to the data-generating parameters than the deviation parameters θ3 , θ4 , θ5 , and θ6 , because θ1 and θ2 are estimated from all neighborhoods whereas θ3 , θ4 , θ5 , and θ6 are estimated from a subset of neighborhoods. Figure 3 shows 1,000 estimates of the curved exponential-family random graph with within-neighborhood edge and geometrically weighted edgewise shared partner terms, where each graph consists of K = 100 neighborhoods of size 50 with natural parameter vectors ηk (θ) = (θ1 , ηk,1 (θ2 ), . . . , ηk,48 (θ2 )), where θ1 is the natural parameter of the edge term and ηk,1 (θ2 ), . . . , ηk,48 (θ2 ) are the natural parameters of the geometrically weighted edgewise shared partner term (k = 1, . . . , 100). The figure shows that the probability mass of the estimators is concentrated in a small neighborhood of the data-generating parameters. 9. Discussion. We have taken constructive steps to demonstrate that statistical inference for transitivity and other complex topological features of networks is possible, provided sensible specifications of exponential-family random graphs are used and additional structure in the form of multilevel structure is available. The theoretical results reported here demonstrate the importance of additional structure: It is the additional structure that facilitates these results. A pleasant feature of the general setup of the statistical theory developed here, in particular the statistical theory in Sections 4 and 5, is that it can be extended to models with other forms of additional structure, such as neighborhood structures with overlapping neighborhoods. Such extensions are possible as long as the dependence induced by the model is sufficiently weak and the model satisfies smoothness and identifiability assumptions. Last, but not least, it is worth noting that additional structure not only facilitates
28
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
theoretical results, but also facilitates computing: the contributions of neighborhoods to estimating functions—e.g., the expectations of within-neighborhood sufficient statistics—may be computed or approximated by exploiting parallel computing on computing clusters. Acknowledgements. We acknowledge support from the National Science Foundation (NSF award DMS-1513644). Supplementary materials. All results are proved in the supplement [42]. References. [1] Bhamidi, S., Bresler, G., and Sly, A. (2011), “Mixing time of exponential random graphs,” The Annals of Applied Probability, 21, 2146–2170. [2] Bhattacharya, B. B., and Mukherjee, S. (2018), “Inference in Ising models,” Bernoulli, 24, 493–525. [3] Boucheron, S., Lugosi, G., and Massart, P. (2013), Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford: Oxford University Press. [4] Brown, L. (1986), Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory, Hayworth, CA, USA: Institute of Mathematical Statistics. [5] Butts, C. T. (2011), “Bernoulli graph bounds for general random graph models,” Sociological Methodology, 41, 299–345. [6] Butts, C. T., and Almquist, Z. W. (2015), “A Flexible Parameterization for Baseline Mean Degree in Multiple-Network ERGMs,” Journal of Mathematical Sociology, 39, 163–167. [7] Chatterjee, S. (2007), “Estimation in spin glasses: A first step,” The Annals of Statistics, 35, 1931–1946. [8] Chatterjee, S., and Diaconis, P. (2013), “Estimating and understanding exponential random graph models,” The Annals of Statistics, 41, 2428–2461. [9] Chung, F., and Lu, L. (2006), Complex Graphs and Networks, Providence: American Mathematical Society. [10] Crane, H., and Dempsey, W. (2015), “A framework for statistical network modeling,” Available at https://arxiv.org/abs/1509.08185.v4. [11] Diaconis, P., Chatterjee, S., and Sly, A. (2011), “Random graphs with a given degree sequence,” The Annals of Applied Probability, 21, 1400–1435. [12] Efron, B. (1978), “The geometry of exponential families,” The Annals of Statistics, 6, 362–376. [13] Frank, O., and Strauss, D. (1986), “Markov graphs,” Journal of the American Statistical Association, 81, 832–842. [14] Handcock, M. S. (2003), “Statistical Models for Social Networks: Inference and Degeneracy,” in Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers, eds. Breiger, R., Carley, K., and Pattison, P., Washington, D.C.: National Academies Press, pp. 1–12. [15] Handcock, M. S., and Gile, K. (2010), “Modeling Social Networks from Sampled Data,” The Annals of Applied Statistics, 4, 5–25. [16] Holland, P. W., and Leinhardt, S. (1976), “Local Structure in Social Networks,” Sociological Methodology, 1–45. [17] Hollway, J., and Koskinen, J. (2016), “Multilevel embeddedness: The case of the global fisheries governance complex,” Social Networks, 44, 281–294. [18] Hollway, J., Lomi, A., Pallotti, F., and Stadtfeld, C. (2017), “Multilevel social spaces: The network dynamics of organizational fields,” Network Science, 5, 187–212.
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
29
[19] Hunter, D. R., and Handcock, M. S. (2006), “Inference in curved exponential family models for networks,” Journal of Computational and Graphical Statistics, 15, 565–583. [20] Hunter, D. R., Krivitsky, P. N., and Schweinberger, M. (2012), “Computational statistical methods for social network models,” Journal of Computational and Graphical Statistics, 21, 856– 882. [21] Janson, S., and Rucinski, A. (2002), “The infamous upper tail,” Random Structures and Algorithms, 20, 317–342. [22] Jonasson, J. (1999), “The random triangle model,” Journal of Applied Probability, 36, 852–876. [23] Kass, R., and Vos, P. (1997), Geometrical foundations of asymptotic inference, New York: Wiley. [24] Kim, J. H., and Vu, V. H. (2004), “Divide and conquer martingales and the number of triangles in a random graph,” Random Structures & Algorithms, 24, 166–174. [25] Kolaczyk, E. D. (2009), Statistical Analysis of Network Data: Methods and Models, New York: Springer-Verlag. [26] Kontorovich, L., and Ramanan, K. (2008), “Concentration inequalities for dependent random variables via the martingale method,” The Annals of Probability, 36, 2126–2158. [27] Krivitsky, P. N. (2012), “Exponential-family models for valued networks,” Electronic Journal of Statistics, 6, 1100–1128. [28] Krivitsky, P. N., Handcock, M. S., and Morris, M. (2011), “Adjusting for network size and composition effects in exponential-family random graph models,” Statistical Methodology, 8, 319–339. [29] Krivitsky, P. N., and Kolaczyk, E. D. (2015), “On the question of effective sample size in network modeling: An asymptotic inquiry,” Statistical Science, 30, 184–198. [30] Lauritzen, S., Rinaldo, A., and Sadeghi, K. (2017), “Random networks, graphical models, and exchangeability,” Available at https://arxiv:1701.08420v1. [31] Lazega, E., and Snijders, T. A. B. (eds.) (2016), Multilevel Network Analysis for the Social Sciences, Switzerland: Springer-Verlag. [32] Lomi, A., Robins, G., and Tranmer, M. (2016), “Introduction to multilevel social networks,” Social Networks, 266–268. [33] Lusher, D., Koskinen, J., and Robins, G. (2013), Exponential Random Graph Models for Social Networks, Cambridge, UK: Cambridge University Press. [34] Mukherjee, S. (2013), “Consistent estimation in the two star exponential random graph model,” Tech. rep., Department of Statistics, Columbia University, arXiv:1310.4526v1. [35] Nowicki, K., and Snijders, T. A. B. (2001), “Estimation and prediction for stochastic blockstructures,” Journal of the American Statistical Association, 96, 1077–1087. [36] Ravikumar, P., Wainwright, M. J., and Lafferty, J. (2010), “High-dimensional Ising model selection using `1 -regularized logistic regression,” The Annals of Statistics, 38, 1287–1319. [37] Rinaldo, A., Petrovic, S., and Fienberg, S. E. (2013), “Maximum likelihood estimation in network models,” The Annals of Statistics, 41, 1085–1110. [38] Rubin, D. B. (1976), “Inference and missing data,” Biometrika, 63, 581–592. [39] Schweinberger, M. (2011), “Instability, sensitivity, and degeneracy of discrete exponential families,” Journal of the American Statistical Association, 106, 1361–1370. [40] Schweinberger, M., and Handcock, M. S. (2015), “Local dependence in random graph models: characterization, properties and statistical inference,” Journal of the Royal Statistical Society, Series B, 77, 647–676. [41] Schweinberger, M., and Luna, P. (2017), “HERGM: Hierarchical exponential-family random graph models,” Journal of Statistical Software, accepted. [42] Schweinberger, M., and Stewart, J. (2017), “Supplement: Finite-graph concentration and consistency results for random graphs with complex topological structures,” Tech. rep., Department of Statistics, Rice University.
30
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
[43] Shalizi, C. R., and Rinaldo, A. (2013), “Consistency under sampling of exponential random graph models,” The Annals of Statistics, 41, 508–535. [44] Slaughter, A. J., and Koehly, L. M. (2016), “Multilevel models for social networks: hierarchical Bayesian approaches to exponential random graph modeling,” Social Networks, 44, 334–345. [45] Snijders, T. A. B., Pattison, P. E., Robins, G. L., and Handcock, M. S. (2006), “New specifications for exponential random graph models,” Sociological Methodology, 36, 99–153. [46] Strauss, D. (1986), “On a general class of models for interaction,” SIAM Review, 28, 513–527. [47] Wang, P., Robins, G., Pattison, P., and Lazega, E. (2013), “Exponential random graph models for multilevel networks,” Social Networks, 35, 96–115. [48] Wasserman, S., and Pattison, P. (1996), “Logit models and logistic regression for social networks: I. An introduction to Markov graphs and p∗ ,” Psychometrika, 61, 401–425. [49] Xiang, R., and Neville, J. (2011), “Relational learning with one network: an asymptotic analysis,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1–10. [50] Yan, T., Leng, C., and Zhu, J. (2016), “Asymptotics in directed exponential random graph models with an increasing bi-degree sequence,” The Annals of Statistics, 44, 31–57. [51] Yan, T., Wang, H., and Qin, H. (2016), “Asymptotics in undirected random graph models parameterized by the strengths of vertices,” Statistica Sinica, 26, 273–293. [52] Zappa, P., and Lomi, A. (2015), “The analysis of multilevel networks in organizations: models and empirical tests,” Organizational Research Methods, 18, 542–569.
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
1
SUPPLEMENT: FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS FOR RANDOM GRAPHS WITH COMPLEX TOPOLOGICAL STRUCTURES B Y M ICHAEL S CHWEINBERGER A ND J ONATHAN S TEWART Rice University We prove the main concentration results for dependent random graphs in Appendix A and the main concentration results for maximum likelihood estimators, M -estimators, and subgraph-to-graph estimators in Appendices B, C, and D, respectively. Auxiliary lemmas are proved in Appendix E. APPENDIX A: PROOFS: FINITE-GRAPH CONCENTRATION RESULTS FOR DEPENDENT RANDOM GRAPHS We prove the main concentration results of Sections 3 and 5, Propositions 1, 2, and 3. Proof of Proposition 1. By assumption, E f (X) exists. We are interested in deviations of the form |f (X) − E f (X)| ≥ t, where t > 0. Choose any t > 0 and let X = {x ∈ X : |f (x) − E f (X)| ≥ t}. Since within-neighborhood edges do not depend on between-neighborhood edges, P(X ∈ X, Y ∈ Y) = P(X ∈ X). In the following, we denote by P a probability measure on (X, S) with densities of the form (1), where S is the power set of the countable set X. Keep in mind that X = (Xk )K k=1 denotes the sequence of within-neighborhood edge variables, where Xk = (Xi,j )i∈Ak < j∈Ak . In an abuse of notation, we denote the elements of the sequence of edge variables by X 1 , . . . , Xm with sample spaces X1 , . . . , Xm , P X |A k| respectively, where m = K is the number of within-neighborhood edge k=1 2 variables. Let Xi:j = (Xi , . . . , Xj ) be a subsequence of edge variables with sample space Xi:j , where i ≤ j. By applying Theorem 1.1 of Kontorovich and Ramanan [26] to kf kLip -Lipschitz functions f : X 7→ R defined on the countable set X, t2 P(|f (X) − E f (X)| ≥ t) ≤ 2 exp − , 2 m kΦk2∞ kf k2Lip
2
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where Φ is the m × m-upper triangular matrix with entries ϕi,j if i < j φi,j = 1 if i = j 0 if i > j and
m X = max 1 + ϕi,j . 1≤i≤m j=i+1
kΦk∞
The coefficients ϕi,j are known as mixing coefficients and are defined by ϕi,j ≡
sup x1:i−1 ∈X1:i−1 (xi , x?i )∈Xi ×Xi
ϕi,j (x1:i−1 , xi , x?i ) =
sup x1:i−1 ∈X1:i−1 (xi , x?i )∈Xi ×Xi
kπxi − πx?i kTV ,
where kπxi − πx?i kTV is the total variation distance between the distributions πxi and πx?i given by πxi ≡ π(xj:m | x1:i−1 , xi ) = P(Xj:m = xj:m | X1:i−1 = x1:i−1 , Xi = xi ) and πx?i ≡ π(xj:m | x1:i−1 , x?i ) = P(Xj:m = xj:m | X1:i−1 = x1:i−1 , Xi = x?i ). Since the support of πxi and πx?i is countable, kπxi − πx?i kTV =
1 2
X
|π(xj:m | x1:i−1 , xi ) − π(xj:m | x1:i−1 , x?i )|.
xj:m ∈Xj:m
An upper bound on kΦk∞ can be obtained by bounding the mixing coefficients ϕi,j as follows. Consider any pair of edge variables Xi and Xj . If Xi and Xj involve nodes in more than one neighborhood, the mixing coefficient ϕi,j vanishes by the local dependence induced by exponential families of the form (1). If the pair of nodes corresponding to Xi and the pair of nodes corresponding to Xj belong to the same neighborhood, the mixing coefficient ϕi,j can be bounded as follows: ϕi,j (x1:i−1 , xi , x?i ) = ≤
1 2
X xj:m ∈Xj:m
1 2
X
|π(xj:m | x1:i−1 , xi ) − π(xj:m | x1:i−1 , x?i )|
xj:m ∈Xj:m
π(xj:m | x1:i−1 , xi ) +
1 2
X xj:m ∈Xj:m
π(xj:m | x1:i−1 , x?i ) = 1,
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
3
because πxi and πx?i are conditional probability mass functions with countable support Xj:m . We note that the upper bound is not sharp, but it has the advantage that it covers a wide range of dependencies within neighborhoods. Thus, m X kAk∞ kΦk∞ = max 1 + , ϕi,j ≤ 1≤i≤m 2 j=i+1
∞ − 1 other edge varibecause each edge variable Xi can depend on at most kAk 2 ables corresponding to pairs of nodes belonging to the same pair of neighborhoods. Therefore, there exists C > 0 such that, for all t > 0, ! t2 P(|f (X) − E f (X)| ≥ t) ≤ 2 exp − PK |A | , C k=1 2k kAk4∞ kf k2Lip where kAk∞ ≥ 1 because all neighborhoods Ak are non-empty and kf kLip > 0 by assumption. Proof of Proposition 2. By assumption, the neighborhood-dependent natural parameters ηk,i (θ) are of the form ηk,i (θ) = ηi (θ) (i = 1, . . . , dim(ηk ), k = 1, . . . , K). Therefore, the exponential family can be reduced to an exponential family with natural parameter vector η(θ) = (η1 (θ), . . . , ηd∞ (θ)) and sufficient statistic vector s(x) = (s1 (x), . . . , sd∞ (x)) , where d∞ = max1≤k≤K dim(ηk ). Here, the sufficient statistics si (x) are sums of within-neighborhood sufficient statistics sk,i (xk ) (k = 1, . . . , K): si (x) =
K X
sk,i (xk ), i = 1, . . . , d∞ .
k=1
\? )) = s(X) and that µ(η(θ ? )) = Eη(θ? ) s(X) = E s(X) ∈ Observe that µ(η(θ rint(M) exists, because η(θ ? ) ∈ int(N) [4, Theorem 2.2, pp. 34–35]. We bound the probability of deviations of s(X) from E s(X) in terms of the `∞ - and `2 -norm below.
4
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
`∞ -norm. For all δ > 0, \? )) − µ(η(θ ? ))k∞ ≥ δ P kµ(η(θ
K X |Ak | α
!
2 ! K X |Ak | α ≥ δ 2 k=1
= P ks(X) − E s(X)k∞
k=1
≤ P
d ∞ [
|si (X) − E si (X)| ≥ δ
i=1
!! K X |Ak | α 2
k=1
.
By condition (2) of Proposition 2, there exists A > 0 such that the Lipschitz coefficient of si : X 7→ R with respect to the Hamming metric d : X × X 7→ R+ 0 is bounded above by ksi kLip ≤ A kAk∞ (i = 1, . . . , d∞ ). Thus, by a union bound over the d∞ components of s(X) and by applying Proposition 1 to deviations of PK |Ak |α , we obtain, for all δ > 0, size t = δ k=1 2 P
d ∞ [
|si (X) − E si (X)| ≥ δ
i=1
!! K X |Ak | α k=1
≤ 2 exp −
δ2 B
K |Ak | α k=1 2
P
|Ak | k=1 2
PK
2
2
kAk4∞ kAk2∞
+ log d∞ ,
where B > 0. Since all neighborhoods are of the same order of magnitude, there exist C1 > 0 and C2 > 0 such that C1 kAk∞ ≤ |Ak | ≤ C2 kAk∞ . Thus, there P |Ak | α 2 ) in the numerator of the exponent exists C3 > 0 such that the term ( K k=1 2 can be bounded below by !2 K X |Ak | α k=1
2
≥ C3 K 2 kAk4∞α ,
and there exists C4 > 0 such that the term exponent can be bounded above by K X |Ak | k=1
2
PK
k=1
|Ak | 2
in the denominator of the
≤ C4 K kAk2∞ .
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
5
As a result, there exists C > 0 such that, for all δ > 0, \? )) − µ(η(θ ? ))k∞ ≥ δ P kµ(η(θ
! K X |Ak | α k=1
≤ 2 exp −
δ2 C K 4 (2−α)
kAk∞
2
! + log d∞
.
`2 -norm. The same argument used above shows that, for all δ > 0, α ! K X |A | k ? \? )) − µ(η(θ ))k2 ≥ δ P kµ(η(θ 2 k=1 PK |Ak |α ! δ k=1 2 \? )) − µ(η(θ ? ))k∞ ≥ √ ≤ P kµ(η(θ d∞ ! δ2 C K ≤ 2 exp − + log d∞ . 4 (2−α) d∞ kAk∞ Proof of Proposition 3. Condition [C.1] implies that E b(X) exists. Thus, for all δ > 0, ! K X |Ak | α P kb(X) − E b(X)k2 ≥ δ 2 k=1 PK |Ak |α ! δ pk=1 2 ≤ P kb(X) − E b(X)k∞ ≥ dim(b) PK |Ak |α ! dim(b) [ δ . pk=1 2 ≤ P |bi (X) − E bi (X)| ≥ dim(b) i=1 By condition [C.1], there exists A > 0 such that the Lipschitz coefficient of bi : X 7→ R with respect to the Hamming metric d : X × X 7→ R+ 0 is bounded above by kbi kLip ≤ A kAk∞ (i = 1, . . . , dim(b)). Thus, by a union bound over the dim(b) components of b(X) and by applying Proposition 1 to deviations of size PK |Ak |α p t=δ / dim(b), we obtain, for all δ > 0, k=1 2 PK |Ak |α ! dim(b) [ δ pk=1 2 P |bi (X) − E bi (X)| ≥ dim(b) i=1 P 2 K |Ak | α 2 δ k=1 2 ≤ 2 exp − + log dim(b) , PK |Ak | 4 2 B dim(b) k=1 2 kAk∞ kAk∞
6
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where B > 0. We showed 2 that there C1 > 0 α 2of Proposition P in the|Aproof Pexist K |Ak | 2 kAk4 α and k| and C2 > 0 such that ( K ) ≥ C K ≤ 1 ∞ k=1 k=1 2 2 2 C2 K kAk∞ . As a consequence, there exists C > 0 such that, for all δ > 0, ! K X |Ak | α P kb(X) − E b(X)k2 ≥ δ 2 k=1 ! δ2 C K + log dim(b) . ≤ 2 exp − 4 (2−α) dim(b) kAk∞
APPENDIX B: PROOFS: FINITE-GRAPH CONCENTRATION RESULTS FOR MAXIMUM LIKELIHOOD ESTIMATORS We prove the main concentration results of Section 4, Theorems 1 and 2 along with Corollaries 1 and 2. Auxiliary lemmas are proved in Appendix E. Proof of Theorems 1 and 2. Theorem 1 can be considered as a special case of Theorem 2 with η(θ) = θ and d∞ = max1≤k≤K dim(ηk ) = dim(θ). We note that Theorem 1 makes the same assumptions as Theorem 2 and satisfies the additional assumption of Theorem 2, which states that the map η : Θ 7→ int(N) is one-to-one and that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()). We therefore prove Theorem 1 by proving Theorem 2. By assumption, θ ? ∈ int(Θ). Observe that θ ? ∈ int(Θ) implies µ(η(θ ? )) ∈ rint(M), because the maps η : Θ 7→ int(N) and µ : int(N) 7→ rint(M) are oneto-one: the map η : Θ 7→ int(N) is one-to-one by assumption while the map µ : int(N) 7→ rint(M) is one-to-one by classic exponential-family theory [4, Theorem 3.6, p. 74]. We are interested in estimators of the form b ˙ θ = θ ∈ Θ : ks(x) − µ(η(θ))k2 = inf ks(x) − µ(η(θ))k2 , ˙ θ∈Θ
where Θ ⊆ {θ ∈ Rdim(θ) : ψ(η(θ)) < ∞}. The following proof covers both full and non-full exponential families, including curved exponential families. We first discuss the existence of θb in full and non-full, curved exponential families and then bound the probability of event θb ∈ B(θ ? , ).
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
7
b full and non-full, curved exponential families. For any given Existence of θ: ω > 0, let ( ) K X |Ak | α 0 0 ? M(ω) = µ ∈ M : kµ − µ(η(θ ))k2 < ω 2 k=1
be the subset of µ0 ∈ M close to the data-generating mean-value parameter vector P |Ak | α . µ(η(θ ? )) ∈ rint(M) in the sense that kµ0 − µ(η(θ ? ))k2 < ω K k=1 2 Choose ω > 0 small enough so that M(2 ω) ⊆ rint(M) and let ( ) K X |Ak | α ? G(ω) = x ∈ X : ks(x) − µ(η(θ ))k2 < ω 2 k=1
be the subset of x ∈ X such that s(x) ∈ M(ω) ⊆ rint(M). In the following, we show that in the event G(ω) the set θb is non-empty, i.e., θb exists. To see that, let M(Θ) be the set of mean-value parameter vectors induced by Θ, i.e., M(Θ) = {µ0 ∈ rint(M) : there exists θ ∈ Θ such that µ(η(θ)) = µ0 } . If the exponential family is full, then M(Θ) = rint(M), otherwise M(Θ) ⊂ rint(M), because non-full exponential families exclude some natural parameter vectors along with the corresponding mean-value parameter vectors. To show that the set θb is non-empty in the event G(ω), note that, for all x ∈ G(ω) and hence all b ∈ M(Θ) ⊆ s(x) ∈ M(ω) ⊆ rint(M), there exists at least one element µ(η(θ)) rint(M) such that b 2 ≤ ks(x) − µ(η(θ ? ))k2 , ks(x) − µ(η(θ))k because the data-generating mean-value parameter vector µ(η(θ ? )) ∈ M(Θ) ⊆ rint(M) is known to be an element of M(Θ) ⊆ rint(M) and any minimizer of ks(x) − µ(η(θ))k2 cannot be farther from s(x) ∈ M(ω) ⊆ rint(M) than the data-generating mean-value parameter vector µ(η(θ ? )) ∈ M(Θ) ⊆ rint(M). In b ∈ M(Θ) ⊆ rint(M), we have addition, for each element µ(η(θ)) b − µ(η(θ ? ))k2 ≤ ks(x) − µ(η(θ))k b 2 + ks(x) − µ(η(θ ? ))k2 kµ(η(θ)) ≤ 2 ks(x) − µ(η(θ ? ))k2 , b ∈ M(Θ) ⊆ rint(M) is contained in which implies that each element µ(η(θ)) M(2 ω) ⊆ rint(M); note that ω > 0 was chosen small enough so that M(2 ω) ⊆ rint(M). Therefore, for all x ∈ G(ω), the set θb contains at least one element. If there exists a unique element θ ∈ Θ such that µ(η(θ)) = s(x) ∈ M(ω) ⊆
8
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
rint(M), then the set θb contains one and only one element. In particular, in full exponential families with η(θ) = θ, for all x ∈ G(ω), there exists a unique element θ ∈ Θ such that µ(η(θ)) = s(x) ∈ M(ω) ⊆ rint(M) [4, Theorem 3.6, p. 74]. Therefore, in full exponential families θb exists and is unique in the event G(ω), whereas in non-full exponential families θb exists but may not be unique in the event G(ω). Bounding the probability of event θb ∈ B(θ ? , ): full and non-full, curved exponential families. By the identifiability conditions of Theorem 2, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()). In addition, there exists δ() > 0 such that, for all η(θ) ∈ N \ B(η(θ ? ), γ()), K X |Ak | α ? kµ(η(θ)) − µ(η(θ ))k2 ≥ δ() for some 0 ≤ α ≤ 1. 2 k=1
Therefore, kµ(η(θ)) −
µ(η(θ ? ))k2
< δ()
K X |Ak | α k=1
2
implies η(θ) ∈ B(η(θ ? ), γ()), which in turn implies θ ∈ B(θ ? , ). To bound the probability of event θb ∈ B(θ ? , ), we exploit the fact that, for all x ∈ G(ω) and hence all s(x) ∈ M(ω) ⊆ rint(M), the set θb is non-empty and, for b each element of the set θ, b − µ(η(θ ? ))k2 ≤ ks(x) − µ(η(θ))k b 2 + ks(x) − µ(η(θ ? ))k2 kµ(η(θ)) ≤ 2 ks(x) − µ(η(θ ? ))k2 , b ∈ M(2 ω) ⊆ rint(M). Thus, the probability of event which implies that µ(η(θ)) K X |Ak | α ? b kµ(η(θ)) − µ(η(θ ))k2 < δ() ∩ G(ω) 2 k=1
can be bounded from below by bounding the probability of event K X |Ak | α ? 2 ks(x) − µ(η(θ ))k2 < δ() ∩ G(ω), 2 k=1
which, by definition of event G(ω), is equivalent to bounding the probability of event X K δ() |Ak | α ? ks(x) − µ(η(θ ))k2 < min . , ω 2 2 k=1
9
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
These results show that the probability of event θb ∈ B(θ ? , ) can be bounded from below as follows: P θb ∈ B(θ ? , ) ⊆ int(Θ) ≥ P θb ∈ B(θ ? , ) ⊆ int(Θ) ∩ G(ω) ! α K X |A | k ? b − µ(η(θ ))k2 < δ() ≥ P kµ(η(θ)) ∩ G(ω) 2 k=1 X ! K δ() |Ak | α ? ≥ P ks(X) − µ(η(θ ))k2 < min , ω . 2 2 k=1
To bound the probability of the event on the right-hand side of the inequality above, note that the smoothness condition (7) of Theorem 2 ensures that the smoothness condition (2) of Proposition 2 is satisfied. Therefore, Proposition 2 can be invoked to show that there exists C > 0 such that X α ! K |A | δ() k , ω P ks(X) − µ(η(θ ? ))k2 ≥ min 2 2 k=1 ! κ()2 C K ≤ 2 exp − + log d∞ , 4 (2−α) d∞ kAk∞ where κ() = min(δ() / 2, ω) > 0. Collecting results shows that
P θb ∈
B(θ ? , )
κ()2 C K ⊆ int(Θ) ≥ 1 − 2 exp − + log d∞ 4 (2−α) d∞ kAk∞
! .
Proof of Corollary 1. Corollary 1 follows from Theorem 1, because conditions (4) and (5) of Theorem 1 are satisfied. Condition (4) is satisfied with α = 1 by Lemma 3 in Appendix E.1 provided |Ak | ≥ 3 (k = 1, . . . , K). Condition (5) is satisfied, because changing an edge cannot change the number of withinneighborhood edges by more than 1 and the number of within-neighborhood transitive edges by more than 2 (kAk∞ − 2) + 1. Thus, Theorem 1 can be invoked to conclude that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ() > 0, C1 > 0, and C2 > 0 such that ! 2 C K κ() 1 P θb ∈ B(θ ? , ) ⊆ int(Θ) ≥ 1 − 2 exp − + log dim(θ) 4 (2−α) dim(θ) kAk∞ κ()2 C2 K = 1 − 4 exp − , kAk4∞
10
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where we used the fact that dim(θ) = 2 and α = 1. Proof of Corollary 2. Corollary 2 follows from Theorem 2, because all conditions of Theorem 2 are satisfied. First, by Lemma 4 in Appendix E.2, the map η : Θ 7→ int(N) is one-to-one and, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()). Second, condition (6) of Theorem 2 is satisfied with α = 3/4 by Lemma 5 in Appendix E.2 provided |Ak | ≥ 4 (k = 1, . . . , K). Last, but not least, condition (7) of Theorem 2 is satisfied, because changing an edge cannot change the number of within-neighborhood edges by more than 1 and the number of within-neighborhood connected pairs of nodes with i shared partners by more than 2 (kAk∞ − 2) + 1 (i = 1, . . . , |Ak | − 2, k = 1, . . . , K). Thus, by Theorem 2, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ() > 0 and C > 0 such that ! 2CK κ() ? P θb ∈ B(θ , ) ⊆ int(Θ) ≥ 1 − 2 exp − + log d∞ 4 (2−α) d∞ kAk∞ κ()2 C K + log kAk∞ , ≥ 1 − 2 exp − kAk6∞ where we used the fact that d∞ = max1≤k≤K dim(ηk ) = kAk∞ −1 and α = 3/4. APPENDIX C: PROOFS: FINITE-GRAPH CONCENTRATION RESULTS FOR M -ESTIMATORS We prove the main concentration result of Section 5, Theorem 3 along with Corollary 3. Proof of Theorem 3. Theorem 3 can be proved along the same lines as Theorems 1 and 2. By condition [C.1], the expectation E b(X) of b(X) under the data-generating exponential-family distribution exists. Let B be the convex hull of the set {b(x) : x ∈ X} and, given any ω > 0, let ( ) K X |Ak | α 0 0 B(ω) = β ∈ B : kβ − E b(X)k2 < ω 2 k=1
0
be the subset of β ∈ B close to the expectation E b(X) of b(X) under the datagenerating exponential-family distribution in the sense that kβ 0 − E b(X)k2 < P |Ak | α ω K . Choose ω > 0 small enough so that B(2 ω) ⊆ rint(B) and let k=1 2 ( ) K X |Ak | α G(ω) = x ∈ X : kb(x) − E b(X)k2 < ω 2 k=1
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
11
be the subset of x ∈ X such that b(x) ∈ B(ω) ⊆ rint(B). We are interested in estimators of the form b ˙ θ = θ ∈ Θ : kb(x) − β(θ)k2 = inf kb(x) − β(θ)k2 , ˙ θ∈Θ
where Θ is an open subset of Rdim(θ) and the expectation β(θ) = Eθ b(X) exists for all θ ∈ Θ by condition [C.2]. In the following, we do not assume that the family of distributions parameterized by θ ∈ Θ is an exponential family and we do not assume that the set of expectations {Eθ b(X), θ ∈ Θ} covers the whole set rint(B). In other words, the family may not be able to match all possible b(x) ∈ rint(B) in the sense that there exists θ ∈ Θ such that Eθ b(X) = b(x) ∈ rint(B). The critical assumption exploited by the following proof is that the family can match the expectation E b(X) of b(X) under the data-generating exponential-family distribution in the sense that there exists one and only one element θ0 ∈ Θ such that Eθ0 b(X) = E b(X) ∈ rint(B), along with the fact that b(X) falls with high probability into a small ball centered at Eθ0 b(X) = E b(X) ∈ rint(B) under suitable conditions. We first discuss the existence of θb and then bound the probability of event θb ∈ B(θ0 , ). b We show that in the event G(ω) the set θb is non-empty, i.e., θb Existence of θ. exists. To show that the set θb is non-empty in the event G(ω), note that, for all x ∈ G(ω) and hence all b(x) ∈ B(ω) ⊆ rint(B), there exists at least one element θ ∈ Θ such that kb(x) − β(θ)k2 ≤ kb(x) − β(θ0 )k2 = kb(x) − E b(X)k2 , because, by condition [C.2], there exists θ0 ∈ Θ such that β(θ0 ) = E b(X) ∈ B(ω) ⊆ rint(B) and any minimizer of kb(x) − β(θ)k2 cannot be farther from b(x) ∈ B(ω) ⊆ rint(B) than β(θ0 ) = E b(X) ∈ B(ω) ⊆ rint(B). Therefore, the b we set θb is non-empty in the event G(ω). In addition, for each element of the set θ, have b − E b(X)k2 ≤ kb(x) − β(θ)k b 2 + kb(x) − E b(X)k2 kβ(θ) ≤ 2 kb(x) − E b(X)k2 , b ∈ B(2 ω) ⊆ rint(B); note that ω > 0 was chosen small which implies that β(θ) enough so that B(2 ω) ⊆ rint(B). Therefore, for all x ∈ G(ω) and hence all b(x) ∈ B(ω) ⊆ rint(B), the set θb contains at least one element. If the set θb contains b = more than one element, then all elements of the set θb map to expectations β(θ) Eθb b(X) ∈ B(2 ω) ⊆ rint(B) that have the same `2 -distance from b(x) ∈ B(ω) ⊆ rint(B) by construction of estimating function kb(x) − β(θ)k2 .
12
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
Bounding the probability of event θb ∈ B(θ0 , ). To bound the probability of event θb ∈ B(θ0 , ), note that, by condition [C.2], the expectation β(θ) = Eθ b(X) exists for all θ ∈ Θ and, by condition [C.3], for all > 0 small enough so that B(θ0 , ) ⊆ Θ, there exists δ() > 0 such that, for all θ ∈ Θ \ B(θ0 , ), kβ(θ) − β(θ0 )k2 = kβ(θ) − E b(X)k2 ≥ δ()
K X |Ak | α k=1
2
for some 0 ≤ α ≤ 1. Therefore, kβ(θ) − β(θ0 )k2 = kβ(θ) − E b(X)k2 < δ()
K X |Ak | α k=1
2
implies θ ∈ B(θ0 , ). To bound the probability of event θb ∈ B(θ0 , ), we exploit the fact that, for all x ∈ G(ω) and hence all b(x) ∈ B(ω) ⊆ rint(B), the set θb is non-empty and, for b each element of the set θ, b − E b(X)k2 ≤ kb(x) − β(θ)k b 2 + kb(x) − E b(X)k2 kβ(θ) ≤ 2 kb(x) − E b(X)k2 , b ∈ B(2 ω) ⊆ rint(B), as pointed out above. Thus, the which implies that β(θ) probability of event b − E b(X)k2 < δ() kβ(θ)
K X |Ak | α k=1
2
∩ G(ω)
can be bounded from below by bounding the probability of event 2 kb(x) − E b(X)k2 < δ()
K X |Ak | α k=1
2
∩ G(ω),
which, by definition of event G(ω), is equivalent to bounding the probability of event X K |Ak | α δ() ,ω . kb(x) − E b(X)k2 < min 2 2 k=1
These results show that the probability of event θb ∈ B(θ0 , ) can be bounded from
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
13
below as follows: P θb ∈ B(θ0 , ) ⊆ Θ ≥ P θb ∈ B(θ0 , ) ⊆ Θ ∩ G(ω) ! α K X |A | k b − E b(X)k2 < δ() ≥ P kβ(θ) ∩ G(ω) 2 k=1 X ! K δ() |Ak | α ≥ P kb(X) − E b(X)k2 < min ,ω . 2 2 k=1
To bound the probability of the event on the right-hand side of the inequality above, Proposition 3 can be invoked to show that there exists C > 0 such that X ! K δ() |Ak | α P kb(X) − E b(X)k2 ≥ min ,ω 2 2 k=1 ! κ()2 C K ≤ 2 exp − + log dim(b) , 4 (2−α) dim(b) kAk∞ where κ() = min(δ() / 2, ω) > 0. Collecting results shows that
P θb ∈ B(θ0 , ) ⊆ Θ
≥ 1 − 2 exp −
κ()2 C K 4 (2−α)
! + log dim(b) .
dim(b) kAk∞
Proof of Corollary 3. Corollary 3 follows from Theorem 3, because all conditions of Theorem 3 are satisfied. Condition [C.1] is satisfied, because E b(X) ∈ rint(B) follows from θ ? ∈ int(Θ? ) and because changing an edge cannot change the number of within-neighborhood edges by more than 1 and the number of within-neighborhood transitive edges by more than 2 (kAk∞ − 2) + 1. Conditions [C.2] and [C.3] follow from classic exponential-family theory [4], because θ ∈ Θ0 is the natural parameter vector and β(θ) ∈ rint(B) is the mean-value parameter vector of the misspecified exponential family. Condition [C.2] is satisfied, because the parameter space Θ0 = R2 is open, β(θ) = Eθ b(X) exists for all θ ∈ Θ0 = R2 by classic exponential-family theory [4, Theorem 2.2, pp. 34–35], and there exists a unique element θ0 ∈ Θ0 = R2 such that β(θ0 ) = Eθ0 b(X) = E b(X), because the map β : int(Θ0 ) 7→ rint(B) is one-toone by classic exponential-family theory [e.g., Theorem 3.6, 4, p. 74]. Condition [C.3] is satisfied with α = 1 by Lemma 6 in Appendix E.3 provided |Ak | ≥ 3 (k = 1, . . . , K). Since all conditions of Theorem 3 are satisfied, we can invoke Theorem 3 to conclude that, for all > 0 small enough so that B(θ0 , ) ⊆ Θ0 ,
14
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
there exist κ() > 0, C1 > 0, and C2 > 0 such that the M -estimator θb exists, is unique, and is contained in B(θ0 , ) with probability
P θb ∈ B(θ0 , ) ⊆ Θ0
≥ 1 − 2 exp −
κ()2 C1 K 4 (2−α)
! + log dim(b)
dim(b) kAk∞ κ()2 C2 K = 1 − 4 exp − , kAk4∞
where we used the fact that dim(b) = 2 and α = 1. Observe that uniqueness follows from the fact that there is a unique θ ∈ Θ0 = R2 such that Eθ b(X) = b(x) ∈ rint(B) for all possible b(x) ∈ rint(B), because θ ∈ Θ0 = R2 is the natural parameter vector and β(θ) ∈ rint(B) is the mean-value parameter vector of the misspecified exponential family and therefore the map β : int(Θ0 ) 7→ rint(B) is one-to-one [e.g., Theorem 3.6, 4, p. 74]. APPENDIX D: PROOFS: FINITE-GRAPH CONCENTRATION RESULTS FOR SUBGRAPH-TO-GRAPH ESTIMATORS We prove the main concentration result of Section 6, Theorem 4 along with Proposition 4 and Corollary 4. Proof of Proposition 4. By the factorization properties implied by local dependence, we have, for all θ ∈ Θ ⊆ {θ ∈ Rdim(θ) : ψL (η(θ)) < ∞}, all xL ∈ XL , and all K ⊆ L, X pη(θ) (xL ) xL\K ∈XL\K
=
X
exp (hη(θ), s(xL )i − ψL (η(θ))) νL (xL )
xL\K ∈XL\K
(13)
=
X
Y
exp (hη(θ), s(xA )i − ψA (η(θ))) νA (xA )
xL\K ∈XL\K A∈L
=
Y
exp (hη(θ), s(xA )i − ψA (η(θ))) νA (xA )
A∈K
= exp (hη(θ), s(xK )i − ψK (η(θ))) νK (xK ) = pη(θ) (xK ), where we exploited the fact that, by local dependence, νL (xK ) satisfies Y νL (xL ) = νA (xA ), A∈L
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
15
while ψL (η(θ)) satisfies ψL (η(θ)) =
X
ψA (η(θ))
A∈L
and ψA (η(θ)) =
X
exp (hη(θ), s(xA )i) νA (xA ), A ∈ L.
xA ∈XA
We note that the natural parameter vector η(θ) takes the form η(θ) = (η1 (θ), . . . , ηd∞ (θ)), while the sufficient statistic vector takes the form s(xK ) = (s1 (xK ), . . . , sd∞ (xK )), where the sufficient statistics si (xK ) are based on the sufficient statistics sA,i (xA ) of neighborhoods A ∈ K: X si (xK ) = sA,i (xA ), i = 1, . . . , d∞ . A∈K
In addition, note that νK (xK ) =
Y
νA (xA )
A∈K
and that ψK (η(θ)) =
X
ψA (η(θ))
A∈K
P
is finite provided ψL (η(θ)) = A∈L ψA (η(θ)) is finite. To prove that, for all xK ∈ XK and all yK ∈ YK , Pη(θ) (XK = xK , YK = yK , XL\K ∈ XL\K , YL\K ∈ YL\K ) = Pη(θ) (XK = xK , YK = yK ), observe that the independence of within- and between-neighborhood subgraphs implies that (14)
Pη(θ) (XK = xK , YK = yK , XL\K ∈ XL\K , YL\K ∈ YL\K ) = Pη(θ) (XK = xK , XL\K ∈ XL\K ) P(YK = yK , YL\K ∈ YL\K ).
The first term on the right-hand side of (14) can be dealt with by using (13), which implies that Pη(θ) (XK = xK , XL\K ∈ XL\K ) = Pη(θ) (XK = xK ).
16
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
The second term on the right-hand side of (14) can be dealt with by using the assumption that the between-neighborhood subgraphs are independent: P(YK = yK , YL\K ∈ YL\K ) = P(YK = yK ) P(YL\K ∈ YL\K ) = P(YK = yK ). Collecting results gives Pη(θ) (XK = xK , YK = yK , XL\K ∈ XL\K , YL\K ∈ YL\K ) = Pη(θ) (XK = xK ) P(YK = yK ) = Pη(θ) (XK = xK , YK = yK ). Proof of Theorem 4. By Proposition 4, we have, for all θ ∈ Θ ⊆ {θ ∈ : ψL (η(θ)) < ∞}, all K ⊆ L, and all xK ∈ XK and all yK ∈ YK ,
Rdim(θ)
Pη(θ) (XK = xK , YK = yK , XL\K ∈ XL\K , YL\K ∈ YL\K ) = Pη(θ) (XK = xK , YK = yK ) and, for all xL ∈ XL , X
pη(θ) (xL ) = pη(θ) (xK ),
xL\K ∈ XL\K
where pη(θ) (xK ) = exp (hη(θ), s(xK )i − ψK (η(θ))) νK (xK ). Here, η(θ) and s(xK ) have dimension d∞ = maxA∈L dim(ηA ). We note that ψK (η(θ)) satisfies X ψK (η(θ)) = ψA (η(θ)), A∈K
where ψA (η(θ)) =
X
exp (hη(θ), s(xA )i) νA (xA ), A ∈ K.
xA ∈XA
P In addition, note that ψ (η(θ)) = K A∈K ψA (η(θ)) is finite provided ψL (η(θ)) = P A∈L ψA (η(θ)) is finite. In other words, local dependence implies that XK is independent of XL\K and the marginal density of xK ∈ XK induced by K ⊆ L is an exponential-family density with support XK and local dependence, with natural parameter vector η(θ) and sufficient statistic vector s(xK ). As a result, Theorems 1 and 2 can be applied to the exponential family with support XK and local dependence, with natural parameter vector η(θ) and sufficient statistic vector s(xK ). Observe that the conditions of Theorem 4 ensure that all conditions of Theorems 1
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
17
and 2 are satisfied for all K ⊆ L. Therefore, for full exponential families, Theorem 1 shows that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ1 () > 0 and C1 > 0 such that θbK exists, is unique, and is contained in B(θ ? , ) with probability
P θbK ∈ B(θ ? , ) ⊆ int(Θ) ≥ 1 − 2 exp −
κ1 ()2 C1 |K| 4 (2−α)
! + log dim(θ) ,
dim(θ) kKk∞
where kKk∞ = maxA∈K |A|. For non-full, curved exponential families, Theorem 2 shows that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exist κ2 () > 0 and C2 > 0 such that θbK exists and is contained in B(θ ? , ) with probability ! κ2 ()2 C2 |K| ? b P θK ∈ B(θ , ) ⊆ int(Θ) ≥ 1 − 2 exp − + log d∞ . 4 (2−α) d∞ kKk∞ Thus, there exist κ() = min(κ1 (), κ2 ()) > 0 and C = min(C1 , C2 ) > 0 such that ! 2 C |K| κ() + log d∞ P θbK ∈ B(θ ? , ) ⊆ int(Θ) ≥ 1 − 2 exp − 4 (2−α) d∞ kKk∞ ! κ()2 C |K| + log d∞ , ≥ 1 − 2 exp − 4 (2−α) d∞ kLk∞ where we used the fact that kKk∞ = maxA∈K |A| ≤ maxA∈L |A| = kLk∞ and d∞ = maxA∈L dim(ηA ) = dim(θ) in full exponential families with natural parameter vectors of the form η(θ) = θ. Proof of Corollary 4. Corollary 4 follows from Theorem 4, because the conditions of Theorem 4 are satisfied. The proof of Corollary 4 resembles the proof of Corollary 2 and is therefore omitted. APPENDIX E: PROOFS: AUXILIARY LEMMAS We prove auxiliary lemmas that are useful for verifying the conditions of Theorem 1 (Appendix E.1), Theorem 2 (Appendix E.2), and Theorem 3 (Appendix E.3). We start with two lemmas that are useful for bounding expectations of sufficient statistics. Throughout, we write Λ(η) =
min
min
i∈Ak < j∈Ak , k=1,...,K x−i,j ∈ X−i,j
Pη (Xi,j = 1 | X−i,j = x−i,j )
18
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
and Ω(η) =
max
max
i∈Ak < j∈Ak , k=1,...,K x−i,j ∈ X−i,j
Pη (Xi,j = 1 | X−i,j = x−i,j ),
where η is the natural parameter vector of the exponential family under consideration and x−i,j ∈ X−i,j corresponds to x ∈ X excluding xi,j ∈ Xi,j (i ∈ Ak < j ∈ Ak , k = 1, . . . , K). Let f : X 7→ R be a function of the random graph. We denote by EΛ(η) f (X) and EΩ(η) f (X) the expectations of f : X 7→ R under the Bernoulli random graph model with probabilities Λ(θ) and Ω(θ), respectively, provided EΛ(η) f (X) and EΩ(η) f (X) exist. Here, the Bernoulli random graph model refers to the exponential-family random graph model which assumes that the edge variables Xi,j are independent Bernoulli random variables with probabilities Λ(θ) and Ω(θ), respectively. Lemma 1. Consider a full or non-full, canonical or curved exponential family with PK |Ak | support X = {0, 1} k=1 ( 2 ) and local dependence and natural parameter vector η ∈ int(N). Then there exists C(η) ∈ [Λ(η), Ω(η)] such that K K X X X |Ak | , Eη Xi,j = C(η) 2 k=1 i∈Ak < j∈Ak
k=1
where C(η) ∈ [Λ(η), Ω(η)] denotes the probability of an edge under the Bernoulli(C(η)) random graph model. In addition, if f : X 7→ R is a function of the random graph that is non-decreasing in the number of edges, then EΛ(η) f (X) ≤ Eη f (X) ≤ EΩ(η) f (X), where EΛ(η) f (X) and EΩ(η) f (X) are the expectations of f : X 7→ R under the Bernoulli random graph model with probabilities Λ(θ) and Ω(θ), respectively. If f : X 7→ R is a function of the random graph that is non-increasing in the number of edges, then the inequalities are reversed. Proof of Lemma 1. To ease the presentation, we consider K = 1 neighborhood and drop the subscript k from all neighborhood-dependent quantities. The extension to K ≥ 2 neighborhoods is straightforward. Observe that Eη Xi,j = Pη (Xi,j = 1) can be written as X Pη (Xi,j = 1) = Pη (Xi,j = 1 | X−i,j = x−i,j ) Pη (X−i,j = x−i,j ), x−i,j ∈ X−i,j
which implies that Eη Xi,j is bounded below by X Λ(η) Pη (X−i,j = x−i,j ) = Λ(η) x−i,j ∈ X−i,j
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
and bounded above by X
19
Ω(η) Pη (X−i,j = x−i,j ) = Ω(η).
x−i,j ∈ X−i,j
Since the expectation Eη Xi,j is contained in the convex set [Λ(η), Ω(η)], there exists C(η) ∈ [Λ(η), Ω(η)] such that X X |A| Eη Xi,j = Eη Xi,j = C(η) . 2 i∈A < j∈A
i∈A < j∈A
In addition, Butts [5, Corollary 1, p. 306] proved that, if f : X 7→ R is a function of the random graph that is non-decreasing in the number of edges, then EΛ(η) f (X) ≤ Eη f (X) ≤ EΩ(η) f (X), where the inequalities are reversed when f : X 7→ R is a function of the random graph that is non-increasing in the number of edges. The result follows from the recursive factorization of the probability mass function Pη (X = x). Denote the elements of the sequence of within-neighborhood edge variables X by X 1 , . . . , Xm and the corresponding sample spaces by X1 , . . . , Xm , where m = |A| 2 . Then, for all x ∈ X, we have Pη (X = x) =
m Y
Pη (Xi = xi | Xj = xj , j = 1, . . . , i − 1),
i=1
where the conditional probabilities Pη (Xi = 1 | Xj = xj , j = 1, . . . , i − 1) are bounded by Λ(η) =
min
min
1 ≤ i ≤ m x−i ∈ X−i
Pη (Xi = 1 | X−i = x−i )
≤ Pη (Xi = 1 | Xj = xj , j = 1, . . . , i − 1) ≤
max
max Pη (Xi = 1 | X−i = x−i )
1 ≤ i ≤ m x−i ∈ X−i
= Ω(η),
where x−i ∈ X−i corresponds to x ∈ X excluding xi ∈ Xi (i = 1, . . . , m). Therefore, Pη (X = x) can be bounded below and above by the probability mass function of the Bernoulli random graph model with probabilities Λ(η) and Ω(η), respectively. As a result, the cumulative distribution function and expectation of any function f : X 7→ R of the random graph that is non-decreasing in the number of edges can be bounded by the corresponding cumulative distribution functions and expectations of the Bernoulli random graph model with probabilities Λ(η) and Ω(η) [5, Corollary 1, p. 306]: EΛ(η) f (X) ≤ Eη f (X) ≤ EΩ(η) f (X),
20
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where the inequalities are reversed when f : X 7→ R is a function of the random graph that is non-increasing in the number of edges. Lemma 2. Consider a full or non-full, canonical or curved exponential family with PK |Ak | support X = {0, 1} k=1 ( 2 ) and local dependence and natural parameter vector η ∈ int(N). Let f : X 7→ R be the number of within-neighborhood transitive edges defined by f (x) =
K X
X
xi,j
k=1 i∈Ak < j∈Ak
max
h∈Ak , h 6= i,j
xi,h xj,h , x ∈ X,
which may or may not be a sufficient statistic of the exponential family under consideration. If |Ak | ≥ 3 (k = 1, . . . , K), then there exists C(η) > 0 such that K X |Ak | , Eη f (X) = C(η) 2 k=1
where C(η) satisfies, for some λ ∈ (0, 1), 0 < λ Λ(η)3 + (1 − λ) Ω(η)3 ≤ C(η) ≤ λ Λ(η) + (1 − λ) Ω(η) < 1. Proof of Lemma 2. To ease the presentation, we consider K = 1 neighborhood and drop the subscript k from all neighborhood-dependent quantities. The extension to K ≥ 2 neighborhoods is straightforward. Since the number of transitive edges f : X 7→ R is a function of the random graph that is non-decreasing in the number of edges, Lemma 1 implies that the expectation Eη f (X) can be bounded by the expectations EΛ(η) f (X) and EΩ(η) f (X) under the Bernoulli random graph model with edge probabilities Λ(η) and Ω(η), respectively: EΛ(η) f (X) ≤ Eη f (X) ≤ EΩ(η) f (X), where Λ(η) and Ω(η) are defined in the introduction of Appendix E and all expectations exist, because the sample space X is finite. The expectation EΛ(η) f (X) can be written as X EΛ(η) f (X) = EΛ(η) Xi,j max Xi,h Xj,h h∈A, h 6= i,j
i∈A < j∈A
= EΛ(η)
X
Xi,j
1
=
i∈A < j∈A
EΛ(η) Xi,j
Xi,h Xj,h ≥ 1
h∈A, h 6= i,j
i∈A < j∈A
X
X
1
X
h∈A, h 6= i,j
Xi,h Xj,h ≥ 1 ,
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
21
P where 1( h∈A, h 6= i,j xi,h xj,h ≥ 1) is an indicator function, which is 1 if nodes i and j have one or more shared partners in neighborhood A and is 0 otherwise. Here, we exploited the fact that the number of transitive edges in neighborhood A is equal to the number of pairs of nodes with one P or more edgewise shared partners. To evaluate the expectation EΛ(η) [Xi,j 1( h∈A, h 6= i,j Xi,h Xj,h ≥ 1)], we take advantage of the independence of edge variables under the Bernoulli(Λ(η)) random graph model, which implies that X EΛ(η) Xi,j 1 Xi,h Xj,h ≥ 1 h∈A, h 6= i,j
= EΛ(η) (Xi,j ) EΛ(η) 1
X
Xi,h Xj,h ≥ 1
h∈A, h 6= i,j
= Λ(η) PΛ(η)
X
Xi,h Xj,h ≥ 1 .
h∈A, h 6= i,j
In addition, the independence of edge variables under the Bernoulli(Λ(η)) random graph model P implies that, for all i ∈ A and all j ∈ A such that i 6= j, the distribution of h∈A, h 6= i,j Xi,h Xj,h is Binomial(|A| − 2, Λ(η)2 ). Therefore, X PΛ(η) Xi,h Xj,h ≥ 1 = 1 − (1 − Λ(η)2 )|A|−2 , h∈A, h 6= i,j
which implies that EΛ(η) f (X) = Λ(η) [1 − (1 −
Λ(η)2 )|A|−2 ]
|A| . 2
Along the same lines, it can be shown that the expectation EΩ(η) f (X) is given by |A| EΩ(η) f (X) = Ω(η) [1 − (1 − Ω(η)2 )|A|−2 ] . 2 Let C1 (η) = Λ(η) [1−(1−Λ(η)2 )|A|−2 ] and C2 (η) = Ω(η) [1−(1−Ω(η)2 )|A|−2 ]. Observe that Λ(η) ∈ (0, 1) implies that C1 (η) is bounded below by Λ(η)3 and bounded above by Λ(η) for all |A| ≥ 3. Likewise, C2 (η) is bounded below by Ω(η)3 and bounded above by Ω(η) for all |A| ≥ 3. Since the expectation Eη f (X) |A| is contained in the convex set [C1 (η) |A| 2 , C2 (η) 2 ], there exists λ ∈ (0, 1) such that |A| |A| |A| Eη f (X) = λ C1 (η) + (1 − λ) C2 (η) = C(η) , 2 2 2
22
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where C(η) satisfies 0 < λ Λ(η)3 + (1 − λ) Ω(η)3 ≤ C(η) ≤ λ Λ(η) + (1 − λ) Ω(η) < 1. E.1. Auxiliary lemmas: Theorem 1. We prove auxiliary lemmas that are useful for verifying the conditions of Theorem 1. Lemma 3. Consider an exponential-family random graph of fixed and finite size with within-neighborhood edge and transitive edge terms as defined in Section 4.1. Let Θ = R2 and assume that θ ? ∈ int(Θ). Then, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists δ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), kµ(η(θ ? ))
− µ(η(θ))k2
K X |Ak | ≥ δ() , 2 k=1
provided |Ak | ≥ 3 (k = 1, . . . , K). Therefore, identifiability condition (4) of Theorem 1 is satisfied with α = 1 provided |Ak | ≥ 3 (k = 1, . . . , K). Proof of Lemma 3. To ease the presentation, we consider K = 1 neighborhood and drop the subscript k from all neighborhood-dependent quantities. The extension to K ≥ 2 neighborhoods is straightforward. In the following, we write µ(η(θ)) = µ(θ) = Eθ s(X) and ψ(η(θ)) = ψ(θ), because the natural parameter vector of the exponential family is given by η(θ) = θ ∈ Θ = R2 . To verify identifiability condition (4) of Theorem 1, we need to verify that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists δ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), α |A| ? kµ(θ ) − µ(θ)k2 ≥ δ() for some 0 ≤ α ≤ 1. 2 To do so, observe that kµ(θ ? ) − µ(θ)k2 =
p (µ1 (θ ? ) − µ1 (θ))2 + (µ2 (θ ? ) − µ2 (θ))2 .
The deviation |µ1 (θ ? ) − µ1 (θ)| can be dealt with by using Lemma 1, which shows that there exists C1 (θ) ∈ [Λ(θ), Ω(θ)] such that X |A| µ1 (θ) = Eθ Xi,j = C1 (θ) , 2 i∈A < j∈A
where Λ(θ) and Ω(θ) are defined in the introduction of Appendix E. Here, Λ(θ) is given by • Case θ ∈ R × R− : Λ(θ) = [1 + exp(−θ1 − θ2 (2 |A| − 3))]−1 > 0;
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
23
• Case θ ∈ R × R+ ∪ {0}: Λ(θ) = [1 + exp(−θ1 )]−1 > 0; while Ω(θ) satisfies 0 < Λ(θ) ≤ Ω(θ) < 1 for all θ ∈ R2 ; note that |A| is fixed and finite since the size of the random graph is fixed and finite. Therefore, the deviation |µ1 (θ ? ) − µ1 (θ)| satisfies |A| ? ? |µ1 (θ ) − µ1 (θ)| = |C1 (θ ) − C1 (θ)| . 2 To deal with the deviation |µ2 (θ ? )−µ2 (θ)|, observe that, by Lemma 2, there exists C2 (θ) > 0 such that X |A| µ2 (θ) = Eθ Xi,j max Xi,h Xj,h = C2 (θ) , h∈A, h6=i,j 2 i∈A < j∈A
where C2 (θ) satisfies λ Λ(θ)3 + (1 − λ) Ω(θ)3 ≤ C2 (θ) ≤ λ Λ(θ) + (1 − λ) Ω(θ), 0 < λ < 1. As a result, |µ2
(θ ? )
− µ2 (θ)| = |C2
(θ ? )
|A| − C2 (θ)| . 2
We can hence conclude that, for all θ ∈ Θ \ B(θ ? , ), there exists a function C(θ, θ ? ) > 0 such that |A| ? ? kµ(θ ) − µ(θ)k2 = C(θ , θ) > 0, 2 where strict positivity follows from the fact that the map µ : Θ 7→ rint(M) is oneto-one by classic exponential-family theory [4, Theorem 3.6, p. 74], which implies that kµ(θ ? ) − µ(θ)k2 > 0. Last, but not least, we show that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists δ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have kµ(θ ? ) − µ(θ)k2 ≥ δ() |A| > 0. To do so, note that η(θ) = θ is the natural parameter 2 vector of the exponential family. Thus, by classic exponential-family theory, ψ(θ) is a strictly convex function of the natural parameter vector θ on the interior of the natural parameter space {θ ∈ Rdim(θ) : ψ(θ) < ∞} = R2 [4, Theorem 1.13, p. 19]. Therefore, the gradient ∇θ ψ(θ) = µ(θ) is a strictly increasing function of θ and the smallest value of kµ(θ ? ) − µ(θ)k2 , considered as a function of θ for fixed θ ? ∈ int(Θ), is attained on the boundary of ball B(θ ? , ) ⊆ int(Θ).
24
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
As a result, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists δ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), |A| ? kµ(θ ) − µ(θ)k2 ≥ δ() , 2 provided |A| ≥ 3. Therefore, identifiability condition (4) of Theorem 1 is satisfied with α = 1 provided |A| ≥ 3. E.2. Auxiliary lemmas: Theorem 2. We prove auxiliary lemmas that are useful for verifying the conditions of Theorem 2. Lemma 4. Consider a curved exponential-family random graph of fixed and finite size with within-neighborhood edge and geometrically weighted edgewise shared partner terms as defined in Section 4.2. Let Θ = R×(0, 1). Then the map η : Θ 7→ int(N) is one-to-one and, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()) provided |Ak | ≥ 4 (k = 1, . . . , K). Proof of Lemma 4. It is straightforward to show that the map η : Θ 7→ int(N) is one-to-one and that, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), v u d∞ uX ? kη(θ) − η(θ )k2 = t (ηi (θ) − ηi (θ ? ))2 ≥ γ(). i=1
To see that, note that d∞ = kAk∞ − 1 ≥ 3 since |Ak | ≥ 4 (k = 1, . . . , K). Therefore, v v u d∞ u 3 uX uX ? 2 kη(θ) − η(θ ? )k2 = t (ηi (θ) − ηi (θ )) ≥ t (ηi (θ) − ηi (θ ? ))2 , i=1
i=1
where η1 (θ) = θ1 η2 (θ) = 1 η3 (θ) = 2 − θ2 . Thus, for all θ ?
∈ int(Θ) = R×(0, 1) and all δ ∈ R2 such that θ ? +δ ∈ int(Θ) =
R × (0, 1), (η1 (θ ? + δ) − η1 (θ ? ))2 = (θ1? + δ1 − θ1? )2 = δ12 (η2 (θ ? + δ) − η2 (θ ? ))2 = 0 (η3 (θ ? + δ) − η3 (θ ? ))2 = (2 − θ2? − δ2 − 2 + θ2? )2 = δ22 ,
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
25
which implies that v u 3 uX p ? ? kη(θ + δ) − η(θ )k2 ≥ t (ηi (θ ? + δ) − ηi (θ ? ))2 = δ12 + δ22 = kδk2 . i=1
Therefore, the `2 -distance of η(θ ? + δ) from η(θ ? ) ∈ int(N) in the natural parameter space N is a strictly increasing function of the `2 -distance = kδk2 of θ ? + δ from θ ? ∈ int(Θ) in the parameter space Θ. As a result, for all > 0, there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), v u d∞ uX ? kη(θ ) − η(θ)k2 = t (ηi (θ ? ) − ηi (θ))2 ≥ γ(). i=1
Lemma 5. Consider a curved exponential-family random graph of fixed and finite size with within-neighborhood edge and geometrically weighted edgewise shared partner terms as defined in Section 4.2. Let Θ = R × (0, 1) and assume that θ ? ∈ int(Θ). Then, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()). In addition, there exists δ() > 0 such that, for all η(θ) ∈ N \ B(η(θ ? ), γ()), kµ(η(θ ? ))
− µ(η(θ))k2 ≥ δ()
K X |Ak | 3/4 k=1
2
,
provided |Ak | ≥ 4 (k = 1, . . . , K). Therefore, identifiability condition (6) of Theorem 2 is satisfied with α = 3/4 provided |Ak | ≥ 4 (k = 1, . . . , K). Proof of Lemma 5. To ease the presentation, we consider K = 1 neighborhood and drop the subscript k from all neighborhood-dependent quantities. The extension to K ≥ 2 neighborhoods is straightforward. By Lemma 4, for all > 0 small enough so that B(θ ? , ) ⊆ int(Θ), there exists γ() > 0 such that, for all θ ∈ Θ \ B(θ ? , ), we have η(θ) ∈ N \ B(η(θ ? ), γ()). Therefore, it is enough to show that there exists δ() > 0 such that, for all η(θ) ∈ N \ B(η(θ ? ), γ()), α |A| for some 0 ≤ α ≤ 1. kµ(η(θ ? )) − µ(η(θ))k2 ≥ δ() 2 To do so, observe that kµ(η(θ ? )) − µ(η(θ))k2 v u |A|−1 u X = t(µ1 (η(θ ? )) − µ1 (η(θ)))2 + (µi (η(θ ? )) − µi (η(θ)))2 > 0, i=2
26
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
where strict positivity follows from the fact that the maps η : Θ 7→ int(N) and µ : int(N) 7→ rint(M) are one-to-one and hence θ 6= θ ? implies kµ(η(θ ? )) − µ(η(θ))k2 > 0; note that the map η : Θ 7→ int(N) is one-to-one by Lemma 4 while the map µ : int(N) 7→ rint(M) is one-to-one by classic exponential-family theory [4, Theorem 3.6, p. 74]. We distinguish two cases, the case µ1 (η(θ ? )) 6= µ1 (η(θ)) and the case µ1 (η(θ ? )) = µ1 (η(θ)). We note that θ 6= θ ? and µ1 (η(θ ? )) = µ1 (η(θ)) imP|A|−1 ply i=2 (µi (η(θ ? )) − µi (η(θ)))2 > 0, because θ 6= θ ? implies kµ(η(θ ? )) − µ(η(θ))k2 > 0. Case µ1 (η(θ ? )) 6= µ1 (η(θ)). We have kµ(η(θ ? )) − µ(η(θ))k2 ≥ |µ1 (η(θ ? )) − µ1 (η(θ))| > 0. Observe that µ1 (η(θ)) = Eη(θ) s1 (X) is the expected number of edges in neighborhood A. Therefore, Lemma 1 can be invoked to show that there exist C1 (η(θ ? )) > 0 and C1 (η(θ)) > 0 such that |A| ? ? |µ1 (η(θ )) − µ1 (η(θ))| = |C1 (η(θ )) − C1 (η(θ))| , 2 where Λ(η(θ ? )) ≤ C1 (η(θ ? )) ≤ Ω(η(θ ? )) and Λ(η(θ)) ≤ C1 (η(θ)) ≤ Ω(η(θ)) for all θ ∈ int(Θ) = R × (0, 1). We note that Λ(η(θ)) and Ω(η(θ)) are defined in the introduction of Appendix E and that Λ(η(θ)) ≥ [1 + exp(−θ1 )]−1 > 0 while Ω(η(θ)) satisfies 0 < Λ(η(θ)) ≤ Ω(η(θ)) < 1 for all θ ∈ int(Θ) = R × (0, 1). Case µ1 (η(θ ? )) = µ1 (η(θ)). As pointed out above, θ 6= θ ? and µ1 (η(θ ? )) = P|A|−1 ? 2 µ1 (η(θ)) imply > 0, where i=2 (µi (η(θ )) − µi (η(θ))) µi (η(θ)) = Eη(θ) si (X) is the expected number of pairs of nodes with i − 1 edgewise shared partners in neighborhood A (i = 2, . . . , |A| − 1). Bounding the P|A|−1 term i=2 (µi (η(θ ? )) − µi (η(θ)))2 is more challenging than bounding the term (µ1 (η(θ ? )) − µ1 (η(θ)))2 , because the numbers of pairs of nodes with i − 1 edgewise shared partners are neither non-decreasing nor non-increasing functions of the number of edges (i = 2, . . . , |A| − 1). Therefore, Lemma 1 cannot be applied to the expectations µi (η(θ)) = Eη(θ) si (X) (i = 2, . . . , |A| − 1). But it turns out P|A|−1 to be possible to bound i=2 (µi (η(θ ? )) − µi (η(θ)))2 from below in terms of absolute deviations of the expected numbers of transitive edges under η(θ ? ) and η(θ). The advantage of doing so is that the number of transitive edges is a nondecreasing function of the number of edges and hence Lemma 1 can be applied via Lemma 2. To see that the expected numbers of pairs of nodes with one or more edgewise shared partners is related to the expected number of transitive edges, note
27
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
that the expected number of transitive edges in neighborhood A can be written as X Eη(θ) Xa,b max Xa,c Xb,c c∈A, c 6= a,b
a∈A < b∈A
|A|−1
= Eη(θ)
X
X
i=2
a∈A < b∈A
Xa,b
=
i=2
Xa,c Xb,c = i − 1
c∈A, c 6= a,b
|A|−1
X
1
X
X
Eη(θ)
Xa,b
a∈A < b∈A
1
X
Xa,c Xb,c = i − 1
c∈A, c 6= a,b
|A|−1
=
X
µi (η(θ)),
i=2
P|A|−1 which shows that i=2 µi (η(θ)) is equal to the expected number of transiP|A|−1 tive edges under η(θ). To bound i=2 (µi (η(θ ? )) − µi (η(θ)))2 from below in terms of absolute deviations of the expected numbers of transitive edges under η(θ ? ) and η(θ), it is convenient to operate in the natural parameter space N = R|A|−1 of the full exponential family with natural parameter vector η and sufficient statistic vector s(x). Write η ≡ η(θ) and η ? ≡ η(θ ? ). Then, to bound P|A|−1 the term i=2 (µi (η ? ) − µi (η))2 from below in terms of absolute deviations of the expected numbers of transitive edges under η ? and η while avoiding the trivial lower bound of 0, we note that it is possible to choose η˙ ∈ int(N) such that ˙ 2 kµ(η ? ) − µ(η)k2 ≥ kµ(η ? ) − µ(η)k and such that the expected number of transitive edges under η˙ is not identical to the expected number of transitive edges under η ? . To see that, note that |A|−1
kµ(η)k1 =
X i=1
|A|−1
|µi (η)| = µ1 (η) +
X
µi (η),
i=2
P|A|−1 where µ1 (η) is the expected number of edges while i=2 µi (η) is the expected number of transitive edges under η. Therefore, if µ1 (η ? ) = µ1 (η) and kµ(η)k1 = kµ(η ? )k1 , then the expected numbers of transitive edges under η and η ? are identical. Let T = {µ0 ∈ rint(M) : µ01 = µ1 (η ? ), kµ0 k1 = kµ(η ? )k1 } be the set of mean-value parameter vectors for which the expected numbers of edges and the expected numbers of transitive edges are identical. To see that it
28
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
˙ = µ1 (η ? ) and kµ(η)k ˙ 1 6= is possible to choose η˙ ∈ int(N) such that µ1 (η) ? kµ(η )k1 , note that the set T is a subset of the boundary of the `1 -ball with center 0 ∈ R|A|−1 and radius kµ(η ? )k1 . Suppose that the `2 -distance of µ(η) from µ(η ? ) is equal to ρ1 > 0; note that the `2 -distance is strictly positive because η 6= η ? implies µ(η) 6= µ(η ? ). Observe that the `2 -ball B(µ(η ? ), ρ1 ) with center µ(η ? ) ∈ rint(M) and radius ρ1 > 0 need not be contained in rint(M), but, owing to the fact that the set M is convex by construction, it is possible to construct a smaller `2 -ball B(µ(η ? ), ρ2 ) with the same center µ(η ? ) ∈ rint(M) but smaller radius 0 < ρ2 < ρ1 such that the resulting `2 -ball B(µ(η ? ), ρ2 ) is contained in rint(M). The `1 -ball with center 0 ∈ R|A|−1 and radius kµ(η ? )k1 and the `2 -ball B(µ(η ? ), ρ2 ) ⊆ rint(M) with center µ(η ? ) ∈ rint(M) and radius ρ2 intersect, but it is not too hard to see that it is possible to choose µ˙ ∈ rint(M) ˙ 1 6= inside the `2 -ball B(µ(η ? ), ρ2 ) ⊆ rint(M) such that µ˙ 1 = µ1 (η ? ) and kµk ? kµ(η )k1 . In addition, due to the fact that the map µ : int(N) 7→ rint(M) is a homeomorphism [4, Theorem 3.6, p. 74], there exists η˙ ∈ int(N) such that ˙ = µ˙ ∈ B(µ(η ? ), ρ2 ) ⊆ rint(M), which implies that µ1 (η) ˙ = µ1 (η ? ) µ(η) ? ˙ 1 6= kµ(η )k1 . In conclusion, there exists η˙ ∈ int(N) such that and kµ(η)k ˙ = µ1 (η ? ) and kµ(η)k ˙ 1 6= kµ(η ? )k1 and µ1 (η) |A|−1
kµ(η ? )
− µ(η)k2 ≥
kµ(η ? )
˙ 2 = − µ(η)k
X
˙ 2 > 0. (µi (η ? ) − µi (η))
i=2
kµ(η ? )
˙ 2 from below in terms of absoAs a consequence, we can bound − µ(η)k lute deviations of the expected numbers of transitive edges under η ? and η˙ while avoiding the trivial lower bound of 0. To do so, we first use the Cauchy-Schwarz inequality to obtain v u |A|−1 u X t ? ? 2 ˙ ˙ ˙ 2 kµ(η ) − µ(η)k2 = (µ1 (η ) − µ1 (η)) + (µi (η ? ) − µi (η)) i=2
v u|A|−1 |A|−1 uX X 1 ? 2 t ˙ ˙ |µi (η ? ) − µi (η))| ≥ (µi (η ) − µi (η)) ≥ p |A| − 2 i=2 i=2 and then use the triangle inequality to obtain |A|−1 |A|−1 X X ? ? ˙ ˙ |µi (η ) − µi (η)| ≥ (µi (η ) − µi (η)) i=2 i=2 |A|−1 |A|−1 X X ? ˙ . µi (η ) − µi (η) = i=2 i=2
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
29
P|A|−1 P|A|−1 ˙ are equal to the expected number of The sums i=2 µi (η ? ) and i=2 µi (η) ˙ respectively, as shown above. Lemma 2 can hence transitive edges under η ? and η, be applied to show that, for all η ∈ int(N), there exists C2 (η) > 0 such that |A|−1
X
µi (η) = Eη
i=2
X a∈A < b∈A
Xa,b
max
c∈A, c 6= a,b
Xa,c Xb,c
|A| = C2 (η) , 2
where C2 (η) satisfies λ Λ(η)3 + (1 − λ) Ω(η)3 ≤ C2 (θ) ≤ λ Λ(η) + (1 − λ) Ω(η), provided |A| ≥ 3. Collecting results shows that v u |A|−1 u X t ? ? 2 ˙ ˙ 2 ˙ 2 = (µ1 (η ) − µ1 (η)) + (µi (η ? ) − µi (η)) kµ(η ) − µ(η)k i=2
≥
= ≥ =
|A|−1 |A|−1 X X 1 ? ˙ p µi (η(θ )) − µi (η(θ)) |A| − 2 i=2 i=2 |A| 1 p ˙ |C2 (η ? ) − C2 (η)| 2 |A| − 2 1 |A| ? ˙ |C2 (η ) − C2 (η)| 1/4 1/4 2 |A| (|A| − 1) 3/4 1 |A| ˙ |C2 (η ? ) − C2 (η)| , 1/4 2 2
˙ > 0 by the choice of η, ˙ as explained above. where |C2 (η ? ) − C2 (η)| We can therefore conclude that, for all θ ∈ Θ \ B(θ ? , ), there exists a function C(θ ? , θ) > 0 such that kµ(η(θ ? ))
− µ(η(θ))k2 ≥
C(θ ? , θ)
3/4 |A| . 2
Last, but not least, we show that there exists δ() > 0 such that, for all η(θ) ∈ 3/4 N \ B(η(θ ? ), γ()), we have kµ(η(θ ? )) − µ(η(θ))k2 ≥ δ() |A| > 0. To do 2 so, it is convenient to operate in the natural parameter space N = R|A|−1 of the full exponential family with natural parameter vector η and sufficient statistic vector s(x). Write η ≡ η(θ) and η ? ≡ η(θ ? ). By classic exponential-family theory, ψ(η) is a strictly convex function of the natural parameter vector η on the interior int(N) of the natural parameter space N, which is a convex set [4, Theorem 1.13,
30
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
p. 19]. Thus, the gradient ∇η ψ(η) = µ(η) is a strictly increasing function of the natural parameter vector η and kµ(η ? ) − µ(η)k2 , considered as a function of η for fixed η ? ∈ int(N), attains its minimum on the boundary of the ball B(η ? , γ()). Therefore, kµ(η ? ) − µ(η)k2 is bounded away from 0 by a function of > 0 for all η ∈ N \ B(η ? , γ()). Taken together, these results show that there exists δ() > 0 such that, for all η(θ) ∈ N \ B(η(θ ? ), γ()), 3/4 |A| ? kµ(η(θ )) − µ(η(θ))k2 ≥ δ() , 2 provided |A| ≥ 4. Therefore, identifiability condition (6) of Theorem 2 is satisfied with α = 3/4 provided |A| ≥ 4. E.3. Auxiliary lemmas: Theorem 3. We prove an auxiliary lemma that is useful for verifying the conditions of Theorem 3. Lemma 6. Consider an exponential-family random graph of fixed and finite size with within-neighborhood edge, transitive edge, and same-attribute edge terms as defined in Section 5.2. Then identifiability condition [C.3] of Theorem 3 is satisfied with α = 1 provided |Ak | ≥ 3 (k = 1, . . . , K). Proof of Lemma 6. To verify identifiability condition [C.3] of Theorem 3, we need to verify that, for all > 0 small enough so that B(θ0 , ) ⊆ Θ0 , there exist δ() > 0 such that, for all θ ∈ Θ0 \ B(θ0 , ), kEθ? b(X) − Eθ b(X)k2 ≥ δ()
K X |Ak | α k=1
2
for some 0 ≤ α ≤ 1,
where Eθ? b(X) is the expectation of b(X) under the data-generating exponentialfamily distribution with natural parameter vector θ ? ∈ Θ? = R3 and Eθ b(X) is the expectation of b(X) under the misspecified exponential-family distribution with natural parameter vector θ ∈ Θ0 = R2 . Since the expectations Eθ? b(X) and Eθ b(X) are expectations of the number of within-neighborhood edges and transitive edges, an argument along the lines of Lemma 3 shows that, for all > 0 small enough so that B(θ0 , ) ⊆ Θ0 , there exist δ() > 0 such that, for all θ ∈ Θ0 \ B(θ ? , ), kEθ? b(X) − Eθ b(X)k2 ≥ δ()
K X |Ak | k=1
2
,
provided |Ak | ≥ 3 (k = 1, . . . , K). Therefore, identifiability condition [C.3] of Theorem 3 is satisfied with α = 1 provided |Ak | ≥ 3 (k = 1, . . . , K).
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
31
REFERENCES [1] Bhamidi, S., Bresler, G., and Sly, A. (2011), “Mixing time of exponential random graphs,” The Annals of Applied Probability, 21, 2146–2170. [2] Bhattacharya, B. B., and Mukherjee, S. (2018), “Inference in Ising models,” Bernoulli, 24, 493–525. [3] Boucheron, S., Lugosi, G., and Massart, P. (2013), Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford: Oxford University Press. [4] Brown, L. (1986), Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory, Hayworth, CA, USA: Institute of Mathematical Statistics. [5] Butts, C. T. (2011), “Bernoulli graph bounds for general random graph models,” Sociological Methodology, 41, 299–345. [6] Butts, C. T., and Almquist, Z. W. (2015), “A Flexible Parameterization for Baseline Mean Degree in Multiple-Network ERGMs,” Journal of Mathematical Sociology, 39, 163–167. [7] Chatterjee, S. (2007), “Estimation in spin glasses: A first step,” The Annals of Statistics, 35, 1931–1946. [8] Chatterjee, S., and Diaconis, P. (2013), “Estimating and understanding exponential random graph models,” The Annals of Statistics, 41, 2428–2461. [9] Chung, F., and Lu, L. (2006), Complex Graphs and Networks, Providence: American Mathematical Society. [10] Crane, H., and Dempsey, W. (2015), “A framework for statistical network modeling,” Available at https://arxiv.org/abs/1509.08185.v4. [11] Diaconis, P., Chatterjee, S., and Sly, A. (2011), “Random graphs with a given degree sequence,” The Annals of Applied Probability, 21, 1400–1435. [12] Efron, B. (1978), “The geometry of exponential families,” The Annals of Statistics, 6, 362–376. [13] Frank, O., and Strauss, D. (1986), “Markov graphs,” Journal of the American Statistical Association, 81, 832–842. [14] Handcock, M. S. (2003), “Statistical Models for Social Networks: Inference and Degeneracy,” in Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers, eds. Breiger, R., Carley, K., and Pattison, P., Washington, D.C.: National Academies Press, pp. 1–12. [15] Handcock, M. S., and Gile, K. (2010), “Modeling Social Networks from Sampled Data,” The Annals of Applied Statistics, 4, 5–25. [16] Holland, P. W., and Leinhardt, S. (1976), “Local Structure in Social Networks,” Sociological Methodology, 1–45. [17] Hollway, J., and Koskinen, J. (2016), “Multilevel embeddedness: The case of the global fisheries governance complex,” Social Networks, 44, 281–294. [18] Hollway, J., Lomi, A., Pallotti, F., and Stadtfeld, C. (2017), “Multilevel social spaces: The network dynamics of organizational fields,” Network Science, 5, 187–212. [19] Hunter, D. R., and Handcock, M. S. (2006), “Inference in curved exponential family models for networks,” Journal of Computational and Graphical Statistics, 15, 565–583. [20] Hunter, D. R., Krivitsky, P. N., and Schweinberger, M. (2012), “Computational statistical methods for social network models,” Journal of Computational and Graphical Statistics, 21, 856– 882. [21] Janson, S., and Rucinski, A. (2002), “The infamous upper tail,” Random Structures and Algorithms, 20, 317–342. [22] Jonasson, J. (1999), “The random triangle model,” Journal of Applied Probability, 36, 852–876. [23] Kass, R., and Vos, P. (1997), Geometrical foundations of asymptotic inference, New York: Wiley. [24] Kim, J. H., and Vu, V. H. (2004), “Divide and conquer martingales and the number of triangles
32
MICHAEL SCHWEINBERGER AND JONATHAN STEWART
in a random graph,” Random Structures & Algorithms, 24, 166–174. [25] Kolaczyk, E. D. (2009), Statistical Analysis of Network Data: Methods and Models, New York: Springer-Verlag. [26] Kontorovich, L., and Ramanan, K. (2008), “Concentration inequalities for dependent random variables via the martingale method,” The Annals of Probability, 36, 2126–2158. [27] Krivitsky, P. N. (2012), “Exponential-family models for valued networks,” Electronic Journal of Statistics, 6, 1100–1128. [28] Krivitsky, P. N., Handcock, M. S., and Morris, M. (2011), “Adjusting for network size and composition effects in exponential-family random graph models,” Statistical Methodology, 8, 319–339. [29] Krivitsky, P. N., and Kolaczyk, E. D. (2015), “On the question of effective sample size in network modeling: An asymptotic inquiry,” Statistical Science, 30, 184–198. [30] Lauritzen, S., Rinaldo, A., and Sadeghi, K. (2017), “Random networks, graphical models, and exchangeability,” Available at https://arxiv:1701.08420v1. [31] Lazega, E., and Snijders, T. A. B. (eds.) (2016), Multilevel Network Analysis for the Social Sciences, Switzerland: Springer-Verlag. [32] Lomi, A., Robins, G., and Tranmer, M. (2016), “Introduction to multilevel social networks,” Social Networks, 266–268. [33] Lusher, D., Koskinen, J., and Robins, G. (2013), Exponential Random Graph Models for Social Networks, Cambridge, UK: Cambridge University Press. [34] Mukherjee, S. (2013), “Consistent estimation in the two star exponential random graph model,” Tech. rep., Department of Statistics, Columbia University, arXiv:1310.4526v1. [35] Nowicki, K., and Snijders, T. A. B. (2001), “Estimation and prediction for stochastic blockstructures,” Journal of the American Statistical Association, 96, 1077–1087. [36] Ravikumar, P., Wainwright, M. J., and Lafferty, J. (2010), “High-dimensional Ising model selection using `1 -regularized logistic regression,” The Annals of Statistics, 38, 1287–1319. [37] Rinaldo, A., Petrovic, S., and Fienberg, S. E. (2013), “Maximum likelihood estimation in network models,” The Annals of Statistics, 41, 1085–1110. [38] Rubin, D. B. (1976), “Inference and missing data,” Biometrika, 63, 581–592. [39] Schweinberger, M. (2011), “Instability, sensitivity, and degeneracy of discrete exponential families,” Journal of the American Statistical Association, 106, 1361–1370. [40] Schweinberger, M., and Handcock, M. S. (2015), “Local dependence in random graph models: characterization, properties and statistical inference,” Journal of the Royal Statistical Society, Series B, 77, 647–676. [41] Schweinberger, M., and Luna, P. (2017), “HERGM: Hierarchical exponential-family random graph models,” Journal of Statistical Software, accepted. [42] Schweinberger, M., and Stewart, J. (2017), “Supplement: Finite-graph concentration and consistency results for random graphs with complex topological structures,” Tech. rep., Department of Statistics, Rice University. [43] Shalizi, C. R., and Rinaldo, A. (2013), “Consistency under sampling of exponential random graph models,” The Annals of Statistics, 41, 508–535. [44] Slaughter, A. J., and Koehly, L. M. (2016), “Multilevel models for social networks: hierarchical Bayesian approaches to exponential random graph modeling,” Social Networks, 44, 334–345. [45] Snijders, T. A. B., Pattison, P. E., Robins, G. L., and Handcock, M. S. (2006), “New specifications for exponential random graph models,” Sociological Methodology, 36, 99–153. [46] Strauss, D. (1986), “On a general class of models for interaction,” SIAM Review, 28, 513–527. [47] Wang, P., Robins, G., Pattison, P., and Lazega, E. (2013), “Exponential random graph models for multilevel networks,” Social Networks, 35, 96–115. [48] Wasserman, S., and Pattison, P. (1996), “Logit models and logistic regression for social networks: I. An introduction to Markov graphs and p∗ ,” Psychometrika, 61, 401–425.
FINITE-GRAPH CONCENTRATION AND CONSISTENCY RESULTS
33
[49] Xiang, R., and Neville, J. (2011), “Relational learning with one network: an asymptotic analysis,” in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1–10. [50] Yan, T., Leng, C., and Zhu, J. (2016), “Asymptotics in directed exponential random graph models with an increasing bi-degree sequence,” The Annals of Statistics, 44, 31–57. [51] Yan, T., Wang, H., and Qin, H. (2016), “Asymptotics in undirected random graph models parameterized by the strengths of vertices,” Statistica Sinica, 26, 273–293. [52] Zappa, P., and Lomi, A. (2015), “The analysis of multilevel networks in organizations: models and empirical tests,” Organizational Research Methods, 18, 542–569. M ICHAEL S CHWEINBERGER , J ONATHAN S TEWART D EPARTMENT OF S TATISTICS R ICE U NIVERSITY 6100 M AIN S T, MS-138 H OUSTON , TX 77005-1827 E- MAIL : M . S @ RICE . EDU JONATHAN . STEWART @ RICE . EDU