Abstractâ The problem of graphical model selection is to correctly estimate the graph structure of a Markov random field given samples from the underlying ...
1
Information-theoretic limits of graphical model selection in high dimensions ∗ EECS
Narayana Santhanam∗ Martin J. Wainwright∗† and Departments, UC Berkeley, Berkeley, CA 94720 † Statistics
Abstract— The problem of graphical model selection is to correctly estimate the graph structure of a Markov random field given samples from the underlying distribution. We analyze the information-theoretic limitations of this problem under high-dimensional scaling, in which the graph size p and the number of edges k (or the maximum degree d) are allowed to increase to infinity as a function of the sample size n. For pairwise binary Markov random fields, we derive both necessary and sufficient conditions on the scaling of the triplet (n, p, k) (or the triplet (n, p, d)) for asympotically reliable reocovery of the graph structure.
I. I NTRODUCTION The problem of graphical model selection is to correctly estimate the graph structure of a Markov random field given samples from the underlying distribution. This problem is central to statistics and machine learning, with consequences for a variety of application domains, e.g., image analysis [3], social network analysis [15], and computational biology. If the underlying model is known to be tree-structured, then the problem can be reformulated as a maximum-weight spanning tree problem [5], and solved in polynomial time [5]. On the other hand, for fully general graphs with cycles, the problem is known to be computationally hard. Nonetheless, a variety of practical methods have been proposed, including constraint-based approaches [12], heuristic search, and ℓ1 -based relaxations [10], [14]. In particular, it can be proven that some of the ℓ1 -based approaches are consistent for model selection under particular scalings of the graph size, degrees, and number of samples [10], [14]; for the case where each degree is bounded by d, a similar consistency analysis has been performed for a heuristic method in the paper [4]. Other researchers [8], [6] have analyzed model selection criteria based on pseudolikelihood. Of complementary interest—and the focus of the paper—are the information-theoretic limitations of the graphical model selection problem. More concretely, consider an undirected graph G = (V, E) with p = |V | vertices. In one of our model classes, we allow number of edges k ∈ {1, . . . , p2 } to be a free parameter; in the other, we allow the node degree d to be a free parameter. Suppose that we have a set of N distributions, each a Markov random field with respect to a graph chosen from
our family. In this paper, we ask the following question: given n i.i.d. samples from a fixed but unknown model from this family, for what scalings of the triple (n, p, k) (or the triple (n, p, d)) does there exist a method that recovers the correct model with probability converging to one. Conversely, for what scalings does any method fail to recover? Note that from this perspective, the graphical model selection problem is a channel coding problem, in which the messages are graph structures, and each use of the channel corresponds to taking an i.i.d. sample. Although the analysis described here is more generally applicable, so as to bring sharp focus to the issue, we limit ourselves to the case of a pairwise binary Markov random field, also known as the Ising model [2]. The first main contribution of this paper (Theorem 1) is to derive a set of sufficient conditions on (n, p, k) that ensure asymptotically reliable selection is possible. Our proof is constructive in nature, based on a direct analysis of a particular model selection procedure. The second main contribution (Theorem 2) addresses the converse, and provides necessary conditions, i.e., a lower bound on the sample size n as a function of p and k (or p and d) that any selection procedure must satisfy for asymptotically reliable recovery. For the special case of bounded degree models, these conditions sharpen results obtained in concurrent work [4]. II. P ROBLEM
STATEMENT AND MAIN RESULTS
We begin by formulating the problem, and stating our main results precisely. A. Problem formulation Given an undirected graph G = (V, E), let us associate with each vertex i ∈ V a binary random variable Xi ∈ {−1, +1}. We then consider a distribution over the random vector X = (X1 , . . . , Xp ) with probability mass function X 1 λst xs xt , (1) exp Pθ (x) = Z(θ) (s,t)∈E
where Z(θ) is the normalization constant. The Ising model (1) has its origins in statistical physics [2], where
2
it used to model physical phenomena such as crystal structure and magnetism; it is also used as a simple model in image processing [3], [7]. In this paper, we consider two different classes of Ising models (1), depending on the condition that we impose on the edge set E. Specifically, we define the model classes. (a) the collection Gk,p of graphs G with |E| = k edges for some k ≥ 1, and (b) the collection Gd,p of graphs such that each vertex has degree d. For a fixed edge set E, we say that the model is homogeneous if λst = λ 6= 0 for all (s, t) ∈ E, and λst = 0 otherwise. In this conference paper, we focus on homogeneous models for the case λ > 0, corresponding to attractive interactions. However, the results and their proofs can be adapted to more general model classes (see the full-length version [11] for details). We describe the model selection problem in the context of the class Gk,p ; of course, a similar description p applies to the class Gd,p . Note that there are N = (k2) possible graphs in the class Gk,p . Given some graph from the class Gk,p , our goal is to use data to choose the correct model correctly with high probability. Each model i is indexed by a parameter vector θi ∈ Rk , and our model class Gk,p is indexed by the collection of parameters {θ1 , θ2 , . . . , θN }. Suppose that we are given a set of n i.i.d. observation vectors (X (1) , . . . , X (n) ), where each vector X (i) ∈ {−1, 1}p is drawn according to the distribution Pθi for some unknown i ∈ {1,. . . , N }. We now have three problem parameters (n, p2 , k), and we are interested in understanding the set of triples for which it is possible/impossible to recover (with high probability) the correct model. We begin by making some remarks about the model selection problem. It should be noted that correct model selection is not equivalent to simply estimating pairs of vertices (s, t) for which cov{Xs , Xt } 6= 0; indeed, given the distribution (1), it can be seen that Xs and Xt are often correlated even when there is no direct edge connecting them. The ferromagnetic assumption (λ > 0 for all s, t ∈ V ) ensures that any correlation, whether or not it corresponds to an edge between s and t, is always positive. (This fact follows by applying the FKG inequality [1] to the pair of increasing functions f (X) = Xs and g(X) = Xt .) Moreover, from the outset, it is clear that the edge strengths λst must be bounded. the edge parameters are unbounded, it is impossible to distinguish certain pairs of models, even given infinite amounts of data. Example 1. As a simple example, consider the case k = 2 and p = 3, so that Gk,p corresponds to graphs on 3 vertices with exactly 2 edges; note that there are a total of three such graphs. However, if we set
λ = ∞ so that the Ising model (1) enforces a “hardcore” constraint, all these models become the guilt by association namely distribution, the two configurations 1 1 1 and −1 −1 −1 have probability 12 each. If λ is finite but very large, the models are not identical, but nonetheless extremely hard to distinguish. 2 On the other hand, if λ is non-zero but finite, given infinite amounts of data, two distinct models θi and θj can always be told apart by looking at the vector p µ ∈ R(2) of mean parameters (i.e., µst = E[Xs Xt ] for all pairs (s, t)). This fact is a special case of a more general theorem about exponential families, stating that the mapping between exponential parameters and mean parameters is one-to-one [13]. In our particular setting, an elementary proof follows along the lines of the proof of Lemma 4. Of course, given finite amounts of data, the means will be known only approximately, so that the issue is to determine the degree of separation between the mean parameters. B. Useful definitions In order to quantify the distinguishability of different models, we begin by defining some useful “distance” measures. Given two parameters θ and θ′ , let D(θ||θ′ ) denote the Kullback-Leibler divergence between the associated Letting (θi + θj )/2 denote the model obtained by averaging the parameters of θi and θj , one useful measure of distance between models θi and θj is θi + θj θi + θj def ||θi + D ||θj , (2) D(θi , θj ) = D 2 2 This distance measure bounds the Chernoff exponent in the hypothesis test between models θi and θj . In Lemma 4, we bound D(i, j) for any two models and tighten the bound in Lemma 5 for the case where each model belongs to Gd,p . Another useful measure of distance we consider is the symmetrized KL distance between models θi and θj , given by def
J (θi , θj ) = D(θi ||θj ) + D(θj ||θi ).
(3)
Clearly D(θi , θj ) ≥ 0 and J (θi , θj ) ≥ 0. For the above defined distances between models θi and θj , we write D(i, j) and J (i, j) where there is no ambiguity. As indicated above, it is also necessary to measure the edge strengths λst , which we do with parameter X def λst , (4) B = max s∈V
t∈N (s)
where N (s) denotes the subset of vertices connected to vertex s by an edge. For the d-regular class Gd,p and homogeneous models with λ > 0, we have B = dλ. Our analysis shows that the inherent difficulty of model selection depends critically on the scaling of B.
3
C. Statement of main results
A. Preliminaries
Formally, a model selection procedure is a mapping φ from samples X n = {X (1) , . . . , X (n) } to model indices in Gk,p (or Gd,p ). The main results of this paper are to provide necessary and sufficient conditions on the number of samples n required to achieve asymptotically reliable model selection using optimal selection procedures, or decoders. We refer to this scaling of n as the sample complexity of the graphical model selection problem. Our first main result provides sufficient conditions on the sample size for reliable model selection:
We begin by analyzing the effects of large values of the parameter B, in particular concluding that sample complexity should grow exponentially grow exponentially in B. Let Km be a fully connected graph on m+1 vertices. Let θ′ be a model with its graph being Kr with the edge (s, t) removed, and every edge parameter being λ. We compute the (indirect) correlation on the missing edge. In the ferromagnetic case, the FKG inequality implies that the resulting correlation is the maximum possible indirect correlation that can be obtained from any graph over these m + 1 nodes, and edge weights λ.
Theorem 1. (a) Sufficiency for Gk,p : For any given δ ∈ (0, 1), if the sample size n satisfies 3e2B + 1 2 (k + 3) log k + log n> δ sinh2 λ4 2 4 2 log p + log (5) + δ tanh2 (λ)
Lemma 1. If B ≥ 1, for the model θ′ defined above
then ∃ decoder φ : X n → Gk,p such that P[φ(X n ) 6= θi |θi ] < δ, for all θi , 1 ≤ i ≤ N . (b) Sufficiency for Gd,p : For any given δ ∈ (0, 1), if the sample size n satisfies 3e2B + 1 1 n> (4d + 1) 3 log p + log(2d) + log δ sinh2 λ2
then there exists a decoder φ : X n → Gd,p such that P[φ(X n ) 6= θi |θi ] < δ for all θi , 1 ≤ i ≤ N . The following theorem provides necessary conditions, applicable to any model selection procedure, for the edge set to be recovered exactly: Theorem 2. (a) Necessity for Gk,p : Given any decoder φ : X n → Gk,p , if B ≥ 1 and the sample size satisfies eB/2 log k log p 1 + n≤ , 2 sinh(λ) 4Be5λ/2 sinh(λ) then there exists a P[φ(X n ) 6= θi |θi ] ≥ 12 .
model
θi
such
that
(b) Necessity for Gd,p : Given any decoder φ : X n → Gk,p , if B ≥ 1 and the sample size satisfies 1 eB/4 dλ(log pd − 1) 1 p n≤ + d log . 5λ/2 λ −λ 2 32e 4 d (e − e ) there there exists a P[φ(X n ) 6= θi |θi ] ≥ 12 . III. P ROOF
model
θi
such
that
SKETCHES
In this section, we provide the basic steps involved in the proofs of our two theorems.
Eθst Xs Xt ≥ 1 −
2(m + 1)e3λ/2 + (m + 1)e3λ/2
eB/2
We now construct pairs of models separated by distances decreasing exponentially in B. For the class Gd,p , consider the following two models: (i) θ1 with its graph being Kd with an edge, say (s, t), removed and (ii) θ2 with its graph being Kd with a different edge, say (u, v), removed. (Recall that all weights for edges included in the graph are set to λ.) From the FKG inequality and noting that model θ2 contains the edge (s, t), we have P1 [Xs Xt = 1] 2λ P2 [Xs Xt = 1] ≤ e . P2 [Xs Xt = −1] P1 [Xs Xt = −1] Moreover, it can be shown [11] that J (1, 2) ≤
2(e2λ − 1) P1 (Xs Xt =1) P1 (Xs Xt =−1)
.
(6)
Hence if B > 1, it follows that 2Be5λ/2 sinh(λ) . eB/2 Similarly considering K√2k and constructing two models from it as above, we obtain two models in Gk,p separated by a distance that is exponentially small in B. It follows that if B ≥ 1 and any model is to be inferred with high probability, the sample complexity should grow exponentially in the parameter B. J (1, 2) ≤
B. Proof of Theorem 1 The graph selection problem can be viewed as a channel coding problem, in which the goal is to select from one of N competing models based on the samples X n . Two factors contribute to the error probability of an optimal decoder: the pairwise KL divergence between models (the two-way hypothesis testing step), and the number of competing models. The first factor determines how many samples we need if we are to distinguish between two models, and the second determines how to combine the results of the various two way tests.
4
Lemma 2 considers the optimal hypothesis test between two models, which are used in conjunction with Lemmas 4 and 5 to obtain the achievable region in Theorem 1. 1) Specification of optimal decoder: Given n i.i.d. samples, consider the sub-problem of testing model 1 versus model j. From classical theory on hypothesis testing [9], it is well-known that the optimal test is based on thresholding the likelihood ratio. In particular, to test model j versus model 1, we compute
Lemma 4. For all distinct θi and θj λ d(i, j) ≥ 2B . sinh 3e + 1 4 1
In many cases, we can exploit independence properties of the Markov random field to obtain sharper conditions. The following result provides a refined statement for the d-regular class Gd,p : Lemma 5. Given distinct models θi , θj ∈ Gd,p , define their mismatch
n
Lj (X) : =
1X p(X (i) ; θj ) , log n i=1 p(X (i) ; θ1 )
P(fail | θ1 true) ≤
P[Lj (X) > 0].
(9)
Then
(8)
i=2
Theorem 1 is based on a stronger bound than the naive union bound described above, nevertheless, estimating the probability P[Lj (X) > 0] will be useful. The following result shows that P[Lj (X) > 0] decreases exponentially in the distance d(1, j): Lemma 2. Given n i.i.d. samples from Pθ1 , we have log P[Lj (X) > 0] ≤ −n D(1, j), where the distance was defined earlier (2). In order to make effective use of Lemma 2, we require a bound on the distance d(i, j) between models θi and θj . The following technical lemma is an intermediate step in this direction: Lemma 3. Consider distinct models θi and θj , and let X X def Xs′ Xt′ . Xs′ Xt′ −λ E(X) = λ (s′ ,t′ )∈Ei −Ej
def
m(i, j) = |(Ei − Ej ) ∪ (Ej − Ei )|.
(7)
where we have used the fact that the samples are i.i.d. Note that each Lj (X) is a random variable; an easy computation yields that E1 [Lj (X)] = −D(θ1 k θj ) where E1 denotes expectation under Pθ1 and D(θ1 k θj ) denotes the KL divergence between Pθ1 and Pθj . Moreover, by the strong law of large numbers, we have a.s. Lj (X) → E[Lj (X)]. 2) Achievable region: Now suppose that model θ1 is the true model. The optimal decoder fails if and only if Lj (X) > 0 for some j ∈ {2, . . . , N }. Hence, by union bound we have N X
2
(s′ ,t′ )∈Ej −Ei
Let edge (s, t) ∈ Ei − Ej . At least one of the following events changes E(X) by atleast λ: (i) flipping s (ii) flipping t or (iii) flipping both s and t. We now have the ingredients to bound the distance d(i, j):
D(i, j) ≥
m(i, j) 1 sinh2 4d + 1 3e2B + 1
λ . 4
We now have the necessary ingredients to prove Theorem 1. Considering first the class Gk,p , fix some model θi ∈ Gk,p . Consider the following decoder: using the n i.i.d. samples, first remove all vertices s such that for all t, n
1 1 X i i 1 eλ − e−λ X X < = tanh(λ). n i=1 s t 2 eλ + e−λ 2
Let ΘVi be the set of all models θj such that Vj = Vi . Note that for all θj ∈ / ΘVi , for every edge (s, t) such that {s, t} ∩ (Vj − Vi ) 6= ∅, Eθi [Xs Xt ] = 0, while Eθj [Xs Xt ] ≥ tanh(λ); and for every edge (s, t) such that {s, t}∩(Vi − Vj ) 6= ∅, Eθi [Xs Xt ] ≥ tanh(λ), while Eθj [Xs Xt ] = 0. Hence, setting ǫ =
1 2
tanh(λ),
P( mistake θi for some θj ∈ / ΘVi | θi true) = P(fail to identify Vi | θi true) p ≤ exp(−nǫ2 /4) 2 (2k) k 2 Note that there are at most m other models which m have m edges different, but share the same set of vertices Vi . We use the optimum maximum likelihood ratio test from Section III-B.2 to distinguish from these models. Hence P(fail | θi true) p ≤ exp(−nǫ2 /4) 2 ! 2k λ/2 −λ/2 2 e − e k ∗ 2m∗ k 2 . exp −m n +k·k 4(3e2B + m∗ ) m∗ m∗ for m∗ corresponding to max term in the summation. Instead of estimating m∗ , we will just ensure that no
5
matter where 1 ≤ m∗ ≤ k is, the bound < δ. The result for the class Gk,p follows. For the class Gd,p , we use the optimal LLR test for p 2) the decoder. Fix θi ∈ Gd,p . There are at most (m models in Gd,p with mismatch m from θi . The result for Gd,p follows using the union bound in conjunction with Lemma 2 and Lemma 5; see the full-length version [11] for details. 2 C. Proof of Theorem 2 We conclude by sketching the proof of the necessary conditions in Theorem 2. Lemma 6. For any set of w models θ1 through θw and all decoders q : X n → {θ1 , . . . ,θw }, if w 2 log w 4 the sample size satsifies n ≤ 2 P J (i,j) , then max1≤i≤w P(q(X n ) 6= θi |θi ) ≥ 12 .
1≤i