Combinatorial Inference for Graphical Models - arXiv

arXiv:1608.03045v2 [math.ST] 18 Aug 2016

Combinatorial Inference for Graphical Models Matey Neykov∗†

Junwei Lu∗†

Han Liu∗‡

Abstract We propose a new family of combinatorial inference problems for graphical models. Unlike classical statistical inference where the main interest is point estimation or parameter testing, combinatorial inference aims at testing the global structure of the underlying graph. Examples include testing the graph connectivity, the presence of a cycle of certain size, or the maximum degree of the graph. To begin with, we develop a unified theory for the fundamental limits of a large family of combinatorial inference problems. We propose new concepts including structural packing and buffer entropies to characterize how the complexity of combinatorial graph structures impacts the corresponding minimax lower bounds. On the other hand, we propose a family of novel and practical structural testing algorithms to match the lower bounds. We provide thorough numerical results on both synthetic graphical models and brain networks to illustrate the usefulness of these proposed methods.

Keywords: Graph structural inference; minimax testing; uncertainty assessment; multiple hypothesis testing; post-regularization inference.

1

Introduction

Graphical model is a powerful tool for modeling complex relationships among many random variables. A central theme of graphical model research is to infer the structure of the underlying graph based on observational data. Though significant progress has been made, ∗

Department of Operations Research and Financial Engineering, Princeton University, Princeton NJ 08544, USA; e-mail: {mneykov,junweil,hanliu}@princeton.edu † These authors contributed equally to this work. ‡ Research supported by NSF DMS1454377-CAREER; NSF IIS 1546482-BIGDATA; NIH R01MH102339; NSF IIS1408910; NIH R01GM083084.

1

existing works mainly focus on estimating the graph (Meinshausen and B¨ uhlmann, 2006; Liu et al., 2009; Ravikumar et al., 2011; Cai et al., 2011) or testing the existence of a single edge (Jankova and van de Geer, 2015; Ren et al., 2015; Neykov et al., 2015; Gu et al., 2015). In this paper we consider a new inferential problem: testing the combinatorial structure of the underlying graph. Examples include testing the graph connectivity, cycle presence, or assessing the maximum degree of the graph. Unlike classical inference which aims at testing a set of Euclidean parameters, combinatorial inference aims to test some global structural properties and requires the development of new methodology. As for methodological development, this paper mainly considers the Gaussian graphical model (though our method is applicable to the more general semiparametric exponential family graphical models and ellip tical copula graphical models): Let X = (X1 , . . . , Xd )T ∼ Nd 0, (Θ∗ )−1 be a d-dimensional Gaussian random vector with precision matrix Θ∗ = (Θ∗jk ). Let G∗ = G(Θ∗ ) := (V , E ∗ ) be an undirected graph, where V = {1, . . . , d} and an edge (j, k) ∈ E ∗ if and only if Θ∗jk 6= 0. It is well known that G∗ has the pairwise Markov property, i.e., (j, k) 6∈ E ∗ if and only if Xj and Xk are conditionally independent given the remaining variables. In a combinatorial inference problem, our goal is to test whether G∗ has certain global structural properties based on n random samples X1 , . . . , Xn . Specifically, let G be the set of all graphs over the vertex set V and (G0 , G1 ) be a pair of non-overlapping subsets of G. We assume all the graphs in G1 have a property (e.g., connectivity) while the graphs in G0 do not have this property. Such a pair (G0 , G1 ) is called a sub-decomposition of G. Our goal is to test the hypothesis H0 : G∗ ∈ G0 versus H1 : G∗ ∈ G1 . We provide several concrete examples below. Connectivity. A graph is connected if and only if there exists a path connecting each pair of its vertices. To test connectivity, we set G0 = {G ∈ G : G is disconnected} and G1 = {G ∈ G : G is connected}. Under the Gaussian graphical model, this is equivalent to testing whether the variables can be partitioned into at least two independent sets. Cycle presence. Sometimes it is of interest to test whether the underlying graph is acyclic. In this example we let G0 = {G ∈ G : G is cycle-free} and G1 = {G ∈ G : G contains a cycle}. If a graph is acyclic, it can be easily visualized on a two dimensional plane. Maximum degree. Another relevant question is to test whether the maximum degree of the graph is less than or equal to some integer s0 ∈ N versus the alternative that the maximum degree is at least s1 ∈ N, where s0 < s1 . Define the sub-decomposition G0 = {G : dmax (G) ≤ s0 } and G1 = {G : dmax (G) ≥ s1 } respectively. While our ultimate goal is to test whether G∗ ∈ G0 versus G∗ ∈ G1 , our access to G∗ is only through the random samples {Xi }ni=1 . Under Gaussian models, we can translate the original problem of testing graphs to testing the precision matrix: H0 : Θ∗ ∈ S0 vs H1 : Θ∗ ∈ S1 .

2

(1.1)

In (1.1), S0 , S1 ⊂ M(s) are two sets of precision matrices such that for all Θ ∈ S0 , S1 we have G(Θ) ∈ G0 , G1 respectively, and M(s) is defined as M(s) = {Θ ∈ Rd×d : Θ = ΘT , C −1 ≤ Θ ≤ C, kΘk1 ≤ L, max kΘ∗j k0 ≤ s}

(1.2)

j∈[d]

for some constants 1 ≤ C ≤ L. The set M(s) restricts our attention to well conditioned symmetric matrices Θ, whose induced graphs G(Θ) have maximum degree of at most s. Given this setup, we aim to characterize necessary conditions on the pair S0 , S1 under which the combinatorial inference problem in (1.1) is testable. Specifically, recall that a test is any measurable function ψ : {Xi }ni=1 7→ {0, 1}. Define the risk of testing S0 against S1 as: h i γ(S0 , S1 ) = inf max PΘ (ψ = 1) + max PΘ (ψ = 0) . (1.3) ψ

Θ∈S0

Θ∈S1

If liminf n→∞ γ(S0 , S1 ) = 1, we say that the problem (1.1) is untestable since any test fails to distinguish between S0 and S1 in the asymptotic minimax sense. Due to the close relationship between the sets of precision matrices (S0 , S1 ) and the sub-decomposition (G0 , G1 ) (recall that for all Θ ∈ S0 , S1 we have G(Θ) ∈ G0 , G1 resp.), we anticipate that the sub-decomposition can capture the intrinsic challenge of the test in (1.1). Indeed, in Sections 2 and 4 we develop a new theory to characterize the impact of the combinatorial structures of G0 and G1 to the lower bound of the risk γ(S0 , S1 ). Such lower bounds provide necessary conditions for any valid test. We then develop practical procedures that match the obtained lower bounds. To understand how the sub-decomposition (G0 , G1 ) affects the intrinsic difficulty of the problem in (1.1), we consider the three examples given before. Our theory on lower bounds distinguishes between two types of sub-decompositions — in the first type one can find graphs belonging to G0 and G1 differing in only one single edge, while in the second type all graphs belonging to G0 must differ on multiple edge sets from the graphs belonging to G1 . One can check that in the first two examples (connectivity and cycle presence testing) there exist graphs belonging to G0 and G1 differing in only one single edge. For instance, when testing connectivity, consider a tree with a single edge removed (thus it becomes a forest) versus a connected tree. Extending this intuition, for a fixed graph G0 = (V , E0 ) ∈ G0 , we call the edge set C = {e1 , . . . , em } a single-edge null-alternative separator (NAS) if for all edges e ∈ C the graphs (V , E0 ∪ {e}) ∈ G1 . Intuitively the bigger the cardinality of a NAS set is, the harder it is to tell the null from the alternatives. In Section 2 we detail that γ(S0 , S1 ) is asymptotically 1, when the signal strength of separation between S0 and S1 is low (see (2.2) and (2.3) for a formal definition) and there exists a NAS set with sufficiently large packing number. The packing number, formally defined in Definition 2.3, represents the cardinality of a subset of edges in C which are “far” apart, where the proximity measure of two edges is a predistance1 based on the graph G0 . Recall that for a graph G and two 1

Compared to distance, a predistance does not have to satisfy the triangle inequality.

3

vertices u and v, a graph geodesic distance is defined by: dG (u, v) := length of the shortest path between u and v within G. Using the notion of geodesic distance, one can define a predistance between two edges, by taking the minimum over the geodesic distances of their corresponding nodes. If the difference between null and alternative is more than one edge, as in the maximum degree testing G0 = {G : dmax (G) ≤ s0 } vs G1 = {G : dmax (G) ≥ s1 } for example, the packing number does not always capture the lower bound of the tests. In Section 4, we develop a novel mechanism to handle this more sophisticated case. We introduce a new concept called “buffer entropy” which can produce sharper lower bounds to overcome the disadvantages of the packing number. On the other hand, to match the lower bound, we propose the null-alternative witness technique (NAWT) as a unified approach for combinatorial tests. The NAWT identifies a critical structure and proceeds to test whether this structure indeed belongs to the true graph. We prove the NAWT tests can control both the type I and type II errors asymptotically.

1.1

Contributions

There are three major contributions of this paper. Our first contribution is to develop a novel strategy for obtaining minimax lower bounds on the signal strength required to distinguish combinatorial graph structures which are separable via a single-edge NAS. In particular, our theory bridges the information-theoretic lower bounds to the packing number of the NAS set, which is an intuitive combinatorial quantity. To obtain this result, we relate the chi-square divergence between two probability measures to the number of “closed walks” on their corresponding Markov graphs. Our analysis hinges on several technical tools including Le Cam’s Lemma, matrix perturbation inequalities and spectral graph theory. The usefulness of the approach is demonstrated by obtaining generic and interpretable lower bounds in numerous examples such as testing connectivity, connected components, self-avoiding paths, and cycles. Our second contribution is to provide a device for proving lower bounds under the settings where the null and alternative graphs differ in multiple edges. Under such case, the packing number does not always provide tight lower bounds. Our technical device is based on a new graph combinatorial quantity called buffer entropy. The buffer entropy is a complexity measure of the structural tests and characterizes their lower bounds. We apply this device to derive lower bounds for testing the maximum degree and detecting a sparse clique. Our third contribution is to propose a null-alternative witness technique (NAWT) for testing (1.1), which matches the lower bounds on the signal strength. The NAWT is a two step procedure — in the first step it identifies a minimal structure witnessing the alternative hypothesis, and in the second step it attempts to certify the presence of this structure in the 4

graph. NAWT utilizes recent advances in high-dimensional inference and provides honest tests for combinatorial inference problems. It has two advantages compared to the support recovery procedures in Meinshausen and B¨ uhlmann (2006); Ravikumar et al. (2011); Cai et al. (2011): First, it allows us to control the type-I error at any given level; Second, it does not require us to perfectly recover the underlying graph to conduct valid inference.

1.2

Related Work

Graphical model inference is relatively straightforward when d < n, but becomes notoriously challenging when d n. In high-dimensions, estimation procedures were studied by Yuan and Lin (2006); Friedman et al. (2008); Lam and Fan (2009); Cai et al. (2011) among others, while for variable selection procedures see Meinshausen and B¨ uhlmann (2006); Raskutti et al. (2008); Liu et al. (2009); Ravikumar et al. (2011); Cai et al. (2011) and references therein. Recently, motivated by Zhang and Zhang (2014), various inferential methods for high-dimensional graphical models were suggested (Liu, 2013; Jankova and van de Geer, 2015; Chen et al., 2015; Ren et al., 2015; Neykov et al., 2015; Gu et al., 2015, e.g.), most of which focus on testing the presence of a single edge (except Liu (2013) who took the FDR approach (Benjamini and Hochberg, 1995) to conduct multiple tests and Gu et al. (2015) who developed new procedures of edge testing in Gaussian copula models). None of the aforementioned works address the problem of combinatorial structure testing. In addition to estimation and model selection procedures, efforts have been made to understand the fundamental limits of these problems. Lower bounds on estimation were obtained by Ren et al. (2015), where the authors show that the parametric estimation rate √ n−1/2 is unattainable unless s log d/ n = o(1). Lower bounds on the minimal sample size required for model selection in Ising models were established by Santhanam and Wainwright (2012), where it is shown that support recovery is unattainable when n s2 log d. In a follow up work, Wang et al. (2010) studied model selection limits on the sample size in Gaussian graphical models. The latter two works are remotely related to ours, in that both works exploit graph properties to obtain information-theoretic lower bounds. However, our problem differs significantly from theirs since we focus on developing lower bounds for testing graph structure, which is a fundamentally different problem from estimating the whole graph. Our problem is most closely related to those in Addario-Berry et al. (2010); Arias-Castro et al. (2012, 2015a,b), which are inspired by the large body of research on minimax hypothesis testing (Ingster, 1982; Ingster et al., 2010; Arias-Castro et al., 2011a,b, e.g.) among many others. Addario-Berry et al. (2010) quantify the signal strength as the mean parameter of a standard Gaussian distribution, while Arias-Castro et al. (2012, 2015a) impose models on the covariance matrix of a multivariate Gaussian distribution. In our setup the parameter spaces of interest are designed to reflect the graphical model structure, and hence the signal strength is naturally imposed on the precision matrix. Arias-Castro et al. (2015b) address 5

testing on a lattice based Gaussian Markov random field. For specific problems they establish lower bounds on the signal strength required to test the empty graph versus an alternative hypothesis. This is different from the setting of our problems, where the null hypothesis is usually not the empty graph.

1.3

Notation

The following notations are used throughout the paper. For a vector v = (v1 , . . . , vd )T ∈ Rd , P let kvkq = ( di=1 viq )1/q , 1 ≤ q < ∞, kvk0 = |supp(v)|, where supp(v) = {j : vj 6= 0}, and |A| denotes the cardinality of a set A. Furthermore, let kvk∞ = maxi |vi | and v⊗2 = vvT . For a matrix A, we denote A∗j and Aj∗ to be the j th column and row of A respectively. For any n ∈ N we use the shorthand notation [n] = {1, . . . , n}. For two integer sets S1 , S2 ⊆ [d], we denote AS1 S2 to be the sub-matrix of A with elements {Ajk }j∈S1 ,k∈S2 . Moreover, we denote kAkmax = maxjk |Ajk |, kAkp = maxkvkp =1 kAvkp for p ≥ 1. For a graph G we use V (G), E(G) to refer to the vertex set, edge set of G respectively. We also denote V (E) as the vertex set of the edge set E. We reserve special notations for the complete vertex set V := [d], the complete edge set E := {e ∈ 2[d] | |e| = 2} and the complete graph G := (V , E). For two integers j, k ∈ [d] we use unordered pairs (j, k) = (k, j) to denote undirected edges between vertex k and vertex j. Any symmetric matrix A ∈ Rd×d naturally induces an undirected graph G(A), with vertices in the set V and edge set E(G(A)) = {(j, k) : Ajk 6= 0, j 6= k}. Additionally if E is an arbitrary edge set (i.e., E ⊆ E) for e := (j, k) ∈ E we use the notations Ae = Ajk = Akj interchangeably to denote the element e of the matrix A. Given two sequences {an }, {bn } we write an = O(bn ) if there exists a constant C < ∞ such that an ≤ Cbn ; an = o(bn ) if an /bn → 0, and an bn if there exists positive constants c and C such that c < an /bn < C. Finally we use the shorthand notations ∧ and ∨ for min and max of two numbers respectively.

1.4

Organization of the Paper

The paper is structured as follows. Our main result along with applications to several examples is presented in Section 2. In Section 3 we outline the NAWT method, and illustrate how to apply our procedure to the examples considered in Section 2. In Section 4 we generalize the lower bounds strategies from the single edge NAS setting to multiple edge NAS stetting. Thorough numerical studies and a real data analysis are presented in Section 5. A brief discussion is provided in Section 6. All proofs are deferred to the Appendix.

6

2

Single-Edge Null-Alternative Separators

In this section we derive a novel and generic lower bound strategy, applicable to null and alternative hypotheses which differ in one single edge: i.e., under the Gaussian model, there exist two matrices Θ0 ∈ S0 and Θ1 ∈ S1 whose induced graphs G0 := G(Θ0 ) and G1 := G(Θ1 ) differ in a single edge. We formalize this concept in the definition below. Definition 2.1 (Single-Edge Null-Alternative Separator). For a sub-decomposition (G0 , G1 ) of G, let G0 = (V , E0 ) ∈ G0 be a graph under the null. We refer to an edge set C = {e1 , . . . , em }, as a (single-edge) null-alternative separator (NAS) with the null base G0 if for any e ∈ C the graphs Ge := (V , E0 ∪ {e}) ∈ G1 . As remarked in the introduction, if a large NAS set exists it is expected that differentiating G0 from an alternative graph Ge ∈ G1 is more challenging. Indeed, our main result of this section confirms this intuition. We proceed to define a predistance for a graph G and two edges e, e0 (which need not belong to E(G)) which plays a key role in our lower bound result. Definition 2.2 (Edge Geodesic Predistance). Let G = (V , E) and {e, e0 } be a pair of edges (e and e0 may or may not belong to E). We define dG (e, e0 ) := min 0 dG (u, v), u∈e,v∈e

where dG (u, v) denotes the geodesic distance between vertices u and v on the augmented graph G. If such a path does not exist dG (e, e0 ) = ∞. dG (e, e0 ) is a predistance, i.e., dG (e, e) = 0 and dG (e, e0 ) ≥ 0. Moreover, dG (e, e0 ) has the same value regardless of whether e, e0 ∈ E(G). See Figure 1 for an illustration of dG (e, e0 ). To sharply characterize combinatorial inference lower bounds, we need to define a notion of structural packing entropy for graphs. The classical packing entropy on metric spaces has been widely used in statistics community (Yang and Barron, 1999, e.g.) to characterize the information-theoretic lower bound for parametric estimation. We propose the structural packing entropy for graphs as an analog of the classical packing entropy on metric spaces. Definition 2.3 (Structural Packing Entropy). Let C be a non-empty edge set and G be a graph. For any r ≥ 0 we call the edge set Nr ⊂ C an r-packing of C if for any e, e0 ∈ Nr we have dG (e, e0 ) ≥ r. Define the structural r-packing entropy as: M (C, dG , r) := log max |Nr | | Nr ⊂ C, Nr is a r-packing of C .

7

(2.1)

The packing entropy in Definition 2.3 is an analog to the classical packing entropy on metric spaces in the sense that it is defined over an edge set C equipped with a predistance dG (e, e0 ) based on the graph G. Similar to the relation between the packing entropy in metric spaces and the lower bounds for parametric estimation (Yang and Barron, 1999), we will show that the structural packing entropy plays a pivotal role in characterizing the information-theoretic lower bound for the graph structural tests. To study minimax lower bounds, we only need to focus on the Gaussian graphical model whose structural properties are completely characterized by the precision matrices. We now formally define the sets of precision matrices S0 and S1 used in this section. Let: n o S0 (θ, s) := Θ ∈ M(s) : G(Θ) ∈ G0 , min |Θe | ≥ θ and, (2.2) e∈E(G(Θ)) n o S1 (θ, s) := Θ ∈ M(s) : G(Θ) ∈ G1 , min |Θe | ≥ θ , (2.3) e∈E(G(Θ))

where M(s) is defined in (1.2). The parameter θ in the definitions of S0 (θ, s) and S1 (θ, s) denotes the signal strength, and as we show below, its magnitude plays an important role in determining whether one can distinguish between graphical models in S0 (θ, s) and S1 (θ, s). Theorem 2.1 (Necessary Signal Strength). Let G0 ∈ G0 be a graph whose maximum degree is bounded by an integer D, and let C be a single-edge NAS with null base G0 . Suppose that n ≥ (log |C|)/c0 for some c0 > 0, and r M (C, dG0 , log |C|) 1 − C −1 √ θ≤κ , (2.4) ∧ n 2(D + 2) where C is defined in (1.2). Then if M (C, dG0 , log |C|) → ∞ as n → ∞, there exists a sufficiently small constant κ in (2.4) (depending on D, C, L, c0 ) such that: liminf γ(S0 (θ, s), S1 (θ, s)) = 1. n→∞

Theorem 2.1 parallels the classical minimax results in Yang and Barron (1999), but unlike their work we quantify the signal strength necessary for testing combinatorial structure rather than estimating Euclidean parameters. The radius log |C| of the packing entropy in (2.4) −1 ensures that the pairs of distinct edges are sufficiently far apart. The constant term √1−C 2(D+2) in (2.4) ensures that precision matrices with signal strength θ indeed belong to M(s). Proof Sketch. The proof of Theorem 2.1 can roughly be divided into four steps. More details will be provided in Appendix A. Step 1 (Connect the structural parameters to geometric parameters). Given the adjacency matrices of the null and alternative graphs G0 and {Ge }e∈C , we construct the corresponding precision matrices and make sure that they belong to S0 (θ, s) and S1 (θ, s). 8

Step 2 (Construct a risk lower bound using Le Cam’s method). The second step uses Le Cam’s method to lower bound the risk γ(S0 (θ, s), S1 (θ, s)). This requires us to evaluate the chi-square divergence between a normal and a mixture normal distribution. The chi-square divergence can be expressed via ratios of determinants. In particular, we show that the log chi-square divergence can be equivalently re-expressed via an infinite sum of differences among trace operators of adjacency matrix powers. Step 3 (Represent the lower bound by the number of shortest closed walks in the graph). In this step we control the deviations of the differences of the trace operators. Since the trace of the power of an adjacency matrix equals the number of closed walks within the corresponding graph, we eliminate the trace powers which are smaller than the shortest closed walks. The traces of the higher powers are handled via matrix perturbation bounds. Step 4 (Characterize the smallest magnitude of the geometric parameter using the packing entropy). Lastly, we show that condition (2.4) ensures that the closed walks on the packing of the NAS set are sufficiently lengthy, which shows that the chi-square divergence vanishes when the signal strength θ is small.

2.1

Some Applications

In this section we give several examples of combinatorial testing, which readily fall into the framework developed in Section 2. Example 2.1 (Connectivity Testing). Consider the sub-decomposition G0 = {G ∈ G : G disconnected} vs G1 = {G ∈ G : G connected}. We construct a base graph G0 := (V , E0 ) where bd/2c−1 E0 := {(j, j + 1)j=1 , (bd/2c, 1), (j, j + 1)dj=bd/2c+1 , (bd/2c + 1, d)}, and let bd/2c

C := {(j, bd/2c + j)j=1 } (see Figure 1). Clearly adding any edge from C to G0 connects the graph, so C is a single edge NAS with a null base G0 . Furthermore, the maximum degree of G0 equals 2 by construction. To construct a packing set of C, we collect all edges (j, bd/2c + j) satisfying dlog |C|e divides j except if j > bd/2c − dlog |C|e. This procedure results in a packing set with radius at least k j |C| dlog |C|e which has cardinality of at least dlog |C|e − 1. Therefore i |C| k − 1 log |C| log d. dlog |C|e p Theorem 2.1 implies that testing connectivity is impossible when θ < κ log d/n ∧ M (C, dG0 , log |C|) ≥ log

hj

9

−1 1−C √ . 4 2

e0

e

Figure 1: The graph G0 with two edges e, e0 ∈ C : dG0 (e, e0 ) = 2, d = 10. √ √ Example 2.2 (m + 1 vs m Connected Components, m ≥ d). Let m ≥ d be an integer. In this example we are interested in testing whether the graph contains √ m + 1 connected components vs m connected components. The reason to assume m ≥ d is to make sure there are sufficiently many edges for constructing a single edge NAS in order to obtain √ sharp bounds. The case when m < √ d is treated in Example 2.7 via a different NAS construction (In fact, the case m < d requires deleting edges from the alternative rather than adding edges to the null base. See Section 2.2 for more details). Formally we have the sub-decomposition G0 = {G ∈ G : G has ≥ m + 1 connected components} vs G1 = {G ∈ G : G has ≤ m connected components}. Construct the null base graph G0 = (V , E0 ), where E0 := {(j, j + 1)d−m−1 }, j=1 and we let C := {(j, j + 1)d−1 j=d−m } (see Figure 2). Adding an edge e ∈ C to G0 converts the base graph G0 into a graph with m connected components and therefore C is a single edge NAS with a null base G0 . Additionally, the maximum degree of G0 is 2 by construction. Note that the distance between any two edges in C is 0 if and only if they share a common vertex, and ∞ in all other cases. This implies that we can construct a packing set by taking every other edge in the set C. We conclude that M (C, dG0 , log log d. Hence, by Theorem 2.1 the minimax risk goes p|C|) log(|C|/2) −1 1−C √ to 1 when θ < κ log d/n ∧ 4 2 . e

e0

Figure 2: Null base graph G0 with d − m − 1 edges, NAS C (dashed), dG0 (e, e0 ) = ∞, d = 8.

10

Example 2.3 (Cycle Detection). Consider testing whether the graph is acyclic vs the graph contains a cycle. Let G0 = {G ∈ G : G is cycle-free} and G1 = {G ∈ G : G contains a cycle}. Define the null base graph G0 = (V , E0 ) d−1 E0 := {(j, j + 1)j=1 },

and let the edge set C := {(j, j + 2)dj=1 }, where the addition is taken modulo d (refer to Figure 3 for a visualization). By construction we have G0 ∈ G0 and |C| = d. Adding any edge from C to G0 results in a graph with a cycle, and hence the edge set C is a single edge NAS with a null base G0 . The maximum degree of G0 equals 2, and is thus bounded. Moreover, there exists a log |C|-packing set of C of cardinality at least dlog|C|−2 which can be produced by collecting the edges (j, j + 2) |C|e+2 for j = k(log |C| + 2) + 1 for k = 0, 1, . . . and j ≤ d − 2. The last observation implies that M (C, dG0 , log |C|) log dlog|C|−2 log d. Hence by Theorem 2.1 we conclude that the |C|e+2 p −1 √ . minimax risk goes to 1 when θ < κ log d/n ∧ 1−C 4 2

e0

e

Figure 3: The graph G0 with two (dashed) edges e, e0 ∈ C such that dG0 (e, e0 ) = 2, d = 7. Example 2.4 (Tree vs Connected Graph with Cycles). The construction in Example 2.3 also shows that we have the same signal strength limitation to test for cycles, even if we restrict to the subclass of connected graphs, i.e., the class of graphs under the null hypothesis is the class of all trees G0 = {G ∈ G : G is a tree}, and the alternative is the class of all connected graphs which contain a cycle — G1 ∈ {G ∈ G : G is connected but is not a tree}. Example 2.5 (Triangle-Free Graph). Consider testing whether the graph contains a triangle (i.e., 3-clique). More formally let the decomposition of G be G0 = {G ∈ G : G is triangle-free} and G1 = {G ∈ G : ∃ 3-clique subgraph of G}. It is clear that in this case we can reuse the e0 = (V , E0 ∪ {(1, d)}), where E0 and C are taken as in Example 2.3. set C and its null base G 11

√ Example 2.6 (Self-Avoiding Path (SAP) of Length m vs m + 1, m < d). In this example we test the existence of a SAP of length ≤ m vs SAP of length ≥ m + 1, i.e., G0 = {G ∈ G : All √ SAPs have length ≤ m} vs G1 = {G ∈ G : ∃ SAP of length m + 1}. We assume that m < d in order √ to be able to construct sufficiently large NAS set which yields tight bounds. The case m ≥ d is treated in Example 2.8. Without loss of generality we further suppose that m + 2 divides d (at the sake of assigning some isolated vertices in the graph). Construct the following graph in G0 : G0 = (V , E0 ), where E0 := {(j, j + 1) : unless (m + 2) divides j or (m + 2) divides (j + 1)}, and hence |E0 | = dm/(m + 2) (see Figure 4). Consider the class: C := {(j(m + 2) − 1, j(m + 2))j∈[d/(m+2)] }. The set C is a NAS since adding any edge from C to G0 results in a graph with a longer self−1 √ avoiding path. Furthermore the maximum degree of G0 is 2. Thus when θ < 1−C we can 4 2 apply Theorem 2.1. Notice that the set C is its own a log |C|-packing, since each two edges 0 e, e0 ∈ C satisfy p dG0 (e, e ) = ∞. Hence M (|C|, dG0 , log |C|) = log |C| log d. We conclude that if θ ≤ κ log d/n we cannot differentiate the null from the alternative hypothesis. 1

(m + 1)

(m + 3)

(2m + 3)

(2m + 4)

(2m + 5)

(3m + 5)

(3m + 6)

e

(m + 2)

e0 Figure 4: The null base G0 and the NAS C (dashed) with dG0 (e, e0 ) = ∞, d = 15, m = 3.

2.2

Edge Deletion NAS Extension

Though Theorem 2.1 delivers valid lower bounds, the obtained bounds may not always be tight. For example, recall the construction of Example 2.6. When m = d − 2 the NAS set contains only a single edge, and therefore (2.4) is not sharp. Sometimes tighter bounds can be obtained via constructions which delete edges from the alternative graph instead of adding edges to the graph under null. In this subsection we extend the NAS definition to allow for deleting edges from the alternative hypothesis, which is sometimes more convenient and gives more informative lower bounds as we demonstrate below. Definition 2.4 (Single-Edge Deletion Null-Alternative Separator). Let G1 = (V , E1 ) be an alternative graph. An edge set C is called a (single edge) deletion NAS with alternative base G1 , if for all e ∈ C the graphs G\e = (V , E1 \ {e}) ∈ G0 . 12

In parallel to Theorem 2.1 we have the following result using the deletion NAS. Theorem 2.2. Let G1 ∈ G1 be a graph whose maximum degree bounded by a fixed integer D, and let C be a single-edge deletion NAS with alternative base G1 . Suppose that n ≥ (log |C|)/c0 for some c0 > 0 and that: r M (C, dG1 , log |C|) 1 − C −1 θ≤κ ∧ √ . (2.5) n 2D Then, if M (C, dG1 , log |C|) → ∞ as n → ∞, for a sufficiently small κ (depending on D, C, L, c0 ) we have: liminf γ(S0 (θ, s), S1 (θ, s)) = 1. n→∞

To illustrate the usefulness of Theorem 2.2 for edge deletion NAS, we consider the examples below. √ Example 2.7 (m + 1 vs m Connected Components, m < d). In this example we consider a complementary setting to Example 2.2, and test whether √ the graph contains m + 1 connected components vs m connected components for m < d. Recall the decomposition G0 = {G ∈ G : G has ≥ m + 1 connected components} vs G1 = {G ∈ G : G has ≤ m connected components}. Construct an alternative base graph G1 = (V , E1 ): E1 := {(j, j + 1)d−m j=1 }, and set C := E1 = {(j, j + 1)d−m j=1 } (see Figure 5 for a visualization). Since removing any edge from C results in increasing the number of connected components by 1, the set C is a single edge deletion NAS with an alternative base graph G1 . Moreover, notice that the maximum degree of G1 is 2. The entropy M (C, dG1 , log |C|) satisfies M (C, dG1 , log |C|) log log|C||C| log |C| log d. Hence by Theorem 2.2 we conclude that p −1 √ . testing connectivity is impossible when θ < κ log d/n ∧ 1−C 2 2 q e

q0 e

Figure 5: The graph G1 with d − m edges and m − 1 isolated nodes, d = 8. √ Example 2.8 (Self-Avoiding Path (SAP) of Length m vs m + 1, m ≥ d). In parallel√to Example 2.6 we now consider testing that SAP ≤ m vs SAP of length ≥ m+1 when m ≥ d. Define G0 = {G ∈ G : ∀ SAP have length ≤ m} vs G1 = {G ∈ G : ∃ SAP of length m + 1}. Construct the alternative graph G1 = (V , E1 ): E1 := {(j, j + 1)m+1 j=1 }). 13

Define the set C := E1 = {(j, j + 1)m+1 j=1 }, and note that removing any edge from C results in a graph from the null. Hence C is a deletion NAS with base G1 . Moreover, the maximum degree of G1 is 2. Similarly to Example 2.7 it follows that M (C, dp G1 , log |C|) log |C| log d. An application of Theorem 2.2 implies that −1 √ for values of θ < κ log d/n ∧ 1−C testing the length of SAP is impossible. 2 2 q e

q0 e

Figure 6: The graph G1 with m + 1 edges and two edges e, e0 ∈ C with dG1 (e, e0 ) = 1. Remark 2.1. When Θp ∈ M(s), the results in Theorem 2.1 and Theorem 2.2 suggest that a signal strength of order log d/n is necessary for controlling the risk (1.3) . In fact, Theorem 7 of Cai et al. (2011) shows that under such signal strength condition, support recovery of Θ is indeed achievable, which further implies that controlling the risk (1.3) is possible. A naive procedure for matching the lower bound is to first perfectly recover the graph structure. Then construct a test based on examining whether the graph has the desired combinatorial structure. Though such an approach is theoretically feasible, it is not practical. First, such an approach is overly conservative and does not allow us to tightly control the type-I error at a desired level. Second, such an approach crucially depends on having a suitable thresholding parameter to estimate the graph, which is in general not realistic. In the next section, we present a family of novel test procedures, which do not require perfect support recovery of the full graph. Compared to the naive approach described in Remark 2.1, our tests explicitly exploit the combinatorial structure of the targeted hypotheses and can control the type-I error at any desired level.

3

Null-Alternative Witness Technique

We start with a high-level outline of a new combinatorial inference approach for graphical models. For clarity we mainly present using the case of Gaussian graphical models, and comment on extensions to other graphical models in Section 3.3. We assume henceforth that the full graph G = (V , E) belongs to G1 . For a graph G ∈ G1 , define the following class of edge sets W1 (G) = {E : ∀E 0 , E ⊆ E 0 ⊆ E(G) ⇒ (V , E 0 ) ∈ G1 }. The set W1 (G) collects all edge sets forming graphs in G1 , which can be obtained by the graph G via iteratively pruning one edge at a time and maintaining that in each step the 14

resulting graph still belongs to G1 . We use the shorthand notation W 1 = W1 (G). Consider the following parameter sets: n o S0 (s) := Θ ∈ M(s) : G(Θ) ∈ G0 , and n o S1 (θ, s) := Θ ∈ M(s) : G(Θ) ∈ G1 , max min |Θe | ≥ θ ,

(3.1) (3.2)

E∈W 1 e∈E

The parameter space (3.1) does not impose any assumption on the minimum signal strength, thus is broader than the one defined in (2.2). In Definition (3.2), the signal strength is not imposed on all the edges of the alternative graphs. In fact, we only need to impose the signal strength assumption on a subset of edges which can be obtained by pruning the complete graph. Such a condition is much weaker than the usual condition needed for perfect graph recovery. We note that for sub-decompositions (G0 , G1 ) satisfying W1 (G) ⊆ W 1 for all G ∈ G1 , the parameter space (3.2) is strictly larger than the parameter space (2.3). Given n independent samples Xi ∼ N (0, (Θ∗ )−1 ) and a sub-decomposition (G0 , G1 ) we b := formulate a procedure for testing H0 : Θ∗ ∈ S0 (s) vs H1 : Θ∗ ∈ S1 (θ, s). Let Σ P b be any estimator of the precision n−1 ni=1 Xi⊗2 be the empirical covariance matrix. Let Θ matrix Θ∗ satisfying for some fixed constant K > 0: p b − Θ∗ kmax ≤ K log d/n, kΘ (3.3) p p bΘ b − Id kmax ≤ K log d/n, b − Θ∗ k1 ≤ Ks log d/n, kΣ (3.4) kΘ with probability at least 1 − d−1 uniformly over the parameter space M(s) (recall definition (1.2)). An example of an estimator of Θ∗ with this properties is the CLIME procedure introduced by Cai et al. (2011) (see also (5.2)). An overview of the null-alternative witness techinique (NAWT) is sketched below: i. In the first step NAWT identifies a minimal structure witnessing the alternative; ii. In the second step NAWT attempts to certify that the minimal structure identified by the first step is indeed present in the graph. bn/2c

Split the data D = {Xi }ni=1 in two approximately equal-sized sets D1 = {Xi }i=1 , D2 = b (1) , Θ b (2) on D1 and D2 correspondingly. For the first {Xi }ni=bn/2c+1 and obtain estimates Θ b (1) to solve the following max-min combinatorial optimization problem step, we exploit Θ b := argmax min |Θ b (1) |, E e E∈W 1

15

e∈E

(3.5)

where edge sets with the smallest cardinality are prefered, and further ties are broken arbitrarily. Program (3.5) aims at identifying the smallest edge set in W 1 whose minimal b (1) and sufficiently strong sigsignal is as large as possible. Given a consistent estimator Θ nal strength, the solution of (3.5) identifies a minimal substructure of G(Θ∗ ) belonging to G1 . This strategy is motivated by the definition of the alternative parameter set S1 (θ, s) (3.2). We remark that solving program (3.5) could be computationally challenging for some combinatorial tests. However, for all examples considered in this paper, simple and efficient b as the minimal polynomial time algorithms are available. We refer to the graph (V , E) structure witnessing the alternative. Although the minimal structure witness is defined in full generality, for the ease of presentation we justify its validity on a case by case basis. In the second step, NAWT attempts to certify the witness structure using the estimate (2) b Θ . Formally, NAWT aims to test the hypothesis b : Θ∗ 6= 0. b : Θ∗ = 0 vs H1 : ∀ e ∈ E H0 : ∃ e ∈ E e e

(3.6)

A rejection of the null hypothesis in (3.6) certifies the presence of the alternative witness structure. If the test fails to reject, NAWT cannot reject the null structure hypothesis. In Section 3.1 we give a detailed description on how NAWT works.

3.1

Minimal Structure Certification

In this section we detail an algorithm for testing (3.6). In fact, we present a general test for the following multiple testing problem H0 : ∃ e ∈ E s.t. Θ∗e = 0 vs H1 : ∀ e ∈ E, Θ∗e 6= 0, using the data X = {Xi }ni=1 , Xi ∼ N (0, (Θ∗ )−1 ), where E is a pre-given edge set. Following Neykov et al. (2015), for any j, k ∈ [d] we define the bias corrected estimate: e jk := Θ b jk − Θ

bΘ b ∗k − ek ) b T (Σ Θ ∗j , bT Σ b ∗j Θ ∗j

where ek is a canonical unit vector with 1 at its k th entry. Under mild regularity conditions, b jk admits the following Bahadur representation: Neykov et al. (2015) show that Θ n X √ ⊗2 ∗ −1/2 ∗ e nΘjk = n (Θ∗T ∗j Xi Θ∗k − Θjk ) + op (1). i=1

This motivates a mutiplier bootstrap scheme for approximating the distribution of the statis√ e tic nΘ jk under the null hypothesis: cjk = n−1/2 W

n X b T X ⊗2 Θ b ∗k − Θ b jk )ζi , (Θ ∗j i i=1

16

where ζi ∼ N (0, 1), i ∈ [n] are independent and identically distributed. To approximate the √ e null distribution of the statistic max(j,k)∈S n|Θ jk | over a subset S ⊆ E, let c1−α,S denote the cjk | (conditioned on the dataset X ). Formally, (1 − α)-quantile of the statistic max(j,k)∈S |W we let n o cjk | ≤ t X ≥ 1 − α . (3.7) c1−α,S := inf t : P max |W t∈R

(j,k)∈S

As defined c1−α,S is a population quantity (conditioning on X ). In practice an arbitrarily accurate estimate b c1−α,S of c1−α,S can be obtained via Monte Carlo simulations. Below we nc ⊆ E of rejected edges d describe a multiple edge testing procedure returning a subset E (hypotheses). Our procedure is based on the step-down construction of Chernozhukov et al. (2013), which is inspired by the multiple testing method of Romano and Wolf (2005). Algorithm 1 Multiple Edge Testing cn ← E; Initialize E repeat e e| ≥ c cn : √n|Θ Reject R ← {e ∈ E cn } 1−α,E cn ← E cn \ R Update E cn = ∅ until R = ∅ or E nc ← E \ E d cn return E Decompose the edge set E = E n ∪ E nc , where E n ∩ E nc = ∅, E n is the subset of true null edges and E nc is the set of non-null edges. Define the parameter space p (3.8) ME n ,E nc (s, κ) := {Θ ∈ M(s) : maxn |Θe | = 0, minnc |Θe | ≥ κ log d/n}. e∈E

e∈E

cr has strong control of the family-wise error For a fixed edge set E, we say that an edge set E rate (FWER) if limsup sup

sup

n→∞ E n ⊂E Θ∗ ∈ME n ,E nc (s,κ)

cr 6= ∅) ≤ α, P(E n ∩ E

(3.9)

for some pre-specified size α > 0. Our next result shows that Algorithm 1 returns an edge nc with strong control of the FWER. d set E Proposition 3.1 (Strong FWER Test). Let Θ∗ ∈ ME n ,E nc (s, κ) and furthermore: p √ s log(nd) log d log(nd)/ n = o(1), (log(dn))6 /n = o(1).

(3.10)

nc of Algorithm 1 satisfies (3.9). In addition, if d Then for any fixed edge set E, the output E the constant κ in (3.8) satisfies κ ≥ C 0 C 4 for a fixed absolute constant C 0 > 0, we have:

liminf inf n

inf

n→∞ E ⊆E Θ∗ ∈ME n ,E nc (s,κ)

17

nc ) = 1. d P(E nc = E

Of note, when κ is sufficiently large, Algorithm 1 achieves exact control of the FWER, i.e., we have equality in (3.9). This happens since all edges in E nc will be rejected with overwhelming probability, while the bootstrap comparison is asymptotically exact for the remaining edges E n . As a consequence of this result, if the null hypothesis set S0 (s) considered in (3.1) exhibits signal strength as in definition (2.2), the NAWT tests are exact. nc be the output of Algorithm 1 on data X with d Definition 3.1. For an edge set E let E level 1 − α. Define the following test function: B nc ). d ψα,E (X ) := 1(E = E B ψα,E (X ) tests whether the set E is comprised only of non-null edges.

3.2

Examples

In this section we describe practical algorithms based on the NAWT for testing problems outlined in Section 2.1. Our tests can distinguish the null from the alternative hypotheses when the minimum signal strength is sufficientlyp large. As we shall see, the magnitude of the required signal strength is precisely of order log d/n and therefore in view of Section 2 these tests are minimax optimal. Recall that we observe n i.i.d. samples {Xi }ni=1 from bn/2c Xi ∼ Nd (0, (Θ∗ )−1 ). We split the data into D1 = {Xi }i=1 , D2 = {Xi }ni=bn/2c+1 and obtain b (1) , Θ b (2) on D1 and D2 correspondingly, for which (3.3) and (3.4) hold. estimates Θ 3.2.1

Connectivity Testing

This section proposes a new procedure for honestly testing whether G(Θ∗ ) is a connected graph. Accordingly, the sub-decomposition is G0 := {G ∈ G : G disconnected} and G1 := {G ∈ G : G connected}. It is simple to check that definitions (3.1) and (3.2) reduce to: S0 (s) = {Θ ∈ M(s) : G(Θ) ∈ G0 },

(3.11)

S1 (θ, s) = {M(s) : G(Θ) ∈ G1 , ∃ E ⊂ E(G(Θ)), (V , E) is a tree, min |Θe | ≥ θ}, e∈E

(3.12)

Finding the minimal structure witness (3.5) reduces to finding a maximum spanning tree b (1) (MST) Tb on the full graph with edge weights |Θ e |. The complexity of finding a MST is 2 O(d log d), where d is the number of vertices. We summarize the procedure below: Algorithm 2 Connectivity Test Input: D = {Xi }ni=1 , level 0 < α < 1. bn/2c Split the data D1 = {Xi }i=1 , D2 = {Xi }ni=bn/2c+1 b (1) satisfying (3.3) Using G1 obtain estimate Θ b (1) Find MST Tb on the full graph with weights |Θ e |. B Output: ψα, (D2 ). Tb

18

The results on connectivity testing are summarized in the following p Corollary 3.2. Let θ = (2K + C 0 C 4 ) log d/bn/2c for an absolute constant C 0 > 0 and B (D2 ) from the output of assume that (3.10) holds. Then for any fixed level α, the test ψα, Tb Algorithm 2 satisfies: limsup sup P(reject H0 ) ≤ α,

liminf

3.2.2

inf

n→∞ Θ∗ ∈S1 (θ,s)

n→∞ Θ∗ ∈S0 (s)

P(reject H0 ) = 1.

Connected Components Testing

Connected component test is more general than the connectivity test. For an integer m ∈ [d − 1] let H0 : # connected components ≥ m + 1 vs H1 : # connected components ≤ m. Testing connectivity is a special case when m = 1. Recall that a sub-graph F of G, i.e., E(F ) ⊂ E(G) and V (F ) = V (G) = V , is called a spanning forest of G, if F contains no cycles, adding any edge e ∈ E(G) \ E(F ) to E(F ) creates a cycle, and |E(F )| is maximal. This definition extends naturally to graphs with positive weights on their edges. For a graph G let SF(G) denote the class of spanning forests of G. Define the set of precision matrices SF(θ) := {Θ : ∃ F ∈ SF(G(Θ)) such that mine∈F |Θe | ≥ θ}. For a fixed m ∈ [d − 1] definitions (3.1) and (3.2) reduce to: S0 (s) := {Θ ∈ M(s) : G(Θ) ∈ Gm+1 }, S1 (θ, s) := {Θ ∈ SF(θ) ∩ M(s) : G(Θ) ∈ ∪j≤m Gj }, where Gj := {G ∈ G : G has exactly j connected components} for all j ∈ N. Algorithm 3 Connected Components Test Input: D = {Xi }ni=1 , level 0 < α < 1. bn/2c Split the data into D1 = {Xi }i=1 , D2 = {Xi }ni=bn/2c+1 b (1) satisfying (3.3) Using D1 obtain an estimate Θ b (1) b b Find an MSF Fb with m connected components with weights |Θ e |, let E = E(F ). B Output: ψα,Eb (D2 ). We have the following result regarding the asymptotic performance of Algorithm 3 p Corollary 3.3. Let θ = (2K + C 0 C 4 ) log d/bn/2c for an absolute constant C 0 > 0 and B assume that (3.10) holds. Then for any fixed level α, the test ψα, b (D2 ) from the output of E Algorithm 3 satisfies: limsup sup P(reject H0 ) ≤ α,

liminf

inf

n→∞ Θ∗ ∈S1 (θ,s)

n→∞ Θ∗ ∈S0 (s)

19

P(reject H0 ) = 1.

3.2.3

Cycle Detection

In this section we outline an algorithm testing whether the graph is acyclic. Recall the subdecomposition G0 = {G ∈ G : G contains no cycles} and G1 = {G ∈ G : G contains a cycle}. The parameter spaces (3.1) and (3.2) reduce to S0 (s) := {Θ ∈ M(s) : G(Θ) ∈ G0 }, S1 (θ, s) := {Θ ∈ M(s) : G(Θ) ∈ G1 , ∃ E ⊂ E(G(Θ)), E is a cycle, min |Θe | ≥ θ}. e∈E

The cycle presence test is compactly summarized below: Algorithm 4 Cycle Presence Tests Input: D = {Xi }ni=1 , level 0 < α < 1. bn/2c Split the data into D1 = {Xi }i=1 , D2 = {Xi }ni=bn/2c+1 b (1) satisfying (3.3); Sort edges according to weights |Θ b (1) Using D1 obtain estimate Θ e | (1) 2 b Iteratively add edges e from high to low |Θe | values until a cycle is present. b be set of edges forming the cycle. Let E B Output: ψα, b (D2 ). E

The following result justifies the validity of Algorithm 4. p Corollary 3.4. Let θ = (2K + C 0 C 4 ) log d/bn/2c for an absolute constant C 0 > 0 and B assume that (3.10) holds. Then for any fixed level α, the test ψα, b (D2 ) from the output of E Algorithm 4 satisfies: limsup sup P(reject H0 ) ≤ α,

liminf

3.2.4

inf

n→∞ Θ∗ ∈S1 (θ,s)

n→∞ Θ∗ ∈S0 (s)

P(reject H0 ) = 1.

Triangle-Free Graph Testing

In this section we discuss an algorithm for testing whether the graph is triangle-free. The corresponding sub-decomposition is G0 = {G ∈ G : G is triangle-free} and G1 = {G ∈ G : ∃ 3-clique subgraph of G}. The parameter sets of precision matrices specified by (3.1) and (3.2) reduce to S0 (s) := {Θ ∈ M(s) : G(Θ) ∈ G0 }, S1 (θ, s) := {Θ ∈ M(s) : G(Θ) ∈ G1 , ∃E ⊂ E(G(Θ)), E is a triangle, min |Θe | ≥ θ}. e∈E

We summarize the cycle test below 2

This action requires no more than d iterations.

20

Algorithm 5 Triangle-Free Test Input: D = {Xi }ni=1 , level 0 < α < 1. bn/2c Split the data into D1 = {Xi }i=1 , D2 = {Xi }ni=bn/2c+1 b (1) satisfying (3.3); Sort edges according to weights |Θ b (1) Using D1 obtain estimate Θ e | (1) b e | values until a triangle emerges. Iteratively add edges e from high to low |Θ b be the set of edges forming the triangle. Let E B Output: ψα, b (D2 ) E

B The following proposition characterizes the performance of ψα, b (D2 ). E

p Corollary 3.5 (Honest Triangle-Free Test). Let θ = (2K + C 0 C 4 ) log d/bn/2c for an absolute constant C 0 > 0 and assume that (3.10) holds. Then for any fixed level α, the test B ψα, b (D2 ) from the output of Algorithm 6 satisfies: E limsup sup P(reject H0 ) ≤ α,

liminf

3.2.5

inf

n→∞ Θ∗ ∈S1 (θ,s)

n→∞ Θ∗ ∈S0 (s)

P(reject H0 ) = 1.

Self-Avoiding Path Length Testing

In this section we summarize an algorithm for testsing whether the longest SAP in the graph is of length ≤ m vs the longest SAP in the graph has length ≥ m + 1. The subdecomposition is G0 = {G ∈ G : All SAPs in G have length ≤ m} and G1 = {G ∈ G : ∃ a SAP of length m + 1}. The parameter sets of precision matrices specified by (3.1) and (3.2) reduce to: S0 (s) := {Θ ∈ M(s) : G(Θ) ∈ G0 }, S1 (θ, s) := {Θ ∈ M(s) : G(Θ) ∈ G1 , ∃E⊂E(G(Θ)), E is SAP of length m + 1, min |Θe | ≥ θ}. e∈E

The SAP length test is summarized below: Algorithm 6 SAP Length Test Input: D = {Xi }ni=1 , level 0 < α < 1. bn/2c Split the data into D1 = {Xi }i=1 , D2 = {Xi }ni=bn/2c+1 b (1) satisfying (3.3); Sort edges according to weights |Θ b (1) Using D1 to obtain estimate Θ e | (1) b e | values until a SAP of length m + 1 emerges. Iteratively add edges e from high to low |Θ b be the set of edges forming the SAP of length m + 1. Let E B Output: ψα, b (D2 ) E

B Below we characterize the performance of the output of Algorithm 6 — ψα, b (D2 ). E

21

p Corollary 3.6 (Honest SAP Test). Let θ = (2K + C 0 C 4 ) log d/bn/2c for an absolute B constant C 0 > 0 and assume that (3.10) holds. Then for any fixed level α, the test ψα, b (D2 ) E from the output of Algorithm 6 satisfies: limsup sup P(reject H0 ) ≤ α,

liminf

3.3

inf

n→∞ Θ∗ ∈S1 (θ,s)

n→∞ Θ∗ ∈S0 (s)

P(reject H0 ) = 1.

Extensions to More General Graphical Models

The NAWT procedure can be applied far beyond the class of Gaussian graphical models. Our first extension is the transelliptical graphical models proposed by Liu et al. (2012). We start with some definitions. Definition 3.2 (Elliptical distribution Fang et al. (1990)). Let µ∗ ∈ Rd and Σ∗ ∈ Rd×d . A ddimensional random vector X has an elliptical distribution, denoted with X ∼ ECd (µ∗ , Σ∗ , ξ), D if X = µ∗ +ξAU , where U ∈ Rq is a random vector uniformly distributed on the unit sphere, ξ ≥ 0 is a scalar random variable independent of U , and A ∈ Rd×q is a deterministic matrix such that AAT = Σ∗ . Definition 3.3 (Transelliptical Distribution Liu et al. (2012)). A continuous random vector X = (X1 , . . . , Xd )T has a transelliptical distribution, if there exist monotone univariate functions f1 , . . . , fd and a non-negative random variable ξ, with P(ξ = 0) = 0, such that: (f1 (X1 ), . . . , fd (Xd ))T ∼ ECd (0, Σ∗ , ξ), where Σ∗ is symmetric with diag(Σ∗ ) = 1 and Σ∗ > 0. A graphical model structure over the family of transelliptical distributions is defined through the notion of the “latent generalized concentration matrix” — Θ∗ = (Σ∗ )−1 . An edge is present between the two variables Xj , Xk if and only if Θ∗jk 6= 0. To construct an b Next one obtains a estimate of Θ∗ , one first estimates Σ∗ by a ranked based estimator Σ. b satisfying (3.3) and (3.4) by applying CLIME on Σ b (see Liu et al. consistent estimator of Θ (2012) for details). A decorrelation method can then be applied to substitute the estimates b jk with Θ e jk , where Θ e jk are guaranteed to have asymptotic normal distributions (see Neykov Θ et al. (2015) e.g). The NAWT immediately applies to this setup. Our second extension is the semiparametric exponential family graphical model proposed by Yang et al. (2014). Below by indexing a vector v = (v1 , . . . , vd )T with v\j we mean omitting the j th element of that vector. Definition 3.4 (Semiparametric Exponential Family Distribution). A d-dimensional random vector X = (X1 , . . . , Xd ) ∈ Rd follows a semiparametric exponential family distribution if for any j ∈ [d], the conditional density of Xj |X\j satisfies h i p(xj |x\j ) = exp xj (βj∗T x\j ) + fj (xj ) − bj (βj , fj ) , (3.13) 22

where fj (·) is a base measure and bj (·, ·) is the log-partition function. The above model is semiparametric since the base measure functions fj are unknown, and it forms a rich family of models including Gaussian, Poisson and Ising models. It is easy to see that βjk = 0 if and only if Xj and Xk are conditionally independent. Thus the semiparametric exponential family naturally defines a graphical model. A paiwrise rank based procedure has been proposed by Yang et al. (2014) to obtain an estimator βbj of βj for all j ∈ [d]. Yang et al. (2014) also provide finite sample bounds on kβbj − βj k2 , kβbj − βj k1 with the optimal rates of convergence. They further propose a decorrelated estimator βejk √ 2 2 satisfying n(βejk − βjk ) N (0, σjk ) for some limiting variance σjk . Therefore, we can apply the NAWT procedure to obtain valid combinatorial structure tests.

4

Multi-Edge NAS

The results of Section 2 have two major limitations. Firstly, the null base G0 is assumed to be of bounded degree. Secondly, our results cover only tests for which there exists a singe-edge NAS. In this section we relax both of these conditions. The following examples illustrate relevant testing problems which do not fall into the framework of Section 2. Maximum Degree Testing. Consider testing whether the maximum degree of the graph dmax satisfies dmax ≤ s0 vs dmax ≥ s1 , where s0 < s1 ≤ s are integers which are allowed to scale with n. In this case it is impossible to simultaneously construct a null base graph G0 of bounded degree and a single-edge NAS set C. Clique Detection. Consider testing whether the graph is empty versus the alternative that the graph is an s-clique. In this test the null base is an empty graph, while the NAS set is an s-clique, and therefore cannot be a single-edge NAS. To handle multiple edge NAS, we first extend Definitions 2.1 and 2.2 to allow for the above examples. Definition 4.1 (Null-Alternative Separator Set). Let G0 = (V , E0 ) ∈ G0 be a fixed graph under the null with adjacency matrix A0 . We call a collection of edge sets C a (multi-edge) NAS with null base G0 if for all edge sets S ∈ C we have S ∩ E0 = ∅ and (V , E0 ∪ S) ∈ G1 . For any edge set S ∈ C, we denote the adjacency matrix of the graph (V , S) with AS . Definition 4.2 (Edge Set Geodesic Predistance). For two edge sets S and S 0 and a given graph G let: dG (S, S 0 ) = min dG (e, e0 ). 0 0 e∈S,e ∈S

We provide two generic strategies to obtain combinatorial inference lower bounds on the signal strength. The first strategy, described in Section 4.1, assumes that all S ∈ C satisfy 23

|S| ≤ U for some fixed constant U . The second strategy, presented in Section 4.2, does not require bounded cardinality of the edge sets S, but requires the null bases and NAS sets have some special combinatorial properties (More details will be provided later).

4.1

Bounded Edge Sets

Below we consider an extension of Theorem 2.1 for multi-edge NAS, where the number of edges in each set S ∈ C satisfy |S| ≤ U for some fixed integer U ∈ N. In contrast to Section 2, here the graph G0 is allowed to have unbounded degree. Theorem 4.1. Let G0 ∈ G0 be a graph under the null, and let C be a multi-edge NAS with null base G0 . Suppose that n ≥ (log |C|)/c0 for some sufficiently small κ, c0 > 0, and that: r κ 1 − C −1 M (C, dG0 , log |C|) θ≤κ ∧ ∧ , (4.1) nU U (kA0 k2 + 2U ) 4(kA0 k1 + 2U ) Assuming M (C, dG0 , log |C|) → ∞ as n → ∞ we have: liminf γ(S0 (θ, s), S1 (θ, s)) = 1. n→∞

Theorem 4.1 is an extension of Theorem 2.1. Specifically, Theorem 2.1 corresponds the setting where U = 1, and kA0 k2 ≤ kA0 k1 ≤ D (recall that D is an upper bound of the graph degree). Even though by assumption U = maxS∈C |S| is bounded, we explicitly keep the dependency on U in (4.1) to reflect how the bound changes if U is allowed to scale. The first term on the right hand side of (4.1) is the structural packing entropy, while the remaining two terms ensure the parameter θ is small enough to construct a valid packing set (More details are provided in the proof). We illustrate the usefulness of Theorem 4.1 by an example similar to the ones in Section 2.1. Consider testing whether the maximum degree of the graph G(Θ∗ ) is at most s0 vs it is at least s1 , where s0 < s1 ≤ s can increase with n but the null-alternative gap s1 − s0 remains bounded. Therefore we cannot apply Theorem 2.1 but should use Theorem 4.1 instead. Define the sub-decomposition G0 = {G : dmax (G) ≤ s0 } and G1 = {G : dmax (G) ≥ s1 } respectively. Example 4.1 (Maximum Degree Test with Bounded Null-Alternative Gap). p Let S0 (θ, s) and S1 (θ, s) be defined in (2.2) and (2.3). Assume that s log d/n = p o(1), s log d/n = O(1) γ and s = O(d ) for some γ < 1. Then if κ is small enough and θ < κ log d/n, we have liminf γ(S0 (θ, s), S1 (θ, s)) = 1. n→∞

In order to show the above result, we start by building a graph G0 by constructing b s1d+1 c non-intersecting s0 -star graphs first. Define the star graph centers Cj = (s1 + 1)j + 1 for 24

j = 0, . . . , b s1d+1 c − 1. Next, define G0 by d

c−1 s +1 0 b s1 [ [ {(Cj , Cj + k)} . G0 := V , j=0

k=1

To define the NAS C, simply connect each of the vertices Cj to the remaining s1 −s0 vertices in the block as follows: C :=

s1 n [

(Cj , Cj + k) j = 0, . . . , b

k=s0 +1

6

12

2

5

18

8

11

1

3

o d c−1 . s1 + 1

14

17

7

4

9

13

10

15

16

Figure 7: Test for the maximum degree G0 := {G : dmax (G) ≤ s0 } vs G1 = {G : dmax (G) ≥ s1 } with s0 = 3, s1 = 5 and d = 18. The solid edges represent G0 ∈ G0 with maximum degree s0 = 3. We construct the NAS C = {{(1, 5), (1, 6)}, {(7, 11), (7, 12)}, {(13, 17), (13, 18)}}. Since the predistance between any two different edge sets S, S 0 ∈ C, is dG0 (S, S 0 ) = ∞, the set C itself is a (log |C|)-packing set. The latter implies that M (C, dG0 , log |C|) = log |C| √ log(d/(s1 + 1)) log d. In addition, one can easily check that kA0 k2 = s0 and kA0 k1 = s0 . By Theoremp4.1 we have that under the required scaling liminf n→∞ γ(S0 (θ, s), S1 (θ, s)) = 1, when θ < κ log d/n for a sufficiently small κ.

4.2

Scaling Edge Sets

Theorem 4.1 requires the cardinality of edge sets in the NAS C to be bounded. In this section, we consider multi-edge NAS C such that S ∈ C is allowed to increase with n. For this case, the previous notion of packing entropy based on geodesic predistence is no longer effective to characterize the lower bound. Instead, we introduce a new mechanism called buffer entropy to quantify the lower bound under scaling multi-edge NAS. We first intuitively explain why the structural entropy in Theorem 4.1 may not be enough for handling NAS with scaling edge sets. Recall that Theorem 4.1 uses the structural entropy M (C, dG0 , log |C|) to characterize the lower bound and the packing entropy is calculated 25

(a)

(b)

(c)

(d)

Figure 8: In all plots S, S 0 and G0 are plotted with dashed, dotted and solid edges respectively (only (c) and (d) have solid edges). For both case (a) and case (b), we have dG0 (S, S 0 ) = 0. However, the sets S and S 0 in case (a) should be “closer” to each other compared to those on case (b) as they have more overlapping vertices, but dG0 fails to tell this difference. In case (c) and (d) for instance, we also have dG0 (S, S 0 ) = 1 in both cases (c) and (d) and hence dG0 cannot differentiate the two cases. However, in case (c), S has more vertices connected to S 0 than incase (d). Therefore, S and S 0 should also be “closer” in case (c) than case (d). based on the edge set geodesic predistance dG0 in Definition 4.2. Intuitively, for any S, S 0 ∈ C, the smaller dG0 (S, S 0 ) is, the easier it is to perform a test. However, under the setting of NAS with scaling edge sets, different edge sets S, S 0 ∈ C may have multiple overlapping vertices and the notion of geodesic predistance is no longer precise enough to reflect the closeness between S and S 0 . For example, in Figure 8, for both case (a) and case (b), we have dG0 (S, S 0 ) = 0. However, the sets S and S 0 in Figure 8 (a) should be “closer” to each other compared to those on Figure 8 (b) as they have more overlapping vertices, but dG0 fails to tell this difference. Similar example holds in the case of non-overlapping edge sets. In Figure 8 (c) and (d) for instance, we also have dG0 (S, S 0 ) = 1 in both cases (c) and (d) and hence dG0 cannot differentiate the two cases. However, in Figure 8 (c), S has more vertices connected to S 0 than in Figure 8 (d). Therefore, S and S 0 should also be “closer” in Figure 8 (c) than Figure 8 (d). From the discussion above, we see that the graph geodesic predistance fails to characterize the closeness between two multi-edge sets. Looking more closely in Fig 8, we can identify critical vertices measuring the “closeness” between the edge sets S and S 0 . Intuitively, the critical vertices should be the vertices which all paths connecting S and S 0 pass through. For instance in case (a) of Fig 8, we have 3 such critical vertices whereas in case (b) we have 2 such vertices. Similarly, in case (c) we have 6 critical vertices while in case (d) we have 4 critical vertices. Below we introduce a new concept called vertex buffer, which helps to measure the closeness between edge sets S and S 0 more precisely than the geodesic predistance. Definition 4.3 (Vertex Buffer). Let G = (V, E) be a graph and V1 , V2 ⊆ V be two vertex sets. A set V ⊆ V1 ∪ V2 is called a buffer between V1 and V2 if any path on G connecting some vertex in V1 to some vertex in V2 must contain at least one vertex v1 ∈ V ∩ V1 and one 26

Figure 9: Visualization of one of the buffers in V(V1 , V2 , G). Here V1 , V2 are plotted with red and blue respectively and G is in solid edges. The vertices in the buffer are marked in the dashed squares. We can see that all paths passing from V1 to V2 must contain at least one of these vertices in the buffer.

vertex v2 ∈ V ∩ V2 . We use V(V1 , V2 , G) to denote the collection of all buffers between V1 and V2 . We visualize an example of buffer in Figure 9. Using the notion of vertex buffer, we can define the new closeness measure between S and S 0 . Recall that V (S) denotes the vertex set of the edge set S. For a NAS C with null base G0 and any two edge sets S, S 0 ∈ C let VS,S 0 ∈ V(V (S), V (S 0 ), G(AS,S 0 )), where we denote AS,S 0 := A0 + AS + AS 0 .

(4.2)

The buffer size |VS,S 0 | is a measure of closeness of the edge sets S and S 0 . For instance, we have |VS,S 0 | = 3 in Fig 8 (a) while |VS,S 0 | = 2 in Fig 8 (b) and we have |VS,S 0 | = 6 in Fig 8 (c) while |VS,S 0 | = 4 in Fig 8 (d). This implies that the larger |VS,S 0 | is, the closer S and S 0 will be. The vertex buffer VS,S 0 is useful for the lower bound since it characterizes the critical vertices between S and S 0 . The paths passing through at least one edge in both S and S 0 must pass through at least one vertex in VS,S 0 . In contrast to the bounded edge sets case, when the edge sets in C are allowed to scale in size, it is not effective to build packing sets based on the predistance, since this strategy limits the number of edge sets we can build. One way to increase the cardinality of C is to consider a larger number of potentially overlapping structures, and use the buffer size as a more precise closeness measure between these structures. Below we introduce a new entropy concept which quantifies this intuition. Definition 4.4 (Buffer Entropy). Let C be a multi-edge NAS set with a base graph G0 and V := {VS,S 0 }S,S 0 ∈C be a collection of buffers. The buffer entropy is defined as: −1 MB (C, V ) := log max ES 0 |VS,S 0 | , (4.3) S∈C

27

where the expectation ES 0 is taken with respect to uniformly sampling S 0 from C. Under some cases, the buffer entropy MB (C, V ) is asymptotically equivalent to the packing entropy M (C, dG0 , log |C|). For instance, in Example 4.1, we can see from Figure 7 that VS,S 0 = ∅ if S 6= S 0 and VS,S 0 = V (S) if S = S 0 . Therefore, for all S ∈ C, ES 0 |VS,S 0 |] = (s1 − s0 )/|C| and the buffer entropy MB (C, V ) = log(|C|/(s1 − s0 )) M (C, dG0 , log |C|) since s1 − s0 ≤ U is a bounded constant. However, for edge sets of scaling cardinalities, the buffer entropy may have a better characterization of the complexity of NAS set than the packing entropy as we argued by the examples in Figure 8. We want the buffer entropy to be as large as possible to achieve sharp lower bounds. Here we provide some guidelines on how to find the buffer sets. In general, we aim to find the buffers VS,S 0 of size as small as possible for any pair S, S 0 ∈ C. It is easy to see that the set {V (E0 ∪ S) ∩ V (S 0 )} ∪ {V (E0 ∪ S 0 ) ∩ V (S)} is always a valid vertex buffer, and we believe it is a good choice for VS,S 0 in general (see Fig 9 and Examples 4.2, 4.3, e.g.) Since larger |VS,S 0 | implies that S and S 0 are closer, we aim to build NAS in which the cardinality |VS,S 0 | is not too large. We have the trivial bound X X |VS,S 0 | ≤ 1(v ∈ VS,S 0 ) + 1(v ∈ VS,S 0 ). v∈V (S 0 )

v∈V (S)

Intuitively, if the random variables {1(v ∈ VS,S 0 )}v∈V (S) are negatively correlated, |VS,S 0 | will not be too large. We formalize this idea in the following definition. Definition 4.5 (Incoherent NAS). The collection of edge sets C is called an incoherent NAS under a collection of buffers V := {VS,S 0 }S,S 0 ∈C if for any fixed S ∈ C, the random variables {1(v ∈ VS,S 0 )}v∈V (S) with respect to a uniformly sampled S 0 from C are negatively associated. In other words, for any pair of disjoint sets I, J ⊆ V (S) and any pair of coordinate-wise nondecreasing functions f, g we have: Cov f ({1(v ∈ VS,S 0 )}v∈I ), g({1(v ∈ VS,S 0 )}v∈J ) ≤ 0. Negative association is satisfied by a variety of classical discrete distributions such as the multinomial and hypergeometric, and even more generally by the class of permutation distributions (Joag-Dev and Proschan, 1983; Dubhashi and Ranjan, 1996, e.g.). It is a standard assumption that has been exploited in other works (Addario-Berry et al., 2010, e.g.) for obtaining lower bounds. We will also show concrete constructions of incoherent NAS in Examples 4.2 and 4.3. Besides packing entropy, the lower bound in Theorem 4.1 also involves the maximum degree kA0 k1 and the spectral norm kA0 k2 . We define similar quantities for the scaling edge sets case. As the sizes of S, S 0 ∈ C are no longer ignorable, we need to consider the matrix

28

AS,S 0 (4.2) instead. Denote the uniform maximum degree as Γ := maxS,S 0 ∈C kAS,S 0 k1 and uniform spectral norm as Λ := maxS,S 0 ∈C kAS,S 0 k2 . We define |S ∩ S 0 | R := max , S,S 0 ∈C |VS,S 0 |

B := max 0

S,S ∈C

2

4

(Γ |VS,S 0 |) ∧ Λ .

R is an edge-node ratio measuring how dense the edge set S ∩ S 0 is compared to the vertex buffers. For example, in Figure 10, the ratio R is 1 in case (a) compared to 1/2 in case (b). The quantity B is an auxiliary quantity assembling maximum degrees, spectral norms and buffer sizes and helps to obtain a compact lower bound formulation.

(b)

(a)

Figure 10: Illustration of an edge-node ratios. For case (a), we have R = 1 and R = 1/2 for case (b). This implies that edges in case (a) are denser than case (b). Below we connect the structural features we defined above to the lower bound. Recall definitions (2.2) and (2.3) on S0 (θ, s) and S1 (θ, s). We have the following theorem. Theorem 4.2. Let C be an incoherent NAS under the collection of buffers V := {VS,S 0 }S,S 0 ∈C . Then if MB (C, V ) → ∞ and r r MB (C, V ) R 1 − C −1 θ≤ ∧ ∧ √ , (4.4) 4nR B 2 2Γ the minimax risk satisfies liminf γ(S0 (θ, s), S1 (θ, s)) = 1. n→∞

When the sample size n is sufficiently large, the buffer entropy term on the right hand side of (4.4) is the smallest term and drives the bound which bares similarity to Theorem 4.1. As we previously argued, in Example 4.1, the buffer entropy and the structural entropy are of the same order, and moreover it is easy to check that the quantities R and B satisify |S| R = |S|+1 and B ≤ (U + 1)(kA0 k1 + 2U )2 which implies that when maxS∈C |S| ≤ U the terms in bound (4.4) match those from (4.1) up to constants. To better illustrate the usage of Theorem 4.2 we consider two examples. First we focus on the problem of testing whether the maximum degree in the graph is ≤ s0 vs ≥ s1 . In contrast to Example 4.1, here s0 is allowed to be much smaller than s1 and we cannot apply Theorem 4.1. When s0 = 0, this problem is related to the problem of detecting a set of s1 signals in 29

14

16

2

13

6

15

1

3

18

10

17

5

4

7

9

8

11

12

Figure 11: Test for maximum degree G0 = {G : dmax (G) ≤ s0 } and G1 = √ {G : dmax (G)√≥ s1 } with s0 = 3 and s1 = 6. We split the vertices into two parts {1, . . . , b dc} and {b dc + 1, . . . , d}. We use the first part of vertices to construct s0 -star graphs as G0 (visualized with solid edges). To construct the NAS√C, we select any s1 − s0 vertices (e.g., vertices 13, 14, 16) from the second vertices part {b dc + 1, . . . , d} and connect them to any center of the s0 -star graphs in G0 (e.g., vertex 1). This gives us one of the edge sets S ∈ C (e.g., S = {(1, 13), (1, 14), (1, 16)} depicted in dashed edges in the figure). We construct C containing all such edge sets. For any S, S 0 ∈ C, we choose the vertex buffer VS,S 0 = V (S) ∩ V (S 0 ) (e.g., for S = {(1, 13), (1, 14), (1, 16)} and S 0 = {(5, 14), (5, 16), (5, 18)} we have VS,S 0 = {14, 16} and |S ∩ S 0 | = ∅.) the normal means model (Ingster, 1982; Baraud, 2002; Donoho and Jin, 2004; Addario-Berry et al., 2010; Verzelen and Villers, 2010, e.g.). However the two problems are distinct, since we are studying structural testing in the graphical model setting. Given s0 < s1 ≤ s, we let the sub-decomposition be G0 = {G : dmax (G) ≤ s0 } and G1 = {G : dmax (G) ≥ s1 }. We summarize our results in the following Example 4.2 (Maximum Degree Test with Scaling Null-Alternative Gap). Let S0 (θ, s) and p γ S1 (θ, s) be defined in (2.2) and (2.3). Assume that s log d/n = O(1) and p s = O(d ) for some γ < 1/2. Then for a small enough absolute constant κ if θ < κ log d/n we have liminf n→∞ γ(S0 (θ, s), S1 (θ, s)) = 1. Before we show how this example follows from Theorem 4.2 we would like to highlight the difference between Examples 4.1 and 4.2. First, note that Example 4.2 is more flexible compared to Example 4.1 since the number of edges is allowed to scale with n. However Example 4.2 is also more restrictive in that we require s = O(dγ ) for some γ < 1/2, which is not required by Example 4.1. We now first explain the construction of (G then we formalize it. We start by √ 0 , C) and √ splitting the vertices into two parts {1, . . . , b dc} and {b dc + 1, . . . , d}. The graph G0 is constructed based only on the first part of vertices, and consists of non-intersecting s0 -star graphs. To build the NAS C, we select any s1 − s0 vertices from the second set vertices √ {b dc + 1, . . . , d} and connect them to any center of the s0 -star graphs in G0 (e.g., vertex 1). The set C contains all such edge sets (see Figure 11). Formally, to construct a graph G0 30

√ √ d we split the vertex set {1, . . . , b dc} into b s0 +1 c non-intersecting s0 -star graphs. We take √

d c − 1 and connect them to the remaining centers Cj = (s0 + 1)j + 1 for j = 0, . . . , b s0 +1 vertices as follows: d c−1 s +1 0 b s0 [ [ G0 := V , {(Cj , Cj + k)} . j=0

k=1

√ Since s0 < s1 ≤ s = O(dγ ), this creates at least b d/dγ c d1/2−γ graphs. Denote the set of centers of the s0 -star graphs with bs

J :=

√ d c−1 0 +1

[

{Cj }.

j=0

√ To construct the NAS set C from the remaining d − b dc isolated vertices we choose a set I of s1 − s0 vertices and connect them with any of the vertex C from the center set J . Formally we let C := {SC,I }C∈J ,I∈I , where n √ o SC,I = {(C, i)}i∈I and I = I ∈ {M ∈ 2[d] : |M | = s1 − s0 } min i > b dc . i∈I

Consider the collection of vertex buffers VS,S 0 = V (S) ∩ V (S 0 ). A visualization of the key quantities G0 , C, VS,S 0 for two specific S, S 0 ∈ C is provided on Figure 11. Now we will argue that the set {1(v ∈ VS,S 0 )}v∈V (S) is negatively associated, as required by Assumption 4.5. Note that uniformly selecting a set S 0 ∈ C is equivalent to uniformly selecting a center C ∈ J and a vertex set I ∈ I and construct S 0 by connecting C to every vertex in I. By construction the selection of a center C is independent of the selection of the vertex set I. By a result in Section 3.1 (c) of Joag-Dev and Proschan (1983) the random variables {1(v ∈ VS,S 0 )}v∈V (S) are negatively associated. Next we calculate the remaining quantities √ √ √ from Theorem 4.2. By the triangle inequality kAS,S 0 k2 ≤ s1 + s1 − s0 ≤ 2 s1 and √ kAS,S 0 k1 ≤ 2s1 and therefore Λ ≤ 2 s1 , Γ ≤ 2s1 , and B ≤ Λ4 ≤ 16s21 ≤ 16s2 . To calculate upper and lower bounds of R we only need to consider |S ∩ S 0 | = 6 0 and in this case S and 0 S must share the same center. In this situation by definition we have |S ∩ S 0 | = |VS,S 0 | − 1 and therefore 1/2 ≤ R ≤ 1. Finally we calculate maxS∈C ES 0 |VS,S 0 |. Recall that VS,S 0 = V (S) ∩ V (S 0 ) and that a set S 0 can be constructed by first selecting a center C uniformly from the set J and then selecting a set I uniformly from I. Hence for all S ∈ C ES 0 |VS,S 0 | = E[1(C ∈ V (S))|S] + E[|I ∩ V (S)| | S] =

1 (s1 − s0 )2 + . |J | d − (s0 + 1)|J |

The last equality implies that MB (C, V ) log d. An application of Theorem 4.2, and taking into account the required scaling completes the result. 31

1

10

2

9

8

3

4

7

5

6

Figure 12: Test for sparse clique with s = 5. We construct G0 = (V , ∅) and visualize two intersecting S, S 0 ∈ C, where S is the 5-clique K{1,3,5,7,9} , S 0 is the 5-clique K{1,2,3,9,10} and VS,S 0 = {1, 3, 9}. Our second example further illustrates the usage of Theorem 4.2 with a clique detection problem. Define the null and alternative parameter spaces: S0 := {Id },

S1 (θ, s) := {Id + θ(vvT − Id ) : θ ∈ (0, 1), ∀j ∈ [d] vj ∈ {±1, 0}, kvk22 = s}.

This setup is related to that in Berthet and Rigollet (2013); Johnstone and Lu (2009). Our case is different from previous works because we parametrize the precision matrix rather than the covariance matrix, and the parametrization is distinct. Under our parametrization, the graph in the alternative hypothesis consists of a single s-clique. q 2) Example 4.3 (Sparse Clique Detection). For values of θ < 4√12s ∧ log(d/s we have: 4ns liminf γ(S0 , S1 (θ, s)) = 1. n→∞

Consider G0 = (V , ∅), and let C = all s cliques with vertices in V . Take two sets S, S 0 ∈ C. Define the vertex buffer as VS,S 0 = V (S) ∩ V (S 0 ). Similarly to Example 4.2, conditioning on S the random variables {1(v ∈ VS,S 0 )}v∈V (S) are negatively associated (see Section 3.1 (c) of Joag-Dev and Proschan (1983)). By the definition of vertex buffer sets we 0 have that |VS,S 0 | ≤ s, |S ∩ S 0 | = VS,S < |VS,S 0 |2 ≤ s|VS,S 0 | and hence s − 1/2 ≤ R ≤ s. 2 Furthermore Γ ≤ 2s, and therefore Λ ≤ 2s. Therefore B ≤ 4s3 . For all v ∈ V (S) we have ES 0 [1(v ∈ VS,S 0 )] = s/d which implies ES 0 [|VS,S 0 |] = s2 /d for all S ∈ C, and therefore MB (C, V ) = log(d/s2 ). An application of Theorem 4.2 shows the statement.

4.3

Upper Bounds

In this section we develop algorithms matching the bounds in Sections 4.1 and 4.2. 32

4.3.1

Maximum Degree Test

We propose a test matching lower bound in Example 4.1. Recall that the sub-decomposition is G0 = {G : dmax (G) ≤ s0 } and G1 = {G : dmax (G) ≥ s1 }. The parameter sets (3.1) and (3.2) reduce to S0 (s) = {Θ ∈ M(s) : G(Θ) ∈ G0 } and S1 (θ, s) = {Θ ∈ M(s) : G(Θ) ∈ G1 , ∃ vertex k ∈ V , and set S ⊂ [d] such that k 6∈ S, |S| = s, minj∈S |Θjk | ≥ θ}. Recall the b (1) and Θ b (2) from Section 3.2. definitions of D1 , D2 , Θ Algorithm 7 Maximum Degree Test Input: D = {Xi }ni=1 , level 0 < α < 1. bn/2c Split the data into D1 = {Xi }i=1 , D2 = {Xi }ni=bn/2c+1 b (1) Add edges e from high to low |Θ e | values until a vertex with (s0 + 1) neighbors occurs. b be the set of s edges connecting the vertex to its (s0 + 1) neighbours. Let E B Output: ψα, b (D2 ). E

B Following the formulation of Algorithm 7 we characterize the performance of ψα, b (D2 ). E

p Corollary 4.3 (Honest Max Degree Test). Let θ = (2K + C 0 C 4 ) log d/bn/2c for an absolute constant C 0 > 0 and assume that (3.10) holds. Then for any fixed level α, the test B ψα, b (D2 ) from the output of Algorithm 7 satisfies: E limsup sup P(reject H0 ) ≤ α,

liminf

inf

n→∞ Θ∗ ∈S1 (θ,s)

n→∞ Θ∗ ∈S0 (s)

P(reject H0 ) = 1.

Remark 4.1. The test of Algorithm 7 can also be used to test the problem described in Example 4.2, since the lower bound on the signal strength is of the same magnitude. 4.3.2

Clique Detection Test

Here we devise a test to match the lower bound of Example 4.3. Unlike previous tests, the b = n−1 Pn X ⊗2 . Define clique detection test is not computationally feasible. Recall that Σ i i=1 bmin = min λd (Σ b CC ), λ |C|=s

b CC is a sub-matrix of Σ b with both column and row indices in C, and λd (A) is the where Σ bmin < ν), where ν is smallest eigenvalue of the matrix A. Consider the test ψ = 1(λ 2 p √ ν := 1 − ( 2 + 1) (s log(ed/s) + log(2α−1 ))/n , for some small constant α ≥ 0. We have 33

Proposition 4.4. Given any α ∈ (0, 1), suppose (s log(ed/s) + log(2α−1 ))/n = o(1). Then for values of θ ∈ (0, 1) satisfying r log(ed/s) θ>κ , sn for an absolute constant κ, we have limsup sup PΘ (reject H0 ) ≤ α n→∞ Θ∈S0

liminf

inf

n→∞ Θ∈S1 (θ,s)

PΘ (reject H0 ) = 1.

The proof of Proposition 4.4 is deferred to Appendix C.2. We remark that the bound of Proposition 4.4 matches the lower bound of Example 4.3 up to a scalar in the regime s = O(dγ ) for γ < 1/2, and s log d/n = o(1).

5

Numerical Studies and Real Data Analysis

In this section we present numerical analysis of Algorithms 2 and 4 for the hypothesis tests on the connectivity and cycles on synthetic datasets. In addition, we implement the graph connectivity test to study the brain networks for the ADHD-200 dataset.

5.1

Connectivity Testing

We present numerical simulations assessing the performance of Algorithm 2 for testing connectivity. Recall G0 = {G ∈ G : G disconnected} and G1 = {G ∈ G : G connected}, and consider testing H0 : G(Θ∗ ) ∈ G0 vs H1 : G(Θ∗ ) ∈ G1 . In order to construct the connected and disconnected graphs, we consider a chain graph of length k with adjacency matrix as follows   0 1 0 1 0 1      . . . . . .  ∈ Rk×k . AChain(k) =  (5.1) 1     . . . . . . 1  0 1 0 We construct the connected graph with adjacency matrix A = AChain(d) and the disconnected graphs with adjacency matrices Am = diag(AChain(m) , AChain(d−m) ), for m ∈ [d − 1]. The corresponding graphs of A and A1 , . . . , Ad−1 are visualized in Figure 13. 34

1

2

3

4

5

6

7

1

2

3

4

5

6

7

Figure 13: G(A4 ) and G(A) for d = 7. To show that our test is uniformly valid over various disconnected graphs under the null hypothesis, we use different precision matrices for different repetitions. For each repetition, we randomly select M ∼ Uniform([d−1]), choose the precision matrix Θ∗M (θ) = Id +θAM and generate i.i.d. samples Xi ∼ N (0, (Θ∗M (θ))−1 ) for i ∈ [n]. Under the alternative hypothesis, we generate n i.i.d. samples X1 , . . . , Xn from N (0, (Θ∗ (θ))−1 ), where Θ∗ (θ) = Id + θA. The sample size is chosen from n = 400 and 600 and the dimension varies in d = 100, 125 and 150. The significance level of the tests is set to α = 0.05. Within each test we generate 3, 000 bootstrap samples to estimate the quantiles (3.7). We estimate the precision matrix by the CLIME estimator (Cai et al., 2011) as follows b λ = argmin kΘk1 , s.t. kΣΘ b − Id kmax ≤ λ. Θ

(5.2)

The tuning parameter is chosen by minimizing a K-fold cross validation risk CV(λ) =

K X

(−k)

b (k) Θ b kΣ λ

− Id kmax ,

k=1

b (k) is the sample covariance matrix estimated on the k-th subset of data and Θ b (−k) where Σ λ is the CLIME estimator without using the k-th subset ofp data. We use the first scenario and implement a 5-fold cross validation. It slelects λ = 1.5 log d/n and we keep using this value of λ throughout all remaining settings. We vary the signal strength θ ∈ [0.25, 0.45] and estimate the risk of our test ψ given by Algorithm 2: PG(Θ∗ )∈G0 (ψ = 1) + PG(Θ∗ )∈G1 (ψ = 0), by averaging the type-I and type-II errors under null and alternative settings through 200 repetitions. The results are visualized in Figure 14, where the Y -axis represents the risk and the X-axis plots the signal strength θ. As predicted by our results, with the increase in θ the risk curves become well controlled about the target value 0.05. Moreover, there is little difference between the three risk curves within each plot. The curves for d = 125, 150 have a slightly worse performance than the curves for smaller dimension d = 100. This effect is expected, as the signal strength required to separate the null and alternative depends on d only logarithmically. Furthermore, smaller signal strength is required to distinguish the null from the alternative when we increase n from 400 to 600. Table 1 reports the type-I errors of ψ. The type-I errors remain within the desired range of 0.05. As we increase the signal 35

1.0 0.8 0.6 0.4

PΘ0(ψ = 1) + PΘ1(ψ = 0)

d = 100 d = 125 d = 150

0.0

0.2

0.4

0.6

0.8

d = 100 d = 125 d = 150

0.2

PΘ0(ψ = 1) + PΘ1(ψ = 0)

1.0

strength θ, the values of the errors increase from 0 to the target level. This phenomenon is expected and is explained below Proposition 3.1.

0.25

0.30

0.35 θ

0.40

0.45

0.25

(a) n = 400

0.30

0.35 θ

0.40

0.45

(b) n = 600

Figure 14: Type-I and type-II errors control for connectivity test with θ ∈ [0.25, 0.45]. The grey horizontal line is the significance level α = 0.05.

Table 1: Sizes for connectivity testing θ ∈ [0.25, 0.45] n

θ

d 100 400 125 150 100 600 125 150

5.2

0.25

0.28

0.32

0.35

0.38

0.42

0.45

0.000 0.000 0.000 0.000 0.000 0.000

0.000 0.000 0.000 0.005 0.000 0.000

0.000 0.000 0.000 0.020 0.025 0.015

0.020 0.010 0.010 0.045 0.050 0.055

0.045 0.045 0.040 0.030 0.040 0.040

0.045 0.050 0.050 0.025 0.045 0.045

0.060 0.060 0.055 0.030 0.040 0.040

Cycle Presence Testing

In this section we present numerical analysis of Algorithm 4 for cycle presence testing. Recall the sub-decomposition G0 = {G ∈ G : G contains no cycles} and G1 = {G ∈ G : G contains a cycle}. We aim to conduct the hypothesis H0 : G(Θ∗ ) ∈ G0 vs H1 : G(Θ∗ ) ∈ G1 . 36

Similarly to the previous section, we also generate data under both the null and alternative hypotheses. We use the chain graph with adjacency matrix AChain(d) as the acyclic graph and construct the loopy graphs by adding edges to the chain graph. More specifically, we construct the graphs with adjacency matrices Am = AChain(d) + Em , for m = 3, 4, . . . , 10, where Em is the adjacency matrix of the graph (V , {(1, m)}). The graphs we construct are illustrated in Figure 15. Under the null hypothesis, we generate n i.i.d. samples X1 , . . . , Xn ∼ N (0, (Θ∗ (θ))−1 ) where Θ∗ (θ) = Id + θAChain(d) and repeat the simulation for N = 200 times. Under the alternative, for each repetition, we randomly select M ∼ Unif({3, 4, . . . , 10}), set the precision matrix as Θ∗M (θ) = Id + θAM and generate i.i.d. samples X1 , . . . , Xn ∼ N (0, (Θ∗M (θ))−1 ). We also repeat this procedure for N = 200 times to calculate the type-II error. 1

2

3

4

5

6

7

1

2

3

4

5

6

7

1.0 0.8 0.6 0.4 0.2

PΘ0(ψ = 1) + PΘ1(ψ = 0)

d = 100 d = 125 d = 150

0.0

0.2

0.4

0.6

0.8

d = 100 d = 125 d = 150

0.0

PΘ0(ψ = 1) + PΘ1(ψ = 0)

1.0

Figure 15: G(A) and G(A4 ) for d = 7.

0.25

0.30

0.35 θ

0.40

0.45

0.25

(a) n = 400

0.30

0.35 θ

0.40

0.45

(b) n = 600

Figure 16: Type-I + type-II error control for cycles presence test with θ ∈ [0.25, 0.45]. The grey horizontal line is the significance level α = 0.05. 37

As before the significance level is set to α = 0.05, and the tuning parameter as λ = p 1.5 log d/n. The sample size varies in n = {400, 600}, the dimension d = 100, 125 and 150 and the number of bootstrap samples is 3, 000. In Figure 16, we plot the risk curves of the test ψ based on Algorithm 4 for θ ∈ [0.25, 0.45]. With the increase of signal strength or sample size, the risk diminishes to its target value which confirms the results of Section 3.2.3. Table 2: Sizes for cycle presence testing θ ∈ [0.25, 0.45] n

θ

d 100 400 125 150 100 600 125 150

0.25

0.28

0.32

0.35

0.38

0.42

0.45

0.010 0.025 0.005 0.045 0.040 0.030

0.045 0.005 0.005 0.040 0.040 0.030

0.015 0.045 0.020 0.060 0.060 0.050

0.045 0.030 0.020 0.065 0.055 0.015

0.040 0.070 0.015 0.045 0.065 0.060

0.040 0.070 0.060 0.060 0.050 0.050

0.055 0.060 0.050 0.040 0.035 0.040

In Table 2 we report the type-I errors of the tests for each of 6 values of the signal strength θ. The type-I error remains well controlled and is closer to its the target level of 0.05 when the signal strength becomes larger as Corollary 3.4 predicts.

5.3

ADHD-200 Data Analysis

In this section we analyze the difference in connectivity levels of the brain networks of subjects with and without attention deficit hyperactivity disorder (ADHD). We use the ADHD200 brain-imaging dataset, which is made publicly available by the ADHD-200 consortium (Milham et al., 2012). The dataset consists of 614 subjects with 491 controls and 195 cases. For each subject a number between 76 and 276 fMRI scans are available. Each of the scans provides a brain imaging vector of 264 voxels, which we use to obtain estimates of the brain network. To simplify the analysis we treat each of the repeated measurements as independent. In addition to the fMRI scans, limited clinical data such as age, gender, handedness etc is also available for each subject. The ages range from 7 to 21 years, with medians 11.75 for the controls and 10.58 for the cases. Using the ADHD-200 dataset, Cai et al. (2015) explored the connectivity of the brain network, and concluded that the brain networks of ADHD cases are in general less connected compared to the brain networks of healthy subjects. Additionally, Gelfand et al. (2003); Bartzokis et al. (2001) studied how brain connectivity scales with age, and demonstrated that brain connections typically increase from junior to adult age. 38

Algorithm 8 Multiple Edge Testing cn ← Tb; Initialize E repeat cn : √n|Θ e e | ≥ √nµ + c Reject R ← {e ∈ E cn } 1−α,E cn ← E cn \ R Update E cn = ∅ until R = ∅ or E nc ← T d b\E cn return E To explore potential changes in brain network connectivity in terms of age from junior to senior, we split the dataset into subjects whose age is less than 11 and greater than 11 years, where 11 is chosen since it is close to the median ages of the cases and controls. Moreover, to further study the brain networks at different levels, we define the graph at level µ ≥ 0 for the precision matrix Θ∗ as G(Θ∗ , µ) = (V , E(Θ∗ , µ)), where E(Θ∗ , µ) = e | |Θ∗e | ≥ µ . When µ = 0, G(Θ∗ , µ) reduces to the typical graph G(Θ∗ ) defined in Section 1.3. We explore the connectivity of G(Θ∗ , µ) for larger µ’s, because the edges in the brain networks with stronger signal strengths are more informative to reflect the essential topological structure of brains. We therefore consider the hypothesis test H0 : G(Θ∗ , µ) disconnected vs H1 : G(Θ∗ , µ) connected.

(5.3)

We can easily modify Algorithm 2 to conduct the test above. In fact, we only need to change B B (D2 , µ) obtained from Algorithm 8. (D2 ) in Algorithm 2 to ψα, the output ψα, Tb Tb Comparing Algorithm 8 with the original Algorithm 1, the only difference is that we cn : √n|Θ e e| ≥ c cn change the first step in while loop from R ← {e ∈ E cn } to R ← {e ∈ E : 1−α,E √ √ e n|Θe | ≥ nµ + c cn }. Following the same proof as Corollary 3.2 , we can show that 1−α,E

B (D2 , µ) is a valid test for testing (5.3). A detailed proof can be found in Corollary B.6 in ψα, Tb B the Appendix. We apply the test ψα, (D2 , µ) to all 4 groups (junior controls, senior controls, Tb nc d junior cases and senior cases) of subjects results for µ ∈ [0.16, 0.3]. We denote Tb(µ) = E nc is defined in Algorithm 8. If T d b(µ) is connected, since Tb(µ) is a tree, we can see where E

B that ψα, (D2 , µ) = 1 and H0 in (5.3) is rejected. For robustness the data is shuffled 100 Tb times before splitting it into two halves and the number of connected components reported is averaged over these 100 data shuffles. Figure 17 demonstrates the number of connected components for Tb(µ) when µ varies in the range [0.16, 0.3]. We can see that the number of connected components is always larger than 1 and thus we do not reject H0 in (5.3) for all µ ∈ [0.16, 0.3]. Moreover, we observe in Figure 17 that Tb(µ) has more connected components for the controls than the cases. These results confirm the findings of Cai et al. (2015) that the brain

39

Controls, Age > 11 Controls, Age 11 Cases, Age 0. By the triangle inequality for any e, e0 ∈ C we have: max(kA0 k2 , kA0 + Ae k2 , kAe,e0 k2 ) ≤ kA0 k2 + 2, max(kA0 k1 , kA0 + Ae k1 , kAe,e0 k1 ) ≤ kA0 k1 + 2. Next we make sure that the matrices Θ0 and Θe fall into the set M(s) and in addition the matrix Θe,e0 > 0. For the upper bounds, it suffices to choose η satisfying: max(kΘ0 k2 , kΘe k2 ) ≤ 1 + (kA0 k2 + 2)θ ≤ C,

max(kΘ0 k1 , kΘe k1 ) ≤ 1 + (kA0 k1 + 2)θ ≤ L.

Recall that kA0 k2 ≤ kA0 k1 , and C ≤ L hence both inequalities are implied if 1+(kA0 k1 + 2)θ ≤ C. Furthermore, by Weyl’s inequality: λd (Θ0 ), λd (Θe ), λd (Θe,e0 ) ≥ 1 − θ(kA0 k2 + 2) ≥ 1 − θ(kA0 k1 + 2), 43

(A.2)

where λd denotes the smallest eigenvalue of the corresponding matrix. We want to ensure −1 1−C −1 that the last term is ≥ C −1 . Since by assumption θ < √1−C = √2(kA the above 2(D+2) 0 k1 +2) inequalities are satisfied (note that trivially 1 − C −1 ≤ C − 1 since C ≥ 1). We also obtain as a by-product that Θe,e0 ≥ 0.

Step 2(Risk Lower Bound). In this step we obtain a lower bound on the minimax risk driven by Le Cam’s Lemma (Le Cam, 1973). Using a determinant identity we control the chi-divergence by the traces of adjacency matrices’s powers. By assumption G0 ∈ G0 , Ge ∈ G1 for all e ∈ C and hence the induced graphs G(Θ0 ) ∈ G0 and G(Θe ) ∈ G1 for all e ∈ C. Put the uniform prior on C and consider the models generated by N (0, (Θe )−1 ) where e ∈ C. Define: P=

1 X P Θe , |C| e∈C

where PΘe we define the probability measure when the data is i.i.d. Xi ∼ N (0, (Θe )−1 ), and let PΘ0 be the probability measure when the data is i.i.d. Xi ∼ N (0, (Θ0 )−1 ). Next, by Neyman-Pearson’s lemma we have: i h (A.3) γ(S0 , S1 ) ≥ inf PΘ0 (ψ = 1) + P(ψ = 0) = 1 − TV(P, PΘ0 ), ψ

where for two probability measures P, Q λ on a measurable space (Ω, A), TV stands for total variation distance, and is defined as Z dQ 1 dP (ω) dλ(ω). TV(P, Q) = sup |P (A) − Q(A)| = (ω) − 2 dλ dλ A∈A By Cauchy-Schwartz one has: 1 − TV(Pπ , PΘ0 ) ≥ 1 −

1 2

q Dχ2 (Pπ , PΘ0 ),

(A.4)

where Dχ2 (P, Q) is the chi-square divergence between the measures P, Q and is defined as: 2 2 Z Z dP dP (ω) − 1 dQ(ω) = (ω) dQ(ω) − 1, Dχ2 (P, Q) = dQ dQ assuming that P Q. Observe that Dχ2 (P, PΘ0 ) can be equivalently expressed as: Dχ2 (P, PΘ0 ) = EΘ0 L2Θ0 − 1,

(A.5)

P dPΘe 1 where LΘ0 = |C| e∈C dPΘ0 is the integrated likelihood ratio, and EΘ0 denotes the expectation under Xi ∼ N (0, (Θ0 )−1 ). Hence by (A.3) and (A.4), it suffices to obtain upper bounds on the integrated likelihood ratio in order to lower bound the minimax risk (1.3). Writing 44

out the likelihood ratio comparing the normal distribution with precision matrix Θ0 to the uniform mixture of normal distribution with precision matrix Θe for e ∈ C we get: LΘ0

1 X = |C| e∈C

det(Θe ) det(Θ0 )

n/2 Y n

exp(−XiT θAe Xi /2).

i=1

To calculate the chi-square distance in (A.5), next we square this expression and take its expectation under PΘ0 to obtain: # " n X det(Θe ) n/2 det(Θe0 ) n/2 Y 1 EΘ0 L2Θ0 = EΘ0 exp(−XiT θ(Ae + Ae0 )Xi /2) |C|2 e,e0 ∈C det(Θ0 ) det(Θ0 ) i=1 n/2 n/2 det(Θe0 ) 1 X det(Θe ) . (A.6) = |C|2 e,e0 ∈C det(Θ0 ) det(Θe,e0 ) Next, we will expand the determinants above. Recall that we have ensured that 1 − θ(kA0 k2 + 2) > 0 (see A.2). This implies θ max(kA0 k2 , kA0 + Ae k2 , kA0 + Ae0 k2 , kAe,e0 k2 ) ≤ 1. For what follows for a symmetric matrix Ad×d we denote its ordered eigenvalues with λ1 (A) ≥ λ2 (A) ≥ . . . ≥ λd (A). Let A ∈ Rd×d be a symmetric matrix such that kAk2 ≤ 1. Then we have: log det(I + A) =

d X

log λj (I + A) =

j=1

d X

log(1 + λj (A)) =

j=1

d X ∞ X

(−1)k+1

j=1 k=1

λkj (A) k

∞ X Tr(Ak ) (−1)k+1 . = k k=1

Using the form det(I + A) = exp(log det(I + A)) and plugging the above equation into (A.6), we conclude that: X ∞ 1 X n (−θ)k 2 k k EΘ0 LΘ0 = exp T1 + T2 , where |C|2 e,e0 ∈C 2 k=1 k T1k := Tr[Ak0 − (A0 + Ae )k ]

T2k := Tr[(Ae,e0 )k − (A0 + Ae0 )k ].

Step 3 (Trace Perturbation Inequalities). In this step, we control T1k + T2k in terms of k and link it with the geometric quantities of the graph. We view T1k as the perturbation difference between Tr[Ak0 ] and Tr[(A0 + Ae )k ] and 45

we treat T2k similarly. In the following step, we aim to develop the perturbation inequalities for the trace of matrix powers. First we will argue that T1k + T2k ≥ 0 for all even k ∈ N. To see this recall that the trace operator of an adjacency matrix M satisfies Tr(Mk ) = number of all closed walks of length k. First we consider case e 6= e0 . Notice that all closed walks in G(A0 + Ae ) that do not belong to G(A0 ) have to pass through the edge e at least once. Similarly all closed walks in G(A0 + Ae0 ) that do not belong to G(A0 ) have to pass through the edge e0 at least once. Furthermore, all closed walks of length k passing through either e or e0 belong to G(Ae,e0 ). In addition G(Ae,e0 ) might contain extra closed walks passing through both e and e0 . This shows: T1k + T2k ≥ 0, for all k. Since when k is odd we have (−θ)k (T1k + T2k ) ≤ 0, we only need to focus on k even. Next we prove that for k < 2dG0 (e, e0 )+2, we have T1k +T2k ≡ 0. To see this, first consider the case e 6= e0 . Notice that the graph G(Ae,e0 ) cannot contain paths passing through both e and e0 unless k ≥ 2dG0 (e, e0 ) + 2. To see this, notice that no even length closed walk between e and e0 can exist if the length of this walk is smaller than 2dG0 (e, e0 ) plus the two edges e and e0 . This proves our claim in the case e 6= e0 . In the special case e = e0 , the length of the path trivially needs to be at least of length 2 to pass through both e and e0 . We will now argue that for even k ∈ N we have T1k + T2k ≤ 2(kA0 k2 + 2)k . In fact we will prove that T1k ≤ 0 for all k and T2k ≤ 2(kA0 k2 + 2)k for all even k. To see that T1k ≤ 0, note that G(A0 ) contains less closed walks than G(A0 + Ae ). Recall that for a symmetric matrix Ad×d we denote its ordered eigenvalues with λ1 (A) ≥ λ2 (A) ≥ . . . ≥ λd (A). Using Corollary D.2 for the matrices Ae,e0 , A0 +Ae0 , Ae with constants cσ(j) = sign(λj (Ae,e0 ) − λj (Ae ))|λj (Ae,e0 ) − λj (Ae )|k−1 , we obtain: d X

k

|λj (Ae,e0 ) − λj (Ae )| ≤

j=1

d X

cj λj (A0 + Ae0 ) ≤

j=1

d hX j=1

|cj |

k k−1

d hX i k−1 k

|λj (A0 + Ae0 )|k

i k1

,

j=1

where the last inequality follows by H¨older’s inequality. We conclude that: d X

k

|λj (Ae,e0 ) − λj (Ae )| ≤

j=1

d X j=1

46

|λj (A0 + Ae0 )|k .

(A.7)

Next, observe that for a single edge matrices the matrix −Ae has very simple eigenvalue structure: 1, −1 and d − 2 zeros. Hence we conclude that for even k: T2k = Tr((Ae,e0 )k ) − Tr((A0 + Ae0 )k ) =

d X

|λj (Ae,e0 )|k −

d X

j=1 k

k

|λj (A0 + Ae0 )|k

j=1 k

≤ |λ1 (Ae,e0 )| − |(λ1 (Ae,e0 ) − 1)| + |λd (Ae,e0 )| − |λd (Ae,e0 ) + 1|k

(A.8)

≤ |λ1 (Ae,e0 )|k + |λd (Ae,e0 )|k ≤ 2kAe,e0 kk2 ≤ 2(kA0 k2 + 2)k .

(A.9)

The last shows that indeed T1k + T2k ≤ 2(kA0 k2 + 2)2 as claimed. Putting everything together we obtain ∞ X (−θ)k k=1

k ≤

[T1k

+

T2k ]

∞ X

≤

2|k, k≥2dG0 (e,e0 )+2 ∞ X

2|k, k≥2dG0 (e,e0 )+2

θk k [T + T2k ] k 1 0

2((kA0 k2 + 2)θ)2dG0 (e,e )+2 2((kA0 k2 + 2)θ)k ≤ k (2dG0 (e, e0 ) + 2)(1 − (θ(kA0 k2 + 2))2 ) 0

2((kA0 k2 + 2)θ)2dG0 (e,e )+2 ≤ , dG0 (e, e0 ) + 1 where in the last inequality we used the fact that θ ≤

√

1 2(kA0 k2√ +2)

requirements on θ. This completes the proof of (A.1) where R =

which follows by the

2(kA0 k2 + 2).

Step 4 (Rate Control). The goal in this final step is to show that if (2.4) holds the risk liminf γ(S0 (θ, s), S1 (θ, s)) = 1. The proof is technical, but the high-level idea is to clip the first log |C| degrees in (A.1) and deal with two separate summations. It turns out that the scaling assumed on θ in (2.4) is precisely enough to control both the summation of all degrees below log |C| and the summation of all degrees above log |C|. Define the following quantities: Kr := |{(e, e0 ) : e, e0 ∈ C, dG0 (e, e0 ) = r}|, P where (e, e0 ) are unordered edge pairs, and observe that r≥0 Kr = |C| + |C| by definition. 2 q We will in fact, first show that if θ ≤ κ logn|C| for some small κ, and blog |C|c

X

Kr = O(|C|2−γ ),

(A.10)

r=0

for some 0 < γ ≤ 1, then limsup γ(S0 (θ, s), S1 (θ, s)) = 1 provided that |C| → ∞. We will then derive the Theorem as a Corollary to this observation. On an important note, in Remark 47

A.1 we will argue that if condition (A.10), is satisfied then M (C, dG0 , 21 log |C|) log |C|, and hence we do not lose any generality in drawing the conclusion of the Theorem as a Corollary to this setup. It suffices to control: 2dG (e,e0 )+2 X 2 θ 0 2K0 − |C| 2 + (A.11) /2), exp n exp(nθ |C|2 0 dG0 (e, e0 ) + 1 |C|2 0 (e,e ):dG0 (e,e )≥1 {z } | {z } | I2 I1

where we will write θ for Rθ for brevity. We will show that when θ¯ is small, (A.11) is bounded by 1 asymptotically, which in turn suffices to show that limsup γ(S0 (θ, s), S1 (θ,q s)) = 1. Notice that θ and θ are the same quantity up to the constant R and hence θ < κ q equivalent to θ < κ ¯ logn|C| for some sufficiently small κ ¯ . We will require: p κ ¯ < 2γ ∧ (ec0 )−1/2 . q Observe that since K0 = O(|C|2−γ ), κ ¯ 2 /2 < γ, and θ < κ ¯ logn|C| we have:

log |C| n

is

2 /2

I2 ≤

(2K0 − |C|)|C|κ¯ |C|2

→ 0.

q Next we tackle the term I1 in (A.11). We will show that since θ = κ ¯ logn|C| = o(1) by assumption, this term goes to 1. d−1 2 X 2 X 2K∞ 2 2r 2r+2 Kr |C|κ¯ θ /(r+1) + I1 = /(r + 1)) = Kr exp(nθ 2 2 |C| r≥1 |C| r=1 |C|2 d−1 2r 2K∞ 2 X 2θ /(r+1) + < K |C| , r 2 |C| r=1 |C|2

where the last inequality follows by the fact that κ ¯ 2 /2 < γ < 1. Splitting out the first blog |C|c terms out of this summation yields: blog |C|c 2r 2 X 2 I1 < Kr |C|2θ /(r+1) + 2 2 |C| r=1 |C| | {z } |

d−1 X

Kr |C|2θ

r=blog |C|c+1

{z

I11

I12

The first term is bounded by:  I11

≤ 2

blog |C|c

X r=1



θ

2

|C| = o(1), Kr  |C|2 48

2r

/(r+1)

+

2K∞ . |C|2 }

where we used

P

blog |C|c r=2

2 Kr = O(|C|2−γ ) and the fact that θ < γ. Next, notice that |C|2θ

2r

/(r+1)

2r

≤ 1 + 6θ (|C| − 1)/(r + 1).

This follows by: 2r

2r

2r

exp(2 log(|C|)θ /(r + 1)) ≤ 1 + 6 log(|C|)θ /(r + 1) ≤ 1 + 6(|C| − 1)θ /(r + 1), 2r

with the first inequality holding when 2 log(|C|)θ /(r + 1) < 1, which is true since θ < √ κ ¯ c0 < exp(−1/2) < √12 , and r ≥ blog |C|c + 1. Hence we have: I12

2 ≤ |C|2

d−1 X

2r

Kr (1 + 6(|C| − 1)θ /(r + 1)) +

r=blog |C|c+1

12(|C| − 1) O(|C|2−γ ) + ≤ 1− |C|2 |C|2

d−1 X r=blog |C|c+1

2K∞ |C|2

Kr 2r θ r

2blog |C|c+2 O(|C|2−γ ) 12θ Kr . ≤ 1− + |C|−1 max 2 2 blog |C|c+1≤r r |C| 1−θ Paying closer attention to the second term we have: 12θ

2blog |C|c+2

1−θ

2

Kr 2blog |C|c+2 ≤ 24θ |C|−1 max blog |C|c+1≤r r ≤ 24θ

2blog |C|c+2

|C| 2

+ |C| |C|

|C| = o(1),

with the last equality holds since: √ θ ≤ κ c0 < exp(−1/2), as we required.qThis combined with (A.1) concludes the proof of limsup γ(S0 (θ, s), S1 (θ, s)) =

1 when θ ≤ κ logn|C| . Finally, notice that any subset of a NAS C is trivially a NAS. Hence we can apply what we just showed to the set Nlog |C| ⊂ C — the maximal log |C|-packing of C. Evaluating the constants Kr on Nlog |C| gives: K0 = |Nlog |C| |, Kr = 0 for all r ≤ blog |C|c, and since bNlog |C| c ≤ blog |C|c we conclude that any 0 < γ ≤ 1.

49

PbNlog |C| c r=0

Kr = |Nlog |C| | = O(|Nlog |C| |2−γ ) for

Remark A.1. In this remark we argue that if one can find a NAS C with null base G0 satisfying |C| → ∞ and (A.10), then M (C, dG0 , 12 log |C|) satisfies M (C, dG0 , 21 log |C|) log |C|, and hence there are no benefits on imposing conditions on Kr as in (A.10) over conditions on the entropy except in terms of constants. We build a set Nlog |C|/2 in a sequential manner, by transferring an edge e from C and removing all edges e0 ∈ C such that dG0 (e, e0 ) ≤ 21 log |C|. Suppose in total there were k such additions so that log k ≤ M (C, dG0 , 21 log |C|). If on the i-th step there were mi edges removed then by the triangle inequality and (A.10) we have: X mi + mi = O(|C|2−γ ), 2 i∈[k]

and thus

P

i∈[k]

m2i = O(|C|2−γ ). Hence since we have removed all edges from |C| we conclude: |C| ≤ k

P

i∈[k]

k

mi

sP ≤

i∈[k]

k

m2i

|C|1−γ/2 √ =O , k

which shows that k & |C|γ , and therefore indeed M (C, dG0 , 21 log |C|) log |C|. Finally we note that a constant in front of the the packing radius log |C| will not affect the rates. This completes the proof. Proof of Theorem 2.2. The proof of this result follows the same line of argument as the proof of Theorem 2.1. Here we just sketch the differences. The first important observation is that the risk γ(S0 , S1 ) is symmetric in the sense that γ(S0 , S1 ) = γ(S1 , S0 ). This implies that if controlling the eigenvalues of precision matrices made in the proof of Theorem 2.1 hold for edge deletion NAS, the proof will continue to hold. If we denote by Ae the adjacency matrix of (V , {e}) for any e ∈ C, then the adjacency matrices of the graphs under the null G\e are given by A1 − Ae . It is now a simple exercise to check that: Tr(Ak1 ) − Tr((A1 − Ae )k ) ≤ 2kA1 kk2 ,

Tr((A1 − Ae0 )k ) − Tr((A1 − Ae − Ae0 )k ) ≤ 0,

and the proof is completed as in Theorem 2.1.

B B.1

Structure Testing Algorithms Proofs Connectivity Testing Proofs

b (1) Lemma B.1. M(s) and p Assume the estimate Θ satisfies (3.3) on the parameter space r−2 ∗ b let θ = rK log d/bn/2c, where r > 2. Then for all e ∈ E(T ), we have |Θe | ≥ r θ when Θ∗ ∈ S1 (θ, s) (see definition in (3.12)) with probability at least 1 − 1/d. Moreover, when (1) ∗ b = argmin Θ∗ ∈ S0 (s) (see definition in (3.11)) for E b = 0 with e∈E(Tb) |Θe | we have ΘE probability at least 1 − 1/d. 50

Proof of Lemma B.1. By (3.3) we obtain the following: r−1 θ, when |Θ∗e | ≥ θ, r b (1) | ≤ θ , when |Θ∗ | = 0, |Θ e e r

b (1) | ≥ |Θ e

(B.1) (B.2)

with probability at least 1−1/d uniformly for all e. First consider the case when Θ∗ ∈ S0 (s). Since the graph is disconnected, we have that there exists at least one edge in e ∈ Tb, such b (1) b (1) that Θ∗e = 0 and hence by (B.2) we have |Θ e | ≤ θ/r. This implies that: |Θ b | ≤ θ/r. E b with Now suppose that Θ∗ 6= 0. Denote the connected component of G(Θ∗ ) containing E b E

C ∗ . Since Θ∗ ∈ S0 (s) there exists a spanning tree TC ∗ of C ∗ such that for all e ∈ E(TC ∗ ) we b (1) b (1) have |Θ∗e | ≥ θ. By (B.1) the latter implies that |Θ e | > |Θ b | for all e ∈ E(TC ∗ ). On the E (1) b other hand, note that the value |Θ | is the largest among the values of all edges connecting b E

b (1) ) which E b connects. However, as we argued there the two connected components of G(Θ b (1) | > |Θ b (1) |. This is a exist an edge ee ∈ TC ∗ connecting these two components satisfying |Θ ee b E ∗ b contradiction with the fact that T is a MST. Hence Θ = 0. b E

b e| ≥ Next let Θ∗ ∈ S1 (θ, s). In this case by (B.1) the tree will cover only edges with |Θ (r − 1)θ/r and hence by assumption (3.3) and the fact that Θ∗ ∈ S1 (θ, s) we have |Θ∗Eb | ≥ (r − 2)θ/r. This completes the proof. Proof of Corollary 3.2. We first verify the size of the test. By Lemma B.1 we conclude that with asymptotic probability 1 we have mine∈E(Tb) |Θ∗e | = 0 when Θ∗ ∈ S0 (s). Conditioning on this event by Proposition 3.1 we conclude that: limsup

sup

n→∞ Θ∗ ∈S0 (θ)

P(ψ = 1) ≤ α.

For the second part, with asymptotic probability 1 we have that |Θ∗Eb | ≥ θ 2+ ≥ Using the second part of Proposition 3.1 we can guarantee:

liminf

sup

n→∞ Θ∗ ∈S (θ,s) 1

√

2 K 2+

p log d/n.

P(ψ = 0) = 0.

This completes the proof. b (1) Lemma B.2. Assume (3.3) on the parameter space p the estimate Θ satisfies assumption ∗ M(s) and θ ≥ rK log d/bn/2c for some r > 2. If Θ ∈ S0 (s) then |Θ∗Eb | = 0 where −1 ∗ b = argmin b (1) E b |Θe | with probability at least 1 − d . Furthermore, when Θ ∈ S1 (θ, s), e∈E(F )

for all e ∈ E(Fb) we have |Θ∗e | ≥

r−2 θ r

with probability at least 1 − d−1 .

Proof of Lemma B.2. First observe that we have that Fb with I connected components has bi }i∈[d−I] , where {E bi }i∈[d−1] is an ordering of E(Tb) such that |Θ b (1) | ≥ an edge set E(Fb) = {E b E i

51

(1)

b |Θ bi+1 | for all i ∈ [d − 2]. The remainder of the proof is similar to that of Lemma B.1 so we E omit the details. Proof of Corollary 3.3. Setting r = 2 +

C0C4 , K

we get that p θ = max(rC 0 C 4 /(r − 2), rK) log d/bn/2c.

p Hence θ ≥ rK log d/bn/2c, and a direct application of Lemma B.2 ensures that with high probability the edge set E(Fb) satisfies mine∈E(Fb) |Θ∗e | ≥ (r − 2)θ/r when Θ∗ ∈ S1 (θ, s) and mine∈E(Fb) |Θ∗e | = 0 when Θ∗ ∈ S0 (s). The remaining part of the proposition follows directly by Proposition 3.1.

B.2

Cycle Detection Proofs

b (1) satisfies (3.3) on the parameter space M(s) and LemmapB.3. Assume the estimate Θ b be the cycle obtained by Algorithm 4. We have θ = rK log d/bn/2c for some r ≥ 2. Let E mine∈Eb |Θ∗e | ≥ (r − 2)θ/r when Θ∗ ∈ S1 (θ, s) and mine∈Eb |Θ∗e | = 0 when Θ∗ ∈ 1¸0 (s) with probability at least 1 − 1/d. Proof of Lemma B.3. When Θ∗ ∈ S11 (θ, s), there exists a cycle E ⊂ E(G(Θ∗ )). Hence b e| > by (B.1) when the algorithm terminates we will have covered only edges such that |Θ (r − 1)θ/r. The latter, combined with (B.1) implies that until a cycle is reached, Algorithm 4 will have included only edges of true magnitude |Θ∗e | ≥ (r − 2)θ/r, which completes the proof of the first claim. b such that On the other hand, when Θ∗ ∈ S0 (s), we have that there exists an edge e ∈ E Θ∗e = 0, which completes the proof. Proof of Corollary 3.4. To handle the null hypothesis observe that by Lemma B.3 we have mine∈Eb |Θ∗e | = 0. Invoking Proposition 3.1 gives the desired result. To show that the test is asymptotically powerful we use Proposition 3.1 in conjunction with the other result of Lemma B.3.

B.3

SAP Length Test Proofs

b (1) satisfies (3.3) on the parameter space M(s) and Lemma p B.4. Assume the estimate Θ θ = rK log d/bn/2c for some r ≥ 2. If Θ∗ ∈ S0 (s), we have that mine∈Eb |Θ∗e | = 0. Conversely when Θ∗ ∈ S1 (θ, s) we have mine∈Eb |Θ∗e | ≥ (r − 2)θ/r with probability at least 1 − 1/d. b is a SAP with Proof of Lemma B.4. First, we consider the case when Θ∗ ∈ S0 (s). Since E I + 1 edges and G(Θ∗ ) is free of such SAPs when Θ∗ ∈ S0 (s) we must have mine∈Eb |Θ∗e | = 0. b e | > (r − 1)θ/r. Thus by 3.3 |Θ∗e | ≥ (r − 2)θ/r Next, when Θ∗ ∈ S1 (θ, s), by (B.1) mine∈Eb |Θ with probability at least 1 − 1/d. 52

Proof of Corollary 3.6. To handle the null hypothesis observe that by Lemma B.4 we are guaranteed to have mine∈Eb Θ∗e = 0. Invoking Proposition 3.1 gives the desired result. To show that the test is asymptotically powerful we use Proposition 3.1 in conjunction with Lemma B.4.

B.4

Triangle-Free Tests Proofs

b (1) satisfies (3.3) on the parameter space M(s) and Lemma p B.5. Assume the estimate Θ θ = rK log d/bn/2c for some r ≥ 2. If Θ∗ ∈ S0 (s), we have that mine∈Eb |Θ∗e | = 0. Conversely when Θ∗ ∈ S1 (θ, s) we have mine∈Eb |Θ∗e | ≥ (r − 2)θ/r with probability at least 1 − 1/d. Proof of Lemma B.5. Proof is the same as Lemma B.4, we omit the details. Proof of Corollary 3.5. Proof is the same as Corollary 3.6, we omit the details.

B.5

Results for Thresholded Graphs

In this section, we discuss the hypothesis in (5.3) H0 : G(Θ∗ , µ) disconnected vs H1 : G(Θ∗ , µ) connected B and prove the result on the test ψα, (D2 ) proposed in Section 5.3. Denote Tb

n o S0 (s) := Θ ∈ M(s) : G(Θ∗ , µ) disconnected and, n o ∗ S1 (θ, s) := Θ ∈ M(s) : G(Θ , µ + θ) connected . We have the following result parallel to Corollary 3.2. Corollary B.6. Let θ = (2K + C 0 C 4 )log d/bn/2c for an absolute constant C 0 > 0. Assume B that µ ≥ θ and (3.10) holds.Then for any fixed level α, the test ψα, b (D2 ) from the output of E Algorithm 3 satisfies: limsup sup P(reject H0 ) ≤ α,

liminf

inf

n→∞ Θ∗ ∈S1 (θ,s)

n→∞ Θ∗ ∈S0 (s)

P(reject H0 ) = 1.

Proof. The proof is same as the one of Corollary 3.2. We only need to change the proof of Proposition 3.1 in Appendix E.1 to the thresholded version. In fact, we only need to change (E.10). When |Θ∗jk | < µ, we have √ √ √ p √ √ e jk | ≥ n|κ log d/n| − n max |Θ e jk − Θ∗jk | + nµ ≥ nµ + c1−α,E . n|Θ (j,k)∈E

All the remaining part of the argument is the same. 53

C

Multi-Edge NAS Proofs

Before proceeding with the proofs we remind the reader definition (4.2) of AS,S 0 = A0 + AS + A S 0 . Proof of Theorem 4.1. We denote by L := exp(M (C, dG0 , log |C|)) the cardinality of the (log |C|)-packing. Using Proposition C.1 with the constants from Setting 1, and the fact that we have a (log |C|)-packing it suffices to control: 0 X 2 n|VS 0 ,S ∩ V (S)|kAS k2 kAS 0 k2 (2kAS,S 0 k2 θ)2dG0 (S,S ) θ2 exp L2 2dG0 (S, S 0 ) + 2 0 0 (S,S ):dG0 (S,S )≥log |C|

+

2 L2

X (S,S 0 ):dG0 (S,S 0 )=0

n|VS 0 ,S ∩ V (S)|kAS k2 kAS 0 k2 (2kAS,S 0 k2 θ)2 θ2 exp n|S ∩ S 0 |θ2 + , 4

When the cardinality of each |S| ≤ U the above expression can further be controlled by: 0 X 2 2nU 3 (2Λ0 θ)2dG0 (S,S ) θ2 2 2 3 2 2 exp + exp nU θ + 2nU (Λ0 θ) θ , L2 2dG0 (S, S 0 ) + 2 {z } |L (S,S 0 ):dG0 (S,S 0 )≥log |C| I2 {z } | I1

where Λ0 := kA0 k2 + 2U , and we used the facts that |S ∩ S 0 |, kAS k2 , kAS 0 k2 ≤ U , |VS 0 ,S ∩ V (S)| ≤ |V (S)| ≤ 2U and that dG0 (S, S 0 ) = 0 is only possible when S ≡ S 0 . We first deal with the term I2 . For values of 1 θ< , 2U Λ0 q L 2 2 for some sufficiently small 0 < κ we we have I2 ≤ L exp(2nU θ ). Hence when θ ≤ κ log nU have I2 = o(1). Next, similarly to the proof of Step 4 in Theorem 2.1 (recall the definition of Kr ) we can show: 1 12(2Λ0 θ)2blog Lc+2 −1 Kr I1 ≤ 1 − + L max . blog Lc+1≤r r L 1 − (2Λ0 θ)2 Paying closer attention to the second term we have: 12(2Λ0 θ)2blog Lc+2 −1 Kr L max ≤ 24(2Λ0 θ)2blog Lc+2 2 blog Lc+1≤r r 1 − (2Λ0 θ)

L 2

+L L

≤ 24(2Λ0 θ)2blog Lc+2 L = o(1), with the last equalities hold for: 2Λ0 θ ≤

1 < exp(−1/2). 2

This completes the proof. 54

Proof of Theorem 4.2. Note that by the definition of B we have: 2 2 max (kAS k2 kAS 0 k2 kAS,S 0 k2 ) ∧ (|VS,S 0 |kAS,S 0 k1 ) ≤ B. 0 S,S ∈C

Therefore using Proposition C.1 it suffices to control: 1 X 2 2 exp |VS,S 0 ∩ V (S)|nθ R + Bθ . Dχ2 := |C|2 S,S 0 ∈C P First note that by |VS,S 0 ∩ V (S)| = v∈V (S) 1(v ∈ VS,S 0 ), we have: X 1 X 2 2 1(v ∈ VS,S 0 ) . D χ2 ≤ exp nθ R + Bθ |C|2 S,S 0 ∈C v∈V (S)

Denotep by PS 0 the measure induced by drawing S 0 uniformly from C. Under the assumption: θ < R/B, and using the fact that the random variables {1(v ∈ VS,S 0 ) | v ∈ V (S)} are negatively associated for every fixed S ∈ C, we obtain: h X h ii log Dχ2 ≤ max log exp(2Rnθ2 )PS 0 (v ∈ VS,S 0 ) + (1 − PS 0 (v ∈ VS,S 0 )) S∈C

v∈V (S)

h i X ≤ max (exp(2Rnθ2 ) − 1) PS 0 (v ∈ VS,S 0 ) S∈C

v∈V (S) 2

≤ exp(2Rnθ ) max ES 0 |VS,S 0 |, S∈C

where the expectation ES 0 is taken with respect to a uniform draw of S 0 ∈ C. The first inequality above is due to negative association, the second inequality is due to log(1+x) ≤ x. Hence for values of θ v v u u u log [max u log [max 1/2 −1 −1 0 0 0 0 0 0 E |V |] [max E |V |] E |V |] S∈C S S,S S∈C S S,S S∈C S S,S t t = , θ≤ 2nR 4nR we have Dχ2 ≤ exp([maxS∈C ES 0 |VS,S 0 |]1/2 ), and therefore: 1q exp([max ES 0 |VS,S 0 |]1/2 ) − 1 = 1, liminf γ(S0 (θ, s), S1 (θ, s)) ≥ 1 − n S∈C 2 where the last equality holds since MB (C, V ) → ∞ implies maxS∈C ES 0 |VS,S 0 | = o(1). Proposition C.1. Let G0 ∈ G0 and C be a NAS set with null base G0 . Then for any collection of vertex buffers V = {VS,S 0 }S,S 0 ∈C and any of the following two choices: (|VS,S 0 ∩ V (S)| ∧ |VS,S 0 ∩ V (S 0 )|)kAS k2 kAS 0 k2 and KS,S 0 := 2kAS,S 0 k2 ; kAS,S 0 k22 |VS,S 0 ∩ V (S)||VS,S 0 ∩ V (S 0 )| := and KS,S 0 := 2kAS,S 0 k1 . kAS,S 0 k21

Setting 1: HS,S 0 := Setting 2: HS,S 0

55

when the signal strength satisfies θ < min 0

S,S ∈C

1 − C −1 √ , 2 2kAS,S 0 k1

(C.1)

we have: v u 2(dG0 (S,S 0 )∨1+1) 0 (KS,S 0 θ) 1u 1 X H S,S γ(S0 (θ, s), S1 (θ, s)) ≥ 1 − t exp n |S ∩ S 0 |θ2 + − 1, 2 |C|2 S,S 0 ∈C 2(dG0 (S, S 0 ) ∨ 1 + 1) Remark C.1. Proposition C.1 continues to hold for parameter classes M(s) not imposing bounds on the `1 norm of the precision matrix Θ, if we substitute kAS,S 0 k1 with kAS,S 0 k2 in (C.1). In fact tracking the proof, it is easy to see that the theorem also remains valid for parameter spaces such that I + θA0 ∈ S0 (θ, s) and I + θ(A0 + AS ) ∈ S1 (θ, s) for all S ∈ C, −1 √ with (C.1) replaced by θ < minS,S 0 ∈C kA 1−C . 0 k ∨ 2K 0 S,S

2

S,S

Definition C.1. Let C be a NAS with null base G0 ∈ G0 whose adjacency matrix is A0 . We call a set of constants {HS,S 0 , KS,S 0 : S, S 0 ∈ C} admissible with respect to the pair (G0 , C) if for all even integers k ≥ 4 the following holds: k Tr(AkS,S 0 + Ak0 − (A0 + AS )k − (A0 + AS 0 )k ) ≤ HS,S 0 KS,S 0,

(C.2)

for all S, S 0 ∈ C. We will see that any admissible set of constants {HS,S 0 , KS,S 0 : S, S 0 ∈ C} yields the lower bound on the minimax risk claimed by Proposition C.1. Proof of Proposition C.1. We will in fact show a slightly stronger result than presented, involving the following additional combinatorial quantity: ∆S,S 0 := |{triangles in G(AS,S 0 ) with at least one edge in S and another edge in S 0 }|.

Step 1(Matrix Construction). In this step we construct a set of precision matrices and argue that they fall into the parameter set M(s). Take Θ0 = I + θA0 , ΘS = I + θ(A0 + AS ), ΘS,S 0 = I + θAS,S 0 , for S, S 0 ∈ C and some θ > 0. For any S, S 0 ∈ C we have: max(kA0 k2 , kA0 + AS k2 , kAS,S 0 k2 ) ≤ kAS,S 0 k2 ≤ kAS,S 0 k1 max(kA0 k1 , kA0 + AS k1 , kAS,S 0 k1 ) ≤ kAS,S 0 k1 , where these inequalities hold since all matrices A0 , A0 + AS and AS,S 0 consist only of nonnegative entries. 56

Similarly to the proof of Theorem 2.1 we can make sure that the matrices Θ0 and ΘS fall −1 into the set M(s) and in addition the matrix ΘS,S 0 is strictly positive definite if θ < 1−CΓ . Thus by assumption the graphs G(Θ0 ) ∈ G0 and G(ΘS ) ∈ G1 for all S ∈ C. In addition this implies that the matrices Θ0 , ΘS , ΘS,S 0 are strictly positive definite for any S, S 0 ∈ C.

Step 2(Risk and Trace Bounds). In this step we will lower bound the risk, and will further derive some combinatorial bounds on the traces of powers of adjacency matrices. These bounds are more detail tracking compared to bounds discussed in Theorem 2.1. Similarly to Step 2 of the proof of Theorem 2.1 it suffices to bound: n/2 n/2 det(I + θ(A0 + AS )) det(I + θ(A0 + AS 0 )) det(I + θA0 ) det(I + θ(AS,S 0 )) X ∞ n (−θ)k k k k k = exp Tr AS,S 0 + A0 − (A0 + AS ) − (A0 + AS 0 ) , 2 k=1 k Similarly to Step 3 of Theorem 2.1 it is easy to argue that for any k ∈ N we have: Tr(AkS,S 0 + Ak0 − (A0 + AS )k − (A0 + AS 0 )k ) ≥ 0. We will consider three cases: (1) k < 2(dG0 (S, S 0 ) + 1), (2) k < 4 and k ≥ 2(dG0 (S, S 0 ) + 1) and (3) k ≥ 4 and k ≥ 2(dG0 (S, S 0 ) + 1). For k < 2(dG0 (S, S 0 ) + 1), similarly to the argument in the Step 3 of the proof of Theorem 2.1, the above is in fact an equality. However, for the case k < 4 and k ≥ 2(dG0 (S, S 0 ) + 1), we will instead show the following two more precise bounds: Tr(A20 + A2S,S 0 − (A0 + AS )2 − (A0 + AS 0 )2 ) ≤ 4|S ∩ S 0 |,

(C.3)

Tr(A30

(C.4)

+

A3S,S 0

3

3

− (A0 + AS ) − (A0 + AS 0 ) ) ≥ 6∆S,S 0 .

The left hand side of (C.3) contains edges lying only in the intersection S ∩ S 0 since all closed walks containing at least one edge in G0 cancel out. By definition the number of such edges is |S ∩ S 0 |. In addition, each closed walk of length 2 in (C.3), has precisely one edge in S and one edge in S 0 . Fixing the first edge to be in S and the second edge to be in S 0 we notice that each edge appears twice in the total count — once for each of its two vertices. Further multiplying by 2 to adjust for the ordering of the edges we obtain 4|S ∩ S 0 |. Next, notice that expression (C.4) contains only closed walks which are triangles. In addition, similarly to the logic above, only walks containing one edge from S and S 0 survive in (C.4). Further, each triangle is contained at least 6 times — once per each of its vertices and once per its 2 orientations. This completes the proof of (C.4). We will check the third case in the next step by checking the admissibility in Definition C.1. 57

Step 3 (Verifying Admissibility). In this step we show that both settings below are admissible for any pair (G0 , C): (|VS,S 0 ∩ V (S)| ∧ |VS,S 0 ∩ V (S 0 )|)kAS k2 kAS 0 k2 and KS,S 0 := 2kAS,S 0 k2 ; kAS,S 0 k22 |VS,S 0 ∩ V (S)||VS,S 0 ∩ V (S 0 )| := and KS,S 0 := 2kAS,S 0 k1 . kAS,S 0 k21

S1: HS,S 0 := S2: HS,S 0

Before we prove that the constants in Settings 1 and 2 are admissible, we will show a simple and general bound on closed walks over a sequence of graphs. Let E1 , . . . , Ej ⊂ E be fixed edge sets and G1 = (V , E1 ), G2 = (V , E2 ), . . . Gj = (V , Ej ) be graphs with vertex set V , adjacency matrices A1 , . . . Aj . Denote wii as the number of closed walks starting and ending at vertex i such that its `th edge locates on G` for ` ∈ [j]. Note that wii , precisely equals to Q the (i, i)-th entry of the matrix A, where A = `∈[j] A` , i.e. hY i Y wii = Aii = kA` k2 . (C.5) Ak ≤ kAk2 ≤ `∈[j]

ii

`∈[j]

We conclude that for any fixed vertex i ∈ V , we have that the number of closed walks starting Q and ending at vertex i walking on the edges of G` for ` ∈ [j] is at most: `∈[j] kA` k2 . Following the above argument, we will prove below the admissibility of the constants in Setting 1. More precisely we will prove the following Tr(Ak0 + AkS,S 0 − (A0 + AS )k − (A0 + AS 0 )k ) h k3 (|V 0 ∩ V (S)| ∧ |V 0 ∩ V (S 0 )|)kA k kA 0 k i S,S S,S S 2 S 2 kAS,S 0 kk2 , 3 ≤ 4 2 kAS,S 0 k2

(C.6)

3

which implies the admissibility of Setting 1, by the trivial bound k4 ≤ 2k . Recall that k is even. To prove (C.6), we remind the reader that the trace of an adjacency matrix, counts the number of closed walks of length k in the graph. A walk will only be counted in the LHS of (C.6), if it contains an edge from the set S and another edge from the set S 0 . In the remainder of this step of the proof, we will bound the number of closed walks containing edges from both S and S 0 . (k) Denote CS,S 0 = {closed walks C of length k on G(AS,S 0 ), with edges e, e0 ∈ C, e 6= e0 , e ∈ C S, e0 ∈ S 0 }. We will denote closed walks of length k by C = v0C → v1C → . . . → vk−1 → v0C , C where vjC is the j th vertex of C and (vjC , vj+1 ) is its j th edge, and indexation is taken modulo k. All indexation below will also be taken modulo k where applicable. Of note, one can show that in the special case of buffers VS,S 0 = {V (E0 ∪S)∩V (S 0 )}∪{V (E0 ∪S 0 )∩V (S)} 3 a stronger bound holds with the constant 2 k2 in place of k4 . This is however irrelevant for our purposes, since we relax this constant. 3

58

C vo+i

voC

C vo+j

0

∈S

∈S

C vo−1

C vo+j+1

Figure 19: A closed walk C from the set C ∈ Ck (v, o, i, j). For any 0 ≤ j < k/2, 0 ≤ i ≤ j, and a vertex v ∈ V , we first count the number of closed (k) C C C C walks in the set Ck (v, o, i, j) = C ∈ CS,S 0 vi+o = v, (vo−1 , voC ) ∈ S and (vo+j , vo+j+1 ) ∈ S0 . Following (C.5), we have |Ck (v, o, i, j)| ≤ kAS k2 kAS 0 k2 kAS,S 0 k2k−2 .

(C.7) (k) Similarly, for 0 ≤ j < k/2, 0 ≤ i ≤ j, and a vertex v ∈ V we define Ck0 (v, i, j) = C ∈ CS,S 0 C C C viC = v, (vo−1 , voC ) ∈ S 0 and (vo+j , vo+j+1 ) ∈ S . We also have |Ck0 (v, o, i, j)| ≤ kAS k2 kAS 0 k2 kAS,S 0 k2k−2 . C (k) C C ) ∈ S 0 and Ck0 (j) = , vo+j+1 , voC ) ∈ S and (vo+j Define the sets Ck (o, j) = C ∈ CS,S 0 (vo−1 C (k) C C ) ∈ S , so that we clearly have Ck (v, o, i, j) ⊆ , vo+j+1 , voC ) ∈ S 0 and (vo+j C ∈ CS,S 0 (vo−1 Ck (o, j) and Ck0 (v, o, i, j) ⊆ Ck0 (o, j). By the definition of the vertex buffer set VS,S 0 , each closed walk C in Ck (v, o, i, j), has at least one vertex belonging to VS,S 0 |S (and one vertex in VS,S 0 |S 0 ). What is more, C has a vertex in VS,S 0 |S which belongs to the set {vo , vo+1 , . . . , vo+j }, since vo ∈ V (S) and vo+j ∈ V (S 0 ). This argument shows that: [ [ [ [ Ck (o, j) = Ck (v, o, i, j). Ck (v, o, i, j) ⊆ v∈V 0≤i≤j

v∈VS,S 0 |S 0≤i≤j

Hence: |Ck (o, j)| ≤ (j + 1)|VS,S 0 |S |kAS k2 kAS 0 k2 kAS,S 0 kk−2 2 , and therefore by symmetry: |Ck (o, j)| ≤ (j + 1)(|VS,S 0 |S | ∧ |VS,S 0 |S 0 |)kAS k2 kAS 0 k2 kAS,S 0 kk−2 2 ,

(C.8)

Similarly we have |Ck0 (o, j)| ≤ (j + 1)(|VS,S 0 |S | ∧ |VS,S 0 |S 0 |)kAS k2 kAS 0 k2 kAS,S 0 kk−2 2 . 59

(C.9)

We now observe that: k−1 [ k/2−1 [

(k)

CS,S 0 ⊆

Ck (o, j) ∪ Ck0 (o, j)

o=0 j=0 k−1 0 Notice that in the special case when j = k/2−1, we have ∪k−1 o=0 Ck (o, k/2−1) = ∪o=0 Ck (o, k/2− 1). Indeed, by the fact that indexation is taken modulo k we trivially have Ck (o, k/2 − 1) = Ck0 (k/2 − o, k/2 − 1). Hence:

(k) CS,S 0

⊆

k−1 [ h k/2−2 [ o=0

i Ck (o, j) ∪ Ck (o, j) ∪ Ck (o, k/2 − 1).

j=0

Applying (C.8) and (C.9), and summing over j we have: k

(k) |CS,S 0 |

−1 2 h X ki k−2 ≤ (|VS,S 0 |S | ∧ |VS,S 0 |S 0 |)kAS k2 kAS 0 k2 kAS,S 0 k2 k 2 j+ 2 j=1

=

k3 (|VS,S 0 |S | ∧ |VS,S 0 |S 0 |)kAS k2 kAS 0 k2 kAS,S 0 k2k−2 , 4

which completes the proof of (C.6). To check the admissibility for Setting 2, just as before we must have two edges in S and 0 S respectively within a closed walk of length k which is counted in the LHS of (C.2). Then, by the definition of vertex buffer set VS,S 0 , we certainly have two vertices in the set VS,S 0 (one in VS,S 0 ∩ V (S) and one in VS,S 0 ∩ V (S 0 )). Notice that each vertex on the path is of degree at most kAS,S 0 k1 , and hence can give rise to at most kAS,S 0 k1 continuations of the path. Therefore, having fixed two vertices from the sets VS,S 0 ∩ V (S) and VS,S 0 ∩ V (S 0 ), we are left with at most kAS,S 0 kk−2 paths. Taking into account that we can position the two 1 k vertices on at most k(k − 1) ≤ 2 spots completes the proof.

Step 4 (Proof Completion). In this step we complete the proof by arguing that if we are given any set of admissible constants {HS,S 0 , KS,S 0 }S,S 0 ∈C the bound on the minimax risk from the Theorem statement follows. ∞ X

θk Tr AkS,S 0 + Ak0 − (A0 + AS )k − (A0 + AS 0 )k /k

k=1

≤ 2|S ∩ S 0 |θ2 − 2∆S,S 0 θ3 + HS,S 0

X 2|k, k≥2(dG0 (S,S 0 )∨1+1) 0

0

2

≤ 2|S ∩ S |θ − 2∆

S,S 0

(KS,S 0 θ)k k

2HS,S 0 (KS,S 0 θ)2(dG0 (S,S )∨1+1) θ + , 2(dG0 (S, S 0 ) ∨ 1 + 1) 3

60

where in the last inequality we used that (KS,S 0 θ)2 ≤ 12 by (C.1). In step 3 we verified that the constants {HS,S 0 , KS,S 0 } given in the Theorem statement are admissible for (G0 , C) in the sense of Definition C.1. This completes the proof.

C.1

Star Graph/Maximum Degree Proofs

b (1) satisfies (3.3) on the parameter space M(s) and Lemma p C.2. Assume the estimate Θ θ = rK log d/bn/2c for some r ≥ 2. If Θ∗ ∈ S0 (s), we have that mine∈Eb |Θ∗e | = 0. Conversely when Θ∗ ∈ S1 (θ, s) we have mine∈Eb |Θ∗e | ≥ (r − 2)θ/r with probability at least 1 − 1/d. b is a set of s edges Proof of Lemma C.2. First, we consider the case when Θ∗ ∈ S0 (s). Since E b e | > (r − 1)θ/r. we must have mine∈Eb |Θ∗e | = 0. Next, when Θ∗ ∈ S1 (θ, s), by (B.1) mine∈Eb |Θ Thus by assumption (3.3) |Θ∗e | ≥ (r − 2)θ/r with probability at least 1 − 1/d. Proof of Corollary 4.3. To handle the null hypothesis observe that by Lemma C.2 we are guaranteed to have mine∈Eb |Θ∗e | = 0. Invoking Proposition 3.1 gives the desired result. To show that the test is asymptotically powerful we use Proposition 3.1 in conjunction with Lemma C.2.

C.2

Clique Detection Upper Bound Proof

Lemma C.3. Under H0 we have: r r r q −1 √ s s log(ed/s) + log α s log(ed/s) + log α−1 bmin ≥ 1 − λ − 2 ≥ 1 − ( 2 + 1) , n n n (C.10) with probability no less than 1 − 2α. Proof of Lemma C.3. The proof relies on an implication of Gordon’s comparison theorem for Gaussian processes, see (Vershynin, 2012, Corollary 5.35 e.g.). By this result we have that: r q b CC ) ≥ 1 − s − t, λmin (Σ n 2 with probability at least 1 − 2 exp(−nt /2) for any t ≥ 0. Using qthe union bound in con s α−1 , by setting t = 2 s log(ed/s)+log junction to the standard inequality ds ≤ ed , we can s n ensure that (C.10) holds. Lemma C.4. Under H1 we have: q √ p s q 2 log α−1 α−1 q 1+ n + 1 + ( 2 + 1) s log(ed/s)+log n n bmin ≤ p p ≤ , λ 1 + (s − 1)θ 1 + (s − 1)θ 61

(C.11)

with probability at least 1 − 2α. Proof of Lemma C.4. Taking in mind that Σ = Θ−1 by a simple calculation one can verify 1 that for the set C ∗ = supp(v), λmin (ΣC ∗ C ∗ ) = [λmax (ΘC ∗ C ∗ )]−1 = 1+(s−1)θ with a correspondv√ ∗ C ing eigenvector s . Again using Corollary 5.35 of Vershynin (2012), and the fact that for two symmetric psd matrices A, B we have λmin (ABA) ≤ λmin (A)2 λmax (B), we have: p s q 2 log α−1 q 1+ n + n b C∗C∗ ) ≤ p , λd (Σ 1 + (s − 1)θ with probability at least 1 − 2α. Proof of Proposition 4.4. Combining the results of Lemma C.3 and setting α = d−1 in Lemma C.4 it suffices to show there will be a gap between the bounds in (C.10) and (C.11). By simple algebra when r log(ed/s) θ>κ , sn for a sufficiently large κ the gap between (C.10) and (C.11) is implied, which completes the proof.

D

Auxiliary Results

Recall that for a symmetric matrix A ∈ Rd×d we denote its eigenvalues in decreasing with λ1 (A) ≥ λ2 (A) ≥ . . . ≥ λd (A). Theorem D.1 (Lidskii’s Inequality (Helmke and Rosenthal, 1995)). For two symmetric m × m matrices A and B we have: k X

λij (A + B) ≤

j=1

k X

λij (A) +

j=1

k X

λj (B),

j=1

for any 1 ≤ i1 < . . . < ik ≤ m. We now derive the following Corollary of Theorem D.1, which can also be viewed as a more general formulation of the latter result. Corollary D.2. Under the setting of Theorem D.1, for any constants c1 ≥ c2 ≥ . . . ≥ cm , and a permutation σ on {1, . . . , m} we have: m X j=1

cσ(j) λj (A + B) ≤

m X

cσ(j) λj (A) +

j=1

62

m X j=1

cj λj (B),

Proof of Corollary D.2. First notice that since m X

λj (A + B) =

j=1

m X

λj (A) +

j=1

m X

λj (B),

j=1

it suffices to show the bound when for all j we have cj ≥ 0. Let C = {j : cj 6= 0}. We will induct on the cardinality |C|, which we denote with k. When k = 1, the inequality immediately follows by Theorem D.1. Suppose the inequality hods for k = l. Notice that the inequality can equivalently be expressed as m X j=1

cj λσ−1 (j) (A + B) ≤

m X

cj λσ−1 (j) (A) +

j=1

m X

cj λj (B),

(D.1)

j=1

To see why it holds for k = l + 1 let c∗ = minj∈C cj , and notice that by Theorem D.1 we have: X X X c∗ λσ−1 (j) (A + B) ≤ c∗ λσ−1 (j) (A) + c∗ λj (B). j∈C

j∈C

j∈C

This implies that inequality (D.1) follows by an inequality in which we subtract c∗ from all cj , j ∈ C. This completes the proof by the induction hypothesis.

63

E

Supplementary Material

For a real valued random variable X and a random ` ≥ 1, we define the Orlicz-norm ψ` as: kXkψ` = sup p−1/` (E|X|p )1/p .

(E.1)

p≥1

We mainly use the ψ1 and ψ2 norms. Recall that random variables with bounded ψ1 and ψ2 norms are called sub-exponential and sub-Gaussian (Vershynin, 2012, e.g.).

E.1

Minimal Structure Certification

Lemma E.1. The random variable X ∼ N (0, (Θ∗ )−1 ) satisfies: sup kαT Xkψ2 ≤ kZkψ2 C < C,

(E.2)

kαk2 =1

where Z ∼ N (0, 1). Proof of Lemma E.1. The first part follows since we are assuming kΘ∗−1 k2 ≤ C, and the second inequality follows via a direct calculation. Lemma E.2. We have inf ∆jk ≥ C −2 .

j,k∈[d]

Proof of Lemma E.2. By Issrelis’ theorem we have ∆jk = Θ∗jj Θ∗kk + Θ∗jk Θ∗kj ≥ Θ∗jj Θ∗kk ≥ C −2 > 0. Lemma E.3 (Covariance Concentration). There exists a universal constant R > 0, such that: p b − Σ∗ kmax ≤ RC 2 log d/n, kΣ with probability at least 1 − 1/d uniformly over the parameter space M. Proof of Lemma E.3. The proof follows by applying the inequality kXi Xj kψ1 ≤ 2kXi kψ2 kXj kψ2 ≤ 2c−2 (see Lemma E.1) in combination with Proposition 5.16 (Vershynin, 2012) and the union bound. Lemma E.4. For large enough n and d we have: √ ∗ ∗ b e jk − Θ∗jk ) + Θ∗T sup P( max n|(Θ ∗j (Σ − Σ )Θ∗k | > ξ1 ) < ξ2 , Θ∗ ∈M(s)

j,k∈[d]

√ where ξ1 = Ξ1 s log d/ n, ξ2 = 2/d, where Ξ1 is an absolute constant depending solely on K, L and C. 64

Proof of Lemma E.4. By elementary algebra we obtain the following representation: b − Σ∗ )Θ∗ = − e jk − Θ∗ ) + Θ∗T (Σ (Θ jk ∗j ∗k

∗ ∗ T b ∗ ∗ bT Σ b b b Θ ∗j ∗\j (Θ\j,k − Θ\j,k ) + (Θ∗j − Θ∗j ) (Σ − Σ )Θ∗k bT Σ b Θ ∗j ∗j | {z } I1jk

−

b Θ∗T ∗j (Σ

−Σ

|

∗

1 −1 , bT Σ b ∗j Θ ∗j {z }

)Θ∗∗k

(E.3)

I2jk

where indexing with \j means dropping the corresponding column or element from the matrix or vector respectively. We deal with the terms I1jk first. By the triangle inequality followed by H¨older’s inequality we have: ∗ ∗ ∗ ∗ bT Σ b −1 b T b b b b max |I1jk | ≤ max {|Θ ∗j ∗j | (kΘ∗j Σ∗\j k∞ kΘ\j,k − Θ\j,k k1 + kΘ∗j − Θ∗j k1 kΣ − Σ kmax kΘ∗k k1 )}

j,k∈[d]

j,k∈[d]

Since it is assumed that log d/n = o(1), take d and n large enough so that log d/n ≤ 1/(4K 2 ). Let E be the event where (3.3) and (3.4) hold. Then on E we have: p bT Σ b ∗j − 1| ≤ K log d/n. max |Θ ∗j j∈[d]

b −1 ≤ 2. Continuing our bounds on the bT Σ This implies that on the same event supj∈[d] |Θ ∗j ∗j | event E we have: ∗ 2 −1/2 bT Σ b b kΘ ). ∗j ∗\j k∞ kΘ\j,k − Θ\j,k k1 ≤ K s log d/n = o(n

Furthermore on the intersection of E and the event from Lemma E.3, which holds with probability no less than 1 − 2/d, we have: b − Θ∗ k 1 k Σ b − Σ∗ kmax kΘ∗ k1 ≤ RC 2 KLs log d/n = o(n−1/2 ). kΘ ∗j ∗j ∗k Combining the last four bounds we conclude that maxj,k∈[d] |I1jk | = op (n−1/2 ). Next we handle the terms |I2jk |. We have: ( ) 1 jk ∗T b ∗ ∗ max |I2 | ≤ max |Θ∗j (Σ − Σ )Θ∗k | − 1 . bT Σ b j,k∈[d] j,k∈[d] Θ ∗j ∗j Clearly on the event E under the assumption log d/n ≤ 1/(4K 2 ), we have that: p 1 max − 1 ≤ 2K log d/n. bT Σ b j∈[d] Θ ∗j ∗j 65

(E.4)

⊗2 ∗ Next we consider the random variables Θ∗T Θ∗k for all j, k ∈ [d]. By Lemma E.1 we ∗j X have: ⊗2 ∗ T ∗ ∗ 2 2 4 kΘ∗T Θ∗k kψ1 ≤ 2kΘ∗T ∗j X ∗j Xkψ2 kX Θ∗k kψ2 ≤ 2 sup kΘ∗j k2 C ≤ 2C , j∈[d]

with the last inequality following by the fact that supj∈[d] kΘ∗∗j k22 ≤ kΘ∗ k22 ≤ C 2 since ⊗2 ∗ Θ∗ ∈ M(s). Clearly then kΘ∗T Θ∗k − Θ∗jk kψ1 ≤ 4C 4 . Finally by the union bound and ∗j X e Proposition 5.16p (Vershynin, 2012), one concludes that there exists an absolute constant C, 4 such that when log d/n ≤ 4C , we have: p ∗ ∗ b e 4 log d/n, max |Θ∗T (E.5) ∗j (Σ − Σ )Θ∗k | ≤ 4CC j,k∈[d]

with probability at least 1 − 2/d. Combining this bound with (E.4) and our conditions we conclude that maxj,k∈[d] |I2jk | = op (n−1/2 ). This completes the proof. Lemma E.5. Let {Xi }ni=1 be identical (not necessarily independent), d-dimensional subGaussian vectors with maxi∈[n],j∈[d] kXij kψ2 = C. Then there exists an absolute constant U > 0 depending solely on C such that : max kXi⊗2 kmax < U log(nd), i∈[n]

with probability at least 1 − 1/d. Proof of Lemma E.5. The proof is trivial upon noting that maxi∈[n] kXi⊗2 kmax = maxi∈[n] kXi k2∞ and combining (5.10) in Vershynin (2012) with the union bound. We omit the details. p p Lemma E.6. Assuming that s log d/n log(nd) = o(1), we have that for any fixed j, k ∈ [d]: b jk − ∆jk | ≤ τ (n)) = 1, lim ∗ inf P(|∆ n Θ ∈M(s)

where τ (n) = o(1). Proof of Lemma E.6. We start by showing that: I := max n

−1

n X

j,k∈[d]

⊗2 ∗ ∗ 2 b T X ⊗2 Θ b ∗k − Θ b jk − Θ∗T (Θ ∗j Xi Θ∗k + Θjk ) , ∗j i

i=1

is asymptotically negligible. Since (a − b)2 ≤ 2(a2 + b2 ), we have that: −1

|I| ≤ 2 max n j,k∈[d]

|

X n

b T X ⊗2 Θ b ∗k (Θ ∗j i

−

⊗2 ∗ 2 Θ∗T ∗j Xi Θ∗k )

i=1

{z

b jk − Θ∗ )2 . +2 max (Θ jk j,k∈[d] | {z } } I2

I1

66

Clearly by assumption (3.3) |I2 | ≤ K 2 log d/n, with probability at least 1 − 1/d. Next we handle I1 . By the triangle inequality the following holds: 1/2 1/2 n n X X 1/2 ⊗2 ⊗2 −1 −1 ∗ T ∗ 2 2 T ∗ b X (Θ b ∗j − Θ ) X Θ ) b ∗k − Θ )) I ≤ max n (Θ + max n . ((Θ 1

∗j

j,k∈[d]

∗k

i

j,k∈[d]

i=1

|

{z

}

I11

∗j

i

∗k

i=1

|

{z

}

I12

2 : We first handle I11 n

2 I11

X b ∗k − Θ∗ )T 1 b ⊗2 X ⊗2 (Θ b ∗k − Θ∗ ). = max (Θ X ⊗2 Θ ∗k ∗k ∗j i j,k∈[d] n i=1 i {z } | M

Using Lemma E.5, we can handle M in the following way: n

kMkmax ≤

max kXi⊗2 kmax max i∈[d] j∈[d]

1 X b T ⊗2 b b ∗j k1 kΣ bΘ b ∗j k∞ . Θ X Θ∗j ≤ Op (log(nd)) max kΘ j∈[d] n i=1 ∗j i (E.6)

p b ∗j we have: maxj∈[d] kΣ bΘ b ∗j k∞ ≤ (1 + K log d/n). Hence: By the definition of Θ p b ∗j − Θ∗ k1 )(1 + K log d/n) = Op (log(nd))L, kMkmax ≤ Op (log(nd)) max(kΘ∗∗j k1 + kΘ ∗j j∈[d]

where we used assumption (3.4). Thus: b ∗k − Θ∗ k2 LOp (log(nd)) = Op s2 log d/n log(nd) = op (1), |I11 | ≤ max kΘ ∗k 1 k∈[d]

with probability at least 1 − 2/d uniformly over M(s). By a similar argument we can show that I12 is of the same order. Putting everything together we conclude: I = Op s2 log d log(nd)/n = op (1).

(E.7)

P ⊗2 ∗ Next, we argue that for any fixed j, k we have that the difference n−1 ni=1 (Θ∗T ∗j Xi Θ∗j − ⊗2 ∗ Θ∗jk )2 − E(ΘT∗j X ⊗2 Θ∗k − Θ∗jk )2 is small. Since as in Lemma E.4 we have kΘ∗T Θ∗k kψ1 ≤ ∗j X 4 ∗T ⊗2 ∗ ∗ 2 2C , then certainly Var((Θ∗j X Θ∗k − Θjk ) ) is finite for any fixed j, k ∈ [d]. A usage of Chebyshev’s inequality gives us that: n

−1

n X

⊗2 ∗ ∗ 2 T ⊗2 (Θ∗T Θ∗k − Θ∗jk )2 = op (1). ∗j Xi Θ∗j − Θjk ) − E(Θ∗j X

i=1

67

Finally note that by the triangle inequality the following two inequalities hold for any fixed j, k ∈ [d]:

−1

n

1/2 1/2 n n X X ⊗2 ∗ ⊗2 b ∗T −1 ∗ 2 T 2 b b (Θ∗j Xi Θ∗k − Θjk ) + I 1/2 , ≤ n (Θ∗j Xi Θ∗k − Θjk ) i=1

i=1

1/2 1/2 n n X X ⊗2 b ⊗2 ∗ 2 T −1 ∗T ∗ 2 −1 b b + I 1/2 . (Θ∗j Xi Θ∗k − Θjk ) ≤ n (Θ∗j Xi Θ∗k − Θjk ) n i=1

i=1

P ⊗2 ∗ ⊗2 T ∗ 2 Θ∗k − Θ∗jk )2 + op (1) = Op (1). This Observe that n−1 ni=1 (Θ∗T ∗j Xi Θ∗k − Θjk ) = E(Θ∗j X completes the proof.

Lemma E.7. Let Xn (r) and ξn (r) be two sequences of random variables, depending on a parameter r ∈ R. Suppose that limn supr∈R supt |P(Xn (r) ≤ t) − F (t)| = 0 where F is a continuous cdf, and limn inf r∈R P(|1−ξn (r)| ≤ τ (n)) = 1 for τ (n) = o(1). Assume in addition that F is Lipschitz, i.e., there exist κ > 0, such that for any t, s ∈ R : |F (t)−F (s)| ≤ κ|t−s|. Then we have: Xn (r) ≤ t − F (t) = 0. lim sup sup P n r∈R t ξn (r) Proof. The proof goes through by a direct argument. We omit the details. Proof of Proposition 3.1. The proof of the strong error rate control is based upon verifying conditions from Theorem 5.1 of Chernozhukov et al. (2013). The rate (log(dn))6 /n = o(1) in (3.10) holds due to recent advances in a Berry-Esseen bound due to Chernozhukov et al. (2016). Define: √ ∗ ∗ b e jk − Θ∗jk ) + Θ∗T Υ1 := n max |(Θ ∗j (Σ − Σ )Θ∗k |, (j,k)∈E

Υ2 := max n (j,k)∈E

−1

n X

b T X ⊗2 Θ b −Θ b jk − Θ∗T X ⊗2 Θ∗ + Θ∗ )2 . (Θ ∗j ∗k ∗j ∗k jk i i

i=1

p p By Lemma E.4, we are guaranteed that supΘ∗ ∈MEn ,Enc P( log |E||Υ1 | ≥ log |E|ξ1 ) < ξ2 , p √ where ξ1 = Ξ1 s log d/ n, ξ2 = 2/d and since under our assumptions log |E|ξ1 = o(1) and ξ2 = o(1), this satisfies4 the first requirement in condition (M) (i) of Theorem 5.1. (Chernozhukov et al., 2013). Next we turn to bounding |Υ2 |. Using (E.7) from Lemma E.6 we conclude that there exists a constant C > 0 such that supΘ∗ ∈MEn ,Enc P((log(|E|n))2 Υ2 ≥ Cs2 (log(|E|n))2 log(nd) log d/n) ≤ 2/d. Clearly since |E| ≤ d2 < d2 by our assumption we have s2 (log(|E|n))2 log(nd) log d/n = ⊗2 ∗ ⊗2 ∗ ∗ 2 ∗T 4 o(1). Finally, recall that E(Θ∗T ∗j Xi Θ∗k − Θjk ) ≥ δ and kΘ∗j Xi Θ∗k kψ1 ≤ 2C as we saw 4

In fact assumption (M) requires a stronger control but it is not necessary for our purporse.

68

in the proof of Lemma E.4, and in addition by assumption (log(dn))6 /n = o(1). This completes the verification of conditions needed by Theorem 5.1 of Chernozhukov et al. (2013) and shows that (3.9) holds. To see the second part it is sufficient to show that on the first iteration of Algorithm 1, all edges will be rejected. We first obtain a high probability bound on the quantile of √ e jk − Θ∗ |. By Lemma E.4 we get that: n max(j,k)∈E |Θ jk √ √ √ b − Σ∗ )Θ∗ | + Ξ1 s log d/ n, e jk − Θ∗ | ≤ n max |Θ∗T (Σ n max |Θ jk ∗j ∗k (j,k)∈E

(j,k)∈E

with probability at least 1 − 2/d, where Ξ1 is an absolute constant. Next recall that by (E.5) p when log d/n ≤ 4C 4 on the same event as above we have: p ∗ ∗ b e 4 log d/n. max |Θ∗T ∗j (Σ − Σ )Θ∗k | ≤ 4CC j,k∈[d]

Putting last two inequalities together we conclude: p √ √ e jk − Θ∗ | ≤ 4CC e 4 log d + Ξ1 s log d/ n, n max |Θ jk

(E.8)

(j,k)∈E

with probability no less than 1 − 2/d. The last of course implies that any quantile of √ e jk − Θ∗ | will be smaller than the value on the RHS with high probability. n max(j,k)∈E |Θ jk Next, by Corollary 3.15 of Chernozhukov et al. (2013) we have: √ e jk − Θ∗jk | ≤ c1−α,E ) − (1 − α)| = o(1). sup |P( n max |Θ (E.9) (j,k)∈E

α∈(0,1)

Combining (E.8) and (E.9) gives us that for large enough d, n, for any fixed α > 0 we have: p √ e 4 log d + Ξ1 s log d/ n. c1−α,E ≤ 4CC Now, clearly √

e jk | ≥ n|Θ

√

p √ e jk − Θ∗jk | ≥ c1−α,E , n|κ log d/n| − n max |Θ

(E.10)

(j,k)∈E

e 4 , and hence the power goes to 1 uniformly over the parameter provided that |κ| ≥ 9CC space ME n ,E nc .

References Addario-Berry, L., Broutin, N., Devroye, L. and Lugosi, G. (2010). On combinatorial testing problems. The Annals of Statistics 38 pp. 3063–3092. 5

Formally we verified conditions of Theorem 5.1, but in fact they imply Corollary 3.1 holds.

69

Arias-Castro, E., Bubeck, S. and Lugosi, G. (2012). Detection of correlations. The Annals of Statistics 40 pp. 412–435. Arias-Castro, E., Bubeck, S. and Lugosi, G. (2015a). Detecting positive correlations in a multivariate sample. Bernoulli 21 209–241. Arias-Castro, E., Bubeck, S., Lugosi, G. and Verzelen, N. (2015b). Detecting markov random fields hidden in white noise. arXiv preprint arXiv:1504.06984 . `s, E. J. and Durand, A. (2011a). Detection of an anomalous Arias-Castro, E., Cande cluster in a network. The Annals of Statistics 39 pp. 278–304. ´s, E. J. and Plan, Y. (2011b). Global testing under sparse alArias-Castro, E., Cande ternatives: Anova, multiple comparisons and the higher criticism. The Annals of Statistics 39 pp. 2533–2556. Baraud, Y. (2002). Non-asymptotic minimax rates of testing in signal detection. Bernoulli 8 pp. 577–606. Bartzokis, G., Beckson, M., Lu, P. H., Nuechterlein, K. H., Edwards, N. and Mintz, J. (2001). Age-related changes in frontal and temporal lobe volumes in men: a magnetic resonance imaging study. Archives of General Psychiatry 58 pp. 461–465. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57 pp. 289–300. Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension. The Annals of Statistics 41 pp. 1780–1815. Cai, T. T., Liu, W. and Luo, X. (2011). A constrained `1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106 pp. 594–607. Cai, W., Chen, T., Szegletes, L., Supekar, K. and Menon, V. (2015). Aberrant cross-brain network interaction in children with attention-deficit/hyperactivity disorder and its relation to attention deficits: a multi-and cross-site replication study. Biological Psychiatry . Chen, M., Ren, Z., Zhao, H. and Zhou, H. (2015). Asymptotically normal and efficient estimation of covariate-adjusted gaussian graphical model. Journal of the American Statistical Association 111 pp. 394–406. Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics 41 pp. 2786–2819. Chernozhukov, V., Chetverikov, D. and Kato, K. (2016). Central limit theorems and bootstrap in high dimensions. Annals of Probability , to appear. Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics 32 pp. 962–994. 70

Dubhashi, D. and Ranjan, D. (1996). Balls and bins: A study in negative dependence. BRICS Report Series 3. Fang, K. T., Kotz, S. and Ng, K. W. (1990). Symmetric multivariate and related distributions, vol. 36 of Monographs on Statistics and Applied Probability. Chapman and Hall, Ltd., London. Friedman, J. H., Hastie, T. J. and Tibshirani, R. J. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441. Gelfand, A. E., Kim, H.-J., Sirmans, C. and Banerjee, S. (2003). Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association 98 387–396. Gu, Q., Cao, Y., Ning, Y. and Liu, H. (2015). Local and global inference for high dimensional gaussian copula graphical models. arXiv preprint arXiv:1502.02347 . Helmke, U. and Rosenthal, J. (1995). Eigenvalue inequalities and schubert calculus. Mathematische Nachrichten 171 pp. 207–226. Ingster, Y. I. (1982). On the minimax nonparametric detection of signals in white gaussian noise. Problemy Peredachi Informatsii 18 pp. 61–73. Ingster, Y. I., Tsybakov, A. B. and Verzelen, N. (2010). Detection boundary in sparse regression. Electronic Journal of Statistics 4 pp. 1476–1526. Jankova, J. and van de Geer, S. A. (2015). Confidence intervals for high-dimensional inverse covariance estimation. Electronic Journal of Statistics 9 pp. 1205–1229. Joag-Dev, K. and Proschan, F. (1983). Negative association of random variables with applications. The Annals of Statistics 11 pp. 286–295. Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association 104 pp. 682–693. Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics 37 pp. 4254–4278. Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions. The Annals of Statistics 1 pp. 38–53. Liu, H., Han, F. and Zhang, C.-H. (2012). Transelliptical graphical models. Proc. of NIPS 809–817. Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10 2295–2328. Liu, W. (2013). Gaussian graphical model estimation with false discovery rate control. The Annals of Statistics 41 2948–2978. ¨ hlmann, P. (2006). High dimensional graphs and variable Meinshausen, N. and Bu selection with the lasso. The Annals of Statistics 34 pp. 1436–1462. 71

Milham, M. P., Fair, D., Mennes, M., Mostofsky, S. H. et al. (2012). The ADHD200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in systems neuroscience 6 62. Neykov, M., Ning, Y., Liu, J. S. and Liu, H. (2015). A unified theory of confidence regions and testing for high dimensional estimating equations. arXiv preprint arXiv:1510.08986 . Raskutti, G., Yu, B., Wainwright, M. J. and Ravikumar, P. K. (2008). Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of `1 -regularized MLE. In Advances in Neural Information Processing Systems. Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). Highdimensional covariance estimation by minimizing `1 -penalized log-determinant divergence. Electronic Journal of Statistics 5 pp. 935–980. Ren, Z., Sun, T., Zhang, C.-H. and Zhou, H. H. (2015). Asymptotic normality and optimalities in estimation of large gaussian graphical models. The Annals of Statistics 43 pp. 991–1026. Romano, J. P. and Wolf, M. (2005). Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association 100 pp. 94– 108. Santhanam, N. P. and Wainwright, M. J. (2012). Information-Theoretic limits of selecting binary graphical models in high dimensions. Information Theory, IEEE Transactions on 58 pp. 4117–4134. Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Applications (Y. C. Eldar and G. Kutyniok, eds.). Cambridge University Press. Verzelen, N. and Villers, F. (2010). Goodness-of-fit tests for high-dimensional gaussian linear models. The Annals of Statistics 38 pp. 704–752. Wang, W., Wainwright, M. J. and Ramchandran, K. (2010). Information-theoretic bounds on model selection for gaussian markov random fields. In ISIT. Citeseer. Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. The Annals of Statistics 27 pp. 1564–1599. Yang, Z., Ning, Y. and Liu, H. (2014). On semiparametric exponential family graphical models. arXiv preprint arXiv:1412.8697 . Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B (Methodological) 68 pp. 49–67. Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society. Series B (Methodological) 76 pp. 217–242.

72

Combinatorial Inference for Graphical Models - arXiv

Combinatorial Inference for Graphical Models - arXiv

Suggest Documents

Two Sample Inference for Populations of Graphical Models with ... - arXiv

Bayesian inference for general Gaussian graphical models with ... - arXiv

Distributed MAP Inference for Undirected Graphical Models

MapReduce Guided Approximate Inference Over Graphical Models

Graphical Models, Exponential Families, and Variational Inference

Algebra of inference in graphical models revisited

Statistical Inference in Graphical Models - Semantic Scholar

Abductive Inference with Probabilistic Graphical Models

Continuous Inference in Graphical Models with ... - CiteSeerX

RevBayes: Bayesian Phylogenetic Inference Using Graphical Models

Exact Inference on Gaussian Graphical Models of

Methods for advancing combinatorial optimization over graphical models

Methods for advancing combinatorial optimization over graphical models

Learning of Graphical Models and Efficient Inference for Object Class ...

Global Sensitivity Analysis for MAP Inference in Graphical Models - Idsia

Coupled Graphical Models and Their Thresholds - arXiv

Pairwise decomposition for combinatorial optimization in graphical ...

Graphical Models

Embedding Population Dynamics Models in Inference - arXiv

Exact MAP inference in general higher-order graphical models using

Causal Inference using Graphical Models: The R package pcalg

Adaptive Inference on General Graphical Models - Google Sites

Goal-Based Imitation as Probabilistic Inference over Graphical Models

Variational inference in graphical models: The view from the marginal ...