Bayesian structural learning and estimation in Gaussian graphical models Technical Report no. 545 Department of Statistics, University of Washington By Alex Lenkoski Department of Statistics, University of Washington, Seattle, WA 98195, U.S.A.
[email protected] and Adrian Dobra Department of Statistics, University of Washington, Seattle, WA 98195, U.S.A.
[email protected]
Abstract We propose a new stochastic search algorithm for Gaussian graphical models called the mode oriented stochastic search. Our algorithm relies on the existence of a method to accurately and efficiently approximate the marginal likelihood associated with a graphical model when it cannot be computed in closed form. To this end, we develop a new Laplace approximation method to the normalizing constant of a G-Wishart distribution. We show that combining the mode oriented stochastic search with our marginal likelihood estimation method leads to excellent results with respect to other techniques discussed in the literature. We also describe how to perform inference through Bayesian model averaging based on the reduced set of graphical models identified. Finally, we give a novel stochastic search technique for multivariate regression models. Some key words: Bayesian model averaging; Covariance estimation; Covariance selection; Gaussian graphical models; Multivariate regression; Stochastic search.
1
1
Introduction
We are concerned with a p-dimensional multivariate normal distribution Np (0, Σ). If the ratio between p and n – the available sample size – becomes large, the sample covariance matrix might not be positive definite, while its eigenvalues might not reflect the eigenstructure of the actual covariance matrix (Yang and Berger, 1994). Dempster (1972) proposed reducing the number of parameters that need to be estimated by setting off-diagonal elements of the precision matrix K = Σ−1 to zero. A pattern of zero constraints in K is called a covariance selection model or a Gaussian graphical model as it represents a pairwise conditional independence structure (Wermuth, 1976). Among many notable contributions focused on structural learning in Gaussian graphical models we mention the regularization methods of Yuan and Lin (2007), Bickel and Levina (2008), Meinshausen and B¨ uhlman (2006). The simultaneous confidence intervals of Drton and Perlman (2004) solve the underlying multiple comparisons problem (Edwards, 2000). Each of these methods produce a single sparse precision matrix whose structure is further used to estimate Σ through the corresponding maximum likelihood estimates (Dempster, 1972). By imposing suitable prior distributions for K or Σ, various Bayesian approaches have also been proposed (Leonard and Hsu, 1992; Yang and Berger, 1994; Daniels and Kass, 1999; Barnard et al., 2000; Smith and Kohn, 2002; Liechty et al., 2004; Rajaratnam et al., 2008). Inference can be done based on the best model, i.e. the model having the highest posterior probability. Alternatively, the uncertainty related to a particular choice of zero constraints in K can be taken into account by Bayesian model averaging (Kass and Raftery, 1995). Estimation of quantities of interest is consequently performed by averaging over all possible models weighted by their posterior probabilities. Markov chain Monte Carlo techniques are key in this context since they are used to visit the space of 2p(p−1)/2 possible models (Giudici and Green, 1999; Dellaportas et al., 2003; Wong et al., 2003). As p increases, Markov chain Monte Carlo methods are likely to be slow to converge due to the sheer size of the search space and might not discover the models with the highest posterior probability. Jones et al. (2005) addressed this issue by proposing the shotgun stochastic search algorithm that is designed to efficiently move towards regions of high posterior probability in the models space, while Scott and Carvalho (2008) developed the feature inclusion search algorithm. Hans et al. (2007) give a version of the shotgun stochastic search algorithm for performing variable selection in linear regression models. Madigan and Raftery (1994) argue that averaging across all models is not practical for higher-dimensional datasets in which the ratio p/n is large. They introduce the Occam’s window model selection strategy in which the models having a small posterior probability with respect to the best model are discarded. Averaging is thereby performed over the smaller set of remaining models. In this paper we develop the mode oriented stochastic search for Gaussian graphical models. Our approach builds on the ideas behind the shotgun stochastic search while simultaneously employing the Occam’s window principle. This combination quickly identifies those models whose ratio between their posterior probability and the posterior probability of the best model is above a certain threshold. We also give the Laplace approximation to the normalizing constant of a G-Wishart dis2
tribution (Diaconnis and Ylvisaker, 1979; Roverato, 2002; Atay-Kayis and Massam, 2005) associated with nondecomposable prime graphs. The Laplace approximation works well because the mode of a G-Wishart distribution can be efficiently and accurately determined using the iterative proportional scaling algorithm (Speed and Kiiveri, 1986). In our Bayesian framework estimation of K and Σ is performed through sampling from a G-Wishart distribution, which is made possible by the block Gibbs sampler algorithm (Asci and Piccioni, 2007). We show both theoretically and empirically that the combination of our new stochastic search algorithm and our method for estimating the marginal likelihood associated with a G-Wishart conjugate prior represents a computationally efficient method for rapidly exploring regions of high posterior probability of general Gaussian graphical models. The structure of the paper is as follows. In Section 2 we introduce the G-Wishart distribution associated with a Gaussian graphical model. In Section 3 we describe the iterative proportional scaling algorithm and the block Gibbs sampler algorithm. We develop and study the applicability of the Laplace approximation for the normalizing constant of a G-Wishart distribution in Section 4. In Section 6 we describe stochastic search methods for Gaussian graphical models including our new algorithm – the mode oriented stochastic search. In Section 7 we show how to perform inference through Bayesian model averaging based on the highest posterior probability models identified. We illustrate the performance of our proposed methodology in Section 8 in two simulated examples. In Section 9 we discuss structural learning in multivariate regressions through Gaussian graphical models with partitioned covariates and exemplify it with two real world datasets. We make concluding remarks in Section 10.
2
Graphical models and the G-Wishart distribution
We use the notation and definitions for Gaussian graphical models presented in Ch.5 of Lauritzen (1996). Let G = (V, E) with E ⊂ {(i, j) ∈ V × V : i < j} be an undirected graph whose vertices are associated with a p-dimensional vector X that follows a Np (0, Σ) distribution. The nonzero elements of K = Σ−1 are associated with edges in E. A missing edge in E implies Kij = 0 and corresponds with the conditional independence of univariate elements Xi and Xj of X given the remaining elements. The canonical parameter K is constrained to the cone PG described by those positive definite matrices with entries equal to zero for all (i, j) ∈ / E. We follow Roverato (2002) and define the set of indices of the free elements of K ∈ PG : V = {(i, j) : i ≤ j with i = j ∈ V or (i, j) ∈ E}.
(1)
Roverato (2002) generalizes the hyper inverse Wishart distribution of Dawid and Lauritzen (1993) to arbitrary graphs by deriving the Diaconis and Ylvisaker (1979) conjugate prior for K ∈ PG . Letac and Massam (2007) as well as Atay-Kayis and Massam (2005) continue this development and call this distribution the G-Wishart. More specifically, the G-Wishart
3
distribution WG (δ, D) has density 1 1 (δ−2)/2 (det K) exp − hK, Di , p(K|G) = IG (δ, D) 2
(2)
with respect to the Lebesgue measure on PG . Here hA, Bi = tr(AT B) is the trace inner product. The G-Wishart distribution is a regular exponential family with canonical parameter K ∈ PG and canonical statistic (−D/2). Diaconis and Ylvisaker (1979) prove that the normalizing constant Z 1 (δ−2)/2 (3) IG (δ, D) = (det K) exp − hK, Di dK, 2 K∈PG is finite if δ > 2 and D−1 ∈ PG . If the graph G is complete, WG (δ, D) reduces to the Wishart distribution Wp (δ, D), hence its normalizing constant (3) is IG (δ, D) = 2(δ+p−1)p/2 Γp {(δ + p − 1)/2} (det D)−(δ+p−1)/2 , (4) Q i where Γp (a) = π p(p−1)/4 p−1 i=0 Γ a − 2 for a > (p − 1)/2 (Muirhead, 2005). Roverato (2002) proves that the G-Wishart distribution can be factorised according to the prime components of G and their separators. Let P1 , . . . , Pk be a perfect sequence of prime components of G and let S2 , . . . , Sk be the corresponding separators, where Sl = ∪l−1 j=1 Pj ∩ Pl , l = 2, . . . , k – see Dobra and Fienberg (2000) for fast algorithms for producing such a perfect sequence of prime components together with their separators. The normalizing constant (3) is equal to (Roverato, 2002): Qk j=1 IGPj (δ, DPj ) . (5) IG (δ, D) = Qk j=2 IGSj (δ, DSj ) The subgraph GSj associated with a separator Sj is required to be complete, hence IGSj (δ, DSj ) is explicitly calculated as in (4). If a subgraph GPj happens to be complete, the computation of IGPj (δ, DPj ) can be done in a similar manner. As these are straightforward calculations, the challenge in determining IG (δ, D) lies in the computation of normalizing constants of the prime components whose subgraphs are not complete. Dellaportas et al. (2003) and Roverato (2002) develop importance sampling algorithms for computing such normalizing constants, while Atay-Kayis and Massam (2005) propose a simple but efficient Monte Carlo method. Their method was derived by performing a change of variables to convert the matrix K to an upper-triangular matrix ψ. Upon doing so, and after accounting for the Jacobian of this transformation, Atay-Kayis and Massam (2005) show that the integral can be rewritten as a function of a set of independent χ2 and normal random variables. A Monte Carlo scheme can then be elicited to estimate IG (δ, D). While its precision is very good, this method becomes computationally inneficient if D has a complex structure due to the large number of iterations required to achieve convergence. In Section 4 we propose a Laplace approximation method for computing IG (δ, D) with the novel approach of using the iterative proportional scaling algorithm described in Section 3.1 to compute the mode of the integrand in (3). 4
3
The iterative proportional scaling algorithm and the block Gibbs sampler
Let G be an arbitrary undirected graph and let C1 , . . . , Ck be its set of cliques. The iterative proportional scaling algorithm and the block Gibbs sampler are cyclic algorithms that generate matrices with values in the cone of incomplete symmetric matrices X such that the submatrices XCi , i = 1, . . . , k are positive definite and such that X can be completed in such a way that the inverse of their completion belongs to PG . For any given clique C of G and a given |C| × |C| positive definite matrix A, we introduce the following operator from PG into PG −1 A + KC,V \C (KV \C )−1 KV \C,C KC,V \C MC,A K = . (6) KV \C,C KV \C which is such that [(MC,A K)−1 ]C = A (Lauritzen, 1996). Since KV \C , KV \C,C , KC,V \C remain unchanged under this transformation, it follows that MC,A maps PG into PG .
3.1
The iterative proportional scaling algorithm
Given a p × p positive definite matrix L, we use the iterative proportional scaling algorithm to find the p × p matrix K such that (K −1 )Cj = LCj , for j = 1, . . . , k and K ∈ PG .
(7)
Speed and Kiiveri (1986) proposed the iterative proportional scaling algorithm that finds the unique matrix in PG that satistifies (7). A similar algorithm exists for hierarhical log-linear models – see, for example, Bishop et al. (1975). The algorithm proceeds as follows: Step 1. Start with K 0 = Ip , the p-dimensional identity matrix. Step 2. At iteration r = 0, 1, . . . do Step 2A. Set K r+(0/k) = K r . Step 2B. For each j = 1, . . . , k, set K r+(j/k) = MCj ,LCj K r+((j−1)/k) . Step 2C. Set K r+1 = K r+(k/k) . The sequence (K r )r≥0 ⊂ PG converges to the solution of (7) (Speed and Kiiveri, 1986). In particular, when G is decomposable with cliques {C1 , . . . , Ck } arranged in a perfect sequence and separators {S2 , . . . , Sk }, the iterative proportional scaling algorithm converges after one iteration to the solution of (7) that is readily available through the following formula (Lauritzen, 1996): k X j=1
−1 0
[(LCj ) ] −
k X
[(LSj )−1 ]0 ,
(8)
j=2
0
where [A] is the matrix whose C-submatrix coincides with A and has zero entries everywhere else. 5
3.2
The block Gibbs sampler
The iterative proportional scaling algorithm can be transformed into a sampling procedure from the G-Wishart WG (δ, D) distribution by replacing the deterministic updates associated with each clique of G with random updates drawn from appropriate Wishart distributions. This sampling method is called the block Gibbs sampler and was originally discussed in Piccioni (2000). Asci and Piccioni (2007) developed it explicitly by exploiting the theory of exponential families with cuts. Here a cut is a clique of G. If K ∈ PG is distributed WG (δ, D) and C is a clique of G then from Corollary 2 of Rover−1 ato (2002) we know that [(K −1 )C ] has a Wishart distribution W|C| (δ, DC ), also written W|C| (δ + |C| − 1, (DC )−1 ) in the notation used in Ch. 3 of Muirhead (2005). This chapter shows how to sample from a Wishart distribution through the Bartlett’s decomposition. The block Gibbs sampler is obtained by replacing Step 2B of the iterative proportional scaling algorithm described in Section 3.1 with For each j = 1, . . . , k, simulate A from W|Cj | (δ, DCj ) and set K r+(j/k) = MCj ,A−1 K r+((j−1)/k) . The other steps remain unchanged. The sequence of matrices (K r )r≥r0 generated by the block Gibbs sampler are random samples from WG (δ, D) after a suitable burn-in time r0 (Piccioni, 2000; Asci and Piccioni, 2007). Carvalho et al. (2007) propose another sampler that is based on decomposing G into its sequence of prime components.
4
The Laplace approximation for IG(δ, D)
We use the Laplace approximation of Tierney and Kadane (1986) to estimate the normalizing constant IG (δ, D) in (3). The indices of the free elements of K ∈ PG are given by V – see (1). We write Z Y IG (δ, D) = exp (hδ,D (K)) dKij , (i,j)∈V
K∈PG
where hδ,D (K) = −
1 tr(K T D) − (δ − 2) log(det K) . 2
The Laplace approximation to IG (δ, D) is |V|/2 b b −1/2 , IG\ (δ, D) = hδ,D (K)(2π) [det Hδ,D (K)]
b ∈ PG is the mode of WG (δ, D) and Hδ,D is the |V| × |V| Hessian matrix associated where K with hδ,D . For (i, j) ∈ V, the first derivative of hδ,D is (see, for example, Hartville (1997)) 1 dhδ,D (K) = − tr D − (δ − 2)K −1 (1ij )0 , dKij 2 6
where (1ij )0 is a p × p matrix that has a 1 in the (i, j) and (j, i) positions and zero elsewhere. dhδ,D (K) b = (δ − 2)K b∗, By setting the first derivatives to zero, i.e. dK = 0, (i, j) ∈ V, we obtain K ij b ∗ is the solution of the system of equations (7) for L = D. If G is decomposable, where K b ∗ is given by (8) with L = D. If G is nondecomposable, K b ∗ can be efficiently determined K using the iterative proportional scaling algorithm described in Section 3.1. For (i, j), (l, m) ∈ V, the ((i, j), (l, m)) entry of Hδ,D is given by δ − 2 −1 d2 hδ,D (K) =− tr K (1ij )0 K −1 (1lm )0 . dKij dKlm 2
5
G-Wishart prior and posterior distributions
The likelihood function corresponding with a random sample x(1:n) = (x(1) , . . . , x(n) ) from Np (0, Σ) is propotional to 1 n/2 (det K) exp − hK, U i , (9) 2 P where U = ni=1 x(i) x(i)T . A G-Wishart prior WG (δ0 , D0 ) for K is conjugate to the likelihood (9), where G = (V, E) is an undirected graph, δ0 > 2 and D0−1 ∈ PG . The posterior distribution of K is proportional to 1 (δ0 +n−2)/2 (det K) exp − hK, D0 + U i . 2 Let us define D∗ to be the unique positive definite matrix that verifies the system (Dempster, 1972; Knuiman, 1978; Speed and Kiiveri, 1986): ∗ Dij = (D0 + U )ij , if (i, j) ∈ V, (10) (D∗ )−1 if (i, j) ∈ / V. ij = 0, We note that hK, D0 + U i = hK, D∗ i. Since (D∗ )−1 ∈ PG we take the posterior distribution of K to be WG (δ0 + n, D∗ ) thereby ensuring that it is proper. The marginal likelihood of the data given G is the ratio of the normalizing constants of the G-Wishart posterior and prior: p(x(1:n) |G) = IG (δ0 + n, D∗ )/IG (δ0 , D0 ).
(11)
The prior WG (δ0 , D0 ) can be interpreted as representing a fictive dataset having a sample size of δ0 − 2 and sample covariance D0 . Throughout this paper we take δ0 = 3 and D0 to be the p-dimensional identity matrix, that is, we assume that the observed variables are independent apriori and that the weight of the prior is equivalent with one observed sample. The posterior WG (δ0 + n, D∗ ) can be thought of as being associated with a dataset obtained by augmenting the observed data x(1:n) with the fictive prior data. 7
The fictive prior data as well as the augmented data are reflected in the marginal likelihoood (11) which quantifies the fit of the graphical model G. There are two methods for computing the normalizing constants of the G-Wishart prior and posterior when G is nondecomposable: the Monte Carlo method of Atay-Kayis and Massam (2005) and the Laplace approximation method developed in Section 4. Kass and Raftery (1995) note that the accuracy of the Laplace approximation depends on the peakedness of the density function about its mode, and on the degree to which the density resembles a Gaussian distribution in this area. In order to empirically illustrate our choice of procedure for computing the two normalizing constants in (11), we consider the five-dimensional circle model C5 of Yuan and Lin (2007) – see Section 8.1. We let Kii = 1, Ki,i−1 = Ki−1,i = 0.5, K1,5 = K5,1 = 0.4 and Ki,j = 0 otherwise. We then consider the distribution WC5 (δ, (δ −2)K −1 ), for δ ∈ {3, 13, 28}, and use the block Gibbs sampler from Section 3.2 to draw ten thousand matrices from each of these three distributions, with the goal of assessing the degree to which the distribution is concentrated about its mode. Figure 1 shows the distribution of the K12 element across the sampled matrices for each distribution. This particular form WC5 (δ, (δ − 2)K −1 )) was chosen because the mode of each distribution is K itself (however, it should be noted that this mode is for the entire matrix and the marginal distribution of K12 need not be centered exactly at .5, as can be seen when δ = 3 in Figure 1). This study is particularly important in assessing the applicability of the Laplace approximation for the model selection purposes of this paper. Setting δ = 3 is typically the choice in the prior distribution and thus the first line shows the distribution if the prior was chosen to be WC5 (3, K −1 ). A WC5 (13, 11 · K −1 ) distribution represents the posterior distribution after collecting ten datapoints whose sample covariance matrix was K −1 and equivalently a WC5 (28, 26 · K −1 ) distribution would be the posterior if 25 datapoints were collected with the same sample covariance matrix. The fact that the distribution of K12 is so much more diffuse when δ = 3 as opposed to 13 or 28 indicates why the approximation tends to work poorly in the prior. However, the fact that the parameter already has a rather peaked distribution once δ = 13 implies that it is suitable to use for approximating posterior normalizing constants. Therefore the G-Wishart posterior distribution becomes more concentrated around its mode as the sample size increases. This implies that the Laplace approximation is likely to perform well for computing the posterior normalizing constant IG (δ0 + n, D∗ ). On the other hand, the Monte Carlo method of Atay-Kayis and Massam (2005) requires an increased number of iterations to converge for datasets with a larger sample size, hence using it for computing IG (δ0 + n, D∗ ) is accurate but computationally demanding. On the other hand, the same Monte Carlo method converges fast for the computation of the prior normalizing constant IG (δ0 , D0 ) with D0 set to the identity matrix. As such, we evaluate the marginal likelihood (11) of a nondecomposable graph G in an efficient and accurate manner by employing the Laplace approximation for IG (δ0 + n, D∗ ) and the Monte Carlo method for IG (δ0 , D0 ).
8
0.0
0.5
1.0
1.5
2.0
δ=3 δ = 13 δ = 28 True Value
−1
0
1
2
3
4
5
6
k12
Figure 1: Marginal distributions of K12 based on 10, 000 samples from the G-Wishart distribution WC5 (δ, (δ − 2)K −1 ) for δ ∈ {3, 13, 28}.
9
6
Stochastic search algorithms
We denote by G a set of competing graphs. We associate with each graph G ∈ G a neighborhood nbd(G) ⊂ G. The neighborhoods are defined with respect to the class of graphs considered. If G represents the 2r graphs with p vertices where r = p(p − 1)/2, nbd(G) contains those graphs that are obtained by adding or deleting one edge from G. If G is restricted to decomposable graphs, nbd(G) contains only the decomposable graphs that are one edge away from G. The Bayesian approach to structural learning in Gaussian graphical models involves determining the graphs in G having the highest posterior probability p(G|x) ∝ p(x|G)p(G), where x is the observed data. Assuming that all the candidate graphs are apriori equally likely implies p(G|x) ∝ p(x|G). Wong et al. (2003) propose a uniform prior on graphs with the same number of edges, while Jones et al. (2005) discuss priors that encourage sparse graphs by penalizing for the inclusion of additional edges: p(G) ∝ β k (1 − β)r−k , where k is the number of edges of G and β ∈ (0, 1) is the probability of the inclusion of an edge in G. A uniform prior for β leads to a prior with desirable multiple testing correction properties (Scott and Berger, 2006; Scott and Carvalho, 2008): −1 r p(G) ∝ (r + 1) . k
(12)
Giudici and Green (1999) and Wong et al. (2003) developed reversible jump Markov chain Monte Carlo samplers (Green, 1995) for decomposable and arbitrary Gaussian models, respectively. As p gets larger, Markov chain Monte Carlo methods in which the chains are run over the product space of G and other model parameters require a considerable number of iterations to achieve convergence due to the high dimensionality of the state space. The Markov chain Monte Carlo model composition algorithm of Madigan and York (1995) addresses this issue by constructing an irreducible chain only on G. If the chain is currently in state G, a candidate graph G0 is randomly drawn uniformly from nbd(G). The chain moves to G0 with probability p(G0 |x)/|nbd(G0 )| , (13) min 1, p(G|x)/|nbd(G)| The acceptance probability (13) simplifies to min{1, p(G0 |x)/p(G|x)} in the space of arbitrary graphs since |nbd(G)| = r for all G ∈ G. This is not true for the smaller space of decomposable graphs. Once the model parameters are integrated out, Markov chain Monte Carlo methods are only a device to visit regions with high posterior probability graphs since there is no substantive need to actually sample from the posterior distribution {p(G|x) : G ∈ G}. The posterior probability of each graph G is readily available from its marginal likelihood up to 10
the normalizing constant #−1
" X
p(G|x)
,
G∈G
hence there is no need to repeatedly revisit G to determine p(G|x). Jones et al. (2005) propose the shotgun stochastic search algorithm that moves from a current graph G to one (α) 0 0 of its neighbors that is selected by sampling from T (G ; G) : G ∈ nbd(G) , where X 00 (14) T (α) (G0 ; G) = [p(G0 |x)α ] / p(G |x)α . 00 G ∈nbd(G) Here α is an annealing parameter that can be adjusted to encourage more agressive moves (α > 1) or increase the chance of moving to lower posterior probability graphs by smoothing out the proposal (α < 1). Jones et al. (2005) empirically show that the shotgun stochastic search algorithm consistently finds better graphs much faster than the Markov chain Monte Carlo model composition approach of Madigan and York (1995). This is largely true if one does not have to make too many changes to the current graph to reach the highest posterior probability graph Gh = argmaxG∈G p(G|x). On the other hand, fully exploring the neighborhoods of all the graphs on a path connecting the current graph with Gh is very likely to be computationally inneficient if the length of this path is long. Another potential shortcoming shared by the Markov chain Monte Carlo model composition algorithm and the shotgun stochastic search algorithm is that the graph whose neighborhood is explored at the next iteration is selected from the neighbors of the current graph. It is sensible to assume that Gh could be in the neighborhood of a graph that has been identified at a previous iteration but was not selected for exploration. Berger and Molina (2005) point out that the graph to be explored at the next iteration should be selected from the list of graphs identified so far with probabilities proportional with the posterior probability of each graph. We follow the idea of Berger and Molina (2005) and propose a novel stochastic search algorithm for Gaussian graphical models called the mode oriented stochastic search that also combines the principles behind the Markov chain Monte Carlo model composition algorithm and the shotgun stochastic search algorithm. Our method is designed to identify the highest posterior probability graphs that belong to G(c) = {G ∈ G : p(G|x) ≥ c p(Gh |x)},
(15)
where c ∈ (0, 1). We follow the “Occam’s window” principle of Madigan and Raftery (1994) and discard models with low posterior probability with respect to Gh . If the graphs in G are apriori equally likely, choosing c in the intervals (0, 0.01], (0.01, 0.1], (0.1, 1/3.2], (1/3.2, 1] means eliminating models having decisive, strong, substantial or “not worth more than a bare mention” evidence against them with respect to Gh (Kass and Raftery, 1995). Since G(c) could still contain too many graphs for certain values of c, we define our target set of 11
models to be G(c, m) that consists of the top m highest posterior probability models in G(c). We keep a list of graphs S that is updated during the search. We define a subset S(c, m) of S as we defined the subset G(c, m) of G. We choose a constant c0 ∈ (0, c) and allow our stochastic search to escape local optima by visiting graphs in S(c0 ) \ S(c, m). These lower probability graphs are deleted from S with probability q ∈ (0, 1). A graph is called explored if all its neighbors have been visited. We keep track of the graphs currently in S that have been explored. The mode oriented proceeds as follows. procedure MOSS(c,c0 ,m,k,q) I Initialize the starting list of graphs S ⊂ G. For each graph G ∈ S, calculate and record its posterior probability p(G|x) and mark it as unexplored. For l = 1, . . . , k do: • Let L be the unexplored graphs in S. If L = ∅, STOP. • Sample a graph G ∈ L according to probabilities proportional with p(G|x) normalized within L. Mark G as explored. • For each G0 ∈ nbd(G) do the following: if G0 ∈ / S, include G0 in S, evaluate and record 0 0 p(G |x), and mark G as unexplored. • With probability q, prune the models in S \ S(c, m). Otherwise prune the models in S \ S(c0 ). end procedure A feature of the mode oriented stochastic search is that it can end before completing k iterations if the current list of models does not contain any unexplored graphs. Alternatively, we could choose to mark all the models in S as unexplored and continue the procedure until the specified number of iterations has been reached. Due to the random prunning of the list S, our search algorithm could follow a different search path in the target set of graphs G. On the other hand, stopping the procedure early could significantly decrease the overall running time. We recommend using a value for c0 as close to zero as possible. The role of the parameter 0 c is to limit the number of models that are included in S to a manageable number. The following result holds: Theorem 1 The mode oriented stochastic search finds Gh provided that it is run for a sufficient number of iterations. The proof of Theorem 1 is given in the Appendix. We empirically compare the relative efficiency of the Markov chain Monte Carlo composition algorithm, the shotgun stochastic search and the mode oriented stochastic search in Section 8.2.
12
7
Bayesian inference
Let S be the set of graphs identified by the mode oriented stochastic search algorithm. The posterior distribution of the precision matrix K given S is obtained by Bayesian model averaging (Kass and Raftery, 1995). It is a mixture of the posterior distributions under each graph in S with weights given by the posterior probabilities of each graph G normalized within S (Madigan and Raftery, 1994): X p(K|x(1:n) , S) = WG (δ0 + n, D∗ )p(G|x(1:n) , S), (16) G∈S
where " p(G|x(1:n) , S) = p(G|x(1:n) )/
# X
p(G0 |x(1:n) ) .
(17)
G0 ∈S
The mixture of G-Wishart distributions in (16) is not a G-Wishart distribution. Since all the components of the mixture are proper, its normalising constant is the linear combination of the normalising constants of the components, and therefore it is a proper distribution. Yang and Berger (1994) give Bayes estimators for K and Σ with respect to squared, entropy (or Kullback-Leibler) and quadratic loss functions. The estimators for K with respect to the G-Wishart posterior distribution WG (δ0 + n, D∗ ) are δ1 (G) = E(K|x(1:n) , G), δ2 (G) = [E(Σ|x(1:n) , G)]−1 and vec(δ3 (G)) = [E(Σ ⊗ Σ|x(1:n) , G)]−1 vec(E(Σ|x(1:n) , G)), respectively. The corresponding estimators for Σ are δ4 (G) = E(Σ|x(1:n) , G), δ5 (G) = [E(K|x(1:n) , G)]−1 and vec(δ6 (G)) = [E(K ⊗ K|x, G)]−1 vec(E(K|x(1:n) , G)). The posterior means of K and Σ are estimated by \ (1:n) , G) = J −1 E(K|x
J X
Kj ,
\ (1:n) , G) = J −1 E(Σ|x
j=1
J X
Kj−1 ,
j=1
where K1 , K2 , . . . , KJ are sampled from WG (δ0 + n, D∗ ) using the block Gibbs sampler. The Bayesian model averaging estimators with respect to S are given by X δj (S) = δj (G)p(G|x(1:n) , S), j = 1, . . . , 6. G∈S
Various estimators of other quantities of interest can be obtained from the posterior (16). The posterior inclusion probability of an edge (i, j) is given by the sum of normalized probabilities (17) of the graphs in S that contain (i, j). We follow the notion of median probability linear model of Barbieri and Berger (2004) and define the median graph with respect to S containing all the edges having posterior inclusion probabilities greater than 0.5.
8
Examples
We illustrate our proposed methodology through two simulated examples. Unless we specify otherwise, we assume that all the graphs are apriori equally likely. 13
8.1
First simulated example
We replicate the simulation study from Section 6 of Yuan and Lin (2007) who performed a comprehensive comparison of their own selection and estimation approaches with the method of Meinshausen and B¨ uhlmann (2006) and the SIN method of Drton and Perlman (2004). They simulated 25 samples of dimension p = 5 and p = 10 from eight different models: AR(1), AR(2), AR(3), AR(4), a full graph, a star graph with every vertex conected to the first vertex and a circle graph. We used the mode oriented stochastic search algorithm with c = 0.1, m = ∞, c0 = 0.01 and q = 0.1 and a single random starting graph. We further set the parameter k = ∞ and thereby allowed the algorithm to terminate itself in each instance, to examine the relationship between total graph evaluations and model complexity. We performed separate searches in the space of decomposable graphs and in the space of all possible graphs – see Table 1. For each model, sample size and search type, we estimate the precison matrix K using the highest posterior probability graph (B), the median graph with respect to S (M) and Bayesian model averaging with respect to S (A). We employed the posterior mean estimator δ1 and assessed its performance using the average Kullback-Leibler (KL) loss across the replicates. We also report the number of incorrectly identified edges (false positives, FP) and the number of incorrectly missed edges (false negatives, FN) with respect to the median graph associated with S. In addition, we give the number of graphs in S (|S|) and the number of graphs whose marginal likelihood was evaluated until the completion of the mode oriented stochastic search (E). Table 1 shows that the estimated Kullback-Leibler loss values are smaller for Bayesian model averaging compared to the highest posterior probability graph and the median graph that exhibit a similar performance. More importantly, our proposed approach performs consistently better than any of the estimation techniques considered in Yuan and Lin (2007). Our algorithm recovers the structure of the graphs well as the number of false positive and false negative edges is typically as good as the best technique from Yuan and Lin (2007). We remark that, in the case of model 8, which is the only nondecomposable model in the group, the Kullback-Leibler loss is considerably lower when nondecomposable graphs are considered. Finally, we see that the number of models evaluated in a run of the mode oriented stochastic search increases roughly with the complexity of the underlying model. The first four models, which are autoregressive, take fewer model evaluations to achieve termination than the last four models, especially models 6 and 8. Nevertheless, the number of graphs evaluated is considerably smaller than the total number of graphs in the search space: 1024 for p = 5 and 3.52 × 1013 for p = 10.
8.2
Second simulated example
Scott and Carvalho (2008) simulated a dataset of size 50 from a graph with 25 vertices whose edges link only the first 10 variables: {(1, 6), (1, 7), (2, 4), (3, 4), (3, 6), (3, 7), (3, 8), (3, 9), (4, 6), (4, 8), (4, 9), (5, 6), (5, 9), (6, 7), (6, 8), (6, 9), (7, 9), (7, 10), (8, 9), (9, 10)} 14
Table 1: Results for the eight simulated models of Yuan and Lin (2007). The standard errors across the 100 replicates are shown in parantheses. Decomposable Unrestricted p Model B M A FP FN |S| E B M A FP FN |S| E 1 0.16 0.16 0.14 0.33 0 10 236 0.16 0.16 0.14 0.33 0 10 255 (0.02) (0.01) (0.01) (0.06) (0) (1) (7) (0.02) (0.01) (0.01) (0.06) (0) (1) (17) 2 0.29 0.29 0.28 0.25 0.03 6 107 0.29 0.29 0.29 0.25 0.03 9 212 (0.02) (0.02) (0.02) (0.06) (0.02) (0) (3) (0.02) (0.02) (0.02) (0.05) (0.02) (0) (8) 3 0.54 0.54 0.54 0.1 2.22 12 174 0.54 0.54 0.54 0.16 2.28 14 298 (0.02) (0.02) (0.02) (0.03) (0.13) (1) (9) (0.02) (0.02) (0.02) (0.04) (0.13) (1) (15) 4 0.55 0.55 0.53 0.15 5.69 17 263 0.55 0.54 0.53 0.21 5.57 19 409 (0.02) (0.02) (0.02) (0.04) (0.12) (1) (16) (0.02) (0.02) (0.02) (0.04) (0.11) (1) (22) 5 5 0.58 0.59 0.57 0 7.17 16 268 0.58 0.58 0.57 0 7.07 17 368 (0.02) (0.02) (0.02) (0) (0.11) (1) (14) (0.02) (0.02) (0.02) (0) (0.12) (1) (18) 6 0.84 0.87 0.83 0 4.58 17 274 0.84 0.87 0.83 0 4.53 18 348 (0.04) (0.03) (0.03) (0) (0.31) (1) (22) (0.04) (0.03) (0.03) (0) (0.31) (1) (26) 7 0.35 0.35 0.32 0.21 2.92 16 306 0.35 0.35 0.32 0.21 2.9 16 389 (0.01) (0.01) (0.01) (0.05) (0.09) (1) (14) (0.01) (0.01) (0.01) (0.05) (0.09) (1) (17) 8 0.48 0.48 0.47 1.56 0.12 9 128 0.37 0.37 0.36 0.32 0.04 8 191 (0.02) (0.03) (0.02) (0.08) (0.03) (0) (4) (0.02) (0.02) (0.02) (0.06) (0.02) (0) (8) 1 0.16 0.16 0.13 1.04 0 45 5284 0.16 0.16 0.13 1.04 0 45 4158 (0.01) (0.01) (0.01) (0.11) (0) (4) (253) (0.01) (0.01) (0.01) (0.11) (0) (4) (280) 2 0.27 0.28 0.26 0.79 0 24 2076 0.3 0.29 0.27 0.79 0 72 6464 (0.01) (0.01) (0.01) (0.12) (0) (2) (63) (0.01) (0.01) (0.01) (0.11) (0) (5) (354) 3 0.49 0.5 0.51 0.46 1.28 20 1763 0.55 0.56 0.55 0.97 1.58 52 4541 (0.02) (0.02) (0.02) (0.08) (0.1) (2) (56) (0.02) (0.02) (0.02) (0.11) (0.12) (5) (360) 4 0.86 0.88 0.84 0.47 12.91 77 4237 0.86 0.86 0.84 0.85 12.88 188 13047 (0.02) (0.02) (0.02) (0.07) (0.27) (10) (345) (0.02) (0.02) (0.02) (0.11) (0.22) (19) (1027) 10 5 0.77 0.78 0.76 0.17 20.59 72 4357 0.77 0.76 0.76 0.35 20.1 145 10342 (0.02) (0.02) (0.02) (0.04) (0.21) (6) (241) (0.02) (0.02) (0.02) (0.06) (0.2) (14) (752) 6 1.21 1.21 1.18 0 18.04 173 10302 1.71 1.7 1.64 0 39.9 399 24572 (0.05) (0.05) (0.04) (0) (1.99) (26) (1402) (0.01) (0.01) (0.01) (0) (0.2) (30) (1543) 7 0.53 0.53 0.52 1.08 4.23 154 10517 0.53 0.53 0.52 1.14 4.23 167 11822 (0.02) (0.02) (0.02) (0.1) (0.15) (17) (761) (0.02) (0.02) (0.02) (0.1) (0.15) (18) (957) 8 0.52 0.48 0.48 5.75 0 531 16056 0.41 0.39 0.37 2.07 0 420 28062 (0.02) (0.02) (0.01) (0.24) (0) (47) (1063) (0.02) (0.02) (0.02) (0.2) (0) (47) (2422)
15
We are interested in using this dataset to empirically verify Theorem 1. More specifically, we want to learn which stochastic search algorithm discovers the highest posterior probability graphs given the same number of marginal likelihood evaluations. We compare the mode oriented stochastic search, the shotgun stochastic search of Jones et al. (2005) and the Markov chain Monte Carlo composition algorithm of Madigan and York (1995). Each algorithm was started once from the same graph which was chosen at random. The parameters for the mode oriented stochastic search were c = 1/3.2, m = 250, c0 = 0.1 and q = 0.1. Instead of running the algorithm for a fixed number of iterations we set the parameter k = ∞, and thereby ran the mode oriented stochastic search at these settings until the algorithm terminated itself. We recorded the number of graphs whose marginal likelihood was evaluated by the mode oriented stochastic search algorithm, then ran the other two algorithms until they evaluated the same number of graphs. In this case, a total of 433, 801 graphs were evaluated by each of the three algorithms. Figure 8.2 reports the posterior probabilities of the top 250 graphs identified by each stochastic search algorithm and shows that the mode oriented stochastic search algorithm possesses two desirable characteristics with respect to the other algorithms. The first feature is that higher posterior probability models are returned given the same number of model evaluations. Jones et al. (2005) note that the shotgun stochastic search algorithm finds higher posterior probability models than the Markov chain Monte Carlo model composition algorithm which is consistent with the results in Figure 8.2. However, the mode oriented stochastic search algorithm finds models that have approximately 150 times the posterior probability of the models returned by the shotgun stochastic search algorithm. Figure 8.2 also shows a much smaller spread in the marginal likelihood values of the top 250 models found by either the shotgun stochastic search or the Markov chain Monte Carlo model composition algorithms. This is indicative of the capability of mode oriented stochastic search to find models in the set M(c) quicker than the other two algorithms, and not just the model with the largest posterior probability. Scott and Carvalho (2008) argue that the use of the prior (12) on the models space is necessary to favor sparser graphs by penalizing for increased complexity. We used this prior and discovered that the corresponding top models were much sparser than the true graph. The normalizing constant of the G-Wishart prior becomes larger for denser graphs, hence it effectively penalizes for the inclusion of additional edges. As such, a uniform prior on the models space is needed in our framework to avoid penalizing for model complexity twice.
9
Multivariate regression
We consider the case when the variables X = XV , V = {1, 2, . . . , p}, are partioned in a set of response variables XR , R ⊂ V , and a set of explanatory variables XV \R . If the joint distribution of X is Np (0, Σ), it follows that the conditional distribution of XR given XV \R = xV \R is N|R| (ΓR|V \R xV \R , ΣR|V \R ), where ΓR|V \R = ΣR,V \R (ΣV \R )−1 and ΣR|V \R = ΣR − ΣR,V \R (ΣV \R )−1 ΣV \R,R . The marginal distribution of XV \R is N|V \R| (0, ΣV \R ). We
16
SSS
60
MC3
40 0
20
Frequency
MOSS
−405
−400
−395
−390
−385
−380
−375
−370
Marginal Likelihood
Figure 2: Distribution of the top 250 marginal likelihoods returned by the mode oriented stochastic search, the shotgun stochastic search and the Markov chain Monte Carlo model composition algorithms after evaluating the same number of models and starting at the same randomly generated graph.
17
write p(X|Σ) = p(XR |XV \R , ΓR|V \R , ΣR|V \R )p(XV \R |ΣV \R ).
(18)
The regression parameters (ΓR|V \R , ΣR|V \R ) are independent of ΣV \R – see Muirhead (2005). As such, inference for the conditional p(XR |XV \R ) can be done independently from inference on the marginal p(XV \R ). The zero constraints Kij = 0 for the elements of the precision matrix K = Σ−1 are classified as (a) conditional independence of two response variables given the rest, i.e. i, j ∈ R; (b) conditional independence of a response variable and an explanatory variable, i.e. i ∈ R and j ∈ V \R; and (c) conditional independence of two explanatory variables, i.e. i, j ∈ V \R. Note that condition (b) is equivalent with the absence of Xj from the regression of Xi given XV \R – see Proposition 10.5.1 page 323 of Whittaker (1990). The zero constraints of type (a) and (b) are associated with the conditional p(XR |XV \R ), while the constraints of type (c) involve the marginal p(XV \R ). Define the set of graphs G [V \R] with vertices V such that their V \ R subgraph GV \R is complete. A graph G ∈ G [V \R] embeds only constraints of type (a) and (b) for K, hence it is representative of the conditional independence relationships in p(XR |XV \R ) – see Proposition 10.1.1 page 304 of Whittaker (1990). As before, we assume a G-Wishart prior WG (δ0 , D0 ) for K ∈ PG . The induced prior for Σ = K −1 is hyper inverse Wishart HIWG (δ0 , D0 ) which is strong hyper Markov – see Corollary 1 of Roverato (2002). From Proposition 5.6 of Dawid and Lauritzen (1993) it follows that the marginal data-distribution of X is Markov. Since G is collapsible onto V \ R and (R, V \ R) is a decomposition of G, it follows that the marginal likelihood of the regression of XR on XV \R with constraints induced by G is the ratio of the marginal likelihoods of G and GV \R : (1:n) (1:n) (1:n) (1:n) p xR |xV \R , G = p x |G /p xV \R |GV \R . (19) (1:n)
(1:n) , A ⊂ V, represents the corresponding with XA . Since subset of the data x (1:n) GV \R is complete, p xV \R |GV \R is the ratio of the normalizing constants of the Wishart distributions W|G\R| (δ0 + n, DV∗ \R ) and W|G\R| (δ0 + n, DV \R ) – see (4). This means that (1:n) p xV \R |GV \R is constant for all G ∈ G [V \R] . Therefore finding multivariate regressions
Here xA
with high marginal likelihoods is equivalent with identifying graphs in G [V \R] with high marginal likelihoods. It seems relevant to notice that only a subset of explanatory variables might be connected with the response variables in any given graph G ∈ G [V \R] . Let bdG (R) ⊂ V \ R be the boundary of the response variables in G, that is, j ∈ bdG (R) if there exists an i ∈ R such that (i, j) is an edge in G. If there exists at least one explanatory variable that is not linked with any response variable (i.e., V \ (R ∪ bdG (R)) 6= ∅) then V \ R is a clique in G. Therefore XR is independent of XV \(R∪bdG (R)) given XbdG (R) , thus p(XR |XV \R ) = p(XR |XbdG (R) ).
18
Since the marginal data-distribution of X is Markov, (19) reduces to: (1:n) p x |GR∪bdG (R) (1:n) (1:n) (1:n) (1:n) R∪bdG (R) . p xR |xV \R , G = p xR |x , GR∪bdG (R) = bdG (R) (1:n) p x |GbdG (R) bdG (R)
(20)
This property is especially important if |R| is much smaller than |V \ R|, hence it is likely that only a few explanatory variables will be connected with the responses. The mode oriented stochastic search algorithm can be employed to learn graphs with the highest posterior probability in G [V \R] . As Whittaker (1990) points out, the size of the search space is 2|R|(|R|−1)/2 + 2|R|·|V \R| which is significantly smaller than 2p(p−1)/2 . The admissible set of neighbors of a graph needs to be modified so that no two vertices in V \ R become disconnected during the search. We let S [V \R] ⊂ G [V \R] be the resulting set of graphs. We replace each graph G ∈ S [V \R] with its subgraph GR∪bdG (R) and consider that this transformed set of graphs is {G1 , . . . , GJ }. The conditional posterior distribution of XR given XV \R = xV \R is: p XR |XV \R = xV \R
J X = p XR |XbdG
j
j=1
(R)
= xbdG
j
(R)
wj .
The weight of each regression is givenP by its marginal likelihood (20) normalized within the set of graphs {G1 , . . . , GJ } such that Jj=1 wj = 1. Given a graph Gj ∈ S [V \R] , the joint distribution of XR∪bdG (R) is N|R|+|bdG (R)| (0, Σj ), j j ∗ where the posterior distribution of Σj is inverse Wishart IWGj δ0 + n, DR∪ bd (R) . We Gj
b j = δi (Gj ) (i ∈ {1, 2, 3}) as described in Section 7. The estimate Σj with an estimator Σ conditional posterior distribution of XR given XbdG (R) = xbdG (R) is estimated as j
p XR |XbdG
j
(R)
= xbdG
j
(R)
= N|R|
j
j b x ,Σ . R|bdGj (R) bdGj (R) R|bdGj (R)
bj Γ
−1
bj bj bj bj − Γ . and Σ =Σ Σ R bdGj (R) R|bdGj (R) R|bdGj (R) bdGj (R),R We note that it was sufficient to do estimation in lower-dimensional marginals of the joint distribution of X which makes our approach suitable for use in high-dimensional datasets. For the purpose of identifying graphs with a more parsimonious structure, we could [V \R] of G [V \R] containing graphs with the maximum size restrict the search to the subset Gq of their cliques at most q, where q is a small integer. Such graphs have a better chance of inducing good predictive models for Y . The size of the boundary of Y will be at most q − 1 and hence no more than q − 1 explanatory variables can enter each multivariate regression [V \R] model. The graphs in Gq are connected not only by the addition and removal of edges, [V \R] but also by substituting edges and explanatory variables. Let G = (V, E) ∈ Gq and
where
bj Γ R|bdGj (R)
=
bj Σ R,bdGj (R)
bj
Σ
19
denote by E = {(i, j) ∈ (V × V ) \ [(V \ R) × (V \ R)] : i < j} the subset of edges that can be changed in G. These are edges that connect two response variables or a response variable with an explanatory variable. The neighbors of G are obtained by: (i) including in E an edge from E \ E, (ii) deleting an edge from E, (iii) replacing an edge in E with any other edge in E \ E and (iv) replacing an explanatory variable currently in bdG (R) with any another explanatory variable currently in V \ (R ∪ bdG (R)).
9.1
First real-world example: the call center data
We analyze a large scale dataset originally described in Shen and Huang (2005) and further studied in Huang et al. (2006), Bickel and Levina (2008) and Rajaratnam et al. (2008). The number of calls nij for i = 1, . . . , 239 days and j = 1, . . . , 102 ten-minute daily time intervals were recorded in 2002 from the call center of a major financial institution. A transfomation xij = (nij +0.25)1/2 was subsequently employed to assure the normality assumption. The call center data have been used to predict the volume of calls in the second half of the day given the volume of calls from the first half of the day based on an estimate of Σ, the covariance of calls in a given day. We write (1:51) x µ1 Σ11 Σ12 x= , µ= , Σ= , (21) µ2 Σ21 Σ22 x(52:102) (1:51)
(52:102)
where xi = (xi1 , . . . , xi51 )T ∼ N51 (µ1 , Σ11 ), xi = (xi52 , . . . , xi102 )T ∼ N51 (µ2 , Σ22 ) for (52:102) (1:51) i = 1, . . . , 239. The linear regression of xi based on xi is (52:102)
x bi
(1:51)
= µ2 + Σ21 Σ−1 11 (xi
− µ1 ).
(22)
The dataset is then divided into a training set x1:205 (first 205 days) and a test set x206:239 (the remaining 34 days). The training data, x1:205 , are then used to learn the dependency structure among the 102 variables associated with each time interval and to further estimate the covariance matrix Σ. Previous assessments of these data have shown them to be an interesting case study in the benefits of using sparse models to perform prediction. Using a cross-validation approach, Rajaratnam et al. (2008) determine that amongst autoregressive models, the AR(4) model is most suggested by the data. While their model selection procedure suggests this model, Rajaratnam et al. (2008) go on to show that a model that has superior predictive performance is actually one in which the “response” variables (time periods 52 to 102) are joined by an AR(1) model, a much sparser structure, while the first 51 variables are joined in a larger structure, essentially an AR(14) model (thus utilizing the tapered models concept outlined in Bickel and Levina (2008)). We further bear out this phenomenon of sparsity by employing the mode oriented stochastic search, conducting a fully Bayesian model selection procedure for these data, subsequently performing posterior inference as described in Section 7. We start by considering graphical models while ignoring the natural partitioning of the observed covariates into a set of responses (X52 , . . . , X102 ) and a set of explanatory variables 20
(X1 , . . . , X51 ). We employed the mode oriented stochastic search algorithm to identify high posterior decomposable graphs. We set c = 1/3.2, m = 250, c0 = 0.1 and q = 0.1 and k = 1000. We repeated the search five times starting from random decomposable graphs. This proved as a starting point, as we then used the median graph associated with the 250 decomposable graphs returned by the algorithm as a starting model for five additional searches in the space of unrestricted graphs and ultimately received 250 nondecomposable graphs in the final set S. Each of the 250 models returned by this search share many edges with an AR(4) model, thereby showing that our model selection strategy agrees with the findings of Rajaratnam b 5,S = δ5 (x1:205 , S) of Σ from (21) through et al. (2008). We then produce the estimator Σ Bayesian model averaging with respect to the graphs in S – see Section 7. We also estimate Σ b M LE . The performance of the corresponding best linear using the sample covariance matrix Σ predictors (22) is quantified by the average absolute forecast error in the test data x206:239 – see Figure 9.1. The error shown for the graphical model search is roughly analogous to that reported for the AR(4) model in Rajaratnam et al. (2008) using their estimator. We then consider a dramatically different search that addresses the multivariate regression nature of the problem and the desire for sparsity directly. Using the methods described in Section 9 we conducted a multivariate regression search where {1, . . . , 51} = R. To focus the search on sparsity we then made two further restrictions: first, we considered only models in which no more than two terms from R were used as regressors and, second, we restricted attention to those models G ∈ G [V \R] for which the GV \R subgraph had no clique of size larger than 3. We employed the mode oriented stochastic search in this restricted space with the settings described above. Since a smaller set of models fit these requirements, 18 models were returned in the set S. We then formed δ5 (x1:205 , S) from this search, and have plotted the results in Figure 9.1. The errors from this search are considerably smaller than those from the graphical model search described above. This result shows the benefit of joining stochastic search, a multivariate regression specification and sparse models. Employing the multivariate regression search technique significantly increased the speed in which the space could be searched, while the clique size restrictions placed on GV \R returned a set of sparse models whose predictive performance is greater than fuller models that are returned by standard model selection procedures. For comparison purposes we note that the sum of the forecast errors is 74.73 for the MLE, 60.77 for the graph search and 53.37 for the multivariate regressions search.
9.2
Second real-world example: estrogen receptor genes
We make use of a breast cancer dataset that is publicly available as supplemental material in Pittman et al. (2004). Gene expression assays were performed on the Human U95Av2 GeneChip. The MASS5.0 signal measures of expression were transformed on a log2 scale and quantile normalized. After the removal of the 67 control probes and of the genes with small variation or with low levels, the resulting dataset comprises 7, 027 probe sets and 158 samples. 21
●
2.0
● ●
● ● ●
●
●
● ●● ●●
● ● ● ● ●
●
1.5
● ●
● ● ● ● ●● ●
● ●
●
●
1.0
Mean Absolute Error
●
●
●
MLE Graph Mult Reg
●
● ●
●
●●
●
●
● ●● ●
●
●
● ●
0.5
●
50
60
70
80
90
100
Time Period
b 5,S under the unrestricted graphical model search Figure 3: Forecast error associated with Σ b M LE . (Graph) and the multivariate regression search (Mult Reg), in addition to Σ
22
We employ the multivariate regression stochastic search to determine genes that are potentially involved in the estrogen receptor (ER) pathway. The response variables are taken to be: estrogen receptor itself (two probes: ERS1 and HG3125-HT3301), TFF1, GATA3, FOXA1, XBP1 and XBP1a. ER is a transcription factor that targets TFF1, while GATA3, FOXA1 and XBP1 are known to have strong associations in expression with ER (Lacroix and Leclercq, 2004). In all, there were seven response variables. We then used the multivariate regressions version of the mode oriented stochastic search to find potential regressors and also fit a graphical model amongst these seven probes. As settings we used c = 1/3.2, c0 = .01, q = .01, m = 250, k = 1, 000 and restricted the space to those G ∈ G [G\V ] that had no more than 5 regressors from R into V \ R. In such a large space, proper convergence was a key concern. To assess convergence, four instances of the algorithm were run from different, random starting points. Figure 9.2 plots the score of the top graph found at a given iteration by each instance of the algorithm. Figure 9.2 shows that each instance of the algorithm eventually finds the same top model, good evidence that the algorithm has converged to the true mode. It should be noted that none of the four instances ran for the entire 1, 000 iterations, instead each terminated at between 500 and 750 iterations. Between 20 million and 30 million models were evaluated by each instance, depending on the number of iterations the algorithm ran. In addition to finding the same top model, each instance of the algorithm returned the same 24 models in the set S. The 24 models each included the same 5 regressors (CA12, UBQLN2, PTP4A3, MYB and TFF3) and only differed in the structure of the graph GV \R , as well as some links between the 5 regressors and the response variables. Three of these genes (CA12, MYB and TFF3) are known to have expression levels consistently positively correlated with estrogen receptor status in breast cancer - higher expression is associated with ER-positive tumors (van’t Veer et al., 2002). Table 2 reports the posterior probability of all nonzero edges, broken into groupings A (edges between two response probes) and B (edges between a regressor and a response probe). Many of the edges receiving higher posterior probabilities have an established biological significance. For example, the edges (ESR1,HG3125-HT3301) and (XBP1,XBP1a) link two probes that correspond to the same gene. Moreover, FOXA1 has a direct role in the transcription of TFF1 (Dobra et al., 2004), and in our graphs the edge (TFF1,FOXA1) has a posterior inclusion probability equal to 1.
10
Conclusions
The implicit claim that sampling from the posterior distribution on the space of graphs should reveal the highest posterior probability graphs in a computationally efficient manner holds only if the cumulative posterior probability of these graphs is not close to zero. For high-dimensional applications this condition is no longer true: the number of possible models is extremely large while the available sample size tends to be small. As such, the uncertainty in selecting graphs is very high and methods that sample from the posterior distribution on the space of graphs spend most of their time around graphs having low posterior probabil23
0 −100 −200 −300 −500
−400
Top Model Score
0
200
400
600
800
1000
Iteration
Figure 4: Convergence plot for four instances of the mode oriented stochastic search in the estrogen receptor example, showing the score of the top graph by iteration for each instance.
24
Table 2: Edge probabilities between gene probes for estrogen receptor example by Group A (edges between two response probes) and Group B (edges between a regressor and a response probe) Group A Gene 1 Gene 2 Probability ESR1 HG3125 HT3301 1 ESR1 GATA3 1 ESR1 XBP1 0.031 ESR1 XBP1a 0.041 TFF1 0.066 HG3125 HT3301 HG3125 HT3301 XBP1 0.035 HG3125 HT3301 XBP1a 0.042 TFF1 GATA3 0.264 TFF1 FOXA1 1 TFF1 XBP1 1 TFF1 XBP1a 0.023 GATA3 FOXA1 0.146 GATA3 XBP1 0.047 GATA3 XBP1a 0.035 FOXA1 XBP1 1 FOXA1 XBP1a 0.03 XBP1 XBP1a 1
25
Group B Gene 1 Gene 2 Probability TFF3 TFF1 1 TFF3 GATA3 0.884 TFF3 FOXA1 1 TFF3 XBP1 0.024 CA12 TFF1 1 CA12 GATA3 1 CA12 FOXA1 1 CA12 XBP1 0.025 UBQLN2 ESR1 0.518 UBQLN2 TFF1 0.933 UBQLN2 GATA3 0.822 UBQLN2 FOXA1 1 UBQLN2 XBP1 1 UBQLN2 XBP1a 1 PTP4A3 ESR1 1 PTP4A3 HG3125 HT3301 1 PTP4A3 TFF1 0.085 PTP4A3 XBP1 1 PTP4A3 XBP1a 0.024 MYB HG3125 HT3301 1 MYB TFF1 0.966 MYB GATA3 1 MYB FOXA1 0.966 MYB XBP1 0.023 MYB XBP1a 1
ity because their cumulative posterior probability approaches one. The implication is that determining the highest posterior probability graphs and sampling from the corresponding posterior distribution become two separate problems that need distinct solutions. The mode oriented stochastic search algorithm is designed to efficiently move towards high posterior probability graphs without attempting to sample. We have shown that it performs better than Markov chain Monte Carlo model composition and, therefore by analogy, to any graph determination algorithm that actually samples from the posterior on the graph space. We have also shown empirically that it outperforms the shotgun stochastic search algorithm of Jones et al. (2005). We would certainly like to compare the efficiency of our stochastic search with respect to the feature-inclusion algorithm of Scott and Carvalho (2008) whose current implementation seems to be restricted to decomposable graphs and a much different choice of priors. The simulation study of Section 8.1 showed that when the data are generated from an underlying nondecomposable graph, estimation will suffer from using only decomposable graphs, and this served to motivate our interest in performing general graphical model selection efficiently. The Laplace approximation has proved crucial to this end. While the Laplace approximation has been widely used for over two decades, the particular approximation derived in this paper is novel, in that it incorporates the iterative proportional scaling algorithm to quickly find the mode of the integrand. The computational challenges involved in this study were also enormous. Once a particular issue had been resolved (for instance, finding a practicable approach to computing the marginal likelihood of a model), a new computational challenge would become apparent. In particular, we have found there to be a relative dearth of fast clique decomposition algorithms for sparse graphs, which are required both to compute the Laplace approximation and run the block Gibbs sampler. The development of an algorithm that can quickly determine the clique decomposition of a given graph based on the cliques of one of its neighbors would be extremely helpful to statisticians working with general graphical models. Furthermore, this project considered the analysis of data that are assumed to come from a single population. In many cases, there is interest in using information from several populations in order to estimate common dependence structure between covariates. It is a logical next step of our analysis to build in a hierarchical, multilevel component. The exact implementation of such an approach would be challenging, in that marginal likelihoods of models may be difficult to compute. However, the advantage of the Bayesian approach is that posterior inference in these settings would be straightforward and amenable to the addition of other components such as multiple imputation. Finally, we have combined the concepts of graphical models and multivariate regressions into a consistent framework, which we believe dramatically increases the number of potential applications of this methodology. We have shown that in the call center data, a multivariate regressions framework reduced search burden and allowed for a sparse search that ultimately yields lower predictive error. In the case of the estrogen receptor example, it reduced a challenging problem to one that was computationally straightforward and still returned the key probes known to form this interaction network.
26
C++ and Matlab code implementing the methods described in this paper is available by request from the authors.
Acknowledgements The work of Alex Lenkoski and Adrian Dobra was supported in part by a seed grant from the Center of Statistics and the Social Sciences, University of Washington. The authors are grateful to H´el`ene Massam for helpful discussions on an earlier version of the manuscript. The authors thank James Scott and Carlos Carvalho for sharing their simulated 25-node dataset. The authors thank Jianhua Huang for sharing the Call Center data.
Appendix Proof of Theorem 1. We assume c0 = 0. In this case there is a positive probability that no graphs are deleted from the list L. Let G0 be a graph in the initial set of starting graphs. There exists a path G0 , G1 , . . . , Gk , Gk+1 = Gh such that Gj ∈ nbd(Gj−1 ) for j = 1, . . . , k +1. By using an induction argument, proving that the mode oriented stochastic search reaches Gh reduces to proving that Gj ∈ S if Gj−1 ∈ S. There is a positive probability that Gj−1 is selected for exploration. Once this happens Gh is included in S with probability one. Therefore the mode oriented stochastic search finds Gh .
References Asci, C. and Piccioni, M. (2007). “Functionally compatible local characteristics for the local specification of priors in graphical models.” Scand. J. Statist., 34, 829–40. Atay-Kayis, A. and Massam, H. (2005). “A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models.” Biometrika, 92, 317–35. Barbieri, M. and Berger, J. O. (2004). “Optimal predictive model selection.” Ann. Statist., 32, 870–897. Barnard, J., McCulloch, R., and Meng, X. (2000). “Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage.” Statist. Sinica, 10, 1281–311. Berger, J. O. and Molina, G. (2005). “Posterior model probabilities via path-based pairwise priors.” Statistica Neerlandica, 59, 3–15. Bickel, P. J. and Levina, E. (2008). “Regularized estimation of large covariance matrices.” Ann. Statist., 36, 199–227.
27
Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. M.I.T. Press. Cambridge, MA. Carvalho, C. M., Massam, H., and West, M. (2007). “Simulation of hyper-inverse Wishart distributions in graphical models.” Biometrika, 7, 269–81. Daniels, M. and Kass, R. (1999). “Nonconjugate Bayesian estimation of covariance matrices.” J. Am. Statist. Assoc., 94, 1254–63. Dawid, A. P. and Lauritzen, S. L. (1993). “Hyper Markov laws in the statistical analysis of decomposable graphical models.” Ann. Statist., 21, 1272–317. Dellaportas, P., Giudici, P., and Roberts, G. (2003). “Bayesian inference for nondecomposable graphical Gaussian models.” Sankhya, 65, 43–55. Dempster, A. P. (1972). “Covariance selection.” Biometrics, 28, 157–75. Diaconnis, P. and Ylvisaker, D. (1979). “Conjugate priors for exponential families.” Ann. Statist., 7, 269–81. Dobra, A. and Fienberg, S. E. (2000). “Bounds for Cell Entries in Contingency Tables Given Marginal Totals and Decomposable Graphs.” Proc. Natl. Acad. Sci., 97, 11185–11192. Dobra, A., Hans, C., Jones, B., Nevins, J. R., Yao, G., and West, M. (2004). “Sparse graphical models for exploring gene expression data.” J. Multivariate Anal., 90, 196– 212. Drton, M. and Perlman, M. D. (2004). “Model selection for Gaussian concentration graphs.” Biometrika, 91, 591–602. Edwards, D. M. (2000). Introduction to Graphical Modelling. New York: Springer. Giudici, P. and Green, P. J. (1999). “Decomposable graphical Gaussian model determination.” Biometrika, 86, 785–801. Green, P. J. (1995). “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.” Biometrika, 82, 711–32. Hans, C., Dobra, A., and West, M. (2007). “Shotgun stochastic search for ”large p” regression.” J. Am. Statist. Assoc., 102, 507–16. Harville, D. A. (1997). Matrix Algebra From a Statistician’s Perspective. Springer Science. Huang, J., Liu, N., Pourahmadi, M., and Liu, L. (2006). “Covariance matrix selection and estimation via penalized normal likelihood.” Biometrika, 93, 85–98. Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). “Experiments in stochastic computation for high-dimensional graphical models.” Statist. Sci., 20, 388– 400. 28
Kass, R. and Raftery, A. E. (1995). “Bayes factors.” J. Am. Statist. Assoc., 90, 773–95. Knuiman, M. (1978). “Covariance Selection.” Suppl. Adv. Appl. Prob, 10, 123–130. Lacroix, M. and Leclercq, G. (2004). “About GATA3, HNF3A and XBP1, three genes coexpressed with the oestrogen receptor-α gene (ESR1) in breast cancer.” Molecular and Cellular Endocrinology, 219, 1–76. Lauritzen, S. L. (1996). Graphical Models. Oxford University Press. Leonard, T. and Hsu, J. S. J. (1992). “Bayesian inference for a covariance matrix.” Ann. Statist., 20, 1669–96. Letac, G. and Massam, H. (2007). “Wishart distributions for decomposable graphs.” Ann. Statist., 35, 1278–323. Liechty, J. C., Liechty, M. W., and M¨ uller, P. (2004). “Bayesian correlation estimation.” Biometrika, 91, 1–14. Madigan, D. and Raftery, A. E. (1994). “Model selection and accounting for model uncertainty in graphical models using Occam’s window.” J. Am. Statist. Assoc., 89, 1535–45. Madigan, D. and York, J. (1995). “Bayesian Graphical Models for Discrete Data.” International Statistical Review , 63, 215–232. Meinshausen, N. and B¨ uhlmann, P. (2006). “High-dimensional graphs with the Lasso.” Ann. Statist., 34, 1436–62. Muirhead, R. J. (2005). Aspects of Multivariate Statistical Theory. John Wiley & Sons. Piccioni, M. (2000). “Independence structure of natural conjugate densities to exponential families and the Gibbs Sampler.” Scand. J. Statist., 27, 111–27. Pittman, J., Huang, E., Dressman, H., Horng, C. F., Cheng, S. H., Tsou, M. H., Chen, C. M., Bild, A., Iversen, E. S., Huang, A. T., Nevins, J. R., and West, M. (2004). “Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes.” Proc. Natl. Acad. Sci., 101, 8431–8436. Rajaratnam, B., Massam, H., and Carvalho, C. M. (2008). “Flexible covariance estimation in graphical Gaussian models.” Ann. Statist.. To appear. Roverato, A. (2002). “Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models.” Scand. J. Statist., 29, 391–411. Scott, J. G. and Berger, J. O. (2006). “An exploration of aspects of Bayesian multiple testing.” J. Statist. Plan. Infer., 136, 2144–2162. 29
Scott, J. G. and Carvalho, C. M. (2008). “Feature-inclusion stochastic search for Gaussian graphical models.” J. Comput. Graph. Statist.. To appear. Shen, H. and Huang, J. Z. (2005). “Analysis of call center arrival data using singular value decomposition.” Applied Stochastic Models in Business and Industry, 21, 251–263. Smith, M. and Kohn, R. (2002). “Bayesian parsimonious covariance matrix estimation for longitudinal data.” J. Am. Statist. Assoc., 87, 1141–53. Speed, T. P. and Kiiveri, H. T. (1986). “Gaussian Markov distributions over finite graphs.” Ann. Statist., 14, 138–150. Tierney, L. and Kadane, J. (1986). “Accurate Approximations for Posterior Moments and Marginal Densities.” J. Amer. Statist. Assoc., 81, 82–86. van’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., van der Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R., Roberts, C., Linsley, P., Bernards, R., and Friend, S. (2002). “Gene expression profiling predicts clinical outcome of breast cancer.” Nature, 415, 484–485. Wermuth, N. (1976). “Analogies between multiplicative models in contigency tables and covariance selection.” Biometrics, 32, 95–108. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. John Wiley & Sons. Wong, F., Carter, C. K., and Kohn, R. (2003). “Efficient estimation of covariance selection models.” Biometrika, 90, 809–30. Yang, R. and Berger, J. O. (1994). “Estimation of a covariance matrix using the reference prior.” Ann. Statist., 22, 1195–211. Yuan, M. and Lin, Y. (2007). “Model selection and estimation in the Gaussian graphical model.” Biometrika, 94, 19–35.
30