SPARSE ESTIMATION IN ISING MODEL VIA PENALIZED MONTE CARLO METHODS
arXiv:1612.07497v1 [stat.ME] 22 Dec 2016
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL Abstract. We consider a problem of model selection in high-dimensional binary Markov random fields. The usefulness of the Ising model in studying systems of complex interactions has been confirmed in many papers. The main drawback of this model is the intractable norming constant that makes estimation of parameters very challenging. In the paper we propose a Lasso penalized version of the Monte Carlo maximum likelihood method. We prove that our algorithm, under mild regularity conditions, recognizes the true dependence structure of the graph with high probability. The efficiency of the proposed method is also investigated via simulation studies.
1. Introduction A Markov random field is an undirected graph (V, E), where V = {1, . . . , d} is a set of vertices and E ⊂ V × V is a set of edges. The structure of this graph describes conditional independence among subsets of a random vector Y = (Y (1), . . . , Y (d)), where a random variable Y (s) is associated with a vertex s ∈ V. Finding interactions between random variables is a central element of many branches of science, for example biology, genetics, physics or social network analysis. The goal of the current paper is to recognize the structure of a graph on the basis of a sample consisting of n independent graphs. We consider the high-dimensional setting, i.e. the number of vertices d can be comparable or larger than the sample size n. It is motivated by contemporary applications of Markov random fields in the above-mentioned places, for instance gene microarray data. The Ising model (Ising, 1925) is an important example of a mathematical model that is often used to explain relations between discrete random variables. In the literature one can find many papers that argue for its effectiveness in recognizing the structure of a graph (Ravikumar et al., 2010; H¨ofling and Tibshirani, 2009; Guo et al., 2010; Xue et al., 2012; Jalali et al., 2011). This model also plays a key role in our paper. On the other hand, the Ising model is an example of an intractable constant model that is the joint distribution of Y is known only up to a norming constant and this constant cannot be calculated in practice. Thus, there are two main difficulties in the considered model. The first one is the highdimensionality of the problem. The second one is the intractable norming constant. To overcome the first obstacle we apply a well-known Lasso method (Tibshirani, 1996). The properties of this method in model selection are deeply studied in many papers that mainly investigate uhlmann and van de Geer, linear models or generalized linear models (Bickel et al., 2009; B¨ 2011; Huang and Zhang, 2012; van de Geer, 2008; Ye and Zhang, 2010; Zhao and Yu, 2006; Zhou, 2009). However, it is not difficult to find papers that describe properties of Lasso estimauhlmann and van de Geer, tors in more complex models, for instance Markov random fields (B¨ Key words and phrases. Lasso penalty .
Ising model, Monte Carlo Markov chain, Markov random field, model selection, 1
2
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
2011; Ravikumar et al., 2010; H¨ofling and Tibshirani, 2009; Guo et al., 2010; Xue et al., 2012) that are considered in this paper. There are many approaches trying to overcome the second obstacle that is the intractable norming constant. For instance, in Ravikumar et al. (2010) one proposes to perform d regularized logistic regression problems. This idea is based on the fact that the norming constant reduces, if one considers the conditional distribution instead of the joint distribution in the Ising model. This simple fact is at the heart of the pseudolikelihood approach (Besag, 1974) that is replacing the likelihood (that contains the norming constant) by the product of conditionals (that do not contain the norming constant). This idea is widely applied in the literature (H¨ofling and Tibshirani, 2009; Guo et al., 2010; Xue et al., 2012; Jalali et al., 2011) to study model selection properties of high-dimensional Ising models. However, this approach works well only if the pseudolikelihood is a good approximation of the likelihood. In general, it depends on the true structure of a graph. Namely, if this structure of the graph is sufficiently simple (examples of different structures can be found in section 5), then the product of conditionals should be close to the joint distribution. However, in practice this knowledge is unavailable. Therefore, in the current paper we propose another approach to the norming constant problem that relates to Markov chain Monte Carlo (MCMC) methods. Namely, the norming constant is approximated using the importance sampling technique. The advantage of this method over the pseudolikelihood is that it does not depend on the unknown complexity of the estimated graph. Roughly speaking, the size of a sample used in importance sampling should be sufficiently large to have good approximation of the likelihood. the MCMC method is a well-known approach to overcome the problem with the intractable norming constant in classical (low-dimensional) estimation of graphs. For instance, their properties are investigated in influential papers Geyer and Thompson (1992); Geyer (1994). However, to the best of our knowledge they have not been studied (theoretically and practically) in high-dimensional Markov random fields previously. Thus, the main contribution of the paper is investigating their properties in model selection in the high-dimensional scenario and comparing them with the existing methods that are mentioned above. Model selection for undirected graphical models means finding the existing edges in the ”sparse” graph that is a graph having relatively few edges (comparing to the total number of possible edges d(d−1) 2 and the sample size n). The paper is organized as follows: in the next section we describe the Ising model and our approach to the problem that relates to minimization of the penalized MCMC approximation of the likelihood. The literature concerning this topic is also discussed. In section 3 we state main theoretical results. Details of practical implementation are given in section 4, while the results of simulation studies are presented in section 5. The conclusions can be found in section 6. Finally, the proofs are postponed to appendices A and B.
2. Model description and related works In this section we introduce the Ising model and the proposed method. It also contains a review of the literature relating to this problem. 2.1. Ising Model and undirected graphs. Let (V, E) be an undirected graph that consists of a set of vertices V and a set of edges E. The random vector Y = (Y (1), Y (2), . . . , Y (d)), that takes values in Y, is associated with this graph. In the paper we consider a special case
SPARSE ISING MODEL VIA PENALIZED MC METHODS
3
of the Ising model that Y (s) ∈ {−1, 1} and the joint distribution of Y is given by the formula ! X 1 ⋆ ⋆ θrs y(r)y(s) , exp (1) p(y|θ ) = C(θ ⋆ ) r δ ≥ Rm , then Tˆ = {(r, s) : |θˆrs | > δ}. If θmin n P Tˆ = T ≥ 1 − 4ε .
The main results of the paper describe properties of estimators that are obtained by minimization of the MCMC approximation (6) with the Lasso penalty. Theorem 1 states that the estimation error of the Lasso estimator can be controlled. Roughly speaking, the estimation error is small, if the initial sample size and the MCMC sample size are large enough, the model is sparse and the cone invertibility factor F (ξ, T ) is not too close to zero. The influence of the model parameters (n, d, d¯0 ) as well as Monte Carlo parameters (m, β1 , β2 , M ) on the results are explicitly stated. It is worth to emphasize that our results work in the high-dimensional scenario, i.e. the number of vertices d can be (much) greater than the sample size n provided that the model is sparse. Indeed, the condition (14) is satisfied even if c d ∼ O en 1 , d¯0 ∼ O(nc2 ) and c1 + 2c2 < 1. The condition (15), that relates to the MCMC sample size, is also reasonable. The number β1 depends on the initial and stationary distributions. In general, its relation to the number of vertices is exponential. However, in (15)
8
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
it appears with the logarithm. Moreover, β1 is also reduced using so called burn-in time, i.e. the beginning of the Markov chain trajectory is discarded. Next, the number β2 is related to the spectral gap of a Markov chain. Under mild conditions the inverse of β2 depends polynomially on d, and under strong regularity conditions it can be reduced to O(d log d) as in Mossel and Sly (2013). Finally, there is also the number M in the condition (15) that relates to the distance between the stationary distribution h(·) and p(·|θ ⋆ ). Stating the explicit relation between M and the model seems to be difficult. However, the algorithm, that we ˆ is designed in such a way to minimize the impact of M on the results. propose to calculate θ, The detailed implementation of the algorithm is given in section 4. The estimation error of the Lasso estimator in Theorem 1 is measured in l∞ -norm. Similarly to Huang et al. (2013), it can be extended to the general lq -norm, q ≥ 1. We omit it, because (16) is sufficient to obtain the second main result of the paper (Corollary 1). It states that the thresholded Lasso estimator is model selection consistent, if, additionally to (14) and (15), the nonzero parameters are not too small and the threshold is appropriately chosen. We have already mentioned that there are many approaches to the high-dimensional Ising model. The significant part of them refers to the pseudolikelihood method. Now we compare conditions that are sufficient to prove model selection consistency in the current paper to those basing on the pseudolikelihood approximation in Ravikumar et al. (2010); Guo et al. (2010); Xue et al. (2012); Jalali et al. (2011). If we simplify regularity conditions in Theorem 1, Corollary 1 and forget about Monte Carlo parameters in (15), then we have: (a) the cone invertibility factor condition is satisfied, (b) the sample size should be sufficiently large, that is n > d¯20 log d, q ⋆ (c) the nonzero parameters should be sufficiently large, that is θmin >
log d n
.
In Ravikumar et al. (2010, Corollary 1) and Guo et al. (2010, Theorem 2) one needs stronger p irrepresentable condition in (a), d¯0 in the third power in (b) and additional factor d¯0 in (c). To be precise, in Ravikumar et al. (2010) one considers d separate procedures, so their conditions do not depend on d¯0 but on the maximum neighbourhood size that is smaller than d¯0 . In Xue et al. (2012, Corollary 3.1 (2)) model selection consistency of Lasso estimators is proved with stronger conditions than ours. Namely, they are similar to Ravikumar et al. (2010) and Guo et al. (2010) but d¯0 is reduced in the condition (c). Moreover, they also consider the pseudolikelihood approximation with the SCAD penalty and shows that the condition (a) seems to be superfluous in this case, see Xue et al. (2012, Corollary 3.1 (1)). However, using the SCAD penalty they minimize a nonconvex function to obtain an estimator, so they have to prove that the computed (local) minimizer is the desired theoretic local solution. Their approach can be viewed as a sequence of weighted Lasso problems, so they need auxiliary Lasso procedures to behave well. Therefore, the irrepresentable condition is assumed (Xue et al., 2012, Corollary 3.2). The conditions sufficient for model selection consistency that are stated in Jalali et al. (2011, Theorem 2) are comparable to ours but also more restrictive. Instead of the condition (a) they consider a similar requirement called the restricted strong convexity condition. It is completed by the restricted strong smoothness condition. Moreover, in the the lower bound p ⋆ . in the condition (c) they need an additional factor d¯0 as well as the upper bound for θmin In the proof of Theorem 1 we use methods that are well-known while investigating properties of Lasso estimators as well as some new argumentation. The main novelty (and difficulty) is the use of the Monte Carlo sample that contains dependent vectors. The first part of our argumentation consists of two steps:
SPARSE ISING MODEL VIA PENALIZED MC METHODS
9
- the first step can be viewed as ”deterministic”. We apply methods that were developed in Ye and Zhang (2010); Huang and Zhang (2012); Huang et al. (2013) and strongly exploit convexity of the considered problem. These auxiliary results are stated in Lemma 1 and Lemma 2 in Appendix A, - the second step is ”stochastic”. We state a probabilistic inequality that bounds the l∞ -norm of the derivative of the MCMC approximation (6) at θ ⋆ , that is m P wk (θ ⋆ )J(Y k ) n X 1 ⋆ (18) ∇ℓm J(Yi ) + k=1 m n (θ ) = − P n i=1 wk (θ ⋆ ) k=1
where
exp θ ′ J(Y k ) , k = 1, . . . , m . wk (θ) = h(Y k ) Notice that (18) contains independent random variables Y1 , . . . , Yn from the initial sample and the Markov chain Y 1 , . . . , Y m from the MC sample. Therefore, to obtain the exponential inequalities for the l∞ -norm of (18), which are given in Lemma 3 and Corollary 2 in Appendix A, we apply the MCMC theory. In particular, we frequently use the following Hoeffding’s inequality for Markov chains (Miasojedow, 2014, Theorem 1.1). Theorem 2. Let Y 1 , . . . , Y m be a Gibbs sampler defined in Subsection 2.3. Moreover, let g : Y → R be a bounded function and µ = EY ∼h g(Y ) be a stationary mean value. Then for every t > 0, m ∈ N and an initial distribution q ! m 1 X β2 mt2 k g(Y ) − µ > t ≤ 2β1 exp − , P m |g|2∞ k=1
where |g|∞ = sup |g(y)| and β1 , β2 are defined in (10). y∈Y
The second part of our argumentation relates to the fact that the Hessian of the MCMC approximation at θ ⋆ , that is P ⊗2 m m P wk (θ ⋆ )J(Y k )⊗2 wk (θ ⋆ )J(Y k ) k=1 k=1 ⋆ , (19) ∇ 2 ℓm − n (θ ) = m m P P wk (θ ⋆ ) wk (θ ⋆ ) k=1
k=1
is random variable. Similar problems were considered in several papers investigating properties of Lasso estimators in the high-dimensional Ising model (Ravikumar et al., 2010; Guo et al., 2010; Xue et al., 2012) or the Cox model (Huang et al., 2013). We overcome this difficulty by bounding from below the cone invertibility factor F¯ (ξ, T ) by nonrandom F (ξ, T ). Therefore, we need to prove that Hessians of (11) and (6) are close. It is obtained again using the MCMC theory in Lemma 4 and Corollary 3 in Appendix A. Finally, the proofs of Theorem 1 and Corollary 1 are stated in Appendix B. 4. Details of implementation In this section we describe in detail practical implementation of the algorithm analyzed in the previous section. The solution of the problem (8) depends on the choice of λ in the penalty term and the parameter ψ in the instrumental distribution. Finding “optimal” λ and ψ is difficult in practice. To overcome this problem we compute a sequence of minimizers
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
10
(θˆi )i such that θˆi corresponds to λ = λi and the sequence (λi )i is decreasing. We start with the greatest value λ0 for which the entire vector θˆ is zero. For each value of λi , i > 0, we set ψ = θˆi−1 and use the MCMC approximation (8) with Y 1 , . . . , Y m given by a Gibbs sampler with the stationary distribution p(y|θˆi−1 ). This scheme exploits warm starts and leads to more stable algorithm. The important property of this approach is that the chosen ψ = θˆi−1 is usually close to θˆi , because differences between consecutive λi ’s are small. In our studies the final estimator θˆ is the element of the sequence (θˆi )i that recognizes the true model in the best way (in general, we would use cross-validation or one of the information criteria). One believes that the final estimator θˆ = θˆi is close to θ ⋆ , therefore the chosen ψ = θˆi−1 should also be similar to θ ⋆ that makes M in (10) close to one. Finally, notice that conditionally on the previous step our algorithm fits to the framework described in subsection 2.1. ˆ The function ℓm n (θ) is convex, so we can use proximal gradient algorithms to compute θi as a solution of (8) for a given λi and ψ. Precisely, we use the FISTA algorithm with backtracking from Beck and Teboulle (2009). The whole procedure is summarized in Algorithm 1. Algorithm 1 MCMC Lasso for Ising model Let λ0 > λ1 > · · · > λK and ψ = 0. for i = 1 to K do Simulate Y 1 , . . . , Y m using a Gibbs sampler with the stationary distribution p(y|ψ). Run the FISTA algorithm to compute θˆi as arg min {ℓm n (θ) + λi |θ|1 } . θ
Set ψ = θˆi . end for The computational cost of a single step of the Gibbs sampler is dominated by computing probability (9), which is of the order O(r), where r is the maximal degree of vertices in a graph related to the stationary distribution. In the paper we focus on estimation of sparse graphs, so the proposed λi ’s have to be sufficiently large to make θˆi ’s sparse. Therefore, the degree r is rather small and the computational cost of generating Y 1 , . . . , Y m is of order O(m). Next, we need to compute ℓm n (θ) and its gradient. For an arbitrary Markov chain the cost of these computation is of the order O(d2 m). But when we use single site updates as in the Gibbs sampler we can reduce it to O(dm) by remembering which coordinate of Y k where updated. Indeed, if we know that only the r coordinates are updated in the step Y k → Y k+1 , then X θ ′ J(Y k+1 ) = θ ′ J(Y k ) + θsr [Y k+1 (s)Y k+1 (r) − Y k (s)Y k (r)] s : sr
θrs [Y k+1 (s)Y k+1 (r) − Y k (s)Y k (r)] .
1
Finally, it is well-known that FISTA (Beck and Teboulle, 2009) achieve accuracy ǫ in O(ǫ 2 ) steps. So, the total cost of computing the solution for single λi with precision ǫ is of order 1 O(ǫ 2 md). The further reduction of the cost can be obtained using sparsity of θˆi−1 in computing ℓm n (θ) and its gradient, and introducing active variables inside the FISTA algorithm. In section 5 we compare estimators based on the MCMC approximation with the one based on the pseudolikelihood (both with the Lasso penalty). The latter was introduced in
SPARSE ISING MODEL VIA PENALIZED MC METHODS
Pseudolikelihood d
11
MCMC
n
True Model
c1
c2
True Model
c1
c2
20 50 100 200 500 1000 50 50 100 200 500 1000
0.1500 0.1425 0.1840 0.2325 0.2550 0.1500 0.1400 0.1525 0.2475 0.2500
3.5 3.0 1.0 2.0 2.5 3.5 3.5 3.5 1.5 2.5
0.00 0.00 0.16 0.12 0.12 0.00 0.00 0.00 0.16 0.08
0.4475 0.5075 0.5440 0.5550 0.5525 0.4625 0.5025 0.5250 0.5475 0.5420
1.5 2.0 3.0 4.0 4.0 1.5 2.0 2.5 3.5 4.0
0 0 0 0 0 0 0 0 0 0
Pseudolikelihood d
n
True Model
c1
MCMC True Model
c1
20 50 0.1500 3.5 0.4475 1.5 100 0.1425 3.0 0.5075 2.0 200 0.1440 3.5 0.5440 3.0 500 0.1900 2.5 0.5550 4.0 1000 0.2450 3.0 0.5525 4.0 50 50 0.1500 3.5 0.4625 1.5 100 0.1400 3.5 0.5025 2.0 200 0.1525 3.5 0.5250 2.5 500 0.1625 2.5 0.5475 3.5 1000 0.2340 3.0 0.5420 4.0 Table 1. The performance of the pseudolikelihood and the MCMC approximation with the Lasso (bottom) and the thresholded Lasso (top) for M 1 with ϑ = 1. H¨ofling and Tibshirani (2009) and we use the proximal gradient algorithm to compute it. It is worth to notice that the cost of computing the pseudolikelihood approximation and its gradient is of order O(d2 n), because the pseudolikelihood is not a function of the sufficient statistic. 5. Simulated data To illustrate the performance of the proposed method we simulate data sets in three scenarios: ⋆ = M1 The first 5 vertices are correlated, while the remaining vertices are independent: θrs ⋆ ±ϑ for r < s and s = 2, 3, 4, 5, other θrs = 0. We consider two cases ϑ = 1 and ϑ = 2, and the signs are chosen randomly. Thus, the model dimension in this problem is 10. ⋆ = M2 The first 10 vertices has the “chain structure”, and the rest are independent: θr−1,r ±1 for r ≤ 10. Again the signs are chosen randomly and the model dimension is 10. ⋆ = M3 There are two independent blocks of correlated vertices (each of the length 4): θrs ±2 for r < s, s = 2, 3, 4 and for 5 ≤ r < s, s = 6, 7, 8. The signs are chosen randomly and the model dimension is 12.
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
12
Pseudolikelihood d
MCMC
n
True Model
c1
c2
True Model
c1
c2
20 50 100 200 500 1000 50 50 100 200 500 1000
0.5075 0.4950 0.5200 0.5325 0.5700 0.5050 0.4825 0.4625 0.5300 0.5263
3.5 3.5 1.0 3.0 3.0 3.5 3.5 1.0 2.5 3.0
0.00 0.00 0.20 0.04 0.00 0.00 0.00 0.20 0.04 0.00
0.5425 0.5475 0.5475 0.5800 0.6500 0.5400 0.5450 0.5450 0.5550 0.6404
2.0 2.0 2.0 1.0 1.0 2.0 2.0 2.0 2.5 1.0
0.00 0.00 0.00 0.12 0.16 0.00 0.00 0.00 0.00 0.20
Pseudolikelihood d
n
True Model
c1
MCMC True Model
c1
20 50 0.5075 3.5 0.5425 2.0 100 0.4950 3.5 0.5475 2.0 200 0.4625 3.5 0.5475 2.0 500 0.5175 3.0 0.5500 2.0 1000 0.5700 3.0 0.5750 2.0 50 50 0.5050 3.5 0.5400 2.0 100 0.4825 3.5 0.5450 2.0 200 0.4600 3.5 0.5450 2.0 500 0.5125 4.0 0.5550 2.5 1000 0.5263 3.0 0.5526 2.5 Table 2. The performance of the pseudolikelihood and the MCMC approximation with the Lasso (bottom) and the thresholded Lasso (top) for M 1 with ϑ = 2.
The model M 1 corresponds to the case that the dependence structure is relatively complex and the approximation by the pseudolikelihood can behave badly. The model M 2 has a chain structure, so it is known that the pseudolikelihood approximation works well. Finally, the last scenario is a compromise between first two models. For each of model we draw 20 configuration of signs, so we get 20 vectors θ ⋆ for each model. We consider the following cases: d = 20, 50 and n = 50, 100, 200, 500, 1000. For each configuration of the model, the number of vertices d and the number of observation n we sample 20 replications of data sets. For sampling a data set we use a final configuration of independent Gibbs sampler of the length 106 . Based on preliminary runs we set m = 105 . Our choice of λ and the threshold is related 1 and Corollary 1. For models M 1 and M 3 we run the algorithm with λ = to Theorem p c1 ∗ log(d(d − 1)/n) with c1 being a sequence from 1 to 4 with the difference 0.25. In the model M 2 this penalty was too small, so we increase the range of c1 to 8. For the thresholded p Lasso we use a threshold of the form c2 ∗ log(d(d − 1)/n) with c2 being a sequence of consecutive numbers from 0 up to 0.16 by 0.04. For each c1 for the Lasso and a pair of parameters (c1 , c2 ) for the thresholded Lasso we compute the frequency of finding exactly the
SPARSE ISING MODEL VIA PENALIZED MC METHODS
Pseudolikelihood d
13
MCMC
n
True Model
c1
c2
True Model
c1
c2
20 50 100 200 500 1000 50 50 100 200 500 1000
0.3725 0.9050 1.0000 1.0000 1.0000 0.2000 0.8300 0.9975 1.0000 1.0000
1.0 1.0 2.5 4.0 3.5 2.5 2.0 1.0 4.0 4.0
0.12 0.24 0.20 0.16 0.24 0.00 0.08 0.28 0.16 0.20
0.1825 0.7300 0.9700 1.0000 1.0000 0.1175 0.6125 0.9300 0.9950 0.9950
1 1 1 4 4 1 1 1 3 4.5
0.12 0.24 0.36 0.32 0.36 0.08 0.20 0.32 0.36 0.08
Pseudolikelihood d
n
True Model
c1
MCMC True Model
c1
20 50 0.2300 2.5 0.0150 2.0 100 0.7375 4.0 0.0975 2.5 200 0.7800 4.0 0.4350 4.0 500 0.9735 6.0 0.9225 6.0 1000 0.9980 8.0 0.9980 8.0 50 50 0.2000 2.5 0.0250 1.5 100 0.6975 3.5 0.0700 2.5 200 0.8825 4.0 0.3250 3.5 500 0.9920 6.0 0.7280 5.5 1000 1.0000 7.5 0.9700 7.5 Table 3. The performance of the pseudolikelihood and the MCMC approximation with the Lasso (bottom) and the thresholded Lasso (top) for M 2.
true model, and we choose a pair that maximizes it. Next, we proceed in the same way with the pseudolikelihood instead of the MCMC approximation. Results for the first model are summarized in Table 1 and Table 2 for ϑ = 1 and ϑ = 2, respectively. We observe that in this model the MCMC approximation outperforms the pseudolikelihood approximation. The dominance of MCMC is especially evident for ϑ = 1, but in the case ϑ = 2 it is also significant. Moreover, looking at the Table 1 we notice that thresholding does not improve the quality of procedures in model selection. However, for ϑ = 2 and large n the improvement is observed. In Table 3 we present results for the second model. The dependence structure is simple, so the pseudolikelihood approximation is accurate. Therefore, model selection of estimators based on the pseudolikelihood is very good. Nevertheless, the MCMC approximation with the Lasso (for large n) and the thresholded Lasso behaves similarly to the pseudolikelihood analogue. Besides, thresholded Lasso estimators for both methods work better than Lasso estimators (especially for small n). Results for the third model are presented in Table 4. For large sample sizes both methods work well, but the MCMC approximation performs better for small sample sizes. The improvement related to thresholding is also observed.
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
14
Pseudolikelihood d
MCMC
n
True Model
c1
c2
True Model
c1
c2
20 50 100 200 500 1000 50 50 100 200 500 1000
0.6850 0.8875 0.9975 1.0000 1.0000 0.7000 0.7400 0.9875 1.0000 1.0000
3.5 2.0 3.0 3.5 3.5 3.0 1.5 2.5 4.0 3.5
0.00 0.08 0.04 0.00 0.00 0.00 0.12 0.08 0.00 0.00
0.7525 0.7525 0.9800 1.0000 1.0000 0.7500 0.7525 0.8450 1.0000 1.0000
2.0 2.0 1.0 2.0 2.5 1.5 2.0 1.0 1.0 2.0
0.00 0.00 0.12 0.00 0.00 0.04 0.00 0.12 0.24 0.16
Pseudolikelihood d
n
True Model
c1
MCMC True Model
c1
20 50 0.6850 3.5 0.7525 2.0 100 0.7400 2.5 0.7525 2.0 200 0.9800 3.0 0.8775 1.5 500 1.0000 3.5 1.0000 2.0 1000 1.0000 3.5 1.0000 2.5 50 50 0.7000 3.0 0.7425 1.5 100 0.7025 3.5 0.7525 2.0 200 0.9000 3.0 0.7475 2.0 500 1.0000 4.0 0.9725 2.0 1000 1.0000 3.5 0.9880 2.5 Table 4. The performance of the pseudolikelihood and the MCMC approximation with the Lasso (bottom) and the thresholded Lasso (top) for M 3.
6. Conclusions In the paper we consider a problem of structure learning of binary Markov random fields. We base estimation of model parameters on the Lasso penalized Monte Carlo approximation of the likelihood. In the theoretical part of the paper we show that the proposed procedure recognizes the true dependence structure with high probability. The regularity conditions that we need are not restrictive and are weaker than assumptions used in the pseudolikelihood approach. Moreover, the theoretical results are completed by simulation experiments. They confirm that the quality of the pseudolikelihood approximation depends strongly on the complexity of the graph structure, while the MCMC approximation is able to find the true model without regard for it. The results of the current paper can be easily extended to other discrete Markov random fields. There are also some non-trivial issues that are not discussed in the paper, for instance how the parameter λ should be chosen in practice (cross-validation, information criteria etc.). The evaluation of the prediction error of the procedure is also a difficult problem as well as investigating the model (1) with predictors. Clearly, these all problems need detailed studies.
SPARSE ISING MODEL VIA PENALIZED MC METHODS
15
Acknowledgements Bla˙zej Miasojedow and Wojciech Rejchel are supported by Polish National Science Center grants no. 2015/17/D/ST1/01198 and no. 2014/12/S/ST1/00344, respectively. References Onureena Banerjee, Laurent El Ghaoui, and Alexandre d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9(Mar):485–516, 2008. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. Julian Besag. Spacial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B, 36:192–236, 1974. Julian Besag. Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika, 64:616–618, 1977. Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37:1705–1732, 2009. Guy Bresler, Elchanan Mossel, and Allan Sly. Reconstruction of Markov random fields from samples: Some observations and algorithms. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 343–356. Springer, 2008. Peter B¨ uhlmann and Sara van de Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Series in Statistics, New York: Springer, 2011. Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. Charles J Geyer. On the convergence of Monte Carlo maximum likelihood calculations. Journal of the Royal Statistical Society. Series B (Methodological), 56:261–274, 1994. Charles J Geyer and Elizabeth A Thompson. Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society. Series B (Methodological), 54: 657–699, 1992. Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Joint structure estimation for categorical Markov networks. Technical report, 2010. Holger H¨ofling and Robert Tibshirani. Estimation of sparse binary pairwise Markov networks using pseudolikelihoods. Journal of Machine Learning Research, 10(Apr):883–906, 2009. Jian Huang and Cun-Hui Zhang. Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. Journal of Machine Learning Research, 13(Jun):1839–1864, 2012. Jian Huang, Tingni Sun, Zhiliang Ying, Yi Yu, and Cun-Hui Zhang. Oracle inequalities for the lasso in the Cox model. Annals of statistics, 41(3):1142–1165, 2013. Ernst Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift f¨ ur Physik A Hadrons and Nuclei, 31(1):253–258, 1925. Ali Jalali, Christopher C Johnson, and Pradeep K Ravikumar. On learning discrete graphical models using greedy methods. In Advances in Neural Information Processing Systems, pages 1935–1943, 2011. Ioannis Kontoyiannis and Sean P Meyn. Geometric ergodicity and the spectral gap of nonreversible markov chains. Probability Theory and Related Fields, 154(1-2):327–339, 2012. Su-In Lee, Varun Ganapathi, and Daphne Koller. Efficient Structure Learning of Markov Networks using l1 -Regularization. In Advances in Neural Information Processing Systems, pages 817–824, 2006.
16
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
Bla˙zej Miasojedow. Hoeffding’s inequalities for geometrically ergodic markov chains on general state space. Statistics & Probability Letters, 87:115–120, 2014. Elchanan Mossel and Allan Sly. Exact thresholds for Ising–Gibbs samplers on general graphs. The Annals of Probability, 41(1):294–328, 2013. Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high-dimensional analysis of M -estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009. Pradeep Ravikumar, Martin J Wainwright, and John Lafferty. High-dimensional Ising model selection using l1 -regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010. Gareth O Roberts and Jeffrey S Rosenthal. Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab, 2(2):13–25, 1997. Gareth O Roberts and Jeffrey S Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20–71, 2004. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58:267–288, 1996. Sara van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36:614–645, 2008. Hulin Wu and Fred W Huffer. Modelling the distribution of plant species using the autologistic regression model. Environmental and Ecological Statistics, 4(1):31–48, 1997. Lingzhou Xue, Hui Zou, and Tianxi Cai. Nonconcave penalized composite conditional likelihood estimation of sparse Ising models. The Annals of Statistics, 40:1403–1429, 2012. Fei Ye and Cun-Hui Zhang. Rate Minimaxity of the Lasso and Dantzig Selector for the lq loss in lr Balls. Journal of Machine Learning Research, 11(Dec):3519–3540, 2010. Tong Zhang. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10(Mar):555–568, 2009. Peng Zhao and Bin Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7(Nov):2541–2563, 2006. Shuheng Zhou. Thresholding procedures for high dimensional variable selection and statistical estimation. In Advances in Neural Information Processing Systems, pages 2304–2312, 2009. Appendix A. Auxiliary results In this section we formulate lemmas that are needed to prove main results of the paper. The roles that they play are described in detail at the end of section 3. The first lemma is borrowed from Huang et al. (2013, Lemma 3.1). ⋆ Lemma 1. Let θ˜ = θˆ − θ ⋆ , z ∗ = |∇ℓm n (θ )|∞ and h i ˆ − ∇ℓm (θ) . ˆ θ) = (θˆ − θ)′ ∇ℓm (θ) D(θ, n n
Then (20)
ˆ θ ⋆ ) + (λ − z ∗ )|θ˜T c |1 ≤ (λ + z ∗ )|θ˜T |1 . (λ − z ∗ )|θ˜T c |1 ≤ D(θ,
Besides, for arbitrary ξ > 1 on the event ξ −1 m ⋆ λ (21) Ω1 = |∇ℓn (θ )|∞ ≤ ξ +1 the random vector θ˜ belongs to the cone C(ξ, T ).
SPARSE ISING MODEL VIA PENALIZED MC METHODS
17
Proof. The proof is the same as the proof of Huang et al. (2013, Lemma 3.1). It is quoted here to make the paper complete. Convexity of the MCMC approximation ℓm n (θ) easily implies the first inequality in (20). The same property combined with convexity of the Lasso penalty gives us that zero has to belong ˆ i.e. to the subgradient of (8) at the minimizer θ, ˆ ˆ if θˆrs 6= 0 ∇rs ℓm n (θ) = −λsign(θrs ), (22) m ˆ |∇rs ℓn (θ)| ≤ λ, if θˆrs = 0 , m ˆ where we use ∇ℓm and sign(t) = 1 for t > 0, sign(t) = −1 for t < 0, n (θ) = ∇rs ℓn (θ) r 1. Moreover, let us denote τ = (ξ+1) F¯ (ξ,T ) (23) Ω2 = τ < e−1 .
Then Ω1 ∩ Ω2 ⊂ A, where
A = |θˆ − θ ⋆ |∞ ≤
(24)
2ξeη λ (ξ + 1)F¯ (ξ, T )
where η < 1 is the smaller solution of the equation ηe−η = τ.
,
Proof. Suppose we are on the event Ω1 ∩ Ω2 . Denote again θ˜ = θˆ − θ ⋆ and notice that θ˜ θ = |θ| ˜ ∈ C(ξ, T ) by Lemma 1. Consider the function 1
⋆ ′ m ⋆ g(t) = θ ′ ∇ℓm n (θ + tθ) − θ ∇ℓn (θ )
for each t ≥ 0. This function is nondecreasing, because ℓm n (·) is convex. Thus, we obtain ˜ 1 ) for every t ∈ (0, |θ| ˜ 1 ). On the event Ω1 and from Lemma 1 we have that g(t) ≤ g(|θ| (25)
⋆ m ⋆ θ ′ [∇ℓm n (θ + tθ) − ∇ℓn (θ )] +
2λξ 2λ |θT c |1 ≤ |θT |1 . ξ +1 ξ+1
In further argumentation we consider all nonnegative t satisfying (25) that is an interval [0, t˜] for some t˜ > 0. Proceeding similarly to the proof of Huang et al. (2013, Lemma 3.2) we obtain (26)
⋆ m ⋆ 2 ′ 2 m ⋆ tθ ′ [∇ℓm n (θ + tθ) − ∇ℓn (θ )] ≥ t exp(−γtθ )θ ∇ ℓn (θ )θ ,
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
18
where γtθ = t maxk,l |θ ′ J(Y k ) − θ ′ J(Y l )| ≤ 2t, because J(Y k ) = Y k (r)Y k (s) rs and |θ|1 = 1. Therefore, the right-hand side in (26) can be lower bounded by ⋆ t2 exp(−2t)θ ′ ∇2 ℓm n (θ )θ .
(27)
Using the definition of F¯ (ξ, T ), the fact that θ ∈ C(ξ, T ), the bound (27) and (25) we obtain F¯ (ξ, T )|θT |21 ⋆ t exp(−2t) ≤ t exp(−2t)θ ′ ∇2 ℓm n (θ )θ d¯0 ⋆ m ⋆ ≤ θ ′ [∇ℓm n (θ + tθ) − ∇ℓn (θ )] 2λξ 2λ ≤ |θT |1 − |θT c |1 ξ+1 ξ+1 ≤ λ(ξ + 1)|θT |21 /2 .
So, every t satisfying (25) has to fulfill the inequality 2t exp(−2t) ≤ τ . In particular, 2t˜exp(−2t˜) ≤ τ. We are on Ω2 , so it implies that 2t˜ ≤ η, where η is the smaller solution ˜ 1 ≤ t˜, so of the equation η exp(−η) = τ. We know also that |θ| ˜ ˜ ′ 2 m ⋆ ˜ 1 exp(−η) ≤ t˜exp(−2t˜) ≤ t exp(−2t)θ ∇ ℓn (θ )θ |θ| F¯ (ξ, T )|θT |1 |θ|∞ ⋆ m ⋆ ˜ θ ′ ∇ℓm n (θ + tθ) − ∇ℓn (θ ) ≤ F¯ (ξ, T )|θT |1 |θ|∞ 2λξ , ≤ (ξ + 1)F¯ (ξ, T )|θ|∞ where we have used bounds (27) and (25). Using the equality |θ|∞ = proof.
˜∞ |θ| ˜1 , |θ|
we finish the
Lemma 3. For every natural n, m and positive t P
⋆ (|∇ℓm n (θ )|∞
2 t mβ mβ 2 2 ¯ 1 exp − − 2dβ . ≤ t) ≥ 1 − 2d¯exp −nt /8 − β1 exp − 4M 2 64M 2 2
⋆ Proof. We can rewrite ∇ℓm n (θ ) as
(28)
⋆ ∇ℓm n (θ ) = −
"
1 n
n X i=1
J(Yi ) −
#
∇C(θ ⋆ ) + C(θ ⋆ )
1 m
m P
k=1
h wk (θ ⋆ ) J(Y k ) − 1 m
m P
∇C(θ ⋆ ) C(θ ⋆ )
wk (θ ⋆ )
i
k=1
Notice that the first therm in (28) depends only on the initial sample Y1 , . . . , Yn and is an average of i.i.d random variables. The second term depends only on the MCMC sample Y 1 , . . . , Y m . We start the analysis with the former one. Using Hoeffding’s inequality we obtain for each natural n, positive t and a pair of indices r < s n ! 1 X ∇rs C(θ ⋆ ) 2 Jrs (Yi ) − > t/2 ≤ 2 exp −nt /8 . P n C(θ ⋆ ) i=1
Therefore, by the union bound we have n 1 X ∇C(θ ⋆ ) (29) P J(Yi ) − n C(θ ⋆ ) i=1
!
> t/2 ∞
≤ 2d¯exp −nt2 /8 .
SPARSE ISING MODEL VIA PENALIZED MC METHODS
19
Next, we investigate the second expression in (28). Its denominator is an average that depends on the Markov chain. To handle it we can use Theorem 2 in section 3. Notice that ⋆ ′ exp[(θ ⋆ )′ J(y)] ≤ M C(θ ⋆ ) for every y and EY ∼h exp[(θh(Y) )J(Y )] = C(θ ⋆ ). Therefore, for every n, m h(y) ! m mβ2 1 X ⋆ ⋆ (30) P wk (θ ) ≥ C(θ )/2 ≥ 1 − β1 exp − . m 4M 2 k=1
Finally, we bound the l∞ -norm of the numerator of the second term in (28). We fix a pair of indices r < s. It is not difficult to calculate that exp [(θ ⋆ )′ J(Y )] ∇rs C(θ ⋆ ) EY ∼h Jrs (Y ) − =0 h(Y ) C(θ ⋆ ) and for every y ∇rs C(θ ⋆ ) exp((θ ⋆ )′ J(y) ⋆ Jrs (y) − C(θ ⋆ ) ≤ 2M C(θ ) . h(y) From Theorem 2 we obtain for every positive t m 1 X ∇rs C(θ ⋆ ) ⋆ k wk (θ ) Jrs (Y ) − ≥ tC(θ ⋆ )/4 m C(θ ⋆ ) k=1 2 mβ2 . Using union bounds we estimate the numerator with probability at least 1−2β1 exp − t64M 2 of the second expression in (28). This factand (30) imply thatfor 2every positive t and natural mβ2 ¯ 1 exp − t mβ22 we have n, m with probability at least 1 − β1 exp − 4M 2 − 2dβ 64M P h 1 m m wk (θ ⋆ ) J(Y k ) − k=1 m 1 P wk (θ ⋆ ) m
(31)
∇C(θ ⋆ ) C(θ ⋆ )
k=1
Taking (29) and (31) together we finish the proof.
i
≤ t/2 . ∞
Corollary 2. Let ε > 0, ξ > 1 and r s ¯ ¯ log (2d + 1)β1 /ε ξ+1 2 log(2d/ε) . max 2 , 8M λ= ξ−1 n mβ2 Conditions (14) and (15) imply
P (Ω1 ) ≥ 1 − 2ε . Proof. We take t =
ξ−1 ξ+1 λ
in Lemma 3.
Lemma 4. For every n, m and positive t 2 m ⋆ mβ2 2 ⋆ (32) P ∇ ℓn (θ ) − ∇ log C(θ ) ∞ ≤ t ≥ 1 − 2β1 exp − 4M 2
2 t mβ2 ¯ ¯ . − 2d(d + 1)β1 exp − 256M 2
˙ BLAZEJ MIASOJEDOW AND WOJCIECH REJCHEL
20
Proof. To estimate the l∞ -norm of the matrix difference in (32) we bound l∞ -norms of two matrices: i h m m 2 ⋆ P 1 P ⋆ ) J(Y k )⊗2 − ∇ C(θ ) w (θ wk (θ ⋆ )J(Y k )⊗2 ⋆ k m C(θ ) ∇2 C(θ ⋆ ) k=1 k=1 − = (33) m m P C(θ ⋆ ) 1 P wk (θ ⋆ ) wk (θ ⋆ ) m k=1
k=1
and
1 m
(34)
m P
wk (θ)J(Y k )
k=1 1 m
m P
wk (θ)
k=1
⊗2
∇C(θ ⋆ ) − C(θ ⋆ )
⊗2
.
The denominator of the right-hand side of (33) has been estimated in the proof of Lemma 3, so we bound the numerator. We can calculate that exp [(θ ⋆ )′ J(Y )] ∇2 C(θ ⋆ ) ⊗2 EY ∼h J(Y ) − =0 h(Y ) C(θ ⋆ ) and for every y and two pairs of indices r < s, r ′ < s′ we have ∇2rs,r′ s′ C(θ ⋆ ) exp((θ ⋆ )′ J(y) Jrs (y)Jr′ s′ (y) − ≤ 2M C(θ ⋆ ) . h(y) C(θ ⋆ )
Using the union bound and (30) we upper-bound the l∞ -norm of the right-hand side of (33) by t/2 with probability at least 2 mβ2 ¯2 β1 exp − t mβ2 . 1 − β1 exp − − 2 d 4M 2 64M 2 The last step of the proof is handling with the l∞ -norm of (34). This expression can be upper-bounded by P P m m k ) wk (θ)J(Y k ) w (θ)J(Y k ⋆ k=1 ∇C(θ ⋆ ) k=1 + ∇C(θ ) . (35) − m m C(θ ⋆ ) P P C(θ ⋆ ) ∞ wk (θ) wk (θ) k=1
∞
k=1
∞
∞
The first term in (35) has been bounded with high probability in the proof of Lemma 3. The remaining two can be easily estimated by one. Therefore, for every positive t the l∞ -norm of (34) is not greater than t/2 with probability at least 2 t mβ2 mβ2 ¯ − 2dβ1 exp − . 1 − β1 exp − 4M 2 256M 2 Putting together this fact and the bound of (33) we finish the proof. Corollary 3. If (15), then for every ε > 0 the following inequality s ¯ d¯ + 1)β1 /ε log 2d( 2 ¯ ¯ F (ξ, T ) ≥ F (ξ, T ) − 16d0 (1 + ξ) M mβ2 has probability at least 1 − 2ε.
SPARSE ISING MODEL VIA PENALIZED MC METHODS
21
Proof. We take s
¯ d¯ + 1)β1 /ε log 2d( t = 16M mβ2 in Lemma 4 and use Huang et al. (2013, Lemma 4.1 (ii)).
Appendix B. Proofs of main results Proof of Theorem 1. Fix ε > 0, ξ > 1 and denote γ = γ(ξ) = α(ξ)−2 α(ξ) ∈ (0, 1). First, from Corollary 2 we know that P (Ω1 ) ≥ 1 − 2ε. Using the condition (15) we obtain that s ¯ d¯ + 1)β1 /ε log 2d( 2 ¯ F (ξ, T ) − 16d0 (1 + ξ) M ≥ γF (ξ, T ) . mβ2 Therefore, from Corollary 3 we have that P (F¯ (ξ, T ) ≥ γF (ξ, T )) ≥ 1 − 2ε. It is not difficult to calculate that (1 + ξ)d¯0 λ ≤ e−1 , γF (ξ, T ) so we have also P (Ω2 ) ≥ 1−2ε. To finish the proof we use Lemma 2 (with η = 1 for simplicity) and again bound F¯ (ξ, T ) from above by γF (ξ, T ) in the event A defined in (24). Proof of Corollary 1. The proof is a simple consequence of the uniform bound (16) obtained in Theorem 1. Indeed, for an arbitrary pair of indices (r, s) ∈ / T we obtain ⋆ m |θˆrs | = |θˆrs − θ | ≤ R , rs
n
so (r, s) ∈ / Tˆ. Analogously, if we take a pair (r, s) ∈ T, then ⋆ ⋆ |θˆrs | ≥ |θrs | − |θˆrs − θrs | > 2δ − Rnm ≥ δ . Blaz˙ ej Miasojedow, Institute of Applied Mathematics and Mechanics University of Warsaw, ul. Banacha 2, 02-097, Warszawa, Poland E-mail address:
[email protected] Wojciech Rejchel, Faculty of Mathematics and Computer Science Nicolaus Copernicus Uni´ , Poland versity, ul. Chopina 12/18, 87-100 Torun E-mail address:
[email protected]