arXiv:0912.2873v2 [stat.AP] 24 Jul 2010
Variational Bayesian inference and complexity control for stochastic block models P. Latouche∗, E. Birmel´e and C. Ambroise Laboratoire Statistique et G´enome, UMR CNRS 8071, UEVE Abstract: It is now widely accepted that knowledge can be acquired from networks by clustering their vertices according to connection profiles. Many methods have been proposed and in this paper we concentrate on the Stochastic Block Model (SBM). The clustering of vertices and the estimation of SBM model parameters have been subject to previous work and numerous inference strategies such as variational Expectation Maximization (EM) and classification EM have been proposed. However, SBM still suffers from a lack of criteria to estimate the number of components in the mixture. To our knowledge, only one model based criterion, ICL, has been derived for SBM in the literature. It relies on an asymptotic approximation of the Integrated Complete-data Likelihood and recent studies have shown that it tends to be too conservative in the case of small networks. To tackle this issue, we propose a new criterion that we call ILvb, based on a non asymptotic approximation of the marginal likelihood. We describe how the criterion can be computed through a variational Bayes EM algorithm. Key words: Random graphs, Stochastic block models, Community detection, Variational EM, Variational Bayes EM, Integrated complete-data likelihood, Integrated observed-data likelihood Received: date / Accepted: date
1
Introduction
Networks are used in many scientific fields such as biology (Albert and Barab´asi 2002) and social sciences (Snijders and Nowicki 1997, Nowicki and Snijders 2001). They aim at modelling with edges the way objects of interest are related to each other. Examples of such data sets are friendship (Palla et al 2007), protein-protein interaction networks (Barab´asi and Oltvai 2004), powergrids (Watts and Strogatz 1998) and the Internet (Zanghi et al 2008). In this context, a lot of attention has been paid on developing models to learn knowledge from ∗ Address for correspondance: Pierre Latouche, Laboratoire Statistique et G´ enome, Tour Evry 2, 523 place des terrasses de l’Agora, 91000 Evry, France. E-mail:
[email protected]
1
the network topology. It appears that available methods can be grouped into three significant categories. Some models look for community structure, also called homophily or assortative mixing (Girvan and Newman 2002, Danon et al 2005). Given a network, the vertices are partitioned into classes such that vertices of a class are mostly connected to vertices of the same class. In the model of Handcock et al (2007), which extends Hoff et al (2002), vertices are clustered depending on their positions in a continuous latent space. They proposed a two-stage maximum likelihood approach and a Bayesian algorithm, as well as an asymptotic BIC criterion to estimate the number of latent classes. The two-stage maximum likelihood approach first maps the vertices in the latent space and then uses a mixture model to cluster the resulting positions. In practice, this procedure converges quickly but looses some information by not estimating the positions and the cluster model at the same time. Conversely, the Bayesian algorithm, based on Markov Chain Monte Carlo, estimates both the latent positions and the mixture model parameters simultaneously. It gives better results but is time consuming. Both the maximum likelihood and the Bayesian approach are implemented in the R package “latentnet” (Krivitsky and Handcock 2009). Other models look for disassortative mixing, in which vertices mostly connect to vertices of different classes (Estrada and Rodriguez-Velazquez 2005). They are particularly suitable for the analysis of bipartite networks which are used in numerous applications. Examples of data sets having such structures are transcriptional regulatory networks where operons encode transcription factors directly involved in operons regulation. To get some insight into the transcription process, these two types of nodes are often grouped into different classes with high inter connection probabilities. Other examples are citation networks where authors cite or are cited by papers. For a more detailed description of the differences between community structure and disassortative mixing, see Newman and Leicht (2007). Finally, a few models can look for both community structure and disassortative mixing. Hofman and Wiggins (2008) proposed a probabilistic framework, as well as an efficient clustering algorithm. Their model, implemented in the software “VBMOD”, is based on two key parameters λ and . Given a network, it assumes that vertices connect with probability λ if they belong to the same class and with probability otherwise. Moreover, they introduced a non asymptotic Bayesian criterion to estimate the number of classes. It is based on a variational approximation of the marginal likelihood and has shown promising results. In this paper, we focus on the Stochastic Block Model (SBM) which was originally developed in social sciences (White et al 1976, Fienberg and Wasserman 1981, Frank and Harary 1982, Holland et al 1983, Snijders and Nowicki 1997). Given a network, SBM assumes that each vertex belongs to a hidden class among Q classes, and uses a Q × Q matrix π to describe the intra and inter connection probabilities. Moreover, the class proportions are represented using a Q-dimensional vector α. No assumption is made on the form of the connectivity matrix such that very different structures can be taken into account. In particular, SBM can characterize the presence of hubs which make networks 2
locally dense (Daudin et al 2008). Moreover and to some extent, it generalizes many of the existing graph clustering techniques as shown in Newman and Leicht (2007). For instance, the model of Hofman and Wiggins (2008) can be seen as a constrained SBM where the diagonal of π is set to λ and all the other elements to . Many methods have been proposed in the literature to jointly estimate SBM model parameters and cluster the vertices of a network. They all face the same difficulty. Indeed, contrary to many mixture models, the conditional distribution of all the latent variables Z and model parameters, given the observed data X, can not be factorized due to conditional dependency (for more details, see Daudin et al 2008). Therefore, optimization techniques such as the EM algorithm can not be used directly. Nowicki and Snijders (2001) proposed a Bayesian probabilistic approach. They introduced some prior Dirichlet distributions for the model parameters and used Gibbs sampling to approximate the posterior distribution over the model parameters and posterior predictive distribution. Their algorithm is implemented in the software BLOCKS, which is part of the package StoCNET (Boer et al 2006). It gives accurate a posteriori estimates but can not handle networks with more than 200 vertices. Daudin et al (2008) proposed a frequentist variational EM approach for SBM which can handle much larger networks. Online strategies have also been developed (Zanghi et al 2008). While many inference strategies have been proposed for estimation and clustering purpose, SBM still suffers from a lack of criteria to estimate the number of classes in networks. Indeed, many criteria, such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) (Burnham and Anderson 2004) are based on the likelihood p(X | α, π) of the observed data X, which is intractable here. To tackle this issue, Mariadassou et al (2010) and Daudin et al (2008) used a criterion, so-called ICL, based on an asymptotic approximation of the integrated complete-data likelihood. This criterion relies on the joint distribution p(X, Z | α, π) rather than p(X | α, π) and can be easily computed, even in the case of SBM. ICL was originally proposed by Biernacki et al (2000) for model selection in Gaussian mixture models, and is known to be particularly suitable for cluster analysis view since it favors well separated clusters. However, because it relies on an asymptotic approximation, Biernacki et al (2010) showed, in the case of mixtures of multivariate multinomial distributions, that it may fail to detect interesting structures present in the data, for small sample sizes. Mariadassou et al (2010) obtained similar results when analyzing networks generated using SBM. They found that this asymptotic criterion tends to underestimate the number of classes when dealing with small networks. We emphasize that, to our knowledge, ICL is currently the only model based criterion developed for SBM. Our main concern in this paper is to propose a new criterion for SBM, based on the marginal likelihood p(X), also called integrated observed-data likelihood. The marginal likelihood is known to focus on density estimation view and is expected to provide a consistent estimation of the distribution of the data. For a more detailed overview of the differences between integrated complete-data likelihood and integrated observed-data likelihood, we refer to Biernacki et al 3
(2010). In the case of SBM, the marginal likelihood is not tractable and we describe in this paper how a non asymptotic approximation can be obtained through a variational Bayes EM algorithm. In Section 2, we describe SBM and we introduce some non informative conjugate prior distributions for the model parameters. The variational Bayes EM algorithm is then presented in Section 3. We show in Section 4 how it naturally leads to a new model selection criterion that we call ILvb, based on a non asymptotic approximation of the marginal likelihood. Finally, in Section 5, we carry out some experiments using simulated data sets and the metabolic network of Escherichia coli, to assess ILvb. The R package “mixer” implementing this work is available from the following web site: http://cran.r-project.org.
2
A Mixture Model for Graphs
The data we model consists of a N × N binary matrix X, with entries Xij describing the presence or absence of an edge from vertex i to vertex j. Both directed and undirected relations can be analyzed but in the following, we focus on undirected relations. Therefore X is symmetric.
2.1
Model and Notations
The Stochastic Block Model (SBM) introduced by Nowicki and Snijders (2001) associates to each vertex of a network a latent variable Zi drawn from a multinomial distribution, such that Ziq = 1 if vertex i belongs to class q Zi ∼ M 1, α = (α1 , α2 , . . . , αQ ) . We denote α, the vector of class proportions. The edges are then drawn from Bernoulli distribution Xij |{Ziq Zjl = 1} ∼ B(πql ), where π is a Q × Q matrix of connection probabilities. According to this model, the latent variables Z1 , . . . , ZN are iid and given this latent structure, all the edges are supposed to be independent. Note that SBM was originally described in a more general setting, allowing any discrete relational data. However, as explained previously, we concentrate in the following on binary edges only.
4
Thus, when considering an undirected graph without self loops, this leads to p(Z | α) =
N Y
M(Zi ; 1, α) =
p(X | Z, π) =
Y
αqZiq ,
i=1 q=1
i=1
and
Q N Y Y
p(Xij | Zi , Zj , π)
i