Choosing the number of components in a finite mixture model using an exact Integrated Completed Likelihood criteria Marco Bertoletti?, Nial Friel† and Riccardo Rastelli†1 ?
arXiv:1411.4257v1 [stat.CO] 16 Nov 2014
†
Dept. of Statistical Sciences, University of Bologna, Italy
Insight: Center for Data Analytics and School of Mathematical Sciences, University College Dublin, Ireland
Abstract The ICL criteria has proven to be a very popular approach for clustering data through automatically choosing the number of components in a mixture model. This approach effectively maximises the complete data likelihood, thereby including the allocation of observations to components in the model selection criteria. However for practical implementation one needs to introduce an approximation in order to estimate the ICL. Our contribution here is to illustrate that through the use of conjugate priors one can derive an exact expression for ICL and so avoiding any approximation. Moreover, we illustrate how one can find both the number of components and the allocation of observations to components in one algorithmic framework. The performance of our algorithm is presented on several simulated and real examples.
1
Introduction
Finite mixture models are a widely used approach for parametric cluster analysis. Choosing the number of components in the mixture model, which is usually viewed as one of model choice, is a crucial issue. In a Bayesian framework, model choice can be dealt with in a Markov Chain Monte Carlo framework where the number of components is estimated simultaneously with model parameters using the Reversible Jump algorithm of Green [10], extended to the context of finite mixtures by Richardson and Green [16]. An alternative approach has been introduced in Nobile and Fearnside [14], where authors propose to integrate out model parameters, in order to achieve better estimation and more efficient sampling. The resulting algorithm, called the Allocation Sampler, carries out inference on the allocations (cluster labels of the observations) and the number of groups K in one framework. Similar approaches based on collapsing model parameters have been applied to different contexts, such as network models, as shown in Wyse and Friel [18], Wyse, Friel, and Latouche [19], Côme and Latouche [6], McDaid et al. [11]. 1
Corresponding author:
[email protected]
1
In the frequentist approach, model choice is often performed by means of Bayesian Information Criterion. Throughout the paper we will denote the observed data with x and the allocations with z. Now, let K be the number of groups and θˆK be the estimated parameters under the assumption of K groups, then the log model evidence log f (x|K) is approximated by the BIC, which is defined as: 1 ˆ (1.1) BIC(K) = log f x K, zˆ, θK − νK log(n), 2 where zˆ are the MLE allocations corresponding to the estimated parameters, while n is the number of observations and νM the number of parameters under the assumption of K groups. The BIC index is an approximation of the log model evidence in that both the parameters and the allocations are replaced with maximum likelihood estimates. Hence, among all the possible choices, the model maximizing the BIC is chosen. However, an interesting variant of the BIC method has been introduced in Biernacki, Celeux, and Govaert [3], which has generated a lot of interest in cluster analysis research. In this work, the authors propose to base model choice on the Integrated Completed Likelihood (ICL) criterion, which has an advantage over BIC in assessing the number of mixture components since it includes the information given by the allocations. In more detail, the ICL index turns out to be equal to the BIC penalised by the estimated mean entropy: ICL(K) := BIC(K) −
K X n X p zi x, θˆK , K log p zi x, θˆK , K .
(1.2)
g=1 i=1
The ICL index is developed to approximate the log model evidence, given by: Z f (x, z|θ K , K) π (θ K ) dθ K . log f (x, z|K) = log
(1.3)
ΘK
ICL has an advantage with respect to BIC in that it makes use of the complete data (x, z) to choose the model. Following this criterion, well-separated clusters are favoured and the usual BIC-based model selection is expected to be outperformed in a general mixture model framework. In this paper, we show that in a mixture model the ICL criterion can be employed using an exact value, rather than the approximated variant, given by equaton (1.2) following Biernacki, Celeux, and Govaert [3]. This is possible thanks to the use of conjugate prior distributions, which essentially allow model parameters to be integrated out apart from the allocation variables. Using an approach similar to the one of Nobile and Fearnside [14], we show that an exact formula for f (x, z|K) can be obtained. We propose an heuristic greedy algorithm which extends the ones of Wyse, Friel, and Latouche [19], Côme and Latouche [6] and is capable of finding the allocations z that maximize f (x, z|K), hence returning the optimal clustering solution according to the ICL criterion. An important advantage of our approach is that it gives a direct answer to the problem of choosing the number of groups K, since such value can be inferred straightforwardly from the optimal allocation vector. 2
The paper is organized as follows. In Section 2 the modeling assumptions for the mixture models are described. The model is a special case of the one used in Nobile and Fearnside [14], and both the univariate and multivariate situations are explored. Section 3 shows the closed form equation for f (x, z|K), setting up the framework for the optimization routine. In Section 4, the heuristic greedy procedure is described in detail, and an analysis of the performances and the drawbacks are exposed. Section 5 describes a few guidelines on how to interpret hyperparameters, while the following two Sections 6 and 7 describe the results of the applications of the algorithm. The paper ends with some final remarks outlined in Section 8.
2
Mixture models
Let x = {x1 , . . . , xn } be the matrix of observations. To ease the notation, we describe the case where ∀ i = 1, 2, . . . , n, xi is a continuous variables of b ≥ 2 dimensions, although a similar framework is set up for the univariate case. For the moment we assume that the data comes from a mixture of K multivariate Gaussians, although our approach has more general applicability. Thus, ∀ i = 1, . . . , n, xi are iid and have density: f (xi |λ, m, R, K) =
K X
λg φb xi mg , R−1 g
(2.1)
g=1
where φb · mg , R−1 is the multivariate Gaussian density with centre mg and covariance g , for every g, while λ = {λ1 , . . . , λK } are the mixture weights. Since we are matrix R−1 g interested in a clustering situation, it is useful to denote the allocation g of a observation i with the variable zi = g, so that we have the following expression for the so called complete log-likelihood: log [f (xi , z|m, R, K)] =
K X X
log φb xi mg , R−1 . g
(2.2)
g=1 i:zi =g
Then, we follow a Bayesian framework as in Nobile and Fearnside [14], setting up a hierarchical structure that defines prior distributions for all the parameters involved in the mixture. We assume that the weights λ are distributed as a K-dimensional Dirichlet variable with hyperparameters α = (α1 , . . . , αK ). Throughout this paper we assume that such a distribution is symmetric, so that α = α1 = · · · = αK . The allocation variables are iid categorical variables such that for every i: 1 with probability λ1 2 with probability λ2 zi = . (2.3) .. .. . K with probability λ . K
3
The remaining parameters are the mixture centres and precisions, which satisfy: Rg |ν, ξ ∼ W ishart (ν, ξ) mg |µ, τ, Rg ∼ M V Nb µ, [τ Rg ]−1
(2.4)
(2.5)
So the set of hyperparameters for the model is (α, τ, µ, ν, ξ), where α and τ are positive real numbers, ξ is a b × b positive definite scale matrix, ν > b − 1 are the degrees of freedom, and µ is a b-dimensional vector. The model described differs from that presented in Nobile and Fearnside [14] in that only the symmetric case is considered for the hyperparameters, meaning that these do not depend on the group label g. The reason for this restriction will be more clear in Section 3. The univariate case has the same structure with a Ga (γ, δ) distribution in place of the Wishart, where γ and δ are positive real numbers, and all the parameters are scaled to the proper number of dimensions.
3
Exact ICL
The modeling assumptions outlined resemble the very general framework for Gaussian mixture models, with the exception of Equation (2.5), which states that the position of the centre of a cluster is assumed to be distributed according to the covariance matrix of the cluster itself. Therefore, groups with similar shapes will have similar positioning over the space. This modeling assumption is exploited in Nobile and Fearnside [14], to allow all the parameters from the model to be integrated out with the exception of the allocations and the number of groups. Then, they propose a Gibbs sampler algorithm to sample from the posterior distribution of such parameters, hence obtaining posterior estimates for the allocations and number of groups. Here, we similarly take advantage of the possibility to integrate out all the parameters apart from allocation variables and the number of groups, obtaining an exact expression for the model evidence based on complete data: Z Z Z f (x, z|α, φ, K) = f (x, z, m, R, λ|α, φ, K) dmdRdλ Z Z Z = f (x|z, m, R, K) π (z|λ, K) π (λ|α, K) π (m, R|φ, K) dmdRdλ Z Z Z = f (x|z, m, R, K) π (m, R|φ, K) dmdR + π (z|λ, K) π (λ|α, K) dλ = f (x|z, φ, K) + π (z|α, K) , (3.1) where φ = (τ, µ, ν, ξ) are the hyperparameters for the model. Under the assumptions introduced, as shown in Nobile and Fearnside [14], the logs of the final terms on the right
4
hand side of Equation (3.1) can be expressed analytically: ( K X b b bng log(π) + log(τ ) − log(τ + ng ) log f (x|z, φ, K) = − 2 2 2 g=1 b X ν + ng + 1 − s ν+1−s ν + log Γ − log Γ + log |ξ| 2 2 2 s=1 ) X (ν + ng ) τ n g ¯ g )(xi − x ¯ g )t + − log ξ + (¯ xg − µ)(¯ xg − µ)t (xi − x 2 τ + n g i:z =g i
(3.2) log π (z|α, K) = log Γ(Kα) − log Γ(Kα + n) − K log Γ(α) +
K X
log Γ(α + ng )
(3.3)
g=1
¯ g is the centre where ng is the number of observations in group g, | · | is the determinant and x of cluster g. The exact value of the ICL will then be defined by: exactICL(z, K) = log f (x|z, φ, K) + log π (z|α, K) .
(3.4)
Clearly the exactICL depends on the hyperparameters α and φ. This does not happen when the ICL is evaluated through the BIC approximation, since in that case parameters are replaced with Maximum Likelihood Estimates which by definition depend only on the data. Thus every result based on the exactICL will depend on hyperparameters, which must be chosen wisely and a priori. Once hyperparameters are defined, the exactICL will depend only on the data (which is fixed), on the allocations, and on the number of groups. Therefore, it is possible to search over the space of all the possible allocations to find one maximizing the exactICL. Such configuration can be regarded as an appropriate clustering solution thanks to the theoretical advantage of using the ICL criterion for clustering purposes. Furthermore, the ICL criterion is completely fulfilled in that the corresponding exact value (rather than an approximation) is maximized, hence yielding an obvious advantage in terms of reliability. During the optimization, we also exploit the fact that the allocations z = {z1 , . . . , zn } are categorical variables. Indeed, from every configuration z we can recover exactly the value of K, which will be equal to the number of different states of z. However, this convenience comes to a price, since all the hyperparameters depend on K as well. Therefore we use the symmetric assumption, removing de facto this dependence. Once optimization has been performed, the resulting configuration zˆ maximizes the exactICL and thus is selected as the clustering solution. Although we use a parametric framework, the mixture parameters are not included in the solution, but evidently these can be easily recovered from the data and the best configuration, if needed, at least for large enough datasets. In any case, the algorithm we describe in this paper is meant to serve as a direct tool to obtain a clustering of the data, which can then be used for any purpose. 5
At this stage, two more questions need to be answered. The first is whether we are always able to find the global optimum zˆ in the space of all the possible configurations, and, if so, how do we find it. The second question concerns how we choose hyperparameters, and how this choice affects the final solution. The following Section will tackle these two issues.
4
Optimizing the exactICL
The number of possible solutions in the search space for the optimization routine is of the order of O((Kmax )n ), where Kmax can be chosen as the maximum number of groups which we search over. The whole space is impossible to exhaustively explore, even for small Kmax and n. However, similar combinatorial problems have been explored in Newman [13] with interesting results by means of heuristic greedy routines that resemble the well known Iterated Conditional Modes of Besag [2]. To define an optimization routine for the clustering problem, we first review the greedy algorithms initially introduced in Côme and Latouche [6], Wyse, Friel, and Latouche [19]. In these papers the Greedy algorithm is used to maximize the exactICL with respect to the allocations on a Stochastic Block Models and Bipartite Block Models for networks, respectively. In both cases the number of groups is inferred from the allocations. To describe the Greedy algorithm, we adapt a similar algorithm for our clustering problem. The first step consists of finding a configuration to initialize the algorithm. Random allocations are preferable, where the starting number of groups is Kmax , which is supposed to be much larger than the best K. Then essentially the algorithm iterates over the observations in a random order and updates each allocation choosing the value that maximizes the objective function. Once a complete loop over all the observations does not yield any increase in the exactICL the algorithm is stopped. Pseudocode is outlined in Algorithm 1. In the pseudocode we denote by zA→g the configuration z where the allocations of the nodes in the set A are changed into g, and `A→g is the corresponding exactICL. We now describe how this routine relates to the clustering problem, specifically. The algorithm dramatically reduces the search space, and usually reaches convergence in very few steps. Its main feature is that it works well in collapsing groups, by which we mean, when a large number of clusters is present, the routine tends to add one observation at a time to one of the few important groups, leaving the one it came from eventually empty. But conversely, the main drawback of the Greedy algorithm is that it is not capable of merging and splitting groups. As soon as most of the groups have been collapsed and only few large clusters remain, the algorithm tends to gets stuck due to the poor sensitivity of exactICL with respect to little changes. Moreover, this might very likely be a local optimum, rather than the global one. In other words, assuming that a better solution is obtainable by splitting or merging groups, the algorithm will not be able to find it because the objective function will not produce any increase when only one point is reallocated (unless one of the groups involved has very few points). Figure 4.1 is intended to explain this behaviour by a trivial example. To tackle this issue, in Côme and Latouche [6] the authors propose multiple reruns of the routine with different starting configurations and a final merge step with the 6
Algorithm 1 Greedy algorithm Initialize z for every iteration do Let z, K, ` be the current allocations, number of groups and exactICL, respectively. Let `stop = ` Initialize V = {1, 2, . . . , n} as the pool of the observation labels while V not empty do Pick randomly a observation i from V and delete it from V gˆ = arg max `i→g g=1,2,...,K+1
` = `i→ˆg z = zi→ˆg Reassign the labels of z so that the smallest numbers possible are used K = max z end while if `stop = ` then break end if end for Return z, K and `
explicit purpose of avoiding local optima. However, the inability of the Greedy algorithm to split groups remains an important issue, thus it is frequent to observe multimodal clusters in the final solution and an underestimation of K. In the clustering framework, the algorithm usually has sufficient performance when the groups are well separated, but its performance deteriorates when clusters of different shapes overlap. Concerning the computational effort, it is not possible to quantify this exactly since the number of iterations needed for convergence is random. However, for each iteration we have O(n2 ) evaluations of the objective function, and typically less than 10 iterations are needed for convergence. Actually the computational cost is O(nK) where K is the number of groups at that particular stage, which is typically much smaller than O(n2 ). The final merging has ˆ 3 ) where K ˆ is the number of groups in the final configuration (Côme and a cost of O(K Latouche [6]). Greedy Combined algorithm In this paper we show that the Greedy algorithm can be modified to improve distinctly its performances with a contained increase in the computational cost. As shown, the main drawback of Greedy algorithm is that the objective function is not sensitive enough to leave a local optima by changing the allocation of one observation at a time. As a consequence, 7
Figure 4.1: exactICL is not sensitive to small changes in the allocations: in this dataset, composed of 10 observations only, in order to have an increase in the objective function and thus being able to leave the first configuration on the top left, at least 5 observations must be reallocated at the same time. The true values are chosen as hyperparameters. it gets stuck very easily (due to the greedy behaviour) and once that happens the loop breaks. In other words, the steps on the search space are not wide enough to allow a proper exploration of the space itself. Therefore we propose an algorithm that has the same greedy behaviour, but makes use of larger steps allowing a better exploration and the ability to leave local optima as well. We define these wider steps as joint updates of the allocations of multiple observations. Thus, our algorithm is essentially the same Greedy algorithm where instead of updating the allocation of one observation at a time, a set of observations is updated. The observations in this set must belong to the same group, and are chosen on a nearest neighbour basis. Furthermore, the set is updated as a block, therefore for each iteration only O(nK) evaluations of the objective function are performed, where K is the number of groups considered at that particular stage. The number of observations chosen is the realization of a Beta-Binomial random variable, where the Beta hyperparameters are user defined and the number of trials is given by the cardinality of the group. Although the computational effort for each step is the same as the Greedy algorithm, usually the number of iterations needed to converge is higher. Also, since the updates at each iteration are different due to the random number of neighbours, one might want to let the routine run until a better solution is found. For datasets with less than 1000 observations, we advise that no more than 15 iterations should be performed in total, one should much rather try other initial configurations instead. Pseudocode for the algorithm is presented in Algorithm 2. It must be pointed out that in the Greedy Combined algorithm the distance between 8
Algorithm 2 Greedy Combined algorithm Initialize z for every iteration do Let z, K, ` be the current allocations, number of groups and exactICL, respectively Initialize V = {1, 2, . . . , n} as the pool of the observation labels while V not empty do Pick randomly a observation i from V and delete it from V Let Ji = {j1 , . . . , jKi } be the set of observations allocated in the same group as i, ordered increasingly according to their distance with respect to i Sample η from a Beta(β1 , β2 ) Sample r from a Bin (Ki , η) Let Ji0 = {j1 , . . . , jr } be the set of the first r observations in Ji gˆ = arg max lJi0 →g g=1,2,...,K+1
if ` < `iˆg then ` = `iˆg z = zi→ˆg Reassign the labels of z so that the smallest numbers possible are used K = max z end if end while end for Return z, K and ` observations is used to create Ji0 . Thus, a matrix of dissimilarities must be evaluated at the beginning of the procedure for a storage and computational cost proportional to O(n2 ). The distance used is chosen by the user. Usually the Euclidean distance does a fine job, however proper transformations of the data and different distance choices might be advisable when clusters do not have a spherical shape. Indeed, the rule of thumb is that the more round shaped a cluster is, the higher the probability that Ji will be closer to the perfect updating set.
5
Choosing the hyperparameters
As defined in Section 2, in the multivariate case the hyperparameters to be set are (α, τ, µ, ν, ξ). As already mentioned, in the original approximated form of ICL no hyperparameters are involved so that the maximization problem can be set up in an objective way. In our case instead, the objective function to be maximized does depend on these hyperparameters, which 9
subsequently can impact the solution obtained considerably. In the standard Bayesian approach, these hyperparameters should be chosen in the way that best represents the prior information that one has on the model parameters. With this in mind we propose some guidelines for the interpretation of the hyperparameters, which can be quite challenging. First of all, α regulates the number of observations that forms each cluster. Assuming 1 as a default value, a number greater than 1 will create more homogeneous groups, that have approximately the same number of observations, whereas a smaller number will tend to create a bigger group and many smaller groups of approximately the same size. Typical values for α can be 1,10 or 100 according to the prior information we have on groups sizes. Assuming that the data has been standardized or at least centred, the origin can be chosen as µ. Choosing the hyperparameter ξ as the identity matrix allows to represent many different situations, without imposing strong constraints on the cluster volume, in particular after the data has been standardized: no particular shape is a priori supported or demoted. We propose this setting as a default choice. The last two parameters, ν and τ are probably the most complicated to choose. τ regulates the precision of the distribution of the distance between the cluster centres and µ. Then, a small value of τ will yield distant and separated groups whereas a large value will collapse the groups in the origin. When less than 10 groups are expected, a reasonable value for τ is 0.01, which gives some very typical situations. However, τ can be increased up to 0.1 to represent the prior belief that groups should be more overlapping. When more than 10 clusters are expected, τ must be decreased accordingly to support the same degree of separation between the groups. The hyperparameter ν is constrained to be greater than b − 1, and it indicates the shape that the clusters should have. Values of ν close to b will support extremely narrow elliptical shapes whereas greater values will support round shapes. As a default choice we propose ν = b + 2, which can account for many different settings. However, according to prior beliefs, any value ranging from b to b + 10 can be chosen. The advantage of setting up τ and ν rather than ξ is that those two hyperparameters change the situation represented independently. Figure 5.1 shows two generated datasets corresponding to particular choices of τ and ν, which yield datasets with different cluster shapes and different degree of separation. For the univariate case, we propose as default hyperparameters γ = δ = 0.5, which can account for different situations and cluster sizes. A final note concerns the choice of the parameters for the Beta-Binomial distribution. We propose the values β1 = 0.1 and β2 = 0.01 as default choices. These values are completely up to the user, and for most applications they do not need careful tuning: default values will work just fine. However, when groups are very concentrated and overlapping it may be advisable to tune up the distribution of the number of neighbours to fasten the routine and ensure convergence. As a rule of thumb, if 15 is taken as the number of iterations, we usually expect that the last five loops do not increase the objective function. If that tends not to happen, one might want to decrease the average number of neighbours updated.
10
Figure 5.1: 500 points divided in 5 groups and generated according to the model described in Section 2. Common hyperparameters are α = 100, µ = (0, 0)t and the identity matrix as ξ. On the left hand image τ = 0.1 and ν = b = 2 while on the right hand image τ = 0.01 and ν = b + 10.
6
Simulated datasets
Here simulated datasets have been created following the model described in Section 2, starting from fixed values for the hyperparameters. Then, the data are standardized and the Greedy Combined algorithm introduced in Section 4 is run repeatedly with 10 different random starting configurations and for 15 iterations. The best configuration found during the process is retained as the solution maximizing the exactICL. Hyperparameters are set to default values. For some applications the Model Based Clustering solution, obtained with default parameters, is offered as well. The purpose is not an objective comparison in clustering performances, since the two methods work differently and have different modeling assumptions, let alone that our framework has a Bayesian derivation, based on hyperparameters. We aim to show instead, that our optimization algorithm is efficient in recovering the configuration maximizing the exactICL, and that it is not trivially outperformed in finding the exactICL by other clustering algorithms. For each configuration, misclassified observations are plotted with empty symbols to understand how the solutions differ from the true one. Dataset 1. In the first simulated dataset 200 points are simulated according to the model described in Section 2, where the hyperparameters are set with the following values: K = 3, α = 10, µ = (0, 0)t , ν = b, τ = 0.2 and ξ is the identity matrix. The standardized data, composed of two bigger clusters of rounded shape and a smaller narrower cluster partially overlapping, are represented on the left hand side of Figure 6.1. The central image in Figure 11
Figure 6.1: On the left hand image the raw data are represented. In the central image the resulting clustering solution obtained with Model Based Clustering routine with default parameters is represented. In this configuration the number of groups is overestimated. The right hand image shows instead the Greedy Combined algorithm solution, which recovers the exact number of groups. Misclassified observations are plotted by empty symbols. 6.1 represents the Model Based Clustering solution whereas the one on the right hand side shows the best allocations found with the Greedy Combined algorithm. The hyperparameters chosen are α = 1, τ = 0.1 and ν = b + 2. In this situation, our algorithm recovers the exact number of clusters and their allocations. The value of the objective function for the best solution found is −561.16, while for the true allocations and the Model Based Clustering solution the exactICL values are equal to −584.33 and −577.74 respectively. Dataset 2. In the second dataset 150 points divided in 5 groups are generated, using as hyperparameters α = 10, µ = (0, 0)t , ν = b + 2, τ = 0.05 and the identity matrix as ξ. The true allocations of the data points are represented in the top left part of Figure 6.2. In this case all the clusters have approximately a rounded shape and the same size, and two couples of groups show a slight overlapping. The Model Based Clustering routine overestimates the number of groups, creating 6 clusters of similar shape and size (top right in Figure 6.2). As concerns the Greedy Combined algorithm algorithm, we offer two possible clustering solutions, corresponding to two different choices of hyperparameters. In the first case (represented in the bottom left image), the default hyperparameters are used: α = 10, µ = (0, 0)t , ν = b + 2, τ = 0.05 and ξ equal to the identity matrix. In the latter, the parameter ν is changed to 12, so that the prior information gives more support to groups of round shape. The outcome is different in the two cases: in the first, larger clusters of elliptical shape are formed, and the number of groups is underestimated. In the latter, the best configuration is very close to the real one, and the number of groups is estimated correctly. Under the first choice of hyperparameters, the values of the objective function are −350.77, −356.75 and −317.89 for the true allocations, Model Based Clustering and Greedy Combined algorithm respectively. For the latter case the values (in the same order) are: −325.06, −319.48 and −308.21. 12
Figure 6.2: On top left image true allocations of the data points are represented. Model Based Clustering allocations are represented in top right image, and the number of groups is again overestimated. On the bottom two different optimal configurations obtained with Greedy Combined algorithm are represented, corresponding to two different choices of hyperparameters. The one on the left side favours elliptical shapes, underestimating the number of groups, while the one on the right promotes round shapes recovering the number of groups correctly. Misclassified observations are plotted by empty symbols.
13
Figure 7.1: Galaxy dataset. On the left image an histogram of the data is represented. In the centre an extended representation is proposed, where the x axis corresponds to the Galaxy label and the Galaxy velocity is on the y axis. The image on the right shows the resulting Greedy Combined algorithm clustering solution.
7
Real data applications
Galaxy data. The dataset considered represents the velocities of 82 distant galaxies, diverging from the Milky Way. The data has been initially analysed from a statistical point of view in Roeder [17], and has drawn an interest because of the hypothesis that galaxies should be clustered according to their velocity. The data, in the form of an histogram and of an expanded scatterplot, are represented in the left and central image of Figure 7.1. Default hyperparameters are chosen, as described in Section 5 for the univariate case . The algorithm is run for 10 different initial random configurations with 15 set as number of iterations. The resulting best solution is represented in the right part of Figure 7.1. The corresponding best model has 3 clusters, of which the 2 groups on the sides are much smaller than the central one. Diabetes data. The diabetes dataset has been first studied in Reaven and Miller [15]. A mixture modeling approach has been first tried in Fraley and Raftery [7], proving to be efficient with respect to many other classifiers, due to the narrow-elliptical shape of the clusters. The data is divided in 3 groups, corresponding to 3 different clinical stages: “Normal” (76 observations), “Chemical” (36 observations) and “Overt” (33 observations). For each observation, the observed variables are: • glucose: plasma glucose response to oral glucose • insulin: plasma insulin response to oral glucose • sspg: degree of insulin resistance. 14
The Greedy Combined algorithm was run as before for 10 times each for 15 iterations, and the hyperparameters chosen were the default settings with ν = b + 1. Figure 7.2 represents the final solution obtained with the algorithm, where the empty symbols correspond to the misclassified observations. The value of the objective function at the configuration found
Figure 7.2: Diabetes dataset. Scatter plots of the 3 variables are proposed, with colours and symbols representing the Greedy Combined algorithm solution obtained with default parameters. is −297.13, while the exactICL for the true allocations is −358.19. The number of clusters corresponding to the best solution is correct. However, under the choice of hyperparameters proposed, the best configuration obtained is quite different from the correct classification, in that the size of the “normal” group is evidently overestimated, mainly at the expense of the “chemical” group. While the “overt” group is well classified.
15
8
Final remarks
In this paper we have shown that in a Gaussian mixture model, under the assumption of conjugate priors, an exact formula for the integrated completed likelihood can be obtained. A greedy algorithm, called Greedy Combined algorithm, has been proposed to optimize such the integrated complete likelihood with respect to the allocations of the observations. The resulting configuration, by definition, maximizes the exactICL thereby yielding a clustering solution for a selected number of mixture components. Furthermore, in contrast to other existing methods, an exact value, rather than an approximated one, is maximized. The optimization routine proposed stems from the Iterated Conditional Modes of Besag [2] and extends similar approaches presented in Wyse, Friel, and Latouche [19], Côme and Latouche [6], to improve efficiency and adapt to the different framework. As shown in Wyse, Friel, and Latouche [19], the greedy procedure can be thought as a deterministic Gibbs Sampler that updates each variable with the argmax of the corresponding full conditional rather than sampling from it. In more detail, the main change consists of the introduction of a combined update of groups of allocation variables, which ensures a much higher probability of convergence to the global optimum at a negligible additional computational cost. As far as we know, this technique has not been experimented before in similar problems and the same idea could be applied to other frameworks such as Stochastic Bock Models to improve efficiency. The mixture model introduced is a special case of that presented in Nobile and Fearnside [14]. The difference lies in that a symmetric assumption is needed on the hyperparameters, since the space where they are defined depends on K, which is a variable involved in the optimization. However, a possible extension of our algorithm may include hyperparameters defined differently for each K. This would require more work in choosing the hyperparameters, but it would give more options to describe the prior information. One might think that another way to avoid this problem is to introduce other hierarchy levels on hyperparameters. However this would introduce inconsistency in the optimization routine since for each K the ICL would be a probability density defined on spaces of different dimensions, thus a comparison of the exactICL values between two configurations representing different number of groups would be meaningless. Another point to make is that many other statistical models offer the possibility to integrate out all the parameters, so that an exact formula for the model evidence is obtainable. In Nobile and Fearnside [14] different possibilities are proposed for different conjugate pairs. So the Greedy Combined algorithm can be straightforwardly extended to other mixture frameworks to maximize the exactICL. Applications of the routine to clustering problems have been proposed. The algorithm has been run both on simulated datasets and on real data. The framework described appears to be consistent and to give meaningful clustering solutions in all the cases described. In the simulated datasets as well as in the galaxy data, the clustering configurations proposed give a good interpretation of the data, and in some cases results appear to be appreciably different from the ones obtained with Model Based Clustering. Although such a comparison is not particularly relevant, it is important to stress that the global optimum for the exactICL 16
is never known, so as a pragmatic approach, clustering should always be performed using other routines to check if the Greedy Combined algorithm solution is better that the values of exactICL returned by each of these clustering solutions. Indeed a useful approach could be to use the clustering solutions obtained with different algorithms as starting points for the Greedy Combined algorithm algorithm, to see if new parts of the search space can be explored. In the diabetes dataset, the clustering solution proposed has the correct number of groups but it differs from the true configuration. However, the solution makes sense from the modeling point of view. The same framework described in this work can be extended to the supervised case. Say we can split the data into a training set x and a test set x0 , where, for the first allocations z are known, while for the latter they are not. So our goal is to classify the observations in the test set, i.e. find their allocations z0 . Then we simply run the Greedy Combined algorithm algorithm on the complete dataset given by the positions y = (x, x0 ) and the allocations η = (z, z0 ) taking care of optimizing only on the unknown allocations, leaving the others fixed. In other words, we want to find the allocations z0 such that the exactICL of the complete data is maximized. 0 In the situation described the search space has a size proportional to O((Kmax )n ) where n0 is the number of observations in x0 . Also, the number of groups for the complete data has a lower bound given by the number of groups of the training set. All the code has been written in R, making use of the package mclust (Fraley and Raftery [8]) to perform Model Based Clustering. Due to the high presence of loops the code turned out to be quite slow, however such difficulties can be overcome by writing the algorithm in a more efficient programming language where one would expect strong improvements in computational time.
Acknowledgements Marco Bertoletti completed this work while visiting the School of Mathematical Sciences, UCD, as part of an M.Sc. project. The Insight Centre for Data Analytics is supported by Science Foundation Ireland under Grant Number SFI/12/RC/2289. Nial Friel and Riccardo Rastelli’s research was also supported by a Science Foundation Ireland grant: 12/IP/1424.
References [1] Jean-Patrick Baudry, Adrian E Raftery, Gilles Celeux, Kenneth Lo, and Raphael Gottardo. “Combining mixture components for clustering”. In: Journal of Computational and Graphical Statistics 19.2 (2010). [2] J. Besag. “On the statistical analysis of dirty pictures”. In: Journal of the Royal Statistical Society. Series B (Methodological) (1986), pp. 259–302.
17
[3] C. Biernacki, G. Celeux, and G. Govaert. “Assessing a mixture model for clustering with the integrated completed likelihood”. In: Pattern Analysis and Machine Intelligence, IEEE Transactions on 22.7 (2000), pp. 719–725. [4] C. Biernacki, G. Celeux, and G. Govaert. “Exact and Monte Carlo calculations of integrated likelihoods for the latent class model”. In: Journal of Statistical Planning and Inference 140.11 (2010), pp. 2991–3002. [5] V.D. Blondel, J.L. Guillaume, R. Lambiotte, and E. Lefebvre. “Fast unfolding of communities in large networks”. In: Journal of Statistical Mechanics: Theory and Experiment 2008.10 (2008), P10008. [6] E. Côme and P. Latouche. “Model selection and clustering in stochastic block models with the exact integrated complete data likelihood”. In: arXiv preprint arXiv:1303.2962 (2013). [7] C. Fraley and A.E. Raftery. “How many clusters? Which clustering method? Answers via model-based cluster analysis”. In: The computer journal 41.8 (1998), pp. 578–588. [8] C. Fraley and A.E. Raftery. MCLUST version 3: an R package for normal mixture modeling and model-based clustering. Tech. rep. DTIC Document, 2006. [9] N. Friel, C. Ryan, and J. Wyse. “Bayesian model selection for the latent position cluster model for Social Networks”. In: arXiv preprint arXiv:1308.4871 (2013). [10] P.J. Green. “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination”. In: Biometrika 82.4 (1995), pp. 711–732. [11] A.F. McDaid, T.B. Murphy, N. Friel, and N.J. Hurley. “Improved Bayesian inference for the stochastic block model with application to large networks”. In: Computational Statistics & Data Analysis 60 (2013), pp. 12–31. [12] A.F. McDaid, T.B. Murphy, N. Friel, and N.J. Hurley. “Model-based clustering in networks with stochastic community finding”. In: arXiv preprint arXiv:1205.1997 (2012). [13] M.E.J. Newman. “Fast algorithm for detecting community structure in networks”. In: Physical review E 69.6 (2004), p. 066133. [14] A. Nobile and A.T. Fearnside. “Bayesian finite mixtures with an unknown number of components: The allocation sampler”. In: Statistics and Computing 17.2 (2007), pp. 147–162. [15] G.M. Reaven and R.G. Miller. “An attempt to define the nature of chemical diabetes using a multidimensional analysis”. In: Diabetologia 16.1 (1979), pp. 17–24. [16] S. Richardson and P. J. Green. “On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion)”. In: Journal of the Royal Statistical Society, Series B 59.4 (1997), pp. 731–792. [17] K. Roeder. “Density estimation with confidence sets exemplified by superclusters and voids in the galaxies”. In: Journal of the American Statistical Association 85.411 (1990), pp. 617–624. 18
[18] J. Wyse and N. Friel. “Block clustering with collapsed latent block models”. In: Statistics and Computing 22.2 (2012), pp. 415–428. [19] J. Wyse, N. Friel, and P. Latouche. “Inferring structure in bipartite networks using the latent block model and exact ICL”. In: arXiv preprint arXiv:1404.2911 (2014).
19