Identification of MCMC Samples for Clustering Kenichi Kurihara, Tsuyoshi Murata, and Taisuke Sato Tokyo Institute of Technology, Tokyo, Japan,
[email protected],
[email protected] and
[email protected]
Abstract. For clustering problems, many studies use just MAP assignments to show clustering results instead of using whole samples from a MCMC sampler. This is because it is not straightforward to recognize clusters based on whole samples. Thus, we proposed an identification algorithm which constructs groups of relevant clusters. The identification exploits spectral clustering to group clusters. Although a naive spectral clustering algorithm is intractable due to memory space and computational time, we developed a memory-and-time efficient spectral clustering for samples of a MCMC sampler. In experiments, we show our algorithm is tractable for real data while the naive algorithm is intractable. For search query log data, we also show representative vocabularies of clusters, which cannot be chosen by just MAP assignments.
1
Introduction
Clustering is one of the most widely used methods for data analysis. In recent years, non-parametric Bayesian approaches have received a great deal of attention especially for clustering. Unlike traditional clustering algorithms, nonparametric Bayesian approaches do not require the number of clusters as an input. Thus, they are very useful in practice especially when we do not have any information about the number of clusters a priori. More specifically, the Dirichlet process mixture models[1, 2] have been used for machine learning and natural language processing[3, 4]. The Markov chain Monte Carlo (MCMC) sampler, e.g. the Gibbs sampler, is typically utilized to estimate a posterior distribution[3–5] while deterministic inference algorithms have also been proposed[6, 7]. Since the MCMC sampler is unbiased, we can estimate the true posterior distribution by collecting many samples from a sampler. Inference on test data is done using collected samples. However, when a task is clustering, many studies use just maximum a posteriori (MAP) assignments to show clustering results. This is because it is not straightforward to show clustering results with whole samples instead of just MAP assignments. Figure 1 shows an example. The left figure shows MAP assignments by a Gibbs sampler of a Dirichlet process Gaussian mixture. The middle figure shows all of the collected samples by the sampler. It is easy to see each cluster and the number of clusters, which is four, in the left figure. However, clusters overlap
15
15
15
10
10
10
5
5
5
0
0
0
−5
−5
−5
−10
−10
−10
−15
−15
−20 −60
−40
−20
0
MAP
20
40
−20 −60
−15 −40
−20
0
20
40
−20 −60
samples
−40
−20
0
20
40
identified groups
Fig. 1. MAP assignments (left), posterior samples from a MCMC sampler (middle) and identified groups of clusters (right)
each other in the middle figure, hence it is not clear any more to distinguish clusters. Our goal of this study is to obtain the right figure from the middle one. This is done by identifying clusters from samples. In the right figure, each color represents one identified group consisting of clusters relevant to each other. This identification is more informative for clustering than MAP assignments. For example, when a task is vocabulary clustering, we can estimate the probability of a vocabulary assigned to a group based on clusters in the group. This allows us to rank vocabularies in the group. On the other hand, such ranking is impossible only with MAP assignments. In this paper, we propose an identification algorithm. To make it tractable, we also develop an efficient spectral clustering algorithm for samples. This paper is organized as follows. In Section 2 and 3, we briefly review the Dirichlet process mixture models[1, 2] and the Gibbs sampler, which we use as a clustering algorithm. In Section 4, we propose an identification algorithm. Section 5 shows experimental results. We discuss and conclude this study in Section 6 and 7.
2
The Dirichlet Process Mixture Models
We briefly review the Dirichlet process mixture models[1, 2] here. Finite mixture models consist of finite components, and the number of components is constant. On the other hand, the generative process of the Dirichlet process mixture models draws a new object from either one of existing components or a new component. Thus, the number of components may grow as the number of objects grows. Let Z be an n by K assignment matrix where n is the number of objects and K is the number of clusters. Note K grows as n grows, thus it is not constant. When object i is assigned to cluster c, Zi,c = 1 and Zi,c′ = 0 for c′ 6= c. We denote the i-th row and the j-th column of Z by Zi and ζj , respectively. An example of assignments with notations, Z, Zi and ζj , are presented in Figure 2.
matrix representation
0
1 0 1 0 1 10 1 Z1 = (10) 0 Z = @ 10 A ⇔ Z2 = (10) ⇔ ζ1 = @ 1 A , ζ2 = @ 0 A 01 Z3 = (01) 0 1
graphical representation
1 2
3
Fig. 2. Matrix and graphical representations of assignments. Object 1 and 2 are clustered into one cluster, and 3 is clustered into another.
When (Z1 , ..., Zi−1 ) is given, the i-th object would be assigned to the c-th component with probability, ( ¬i nc if n¬i c > 0 p(Zi,c = 1|Z1 , ..., Zi−1 ) = i−1+α (1) α otherwise i−1+α ¬i where n¬i c is the number of objects currently assigned to component c, i.e. nc = P j6=i Zj,c and α is a hyperparameter. Every object can be assigned to a new α . Given n objects, the probability of Z is, component with probability i−1+α
p(Z) =
k Γ (α) Y αΓ (nc ) Γ (n + α) c=1
(2)
where nc is the number of objects assigned to cluster c. After Zi is drawn, when Zi,c = 1, data point xi is sampled as xi ∼ p(xi |θc ) where θc is the parameter of component c. Each θc is drawn from G0 called base distribution. Thus, the joint probability of X = (x1 , ..., xn ) and Z is, p(X, Z) = p(Z)
K Z Y
c=1
dθc G0 (θc )
Y
p(xi |θc ).
(3)
i;Zi,c =1
In the next section, we derive a Gibbs sampler for the Dirichlet process mixture models based on (3).
3
Gibbs Sampling for the Dirichlet Process Mixture Models
When a task is clustering, our goal P is to infer p(Z|X) = p(Z, X)/p(X). P Although p(X, Z) is given in (3), p(X) = Z p(X, Z) is intractable due to Z which requires exponential order computation. Instead of directly computing p(Z|X), people use MCMC samplers including the Gibbs sampler, or variational algorithms. In the case of MCMC samplers, Z is drawn from p(Z|X) many times. The true posterior, p(Z|X), is approximated by the collection of Z. The Gibbs sampler for the Dirichlet process mixture updates one assignment, Zi , in a step. Let Z ¬i be a matrix consisting of rows of Z except for the i-th row,
Algorithm 1 Gibbs sampler for the Dirichlet process mixture models. 1: Initialize Z randomly with K. 2: repeat 3: Choose i ∈ {1..N } uniformly. 4: Update Zi based on probability given by (4). 5: until convergence
and Xc¬i be {xj ; Zj,c = 1}. Since p(Zi |X, Z ¬i ) ∝ p(X, Z) holds, p(Zi |X, Z ¬i ) is derived from (3) as, p(Zi,c = 1|X, Z ¬i ) ∝ p(Zi,c = 1|Z ¬i )p(xi , Xc¬i )/p(Xc¬i ) ( ¬i ¬i ¬i n¬i c p(xi , Xc )/p(Xc ) if nc > 0 = α p(xi , Xc¬i )/p(Xc¬i ) otherwise.
(4)
Note that p(xi , Xc¬i ) and p(Xc¬i ) have closed forms when G0 is a conjugate prior. We summarize this Gibbs sampler in Algorithm 1.
4
Identification of Samples
4.1
Preliminaries
Before we propose an identification algorithm, we introduce notations. We distinguish the following notations, sample Z (i) , collection of samples S ≡ {Z (1) , ..., Z (S) }, cluster ζi , collection of clusters C(S) and group of clusters Gj = {ζj1 , ζj2 , ...}. Figure 3 shows an example of these notations. Note C(S) is a multiset. S is collected by a MCMC sampler, i.e. Z (i) ∼ p(Z|X). People typically use only MAP assignments, Z MAP , to show clustering results where Z MAP = argmaxZ∈S p(Z|X). This is because it is very straightforward (see the left figure in Figure 1). Instead, we propose to use whole samples, S, by identification (see the middle and right figures in Figure 1). 4.2
Identification by Clustering
Sk Our task is to find groups {Gc }kc=1 , s.t. c=1 Gc = C(S)1 . Each group, Gc , consists of relevant clusters. To find such groups, we propose an identification algorithm, which group all clusters, C(S), i.e. clustering on clusters. The right figure in Figure 1 is an example of the identification. Clusters which have the same color are clustered into one group by the identification method. The identification leads to better interpretation of clustering results. For example, when data is discrete, e.g. vocabularies, showing vocabularies contained by one cluster is a typical way to see clustering results. However, when we use only MAP assignments, Z MAP , we cannot estimate the probability of a vocabulary assigned to a cluster. 1
Note Gc and C(S) are multisets. Thus, regard
S
as a union operator for multisets.
=
=
=
=
=
=
ζ3
ζ4
ζ5
ζ6
ζ7
=
=
=
=
0 1 0 1 0 1 0 10 10 10 1 1 0 1 0 1 0 0 @ 1 A @ 0 A @ 0 A @ 1 A@ 0 A@ 1 A@ 0 A 0 1 1 0 0 0 1
! ! S S ... 1 2 1 2 2 1 3 3 3
=
identification
ζ2
=
.. .
ζ1
=
Z (3)
=
Z (2)
=
S=
! 1 2 3 1 3 2 1 2 3 ... =
Z (1)
= 1 2 3 ⇒ C(S) = 1 3 2 = = 1 2 3
=
G1
G2
Fig. 3. Collection of samples S, collection of clusters C(S) and group of clusters {Gc }kc=1 . Note that Gc is a multiset.
We exploit spectral clustering[8] to find the groups of clusters, {Gc }kc=1 . Many studies have shown that spectral clustering outperforms traditional clustering algorithms. When we apply spectral clustering to C(S), a problem is the number of clusters in C(S), i.e. |C(S)|, can be very large. For example, when the number of samples, S, is 200, and the average of the number of clusters in each sample is 50, then |C(S)| is roughly 10,000. This is too large for spectral clustering. Spectral clustering can handle up to thousands objects on typical PCs. The computational time and memory space of spectral clustering is the order of |C(S)|2 . This is because spectral clustering requires to solve eigenvectors of a |C(S)| by |C(S)| affinity matrix. 4.3
Unique Spectral Clustering
We ease the problem by using the fact that C(S) contains many identical clusters, but we can still solve the exact problem. For example, in the case of Figure 4, C(S) contains both of 2 and 3 twice. We denote unique clusters in C(S) ˜ ˜ by C(S). Let n be |C(S)| and n ˜ be |C(S)|. We show that the eigenvalue problem in spectral clustering can be solved in O(˜ n2 + n) time and memory instead of 2 ˜ O(n ). In our experiments, we actually observe n ˜ ≡ |C(S)| becomes just 34.0% of n ≡ |C(S)|. Algorithm 2 is the pseudo code of spectral clustering with the normalized cut algorithm [8]. In step 3 of Algorithm 2, we need to solve an eigenvalue problem. This requires O(kn2 ) computational time and memory space in a naive way where k is the number of eigenvectors. But, we can reduce it to O(k(n + n ˜ 2 )) using the following theorem. Theorem 1. Finding the k smallest eigenvectors of normalized graph Laplacian ˜ L requires O(k(n+˜ n2 )) time and memory space where n = |C(S)| and n ˜ = |C(S)|.
! C(S) = 1 2 3 1 3 2 1 2 3 ... ˜ C(S) =
! 1 2 3 1 3 2 1 ...
˜ Fig. 4. collection of all clusters C(S) and unique clusters C(S).
Algorithm 2 Spectral Clustering with the Normalized Cut 1: Construct affinity matrix W , s.t. Wi,j is the similarity of object i and j, which takes a value between 0 and 1. 1 1 2: Calculate a normalized graph Laplacian, L = I − D− 2 W D− 2 where I isPthe identity matrix, and D is a degree matrix which is diagonal with Dii = n j=1 Wij . 3: Obtain the k smallest eigenvectors, {u1 , ..., uk } of L. Let Y be a k by n matrix P 1 where Yi,j = ui,j /( ki′ =1 u2i′ ,j ) 2 . 4: Find assignments by k-means on Y assuming each column of Y is a mapped data point.
Algorithm 3 Proposed Algorithm 1: 2: 3: 4:
Collect samples S from a MCMC sampler. Find unique clusters of C(S). ˜ Construct M and L. Apply spectral clustering as if M T LM S is a normalized graph Laplacian to find identified groups {Gc }kc=1 where kc=1 Gc = C(S).
˜ Proof. L is a n by n normalized graph Laplacian. L has a n ˜ by n ˜ submatrix L T˜ where L = M LM . M is a sparse binary matrix s.t. Mi,j = Mi,j ′ = 1 if and only ˜ v holds. This calculation if ζj = ζj ′ where ζj and ζj ′ ∈ C(S). Thus, Lv = M T LM requires O(n + n ˜ 2 ). Typical algorithms for the eigenvector problem, e.g. the QR algorithm, just repeats the calculation. Clearly, memory space required the above computation is O(k(n + n ˜ 2 )) as well. ⊓ ⊔ Directly from Theorem 1, the computational time of spectral clustering is also reduced to O(k(n + n ˜ 2 )). We call the efficient spectral clustering unique spectral clustering. Again remember a solution given by the unique spectral clustering is the same as a solution given by a naive spectral clustering. In our experiments, in Section 5, we only use spectral clustering proposed by [8], but we can reduce computational complexity of other spectral clustering algorithms [9, 10] as well. Spectral clustering requires affinity matrix W in step 1 of Algorithm 2. Thus, we need to define a similarity between cluster ζi and ζj . Although we can use any similarities, we use cosine in this paper, i.e. Wi,j = cos(ζi , ζj ) = ζiT ζj /||ζi ||||ζj ||. Algorithm 3 summarizes the proposed algorithm.
20
20
15
15
10
10
5
5
0
0
−5
−5
−10 −20
−10
0
10
20
−10 −20
−10
0
10
20
Fig. 5. A toy example of the proposed algorithm. The left figure shows input data, and the right is identified clusters, where ellipses that have the same color are assigned to one group. The identified groups are consistent with the five original crowds in the left plot.
5
Experiments
In this section, we show how the proposed identification works. We apply the proposed algorithm to two data sets, where one is synthetic and the other is real. We use the synthetic one for the evaluation of the proposed algorithm, and we show the tractability and identified groups by using the real one.
5.1
Quality of Identification
In general, it is not trivial to evaluate clustering results because data to be clustered does not typically have labels. It is also non-trivial to evaluate the proposed algorithm for the same reason. In this paper, we only use an ideal synthetic data set for which we easily evaluate the quality of identification done by the proposed algorithm. The data set is shown in the left plot of Figure 5, which consists of five crowds of points. Thus, the ideal grouping of clusters will find groups consistent with the five crowds. We first collect samples with the Gibbs sampler for a Dirichlet process Gaussian mixture [3], then apply the identification algorithm to find groups of clusters of the sampler. In this experiment, we know the ideal number of groups is five. Thus, we use five as the input of the identification algorithm to identify five groups. Ellipses in the right plot in Figure 5 are clusters sampled by the sampler. Ellipses, i.e. clusters, that have the same color are assigned to one group. For example, the sampler drew many clusters, shown as ellipses, for the rightbottom crowd in the right plot in Figure 5. The algorithm grouped the clusters into one group, which is actually an ideal grouping. As the right plot in Figure 5 shows, the discovered groups are completely consistent with the five crowds. From this experiment, it seems the proposed algorithm works well on an ideal case.
40
80 120 #samples
160
UniqueSC NaiveSC
300 200 100 0
40
80 120 #samples
160
% of unique clusters
400 UniqueSC NaiveSC
memory (MB)
time (sec.)
170 140 110 80 50 20
100 75 50 25 0 150
300 450 #samples
600
Fig. 6. Computational time and memory space of spectral clusterings (left and middle, respectively) and the percentage of unique clusters in C(S) (right). The standard deviations are shown on these figures although they are too small to be visible.
5.2
Tractability of the Identification Algorithm on Real Data
In this section, we apply the identification algorithm to a real data set. First of all, we show the proposed algorithm can handle much more samples than a naive algorithm on a real data set. Next, we apply the identification algorithm to vocabulary clustering. We estimate the probability of a vocabulary assigned to an identified cluster, and show representatives of each cluster, which is impossible without identification. The data we use is search query log data. The analysis of query data has become very important. It is clearly useful to improve search quality. Another example is the improvement of advertisement on the web. As the market of advertisement on the web grows, advertisers need to choose word on which they put advertisements, to maximize benefits from advertisements. In many cases, each query consists of couple words instead of a single word, e.g. “weather Tokyo”. Thus, analysis of pairs of words in a single query is expected to discover informative structure of words. We collected the data on SearchSpy (http://www.dogpile.com/info.dogpl/ searchspy) from January 2007 to April 2007, In this period, we observed more than million vocabularies, but we omit vocabularies which are not used more than 1,600 times for our experiments. Finally, we use 4,058 vocabularies for clustering. We apply a relational clustering to the data. We construct a binary matrix, R, where Ri,j = 1 if and only if vocabulary i and vocabulary j are used in the same query. We use the infinite relational model (IRM)[11, 12] as a relational clustering algorithm, which utilizes the Dirichlet process. We run a Gibbs sampler for the IRM, which is described in Appendix A.2. Tractability We first compare the unique spectral clustering proposed in Section 4 with a naive spectral clustering because the tractability of spectral clustering leads to the tractability of the identification algorithm. The results are shown in Figure 6. We vary the number of samples, |S|, from 40 to 160 in the left and middle figures, and from 150 to 600 in the right figure. For each |S|, we
Table 1. 10 representative vocabularies in identified clusters, G7 and G39 . It seems G7 consists of the names of places and something relating to places, and G39 includes possible properties of G7 . G7 ca, atlanta, carolina, alabama, nc, kansas, nj, colorado, fort, pennsylvania, ... G39 owner, lakes, lottery, avenue, zoo, landscape, camping, comfort, flights, hilton, ...
run two spectral clustering algorithms 10 times. Figure 6 shows the average and the standard deviation of 10 trials. The left and middle figures in Figure 6 show the computational time and memory space of two spectral clustering algorithms. When the number of samples is 160, the unique spectral clustering is 5.5 times faster than the naive algorithm, and needs only less than 10% memory space of the naive one. The right figure in Figure 6 depicts the percentage of unique samples in C(S). When the number of samples is 600, it is 34.0%. At this point, n (≡ |C(S)|) reaches 24,000. Thus, the naive spectral clustering needs to solve the eigenvalue problem of a 24,000 by 24,000 matrix, which does not even fit into 4 GB memory even when the matrix is constructed in a sparse representation. On the other hand, unique SC only need to solve approximately 8,200 by 8,200 ˜ matrix, which is still tractable, i.e. n ˜ ≡ |C(S)| = 8200. In this paper, we show the tractability only by experiments. Unfortunately, it does not seem very easy to show the reduction of n ˜ from n in theory. Identified Clusters We applied the proposed algorithm to 300 samples drawn from a Gibbs sampler to identify clusters. The gross number of clusters and the number of unique clusters in 300 samples are approximately 12,000 and 4,800, respectively. Since the average number of clusters in the 300 samples is 40, we identify 40 groups, i.e. 40 is the input parameter of the spectral clustering. Proposed unique spectral clustering takes two minutes for this data. As an example, we pick up G7 and G39 . Their representative vocabularies are listed in Table 1. These vocabularies have probability one to be assigned to either of G7 or G39 . The assignment probability can be estimated by the identification of a number of samples of p(Z|X). In other words, the probability cannot be estimated only by Z MAP . The inference by the IRM tells that they likely to make co-occurrences, i.e. vocabularies in G7 and G39 are used in the same search query with relatively high probability. Actually, one can imagine situations to use these vocabularies simultaneously as a search query.
6
Discussion
We have proposed unique spectral clustering which is efficient for samples of a MCMC sampler. One may think we can also apply k-means, which has linear computational time and memory space. In this case, ζi is used as a feature vector of the i-th cluster. However, k-means does not necessarily fit to {ζi }ni=1 . This
is because ζi is represented as a binary vector, while k-means is derived from a mixture of spherical multivariate Gaussians. We have used cosine as a similarity measure for clusters. But, we can use other similarities as well. For example, similarity(ζi , ζj ) ≡ 1−HammingDist(ζi , ζj )/d is also available where d is the length of vector ζi . We leave the study of other similarities as future work.
7
Conclusion
For clustering problems, we have pointed out many studies use just maximum a posteriori (MAP) assignments even when they use MCMC samplers. This is because it is not straightforward to recognize clusters with many samples from p(Z|X). Thus, we have proposed an identification algorithm which constructs groups of clusters. Each group is recognized as the distribution of a cluster. We apply spectral clustering to group clusters. Although a naive spectral clustering algorithm is intractable due to memory space and computational time, we developed a memory-and-time efficient spectral clustering for samples of a MCMC sampler. In experiments, we have first applied the proposed algorithm to an ideal data set, and seen it works well. Next, we have shown the proposed algorithm is tractable for real data, search query log, although the naive algorithm is actually intractable. For the query log, we have also exhibited representative vocabularies of clusters, which cannot be chosen by just MAP assignments.
A
Infinite Relational Models
Kemp et al. [11] and Xu et al. [12] independently proposed infinite relational models (IRMs). IRMs are a general model for clustering relational data. For example, when we have relational data consisting of people and relation “like”, the IRM clusters people based on relation “like”. People in one cluster share the same probability to “like” people in another cluster. In the following section, we describe a special case of the IRM, which we use in our experiments.
A.1
Generative Model of the IRM
The generative model of the IRM is given in (5)–(7). It samples assignment Z first of all. Next, η is sampled from the beta distribution for 1 ≤ s ≤ t ≤ K. Finally, R is observed from the Bernoulli distribution with parameter η. R(i, j) = 1 means word i and word j are used in the same query, and R(i, j) = 0 means they are not used in the same query. This process is summarized as, Z|γ ηs,t |β
∼ DP(Z; γ) ∼ Beta(ηs,t ; β, β)
R(i, j)|zi , zj , η ∼ Bern(R(i, j); η(s, j))
(5) for 1 ≤ s ≤ t ≤ K
(6)
for 1 ≤ i ≤ j ≤ W.
(7)
Marginalizing out η, we obtain the marginal likelihood of the IRM as, p(R, Z) =
Y
1≤s≤t≤K
B(ms,t + β, Ns Nt − ms,t + β) DP (Z), B(β, β)
where ms,t = |{(i, j)|Rij = 1, Zi,s = 1 & Zj,t = 1}| and Ns =
A.2
P
i
(8)
Zi,s .
Inference for the IRM
Our task is clustering. Thus, what we want to do by the IRM is to infer Z given R. We use the Gibbs sampler for the inference. To utilize the Gibbs sampler, we need a posterior distribution of Z for each word, p(Zi,u = 1|Z ¬i , R) where Z ¬i is assignments except for word i. We notice p(Zi,u = 1|Z ¬i , R) ∝ p(R, Z). Thus, we can use (8) for the Gibss sampler2 ,
p(Zi,u
8Q > 1≤s≤t≤K B(ms,t + β, Ns Nt − ms,t + β)(Nu − 1) > > < (if u is an existing cluster) ¬i = 1|Z , R) ∝ Q > B(m s,t + β, Ns Nt − ms,t + β)γ > > : 1≤s≤t≤K (if u is a new cluster).
(9)
References 1. Ferguson, T.: A bayesian analysis of some nonparametric problems. The Annals of Statistics 1 (1973) 209–230 2. Antoniak, C.: Mixtures of Dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics 2 (1974) 1152–1174 3. Rasmussen, C.E.: The infinite gaussian mixture model. In Solla, S. A., T.K.L., Muller, K.R., eds.: Advances in Neural Information Processing Systems. Volume 12., MIT Press (2000) 554–560 4. Teh, Y.W., Jordan, M., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (December 2006) 1566– 1581 5. Liang, P., Jordan, M.I., Taskar, B.: A permutation-augmented sampler for DP mixture models. In Ghahramani, Z., ed.: Proceedings of the 24th Annual International Conference on Machine Learning, Omnipress (2007) 545–552 6. Blei, D.M., Jordan, M.I.: Variational inference for dirichlet process mixtures. Bayesian Analysis 1(1) (2005) 121–144 7. Kurihara, K., Welling, M., Vlassis, N.: Accelerated variational dirichlet process mixtures. In: Advances in Neural Information Processing Systems 19. (2007) 8. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems 14. (2002) 9. Hagen, L., Kahng, A.B.: New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 11(9) (1992) 1074–1085 10. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8) (2000) 888–905 2
For efficiency, it should be better to simplify p(Zi,u = 1|Z ¬i , R) than (9).
11. Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning systems of concepts with an infinite relational model. In: AAAI. (2006) 12. Xu, Z., Volker Tresp, K.Y., Kriegel, H.P.: Learning infinite hidden relational models. In: Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence. (2006)