Semi-supervised Clustering Using Bayesian Regularization Zuobing Xu, Ram Akella University of California, Santa Cruz, CA, USA zbxu,
[email protected]
Abstract Text clustering is most commonly treated as a fully automated task without user supervision. However, we can improve clustering performance using supervision in the form of pairwise (must-link and cannot-link) constraints. This paper introduces a rigorous Bayesian framework for semi-supervised clustering which incorporates human supervision in the form of pairwise constraints both in the expectation step and maximization step of the EM algorithm. During the expectation step, we model the pairwise constraints as random variables, which enable us to capture the uncertainty in constraints in a principled manner. During the maximization step, we treat the constraint documents as prior information, and adjust the probability mass of model distribution to emphasize words occurring in constraint documents by using Bayesian regularization. Bayesian conjugate prior modeling makes the maximization step more efficient than gradient search methods in the traditional distance learning. Experimental results on several text datasets demonstrate significant advantages over existing algorithms.
1
Introduction
In many text mining tasks, there is a large supply of unlabeled data but limited labeled data. Labeled data can be expensive to generate, since labeling requires domain expertise. A critical problem in real life text mining applications is to learn efficiently from limited labeled data, such as web document categorization, call center service log clustering etc. Consequently, semi-supervised clustering[1][2][3][7][8][9][10][11], which learns from both labeled data and unlabeled data, has become a critically important research area. It allows a human expert to steer the clustering process so that data can be partitioned into a useful set of clusters with minimum human effort. In this paper, we treat the clustering with constraints in a rigorous Bayesian framework and propose a new semisupervised clustering algorithm based on the Expectation
Mike Ching, Renjie Tang IBM Almaden Research Center, CA, USA mching,
[email protected]
Maximization (EM) clustering algorithm [4]. Model based clustering techniques have been widely used and have shown promising results in many applications involving complex data. Zhong and Ghosh [12] have shown that the multinomial model performs consistently well and is efficient to implement in comparison to other mixture models in text clustering,. In this paper, we propose a new semi-supervised clustering algorithm based on the multinomial mixture models. Our model incorporates supervision in the form of mustlink and cannot-link constraints, indicating that a pair of documents should be assigned either to the same cluster or different clusters respectively [1]. Pairwise constraints are the most natural type of user supervision in clustering. Users may be unaware of the data class label, but they can still specify pairs of data belong to the same or different clusters. Our new algorithm incorporates pairwise constraints into the Expectation and Maximization steps of the multinomial mixture model. In the E-step updated with constraints, the proposed algorithm approximates the constraint posterior probability to accommodate a probabilistic model of user inputs on pairwise (must-link and cannot-link) constraints. In the new algorithm, the user specifies the strength of a constraint, which captures the trade-off between belief in constraint information and document distribution. Larger the value of the trade-off parameter, more the data clustering algorithm trusts the user input than the document similarity (based on the word frequencies in documents). In the updated M-step with constraints, we treat the pairwise constraints as prior information. We incorporate constraint information by augmenting cluster multinomial model with pseudocounts, which reflect the word occurrence in the constraint documents. In the Bayesian community, this approach is called Bayesian regularization. Moreover, Bayesian conjugate updating performs more efficiently than the gradient methods used in the traditional distance learning approaches during the M-step. We also show that an appropriate scale parameter, often interpreted as equivalent sample size or prior strength, leads to a significant improvement of the clustering results.
The remainder of this paper is organized as follows. Section 2 summarizes previous work. Our new algorithm is described in Section 3. In Section 4, we describe our experimental results. Finally, section 5 contains the conclusions.
2
Previous Work
Existing research on semi-supervised clustering falls into two broad categories of approaches: constraint based approaches and distance based approaches. For the constraint based approaches, the objective function of the clustering algorithm is modified to incorporate the penalty of violating the constraints, so that the user provided constraints can be used to bias the search for an appropriate partition. Wagstaff et al. [10] proposed the COP-KMeans algorithm which incorporates constraints by a heuristically motivated objective function. The constraint based approaches have been extended to generative model clustering in recent work [7] [8][9]. These models incorporate pairwise constraints into the Expectation-Maximization (EM) algorithm by using different approximations of the pairwise constraints. In the distance based approaches, an existing clustering algorithm that uses a parameterized similarity metric is first trained to satisfy the labels or constraints in the supervised data. Several adaptive distance measures have been used for semi-supervised clustering, including Jensen-Shannon distance trained using gradient decent [3], or Mahalanobis distances trained using combination of gradient decent and iterative projections[11]. Basu et al. [1] have introduced a probabilistic framework based on Hidden Markov Random Fields (HMRFs) for semi-supervised clustering that combines the constraint based and distance based approaches in a unified framework. Their model employs constraint-sensitive assignment and iterative distance learning to optimize the objective function in the K-Means clustering framework. In contrast, we propose a new model which also incorporates constraint information both in the E step and M step based on rigorous Bayesian framework. For constraint learning, our new algorithm assigns documents to cluster probabilistically, which is consistent with the EM framework. Our Estep also captures the trade-off between belief in the original document distribution and pairwise constraint information. In the M-step, we reestimate the cluster distribution model by using Bayesian regularization to naturally incorporate constraint information, while the traditional distance-based approaches learn a distorted distance metric to satisfy constraints. Because of the extra computation of the optimization step in the traditional distance based approaches [1][2], our approach is more computational efficient than the traditional distance based approaches.
3
Clustering with Pairwise Constraints
3.1
EM Based on Mixture of Multinomial Distributions without Constraints
We propose to use EM clustering based on the multinomial mixture models as our base clustering algorithm. Here, we briefly explain the EM algorithm for estimating multinomial mixture models to help understand our semisupervised clustering algorithm. Suppose we have N documents d = (d1 , . . . dN ), and the goal is to partition the N documents into K disjoint cluster (Ck )K k=1 . In the E-step without constraints, we first calculate P (di |Cj ; Θold ), the probability of document i being generated by its representative cluster Cj ; we then calculate the posterior probability and assign the document to the cluster with the highest posterior probability. Let yi be the cluster assignment for document i, then p(Cj |Θold )P (di |Cj ; Θold ) P (yi = Cj |di ; Θold ) = K old )P (d |C ; Θold ) i k k=1 p(Ck |Θ (1) In the M-step without constraints, We reestimate the distribution parameters and cluster probabilities. P (wt |Cj ; Θnew ) = 1 + i P (yi = Cj |di ; Θold )N (wt , di ) (2) V V + m=1 i P (yi = Cj |di ; Θold )N (wm , di ) P (Cj |Θ
new
)=
1+
N i=1
P (yi = Cj |di ; Θold ) K +N
(3)
Θold are the current parameters estimates we used to evaluate the expectation and Θnew are the new parameters estimated in the M-step; p(wt |Cj ; Θnew ) indicates the probability of word wt being generated by cluster model Cj ; p(Cj |Θnew ) stands for the probability of generating cluster j; Θ defines a set of multinomial distributions and class probabilities; V indicates vocabulary size; N (wt , di ) indicates the word frequency of word t in document i. The E-step and M-step iterates until the algorithm converges to a local maximum.
3.2
E-step with Constraints
Pairwise constraints influence the E-Step, such that cluster label assignment depends not only on the document distribution but also on the constraints. To effectively utilize the influence of pairwise constraints, we infer the joint posterior distribution based on all the must-link and cannot-link constraints. We begin by taking the transitive closure of the must-link constraints to obtain all the connected components consisting of points connected by must-links. If
N
In the above equation, Y = (yi )i=1 denotes the source assignment of the documents d. Thus, we calculate the expectation P (Y|d, Θold , G)in the E-step, and find the parameter Θnew which maximizes the Eqn. (4) in the M-step. Shental et al. [9] uses the Bayesian framework to calculate the expectation of all the constraint documents as P (Y|d, Θold , G) = P (Y|Θold )P (d|Y, Θold )P (G|Y, d, Θold ) = (5) old )P (d|Y, Θold )P (G|Y, d, Θold ) Y P (Y|Θ (a)
(b)
Figure 1. Constraints Augmentation. (a) must-link constraints(solid line) and cannotlink constraints(dash line) among 7 documents. Solid circle, solid square and transparent circle indicate 3 categories of documents respectively. (b) Augment the mustlink constraints and cannot-link constraints from links in (a)
there is any cannot-link constraint between two neighboring sets, which consist of documents in must-link transitive closure, then we enforce cannot-link constraints on every pair of documents each from these two neighboring sets. Finally, we organize all the constraint documents in a clique T . We illustrate the procedure by an intuitive example in Figure 1. In the E-step, assignment of documents to clusters is updated using the current estimates of the cluster model. In the EM algorithm without constraints, all the documents are independent of each other, so we can calculate each document’s posterior probability independently as in Eqn. (1). In contrast, the posterior probability of a document in the constraint clique T depends on other documents in the clique T under pairwise constraints. Let G represent all the pairwise constraints. Therefore, Gij = 1 indicates the must-link constraint between document i and j; and Gij = 0 indicates the cannot-link constraint between document i and j. The EM algorithm on a mixture of multinomial distributions with constraints maximizes the expectation of the posterior distribution:
EY (log(P (Θnew |d, Y, G)|d, Θold , G)) log(P (Θnew |d, Y))P (Y|d, Θold , G) = Y
(4)
If we use hard constraints, P (G|Y, d, Θold ) = 1 when Y is consistent with G; P (G|Y, d, Θold ) = 0 when Y is not consistent with G. Applying hard constraints implies that cluster assignments have to fully comply with the constraints. However, satisfying the constraints completely will distort the underlying document distribution and lead to a degradation of clustering performance. Hence, we incorporate human inputs which alter a purely mechanistic computation based on word frequency occurrence. Now we describe the first key idea of this paper. We model the scenario that the user can specify the strength level of the constraints by introducing a trade-off parameter αij , which reflects the enforcement level of the constraints. We approximate P (G|Y, d, Θold ) as a function of both the divergence of the original distribution and the human interaction tuned by the trade-off coefficient αij , which is different from [8]. Assuming that the pairwise constraints are independent, the probability of constraints Gij conditioned on documents di , dj and cluster assignments yi , yj is modeled as P (Gij |di , dj , yi , yj , Θold ) = (1 − αij )e−JSij + αij 1 (yi = yj ) (1 − αij )(1 − e−JSij ) + αij 1 (yi = yj )
if Gij = 1 if Gij = 0 (6)
Here JSij is the Jensen-Shannon divergence. JensenShannon divergence is a symmetrized and smoothed version of the Kullback-Leibler (KL) divergence. It is defined by 1 p(di ) + p(dj ) JSi,j = KL p(di )|| 2 2 1 p(di ) + p(dj ) + KL (p(dj )|| (7) 2 2 1 in Eqn.(6) is the indicator function. αij is a tuning parameter which reflects the trade-off between human supervision and data distribution. if αij = 1, we completely rely on human feedback; if αij = 0, the constraints do not have any influence on the cluster assignment. Thus, the probability of observing a constraint can be considered as a mixture
of two sources: one is an exponential distribution, and the other is Bernoulli distribution with probability P (yi = yj ). With this approximation, we will have the following situations: if Gij = 1, the must link constraint probability decreases with the distance between the documents, which indicates that two documents with larger distance are less likely to have must-link constraint; if Gij = 0, the cannot-link constraint probability increases with the distance, which indicates that two documents with larger distance are more likely to have cannot-link constraint. This approximation also guarantees the constraint probability is between 0 and 1. The posterior probability of a given document di in the clique T is calculated by marginalizing the joint posterior distribution over the entire clique. Marginalizing the posterior distribution is computationally prohibitive when the clique size is large. We apply the Gibbs sampling, which was proposed in [8], to estimate the posterior probability. In the Gibbs sampling, we estimate P (yi | Θold , G) as the mean of conditional document sample sequence. P (yi = k|d, Θold , G)
= E(δ(yi , k)|d, Θold , G) ≈
S 1 δ(yit , k) S t=1
where the sum is over a sequence of S samples yi1 , yi2 , . . . , yiS from P (Y |d, Θ, G) generated by the Gibbs sampler. We start the Gibbs sampler from its initial state, which is the cluster assignment resulting from the E step without constraints. This ensures the mutual agreement between the clustering based on document distribution and clustering based on constraint. The tth sample in the sequence is generated by the Gibbs sampling technique: Generate sample yit = Ck from the conditional distribution t−1 t−1 , yi+1 , . . . , yFt−1 , d, G, Θold ) P (yit = Ck |y1t−1 , . . . , yi−1
=
in the E-step. That is, we find: Θnew = arg max EY (log(P (Θ|d, Y, G)|d, Θold , G)) Θ
(8) We now contrast the traditional M-step without constraints with our updated M-step with constraints. The Dirichlet distribution is a commonly used conjugate prior distribution over the parameters of multinomial distribution. The form of the Dirichlet distribution is P (θCj ) ∝
V
m (βm −1) θC j
(9)
m=1 m where θC is the multinomial parameter for word m in clusj ter Cj , and βm is the parameter for the Dirichlet distribution. With this prior, we estimate the model parameters by maximizing the posterior probabilities of the parameters given the class assignment from the E-step. Finding Θ that maximizes P (Θ|d, Y) is accomplished by maximizing P (Θ|d, Y) ∝ P (d, Y|Θ)P (Θ). In the M-step without constraints, we set all βm = 2, which corresponds to a prior that favors uniform distribution. Because the Dirichlet distribution is a conjugate prior for the multinomial distribution, the maximum a posteriori (MAP) estimate of P (wt |Cj , Θ) in Eqn. (2) remains the multinomial distribution form and is simply calculated by augmenting both numerator and denominator of the maximum likelihood estimate with pseudocounts (one for each word). The use of this type of prior is sometimes referred to as Laplace smoothing. Smoothing is necessary to prevent zero probabilities for infrequently occurring words. The updated M-step with constraints is different from the traditional M-step without constraints, as the new Mstep changes βm to account for the information in the constraint documents. We use word counts in the constraint documents to define a conjugate Dirichlet prior for the parameters θCj , instead of the uniform prior in the traditional M-step.That is,
P (y t−1 , . . . , yit = Ck , . . . , yFt−1 |d, G, Θold ) 1 t−1 t−1 t old ) k P (y1 , . . . , yi = Ck , . . . , yF |d, G, Θ
P (θCj ) ∝
V
m P C(wm ,Cj ) θC j
m=1
where
3.3
M-step with Constraints
Our second contribution in this paper is to incorporate information in the constraint documents in the M-step, by treating constraint documents as prior knowledge in the models. Previous semi-supervised clustering algorithms based on mixture model distributions [7] [8] [9]have not incorporated information from constraint documents to enhance model estimation in the M-step. The M-step is to maximize the expectation we computed
P C(wm , Cj ) = 1+ P (yi = Cj |dk ; Θold )N (wm , dk ) +β k∈T
Here β is a parameter indicating our confidence on the constraint document prior. β can be interpreted as equivalent sample size or prior strength. Intuitively the larger β is, the more constraint document information p(wt |Cj ; Θ) contains. Since we use conjugate prior, the MAP estimate of cluster multinomial model θCj in the updated Mstep can be computed by adding additional pseudocounts
P C(wm , Cj ). Essentially this indicates that the new cluster model will stress the information contained in the constraint documents β times as much as the information contained in other documents in the cluster. The updated MAP estimate for p(wt |Cj ; Θnew ) is: p(wt |Cj ; Θnew ) =
P (yi = Cj |di ; Θold )N (wt , di ) old )N (w , d )] m i m=1 [P C(wm , Cj ) + i P (yi = Cj |di ; Θ
V
P C(wt , Cj ) +
of statistical information shared by two random variables: the true class label and the algorithm clustering result. In other words, it measures how the algorithm clustering result matches the external true class label. If Y is the random variable denoting the cluster assignments of the points and Y is the random variable denoting the underlying class labels on the points, then the NMI measure is defined as
i
The class probability p(Cj |Θ) is estimated in the same manner, and it also involves a ratio of counts estimated from the pairwise constraint documents. N P C(Cj ) + i=1 P (yi = Cj |di ; Θold ) new P (Cj |Θ )= β|T | + K + N (10) where P (yi = Cj |dk ; Θold ) P C(Cj ) = 1 + β
NMI =
I(Y, Y )
(H(Y ) + H(Y ))/2
where I(Y, Y ) = H(Y ) − H(Y |Y ) is the mutual information between the random variable Y and Y . H(Y ) is the Shannon entropy of Y , and H(Y |Y ) is the conditional entropy of Y given Y . The NMI value is 1 when the clustering results perfectly match the external category labels and close to 0 for a random partitioning.
4.2
Comparison with the MPCKMeans Algorithm
k∈T
|T | is the number of documents in the constraint document clique T . We will show how the parameter β influences the clustering performance in the experiments. Using Dirichlet conjugate prior makes computation the M-step feasible even for large number of constraints.
4
Experimental Methodology and Experimental Results
To validate our algorithm, we test on 3 different datasets from the 20-Newsgroups collection. This collection contains messages collected from 20 different Usenet newsgroups. There are around 1000 messages for each newsgroup. We create 3 datasets from the collection. News-similar-3 consists of 3 newsgroups on similar topics (comp.graphics, comp.os.ms-windows, comp.windows.x), which contains significant overlapping documents. News-related-3 contains 3 related newsgroups on political topics (talk.politics.misc, talk.politics.guns and talk.politics.mideast). News-different-3 consists of 3 newsgroups on different topics across religion, sports and science (alt.atheism, rec.sport.baseball and sci.space), which is well separated. All the datasets are pre-processed by stop-word removal and porter stemming. We extract 400 frequently occurred bigrams from each dataset. From the original single terms and extracted phrase, we select 4000 features for each dataset by evaluating term variance quality [5].
4.1
Clustering Evaluation
We use normalized mutual information (NMI)[6] as our clustering evaluation measure. NMI measures the amount
We compare our algorithm with the MPCKMeans algorithm by [2] on these 3 datasets. In all the experiments, we set the trade-off parameters αij = 0.9 and evaluate clustering results by specifying different number of constraints. We also compare the results of our algorithm for different parameter setting β = 0, β = 10 and β = 20 to show how the parameter β influences the clustering performance. In [2], they randomly select 100 documents from each newsgroup and cluster these documents, while we evaluate both algorithms with all the documents (around 1000 documents) in each newsgroup. Because of the strong connection between multinomial model based clustering and the divisive Kullback-Leibler (KL) divergence clustering, we set the distance measure in the MPCKMeans algorithm as I-divergence [1], which is a variation of KL divergence and belongs to the class of Bregman divergences. We generate learning curves using 20 runs of 2-fold cross-validation for each dataset to study the effect of constraints in clustering. 50% of the dataset is used for training, and another 50% of the dataset is used for testing. We randomly generate pairwise constraints from the training data for each run, and create transitive closure from the constraints. To have a fair comparison, we use the same initial random seeds and the same set of constraints for both algorithms. The clustering algorithm was run on the whole dataset, but the NMI is only calculated on the test set. Figure 2 demonstrates the clustering results of our algorithm with β = 0, β = 10 and β = 20 and MPCKMeans algorithm on News-similar-3, News-related-3 and News-different-3 datasets. From the results, we can see that our semi-supervised clustering outperforms the MPCKMeans algorithms in all the three datasets significantly and our algorithm with Bayesian regularization in the M-Step (β = 20 and β = 10) performs better
News−related−3
News−different−3
0.65
0.95
0.6
0.9
News−similar−3 0.65 0.6 0.55
0.55
0.85
0.5
0.5
0.45
NMI
NMI
NMI
0.8 0.45
0.75 0.4 Our Algorithm(β=20) Our Algorithm(β=10) Our Algorithm(β=0) MPCKMean
0.3 0.25
0
50
100
150
200
Number of Contraints
250
Our Algorithm(β=20) Our Algorithm(β=10) Our Algorithm(β=0) MPCKMean
0.3
0.7
0.35
0.4 0.35
Our Algorithm(β=20) Our Algorithm(β=10) Our Algorithm(β=0) MPCKMean
0.65
0
50
100
150
200
Number of Contraints
250
0.25 0.2 0
50
100
150
200
250
Number of Contraints
Figure 2. Comparison of NMI for different clustering algorithms on News-similar-3, News-related-3 and News-different-3.
than without Bayesian regularization (β = 0). The advantage of our new algorithm is most evident in its discriminatory power, when the dataset contains significant overlap between clusters (News-similar-3, News-related-3). In addition, Our algorithm runs significantly faster than MPCKMeans algorithm as we expected.
5
Conclusions
We have introduced a new semi-supervised clustering algorithm that allows probabilistic pairwise constraints in the E-step and incorporates constraint documents information by augmenting pseudocounts in the M-step. Experimental results on different text datasets show that our algorithm outperforms the existing MPCKMeans algorithm by significant margins. There are several interesting research directions that may further improve the semi-supervised clustering algorithm: first, actively selecting constraint documents for user evaluation; second, extending the current M-step by excluding general words as prior knowledge; and third, estimating the equivalence size of pseudocounts.
6
Acknowledgments
[2]
[3]
[4]
[5]
[6] [7]
[8]
[9]
[10]
We would like to acknowledge support from IBM Almaden Research center, Cisco, University of Californias MICR Program, CITRIS, and UARC. We also appreciate discussions with Sugato Basu.
[11]
References
[12]
[1] S. Basu, M. Bilenko, and R. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the
Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2004. M. Bilenko, S. Basu, and R. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML, 2004. D. Cohn, R.Caruana, and A.McCallum. Semi-supervised clustering with user feedback. Technical Report TR20031892, Cornell University, 2003. A. Dempster, N. Laird, and D. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B., 1977. I. Dhillon, J. Kogan, and C. Nicholas. Feature selection and document clustering. Lecture Notes in Computer Science, 2002. B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ10219, IBM, 2001. M. Law, A. Topchy, and A. Jain. Model-based clustering with probabilistic constraints. In SIAM International Conference on Data Mining, 2005. Z. Lu and T. Leen. Semi-supervised learning with penalized probabilistic clustering. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2004. N. Shental, A. Bar-Hillel, T. Hertz, and D. Weinshall. Computing gaussian mixture models with em using equivalence constraints. In NIPS, 2004. K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-means clustering with background knowledge. In Proceedings of 18th International Conference on Machine Learning (ICML), pages 577–584, 2001. E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with background knowledge. In Advances in Neural Information Processing System (NIPS), pages 505–512, 2003. S. Zhong and J. Ghosh. A unified framework for modelbased clustering. Machine Learning, 4:1001–1037, 2003.