The empirical evaluation on publicly available data sets also shows that our algorithm is effective. ..... Micro-average
2011 11th IEEE International Conference on Data Mining Workshops
A Novel Co-clustering method with Intra-Similarities Jian-Sheng Wu, Jian-Huang Lai, Chang-Dong Wang School of Information Science and Technology, Sun Yat-sen University, Guangzhou, P. R. China. Email:
[email protected],
[email protected],
[email protected]
For a given co-clustering problem, we can use the cooccurrence matrix to model the relationships between samples of the two sets [3], [4]. So, the co-clustering problem can be treated as a bipartite graph partitioning problem. But it is an NP-complete problem [3]. Dhillon [3] first employs the spectral graph partitioning method and the singular value decomposition (SVD) to partition the bipartite graph in the co-clustering field. Although it is an effective method, there are still some drawbacks due to the shortages of the spectral graph partitioning and the eigenvector based decomposition methods. To overcome the drawbacks existing in SVD or eigenvector based decomposition methods, Long et al. [6] propose a new co-clustering framework, namely, block value decomposition (BVD). Compared with SVD or eigenvectorbased decomposition, the decomposition from BVD has an intuitive interpretation. To overcome the drawbacks existing in spectral graph partitioning, in [7], Rege et al. propose an isoperimetric co-clustering algorithm for partitioning the bipartite graph by minimizing the isoperimetric ratio. Also, the co-occurrence matrix can be viewed as an empirical joint probability distribution of two discrete random variables, and the co-clustering problem can be posed as an optimization problem in information theory [4]. Slonim and Tishby [14] introduce mutual information into document clustering problem by taking the document set and the word set as the value sets of two random variables. They first find word clusters that preserve most of mutual information about the document set, and then find document clusters that maximize the mutual information about the word cluster set. Different from [14], Dhillon et al. [4] simultaneously cluster the documents and words in each iteration until the preserved mutual information between the two cluster sets is the largest. A generalized co-clustering framework is presented in [15], in which any Bregman divergence can be used in the objective function, and various conditional expectation need to be preserved. Previous work has focused on the inter-relationships between the samples belonging to different sets, but not taken into account the intra-relationships between samples of each set. When doing the document and word co-clustering problem, previous approaches just treat the document as a collection of words, disregarding word sequences, so there may be two documents have few shared key words (words selected as features for clustering), but they may have similar means, or two documents having some shared key words, but
Abstract—Recently, co-clustering has become a topic of much interest because of its applications to many problems. It has been proved more effective than one-way clustering methods. But the existing co-clustering approaches just treat the document as a collection of words, disregarding the word sequences. They only consider the co-occurrence counts of words and documents, but do not take into account the similarities between words and similarities between documents. However, these similarity information can help improving the co-clustering. In this paper, we incorporate the word similarities and document similarities into the co-clustering algorithm, and propose a new co-clustering method. And we provide a theoretical analysis that our algorithm can converge to a local minimum. The empirical evaluation on publicly available data sets also shows that our algorithm is effective. Keywords-co-clustering; word similarities; document similarities;
I. I NTRODUCTION Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering objects. It is a fundamental tool commonly used to discover the structure in data and exploratory in nature [1]. Up to now, there have been lots of clustering methods proposed, and many of them are one-way clustering methods, such as K-means, hierarchical clustering, and so on. Though some of them are very effective, it has been shown that co-clustering is a more effective method in many applications [2]–[5], due to the benefit of exploiting the duality between rows and columns [6]. For example, for a document, it can have lots of different words, meanwhile, a word can occur in many documents, and can be in one document many times. Hence, a document can hold some information about words, and a word can also hold some information about documents. Although this paper focuses on the document-word co-clustering, it can be applied in other co-clustering applications. Here, we use the “word” to denote the element and the “document” denote the container that consists of elements. To date, co-clustering has been used in many applications, such as documents clustering [3], [4], [7], genes expression data clustering [2], [5], images processing and scene modeling [8]–[10], etc. And many variants have been developed, such as Bayesian co-clustering models [11], constrained or semi-supervised co-clustering [12], [13]. 978-0-7695-4409-0/11 $26.00 © 2011 IEEE DOI 10.1109/ICDMW.2011.15
300
Since the distribution p is fixed for a given problem, I(X; Y ) is fixed according to (2). So minimizing (1) is equal ˆ Yˆ ). to maximize the preserved mutual information I(X; In (1), it only takes into account the joint occurrence of xi and yr , but discards some information that xi has about xj , and yr has about yt , such as the similarity information. An operational definition of clustering can be stated as follows: Given n representation of n objects, find K groups based on a measure of similarity such that the similarities between samples in the same group are high while the similarities between samples in different groups are low [1]. So, we can cluster the word set by maximizing W SS and cluster the document set by maximizing DSS. W SS and DSS are defined as follows [16] W SS = γsim(x, μxˆ ) (3)
they may represent different means. For the first case, suppose we have two documents, one has a sentence “Michael Jackson is an American musician and entertainer”, the other has a sentence “The Hillbilly Cat is the most popular singer of rock”, and the key words are musician, entertainer, singer, and rock. Although the two documents have few shared key words, they share the similar topic about entertainment in that all the key words are about entertainment. If we could measure the similarity between the key words musician and singer, entertainer and singer, musician and rock, entertainer and rock, then we can judge whether the two sentences have the similar means. For the latter case, it is because they discard the information that the terms and phrases provide. Computing the similarity between documents using terms and phrases will help judging whether they should belong to the same cluster. Based on these observations, in our work, we focus on incorporating the intra-relationships into the information theoretic co-clustering algorithm (ITCC) [4]. For the document and word co-clustering problem, we take the similarities between words, and the similarities between documents as the intra-relationships, and propose the algorithm information-theoretic co-clustering incorporating word similarities and document similarities (ITCCWDS). The rest of the paper is organized as follows. In section II, the problem formulation is presented. Section III details the proposed ITCCWDS algorithm, and provides a theoretical analysis. Experimental results are reported in section IV. Finally, we conclude the paper in section V.
x ˆ x∈ˆ x
DSS =
θsim(y, νyˆ)
(4)
yˆ y∈ˆ y
where γ and θ are weights for sample x and y, and sim(·, ·) is defined to compute the similarity between vectors. Incorporating the similarity information, we rewrite (1) as ˆ Yˆ ) − αW SS − βDSS I(X; Y ) − I(X; (5) where α and β are the weights for the trade off among the loss of mutual information, W SS, and DSS. While a word may occur in many documents, and also may occur in one document many times, a document can have many words. So we denote the marginal probabilities p(x) and p(y) as the weights for word x and document y.
II. P ROBLEM F ORMULATION
Definition 1. An optimal co-clustering is to minimize (5), subject to the constraints on the number of row and column.
Let X = {x1 , x2 , . . . , xm } and Y = {y1 , y2 , . . . , yn } denote the document and word set, respectively. Here, we call X the row sample set, and Y the column sample set. Then, we can estimate the joint probability p(xi , yj ) based on the co-occurrence matrix of X and Y . For the hard co-clustering problem shown by Dhillon at al. [4], we are interested in simultaneously clustering the row samples into k disjoint clusters, and the column samples into l disjoint clusters. Denote the row cluster set and column cluster set as ˆ = {ˆ ˆ2 , . . . x ˆk } and Yˆ = {ˆ y1 , yˆ2 , . . . x ˆl }, respectively. X x1 , x For some row cluster x ˆ and column cluster yˆ, we denote μxˆ ˆ and νyˆ as the exemplars, respectively. And we denote X(x) as the cluster label for x, and Yˆ (y) for y. Dhillon at al. propose the ITCC algorithm in [4]. ITCC simultaneously cluster documents and words by minimizing the loss of mutual information, defined as ˆ Yˆ ) I(X; Y ) − I(X;
III. T HE ITCCWDS A LGORITHM In [4], Dhillon et al. express (1) as the weighted sum of relative entropies between row distributions p(Y |x) and rowcluster prototype distributions q(Y |ˆ x) or the weighted sum of relative entropies between column distributions p(X|y) and column-cluster prototype distributions q(X|ˆ y ) as follows ˆ Yˆ ) = p(x)D(p(Y |x)||q(Y |ˆ x)) (6) I(X; Y )−I(X; x ˆ x∈ˆ x
ˆ Yˆ ) = I(X; Y )−I(X;
p(y)D(p(X|y)||q(X|ˆ y )) (7)
yˆ y∈ˆ y
where D(·||·) denotes the Kullback-Leibler(KL) divergence. According to (3), (4) and the definitions of γ and θ, (5) can be expressed as follows as done to (1) ˆ Yˆ ) − αW SS − βDSS I(X; Y ) − I(X; = p(x) D(p(Y |x)||q(Y |ˆ x)) − αsim(x, μxˆ )
(1)
where I(X; Y ) is the mutual information between sets X and Y . It is defined as p(y|x) . (2) p(x)p(y|x) log I(X; Y ) = p(y)
x ˆ x∈ˆ x
−β
yˆ y∈ˆ y
x∈X y∈Y
301
p(y)sim(y, νyˆ)
(8)
=
p(y) D(p(X|y)||q(X|ˆ y )) − βsim(y, νyˆ)
Algorithm 1 ITCCWDS 1: Input: The joint probability distribution p(X, Y ), the word similarity matrix W ordSim(X),the document similarity matrix DocSim(Y ), the number of row clusters k, the number of column clusters l, parameters α and β. ˆ and Yˆ . 2: Output: The cluster sets X 3: Initialization: Set t = 0. Initialize document and ˆ Yˆ ), ˆ (0) and Yˆ (0) . Compute q (0) (X, word cluster set X (0) (0) ˆ q (Y |Yˆ ), and q (0) (Y |ˆ x) as done in [4]. q (X|X), And obtain U (0) and V (0) by maximizing (3) and (4). 4: repeat 5: E-row step: For each row sample x, find its new cluster as
yˆ y∈ˆ y
−α
p(x)sim(x, μxˆ )
(9)
x ˆ x∈ˆ x
Note that the co-clustering problem is NP-hard, and a local minimum does not guarantee a global minimum [4]. So here, by extending algorithm ITCC, we use algorithm ITCCWDS to minimize the objective function (5) based on the EM algorithm. At the beginning, it starts with an ˆ (0) and Yˆ (0) . Then, it fixes the initialization of partition X ˆ column clusters Y and minimizes the objective function in ˆ and the form (8), and thirdly, it fixes the row clusters X minimizes the objective function in the form (9). The process repeats the last two steps until it converges. When fix Yˆ , p(y)sim(y, νyˆ) is fixed, so mini-
ˆ (t+1) (x) X
yˆ y∈ˆ y
ˆ mizing (8) is equal to minimize (10), and when fix X, p(x)sim(x, μxˆ ) is fixed, minimizing (9) is equal to
Let the column clusters keep unchanged, i.e., Yˆ (t+1) = Yˆ (t) . ˆ Yˆ ), q (t+1) (X|X), ˆ M-row step: Compute q (t+1) (X, (t+1) (t+1) ˆ (Y |Y ), and q (X|ˆ y ) as done in [4], and q obtain V (t+1) by maximizing (4) E-column step: For each column sample y, find its new cluster as
x ˆ x∈ˆ x
minimize (11). D(p(Y |x)||q(Y |ˆ x)) − αsim(x, μxˆ ))
6:
(10)
x ˆ x∈ˆ x
D(p(X|y)||q(X|ˆ y )) − βsim(y, νyˆ))
7:
(11)
yˆ y∈ˆ y
(t+2)
CY
Thus we can define the distance from row x to row cluster x ˆ, and the distance from column y to column cluster yˆ as follows x)) − αsim(x, μxˆ ) dx→ˆx = D(p(Y |x)||q(Y |ˆ
(12)
y )) − βsim(y, νyˆ) dy→ˆy = D(p(X|y)||q(X|ˆ
(13)
8:
9:
10: 11:
(t) −αsim(x, μXˆ (t+1) (x) ) (t) −β p(y)sim(y, νyˆ ) =
−β
(14)
yˆ y:Yˆ (t) (y)=ˆ y
p(x)D(p(Y |x)||q (t) (Y |ˆ x))
ˆ (t+1) (x)=ˆ x ˆ x:X x
p(x) D(p(Y |x)||q (t) (Y |ˆ x))
ˆ (t+1) (x))) p(x) D(p(Y |x)||q (t) (Y |X
ˆ (t) (x)=ˆ x ˆ x:X x
(t) −αsim(x, μxˆ )
= argminyˆ(D(p(X|y)||q (t+1) (X|ˆ y )) −βsim(y, νyˆ))
Keep the row clusters unchanged, i.e., CX = (t+1) CX . ˆ Yˆ ), M-column step: Compute q (t+2) (X, (t+2) (t+2) (t+2) ˆ (X|X), q (Y |Yˆ ), and q (Y |ˆ x) as q done in [4], and obtain U (t+2) by maximizing (3) Compute objective function value objV alue(t+2) using (5), and compute the change in objective function value, that is, δ = objV alue(t) − objV alue(t+2) . t=t+2 until δ < 10−6
≥
Lemma 1. Algorithm ITCCWDS monotonically decreases the objective function (5) to a local minimum.
ˆ (t) (x)=ˆ x ˆ x:X x
(y)
(t+2)
In E-steps, we reassign cluster label for each sample based on the current cluster sets. First, in E-row step, keep the column clusters fixed, and for each row sample x, find a new cluster who it is closest to using (12). Then, in E-column step, keep the row clusters fixed, and for each column sample y, find a new cluster who it is closest to using (13). In M-steps, we update the cluster prototype distributions q(Y |ˆ x) and q(X|ˆ y ), and the row and column exemplar sets U and V . First, in M-row step, we update q(Y |ˆ x) for each row cluster as done in [4], and update U by maximizing (3). Then, in M-column step, we update q(X|ˆ y ) for each column cluster as done in [4], and update V by maximizing (4). The details of the algorithm are shown in Algorithm 1.
Proof:
= argminxˆ (D(p(Y |x)||q (t) (Y |ˆ x)) −αsim(x, μxˆ ))
yˆ y:Yˆ (t) (y)=ˆ y
−α
ˆ (t+1) (x)=ˆ x ˆ x:X x
(t) p(y)sim(y, νyˆ )
−β
yˆ y:Yˆ (t+1) (y)=ˆ y
302
(t)
sim(x, μxˆ ) (t)
p(y)sim(y, νyˆ )
(15)
≥
p(x)D(p(Y |x)||q (t+1) (Y |ˆ x))
Table I S UMMARY OF DATASETS USED FOR EXPERIMENTS
ˆ (t+1) (x)=ˆ x ˆ x:X x
−α
(t+1)
ˆ (t+1) (x)=ˆ x ˆ x:X x
−β
)
(t+1)
yˆ y:Yˆ (t+1) (y)=ˆ y
=
sim(x, μxˆ
p(y)sim(y, νyˆ
)
Dataset # of clusters # of docs # of words Binary & Binary subject 2 500 15582 & 15657 Multi5 & Multi5 subject 5 500 14274 & 14397 Multi10 & Multi10 subject 10 500 15336 & 15480 NG10 10 20,000 143714 NG20 20 20,000 143714 CLASSIC3 3 3893 20168 Yahoo 6 2340 37482
(16)
p(y) D(p(X|y)||q (t+1) (X|ˆ y ))
yˆ y:Yˆ (t+1) (y)=ˆ y
(t+1)
) −βsim(y, νyˆ −α
ˆ (t+1) (x)=ˆ x ˆ x:X x
(t+1) p(x)sim(x, μxˆ ).
boundaries between some newsgroups rather fuzzy. To make our comparison consistent with the existing work, we reconstructed various subsets of NG20: Binary, Binary subject, Multi5, Multi5 subject, Multi10, Multi10 subject, NG10, NG20, and preprocessed all the subsets as in [4], [19], i.e., removed stop words, ignored file headers, lowered the upper case characters, and selected the top 2000 words by mutual information. The CLASSIC3 data set consists of 3893 abstracts from MEDLINE, CISI, and CRANFIELD subsets, and the data set Yahoo consists of 2340 articles from 6 categories. For CLASSIC3 and Yahoo, after ignoring html tags (here, only for Yahoo), removing stop words, and lowering the upper case characters, we selected the top 2000 words by mutual information as the preprocessing. The details of these subsets are given in Table I.
(17)
In the above proof, Eq. (14) follows from the E-word step; Eq. (15) follows since the column cluster set is fixed; Eq. (16) follows from [4], as well as maximizing (3) and that the column cluster set is fixed; and (17) is due to (8) and (9). We can get (18) at the same way, p(y) D(p(X|y)||q (t+1) (X|ˆ y )) yˆ y:Yˆ (t+1) (y)=ˆ y (t+1)
) −βsim(y, νyˆ −α ≥
(t+1)
ˆ (t+1) (x)=ˆ x ˆ x:X x
p(x)sim(x, μxˆ
)
p(x) D(p(Y |x)||q (t+2) (Y |ˆ x))
ˆ (t+2) (x)=ˆ x ˆ x:X x
B. Word Similarity Due to lexical ambiguity, in [20], [21], Reisinger and Mooney introduce two multi-prototypes approaches to vector-space lexical semantics, in which they represent individual words as collections of ”prototype” vectors. Here, we use the approach [21] to compute the pairwise similarities between words as it can represent the common metaphor structure found in highly polysemous words. Therefore, we denote the similarity sim(x, μ) between words x and the exemplar μ as
(t+2) ) −αsim(x, μxˆ
−β
yˆ y:Yˆ (t) (y)=ˆ y
(t)
p(y)sim(y, νyˆ ).
(18)
By combining (17) and (18), it follows that ITCCWDS monotonically decreases the objective function. Since the Kullback-Leibler divergence is non-negative, and (3) and (4) are bounded, so, the objective function is lower bounded. Therefore, algorithm ITCCWDS can converge to a local minimum in a finite number of steps.
sim(x, μ) =
Remark 1. The time complexity of Algorithm ITCCWDS is O((nz(k + l) + km2 + ln2 )τ ), where nz is the number of non-zeros in p(X, Y ), and τ is the number of iterations.
Kμ Kx 1 d(πi (x), πj (μ)) Kx Kμ i=1 j=1
(19)
where Kx and Kμ are the number of clusters for x and μ, πi (x) and πj (μ) are the cluster centroids, and d(·, ·) is a standard distributional similarity measure, here we use the cosine distance.
IV. E XPERIMENTAL R ESULTS A. Data Sets and Parameter Settings
C. Document Similarity
For our experiment performance evaluation, we use various subsets of the of 20-Newsgroup (NG20) [17], CLASSIC3 data set [3], and the Yahoo data set [18]. The NG20 data set consists of approximately 20,000 newsgroup articles collected from 20 different usenet newsgroups. Many of the newsgroups have similar topics, and about 4.5% of the articles are present more than in one group, making the
In [22] Chim and Deng propose a phrase-based algorithm to compute the pairwise similarities of documents based on the combination of the Suffix Tree Document (STD) model and the Vector Space Document (VSD) model. It’s proved to be effective. They represent each document as a vector d y = {w(1, y), w(2, y), . . . , w(M, y)}, 303
(20)
Table II T HE PARAMETER SETTINGS OF ITCCWDS
Dataset Binary Binary subject Multi5 Multi5 subject Multi10 Multi10 subject
FOR ALL THE SUBSETS OF
2 0.5\0.1 0.1\1 0.1\0.6 2\6 1\10 2\10
4 0.3\0.2 0.2\0.3 0.2\1 0.1\0.3 3\10 1\6
Number 8 0.8\0.2 0.9\0.9 1\9 0.1\0.1 0.2\1 0.6\1
NG20
WITH DIFFERENT NUMBER OF WORD CLUSTERS
of word clusters 16 32 0.2\0.1 1\0.1 0.1\0.1 0.1\0.9 1\9 7\9 0.2\0.8 0.2\0.4 0.4\0.3 0.5\0.5 1\9 6\7
64 1\0.2 0.8\0.7 6\6 8\8 0.8\1 9\4
0.5
0.72
0.7
0.6 0.5 0.4 0.3
4 8 16 32 64 Number of Word Clusters(log scale)
2
128
(a) Binary
Micro−Averaged−Precision
Micro−Averaged−Precision
0.7
ITCCWDS ITCC ITCCLS
0.3
0.2
0.1 2
128
4 8 16 32 64 Number of Word Clusters(log scale)
128
(d) Binary subject
4 8 16 32 64 Number of Word Clusters(log scale)
128
(c) Multi10
0.6 0.5 0.4
2
ITCCWDS ITCC ITCCLS
0.5
0.7
0.3
Figure 1.
0.4
(b) Multi5
0.75
2
4 8 16 32 64 Number of Word Clusters(log scale)
0.8
0.8
0.65
ITCCWDS ITCC ITCCLS
Micro−Averaged−Precision
0.66 2
ITCCWDS ITCC ITCCLS
Micro−Averaged−Precision
Micro−Averaged−Precision
Micro−Averaged−Precision
0.74
0.68
128 0.5\0.2 1\0.4 0.9\0.1 6\4 0.9\0.8 0.2\0.1
ITCCWDS ITCC ITCCLS 4 8 16 32 64 Number of Word Clusters(log scale)
(e) Multi5 subject
128
0.4
0.3
0.2
0.1 2
ITCCWDS ITCC ITCCLS 4 8 16 32 64 Number of Word Clusters(log scale)
128
(f) Multi10 subject
Micro-averaged-precision values with varied number of word clusters on different NG20 data sets
where M is the number of terms, w(i, y) = (1 + log tf (i, y)) log(1 + N/df (i)), tf (i, y) is the frequency of the ith term in document y, and df (i) denotes the number of documents containing the ith term. Then they computed the pair similarity between documents yi and yj using cosine measure as it is commonly used in the VSD model. So we can get the simiarity between document y and the exemplar ν using sim(y, ν) =
< d y , d ν >
× ||dν||
||dy||
In the experiments, the initial clusters are generated as the same with ITCC. To initial the clusters, we adopt different strategies for the initialization of the word clusters and document clusters. For the initialization of the word clusters, choose initial word cluster “centroids” to be “maximally” far apart from each other. First, take the word which is farthest to the centroid of the whole data set as the first word cluster “centroid”. Then, take the word which is farthest to all the previous word cluster “centroids” already picked as the “centroid” until all the word cluster “centroids” are picked. For the initialization of the document clusters, we use a random perturbation of the “mean” document. Since there is a random component in the initialization step, all our results are averages of five trial. To validate clustering results, micro-averaged-precision [4] is used as the measure metric. By analyzing the distance from the sample to the cluster prototype, and the similarity between the sample and
(21)
where < ·, · > denotes the inner product, and || · || denotes the L2 norm. D. Results and Discussion In this section, we provide empirical evidence to demonstrate the effectiveness of algorithm ITCCWDS, in comparison with ITCC [4], and ITCC with local search (ITCCLS).
304
Micro−Averaged−Precision
1 0.8
also reported the empirical evaluations to demonstrate the advantages of our approach in terms of clustering quality in the document-word co-clustering problem.
ITCCWDS ITCC ITCCLS
ACKNOWLEDGMENT 0.6
This project was supported by the NSFC-GuangDong (U0835005) and the NSFC (61173084).
0.4
R EFERENCES [1] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern Recognition Letters, vol. 31, pp. 651–666, June 2010.
0.2 0
Figure 2.
NG10
[2] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proc. of Int. Conf. on Intelligent Systems for Molecular Biology, 2000, pp. 93–103.
NG20 Yahoo CLASSIC3 Data sets
[3] I. S. Dhillon, “Co-clustering documents and words using bipartite spectral graph partitioning,” in Proc. of the 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2001, pp. 269–274.
Micro-averaged-precision values on the large data set
the cluster exemplar, we set the α and β from 0.1 to 10.0. Figure 1 shows that the performance of ITCCWDS, ITCC, and ITCCLS on the subsets of NG20, varying with the number of word clusters. The value for parameters α and β are shown in Table II. The real numbers in each table cell are the values for α and β. From the performance results reported in Figure 1, it is clear that ITCCWDS improves the document clustering precision substantially over ITCC and ITCCLS. ITCCWDS has obtained on average almost 15% higher precision than its counterparts, on all testing sets except Binary subject. And on Binary subject, the proposed method has gained comparable results. Also, Figure 1 demonstrates that ITCCWDS is less sensitive to the number of word clusters. Figture 2 records the precision values of the three algorithms on the large data sets NG10, NG20, Yahoo, and CLASSIC3. For the four large data sets, we set the number of the word clusters to be 128, 128, 64, 200, respectively, and the parameter pairs (α, β) are set (0.1, 0.9), (0.8, 0.1), (10, 2), and (6, 5). It shows that our algorithm is still effective on the large data set. Data set CLASSIC3 is easy to cluster to groups. On CLASSIC3, the three algorithms all have extracted the original clusters almost correctly resulting in the micro-averaged-precision values more than 0.985. However, on the other three more challenge data sets, our algorithm can achieve much better clustering performance. On data set NG10, our algorithm has obtained almost 7% higher precision than ITCC, and 5% higher than ITCCLS. The comparative results have demonstrated the improvement of ITCCWDS on the clustering performance.
[4] I. S. Dhillon, S. Mallela, and D. S. Modha, “Informationtheoretic co-clustering,” in Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2003, pp. 89–98. [5] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, “Minimum sumsquared residue co-clustering of gene expression data,” in Proc. of 4th SIAM Int. Conf. on Data Mining, 2004, pp. 114– 125. [6] B. Long, Z. Zhang, and P. S. Yu, “Co-clustering by block value decomposition,” in Proc. of the 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, 2005, pp. 635–640. [7] M. Rege, M. Dong, and F. Fotouhi, “Co-clustering documents and words using bipartite isoperimetric graph partitioning,” in Proc. of the 6th Int. Conf. on Data Mining, 2006, pp. 532– 541. [8] G. Qiu, “Image and feature co-clustering,” in Proc. of the 17th Int. Conf. on Pattern Recognition, 2004, pp. 991–994. [9] J. Liu and M. Shah, “Scene modeling using co-clustering,” in Proc. of 11th IEEE Int. Conf. on Computer Vision, 2007, pp. 1–7. [10] S. N. Vitaladevuni and R. Basri, “Co-clustering of image segments using convex optimization applied to em neuronal reconstruction,” in Proc. of 2010 IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2010, pp. 2203–2210. [11] H. Shan and A. Banerjee, “Bayesian co-clustering,” in Proc. of 8th IEEE Int. Conf. on Data Mining, 2008, pp. 530–539.
V. C ONCLUSION
[12] R. G. Pensa and J. F. Boulicaut, “Constrainted co-clustering of gene expression data,” in SDM, 2008, pp. 25–36.
In this paper, we have proposed a novel co-clustering method with intra-similarities. Unlike existing co-clustering algorithms, the proposed algorithm incorporates the similarities between samples belonging to the same set. We have
[13] X. Shi, W. Fan, and P. S. Yu, “Efficient semi-supervised spectral co-clustering with constraints,” in Proc. of 10th IEEE Int. Conf. on Data Mining, 2010, pp. 1043–1048.
305
[14] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proc. of the 23rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000, pp. 208–215. [15] A. Banerjee, I. Dhillon, and D. S. Modha, “A generalized maximum entropy approach to bregman co-clustering and matrix approximation,” in Proc. of the 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2004, pp. 509–514. [16] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with bregman divergences,” Machine Learning Research, vol. 6, pp. 1705–1749, December 2005. [17] K. Lang, “Newsweeder: Learning to filter netnews,” in Proc. of the 12th Int. Conf. on Machine Learning, 1995, pp. 331– 339. [18] D. Boly, “Hierachical taxonomies using divisive partitioning,” University of Minnesota, Tech. Rep. TR-98-012, 1998. [19] I. S. Dhillon and Y. Guan, “Information theoretic clustering of sparse co-occurrence data,” Dept. of Computer Sciences, University of Texas, Tech. Rep. TR-03-39, September 2003. [20] J. Reisinger and R. J. Mooney, “Multi-prototype vector-space models of word meaning,” in Proc. of the 2010 Annual Conf. of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 109–117. [21] J. Reisinger and R. Mooney, “A mixture model with sharing for lexical semantics,” in Proc. of the 2010 Conf. on Empirical Methods in Natural Language Processing, 2010, pp. 1173– 1182. [22] H. Chim and X. Deng, “Efficient phrase-based document similarity for clustering,” IEEE Trans. on Knowledge and Data Engineering, vol. 20, no. 9, pp. 1217–1229, September 2008.
306