Nonnegative Matrix Tri-factorization Based High-Order Co ... - CiteSeerX

0 downloads 0 Views 304KB Size Report
Department of Computer Science and Engineering. University of Texas at ... proposed O-NMTF approach employs Nonnegative Matrix Tri-. Factorization (NMTF) to .... mathematical elegance and promising empirical results. Motivated by the ...
2011 11th IEEE International Conference on Data Mining

Nonnegative Matrix Tri-Factorization Based High-Order Co-Clustering and Its Fast Implementation Hua Wang, Feiping Nie, Heng Huang, Chris Ding Department of Computer Science and Engineering University of Texas at Arlington, Arlington, Texas 76019, USA {huawangcs, feipingnie}@gmail.com, {heng, chqding}@uta.edu

   

Pairwise affinity

w1nk º w2 nk »»  » » wnk nk ¼»

re

containing Queries

Intra-type information

g

in

nc re fe

containing

w12 w22  wnk 2

viewing issuing

ª w11 «w « 21 «  « ¬« wnk 1

ª0 0 u 0 u º «0 0 0 u 0 » « » «u 0 0 0 0 » « » ¬0 u 0 0 0 ¼ ³:HESDJH-ZRUG´ co-occurrence matrix

Words Inter-type information

Figure 1. A Web search system is a typical example of multi-type relational data, involving (1) Inter-type relationships (green solid lines): the relations between objects in different data types. (2) Intra-type relationships (blue dashed lines): the relations between different objects in a same data type.

types of data objects has their own attributes. Meanwhile different types of data are also interrelated to each other in various ways, e.g., Web pages and words are related via cooccurrences. Such data sets are often called as Multi-Type Relational data [1], [2]. Different from traditional homogeneous data of one single type, multi-type relational data contains more information with richer structures. Typically, we consider the following two forms of information of a multi-type relational data set: ∙ Inter-type relationships characterize the relations between data objects from different types, such as the cooccurrences between data objects from different types as illustrated by the green solid lines in Figure 1. ∙ Intra-type relationships characterize the native relations between objects within one data type, e.g., as illustrated by the blue dashed lines in Figure 1, the internet hyperlinks among Web pages, the pairwise affinities among users that are induced from user attributes, etc. The rich structures of multi-type relational data provide a potential opportunity to improve the clustering accuracy, which, however, also present a new challenge on how to effectively use all available information contained in a multitype relational data set. Similar to co-clustering for two-type relational data that makes use of the interrelatedness between the two types of data, simultaneous clustering of multi-type relational data aims to exploit both inter-type and intratype information, which is called as high-order co-clustering (HOCC). In this paper, we tackle this new, yet important, problem, where we also take into account large-scale data for practical use. Given the inter-type relationships and the intra-type in-

Keywords-High-Order Co-Clustering, Multi-Type Relational Data, Nonnegative Matrix Tri-Factorization, Cluster Indicator Matrix

I. I NTRODUCTION Most traditional clustering algorithms concentrate on dealing with homogeneous data, in which all the objects of interest are of one single type. Recently, the rapid progress of modern technologies, especially for those related to Internet, has brought new data much richer in structure, involving objects of multiple types that are related to each other. For example, in a Web search system as illustrated in Figure 1, we have four different types of data entities including words, Web pages, search queries and Web users. Each of these four 1550-4786/11 $26.00 © 2011 IEEE DOI 10.1109/ICDM.2011.109

Web pages

Users

Abstract—The fast growth of Internet and modern technologies has brought data involving objects of multiple types that are related to each other, called as Multi-Type Relational data. Traditional clustering methods for single-type data rarely work well on them, which calls for new clustering techniques, called as high-order co-clustering (HOCC), to deal with the multiple types of data at the same time. A major challenge in developing HOCC methods is how to effectively make use of all available information contained in a multi-type relational data set, including both inter-type and intra-type relationships. Meanwhile, because many real world data sets are often of large sizes, clustering methods with computationally efficient solution algorithms are of great practical interest. In this paper, we first present a general HOCC framework, named as Orthogonal Nonnegative Matrix Tri-factorization (O-NMTF), for simultaneous clustering of multi-type relational data. The proposed O-NMTF approach employs Nonnegative Matrix TriFactorization (NMTF) to simultaneously cluster different types of data using the inter-type relationships, and incorporate intra-type information through manifold regularization, where, different from existing works, we emphasize the importance of the orthogonalities of the factor matrices of NMTF. Based on O-NMTF, we further develop a novel Fast Nonnegative Matrix Tri-Factorization (F-NMTF) approach to deal with large-scale data. Instead of constraining the factor matrices of NMTF to be nonnegative as in existing methods, F-NMTF constrains them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimization problem of the proposed method can be decoupled, which results in subproblems of much smaller sizes requiring much less matrix multiplications, such that our new algorithm scales well to real world data of large sizes. Extensive experimental evaluations have demonstrated the effectiveness of our new approaches.

774

are denoted as m𝑖⋅ and m⋅𝑗 respectively. We denote the Frobenius norm and the trace of a matrix as ∥⋅∥ and tr (⋅) respectively. 𝑀 (𝑖, 𝑗) denotes the (𝑖, 𝑗)-th entry of the matrix 𝑀 , and v (𝑖) denotes the 𝑖-th entry of the vector v. We denote R as the real number set, R+ as the nonnegative real number set, and Ψ as the cluster indicator matrix set. An indicator matrix 𝐺 ∈ Ψ𝑛×𝑐 is a special type of nonnegative matrix: all the entries of g𝑖⋅ (1 ≤ 𝑖 ≤ 𝑛) are equal to 0 except for one and only one entry equal to 1, indicating the cluster membership ∑ of the corresponding data point, i.e., g𝑖⋅ ∈ {0, 1}𝑐 and 𝑗 g𝑖⋅ (𝑗) = 1.

formation, we first present a simple but general HOCC framework, named as Orthogonal Nonnegative Matrix Tri-factorization (O-NMTF), for simultaneous clustering on multi-type relational data. In the proposed O-NMTF approach, we use Nonnegative Matrix Tri-Factorization (NMTF) [3] to simultaneously cluster different types of data upon the inter-type relation matrices. Meanwhile, the optional intra-type information for different types of data in form of pairwise affinity is incorporated as manifold regularization to NMTF, where, different from existing manifold regularized NMTF methods, we emphasize the importance of the orthogonality on the factor matrices. Because existing solution algorithms to NMTF problems, as well as ours to the proposed O-NMTF approach, are usually computationally prohibitive due to involving intensive multiplications on matrices of large sizes, instead of constraining the factor matrices of NMTF to be nonnegative, we further propose a novel Fast Nonnegative Matrix TriFactorization (F-NMTF) approach to constrain them to be cluster indicator matrices, a special type of nonnegative matrices. With this new constraint, the optimization problem can be decoupled into a number of subproblems of much smaller sizes, which require much less matrix multiplications. Consequently, our new algorithm is computationally efficient, which makes it of particular use in clustering largescale multi-type relational data in real world applications. We summarize our contributions as following.

II. O RTHOGONAL N ONNEGATIVE M ATRIX T RI - FACTORIZATION (O-NMTF) FOR H IGH -O RDER C O -C LUSTERING In this section, we first briefly review co-clustering of twotype relational data using NMTF, from which we gradually develop the proposed O-NMTF approach for high-order coclustering of multi-type relational data. Our goal is to employ both inter-type relationships and intra-type relationships of the input data via a compact, yet effective, framework. Problem formalization. Given a data set with 𝐾 types of { 𝑘data𝑘 objects𝑘 }𝒳 = {𝒳1 , 𝒳2 , . . . , 𝒳𝐾 }, where 𝒳𝑘 = x1 , x2 , . . . , x𝑛𝑘 represents 𝑛𝑘 data objects of the 𝑘-th type. Suppose that we have a set of inter-type relationship matrices {𝑅𝑘𝑙 ∈ R𝑛𝑘 ×𝑛𝑙 } between different types of data 𝑇 , where 𝑅𝑘𝑙 (𝑖, 𝑗) measures how objects and 𝑅𝑙𝑘 = 𝑅𝑘𝑙 𝑘 closely x𝑖 is related to x𝑙𝑗 . Besides, we also have intratype information for each type of the data, e.g., for the 𝑘th type of data we have pairwise affinities 𝑊𝑘 ∈ R𝑛𝑘 ×𝑛𝑘 between the the data objects in 𝒳𝑘 . Our goal is to learn from 𝑅𝑘𝑙 (1 ≤ 𝑘, 𝑙 ≤ 𝐾) and 𝑊𝑘 (1 ≤ 𝑘 ≤ 𝐾) a model that is able to simultaneously partition the data objects in , 𝑐2 , . . . , 𝑐𝐾 disjoint 𝒳1 , 𝒳2 , . . . , 𝒳𝐾 into 𝑐1∑ ∑ clusters respectively. We denote 𝑛 = 𝑘 𝑛𝑘 and 𝑐 = 𝑘 𝑐𝑘 .

1) We present a simple, yet effective, framework to tackle the complicated problem of high-order co-clustering of multi-type relational data, which aims to better utilize both inter-type and intra-type relationships of a multitype relational data set. 2) Different from existing manifold regularized Nonnegative Matrix Factorization (NMF) methods [2], [4], [5], we emphasize the importance of the orthogonalities of the factor matrices, both theoretically and empirically. 3) Instead of enforcing traditional nonnegative constraints on the factor matrices of NMTF, we constrain them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimization problem of the proposed F-NMTF approach can be decoupled into subproblems with much smaller sizes, and the decoupled subproblems involve much less matrix multiplications. Therefore, our approach is computationally efficient and scales well to large-scale real world data. Different from our earlier publication [6] that deals with asymmetric NMTF on rectangle input matrices, in this paper we deal with symmetric square input matrix, which is harder to solve due to the fourth-order term of the factor matrices in the objective.

A. A Brief Review of Co-Clustering via NMTF The simplest multi-type relational data involves only two types of objects, which widely appear in real world applications, e.g., words and documents in document analysis, users and items in collaborative filtering, experimental conditions and genes in microarray data analysis. Instead of being independent, the clustering tasks of different types of objects are often closely related. As a result, co-clustering methods (also called as bi-clustering in some research papers), which simultaneously cluster the both types of data by leveraging their interrelatedness, have been proposed [3], [5], [7]–[9]. Among these methods, NMTF based co-clustering methods have attracted increased attention in recent years due to their mathematical elegance and promising empirical results. Motivated by the close connection between NMF and 𝐾means clustering [3], [10], Ding et al. [3] proposed to use NMTF to simultaneously cluster the rows and columns of a

Notations. Throughout this paper, we denote matrices as uppercase characters and vectors as boldface lowercase characters. The 𝑖-th row and 𝑗-th column of the matrix 𝑀

775

nonnegative input relationship matrix 𝑅12 by decomposing it into three nonnegative factor matrices, which minimizes: 2  𝐽1 = 𝑅12 − 𝐺1 𝑆12 𝐺𝑇2  , (1) 𝑠.𝑡. 𝐺1 ≥ 0, 𝐺2 ≥ 0, 𝑆12 ≥ 0 ,

Proof: Upon the definitions of 𝑅, 𝐺 and 𝑆, we derive:

 ] [  2 [ 0 𝑅12 0   𝑇 − 𝑅 − 𝐺𝑆𝐺  =  𝑇 𝑇  𝑅12 0 𝐺2 𝑆12 𝐺𝑇1 2    = 2 𝑅12 − 𝐺1 𝑆12 𝐺𝑇2  ,

where 𝐺1 ∈ R𝑛+1 ×𝑐1 and 𝐺2 ∈ R𝑛+2 ×𝑐2 are continuous and act as the “soft” cluster indications [10] for 𝒳1 and 𝒳2 respectively, and 𝑆12 ∈ R𝑐+1 ×𝑐2 absorbs the different scales of 𝑅12 , 𝐺1 and 𝐺2 . The original NMF problem [11] requires 𝑅12 to be nonnegative. In co-clustering scenarios, however, this constraint (thereby the nonnegativity constraint on 𝑆12 ) can be relaxed [12] to achieve additional flexibility, which leads to the semi-NMTF problem that minimizes: 𝐽2 = ∥𝑅12 −

𝐺1 𝑆12 𝐺𝑇2 ∥2 ,

𝑠.𝑡.

which proves the lemma. Based upon Lemma 1, we have the following theorem. Theorem 1: It is equivalent to solve Eq. (4) and to solve the following problem: min 𝐽5 = ∥𝑅 − 𝐺𝑆𝐺𝑇 ∥, in which

𝐺1 ≥ 0, 𝐺2 ≥ 0 . (2)



B. Objective of O-NMTF A natural generalization of the co-clustering objective in Eq. (2) to simultaneously cluster multi-type relational data, called as high-order co-clustering [1], [14]–[17], is to solve the following optimization problem [1], [16], [17]:  2 ∑ min 𝐽3 = 0

Suggest Documents