I/O-Scalable Bregman Co-clustering and Its Application to the Analysis of Social Media
Technical Report
Department of Computer Science and Engineering University of Minnesota 4-192 Keller Hall 200 Union Street SE Minneapolis, MN 55455-0159 USA
TR 11-024 I/O-Scalable Bregman Co-clustering and Its Application to the Analysis of Social Media Kuo-wei Hsu and Jaideep Srivastava
October 10, 2011
I/O-Scalable Bregman Co-clustering and Its Application to the Analysis of Social Media Kuo-Wei Hsu
Jaideep Srivastava
University of Minnesota
University of Minnesota
Minneapolis, MN USA
Minneapolis, MN USA
[email protected]
[email protected]
Abstract Adoption of social media has experienced explosive growth in recent years, and this trend appears likely to continue. A natural consequence has been the creation of vast quantities of data being generated by social media applications, and hence increased interest from the database community. This data is also providing unique opportunities to understand the sociological and psychological aspects, human interaction, and media production/consumption, and hence the growth in areas such as user modeling, behavior analysis, and social network analysis, which together is being labeled as the emerging area of Computational Social Science (CSS) [37, 59]. These new types of data analysis are leading to the introduction of new computational techniques, e.g. p* modeling, ERGMs [62], co-clustering [6], etc. This paper focuses on a scalable implementation of Bregman coclustering algorithm and its application to social media analysis. Bregman co-clustering algorithm performs twoway clustering and is theoretically scalable while we discuss an OLAP based implementation to achieve this goal. Principally, we demonstrate how aggregations required by the algorithm can be mapped naturally to summary statistics computed by an OLAP engine and stored in data cubes. Our OLAP based implementation of the algorithm is able to handle large-scale datasets, i.e. datasets that are too large for main memory based implementations. Further, we explore the suitability of the relational model for modeling social media data. Specifically, we argue that data cubes and the star schema are well suited for managing social media data. Our research is a step toward the increasing interest the research K.-W. Hsu ( ) · J. Srivastava Department of Computer Science and Engineering, University of Minnesota, 4-192 Keller Hall, 200 Union Street SE, Minneapolis, MN 55455, USA e-mail:
[email protected] J. Srivastava e-mail:
[email protected]
community has in connecting three research areas, namely database, data mining, and social media analysis. Keywords OLAP · Data cube · Bregman coclustering · Social media
1 Introduction 1.1 Overview Adoption of social media has noticed explosive growth in the past few years, and this trend appears unabated. Following on the heels of this trend is the creation of vast quantities of data being generated by social media applications, and hence the increasing interest from the database community. This data is also providing a unique opportunity to understand the sociological and psychological aspects human interaction and media production/consumption., and hence the growth in areas such as user modeling, behavior analysis, and social network analysis, which together is being labeled as the emerging area of Computational Social Science (CSS) [37, 59]. These new types of data analysis are leading to the introduction of new computational techniques, e.g. p* modeling, ERGMs [62], co-clustering [6], etc. Database research addresses management of large quantities of data; data mining is about analysis of data; and social media analysis studies social actors and relationships among them on social webs, including social network analysis (see Fig. 1). In the figure, a line represents a connection, and means that there are techniques available to bridge two research areas. The link in the right corresponds to techniques that perform data mining with the help of database; the link in the left is related to data models and schemas for social media analysis; the link in the bottom represents applications like collaborative filtering (CF) based recommendation systems. However,
there are limited studies in literature considering connections between database and social media analysis, while connections between database and data mining are still a continuing research topic. Much of database research has been driven by the need to develop efficient techniques to implement complex data manipulation operations, e.g. the relational join; and CSS offers an exciting new opportunity to do so once again. In this paper we focus on a specific computational technique, called Bregman Co-Clustering [6], which gaining wide usage for social network analysis ([7] for example). However, its standard implementation creates significant I/O bottlenecks, and it is thus ripe for exploration of I/O scalable implementations; which is the specific focus of the present paper. Extending a preliminary investigation [33], we introduce an implementation that utilizes a database engine to implement Bregman co-clustering algorithm, and therefore a connection between database and data mining is built; we also demonstrate an application of such an implementation to social media, and by doing this we present connections among these three research areas: database, data mining, and social media, as illustrated in Fig. 1.
Fig. 1. Database, data mining, and social media analysis. By connecting these three research areas together, we find the following benefits. The first benefit is the capability to perform advanced analyses on large-scale data mining and social media analyses. Applying data mining techniques to digital investigations brings benefits to investigators and analysts but large-scale datasets raise challenges, as mentioned by Beebe and Clark in [8]. As another example, Lopez et al. in [40] present an application analyzing a large amount of data in software development communities. The challenges are the same as social media data become larger and richer especially in the online age. Moreover, the work presented in this paper can be extended to other application domains, as techniques for social media analysis have been deployed in other application domains. For example, Dreyfus and Iyer in [25] apply the concept of social network to the architectural design of an information system. Keith et al. in [35] propose a coordination network
analysis approach for better understanding the organizational and individual impacts to the serviceoriented enterprise structure. In conclusion our work has created a shortcut to implementing scalable algorithms with database support and therefore has demonstrated a great potential to overcome the challenges. The second benefit is the saving of time to pre-process data. Hirji in [32] indicates that a large percent of the time needed to run a data mining project is spent pre-processing data. It is a common way that we employ SQL scripts to perform data pre-processing (e.g. cleaning, filtering, and transforming data records) before we apply data mining algorithms. As discussed by Inmon in [34], data warehouse techniques help data mining algorithms clean, integrate, and summarize data. If we perform data pre-processing for data mining inside a database or a data warehouse, we could save time and further speed up the whole project.
1.2 Social media analysis Sociology is supported by theories and evaluated by experiments. Sociologists develop a theory based on observations and/or other theories, and use field surveys and statistical analyses as evaluation. Nevertheless, sociologists cannot always design appropriate experiments due to complexities or high costs of large-scale experiments. Therefore, there is a need for simulation, which includes smaller-scale experiments and/or result analyses. A satisfying simulation helps devise and revise a theory, while a satisfying environment makes such a simulation practical. WWW provides a good simulation environment because it reflects thoughts of users and also records interactions between users. Since the introduction of social media to WWW, more webs provide social network services that help users socially interact with others on social webs. A social web is a web based platform that allows user to interact with others directly by exchanging information and/or sharing comments or perspectives. WWW provides us a platform to gather rich social media data because of its convenience. On the one side, WWW brings us new challenges also because of its convenience – people use WWW frequently and easily interact with others on webs so that the amount of data collected from social webs is enormous. Mislove et al. in [43] give statistics for four popular social webs: Flickr1, LiveJournal2, Orkut3, and YouTube4. The authors show that all of these web sites have millions of users and
1
www.flickr.com
2
www.livejournal.com
3
www.orkut.com
4
www.youtube.com
2
tens or even hundreds of millions of links presenting friendships. Both the large size and the high complexity of social media data make management and analysis even harder today. The relational model is the fundamental because it has been widely studied in academy and deployed in industry. Therefore, there are studies using the relational model as the back-end support for social network analysis, such as [61]. However, they design data schemas following the principle of OLTP (online transaction processing [13]) instead of OLAP (online analytical processing [30]). From a perspective of the decision support system, the goal of OLAP includes computation and management of summarized data instead of individual data records (e.g., daily transaction records), which are the major focus of OLTP. In other words, an OLAP system has to be able to compute summary statistics along multiple dimensions with various granularities. The star schema is one of building blocks of OLAP, and it is designed to accomplish this goal.
1.3 Co-clustering Data mining is a broad research field, so we focus on a subfield named clustering. More specifically, in this paper, we concentrate on co-clustering which performs two-way clustering. Clustering is an unsupervised learning technique used to group similar data samples in a way that the coherence inside a cluster is higher than that between clusters. Clustering algorithms measure the degree of coherence by solving various kinds of objective functions defined to minimize the distance between data samples. As data and clustering tasks become more complicated, it is harder to use such algorithms to produce satisfactory results. Co-clustering, on the other hand, has attracted great attention because it simultaneously measures the degree of coherence in data samples (rows) and in features (columns), so it achieves better performance. It has been deployed in various areas. Co-clustering has been applied in bioinformatics to simultaneously cluster genes and experiments [16, 18] and in text mining to discover document clusters and word clusters at the same time [22]. For instance, we can apply a co-clustering algorithm to CF based recommendation systems in a natural way. Media is a channel of communication, while social media is such a channel built upon social interactions. In the Internet era social interactions refer to particular forms of actions that people take on WWW (World Wide Web). Social media analysis studies such interactions. Social network analysis, for example, is a major component of social media analysis and studies relationships between social actors based on their social interactions. Let us a CF based movie recommendation system as an example. The goal of the system is to find who (a user) like what (movies). This belongs to social media analysis, and a possible social
network analysis task is the discovery of potential social networks among users based on movies they like or dislike. Recommendation systems are to help users find in what they might feel interested. They can be roughly divided into two categories, content-based filtering and collaborative filtering [53]. Content-based filtering makes recommendation based on the content that a user has accessed and/or is accessing. Content-based filtering is an application of personalization techniques. Collaborative filtering is a social media application because it does not just focus on the analysis of contents but on the analysis of recommendations from other users. In details, people having common interests would behave in a similar way (since “people of one kind come together”). They give similar comments to some web pages, books, movies, songs, all kinds of products, or even other people. In a CF based recommendation system, “the user is recommended items that people with similar preferences have liked in the past” [58]. Let us use a CF based movie recommendation system as an example to demonstrate that co-clustering is a natural fit to such a system. Fig. 2 shows an example using co-clustering to recommend movies. In Fig. 2, a bipartite graph consists of people and movies, and each person gives ratings to movies. We can store these ratings in a contingency matrix where elements with colors represent degrees of preference. Applying co-clustering to the matrix, we obtain two types of groups - groups of people having similar taste and groups of similar movies. Traditional clustering algorithms face problems, for example, if we treat movies as general features and do not cluster them. In the following, P and M respectively denote a person and a movie, and a subscript is simply an identifier. If both P1 and P2 like M2 while both P1 and Pm like Mn, it is highly probable that these three persons are in the same group. We will be even more confident about this clustering if M2 and Mn are in the same series of movies. It comes at no surprise that a group of much the same movies interests the same group of people and people showing similar favor on movies would enjoy the same groups of movies. However, traditional clustering algorithms have difficulty capturing this intuition. By grouping similar features, co-clustering leads to more informative clusters.
Fig. 2. An example of using co-clustering in movie recommendation.
3
Bregman co-clustering algorithm proposed in [6] is one of advanced co-clustering algorithms. Other co-clustering techniques include the one based on information theory [22] and those based on matrix decomposition [39, 23]. Banerjee et al. in [6] present the comparison of applying four implementations to CF based recommendation system on one of MovieLens datasets [36, 54, 55]. The coclustering provides comparable accuracy and requires less computation effort. In addition, Agarwal and Merugu in [1] incorporate the covariate information and use Bregman divergences to perform co-clustering. The covariate information employed in [1] includes movie content and statistics about users. The approach used in [1] groups rows and columns simultaneously such that the resulting clusters provide most accurate prediction with the adjustment of covariates. “Powerful as it is, co-clustering is not practical to apply on large matrices”, Papadimitriou and Sun [52]. Moreover, “it becomes hard to scale automated collaborative filtering systems to large numbers of items while maintaining reasonable prediction performance” [45]. Nevertheless, the goal of this paper is to show that, we could achieve the scalability and maintain the prediction performance of the Bregman co-clustering by utilizing an OLAP engine for its implementation. We address the issue of scalability of Bregman coclustering algorithms. Specifically, we consider the general framework proposed in [6] and describe an OLAP based implementation that utilizes a database engine to improve the runtime performance. Chen et al. in [15] emphasize the importance of implementing efficient and scalable algorithms that can work on large datasets stored in a database. A key observation is that Bregman co-clustering algorithm requires computation and re-computation of summary statistics, which are used to build an approximation matrix as we shall see later in this paper. This computation cost can be significantly reduced by using an OLAP engine especially when the dataset is large. OLAP is applied to compute summary statistics for Bregman co-clustering algorithm. The mapping of data access operations of Bregman co-clustering algorithm to OLAP operations is shown to be fairly natural. Since social media analysis examines social actors and studies their social relationships, it is natural to use the relational model and the star schema to manage such relationships. The star schema is in the center of our implementation (As we shall see later in this paper), there is a natural match between social media analysis and our OLAP based implementation of Bregman co-clustering algorithm. Banerjee et al. in [6] have been very successful in applying co-clustering to various applications. Thus, we believe that the techniques presented in this paper will go a long way in making co-clustering applicable to large datasets. Furthermore, since we employ standard (i.e. welldeveloped and widely-deployed) OLAP operations, our
OLAP based implementation has a large perspective user base.
1.4 Our contributions This paper makes a number of contributions, as briefly listed below first and then discussed later in this subsection: z A scalable and portable implementation of Bregman co-clustering (Sections 2 and 3). z A study of data schemas for social media analysis through a short survey (Section 4). z A connection between Bregman co-clustering and social media analysis through an experimental study (Section 5).
1.4.1 The first contribution is an OLAP based implementation of Bregman co-clustering algorithm, which makes Bregman co-clustering algorithm more applicable, especially to situations where data are large-scale and are stored in a database. It connects the research of Bregman co-clustering and database research and makes explicit the relationship between Bregman co-clustering algorithm and a data cube, a fundamental element of OLAP. Such an implementation of the scalable Bregman co-clustering benefits data mining research by introduction the support of OLAP, and it also increases the application domain of database research. Moreover, as discussed by Chen et al. in [15], it is important for a data mining system to efficiently and effectively extract information from a large amount of realworld data stored in relational databases since most realworld data mining projects study a large amount of readonly data. A similar viewpoint can be found in [50], where Palpanas suggests that a data mining system should take full advantage of OLAP to meet the analyst’s requirements.
1.4.2 The second contribution is to demonstrate a connection between database and social media analysis by studying data schemas for social media analysis. The relational model is the fundamental because it has been widely studied in academy and deployed in industry. Here we argue that using the star schema and OLAP can help us achieve higher flexibility and higher scalability and better efficiency to analyze social media data.
1.4.3 The third contribution is to demonstrate a connection between Bregman co-clustering algorithm and social media analysis. Banerjee et al. in [6] discuss some innovative
4
applications of Bregman co-clustering algorithm, including a CF based movie recommendation system. It has been extended to a framework dealing with multi-type relational data in movie recommendation in [7]. Such applications implicitly create a connection between Bregman coclustering algorithm and social media analysis.
1.5 Outline The remainder of this paper is organized as follows. We review Bregman co-clustering algorithm in Section 2. Next, we give a brief review of database support for data mining and discuss our OLAP based implementation of the algorithm in Section 3. Following that, in Section 4 we introduce social media analysis, connect it to database research with a discussion about the relational data model and the star schema, and additionally discuss its connection to data mining. Afterwards, experimental results along with discussions are presented in Section 5. While related work is given wherever (throughout Sections 1 to 5) it could make the contents more relevant to this paper, we study in more detail the related work in Section 6. Finally, conclusions and future work are given in Section 7.
2 Bregman Co-clustering 2.1 Overview Co-clustering is an emerging research topic relative to traditional clustering, which has been widely used for years. Co-clustering, like traditional clustering, uses features to group data samples. However, the goal of coclustering is to simultaneously cluster data samples and features. Given a matrix where a row contains a set of values of features used to describe a data sample and a column contains a set of values of all data samples with respect to a specific feature, a co-clustering task is to generate groups of data samples and groups of features. In other words, co-clustering clusters rows and columns at the same time if we use a contingency matrix where a row represents a data sample and a column represents a feature. Below we introduce a general co-clustering framework proposed in [6]. Bregman co-clustering algorithm clusters data samples by using the statistical information gathered from features, and at the same time it clusters features with the help of the statistical information collected from data samples [6]. It has shown the promising performance, both for the quality of the produced clustering results and its computational efficiency [6]. The key feature of Bregman co-clustering algorithm is that it associates a co-clustering task with the discovery of an
optimal approximation matrix. An approximation matrix is obtained by the use of a specific set of smaller matrices (i.e. row, column, row-cluster, column-cluster, and/or co-cluster matrices). That is to say, Bregman co-clustering algorithm converts the co-clustering task to the search for an optimal approximation matrix, using the minimum Bregman information principle [6]. Similar to the information theoretic co-clustering [22], in Bregman co-clustering, the objective function of co-clustering is defined so as to minimize loss in Bregman information between rows (e.g. data samples) and columns (e.g. features) before and after co-clustering. It utilizes a specific set of matrices to build an approximation matrix first and then evaluate the approximation error for the construction of an optimal approximation matrix. Not only does Bregman coclustering algorithm produce better results than other coclustering algorithms, it is also much more scalable in theory. A naive implementation of Bregman co-clustering algorithm is a main memory based implementation. The algorithm is scalable in theory but restricts to the main memory space in practice. When a dataset is large or even does not fit entirely in main memory, a main memory based implementation of the algorithm will spend a significant fraction of its time in context switch or file access. Such an implementation limits the inherent scalability of Bregman co-clustering algorithm. A significant fraction of computations performed by Bregman co-clustering algorithm could be easily mapped to those performed by an OLAP engine, as discussed in [33]. Experiments reported in [33] display that the version working on top of an OLAP engine is much more scalable. That unlocks the power of Bregman co-clustering algorithm for applications e.g. social media analysis which usually come with large datasets.
2.2 Bases in Bregman co-clustering Bregman co-clustering algorithm associates co-clustering with matrix approximation, while quality of the approximation result is measured in terms of the approximation error between the original matrix and the approximated one [6]. A better co-clustering leads to a better matrix approximation. More specifically, quality is evaluated by a Bregman divergence function d M , and the optimality is determined by the minimum Bregman information (MBI) principle. Given a matrix Z, the Bregman information [5] (which is defined as the expected Bregman divergence to the expected value of Z) corresponding to I-divergence is defined as follows:
I I (Z )
([ Z log(
Z )] ([ Z ]
5
The Bregman information corresponding to squared Euclidean distance is defined as follows:
I I (Z )
([ Z ([ Z ]] 2
2.2.2 Basis C2 Basis C2 is defined by {{ , }} and represents a set of summary statistics obtained from co-clusters, i.e. k × l blocks of row and column-clusters. Eq. 2 is to calculate each element of an approximation matrix:
Bregman co-clustering algorithm builds an approximation matrix ( Zˆ ) by using a Bregman co-clustering basis (i.e., a blueprint of using summary statistics to construct an approximation matrix) that characterizes a set of summary statistics. Notations are as follows: z U is a set of rows (e.g. data samples, actors, or objects) and is represented as a matrix of size m. z V is a set of columns (e.g. features or attributes) and is represented as a matrix of size n. z is a set of row-clusters and is used to represent a function defined by (U):{1,…, m} {1,…, k} that maps m rows to k row-clusters. z is a set of column-clusters and is used to represent a function defined by (V):{1…, n} {1,…, l} that maps n columns to l column-clusters. U and V can be seen as random variables over {1, …, m} and {1, …, n}, respectively. is a mapping of row-clusters and it keeps the information indicating which row is assigned to which row-cluster. In other words, it is a function of U, or (U). Similarly, is a mapping of columnclusters. It is a function of V, or (V), that maps a column to the column-cluster to which it belongs. { } is a set of summary statistics obtained from row-clusters, and { } is a set of statistics obtained from column-clusters. In what follows, we will introduce those 6 Bregman coclustering bases defined in [6]. Before proceeding, some notations are in order concerning the Bregman coclustering bases. E represents for an expectation function or a weighted average function. The symbol presents the logic AND, and the symbol means all possible values. Subscript i indicates i-th rows or row-clusters, while subscript j indicates j-th columns or column-clusters. Operations here are for squared Euclidean distance.
Zˆ i , j
E[{Z i ,j U (U i ) Z i , j J (V j )}]
Eq. 2
It builds an approximation matrix based on the average of elements of each co-cluster or block. Such a block gives the average of each interaction of the column and row-clusters. We also call such an average a block average.
2.2.3 Basis C3
Basis C3 is defined by {{ , }, {U}}. Each element of an approximation matrix is calculated according to Eq. 3:
Zˆ i , j
E[{Z i ,j U (U i ) Z i , j J (V j )}]
E[{Z i ,j U i }] E[{Z i ,j U (U i )}]
Eq. 3
Three terms on the right-hand side of the above equation are, respectively, the block average, the average of all elements for each row, and the average of elements in each row-cluster of the given data matrix.
2.2.4 Basis C4 Basis C4 is defined by {{ , }, {V}}. Each element of an approximation matrix is calculated according to Eq. 4, which is similar to Eq. 3, the equation corresponding to C3:
Zˆ i , j
2.2.1 Basis C1
E[{Z i ,j U (U i ) Z i , j J (V j )}]
E[{Z i , j V j }] E[{Z i , j J (V j )}]
Basis C1 is defined by {{ }, { }}. Corresponding summary statistics include those obtained from row-clusters and those obtained from column-clusters. Each element of an approximation matrix is calculated according to Eq. 1:
Eq. 4
Zˆ i , j
E[{Z i ,j U (U i )}]
E[{Z i , j J (V j )}] E[{Z i ,j }]
Eq. 1
Three terms on the right-hand side of the above equation are, respectively, the block average, the average of all elements for each column, and the average of elements in each column-cluster of the given data matrix.
2.2.5 Basis C5 Basis C5 is defined by {{ , }, {U}, {V}}. Basis C5 indicates a set of statistics from co-clusters, rows, columns, row-clusters, and column-clusters. Eq. 5 is for the calculation of each element of an approximation matrix:
Three terms on the right-hand side of the above equation represent, respectively, the average of elements in each row-cluster, the average of elements in each columncluster, and the average of all elements of Z. The third term is, in other words, a global average.
6
Zˆ i , j
clustering basis C , the goal is to find an optimal coEq. 5
E[{Z i ,j U i }] E[{Z i , j V j }]
2.2.6 Basis C6 Basis C6 is defined by {{U, }, {V, }}, where {U, } is a set of rows of column-clusters while {V, } is a set of columns of row-clusters. Eq. 6 is used to calculate each element of an approximation matrix for C6:
Zˆ i , j E [{ Z i , j V j Z i ,j U (U i )}]
> @
Eq. 6
E [{ Z i , j U (U i ) Z i , j J (V j )}] The first term on the right-hand side of the above equation is the average of elements of each row in each columncluster; it is averaged over rows existing in each columncluster with respect to all column-clusters. Likewise, the second term is the average of elements of each column in each row-cluster; it is averaged over columns existing in each row-cluster with respect to all row-clusters. Finally, the third term is the block average.
2.3 Bregman co-clustering algorithm
that is the
ˆ SA where =
U , J
that minimizes
is the =c that gives
, M =c , or Zˆ { arg min , M Z c . This is an
optimization problem and can be solve by the use of Lagrange dual. The solution can be given by Eq. 7: s
M Zˆ { M (>Z @ ¦ /*cr
Eq. 7
r 1
In Eq. 7, is the gradient with respect to
M , w presents *
the weight, and is a Lagrange multiplier while / is an optimal Lagrange multiplier. The difference between Eq. 7 and the corresponding equation in [6] is that we ignore terms of weights since we assign equal weights to all data samples. Although it is difficult to find a globally optimal solution, it is possible to use an iterative update approach to find a locally optimal solution [6]. The algorithm is based on an alternative minimization strategy, as shown in Fig. 3. Such an approach is called the “Iterative Single Side Clustering” approach by Pan et al. in [51].
U
*
,J *
BregmanCoClustering=, C , k , l
Initialize U and J with random values Repeat I iterations 1. Calculate summary statistics SS based on U , and J (see Section 3) 2. Update row clusters for i 1 to m
U * U i m
3.
=, C ,
> @
( dM =, =ˆ
,
matrix based on
SS ,
argmin U c U c U i g ,1d g d k
=ˆ is an approximation = , C , U c , and J end for Update column clusters for j 1 to n
J * V j m
This subsection is based on discussions in [6], and Figure 3 gives the algorithm. A set of summary statistics is presented as a set of random variables: SA
,J *
Z cS A
The first term on the above equation is the block average, also the average of elements of each co-cluster. The second term represents the average of all elements of each row of the given data matrix. Similarly, the third term represents the average of all elements of each column. Furthermore, the fourth term is the average of elements of each rowcluster while the fifth term is the average of elements in each column-cluster.
E [{ Z i , j U i Z i , j J (V j )}]
*
( d M =, =ˆ minimum
E[{Z i ,j U (U i )}] E[{Z i , j J (V j )}]
U
clustering
E[{Z i ,j U (U i ) Z i , j J (V j )}]
argmin
J c J c V j h ,1d h d l
> @ ,
( dM =, =ˆ
is an approximation matrix based on
^ =c (>= c@ (>=c c@, for all c in C `,
SS , = ,
C , U , and J c
where C is a Bregman co-clustering basis and =c preserves the summary statistics. A co-clustering task can be described as follows: Given an m u n matrix = , a
end for Until convergence or I iterations have been performed
d M , the number of row clusters k , the number of column clusters l , and a Bregman co-
Fig 3. Bregman co-clustering algorithm [6, 33].
Bregman divergence
=ˆ
Return
U , J *
*
7
Typical performance chart for pure memory-based co-clustering soultion 700
600
500 Running time (sec)
Initially, the algorithm randomly assigns each row to some row-cluster and also each column to some column-cluster. Following that, in the step 1 it calculates summary statistics according to the user-specified co-clustering basis. In the step 2 the algorithm assigns row to the row-cluster that minimizes the error with respect to rows. When handling row-clusters in the step 2, the algorithm treats column-clusters as known information and builds an approximation matrix with row-clusters, the unknown information. After obtaining an approximation matrix, the algorithm evaluates the approximation error by computing the Bregman divergence between the original data matrix and the approximated one. Next, the algorithm updates the row-cluster of each row by finding the row-cluster that gives the minimum Bregman information. Similarly, for handling column-clusters in the step 3, the algorithm views row-clusters as known information and constructs an approximation matrix with column-clusters. After an approximation matrix is obtained, again, the algorithm computes the Bregman divergence between the original data matrix and the approximated one in order to evaluate the approximation error. Then, it adopts the column-cluster giving the minimum Bregman information to update the column-cluster of each column. Finally, the algorithm terminates either when the convergence criterion is met or when the max number of iterations T is reached. The convergence criterion is the difference in approximation errors between iterations t and t-1, and we set the threshold for convergence to 0.001. The step to compute summary statistics is the key of Bregman co-clustering algorithm. The computation of summary statistics requires the most significant computational effort, regardless of whether the Bregman divergence corresponds to a closed form solution or not. If we use some Bregman divergence, even when there does not exist a closed form solution, summary statistics still need to be computed. The time complexity of Bregman co-clustering algorithm is proportional to the number of elements in the input data matrix, as discussed in [6]. One straightforward implementation is to adopt a main memory based implementation (such as that one using Matlab). A typical performance chart for such an implementation is illustrated in Fig. 4. Although the algorithm itself is scalable, the implementation is restricted to the size of main memory especially when datasets are usually too large to fit into main memory.
400
300
200
100
0
5
10
15
20
25
30
Number of non-zero elements in data matrix (x103)
Fig. 4. An example given by a main memory based implementation, presenting running time versus the number of non-zero elements in our case. Database techniques could benefit data mining since most real-world applications collect and store data in a database. However, this is not always reflected in reality. Based on a poll on a renowned data mining web site, Matlab is used by data miners more than Microsoft SQL Server and Oracle DM for real projects5. But results from another poll indicate that about one quarter of voters (data miners) deploy data mining models using DBMS and SQL6. It is crucial for Bregman co-clustering algorithm to efficiently process a large amount of data stored in a database. As illustrated in Fig. 5, if the dataset is too large to be loaded in main memory at once, we have to partition the dataset and process one part at a time. We do so in the step of computation of summary statistics. Next, we repeat this in order to find the best row-cluster assignment for each row; we do this again when updating column-clusters for columns. Therefore, we need to transmit data between main memory and secondary storage three times. Fig. 5 (a) illustrates a scenario where a powerful machine is employed to calculate summary statistics and the whole dataset is partitioned and transmitted three times between database and the machine. Fig. 5 (b) displays a scenario where a database engine is utilized to calculate summary statistics and hence the whole dataset is only partitioned and transmitted twice. The first transmission is to get data for the calculation of summary statistics; the second transmission is to get data (in a stream, for example) to update the assigned row clusters; the third is to get data to update the assigned column clusters.
5
http://www.kdnuggets.com/polls/2009/data-mining-toolsused.htm
6
http://www.kdnuggets.com/polls/2009/deployment-datamining-models.htm
8
x
x x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
(a)
(b)
Fig. 5. Scenarios for utilizing a database for data mining. To provide a more efficient implementation, we need the help from secondary storage management systems. For this purpose, a relational database is a natural born secondary storage management system, and SQL is a powerful tool for performing set-oriented operations. Furthermore, because an OLAP engine computes and maintains summary statistics inside a database, it is not necessary to read all data into main memory when computing summary statistics. Our implementation details will be discussed in the next section.
3 Support from database In this section we discuss an OLAP based implementation of Bregman co-clustering algorithm and related work about database support for data mining.
3.1 An OLAP based implementation Here we show how an OLAP engine provides an interface that effectively fits the requirements Bregman co-clustering algorithm. Both OLAP and Bregman co-clustering are related to aggregations. OLAP is designed to answer queries efficiently with aggregations for large datasets, and OLAP queries could be viewed as correlated aggregate queries and data cubes [2]. Bregman co-clustering algorithm builds an approximation matrix by using summary statistics that are actually aggregations. It requires a large number of aggregations along multiple dimensions. Thus, using an OLAP engine to implement Bregman coclustering algorithm is fairly intuitive. The first step of using OLAP to implement Bregman coclustering is to define a schema to store matrices, especially sparse matrices, in a relational database. As mentioned by
Cornacchia et al. in [19], most storage schemes are designed for specific data access patterns for particular applications. Examples include Compress Row/Column Storage as well as Compressed Diagonal Storage. For storing a matrix in a relational database, especially a large yet sparse one, a common approach is to have a relation where tuples represent elements of the matrix. Three data structures are commonly used to store sparse matrices: coordinate storage (COO), compressed sparse row (CSR), and compresses sparse column (CSC). From a database perspective, COO is a denormalized table while CSR and CSC are normalized tables. Goharian et al. in [28] compares data structures, including COO, CSR, and CSC, to store sparse matrices. They conclude that CSR uses much less storage space than COO but its processing time is not significantly different from that given by COO. We adopt the COO format because it avoids extra join operations, or because querying one denormalized table is faster than querying and joining two or more normalized tables. We store matrices, which are usually sparse in social media analysis, with the following schema: CREATE TABLE matrix ( row INT, col INT, val FLOAT ); Note: FLOAT could be DECIMAL(18, 2). The schema is straightforward and simple. It takes advantage of sparseness of most (if not all) data considered in social media analysis. An N-dimensional data cube is composed of (N-1)dimensional cubes and aggregates [30]. The definition of the data cube is shown below: CREATE TABLE CCube ( row INT, col INT, r_clust INT, c_clust INT, val FLOAT ); Furthermore, we employ the star schema and perform aggregation on the data cube used to store summary statistics, which are required by Bregman co-clustering algorithm. The star schema could be interpreted in the following way. The matrix storing data samples in the contingency matrix is a fact table. The table storing pairs of row and row-cluster is a dimension table that extends the fact table by adding information about row-clustering to it; similarly, the table storing column-column-cluster pairs is another dimension table and adds information about
9
column-clustering to the fact table. The following is to perform aggregation on the data cube: INSERT INTO CCube SELECT Z.row, Z.col, R.col AS r_clust, C.col AS c_clust, SUM(Z.val) AS val FROM (R JOIN Z ON R.row=Z.row) JOIN C ON Z.col=C.row GROUP BY Z.row, Z.col, R.col, C.col WITH CUBE; R is a table holding pairs of row and row-cluster (i.e. {(row, row-cluster)}), and C is a table holding pairs of column and column-cluster (i.e. {(column, column-cluster)}). Notice that the aggregation is the SUM function instead of the AVG function. We only store non-zero elements in tables that represent matrices but Bregman co-clustering bases require average values with respect to non-zero and zero elements (of specified sub-matrices, which are related to Bregman co-clustering bases), so we need to consider all corresponding elements and accordingly calculate average values. This can be achieved by using the SUM function and by using information about sizes of row-clusters and column-clusters. This simple technique is important because, without it, we have an incorrect implementation (and then the link between Bregman co-clustering and social media analysis breaks). If we simply store all (zero and non-zero) elements in a database and overlook the usefulness of this simple technique, we are not taking advantage of sparseness of data used in social media analysis (and then the link between database and social media analysis breaks). Now let us focus on Bregman co-clustering bases and their mappings to the data cube. A data cube consists of all combinations of hierarchical relationships, and hence it can be represented in a lattice form, as used in figures from Fig. 6 to Fig. 11. The corresponding nodes from Fig. 6 to Fig. 11 indicate the same value. Notations used here are consistent with those mentioned earlier, while one thing worth mentioning is that and can be obtained from the data cube. U and V are from the input data matrix. These figures present 6 lattices each of which maps to a Bregman co-clustering basis [6]. A node in gray presents the set of statistics required for a Bregman co-clustering basis. In these figures, a node presents in which dimensions we are interested, and it gives a smaller data cube (or simply a table) that contains aggregates along all possible combinations of specified dimensions. For example, all four dimensions ( , , U, and V) are fixed in the node 1 (the top node of these figures) such that the node 1 represents individual elements. Moreover, the node 16 fixes nothing and specifies no dimensions such that node 16 represents
an aggregate along all four dimensions. It represents the global aggregate. These figures visualize the mapping from Bregman coclustering bases to the data cube lattice. We summarize the mapping for each co-clustering basis in the following subsections.
3.1.1 Basis C1 The equation corresponding to Basis C1, Eq. 1, is in Section 2.2.1. The value of each element of an approximation matrix for C1 equals to the following: The average of elements in each row-cluster, plus the average of elements in each column-cluster, minus the global average of all elements. These three components respectively correspond to nodes 12, 15, and 16 in the data cube lattice in Fig. 6. The node 12 aggregates elements along the dimension of the row-cluster. It queries the data cube with the condition that the row-cluster is not null. The node 15 aggregates elements along the dimension of the column-cluster. It queries the data cube with the condition that the columncluster is not null. The node 16 fixes nothing and it returns the global average of all elements of the original data matrix.
Fig. 6. Basis C1 and the corresponding nodes in the data cube lattice. The following is to refer to the node 12 (i.e. the row-cluster average, as shown in gray in Fig. 6): SELECT r_clust, val FROM CCube WHERE row IS NULL AND col IS NULL AND r_clust IS NOT NULL AND c_clust IS NULL;
The following is to refer to the node 15 (i.e. the columncluster average, as shown in gray in Fig. 6):
10
SELECT c_clust, val FROM CCube WHERE row IS NULL AND col IS NULL AND r_clust IS NULL AND c_clust IS NOT NULL; Note that the global average (the node 16) could be calculated from the row-cluster average (the node 12) or the column-cluster average (the node 15). When we maintain information about sizes of row-clusters or column-clusters in memory, the calculation of the global average can be performed in memory and could be faster than issuing a query for the node 16.
SELECT r_clust, c_clust, val FROM CCube WHERE row IS NULL AND col IS NULL AND r_clust IS NOT NULL AND c_clust IS NOT NULL; To sum up, if our goal is to build an approximation matrix based on the basis C2, we need to know the average of elements in each block (or the average of each interaction of the column and row-clusters). Thus, we need to know the aggregates along all combinations of row-clusters and column-clusters. We can obtain such information from the node 9.
3.1.3 Basis C3 3.1.2 Basis C2 The equation corresponding to Basis C2, Eq. 2, is in Section 2.2.2. C2 corresponds to aggregating elements along dimensions of the row-cluster and the column-cluster. Using C2 to build an approximation matrix is equal to computing the average of elements in each co-cluster. This could be obtained from the node 9 in the data cube lattice in Fig. 7.
The equation corresponding to Basis C3, Eq. 3, is in Section 2.2.3. The value of each element of an approximation matrix equals to the following: The block average, plus the average of elements of each row, minus the average of elements of each row-cluster. These three components respectively correspond to nodes 9, 12, and 13 in the data cube lattice in Fig. 8. These three nodes are shown in gray in Fig. 8.
Fig. 7. Basis C2 and the corresponding nodes in the data cube lattice.
Fig. 8. Basis C3 and the corresponding nodes in the data cube lattice.
The node 9, as shown in gray in Fig. 7, represents the average of elements in each interaction of the columnclusters and row-clusters. It represents the block average. In other words, it aggregates elements along dimensions of the row and column-clusters and it returns elements in each interaction (i.e. block) of all combinations of the row and column-clusters. The following is to refer to the node 9 (the block average):
Nodes 9 and 12 are described in Sections 3.1.1 and 3.1.2. The node 13 aggregates elements along the dimension of the row. It equals to querying the data cube with the condition that the row is not null. The following is to refer to the node 13 (the row average):
11
SELECT row, val FROM CCube WHERE row IS NOT NULL AND col IS NULL AND r_clust IS NULL AND c_clust IS NULL;
3.1.4 Basis C4 The equation corresponding to Basis C4, Eq. 4, is in Section 2.2.4. The value of each element of an approximation matrix equals to the following: The block average, plus the average of elements of each column, minus the average of elements of each column-cluster. These three components respectively correspond to nodes 9, 14, and 15 in the data cube lattice in Fig. 9. The nodes 9 and 15 are described earlier. The node 14 aggregates elements along the dimension of the column. It equals to querying the data cube with the condition that the column is not null.
3.1.5 Basis C5 The equation corresponding to Basis C5, Eq. 5, is in Section 2.2.5. Summary statistics based on C5 require values from nodes 9, 12, 13, 14, and 15 in the data cube lattice in Fig. 10. To calculate each element of an approximation matrix, first of all, we add together the block average (the node 9), the average of elements of each row (the node 13), and the average of elements of each column (the node 14). Then, we add together the average of elements of each rowcluster (the node 12) and the average of elements of each column-cluster (the node 15). Finally, we subtract the sum given by the second part from the sum given by the first part. We can derive values in nodes 12 and 15 from the node 9. However, we can issue two queries for them when the data cube is created. Note that this case is different from that case in Section 3.1.1, where we calculate the global average in memory using the row-cluster average or the columncluster average. That calculation is related to one set of averages (the node 12 or the node 15), and that can be done in a simple loop. Nevertheless, the calculation of the block average (the node 9) is related to two sets of averages (the node 12 and the node 15), and this is more complicated than the case in Section 3.1.1. The calculation of the node 9 could be slower than issuing a query for it.
Fig. 9. Basis C4 and the corresponding nodes in the data cube lattice. The following is to refer to the node 14 (i.e. the column average, as shown in gray in Fig. 9): SELECT col, val FROM CCube WHERE row IS NULL AND col IS NOT NULL AND r_clust IS NULL AND c_clust IS NULL;
Fig. 10. Basis C5 and the corresponding nodes in the data cube lattice.
3.1.6 Basis C6 The equation corresponding to Basis C6, Eq. 6, is in Section 2.2.6. The value of each element of an approximation matrix equals to the following: The average of elements of each row of each column-cluster, plus the average of elements of each column of each row-cluster, minus the block average. These three components respectively correspond to nodes 7, 9, and 10 in the data cube lattice in Fig. 11.
12
Fig. 11. Basis C6 and the corresponding nodes in the data cube lattice. The node 7 aggregates elements along dimensions of the row-cluster and the column. It queries the data cube with the condition that the row-cluster is not null and the column is not null. The following is to refer to the node 7 (i.e. the row-cluster column average, as shown in gray in Fig. 11): SELECT col, r_clust, val FROM CCube WHERE row IS NULL AND col IS NOT NULL AND r_clust IS NOT NULL AND c_clust IS NULL; The node 10 aggregates elements along dimensions of the column-cluster and the row. It queries the data cube with the condition that the column-cluster is not null and the row is not null. The following is to refer to the node 10 (i. e. the column-cluster row average, as shown in gray in Fig. 11):
amount of data brings a significant challenge to management and analysis of social media data. In this section, we explore the suitability of the relational model for modeling social media data. Specifically, we show that data cubes and the star schema, extensively used in OLAP, are well suitable for modeling social media data. Martin and Gutierrez in [41] state that, “The social network analysis data workflow can certainly benefit from data management techniques based on an appropriate data model”. However, what is the appropriate data model for social network analysis, or for social media analysis? Here we argue that the well-developed relational model is a satisfactory model for this purpose. Since a social network, usually embedded in social media data, is composed of a set of actors (entities) and a set of relationships (links), it is straightforward to use the Entity-Relationship (ER) model to describe a social network. This natural mapping indeed helps understand and explain the structure of a social network. However, most social network or media data are generated by computers and stored in a relational database management system, and a conceptually correct ER diagram would not necessarily correspond to a practically efficient design when it is mapped to a relational data schema and implemented with a relational database. Here we use an example from [61] to demonstrate data schemas used today for social network analysis. Carley et al. in [11] describe a set of requirements for a toolkit designed to facilitate social network analysis and the dynamic network analysis, and they also demonstrate a collection of tools used to automate the analysis. One of these tools is the NetIntel database proposed in [61], whose goal is to manage relational data and perform complex SQL queries. Fig. 12 presents the simplified version of the schema proposed in [61], while Fig. 13 shows a revision of Fig. 12 using the start schema. To ease the comparison we give these two figures first and then a brief discussion.
SELECT row, c_clust, val FROM CCube WHERE row IS NOT NULL AND col IS NULL AND r_clust IS NULL AND c_clust IS NOT NULL;
4 Social Media Analysis As the concept of social media is introduced, more social network services are available on WWW. Examples are ecommunities, blogs, collaborative filtering and recommendation system. Consequently, more social media data are generated and collected. In addition, this enormous
Fig. 12. The simplified data schema used in the NetIntel database [61].
13
This type of queries consists of aggregation over multiple dimensions and is important in some social media analysis, such as intelligence analysis and anti-terrorism. In reality, however, the large amount of data would make answering the above query even harder. The data schema proposed in [61] would have been better if the authors had introduced OLAP into the design. OLAP is designed to answer this type of queries and further answer complicated multidimensional aggregation queries [30]. By modifying the schema presented in Fig. 12 into the star schema illustrated in Fig. 13, we are allowed to respond to queries consisting of multi-dimensional aggregation more efficiently. Furthermore, Martin and Guiterrez in [41] indicate that “most network operations are performed outside the DBMS, loosing in this way most of the benefits of using a DBMS in the first place.” This supports our point of view that we should leverage the star schema and take advantage of OLAP. With the help of the star schema and data cubes, OLAP could benefit social media analysis by reducing the cost in terms of computation effort. By transforming the schema in Fig. 12 to the star schema in Fig. 13, we could apply our implementation of Bregman co-clustering algorithm to social network analysis.
5 Experimental results Fig. 13. A revision of Fig. 12 using the star schema. Tsvetovat et al. in [61] use an Edge table and a Node to store edges and nodes in a social network, respectively. They also design an EdgeType table for detailed edge information and a set of tables to describe eight different types of nodes comprehensively. The schema proposed in [61] satisfies requirements described in [11], and it can easily answer the following example queries which could be found in [11]: z “Find all social structure data that came from New York Times.” z “Find all data that came from New York Times article from 10/10/2003.” z “What is the network of people who were born in Syria?” Once social media data are generated when interactions happened between users, data would not be updated frequently and hardly be deleted, but would be read, extracted, and analyzed for most of the time. This is the situation for which OLAP is originally designed. We argue that we should take advantage of OLAP to facilitate social media analysis. In what follows we will revisit the example from [61] and revise it using the star schema. The philosophy behind the design proposed in [61] is the integration with a database, similar to that described in [24]. However, it is not trivial for such a data schema to answer the following query: “Build a cross-tabulation of strength of communication by Agent and Location (or Event).”
Our implementation is in C# and uses Microsoft SQL Server (Microsoft SQL Server Express, more precisely) as the back-end database. We use OLAP to compute and maintain summary statistics, as mentioned earlier, and we use functions written in C# to compute Bregman divergences and to update cluster assignments in main memory. As is a common practice in database programming, we employ buffers to facilitate data transmission between our main program and the back-end database. The size of a data buffer is 10,000 tuples, while the size of a command buffer is 40 SQL statements. The computational infrastructure for our experiments is based on commodity hardware. We conduct experiments on a general PC equipped with one 3 GHz single-core CPU and 2 GB main memory (while there is around 1.2 GB available for programs). The use of commodity hardware should serve to buttress our claims of improved computational efficiency, since any gains demonstrated below can be easily magnified using advanced computational hardware and techniques. Daruru et al. in [20] propose an implementation of coclustering based on Dataflow programming model and provide experimental results with a focus on the effect of using multi-core hardware. Papadimitriou and Sun in [52] introduce an implementation of co-clustering based on Map-Reduce [21]. The computational infrastructure employed in [52] is a cluster of 39 nodes, each of which consists of two dual-core processors (running at 2.66 GHz or 3 GHz) and 8 GB RAM. The special infrastructure
14
dataset 0.4M is selected from MovieLens 10M100K with condition user id 10,000 and movie id 1,000. The dataset 1M is originally from MovieLens 1M. The dataset 1.4M is selected from MovieLens 10M100K with condition user id 20,000 and movie id 2,000.
memory management (e.g. virtual memory allocation and/or context switch), as discussed in Sec. 2.3 and illustrated in Fig. 5. Total running time vs. Dataset 2500
2000 Basis 1
Time (sec)
provides significant computing power but limits the portability. We use the MovieLens datasets [36, 54, 55], where a row is a user id, a column is a movie id, and a value is a rating between 1 and 5. Datasets used in experiments are summarized in Table 1. The dataset 0.1M is originally from MovieLens 100K. The
Basis 2
1500
Basis 3 Basis 4 1000
Basis 5 Basis 6
500
Table 1. Datasets used in experiments. 0.1M 0.4M 1M 1.4M
Rows (user ids) 943 10000 6040 20000
Columns (movie ids) 1682 1000 3952 2000
Non-zero elements (ratings) 100,000 423,607 1,000,209 1,471,838
We compare the elements in a final approximation matrix to elements in the input data matrix. This also demonstrates the application of Bregman co-clustering to a CF based movie recommendation system, since the elements in a final approximation matrix represent the predicted ratings. We also report results regarding RMSE (root mean squared error) and/or MAE (mean absolute error) even though our implementation has no impact on clustering performance. MAE is an easy statistic and treats equally errors or incorrect predictions, while RMSE penalizes errors based on how far they are from the correct predictions and hence is suitable for ranking the results. While our implementation does not in any way affect the performance of the co-clustering algorithm itself, we report clustering accuracy scores nevertheless to confirm correct implementation of the algorithm and strengthen the case for using Bregman co-clustering in social media analysis. In the following subsections we report results obtained from 10 runs.
0 0.1M
0.4M
1M
1.4M
Dataset
Fig. 14. Runtime performance given by a main memory based implementation in C#. Fig. 15 displays memory allocated by such an implementation. Memory usage also increases rapidly as the size of the dataset increases. Nevertheless, we see in Fig. 15 that 1.4M takes less memory than 1M. This indicates that the underlying system reallocates memory blocks in order to save more memory. It spends more time doing memory I/O such that the main memory does not have to hold everything. Runtime performance deteriorates sharply due to more memory I/O, as shown in Fig. 14. Max allocated memory vs. Dataset 1400 1200 Basis 1
1000
Basis 2
M bytes
Name
800
Basis 3 Basis 4
600
Basis 5 Basis 6
400 200 0 0.1M
0.4M
1M
1.4M
Dataset
5.1 Main memory based implementation Fig. 4 illustrates runtime performance given by a main memory based implementation in Matlab. In this subsection we report runtime performance obtained from applying a main memory based implementation in C# on those four datasets. As we can see in Fig. 14, especially for Basis C2, running time increases rapidly as the size of the dataset increases. For instance, 1.4M is 14 times larger than 0.1M but running time for 1.4M is 20-60 times more than that for 0.1M – ratios of running time of 1.4M to that of 0.1M for six bases are 23.9 (C1), 44.6 (C2), 26.1 (C3), 63.2 (C4), 29.5 (C5), and 24.0 (C6), respectively. This is because the underlying system spends more time doing I/O and
Fig. 15. Memory allocated by a main memory based implementation in C#.
5.2 The use of database view The use of data cubes for the implementation of Bregman co-clustering algorithm is given in Section 3.1. It is natural to use a database view to manage a data cube. The following figure displays the difference in total running time (in seconds) between using and not using database views. The dataset 0.1M is used in these experiments, and both the number of row clusters and the number of column clusters are set to 10. Total running time includes all read and write operations and computation.
15
Fig. 17. Total running time vs. datasets. The use of database view vs. Total running time 200 180 160 Basis 1
Time (sec)
140
Basis 2
120
Basis 3 Basis 4
100
Basis 5 Basis 6
80 60
As we can see from Fig. 17, compared to that in Fig. 14, the relationship between total running time and the size of the dataset is closer to a linear relationship (while that shown in Fig. 14 is closer to a polynomial relationship). Fig 18 illustrates the memory usage for each dataset and each basis. Likewise, the growth of the memory usage with our OLAP based implementation is slower than that with the main memory based implementation (Fig. 15).
40 Max allocated memory vs. Dataset
20
900
0
800
Not using VIEW
Using VIEW 700
The runtime performance is almost the same for all 6 bases when database views are not employed. The use of database views brings positive effects to Bases C1 and C2, but it shows negative effects to Basis C5. As mentioned in Sec. 3.1.5 and illustrated in Fig. 10, the calculation of each element of an approximation matrix, involves the block average, the average of elements of each row, the average of elements of each column, the average of elements of each row-cluster, and the average of elements of each column-cluster. Moreover, Fig. 16 shows a trade-off between the choice of Bregman co-clustering bases and the use of database views, since different bases usually achieve different clustering performance, as shown later in this section. For consistency, in the following subsections we report results given by using database views.
5.3 Size of dataset In this subsection we consider datasets of various sizes. Here the number of row clusters is set to 10, and so is the number of column clusters. We report not only total running time but also the maximum amount of allocated memory (to our main program), RMSE, and MAE.
Basis 1
M bytes
Fig. 16. The difference between using and not using database views.
600
Basis 2
500
Basis 3 Basis 4
400
Basis 5 Basis 6
300 200 100 0 0.1M
0.4M
1M
1.4M
Dataset
Fig. 18. Maximum allocated memory vs. datasets. Comparing memory required by data sets of various sizes in Fig. 18, we see superior scalability of our approach. For instance, the ratio of size of 1.4M to that of 1M is 1.4 but ratio of memory usage is less than 1.4. The largest dataset 1.4M does not always use more memory than the second largest dataset 1M. This could be due to the memory management mechanism (e.g. garbage collection) employed in C# and the underlying .NET platform. The underlying memory management system is activated to collect more usable memory space when the amount of allocated memory is close to the limit. In the following two figures we report the clustering performance. In given datasets, a rating is between 1 and 5. In experiments we do not normalize a rating such that it is between 1 and 5; rather, we simply round it to the closet integer. We can see from Figure 20 that Bregman coclustering is still able to provide high-quality results by giving MAE ranged from 0.15 to 0.35 (depending on different bases and data sets). Bregman co-clustering algorithm gives low MAE.
Total running time vs. Dataset 3500 3000 Basis 1
Time (sec)
2500
Basis 2 2000
Basis 3 Basis 4
1500
Basis 5 Basis 6
1000 500 0 0.1M
0.4M
1M
1.4M
Dataset
16
Root Mean Squared Error vs.Dataset 0.9
Total running time vs. Number of iterations 800
0.8
700
Basis 1
600
Basis 2 Basis 3
0.6
Basis 4 Basis 5
0.5
Basis 6
Time (sec)
RMSE
0.7
Basis 1 Basis 2
500
Basis 3
400
Basis 4 Basis 5
300
Basis 6
0.4
200 100
0.3 0.1M
0.4M
1M
1.4M
0
Dataset
10
20
30
40
50
60
70
80
90
100
Number of iterations
Fig. 19. RMSE vs. datasets.
Fig. 21. Total running time against the number of iterations.
Mean Absolute Error vs. Dataset 0.35
Max allocated memory vs.Number of iterations
0.3
60 Basis 1 Basis 2
MAE
0.25
Basis 3
55
Basis 1
Basis 4 Basis 5 Basis 6 0.15
Basis 2
M bytes
0.2
Basis 3
50
Basis 4 Basis 5 Basis 6
45
0.1 0.1M
0.4M
1M
1.4M
Dataset 40
Fig. 20. MAE vs. datasets.
10
20
30
40
50
60
70
80
90
100
Number of iterations
5.4 Performance vs. iterations Here we consider the dataset 0.1M. Additionally we set both the number of row clusters and the number of column clusters to 10. Below the runtime performance against the number of iterations is reported. The runtime performance (total running time in seconds) and the clustering performance are reported. The x-axis in figures from Fig. 21 to Fig. 24 represents the maximum number of iterations. Fig. 21 reports total running time against the number of iterations, while Fig. 22 presents the memory usage against the number of iterations. As mentioned earlier, the algorithm terminates either when the convergence criterion is met or when the maximum number of iterations is reached. Therefore, we see curves except the one given by Basis C1 in Fig. 21 enter a steady situation even when the number of iterations continues increasing. Furthermore, we see that in Fig. 22 all six bases allocate memory in a relatively steady manner.
Fig. 22. The maximum amount of allocated memory against the number of iterations. The following two figures display clustering performance. These curves enter a steady situation because the errors between two consecutive iterations are converged. RMSE in Fig. 23 and MAE in Fig. 24 do not change dramatically after a certain number of iterations for each basis (but not for Basis C1). Root Mean Squared Error vs. Number of iterations 0.86 0.84
RMSE
We see from the above two figures: 1) Basis C1 provides the worst clustering performance in each of the four datasets, but it performs better (i.e. achieves lower RMSE and MAE) when the given dataset gets larger. 2) Basis C6 outperforms other bases in each of the four datasets. 3) Bases C2, C3, C4, and C5 provide the similar clustering performance in each of the four datasets.
0.82
Basis 1
0.8
Basis 2 Basis 3
0.78
Basis 4 Basis 5
0.76
Basis 6
0.74 0.72 0.7 10
20
30
40
50
60
70
80
90
100
Number of iterations
Fig. 23. RMSE against the number of iterations.
17
Mean Absolute Error vs. Number of iterations
Total running time vs. Number of row (or column) clusters 900
0.34
800
0.32 700
MAE
Basis 2 Basis 3
0.28
Basis 4 Basis 5
0.26
Time (sec)
Basis 1
0.3
Basis 1
600
Basis 2
500
Basis 3 Basis 4
400
Basis 5
Basis 6
300
Basis 6
200
0.24
100
0.22
5
10
20
30
40
50
60
70
80
90
100
Fig. 24. MAE against the number of iterations. Figures from Fig. 21 to Fig. 24 show us some ideas about selecting a basis and setting the number of iterations. Basis C1 displays a linear relationship between running time and the number of iterations (and hence we could accordingly estimate the time we need to run the analysis), does not change dramatically in terms of the memory usage, but gives the highest RMSE and MAE. Compared to Basis C1, Basis C2 is faster (especially when the number of iterations is smaller), uses less memory, and shows lower RMSE and MAE; compared to Basis C6, Basis C2 does not provide comparable clustering performance. Basis C6 runs faster and gives lower RMSE as well as MAE but requires more memory.
10
15
20
Number of row clusters = Number of column clusters
Number of iterations
Fig. 25. Total running time against the number of row (or column) clusters. Fig. 26 reports the memory usage against the number of row clusters (or the number of column clusters). Similar to what is shown in Fig. 22, Fig. 26 shows that all six bases give relatively steady curves. However, Basis C6 requires slightly more memory than others. Fig. 25 and Fig. 26 along with Fig. 27 and Fig. 28 would assist in parameter tuning, for which both runtime performance and clustering performance should be considered. Max allocated memory vs.Number of row (or column) clusters 60
55
column clusters
M bytes
Basis 1
5.5 Performance vs. the numbers of row and
Basis 2
50
Basis 3 Basis 4 Basis 5
45
In this subsection we consider the runtime performance and the clustering performance against the number of row or column clusters. We set equal the number of row clusters and the number of column clusters, and we set the maximum number of iterations to 100. Here the dataset 0.1M is considered. Fig 25 shows total running time against the number of row clusters (also the number of column clusters, since we set both the same value). In Fig. 25 the runtime performance does not always increase as the number of row or column clusters increases. Since we use the same dataset and hence the number of non-zero elements is the same throughout experiments here, only the number of iterations that are actually performed affects the runtime performance. Nevertheless, the actual number of iterations primarily depends on the employed basis. Consequently, we see different patterns from different bases in Fig. 25.
Basis 6
40 5
10
15
20
Number of row clusters = Number of column clusters
Fig. 26. The maximum amount of allocated memory against the number of row (or column) clusters. The following two figures illustrate the clustering performance, in terms of RMSE and MAE, against the number of row clusters (or the number of column clusters). They show that, the clustering performance is better when the number of row or column clusters is larger. Basis C6 achieves lowest RMSE and MAE, while Basis C1 shows poor clustering performance. Other bases provide the similar clustering performance.
18
Root Mean Squared Error vs. Number of row (or column) clusters 0.88
Data mining tools based on OLAP, e.g. DBMiner [31]
Database and data mining
0.83
RMSE
Basis 1 Basis 2 0.78
Basis 3
Co-clustering algorithms and scalable implementations Col-clustering algorithms, e.g. BVD [39] and CRD [51]
Direct implementations of data mining in SQL, e.g. [9,46-49]
Basis 4 Basis 5 0.73
Basis 6
5
10
15
OLAP based implementation of Bregman coclustering algorithm
Data mining based on database extensions, such as [29,42,44,56]
0.68 20
Number of row clusters = Number of column clusters
Scalable coclustering implementations, e.g. [20,52]
Fig. 27. RMSE against the number of row (or column) clusters. Social media analysis and database, e.g. [4,14,17]
Mean Absolute Error vs. Number of row (or column) clusters 0.32
Applications of coclustering to social media analysis, e.g. [3,26-27,38,60]
0.31 0.3
Social media analysis, database, and co-clustering
MAE
0.29 0.28
Basis 1
0.27
Basis 2
0.26
Basis 3
0.25
Basis 4
0.24
Basis 5
0.23
Basis 6
0.22 5
10
15
20
Number of row clusters = Number of column clusters
Fig. 28. MAE against the number of row (or column) clusters.
6 Related work We develop this section according to Fig. 29 presented below, where the central node is this paper and nodes around it represent related work. References are categorized into three categories, as shown in Fig. 29.
Fig. 29. Related work. The first category is regarding the connection between database and data mining. Bentayeb and Darmont in [9] indicated that, the integration of data mining and database could utilize the efficiency provided by SQL engines. When data mining algorithms are integrated in database management systems, we are no longer limited by the main memory space but only limited by the disk space, as mentioned by Bentayeb et al. in [10]. Furthermore, Chaudhuri in [12] argued that, one should build data mining systems that are not only scalable but also “SQLaware”. We achieve this by utilizing SQL and OLAP to scale Bregman co-clustering algorithm over large datasets. Our implementation does not extend SQL but follow standard data cube operations. Since our implementation is built upon SQL and OLAP, its applicability could be extended to XML, for example, by referring to the technique developed by Wiwatwattana et al. in [64], where the authors propose a definition of data cube that can be performed on XML data. Ordonez and his colleagues have done a series of research papers on implementation of traditional clustering in SQL: Ordonez and Cereghini in [46] discussed schemas used to implement EM (Expectation Maximization) algorithm in SQL; in [47] Ordonez and Omiecinski indicated that integrating the clustering algorithms into database is more practical; the approach proposed in [48] is purely SQL based and all operations happened in disk; Ordonez in [49] compared implementations of clustering in SQL and C++. As for a comparison, in terms of flexibility, our OLAP based implementation outperforms the approach discussed in [46], because the number of dimensions (i.e. features) the approach in [46] can handle is determined by the length of a single SQL statement that the underlying database can handle. That is also to say that, the approaches proposed in [46, 47, 48] are not suitable for social media analysis where
19
the number of dimensions (e.g. products or people in CF based recommendation systems) is usually high. The second category is regarding some new co-clustering algorithms and some scalable implementations of coclustering. Long et al. in [39] proposed Block Value Decomposition (BVD), modeling a co-clustering task as an optimization problem with respect to a triple decomposition or factorization of a data matrix. Nevertheless, it is computationally expensive to decompose or factorize a matrix. It costs even more computation power to perform such an operation on matrices represented in a compact format, e.g. matrices stored in a relational database. Moreover, Pan et al. in [51] proposed Co-clustering based on Column and Row Decomposition (CRD). One feature of CRD is that its complexity is linear in the number of rows and the number of columns. However, “most of the operations in CRD involve only the sampled columns and rows”, as mentioned by Pan et al. in [51], such that the performance might be affected by the sampling bias. Daruru et al. in [20] proposed a solution based on Dataflow programming model to apply co-clustering on the Netflix dataset (from a CF based movie recommendation system). The solution proposed by Daruru et al. achieves high scalability [20], but it is not “SQL-aware” and could not utilize some useful functions provided by a database, such as ETL functions used for pre-processing data. Moreover, in [52] Papadimitriou and Sun introduced a Map-Reduce [21] based implementation of co-clustering. Papadimitriou and Sun claimed that the framework proposed in [52] can assist in implementing various co-clustering algorithms. Nevertheless, an implementation of Bregman co-clustering is missing in [52]. The third category is regarding social media analysis from perspectives of database and co-clustering. Some new frameworks were proposed to provide OLAP-like operations to social media analysis, such as [14, 17]. In [17] Chi et al. proposed a framework for OLAP for data on the Web, aiming at unorganized, unstructured, large, and sparse data. Chi et al. identified four dimensions (people, relation, content, and time) in [17]. Thanks to those four dimensions proposed in [17], we could employ the relational model and traditional OLAP techniques to represent unorganized data. It is possible to extend our study presented in this paper to unorganized data. For example, rows could be a group of people, columns could be another group of people, and values of elements in a matrix could represent the strength of relations between people. As another example, a row and a column could respectively be a person and a keyword on a blog, while the value of an element could represent the strength of favor that the person shows to the use of the keyword. Furthermore, graph model focuses on data and relations, while the base data structure of the relational model is “relation” and the relational model focuses on data and attributes [4]. Accordingly, the relational model is sufficient to describe relations in a graph or a network. Chen et al. in
[14] proposed a framework to perform OLAP on graph data. The goal of the framework proposed in [14] is to provide “OLAP-style functionalities” to networked data, but our goal is to utilize OLAP to implement Bregman coclustering algorithm that is not designed for graph data but could be used in social network analysis. Giannakidou et al. applied co-clustering to a social tagging system in [27]; George and Merugu in [26] applied coclustering to CF based recommendation systems; Symeonidis et al. in [60] studied CF based recommendation systems and propose a co-clustering algorithm based on nearest-neighbor information. It is important, as discussed by Yu et al. in [65], that a good recommendation system should make its recommendation diverse to users. That is, items recommended to users should be similar to their tastes, while those items should not be too similar otherwise they would feel bored quickly. Yu et al. in [66] indicate that, “over-specialization” in recommendation systems reduces the diversity of items recommended to users and limits the variety of choices for users. When using Bregman co-clustering algorithm, we could achieve different levels of recommendation diversatification by simply setting different numbers of row and column-clusters, or by using a different Bregman coclustering basis. Liu et al. in [38] discuss an application of co-clustering to predictions of links between movies and users. Because an approximation matrix generated by Bregman co-clustering algorithm maintains ratings given by a user to a movie, such information could directly be used in the application of predictions of movie-user links. Amer-Yahia et al. in [3] propose SocialScope to assist in integrating content and social information (on web sites). SocialScope consists of three layers: The content management layer, the information discovery layer, and the information presentation layer. As might be expected, our implementation of Bregman co-clustering algorithm could be part of the information discovery layer.
7 Conclusions and future work In this paper we attempted to demonstrate an integration of database, data mining, and social media analysis by studying a scalable implementation of a powerful data mining algorithm (i.e. Bregman co-clustering algorithm) that is supported by existing database techniques and can be applied to social media analysis. First of all, we discussed an OLAP based implementation of Bregman co-clustering algorithm. Bregman co-clustering algorithm has been proven to generate substantially better results than other co-clustering algorithms. Practical impact could be significant if scalable implementations could be found. Addressing the scalability issue, our implementation utilizes an OLAP engine to compute summary statistics that are required by Bregman co-clustering algorithm to build
20
approximation matrices. In addition, this paper contributes to the understanding of mapping six Bregman co-clustering bases to the data cube. Moreover, our implementation is a demonstration that a general database can be used as an effective computation engine to support data mining algorithms. Following that, we discussed data modeling and schema design for social media analysis by referring to a framework proposed by researchers in the social media analysis community. Multi-dimensional co-clustering is important for social media analysis since social media data are generally collected from multiple social webs and heterogeneous data services. Furthermore, we argued that the star schema and OLAP could deal with the high complexity of management and analysis of social media data. We considered the data schema proposed by other researchers as an example and discussed a revision using the star schema. Finally, we considered datasets from a CF based movie recommendation system in experiments and for the demonstration of an application of co-clustering to social media. Bregman co-clustering algorithm is scalable in theory, while our OLAP based implementation makes it scalable for large datasets in practice. We reported not only the runtime performance but also the clustering performance. Future work includes the following directions. First, exploring index structures for the more efficient access of matrices stored on a relational database. An efficient indexing would help us manage gigantic datasets, which are generally from real-world applications. Specialized indices for OLAP, such as the Star index, could improve the performance further. This is from the database research perspective. Second, extending our implementation to a multi-dimensional co-clustering framework, such as the multi-way co-clustering proposed in [7] and multi-label coclustering proposed in [57]. This is from data mining perspective. Third, new data models and new schemas for social media and/or networks are worth further investigating. As long as a new schema has an interface to connect it to SQL, our study presented in this paper could be extended to the new schema and could probably be applied to new applications over social webs. This is from social media analysis perspective. Fourth, future work could also include, for example, extensions with parallel computing techniques, distributed data management frameworks, such as using Map-Reduce [21] as the underlying framework, or advanced multi-core machine architectures. This is from the infrastructure perspective.
Acknowledgements Authors would like to thank Dr. Arindam Banerjee for advice and practical discussions on Bregman co-clustering algorithm, Nisheeth Srivastava for his comments on paper
organization and presentation, and Sung-Rim Moon for her help in proof-reading. Authors would also like to thank anonymous reviewers for their valuable comments.
References >@ Agarwal, D., Merugu, S. Predictive discrete latent factor models for large-scale dyadic data. In Proceedings of the 13th ACM SIGKDD, 2007, 26-35. >@ Akinde, M. O., B.ohlen, M. H., Johnson, T., Lakshmanan, L. V. S., Srivastava, D. Efficient OLAP query processing in distributed data warehouses. Information Systems, 28 (2003), 111-135. >@ Amer-Yahia, S., Lakshmanan, L. V. S., and Yu, C. SocialScope: Enabling Information Discovery on Social Content Sites. In Proceedings of the 4th Biennial Conference on Innovative Data Systems Research (CIDR), 2009. >@ Angles, R. and Gutierrez, C. Survey of Graph Database Models. ACM Computing Surveys, Vol. 40, No. 1, Article 1, Publication date: February 2008. >@ Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. Clustering with Bregman divergences. Journal of Machine Learning Research, 6 (2005), 1705-1749. >@ Banerjee, A., Dhillon, I. S., Ghosh, J., Merugu, S., Modha, D. S. A generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation. Journal of Machine Learning Research, 8 (2007), 1919-1986. >@ Banerjee, A., Basu, S., Merugu, S. Multi-way Clustering on Relation Graphs. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM), 2007. >@ Beebe, H., Clark, J. G. Dealing with Terabyte Datasets in Digital Investigations. In Advances in Digital Forensics. In Proceedings of the IFIP International Conference on Digital Forensics, 2005, 3-16. >@ Bentayeb, F. and Darmont, J. Decision Tree Modeling with Relational Views. In Proceedings of the 13th International Symposium on Foundations of Intelligent Systems (ISMIS), 2002, 423-431. >@ Bentayeb, F., Darmont, J., and Udréa, C. Efficient Integration of Data Mining Techniques in Database Management Systems. In Proceedings of the International Database Engineering and Applications Symposium (IDEAS), 2004, 59-67. >@ Carley, K. M., Diesner, J., Reminga, J., and Tsvetovat, M. Toward an interoperable dynamic network analysis toolkit. Decision Support System, 43 (2007), 1324-1347. >@ Chaudhuri, S. Data Mining and Database Systems: Where is the Intersection? Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (1998). >@ Chaudhuri, S., Dayal, U., and Ganti, V. Database Technology for Decision Support Systems. Computer, 34, 12 (2001), 48-55. >@ Chen, C., Yan, X., Zhu, F., Han, J., Yu, P. Graph OLAP: Towards Online Analytical Processing on Graphs. IEEE International Conference on Data Mining (ICDM), 2008. >@ Chen, M.-S., Han, J., Yu, P. S. Data Mining: An Overview from Database Perspective. IEEE Transactions on Knowledge and Data Engineering, 8, 6 (1996), 866-883.
21
>@ Cheng, Y. and Church, G. M. Biclustering of expression data. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 2000, 93-103. >@ Chi, Y., Zhu, S., Hino, K., Gong, Y., and Zhang, Y. iOLAP: A Framework for Analyzing the Internet, Social Networks, and Other Networked Data. IEEE Transactions on Multimedia, 11, 3 (2009). >@ Cho, H., Dhillon, I., Guan, Y., and Sra, S. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), 2004. >@ Cornacchia, R., Héman, S., Zukowski, M., de Vries, A. P., and Boncz, P. Flexible and efficient IR using array databases. The VLDB Journal, 17 (2008), 151–168. >@ Daruru, S., Marín, N., Walker, M., and Ghosh, J. Pervasive Parallelism in Data Mining: Dataflow solution to Coclustering Large and Sparse Netflix Data. KDD, 2009. >@ Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. OSDI, 2004. >@ Dhillon, I. S., Mallela, S., and Modha, D. S. Informationtheoretic co-clustering. KDD, 2003, 89-98. >@ Ding, C., Li, T., Peng, W., and Park, H. Orthogonal nonnegative matrix tri-factorizations for clustering. KDD, 2006. >@ Domingos, P. Prospects and challenges for multi-relational data mining. ACM SIGKDD Explorations Newsletter, 5, 1 (2003). Position papers on MRDM, 80-83. >@ Dreyfus, D. and Iyer, B. Enterprise Architecture: A Social Network Perspective. In Proceedings of the 39th Hawaii International Conference on System Sciences (HICSS), 2006. >@ George, T. and Merugu, S. A Scalable Collaborative Filtering Framework based on Co-clustering. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), 2005, 625-628. >@ Giannakidou, E., Koutsonikola, V., Vakali, A., and Kompatsiaris, I. Co-Clustering Tags and Social Data Sources. In Proceedings of the 9th International Conference on Web-Age Information Management, 2008. >@ Goharian, N., Jain, A., Sun, Q. Comparative Analysis of Sparse Matrix Algorithms for Information Retrieval. Journal of Systemics, Cybernetics and Informatics (2003). >@ Graefe, G., Fayyad, U., Chaudhari, S. On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases. In Proceedings of the 4th ACM SIGKDD, 1998, 204-208. >@ Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H. Data Cube: A relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery, 1, 1 (1997), 29-53. >@ Han, J. OLAP Mining: An Integration of OLAP with Data Mining. In Proceedings of the IFIP Conference on Data Semantics (DS-7), 1997, 1-11. >@ Hirji, K. K. Exploring data mining implementation. Communication of the ACM, 44, 7 (July 2001), 87-93. >@ Hsu, K.-W., Banerjee, A., Srivastava, J. I/O Scalable Bregman Co-clustering. In Proceedings of the 12th PacificAsia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD), 2008, 896-903. >@ Inmon, W. H. The Data Warehouse and Data Mining. Communications of the ACM, 39, 11 (1996), 49-50.
>@ Keith, M., Demirkan, H., and Goul, M. Coordination Network Analysis: A Research Framework for Studying the Organizational Impacts of Service-Orientation in Business Intelligence. In Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS), 2007. >@ Konstan, J. A., Riedl, J., Borchers, A., and Herlocker, J. Recommendation systems: A grouplens perspective. In Recommendation systems: Papers from the 1998 Workshop (AAAI Technical Report WS-98-08), AAAI Press, 1998, 6064. >@ Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.L., Brewer, D., Christakis, N., Contractor, N., Fowler, J., Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., Van Alstyne, M. Computational Social Science. Science, 323, 5915 (2009), 721-723. >@ Liu, T., Tian, Y., and Gao, W. A Two-Phase Spectral Bigraph Co-clustering Approach for the “Who Rated What” Task in KDD Cup 2007. KDDCup, 2007. >@ Long, B., Zhang, Z., and Yu, P. Co-clustering by Block Value Decomposition. KDD, 2005. >@ Lopez, F. L., Robles, G., and Gonzalez, B. J. M. Applying social network analysis to the information in CVS repositories. International Workshop on Mining Software Repositories (MSR), W17S Workshop 26th International Conference on Software Engineering, 2004. >@ Martın, M. S. and Gutierrez, C. A Database Perspective of Social Network Analysis Data Processing. Sunbelt XXVI, 2006. (TR_DCC-2006-007) >@ Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. YALE: Rapid Prototyping for Complex Data Mining Tasks. KDD, 2006. >@ Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., and Bhattacharjee, B. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/USENIX Internet Measurement Conference (IMC), 2007. >@ Netz, A., Chaudhuri, S., Bernhardt, J., and Fayyad, U. Integration of Data Mining and Relational Databases. In Proceedings of the 26th International Conference on Very Large Databases (VLDB), 2000. >@ O’Connor, M. and Herlocker, J. Clustering Items for Collaborative Filtering. In Proceedings of ACM-SIGIR Workshop on Recommendation systems, 1999. >@ Ordonez, C., Cereghini, P. SQLEM: Fast Clustering in SQL using the EM Algorithm. In Proceedings of the SIGMOD Conference, 2000, 559-570. >@ Ordonez, C. and Omiecinski, E. Efficient Disk-Based KMeans Clustering for Relational Databases. IEEE Transactions on Knowledge and Data Engineering, 16, 8 (2004), 909-921. >@ Ordonez, C. Programming the K-means Clustering Algorithm in SQL. In Proceedings of the 10th ACM SIGKDD, 2004, 823-828. >@ Ordonez, C. Integrating K-Means Clustering with a Relational DBMS Using SQL. IEEE Transactions on Knowledge and Data Engineering, 18, 2 (2006), 188-201. >@ Palpanas, T. Knowledge Discovery in Data Warehouses. ACM SIGMOD Record, 29, 3 (September 2000), 88-100. >@ Pan, F., Zhang, X., and Wang, W. A General Framework for Fast Co-clustering on Large Datasets Using Matrix Decomposition. International Conference on Data Engineering (ICDE), 2008.
22
>@ Papadimitriou, S. and Sun, J. DisCo: Distributed Coclustering with Map-Reduce. IEEE International Conference on Data Mining (ICDM), 2008. >@ Perugini, S., Gonçalves, M. A., and Fox, E. A. Recommendation systems Research: A Connection-Centric Survey. Journal of Intelligent Information Systems, 23, 2 (2004), 107-143. >@ Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proceedings of ACM Conference on Computer Supported Cooperative Work, 1994, 175-186. >@ Riedl, J. and Konstan, J. Movielens dataset. In http://www.grouplens.org/ >@ Sattler, K. and Dunemann, O. SQL Database Primitives for Decision Tree Classifiers. In Proceedings of Tenth International Conference on Information and Knowledge Management (CIKM), 2001, 379-386. >@ Shan, H. and Banerjee, A. Bayesian Co-clustering. IEEE International Conference on Data Mining (ICDM), 2008. >@ Siersdorfer, S., and Sizov, S. Social Recommendation systems for Web 2.0 Folksonomies. In Proceedings of the 20th ACM conference on Hypertext and hypermedia, 2009. >@ Srivastava, J. Computational Social Science and Web Engineering. Keynote address at the International Conference on Web Engineering, 2009, San Sebastian, Spain, June 24-26, 2009. >@ Symeonidis, P., Nanopoulos, A., Papadopoulos, A., and Manolopoulos, Y. NearestBiclusters Collaborative Filtering. WEBKDD, 2006. >@ Tsvetovat, M., Diesner, J., and Carley, K. NetIntel: A Database for Manipulation of Rich Social Network Data. Technical Report CMU-ISRI-04-135, Carnegie Mellon University, School of Computer Science, Institute for Software Research International, 2004. >@ Wasserman, S., Faust, K. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. >@ Witten, I. H. and Frank, E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005. >@ Wiwatwattana, N., Jagadish, H. V., Lakshmanan, L. V. S., and Srivastava, D. Xˆ3: A Cube Operator for XML OLAP. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE), 2007, 916 - 925. >@ Yu, C., Lakshmanan, L. V. S., and Amer-Yahia, S. It Takes Variety to Make a World: Diversification in Recommendation systems. EDBT, 2009. >@ Yu, C., Lakshmanan, L. V. S., and Amer-Yahia, S. Recommendation Diversification Using Explanations. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE), 2009
23