Database Support for Bregman Co-clustering - Google Sites

Database Support for Bregman Co-clustering Kuo-Wei (David) Hsu and Jaideep Srivastava Department of Computer Science and Engineering University of Minnesota

Outline   Introduction   Bregman Co-clustering   Our OLAP Based Implementation   Experimental Results   Conclusions and Future Work

Introduction   Co-clustering groups rows and columns at the same time   Data are in a matrix, where row and column are of different types   Example: Movie recommendation   People who have similar taste are attracted to similar movies   Certain group of movies entertain certain group of people

  Bregman co-clustering is scalable and gives better results

Introduction   Bregman co-clustering searches for the optimal clustering by

searching for the best approximation matrix, which is close to the original data matrix based on minimum Bregman information (MBI) principle   Summary statistics are computed and re-computed to

find the approximation matrix or to approach the optimal clustering   Data cube is generally used to manage summary statistics   This study finds the mapping between data cube and the Bregman

co-clustering algorithm

Bregman Co-clustering   Bregman co-clustering associates co-clustering with discovery of

optimal approximation matrix   Use minimum Bregman information (MBI) principle to keep most

information   Find the optimal solution iteratively

  Building approximation matrix   6 bases are defined, each of which is a set of summary statistics & a blueprint to build approximation matrix   Problem and solution   It is theoretically scalable, but in practice it is restricted by memory   OLAP engine could provide summary statistics naturally

Bregman Co-clustering Algorithm 1. 

Randomly generate row clusters and column clusters

2. 

Calculate the summary statistics Calculate approximation and update row clusters according to the basis

3. 

4. 

Calculate approximation and update column clusters according to the basis

5. 

Repeat 2~4 till convergence

U is a set of rows; ρ(U):{1,…,m}→{1,…,k} V is a set of columns; γ(V):{1…,n,} →{1,…,l} Squared Euclidean Distance 

Building Approximation Matrix   Reconstruct matrix by summary statistics   : Observed   : Estimated   E[]: Weighted average   6 bases, i.e. ways to reconstruct Z   1:   2:   3:   4:   5:   6:

Basis 1

Avg. of cells in each column cluster + Avg. of cells in each row cluster - Avg. of all cells in given data matrix

Banerjee, A., Dhillon, I. S., Ghosh, J., Merugu, S., Modha, D. S.: A generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation. Journal of Machine Learning Research, Vol. 8, pp. 1919-1986 (2007)

Basis 2

Avg. of cells in each intersection of column and row clusters, i.e. block average


Basis 3

Block average +

Avg. of all cells for each row in given data matrix

- Avg. of cells in each row cluster


Basis 4

Block average +

Avg. of all cells for each column in given data matrix

- Avg. of cells in each column cluster


Basis 5

Block average +


+


- Avg. of cells in each row cluster - Avg. of cells in each column cluster


Basis 6

Avg. of cells for each row in each column cluster + Avg. of cells for each column in each row cluster - Block average


Our OLAP Based Implementation   Storing matrices in a relational database   We store non-zero elements only   We employ the coordinate storage (COO) format because it

avoids extra join operations, which are required by compression based approaches

  Define the data cube to manage summary statistics

Schema and Cube   Matrix: (row INT, col INT, val FLOAT) Rclust: Row clusters col (r_cluster) row val (‘1’)

Z: Data points row col val

Cclust: Column clusters row (c_cluster) col val (‘1’)

  SQL to create the cube SELECT Z.row, We can not use AVG directly since Z.col, we only store non-zero elements. 4D cube Rclust.col AS r_clust, A solution is to use SUM and “real” sizes of clusters Cclust.col AS c_clust, AVG(Z.val) AS val FROM (Rclust JOIN Z ON Rclust.row=Z.row) JOIN Cclust ON Z.col=Cclust.row GROUP BY Z.row, Z.col, Rclust.col, Cclust.col WITH CUBE 1st join: attach row cluster # to each cell; 2nd join: attach column cluster # to each cell

Mapping Basis 1 to Cube

Avg. of cells in each column cluster + Avg. of cells in each row cluster - Avg. of all cells in given data matrix


Avg. of cells in each interaction of column and row clusters, i.e. block average


Block average +


- Avg. of cells in each row cluster


Block average +


- Avg. of cells in each column cluster


Block average +


+


- Avg. of cells in each row cluster - Avg. of cells in each column cluster


Avg. of cells for each row in each column cluster + Avg. of cells for each column in each row cluster - Block average

Experimental Results   Commodity hardware : PC with 3.0GHz Intel

Xeon processor and 2 GB RAM   Runtime performance on datasets from a variety of application domains   Matrix decomposition   Bioinformatics   Document clustering   Collaborative filtering (CF) based recommendation

Matrix Decomposition af23560

Figures show runtime in seconds for 6 bases

e40r5000

fidapm11

memplus

Bioinformatics

yeast


lymphoma

Max 10 iterations yeast

Max 100 iterations

Document Clustering


enron

kos

nips

CF Based Recommendation

BookCrossing: The size is 6,838 by 5,642 and there are 90,613 non-zero elements

Runtime in seconds for 6 bases

Clustering quality, is mean absolute error (MAE), for 6 bases Runtime in seconds for 6 bases

Clustering quality, is mean absolute error (MAE), for 6 bases

Conclusions and Future Work   Bregman co-clustering   It iteratively finds the optimal approximation matrix   6 bases are defined as blueprint to build approximated matrices   Our contribution: Making it even more scalable   Mapping Bregman co-clustering bases to data cube (OLAP)   OLAP based implementation can handle datasets too large (in

rows AND columns) to fit in memory

  Future work:   From “data” perspective, moving this to other data storage

engines or cloud computing platforms   From “mining” perspective, extending this to multidimensional co-clustering

Acknowledgements   We would like to thank Dr. Arindam Banerjee for advice and

practical discussions on Bregman co-clustering algorithm.

Database Support for Bregman Co-clustering - Google Sites

Database Support for Bregman Co-clustering - Google Sites

Suggest Documents

Database Support for Bregman Co-clustering - Google Sites

Database Support for Taxonomy

Database-support for Continuous Prediction

Database Support for Taxonomy. - CiteSeerX

Bregman Leadership Program Information ... - Peter Bregman

Bregman Leadership Week Brochure 6.14 - Peter Bregman

The Bregman Leadership Program - Peter Bregman

System i: Database Database programming - Support - IBM

Operating System Support for Database Management - CiteSeer

Spatio-Temporal Database Support for Legacy ... - CiteSeerX

Operating System Support for Database Management - CiteSeer

Animated Courseware Support for Teaching Database Design

Database Support for Data Mining Patterns - InfoLab

relational database support for enterprise product ...

beehive: global multimedia database support for ... - CiteSeerX

Database-Agnostic Transaction Support for Cloud ... - CiteSeerX

Flash Device Support for Database Management - CIDR

Database support for very large hypertexts - CiteSeerX

Database-Agnostic Transaction Support for Cloud ... - CiteSeerX

XML-Based Support for Database Histories and

PRO-MOTION: Support for mobile database access

Towards Comprehensive Database Support for ... - CiteSeerX

XML Database Support for Program Trace Visualisation

Realtime Database Support for Environmental ... - Semantic Scholar