Subspace Stream Clustering Evaluation Using the ... - Springer Link

4 downloads 384 Views 818KB Size Report
Page 1 ... Online Analysis) framework was developed also above WEKA to pro- ... The framework is supported with a subspace stream generator, a visual-.
Subspace MOA: Subspace Stream Clustering Evaluation Using the MOA Framework Marwan Hassani, Yunsu Kim, and Thomas Seidl Data Management and Data Exploration Group RWTH Aachen University, Germany {hassani,kim,seidl}@cs.rwth-aachen.de

Abstract. Most available static data are becoming more and more highdimensional. Therefore, subspace clustering, which aims at finding clusters not only within the full dimension but also within subgroups of dimensions, has gained a significant importance. Recently, OpenSubspace framework was proposed to evaluate and explorate subspace clustering algorithms in WEKA with a rich body of most state of the art subspace clustering algorithms and measures. Parallel to it, MOA (Massive Online Analysis) framework was developed also above WEKA to provide algorithms and evaluation methods for mining tasks on evolving data streams over the full space only. Similar to static data, most streaming data sources are becoming highdimensional, and tracking their evolving clusters is also becoming important and challenging. In this demonstrator, we present, to the best of our knowledge, the first subspace clustering evaluation framework over data streams called Subspace MOA. Our demonstrator follows the onlineoffline model which is used in most data stream clustering algorithms. In the online phase, users have the possibility to select one of three most famous summarization techniques to form the microclusters. In the offline phase, one of five subspace clustering algorithms can be selected. The framework is supported with a subspace stream generator, a visualization interface to present the evolving clusters over different subspaces, and various subspace clustering evaluation measures.

1

Introduction

Clustering on the high-dimensional data becomes more and more important as modern databases tend to be huge. Due to the curse of dimensionality, excessive number of attributes makes data points unique, and the distances between the points become more alike as the dimensionality grows in high-dimensional space [3]. For such kinds of data with higher dimensions, distances grow more and more alike (cf. the toy example in Figure 1(a)). Applying traditional clustering algorithms (called in this context: full-space clustering algorithms) over such data objects will lead to useless clustering results. In Figure 1(a), the majority of the black objects will be grouped in a single-object cluster (outliers) when using a full-space clustering algorithm, since they are all dissimilar, but apparently they are not as dissimilar as the gray objects. The latter fact motivated the W. Meng et al. (Eds.): DASFAA 2013, Part II, LNCS 7826, pp. 446–449, 2013. c Springer-Verlag Berlin Heidelberg 2013 

447

Cluster 1: DIM 2

Cluster 2

Dim 2

Subspace MOA: Subspace Stream Clustering Evaluation

Cluster 1

Cluster 1: DIM 1

(a)

Dim 1

(b)

Fig. 1. (a) A Toy Example of a Subspace Clustering Output, (b) A Screen Shot of OpenSubspace Framework

research in the domain of subspace and projected clustering in the last decade which resulted in an established research area for static data. OpenSubspace framework [8] was proposed to evaluate and explorate subspace clustering algorithms in WEKA with a rich body of most state of the art subspace/projected clustering algorithms and measures (cf. Figure 1(b)). In this research, these algorithms are applied to the streaming cases. Other than static data that do not vary over time, stream data are given in different rate and pattern changing dynamically, which makes it challenging to analyze its evolving structure and behavior. In streaming scenarios, we also often face limitations on processing time and storage, since a vast amount of continuous data are coming rapidly. Data stream mining has been an emerging research topic in the previous decade and a rich body of stream mining algorithms has been created. MOA (Massive Online Analysis) framework [7] was built on experience with both WEKA and VFML (Very Fast Machine Learning) toolkit [6] to support the research in the stream mining area with generators, visualization methods, and interesting evaluation measures. Similar to static data, evolving data streams are also becoming naturally high-dimensional with their existence in multiple applications with many attributes. However, different to subspace clustering algorithms over static data, only few subspace stream clustering algorithms has been developed recently (HPStream [2] and PreDeConStream [5]. Such kinds of algorithms are a bit tricky since they have to track the all changes of evolving clusters over the streams (splitting, merging, appearance, decaying, moving, ... etc.), by considering the fact the these clusters might exist in all possible subspaces and not only in the full-space. In Subspace MOA, users can select any of ten subspace clustering algorithms to be the offline part of subspace clustering algorithm, where one of seven summarization methods for the online part can be also selected.

448

2

M. Hassani, Y. Kim, and T. Seidl

The Subspace MOA Framework

1. Under the Setup Tab: (cf. Figure 2(a)), the selection of the data stream input, Subspace MOA offers the possibility of reading external ARFF files, a synthetic random RBF generator, and a synthetic random RBF subspace generator with the possibility of varying the subspace event. The online-offline model is followed by most stream clustering algorithms (cf. [1], [4], [5]). In the online phase, a summarization of the data stream points is performed and the resulting microclusters is given by sets of cluster features CFi = (N, LSi , SSi ) which represent the number of points within that microcluster, their linear sum and their squared sum, respectively. Subspace MOA offers three algorithms to form these microclusters and continuously maintain them. These are the main ones supported by MOA: ClusterGenerator, CluStream, and DenStream . In the offline phase, the clustering features are used to reconstruct an approximation to the original N points using Gaussian functions toreconstruct spherical microclusd SS LS 2 1 i ters centered at ci = LS i=1 SSi N with a radius: r = N − ( N ) (SS = d 1 d and LS = d i=1 LSi ). The generated N points are forwarded to one of the five most famous subspace clustering algorithms that are supported by OpenSubspace: SubClu, ProClus, P3C, FIRES and CLIQUE. Up to eight evaluation measures (such as CE, CMM, SubCMM, Entropy, F1, RNIA) (cf. Figure 2(a)) can be used to reflect the quality of the clustering directly after processing each horizon. These values are printed gradually in the output panel under the Setup tab as the stream evolves. 2. Under the Visualization Tab: (cf. Figure 2(b)), the evolving of the final clustering of the selected subspace clustering algorithms as well as the evolving of the ground truth stream is visualized in a two dimensional representation. Users can select any pair of dimensions to visualize the evolving ground truth

(a)

(b)

Fig. 2. Subspace MOA Screen Shots of (a) The Setup Tab, (b) The Visualization Tab

Subspace MOA: Subspace Stream Clustering Evaluation

449

as well as the resulted clustering. Different to MOA, Subspace MOA is able to visualize and get the quality measures of arbitrarily shaped clusters.

3

Website, Demo Plan and Conclusion

Subspace MOA can be found at http://dme.rwth-aachen.de/en/subspacemoa. In the demonstrator, we want to explain the main idea of two subspace clustering algorithms as well the online-offline model, with the motivation for getting the final subspace stream clustering algorithms. The framework will offer researchers the possibilities to detect weak and strong points of different subspace clustering algorithms when applied in the streaming scenario, as well as the suitable online/offline combination for a certain dataset. This is all done in a user friendly interface that is in line with the MOA framework style. Acknowledgments. This work has been supported by the UMIC Research Centre, RWTH Aachen University, Germany.

References 1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proc. of the 29th Int. Conf. on Very Large Data Bases, VLDB 2003, vol. 29, pp. 81–92 (2003) 2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proc. of the 30th Int. Conf. on Very Large Data Bases, VLDB 2004, vol. 30, pp. 852–863 (2004) 3. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is ”nearest neighbor” meaningful? In: Int. Conf. on Database Theory, pp. 217–235 (1999) 4. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: 2006 SIAM Conference on Data Mining, pp. 328–339 (2006) 5. Hassani, M., Spaus, P., Gaber, M.M., Seidl, T.: Density-based projected clustering of data streams. In: H¨ ullermeier, E., Link, S., Fober, T., Seeger, B. (eds.) SUM 2012. LNCS, vol. 7520, pp. 311–324. Springer, Heidelberg (2012) 6. Hulten, G., Domingos, P.: VFML – a toolkit for mining high-speed time-changing data streams (2003) 7. Kranen, P., Kremer, H., Jansen, T., Seidl, T., Bifet, A., Holmes, G., Pfahringer, B., Read, J.: Stream data mining using the moa framework. In: Lee, S.-g., Peng, Z., Zhou, X., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA 2012, Part II. LNCS, vol. 7239, pp. 309–313. Springer, Heidelberg (2012) 8. M¨ uller, E., Assent, I., G¨ unnemann, S., Jansen, T., Seidl, T.: Opensubspace: An open source framework for evaluation and exploration of subspace clustering algorithms in weka. In: Open Source in Data Mining Workshop at PAKDD, pp. 2–13 (2009)