Intelligent Data Analysis 16 (2012) 69–91 DOI 10.3233/IDA-2011-0511 IOS Press
69
On clustering large number of data streams Zaher Al Aghbaria,∗, Ibrahim Kamelb and Thuraya Awada a Department
b Department
of Computer Science, University of Sharjah, Sharjah, United Arab Emirates of Electrical and Computer Engineering, University of Sharjah, Sharjah, United Arab
Emirates
OR C
OP
Y
Abstract. Data streams and their applications appear in several fields such as physics, finance, medicine, environmental science, etc. As sensor technology improves, sensor data rates continue to increase. Consequently, analyzing data streams becomes ever more challenging. Fast online response is a must for applications that involve multiple data streams, especially when the number of data streams is large. This paper proposes an efficient clustering technique called Multi-way Grid-based join algorithm (MGjoin) to find clusters in multiple data streams. The proposed algorithm uses a Discrete Fourier Transformation (DFT) to reduce the dimensionality of the streams. Each stream is represented by a point in a multi-dimensional grid in the frequency domain. The MG-join algorithm finds the different clusters in multiple data streams in the frequency domain. Moreover, this paper proposes an incremental update mechanism to avoid the recalculation of DFT coefficients when new readings arrive and thus minimizes the processing time. Experiments on synthetic data streams show that the proposed clustering technique is much faster than traditional clustering techniques and yet its accuracy is as good as that of the traditional clustering techniques. This makes the proposed technique suitable for sensors network environment where computing and power capabilities are limited. Keywords: Clustering multiple data streams, grid-based clustering, incremental clustering, dimensionality reduction, Stream join
1. Introduction
AU
TH
A data stream is a real-time, continuous, ordered (explicitly by timestamp or implicitly by arrival time) sequence of data elements. Data streams have gained importance in recent years mainly because of the advances in hardware technology. These advances have made it easy to store and process numerous transactions and activities in an automated way. Applications where information naturally occurs as stream of data values include network monitoring [40], telecommunication systems [44], financial trades [13,48], medical applications [39], and environmental applications [15]. For example, in the telecommunications arena, data streams can be used in monitoring traffic congestion on the Internet or in telephone communication infrastructure [16]. Data streams can also represent transaction logs and web logs [17,18]. In the financial sector, data streams can be used for detecting trends of financial markets [14]. In environmental applications, data stream processing can be used to track pollution such as oil spills and poisonous smokes [27]. Data streams are unbounded in size, continuous and have high arrival rate. That is data stream processing is different from conventional static or slow updating database systems. As a result, data streams present new challenges to data processing algorithms that impose the following new requirements on data stream algorithms: ∗
Corresponding author. E-mail:
[email protected].
1088-467X/12/$27.50 2012 – IOS Press and the authors. All rights reserved
70
Z.A. Aghbari et al. / On clustering large number of data streams
Process data in one pass to provide timely results. Be incremental to adapt quickly to the high arrival rate of data. Require nominal memory to process large volumes of data streams. Be scalable to the number of streams to cater for new applications.
OP
– – – –
Y
Fig. 1. Sensors network. (Colours are visible in the online version of the article; http://dx.doi.org/10.3233/IDA-2011-0511)
AU
TH
OR C
To be able to handle an infinite number of values, processing is restricted either to a window of elements, or to those data elements that have arrived in the last window of time. The former is called a count-based sliding window, while the latter is called a time-based or a timestamp-based sliding window [6]. Constraining all queries by sliding window allows continuous queries to be executed over unbounded streams in finite memory. Due to the above constrains, finding the different clusters of data streams where the data streams of each cluster have similar values is essential for many applications such as finding oil spills in environmental application [27], finding repeated patterns of stock behavior in financial applications [14], finding similar voice patterns in voice recongnition applications [3] and finding traffic congestions at certain times in telecommunication applications [16]. Thus, the goal of this paper is to efficiently find clusters in multiple data streams. This usually involves processing a large number of data streams. Two readings are considered similar when the Euclidean distance between the readings is less than some threshold, which is a user defined parameter. We propose solutions for centralized environments, where sensors send their readings to one central processing node, called a sink (see Fig. 1). The straight forward approach to find the different clusters in multiple data streams is to compute the pair-wise distance between readings within time window w (see Fig. 2) across all streams. This process is very costly, especially when the number of data streams is large. In this paper, we propose an efficient multi-dimensional grid-based clustering algorithm for data streams. To reduce the dimensionality of data streams, we use a linear transformation, Discrete Fourier Transformation (DFT), that packs most of the stream information in a few DFT coefficients. Thus, each data stream is represented by point in a multi-dimensional grid in the frequency domain. The proposed technique detects the existence of clusters in real time and updates the clusters incrementally. We summarize the main contributions of this paper as follows: 1. Propose a novel unsupervised grid-based (MG-join) algorithm to cluster multiple data streams. 2. Uses DFT to concentrate most of the stream information in a few coefficients and thus reduce the high dimensionality of the data.
Z.A. Aghbari et al. / On clustering large number of data streams
71
Fig. 2. Sliding windows of data streams.
Y
3. Update the clustering result in an incremental fashion.
2. Related work
OR C
OP
The rest of this paper is organized as follows: related work on mining data streams are presented in Section 2. The proposed clustering technique is presented in Section 3. In Section 4, we provide an extensive experimental evaluation for the proposed technique. Finally, we conclude the paper and outline possible future work in Section 5.
Algorithms in data streams have received increased attention over the past few years. In this Section, we present some pieces of work that are most related to the algorithm proposed in this paper. These related work are categorized into three areas: Equi-join of data streams, pattern discovery in data streams and clustering data streams. 2.1. Equi-join of data streams
AU
TH
A large body of research in the data streaming area focused on the join operation. These works were proposed for either centralized or distributed environment (see Fig. 3). For the centralized approach, there are two approaches: two-way join and multi-way join. In a two-way join approach [26], investigated algorithms for evaluating moving window joins over pairs of unbounded streams with different arrival rates. They suggested using a symmetric join to reduce the execution cost and introduced a unit-timebasis cost model for performance evaluation. On the other hand [24], considered the problem of sharing the window join execution among multiple queries. Research in multi-way join approach followed one of two sub-approaches: single join of streams and set of binary joins (two streams at a time). The system in [29] suggested a single multi-way join operator for all streams, which is called M-join, by generalizing the existing streaming binary join algorithm in [28]. Using a single M-join, an arrival from any input source can be used to generate and propagate results in a single step without having to pass these results through a multi-stage binary execution pipeline. On the other hand, the use of set of binary joins may be used with, or without, sliding windows. For example [23], proposed a stream window join operator, which copes with unbounded and asynchronized multi-sensor data streams. Moreover [20], studied multi-way join processing of continuous queries over data streams; as a result, several algorithms for performing continuous incremental joins were proposed.
72
Z.A. Aghbari et al. / On clustering large number of data streams
Joins over Data Streams Centralized Processing Two-Way Join (Kang, Naughton, and Viglas, 2003) (Hammad, Frankline, Aref, and Elmagarmid, 2003)
Distributed Processing Multi-Way Join (more than two streams)
Set of Binary Joins (two streams at a time)
Sliding Windows (Naughton, Burger, and Viglas, 2003)
Y
(Madden, Shah, Hellerstein, and Raman, 2002)
Sliding Windows (Hammad, Aref, and Elmagarmid, 2003) (Golab and Ozsu, 2003) (Hammad, Ghanem, Aref, and Elmagarmid, 2004)
OP
No Sliding Windows
Single Join for All Streams
Fig. 3. Classification of joins over data streams literature.
TH
OR C
However, the authors of [25] argued that the straightforward application of traditional pipelined query processing techniques to sliding window queries can result in inefficient and incorrect behavior. On the other hand [28], presented a continously adaptive continuous query (CACQ) implementation without sliding window. For distributed environment [5], addressed the problem of joining large number of data streams. They presented a variable-arity join operator specially designed for a dynamically-configured large-scale sensor network with distributed processing capabilities [45] also focused on multi-way join sliding window processing over distributed data streams. They proposed a novel join algorithm based on two distributed data stream transfer models. Very recently [15], investigated the problem of processing join queries within a sensor network in distributed environment. 2.2. Pattern discovery in data streams
AU
Pattern discovery is the process of searching for patterns in data streams which are similar to a given query stream, or part of a stream [35] proposed AWSOM (Arbitrary Window Stream mOdeling Method) to find patterns in a single data stream. Their proposed method is based on using wavelets to represent the important features of the data. Later [36] introduced SPIRIT (Streaming Pattern dIscoverRy in multIple Time series) algorithm, which captures correlations and trends in collections of semi-infinite co-evolving data streams. The PIRIT method is an anytime single pass algorithm that is based on Principle Component Analysis. Recently [40], use Dynamic Time Warping (DTW) distance to find sub-sequences of a data stream that are similar to a given query sequence. DTW is a popular distance measure, permitting accelerations and decelerations, and it has been studied for finite stored sequences. However, in many applications such as network analysis and sensor monitoring, massive amounts of data arrive continuously and therefore it is infeasible to save all the historical data. They proposed SPRING that keeps summaries of the data without significantly sacrificing accuracy and at the same time it requires constant space and time per time-tick.
Z.A. Aghbari et al. / On clustering large number of data streams
73
Clustering Data Streams Multiple Data Streams
Single Data Stream Grid-Based (Park and Lee, 2004)
Density-Based (Cao, Ester, Qian, and Zhou, 2006) (Tasoulis, Adams, and Hand, 2006) (Nasraoui and Rojas, 2006) (Nasraoui and Rojas, 2006)
Expectation
Y
k-median (Guha, Mishra, Motwani, and O Callaghan, 2000) (Guha, Meyerson, Mishra, Motwani, and O Callaghan, 2003) (Babcock, Datar, Motwani, and O Callaghan, 2003) (Charikar, O Callaghan, and Panigrahy, 2003)
OP
k-means (Ordonez, 2003) (Nittel, Leung, and Braverman, 2004) (O Challaghan, Mishra, Meyerson, Guha, and Motwani, 2002) (Aggarwal, Han, Wang, and Yu, 2003) (Aggarwal, Han, Wang, and Yu, 2004)
k-means (Beringer and Hullermeier, 2006) (Dai, Huang, Yeh, and Chen, 2006) (Beringer and Hullermeier, 2007)
Maximization (Zhou, Cao, Yan, Sha, and He, 2007)
2.3. Clustering data streams
OR C
Fig. 4. Classification of clustering data streams literature.
AU
TH
Clustering algorithms arrange a data set into several disjoint clusters such that data streams in the same cluster are similar to each other and are dissimilar to data streams in other clusters according to some similarity metrics. Clustering data streams can be divided into two categories: single data stream and multiple data streams (see Fig. 4). In clutering sub-sequeces of a single data stream, several algorithms have been proposed that use the k-median approach [11,12,21,22] proposed algorithms that cluster a data stream in a single pass and use nominal space. Then, the system in [7] used an exponential histogram (EH) data structures to improve the algorithm proposed by [21]. An improved incremental k-means algorithm for clustering binary data streams was proposed by [34]. Moreover [30], proposed a k-means based clustering algorithm that is highly scalable to data size. Then [31], proposed two algorithms: STREAM and LOCALSEARCH for data stream clustering and showed that the proposed clustering algorithm outperforms the commonly used k-means algorithm. CluStream, which is proposed by [1], addressed the quality of clusters when the data evolves considerably over time, then, in [2] the same authors proposed HPStream, which is a projected clustering algorithm for high dimensional data streams that outperforms CluStream. Other research works used density-based methods to cluster data. These density-based methods, such as DBSCAN [18], do not require a prior knowledge of the number of clusters. It locates regions of high density that are separated from one another by regions of low density. Recently, researchers have become interested in using density-based methods for clustering evolving data streams. For example [10], presented a new approach called DenStream that discovers clusters with arbitrary shape and detects outliers. Also [42], proposed a density-based approach, which is an extension of conventional kernel density clustering of data streams with spatio-temporal features. TRAC-STREAMS algorithim [32], presented a different density-based approach for mining noisy and evolving data streams. [37] proposed a statistical grid-based approach to cluster data elements of a single data stream. The multi-dimensional data space of a data stream is dynamically divided into a set of cells with different
74
Z.A. Aghbari et al. / On clustering large number of data streams
Y
sizes. By maintaining only the distribution statistics of data in each cell, the algorithm efficiently finds clusters in a single stream. On the other hand [8], developed an online version of the classical k-means to cluster multiple data streams [17] introduced a k-means based on-demand clustering method to cluster multiple data streams. Recently [47], proposed an EM-based (Expectation Maximization) framework to effectively cluster distributed data streams. The framework handles the existence of noise and incomplete data records. In addition, the framework learns the distribution of underlying data streams by maximizing the likelihood of the data clusters and uses a test-and-cluster strategy to reduce the average processing cost. The same authors of [47] have proposed clusters’ tracking technique of data streams [46]. From the above discussion, all the methods proposed for clustering multiple data streams are either based on k-means, which require prior knowledge of the number of clusters, or inefficiently find the clusters due to the high-dimensionality of the data streams. Our objective in this paper is to develop an algorithm that efficiently divides multiple data streams into an unknown number of clusters.
OP
3. Proposed solution
TH
3.1. Formal definition of the problem
OR C
The goal of this study is to find clusters across multiple streams in a sensor network, i.e. streams that show similar readings over a period of time. Since all natural phenomena in our applications, such as oil spill and smoke, could be of irregular (non-convex) shapes, supervised algorithms such as K-means cannot discover these non-convex clusters as these algorithms are known to be more appropriate for circular (convex) shape clusters. However, our proposed algorithm, like DBSCAN, can discover two non-convex clusters even if one cluster is partially, or totally, engulfing the other cluster. Thus, we propose unsupervised clustering technique called multi-way grid-based join algorithm (MG-join). We present a formal problem definition in Section 3.1, brief discussion of DFT in Section 3.2, the MG-join algorithm in Section 3.3, an efficient incremental processing of the proposed algorithm in Section 3.4, a proposed distributed solution in Section 3.5 and the complexity analysis in Section 3.5.
AU
Let N denote the number of data streams. Each stream consists of a sequence of readings with an ordered timestamp. Since streams can grow infinitely (unbounded in length), we consider readings within a predefined window of length w. The sliding window w contains the most w recent readings of the data streams. This type of sliding window is called the time-based or the timestamp-based sliding window. Given the length of the sliding window w and the current time point t, we consider the sub-sequences of data streams from time t − w + 1 to the current time t. Readings with timestamps older than w are discarded. Streams are assumed to be synchronized in the sense that a new reading arrives at each stream at a constant time interval. Based on the above description, a data stream is represented formally as w-dimensional vector: x = x0 , x1 , ..., xw−2 , xw−1
where xi is a real number that represents a single reading. The data streams will be updated each time new blocks of values arrive as shown in Fig. 5. Our objective is to find clusters across multiple data streams. A cluster is detected when a group of sensors produces similar readings for a period of time. Two readings are considered similar when the Euclidean
Z.A. Aghbari et al. / On clustering large number of data streams
75
Table 1 Sliding windows of data streams Meaning Data stream vector in the time domain Data stream vector in the frequency domain The number of data streams The size of a sliding window The size of the basic window Current time point The number of blocks in the window The number of DFT coefficients used to represent a stream Distance between neighbor streams Number of neighbors considered for every stream Cluster The number of true clusters in a data set The number of generated clusters by the proposed algorithm
Y
Symbol x X N w b t N f d k g C G
OP
W Stream 1
OR C
Stream 2
Stream N
Expiring Data
Unchanged Data
New Data Time
Fig. 5. Sliding windows of N data streams.
AU
i=0
TH
distance d between the readings is less than some threshold (determining this threshold is presented in Section 4). The Euclidean distance between stream x and stream y is calculated by Eq. (1): w−1 1/2 d(x, y) = x − y = (xi − yi )2 (1) The cluster g is a set of similar data streams such that g = S1 , S2 , ..., S|g| Where |g| is the total number of streams in g. The set of all the clusters found by the algorithm is represented by G. Before we present the proposed MG-join algorithm, we present a brief background on DFT that is used by MG-join algorithm. Table 1 presents the symbols used in this Section. 3.2. Background on DFT Data streams have high dimensions. Thus the first step in the proposed approach is to transform the data streams to reduce the dimensionality of the streams without sacrificing accuracy. The Discrete
76
Z.A. Aghbari et al. / On clustering large number of data streams
Fourier Transformation is used to concentrate the stream information in the first few DFT coefficients. Thus, a DFT takes as input a finite sequence of real numbers in the time domain and transforms it into a frequency domain representation. In particular, a DFT is widely used in signal processing and related fields to analyze the frequencies contained in signals. The DFT can be computed in practice using a Fast Fourier transform (FFT) algorithm. The FFT and the DFT terms are often used interchangeably, although there is a clear distinction: DFT refers to a mathematical transformation, regardless of how it is computed, while FFT refers to any one of several efficient algorithms for the DFT. The Discrete Fourier Transformation DFT of a stream x (x = x0 , x1 , . . . , xw−1 ) is a sequence of complex numbers [33]:
Xf =
w−1
xi e
−j2πf i w
f = 0, 1, ..., w − 1,
j=
−1
OP
i=0
√
Y
DF T (x) = X = X0 , X1 , ..., Xw−1
The DFT preserves the Euclidean distance between two sequences x and y [4]. That means the distance in the native domain equals the distance in the frequency domain. d(x, y) = d(X, Y )
3.3. The proposed MG-join algorithm
OR C
In this paper, the proposed MG-join algorithm finds the clusters in the frequency domain; therefore, it is important that the transformation method preserves the distance between the time domain and the frequency domain.
No prior assumption of the number of clusters. Clustering avoids the pair-wise distance computation. Clustering is efficient as it is performed in the frequency domain. There is no need to maintain the data physically; instead we keep useful statistics of the data streams in each cell in the grid. – Grid-based clustering is fast for large sets of data.
AU
– – – –
TH
The straightforward approach of finding similar streams is to compute the pair-wise distance between all the readings within w in all the streams. This process is costly, especially when the number of streams N is large. A more efficient solution is to find similar data streams by means of clustering. However, most of the prior works in clustering data streams use the k-means algorithm, which requires prior knowledge of the number of clusters. The proposed MG-join algorithm is an unsupervised grid-based clustering algorithm that has the following features:
The MG-join algorithm (Algorithm 1) computes the DFT coefficients for the readings of the data streams in w. Then, for every new block of values received, the values within w are changed by adding the new block and removing the eldest block (see Fig. 8). Thus, the DFT needs to be updated in an efficient way without recalculating the DFT coefficients of the reading in w. Therefore, we propose an incremental update mechanism of the DFT coefficients that is presented in Section 3.4. In data transformation, the DFT coefficients are computed over w for all the streams. Each stream is represented by f DFT coefficients. Then, each stream is mapped into a point in f-dimensional grid. For
Z.A. Aghbari et al. / On clustering large number of data streams
77
Fig. 6. Mapping streams to grid structure.
OR C
Algorithm 1: MG-join (DFTCoefficients, CellWidth, MinP) 1: Build the Grid Structure 2: Map the Streams to the Grid 3: ListOfClusters=Process-Grid() // see Algorithm 2
OP
Y
example, if f = 2, then each stream is represented as a point in a 2-dimensional grid as shown in Fig. 6. If two points are close in the grid, then they represent two similar streams within this window. After mapping the streams, the MG-join (Algorithm 1) takes as input the list of DFT coefficients for all the streams and two other parameters, CellWidth and MinP. CellWidth represents the width of each grid cell and MinP represents the minimum number of streams that can form a cluster. Determining the CellWidth and MinP is presented later in this section.
The MG-join algorithm starts by building the grid structure in line 1 using the CellWidth parameter. Then, each stream is mapped into a point in the f-dimensional grid (line 2). In line 3, the algorithm builds the list of clusters using the Process-Grid algorithm (Algorithm 2).
AU
TH
Algorithm 2: Process-Grid() 1: P = next unlabeled cell with the largest occupancies 2: for CurrentCluster = Cluster-Construction(P ) // see Algorithm 3.3 3: if (size(CurrentCluster) > MinP) then 4: put CurrentCluster in the G // add to the List of clusters 5: end if 6: if (there exist unlabeled cell) then // terminating condition 7: Process-Grid() // recursion call 8: end if 9: return G // list of clusters
The Process-Grid procedure takes cell with the largest number of streams as a first candidate cell (line 1). The second candidate cell is the cell with the next largest number of streams, and so on. The current candidate cell should be an unlabeled cell (not a member of any cluster). For each candidate cell, the algorithm finds a new cluster through the Cluster-Construction (Algorithm 3) in line 2. This cluster will be considered if the number of streams in its member list is greater than the threshold MinP (lines 3 & 4). The Process-Grid procedure repeats these steps recursively until all cells are processed. The Process-Grid procedure returns the list of the accurate clusters (G) to the main algorithm, MG-join. The Cluster-Construction (Algorithm 3) takes as input a candidate cell as initial seed. The algorithm starts clustering by adding the neighboring cells to the initial seed. For each neighboring cell, the ClusterConstruction algorithm makes sure that the neighboring cell is not empty (line 3) and not a member of another cluster (line 4), and then it adds the neighboring cell to the current cluster. At the same time, the
78
Z.A. Aghbari et al. / On clustering large number of data streams
OP
Y
Algorithm 3: Cluster-Construction (P ) 1: put P in the members of CurrentCluster 2: for (each neighbor i) do 3: if (i is not Null) then // empty cell is a terminal cell 4: if (i is not labeled) then // cell is not a member of any cluster 5: return Cluster-Construction(i) // recursion call 6: end if 7: end if 8: end for
OR C
Fig. 7. Neighbors of cell in 2-dimensional grid.
AU
TH
algorithm repeats recursively to check the neighbors of the current neighbor untill there is no populated cell around any members of the current cluster (line 5). The algorithm returns the current constructed cluster to the Process-Grid procedure. For a cell in a 2-dimensional grid (Fig. 7), the gray cells are the neighbors of the cell i, where the number of neighbors is 32 -1. In a f-dimensional grid, the number of neighbors is 3f -1. MinP and CellWidth are design parameters that need to be determined in advance before executing the MG-join algorithm. MinP is the minimum number of streams, in a cluster, that is considered significant to form a phenomenon. Thus, a group of less than MinP streams is considered isolated noise and removed. This parameter depends on the application under consideration and is determined by the user. CellWidth determines the granularity of the grid and it depends on the density of the data. CellWidth is calculated empirically using training data set. We use a simple but effective heuristic called k-distance, which was used in [18], to determine CellWidth that is equal to the computed maximal distanced of stream x to its kth nearest neighbor. In our case, d is the distance between stream x and its (MinP-1)th stream. That is the CellWidth should be big enough to accommodate a minimum number of streams that form a cluster. The k-distance heuristic is based on the observation that for a stream x, increasing k from 1 to the number of streams in a cluster, d increases until a threshold k after which d does not increase significantly (levels off). Based on our experiments, a threshold k for our data sets is found to be 3 and the CellWidth value is 0.35. 3.4. Incremental processing of the proposed algorithms Due to the high arrival rate and continuous nature of data streams, algorithms must be able to adapt rapidly to the fast changes in the streams by using an efficient incremental mechanism. Updates in data
Z.A. Aghbari et al. / On clustering large number of data streams
79
Fig. 8. Sliding window and basic windows.
OP
Y
streams come in the form of insertion of new data and elimination of old data. For efficient updating, the sliding window w is divided equally into n blocks. Each block B is of size b. B is called the basic window. Thus w = nb. Figure 8 illustrates an example of sliding windows and basic windows, where w contains 3n and n contains 3 readings (b = 3). That means for every 3 new readings, the algorithm updates the sliding window by eliminating the oldest block (oldest 3 readings), B0 , and adding the newly arrived block (new arrival 3 readings), Bn . Thus the stream x is written as:
B1
OR C
x = x0 , x1 , . . . , xb−1 xb , xb+1 , . . . , x2b−1 , . . . , x(n−1)b , x(n−1)b+1 , . . . , xw−1 .
B2
Bn
B0 = x0 , x1 , . . . , xb−1
TH
All data streams are updated by receiving new blocks of values. When a new block arrives, the DFT coefficients need to be updated over the sliding window w. For efficient computation of the DFT coefficients, the algorithm keeps a summary for each block. Then, in order to calculate the new DFT coefficients over w, the algorithm needs only to calculate the summary of the new block and eliminate the summary of the old block as follows: Let the expired block of data stream be
and the newly arrived block of values be
AU
Bn = xw , xw+1 , . . . , xw+b−1
Therefore, the old window and the new window are: xold = x0 , x1 , . . . , xw−1 = B0 + xb , xb+1 , . . . , xw−1 xnew = xold − B0 + Bn = xb , xb+1 , . . . , xw−1 , xw , xw+1 , . . . , xw+b−1 old be the mth DFT coefficient of the current data stream xold , and X new be the mth DFT Let Xm m new is computed as follows [48]: coefficient of the new data stream. Xm
new Xm
=e
j2πmb w
old Xm
+
b−1 i=0
xw+i e
j2m(b−i) w
−
b−1 i=0
xi e
j2m(b−i) w
(2)
80
Z.A. Aghbari et al. / On clustering large number of data streams
Note that to update the DFT coefficient incrementally, the following value is stored for each block: ζm =
b−1
xi e
j2m(b−i) w
m = 1, . . . , n
(3)
i=0
Y
When a new block of readings arrives, the algorithm calculates the necessary value of the new block (ζm ). Then the algorithm updates the DFT coefficients of the sliding window w using Eq. (2). Partitioning the sliding window into smaller basic windows reduces the update cost. For each block, we need only to store one value (ζm ) instead of all the individual readings. Thus, the memory storage is also reduced. On the other hand, the results will not be up-to-date at each clock tick. The delay is at most one block size, so this disadvantage is limited by small blocks [9]. However, small blocks increase the number of blocks and therefore increase the values that need to be stored. In addition, one can note that for short time intervals the sensor readings change slightly. Therefore, changes are more noticeable after a block of readings than for a single reading.
OP
3.5. Complexity analysis
1. Mapping the stream to the grid. 2. Constructing clusters.
OR C
The complexity of the proposed MG-join algorithm consists of two parts: the data transformation and the MG-join algorithm. The time complexity of the DFT of one stream is known to be O(w log w) [33], where w is the size of the sliding window. Since we have N streams, the DFT time complexity is O(N w log w). To compute the complexity of the MG-join algorithm, we need to inspect its main operations, which are:
a. Ordering the cells according to the number of streams in each cell. b. Selecting the cells with the largest number of streams. c. Constructing a cluster.
AU
TH
The first operation is mapping the streams to the grid. A single pass through the N data streams determines the cell of each stream, thus the time complexity is O(N ). The next operation is constructing the clusters. This operation selects the cell that contains the largest number of data streams as an initial seed. The time complexity of ordering the cells is equal to the number of cells in the grid. If the size of the grid is L, where L is determined by CellWidth, then the time complexity of determining the initial seed is O(L). The algorithm then goes through the list of cells to select most populated cell. In the worst case each stream is mapped to a separate cell in the grid and therefore the algorithm will need to process the whole list and thus the time complexity of selecting the initial seeds is O(N ). Each time the algorithm constructs a single cluster from an initial seed, the algorithm needs to go through all the neighbors of the initial seed. Since the number of the dimensions equals the number of DFT coefficients f , the possible number of neighbors for a cell in the grid is 3f − 1. In the worst case, each stream is mapped into separate cell in the grid, and for each cell the algorithm checks 3f − 1 neighbors. There are N occupied cells in the grid (each cell has one stream), so the time complexity is O(N ∗ (3f − 1)). In summary, the complexity of the proposed MG-join algorithm is: O(N ) + O(L) + O(N ) + O(N ∗ (3f − 1)) = O (L) + O (N3 f ).
Z.A. Aghbari et al. / On clustering large number of data streams
81
OP
Y
In this paper, we compare the proposed algorithm with the DBSCAN algorithm. The description of the DBSCAN is presented in Section 4.2. The basic time complexity of the DBSCAN algorithm is O (N * points in the Eps-neighborhood), where N is the number of streams. In the worst case, for each stream, the algorithm has to traverse all nodes of the R∗ -tree. Thus the complexity of the DBSCAN algorithm is O(N 2 ). However, if we assume the Eps-neighborhood is small as compared to the size of whole dataset, a small query region has to traverse only a limited number of paths in the R∗ -tree. Therefore the average run time complexity of a single region is O(log N), where the height of an R∗ -tree is O(log N) for a data set of N streams. For each of the N data streams, we have at most one query region. Thus the average run time complexity of DBSCAN in this case is O(N log N). Based on the used data sets, if the number of streams in a cluster is small, it is likely that the whole cluster would fit in one R∗ -tree node and thus the search would cost O(log N). But if the number of streams in a cluster is large, it is likely that the cluster would span multiple R∗ -tree nodes and thus the worst case search would cost O(N 2 ). Note that the proposed MG-join worst case complexity is better than that of the DBSCAN. 4. Performance evaluation
TH
4.1. Description of synthetic data sets
OR C
We conducted experiments to measure the performance of the algorithms using synthetic data, which is described in Secion 4.1. The proposed MG-join algorithm is compared with the DBSCAN algorithm, which is described in Section 4.2. The evaluation of the proposed multi-way grid-based join algorithm is presented in Section 4.3. We use the object oriented programming language Java (JDK 6) to implement all the proposed algorithms. For the experiments with the DBSCAN algorithm, we use the WEKA 3 toolkit, a data mining with open source machine learning software written in Java [43]. All the experiments were conducted using a 2 GHz Pentium(R) M with 2 GB RAM, and Windows XP.
AU
We use synthetic data to evaluate the proposed algorithm. In general, an important advantage of synthetic data is that it helps in conducting experiments in a controlled way and hence to answer questions regarding specific hypotheses related to the proposed algorithm and its behavior. To evaluate the MG-join algorithm, data sets that contain clusters were generated as follows. First, a prototype p is generated for each cluster. This prototype is a stochastic process defined by means of a second-order differential equation [8]: p(t + Δt) = p(t) + p (t + Δt) p (t + Δt) = p (t) + u(t)
(4)
where t = 0, Δt, 2Δt. . . . The u(t) is an independent random variable that is uniformly distributed in an interval [−a, a], where a is a constant value and as it becomes smaller, the stochastic process p(.) becomes smoother. The streams that should belong to a cluster are then generated by “distorting” the prototype horizontally. More precisely, a data stream x is defined by x(t) = p((t + h(t))
(5)
82
Z.A. Aghbari et al. / On clustering large number of data streams
Fig. 9. Example of prototypical data stream & distorted version.
4.2. Description of the DBSCAN algorithm
OP
Y
where h are stochastic processes that are generated in the same way as the prototype p [8]. Figure 9 shows a prototype p and a data stream x. In a similar way, the training data set has been collected and thus it follows the same distribution as that of the test data sets. The size of the training data set is about 20% of the size of the test data.
AU
TH
OR C
The proposed MG-join algorithm is compared with the DBSCAN [18], which is a density based algorithm. The main idea in the DBSCAN is based on the idea of density reachability. A point q is density reachable from point p if q is within Eqs distance from p and there are at least Minpt points within Eps around p. Thus, p is considered a core point and it is assigned to a cluster. Non-core points are either assigned to a boundary of a cluster or labeled as noise (do not belong to any cluster). DBSCAN is designed to discover clusters of arbitrary shape in noisy data by locating regions of high density that are separated from one another by regions of low density (see Algorithm 4). The DBSCAN algorithm requires two parameters: (1) Eps, the size of Eps neighborhood of points and (2) Minpts, the minimum number of points in the Eps-neighborhood to form a cluster. DBSCAN uses an R*-tree structure, which holds all the data points. The DBSCAN algorithm starts with an arbitrary starting point p that has not been visited and finds all the neighboring points within distance Eps of the starting point. If the number of neighboring points is greater than or equal to Minpts, the starting point is labeled with current cluster ID an added to the queue. The current cluster is expanded with the starting point (see Algorithm 5). Then, the neighbors of p are added to the queue. But, if the number of neighbor points of p is less than Minpts, p is considered noise. The algorithm recursively repeats the process for all the neighbors until a cluster is fully discovered. Then, the algorithms proceeds to process the unlabeled (unvisited) points in the R*-tree, if any, to try to discover other clusters. 4.3. Performance evaluation of MG-join
To evaluate the performance of the MG-join algorithm we investigated the following two aspects: the quality in terms of the purity of the produced clusters and the efficiency in terms of system execution time. We compared the MG-join algorithm with the DBSCAN algorithm in both aspects. The input parameters of both algorithms are summarized in Table 2. DBSCAN uses MinPts that is equal to MinP, and Eps is the maximal distance of the MinPts+1 nearest neighbor. However, the CellWidth is not equal to the Eps because the MG-join works with the transformed data in the frequency domain, and therefore
Z.A. Aghbari et al. / On clustering large number of data streams
83
OR C
OP
Algorithm 5: ExpandCluster (p, Input Set, Eps, Minpts, ClusterID) 1: put p in a seed queue 2: while (the queue is not Null) do 3: extract c from the queue 4: retrieve the Eps-neighborhood of c 5: if (there are at least Minpts neighbors) then 6: for (each neighbor i) do 7: if (i is labeled NOISE) then 8: label i with ClusterID 9: end if 10: if (i is not labeled) then 11: label i with ClusterID 12: put i in the queue 13: end if 14: end for 15: end if 16: end while
Y
Algorithm 4: DBSCAN (Input Set, Eps, Minpts) 1: for (each p in the Input Set) do 2: if (p is not labeled) then 3: if (p is a core point) then 4: generate a new ClusterID 5: label p with ClusterID 6: ExpandCluster (p, Input Set, Eps, Minpts, ClusterID) // see Algorithm 5 7: else 8: label(p, NOISE) 9: end if 10: end if 11: end for
TH
the distance computation is done using the transformed data, which considers only f DFT coefficients. On the other hand, Eps for the DBSCAN is computed using the original data in the native domain which considers all the w components of the streams. Table 2 Neighbors of cell in 2-dimensional grid
AU
Algorithm MG-join DBSCAN
Parameters MinP, CellWidth MinPts, Eps
4.3.1. Purity evaluation The quality of the MG-join depends on the choice of the data set. Therefore, we test the algorithms on different data sets. The quality of the MG-join is measured in terms of purity of the produced clusters. Before presenting the results of quality evaluation, we discuss the definition of purity. In the context of clustering, precision and recall are used for evaluating the clustering results. Let C be the set of the true clusters in the data, and G be the set of the calculated clusters. Each cluster i in G is compared with a true cluster j in C where cluster j is the dominant cluster label of the members in C . Recall and precision are computed as follows [38]: precij =
nij ni
(6)
84
Z.A. Aghbari et al. / On clustering large number of data streams
Fig. 10. Purity vs. number of clusters (comparison with DBSCAN).
nij nj
(7)
Y
recall =
purity =
precij
i=1
|G|
OR C
|G|
OP
where nij is the number of streams from the true cluster j that appears in the calculated cluster i, ni is the total number of streams in cluster i, and nj is the total number of streams in cluster j . Note that the true cluster labels are known for the synthetic data sets. We use the precision, Eq. (6), to measure the purity of each calculated cluster by the MG-join algorithm. Then the average purity is the average of the precision of all the clusters produced in G [2,10,42]:
(8)
where |G| is the total number of clusters generated by the MG-join. The experiments are conducted by varying the following parameters: the total number of the data streams N , the size of the sliding window w, the number of DFT coefficients f , and the number of clusters in data set C . The experiment parameters are set as follows:
TH
The total number of data sets is 250 divided into 5 groups. The number of data streams in a data set is 200. The size of the window w is fixed to 60 values. The maximum number of DFT coefficients used is 2. The number of clusters in the data varies from 2 to 10.
AU
– – – – –
The Y-axis in Fig. 10 represents the purity value, while the X-axis represents the number of clusters. Five different groups of data stream are generated where each group contains 50 data sets and each data set contains 200 data streams. Each group of data sets contains different number of clusters. For example, the first group (50 data sets) contains 2 clusters; the second group (50 data sets) contains 4 clusters, and so on. Figure 10 shows that the MG-join algorithm gives exactly the same purity values as the DBSCAN algorithm for all groups of data sets that contain varying number of clusters (from two to ten). Although, both algorithms produce the same quality of clusters, the MG-Join algorithm uses only two DFT coefficients to represent each stream. Yet as we can see from Section 4.3.2, MG-join is much faster than DBSCAN. Note that as the number of clusters increases, the purity value decreases for both the DBSCAN and the MG-Join. That is because with a greater number of clusters in the data sets, the clusters become closer
Z.A. Aghbari et al. / On clustering large number of data streams
85
OP
Y
90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 2
OR C
Perentage of the Data Sets
Fig. 11. Purity vs. DFT coefficients & number of clusters.
3
4
DFT Coefficients
5
Fig. 12. Purity vs. DFT coefficients, N = 200, w = 60, C = 4,100 data sets.
AU
TH
to each other. Thus, the algorithm tends to merge some of the clusters and therefore the purity of the clustering algorithms decreases. This is true for almost all the clustering algorithms. We noted that in most cases two DFT coefficients were enough to get exactly the same results as the DBSCAN. In a few cases, where a significant overlap exists between clusters, more than two DFT coefficients are required for MG-join to achieve the same results as DBSCAN. The above experiment is repeated for 400 data sets that are divided into 4 groups, where each group contains 100 data sets. Each group contains different number of clusters (from 4 to 10) as shown in Fig. 11 and some of these clustes are overlapping. Note that each data set contains 200 data streams. The computation of purity of clustering is repeated for different number of DFT coefficients representing the data streams in the group until the same purity of the DBSCAN for the group is achieved. In Fig. 11, the X-axis shows the number of clusters. The Y-axis shows the percentage of the group of data sets that achieves the same purity value as DBSCAN using two, three, four, and five DFT coefficients. Figure 12 explains the results of the group that contains 4 clusters (left group in Fig. 11). In Fig. 12, the number of DFT coefficients required to achieve the same purity as that of the DBSCAN is follows: 80% of the dataset required only two DFT coefficients, 8% of the data required three DFT coefficients, 8% required four DFT coefficients and only 4% of the data required five DFT coefficients. In summary, we showed that for most of the datasets two DFT coefficients are enough to achieve same clustering quality as DBSCAN. Yet the MG-join is much faster than DBSCAN as we see in the following section which presents efficiency evaluation in terms of system time.
Z.A. Aghbari et al. / On clustering large number of data streams
System Time (seconds)
86
120 100
MG-Join k-means
80 60 40 20 0 100
300
500
700
Number of Streams
900
Fig. 13. System time vs. number of data streams (comparison with k-means).
AU
TH
OR C
OP
Y
4.3.2. Execution time evaluation of the multi-way grid-based join This section presents a comparison between the MG-join execution time and the DBSCAN execution time. We also present a comparison between the MG-join execution time and the k-means execution time. The system execution time of the MG-join algorithm is the total time used for DFT transformation, in addition to the time used to building the grid structure, mapping data streams to the grid, and cluster construction. On the other hand, for the DBSCAN and k-means the system execution time is the time used for finding the clusters using the data streams in the time domain. There are different factors which can affect the system time. These factors include the number of data streams N in the data set, the window size w, and the number of DFT coefficients f . Experiments are conducted to evaluate the effect of those factors on the response time of the compared algorithms. The objective of the first set of experiments is to evaluate how the number of streams affects the efficiency. All other parameters are fixed. In this experiment, the number of data streams is varied from 100 to 1000. However, the other parameters are fixed as follows: w = 60, f = 2. The test data contains 3 clusters. The Y-axis in Fig. 13 shows the system execution time used by MG-join algorithm and the k-means algorithm. From this figure, we note that as the number of streams increases the system execution time of both algorithms increases. However, the time of the MG-join algorithm increases at a very small rate, while the k-means time increases at a high rate. This is because the k-means algorithm needs to compute the distance between each cluster centers and each data stream member of that cluster in every iteration until the algorithm converges. The time complexity of the k-means algorithm is O(ICNw), where I is number of k-means iterations, C is the number of clusters, N is the number of streams, and w is the window size. Increasing the number of data streams will increase the distance calculation time and thus increases its execution time. On the other hand, there is no direct distance calculation in the proposed MG-join algorithm. The distances are implicitly implemented in the grid structure. To specify k (number of clusters) for the k-means algorithm, we first execute the MG-join algorithm, and then k is set as the number of generated clusters by the MG-join algorithm. Next, we present the comparison between the proposed MG-join execution time and the DBSCAN execution time. In this experiment, the number of data streams is varied from 10 to 10000. However, the other parameters are fixed as follows: w = 60, f = 2. The test data contains 4 clusters. The Y-axis in Figs 14 and 15 shows the system execution time used by the MG-join algorithm and DBSCAN algorithm. The two figures represent the same experiments, but different representation scales; Fig. 14 uses a logarithmic scale in the Y-axis, while Fig. 15 uses a linear scale. From these figures, we note the following: as the number of streams increases, the system execution time of both algorithms
Z.A. Aghbari et al. / On clustering large number of data streams
87
Log SystemTime (Second)
10000 1000
MG-join DBScan
100 10 1 0.1 0.01
0.001 10
100
1000
10000
Number of Data Streams
10
100
Y
MG-join DBScan
OP
9.001 8.001 7.001 6.001 5.001 4.001 3.001 2.001 1.001 0.001
1000
OR C
SystemTime (Second)
Fig. 14. Semi-logarithmic scale, system time vs. number of data streams.
10000
Number of Data Streams
Fig. 15. System time vs. number of data streams: w = 60, f = 2, C = 4.
AU
TH
increases. However, the time of the MG-join algorithm increases much more slowly than the DBSCAN. Note also that for a small number of streams like 10, the DBSCAN algorithm is slightly faster than the MG-join algorithm, while for a large number of streams, the MG-join algorithm outperforms the DBSCAN algorithm. The reasons is that for a fixed number of clusters, a small N would make the clusters relatively small and thus the DBSCAN traverses a limited number of R∗ -tree paths with time complexity O(N log N), see Section 3.5. But, when N increases, the clusters become relatively larger in size and thus the complexity of the DBSCAN becomes O(N 2 ). As a result, if the number of streams increases, the execution time of the DBSCAN increases quadratically. In contrast, the complexity of the MG-join is O(N 3f ) + O(L). In this experiment, the size of the grid is fixed and does not change with the number of streams. As the DFT coefficient f is fixed, the value 3f is fixed. Thus, while the number of streams N increases, the execution time of the MG-join algorithm increases linearly with N ; as a result, it increases much more slowly than DBSCAN. The following set of experiments evaluates the effect of window size on the execution time of the MG-join and DBSCAN algorithms. In this experiment, we varied the window size w from 50 to 250. However, the other parameters are fixed as follows: N = 1000, f = 2. The test data contains 4 clusters. In Fig. 16, the Y-axis shows the system time, while the X-axis shows the window size. Figure 16 shows that as the window size increases, the execution time of both algorithms increases. The execution time for the MG-join algorithm increases slightly because the overhead of the DFT computation increases with the window size (time complexity of DFT O(Nw log w)). Then, during cluster phase, a fixed number of DFT components is used in the distance computation and thus the window size does not add
88
Z.A. Aghbari et al. / On clustering large number of data streams
SystemTime Sec
60.1 50.1 40.1 30.1
MG-join DBScan
20.1 10.1 0.1 50
100
150
200
250
Window Size
Y
Fig. 16. System time vs. window size: N = 1000, f = 2, C = 4.
OP
1000 100
MG-join DBScan
10 1 2
OR C
Log SystemTime (Second)
10000
3
4
5
6
7
DFT Coefficients
Fig. 17. Semi-logarithmic scale, system time vs. DFTs: N = 8000, w = 60, C = 4.
AU
TH
overhead on the clustering phase. On the other hand, the execution time of the DBSCAN uses the whole window size in the computation of distances during the clustering phase thus the window size has a severe effect on DBSCAN’s execution time, as described in Section 3.5. Although, the time increases for both algorithms, the proposed MG-join algorithm greatly outperforms the DBSCAN algorithm. The following experiments evaluate the effect of the number of DFT coefficients on the execution time of the proposed algorithm. Earlier in Section 4.3.1, we presented how the number of DFT coefficients affects the purity of the MG-join. We conclude that for most of our data sets, two DFT coefficients were enough to produce the same results as DBSCAN. Few datasets which have clusters that overlap each other require up to 5 DFT coefficients. We conducted experiments to test the effect of increasing the number of DFT coefficients on the system execution time. In this experiment, we varied the number of DFTs f from 2 to 7. However, the other parameters are fixed as follows: N = 8000, w = 60. The test data contains 4 clusters. The Y-axis of Fig. 17 shows the logarithm of the execution time, while the X-axis represents the number of the DFT coefficients, f . Note that the execution time of the DBSCAN is independent of f and the number of streams is fixed. As the number of the DFT coefficients increases, the execution time for the MG-join algorithm increases. Recall the time complexity of the MG-join is O(N 3f ) + O(L). Increasing f will increase the number of neighbors that need to be checked (3f −1), and that will increase the processing time exponentially. This is what is known in the literature as the curse of dimensionality.
Z.A. Aghbari et al. / On clustering large number of data streams
89
As shown in Fig. 17 for a small number of DFT coefficients, the MG-join algorithm is much faster than the DBSCAN algorithm. When the number of DFT coefficients increases above 6 (the cross-over point), the execution time of the MG-join algorithm becomes larger than the DBSCAN algorithm for the current experimental setup. Note that the cross-over point depends on the number of data streams N and window size w. For example, when N or w increases the cross-over point will increase above 6 DFT coefficients since the R*-tree of the DBSCAN will grow bigger and become slower. However, since we do not need a large number of DFT coefficients to achieve the same clustering quality as that of the DBSCAN (Section 4.3.1), we can conclude that the proposed MG-join is more efficient than the DBSCAN when the number of data streams is large. 5. Conclusions
OP
Y
This paper proposes the MG-join algorithm to address the problem of finding clusters in multiple data streams. Due to the nature of data streams, new challenges have emerged. Thus, processing data streams is expensive in terms of memory and time. In addition to the characteristics of the data streams, we are interested in data streams generated by sensor networks that present additional challenges such as limited processing capabilities, low memory and power. The main contributions of this paper are:
OR C
1. Proposing the novel MG-join algorithm based on unsupervised grid-based method find to clusters in multiple data streams. 2. The use of DFT to concentrate most of the stream information in a few coefficients and thus reduce the high dimension of the data. 3. An incremental technique to update the streamed data.
References [1] [2] [3] [4]
AU
TH
The conducted experiments compared the proposed algorithm with the DBSCAN and k-means algorithms to evaluate its quality and efficiency. The experiments’ results showed that the proposed MG-join algorithm produces a similar clustering quality as that of the DBSCAN by using two DFT coefficients in most of the cases. In addition, the experiments showed that the proposed MG-join algorithm is much faster than DBSCAN. The results also showed that the MG-join is scalable for a large number of data streams. Although increasing the number of DFT coefficients increases the cost of MG-join algorithm, most datasets require 2 to 5 DFT coefficients to produce clustering quality similar to that of the DBSCAN. In the future, we plan to extend the method to find clusters in multiple data streams that are not synchronized, try other transformation methods to reduce the dimensionality of the data streams, and examine the calculation of the DFT coefficients at the sensor level, or at the group head, in the distributed MG-join algorithm.
C. Aggarwal, J. Han, J. Wang and P. Yu, A Framework for Clustering Evolving Data Streams, Proceedings of the 29th International Conference on Very Large Data Base, Berlin, Germany, September 2003. C. Aggarwal, J. Han, J. Wang and P. Yu, A Framework for Projected Clustering of High Dimensional Data Streams, Proceedings of the 30th International Conference on Very Large Data Base, Toronto, Canada, September 2004. C.C. Aggarwal, On classification and segmentation of massive audio data streams, Journal of Knowledge and Information Systems 20(2) (Aug 2009). R. Agrawal, C. Flaoutsos and A. Swami, Efficient Similarity Search In Sequence Databases, Proceedings of the 4th FODO International Conference of Foundations of Data Organization and Algorithms, Chicago, Illinois, USA, October 1993.
[10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30]
Y
[8] [9]
OP
[7]
OR C
[6]
M. Ali, M. Mokbel, W. Aref and I. Kamel, Detection and Tracking of Discrete Phenomena in Sensor-Network Databases, Proceedings of the 17th International Conference on Scientific and Statistical Database Management, Santa Barbara, California, June 2005. B. Babcock, M. Datar and R. Motwani, Sampling from a Moving Window over Streaming Data, Proceedings of the 13th SIAM-ACM Symposium on Discrete Algorithms, 2002. B. Babcock, M. Datar, R. Motwani and O’Callaghan, Maintaining Variance and k-Medians over Data Stream Windows, Proceedings of the 22nd Symposium on Principles of Database Systems, San Diego, California, USA, April 2003. J. Beringer and E. Hullermeier, Online Clustering of Data Streams, Data & Knowledge Engineering 58 (2006). J. Beringer and E. Hullermeir, Fuzzy Clustering of Parallel Data Streams, Advances in Fuzzy Clustering and its Applications, Wiley, 2007. F. Cao, M. Ester, W. Qian and A. Zhou, Density-Based Clustering over an Evolving Data Stream with Noise, Proceedings of the SIAM Conference on Data Mining, Bethesda, Maryland, April 2006. V. Chaoji, M. Al Hasan, S. Salem and M.J. Zaki, SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters, Journal of Knowledge and Information Systems 21(2) (Nov 2009). M. Charikar, L. O’Callaghan and R. Panigrahy, Better Streaming Algorithms for Clustering Problems, Proceedings of 35th ACM Symposium on Theory of Computing, San Diego, California, USA, June 2003. J. Chen, D. DeWitt, F. Tian and Y. Wang, NiagaraCQ: A Scalable Continuous Query System for Internet Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA May 2000. R. Cole, D. Shasha and Zhao, Fast Window Correlations over Uncooperative Time Series, Proceedings of the 11th SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, August 2005. A. Coman and M. Nascimento, A Distributed Algorithm for Joins in Sensor Networks, Proceedings of the 19th SSDBM International Conference on Scientific and Statistical Database Management, Banff, Canada, July 2007. C. Cortes, K. Fisher, D. Pregibon, A. Rogers and F. Hancock, A Language for Extracting Signatures from Data Streams, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Ma, USA, August 2000. B. Dai, J. Huang, M. Yeh and M. Chen, Adaptive Clustering for Multiple Evolving Streams, IEEE Transactions on Knowledge and Data Engineering 18(9) (December 2006). M. Ester, H. Kriegel, J. Sander and X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge and Data Mining. Portland, Oregon, August 1996. L. Golab and M. Ozsu, Issues in Data Stream Management, Proceedings of the 21st SIGMOD International Conference on Management of Data, San Diego, CA, USA, June 2003. L. Golab and M. Ozsu, Processing Sliding Window Multi-joins in Continuous Queries Over Data Streams, Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003. S. Guha, N. Mishra, R. Motwani and L. O’Callaghan, Clustering Data Streams, Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, Redondo Beach, CA, November 2000. S. Guha, A. Meyerson, N. Mishra, R. Motwani and L. O’Callaghan, Clustering Data Streams: Theory and Practice, IEEE Transaction on Knowledge and Data Engineering, special issue on clustering, vol. 15, 2003. M. Hamad, W. Aref and A. Elmagarmid, Stream Window Join: Tracking Moving Objects in Sensor-Network Databases, Proceedings of the 15th SSDBM International Conference on Scientific and Statistical Database Management, Combridg, MA, USA, July 2003. M. Hamad, M. Franklin, W. Aref and A. Elmagarmid, Scheduling for Shared Window Joins Over Data Streams, Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003. M. Hammad, T. Ghanem, W. Aref, A. Elmagarmid and M. Mokbel, Efficient Pipelined Execution of Sliding-Window Queries Over Data Streams, Purdue University Department of Computer Sciences Technical Report, June 2004. J. Kang, J. Naughton and S. Viglas, Evaluating Window Joins Over Unbounded Streams, Proceedings of the 19th ICDE International Conference on Data Engineering, Bangalore, India, February 2003. E. Lamboray, S. Wurmlin and M. Gross, Data Streaming in Telepresence Environments, IEEE Transactions on Visualization and Computer Graphics 11(6) (December 2005). M. Madden, M. Shah, J. Hellerstein and V. Raman, Continuously Adaptive Continuous Queries Over Streams, Proceedings of the ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 2002. J. Naughton, J. Burger and S. Viglas, Maximizing the Output Rate of Multi-Way Join Queries Over Streaming Information Sources, Proceedings of the 29th VLDB International Conference on Very Large Data Base, Berlin, Germany, September 2003. S. Nittel, K. Leung and A. Braverman, Scaling Clustering Algorithms for Massive Data Sets using Data Streams, Proceedings of the 20th International Conference on Data Engineering, Boston, USA, April 2004.
TH
[5]
Z.A. Aghbari et al. / On clustering large number of data streams
AU
90
Z.A. Aghbari et al. / On clustering large number of data streams
[37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48]
Y
[36]
OP
[35]
OR C
[33] [34]
TH
[32]
L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha and R. Motwani, Streaming-Data Algorithms for High-Quality Clustering, Proceedings of the 18th International Conference on Data Engineering, San Jose, California, USA, March 2002. Olfa Nasraoui, Carlos Rojas, Robust Clustering for Tracking Noisy Evolving Data Streams, SIAM International Conference on Data Mining (SDM 2006). A. Oppenheim, R. Schafer and J. Buck, Discrete-Time Signal Processing, Prentice Hall, 2nd edition 1999. C. Ordonez, Clustering Binary Data Streams with K-means, Proceedings of the 13th ACM Data Mining and Knowledge Discovery, San Diego, California, USA, June 2003. S. Papadimitriou, A. Brockwell and C. Faloutsos, Adaptive Hands-Off Stream Mining, Proceedings of the 27th VLDB International Conference on Very Large Data Base, pages 560–571, Berlin, Germany, September 2003. S. Papadimitriou, J. Sun and C. Faloutsos, Streaming Pattern Discovery in Multiple Time-Series, Proceedings of the 31st VLDB International Conference on Very Large Data Base, pages 697–708, Trondheim, Norway, August-September 2005. N. Park and W. Lee, Statistical Grid-based Clustering Over Data Streams, Proceedings of the 22nd SIGMOD International Conference on Management of Data, Toronto, Canada, March 2004. M. Rosell, V. Kann and J. Litton, Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications, Proceedings of ICON International Conference on Natural Language Processing, Hyderabad, India, December 2004. Y. Sakurai, S. Papadimitriou and Faloutsos, “Braid: Stream Mining through Group Lag Correlations, Proceedings of the 24th ACM SIGMOD International Conference on Management of Data, pages 599-610, Baltimore, Maryland, June 2005. Y. Sakurai, C. Faloutsos and M. Yamamuro, Stream Monitoring under the Time Warping distance, Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, April 2007. P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Person Education, Inc., 2006. D. Tasoulis, N. Adams and D. Hand, Unsupervised Clustering In Streaming Data, Proceedings of the 6th International Conference on Data Mining, Hong Kong, China, December 2006. WEKA machine learning package. University of Waikato: http://www.cs.waikato.ac.nz/ml/weka. Y. Yao and J. Gehrke, Query Processing in Sensor Networks, Proceedings of 1st CIDR Biennial Conference on Innovative Data System Research, Asilomar, Canada, January 2003. D. Zhang, J. Li, K. Kimeli and W. Wang, Sliding Window based Multi-Join Algorithms over Distributed Data Streams, Proceedings of the 22nd ICDE International Conference on Data Engineering, Atlanta, Georgia, USA, April 2006. A. Zhou, F. Cao, W. Qian and C. Jin, Tracking clusters in evolving data streams over sliding windows, Journal of Knowledge and Information Systems 15(2) (Nov 2008). A. Zhou, F. Cao, Y. Yan, C. Sha and X. He, Distributed Data Stream Clustering: A Fast EM-Based Approach, Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, April 2007. Y. Zhu and D. Shasha, StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time, Proceedings of the 28th International Conference on Very Large Data Base, Hong Kong, China, August 2002.
AU
[31]
91