LNAI 7121 - HUE-Stream: Evolution-Based Clustering ... - Springer Link

HUE-Stream: Evolution-Based Clustering Technique for Heterogeneous Data Streams with Uncertainty Wicha Meesuksabai, Thanapat Kangkachit, and Kitsana Waiyamai Department of Computer Engineering, Faculty of Engineering Kasetsart University, Bangkok 10900, Thailand {g5314550164,g51850410,fengknw}@ku.ac.th

Abstract. Evolution-based stream clustering method supports the monitoring and the change detection of clustering structures. E-Stream is an evolutionbased stream clustering method that supports different types of clustering structure evolution which are appearance, disappearance, self-evolution, merge and split. This paper presents HUE-Stream which extends E-Stream in order to support uncertainty in heterogeneous data. A distance function, cluster representation and histogram management are introduced for the different types of clustering structure evolution. We evaluate effectiveness of HUE-Stream on real-world dataset KDDCup 1999 Network Intruision Detection. Experimental results show that HUE-Stream gives better cluster quality compared with UMicro. Keywords: Uncertain data streams, Heterogeneous data, Clustering, Evolutionbased clustering.

1

Introduction

Recently, clustering data streams has become a research topic of growing interest. One main characteristic of data streams is to have the infinite evolving structure and to be generated at rapid rate. We call a stream clustering method that supports the monitoring and the change detection of clustering structures evolution-based stream clustering method. Apart from its infinite data volume, data streams also contain error or only partially complete information, called data uncertainty. In this paper, we focus on developing an evolution-based stream clustering method that supports uncertainty in data. Many techniques have been proposed for clustering data streams. Most research has focused on clustering techniques for numerical data [1, 2, 3, 7]. Few have been proposed to deal with heterogeneous data streams, including numeral and categorical attributes simultaneously [8, 9, 10]. However, very few have proposed to monitor and detect change of the clustering structures. [7] proposed an evolution-based clustering method for numerical data streams. [8] extends it by integrating both numerical and categorical data streams. However, both methods do not support uncertainty in data streams. J. Tang et al. (Eds.): ADMA 2011, Part II, LNAI 7121, pp. 27–40, 2011. © Springer-Verlag Berlin Heidelberg 2011

28

W. Meesuksabai, T. Kangkachit, and K. Waiyamai

To support uncertainty in data streams, Aggarwal et al. [4] introduced the uncertain clustering feature for cluster representation and proposed a technique named UMicro. Later, they continued to study the problem of high dimensional projected clustering of uncertain data streams [1]. LuMicro [5] technique has been proposed to improve clustering quality, with the support of uncertainty in numerical attributes and not in categorical attributes. However, all these works have not been proposed in the context of evolution-based stream clustering. In this paper, we present an evolution-based stream clustering technique called HUE-Stream that supports uncertainty in both numerical and categorical attributes. HUE-Stream is extended from E-Stream technique [7] which is an evolution-based stream clustering technique that integrates both numerical data streams without support of categorical data and uncertainty. Distance function, cluster representation and histogram management are introduced for the different types of clustering structure evolution. A distance function with probability distribution of two objects is introduced to support uncertainty in categorical attributes. To detect change in clustering structure, the proposed distance function is used to merge clusters and find the closest cluster of a given incoming data and proposed histogram management to split cluster in categorical data. Experimental results show that HUE-Stream gives better cluster quality compared with UMicro. The remaining of the paper is organized as follows. Section 2 introduces basic concepts and definitions. Section 3 presents our stream clustering algorithm called HUE-Stream. Section 4 compares the performance of HUE-Stream and UMicro with respect to the real-world dataset. Conclusions are discussed in Section 5.

2

Basic Concepts of Evolution-Based Stream Clustering with Uncertainty

In the following, some notations and definitions of evolution-based stream clustering with uncertainty are defined. First, we assume that we have a data stream consists in a set of dimensional tuples … … arriving at time stamps … …. Each data point contains numerical attributes and categorical attributes, denoted by . The number of valid values for a categorical attribute … , … xn+k is Vk, where 1 ≤ k ≤ c, and j-th valid value for xn+k is Vjk, where 1 ≤ j ≤ Vk. 2.1 Tuple-Level and Dimension-Level Uncertainty In fact, there are many existing categories of assumption to model the uncertainty in data stream [5]. In this paper, our assumption is mainly focus on discrete probability density function which has been widely used and easy to apply in practice. For each an uncertain tuple , its | | possible values in d-th dimension can be defined by probability distribution vector as , , , ,…, , , and ∑ p x 1.

Evolution-Based Clustering Technique for Heterogeneous Data Streams

Dimension-level uncertainty of the j-th dimension of a tuple can be defined as follows: 0,

: log

29

denoted by 1

, otherwise ;

Let be a tuple, a tuple-level uncertainty of denoted by is an average of its ∑ dimension-level uncertainties defined as ; Therefore, uncertainty of all k-tuples data streams can be calculated as average of their tuple-level uncertainties. 2.2 Cluster Representation Using Fading Cluster Structure with Histogram A Fading Cluster Structure with Histogram (FCH) has been introduced in [9]. In this paper, FCH is extended to support uncertainty in both numerical and categorical data. An uncertain cluster feature for a cluster C contained a set of tuples … arriving at time stamps … is defined as FCH = ( 1 , , 2 , , , , , ) as described below. , , , , , is a vector of weighted sum of tuple expectation for each dimension at time t, i.e., the value of the jth entry is ∑ denoted by

1

,

f t

T

∑

,

,

,

.

, is a vector of weighted sum of squares of tuple expectation for each , dimension at time t, i.e., the value of the jth entry denoted by is ∑

∑ , , , ,

,

,

is a sum of all weights of data points in a cluster C at time t, i.e., ∑ is sum of all weight of tuple uncertainty in cluster C at time t, i.e., ∑

is a α-bin histogram of numerical data values with α equal width , intervals, i.e., the j-th numerical dimension with l-th bin histogram of HN(t ) at time t ∑

is ∑ ,

1

·

,

,

,

1 ·

where ;

0

, is a β -bin histogram of categorical data values with top β accumulated probabilities of valid categorical values and their weighted frequency for each categorical dimension at time t. The histogram of j-th categorical dimension of C is , = {( , , , , , ), …, ( , , , , , )} where

is an a-th valid categorical value in j-th categorical dimension,

30


,

is accumulated probability of , =∑ ∑ where

and

,

as

,

,

which can be formulated as ,

,

1 0

,

,

,

,

.

is an accumulated weight frequency of = (∑ ∑ . , ,

which can be formulated

, and , are used in calculating distance between two set Note that of categorical data as described in section 2.3. 2.3 Distance Functions A distance function plays important role in data clustering tasks. To deal with uncertainty in both categorical and numerical data, we propose new distance functions that take into account the uncertainty as further described as follows. Cluster-Point distance can be formulated as: ,

,

1

,

(1)

C, X is the distance between expected center of numerical attributes where , and expected value of j-th numerical attributes of in cluster C, denoted by data point denoted by E[ as 1 n

C, ∑

where E X

x,

p x,

Center C

EX

(2)

4 SD C

, and CenterC and SDC can be derived from mean

and standard deviation of expected values E X of all data points in cluster C; , is the distance derived from categorical attributes: 1

,

,

,

,

(3)

|

,

,

, ,

,

,

, ,

,

,

,

, ,

,

|

,

, , ,

,

1

,

,

,

(4) ,

(5)

Cluster-Cluster distance can be formulated as: C , C is the distance between expected center of numerical attributes in cluster C and C which can be defined as:


C ,C

1 n

CenterC

CenterC

31

(6)

C ,C and defined as:

is the distance derived from categorical attributes, which can be

C ,C

(7)

1

C ,C , , ,

C ,C , , ,

C ,C , , , where C ,C , , , C ,C , , ,

C ,

C ,

C ,

C ,

C , C ,

1

C ,

,

C ,

,

C ,

(8)

(9)

2.4 Evolution-Based Stream Clustering Evolution-based stream clustering method supports the monitoring and the change detection of clustering structures. A cluster is a collection of data that have been memorized for processing in the system. Each cluster can be classified into 3 types which are isolated data point, inactive cluster, or active cluster. Here we give description of each type of cluster: • • •

An isolated data point is data that cannot be clustering to any cluster. The system may be used for further consideration with density increases in its region and formed a new cluster or fade its weight until it disappears. An inactive cluster is isolate data point or cluster that has a low weight. It is caused by an aggregation of isolate data point or fading decreases weight of active cluster. An active cluster is cluster that is considered a model of the system.

Fading Function In order to capture the clustering characteristics of the data streams which evolve over time, we need to focus on new data more than on old data. To decrease weight of old data over time, a fading function is used. Let λ be the decay rate and t be the elapsed time; the fading function can be defined as follows: 2

(10)

Weight of a cluster is the number of data elements in a cluster. Weight is determined according to the fading function. Initially, each data element has a weight of 10. A cluster can increase its weight by assembling incoming data points or merging with other clusters.

32


Evolution of Stream Clustering Structure To capture the characteristics of data streams which have an evolving nature, evolution-based stream clustering method is composed of different steps. In the beginning, incoming data are considered as isolated clusters. Then, a cluster is formed when a sufficiently dense region appears. An inactive cluster is changed to active cluster when its weight reaches a given threshold. When a set of clusters have been identified, incoming data must be assigned to a closest cluster based on similarity score. To detect change in clustering structure, the following clustering evolutions are checked and handled: appearance, disappearance, self-evolution, merge and split. Appearance: The appearance of a new cluster occurs when there is a group of dense data points locating close to each other by considering the point-point distance and cluster-point distance as in section 2.3, respectively. Disappearance: The disappearance of existing clusters happen when their data points having least recent time stamp and their weights are less than the remove_threshold. Self-Evolution: In each active cluster, its characteristics can be evolved over time because the weight of old data points is faded so that the new data points can affect the characteristic of the cluster in a short time. By doing this, we use fading equation 10 and Cluster-Point distance to find the closet cluster of a new data point which will change the cluster characteristic. Merge: Two overlap clusters, denoted by C1 and C2, can be merged into a new cluster by considering the distance between two clusters, Cluster-Cluster distance. Since the characteristics of two overlap clusters should exist in the merged cluster, FCH of merged cluster can be formulated as follows: Let and denote two sets of cluster. FCH( ) can be calculated based on FCH( ) and FCH( ) • •

•

2 , , , , , in FCH are the The values of entires 1 , , sum of the corresponding entires in FCH(C1) and FCH(C2). To obtain merged histogram of numerical data values, , we first find the minimum and maximum value in each numerical dimension of the pair. Then we divide this range into α intervals with equal length. Finally, we compute the frequency of each merged interval from the histogram of the pair. For calculating merged histogram of categorical data values, , we union two set of top α-bin categorical data value in each dimension of the pair. Then, we order the union set by its frequency in descending order. . Finally, we store only top β-bin categorical data values into

Split: A cluster can be splitted into two smaller clusters when its inside behaviour is obviously separated. All attributes are verified to find the split-position. If splitposition occurs in numerical or categorical attribute, the weight will be recalculated based on histogram of the splitting attribute. Then, cluster representation of the new clusters is determined based on the calculated weight. Histogram is used for consideration on the split of behavior of each cluster. It can be found in both numerical and categorical attribute.

Evolution-Based Clustering Technique for Heterogeneous Data Streams •

•

33

For numerical attributes, a valley which lies between 2 peeks of histogram is considered as splitting criteria. If the splitting valley is found more than one point, the best splitting valley is the minimum value valley. When the cluster splits, we split histogram in that dimension and other dimensions are weighted based on the split dimension. The splitting valley must have statistical significantly lower than the lower peak. For categorical attributes, splitting-attribute is an attribute that has significant frequency than the others within the same cluster. Split-position is a position between a pair of adjacent values whose frequencies are the most different. If split-position occurs between first and second bars, we will not split that cluster into 2 small clusters because the first value is only one outstanding member in the top- β of the splitting-attribute.

Dimension Split dimension

Split histogram

1st split histogram

2nd split histogram

Other dimension

Fig. 1. Histogram management in a split dimension and other dimension of numerical

Dimension Split dimension

Split histogram

1st split histogram

2nd split histogram

Other dimension

Fig. 2. Histogram management in a split dimension and other dimension of categorical

3

The Algorithm

In this section, we started discussing heterogeneous data streams with uncertainty and their evolution over time. Then, we present a new algorithm, called HUE-Stream which extended from E-Stream algorithm [7]. 3.1

Overview of HUE-Stream Algorithm

This section describes HUE-Stream which is an extension of E-Stream [7] to support uncertainty in heterogeneous data stream. Table 2 contains all the notations that are used in HUE-Stream algorithm. Pseudocode of HUE-Stream is given in Fig. 3.

34

W. Meesuksabai, T. Kangkachit, and K. Waiyamai Table 1. List of notations used in HUE-Stream pseudo-code Notation |FCH| FCHi FCHi.W FCHi.sd #isolate isolatei minDistnum(FCHa, FCHb) minDistcat(FCHa, FCHb) FCHi.U

Definition Current number of clusters ith cluster Weight of the ith cluster Standard deviation of the ith cluster Current number of isolated data ith isolated data Nearest pair of clusters when measured with numerical attributes Nearest pair of clusters when measured with categorical attributes Uncertainty of ith cluster

HUE-Stream main algorithm is given in Fig. 3. HUE-Stream supports the monitoring and the change detection of clustering structures that can evolve over time. Five types of clustering structure evolution are supported which are appearance, disappearance, self-evolution, merge and split. In line 1, the algorithm starts by retrieving a new data point. In line 2 and 3, it fades all clusters and deletes any clusters having insufficient weight. In line 4, it splits a cluster when the behavior inside the cluster is obviously separated. In line 5, it merges a pair of clusters for which the characteristics are very similar. In line 6, it checks the number of clusters and merges the closest pairs if the number of cluster exceeds the limit. In line 7, it scans cluster in the system to find out active clusters that reach sufficient weight. In line 8 to 13, it finds the closest cluster to contain the incoming data point with respect to the distance and uncertainty of that cluster. Isolate data point will be created where there is no cluster to contain the new data point. The flow of control then returns to the top of the algorithm and waits for a new data point. Algorithm HUE-Stream 1 retrieve new data Xi 2 FadingAll 3 DeleteCluster 4 CheckSplit 5 MergeOverlapCluster 6 LimitMaximumCluster 7 FlagActiveCluster 8 (Uncertainty[], index[]) ← FindCandidateClosestCluster 9 if sizeof(index[]) > 0 10 index ← FindClosestCluster 11 add xi to FCHindex 12 else 13 create new FCH from Xi 14 waiting for new data Fig. 3. HUE-Stream algorithm


35

Procedure MergeOverlapCluster 1 for i ← 1 to |FCH| 2 for j ← i + 1 to |FCH| 3 overlap[i,j] ← dist(FCHi,FCHj) 4 m ← merge_threshold 5 if overlap[i,j] > m*(FCHi.sd+FCHj.sd) 6 if distcat[i,j] < m *minDistcat(FCHa, FCHb) 7 if (i, j) not in S 8 merge(FCHi, FCHj) Procedure CheckSplit 1 for i ← 1 to |FCH| 2 for j ← 1 to number of numerical attributes 3 find valley and peek 4 if chi-square test(valley,peek) > significance 5 split using numerical attribute 6 for j ← 1 to number of categorical attributes 7 find maximum different of bink, bink+1 8 if chi-square test(bink,bink+1) > significance 9 split using categorical attribute 10 if split using only numerical or categorical 11 split FCHi 12 S ← S {(I,|FCH|)} 13 else if numerical and categorical 14 (n1,n2) ← split using numerical 15 (c1,c2) ← split using categorical 16 if max(c1,c2) > max(n1,n2) 17 split FCHi using categorical 18 if max(c1,c2)