A Supervised Clustering and Classification Algorithm for Mining Data ...

0 downloads 0 Views 210KB Size Report
Abstract—This paper presents a data mining algorithm based on supervised ... data mining algorithms must be capable of handling mixed data types. ...... [12] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algo- rithm for large ...
396

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, MARCH 2006

A Supervised Clustering and Classification Algorithm for Mining Data With Mixed Variables Xiangyang Li, Member, IEEE, and Nong Ye, Senior Member, IEEE

Abstract—This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms. Index Terms—Classification, clustering, computer intrusion detection, dissimilarity measures.

I. I NTRODUCTION

A

DVANCED sensing and computing technologies have enabled the collection of large amounts of complex data in many fields. Data mining techniques can be used to discover useful patterns that in turn can be used for classifying new data or other purposes [1], [2]. Data mining algorithms for processing large amounts of data must be scalable. Data mining algorithms for processing data with changing patterns must be capable of incrementally learning and updating data patterns as new data become available. For data with mixed variables, including numerical, ordinal, and nominal variables, data mining algorithms must be capable of handling mixed data types. Although data mining algorithms such as decision trees exist, which support the incremental learning of data with mixed data types, we are not satisfied with the scalability of these algorithms in handling large amounts of data, e.g., computer audit data for computer intrusion detection in our study [3]. Hence, we develop a scalable data mining algorithm, based Manuscript received April 27, 2004; revised July 31, 2004. This work was supported in part by the Air Force Office of Scientific Research (AFOSR) under Grant F49620-99-1-001. This paper was recommended by Associate Editor J. Miller. X. Li is with the Department of Industrial and Manufacturing Systems Engineering, University of Michigan, Dearborn, MI 48128 USA (e-mail: [email protected]). N. Ye is with the Department of Industrial Engineering, Arizona State University, Tempe, AZ 85287 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TSMCA.2005.853501

on supervised clustering with additional steps, to enhance the robustness to the presentation order of training data points and the noise in training data. This algorithm supports incremental learning and mixed data types. Although this algorithm was initially developed for computer intrusion detection, its application is not limited to this application. In [4] and [5], we present our data mining algorithm, clustering and classification algorithm—supervised (CCAS), that deals with numeric variables only. This paper presents an extended version of CCAS (ECCAS) that enables the handling of mixed data types. The application of ECCAS to computer intrusion detection, using network traffic data with mixed data types, is presented in this paper. The applications of ECCAS to a medical diagnosis problem and a salary classification problem are also presented. This paper is organized as follows. Section II reviews the steps of CCAS for handling only numeric attribute variables, in comparison with some other data mining algorithms. Section III presents ECCAS for handling mixed data types. The applications of ECCAS to computer intrusion detection, medical diagnosis, and salary classification are described in Section IV. Section V presents and discusses the testing performance of ECCAS for these applications. Section VI concludes the paper. II. R EVIEW OF CCAS Cluster analysis uses attribute values of data points to partition data points into clusters. Among well-known heuristics algorithms for clustering are hierarchical clustering algorithms and partitioning clustering algorithms, such as the K-means method [6], [7]. However, both the hierarchical and partitioning clustering algorithms require a complete matrix of pairwise distances between all data points before the clustering can proceed. This creates difficulty in incremental learning that must update clusters when new data points become available. Moreover, both the hierarchical and partitioning clustering algorithms are not scalable to deal with large amounts of data. A straightforward method of incrementally clustering data points is to process data points one by one and group the next data point into an existing cluster closest to this data point or create a new cluster. However, this method of incremental clustering has a problem of “local bias of input order.” When a new data point is processed, the existing clusters only reflect the current data distribution and the cluster structure of the data points processed so far. Thus, different presentation orders of training data points may produce quite different cluster structures.

1083-4427/$20.00 © 2006 IEEE

LI AND YE: A SUPERVISED CLUSTERING AND CLASSIFICATION ALGORITHM FOR MINING DATA WITH MIXED VARIABLES

The methods in [8]–[11] overcome this problem while enabling incremental learning. In the grid-based clustering method, the data space is partitioned into nonoverlapping regions or grids. Only the data points in the same grid cell can be clustered together. The density-based clustering method requires that a cluster should contain a certain number of points within some radius of each point in this cluster. Hence, the density-based clustering method considers clusters as regions of data points with high density, and clusters are separated by regions of data with low density or noise. The subspace clustering method is a bottom-up grid-based method to find dense units in lower dimensional subspaces and merge them into dense clusters in higher dimensional subspaces. There are other algorithms that handle large data sets of complex distribution. In [12], a hierarchical algorithm partitions and clusters the random sampling of a database to produce partial clusters. Then, these clusters are clustered again in the second pass. For detecting outliers, other than clustering, authors of [13] use distance-based algorithms to search for the neighborhood of each data point and label data points without enough neighbors as outliers. CCAS is built on several concepts in traditional and scalable cluster analysis, along with several innovative concepts that we develop. Also, one important difference is that CCAS is a supervised clustering algorithm for classification problems. In this section, we review the steps of CCAS for handling numeric attribute variables only. Section III presents ECCAS that extends CCAS for handling mixed data types. A. Cluster Representation and Distance Measures CCAS performs a supervised clustering of data points based on the distance of the data points, as well as the target class of each data point. Hence, CCAS differs from traditional clustering techniques, such as hierarchical cluster and K-means clustering [5], which use only the distance of data points for clustering. We consider a data record as a data point in a data space. A data point D is a (p + 1)-dimensional vector with p attribute variables X1 , X2 , . . . , Xp and one target variable Y indicating the target class of the data point. In CCAS, a cluster L is represented by the centroid of all the data points in the cluster, with coordinates XL, the number of data points in this cluster NL , and the class label Y L. XL is calculated as follows: N L

XL =

j=1

Xj

NL

(1)

where X j is the coordinates of the jth point in this cluster. The distance for a data point D of only numeric attributes to a cluster L can be calculated using different distance measures. The comparison of three different distance measures is presented in [4] and [5]. In this study, we use a weighted Euclidean distance   P  2 (2) dn (D, L) =  (Xi − XLi )2 riY i=1

397

where Xi and XLi are the coordinates of the data point D and the cluster L’s centroid on the ith dimension, and riY is the correlation coefficient between the attribute variable Xi and the target variable Y 2 (n) = riY



2 SiY (n) Sii (n)SY Y (n)

2 .

(3)

To compute the distance between two clusters, we use the same distance measure, except that the coordinates of a data point and a cluster are replaced by the centroid coordinates of the two clusters. For clusters j and k, the weighted Euclidean distance is   P  2  2 (4) XLji − XLki riY dn (j, k) =  i=1

where XLji and XLki are the ith coordinates of the centroids of clusters j and k. B. Incremental Grid-Based Supervised Clustering in CCAS The core procedure of CCAS is a grid-based supervised clustering of data points. This procedure first divides the p-dimensional space of data points into grid cells. There are many different ways to perform the division of the data space as presented in [4] and [5]. The dimension of each attribute variable is divided into a set of intervals within the range defined by the minimum and maximum values of data points on this dimension. Then, the p-dimensional data space is separated into grid cells determined by the end points of these intervals on the p dimensions. The whole space is separated into “cubic” cells indexed by these intervals. After this step, a data point or a cluster falls into one grid cell. A cluster has an index referring to the grid cell of the cluster. Obviously, we bring in subjective impact, because of the introduction of a parameter, i.e., the number of grid intervals in each dimension. How to choose this parameter makes the difference in the performance of this algorithm. Different tasks may have different data points and distribution, and thus, should have different settings of this parameter. The number of intervals can be different for each dimension and unequal intervals could be used in one dimension. In our study and experimentation, we simplify our work by using the same number of equal grid intervals for all dimensions. In clustering, for each data point, we only search for the nearest cluster with the same target class in the same grid cell of this data point. If there is no such cluster in this grid cell, we create a new cluster containing this data point. The incremental grid-based supervised clustering includes the following steps. 1) For a new data point D, calculate its grid index. 2) Among the existing clusters of the grid index, search for the nearest cluster L to this data point based on a distance measure d(D, L), as defined in (2). 3) If the nearest cluster L is found and L has the same target class as that of D, add D into L, and update the centroid

398

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, MARCH 2006

Fig. 1. An illustration of the supervised grouping algorithm in one dimension, where points of different shapes represent clusters of different classes.

coordinates of L and the number of the data points (NL ) in this cluster as follows: NL XLi + Xi , NL + 1 NL = NL + 1.

XLi =

for i = 1, 2, . . . , p (5)

4) Otherwise, create a new cluster with this data point as the centroid, assign the number of the data points in the new cluster to 1, and make the target class of the data point as the target class of the new cluster. 5) Repeat 1)–4) until no data point in the data set is left. In the above procedure, the target class is used to supervise the clustering of the data points. Each cluster is assigned with a target class, and only data points that are closest in distance and have the same target class are grouped into the same cluster. Hence, this procedure is called supervised clustering. This data mining procedure is different from hierarchical clustering and K-mean clustering in that it uses only the distance of data points in grouping data points into clusters. This supervised clustering procedure is similar to decision tree algorithms in that both attribute variables and the target variable take part in grouping or dividing data points into clusters. Therefore, this procedure can be used not only for knowledge discovery, but also for data classification, as the cluster structure with class information can be used as a classification function. The division of the data space into grid cells and the formation of clusters within grid cells are the methods that we use to try to overcome the robustness problem due to the presentation order of data points. For example, a number of consecutive data points of the same class may appear first in the data set. Without the above methods, these data points would be grouped into one cluster, even though they are not close to each other at all. Consequently, the cluster structure is not robust to the presentation order of data points. The different presentation order of the same data points may generate a different cluster structure. With the methods of dividing the data space into grid cells and the formation of clusters only within grid cells, we prevent these data points from joining into one single cluster. Instead, if data points are distributed in the data space, we would see clusters of data points distributed in the data space. C. Postprocessing of the Cluster Structure for More Robustness CCAS includes additional steps after the incremental gridbased supervised clustering procedure, in order to further reduce the impact of data presentation order and noise on the resulting cluster structure. These post-processing steps include approaches of data redistribution, supervised grouping of clusters, and removal of outliers. 1) Data Redistribution: In the incremental procedure of clustering, each data point interacts with the nearest cluster,

based on the current cluster structure. With the processing of more data points, the cluster structure reflects more accurately the distribution of data points in the data space. However, the forming of clusters at the beginning cannot reflect the impact of data points incorporated later, and causes the robustness problem with the presentation order of data points. Hence, we use a data redistribution method to further reduce the impact of the presentation order of data points. Using the redistribution method, all data points in the training data set are clustered again using the existing clusters as the seed clusters. The clustering procedure is the same as that of incremental clustering, except that when a seed cluster is found to be the nearest cluster to a data point, the seed cluster is replaced by a new cluster with the data point as the centroid and only this data point in this cluster. In the redistribution, we allow new clusters to emerge, and thus, allow the adjustment of the cluster structure to capture more accurately the data distribution. Here, we still use grid cells in redistribution, although they may not be that important if we consider that the seed-cluster structure are good enough representatives of the underlying data distribution. We choose the option to use grid cells, mainly because we want to assure that the clustering and redistribution could distinguish the clusters as much as possible. We do not require changing the presentation order of data points in redistribution, although this may improve the redistribution efficiency. Here we want to rely solely on the redistribution, rather than a method very specific to the data distribution. Also, in practice, the data points come in a certain order. Changing the input order could bring further burden management. Obviously, this redistribution process can be repeated many times with the classification performance expected to improve at computation cost. Our experiments show that usually one round of redistribution is sufficient for improving robustness, and more rounds of iteration do not usually produce more changes in the cluster structure. 2) Supervised Grouping of Clusters: It is possible that a natural large cluster may be divided into several small clusters by grid cells, if the large cluster covers the area of several grid cells. Much like a traditional hierarchical clustering algorithm, we use a supervised grouping procedure to check if any two clusters nearest to each other have the same target class and thus can be grouped into one cluster (as illustrated in Fig. 1). In this grouping procedure, a single linkage method [6] is used, where the distance between two larger clusters is defined as the distance between their nearest “points,” the original clusters in our case. The weighted Euclidean distance is used for this calculation. We apply the hierarchical clustering algorithm to the clusters in the cluster structure after the data redistribution procedure to obtain a hierarchical tree with clusters as the leaves of the tree. Starting from the leaves of the tree, we check whether

LI AND YE: A SUPERVISED CLUSTERING AND CLASSIFICATION ALGORITHM FOR MINING DATA WITH MIXED VARIABLES

each pair of neighbor clusters has the same target class. If a pair of neighbor clusters has the same target class, they are merged into one cluster. This new cluster and its neighbor could form a pair and are checked to see whether they have the same target class. This process of merging clusters continues from the leaves of the tree to the root of the tree until no neighbor clusters can be merged. The supervised grouping of clusters reduces not only the side effect of grid cells, but also the number of clusters in the resulting cluster structure for better scalability. 3) Removal of Outliers: To further refine the cluster structure from the supervised grouping of clusters, we can remove data outliers by checking the number of data points in each cluster. Clusters that have very few data points may represent noises in training data samples, and thus, outliers. We can remove those clusters whose number of data points is less than or equal to a preset threshold. The threshold on the minimum number of data points that a cluster must have can be chosen based on the average number of data points in the produced clusters. However, this threshold is closely dependent on specific training data. For example, there may be very few instances of a certain class, and this threshold value can be different for clusters with different target classes. In this study of application to intrusion detection data, we set this threshold value to 1, based on a set of experiments. 4) The Working Flow: Five phases are used in our application. In phase 1, we calculate the correlation coefficients and information entropies used in distance measurement. It also records the value range of each predictor variable. The incremental grid-based clustering is done on training data points in phase 2. In phase 3, we use the output from phase 2 as the seed clusters to redistribute the training data points. In phase 4, we first apply the supervised grouping. After this, outliers become more obvious and are removed. We apply the supervised grouping in phase 5 again for a more compact and accurate cluster structure. Besides incremental learning ability, CCAS brings flexibility in data mining management. It supports the local adjustment of the cluster structure attributed to the cluster representation and working procedures. Thus, we can implement and maintain cluster models in a parallel-computing environment. Each approach or step could function independently, linked by the input and output clusters. D. Classification There are two methods in classifying a new data point using the clusters labeled with target class. If a categorical value needs to be assigned to the data point D, we assign the data point D to the target class that is dominant among the target classes of the k nearest clusters to this data point using the weighted Euclidean distance in this study, where k is a parameter set by the user and usually takes an odd number such as 3. If we have only two target classes that can be represented by 0 and 1, we can use a numeric value between 0 and 1 to indicate the closeness of the data point D to these target classes 0 and 1. This method gives us another parameter to manipulate, i.e., the threshold determining the target class. It is useful in comparing different decision thresholds, as shown in the result analysis

399

in the application to computer intrusion detection. The numeric target value can be obtained by computing the distanceweighted average of the target values of the k nearest clusters Wj =

1 d2 (D, Lj ) k 

Y =

j=1

Y Lj W j

k  j=1

(6) Wj

where Lj is the jth nearest cluster, and W j is the weight for the cluster Lj based on the squared distance to D; the target class values of this cluster and D are Y Lj and Y . Here, we do not take the number of data points in a cluster into consideration. For example, in intrusion detection, the number of normal data points normally outnumbers that of intrusive data points by a big margin in training data. We will have a bias to normal class if we use this number in classifying a new data point. However, this could be an open issue in future research if we want to inspect the impact of cluster “mass.” In instance-based learning, usually all representatives are used in calculating the target value of a new data point using the above method. However in our algorithm there is a difference in the different phases with regard to classification. Using all the produced clusters is not appropriate when we consider the gridbased clustering method in which grid cells limit the formation of clusters. Therefore, in phases 2 and 3, we use all the clusters in the grid cell of a data point as the nearest neighbors to calculate the target value of the data point. After the supervised grouping of clusters in phases 4 and 5, the effect of the grid cells on the cluster structure are reduced. In the global picture of the cluster structure, we apply the calculation to all the clusters in the entire data space to determine the target value of the new data point. E. Handling of Multiple Target Classes CCAS can handle a target variable with more than two target classes by making the following changes. 1) Transform the target variable into multiple binary target variables with one variable for each target class. For example, a target variable Y with three target classes (i.e., class 1, class 2, and class 3) can be transformed into Y1 , Y2 , and Y3 . If the target class of a data record is class 1, Y1 , has the value of 1, Y2 has the value of 0, and Y3 has the value of 0. The correlation coefficient between an attribute variable and each of the transformed target variables can be calculated. The correlation coefficients between the attribute variable and the transformed target variables are averaged as the final correlation coefficient between the attribute variable and the original target variable. 2) In classification, the target class of a new data point is the dominant class among the target classes of the k-nearest neighbor clusters of this data point.

400

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, MARCH 2006

III. ECCAS When there are nominal variables, we express a data point as a tuple {X, A, Y }, where X = (X1 , X2 , . . . , Xp ) represents the numeric attributes, A = (A1 , A2 , . . . , Am ) represents the nominal attributes, and Y is the target variable. Each nominal variable Ai, i ∈ {1, . . . , m}, can be a categorical value from the domain of this nominal attribute DOM(Ai ). DOM(Ai ) is nominal if it is finite and unordered, i.e., for any items a, b ∈ DOM(Ai ), we have either a = b or a = b. A distance measure is used to describe the dissimilarity between two objects. Two of the basic requirements for a proper distance measure are the following. 1) When two objects have identical or similar attribute values, the distance measure has the minimum value or a small value. 2) When two objects are opposite in concept or quite different, the distance measure has the maximum value or a large value. If there are nominal variables in a data set, we need to find a compatible way of measuring the dissimilarity on both nominal and numeric variables. Nominal variables are discrete and have no order for ranking. Numerical operations such as addition, subtraction, multiplication, and division cannot be applied to nominal variables. There are three general methods for this purpose. 1) Nominal to numeric. A multicategorical nominal variable can be converted into multiple binary variables, using 0 or 1 to represent either a categorical value absent or present in a data point [7]. These binary variables can then be handled as numeric variables. A shortcoming of this method is that when there are many possible categorical values for a nominal variable, the method must deal with a large number of binary variables that have high dependence among them. The value of the original numeric variables should be scaled into [0,1] to make them compatible with the new binary variables. 2) Combined distance. In [14], the K-prototype algorithm calculates the dissimilarity dn on numeric variables using the squared Euclidean distance, and dc on nominal attribute variables defined as the number of mismatches on categorical values. The overall dissimilarity measure is defined as dn + γdc , where γ is a weight factor to balance these two parts. This algorithm represents the mean of a nominal variable for a cluster as the categorical value occurring most frequently among the data points in this cluster. 3) Numeric to nominal. The K-mode algorithm in [15] is a simplified version of the above K-prototype algorithm. Using certain methods to convert numeric variables to nominal variables, this K-mode algorithm handles only nominal variables. There are problems with K-modes or the K-prototype algorithm in their representation of a cluster and their distance measures. For example, if the cluster has two categorical values of one attribute variable with similar or the same occurrence frequencies, it is not appropriate to choose either of these categorical values to represent the mean of the attribute variable

for data points in this cluster. Moreover, it is difficult to choose an appropriate γ. We develop two methods, called method A and method B, to extend CCAS in ECCAS for handling data with a mixture of nominal and numeric attribute variables. A. Method A Based on a Combination of Two Distance Measures Similar to the combined distance discussed above, we use one distance measure for numeric variables and another for nominal variables, and then combine these two to obtain an overall distance measure. We refer to this method as ECCAS(A). 1) Distance Between Two Clusters: A nominal attribute Ai can take one among ni categorical values. We count the frequencies of the ni categories for this nominal attribute for a cluster j with a number of data points, and represent these frequencies as the following integer vector  aji = aji,1 , aji,2 , . . . , aji,ni .

(7)

For another cluster k we obtain

aki = aki,1 , aki,2 , . . . , aki,ni .

(8)

Thus, we calculate the correlation coefficient of the corresponding integer vectors for these two clusters j and k, denoted as correl(aji , aki ). The value of the correlation coefficient falls in [−1, 1]. Since the value of 1 corresponds to the minimum distance and −1 corresponds to the maximum distance, we use (1 − correl(aji , aki ))/2 as the distance measure to keep the value in [0, 1]. This distance measure satisfies the two requirements for a distance measure. When combining distance values from all the attribute variables, the relative importance of each attribute variable with regard to target-value classification is taken into account by a weight factor. This weight factor is calculated as the squared correlation coefficient between the numeric attribute variable and the target variable, and falls in the range of [0, 1]. We use information entropy to determine the weight factor for nominal attributes. The information entropy for nominal attribute Ai , with regard to the target variable Y , is defined as entropy(Ai , Y ) = −

ni  u=1

pu

C 

puv log2 puv

(9)

v=1

where ni is the number of different categorical values of Ai , C is the number of different target classes of Y , pu is the proportion of data points that have categorical value u for attribute Xi in the whole data set, and puv is the proportion of data points with a target class v for Y in the subset of data points that have categorical value u for Ai . Information entropy characterizes the impurity of information in a nominal attribute variable with regard to the target variable. Its value falls into the range of [0,1]. It is small when

LI AND YE: A SUPERVISED CLUSTERING AND CLASSIFICATION ALGORITHM FOR MINING DATA WITH MIXED VARIABLES

the two variables are dependent. Thus, 1 − entropy(Ai , Y ) is used as the weight factor for the nominal variable Ai to reward higher dependence. Therefore, the distance measure of two clusters j and k obtained from all the normal attribute variables is defined as follows: dc (j, k) =

m  1 i=1

2

 1 − correl aji , aki (1 − entropy(Ai , Y )). (10)

In generating the overall distance, we do not need to find another weight factor since we have such weight for each individual attribute. Also, we scale numeric attributes to put the individual distance measurements into the same value range. Thus, we need first to scale each numeric variable, to let its values fall in [0, 1]. Then, a simple addition operation could be applied reasonably. The overall distance of clusters j and k obtained from all the attribute variables is as follows: d(j, k) = dn (j, k) + dc (j, k)

(11)

where dn (j, k) is the distance measure obtained from all the numeric attribute variables calculated as the weighted sum of the Euclidean distances on all the numeric variables, and dc (j, k) is the distance measure for nominal attributes based on the calculation of correlation coefficient and information entropy. 2) Distance of a Data Point to a Cluster: To compute the distance from a data point to a cluster used in the grid-based supervised clustering of data points, we consider the data point as a “virtual” cluster containing only one data point. Hence, the distance of a data point and a cluster is calculated using the same method for computing the distance of two clusters. To compute the occurrence frequency vector of each nominal attribute variable for the “virtual” cluster with only one data point, the frequency of the categorical value appearing in that data point takes the value of 1, and the frequencies of all other categorical values take the value of 0. 3) Workflow of ECCAS(A): ECCAS(A) has the same workflow as previously described, except for the following changes. 1) The scaling of values for numeric attribute variables is added in phase 1. Additionally, the statistics of the nominal variables are also collected and the information entropy of each nominal variable with regard to the target variable is computed. 2) Each cluster is represented by not only a centroid on the numeric attribute variables, but also an occurrence frequency vector for each nominal attribute variable. The update of the occurrence frequency vector for a nominal attribute variable is straightforward. For clusters j and k, calculate the occurrence frequency vector for the new cluster from joining these two clusters together as follows:  = aji,1 + aki,1 , aji,2 + aki,2 , . . . , aji,ni + aki,ni . aj+k i (12)

401

B. Method B Based on Conversion of Nominal Variables to Binary Variables In Method B, each categorical value of a nominal attribute is represented by a binary variable. If a data point takes this categorical value for the nominal attribute, the corresponding binary variable takes the value of 1; otherwise, the binary variable takes the value of 0. These binary variables are then handled as numeric attribute variables. We refer to the ECCAS algorithm using this method as ECCAS(B). C. Computation Complexity of ECCAS Let M be the number of the produced clusters after each phase of ECCAS, m be the number of nominal attribute variables, p be the number of numeric attribute variables, and N be the number of data points in a data set. For phase 1—scanning and preprocessing data points in a data set, the computation complexity is O((p + m)N ). For phases 2 and 3, the upper bound on the computation complexity is O((p + m)N M ), if we search the produced cluster structure sequentially. If we apply a more efficient storage structure of the produced clusters, using techniques such as the one shown in [16], the computation complexity can be reduced to O((p + m)N ). For the supervised grouping of clusters in phases 4 and 5, the computation complexity has a constant as the upper bound, depending on the number of initial clusters. The complexity of computing pairwise distances of clusters is O(MI (MI − 1)/2), where MI is the number of initial clusters. In our implementation, the computation takes much less time. Many distances are not used and can be ignored if the associated clusters (original or aggregated clusters) have different classes. Thus, the inspection of these pairwise distances terminates very soon when more and more distances are discarded. The computation complexity of removing outliers in phase 4 is O(M ). The computation complexity of classifying a data point is O((p + m)M ). Again, this computation complexity can be reduced to O(p + m), if we use a more efficient technique to store and search the cluster structure. IV. A PPLICATIONS OF ECCAS In [4] and [5], we have shown through data sets with only numeric attribute variables that CCAS is a scalable and robust algorithm. In this study, we test ECCAS on three data sets for computer intrusion detection, medical diagnosis, and salary classification, respectively, each with mixed data types, to evaluate the classification accuracy and reliability of ECCAS. A. Intrusion Detection Intrusion detection is an important part of protecting computer and network systems against cyber attacks [5]. Computer intrusion detection can be considered as a classification problem in which the data of computer and network activities are monitored and are classified into one of two target classes: normal and intrusive. More classes could be used if a finer classification is needed, e.g., different attack types. Two types of system activity data are usually used for computer intrusion detection: network traffic data and computer log or audit data.

402

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, MARCH 2006

TABLE I TARGET CLASSES IN THE DATA SETS

In this study, we apply ECCAS to intrusion detection using the Knowledge Discovery and Data Mining (KDD) Cup 1999 data. This data set is used for the third International Knowledge Discovery and Data Mining Tools Competition (http://kdd.ics. uci.edu/databases/kddcup99/kddcup99.html). The data record contains features extracted from network traffic connections for the 1998 Defense Advanced Research Projects Agency (DARPA) Intrusion Detection Evaluation Program (http:// ideval.ll.mit.edu/). The training data of both normal and intrusive activities includes about 5 000 000 connection records of seven weeks of network traffic. The testing data includes about 300 000 connection records. A connection contains a sequence of TCP packets within a defined time period for data flows from a source IP address to a destination IP address using an application protocol. For each connection, features are extracted from raw connection records to help distinguish normal connections from intrusive connections. Hence, for each connection there is a data record consisting of the extracted features for this connection. For example, “serror_ate” is a feature that represents the percentage of connections showing “SYN” errors. Among these features, there are 34 numeric attribute variables and seven nominal attribute variables. Each connection record in the training data has the value in the target variable known as either normal or intrusive, along with a specific attack type. Attacks fall into four main categories as shown in Table I. In one application of ECCAS to these data sets, each data point in the training data set has a target value of 0 or 1, where the value of 0 represents normal and the value of 1 represents intrusive. A target value in [0, 1] is assigned to each data point in the testing data set, according to the weighted average of the target values of the k nearest neighbor clusters to this data point. The larger the target value of the data point, the more likely the data point is intrusive. In another application, each data point in the training data set takes a target value among five categories: 0 for normal activities, 1 for probe attacks, 2 for denial of service attacks, 3 for attacks to gain unauthorized access from a remote host, and 4 for attacks to gain unauthorized access to local root privileges. A target value in the set of {0, 1, 2, 3, 4} is assigned to each data point in the testing data set according the dominant target class among the target classes of the k nearest neighbor clusters to this data point. We set k = 1, that is, we use only the nearest cluster to classify a new data point. In the applications of ECCAS to intrusion detection, the number of grid intervals on each dimension is set to 3 after several experiments. We have investigated in [5] for computer intrusion detection, the prediction accuracy and robustness of

TABLE II FOUR PRESENTATION ORDERS OF TRAINING DATA FOR MEDICAL DIAGNOSIS APPLICATION

CCAS to the presentation order of training data and the number of grid intervals. In this study, we evaluate only the prediction accuracy of ECCAS on intrusion detection data with mixed data types. B. Medical Diagnosis A set of medical diagnosis data is used in an empirical comparison of decision tree, statistical, and neural network classifiers [17]. The data set is used to predict whether a patient is hyperthyroid. The data have three target classes: normal, hyper, and subnormal functioning. The data have six numeric attributes and 15 binary attributes that are treated as nominal variables in this study. The training data set has 3772 records. The testing data set consists of 3428 records. In this application, we investigate ECCAS with respect to its prediction accuracy, robustness to the presentation order of training data, as well as the impact by the number of grid intervals. We run ECCAS with different grid intervals. In addition to the original presentation order, we generate three other training data sets with different presentation orders of data points. The four presentation orders are summarized in Table II. Due to the relatively small size of this data set for medical diagnosis, we can easily set up these additional experiments to test the robustness to the data presentation order and the impact of grid intervals, in addition to testing the prediction accuracy. During training, we do not perform the removal of outliers in phases 4 and 5. In this training data set, relatively small in size, classes 1 and 2 have much less representatives (93 and 191, respectively) than class 3 (3488). After the supervised grouping of clusters, there are quite a few clusters for classes 1 and 2, many of them containing only one data point. Thus, the removal of outliers can be damaging. In fact, the testing results show that ECCAS produces satisfactory prediction accuracy without the removal of outliers and phase 5. We set k = 1, that is, we use only the nearest cluster to classify a new data point. C. Salary Classification Salary data is used to predict whether one person’s salary is greater or less than 50 000/year. The detailed data description can be found in [18]. This data set can be retrieved from University of California, Irvine (UCI) KDD repository (http:// kdd.ics.uci.edu/) or Delve data repository (http://www.cs. toronto.edu/delve/). The data are extracted from the 1994 census database of the US Census Bureau. There are six numeric and eight nominal attributes about people such as education, marital-status, occupation, race, sex, and so on. There are over 30 000 records in the training data set and over 15 000 in the testing data set.

LI AND YE: A SUPERVISED CLUSTERING AND CLASSIFICATION ALGORITHM FOR MINING DATA WITH MIXED VARIABLES

Fig. 2.

ROC curves for ECCAS(A) on intrusion detection data.

Since the training data set has a large volume of data, we apply all the five phases of ECCAS. We use two grid intervals for each numeric dimension. The minimum number of data points in each cluster for the removal of outliers is 1. Again we set k = 1, that is, we use only the nearest cluster to classify a new data point. V. R ESULTS AND D ISCUSSION A. Intrusion Detection For the application of ECCAS using two target classes, we perform the receiver operating characteristic (ROC) analysis on the testing results that are obtained using the cluster structure from ECCAS (see Figs. 2 and 3). To perform the ROC analysis, we first set a signal threshold value in [0, 1]. If the assigned target value of a data record in the testing data set is greater than this signal threshold, the data record is signaled as intrusive; otherwise, the data record is considered as normal. Hence, using the signal threshold, a target class is assigned to each data record in the testing data. By comparing the assigned target class with the true target class of each data record in the testing data, we know whether we have a hit, a false alarm, a miss or a correct classification on the data record. A hit is a signal on a data record whose true target class is intrusive. A false alarm is a signal on a data record whose true target class is normal. A miss occurs on a data record whose true target class is intrusive, but the assigned target class is normal. A correct classification occurs on a data record whose true target class is normal and assigned target class is normal, too. We obtain a hit rate and a false alarm rate for a given signal threshold. The hit rate is the ratio of the number of signals on the truly intrusive data records to the total number of the truly intrusive data records. The false alarm rate is the ratio of the number of signals on the truly normal data records to the total number of the truly normal data records. We can obtain a different pair of hit rate and false alarm rate for a different value of the signal threshold. Thus an ROC curve plots pairs of hit rate and false alarm rate for various values of signal thresholds in the range of [0, 1]. The closer an ROC curve is to the top-left corner, which represents the 100% hit rate and the 0% false-alarm rate (see Figs. 2 and 3) of a chart, the better the classification performance. Figs. 2 and 3 show the ROC curves of the testing results for ECCAS(A) and ECCAS(B), respectively. For example,

403

Fig. 3. ROC curves for ECCAS(B) on intrusion detection data.

ECCAS(A) can produce the 90% hit rate when the false-alarm rate is kept at near 0%. In the application of ECCAS using five target classes, the cluster structure is used to assign the class of each data point in the testing data. A confusion matrix is used to evaluate the classification performance of ECCAS. The confusion matrix is the method used by the KDD Cup 1999 Contest to evaluate the performance of a participating algorithm. A confusion matrix contains information about true classes and assigned classification results by an algorithm. Columns indicate the assigned target classes, rows indicate the true target classes, and the value in each cell gives the number of data records for each situation. For each row, the percentage of the correctly classified data points among all the data records with the true target class for that row is calculated. For each column, the percentage of the correctly classified data points among all the data points with the assigned target class for that column is calculated. The falsealarm rate and the hit rate are also computed from the confusion matrix. The KDD Cup 1999 also applies cost weights to cells in a confusion matrix to obtain a weighted average cost of an algorithm that is represented by the cost per test example as shown in Table III [19]. A data record in the testing data set is considered as a test example. The lower the average cost, the better the classification performance. The winning technique of the KDD Cup 1999 is an algorithm using decision trees. This algorithm produces the average cost of 0.2331. The false-alarm rate and the hit rate of this algorithm are 0.5% and 91.8%, respectively. The best 17 algorithms in the KDD Cup 1999 have their average costs in the range from 0.2331 to 0.2684 [19]. Table III shows the confusion matrices for ECCAS(A). From the confusion matrix obtained from the testing results after phase 2, it appears that the cluster structure not only produces the testing results with the high hit rate of 95.1%, but also the high false-alarm rate of 21.5%. However, the average cost is 0.2312, which is even better than the value of 0.2331 from the wining algorithm of the KDD’99 Cup. Thus, the average cost may not be an appropriate evaluation measure, although it is used in the KDD Cup 1999 contest to score the algorithms. Hence, in addition to the average cost, we also examine the hit rate and the false alarm rate to evaluate the classification performance of ECCAS. After five phases of ECCAS(A), we obtain the hit rate of 91.2% and the false alarm rate of 2.3%.

404

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, MARCH 2006

TABLE III CONFUSION MATRICES FOR ECCAS(A) ON INTRUSION DETECTION DATA

TABLE IV CONFUSION MATRICES FOR ECCAS(B) ON INTRUSION DETECTION DATA

Table IV shows the confusion matrices for ECCAS(B). In comparison with the performance of ECCAS(A), the falsealarm rate of ECCAS(B) is lower at the level of about 0.9%, the hit rate is a slightly lower at the level of 91%. Overall, it appears that ECCAS(A) and ECCAS(B) have comparable performance. ECCAS(B) appears slightly better if we want lower false-alarm rates in an intrusion detection system. B. Medical Diagnosis We test only ECCAS(A) on the data set for medical diagnosis for several reasons. First of all, the performance of ECCAS(A) is comparable to that of ECCAS(B) for the data set for computer intrusion detection. Secondly, ECCAS(B) requires the processing of more data after transforming a nominal variable into a binary form. Moreover, the concept of handling nominal variables in ECCAS(A) is novel, and has not been tested in previous work. Lastly, ECCAS is designed to handle large data sets. The performance on the computer intrusion detection data is more convincing in terms of comparing the two distance measures than that on a small-sized data set. We refer to ECCAS(A) as ECCAS in this section. Fig. 4 shows the error rates for different numbers of grid intervals when we use the training data with presentation order 1. We observe that the number of grid intervals has impact on the prediction accuracy. Using one grid interval for each numeric dimension even gives the best accuracy in overall classification error rate among different numbers (1–6) of grid

Fig. 4. Error rates with different grid intervals for presentation order 1 on medical diagnosis data.

intervals. In fact, there, we do not use grid in the incremental clustering at all. For this specific data set, the presentation order 1 keeps the input data well mixed during presentation, similar to what the authors observed in [5]. Using just the redistribution and supervised cluster grouping could achieve the best performance. Also, we could conclude that using a finer grid does not necessarily improve the performance. It can even make the cluster structure worse by forcing natural clusters to be split. As the number of grid intervals increases from 1 to 6, the error rate increases from 0.060 to 0.110. In overall, for the data set of medical diagnosis, the performance of ECCAS is comparable to those of the other classification algorithms reported in [17].

LI AND YE: A SUPERVISED CLUSTERING AND CLASSIFICATION ALGORITHM FOR MINING DATA WITH MIXED VARIABLES

Fig. 5. Error rates with different grid intervals for presentation order 4 on medical diagnosis data.

On the other hand, after examining the confusion matrices for one and two grid intervals, we see that using two grid intervals, more data points of classes 1 and 2 are classified correctly in the testing data set. Compared with class 3, these two classes are poorly represented with very few data points in the training data set. We also see, in other cases of different presentation orders, that using a grid improves performance. Therefore, we are reluctant to say that not using a grid is better, even in this special case. Instead, we could have more insight about the effect of grids on the cluster structure. Presentation order 4 of the training data produces the worst prediction accuracy among all the four presentation orders. Fig. 5 gives the error rates versus the number of grid intervals for this presentation order. Because the training data points for this presentation order are organized in the order of classes 3, 2, and 1, the data records of the same class intend to group together, especially for class 3. This situation further deteriorates due to the fact that class 3 has the majority of data points in the training data set. The number of clusters produced in phase 2 for this presentation order is much less than that for the other presentation orders. The smallest number (2) of grid intervals produces the worst error rate at 0.186. The error rate becomes smaller, with the best at 0.126 for four grid intervals. After that, the error rate increases again. For four grid intervals, we plot the error rate versus the presentation order in Fig. 6. The best error rate is 0.088 for presentation orders 2 and 3. We observe that the presentation order has an impact on the prediction accuracy of ECCAS. However, after data redistribution and the supervised grouping of clusters, the impact by the presentation order is greatly alleviated. C. Salary Classification We test only ECCAS(A) on the data set for salary classification for the same reasons as discussed before. The results of the algorithms including decision trees, naïve Bayes, and nearest neighbors (using the Machine Learning Library in C++ (MCL++) from Silicon Graphics, Inc. (SGI) (http://www.sgi.com/tech/mlc/) with default settings for these algorithms) for this data set are reported in [18]. The error rates for decision trees and naïve Bayes are around 0.15. The error rate of 1-nearest-neighbor algorithm is over 0.21, and over 0.20 for 3-nearest-neighbor algorithm.

405

Fig. 6. Error rates with different presentation orders with four grid intervals on medical diagnosis data.

Fig. 7. Error rates versus phases for salary data using two grid intervals, where stage 4.1 groups clusters, and then stage 4.2 removes outliers.

Fig. 7 shows the error rates of ECCAS on this data set. We observe that the prediction accuracy improves from phase 1 to 5. The error rate after phase 5 is less than 0.20. The best error rate is at 0.19, after the removal of outliers. This result is better than those of the nearest-neighbor algorithms using one and three neighbors, although representatives in the cluster structures are much less than the original data points. VI. C ONCLUSION In this study, we extend a scalable, incremental, and supervised clustering and classification algorithm—CCAS— into ECCAS that has the capacity of handling data with both numeric and nominal variables. Two different methods of handling mixed data types are developed. The two methods of ECCAS are tested and compared on a data set with mixed variable types for intrusion detection. Both methods produce comparable performance to that of the winning algorithm in a data mining contest on the same data set. ECCAS(A) is also tested on two other data sets for medical diagnosis and salaryprediction applications with comparable performance to those of other data mining algorithms applied to these data sets. The performance on different data sets shows the reliability of ECCAS. The testing results for one data set also show that the five phases of ECCAS reduces the impact of the data presentation order on the prediction accuracy. The number of grid intervals shows the impact on the prediction accuracy of ECCAS. In this study, we tested different numbers of grid

406

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 2, MARCH 2006

intervals empirically. The ECCAS algorithm and the distance measure could be used in common data mining applications. We are developing methods to adaptively and dynamically adjust the parameters during training, including the grid-interval configuration and the threshold-controlling outlier removal. R EFERENCES [1] V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory and Methods. New York: Wiley, 1998. [2] T. Mitchell, Machine Learning. Boston, MA: WCB/McGraw-Hill, 1997. [3] N. Ye, X. Li, Q. Chen, S. M. Emran, and M. Xu, “Probabilistic techniques for intrusion detection based on computer audit data,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 31, no. 4, pp. 266–274, Jul. 2001. [4] N. Ye and X. Li, “A supervised, incremental learning algorithm for classification problems,” Comput. Ind. Eng. J., vol. 43, no. 4, pp. 677–692, 2002. [5] X. Li and N. Ye, “Grid- and dummy-cluster-based learning of normal and intrusive clusters for computer intrusion detection,” Qual. Reliab. Eng. Int., vol. 18, no. 3, pp. 231–242, 2002. [6] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988. [7] H. Ralambondrainy, “A conceptual version of the K-means algorithm,” Pattern Recognit. Lett., vol. 16, no. 11, pp. 1147–1157, 1995. [8] T. Zhang, “Data clustering for very large datasets plus applications,” Ph. D. dissertation, Dept. Comput. Sci., Univ. Wisconsin, Madison, 1997. [9] M. Ester, H. P. Kriegel, J. Sander, M. Wimmer, and X. Xu, “Incremental clustering for mining in a data warehousing environment,” in Proc. 24th Very Large Data Bases (VLDB) Conf., New York, 1998, pp. 323–333. [10] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A multi-resolution clustering approach for very large spatial databases,” in Proc 24th Very Large Data Bases (VLDB) Conf., New York, 1998, pp. 428–439. [11] S. G. Harsha and A. Choudhary, “Parallel subspace clustering for very large data sets,” Northwestern Univ., Evanston, IL, Rep. CPDC-TR9906-010, 1999. [12] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” in Proc. ACM SIGMOD Int. Conf. Management Data, Seattle, WA, 1998, pp. 73–84. [13] E. Knorr and R. Ng, “Algorithms for mining distance-based outliers in large datasets,” in Proc. 24th Very Large Data Bases (VLDB) Conf., New York, 1998, pp. 392–403. [14] Z. Huang, “Clustering large data sets with mixed numeric and categorical values,” in Proc. 1st Pacific-Asia Conf. Knowledge Discovery and Data Mining, Singapore, 1997, pp. 21–34. [15] ——, “A fast clustering algorithm to cluster very large categorical data sets in data mining,” presented at the SIGMOD Workshop Research Issues Data Mining and Knowledge Discovery, Tucson, AZ, 1997.

[16] C. Huang, Q. Bi, R. Stiles, and R. Harris, “Fast search equivalent encoding algorithms for image compression using vector quantization,” IEEE Trans. Image Process., vol. 1, no. 3, pp. 413–416, Jul. 1992. [17] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih, “A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms,” Mach. Learn. J., vol. 40, no. 3, pp. 203–228, 2000. [18] R. Kohavi, “Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid,” in Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 202–207. [19] C. Elkan. (2004, Oct.). Result of the KDD’99 Classifier Learning Contest. [Online]. Available: http://www-cse.ucsd.edu/users/elkan/clresults.html

Xiangyang Li (S’00–A’01–M’03) received the B.S. degree in automatic control from Northeastern University, Shenyang, China in 1993, the M.S. degree in systems simulation from the Chinese Academy of Aerospace Administration, Beijing, China, in 1996, and the Ph.D. degree in information security from Arizona State University, Tempe in 2001. He is currently an Assistant Professor with the Department of Industrial and Manufacturing Systems Engineering, University of Michigan, Dearborn. He has published more than 20 papers in peerreviewed journals and conferences. He has research interests in quality and security of information systems, intelligent systems in human systems studies, knowledge discovery and management, and system modeling and simulation. Dr. Li is a Member of the Association for Computing Machinery, Association for Information Systems, and Chinese Association for Systems Simulation.

Nong Ye (M’92–SM’01) received the B.S. degree in computer science from Peking University, Beijing, China, the M.S. degree in computer science from the Chinese Academy of Sciences, Beijing, China, and the Ph.D. degree in industrial engineering from Purdue University, West Lafayette, IN. She is a Professor of Industrial Engineering and an Affiliated Professor of Computer Science and Engineering at Arizona State University, Tempe. Her research interests are in QoS and in dependability and security of computer and network systems. Dr. Ye is an Associate Editor of the IEEE TRANSACTIONS ON RELIABILITY and IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS. She is a Senior Member of the Institute of Industrial Engineers.