IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 18,
NO. 2,
FEBRUARY 2006
161
An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data Chung-Chian Hsu and Sheng-Hsuan Wang Abstract—Data mining uncovers hidden, previously unknown, and potentially useful information from large amounts of data. Compared to the traditional statistical and machine learning data analysis techniques, data mining emphasizes providing a convenient and complete environment for the data analysis. In this paper, we propose an integrated framework for visualized, exploratory data clustering, and pattern extraction from mixed data. We further discuss its implementation techniques: a generalized self-organizing map (GSOM) and an extended attribute-oriented induction (EAOI), which not only overcome the drawbacks of their original algorithms, but also provide additional analysis capabilities. Specifically, the GSOM facilitates the direct handling of mixed data, including categorical and numeric values. The EAOI enables exploration for major values hidden in the data and, in addition, offers an alternative for processing numeric attributes, instead of generalizing them. A prototype was developed for experiments with synthetic and real data sets, and comparison with those of the traditional approaches. The results confirmed the feasibility of the framework and the superiority of the extended techniques. Index Terms—Attribute-oriented induction, clustering, data mining, pattern discovery, self-organizing map.
æ 1
INTRODUCTION
D
ATA mining is a kind of data analysis technique, which aims to discover hidden, previously unknown, and potentially useful patterns from a huge amount of data for decision support [1], [2], [3]. In contrast to the conventional tools for data analysis from statistics or machine learning [4], data mining systems are more integrated, which put together techniques from fields like statistics, machine learning, databases, artificial intelligence, visualization, and user interface [3], offering a more convenient and complete analysis environment. Clustering data and then extracting patterns from the clusters are the steps used in the popular practice of data analysis in real world. Clustering groups a set of physical or abstract objects into classes of similar objects. A cluster of data objects can be treated collectively as one group for further processing, alleviating the handling of large amounts of heterogeneous data. On the other hand, pattern extraction, especially class description, generates descriptions for characterizing the data, providing a concise and succinct summary. The integration of clustering and pattern extraction is natural and necessary towards a complete and convenient data analysis environment. Among unsupervised clustering techniques, a lot of attention has been paid to selforganizing map (SOM), which projects high-dimensional data to low-dimensional grids, mostly two dimensional, without losing their topological order [5]. Regarding pattern extraction techniques, attribute-oriented induction (AOI) is a popular and effective approach [6].
. The authors are with the Department of Information Management, National Yunlin University of Science and Technology, Douliu, Yunlin, 640 Taiwan. E-mail: {hsucc, g9223722}@yuntech.edu.tw. Manuscript received 14 Mar. 2004; revised 3 May 2005; accepted 15 Aug. 2005; published online 19 Dec. 2005. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-0073-0304. 1041-4347/06/$20.00 ß 2006 IEEE
A successful integration relies on appropriate individual techniques. However, the traditional SOM and AOI have some drawbacks. The SOM is incapable of directly handling the categorical data. Accordingly, categorical values are usually transformed to binary values before training, resulting in the loss of semantic information and possibly in reduced clustering quality. The AOI may fail to preserve major values of an attribute, leading to over generalization. Moreover, it also suffers from the problems of constructing subjectively numeric concept hierarchies and improperly generalizing boundary values near discretization points. In this paper, we propose an integrated framework for visualized, exploratory data clustering, and pattern extraction in mixed data. We discuss the techniques for implementing the proposed framework, including a generalized SOM (GSOM) and an extended AOI (EAOI). The GSOM can directly handle mixed data, numeric and categorical, and the EAOI can explore for major values and overcome the problems with discretizing numeric attributes. The integrated analysis framework works as follows: train the GSOM using preprocessed data, perform data clustering visually and exploratory on the trained map, and then extract the characteristics of individual clusters using the EAOI. In addition to completeness and convenience of the integrated framework, the adopted techniques offer several attractive properties. First, the framework is visualized and interactive, making it an excellent tool for exploratory data analysis. Users can visually and interactively explore for better clustering if the data are not obviously separated on the trained map. Second, the visualized GSOM alleviates the difficult problem of determining the cluster number that is commonly encountered in other popular clustering techniques, such as k-means [7], ART [8], and hierarchical clustering [7]. Moreover, the GSOM and EAOI generalize their original methods. As a result, not only can the Published by the IEEE Computer Society
162
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 18, NO. 2,
FEBRUARY 2006
extended methods analyze data as their traditional counterparts but also offer additional capabilities for analyzing the same data from different perspectives. This is again an important feature for exploratory data analysis.
1.1 Contributions of This Paper The integration of clustering and pattern extraction is natural toward a complete data analysis environment. We propose two data mining techniques for realizing the integration. Those two techniques improved their traditional counterparts significantly, making the proposed system more practical as regards an exploratory data analysis tool. The GSOM is able to directly handle categorical data such that the system not only faithfully reflects the topological structure of mixed data, but also becomes more user-friendly (without the need of performing transformation of categorical values). The EAOI offers the additional capability of preserving major values in the data and an alternative for processing numeric attributes (which overcomes the problems with discritizing numeric values), enhancing the capabilities of exploring the data. The remaining paper is organized as follows: Section 2 briefly reviews SOM and AOI, and further discusses their shortcomings. Section 3 presents the generalized SOM and the extended AOI. Section 4 proposes the integrated framework. Section 5 gives the experimental results of one synthetic and two real data sets. Finally, Section 6 provides concluding remarks.
2
SELF-ORGANIZING MAP AND ATTRIBUTE-ORIENTED INDUCTION
2.1 Self-Organizing Map The self-organizing map (SOM) is an unsupervised neural network which projects high-dimensional data onto a lowdimensional grid, usually two-dimensional, and preserves the topological relationships of the original data [5]. In other words, similar data tend to gather together on a trained map. Training an SOM essentially involves two steps: the identifying and the adjusting steps. In the identifying step, each training pattern compares with all the units of the map and identifies the best matching unit (BMU) that is most similar to the training pattern. Then, in the adjusting step, the BMU and its neighbors are updated to resemble the training pattern. Repeat these two steps for each pattern in the training data set until the map converges. The identifying and the adjusting processes can be expressed by the following formulas: kx mb k ¼ minfkx mc kg;
ð1Þ
mc ðt þ 1Þ ¼ mc ðtÞ þ ðtÞhbc ðtÞ½xðtÞ mc ðtÞ;
ð2Þ
c
where x is an input vector. mc and mb denote the model vectors of unit c and BMU, respectively. ðtÞ and hbc ðtÞ are the learning rate and the neighborhood function, both decreasing gradually upon increasing the training step t. Since it was originally proposed, SOM has had many applications. In its early stages, the applications mainly focused on engineering [9], including texture recognition [10], process monitoring and control [11], [12], speech
Fig. 1. The conventional approach transforms the categorical attribute Favorite_Drink to three binary attributes with domain {0, 1} prior to training an SOM.
recognition [13], and flaw detection in machinery [14]. Recently, applications to data mining and other areas have emerged, including document maps [15], medical diagnosis [16], [17], as well as financial forecasting and management [18], [19], [20]. Due to the capability of topology preservation, SOM is an excellent tool in the exploratory phase of data mining and has recently been integrated with conventional clustering algorithms to aid in cluster analysis [21], [22], [23]. It has been found that the integrated approach reduces computation time and performs well in comparison with other direct clustering approaches [21].
2.2 Problem with the SOM The conventional SOM cannot directly handle categorical attributes. In (1), identifying the BMU of a training pattern typically resorts to computing the Euclidean distance, thus, only suitable for numeric data. For categorical data, binary transformation, which converts each categorical attribute to a set of binary attributes (e.g., Fig. 1) is usually performed before the training. However, the binary transformation approach has at least four disadvantages: Similarity information among categorical values is not conveyed; for example, being a carbonated drink, Coke is more similar to Pepsi than Coffee. 2. When the domain of a categorical attribute is large, the transformation increases the dimensionality of the transformed relation, resulting in wasting storage space and in increasing the training time. 3. Maintenance is difficult; when the attribute domain changes, the new relation schema needs to change as well. For instance, if “Juice” is included to the Favorite_Drink attribute, an additional attribute, Juice, needs to be inserted in the numeric relation. 4. The names of binary attributes fail to preserve the semantics of the original categorical attribute; for example, the three new attributes do not reflect the meaning of “Favorite Drink.” Another common approach for handling categorical values in clustering algorithms is simple matching, where a comparison of two identical values results in a difference 0; otherwise, two distinct values result in a difference 1 [24], [25], [26]. However, this approach does not take the similarity between categorical values into consideration and, hence, may fail to faithfully disclose the structure of mixed data. 1.
2.3 Attribute-Oriented Induction Attribute-oriented induction extracts data patterns in a large amount of data and produces a set of concise rules, which represent the general patterns hidden in the data.
HSU AND WANG: AN INTEGRATED FRAMEWORK FOR VISUALIZED AND EXPLORATORY PATTERN DISCOVERY IN MIXED DATA
Fig. 2. (a) A relation and (b) a generalized relation by attribute-oriented induction algorithm.
The induction method mainly includes two steps, attribute removal and attribute generalization [6]. Each attribute participating in the generalization process is associated with a concept hierarchy. The attribute without a proper concept hierarchy or unable to further generalize is removed. In the generalization step for each remaining attribute, the original attribute values, which are more specific, are replaced by the values closer to the root of its concept hierarchy, which are more general.
2.4
Problems with Handling Major Values and Numeric Attributes The traditional AOI is incapable of revealing major values and suffers from discretizing numeric attributes. Major values, which take up a major portion of an attribute, represent important information worth exploring during data analysis. The AOI may fail to disclose major values due to over generalization. The generalization in AOI does not take value distribution into account and depends solely on the number of distinct attribute values. Specifically, as long as the number of distinct values exceeds the specified generalization parameter, generalization takes place. Consequently, all the attribute values, probably including major values, are replaced by higher-level concepts no matter how the values are distributed in the attribute. Misconception can, therefore, occur due to the loss of distribution information after generalization. For example, given a relation like Fig. 2a, there is only one tuple for each of Taoyuan, Hsinchu, and Miaoli cities, but 92 tuples for Taipei. In other words, Taipei is a major value of the attribute. Assume that the City concept hierarchy gives the general-specific relationship of ffTaipei; Taoyuan; Hsinchu; Miaolig North Taiwan; fTaichung; Yunling Central Taiwang: When generalizing the City attribute to higher-level concepts, the AOI substitutes the city values to the values “North_Taiwan” and “Central_Taiwan” as shown in Fig. 2b. Consequently, the first tuple in the generalized relation may be taken as a common characteristic in every city of northern Taiwan, including Taoyuan, Hsinchu, and Miaoli, if a reader is unaware of how the city values actually are distributed before the generalization. In fact, the generalized pattern represents mainly the characteristics of Taipei because northern Taiwan, other than Taipei, takes up only a small portion, i.e., 3/95.
163
Fig. 3. (a) A distance hierarchy with each weight set to 1. (b) Two-level distance hierarchy for simple matching approach. (c) Degenerated distance hierarchy with w ¼ ðmax minÞ for a numeric attribute.
Regarding the construction of concept hierarchies for numeric attributes, there are two problems: 1) subjectivity of the construction and 2) the generalization of boundary values. For the subjective construction, some numeric concept hierarchies do not have the objective construction rules and, thus, usually depends on personal viewpoints. For example, in the Salary concept hierarchy, some would define a medium income ranging from 40K to 80K, while others might think that the range from 50K to 100K is more appropriate. For the generalization of boundary values of discretization, two boundary values with a small difference would be generalized to very different concepts. For example, assume that a medium income is defined as in the range of 50K to 100K. Then, 50K is generalized to a medium income; whereas 49.9K is generalized to a low income although they have only a difference of 100. Several approaches to constructing numeric concept hierarchies have been proposed. A simple way is the binning method, which partitions numbers in equal ranges or equal frequencies [3]. The method does not account for value distribution and might result in unnatural partitions. Other solutions include histogram analysis, numeric clustering and 3-4-5 rules [3], recursive binary discretization [27], minimum description length [28], entropy-based discretization [29], and chi-square test [30], [31]. These methods in essence analyze value distribution to locate better cutting points for appropriate partitioning, which helps in reducing the degree of subjectivity. However, these methods still cannot solve the problem with generalizing boundary values.
3
GENERALIZING SOM
AND
EXTENDING AOI
The SOM fails to properly handle categorical attributes, and the AOI fails to disclose major values as well as to generalize boundary values properly. In this section, we present solutions, a generalized SOM and an extended AOI.
3.1 The Generalized SOM 3.1.1 Distance Hierarchy To alleviate the drawbacks resulting from binary transformation, we propose distance hierarchy, a concept hierarchy extended with weights (c.f., Fig. 3), as the mechanism to facilitate the representation and measurement of the distance between categorical values. A concept hierarchy is composed of concept nodes and links, where higher-level nodes represent more general concepts [3], [32]. We extended the structure with weights: Each link has a weight
164
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
representing a distance. The distance between two concepts at leaf nodes is defined as the total weight between those two nodes. Weights are assigned by domain experts. For simplicity, unless stated explicitly, each link weight is set to 1 in this article. A point X in a distance hierarchy consists of two parts, an anchor and a real-value offset, denoted as ðNX ; dX Þ, where the anchor is a leaf node and the offset represents the distance from the root of the hierarchy to X. Functions anchor and offset return these two elements, respectively, i.e., anchorðXÞ ¼ NX and offsetðXÞ ¼ dX . For example, in Fig. 3a, assume Y ¼ ðPepsi; 0:3Þ, it indicates that Y is in the path of (Pepsi, Any) and 0.3 away from the root Any. Moreover, anchorðY Þ ¼ Pepsi and offsetðY Þ ¼ 0:3. A point Y is an ancestor of X if Y is in the path from X to the root. The least common ancestor of two points X and Y , denoted as LCAðX; Y Þ, is defined as the deepest tree node that is an ancestor of X and Y . In Fig. 3a, the tree node Drink is the least common ancestor of X and Z, i.e., LCAðX; ZÞ ¼ Drink. The least common point of two points X and Y , denoted as LCPðX; Y Þ, is defined as one of the three cases: 1) either X or Y if they are at the same position (i.e., equivalent), 2) Y if Y is an ancestor of X, or otherwise 3) LCAðX; Y Þ. In Fig. 3a, LCPðM; Y Þ is either M or Y since they are equivalent. LCPðX; Y Þ ¼ LCPðY ; ZÞ ¼ Y and LCPðX; ZÞ ¼ Drink. The distance between two points in a distance hierarchy is the total weight between them. Let X ¼ ðNX ; dX Þ and Y ¼ ðNY ; dY Þ be the two points. The distance between X and Y is defined as jX Y j ¼ dX þ dY 2dLCP ðX;Y Þ ;
ð3Þ
where dLCP ðX;Y Þ is the distance between the root and the least common point of X and Y . A numeric distance hierarchy is a degenerated hierarchy which consists of only two nodes, a root MIN and a leaf MAX, and one weight, as shown in Fig. 3c. The distance hierarchy is a general model which can model several popular methods of distance computation, including the Hamming distance, the subtraction method, and the binary transformation. The method of the Hamming distance, a distance measure for categorical data, counts the number of attributes for which two patterns have different values. In other words, two distinct values result in a difference of 1 while two identical values result in a difference of 0. The Hamming distance can be modeled by a two-level distance hierarchy with each weight set to 0.5, as shown in Fig. 3b, in which two distinct values have a distance value of 1. The Hamming distance is also referred to as the simple matching approach. The subtraction method is the standard approach for numeric data, which computes the distance by subtracting numeric values. The method can be modeled as follows: Associate each numeric attribute with a numeric distance hierarchy and set the only weight to the attribute’s range, i.e., max-min. As a result, the attribute’s minimum is mapped to the root MIN and the maximum to the leaf MAX. Subtracting two values can be considered as computing the distance between two mapping points in the hierarchy.
VOL. 18, NO. 2,
FEBRUARY 2006
The binary transformation approach can be modeled by associating each newly created binary attribute with a binary numeric distance hierarchy with the weight set to 1, the min to 0, and the max to 1.
3.1.2 The Generalized SOM Given an n-dimensional data set, a generalized SOM model includes a generalized SOM and a set of n distance hierarchies. The generalized SOM consists of a set of n-dimensional map units. Each component of a map unit m is extended to a two-part component (details are given below). An attribute xi of a training pattern x corresponds to the component mi of m, and both associates with the distance hierarchy dhi . Training a GSOM is the same as training a traditional SOM. However, the distance calculation between x and m as well as the adjustment of map units needs to adapt to the generalized model. The following two sections describe the adaptations. 3.1.3 Distance between a Pattern and a Map Unit The distance between a pattern and a unit is measured by mapping them to their distance hierarchies and then aggregating the distances of the paired mapping points. For an n-dimensional training data set, each attribute is associated with a distance hierarchy, and each attribute value can be mapped to a point in its distance hierarchy. Formally, let xi be the ith attribute with the domain Domðxi Þ and dhi be the associated distance hierarchy with the leaf set Leafðdhi Þ. xi can be categorical or numeric. For the categorical case, Domðxi Þ is equivalent to Leafðdhi Þ and each value of Domðxi Þ corresponds to a leaf of dhi . For simplicity, we assume that a domain value P corresponds to a leaf with the same label P . For a training pattern with xi ¼ P , xi is therefore mapped to a point X ¼ ðP ; dP Þ in dhi , which is positioned at the leaf labeled by P , and X0 s offset dP is the distance from P to the root, denoted as dhi ðxi Þ ¼ X ¼ ðP ; dP Þ. For the numeric case, xi is associated with a numeric distance hierarchy dhi (c.f., Fig. 3c) with the only weight set to the domain range of xi . For a pattern with xi ¼ v, xi is mapped to the point X ¼ ðMAX; v-minÞ, where min is the minimum of Domðxi Þ and v-min the distance from X to the root MIN. For example, assume that a two-dimensional pattern x ¼ ðCoke; 9Þ, Domðx2 Þ ¼ ½5; 20, and distance hierarchies dh1 and dh2 are given as shown in Fig. 3a and Fig. 3c. x1 ¼ Coke is mapped to X ¼ ðCoke; 2Þ in dh1 . x2 ¼ 9 is mapped to X ¼ ðMAX; 4Þ in dh2 . For a GSOM associated with the training data set, each component of a map unit is associated with the same distance hierarchy with that of its corresponding attribute of the data. Similarly, each component value of a unit can be mapped to a point in the hierarchy. Formally, assume a unit m consists of n components, m ¼ ½m1 ; m2 ; . . . ; mn . A unit component mi corresponds to the attribute xi and is associated with the hierarchy dhi . Each mi , which can be categorical or numeric, is composed of two parts: ðN; dÞ. For the categorical case, the N part of an mi shares the same domain with that of xi , i.e., N 2 Domðxi Þ and d is a real value. A component is mapped to a
HSU AND WANG: AN INTEGRATED FRAMEWORK FOR VISUALIZED AND EXPLORATORY PATTERN DISCOVERY IN MIXED DATA
165
whereas Fig. 4b is referred to the other two. In this illustration, NLCA of P and Q is U. Case 1. If M is an ancestor of NLCA and does not cross NLCA after the downward adjustment, then the new M is ðQ; dM þ Þ.
Fig. 4. The adjustment of a GSOM mapping point M toward a data mapping point X.
point with the same value in its hierarchy. That is, mi ¼ ðN; dÞ is mapped to a point M with the value ðN; dÞ, denoted as dhi ðmi Þ ¼ M ¼ ðN; dÞ, indicating the anchor of the mapping point M is N and the offset from the root is d. For the numeric case, a numeric mi is associated with the numeric distance hierarchy dhi , (c.f., Fig. 3c). Therefore, mi always has a value of ðMAX; dÞ and is mapped to a point M of ðMAX; d-minÞ in dhi , where M has the anchor MAX and is away from the root by ðd-minÞ. For example, suppose m ¼ ½ðCoke; 0:3Þ; ðMAX; 11Þ and their hierarchies dh1 and dh2 are as shown in Fig. 3a and Fig. 3c. m1 ¼ ðCoke; 0:3Þ is mapped to the point M ¼ ðCoke; 0:3Þ in dh1 . m2 is mapped to the point M ¼ ðMAX; 6Þ in dh2 . The distance between a training pattern and a GSOM unit, therefore, is measured by mapping each component to its hierarchy and aggregating the distances of the paired mapping points in individual hierarchies. Specifically, suppose x, m, and dh represent a training pattern, a map unit, and a set of distance hierarchies, respectively. Then, the distance between x and m is defined as dðx; mÞ ¼
X
!1=2 2
jdhi ðxi Þ dhi ðmi Þj
i¼1;n
¼
X
ð4Þ
!1=2 jXi Mi j2
;
i¼1;n
where Xi and Mi are the mapping points of xi and mi , respectively, in dhi . For example, the differences between the paired mapping points of x and m are jðCoke; 2Þ ðCoke; 0:3Þj ¼ 1:7 and jðMAX; 4Þ ðMAX; 6Þj ¼ 2, respectively, making the distance between x and m ð1:7 2 þ 2 2Þ 1=2 ¼ 2:62.
3.1.4 Adaptation of a Unit Component During the adjusting phase, the attribute xi of a training pattern is the adjusting aim of the component mi of a unit. In terms of the points in their distance hierarchy, the mapping points, say X and M, of xi and mi form an adjustment path. The point M moves toward X along the path. Let X ¼ ðP ; dX Þ, M ¼ ðQ; dM Þ, and be the adjusting amount, and NLCA be the least common ancestor of the anchors P and Q (c.f., Fig. 4). For the categorical case, recall that a categorical value is mapped to the leaf that is labeled by the same value. Therefore, X is always at a leaf. Depending on the positions of M and NLCA , there are four adjustment cases. Fig. 4a is referred to the first two cases,
Case 2. If M is an ancestor of NLCA and crosses NLCA after the downward adjustment, then the new M is ðP ; dM þ Þ, as indicated by the new M 0 in Fig. 4a. Case 3. If NLCA is an ancestor of M and M does not cross NLCA after the upward adjustment, then the new M is ðQ; dM Þ. Case 4. If NLCA is an ancestor of M and M crosses NLCA after the adjustment, then the new M is ðP ; 2dNLCA dM þ Þ, as indicated by the new M 0 in Fig. 4b. Note that as long as M crosses NLCA in the adjustment, new M’s anchor changes to be the same with that of X. For a numeric component, the adjusting process is simpler due to its degenerated hierarchy (c.f., Fig. 4c). Let X ¼ ðMAX; dX Þ, M ¼ ðMAX; dM Þ, and be the adjusting amount. If dM > dX , the new M is ðMAX; dM Þ; otherwise, ðMAX; dM þ Þ. Note that the new model extends the traditional numeric SOM in the way of providing additional capability of directly handling categorical data. To use a GSOM in the traditional approach for mixed data, i.e., the binary transformation approach, we only need to transform the categorical attributes and then associate those attributes and unit components with numeric distance hierarchies. In other words, the GSOM generalizes the SOM and offers the capability of exploring the data from different perspectives, which is a valuable feature as a tool in exploratory data mining.
3.2 Extended Attribute-Oriented Induction To overcome the problems of major values and numeric attributes, we recently proposed an extension [33] to the conventional AOI. The extension offers the capability of exploring for major values and an alternative for processing numeric attributes. For the exploration of major values, we introduced a parameter, majority threshold . If some values (i.e., major values) take up a major portion (exceeding ) of an attribute, the extended AOI (EAOI) preserves those major values and generalizes other nonmajor values. If no major values exist in an attribute, the EAOI proceeds like the AOI, generating the same results as that of the conventional approach. Furthermore, if is set to 1, the EAOI degenerates to the AOI. For solving the problems of constructing subjectively numeric concept hierarchies and generalizing boundary values, we proposed an alternative for processing numeric attributes: Users can choose to compute the average and deviation of the aggregated numeric values instead of generalizing those values to discrete concepts. Under this alternative, only categorical attributes are generalized. The average and deviation of numeric attributes of the merged tuples are calculated and then replace the original numeric values. The computed deviation reveals the dispersion of numeric values; the less the deviation is, the more
166
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 18, NO. 2,
FEBRUARY 2006
Fig. 5. (a) A working relation. (b) Results of EAOI with ¼ 3 and ¼ 0:8. (c) Results of AOI with ¼ 3, the same with those of EAOI with ¼ 3, ¼ 1, and generalizing numeric attributes chosen.
concentrated the values are; otherwise, the more diversified the values are. The EAOI algorithm [33] is outlined as follows: Algorithm: An extended attribute-oriented induction algorithm for major values and alternative processing of numeric attributes Input: A relation W with an attribute set A; a set of concept hierarchies; generalization threshold , and majority threshold . Output: A generalized relation P . Method: 1. Determine whether to generalize numeric attributes. 2. For each attribute Ai to be generalized in W , 2.1 Determine whether Ai should be removed, and if not, determine its minimum desired generalization level Li in its concept hierarchy. 2.2 Construct its major-value set Mi according to and . 2.3 For v 2 DomðAiÞ, if v 2 = Mi, construct the mapping pair as ðv; vLi MLi Þ; otherwise, as ðv; vÞ. 3. Derive the generalized relation P by replacing each value v by its mapping value and computing other aggregate values. In Step 1, if numeric attributes are not to be generalized, their averages and deviations will be computed in Step 3. Step 2 aims at preparing the mapping pairs of attribute values for generalization. First, in Step 2.1, an attribute is removed either because there is no concept hierarchy defined for the attribute, or its higher-level concepts are expressed in terms of other attributes [3]. In Step 2.2, the attribute’s major-value set Mi is constructed, which consists of the first ð< Þ count leading values if they take up a major portion ð Þ of the attribute, where is the generalization threshold that sets the maximum number of distinct values allowed in the generalized attribute. In Step 2.3, if v is one of the major values, its mapping value remains the same, i.e., major values will not be generalized to higher-level concepts. Otherwise, v will be generalized by the concept at level Li by excluding the values contained in both the major-value set and the leaf set of the vLi subtree (i.e., vLi MLi where MLi ¼ LeafðvLi Þ \ MiÞ. Note that, if there are no major values in Ai, Mi and MLi will be empty. Accordingly, the EAOI will behave like the AOI. In Step 3, aggregate values are computed, including the accumulated count of merged tuples, which have identical values after the generalization, and the averages and deviations of numeric attributes of merged tuples if numeric attributes are determined not to be generalized.
For the previous example in Fig. 2, we have taken ¼ 3, ¼ 0:8, and the computation of mean and deviation for numeric attributes is chosen. For the City attribute, the count leading value Taipei satisfies the majority condition ð92=100 0:8Þ and, therefore, the major value set MCity is {Taipei}. During the generalization, the major value Taipei remains unchanged, and the others are generalized to their higher-level concepts North_Taiwan-{Taipei} and Central_ Taiwan, respectively. Note that, if is set to 1, it will not be possible to satisfy the majority condition with < 3 (i.e., ). Therefore, the major value set will be empty and all the attribute values will be generalized to their higher-level concepts. In other words, if generalizing numeric attributes is chosen, the generalization process proceeds the same as that of the conventional AOI and generates the same results (Fig. 5c) as that of the AOI.
4
EXPLORATORY CLUSTERING EXTRACTION
AND
PATTERN
Integrating the GSOM and the EAOI can form a convenient and powerful analysis tool for mixed data. Therefore, we propose an interactive, visualized analysis framework for data clustering and pattern extraction, as shown in Fig. 6. This integrated architecture offers more analysis capabilities, compared to stand-alone techniques. The GSOM alone is incapable of extracting clusters’ characteristics, whereas the EAOI alone will result in over generalization if the data are diversified and not clustered before generalization. The advantages of the proposed architecture are visualization and exploration, making it an excellent tool in exploratory data mining. The visualized SOM helps alleviate the difficult problem of determining the appropriate cluster number. Moreover, the GSOM and the EAOI allow users to explore the data from different perspectives in
Fig. 6. Framework of exploratory, visualized clustering, and pattern extraction.
HSU AND WANG: AN INTEGRATED FRAMEWORK FOR VISUALIZED AND EXPLORATORY PATTERN DISCOVERY IN MIXED DATA
addition to their traditional analyses. Specifically, the GSOM can handle categorical data directly, and the EAOI can probe for major values and offer an alternative for processing numeric attributes. After preprocessing and training, the data are projected onto a two-dimensional GSOM, then the visualized data clustering is performed automatically or semiautomatically on the trained GSOM and, finally, the EAOI is employed to extract the characteristics from individual clusters. Three kinds of patterns can be analyzed: cluster characteristics, discriminant rules, and characteristic rules.
167
TABLE 1 The Synthetic Data Set
4.1 Cluster Characteristics Extracted by EAOI from each cluster Ci , cluster characteristics can be expressed as: Ci : f½fAj;k ¼ v j Aj;k ¼ ð; Þg; si;j g;
ð5Þ
where Aj;k is the kth component of the jth pattern extracted from Ci . The value of A;j;k is either categorical, represented by a concept, or numeric, represented by a pair of mean and deviation. si;j is the support of the jth pattern in Ci , i.e., the ratio of the data that satisfies the jth pattern in Ci . For example, C1 : {[(City = Taipei, Salary = (51,000, 0));0.97], [(City = North_Taiwan-{Taipei}, Salary = (51,000, 1000));0.03]} represents two patterns, which take up 97 percent and 3 percent supports, extracted from C1 .
4.2 Discriminant Rules For a data set with a class attribute, the integrated tool can analyze the relationship between feature attributes and the class attribute. The discriminant rules are expressed in the format of (6), of which the left-hand side shows each categorical or numeric pattern Aj;l with support si;j , extracted from Cluster Ci , and the right-hand side shows each class Classi;k with its confidence ci;k in Ci . IF Ci : f½fAj;l ¼ v j Aj;l ¼ ð; Þg; si;j g ! fClassi;k ðci;k Þg: ð6Þ The supports and the confidences are computed as si;j ¼ ðNi;j =Nk Þ and ci;k ¼ ðNi;k =Ni Þ, respectively, where Ni is the number of data in Ci , Ni;j is the number of data satisfying the jth pattern in Ci , and Ni;k is the number of the data of the kth class in Ci . For instance, if C1 :{[(City = Taipei, Salary = (51,000, 0));0.97], [(City = North_Taiwan-{Taipei}, Salary = (51,000, 1000));0.03]}! {A(0.7), B(0.3)} indicates that C1 has two patterns taking up 97 percent and 3 percent, respectively, and these patterns imply Class A with 70 percent confidence or Class B with 30 percent confidence.
4.3 Characteristic Rules Another pattern to extract is the characteristics of a class, say Classk , and their distribution among clusters. The pattern can be expressed as the following format: IF Classk ! fðfAj;k ¼ v j Aj;k ¼ ð; Þg; ðCi ; ci;k ÞÞg:
ð7Þ
The confidence is computed by ci;k ¼ ðNi;k =Nk Þ, where Nk is the number of the data of the kth class and Ni;k is the number of the data of the kth class in Ci .
5
EXPERIMENTAL RESULTS
A prototype was developed using Borland C++ Builder 6 for experiments to verify the integrated framework. This paper reports three experiments on one synthetic and two real data sets.
5.1 Synthetic Data The integrated tool can analyze data using the conventional approaches as well as using the proposed approaches. This experiment aims to compare the results by using the conventional SOM and AOI with those of the GSOM and EAOI on a synthetic, mixed data set. When training a traditional SOM, the data set is preprocessed to transform the categorical data to binary data. We designed a data set of 400 tuples, which has four attributes plus one class attribute, as shown in Table 1. Status and Area are categorical, while Age and Amount are numeric. Categorical values are generated according to the specified ratios. The Age values are generated according to normal distributions, and the Amount values are generated according to a uniform distribution. Each class has 100 tuples, and the class label is assigned according to the Status attribute. Major values may occur in attributes, for instance, the Status values MS and MA together take up 80 percent of Class G. The hierarchies for attributes are shown in Fig. 7. The hierarchies of the Age and the Amount are for the traditional AOI. First, a GSOM is trained using the data set and an SOM is trained using the transformed data set, then the data are clustered visually on the trained maps and, finally, the discriminant patterns are induced from individual clusters by using EAOI and AOI. The map size is 64 units. The training parameter setting follows those mentioned in SOM_PAK [34]. The learning rate is a linear function ðtÞ ¼ ð0Þ ð1:0 t=T Þ with the initial value ð0Þ ¼ 0:9, the neighborhood function is a Gaussian function, and a neighborhood radius function rðtÞ ¼ 1:0 þ ðrð0Þ 1Þ ð1:0 t=T Þ with the initial value rð0Þ is set to the side length of the map. The training time T is at least 10 times of the map size. Fig. 8 shows the training results of 12,000 training time. The spot size is proportional to the number of data
168
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 18, NO. 2,
FEBRUARY 2006
TABLE 2 The Discriminant Rules Extracted by EAOI from the Four Clusters in Fig. 8a
Fig. 7. Distance hierarchies for the synthetic data set.
projected to the unit. The results by the GSOM, shown in Fig. 8a, clearly show four groups. We, therefore, manually cluster the results into four groups. In contrast, the results by the SOM, shown in Fig. 8b, are more widespread. Consequently, they are not easy to cluster into four groups. This is because the similarity between categorical values is not taken into account in the SOM method. We show the class labels of patterns contained in each unit beside the unit itself. The results in Fig. 8a indicate that the patterns in each unit are all from the same class, and the units with the same class label tend to be in the nearby locations, naturally forming a group. On the contrary, Fig. 8b shows that the patterns from Class H and J might mix into a unit. Moreover, the units to which H and J patterns project are not well separated, as shown in the lower-right of the map. We further use EAOI and AOI to extract discriminant rules for the four groups formed on the GSOM. The parameters are set as follows: the attribute generalization threshold ¼ 3 and the majority threshold ¼ 0:75. Concept hierarchies are as shown in Fig. 7. Table 2 and Table 3 show a portion of the results by the EAOI and the AOI, respectively. The EAOI extracts the major values in both the Status and Area attributes in
Fig. 8. Training results of the synthetic data set by using (a) the GSOM method and (b) the SOM method.
Cluster 1 (C1). In contrast, the AOI generalizes all the values to higher-level concepts. For the numeric attribute like Age, the results by the EAOI show that the means are all around 25 in C1. The results by the AOI are Young and Mid, which are more general, ranging from 20 to 29 and from 30 to 40. The patterns extracted from C2 by these two methods are similar to those of C1 by the two methods, respectively. Regarding C3, the major values in Area are preserved by the EAOI whereas all values are generalized to a single value WestAsia by the AOI. Since the number of distinct values in Status is 3 ( ), they are not generalized in both methods. The results of C4 are similar to those of C3. Note that, when is set to 1 and the option of generalizing numeric attributes is chosen, the EAOI will output the same results as that of the AOI in Table 3.
TABLE 3 The Discriminant Rules Extracted by AOI from the Four Clusters in Fig. 8a
HSU AND WANG: AN INTEGRATED FRAMEWORK FOR VISUALIZED AND EXPLORATORY PATTERN DISCOVERY IN MIXED DATA
169
Fig. 10. Training and clustering results of the Adult data set by using (a) the GSOM and (b) the SOM methods.
Fig. 9. Distance hierarchies for the attributes of the Adult data set.
5.2 UCI Adult Data Set For real data, we use the Adult data set from the UCI repository [35]. The data set has 15 attributes including eight categorical, six numerical, and one class attributes Salary indicating whether the salary is over 50K (> 50K) or less than 50K ( 50K). We randomly select 10,000 from the 48,842 tuples for the experiment. Seventy-six percent of the selected tuples have the value of 50K, which is the same distribution with that of the original data set. For the attribute selection, we use the technique of relevance analysis based on information gain [3]. The relevance threshold was set to 0.1, and we got seven qualified attributes: Marital-status, Relationship, Eucation, Capital_gain, Capital_loss, Age, and Hours_per_week. The first three are categorical, and the others are numeric. Distance hierarchies are constructed as shown in Fig. 9. The map size is 400 units. The training parameters are set to the same with that of the previous experiment. Fig. 10 shows the results of 60,000 training time. The results show that the training data by the GSOM method are projected to the nearby units compared to those by the SOM method. Again, this is due to the similarity between categorical values taken into account via distance hierarchies in the GSOM method. Clustering of a trained map can be done automatically [21], [22], [23] or semiautomatically. We use three criteria to cluster the training results. The first two can be made automatically and the third one semiautomatically. For the first two criteria, the neighboring spots with a grid distance of d 1:414 or d 2:828 are grouped together to form a
cluster. For the third, we further take the shape of clusters into consideration. The neighboring spots with a distance d 3 may or may not be merged depending on the shape of the to-be-merged clusters and the spacious relationship with other neighboring groups, intending to avoid chaining effect [36]. The number of clusters by the different distance criteria is shown in Table 4. From our experience, the semiautomatic method (i.e., d 3&Adj:) produces reasonable clustering results, which are shown in Fig. 10. The first two criteria (i.e., d 1:414 and d 2:828) are in essence similar to the single-link approach of hierarchical clustering [7], which suffers from the chaining effect [36]. For instance, the second criterion (d 2:828) merges Clusters 4 and 7 of the GSOM in Fig. 10a and merges Clusters 1, 2, 5, 6, 10, 12, and 13 of the SOM in Fig. 10b. To measure how the clustering improves the likelihood of similar values falling in the same cluster, the average categorical utility of clusters can be used. The categorical utility function [37] attempts to maximize both the probability that the two data in the same cluster have attribute values in common and the probability that the data from different clusters have different values. The higher the value of categorical utility, the better the clustering fares [38]. The average categorical utility of a set of clusters is calculated as follows: ! 1 X jCk j X X 2 2 ACU ¼ ½P ðAi ¼ Vij j Ck Þ P ðAi ¼ Vij Þ ; K k jDj i j ð8Þ
TABLE 4 Number of Resultant Clusters of the Trained GSOM and SOM Using Different Distance Criteria
170
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
TABLE 5 The Increased Rate of the Average CU and the Expected Entropy of the Class Attribute Salary
where P ðAi ¼ Vij jCk Þ is the conditional probability that the attribute Ai has the values Vij given the cluster Ck , and P ðAi ¼ Vij Þ is the overall probability of Ai having Vij in the entire data set. We compute the ACU of categorical values of clusters formed by the three clustering criteria at the leaf level and Level 1 of the distance hierarchies, and the increased rate, as shown in Table 5. The ACU at Level 1 is computed by generalizing categorical values to their values at Level 1 and then applying (4). The larger increased rates in the GSOM approach indicate that the GSOM influences the clustering in the way of helping group similar categorical values together, where the similarity is defined via distance hierarchies. In addition, compared to those of the SOM, the spots of the GSOM spread less widely. This also indicates the effect of taking the similarity between categorical values into consideration during training. The expected entropy of an attribute C in a set of clusters can be used to measure how the class values are distributed in the clusters. The smaller the value, the better the clustering quality is, in terms of the external class attribute. The expected entropy is computed as follows: First, calculate the entropy of C in each cluster. Then, sum up all the entropies weighted by its cluster size. The formula is as follows [38]: ! X jCk j X ^ P ðC ¼ Vj Þ log P ðC ¼ Vj Þ ; ð9Þ EðC Þ ¼ jDj j k where Vj denotes one of the possible values that C can take, jCk j is the size of Cluster k, and jDj is the data set size. The expected entropies of the class attribute Salary in different clusterings are shown in Table 5. The results by the GSOM approach with d 1:414 produce the smallest TABLE 6 The Salary Distribution and the Number of Extracted Patterns with support 0:05
VOL. 18, NO. 2,
FEBRUARY 2006
TABLE 7 Cluster Characteristics Patterns with support 0:05 Extracted from Clusters Using the EAOI
Notation: AVD (Advanced), BA (bachelors), CO (college), HB (husband), HS (HighSchool), JU (junior), MCS (married-civ-spouse), MS (master), NE (never-married), NIF (Not_In_Family), SC (some-college), SI (single), OC (own-child), WI (wife).
expected entropy. Furthermore, the chaining effect results in a reduced cluster number and the increased expected entropy (see the rows of d 2:828 in Table 5). To illustrate further analysis of pattern extraction from the resultant clusters, we use the results in Fig. 10a, which has a modest number of clusters including 9,843 tuples in clusters and 157 outliers. The Salary class distributions in the clusters are shown in Table 6, where Clusters 4 and 1 have the largest ratios of > 50K. Clusters 5, 3, and 7 have much lower ratios of > 50K compared to the data set. We use EAOI and AOI to extract cluster patterns. The parameters were set as follows: the attribute generalization threshold ¼ 4 and the majority threshold ¼ 0:75. The number of patterns with support 0:05 extracted from each cluster is shown in Table 6. Table 7 and Table 8 are referred to for a portion of the patterns from Clusters 4, 2, and 7 by both methods. The results by the EAOI indicate that all the clusters except for Clusters 1 and 6 have major values in Education. For example, AVD (advanced), CO (college), and JU (junior) are the major values of Education in C4. On the contrary, the AOI may miss those major values and overly generalize an attribute to the most general concept Any, as shown in Education of C4 in Table 8. Two other cases of over generalization are as shown in C2 and the numeric attributes. First, the AOI outputs only one pattern over 5 percent from C2, whereas the EAOI outputs seven patterns and preserves the major values of all the categorical attributes. Second, many numeric attributes are generalized to Any by the AOI due to too many distinct TABLE 8 Cluster Characteristics Patterns with support 0:05 Extracted from Clusters Using the AOI
HSU AND WANG: AN INTEGRATED FRAMEWORK FOR VISUALIZED AND EXPLORATORY PATTERN DISCOVERY IN MIXED DATA
171
TABLE 10 Number of Data, Distinct Products, and Patterns in the Clusters
Fig. 11. Training and clustering results of the Sales data set by using (a) the GSOM and (b) the SOM methods.
values, whereas the EAOI offers more insights by outputting the means and deviations. Finally, we analyze the patterns extracted by the EAOI from the GSOM’s clusters, as shown in Table 7. Generally, the data in the clusters with larger ratios of > 50K (e.g., C4) have higher educations, stable marriages and relationships, older ages, longer work hours, and larger capital-gains, compared to those of the data in the clusters (e.g., C7) with lower ratios of > 50K.
5.3 Sales Data In another experiment, we used a subset of sales records of a store at a university during 4/12/1999 to 7/17/2000. The data set is of Grocery category including 6,863 transactions. Four attributes participated in the experiment including two categorical (Date and Product) and two numeric (Quantity and Amount). The Product distance hierarchy is constructed according to the product classification of the store. Others are constructed according to common sense. The training parameters are set to the same as those of the previous experiment. Fig. 11 shows the results of 20,000 training time. Table 9 shows the clustering results by the single-link method with different distance criteria; that is, two clusters are merged if their cluster distance is less than the criterion. It is worth mentioning that this clustering is done automatically without the user’s intervention. The results by the SOM are more widespread, which can be confirmed by comparing the number of clusters on the GSOM and SOM. More specifically, the cluster number (115) of the SOM is larger than that (86) of the GSOM when d ¼ 0. However, all the spots of the SOM are merged when d 3 whereas the GSOM forms four clusters. TABLE 9 Number of Clusters by Different Distance Criteria and the Average CU at Different Levels
It can be noticed in Table 9 that the ACU of the GSOM at higher levels (Levels 3 and 2) is higher than that of the SOM, indicating that the GSOM model successfully groups nearby similar categorical values, which are defined via their distance hierarchies. (Note: the height of Date and Product distance hierarchies is 4.) As expected, over generalization occurs in the AOI when many distinct values exist in the data. As shown in Table 10 and Table 11, the Product attribute of Cluster 3 by the AOI becomes Any while that by the EAOI shows the extracted major values, Appliance and Sanitation.
6
CONCLUDING REMARKS
The attributes participating in the training of the GSOM have a significant impact on the results due to the distance metric used in the training algorithm. If a class attribute is involved in the data, relevance analysis between the class attribute and the others (or feature selection) [39], [40], [41] should be performed before training to ensure the quality of cluster analysis. Moreover, most variants of the SOM use Euclidean-based distance metrics and so does our GSOM. It is interesting to investigate other possible metrics like the Manhattan distance or cosine correlation in the future. In addition, the outcome of the GSOM clustering is currently user-dependent, as it is done semi-interactively. While this can be regarded as a feature of the interactive system, a sensitivity analysis with respect to the main parameters as well as the incorporation of TABLE 11 Characteristics Patterns (support 0:05) of Cluster 3 Extracted by EAOI and AOI
172
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
additional tools to visually support the parameter decisions would be desirable to make the results more stable and user-independent.
ACKNOWLEDGMENTS The authors would like to thank the reviewers for their valuable suggestions. They would also like to thank YanCheng Lin and Yu-Wei Su for their involvement in the early stage of this research.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8]
[9] [10] [11] [12] [13] [14]
[15]
[16] [17]
[18]
[19] [20] [21] [22]
U. Fayyad and R. Uthurusammy, “Data Mining and Knowledge Discovery in Databases,” Comm. ACM, vol. 39, pp. 24-26, 1996. G. Groth, Data Mining: A Hands-On Approach for Business Professionals. Prentice Hall, 1998. J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kaufmann, 2001. T.M. Mitchell, Machine Learning. McGraw Hill, 1997. T. Kohonen, Self-Organizing Maps. Springer-Verlag, 1997. J. Han, Y. Cai, and N. Cercone, “Data-Driven Discovery of Quantitative Rules in Relational Databases,” IEEE Trans. Knowledge and Data Eng., vol. 5, pp. 29-40, 1993. A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. PrenticeHall, 1988. G.A. Carpenter and S. Grossberg, “A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine,” Computer Vision, Graphics, and Image Processing, vol. 37, pp. 54-115, 1987. T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas, “Engineering Applications of the Self-Organizing Map,” Proc. IEEE, vol. 84, no. 10, pp. 1358-1384, Oct. 1996. A. Visa, “A Texture Classifier Based on Neural Network Principles,” Proc. Int’l Joint Conf. Neural Networks, pp. 491-496, 1990. M. Kasslin, J. Kangas, and O. Simula, “Process State Monitoring Using Self-Organizing Maps,” Artificial Neural Networks, pp. 15321534, 1992. O. Simula and J. Kangas, “Process Monitoring and Visualization Using Self-Organizing Maps,” Neural Networks for Chemical Eng., 1995. J. Mantysalo, K. Torkkola, and T. Kohonen, “Mapping Context Dependent Acoustic Information into Context Independent Form by LVQ,” Speech Comm., vol. 14, no. 2, pp. 119-130, 1994. M. Vapola, O. Simula, T. Kohonen, and P. Merilainen, “Representation and Identification of Fault Conditions of an Aesthesia System by Means of the Self-Organizing Map,” Proc. Int’l Conf. Artificial Neural Networks (ICANN ’94), vol. 1, pp. 246-249, 1994. T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A. Saarela, “Self-Organization of a Massive Document Collection,” IEEE Trans. Neural Networks, vol. 11, no. 3, pp. 574585, 2000. D.R. Chen, R.F. Chang, and Y.L. Huang, “Breast Cancer Diagnosis Using Self-Organizing Map for Sonography,” Ultrasound in Medicine and Biology, vol. 1, no. 26, pp. 405-411, 2000. A.A. Kramer, D. Lee, and R.C. Axelrod, “Use of a Kohonen Neural Network to Characterize Respiratory Patients for Medical Intervention,” Proc. Conf. Artificial Neural Networks in Medicine and Biology, pp. 192-196, 2000. N. Kasabov, D. Deng, L. Erzegovezi, M. Fedrizzi, and A. Beber, “On-Line Decision Making and Prediction of Financial and Macroeconomic Parameters on the Case Study of the European Monetary Union,” Proc. ICSC Symp. Neural Computation, 2000. G.J. Deboeck, “Modeling Non-Linear Market Dynamics for IntraDay Trading,” Neural-Network-World, vol. 1, no. 10, pp. 3-27, 2000. S. Kaski and T. Kohonen, “Exploratory Data Analysis by the SelfOrganizing Map: Structures of Welfare and Poverty in the World,” Neural-Networks in Financial Eng., pp. 498-507, 1996. J. Vesanto and E. Alhoniemi, “Clustering of the Self-Organizing Map,” IEEE Trans. Neural Networks, vol. 11, no. 3, pp. 586-600, May 2000. M.Y. Kiang, U.R. Kulkarni, and K.Y. Tam, “Self-Organizing Map Network as an Interactive Clustering Tool—An Application to Group Technology,” Decision Support Systems, pp. 351-374, 1995.
VOL. 18, NO. 2,
FEBRUARY 2006
[23] M.Y. Kiang, “Extending the Kohonen Self-Organizing Map Networks for Clustering Analysis,” Computational Statistics and Data Analysis, vol. 38, pp. 161-180, 2001. [24] Z. Huang, “Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, vol. 2, no. 3, Sept. 1998. [25] Z. Huang and M.K. Ng, “A Fuzzy k-Modes Algorithm for Clustering Categorical Data,” IEEE Trans. Fuzzy Systems, vol. 7, no. 4, pp. 446-452, 1999. [26] M.K. Ng and J.C. Wong, “Clustering Categorical Data Sets Using Tabu Search Techniques,” Pattern Recognition, vol. 35, pp. 27832790, 2002. [27] J. Catlett, “Megainduction: Machine Learning on Very Large Databases,” PhD dissertation, Univ. of Sydney, 1991. [28] U.M. Fayyad and K.B. Irani, “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Proc. 13th Int’l Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993. [29] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [30] R. Kerber, “Chimerge: Discretization of Numeric Attributes,” Proc. Ninth Nat’l Conf. Artificial Intelligence, pp. 123-128, 1992. [31] H. Liu and R. Setiono, “Chi2: Feature Selection and Discretization of Numeric Attributes,” Proc. Seventh IEEE Int’l Conf. Tools with Artificial Intelligence, pp. 388-391, 1995. [32] J. Han and Y. Fu, “Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases,” Proc. AAAI ’94 Workshop Knowledge Discovery in Databases (KDD ’94), pp. 157-168, 1994. [33] C.-C. Hsu, “Extending Attributed-Oriented Induction Algorithm for Major Attribute Values and Numeric Values,” Expert Systems with Applications, vol. 27, no. 2, pp. 187-202, 2004. [34] T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen, “SOM_ PAK: The Self-Organizing Map Program Package,” Technical Report A31, Laboratory of Computer and Information Science, Helsinki Univ. of Technology, Espoo, Finland, 1996. [35] P.M. Murphy and D.W. Aha, UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearn/MLRepository. html, 1992. [36] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, Sept. 1999. [37] M. Gluck and J. Corter, “Information, Uncertainty, and the Utility of Categories,” Proc. Seventh Ann. Conf. Cognitive Soc., pp. 283-287, 1985. [38] D. Barbara, J. Couto, and Y. Li, “COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,” Proc. 11th Int’l Conf. Information and Knowledge Management, pp. 582-589, 2002. [39] M. Dash and H. Liu, “Feature Selection Methods for Classification,” Intelligent Data Analysis: An Int’l J., vol. 1, 1997. [40] R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, pp. 273-324, 1997. [41] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic, 1998.
HSU AND WANG: AN INTEGRATED FRAMEWORK FOR VISUALIZED AND EXPLORATORY PATTERN DISCOVERY IN MIXED DATA
Chung-Chian Hsu received the MS and PhD degrees in computer science from Northwestern University in 1988 and 1992, respectively. He joined the Department of Information Management at National Yunlin University of Science and Technology, Taiwan, in 1993, was the chairman of the department from 2000 to 2003, and is currently a professor in the department. His research interests include data mining, machine learning, and decision support systems. Since 2002, he has also been the director of the information systems division at the Testing Center for Technological and Vocational Education, Taiwan.
173
Sheng-Hsuan Wang received the MBA degree in information management from National Yunlin University of Science and Technology, Taiwan, in 2005. He has been a member of the Intelligent Database Systems Laboratory (IDSL), National Yunlin University of Science and Technology since 2003. His main research interests are related to self-organizing maps and their applications to data mining.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.