Quality assessment and Uncertainty Handling in Data Mining Process

9 downloads 3609 Views 35KB Size Report
established data mining methods [FU96] are: the definition/extraction of clusters that provide a classification scheme, the classification of database values into ...
Quality assessment and Uncertainty Handling in Data Mining Process

Maria Halkidi PhD Student, Dept of Informatics Athens University of Economics & Bussiness Patision 76, 10434, Athens, Greece (Hellas) Email: [email protected]

Abstract The KDD process aims at the discovery and extraction of “useful” knowledge (such as interesting patterns, classification, rules etc) from large data repositories. A widely recognized requirement is that the patterns discovered must be valid and ultimately comprehensible (i.e., to be easily understood by analysts). Another requirement that is under-addressed in KDD process is the reveal and the handling of uncertainty in the main data mining processes of clustering, classification and association rules extraction. 1. Research Problem Data Mining is a step in the KDD process that is mainly concerned with methodologies for knowledge extraction from large data repositories. There are many data mining algorithms that accomplish a limited set of tasks and produce a particular enumeration of patterns over data sets. These main tasks, according to well established data mining methods [FU96] are: the definition/extraction of clusters that provide a classification scheme, the classification of database values into the categories defined, and the extraction of association rules or other knowledge artifacts. In the vast majority of the data mining approaches, the initial categories resulting from a clustering process are crisp, so that the data values are classified into one of a set of categories. Moreover, clustering is a mostly unsupervised procedure. This implies that there are no predefined classes and most of the clustering algorithms depend on assumptions and initial guesses in order to define the subgroups presented in a data set. They also assume that the number of clusters produced by an algorithm is the best fitting for the specific data set. As a consequence, the final clusters require some sort of evaluation in most applications. According to the above described approaches in data mining process, we address the following issues and their implications in data mining process: i. The clusters are not overlapping. The limits of clusters are crisp and each database value may be classified into at most one cluster. The result is that, in some cases “interesting” tuples fall out of the cluster limits so they are not classified at all. This is unlikely to everyday life experience where a value may be partially classified into one or more categories. For instance, a male person 182cm high in Central Europe is considered both as of “medium” height as well as “tall” to some degree. ii. The data values are treated equally in the classification process. In traditional data mining systems database values are classified in the available categories in a crisp manner, i.e., a value either belongs to a category or not. The person mentioned in the above example is considered as tall and also another person 199cm high is also considered tall. It is profound that the second person satisfies to a higher degree than the first, the criterion of being “tall”. This piece of knowledge (the difference of belief that A is tall and also B is tall) cannot be acquired using the existing classification schemes. iii. Little attention to cluster quality issues. The majority of clustering algorithms partition a data set in a number of clusters based on some parameters such as the desired number of clusters, the minimum number of objects in a cluster, the diameter of a cluster etc. They search for the best clusters according to some well defined criteria assuming that the resulting structure for the extracted clusters is the optimal [Dave96]. As a consequence, if clustering algorithm parameters have not been assigned proper values, the clustering method may result in a partitioning that is not optimal for the

specific data set. The problem of deciding the number of clusters as well as the evaluation of the clustering results has been subject of several research efforts [Dave96, GG89, RLR98, ThK99]. Although several validity indices have been introduced, in practice most of the clustering methods do not use any of them. Furthermore, formal methods in data base and data mining applications for finding the best partitioning of a data set are very few [Smyth96] iv. the resulting rules may “hide knowledge” A rule is an implication of the form A-> B where A, B are groups of attributes or groups of categories (each attribute includes more than one categories). Then all the sets of values belong to the categories denoted by A, B equally contribute to the strength of the rule. Each tuple contributes its own contribution to the rule since each value set bears a different classification degree of belief (d.o.b). Taking in account the two previous observations, again the detected rules do not capture the difference in strength of the association in a tuple basis. It is clear from the preceding discussion that there is interesting knowledge in the partial classification of values, which is not extracted during the data mining process. This is due to the fact that uncertainty is not considered, as well as due to the fixed number of clusters. 2. State of existing solutions. There are some approaches proposed in literature, dealing with uncertainty representation in data mining [Bezd84, Chiu97, J98, CZ98] (e.g. fuzzy decision trees, Fuzzy C-Means). According to these approaches each data value can be assigned to one or more categories with an attached degree of belief. However, they do not propose ways to handle classification information and exploit it for decision-making. Also, the quality assessment of clusters is another issue of interest on which some research efforts have been concentrated. More specifically, there are cluster validity indices described in literature both for crisp or fuzzy clustering [Dave96, GG89, RLR98, ThK99]. However, we should mention that the evaluation of proposed indices and the analysis of their reliability as regards the identification of the best clustering scheme produced by a specific clustering algorithm are limited. In general terms, every validity index may fail in some cases since all of them are based on some parameters that may influence indices values and lead to unreliable results. Moreover, there is very little study of cluster quality in area of databases and data mining and how the adoption of a clustering scheme evaluation procedure can affect the data mining process. 3. Proposed Approach. The importance of the above described requirements in data mining process i.e., usage and reveal of uncertainty and the evaluation of data mining results led us to the definition of a new data mining framework, which supports uncertainty. We adopt fuzzy logic as the main tool for representation and handling of uncertainty in this context. The basic steps of our approach can be described as follows: 1.

Clustering scheme extraction. In this step we define/extract clusters that correspond to the initial categories for a particular data set. We can use any of the well-established clustering methods that are available.

2.

Evaluation of the clustering scheme. The clustering methods can find a partition of our data set, assuming apriori specified number of clusters. Our purpose is to define a number of clusters that is “optimal” for the specific data set. According to our approach, we apply a clustering algorithm for different parameter values and then to use a well-defined evaluation criterion in order to choose the best of the clustering schemes. The best partitioning of our data set (i.e., clustering scheme) will better confirm the pre-specified criterion. This evaluation criterion is represented by an index, called as clustering scheme quality index. The definition of this index is based on the two fundamental criteria of clustering quality (i.e., compact clusters and well separated.

3.

Definition of membership functions. Fuzzy clustering algorithms define clusters and compute the grade of membership of each data value to the clusters. However, most of the clustering methods are crisp, i.e., all values in a cluster belong totally to it. As mentioned in previous section we aim at assigning uncertainty features in this case. This is achieved by assignment of appropriate mapping functions to the clusters.

4.

Fuzzy Classification. The values of the non-categorical attributes (Ai) of a data set are classified into categories according to a set of categories L={li} (where li a category) and a set of classification functions defined in preceding clustering process. The result of this procedure is a set of degrees of belief (d.o.b.s){M ={li(tk.Ai)}. Each member of this set represents the confidence that the specific value tk.Ai (where tk is the tuple identifier) belongs to the set denoted by the category li.

5.

Classification Value Space (CVS) Construction. According to the classification framework proposed in [Vaz98, VH00] we transform the data set into classification beliefs and store them into a structure called

CVS. CVS is represented by a cube, the cells of which store the degrees of belief for the classification of the attributes’ values. 6.

Handling of the information included in the CVS. CVS includes significant knowledge for our data set. We can exploit this knowledge for decision making, based on well-established information measures that we define in our approach. These measures are based on fuzzy logic concepts and reflect the information quantity in our data set with respect to each category (cluster), attribute or/and to the whole classification scheme. Then, we exploit the results of these measures in order to i) make decisions for our application, and ii) to assess how well the current classification model is applied to our data set. In the case that the classification model is used to classify new instances the evaluation procedure can help us to understand if it is necessary to redefine the initial clustering scheme.

7.

Association rules extraction between attributes based on the above mentioned classification scheme. Mining of association rules between these classifications being able to express high-level natural language queries and rules. Again we can compare the validity of relationships in different sets and thus correspond to decision-makers requirements.

4. Preliminary Results. Based on the above approach we have designed the main steps of a new data mining system so as to handle uncertainty in main data mining tasks. Our research effort has already resulted in a clustering and classification scheme that supports uncertainty. More specifically, the results of our current research are: •

Definition and evaluation of clustering scheme quality indices. We defined two quality indices, CD and SD, for clustering scheme evaluation, based on the fundamental quality criteria for clustering (i.e., compactness and separation). We evaluated the reliability of defined indices both theoretically and experimentally. More specifically, we described the pros and cons of each index based on the fundamental concepts of their definition. Also, we carried out an experimental study based on real data sets in order to evaluate the behavior of these indices considering different values for the clustering parameters (i.e., number of clusters, max number of clusters). The results of this study are summarized as follows: 9 The index CD has a strong decreasing tendency as the number of clusters increases. Thus, the best clustering scheme is always the scheme produced by the maximum number of clusters. On the contrary, 9 The index SD results in a local minimum in the range for the number of clusters [cmin, cmax] that we considered. This happens because SD takes into account both of the clustering quality criteria and with similar weight while CD appears to be more influenced by the variation of clusters (compactness). The value at which the quality index SD gets its minimum value and also a significant local change in the value of SD is an indication of the “best” number of clusters in a specific data set, i.e the clustering scheme that fits the data set. Moreover, the presence of significant changes in the quality index values, appeared as “knee” in the graph of SD versus the number of clusters, may be an indication of clustering tendency in our data set (i.e., that the data set possesses a clustering structure). 9 the SD proposes an optimal number of clusters almost irrespectively of the maximum number of clusters.



Uncertainty handling based on membership functions. To transform crisp clustering schemes into fuzzy ones so as to support uncertainty, we defined a scheme for assignment of the appropriate membership functions to the clusters. The functions are based on the hypertrapezoidal membership functions, common in fuzzy systems.



Definition of a classification scheme that supports uncertainty. A scheme for classification of database values putting emphasis in uncertainty handling and classification quality measures. The classification scheme handles uncertainty through the maintenance of a framework based on fuzzy logic. Moreover, we present how the proposed classification scheme can be used for multidimensional classification so as to support decision-making that combines more than one-classification criteria. For this purpose, we proposed two approaches [VH00]: i)classification based on multidimensional clusters, ii) classification based on onedimensional clusters, while we described an experimental study we have performed in order to evaluate these two approaches. The overall result of this study is that the approach based on multidimensional clusters, produces better classification schemes.



Information measures for the classification scheme based on the energy metric function, which reflect the information quantity included in a fuzzy set. Based on these measures, we compare different sets as to the degree they fit to the classification scheme or compare different data sets under a specific criterion. Also, we extract "useful" knowledge for reasoning and decision making based on the information measures.

To summarize, the overall objective of this research effort is the development of a framework that supports uncertainty in the data mining process. For this purpose, we have proposed a methodology for handling uncertainty in clustering and classification processes based on fuzzy logic concepts. Moreover, we propose a clustering evaluation procedure for extracting optimal clustering schemes based on well-established quality criteria as well as a procedure for handling classification information and exploiting it for decision-making. Further work will be concentrated in the improvement of the current approach as well as in the connection of current system with a reasoning process and association rules extraction so as to have a full-fledged data mining system. Moreover, we aim at studying other well-known approaches for uncertainty handling, (e.g. novel techniques from statistics) and finally at the comparison of the different approaches. It is obvious, that this study will give us the chance to have a whole picture for the approaches proposed in one of the most important research areas in data mining process, that is the reveal and the handling of uncertainty.

References [Bezd84] Bezdeck J.C, Ehrlich R., Full W., "FCM:Fuzzy C-Means Algorithm", Computers and Geoscience, 1984 [Chiu97] S. Chiu. "Extracting Fuzzy Rules from Data for Function Approximation and Pattern Classification". Fuzzy Information Engineering- A Guided Tour of Applications.(Eds.: D. Dubois, H. Prade, R Yager), 1997. [CZ98] Cezary Z. Janikow, "Fuzzy Decision Trees: Issues and Methods", IEEE Transactions on Systems, Man, and Cybernetics, Vol. 28, Issue 1, pp 1-14, 1998. [Dave96] R. Dave. "Validating fuzzy partitions obtained through c-shells clustering", Pattern Recognition Letters, Vol .17, pp613-623, 1996 [FU96] U. Fayyad, R Uthurusamy. "Data Mining and Knowledge Discovery in Databases", Communications of the ACM. Vol.39, No11, November 1996. [GG89] I. Gath, B. Geva. "Unsupervised Optimal Fuzzy Clustering". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 11, No7, July 1989. [J98] Cezary Z. Janikow, "Fuzzy Decision Trees: Issues and Methods", IEEE Transactions on Systems, Man, and Cybernetics, Vol. 28, Issue 1, pp 1-14, 1998. [RLR98] R. Rezaee, B. Lelieveldt, J. Reiber. "A new cluster validity index for the fuzzy c-mean", Pattern Recognition Letters, 19, pp237-246, 1998. [Smyth96] Padhraic Smyth. "Clustering using Monte Carlo Cross-Validation". KDD 1996, 126-133. [ThK99] S. Theodoridis, K. Koutroubas. Pattern recognition, Academic Press, 1999 [Vaz98] M. Vazirgiannis, "A classification and relationship extraction scheme for relational databases based on fuzzy logic", in the proceedings of the PAKDD ’98 Conference, Melbourne, Australia. [VH00] M. Vazirgiannis, M. Halkidi. "Uncertainty handling in the datamining process with fuzzy logic", to appear in the proceedings of the IEEE-FUZZY conference, San Antonio, Texas, May, 2000. [XB91] Xie, G. Beni. "A Validity measure for Fuzzy Clustering", IEEE Transactions on Pattern Analysis and machine Intelligence, Vol13, No4, August 1991.

Suggest Documents