Design Considerations for a CBR-based Intelligent Data Mining Assistant Michel Charest, Sylvain Delisle and Ofelia Cervantes1 Département de mathématiques et d’informatique Université du Québec à Trois-Rivières Québec, Canada, G9A 5H7 {michel.charest, sylvain.delisle}@uqtr.ca,
[email protected]
Abstract The realization of an intelligent data mining assistant for the non-specialist data miner is a challenging and complex endeavour. By its very nature, the effective application of a data mining process is littered with many gruelling and technical decisions. Consequently, the realization of a CBRbased data mining assistant involves a host of important design considerations (i.e. case representation, similarity measures, indexing and feature weights, reuse and revision strategies, seed cases and case maintenance issues). In this paper, we present the key design considerations that were addressed during our implementation of a CBRbased intelligent mining assistant, capable of empowering the non-specialist data miner throughout the key phases of the CRISP-DM data mining process.
1. Introduction Although data mining does promise to uncover potentially valuable, useful and implicit knowledge from one’s data repositories, the effective application of data mining still faces some very serious challenges: · Data mining research seems to be utterly based on specialized techniques, whereas research on strategic, methodological and epistemological aspects of data mining (DM) is rare. · Current DM processes make very little use of existing corporate knowledge. Consequently, DM is more tedious than necessary and can tend to produce already known information. The following is a continuation from previous work where a strong case has been established for exploiting the synergistic combination of a DM ontology and the case-based reasoning (CBR) paradigm for achieving efficient data mining assistance [1]. In this paper, Section 2 presents some key challenges and Section 3 summarizes related work in the field of intelligent DM assistance. Section 4 provides a system overview of our DM assistant and Section 5 elaborates on key design considerations. Section 6 presents a discussion, while Section 7 presents conclusions and future work. 1
2. The Challenges 2.1 Support the Non-Expert Data Miner Most commercial data mining products either do not offer any intelligent assistance (i.e. decision support) or tend do so in the form of rudimentary “wizard-like” interfaces. These wizard-like interfaces make hard assumptions about the level of background knowledge required by a user (i.e. Oracle Data Miner, SAS Enterprise Miner, etc.). The following is a concise list of some important decisions that plague the novice data miner: · How to effectively perform data quality verification (i.e. outliers)? · How to efficiently perform the data preparation phase (i.e. normalization, discretization)? · Which statistical or machine learning algorithm is most appropriate (i.e. decision tree or neural net.)? · Which training parameters are most suitable? · How to evaluate the data mining effort (crossvalidation, p-value, ROC-curves)? The fields of statistics and machine learning have produced a myriad of models/algorithms that can be readily exploited by data miners. Consequently, this profusion of algorithms has dramatically burdened the data miner with more difficult decisions that must be addressed in order to effectively apply DM to produce useful and meaningful results.
2.2 Fostering Knowledge Reuse With respect to the overall data mining process, most enterprises do not directly manage tacit knowledge (i.e. answers to above-mentioned queries) in a form that can be effectively stored, refined and reused. Most products simply archive DM activities, but leave it up to the user to intelligently manage this knowledge. An intelligent DM assistant should possess characteristics that allow it to learn from past experience and empower the user of the system to avoid the repetition of mistakes (i.e. CBR paradigm).
2.3 Beyond Model Selection Support The selection of an appropriate algorithm for a given data mining task may be considered necessary,
Escuela de Ingeniería y Ciencias, Universidad de las Américas, Puebla, México.
but is definitely not sufficient for ensuring the successful outcome of a DM project. Our appeal to an intelligent DM assistant implies the realization of a system that is capable of aiding a user throughout the key phases of the data mining process (i.e. data understanding, data preparation and data modeling).
DM Case Base
3. Related Work Previous research efforts into intelligent DM assistants have primarily focused on providing a user with model selection support ([2], [3], [4], [5]). As shall be elaborated upon later, our DM assistant provides a more “holistic” approach to the DM support problem by providing assistance for the key phases of the CRISP-DM methodology. Other research efforts have also demonstrated the potential benefits of implementing data mining assistants that provide support beyond feature or model selection. For instance, Bartlmae [6] has provided a framework, based on the CRISP-DM methodology [7], the CBR and Experience Factory paradigms, for capturing coarse-grained knowledge (i.e. lessons, guidelines, documentation, etc.) during the mining process. While our proposed approach is similar, our focus has been to provide more finegrained detailed knowledge (as mentioned in section 2.4 above). In addition, Bernstein et al. have proposed an intelligent data mining assistant based on the use of an ontology [8]. Their ontology contains constraints and performance knowledge that is eventually searched for in order to find a ranking of possible satisfactory DM processes. Our aim is to provide intelligent data mining assistance via the synergistic combination of both the CBR and ontology knowledge representation paradigms.
4. System Overview As illustrated in Figure 1, our hybrid DM assistant consists of six main components: a DM Case Base, a DM Ontology, a Case Reasoner, Rule Reasoner, a
DM Ontology and Rule Base
DL Reasoner
Case Reasoner
2.4 Inherent Complexity of Data Mining Over the past several decades, the field of data mining has witnessed tremendous growth by profiting from the advancements of diverse domains (machine learning, statistics, information-theory, data-warehousing, etc.) and specialized sub-fields (i.e. data visualization, neural networks, probabilistic methods, ensemble learning, etc.). Nowadays, not only must novice data miners contend with the complexities of their respective fields of application (i.e. data interpretation), they must also manage the inherent complexities associated with effectively using the available “arsenal” of data mining tools, methods and algorithms.
Knowledge Expert
Rule Reasoner
DM Assistant Interface
Decision Maker (Data Miner)
DM or STATISTICS TOOL KIT
DATA WAREHOUSE
Model Deployment Decision Support
Figure 1. DM Assistant System Architecture
DL Reasoner (Description Logic), and a DM Assistant Interface. The CBR and DM ontology subsystems have well defined knowledge representation roles. The DM Ontology and Rule Base defines high-level concepts (i.e. tasks, activity types, algorithms, etc.) and case adaptation knowledge, while the DM Case Base holds detailed case information (i.e. data quality verification, data preparation steps, model parameters, etc.). Operation of the intelligent DM assistant is initiated by the user’s specifying a query DM problem (i.e. problem characteristics). Subsequently, the Case Reasoner provides a subset of similar previously resolved DM cases to the user. Once a user has chosen a basis case, the adaptation cycle is carried forward (assisted) by the inference capabilities provided by the Rule Reasoner and the detailed domain knowledge within our DM ontology. The DM ontology (by formally capturing concepts, relationships, constraints and rules) is capable of complementing the CBR system and addressing this need for more detailed domain knowledge. For the moment, the DL Reasoner is strictly used for the purposes of ensuring the consistent evolution of the DM ontology.
5. Design Considerations The following section addresses the key design issues that were addressed for implementing the DM Case Base and Case Reasoner components of our intelligent DM assistant The justification for our choosing the CBR paradigm as the basis for our intelligent DM system was motivated by the fact that no explicit formal knowledge (from first principles) is available for DM. As such, it seemed intuitive for DM activities to be captured as “experience” in the form of cases.
DM Problem
DM Problem
DM Problem
META-LEARNER ( CBR )
DM Case DM Case
DM Problem Space
DM Case DM Case
DM Solution Space
Figure 2. A Data Mining Meta-Learning Problem
In addition, since DM is a domain where few documented cases are available, the use of a CBRbased approach (by imposing a less formal and structured knowledge representation) provided the first step towards fostering knowledge reuse of DM activities.
5.1 A DM Case Representation 5.1.1 Grounded in Meta-Learning. The process of providing DM assistance can be viewed as a metalearning problem. One of the key objectives of meta-learning has been to build a meta-classifiers that are capable of effectively mapping datasets to models [9]. As demonstrated by [2], [3] and [4], the use of labeled meta-data (paired data characteristics and model performance results), can be used to induce a meta-classifier for successfully providing model selection assistance. From this fundamental meta-learning principle, our focus has been to extend the meta-learning problem, so as to encompass both larger “meta-problem” and “meta-solution” spaces. For instance, we have added several additional problem description attributes (in addition to data characteristic measures, to be discussed shortly) such as Business Area, DM Activity Type, and the presence of outlier values. Essentially, the decision to introduce these attributes was motivated by an interest to provide additional discriminating metadata for “learning” a more elaborate target concept (i.e. DM case solution). For instance, the solution part can provide additional assistance (aside from a ‘chosen model’) for managing data preparation activities (i.e. how to handle duplicate, outlier or inconsistent values). Hence, as illustrated in Figure 2, we are interested in mapping DM Problems to entire DM Cases (or DM process instances). 5.1.2 CRISP-DM Driven. Considering that our objective is to support the user beyond model selection, it seemed quite natural for us to integrate a DM process as a basis for the case structure and vocabulary. Hence, we chose to use the CRISP-DM data mining process as a basis for eliciting a set of representative attributes (features) for our case representation. CRISP-DM efficiently captures “knowledge” (in the form of a series of well defined and generalized phases, tasks and activities) of the entire data mining effort. From this, we were able to define a case representation consisting of 54 features.
5.2 Similarity and Utility-Oriented Retrieval The following elaborates on specific design issues that were considered during the implementation of the case retrieval mechanism for our DM assistant, and particularly our proposal for a new approach for improving retrieval accuracy based on utility theory. 5.2.1 Eliciting Feature Indexes. Motivated by the previously discussed DM meta-learning problem, we elicited a total of 15 indexes from six different “discriminating” areas or types. These discriminating areas are illustrated in Figure 3 (along with their respective index counts). For instance, the Business area index is used to discriminate amongst a possible finite set of application areas within a given organization (i.e. engineering, finance, marketing, etc.). In addition, we implemented a “core” subset of feature indexes commonly available from the area of meta-learning (indicated by General, Statistical and Info-Theory in Figure 3), based on the works of [10] and [4] (i.e. skewness, mutual information, etc.). Moreover, we introduced four additional indexes (i.e. ratio duplicates, has outliers, ratio missing values, has inconsistent values) since these provide discriminating power for representing DM cases, where valuable knowledge concerning data quality and data preparation can later be solicited by the system. 5.2.2 The Similarity Measures. The case base reasoning component of our DM assistant was initially implemented using a nearest neighbor classifier (i.e. k-NN) [11] and the following global similarity measure (GSM): N
GSM (q, c) =
åd w sim (q, c) i
i
i
i =1
(1)
N
åd
i
i =1
The discriminating parameter (explained later on), feature weight and local similarity measure are respectively indicated by δ, w and sim(q,c) in Equation (1). Our choice for using the nearest neighbor was mainly motivated by its simplicity, flexibility (low maintenance for adding new cases), reasonable performance, and by the fact that we initially had few recorded DM cases available. The aforementioned index features were defined as
binary, nominal and numerical types. Hence, the local similarity measures for the binary and nominal indexes were defined using exact matching and the numerically bounded indexes (i.e. ratio of symbolic attributes) were specified using linear normalized functions (with the exception of the statistical and information-theoretic measures which were represented by non-linear distance functions) [12]. 5.2.3 Handling Missing Indexes. Depending on whether a given DM problem’s dataset contains only numerical or nominal values, the statistical or information-theoretic index values may or may not contribute to the problem characterization. Hence, the δ parameter is used (by providing a “nonapplicability” condition) to handle potentially missing index values and prevents the global similarity measure from biasing. In addition, the δ parameter was also used to maintain a good proximity measure for asymmetric boolean indexing features (has outliers and inconsistent values) [13]. 5.2.4 Setting Feature Weights. In order to improve on retrieval accuracy for k-NN based approaches, it is customary for a weight (w) to be assigned for every feature index (to reduce the influence of redundant and irrelevant features). Under an ideal setting, where ample historic DM cases are available, an effective approach for establishing weights is via the use of machine learning techniques [14]. Unfortunately, due to limited historical DM case information, an expert committee (to be addressed shortly) was consulted in order to provide preliminary feature index weights. In brief, the strongest weights were assigned for the data characteristic measures, Business Area and DM activity type indexes.
General (5)
Data Quality (4)
example, since the solution part of a DM case is multi-dimensional (i.e. advice for data quality, data preparation, modeling parameters, model evaluation, etc.), a user may have a preference for using a previous DM case that offers more data preparation support (over data modeling information). In addition, different DM cases hold various levels of “solution quality” depending on how each user has approached the problem. This kind of tacit knowledge embedded within the case is hard to assess merely by the use of a global similarity measure. Hence, in order to improve retrieval accuracy, we opted for defining a global utility measure (GUM) for each retained DM case. Even though this utility measure is somewhat subjective (it is based on the user’s level of satisfaction at each phase of the DM process for a previously resolved case), it can provide a significant refinement (for improving overall case retrieval accuracy) over a purely similarity-based retrieval approach. For instance, when presented with candidate cases, the user can evaluate the trade-off between the “similarity” of a case and the associated level of “usefulness” for a given case. Hence, a more informed decision concerning which case is the most appropriate DM basis case can be made (upon which adaptations can be affected to resolve a target problem). A utility function maps a “state” onto a real number (i.e. typically in the range [0,1]), which describes the associated degree of “happiness” or usefulness for achieving the state. Specifically, our implementation of a global utility measure (GUM) indicated by Equation (2), implies the sum of three local utility measures; local utility measures for the data understanding, data preparation, and data modeling CRISP-DM phases. These local utility measures were weighted (i.e. more strongly for the data preparation and modeling phases) in a similar fashion to the approach used for the GSM.
GUM ( X ) = DM (1)
Info- Business Statistical (1) Theory (2) (2)
Figure 3. Distribution of Index Feature Types
5.2.5 Utility-Oriented Retrieval. From a theoretical perspective, Bergman et al. were the first to propose the possibility of extending the traditional CBR with a utility-oriented approach [15]. Inspired by the idea, during early experimentation we quickly ascertained that (though the k-NN classifier provided the user with a subset of similar DM cases to work with) the final decision to select the most “appropriate” case was not obvious (strictly based on the global similarity measure). For
N
å k lum ( x ) i
i
i
(2)
ì =1
The state attributes (user satisfaction levels for a given DM phase), utility weights and local utility measures are respectively represented by the vector X, k and lum(x) in the above equation. The following example clarifies how the combined similarity and utility measures can empower a user to select the most appropriate case and subsequent case revision. Table 1 illustrates the case retrieval results obtained from our DM assistant. The similar cases are ranked according to similarity (GSM), however the third column provides additional utility information (GUM). For a novice data miner, the most appropriate basis case would be to aim for a case with both a relatively strong similarity (though not necessarily the highest as in case #1), and a strong utility measure (as is shown for case #2).
Though this choice is not guaranteed, it provides somewhat of a measure as to the potential usefulness (i.e. adaptation knowledge) that will be available for the following case adaptation process.
5.3 Reuse and Revision Strategies DM methodologies such as CRISP-DM adequately specify the phases, tasks and activities that need to be carried out during a DM project, but provide very little “detailed knowledge” for the novice miner on how to actually carry out a given step. Hence, for complex application domains such as DM assistance, where detailed domain knowledge may be required to decide on the appropriate choice for a given case attribute (and its potential impact on other attributes), it is fair to put forward that achieving semiautomated case adaptation support invariably requires a complementary knowledge source (i.e. a knowledge intensive approach). For example, the proper application of a simple linear regression model often requires that the user possess detailed knowledge for effectively carrying out the model evaluation phase for a given DM task (i.e. significance testing, residue normality and model variance verification). As such, in previous work [16] we have proposed the use of a DL-based ontology (combining both declarative and procedural knowledge in the form of SWRL rules [17]) and accompanying rule-based reasoner in order to support the CBR reuse and revision phases (by providing recommendations and attribute transformations when possible). A screen capture of the DM Assistant Interface and accompanying recommendations is provided in Figure 4.
5.4 Seed Case Generation Our primary concern during the authoring of initial “seed cases” was to ensure reasonably good case competence (i.e. coverage of the target problem space) over a constrained problem space. Although
novel approaches have been proposed for authoring cases to ensure sufficient competence and performance, most of these approaches make strong simplifying assumptions (i.e. large quantities of available cases, low problem description index feature dimension, representative-ness assumption, etc.) [18]. Opting for a simpler approach, a committee of experts (i.e. a statistician, domain analyst and expert data miner) were put together in order to analyze data-related issues, DM case particularities and case maintenance strategies. Essentially, a preliminary statistical analysis (i.e. means and standard deviations of most index features) was performed on target datasets and seed DM activities were carried out in the vicinity of the established feature means (i.e. constrained area of the problem space). Subsequently, the DM cases were individually reviewed by the expert committee on simple criteria (i.e. relevance and solution quality). # 1 2 3
Table 1 – Utility-Oriented Retrieval Example CASE NAME GSM GUM Student vs. GPA classifier 0.921 0.727 Seniority vs. Pubs classifier 0.882 0.922 Demography Regression 0.827 0.582
5.5 Case Maintenance Issues Smyth and McKenna [19] define three types of CBR performance metrics: efficiency, competence and solution quality. For the moment, since our case-base is of a modest size, efficiency (problem solving time) shall not be a concern. Although approaches for defining case-base competence models and performance metrics are available ([20], [21]), we again faced difficulties with formalizing the notions of competence and solution quality (see Section 2.4). Our approach mainly consisted of periodically consulting the committee of experts to evaluate possible outlier cases, duplicate cases and corrective actions to ensure “qualitative” case competence and solution quality.
Figure 4. Data Mining Assistant Showing Case Details (Left Panel) and Recommendations (Right Panel)
6. Discussion It is foreseeable that future research could allow us to assign a probabilistic “confidence measure” for the specified local utilities for each case stored. In other words, a quality measure could be assigned to a utility based on a data miner’s past experience for resolving “high quality” cases. Furthermore, the primary motivation for our introducing a utility measure to refine case retrieval was based on our inability to use traditional information retrieval metrics (i.e. recall and precision) for evaluating our system. For instance, recall was a moot point since we are using a fixed 5-NN retrieval method. In addition, the notion of retrieval precision was intractable since our case representation is complex and multi-classed (classifying DM case solutions is a non-trivial task), unlike a simple classification or regression problem where the class label is discernable. During early research for defining the most appropriate case structure for our DM case (i.e. flat or hierarchical), we recognized that splitting our case representation into distinct sub-cases (i.e. the 5 distinct phases of the CRISP-DM process), could pose serious problems later on for retrieving and reconstructing a coherent and similar DM case for the user.
7. Conclusions and Future Work Although the field of case-based reasoning is relatively mature, one of the key objectives in this paper has been to emphasize the important design considerations that were essential for the successful implementation of a CBR-based intelligent DM assistant. Subsequently, this has re-enforced the pivotal role that the CBR component has played in achieving some of the previously mentioned challenges and goals (i.e. fostering knowledge reuse, beyond model selection support). In addition, at the core of our CBR implementation lies our proposal for extending the classical meta-learning problem from mapping datasets to models, to mapping DM problems to DM cases. Last, but not least, we have also proposed a novel, complementary utilityoriented similarity measure for allowing a user to make a more informed decision with respect to the selection of a basis case during the CBR retrieval phase. The next steps will lie in conducting more DM activities in order to increase the size of our DM case base, to eventually provide a more critical evaluation of the system’s overall performance.
References [1] M. Charest, S. Delisle, O. Cervantes, and Y. Shen, Intelligent Data Mining Assistance via CBR and Ontologies, DEXA-PMKD Workshop. Poland, 2006
[2] A. Kalousis and M. Theoharis, NOEMON: Design, Implementation and Performance Results of an Intelligent Assistant for Classification Selection. Intelligent Data Analysis, Elsevier, 1999 [3] MetaL, A Meta-Learning Assistant for Providing User Support in Machine Learning and Data Mining, in www.metal-kdd.org, 2002 [4] G. Linder and R. Studer, AST: Support for Algorithm Selection with a CBR Approach. Recent Advances in Meta-Learning and Future Work, 1999, 38-47 [5] A. Suyama, N. Negishi, and T. Yamaguchi, CAMLET: A Platform for Automatic Composition of Inductive Applications Using Ontologies. Progress in Discovery Science, 2002 [6] K. Bartlmae, Optimizing Data-Mining Processes: A CBR Based Experience Factory for Data Mining, ICSC, Springer, 1999 [7] CRISP-DM1.0, A Step-by-step Data Mining Guide, www.crisp-dm.org, 2000 [8] A. Bernstein, F. Provost, and S. Hill, Intelligent Assistance for the Data Mining Process: An Ontologybased Approach, IEEE Transactions on Knowledge and Data Engineering, 2005 [9] R. Vivalta, C. Giraud-Carrier, P. Brazdil, and C. Soares, Using Meta-Learning to Support Data Mining. Journal of Computer Science Applications, 2004, 31-45 [10] C. Castliello, G. Castellano, and A. Fanelli, MetaData: Characterization of Input Features for MetaLearning. Modeling Decisions for Artificial Intelligence, LNAI, 2005, 457-468 [11] D. Aha, D. Kibler, and M. Albert, Instance-Based Learning Algorithms. Kluwer Academic Publishers, 1991. 6, 37-66 [12] T. Liao, Z. Zhang, and C. Mount, Similarity Meaures for Retrieval in Case-Based Reasoning Systems. Applied Artificial Intelligence, 1998. 12, 267-288 [13] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining (Boston, MA: Addison-Wesley, 2005). [14] D. Wettschereck and D. Aha, Weighting Features, International Conference on Case-Based Reasoning, 1995 [15] R. Bergmann, M. Richter, S. Schmitt, A. Stahl, and I. Vollrath, Utility-Oriented Matching: A New Research Direction for Case-Based Reasoning, GWCBR-2001 Proceedings of the 9th German Workshop on Case-Based Reasoning. Baden-Baden, Germany, 2001 [16] M. Charest and S. Delisle, Ontology-Guided Intelligent Data Mining Assistance: Combining Declarative and Procedural Knowledge, 10th Int. Conference on Artificial Intelligence and Soft Computing. Mallorca, Spain, 2006 [17] SWRL, Semantic Web Rule Language, http://www.w3.org/Submission/SWRL/ [18] D. McSherry, Intelligent Case-Authoring Support in CaseMaker-2. Computational Intelligence, 2001. 17(2) [19] B. Smyth and E. McKenna, Building Compact Competent Case-Bases, 3rd International Conference in CBR. Berlin, 1999 [20] B. Smyth and E. McKenna, Competence Models and the Maintenance Problem. Computational Intelligence, 2001. 17(2) [21] D. Leake and D. Wilson, Remembering Why to Remember: Performance-Guided Case Base Maintenance, Proceedings of EWCBR-2K, 2000