Evaluating the Success Level of Data Mining Projects ... - IEEE Xplore

8 downloads 10756 Views 300KB Size Report
Abstract—One of the critical issues in data mining process especially for organizations is evaluating the success level of performed data mining projects.
Evaluating the Success Level of Data Mining Projects Based on CRISP-DM Methodology by a Fuzzy Expert System Ahmad Nadali

Elham Naghizadeh kakhky

Department of Information Technology Management, Science and Research Branch, Islamic Azad University, Tehran, Iran E-mail: [email protected]

Department of Information Technology Management, Shahid Beheshti University, Tehran, Iran E-mail: [email protected]

Hamid Eslami Nosratabadi* Member of Young Researchers Club, Science and Research Branch, Islamic Azad University, Tehran, Iran E-mail: [email protected] *Corresponding Author

documentation as the key factor for successful Knowledge Discovery in Databases (KDD) projects. Despite all the efforts made to introduce various methods for managing data mining projects, [21] has argued that several common pitfalls occurring in DM projects can be summarized as a lack of methodology for project development and [20] states that in practice, KDD projects are still approached in an unstructured, ad hoc manner. CRISP-DM [22–23] is an attempt to provide industrial standards for the practice of DM and it comprises of six phases, business understanding, data understanding, data preparation, modeling, evaluation and deployment. Since CRISP-DM is reported the most frequently used methodology in 42% of companies interviewed followed by companies using their own methodology (28%), it seems natural that CRISP-DM has been used in numerous research concerning DM application in different areas [24–25] Therefore, CRISP-DM phases have been the main focus of those studies discussing the success of DM projects from the methodological point of view, although these literatures have hugely concentrated on the first two steps of DM process as the major indicators of the success or failure of data mining projects [26–28]. Since quality and correctness of performing each of CRISP-DM phases seems crucial to the success of whole project we decided to study the role of each CRISP-DM phase on the success of DM projects. Therefore, the influence of quality level of each CRISP-DM phase on the success level of DM projects has been investigated through data miners. This relation though, is determined in a verbal manner; as a result we have used fuzzy expert systems in order to evaluate the success level of DM projects based on main DM phases. The remainder of this paper is structured as follows: in the next section the data mining process and CRISP-DM standard are reviewed, in section 3 the concept of fuzzy expert systems is outlined, in section 4 the empirical study process of this research is presented as a case study, and an example of a data mining project in an Iranian bank is drawn. And finally in section 5, conclusions are proposed.

Abstract—One of the critical issues in data mining process especially for organizations is evaluating the success level of performed data mining projects. The purpose of this research is designing a Fuzzy expert system for the evaluation of success level of data mining projects based on quality of CRISP-DM methodology phases as one of the famous data mining methodologies. Here the CRISP-DM phases are specified as inputs of Fuzzy Inference System (FIS) model and the output is the success level of data mining project. This system has been designed by MATLAB software and has been implemented for a data mining project in an Iranian Bank as empirical study. Keywords- Data Mining Projects, CRISP-DM Methodology, Fuzzy Expert System.

I. INTRODUCTION Since its emergence in the early 90’s, Data mining (DM), which is defined as the non trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data [1], has been the main concern of many researchers as well as being widely used in various applications. Research in data mining has addressed a broad range of applications such as sales and customer relationship management [2–4], financial forecasting [5], fraud detection [6], gene mapping [7], sky survey cataloging [8], mining the datasets of meteorological offices [9], and mining of health care data [10–11]. DM methods are gaining enormous recognition and popularity due to being beneficial to some areas such as performance improvement and cost reduction in many industrial and business applications; the success of these methods is still somewhat limited[12]. Among the immense amount of DM projects being developed, neither all the project results are in use [13–15] nor do all projects end successfully [16–17]. The failure rate is actually as high as 60% [18]. Unlike the massive papers concerning algorithms and the modeling phase of data mining, little attention has been paid on the analysis about the data mining project itself, and therefore there are little research concerning the success of data mining projects. In [19] has determined four criteria, including data quality, human, finance budget, and support of the executives, as the most important factors affecting the success of a DM project. In [20] has emphasized on the role of effective project management and especially systematic

II. DATA MINING PROCESS AND CRISP-DM METHODOLOGY

Various data mining methodologies [2], [22], [29–33] have been proposed in the literature to provide explicit guidance towards the process of implementing data mining

___________________________________ 978-1-4244 -8679-3/11/$26.00 ©2011 IEEE

161

projects. These methodologies describe a data mining project as comprised of a sequence of phases and highlight the particular tasks and their corresponding activities to be performed during each of the phases. KDD process, as presented in [32] considered the following five stages in order to extract deemed knowledge from database: Selection - this stage consists on creating a target data set, or focusing on a subset of variables or data samples, on which discovery is to be performed; Pre-processing - this stage consists on the target data cleaning and pre processing in order to obtain consistent data; Transformation - this stage consists on the transformation of the data using dimensionality reduction or transformation methods; Data Mining - this stage consists on the searching for patterns of interest in a particular representational form, depending on the DM objective (usually, prediction); Interpretation/Evaluation - this stage consists on the interpretation and evaluation of the mined patterns [34]. On the other hand SEMMA (Sample, Explore, Modify, Model and Assess) is a methodology oriented to select, explore and model a great amount of data; looking to discover business patterns in the data [33]. The process begins with the extraction of sample data on which analysis is going to be applied. Once the sample is selected, the methodology proposes to explore the data in order to simplify the model. The third phase involves entailing data to DM tool. The fourth phase involves running the DM tool on the selected data. The last phase consists of evaluating results by analyzing the model by contrast with statistical models or new sample [33]. The cross-industry standard process for data mining (CRISP-DM) is also another well-known process model to develop Data Mining projects and was proposed by a consortium of companies include of Teradata, SPSS (ISL), Daimler-Chrysler and OHRA. CRISP-DM defines the processes and tasks that you have to do in order to develop a successful Data Mining project [35]. As it was mentioned before, since its introduction in 1996, CRISP-DM has been the most favored methodology in data mining domain [22]. Therefore, we have chosen it as our reference model. The six phases of CRISP-DM are shown in Fig.1 and described briefly as follow.

business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. 2) Data understanding: The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information. 3) Data preparation: The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modeling tools.4) modeling: In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary. 5) Evaluation: At this stage in the project you have built a model (or models) that appear to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached. 6) Deployment: Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. It often involves applying “live” models within an organization’s decision making processes, for example in real-time personalization of Web pages or repeated scoring of marketing databases. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In many cases it is the customer, not the data analyst, who carries out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions need to be carried out in order to actually make use of the created models [22]. III. FUZZY EXPERT SYSTEM Fuzzy expert systems use fuzzy data, fuzzy rules and fuzzy inference, in addition to the standard ones implemented in the ordinary expert systems [36]. The fuzzy Inference Systems (FIS) are very good tools as they hold the nonlinear universal approximation. They are suitable to handle experimental data as well as a priori knowledge on the unknown solution, which is expressed by inferential linguistic rules in the form IF THEN whose antecedents and consequents utilize fuzzy sets instead of crisp numbers. In

Fig 1. Phases of the CRISP-DM Process Model [22]

1) Business understanding: This initial phase focuses on understanding the project objectives and requirements from a

162

other words, Fuzzy Inference System incorporates fuzzy inference and rule-based expert systems. Fuzzy inference in this system refers to the use of computer programs to execute inference work resembling what humans do daily. The input constitutes some ambiguous linguistic semantics or unclear concepts for a specific event [37]. Fuzzy inference systems can express human expert knowledge and experience by using fuzzy inference rules represented in “if-then” statements. The fuzzy inference process has five steps: Fuzzify inputs, apply fuzzy operator, apply implication method, aggregate all outputs and Defuzzify. In order to obtain a good FIS it is necessary that the researchers possess domain knowledge; the knowledge has to be represented in a symbolic form, be complete, correct and consistent [37]. Following the fuzzy inference mechanism, the output can be a fuzzy set or a precise set of certain features. Fuzzy inference infers the results from the existing knowledge base. 1) Fuzzy concept base: This contains the terminology and relevant predicate of a linguistic expression. Terminology is in the domain of the fuzzy set, possesses many pre-defined dismemberment values denoted by predicates. 2) Fuzzy proposition base: Membership functions accrue to the fuzzy proposition, which was induced from fuzzy concept base. There are numerous types of membership functions, such as S-shape, Z-shape, and P-shape, all easily definable with equations and parameters [38]. Fuzzy Inference System (FIS) incorporates fuzzy inference and rule-based expert systems. There are different types of fuzzy systems are introduced. Mamdani fuzzy systems and TSK fuzzy systems are two types of fuzzy systems commonly used in literature that has different ways of knowledge representation.TSK (Takagi-Sugeno-Kang) fuzzy system was proposed in an effort to develop a systematic approach to generate fuzzy rules from a given input–output data set [2]. Numeric analysis approach of fuzzy system was first presented by Takagi and Sugeno and then a lot of studies have been made [38]. A basic Takagi–Sugeno fuzzy inference system is an inference scheme in which the conclusion of a fuzzy rule is constituted by a weighted linear combination of the crisp inputs rather than a fuzzy set and the rules have the following Structure: If x is A1 and y is B1, then z1 = p1x + q1y + r1 (1) Where p1, q1, and r1 are linear parameters.TSK Takagi– Sugeno Kang fuzzy controller usually needs a smaller number of rules, because their output is already a linear function of the inputs rather than a constant fuzzy set [39]. Mamdani fuzzy system was proposed as the first attempt to control a steam engine and boiler combination by a set of linguistic control rules obtained from experienced human operators. Rules in Mamdani fuzzy systems are like these: If x1 is A1 AND/OR x2 is A2 Then y is B1 (2) Where A1, A2 and B1 are fuzzy sets. The fuzzy set acquired from aggregation of rules’ results will be defuzzified using defuzzification methods like centroid (center of gravity), max membership, mean-max, and weighted average. The centroid method is very popular, in which the ‘‘center of mass’’ of the result provides the crisp

value. In this method, the defuzzified value of fuzzy set A, d (A), is calculated by the formula (3)  ‫š ׬‬Ǥ μ୅ ሺሻ†š (3) ൘ d (A)= ୅ ‫׬‬୅ μ୅ ሺሻ†š where is the membership function of fuzzy set A .Regarding our problem in which various possible conditions of parameters are stated in form of fuzzy sets, the Mamdani fuzzy systems will be utilized due to the fact that the fuzzy rules representing the expert knowledge in Mamdani fuzzy systems, take advantage of fuzzy sets in their consequences, while in TSK fuzzy systems, the consequences are expressed in form of a crisp function [40]. IV.EMPIRICAL STUDY As it was mentioned in the first section, CRISP-DM comprises of six phases. In this section we try to evaluate the success of data mining projects, considering the success quality level of each of these phases. Since data miners and data analysts provide their assessment of the success level of data mining projects and CRISP-DM phases in a verbal manner, these judgments are often imprecise and vague. To address this problem we have designed a fuzzy expert system based on Mamdani Fuzzy Inference system whose inputs are the success level of Business understanding(Bu), Data understanding(Du), Data preparation(Dp), Modeling(Mo), Evaluation(Ev) and Deployment(De) phases; It also determines the success level (SL) of data mining project as its output. This system is designed based on a set of obtained rules from Iranian Saman bank data mining experts regarding the relation between input variables and output. After determining input and output variables, a membership function is defined for each of them. These membership functions are illustrated in figures 2 to 8.

Fig 2. Three Gbell Membership functions for Business Understanding

Fig 3. Three Gbell Membership functions for Data Understanding

163

6 7 8 9 10

L M H M H

M L M L M

L VL H VH VH

M H M H M

L M M L M

L M M H M

VL L M M H

This system is capable of predicting the success level of data mining projects based on the quality level of CRISPDM phases. Regarding to the proposed fuzzy expert system and

Fig 4. Five Gaussian2 Membership functions for Data Preparation

the opinion’s the Data mining Experts of bank; we have evaluated a data mining project which performed according to the CRISP-DM methodology in Saman bank before [41]. The acquired evaluation is as follows: Business understanding:0.80 Data understanding:0.70 Data preparation:0.85 Modeling:0.75 Evaluation:0.90 Deployment:0.65 As a result Success Level will be 0.73.

Fig 5. Three Gaussian2 Membership functions for Modeling

IV.

CONCLUSION

Evaluating the success of data mining projects based on the quality level of each of six phases of CRISP-DM methodology was our main goal. In order to reach this aim, a fuzzy expert system has been designed for which the main six phases of data mining process serve as the inputs and the success level of data mining project is considered as the output. In addition, membership functions were defined for each of these variables, and then Mamdani fuzzy expert system is designed based on the obtained rules from data miners, who were considered as data mining experts. This system, as an evaluating system, is able to determine the success level of data mining projects considering the quality level of the main phases of data mining process. Predicting the success of data mining projects in various application areas as well as determining the effect of quality variations of each phase on the success of whole project are some of the advantages of this system.

Fig 6. Three Gaussian Membership functions for Evaluation

Fig 7. Three Gaussian Membership functions for Deployment

ACKNOWLEDGEMENT

Here, we appreciate from the Data mining Experts of Saman Iranian bank which has given their knowledge to use them to us as the researchers.

Fig 8. Five Gaussian Membership functions for Success Level

In the next step, the discussed fuzzy expert system is designed by MATLAB software according to the obtained rules from data mining experts which some of them as sample are shown in table 1.

REFERENCES [1]

Table1. The rules of Designed Fuzzy Expert System

1 2 3 4 5

[2]

Bu

Du

Dp

Mo

Ev

De

SL

H M L H M

H M M L H

H M VL M VH

M H M L H

H H M L M

H M L L M

VH M VL L H

[3] [4]

164

W. Frawley, G. Piatetsky-Shapiro, C. Matheus,” Knowledge discovery in databases: An overview”. AI Magazine,1992, pp. 213– 228. M. Berry, G. Linoff, “Data mining techniques for marketing, sales and customer support”, John Wiley and Sons,1997. M. Berry, G. Linoff, “Mastering data mining: The art and relationship of customer relationship management”, John Wiley and Sons, 2000. S.-Y. Hung, D. C. Yen, H.-Y.Wang,“Applying data mining to telecom churn Management”. Expert Systems with Applications, 2006, 31(3), pp.515–524.

[5]

[6] [7] [8] [9]

[10]

[11]

[12] [13] [14] [15] [16]

[17]

[18] [19]

[20]

[21]

[22] [23] [24]

[25]

[26]

[27]

[28]

S.-H. Chun, Y.-J. Park, “A new hybrid data mining technique using a regression case based reasoning: Application to financial forecasting”. Expert Systems with Applications, 2006, 31(2),pp. 329–336. T. Fawcett, F.Provost,“Adaptive fraud detection”. Data Mining and Knowledge Discovery,1997, 1(3), pp.291–316. M. Kantardzic, J. Zurada, “Next generation of data-mining applications”, Wiley–IEEE Press, 2005. J. Sim, “Critical success factors in data mining projects”. PhD thesis, Philosophy University of North Texas,2003. A. Bartok, et al, “Data Mining and Integration for Predicting Significant Meteorological Phenomena”, International Conference on Computational Science, ICCS. 2010. F. Alonso, et al.,“Combining expert knowledge and data mining in medical diagnosis domain”. Expert Systems with Applications, 2002, 23(4),pp.367–375. G. Phillips-Wren, P.Sharkey, S. M. Dy, “Mining lung cancer patient data to assess healthcare resource utilization”. Expert Systems with Applications,2008, 35(4), pp.1611–1619. K .Lawrence, S. Kudyba, R. Klimberg,“Data Mining Methods and Applications”, Auerbach Publications, 1st edition, 2007, pp.651-669. B. Eisenfeld. E. Kolsky, T. Topolinski,”42 percent of CRM software goes unused”. www.gartner.comi, February,2003. B, Eisenfeld, et al.,“Unused CRM software increases TCO and decreases ROI”. www. gartner.comi, February 2003. A. Zornes, ”The top 5 global 3000 data mining trends for 2003/04”,METAGroupResearch-DeltaSummary,2061,March 2003. H.A.Edelstein, H.C.Edelstein, ”Building, Using, and Managing the Data Warehouse”, Data Warehousing Institute ,firsted. ,Prentice- Hall PTR, EnglewoodCliffs, NJ, 1997. M.Strand, “The Business Value of Data Warehouses—Opportunities, Pitfalls and Future Directions”, Ph.D. Thesis, Departmentof Computer Science, University of Skovde, December2000. J.E.Gondar, “MetodologÕ´a Del Data Mining”.No.84-96272-21-4. Data Mining Institute,S.L.,2005. G. Nie,et al,“Decision analysis of data mining project based on Bayesian risk”, Expert Systems with Applications, 36 ,2009, pp.4589–4594 K.Becker, C. Ghedini, “A documentation infrastructure for the management of data mining projects”, Information and Software Technology , 47 (2005), pp. 95–111. L.T.Moss, S. Atre, “Business Intelligence Roadmap. The Complete Project Lifecycle for Decision-Support Applications”, AddisonWesley Information Technology Series, 2004 E.Chapman and et.al., “CRISP-DM 1.0 Step-by- Step Data Mining Guide, SPSS”, http://www.crispdm.org/CRISPWP-0800.pdf, 2000. C.Shearer, “The CRISP-DM model: the new blueprint for data mining”, Journal of Data Warehousing, 2000, 5 (4), pp.13–22. J.Hipp, G.Lindner, “Analyzing warranty claims of automobiles: an application description following the CRISP-DM data mining process”, Lecture Notes in Computer Science ,1749, 1999,pp. 31–40. R.Wirth, J. Hipp, “CRISP-DM: towards a standard process model for data mining”, Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining, 2000,pp. 29– 39. Z. Bošnjak, O. Grljevic, S.Bošnjak, “CRISP-DM as a Framework for Discovering Knowledge in Small and Medium Sized Enterprises’ Data”, 5th International Symposium on Applied Computational Intelligence and Informatics, 2009. A.Feeldersa, H.Danielsa, M.Holsheimerc,” Methodological and practical aspects of data mining”, Information & Management, 37, 2000, pp.271-281 X. Hu, “DB-HReduction: A Data Preprocessing Algorithm for Data Mining Applications”, Applied Mathematics Letters, 16, 2003, 889895

[29] Anand, S., Buchner, A., “Decision support using data mining”. London: Financial Times Pitman Publishers,1998. [30] P.Cabena, et al., “Discovering data mining: From concepts to implementation”, Prentice Hall, 1998. [31] K.Cios, L. Kurgan,” Trends in data mining and knowledge discovery”. Advanced techniques in knowledge discovery and data mining, Springer, 2005, pp.1–26. [32] U.Fayyad, G.Piatetsky-Shapiro, P.Smyth, “From data mining to knowledge discovery: An overview”, Advances in knowledge discovery and data mining. AAAI Press, 1996b, pp. 1–34. [33] SAS Enterprise Miner: SEMMA http://www.sas.com/offices/europe/uk/technologies/analytics/datamin ing/miner/semma.html, January 2011. [34] A.Azevedo , F.M. Santos, “KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW”,ISBN: 978-972-8924-63-8, 2008 IADIS [35] O.Marban, E.Menasalvas, C.Fernandez-Baizan, “A cost model to estimate the effort of datamining projects (DMCoMo)”, Information Systems, 33, 2008,pp. 133–150 [36] S.Pourdarab, H.Eslami Nosratabadi, M.Abbasian,"Design a Fuzzy Expert System to Evaluate Science and Technology parks",4th international conference of Fuzzy Information and Technology,2010. [37] H.Iyatomi, M.Hagiwara ,"Adaptive fuzzy inference neural network", Pattern Recognition,2004, 37 (10), pp. 2049-2057. [38] Y. S.Juang, S. S.Lin, H. P. Kao,"Design and implementation of a fuzzy inference system for supporting customer requirements",Expert Systems with Applications, 2007, 32 (3), pp. 868-878. [39] T.Takagi, M.Sugeno, “Fuzzy identification of systems and its applications of modeling and control”, IEEE Transactions of Systems Man and Cybernetics, USA,1985. [40] A.Haji, M.Assadi,"Fuzzy expert systems and challenge of new product pricing", Computers & Industrial Engineering,2009, 56(2), pp. 616-630 [41] A. Nadali , S. Pourdarab, H. Eslami Nosratabadi,"Labeling the class of Bank Credit's customers by a fuzzy Expert System For Credit Scoring with Data Mining Approach",ICKD 2011, IEEE.

165

Suggest Documents