Comparison of Data Mining techniques and tools for data classification Luís C. Borges
Viriato M. Marques
Jorge Bernardino
Polytechnic Institute of Coimbra Polytechnic Institute of Coimbra Polytechnic Institute of Coimbra ISEC- Institute of Engineering ISEC- Institute of Engineering ISEC- Institute of Engineering Rua Pedro Nunes, Coimbra, Portugal Rua Pedro Nunes, Coimbra, Portugal Rua Pedro Nunes, Coimbra, Portugal Tel. ++351 239 790 200 Tel. ++351 239 790 200 Tel. ++351 239 790 200
[email protected]
[email protected]
[email protected]
ABSTRACT Data Mining is a knowledge field that intersects domains from computer science and statistics, attempting to discover knowledge from databases in order to facilitate the decision making process. Classification is a Data Mining task that learns from a collection of cases in order to accurately predict the target class for new cases. Several machine learning techniques can be used to perform classification. Free and open source Data Mining software tools are available from the Internet that offers the capability of performing classification through different techniques. This study compares four free and open source Data Mining tools: KNIME, Orange, RapidMiner and Weka. Our objective is to reveal the most accurate tool and technique for the classification task. Analysts may use the results to rapidly achieve a good result. Our experimental results show that there is no single tool or technique that always achieves the best result but some achieve better results more often than others.
Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data Mining. I.5.2 [Pattern Recognition]: Design Methodology – Classifier design and evaluation.
General Terms Data Mining, Experimentation.
Algorithms,
Measurement,
Performance,
Keywords Free and open source Data Mining tools, data classification, knowledge discovery, Data Mining techniques.
1. INTRODUCTION Today's organizations have too much data for a decision-making analysis to be carried out manually. Data Mining allows the exploration of data to discover knowledge used mostly in strategic management decisions. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for thirdparty components of this work must be honored. For all other uses, contact the owners/author(s). Copyright is held by the owner/author(s). C3S2E13, Jul 10-12 2013, Porto, Portugal ACM 978-1-4503-1976-8/13/07. http://dx.doi.org/10.1145/2494444.2494451
Data Mining is the extraction of interesting knowledge (nontrivial, implicit, previously unknown and potentially useful) that algorithmically detects specific patterns, trends in the data and rules mechanisms (associations between seemingly unconnected data). Data Mining is a multidisciplinary area that includes methods and techniques (including algorithms) from Statistics and Machine Learning but also Artificial Intelligence, Pattern Recognition, Databases and Data Visualization [1][2]. Data Mining tasks are related to the objectives. The purpose of the classification task is to build models capable of predicting the class of new cases. Data Mining models are mathematical representations aimed at understanding and studying of the data. Data Mining tasks are a set of processes involved in producing models implemented using techniques (algorithms). The Data Mining software tools combine fundamentals, theories, methods and algorithms. These applications base their operation in algorithms that look for patterns of knowledge by combining a set of tools for interrogation and exploration of data with tools that allow the visualization of results and reporting. This work presents a comparative study between 4 free and open source Data Mining software tools (KNIME, Orange, RapidMiner and Weka). To evaluate the performance of the tools we used math measurement accuracy. The study aims at providing analysts with the tool and technique they may use to achieve fast and good results. The remainder of this paper is organized as follows: Section 2 describes the proposed methodology, the software tools, datasets and algorithms tested. Section 3 reports and evaluates the results obtained for the different perspectives. Section 4 presents our conclusions and suggested future work.
2. METHODOLOGY The methodology of our study consists of three preparatory steps: (1) the selection of Data Mining Tools to test, (2) the selection of Datasets to be used and (3) the selection of classification algorithms to evaluate. The study will test all possible combinations of datasets, techniques, partitioning modes and algorithms, and evaluate its classification by the accuracy metric.
2.1 Tools The selection of the tools to test was done accordingly to the 5 best Data Mining tools [3]. The choice was at the discretion of user-friendliness; all have a Graphical User Interface (GUI) but only four are possible to use without scripting. To use JHepWork is essential to have competences on Jython programming
language. We selected to study only those that an analyst (not a programmer) is able to use: (1) KNIME 2.6.0 [4], (2) Orange 2.6 [5], (3) RapidMiner 4.6 [6] and (4) Weka 3.6 [7].
2.2 Datasets The chosen datasets were downloaded from the UCI repository (University of California, Irvine) [8]. Table I shows their characterization: the name by which they are known in the literature, the variables data type with nominal and ordinal grouped into categorical, the number of instances, the number of attributes and the number of possible values for each target class.
The datasets have single and multivariate data types and belong to the classification task since this is the task that this study focuses. These datasets were selected because they provide a wide range of possibilities. They have discrete and continuous quantitative variables as well as nominal and ordinal qualitative variables. The number of instances varies between a minimum of 32 and a maximum of 32561. The number of attributes varies between a minimum of 4 and maximum of 56. The target class is between binary and 7-ary classification. Some Datasets have few instances but many attributes, others with the opposite. These characteristics show that each dataset is unique and that combined, the datasets ensure a good mix of samples for testing.
Table I. Characterization of Datasets Dataset name Adult
Variable types Categorical, Integer
#Instances 32561
#Attributes 14
#Class values 2
Breast-cancer Car Evaluation
Categorical
286
9
2
Categorical
1728
6
4
Credit-approval
Categorical, Integer, Real
690
15
2
Iris
Real
150
4
3
Lung-cancer
Integer
32
56
3
Wine
Integer, Real
178
13
3
Zoo
Categorical, Integer
101
16
7
2.3 Classification Data Classification is a two steps process: (1) the training (or learning) phase and (2) the test (or evaluation) phase where the actual class of the instance is compared with the predicted class. If the hit rate is acceptable to the analyst, the classifier is accepted as being capable of classifying future instances with unknown class.
2.3.1 Classification algorithms Classification is typically obtained by supervised learning but can also be performed by unsupervised learning, e.g. where the class is not used or unknown as in the Clustering technique. For the test we use algorithms of the following techniques: (1) Decision Tree, (2) Rules Induction, (3) Clustering, (5) Artificial Neural Network (ANN), (6) Bayesian classifier and (6) a Support Vector Machine.
2.3.2 Performance evaluation The performance evaluation of the classifiers will be assessed by the accuracy metric. This metric is calculated dividing the number of instances correctly classified by the total value of instances. A correctly classified instance is one in which the classifier predicts the correct class of the test’s instance.
While performing the test, the parameters of the algorithms will run with their predefined values by the tools, except when it is possible to select that missing values shall be ignored. It is possible to get better performance by changing the parameters of the algorithms, but the difference is not significant to the scope of this test because all operators could be improved in this way. The datasets used in the test were saved in Weka’s standardized format (arff). All tools are able to read this format natively. Orange was programmed by means of the widgets available for classification techniques (Learners); ‘File’ and all ‘Learners’ have been linked to a ‘Test Learners’, parameterized on both partitioning modes. It was not used any preprocessing widget. KNIME was programmed with the nodes available on group ‘Mining’. For the ‘Decision Tree Learner’ and ‘Learner's Naive Bayes’, the data has not been subjected to preprocessing as these nodes accept all kinds of attributes. For the other ‘Learners’, categorical attributes were converted into numeric values of type ‘DoubleValue’ and normalized between 0.0 and 1.0. For the ‘RProp MLP Learner’, the real class and the predicted class were converted to type ‘IntValue’ before evaluation. For the ‘k Nearest Neighbor’ and ‘SVM Learner’, the real class of the Trainingset and of the Testset was converted to ‘StringValue’ before the construction of the models.
3.1 Experimental setup
Weka and RapidMiner have several algorithms for each technique. Tests were exhaustive, i.e. all the algorithms were tested and the results show which one obtained a better accuracy or, in the case of similar values, the operator that ran in the first place. The important in this test is to determine the most accurate combination of tool and technique and not the best algorithm. Data preprocessing was not performed on Weka although this tool automatically does it considering specific algorithms, e.g. discretization.
The tests were carried out by random sampling in the partitioning modes Percentage Split with ‘70:30’ and Cross-Validation with ‘k=5’ since the smaller dataset contains 101 instances.
RapidMiner has some operators (e.g. ‘NeuralNetImproved’ and ‘LibSVMLearner’), that only work with numeric attributes; for these cases, the categorical data were transformed into numeric
3. EXPERIMENTAL RESULTS In this section we describe the experimental setup of the tests and evaluate its results. Our tests consist in evaluating the tools KNIME, Orange, RapidMiner and Weka for the classification techniques Decision Trees, Classification Rules, Artificial Neural Networks, Bayesian Classifiers and Clustering using the datasets described in Table I.
values by ‘Nominal2Numerical’ operator and normalized between 0.0 and 1.0 with the ‘Normalization’ operator.
3.2 Experimental results evaluation The analyses of the results can be performed by Datasets, Tools, Techniques, Partitioning, Algorithms (Classifiers) and overall total. The aggregated results totals were calculated through the simple arithmetic average of the tests. We obtained a classification accuracy of 81.13% running all the tests. It was not possible to carry out some tests: Orange does not have the ability to perform Artificial Neural Networks and for the CN2 algorithm from the Rules technique, the Adult Dataset resulted in an out-of-memory error. Table II. Total by Dataset Dataset Iris Zoo Wine Car Evaluation Credit Approval Adult Breast-cancer Lung-cancer
Average accuracy 93.86% 93.10% 92.76% 90.14% 83.38% 80.05% 71.88% 43.81%
Considering the characteristics of the Datasets tested (Table I), there is no clear correlation between the result of the classification and the type of variables, cardinality of instances, the number of attributes or values for the target class. Table II shows all the accuracy tests by dataset. It turns out that the Dataset with better accuracy (Iris) is the one with the fewest attributes (4) and the Dataset with the worst accuracy (Lung-cancer) is the one with the largest number of attributes (56). However, this is not the case for the remaining Datasets. Only the Lung-cancer Dataset has an average deviation higher than the standard deviation. The remaining Datasets obtained accuracy between a minimum of 71.88% for Breast-cancer and a maximum of 93.86% for Iris. Table III. Total by Tool Tool Weka KNIME Orange RapidMiner
Average accuracy 84.28% 81.40% 81.19% 77.67%
From the tools perspective, as shown in Table III, we can notice that the best accuracy was obtained by Weka software with 84.28%, in agreement with the study of Wahbeh et al. [9], and the worst accuracy was obtained by RapidMiner with 77.67%. Table IV. Total by Technique Technique Decision Trees Artificial Neural Networks Rules Clustering Bayesian Classifiers SVM
Average accuracy 83,10% 82,85% 82,76% 82,07% 81,76% 74,72%
The analysis by technique is presented in Table IV. The best result was achieved through models based on Decision Trees with 83.10%. The worst result was obtained by using Support Vector
Machines with 74.72%, explained by the poor results obtained with the Datasets Lung-cancer and Adult, especially in RapidMiner with 25.95% and 24.14%, respectively. These Datasets are the biggest in terms of cardinality of instances and number of attributes, which reveals a tendency towards the worsening with the complexity of the Dataset. Table V. Total by Partitioning Mode Partitioning mode X-Validation k=5 % Split 70:30
Average accuracy 81.41% 80.85%
Table V shows the accuracy in terms of partitioning. The best accuracy was obtained in Cross-Validation ‘k=5’ with 81.41% but very close to Percentage Split ‘70:30’ with 80.85%. The gain must be considered taking into account an increase in processing time, which may or may not be significant depending on the size of the Dataset. Table VI. Total by Algorithm Algorithm AODEsr NBTree FT HNB PART ID3 SimpleCart DTNB BFTree Jrip LWL LMT BayesNet Ridor J48graft RuleLearner DecisionStump BasicRuleLearner NaiveBayes MultilayerPerceptron Rprop MLP Learner NNge KernelNaiveBayes k Nearest Neighbor Decision Tree Learner IB1 IBk SMO Naive Bayes k Nearest Neighbours ADTree Classification Tree SVM Learner NeuralNetImproved CN2 SVM NearestNeighbors CHAID Fuzzy Rule Learner KStar DecisionTable Naive Bayes Learner
Average accuracy 100.00% 100.00% 98.26% 94.79% 94.59% 94.36% 91.98% 91.66% 91.00% 90.30% 88.78% 87.33% 87.24% 86.87% 86.24% 86.02% 85.51% 85.17% 85.02% 84.38% 84.38% 83.86% 83.79% 83.29% 83.06% 82.62% 82.59% 82.40% 82.05% 81.43% 81.34% 81.21% 81.06% 80.93% 80.67% 80.53% 80.49% 80.46% 80.17% 79.00% 77.50% 76.41%
#Test s 1 1 3 3 1 3 2 4 2 2 3 2 9 2 1 11 2 3 6 14 16 3 11 16 16 6 4 16 16 16 2 16 16 16 16 16 16 11 16 3 3 16
VotedPerceptron LBR MultiCriterionDecisionStump BestRuleInduction EvoSVM OneR LibSVMLearner J48 AODE LADTree JMySVMLearner
75.14% 74.48% 73.26% 73.26% 70.25% 70.00% 55.96% 53.13% 48.13% 40.00% 24.19%
only in one test. In cases where only one classifier was available on the tool, which is the case of Orange and KNIME, the maximum number of tests in this classifier is 16. It should be stressed that the result of Weka ‘BayesNet’ algorithm was the best in 9 of the 16 possible tests, with an average accuracy of 87.24%.
2 1 1 1 1 1 14 1 2 1 1
Although some algorithms obtained low accuracy rates, it should be noted that it was the best result obtained for a particular test. The results are shown in Table VII. To determine the tool and technique that best suits the task of classification, tool accuracy was compared having regard to technique and vice versa. The best tool of the experience was Weka since it got the 1st place on four techniques and the 2nd place on two techniques. The best technique of the experience was Decision Trees that got the 1st place on Weka and the 3rd place on the other tools.
Table VI shows the total by algorithm and we can conclude that the best classifiers are from Weka, and are based on the Decision Tree technique. The best classifiers reached optimal accuracy but
Table VII. Total by Tool and Technique Tool
Decision Trees
Weka KNIME Orange RapidMiner
85.96% 83.06% 81.21% 82.15%
Artificial Neural Networks 83.23% 84.38% 80.93%
4. CONCLUSIONS AND FUTURE WORK Our experiments allowed a comparison of the resulting accuracy from several algorithms, among 8 datasets selected by their different characteristics, 4 free and open source software tools considered the best on the market, 6 machine learning techniques for classification and 2 partitioning modes. The overall conclusion from the study is that there is no single tool or technique that is better than any other for any classification task. Individually, the best result was achieved with Weka and the Decision Trees technique with 85.96%. The accuracy value should consider what is expected from a classifier in a real case application scenario; while one scenario may require a high accuracy, other may accept a low accuracy. Since there is no universal percentage given the possible fields of application of Classification, the most appropriate method would be resorting to an expert in the field to set these values as suggested by Saitta [10]. This way we could conclude that the classifier obtained would have practical application. 384 tests were recorded accounting for only those who have obtained the best result. Due to its size, many test alternatives have been left out and are suggested for future work: test the algorithms using their parameters, test more tools, test other partitioning modes, test more Datasets, extend to other Data Mining tasks and include the techniques of Regression, Ensemble and Genetic Algorithms. The study results may be used to develop a software tool that implements the algorithms that achieved the best results.
5. REFERENCES [1] Santos, M., Ramos, I. 2009. Business Intelligence Tecnologias da Informação na Gestão de Conhecimento. FCA - Editora de informática, Lda., 2nd edition
Rules
Clustering
Bayesian Classifier
SVM
85.89% 80.17% 80.67% 84.06%
83.09% 83.29% 81.43% 80.49%
85.11% 76.41% 82.05% 83.48%
82.40% 81.06% 80.53% 54.87%
[2] Santos, M.F., Azevedo, C. 2005. Data Mining - Descoberta de conhecimento em bases de dados. FCA - Editora de Informática, Lda., 1st edition [3] Auza, J. 2010. 5 of the Best Free and Open Source Data Mining Software. [Accessed Online March 2013] http://www.junauza.com/2010/11/free-data-miningsoftware.html [4] KNIME; KNIME.com AG, Germany; [Accessed Online on August, 27 2012] http://www.knime.org/ [5] Orange; Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Slovenia; [Accessed Online August, 27 2012] http://orange.biolab.si/ [6] RapidMiner; Rapid-i GmbH, Germany; [Accessed Online on August, 27 2012] http://rapid-i.com [7] Weka; Machine Learning Group, Waikato University, New Zealand; [Accessed Online August, 27 2012] http://www.cs.waikato.ac.nz/ml/weka/ [8] UCI – Datasets Repository; Machine Learning Center from California University, Irvine; [Accessed Online August, 27 2012] http://archive.ics.uci.edu/ml/ [9] Wahbeh, A.H., Al-Radaideh Q.A., Al-Kabi, M.N. and AlShawakfa E.M. 2010. A comparison study between Data Mining Tools over some classification methods. IJACSA, Special Issue on Artificial Intelligence, SAI Publisher, 2(8), pp. 18-26 [10] Saitta, S. 2010. What is a good classification accuracy in Data Mining? [Accessed Online March 2013] http://www.dataminingblog.com/what-is-a-goodclassification-accuracy-in-data-mining/