Analysis) [3] is an open source data mining tool which ... comparison study of J48, MLP, NB, IBK, and SMO ..... mining and knowledge discovery software tools,.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
An Empirical Evaluation of Data Mining Classification Algorithms Prof. Hetal Bhavsar 1, Dr. Amit Ganatra 2 Assistant Professor, Department of Computer Science and Engineering, The M. S. University of Baroda, Vadodara, Gujarat, India 1 Dean, Faculty of Technology and Engineering, CHARUSAT, Changa, Gujarat, India 2
mining, artificial intelligence, neural networks, genetic algorithm etc.
Abstract: Data Mining is the process of extracting interesting knowledge from large datasets by joining methods from statistics and artificial intelligence with database management. Classification is one of the main functionality in the field of data mining. Classification is the forms of data analysis that can be used to extract models describing important data classes The well known classification methods are Decision tree classification, Neuaral network classification, Naïve Bayes Classification, k-nearest neighbor classification and Support Vector Machine (SVM) classification. In this paper, we present the comparison of five classification algorithms, J48; which is based on C4.5 decision tree based learning, Multilayer perceptron (MLP); uses the multilayer feed forward neural network approach, Instance based K-nearest neighbour (IBK), Naive Bayse (NB), and Sequential Minimal Optimization (SMO); is an extension of support vector machine. Performance of these classification algorithms are compared with respect to classifier accuracy, error rates, building time of classifier and other statistical measures on WEKA tool. The result showed that there is no universal classification algorithm which works better for all the dataset.
Classification, one of the main functionality of data mining, can be described as supervised learning algorithm as it assigns class labels to data objects based on the relationship between the data items with a predefined class label. The classification techniques are used to learn a model from a set of training data and to classify a test data into one of the classes [1]. WEKA (Waikato Environment for Knowledge Analysis) [3] is an open source data mining tool which includes implementation of various classification algorithm like decision trees, Naïve Bayes, lazy learning, neural network, etc. To observe the performance of the different classification algorithm, this research has conducted a comparison study of J48, MLP, NB, IBK, and SMO algorithms using seven dataset available on UCI dataset repository [4]. The datasets considered for this research are: Breast Cancer, Diabetes, Vote, Car Evaluation, Spambase, Audiology, and Nursery.
Keywords: Classification, supervised learning, decision tree, naive bayse, support vector machine
The rest of the paper is organized as follows: Section 2 covers the related work in this area. Section 3 describes the classification method and its phases. Experimental results and evaluations are presented in section 4. Finally, section 5 gives the conclusion of research.
I. Introduction: The tremendous amount of information stored in databases and data repository cannot simply be analyzed manually for valuable decision making. Therefore, humans need assistance in their analysis capacity [2]. Such requirement has generated urgent need for automated tools that can assist us in transforming that vast amount of data into useful information and knowledge. Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data set. Data mining involves an integration of multiple fields including Statistical models, mathematical algorithm, information retrieval, databases, pattern recovery and machine learning methods. Data mining can be done with large number of algorithms and techniques which includes classification, clustering, regression, association
II. Related Work: The top 10 data mining classification algorithms: c4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, KNN, Naive Bayes and CART is described in [5] including their impact and new research issues. Study of a large number of techniques based on Artificial Intelligence, Perceptron-based techniques and Statistics shown that after better understanding of strength and weakness of each method, it is possible to integrate two or more algorithm together to solve a problem [6]. Despite the advantages, these ensemble methods have weaknesses like: increased storage, increased computation, and decreased comprehensibility. In [7],
142
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
the comparison of different classification algorithms for hierarchical prediction for protein function based on the predictive accuracy of classifier is given. It was found that classification accuracy is increased by using different classifiers at different nodes in the classifier tree. The performance comparison of classification algorithm on breast cancer dataset to perform the diagnosis of the patient is presented in [8] [9] [10]. A comparison study of different data mining tools on several datasets with various classification methods is presented in [11], which concluded that WEKA toolkit has achieved the highest applicability and highest improvement in classification performance when moving from the percentage split test mode to the cross validation test mode compared to other tools. The comparison of various classification algorithms using WEKA on different dataset is given in [12] [13] [14]. The error rate of various classification algorithms were compared to bring out the best and effective algorithm suitable for social network data is covered in [15].
For this research, several data sets have been downloaded from the UCI repository [4] and details are shown in Table 1. Table 1: Dataset Description Dataset Name Breast Cancer Wisconsin Diabetes Vote Car Evaluation Spambase Audiology Nursery
The methodology of the study constituted of collecting datasets having different characteristics and selecting a set of classification algorithm to test the performance of classification algorithm on WEKA tool. For testing the accuracy of classifier k-fold cross validation and percentage split (also called Holdout method) mode are used. Figure 1. demonstrate the overall methodology followed for fulfilling the goal of this research.
IBKK
Integer
699
11
2
Integer Categorical
768 435
9 17
2 2
Categorical
1728
7
4
4601
58
2
226 12960
70 9
24 5
Integer, Real Categorical Categorical
C. Classification Algorithm 1. J48
SMO
J48 is an implementation of C4.5 in WEKA. C4.5 is one of the well known decision tree induction algorithm. The decision tree method is a supervised machine learning technique that builds a decision tree from a set of class labelled training samples during the machine learning process. Each internal node in a decision tree represents a test on attribute, each branch represents an outcome of the test and each leaf node represents the class label [1]. C4.5 is uses gain ratio for selection of attribute for splitting. It provides an improvement over ID3 as it deals with nominal and numerical attributes as well as able to handle missing and noisy data [16].
Testing Accuracy
k-fold CV
Number of Class
Classification is a two step process, 1. Learning step: classification algorithm built the classifier learning from the training set made up of database tuples and associated class label. 2. Testing step: model is used for testing data; the predicative accuracy of classifier is estimated using test set different from training set. The accuracy of the classification model is determined by comparing true class labels in the testing set with those assigned by the model [1].
Build Classification Model
NB
Number of Attributes
B. Building Classification Model
Collect dataset
MLP
Number of Instances
The data sets having different characteristics are chosen to evaluate the performance of classification algorithm. The dataset considered have addressed different areas, such as the number of instances varies from 200 to 13000, number of attributes ranges from 7 to 70 and the attribute type; where some datasets contain one type while other contains two types.
III. Methodology:
J48
Attribute Type
Holdout
Result Evaluation
Fig. 1 Study of Methodology
A. Dataset Description
143
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
minimization principle and have the aim of determining the location of decision boundaries also known as hyperplane that produce the optimal separation of classes [1][6]. Support Vector Machine revolves around the notion of a “margin”—either side of a hyperplane that separates two data classes. Maximizing the margin and thereby creating the largest possible distance between the separating hyperplane and the instances on either side of it has been proven to reduce an upper bound on the expected generalisation error [17]. Further, application of SVM results in the global solution for a classification problem. SVM based classification is attractive, because its efficiency does not directly depend on the dimension of classified entities.
2. Multilayer Perceptron (MLP) MLP is a feed forward network that makes a model to map input data to output data. Neural network with back propagation algorithm performs learning on a multilayer feed-forward neural network [1]. A multilayer feed-forward network consists of large number of units (neurons) joined to gather in a pattern of connections. These units are: an input layer, one or more hidden layer and an output layer. The input layer, receives the information to be processed, the output layer, shows the result of processing and hidden layer, allow the signals to travel one way only, from input to output. They learn by iteratively processing a data set of training samples, comparing the network prediction for each sample with the actual target value [6] [16], where one iteration through the training set is called epoch. For experimental results the MLP is ran with two sets of epoch, 100 and 500.
D. Measures for Performance Evaluation The performance metrics used for comparison of different classification algorithm on various datasets are, prediction accuracy, correctly classified versus incorrectly classified instances, time to build the model, Kappa Statistic (KS), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE).
3. Naïve Bayes Classification (NB) Naïve Bayes Classifier is the simple statistical Bayesian Classifier, which predicts the class membership probability, the probability that a given sample belongs to a particular class. It is called Naïve as it assumes that all variables contribute toward classification and are mutually correlated. This assumption is called class conditional independence [16]. This is an unrealistic assumption for most datasets; however it leads to a simple prediction framework that gives surprisingly good result in many practical cases. The Naïve Bayes Classifier is based on Bayes’ Theorem [1].
a. Classification accuracy: For assessing the classifier accuracy two well-known techniques were used: k-fold cross validation and percentage split [1]. In k-fold cross validation, data is split into k disjoint subsets (folds). Training and testing is performed k times. For each of this k experiment, k-1 folds are used for training and remaining one is used for testing. The error rate of the classifier is the average of the error rates of k experiments. In percentage split method, two-thirds of data are selected for training and one-third of data is selected for testing.
4. Instance Based Classifier (IBK)
The objective of using these two techniques to measure the accuracy, is to check whether there is an improvement in the accuracy measure when moving from the one test mode to other test mode.
IBK is an implementation of K-nearest neighbor classification algorithm in WEKA. Instance based classifiers are also called lazy learners as they store all of the training samples and do not build a classifier until a new, unlabeled sample needs to be classified [1][6]. The k-nearest neighbors’ algorithm is amongst the simplest of all machine learning algorithms. It is based on the principal that the samples that are similar are lies in close proximity. Given an unlabeled sample, K-nearest neighbor classifier searches the pattern space for the k-objects that are closest to it and assigned the class by identifying the most frequent class label. If the value of k=1 then assign the class of the training sample that is the closest to the unknown sample in the pattern space [16]. For experimental results, size of k is taken to be as 5.
b. Kappa Statistic (KS): It is a chance corrected measure of agreement between the classification and the true classes. It is calculated by taking the agreement expected by chance away from the observed agreement and dividing by the maximum possible agreement. The possible values range from +1(perfect agreement) via 0 (no agreement above that expected by chance) to -1 (complete disagreement). c. Mean Absolute Error (MAE): It is a quantity used to measure how close predictions are to the eventual outcome. It is an average of loss function over the test dataset where the loss function measures the error between actual and the predicted values.
5. Sequential Minimal Optimization (SMO) SMO (Sequential Minimal Optimization) is an extension of support vector machine. SVM are based on statistical learning theory and structural risk
d. Root Mean Squared Error (RMSE): The presence of outliers is exaggerated by mean squared error, not
144
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
by MAE. RMSE is a square root of mean squared error, measures the average magnitude of the error.
Table 3. Result for Breast Cancer dataset using Percentage split
IV. Experimental results and evaluation This research presents a comparative study of various data mining classification algorithms based on various essential parameters including types of database used, number and types of attribute supported, time to build the classifier, number of correctly classified versus number of incorrectly classified instances, accuracy and other statistical measures.
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
95.3782
4.6218
0.08
0.9006
0.0671
0.2124
NB
94.958
5.042
0.03
0.8913
0.048
0.2141
KNN
95.3782
4.6218
0
0.8996
0.052
0.1896
94.5378
5.4622
0.17
0.8814
0.0576
0.1856
95.3782
4.6218
1.42
0.9001
0.0524
0.1956
95.3782
4.6218
0.09
0.8996
0.0462
0.215
MLP (epoch=100) MLP (epoch=500) SMO
The data set used for experimental evaluation is in ARFF format. The simulation results are taken by running WEKA tool on Intel core i5-2430M CPU@ 2.4GHz with 4GB of RAM Machine.
Since the breast cancer dataset has 699 instances, 461 instances are used to build the model and 238 instances are used for testing the accuracy of percentage split mode. Table 3, shows that all the algorithm performs well with average accuracy of 95%. The kappa statistics for J48 is closer to 1, indicates that J48 provides the perfect agreement for classification of data items. From Table 2 and Table 3 it is also seen that, for breast cancer dataset moving from 10_fold_cv mode to percentage split mode, accuracy of model is reduced.
1. Breast Cancer Dataset Breast Cancer dataset classifies the patient’s breast cancer as begnin or malignant. The results for breast cancer dataset using WEKA in 10-fold cross validation and percentage split mode are shown in Table 2 and Table 3. From the Table 2, it is clearly seen that the highest accuracy is 96.99% for SMO and the lowest accuracy is 94.56% for J48. This shows that, for breast cancer dataset SMO algorithm performs best followed by MLP for epoch 100 and KNN. The KNN required least time to build the model followed by NB and SMO. Table 2 also shows that increasing the number of epoch from 100 to 500 for MLP, increase the time for building the model as well as reduce the accuracy of prediction.
2. Diabetes Dataset The result for diabetes dataset is shown in Table 4 and Table 5. For diabetes dataset also SMO outperforms compared to other classification algorithm with the correctly classified instances 594 with accuracy of 77.34%. Though the time required to build the model for KNN is lowest, it suffers from the lowest predication accuracy. The SMO has the lowest mean absolute error and highest root mean square error (less variance).
The Kappa Statistics for SMO is much closer to 1 (i.e. 0.9337) which indicates that SMO provides the perfect agreement for classification of data items. SMO has lesser error rate in MAE and RMSE as it provides the more perfect prediction and lesser variance in predictions.
Table 4. Result for Diabetes dataset using 10_fold_cv
Table 2. Result for Breast Cancer dataset using 10_fold_cv Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
94.5637
5.4363
0.06
0.8799
0.0694
0.2229
NB
95.9943
4.0057
0.03
0.9127
0.0408
0.1994
KNN
96.7096
3.2904
0
0.9275
0.0457
0.1579
96.7096
3.2904
0.17
0.9274
0.0548
0.1709
95.279
4.721
0.81
0.8958
0.0501
0.197
96.9957
3.0043
0.08
0.9337
0.03
0.1733
MLP (epoch=100) MLP (epoch=500) SMO
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
73.8281
26.1719
0.13
0.4164
0.3158
0.4463
NB
76.3021
23.6979
0.03
0.4664
0.2841
0.4168
KNN
73.1771
26.8229
0
0.3874
0.3165
0.4318
76.3021
23.6979
0.17
0.4674
0.3034
0.4061
75.3906
24.6094
0.87
0.4484
0.2955
0.4215
77.3438
22.6563
0.11
0.4682
0.2266
0.476
MLP (epoch=100) MLP (epoch=500) SMO
145
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
Table 5. Result for Diabetes dataset using Percentage Split
Table 7. Result for Vote dataset using Percentage Split
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
76.2452
23.754
0.05
0.434
0.312
0.4059
J48
97.2973
2.7027
0.03
0.9447
0.0608
0.1539
NB
77.0115
22.988
0
0.463
0.266
0.3822
NB
91.2162
8.7838
0.02
0.8232
0.0912
0.2858
KNN
75.0958
24.904
0
0.400
0.310
0.4211
KNN
92.5676
7.4324
0
0.8497
0.0845
0.2173
79.3103
20.689
0.16
0.523
0.311
0.3887
97.973
2.027
0.2
0.9585
0.0292
0.1242
74.3295
25.670
0.78
0.431
0.318
0.4445
98.6486
1.3514
0.87
0.9724
0.0222
0.1134
79.3103
20.689
0.03
0.490
0.206
0.4549
96.6216
3.3784
0.03
0.9311
0.0338
0.1838
MLP (epoch=100) MLP (epoch=500) SMO
MLP (epoch=100) MLP (epoch=500) SMO
Since the diabetes dataset has 768 instances, 507 instances are used for training and 261 instances are used for testing, in percentage split. Table 5, show that SMO and MLP (epoch=100) has highest prediction accuracy followed by NB and J48. KNN has the lowest prediction accuracy. Kappa statistics is higher for MLP than SMO, means MLP (epoch=100) provide more perfection of classification compared to SMO.
Since the vote dataset has 768 instances, 287 instances are used for training and 148 instances are used for testing, in percentage split mode. Table 7, show that MLP has highest prediction accuracy followed by J48 and SMO. NB has the lowest prediction accuracy. Though MLP (epoch=500) provides highest prediction accuracy, the time required to build the model is largest compared to other algorithms, and at the same time MLP(epoch =100) provides slightly less accuracy with model building time of 0.2 seconds. MLP has lesser error rate in MAE and RMSE as it provides the more perfect prediction and lesser variance in predictions.
Results from Table 4 and Table 5 shows that, moving from 10_fold_cv to percentage spit method for diabetes dataset improves the accuracy of classification algorithm.
3. Vote Dataset Table 6 shows that for vote dataset J48 and SMO provides promising result of classification with accuracy of 96% followed by MLP with accuracy of 94%. Though the time required building the model for KNN and NB is less compared to other algorithms, they suffers from the lower predication accuracy. The J48 has the highest kappa statistics, closest to perfect agreement and lesser root mean square error.
Results from Table 6 and Table 7 shows that, moving from 10_fold_cv to percentage spit method for vote dataset improves the accuracy of classification algorithm.
4. Car Dataset Table 8 shows that for car dataset MLP provides promising result of classification with accuracy of 99% followed SMO and KNN. Though the time required to build the model with NB is very less compared MLP, it suffers from the lowest predication accuracy. MLP(epoch =100) provides slightly less accuracy compared to MLP(epoch=500) with model building time of 1.64 seconds. The MLP provides more perfect prediction and lesser variance in prediction.
Table 6. Result for Vote dataset using 10_fold_cv
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
96.3218
3.6782
0.06
0.9224
0.0611
0.1748
NB
90.1149
9.8851
0.02
0.7949
0.0995
0.2977
KNN MLP (epoch=100) MLP (epoch=500) SMO
92.6437
7.3563
0
0.8475
0.0841
0.2259
94.9425
5.0575
0.25
0.8933
0.0561
0.1087
94.7126
5.2874
1.01
0.8888
0.0528
0.2078
96.092
3.908
0.09
0.9178
0.0391
0.1977
Table 8. Result for Car dataset using 10_fold_cv Time (sec)
92.3611
7.6389
0.05
0.8343
0.0421
0.1718
85.5324
14.4676
0.02
0.6665
0.1137
0.2262
93.5185
6.4815
0
0.853
0.1122
0.1953
99.3634
0.6366
1.64
0.9861
0.0115
0.0548
99.537
0.463
7.6
0.9899
0.0062
0.0456
93.75
6.25
0.52
0.8649
0.2559
0.3202
Accuracy (%)
J48 NB KNN MLP (epoch=100) MLP (epoch=500) SMO
146
Error rate (%)
Algorithm
KS
MAE
RMSE
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
Table 11. Result for Spam dataset using percentage split
Table 9. Result for Car dataset using percentage split
Algorithm
Accuracy
Error rate (%)
Time (sec)
KS
MAE
RMSE
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
90.9864
9.0136
0.05
0.8088
0.0509
0.1883
J48
92.1995
7.8005
0.62
0.835
0.102
0.2686
NB
87.585
12.415
0
0.719
0.1145
0.2248
NB
78.0051
21.994
0.11
0.572
0.220
0.4676
KNN
KNN MLP (epoch=100) MLP (epoch=500) SMO
90.6463
9.3537
0
0.7821
0.1152
0.2041
98.9796
1.0204
1.23
0.9776
0.0156
0.0639
99.1497
0.8503
6.16
0.9814
0.0091
0.0582
93.3673
6.6327
0.23
0.8573
0.2561
0.321
MLP (epoch=100) MLP (epoch=500) SMO
89.2583
10.741
0
0.774
0.155
0.2869
87.8517
12.148
13.63
0.735
0.145
0.3019
87.5959
12.404
68.34
0.728
0.139
0.3011
90.5371
9.4629
0.22
0.797
0.094
0.3076
For the spam dataset, 3037 instances are used for training and 1564 instances are used for testing, in percentage split mode. Table 7, show that J48 has highest prediction accuracy followed by SMO and KNN. NB has the lowest prediction accuracy. The kappa statistics for J48 is highest indicating perfect agreement for classification.
Since the car dataset has 768 instances, 1140 instances are used for training and 588 instances are used for testing, in percentage split. Table 9, show that for car dataset it gives similar result as with that of 10_fold_cv. MLP has highest prediction accuracy followed by SMO. Results from Table 8 and Table 9 shows that, moving from 10_fold_cv to percentage spit method for car dataset reduces the accuracy of classification algorithm.
Results from Table 10 and Table 11 shows that, moving from 10_fold_cv to percentage spit method for vote dataset improves the accuracy of classification algorithm.
5. Spam Dataset 6. Table 10 shows that for spam dataset J48 provides highest prediction accuracy followed by MLP (epoch=500), SMO and KNN. The KNN required least time to build the model followed by NB and SMO. Table 10 also shows that increasing the number of epoch from 100 to 500, increase model building time from nine times, with only 2% of increase in accuracy. J48 has lesser error rate in mean absolute error and root mean squared error as it provides the more perfect prediction and lesser variance in prediction.
Audiology Dataset
Table 12 shows that, for audiology dataset MLP(epoch=500) provides highest prediction accuracy followed by SMO and MLP (epoch=100). The MLP provides more perfect prediction and lesser variance in prediction. The KNN and NB required least time to build the model but they suffer from the lowest classification accuracy. Table 10 also shows that increasing the number of epoch from 100 to 500, increase model building time, seven times with that of epoch =100, with only 3% of increase in accuracy.
Table 10. Result for Spam dataset using 10_fold_cv Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
92.9798
7.0202
1.46
0.8528
0.0892
0.2562
NB
79.2871
20.7129
0.17
0.5965
0.2066
0.4527
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
KNN
90.4151
9.5849
0
0.7983
0.1355
0.2778
J48
77.8761
22.1239
0.02
NB
73.4513
26.5487
0
KNN
62.8319
37.1681
80.0885
MLP (epoch=100) MLP (epoch=500) SMO
89.4371
10.5629
18.1
0.7787
0.137
0.2846
91.4366
8.5634
156.8
0.8205
0.108
0.2631
90.4151
9.5849
0.62
0.7959
0.0958
Table 12. Result for Audiology dataset using 10_fold_cv
MLP (epoch=100) MLP (epoch=500)
0.3096
SMO
147
MAE
RMSE
0.7418
0.022
0.1201
0.6821
0.0263
0.1362
0
0.5539
0.038
0.1441
19.9115
4.74
0.7661
0.0252
0.1094
83.1858
16.8142
27.38
0.8028
0.0177
0.1026
81.8584
18.1416
1.3
0.7872
0.0767
0.1934
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
Table 13. Result for Audiology dataset using percentage split
Table 15. Result for Nursery dataset using percentage split
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
83.1169
16.8831
0.05
0.8033
0.0212
0.1155
NB
71.4286
28.5714
0.02
0.6572
0.0284
0.1399
KNN
58.4416
41.5584
0
0.4935
0.0398
0.1408
83.1169
16.8831
2.62
0.8012
0.0261
0.1029
MLP (epoch=100) MLP (epoch=500) SMO
Algorithm
Accuracy (%)
Error rate (%)
Time (sec)
KS
MAE
RMSE
J48
96.4821
3.5179
0.14
0.9483
0.0186
0.1055
NB
90.6718
9.3282
0.03
0.8618
0.077
0.1766
KNN
84.4156
15.5844
13.43
0.8167
0.0178
0.0994
NA
NA
NA
NA
NA
NA
MLP (epoch=100) MLP (epoch=500) SMO
Since the audiology dataset has 226 instances, 149 instances are used for training and 77 instances are used for testing, in percentage split. Table 13, show that for audiology dataset percentage split it gives similar result as with that of 10_fold_cv.MLP has highest prediction accuracy followed by J48. SMO algorithm does not run with percentage split mode, as it terminated with shortage of memory.
0.9636
0.0854
0.1512
0.962
0.007
0.0521
97.4353
2.5647
79.84
0.962
0.006
0.0514
92.828
7.172
16.8
0.8947
0.2429
0.3207
V. Observations Figure 2 and Figure 3 show the accuracy of different classification algorithms on various datasets with 10_fold_cv method and percentage split method, respectively. The following are the observations:
Table 14 shows that, for Nursery dataset MLP provides highest prediction accuracy followed by KNN and J48. The KNN and NB required least time to build the model followed by J48 and MLP (epoch=100). MLP has lesser error rate in mean absolute error and root mean squared error as it provides the more perfect prediction and lesser variance in prediction.
•
Table 14. Result for Nursery dataset using 10_fold_cv
•
Algorithm
Accuracy (%)
Error rate (%)
J48
97.0525
2.9475
0.22
0.9568
0.0153
0.0951
NB
90.3241
9.6759
0.02
0.8567
0.0765
0.1767
KNN
98.3796
1.6204
0.02
0.9761
0.0859
0.1466
99.7377
0.2623
15.91
0.9962
0.0024
0.0194
99.7299
0.2701
97.58
0.996
0.0014
0.0183
93.0787
6.9213
16.27
0.8985
0.2428
0.3202
SMO
0 15.88
Results from Table 14 and Table 15 shows that, moving from 10_fold_cv to percentage spit method for nursery dataset reduces the accuracy of classification algorithm.
Nursery Dataset
MLP (epoch=100) MLP (epoch=500)
2.4739 2.5647
Table 15, show that KNN has highest prediction accuracy followed by MLP and J48. NB and SMO have the lowest prediction accuracy. Time to build the model for KNN is least, while with MLP is highest. KNN has the highest value for Kappa statistics indication perfect agreement for classification and lesser error rate in MAE and RMSE as it provides lesser variance in predictions.
Results from Table 12 and Table 13 shows that, moving from 10_fold_cv to percentage spit method for audiology dataset improves the accuracy of J48 and MLP and reduced the accuracy of NB and KNN classification algorithm. 7.
97.5261 97.4353
Time (sec)
KS
MAE
RMSE
•
• •
Since the nursery dataset has 12960 instances, 8554 instances are used for training and 4406 instances are used for testing, in percentage split.
148
It has been observed that, k_fold_cv gives better accuracy performance for binary classification and percentage split mode gives better accuracy performance for multi classification. For breast cancer, diabetes, vote and spambase dataset which has two classes, SMO has highest correctly classified instances compared to other classification algorithms. For car, audiology and nursery dataset MLP provides highest prediction accuracy compared to SMO and J48, where the number of classes are more than two. MLP classifiers require more time to build the required model. The performance of KNN is worst for Audiology dataset, as it has 70 attributes and 24 classes. The KNN required least time to build the model for every dataset, but it suffers from having less prediction accuracy.
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
120 100 80 60 40 20 0
J48 BN KNN MLP (epoch=100) MLP (epoch=500) SMO
Figure 2. Accuracy of classification algorithms on various dataset using 10-fold-cv 120 100 80
J48
60
NB
40
KNN
20
MLP (epoch=100)
0
MLP (epoch=500) SMO
Figure 3. Accuracy of classification algorithms on various dataset using percentage split
•
References:
The performance of NB is lowest for Vote, Car, Spambase and Nursery dataset, where dataset are either categorical or real.
[1] J. Han and M. Kamber, Data Mining Concepts and Techniques, Elevier, 2011.
VI Conclusion [2] M. Goebel,L. Gruenwald, A survey of data mining and knowledge discovery software tools, ACM SIGKDD Explorations Newsletter. Vol.1 No. 1, pp.20-33, June 1999 [doi>10.1145/846170.846172].
This research conducted performance comparison of five classification algorithms including decision tree, Naïve Bayes, Instance based nearest neighbour, Multilayer perceptron and Support Vector Machine on seven data sets with different characteristics in WEKA. The overall assessment showed that there is no single classification algorithm which can provide the best predictive model for all datasets. The accuracy of predictive model is affected by the selection of attributes, type of dataset, number of classes, attributes and instances. With this we conclude that the different classification algorithms are designed to perform better for different datasets.
[3] “WEKA – Data Mining Machine Learning Software”, http://www.cs.waikato.ac.nz/ml/ [4]UCI Machine Learning Repository, Available at:http://archieve.ics.uci.edu/ml/ [5] XindongWu, Vipin Kumar, J. Ross Quinlan,Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg, “Top 10
149
https://sites.google.com/site/ijcsis/ ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
[12] Bharat Deshmukh, Ajay Patil, B. V. Pawar, “ Comparisoin of Classification Algorithms using WEKA on various Datasets, International Journal of Computer Science and Information Technology, Vol. 4, No. 2, December 2011, pp 85-90.
algorithms in data mining,” Knowledge Information system, vol. 14, pp. 1-37, 2008. [6] S. B. Kotsiantis, “Supervised Machine Learning: A Riview of Classification Techniques,” Informatica, vol. 31, pp. 249-268, 2007. [7] Secker, Andrew, et al. "An experimental comparison of classification algorithms for hierarchical prediction of protein function." Expert Update (Magazine of the British Computer Society's Specialist Group on AI) 9.3 (2007): 1722.
[13] Peiman Mamani Barnaghi, Vahid Alizadeh Sahzabi and Azuraliza Abu Bakar, “A Comparative Study for Various Methods of Classification”, International Conference on Information and Computer Networks (ICICN 2012), Vol. 27, IACSIT press, Singapore.
[8] Ryan Potter, “ Comparison of Classification Algorithms Applied to Breast Cancer Diagnosis and Prognosis”, Wiley Expert Systems, 24(1), 1731, (2007).
[14] Rohit Arora, Suman , “ Comparative Analysis of Classification Algorithms on Different Datasets using WEKA”, International Journal of Computer Applications (0975-8887), Vol. 54, No. 13, Sept.2012.
[9] Dr. Varun Kumar, Dr. Luxmi Verma, “ Binary Classifiers for Health Care Databases: A Comparative Study of Data Mining Classification Algorithms in the Diagnosis of Breast Cancer”, IJSCT, Vol. 1, Issue 2, December 2010.
[15] P. Nancy, Dr. R. Geetha Ramani, “ A Comparison on Performance of Data Mining Algorithms in Classification of Social Network Data”, International Journal of Computer Applications (0975-8887) Vol. 32, No. 8, October 2011.
[10] Mohd Fauzi bin Othman, Thomas Moh Shan Yau, “ Comparison of Different Classification Techniques Using WEKA for Breast Cancer”, IFMBE proceedings 15, pp. 520-523, 2007, springer-Verlag Berlin Heidelberg.
[16] Hetal Bhavsar, Amit Ganatra, “A Comparative Study of Training Algorithms for Supervised Machine Learning”, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 22312307, Volume-2, Issue-4, September 2012.
[11] A. H. Wahbeh, Q. A. Al-Rasaideh, M. N. AlKabi, and E. M. Al-Shawakfa, “ A Comparison Study between Data Mining Tools over some Classification Methods”, International Journal of Advanced Computer Science and Applications, Special Issue on Artificial Intelligence.
[17] Vapnik, Corinna Cortes and Vladimir, “Support Vector Network,” Machine Learning, vol. 20, pp. 273-297, 1995.
150
https://sites.google.com/site/ijcsis/ ISSN 1947-5500