defined by the produced error rate, of some popular predictive data mining .... first and second eigenvalues of homogeneity analysis as well as the average ...
OPSEARCH, VOL. 41, No. 3, 2004
0030-3887/00 $ 5.00+0.00 © Operational Research Society of India
Rules for Comparing Predictive Data Mining Algorithms by Error Rate D. Koliastasis and D.K. Despotis University of Piraeus, Greece
Abstract In knowledge discovery from databases and warehouses, there is an ongoing number of algorithms and practices that can be used for the very same application, for example to predict the value of a specific field. These algorithms are trained on a portion of the original data set and are tested on the remaining data set for their accuracy. Furthermore, the estimated value of the target field is often tested against a second data set for evaluation purposes. This paper examines the factors affecting the performance, as it is defined by the produced error rate, of some popular predictive data mining algorithms such as decision trees, neural nets, regression etc, on many data sets from different sources. These factors may be either the number of attributes, the type of each field, the number of missing values etc. Finally, it is tested whether it is possible to gauge a priori which algorithm(s) will produce the lowest error rate for each specific data set. As a result some heuristic rules are to be listed in order to facilitate the decision maker in selecting the best possible technique. Keywords Dataset Characteristics, Predictive algorithms, Decrease in error. 1. Introduction In knowledge discovery from databases, the most frequent task for the analyst is to predict the value of a specific field on the basis of the information extracted from the raw data of the database. The established practice is to collect the more relevant data and to transform them in a way that a suitable algorithm can be applied. The algorithm is trained on a portion of the original data set and tested on the remaining data set for their accuracy. Furthermore, the estimated value of the target field is often tested against a second data set for validation purposes. There is an increasing number of algorithms and practices that can be used for the very same application. Extensive research has been performed to develop appropriate machine learning techniques for different data mining tasks, and has led to a proliferation of different learning algorithms. However, previous work has shown that no learner is generally better than another learner. If a learner performs better than another learner on some learning situations, then the first learner usually performs worse than the second learner on other situations [6]. In other words, no single learning algorithm can perform well and uniformly outperform other algorithms over all data mining tasks. This has been confirmed be the “no free lunch theorems” [7]. The major reasons are that a learning algorithm has different performance in processing different datasets and that different
36
D. Koliastasis and D.K.Despotis
learning algorithms are implemented with different search heuristics, which results in variety of ‘inductive bias’ [4]. In real-world applications, the users need to select an appropriate learning algorithm according to the mining task that is to be performed [2], [5]. An inappropriate selection of algorithm will result in slow convergence or even lead to a sub-optimal local minimum. Meta-learning has been proposed to deal with the issues of algorithm selection [3]. One of the aims of meta-learning is to assist the user to determine the most suitable learning algorithm(s) for the problem at hand. The task of meta-learning is to find functions that map datasets to predicted data mining performance (e.g., predictive accuracies, execution time, etc.). To this end meta-learning uses a set of attributes, called meta-attributes, to represent the characteristics of data mining tasks, and search for the correlations between these attributes and the performance of learning algorithms. Instead of executing all learning algorithms to obtain the optimal one, metalearning is performed on the meta-data characterizing the data mining tasks. The effectiveness of meta-learning is largely dependent on the description of tasks (i.e., metaattributes). The remainder of the paper is organized as follows. The characteristics of the data sets are described in detail in section 2 along with the examined algorithms. Experiments illustrating the factors affecting the performance of each algorithm are described in section 3. Section 4 concludes the paper by stating some heuristic rules and points out interesting possibilities for future work. 2. Predictive algorithms and dataset characteristics In general there are two families of algorithms, the statistical, which are best implemented by an experienced analyst since they require a lot of technical skills and specific assumptions and the data mining tools, which do not require much model specification but they offer little diagnostic tools. Each family has reliable and well-tested algorithms that can be used for prediction. In the case of the classification task, the most frequent encountered algorithms are logistic regression (LR), decision tree and decision rules, neural network (NN) and discriminant analysis (DA). In the case of regression, multiple linear regression (MLR), classification & regression trees (C&RT) and neural networks have been used extensively. In the classification task the error rate is defined straightforwardly as the percentage of the misclassified cases in the observed versus predicted contingency table. When NNs are used to predict a scalar quantity, the square of the correlation for the predicted outcome with the target response is analogous to the r-square measure of MLR. Therefore the error rate can be defined in the prediction task as: Error rate = 1 - correlation2 (observed, predicted). In both tasks, error rate varies from zero to one, with one indicating bad performance of the model and zero the best possible performance. The dataset characteristics are related with the type of problem. In the case of the classification task the number of classes, the entropy of the classes and the percent of the mode category of the class can be used as useful indicators. The relevant ones for the regression task might be the mean value of the dependent variable, the median, the mode, the standard deviation, skewness and kurtosis. Some database measures include the
Rules for Comparing Predictive Data Mining Algorithms by Error Rate
37
number of the records, the percent of the original dataset used for training and for testing, the number of missing values and the percent of incomplete records. Also useful information lies on the total number of variables. For the categorical variables of the database, the number of dimensions in homogeneity analysis and the average gain of the first and second eigenvalues of homogeneity analysis as well as the average attribute entropy are the corresponding statistics. For the continuous variables, the average mean value, the average 5% trimmed mean, the median, the variance, the standard deviation, the range, the inter-quartile range, skewness, kurtosis and the Hubers’ M-estimator are some of the useful statistics that can be applied to capture the information on the data set. The determinant of the correlation matrix is an indicator of the interdependency of the attributes on the data set. The average correlation, as it is captured by Crobach- α reliability coefficient, may be still an important statistic. By applying principal component analysis on the numerical variables of the data set, the first and second largest eigenvalues can be observed. If the data set for a classification task has categorical explanatory variables, then the average information gain and the noise to signal ratio are two useful information measures, while the average Goodman and Kruskal tau and the average chi-square significance value are two statistical indicators. Also in the case of continuous explanatory variables, Wilks’ lambda and the canonical correlation of the first discrimination function may be measures for the discriminating power within the data set. By comparing a numeric with a nominal variable with the student’s t-test, two important statistics are produced to indicate the degree of their relation, namely Eta squared and the Significance of the F-test. The characteristics of the dataset can be captured from the output of relevant algorithms applied on it. When applying a NN, the difference of the outcome of the network to the variation of the values of the input values is measured by the sensitivities of the input attributes. Three interesting measures may be the average sensitivity value; the maximum and second largest sensitivity values for the produced topology. Similarly when running a decision tree on the data set, the number of terminal branches and the depth of the decision tree (C5.0) are useful statistics measuring the complexity of the data set.
38
D. Koliastasis and D.K.Despotis
3. Factors affecting the error rate In this section we experimentally evaluate some hypotheses concerning the relation of the performance of the algorithm and the data characteristics. In section 3.1 we describe our experimental set-up, in section 3.2 and 3.3 we work on hypotheses on the classification and prediction task, and in section 3.4 we study the relation of the algorithms and the performed task. 3.1. Experimental set-up In order to test any hypothesis concerning the interaction of the preferred algorithm and the metrics used, the appropriate data set have to be constructed. Most of the data sets used have been extracted from the UCI repository [1]. Synthetic data sets of various sample sizes, from 10000 to 100000 records, have been constructed in a semi-random way, by using as input the values of a multivariate normal distribution. In a second step, a specific amount of noise has been inserted on a random way. The 15 data sets for the classification task are all from the UCI repository. For the prediction task, five datasets are from UCI and seven are synthetic datasets. A central part of the experiment is to test the algorithms on the basis of their default settings. It is well understood that in the field of data analysis, it is the responsibility of the analyst to tune the implemented algorithm and improve the performance. For the decision tree and rules, simple C5.0 decision trees have been used with minimum two records per child branch, as many data sets are of small samples. In the NN model the multi-layer perceptron (MLP) methodology, also called a back propagation network, has been used with one hidden layer with stopping criterion if the network appears to have reached its optimal trained state. In the LR and DA the default settings have been applied, while in some cases the categorical fields have been transformed to a set of flag indicator fields and entered in the set of the explanatory fields. 3.2. The classification task One of the more interesting hypotheses that can be tested on the classification task is that the errors rates are not highly correlated for all algorithms (see Table 1 below). Table 1: Correlations across the tested algorithms: Classification Task LR C5.0 tree C5.0 rules NN DA
LR 1.00
C5.0 tree 0.48 1.00
C5.0 rules 0.53 1.00 1.00
NN 0.48 0.95 0.96 1.00
DA 0.38 0.01 0.04 0.15 1.00
The number of dimensions of the data set, as it is produced by “homogeneity analysis”, tends to increase the predictive ability of the DA, while simultaneously increases the error rate of the other competitive algorithms. This situation exists as well with some
Rules for Comparing Predictive Data Mining Algorithms by Error Rate
39
other characteristics, while there are many other that all techniques are portraying a consistent attitude. On the other hand, logistic regression seems to be more sensitive to the entropy of the classes of the dependent variable. Table 2 below provides all the correlation statistics. Table 2: Correlations of data characteristics vs. prediction error throughout different algorithms: Classification task Data characteristics vs. error in prediction
LR
C5.0 tree
C5.0 rules
NN
DA
Cox & Snell R square
-1.00
-0.66
-0.68
-0.65
-0.38
Canonical correlation of the 1st discrimination function
-0.89
-0.48
-0.55
-0.58
-0.89
Average information gain
-0.10
0.12
0.10
-0.04
-0.67
Average Skewness value
0.19
-0.48
-0.49
-0.59
-0.51
Number of dimensions in homogeneity analysis
0.37
0.41
0.41
0.37
-0.56
-0.18
0.32
0.31
0.39
-0.53
0.07
0.59
0.60
0.58
-0.52
Average Eta squared
-0.49
-0.38
-0.47
-0.33
-0.50
Number of attributes
0.03
0.60
0.59
0.58
-0.48
Average Kurtosis value
0.25
-0.39
-0.39
-0.47
-0.36
Average gain of the 2nd eigenvalue of homogeneity analysis
-0.31
-0.38
-0.37
-0.46
-0.36
Average Goodman and Kruskal tau
0.36
0.65
0.64
0.51
-0.44
Average gain of the 1st eigenvalue of homogeneity analysis
-0.35
-0.38
-0.38
-0.43
-0.22
Average X2 sig. Value
-0.43
-0.23
-0.21
-0.16
0.27
Number of missing values
-0.43
-0.22
-0.23
-0.12
-0.27
Number of records
-0.41
-0.34
-0.36
-0.27
-0.36
Average attribute entropy
-0.40
-0.34
-0.33
-0.19
0.51
Maximum sensitivity
0.11
-0.27
-0.29
-0.36
0.20
2nd largest sensitivity
-0.05
-0.24
-0.25
-0.34
0.32
Percent of the mode class
-0.27
-0.31
-0.32
-0.33
-0.04
0.44
-0.09
-0.09
-0.04
-0.31
Percent of incomplete records Number of “nominal” attributes
Number of classes
40
D. Koliastasis and D.K.Despotis
Average Minimum value
-0.29
0.16
0.25
0.34
0.24
Crobach- α
-0.10
-0.25
-0.29
-0.19
-0.22
2nd eigenvalue of the factor analysis
-0.27
-0.22
-0.24
-0.21
-0.15
1st eigenvalue of the factor analysis
-0.26
-0.22
-0.23
-0.24
-0.19
Number of “numeric” attributes
-0.07
-0.18
-0.19
-0.18
0.21
0.46
0.15
0.15
0.21
-0.17
-0.17
0.26
0.25
0.34
0.10
0.16
-0.05
-0.07
-0.14
0.54
-0.06
0.23
0.24
-0.04
-0.13
0.40
-0.11
-0.06
0.10
0.94
Average 5% Trimmed Mean value
-0.08
0.21
0.24
0.09
-0.03
Average Hubers M-Estimator value
-0.08
0.15
0.17
0.05
-0.06
Average Mean value
-0.07
0.26
0.28
0.09
-0.04
Average Median value
-0.06
0.25
0.29
0.18
0.05
Average Variance value
-0.02
0.26
0.28
0.01
-0.06
Average Range value
-0.02
0.25
0.26
0.00
-0.05
Average Maximum value
-0.02
0.25
0.26
0.00
-0.05
Average Inter-quartile Range value
0.05
0.28
0.30
0.10
-0.01
Average Sig. F-test
0.15
0.11
0.17
0.27
0.25
Depth of the C50 tree
0.25
0.34
0.35
0.46
0.53
Determinant of the correlation matrix
0.26
0.25
0.33
0.38
0.43
Wilks lambda
0.74
0.65
0.71
0.62
0.80
Entropy of the classes Number of terminal branches Average sensitivity Average Std. Deviation value Noise to signal ratio
In order to reach conclusions about the effect of an independent characteristic on the relevant performance of each algorithm, repeated measures analysis is the appropriate technique. The “Tests of Between-Subjects Effects” examines this hypothesis. Apart from some characteristics that are used implicitly on each algorithm, such as the canonical correlation on the DA case, the number of classes, the entropy of the classes, the depth of the decision tree and the average Eta squared are correlated differently with the predictive algorithms. For example the depth of the decision tree C5.0 has a correlation of 0.53 with the error of the DA, 0.46 with the NN error, 0.34 with the decision tree error and only 0.25 with the LR error.
Rules for Comparing Predictive Data Mining Algorithms by Error Rate
41
3.3. The regression task What is evident and rather not surprising is that MLR, NNs and C&RT produce highly correlated results (see Table 3 below). Table 3: Correlations across the algorithms: Regression task MLR NN
MLR
NN
C&RT
1.00
0.98
0.99
1.00
0.99
C&RT
1.00
In the regression task, the average correlation of the numeric attributes (Crobach α ) is related more with the error on the NN than in the MLR case. Table 4 below displays the correlations of the characteristics and the algorithms. Average Eta squared, first eigenvalue of the factor analysis, average skewness and the number of missing values are the four characteristics more negatively related with the prediction error in the case of MLR. Almost equivalent is the ranking in the C&RT trees (Average Eta squared, first eigenvalue of the factor analysis, average skewness and second max sensitivity of the default NN) and for the NN (average Eta squared, first eigenvalue of the factor analysis, second max sensitivity of the default NN and average skewness). On the contrary, the number of records the determinant of the correlation matrix and the average Sig. of the Ftest are the characteristics that are more related with the prediction error. The increase in the statistics of number of nominal attributes, which can be transformed to a set of flag attributes, the standard deviation, the mode and the mean of the dependent variable may lead to a better performance of MLR than to the other two competitive algorithms. Similar things happen with the average minimum value, average Hubers MEstimator value and the average median value for the C&RT trees and for the second largest sensitivity and the number of continuous variables for the NN. 3.4. Hypotheses independent of the task Univariate analysis of variance can be used to test hypotheses concerning the interaction of some specific factors on the error rate of each algorithm. Although NNs have on average different error rates when applied to the regression or the classification task, things remain stable when the training or test set is analysed. Furthermore there is no significant interaction in these two factors. Although the error rate for each algorithm cannot be assumed as equal, the effect of the size of the data set tends to be important, although this experiment has not the necessary observed power to detect such difference. The interaction of algorithm and data set size is less significant stated that practically all algorithms perform in a similar way in an increase of sample size. The correlation coefficient of the “Percent of the original test” and the ”Neural network error” is almost zero indicating no significance relation between the increase of the training size and the performance of the algorithm.
42
D. Koliastasis and D.K.Despotis
Table 4: Correlations of data characteristics vs. prediction error throughout different algorithms: Regression task Data characteristics vs. error in prediction
MLR
C&RT
NN
Average Eta squared
-0.96
-0.99
-0.98
1st eigenvalue of the factor analysis
-0.84
-0.84
-0.86
2nd Max sensitivity of the neural network
-0.66
-0.68
-0.74
Average Skewness value
-0.69
-0.70
-0.70
Number of missing values
-0.68
-0.64
-0.61
Average Hubers M-Estimator value
-0.64
-0.66
-0.62
Average Median value
-0.64
-0.65
-0.61
Average 5% Trimmed Mean value
-0.61
-0.62
-0.58
Average Mean value
-0.59
-0.60
-0.57
Skewness of the dependent attribute
-0.59
-0.59
-0.58
2nd eigenvalue of the factor analysis
-0.56
-0.54
-0.48
Average Minimum value
-0.52
-0.54
-0.49
Average Kurtosis value
-0.52
-0.52
-0.53
Number of “numeric” attributes
-0.45
-0.44
-0.51
Average Interquartile Range value
-0.49
-0.49
-0.47
Average Maximum value
-0.47
-0.47
-0.44
Number of attributes
-0.47
-0.44
-0.45
Average Std. Deviation value
-0.46
-0.46
-0.44
Mode value of the dependent attribute
-0.45
-0.42
-0.39
Average Range value
-0.45
-0.45
-0.43
Percent of incomplete records
-0.45
-0.42
-0.39
Standard deviation of the dependent attribute
-0.44
-0.40
-0.37
Kurtosis of the dependent attribute
-0.44
-0.43
-0.41
Mean value of the dependent attribute
-0.43
-0.40
-0.37
Median value of the dependent attribute
-0.43
-0.39
-0.36
Average Variance value
-0.41
-0.41
-0.39
Number of “nominal” attributes
-0.26
-0.22
-0.18
Rules for Comparing Predictive Data Mining Algorithms by Error Rate
43
Average sensitivity of the neural network
-0.08
-0.10
-0.12
Average attribute entropy
-0.04
-0.02
-0.03
Crobach- α
0.29
0.31
0.40
Maximum sensitivity of the neural network
0.40
0.40
0.37
Number of records
0.51
0.53
0.53
Determinant of the correlation matrix
0.73
0.75
0.76
Average Sig. F-test
0.86
0.87
0.88
4. Conclusion The main scope of this paper was to examine the correlation of some commonly used dataset characteristics to the error rate of various common predictive algorithms. In the classification task, data characteristics tend to discriminate better the algorithms. However, in the regression task, things tend to be more similar referring to the error rate. On the basis of the analysis of the experimental results, some conclusions can be drawn for the classification task, as follows: The deeper is the C5.0 tree the higher is the relative error of the NN to the error of the C5.0; the lower is the number of nominal variables the higher is the relative error of DA to the error of C5.0 tree; the higher is the number of classes the higher is the relative error of LR to the error of C5.0 tree. For the regression task, the following practical rule can be derived: The lower is the total number of variables the higher is the relative error of MLR to the error of NN. Some of the above rules, apart from the more prominent, are not statistically significant, and for this reason there is a need to enrich the dataset base with data from other sources as well as with synthetic data of varying sizes and information content. Moreover, the validity and reliability of the calculated coefficients is a topic that must be further investigated, since a single attribute may have an excess influence on the statistics value. 5. References [1]
Blake, C., Keogh, E. and Merz, C. (1998), UCI Repository, University of California, Irvine. http://www.ics.uci.edu/emlearn/mlrepository.html
[2]
Brodley, C. E. (1995), Recursive automatic bias selection for classifier construction, Machine Learning, 20, 63-94.
[3]
Kalousis, A. and Hilario (2000), M., Model Selection via Meta-learning: a Comparative study, Proceedings of the 12th International IEEE Conference on Tools with AI, Canada, 214-220.
[4]
Mitchell, T. (1997), Machine Learning, McGraw Hill.
[5]
Schaffer, C. (1993), Selecting a Classification Methods by Cross Validation, Machine Learning, 13, 135-143.
[6]
Schaffer, C. (1994), Cross-validation, stacking and bi-level stacking: Meta-methods for classification learning, In Cheeseman, P. and Oldford R.W.(eds) Selecting Models from Data: Artificial Intelligence and Statistics IV, 51-59.
44
[7]
D. Koliastasis and D.K.Despotis
Wolpert, D. (1996), The lack of a Priori Distinctions between Learning Algorithms, Neural Computation, 8, 1341-1420.