Artif Intell Rev (2011) 36:249–266 DOI 10.1007/s10462-011-9211-4
An incremental ensemble of classifiers S. B. Kotsiantis
Published online: 11 March 2011 © Springer Science+Business Media B.V. 2011
Abstract Along with the increase of data and information, incremental learning ability turns out to be more and more important for machine learning approaches. The online algorithms try not to remember irrelevant information instead of synthesizing all available information (as opposed to classic batch learning algorithms). Today, combining classifiers is proposed as a new road for the improvement of the classification accuracy. However, most ensemble algorithms operate in batch mode. For this reason, we propose an incremental ensemble that combines five classifiers that can operate incrementally: the Naive Bayes, the Averaged One-Dependence Estimators (AODE), the 3-Nearest Neighbors, the Non-Nested Generalised Exemplars (NNGE) and the Kstar algorithms using the voting methodology. We performed a large-scale comparison of the proposed ensemble with other state-of-the-art algorithms on several datasets and the proposed method produce better accuracy in most cases. Keywords
Incremental classifier · Online learning · Classification
1 Introduction Supervised learning algorithms use instances, which have already been pre-classified in some way. That is, each instance has a label, which identifies the class to which it. Supervised machine learning explores algorithms that reason from the externally supplied instances to produce general hypotheses, which can make predictions about future instances. To induce a hypothesis from a given set of data, a learning system needs to make assumptions about the hypothesis to be learned. These assumptions are called biases. A learning system without any assumption cannot generate a useful hypothesis since the number of hypotheses that are consistent with the set of data is usually colossal. Since every learning algorithm uses
S. B. Kotsiantis (B) Educational Software Development Laboratory, Department of Mathematics, University of Patras, P.A. Box: 1399, 26 500 Rio, Greece e-mail:
[email protected]
123
250
S. B. Kotsiantis
some biases, it behaves well in some domains where its biases are suitable while it performs disappointingly in other domains (Rokach 2010). For this reason, combining classifiers is proposed as a new direction for the enhancement of the classification accuracy (Ula¸s et al. 2009). Dietterich (2000) discusses three fundamental ways in which combining classifiers can achieve better performance: statistical, representational and computational. The statistical analysis starts with the observation that any learning algorithm tries to find a hypothesis that has a high accuracy on the training dataset. When the amount of the training set is small, there may be many different hypotheses that all provide the same accuracy on the training set. Constructing an ensemble out of all these accurate classifiers can allow the algorithm to reduce the risk of choosing a wrong hypothesis. Secondly, the representational analysis follows from the fact that a learning algorithm may not be able of representing the true function either because it is outside of its hypothesis space or because it does not have enough training data to explore all of its hypothesis space to find it (e.g. the classifier stops searching once it finds a hypothesis that fits the training set). By combining several different hypotheses (e.g. using a weighted sum) it may be likely to expand the space of functions. Finally, the computational argument is that many learning algorithms perform some kind of local search in the hypotheses space that may get stuck in local optima. Instances include gradient-based search in neural networks and greedy search in decision trees. An ensemble constructed by running the local search from different starting points may result in a better approximation to the real hypothesis. However, most ensemble algorithms operate in batch mode, i.e., they repeatedly examine and process the entire training dataset. Classically, they require at least one pass through the training set for every base model to be added in the ensemble. The base model learning algorithms themselves may need several passes through the training data to create each base model. In situations where data is being generated continuously, storing data for batch learning is not practical, which makes using these ensemble learning algorithms impracticable. These algorithms are also unfeasible in situations where the training set is large enough that examining and processing it many times would be prohibitively expensive. Incremental learning ability is very vital to machine learning approaches designed for solving real-world problems due to two reasons. Firstly, it is almost impossible to collect all useful training instances before the trained system is put into use. Therefore when new instance are fed, the learning approach should have the ability of doing some revisions on the trained system so that unlearned knowledge encoded in those new examples can be included. Secondly, modifying a trained system may be cheaper in time cost than building a new system from scratch, which is valuable mainly in real-time applications. We propose an incremental ensemble that combines five classifiers that can operate incrementally: the Naive Bayes, the AODE, the 3NN, the NNGE and the Kstar algorithms using the voting methodology. We performed a large-scale comparison of the proposed ensemble with other state-of-the-art algorithms on several datasets and we took better accuracy in most cases. Section 2 introduces some basic themes about online learning, while Sect. 3 discusses the proposed ensemble method. Experiment results and comparisons of the proposed method with other learning algorithms in several datasets are presented in Sect. 4. Finally, we conclude in Sect. 5 with summary and additional research topics.
123
An incremental ensemble
251
2 Online learning One of the biggest challenges in data mining is to handle huge datasets. These datasets are common in many real application domains as e-commerce and financial market. Furthermore, the most recent advances in miniaturization and sensor technology lead to sensor networks, gathering spatio-temporal data about the environment and revealing a new area that needs to deal with huge datasets. In such domains, thousands of measurements are collected every day, thus the amount of information to be stored in databases is massive and continuously growing. Therefore, the knowledge extracted from these databases need to be continuously updated, otherwise it may become outdated. The main problem of using traditional (non-incremental) learning algorithms used to extract knowledge from databases with these huge and continuously growing datasets is the high computational effort required. When comparing online and batch algorithms, it is valuable to keep in mind the different types of setting where they can be applied. In a batch setting, an algorithm has a fixed collection of examples in hand, and uses them to construct a hypothesis, which is used after for classification without further adaptation. In an online setting, the algorithm continually modifies its hypothesis as it is being used; it repeatedly receives a new pattern, predicts its value and possibly updates its hypothesis accordingly. The on-line learning task is to acquire a set of concept descriptions from labeled training set distributed over the time. This type of learning is significant for many applications, such as intelligent user interfaces, computer security, and market-basket analysis. For example, customer preferences change as new products and services become available. Algorithms for coping with concept drift must converge rapidly and accurately to new target concepts, while being efficient in space and time. Desirable characteristics for incremental learning systems are that they should: • • • •
require small constant time per record be able to build a model using at most one scan of the dataset use only a fixed amount of main memory make a usable model available at any point in time
Online learning algorithms process each training instance once “on arrival” without the requirement of storage and reprocessing, and maintain a current hypothesis that reflects all the training examples seen so far. Such algorithms are also valuable with very large datasets, for which the multiple passes required by most batch algorithms are prohibitively costly. Some researchers developed online algorithms for learning traditional machine learning models such as decision trees (Utgoff et al. 1997). Given an existing decision tree and a new instance, this algorithm adds the instance to the example set at the appropriate non-terminal and leaf nodes and then verifies that all the features at the non-terminal nodes and the class at the leaf node are still the finest. Batch neural network learning is frequently performed by making multiple passes (known in the literature as epochs) through the data with each training example processed one at a time. Hence, neural networks can be learned online by simply making one pass through the data. However, there would obviously be some loss associated with only making one pass through the data (Saad 1998). There is a known disadvantage for all these algorithms since it is very difficult to perform learning with several instances at once. In order to solve this problem, some algorithms
123
252
S. B. Kotsiantis
rely on windowing techniques (Widmer and Kubat 1996) which consist in storing the last k examples and performing a learning task at any time a new instance is encountered. The Weighted Majority (WM) algorithm (Littlestone and Warmuth 1994) is the basis of many online algorithms. WM maintains a weight vector for the set of classifiers, and predicts the outcome using a weighted majority vote between the classifiers. WM online learns this weight vector by “punishing” incorrect classifiers. A number of similar algorithms have been developed such as (Auer and Warmuth 1998). It must be mentioned that WM are binary algorithms and cannot straightforwardly handle multi-class problems. Voted-perceptron (Freund and Schapire 1999) stores more information during training and then uses this elaborate information to generate better predictions on the test instance. The information it maintains during training is the list of all prediction vectors that were generated after each mistake. For each such vector, the algorithm counts the number of iterations the vector “survives” until the next error is made; they refer to this count as the “weight” of the prediction vector. To compute a prediction it computes the binary prediction of each one of the prediction vectors and combines all these predictions by a weighted majority vote. The weights used are the survival times explained above. This makes intuitive sense, as “good” prediction vectors is likely to survive for a long time and thus have larger weight in the majority vote. It must be mentioned that Voted-perceptron is also a binary algorithm and cannot straightforwardly handle multi-class problems. The concept of combining classifiers is proposed as a new route for the improvement of the performance of classifiers (Gangardiwala and Polikar 2005; Kuncheva 2004). Each algorithm has a different inductive bias, that is, makes a different assumption about the data and makes errors on different examples and by appropriate combination, the overall error can be reduced. A classifier can be accurate for some classes but not all, and it is best to use it only for that classes it is good at. Each classifier has a certain inductive bias, which can hold for some classes but not for all. Take the case of a linear discriminant: in a multi-class case, one of the classes can be linearly separable from the others but another class may not be. The output of the linear classifier may consequently be included in the ensemble of discriminants for that class, but for some other class, a more complex discriminant should be included in the ensemble. Some classes may be handled by a single discriminant, i.e., the output of one classifier suffices, but more discriminants are combined for more difficult classes. Numerous methods have been proposed for the construction of ensemble of classifiers (Dietterich 2000). Although or perhaps because many methods of ensemble creation have been recommended, there is as yet no clear picture of which method is best (Kotsiantis et al. 2006). Mechanisms that are used to build ensemble of classifiers include: (i) using different subsets of training set with a single learning technique, (ii) using different training parameters with a single training technique (e.g., using different initial weights for every neural network in the ensemble) and (iii) using different learning techniques. Unfortunately, in an on-line environment, it is less clear how to apply ensemble methods directly. For instance, with bagging, when one new example arrives that is misclassified, it is too ineffective to resample the available data and learn new classifiers. One solution is to rely on the user to specify the number of examples from the input stream for each base learner (Fern and Givan 2000; Oza and Russell 2001) but this approach supposes one know a great deal about the structure of the data stream. There are also on-line boosting algorithms that reweight classifiers (Fan et al. 1999; Oza and Russell 2001) but these presume a fixed number of classifiers. Additionally, online boosting is likely to suffer a large loss initially when the base models have been trained with very few instances, and the algorithm can never be able to recover (Chu and Zaniolo 2004). In the following section, we propose an incremental ensemble of classifiers.
123
An incremental ensemble
253
3 Proposed incremental ensemble It is a well-known fact that the selection of an optimal set of classifiers is a vital part of multiple classifier systems and the independence of classifier outputs is usually considered to be an advantage for obtaining better multiple classifier systems (Rokach 2009; Menahem et al. 2009). In terms of classifier combination, the voting methods demand no prerequisites from the learners. When multiple classifiers are combined using voting methodology, one expect to obtain high-quality results based on the belief that the majority of classifiers are more probable to be correct in their decision when they agree in their estimation. As far as the used learning algorithms of the proposed ensemble are concerned, five online algorithms are used: • Naive Bayes (NB) (Domingos and Pazzani 1997) classifier is the simplest form of Bayesian network since it captures the assumption that every attribute is independent of all the all other attributes, given the state of the class feature. The assumption of independence is obviously almost always wrong. However, simple naive Bayes method remains competitive, even though it supplies very poor estimates of the true underlying probabilities (Domingos and Pazzani 1997). The naive Bayes algorithm is traditionally used in “batch mode”, meaning that the algorithm does not perform the majority of its computations after observing each training instance, but rather accumulates certain information on all of the training instances and then performs the final computations on the entire group or “batch” of instances (Domingos and Pazzani 1997). However, note that there is nothing inherent in the algorithm that stops one from using it to learn incrementally. As an example, consider how the incremental naive Bayes algorithm can work assuming that it makes one pass through all of the training set. In step #1, it initializes all of the counts and totals to 0 and then goes through the training instances, one at a time. For each training instance, it is given the feature vector x and the value of the label for that. The algorithm goes through the feature vector and increments the correct counts. In step #2, these counts and totals are converted to probabilities by dividing each count by the number of training instances in similar class. The final step (#3) computes the prior probabilities p(k) as the fraction of all training instances that are in class k. • AODE (Averaged One-Dependence Estimators) classifier (Webb et al. 2005) is considered an improvement on NB. Sahami (1996) introduced the notion of k-dependence estimators, through which the probability of each feature value is conditioned by the class and, at most, k other features. In order to continue efficiency, AODE is restricted to exclusively use 1-dependence estimators (ODEs). Specifically, AODE makes use of SPODEs (SuperParent-One-Dependence Estimators), as every feature depends on the class and another shared feature, designated as superparent. AODE weakens the feature independence assumption by averaging all models from a restricted class of one-dependence learners. • The 3-nearest neighbors algorithm (3-NN) is a method for classifying instances based on closest training examples in the feature space (Wu et al. 2002). 3-NN is a well-known instance-based learner, or lazy learner where the function is only approximated locally and all computation is deferred until classification. An instance is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its 3 nearest neighbors. It is common to use the Euclidean distance, though other distance measures, such as the Manhattan distance.
123
254
S. B. Kotsiantis
Training set Learning phase
(x, ?) Application phase
NB
AODE
3-NN
NNGE
h1
h2
h3
h4
KSTAR
h5
h* = Average of Probabilities (h1, h2, h3, h4, h5)
(x, y*)
Fig. 1 The proposed ensemble
• K-star is an instance-based learner, which is the class of a test instance is based upon the class of those training examples similar to it, as determined by some similarity function. It differs from other instance-based learners in that it utilizes an entropy-based distance function (Cleary and Trigg 1995). • Non-Nested Generalised Exemplars (Roy 2002) (NNGE) expands on nearest neighbour by introducing generalised exemplars. Generalised exemplars are a bounded group of examples that share the same concept and are close in proximity within an n-dimensional problem space, where n is the number of features in each example. The bounding of groups is implemented by axis-parallel n-dimensional rectangles, or hyperrectangles. Hyperrectangles represent each generalisation by an exemplar where each feature value is replaced by either a range of values for a continuous-valued domain or a list of probable values for a discrete-valued domain (Roy 2002). This enables hyperrectangles to represent a more broad rule more fully than many single examples. The proposed ensemble begins by creating a set of five classifiers (NB, AODE, 3-NN, K-star, NNGE). When a new instance arrives, the algorithm passes it to and receives a prediction from each learner. In online setting, the algorithm continually modifies its hypothesis as it is being used; it repeatedly receives a pattern, predicts its value based on majority vote of the learners’ predictions and possibly updates its hypothesis accordingly. The proposed ensemble is schematically presented in Fig. 1, where h i is the produced hypothesis of each classifier, x the example for classification and y* the final prediction of the proposed online ensemble. The number of model or runtime parameters to be tuned by the user is an sign of an algorithm’s ease of use. For a non specialist in data mining, the proposed ensemble with no user-tuned parameters will surely be more attractive. It must also be mentioned that the proposed ensemble can without difficulty be parallelized using a learning algorithm per machine. Parallel and distributed computing is of most importance for Machine Learning (ML) practitioners because taking gain of a parallel or a distributed execution a ML system can: (i) increase its speed; (ii) increase the range of domains where it can be applied (because it can process more instances, for example).
123
An incremental ensemble
255
Table 1 Datasets Datasets
Instances
Categ. features
Numer. features
Classes
Audiology
226
69
0
Autos
205
10
15
24 6
Badge
294
4
7
2
Balance
625
0
4
3
Breast-cancer
286
9
0
2
Breast-w
699
0
9
2
Colic
368
15
7
2
Credit-rating
690
9
6
2
1,000
13
7
2
Diabetes
768
0
8
2
Glass
214
0
9
6
Haberman
306
0
3
2
Credit-g
Heart-c
303
7
6
5
Heart-h
294
7
6
5
Heart-statlog
270
0
13
2
Hepatitis
155
13
6
2
Ionosphere
351
34
0
2
Iris
150
0
4
3
57
8
8
2
Lymphotherapy
148
15
3
4
Monk1
124
6
0
2
Monk2
169
6
0
2
Monk3
122
6
0
2
Primary-tumor
339
17
0
21
Labor
Sonar
208
0
60
2
Student
344
11
0
2
Titanic
2,201
3
0
2
Vehicle
846
0
18
4
Vote
435
16
0
2
Wine
178
0
13
3
Zoo
101
16
2
7
4 Comparisons and results For the purpose of our study, we choose datasets from real problems with varying characteristics. The datasets come from many domains of the UCI repository (Frank and Asuncion 2010) and covers areas such as: pattern recognition (vote, iris, zoo), image recognition (sonar, ionosphere,), medical diagnosis (breast-cancer, breast-w, diabetes, colic, heart-c, heart-h, hepatitis, heart-statlog, haberman, lymphotherapy) and commodity trading (credit-g, credit-rating). Table 1 is a short description of these data sets presenting the number of output classes, the type of the attributes and the number of instances.
123
256
S. B. Kotsiantis
The used datasets are batch datasets, i.e., there is no natural order in the data. The most common way to convert the online ensemble into a batch algorithm is to repeatedly cycle through a dataset, processing the instances one at a time until the end of the data. In order to calculate the classifiers’ accuracy, the whole training set was divided into ten mutually exclusive and equal-sized subsets and for each subset the learner was trained on the union of all of the other subsets. Then, cross validation was run 10 times for each algorithm and the mean value of the 10-cross validations was calculated. The AODE algorithm is able to process categorical data only. However, the used datasets involve both symbolic and numerical features. Hence, there was the important issue to discretize numerical (continuous) attributes. Entropy discretization method was used (Janssens et al. 2006). Entropy discretization recursively chooses the cut-points minimizing entropy until a stopping criterion based on the Minimum Description Length criterion ends the recursion. During the first experiment, each incremental learning algorithm (Naive Bayes, 3NN, KSTAR, AODE, NNge) is compared with the proposed ensemble. It must be mentioned that we used the free available source code for these algorithms by Witten and Frank (2005) for our experiments. We minimized the effect of any expert bias by not attempting to tune any of the algorithms to the specific dataset. Wherever feasible, default values of learning parameters were used. This approach may result in lower estimates of the true error rate, but it is a bias that influences all the learning algorithms equally. In the last rows of the Table 2 there are the aggregated results. In Table 2, we represent with “v” that the proposed ensemble looses from the specific algorithm. That is, the specific algorithm performed statistically better than the proposed ensemble according to t-test with p < 0.05. Furthermore, in Table 2, “*” indicates that proposed ensemble performed statistically better than the specific algorithm according to t-test with p < 0.05. In all the other cases, there is no significant statistical difference between the results (Draws) (Salzberg 1997). In the last rows of the Table 2 one can also observe the aggregated results in the form (a, b, c). In this notation “a” means that the proposed ensemble is significantly less accurate than the compared algorithm in a out of 31 datasets, “c” means that the proposed ensemble is significantly more accurate than the compared algorithm in c out of 31 datasets, while in the remaining cases (b), there is no significant statistical difference between the results. To sum up, the proposed ensemble is significantly more precise than NB algorithm in 10 out of the 31 datasets, whilst it has significantly higher error rates in one dataset. In addition, the proposed algorithm is significantly more accurate than 3NN algorithm in 7 out of the 31 datasets, whereas it has significantly higher error rates in one dataset. Moreover, the proposed ensemble is significantly more precise than KSTAR algorithm in 14 out of the 31 datasets, whilst it has significantly higher error rates in one dataset. Furthermore, the proposed algorithm is significantly more accurate than AODE algorithm in 5 out of the 31 datasets, whereas it has significantly higher error rates in none dataset. Moreover, the proposed ensemble is significantly more precise than NNGE algorithm in 18 out of the 31 datasets, whilst it has significantly higher error rates in none dataset. Subsequently, we provide graphs showing how the accuracy increases as more instances are added.
123
An incremental ensemble
257
123
258
123
S. B. Kotsiantis
An incremental ensemble
259
123
260
123
S. B. Kotsiantis
An incremental ensemble
261
123
262
S. B. Kotsiantis
Table 2 Comparing the proposed ensemble with the based incremental classifiers Voting incremental ensemble
NB
3NN
KSTAR
AODE
NNge
Audiology
78.19 (6.12)
72.64 (6.10)*
72.73 (6.04)*
80.32 (7.11)
67.97 (7.73)*
73.00 (8.57)*
Autos
79.79 (9.53)
57.41 (10.77)* 67.23 (11.07)* 72.01 (9.65)* 74.76 (11.56) * 74.26 (9.52)*
Badges
100.00 (0.00)
99.66 (1.03)
100.00 (0.00)
90.27 (4.74)* 100.00 (0.00)
100.00 (0.00)
Balance-scale
86.49 (2.99)
90.53 (1.67)v
86.74 (2.72)
88.72 (2.27)v 69.96 (4.62)*
80.46 (3.89)*
Breast-cancer
73.82 (6.21)
72.70 (7.74)
73.13 (5.54)
73.73 (6.79)
67.80 (7.08)*
Breast-w
97.20 (1.86)
96.07 (2.18)*
96.60 (1.97)
95.35 (2.44)* 97.05 (1.90)
96.18 (2.30)*
Colic
83.16 (5.79)
78.70 (6.20)*
80.95 (5.53)
75.71 (6.72)* 82.45 (5.54)
79.02 (6.48)
Credit-g
76.00 (3.58)
75.16 (3.48)
72.21 (3.25)*
70.17 (3.89)* 75.83 (3.57)
69.24 (4.47)*
Credit-rating
85.25 (4.06)
77.86 (4.18)*
84.96 (4.44)
79.10 (4.16)* 86.67 (3.77)
82.83 (4.70)*
Diabetes
75.74 (3.98)
75.75 (5.32)
73.86 (4.55)
70.19 (4.77)* 75.70 (4.72)
72.84 (4.63)*
Glass
76.17 (8.65)
49.45 (9.50)*
74.53 (7.95)
75.31 (9.05)
70.02 (7.94) *
67.98 (9.31)*
Haberman
71.84 (5.59)
75.06 (5.42)
69.77 (5.72)
70.27 (5.80)
71.57 (3.95)
66.80 (7.05)*
Heart-c
82.88 (6.65)
83.34 (7.20)
81.82 (6.55)
75.18 (7.21)* 82.87 (6.65)
77.78 (7.74)*
Heart-h
83.30 (6.16)
83.95 (6.27)
82.33 (6.23)
77.83 (6.57)* 84.33 (6.20)
79.60 (6.80)*
Heart-statlog
82.41 (6.19)
83.59 (5.98)
79.11 (6.77)
76.44 (7.42)* 82.70 (6.56)
77.30 (8.12)*
Hepatitis
83.75 (9.30)
83.81 (9.70)
80.85 (9.20)
80.17 (8.12)
85.36 (9.23)
81.88 (8.13)
Ionosphere
92.40 (4.22)
82.17 (6.14)*
86.02 (4.31)*
84.64 (4.79)* 91.09 (4.75)
90.60 (4.62)
Iris
95.40 (4.80)
95.53 (5.02)
95.20 (5.11)
94.67 (5.53)
Labor
93.00 (10.55)
93.57 (10.27)
92.83 (9.86)
92.03 (10.86) 88.43 (13.69)
86.23 (15.17)
Lymphography 85.50 (8.92)
83.13 (8.89)
86.86 (7.55)
85.08 (7.87)
77.14 (10.12)*
Monk1
85.97 (10.47)
73.38 (12.49)* 78.97 (11.89) * 80.27 (11.34) 82.32 (11.31)
86.73 (11.34)
Monk2
56.94 (11.86)
56.83 (8.71)
54.74 (9.20)
58.35 (10.47) 59.62 (8.23)
53.98 (12.79)
Monk3
92.63 (6.63)
93.45 (6.57)
86.72 (9.99)
86.22 (9.44)* 93.21 (6.79)
89.08 (7.71)
Primary-tumor
45.13 (6.22)
49.71 (6.46)v
49.77 (6.02)v
38.02 (6.84)* 44.98 (6.43)
39.09 (7.17)*
73.05 (6.93)
93.07 (5.76) 81.74 (8.45) *
83.76 (8.51)
96.00 (4.54)
Sonar
81.81 (8.80)
67.71 (8.66)*
77.05 (9.56)
85.11 (7.65)
Students
84.14 (5.62)
85.70 (5.97)
82.29 (5.97)
80.85 (6.62)* 86.08 (5.77)
81.01 (6.25)
71.12 (9.22)*
Titanic
78.11 (2.00)
77.85 (2.40)
78.90 (1.80)
77.56 (1.81)
78.21 (2.17)
70.69 (9.64)*
Vehicle
72.00 (3.36)
44.68 (4.59)*
70.21 (3.93)
70.22 (3.48)
70.32 (3.62)
62.26 (5.67)*
Vote
93.95 (3.61)
90.02 (3.91)*
93.08 (3.70)
93.22 (3.48)
94.28 (3.42)
95.10 (3.14)
Wine
98.59 (2.58)
97.46 (3.70)
95.85 (4.19)*
98.72 (2.61)
98.21 (3.08)
95.93 (4.56)
Zoo
96.35 (4.96)
94.97 (5.86)
92.61 (7.33)*
96.03 (5.66)
94.66 (6.38)
94.09 (6.38)
Average accuracy W-D-L ( p < 0.05)
82.84
78.77
80.58
79.73
81.30
78.58
1/20/10
1/23/7
1/16/14
0/26/5
0/13/18
During the second experiment, a representative algorithm for each of batch sophisticated supervised learning algorithms was compared with the proposed ensemble. We used batch algorithms as an superior measure of the accuracy of learning algorithms. Most of the incremental versions of batch algorithms are not lossless (Saad 1998; Utgoff et al. 1997; Widmer and Kubat 1996). An online lossless learning algorithm is an algorithm that returns
123
An incremental ensemble
263
a hypothesis identical to what its corresponding batch algorithm would return given the same training data. The C4.5 algorithm (Quinlan 1993) was the representative of the decision trees in our study. The SMO algorithm was the representative of the support vector machines (Katagiri and Abe 2006). Finally, the RIPPER (Cohen 1995) was the representative of the rule learners in our study. The meaning of the symbols “ν” and “∗” in Table 3 is the same with that of Table 2. As one can see in Table 3, the proposed ensemble is significantly more precise than SMO in 6 out of the 31 datasets, whilst it has significantly higher error rates in none dataset. Table 3 Comparing the proposed ensemble with well known classifier Voting online ensemble
RIPPER
C4.5
SMO
Audiology
78.19 (6.12)
73.10 (6.80)*
77.26 (7.47)
80.77 (7.04)
Autos
79.79 (9.53)
73.62 (9.87)
81.77 (8.78)
71.34 (10.12)*
Badges
100.00 (0.00)
100.00 (0.00)
100.00 (0.00)
100.00 (0.00)
Balance-scale
86.49 (2.99)
80.30 (3.66)*
77.82 (3.42)*
87.57 (2.49)
Breast-cancer
73.82 (6.21)
71.45 (6.44)
74.28 (6.05)
69.52 (7.50) *
Breast-w
97.20 (1.86)
95.61 (2.22)*
95.01 (2.73)*
96.75 (2.01)
Colic
83.16 (5.79)
85.10 (6.11)
85.16 (5.91)
82.66 (5.41)
Credit-g
76.00 (3.58)
72.21 (3.96)*
71.25 (3.17)*
75.09 (3.42)
Credit-rating
85.25 (4.06)
85.16 (3.94)
85.57 (3.96)
84.88 (3.86)
Diabetes
75.74 (3.98)
75.18 (4.54)
74.49 (5.27)
76.80 (4.54)
Glass
76.17 (8.65)
66.78 (9.65)*
67.63 (9.31)*
57.36 (8.77)*
Haberman
71.84 (5.59)
72.72 (5.90)
71.05 (5.20)
73.40 (1.06)
Heart-c
82.88 (6.65)
79.95 (6.77)
76.94 (6.59)*
83.86 (6.21)
Heart-h
83.30 (6.16)
79.57 (6.64)*
80.22 (7.95)
82.74 (6.44)
Heart-statlog
82.41 (6.19)
78.70 (6.81)
78.15 (7.42)
83.89 (6.24)
Hepatitis
83.75 (9.30)
78.13 (9.04)
79.22 (9.57)
85.77 (9.04)
Ionosphere
92.40 (4.22)
89.16 (4.64)
89.74 (4.38)
88.07 (5.32)*
Iris
95.40 (4.80)
93.93 (7.28)
94.73 (5.30)
96.27 (4.58)
Labor
93.00 (10.55)
83.70 (15.09)
78.60 (16.58)*
92.97 (9.75)
Lymphography
85.50 (8.92)
76.31 (11.37)*
75.84 (11.05)*
86.48 (7.68)
Monk1
85.97 (10.47)
83.87 (16.34)
80.61 (11.34)
79.58 (11.99) *
Monk2
56.94 (11.86)
56.21 (8.89)
57.75 (11.18)
58.70 (5.80)
Monk3
92.63 (6.63)
84.80 (9.27)*
92.95 (6.68)
93.45 (6.57)
Primary-tumor
45.13 (6.22)
38.74 (5.57)*
41.39 (6.94)
47.09 (6.59)
Sonar
81.81 (8.80)
73.40 (9.91)*
73.61 (9.34)*
76.60 (8.27) *
Students
84.14 (5.62)
86.44 (5.74)
86.75 (5.69)
86.72 (5.19)
Titanic
78.11 (2.00)
78.01 (2.04)
78.55 (2.10)
77.60 (2.39)
Vehicle
72.00 (3.36)
68.32 (4.37)*
72.28 (4.32)
74.08 (3.82)
Vote
93.95 (3.61)
95.75 (2.74)
96.57 (2.56)v
95.77 (2.90)
Wine
98.59 (2.58)
93.14 (6.94)*
93.20 (5.90)*
98.76 (2.73)
Zoo
96.35 (4.96)
86.62 (6.98)*
92.61 (7.33)
96.05 (5.60)
Average error
82.84
79.23
80.03
81.95
0/18/13
1/21/9
0/25/6
W-D-L ( p < 0.05)
123
264
S. B. Kotsiantis
The proposed algorithm is also significantly more precise than RIPPER algorithm in 13 out of the 31 datasets, while it has significantly higher error rates in none dataset. Finally, the proposed algorithm has significantly lower error rates than C4.5 algorithm in 9 out of the 31 datasets and it is significantly less accurate in one dataset. All the experiments specify that the proposed ensemble performed, on average, better than all the tested algorithms. Obviously, incremental updating would be much faster than rerunning a batch algorithm on all the data seen so far, and can even be the only possibility if all the data seen so far cannot be stored or if one need to perform online prediction and updating in real time or, at least, very rapidly. We are much interested in minimizing the required training time because, as we have already said, a major research area is the investigation of accurate techniques that can be applied to problems with thousands of features, millions of training instances and hundreds of classes. It is attractive to have machine-learning algorithms that can analyze large datasets in just a few hours of computer time.
5 Conclusion The recent and quick development of areas such as databases, e-commerce, electronic sensors and ubiquitous computation generated a new motivation for incremental learning algorithms investigation. These technologies allow dynamic systems to be designed and utilized in real world applications. Such dynamic system is continuously receiving new data to be stored in huge databases. Hence, the knowledge present in the databases is continuously evolving and the learning process may need to go on approximately indefinitely, thus a non-incremental learning algorithm may become unproductive. Online learning is the area of machine learning concerned with learning each training example once (perhaps as it arrives) and in no way examining it again. Online learning is essential when data arrives continuously so that it may be unfeasible to store data for batch learning or when the dataset is large enough that multiple passes through the dataset would be time-consuming. Perfectly, we would like to be able to identify or design the single best learning algorithm to be used in all problems. However, both experimental results and theoretical work indicate that this is not probable (Witten and Frank 2005). Recently, the concept of combining classifiers is proposed as a new way for the improvement of the classification accuracy. However, most ensemble algorithms operate in batch mode. There are several opportunities that could be explored when designing an online ensemble algorithm. A naive approach is to maintain a dataset of all observed instances and to invoke an offline algorithm to produce an ensemble from scratch when a new example arrives. This approach is often impractical both in terms of update time and space for online settings with resource constraints. To help alleviate the space problem we could limit the size of the dataset by only storing and utilizing the most recent or most important examples. However, the resulting update time is still frequently impractical. For this reason, we proposed an incremental ensemble that combines five classifiers that can operate incrementally: the Naive Bayes, the AODE, the 3NN, the NNGE and the Kstar algorithms using the voting methodology. We performed a large-scale comparison with other state-of-the-art algorithms and ensembles on 31 standard benchmark datasets and we took better accuracy in most cases. However, in spite of these results, no general method will work always. We have mostly used our online ensemble algorithm to learn static datasets, i.e., those which do not have any temporal ordering among the training instances. Much data mining
123
An incremental ensemble
265
research is concerned with finding methods applicable to the increasing variety of types of data available—time series, multimedia, spatial, worldwide web logs, etc. Using online learning algorithms on these different types of data is an important area of future work. Moreover, the used combination strategy is based on voting method. In a future work, apart from voting, it might be worth to try other combination rules to find the regularity between the combination strategy, individual learners and the datasets.
References Auer P, Warmuth M (1998) Tracking the best disjunction. Mach Learn 32: 127–150, Kluwer Academic Publishers Chu F, Zaniolo C (2004) Fast and light boosting for adaptive mining of data streams. In: Advances in knowledge discovery and data mining, Lecture notes in computer science 3056. pp 282–292 Cleary JG, Trigg LE (1995) K*: an instance-based learner using an entropic distance measure. In: 12th international conference on machine learning. pp 108–114 Cohen W (1995) Fast effective rule induction. In: Proceedings of International Conference of ML-95. pp 115–123 Dietterich T. (2000) Ensemble Methods in Machine Learning, Lecture Notes in Computer Science, vol 1857, pp 1–15 Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130 Fan W, Stolfo S, Zhang J (1999) The application of AdaBoost for distributed, scalable and on-line learning. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 362–366 Fern A, Givan R (2000) Online ensemble learning: an empirical study. In: Proceedings of the seventeenth international conference on ML. Morgan Kaufmann, pp 279–286 Frank A, Asuncion A (2010) UCI Machine learning repository. [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science Freund Y, Schapire R (1999) Large margin classification using the perceptron algorithm. Mach Learn 37: 277–296. Kluwer Academic Publishers Gangardiwala, A.; Polikar, R.; Dynamically weighted majority voting for incremental learning and comparison of three boosting based approaches, 2005 IEEE international joint conference on neural networks, IJCNN ’05, vol 2, pp. 1131–1136, 31 July–4 Aug 2005 Janssens D, Brijs T, Vanhoof K, Wets G (2006) Evaluating the performance of cost-based discretization versus entropy- and error-based discretization. Comput Oper Res 33(11):3107–3123 Katagiri S, Abe S (2006) Incremental training of support vector machines using hyperspheres. Pattern Recognit Lett 27(13):1495–1507 Kotsiantis S, Zaharakis I, Pintelas P (2006) Machine learning: a review of classification and combining techniques. Artificial Intell Rev 26(3):159–190 (Springer) Kuncheva LI (2004) Classifier ensembles for changing environments. In: Multiple classifier systems (MCS 2004), Lecture notes in computer science 3077. pp 1–15 Littlestone N, Warmuth M (1994) The weighted majority algorithm. Inf Comput 108:212–261 Menahem E, Rokach L, Elovici Y (2009) Troika—an improved stacking schema for classification tasks. Inf Sci. doi:10.1016/j.ins.2009.08.025 Oza NC, Russell S (2001) Online bagging and boosting. In: Richardson T, Jaakkola T (eds) Artificial intelligence and statistics. pp 105–112 Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco Rokach L (2009) Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography. Comput Statist Data Anal 53(12):4046–4072 Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1–2):1–39 Roy S (2002) Nearest neighbor with generalization. Christchurch, New Zealand Saad D (1998) Online learning in neural networks. Cambridge University Press, London Sahami M (1996) Learning limited dependence Bayesian classifiers. In: Proceedings of the 2nd international conference on knowledge discovery in databases. pp 335–338 Salzberg S (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Discov 1:317–328
123
266
S. B. Kotsiantis
Ula¸s A, Semerci M, Yıldız O T, Alpaydın E (2009) Incremental construction of classifier and discriminant ensembles. Inf Sci 179(9):1298–1318 Utgoff P, Berkman N, Clouse J (1997) Decision tree induction based on efficient tree restructuring. Mach Learn 29:5–44 Webb GI, Boughton JR, Wang Z (2005) Not so naive Bayes: aggregating one-dependence estimators. Mach Learn 58:5–24 Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23:69–101 Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. 2. Morgan Kaufmann, San Francisco Wu Y, Ianakiev K, Govindaraju V (2002) Improved k-nearest neighbor classification. Pattern Recognit 35(10):2311–2318
123