A Fast Ensemble of Classifiers - CiteSeerX

1 downloads 70 Views 238KB Size Report
image recognition. (ionosphere), medical diagnosis (breast-cancer, breast-w, colic, diabetes, heart-c, heart-h, heart- statlog, hepatitis, haberman, lymphotherapy,.
A Fast Ensemble of Classifiers S. B. Kotsiantis P. E. Pintelas [email protected] [email protected] Educational Software Development Laboratory Department of Mathematics University of Patras, Hellas

Abstract Recently, in the area of Machine Learning the concept of combining classifiers is proposed as a new direction for the improvement of the classification accuracy. However, a major research area is to explore learning techniques for scaling up to large problems and it is well known that ensembles need increased computation. In this work, we try to bridge the gap by using very rapid algorithms for building a fast voting ensemble. We have implemented a learning tool that combines the Naive Bayes, the VFI and the Decision Stump algorithms using the sum voting methodology. We performed a large-scale comparison with other state-of-the-art algorithms and fast ensembles on several datasets and we took better accuracy in most cases using less time for training, too. Keywords: machine learning, classification, data mining.

1

Introduction

Just like statistics, machine learning aims at building models for two reasons: (i) to understand, interpret a set of observations (data analysis and exploration); (ii) to be able to predict properties of unseen instances. The main difference between machine learning methods and classical statistics is that machine learning does not assume any parametric form of the appropriate model to use. Supervised learning algorithms are presented with instances, which have already been preclassified in some way. That is, each instance has a label, which identifies the class to which it belongs and so this set of instances is sub-

divided into classes. Supervised machine learning explores algorithms that reason from the externally supplied instances to produce general hypotheses, which will make predictions about future instances. To induce a hypothesis from a given dataset, a learning system needs to make assumptions about the hypothesis to be learned. These assumptions are called biases. A learning system without any assumptions cannot generate a useful hypothesis since the number of hypotheses that are consistent with the dataset is usually huge. Since every learning algorithm uses some biases, it behaves well in some domains where its biases are appropriate while it performs poorly in other domains [16]. For this reason, combining classifiers is proposed as a new direction for the improvement of the classification accuracy [2]. However, ensembles need increased computation and a research area is to explore learning techniques for scaling up to large datasets. In this work, we try to bridge the gap by using fast weak algorithms for building a rapid ensemble. The problem we tackle is how to develop an ensemble of classifiers with both good generalization performance and efficiency in space and time in a supervised learning environment. The motivation to use weak classifiers instead of well-trained classifiers is that they are computationally cheap and easy to obtain. If combinations of weak classifiers could also achieve good generalization performance, they would provide a feasible solution for achieving good performance and efficiency in space and time. We have implemented an ensemble that combines the Naive Bayes, the VFI and the Decision Stump algorithms using the sum voting methodology. The proposed ensemble is tested systematically on real datasets and it is at least

slightly more accurate than state-of-the-art learning algorithms and fast ensembles. Section 2 introduces some basic ML issues, while section 3 discusses the proposed ensemble method. Experiment results and comparisons of the proposed ensemble with other learning algorithms and fast ensembles in several datasets are presented in section 4. We briefly present the implemented learning tool in Section 5. Finally, we conclude in Section 6 with summary and further research topics.

2

Machine learning issues

Supervised classification is one of the tasks most frequently carried out by the so-called Intelligent Systems. Thus, a large number of techniques have been developed based on Artificial Intelligence (Logic-based techniques, Perceptron-based techniques) and Statistics (Bayesian Networks, Instance-based techniques) [12]. The concept of combining classifiers is proposed as a new direction for the improvement of the performance of classifiers [2]. The size of the training set constitutes a significant factor in the overall performance of the trained classifier. In general, the quality of the computed model improves as the size of the training set increases. At the same time, however, the size of the training set is limited by main memory constraints and the time complexity of the learning algorithm. A major research area has explored techniques for scaling up learning algorithms so that they can apply to problems with millions of training instances, thousands of features, and hundreds of classes. Large machine learning problems are beginning to arise in database-mining applications, where there can be millions of transactions every day and where it is desirable to have machine-learning algorithms that can analyze such large datasets in just a few hours of computer time. Another area where large learning problems arise is in information retrieval from full-text databases and the World Wide Web. In information retrieval, each word in a document can be treated as an input feature; so, each training instance can be described by thousands of features. Finally, applications in speech recognition, object recognition, and character recognition in which hundreds or thousands of classes must be discriminated.

Thus, despite their obvious performance advantages, ensembles have at least two weaknesses: (1) increased storage, (2) increased computation. The first weakness, increased storage, is a direct consequence of the requirement that all component classifiers, instead of a single classifier, need to be stored after training. The total storage depends on the size of each component classifier itself and the size of the ensemble (number of classifiers in the ensemble). The second weakness is increased computation: to classify an input query, all component classifiers (instead of a single classifier) must be trained and processed. For this reason, fast and scalable ensembles are needed. An ensemble is scalable if it performs as well on large datasets as on small and mediumsized datasets. In the following section, we propose a fast ensemble of classifiers.

3

Proposed ensemble

Using an ensemble of weak classifiers approach has some benefits. Firstly, the training time is often less for generating multiple weak classifiers compared to training one strong classifier. This is because, strong classifiers spend a majority of their training time in fine tuning the desired decision boundary, whereas weak classifiers completely skip the fine-tuning stage as they only generate a rough approximation of the decision boundary. Secondly, weak classifiers are also less likely to suffer from overfitting problems, since they avoid learning outliers, or quite possibly a noisy decision boundary. A set of weak classifiers should satisfy the following two conditions: a) each weak classifier should do better than random guessing and b) the set of classifiers should have enough computational power to learn a problem. The first condition ensures that each weak classifier possesses a minimum computational power. The second condition suggests that individual weak classifiers should learn different parts of a problem so that a collection of weak classifiers can learn an entire problem. If all the weak classifiers in a collection were to learn the same part of a problem, their combination would not do better than individual classifiers. When weak classifiers are combined using a voting methodology, we expect to obtain good results based on the belief that the majority of

classifiers are more likely to be correct in their decision when they agree in their opinion. Voters can express the degree of their preference using a confidence score i.e. the probabilities of classifiers prediction. In the proposed ensemble the sum rule is used each voter gives the probability of its prediction for each candidate. Next all confidence values are added for each candidate and the candidate with the highest sum wins the election. It must be mentioned that the sum rule is one of the best voting methods for classifier combination according to [9]. As far as the used learning algorithms of the proposed ensemble are concerned, three fast algorithms are used: •





Decision stump [11]. Decision stump (DS) are one level decision trees that classify instances by sorting them based on feature values. Each node in a decision stump represents a feature in an instance to be classified, and each branch represents a value that the node can take. Instances are classified starting at the root node and sorting them based on their feature values. VFI (Voting Feature Interval algorithm) [7]. From the training instances, the VFI algorithm constructs intervals for each feature. An interval is either a range or point interval. For point intervals, only a single value is used to define that interval. For range intervals, on the other hand, it suffices to maintain only the lower bound for the range of values, since all range intervals on a feature dimension are linearly ordered. For each interval, a single value and the votes of each class in that interval are maintained. Thus, an interval may represent several classes by storing the vote for each class. The classification of a new instance is based on a vote among the classifications made by the value of each feature separately. Naive Bayes [8] classifier is the simplest form of Bayesian network since it captures the assumption that every feature is independent of the rest of the features, given the state of the class feature. The assumption of independence is clearly almost always wrong. However, simple naive Bayes method remains competitive, even though it provides very poor estimates of the true underlying probabilities [8].

In detail, the proposed ensemble (VoteDVM) is schematically presented in Figure 1. Each classifier (NB, VFI, DS) generate a hypothesis h1, h2, h3 respectively. The a-posteriori probabilities generated by the individual classifiers are correspondingly denoted p1(i), p2(i), p3(i) for each output class i. Next, the class represented by the maximum sum value of the aposteriori probabilities is taken as the voting hypothesis (h*). The predictive class is computed by the rule:

Predictive Class = arg max

i=number of classes, j=3



i

p j (i )

i =1, j =1

What is more, the number of model or runtime parameters to be tuned by the user is an indicator of an algorithm’s ease of use. For a non specialist in data mining, the proposed ensemble with no user-tuned parameters will certainly be more appealing. Treaining set Learning phase

NB

h1

(x, ?)

VFI

DS

h2

h3

h* = SumRule(h1, h2, h3)

Application phase

(x, y*)

Figure 1. The proposed ensemble

It must also be mentioned that the proposed ensemble can easily be parallelized using a learning algorithm per machine. Parallel and distributed computing is of most importance for ML practitioners because taking advantage of a parallel or a distributed execution a ML system may: i) increase its speed; ii) increase the range of applications where it can be used (because it can process more data, for example).

4 Comparisons and results For the purpose of our study, we used 22 wellknown datasets by many domains from the UCI repository [3]. These datasets were hand selected so as to come from real-world problems and to vary in characteristics. Among these are

datasets from: pattern recognition (iris, mushroom, zoo), image recognition (ionosphere), medical diagnosis (breast-cancer, breast-w, colic, diabetes, heart-c, heart-h, heartstatlog, hepatitis, haberman, lymphotherapy, sick) commodity trading (credit-g) and various applications (waveform). In Table 1, there is a brief description of these datasets such as the number of output classes, the number of features and the number of instances. Table 1. Description of datasets

breast-cancer breast-w colic credit-g diabetes grub-damage haberman heart-c heart-h heart-statlog hepatitis hypothyroid ionosphere iris labor lymphotherapy mushroom primary-tumor sick soybean waveform zoo

Instances 286 699 368 1000 768 155 306 303 294 270 155 3772 351 150 57 148 8124 339 3772 683 5000 101

Features Classes 9 2 9 2 22 2 20 2 8 2 8 4 3 2 13 5 13 5 13 2 19 2 29 4 34 2 4 3 16 2 18 4 22 2 17 21 29 2 35 19 40 3 17 7

In order to calculate the classifiers’ accuracy, the whole training set was divided into ten mutually exclusive and equal-sized subsets and for each subset the classifier was trained on the union of all of the other subsets (10-cross validation). Then, cross validation was run 10 times for each algorithm and the median value of the cross validations was calculated. During the first experiment, a representative algorithm for each of the other sophisticated machine learning techniques was compared with the proposed ensemble. It must be mentioned that we used the free available source code for these algorithms by [18]. The C4.5 algorithm [14] was the representative of the decision trees in our study. The most well-known learning algorithm to estimate the values of the weights

of a neural network - the Back Propagation (BP) algorithm [12] - was the representative of the neural nets. The Sequential Minimal Optimization (or SMO) algorithm was the representative of the Support Vector Machines [13]. In our study, we also used the 3-NN algorithm that combines robustness to noise and less time for classification than using a larger k for kNN [1]. Finally, the RIPPER [6] was the representative of the rule learners in our study. We have tried to minimize the effect of any expert bias by not attempting to tune any of the algorithms to the specific dataset. Wherever possible, default values of learning parameters were used. This approach may result in lower estimates of the true error rate, but it is a bias that affects all the learning algorithms equally. In the last rows of the Table 2 there are the aggregated results. In Table 2, we represent with “vv” that the proposed ensemble looses from the specific algorithm. That is, the specific algorithm performed statistically better than the proposed according to paired t-test with p

Suggest Documents