Comparison of Genetic Algorithm and Sequential ... - Semantic Scholar

19 downloads 3308 Views 173KB Size Report
Classifier subset selection (CSS) from a large ensemble is an effective way to design multiple classifier systems. (MCSs). Given a validation dataset and a ...
Comparison of Genetic Algorithm and Sequential Search Methods for Classifier Subset Selection Hongwei Hao a,*, Cheng-Lin Liu b, Hiroshi Sako b a University of Science and Technology Beijing, Beijing, P.R. China b Central Research Laboratory, Hitachi, Ltd. 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan Abstract Classifier subset selection (CSS) from a large ensemble is an effective way to design multiple classifier systems (MCSs). Given a validation dataset and a selection criterion, the task of CSS is reduced to searching the space of classifier subsets to find the optimal subset. This study investigates the search efficiency of genetic algorithm (GA) and sequential search methods for CSS. In experiments of handwritten digit recognition, we select a subset from 32 candidate classifiers with aim to achieve high accuracy of combination. The results show that in respect of optimality, no method wins others in all cases. All the methods are very fast except the generalized plus l and take away r(GPTA) method.

1. Introduction Multiple classifier systems (MCSs) have been intensively studied for many years. It was shown that the combination of multiple diverse and accurate classifiers can achieve higher performance over the best individual one [1]. The performance of MCS essentially depends on the complementarity (diversity) of constituent classifiers. The complementarity can be obtained by diversifying the feature representation, classifier structure, and the training data of constituent classifiers. A strategy is to overly generate a large ensemble of candidate classifiers and then select a subset for good complementarity [2-5]. Classifier subsection selection (CSS) from large ensemble is significant in two respects. First, the limited computing source of MCS demands that a small number of classifiers are to be combined. Second, in many cases, the combination of a subset of classifiers may give higher accuracy than combining all the classifiers at hand. In pattern recognition applications, we can easily generate a large number of classifiers by varying pre-processing, feature extraction, learning/classification algorithms, and *

training data. Running and combining all these classifiers is obviously not a good choice, and the selection of a subset should give better performance-to-cost ratio. Given a set of candidate classifiers, a validation dataset and an appropriate selection criterion, the task of CSS is reduced to searching the space of classifier subsets to find a subset that give optimal criterion on the validation dataset. The validation data, the selection criterion, and the search algorithm are all influential to the combination performance of MCS. For selection from large ensemble, efficient search algorithms are needed to overcome the combinatorial explosion of search space. Recently, a few works have contributed to MCS design using classifier selection. Giacinto and Roli clustered the candidate classifiers according to interdependency and selected one classifier from each cluster [2,3]. Roli, et al., also used heuristic search for classifier selection. Ruta and Gabrys [4], and Sirlantzis and Fairhurst [5], used evolutionary algorithms for classifier selection. In regard of the selection criterion, a number of classifier diversity measures have been studied [3,6,7,8]. Intuitively, the combination accuracy on validation data can be directly used as a criterion [4,5]. It is noteworthy that these previous works all utilized the abstract output (crisp class) of classifiers only. This study aims to compare the search efficiency of various search algorithms for CSS, including the genetic algorithm (GA) and a number of sequential search methods. These algorithms have not been systematically compared in MCS design, though they have been frequently used for feature selection (FS) in statistical pattern recognition [9,10]. In our experiments, all the candidate classifiers give measurement outputs, so we test abstract level combination as well as measurement level combination, unlike that the previous works considered abstract level combination (majority voting) only. The search algorithms used for CSS are those frequently used in feature selection: GA and sequential search methods. The sequential methods include

The work of Hongwei Hao was done when he was working in Hitachi Central Research Laboratory.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

sequential forward search (SFS), sequential backward search (SBS), plus l and take away r (PTA(l,r)), generalized PTA(l,r) (GPTA(l,r)), sequential floating forward search (SFFS) and sequential floating backward search (SBFS) [11], etc. The SFS has been tried by Roli, et al. [3], and the GA has been used in [4,5]. We experiment classifier selection and combination in handwritten digit recognition, wherein we aim to select an optimal or sub-optimal subset of classifiers from 32 candidates. The performance of the selected classifiers is measured by the combination accuracy using plurality of votes and confidence-based measurement level combination. The results show that classifier selection largely improves the combination performance compared to the combination of all classifiers. Most search methods run very fast and select good combination of classifiers but no one guarantees global optimality.

2. Classifier Selection Methods The goal of classifier selection is to select a subset of k classifiers from a given set of K (K>k) candidates, to achieve the best combination performance. Given a selection criterion (herein the classification accuracy of combination on validation data), classifier selection is reduced to a combinatorial search problem. The search algorithms are briefly described as follows.

2.1. Sequential search methods Sequential search methods start from an empty set or the set of all candidates as the initial selected subset. Classifiers are iteratively added to or deleted from the selected subset with aim to improve the criterion. At each step, only one or a small number (say, 2 or 3) of classifiers are added or deleted, so the complexity of search is not high. Even though the sequential methods do not guarantee the global optimality of selected subset, they are efficient even for large-scale problems. We test in our experiments five sequential search algorithms: SFS, SBS, PTA(l,r), GPTA(l,r), and SFFS. The methods are outlined in the following. As other forward search methods do, SFS starts with the empty set. Then classifiers are added to the selected subset one by one. At each step, the added classifier is selected from the remaining subset such that the already selected subset plus the added one gives the best combination performance in terms of the criterion. By this procedure, a series of subsets and criterion measures are given, and we can eventually choose the subset of highest criterion. SBS is in the reverse direction of SFS. It starts from the set of all classifiers and the constituent classifiers are iteratively deleted. At each step, a classifier is deleted

such that the remaining subset gives the best combination performance. PTA(l,r) combines SFS and SBS in such a way that at each step, classifiers are added to the selected subset l times and then deleted r times. When l>r, the selected subset is increasing (forward), otherwise is decreasing (backward). We test forward PTA(l,r) in our experiments. GPTA(l,r) is different from PTA(l,r) in that at each step, the l(r) classifiers are not added (deleted) sequentially, but are selected from all the combinations of l or r classifiers in the remaining subset or selected subset. This is helpful to improve the optimality of selected subset because the correlation between classifiers is better encountered. As the cost, the exhaustive comparison of size-l(r) subsets complicates the computation. SFFS and SBFS combine SFS and SBS more flexibly than PTA(l,r) does, in that the number of successive add or delete operations is not fixed, but flexible depending on the criterion of the selected subset. In SFFS, on adding a classifier to the selected subset by one SFS step, the classifiers are sequentially deleted from the selected subset provided that the deletion improves the criterion of combination.

2.2. Genetic algorithm In GA, a selected subset of classifiers is represented by a binary string called a chromosome, with a 1/0 at position i denoting the presence/absence of classifier i. A number of chromosomes, called a population, evolve from generation to generation by selection, crossover, and mutation, with hope that the criterion measures (called fitness functions) of the chromosomes improve. The selection of chromosomes to survive to the next generation is based on the fitness functions such that the chromosomes with higher fitness have more chance to survive. The crossover and mutation operations enlarge the variation of population so as to increase the chance of escaping from local optima. After a number of generations, the chromosome of highest fitness in the population gives the solution of classifier selection. An introduction to GA can be found in [12]. In our implementation of GA for CSS, we use a population of 30 strings of length 32 (corresponding to 32 candidate classifiers). The population evolves for at most 100 generations. On termination, the optimal string among all the generations is selected. For string selection, the fitness values are re-scaled such that the highest fitness is 1.6 times as much as the average value 1. First, offsprings are assigned based on the integer part of expected number (re-scaled fitness). Then the fractional parts are assigned in roulette wheel selection. The selected strings are randomly paired and two-point crossover is performed. The crossover rate and mutation rate are set to 0.9 and 0.01, respectively. These

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

implementation parameters were set based on trial and error.

3. Experimental Setup We experiment CSS in handwritten digit recognition using 32 candidate classifiers with various pre-processing procedures, feature representations, and classifier structures. Two datasets, one collected in Japan (Hitachi) and one in US (NIST SD-19) were used to train 24 and 8 classifiers, respectively. The pre-processing procedure varies in linear/nonlinear/moment normalization, deslant or not, and the aspect ratio mapping function. The feature types include chaincode feature, gradient feature, NCFE (normalization-cooperated feature extraction), etc. The classifier structures include nearest mean, single-layer perceptron (SLP), multiplayer perceptron (MLP), radial basis function (RBF) classifier, polynomial classifier (PC), learning vector quantization (LVQ) classifier, learning quadratic discriminant function (LQDF), etc. The preprocessing, feature extraction, and classification methods are reported in [13-15]. We use a dataset of 81,544 digit samples collected by Hitachi as the validation data for CSS. For testing the combination performance of selected classifiers, we use two test datasets. Test-1 contains 9,725 samples collected in environment similar to that of the validation dataset. Test-2 contains 36,473 samples that was rejected or misrecognized by an old recognizer of Hitachi. The samples of Test-2 are difficult due to excessive shape deformation or image degradation. Experiments are performed in four settings by using two combination rules and two validation data sizes. A subset of 10,000 samples as well as the whole validation dataset is used in CSS. The combination rules are the plurality of votes and the sum-rule on class confidences. The classification accuracy of combined classifiers is used as the selection criterion. For classifier combination at measurement level, transforming the classifier outputs to confidence measures is important, especially for combining different types of classifiers. The confidence measures are desired to represent the probability of correctness of each class. Based on the confidence measures, a simple un-weighted combination rule like the sum-rule is able to give high accuracy. For confidence evaluation, we adopt the approach of Shuermann [16]. Accordingly, assume that each output (the measurement of a class) undergoes two Gaussian distributions with equal variance, for the target class and the other classes, the posterior probability of the target class can be calculated in the sigmoid form:

f ( xi ) =

1 1 + exp{−α [ xi − ( β + γ / α )]}

where xi denotes the output of class i, and f ( xi ) is the

transformed confidence. The parameters {α , β , γ } are calculated from the means and variance of onedimensional Gaussians, which are in turn estimated on the validation dataset. For evaluating the performance of classifier selection, the highest accuracies of individual classifiers and the accuracies of combining all the 32 classifiers are listed in Table 1. We can see that both plurality and confidencebased combination (sum-rule) give higher accuracies than the best individual classifier, and the confidence-based combination is superior to plurality. Table 1. Accuracies (%) of individual classifiers and combing all 32 classifiers Classifier Valid Test-1 Test-2 Individual 99.40 99.77 89.93 Plurality 99.50 99.79 91.57 Sum-rule 99.50 99.83 92.61

4. Experimental Results The combination results of CSS with validation size 81,544 are shown in Table 2 and Table 3, for plurality and sum-rule combination, respectively. The results of CSS with validation size 10,00 are shown in Table 4 and Table 5. The second column of each table shows the CPU time (on Pentium-4-1.9GHz) of search, the third column shows the number of selected classifiers, and the right three columns give the classification accuracies. In each table, we give the results of CSS using 10 search strategies. For validation size 10,000, we also give the results of exhaustive search selecting 8 classifiers. Since the computation of exhaustive search is very heavy, we didn’t try to search over 8 classifiers exhaustively. As a baseline, the method “Accuracy” ordered the candidate classifiers in decreasing order of accuracy on validation data and simply selects the leading classifiers that give the highest combination accuracy on validation data. The PTA(l,r) and GPTA(l,r) methods have two variations each, (l=2,r=1) and (l=3,r=2). By the sequential search methods, multiple subsets with the same high criterion value may be selected. In this case, the subset with least number of classifiers is taken. The GA also has two variations, without or with penalty on the number of selected classifiers. In the selection criterion (fitness function), GA just uses the accuracy of combination on validation data, GA-P adds a penalty to the accuracy such that the criterion is the accuracy minus 0.0001 times the number of selected classifiers. GA-P is expected to select a small number of classifiers that give high combination performance.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

Table 2. Combination results (%) of plurality by selection with validation size 81,544 Method CPU(s) #sel Valid Test1 Test2 Accuracy 2.44 8 99.511 99.81 90.47 SFS 21.8 16 99.573 99.84 91.67 SBS 28.5 9 99.587 99.85 91.77 (2,1) 70.8 13 99.579 99.80 91.47 (3,2) 117.6 8 91.51 99.593 99.83 G(2,1) 238 7 99.587 99.83 91.30 G(3,2) 1,840 8 91.51 99.593 99.83 SFFS 51.9 13 99.583 99.83 91.25 GA 134 9 99.589 99.83 91.71 GA-P 109 8 99.584 99.83 91.61 Table 3. Combination results (%) of sum-rule by selection with validation size 81,544 Method CPU(s #sel Valid Test1 Test2 ) Accuracy 13.95 8 99.525 99.83 90.72 SFS 131.9 8 99.621 99.88 93.02 SBS 209.6 9 99.567 99.81 93.07 (2,1) 473.7 8 99.621 99.88 93.02 (3,2) 811 7 99.619 99.86 92.76 G(2,1) 1,241 8 99.621 99.88 93.02 G(3,2) 10,874 8 99.621 99.88 93.02 SFFS 331.7 8 99.621 99.88 93.02 GA 859 11 99.603 99.86 93.14 GA-P 594 7 99.615 99.86 92.64 Comparing the accuracies in Table 2 and Table 3, it is evident that the confidence-based measurement level combination (sum-rule) gives higher accuracy than the plurality of abstract votes. On validation size 10,000, the sum-rule also gives higher accuracy on the test datasets even though on the validation data, the accuracy of plurality is higher. In respect of search efficiency, we can see that except GPTA(l,r), all the sequential search methods and the GA finish in very short CPU time, say, from tens to hundreds of seconds on validation size 81,544. Including GPTA(l,r), all the heuristic search methods are much faster than the exhaustive search. On validation size 10,000, the exhaustive search of 8 classifiers from 32 candidates consumes as long as 61.65 hours. Increasing the number of candidate classifiers, the computation time will increase in polynomial and soon become infeasible on general computers. Now let us see the optimality on validation data and the generalization performance on test datasets. First, it is evident that the selection by ordering classifiers in accuracy does not yield good combination performance. This is because the correlation between classifiers is totally ignored. In Table 2, the highest accuracy on validation data is given by PTA(3,2) and GPTA(3,2),

though the validation accuracies of SBS, GPTA(2,1), SFFS and GA are also competitively high. Because of the multiple local optima in the combinatorial space, it is not easy to find the global optimum using a heuristic search algorithm. It is therefore reasonable that different search methods find different solutions of comparable criterion measures. Table 4. Combination results (%) of plurality by selection with validation size 10,000 Method CPU(s) #sel Valid Test1 Test2 Accuracy 0.31 14 98.98 99.81 90.26 SFS 2.56 12 99.23 99.80 92.02 SBS 3.23 12 99.21 99.83 92.00 (2,1) 8.27 6 99.23 99.81 91.38 (3,2) 13.7 8 99.22 91.78 99.86 G(2,1) 27.9 10 99.25 99.85 91.37 G(3,2) 218 10 99.23 99.85 91.32 SFFS 5.97 6 99.23 99.81 91.38 GA 16.9 15 99.22 99.83 91.79 GA-P 14.05 8 99.22 99.86 91.78 Exhaust 45,041 8 99.81 91.09 99.26 Table 5. Combination results (%) of sum-rule by selection with validation size 10,000 Method CPU(s) #sel Valid Test1 Test2 Accuracy 1.64 4 99.02 99.83 90.49 SFS 15.3 17 99.18 99.84 93.14 SBS 23.5 12 99.17 99.87 92.67 (2,1) 53.5 16 99.19 99.84 93.05 (3,2) 88.5 15 99.19 99.84 93.05 G(2,1) 161 9 99.19 99.89 93.26 G(3,2) 1,234 9 99.20 99.83 92.83 SFFS 39.9 15 99.19 99.84 93.11 GA 112 15 99.19 99.88 92.79 GA-P 80.5 9 99.20 99.83 92.83 Exhaust 61.65h 8 99.88 92.29 99.20 In Table 3 (sum-rule with validation size 81,544), the highest validation accuracy was given by five search strategies: SFS, PTA(2,1), GPTA(2,1), GPTA(3,2), and SFFS. By them, the selected classifiers also give the highest combination accuracy on Test-1 dataset and competitively high accuracy on Test-2 dataset. The GA with penalty (GA-P) does effect in selecting fewer classifiers with competitive performance compared to the GA without penalty, particularly in Tables 4 and 5. The results of CSS with validation size 10,000 shows similar tendency to that of validation size 81,544. For either plurality or sum-rule combination, the exhaustive search finds the solution of highest validation accuracy. In sum-rule combination, GPTA(3,2) and GA-P find different classifier subsets that give the same high validation accuracy. The classifier subset with high

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

validation accuracy, however, does not necessarily give high accuracies to test datasets, as shown in Table 4 (Exhaust) and Table 5 (GPTA(3,2) and GA-P). The generalization performance of CSS does not rely on the search algorithm, but mainly rely on a large, representative validation dataset. A better selection criterion is also important, just like that confidence–based combination outperforms plurality. The design of highly relevant diversity measure for MCS still remains open [7,8]. Nevertheless, the combination accuracy on validation data serves as a reasonably good selection criterion, as shown in our experimental results. Among all the results, the highest accuracies on test datasets are shown in Table 5, as given by 9 classifiers selected by GPTA(2,1). The accuracies, 99.89% and 93.26% are much higher than those of combining all the 32 classifiers (99.83% and 92.61%) and the best individual classifier (99.77% and 89.93%). Of course, the whole set of 32 classifiers include some weak ones which deteriorate the combination performance. However, the number of strong classifiers remains large, and combining a large number of strong classifiers does not necessarily outperform combining a subset because of the interdependency between them.

[4]

[5]

[6]

[7]

[8]

[9]

5. Conclusion This paper compares the performance of GA and sequential search methods for classifier subset selection. The experimental results show that most of them are very fast in finding a good selection. In respect of the optimality, some methods such as PTA(l,r), GPTA(l,r), SFFS, and GA generally perform well but no one consistently outperforms the others. The problem that selected classifiers with high accuracy on validation data do not necessarily give high accuracy to test data should be overcome by more relevant selection criterion and large representative validation data.

References [1] J. Kittler, M. Hatef, R. P. W. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Analysis and Machine Intelligence, 20(3): 226-239, 1998. [2] G. Giacinto, F. Roli, An approach to the automatic design of multiple classifier systems, Pattern Recognition Letters, 22(1): 25-33, 2001. [3] F. Roli, G. Giacinto, G. Vernazza, Methods for designing multiple classifiers systems, Multiple

[10] [11] [12] [13]

[14]

[15]

[16]

Classifier Systems, J. Kittler and F. Roli (eds.), LNCS 2096, Springer, 2001, pp.78-87. D. Ruta, B. Gabrys, Application of the evolutionary algorithms for classifiers selection in multiple classifier systems wit majority voting, Multiple Classifier Systems, J. Kittler and F. Roli (eds.), LNCS 2096, Springer, 2001, pp.399-408. K. Sirlantzis, M. Firhurst, Investigation of a novel self-configurable multiple classifier systems for character recognition, Proc. 6th ICDAR, Seattle, 2001, pp.1002-1006. H.J. Kang, S.-W. Lee, Evaluation on selection criteria of multiple numeral recognizers with the fixed number recognizers, Proc. 16th ICPR, Quebec, Canada, 2002, Vol.3, pp.403-406. C.A. Shipp, L.I. Kuncheva, Relationships between combination methods and measures of diversity in combining classifiers, Information Fusion, 3(2): 135-148, 2002. L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensemble and their relationship with the ensemble accuracy, Machine Learning, 51(2): 181207, 2003. A. Jain, D. Zongker, Feature selection: evaluation, application and small sample performance, IEEE Trans. Patten Analysis and Machine Intelligence, 29(2): 153-158, 1997. M. Kudo, J. Sklansky, Comparison of algorithms that select features for pattern classifiers, Pattern Recognition, 33(1): 25-41, 2000. P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature selection, Pattern Recognition Letters, 15: 1119-1125, 1994. M. Srinvas, L.M. Patnaik, Genetic algorithms: a survey, Computer, June 1994, pp.17-26. C.-L. Liu, H. Sako, H. Fujisawa, Performance evaluation of pattern classifiers for handwritten character recognition, Int. J. Document Analysis and Recognition, 4(3): 191-204, 2002. C.-L. Liu, H. Sako, H. Fujisawa, Learning quadratic discriminant function for handwritten character recognition, Proc. 16th ICPR, Quebec, Canada, 2002, Vol.4, pp.44-47. C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition: investigation of normalization and feature extraction techniques, submitted to Pattern Recognition, 2003. J. Schuermann, Pattern Classification--A United View of Statistical and Neural Approaches, WileyInterscience, 1996.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE

Suggest Documents