Designing an optimal Classifier Ensemble for

0 downloads 0 Views 288KB Size Report
ensembles for handwriting character recognition [5]. However, a large number .... Devanagari data comprises of 119 stroke classes, 17263 training samples and ...
Designing an optimal Classifier Ensemble for online character recognition using Genetic Algorithms Jitendra Kumar Department of Electrical Engineering, IIT, Chennai, India. [email protected]

V.S. Chakravarthy* Department of Biotechnology, IIT, Chennai, India. [email protected]

Abstract

approaches study design of individual classifiers also as a part of ensemble design. An interesting practical situation in ensemble design arises when a large number of individual classifiers for a given task exist and there is a need to make selection from them to construct an optimal classifier ensemble. A few studies have considered the problem of selecting individual classifiers from a pool of classifiers [7]. In [8] Genetic Algorithms (GA) were used for classifier selection problem and the resulting ensemble was applied to offline handwritten digit recognition. In [13, 14] GA is applied for feature subset selection on pattern recognition problem. It is found that GA results in providing reasonable compromise between search complexity and the quality of found solution.

We formulate the problem of creating an optimal classifier ensemble as an optimization problem and apply genetic algorithms to the problem. A pool of 25 individual classifiers is created by training SVM-based classifiers on various features and by varying SVM kernel parameters. A subset of the classifiers selected from the above classifier pool, generated using the proposed optimization technique, constitute the final optimized classifier ensemble. The ensembles designed by the proposed method are applied to the problem of stroke recognition for two Indic scripts: Devnagari and Tamil. Ensemble performance always exceeded the performance of best individual classifier and is comparable to some of the best reported online character recognition results for the above scripts. Keywords: Classifier selection, classifier ensemble, genetic algorithms, Indic scripts.

1. Introduction It is well-appreciated in pattern recognition literature that combining multiple classifiers improves generalization performance of the ensemble relative to the performance of individual classifiers [1]. In an ideal classifier ensemble, individual classifiers should have high performance, and error patterns among the individual classifiers should be as different as possible (the so-called “decorrelation” condition) [3]. Choice of classifier combination technique also determines ensemble performance [4]. A good number of classifier combination techniques have been proposed with sound theoretical justification. Some of the most popular classifier combination techniques include: majority voting [5], weighted averaging [4], the Bayes approach [5], the Dempster-Shafer theory [2], and combining neural networks [6]. There were many cases of application of classifier ensembles for handwriting character recognition [5]. However, a large number of approaches to ensemble design focus on the method of combination, though some

In this paper, we present a technique based on Genetic Algorithms for the problem of classifier selection for ensemble design. The classifier ensemble is applied to online handwritten character recognition for two Indic scripts: Devnagari and Tamil. Our formulation of classifier selection as a constrained optimization problem is described in Section 2. The methodology of constructing a large pool of classifiers to choose from is presented in Section 3. The results obtained are described in Section 4, which is followed by a discussion of the work.

2. Ensemble design as an optimization problem Our present aim is to select a set of K classifiers out of N classifiers such that a classifier ensemble built out of the K classifiers yields highest performance. A well-known guiding principle for classifier ensemble design is to ensure that individual classifiers have high performance, while the error distance of pairs of classifiers remains high. These two requirements are mutually conflicting, because in the extreme case, if two classifiers have 100% accuracy, the error distance between them is 0. Error vector for a classifier is an M-dimensional vector (where M is the number of classes) such that the i’th component of the vector denotes percentage accuracy of the classifier on that class. Error distance between a pair of classifiers is the

Euclidean distance between error vectors of the two classifiers. In [17] various measures of diversity like correlation measure, disagreement measure, double-fault measure etc., in classifier ensembles are proposed. The measure of diversity used in this paper is Euclidean distance. To build optimal ensemble we need to select best K classifiers so that recognition accuracy of ensemble will be better than any other ensemble evaluated with the same combination technique. Therefore, design of an optimal classifier can be formulated as the following optimization problem. Let V = {v1, v2, .….., vN } is a binary (0/1) vector where each element denotes the presence or absence of a classifier in the ensemble; let P = {p1, p2, .….., pN } (0