From static to dynamic ensemble of classifiers selection: Application to

0 downloads 0 Views 928KB Size Report
of dynamic ensemble selection versus static classifier selection is that used classifier set depends critically on the test pattern ... represent input samples and the use of a powerful sin- ..... the correlation between the classifiers errors is a natural.
International Journal of Knowledge-based and Intelligent Engineering Systems 16 (2012) 279–288 DOI 10.3233/KES-2012-00249 IOS Press

279

From static to dynamic ensemble of classifiers selection: Application to Arabic handwritten recognition

Y

Nabiha Azizi∗ and Nadir Farah

OP

Labged Laboratory: Laboratoire de Gestion Electronique de Documents, Computer Science department, Badji Mokhtar University Annaba, Annaba, Algeria

TH

OR C

Abstract. Arabic handwriting word recognition is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. One way to improve the recognition rates classification task is to improve the accuracy of individual classifiers; another, is to apply ensemble of classifiers methods. To select the best classifier set from a pool of classifiers, the classifier diversity is considered one of the most important properties in static classifier selection. However, the advantage of dynamic ensemble selection versus static classifier selection is that used classifier set depends critically on the test pattern. In this paper, we propose two approaches for Arabic handwriting recognition (AHR) based on static and dynamic ensembles of classifiers selection. The first one selects statically the best set of classifiers from a pool of classifier already designed based on diversity measures. The second one represents a new algorithm based on Dynamic Ensemble of Classifiers Selection using Local Reliability measure (DECS-LR). It chooses the most confident ensemble of classifiers to label each test sample dynamically. Such a level of confidence is measured by calculating the proposed local reliability measure using confusion matrixes constructed during training level. We show experimentally that both approaches provide encouraging results with the second one leading to a better recognition rate for (AHR) system using IFN ENIT database.

1. Introduction

AU

Keywords: Arabic handwritten recognition, static classifier selection, dynamic classifier selection, local accuracy estimation, fusion methods

For almost any real life pattern recognition application, a number of approaches and procedures may be used as a solution. After more than 20 years of continuous and intensive effort devoted to solve the challenges of handwriting recognition, progress in recent years has been very promising [11]. Many applications require off-line HWR capabilities such as bank processing, mail sorting, document archiving, commercial form-reading, office automation, etc. So far, off-line HWR remains an open prob∗ Corresponding

author: Nabiha Azizi, Labged Laboratory: Laboratoire de Gestion Electronique de Documents, Computer Science Department, Badji Mokhtar University Annaba, BP n ◦ 12, Annaba, 23000, Algeria. E-mail: [email protected].

lem, in spite of a dramatic boost of research [8,9,29, 30]. Studies in Arabic handwriting recognition, although not as advanced as those devoted to other scripts (e.g. Latin), have recently shown renewed interest [18,35]. We point out that the techniques developed for Latin HWR are not all appropriate for Arabic handwriting because, Arabic script is based on an alphabet and rules distinct from those of Latin (cf. Section 2). Classical approaches of pattern recognition require both the selection of an appropriate set of features to represent input samples and the use of a powerful single classifier. In recent years, however, in order to improve the recognition accuracy in complex application domains, there has been a growing research activity in the study of efficient methods for combining the results of many different classifiers [21,28,31].

ISSN 1327-2314/12/$27.50  2012 – IOS Press and the authors. All rights reserved

N. Azizi and N. Farah / From static to dynamic ensemble of classifiers selection

Y

and Select strategy to investigate the effect of several diversity measures for Arabic handwritten recognition system. The obtained results are analysed and compared d with those of the second approach based on DECS. This static classifier selection strategy called either the overproduce and choose strategy. The overproduction phase involves generation of an initial large candidate classifiers pool, while the selection phase is intended to select the best performing classifiers subset, which is then used to classify the whole test set. This strategy is subject the main problem. A fixed subset of classifiers defined using a training/optimization data set may not be well adapted for the classification of the whole test set [36]. This problem is similar to searching for a universal best individual classifier, i.e., due to differences among samples, there is not a unique classifier set perfectly adapted for every test set. On the other hand, in dynamic classifier selection, each classifier competence in the ensemble is calculated during the classification phase and then the most competent classifier is selected [19,33,39]. Recently, dynamic ensemble of classifiers selection (DES) methods have been developed to overcome the latter problem [7,13]. In these methods, first a subset of classifiers is dynamically selected from the ensemble and then the selected classifiers are combined by majority voting method. To test the effectiveness of DES, we propose in this paper a new dynamic Ensemble of classifiers selection approach based on local accuracy estimation. The proposed algorithm extracts the best EoC for each test sample using new measure about each class for all classifiers. That measure named Local-Reliability measure is calculated by some information extracted from confusion matrixes. When an ensemble of classifiers (EoC) is selected based on our Algorithm and the L-Reliability measure, two fusion methods which are voting and weighted voting methods are applied to generate the final label class with the appropriate confidence. The remainder of this paper is organized as follows: The next section describes the main properties of Arabic handwriting recognition. In section 3, we illustrated SCS paradigm to present our proposed approach based on over produce and select strategy. The next section detailed DCS paradigm and the main idea of our proposed Dynamic Ensemble Classifier Selection methodology Based on Local Reliability measure (DECS-LR) with the proposed algorithm. The main results are illustrated in Section 5.

AU

TH

OR C

Ensemble of classifiers (EoCs) exploits the idea that different classifiers can offer complementary information about patterns to be classified. It is desirable to take advantage of individual classifiers strengths and to avoid their weaknesses, resulting in classification accuracy improvement. Both theoretical and empirical researches have demonstrated that a good ensemble can not only improve generalization ability significantly, but also strengthen the classification system robustness [23,28,38]. The EoCs has become a hotspot in machine learning and pattern recognition and been successfully applied in various application fields, including handwriting recognition, speaker identification and face recognition [7,12,14,26,32]. Diversity should be regarded in a more general context as a way to found the best classifier combination for a specific combination method. Optimally it should produce member classifier sets that are different from each other in a way that it takes benefits from classifier combination, regardless of the used combination method [23,34]. Classifier selection techniques may be divided into two categories: static and dynamic. In the Static Classifier Selection (SCS), the selection of the best classifier is specified during a training phase, before classifying test instance. For the Dynamic Classifier Selection (DCS), the choice of a classifier is made during the classification or test phase. We call it “dynamic” because the used classifier depends critically on the test instance itself [10,13]. In our previous works, we dealt with the recognition of Arabic handwritten words using Algerian and Tunisian town names databases. Our study uses single classifiers having as an input several kinds of features [1]. We focused later on multiple classifiers approaches (at classification and fusion levels). We tried several combination schemes for the same application [2,4]; and, while studying diversity role in improving multiple classifiers system (MCS) and in spite of the weak correlation between diversity and performance, we argue that diversity might be useful to build ensembles of classifiers if it will be used as unique criterion to select classifiers. We demonstrated through experimentation that using diversity jointly with performance to guide selection can avoid over fitting during the search. We illustrate in this paper an over view of our approach based on Static classifiers selection using diversity measures and individual classifiers accuracies to choose best set of classifiers [3,4,6]. So, we present firstly a brief explanation of static ensemble of classifier selection based on Overproduce

OP

280

281

Y

N. Azizi and N. Farah / From static to dynamic ensemble of classifiers selection

OP

Fig. 1. Main properties of Arabic handwritten. (1) Written from right to left. (2) One Arabic word includes three cursive sub-words. (3) A word consisting of six characters. (4) Some characters are not connectable from the left side with the succeeding character. (5) The same character with different shapes depends on its position in the word. (6) Different characters with different sizes. (7) Different characters with different number of dots. (8) Different characters have the same number of dots but different position of dots.

have a completely different appearance. Along with the dots and other marks representing vowels, this makes the effective size of the alphabet about 160 characters. 5. Arabic writers often make use of elongation (to beautify the script).

OR C

2. Arabic handwritten properties

TH

Arabic is written by more than 100 million people, in over 20 different countries. The Arabic script evolved from a type of Aramaic, with the earliest known document dating from 512 AD. The Aramaic language has fewer consonants than Arabic [14,24]. Arabic characters are used in the writing of several languages such as Arabic, Persian and Urdu. A summary of the features of Arabic writing appears below [8,25,32].

AU

1. Arabic text, both handwritten and printed, is cursive. The letters are joined together along a writing line (Fig. 1). This is similar to Latin ‘joined up’ handwriting, which is also cursive, but in which the characters are easier to separate. 2. In contrast to Latin text, Arabic is written right to left, rather than left to right. This is perhaps more significant for a human reader rather than a computer, since the computer can simply flip the images. 3. More importantly from the point of view of automated recognition, Arabic contains dots and other small marks that can change the meaning of a word, and need to be taken into account by any computerised recognition system. Often the diacritic marks representing vowels are left out, and the word must be identified from its context. 4. The shapes of the letters differ depending on whereabouts in the word they are found. The same letter at the beginning and end of a word can

3. Static ensemble selection versus dynamic classifier selection

Classifier selection techniques fall into two general methodologies. According to the first type called static classifier selection (SCS), the optimal selection solution found for the validation set is fixed and used for the classification of unseen patterns The whole analytical effort is thus focused on the extraction of the best combination for the labeled validation set, which can also be used for evaluation. The selection of classifiers in SCS is fully based on the average performances obtained for the labeled validation set, and thus complies with the redundant ensembles type of combination. 3.1. Overproduce and select strategy in SCS Figure 2 shows the basic idea to produce an initial large set of “candidate” classifier ensembles, and then to select the sub-ensemble of classifiers that can be combined to achieve optimal accuracy. Constraints and heuristic criteria are used in order to limit the computational complexity of the “choice phase” (e.g., the performances of a limited number of

282

N. Azizi and N. Farah / From static to dynamic ensemble of classifiers selection

Ensemble Overproduction Training Set

Ensemble select

Or

Classifier Generation

Competence generation

Combiner design Evaluation Set

Performance evaluation

Dynamic Classifier Selection

Pool of classifiers - Measures - Clustering -Various Training data sets,

Fig. 2. Overproduce and select strategy steps.

Fig. 3. Dynamic classifier selection components.

OP

assign the label M to the sample Ifrom the test data set. The selection is called dynamic because levels 2 and 3 are performed during the test phase. However, several methods reported in the literature as DCS methods pre estimate regions of competence during the training phase [13] and perform only the third level during the test phase. The main difference between the various DCS methods is the strategy employed to generate regions of competence. Among different DCS schemes, the most representative one is Dynamic Classifier Selection by Local Accuracy (DCS-LA) [37].

OR C

– The ensemble overproduction phase uses techniques like Bagging and Boosting [15,20]. Different classifiers can also be designed by using different initializations of the respective learning parameters, using different classifier types and different classifier architectures. – The Ensemble choice phase is aimed to select the subset of classifiers that can be combined to achieve optimal accuracy.

Y

Test set

candidate ensembles are evaluated by a simple combination function such as the majority voting rule [6]). We can distinguish two phases:

AU

TH

In a more aggressive approach called dynamic classifier selection (DCS), the selection is done online, during classification, based on training performances and also using various parameters of the actual unlabelled pattern to be classified. In [39], Woods et al. proposed to select the single classifier that shows the best performance in the closest neighborhood defined by an arbitrarily set. In comparison with SCS where the “best” classifier is selected before the test phase, DCS dynamically selects the “best” classifier for each test instance. Among different DCS schemes, the most representative one is Dynamic Classifier Selection by Local Accuracy (DCS-LA) [37] which explores a local community for each test instance to evaluate the base classifiers, where the local community is characterized as the “k” Nearest Neighbors of the evaluation set EV test instance. 3.2. Dynamic classifier selection strategy Dynamic Classifier Selection methods are divided into three levels, as illustrated in Fig. 3. First, the classifier generation level uses a training data set to obtain pool of classifiers; secondly, region of competence generation uses an independent evaluation data set (EV) to produce competence regions (Rj); and dynamic selection chooses a winning partition or the winning classifier (Ci∗ ), over samples contained in Rj, to

3.2.1. Local class accuracy (LCA) The local accuracy is estimated in respect of output classes [16]. In other words, we consider the percentage of the local training samples assigned to a class CL i by this classifier that have been correctly labeled.

3.2.2. A priori selection method (A priori) The classifier accuracy can be weighted by the distances between the training samples in the local region and the test sample. Consider the sample xj ∈ ωk as one of the K-nearest neighbors of the test pattern X. The p (ωk |xj, Ci) provided by classifier ci can be regarded as a measure of the classifier accuracy for the test pattern X based on its neighbor xj. If we suppose that we have N training samples in the neighborhood, then the best classifier C ∗ for classifying the sample X can be selected by [17]: x N  P (ωk | ∈ ωk , ci )Wj C∗ = argi max

j=1

j N 

(1) WJ

j=1

Where W j = 1/dj is the weight, and dj is the Euclidean distance between the test pattern X and the its neighbor sample xj .

N. Azizi and N. Farah / From static to dynamic ensemble of classifiers selection

4. Proposed approach based on static selection by overproduce and select strategy

TH

AU

4.1. Measure of diversity between classifiers

The diversity measure is calculated in term of the output value through all classifiers. In this work, we used six well known diversity measures to construct the best classifier sub set. 4.1.1. Correlation between the errors It is interesting to examine that the independence between the committed errors is beneficial for the MCS; the correlation between the classifiers errors is a natural choice to compare the classifiers subsets [23]: Cov[vea , veb ] ρa,b =  V ar[vea ]V ar[veb ]

(3)

vea and veb are binary vectors of the a and b classifiers errors. The best set is selected while choosing the one having the minimum average of these pairs of measures.

N 11 N 00 − N 01 N 10 N 11 N 00 + N 01 N 10

(4)

Where N 11 is the number of time where the two classifiers are correct, N 00 the number of time where the two classifiers are incorrect, N 01 and N 10 represent the number of time where just the first or the second are either correct. 4.1.3. Disagreement measure This measure represents the ratio between the numbers of observations where one classifier is correct and the other is incorrect with respect to the total number of observations [23]:

OR C

Different works have been done in the field of AOCR (Arabic Optical Character Recognition) [1–6,8,28]. It is not an easy task to obtain a robust MCS, unifying the set of classifiers already achieved and tested. Indeed, the major goal of the combination is to try to maximise the benefits of the complementarily of different models and to compensate the weaknesses of each classifier. To select the set of classifiers having the best individual performances doesn’t imply a better rate of recognition in any case in the global system. It is justified by the nature of the classifiers. Diversity has been quantified in several ways for classification fusion. As a result, different measures have been proposed in the literature. In this section, we present the adopted diversity measures followed by the proposed approach steps.

Qa,b =

Y

j=1

4.1.2. Q average The Q average Or Q statistic aims to calculate the similarity between two classifiers [3]. It is defined for two classifiers a, b as:

OP

3.2.3. A posteriori selection method (A posteriori) If the class assigned by the classifier Ci is known, then we can use the classifier accuracy in the aspect of the known class. Suppose that we have N training samples in the neighborhood and let us consider the sample xj ∈ ωk as one of the K-nearest neighbors of the test pattern X. Then, the best classifier C ∗ (ωk ) with the output class ωk for classifying the sample X can be selected by [37]:  P (ωk |xj , ci )Wj C∗ (ωK ) = argi max XNJ ∈ωK (2)  P (ωk |xj , ci )WJ

283

Da,b =

N 10 + N 01 . N

(5)

4.2. Steps of SCS approach Based on our previous approach results published in [9] and which it focused on the use of the unique criterion (diversity measures), we have noticed some limits which we can be summarized in the following points: – High cost in time and memory capacity during the diversity measures calculation for all the possible combinations of the m classifiers. – Such as the majority of the subset selection approaches of classifiers based on diversity, the approach neglects individual classifier accuracy as additional criterion. This last is a very important factor to improve multi classifiers systems performance. – A selected sub set may be don’t the most powerful classifier (bases on individual accuracy during test phase) or even more serious than that, sub set with only the “M” weak classifiers which represent the most diversified ones which inevitably degrades the recognition rate of the total system. – Our idea is to combine the two criteria: Classifier Individual ACCURACY and DIVERSITY for a subset classifier selection. Our proposed approach chooses a fixed “m” out of all “L” base classifiers.

N. Azizi and N. Farah / From static to dynamic ensemble of classifiers selection

OR C

5. Local reliability measure for classifier output classes in DCS

however, propose another approach: Instead of finding the most suitable classifier, we select the most suitable ensemble for each test sample. The proposed algorithm is based on the definition of a DCS-LA method which makes it possible to select, for each test pattern, the best ensemble of classifiers that has more probability to make a correct classification on that pattern. However, the quality of the combined system based on the selected classifiers relies mostly on the goodness of the selection criterion. It is used for evaluation of various joint properties of classifiers, in particular those relating to or deciding directly about the combined performance of the selected set of classifiers. This selection condition is based on the classifier local accuracy estimation; to attempt this objective, we propose a new measure named Local-reliability or Lreliability measure in a k nearest neighborhood of the input pattern I (neighborhood (I)), defined with respect to the evaluation set. To calculate L-reliability measure, we have constructed “L” confusion matrixes for each classifier during training level. Used confusion matrix can be define as square matrix that contains N rows (the calculated Label Class) ands N columns (the predicted label class). Each cell (d, f ) represents the training samples number classified in label class “d” knowing that the predicted label class is “f ”. With the above definition, we observe that:

Y

1. it starts with a set containing one classifier which is the best classifier (based on accuracy) during the test phase; 2. At each iteration, it chooses among all possible classifiers the one that best improves the global system performance when added to the current ensemble. The performance is calculated using evaluation criterion (the three diversity measures defined in previous paragraph). Once the set of classifier is selected, it is impossible to use a combination methods as the weighted average, or the sum of the results because outputs classifiers are heterogeneous. Methods based on output labels classes like voting and weight voting, methods will be used in our study.

OP

284

AU

TH

The main difference between the various DCS methods is the strategy employed to generate regions of competence. Among different DCS schemes, the most representative one is Dynamic Classifier Selection by Local Accuracy (DCS-LA) [37]. Dynamic Classifier Selection by Local Accuracy explores a local community for each test instance to evaluate the base classifiers, where the local community is characterized as the k Nearest Neighbors (kNN) of the test instance in the evaluation set EV. The intuitive assumption behind DCS-LA is quite straightforward: Given a test instance I, we find its neighborhood δI in EV (using the Euclidean distance), and the base classifier that has the highest accuracy in classifying the instances δI should also have the highest confidence in classifying δ. Let Cj (j = 1,. . . , L) be a classifier, and an unknown test instance I. We first label with all individual classifiers (Cj ; j = 1, . . ., L) and acquire L class labels C1 (I); . . .; CL (I). If individual classifiers disagree, the local accuracy is estimated for each classifier. Given EV, the local accuracy of classifier Cj with instance δI, LocCj (δI), is determined by the number of local evaluation instances for classifier Cj that have the same class label as the classifier’s classification, over the total number of instances considered. The final decision for δI is the base classifier which provide the max of local accuracy [13]. All dynamic selection methods are designed to find the classifier with the greatest possibility of being correct for a sample in a pre-defined neighborhood. We,

– The cell (j, j) or “ajj ” represent s the number of correct predictions for each class j (j = 1,. . . , N). In fact, we can calculate classifier accuracy (AC(ci )) by Eq. (6): N 

Ac(Ci ) =

ajj

j=1

N

(i = 1; . . . ; N )

(6)

After training set execution, Local reliability measure which represents each label class J confidence for classifier Ci can be estimated by the proposed equation as follow Eq. (7): L − reliability(Ci,j ) =

ajj ∗ Ac(Ci ) d=N  ad,j

(7)

d=1etd=j

The different steps to design proposed DEC approach can be summarized by the following algorithm:

N. Azizi and N. Farah / From static to dynamic ensemble of classifiers selection

Y

Fig. 4. Example of a dataset entry.

6.2. Used classifiers

OP

For the pool of classifiers used for the two approaches validation, we have used the same ensemble of classifiers published in our previous work based on static classifier selection to compare both of results. We have used different classification algorithms which are: – Two (02) SVM (Support Vector Machine) classifiers. “One against all” strategy was adopted. These classifiers are elaborated under the “LIBSVM” library, version 2.7. The inputs on these SVM systems are the structural features. We have used polynomial and Gaussian kernel functions. – Three (03) KNNs (“k” Nearest Neighbourhood) classifiers with “k” equal respectively to 2, 3 and 5. The training examples are vectors with structral and statistical features; each one with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. – Three (03) ANNs (Artificial Neuronal Network) classifiers with different number of the hidden layer neurons. In order to train a neural network to perform AHR, we must adjust the weights of each unit in such a way that the error between the desired output and the actual output is reduced. This process requires that the neural network compute the error derivative of the weights (EW). In other words, it must calculate how the error changes as each weight is increased or decreased slightly. The back propagation algorithm is the used method for determining the EW in our study. – Two HMMs (Hidden Marcov Models) classifers: The Hidden Markov Model(HMM) is a powerful statistical tool for modeling generative sequences

6.1. Used data base

AU

6. Experimental results

TH

OR C

Algorithm 1: DECS using L-reliability 1: Design a pool of classifiers C. 2: Perform the competence level using confusion matrix 3: For each test pattern I: Do Begin 5: If all classifier Ci (i = 1; . . . ;L) are agree for the same Class j for the pattern I Then assign the class j to the I Else Begin 6: Find the k nearest neighborhood using Euclidian Distance from the evaluation set. 7: Calculate the accuracy of each neighborhood m (m = 1; . . . ; k) with all base classifiers Ci: AC(Ci,)(m). 8: Combine ACTUAL accuracy results AC (Ci)(m) with the local reliability associated to the actual label using Eq. (7) for each neighborhood m to obtain the combined accuracy. 9: Delete all pairs (neighborhood, classifier Ci) which the global accuracy value

Suggest Documents