A Scalable Method for Improving the Performance of

0 downloads 0 Views 306KB Size Report
Multiclass Applications by Pairwise Classifiers and GA. Hamid Parvin, Hosein ... which send data out of the neural network, and hidden units (indicated by an ...
Fourth International Conference on Networked Computing and Advanced Information Management

A Scalable Method for Improving the Performance of Classifiers in Multiclass Applications by Pairwise Classifiers and GA Hamid Parvin, Hosein Alizadeh, Behrouz Minaei-Bidgoli and Morteza Analoui Department of Computer Engineering Iran University of Science and Technology, Tehran, Iran. {h_parvin, ho_alizadeh}@comp.iust.ac.ir, {b_minaei, analoui}@iust.ac.ir framework for development of combinational classifiers has been proposed. In this structure a multiclass classifier in addition to a few pairwise classifiers create a classifier ensemble. For combining the result, we employ weighted majority vote rule. The weights are determined by genetic algorithms.

Abstract In this paper, a new combinational method for improving the recognition rate of multiclass classifiers is proposed. The main idea behind this method is using pairwise classifiers to enhance the ensemble. Because of more accuracy of them, they can decrease the error rate in error-prone feature space. Firstly, a multiclass classifier has been trained. Then, regarding to confusion matrix and evaluation data, the pair-classes that have the most error have been derived. After that, pairwise classifiers have been trained and added to ensemble of classifiers. Finally, weighted majority vote for combining the primary results is applied. In this paper, Multi Layer Perceptron is used as base classifier. Also, GA determines the optimized weights in final classifier. This method is evaluated on a Farsi digit handwritten dataset. Using proposed method, the recognition rate of simple multiclass classifier has been improved from 97.83 to 98.89 which shows an adequate improvement.

1.1. Neural Network A first wave of interest in neural networks (also known as 'connectionist models' or 'parallel distributed processing') emerged after the introduction of simplified neurons by McCulloch and Pitts in 1943. These neurons were presented as models of biological neurons and as conceptual components for circuits that could perform computational tasks. The elements of the Artificial Neural Networks are input vectors, output vectors, target vectors, weight, transfer function and bias. There are different forms to connect the neurons, and then the result is different network topologies [4]. In this paper the Multi Layer Perceptron is used. Each unit of neural network performs a relatively simple job: receive input from neighbors or external sources and use this to compute an output signal which is propagated to other units. Apart from this processing, a second task is the adjustment of the weights. The system is inherently parallel in the sense that many units can carry out their computations at the same time. Within neural systems it is useful to distinguish three types of units: input units (indicated by an index i) which receive data from outside the neural network, output units (indicated by an index o) which send data out of the neural network, and hidden units (indicated by an index h) whose input and output signals remain within the neural network. During operation, units can be updated either synchronously or asynchronously. With synchronous updating, all units update their activation simultaneously; with

1. Introduction Nowadays, recognition system is used in many applications. These applications are related to different fields that have different nature. So, it can not be expected to reach expected results in all these applications with only one single classifier. The optimal classifier in each case is extremely related to the nature of the problem. In practice, there may be applications that any single classifier can't solve them in an acceptable accurate [21]. In such situation, it is better for us to use a classifier ensemble than a single classifier, to reach an optimal accuracy [1]. Combining diverse classifier usually result in more accurate classification systems [1,2,23]. It has been proved that performance of combinational classifier is usually higher than single classifier [3]. In this paper, a

978-0-7695-3322-3/08 $25.00 © 2008 IEEE DOI 10.1109/NCM.2008.226

137

Authorized licensed use limited to: Iran Univ of Science and Tech Trial User. Downloaded on December 14, 2008 at 06:47 from IEEE Xplore. Restrictions apply.

asynchronous updating, each unit has a (usually fixed) probability of updating its activation at a time t, and usually only one unit will be able to do this at a time. In some cases the latter model has some advantages. In most cases we assume that each unit provides an additive contribution to the input of the unit with which it is connected. The total input to unit k is simply the weighted sum of the separate outputs from each of the connected units plus a bias or offset term θ k: s k (t ) =

∑w

jk (t ) y j (t ) + θ k (t )

the population. And third, an evolution procedure that is based on some "genetic" operators such as selection, crossover and mutation. The crossover takes two individuals to produce two new individuals. The mutation consists in modifying randomly a gene of an individual. The quality of the individuals is assessed through a fitness function. The result is a real value for each individual. The best individuals will survive and are allowed to produce new individuals. The stop condition is used to determine the end of the algorithm. Three well-known stop conditions are: first, a predefined number of generations or evaluations and second, a pre-defined value to reach for the fitness function; And third, a number of generation without improvement. For further source material on genetic algorithms you can refer to [6-8]. An important aspect of GAs in learning is their usage as optimization tool in pattern recognition [1, 22, 13-17]. GAs are applied in SPR [1] in two ways: their direct usage as classifier, and their usage as optimizing tool for determining the parameters of classifiers. In [14], the GA is used to find decision boundaries in Nfeature space. More applications of the GA are optimization of parameters in classification process. Many researchers also use GA in feature subset selection [13, 16-19]. Combination of classifiers is another field for usage of GA as an optimization tool. GA has been used for feature selection in classifier ensemble in [10, 24]. The rest of this paper is organized as follows. We explain the combination of classifiers, in the section 2. Section 3 describes the Pairwise classification. In section 4, we explain the proposed method. Section 5 contains the results of our proposed method and traditional methods on Farsi handwritten digits dataset, comparatively. Finally, we conclude in section 6.

(1)

j

The contribution for positive wjk is considered as an excitation and for negative wjk as inhibition. In some cases more complex rules for combining inputs are used, in which a distinction is made between excitatory and inhibitory inputs. We call units with a propagation rule (above formula) sigma units. Generally, some sort of threshold function are used: a hard limiting threshold function (a sign function), or a linear or semi-linear function, or a smoothly limiting threshold. A neural network has to be configured such that the application of a set of inputs produces the desired set of outputs. Various methods to set the strengths of the connections exist. One way is to set the weights explicitly, using a priori knowledge. Another way is to 'train' the neural network by feeding it teaching patterns and letting it change its weights according to some learning rule. For example, the weights are updated according to the gradient of the error function. For further study you can refer to neural network books, like [5].

1.2. Genetic Algorithm

2. Combining classifiers

Genetic algorithms are optimization and machine learning algorithms based loosely on processes of biological evolution. John Holland created the genetic algorithm method which is introduced in [6]. Interest in genetic algorithms has increased recently in conjunction with an increase in interest in other algorithms based on natural processes, including simulated annealing and neural networks. A GA can be considered as a composition of three essential elements: first, a set of potential solutions called individuals or chromosomes that will evolve during a number of iterations (generations). This set of solutions is also called population. Second, an evaluation mechanism (fitness function) that allows assessing the quality or fitness of each individual of

In general, creation of combinational classifiers may be in four steps. It means combining of classifiers may happen in four levels. Figure 1 depicts these four steps. In step four, we try to create different subset of data in order to make independent classifiers. Bagging and boosting are examples of this method [3 ,9]. In these examples, we use different subset of data instead of all data for training. In step three, we use subset of features for obtaining diversity in ensemble. In this method, each classifier is trained on different subset of features [10-12]. In step two, we can use different kind of classifiers for creating the ensemble [12]. Finally, in

138

Authorized licensed use limited to: Iran Univ of Science and Tech Trial User. Downloaded on December 14, 2008 at 06:47 from IEEE Xplore. Restrictions apply.

the step one, method of combining (fusion) is considered. Level 1

Level 2

because it is from order 2. From here, this method has big computational cost and is not good in the problems that have many classes.

Combining classifiers

...

Classifier 1

Level 3

Features

Level 4

Data sets

Classifier L

Figure 2: Simplicity of boundaries in pairwise classification vs. multiclass classification.

Figure 1: Different levels of creation of classifier ensemble

4. Proposed method

In the combining of classifiers, we intend to increase the performance of a classifier. There are several ways for combining classifiers. The simplest way is to find best classifier. Then we use it as main classifier. This method is offline CMC. Another method that is named online CMC uses all classifier in ensemble. For example, this work is done using voting. We also use from voting method in this paper. We also use weighted voting. Using this method, we show that combining method can improve the result of classification.

The main idea behind the proposed method is to use a number of pairwise classifiers to decrease error in error prone regions of feature space. In this method, firstly, we use a multiclass classifier for classification of data. In this step we use Multi Layer Perceptron as base Multiclass classifier. Then, confusion matrix is obtained from its results. At next step, we detect the classes that are confused and error prone. After that, we employ a number of pairwise classifiers to improve the accuracy in those error prone regions. Finally, we use GA to determine weights of each classifier vote. For each class we use a distinct classifier ensemble. So, we run GA as frequently as the number of classes. The dataset used in this paper is Farsi handwritten digit set which will be discussed in section 5.

3. Pairwise classification Classifiers are divided into two categories with regard to number of classes that they classify: pairwise classifiers and multiclass classifiers. The aim of a multiclass classifier is to separate one class from other classes. In contrary, the aim of a pairwise classifiers is to separate one class from another one. Because of pairwise classifiers just train for discrimination between two classes, decision boundaries of them are simpler and more effective than those of multiclass classifiers. Figure 2 depicts this note. Pairwise classification is a combining method that uses all possible pairwise classifiers instead of one single multiclass classifier. Suppose that the number of classes is c. Then we need c×(c-1)/2 pairwise classifiers. This method is well for problems that have a few classes. If we have a large number of classes, we will need a very large number of pairwise classifiers,

4.1. Determining erroneous pair-classes At this step, a multiclass classifier is trained on training data. Then, using results of this classifier on evaluation data, we obtain confusion matrix. This matrix contains important information about functionality of classifiers. Close and error-prone classes are detected using this matrix. Indeed, confusion matrix determines error distribution of different classes. Item aij of this matrix determines how many instances of class cj have been misclassified in class ci. Table 1 shows the confusion matrix obtained from the base multiclass classifier. As you can see, digit 5 (or equivalently class 6) was recognized incorrectly 15

139

Authorized licensed use limited to: Iran Univ of Science and Tech Trial User. Downloaded on December 14, 2008 at 06:47 from IEEE Xplore. Restrictions apply.

Table 1: Confusion matrix pertaining to the Farsi handwritten OCR 0 1 2 3 4 5 6 7 8 9 0 0 4 1 14 2 0 0 1 0 969 4 992 1 0 2 4 1 1 1 15 1 1 1 974 18 9 1 4 4 0 1 2 0 0 13 957 12 0 3 2 0 1 3 5 0 3 17 973 3 2 2 0 3 4 15 0 0 0 0 977 1 0 0 0 5 2 6 2 1 3 0 974 5 1 3 6 3 0 3 1 0 1 1 986 0 0 7 0 1 0 1 0 0 2 0 995 0 8 1 0 4 1 0 0 10 0 3 976 9 times as digit 0 (or equivalently class 1), as well as digit 0 was recognized incorrectly 14 times as digit 5. It means it has happened 29 misclassifications between these two digits (classes), totally. The most erroneous pair-classes in this matrix are (2 ,3), (0 ,5), (3 ,4), (1 ,4) and (6, 9).

vote. The matter here is how we can optimally determine these weights. Here, GA is employed to find these weights. Because of the capability of GA in passing of local optimums, it is estimated that accuracy of this method is better than simple ANN or unweighted ensemble. Figure 4 depicts structure of our ensemble system. As it is shown in figure 3, the number of GAs employed in our system is 10, which are equal to the number of digits (classes).

4.2. Training of pairwise classifiers After determining the most erroneous pair-classes, in this step, we have to train a binary classifier for each of them. At this step, pairwise classification is training on only instances of two corresponding classes. For example assume we have (0,5) as an erroneous pairclasses; now we must train a classifier for it using those data points of training data that have label 0 or 5. As mentioned this method is flexible, so we can add arbitrary number of pairwise classifiers to the base primary classifiers. It is expected that accuracy of this system is higher than the primary base classier. Assume error number of pair-classes is the number of misclassification occurred between those two classes, totally. For choosing the most erroneous pairclasses, it is sufficient to sort error numbers of pairclasses. Then we can select an arbitrary number of them. This arbitrary number can be determined by try and error.

Input layer Neural network with multiple outputs Neural network with two outputs

Figure 3: Structure of proposed method In fact, each GA creates an ensemble for detecting one digit (class), by weighted majority vote method. Each GA-based ensemble, regarding to the results of the base classifier and the pairwise classifiers, gives its vote. Finally, the most certain class is selected. This is simply done with a max function.

4.3. Fusion of pairwise classifiers The last step of the proposed approach is to combine results of the primary base classifier and those of pairwise classifiers. The results of these classifiers are used as inputs for the combiner. The output of the combiner is as the final joint output. There are many ways for combining these results. Here, we perform this work using weighted majority

140

Authorized licensed use limited to: Iran Univ of Science and Tech Trial User. Downloaded on December 14, 2008 at 06:47 from IEEE Xplore. Restrictions apply.

Table 2: Comparison of the result Methods Recognition ratio (%) A simple multiclass classifier 97.83 Final combining without GA 98.06 Proposed method 98.89

5. Experimental Results This section evaluates the result of applying our method on Farsi handwritten digit set named Hoda [20]. This dataset contains 102,364 instances of digit 0-9. We divide data into 3 parts: training, evaluation and test sets. Training set contains 60,000 instances. Each of evaluation and test sets are contained 10,000 instances. The 106 features from each of them have been extracted which are described in [20]. Some examples of instances of this dataset are depicted in figure 4.

6. Conclusion In this paper, a new method to improve the performance of multiclass classification system is proposed. In this method an arbitrary number of binary classifiers are added to main classifier to increase its accuracy. Then results of all these classifier are given to a set of GAs. Usage of confusion matrix causes proposed method to be flexible. The number of all possible pairwise classifiers is c*(c-1)/2 that it is O(c^2). Using this method without giving up a considerable accuracy, we decrease its order to O(1). This feature of proposed method causes to be useful in problems that have a large number of classes. The experiments shows the effectiveness of this method.

Figure 4: Some instances of Farsi OCR data set, with different qualities. In this work, MLP is used as base primary classifier. Then confusion matrix is created by this MLP. The most erroneous pair-classes obtained from confusion matrix are successively: (2, 3) , (0, 5), (3, 4), (1, 4), (6, 9), … 10 binary classifiers are trained for the pair-classes that have considerable error number. The result of the primary base classifier and those of pairwise classifiers are given to GAs as inputs. Therefore, each chromosome contains 30 genes, that equals to number of outputs (10*2 is relevant to the outputs of pairwise classifiers and 10 is relevant to the outputs of the primary base classifier). Gaussian and Scattered operators are respectively used for mutation and recombination. Also, population size is 500. Termination condition is the passing of 200 generations. Fitness function for GA is the accuracy ratio of final recognition system on evaluation data set. The output of each GA is certainty of GA to select its corresponding class. Max function selects the most certain decision as final joint decision. Table 2, shows the results of our proposed method and those of previous method comparatively. The recognition rate of the primary base classifier is 97.83. Using proposed method, recognition ratio has been improved from 97.83 to 98.89. It is a good improvement. If we use an MLP instead of Gas in combiner, we can obtain 98.06.

7. Acknowledgments This work is partially supported by Data Mining Research group at Computer Research Center of Islamic Sciences (CRCIS), NOOR co. P.O. Box 37185-3857, Qom, Iran

8. References [1] B. Minaei-Bidgoli, and W. F. Punch, “Using Genetic Algorithms for Data Mining Optimization in an Educational Web-based System”, GECCO 2003. [2] S. Gunter and H. Bunke, "Creation of classifier ensembles for handwritten word recognition using feature selection algorithms", IWFHR 2002 on January 15, 2002. [3] L. Breiman, “Bagging Predictors”, Journal of Machine Learning, Vol 24, no. 2, pp. 123-140, 1996. [4] A. Sanchez, R. Alvarez, J.C. Moctezuma, S. Sanchez, Clustering and Artificial Neural Networks as a Tool to Generate Membership Functions, Proceedings of the 16th IEEE International Conference on Electronics, Communications and Computers (CONIELECOMP 2006).

141

Authorized licensed use limited to: Iran Univ of Science and Tech Trial User. Downloaded on December 14, 2008 at 06:47 from IEEE Xplore. Restrictions apply.

[18] W. F. Punch, M. Pei, L. Chia-Shun, E. D. Goodman, P. Hovland, and R. Enbody, "Further research on Feature Selection and Classification Using Genetic Algorithms", In 5th International Conference on Genetic Algorithm , Champaign IL, pp 557-564, 1993.

[5] S. Haykin, “Neural Networks, a comprehensive foundation”, second edition, Prentice Hall International, Inc. ISBN: 0-13-908385-5, 1999. [6] J. H. Holland, “Adaptive in Natural and Artificial Systems”, MIT Press, Cambridge, MA, 1992. 1st edition: 1975, The University of Michigan Press, Ann Arbor.

[19] H. Vafaie and K. De Jong, “Robust feature Selection algorithms”. Proceeding 1993 IEEE Int. Conf on Tools with AI, 356-363. Boston, Mass., USA. Nov. (1993).

[7] D. E. Goldberg, “Genetic Algorithms in Search”, Optimization and Machine Learning, Addison-Wesley, New York, NY, U.S.A. (1989).

[20] H. Khosravi, E. Kabir, "Introducing a very large dataset of handwritten Farsi digits and a study on the variety of handwriting styles", Pattern Recognition Letters(2007), vol 28 issue 10 pp:1133-1141.

[8] L. Davis (Ed.), “Handbook of Genetic Algorithms”, Van Nostrand Reinhold, New York, 1991.

[21] H. Parvin, H. Alizadeh, M. Fathi and B. Minaei-Bidgoli, “Improved Face Detection Using Spatial Histogram Features”, The 2008 Int. Conf. on Image Processing, Computer Vision, and Pattern Recognition (IPCV'08), Las Vegas, Nevada, USA, July 14-17, 2008 (in press).

[9] S. Dudoit and J. Fridlyand, “Bagging to improve the accuracy of a clustering procedure”, Bioinformatics, 2003, 19 (9), pp. 1090-1099. [10] L.I. Kuncheva, and L.C. Jain, “Designing Classifier Fusion Systems by Genetic Algorithms”. IEEE Transaction on Evolutionary Computation, Vol. 33 (2000) 351-373.

[22] B. Minaei-Bidgoli, G. Kortemeyer and W.F. Punch, “Optimizing Classification Ensembles via a Genetic Algorithm for a Web-based Educational System”, (SSPR /SPR 2004), Lecture Notes in Computer Science (LNCS), Volume 3138, Springer-Verlag, ISBN: 3-540-22570-6, pp. 397-406, 2004.

[11] A. Strehl, J. Ghosh, ”Cluster ensembles a knowledge reuse framework for combining multiple partitions”, Journal on Machine Learning Research, pp. 583-617, 2002. [12] F. Roli, G. Giacinto and G. Vernazza, Methods for designing multiple classifier systems, In Proc. of 2nd International Workshop on Multiple Classifier Systems, Vol. 2096 of Lecture Notes in Computer Science, Cambridge, UK, 2001, Springer- Verlag, pp. 78–87.

[23] Saberi, A., Vahidi, M., Minaei-Bidgoli, B. “Learn to Detect Phishing Scams Using Learning and Ensemble Methods”. IEEE/WIC/ACM International Conference on Intelligent Agent Technology, Workshops (IAT 07), pp. 311314, Silicon Valley, USA, November 2 – 5, 2007.

[13] J. Bala, K. De Jong, J. Huang, H. Vafaie, and H. Wechsler, “Using learning to facilitate the evolution of features for recognizing visual concepts”, Evolutionary Computation 4(3) – Special Issue on Evolution, Learning, and Instinct: 100 years of the Baldwin Effect. 1997.

[24] Minaei-Bidgoli, B., Kortemeyer, G., Punch, W.F., “Mining Feature Importance: Applying Evolutionary Algorithms within a Web-Based Educational System”, Proc. of the Int. Conf. on Cybernetics and Information Technologies, Systems and Applications, CITSA 2004.

[14] S. Bandyopadhyay, and C. A. Muthy, “Pattern Classification Using Genetic Algorithms”, Pattern Recognition Letters, (1995).Vol. 16, pp. 801-808. [15] E. Falkenauer, “Genetic Algorithms and Grouping Problems”, John Wiley & Sons, (1998). [16] C. Guerra-Salcedo and D. Whitley, “Feature Selection mechanisms for ensemble creation: a genetic search perspective”, In: Freitas AA (Ed.) Data Mining with Evolutionary Algorithms: Research Directions – Papers from the AAAI Workshop, 13-17. Technical Report WS-99-06. AAAI Press, (1999). [17] M. J. Martin-Bautista and M. A. Vila, “A survey of genetic feature selection in mining issues”, Proceeding Congress on Evolutionary Computation (CEC-99), 13141321. Washington D.C., July (1999).

142

Authorized licensed use limited to: Iran Univ of Science and Tech Trial User. Downloaded on December 14, 2008 at 06:47 from IEEE Xplore. Restrictions apply.

Suggest Documents