746
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 32, NO. 6, NOVEMBER 2002
Correspondence________________________________________________________________________ A Cloning Approach to Classifier Training
Index Terms—Al-Alaoui algorithm, back-propagation algorithm, Bayes classifier, character recognition, Levenberg–Marquardt algorithm, neural networks, pattern classification.
the Winnow algorithms may not converge. A comparison of the Al-Alaoui and Winnow algorithms will be taken up in a future work. This paper proposes an adaptation of the batch-mode Al-Alaoui algorithm to multilayer feed-forward neural networks targeted for, among others, nonlinear classification problems. The remainder of the paper is organized as follows. Section II presents a mathematical formulation of the pattern classification problem. Section III develops a modified back-propagation training algorithm for multilayer neural networks using the Al-Alaoui algorithm. In Section IV, the current standard back-propagation algorithm (designated as BP), which includes momentum and an adaptive learning rate, is compared with the modified standard back-propagation algorithm (designated as ALBP). The comparison consists of applications to three different data sets. In Section V, three experiments are carried out comparing the Al-Alaoui algorithm with the standard BP and the Levenberg–Marquardt (LM) algorithms. Finally, Section VI concludes the paper.
I. INTRODUCTION
II. PROBLEM FORMULATION
The development of the Al-Alaoui algorithm for pattern classification[1]–[6] using a single-layer neural network was motivated by the proofs of Patterson and Womack [7] and of Wee [8] that the mean-square error (MSE) solution of the pattern classification problem gives a weighted minimum MSE approximation to the optimal Bayes discriminant function, weighted by the probability density of the sample. The algorithm uses the above links to the Bayes classifier by introducing a weighted MSE approach that offsets the weighting by the probability of the samples. The approach in [1] and [2] results in a batch mode, where the corrections are carried out after all the patterns are presented, and several equivalent forms of single pattern adaptation algorithms. The resulting iterative procedure was shown to be a weighted MSE solution. Starting with the MSE solution, the erroneously classified samples are cloned and the clones are appended to the training set. This increases their probability of occurrence, thus offsetting the weighting introduced by the MSE approach. The procedure could be terminated either after the error was reduced to a certain value or by specifying the number of iterations to be carried out and keeping the solution that attains the least number of errors. It should be noted that, for single-layer neural networks, i.e., no hidden layers, the algorithm is similar to the Perceptron [9] and the Winnow algorithms [10] in that they are all mistake-driven algorithms where the weights are updated only if an error occurs during training. Additionally, all three algorithms converge to a separating solution if the data is linearly separable. However, while the Al-Alaoui algorithm always converges for linearly nonseparable data, the Perceptron and
The problem is formulated using the same methodology as that of Wee [8] and as elaborated on by Duda and Hart [11]. The formulation of the augmented training samples, with each constructs a matrix row corresponding to a training sample augmented by adding a component of value equal to one. Thus, if the dimension of the training . In sample is p, then the dimension of the augmented sample is p addition, a target matrix of ones and zeros as well as a weight matrix W whose columns are the weight vectors for the classes are used. The problem hence takes the following form: given and , find the MSE solution for
Mohamad Adnan Al-Alaoui, Rodolphe Mouci, Mohammad M. Mansour, and Rony Ferzli
Abstract—The Al-Alaoui algorithm is a weighted mean-square error (MSE) approach to pattern recognition. It employs cloning of the erroneously classified samples to increase the population of their corresponding classes. The algorithm was originally developed for linear classifiers. In this paper, the algorithm is extended to multilayer neural networks which may be used as nonlinear classifiers. It is also shown that the application of the Al-Alaoui algorithm to multilayer neural networks speeds up the convergence of the back-propagation algorithm.
Manuscript received June 6, 2000; revised October 3, 2002. This work was supported in part by the University Research Board of the American University of Beirut, Beirut, Lebanon. This paper was recommended by Associate Editor A. Bouzerdoum. M. A. Al-Alaoui and R. Ferzli are with the Department of Electrical and Computer Engineering, American University of Beirut, Beirut, Lebanon (e-mail:
[email protected];
[email protected]). R. Mouci is with the Central Bank of Lebanon, Beirut, Lebanon (e-mail:
[email protected]). M. M. Mansour is with the Coordinated Science Laboratory/Electrical and Computer Engineering Department, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TSMCA.2002.807035
A
+1
B
A
AW = B
B
(1)
:
It is well known that the MSE solution of (1) is given by
W = A# B
(2)
A
#
where designates the pseudo or generalized inverse. If has full column rank, then the solution takes the form, where T designates the transpose
W = (AT A)01AT B
(3)
:
The batch mode consists of cloning the erroneously classified patterns and appending the clones, after each epoch, to the training set. In single pattern adaptations, changes are carried out only if the pattern is erroneously classified. Three forms were developed in [1] and [2] and the approach was shown to be equivalent to a weighted MSE. Form 1 simply appends the clone of a pattern that is in error to the training set, and thus both and are modified. Form 2 multiplies the row of the sample that is in error in matrix and the corresponding row in matrix by n + 1, where n is the number of times that sample was found to be erroneously classified during the iteration process. Form 3 modifies only the matrix by modifying the row in , corresponding to the sample that is in error, by an amount proportional to the error and in the direction of the gradient of k 0 k with respect to the corresponding row of . The mathematical formulation of the pattern classification problem can be summarized as follows. The training set consists of N samples drawn from L classes, with Ni denoting the number of training samples from class i, and L i Ni = N . Let x represent the augmented sample
B p
1083-4427/02$17.00 © 2002 IEEE
A
B
A
B
B
B
AW B
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 32, NO. 6, NOVEMBER 2002
vector to be classified and !i denote class i. Define Oi (x ; w ) to be the output of the ith neuron in the output layer, where w is the weight vector. Let the desired output, di (x), be 1 when x 2 !i and 0 otherwise. Then the decision rule is: decide x is from class !i if di (x) > dj (x) for all j 6= i. In order to overcome the limitation of single layer networks, which are capable of performing correct classification for linearly separable classes only, multilayer feed-forward networks have emerged as universal approximators with a nonlinear mapping capability. Multilayer networks have three main features: smooth nonlinear neuron activation functions, one or more hidden layers of neurons, and a high degree of connectivity [12], [13]. Multilayer networks are usually trained with the highly popular error back-propagation algorithm, consisting of two passes, a forward pass and a backward pass. In the forward pass, the outputs of the neurons are calculated while keeping the weights fixed, while in the backward pass, the weights are updated based on a gradient descent in the weight space [14]–[16]. In [17], it is shown that networks with one hidden layer have similar performance to networks with more than one hidden layer. Further improvements to the back-propagation method may be introduced by allowing the learning rate to be adaptive. A small value leads to a smooth trajectory on the error surface at the cost of slow learning rate; a large value may cause instability, but speeds up the learning rate [12], [13]. Back-propagation learning may, as well, be speeded up while avoiding instability by adding a momentum term in weight updating [12], [13], [18]. One of the variants of the back-propagation algorithm is the Levenberg–Marquardt training method [19]. This technique accelerates the back-propagation convergence rate by using the Hessian matrix. The Bayes optimal discriminant functions are given by
gi (x) = P (!i jx );
i = 1; 2; . . . ; L
(4)
where P (!i jx) is the probability that x 2 !i . The decision rule to put x in a certain class is: decide x is from class !i if gi (x) > gj (x) for all j 6= i. The Bayes optimal discriminant functions are optimal in the sense that the probability of error is minimized by their use. White [20] and Ruck et al. [21] had shown, in a manner similar to the proof of Patterson and Womack for the MSE solution of a single-layer neural network, that the back-propagation algorithm minimizes the following criterion:
"2 (w ) =
L i=1 x
O(x; w ) 0 gi (x)
2
p(x) dx x:
(5)
The novelty in the Al-Alaoui algorithm is that it reintroduces the samples that are erroneously classified by cloning them and adding the clones to the training set, in order to offset the weighting by p(x) in (5), thus increasing their probability and making them more favorable according to (5).
Fig. 1.
747
Al-Alaoui algorithm for batch-mode neural networks.
cation boundaries by cloning them upon being erroneously classified. For multilayer networks, the back-propagation method, irrespective of learning rate adaptability and the use of momentum, is applied for a selected number of epochs, each consisting of the whole input set of training samples. The input set is then tested and the clones of the samples in the original training set that are misclassified are added to the training set to yield a new input set for the next group of epochs. The process continues until all samples are correctly classified, or until the number of misclassifications has dropped to some desired level, or until a prespecified number of iterations has been reached, keeping the best obtained solution. Two fast convergence variants of the BP algorithm provided by Matlab, traingdx and trainlm, were employed [23]. traingdx is the current standard back-propagation algorithm, which includes momentum and an adaptive learning rate, and subsequently will be referred to as the BP algorithm. trainlm implements the Levenberg–Marquardt algorithm, and subsequently will be referred to as the LM algorithm. It will be demonstrated that the application of the Al-Alaoui algorithm to the fast variants of the BP algorithm speeds up the convergence rate of these algorithms.
III. AL-ALAOUI ALGORITHM FOR NEURAL NETWORKS This section presents the Al-Alaoui algorithm for multilayer feedforward networks. A flowchart of the algorithm is shown in Fig. 1. For single-layer neural networks, the batch mode algorithm was applied to MSE problems where the classification error was large and resulted in the reduction of the number of samples that are erroneously classified. For multilayer neural networks using the back-propagation algorithm, on the other hand, convergence is usually obtained for many of the classification problems and thus no improvement in the classification is needed. The algorithm, however, can be adapted to speed up the convergence of the back-propagation algorithm by reintroducing the cloned erroneously classified samples [22]. The algorithm reduces the number of misclassifications by cloning the erroneously classified samples and appending them to the training set. This approach gives more weight to samples close to the classifi-
IV. EXPERIMENTAL WORK ON IMPROVING THE CONVERGENCE OF THE BP ALGORITHM In this section, the standard BP algorithm is compared with the modified standard BP algorithm, designated as ALBP. The comparison consists of three experiments involving three different data sets. The first data set is a special character set, the second is the iris data set [24], and the third is a double spiral data set. In all three experiments, it is demonstrated that the training time using the ALBP algorithm is less than the training time using the BP algorithm, while the resulting errors of the corresponding testing sets are comparable for both algorithms. A. Experiment 1: Special Character Set A special character set, consisting of an alphabet of 48 Latin symbols, shown in Fig. 2, was used for training. These symbols are of fixed
748
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 32, NO. 6, NOVEMBER 2002
Fig. 3. Bitmap of the 1653 characters used for testing.
Fig. 2.
Special character set of an alphabet of 48 Latin symbols.
size and font, defined on a grid of five rows by three columns. Each symbol is therefore represented by a feature vector of size 15, corresponding to the 15 cells of the grid, each of which may have an intensity value of 0 or 1 to represent the white or black gray levels, respectively. Thus, the input alphabet is a 15 2 48 matrix (48 symbols each represented by a vector of size 15 corresponding to a 5 2 3 grid on which the symbol is displayed) and the output is a 48 2 48 matrix whose diagonal contains ones and the remaining entries are zeros. The character recognition problem that is addressed is a template matching task [25], [26], which is conveniently handled by neural networks. The architecture of the neural network used is a 15-8-48 network. The training, or development time, is the time taken by MATLAB to get the neural network trained to correctly classify the input symbols of the training set. The training set is the alphabet in “clean” format as defined in Fig. 2. The training time for the BP algorithm was 5.219 s, while the training time for the ALBP algorithm was 1.422 s. Note that BP refers to what is now the standard back-propagation algorithm that consists of the original back-propagation algorithm plus momentum and an adaptive learning rate. Experiments were also conducted for the Al-Alaoui algorithm using the original back-propagation algorithm (without momentum and variable learning). The training time for the Al-Alaoui algorithm using the original back-propagation algorithm was 4.937 s for a learning rate of 0.01, which is less than the training time for the standard BP alone. It should be noted that the training time using the original BP was 41.984 s. The network was tested with the graphical representation or bitmap of a text consisting of 1653 characters that was extracted at random from a book. The corresponding bitmap is shown in Fig. 3. The input fed for recognition to the trained network is a 15 2 1653 matrix with a 48 2 1653 output target matrix. The testing data set consisted of the original clean bitmap and 40 other versions distorted with additive uniform or normal noise with a standard deviation as follows: • ten copies with low uniform noise (