Generalization Improvement of a Fuzzy Classifier

2 downloads 0 Views 102KB Size Report
cluster is approximated by a center and a covariance matrix, and the ... recognition rate is not sufficient, we generate fuzzy rules ... tance between x and cij, αij is a tuning parameter for ... belonging to cluster ij as follows: Qij = 1. |Xij| ... If mkl(x) is the maximum, we ... semi-definite, and to be positive definite, the number of the.
Generalization Improvement of a Fuzzy Classifier with Ellipsoidal Regions Shigeo Abe and Keita Sakaguchi Graduate School of Science and Technology Kobe University, Kobe, Japan E-mail: [email protected] Abstract—In a fuzzy classifier with ellipsoidal regions, each cluster is approximated by a center and a covariance matrix, and the membership function is calculated using the inverse of the covariance matrix. Thus when the number of training data is small, the covariance matrix becomes singular and the generalization ability is degraded. In this paper, during the symmetric Cholesky factorization of the covariance matrix, if the input of the square root is smaller than a prescribed positive value, we replace the input with the prescribed value. Further, we tune the slopes of the membership functions so that the margins are maximized. We show the validity of our method by computer simulations.

Consider classifying an m-dimensional input vector x into one of n classes using fuzzy rules with ellipsoidal regions. Initially we assume that each class consists of one cluster and define a fuzzy rule for each class. And if the recognition rate is not sufficient, we generate fuzzy rules dynamically. Assuming that several clusters are defined for class i (i = 1, · · · , n), we call the jth cluster for class i cluster ij (j = 1, · · ·). We define the following fuzzy rule for cluster ij: Rij : If x is cij then x is class i

I. Introduction Fuzzy classifiers that extract fuzzy rules from data are classified, from the shape of the approximated class regions, into fuzzy classifiers with hyperbox regions, fuzzy classifiers with polyhedral regions, and fuzzy classifiers with ellipsoidal regions [1], [2], [3], [4]. Among fuzzy classifiers with ellipsoidal regions, those discussed in [2], [3], [4] are shown to have generalization ability comparable to or better than that of multilayer neural network classifiers when input variables are continuous. But since we need to calculate the inverse of the covariance matrix determined by the training data, if the number of the training data is small, the generalization ability is degraded. When the covariance matrix is singular, usually we calculate the pseudo-inverse [5], [6] and control singular values [4, p. 92]. But the improvement is still not sufficient [4]. Thus to realize high generalization ability, in this paper, we use the symmetric Cholesky factorization in calculating the inverse of the covariance matrix, and when the input of the square root is smaller than a prescribed value we replace the input with the prescribed value. Furthermore, at the initial stage of tuning slopes of the membership functions, we maximize the slope margins to improve the generalization ability. In the following, in Section II, we briefly explain the fuzzy classifier with ellipsoidal regions according to [2], [3], and in Section III the generalization enhancement by the symmetric Cholesky factorization is discussed. Then in Section IV we discuss how to maximize margins by tuning membership functions. Finally in Section V, the validity of the proposed method is evaluated for hiragana data. II. A Fuzzy Classifier with Ellipsoidal Regions In this section, we overview the fuzzy classifier with ellipsoidal regions based on [2], [3], [4].

(1)

where cij is the center of cluster ij and is calculated by the training data included in cluster ij: cij =

 1 x, |Xij |

(2)

x ∈Xij

where Xij is the set of training data included in cluster ij, and |Xij | is the number of data included in Xij . For the center cij , we define the membership function mij (x) which defines the degree to which x belongs to cij : mij (x)

= exp(−h2ij (x)), d2ij (x)

h2ij (x)

=

d2ij (x)

= (x − cij )t Q−1 ij (x − cij ),

αij

,

(3) (4) (5)

where hij (x) is a tuned distance, dij (x) is a weighted distance between x and cij , αij is a tuning parameter for cluster ij, Qij is an m × m covariance matrix for cluster ij. And Q−1 ij denotes the inverse of the covariance matrix Qij and the superscript t denotes the transpose of a matrix. Here we calculate the covariance matrix Qij using the data belonging to cluster ij as follows: Qij =

 1 (x − cij )(x − cij )t . |Xij |

(6)

x∈Xij

For an input vector x we calculate the degrees of membership for all the clusters. If mkl (x) is the maximum, we classify the input vector into class k. Recognition performance is improved by tuning fuzzy rules, i.e., by tuning αij one at a time. When αij is increased, the slope of mij (x) is decreased and the degree of membership is increased. Then misclassified data may be correctly classified and correctly classified data may be

misclassified. Based on this, we calculate the net increase of the correctly classified data. Likewise, by increasing the slope, we calculate the net increase of the correctly classified data. Then allowing new misclassification, we tune the slope so that the recognition rate is maximized. In this way we tune fuzzy rules successively until the recognition rate of the training data is not improved [4, pp. 121–129]. If the recognition rate is not sufficient even by tuning, we define the new cluster that includes the misclassified data belonging to the same class. For the newly added clusters we calculate the centers and the covariance matrices, and tune the fuzzy rules.

|Xij | ≥ m + 1.

(7)

When the covariance matrix Qij is singular, usually Qij is decomposed into singular values [5]. But this slows down training. Therefore, to speed up training even when Qij is singular, we use the symmetric Cholesky factorization. When Qij is positive definite, each diagonal element is positive and the value is the maximum among the column elements. Namely, i = j.

(8)

Thus Qij can be decomposed into the two triangular matrices by the symmetric Cholesky factorization without pivot exchanges as follows [5]: Qij = Lij Ltij ,

(9)

where Lij is the real-valued regular lower triangular matrix and each element of Lij is given by qop −

p−1  n=1

lpn lon

lop

=

laa

for   a−1   2 = qaa − lan

lpp o = 1, · · · , m,

p = 1, · · · , o − 1,

(10)

n=1

for

a = 1, 2, · · · , m.

(11)

Using Lij , (5) is written as follows: t −1 d2ij (x) = (L−1 ij (x − cij )) Lij (x − cij ).

(12)

Now we define the vector yij (= (yij1 , · · · , yijm )t ) by yij = L−1 ij (x − cij ).

(13)

Then solving the following equation for yij Lij yij = x − cij ,

(15)

When the number of training data is small, the value in the square root of (11) is non-positive. To avoid this, if qaa −

a−1 

2 lan ≤ η,

(16)

√ η.

(17)

n=1

where η (> 0), we set

This increases the variances of the variables.

The covariance matrix Qij is guaranteed to be positive semi-definite, and to be positive definite, the number of the training data belonging to cluster ij needs to be at least larger than the number of input variables [7]. Namely,

for i, j = 1, 2, · · · , m,

t yij . d2ij (x) = yij

laa =

III. Generalization Improvement by the Symmetric Cholesky Factorization

qii > |qij |

we calculate d2ij (x) without calculating the inverse of Qij :

(14)

IV. Maximizing Margins of αi A. Concept In fuzzy classifier with ellipsoidal regions, if there are overlaps between classes, the overlaps are resolved by tuning the membership functions. But if there is no overlap, the membership functions are not tuned. When the number of training data is small, usually the overlaps are scarce. Thus, the generalization ability is degraded. To tune membership functions even when the overlaps are scarce, we use the idea used in training support vector machines. In training support vector machines for a two-class problem, the separating margin between the two classes is maximized to improve the generalization ability [4, 47–61]. In the fuzzy classifier with ellipsoidal regions, we tune the slopes of the membership functions so that the slope margins are maximized. Initially, we set the values of αij to be 1. Namely, the fuzzy classifiewith ellipsoidal regions is equivalent to the classifier based on the Mahalanobis distance if each class consists of one cluster. Then, we tune αij so that the slope margins are maximized. Here, we maximize the margins without causing new misclassification. When the recognition rate of the training data is not 100% after tuning, we tune αij so that the recognition rate is maximized as discussed in [4]. Here we discuss how to maximize slope margins. Unlike the support vector machines, tuning of αij is not restricted to two classes but for ease of illustration we explain the concept of tuning using two classes with one cluster for each class. In Fig. 1 the filled rectangle and circle show the training data, belonging to classes i and o, that are nearest to classes o and i, respectively. The class boundary of the two classes is somewhere between the two curves shown in the figure. We assume that the generalization ability is maximized when it is in the middle of the two. In Fig. 2 (a), if the datum belongs to class i, it is correctly classified since the degree of membership for class i is larger. This datum remains correctly classified until the degree of membership for class i is decreased as shown in the dotted curve. Similarly, in Fig. 2 (b), if the datum belongs to class o, it is correctly classified since the degree of

Degree of membership

Boundary

cop

Class o

Class i

1

0

x

Datum

cij

Fig. 1. Concept of maximizing margins

membership for class o is larger. This datum remains correctly classified until the degree of membership for class i is increased as shown in the dotted curve. Thus, for each αij , there is an interval of αij that makes correctly classified data remain correctly classified. Therefore, if we change the value of αij so that it is in the middle of the interval, the slope margins are maximized. In the following we discuss how to tune αij . B. Upper and Lower Bounds of αij Let X be the set of training data that are correctly classified for the initial αij . Let x (∈ X) belong to class i. If the degree of membership for cluster ij, mij (x), is not the largest, or if mij (x) is the largest but mik (x) (k = j) is the second largest, the change of αij does not cause misclassification of x. Thus we do nothing for these data. If mij (x) is the largest and mik (x) (k = j) is not the second largest, there is a lower bound, Lij (x), of αij to keep x correctly classified: Lij (x) =

d2ij (x) . min h2op (x) o=i

(18)

Then the lower bound Lij (1) that does not cause new misclassification is given by Lij (1) = max Lij (x). x∈X

(19)

Similarly, for x (∈ X) belonging to a class other than class i, we can calculate the upper bound of αij . Let moq (x) (o = i) be the largest. Then the upper bound Uij (x) of αij that does not cause misclassification of x is given by d2ij (x) Uij (x) = . (20) min h2oq (x) q

The upper bound Uij (1) that does not make new misclassification is given by Uij (1) = min Uij (x). x∈X

(21)

In [4, p. 122], Lij (l) and Uij (l) are defined as the lower and the upper bound in which l − 1 correctly classified data

Degree of membership

(a) Upper bound of αij

Class i

1

0

Class o

Datum

x

(b) Lower bound of αij

Fig. 2. Range of αij that does not cause misclassification

are misclassified, respectively. Thus, Lij (1) and Uij (1) are the special cases of Lij (l) and Uij (l). C. Tuning Procedure The correctly classified data remain correctly classified even when αij is set to some value in the interval (Lij (1), Uij (1)). The tuning procedure of αij becomes as follows. For αij we calculate Lij (1) and Uij (1) and set the value of αij with the middle point of Lij (1) and Uij (1): αij =

1 (Lij (1) + Uij (1)). 2

(22)

We successively tune one αij after another. Tuning results depend on the order of tuning αij but in the following simulation study, we tune from the first cluster in class 1 to the last cluster of class n. V. Performance Evaluation Using hiragana data, we evaluate the recognition improvement for the test data by controlling the value of η in the symmetric Cholesky factorization and maximizing the slope margins. Hiragana data were collected to classify numerals and hiragana characters on Japanese vehicle license plates [4]. Hiragana data gathered from the vehicle license plates were transformed into 5 × 10 (= 50) gray-scale grid data. The number of classes was 39 and the numbers of training and test data were both 4610. For the fuzzy classifier with ellipsoidal regions, we set the parameter δ = 0.1 [2] and the maximum number of allowable misclassifications lM [2] to be 10, and we assumed that each class consisted of one cluster. We used a Sun UltraSPARC-IIi workstation (335MHz). To compare the results with other methods we used the evaluation results using the singular value decomposition and the support vector machine [4].

Table I shows the recognition rates of the test (training) data when the covariance matrices were factorized and when the value of η in (16) was changed. The column “Initial” shows the recognition rates for αij = 1 and the column “Final” shows the recognition rates when αij were tuned to resolve misclassification. The final recognition rate of the test data increased as η was increased and when η = 10−2 , the recognition rate was the highest.

Table III shows the best performance using the proposed methods, the singular value decomposition (SVD), and the support vector machine. In the table, CF means Cholesky factorization and CF + MM means that maximizing the slope margins was combined with the Cholesky factorization. Except for the SVD, there was not so much difference of recognition rates among the remaining three. TABLE III Best Performance for Hiragana Data.

Method CF CF + MM SVD SVM

TABLE I Recognition Rates for Hiragana Data.

η 10−5 10−4 10−3 10−2 10−1

Initial (%) 77.51 (100) 81.39 (100) 95.53 (99.98) 98.79 (99.65) 94.43 (96.14)

Final (%) — — 95.64 (100) 98.85 (100) 94.69 (97.57)

Time (s) 35 35 43 101 310

Table II shows the results when η was changed and αij were tuned to maximize the margins. The column “Initial” shows the recognition rates when αij were tuned to maximize the margins but were not tuned to resolve misclassification. Comparing Tables I and II, the maximum recognition rates of the test data were the same, but for η = 10−5 and 10−4 , the recognition rates by maximizing the slope margins were much better.

To improve the generalization ability of the fuzzy classifier with ellipsoidal regions when the number of training data is small, we proposed to control the values of the diagonal elements during the symmetric Cholesky factorization of the covariance matrix and to maximize the slope margins. For the hiragana data, we showed that by controlling the values of the diagonal elements during the Cholesky factorization the generalization ability of the classifier was improved and that by combining this method with maximizing slope margins, the generalization ability became more robust. References [2]

η 10−5 10−4 10−3 10−2 10−1

Initial (%) 90.50 (100) 92.17 (100) 96.25 (100) 98.83 (99.87) 94.64 (96.59)

Final (%) — — — 98.85 (100) 94.84 (97.51)

Time (s) 431 458 575 488 996

Time (s) 101 488 946 7144

VI. Conclusions

[1]

TABLE II Recognition Rates for Hiragana Data by Maximizing Margins.

Rates (%) 98.85 (100) 98.85 (100) 94.51 (99.89) 99.07 (100)

[3] [4] [5] [6] [7]

S. Abe, Neural Networks and Fuzzy Systems: Theory and Applications, Kluwer Academic Publishers, Boston, 1996. S. Abe and R. Thawonmas, A Fuzzy Classifier with Ellipsoidal Regions, IEEE Trans. Fuzzy Systems, vol. 5, no. 3, pp. 358368,1998. S. Abe, Dynamic Cluster Generation for a Fuzzy Classifier with Ellipsoidal Regions, IEEE Trans. Systems, Man, and Cybernetics–Part B, vol. 28, no. 6, pp. 869-876, 1998. S. Abe, Pattern Classification: Neuro-Fuzzy Methods and Their Comparison, Springer-Verlag, 2001. G. H. Golub and C. F. Van Loan, Matrix Computations, Third Edition, John Hopkins University Press, 1996. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C, The Art of Scientific Computing, Second Edition, Cambridge University Press, 1996. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, pp. 67-68, 1973.

Suggest Documents