A Multi-class Classification Algorithm based on Ordinal Regression

0 downloads 0 Views 885KB Size Report
Currently, the solution of binary classifica- tion problems using the support vector classification (SVC) is well developed. However, in general, multi-class classi-.
A Multi-class Classification Algorithm based on Ordinal Regression Machine∗ Zhixia Yang 1. College of Science China Agricultural University 2. College of Mathematics and System Science Xinjiang University [email protected]

Naiyang Deng† College of Science China Agricultural University [email protected]

Yingjie Tian Research Center on Data Technology & Knowledge Economy Chinese Academy of Sciences [email protected]

Abstract Multi-class classification is an important and on-going research subject in machine learning. In this paper, we propose a multi-class classification algorithm based on ordinal regression algorithm using 3-class classification. This algorithm is similar to Algorithm K-SVCR and Algorithm νK-SVCR, but it includes fewer parameters. Another advantage of our algorithm is that, for the K-class classification problem, our algorithm can be extended to using p-class classification with 2 ≤ p ≤ K. Numerical experiments on artificial data sets and benchmark data sets show that the algorithm is reasonable and effective.

1. Introduction Multi-class classification refers to construction a decision function f (x) defined from an input space X = Rn on to an unordered set of classes Y = {Θ1 , Θ2 , · · · , ΘK } based on independently and identically distributed training set (1) T = {(x1 , y1 ), · · · , (xl , yl )} ∈ (X × Y)l . Support vector machines (SVMs) are useful tools for classification. Currently, the solution of binary classification problems using the support vector classification (SVC) is well developed. However, in general, multi-class classification problem is more universal. So it is interesting to find effective algorithm for it. At present, there are roughly ∗ This work is supported by the National Natural Science Foundation of China (No. 10371131) † The corresponding author

two types of SVM-based approaches to solve the multiclass classification problem. One is the “decompositionreconstruction” architecture approach that makes direct use of binary SVC to tackle the tasks of multi-class classification, for example, the “ one versus rest” method (see[1][2]), the “one versus one” method (see[3]-[4]), and the “error-correcting output code” method (see[7]-[8]). The other is the “all together” approach (see [9]-[13]). In other words, it solves the multi-class classification problem through only one optimization formulation. It should be pointed out that there is a new variant (see [5]-[6]) in the decomposition-reconstruction approach. Instead of using binary SVC, it uses 3-class classification, where the decision function of a 3-class classification is obtained by a combination of SVC and SVR (Support Vector Regression). This new variant is not only more robust, but also more efficient. In this article, following the idea of [5] and [6], we propose a new algorithm. It also uses 3-class classification, but different from [5] and [6], the decision function of the 3-class classification is obtained by ordinal regression proposed in [14]. One of the advantages is that our algorithm includes fewer parameters. In fact, in Algorithm K-SVCR [5] and Algorithm ν-K-SVCR [6], the accuracy parameter δ (cf. (7)-(8) in [5]) and ε (cf. (13)-(18) in [6]) are respectively to be chosen a priori, but the corresponding parameter does not appear in our algorithm. Another advantage is that, for the general K-class classification problem (1), our algorithm is easily to be extended to a more flexible version; instead of restricting using 3-class classification, pclass classification with 2 ≤ p ≤ K is possibly to be used. Therefore, a large class of algorithm can be constructed. The rest of this article is organized as follows: In the next

Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05) 0-7695-2504-0/05 $20.00 © 2005

IEEE

where



H1 =

(K(xi , xj ))l1 ×l1 (K(xi , xj ))l2 ×l1

(K(xi , xj ))l1 ×l2 (K(xi , xj ))l2 ×l2

 ,

H2 = H3T =   −(K(xi , xj ))l1 ×l2 −(K(xi , xj ))l1 ×l3 , −(K(xi , xj ))l2 ×l2 −(K(xi , xj ))l2 ×l3   (K(xi , xj ))l2 ×l2 (K(xi , xj ))l2 ×l3 H4 = , (K(xi , xj ))l3 ×l2 (K(xi , xj ))l3 ×l3 and (K(xi , xj ))ln ×lm denotes an ln × lm matrix with xi belonging to the n-th class and xj belonging to the m-th class (n, m = 1, 2, 3).

Fig.1 The fixed margin policy of ordinal regression

section, we introduce the ordinal regression. In Section 3, we give our multi-class classification algorithm. Numerical results are shown in the Section 4. Section 5 concludes the article.

Solve the following optimization problem:

μ

s.t.

2. The Ordinal Regression learning machine Ordinal regression is studied in ([14]-[15]). In this paper, we need its kernel formulation. This formulation is mentioned but is not explicitly given there. So we introduce it in this section. Consider a 3-class classification problem with a training set T

= {(x1 , y1 ), · · · , (xl , yl )} = {(x1 , 1), · · · , (xl1 , 1), (xl1 +1 , 2), · · · , (xl1 +l2 , 2), (xl1 +l2 +1 , 3), · · · , (xl1 +l2 +l3 , 3)} ∈ (X × Y)l ,

(2)

where the inputs xi ∈ X = Rn , the outputs yi ∈ Y = {1, 2, 3}, i = 1, · · · , l. Now let us turn to the ordinal regression. Here we only consider the fixed margin strategy. Its basic idea is: For linearly separable case (see Fig.1), the margin to be maximized is associated with the two closest neighboring classes. As in conventional SVM, the margin is equal to 2/w, thus maximizing the margin is achieved by minimizing w. This leads to the following algorithm through a rather tedious procedure. Algorithm 1: Kernel Ordinal Regression for 3-class problem 1. Given a training set (2); 

2. Select C > 0, and a kernel function K(x, x ); 3. Define a symmetry matrix  H1 H= H3

H2 H4

 ,

(3)

 1 T μ Hμ − μi , 2 i=1 N

min

(4)

0 ≤ μi ≤ C, i = 1, · · · , N, l1 l1 +2l2  μi = μi , i=1 l 1 +l2

(5) (6)

i=l1 +l2 +1 N 

μi =

i=l1 +1

μi ,

(7)

i=l1 +2l2 +1

where N = 2l − l1 − l3 , and get its solution μ = (μ1 , · · · , μN )T ; 4. Construct the function g(x) = −

l 1 +l2

N 

μi K(xi , x) +

μi K(xi , x);

i=l1 +l2 +1

i=1

(8) 5. Solve the following linear program problem: min

b1 ,b2 ,ξ,ξ ∗

s.t.

l 1 +l2 i=1

ξi +

l 2 +l3

ξi∗ ,

(9)

i=1

g(xi ) − b1 ≤ −1 + ξi , i = 1, · · · , l1 , g(xi ) − b2 ≤ −1 + ξi ,

(10)

i = l1 + 1, · · · , l1 + l2 , (11) ∗ , g(xi ) − b1 ≥ 1 − ξi−l 1 i = l1 + 1, · · · , l1 + l2 , (12) ∗ , g(xi ) − b2 ≥ 1 − ξi−l 1 i = l1 + l2 + 1, · · · , l, ξi ≥ 0, i = 1, · · · , l1 + l2 ,

(13) (14)

ξi∗ ≥ 0 i = 1, · · · , l2 + l3 ,

(15)

where ξ = (ξ1 , · · · , ξl1 +l2 )T , ξ ∗ = (ξ1∗ , · · · , ξl∗2 +l3 )T , and get its solution (b1 , b2 , ξ, ξ ∗ )T ;

Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05) 0-7695-2504-0/05 $20.00 © 2005

IEEE

6. Construct the decision function ⎧ ⎨ 1, if g(x) ≤ b1 ; 2, if b1 < g(x) ≤ b2 ; f (x) = ⎩ 3, if g(x) > b2 ,

3. Transform the training set (1) into the following form: T˜ = (16)

where g(x) is given by (8).

3. The multi-class classification algorithm based on ordinal regression In this section, we propose our multi-class classification algorithm based on ordinal regression described in Section 2. The question is , how to use 3-class classifiers to solve a multi-class classification problem. There exist many reconstruction schemes. Remind that, when 2-class classifiers are used, there are “one versus one”, “one versus rest” and other architectures. When 3-class classifiers are used, it is obvious that more architectures are possible. In order to compare our algorithm with the ones in [5] and [6], here we also apply “one versus one versus rest” architecture. Let the training set T be given by (1). For an arbitrary pair (Θj , Θk ) ∈ (Y × Y) of classes with j < k, Algorithm 1 will be applied to separate the two classes Θj and Θk as well as the remaining classes. To do so, we respectively consider the class Θj , Θk and Y \ {Θj , Θk } as the class 1, 3 and 2 in ordinal regression, i.e., the class Θj corresponds to the inputs xi (i = 1, · · · , l1 ) , the class Θk corresponds to the inputs xi i = l1 + l2 + 1, · · · , l and the remaining classes correspond to the inputs xi (i = l1 + 1, · · · , l1 + l2 ). Thus, in the K-class problem, for each pair (Θj , Θk ), we get by Algorithm 1 a classifier fΘj,k (x) to separate them as well as the remaining classes. So we get K(K −1)/2 classifiers in total. Hence, for a new input x¯, K(K −1)/2 outputs are obtained. We adopt voting schemes,i.e., it is assumed that all the classifiers outputs are in the same rank, and the ‘winner-takes-all’ scheme is applied. In other words, x¯ will be assigned to the class that gets the most votes. Now we give the detail of our multi-class classification algorithm as follows: Algorithm 2: The multi-class classification algorithm based on ordinal regression 1. Given the training set (1), construct a set P with K(K − 1)/2 elements: P = {(Θj , Θk )|(Θj , Θk ) ∈ (Y × Y), j < k}, (17) where Y = {Θ1 , · · · , ΘK }. Set m = 1; 2. If m = 1, take the first pair in the set P ; otherwise take its next pair. Denote the pair just taken as (Θj , Θk );

{(˜ x1 , 1), · · · , (˜ xl1 , 1), (˜ xl1 +1 , 2), · · · , xl1 +l2 +1 , 3), · · · , (˜ xl , 3)}, (18) (˜ xl1 +l2 , 2), (˜ where ˜l1 } = {xi |yi = Θj }, {˜ x1 , · · · , x {˜ xl1 +1 , · · · , x˜l1 +l2 } = {xi |yi ∈ Y \ {Θj , Θk }}, ˜l } = {xi |yi = Θk }; {˜ xl1 +l2 +1 , · · · , x 4. Algorithm 1 is applied to the training set (18). Denote the decision function obtained as fΘj,k (x); x) = 1, a posi5. Vote: For an input x¯, When fΘj,k (¯ tive vote is added on Θj , and no votes are added on x) = 3, a positive vote the other classes; when fΘj,k (¯ is added on Θk , and no votes are added on the other x) = 2, a negative vote is added classes; when fΘj,k (¯ on both Θj and Θk , and no votes are added on the other classes; 6. If m = K(K − 1)/2, go to the next step; otherwise set m = m + 1, return to the step 3; 7. Calculate the total votes of each class by adding the positive and negative votes on this class. The input x ¯ will be assigned to the class that gets the most votes.

4. Numerical experiments In this section, we present two types of experiments. All test problems are taken from [5] and [6].

4.1. Example in the plane According to Algorithm 2, for a K-class classification problem, Algorithm 1 is used K(K − 1)/2 times and K(K − 1)/2 classifiers are constructed. In order to examine the validity of these classifiers, we study an artificial example in plane which can be visualized. Note that this validity is not clear at a glance. For example, consider 3class classification problem. The “one versus one versus rest” architecture implies that there are 3 training sets: T1−2−3 = {Class 1 (one), Class 3 (rest), Class 2 (one)}, (19) T1−3−2 = {Class 1 (one), Class 2 (rest), Class 3 (one)}, (20) and T2−3−1 = {Class 2 (one), Class 1 (rest), Class 3 (one)}, (21)

Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05) 0-7695-2504-0/05 $20.00 © 2005

IEEE

where Class 3, 2, and 1 are respectively in the “middle” for ordinal regression. Each of them is expected in our mind to be separated by 2 parallel hyperplanes in feature space nicely. However, observe the simplest case when the input space is one-dimensional, see Fig.2. It is easy to see that

Fig.2 One dimension case. Class 1: ‘◦’; Class 2: ‘’; Class 3: ‘♦’

it is impossible to separate all of the 3 training sets properly if the kernel in Algorithm 1 is linear. It can be imagine that similar difficulty will also happen to the low dimensional case. But as pointed in [5], these separations turn to be practical when nonlinear kernel is introduced. They construct the parallel separation hyperplanes in feature space by a combination of SVC and SVR with a parameter δ. Its success is shown by their numerical experiments when their parameter δ is chosen carefully. In the following, we shall show that this is also implemented successfully by Algorithm 1 (Kernel Ordinal Regression). Note that there is no any parameter corresponding to δ in our algorithm. In our experiment, the training set T is generated following a Gaussian distribution on R2 with 150 training data equally distributed in three classes, marking three classes by the following signs: Class 1 with ‘+’, Class 2 with ‘’ and Class 3 with ‘♦’. The parameter C and the kernel K(·, ·) in Algorithm 1 are selected as follows: C = 1000, and polynomial kernel K(x, x ) = ((x · x ) + 1)d with degree d = 2, 3, 4. For each group parameters we have three decision functions corresponding to the training sets: T1−2−3 , T1−3−2 and T2−3−1 in (19)-(21), and the training sets are marked as 1–2–3, 1–3–2 and 2–3–1. The numerical results are shown in Fig.3: From Fig.3, it can be observed that the behaviour of kernel ordinal regression algorithm is reasonable. The training data are separated explicitly into arbitrary order of classes. We get validity classifiers choosing different parameters of kernel. And it is also similar to that in [5] and [6], when their parameters δ and ε are chosen properly. Note that we need not make this choice.

4.2. Experiments on Benchmark Data Sets In this subsection, we test our multi-class classification algorithm on a collection of three benchmark problems, ‘Iris’, ‘Wine’ and ‘Glass’, from the UCI machine learning repository [16]. Each data set is first split randomly into ten subsets. Then one of these subsets is reserved as a test set and the others are summarized as the training set; this process is repeated ten times. Finally we compute the average error of the ten times, where the error of each time is described by the percentage of examples finally assigned to

Fig.3 Different kernel parameter d on artificial data set with C = 1000. (a) 1–3–2, d = 2; (b) 1–2–3, d = 2; (c) 2–3–1, d = 2; (d) 1–3–2, d = 3; (e) 1–2–3, d = 3; (f) 2–3–1, d = 3; (g) 1–3–2, d = 4; (h) 1–2–3, d = 4; (i) 2–3–1, d = 4;.

the wrong classes. For data sets ‘Iris’ and ‘Wine’, the parameter C = 10 and the polynomial kernels K(x, x ) = ((x · x ) + 1)d with degree d = 2 and d = 3 are employed respectively in our algorithm. For data set ‘Glass’, the parameter C = 1000  2  and the Gaussian kernel K(x, x ) = exp(− x−x 2σ2 ) with σ = 2 is employed. The averages errors are listed in Table 1. In order to compare our algorithm with Algorithm K-SVCR in [5], Algorithm ν-K-SVCR in [6], Algorithm ‘1-v-r’ in [1,2] and Algorithm ‘1-v-1’ in [3,4], their corresponding errors are also listed in Table 1. This table also displays the number of training data, attributes and classes for each database. It can be observed that the performance of our algorithm is generally comparable to the other ones. Specifically, for the ‘Glass’ set, our algorithm outperforms other algorithms.

5. Conclusion In this paper we have proposed a new algorithm for multi-class classification based on ordinal regression machine. Our algorithm is related closely with the Algorithm K-SVCR [5] and Algorithm ν-K-SVCR [6]. Comparing our algorithm with the latter ones ,we have the following conclusions: (a) Noticing their similar structure, all of them have the same nice properties, e.g. good robustness; (b) For their numerical performance, all of them are comparable as shown by our numerical experiments;

Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05) 0-7695-2504-0/05 $20.00 © 2005

IEEE

Table 1. The percentage of error

Iris Wine Glass

#pts

#atr

#class

150 178 214

4 13 9

3 3 6

1-v-r

1-v-1

K-SVCR

ν-K-SVCR

Our algorithm

err% 1.33 5.6 35.2

err% 1.33 5.6 36.4

err% 1.93 2.29 30.47

err% 1.33 3.3 32.47

err% 2 2.81 26.17

#pts: The number of training data, #atr: The number of example attributes, #class: The number of classes, err%: The percentage of error, 1-v-r: “one versus rest” , 1-v-1: “one versus one”

(c) Our algorithm includes fewer parameters and therefore is easier to be implemented; (d) For a K-class classification problem (1), many variant algorithms can be obtained by extending our algorithm, for example, using p-class classification (2 ≤ p ≤ K) instead of 3-class classification. These algorithms are under our consideration.

References [1] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. M¨uller, E. Sackinger, P. Simard and V. Vapnik (1994). Comparison of classifier methods: a case study in handwtiting digit recognition. in: IAPR(ED.), Proceedings of the International Conference on Pattern Recognition, pp. 77–83. IEEE Computer Society Press. [2] V. Vapnik. Statistical Learning Theory. John Wiley&Sons,1998 [3] T. J. Hastie and R. J. Tibshirani (1998). Classification by pairwise coupling. in: M.I. Jordan, M. J. Kearns and S. A. solla(Eds.), Advances in Neural Information processing Systems, 10, 507–513. MIT Press, Cambridge, MA. [4] U. Krebel (1999). Pairwise classification and support vector machines. In: B. Sch¨olkopf, C. J. C. Burges and A. J. Smola(Eds), Advances in kernel Methods: Support Vector Learning, pp.25568. MIT Press, Cambridge, MA. [5] C. Angulo, X. Parr and A. Catal`a(2003). K-SVCR. A support vector machine for multi-class classification. Neurocomputing, 55, 57–77 [6] P. Zhong and M. Fukushima (2004). A new multi-class support vector algorithm. Optimization Methods and Software, to appear. [7] E. L. Allwein, R. E. Schapire and Y. Singer(2001). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.

[8] T. G. Dietterich and G. Bakiri(1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2,263–286 [9] K. P. Bennett(1999). Combining support vector and mathematical programming methods for classification. in: B. Sch¨alkopf, C. J. C. Burges and A. J. Smola(Eds.), Advances in Kernel Methods: Support Vector Learning, pp. 307–326. MIT Press, Cambridge, MA. [10] K. Crammer and Y. Singer(2002). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292. [11] Y. Lee, Y. Lin and G. Wahba(2001). Multicategory support vector machines. Computing Science and Statistics, 33, 498–512. [12] J. Weston and C. Watkins(1998). Multi-class support vector machines. CSD-TR-98-04 Royal Holloway, University of London, Egham,UK. [13] T. Graepel et al. Classification on proximity data with LP-machines: In: Ninth International Conference on Artifical Neural Networks IEEE. London: Conference Publications 1999, 470: 304–309. [14] A. Shashua and A. Levin. Taxonomy of Large Margin Principle Algorithms for Ordinal Regression problems. Technical Report 2002-39. Leivniz Center for Research School of Computer Science and Eng., the hebrew University of Jerusaalem. [15] R. Herbrich, T.Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 2000. pp. 115–132 [16] C. L. Blake, C. J. Merz, UCI Repository of Machine Learning Databases, University Of California, Irvine, 1998. http://www.ics.uci.edu/ mlearn/MLRepository. html.

Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’05) 0-7695-2504-0/05 $20.00 © 2005

IEEE