International Journal of Innovative Computing, Information and Control Volume 6, Number 6, June 2010
c ICIC International °2010 ISSN 1349-4198 pp. 1–18
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION Chi-Yuan Yeh, Chi-Wei Huang and Shie-Jue Lee Department of Electrical Engineering National Sun Yat-Sen University Kaohsiung 804, Taiwan Corresponding author: Shie-Jue Lee (
[email protected])
Received November 2008; revised March 2009 Abstract. Applying Support vector clustering (SVC) to multi-class classification problems has difficulty in determining the hyperparameters of the kernel functions. Multikernel learning has been proposed to overcome this difficulty, by which kernel matrix weights and Lagrange multipliers can be simultaneously derived with semidefinite programming. However, the amount of time and space required is very demanding. We develop a two-stage multi-kernel learning algorithm which conducts sequential minimal optimization and gradient projection iteratively. One multi-kernel SVC is constructed for the patterns of each class. The outputs obtained by all the multi-kernel SVCs are integrated and a discriminant function is applied to make the final multi-class decision. Experimental results on data sets taken from UCI and Statlog show that the proposed approach performs better than other methods. Keywords: Multi-class classification, Support vector clustering, Multi-kernel learning, SMO, Gradient projection
1. Introduction. Support vector machines have been shown to be an effective tool for solving classification problems [1–7], regression problems [8–11], and clustering problems [12–14]. Support vector machines [1] were designed originally for binary classification. However, real-world applications usually involve multiple classes. To effectively extend binary support vector machines to solve multi-class classification problems is still an on-going research issue. Several solutions have been proposed to solve multi-class classification problems [1, 15–17]. Some solutions decompose a multi-class problem into a set of two-class sub-problems and then use a discriminant function to make the final decision. These include methods like One-Against-All (OVA) [15], One-Against-One (OAO) [1, 16], and Directed Acyclic Graph (DAG) [17]. Other methods have also been proposed, e.g., support vector clustering (SVC) [12], support vector data description (SVDD) [13], and one-class support vector machine (OCSVM) [14], for multi-class problems [18, 19]. In these methods, one SVC (SVDD, OCSVM) is derived from the training patterns of each class. The size of the kernel matrix and the training time of the method are less than that of OVA, OAO, and DAG. However, discriminant functions for making decisions based on competing SVC (SVDD, OCSVM) classifiers are required. Selecting proper kernel functions and hyperparameters is an important issue in kernel based applications [20–24]. Unsuitable kernel functions or hyperparameters can lead to a relatively poor performance. This problem is usually taken care of by a trial-and-error approach. Furthermore, a typical SVC (SVDD, OCSVM) application usually uses the same hyperparameter settings for each class. This may not be a good idea when pattern distributions are significantly different among different classes. For example, one class may contain patterns with a dense distribution where a kernel with a small variance 1
2
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
is appropriate, while another class contains sparsely distributed patterns that may be better handled by another kernel with a large variance. In this study, multiple kernels are applied, and a linear combination of kernel matrices is adopted to solve this problem of hyperparameter selection. The simplest way of combining multiple kernels is by averaging them. However, having the same weight with each kernel may not be good for the decision process. Intuitively, each kernel may have different weights for different classes in different applications. Determining optimal weights for participating kernels becomes a major issue for combining multiple kernels. Lanckriet et al. [25, 26] proposed an approach to finding optimal weights in multi-kernel combination. They transformed the optimization problem into a semidefinite programming (SDP) problem, in which a global optimum exists. SDP can then be solved with the interior-point method. However, they focus on the transduction setting where a kernel matrix is created by using both the training patterns and the testing patterns. Therefore, the amount of time and space required is quite demanding and the size of the problem that can be solved is limited. Crammer et al. [27] and Bennett et al. [28] used boosting methods to combine heterogeneous kernel matrices. Ong et al. [29] introduced the method of hyperkernels with SDP, while Tsang and Kwok [30] reformulated the problem as a second-order cone programming problem which can be solved more efficiently than SDP. Other multi-kernel learning algorithms were also proposed, including Sonnenburg et al. [31] and Rakotomamonjy et al. [32]. These approaches solve the problem by iteratively using the SMO algorithm [33] to update Lagrange multipliers and kernel weights in turn. However, they are likely to suffer from local minimum traps. In this paper, we propose a classifier, which integrates multi-kernel learning and support vector clustering (SVC), to solve the multi-class classification problem. For a training data set of k classes, we build k multi-kernel SVC machines. A two-stage multi-kernel learning algorithm is developed to optimally combine multi-kernel matrices for each SVC machine. This learning algorithm applies sequential minimal optimization (SMO) [33] and gradient projection iteratively to obtain Lagrange multipliers and optimal kernel weights. Then a discriminant function is applied to make the classification decision based on the outputs of all the SVC machines. Experimental results, obtained by running on datasets generated synthetically and taken from the UCI Repository of machine learning database [34] and the Statlog collection [35], show that our method performs better than other methods. The rest of this paper is organized as follows. Section 2 presents basic concepts about support vector clustering and discriminant functions for multi-class classification. Section 3.2 describes the algorithm of multi-kernel support vector clustering for multi-class classification and the two-stage optimization approach for multi-kernel learning. A brief example is given in Section 4 to illustrate how our proposed approach works. Experimental results are presented in Section 5. Finally, a conclusion is given in Section 6. 2. Background. To construct a classifier based on a set of training patterns, one can train a support vector clustering (SVC) machine for each class and then apply a discriminant function to make the final classification decision by integrating the outputs of all the SVC machines. Basic concepts about SVC and discriminant functions are briefly described below. 2.1. Support vector clustering (SVC). SVC is a kernel method which computes the smallest sphere in the feature space enclosing the image of the input data. To make the method more robust, the distance from an image to the center does not need to be strictly equal to or smaller than the radius of the sphere. Instead, excessive distances are penalized [12]. Slack variables ξi are introduced to account for such excessive distances.
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION
3
The objective function and constraints for SVC are min R2 + C
R,a,ξ
l X
ξi
i=1
kφ(xi ) − ak2 ≤ R2 + ξi , ξi ≥ 0, i = 1, 2, ..., l
s.t.
(1)
where R is the radius of the enclosing sphere, l is the number of training patterns, and C is a parameter that gives a tradeoff between the radius of sphere and the excessive distances to the center. Note that φ : X → F is a possibly nonlinear mapping from the input space to a feature space F , and a is the center of the sphere in the feature space. To solve Eq.(1), the Lagrangian is introduced. By taking partial derivatives with respect to the primal variables and setting the resulting derivatives to zero, Eq.(1) can be converted to the following Wolfe dual form max α
s.t.
l X
αi K(xi , xi ) −
i=1
l X l X
αi αj K(xi , xj )
i=1 j=1
0 ≤ αi ≤ C, i = 1, 2, ..., l, l X αi = 1
(2)
i=1
where αi ≥ 0, i = 1, 2, ..., l, are Lagrange multipliers, and K(xi , xj ) is a kernel function calculating the inner product between the images of the two input vectors xi and xj in the feature space, i.e., K(xi , xj ) = hφ(xi ), φ(xj )i. Throughout this paper, the RBF (radial basis function) kernel is adopted. That is, ¡ ¢ K(xi , xj ) = exp −γkxi − xj k2 (3) where γ is the width parameter of the RBF kernel. Now, Eq.(2) can be solved by SMO [33]. For a data point xi , if the associated ξi is greater than 0, φ(xi ) lies outside the sphere and xi is called a bounded support vector (BSV). If ξi = 0 and 0 < αi ≤ C, φ(xi ) lies on the surface of the sphere and xi is referred to as a support vector (SV). If ξi = 0 and αi = 0, φ(xi ) lies inside the sphere. 2.2. Discriminant functions. To determine which class a testing pattern xp belongs to, discriminant functions are applied. Sachs et al. [18] proposed two discriminant functions for this purpose. The first one, called Nearest-Center (NC), assigns a testing pattern xp to the class m with center am that is closest to xp . The discriminant function is defined as follows: f (xp , a) = arg min kφm (xp ) − am k2 m=1,...,k X
= arg min Km (xp , xp ) − 2 m=1,...,k
xj,m ∈Cm
+
X
X
αj,m Km (xp , xj,m )
αi,m αj,m Km (xi,m , xj,m )
(4)
xi,m ∈Cm xj,m ∈Cm
where a = {a1 , ..., ak }, Km (xi , xj ) = hφm (xi ), φm (xj )i, αi,m ’s are Lagrange multipliers of the SVC for class m, Cm is the set of training patterns belonging to class m, and xi,m is the ith training pattern in Cm .
4
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
The second discriminant function, called Nearest-Support-Vector (NSV), assigns a testing pattern xp to the class m that has the shortest distance between xp and its support vector set SVm . The discriminant function is defined as follows: f (xp , SV ) = arg min
min
kφm (xp ) − φm (xm,j )k2
= arg min
min
(Km (xp , xp ) − 2Km (xp , xj,m )
m=1,...,k xm,j ∈SVm m=1,...,k xj,m ∈SVm
+ Km (xj,m , xj,m ))
(5)
where SV = {SV1 , ..., SVk }, and SVm contains the support vectors of the SVC for class m. 3. Proposed approach. Suppose we are given a set of t training patterns (x1 , y1 ), (x2 , y2 ), ..., (xt , yt ), where xi ∈ Rn and yi = {1, ..., k} are input vector and desired class, respectively, of pattern i, i = 1, ..., t,. To construct a classifier for distinguishing the patterns of one class from the patterns of the other classes, we first separate the training patterns into k groups C1 , C2 , . . . , Ck , where Ci contains the patterns of class i. For each group Ci , we construct a multi-kernel SVC machine SVCi which, after training, outputs three sets of values: a set of Lagrange multipliers, the center of the enclosing sphere, and the set of support vectors, as shown in Figure 1. In this figure, solid lines indicate
. . .
D 1 , a1 , SV1
D 2 , a 2 , SV2
yp Training error
. . .
D k , a k , SVk
Figure 1. Architecture of proposed approach the processing flow of training, while dotted lines indicate the processing flow of testing. In testing, a discriminant function is applied to make the final classification decision by integrating the outputs obtained from all the SVC machines. 3.1. Mutli-kernel SVC. Early SVM based methods, as described in Section 2.1, used a single kernel function to calculate the inner product between two images in the feature space F . An entry in the resulting kernel matrix measures the similarity of any two patterns in the feature space. If a dataset has varying local distributions, multiple kernels
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION
5
may be applied to cope with this varying distribution of patterns. A simple direct sum fusion can be defined as K(xi , xj ) = hΦ(xi ), Φ(xj )i, where Φ is a new feature mapping defined as Φ(x) = [φ1 (x), ..., φM (x)]T . The kernel matrix can be easily written as K = K 1 + ... + K M in this case, with Ks (xi , xj ) = hφs (xi ), φs (xj )i, s = 1, 2, ..., M . We can generalize this simple fusion to a weighted combination of kernel matrices as follows:
K =
M X
µs K s
(6)
s=1
P where M is the total number of kernel matrices, µs ≥ 0, and M s=1 µs = 1. A simple calculation shows that a non-negative linear combination of kernels yields another valid kernel which is a positive semi-definite matrix [36]. Setting the sum of weights equal to 1 controls the size of the search space to avoid overfitting. Naturally, we would like to optimize these kernel weights µs , s = 1, 2, ..., M , for this combination. As mentioned, one multi-kernel SVC is constructed for the training patterns of each class. Let Cm be the training patterns of class m. We would like to construct SVCm . Referring to Eq.(1), the objective function and constraints for SVCm become
|Cm |
min
Rm ,am ,ξ
s.t.
2 Rm
+C
X
ξi,m
i=1 2 kΦm (xi,m ) − am k2 ≤ Rm + ξi,m , ξi,m ≥ 0, i = 1, 2, ..., |Cm |
(7)
where xi,m is the ith training pattern in Cm , |Cm | is the number of training patterns in √ √ Cm , and the feature mapping function Φm = [ µ1,m φ1,m , ..., µM,m φM,m ]T . To solve the constrained optimization problem, the Lagrangian is introduced as follows:
|Cm |
L(Rm , am , ξm , αm , β m ) =
2 Rm
+C
X i=1
|Cm |
−
X
|Cm |
ξi,m −
X
βi,m ξi,m
i=1
¤ £ 2 + ξi,m − kΦm (xi,m ) − am k2 αi,m Rm
(8)
i=1
where αi,m ≥ 0 and βi,m ≥ 0 are the Lagrange multipliers. L has to be minimized with respect to Rm , am , and ξm , given αm and β m , and then maximized with respect to αm and β m , given Rm , am , and ξm .
6
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
By taking partial derivatives of L and setting them to 0, we have |Cm |
X ∂L = 2Rm − 2Rm αi,m = 0 ∂Rm i=1 |Cm |
⇒
X
αi,m = 1
(9)
i=1
∂L = C − αi,m − βi,m = 0 ∂ξi,m ⇒ αi,m + βi,m = C
(10)
|Cm |
X ∂L = −2 αi,m (Φm (xi,m ) − am ) = 0 ∂am i=1 |Cm |
⇒ am =
X
αi,m Φm (xi,m )
(11)
i=1
From Eqs.(9), (10), and (11) we can convert Eq.(8) to the following Wolfe dual form: |Cm |
X
min max µm
αm
|Cm | |Cm |
αi,m hΦm (xi,m ), Φm (xi,m )i −
i=1
s.t.
XX
αi,m αj,m hΦm (xi,m ), Φm (xj,m )i
i=1 j=1
0 ≤ αi,m ≤ C, i = 1, ..., |Cm |, |Cm |
X
αi,m = 1.
(12)
i=1
By Eq.(6), Eq.(12) can be formulated as follows: |Cm |
min max µm
αm
s.t.
X
|Cm | |Cm |
αi,m Km (xi,m , xi,m ) −
i=1
XX
αi,m αj,m Km (xi,m , xj,m )
i=1 j=1
0 ≤ αi,m ≤ C, i = 1, ..., |Cm |, |Cm |
X
αi,m = 1,
i=1
µs,m ≥ 0, s = 1, ..., M, M X
µs,m = 1
(13)
s=1
where Km (xi , xj ) =
PM s=1
µs,m Ks,m (xi , xj ) and Ks,m (xi , xj ) = hφs,m (xi ), φs,m (xj )i.
3.2. Two-stage multi-kernel learning. To build our classifier shown in Figure 1, the Lagrange multipliers and kernel weights in Eq.(13) have to be solved. The center of the enclosing sphere can then be derived by Eq.(11). We develop a two-stage optimization algorithm, shown in Figure 2, for this purpose. In the first stage, the weight vector µm is
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION
7
initializing weights
P s0, m , s 1,..., M , with random values 0
t
using SMO to train SVC first stage
t 1
t
M
by K m
¦P
t s ,m
K s ,m
s 1
no
stopping criterion is met
using gradient projection
method to update P mt 1 second stage
yes
stop and output a m , SVm , and D m
Figure 2. Two-stage multi-kernel learning algorithm
kept fixed and Eq.(13) can be expressed as follows: |Cm |
max αm
s.t.
X
|Cm | |Cm |
αi,m Km (xi,m , xi,m ) −
i=1
XX
αi,m αj,m Km (xi,m , xj,m )
i=1 j=1
0 ≤ αi,m ≤ C, i = 1, ..., |Cm |, |Cm |
X
αi,m = 1.
(14)
i=1
The Lagrange multipliers αi ’s are solved by the SMO algorithm. In the second stage, the Lagrange multipliers are kept fixed, and the weight vector µm is updated by the gradient projection method. SMO is a standard algorithm for solving the Wolfe dual form. A detailed description about it can be found in [33]. In the following, we describe how gradient projection is applied to obtain optimal µm . When the Lagrange multipliers are kept fixed, Eq.(13) can be expressed as follows: min J(µm ) µm
s.t. µs,m ≥ 0, s = 1, ..., M, M X
µs,m = 1
(15)
s=1
P|Cm | P|Cm | P|Cm | where J(µm ) = j=1 αi,m αj,m Km (xi,m , xj,m ). Note i=1 i=1 αi,m Km (xi,m , xi,m ) − that J(µm ) only depends on µm . By gradient projection [37], we have = µkm + η k (µkm − µkm ), µk+1 m
(16)
8
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
where µkm is the weight vector of the kth iteration, 0 < η k ≤ 1 is the step-size, and µkm is a feasible descent direction. Let zm = µkm − sk ∇J(µkm ). Then µkm can be defined as ½ zm if zm belongs to the feasible region, k µm = (17) [zm ]+ otherwise where [·]+ denotes a projection in the feasible descent direction, sk is a positive scalar, and ∇J(µkm ) is the gradient below: ∂J ∇J(µkm ) = ∂µkm l l X l X X = αi,m Ks,m (xi,m , xi,m ) − αi,m αj,m Ks,m (xi,m , xj,m ). (18) i=1
i=1 j=1
When µkm belongs to the feasible region, the iteration reduces to an unconstrained steepest descent one. The calculation of [zm ]+ can be achieved by ° ° °zm − [zm ]+ °2 min [zm ]+
s.t.
eT [zm ]+ = 1
(19)
which can be expressed as the following form of quadratic programming: 1 ([zm ]+ )T H[zm ]+ − zTm [zm ]+ min+ [zm ] 2 s.t. eT [zm ]+ = 1
(20)
where H is an identity matrix of rank M and e is an M -vector of ones. The step-size η k is determined by using the Armijo rule along the feasible direction. Here, by choosing β and σ subject to 0 < β < 1 and 0 < σ < 1 , we can set η k = β pk , where pk is the first nonnegative integer p for which k+1 k+1 p k+1 T k+1 k+1 k+1 p (21) J(µk+1 m ) − J(µm + β (µm − µm )) ≥ −σβ ∇J(µm ) (µm − µm ). ° k+1 ° 2 Besides, the stopping criterion is defined as °µm − µkm °2 < 10−6 or the maximal number of iteration, which is set to 500, has been reached. The detailed procedure of the gradient projection algorithm is depicted in Figure 3.
3.3. Additional discriminant functions. We develop two more discriminant functions for our system. The first one, called Nearest-Average-Support-Vector (NASV), assigns a testing pattern xp to the class m that has the shortest average distance between xp and the support vectors of SVCm . The discriminant function is defined as follows: P kφm (xp ) − φm (xm,j )k2 f (xp , SV ) = arg min
m=1,...,k
= arg min
m=1,...,k
xm,j ∈SVm
P
|SVm | (Km (xp , xp ) − 2Km (xp , xj,m ) + Km (xj,m , xj,m ))
xj,m ∈SVm
|SVm | (22)
where SVm is the support vector set of SVCm and |SVm | is the cardinality of SVm . The second discriminant function, called k-Nearest-Support-Vector (KNSV), assigns a testing pattern xp to the class m which has a majority vote in the k nearest support vectors of xp , where k is a positive integer, typically small. If k = 1, then the pattern
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION
9
inital weight vector
P mk
P mt k
0
J ( P mk ) m Eq.(18)
k
k 1
z m belong to
no
feasible region
yes [ z m ] m Eq.(20)
P mk m Eq.(17)
Armijo rule Eq.(21)
P mk 1 m Eq.(16)
stopping criterion
no
is met
yes
stop and output optimal P mt 1
P mk 1
Figure 3. Gradient projection method is simply assigned to the class of its nearest support vector. Clearly, KNSV is a general version of NSV. 4. An Illustration. We give an example here to illustrate how our proposed approach works. Consider a dataset of three classes, class 1, class 2, and class 3, each of which containing four patterns as follows: ½ · ¸ · ¸ · ¸ · ¸¾ 1 1 −1 −1 C1 = x1,1 = , x2,1 = , x3,1 = , x4,1 = , 1 −1 −1 1 ½ · ¸ · ¸ · ¸ · ¸¾ 2 2 −2 −2 C2 = x1,2 = , x2,2 = , x3,2 = , x4,2 = , 2 −2 −2 2 ½ · ¸ · ¸ · ¸ · ¸¾ 3 3 −3 −3 C3 = x1,3 = , x2,3 = , x3,3 = , x4,3 = . 3 −3 −3 3 We need to construct three multi-kernel SVC machines, SVC1 , SVC2 , and SVC3 for C1 , C2 , and C3 , respectively. Consider three-kernel learning, i.e., K m = µ1,m K 1,m + µ2,m K 2,m + µ3,m K 3,m for class Cm , 1 ≤ m ≤ 3. Assume that K 1,m , K 2,m , and K 3,m are RBFs with γ being 0.1, 0.5, and 1.0, respectively. Let’s show the construction of SVC1 . We follow each stage in Figure 2 to obtain optimal kernel weights and Lagrange multipliers for SVC1 . To begin with, we initialize the kernel weights to random values, e.g., µ1,1 = 0.5558, µ2,1 = 0.3848, and µ3,1 = 0.0594.
10
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
First stage In this stage, we keep kernel weights fixed and have K 1 = µ1,1 K 1,1 + µ2,1 K 2,1 + µ3,1 K 3,1 1 0.6703 0.4493 0.6703 0.6703 1 0.6703 0.4493 = 0.5558 × 0.4493 0.6703 1 0.6703 0.6703 0.4493 0.6703 1 1 0.1353 0.0183 0.1353 0.1353 1 0.1353 0.0183 +0.3848 × 0.0183 0.1353 1 0.1353 0.1353 0.0183 0.1353 1 1 0.0183 0.0003 0.0183 0.0183 1 0.0183 0.0003 +0.0594 × 0.0003 0.0183 1 0.0183 0.0183 0.0003 0.0183 1 1 0.4257 0.2568 0.4257 0.4257 1 0.4257 0.2568 . = 0.2568 0.4257 1 0.4257 0.4257 0.2568 0.4257 1 The we apply SMO and obtain the values for Lagrange multipliers as follows: α1 = [ 0.2502 0.2498 0.2506 0.2494 ]T . Second stage In this stage, Lagrange multipliers are kept fixed and we follow each step in Figure 3 to obtain kernel weights. Let k = 0. Step 1. Calculate the gradient by Eq.(18): ∇J(µk1 ) = [ 0.3025 0.6778 0.7408 ]T Step 2. Calculate z1 = µk1 − sk ∇J(µk1 ). We have z1 = [ 0.2828 −0.2269 −0.6092 ]T Step 3. Check whether z1 belongs to the feasible region. In this case, it does not belong to the feasible region. So we go to Step 4. Step 4. Calculate [z1 ]+ by Eq.(20): [z1 ]+ = [ 0.7548 0.2452 0.000 ]T Step 5. Calculate µk1 by Eq.(17): µk1 = [z1 ]+ = [ 0.7548 0.2452 0.000 ]T Step 6. Determine the learning rate η k by the Armijo rule. by Eq.(16) and have Step 7. Calculate µk+1 1 = [ 0.7349 0.2591 0.0059 ]T µk+1 1 Step 8. Check whether the stopping criterion is met. If yes, we are done. Otherwise, k = k + 1 and go to Step 1.
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION
11
Stage 1 and Stage 2 proceed iteratively. When the stopping criterion is met, we obtain optimal kernel weights and Lagrange multipliers for SVC1 as µ1 = [ 1 0 0 ]T , α1 = [ 0.2521 0.2479 0.2514 0.2486 ]T . Similarly, we construct SVC2 and SVC3 for C2 and C3 , and obtain the following optimal kernel weights and Lagrange multipliers: µ2 = [ 1 0 0 ]T , α2 = [ 0.2504 0.2496 0.2501 0.2499 ]T , µ3 = [ 1 0 0 ]T , α3 = [ 0.2502 0.2498 0.2500 0.2500 ]T . Having done with the SVCs for the three classes, we can use any discriminant function presented in Section 2.2 or Section 3.3 to determine the class of a testing pattern xp . For instance, let a testing pattern be xp = [ 1.1 1.3 ]T and the Nearest-Center (NC) be the discriminant function. By applying Eq.(4), we have à 3 4 3 X X X 2 kΦ1 (xp ) − a1 k = µs,1 Ks,1 (xp , xp ) − 2 αj,1 µs,1 Ks,1 (xp , xj,1 ) s=1
+
4 X 4 X i=1 j=1
j=1
αi,1 αj,1
3 X
s=1
!
µs,1 Ks,1 (xi,1 , xj,1 )
= 0.3989,
s=1
2
kΦ2 (xp ) − a2 k = 0.5200, kΦ3 (xp ) − a3 k2 = 0.8634. Obviously, xp is closest to a1 . Therefore, it is assigned to class 1. 5. Experimental Results. In this section, we show experimental results to demonstrate the effectiveness of our proposed approach. For convenience, we abbreviate the singlekernel support vector clustering classifier as SKSVCC, and multiple-kernel support vector clustering classifier as MKSVCC. Experiments with eight different datasets, synthetic or real, are conducted. 5.1. Experiment I. In this experiment, two synthetic datasets, DS-I and DS-II, are considered. The two datasets are generated by using the libraries and toolboxes provided in Matlab [38]. Each dataset contains 150 training patterns and 75 testing patterns, belonging to one of three classes. Besides, each class contains 50 training patterns and 25 testing patterns. The data patterns are drawn from the 2-variate normal distribution with different mean vectors and covariance matrices, as specified in Table 1. Note that Table 1. Characteristics of datasets DS-I and DS-II. dataset DS-I
DS-II
class 1 class 2 class 3 T T T mean (µ) [0 0] [3 8] · ¸ · ¸ · [8 0] ¸ P 1 0.1 5 0.5 3 0.1 covariance ( ) 0.1 2 0.5 3 0.1 2 T T T mean (µ) · [0 0] ¸ · [3 4] ¸ · [4 0] ¸ P 1 0.1 5 0.5 3 0.1 covariance ( ) 0.1 2 0.5 3 0.1 2
12
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
DS-II has more overlapped patterns than DS-I, and is more difficult to be separated into three classes. The results obtained by SKSVCC with different values of γ, e.g., γ = 1, 3, 5, and different discriminant functions are shown in Table 2. Note that, in this Table 2. Performance of SKSVCC with different hyperparameters and different discriminant functions. γ = 1 discriminant DS-I DS-II function training (%) testing (%) training (%) testing (%) NC 100.00 96.00 82.67 80.00 NSV 100.00 100.00 93.33 74.67 NASV 100.00 100.00 86.67 82.67 KNSV 100.00 100.00 84.00 85.33 γ = 3 discriminant DS-I DS-II function training (%) testing (%) training (%) testing (%) NC 100.00 89.33 93.33 76.00 NSV 100.00 100.00 98.67 74.67 NASV 100.00 100.00 90.00 84.00 KNSV 100.00 100.00 83.33 85.33 γ = 5 discriminant DS-I DS-II function training (%) testing (%) training (%) testing (%) NC 100.00 86.67 96.00 77.33 NSV 100.00 100.00 93.33 74.67 NASV 100.00 100.00 94.67 82.67 KNSV 100.00 100.00 82.67 85.33 table, ‘NC’ means the nearest center measure, ‘NSV’ means the nearest support vector measure, ‘NASV’ means the nearest average support vectors measure, and ‘KNSV’ means the k nearest support vectors measure. Also, ‘training (%) means the training accuracy rate and ‘testing (%)’ means the testing accuracy rate. From Table 2, we can see that different hyperparameters can affect the performance of the classifier. Consider DS-I. The classifiers with discriminant functions NSV, NASV, and KNSV work equally well for γ being 1, 3, or 5. Both training and testing accuracy rates are 100% in each case. However, the testing accuracy rate obtained by the classifier with γ being 1, 96.00%, is significantly higher than that with γ being 3 and 5, 89.33% and 86.67%, respectively. For DS-II, the classifiers with γ being 1, 3, and 5 all work equally well when discriminant functions NSV and KNSV are used. The classifier with γ = 1 works best, testing accuracy rate being 80.00%, when NC is used. However, the classifier with γ = 3 works best, testing accuracy rate being 84.00%, when NASV is used. Besides, we also try with 37 different settings of hyperparameter γ, from 0.01 to 0.09 with a stepping factor of 0.01, from 0.1 to 0.9 with a stepping factor of 0.1, from 1 to 10 with a stepping factor of 1, and from 10 to 100 with a stepping factor of 10. The prediction accuracy obtained by SKSVR with different discriminant functions on DS-II is shown in Figure 4. From this figure, we can see that different hyperparameters can affect the performance of the classifier again. Now, we run MKSVCC with different discriminant functions on DS-I and DS-II. A three-kernel learning is adopted, i.e., K m = µ1,m × K 1,m + µ2,m × K 2,m + µ3,m × K 3,m for class m, where K 1,m , K 2,m , and K 3,m are RBFs with γ1 = 1, γ2 = 3, and γ3 = 5, respectively. The results are shown in Table 3. From Table 2 and Table 3, we can see that MKSVCC either works equally well or better for all cases. For DS-I, MKSVCC works
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION
13
100 NC NSV NASV KNSV
90
Accuray
80
70
60
50
40
30
0.05
0.1
0.6
γ
2
7
30
80
Figure 4. Testing accuracy obtained by SKSVCC with different hyperparameters and different discriminant functions on DS-II. Table 3. Performance of MKSVCC with different discriminant functions on DS-I and DS-II. γ ∈ {1, 3, 5} discriminant DS-I DS-II function training (%) testing (%) training (%) testing (%) NC 100.00 96.00 91.33 81.33 NSV 100.00 100.00 98.67 74.67 NASV 100.00 100.00 87.33 85.33 KNSV 100.00 100.00 84.00 88.00 equally well. For DS-II, MKSVCC works better. For example, SKSVCC gets the highest testing accuracy rate 80.00% with NC. MKSVCC gets 81.33% instead. SKSVCC gets the testing accuracy rate 85.33% with KNSV, but MKSVCC gets 88.00% instead. These examples show the advantage that can be derived from the combination of multiple kernels. Through multi-kernel learning, our system can derive automatically optimal kernel weights for participating kernels. Figure 5 and Figure 6 show the decision boundaries obtained by MKSVCC with the discriminant function KNSV. Obviously, the decision boundaries for DS-II are more complex than those for DS-I, which explains why it is more difficult to separate DS-II into three classes. 5.2. Experiment II. Next, we compare SKSVCC and MKSVCC on six real datasets: ‘iris’, ‘glass’, and ‘vowel’ from the UCI Repository [34], and ‘segment’, ‘satimage’, and ‘letter’ from the Statlog collection [35]. A brief description of these datasets is given in Table 4. To eliminate possible biases, 10-fold cross validations are independently conducted 10 times for each dataset. Table 5, Table 6, and Table 7 present the testing accuracy rates obtained by SKSVCC. Note that optimal results are obtained with different hyperparameters in different datasets. This shows the critical importance of selecting optimal hyperparameters in conventional kernel based methods. However, using multiple kernels with different hyperparameters helps solve this problem automatically. We run MKSVCC with different discriminant functions on these six real datasets. A three-kernel learning is adopted, i.e., K m = µ1,m × K 1,m + µ2,m × K 2,m + µ3,m × K 3,m for class m, where K 1,m , K 2,m , and K 3,m are RBFs with γ1 = 10, γ2 = 18, and γ3 = 25, respectively. Initial kernel weights are set as mean values, i.e., µ1,m = 31 , µ2,m = 13 , and µ3,m = 31 . The results are
14
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
class 1 training data class 2 training data class 3 training data class 1 testing data class 2 testing data class 3 testing data KNSV
12 10 8 6 4 2 0 −2 −4
0
5
10
15
Figure 5. Decision boundaries obtained by MKSVCC for DS-I.
class 1 training data class 2 training data class 3 training data class 1 testing data class 2 testing data class 3 testing data KNSV
10
8 6 4 2
0 −2 −4
−2
0
2
4
6
8
10
12
Figure 6. Decision boundaries obtained by MKSVCC for DS-II. Table 4. Characteristics of six real datasets. dataset No. of patterns No. of classes No. of dimensions iris 150 3 4 glass 214 6 9 vowel 990 11 10 segment 2310 7 19 satimage 6435 6 36 letter 15000 26 16 Table 5. Testing accuracy obtained by SKSVCC with γ = 10 for real datasets. discriminant function NC NSV NASV KNSV
iris 94.20 95.13 95.33 93.53
glass 52.72 64.46 44.91 61.60
dataset vowel segment satimage letter 97.66 93.92 73.90 95.02 98.98 96.13 87.54 94.54 98.30 95.63 88.31 94.87 96.82 94.10 88.17 94.16
MULTI-KERNEL SUPPORT VECTOR CLUSTERING FOR MULTI-CLASS CLASSIFICATION
15
Table 6. Testing accuracy obtained by SKSVCC with γ = 18 for real datasets. discriminant function NC NSV NASV KNSV
iris 94.40 95.47 95.67 96.13
glass 56.10 64.30 46.93 64.57
dataset vowel segment satimage letter 93.53 92.71 52.64 89.51 98.96 96.84 89.89 94.95 98.57 96.40 89.81 95.61 96.67 95.20 89.43 94.52
Table 7. Testing accuracy obtained by SKSVCC with γ = 25 for real datasets. discriminant function NC NSV NASV KNSV
iris 93.40 95.47 95.53 96.00
glass 60.23 64.07 50.52 64.75
dataset vowel segment satimage letter 88.84 90.83 41.59 82.70 98.95 96.94 89.83 95.04 98.69 96.52 89.59 95.71 96.59 94.10 89.15 94.53
shown in Table 8. Obviously, MKSVCC either works equally well or better for all cases. This shows the power of multi-kernel learning. Through multi-kernel learning, our system can derive automatically optimal kernel weights for participating kernels. Table 8. Testing accuracy obtained by MKSVCC for real datasets, with γ ∈ {10, 18, 25} and with kernel weights initialized as mean values. discriminant function NC NSV NASV KNSV
iris 94.47 95.93 95.87 96.40
glass 60.82 64.92 52.36 65.60
dataset vowel segment satimage letter 97.66 93.93 73.90 95.02 98.98 96.99 90.19 95.06 98.70 96.58 90.79 95.81 96.84 95.42 90.48 94.69
6. Conclusion. We have proposed a multi-kernel SVC approach for multi-class classification problems. Due to varying pattern distributions in different classes, a single kernel SVC may not provide good solutions to multi-class classification problems. Optimal hyperparameters must be obtained through trial-and-error case by case. The proposed approach can be used to handle this hyperparameter selection problem automatically. For a training data set of k classes, we build k multi-kernel SVC machines. A two-stage multikernel learning algorithm is developed to optimally combine multi-kernel matrices for each SVC machine. This learning algorithm applies sequential minimal optimization and gradient projection iteratively to obtain Lagrange multipliers and optimal kernel weights. Then a discriminant function is applied to make the classification decision based on the outputs obtained from all the SVC machines. Experimental results, obtained by running on datasets generated synthetically or taken from the UCI Repository of machine learning database [34] and the Statlog collection [35], show that our method performs better than other methods. For the experiments in Section 5, kernel weights were initialized as mean values in our approach. For example, we set µ1,m = 31 , µ2,m = 13 , and µ3,m = 31 in Experiment II.
16
C.-Y. YEH, C.-W. HUANG AND S.-J. LEE
However, initial conditions are not critical to the final outputs of our approach. To show this, we perform Experiment II five times. In each run, kernel weights are initialized as different sets of random values. The results obtained with discriminant functions NASV and KNSV are shown in Table 9. Note that the standard deviation in this table is small in each case, which means that different initial settings lead to the same outputs. By referring to Table 8, we can hardly tell any difference in testing accuracies among different initial settings to kernel weights. Table 9. Testing accuracy obtained by MKSVCC for real datasets, with γ ∈ {10, 18, 25} and with kernel weights initialized as random values. discriminant function
NAVS
NKSV
number of runs 1 2 3 4 5 Average Standard deviation 1 2 3 4 5 Average Standard deviation
iris 96.20 96.00 95.87 96.07 96.00 96.03 0.120 96.40 96.27 96.27 96.47 96.40 96.36 0.089
glass 52.22 52.73 52.30 53.17 52.49 52.58 0.383 65.53 65.65 65.51 66.00 65.69 65.68 0.197
dataset vowel segment satimage 98.71 96.65 89.76 98.71 96.63 89.76 98.72 96.65 89.75 98.72 96.65 89.77 98.71 96.65 89.77 98.71 96.65 89.76 0.005 0.009 0.008 97.06 95.62 90.43 97.18 95.58 90.43 97.18 95.58 90.43 97.18 95.59 90.43 97.22 95.59 90.45 97.16 95.59 90.43 0.061 0.016 0.009
Acknowledgment. This work was supported by the National Science Council under the grant NSC 95-2221-E-110-055-MY2. A preliminary version of this paper was presented at the International Conference on Innovative Computing, Information and Control, June 18-20, 2008, Dalian, China. REFERENCES. [1] C. Cortes and V. Vapnik, Support-vector network, Machine Learning, vol. 20, no. 3, pp. 273-297, 1995. [2] M. Pontil and A. Verri, Support vector machines for 3D object recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 6, pp. 637-646, 1998. [3] H. Drucker, D. Wu and V. Vapnik, Support vector machines for spam categorization, IEEE Trans. on Neural Networks, vol. 10, no. 5, pp. 1048-1054, 1999. [4] G. Guo, S. Z. Li and K. Chan, Face recognition by support vector machines, Proc. of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 196-201, 2000. [5] K. I. Kim, K. Jung, S. H. Park and H. J. Kim, Support vector machines for texture classification, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol, 24, no. 11, pp. 1542-1550, 2002.
REFERENCES
17
[6] J. J. Ward, L. J. McGuffin, B. F. Buxton and D. T. Jones, Secondary structure prediction with support vector machines, Bioinformatics, vol. 19, no. 13, pp. 16501655, 2003. [7] R.-C. Chen and S.-P. Chen, Intrusion detection using a hybrid support vector machine based on entropy and TF-IDF, International Journal of Innovative Computing, Information and Control, vol. 4, no. 2, pp, 413-424, 2008. [8] O. L. Mangasarian and D. R. Musicant, Robust linear and support vector regression, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol, 22, no. 9, pp. 950955, 2000. [9] M. Kobayashi, Y. Konishi and H. Ishigaki, A lazy learning control method using support vector regression, International Journal of Innovative Computing, Information and Control, vol. 3, no. 6, pp, 1511-1523, 2007. [10] K. S. Ni and T. Q. Nguyen, Image superresolution using support vector regression, IEEE Trans. on Image Processing, vol, 16, no. 6, pp. 1596-1610, 2007. [11] P. Zhong and L. Wang, Support vector regression with input data uncertainty, International Journal of Innovative Computing, Information and Control, vol. 4, no. 9, pp, 2325-2332, 2008. [12] A. Ben-Hur, D. Horn, H. T. Siegelmann and V. Vapnik, Support vector clustering, Journal of Machine Learning Research, vol. 2, pp. 125-137, 2001. [13] D. M. J. Tax and R. P. W. Duin, Support vector domain description, Pattern Recognition Letters, vol. 20, no. 11-13, pp. 1191-1199, 1999. [14] K. R. M¨ uller, S. Mika, G. Ratsch, K. Tsuda and B. Sch¨olkopf, An introduction to kernel-based learning algorithms, IEEE Trans. on Neural Networks, vol. 12, no. 2, pp. 181-201, 2001. [15] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. M¨ uller, E. Sackinger, P. Simard and V. Vapnik, Comparison of classifier methods: a case study in handwritten digit recognition, Proc. of the International Conference on Pattern Recognition, pp. 77-87, 1994. [16] J. Friedman, Another approach to polychotomous classification, Technical Report, Department of Statistics, Stanford University, Stanford, CA. 1996. [17] J. C. Platt, N. Cristianini and J. Shawe-Taylor, Large margin DAGs for multiclass classification, Proc. of the Neural Information Processing Systems Conference, vol. 12, pp. 547-553, 2000. [18] A. Sachs, C. Thiel and F. Schwenker, One-class support-vector machines for the classification of bioacoustic time series, ICGST International Journal on Artificial Intelligence and Machine Learning, vol. 6, no. 4, pp. 29-34, 2006. [19] B. Y. Sun and D. S Huang, Support vector clustering for multi-class classification problems, Proc. of the Congress on Evolutionary Computation, pp.1480-1485, 2003. [20] O. Chapelle, V. Vapnik, O. Bousquet and S. Mukherjee, Choosing multiple parameters for support vector machines, Machine Learning, vol. 46, no. 1-3, pp. 131-159, 2002. [21] K. Duan, S. Keerthi and A. N. Poo, Evaluation of simple performance measures for tuning SVM hyperparameters, Neurocomputing, vol. 51, no. 4, pp. 41-59, 2003. [22] F. Friedrichs and C. Igel, Evolutionary tuning of multiple svm parameters, Neurocomputing, vol. 64, no. Complete, pp. 107-117, 2005. [23] J. T.-Y. Kwok, The evidence framework applied to support vector machines, IEEE Trans. on Neural Networks, vol. 11, no. 5, pp. 1162-1173, 2000. [24] Q. She, H. Su, L. Dong and J. Chu, Support Vector Machine with Adaptive Parameters in Image Coding, International Journal of Innovative Computing, Information and Control, vol. 4, no. 2, pp, 359-367, 2008.
18
REFERENCES
[25] G. Lanckriet, T. D. Bie, N. Cristianini, M. Jordan and W. Noble, A statistical framework for genomic data fusion, Bioinformatics, vol. 20, no.16, pp. 2626-2635, 2004. [26] G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui and M. I. Jordan, Learning the kernel matrix with semidefinite programming, Journal of Machine Learning Research, vol. 5, pp. 27-72, 2004. [27] K. Crammer, J. Keshet and Y. Singer, Kernel design using boosting, in Advances in Neural Information Processing Systems, S. Becker, S. Thrun, K. Obermayer (eds.), MIT Press, Cambridge, MA, USA, vol. 15, pp. 537-544, 2003. [28] K. P. Bennett, M. Momma and M. J. Embrechts, MARK: a boosting algorithm for heterogeneous kernel models, Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 24-31, 2002. [29] C. S. Ong, A. J. Smola and R. C. Williamson, Learning the kernel with hyperkernels, Journal of Machine Learning Research, vol. 6, pp. 1043-1071, 2005. [30] I. W.-H. Tsang and J. T.-Y. Kwok, Efficient hyperkernel learning using second-order cone programming, IEEE Trans. on Neural Networks, vol. 17, no. 1, pp. 48-58, 2006. [31] S. Sonnenburg, G. R¨aetsch, C. Sch¨aefer and B. Sch¨olkopf, Large scale multiple kernel learning, Journal of Machine Learning Research, vol. 7, pp. 1531-1565, 2006. [32] A. Rakotomamonjy, F. Bach, S. Canu and Y. Grandvalet, More efficiency in multiple kernel learning, Proc. of the 24th International Conference on Machine learning, pp. 775-782, 2007. [33] J. C. Platt, Fast training of support vector machines using sequential minimal optimization, in Advances in Kernel Methods: Support Vector Learning, B. Sch¨olkopf, C. J. C. Burges, A. J. Smola (eds.), MIT Press, Cambridge, MA, USA, pp. 185-208, 1999. [34] D. J. Newman, S. Hettich, C. L. Blake and C. J. Merz, UCI repository of machine learning databases, 1998. [35] D. Michie, D. J. Spiegelhalter and C. C. Taylor, Machine Learning, Neural and Statistical Classification, Prentice Hall, Englewood Cliffs, N.J., 1994. [36] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000. [37] D. P. Bertsekas, Nonlinear Programming, Second Edition, Athena Scientific, Massachusetts, 1999. [38] The MathWorks, Inc., Statistics toolbox computation visualization programming for use with matlab user’s guide, 1998.