Support Vector Machine Classification Based on ... - Semantic Scholar

4 downloads 33494 Views 393KB Size Report
lenging task for the large data sets due to the training complexity, high ... SVM Classification Based on Fuzzy Clustering for Large Data Sets ..... basics waves.
Support Vector Machine Classification Based on Fuzzy Clustering for Large Data Sets Jair Cervantes1 , Xiaoou Li1 , and Wen Yu2 1

Secci´ on de Computaci´ on Departamento de Ingenier´ a El´etrica CINVESTAV-IPN A.P. 14-740, Av.IPN 2508, M´exico D.F., 07360, M´exico 2 Departamento de Control Autom´ atico, CINVESTAV-IPN A.P. 14-740, Av.IPN 2508, M´exico D.F., 07360, M´exico [email protected]

Abstract. Support vector machine (SVM) has been successfully applied to solve a large number of classification problems. Despite its good theoretic foundations and good capability of generalization, it is a big challenging task for the large data sets due to the training complexity, high memory requirements and slow convergence. In this paper, we present a new method, SVM classification based on fuzzy clustering. Before applying SVM we use fuzzy clustering, in this stage the optimal number of clusters are not needed in order to have less computational cost. We only need to partition the training data set briefly. The SVM classification is realized with the center of the groups. Then the de-clustering and SVM classification via reduced data are used. The proposed approach is scalable to large data sets with high classification accuracy and fast convergence speed. Empirical studies show that the proposed approach achieves good performance for large data sets.

1

Introduction

The digital revolution has made possible that the data capture be easy and its storage have a practically null cost. As a matter of this, enormous quantities of highly dimensional data are stored in databases continuously. Due to this, semi-automatic methods for classification from databases are necessary. Support vector machine (SVM) is a powerful technique for classification and regression. Training an SVM is usually posed as a quadratic programming (QP) problem to find a separation hyper-plane which implicates a matrix of density n × n, where the n is the number of points in the data set. This needs huge quantities of computational time and memory for large data sets, so the training complexity of SVM is highly dependent on the size of a data set [1][19]. Many efforts have been made on the classification for large data sets. Sequential Minimal Optimization (SMO)[12] transforms the large QP problem into a series of small QP problems, each one involves only two variables [4][6][15]. In [11] was applied the boosting to Platt’s SMO algorithm and to use resulting Boost-SMO method for speeding and scaling up the SVM training. [18] discusses large scale approximations for Bayesian inference for LS-SVM. The results of [7] demonstrate that A. Gelbukh and C.A. Reyes-Garcia (Eds.): MICAI 2006, LNAI 4293, pp. 572–582, 2006. c Springer-Verlag Berlin Heidelberg 2006 

SVM Classification Based on Fuzzy Clustering for Large Data Sets

573

a fair computational advantage can be obtained by using a recursive strategy for large data sets, such as those involved in data mining and text categorization applications. Vector quantization is applied in [8] to reduce a large data set by replacing examples by prototypes. Training time for choosing optimal parameters is greatly reduced. [9] proposes an approach based on an incremental learning technique and a multiple proximal support vector machine classifier. Random Selection [2][14][16] is to select data such that the learning be maximized. However, it could over-simplify the training data set, lose the benefits of SVM, specially if the probability distribution of the training data and the testing data are different. On the other hand, unsupervised classification, called clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait (often proximity according to some defined distance measure). The goal of clustering is to separate a finite unlabeled data set into a finite and discrete set of “natural,” hidden data structures [13]. Some results [1][5][19] show that clustering technique can help to decrease complexity of SVM training. But, they need more computations to build the hierarchical structure. In this paper we propose a new approach for classification of large data sets, named SVM classification based on fuzzy clustering. To the best of our knowledge, SVM classification based on fuzzy clustering has not yet been established in the literature. In partitioning, the number of clusters is pre-defined to avoid computational cost for determining the optimal number of clusters. We only section the training data set and to exclude the set of clusters with minor probability for support vectors. Based on the obtained clusters, which are defined as mixed category and uniform category, we extract support vectors by SVM and form into reduced clusters. Then we apply de-clustering for the reduced clusters, and obtain subsets from the original sets. Finally, we use SVM again and finish the classification. An experiment is given to show the effectiveness of the new approach. The structure of the paper is organized as follows: after the introduction of the SVM and fuzzy clustering in Section II, we introduce the SVM based on fuzzy clustering classification in Section III. Section IV demonstrates experimental results on artificial and real data sets. We conclude our study in Section V.

2

Support Vector Machine for Classification and Fuzzy Clustering

Assume that a training set X is given as: (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

(1)

i.e. X = {xi , yi }ni=1 where xi ∈ Rd and yi ∈ (+1, −1) . Training SVM yields to solve a quadratic programming problem as follows

574

J. Cervantes, X. Li, and W. Yu

max − 12 αi

subject to:

l 

αi yi αj yj K xi · xj  +

i,j=1

l 

l 

αi

i=1

αi yi = 0,

C ≥ αi ≥ 0, i = 1, 2, . . . l

i=1 T

where C > 0, αi = [α1 , α2 , . . . , αl ] , αi ≥ 0, i = 1, 2, . . . , l, are coefficients corresponding to xi , xi with nonzero αi is called Support Vector (SV). The function K is called the Mercel kernel, which must satisfy the Mercer condition [17]. Let S be the index set of SV, then the optimal hyperplane is  (αi yi )K(xi , xj ) + b = 0 i∈S

and the optimal decision function is defined as    f (x) = sign (αi yi )K(xi , xj ) + b

(2)

i∈S

where x = [x1 , x2 , . . . , xl ] is the input data, αi and yi are Lagrange multipliers. A new object x can be classified using (2). The vector xi is shown only in the way of inner product. There is a Lagrangian multiplier α for each training point. When the maximum margin of the hyperplane is found, only the closed points to the hyperplane satisfy α > 0. These points are called support vectors SV, the other points satisfy α = 0. Clustering essentially deals with the task of splitting a set of patterns into a number of more-or-less homogenous classes (clusters) with respect to a suitable similarity measure, such that the patterns belonging to any one of the clusters are similar and the patterns of the different clusters are as dissimilar as possible. Let us formulate the fuzzy clustering problem as: consider a finite set of elements X = {x1 , x 2 , . . . , xn } with d − dimension in the Euclidian space Rd , i.e., xj ∈ Rd , j = 1, 2, . . . , n. The problem is to perform a partitioning of these data into k fuzzy sets with respect to a given criterion. The criterion is usually to optimize an objective function. The result of the fuzzy clustering can be expressed by a partitioning matrix U such that U = [uij ]i=1...k, j=1...n , where uij is a numeric value in [0, 1]. There are two constraints on the value of uij . First, total memberships of the element xj ∈ X in all classes is equal to 1. Second, every constructed cluster is non-empty and different from the entire set, i.e., k 

uij = 1, for all j = 1, 2, . . . , n

i=1

0
1

(4)

i=1 j=1

where m is call a exponential weight which influences the degree of fuzziness of the membership function. To solve the minimization problem, we differentiate the objective function in (4) with respect to vi (for fixed uij , i = 1, . . . , k, j = 1, . . . , n) and to uij (for fixed vi , i = 1, . . . , k) and apply the conditions of (3) vi =  n

n 

1 m

(uij )

m

(uij ) xj ,

i = 1, . . . , k

(5)

j=1

j=1

1/m−1  2 1/ xj , vi  uij = k  1/m−1  2 1/ xj − vk 

(6)

l=1

where i = 1 . . . k; j = 1 . . . n. The system described by (5) and (6) cannot be solved analytically. However, the following fuzzy clustering algorithm provides an iterative approach: Step 1: Select a number of clusters k (2 ≤ k ≤ n) and exponential weight m (1 < m < ∞) . Choose an initial partition matrix U (0) and a termination criterion .  (l) Step 2: Calculate the fuzzy cluster centers vi | i = 1, 2, . . . , k using U (l) and (5). the new partition matrix U (l+1) by using  Step 3: Calculate (l) vi | i = 1, 2, . . . , k and (6).



(l) Step 4: Calculate Δ = U (l+1) − U (l) = maxi,j ul+1 ij − uij . If Δ > , then l = l + 1 and go to Step 2. If Δ ≤ , then stop. The iterative procedure described above minimizes the objective function in (4) and leads to some of its local minimum.

576

3

J. Cervantes, X. Li, and W. Yu

SVM Classification Based on Fuzzy Clustering

Assuming that we have an input-output dataset. The task here is to find a set of support vectors from the input-output data, which maximize the space between classes. From the above discussion, we need to – eliminate the input data subset which is far away from the decision hyperplane; – find support vectors from the reduced data set. Our approach can be formed into the following steps. 3.1

Fuzzy Clustering: Sectioning the Input Data Space into Clusters

According to (1), let X = {x1 , x2 , . . . , xn } be the set of n inputs data, where each data xi can be considered as a point represented by a vector of dimension d as follows: xi = w(x1 ), w(x2 ), . . . , w(xd ) where 1 ≤ i ≤ n. w(xi ) denotes the weight of xj in xi , where 0 < wij < 1 and 1 ≤ j ≤ d. Assume that we can divide the original input data set into k clusters, where k > 2. Note that the clusters number must be strictly > 2, because we want to reduce the original input data set, eliminating the clusters far from the decision hyperplane. If k = 2 the input data set would be split into the number of existent classes and the data set could not be eliminated. The fuzzy clustering obtains k clusters center of data and also the membership function of each data in the clusters minimizing the objective function (4), where um ij denote the membership grade of xj in the cluster Ak , vi denotes the center of Ak and xj , vi  denotes the distance of xj to center vi .. Furthermore, xj and vi are vectors of dimension d. For other hand, m influences the degree of fuzziness of the membership function. Note that the total membership of the element k  xj ∈ X in all classes is equal to 1.0, i.e., uij = 1 where 1 ≤ i ≤ n. The i=1

membership grade of xj in Ak is calculated as in (6) and the cluster center vi of Ak is calculated as in (5). The complete fuzzy clustering algorithm is developed by means of the steps 1, 2, 3 and 4 shown in the section II. 3.2

Classification of Clusters Center Using SVM

Let (X, Y ) be the training patterns set where X = {x1 , . . . , xj , . . . , xn } and Y = {y1 , . . . , yj , . . . , yn } where yj ∈ (−1, 1), and xj = (xj1 , xj2 , . . . , xjd )T ∈ Rd and each measure xji is a characteristic (attribute, dimension or variable). The process of fuzzy clustering is based on finding k partitions of X, C = {C1 , . . . , Ck } (k < n) , such that a) ∪ki=1 Ci = X and b) Ci = ∅, i = 1, . . . , k, where: Ci = {∪xj | yj ∈ (−1, 1)}, i = 1, . . . , k (7)

SVM Classification Based on Fuzzy Clustering for Large Data Sets

577

Fig. 1. Uniform clusters and mixed clusters

That is, independently that each obtained cluster contains elements with different membership grade or membership function, each element have the original possession to a class, which is shown by (7). The cluster elements obtained can have a possession of uniform class (as it is appreciated in the Figure 1(b), where the cluster elements 1, 2, 3, 4, 7 and 8 have one only possession), this type of clusters is defined as: Cu = {∪xj | yj ∈ −1 ∧ yj ∈ 1} or in the contrary case a possession of mixed class, where each cluster has elements of one or another class (see Figure 1(b) where the clusters 5 and 6 contain elements with possession of mixed class), this type of clusters is defined as: Cm = {∪xj | yj ∈ −1 ∨ yj ∈ 1}

Fig. 2. Reduction of the input data set. (a) clusters with mixed elements, (b) clusters close to separation hiper-plane.

We identified the clusters with elements of mixed category and the clusters with elements of uniform category. The clusters with elements of mixed possession are separated for a later evaluation (see Figure 2 a) ), because these contain elements with bigger likelihood to be support vectors. Each one of the

578

J. Cervantes, X. Li, and W. Yu

remaining clusters has a center and a label of possession of uniform class (see Figure 2 b) ), with these clusters center and the possession labels, we used SVM to find a decision hyperplane, which is defined by means of the support vectors, the support vectors found are separated of the rest as shows in the Figure 3 b).

Fig. 3. Reduction of input data using partitioning clustering. a) Clusters of original data set. b) Set of reduced clusters.

3.3

De-clustering: Getting a Data Subset

The data subset obtained from the original data set given in (1) is formed for l

Cmi , l ≤ k and

i=1 p

Cui ,

| Cu ∈ svc, p ≤ k

i=1

where svc is defined as clusters set with uniform elements which are support vectors, Cm and Cu are defined as the clusters with mixed and uniforms elements respectively. In Figure 3 b) the clusters set with uniform elements is represented by the clusters 3, 4 and 5. We should note that the uniform elements of the clusters set are also support vectors (svc). While, the clusters set with elements of mixed possession is represented by the clusters 1 and 2. To this point, the original data set has been reduced getting the clusters close to the optimal decision hyperplane and eliminating the clusters far away from the optimal decision hyperplane. However, the subset obtained is formed with clusters and we needed de-clustering and to get the elements from the clusters. The data subset obtained (X  , Y  ) has characteristics: 1) X  ⊂ X 2) X  ∩ X = X  , X  ∪ X = X

(8)

where this characteristics come true for X and Y. i.e., the data set obtained is a data subset from the original data set.

SVM Classification Based on Fuzzy Clustering for Large Data Sets

3.4

579

Classification of Reduced Data Using SVM

In this step, we used SVM on the data subset in steps 1, 2 and 3. Since the original data set is reduced significantly in the previous steps, eliminating the clusters with center far away of optimal decision hyperplane, the time of training in this stage is a lot of minor in comparison with the time of training of the original data set.

4

Experimental Results

To test and demonstrate the proposed techniques for large data sets classification, several experiments are designed and performed and the results are reported and discussed in this section. First, let consider a simple case of classification. We generate a set of 40000 data at random, unspecified the ranges of aleatory generation. The data set has two dimensions and are labeled by X. The record ri is labeled as ”+” if wx + b > th and ”-” if wx + b < th, here th is the threshold, w is the weight associated to input data and b is the bias. In this way, the data set is linearly separable. Figure 4 shows the input data set with 10000 (see a)), 1000 (see b)) and 100 (see c)). Now we use the fuzzy clustering to reduce the data set. After the first 3 steps of the algorithm proposed in this paper (Sections 3.1, 3.2 and 3.3), the input data sets are reduced to 337, 322 and 47 respectively, see d), e) and f) in Figure 4.

Fig. 4. Original data sets and reduced data sets with 104 data

We use normal SVM with linear kernel. The comparison performances of the SVM based on fuzzy clustering and normal SVM are shown in Table 1. Here ”%” represents the percentage of the original data set, ”#” data number, ”Vs ” is the

580

J. Cervantes, X. Li, and W. Yu

number of support vectors, ”t” is the total experimental time in seconds, ”k” is cluster number, ”Acc” is the accuracy. We can see that the data reduction is carried out with the distance between the clusters center and the optimal decision hyperplane, which provokes that some support vectors be eliminated. In spite of this, SVM based on fuzzy clustering has better performances and the training time is small compared with the normal SVM.

Table 1. Comparison between the SVM based on fuzzy clustering and normal SVM with 104 data normal SVM SVM based on fuzzy clustering % # VS t Acc VS t Acc k 0.125 500 5 96.17 99.2% 4 3.359 99.12% 10 0.25% 1000 5 1220 99.8% 4 26.657 99.72% 20 25% 10000 3 114.78 99.85% 50 50% 20000 3 235.341 99.88% 100 75% 30000 3 536.593 99.99% 150 100% 40000 3 985.735 99.99% 200

Second, we generated a set of 100, 000 data at random, specified range and radio of aleatory generation as in [19]. Figure 5 shows the input data set with 1000 data (see a)), 10000 data (see b)) and 50000 data (see c)). After the fuzzy clustering , the input data sets are reduced to 257, 385 and 652 respectively, see d), e) and f) in Figure 5. For testing sets, we generated sets of 56000 and 109000 data for the first example (Figure 4) and the second example (Figure 5) respectively, the comparison performances of the SVM based on fuzzy clustering and normal SVM are shown in Table 2.

Fig. 5. Classification results with 105 data

SVM Classification Based on Fuzzy Clustering for Large Data Sets

581

Table 2. Comparison between the SVM based on fuzzy clustering and normal SVM with 105 data normal SVM SVM based on fuzzy clustering % # VS t Acc VS t Acc k 0.25% 2725 7 3554 99.80% 6 36.235 99.74% 20 25% 27250 4 221.297 99.88% 50 50% 54500 4 922.172 99.99% 100 75% 109000 4 2436.372 99.99% 150

Third, we use the Waveform data set [3]. The data set has 5000 waves and three classes. The classes are generated as from a combination of two or three basics waves. Each record has 22 attributes with continuous values between 0 to 6. In this example we use normal SVM with RBF kernel. Table 3 shows the training time with different data number, the performance of the SVM based on fuzzy clustering is good. The margin of optimal decision is almost the same as the normal SVM. The training time is very small with respect to normal SVM. Table 3. Comparison between the SVM based on fuzzy clustering and normal SVM for the Wave dataset % 8 12 16 20 24

5

# 400 600 800 1000 1200

normal SVM VS t Acc 168 35.04 88.12% 259 149.68 88.12% 332 444.45 88.12% 433 1019.2 88.12% 537 2033.2 -

SVM based on fuzzy clustering Vs t Acc k 162 22.68 87.3% 40 244 47.96 87.6% 60 329 107.26 88.12% 80 426 194.70 88.12% 100 530 265.91 88.12% 120

Conclusion

In this paper, we developed a new classification method for large data sets. It takes the advantages of the fuzzy clustering and SVM. The algorithm proposed in this paper has a similar idea as the sequential minimal optimization (SMO), i.e., in order to work with large data sets, we partition the original data set into several clusters and reduce the size of QP problems. The experimental results show that the number of support vectors obtained using the SVM classification based on fuzzy clustering is similar to the normal SVM approach while the training time is significantly smaller.

References 1. Awad M., Khan L., Bastani F. y Yen I. L., “An Effective support vector machine (SVMs) Performance Using Hierarchical Clustering,” Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04) - Volume 00, 2004 November 15 - 17, 2004 Pages: 663- 667

582

J. Cervantes, X. Li, and W. Yu

2. Balcazar J. L., Dai Y. y Watanabe O.,“Provably Fast Training Algorithms for support vector machine”, In Proc. of the 1 st IEEE Int. Conf. on Data Mining., IEEE Computer Society 2001, pp. 43-50. 3. Chih-Chung Ch., and Chih-Jen L., LIBSVM: a library for support vector machines, 2001, http://www.csie.ntu.edu.tw/˜cjlin/libsvm. 4. Collobert R. y Bengio S., “ SVMTorch: support vector machine for large regresion problems”. Journal of Machine Learning Research, 1:143-160, 2001. 5. Daniael B. y Cao D.,“Training support vector machine Using Adaptive Clustering”, in Proc. of SIAM Int. Conf on Data Mining 2004, Lake Buena Vista, FL, USA 6. Joachims T.,“Making large-scale support vector machine learning practical”. In A.S.B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: support vector machine . MIT Press, Cambridge, MA 1998. 7. S.W.Kim; Oommen, B.J.; Enhancing prototype reduction schemes with recursion: a method applicable for ”large” data sets, IEEE Transactions on Systems, Man and Cybernetics, Part B, Volume 34, Issue 3, 1384 - 1397,2004 8. Lebrun, G.; Charrier, C.; Cardot, H.; SVM training time reduction using vector quantization, Proceedings of the 17th International Conference on pattern recognition, Volume 1, 160 - 163,2004. 9. K.Li; H.K.Huang; Incremental learning proximal support vector machine classifiers, Proceedings. 2002 International Conference on machine learning and cybernetics, Volume 3, 1635 - 1637, 2002 10. Luo F., Khan L., Bastani F., Yen I., y Zhou J., “A Dynamical Growing Self Organizing Tree (DGSOT) for Hierarquical Clustering Gene Expression Profiles”, Bioinformatics, Volume 20, Issue 16, 2605-2617, 2004. 11. Pavlov, D.; Mao, J.; Dom, B.; Scaling-up support vector machine using boosting algorithm, Proceedings of 15th International Conference on pattern recognition, Volume 2, 219 - 222, 2000 12. Platt J., “Fast Training of support vector machine using sequential minimal optimization. In A.S.B. Scholkopf, C. Burges, editor, Advances in Kernel Methods: support vector machine . MIT Press, Cambridge, MA 1998. 13. Rui Xu; Wunsch, D., II., “Survey of clustering algorithms”, IEEE Transactions on Neural Networks, Volume 16, Issue 3, May 2005 Page(s): 645 - 678 14. Schohn G. y Cohn D., “Less is more : Active Learning with support vector machine”, In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000. 15. Shih L., Rennie D.M., Chang Y. y Karger D.R., “Text Bundling: Statistics-based Data Reduction”, In Proc of the Twentieth Int. Conf. on Machine Learning (ICML2003), Washington DC, 2003. 16. Tong S. y Koller D.,“Support vector machine active learning with applications to text clasifications”, In Proc. 17th Int. Conf. Machine Learning, Stanford, CA, 2000 17. Vapnik V., “The Nature of Statistical Learning Theory,” Springer, N.Y., 1995. 18. Van Gestel, T.; Suykens, J.A.K.; De Moor, B.; Vandewalle, J.; Bayesian inference for LS-SVMs on large data sets using the Nystrom method, Proceedings of the 2002 International Joint Conference on neural networks, Volume 3, 2779 - 2784, 2002 19. Yu H., Yang J. y Han Jiawei, “Classifying Large Data Sets Using SVMs with Hierarchical Clusters”, in Proc. of the 9 th ACM SIGKDD 2003, August 24-27, 2003, Washington, DC, USA. 20. R.Xu, D.Wunsch, Survey of Clustering Algorithms, IEEE Trans. Neural Networks, vol. 16, pp.645–678, 2005.