Automated Parameter Selection for Support Vector Machine Decision ...

3 downloads 3750 Views 184KB Size Report
A support vector machine (SVM) provides an optimal separating hyperplane ... Heidelberg; Print ISBN 978-3-540-46481-5; Online ISBN 978-3-540-46482-2 ...
Automated Parameter Selection for Support Vector Machine Decision Tree Gyunghyun Choi and Suk Joo Bae Department of Industrial Engineering, Hanyang University, 17 Haengdang-dong Seongdong-gu, Seoul, Korea [email protected] Abstract. A support vector machine (SVM) provides an optimal separating hyperplane between two classes to be separated. However, the SVM gives only recognition results such as a neural network in a blackbox structure. As an alternative, support vector machine decision tree (SVDT) provides useful information on key attributes while taking a number of advantages of the SVM. we propose an automated parameter selection scheme in SVDT to improve efficiency and accuracy in classification problems. Two practical applications confirm that the proposed methods has a potential in improving generalization and classification error in SVDT.

1

Introduction

Pattern recognition has its applications in various fields of practice such as automatic analysis of medical images, quality inspection for automatic manufacturing system, prediction of geological changes, etc. A support vector machine (SVM), which was firstly proposed by Vapnik [4], is based on theoretical structure and it has provided excellent pattern-recognizing achievement in a number of real applications. In classification problem as an exemplary area of pattern recognition , SVM provides a separating hyperplane between two classes to be separated. Since the separating hyperplane can be applied to various problems, e.g., nonlinear pattern-recognition, function regression, HCI, data mining, web mining, computer vision, artificial intelligence, and medical diagnosis, more active researches on SVM have been done recently. However, the SVM gives only recognition results such as a neural network in a black-box structure. It hardly provides useful information concerning which attributes affect the results. Accordingly, a support vector machine decision tree (SVDT) was suggested in order to provide information on key attributes while taking a number of advantages of the SVM [2]. The SVDT establishes a mathematical model for each decision nodes and forms a separating hyperplane by solving the model. Determining appropriate parameter values is key issue in the modeling because the separating hyperplane changes according to the parameter values at each decision node, consequently it affects global SVDT performance. 

Corresponding author.

I. King et al. (Eds.): ICONIP 2006, Part II, LNCS 4233, pp. 746–753, 2006. c Springer-Verlag Berlin Heidelberg 2006 

Automated Parameter Selection for SVM Decision Tree

747

Bennett [2] searched for proper parameter value by sequently changing the parameters and testing them with a validation set converted from a part of training data. However, it takes much time and efforts to analyze the results and to conduct the test at each decision node. In this paper, we propose an automated scheme for parameter selection in SVDT to resolve such problems. The paper is organized as follows. In Section 2, we briefly review mathematical models for SVM and SVDT. In Section 3 an automated scheme is proposed to select the parameter in SVDT. In Section 4, we provide two examples to illustrate our procedure. Some concluding remarks are presented in Section 5.

2 2.1

Support Vector Machine Decision Tree Support Vector Machine

As a tool of pattern recognition, support vector machine (SVM) presents high performance for recognizing a variety of patterns. The SVM like a radial-basis function network linearly projects nonlinear patterns in input space into highdimensional feature space, and finally linearly analyzes them in the feature space. Via the linear feature space, SVM produces optimal separating hyperplane to resolve classification problems. For a given dataset {(xi , ti ), i = 1, . . . , m}, where xi is ith training data included in one of two classes, and ti ∈ {−1, 1} is an indicator representing corresponding class, the SVM searches an optimal separating hyperplane so that it minimizes the distance from the closest support vector to classify every class. For highly overlapped patterns inseparable by linear separating hyperplane, an optimal linear separating hyperplane can be obtained by solving the following optimization problem: min subject to

m  1 T ηi ω ω+λ 2 i=1

ti (ω T xi + b) ≥ 1 − ηi ,

(1)

where ω denotes a vector of distances between separating hyperplanes and the closest support vector, ηi (≥ 0) denotes ith slack variable, and λ denotes a penalty parameter for i = 1, . . . , m. Note that all of patterns are perfectly separable when ηi = 0. The eq. (1) can be solved easily using a Lagrangian dual. However, it is impossible to discriminate all of patterns with only linear separating hyperplane, thus we need nonlinear separating hyperplane for classifying linearly inseparable patterns. To separate the nonlinear patterns, the SVM nonlinearly projects nonlinear patterns in input space into high-dimensional feature space, and linearly interprets in the feature space. Using a kernel function K(xi , xj ) = φ(xi )·φ(xj ) for an arbitrary function φ(·), we can solve classification problem for nonlinear patterns. See [5] for details.

748

2.2

G. Choi and S.J. Bae

L1 -Norm Support Vector Machine

Given training data {(xi , ti ), i = 1, . . . , m} for xi ∈ Rn , a robust linear programming (RLP) model [1] is defined as min

m 

δi ηi

i=1

subject to

ti (ω T xi + b) ≥ 1 − ηi ,

(2)

for ηi ≥ 0, i = 1, . . . , m. Here, m is the number of training data and δi (> 0), representing a misclassification cost for xi , is defined by  1 , if xi ∈ c1 δi = |c11 | (3) if xi ∈ c2 . |c2 | , The above form  of δi guarantees existence of nontrivial solutions [1]. The objective function m i=1 δi ηi is the degree of permission for xi to be closer to an optimal separating hyperplane than a support vector or to be located in the other side of a half space. A separating hyperplane that minimizes classification error is generated by minimizing the objective function. To introduce a concept of structural risk minimization into the objective function in the RLP, we add L1 -norm ω1 , then L1 -norm SVM is formulated as min

λω1 + (1 − λ)

m 

δi ηi

i=1

subject to

ti (ω T xi + b) ≥ 1 − ηi ,

(4)

where λ is a parameter considering trade-off between the margin and error of classification and satisfies 0 ≤ λ ≤ 1 [3]. By using L1 -norm ω1 instead of L2 -norm ω2 as in general SVM, the L1 -norm SVM has two advantages: 1. The L1 -norm reduces data dimension more effectively by taking more zero components in ω than the L2 -norm. The less attributes are, the higher interpretability is . 2. The L1 -norm SVM can use a linear programming instead of a quadratic programming. Widely used linear programming packages such as LINDOTM and CPLEXTM is more efficient and more stable than quadratic programming solvers for large-scale problems, in particular when training data are sparse. 2.3

Support Vector Decision Tree

Nonlinear separating hyperplane is mandatory to solve a variety of pattern classification problems. However, SVM provides only classification results as a blackbox structure, and it fails to support information about key attributes. As an

Automated Parameter Selection for SVM Decision Tree

749

alternative, since a support vector decision tree (SVDT) creates more interpretable rule with fewer attributes, it has a potential in saving costs for data collection by ignoring irrelevant attributes in the analysis later. In reviewing L1 -norm SVM to generate the SVDT, the L1 -norm SVM generates a linear separating hyperplane that creates two half spaces. L1 -norm SVM is repeatedly applied to each half space, generating two sub-half spaces. We repeat these procedures till some criteria are met, then finally decision trees with nonlinear separating hyperplane. These procedure is called “support vector decision tree (SVDT)” [2]. Unlike a classification and regression trees (CART) and a C4.5, which is classified as a univariate decision tree where a dataset is divided into several meaningful clusters by one attribute, the SVDT is considered as a multivariate decision tree where a dataset is divided into several meaningful clusters by more than one attributes. Potentially, the SVDT achieves better dimension reduction and generates models with low-depth trees in large-scaled dataset, thus it can reduce chances of overfitting and provide more interpretable rules with fewer attributes.

3

Automated Scheme for Parameter Selection in SVDT

Selecting the value of parameter λ in eq. (4) is crucial to execute a decision in SVDT since the global SVDT model is affected by the value. Appropriate value of λ must be selected while considering the trade-off between model generalization and classification accuracy. Bennett [2] used about 1/7 of training data as a validation set and determined the value of λ by testing the generated separating hyperplane with the validation set. Introducing penalty variable c instead of λ in this paper, the eq. (4) is transformed as ω1 + c

min

m 

δi ηi

i=1

ti (ω T xi + b) ≥ 1 − ηi ,

subject to

(5)

and the optimization problem (5) can also be dualized as follows. First, we can define an equivalent problem: min

n  j=1

subject to

sj + c

m 

δi ηi

i=1

ti (ω T xi + b) ≥ 1 − ηi

∀i

ηi ≥ 0

∀i

−sj ≤ ωj ≤ sj

∀j

Then, Lagrangian function of the equivalent problem can be defined, moreover the 1st order optimality condition of the Lagrangian function should lead to the following dual problem:

750

G. Choi and S.J. Bae

Q(α) =

max

−e ≤

subject to

m 

m 

αi

i=1 m 

αi ti xi ≤ e

i=1

ti αi = 0

0 ≤ αi ≤ c,

(6)

i=1

m where e is (n × 1) vector in which all of its components are 1. Here, i=1 δi ηi is the degree of permission for xi to approach a separating hyperplane or to be located in the other side of half-space by passing over the separating hyperplane. In automated selection for value of the penalty c decreases  parameter c, as n in eq. (5), which means decrease of penalty on m i=1 δi ηi , ω1 (≡ j=1 ωj ) in objective function tends to decreases even if any xi is permitted to approach a separating hyperplane or to be included in the other side of half-space (that is, nηi >20). As a result, 2/ω2 , margin of separation, increases since ω2 (≡ improves. j=1 ωj ) increases, and model generalization m On the contrary, as c increases, i=1 δi ηi in objective function tends to minimize and xi of training data is more likely not to overpass  the separating hyperplane. Note that as ω1 (related to the margin) and m i=1 δi ηi (related to classification error) decreases, the margin broadens and classification error minimizes, thus appropriate value of c must be selected by simultaneously considering both margin and classification error. m n To deal with in the same scale, i=1 ωi and i=1 δi ηi are normalized with corresponding standard deviations as m n δi ηi i=1 ωi , β = i=1 , α= σ n ωi σ m δi ηi i=1

i=1

and we search for the c value to minimize α and β simultaneously, equivalently minimize α + β.

4

Practical Applications

We applied the automated SVDT procedure to Credit Screening Database and Census Income Database and investigated classification results from the automated scheme for parameter selection. We used SUN Ultra 10 workstation (333 MHz CPU, 512MB Memory) as hardware and AMPL/CPLEX as software to execute the procedure. 4.1

Credit Screening Database

Credit screening database, built up by a Japan credit card company, records 653 customers’ information including credit approval results (Granted (+1) or Not Granted (−1)). The customer records consist of 15 attributes; 5 continuous-typed

Automated Parameter Selection for SVM Decision Tree

D0

D0

L1 L1

751

L2

L2

Class +1 Resp Rate : 4.30% Targ Class : 4.45% Total Pop : 46.44% Instance : 209 +1 : 9 -1 : 200

Class –1 Resp Rate : 80.08% Targ Class : 95.54% Total Pop : 53.55% Instance : 241 +1 : 193 -1 : 48

(a) training data

Class -1 Resp Rate : 9.47% Targ Class : 9.57% Total Pop : 46.79% Instance : 95 +1 : 9 -1 : 86

Class +1 Resp Rate : 78.70% Targ Class : 90.42% Total Pop : 53.20% Instance : 108 +1 : 85 -1 : 23

(b) test data

Fig. 1. Applicative results of the automated SVDT to credit screening example

and 9 nominal-typed attributes. The database is sourced from UCI Machine Learning Repository. We divided the total data set into 450 (+1: 202, −1: 248) customer records as training data and 203 (+1: 94, −1: 109) as test data. We allocated 1, . . . , n integer values respectively when there are n categories for nominal attributes. Every attribute is normalized with corresponding standard deviation. The objective is to compare classification results from the automated SVDT with real credit approval records. The applicative results of the automated SVDT to credit screening database is shown in Figure 1-(a). Here, the response rate, defined as the ratio of the number of class +1 at node to the total number of data at corresponding node, is 9/209 = 4.30%. The target class is defined as the ratio of the number of class +1 at node to the total number of class +1 in population and its result is 9/202 = 4.45% for L1. The total population is defined as the number of instants at corresponding node divided by the number of instants in population. Finally, the value of c was obtained from the automated procedure as 1.0. As shown in Figure 1-(a), a separating hyperplane to well classify the training data is obtained through only one branching-off. All the attributes except 8th attribute (weight ω8 = −0.998582 with bias b = 3.00001) have zero-weighted values in the analysis, which implies that credit screening data is separable with only one attribute. In confirming performance results for model generalization of the automate SVDT, test result using testing data is given in Figure 1-(b). The proposed method classifies the test data well, connoting better generalization capability. 4.2

Census Income Database

Census income database, established by U.S. census bureau, includes 45,222 demographical information, e.g., age, sex, job, income level, etc. The income level is classified into two groups: less than $ 50,000 (+1) or larger than $ 50,000 (+1).

752

G. Choi and S.J. Bae 1 attr

D0

D1 D2

1 attr

D3

2 attr

2 attr

D1

L1 Class -1 Resp Rate : 5.52% Targ Class : 8.75% Total Pop : 39.39% Instance : 11883 +1 : 657 -1 : 11226

D4

L2 Class -1 Resp Rate : 4.08% Targ Class : 1.09% Total Pop : 6.64% Instance : 2005 +1 : 85 -1 : 1923

1 attr

L5 Class -1 Resp Rate : 11.43% Targ Class : 1.95% Total Pop : 4.26% Instance : 1286 +1 : 147 -1 : 1139

L4 Class +1 Resp Rate : 67.3% Targ Class : 47.64% Total Pop : 17.62% Instance : 5315 +1 : 3577 -1 : 1738

(b) Branching-off result from D1

(a) First branching-off result D4

L3 Class -1 Resp Rate : 23.2% Targ Class : 3.80% Total Pop : 4.04% Instance : 1221 +1 : 286 -1 : 935

D5

L6 Class -1 Resp Rate : 27.1% Targ Class : 18% Total Pop : 17.27% Instance : 5209 +1 : 1412 -1 : 3797

3 attr

L7 Class +1 Resp Rate : 41.53% Targ Class : 17.94% Total Pop : 10.75% Instance : 3243 +1 : 1347 -1 : 1896

(c) Branching-off result from D4 Fig. 2. Branching-off results in the automated SVDT: Census income screening training data

The dataset consist of 13 attributes; 6 continuous-typed and 7 nominal-typed attributes. The database is also sourced from UCI Machine Learning Repository. We divided the total data set into 32,561 (+1: 7,508, −1: 22,654) as training data and 15,060 (+1: 3,700, −1: 11,360) as test data. We allocated 1, . . . , n integer values respectively when there are n categories for nominal attributes. Every attribute is normalized with corresponding standard deviation. Similarly, the objective is to compare classification results from the automated SVDT with real census income data. The applicative results of the automated SVDT to census income database is shown in Figure 2. Figure 2-(a) represents the result from first branching-off. The value of c was obtained from the automated procedure as 1.5. Only one attribute (ω6 = −0.780588 with bias b = 1.66667) has non-zero value and the other weight values are found as zeros. The results show that the automated SVDT classifies 40% out of total data as instances having class −1.

Automated Parameter Selection for SVM Decision Tree

753

The result of branching-off from L1 leaf node is given in Figure 2-(b), and branching-off result from only D4 leaf node is shown in Figure 2-(c), respectively.

5

Summary and Conclusions

The support vector machine decision tree establishes a mathematical model for each decision nodes and forms a separating hyperplane by solving the model. Determining appropriate parameter values is essential in SVDT. The existing methods suffers from loss of time and efforts to determine the parameter, hence we propose an automated scheme for parameter selection in SVDT. We showed that the proposed method provides efficient classification results with two illustrative examples. When we select the smaller value than resulting value from the automated scheme, it is likely to generate overfitting and SVDT with high-depth trees. On the contrary, if we select the larger value than resulting value from the automated scheme, it is more likely to generate overfitting as we concentrate on improving classification rate of training data. In conclusion, parameter value selected from the automated system has a potential in providing more accurate results in the classification problems.

References 1. Bennett, K. P., and Mangasarian, O. L. (1992), “ Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software, Vol. 1, 23–34. 2. Bennett, K. P., Wu, D. H., and Auslender, L. (1998), “On Support Vector Desision Trees for Database Marketing”, Rensselaer Polytechnic Institute Math Report No. 98–100, Troy, New York. 3. Bradley, P. S., and Mangasarian, O. L. (1998), “ Feature Selection Via Concave Minimization and Support Vector Machines”, Mathematical Programming Technical Report, 98-03. 4. Vapnik, V. N. (1998), Statistical Learning Theory, Wiley, New York. 5. Scholkopf, B., Burge C. J. C., and Smola, A. J. (1999), Advanced in Kernel Methods - Support vector Learning, The MIT Press, New York.

Suggest Documents