Multiclass Support Vector Machines Using

0 downloads 0 Views 156KB Size Report
extending SVMs, which are binary classifiers, to solve multiclass problems is still an open research area. Some methods for constructing multiclass classifiers ...
Multiclass Support Vector Machines Using Adaptive Directed Acyclic Graph Boonserm Kijsirikul1 and Nitiwut Ussivakul2 Department of Computer Engineering, Chulalongkorn University Phayathai Road, Pathumwan, Bangkok, 10330, Thailand

E-mail: [email protected] and [email protected] Abstract: This paper presents a method of extending Support Vector Machines (SVMs) for dealing with multiclass problems. Motivated by the Decision Directed Acyclic Graph (DDAG), we propose the Adaptive DAG (ADAG): a modified structure of the DDAG that has a lower number of decision levels and reduces the dependency on the sequence of nodes. Thus, the ADAG improves the accuracy of the DDAG while maintaining low computational requirement.

I. INTRODUCTION A Support Vector Machine (SVM) [2,6,8] is gaining increasing attention from researchers for its outstanding performance in real world applications. Nevertheless, extending SVMs, which are binary classifiers, to solve multiclass problems is still an open research area. Some methods for constructing multiclass classifiers from binary SVMs, such as One-against-One (1-v-1) [4], One-against-theRest (1-v-R) [8], have been proposed. The Max Wins algorithm [3], which is one of 1-v-1 methods, offers faster training time compared to the 1-v-R method. Using a new learning architecture, DDAG, Platt et al. [5] proposed the DAGSVM algorithm that reduces training and evaluation time, while maintaining accuracy compared to the Max Wins. In this paper we point out some limitations of the DDAG caused by the dependency on the sequence of its nodes. This led to high variance in classification results and, hence, the reliability of the algorithm. Moreover, the DDAG structure needs an unnecessary high number of node evaluations for a correct class, causing high cumulative error. Our modified version of the DDAG will improve reliability by reducing the dependency on the sequence of nodes and increase accuracy by minimizing the number of node evaluations for a correct class. These advantages are due to the tournament-based architecture that yields a uniform distribution of output probability and becomes structurally flatter than the DDAG. We prove that the expected accuracy of our method is higher than that of the DDAG, and also empirically evaluate our method by comparing it with the DDAG on three data sets, i.e. Thai tone recognition, Thai vowel recognition and UCI Letter data sets. In the next section we briefly describe SVM concepts and multiclass SVMs. In Section III, we present summary of the DAGSVM algorithm, which is based on placing 1-v-1 SVMs into nodes of the DDAG. We then point out its limitations caused by the dependency on the sequence of nodes of the DDAG and by a long evaluation path for a correct class. In Section IV, we introduce modifications to the architecture of the DDAG, giving more reliability and accuracy, especially in case of data sets with a large number of classes. The

0-7803-7278-6/02/$10.00 ©2002 IEEE

analysis of both methods are presented in Section V. The experiments illustrating the improvement are presented in Section VI, and finally the conclusion is given in Section VII. II. SUPPORT VECTOR MACHINES This section introduces the basic idea of SVMs and techniques for constructing multiclass SVMs.

A. Linear Support Vector Machines Suppose we have a data set D of l samples in an ndimensional space belonging to two different classes (+1, –1):

' ^ [  \ _N  ^ O` [  ƒ  \  ^ `` Q

N

N

N

N

(1)

The hyperplane in the n dimensional space is determined by the pair (w,b) where w is an n-dimensional vector orthogonal to the hyperplane and b is the offset constant. The hyperplane (w˜x)+b separates the data if and only if

Z˜ [  E !  Z ˜ [  E  

L

LI

L

LI

\  \ 

(2)

L

L

If we additionally require that w and b be such that the point closest to the hyperplane has a distance of 1/|w|, then we have

Z˜ [  E t  Z ˜ [  E d 

L

\  \ 

LI

L

(3)

L

LI

L

which is equivalent to

\ > Z ˜ [  E @ t  L

(4)

L

L

The optimal separating hyperplane is the hyperplane that maximizes the minimum distance between the hyperplane and any sample of training data. The distance between two closest samples from different classes is

G Z E

^

PLQ [ _\ ` L

Z ˜ [  E 

L

Z L

PD[ ` ^[ _ \ L

L

Z ˜ [  E 

Z L

(5)

From (3), we can see that the appropriate minimum and maximum values are r1. Therefore, we need to maximize

G Z E



Z  Z Z 



(6)

Thus, the problem is equivalent to: ƒ minimize |w|2/2 ƒ subject to the constrains (1) \ > ˜  E @ t  L L

Z[

L

For non-separable case, the training data cannot be separated by a hyperplane without error. The previous constraints then must be modified. A penalty term consisting of the sum of deviations [i from the boundary is added to the minimization problem. Now, the problem is to

Z & [ ¦  

ƒ

minimize

dimensional space; this is equivalent to a non-linear separating surface in ƒn.

O

L

L 

ƒ

subject to the constraints (1) \ > ˜  E @ t   [ L

(2)

Z[

[ t L

L



L



L



The penalty term for misclassifying training samples is weighted by a constant C. Selecting a large value of C puts a high price on deviations and increases computation by effecting a more exhaustive search for ways to minimize the number of misclassified samples. By forming the Lagrangian and solving the dual problem, this problem can be translated into: ƒ

Polynomial degree d:

N [ \

Radial basis function:

N [ \

[

˜ \ 

minimize

/ Z E  D

O

¦D  L

L 

ƒ

The data only ever appears in our training problem (7) in the form of dot products, so in the higher dimensional space we are only dealing with the data in the form )(xi)˜)(xj). If the dimensionality of H is very large, then this could be difficult, or very computationally expensive. However, if we have a kernel function such that k(xi,xj) = )(xi)˜)(xj), then we can use this in place of xi˜xj everywhere in the optimization problem, and never need to know explicitly what ) is. Some widely used kernels are:

 

O

¦D D \ \ L

L M

M

L

M

[ ˜[ L

M

(7)

L

L



L 

where Di are called Lagrange multipliers. There is one Lagrange multiplier for each training sample. In the solution, those samples for which Di > 0 are called support vectors, and are ones such that the equality in (4) holds. All other training samples having Di = 0 could be removed from the training set without affecting the final hyperplane. Let D0, an l-dimensional vector denote the minimum of L(w,b,D). If D  > 0 then xi is a support vector. The optimal separating hyperplane (w0, b0) can be written in terms of D0 and the training data, specifically in terms of the support vectors: L

Z

O



¦D L



L

\L

[

¦D

L





L

\L

[

L



(8)

VXSSRUW YH FWRUV

b0 = 1–w0˜xi for xi with yi = 1 and 0 < Di < C.

(9)

The optimal separating hyperplane classifies points according to the sign of f(x),

I [

VLJQ VLJQ



(12)

1) The One-Against-the-Rest Approach: This approach works by constructing a set of N binary classifiers. The ith classifier is trained with all of the examples in the ith class with positive labels, and all other examples with negative labels. The final output is the class that corresponds to the classifier with the highest output value. This approach is referred to as 1-v-R. 2) The One-Against-One Approach: This approach simply constructs all possible binary classifiers from a training set of N classes. Each classifier is trained on only two out of N classes. Thus, there will be N(N-1)/2 classifiers. This approach is referred to as 1-v-1. In the Max Wins algorithm which is one of 1-v-1 methods, a test example is classified by all of classifiers. Each classifier provides one vote for its preferred class and the majority vote is used to make the final output. If there is more than one class giving the highest score, however, a class will be randomly selected as the final output.



· ¦ D \ [ ˜ [ E ¸¸ 

L

III.DAGSVM (10)



L

L

VXSRUW YHF WRUV

¹

B. Non-Linear Support Vector Machines The above algorithm is limited to linear separating hyperplanes. SVMs get around this problem by mapping the sample points into a higher dimensional space using a nonlinear mapping chosen in advance. This is, we choose a map )  ƒ + where the dimensionality of + is greater than n. We then seek a separating hyperplane in the higher Q

F

Now, we discuss the multiclass (N-class) problem solving by considering the problem as a collection of binary classification problems.

Z ˜ [  E

§¨ ¨©



C. Multiclass Classifiers

O

¦D \

 [\

(11)



subject to the constraints: (1) 0 d Di d C , i (2)

H

G



0-7803-7278-6/02/$10.00 ©2002 IEEE

A disadvantage of the 1-v-1 SVMs is their inefficiency of classifying data as the number of SVMs grows superlinearly with the number of classes. Platt et al. introduced a novel algorithm, DAGSVM, to remedy this disadvantage [5]. A. Decision DAGs A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. Platt et al. used rooted binary DAG, a DAG that has a unique node with no arcs pointing into it, and other nodes which have either 0 or 2 arcs leaving them, to be a class of functions in classification tasks.

C. Issues on DDAG Systematically innovated, the DAGSVM has outperformed the standard algorithm in terms of speed. However, due to its dependency on the sequence of nodes of the DDAG and a large number of node evaluations for the correct class, the accuracy and reliability may be adversely affected. In particular, we found that the output of the DDAG depends on the sequence of binary classifiers in nodes, affecting reliability of the algorithm. In addition, the correct class placed in a node near the root node is clearly at disadvantage by comparison with the correct class near leaf nodes since it exposes to higher risk of being incorrectly rejected. FIGURE 1: THE DDAG FINDING THE BEST CLASS OUT OF FOUR CLASSES.

In a problem with N classes, a rooted binary DAG has N leaves labeled by the classes where each of the N(N-1)/2 internal nodes is labeled with an element of a Boolean function. The nodes are arranged in a triangular shape with the single root node at the top, two nodes in the second layer and so on until the final layer of N leaves. The ith node in layer j < N is connected to the ith and (i + 1)st nodes in the (j + 1)st layer. To evaluate a DDAG, starting at the root node, the binary function at a node is evaluated. The node is then exited via the left edge, if the binary function is -1; or the right edge, if the binary function is 1. The next node’s binary function is then evaluated. The value of the decision function is the value associated with the final leaf node (see Figure 1). Only N-1 decision nodes will be evaluated in order to derive an answer. The DDAG can be implemented using a list, where each node eliminates one class from the list. The implementation list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the other class is eliminated from the list, and the DDAG proceeds to test the first and last elements of the new list. The DDAG terminates when only one class remains in the list. The current state of the list is the total state of the system. B. The DAGSVM Algorithm The DAGSVM algorithm creates a DDAG whose nodes are maximum margin classifiers over a kernel-induced feature space. Such a DDAG is obtained by training each i-j node only on the subset of training points labeled by i or j. The final class decision is derived by using the DDAG architecture, described in Section III (A). The DAGSVM separates the individual classes with large margin. It is safe to discard the losing class at each 1-v-1 decision because all of the examples of the losing class are far away from the decision surface. For the DAGSVM, the choice of the class order in the list (or DDAG) is arbitrary.

0-7803-7278-6/02/$10.00 ©2002 IEEE

2 3

1 2

3

4

1 4 2 4

x

1 3

FIGURE 2: THE OUTPUT FROM THE DDAG DEPENDS ON ITS STRUCTURE.

Let us illustrate the first issue using a pictorial example of a 4-class classification problem in two-dimensional space, as shown in Figure 2. In the figure, ‘1’, ‘2’, ‘3’ or ‘4’ is a class label. As depicted in Figure 2, data points in the shaded area are classified differently depending on the sequence of nodes in the DDAG (or the sequence of elements in the implementation list). For example, consider a data point ‘x’ located in the solid region. In case of a list 1-2-3-4, first, the hyperplane 1vs4 would eliminate class ‘4’ because the data point ‘x’ is not in the side of class ‘4’. The list then becomes 1-2-3. Next, 1vs3 would remove class ‘3’ from the list. Finally, with the list 1-2 the hyperplane 1vs2 gets rid of class ‘1’. As a result, the data point ‘x’ is classified as class ‘2’. In another case of a list 2-1-3-4, first, the hyperplane 2vs4 would eliminate class ‘2’. The list then becomes 1-3-4. Next, 1vs4 would remove class ‘4’ from the list. Finally, with the list 1-3, the hyperplane 1vs3 gets rid of class ‘3’. As a result, the data point ‘x’ is classified as class ‘1’, which differs from the first case. Clearly, this example concludes our first concern, the problem of the dependency on the sequence of nodes of the DDAG. The other issue is that the number of node evaluations for the correct class is still unnecessary high. This results in high cumulative error and, hence, the accuracy. The depth of the

DDAG is N-1 and this means that the number of times the correct class has to be tested against other classes, on average, scales linearly with N. Let consider a case of a 20class problem. If the correct class is evaluated at the root node, it is tested against other classes for 19 times before it is correctly classified as an output. Despite large margin, there exists probability of misclassification, let say 1%, and this will cause 1-0.9919 = 17.38% of cumulative error in this situation. This shortcoming becomes more severe if the number of classes increases. The two issues raised here motivate us to modify the DDAG. IV. NEW APPROACH In this section we introduce a new approach to alleviate the problem of the DDAG structure. The new structure, the Adaptive DAG, reduces the dependency on the sequence of nodes and lowers the depth of the DAG, and consequently the number of node evaluations for a correct class.

A. Adaptive DAGs

derive an answer. Note that the correct class is evaluated against other classes for log2N times (rounded up) or less, considerably lower than the number of evaluations required by the DDAG, which scales linearly with N.

B. Implementation An ADAG can be implemented using a list, where each node eliminates one class from the list (see Figure 4). The implementation list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the class is kept in the left element’s position while the other class will be eliminated from the list. Then, the ADAG proceeds to test the second and the elements before the last of the list. The testing process of each round ends when either one or no class remains untested in the list. After each round, the list is reduced to N/2 elements (rounded up). Then, the ADAG process repeats until only one class remains in the list. 

An Adaptive DAG (ADAG) is a DAG with a reversed triangular structure. In an N-class problem, the system comprises N(N-1)/2 binary classifiers. The ADAG has N-1 internal nodes, each of which is labeled with an element of a Boolean function. The nodes are arranged in a reversed triangle with N/2 nodes (rounded up) at the top, N/22 nodes in the second layer and so on until the lowest layer of a final node, as shown in Figure 3.















6WDWH 

$ $ $ $

6WDWH 

% %

6WDWH 

&

6WDWH 

FIGURE 4: IMPLEMENTATION THROUGH THE LIST.  YV 

 YV 

$

 YV 

$

 YV 

$

$ YV $

$ YV $

%

V. ANALYSIS OF DDAG & ADAG

$

$GDSWLYH /D\HU $

%

% YV %

2XWSXW &ODVV

$GDSWLYH /D\HU %

2XWSXW /D\HU

FIGURE 3: THE STRUCTURE OF AN ADAPTIVE DAG FOR AN 8CLASS PROBLEM.

To classify using the ADAG, starting at the top level, the binary function at each node is evaluated. The node is then exited via the outgoing edge with a message of the preferred class. In each round, the number of candidate classes is reduced by half. Based on the preferred classes from its parent nodes, the binary function of the next-level node is chosen. The reduction process continues until reaching the final node at the lowest level. The value of the decision function is the value associated with the message from the final leaf node (see Figure 3). Like the DDAG, the ADAG requires only N-1 decision nodes to be evaluated in order to

0-7803-7278-6/02/$10.00 ©2002 IEEE

In this section, we analyze the expected accuracy of the DDAG and ADAG. In the following analysis, we assume a uniform distribution of probability that the correct class is in any position in the list. We also assume that the probability of the correct class being eliminated from the list is p, when it is tested against another class, and that the probability of one of any two classes, except for the correct class, being eliminated from the list is 0.5 when they are tested against each other. Given the above assumptions, we first illustrate the expected accuracy of the DDAG by the following example. Consider a four-class problem. Figure 5 shows all patterns where the correct class will be correctly classified by the DDAG. The correct class will be correctly classified if it is not eliminated from the list. This means that when it is at the edge (the first or the last) of the list, all other classes has to be excluded from the list. Under a uniform distribution, the probability is 1/4 that the correct class will be in any position of the list. In the case that the correct class (indicated by ‘x’ in the figure) is at the edge of the current list, it will be correctly classified if all other classes are eliminated from the list. The probability of this is (1-p)N-1, where N is the number of elements in the list.

3

(1-p) x x

x x x

x x x

x x

(1-p)2

x (1-p)3

x

(1-p)1 (1-p)1 (1-p)1 (1-p)1

(1-p)2

FIGURE 5: AN EXAMPLE OF A FOUR-CLASS PROBLEM.

In the case that the correct class is not at the edge, we have two possible choices, i.e. to remove the first element and to remove the last element from the list. This reduces the number of elements one by one. From the above example, the probability that the correct class is correctly classified is: (1/4)(1-p)3 + (1/4)(1/2)1(1-p)2 + (1/4)(1/2)2(1-p)1 + (1/4)(1/2)2(1-p)1 + (1/4)(1/2)2(1-p)1 + (1/4)(1/2)2(1-p)1 + (1/4)(1/2)1(1-p)2 + (1/4)(1-p)3 = (1/4)[(1-p)/p+(1-p)3-(1-p)4/p] Theorem 1 Let p be the probability that the correct class will be eliminated from the implementation list, when it is tested against another class, and let the probability of one of any two classes, except for the correct class, being eliminated from the list be 0.5. Then under a uniform distribution of the position of the true class in the list, the expected accuracy of the DDAG is (1/N)[(1-p)/p+(1-p)N-1-(1-p)N/p], where N is the number of classes. Proof The proof is omitted due to the space available, but it can be generalized from the above example.

Accuracy

Theorem 2 Let p be the probability that the correct class will be eliminated from the implementation list, when it is tested against another class, and let the probability of one of any two classes, except for the correct class, being eliminated from the list be 0.5. Then the accuracy of the ADAG is at worst (1-p)ªlog2 Nº, where N is the number of classes, and ªxº is the least integer greater than or equal to x. Proof Given N classes of examples, the height (the number of layers) of the ADAG is obviously at most ªlog2Nº. Therefore, ªlog2Nº is the upper bound of the number of times that the correct class is compared with other classes. Thus the accuracy of the ADAG is at worst (1-p)ªlog2Nº. . . . . . . .

Figure 6 shows the graph between the number of classes and the accuracy of the DDAG and ADAG from the theorems with p = 0.001. As shown in the figure, the difference in the accuracy become clearer with the increase of the number of classes. VI. EXPERIMENTS In this section, we evaluate the performance of our modification on three different data sets: the Thai tone data set [7], the Thai vowel data set [7], and the UCI Letter data set [1]. For each data set, several different sequences of nodes are chosen randomly to be used for running a number of experiments evaluating both DDAG and ADAG. Each time the accuracy using a sequence is recorded, and the average accuracy (Avg.) and standard deviation (S.D.) are taken as the results for each DAG. For each DAG, we vary the parameter of the kernel function and use the one giving the best results. The numbers of experiments for the Thai tone data set, the Thai vowel data set, and the UCI Letter data set are 120, 20000 and 50000, respectively. A) The Thai Tone Data Set The Thai tone data set consists of five different lexical tones (mid, low, falling, high, and rising). The 6-dimension features are extracted from fundamental frequencies of a scaled input syllable. This data set comprises data from two tests, i.e., inside and outside tests. The inside test includes 12,384 training examples, while the outside test includes 6,192. Both tests have 3,096 test examples. TABLE 1: RESULTS OF THE THAI TONE DATA SET. d

Polynomial Avg. S.D.

c

RBF Avg.

The inside test DDAG 10 95.81 0.02 0.1 96.09 ADAG 10 95.81 0.02 0.1 96.09 The outside test DDAG 4 90.23 0.18 0.5 90.39 ADAG 4 90.24 0.18 0.5 90.39 Where d and c are the parameters in the Polynomial |(x˜y + 1)/6|d and the RBF kernel exp(-|x-y|2/6c), respectively.

S.D. 0.00 0.00 0.08 0.07 kernel

In this data set of a 5-class problem, the results shown in Table 1 are similar for both algorithms. Although the ADAG seems to be slightly better than the DDAG in the outside test, both algorithms are not statistically different. B) The Thai Vowel Data Set

DDAG ADAG



        No. of Classes

FIGURE 6: THE COMPARISON OF DDAG AND ADAG.

0-7803-7278-6/02/$10.00 ©2002 IEEE

The Thai vowel data set consists of 12 classes (12 long vowels) whose 72-dimension features are obtained by separating each vowel segment into three regions and, for each region, computing 12-order RASTA and their time derivatives [7]. This data set also comprises data from two tests, i.e., inside and outside tests. The inside test includes 12,384 training examples, while the outside test includes 6,192. Both tests have 3,096 test examples.

TABLE 2: RESULTS OF THE THAI VOWEL DATA SET. Polynomial Avg. S.D.

RBF Avg.

d c S.D. The inside test DDAG 7 94.44 0.08 0.3 96.65 0.05 ADAG 7 94.46 0.06 0.3 96.66 0.05 The outside test DDAG 6 86.09 0.11 0.4 86.75 0.08 ADAG 6 86.12 0.10 0.4 86.78 0.08 Where d and c are the parameters in the Polynomial kernel |(x˜y + 1)/ 72|d and the RBF kernel exp(-|x-y|2/ 72c), respectively.

(iv) The improvement comes in the form of higher accuracy with higher confidence of achieving the accuracy. This new approach is empirically proved to increase accuracy and confidence, especially in problems with the high number of classes. VII. CONCLUSION

C) The UCI Letter Data Set

In this paper we have pointed out some source of impediment in the DDAG caused by its structure. We have presented a modified structure – the Adaptive DAG – that alleviates the problem by reducing the dependency of the sequence of nodes in the structure as well as lowering the number of node evaluations for the correct class. Our experiments are evidence that our new approach yields higher accuracy and reliability of classification, especially in such a case that the number of classes is relatively large.

The UCI Letter data set comprises 26 classes of letter A to Z. Their 15-dimension features are measured statistics of printed font glyphs. The training set consists of first 16,000 examples, and the test set consists of the rest (4,000). All features of the UCI Letter data set were scaled to lie in [-1,1]. We used the RBF kernel exp(-|x-y|2/c) with c = 0.5. The results are shown in Table 3 and Figure 7.

Since the DAGSVM is one of the SVMs’ fastest algorithms in multiclass classification, this modification of the DDAG will help improve accuracy and the reliability even further. However, there is another concern: although the new approach is less dependent on the order of binary classes, there are still differences in accuracy between each sequence. We will leave this issue for future research.

TABLE 3: RESULTS OF THE UCI LETTER DATA SET.

Acknowledgment

When the number of classes increases to 12 in this data set, the accuracy of the ADAG on average is higher as shown in Table 2. Moreover, the standard deviation of the accuracy is lower than that of the DDAG. Only in RBF kernel functions, the standard deviation does not differ.

c

Average Accuracy

Standard Deviation

DDAG

0.5

96.39

0.08

ADAG

0.5

96.48

0.06





ADAG



Probability



DDAG



  



  



  

 

Accuracy 



  

References



  

FIGURE 7: DISTRIBUTION OF ACCURACY

In this test of a more difficult problem (26 classes), the ADAG has higher accuracy and lower standard deviation than the DDAG does at a 0.01 significance level (99% confidence interval). From the above experiments, we may conclude qualitatively that: (i) The optimal kernels and parameters for the DDAG and the ADAG are quite similar. (ii) In general, the RBF kernel outperforms the polynomial kernel in both the recognition rate and standard deviation of the accuracy. (iii) In a problem with a small number of classes, the improvement is insignificant. However, if the number of classes rises, the ADAG is at advantage.

0-7803-7278-6/02/$10.00 ©2002 IEEE

The work was supported in part by National Electronics and Computer Technology Center (NECTEC) under the project number NT-B-06-4F-13-311.

[1] C. Blake, E. Keogh, and C. Merz, “UCI Repository of Machine Learning Databases”, Dept. of Information and Computer Science, University of California, Irvine, 1998. [2] C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, 2(2):121–167, 1998. [3] J. H. Friedman. “Another approach to polychotomous classification”, Technical report, Stanford University, Department of Statistics, 1996. [4] S. Knerr, L. Personnaz, and G. Dreyfus, “Single-Layer Learning Revisited: A Stepwise Procedure for Building and Training a Neural Network”, In Fogelman-Soulie and Herault, editors, Neurocomputing: Algorithms, Architectures and Applications, NATO ASI Series. Springer, 1990. [5] J. Platt, N. Cristianini, and J. Shawe-Taylor, “Large Margin DAGs for Multiclass Classification”, Advance in Neural Information Processing System, 12, pp. 547-553, 1999. [6] B. SchÖlkopf, “Support Vector Learning”, Ph.D. Thesis, R.Oldenbourg Verlag Publications, Munich, Germany, 1997. [7] N. Thubthong and B. Kijsirikul, “Support Vector Machines for Thai Phoneme Recognition”, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002. (to appear). [8] V. Vapnik, Statistical Learning Theory, New York, Wiley, 1998.

Suggest Documents