Hulls (RCH) to a Nearest Point Algorithm (NPA) leads to an elegant and efficient solution to the SVM classification task, with encouraging practical results to real ...
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < itself analytically, but only its kernel, i.e., the value of the inner products of the mappings of all the samples ( k ( x1 , x2 ) = ( Φ ( x1 ) | Φ ( x2 ) ) for all x1 , x2 ∈ X ) [C1]. Through the “kernel trick”, it is possible to transform a nonlinear classification problem to a linear one, but in a higher (maybe infinite) dimensional space H .1 Once the patterns are mapped in the feature space, provided that the problem for the given model (kernel) is separable, the target of the classification task is to find the maximal margin hyperplane. This classification task, expressed in its dual form, is equivalent with finding the closest points between the convex hulls generated by the (mapped) patterns of each class in the feature space [C15], i.e., it is a Nearest Point Problem (NPP). Finally, in case the classification task deals with non-separable datasets, i.e., the convex hulls of the (mapped) patterns in the feature space are overlapping, the problem is still solvable, provided that the corresponding hulls are reduced, so that to become non-overlapping [C16], [C38]. This is illustrated in Fig. 1. Therefore, the need to resort to the notion of the reduced convex hulls becomes apparent. However, in order to work in this RCH geometric framework, one has to extend the available palette of tools by a set of new RCH-related mathematical properties.
2
In this way, the initially overlapping convex hulls, with a suitable selection of the bound μ , can be reduced so that to become separable. Once separable, the theory and tools developed for the separable case can be readily applied. The algebraic proof is found in [C16] and [C15] and the geometric one in [C12]. The bound μ , for a given set of original points, plays the role of a reduction factor, since it controls the size of the generated RCH; the effect of the value of bound μ to the size of the RCH is shown in Fig. 2.
Fig. 2. The RCH R ( X , 2 5 ) is shown, generated by 5 points (stars) and having μ = 2 5 , presenting the points that are candidates to be extreme [C38], marked by small squares. Each point in the RCH is labeled so as to present the original points from which it has been constructed; the last label is the one with the lowest coefficient.
Fig. 1. The initial convex hulls (light gray), generated by the two training datasets (of disks and diamonds respectively) are overlapping; still overlapping are the RCHs with μ = 0.4 (darker gray); however, the RCHs with μ = 0.1 (darkest gray) are disjoint and, therefore, separable. The nearest points of the RCHs, found through the application of the NPA presented here, are shown as circles and the corresponding separating hyperplane as the bold line.
III. REDUCED CONVEX HULLS (RCH) Definition 1: The set of all convex combinations of points in some set X , with the additional constraint that each coefficient ai is upper-bounded by a non-negative number μ < 1 is called the reduced convex hull of X and it is denoted as R ( X , μ ) :
{
R ( X , μ ) = w : w = ∑ i =1 ai xi , xi ∈ X , k
∑
k
a = 1, 0 ≤ ai ≤ μ
i =1 i
}
Although, at a first glance, this is a nice result, that paves the way to a geometric solution, i.e., finding the nearest points between the RCH, it turns out not to be such a straightforward task: The search for nearest points between the two (one for each class) convex hulls depends directly on their extreme points [C28], which, for the separable case are some of the points in the original dataset. However, in the non-separable case, each extreme point of the RCH turns out to be a reduced convex combination of the original points. This fact makes the direct application of a geometric NPA impractical, since an intermediate step of combinatorial complexity has been introduced. As it is shown in [C30], the (candidates to be) extreme points of a RCH are linear combinations of a specific number (depending on ⎢⎡1 μ ⎥⎤ , where ⎢⎡ x ⎥⎤ denotes the ceiling of x ) of the original points, weighted by a specific set of coefficients. Literally, Theorem 1 of [C30] states the following: “The extreme points of a RCH
{
R ( X , μ ) = w : w = ∑ i =1 ai xi , xi ∈ X ,
have 1
In the rest of this work, for keeping the notation clearer and simpler, the
quantities
x will be used instead of Φ ( x ) , since in the final results, the
patterns enter only through inner products and not individually, therefore making the use of kernels readily applicable.
coefficients
k
ai
that
∑
k i =1
belong
ai = 1, 0 ≤ ai ≤ μ
to
the
}
set
S = {0, 1 − ⎣⎢1 μ ⎦⎥ μ , μ } , where ⎣⎢ x ⎦⎥ denotes the floor of x .” A direct consequence is that only ⎡⎢1 μ ⎤⎥ distinct points of
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < the original set contribute to an extreme point of an RCH. Fig. 2 shows the original convex hull (pentagon), the RCH R ( X , 2 5 ) , for μ = 2 5 , thus ⎢⎡1 μ ⎥⎤ = 3 . The numbers associated with each point in the RCH indicate the respective points of the original set X , whose linear combination (with specific coefficients) gives birth to the respective point. Observe that the extreme points of the RCH are combinations of the original extreme points. Although the previously stated Theorem greatly reduces the amount of the calculations, a direct approach is still of combinatorial complexity. A second basic theorem, also shown in [C30], suggests a way to overcome the combinatorial nature of the direct formation of RCH points, thus reducing the complexity to a quadratic one (i.e., as the standard SVM formulation). According to this theorem the extreme RCH point(s) that has the minimum projection onto a specific direction, is obtained in the following three steps (Appendix): 1. Obtain the projections of the original points onto the given direction. 2. Sort these projections in descending order and choose the ⎡⎢1 μ ⎤⎥ of them. 3.
w = λ ∑ i =1 ai xi + (1 − λ ) ∑ i =1 bi xi = k
=
∑
∑
k
k
k
∑
k i =1
⎡⎣ λ ai + (1 − λ ) bi ⎤⎦ xi
c x where ci ≡ λ ai + (1 − λ ) bi . But, obviously,
i =1 i i
c = 1 and the same bounds hold for each ci , i.e.,
i =1 i
0 ≤ ci ≤ μ . In the sequel, the notion of the Minkowski sum (and difference) will be defined and incorporated into the RCH toolbox, through a set of new Propositions presented in this work. The results to be derived below, will allow Gilbert’s NPA to be used in the RCH framework. Definition 2: The set X 1 + X 2 ≡ { x = x1 + x2 : x1 ∈ X 1 , x2 ∈ X 2 } , created from two sets X 1 and X 2 , is called the Minkowski sum (or direct sum or simply sum) of the two sets and is often denoted as X 1 ⊕ X 2 [C28]. Definition 3: The set − X ≡ { z = − x, x ∈ X } is called the opposite of X .
Take the weighted average of the selected ordered projections, with weights analytically computed. This is the projection of the extreme RCH point, required in the algorithmic computations. Note, also, that knowing the projection, the respective extreme point can easily be derived, if it is required. This setting fits nicely with Gilbert’s NPA, which does not depend directly on the actual extreme points of the RCH, but on the projection of the RCH on a specific direction. In order to extend the framework, introduced in [C30], to allow the use of the popular Gilbert’s NPA to the RCH problem, a number of Propositions have to be shown. In particular, since the specific NPA algorithm transforms the NPP to a Minimum Norm Problem (MNP) and works on Minkowski difference sets, it is necessary to prove that the Minkowski difference of two RCHs is a RCH itself, having a direct (and explicit) relation with the original sets. Besides, since Gilbert’s algorithm finds, at each iteration step, the point in a line segment with minimum norm and uses it in the next iteration step, it is necessary to prove that such a point belongs to the RCH.
Proof: Let z ∈ cR ( X 1 , μ1 ) . Then z = c ∑ i =1 ai xi , where
Corollary 1: The line segment [ w1 , w2 ] joining two points
element
Proposition 1: The set − R ( X , μ ) is still a RCH; actually it is R (−X , μ ) . Proof: Straightforward, by taking a point z ∈ − R ( X , μ ) and using Definition 1 and Definition 3. Proposition 2: Scaling is a RCH-preserving property, i.e., cR ( X , μ ) = R ( cX , μ ) , c ∈ − {0} . k
Proposition 3: The Minkowski sum of two RCH is also a RCH; actually it is the RCH produced by the Minkowski sum of the original sets, with coefficients’ bound the product of the original coefficients’ bounds, i.e., R ( X 1 , μ1 ) + R ( X 2 , μ2 ) = R ( X 1 + X 2 , μ1 μ2 ) . Proof: Let u ∈ R ( X 1 , μ1 ) and v ∈ R ( X 2 , μ 2 ) . For any z ∈ R ( X 1 , μ1 ) + R ( X 2 , μ2 ) , it is:
RCH is itself a convex set. Proof: Let w ∈ [ w1 , w2 ] . Then w can be written as
0 ≤ ai ≤ μ1 and
w1 and w2 can be written as w1 = ∑ i =1 ai xi and w2 = ∑ i =1 bi xi respectively, where k
∑
0 ≤ ai ≤ μ and 0 ≤ bi ≤ μ . Therefore
k i =1
ai = 1 ,
∑
k
b =1,
i =1 i
= 1 . Therefore z = ∑ i =1 ai ( cxi ) k
i
with ( cxi ) ∈ cX .
∑
k
∑a
xi ∈ X , 0 ≤ a ≤ μ and
w1 , w2 ∈ R ( X , μ ) of a RCH, belongs to the RCH, i.e., any
w = λ w1 + (1 − λ ) w2 for some λ ∈ [ 0,1] . From Definition 1,
3
i∈I1
ai xi + ∑ j∈I a j x j .
relation z=
(∑
But
2
j∈I 2
∑ ∑ ∑ ∑
aj
j∈I 2
i∈I1
j∈I 2
i∈I1
∑
)( ∑
j∈I 2
i∈I1
ai = 1
= with
a j = 1 with 0 ≤ a j ≤ μ2 , the previous
can i∈I1
∑
since
z =u+v
) (∑
ai xi +
a j ai xi + ∑ i∈I
1
∑
a j ai ( xi + x j ) .
j∈I 2
be
i∈I1
ai
)( ∑
j∈I 2
aj xj
)
written = =
ai a j x j
However
∑ ∑ j∈I 2
i∈I1
a j ai
=
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
0 (the null vector does not belong to
Z ).
w
Corollary 2: The Minkowski difference of two RCH Z = { z : z = x − y , x ∈ R ( X 1 , μ1 ) , y ∈ R ( X 2 , μ 2 )} is the RCH
Corollary
where
2
− x1 ) ,
(1)
where x1 ∈ R ( X 1 , μ1 ) , x2 ∈ R ( X 2 , μ 2 ) , under the assumption that the parameters ( μ1 , μ2 ) have been selected to guarantee that R ( X 1 , μ1 ) ∩ R ( X 1 , μ1 ) = ∅ . This is clearly a NPP and is equivalent to the following MNP ([C17]): find z * , such that z * = arg min ( z ) , (2) z∈Z
2 There may be more than one such couple of points, in case of degeneracy, but the distance between the points of each couple will be the same for all couples.
4
Z q wnew 1− q zr
Fig. 3: Elements involved in Gilbert’s algorithm.
A brief description of the standard Gilbert’s algorithm, (provided that Z is a convex set, of which we need to find the minimum norm member z * ), is given below: Step 1: Choose w ∈ Z . Step 2: Find the point z ∈ Z with the minimum projection onto the direction of w . If w ≅ z then z * = w ; stop. Step 3: Find the point wnew of the line segment [ w, z ] , with minimum norm (closest to the origin). Set w ← wnew and go to Step 2. The idea behind the algorithm is very simple and intuitive and the elements involved in the above steps of the algorithm are illustrated in Fig. 3. In order to apply effectively the above algorithm for the (non-separable) SVM classification task, the following requirements must be satisfied: 1. The set Z , which in this case is the Minkowski difference of the RCH, generated by the original pattern classes, must be a convex set. Actually, it is a RCH, as it is ensured by Corollary 2 and therefore a convex set (Corollary 1). 2. The minimum norm point wnew , computed at each step, must be a feasible solution, i.e., be the vector joining two points, each in the RCH of the corresponding class. This is ensured by Corollary 4. 3. An efficient way of calculating the point z (with minimum projection on the direction of w ) must be found. Recall that z is the difference of two vectors, each of which is a reduced convex combination of the original points of the corresponding class. Therefore, the problem seems to be of combinatorial complexity. However, since the projections and not the points themselves play the central role in the algorithm, Theorem 2 of [C30] provides an efficient (noncombinatorial) way of performing the calculations.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Let w ∈ Z and zr ∈ Z such that ⎛ z, w ⎞ , (3) zr = arg min ⎜ ⎜ w ⎟⎟ z∈Z ⎝ ⎠ i.e., the point of the RCH Z that exhibits the least projection onto the direction of w . According to Gilbert’s algorithm, we have to find the point wnew of the line segment [ w, zr ] with the least norm and use this, instead of w , in the next iteration of the algorithm. It is assumed that w and zr do not coincide because, otherwise, w would be the least norm point of the convex set and the algorithm would have stopped (stopping condition in Step 2); therefore zr − w > 0 . Let the projection of the origin onto the line defined by zr − w be a point w⊥ and a scalar s such that w⊥ = (1 − s ) w + szr . (4) Taking the inner product of both sides with zr − w , it is 2
w⊥ , zr − w = w, zr − w + s zr − w . But since
w⊥ ⊥ ( zr − w ) (or
(5)
w⊥ , z r − w = 0 from the
Projection Theorem [C21]), (5) becomes s = w, w − z r
zr − w
2
(6)
or 2
2
s zr − w = w − w, z r .
(7)
In order that the new minimizer w belongs to the RCH Z , it should be restricted to be inside the line segment [ w, zr ] , new
which means that wnew = (1 − q ) w + qzr , q ∈ [ 0,1] .
2
(8)
w ;
however, this contradicts to (3) (that zr has the minimum projection of all members of Z onto the direction of w , even from w itself, whose projection is w ). In case s ∈ ]0,1[ , then wnew = w⊥ and q = s . If on the other hand s ≥ 1 , which is equivalent to
2
2
zr − w ≤ w − w, zr , we set q = 1 , i.e.,
wnew = zr .
1
2
and enters into the decision function not directly, but only through the kernel function. The detailed calculations of the quantities involved are presented in the Appendix. Observe that the calculations are distinct from those referred in the standard form of Gilbert’s algorithm, since RCH are now involved. Concerning the stopping condition, one should ensure that the norms of the new closest point of RCH Z and the previous one are very close, i.e., w ≅ wnew or, more precisely, w > 1− ε ⇔
wnew
w − wnew < ε w where 0 < ε
Finally, since the optimal hyperplane is defined by w* , x + b* = 0 where w* = w1* − w2* , setting x =
(
1.
(11)
1 * ( w1 + w2* ) , the threshold is 2
)
1 1 w2* , w2* − w1* , w1* = ( B − A ) . (12) 2 2 To transform the above relations to represent the canonical hyperplane ( wc , b c ) , we have to rescale ( w* , b* ) by a scaling b* =
factor λ > 0 , such that the margin between the two convex w* = w1* − w2* ) to be equal to
hulls (which is given by
wc = λ w* and demand
(13)
wc = 2 w* .
(14)
From (13), (14) becomes λ w = 2 w , hence *
λ = 2 w*
*
2
(15) * 2
and therefore (from (13)) wc = 2 w* w
⇔
wc = 2 ( w1* − w2* ) w1* − w
(16)
In order (11) to be still valid, we set b c = λ b* ,
(17)
* 2 2
so that λ w , x + λ b = 0 ⇔ *
For the calculation of q (and hence of wnew ), the quantities w, w , w, zr
through the Lagrange multipliers ai . Recall that, because of the KKT conditions and the Representer Theorem [C1], w can always be written as: (10) w = ∑ i∈I ∪ I ai yi xi
2 w* , i.e., we have to set
From (4) it is s ≠ 0 , because, otherwise, the algorithm would have stopped at the previous iteration ( w ⊥ = w ). Besides, a negative value for s , i.e., s < 0 , is impossible, since this would imply from (7) that w < w, zr or w < w, z r
5
and zr , zr
need to be calculated. It is more
convenient to work with the original patterns and not with their differences. Therefore, w, w = w1 − w2 , w1 − w2 ⇔ w, w = w1 , w1 + w2 , w2 − 2 w1 , w2 .
(9)
However, since the solution has to be expressed in terms of the input space (where the actual training data live) and not in terms of the feature space (where w , zr and the mapped patterns reside), it is necessary to express the solution w
*
wc , x + b c = 0 .
(18)
Therefore, from (17) it is bc =
w2* , w2* − w1* , w1* * 2 2
w −w * 1
=
B−A . A + B − 2C
(19)
The value b c of (19) represents the offset of the hyperplane bisecting the line segment joining the couple of closest points of the two RCH, each constructed from the corresponding dataset. However, this offset usually does not coincide with the
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
0 ⎪ . (21) ξi = ⎨ 0, if 1 − yi wc , xi + b c ≤ 0 ⎪⎩ A very brief but clear and intuitive explanation of this difference is that, for the same value of the bound μ , the
(
)
( (
) )
convex hulls are not reduced by the same factor; their reduction factor actually depends, not only to μ , but also to the number and the distribution of the samples. Accordingly, from (15) and (16) and using the relations w1* = ∑ i∈I ai* xi and w2* = ∑ i∈I ai* xi , we obtain: 1
Fig. 4: Checkers board dataset with the line separating the two classes, produced using RCH-G algorithm. Dashed lines correspond to values -1 and +1 respectively.
2
aic = λ ai* .
(22)
V. EXPERIMENTS In order to extensively investigate the performance (both in speed and accuracy) of the new algorithm presented here, a number of publicly available test datasets, with known success rates under specific SVM training models (referred in the literature) and two artificial 2-dimensional datasets, for visual representation of the classification results, have been used. Four different SVM classification algorithms were implemented, tested and compared: the SMO algorithm, as it was presented by Platt in [C13] (denoted as SMO), the two modified SMO algorithms presented by Keerthi et. al. in [C33] (denoted as SMO-K1 and SMO-K2 respectively and, finally, the algorithm presented here (denoted as RCH-G). Each algorithm was trained and tested for each dataset, under the same model (kernel with the corresponding parameters) in order to achieve the same accuracy referred in the literature ([C34], [C35]). The accuracies of all classifiers, achieved for each specific dataset, were calculated under the same validation scheme, i.e., the same validation method and the same data realizations (100 publicly available realizations for the standard datasets and Leave-k-Out (LKO) method, where k = 5 , for the artificial 2-dimensional datasets). The standard datasets used are: 1) Diabetis, 2) Banana, 3) German, 3) Waveform, 4) Flare-solar, 5) Heart and the artificial ones are 6) Two-spirals [C36] (Fig. 5) and 7) Checkers board [C37] (Fig. 4). The details of the datasets as well as the success rate reported in the literature are presented in Table 1, the parameters used in Table 2 and the results of the runs are summarized in Table 3.
Fig. 5: Two-spirals dataset with the line separating the two classes, produced using RCH-G algorithm.
Dataset Data Train Test ValiSuccess Dim. Patterns Patterns dation Reported Diabetis 8 468 300 100r3 76.5±1.7 Banana 2 400 4900 100r 88.5±0.7 German 20 700 300 100r 76.4±2.1 Wave21 400 4600 100r 90.1±0.4 form Flare9 666 400 100r 67.6±1.8 Solar Heart 13 170 100 100r 84.0±3.3 Two2 1200 --LKO --3 #r: Number of realizations (specific partitions of the train and test datasets).
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Spirals Checkers Board
2
1000
---
LKO
---
Table 1: Details of datasets (along with validation method) used for testing and Success rate reported in the literature.
Dataset Diabetis Banana German Wave-form Flare-Solar Heart Two-Spirals Checkers Board
σ 20 1 55 20 30 120 0.1
C, tolerance 100, 10-3 316.2, 10-3 3162, 10-3 1000, 10-3 1000, 10-3 1000, 10-3 500, 10-3
μ , accuracy 0.0075, 10-3 0.014, 5×10-4 0.0052, 10-3 0.02, 5×10-4 0.0039, 10-3 0.017, 10-3 0.05, 10-3
0.4
500, 10-3
0.2, 10-3
Table 2: Parameters of the kernel model used (RBF kernel has been used everywhere) and tuning parameters for the specific algorithms used to achieve the reported success rate.
Dataset
Method
SMO SMO-K1 Diabetis SMO-K2 RCH-G SMO SMO-K1 Banana SMO-K2 RCH-G SMO SMO-K1 German SMO-K2 RCH-G SMO SMO-K1 Waveform SMO-K2 RCH-G SMO SMO-K1 Flare-Solar SMO-K2 RCH-G SMO SMO-K1 Heart SMO-K2 RCH-G SMO SMO-K1 Two-Spirals SMO-K2 RCH-G
Success (%) 76.7±1.8 76.7±1.8 76.7±1.8 76.3±1.8 80.0±6.4 89.1±0.6 89.0±0.7 88.8±0.9 76.0±2.2 76.1±2.2 76.1±2.2 76.3±0.4 88.8±0.6 89.2±0.5 88.9±1.4 88.5±0.7 67.6±1.8 69.1±9.6 69.0±9.6 83.9±3.3 83.2±4.2 83.4±4.2 84.8±3.3 100 100 100
Kernel Eval. 2.01×107 1.45×106 4.10×106 6.99×105 1.91×107 6.60×106 1.54×107 1.75×106 1.37×107 9.00×106 4.62×107 2.71×106 9.43×107 2.20×106 6.20×106 1.46×106 2.27×107 1.20×107 7.60×107 3.40×106 2.05×106 2.62×105 2.55×105 4.70×104
Time (s) 63.8 8.4 9.9 6.5 414 111 95 22 2840 161 194 15 2005 65 63.3 23.2 479 156 107 13 51.22 1.54 1.51 0.81
SMO Checkers SMO-K1 Board SMO-K2 RCH-G
2.54×10 1.53×107 29.8
1.95×108 1.29×107 4.46×107
Table 3: Results achieved for each algorithms (number of kernel evaluations and run times in seconds).
The results of the new geometric algorithm presented here, compared to the most popular and fast algebraic ones, are very encouraging: Its advantage with respect to the number of kernel evaluations as well as the overall speedup and for the same level of accuracy, compared to the most popular algebraic techniques, is readily noticeable. The enhanced performance is justified by the fact that although the algebraic algorithms (especially SMO-K2) make a clever utilization of the cache, where kernel values are stored, they cannot avoid repetitive searches in order to find the best couple of points to be used in the next step of the optimization process. Furthermore, the enhanced performance of the new geometric algorithm against its algebraic competitors, can be explained from the fact that it is a straightforward optimization algorithm, with a clear optimization target at each iteration step, always aiming at the global minimum and at the same time being independent of obscure and sometimes inefficient heuristics.
VI. CONCLUSION The SVM approach to machine learning has both theoretical and practical advantages over similar ones. Among these advantages are: a) the sound mathematical background of SVM (supporting their generalization bounds and their guaranteed convergence to the global optimum unique solution), b) their specific structure that allows an elegant generalization for the non-linear case and c) the intuition they display. The geometric intuition is intrinsic to the structure of SVM and has found application in solving both the separable and non-separable problem. Gilbert’s iterative geometric algorithm for the NPP, modified here to accommodate the RCH notion for the non-separable task, resulted in a very promising method of solving SVM. The obtained results (it terms of execution time and number of kernel evaluations) showed a clear advantage when compared with the best representatives of the algebraic (decomposition) methods, i.e., SMO and its improved variants. The algorithm presented here is based on clear and sound geometric concepts, does not use any heuristics and provides a clear understanding of the convergence process and the role of the parameters used. APPENDIX Let:
∑ a x ,∑ =∑ ∑ aa
w1 , w1 =
7
96.2 97.1 96.9
7
i∈I1
i∈I1
j ∈ I1
i i
j∈ I1
i
j
aj xj xi , x j ≡ A
(23)
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [7]
[C7] T. Joachims “Text categorization with support vector machines: learning with many relevant features”, Proceedings of the European Conference on Machine Learning (ECML), Chemnitz, Germany, 1998. [8] [C8]E. Osuna, R. Freund, F. Girosi “Training support vector machines: An application to face detection”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR’97, Puerto Rico, 1997, pp. 130–136. [9] [C9] M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T. S. Furey, M. Ares Jr., D. Haussler “Knowledge-based analysis of microarray gene expression data by using support vector machines”, Proc. Nat. Acad. Sci. 97, 2000, pp. 262–267. [10] [C10] A. Navia-Vasquez, F. Perez-Cuz, A. Artes-Rodriguez “Weighted least squares training of support vector classifiers leading to compact and adaptive schemes”, IEEE Transactions on Neural Networks, vol. 12(5), 2001, pp. 1047–1059. [11] [C11] D. J. Sebald, J. A. Buklew “Support vector machine techniques for nonlinear equalization”, IEEE Transactions on Signal Processing, vol. 48 (11), 2000, pp. 3217–3227. [12] [C12] D. Zhou, B. Xiao, H. Zhou, R. Dai “Global Geometry of SVM Classifiers”, Technical Report in AI Lab, Institute of Automation, Chinese Academy of Sciences. Submitted to NIPS 2002. [13] [C13] J. Platt “Fast training of support vector machines using sequential minimal optimization” in B. Schölkopf, C. Burges and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, 1999. [14] [C14] K. P. Bennett, E. J. Bredensteiner “Geometry in Learning” in C. Gorini, E. Hart, W. Meyer and T. Phillips, editors, Geometry at Work, Mathematical Association of America, 1998. [15] [C15] K. P. Bennett, E. J. Bredensteiner “Duality and Geometry in SVM classifiers” in Pat Langley, editor, Proc. 17th International Conference on Machine Learning, Morgan Kaufmann, 2000, pp. 57–64. [16] [C16] D. J. Crisp, C. J. C. Burges “A geometric interpretation of ν-SVM classifiers” NIPS 12, 2000, pp. 244–250. [17] [C17] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, K. R. K. Murthy “A fast iterative nearest point algorithm for support vector machine classifier design”, Technical Report No. TR-ISL-99-03, Department of CSA, IISc, Bangalore, India, 1999. [18] [C18] V. Franc, V. Hlaváč “An iterative algorithm learning the maximal margin classifier”, Pattern Recognition 36, 2003, pp. 1985–1996. [19] [C19] T. T. Friess, R. Harisson “Support vector neural networks: the kernel adatron with bias and soft margin” Technical Report ACSE-TR752, University of Sheffield, Department of ACSE, 1998. [20] [C20] D. P. Bertsekas, A. Nedić, A. E. Ozdaglar, Convex Analysis and Optimization, Athena Scientific, 2003. [21] [C21] David G. Luenberger, Optimization by Vector Space Methods, John Wiley & Sons, Inc., 1969. [22] [C22] R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, New Jersey, 1970. [23] [C23] S. Boyd, L. Vandenberghe, Convex Optimization, to be published by Cambridge University Press. Available at http://www.stanford.edu/class/ee392o [24] [C24] S. G. Nash, A. Sofer, Linear and Nonlinear Programming, McGraw-Hill, 1994. [25] [C25] C. J. C. Burges “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery 2, pp. 121–167, 1998. [26] [C26] S. Lang, Real and Functional Analysis, 3rd edition, SpringerVerlag, 1993. [27] [C27] G. Bachman, L Narici, Functional Analysis, Dover Publications, Inc., 2000. [28] [C28] J.-B. Hiriart-Urruty, C. Lemaréchal, Convex Analysis and Minimization Algorithms I, Springer-Verlag, 1991. [29] [C29] W. Sierpiński, General Topology, Translated and revised by C. C. Krieger, Dover Publications, Inc., 2000. [30] [C30] M. Mavroforakis, S. Theodoridis “A geometric approach to Support Vector Machine (SVM) classification”, to appear in IEEE Transactions on Neural Networks, also available from …… [31] [C31] E. Gilbert “Minimizing the quadratic form on a convex set”, SIAM J. Control, vol. 4, 1966, pp 61-79. [32] [C32] E. Gilbert, D. Johnson, and S. Keerthi, “A fast procedure for computing the distance between complex objects in three dimensional space,” IEEE J. Robot. Automat., vol. 4, 1988, pp. 193–203.
10
[33] [C33] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy, “Improvements to platt's SMO algorithm for SVM classifier design”, Technical report, Dept of CSA, IISc, Bangalore, India, 1999. [34] [C34] G. Rätsch, T. Onoda and K.-R. Müller , "Soft Margins for AdaBoost", Machine Learning, vol 42 - 3 , pp 287-320, 2001. [35] [C35] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller, “ Fisher discriminant analysis with kernels”, in Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pp. 41-48. IEEE, 1999. [36] [C36] K. Lang, and M. Witbrock, “Learning to tell two spirals apart”, in Proceedings of the Connectionist Summer School, 1988. [37] [C37] L. Kaufman "Solving the Quadratic Programming Problem Arising in Support Vector Classification" in B. Schölkopf, C. Burges and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pp. 148–167 MIT Press, 1999. [38] [C38] M. Mavroforakis, and S. Theodoridis, “Support Vector Machine (SVM) Classification through Geometry”, Proceedings EUSIPCO 2005, Antalya, Turkey, also available from ……