Maximum Margin Classification on Convex Euclidean Metric Spaces Andr´e Stuhlsatz1,2 , Hans-G¨ unter Meier2 , and Andreas Wendemuth1 1
2
Otto-von-Guericke University Magdeburg, Germany, Cognitive Systems Group, Department of Electrical Engineering and Information Technology University of Applied Sciences D¨ usseldorf, Germany, Department of Electrical Engineering
[email protected]
Summary. In this paper, we present a new implementable learning algorithm for the general nonlinear binary classification problem. The suggested algorithm abides the maximum margin philosophy, and learns a decision function from the set of all finite linear combinations of continuous differentiable basis functions. This enables the use of a much more flexible function class than the one usually employed by Mercer-restricted kernel machines. Experiments on 2-dimensional randomly generated data are given to compare the algorithm to a Support Vector Machine. While the performances are comparable in case of Gaussian basis functions and static feature vectors the algorithm opens a novel way to hitherto intractable problems. This includes especially classification of feature vector streams, or features with dynamically varying dimensions as such in DNA analysis, natural speech or motion image recognition.
1 Overcoming Mercer conditions In the late seventies, Vapnik and Chervonenkis [1] had introduced the concept of a maximum margin separating hyperplane. It has been shown, that the maximization of the margin between the data points and the hyperplane minimizes an upper bound on the actual risk [2]. This strategy often leads to a better generalization performance in comparison to pure empirical approaches. E.g., the Support Vector Machine (SVM) [2] has been proved to provide very good classification performance on many different applications in comparison to other learning paradigms [3]. Although the separating hyperplane realizes a linear classification rule, it is also possible to obtain nonlinear ones using decision functions consisting of nonlinear basis functions called kernels. Unfortunately kernel functions have to satisfy some non-trivial mathematical conditions (Mercer conditions [4]). The Mercer conditions are proved only for a handful of commonly used kernels, see e.g. [5]. Sometimes one drawback can be the lack of variety of feasible kernels in applications where the geometries of the data space are not well modeled by the given kernels or where we have
2
Andr´e Stuhlsatz, Hans-G¨ unter Meier, and Andreas Wendemuth
streams of features of variable length instead of static feature vectors. For example, this is the case in speech recognition, motion image recognition or DNA sequence analysis. The aim of this paper is to provide a novel approach for a solution to these cases. We will present a practicable method of maximum margin classification in Banach spaces based on the theoretical work of [6]. Our algorithm will learn on arbitrary compact and convex euclidean metric spaces and its solution will be a decision function from the set of all finite linear combinations of continuous differentiable basis functions.
2 Lipschitz classifier Usually, nonlinear classifiers implement decision functions3 of a function set M X F := f : X → IR : f (x) := cn Φn (x) + b,
cn , b ∈ R, M ∈ IN
(1)
n=1
with linearly weighted basis functions Φn : X → R. E.g., in case of the SVM, Φn (x) := k(xn , x), where k is some kernel function which has to satisfy the Mercer conditions [4] and X is a vector space equipped with some scalar product (Hilbert space). Popular kernels are the Gaussian kernel function k(xn , x) := exp(−γkxn − xk22 ) or the polynomial kernel k(xn , x) := hxn , xid2 . The kernel function k(x, y) = hϕ(x), ϕ(y)iH represents a scalar product of an appropriate space H after embedding X into H by some feature map ϕ. In particular, the SVM seeks for an f ∈ F so that the empirical error is minimized and the margin ρ = 1/kf kH between the data points and a separating hyperplane in H is maximized. The main concept [6] of our algorithm is based upon the idea of embedding a bounded metric space (X , d), i.e. diam(X ) := supx,y∈X d(x, y) which is finite with some metric d, isometrically into a special Banach space B (Arens-Eells space). The corresponding function space Lip(X ) of bounded Lipschitz continuous functions is isometrically mapped to the dual space B 0 . Recall a function is Lipschitz, if (y)| < ∞ exists. L(f ) is defined as the smallest L(f ) := supx,y∈X ,x6=y |f (x)−f d(x,y) Lipschitz constant of f . In analogy to the SVM embedding, it is possible to assign f ∈ Lip(X ) to a linear functional Tf (mx ) = f (x) using appropriate representations mx ∈ B of the points x ∈ X , such that kTf k = L(f ). A canonical hyperplane H := {mx ∈ B | Tf (mx ) = 0} in B allows to bound the margin ρ between the mapped data mxn and H from below by ρ ≥ 1/kTf k = 1/L(f ). Note that, contrary to the SVM approach, minimizing kTf k = L(f ) maximizes a lower bound of the margin ρ. As a consequence, a large margin algorithm can be constructed on a metric space by picking up a function with small Lipschitz constant and minimum empirical error from a larger set of decision 3
Note, that we associate the term decision function with f ∈ F instead of x 7→ sign(f (x)).
Maximum Margin Classification on Convex Euclidean Metric Spaces
3
functions F ∩ Lip(X ) than in case of the SVM. Using the standard methodology to handle non-separable data as well [2], the Lipschitz classifier yields a constrained optimization problem of the form inf
(f,ξ)∈F ∩Lip(X )×[0,∞)N
L(f ) + C
N X
ξn
w.r.t
yn f (xn ) ≥ 1 − ξn ∀n ∈ I (2)
n=1
with index set I := {1, . . . , N }, slack-variable ξ ∈ [0, ∞)N , training data (xn , yn ) ∈ X × {±1} and parameter C > 0 free but fixed in advance.
3 Learning over Compact and Convex Metric Spaces As long as no additional assumptions restrict the decision function space F and the underlying metric space X , it has been shown in [6] how to solve the Lipschitz classification problem, explicitly. The problem was solved as a Lipschitz minimal norm interpolation problem over the whole Lipschitz function space Lip(X ) using the notion of Lipschitz extension. Usually, a solution found by this procedure may be expected to be overfitted, because of the high complexity of the considered function space. An approximation is obtainable if the function class is restricted to linear combinations of distances and additionally assuming the distance as the metric on X [7]. Instead of an approximation or the Lipschitz extension interpolant, it is possible by the following theorem to determine exact Lipschitz constants of real valued, continuous differentiable decision functions defined on a compact and convex euclidean space (for a proof see appendix). Theorem 1. Suppose, that V := (IRm , k·k2 ) is a linear space, D ⊆ V open and X ⊂ D compact and convex. If f : D → IR is continuous differentiable, then L(f ) = max k∇f (x)k2 . (3) x∈X
Hence, equation (3) provides an opportunity to reformulate problem (2) by substituting L(f ) with maxx∈X k∇f (x)k2 , where X ⊂ IRm is compact and convex, and minimizing over F ∩ C (1) (X , IR) ⊂ Lip(X ). In the following, we will consider a function space consisting of finite linear combinations of continuous differentiable basis functions. Lemma 1. Suppose D ⊆ IRm is open, X ⊂ D is compact and convex, f ∈ F as defined in (1), every Φn : D → IR is a continuous differentiable basis function. Then F ⊂ Lip(X ) and the Lipschitz classification problem (2) can be stated as 1 max cT K(x)c + C1TN ξ x∈X 2 (c,ξ,b)∈Z min
M +N +1 Z := (c, ξ, b) ∈ IR : ξ ≥ 0, YGc + yb − 1N + ξ ≥ 0 ,
(4)
4
Andr´e Stuhlsatz, Hans-G¨ unter Meier, and Andreas Wendemuth
where K(x) ∈ IRM ×M is a symmetric positive semi-definite matrix with K(x)m,n := h∇Φm (x), ∇Φn (x)i2 , G ∈ IRN ×M the data dependent design matrix with Gn,m := Φm (xn ), Y := diag(y) ∈ IRN ×N a diagonal matrix of a given target vector y ∈ {−1, 1}N with components yn , ξ ∈ IRN the slack-variable of the soft margin formulation with components ξn and 1N := (1, . . . , 1)T ∈ IRN the vector of ones. Optimization problem (4) is equivalent to (2) in the sense, that from a solution (c∗ , ξ ∗ , b∗ , x∗ ) of (4) follows a solution (f ∗ , ξ ∗ ) of (2) with f ∗ (x) := PM ∗ ∗ n=1 cn Φn (x)+b . It can be easily verified a maximizer of maxx∈X k∇f (x)k2 2 is also a maximizer of maxx∈X k∇f (x)k2 /2. The matrix K(x) is semi-positive definite for all x ∈ X , because 2
k∇f (x)k2 =
M X M X
cm cn h∇Φm (x), ∇Φn (x)i2 = cT K(x)c
(5)
m=1 n=1
implies cT K(x)c ≥ 0 ∀c ∈ IRM . As the pointwise maximum of a family of convex functions, maxx∈X cT K(x)c is convex in c ∈ IRM as well. 3.1 Lipschitz Classifier Reformulation as dual SIP In the previous section we obtained a constrained minimax problem (4) which is not easily solvable directly, because of the complex feasible set as well as the inner global maximization dependent on the outer minimization and vice versa. But it is known that this minimax problem can be transformed into a equivalent problem called semi-infinite program (SIP). Semi-infinite programming deals with problems of infinitely many constraints in finitely many variables. To reveal a brief idea, we state following well-known proposition from the theory of convex SIP without proof: Proposition 1. Suppose X ⊆ IRm to be compact, Y ⊆ IRn to be closed and convex, as well as g : Y × X → IR continuous with convex functions g(·, x). Then each solution of the convex semi-infinite program (c∗0 , c∗ ) :=
argmin c0
(6)
(c0 ,c)∈ZSIP
with feasible set ZSIP := {(c0 , c) ∈ IR × Y : g(c, x) − c0 ≤ 0 ∀x ∈ X } admits a solution (c∗ , x∗ ) ∈ Y × X of the problem c∗0 = g(c∗ , x∗ ) = min max g(c, x). c∈Y x∈X
(7)
For further particulars we refer you to [8] and references therein. Applying above SIP reformulation as well as duality results for convex SIP [9] yield the semi-infinite dual of the minimax problem (4) of the Lipschitz classifier:
Maximum Margin Classification on Convex Euclidean Metric Spaces
5
Algorithm 1 Lipschitz classifier Input: K(1) := K(x0 ) ∈ K, T Set t = 0. repeat 1. Set t=t+1. 2. Calculate (α(t) , c(t) ) :=
Q(α, c, K(t) ).
argmax (α,c)∈Zq (K(t) )
3. Search for x ∈ X such that K(x) ∈ K and Q(α(t) , c(t) , K(x)) > Q(α(t) , c(t) , K(t) ) else, if such x does not exist terminate. 4. Calculate λ∗ := argmax q λK(x) + (1 − λ)K(t) λ∈[0,1]
5. Set K(t+1) = λ∗ K(x) + (1 − λ∗ )K(t) . until t=T
(D)
max
q(K)
(8)
K∈conv(K)
where K := {K(x) 0 : x ∈ X } collects the semi-positive definite matrices on X , 1 Q(α, c, K), Q(α, c, K) := αT 1N − cT Kc, 2
(9) Zq (K) := (α, c) ∈ IRN ×M : 0 ≤ α ≤ C1N , αT y = 0, Kc = GT Yα . q(K) :=
max
(α,c)∈Zq (K)
The closure of the convex hull of K is denoted as conv(K) while the convex hull is the intersection of all convex sets containing K. Because of the difficulties in the computation of the derivatives of K(x), we applied a two stage greedy strategy giving rise to the algorithm (1). In each iteration step, the algorithm finds a solution of (8) on a more and more growing convex subset of the convex hull conv(K) which is spanned by a sequence of matrices K (t) ∈ K such that q(K(1) ) < q(K(2) ) < · · · < q(K(T ) ). Therefore, one has to search for a x ∈ X so that Q(α(t) , c(t) , K(x)) > Q(α(t) , c(t) , K(t) ) = q(K(t) ). Using this new candidate K(x) ∈ K we conclude q(K(x)) > q(K(t) ), thus the algorithm will calculate the new optimal convex combination of the given matrices K(x) and K(t) in the sense that it will improve the next iterate q(K(t+1) ) = max q λK(x) + (1 − λ)K(t) ≥ q(K(x)) > q(K(t) ). (10) λ∈[0,1]
On the other hand, if no x ∈ X can be found to improve the next iterate or the maximum number of steps are reached, the algorithm terminates. After obtaining a solution (α∗ , K∗ ) one gets the associated primal solution (c∗ , b∗ ) P 1 ∗ ∗ T ∗ ∗ ∗ as a solution of the equations K c = G Yα and b = − |S| i∈S ci Φ(xi ).
6
Andr´e Stuhlsatz, Hans-G¨ unter Meier, and Andreas Wendemuth
The index set S contains all indices i with 0 < αi < C. The former equations are easily obtained from the derivation of the dual problem (8).
4 Experiments We investigated an artificial, 2-dimensional and randomly generated classification problem for evaluation, training and testing. A total of one evaluation set, 10 training sets and 10 test sets from a mixture of Gaussians were generated. All sets were pairwise disjoint. First we simulated for the positive labeled class 10 means µk from a bivariate Gaussian distribution N ((1, 0)T , I). Similarly, for the negative labeled class 10 further means were drawn from N ((0, 1)T , I). Then for each class 50 vectors for evaluation, 100 vectors for each of the training sets and 200 vectors for each of the test sets were generated as follows: we picked a µk by chance with probability 1/10, and then each observation was generated from a further Gaussian N (µk , I/5). This procedure of generating the data is taken out of [10]. For comparison, a Support Vector Machine was investigated on the same data as well as the Lipschitz classifier. In both classifiers, we employed Gaussian kernels Φn (x) := exp(−γkxn − xk22 ) as basis functions to get comparable and better interpretable results. Before the test experiments have been started, we had to adjust all parameters of the learning machines, e.g. the algorithm parameter C, the kernel parameter γ and the tuning parameters of the simulated annealing method [11] we used in step 3. of the algorithm. Evaluation was done independent of the test sets using one training set in conjunction with the evaluation set. Next, on every training set a Lipschitz classifier and a SVM were trained with fixed parameter settings from the evaluation (Lipschitz classifier: C = 100, γ = 0.095, SVM: C = 80, γ = 0.35) resulting in 10 different classifiers for each learning machine. For testing the Lipschitz classifier as well as the SVM, we performed a recognition run on every test set using one of the trained classifiers. This means for example, that the first trained SVM was tested on the first generated test set. Similarly, the second trained SVM was tested on the second generated test set, and so on. From this procedure 10 test results form the Lipschitz classifier and SVM have been obtained. These results are summarized in table (1).
5 Conclusion and Future Work As shown in table (1), the achieved freedom to use arbitrary continuous differentiable basis functions provided by our novel implementation of a Lipschitz classifier leads to a small relative loss of 0.86%, only. This suggests that the algorithm can be used in any application where a kernel based classifier like the SVM is not directly capable. Using the theory of a Banach space embedding, we justified that our novel algorithm is indeed a maximum margin classifier, thus it has the potential to generalize well as the results of our experiments
Maximum Margin Classification on Convex Euclidean Metric Spaces
7
Table 1. Classification accuracies for the Lipschitz classifier and Support Vector Machine on ten 2-dimensional randomly generated test sets. Each test set contains 400 data points. #Test SVM [%] Lipschitz classifier [%] 1 87.25 2 87.50 3 90.75 4 85.75 5 88.50 6 86.00 7 91.00 8 83.50 9 88.00 10 86.50 mean 87.48±2.27
85.50 86.00 91.25 84.25 88.75 86.25 89.00 83.00 87.75 85.50 86.73± 2.46
indicate, too. The given results will focus our future work on using other basis functions than Mercer kernels and on applying our algorithm to applications demanding streams of feature vectors with variable length. Such experiments using data from DNA analysis or motion image recognition will enable us to investigate the performance of the Lipschitz algorithm on real world data.
6 Appendix Proof of theorem 1. If x1 , x2 ∈ X , x1 6= x2 and S := {λx1 + (1 − λ)x2 : λ ∈ (0, 1)} ⊆ X , then by the Mean-Value-Theorem ˆ ∈ S, so that there exists x
|f (x1 ) − f (x2 )| = ∇f (ˆ x)T (x1 − x2 ) 2 ≤ k∇f (ˆ x)k2 kx1 − x2 k2 ⇔
|f (x1 ) − f (x2 )| ≤ k∇f (ˆ x)k2 ≤ max k∇f (x)k2 x∈X kx1 − x2 k2
⇒ L(f ) =
max
x1 ,x2 ∈X ,x1 6=x2
|f (x1 ) − f (x2 )| ≤ max k∇f (x)k2 . x∈X kx1 − x2 k2
For x ∈ X and some v ∈ IRm with kvk2 = 1, the directional derivative reads as f (x + tv) − f (x) lim = ∂v f (x) = ∇f (x)T v. t→0 t Next, we conclude for k∇f (x)k2 6= 0 and v = ∇f (x)/k∇f (x)k2 , that
8
Andr´e Stuhlsatz, Hans-G¨ unter Meier, and Andreas Wendemuth
|f (x + tv) − f (x)| ≤ L(f ). t→0 |t|
k∇f (x)k2 = lim It follows the lower bound
max k∇f (x)k2 ≤ L(f ). x∈X
References 1. Vapnik V, Chervonenkis A (1979) Estimation of dependences based on empirical data. [in russian] 2. Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin Heidelberg New York 3. Stuhlsatz A, Katz M, Kr¨ uger SE, Meier HG, Wendemuth A (2006) Support vector machines for postprocessing of speech recognition hypotheses. Proceedings of International Conference on Telecommunications and Multimedia TEMU 4. Courant R, Hilbert D (1953) Methods of mathematical physics. Intersience Publishers, Inc. 5. Sch¨ olkopf B, Smola AJ (2002) Learning with kernels. The MIT Press. 6. von Luxburg U, Bousquet O (2004) Distance based classification with lipschitz functions. Journal of Machine Learning Research, 5: 669-695 7. Graepel T, Herbisch R, Sch¨ olkopf B, Smola AJ, Bartlett P, M¨ uller KR, Obermayer K, Williams R (1999) Classification of proximity data with LP machines. Proceedings of 9th International Conference on Artificial Neural Networks: 304309 8. Hettich R, Kortanek KO (1993) Semi-infinite programming: Theory, methods and applications. SIAM Reviews, 35: 380-429 9. Shapiro A (2005) On the duality theory of convex semi-infinite programming. Optimization, 54: 535-543 10. Hastie T, Tibshirani R, Friedmann J (2003) The elements of statistical learning. Springer, Berlin Heidelberg New York 11. Liu JS (2001) Monte Carlo strategies in scientific computing. Springer, Berlin Heidelberg New York