Online Learning of Approximate Maximum Margin Classifiers with ...

3 downloads 272 Views 331KB Size Report
Abstract We consider online learning of linear classifiers which approximately maximize the 2-norm margin. Given a linearly separable sequence of instances,.
Noname manuscript No. (will be inserted by the editor)

Online Learning of Approximate Maximum Margin Classifiers with Biases Kosuke Ishibashi, Kohei Hatano and Masayuki Takeda

the date of receipt and acceptance should be inserted later

Abstract We consider online learning of linear classifiers which approximately maximize the 2-norm margin. Given a linearly separable sequence of instances, typical online learning algorithms such as Perceptron and its variants, map them into an augmented space with an extra dimension, so that those instances are separated by a linear classifier without a constant bias term. However, this mapping might decrease the margin over the instances. In this paper, we propose a modified version of Li and Long’s ROMMA that avoids such the mapping and we show that our modified algorithm achieves higher margin than previous online learning algorithms. Keywords online learning · maximum margin classification · Perceptron · Support Vector Machine · bias

1 Introduction Support Vector Machine (SVM) (Boser et al 1992) is one of the most powerful tools in Machine Learning and related areas. It computes the maximum margin hyperplane over linearly separable data, and, it can deal with linearly inseparable data as well by using non-linear mappings induced by kernels (Cristianini and Shawe-Taylor 2000) or the soft margin techniques (Cortes and Vapnik 1995). Further, generalization bounds based on margin (Schapire et al 1998; Shawe-Taylor et al 1998; Cristianini and Shawe-Taylor 2000) guarantee that maximizing margin is a robust way to learn. SVM is shown to be quite effective in various classification and regression tasks in practice. The main computation of SVM involves a quadratic programming problem, which can be solved by using standard optimization solvers. Department of Informatics, Kyushu University E-mail: {k-ishi, hatano, takeda}@i.kyushu-u.ac.jp

On the other hand, a disadvantage of SVM is that it is time-consuming to run SVM, especially over large data. There are many researches for making SVM more scalable. A major approach to scale up SVM is to decompose the original quadratic program into smaller problems which are easier to solve (Osuna et al 1997; Platt 1999; Joachims 1999; Chang and Lin 2001; Joachims 2006). In recent years, as an alternative approach, several researchers have tried to apply online learning algorithms in order to obtain maximum margin hypotheses. Online learning algorithms such as Perceptron (Rosenblatt 1959; Novikoff 1962; Minsky and Papert 1969) and its variants (Anlauf and Biehl 1989; Freund and Schapire 1999; Li and Long 2002; Gentile 2001) incrementally update their hypotheses per a given instance. So they can be very fast and work efficiently when instances arrive in streams. Some Perceptron-like algorithms that find large margin classifiers are proposed: Kernel Adatron(Friess et al 1998), Max Margin Perceptron (Kowalczyk 2000) Voted Perceptron (Freund and Schapire 1999), ROMMA (Li and Long 2002), ALMA (Gentile 2001), NORMA (Kivinen et al 2004), MICRA (Tsampouka and Shawe-Taylor 2007) and so on. These Perceptron-like algorithms are much faster than typical implementations of SVM and their predictive performances are close to that of SVM. However, current Perceptron-like algorithms do not seem to fully exploit the linear separability of data. Technically speaking, current Perceptron-like algorithms are designed to learn homogeneous hyperplanes (hyperplanes that pass through the origin), i.e., linear classifiers without constant biases. So the following mapping is used to learn non-homogeneous hyperplanes indirectly (Cristianini and Shawe-Taylor 2000): Given a instance x ∈ X ⊆ Rn , the mapping φ is defined as φ : x → (x, −R), where R = supx∈X x2 . By using this mapping φ, any non-homogeneous hyperplane u · x + b in the original ˜ ·x ˜ in the augmented space can be viewed as a homogeneous hyperplane u ˜ = (x, −R). But this mapping weakens the ˜ = (u, − Rb ) and x space, where u guarantee of the margin: Suppose that for a given sequence S of labeled instances (x1 , y1 ), . . . , (xT , yT ) (xt ∈ Rn and yt ∈ {−1, +1} for t = 1, . . . , T ), there exists a non-homogeneous hyperplane (u, b) that separates S and satisfies u = 1. Then the margin γ of the hyperplane (u, b) over S in the original space is defined as γ = mint=1,...,T yt (u · xt + b). On the other hand, the mar˜ /u ˜  in the augmented gin γ˜ of the corresponding homogeneous hyperplane u space is γ˜ = min t

1 yt (˜ u · x˜t ) yt (u · xt + b) √ ≥ min = √ γ ≈ 0.7γ, t ˜ u 2 2

˜ 2 = u2 + b2 /R2 ≤ 2 (note that the bias b cannot exceed R) 1 . Even since u though the guaranteed margin is worse only by a constant factor, this might cause significant difference in prediction performances in practice. 1 Further, not only the guarantee of the margin, but that of generalization error becomes ˜ = sup ˜ slightly weaker. Let R x. As in (Shawe-Taylor et al 1998; Cristianini and x∈φ(X ) ˜ ˜ which is bounded by Shawe-Taylor 2000), generalization error bounds also depend on R, √ 2R.

2

In this paper, we propose a new online learning algorithm which avoids such the mapping. Our algorithm ROMMAb (Relaxed Online Maximum Margin Algorithm with Bias) is a modification of ROMMA (Relaxed Online Maximum Margin Algorithm) (Li and Long 2002). ROMMAb directly learns a non-homogeneous hyperplane that separates the given data, provided that the data is linearly separable. We show that ROMMAb , given a parameter δ (0 < δ ≤ 1), finds a linear classifier which has margin at least (1 − δ)γ in 2 O( δR 2 γ 2 ) updates, when there exists a hyperplane with margin γ that separates the given sequence of data. As a result, unlike those Perceptron-like online algorithms which use the mapping, ROMMAb finds approximately the same linear classifiers obtained by SVM. Among related researches, Kernel Adatron (Friess et al 1998) and Max Margin Perceptron (Kowalczyk 2000) seem to find biases directly, too. However, Kernel Adatron is not suitable for the online setting since it need to store past examples to compute the bias. Max Margin Perceptron finds the same solution of ROMMAb , but its upperbound of updates is O(R2 log(R/γ)/(δ 2 γ 2 )), which is log(R/γ) times worse than that of ROMMAb. ROMMAb , as well as SVM and other existing Perceptron-like online algorithms, can deal with linearly inseparable data by using kernels and the 2-norm soft margin formulation. Our preliminary experimental results show that ROMMAb achieves higher margin than previous Perceptron-like algorithms over artificial and real datasets.

2 Preliminaries Let X ⊆ Rn be the instance space for a fixed n ∈ N. A pair (x, y) of an instance x ∈ X and a label y ∈ {−1, +1} is called an example. In the standard setting for online learning of binary classifiers (Littlestone 1988), learning proceeds in trials. At each trial t, the learner receives an instance xt ∈ Rn , and it predicts a label yˆt ∈ {−1, +1}. Then the learner receives the true label yt ∈ {−1, +1} and it possibly updates its current hypothesis depending on the received label. In particular, if yt = yˆt , we say that the learner makes a mistake. A typical goal of online learning is to minimize the number of mistakes as small as possible. Most of known online algorithms are mistake-driven, that is, they update their hypotheses when they make a mistake. Online learning algorithms can be used also in the batch setting, where a bunch of examples is given to the learner at once. There are several ways to convert an online learning algorithm to a batch one (Littlestone 1989; Helmbold and Warmuth 1995; Freund and Schapire 1999; Cesa-Bianchi et al 2004; Cesa-Bianchi and Gentile 2006). Simply put, these conversion strategies ensure that an online learning algorithm with a small mistake bound can be used to produce a hypothesis which generalizes well. In this paper we restrict ourselves on learning linear classifiers, i.e., those can be determined by a hyperplane over X . More precisely, a linear classifier f : X → {−1, +1} is written as f (x) = sign(w · x + b) for some weight vector 3

w ∈ Rn and bias b ∈ R, where sign(a) = +1 if a ≥ 0, otherwise sign(a) = −1. The (2-norm geometric) margin of a hyperplane (w, b) over an example (x, y) ·x+b) is defined as y(w , where  ·  is denoted as the 2-norm. Note that a linear w  classifier f (x) = sign(w · x + b) correctly classifies an example (x, y) if and only if the margin of the associated hyperplane is positive. For any sequence of examples S = {(x1 , y1 ), . . . , (xT , yT )} (T ≥ 1), the margin of a hyperplane ·xt +b) (w, b) over S is defined as mint=1,...,T yt (w . w  The algorithms we consider update their hypotheses if not only they make a mistake, but also their hypotheses have insufficient (unnormalized) margin. Now our goal is, given a sequence of examples which is linearly separable with margin γ, and a parameter δ (0 ≤ δ < 1), to minimize the number of updates for obtaining a linear classifier with margin at least (1 − δ)γ.

3 ROMMA and Our modification In this section, we briefly review Li and Long’s ROMMA, and then we propose a modified version of ROMMA.

3.1 ROMMA In order to explain ROMMA, we consider an “online” variant of SVM (online SVM for short). Given a sequence of examples S = ((x1 , y1 ), . . . , (xt−1 , yt−1 )) and an instance xt , the online SVM predicts yˆt = sign(w t · xt + bt ), where wt and bt is given as follows: 1 (w t , bt ) = arg min w2 , w,b 2 subject to yj (w · xj + b) ≥ 1

(j = 1, . . . , t − 1)

(1)

ROMMA can be viewed as a relaxed version of the online SVM. ROMMA predicts yˆt = sign(w t · xt ) such that 1 wt = arg min w2 , w 2 subject to yt−1 (w · xt−1 ) ≥ 1 and w · w t−1 ≥ wt−1 2

(2)

ROMMA 2 is different from the online SVM in that (i)ROMMA only learns a hyperplane without bias, (ii) ROMMA updates its current hypothesis wt only if yt (w t · xt ) < 1 − δ, where δ (0 < δ ≤ 1) is a pre-specified parameter, and (iii) there are only two constraints: one for the last misclassified example and one for the last hypothesis. It can be shown that, under the assumption 2 In (Li and Long 2002), this version of ROMMA is called “aggressive ROMMA”. In this paper, we just call it ROMMA for short.

4

ROMMAb (δ) begin neg 1. (Initialization) Get examples (xpos 1 , +1) and (x1 , −1). Let w0 = (0, . . . , 0) ∈ Rn . 2. For t = 1 to T , (a) Receive an instance xt . (b) Let (wt , bt ) = arg

1 min w2 , w∈Rn ,b∈R 2

subject to : (w · xpos + b) ≥ 1 t

(w · xneg + b) ≤ −1 t

w · wt−1 ≥ wt−1 2 . (c) Predict yˆt = sign(wt · xt + bt ). (d) Receive the label yt . If yt (wt · xt + bt ) < 1 − δ, update ( (xt , xneg ) , (yt = +1) neg t , x ) = (xpos t+1 t+1 (xpos , (yt = −1). t , xt ) end.

neg pos neg Otherwise, let (xpos ). t+1 , xt+1 ) = (xt , xt

Fig. 1 The description of ROMMAb.

that the bias b = 0, the constraints of (2) are weaker than those of (1) (Li and Long 2002). In fact, the second constraint in (2) corresponds to the hyperspace that contains the polyhedron which representing the constraints yj (w ·xj ) ≥ 1 (j = 1, . . . , t − 2). 3.2 ROMMAb Now we describe our new algorithm which we call ROMMAb (Relaxed Online Maximum Margin Algorithm with Bias). Our modification is quite simple: 1 (wt , bt ) = arg min w2 , w,b 2 subject to

(3)

w · xpos + b ≥ 1, w · xneg + b ≤ −1 and t t w · w t−1 ≥ wt−1 2 neg ROMMAb just keeps (xpos t , +1) and (xt , −1), the last positive and negative examples which incur updates over the past t − 1 trials. This makes us to directly find a bias without mapping to the augmented space with an extra dimension. The detail of ROMMAb is given in Figure 1. In each trial t, ROMMAb solves a quadratic programming problem which is described in (3). Now we derive the update by solving the problem analytically.

5

For simplicity of notations, we omit the subscript t, and let v = wt−1 . Then let L be the Lagrangian function defined as n  1 n wi + α {1 − y  (w · x + b)} 2 i=1 ∈{pos,neg}  n    2 +β vi − wi vi ,

L(w, b, α, β) =

i=1

i

where y pos = +1, y neg = −1, α ≥ 0 ( ∈ {pos, neg}) and β ≥ 0. The partial derivatives of L w.r.t. w and b are given respectively as   ∂L ∂L = = wi − α y  xi − βvi , and α y  . (4) ∂wi ∂b ∈{pos,neg}



∈{pos,neg}



Let w and b be the optimal solution of the problem. Then, the solution (w ∗ , b∗ ) must enforce the derivatives (4) to be zero. That is,  wi∗ = α y  xi + βvi and y pos αpos + y neg αneg = 0. ∈{pos,neg} pos

Since y = +1 and y neg = −1, we have αpos = αneg . So let α = αpos = αneg . By the KKT condition (see e.g., (Cristianini and Shawe-Taylor 2000)), it holds that α{1 − y  (w ∗ · x + b)∗ } = 0 ( ∈ {pos, neg}) 







α ≥ 0, y (w · x + b ) ≥ 1 2

( ∈ {pos, neg})



β(v − w · v) = 0 ∗

(5) (6) (7)

2

β ≥ 0, w · v ≥ v .

(8)

Now assume that α = 0. Then, the conditions (7) and (8) imply that β = 1, i.e., w = v, however, which cannot happen since, by the definition, v does not satisfy all of the constraints. So we conclude that α > 0. For β, we consider two cases: (i) β = 0 and (ii) β > 0. (i) Suppose that β = 0. Then, the condition (7) implies w∗ · v > v2 . Further, by using the condition (5), w ∗ = αz, where α =

2 and z = xpos − xneg . z2

(9)

(ii) Otherwise, β > 0. Again, by the condition (7), we obtain w∗ · v = v2 . Solving the system of equations, we have w ∗ = αz + βv,

(10)

where α=

v2 (2 − v · z) v2 z2 − 2(v · z) and β = . v2 z2 − (v · z)2 v2 z2 − (v · z)2 6

(11)

In either case (i) or (ii), the bias b∗ is given as b∗ = −

w∗ · xpos + w ∗ · xneg . 2

(12)

In order to obtain the solution w ∗ , first check whether the value (9) satisfies v · w∗ > v2 . If it does, the solution is (9). Otherwise, the solution is given as (10).

3.3 Convergence proof In the following, we prove that ROMMAb learns the maximum margin hyperplane approximately in a finite number of updates. Before proving the main theorem, we need some lemmas. First of all, by the derivation of the update, it is clear that constraints for the two last misclassified examples (xpos t , +1) and (xneg t , −1) are always active in each trial, i.e., the constraints hold with equality. Lemma 1 For t ≥ 1, it holds that w t · xpos + bt = 1 and wt · xneg + bt = −1. t t Then we prove some technical lemmas. Lemma 2 Let (u, b) ∈ Rn × R be a hyperplane such that yj (u · xj + b) ≥ 1 for j = 1, . . . , t. Then, it holds that u · wt ≥ wt 2 and u ≥ wt . Proof. The proof for the first inequality is done by induction on t. Without loss of generality, we assume that an update is made at each trial t ≥ 1. For − xneg t = 1, the vector is written as w1 = α(xpos 1 1 ) for some α ≥ 0. By the pos definition of u and b, it holds that u · x1 + b ≥ 1 and u · xneg + b ≤ −1, 1 respectively. So, we obtain − u · xneg u · w1 = α(u · xpos 1 1 ) ≥ α(1 − b + 1 + b) = 2α. On the other hand, by Lemma 1, we have − xneg w1 2 = αw 1 · (xpos 1 1 ) = α(1 − bt + 1 + bt ) = 2α, which shows u · w 1 ≥ w1 2 . Suppose that for t < t , the statement is true. Then, there are two cases: neg (i) wt · wt −1 = wt −1 2 , and wt = α(xpos t − xt ) + βw t −1 for α and β neg 2 given in (11), or (ii) wt · wt −1 > wt −1  , and wt = α(xpos t − xt ). For the case (ii), the proof follows the same argument for t = 1, so we only consider the case (i). By the inductive assumption, we have neg 2 u · wt = α(u · xpos t − u · xt ) + βu · w t −1 ≥ 2α + βw t −1 

7

By Lemma 1, neg wt 2 = w t · α(xpos t − xt ) + βw t w t −1

= 2α + βwt wt −1 = 2α + βwt−1 2 . So, we get u · wt ≥ wt 2 and thus we prove the first inequality. The second inequality holds immediately since both (u, b) and (wt , bt ) satisfy the same constraints in (3) and (w t , bt ) minimizes the norm by the definition.

Lemma 3 For each trial t ≥ 1 in which an update is incurred, wt+1 − w t  ≥

δ , 2R

where R = maxj=1,...,t xj . Proof. Suppose that w t makes an update after classifying xt with yt = +1. neg neg So, xpos t+1 = xt and xt+1 = xt . Then, wt · xt + bt < 1 − δ and wt+1 · xt + bt+1 = 1. For the negative instance

(13)

xneg t ,

+ bt = −1 and wt+1 · xneg + bt+1 = −1. w t · xneg t t

(14)

By subtracting (14) from (13), we obtain that neg wt · (xt − xneg t ) < 2 − δ and w t+1 · (xt − xt ) = 2.

(15)

Now, by Schwarz’s inequality, neg wt+1 − w t xt − xneg t  ≥ (w t+1 − w t ) · (xt − xt ) ≥ δ.

So, we get wt+1 − wt  ≥

δ δ . neg ≥ xt − xt  2R

For yt = −1, the proof is the same as above, and thus we complete the proof.

Now we are ready to prove our main result. Theorem 1 Suppose that for a sequence S = ((x1 , y1 ), . . . , (wT , yT )), there exists a hyperplane (u, b) ∈ Rn × R such that yt (u · xt + b) ≥ 1 for t = 1, . . . , T and the hyperplane (u, b) has margin γ over S. In addition, let R = maxt=1,...,T xt . Then ROMMAb(δ) satisfies the following properties. 1. The number of updates made by ROMMAb(δ) is at most 4R2 u2 /δ 2 . 2. At most 4R2 u2 /δ 2 updates, ROMMAb(δ) finds a hypothesis with margin at least (1 − δ)γ. 8

Proof. We begin with proving the first statement of the theorem. First, we show that wt+1 2 ≥ wt 2 + δ 2 /4R2 . By Lemma 3, we have for each trial t in which an update is made wt+1 2 = (wt+1 − w t ) + wt 2 = (wt+1 − w t )2 + 2(wt+1 − w t ) · w t + wt 2 ≥

δ2 + wt 2 . 4R2

On the other hand, by the second inequality in Lemma 2, wt 2 is bounded as u2 ≥ wt 2 . Therefore, the number of updates is at most 4R2 u2 /δ 2 and thus we prove the first statement. Now we prove the second statement of the theorem. By Lemma 3, we have wt  ≤ u for t ≥ 1. So the achieved margin is at least 1−δ 1−δ ≥ = (1 − δ)γ, w u



which completes the proof.

Note that our analysis in Theorem 1 regarding the margin also holds for ROMMA, for which Li and Long proved only asymptotic convergence to the maximum margin classifier.

4 Kernels and Soft Margin Extensions In this section, we discuss two extensions using kernel functions and a soft margin formulation.

4.1 Kernel Extension As well as SVM, ROMMA and other Perceptron-like online algorithms, our algorithm ROMMAb can use kernel functions. Note that, at trial t, the weight vector w t is written as ⎛ ⎞ t−1 t−1   ⎝ αn ⎠ βj xj , wt = j=1

n=j+1

thus an inner product wt · xt is given as a weighted sum of inner products xj ·xj  between instances. Therefore, we can apply kernel methods by replacing each inner product xj ·xj  with K(xj , xj  ) for some kernel K. More practically, we can compute an inner product between wt and a mapped instance using − xneg the recurrence relation w t = αt (xpos t t ) + βt w t−1 . For more details of efficient implementation, see (Li and Long 2002). 9

4.2 Soft Margin Extension In order to apply ROMMAb to linearly inseparable data, as in (Kowalczyk 2000; Li and Long 2002), we introduce the 2-norm soft margin formulation (Cortes and Vapnik 1995; Cristianini and Shawe-Taylor 2000), which is given as follows: m 1 C 2 min w2 + ξ , w,b,ξj 2 2 j=1 j

subject to yj (w · xj + b) ≥ 1 − ξj (j = 1, . . . , t) ξj ≥ 0 (j = 1, . . . , t), where the constant C > 0 is given as a parameter. It is well known that this formulation is equivalent to the 2-norm minimization problem over linearly separable examples in an augmented space:

1 ˜ 2, min w ˜ ,b, 2 w subject to ˜ ·x ˜ j + b) ≥ 1 (j = 1, . . . , t), yj (w √ j ˜ j = (xj , √yC ˜ = (w, Cξ), x ej ), and each ej is a unit vector in Rt where w whose j-th element is 1 and other element is 0. To use a kernel function K with this soft margin formulation, we just modify K as follows: ˜ j , xj ) = K(xi , xj ) + δij , K(x C

(16)

where δij = 1 if i = j, otherwise δij = 0.

5 Experiments To see how our algorithm works in practice, we test it on artificial and real datasets. In our experiments, we use four online learning algorithms, Perceptron, ALMA 3 , ROMMA, and ROMMAb. 3 More precisely, the algorithm is called ALMA in the original paper (Gentile 2001). p ALMAp is designed to maximize p-norm margin. In this paper, we refer to ALMA2 as ALMA for short.

10

5.1 Experiments over artificial datasets Each artificial dataset consists of 100-dimensional {−1, +1}-valued vectors which are labeled with a r-of-k threshold function, where a r-of-k threshold function f is represented as f (x) = sign(xi1 + · · · + xik + k − 2r + 1) for some i1 , . . . , ik s.t. 1 ≤ i1 ≤ i2 ≤ · · · ≤ ik ≤ n, and it outputs +1 if at least r of k relevant features have value +1, and outputs −1, otherwise. For k = 16 and r = 2, 4, 8 (equivalently, the bias is 13, 9, 1, respectively), we generate random 900 training data and 100 test data labeled by the r-of-k threshold function, so that positive and negative examples are equally likely. In order for Perceptron, ALMA and ROMMA to learn linear classifiers with biases, we add an extra dimension √ with value −R to each instance when we run them, where R = max x = 100. For ALMA, ROMMA, and ROMMAb, the algorithms which provably achieve approximate maximum margin, we set parameters so that each guaranteed margin after sufficiently many updates is at least 0.99 and 0.5 times the maximum margin, respectively. To do this, we set α = 0.01 and 0.5 (note the parameter α is defined differently in (Gentile 2001)) for ALMA and δ = 0.01 and 0.5 for ROMMA an ROMMAb , respectively. We train the algorithms in 1000 epochs for getting 0.99 and 0.5 times the maximum margin respectively, where in one epoch, each algorithm goes through the whole training data. At end of each epoch, for each algorithm, we record (i) number of updates incurred during the epoch, (ii) the classification error rate of its hypothesis over test data, and (iii) the margin of the hypothesis over the training data set. Note that we measure the margin of each hypothesis over the original space. We execute the operation 10 times with generating training data and test data randomly, and we average the results over 10 executions. Figure 2 shows the total number of updates of each algorithm for each epoch. Perceptron is the fastest to converge among the algorithms. But, as we will show later, Perceptron does not generalize well. Although ROMMAb converges slower than Perceptron, it converges faster than other two algorithms with fewer updates, especially when δ = 0.99 and the bias of the true function is larger. Then, Figure 3 shows the test error rate of each algorithm for each epoch. As can be seen, both ROMMA and ROMMAb achieve low test error rate faster than other algorithms. It seems that ROMMA and ROMMAb perform similarly and they are comparable in most of the cases. In Figure 4, we summarize the margin of the hypothesis obtained by each algorithm at the end of each epoch. ROMMAb achieves the highest margin for any values of r, when the guaranteed margin is 0.99 times the maximum margin. However, ROMMA gains higher margin than ROMMAb when the guaranteed margin is 0.5 times the maximum. In experiments over artificial data sets, both ROMMA and ROMMAb generalize the best among the four algorithms. ROMMAb achieves higher margin 11

x 10

5

bias=13 (k=16, r=2), 50% margin 9000

8

ROMMAb

7

ROMMA

8000 7000

ALMA

6

total number of updates

bias=13 (k=16, r=2), 99% margin

4 3 2 1 0

0

200

400

ROMMAb

6000

Perceptron

5

total number of updates

9

600

800

ROMMA

5000

ALMA

4000

Perceptron 3000 2000 1000 0

1000

20

40

epoch

x 10

5

bias=9 (k=16, r=4), 99% margin ROMMAb

8000

7

ROMMA

7000

6

ALMA

5

Perceptron

3 2 1

0

200

ROMMAb

6000

4

400

600

800

ROMMA

5000

ALMA

4000

Perceptron

3000 2000 1000 0

1000

20

40

epoch

x 10

5

bias=1 (k=16, r=8), 99% margin

80

100

bias=1 (k=16, r=8), 50% margin 9000

8

ROMMAb

7

ROMMA

6

ALMA

8000 7000

Perceptron

5 4 3 2 1 0

60

epoch

total number of updates

total number of updates

9

100

bias=9 (k=16, r=4), 50% margin

8

0

80

9000

total number of updates

total number of updates

9

60

epoch

0

200

400

600

800

1000

epoch

6000

ROMMAb

5000

ROMMA ALMA

4000

Perceptron

3000 2000 1000 0

20

40

60

80

100

epoch

Fig. 2 Total number of updates of the four algorithms over the training data labeled with r-of-k threshold functions for k = 16 and r = 2, 4, 8 (bias = 13, 9, 1). The results on the left (right) side are obtained when the guaranteed margin is at least 0.99 (0.5) times of the maximum margin.

with fewer updates compared to ROMMA. The reason why ROMMAb converges faster than ROMMA might be that ROMMAb can change the bias drastically because the bias is not regularized, while ROMMA tends to change the bias gradually as the bias is regularized. We should note that with δ = 0.5 ROMMAb gains lower margin than ROMMA. It seems that ROMMAb is sensitive to the approximation parameter δ. 12

bias=13 (k=16, r=2), 99% margin

bias=13 (k=16, r=2), 50% margin

5

5

4

ROMMAb

4

ROMMAb

ROMMA test error rate (%)

test error rate (%)

ROMMA 3

ALMA Perceptron

2

1

0 10

0

10

1

10

2

10

Perceptron 2

1

0

3

ALMA

3

20

40

epoch

60

bias=9 (k=16, r=4), 99% margin 5

4

ROMMAb

4

ROMMAb

ROMMA

ALMA

test error rate (%)

test error rate (%)

ROMMA 3

Perceptron 2

1

0

0

10

1

10

2

10

Perceptron 2

1

0

3

ALMA

3

20

epoch

40

60

bias=1 (k=16, r=8), 99% margin

bias=1 (k=16, r=8), 50% margin ROMMAb

ROMMAb 4

ROMMA

ROMMA ALMA

ALMA 3

test error rate (%)

test error rate (%)

100

5

4

Perceptron

2

1

0

80

epoch

5

10

100

bias=9 (k=16, r=4), 50% margin

5

10

80

epoch

0

10

1

10

2

10

3

2

1

0

3

epoch

Perceptron

20

40

60

80

100

epoch

Fig. 3 Test error rates of the four algorithms for training data labeled with r of k threshold functions for k = 16 and r = 2, 4, 8 (bias = 13, 9, 1). The results on the left (right) side are obtained when the guaranteed margin is at least 0.99 (0.5) times of the maximum margin. Note that on the left side, the scales of x-axes are logarithm so that it can be seen clearly the difference between ROMMAb and ROMMA. ALMA’s test error is much worse than the other algorithms and it is omitted in the results on the left side of the figures.

13

bias=13 (k=16, r=2), 99% margin

bias=13 (k=16, r=2), 50% margin

0.25

0.25

0.2

0.2

0.15

0.15

ROMMAb ROMMA

ROMMAb

0.1

ALMA

margin

margin

0.1

Perceptron 0.05

ROMMA ALMA

0.05

Perceptron 0

200

400

600

800

0

1000

20

40

epoch

60

80

100

epoch

bias=9 (k=16, r=4), 99% margin

bias=9 (k=16, r=4), 50% margin

0.25

0.25

0.2

0.2

ROMMAb 0.15

0.15

ROMMA ALMA

ROMMAb

0.1

Perceptron

margin

margin

0.1

0.05

ROMMA ALMA

0.05

Perceptron 0

200

400

600

800

0

1000

20

40

epoch

80

100

epoch

bias=1 (k=16, r=8), 99% margin

bias=1 (k=16, r=8), 50% margin

0.25

0.25

0.2

0.2

ROMMAb

0.15

60

0.15

ROMMA ALMA

0.1

margin

margin

0.1

Perceptron 0.05

ROMMAb ROMMA

0.05

ALMA Perceptron

0

200

400

600

800

0

1000

epoch

20

40

60

80

100

epoch

Fig. 4 Margins obtained by the four algorithms over the training data labeled with r-of-k threshold functions for k = 16 and r = 2, 4, 8 (bias = 13, 9, 1) The results on the left (right) side are obtained when the guaranteed margin is at least 0.99 (0.5) times of the maximum margin. ALMA’s margin is much worse than the other algorithm and it is omitted in the results on the left side of the figures because we emphasize the difference between ROMMA and ROMMAb.

14

Perceptron ALMA ROMMA ROMMAb SVM SVM w/o bias

total # of updates 246.6 41484.6 5104 5241 – –

margin 0.0005 -0.0071 0.0397 0.0442 0.0479 0.0422

test error (%) 0.822 0.583 0.597 0.597 0.600 0.594

Table 1 Total number of updates, test error rate (%) and margin for subsets of UCI letter after 200 epochs. Results are averaged over 26 labels and 5 different subsets.

5.2 Experiments over UCI Letter dataset The real dataset we use is the Letter dataset in the UCI Machine Learning Repository (Asuncion and Newman 2007). The Letter dataset contains 20, 000 examples. Each example has 16 integer-valued features and a {A, . . . , Z}valued label. In our experiments, in order to observe behaviors of algorithms in long epochs, we choose 4, 000 training examples and 4, 000 test examples randomly from the whole data sets. We use the Gaussian kernel

xi − xj 2 K(xi , xj ) = exp − 2σ 2 with σ = 4, where the choice of the value of σ is used in (Gentile 2001). For parameters regarding margin (the parameter δ for ROMMA and ROMMAb and α for ALMA), we use α = 0.1 and δ = 0.1 in order to achieve at least 0.9 times the maximum margin. As in (Gentile 2001), for each algorithm and each label  ∈ {A, . . . , Z}, we train each algorithm and obtain a binary classifier that classifies whether an instance is labeled with  (+1) or not(−1). For each label, we run each algorithm over the training set in 200 epochs. At the end of each epoch, we measure the test error of the final classifier of each algorithm and margin of obtained hypothesis over training set. We repeat this procedure 5 times while choosing randomly 4, 000 examples as a training set. Finally, we also run SVMs with or without bias terms to compare the online algorithms with them. The results are summarized in Figure 5 and Table 1. As Figure 5 and Table 1 show, ROMMAb obtains the highest margin among online learning algorithms and, compared to SVMs, ROMMAb seems to approximate the margin obtained by SVM with bias. The total number of updates made by ROMMAb is slightly worse than that of ROMMA, but is 8 times better than that of ALMA. On the other hand, ALMA generalizes slightly better than other online algorithms including ROMMAb. This might be due to that ALMA is slow to maximize the minimum margin, but it might enlarge margins of examples “on average”. Note that, according to margin-based generalization bounds (Shawe-Taylor et al 1998; Cristianini and Shawe-Taylor 2000), a linear classifier which enlarges margins over most of examples generalizes well. Next, in order to see carefully how ROMMAb behaves, we compare online algorithms over the 26 individual binary classification problems. In particular, 15

4

x 10

UCI Letter, number of updates

4.5

4

3.5

total number of updates

3

ROMMAb

2.5

ROMMA

2

ALMA 1.5

Perceptron 1

0.5

0 0

50

100

150

200

epoch UCI Letter, test error rate

UCI Letter, margin

1

0.05

0.95

0.9

0.04

ROMMA

0.03

ALMA

0.02

Perceptron

0.8

0.01

0.75

0

0.7

margin

test error rate (%)

0.85

ROMMAb

0.65

0.6

0.55

0.5

-0.01

ROMMAb

-0.02

ROMMA

-0.03

ALMA

-0.04

Perceptron

-0.05 0

50

100

150

200

0

epoch

50

100

150

200

epoch

Fig. 5 Number of updates, margins over the training data for subset of UCI Letter dataset , and test error.

we would like to see how ROMMAb performs when SVM with bias performs better than SVM without bias. To do this, for each binary classification problem, we compare the difference between test errors of ROMMAb and the best online algorithms, as well as those between SVM with and without bias. Figure 6 summarizes the result, where each point corresponds to a binary classification problem over UCI Letter dataset. Note that, for each problem, positive x-value of the corresponding point means that SVM with bias wins against SVM without bias, and positive y-value means that ROMMAb wins against other online algorithms. As expected from the theoretical property that the solution of ROMMAb converges to that of SVM with bias, ROMMAb tends to perform better than other online algorithms when SVM with bias beats SVM without bias. So far, we do not have any clear answer for whether it is better to use bias or not, since it depends on used kernels and data in general. 6 Summary and Future work In this paper, we propose ROMMAb, a modification ROMMA, which obtains the maximum 2-norm margin classifier approximately by determining the bias directly. ROMMAb achieves higher margin than other online learning 16

UCI Letter, the relationship of the differences in test error 0.1

0.05

ROMMAb wins

- testerror ( ROMMAb )

testerror ( best algorithm )

0

one of other algorithm wins -0.05

-0.1

SVM w/o bias

SVM wins

wins -0.15 -0.1

-0.05

0

0.05

0.1

testerror ( SVM w/o bias ) - testerror ( SVM )

Fig. 6 Difference between test errors of SVM with and without bias, and difference between test errors of ROMMAb and the best of other online algorithms.

algorithms over both artificial and real datasets. Especially experiments over artificial datasets show that ROMMAb converges faster than ROMMA when the bias of the true function is large. As future research, we will conduct more detailed experiments over various larger datasets. Moreover, we will compare ROMMAb with popular implementations of SVM such as SVMlight 4 based on (Joachims 1999), LIBSVM (Chang and Lin 2001). Also, we will investigate applying our technique of getting biases directly to problems of finding the maximum margin classifiers where the distance is defined in terms of other norms, say, 1-norm or p-norm. Acknowledgements We would like to thank anonymous referees for their valuable comments.

References Anlauf JK, Biehl M (1989) The adatron; an adaptive perceptron algorithm. Europhysics Letters 10:687–692 Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, http://mlearn.ics.uci. edu/MLRepository.html Boser BE, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp 144–152 4

The software is available from http://www.svmlight.joachims.org/.

17

Cesa-Bianchi N, Gentile C (2006) Improved risk tail bounds for on-line algorithms. In: Advances in Neural Information Proccessing Systems 18, pp 195–202 Cesa-Bianchi N, Conconi A, Gentile C (2004) On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory 50(9):2050–2057 Chang CC, Lin CJ (2001) Libsvm: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm Cortes C, Vapnik V (1995) Support vector networks. Machine Learning 20:273–297 Cristianini N, Shawe-Taylor J (2000) An Introduction to Support Vector Machine. Cambridge University Press Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Machine Learning 37(3):277–299 Friess T, Cristianini N, Campbell C (1998) The kernel adatron algorithm: a fast and simple learning procedure for support vector machine. In: Proceedings of the 15th International Conference on Machine Learning Gentile C (2001) A new approximate maximal margin classification algorithm. Journal of Machine Learning Research 2:213–242 Helmbold D, Warmuth MK (1995) On weak learning. Journal of Computer and System Sciences 50:551–573 Joachims T (1999) Making large-scale support vector machine learning practical. In: B Sch¨ olkopf AS C Burges (ed) Advances in kernel methods - Support vector learning, MIT Press, pp 169–184 Joachims T (2006) Training linear svms in linear time. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD) Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE Transactions on Signal Processing 52(8):2165–2176 Kowalczyk A (2000) Maximum margin perceptron. In: A Smola BS P Bartlett, Schuurmans D (eds) Advances in Large Margin Classifiers, MIT Press, pp 75–114 Li Y, Long PM (2002) The relaxed online maximum margin algorithm. Machine Learning 46(1-3):361–387 Littlestone N (1988) Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning 2(4):285–318 Littlestone N (1989) From on-line to batch learning. In: the second workshop on Computational Learning Theory, pp 269–284 Minsky ML, Papert SA (1969) Perceptrons. MIT Press Novikoff AB (1962) On convergence proofs on perceptrons. In: Symposium on the Mathematical Theory of Automata, Polytechnic Institute of Brooklyn, vol 12, pp 615–622 Osuna E, Freund R, Girosi F (1997) Improved training algorithm for support vector machines. In: Proceedings of IEEE NNSP’97 Platt J (1999) Fast training of support vector machines using sequential minimal optimization. In: Scholk¨ opf B, Burges C, Smola A (eds) Advances in Kernel Methods - Support Vector Learning, MIT Press, pp 185–208 Rosenblatt F (1959) The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65:386–408 Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics 26(5):1651–1686 Shawe-Taylor J, Bartlett PL, Williamson RC, Anthony M (1998) Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory 44(5):1926–1940 Tsampouka P, Shawe-Taylor J (2007) Approximate maximum margin algorithms with rules controlled by the number of mistakes. In: Proceedings of the 24th International Conference on Machine Learning

18

Suggest Documents