Appendix S1: Primal Optimization Problem and Dualization - PLOS

12 downloads 0 Views 154KB Size Report
Hence, we are learning for each class label a hyperplane function fyi (x) = wyi ψ(xi) + byi that allows for scores better than the average by a margin. This is ...
1

Appendix S1: Primal Optimization Problem and Dualization The function s(f, x, y) provides a comparative measure between the scoring function fy (x) and fc6=y (x) as input to the loss function, commonly defined in the literature as s(f, x, y) = fy (x) − argmaxc6=y fc (x).

(1)

Because of the reference to the maximum of the scoring functions fc6=y (x), a large number of constraints is introduced into the optimization problem. Much research has been aimed at solving Eq. (1) more efficiently, e.g. based on decomposition into smaller subproblems [1], interleaving column generation [2] or bundle methods [3], to name a few. But the limiting factor that the max represents, still remains. As a remedy to this issue, we propose as a different and novel requirement that a hypothesis should score better than an average hypothesis, that is s(f, x, y) = fy (x) −

C 1 X fc (x). C c=1

For upcoming derivations, we focus on affine-linear models of the form fc (x) = wc> ψ(x) + bc .

(2)

As discussed earlier, the bias parameter bc may be removed in the derivations, which is a mild restriction for the high dimensional space H we consider. For the time being including the bias, the average ¯ > x + ¯b and hypothesis thus becomes f¯(x) = w ¯ > ψ(x) + by − ¯b, s(f, x, y) = (wy − w) (3) P P C 1 ¯ ¯ c = 1, . . . , C, is associated with a ¯ = C1 C where w c=1 wc and b = C c=1 bc . Each hyperplane wc − w, margin ρ. The following quadratic regularizer aims to penalize the norms of these hyperplanes while at the same time maximizing the margins 1X ¯ 2 − Cρ. kwc − wk (4) Ω(f ) = 2 c The regularized risk thus becomes C X   1X ¯ 2 − Cρ + µ ¯ > ψ(xi ) + byi − ¯b . ||wc − w|| l (wyi − w) 2 c=1 i

Expanding the loss terms into slack variables leads to the primal optimization problem P P 1 ¯ 2 − Cρ + µ i l(ti ) min c ||wc − w|| 2 wc ,w,b,ρ,t

s.t.

¯ ψ(x )i + byi ≥ ρ − ti , hwyi − w, PC i 1 ¯ = C c=1 wc w ¯b = 1 PC bc = 0. c=1 C

∀i

(5)

The condition ¯b = 0 is necessary in order to obtain the primal of the binary µ-SVM as a special case of ¯ = 0 with bc = ρ → ∞. Eq. (5) and to avoid the trivial solution wc = w The optimization problem Eq. (5) has an interesting interpretation. The constraint may be written as ¯ > ψ(xi ) + ρ − ti . wy>i ψ(xi ) + byi ≥ w Hence, we are learning for each class label a hyperplane function fyi (x) = wy>i ψ(xi ) + byi that allows for scores better than the average by a margin. This is illustrated in Fig. 5 in the main manuscript.

2

Dualization Optimization is often considerably easier in the dual space. As it will turn out, we can derive the dual problem of Eq. (5) without knowing the loss function l, instead it is sufficient to work with the Fenchel-Legendre dual l∗ (x) = supt xt − l(t) (e.g. cf. [4, 5]). Applying Lagrange’s theorem incorporates the constraints into the objective by introducing non-negative Lagrangian multipliers α, β, γ, L=

X 1X ¯ 2 − Cρ + µ ||wc − w|| l(ti ) 2 c i X ¯ ψ(xi )i + byi − ρ − ti ) − αi (hwyi − w, i

*

1 X ¯ β, wc − w C c

+

+ +γ

X

bc .

c

Setting the partial derivatives of L wrt. the dual variables to zero gives the following optimality conditions, ¯ = ∀c : wc − w

X

αi ψ(xi ) −

i:yi =c

1 β; C

1> αc = 1 ,

 P > > where α = α> ∈ Rn is equipped with a block structure and where α> c 1 = 1 , ..., αC i:yi =c αi . Note that the total number of variables in α is thus n. Resubstituting into the Lagrangian and subsequently solving for ∂L ∂β = 0 leads to ∀c : wc =

X

αi ψ(xi )

(6)

i:yi =c

¯ = w

1 X αi ψ(xi ); C i

β=

X

αi xi .

i

By the latter equations the Lagrangian saddle point problem can be expressed as sup inf α

t

X 1 (l(ti ) + αi ti ) , − α> Kα + µ 2 i

(7)

α : α> 1 = C, α> c 1 = 1, c = 1, . . . , C (if bias) where K is given by Eq. (7) in the main manuscript. Inserting the notion of a Fenchel-Legendre dual function we can completely remove the dependency on the primal variables in Eq. (7), and obtain the generalized dual problem X 1 sup − α> Kα − µ l∗ (−µ−1 αi ), 2 α i

(8)

∗ α : α> 1 = C, α> c 1 = 1, c = 1, . . . , C (if bias) where l is the Fenchel-Legendre conjugate function, which we subsequently denote as dual loss of l.

References 1. Crammer K, Singer Y (2001) On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines. Journal of Machine Learning Research 2: 265-292. 2. Joachims T, Finley T, Yu CN (2009) Cutting-Plane Training of Structural SVMs. Machine Learning 77: 27-59.

3

3. Teo CH, Vishwanathan S, Smola A, Le Q (2010) Bundle Methods for Regularized Risk Minimization. Journal of Machine Learning Research 11: 311-365. 4. Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge, UK: Cambrigde University Press. 5. Smola AJ, Vishwanathan SVN, Le Q (2008) Bundle Methods for Machine Learning. In: Advances in Neural Information Processing Systems 20.