Generalization Guarantees for a Binary Classification Framework for ...

1 downloads 0 Views 406KB Size Report
Feb 2, 2013 - For any binary classification task over a domain X characterized by a ... can be used to learn, with confidence 1 − δ, a classifier with expected.
Generalization Guarantees for a Binary Classication Framework for Two-Stage Multiple Kernel Learning Purushottam Kar Department of Computer Science and Engineering IIT Kanpur

[email protected] February 2, 2013

Abstract We present generalization bounds for the TS-MKL framework for two stage multiple kernel learning. We also present bounds for sparse kernel learning formulations within the TS-MKL framework.

1

Introduction

Recently Kumar

et al

[6] proposed a framework for two-stage multiple kernel learning that combines

the idea of target kernel alignment and the notion of a good Mercer kernel. common domain

X,

good

kernel proposed in [1] to learn a

More specically, given a nite set of base kernels

K1 , . . . , Kp

over some

we wish to nd some combination of these base kernels that is well suited

to the learning task at hand.

The paper considers learning a positive linear combination of the

Pp

p i=1 µi Ki for some µ ∈ R , µ ≥ 0. It is assumed that the kernels are uniformly 2 bounded i.e. for all x1 , x2 ∈ X and i = 1 . . . p, we have Ki (x1 , x2 ) ≤ κi for some κi > 0. Let  κ = κ21 , . . . , κ2p ∈ Rp . Note that κ ≥ 0. Also note that for any µ and any x1 , x2 ∈ X , we have Kµ (x1 , x2 ) ≤ hµ, κi.

kernels

Kµ =

The notion of suitability used in [6] is that of tasks.

kernel-goodness rst proposed in [1] for classication

For sake of simplicity, we shall henceforth consider only binary classication tasks, the

extension to multi-class classication tasks being straightforward.

We present below the notion

For any binary classication task over a domain X characterized by a X × {±1}, a Mercer kernel K : X × X → R with associated Reproducing Kernel Hilbert Space HK and feature map ΦK : X → HK is said to be (, γ)-kernel good if there exists a unit norm vector w ∈ HK such that kwkH = 1 and the following holds K

of goodness used in [6]. distribution

D

over

s  { y hw, Φ(x)i E 1− ≤ γ (x,y)∼D +

1

2

Learning a

Good Kernel

The key idea behind [6] is to try and learn a positive linear combination of kernels that is good

R(·) : Rp → 7 R+ r  z 0 0 1 − yy Kµ (x, x ) +

according to the notion presented above. We dene the risk functional

R(µ) :=

E

(x,y),(x0 ,y 0 )∼D×D

-combination good

R(µ) is of interest since an application of Jensen's inequality (see [6, Lemma 3.2]) shows us that for any µ ≥ 0   1 that is -combination good, the kernel Kµ is , hµ,κi -kernel good. Furthermore, one can show, A combination

µ

will be said to be

if

R(µ) ≤ .

as follows:

The quantity

using standard results on capacity of linear function classes (see for example [2, Theorem 21]), that an

(, γ)-good

kernel can be used to learn, with condence

misclassication rate at most

 + 1

by using at most

O



κ4 21 γ 2

log

1− δ ,

1 δ

a classier with expected

labeled samples.

In order to cast this learning problem more cleanly, [6] proposes the construction of a

K-space

using the following feature map

 z : (x, x0 ) 7→ K1 (x, x0 ), . . . , Kp (x, x0 ) ∈ Rp This allows us to write, for any

(x1 , y1 ), . . . , (xn , yn ),

dene the

µ ∈ Rp , Kµ (x, x0 ) = hµ, z(x, x0 )i. Given n labeled training ˆ : Rp 7→ R+ as follows1 : empirical risk functional R(·)

ˆ R(µ) :=

2 n(n − 1)

X

points

[1 − yi yj hµ, z(xi , xj )i]+

1≤i 0

ˆ µ

will give us a kernel

Kµˆ

that is

1 ˆ + 1 , hµ,κi ˆ

-kernel good

is a quantity that can be made arbitrarily small.

1 We note that [6] includes the terms [1 − hµ, z(xi , xi )i]+ into the empirical risk as well. This does not change the asymptotics of our analysis except for causing a bit of notational annoyance. In q order to account y for this term, the true risk functional will have to include an additional term Radd (µ) := E [1 − Kµ (x, x)]+ . This will add (x,y)∼D a negligible term toPthe uniform convergence bound because we will have to consider the convergence of the term 2 ˆ add (µ) := R 1≤i≤n [1 − hµ, z(xi , xi )i]+ to Radd . However, from thereon, the analysis will remain unaected n(n+1) since R (µ) ≥ 0 so a combination µ having true risk R(µ) + Radd (µ) ≤  will still give a kernel Kµ that is add   1 , hµ,κi -kernel good.

2

2. We shall then prove that given that there exists a good combination of kernels in the K-space, with very high probability

ˆ will

be very small. This we will prove by showing a converse of

the inequality proved in the rst step. This will allow us to give oracle inequalities for the kernel goodness of the learned combination.

3.1

Step 1

In this step, we prove a uniform convergence guarantee for the learning problem at hand. Using standard proof techniques, we shall reduce the problem of uniform convergence to that of estimating the capacity of a certain function class.

The notion of capacity we shall use is the Rademacher

complexity which we shall bound using the heavy hammer of strong convexity based bounds from [5].

We note that the proof progression used in this step is fairly routine within the empirical

process community and has been used to give generalization proofs for other problems as well (see for example [3, 4]). First of all we note that due to the optimization process we have

2

λ λ λ ˆ µ) ˆ ˆ 22 ≤ kµk ˆ 22 + R( ˆ ≤ k0k22 + R(0) kµk =1 2 2 2 which implies that we need only concern ourselves with combination vectors inside the radius

rλ =

q

L2

ball of

2 λ.

B2 (rλ ) := {µ ∈ Rp : kµk2 ≤ rλ } z = (x, y) as a training sample. For any training set z1 , . . . , zn µ ∈ Rp , we write `(µ, zi , zj ) := [1 − yi yj hµ, z(xi , xj )i]+ . We

For notational simplicity, we denote where

zi = (xi , yi )

and for any

assume, yet again for the sake of notational simplicity, that we obtain at all times, an even number of training samples i.e.

u Ev

n

is even. For a ghost sample

2 n(n − 1)

}

X 1≤i