Generalization Guarantees for a Binary Classication Framework for Two-Stage Multiple Kernel Learning Purushottam Kar Department of Computer Science and Engineering IIT Kanpur
[email protected] February 2, 2013
Abstract We present generalization bounds for the TS-MKL framework for two stage multiple kernel learning. We also present bounds for sparse kernel learning formulations within the TS-MKL framework.
1
Introduction
Recently Kumar
et al
[6] proposed a framework for two-stage multiple kernel learning that combines
the idea of target kernel alignment and the notion of a good Mercer kernel. common domain
X,
good
kernel proposed in [1] to learn a
More specically, given a nite set of base kernels
K1 , . . . , Kp
over some
we wish to nd some combination of these base kernels that is well suited
to the learning task at hand.
The paper considers learning a positive linear combination of the
Pp
p i=1 µi Ki for some µ ∈ R , µ ≥ 0. It is assumed that the kernels are uniformly 2 bounded i.e. for all x1 , x2 ∈ X and i = 1 . . . p, we have Ki (x1 , x2 ) ≤ κi for some κi > 0. Let κ = κ21 , . . . , κ2p ∈ Rp . Note that κ ≥ 0. Also note that for any µ and any x1 , x2 ∈ X , we have Kµ (x1 , x2 ) ≤ hµ, κi.
kernels
Kµ =
The notion of suitability used in [6] is that of tasks.
kernel-goodness rst proposed in [1] for classication
For sake of simplicity, we shall henceforth consider only binary classication tasks, the
extension to multi-class classication tasks being straightforward.
We present below the notion
For any binary classication task over a domain X characterized by a X × {±1}, a Mercer kernel K : X × X → R with associated Reproducing Kernel Hilbert Space HK and feature map ΦK : X → HK is said to be (, γ)-kernel good if there exists a unit norm vector w ∈ HK such that kwkH = 1 and the following holds K
of goodness used in [6]. distribution
D
over
s { y hw, Φ(x)i E 1− ≤ γ (x,y)∼D +
1
2
Learning a
Good Kernel
The key idea behind [6] is to try and learn a positive linear combination of kernels that is good
R(·) : Rp → 7 R+ r z 0 0 1 − yy Kµ (x, x ) +
according to the notion presented above. We dene the risk functional
R(µ) :=
E
(x,y),(x0 ,y 0 )∼D×D
-combination good
R(µ) is of interest since an application of Jensen's inequality (see [6, Lemma 3.2]) shows us that for any µ ≥ 0 1 that is -combination good, the kernel Kµ is , hµ,κi -kernel good. Furthermore, one can show, A combination
µ
will be said to be
if
R(µ) ≤ .
as follows:
The quantity
using standard results on capacity of linear function classes (see for example [2, Theorem 21]), that an
(, γ)-good
kernel can be used to learn, with condence
misclassication rate at most
+ 1
by using at most
O
κ4 21 γ 2
log
1− δ ,
1 δ
a classier with expected
labeled samples.
In order to cast this learning problem more cleanly, [6] proposes the construction of a
K-space
using the following feature map
z : (x, x0 ) 7→ K1 (x, x0 ), . . . , Kp (x, x0 ) ∈ Rp This allows us to write, for any
(x1 , y1 ), . . . , (xn , yn ),
dene the
µ ∈ Rp , Kµ (x, x0 ) = hµ, z(x, x0 )i. Given n labeled training ˆ : Rp 7→ R+ as follows1 : empirical risk functional R(·)
ˆ R(µ) :=
2 n(n − 1)
X
points
[1 − yi yj hµ, z(xi , xj )i]+
1≤i 0
ˆ µ
will give us a kernel
Kµˆ
that is
1 ˆ + 1 , hµ,κi ˆ
-kernel good
is a quantity that can be made arbitrarily small.
1 We note that [6] includes the terms [1 − hµ, z(xi , xi )i]+ into the empirical risk as well. This does not change the asymptotics of our analysis except for causing a bit of notational annoyance. In q order to account y for this term, the true risk functional will have to include an additional term Radd (µ) := E [1 − Kµ (x, x)]+ . This will add (x,y)∼D a negligible term toPthe uniform convergence bound because we will have to consider the convergence of the term 2 ˆ add (µ) := R 1≤i≤n [1 − hµ, z(xi , xi )i]+ to Radd . However, from thereon, the analysis will remain unaected n(n+1) since R (µ) ≥ 0 so a combination µ having true risk R(µ) + Radd (µ) ≤ will still give a kernel Kµ that is add 1 , hµ,κi -kernel good.
2
2. We shall then prove that given that there exists a good combination of kernels in the K-space, with very high probability
ˆ will
be very small. This we will prove by showing a converse of
the inequality proved in the rst step. This will allow us to give oracle inequalities for the kernel goodness of the learned combination.
3.1
Step 1
In this step, we prove a uniform convergence guarantee for the learning problem at hand. Using standard proof techniques, we shall reduce the problem of uniform convergence to that of estimating the capacity of a certain function class.
The notion of capacity we shall use is the Rademacher
complexity which we shall bound using the heavy hammer of strong convexity based bounds from [5].
We note that the proof progression used in this step is fairly routine within the empirical
process community and has been used to give generalization proofs for other problems as well (see for example [3, 4]). First of all we note that due to the optimization process we have
2
λ λ λ ˆ µ) ˆ ˆ 22 ≤ kµk ˆ 22 + R( ˆ ≤ k0k22 + R(0) kµk =1 2 2 2 which implies that we need only concern ourselves with combination vectors inside the radius
rλ =
q
L2
ball of
2 λ.
B2 (rλ ) := {µ ∈ Rp : kµk2 ≤ rλ } z = (x, y) as a training sample. For any training set z1 , . . . , zn µ ∈ Rp , we write `(µ, zi , zj ) := [1 − yi yj hµ, z(xi , xj )i]+ . We
For notational simplicity, we denote where
zi = (xi , yi )
and for any
assume, yet again for the sake of notational simplicity, that we obtain at all times, an even number of training samples i.e.
u Ev
n
is even. For a ghost sample
2 n(n − 1)
}
X 1≤i