Reading Group ML / AI

18 downloads 0 Views 481KB Size Report
wTSB w = aTXT[X1 y1 − X2 y2][X1 y1 − X2 y2. ]T. X a. = aT[XTX1 y1 − XTX2 y2][XTX1 y1 − XTX2 y2. ]T a. = aT[K1 y1 − K2 y2][K1 y1 − K2 y2. ]T a. = aT[m1 ...
Reading Group ML / AI Prof. Christian Bauckhage

outline (kernel) LDA

linear discriminant analysis (LDA)

kernel linear discriminant analysis (kLDA)

summary

observe

last time, we studied the kernel trick xTi xj → ϕ(xi )T ϕ(xj ) = k(xi , xj )

training data

xT xi → ϕ(x)T ϕ(xi ) = k(x, xi )

test data

shortly after it had been fully understood in the mid 1990s, it was applied to “everything”

today, we look at how to kernelize, i.e. rewrite, well known DM / PR / ML / AI algorithms

linear discriminant analysis (LDA)

setting

assume a representative, labeled sample of training data

 n xi , yi i=1

where the data vectors xi ∈ Rm are from two classes Ω1 and Ω2  the labels yi ∈ +1, −1 indicate class membership , i.e.  +1, if xi ∈ Ω1 yi = −1, if xi ∈ Ω2

goal

determine a classifier, that is a function  y : Rm → −1, +1 such that y(x) predicts the correct class for new, previously unobserved data x

popular idea

obtain an estimate of y from projecting x onto a suitable line determined by a projection vector w

in other words, compute  +1, if wT x > θ y(x) = −1, otherwise where θ is an appropriately chosen threshold

such a function y(x) is called a linear classifier

illustration

illustration

x

illustration

w

illustration

w

illustration

w

illustration

w

illustration

w

illustration

θ w

illustration

x

θ w

illustration

x

θ w

what distinguishes good from bad ?

x

x

θ w

θ w

good

bad

problem

learn a projection vector w from the given training data ⇔ among the infinitely many possibilities, choose the one that maximally separates the projected training data

this requires a measure of separation between projections

approach

consider projected class means normalized by within class variance (R.A. Fisher, 1936)

Sir R.A. Fisher (∗1890, †1962)

some definitions

where k ∈ {1, 2}, x ∈ Rm , y ∈ {−1, +1}

class mean µk =

1 X x nk x∈Ωk



some definitions

where k ∈ {1, 2}, x ∈ Rm , y ∈ {−1, +1}

class mean µk =

1 X x nk x∈Ωk

projected class mean   X 1 1 X T w x = wT  x = wT µk µk = nk nk x∈Ωk

x∈Ωk



some definitions

where k ∈ {1, 2}, x ∈ Rm , y ∈ {−1, +1}

class mean µk =

1 X x nk x∈Ωk

projected class mean   X 1 1 X T w x = wT  x = wT µk µk = nk nk x∈Ωk

projected class scatter X 2 s2k = wT x − µk x∈Ωk

x∈Ωk



Fisher’s linear discriminant

Fisher’s linear discriminant is defined as the linear function wT x that maximizes the following objective µ1 − µ2 J(w) = s21 + s22

2 (1)

interpretation

maximizing µ1 − µ2 J(w) = s21 + s22

2

w.r.t. the training data will identify a projection direction w such that projected samples from either class are close to each other the projected class means are as far apart as possible

interpretation

maximizing µ1 − µ2 J(w) = s21 + s22

2

w.r.t. the training data will identify a projection direction w such that projected samples from either class are close to each other the projected class means are as far apart as possible

all that is left is to express J(w) explicitly as a function of w to find an optimal w∗

approach

consider the class scatter matrices

Sk =

X

 T x − µk x − µk

x∈Ωk

and the within class scatter matrix

SW = S1 + S2

observe

we have s2k =

X

wT x − wT µk

2

x∈Ωk

= =

X

 T wT x − µk x − µk w

x∈Ωk wT Sk w

observe

we have s2k =

X

wT x − wT µk

2

x∈Ωk

= =

X

 T wT x − µk x − µk w

x∈Ωk wT Sk w

and therefore s21 + s22 = wT S1 w + wT S2 w = wT SW w

observe

similarly

µ1 − µ2

2

= wT µ1 − wT µ2

2

 T = wT µ1 − µ2 µ1 − µ2 w ≡ wT SB w where SB is called the between class scatter matrix

observe

substituting these into (1) yields the Rayleigh quotient

J(w) =

wT SB w wT SW w

observe

substituting these into (1) yields the Rayleigh quotient

J(w) =

wT SB w wT SW w

to maximize it w.r.t. w, we consider   ∂ ! J(w) = wT SW w 2 SB w − wT SB w 2 SW w = 0 ∂w

solution (part 1)

dividing by wT SW w yields wT SW w wT SB w SB w − T SW w = 0 T w SW w w SW w

in other words SB w − J(w)SW w = 0 or S−1 W SB w = J(w)w

(2)

solution (part 2)

(2) is a generalized eigenvector/ eigenvalue problem it is “fairly straightforward” to see (we don’t show it here) that its solution is given by

w∗ = argmax w

 wT SB w −1 = S µ − µ 1 2 W wT SW w

(3)

solution (part 3)

once w∗ is available, construct a classifier, i.e.  y(x) =

+1, −1,

if xT w∗ > θ otherwise

question how to determine θ ?

question how to determine θ ?

answer if the xi for training are distributed according to a known distribution, proceed analytically in practice, evaluate different choices of θ on an independent, labeled verification set

kernel linear discriminant analysis (kLDA)

observe

in the following, let   Xk = x1 , . . . , xni   yk = n1i , . . . , n1i then µk = Xk yk moreover, let   X = X1 , X2   y = y1 , −y2

observe

in the following, let   Xk = x1 , . . . , xni   yk = n1i , . . . , n1i

we then have w∗ = S−1 W µ1 − µ2



= S−1 W X 1 y1 − X 2 y2 = S−1 W Xy

then µk = Xk yk moreover, let   X = X1 , X2   y = y1 , −y2

⇒ the optimal choice for w is a linear combination of the data vectors in X ⇒ next, we substitute w = Xa



observe

we can write the class scatter matrices as Sk =

X

(xi − µk )(xi − µk )T

xi ∈Ωk

=

X

xi xTi − nk µk µTk

xi ∈Ωk

= Xk XTk − nk Xk yk yTk XTk = Xk XTk − Xk Nk XTk where Nk =

1 nk 1

1 = matrix of all ones

observe

but this is to say that   T wT SB w = aT XT X1 y1 − X2 y2 X1 y1 − X2 y2 X a   T = aT XT X1 y1 − XT X2 y2 XT X1 y1 − XT X2 y2 a   T = aT K 1 y1 − K 2 y2 K 1 y1 − K 2 y2 a   T = aT m1 − m2 m1 − m2 a = aT M a   wT SW w = aT XT X1 XT1 − X1 N1 XT1 + X2 XT2 − X2 N2 XT2 X a   = aT XT XXT − X1 N1 XT1 − X2 N2 XT2 X a   = aT XT XXT X − XT X1 N1 XT1 X − XT X2 N2 XT2 X a   = aT K 2 − K 1 N1 K T1 − K 2 N2 K T2 a = aT N a

kernel linear discriminant analysis (kLDA)

training kernel LDA is to compute

a∗ = argmax a

 aT M a = N−1 m1 − m2 T a Na

where it may be necessary to regularize N

N ← N + I

kernel linear discriminant analysis (kLDA)

applying kernel LDA is to compute  y(x) =

+1, −1,

if xT X a∗ > θ otherwise

 = sign xT X a∗ − θ  = sign k(x)T a∗ − θ

where the threshold θ is determined as in usual LDA

example

polynomial kernel, d ∈ {1, 4, 6} −3

−2

−1

1.5

1.5

1.5

1.0

1.0

1.0

0.5

0.5

0.0

0.0 0

−0.5

1

2

3

−3

−2

−1

0.5 0.0 0

−0.5

1

2

3

−3

−2

−1

0 −0.5

−1.0

−1.0

−1.0

−1.5

−1.5

−1.5

1

2

3

python code (for polynomial kernel)

### training X1, X2 = ... # training data matrices y1, y2 = ... # training data labels a = trainKLDAPolyKernel(X1, X2, y1, y2, d=3)

### testing X = ... # test data matrix y = applyKLDAPolyKernel(X1, X2, lmb, d=3)

python code (for polynomial kernel)

def trainKLDAPolyKernel(X1, X2, y1, y2, d): n1, n2 = X1.shape[1], X2.shape[1] y1 *= 1./n1 y2 *= -1./n2 X K K1 K2 M1 M2 N1 N2 N

= = = = = = = = =

np.hstack((X1, X2)) (1. + np.dot(X.T, X ))**d (1. + np.dot(X.T, X1))**d (1. + np.dot(X.T, X2))**d np.dot(K1, y1) np.dot(K2, y2) np.ones((n1,n1)) * 1./n1 np.ones((n2,n2)) * 1./n2 np.dot(K,K) \ - np.dot(np.dot(K1, N1), K1.T) \ - np.dot(np.dot(K2, N2), K2.T)

return np.dot(la.inv(N), M1-M2)

summary

we now know about

Fisher’s linear discriminant analysis (LDA) and how to kernelize it (kLDA)