wTSB w = aTXT[X1 y1 â X2 y2][X1 y1 â X2 y2. ]T. X a. = aT[XTX1 y1 â XTX2 y2][XTX1 y1 â XTX2 y2. ]T a. = aT[K1 y1 â K2 y2][K1 y1 â K2 y2. ]T a. = aT[m1 ...
Reading Group ML / AI Prof. Christian Bauckhage
outline (kernel) LDA
linear discriminant analysis (LDA)
kernel linear discriminant analysis (kLDA)
summary
observe
last time, we studied the kernel trick xTi xj → ϕ(xi )T ϕ(xj ) = k(xi , xj )
training data
xT xi → ϕ(x)T ϕ(xi ) = k(x, xi )
test data
shortly after it had been fully understood in the mid 1990s, it was applied to “everything”
today, we look at how to kernelize, i.e. rewrite, well known DM / PR / ML / AI algorithms
linear discriminant analysis (LDA)
setting
assume a representative, labeled sample of training data
n xi , yi i=1
where the data vectors xi ∈ Rm are from two classes Ω1 and Ω2 the labels yi ∈ +1, −1 indicate class membership , i.e. +1, if xi ∈ Ω1 yi = −1, if xi ∈ Ω2
goal
determine a classifier, that is a function y : Rm → −1, +1 such that y(x) predicts the correct class for new, previously unobserved data x
popular idea
obtain an estimate of y from projecting x onto a suitable line determined by a projection vector w
in other words, compute +1, if wT x > θ y(x) = −1, otherwise where θ is an appropriately chosen threshold
such a function y(x) is called a linear classifier
illustration
illustration
x
illustration
w
illustration
w
illustration
w
illustration
w
illustration
w
illustration
θ w
illustration
x
θ w
illustration
x
θ w
what distinguishes good from bad ?
x
x
θ w
θ w
good
bad
problem
learn a projection vector w from the given training data ⇔ among the infinitely many possibilities, choose the one that maximally separates the projected training data
this requires a measure of separation between projections
approach
consider projected class means normalized by within class variance (R.A. Fisher, 1936)
Sir R.A. Fisher (∗1890, †1962)
some definitions
where k ∈ {1, 2}, x ∈ Rm , y ∈ {−1, +1}
class mean µk =
1 X x nk x∈Ωk
some definitions
where k ∈ {1, 2}, x ∈ Rm , y ∈ {−1, +1}
class mean µk =
1 X x nk x∈Ωk
projected class mean X 1 1 X T w x = wT x = wT µk µk = nk nk x∈Ωk
x∈Ωk
some definitions
where k ∈ {1, 2}, x ∈ Rm , y ∈ {−1, +1}
class mean µk =
1 X x nk x∈Ωk
projected class mean X 1 1 X T w x = wT x = wT µk µk = nk nk x∈Ωk
projected class scatter X 2 s2k = wT x − µk x∈Ωk
x∈Ωk
Fisher’s linear discriminant
Fisher’s linear discriminant is defined as the linear function wT x that maximizes the following objective µ1 − µ2 J(w) = s21 + s22
2 (1)
interpretation
maximizing µ1 − µ2 J(w) = s21 + s22
2
w.r.t. the training data will identify a projection direction w such that projected samples from either class are close to each other the projected class means are as far apart as possible
interpretation
maximizing µ1 − µ2 J(w) = s21 + s22
2
w.r.t. the training data will identify a projection direction w such that projected samples from either class are close to each other the projected class means are as far apart as possible
all that is left is to express J(w) explicitly as a function of w to find an optimal w∗
approach
consider the class scatter matrices
Sk =
X
T x − µk x − µk
x∈Ωk
and the within class scatter matrix
SW = S1 + S2
observe
we have s2k =
X
wT x − wT µk
2
x∈Ωk
= =
X
T wT x − µk x − µk w
x∈Ωk wT Sk w
observe
we have s2k =
X
wT x − wT µk
2
x∈Ωk
= =
X
T wT x − µk x − µk w
x∈Ωk wT Sk w
and therefore s21 + s22 = wT S1 w + wT S2 w = wT SW w
observe
similarly
µ1 − µ2
2
= wT µ1 − wT µ2
2
T = wT µ1 − µ2 µ1 − µ2 w ≡ wT SB w where SB is called the between class scatter matrix
observe
substituting these into (1) yields the Rayleigh quotient
J(w) =
wT SB w wT SW w
observe
substituting these into (1) yields the Rayleigh quotient
J(w) =
wT SB w wT SW w
to maximize it w.r.t. w, we consider ∂ ! J(w) = wT SW w 2 SB w − wT SB w 2 SW w = 0 ∂w
solution (part 1)
dividing by wT SW w yields wT SW w wT SB w SB w − T SW w = 0 T w SW w w SW w
in other words SB w − J(w)SW w = 0 or S−1 W SB w = J(w)w
(2)
solution (part 2)
(2) is a generalized eigenvector/ eigenvalue problem it is “fairly straightforward” to see (we don’t show it here) that its solution is given by
w∗ = argmax w
wT SB w −1 = S µ − µ 1 2 W wT SW w
(3)
solution (part 3)
once w∗ is available, construct a classifier, i.e. y(x) =
+1, −1,
if xT w∗ > θ otherwise
question how to determine θ ?
question how to determine θ ?
answer if the xi for training are distributed according to a known distribution, proceed analytically in practice, evaluate different choices of θ on an independent, labeled verification set
kernel linear discriminant analysis (kLDA)
observe
in the following, let Xk = x1 , . . . , xni yk = n1i , . . . , n1i then µk = Xk yk moreover, let X = X1 , X2 y = y1 , −y2
observe
in the following, let Xk = x1 , . . . , xni yk = n1i , . . . , n1i
we then have w∗ = S−1 W µ1 − µ2
= S−1 W X 1 y1 − X 2 y2 = S−1 W Xy
then µk = Xk yk moreover, let X = X1 , X2 y = y1 , −y2
⇒ the optimal choice for w is a linear combination of the data vectors in X ⇒ next, we substitute w = Xa
observe
we can write the class scatter matrices as Sk =
X
(xi − µk )(xi − µk )T
xi ∈Ωk
=
X
xi xTi − nk µk µTk
xi ∈Ωk
= Xk XTk − nk Xk yk yTk XTk = Xk XTk − Xk Nk XTk where Nk =
1 nk 1
1 = matrix of all ones
observe
but this is to say that T wT SB w = aT XT X1 y1 − X2 y2 X1 y1 − X2 y2 X a T = aT XT X1 y1 − XT X2 y2 XT X1 y1 − XT X2 y2 a T = aT K 1 y1 − K 2 y2 K 1 y1 − K 2 y2 a T = aT m1 − m2 m1 − m2 a = aT M a wT SW w = aT XT X1 XT1 − X1 N1 XT1 + X2 XT2 − X2 N2 XT2 X a = aT XT XXT − X1 N1 XT1 − X2 N2 XT2 X a = aT XT XXT X − XT X1 N1 XT1 X − XT X2 N2 XT2 X a = aT K 2 − K 1 N1 K T1 − K 2 N2 K T2 a = aT N a
kernel linear discriminant analysis (kLDA)
training kernel LDA is to compute
a∗ = argmax a
aT M a = N−1 m1 − m2 T a Na
where it may be necessary to regularize N
N ← N + I
kernel linear discriminant analysis (kLDA)
applying kernel LDA is to compute y(x) =
+1, −1,
if xT X a∗ > θ otherwise
= sign xT X a∗ − θ = sign k(x)T a∗ − θ
where the threshold θ is determined as in usual LDA
example
polynomial kernel, d ∈ {1, 4, 6} −3
−2
−1
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.0
0.0 0
−0.5
1
2
3
−3
−2
−1
0.5 0.0 0
−0.5
1
2
3
−3
−2
−1
0 −0.5
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
1
2
3
python code (for polynomial kernel)
### training X1, X2 = ... # training data matrices y1, y2 = ... # training data labels a = trainKLDAPolyKernel(X1, X2, y1, y2, d=3)
### testing X = ... # test data matrix y = applyKLDAPolyKernel(X1, X2, lmb, d=3)
python code (for polynomial kernel)
def trainKLDAPolyKernel(X1, X2, y1, y2, d): n1, n2 = X1.shape[1], X2.shape[1] y1 *= 1./n1 y2 *= -1./n2 X K K1 K2 M1 M2 N1 N2 N
= = = = = = = = =
np.hstack((X1, X2)) (1. + np.dot(X.T, X ))**d (1. + np.dot(X.T, X1))**d (1. + np.dot(X.T, X2))**d np.dot(K1, y1) np.dot(K2, y2) np.ones((n1,n1)) * 1./n1 np.ones((n2,n2)) * 1./n2 np.dot(K,K) \ - np.dot(np.dot(K1, N1), K1.T) \ - np.dot(np.dot(K2, N2), K2.T)
return np.dot(la.inv(N), M1-M2)
summary
we now know about
Fisher’s linear discriminant analysis (LDA) and how to kernelize it (kLDA)