clancy wiggum gary chalmers fat tony rod flanders todd flanders hans moleman mayor quimby dr. nick riveria sideshow bob snake jailbird groundskeeper willie ...
Pattern Recognition Prof. Christian Bauckhage
outline lecture 20 the kernel trick kernel engineering kernel algorithms kernel LSQ kernel SVM kernel LDA kernel PCA kernel k-means what’s really cool about kernels summary
high dimensionality can also be a blessing
example
ϕ(x) = x, kxk2 = x1 , x2 , x12 + x22
y
ϕ
x
x ∈ R2
→
ϕ(x) ∈ R3
example
h
2 i ϕ(x) = x, (xT α)α − x
y
ϕ
→ x
x ∈ R2
ϕ(x) ∈ R3
α=
√1 2
1
observe
the examples suggest that non-linear transformations ϕ : Rm → RM where m 6 M can make data linearly separable ⇒ even for non-linear problems, we may resort to efficient linear techniques (recall lecture 08)
recall
in regression, clustering, and classification, linear approaches are a dime a dozen, for instance linear classifiers (least squares, LDA, SVM, . . . ) +1 if wT x > 0 y(x) = −1 if wT x < 0 nearest neighbor classifiers, k-means clustering, . . .
x − q 2 = xT x − 2 xT q + qT q
the kernel trick
idea
in order to adopt linear functions f (x) = xT w to non-linear settings, we may map x ∈ Rm to ϕ(x) ∈ RM
learn ω ∈ RM from data
and then consider h(x) = ϕ(x)T ω
problems
it is not generally clear what transformation ϕ(·) to choose depending on the dimension of the target space RM , it may be expensive to compute ϕ(x) for x ∈ Rm depending on the dimension of the target space RM , it may be expensive to compute inner products ϕ(x)T ϕ( y)
Mercer’s theorem to the rescue
definition
a Mercer kernel is a symmetric, positive semidefinite function k : Rm × Rm → R, such that k(x, y) = ϕ(x)T ϕ( y) = k( y, x)
definitions
a Mercer kernel is a symmetric, positive semidefinite function k : Rm × Rm → R, such that k(x, y) = ϕ(x)T ϕ( y) = k( y, x) a function k : Rm × Rm → R is positive semidefinite, if Z g2 (x) dx > 0 implies that ZZ g(x) k(x, y) g( y) dx dy > 0
note
the notion of positive semidefinite functions generalizes the notion of positive semidefinite matrices where X gT K g = gi Kij gj > 0 i,j
positive semidefinite matrices K have orthogonal eigenvectors ui and positive eigenvalues λi so that K ui = λi ui
note
in analogy, positive semidefinite functions have orthogonal eigenfunctions ψi (x) and eingenvalues µi > 0, i = 1, . . . , ∞ such that Z k(x, y) ψi ( y) dy = µi ψi ( y) where Z ψi (x) ψj (x) dx = δij
note
in analogy to the spectral representation of a matrix K=
m X
λi ui uTi
i=1
we have k(x, y) =
∞ X i=1
µi ψi (x) ψi ( y)
Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)
Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)
Proof k(x, y) =
X
µi ψi (x) ψi ( y)
i
=
X√ √ µi ψi (x) µi ψi ( y) i
=
X
ϕi (x) ϕi ( y)
i
= ϕ(x)T ϕ( y)
implications
⇒ instead of computing ϕ(x)T ϕ( y) on vectors in RM , we may evaluate a kernel function k(x, y) on vectors in Rm ⇒ we may not have to compute ϕ(x) and ϕ(x)T ϕ( y) explicitly, but may directly evaluate a kernel function k(x, y) ⇒ instead of worrying about how to choose ϕ(·), we may worry about choosing a suitable kernel function k(·, ·)
the kernel trick
the kernel trick is, first, to rewrite an algorithm for data analysis or classification in such a way that input data x only appears in form of inner products and, second, to replace any occurrence of such inner products by kernel evaluations this way, we can use linear approaches to solve non-linear problems!
kernel engineering
the linear kernel (⇔ proof that kernel functions exist)
the identity mapping ϕ(x) = x yields a valid Mercer Kernel k(x, y) = ϕ(x)T ϕ( y) = xT y because xT y = yT x
(symmetric)
and xT x > 0
(psd)
observe
in the following, we assume x, y ∈ Rm
b>0∈R c>0∈R
d>0∈R
g : Rm → R
ki (x, y) = ϕ(x)T ϕ( y) is a valid kernel for some ϕ
kernel functions k(x, y) = c · k1 (x, y)
k(x, y) = g(x) k1 (x, y) g( y) k(x, y) = xT y + b k(x, y) = k1 (x, y) + k2 (x, y) d k(x, y) = xT y + b d k(x, y) = k1 (x, y)
2 k(x, y) = ϕ(x) − ϕ( y) h
2 i k(x, y) = exp − 2σ1 2 x − y
note
again, kernel functions k(x, y) implicitly compute inner products ϕ(x)T ϕ( y) not in Rm but in RM
k(x, y)
M
inhom. linear kernel
xT y + b
m+1
polynomial kernel
xT y + b
Gaussian kernel
e−
d
kx−yk2 2σ2
m+d d
∞
assignment
to understand and make sense of all of the above, read C. Bauckhage, “Lecture Notes on the Kernel Trick (I)”, dx.doi.org/10.13140/2.1.4524.8806 C. Bauckhage, “Lecture Notes on the Kernel Trick (III)”, dx.doi.org/10.13140/RG.2.1.2471.6322
kernel algorithms
recall: least squares n given labeled training data (xi , yi ) i=1 we seek w such that y(x) = xT w for convenience, we let x← 1 x w ← w0 w and then solve
2 w∗ = argmin XT w − y w
(1)
note
in today’s lecture, we work with m × n data matrices | | X = x1 x2 · · · | |
| xn |
observe
the least squares problem in (1) has two solutions primal −1 w∗ = XXT Xy | {z } m×m
dual −1 w∗ = X XT X y | {z } n×n
where XT X is a Gram matrix since XT X
ij
= xTi xj
kernel LSQ
while working with the dual may be costly, it allows for invoking the kernel trick
kernel LSQ
while working with the dual may be costly, it allows for invoking the kernel trick, because in y(x) = xT w∗ = xT X XT X
−1
y
all x, xi occur within inner products and we may rewrite y(x) = k(x)T K −1 y where Kij = k(xi , xj ) ki (x) = k(xi , x)
assignment
to see how to implement this idea in numpy / scipy, read C. Bauckhage, “NumPy / SciPy Recipes for Data Science: Kernel Least Squares Optimization (1)”, dx.doi.org/10.13140/RG.2.1.4299.9842
example: regression
linear kernel
polynomial kernel
xT y + 1
xT y + 1
Gaussian kernel 2
e
1 kx−yk −2 2.52
17.5
17.5
17.5
15.0
15.0
15.0
12.5
12.5
12.5
10.0
10.0
10.0
7.5
7.5
7.5
5.0
5.0
5.0
2.5
2.5
2.5
0.0 −5.0
3
−2.5
0.0 0.0
2.5
5.0
7.5
−5.0
−2.5
0.0 0.0
2.5
5.0
7.5
−5.0
−2.5
0.0
2.5
5.0
7.5
example: classification polynomial kernel, d ∈ 1, 3, 6
−3
−2
2
2
2
1
1
1
0
0
−1
0
1
2
3
−3
−2
−1
0 0
1
2
3
−3
−2
−1
−1
−1
−1
−2
−2
−2
0
1
2
3
0
1
2
3
Gaussian kernel, σ ∈ 0.5, 1.0, 5.0
−3
−2
2
2
2
1
1
1
0
0
−1
0
1
2
3
−3
−2
−1
0 0
1
2
3
−3
−2
−1
−1
−1
−1
−2
−2
−2
recall: support vector machines
dual problem of L2 SVM training argmax − µT G + yyT + C1 I µ µ
1T µ = 1 s.t. µ>0 where Gij = yi yj xTi xj
observe
upon solving for µ, we have for the L2 SVM X w= µ s ys xs s∈S
w0 =
X
µ s ys
s∈S
and the resulting classifier is y(x) = sign xT w + w0 = sign
X s∈S
!
µs ys xT xs + w0
kernel SVM
during training and application of an SVM, all the occurrences of x, xi are in form of inner products
we may therefore substitute Gij = yi yj k(xi , xj ) and xT w =
X s∈S
µs ys k(x, xs )
example polynomial kernel, d ∈ 3, 5, 7 , C = 2
−3
−2
2
2
2
1
1
1
0
0
−1
0
1
2
3
−3
−2
−1
0 0
1
2
3
−3
−2
−1
−1
−1
−1
−2
−2
−2
0
1
2
3
0
1
2
3
Gaussian kernel, σ ∈ 0.25, 0.50, 0.75 , C = 1000
−3
−2
2
2
2
1
1
1
0
0
−1
0
1
2
3
−3
−2
−1
0 0
1
2
3
−3
−2
−1
−1
−1
−1
−2
−2
−2
L2 SVM with polynomial kernel
### training X = ... # training data matrix y = ... # training label vector m = trainL2SVMPolyKernel(X, y, d=3, C=2., T=1000) s = np.where(m>0)[0] XS = X[:,s] ys = y[s] ms = m[s] w0 = np.dot(ys,ms) ### X = y = y =
testing ... # test data matrix applyL2SVMPolyKernel(X.T, XS, ys, ms, w0, d=3) np.sign(y)
L2 SVM with polynomial kernel
def trainL2SVMPolyKernel(X, y, d, b=1., C=1., T=1000): m, n = X.shape I Y K M
= = = =
np.eye(n) np.outer(y,y) (b + np.dot(X.T, X))**d Y * K + Y + 1./C*I
mu = np.ones(n) / n for t in range(T): eta = 2./(t+2) grd = 2 * np.dot(M, mu) mu += eta * (I[np.argmin(grd)] - mu) return mu
L2 SVM with polynomial kernel
def applyL2SVMPolyKernel(x, XS, ys, ms, w0, d, b=1.): if x.ndim == 1: x = x.reshape(len(x),1) k = (b + np.dot(x.T, XS))**d return np.sum(k * ys * ms, axis=1) + w0
example
training + testing on 2
Xtrain ∈ R2×2000
1
Xtest ∈ R2×134500 took less than a second
0 −2
−1
0
−1
−2
1
2
3
4
recall: linear discriminant analysis
training an LDA classifier is to compute w∗ = argmin w
wT SB w = S−1 W µ1 − µ2 T w SW w
where SB = (µ1 − µ2 )(µ1 − µ2 )T SW = S1 + S2 X xi µj = n1j xi ∈Ωj
Sj =
X
xi ∈Ωj
(xi − µj )(xi − µj )T
observe
in the following, let Xj = x1 , . . . , xni yj = n1i , . . . , n1i
we have w∗ = S−1 W µ1 − µ2
= S−1 W X 1 y1 − X 2 y2 = S−1 W Xy
then µj = Xj yj moreover, let X = X1 , X2 y = y1 , −y2
⇒ the optimal choice for w is a linear combination of the data vectors in X ⇔ we may substitute w = Xλ
observe
we have Sj =
X
(xi − µj )(xi − µj )T
xi ∈Ωj
=
X
xi xTi − nj µj µTj
xi ∈Ωj
= Xj XTj − nj Xj yj yTj XTj = Xj XTj − Xj Nj XTj where Nj =
1 nj 1
observe
T wT SB w = λT XT X1 y1 − X2 y2 X1 y1 − X2 y2 X λ T = λT XT X1 y1 − XT X2 y2 XT X1 y1 − XT X2 y2 λ T = λT K 1 y1 − K 2 y2 K 1 y1 − K 2 y2 λ T = λT M1 − M2 M1 − M2 λ = λT Mλ wT SW w = λT XT X1 XT1 − X1 N1 XT1 + X2 XT2 − X2 N2 XT2 X λ = λT XT XXT − X1 N1 XT1 − X2 N2 XT2 X λ = λT XT XXT X − XT X1 N1 XT1 X − XT X2 N2 XT2 X λ = λT K 2 − K 1 N1 K T1 − K 2 N2 K T2 λ = λT N λ
kernel LDA
training kernel LDA is to compute λ∗ = argmin λ
λT M λ = N−1 M1 − M2 T λ Nλ
where it may be necessary to regularize N N ← N + I applying kernel LDA is to compute y(x) = sign xT X λ∗ − θ = sign k(x)T λ∗ − θ where θ is determined as in usual LDA
example polynomial kernel, d ∈ 1, 4, 6
−3
−2
2
2
2
1
1
1
0
0
−1
0
1
2
3
−3
−2
−1
0 0
1
2
3
−3
−2
−1
0
−1
−1
−1
−2
−2
−2
1
2
3
recall: principal component analysis
given a (zero mean) data matrix X = x1 , . . . , xn ∈ Rm×n we compute the sample covariance matrix C=
1 XXT ∈ Rm×m n
then solve the eigenvector/eigenvalue problem Cu = λu and use the resulting eigenvectors for various purposes
observe
we have Cu = λu 1 XXT u = u nλ ⇔ Xα = u ⇔
⇒ each eigenvector u of C is a linear combination of the column vectors xi of X and we emphasize that u ∈ Rm
α ∈ Rn
observe
we have Cu = λu ⇔
1 XXT Xα = λ Xα n
1 T T X XX Xα = λ XT Xα n ⇔ K 2 α = ˜λ K α ⇔ K α = ˜λ α ⇔
where ˜λ = n λ
moreover uT u = 1
1 ⇒ αT K α = ˜λ αT α = 1 ⇒ kαk = p ˜λ
centering the kernel
1X ϕ(xi ) − ϕ(xk ) n n
kc (xi , xj ) =
!T
1X ϕ(xj ) − ϕ(xl ) n n
k=1
= ϕ(xi )T ϕ(xj ) − ϕ(xi )T
!
l=1
1X 1X ϕ(xl ) − ϕ(xk )T ϕ(xj ) n n
1 X + 2 ϕ(xk )T ϕ(xl ) n
l
k
k,l
= k(xi , xj ) −
1X 1X 1 X k(xi , xl ) − k(xk , xj ) + 2 k(xk , xl ) n n n l
k
k,l
kernel PCA
solve K α = ˜λ α where Kij = kc (xi , xj ) normalize α α← p ˜λ compute uT x = αT XT x =
X i
αi k(xi , x)
note
on the following slides, we compare standard PCA against kernel PCA we compute them from m = 2 dimensional data points x1 , . . . , xn , where nm standard PCA will produce m eigenvectors ui ∈ Rm kernel PCA will produce n eigenvectors αj ∈ Rn
Rm is the data space, Rn is the feature space
since the feature space dimension exceeds 2, we cannot plot the kernel PCA results directly what we do instead is to consider points x ∈ Rm and visualize standardand kernel PCA in terms of a function f (x) where either f (x) = uTi x or f (x) = αTj XT x
f (x) = uTi x
examples: standard PCA
3
4 2
2.0
3
1.0
1
0.5
1
0
0 −3
−2
−1
0
1
2 −1
−1
−2
−1
−2
−1.0
−3
−1.5
−4
−2.0
1
−1
1.5
1
1.0
1
2
−1 −2 −3
−2
i=2
2 −1
−2
2.0
2
−4
1
0.5
0 0 0
1
2
3
−1
0 0
−0.5
i=1
2
−2
1
0.0 −2
i=1
−3
2
1.5
2
0
0.0 −2
−1
0 −0.5
1
2 −1
−1.0 −1.5 −2.0
i=2
−2
examples: (Gaussian) kernel PCA
f (x) = αTj XT x
0.6 0.6
0.4
2
2
2
0.4 0.4
0.2
1
1
1 0.2
0 −3
−2
−1
0
1
2
0.0
−3
−2
−1
−0.2
−1
0.2
0.0 0.0
0
0 0
1
2
−1
−0.2
−3
−2
−1
0
1
2 −0.2
−1
−0.4
−0.4
−0.4 −2
−2
−2
−0.6
−0.6
−0.6
j=1
j=2
j=3 0.6
0.6
0.4
0.5
2
2
0.4
2
1
0.2
1
0
0.0
0
0.4 0.2 1
0.3 0.2
0.0 0 −3
−2
−1
0
−1
−2
j=4
1
−3
2 −0.2
−0.4
−2
−1
0
1
2
−3
−2
−1
0
1
2
0.1 0.0
−1
−0.2
−1
−2
−0.4
−2
j=5
−0.1
j=6
−0.2
examples: (Gaussian) kernel PCA
f (x) = αTj XT x
0.8 2.0
0.6
2.0
2.0 0.4
0.6 1.5
1.5 0.4
1.0
1.0
0.5
0.2
0.5 0.0
0.0
0.0 −1
0.0 0
1
2 0.0
−0.5 −1.0
−2
−1
0.0 0
1
2 −0.2
−0.5 −1.0
−0.2
−1.5
−1
0
1
2 −0.2
−0.5 −1.0
−0.4
−1.5
−0.6
−2.0
j=1
−2
−0.4
−1.5 −0.4
−2.0
0.2
1.0
0.5 0.2
−2
1.5
0.4
−0.6
−2.0
j=2
j=3 0.6
2.0
2.0
2.0
0.4
0.4
1.5
1.5
1.0 0.5
0.4
1.5 0.2
1.0
0.2
1.0
0.5
0.2
0.5 0.0
0.0
0.0 −2
−1
0
1
−1.0 −1.5
−1
0.0 0
−0.5 −0.2
−0.4
0.0
0.0 −2
2
−0.5
−1.0
1
2 −0.2
−0.4
−1.5
−2.0
−2.0
j=4
j=5
−2
−1
0 −0.5
2 −0.2
−1.0 −1.5
−0.6
1
−2.0
j=6
−0.4
kernel k-means
you take it from here . . .
what’s really cool about kernels
example
non-numeric data S=
’homer simpson’, ’marge simpson’, ’bart simpson’, ’lisa simpson’, ’maggie simpson’, ’apu nahasapeemapetilon’, ’selma bouvier’, ’patty bouvier’, ’nelson muntz’, ’ralph wiggum’, ’seymour skinner’, ’disco stu’, ’kent brockman’, ’carl carlson’, ’lenny leonard’, ’comic book guy’, ’ned flanders’, ’prof. john frink’, ’barney gumbel’, ’dr. julius hibbert’, ’edna krabappel’, ’krusty the clown’, ’reverend lovejoy’, ’otto mann’, ’martin prince’, ’moe syslak’, ’waylon smithers’, ’charles montgomery burns’, ’milhouse van houten’, ’clancy wiggum’, ’gary chalmers’, ’fat tony’, ’rod flanders’, ’todd flanders’, ’hans moleman’, ’mayor quimby’, ’dr. nick riveria’, ’sideshow bob’, ’snake jailbird’, ’groundskeeper willie’
example
non-numeric data S=
’homer simpson’, ’marge simpson’, ’bart simpson’, ’lisa simpson’, ’maggie simpson’, ’apu nahasapeemapetilon’, ’selma bouvier’, ’patty bouvier’, ’nelson muntz’, ’ralph wiggum’, ’seymour skinner’, ’disco stu’, ’kent brockman’, ’carl carlson’, ’lenny leonard’, ’comic book guy’, ’ned flanders’, ’prof. john frink’, ’barney gumbel’, ’dr. julius hibbert’, ’edna krabappel’, ’krusty the clown’, ’reverend lovejoy’, ’otto mann’, ’martin prince’, ’moe syslak’, ’waylon smithers’, ’charles montgomery burns’, ’milhouse van houten’, ’clancy wiggum’, ’gary chalmers’, ’fat tony’, ’rod flanders’, ’todd flanders’, ’hans moleman’, ’mayor quimby’, ’dr. nick riveria’, ’sideshow bob’, ’snake jailbird’, ’groundskeeper willie’
bi-grams B=
’ho’, ’om’, ’me’, ’er’, ’r ’, ’ s’, ’si’, ’im’, ’mp’, ’ps’, ’so’, ’on’
a possible string similarity 2 · |B1 ∩ B2 | s s1 , s2 = |B1 | + |B2 |
a possible string distance d s1 , s2 = 1 − s s1 , s2
a possible string kernel 2 d (s1 , s2 ) k s1 , s2 = exp − 2σ2
word2vec via kernel PCA
barney gumbel ralph wiggum lenny leonard comicedna book guy krabappel prince krusty the apuclown nahasapeemapetilon snakemartin jailbird clancy wiggum charles montgomery burns hans moleman patty bouvier selma bouvier kent brockman milhouse van houten nelson muntz otto mann carl carlson fat tony quimby dr. julius hibbertmayor gary chalmers dr. nick riveria prof. john frink disco stu sideshow groundskeeper willie bob seymour skinner moe syslak reverend lovejoy
waylon smithers
marge simpson bartsimpson simpson maggie lisa simpson homer simpson
ned flanders todd flanders rod flanders
word2vec via kernel PCA works for oov words !!!
barney gumbel ralph wiggum lenny leonard comicedna book guy krabappel prince krusty the apuclown nahasapeemapetilon snakemartin jailbird clancy wiggum charles montgomery burns hans moleman patty bouvier selma bouvier kent brockman milhouse van houten nelson muntz kirk vanmann houten otto carl carlson fat tony quimby dr. julius hibbertmayor gary chalmers dr. nick riveria prof. john frink disco stu sideshow groundskeeper willie bob seymour skinner moe syslak reverend lovejoy
waylon smithers
marge simpson bartsimpson simpson maggie abe simpson lisa simpson homer simpson
maude flanders ned flanders todd flanders rod flanders
assignment
for details, read E. Brito, R. Sifa, and C. Bauckhage, “KPCA Embeddings: An Unsupervised Approach to Learn Vector Representations of Finite Domain Sequences”, Proc. LWDA-KDML, 2017
summary
we now know about
Mercer kernels and the kernel trick rewrite an algorithm for analysis or classification such that input data x only enters it form of inner products replace all inner products by kernel evaluations
kernelized versions of algorithms we studied earlier