where m ⩽ M can make data linearly separable. â even for non-linear problems ... p.s.d. matrices K have orthogonal eigenvectors ui and positive eigenvalues λi.
Reading Group ML / AI Prof. Christian Bauckhage
outline the kernel trick blessings of dimensionality Mercer’s theorem the kernel trick and kernel engineering kernel least squares summary
high dimensionality can also be a blessing
example 1
ϕ(x) = x, kxk2 = x1 , x2 , x12 + x22
y
ϕ
x
x ∈ R2
→
ϕ(x) ∈ R3
example 2
h
2 i ϕ(x) = x, (xT α)α − x
y
ϕ
→ x
x ∈ R2
ϕ(x) ∈ R3
α=
√1 2
1
observe
these examples suggest that non-linear transformations ϕ : Rm → RM where m 6 M can make data linearly separable ⇒ even for non-linear problems, we may resort to efficient linear techniques
recall
in regression, clustering, and classification, linear approaches are a dime a dozen, for instance linear classifiers (least squares, LDA, SVM, . . . ) +1 if wT x > 0 y(x) = −1 if wT x < 0 nearest neighbor classifiers, k-means clustering, . . .
x − q 2 = xT x − 2 xT q + qT q
idea
in order to adopt linear functions f (x) = xT w to non-linear settings, we may map x ∈ Rm to ϕ(x) ∈ RM learn ω ∈ RM from data
and then consider h(x) = ϕ(x)T ω
problems
it is not generally clear what transformation ϕ(·) to choose depending on the dimension of the target space RM , it may be expensive to compute ϕ(x) for x ∈ Rm depending on the dimension of the target space RM , it may be expensive to compute inner products ϕ(x)T ϕ(y)
Mercer’s theorem to the rescue
Mercer kernel
⇔ a symmetric, positive semidefinite function k : Rm × Rm → R such that k(x, y) = ϕ(x)T ϕ( y) = k( y, x)
positive semidefinite function
⇔ a function k : Rm × Rm → R where Z g2 (x) dx > 0 implies that x
g(x) k(x, y) g( y) dx dy > 0
note
p.s.d. functions generalize the notion of p.s.d. matrices where X gT K g = gi Kij gj > 0 i,j
p.s.d. matrices K have orthogonal eigenvectors ui and positive eigenvalues λi K ui = λi ui
note
in analogy, positive semidefinite functions have orthogonal eigenfunctions ψi (x) and eingenvalues µi > 0, i = 1, . . . , ∞ Z k(x, y) ψi ( y) dy = µi ψi ( y) where Z ψi (x) ψj (x) dx = δij
note
in analogy to the spectral representation of a matrix K=
m X
λi ui uTi
i=1
we have k(x, y) =
∞ X i=1
µi ψi (x) ψi ( y)
Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)
Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)
Proof k(x, y) =
X
µi ψi (x) ψi ( y)
i
=
X√ √ µi ψi (x) µi ψi ( y) i
=
X
ϕi (x) ϕi ( y)
i
= ϕ(x)T ϕ( y)
implications
⇒ instead of computing ϕ(x)T ϕ( y) on vectors in RM , we may evaluate a kernel function k(x, y) on vectors in Rm
⇒ we may not have to compute ϕ(x) and ϕ(x)T ϕ( y) explicitly, but may directly evaluate a kernel function k(x, y)
⇒ instead of worrying about how to choose ϕ(·), we may worry about choosing a suitable kernel function k(·, ·)
the kernel trick and kernel engineering
the kernel trick
the kernel trick is, first, to rewrite an algorithm for data analysis or classification in such a way that input data x only appears in form of inner products and, second, to replace any occurrence of such inner products by kernel evaluations this way, we can use linear approaches to solve non-linear problems!
the linear kernel (⇔ proof that kernel functions exist)
the identity mapping ϕ(x) = x yields a valid Mercer Kernel k(x, y) = ϕ(x)T ϕ( y) = xT y because xT y = yT x
(symmetric)
and xT x > 0
(psd)
observe
in the following, we assume x, y ∈ Rm
b>0∈R c>0∈R
d>0∈R
g : Rm → R
ki (x, y) = ϕ(x)T ϕ( y) is a valid kernel for some ϕ
kernel function
k(x, y) = c · k1 (x, y)
d k(x, y) = xT y + b d k(x, y) = k1 (x, y)
k(x, y) = xT y + b
2 k(x, y) = ϕ(x) − ϕ( y)
k(x, y) = g(x) k1 (x, y) g( y)
k(x, y) = k1 (x, y) + k2 (x, y) h
2 i k(x, y) = exp − 2σ1 2 x − y
note
again, kernel functions k(x, y) implicitly compute inner products ϕ(x)T ϕ( y) not in Rm but in RM
k(x, y)
M
inhom. linear kernel
xT y + b
m+1
polynomial kernel
xT y + b
Gaussian kernel
e−
d
kx−yk2 2σ2
m+d d
∞
assignment
to understand and make sense of all of the above, read C. Bauckhage, “Lecture Notes on the Kernel Trick (I)”, dx.doi.org/10.13140/2.1.4524.8806 C. Bauckhage, “Lecture Notes on the Kernel Trick (III)”, dx.doi.org/10.13140/RG.2.1.2471.6322
kernel least squares
least squares n given labeled training data (xi , yi ) i=1 we seek w such that y(x) = xT w for convenience, we let x← 1 x w ← w0 w and then solve
2 argmin XT w − y w
(1)
note
n × m design matrix · · · xT1 · · · · · · xT · · · 2 X= T · · · xn · · ·
m × n data matrix vs.
.. .. . . X = x1 x2 .. .. . .
.. . xn .. .
note
the least squares problem in (1) has two solutions primal −1 w = XXT Xy | {z } m×m
dual −1 w = X XT X y | {z } n×n
where XT X is a Gram matrix since XT X
ij
= xTi xj
note
while working with the dual may be costly, it allows us to invoke the kernel trick
note
while working with the dual may be costly, it allows us to invoke the kernel trick, because in −1 y y(x) = xT w = xT X XT X all x, xi occur within inner products and we may rewrite y(x) = k(x)T K −1 y where Kij = k(xi , xj ) ki (x) = k(xi , x)
assignment
to see how to implement this idea in numpy / scipy, read C. Bauckhage, “NumPy / SciPy Recipes for Data Science: Kernel Least Squares Optimization (1)”, dx.doi.org/10.13140/RG.2.1.4299.9842
example: regression
linear kernel
polynomial kernel
xT y + 1
3 xT y + 1
−2
2 1 kx−yk 2.52
e− 2
20
20
20
15
15
15
10
10
10
5
5
5
0 −4
Gaussian kernel
0 0
2
4
6
8
−4
−2
0 0
2
4
6
8
−4
−2
0
2
4
6
8
example: classification
polynomial kernel, d ∈ {1, 3, 6} −3
−2
−1
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.0
0.0 0
1
2
3
−0.5
−3
−2
−1
0.5 0.0 0
1
2
3
−0.5
−3
−2
−1
0
1
2
3
0
1
2
3
−0.5
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
Gaussian kernel, σ ∈ {0.5, 1.0, 5.0} −3
−2
−1
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.0
0.0 0
−0.5
1
2
3
−3
−2
−1
0.5 0.0 0
−0.5
1
2
3
−3
−2
−1
−0.5
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
summary
we now know about
benefits of high dimensionality Mercer kernels and the kernel trick how valid kernel functions may look like how to kernelize least squares computations