Reading Group ML / AI

8 downloads 1723 Views 466KB Size Report
where m ⩽ M can make data linearly separable. ⇒ even for non-linear problems ... p.s.d. matrices K have orthogonal eigenvectors ui and positive eigenvalues λi.
Reading Group ML / AI Prof. Christian Bauckhage

outline the kernel trick blessings of dimensionality Mercer’s theorem the kernel trick and kernel engineering kernel least squares summary

high dimensionality can also be a blessing

example 1

    ϕ(x) = x, kxk2 = x1 , x2 , x12 + x22

y

ϕ

x

x ∈ R2



ϕ(x) ∈ R3

example 2

h

2 i ϕ(x) = x, (xT α)α − x

y

ϕ

→ x

x ∈ R2

ϕ(x) ∈ R3

α=

√1 2

1

observe

these examples suggest that non-linear transformations ϕ : Rm → RM where m 6 M can make data linearly separable ⇒ even for non-linear problems, we may resort to efficient linear techniques

recall

in regression, clustering, and classification, linear approaches are a dime a dozen, for instance linear classifiers (least squares, LDA, SVM, . . . )  +1 if wT x > 0 y(x) = −1 if wT x < 0 nearest neighbor classifiers, k-means clustering, . . .

x − q 2 = xT x − 2 xT q + qT q

idea

in order to adopt linear functions f (x) = xT w to non-linear settings, we may map x ∈ Rm to ϕ(x) ∈ RM learn ω ∈ RM from data

and then consider h(x) = ϕ(x)T ω

problems

it is not generally clear what transformation ϕ(·) to choose depending on the dimension of the target space RM , it may be expensive to compute ϕ(x) for x ∈ Rm depending on the dimension of the target space RM , it may be expensive to compute inner products ϕ(x)T ϕ(y)

Mercer’s theorem to the rescue

Mercer kernel

⇔ a symmetric, positive semidefinite function k : Rm × Rm → R such that k(x, y) = ϕ(x)T ϕ( y) = k( y, x)

positive semidefinite function

⇔ a function k : Rm × Rm → R where Z g2 (x) dx > 0 implies that x

g(x) k(x, y) g( y) dx dy > 0

note

p.s.d. functions generalize the notion of p.s.d. matrices where X gT K g = gi Kij gj > 0 i,j

p.s.d. matrices K have orthogonal eigenvectors ui and positive eigenvalues λi K ui = λi ui

note

in analogy, positive semidefinite functions have orthogonal eigenfunctions ψi (x) and eingenvalues µi > 0, i = 1, . . . , ∞ Z k(x, y) ψi ( y) dy = µi ψi ( y) where Z ψi (x) ψj (x) dx = δij

note

in analogy to the spectral representation of a matrix K=

m X

λi ui uTi

i=1

we have k(x, y) =

∞ X i=1

µi ψi (x) ψi ( y)

Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)

Theorem (Mercer, 1909) for every positive semidefinite function k(x, y), there exists a vector valued function ϕ(x) such that k(x, y) = ϕ(x)T ϕ( y)

Proof k(x, y) =

X

µi ψi (x) ψi ( y)

i

=

X√ √ µi ψi (x) µi ψi ( y) i

=

X

ϕi (x) ϕi ( y)

i

= ϕ(x)T ϕ( y)

implications

⇒ instead of computing ϕ(x)T ϕ( y) on vectors in RM , we may evaluate a kernel function k(x, y) on vectors in Rm

⇒ we may not have to compute ϕ(x) and ϕ(x)T ϕ( y) explicitly, but may directly evaluate a kernel function k(x, y)

⇒ instead of worrying about how to choose ϕ(·), we may worry about choosing a suitable kernel function k(·, ·)

the kernel trick and kernel engineering

the kernel trick

the kernel trick is, first, to rewrite an algorithm for data analysis or classification in such a way that input data x only appears in form of inner products and, second, to replace any occurrence of such inner products by kernel evaluations this way, we can use linear approaches to solve non-linear problems!

the linear kernel (⇔ proof that kernel functions exist)

the identity mapping ϕ(x) = x yields a valid Mercer Kernel k(x, y) = ϕ(x)T ϕ( y) = xT y because xT y = yT x

(symmetric)

and xT x > 0

(psd)

observe

in the following, we assume x, y ∈ Rm

b>0∈R c>0∈R

d>0∈R

g : Rm → R

ki (x, y) = ϕ(x)T ϕ( y) is a valid kernel for some ϕ

kernel function

k(x, y) = c · k1 (x, y)

d k(x, y) = xT y + b d k(x, y) = k1 (x, y)

k(x, y) = xT y + b

2 k(x, y) = ϕ(x) − ϕ( y)

k(x, y) = g(x) k1 (x, y) g( y)

k(x, y) = k1 (x, y) + k2 (x, y) h

2 i k(x, y) = exp − 2σ1 2 x − y

note

again, kernel functions k(x, y) implicitly compute inner products ϕ(x)T ϕ( y) not in Rm but in RM

k(x, y)

M

inhom. linear kernel

xT y + b

m+1

polynomial kernel

xT y + b

Gaussian kernel

e−

d

kx−yk2 2σ2

m+d d





assignment

to understand and make sense of all of the above, read C. Bauckhage, “Lecture Notes on the Kernel Trick (I)”, dx.doi.org/10.13140/2.1.4524.8806 C. Bauckhage, “Lecture Notes on the Kernel Trick (III)”, dx.doi.org/10.13140/RG.2.1.2471.6322

kernel least squares

least squares  n given labeled training data (xi , yi ) i=1 we seek w such that y(x) = xT w for convenience, we let   x← 1 x   w ← w0 w and then solve

2 argmin XT w − y w

(1)

note

n × m design matrix  · · · xT1 · · · · · · xT · · · 2  X=   T · · · xn · · · 

m × n data matrix vs.

.. .. . .  X =  x1 x2 .. .. . . 

 .. . xn   .. .

note

the least squares problem in (1) has two solutions primal −1 w = XXT Xy | {z } m×m

dual −1 w = X XT X y | {z } n×n

where XT X is a Gram matrix since XT X

 ij

= xTi xj

note

while working with the dual may be costly, it allows us to invoke the kernel trick

note

while working with the dual may be costly, it allows us to invoke the kernel trick, because in −1 y y(x) = xT w = xT X XT X all x, xi occur within inner products and we may rewrite y(x) = k(x)T K −1 y where Kij = k(xi , xj ) ki (x) = k(xi , x)

assignment

to see how to implement this idea in numpy / scipy, read C. Bauckhage, “NumPy / SciPy Recipes for Data Science: Kernel Least Squares Optimization (1)”, dx.doi.org/10.13140/RG.2.1.4299.9842

example: regression

linear kernel

polynomial kernel

xT y + 1

3 xT y + 1

−2

2 1 kx−yk 2.52

e− 2

20

20

20

15

15

15

10

10

10

5

5

5

0 −4

Gaussian kernel

0 0

2

4

6

8

−4

−2

0 0

2

4

6

8

−4

−2

0

2

4

6

8

example: classification

polynomial kernel, d ∈ {1, 3, 6} −3

−2

−1

1.5

1.5

1.5

1.0

1.0

1.0

0.5

0.5

0.0

0.0 0

1

2

3

−0.5

−3

−2

−1

0.5 0.0 0

1

2

3

−0.5

−3

−2

−1

0

1

2

3

0

1

2

3

−0.5

−1.0

−1.0

−1.0

−1.5

−1.5

−1.5

Gaussian kernel, σ ∈ {0.5, 1.0, 5.0} −3

−2

−1

1.5

1.5

1.5

1.0

1.0

1.0

0.5

0.5

0.0

0.0 0

−0.5

1

2

3

−3

−2

−1

0.5 0.0 0

−0.5

1

2

3

−3

−2

−1

−0.5

−1.0

−1.0

−1.0

−1.5

−1.5

−1.5

summary

we now know about

benefits of high dimensionality Mercer kernels and the kernel trick how valid kernel functions may look like how to kernelize least squares computations