Reading Group ML / AI

37 downloads 0 Views 3MB Size Report
Reading Group ML / AI. Prof. Christian Bauckhage ... ϕ(xl). ) = ϕ(xi)Tϕ(xj) − ϕ(xi)T 1 n. ∑ l. ϕ(xl) −. 1 n. ∑ k. ϕ(xk)Tϕ(xj). +. 1 n2. ∑ k,l. ϕ(xk)Tϕ(xl). = k(xi, xj) −. 1.
Reading Group ML / AI Prof. Christian Bauckhage

Outline kernel PCA

kernel PCA

application outlier detection

summary

kernel PCA

recall: standard PCA

u2

u2

u1 e2

e2

e1

e2

u1

e1 e1

recall: standard PCA procedure

given a (zero mean) data matrix   X = x1 , . . . , xn ∈ Rm×n compute the sample covariance matrix C=

1 XXT ∈ Rm×m n

then solve the eigenvector/eigenvalue problem Cu = λu and use the resulting eigenvectors for various purposes

observe

we have Cu = λu 1 XXT u = u nλ ⇔ Xα = u ⇔

⇒ each eigenvector u of C is a linear combination of the column vectors xi of X and we emphasize that u ∈ Rm

α ∈ Rn

observe

we have Cu = λu ⇔

1 XXT Xα = λ Xα n

1 T T X XX Xα = λ XT Xα n ⇔ K 2 α = ˜λ K α ⇔ K α = ˜λ α ⇔

where ˜λ = n λ

moreover uT u = 1

1 ⇒ αT K α = ˜λ αT α = 1 ⇒ kαk = p ˜λ

note

K is an n × n matrix where Kij = xTi xj ⇒ PCA allows for invoking the kernel trick ⇔ we may replace Kij = xTi xj by  k xi , xj = ϕ(xi )T ϕ(xj )

note

when doing standard PCA, we insisted on zero mean data

note

when doing standard PCA, we insisted on zero mean data when doing kernel PCA in feature space, we do not know if the ϕ(xi ) are of zero mean

note

when doing standard PCA, we insisted on zero mean data when doing kernel PCA in feature space, we do not know if the ϕ(xi ) are of zero mean what we need are zero mean or centered feature vectors ϕc (xk ) = ϕ(xk ) − ϕ where 1X ϕ(xk ) n n

ϕ=

k=1

note

the kernel trick is all about not having to compute the ϕ(xk )

note

the kernel trick is all about not having to compute the ϕ(xk )

⇔ what we really need is a centered kernel function  kc xi , xj = ϕc (xi )T ϕc (xj )

note

the kernel trick is all about not having to compute the ϕ(xk )

⇔ what we really need is a centered kernel function  kc xi , xj = ϕc (xi )T ϕc (xj )

next, we shall see that this is actually easy to obtain

centering the kernel

1X ϕ(xi ) − ϕ(xk ) n n

kc (xi , xj ) =

!T

1X ϕ(xj ) − ϕ(xl ) n n

k=1

= ϕ(xi )T ϕ(xj ) − ϕ(xi )T

!

l=1

1X 1X ϕ(xl ) − ϕ(xk )T ϕ(xj ) n n

1 X ϕ(xk )T ϕ(xl ) + 2 n

l

k

k,l

= k(xi , xj ) −

1X 1X 1 X k(xi , xl ) − k(xk , xj ) + 2 k(xk , xl ) n n n l

k

k,l

kernel PCA

solve K α = ˜λ α where Kij = kc xi , xj



1

normalize α ← α ˜λ− 2

compute uT x = αT XT x =

P

i αi kc

 xi , x

f (x) = uTj x

examples: standard PCA

2

2

1

1

f (x) =

uT1 x

0 −3

−2

−1

0

1

2

3

0 −2

−1

0

1

2

0

1

2

−1 −1

−2 −2

2

2

1

1

f (x) =

uT2 x

0 −3

−2

−1

0

1

2

3

0 −2

−1

−1 −1

−2 −2

examples: (Gaussian) kernel PCA

−3

−3

−2

−2

f (x) = αTj XT x

2

2

2

1

1

1

0

0

−1

0

1

2

3

−3

−2

−1

0 0

1

2

3

−3

−2

−1

0

−1

−1

−1

−2

−2

−2

j=1

j=2

j=3

2

2

2

1

1

1

0

0

−1

0

1

2

3

−3

−2

−1

2

3

1

2

3

0 0

1

2

3

−3

−2

−1

0

−1

−1

−1

−2

−2

−2

j=4

1

j=5

j=6

examples: (Gaussian) kernel PCA

2

2

2

1

1

1

0 −2

−1

0 0

1

2

−2

−1

−1

0 0

1

2

−2

−1

−1

−2

0

−2

j=3

2

2

2

1

1

1

0 0

−1

−2

j=4

1

2

−2

−1

2

1

2

−2

j=2

−1

1

−1

j=1

0 −2

f (x) = αTj XT x

0 0

−1

−2

j=5

1

2

−2

−1

0

−1

−2

j=6

examples: (Gaussian) kernel PCA 4

4

4

2

2

2

0 −4

−2

0 0

2

4

−4

−2

0 0

2

4

−4

−2

0

−2

−2

−2

−4

−4

−4

j=1

j=2

4

4

2

2

2

−2

0 0

2

4

−4

−2

4

2

4

0 0

2

4

−4

−2

0

−2

−2

−2

−4

−4

−4

j=4

2

j=3

4

0 −4

f (x) = αTj XT x

j=5

j=6

observe

kPCA can be understood as a clustering algorithm

observe

kPCA can be understood as a clustering algorithm

⇔ the values of the projections αTj XT x =

X

αji kc xi , x



i

reveal structures within the given data

application outlier detection

example

f (x) =

8 X T T α X x j j=1

example

32 X T T α X x f (x) = j j=1

summary

we now know about

kernel PCA and an interesting application