Kernel spectral clustering: model representations, sparsity and out-of ...

20 downloads 0 Views 3MB Size Report
clustering time series in the CoRoT exoplanet database. • prognostics tools for predicting maintenance of machines. • image segmentation. [Alzate et al. 2007 ...
Kernel spectral clustering: model representations, sparsity and out-of-sample extensions Johan Suykens and Carlos Alzate K.U. Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Email: {johan.suykens,carlos.alzate}@esat.kuleuven.be http://www.esat.kuleuven.be/scd/ 4th Int. Conf. on Computational Harmonic Analysis May 2011, Hong Kong

ICCHA4, Hong Kong

Overview



Introduction

• Kernel PCA: primal and dual model representations • Spectral clustering • Kernel spectral clustering • Model selection • Sparsity • Incorporating prior knowledge

Introduction

Application studies related to spectral clustering: • complex networks: clustering time series in power grid networks • clustering of scientific journals • clustering time series in the CoRoT exoplanet database • prognostics tools for predicting maintenance of machines • image segmentation [Alzate et al. 2007, 2009 ; Varon et al., 2011]

ICCHA4, Hong Kong

1

Power grid networks 1

Winter 0.9

Normalized load

0.8

0.7

0.6

0.5

0.4

Summer 0.3

1

24

48

72

Hour

96

120

144

168

Network: 245 substations in Belgian grid Data: hourly load, seasonal/weekly/intra-daily patterns Aim: short-term load forecasting, important for power generation decisions Clustering: identifying customer profiles from time-series (over 5 years) [Espinoza et al., IEEE CSM 2007; Alzate et al., 2009] ICCHA4, Hong Kong

2

Kernel-based models and learning theory

• Classification and regression: the use of kernel-based models is well established aim: good predictive model (achieving good generalization) model selection (kernel parameters, regularization constants): CV

Clustering: underlying kernel-based models? out-of-sample extensions? learning and generalization? training, validation, test data? tuning parameters: how to determine? ICCHA4, Hong Kong

3

Kernel-based models and learning theory

• Classification and regression: the use of kernel-based models is well established aim: good predictive model (achieving good generalization) model selection (kernel parameters, regularization constants): CV • Clustering: underlying kernel-based models? out-of-sample extensions? learning and generalization? training, validation, test data? tuning parameters: how to determine?

ICCHA4, Hong Kong

4

Overview

• Introduction •

Kernel PCA: primal and dual model representations

• Spectral clustering • Kernel spectral clustering • Model selection • Sparsity • Incorporating prior knowledge

Kernel principal component analysis 1.5

1.5

1

1

0.5 0.5

x(2) 0

0

−0.5 −0.5 −1 −1 −1.5 −1.5

−2

−2.5 −1.5

−1

x(1) 0 linear PCA

−0.5

0.5

−2 −1.5

1

−1

−0.5

0

0.5

1

kernel PCA (RBF kernel)

Kernel PCA [Sch¨ olkopf et al., 1998]:

eigenvalue decomposition of ICCHA4, Hong Kong





K(x1 , x1) ... K(x1, xN ) .. ..   K(xN , x1) ... K(xN , xN ) 5

Kernel PCA: primal and dual problem • Underlying primal problem [Suykens et al., 2003] • Primal problem: N

1 X 2 1 T ei s.t. ei = wT ϕ(xi) + b, i = 1, ..., N. min − w w + γ w,b,e 2 2 i=1 • (Lagrange) dual problem = kernel PCA : Ωcα = λα with λ = 1/γ with Ωc,ij = (ϕ(xi) − µ ˆϕ)T (ϕ(xj ) − µ ˆϕ) the centered kernel matrix. • Interpretation: 1. pool of candidates components (objective function equals zero) 2. select relevant components ICCHA4, Hong Kong

6

Kernel PCA: model representations

Primal and dual model representations: wj ϕj (x∗) + b = wT ϕ(x∗) + b

eˆ∗ =

P

j

(D) : eˆ∗ =

P

i αi K(x∗ , xi )

(P ) : ր M ց

+b

which can be evaluated at any point x∗ ∈ Rd, where K(x∗ , xi) = ϕ(x∗)T ϕ(xi) with K(·, ·) a positive definite kernel and feature map ϕ(·) : Rd → Rnh .

ICCHA4, Hong Kong

7

Sparse and robust versions • Iteratively weighted L2 loss (to reduce the influence of outliers): min

w,b,ei

subject to

PN 1 1 T − 2 w w + γ 2 i=1 vi e2i ei = wT ϕ(xi) + b, i = 1, ..., N.

• Other loss functions: e.g. Huber loss for robustness, ǫ-insensitive loss for sparsity min

w,b,ei

subject to

PN 1 1 T − 2 w w + γ 2 i=1 L(ei ) ei = wT ϕ(xi) + b, i = 1, ..., N.

[Alzate & Suykens, IEEE-TNN, 2008] ICCHA4, Hong Kong

8

Robustness: Kernel Component Analysis original image

corrupted image

KPCA reconstruction

Weighted LS-SVM: robustness [Alzate & Suykens, IEEE-TNN 2008] ICCHA4, Hong Kong

9

Robustness: Kernel Component Analysis original image

corrupted image

KPCA reconstruction

KCA reconstruction

Weighted LS-SVM: robustness and sparsity [Alzate & Suykens, IEEE-TNN 2008] ICCHA4, Hong Kong

9

Overview

• Introduction • Kernel PCA: primal and dual model representations •

Spectral clustering

• Kernel spectral clustering • Model selection • Sparsity • Incorporating prior knowledge

Spectral graph clustering Minimal cut: given the graph G = (V, E), find clusters A1 , A2 min

qi ∈{−1,+1}

1X wij (qi − qj )2 2 i,j

with cluster membership indicator qi (qi = 1 if i ∈ A1, qi = −1 if i ∈ A2) and W = [wij ] the weighted adjacency matrix. 5 4

1 2

6

3

cut of size 2

cut of size 1 (minimal cut)

ICCHA4, Hong Kong

10

Spectral graph clustering • Min-cut spectral clustering problem min q˜T L˜ q

q˜T q˜=1

with L = D − W the unnormalized graph Laplacian, degree matrix P D = diag(d1, ..., dN ), di = j wij , giving L˜ q = λ˜ q. Cluster member indicators: qˆi = sign(˜ qi − θ) with threshold θ. • Normalized cut L˜ q = λD q˜ [Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997; von Luxburg, 2007]

• Discrete version to continuous problem (Laplace operator) [Belkin & Niyogi, 2003; von Luxburg et al., 2008; Smale & Zhou, 2007]

ICCHA4, Hong Kong

11

Spectral clustering + K-means

ICCHA4, Hong Kong

12

Overview

• Introduction • Kernel PCA: primal and dual model representations • Spectral clustering •

Kernel spectral clustering

• Model selection • Sparsity • Incorporating prior knowledge

Kernel spectral clustering: case of two clusters • Underlying model (primal representation): eˆ∗ = wT ϕ(x∗) + b with qˆ∗ = sign[ˆ e∗] the estimated cluster indicator at any x∗ ∈ Rd. • Primal problem: training on given data {xi }N i=1 N 1X 1 T min − w w+γ vi e2i w,b,e 2 2 i=1 subject to ei = wT ϕ(xi) + b, i = 1, ..., N

with positive weights vi (will be related to inverse degree matrix). [Alzate & Suykens, IEEE-PAMI, 2010] ICCHA4, Hong Kong

13

Lagrangian and conditions for optimality • Lagrangian: N N 1 T 1X 2 X αi(ei − wT ϕ(xi) − b) L(w, b, e; α) = − w w + γ vi ei − 2 2 i=1 i=1

• Conditions for optimality:  ∂L    =0 ⇒   ∂w    ∂L   =0 ⇒  ∂b ∂L   =0 ⇒   ∂e  i    ∂L   =0 ⇒  ∂αi

w= P

P

i αi

i αi ϕ(xi)

=0

αi = γviei, i = 1, ..., N ei = wT ϕ(xi) + b, i = 1, ..., N

• Eliminate w, b, e, write solution in α. ICCHA4, Hong Kong

14

Kernel-based model representation • Dual problem: V MV Ωα = λα with λ = 1/γ MV = IN − 1T V1 1 1N 1TN V : weighted centering matrix N

N

Ω = [Ωij ]: kernel matrix with Ωij = ϕ(xi)T ϕ(xj ) = K(xi , xj ) • Dual model representation:

eˆ∗ =

N X

αiK(xi, x∗) + b

i=1

with K(xi , x∗) = ϕ(xi)T ϕ(x∗). ICCHA4, Hong Kong

15

Choice of weights vi

• Take V = D

−1

where D = diag{d1, ..., dN } and di =

PN

j=1 Ωij

• This gives the generalized eigenvalue problem: MD Ωα = λDα with MD = IN − 1T D1−11 1N 1TN D −1 N

N

This is a modified version of random walks spectral clustering. • Note that sign[ei] = sign[αi] (on training data) ... but sign[e∗] applies beyond training data

ICCHA4, Hong Kong

16

Kernel spectral clustering: more clusters • Case of k clusters: additional sets of constraints min

w(l) ,e(l) ,bl

k−1

k−1

l=1

l=1

1 X (l)T (l) 1 X (l)T −1 (l) − w w + γle D e 2 2

subject to e(1) = ΦN ×nh w(1) + b11N e(2) = ΦN ×nh w(2) + b21N .. e(k−1) = ΦN ×nh w(k−1) + bk−11N (l)

(l)

where e(l) = [e1 ; ...; eN ] and ΦN ×nh = [ϕ(x1)T ; ...; ϕ(xN )T ] ∈ RN ×nh . • Dual problem: MD Ωα(l) = λDα(l), l = 1, ..., k − 1. [Alzate & Suykens, IEEE-PAMI, 2010]

ICCHA4, Hong Kong

17

Primal and dual model representations k clusters k − 1 sets of constraints (index l = 1, ..., k − 1)

(P ) :

(l) sign[ˆ e∗ ]

= sign[w

(l) T

ϕ(x∗) + bl]

ր M ց (l)

(D) : sign[ˆ e∗ ] = sign[

P

(l)

j αj K(x∗ , xj ) + bl ]

Note: additional sets of constraints also in multi-class and vector-valued output LS-SVMs [Suykens et al., 1999]

ICCHA4, Hong Kong

18

8

8

6

6

4

4

2

2

0

0

x(2)

x(2)

Out-of-sample extension and coding

−2

−2

−4

−4

−6

−6

−8

−8

−10 −12

−10

−8

−6

ICCHA4, Hong Kong

−4

−2

x(1)

0

2

4

6

−10 −12

−10

−8

−6

−4

−2

x(1)

0

2

4

6

19

8

8

6

6

4

4

2

2

0

0

x(2)

x(2)

Out-of-sample extension and coding

−2

−2

−4

−4

−6

−6

−8

−8

−10 −12

−10

−8

−6

ICCHA4, Hong Kong

−4

−2

x(1)

0

2

4

6

−10 −12

−10

−8

−6

−4

−2

x(1)

0

2

4

6

19

Overview

• Introduction • Kernel PCA: primal and dual model representations • Spectral clustering • Kernel spectral clustering •

Model selection

• Sparsity • Incorporating prior knowledge

Piecewise constant eigenvectors and extension • Definition. [Meila & Shi, 2001] Vector α is called piecewise constant relative to a partition (A1, ..., Ak ) iff αi = αj ∀xi, xj ∈ Ap, p = 1, ..., k. • Proposition. [Alzate & Suykens, 2010] Assume v v Nv (i) a training set D = {xi}N and validation set D = {x m }m=1 i=1 i.i.d. sampled from the same underlying distribution; (ii) a set of k clusters {A1 , ..., Ak } with k > 2; (iii) an isotropic kernel function such that K(x, z) = 0 when x and z belong to different clusters; (iv) the eigenvectors α(l) for l = 1, ..., k − 1 are piecewise constant.

Then validation set points belonging to the same cluster are collinear in the k − 1 dimensional subspace spanned by the columns of E v ∈ PN (l) (l) Nv ×(k−1) v R where Eml = em = i=1 αi K(xi , xvm) + bl. ICCHA4, Hong Kong

20

Piecewise constant eigenvectors and extension • Key aspect of the proof: one has (l) e∗

PN (l) (l) = i=1 αi K(xi , x∗ ) + b PN (l) (l) P (l) K(x , x ) + b α K(x , x ) + = cp i ∗ i ∗ i i∈A / p i∈Ap (l) P (l) K(x , x ) + b = cp i ∗ i∈Ap

• Model selection to determine kernel parameters and k: (k−1) (1) (2) ), Looking for line structures in the space (ei , ei , ..., ei evaluated on validation data (aiming for good generalization) • Choice kernel: Gaussian RBF kernel χ2-kernel for images ICCHA4, Hong Kong

21

Model selection (looking for lines): toy problem 1 25 0.4

20 0.3

15 10

0.1

5

(2) i,val

x(2)

0.2

e

0

0 −5

−0.1

−10 −0.2

−15

σ 2 = 0.5, BLF = 0.56

−0.3 −0.4 −0.4

−0.2

0

(1) 0.4 e i,val 0.2

0.6

−20 −25 −30

0.8

−20

−10

0 x(1)

10

20

30

−20

−10

0 x(1)

10

20

30

25 0.8

20 0.6

15 10

0.2

5

(2) i,val

x(2)

0.4

e

0

0 −5

−0.2

−10 −0.4 −0.6 −0.8 −0.4

−15

σ 2 = 0.16, BLF = 1.0 −0.2

(1) e i,val

validation set

ICCHA4, Hong Kong

0

−20

0.2

−25 −30

train + validation + test data

22

Model selection (looking for lines): toy problem 2 σ 2 = 1.20, BLF = 0.49

8

6

4

3

(2) i,val

2

x(3)

e

2

0

1 0

−1

−2

2 −2

−4

−3 2

−6 −8

0.3

−6

−4

−2 (1) e i,val

0

2

0 1

0

4

−1

x(2)

−2

−2

x(1)

σ 2 = 0.003, BLF = 1.0

0.1

3

(2) i,val

0.2

2

x(3)

0

e

−0.1

1 0

−1

−0.2

2 −2

−0.3

−0.4 −0.4

−3 2

−0.3

−0.2

(1) 0 e i,val

−0.1

validation set

ICCHA4, Hong Kong

0.1

0.2

0.3

0 1

0

−1

x(2)

−2

−2

x(1)

train + validation + test data

23

Example: image segmentation (looking for lines)

4 3

(3) i,val

2

e

1 0

−1 −2 3 2

2

1

1 0

0

−1

−1

(2) i,val

e

ICCHA4, Hong Kong

−2 −3

−2 −3

−4 −5

(1) i,val

e

24

Image ID

Image

Proposed method

Nyström method

Human

145086

42049

167062

147091

196073

62096

119082

3096

295087

37073

ICCHA4, Hong Kong

25

Example: power grid networks - identifying customer profiles Power load: 245 substations, hourly data (5 years), d = 43.824 Periodic AR modelling: dimensionality reduction 43.824 → 24 k-means clustering applied after dimensionality reduction

normalized load

1

0.9

normalized load

1

0.9

normalized load

1

0.9

normalized load

1

0.9

0.1

0.1

0.1

0.1

0

0

0

0.8

0.8

0.7

0.8

0.7

0.6

0.6

0.5

4

6

8

10

hour 12

14

16

18

20

22

24

0.4

0.3

0.2

2

0.5

0.4

0.3

0.2

0.6

0.5

0.4

0.3

0.7

0.6

0.5

0.4

0.8

0.7

0.3

0.2

2

4

6

8

10

hour 12

14

16

18

20

22

24

0.2

2

4

6

8

10

hour 12

14

16

18

20

22

24

0

normalized load

1

0.9

normalized load

1

0.9

normalized load

1

0.9

normalized load

1

0.9

0.1

0.1

0.1

0.1

0

0

0

0.8

0.8

0.7

0.8

0.7

0.6

0.5

0.4

0.3

0.2

2

4

6

8

10

hour 12

14

16

18

20

ICCHA4, Hong Kong

22

24

4

6

8

10

hour 12

14

16

18

20

22

24

16

18

20

22

24

2

4

6

8

10

hour

16

18

20

22

24

12

14

0.3

0.2

2

hour

0.4

0.3

0.2

10

0.5

0.4

0.3

8

0.6

0.5

0.4

6

0.7

0.6

0.5

4

0.8

0.7

0.6

2

0.2

2

4

6

8

10

hour 12

14

16

18

20

22

24

0

12

14

26

Clustering time-series: kernel spectral clustering Application of kernel spectral clustering, directly on d = 43.824 Model selection on kernel parameter and number of clusters [Alzate, Espinoza, De Moor, Suykens, 2009] 1

1

0.9

0.9

0.8

0.7

0.8

0.7

0.6

0.1

0.1

0

0

0

15

20

5

10

hour

15

20

5

10

normalized load

0.8

0.8

0.7

0.3

0.2

0.1

0.1

0

0

20

hour ICCHA4, Hong Kong

15

hour

0.4

0.3

0.2

10

0.5

0.4

0.3

5

0.6

0.5

0.4

0

0.7

0.6

0.5

20

0.8

0.7

0.6

15

normalized load

1 0.9

normalized load

1 0.9

15

0.1

hour

1

10

0.2

hour

0.9

5

0.3

0.2

0.1 10

0.4

0.3

0.2

5

0.5

0.4

0.3

0.2

0.6

0.5

0.4

0.3

0.7

0.6

0.5

0.4

0.8

0.7

0.6

0.5

normalized load

normalized load

0.8

normalized load

1 0.9

normalized load

1 0.9

0.2 0.1

5

10

15

hour

20

0

5

10

15

20

hour 27

20

1

1

0.9

0.9

0.9

0.8

0.8

0.8

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

normalized load

1

normalized load

normalized load

Clustering time-series: kernel spectral clustering

0.7 0.6 0.5 0.4 0.3 0.2 0.1

5

10

15

hour

20

0

0.7 0.6 0.5 0.4 0.3 0.2 0.1

5

10

15

hour

20

0

5

10

15

20

hour

Electricity load: 245 substations in Belgian grid (1/2 train, 1/2 validation) xi ∈ R43.824: spectral clustering on high dimensional data (5 years) 3 of 7 detected clusters: - 1: Residential profile: morning and evening peaks - 2: Business profile: peaked around noon - 3: Industrial profile: increasing morning, oscillating afternoon and evening

ICCHA4, Hong Kong

28

“Out-of-sample eigenvectors” • From the conditions for optimality: an eigenvector α satisfies α = γD −1e and 1TN α = 0. PN

• By defining deg(x) = j=1 K(x, xj ) the notion of eigenvector is extended to a validation set as follows: αval =

−1 eval [I − N1v 1Nv 1TNv ]γDval −1 evalk2 k[I − N1v 1Nv 1TNv ]γDval

satisfying kαvalk2 = 1 and 1TN αval = 0. Nv denotes the validation set size. [Alzate & Suykens, IJCNN 2011] ICCHA4, Hong Kong

29

Model selection (looking for dots): toy problem 2 badly tuned 0.03

σ = 0.003, Fisher = 1.0

0.02 0.01 3

(2) i,val

0 2

x(3)

−0.01

α

−0.02 −0.03

1 0

−1 2

−0.04

−2

−0.05

−3 2

0 (1) α i,val

−0.06 −0.05

0 1

0

0.05

−1

x(2)

−2

−2

x(1)

well tuned 0.05

σ = 1.20, Fisher = 0.12

0.04 0.03 3 2

0.01

1

x(3)

α

(2) i,val

0.02

0

0

−1 2

−0.01 −2

−0.02 −0.03 −0.04

ICCHA4, Hong Kong

−3 2

−0.03

−0.02

0 0.01 (1) α i,val

−0.01

0.02

0.03

0.04

0 1

0

−1

x(2)

−2

−2

x(1)

30

Example: image segmentation (looking for dots) 1 0.9

Fisher criterion

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2

4

6

8

0

0.02

10

12

14

Number of clusters k

0.1

0.05

(2)

αi,val

0

−0.05

−0.1

−0.15

−0.2

−0.25 −0.04

ICCHA4, Hong Kong

−0.02

(1)0.04

αi,val

0.06

0.08

0.1

31

Overview

• Introduction • Kernel PCA: primal and dual model representations • Spectral clustering • Kernel spectral clustering • Model selection •

Sparsity

• Incorporating prior knowledge

Kernel spectral clustering: sparse kernel models original image

binary clustering

Incomplete Cholesky decomposition: kΩ − GGT k2 ≤ η with G ∈ RN ×R and R ≪ N Image (Berkeley image dataset): 321 × 481 (154, 401 pixels), 175 SV (l) e∗

ICCHA4, Hong Kong

=

P

(l)

i∈SSV αi K(xi , x∗ ) + bl 32

Kernel spectral clustering: sparse kernel models sparse kernel model

original image

Incomplete Cholesky decomposition: kΩ − GGT k2 ≤ η with G ∈ RN ×R and R ≪ N Image (Berkeley image dataset): 321 × 481 (154, 401 pixels), 175 SV (l) e∗

ICCHA4, Hong Kong

=

P

(l)

i∈SSV αi K(xi , x∗ ) + bl 32

Highly sparse kernel models on images • application on images: xi ∈ R3 (r,g,b values per pixel), i = 1, ..., N pre-processed into zi ∈ R8 (quantization to 8 colors) χ2-kernel to compare two local color histograms (5 × 5 pixels window) • N > 100.000, select subset M ≪ N based on quadratic Renyi entropy as in the fixed-size method [Suykens et al., 2002] • Highly sparse representations: # SV = 3 k • Completion of cluster indicators based on out-of-sample extensions P (l) (l) sign[ˆ e∗ ] = sign[ j∈SSV αj K(x∗ , xj ) + bl] applied to the full image [Alzate & Suykens, Neurocomputing 2011]

ICCHA4, Hong Kong

33

Highly sparse kernel models: toy example 1

4

8

3

6

2 4

1

ei

(2)

x(2)

2

0

0

−1 −2

−2

−4

−3

−6 −5

0

5

10

−4 −3

−2

−1

0

1

2

3

(1) ei

x(1)

only 3k = 9 support vectors

ICCHA4, Hong Kong

34

Highly sparse kernel models: toy example 2

4

3

2

x(2)

1

0

−1

−2

−3

−4 −4

ICCHA4, Hong Kong

−3

−2

−1

0

x(1)

1

2

3

4

35

Highly sparse kernel models: toy example 2 5 4 3 2

x(2)

1 0 −1 −2 −3 −4 −5 −5

−4

−3

−2

−1

0 x(1)

1

2

3

4

5

only 3k = 12 support vectors ICCHA4, Hong Kong

35

Highly sparse kernel models: toy example 2

4

eˆi

(3)

2

0

−2

−4 4

6 4

2 2

0 0 −2

(2) eˆi

ICCHA4, Hong Kong

−2 −4

−4

(1)

eˆi

35

Highly sparse kernel models: image segmentation *

** *

*

* * ***

* * *

* *

*

*

*

*

* *

*

* ** *

* *

* * * * ** * ** * * * * * * * * ** * * * * * * ** * * * ** ** * *** * * * * * * * * * * ** * * * * * * * * * **** * * **** * * * * ** * * * ** ** * * * * * **** * * * * * * * * * * ** * * * * ** * * * *** * * * * * * * * * *** ** * * * * * * * * * * ** * * * * * * * * * * ***** * * * * **** * ** * * * * * * * ** * * * * * ** ** ** * * *** * * * * * *** * * ** * * * ** * * * * *** * * * ** ** * * * ** * * * * ** * * *** * * ** * ** * * ** *** * * * * * ** * * * * * * * * * ** * * * ** * * *** * * * *** * * * * ** * ** * * ** * * * * * * * * * ** * * * * * * ** * ** ** * * * * ** * * ** **** * * ** * ** * * * ** * * * * * ** * *** * * * * * * ** **** ** ** * * * ** * * * * * * ** * *** * * ** * * * * * ** * * * ** * * * * * * *** * * * ** ** * * * * * ** ** * * ** * * * * * * * * ** * * * ** * * ** * * * * * ** ** * * * * * * * * ** * * * * * * * ** * * * * * * * ** * ** * * * * * * ** ** * * * * * * * * * * * * *

*

**

*

* *

*

0.5 0 −0.5

(3)

* *

* * *

* * * ** *

−1

ei

* * ** **

−1.5 −2 −2.5 −3 2 1

0.5 0

0 −1

(2)

ei

ICCHA4, Hong Kong

−0.5 −2

−1 −3

−1.5

(1)

ei

36

Highly sparse kernel models: image segmentation * * *

*

*

*

0

** *

−0.5

*

(3)

*

0.5

−1

ei

*

−1.5 −2 −2.5 −3 2 1

0.5 0

0 −1

(2)

ei

−0.5 −2

−1 −3

−1.5

(1)

ei

only 3k = 12 support vectors ICCHA4, Hong Kong

36

Overview

• Introduction • Kernel PCA: primal and dual model representations • Spectral clustering • Kernel spectral clustering • Model selection • Sparsity •

Incorporating prior knowledge

Kernel spectral clustering: adding prior knowledge • Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link • Primal problem [Alzate & Suykens, IJCNN 2009] k−1

min

w(l) ,e(l) ,bl

k−1

1 X (l)T (l) 1 X (l)T −1 (l) − w w + γle D e 2 2 (1)

l=1

l=1

(1)

subject to e = ΦN ×nh w + b11N .. e(k−1) = ΦN ×nh w(k−1) + bk−11N T

T

w(1) ϕ(x†) = cw(1) ϕ(x‡) .. T T w(k−1) ϕ(x†) = cw(k−1) ϕ(x‡) • Dual problem: yields rank-one downdate of the kernel matrix ICCHA4, Hong Kong

37

Kernel spectral clustering: example

original image

ICCHA4, Hong Kong

without constraints

38

Kernel spectral clustering: example

original image

ICCHA4, Hong Kong

with constraints

39

Conclusions • Spectral clustering within a kernel-based learning framework • Training problem: characterization in terms of primal and dual problem • Out-of-sample extensions: primal and dual model representations • Extend desirable piecewise constant property to validation level • New model selection criteria (learning and generalization aspects) • (highly) sparse kernel models • Suitable for adding prior knowledge through constraints

ICCHA4, Hong Kong

40

References Downloadable from www.esat.kuleuven.be/sista/lssvmlab/ Alzate C., Suykens J.A.K., “Multiway Spectral Clustering with Out-of-Sample Extensions through Weighted Kernel PCA”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(2), 335-347, 2010 Alzate C., Suykens J.A.K., “Sparse Kernel Spectral Clustering Models for Large-Scale Data Analysis”, Neurocomputing, 74(9), 1382-1390, 2011 Alzate C., Suykens J.A.K., “A Regularized Formulation for Spectral Clustering with Pairwise Constraints”, International Joint Conference on Neural Networks (IJCNN 2009), Atlanta US, 2009, 141-148 Alzate C., Suykens J.A.K., “Out-of-Sample Eigenvectors in Kernel Spectral Clustering”, to appear International Joint Conference on Neural Networks (IJCNN 2011) Suykens J.A.K., “Data Visualization and Dimensionality Reduction using Kernel Maps with a Reference Point”, IEEE Transactions on Neural Networks, 19(9), 1501-1517, 2008 Suykens J.A.K., Alzate C., Pelckmans K., “Primal and dual model representations in kernel-based learning”, Statistics Surveys, 4, 148-183, 2010

ICCHA4, Hong Kong

41