Slides

Convex Approaches to Model Selection K. PELCKMANS, J.A.K. SUYKENS NIPS workshop on multi-level Inference December 2006

KULeuven - Department of Electrical Engineering - SCD/SISTA Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium [email protected]

K. PELCKMANS

K.U.Leuven - SCD/SISTA

Overview • Ridge Regression, Smoothing Splines, Regularization networks, LS-SVMs: u? :

I Introduction

(Ω + γId) u = v, Ω ∈ R

d×d

, γ>0

• Solution path when varying γ • Convex Hull of solution path • Tuning = learning optimal element in solution path

4 Ridge set 4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions

K. PELCKMANS


1/11


I Introduction

(Ω + γId) u = v, Ω ∈ R

d×d

, γ>0



K. PELCKMANS

Why convex:

• • • • •

(Practice) Well-developed algorithms (Convexity) Reproducability and analyzable (Complexity) Complexity convex hull (Extensions) Learning more tuning-parameters (Approach to the global minimum) projecting on original path


1/11


I Introduction

(Ω + γId) u = v, Ω ∈ R

d×d

, γ>0



Why convex:

• • • • •

(Practice) Well-developed algorithms (Convexity) Reproducability and analyzable (Complexity) Complexity convex hull (Extensions) Learning more tuning-parameters (Approach to the global minimum) projecting on original path

Pelckmans et al.,A Convex Approach to Validation-based Learning of the Regularizati

K. PELCKMANS


1/11

Ridge solution set

4 Introduction I Ridge set

4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions

K. PELCKMANS

• Solution set (’regularization path’): ˛ n N ˛ S(γ, u|Ω, v) = uγ ∈ R ˛ ∃ 0 < γ < +∞ s.t.

(Ω + γIN ) uγ = v

o

• Let U ΣU T = Ω denote the SVD of the matrix Ω with U U T = U T U = IN and Σ = diag(σ1, . . . , σN ) containing all ordered positive eigenvalues such that σ1 ≥ · · · ≥ σN . • Rewrite KKT as „ « “ ” 1 T T T U ΣU + γIn α = Y ⇔ Ui α = Ui Y ∀i σi + γ


2/11

Ridge solution set (Ct’d)

σ

4 Introduction

4 Tuning ridge 4 w.r.t. CV

Eigenvalues

I Ridge set

σ

4 Example 4 Conclusions

K. PELCKMANS


3/11


σ+ γ

4 Introduction


Eigenvalues

I Ridge set

σ


K. PELCKMANS


3/11


σ+ γ 1/(σ+γ) 4 Introduction


Eigenvalues

I Ridge set

σ


∀γ :

K. PELCKMANS

1 σ+γ

monotonically increasing in σ


3/11


4 Introduction I Ridge set


Convex relaxation: 0

S (Λ, u|Ω, v) 8 T T > U u = λ U v i > i i > > > 1 > λk ≤ λi < λk > 0 > σ > i > > :λ = λ k i

∀i = 1, . . . , N ∀i = 1, . . . , N ∀σi0 > σk0 ∀σk0 = σi0

→ Searching in convex set!

K. PELCKMANS


4/11

Ridge solution set (Ct’d) 4 Introduction I Ridge set

Main result: Proposition 1. [Maximal distance of relaxation] the maximal distance from an element in S 0(Γ, u|Ω, v) to its closest counterpart of the non-convex S(γ, u|Ω, v) can be bounded in terms of the maximum range of the inverse eigenvalue spectrum,


K. PELCKMANS

∀Γ ∈ R

N

‚ ‚ ‚ −1 T −1 ‚ min ‚U Γ U v − (Ω + γ ˆ IN ) v ‚ ≤ kvk2 max γ

2

i>k

Proposition 2. [Smoothness of the Ridge Solution Set] The S(γ, u|Ω, v) is Lipschitz smooth when σN > 0.


˛! ˛ ˛1 1 ˛˛ ˛ ˛ 0 − 0˛ ˛ σi σk ˛ solutionset

5/11

Ridge solution set (Ct’d) Corollary 1. [Modified Ridge Regression yielding a Convex Solution Path] The convex relaxation constitutes the solution path for the modified ridge regression problem 4 Introduction I Ridge set


n “ ”2 1 “ ” X T T T w ˆ = arg min JΓ(w) = w xi − yi + w U ΓU w 2 w i=1

where Γ = diag(γ1, . . . , γD ) and γd satisfies the constraint γd = λ1 − σd for d all d = 1, . . . , D , and the following inequalities hold: 8 > γd > 0 ∀d = 1, . . . , D > (σd + γd) ∀σg > σd σ d > > :γ = γ ∀σ = σ d

K. PELCKMANS

g

d


g

6/11

Tuning γ in Ridge Regression Data {(xi, yi)}ni=1 and {(xvj , yjv )}m j=1 iid from FXY Model f (x) = wT x, training problem: n “ ”2 γ X T T w ˆ = arg min Jγ (w) = w xi − yi + w w 2 w i=1

Normal equations: 4 Introduction

“ ” T T KKT(w|γ, D) : X X + γID w = X Y

4 Ridge set I Tuning ridge

4 w.r.t. CV

Tuning the regularization constant: nv “ ” X T v v ` w xj − yj (w, ˆ γ ˆ ) = arg min w,γ>0

4 Example

s.t.

KKT(w|γ, D) ↔ S(γ, w|D)

j=1

Convex relaxation 4 Conclusions nv “ ” “ ” X T v v ˆ = arg min w, ˆ Λ ` w xj − yj w,Λ

s.t.

0

S (Λ, w|D)

j=1

Pelckmans et al., Additive Regularization Trade-off: Fusion of Trainining and Validatio K. PELCKMANS


7/11

Tuning γ > 0 in Least Squares Support vector Machines • Predictive model f (x) = wT ϕ(x) with ϕ : Rd → RD • Training problem:

4 Introduction

min Jγ (w, e) = w,e

n X

2

ei +

i=1

γ T T w w s.t. w ϕ(xi) + ei = yi, ∀i 2

4 Ridge set

• Normal equations: I Tuning ridge

KKT(α|γ, D) : (K + γIn) α = Y

4 w.r.t. CV 4 Example 4 Conclusions

Pn • Optimal Prediction model fˆ(x) = w ˆ T ϕ(x) = ˆ iK(xi, x) i=1 α • Tuning the regularization constant: ! nv n X X v v (w, ˆ γ ˆ ) = arg min ` α ˆ iK(xi, xj ) − yj s.t. S(γ, α|D) w,γ>0

K. PELCKMANS

j=1

i=1


8/11

Tuning the ridge w.r.t. 10-fold CV

4 Introduction 4 Ridge set 4 Tuning ridge

L-fold → each fold has its own KKT condition: 8 ` ´ > KKT w |γ, D > (1) (1) > ` > > > > :KKT `w |γ, D ´ (10)

(10)

...but γ coupled over folds! Moreover, for each fold exists a separate validation set, thus

I w.r.t. CV

4 Example

`

´

w ˆ(l), γ ˆ = arg min w(l) ,γ>0

L X l=1

1 (n − n(l))

4 Conclusions

s.t.

K. PELCKMANS

X

“ ” T ` w(l)xi − yi

(xi ,yi )∈D v (l)

`

´

KKT w(l)|γ, D(l) , ∀l = 1, . . . , L


9/11

Tuning the ridge w.r.t. CV

4 Introduction 4 Ridge set

Proposition 3. [Coupling over different folds] Let Ul Σl UlT be the SVD of Ωl for all l = 1, . . . , 10. (...) Then the following coupled relaxation is proposed: 8 (l) T (l) T > (l) T > Y (l) ∀k ↔ (l), i Ui w(l) = λk Ui X > > > > 1 > 0 < λ < ∀k = 1, . . . < “ ” k ˛ σ0 0 L k S Λ , w(l)˛ D(1), . . . , D(L) = „ σ0 « 0 0 > k > λ < λ < λ ∀σ > σ k l k > 0 l k σ > > l > > :λk = λl ∀σk0 = σl0

4 Tuning ridge I w.r.t. CV

Thus the convex relaxation to tuning the ridge with respect to an L-fold CV criterion yields to solving


min

w(l) ,Λ

L X l=1

1 n − n(l)

! X

“ ” T v v ` w(l)xi − yi

(xi ,yi )∈D v (l)

s.t.

K. PELCKMANS


0`

L

˛ ´ S Λ , w(l)˛ D(1), . . . , D(L) .

10/11

Examples

4 Introduction 4 Ridge set 4 Tuning ridge 4 w.r.t. CV I Example

4 Conclusions

K. PELCKMANS

Results of a comparison between OLS and RR with D = 10, tuned by CV (steepest descent), by GCV (using steepest descent) and the proposed method fusing training and tuning the ridge together in one convex optimization algorithm. Panel (a) shows the evolution when ranging the condition number with n = 50 fixed. Panel (b) displays the evolution of the performance when the number of examples ranges and Γ(X T X) = 1e3 is fixed. In both cases the proposed convex relaxation is performing similar as steepest descent based counterparts, while it significantly outperforms OLS in the case of low n or a high enough condition number Γ(X T X).


11/11

Conclusions Message:

4 Introduction 4 Ridge set 4 Tuning ridge

• • • •

Hyper-parameter tuning as a convex optimization problem setting the stage ... Convex hull of solution path Bounded maximal distance between solution path and relaxation if σN > 0 or γ>0

• Efficient tuning w.r.t. validation and CV 4 w.r.t. CV 4 Example I Conclusions

K. PELCKMANS

Outlooks:

• Input selection for additive models • Comparison with gradient descent/Bayesian inference • Convex relaxation of solution path SVM


12/11

Slides - ClopiNet

Slides - ClopiNet

Suggest Documents

Slides - ClopiNet

Slides

Slides

slides

slides

Slides

slides

slides

Slides.

slides

slides

Slides

slides

Slides

slides

slides

slides

Slides

slides

slides

slides

slides

slides