Pelckmans et al.,A Convex Approach to Validation-based Learning of the ... set (
Ct'd) σ. Eigenvalu es σ. + γ. K. PELCKMANS. K.U.Leuven - SCD/SISTA. 3/11 ...
Convex Approaches to Model Selection K. PELCKMANS, J.A.K. SUYKENS NIPS workshop on multi-level Inference December 2006
KULeuven - Department of Electrical Engineering - SCD/SISTA Kasteelpark Arenberg 10, 3001 Heverlee (Leuven), Belgium
[email protected]
K. PELCKMANS
K.U.Leuven - SCD/SISTA
Overview • Ridge Regression, Smoothing Splines, Regularization networks, LS-SVMs: u? :
I Introduction
(Ω + γId) u = v, Ω ∈ R
d×d
, γ>0
• Solution path when varying γ • Convex Hull of solution path • Tuning = learning optimal element in solution path
4 Ridge set 4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions
K. PELCKMANS
K.U.Leuven - SCD/SISTA
1/11
Overview • Ridge Regression, Smoothing Splines, Regularization networks, LS-SVMs: u? :
I Introduction
(Ω + γId) u = v, Ω ∈ R
d×d
, γ>0
• Solution path when varying γ • Convex Hull of solution path • Tuning = learning optimal element in solution path
4 Ridge set 4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions
K. PELCKMANS
Why convex:
• • • • •
(Practice) Well-developed algorithms (Convexity) Reproducability and analyzable (Complexity) Complexity convex hull (Extensions) Learning more tuning-parameters (Approach to the global minimum) projecting on original path
K.U.Leuven - SCD/SISTA
1/11
Overview • Ridge Regression, Smoothing Splines, Regularization networks, LS-SVMs: u? :
I Introduction
(Ω + γId) u = v, Ω ∈ R
d×d
, γ>0
• Solution path when varying γ • Convex Hull of solution path • Tuning = learning optimal element in solution path
4 Ridge set 4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions
Why convex:
• • • • •
(Practice) Well-developed algorithms (Convexity) Reproducability and analyzable (Complexity) Complexity convex hull (Extensions) Learning more tuning-parameters (Approach to the global minimum) projecting on original path
Pelckmans et al.,A Convex Approach to Validation-based Learning of the Regularizati
K. PELCKMANS
K.U.Leuven - SCD/SISTA
1/11
Ridge solution set
4 Introduction I Ridge set
4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions
K. PELCKMANS
• Solution set (’regularization path’): ˛ n N ˛ S(γ, u|Ω, v) = uγ ∈ R ˛ ∃ 0 < γ < +∞ s.t.
(Ω + γIN ) uγ = v
o
• Let U ΣU T = Ω denote the SVD of the matrix Ω with U U T = U T U = IN and Σ = diag(σ1, . . . , σN ) containing all ordered positive eigenvalues such that σ1 ≥ · · · ≥ σN . • Rewrite KKT as „ « “ ” 1 T T T U ΣU + γIn α = Y ⇔ Ui α = Ui Y ∀i σi + γ
K.U.Leuven - SCD/SISTA
2/11
Ridge solution set (Ct’d)
σ
4 Introduction
4 Tuning ridge 4 w.r.t. CV
Eigenvalues
I Ridge set
σ
4 Example 4 Conclusions
K. PELCKMANS
K.U.Leuven - SCD/SISTA
3/11
Ridge solution set (Ct’d)
σ+ γ
4 Introduction
4 Tuning ridge 4 w.r.t. CV
Eigenvalues
I Ridge set
σ
4 Example 4 Conclusions
K. PELCKMANS
K.U.Leuven - SCD/SISTA
3/11
Ridge solution set (Ct’d)
σ+ γ 1/(σ+γ) 4 Introduction
4 Tuning ridge 4 w.r.t. CV
Eigenvalues
I Ridge set
σ
4 Example 4 Conclusions
∀γ :
K. PELCKMANS
1 σ+γ
monotonically increasing in σ
K.U.Leuven - SCD/SISTA
3/11
Ridge solution set (Ct’d)
4 Introduction I Ridge set
4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions
Convex relaxation: 0
S (Λ, u|Ω, v) 8 T T > U u = λ U v i > i i > > > 1 > λk ≤ λi < λk > 0 > σ > i > > :λ = λ k i
∀i = 1, . . . , N ∀i = 1, . . . , N ∀σi0 > σk0 ∀σk0 = σi0
→ Searching in convex set!
K. PELCKMANS
K.U.Leuven - SCD/SISTA
4/11
Ridge solution set (Ct’d) 4 Introduction I Ridge set
Main result: Proposition 1. [Maximal distance of relaxation] the maximal distance from an element in S 0(Γ, u|Ω, v) to its closest counterpart of the non-convex S(γ, u|Ω, v) can be bounded in terms of the maximum range of the inverse eigenvalue spectrum,
4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions
K. PELCKMANS
∀Γ ∈ R
N
‚ ‚ ‚ −1 T −1 ‚ min ‚U Γ U v − (Ω + γ ˆ IN ) v ‚ ≤ kvk2 max γ
2
i>k
Proposition 2. [Smoothness of the Ridge Solution Set] The S(γ, u|Ω, v) is Lipschitz smooth when σN > 0.
K.U.Leuven - SCD/SISTA
˛! ˛ ˛1 1 ˛˛ ˛ ˛ 0 − 0˛ ˛ σi σk ˛ solutionset
5/11
Ridge solution set (Ct’d) Corollary 1. [Modified Ridge Regression yielding a Convex Solution Path] The convex relaxation constitutes the solution path for the modified ridge regression problem 4 Introduction I Ridge set
4 Tuning ridge 4 w.r.t. CV 4 Example 4 Conclusions
n “ ”2 1 “ ” X T T T w ˆ = arg min JΓ(w) = w xi − yi + w U ΓU w 2 w i=1
where Γ = diag(γ1, . . . , γD ) and γd satisfies the constraint γd = λ1 − σd for d all d = 1, . . . , D , and the following inequalities hold: 8 > γd > 0 ∀d = 1, . . . , D > (σd + γd) ∀σg > σd σ d > > :γ = γ ∀σ = σ d
K. PELCKMANS
g
d
K.U.Leuven - SCD/SISTA
g
6/11
Tuning γ in Ridge Regression Data {(xi, yi)}ni=1 and {(xvj , yjv )}m j=1 iid from FXY Model f (x) = wT x, training problem: n “ ”2 γ X T T w ˆ = arg min Jγ (w) = w xi − yi + w w 2 w i=1
Normal equations: 4 Introduction
“ ” T T KKT(w|γ, D) : X X + γID w = X Y
4 Ridge set I Tuning ridge
4 w.r.t. CV
Tuning the regularization constant: nv “ ” X T v v ` w xj − yj (w, ˆ γ ˆ ) = arg min w,γ>0
4 Example
s.t.
KKT(w|γ, D) ↔ S(γ, w|D)
j=1
Convex relaxation 4 Conclusions nv “ ” “ ” X T v v ˆ = arg min w, ˆ Λ ` w xj − yj w,Λ
s.t.
0
S (Λ, w|D)
j=1
Pelckmans et al., Additive Regularization Trade-off: Fusion of Trainining and Validatio K. PELCKMANS
K.U.Leuven - SCD/SISTA
7/11
Tuning γ > 0 in Least Squares Support vector Machines • Predictive model f (x) = wT ϕ(x) with ϕ : Rd → RD • Training problem:
4 Introduction
min Jγ (w, e) = w,e
n X
2
ei +
i=1
γ T T w w s.t. w ϕ(xi) + ei = yi, ∀i 2
4 Ridge set
• Normal equations: I Tuning ridge
KKT(α|γ, D) : (K + γIn) α = Y
4 w.r.t. CV 4 Example 4 Conclusions
Pn • Optimal Prediction model fˆ(x) = w ˆ T ϕ(x) = ˆ iK(xi, x) i=1 α • Tuning the regularization constant: ! nv n X X v v (w, ˆ γ ˆ ) = arg min ` α ˆ iK(xi, xj ) − yj s.t. S(γ, α|D) w,γ>0
K. PELCKMANS
j=1
i=1
K.U.Leuven - SCD/SISTA
8/11
Tuning the ridge w.r.t. 10-fold CV
4 Introduction 4 Ridge set 4 Tuning ridge
L-fold → each fold has its own KKT condition: 8 ` ´ > KKT w |γ, D > (1) (1) > ` > > > > :KKT `w |γ, D ´ (10)
(10)
...but γ coupled over folds! Moreover, for each fold exists a separate validation set, thus
I w.r.t. CV
4 Example
`
´
w ˆ(l), γ ˆ = arg min w(l) ,γ>0
L X l=1
1 (n − n(l))
4 Conclusions
s.t.
K. PELCKMANS
X
“ ” T ` w(l)xi − yi
(xi ,yi )∈D v (l)
`
´
KKT w(l)|γ, D(l) , ∀l = 1, . . . , L
K.U.Leuven - SCD/SISTA
9/11
Tuning the ridge w.r.t. CV
4 Introduction 4 Ridge set
Proposition 3. [Coupling over different folds] Let Ul Σl UlT be the SVD of Ωl for all l = 1, . . . , 10. (...) Then the following coupled relaxation is proposed: 8 (l) T (l) T > (l) T > Y (l) ∀k ↔ (l), i Ui w(l) = λk Ui X > > > > 1 > 0 < λ < ∀k = 1, . . . < “ ” k ˛ σ0 0 L k S Λ , w(l)˛ D(1), . . . , D(L) = „ σ0 « 0 0 > k > λ < λ < λ ∀σ > σ k l k > 0 l k σ > > l > > :λk = λl ∀σk0 = σl0
4 Tuning ridge I w.r.t. CV
Thus the convex relaxation to tuning the ridge with respect to an L-fold CV criterion yields to solving
4 Example 4 Conclusions
min
w(l) ,Λ
L X l=1
1 n − n(l)
! X
“ ” T v v ` w(l)xi − yi
(xi ,yi )∈D v (l)
s.t.
K. PELCKMANS
K.U.Leuven - SCD/SISTA
0`
L
˛ ´ S Λ , w(l)˛ D(1), . . . , D(L) .
10/11
Examples
4 Introduction 4 Ridge set 4 Tuning ridge 4 w.r.t. CV I Example
4 Conclusions
K. PELCKMANS
Results of a comparison between OLS and RR with D = 10, tuned by CV (steepest descent), by GCV (using steepest descent) and the proposed method fusing training and tuning the ridge together in one convex optimization algorithm. Panel (a) shows the evolution when ranging the condition number with n = 50 fixed. Panel (b) displays the evolution of the performance when the number of examples ranges and Γ(X T X) = 1e3 is fixed. In both cases the proposed convex relaxation is performing similar as steepest descent based counterparts, while it significantly outperforms OLS in the case of low n or a high enough condition number Γ(X T X).
K.U.Leuven - SCD/SISTA
11/11
Conclusions Message:
4 Introduction 4 Ridge set 4 Tuning ridge
• • • •
Hyper-parameter tuning as a convex optimization problem setting the stage ... Convex hull of solution path Bounded maximal distance between solution path and relaxation if σN > 0 or γ>0
• Efficient tuning w.r.t. validation and CV 4 w.r.t. CV 4 Example I Conclusions
K. PELCKMANS
Outlooks:
• Input selection for additive models • Comparison with gradient descent/Bayesian inference • Convex relaxation of solution path SVM
K.U.Leuven - SCD/SISTA
12/11