Convexity, Loss functions and Gradient

Convexity, Loss functions and Gradient Abhishek Kumar Dept. of Computer Science, University of Maryland, College Park

Oct 4, 2011

Abhishek Kumar (UMD)

Convexity, Loss functions and Gradient

Oct 4, 2011

1 / 15

Linear Classifiers: Decision boundaries are linear ! d X yˆ = wi xi + b = (w · x + b),

Prediction: sign(ˆ y)

i =1

{x : w · x + b = 0}

⇒

?

{x : w · x + b > 0} {x : w · x + b < 0}

⇒ ⇒

? ?

b=0 Margin:y yˆ,

⇒

?

Prediction Error: if Margin < 0

Optimization Problem (for learning): (using 0-1 loss) w = arg min w,b

1 X I(yn (w · xn + b) < 0) N n

– I(·) : 1 if argument is true, else 0 – NP hard to optimize (exactly, and even approximately) – Small change in x may cause large change in loss Abhishek Kumar (UMD)


Oct 4, 2011

2 / 15

Regularization We care about test error (not training error). Minimizing training error alone can overfit the training data. Regularized Learning (SRM): Tikhonov, Ivanov, Morozov w = arg min w,b

1 X I(yn (w · xn + b) < 0) + λR(w) N n

Balances empirical loss and complexity of the classifier – we prefer simple! R(w) is a regularizer for linear hyperplanes. For computational reasons, we want both loss function and regularizer to be convex.



Oct 4, 2011

3 / 15

Convexity Convex Sets: A set S is convex if ∀x, y ∈ S, αx + (1 − α)y ∈ S

(0 ≤ α ≤ 1)

(line segment joining x and y is contained in S) S = {x : x 2 > 2} S = {x : x′ x < 1}

⇒? ⇒?

S = {U : U′ U = I}

⇒?

Operations that preserve convexity of sets: Intersection: S1 , S2 convex ⇒ S1 ∩ S2 convex Affine function (f (x) = Ax + b): S convex ⇒ f (S) and f −1 (S) convex



Oct 4, 2011

4 / 15

Convexity Convex Function: Function f defined on a convex set is convex if f (αx + (1 − α)y ) ≤ αf (x) + (1 − α)f (y )

(0 < α < 1)

Equivalent definition: Function f is convex if its epigraph is a convex set. E = {(x, µ) : µ ∈ R, f (x) ≤ µ} (region that lies above the graph of the function f )

f convex ⇒ −f concave



Oct 4, 2011

5 / 15

Convexity How to check convexity? Domain should be a convex set and one of following three: Use definition (chord lies above the function) f (αx + (1 − α)y ) ≤ αf (x) + (1 − α)f (y )

(0 < α < 1)

For differentiable functions: Function lies above all tangents f (y ) ≥ f (x) + f ′ (x)(y − x) (f (y ) ≥ f (x) + ∇f (x) · (y − x)) For twice differentiable functions: Second derivative is non-negative f ′′ (x) ≥ 0

(∇2 f (x) 0)

(If the above three are strict inequalities, then f is strictly convex.)



Oct 4, 2011

6 / 15

Convexity Operations that preserve convexity 1

Positve scaling and addition: f , g convex⇒ w1 f + w2 g convex

2

Affine function composition: f convex ⇒ f (Ax + b) convex

3

Pointwise maximum: f , g convex⇒ h(x) = max(f (x), g (x)) convex

4

f convex, g convex non-decreasing ⇒ h(x) = g (f (x)) convex

5

f concave, g convex non-increasing ⇒ h(x) = g (f (x)) convex

Examples e x , x, x 2 , x 4 etc.

(x ∈ R)

Any vector or matrix norm h(x) = max(|x|, x 2 ), x ∈ R (using property 3 above) exp(f (x)), f convex (using property 4 above) 1/log (x), x > 1 (using property 5 above)



Oct 4, 2011

7 / 15

Convex Loss Functions All of these are convex upper bounds on 0-1 loss. Hinge loss: L(y , yˆ) = max{0, 1 − y yˆ } Exponential loss: L(y , yˆ) = exp(−y yˆ ) Logistic loss: L(y , yˆ ) = log2 (1 + exp(−y yˆ ))



Oct 4, 2011

8 / 15

Weight Regularization

Why Need weights to be small: ǫ change in input causes (w · ǫ) change in yˆ. Prediction function should be smooth with respect to input x (should change slowly). Regularization is related to generalization: Regularized learning is stable (statistically) and stability directly bounds generalization error (or test error). We prefer convex regularizers due to computational reasons.



Oct 4, 2011

9 / 15

Norm based Regularizers P 1/p ||w ||p = ( i |wi |p ) ,

ℓp Norm:

p ∈ [1, ∞]

||w ||p for p < 1 is not a norm. Why?

Commonly used norm based regularizers in classification: ℓ1 , ℓ2 Unit ball in ℓp : A set S = {w : ||w ||p ≤ 1}.

Figure : ℓ2 , ℓ1 , and ℓp (p < 1) balls in two dimensions

What will ℓ∞ ball look like? What will ℓ0 ball look like? Abhishek Kumar (UMD)


Oct 4, 2011

10 / 15

Properties of different regularizers

Solution often lies on the singularity (corners) of the ball when constrained enough. Why? Sparsity inducing norms: obtained for ℓp regularizers for p ≤ 1. Why? Rotational Invariance: ℓ2 norm is invariant to rotations. When to use ℓ1 : less number of samples, lot of irrelevant features (feature selection) When to use ℓ2 : otherwise. leads to good generalization.



Oct 4, 2011

11 / 15

Look Back

Now we have convex loss functions and convex regularizers. Final objective: L(w, b) =

1 X λ ℓ(ˆ y , yn ) + ||w ||22 N n 2

How to minimize and find the solution w? Gradient Descent.



Oct 4, 2011

12 / 15

Gradient Gradient of a function f is denoted as ∇f , ∂f (x) ∂f (x) ∂f (x) , ,..., . ∇f (x) = ∂x1 ∂x2 ∂xn Gradient of a scalar function is a vector with dimension of the domain of function. Directional Derivative: Rate of change of function along a direction v. Dv (f ) = ∇f · v

Function changes (increases) the most along the direction of gradient. Why?



Oct 4, 2011

13 / 15

Gradient - geometric interpretation Level Sets: For function f (x), the level set containing a point a is a set of points S = {x : f (x) = f (a)} Function admits same value at all points in the level set S. Gradient at a point y is perpendicular to the level set containing y. p Example: f (x, y ) = x 2 + y 2

Figure : Level sets of a cone

How will level sets of function f (x, y , z) = x 2 + y 2 + z 2 look like? Abhishek Kumar (UMD)


Oct 4, 2011

14 / 15

Gradient Descent

Move in the negative direction of gradient to minimize a function. Steepest gradient descent: To minimize a function f , start with some initial guess x(0) . Update rule:

x(t) = x(t−1) − η (t) ∇f (x(t−1) )

Example: Optimizing logistic loss with ℓ2 regularizer L(w, b) =


1 X λ log2 [1 + exp(−yn (w · xn + b))] + ||w ||2 N n 2


Oct 4, 2011

15 / 15

Convexity, Loss functions and Gradient

Convexity, Loss functions and Gradient

Suggest Documents

CONVEXITY OF MORSE STRATIFICATIONS AND GRADIENT

q-convexity and weak q-convexity of KÃ¶the functions

Starlikeness and convexity of generalized Bessel functions

STRATIFIED CONVEXITY & CONCAVITY OF GRADIENT FLOWS ON

On equivalent characterizations of convexity of functions

On equivalent characterizations of convexity of functions

Gradient walk and $ p $-harmonic functions

Convexity and smoothness of scale functions and de Finetti's control ...

Positively homogeneous functions and the Lojasiewicz gradient ...

Radii of starlikeness and convexity of regular Coulomb wave functions

Monotonicity and Convexity of Some Functions Associated with ...

Starlikeness and Convexity Conditions for Classes of Functions ... - EMIS

Restarting accelerated gradient methods with a rough strong convexity ...

Stratified Convexity & Concavity of Gradient Flows on Manifolds with ...

Convexity

Relationship between Loss Functions and ... - Semantic Scholar

Recursions for distribution functions and stop-loss

Unbiased estimates and convex loss functions | SpringerLink

Simultaneous Loss of Soil Biodiversity and Functions

CONVEXITY OF QUOTIENTS OF THETA FUNCTIONS ... - Google Sites

CONVEXITY OF QUOTIENTS OF THETA FUNCTIONS 1. Introduction ...

Schur convexity of a class of symmetric functions - Semantic Scholar

Convexity of marginal functions in the discrete case 1. Introduction

Convexity of symmetric cone trace functions in Euclidean Jordan