Chapter 2 A Brief Introduction to Chapter 2 Support Vector Machine

14 downloads 0 Views 281KB Size Report
Jan 25, 2011 - A new, powerful method for 2-class classification. O i i l id V ik 1965 f li l ifi. – Original idea: Vapnik, 1965 for linear classifiers. – SVM, Cortes ...
Chapter 2 A Brief Introduction to Support Vector Machine (SVM) January 25, 2011

Overview • A new, powerful method for 2-class classification – Original O i i l idea: id Vapnik, V ik 1965 ffor li linear classifiers l ifi – SVM, Cortes and Vapnik, 1995 – Became eca e ve very y hot ot since s ce 2001 00

• Better generalization (less overfitting) p classification with gglobal • Can do linearlyy unseparable optimal • Key ideas – Use kernel function to transform low dimensional training samples to higher dim (for linear separability problem) – Use qquadratic pprogramming g g (Q (QP)) to find the best classifier boundary hyperplane (for global optima)

Linear Classifiers x

denotes +1

α

f

yestt

f(x,w,b) = sign(w. x - b)

denotes -1

How would you classify this data?

Support Vector Machines: Slide 3

Linear Classifiers x

denotes +1

α

f

yestt

f(x,w,b) = sign(w. x - b)

denotes -1

How would you classify this data?

Support Vector Machines: Slide 4

Linear Classifiers x

denotes +1

α

f

yestt

f(x,w,b) = sign(w. x - b)

denotes -1

How would you classify this data?

Support Vector Machines: Slide 5

Linear Classifiers x

denotes +1

α

f

yestt

f(x,w,b) = sign(w. x - b)

denotes -1

How would you classify this data?

Support Vector Machines: Slide 6

Linear Classifiers x

denotes +1

α

f

yestt

f(x,w,b) = sign(w. x - b)

denotes -1

Any of these would be fine.. fine ..but which is best?

Support Vector Machines: Slide 7

Classifier Margin x

denotes +1 denotes -1

α

f

yestt

f(x,w,b) = sign(w. x - b)

Define the margin of a linear classifier as the width that the b boundary d could ld be b increased by before hitting a datapoint.

Support Vector Machines: Slide 8

Maximum Margin

α

x

denotes +1

f

yestt

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with ith maximum i margin.

denotes -1

Linear SVM

This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 9

Maximum Margin

α

x

denotes +1

f

yestt

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with ith maximum i margin.

denotes -1

Support Vectors are those datapoints that the margin g pushes up against Linear SVM

This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 10

Why Maximum Margin? 1. Intuitively this feels safest. denotes +1 denotes -1

f(x,w,b) = sign(w. - b) 2 If we’ve 2. we ve made a small error inxthe location of the boundary this gives us The maximum least chance of causing a misclassification. margin linear

classifier 3. It also helps generalization Support Vectors are those datapoints that the margin g pushes up against

is the linear classifier 4. There’s some theory that this is a with ith the, th um, good thing. maximum margin. 5. Empirically it works very very well.

This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 11

Computing the margin width M = Margin Width

H How d do we compute t M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } 2 • M= w.w

Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machines: Slide 12

Learning the Maximum Margin Classifier x+

M = Margin Width =

2 w.w

x-

Given a guess of w and b we can • Compute whether all data points in the correct half-planes half planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Support Vector Machines: Slide 13

Learning via Quadratic Programming • QP is a well-studied ll d d class l off optimization algorithms to maximize a quadratic function of some real real-valued valued variables subject to linear constraints. • Minimize both w⋅w (to maximize M) and misclassification error

Support Vector Machines: Slide 14

Quadratic Programming

Fi d Find

T u Ru T arg max c + d u + 2 u

a11u1 + a12u2 + ... + a1mum ≤ b1 a21u1 + a22u2 + ... + a2 mum ≤ b2

Subject to

: an1u1 + an 2u2 + ... + anmum ≤ bn

n additional linear inequality constraints

a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mum = b( n +1)

a( n + 2 )1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mum = b( n + 2) : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mum = b( n + e )

e addittional lineear equality constraints

And subject to

Quadratic criterion

Support Vector Machines: Slide 15

Learning Maximum Margin with Noise M = Given guess of 2 w.w •

w , b we can

Compute sum of distances of p points to their correct zones • Compute the margin width A Assume R datapoints, d i each h (xk,yk) where yk = +/- 1

What should our quadratic optimization criterion be?

How many constraints will we have? Wh should What h ld they h be? b ?

Support Vector Machines: Slide 16

Learning Maximum Margin with Noise M = Given guess of

ε11

ε2

2 w.w •

w , b we can

Compute sum of distances of p points to their correct zones • Compute the margin width A Assume R datapoints, d i each h (xk,yk) where yk = +/- 1

ε7

What should our quadratic How many constraints will we optimization criterion be? have? 2R Mi i i Minimize R Wh should What h ld they h be? b ? 1

2

w.w + C ∑ εk k =1

εk = distance of error points to their correct place

w . xk + b >= 1-εk if yk = 1 w . xk + b = 0 for all k

Support Vector Machines: Slide 17

From LSVM to g general SVM Suppose we are in 1-dimension What would SVMs do with this data?

x=0

Support Vector Machines: Slide 18

Not a big surprise

x=0

Positive “plane”

Negative “plane”

Support Vector Machines: Slide 19

Harder 1-dimensional dataset Points are not linearly separable. p What can we do now?

x=0

Support Vector Machines: Slide 20

Harder 1-dimensional dataset Transform T f th the d data t points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions) These points sometimes are called feature vectors. x=0

z k = ( xk , xk2 )

Support Vector Machines: Slide 21

Harder 1-dimensional dataset These points are linearly separable now! Boundary can be found by QP

x=0

z k = ( xk , xk2 )

Support Vector Machines: Slide 22

Common SVM basis functions zk = (polynomial terms of xk of degree 1 to q ) zk = (radial basis functions of xk )

⎛ | xk − c j | ⎞ ⎟⎟ z k [ j ] = φ j (x k ) = KernelFn⎜⎜ ⎝ KW ⎠

zk = (sigmoid functions of xk )

Support Vector Machines: Slide 23

Doing multi-class classification • SVMs can only handle two-class two class outputs (i.e., (i e a categorical output variable with variety 2). • What can be done? • Answer: with N classes, learn N SVM’s • • • •

SVM 1 learns “Output==1” Output==1 vs “Output Output != 1 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N”

• Then to p predict the output p for a new input, p , just j predict with each SVM and find out which one puts the prediction the furthest into the positive region. Support Vector Machines: Slide 24

Compare SVM with NN • Similarity − SVM + sigmoid kernel ~ two-layer feedforward NN − SVM + Gaussian kernel ~ RBF network − For most problems, SVM and NN have similar performance

• Advantages − − − − − −

Based on sound mathematics theory Learning result is more robust Over-fitting is not common Not trapped in local minima (because of QP) Fewer parameters to consider (kernel, error cost C) Works well with fewer training samples (number of support vectors do not matter much). much)

• Disadvantages − Problem need to be formulated as 2-class classification − Learning takes long time (QP optimization) Support Vector Machines: Slide 25

Advantages and Disadvantages of SVM

• Advantages • prediction accuracy is generally high • robust, robust works when training examples contain errors • fast evaluation of the learned target function

• Criticism • long training time ( g ) • difficult to understand the learned function (weights) • not easy to incorporate domain knowledge

Support Vector Machines: Slide 26

SVM Related Links

• Representative implementations • LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc. • SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only C language • SVM-torch: another recent implementation also written in C.

Support Vector Machines: Slide 27