Chapter 2 A Brief Introduction to Chapter 2 Support Vector Machine

Chapter 2 A Brief Introduction to Support Vector Machine (SVM) January 25, 2011

Overview • A new, powerful method for 2-class classification – Original O i i l idea: id Vapnik, V ik 1965 ffor li linear classifiers l ifi – SVM, Cortes and Vapnik, 1995 – Became eca e ve very y hot ot since s ce 2001 00

• Better generalization (less overfitting) p classification with gglobal • Can do linearlyy unseparable optimal • Key ideas – Use kernel function to transform low dimensional training samples to higher dim (for linear separability problem) – Use qquadratic pprogramming g g (Q (QP)) to find the best classifier boundary hyperplane (for global optima)

Linear Classifiers x

denotes +1

α

f

yestt

f(x,w,b) = sign(w. x - b)

denotes -1

How would you classify this data?

Support Vector Machines: Slide 3


denotes +1

α

f

yestt


denotes -1




denotes +1

α

f

yestt


denotes -1




denotes +1

α

f

yestt


denotes -1




denotes +1

α

f

yestt


denotes -1

Any of these would be fine.. fine ..but which is best?


Classifier Margin x

denotes +1 denotes -1

α

f

yestt


Define the margin of a linear classifier as the width that the b boundary d could ld be b increased by before hitting a datapoint.


Maximum Margin

α

x

denotes +1

f

yestt


The maximum margin linear classifier is the linear classifier with ith maximum i margin.

denotes -1

Linear SVM

This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 9

Maximum Margin

α

x

denotes +1

f

yestt


The maximum margin linear classifier is the linear classifier with ith maximum i margin.

denotes -1

Support Vectors are those datapoints that the margin g pushes up against Linear SVM


Why Maximum Margin? 1. Intuitively this feels safest. denotes +1 denotes -1

f(x,w,b) = sign(w. - b) 2 If we’ve 2. we ve made a small error inxthe location of the boundary this gives us The maximum least chance of causing a misclassification. margin linear

classifier 3. It also helps generalization Support Vectors are those datapoints that the margin g pushes up against

is the linear classifier 4. There’s some theory that this is a with ith the, th um, good thing. maximum margin. 5. Empirically it works very very well.


Computing the margin width M = Margin Width

H How d do we compute t M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } 2 • M= w.w

Copyright © 2001, 2003, Andrew W. Moore


Learning the Maximum Margin Classifier x+

M = Margin Width =

2 w.w

x-

Given a guess of w and b we can • Compute whether all data points in the correct half-planes half planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Support Vector Machines: Slide 13

Learning via Quadratic Programming • QP is a well-studied ll d d class l off optimization algorithms to maximize a quadratic function of some real real-valued valued variables subject to linear constraints. • Minimize both w⋅w (to maximize M) and misclassification error


Quadratic Programming

Fi d Find

T u Ru T arg max c + d u + 2 u

a11u1 + a12u2 + ... + a1mum ≤ b1 a21u1 + a22u2 + ... + a2 mum ≤ b2

Subject to

: an1u1 + an 2u2 + ... + anmum ≤ bn

n additional linear inequality constraints

a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mum = b( n +1)

a( n + 2 )1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mum = b( n + 2) : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mum = b( n + e )

e addittional lineear equality constraints

And subject to

Quadratic criterion


Learning Maximum Margin with Noise M = Given guess of 2 w.w •

w , b we can

Compute sum of distances of p points to their correct zones • Compute the margin width A Assume R datapoints, d i each h (xk,yk) where yk = +/- 1

What should our quadratic optimization criterion be?

How many constraints will we have? Wh should What h ld they h be? b ?


Learning Maximum Margin with Noise M = Given guess of

ε11

ε2

2 w.w •

w , b we can

Compute sum of distances of p points to their correct zones • Compute the margin width A Assume R datapoints, d i each h (xk,yk) where yk = +/- 1

ε7

What should our quadratic How many constraints will we optimization criterion be? have? 2R Mi i i Minimize R Wh should What h ld they h be? b ? 1

2

w.w + C ∑ εk k =1

εk = distance of error points to their correct place

w . xk + b >= 1-εk if yk = 1 w . xk + b = 0 for all k


From LSVM to g general SVM Suppose we are in 1-dimension What would SVMs do with this data?

x=0


Not a big surprise

x=0

Positive “plane”

Negative “plane”


Harder 1-dimensional dataset Points are not linearly separable. p What can we do now?

x=0


Harder 1-dimensional dataset Transform T f th the d data t points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions) These points sometimes are called feature vectors. x=0

z k = ( xk , xk2 )


Harder 1-dimensional dataset These points are linearly separable now! Boundary can be found by QP

x=0

z k = ( xk , xk2 )


Common SVM basis functions zk = (polynomial terms of xk of degree 1 to q ) zk = (radial basis functions of xk )

⎛ | xk − c j | ⎞ ⎟⎟ z k [ j ] = φ j (x k ) = KernelFn⎜⎜ ⎝ KW ⎠

zk = (sigmoid functions of xk )


Doing multi-class classification • SVMs can only handle two-class two class outputs (i.e., (i e a categorical output variable with variety 2). • What can be done? • Answer: with N classes, learn N SVM’s • • • •

SVM 1 learns “Output==1” Output==1 vs “Output Output != 1 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N”

• Then to p predict the output p for a new input, p , just j predict with each SVM and find out which one puts the prediction the furthest into the positive region. Support Vector Machines: Slide 24

Compare SVM with NN • Similarity − SVM + sigmoid kernel ~ two-layer feedforward NN − SVM + Gaussian kernel ~ RBF network − For most problems, SVM and NN have similar performance

• Advantages − − − − − −

Based on sound mathematics theory Learning result is more robust Over-fitting is not common Not trapped in local minima (because of QP) Fewer parameters to consider (kernel, error cost C) Works well with fewer training samples (number of support vectors do not matter much). much)

• Disadvantages − Problem need to be formulated as 2-class classification − Learning takes long time (QP optimization) Support Vector Machines: Slide 25

Advantages and Disadvantages of SVM

• Advantages • prediction accuracy is generally high • robust, robust works when training examples contain errors • fast evaluation of the learned target function

• Criticism • long training time ( g ) • difficult to understand the learned function (weights) • not easy to incorporate domain knowledge


SVM Related Links

• Representative implementations • LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc. • SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only C language • SVM-torch: another recent implementation also written in C.


Chapter 2 A Brief Introduction to Chapter 2 Support Vector Machine

Chapter 2 A Brief Introduction to Chapter 2 Support Vector Machine

Suggest Documents

Chapter 2 A Brief Introduction to Chapter 2 Support Vector Machine ...

Chapter 2 Machine Translation: Brief History - Shodhganga

Chapter 2 Vector Analysis

Chapter 2: Introduction to Linear Programming

Chapter 2. Introduction to Statistics and Sampling

CHAPTER 2 INTRODUCTION TO CLASSICAL PROPOSITIONAL ...

Chapter 2 – Part 2

CHAPTER 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2