Jan 25, 2011 - A new, powerful method for 2-class classification. O i i l id V ik 1965 f li l ifi. â Original idea: Vapnik, 1965 for linear classifiers. â SVM, Cortes ...
Chapter 2 A Brief Introduction to Support Vector Machine (SVM) January 25, 2011
Overview • A new, powerful method for 2-class classification – Original O i i l idea: id Vapnik, V ik 1965 ffor li linear classifiers l ifi – SVM, Cortes and Vapnik, 1995 – Became eca e ve very y hot ot since s ce 2001 00
• Better generalization (less overfitting) p classification with gglobal • Can do linearlyy unseparable optimal • Key ideas – Use kernel function to transform low dimensional training samples to higher dim (for linear separability problem) – Use qquadratic pprogramming g g (Q (QP)) to find the best classifier boundary hyperplane (for global optima)
Linear Classifiers x
denotes +1
α
f
yestt
f(x,w,b) = sign(w. x - b)
denotes -1
How would you classify this data?
Support Vector Machines: Slide 3
Linear Classifiers x
denotes +1
α
f
yestt
f(x,w,b) = sign(w. x - b)
denotes -1
How would you classify this data?
Support Vector Machines: Slide 4
Linear Classifiers x
denotes +1
α
f
yestt
f(x,w,b) = sign(w. x - b)
denotes -1
How would you classify this data?
Support Vector Machines: Slide 5
Linear Classifiers x
denotes +1
α
f
yestt
f(x,w,b) = sign(w. x - b)
denotes -1
How would you classify this data?
Support Vector Machines: Slide 6
Linear Classifiers x
denotes +1
α
f
yestt
f(x,w,b) = sign(w. x - b)
denotes -1
Any of these would be fine.. fine ..but which is best?
Support Vector Machines: Slide 7
Classifier Margin x
denotes +1 denotes -1
α
f
yestt
f(x,w,b) = sign(w. x - b)
Define the margin of a linear classifier as the width that the b boundary d could ld be b increased by before hitting a datapoint.
Support Vector Machines: Slide 8
Maximum Margin
α
x
denotes +1
f
yestt
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with ith maximum i margin.
denotes -1
Linear SVM
This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 9
Maximum Margin
α
x
denotes +1
f
yestt
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with ith maximum i margin.
denotes -1
Support Vectors are those datapoints that the margin g pushes up against Linear SVM
This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 10
Why Maximum Margin? 1. Intuitively this feels safest. denotes +1 denotes -1
f(x,w,b) = sign(w. - b) 2 If we’ve 2. we ve made a small error inxthe location of the boundary this gives us The maximum least chance of causing a misclassification. margin linear
classifier 3. It also helps generalization Support Vectors are those datapoints that the margin g pushes up against
is the linear classifier 4. There’s some theory that this is a with ith the, th um, good thing. maximum margin. 5. Empirically it works very very well.
This is the simplest kind of SVM (Called an LSVM) Support Vector Machines: Slide 11
Computing the margin width M = Margin Width
H How d do we compute t M in terms of w and b? • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } 2 • M= w.w
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machines: Slide 12
Learning the Maximum Margin Classifier x+
M = Margin Width =
2 w.w
x-
Given a guess of w and b we can • Compute whether all data points in the correct half-planes half planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Support Vector Machines: Slide 13
Learning via Quadratic Programming • QP is a well-studied ll d d class l off optimization algorithms to maximize a quadratic function of some real real-valued valued variables subject to linear constraints. • Minimize both w⋅w (to maximize M) and misclassification error
Support Vector Machines: Slide 14
Quadratic Programming
Fi d Find
T u Ru T arg max c + d u + 2 u
a11u1 + a12u2 + ... + a1mum ≤ b1 a21u1 + a22u2 + ... + a2 mum ≤ b2
Subject to
: an1u1 + an 2u2 + ... + anmum ≤ bn
n additional linear inequality constraints
a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mum = b( n +1)
a( n + 2 )1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mum = b( n + 2) : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mum = b( n + e )
e addittional lineear equality constraints
And subject to
Quadratic criterion
Support Vector Machines: Slide 15
Learning Maximum Margin with Noise M = Given guess of 2 w.w •
w , b we can
Compute sum of distances of p points to their correct zones • Compute the margin width A Assume R datapoints, d i each h (xk,yk) where yk = +/- 1
What should our quadratic optimization criterion be?
How many constraints will we have? Wh should What h ld they h be? b ?
Support Vector Machines: Slide 16
Learning Maximum Margin with Noise M = Given guess of
ε11
ε2
2 w.w •
w , b we can
Compute sum of distances of p points to their correct zones • Compute the margin width A Assume R datapoints, d i each h (xk,yk) where yk = +/- 1
ε7
What should our quadratic How many constraints will we optimization criterion be? have? 2R Mi i i Minimize R Wh should What h ld they h be? b ? 1
2
w.w + C ∑ εk k =1
εk = distance of error points to their correct place
w . xk + b >= 1-εk if yk = 1 w . xk + b = 0 for all k
Support Vector Machines: Slide 17
From LSVM to g general SVM Suppose we are in 1-dimension What would SVMs do with this data?
x=0
Support Vector Machines: Slide 18
Not a big surprise
x=0
Positive “plane”
Negative “plane”
Support Vector Machines: Slide 19
Harder 1-dimensional dataset Points are not linearly separable. p What can we do now?
x=0
Support Vector Machines: Slide 20
Harder 1-dimensional dataset Transform T f th the d data t points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions) These points sometimes are called feature vectors. x=0
z k = ( xk , xk2 )
Support Vector Machines: Slide 21
Harder 1-dimensional dataset These points are linearly separable now! Boundary can be found by QP
x=0
z k = ( xk , xk2 )
Support Vector Machines: Slide 22
Common SVM basis functions zk = (polynomial terms of xk of degree 1 to q ) zk = (radial basis functions of xk )
⎛ | xk − c j | ⎞ ⎟⎟ z k [ j ] = φ j (x k ) = KernelFn⎜⎜ ⎝ KW ⎠
zk = (sigmoid functions of xk )
Support Vector Machines: Slide 23
Doing multi-class classification • SVMs can only handle two-class two class outputs (i.e., (i e a categorical output variable with variety 2). • What can be done? • Answer: with N classes, learn N SVM’s • • • •
SVM 1 learns “Output==1” Output==1 vs “Output Output != 1 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N”
• Then to p predict the output p for a new input, p , just j predict with each SVM and find out which one puts the prediction the furthest into the positive region. Support Vector Machines: Slide 24
Compare SVM with NN • Similarity − SVM + sigmoid kernel ~ two-layer feedforward NN − SVM + Gaussian kernel ~ RBF network − For most problems, SVM and NN have similar performance
• Advantages − − − − − −
Based on sound mathematics theory Learning result is more robust Over-fitting is not common Not trapped in local minima (because of QP) Fewer parameters to consider (kernel, error cost C) Works well with fewer training samples (number of support vectors do not matter much). much)
• Disadvantages − Problem need to be formulated as 2-class classification − Learning takes long time (QP optimization) Support Vector Machines: Slide 25
Advantages and Disadvantages of SVM
• Advantages • prediction accuracy is generally high • robust, robust works when training examples contain errors • fast evaluation of the learned target function
• Criticism • long training time ( g ) • difficult to understand the learned function (weights) • not easy to incorporate domain knowledge
Support Vector Machines: Slide 26
SVM Related Links
• Representative implementations • LIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc. • SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only C language • SVM-torch: another recent implementation also written in C.
Support Vector Machines: Slide 27