A Gentle Introduction to Support Vector Machines in Biomedicine Alexander Statnikov*, Douglas Hardin#, Isabelle Guyon†, Constantin F. Aliferis* (Materials about SVM Clustering were contributed by Nikita Lytkin*) *New

York University, #Vanderbilt University, †ClopiNet

Part I • Introduction • Necessary mathematical concepts • Support vector machines for binary classification: classical formulation • Basic principles of statistical machine learning

2

Introduction

3

About this tutorial Main goal: Fully understand support vector machines (and important extensions) with a modicum of mathematics knowledge. • This tutorial is both modest (it does not invent anything new) and ambitious (support vector machines are generally considered mathematically quite difficult to grasp). • Tutorial approach: learning problem main idea of the SVM solution geometrical interpretation math/theory basic algorithms extensions case studies. 4

Data-analysis problems of interest 1. Build computational classification models (or “classifiers”) that assign patients/samples into two or more classes. - Classifiers can be used for diagnosis, outcome prediction, and other classification tasks. - E.g., build a decision-support system to diagnose primary and metastatic cancers from gene expression profiles of the patients: Primary Cancer

Classifier model Patient

Biopsy

Metastatic Cancer

Gene expression profile 5

Data-analysis problems of interest 2. Build computational regression models to predict values of some continuous response variable or outcome. - Regression models can be used to predict survival, length of stay in the hospital, laboratory test values, etc. - E.g., build a decision-support system to predict optimal dosage of the drug to be administered to the patient. This dosage is determined by the values of patient biomarkers, and clinical and demographics data: 1 2.2 3 423 2 3 92 2 1 8

Patient

Biomarkers, clinical and demographics data

Regression model

Optimal dosage is 5 IU/Kg/week

6

Data-analysis problems of interest 3. Out of all measured variables in the dataset, select the smallest subset of variables that is necessary for the most accurate prediction (classification or regression) of some variable of interest (e.g., phenotypic response variable). - E.g., find the most compact panel of breast cancer biomarkers from microarray gene expression data for 20,000 genes: Breast cancer tissues Normal tissues

7

Data-analysis problems of interest 4. Build a computational model to identify novel or outlier patients/samples. - Such models can be used to discover deviations in sample handling protocol when doing quality control of assays, etc. - E.g., build a decision-support system to identify aliens.

8

Data-analysis problems of interest 5. Group patients/samples into several clusters based on their similarity. - These methods can be used to discovery disease sub-types and for other tasks. - E.g., consider clustering of brain tumor patients into 4 clusters based on their gene expression profiles. All patients have the same pathological sub-type of the disease, and clustering discovers new disease subtypes that happen to have different characteristics in terms of patient survival and time to recurrence after treatment.

Cluster #1 Cluster #2

Cluster #3

Cluster #4

9

Basic principles of classification

• Want to classify objects as boats and houses. 10

Basic principles of classification

• All objects before the coast line are boats and all objects after the coast line are houses. • Coast line serves as a decision surface that separates two classes.

11

Basic principles of classification These boats will be misclassified as houses

12

Basic principles of classification Longitude

Boat

Latitude

House

• The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example. • First all objects are represented geometrically.

13

Basic principles of classification Longitude

Boat

Latitude

Then the algorithm seeks to find a decision surface that separates classes of objects

House

14

Basic principles of classification Longitude These objects are classified as houses

?

?

? ?

?

?

These objects are classified as boats

Latitude

Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it

15

The Support Vector Machine (SVM) approach • • •

Support vector machines (SVMs) is a binary classification algorithm that offers a solution to problem #1. Extensions of the basic SVM algorithm can be applied to solve problems #1-#5. SVMs are important because of (a) theoretical reasons: - Robust to very large number of variables and small samples - Can learn both simple and highly complex classification models - Employ sophisticated mathematical principles to avoid overfitting

and (b) superior empirical results.

16

Main ideas of SVMs Gene Y

Normal patients

Cancer patients

Gene X

• Consider example dataset described by 2 genes, gene X and gene Y • Represent patients geometrically (by “vectors”)

17

Main ideas of SVMs Gene Y

Normal patients

Cancer patients

Gene X

• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”); 18

Main ideas of SVMs Gene Y Cancer

Cancer

Decision surface

kernel Normal Normal Gene X

• If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found; • The feature space is constructed via very clever mathematical projection (“kernel trick”). 19

History of SVMs and usage in the literature • Support vector machine classifiers have a long history of development starting from the 1960’s. • The most important milestone for development of modern SVMs is the 1992 paper by Boser, Guyon, and Vapnik (“A training algorithm for optimal margin classifiers”) 10000 9000

Use of Support Vector Machines in the Literature 8,180

General sciences Biomedicine

8000 7000

30000

Use of Linear Regression in the Literature

8,860

General sciences Biomedicine

25000

6,660

19,200

20000

24,100 22,200 18,700

20,100 20,000 19,500

19,100 17,700

6000 4,950

5000 4000

3,530

10000

3000

0

14,900

15,500

14,900

16,000

13,500 9,770

10,800

12,000

2,330

2000 1000

15000

19,600 18,300

17,700

1,430 359

906

621 4

12

46

99

1998

1999

2000

2001

201

351

521

726

917

1,190

5000

0

2002

2003

2004

2005

2006

2007

1998

1999

2000

2001

2002

2003

2004

2005

2006

20

2007

Necessary mathematical concepts

21

How to represent samples geometrically? Vectors in n-dimensional space (Rn) • Assume that a sample/patient is described by n characteristics (“features” or “variables”) • Representation: Every sample/patient is a vector in Rn with tail at point with 0 coordinates and arrow-head at point with the feature values. • Example: Consider a patient described by 2 features: Systolic BP = 110 and Age = 29. This patient can be represented as a vector in R2: Age (110, 29)

(0, 0)

Systolic BP

22

How to represent samples geometrically? Vectors in n-dimensional space (Rn) Patient 3

Patient 4

Age (years)

60

Patient 1

40

Patient 2

20

0 200 150

300

100

200 50

100 0

Patient id 1 2 3 4

Cholesterol (mg/dl) 150 250 140 300

Systolic BP (mmHg) 110 120 160 180

0

Age (years) 35 30 65 45

Tail of the vector (0,0,0) (0,0,0) (0,0,0) (0,0,0)

Arrow-head of the vector (150, 110, 35) (250, 120, 30) (140, 160, 65) (300, 180, 45)

23

How to represent samples geometrically? Vectors in n-dimensional space (Rn) Patient 3

Patient 4

Age (years)

60

Patient 1

40

Patient 2

20

0 200 150

300

100

200 50

100 0

0

Since we assume that the tail of each vector is at point with 0 coordinates, we will also depict vectors as points (where the arrow-head is pointing). 24

Purpose of vector representation • Having represented each sample/patient as a vector allows now to geometrically represent the decision surface that separates two groups of samples/patients. A decision surface in 7

A decision surface in R3

R2 7 6

6

5

5

4 4

3 3

2

2

1

1

0 10

0 0

1

2

3

4

5

6

7

5 0

0

1

2

3

4

5

6

7

• In order to define the decision surface, we need to introduce some basic math elements… 25

Basic operation on vectors in Rn 1. Multiplication by a scalar Consider a vector a = (a1 , a2 ,..., an ) and a scalar c Define: ca = (ca1 , ca2 ,..., can ) When you multiply a vector by a scalar, you “stretch” it in the same or opposite direction depending on whether the scalar is positive or negative. a = (1,2) c=2 c a = (2,4)

ca

4 3 2 1

4

a = (1,2) c = −1 c a = (−1,−2)

a

3 2 1

-2

-1

0

a 1

2

3

4

-1 0

1

2

3

4

ca

-2

26

Basic operation on vectors in Rn 2. Addition Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) Define: a + b = (a1 + b1 , a2 + b2 ,..., an + bn )

a = (1,2) b = (3,0) a + b = (4,2)

4 3 2 1

0

a 1

Recall addition of forces in classical mechanics.

a +b 2

b

3

4

27

Basic operation on vectors in Rn 3. Subtraction Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) Define: a − b = (a1 − b1 , a2 − b2 ,..., an − bn ) a = (1,2) b = (3,0) a − b = (−2,2)

4 3 2

-3

a −b

1

-2

0

-1

a 1

2

b

3

4

What vector dowe need to add to b to get a ? I.e., similar to subtraction of real numbers.

28

Basic operation on vectors in Rn 4. Euclidian length or L2-norm Consider a vector a = (a1 , a2 ,..., an ) 2 2 2 ... a = a + a + + a Define the L2-norm: 1 2 n 2 We often denote the L2-norm without subscript, i.e. a a = (1,2) a 2 = 5 ≈ 2.24

3 2 1

0

a 1

Length of this vector is ≈ 2.24 2

3

4

L2-norm is a typical way to measure length of a vector; other methods to measure length also exist.

29

Basic operation on vectors in Rn 5. Dot product

Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) n Define dot product: a ⋅ b = a1b1 + a2b2 + ... + anbn = ∑ ai bi i =1

The law of cosines says that a⋅ b =|| a ||2 || b ||2 cos θ where θ is the angle between a and b. Therefore, when the vectors are perpendicular a ⋅b = 0. a = (1,2) b = (3,0) a ⋅b = 3

a = (0,2) b = (3,0) a ⋅b = 0

4 3 2 1

0

a

θ 1

2

b

3

4

4 3 2

a

1

θ 0

1

2

b

3

4

30

Basic operation on vectors in Rn 5. Dot product (continued) n a ⋅ b = a1b1 + a2b2 + ... + anbn = ∑ ai bi i =1

2 • Property: a ⋅ a = a1a1 + a2 a2 + ... + an an = a 2 • In the classical regression equation y = w ⋅ x + b the response variable y is just a dot product of the vector representing patient characteristics (x ) and the regression weights vector (w ) which is common across all patients plus an offset b. 31

Hyperplanes as decision surfaces • A hyperplane is a linear decision surface that splits the space into two parts; • It is obvious that a hyperplane is a binary classifier. A hyperplane in R2 is a line

A hyperplane in R3 is a plane 7

7

6 6

5 5

4 4

3

3

2 1

2

0 10

1

5 0 0

1

2

3

4

5

6

7

0

0

1

2

3

A hyperplane in Rn is an n-1 dimensional subspace

4

5

6

7

32

Equation of a hyperplane First we show with show the definition of hyperplane by an interactive demonstration.

Click here for demo to begin

or go to http://www.dsl-lab.org/svm_tutorial/planedemo.html

Source: http://www.math.umn.edu/~nykamp/ 33

Equation of a hyperplane Consider the case of R3:

w

P

x − x0

x

An equation of a hyperplane is defined by a point (P0) and a perpendicular vector to the plane (w) at that point.

P0

x0 O

Define vectors: x0 = OP0 and x = OP , where P is an arbitrary point on a hyperplane.

A condition for P to be on the plane is that the vector x − x0 is perpendicular to w : w ⋅ ( x − x0 ) = 0 or define b = − w ⋅ x0 w ⋅ x − w ⋅ x0 = 0 w⋅ x + b = 0 The above equations also hold for Rn when n>3.

34

Equation of a hyperplane Example

w = (4,−1,6) P0 = (0,1,−7) b = − w ⋅ x0 = −(0 − 1 − 42) = 43 ⇒ w ⋅ x + 43 = 0 ⇒ (4,−1,6) ⋅ x + 43 = 0 ⇒ (4,−1,6) ⋅ ( x(1) , x( 2 ) , x(3) ) + 43 = 0 ⇒ 4 x(1) − x( 2 ) + 6 x(3) + 43 = 0

+ direction

w P0

w ⋅ x + 50 = 0 w ⋅ x + 43 = 0 w ⋅ x + 10 = 0

- direction

What happens if the b coefficient changes? The hyperplane moves along the direction of w . We obtain “parallel hyperplanes”.

Distance between two parallel hyperplanes w ⋅ x + b1 = 0 and w ⋅ x + b2 = 0 is equal to D = b1 − b2 / w .

35

(Derivation of the distance between two parallel hyperplanes) tw w x2

w ⋅ x + b2 = 0 w ⋅ x + b1 = 0

x1

x2 = x1 + tw D = tw = t w w ⋅ x2 + b2 = 0 w ⋅ ( x1 + tw) + b2 = 0 2 w ⋅ x1 + t w + b2 = 0 2 ( w ⋅ x1 + b1 ) − b1 + t w + b2 = 0 2 − b1 + t w + b2 = 0 2 t = (b1 − b2 ) / w ⇒ D = t w = b1 − b2 / w 36

Recap We know… • How to represent patients (as “vectors”) • How to define a linear decision surface (“hyperplane”) We need to know… • How to efficiently compute the hyperplane that separates two classes with the largest “gap”? Gene Y

Need to introduce basics of relevant optimization theory Normal patients

Cancer patients

Gene X 37

Basics of optimization: Convex functions • A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval. • Property: Any local minimum is a global minimum!

Local minimum Global minimum

Convex function

Global minimum

Non-convex function 38

Basics of optimization: Quadratic programming (QP) • Quadratic programming (QP) is a special optimization problem: the function to optimize (“objective”) is quadratic, subject to linear constraints. • Convex QP problems have convex objective functions. • These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum).

39

Basics of optimization: Example QP problem Consider x = ( x1 , x 2 ) 1 2 || x ||2 subject to x1 +x 2 −1 ≥ 0 Minimize 2 quadratic objective

linear constraints

This is QP problem, and it is a convex QP as we will see later We can rewrite it as: Minimize

1 2 ( x1 + x22 ) subject to x1 +x 2 −1 ≥ 0 2 quadratic objective

linear constraints 40

Basics of optimization: Example QP problem f(x1,x2)

x1 +x 2 −1 ≥ 0 x1 +x 2 −1

1 2 ( x1 + x22 ) 2

x1 +x 2 −1 ≤ 0 x2 x1 The solution is x1=1/2 and x2=1/2.

41

Congratulations! You have mastered all math elements needed to understand support vector machines. Now, let us strengthen your knowledge by a quiz 42

Quiz 1) Consider a hyperplane shown with white. It is defined by equation: w ⋅ x + 10 = 0 Which of the three other hyperplanes can be defined by equation: w ⋅ x + 3 = 0 ?

w P0

- Orange - Green - Yellow 4

a

3

2) What is the dot product between vectors a = (3,3) and b = (1,−1) ?

2 1

-2

-1

0 -1

1

2

3

4

b

-2

43

Quiz 4

a

3

3) What is the dot product between vectors a = (3,3) and b = (1,0) ?

2 1

-2

-1

0

b

1

2

3

4

3

4

-1 -2

4 3

4) What is the length of a vector a = (2,0) and what is the length of all other red vectors in the figure?

2

a

1

-2

-1

0

1

2

-1 -2

44

Quiz 5) Which of the four functions is/are convex?

1

2

3

4 45

Support vector machines for binary classification: classical formulation

46

Case 1: Linearly separable data; “Hard-margin” linear SVM Given training data:

Negative instances (y=-1)

x1 , x2 ,..., x N ∈ R n

y1 , y2 ,..., y N ∈ {−1,+1}

Positive instances (y=+1)

• Want to find a classifier (hyperplane) to separate negative instances from the positive ones. • An infinite number of such hyperplanes exist. • SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so-called “support vectors”). • If the points on the boundaries are not informative (e.g., due to noise), SVMs will not do well. 47

Statement of linear SVM classifier w⋅ x + b = 0

w ⋅ x + b = −1

w ⋅ x + b = +1

The gap is distance between parallel hyperplanes:

w ⋅ x + b = −1 and w ⋅ x + b = +1

Or equivalently:

w ⋅ x + (b + 1) = 0 w ⋅ x + (b − 1) = 0

We know that Negative instances (y=-1)

Positive instances (y=+1)

D = b1 − b2 / w

Therefore:

D = 2/ w

Since we want to maximize the gap,

we need to minimize w

or equivalently minimize

1 2

2 w

(

1 2

is convenient for taking derivative later on) 48

Statement of linear SVM classifier w ⋅ x + b ≤ −1

w⋅ x + b = 0 w ⋅ x + b ≥ +1

In addition we need to impose constraints that all instances are correctly classified. In our case:

w ⋅ xi + b ≤ −1 if yi = −1 w ⋅ xi + b ≥ +1 if yi = +1 Equivalently:

yi ( w ⋅ xi + b) ≥ 1

Negative instances (y=-1)

Positive instances (y=+1)

In summary: 2 1 w Want to minimize 2 subject to yi ( w ⋅ xi + b) ≥ 1 for i = 1,…,N Then given a new instance x, the classifier is f ( x ) = sign( w ⋅ x + b) 49

SVM optimization problem: Primal formulation n

Minimize

1 2

2 w ∑ i i =1

subject to

Objective function

yi ( w ⋅ xi + b) − 1 ≥ 0

for i = 1,…,N

Constraints

• This is called “primal formulation of linear SVMs”. • It is a convex quadratic programming (QP) optimization problem with n variables (wi, i = 1,…,n), where n is the number of features in the dataset.

50

SVM optimization problem: Dual formulation • The previous problem can be recast in the so-called “dual form” giving rise to “dual formulation of linear SVMs”. • It is also a convex quadratic programming problem but with N variables (αi ,i = 1,…,N), where N is the number of samples. N Maximize ∑ α i − ∑ α iα j yi y j xi ⋅ x j subject to α i ≥ 0 and ∑ α i yi = 0 . N

i =1

N

1 2

i , j =1

i =1

Objective function

Constraints

N Then the w-vector is defined in terms of αi: w = ∑ α i yi xi i =1

And the solution becomes: f ( x ) = sign(∑ α i yi xi ⋅ x + b) N

i =1

51

SVM optimization problem: Benefits of using dual formulation 1) No need to access original data, need to access only dot products. Objective function: ∑ α i − ∑ α iα j yi y j xi ⋅ x j N

i =1

N

1 2

i , j =1

Solution: f ( x ) = sign(∑ α i yi xi ⋅ x + b) N

i =1

2) Number of free parameters is bounded by the number of support vectors and not by the number of variables (beneficial for high-dimensional problems). E.g., if a microarray dataset contains 20,000 genes and 100 patients, then need to find only up to 100 parameters! 52

(Derivation of dual formulation) n

Minimize

1 2

2 w subject to y ( w ⋅ xi + b) − 1 ≥ 0 for i = 1,…,N ∑ i i i =1

Constraints

Objective function

Apply the method of Lagrange multipliers.

( Λ w Define Lagrangian P , b, α ) =

( w − α y ( w ∑ ∑ i i ⋅ xi + b) − 1) n

1 2

i =1

N

2 i

i =1

a vector with n elements a vector with N elements

We need to minimize this Lagrangian with respect to w, b and simultaneously require that the derivative with respect to α vanishes , all subject to the constraints that α i ≥ 0.

53

(Derivation of dual formulation) If we set the derivatives with respect to w, b to 0, we obtain: N ∂Λ P (w, b, α ) = 0 ⇒ ∑ α i yi = 0 ∂b i =1 N ∂Λ P (w, b, α ) α 0 w y x = ⇒ = ∑ i i i ∂w i =1

We substitute the above into the equation for Λ P (w,b, α ) and obtain “dual formulation of linear SVMs”:

Λ D (α ) = ∑ α i − ∑ α iα j yi y j xi ⋅ x j N

i =1

N

1 2

i , j =1

We seek to maximize the above Lagrangian with respect to α , subject to the N

constraints that α i ≥ 0 and ∑ α i yi = 0 . i =1

54

Case 2: Not linearly separable data; “Soft-margin” linear SVM What if the data is not linearly separable? E.g., there are outliers or noisy measurements, or the data is slightly non-linear.

0 0 0

0

0 0

0 0

0 0

0 0 0

0

0

Want to handle this case without changing the family of decision functions. Approach: Assign a “slack variable” to each instance ξ i ≥ 0 , which can be thought of distance from the separating hyperplane if an instance is misclassified and 0 otherwise. N 2 1 Want to minimize 2 w + C ∑ ξ i subject to yi ( w ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N i =1 f ( x ) = sign ( w ⋅ x + b) Then given a new instance x, the classifier is

55

Two formulations of soft-margin linear SVM Primal formulation: n

Minimize

1 2

N

w + C ξ y w ( ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N subject to ∑ ∑ i i i =1

2 i

i =1

Objective function

Constraints

Dual formulation: N Minimize ∑ α i − 12 ∑ α iα j yi y j xi ⋅ x j subject to 0 ≤ α i ≤ C and ∑ α i yi = 0 n

N

i =1

i , j =1

for i = 1,…,N.

Objective function

i =1

Constraints

56

Parameter C in soft-margin SVM Minimize

1 2

N 2 w + C ∑ ξ i subject to yi ( w ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N i =1

C=100

C=1

C=0.15

C=0.1

• When C is very large, the softmargin SVM is equivalent to hard-margin SVM; • When C is very small, we admit misclassifications in the training data at the expense of having w-vector with small norm; • C has to be selected for the distribution at hand as it will be discussed later in this tutorial. 57

Case 3: Not linearly separable data; Kernel trick Gene 2

Tumor

Tumor

?

kernel

Φ

?

Normal

Normal Gene 1

Data is not linearly separable in the input space

Data is linearly separable in the feature space obtained by a kernel

Φ : RN → H

58

Kernel trick Original data x (in input space) f ( x) = sign( w ⋅ x + b)

Data in a higher dimensional feature space Φ (x ) f ( x) = sign( w ⋅ Φ ( x ) + b)

N w = ∑ α i yi xi

N w = ∑ α i yi Φ ( xi )

i =1

i =1

f ( x) = sign(∑ α i yi Φ ( xi ) ⋅ Φ ( x ) + b) N

i =1

f ( x) = sign(∑ α i yi K ( xi ,x ) + b) N

i =1

Therefore, we do not need to know Ф explicitly, we just need to define function K(·, ·): RN × RN R. Not every function RN × RN R can be a valid kernel; it has to satisfy so-called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable. 59

Popular kernels A kernel is a dot product in some feature space:

K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j )

Examples:

K ( xi , x j ) = xi ⋅ x j 2 K ( xi , x j ) = exp(−γ xi − x j ) K ( xi , x j ) = exp(−γ xi − x j ) q K ( xi , x j ) = ( p + xi ⋅ x j ) 2 K ( xi , x j ) = ( p + xi ⋅ x j ) q exp(−γ xi − x j ) K ( xi , x j ) = tanh(kxi ⋅ x j − δ )

Linear kernel Gaussian kernel Exponential kernel Polynomial kernel Hybrid kernel Sigmoidal 60

Understanding the Gaussian kernel 2 Consider Gaussian kernel: K ( x , x j ) = exp(−γ x − x j ) Geometrically, this is a “bump” or “cavity” centered at the training data point x j : "bump” “cavity”

The resulting mapping function is a combination of bumps and cavities.

61

Understanding the Gaussian kernel Several more views of the data is mapped to the feature space by Gaussian kernel

62

Understanding the Gaussian kernel Linear hyperplane that separates two classes

63

Understanding the polynomial kernel 3 Consider polynomial kernel: K ( xi , x j ) = (1 + xi ⋅ x j ) Assume that we are dealing with 2-dimensional data (i.e., in R2). Where will this kernel map the data? 2-dimensional space

x(1)

x( 2 )

kernel 10-dimensional space

1 x(1)

x( 2 )

x(21)

x(22 )

x(1) x( 2 )

x(31)

x(32 )

x(1) x(22 )

x(21) x( 2 ) 64

Example of benefits of using a kernel x( 2 ) x1 x3

x4 x2

• Data is not linearly separable in the input space (R2). • Apply kernel K ( x , z ) = ( x ⋅ z ) 2 to map data to a higher x(1) dimensional space (3dimensional) where it is linearly separable. 2

2 2 x(1) z(1) = x(1) z(1) + x( 2 ) z( 2 ) = K ( x, z ) = ( x ⋅ z ) = ⋅ x z ( 2 ) ( 2 ) x(21) z(21) 2 2 2 2 = x(1) z(1) + 2 x(1) z(1) x( 2 ) z( 2 ) + x( 2 ) z( 2 ) = 2 x(1) x( 2 ) ⋅ 2 z(1) z( 2 ) = Φ ( x ) ⋅ Φ ( z ) 2 2 x z ( 2) ( 2) 65

[

]

Example of benefits of using a kernel x(21) Therefore, the explicit mapping is Φ ( x ) = 2 x(1) x( 2 ) 2 x ( 2) x( 2 )

x(22 )

x1 x3

x4

x2

x1 , x2

kernel x(1)

Φ

x3 , x4

x(21)

2 x(1) x( 2 )

66

Comparison with methods from classical statistics & regression • Need ≥ 5 samples for each parameter of the regression model to be estimated: Number of variables

Polynomial degree

Number of parameters

Required sample

2

3

10

50

10

3

286

1,430

10

5

3,003

15,015

100

3

176,851

884,255

100

5

96,560,646

482,803,230

• SVMs do not have such requirement & often require much less sample than the number of variables, even when a high-degree polynomial kernel is used. 67

Basic principles of statistical machine learning

68

Generalization and overfitting • Generalization: A classifier or a regression algorithm learns to correctly predict output from given inputs not only in previously seen samples but also in previously unseen samples. • Overfitting: A classifier or a regression algorithm learns to correctly predict output from given inputs in previously seen samples but fails to do so in previously unseen samples. • Overfitting Poor generalization.

69

Example of overfitting and generalization There is a linear relationship between predictor and outcome (plus some Gaussian noise). Algorithm 2 Outcome of Interest Y

Outcome of Interest Y

Algorithm 1 Training Data Test Data

Predictor X

Predictor X

• Algorithm 1 learned non-reproducible peculiarities of the specific sample available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization. • Algorithm 2 learned general characteristics of the function that produced the data. Thus, it generalizes. 70

“Loss + penalty” paradigm for learning to avoid overfitting and ensure generalization • Many statistical learning algorithms (including SVMs) search for a decision function by solving the following optimization problem:

Minimize (Loss + λ Penalty) • Loss measures error of fitting the data • Penalty penalizes complexity of the learned function • λ is regularization parameter that balances Loss and Penalty

71

SVMs in “loss + penalty” form SVMs build the following classifiers: f ( x ) = sign( w ⋅ x + b)

Consider soft-margin linear SVM formulation: Find w and b that N 2 1 w + C ξ Minimize 2 ∑ i subject to yi (w ⋅ xi + b) ≥ 1 − ξi for i = 1,…,N i =1

This can also be stated as: Find w and b that N 2 Minimize ∑ [1 − yi f ( xi )]+ +λ w 2 i =1

Loss (“hinge loss”)

Penalty

(in fact, one can show that λ = 1/(2C)). 72

Meaning of SVM loss function [ 1 ( y f x − Consider loss function: ∑ i i )]+ N

i =1

• Recall that […]+ indicates the positive part • For a given sample/patient i, the loss is non-zero if 1 − yi f ( xi ) > 0 • In other words, yi f ( xi ) < 1 • Since yi = {−1,+1}, this means that the loss is non-zero if f ( xi ) < 1 for yi = +1 f ( xi ) > −1 for yi= -1

• In other words, the loss is non-zero if w ⋅ xi + b < 1 for yi = +1 w ⋅ xi + b > −1 for yi= -1

73

Meaning of SVM loss function • If the instance is negative, it is penalized only in regions 2,3,4 • If the instance is positive, it is penalized only in regions 1,2,3

w ⋅ x + b = −1

w⋅ x + b = 0 w ⋅ x + b = +1

1 2 3 4

Negative instances (y=-1)

Positive instances (y=+1)

74

Flexibility of “loss + penalty” framework Minimize (Loss + λ Penalty) Loss function

Penalty function

[ 1 − y f ( x ∑ i i )]+ N

Hinge loss:

i =1

Mean squared error:

∑(y i =1

i

Mean squared error: N

∑(y i =1

i

∑ ( yi − f ( xi )) 2 N

2

SVMs

Ridge regression

λ w1

− f ( xi )) 2

Mean squared error:

2

λw2

− f ( xi )) 2

N

λw2

Resulting algorithm

Lasso

λ1 w 1 + λ2 w 2 2

Elastic net

i =1

Hinge loss:

[ 1 − y f ( x ∑ i i )]+ N

i =1

λ w1

1-norm SVM 75

Part 2 • Model selection for SVMs • Extensions to the basic SVM model: 1. 2. 3. 4. 5. 6.

SVMs for multicategory data Support vector regression Novelty detection with SVM-based methods Support vector clustering SVM-based variable selection Computing posterior class probabilities for SVM classifiers 76

Model selection for SVMs

77

Need for model selection for SVMs Gene 2

Gene 2 Tumor

Tumor

Normal Gene 1

• It is impossible to find a linear SVM classifier that separates tumors from normals! • Need a non-linear SVM classifier, e.g. SVM with polynomial kernel of degree 2 solves this problem without errors.

Normal

Gene 1

• We should not apply a non-linear SVM classifier while we can perfectly solve this problem using a linear SVM classifier! 78

A data-driven approach for model selection for SVMs • Do not know a priori what type of SVM kernel and what kernel parameter(s) to use for a given dataset? • Need to examine various combinations of parameters, e.g. consider searching the following grid: Polynomial degree d

Parameter C

(0.1, 1)

(1, 1)

(10, 1)

(100, 1)

(1000, 1)

(0.1, 2)

(1, 2)

(10, 2)

(100, 2)

(1000, 2)

(0.1, 3)

(1, 3)

(10, 3)

(100, 3)

(1000, 3)

(0.1, 4)

(1, 4)

(10, 4)

(100, 4)

(1000, 4)

(0.1, 5)

(1, 5)

(10, 5)

(100, 5)

(1000, 5)

• How to search this grid while producing an unbiased estimate 79 of classification performance?

Nested cross-validation Recall the main idea of cross-validation: test data

train

train test train

What combination of SVM parameters to apply on training data?

test

valid

train

train

valid

train

Perform “grid search” using another nested loop of cross-validation. 80

Example of nested cross-validation

data

Consider that we use 3-fold cross-validation and we want to optimize parameter C that takes values “1” and “2”.

{

Outer Loop P1

Training Testing Average C Accuracy set set Accuracy P1, P2 P3 1 89% 83% P1,P3 P2 2 84% P2, P3 P1 1 76%

P2 …

P3

…

Inner Loop Training Validation set set

P1 P2 P1 P2

P2 P1 P2 P1

C 1 2

Accuracy

86% 84% 70% 90%

Average Accuracy

85% 80%

}

choose C=1 81

On use of cross-validation • Empirically we found that cross-validation works well for model selection for SVMs in many problem domains; • Many other approaches that can be used for model selection for SVMs exist, e.g.:

Generalized cross-validation Bayesian information criterion (BIC) Minimum description length (MDL) Vapnik-Chernovenkis (VC) dimension Bootstrap

82

SVMs for multicategory data

83

One-versus-rest multicategory SVM method Gene 2

? Tumor I

**** * * * * * * * ** * * * *

?

Tumor II

Tumor III Gene 1

84

One-versus-one multicategory SVM method Gene 2

Tumor I

* * * * * Tumor II * * * * * * * * * * * * ? ?

Tumor III Gene 1

85

DAGSVM multicategory SVM method AML vs.vs. ALL T-cell AML ALL T-cell Not AML

Not ALL T-cell

ALL ALLB-cell B-cellvs. vs.ALL ALLT-cell T-cell Not ALL B-cell

ALL T-cell

Not ALL T-cell

AML vs. ALL B-cell Not AML

ALLALL B-cell B-cell

Not ALL B-cell

AML

86

SVM multicategory methods by Weston and Watkins and by Crammer and Singer Gene 2

? ?

* * * * * * * * * * * * * * * * *

Gene 1

87

Support vector regression

88

ε-Support vector regression (ε-SVR) Given training data: y

*

** ** * * * * *** * ** * *

x1 , x2 ,..., x N ∈ R n y1 , y2 ,..., y N ∈ R +ε -ε

x

Main idea: Find a function f ( x ) = w ⋅ x + b that approximates y1,…,yN : • it has at most ε derivation from the true values yi • it is as “flat” as possible (to avoid overfitting)

E.g., build a model to predict survival of cancer patients that can admit a one month error (= ε ).

89

Formulation of “hard-margin” ε-SVR y

* * *

* * * * * * ** * * * *

+ε w ⋅ x + b = 0 -ε f ( x ) = w⋅ x + b Find

2 w subject

by minimizing to constraints: yi − ( w ⋅ x + b ) ≤ ε yi − ( w ⋅ x + b) ≥ −ε 1 2

for i = 1,…,N. x

I.e., difference between yi and the fitted function should be smaller than ε and larger than -ε all points yi should be in the “ε-ribbon” around the fitted function. 90

Formulation of “soft-margin” ε-SVR y

*

* * * * * * ** * * * * * * * *

+ε -ε

x

If we have points like this (e.g., outliers or noise) we can either: a) increase ε to ensure that these points are within the new ε-ribbon, or b) assign a penalty (“slack” variable) to each of this points (as was done for “soft-margin” SVMs)

91

Formulation of “soft-margin” ε-SVR y

*

ξi

* * * * * * ** * * ξi* * * * * * *

+ε -ε

f ( x ) = w⋅ x + b Find N 2 by minimizing 12 w + C ∑ (ξ i + ξ i* ) i =1 subject to constraints: yi − ( w ⋅ x + b ) ≤ ε + ξ i yi − ( w ⋅ x + b) ≥ −ε − ξ i*

ξ i , ξ i* ≥ 0 for i = 1,…,N. x

Notice that only points outside ε-ribbon are penalized! 92

Nonlinear ε-SVR y

* * *** * * ** ** *

+ε * * * -ε

y

** *

** ** * * * * *** *

x +Φ(ε) -Φ(ε)

Φ(x)

Cannot approximate well this function with small ε!

kernel

Φ

Φ

−1

y

* * *** * * ** ** *

* * * +ε -ε

x 93

ε-Support vector regression in “loss + penalty” form

Build decision function of the form: f ( x ) = w ⋅ x + b Find w and b that

2 Minimize ∑ max(0, | yi − f ( xi ) | −ε ) +λ w 2 N

i =1

Loss (“linear ε-insensitive loss”)

Penalty

Loss function value

Error in approximation

94

Comparing ε-SVR with popular regression methods Loss function Linear ε-insensitive loss: y − f x max( 0 , | ( ∑ i i ) | −ε ) N

Penalty function

2

2

2

2

λw2

Resulting algorithm ε-SVR

i =1

Quadratic ε-insensitive loss: 2 max( 0 , ( ( y − f x ∑ i i )) − ε ) N

λw2

Another variant of ε-SVR

i =1

Mean squared error: 2 − y f x ( ( ∑ i i )) N

λw2

Ridge regression

i =1

Mean linear error: N

| y − f ( x ∑ i i)| i =1

λw2

Another variant of ridge regression 95

Comparing loss functions of regression methods Linear ε-insensitive loss Loss function value

-ε

ε Error in approximation

Mean squared error Loss function value

-ε

ε Error in approximation

Quadratic ε-insensitive loss Loss function value

-ε

ε Error in approximation

Mean linear error Loss function value

-ε

ε Error in approximation 96

Applying ε-SVR to real data In the absence of domain knowledge about decision functions, it is recommended to optimize the following parameters (e.g., by cross-validation using grid-search): • parameter C • parameter ε • kernel parameters (e.g., degree of polynomial) Notice that parameter ε depends on the ranges of variables in the dataset; therefore it is recommended to normalize/re-scale data prior to applying ε-SVR. 97

Novelty detection with SVM-based methods

98

What is it about? Decision function = +1 Predictor Y

• Find the simplest and most compact region in the space of predictors where the majority of data samples “live” (i.e., with the highest density of samples). • Build a decision function that takes value +1 in this region and -1 elsewhere. • Once we have such a decision function, we can identify novel or outlier samples/patients in the data.

******* ******** ***** ******** ** * * * **************** * * ************************** * * Decision function = -1

* *

Predictor X

99

Key assumptions •

•

We do not know classes/labels of samples (positive or negative) in the data available for learning this is not a classification problem All positive samples are similar but each negative sample can be different in its own way

Thus, do not need to collect data for negative samples!

100

Sample applications “Normal” “Novel”

“Novel” Modified from: www.cs.huji.ac.il/course/2004/learns/NoveltyDetection.ppt

“Novel” 101

Sample applications Discover deviations in sample handling protocol when doing quality control of assays. Protein Y

* * *

Samples with low quality of processing from the lab of Dr. Smith

* * Samples with low quality of processing from ICU patients

********* ** ************* * *********** * *** * * ** ** * * **

Samples with high-quality of processing

Samples with low quality of processing from infants

Protein X

Samples with low quality of processing from patients with lung cancer

102

Sample applications Identify websites that discuss benefits of proven cancer treatments.

** ** *

Weighted frequency of word Y Websites that discuss cancer prevention methods

** * * * ** ******* * *********** * * * * * * * ******** ***** * ** ********** * * * * * ** *** *

Websites that discuss benefits of proven cancer treatments

Websites that discuss side-effects of proven cancer treatments

* ** *

*** * Blogs of cancer patients ** *

Weighted frequency of word X

Websites that discuss unproven cancer treatments 103

One-class SVM Main idea: Find the maximal gap hyperplane that separates data from the origin (i.e., the only member of the second class is the origin).

w⋅ x + b = 0

ξi Origin is the only member of the second class

ξj Use “slack variables” as in soft-margin SVMs to penalize these instances

104

Formulation of one-class SVM: linear case

Given training data: x1 , x2 ,..., x N ∈ R n

Find f ( x ) = sign( w ⋅ x + b) by minimizing

w⋅ x + b = 0

ξi

ξj

i.e., the decision function should be positive in all training samples except for small deviations

1 2

w

2

1 + νN

N

∑ξ i =1

i

+b

subject to constraints: w ⋅ x + b ≥ −ξ i ξi ≥ 0 upper bound on for i = 1,…,N.

the fraction of outliers (i.e., points outside decision surface) allowed in the data 105

Formulation of one-class SVM: linear and non-linear cases Linear case

Non-linear case (use “kernel trick”)

Find f ( x ) = sign( w ⋅ x + b) by minimizing

1 2

2 1 w + νN

subject to constraints: w ⋅ x + b ≥ −ξ i ξi ≥ 0 for i = 1,…,N.

Find f ( x ) = sign( w ⋅ Φ ( x ) + b)

N

∑ξ i =1

i

+b

by minimizing

1 2

2 1 w + νN

N

∑ξ i =1

i

subject to constraints: w ⋅ Φ ( x ) + b ≥ −ξ i ξi ≥ 0 for i = 1,…,N.

106

+b

More about one-class SVM • One-class SVMs inherit most of properties of SVMs for binary classification (e.g., “kernel trick”, sample efficiency, ease of finding of a solution by efficient optimization method, etc.); • The choice of other parameter ν significantly affects the resulting decision surface. • The choice of origin is arbitrary and also significantly affects the decision surface returned by the algorithm.

107

Support vector clustering Contributed by Nikita Lytkin

108

Goal of clustering (aka class discovery)

Given a heterogeneous set of data points x1 , x2 ,..., xN ∈ R n Assign labels y1 , y2 ,..., y N ∈ {1,2,..., K } such that points with the same label are highly “similar“ to each other and are distinctly different from the rest

Clustering process

109

Support vector domain description • Support Vector Domain Description (SVDD) of the data is a set of vectors lying on the surface of the smallest hyper-sphere enclosing all data points in a feature space – These surface points are called Support Vectors

kernel

Φ

R

R R

110

SVDD optimization criterion Formulation with hard constraints: Minimize

subject to

R2

Squared radius of the sphere

|| Φ ( xi ) − a ||2 ≤ R 2

for i = 1,…,N

Constraints

R

R

a' R'

a

R

111

Main idea behind Support Vector Clustering • Cluster boundaries in the input space are formed by the set of points that when mapped from the input space to the feature space fall exactly on the surface of the minimal enclosing hyper-sphere – SVs identified by SVDD are a subset of the cluster boundary points

Φ

R

−1

R

a

R

112

Cluster assignment in SVC • Two points xi and x j belong to the same cluster (i.e., have the same label) if every point of the line segment ( xi , x j ) projected to the feature space lies within the hypersphere Some points lie outside the hyper-sphere

Φ

R

R

a

Every point is within the hypersphere in the feature space

R

113

Cluster assignment in SVC (continued) • Point-wise adjacency matrix is constructed by testing the line segments between every pair of points • Connected components are extracted • Points belonging to the same connected component are assigned the same label

C A

E

B D

A B C D E

C A

A

B

C

D

E

1

1

1

0

0

1

1

0

0

1

0

0

1

1 1

E

B D

114

Effects of noise and cluster overlap • In practice, data often contains noise, outlier points and overlapping clusters, which would prevent contour separation and result in all points being assigned to the Noise same cluster Ideal data

Typical data

Outliers

Overlap 115

SVDD with soft constraints • SVC can be used on noisy data by allowing a fraction of points, called Bounded SVs (BSV), to lie outside the hyper-sphere – BSVs are not considered as cluster boundary points and are not assigned to clusters by SVC Noise

Noise

Typical data

kernel

Φ Outliers

Outliers

R

R

a

R

Overlap Overlap

116

Soft SVDD optimization criterion Primal formulation with soft constraints: Minimize

R 2 subject to

Squared radius of the sphere

Introduction of slack variables ξ i mitigates the influence of noise and overlap on the clustering process

|| Φ ( xi ) − a ||2 ≤ R 2 + ξ i ξ i ≥ 0 for i = 1,…,N Soft constraints Outliers Noise

R

R + ξi

Overlap

R

a

117

Dual formulation of soft SVDD Minimize W = ∑ β i K ( xi , xi ) − ∑ β i β j K ( xi , x j ) i

i, j

subject to 0 ≤ β i ≤ C for i = 1,…,N Constraints

• As before, K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j ) denotes a kernel function • Parameter 0 < C ≤ 1 gives a trade-off between volume of the sphere and the number of errors (C=1 corresponds to hard constraints) 2 • Gaussian kernel K ( xi , x j ) = exp(−γ xi − x j ) tends to yield tighter contour representations of clusters than the polynomial kernel • The Gaussian kernel width parameter γ > 0 influences tightness of cluster boundaries, number of SVs and the number of clusters • Increasing γ causes an increase in the number of clusters 118

SVM-based variable selection

119

Understanding the weight vector w Recall standard SVM formulation:

w⋅ x + b = 0

Find w and b that minimize 2 1 subject to yi ( w ⋅ xi + b) ≥ 1 2 w

for i = 1,…,N.

Use classifier: f ( x ) = sign( w ⋅ x + b)

Negative instances (y=-1)

Positive instances (y=+1)

• The weight vector w contains as many elements as there are input n variables in the dataset, i.e. w ∈ R . • The magnitude of each element denotes importance of the corresponding variable for classification task.

120

Understanding the weight vector w x2

w = (1,1) 1x1 + 1x2 + b = 0

w = (1,0) 1x1 + 0 x2 + b = 0

x2

x1

X1 and X2 are equally important x2

x1

X1 is important, X2 is not

w = (0,1) 0 x1 + 1x2 + b = 0

x3

w = (1,1,0) 1x1 + 1x2 + 0 x3 + b = 0 x2

x1

X2 is important, X1 is not

x1

X1 and X2 are equally important, X3 is not 121

Understanding the weight vector w Gene X2

Melanoma Nevi

SVM decision surface

w = (1,1) 1x1 + 1x2 + b = 0

Decision surface of another classifier w = (1,0) 1x1 + 0 x2 + b = 0

True model X1 Gene X1

• In the true model, X1 is causal and X2 is redundant X2 Phenotype •SVM decision surface implies that X1 and X2 are equally important; thus it is locally causally inconsistent •There exists a causally consistent decision surface for this example •Causal discovery algorithms can identify that X1 is causal and X2 is redundant 122

Simple SVM-based variable selection algorithm Algorithm: 1. Train SVM classifier using data for all variables to estimate vector w 2. Rank each variable based on the magnitude of the corresponding element in vector w 3. Using the above ranking of variables, select the smallest nested subset of variables that achieves the best SVM prediction accuracy.

123

Simple SVM-based variable selection algorithm Consider that we have 7 variables: X1, X2, X3, X4, X5, X6, X7 The vector w is: (0.1, 0.3, 0.4, 0.01, 0.9, -0.99, 0.2) The ranking of variables is: X6, X5, X3, X2, X7, X1, X4 Classification accuracy

Subset of variables X6

X5

X3

X2

X7

X1

X6

X5

X3

X2

X7

X1

X6

X5

X3

X2

X7

X6

X5

X3

X2

X6

X5

X3

X6

X5

X6

X4

0.920 0.920 0.919 0.852

Best classification accuracy Classification accuracy that is statistically indistinguishable from the best one

0.843 0.832 0.821

Select the following variable subset: X6, X5, X3, X2 , X7

124

Simple SVM-based variable selection algorithm • SVM weights are not locally causally consistent we may end up with a variable subset that is not causal and not necessarily the most compact one. • The magnitude of a variable in vector w estimates the effect of removing that variable on the objective function of SVM (e.g., function that we want to minimize). However, this algorithm becomes suboptimal when considering effect of removing several variables at a time… This pitfall is addressed in the SVM-RFE algorithm that is presented next. 125

SVM-RFE variable selection algorithm Algorithm: 1. Initialize V to all variables in the data 2. Repeat 3. Train SVM classifier using data for variables in V to estimate vector w 4. Estimate prediction accuracy of variables in V using the above SVM classifier (e.g., by cross-validation) 5. Remove from V a variable (or a subset of variables) with the smallest magnitude of the corresponding element in vector w 6. Until there are no variables in V 7. Select the smallest subset of variables with the best prediction accuracy 126

SVM-RFE variable selection algorithm 10,000 genes

SVM model Prediction accuracy

5,000 genes

SVM model Prediction accuracy

Important for classification

5,000 genes

Discarded

Not important for classification

2,500 genes

…

Important for classification

2,500 genes

Discarded

Not important for classification

• Unlike simple SVM-based variable selection algorithm, SVMRFE estimates vector w many times to establish ranking of the variables. • Notice that the prediction accuracy should be estimated at each step in an unbiased fashion, e.g. by cross-validation. 127

SVM variable selection in feature space The real power of SVMs comes with application of the kernel trick that maps data to a much higher dimensional space (“feature space”) where the data is linearly separable. Gene 2

Tumor

Tumor

?

kernel

Φ

?

Normal

Normal Gene 1

input space

feature space 128

SVM variable selection in feature space • We have data for 100 SNPs (X1,…,X100) and some phenotype. • We allow up to 3rd order interactions, e.g. we consider: • X1,…,X100 • X12,X1X2, X1X3,…,X1X100 ,…,X1002 • X13,X1X2X3, X1X2X4,…,X1X99X100 ,…,X1003

• Task: find the smallest subset of features (either SNPs or their interactions) that achieves the best predictive accuracy of the phenotype. • Challenge: If we have limited sample, we cannot explicitly construct and evaluate all SNPs and their interactions (176,851 features in total) as it is done in classical statistics. 129

SVM variable selection in feature space Heuristic solution: Apply algorithm SVM-FSMB that: 1. Uses SVMs with polynomial kernel of degree 3 and selects M features (not necessarily input variables!) that have largest weights in the feature space. E.g., the algorithm can select features like: X10, (X1X2), (X9X2X22), (X72X98), and so on. 2. Apply HITON-MB Markov blanket algorithm to find the Markov blanket of the phenotype using M features from step 1. 130

Computing posterior class probabilities for SVM classifiers

131

Output of SVM classifier 1. SVMs output a class label (positive or negative) for each sample: sign( w ⋅ x + b) 2. One can also compute distance from the hyperplane that separates classes, e.g. w ⋅ x + b. These distances can be used to compute performance metrics like area under ROC curve.

w⋅ x + b = 0

Negative samples (y=-1)

Positive samples (y=+1)

Question: How can one use SVMs to estimate posterior class probabilities, i.e., P(class positive | sample x)? 132

Simple binning method 1. Train SVM classifier in the Training set. 2. Apply it to the Validation set and compute distances from the hyperplane to each sample. Sample # 1

2

3

4

5

Distance

-1

8

3

4

2

...

98 99

100

-2

0.8

0.3

Training set

Validation set

3. Create a histogram with Q (e.g., say 10) bins using the above distances. Each bin has an upper and lower value in terms of distance.

Testing set

Number of samples in validation set

25 20 15 10 5 0 -15

-10

-5

0 Distance

5

10

15

133

Simple binning method 4.

Given a new sample from the Testing set, place it in the corresponding bin. E.g., sample #382 has distance to hyperplane = 1, so it is placed in the bin [0, 2.5]

Training set

Number of samples in validation set

25 20

10

Testing set

5 0 -15

5.

Validation set

15

-10

-5

0 Distance

5

10

15

Compute probability P(positive class | sample #382) as a fraction of true positives in this bin. E.g., this bin has 22 samples (from the Validation set), out of which 17 are true positive ones , so we compute P(positive class | sample #382) = 17/22 = 0.77

134

Platt’s method Convert distances output by SVM to probabilities by passing them through the sigmoid filter:

1 P ( positive class | sample) = 1 + exp( Ad + B) where d is the distance from hyperplane and A and B are parameters. 1 0.9

P(positive class|sample)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10

-8

-6

-4

-2

0 Distance

2

4

6

8

10

135

Platt’s method 1. Train SVM classifier in the Training set. 2. Apply it to the Validation set and compute distances from the hyperplane to each sample. Sample # 1

2

3

4

5

Distance

-1

8

3

4

3.

2

...

98 99

100

-2

0.8

0.3

Determine parameters A and B of the sigmoid function by minimizing the negative log likelihood of the data from the Validation set.

Training set

Validation set Testing set

4. Given a new sample from the Testing set, compute its posterior probability using sigmoid function.

136

Part 3 • Case studies (taken from our research) 1. 2. 3. 4. 5. 6.

Classification of cancer gene expression microarray data Text categorization in biomedicine Prediction of clinical laboratory values Modeling clinical judgment Using SVMs for feature selection Outlier detection in ovarian cancer proteomics data

• Software • Conclusions • Bibliography

137

1. Classification of cancer gene expression microarray data

138

Comprehensive evaluation of algorithms for classification of cancer microarray data Main goals: • Find the best performing decision support algorithms for cancer diagnosis from microarray gene expression data; • Investigate benefits of using gene selection and ensemble classification methods.

139

Classifiers • • • • • • • • • • •

K-Nearest Neighbors (KNN) Backpropagation Neural Networks (NN) Probabilistic Neural Networks (PNN) Multi-Class SVM: One-Versus-Rest (OVR) Multi-Class SVM: One-Versus-One (OVO) Multi-Class SVM: DAGSVM Multi-Class SVM by Weston & Watkins (WW) Multi-Class SVM by Crammer & Singer (CS) Weighted Voting: One-Versus-Rest Weighted Voting: One-Versus-One Decision Trees: CART

instance-based neural networks

kernel-based

voting decision trees 140

Ensemble classifiers dataset

Classifier 1

Classifier 2

…

Classifier N

Prediction 1

Prediction 2

…

Prediction N

dataset

Ensemble Classifier Final Prediction 141

Gene selection methods Highly discriminatory genes Uninformative genes

1.

Signal-to-noise (S2N) ratio in one-versus-rest (OVR) fashion;

2.

Signal-to-noise (S2N) ratio in one-versus-one (OVO) fashion;

3.

Kruskal-Wallis nonparametric one-way ANOVA (KW);

4.

Ratio of genes betweencategories to within-category sum of squares (BW).

genes

142

Performance metrics and statistical comparison 1. Accuracy + can compare to previous studies + easy to interpret & simplifies statistical comparison

2. Relative classifier information (RCI) + easy to interpret & simplifies statistical comparison + not sensitive to distribution of classes + accounts for difficulty of a decision problem

• Randomized permutation testing to compare accuracies of the classifiers (α=0.05) 143

Microarray datasets Number of Dataset name Sam- Variables Cateples (genes) gories

Reference

11_Tumors

174

12533

11

Su, 2001

14_Tumors

308

15009

26

Ramaswamy, 2001

9_Tumors

60

5726

9

Staunton, 2001

Brain_Tumor1

90

5920

5

Pomeroy, 2002

Brain_Tumor2

50

10367

4

Nutt, 2003

Leukemia1

72

5327

3

Golub, 1999

Leukemia2

72

11225

3

Armstrong, 2002

Lung_Cancer

203

12600

5

Bhattacherjee, 2001

SRBCT

83

2308

4

Khan, 2001

10509

2

Singh, 2002

5469

2

Shipp, 2002

Prostate_Tumor 102 DLBCL

77

Total: •~1300 samples •74 diagnostic categories •41 cancer types and 12 normal tissue types

144

Summary of methods and datasets Cross-Validation Designs (2)

Gene Selection Methods (4)

Performance Metrics (2)

Statistical Comparison

10-Fold CV

S2N One-Versus-Rest

Accuracy

LOOCV

S2N One-Versus-One

RCI

Randomized permutation testing

Non-param. ANOVA

Gene Expression Datasets (11) 11_Tumors

One-Versus-One

14_Tumors

Ensemble Classifiers (7) Majority Voting

KNN Backprop. NN Prob. NN Decision Trees One-Versus-Rest One-Versus-One

Based on MCSVM outputs

Method by WW

MC-SVM OVR MC-SVM OVO MC-SVM DAGSVM Decision Trees

Brain Tumor1 Brain_Tumor2 Leukemia1

Lung_Cancer SRBCT

Majority Voting Decision Trees

9_Tumors

Leukemia2

Binary Dx

DAGSVM

Multicategory Dx

One-Versus-Rest

Method by CS

WV

BW ratio

Based on outputs of all classifiers

MC-SVM

Classifiers (11)

Prostate_Tumors DLBCL 145

em

ia1

ia2 Lu ng _C an ce r SR Pr BC os T tat e_ Tu mo DL r BC L

Le uk

or s

or 1

or 2

or s

em

_T um

_T um

_T um

_T um

Tu mo rs

Le uk

11

Br ain

Br ain

14

9_ Accuracy, % 60

40 OVR OVO DAGSVM WW CS KNN NN PNN

20

0

146

MC-SVM

Results without gene selection

100

80

Results with gene selection Improvement of diagnostic performance by gene selection (averages for the four datasets)

Diagnostic performance before and after gene selection 100

70

9_Tumors

14_Tumors

60

60 40

50

Accuracy, %

Improvement in accuracy, %

80

40 30

20

100

Brain_Tumor1

Brain_Tumor2

80

20 60

10

40 20

OVR

OVO DAGSVM SVM

WW

CS

KNN

NN non-SVM

PNN

SVM

non-SVM

SVM

non-SVM

Average reduction of genes is 10-30 times 147

Comparison with previously published results Multiclass SVMs (this study)

100

Accuracy, %

80

60

Multiple specialized classification methods (original primary studies)

40

20

ia2 Lu ng _C an ce r SR Pr BC os T tat e_ Tu mo DL r BC L

em

ia1

Le uk

em

or s

Le uk

_T um

or 1 11

_T um

or 2 Br ain

_T um

or s Br ain

_T um

14

9_

Tu mo rs

0

148

Summary of results • Multi-class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV. • Gene selection in some cases improves classification performance of all classifiers, especially of non-SVM algorithms; • Ensemble classification does not improve performance of SVM and other classifiers; • Results obtained by SVMs favorably compare with the literature. 149

Random Forest (RF) classifiers • Appealing properties – – – – – –

Work when # of predictors > # of samples Embedded gene selection Incorporate interactions Based on theory of ensemble learning Can work with binary & multiclass tasks Does not require much fine-tuning of parameters

• Strong theoretical claims • Empirical evidence: (Diaz-Uriarte and Alvarez de Andres, BMC Bioinformatics, 2006) reported superior classification performance of RFs compared to SVMs and other methods 150

Key principles of RF classifiers

Testing data Training data

4) Apply to testing data & combine predictions

1) Generate bootstrap samples

2) Random gene selection

3) Fit unpruned decision trees

151

Results without gene selection

• SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets, algorithms are exactly the same in 3 datasets. • In 7 datasets SVMs outperform RFs statistically significantly. • On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI. 152

Results with gene selection

• SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets, algorithms are exactly the same in 2 datasets. • In 1 dataset SVMs outperform RFs statistically significantly. • On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI. 153

2. Text categorization in biomedicine

154

Models to categorize content and quality: Main idea

1. Utilize existing (or easy to build) training corpora 1

2

3 4

2. Simple document representations (i.e., typically stemmed and weighted words in title and abstract, Mesh terms if available; occasionally addition of Metamap CUIs, author info) as “bag-of-words” 155

Models to categorize content and quality: Main idea Unseen Examples

3. Train SVM models that capture implicit categories of meaning or quality criteria

Labeled Examples

Labeled

4. Evaluate models’ performances - with nested cross-validation or other appropriate error estimators - use primarily AUC as well as other metrics (sensitivity, specificity, PPV, Precision/Recall curves, HIT curves, etc.) 5. Evaluate performance prospectively & compare to prior cross-validation estimates

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Txmt

Diag

Estimated Performance

Prog

Etio

2005 Performance

156

Models to categorize content and quality: Some notable results Category

Average AUC

Range over n folds

Treatment

0.97*

0.96 - 0.98

Etiology

0.94*

0.89 – 0.95

Prognosis

0.95*

0.92 – 0.97

Diagnosis

0.95*

0.93 - 0.98

1. SVM models have excellent ability to identify high-quality PubMed documents according to ACPJ gold standard

Method

Treatment AUC

Etiology AUC

Prognosis AUC

Diagnosis AUC

Google Pagerank

0.54

0.54

0.43

0.46

Yahoo Webranks

0.56

0.49

0.52

0.52

Impact Factor 2005

0.67

0.62

0.51

0.52

Web page hit count

0.63

0.63

0.58

0.57

Bibliometric Citation Count

0.76

0.69

0.67

0.60

Machine Learning Models

0.96

0.95

0.95

0.95

2. SVM models have better classification performance than PageRank, Yahoo ranks, Impact Factor, Web Page hit counts, and bibliometric citation counts on the Web according to ACPJ gold standard 157

Models to categorize content and quality: Some notable results Treatment - Fixed Sensitivity

Gold standard: SSOAB

Area under the ROC curve*

SSOAB-specific filters

0.893

Citation Count

0.791

ACPJ Txmt-specific filters 0.548 Impact Factor (2001)

0.549

Impact Factor (2005)

0.558

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.98

0.98

Treatment - Fixed Specificity

0.89 0.71

Fixed Sens Query Filters

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Spec

Sens

Learning Models

0.98

0.75 0.44

Query Filters

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3. SVM models have better classification performance than PageRank, Impact Factor and Citation count in Medline for SSOAB gold standard

0.96

Query Filters

0.91

0.91

Sens Query Filters

Fixed Spec Learning Models

Diagnosis - Fixed Specificity

0.88 0.68

Fixed Sens

0.94

Spec

0.96

Learning Models

0.68

Learning Models

Diagnosis - Fixed Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.91

Etiology - Fixed Specificity

0.98

Fixed Sens

0.91

Fixed Spec

Query Filters

Etiology - Fixed Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.95 0.8

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Spec

0.97

Sens

Learning Models

Fixed Spec

Query Filters

Prognosis - Fixed Sensitivity

0.97

0.82 0.65

Learning Models

Prognosis - Fixed Specificity 1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.8

0.8

Fixed Sens Query Filters

0.87 0.71

Spec Learning Models

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.8

Sens Query Filters

0.77

0.77

Fixed Spec Learning Models

4. SVM models have better sensitivity/specificity in PubMed than CQFs at 158 comparable thresholds according to ACPJ gold standard

Other applications of SVMs to text categorization Model

Area Under the Curve

Machine Learning Models 0.93 Quackometer*

0.67

Google

0.63

1. Identifying Web Pages with misleading treatment information according to special purpose gold standard (Quack Watch). SVM models outperform Quackometer and Google ranks in the tested domain of cancer treatment. 2. Prediction of future paper citation counts (work of L. Fu and C.F. Aliferis, AMIA 2008) 159

3. Prediction of clinical laboratory values

160

Dataset generation and experimental design • StarPanel database contains ~8·106 lab measurements of ~100,000 inpatients from Vanderbilt University Medical Center. • Lab measurements were taken between 01/1998 and 10/2002. For each combination of lab test and normal range, we generated the following datasets.

01/1998-05/2001

Training

06/2001-10/2002

Validation (25% of Training)

Testing

161

Query-based approach for prediction of clinical cab values Training data

Data model

database

Testing sample

Train SVM classifier Validation data SVM classifier Performance These steps are performed for every data model

Testing data

These steps are performed for every testing sample

Optimal data model Prediction 162

Classification results Excluding cases with K=0 (i.e. samples with no prior lab measurements)

Area under ROC curve (without feature selection)

Area under ROC curve (without feature selection)

BUN

>1 75.9%

Range of normal values 2.5 1 digital image available Diagnoses made in 2004

Features Lesion location

Family history of melanoma

Irregular Border

Streaks (radial streaming, pseudopods)

Max-diameter

Fitzpatrick’s Photo-type

Number of colors

Slate-blue veil

Min-diameter

Sunburn

Atypical pigmented network

Whitish veil

Evolution

Ephelis

Abrupt network cut-off

Globular elements

Age

Lentigos

Regression-Erythema

Comedo-like openings, milia-like cysts

Gender

Asymmetry

Hypo-pigmentation

Telangiectasia

Method to explain physician-specific SVM models

FS

Build DT

Build SVM SVM ”Black Box” Regular Learning

Apply SVM

Meta-Learning

Results: Predicting physicians’ judgments

All

RFE

(features)

HITON_PC (features)

HITON_MB (features)

(features)

Expert 1

0.94 (24)

0.92 (4)

0.92 (5)

0.95 (14)

Expert 2

0.92 (24)

0.89 (7)

0.90 (7)

0.90 (12)

Expert 3

0.98 (24)

0.95 (4)

0.95 (4)

0.97 (19)

NonExpert 1

0.92 (24)

0.89 (5)

0.89 (6)

0.90 (22)

NonExpert 2

1.00 (24)

0.99 (6)

0.99 (6)

0.98 (11)

NonExpert 3

0.89 (24)

0.89 (4)

0.89 (6)

0.87 (10)

Physicians

Results: Physician-specific models

Results: Explaining physician agreement Patient 001

Expert 1 AUC=0.92 R2=99%

Blue veil

irregular border

streaks

yes

no

yes

Expert 3 AUC=0.95 R2=99%

Results: Explain physician disagreement Patient 002

Expert 1 AUC=0.92 R2=99%

Blue veil

irregular border

streaks

number of colors

evolution

no

no

yes

3

no

Expert 3 AUC=0.95 R2=99%

Results: Guideline compliance Physician

Reported guidelines

Compliance

Experts1,2,3, non-expert 1

Pattern analysis

Non-compliant: they ignore the majority of features (17 to 20) recommended by pattern analysis.

Non expert 2

ABCDE rule

Non compliant: asymmetry, irregular border and evolution are ignored.

Non expert 3

Non-standard. Reports using 7 features

Non compliant: 2 out of 7 reported features are ignored while some nonreported ones are not

On the contrary: In all guidelines, the more predictors present, the higher the likelihood of melanoma. All physicians were compliant with this principle.

5. Using SVMs for feature selection

176

Feature selection methods Feature selection methods (non-causal) • SVM-RFE This is an SVM-based • Univariate + wrapper feature selection method • Random forest-based • LARS-Elastic Net • RELIEF + wrapper • L0-norm • Forward stepwise feature selection • No feature selection Causal feature selection methods • HITON-PC This method outputs a • HITON-MB Markov blanket of the • IAMB response variable • BLCD (under assumptions) • K2MB … 177

13 real datasets were used to evaluate feature selection methods Dataset name

Domain

Number of variables

Number of samples

Infant_Mortality

Clinical

86

5,337

Died within the first year

discrete

Ohsumed

Text

14,373

5,000

Relevant to nenonatal diseases

continuous

ACPJ_Etiology

Text

28,228

15,779

Relevant to eitology

continuous

Lymphoma

Gene expression

7,399

227

3-year survival: dead vs. alive

continuous

Gisette

Digit recognition

5,000

7,000

Separate 4 from 9

continuous

Dexter

Text

19,999

600

Relevant to corporate acquisitions

continuous

Sylva

Ecology

216

14,394

Ponderosa pine vs. everything else

continuous & discrete

Proteomics

2,190

216

Cancer vs. normals

continuous

Thrombin

Drug discovery

139,351

2,543

Binding to thromin

discrete (binary)

Breast_Cancer

Gene expression

17,816

286

Estrogen-receptor positive (ER+) vs. ER-

continuous

Hiva

Drug discovery

1,617

4,229

Activity to AIDS HIV infection

discrete (binary)

Nova

Text

16,969

1,929

Separate politics from religion topics

discrete (binary)

Financial

147

7,063

Personal bankruptcy

continuous & discrete

Ovarian_Cancer

Bankruptcy

Target

Data type

178

Classification performance vs. proportion of selected features Magnified

Original 1 0.9 Classification performance (AUC)

Classification performance (AUC)

0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2

HITON-PC with G test RFE

0.55 0.5

0

0.5 Proportion of selected features

1

0.89 0.88 0.87 0.86 0.85

HITON-PC with G2 test RFE 0.05 0.1 0.15 0.2 Proportion of selected features 179

Statistical comparison of predictivity and reduction of features Predicitivity

SVM-RFE (4 variants)

Reduction

P-value

Nominal winner

P-value

Nominal winner

0.9754

SVM-RFE

0.0046

HITON-PC

0.8030

SVM-RFE

0.0042

HITON-PC

0.1312

HITON-PC

0.3634

HITON-PC

0.1008

HITON-PC

0.6816

SVM-RFE

• Null hypothesis: SVM-RFE and HITON-PC perform the same; • Use permutation-based statistical test with alpha = 0.05. 180

Simulated datasets with known causal structure used to compare algorithms

181

Comparison of SVM-RFE and HITON-PC

182

Comparison of all methods in terms of causal graph distance SVM-RFE

HITON-PC based causal methods

183

Summary results HITON-PC based causal methods

HITON-PCFDR methods

SVM-RFE

184

Statistical comparison of graph distance

Comparison average HITON-PC-FDR with G2 test vs. average SVM-RFE

Sample size = 200 Nominal P-value winner

York University, #Vanderbilt University, †ClopiNet

Part I • Introduction • Necessary mathematical concepts • Support vector machines for binary classification: classical formulation • Basic principles of statistical machine learning

2

Introduction

3

About this tutorial Main goal: Fully understand support vector machines (and important extensions) with a modicum of mathematics knowledge. • This tutorial is both modest (it does not invent anything new) and ambitious (support vector machines are generally considered mathematically quite difficult to grasp). • Tutorial approach: learning problem main idea of the SVM solution geometrical interpretation math/theory basic algorithms extensions case studies. 4

Data-analysis problems of interest 1. Build computational classification models (or “classifiers”) that assign patients/samples into two or more classes. - Classifiers can be used for diagnosis, outcome prediction, and other classification tasks. - E.g., build a decision-support system to diagnose primary and metastatic cancers from gene expression profiles of the patients: Primary Cancer

Classifier model Patient

Biopsy

Metastatic Cancer

Gene expression profile 5

Data-analysis problems of interest 2. Build computational regression models to predict values of some continuous response variable or outcome. - Regression models can be used to predict survival, length of stay in the hospital, laboratory test values, etc. - E.g., build a decision-support system to predict optimal dosage of the drug to be administered to the patient. This dosage is determined by the values of patient biomarkers, and clinical and demographics data: 1 2.2 3 423 2 3 92 2 1 8

Patient

Biomarkers, clinical and demographics data

Regression model

Optimal dosage is 5 IU/Kg/week

6

Data-analysis problems of interest 3. Out of all measured variables in the dataset, select the smallest subset of variables that is necessary for the most accurate prediction (classification or regression) of some variable of interest (e.g., phenotypic response variable). - E.g., find the most compact panel of breast cancer biomarkers from microarray gene expression data for 20,000 genes: Breast cancer tissues Normal tissues

7

Data-analysis problems of interest 4. Build a computational model to identify novel or outlier patients/samples. - Such models can be used to discover deviations in sample handling protocol when doing quality control of assays, etc. - E.g., build a decision-support system to identify aliens.

8

Data-analysis problems of interest 5. Group patients/samples into several clusters based on their similarity. - These methods can be used to discovery disease sub-types and for other tasks. - E.g., consider clustering of brain tumor patients into 4 clusters based on their gene expression profiles. All patients have the same pathological sub-type of the disease, and clustering discovers new disease subtypes that happen to have different characteristics in terms of patient survival and time to recurrence after treatment.

Cluster #1 Cluster #2

Cluster #3

Cluster #4

9

Basic principles of classification

• Want to classify objects as boats and houses. 10

Basic principles of classification

• All objects before the coast line are boats and all objects after the coast line are houses. • Coast line serves as a decision surface that separates two classes.

11

Basic principles of classification These boats will be misclassified as houses

12

Basic principles of classification Longitude

Boat

Latitude

House

• The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example. • First all objects are represented geometrically.

13

Basic principles of classification Longitude

Boat

Latitude

Then the algorithm seeks to find a decision surface that separates classes of objects

House

14

Basic principles of classification Longitude These objects are classified as houses

?

?

? ?

?

?

These objects are classified as boats

Latitude

Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it

15

The Support Vector Machine (SVM) approach • • •

Support vector machines (SVMs) is a binary classification algorithm that offers a solution to problem #1. Extensions of the basic SVM algorithm can be applied to solve problems #1-#5. SVMs are important because of (a) theoretical reasons: - Robust to very large number of variables and small samples - Can learn both simple and highly complex classification models - Employ sophisticated mathematical principles to avoid overfitting

and (b) superior empirical results.

16

Main ideas of SVMs Gene Y

Normal patients

Cancer patients

Gene X

• Consider example dataset described by 2 genes, gene X and gene Y • Represent patients geometrically (by “vectors”)

17

Main ideas of SVMs Gene Y

Normal patients

Cancer patients

Gene X

• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”); 18

Main ideas of SVMs Gene Y Cancer

Cancer

Decision surface

kernel Normal Normal Gene X

• If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found; • The feature space is constructed via very clever mathematical projection (“kernel trick”). 19

History of SVMs and usage in the literature • Support vector machine classifiers have a long history of development starting from the 1960’s. • The most important milestone for development of modern SVMs is the 1992 paper by Boser, Guyon, and Vapnik (“A training algorithm for optimal margin classifiers”) 10000 9000

Use of Support Vector Machines in the Literature 8,180

General sciences Biomedicine

8000 7000

30000

Use of Linear Regression in the Literature

8,860

General sciences Biomedicine

25000

6,660

19,200

20000

24,100 22,200 18,700

20,100 20,000 19,500

19,100 17,700

6000 4,950

5000 4000

3,530

10000

3000

0

14,900

15,500

14,900

16,000

13,500 9,770

10,800

12,000

2,330

2000 1000

15000

19,600 18,300

17,700

1,430 359

906

621 4

12

46

99

1998

1999

2000

2001

201

351

521

726

917

1,190

5000

0

2002

2003

2004

2005

2006

2007

1998

1999

2000

2001

2002

2003

2004

2005

2006

20

2007

Necessary mathematical concepts

21

How to represent samples geometrically? Vectors in n-dimensional space (Rn) • Assume that a sample/patient is described by n characteristics (“features” or “variables”) • Representation: Every sample/patient is a vector in Rn with tail at point with 0 coordinates and arrow-head at point with the feature values. • Example: Consider a patient described by 2 features: Systolic BP = 110 and Age = 29. This patient can be represented as a vector in R2: Age (110, 29)

(0, 0)

Systolic BP

22

How to represent samples geometrically? Vectors in n-dimensional space (Rn) Patient 3

Patient 4

Age (years)

60

Patient 1

40

Patient 2

20

0 200 150

300

100

200 50

100 0

Patient id 1 2 3 4

Cholesterol (mg/dl) 150 250 140 300

Systolic BP (mmHg) 110 120 160 180

0

Age (years) 35 30 65 45

Tail of the vector (0,0,0) (0,0,0) (0,0,0) (0,0,0)

Arrow-head of the vector (150, 110, 35) (250, 120, 30) (140, 160, 65) (300, 180, 45)

23

How to represent samples geometrically? Vectors in n-dimensional space (Rn) Patient 3

Patient 4

Age (years)

60

Patient 1

40

Patient 2

20

0 200 150

300

100

200 50

100 0

0

Since we assume that the tail of each vector is at point with 0 coordinates, we will also depict vectors as points (where the arrow-head is pointing). 24

Purpose of vector representation • Having represented each sample/patient as a vector allows now to geometrically represent the decision surface that separates two groups of samples/patients. A decision surface in 7

A decision surface in R3

R2 7 6

6

5

5

4 4

3 3

2

2

1

1

0 10

0 0

1

2

3

4

5

6

7

5 0

0

1

2

3

4

5

6

7

• In order to define the decision surface, we need to introduce some basic math elements… 25

Basic operation on vectors in Rn 1. Multiplication by a scalar Consider a vector a = (a1 , a2 ,..., an ) and a scalar c Define: ca = (ca1 , ca2 ,..., can ) When you multiply a vector by a scalar, you “stretch” it in the same or opposite direction depending on whether the scalar is positive or negative. a = (1,2) c=2 c a = (2,4)

ca

4 3 2 1

4

a = (1,2) c = −1 c a = (−1,−2)

a

3 2 1

-2

-1

0

a 1

2

3

4

-1 0

1

2

3

4

ca

-2

26

Basic operation on vectors in Rn 2. Addition Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) Define: a + b = (a1 + b1 , a2 + b2 ,..., an + bn )

a = (1,2) b = (3,0) a + b = (4,2)

4 3 2 1

0

a 1

Recall addition of forces in classical mechanics.

a +b 2

b

3

4

27

Basic operation on vectors in Rn 3. Subtraction Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) Define: a − b = (a1 − b1 , a2 − b2 ,..., an − bn ) a = (1,2) b = (3,0) a − b = (−2,2)

4 3 2

-3

a −b

1

-2

0

-1

a 1

2

b

3

4

What vector dowe need to add to b to get a ? I.e., similar to subtraction of real numbers.

28

Basic operation on vectors in Rn 4. Euclidian length or L2-norm Consider a vector a = (a1 , a2 ,..., an ) 2 2 2 ... a = a + a + + a Define the L2-norm: 1 2 n 2 We often denote the L2-norm without subscript, i.e. a a = (1,2) a 2 = 5 ≈ 2.24

3 2 1

0

a 1

Length of this vector is ≈ 2.24 2

3

4

L2-norm is a typical way to measure length of a vector; other methods to measure length also exist.

29

Basic operation on vectors in Rn 5. Dot product

Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) n Define dot product: a ⋅ b = a1b1 + a2b2 + ... + anbn = ∑ ai bi i =1

The law of cosines says that a⋅ b =|| a ||2 || b ||2 cos θ where θ is the angle between a and b. Therefore, when the vectors are perpendicular a ⋅b = 0. a = (1,2) b = (3,0) a ⋅b = 3

a = (0,2) b = (3,0) a ⋅b = 0

4 3 2 1

0

a

θ 1

2

b

3

4

4 3 2

a

1

θ 0

1

2

b

3

4

30

Basic operation on vectors in Rn 5. Dot product (continued) n a ⋅ b = a1b1 + a2b2 + ... + anbn = ∑ ai bi i =1

2 • Property: a ⋅ a = a1a1 + a2 a2 + ... + an an = a 2 • In the classical regression equation y = w ⋅ x + b the response variable y is just a dot product of the vector representing patient characteristics (x ) and the regression weights vector (w ) which is common across all patients plus an offset b. 31

Hyperplanes as decision surfaces • A hyperplane is a linear decision surface that splits the space into two parts; • It is obvious that a hyperplane is a binary classifier. A hyperplane in R2 is a line

A hyperplane in R3 is a plane 7

7

6 6

5 5

4 4

3

3

2 1

2

0 10

1

5 0 0

1

2

3

4

5

6

7

0

0

1

2

3

A hyperplane in Rn is an n-1 dimensional subspace

4

5

6

7

32

Equation of a hyperplane First we show with show the definition of hyperplane by an interactive demonstration.

Click here for demo to begin

or go to http://www.dsl-lab.org/svm_tutorial/planedemo.html

Source: http://www.math.umn.edu/~nykamp/ 33

Equation of a hyperplane Consider the case of R3:

w

P

x − x0

x

An equation of a hyperplane is defined by a point (P0) and a perpendicular vector to the plane (w) at that point.

P0

x0 O

Define vectors: x0 = OP0 and x = OP , where P is an arbitrary point on a hyperplane.

A condition for P to be on the plane is that the vector x − x0 is perpendicular to w : w ⋅ ( x − x0 ) = 0 or define b = − w ⋅ x0 w ⋅ x − w ⋅ x0 = 0 w⋅ x + b = 0 The above equations also hold for Rn when n>3.

34

Equation of a hyperplane Example

w = (4,−1,6) P0 = (0,1,−7) b = − w ⋅ x0 = −(0 − 1 − 42) = 43 ⇒ w ⋅ x + 43 = 0 ⇒ (4,−1,6) ⋅ x + 43 = 0 ⇒ (4,−1,6) ⋅ ( x(1) , x( 2 ) , x(3) ) + 43 = 0 ⇒ 4 x(1) − x( 2 ) + 6 x(3) + 43 = 0

+ direction

w P0

w ⋅ x + 50 = 0 w ⋅ x + 43 = 0 w ⋅ x + 10 = 0

- direction

What happens if the b coefficient changes? The hyperplane moves along the direction of w . We obtain “parallel hyperplanes”.

Distance between two parallel hyperplanes w ⋅ x + b1 = 0 and w ⋅ x + b2 = 0 is equal to D = b1 − b2 / w .

35

(Derivation of the distance between two parallel hyperplanes) tw w x2

w ⋅ x + b2 = 0 w ⋅ x + b1 = 0

x1

x2 = x1 + tw D = tw = t w w ⋅ x2 + b2 = 0 w ⋅ ( x1 + tw) + b2 = 0 2 w ⋅ x1 + t w + b2 = 0 2 ( w ⋅ x1 + b1 ) − b1 + t w + b2 = 0 2 − b1 + t w + b2 = 0 2 t = (b1 − b2 ) / w ⇒ D = t w = b1 − b2 / w 36

Recap We know… • How to represent patients (as “vectors”) • How to define a linear decision surface (“hyperplane”) We need to know… • How to efficiently compute the hyperplane that separates two classes with the largest “gap”? Gene Y

Need to introduce basics of relevant optimization theory Normal patients

Cancer patients

Gene X 37

Basics of optimization: Convex functions • A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval. • Property: Any local minimum is a global minimum!

Local minimum Global minimum

Convex function

Global minimum

Non-convex function 38

Basics of optimization: Quadratic programming (QP) • Quadratic programming (QP) is a special optimization problem: the function to optimize (“objective”) is quadratic, subject to linear constraints. • Convex QP problems have convex objective functions. • These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum).

39

Basics of optimization: Example QP problem Consider x = ( x1 , x 2 ) 1 2 || x ||2 subject to x1 +x 2 −1 ≥ 0 Minimize 2 quadratic objective

linear constraints

This is QP problem, and it is a convex QP as we will see later We can rewrite it as: Minimize

1 2 ( x1 + x22 ) subject to x1 +x 2 −1 ≥ 0 2 quadratic objective

linear constraints 40

Basics of optimization: Example QP problem f(x1,x2)

x1 +x 2 −1 ≥ 0 x1 +x 2 −1

1 2 ( x1 + x22 ) 2

x1 +x 2 −1 ≤ 0 x2 x1 The solution is x1=1/2 and x2=1/2.

41

Congratulations! You have mastered all math elements needed to understand support vector machines. Now, let us strengthen your knowledge by a quiz 42

Quiz 1) Consider a hyperplane shown with white. It is defined by equation: w ⋅ x + 10 = 0 Which of the three other hyperplanes can be defined by equation: w ⋅ x + 3 = 0 ?

w P0

- Orange - Green - Yellow 4

a

3

2) What is the dot product between vectors a = (3,3) and b = (1,−1) ?

2 1

-2

-1

0 -1

1

2

3

4

b

-2

43

Quiz 4

a

3

3) What is the dot product between vectors a = (3,3) and b = (1,0) ?

2 1

-2

-1

0

b

1

2

3

4

3

4

-1 -2

4 3

4) What is the length of a vector a = (2,0) and what is the length of all other red vectors in the figure?

2

a

1

-2

-1

0

1

2

-1 -2

44

Quiz 5) Which of the four functions is/are convex?

1

2

3

4 45

Support vector machines for binary classification: classical formulation

46

Case 1: Linearly separable data; “Hard-margin” linear SVM Given training data:

Negative instances (y=-1)

x1 , x2 ,..., x N ∈ R n

y1 , y2 ,..., y N ∈ {−1,+1}

Positive instances (y=+1)

• Want to find a classifier (hyperplane) to separate negative instances from the positive ones. • An infinite number of such hyperplanes exist. • SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so-called “support vectors”). • If the points on the boundaries are not informative (e.g., due to noise), SVMs will not do well. 47

Statement of linear SVM classifier w⋅ x + b = 0

w ⋅ x + b = −1

w ⋅ x + b = +1

The gap is distance between parallel hyperplanes:

w ⋅ x + b = −1 and w ⋅ x + b = +1

Or equivalently:

w ⋅ x + (b + 1) = 0 w ⋅ x + (b − 1) = 0

We know that Negative instances (y=-1)

Positive instances (y=+1)

D = b1 − b2 / w

Therefore:

D = 2/ w

Since we want to maximize the gap,

we need to minimize w

or equivalently minimize

1 2

2 w

(

1 2

is convenient for taking derivative later on) 48

Statement of linear SVM classifier w ⋅ x + b ≤ −1

w⋅ x + b = 0 w ⋅ x + b ≥ +1

In addition we need to impose constraints that all instances are correctly classified. In our case:

w ⋅ xi + b ≤ −1 if yi = −1 w ⋅ xi + b ≥ +1 if yi = +1 Equivalently:

yi ( w ⋅ xi + b) ≥ 1

Negative instances (y=-1)

Positive instances (y=+1)

In summary: 2 1 w Want to minimize 2 subject to yi ( w ⋅ xi + b) ≥ 1 for i = 1,…,N Then given a new instance x, the classifier is f ( x ) = sign( w ⋅ x + b) 49

SVM optimization problem: Primal formulation n

Minimize

1 2

2 w ∑ i i =1

subject to

Objective function

yi ( w ⋅ xi + b) − 1 ≥ 0

for i = 1,…,N

Constraints

• This is called “primal formulation of linear SVMs”. • It is a convex quadratic programming (QP) optimization problem with n variables (wi, i = 1,…,n), where n is the number of features in the dataset.

50

SVM optimization problem: Dual formulation • The previous problem can be recast in the so-called “dual form” giving rise to “dual formulation of linear SVMs”. • It is also a convex quadratic programming problem but with N variables (αi ,i = 1,…,N), where N is the number of samples. N Maximize ∑ α i − ∑ α iα j yi y j xi ⋅ x j subject to α i ≥ 0 and ∑ α i yi = 0 . N

i =1

N

1 2

i , j =1

i =1

Objective function

Constraints

N Then the w-vector is defined in terms of αi: w = ∑ α i yi xi i =1

And the solution becomes: f ( x ) = sign(∑ α i yi xi ⋅ x + b) N

i =1

51

SVM optimization problem: Benefits of using dual formulation 1) No need to access original data, need to access only dot products. Objective function: ∑ α i − ∑ α iα j yi y j xi ⋅ x j N

i =1

N

1 2

i , j =1

Solution: f ( x ) = sign(∑ α i yi xi ⋅ x + b) N

i =1

2) Number of free parameters is bounded by the number of support vectors and not by the number of variables (beneficial for high-dimensional problems). E.g., if a microarray dataset contains 20,000 genes and 100 patients, then need to find only up to 100 parameters! 52

(Derivation of dual formulation) n

Minimize

1 2

2 w subject to y ( w ⋅ xi + b) − 1 ≥ 0 for i = 1,…,N ∑ i i i =1

Constraints

Objective function

Apply the method of Lagrange multipliers.

( Λ w Define Lagrangian P , b, α ) =

( w − α y ( w ∑ ∑ i i ⋅ xi + b) − 1) n

1 2

i =1

N

2 i

i =1

a vector with n elements a vector with N elements

We need to minimize this Lagrangian with respect to w, b and simultaneously require that the derivative with respect to α vanishes , all subject to the constraints that α i ≥ 0.

53

(Derivation of dual formulation) If we set the derivatives with respect to w, b to 0, we obtain: N ∂Λ P (w, b, α ) = 0 ⇒ ∑ α i yi = 0 ∂b i =1 N ∂Λ P (w, b, α ) α 0 w y x = ⇒ = ∑ i i i ∂w i =1

We substitute the above into the equation for Λ P (w,b, α ) and obtain “dual formulation of linear SVMs”:

Λ D (α ) = ∑ α i − ∑ α iα j yi y j xi ⋅ x j N

i =1

N

1 2

i , j =1

We seek to maximize the above Lagrangian with respect to α , subject to the N

constraints that α i ≥ 0 and ∑ α i yi = 0 . i =1

54

Case 2: Not linearly separable data; “Soft-margin” linear SVM What if the data is not linearly separable? E.g., there are outliers or noisy measurements, or the data is slightly non-linear.

0 0 0

0

0 0

0 0

0 0

0 0 0

0

0

Want to handle this case without changing the family of decision functions. Approach: Assign a “slack variable” to each instance ξ i ≥ 0 , which can be thought of distance from the separating hyperplane if an instance is misclassified and 0 otherwise. N 2 1 Want to minimize 2 w + C ∑ ξ i subject to yi ( w ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N i =1 f ( x ) = sign ( w ⋅ x + b) Then given a new instance x, the classifier is

55

Two formulations of soft-margin linear SVM Primal formulation: n

Minimize

1 2

N

w + C ξ y w ( ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N subject to ∑ ∑ i i i =1

2 i

i =1

Objective function

Constraints

Dual formulation: N Minimize ∑ α i − 12 ∑ α iα j yi y j xi ⋅ x j subject to 0 ≤ α i ≤ C and ∑ α i yi = 0 n

N

i =1

i , j =1

for i = 1,…,N.

Objective function

i =1

Constraints

56

Parameter C in soft-margin SVM Minimize

1 2

N 2 w + C ∑ ξ i subject to yi ( w ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N i =1

C=100

C=1

C=0.15

C=0.1

• When C is very large, the softmargin SVM is equivalent to hard-margin SVM; • When C is very small, we admit misclassifications in the training data at the expense of having w-vector with small norm; • C has to be selected for the distribution at hand as it will be discussed later in this tutorial. 57

Case 3: Not linearly separable data; Kernel trick Gene 2

Tumor

Tumor

?

kernel

Φ

?

Normal

Normal Gene 1

Data is not linearly separable in the input space

Data is linearly separable in the feature space obtained by a kernel

Φ : RN → H

58

Kernel trick Original data x (in input space) f ( x) = sign( w ⋅ x + b)

Data in a higher dimensional feature space Φ (x ) f ( x) = sign( w ⋅ Φ ( x ) + b)

N w = ∑ α i yi xi

N w = ∑ α i yi Φ ( xi )

i =1

i =1

f ( x) = sign(∑ α i yi Φ ( xi ) ⋅ Φ ( x ) + b) N

i =1

f ( x) = sign(∑ α i yi K ( xi ,x ) + b) N

i =1

Therefore, we do not need to know Ф explicitly, we just need to define function K(·, ·): RN × RN R. Not every function RN × RN R can be a valid kernel; it has to satisfy so-called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable. 59

Popular kernels A kernel is a dot product in some feature space:

K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j )

Examples:

K ( xi , x j ) = xi ⋅ x j 2 K ( xi , x j ) = exp(−γ xi − x j ) K ( xi , x j ) = exp(−γ xi − x j ) q K ( xi , x j ) = ( p + xi ⋅ x j ) 2 K ( xi , x j ) = ( p + xi ⋅ x j ) q exp(−γ xi − x j ) K ( xi , x j ) = tanh(kxi ⋅ x j − δ )

Linear kernel Gaussian kernel Exponential kernel Polynomial kernel Hybrid kernel Sigmoidal 60

Understanding the Gaussian kernel 2 Consider Gaussian kernel: K ( x , x j ) = exp(−γ x − x j ) Geometrically, this is a “bump” or “cavity” centered at the training data point x j : "bump” “cavity”

The resulting mapping function is a combination of bumps and cavities.

61

Understanding the Gaussian kernel Several more views of the data is mapped to the feature space by Gaussian kernel

62

Understanding the Gaussian kernel Linear hyperplane that separates two classes

63

Understanding the polynomial kernel 3 Consider polynomial kernel: K ( xi , x j ) = (1 + xi ⋅ x j ) Assume that we are dealing with 2-dimensional data (i.e., in R2). Where will this kernel map the data? 2-dimensional space

x(1)

x( 2 )

kernel 10-dimensional space

1 x(1)

x( 2 )

x(21)

x(22 )

x(1) x( 2 )

x(31)

x(32 )

x(1) x(22 )

x(21) x( 2 ) 64

Example of benefits of using a kernel x( 2 ) x1 x3

x4 x2

• Data is not linearly separable in the input space (R2). • Apply kernel K ( x , z ) = ( x ⋅ z ) 2 to map data to a higher x(1) dimensional space (3dimensional) where it is linearly separable. 2

2 2 x(1) z(1) = x(1) z(1) + x( 2 ) z( 2 ) = K ( x, z ) = ( x ⋅ z ) = ⋅ x z ( 2 ) ( 2 ) x(21) z(21) 2 2 2 2 = x(1) z(1) + 2 x(1) z(1) x( 2 ) z( 2 ) + x( 2 ) z( 2 ) = 2 x(1) x( 2 ) ⋅ 2 z(1) z( 2 ) = Φ ( x ) ⋅ Φ ( z ) 2 2 x z ( 2) ( 2) 65

[

]

Example of benefits of using a kernel x(21) Therefore, the explicit mapping is Φ ( x ) = 2 x(1) x( 2 ) 2 x ( 2) x( 2 )

x(22 )

x1 x3

x4

x2

x1 , x2

kernel x(1)

Φ

x3 , x4

x(21)

2 x(1) x( 2 )

66

Comparison with methods from classical statistics & regression • Need ≥ 5 samples for each parameter of the regression model to be estimated: Number of variables

Polynomial degree

Number of parameters

Required sample

2

3

10

50

10

3

286

1,430

10

5

3,003

15,015

100

3

176,851

884,255

100

5

96,560,646

482,803,230

• SVMs do not have such requirement & often require much less sample than the number of variables, even when a high-degree polynomial kernel is used. 67

Basic principles of statistical machine learning

68

Generalization and overfitting • Generalization: A classifier or a regression algorithm learns to correctly predict output from given inputs not only in previously seen samples but also in previously unseen samples. • Overfitting: A classifier or a regression algorithm learns to correctly predict output from given inputs in previously seen samples but fails to do so in previously unseen samples. • Overfitting Poor generalization.

69

Example of overfitting and generalization There is a linear relationship between predictor and outcome (plus some Gaussian noise). Algorithm 2 Outcome of Interest Y

Outcome of Interest Y

Algorithm 1 Training Data Test Data

Predictor X

Predictor X

• Algorithm 1 learned non-reproducible peculiarities of the specific sample available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization. • Algorithm 2 learned general characteristics of the function that produced the data. Thus, it generalizes. 70

“Loss + penalty” paradigm for learning to avoid overfitting and ensure generalization • Many statistical learning algorithms (including SVMs) search for a decision function by solving the following optimization problem:

Minimize (Loss + λ Penalty) • Loss measures error of fitting the data • Penalty penalizes complexity of the learned function • λ is regularization parameter that balances Loss and Penalty

71

SVMs in “loss + penalty” form SVMs build the following classifiers: f ( x ) = sign( w ⋅ x + b)

Consider soft-margin linear SVM formulation: Find w and b that N 2 1 w + C ξ Minimize 2 ∑ i subject to yi (w ⋅ xi + b) ≥ 1 − ξi for i = 1,…,N i =1

This can also be stated as: Find w and b that N 2 Minimize ∑ [1 − yi f ( xi )]+ +λ w 2 i =1

Loss (“hinge loss”)

Penalty

(in fact, one can show that λ = 1/(2C)). 72

Meaning of SVM loss function [ 1 ( y f x − Consider loss function: ∑ i i )]+ N

i =1

• Recall that […]+ indicates the positive part • For a given sample/patient i, the loss is non-zero if 1 − yi f ( xi ) > 0 • In other words, yi f ( xi ) < 1 • Since yi = {−1,+1}, this means that the loss is non-zero if f ( xi ) < 1 for yi = +1 f ( xi ) > −1 for yi= -1

• In other words, the loss is non-zero if w ⋅ xi + b < 1 for yi = +1 w ⋅ xi + b > −1 for yi= -1

73

Meaning of SVM loss function • If the instance is negative, it is penalized only in regions 2,3,4 • If the instance is positive, it is penalized only in regions 1,2,3

w ⋅ x + b = −1

w⋅ x + b = 0 w ⋅ x + b = +1

1 2 3 4

Negative instances (y=-1)

Positive instances (y=+1)

74

Flexibility of “loss + penalty” framework Minimize (Loss + λ Penalty) Loss function

Penalty function

[ 1 − y f ( x ∑ i i )]+ N

Hinge loss:

i =1

Mean squared error:

∑(y i =1

i

Mean squared error: N

∑(y i =1

i

∑ ( yi − f ( xi )) 2 N

2

SVMs

Ridge regression

λ w1

− f ( xi )) 2

Mean squared error:

2

λw2

− f ( xi )) 2

N

λw2

Resulting algorithm

Lasso

λ1 w 1 + λ2 w 2 2

Elastic net

i =1

Hinge loss:

[ 1 − y f ( x ∑ i i )]+ N

i =1

λ w1

1-norm SVM 75

Part 2 • Model selection for SVMs • Extensions to the basic SVM model: 1. 2. 3. 4. 5. 6.

SVMs for multicategory data Support vector regression Novelty detection with SVM-based methods Support vector clustering SVM-based variable selection Computing posterior class probabilities for SVM classifiers 76

Model selection for SVMs

77

Need for model selection for SVMs Gene 2

Gene 2 Tumor

Tumor

Normal Gene 1

• It is impossible to find a linear SVM classifier that separates tumors from normals! • Need a non-linear SVM classifier, e.g. SVM with polynomial kernel of degree 2 solves this problem without errors.

Normal

Gene 1

• We should not apply a non-linear SVM classifier while we can perfectly solve this problem using a linear SVM classifier! 78

A data-driven approach for model selection for SVMs • Do not know a priori what type of SVM kernel and what kernel parameter(s) to use for a given dataset? • Need to examine various combinations of parameters, e.g. consider searching the following grid: Polynomial degree d

Parameter C

(0.1, 1)

(1, 1)

(10, 1)

(100, 1)

(1000, 1)

(0.1, 2)

(1, 2)

(10, 2)

(100, 2)

(1000, 2)

(0.1, 3)

(1, 3)

(10, 3)

(100, 3)

(1000, 3)

(0.1, 4)

(1, 4)

(10, 4)

(100, 4)

(1000, 4)

(0.1, 5)

(1, 5)

(10, 5)

(100, 5)

(1000, 5)

• How to search this grid while producing an unbiased estimate 79 of classification performance?

Nested cross-validation Recall the main idea of cross-validation: test data

train

train test train

What combination of SVM parameters to apply on training data?

test

valid

train

train

valid

train

Perform “grid search” using another nested loop of cross-validation. 80

Example of nested cross-validation

data

Consider that we use 3-fold cross-validation and we want to optimize parameter C that takes values “1” and “2”.

{

Outer Loop P1

Training Testing Average C Accuracy set set Accuracy P1, P2 P3 1 89% 83% P1,P3 P2 2 84% P2, P3 P1 1 76%

P2 …

P3

…

Inner Loop Training Validation set set

P1 P2 P1 P2

P2 P1 P2 P1

C 1 2

Accuracy

86% 84% 70% 90%

Average Accuracy

85% 80%

}

choose C=1 81

On use of cross-validation • Empirically we found that cross-validation works well for model selection for SVMs in many problem domains; • Many other approaches that can be used for model selection for SVMs exist, e.g.:

Generalized cross-validation Bayesian information criterion (BIC) Minimum description length (MDL) Vapnik-Chernovenkis (VC) dimension Bootstrap

82

SVMs for multicategory data

83

One-versus-rest multicategory SVM method Gene 2

? Tumor I

**** * * * * * * * ** * * * *

?

Tumor II

Tumor III Gene 1

84

One-versus-one multicategory SVM method Gene 2

Tumor I

* * * * * Tumor II * * * * * * * * * * * * ? ?

Tumor III Gene 1

85

DAGSVM multicategory SVM method AML vs.vs. ALL T-cell AML ALL T-cell Not AML

Not ALL T-cell

ALL ALLB-cell B-cellvs. vs.ALL ALLT-cell T-cell Not ALL B-cell

ALL T-cell

Not ALL T-cell

AML vs. ALL B-cell Not AML

ALLALL B-cell B-cell

Not ALL B-cell

AML

86

SVM multicategory methods by Weston and Watkins and by Crammer and Singer Gene 2

? ?

* * * * * * * * * * * * * * * * *

Gene 1

87

Support vector regression

88

ε-Support vector regression (ε-SVR) Given training data: y

*

** ** * * * * *** * ** * *

x1 , x2 ,..., x N ∈ R n y1 , y2 ,..., y N ∈ R +ε -ε

x

Main idea: Find a function f ( x ) = w ⋅ x + b that approximates y1,…,yN : • it has at most ε derivation from the true values yi • it is as “flat” as possible (to avoid overfitting)

E.g., build a model to predict survival of cancer patients that can admit a one month error (= ε ).

89

Formulation of “hard-margin” ε-SVR y

* * *

* * * * * * ** * * * *

+ε w ⋅ x + b = 0 -ε f ( x ) = w⋅ x + b Find

2 w subject

by minimizing to constraints: yi − ( w ⋅ x + b ) ≤ ε yi − ( w ⋅ x + b) ≥ −ε 1 2

for i = 1,…,N. x

I.e., difference between yi and the fitted function should be smaller than ε and larger than -ε all points yi should be in the “ε-ribbon” around the fitted function. 90

Formulation of “soft-margin” ε-SVR y

*

* * * * * * ** * * * * * * * *

+ε -ε

x

If we have points like this (e.g., outliers or noise) we can either: a) increase ε to ensure that these points are within the new ε-ribbon, or b) assign a penalty (“slack” variable) to each of this points (as was done for “soft-margin” SVMs)

91

Formulation of “soft-margin” ε-SVR y

*

ξi

* * * * * * ** * * ξi* * * * * * *

+ε -ε

f ( x ) = w⋅ x + b Find N 2 by minimizing 12 w + C ∑ (ξ i + ξ i* ) i =1 subject to constraints: yi − ( w ⋅ x + b ) ≤ ε + ξ i yi − ( w ⋅ x + b) ≥ −ε − ξ i*

ξ i , ξ i* ≥ 0 for i = 1,…,N. x

Notice that only points outside ε-ribbon are penalized! 92

Nonlinear ε-SVR y

* * *** * * ** ** *

+ε * * * -ε

y

** *

** ** * * * * *** *

x +Φ(ε) -Φ(ε)

Φ(x)

Cannot approximate well this function with small ε!

kernel

Φ

Φ

−1

y

* * *** * * ** ** *

* * * +ε -ε

x 93

ε-Support vector regression in “loss + penalty” form

Build decision function of the form: f ( x ) = w ⋅ x + b Find w and b that

2 Minimize ∑ max(0, | yi − f ( xi ) | −ε ) +λ w 2 N

i =1

Loss (“linear ε-insensitive loss”)

Penalty

Loss function value

Error in approximation

94

Comparing ε-SVR with popular regression methods Loss function Linear ε-insensitive loss: y − f x max( 0 , | ( ∑ i i ) | −ε ) N

Penalty function

2

2

2

2

λw2

Resulting algorithm ε-SVR

i =1

Quadratic ε-insensitive loss: 2 max( 0 , ( ( y − f x ∑ i i )) − ε ) N

λw2

Another variant of ε-SVR

i =1

Mean squared error: 2 − y f x ( ( ∑ i i )) N

λw2

Ridge regression

i =1

Mean linear error: N

| y − f ( x ∑ i i)| i =1

λw2

Another variant of ridge regression 95

Comparing loss functions of regression methods Linear ε-insensitive loss Loss function value

-ε

ε Error in approximation

Mean squared error Loss function value

-ε

ε Error in approximation

Quadratic ε-insensitive loss Loss function value

-ε

ε Error in approximation

Mean linear error Loss function value

-ε

ε Error in approximation 96

Applying ε-SVR to real data In the absence of domain knowledge about decision functions, it is recommended to optimize the following parameters (e.g., by cross-validation using grid-search): • parameter C • parameter ε • kernel parameters (e.g., degree of polynomial) Notice that parameter ε depends on the ranges of variables in the dataset; therefore it is recommended to normalize/re-scale data prior to applying ε-SVR. 97

Novelty detection with SVM-based methods

98

What is it about? Decision function = +1 Predictor Y

• Find the simplest and most compact region in the space of predictors where the majority of data samples “live” (i.e., with the highest density of samples). • Build a decision function that takes value +1 in this region and -1 elsewhere. • Once we have such a decision function, we can identify novel or outlier samples/patients in the data.

******* ******** ***** ******** ** * * * **************** * * ************************** * * Decision function = -1

* *

Predictor X

99

Key assumptions •

•

We do not know classes/labels of samples (positive or negative) in the data available for learning this is not a classification problem All positive samples are similar but each negative sample can be different in its own way

Thus, do not need to collect data for negative samples!

100

Sample applications “Normal” “Novel”

“Novel” Modified from: www.cs.huji.ac.il/course/2004/learns/NoveltyDetection.ppt

“Novel” 101

Sample applications Discover deviations in sample handling protocol when doing quality control of assays. Protein Y

* * *

Samples with low quality of processing from the lab of Dr. Smith

* * Samples with low quality of processing from ICU patients

********* ** ************* * *********** * *** * * ** ** * * **

Samples with high-quality of processing

Samples with low quality of processing from infants

Protein X

Samples with low quality of processing from patients with lung cancer

102

Sample applications Identify websites that discuss benefits of proven cancer treatments.

** ** *

Weighted frequency of word Y Websites that discuss cancer prevention methods

** * * * ** ******* * *********** * * * * * * * ******** ***** * ** ********** * * * * * ** *** *

Websites that discuss benefits of proven cancer treatments

Websites that discuss side-effects of proven cancer treatments

* ** *

*** * Blogs of cancer patients ** *

Weighted frequency of word X

Websites that discuss unproven cancer treatments 103

One-class SVM Main idea: Find the maximal gap hyperplane that separates data from the origin (i.e., the only member of the second class is the origin).

w⋅ x + b = 0

ξi Origin is the only member of the second class

ξj Use “slack variables” as in soft-margin SVMs to penalize these instances

104

Formulation of one-class SVM: linear case

Given training data: x1 , x2 ,..., x N ∈ R n

Find f ( x ) = sign( w ⋅ x + b) by minimizing

w⋅ x + b = 0

ξi

ξj

i.e., the decision function should be positive in all training samples except for small deviations

1 2

w

2

1 + νN

N

∑ξ i =1

i

+b

subject to constraints: w ⋅ x + b ≥ −ξ i ξi ≥ 0 upper bound on for i = 1,…,N.

the fraction of outliers (i.e., points outside decision surface) allowed in the data 105

Formulation of one-class SVM: linear and non-linear cases Linear case

Non-linear case (use “kernel trick”)

Find f ( x ) = sign( w ⋅ x + b) by minimizing

1 2

2 1 w + νN

subject to constraints: w ⋅ x + b ≥ −ξ i ξi ≥ 0 for i = 1,…,N.

Find f ( x ) = sign( w ⋅ Φ ( x ) + b)

N

∑ξ i =1

i

+b

by minimizing

1 2

2 1 w + νN

N

∑ξ i =1

i

subject to constraints: w ⋅ Φ ( x ) + b ≥ −ξ i ξi ≥ 0 for i = 1,…,N.

106

+b

More about one-class SVM • One-class SVMs inherit most of properties of SVMs for binary classification (e.g., “kernel trick”, sample efficiency, ease of finding of a solution by efficient optimization method, etc.); • The choice of other parameter ν significantly affects the resulting decision surface. • The choice of origin is arbitrary and also significantly affects the decision surface returned by the algorithm.

107

Support vector clustering Contributed by Nikita Lytkin

108

Goal of clustering (aka class discovery)

Given a heterogeneous set of data points x1 , x2 ,..., xN ∈ R n Assign labels y1 , y2 ,..., y N ∈ {1,2,..., K } such that points with the same label are highly “similar“ to each other and are distinctly different from the rest

Clustering process

109

Support vector domain description • Support Vector Domain Description (SVDD) of the data is a set of vectors lying on the surface of the smallest hyper-sphere enclosing all data points in a feature space – These surface points are called Support Vectors

kernel

Φ

R

R R

110

SVDD optimization criterion Formulation with hard constraints: Minimize

subject to

R2

Squared radius of the sphere

|| Φ ( xi ) − a ||2 ≤ R 2

for i = 1,…,N

Constraints

R

R

a' R'

a

R

111

Main idea behind Support Vector Clustering • Cluster boundaries in the input space are formed by the set of points that when mapped from the input space to the feature space fall exactly on the surface of the minimal enclosing hyper-sphere – SVs identified by SVDD are a subset of the cluster boundary points

Φ

R

−1

R

a

R

112

Cluster assignment in SVC • Two points xi and x j belong to the same cluster (i.e., have the same label) if every point of the line segment ( xi , x j ) projected to the feature space lies within the hypersphere Some points lie outside the hyper-sphere

Φ

R

R

a

Every point is within the hypersphere in the feature space

R

113

Cluster assignment in SVC (continued) • Point-wise adjacency matrix is constructed by testing the line segments between every pair of points • Connected components are extracted • Points belonging to the same connected component are assigned the same label

C A

E

B D

A B C D E

C A

A

B

C

D

E

1

1

1

0

0

1

1

0

0

1

0

0

1

1 1

E

B D

114

Effects of noise and cluster overlap • In practice, data often contains noise, outlier points and overlapping clusters, which would prevent contour separation and result in all points being assigned to the Noise same cluster Ideal data

Typical data

Outliers

Overlap 115

SVDD with soft constraints • SVC can be used on noisy data by allowing a fraction of points, called Bounded SVs (BSV), to lie outside the hyper-sphere – BSVs are not considered as cluster boundary points and are not assigned to clusters by SVC Noise

Noise

Typical data

kernel

Φ Outliers

Outliers

R

R

a

R

Overlap Overlap

116

Soft SVDD optimization criterion Primal formulation with soft constraints: Minimize

R 2 subject to

Squared radius of the sphere

Introduction of slack variables ξ i mitigates the influence of noise and overlap on the clustering process

|| Φ ( xi ) − a ||2 ≤ R 2 + ξ i ξ i ≥ 0 for i = 1,…,N Soft constraints Outliers Noise

R

R + ξi

Overlap

R

a

117

Dual formulation of soft SVDD Minimize W = ∑ β i K ( xi , xi ) − ∑ β i β j K ( xi , x j ) i

i, j

subject to 0 ≤ β i ≤ C for i = 1,…,N Constraints

• As before, K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j ) denotes a kernel function • Parameter 0 < C ≤ 1 gives a trade-off between volume of the sphere and the number of errors (C=1 corresponds to hard constraints) 2 • Gaussian kernel K ( xi , x j ) = exp(−γ xi − x j ) tends to yield tighter contour representations of clusters than the polynomial kernel • The Gaussian kernel width parameter γ > 0 influences tightness of cluster boundaries, number of SVs and the number of clusters • Increasing γ causes an increase in the number of clusters 118

SVM-based variable selection

119

Understanding the weight vector w Recall standard SVM formulation:

w⋅ x + b = 0

Find w and b that minimize 2 1 subject to yi ( w ⋅ xi + b) ≥ 1 2 w

for i = 1,…,N.

Use classifier: f ( x ) = sign( w ⋅ x + b)

Negative instances (y=-1)

Positive instances (y=+1)

• The weight vector w contains as many elements as there are input n variables in the dataset, i.e. w ∈ R . • The magnitude of each element denotes importance of the corresponding variable for classification task.

120

Understanding the weight vector w x2

w = (1,1) 1x1 + 1x2 + b = 0

w = (1,0) 1x1 + 0 x2 + b = 0

x2

x1

X1 and X2 are equally important x2

x1

X1 is important, X2 is not

w = (0,1) 0 x1 + 1x2 + b = 0

x3

w = (1,1,0) 1x1 + 1x2 + 0 x3 + b = 0 x2

x1

X2 is important, X1 is not

x1

X1 and X2 are equally important, X3 is not 121

Understanding the weight vector w Gene X2

Melanoma Nevi

SVM decision surface

w = (1,1) 1x1 + 1x2 + b = 0

Decision surface of another classifier w = (1,0) 1x1 + 0 x2 + b = 0

True model X1 Gene X1

• In the true model, X1 is causal and X2 is redundant X2 Phenotype •SVM decision surface implies that X1 and X2 are equally important; thus it is locally causally inconsistent •There exists a causally consistent decision surface for this example •Causal discovery algorithms can identify that X1 is causal and X2 is redundant 122

Simple SVM-based variable selection algorithm Algorithm: 1. Train SVM classifier using data for all variables to estimate vector w 2. Rank each variable based on the magnitude of the corresponding element in vector w 3. Using the above ranking of variables, select the smallest nested subset of variables that achieves the best SVM prediction accuracy.

123

Simple SVM-based variable selection algorithm Consider that we have 7 variables: X1, X2, X3, X4, X5, X6, X7 The vector w is: (0.1, 0.3, 0.4, 0.01, 0.9, -0.99, 0.2) The ranking of variables is: X6, X5, X3, X2, X7, X1, X4 Classification accuracy

Subset of variables X6

X5

X3

X2

X7

X1

X6

X5

X3

X2

X7

X1

X6

X5

X3

X2

X7

X6

X5

X3

X2

X6

X5

X3

X6

X5

X6

X4

0.920 0.920 0.919 0.852

Best classification accuracy Classification accuracy that is statistically indistinguishable from the best one

0.843 0.832 0.821

Select the following variable subset: X6, X5, X3, X2 , X7

124

Simple SVM-based variable selection algorithm • SVM weights are not locally causally consistent we may end up with a variable subset that is not causal and not necessarily the most compact one. • The magnitude of a variable in vector w estimates the effect of removing that variable on the objective function of SVM (e.g., function that we want to minimize). However, this algorithm becomes suboptimal when considering effect of removing several variables at a time… This pitfall is addressed in the SVM-RFE algorithm that is presented next. 125

SVM-RFE variable selection algorithm Algorithm: 1. Initialize V to all variables in the data 2. Repeat 3. Train SVM classifier using data for variables in V to estimate vector w 4. Estimate prediction accuracy of variables in V using the above SVM classifier (e.g., by cross-validation) 5. Remove from V a variable (or a subset of variables) with the smallest magnitude of the corresponding element in vector w 6. Until there are no variables in V 7. Select the smallest subset of variables with the best prediction accuracy 126

SVM-RFE variable selection algorithm 10,000 genes

SVM model Prediction accuracy

5,000 genes

SVM model Prediction accuracy

Important for classification

5,000 genes

Discarded

Not important for classification

2,500 genes

…

Important for classification

2,500 genes

Discarded

Not important for classification

• Unlike simple SVM-based variable selection algorithm, SVMRFE estimates vector w many times to establish ranking of the variables. • Notice that the prediction accuracy should be estimated at each step in an unbiased fashion, e.g. by cross-validation. 127

SVM variable selection in feature space The real power of SVMs comes with application of the kernel trick that maps data to a much higher dimensional space (“feature space”) where the data is linearly separable. Gene 2

Tumor

Tumor

?

kernel

Φ

?

Normal

Normal Gene 1

input space

feature space 128

SVM variable selection in feature space • We have data for 100 SNPs (X1,…,X100) and some phenotype. • We allow up to 3rd order interactions, e.g. we consider: • X1,…,X100 • X12,X1X2, X1X3,…,X1X100 ,…,X1002 • X13,X1X2X3, X1X2X4,…,X1X99X100 ,…,X1003

• Task: find the smallest subset of features (either SNPs or their interactions) that achieves the best predictive accuracy of the phenotype. • Challenge: If we have limited sample, we cannot explicitly construct and evaluate all SNPs and their interactions (176,851 features in total) as it is done in classical statistics. 129

SVM variable selection in feature space Heuristic solution: Apply algorithm SVM-FSMB that: 1. Uses SVMs with polynomial kernel of degree 3 and selects M features (not necessarily input variables!) that have largest weights in the feature space. E.g., the algorithm can select features like: X10, (X1X2), (X9X2X22), (X72X98), and so on. 2. Apply HITON-MB Markov blanket algorithm to find the Markov blanket of the phenotype using M features from step 1. 130

Computing posterior class probabilities for SVM classifiers

131

Output of SVM classifier 1. SVMs output a class label (positive or negative) for each sample: sign( w ⋅ x + b) 2. One can also compute distance from the hyperplane that separates classes, e.g. w ⋅ x + b. These distances can be used to compute performance metrics like area under ROC curve.

w⋅ x + b = 0

Negative samples (y=-1)

Positive samples (y=+1)

Question: How can one use SVMs to estimate posterior class probabilities, i.e., P(class positive | sample x)? 132

Simple binning method 1. Train SVM classifier in the Training set. 2. Apply it to the Validation set and compute distances from the hyperplane to each sample. Sample # 1

2

3

4

5

Distance

-1

8

3

4

2

...

98 99

100

-2

0.8

0.3

Training set

Validation set

3. Create a histogram with Q (e.g., say 10) bins using the above distances. Each bin has an upper and lower value in terms of distance.

Testing set

Number of samples in validation set

25 20 15 10 5 0 -15

-10

-5

0 Distance

5

10

15

133

Simple binning method 4.

Given a new sample from the Testing set, place it in the corresponding bin. E.g., sample #382 has distance to hyperplane = 1, so it is placed in the bin [0, 2.5]

Training set

Number of samples in validation set

25 20

10

Testing set

5 0 -15

5.

Validation set

15

-10

-5

0 Distance

5

10

15

Compute probability P(positive class | sample #382) as a fraction of true positives in this bin. E.g., this bin has 22 samples (from the Validation set), out of which 17 are true positive ones , so we compute P(positive class | sample #382) = 17/22 = 0.77

134

Platt’s method Convert distances output by SVM to probabilities by passing them through the sigmoid filter:

1 P ( positive class | sample) = 1 + exp( Ad + B) where d is the distance from hyperplane and A and B are parameters. 1 0.9

P(positive class|sample)

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10

-8

-6

-4

-2

0 Distance

2

4

6

8

10

135

Platt’s method 1. Train SVM classifier in the Training set. 2. Apply it to the Validation set and compute distances from the hyperplane to each sample. Sample # 1

2

3

4

5

Distance

-1

8

3

4

3.

2

...

98 99

100

-2

0.8

0.3

Determine parameters A and B of the sigmoid function by minimizing the negative log likelihood of the data from the Validation set.

Training set

Validation set Testing set

4. Given a new sample from the Testing set, compute its posterior probability using sigmoid function.

136

Part 3 • Case studies (taken from our research) 1. 2. 3. 4. 5. 6.

Classification of cancer gene expression microarray data Text categorization in biomedicine Prediction of clinical laboratory values Modeling clinical judgment Using SVMs for feature selection Outlier detection in ovarian cancer proteomics data

• Software • Conclusions • Bibliography

137

1. Classification of cancer gene expression microarray data

138

Comprehensive evaluation of algorithms for classification of cancer microarray data Main goals: • Find the best performing decision support algorithms for cancer diagnosis from microarray gene expression data; • Investigate benefits of using gene selection and ensemble classification methods.

139

Classifiers • • • • • • • • • • •

K-Nearest Neighbors (KNN) Backpropagation Neural Networks (NN) Probabilistic Neural Networks (PNN) Multi-Class SVM: One-Versus-Rest (OVR) Multi-Class SVM: One-Versus-One (OVO) Multi-Class SVM: DAGSVM Multi-Class SVM by Weston & Watkins (WW) Multi-Class SVM by Crammer & Singer (CS) Weighted Voting: One-Versus-Rest Weighted Voting: One-Versus-One Decision Trees: CART

instance-based neural networks

kernel-based

voting decision trees 140

Ensemble classifiers dataset

Classifier 1

Classifier 2

…

Classifier N

Prediction 1

Prediction 2

…

Prediction N

dataset

Ensemble Classifier Final Prediction 141

Gene selection methods Highly discriminatory genes Uninformative genes

1.

Signal-to-noise (S2N) ratio in one-versus-rest (OVR) fashion;

2.

Signal-to-noise (S2N) ratio in one-versus-one (OVO) fashion;

3.

Kruskal-Wallis nonparametric one-way ANOVA (KW);

4.

Ratio of genes betweencategories to within-category sum of squares (BW).

genes

142

Performance metrics and statistical comparison 1. Accuracy + can compare to previous studies + easy to interpret & simplifies statistical comparison

2. Relative classifier information (RCI) + easy to interpret & simplifies statistical comparison + not sensitive to distribution of classes + accounts for difficulty of a decision problem

• Randomized permutation testing to compare accuracies of the classifiers (α=0.05) 143

Microarray datasets Number of Dataset name Sam- Variables Cateples (genes) gories

Reference

11_Tumors

174

12533

11

Su, 2001

14_Tumors

308

15009

26

Ramaswamy, 2001

9_Tumors

60

5726

9

Staunton, 2001

Brain_Tumor1

90

5920

5

Pomeroy, 2002

Brain_Tumor2

50

10367

4

Nutt, 2003

Leukemia1

72

5327

3

Golub, 1999

Leukemia2

72

11225

3

Armstrong, 2002

Lung_Cancer

203

12600

5

Bhattacherjee, 2001

SRBCT

83

2308

4

Khan, 2001

10509

2

Singh, 2002

5469

2

Shipp, 2002

Prostate_Tumor 102 DLBCL

77

Total: •~1300 samples •74 diagnostic categories •41 cancer types and 12 normal tissue types

144

Summary of methods and datasets Cross-Validation Designs (2)

Gene Selection Methods (4)

Performance Metrics (2)

Statistical Comparison

10-Fold CV

S2N One-Versus-Rest

Accuracy

LOOCV

S2N One-Versus-One

RCI

Randomized permutation testing

Non-param. ANOVA

Gene Expression Datasets (11) 11_Tumors

One-Versus-One

14_Tumors

Ensemble Classifiers (7) Majority Voting

KNN Backprop. NN Prob. NN Decision Trees One-Versus-Rest One-Versus-One

Based on MCSVM outputs

Method by WW

MC-SVM OVR MC-SVM OVO MC-SVM DAGSVM Decision Trees

Brain Tumor1 Brain_Tumor2 Leukemia1

Lung_Cancer SRBCT

Majority Voting Decision Trees

9_Tumors

Leukemia2

Binary Dx

DAGSVM

Multicategory Dx

One-Versus-Rest

Method by CS

WV

BW ratio

Based on outputs of all classifiers

MC-SVM

Classifiers (11)

Prostate_Tumors DLBCL 145

em

ia1

ia2 Lu ng _C an ce r SR Pr BC os T tat e_ Tu mo DL r BC L

Le uk

or s

or 1

or 2

or s

em

_T um

_T um

_T um

_T um

Tu mo rs

Le uk

11

Br ain

Br ain

14

9_ Accuracy, % 60

40 OVR OVO DAGSVM WW CS KNN NN PNN

20

0

146

MC-SVM

Results without gene selection

100

80

Results with gene selection Improvement of diagnostic performance by gene selection (averages for the four datasets)

Diagnostic performance before and after gene selection 100

70

9_Tumors

14_Tumors

60

60 40

50

Accuracy, %

Improvement in accuracy, %

80

40 30

20

100

Brain_Tumor1

Brain_Tumor2

80

20 60

10

40 20

OVR

OVO DAGSVM SVM

WW

CS

KNN

NN non-SVM

PNN

SVM

non-SVM

SVM

non-SVM

Average reduction of genes is 10-30 times 147

Comparison with previously published results Multiclass SVMs (this study)

100

Accuracy, %

80

60

Multiple specialized classification methods (original primary studies)

40

20

ia2 Lu ng _C an ce r SR Pr BC os T tat e_ Tu mo DL r BC L

em

ia1

Le uk

em

or s

Le uk

_T um

or 1 11

_T um

or 2 Br ain

_T um

or s Br ain

_T um

14

9_

Tu mo rs

0

148

Summary of results • Multi-class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV. • Gene selection in some cases improves classification performance of all classifiers, especially of non-SVM algorithms; • Ensemble classification does not improve performance of SVM and other classifiers; • Results obtained by SVMs favorably compare with the literature. 149

Random Forest (RF) classifiers • Appealing properties – – – – – –

Work when # of predictors > # of samples Embedded gene selection Incorporate interactions Based on theory of ensemble learning Can work with binary & multiclass tasks Does not require much fine-tuning of parameters

• Strong theoretical claims • Empirical evidence: (Diaz-Uriarte and Alvarez de Andres, BMC Bioinformatics, 2006) reported superior classification performance of RFs compared to SVMs and other methods 150

Key principles of RF classifiers

Testing data Training data

4) Apply to testing data & combine predictions

1) Generate bootstrap samples

2) Random gene selection

3) Fit unpruned decision trees

151

Results without gene selection

• SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets, algorithms are exactly the same in 3 datasets. • In 7 datasets SVMs outperform RFs statistically significantly. • On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI. 152

Results with gene selection

• SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets, algorithms are exactly the same in 2 datasets. • In 1 dataset SVMs outperform RFs statistically significantly. • On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI. 153

2. Text categorization in biomedicine

154

Models to categorize content and quality: Main idea

1. Utilize existing (or easy to build) training corpora 1

2

3 4

2. Simple document representations (i.e., typically stemmed and weighted words in title and abstract, Mesh terms if available; occasionally addition of Metamap CUIs, author info) as “bag-of-words” 155

Models to categorize content and quality: Main idea Unseen Examples

3. Train SVM models that capture implicit categories of meaning or quality criteria

Labeled Examples

Labeled

4. Evaluate models’ performances - with nested cross-validation or other appropriate error estimators - use primarily AUC as well as other metrics (sensitivity, specificity, PPV, Precision/Recall curves, HIT curves, etc.) 5. Evaluate performance prospectively & compare to prior cross-validation estimates

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Txmt

Diag

Estimated Performance

Prog

Etio

2005 Performance

156

Models to categorize content and quality: Some notable results Category

Average AUC

Range over n folds

Treatment

0.97*

0.96 - 0.98

Etiology

0.94*

0.89 – 0.95

Prognosis

0.95*

0.92 – 0.97

Diagnosis

0.95*

0.93 - 0.98

1. SVM models have excellent ability to identify high-quality PubMed documents according to ACPJ gold standard

Method

Treatment AUC

Etiology AUC

Prognosis AUC

Diagnosis AUC

Google Pagerank

0.54

0.54

0.43

0.46

Yahoo Webranks

0.56

0.49

0.52

0.52

Impact Factor 2005

0.67

0.62

0.51

0.52

Web page hit count

0.63

0.63

0.58

0.57

Bibliometric Citation Count

0.76

0.69

0.67

0.60

Machine Learning Models

0.96

0.95

0.95

0.95

2. SVM models have better classification performance than PageRank, Yahoo ranks, Impact Factor, Web Page hit counts, and bibliometric citation counts on the Web according to ACPJ gold standard 157

Models to categorize content and quality: Some notable results Treatment - Fixed Sensitivity

Gold standard: SSOAB

Area under the ROC curve*

SSOAB-specific filters

0.893

Citation Count

0.791

ACPJ Txmt-specific filters 0.548 Impact Factor (2001)

0.549

Impact Factor (2005)

0.558

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.98

0.98

Treatment - Fixed Specificity

0.89 0.71

Fixed Sens Query Filters

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Spec

Sens

Learning Models

0.98

0.75 0.44

Query Filters

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3. SVM models have better classification performance than PageRank, Impact Factor and Citation count in Medline for SSOAB gold standard

0.96

Query Filters

0.91

0.91

Sens Query Filters

Fixed Spec Learning Models

Diagnosis - Fixed Specificity

0.88 0.68

Fixed Sens

0.94

Spec

0.96

Learning Models

0.68

Learning Models

Diagnosis - Fixed Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.91

Etiology - Fixed Specificity

0.98

Fixed Sens

0.91

Fixed Spec

Query Filters

Etiology - Fixed Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.95 0.8

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Spec

0.97

Sens

Learning Models

Fixed Spec

Query Filters

Prognosis - Fixed Sensitivity

0.97

0.82 0.65

Learning Models

Prognosis - Fixed Specificity 1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.8

0.8

Fixed Sens Query Filters

0.87 0.71

Spec Learning Models

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.8

Sens Query Filters

0.77

0.77

Fixed Spec Learning Models

4. SVM models have better sensitivity/specificity in PubMed than CQFs at 158 comparable thresholds according to ACPJ gold standard

Other applications of SVMs to text categorization Model

Area Under the Curve

Machine Learning Models 0.93 Quackometer*

0.67

0.63

1. Identifying Web Pages with misleading treatment information according to special purpose gold standard (Quack Watch). SVM models outperform Quackometer and Google ranks in the tested domain of cancer treatment. 2. Prediction of future paper citation counts (work of L. Fu and C.F. Aliferis, AMIA 2008) 159

3. Prediction of clinical laboratory values

160

Dataset generation and experimental design • StarPanel database contains ~8·106 lab measurements of ~100,000 inpatients from Vanderbilt University Medical Center. • Lab measurements were taken between 01/1998 and 10/2002. For each combination of lab test and normal range, we generated the following datasets.

01/1998-05/2001

Training

06/2001-10/2002

Validation (25% of Training)

Testing

161

Query-based approach for prediction of clinical cab values Training data

Data model

database

Testing sample

Train SVM classifier Validation data SVM classifier Performance These steps are performed for every data model

Testing data

These steps are performed for every testing sample

Optimal data model Prediction 162

Classification results Excluding cases with K=0 (i.e. samples with no prior lab measurements)

Area under ROC curve (without feature selection)

Area under ROC curve (without feature selection)

BUN

>1 75.9%

Range of normal values 2.5 1 digital image available Diagnoses made in 2004

Features Lesion location

Family history of melanoma

Irregular Border

Streaks (radial streaming, pseudopods)

Max-diameter

Fitzpatrick’s Photo-type

Number of colors

Slate-blue veil

Min-diameter

Sunburn

Atypical pigmented network

Whitish veil

Evolution

Ephelis

Abrupt network cut-off

Globular elements

Age

Lentigos

Regression-Erythema

Comedo-like openings, milia-like cysts

Gender

Asymmetry

Hypo-pigmentation

Telangiectasia

Method to explain physician-specific SVM models

FS

Build DT

Build SVM SVM ”Black Box” Regular Learning

Apply SVM

Meta-Learning

Results: Predicting physicians’ judgments

All

RFE

(features)

HITON_PC (features)

HITON_MB (features)

(features)

Expert 1

0.94 (24)

0.92 (4)

0.92 (5)

0.95 (14)

Expert 2

0.92 (24)

0.89 (7)

0.90 (7)

0.90 (12)

Expert 3

0.98 (24)

0.95 (4)

0.95 (4)

0.97 (19)

NonExpert 1

0.92 (24)

0.89 (5)

0.89 (6)

0.90 (22)

NonExpert 2

1.00 (24)

0.99 (6)

0.99 (6)

0.98 (11)

NonExpert 3

0.89 (24)

0.89 (4)

0.89 (6)

0.87 (10)

Physicians

Results: Physician-specific models

Results: Explaining physician agreement Patient 001

Expert 1 AUC=0.92 R2=99%

Blue veil

irregular border

streaks

yes

no

yes

Expert 3 AUC=0.95 R2=99%

Results: Explain physician disagreement Patient 002

Expert 1 AUC=0.92 R2=99%

Blue veil

irregular border

streaks

number of colors

evolution

no

no

yes

3

no

Expert 3 AUC=0.95 R2=99%

Results: Guideline compliance Physician

Reported guidelines

Compliance

Experts1,2,3, non-expert 1

Pattern analysis

Non-compliant: they ignore the majority of features (17 to 20) recommended by pattern analysis.

Non expert 2

ABCDE rule

Non compliant: asymmetry, irregular border and evolution are ignored.

Non expert 3

Non-standard. Reports using 7 features

Non compliant: 2 out of 7 reported features are ignored while some nonreported ones are not

On the contrary: In all guidelines, the more predictors present, the higher the likelihood of melanoma. All physicians were compliant with this principle.

5. Using SVMs for feature selection

176

Feature selection methods Feature selection methods (non-causal) • SVM-RFE This is an SVM-based • Univariate + wrapper feature selection method • Random forest-based • LARS-Elastic Net • RELIEF + wrapper • L0-norm • Forward stepwise feature selection • No feature selection Causal feature selection methods • HITON-PC This method outputs a • HITON-MB Markov blanket of the • IAMB response variable • BLCD (under assumptions) • K2MB … 177

13 real datasets were used to evaluate feature selection methods Dataset name

Domain

Number of variables

Number of samples

Infant_Mortality

Clinical

86

5,337

Died within the first year

discrete

Ohsumed

Text

14,373

5,000

Relevant to nenonatal diseases

continuous

ACPJ_Etiology

Text

28,228

15,779

Relevant to eitology

continuous

Lymphoma

Gene expression

7,399

227

3-year survival: dead vs. alive

continuous

Gisette

Digit recognition

5,000

7,000

Separate 4 from 9

continuous

Dexter

Text

19,999

600

Relevant to corporate acquisitions

continuous

Sylva

Ecology

216

14,394

Ponderosa pine vs. everything else

continuous & discrete

Proteomics

2,190

216

Cancer vs. normals

continuous

Thrombin

Drug discovery

139,351

2,543

Binding to thromin

discrete (binary)

Breast_Cancer

Gene expression

17,816

286

Estrogen-receptor positive (ER+) vs. ER-

continuous

Hiva

Drug discovery

1,617

4,229

Activity to AIDS HIV infection

discrete (binary)

Nova

Text

16,969

1,929

Separate politics from religion topics

discrete (binary)

Financial

147

7,063

Personal bankruptcy

continuous & discrete

Ovarian_Cancer

Bankruptcy

Target

Data type

178

Classification performance vs. proportion of selected features Magnified

Original 1 0.9 Classification performance (AUC)

Classification performance (AUC)

0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2

HITON-PC with G test RFE

0.55 0.5

0

0.5 Proportion of selected features

1

0.89 0.88 0.87 0.86 0.85

HITON-PC with G2 test RFE 0.05 0.1 0.15 0.2 Proportion of selected features 179

Statistical comparison of predictivity and reduction of features Predicitivity

SVM-RFE (4 variants)

Reduction

P-value

Nominal winner

P-value

Nominal winner

0.9754

SVM-RFE

0.0046

HITON-PC

0.8030

SVM-RFE

0.0042

HITON-PC

0.1312

HITON-PC

0.3634

HITON-PC

0.1008

HITON-PC

0.6816

SVM-RFE

• Null hypothesis: SVM-RFE and HITON-PC perform the same; • Use permutation-based statistical test with alpha = 0.05. 180

Simulated datasets with known causal structure used to compare algorithms

181

Comparison of SVM-RFE and HITON-PC

182

Comparison of all methods in terms of causal graph distance SVM-RFE

HITON-PC based causal methods

183

Summary results HITON-PC based causal methods

HITON-PCFDR methods

SVM-RFE

184

Statistical comparison of graph distance

Comparison average HITON-PC-FDR with G2 test vs. average SVM-RFE

Sample size = 200 Nominal P-value winner