Lecture 8

63 downloads 21158 Views 283KB Size Report
Project data in the directions of maximum variance. ▫ Fisher Linear ... preserves direction useful for data classification ..... Multiple Discriminant Analysis (MDA).
CS434a/541a: Pattern Recognition Prof. Olga Veksler

Lecture 8

Today Continue with Dimensionality Reduction Last lecture: PCA This lecture: Fisher Linear Discriminant

Data Representation vs. Data Classification PCA finds the most accurate data representation in a lower dimensional space Project data in the directions of maximum variance

However the directions of maximum variance may be useless for classification separable

apply PCA

not separable

to each class

Fisher Linear Discriminant project to a line which preserves direction useful for data classification

Fisher Linear Discriminant Main idea: find projection to a line s.t. samples from different classes are well separated Example in 2D

bad line to project to, classes are mixed up

good line to project to, classes are well separated

Fisher Linear Discriminant Suppose we have 2 classes and d-dimensional samples x1,…,xn where n1 samples come from the first class n2 samples come from the second class consider projection on a line Let the line direction be given by unit vector v

txi

v

v

Scalar vtxi is the distance of projection of xi from the origin xi

Thus it vtxi is the projection of xi into a one dimensional subspace

Fisher Linear Discriminant Thus the projection of sample xi onto a line in direction v is given by vtxi How to measure separation between projections of different classes? Let µ~1 and µ~2 be the means of projections of classes 1 and 2 Let µ1 and µ2 be the means of classes 1 and 2 µ~1 − µ~2 seems like a good measure

n1 n1 1 1 v t xi = v t x i = v t µ1 µ~1 = n1 x i ∈C 1 n1 x i ∈C 1

similarly ,

µ~ 2 = v t µ 2

Fisher Linear Discriminant How good is µ~1 − µ~2 as a measure of separation?

The larger µ~1 − µ~ 2 , the better is the expected separation

µ~1

µ1

µ2

µ~2 µ1

µ2

the vertical axes is a better line than the horizontal axes to project to for class separability however µ 1 − µ 2 > µ~1 − µ~2

Fisher Linear Discriminant

small variance

The problem with µ~1 − µ~2 is that it does not consider the variance of the classes µ~1

µ1 µ2

µ~2 µ1

µ2

large variance

Fisher Linear Discriminant We need to normalize µ~1 − µ~2 by a factor which is proportional to variance 1 n Have samples z1,…,zn . Sample mean is µ z = n z i i =1

Define their scatter as s =

n

i =1

(z i

− µz )

2

Thus scatter is just sample variance multiplied by n scatter measures the same thing as variance, the spread of data around the mean scatter is just on different scale than variance larger scatter:

smaller scatter:

Fisher Linear Discriminant Fisher Solution: normalize

µ~1 − µ~2 by scatter

Let yi = vtxi , i.e. yi ‘s are the projected samples Scatter for projected samples of class 1 is (y i − µ~1 ) 2 s~12 = y i ∈Class 1

Scatter for projected samples of class 2 is ~2 = ~ )2 ( s y − µ 2 i 2 y i ∈Class 2

Fisher Linear Discriminant We need to normalize by both scatter of class 1 and scatter of class 2 Thus Fisher linear discriminant is to project on line in the direction v which maximizes want projected means are far from each other

( µ~1 − µ~2 )2 J (v ) = ~ 2 ~ 2 s1 + s2

want scatter in class 1 is as small as possible, i.e. samples of class 1 cluster around the ~ projected mean µ 1

want scatter in class 2 is as small as possible, i.e. samples of class 2 cluster around the ~ projected mean µ 2

Fisher Linear Discriminant

( µ~1 − µ~2 )2 J (v ) = ~ 2 ~ 2 s1 + s 2

If we find v which makes J(v) large, we are guaranteed that the classes are well separated projected means are far from each other

µ~1 ~ implies that small s 1 projected samples of class 1 are clustered around projected mean

µ~2 ~ implies that small s 2 projected samples of class 2 are clustered around projected mean

Fisher Linear Discriminant Derivation

( µ~1 − µ~2 )2 J (v ) = ~ 2 ~ 2 s1 + s2

All we need to do now is to express J explicitly as a function of v and maximize it straightforward but need linear algebra and Calculus

Define the separate class scatter matrices S1 and S2 for classes 1 and 2. These measure the scatter of original samples xi (before projection) S1 = S2 =

(x i

− µ 1 )( x i − µ 1 )

(x i

− µ 2 )( x i − µ 2 )

x i ∈Class 1

x i ∈Class 2

t

t

Fisher Linear Discriminant Derivation Now define the within the class scatter matrix SW = S 1 + S 2

Recall that

(y i

s~12 =

y i ∈Class 1

− µ~1 ) 2

t Using yi = vtxi and µ~1 = v µ 1

~2 = s 1 =

(v x − v µ ) (v (x − µ )) (v (x t

i

y i ∈Class 1

((x

y i ∈Class 1

=

v

t

y i ∈Class 1

2

1

t

t

i

y i ∈Class 1

=

t

1

− µ1 ) v t

i

(x i

t

) ((x t

i

− µ 1 ))

− µ1 ) v t

i

)

− µ 1 )( x i − µ 1 ) v = v t S 1v t

Fisher Linear Discriminant Derivation s~22 = v t S 2v Therefore s~12 + s~22 = v t S 1v + v t S 2v = v t S W v

Similarly

Define between the class scatter matrix t S B = (µ 1 − µ 2 )(µ 1 − µ 2 ) SB measures separation between the means of two classes (before projection) Let’s rewrite the separations of the projected means 2 2 t t ~ ~ (µ 1 − µ 2 ) = (v µ 1 − v µ 2 ) t t = v (µ 1 − µ 2 )(µ 1 − µ 2 ) v = v t SBv

Fisher Linear Discriminant Derivation Thus our objective function can be written: 2 ~ ~ ( v t S Bv µ1 − µ 2 ) J (v ) = ~ 2 ~ 2 = t s1 + s 2 v SW v

Minimize J(v) by taking the derivative w.r.t. v and setting it to 0 d J (v ) = dv =

d t d t t v SW v v t S B v v S B v v SW v − dv dv

(v

t

SW v )

2

(2 S B v )v t SW v − (2 SW v )v t S B v

(v

t

SW v )

2

=0

Fisher Linear Discriminant Derivation Need to solve

v t S W v (S B v ) − v t S B v (S W v ) = 0

v t S W v (S B v ) v t S B v (S W v ) − =0 t t v SW v v SW v v t S B v (S W v ) SBv − =0 t v SW v = λ S B v = λ SW v generalized eigenvalue problem

Fisher Linear Discriminant Derivation S B v = λ SW v

If SW has full rank (the inverse exists), can convert this to a standard eigenvalue problem S W− 1S B v = λ v But SB x for any vector x, points in the same α direction as µ1 - µ2 t S B x = (µ 1 − µ 2 )(µ 1 − µ 2 ) x = (µ 1 − µ 2 )((µ 1 − µ 2 )t x ) = α (µ 1 − µ 2 ) Thus can solve the eigenvalue problem immediately v = S W− 1 (µ 1 − µ 2 )

−1 W

S SB

[S (µ −1 W

−1 −1 [ = S α ) ( ) ] [ ] = SW α µ 1 − µ 2 W (µ 1 − µ 2 )] 1 − µ2

v

λ

v

Fisher Linear Discriminant Example Data Class 1 has 5 samples c1=[(1,2),(2,3),(3,3),(4,5),(5,5)] Class 2 has 6 samples c2=[(1,0),(2,1),(3,1),(3,2),(5,3),(6,5)]

Arrange data in 2 separate matrices c1 =

1 2 5 5

c2 =

1 0 6 5

Notice that PCA performs very poorly on this data because the direction of largest variance is not helpful for classification

Fisher Linear Discriminant Example First compute the mean for each class µ 1 = mean (c 1 ) = [3 3 . 6 ] µ 2 = mean (c 2 ) = [3 . 3

2]

Compute scatter matrices S1 and S2 for each class 17 . 3 16 8 .0 ( ) S = 5 ∗ cov c = S 1 = 4 ∗ cov (c 1 ) = 810 2 2 16 16 .0 7 .2 Within the class scatter:

. 3 24 S W = S 1 + S 2 = 27 24 23 . 2

it has full rank, don’t have to solve for eigenvalues

− 0 . 41 The inverse of SW is S W− 1 = inv (S W ) = −00. 39 . 41 0 . 47 Finally, the optimal line direction v . 79 v = S W− 1 (µ 1 − µ 2 ) = −00. 89

Fisher Linear Discriminant Example Notice, as long as the line has the right direction, its exact position does not matter Last step is to compute the actual 1D vector y. Let’s do it separately for each class 1 Y 1 = v t c 1t = [− 0 . 65 0 . 73 ] 2 1 Y 2 = v t c 2t = [− 0 . 65 0 . 73 ] 0

5 = [0 . 81 5 6 = [− 0 . 65 5

0 .4 ] − 0 . 25 ]

Multiple Discriminant Analysis (MDA) Can generalize FLD to multiple classes In case of c classes, can reduce dimensionality to 1, 2, 3,…, c-1 dimensions Project sample xi to a linear subspace yi = V txi V is called projection matrix

Multiple Discriminant Analysis (MDA) Let

ni by the number of samples of class i and µi be the sample mean of class i µ be the total mean of all samples µi = 1

ni

µ = 1

x

n

x ∈ class i

xi

xi

det (V t S BV ) Objective function: J (V ) = det (V t S W V ) within the class scatter matrix SW is SW =

c

i =1

Si =

c

(x k

i = 1 x k ∈class i

− µ i )( x k − µ i )

t

between the class scatter matrix SB is

SB =

c

i =1

n i (µ i − µ )(µ i − µ )

maximum rank is c -1

t

Multiple Discriminant Analysis (MDA) det (V t S BV ) J (V ) = det (V t S W V )

First solve the generalized eigenvalue problem: S B v = λ SW v At most c-1 distinct solution eigenvalues Let v1, v2 ,…, vc-1 be the corresponding eigenvectors The optimal projection matrix V to a subspace of dimension k is given by the eigenvectors corresponding to the largest k eigenvalues Thus can project to a subspace of dimension at most c-1

FDA and MDA Drawbacks Reduces dimension only to k = c-1 (unlike PCA) For complex data, projection to even the best line may result in unseparable projected samples

Will fail:

1. J(v) is always 0: happens if µ1 = µ2 PCA performs reasonably well here:

fails:

PCA also

2. If J(v) is always large: classes have large overlap when projected to any line (PCA will also fail)