Linear vs. Quadratic Discriminant Classifier

0 downloads 0 Views 1MB Size Report
Dec 13, 2017 - decision boundary between the two classes is nonlinear. x1 x2. Class 1 ( 1) ... i = 1,2,...,c). Given an unknown sample x, the value for each discriminant function .... if xT Wx + wT x + W0 = 0; On the boundary. -ve if xT Wx + wT x ...
Linear vs. Quadratic Discriminant Classifier Alaa Tharwat Tharwat, Alaa. ”Linear vs. quadratic discriminant analysis classifier: a tutorial.” International Journal of Applied Pattern Recognition 3.2 (2016): 145-180 Email: [email protected]

December 13, 2017

Alaa Tharwat

December 13, 2017

1 / 69

Agenda

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

2 / 69

Agenda

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

3 / 69

Introduction

A pattern or sample is represented by a vector or a set of m features, which represent one point in m-dimensional space (Rm ) that is called pattern space. For example, In character recognition, features may include histograms which counting the number of black pixels along vertical and horizontal directions, stroke detection, and number of internal holes. In computer vision, the features may include the edges, shape, area, .. etc. In speech recognition, the features can be the power of sound signal, noise ratios, or the length of the signals.

The goal of the pattern classification process is to train a model using the labelled patterns to assign a class label to an unknown pattern. The classifier is represented by c decisions or discriminant functions ({f1 , f2 , . . . , fc }), i.e. one discriminant function for each class. The decision functions are used to determine the decision boundaries between classes and the region or area of each class. Alaa Tharwat

December 13, 2017

4 / 69

Introduction

Given two classes ω1 (in blue color) and ω2 (in red color). Each pattern or sample is represented by only two features. decision boundary between the two classes is nonlinear. x2

Class 1 (

Decision

1)

Boundary

Class 2 (

2)

x1

Figure: An example of the classification using two classes. Alaa Tharwat

December 13, 2017

5 / 69

Introduction

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

6 / 69

Building a Classifier Model

Discriminant functions

Discriminant functions are used to build the decision boundaries to discriminate between different classes into different regions (ωi , i = 1, 2, . . . , c). Assume we have two classes (ω1 ) and (ω2 ), thus there are two different discriminant functions (f1 and f2 ). The decision functions are used to determine the decision boundaries between the two classes. In other words, the discriminant functions are used to determine the class label of the unknown pattern (x) based on comparing c different discriminant (in our example we have two classes, i.e. c = 2) fi (x) > fj (x) , i, j = 1, 2, . . . , c, i 6= j

Alaa Tharwat

December 13, 2017

(1)

7 / 69

Building a Classifier Model

Discriminant functions

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

8 / 69

Building a Classifier Model

Decision boundaries

After calculating discriminant functions, the decision region or class label of an unknown pattern x is calculated as follows:   : for S12 (x) > 0 Class 1 sgn(S12 (x)) = sgn(f1 (x) − f2 (x)) = Undefined : for S12 (x) = 0 (2)   Class 2 : for S12 (x) < 0

Alaa Tharwat

December 13, 2017

9 / 69

Building a Classifier Model

Decision boundaries

x2

Class 1 ( f1>f2

Decision Boundary (f1=f2)

1)

f2>f1

Class 2 (

2)

x1

Figure: Illustrative example to show how the discriminant functions create the decision boundary.

Alaa Tharwat

December 13, 2017

10 / 69

Building a Classifier Model

Decision boundaries

The classification problem using only two classes, i.e. binary classification, is so simple. Practically, there are many classes c. The discriminant function for each class is estimated (fi , i = 1, 2, . . . , c). Given an unknown sample x, the value for each discriminant function is calculated, and then assigns the class label for the class which has the maximum discriminant value.

Alaa Tharwat

December 13, 2017

11 / 69

Discriminant Functions

m Input (xi∈ℛ )

Building a Classifier Model

x11 x12

x1 x2

1

xn1

xn1+1 xn1+2

Decision boundaries

2

xn1+n2

c

xN-nc+1

xN-1 xN

x1m

f1(x)

f2(x)

fc(x)

Maximum Selector Class Label Figure: The structure of building a classifier, which includes N samples and c discriminant functions or classes. Alaa Tharwat

December 13, 2017

12 / 69

Building a Classifier Model

Decision boundaries

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

13 / 69

Building a Classifier Model

Normal density and Bay’s Rule

Let ω1 , ω2 , . . . , ωc be the set of c classes, P (x|ωi ) represents the likelihood function. P (ωi ) represents the priori probability of each class that reflects the prior knowledge about that class and it is simply equal to the ratio between the number of samples in that class (ni ) and the total number of samples in all classes (N ), i.e. P (ωi ) = nNi . Bayes formula calculates the posterior probability based on priori and likelihood as follows: P (x|ω = ωi )P (ωi ) likelihood × priori P (ω = ωi |x) = = (3) P (x) evidence where P (ω = ωi |x) represents the posterior probability or a posteriori, P (x) represents evidence and it is calculated as follows, Pthe c P (x) = i=1 P (x|ω = ωi )P (ωi ). P (x) is used only to scale the expressions Pc in Equation (3), thus the sum of the posterior probabilities is 1 ( i=1 P (ωi |x) = 1). Generally, P (ωi |x) is calculated using the likelihood (P (x|ωi )) and prior probability (P (ωi )). Alaa Tharwat

December 13, 2017

14 / 69

Building a Classifier Model

Normal density and Bay’s Rule

Assume that P (x|ωi ) is normally distributed (P (x|ωi ) ∼ N (µi , Σi )) as follows: 1 1 P (x|ωi ) = N (µi , Σi ) = p exp(− (x − µi )T Σ−1 i (x − µi )) m 2 (2π) |Σi | (4) where: th µi represents Pni the mean of the i class, 1 (µi = ni i=1 xi , xi ∈ ωi , ∀ i = 1, 2, . . . , c) Σi is the covariance matrix of the ith class, |Σi | and Σ−1 represent i the determinant and inverse of the covariance matrix, respectively, 1 X (x − µi )(x − µi )T , ∀ i = 1, 2, . . . , c (5) Σi = ni x∈ω i   var(x1 , x1 ) cov(x1 , x2 ) . . . cov(x1 , xN )  cov(x2 , x1 ) var(x2 , x2 ) . . . cov(x2 , xN )    (6)   .. .. .. ..   . . . .

cov(xN , x1 ) cov(xN , x2 ) var(xN , xN ) m represents the number of features or the number of variables of the sample (x). Alaa Tharwat

December 13, 2017

15 / 69

Building a Classifier Model

Normal density and Bay’s Rule

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

16 / 69

Building a Classifier Model

Discriminant Functions for the Normal Density

Given two classes ω1 and ω2 and each class has one discriminant function (fi , i = 1, 2), and an unknown pattern (x). If P (ω1 |x) > P (ω2 |x), thus the unknown pattern belongs to the first class (ω1 ). Similarly, if P (ω2 |x) > P (ω1 |x); hence, x belongs to ω2 . The discriminant function for each class can be calculated as follows: fi (x) = ln P (ω = ωi |x) = P (x|ω = ωi )P (ωi )

(7)

= ln(P (x|ω = ωi )) + ln(P (ωi )) , i = 1, 2 (8) 1 1 = ln p exp(− (x − µi )T Σ−1 (9) i (x − µi )) + ln(P (ωi )) m 2 (2π) |Σi | 1 m ln|Σi | = − (x − µi )T Σ−1 ln(2π) − + ln(P (ωi )) i (x − µi ) − 2 2 2 (10) =−

Σ−1 m ln|Σi | i (xT x + µTi µi − 2µTi x) − ln(2π) − + ln(P (ωi )) 2 2 2 (11)

Alaa Tharwat

December 13, 2017

17 / 69

Building a Classifier Model

Discriminant Functions for the Normal Density

The decision boundary between the two classes ω1 and ω2 is represented by the difference between the two discriminant functions as follows: S12 = f1 − f2

(12)

= ln P (ω = ω1 |x) − ln P (ω = ω2 |x) (13) P (x|ω = ω1 )P (ω1 ) (14) = ln P (x|ω = ω2 )P (ω2 ) P (x|ω = ω1 ) P (ω1 ) = ln + ln (15) P (x|ω = ω2 ) P (ω2 ) = lnP (x|ω = ω1 ) + lnP (ω1 ) − lnP (x|ω = ω2 ) − lnP (ω2 ) (16) Note: 1 2

ln X Y = lnX − lnY . lnXY = lnX + lnY .

Alaa Tharwat

December 13, 2017

18 / 69

Building a Classifier Model

Discriminant Functions for the Normal Density

1 S12 (x) = − [Σ−1 (xT x − 2µT1 x + µT1 µ1 ) 2 1 T T T − Σ−1 2 (x x − 2µ2 x + µ2 µ2 ) + ln|Σ1 | − ln|Σ2 |] + ln

Quadratic Term

P (ω1 ) P (ω2 ) (17)

Linear Term

1 −1 T −1 T −1 = − xT (Σ−1 1 − Σ2 )x + (µ1 Σ1 − µ2 Σ2 )x 2 Bias T −1 −0.5(µT1 Σ−1 1 µ1 − µ2 Σ2 µ2 + ln|Σ1 | − ln|Σ2 |) + ln

P (ω1 ) P (ω2 ) (18)

= xT W x + w T x + W0 Alaa Tharwat

(19) December 13, 2017

19 / 69

Building a Classifier Model

Discriminant Functions for the Normal Density

The decision boundary consists of three parts: W which is the coefficient of the quadratic term xW T x, thus, the decision boundary is calculated by quadratic function or curve, which is called Quadratic Discriminant Classifier (QDC) 1 W = − (Σ−1 − Σ−1 2 ) 2 1

(20)

w which represents the slope of the line, T −1 w = µT1 Σ−1 1 − µ2 Σ2

(21)

W0 represents the threshold or bias, T −1 W0 = −0.5(µT1 Σ−1 1 µ1 − µ2 Σ2 µ2 + ln|Σ1 | − ln|Σ2 |) + ln

 +ve 0 sgn(S12 (x)) =  −ve Alaa Tharwat

P (ω1 ) (22) P (ω2 )

if xT W x + wT x + W0 > 0 → x ∈ ω1 if xT W x + wT x + W0 = 0; On the boundary if xT W x + wT x + W0 < 0 → x ∈ ω2 (23) December 13, 2017

20 / 69

Building a Classifier Model

Discriminant Functions for the Normal Density

Algorithm 1 Discriminant Analysis Classifier (Building Model) 1:

2: 3: 4: 5: 6: 7:

Input: data matrix X, which consists of N samples [xi ]N i=1 , each of which is represented as a column of length m., where xi represents the ith sample. Compute the mean of each class µi (m × 1). Calculate the priori probability of each class P (ωi ) = nNi . Compute the covariance matrix for each class (Σi ). for all (class ωi , i = 1, 2, . . . , c) do Calculate the discriminant function (fi ) as in Equation (7). end for

Alaa Tharwat

December 13, 2017

21 / 69

Building a Classifier Model

Discriminant Functions for the Normal Density

Algorithm 2 Discriminant Analysis Classifier (Classify an unknown Sample) 1: 2: 3: 4: 5: 6:

Input: An unknown sample (T (m × 1)). Output: Class label (ωi ). for all (Discriminant functions (fi ), which are calculated before (when building our model) do Substitute the value of the unknown sample (T ) in the discriminant function (fi ). end for Assign the class label (ωmax ) to the unknown sample (T ), where (ωmax ) represents the class that has the maximum discriminant function.

Alaa Tharwat

December 13, 2017

21 / 69

Building a Classifier Model

Discriminant Functions for the Normal Density D i= i-

i 3-

3

2

1-

1

Mean of each Class ( i)

2-

Data Matrix (X)

X=

Covariance Matrix (Σi) Σ =D D

T

Σ2=D2D2T Σ3=D3D3T

Class 1 Class 2 Class 3

1

x2

3

f =-0.5Σ1-1(xTx+µ1Tµ1-2µ Tx)-0.5m(ln(2 ))-0.5ln(|Σ1|)+ln(P( 1)) 1 f2=-0.5Σ2-1(xTx+µ2Tµ2-2µ Tx)-0.5m(ln(2 ))-0.5ln(|Σ2|)+ln(P( 2)) 2 f3=-0.5Σ3-1(xTx+µ3Tµ3-2µ Tx)-0.5m(ln(2 ))-0.5ln(|Σ3|)+ln(P( 3)) 3

-

3

2

2

1 1

1

Discriminant Functions (fi)

-

-

x12

x1x2

µ1 (S12=f1-f2)>0

S 13>0

3

S12  >  T −1   w = µT1  Σ−1 − µ Σ = Σ−1 (µT1 − µT2 ) 2 2 1

( |) + ln T −1 W0 = −0.5(µT1 Σ−1 ln|Σ −( ln|Σ 1 |( 2 (( 1 µ1 − µ2 Σ2 µ2 + ( ((

= −0.5Σ−1 (µT1 µ1 − µT2 µ2 ) + ln Alaa Tharwat

(24) P (ω1 ) P (ω2 )

P (ω1 ) P (ω2 ) December 13, 2017

23 / 69

Special cases of discriminant analysis classifier

Case 1: Equal Variance (Σi = σ 2 I)

In binary classification, the decision boundary is a point, line, or plane where S12 = 0 and this point will be calculated as follows: S12 = 0 → Σ−1 (µT1 − µT2 )x − 0.5Σ−1 (µT1 µ1 − µT2 µ2 ) + ln

P (ω1 ) =0 P (ω2 ) (25)

The decision boundary xDB is xDB =

µ1 + µ2 Σ P (ω1 ) + ln 2 µ2 − µ1 P (ω2 )

(26)

If the two classes are equiprobable, i.e. (ω1 ) P (ω1 ) = P (ω2 ) → ln PP (ω = 0, then the second term 2) Σ 1) ( µ2 −µ ln PP (ω (ω2 ) ) will be neglected and the decision boundary is the 1 2 point in the middle of the class centers ( µ1 +µ 2 ). The decision boundary will be closer to the class that has lower prior probability. For example, P (ωi ) > P (ωj ), then |µj − xDB | < |µi − xDB |. Alaa Tharwat

December 13, 2017

24 / 69

Special cases of discriminant analysis classifier

Case 1: Equal Variance (Σi = σ 2 I)

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

25 / 69

Special cases of discriminant analysis classifier

Case 2: Equal Variance (Σi = Σ)

In this example, the covariance matrices of all classes were equal but arbitrary. The variance of the variables were not equal. Geometrical interpretation for this case is that the distributions of all classes were elliptical in m-dimensions space.

Alaa Tharwat

December 13, 2017

26 / 69

Special cases of discriminant analysis classifier

Case 2: Equal Variance (Σi = Σ)

As in the first case, the covariance of all classes are equal (Σ1 = Σ2 = Σ), hence also the term W will be neglected, the term ln|Σ1 | − ln|Σ2 | will be neglected, and W0 will be easier to calculate. 0 >x + wT x + W0 S12 = xT  W = w T x + W0

Alaa Tharwat

December 13, 2017

27 / 69

Special cases of discriminant analysis classifier

Case 2: Equal Variance (Σi = Σ)

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

28 / 69

Special cases of discriminant analysis classifier

Case 3: Different Covariance matrices (Σi =arbitrary)

In this case, the covariance matrices were different for all classes and we can consider this case represents the common or practical case. The distributions of all classes were different. Hence, the distributions of all classes will not be the same, i.e. with different shapes. The equation of the decision boundary is S12 (x) = xT W x + wT x + W0

(27)

The decision boundary is nonlinear.

Alaa Tharwat

December 13, 2017

29 / 69

Special cases of discriminant analysis classifier

Case 3: Different Covariance matrices (Σi =arbitrary)

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

30 / 69

Numerical Examples

Example 1: Equal Variance (Σi = σ 2 I)

In this example, the features were statistically independent, i.e. all off-diagonal elements of the covariance matrices have zeros which means that the features are correlated, and have the same variance (σ 2 ). Thus, 1

2

3

The covariance matrices were diagonal and its diagonal elements were σ2 . Geometrical interpretation for this case is that each class is centered around its mean, the distance from the mean to all samples of the same class are equal. The distributions of all classes are spherical in an m-dimensional space.

Alaa Tharwat

December 13, 2017

31 / 69

Numerical Examples

Example 1: Equal Variance (Σi = σ 2 I)

Given three different classes denoted by, ω1 , ω2 , ω3 . Each class has four samples, i.e. P (ω1 ) = P (ω2 ) = P (ω3 ) =  3.00 3.00 ω1 =  4.00 4.00

  4.00 3.00   5.00 3.00 , ω2 =  4.00 4.00 5.00 4.00

4 12 .

  2.00 6.00   3.00  6.00 , andω3 =  7.00 2.00  3.00 7.00

 2.00 3.00  2.00 3.00 (28)

The mean of each class is:

      µ1 = 3.50 4.50 , µ2 = 3.50 2.50 , and µ3 = 6.50 2.50

Alaa Tharwat

December 13, 2017

(29)

32 / 69

Numerical Examples

Example 1: Equal Variance (Σi = σ 2 I)

Subtract the mean of each class from each sample in that class follows:      −0.50 −0.50 −0.50 −0.50 −0.50 −0.50 0.50  −0.50 0.50  −0.50     D1 =   0.50 −0.50 D2 =  0.50 −0.50 and D3 =  0.50 0.50 0.50 0.50 0.50 0.50

as

 −0.50 0.50   −0.50 0.50 (30)

The covariance matrix for each class (Σi ) is: 

 1.00 0.00 Σ1 = Σ2 = Σ3 = 0.00 1.00

Σ−1 1

Alaa Tharwat

=

Σ−1 2

=

Σ−1 3

(31)



 1.00 0.00 = 0.00 1.00

(32)

December 13, 2017

33 / 69

Numerical Examples

Example 1: Equal Variance (Σi = σ 2 I)

The discriminated functions for each class is: fi (x) = −

Σ−1 m ln|Σi | i (xT x + µTi µi − 2µTi x) − ln(2π) − + ln(P (ωi )) 2 2 2 (33) f1 = −0.5x21 − 0.5x22 + 3.50x1 + 4.50x2 − 17.35 f2 = −0.5x21 − 0.5x22 + 3.50x1 + 2.50x2 − 10.35 f3 =

−0.5x21



0.5x22

(34)

+ 6.50x1 + 2.50x2 − 25.35

The decision boundaries between each two classes are as follows: S12 = f1 − f2 → x2 = 3.50 S13 = f1 − f3 → x2 = 1.5x1 − 4.00

(35)

S23 = f2 − f3 → x1 = 5.00 The decision boundary S12 depends only on x2 . Thus, for all samples belonging to class ω1 , the value of x2 is greater than 3.5 to be positive. Alaa Tharwat

December 13, 2017

34 / 69

Example 1: Equal Variance (Σi = σ 2 I)

Numerical Examples x2

Class 1 Class 2 Class 3

µ1 7

σ

S13>0 6

µ1

S12>0

σ

x2 = 1.5

5

µ1

µ3

S130

12

S1

3 0

S13

4

3

µ2

S120 1

2

3

4

S23 µ3

S230

S12 ni , i = 1, 2, 3. Assume the priori probability of the three classes were equal (P (ω1 ) = P (ω2 ) = P (ω3 ) = 13 ). Table: The feature values, mean, mean-centering data of all samples of our example. Pattern No. 1 2 3 4 5 6 7 8 9

x1 3 3 4 3 3 4 6 6 7

Alaa Tharwat

Features x2 x3 x4 4 3 5 5 6 4 4 5 7 2 5 2 3 5 3 2 3 5 2 5 6 3 6 7 2 5 7

Class

x1

Mean x2 x3

D x4

ω1

3.33

4.33

4.67

5.33

ω2

3.33

2.33

4.33

3.33

ω3

6.33

2.33

5.33

6.67

x1 -0.33 -0.33 0.67 -0.33 -0.33 0.67 -0.33 -0.33 0.67

x2 -0.33 0.67 -0.33 -0.33 0.67 -0.33 -0.33 0.67 -0.33

x3 -1.67 1.33 0.33 0.67 0.67 -1.33 -0.33 0.67 -0.33

x4 -0.33 -1.33 1.67 -1.33 -0.33 1.67 -0.67 0.33 0.33

December 13, 2017

61 / 69

Singularity Problem

Introduction

The covariance matrices for all classes are as follows:   −0.33 0.67 1.33 −1.33

0.33 1.33 4.67 −0.67

1.67 −1.33 −0.67 4.67

0.67 −0.33  −1.33 1.67

−0.33 0.67 0.67 −0.33

−1.33 0.67 2.67 −3.33

1.67 −0.33 −3.33 4.67



−0.33 0.67 0.67 0.33

0.66 −0.33 0.33 1.67

Σ1 = 

 Σ2 =

Σ3 =

0.67 −0.33 −0.33 0.33

−0.33 0.67 0.67 0.33

 (55)



0.33 0.33 0.33 0.67

The rank of all covariance matrices was two, i.e. the rank= ni − 1. Thus, the covariance matrices were singular; hence, the discriminant functions cannot be calculated. Singularity problem can be solved using many methods such as Regularized Linear Discriminant Analysis (RLDA) and subspace methods. Alaa Tharwat

December 13, 2017

62 / 69

Singularity Problem

Regularized Linear Discriminant Analysis (RLDA) Method

Regularized Linear Discriminant Analysis (RLDA) Method In this method, the identity matrix is scaled by multiplying it by a regularization parameter (1 > η > 0) and adding it to the covariance matrix to make it non-singular. Thus, the diagonal elements of the covariance matrix are biased as ˆ = Σ + ηI. follows, Σ However, choosing the value of the regularization parameter requires more tuning and a poor choice for this parameter can degrade the performance of the method. Another problem of this method is that the parameter η is just added to perform the inverse of Σ and has no clear mathematical interpretation

Alaa Tharwat

December 13, 2017

63 / 69

Singularity Problem

Regularized Linear Discriminant Analysis (RLDA) Method

Regularized Linear Discriminant Analysis (RLDA) Method Assume η = 0.05, and the covariance matrices were calculated as follows, Σˆi = Σi + ηΣi and the values of Σˆi are as follows:

 Σˆ1 =

Σˆ2 =

Σˆ3 =



0.72 −0.33 0.33 1.67

−0.33 0.72 1.33 −1.33

0.33 1.33 4.72 −0.67

1.67 −1.33 −0.67 , 4.72



0.72 −0.33  −1.33 1.67

−0.33 0.72 0.67 −0.33

−1.33 0.67 2.72 −3.33

1.67 −0.33 −3.33 , 4.72



−0.33 0.72 0.67 0.33

0.72 −0.33 −0.33 0.33

−0.33 0.67 0.72 0.33

−1 Σˆ1 =



Σˆ2

−1

Σˆ3

−1

=



0.33 0.33 0.33 , 0.72

=



17.38  0.96 −2.38 −6.21

0.96 17.85 −4.54 4.07

−2.38 −4.54 1.63 −0.21

−6.21 4.07  −0.21 3.52



17.93  3.00 4.14 −3.20

3.00 6.11 −6.00 −4.88

4.14 −6.01 11.73 6.40

−3.20 −4.88 6.40 5.52



3.88 12.23 −7.77 −3.88

3.88 −7.77 12.23 −3.88

−7.58 −3.88 −3.88 8.53

8.53  3.88 3.88 −7.58







(56)

Alaa Tharwat

December 13, 2017

64 / 69

Singularity Problem

Subspace Method

Subspace method In this method, a non-singular intermediate space is obtained to reduce the dimension of the original data to be equal to the rank of the covariance matrix, hence Σi becomes full-rank (A is a full-rank matrix if all columns and rows of the matrix are independent, i.e. rank (A)= # rows= #cols, thus, Σi can be inverted). In other words, a dimensionality reduction method is used to remove the null-space of the covariance matrices. Principal Component Analysis (PCA) is one of the most common dimensionality reduction methods.

Alaa Tharwat

December 13, 2017

65 / 69

Singularity Problem

Subspace Method

Subspace method In this example, the dimensions of the original data, i.e. the data of each class, were reduced using PCA technique to be equal to the rank of the covariance matrix. The main idea of the PCA technique is to calculate the eigenvalues and eigenvectors of the data matrix and neglect the eigenvectors, which have lower eigenvalues. The eigenvalues (λi ) and eigenvectors (Vi ) of all classes (ωi ) are as follows:

 λ1 =

 λ2 =

Alaa Tharwat

V1 =



7.86 0.81 , 0.00 0.00

 λ3 =



6.22 4.45 , 0.00 0.00

V2 =



1.67 1.00 , 0.00 0.00

V3 =



0.21 −0.32 −0.55 0.74

0.30 0.06 0.79 0.53

0.43 −0.81 0.15 −0.37

−0.82 −0.49 0.22 0.18



0.29 −0.10 −0.57 0.76

−0.15 0.85 0.30 0.40

−0.60 −0.48 0.42 0.48

−0.73 0.18  −0.64 −0.18



0.71 0.00 0.00 0.71

−0.19 0.58 −0.77 0.19

−0.61 −0.51 −0.09 0.60

−0.32  0.63 0.63 0.32



 (57)



December 13, 2017

66 / 69

Singularity Problem

Subspace Method

Subspace method. Table: The feature values, mean, mean-centering data, covariance matrices, and the inverse of the covariance matrices of all classes of our example after projecting it onto the PCA space to reduce the dimension of the original data. Pattern No. 1 2 3 4 5 6 7 8 9

Features x1 x2 1.38 6.18 -1.34 8.08 1.96 9.13 -0.69 3.56 -0.04 4.81 3.02 4.00 4.43 8.49 6.01 9.19 4.43 9.90

Alaa Tharwat

Class ω1 ω1 ω1 ω2 ω2 ω2 ω3 ω3 ω3

Mean x1 x2 0.67

7.80

7.64

4.13

4.95

9.19

D x1 0.71 -2.01 1.30 -1.46 -0.80 2.26 -0.53 1.05 -0.53

x2 -1.61 0.23 1.33 -0.57 0.69 -0.12 -0.71 0 0.71

Covariance Matrix (Σi )   6.22 0.00 Σ1 = 0.00 4.45 

7.86 0.00 0.00 0.81



Σ−1 2 =

  0.13 0.00 0.00 1.24



1.67 0.00 0.00 1.00



Σ−1 3 =

  0.60 0.00 0.00 1.00

Σ2 =

Σ3 =

Inverse Covariance Matrix (Σ−1 i )   0.16 0.00 −1 Σ1 = 0.00 0.23

December 13, 2017

67 / 69

Singularity Problem

Subspace Method

Introduction. Building a classification model. Discriminant functions. Decision boundaries. Normal density and Bay’s Rule. Discriminant Functions for the Normal Density.

Special cases of discriminant analysis classifier. Special Case 1: Equal Variance (Σi = σ 2 I). Special Case 2: Equal Variance (Σi = Σ). Special Case 3: Different Covariance matrices (Σi is arbitrary).

Numerical Examples. Example for Case 1. How to classify an unknown sample or pattern. What is the influence of changing the prior probability? What is the distance between samples from two different classes to the decision boundary that separates these two classes?

Example for Case 2. Example for Case 3.

Singularity problem. Summary. Alaa Tharwat

December 13, 2017

68 / 69

Summary

How to construct linear and quadratic decision boundaries. What is the influence of equal or different covariance matrices. How to classify an unknown sample? What is the infleunce of changing the prior probability? What is the problems of building discriminant analysis classifiers? This classifier is the first step to understand well-known classifiers such as Support Vector Machine (SVM) or Neural Network (NN) classifiers. How the singularity problem solved? for more details ”Tharwat, Alaa. ”Linear vs. quadratic discriminant analysis classifier: a tutorial.” International Journal of Applied Pattern Recognition 3.2 (2016): 145-180”

Alaa Tharwat

December 13, 2017

69 / 69

Suggest Documents