On Designing Invertible Pseudo Covariance Matrix for ...

3 downloads 0 Views 273KB Size Report
Jul 9, 2016 - Rashid Mahmood1, Khalid Mahmood Aamir1, Marija Milojević Jevrić2*,. Stojan Radenović3,4 and Tehseen Zia5. 1University of Sargodha ...
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/305270059

On Designing Invertible Pseudo Covariance Matrix for Undersampled Cases in Classification Article · January 2016 DOI: 10.9734/BJMCS/2016/27435

CITATION

READS

1

40

5 authors, including: Stojan Radenović

Tehseen Zia

University of Belgrade

University of Sargodha

436 PUBLICATIONS 3,368 CITATIONS

19 PUBLICATIONS 114 CITATIONS

SEE PROFILE

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A sharp generalization on cone b-metric space over Banach algebra View project

Fixed Point Theory View project

All content following this page was uploaded by Stojan Radenović on 16 July 2016. The user has requested enhancement of the downloaded file.

British Journal of Mathematics & Computer Science 17(5): 1-9, 2016, Article no.BJMCS.27435 ISSN: 2231-0851

SCIENCEDOMAIN international www.sciencedomain.org

On Designing Invertible Pseudo Covariance Matrix for Undersampled Cases in Classification Rashid Mahmood1, Khalid Mahmood Aamir1, Marija Milojević Jevrić2*, Stojan Radenović3,4 and Tehseen Zia5 1

University of Sargodha, Sargodha, 40100, Pakistan. Mathematical Institute SANU, Kneza Mihaila 36, 11000 Belgrade, Serbia. 3 Faculty of Mechanical Engineering, University of Belgrade, Kraljice Marije 16, 11120 Belgrade, Serbia. 4 Department of Mathematics, University of Novi Pazar, Novi Pazar, Serbia. 5 COMSATS Institute of Information Technology, Islamabad, 44000, Pakistan. 2

Authors’ contributions This work was carried out in collaboration between all authors. Author SR designed the study, wrote the introduction and related work and supervised the work. Authors RM and KMA carried out all methodologies work and performed the thresholds settings. Author MMJ managed the analyses of the study and obtained the results. Author SR wrote the first draft of the manuscript. Author TZ managed the literature searches and edited the manuscript. All authors read and approved the final manuscript. Article Information DOI: 10.9734/BJMCS/2016/27435 Editor(s): (1) Morteza Seddighin, Indiana University East Richmond, USA. Reviewers: (1) R. Praveen Sam, Jawaharlal Nehru Technological University, Anantapur, India. (2) Ahmed Nasser Zaky Sayed, Cairo University (ISSR ), Cairo, Egypt. Complete Peer review History: http://sciencedomain.org/review-history/15335

Original Research Article

Received: 1st June 2016 Accepted: 2nd July 2016 Published: 9th July 2016

_______________________________________________________________________________

Abstract In linear discriminant analysis, determinant and inverse of the covariance matrix are required to be computed. If number of features is greater than the number of available examples, covariance matrix is no longer invertible. A common approach is to reduce dimensionality due to which some features of interest may be lost. When we are not interested in dimensionality reduction, one approach to solve such problems is to take pseudoinverse of covariance matrix which is not always possible. We propose, in such cases, to project covariance matrix onto a highly correlated space to compute pseudoinverse of the matrix. Proposed solution has been tested for classification of microarray gene expression data of colon’s tumor.

_____________________________________

*Corresponding author: E-mail: [email protected];

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

Keywords: Covariance matrix; dimensionality reduction; pseudoinverse.

1 Introduction

Let  be data points or collection of observations such that  ∈ ℝ for 1 ≤  ≤ and  ∈ ℕ, where ℝ is a set of real numbers and number of data points be . Therefore, ℝ× is our sample space. In classification, a pattern is a pair of variables  ,   where  is the concept behind the observation. The task is to discriminate examples from different classes with respect to  . Examples from the same class should have similar  while examples from different classes have different  . The minimum-error-rate classification can be achieved by use of following discriminant functions:   =  |  + ln 

If | ~  , ∑ , we have 1   = −  − 2



$

∑ %&  −

1





' 1 ln 2( − ln|∑ | + ln  2 2

2

The covariance matrix ∑ is always symmetric and positive semi-definite. In the following, we discuss the general case when ∑ is arbitrary. In a general multivariate normal case, covariance matrices are different for each category. The only term that  can be dropped from Eq. 2 is the ln 2π term, and the resulting discriminant functions are inherently ) quadratic.

Where

and

  = $ +  + , $  + -

1 + = − ∑ %& 2 , = ∑ %&  - = −

1 2



$

∑ %&





3

1 ln |∑ | + ln  2

This shows that computing inverse of covariance matrix (i.e., ∑%& ) is essential in discriminant analysis. In this situation, two types of problems arise when relationship between and ' is considered. •



Case 1  ≥ ': When number of available samples (n-samples) is equal to or greater than the number of dimensions ('-dimensions), the computation of ∑%& is trivial. Case 2  < ': In most of practical problems (e.g., classification of cardic signal or detection of tumor from microarray gene expression data), ' is large such that d ≫ n. Such problems are known as undersampled problems ([1]).

There are three common approaches to solve these problems when is smaller than '. •

• •

Dimensionality reduction is applied such that '3 < , where '3 is number of dimensions in reduced space. However, it is possible that some useful features may lose. Eigen values are translated such as none of eigen values is zero. This reshapes the system matrix. Computing pseudoinverse of ∑ which is not always possible.

In this paper, we propose a solution to problem when Σ5 does not exist and dimensionality reduction and reshaping the system matrix are to be avoided. The related work is presented in Section 2. The proposed

2

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

methodology is described in Section 3. Results and discussion are presented in Section 4. Finally, the paper is concluded is Section 5.

2 Related Work There are many algorithms developed so for to tackle undersampled problems ([2-8]). Most of the methods are proposed in the context of scatter matrix in classical linear discriminant analysis (LDA). This is because to perform discriminant analysis, the scatter matrix must be non-singular. Depending on nature of preprocessing of scatter matrix, the proposed methods can broadly be categorized in two categories: dimensionality reduction (DR) and matrix transformation (MT). The DR techniques are applied to reduce the dimensionality such that the features vectors have dimensions smaller than the number of samples. For instance in ([8-10]), principle component analysis (PCA) is used and orthogonalization is applied in ([1112]) for reducing the dimensionality. Afterwards, LDA is applied for classification. DR is useful when we have to classify over the dominant features. However, in cases like classification of ventricle diseases using electrocardiogram waveforms or classification of cancerous genes using microarray data, classification may not be carried out using dominant features. Therefore, DR may not be helpful. In this situation, MT techniques can be used to make the scatter matrix invertible. The most notable examples of these techniques include regularized LDA (RLDA) ([13-14]), penalized LDA (PLDA) ([6]), pseudo-inverse LDA ([5]) and shrinkage LDA (SLDA) ([15]). In RLDA, all the eigen values of the scatter matrix are given a scalar, constant and equal shift. This replaces zero eigen values with some positive real value. This shift in eigen values makes the scatter matrix invertible. This regularization minimizes within-class variance at the cost of bias. Therefore, we have to play with parameters(s) for optimal tradeoff between variance and bias. In penalized LDA, all values of within class scatter matrix are given a positive scalar shift, i.e., we use ∑ + 6 where ∑ is a within class scatter matrix and Ω the shift matrix. If 6 = 8 for some constant μ and I as identity matrix, penalized LDA converges to the regularized LDA. Therefore, penalized LDA is more general as compared to regularized LDA. A problem with this technique is that there is no single algorithm 6 for any set of applications ([6]). When the scatter matrix is singular, sometimes its pseudo-inverse is possible. Therefore, we consider pseudo-inverse of the scatter matrix in place of its inverse. The technique is referred as pseudo-inverse LDA. However, pseudo-inverse is not possible when numerical values are very small during computations. Pseudo-inverse is an estimate of the inverse of the matrix using least square estimation ([16], [5]). Effect of the sample size and dimension on classification error was studied in ([1]). Performance of this technique is comparable to PCA + LDA and RLDA ([17]).

3 Methodology Let Σ; ∈ ℝ= = ==?

4

Matrix Σ5 represents a situation (space) between ideal space and real space and possesses all the properties of Σ, like symmetry, Hermitian etc. The matrix Σ; is defined such that elements along the main diagonal are α and all other elements are β; and α and β are real numbers. Some properties of =? are given below.

3.1 Inverse Let Σ; be a matrix of size × , as it is a symmetric Hermitian matrix, its inverse is computed as follows: Consider that  , A , ' ∈ B , where ' is the dimensionality and ' ∈ C . Usually, matrix inversion is a computationally heavy task; we have derived simple expressions for its computation. 

= −DE − D%)

5

3

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

A = E − D%& − ' − 1 H =

+ D ' − 1





This H is the determinant of =? . 1 [ H

=?%& =

 K

+ A −



 8 ]

6

7

8

where K is defined as a matrix of size ' × ' with all elements as 1, and 8 is an identity matrix of the same size.

3.2 Eigenvalues of =?

We have found that the largest eigenvalue of =? is: NOPQ = E + ' − 1D

Other eigenvalues are: N = E − D

9 10

for  = 2,3, … , C.

3.3 Eigenvalues of =?%&

As eigenvalues of =?%& are inverse of eigenvalues of =? . Therefore, minimum eigenvalue is: 1 E + ' − 1D

N3O =

and other ' − 1 eigenvalues are: N3 =

1 E−D

For  = 2,3, … , C.

11 12

3.4 Eigen Value Decomposition of =? Consider that:

=? = U? V? W?X

V? is given by:

V? = Y

E + ' − 1D ⋮ 0

We have found U? as:

13 ⋯ 0 ⋱ ⋮ ] ⋯ E−D

14

4

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

1 1 − ∆ ∆ ) ` & 1 _− 1 − _ ∆& ' − 1∆) U? = _ ⋮ _ 1 1 _ − ∆ ' − 1∆) & _ 1 1 − ^ ∆& ' − 1∆) −

∆&e √' and

∆ge h1 +

for i = 2,3, … , C.

0 0

⋯ ⋱





1



0 0

∆%& 1 − 2∆%&

d c c c c 0c c 1 ∆ b

1 '−i+1

15

16

First column of U? is an eigenvector corresponding to the largest eigenvalue of =? . Matrix W? is computed as: W? = U?

17

Σ;%& = U? Λ? W?X %&

18

3.5 Eigen value decomposition of =?%& = U? Λ? %& W?X

The matrix V? %& is a diagonal matrix with eigenvalues of Σ;%& on its diagonal.

19

3.6 Pseudo inverse of =5 Let us consider that

and

Σ = UΛW X

20

Σ5 = Σ Σ? = UΛW X U? Λ? W?X = UV3W?X

22 23

Σ? = U? Λ? W?X

21

V3 = ΛW X U? Λ?

24

W X U? = 8

25

We have found that under the assumption of U = U? and W = W? :

5

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

V3 = VV?

26

V3%& = V? %& V%& Therefore

27

>%& W?X Σ5 k = U? Λ

28

Where Σ5 k represents the pseudoinverse of Σ5.

3.7 Density function in the projected space The general expression for multivariate normal density in d-dimensions, in the projected space is as follows:  =

1

2( |Σ5| l m

n m

o [%m%p n

qr > s %p]

Or simply ~ , =5 . The discriminant function corresponding to Eq. 29 is: 1   = −  − 2



X 5 k  Σ







' 1 ln2( − lntΣ5 t + ln u  2 2

where = is arbitrary. We assume that =5 is also arbitrary. Therefore,   = $ +  + $  + u-

Where + = −

 = Σ5k And

1 k Σ5 2  

u- = −

1 2

29

30 31

32

$ 5k  Σ 

33

1 − lntΣ5 t + ln u  2

34

3.8 Classification Let & , ) , …, v ~  , = be data points from class . Let ' be the length of w , for  ≤ x ≤ y. v

1 z w y

We compute the mean of the class. 

=

35

we&

Transforming   , =5  to 0, =5 : w ← w −  ∑ge& w [i] &

∀ x = 1, 2, … , y

6

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

Where w [i] means i $} value of x$} sample in  $} class. Let

~ = [& ) . . . v ]

The scatter matrix is: = =

1 ~~ X y

36

V3 = V V?

37

Pseudo inverse is computed as: %& Σ5k = U? V3 U?X

The classification algorithm produces



and =5k .

38

3.9 Threshold setting Eq. 31 is quadratic with ' variables where ' is large. Therefore, it is very hard to compute the closed form solution or any general expression for the threshold. We suggest that training samples, i.e., & , ) , … for  = 1, 2, … , € are used in   given by Eq. 31 and human or machine intelligence is used for deciding threshold values, as optimal as possible.

3.10 Testing Let  be a vector from unknown class. We transform distribution from y from  , = to 0, = using:

1  ←  − z [x] ' we&

where [x] is the x$} value of the vector y. Compute   for  = 1, 2, … , € where € is the number of classes. Considering the threshold and  , we decide the class to which  probably belongs.

4 Results and Discussion To analyze the effectiveness of proposed approach, we have taken a cancer gene expression dataset, namely “Colon Tumor”, which is collected from Kent Ridge Bio-medical Dataset Repository ([18]). The dataset contains 62 instances collected from colon-cancer patients. There are 40 tumor biopsies from tumors. Other instances are normal biopsies taken from healthy parts of the same patients. Every sample has expressions from large number of genes and a class label (i.e. normal or cancer). The data has higher number of dimensions as compared to the number of samples, thus making the problem as low sample size problem, i.e., d ≫ n, where ' is dimension and is the number of samples in the set. In this case, ' = 256 and = 40 and = 22 for biopsies from sick and normal cells, respectively. As microarray experiments are very expensive, it seems impossible to have ' ≤ . This work does not include preprocessing of the raw dataset. Every sample is in the vector form. Such a form of data is the input to the method.

7

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

In order to test the effectiveness of our classifier, standard performance metrics are used in this research. Given a test set with N samples, let NP and NN be the number of positive samples ("normal") and the number of negative samples ("diseased") within the dataset (N = NP + NN) respectively. After classification, let TP and FP be the number of positives detected as being positive and the number of positives detected/classified as being negatives (NP = TP + FP ) respectively. Similarly, let TN and FN be negatives detected as being negative and the number of negatives detected/classified as being positive (NN = TN + FN). For this paper, we have considered the following metrics to analyze the performance.

Sensitivity: Sn = TP/(TP+FN) Specificity: Sp = TN/(TN+FP) Positive predicted value: PPV = TP/(TP+FP) Negative predicted value: NPV = TN/(TN+FN) Precision: P = (TP+TN)/N

1. 2. 3. 4. 5.

The computed results are shown in the Table 1. Table 1. Performance analysis of the classifier Metric ‚ ‚ƒ W yW 

Measure (%) 66.67% 83.33% 75.00% 76.92% 76.19%

5 Conclusion We have studied a case in classification when dimensionality is larger than the number of samples. This paper discussed when neither the number of dimensions is reduced nor pseudoinverse is possible. We have proposed that in such cases projection of the scatter matrix onto an ideal scatter matrix makes computation of pseudoinverse possible. However, due to the computation of large power (Equations 5 to 8), the accuracy of the method can be affected. Moreover, the computational complexity and space complexity of the proposed technique is K' „  where ' the dimension is and ' is very large. Therefore, it is a computationally heavy technique and requires large space, but the same is true for the existing techniques. The proposed classifier is also noise sensitive like existing techniques. However, by gaining a larger dataset and preprocessing, the results may further be improved.

Acknowledgement The authors are very grateful to the referees for carefully reading the paper and for their comments and suggestions which have improved the paper. The work of the author MMJ was supported by the Serbian Ministry of Education and Science (project III44006).

Competing Interests Authors have declared that no competing interests exist.

References [1]

Krzanowski WJ, et al. Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data. Appl. Stat-J. Roy St. C. 1995;44:101–115.

8

Mahmood et al.; BJMCS, 17(5): 1-9, 2016; Article no.BJMCS.27435

[2]

Fatma AA, Rhoadesb BE. Characterizations of HJ Matrices. Filomat. 2016;30(3): 675-679.

[3]

Yu H, Yang, J. A direct LDA algorithm for high-dimensional data with application to face recognition. Pattern Recogn. 2001;34(10):2067-2070.

[4]

Yang J, Yang J. Why can LDA be performed in PCA transformed space. Pattern Recogn. 2003;36(2):563-566.

[5]

Raudys S, Duin RP. Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recogn. Lett. 1998;19(5):385-392.

[6]

Hastie T, et al. Penalized discriminant analysis. Ann. Stat. 1995;73-102.

[7]

Howland P, Park H. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 2004;26(8):995-1006.

[8]

Belhumeur PN, Hespanha JP, Kriegman DJ. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 1997;19(7):711-720.

[9]

Swets DL, Weng JJ. Using discriminant eigen features for image Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 1996;8:831-836.

[10]

Zhao W, Chellappa R, Phillips PJ. Subspace linear discriminant analysis for face recognition. Computer Vision Laboratory, Center for Automation Research, University of Maryland; 1999.

[11]

Zheng W, Zhao L, Zou C. An efficient algorithm to solve the small sample size problem for LDA. Pattern Recogn. 2004;37(5):1077-1079.

[12]

Zheng W, Zou C, Zhao L. Real-time face recognition using gram-schmidt orthogonalization for LDA. In Proceedings of the 17th IEEE International Conference on Pattern Recognition. 2004;2:403406.

[13]

Sullivan FO. A statistical perspective on ill-posed inverse problems. Stat. Sci. 1986;1(4):502-518.

[14]

Titterington DM. Common structure of smoothing techniques in statistics. Int. Stat. Rev. 1981;141170.

[15]

Cornfield J. Discriminant functions. Review of the International Statistical Institute. 1967;35:142-153.

[16]

Fukunaga K. Introduction to statistical pattern classification. Academic Press; 1990.

[17]

Skurichina M, Duin RP. Stabilizing classifiers for very small sample sizes. In proceedings of the 13th IEEE International Conference on in Pattern Recognition. 1996;2:891-896.

[18]

Ridge K. Kent ridge bio-medical dataset repository. Kent Ridge Bio-medical Dataset Repository; 2015. _______________________________________________________________________________________

© 2016 Mahmood et al.; This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Peer-review history: The peer review history for this paper can be accessed here (Please copy paste the total link in your browser address bar) http://sciencedomain.org/review-history/15335

9

View publication stats

Suggest Documents