Automatic Model-Order Selection for PCA - Semantic Scholar

25 downloads 0 Views 120KB Size Report
... Springer, 1 edition, March 2002. [10] Eric. W. Weisstein, Correlation Coefficient–Bivariate Normal. Distribution, From MathWorld–A Wolfram Web Resource,.
AUTOMATIC MODEL-ORDER SELECTION FOR PCA Michel Sarkis1 , Zaher Dawy2 , Florian Obermeier1 , and Klaus Diepold1 1

Munich University of Technology, Institute for Data Processing (LDV)

Arcisstr. 21, 80290 Munich, Germany, email:{michel, f.obermeier, kldi}@tum.de 2

American University of Beirut, Department of Electrical and Computer Engineering Beirut 1107 2020, Lebanon, email: [email protected]

ABSTRACT Determining the model-order of a given data set is an important task in signal analysis. Principal Component Analysis (PCA) can be used for this purpose if there is a criterion upon which the correct order can be chosen. In this work, we propose a new and simple technique to determine automatically the rank of a PCA model. Tested with simulated data, the algorithm is able to determine the correct model order efficiently. Applied to video sequences, this method is able to estimate the necessary subspaces that capture the motion and illuminance changes within the different frames. This helps in reducing the storage need/requirements of video sequences and improves the efficiency of context based search and retrieval techniques. Index Terms— Data Compression, Information Retrieval, Image Coding, Video Signal Processing 1. INTRODUCTION Finding the correct model-order is a fundamental step in data analysis. This topic has been explored from the late 50s and the research is still going on [1, 2, 3, 4]. The main objective is to perform dimension reduction by finding the lowest model-order that best describes the data. Dimension reduction is required by many applications in signal processing: it can be applied to estimate the number of signals impinging on an antenna array [5, 6] and to determine the number of signals that have to be estimated by Independent Component Analysis (ICA) [7, 8]. Principal Component Analysis (PCA) has been used to subdivide a signal into subspaces called eigenspaces. The problem, however, resides in the dimensionality of the subspaces that should be chosen in order to best describe the data with a lower dimension. Usually, this is done by trial and error or by employing standard model-order selection techniques like Minimum Description Length (MDL) or Laplace PCA [2, 3]. In this work, we present a new algorithm that computes automatically the number of Singular Values (SVs) to be retained by PCA based on the distribution of residual correlations of the difference between the covariance matrix of the data set and its lower rank approximate; thus, the proposed technique will be denoted as Residual Correlation Technique (RCT). Obtained results show that RCT can determine the correct model-order of different data sets with high probability; nevertheless, the algorithm is relatively sensitive to higher noise levels. In addition, RCT is shown to be able to determine the necessary subspaces to capture the variation among the frames of a real video sequence. Section 2 presents the statistical background required in this work. In Section 3, the proposed residual correlation technique is explained. Section 4 summarizes existing techniques to be compared

1­4244­0481­9/06/$20.00 ©2006 IEEE

933

with the proposed technique. Section 5 presents results obtained on synthesized signals and video sequences. Finally, conclusions are drawn in Section 6. 2. NOTION OF STATISTICS Given two zero-mean random variables X and Y with variances σx2 and σy2 respectively, their bivariate normal distribution is written as: „ « 1 z √ exp − , (1) PXY (x, y) = 2 (1 − ρ2 ) 2πσ xσ y 1 − ρ2 where ρ defines the correlation between the two variables and z is a quadratic term in x, y, and ρ [9, 10]. The true correlation term ρ follows a special distribution which is beyond the scope of this work; however, we are interested in the variance σρ2 of this distribution which is given by: « 2 „ (1 − ρ2 ) 11ρ2 σρ2 = 1+ + ... , (2) N −1 2 (N − 1) where N is the number of samples [9, 10]. When ρ tends to zero, (2) reduces to the simple form σ02 = N 1−1 . This quantity (σ02 ) is called the variance of the distribution with zero correlation and it indicates that the two variables X and Y are statistically uncorrelated. 3. RESIDUAL CORRELATION TECHNIQUE When PCA is applied to a certain data, dimension reduction is performed by taking out the SVs with low significance to the total variance. However, the bottleneck resides on how to define a criterion through which an SV is considered to be significant or not. In this work, we propose a new method for the rank selection for a PCA model using residual statistical fit. The main idea is to obtain a quantity, the residual correlations, that comprises the difference between the given observed data and its approximation after dimension reduction. Statistical fit has been used in other applications which include regression tests and factor analysis [9, 11]. In this paper, it will be employed to estimate the number of the SVs to retain. In what follows, the notation XN ×M will be used to indicate that the dimension of a matrix X is N × M . Given a data matrix XN ×M where N is the number of samples and M is the number of variables, the Singular Value Decomposition (SVD) of its covariance matrix CM ×M , which is at the same time its eigenvalue decomposition, is: C = UΣUT , (3) where UM ×M is the matrix of eigenvectors and ΣM ×M is the diagonal matrix of the eigenvalues. The recovered covariance matrix e M ×M is defined to be the new covariance matrix of the data where C

ICIP 2006

only m < M significant eigenvalues are retained. Hence, (3) can now be rewritten as: e = U eΣ eU eT C “ ”“ ”T eΣ e 1/2 U eΣ e 1/2 U = (4)

new model with lower dimension. The higher the score (probability), the more likely is the model fitting to the data. Each probability is computed by integrating over the parameter values in the new model “ ” Z “ ” as: e = p (X|θ) p θ|X e dθ, p X|X (6)

e M ×m is the low-rank model matrix of eigenvectors, and where U m×m e is that of the eigenvalues. Each non-diagonal term of C repΣ resents the correlation between the variables in X while each one in e corresponds to the recovered correlation of the variables defined C by the product in (4). The goal of this algorithm is to find a recovered covariance matrix such that the chosen variables are statistically uncorrelated. This is usually what is needed in most signal processing applications. Here, it will be achieved by using the variance of the distribution with zero correlation derived in Section 2 as a criterion. This variance is only defined for two variables. Moreover, there is no compact form of the correlation distribution for more than two variables [9]. One way to overcome this problem is by computing the correlations between the available variables two by two and then calculating the overall mean variance. Since σ02 is only dependent on the number of samples, it will be also equal to (N − 1)−1 for any two variables. Hence, the overall mean will also be equal to the same quantity (N − 1)−1 which will be used as the criterion to compute the low-rank model. e Defining now the residual correlation matrix EM ×M = C − C; e the recovered covariance matrix C is considered as a good approx2 (m) of the distribution of the enimation of C if the variance σRCT tries eij of E, where i = j, is less than or equal to that of the distribution with zero correlation:

e N ×m is the low-rank approxiwhere XN ×M is the original data, X mation of X, and θ is the integration variable. Thus, the lower dimension is selected as the one that maximizes (6). Laplace’s method is a technique to approximate the above integral. The main idea is to choose a good parametrization of the integration variable θ [3]. As a summary, it can be said that this technique is a Bayesian based approach.

θ

2 (m)  σ02 = (N − 1)−1 . σRCT

(5)

As a result, the target lower dimension is the smallest m that satisfies (5). In other words, it is the smallest value of m that provides a 2 smaller than or equal to σ02 . variance σRCT The use of the covariance matrix is justified since it represents the measure of the strength of the linear relationships (correlations) among the variables while the correlation coefficient matrix is its normalized version which has the same variance σ02 for no correlation. Consequently the diagonal entries of E were omitted in this computation since they represent the residual variances and not the residual correlations between the variables. RCT, like any other model-order selection technique, has the ability to take the number of available observations N into account along with the structure of the data; therefore, the resulting new subspace depends on these two entities. One advantage is that no prior knowledge is needed in order to process the data. The only required assumption is that the variables are normally distributed which is usually the case. In addition, the noise is assumed to be white. 4. STANDARD TECHNIQUES In this part, two techniques for order selection will be presented: Laplace PCA and MDL. The first technique is chosen since it was implemented specifically for PCA and its performance is superior when compared to many other techniques such as 5-fold cross validation, Bayesian information criterion, and automatic relevance determination [3]. The second technique is chosen since it is wellknown in model-order selection and was not compared to Laplace PCA [2]. These techniques will be compared to RCT in Section 5. 4.1. Laplace PCA Bayesian model selection estimates the dimensionality of a data set by selecting some scores according to an assigned probability to the

934

4.2. Minimum Description Length The MDL principle is an information theoretical approach based on the idea of finding a way of capturing features in a data set in order to obtain a model that allows the shortest description of both the data and the data model simultaneously. In our context, the formulation of the MDL principle is nothing but minimizing the following equation: ´ m+1 N ` 2 MDL (m) = ln σMDL (m) + ln (N ) , (7) 2 2 where m is varied from 1 to M , N is the number of samples, and 2 σMDL (m) is the variance of the error between the original model and the model of dimension m. Hence, the lower rank model is the value of m at which MDL(m) is minimal [2]. 5. EXPERIMENTAL ANALYSIS This section is divided into two parts. The first one shows the results obtained by applying RCT on simulated data sets. The sets are generated from a zero-mean multi-dimensional Gaussian distribution. Unless otherwise specified, each set contains three major signals considered as the true dimension of the data in addition to some noise signals. The variances of the three signals are assumed to be 10, 8, and 5 respectively, while the number and the variance of the noise signals may differ depending on the test case. The results are then compared to the Laplace PCA and MDL techniques. In the tabulated results, σn2 refers to the variance of the noise and ln reflects the number of the noise signals. In addition, the column that indicates the dimension three is highlighted since it indicates the number of times each method was successful. This test-setup is very common to similar problems and was also used in [3]. The second part tests the algorithms on real video sequences. The main issue is to determine the number of eigen-images necessary to capture the temporal variation, i.e. motion and illuminance changes, to reconstruct the video sequence correctly. The data used is the famous Akiyo sequence which consists of 300 frames in CIF resolution [12]. The importance of this part resides in determining the ability of model order selection techniques to minimize the size of video sequences which can be used to reduce storage space and enhance the usage of context based search and retrieval methods. 5.1. Testing with Simulated Data Sequences In the first set, five noise signals where each is of unit variance were added to the three original signals. The number of samples is assumed to be 20. This setup imitates the case when the number of samples is large enough compared to the total dimension of the data. The experiment was repeated ten thousand times to ensure the credibility of the outcomes obtained. In Table 1, each column indicates the number of times a dimension was found by the three techniques while each row reflects the frequency of the outputs of each method.

The sum of the frequencies in each row is equal to ten thousand. As can be noticed, all the methods have given the correct dimension most of the time. However, RCT has the highest probability (0.8581) of choosing the right dimension. Table 1. Results of dimension estimation by RCT, Laplace PCA, and MDL for N = 20, σn2 = 1, and ln = 5. Dimension 1 2 3 4 5 6 7 Laplace PCA 208 1994 7544 219 24 8 3 MDL 286 1658 7428 468 86 45 29 RCT 13 1282 8581 124 0 0 0 In the second set, the number of noise signals is increased to 20 while keeping the other parameters intact. This is done to examine the behavior of a technique when the number of samples is not large enough compared to the data dimension. Table 2 reflects that RCT has a probability of 0.6819 of getting the right dimension while Laplace PCA has a probability of 0.4843 and MDL of 0.17. Such a low probability is expected since the number of samples is relatively low with respect to the total dimension. Another conclusion is that both RCT and Laplace PCA tend to under-estimate the dimension to 2 with a probability of 0.3079 and 0.4581 respectively, while MDL over-estimates 37.75% of the times the dimension to 18 and under-estimates it 30.79% of the times to 2. Hence, MDL is the most sensitive to the number of samples. Table 2. Results of dimension estimation by and MDL for N = 20, σn2 = 1, and ln = 20. Dimension 1 2 3 4 Laplace PCA 543 4581 4843 33 MDL 1439 3079 1700 7 RCT 27 3094 6819 60

RCT, Laplace PCA, 5 0 0 0

6 18 0 0 0 3775 0 0

In the third set, the behavior of the algorithms with a low Signal to Noise Ratio (SNR) is examined. This setup can be usually done by either lowering the signal power or increasing that of the noise. Here, it was done by increasing the noise variance to 1.5. The results in Table 3 show that the performance of Laplace PCA and MDL decreases a lot compared to their behavior in Table 1. The probabilities of choosing the dimension three of these two techniques decreased from 0.7544 and 0.7428 to 0.3869 and 0.3826, respectively. The probability of RCT has also decreased; however, RCT determined the correct result 59.7% of the time which is 21% higher than Laplace PCA which makes it stronger against noise variation. Table 3. Results of dimension estimation by RCT, Laplace PCA, and MDL for N = 20, σn2 = 1.5, and ln = 5. Dimension 1 2 3 4 5 6 7 Laplace PCA 1927 4031 3869 141 21 7 4 MDL 2377 3422 3826 270 65 21 19 RCT 2 215 5970 3413 398 2 0 The previous results show that the proposed algorithm works well in normal cases, i.e. the noise level is low compared to the signals. However, when the noise level tends to get higher, the performance of RCT starts to degrade. An example of such setups are the tests used in the comparisons in [3]. In the first data set, five signals of variances 10, 8, 6, 4, and 2 respectively and five noise signals where each is of unit variance are assumed. RCT tended to underestimate the true dimension of 5 to 4 most of the time. This is due to the fact that RCT was not able to distinguish the signal of variance 2 from the noise of unit variance. Another example is the third test case where the noise dimension was extremely high (95 noise signals). RCT underestimated most of the time the dimension to 1 or 2 instead of 5.

935

These disadvantages are due to the fact that the methodology RCT follows is simple (correlations) and does not require advanced estimation techniques that increase the robustness against noise like Laplace PCA or MDL. Thus, RCT works pretty well for low noise levels but its performance deteriorates when these levels increase tremendously. This disadvantage, however, is not crucial when dealing with real video sequences as shown in the next section. 5.2. Application to Real Video Sequences Modeling video sequences tries to obtain a new representation of the data that undergo temporal variation. This allows to capture the variations within and to obtain a minimal description of the data by finding the minimal number of eigen-images needed to construct the complete video sequence. In addition, it helps enhancing the speed of computer vision applications such as content based search and retrieval, and object recognition [13]. In what follows, temporal variation will first be analyzed in intensity domain where the input data represent the intensity values of the pixels. Then, it will be conducted in motion compensated intensity domain where the input data are the motion compensated pixel values [14]. The motion vectors used for this compensation were obtained using the Horn and Schunck algorithm [15]. To judge the obtained results, weighted SNR (WSNR) will be used. WSNR is an image quality assessment measure which reflects the quality of a reconstructed image with respect to the human visual system [16, 17]. It is obtained by constructing a weighting function in the frequency domain and then multiplying it by the difference of the Fourier transforms between the original and the reconstructed image. The result, the weighted power, is then transformed to the spacial domain. Finally, WSNR is computed by taking the logarithm of the ratio of the weighted power to that of the original image. In order to apply the model-order selection techniques, each frame will be vectorized into a column-vector and the set of all vectors will be cascaded to form the input data matrix. Consequently, the number of samples N will be the total number of frames in the video sequence and the number of variables M will be equal to the number of pixels in a frame. Figure 1 presents the logarithm of the magnitude of the SVs of the first 10 frames of Akiyo in intensity and motion compensated intensity domains. RCT found that the number of SVs and hence the eigen-images to be retained in both domains to be 3 which is the same result obtained by trial and error in [14]. These images represent the static background of the sequence, the motion of the eyes, and the motion of the lips respectively. Laplace PCA resulted in 4 images in the intensity domain and 3 in the motion compensated one while MDL resulted in 9 and 8, respectively. Figure 2 shows the plot of the mean WSNR (mWSNR) in dB for the first 10 frames, where each value is computed by taking the corresponding number of eigen-images. Looking at this figure, it can be deduced that the performance of RCT and Laplace PCA are much better than MDL in such an analysis since the latter resulted in three times the images to just represent the first ten frames of the sequence. However, in the intensity domain, Laplace PCA resulted in an additional dimension which brings 1 dB gain to the mWSNR. This gain is low taking into account that an additional frame, i.e. 33% more information, has to be kept. This fact can be more visualized in Figure 3 when the mWSNR for the 300 frames is considered. Note that the high increase in mWSNR at the beginning of this plot is due to the fact that more eigen-images are needed to represent the whole video sequence. As can be seen in the figure, RCT resulted in 124 eigen-images while Laplace PCA resulted in 150 for only 0.5 dB gain this time.

Log Magnitude of the Singular Value

This means that 26 more frames (21%) have to be kept to obtain a gain of 0.5 dB. Such a gain is negligible and will not affect much the quality of the reconstructed images. This fact can be also seen in Figure 2, since Laplace PCA has determined the right dimension only after motion compensation. Consequently, it can be deduced that RCT is the most sensitive to temporal variations which makes it more capable of estimating the number of required eigen-images. As a direct application, the obtained eigen-images can be used to perform efficient content based search and retrieval. This is the case since only the obtained eigen-images need to be searched due to the fact they contain most of the information in the video sequence under study. For example, taking the first 10 frames of the Akiyo sequence and using RCT, only 3 eigen-images need to be searched instead of 10 which is 70% less. 6

10

Intensity Domain Motion Compensated Intensity Domain

5

10

4

10

3

10

2

10

1

2

3

4

5

6

7

8

9

10

Singular Value Number Fig. 1. Singular values of the first ten frames of Akiyo. 60

Intensity Domain Motion Compensated Intensity Domain

mWSNR (dB)

55 50 RCT 45

MDL

40 Laplace PCA

35 30 25 20 1

2

3

4

5

6

7

8

9

10

Number of Eigen Images Fig. 2. mWSNR of the reconstructed first ten frames of Akiyo. 60

mWSNR (dB)

55 50

RCT (124)

45

MDL (298) Laplace PCA (150)

40 35 30 25 20 0

50

100

150

200

250

300

Number of Eigen Images Fig. 3. mWSNR of the reconstructed 300 frames of Akiyo for the intensity domain.

936

6. CONCLUSION A new method that computes automatically the dimension for PCA is presented. The essence of the algorithm is to benefit from the residual correlations of the difference between the covariance matrix of the data and its lower dimension approximates. This will lead to the criterion needed to obtain the rank of a PCA model. RCT takes into account two important factors: the number of samples available and the structure of the data represented by the correlations, which makes it a simple technique to be implemented. 7. REFERENCES [1] H. F. Kaiser, “The varimax criterion for analytic rotation in factor analysis,” Psychometrika, vol. 23, pp. 187–200, 1958. [2] J. Rissanen, “Modeling by shortest data description,” Automatica, pp. 461–464, 1978. [3] T. P. Minka, “Automatic choice of dimensionality for PCA,” Neural Info. Proc. Systems (NIPS), vol. 13, pp. 598–604, 2000. [4] P. Stoica and Y. Selen, “A review on information criterion rules,” IEEE Sig. Proc. Mag., pp. 36–47, July 2004. [5] M. Wax and T. Kailath, “Detection of signals by information theoretic criteria,” IEEE Trans. Ac. Sp. Sig. Proc. (ASSP), vol. 33, no. 2, pp. 387–392, 1985. [6] K. M. Wong nad Q. T. Zhang, J. P. Reilly, and P. C. Yip, “On information theoretic criteria for the determination of the number of signals in high resolution array processing,” IEEE Trans. Ac. Sp. Sig. Proc. (ASSP), vol. 38, no. 11, pp. 1959–1971, 1990. [7] A. Hyvaerinen, J. Karhunen, and E. Oja, Independent component analysis, Wiley, New York, USA, 2001. [8] Z. Dawy, M. Sarkis, J. Hagenauer, and J. C. Mueller, “A novel gene mapping algorithm based on independent component analysis,” in ICASSP, March 2005. [9] C. Rose and M. D. Smith, Mathematical Statistics with MATHEMATICA, Springer, 1 edition, March 2002. [10] Eric. W. Weisstein, Correlation Coefficient–Bivariate Normal Distribution, From MathWorld–A Wolfram Web Resource, http://mathworld.wolfram.com/CorrelationCoefficientBivariateNormalDistribution.html. [11] A. M. C. Machado, J. C. Gee, and M. F. M. Campos, “Visual data mining for modeling prior distributions in morphometry,” IEEE Sig. Proc. Mag., pp. 20–27, May 2004. [12] T. Alpert et al., “Subjective evaluation of MPEG-4 video codec proposals: methodological approach and test procedures,” Sig. Proc. Image Comm., vol. 19, no. 4, pp. 305–325, May 1997. [13] L. Guan, S. Y. Kung, and J. Larsen, Multimedia Image and Video Processing, CRC press LLC, September 2001. [14] K. Diepold, F. Obermeier, and T. Zeitler, “Multi-domain subspace representation of image sequences,” in Workshop on Image Analysis for Multimedia Interactive Services, April 2005. [15] B. K. P. Horn and B. G. Schunk, “Determining optical flow,” Artificial Intelligence, vol. 2, pp. 185–203, 1981. [16] T. Mitsa and K. Varkur, “Evaluation of contrast sensitivity functions for the formulation of quality measures incorporated in halftoning algorithms,” in ICASSP, April 1993. [17] V. Monga, N. D. Venkata, H. Rehman, and B. L. Evans, “Halftoning Toolbox for MATLAB,” July 2005, available: http://www.ece.utexas.edu/˜bevans/projects/halftoning/toolbox/.

Suggest Documents