Sparse Representation Based Discriminative Canonical Correlation ...

1 downloads 0 Views 327KB Size Report
Abstract—Canonical correlation analysis (CCA) has been widely used in pattern recognition and machine learning. However, both CCA and its extensions ...
2012 11th International Conference on Machine Learning and Applications

Sparse Representation based Discriminative Canonical Correlation Analysis for Face Recognition Naiyang Guan, Xiang Zhang, Zhigang Luo, Long Lan School of Computer Science National University of Defense Technology Changsha, China [email protected],zhangxiang [email protected],[email protected]

of the dictionary for each sample and expect the sample to be represented by using as least number of items in a certain dictionary as possible. According to [8], the sparse reconstructive weights have naturally discriminant power in absence of label information. Therefore, it outperforms other unsupervised methods like principal component analysis (PCA, [9]) and neighborhood preserving embedding (NPE, [10]) in face recognition. In practice, there are different modalities in a dataset. For example, an image can be described by pixels, histogram, gradient-based features such as SIFT. However, traditional dimension reduction methods such as FLDA and MFA do not deal with multi-modal data. To this end, Hotelling [11] proposed canonical correlation analysis (CCA) to process two modalities data. CCA inspects the linear relationship between two random variable sets by maximizing their correlation. To deepen the understanding of CCA, Bach et al. [12] proposed a probabilistic CCA which uses graphic models to find the latent variables hidden between different modalities. Although CCA is close to PCA, ICA [13] and FLDA, it is quite different from them because CCA considers multiple modalities data. Therefore, CCA successfully improves the consequent classification performance and thus it has been widely used to many fields, such as facial expression recognition [14], image processing [15], image retrieval [16], image texture analysis [17], and context-based text mining [18]. Recently, there appear many CCA extensions. To overcome the over-fitting problem, Sun et al. proposed sparse CCA for multi-label classification [19] and other tasks [20], [21]. In the aim of fitting CCA to nonlinear distributed data, W.W. Hsieh proposed a neural network based CCA [22] to deal with the nonlinear correlation, but it suffers from heavy computation cost. To overcome such deficiency, Banch et al. [23], [24] proposed a kernel CCA (KCCA) to map samples to a high-dimensional space followed by performing CCA on these linearly separable samples in the high dimensional space. By using label information and kernel trick, Matthew et al. [25] proposed semi-supervised Laplacian regularized KCCA for brain graph analysis. Since it is difficult to choose a suitable kernel measurement, KCCA is not easy

Abstract—Canonical correlation analysis (CCA) has been widely used in pattern recognition and machine learning. However, both CCA and its extensions sometimes cannot give satisfactory results. In this paper, we propose a new CCA-type method termed sparse representation based discriminative CCA (SPDCCA) by incorporating sparse representation and discriminative information simultaneously into traditional CCA. In particular, SPDCCA not only preserves the sparse reconstruction relationship within data based on sparse representation, but also preserves the maximum-margin based discriminative information, and thus it further enhances the classification performance. Experimental results on Yale, Extended Yale B, and ORL datasets show that SPDCCA outperforms both CCA and its extensions including KCCA, LPCCA and LDCCA in face recognition. Keywords-discriminative dimension reduction; canonical correlation analysis; sparse representation;

I. I NTRODUCTION Discriminative dimension reduction method learns a lowdimensional subspace for subsequent processing, which has the strong power to separate different classes because it preserves the discriminative information. Due to its effectiveness, discriminative dimension reduction method has been widely used in various fields such as bioinformatics [1], image retrieval [2], data visualization [3] and signal processing [4]. Fisher’s linear discriminant analysis (FLDA) [5] is usually considered as the first discriminative dimension reduction method. It learns a discriminative subspace by simultaneously maximizing the between-class scatter and minimizing the within-class scatter. FLDA is very effective and useful in pattern recognition and machine learning. However, the underlying Gaussian assumption is often violated and thus results in poor performance especially when the data is insufficient. To overcome such deficiency, Yan et al. [6] proposed marginal Fisher analysis (MFA) that maximizes the margins between classes. Guan et al. [7] further introduced MFA in non-negative matrix factorization(NMF). In contrast to these supervised methods, some unsupervised discriminative dimension reduction methods, e.g., sparse preserving projections (SPP [8]) have been proposed recently. They reconstruct the sparse coefficients for linear combination 978-0-7695-4913-2/12 $26.00 © 2012 IEEE DOI 10.1109/ICMLA.2012.18

51

where λ is the Eigen value, and can be solved by using the general Eigen-value decomposition. To avoid the singularity problem,Cxx and Cyy can be substituted by Cxx +γIx and Cyy +γIy , respectively, where γ is a regularization parameter between 0 and 1, and Ix and Iy are identity matrices. This process intrinsically follows the regularization theory, i.e.,     0 Cxy ωx 0 x [ ωωxy ] = λ Cxx +γI (3) 0 Cyy +γIy [ ωy ] . Cyx 0

to use in practice. To overcome such deficiency of KCCA, Sun et al. [26] proposed locality preserving CCA method (LPCCA) which incorporates local geometrical structure to fit the nonlinear distributed data. Peng et al. [27] proposed local discrimination CCA (LDCCA) to further incorporate the discriminative information. As a discriminative multimodalities dimension reduction method, LDCCA suffers from such problems as can not sufficiently utilize local geometric structure. In this paper, we propose a novel discriminative multimodalities dimension reduction method termed sparse representation based discriminative canonical correlation analysis (SPDCCA) to overcome the aforementioned problems. SPDCCA not only maintains discriminative information but also preserves the sparse reconstructive relationship of face images for face recognition. Benefit from the discriminative information, SPDCCA separates samples in the reduced dimensional space by maximizing intra-class correlations and minimizing inter-class correlations between two modalities. Moreover, SPDCCA considers natural discriminative information by using the sparse representation of samples and thus further improve the classification performance. Experimental results on popular face image datasets including Yale, Extended Yale B and ORL show that SPDCCA outperforms CCA, KCCA, LPCCA and LDCCA. The remainder of this paper is organized as follows: Section 2 outlines CCA and sparse representation and presents SPDCCA. Section 3 describes the experimental results and Section 4 concludes this paper.

In some cases, the Tikhonov regularization term can be replaced by manifold regularization [28]. B. KCCA Kernel CCA (KCCA, [14]) maps samples into a high dimensional space and discovers the nonlinear correlation hidden in the original samples. It is done by using the kernel trick which constructs a kernel matrix instead of performing an explicit mapping. RBF [29] is one of the well-known kernel matrices. In this paper, we will use RBF kernel following.   n Given a pair-wise dataset xi , yi ) i=1 ∈ Rp × Rq . Assuming the nonlinear mapping φ : x → φ(x) and ϕ : y → ϕ(y), which map the original samples into another feature space. The kernel function can be expressed as the inner product of mapped samples, i.e., (Kx )ij = Kx (xi , xj ) = φ(xi )T φ(xj )and (Ky )ij = Ky (yi , yj ) = ϕ(yi )T ϕ(yj ), where Kx (·, ·) and Ky (·, ·) are the kernel functions which do not need the explicit nonlinear mapping, i.e. φ and ϕ. The solution vectors of KCCA can be expressed by the n n linear combination of {φ (xi )}i=1 and {ϕ (yi )}i=1 , i.e. wφ = φ(X)α and wϕ = ϕ(Y )β. Thus, the objective function of KCCA becomes:

II. R ELATED W ORK O N CCA This section introduces CCA and its closely related extensions including KCCA, LPCCA and LDCCA. A. CCA CCA aims at finding a pair-wise linear mapping for two sets of random variables by maximizing their correlations, namely, canonical correlation two datasets.  nbetween  features ∈ Rp × Rq, whereGiven a pair-wise dataset xi , yi ) i=1  n in n is the number   n of samples and the datasets X = xi i=1 and Y = yi i=1 have been centralized for convenience, n  i.e., n1 xi = 0, the objective function of CCA is described

(ωφ ,ωϕ )=arg max

ωφ ,ωϕ

α,β

TC ωx xy ωy TC ωx xx ωx



TC ωy yy ωy

.



αT Kx Kx α



.

(4)

β T Ky Ky β

where (Kx ) = φ(X)T φ(X) and (Ky ) = φ(Y )T φ(Y ). Similar to CCA, the optimization problem of KCCA can overcome the singular problem by adding the regularization term. Consequently, the final objective function of KCCA becomes the following formula:     0 Kx Ky Kx Kx +γKx 0 α [α β]=λ 0 Ky Ky +γKy [ β ] . (5) Ky Kx 0

i=1



ω T φ(X)ϕ(Y )T ωϕ φ

T ϕ(Y )ϕ(Y )T ω ω T φ(X)φ(X)T ωφ ωϕ ϕ φ T α Kx Ky β

=arg max √

as follows: (ωx ,ωy )=arg max ωx ,ωy



(1)

where Cxy is the covariance between X and Y , Cxx and Cyy are the variances of X and Y , respectively, and ωxT stands for the transpose of the vector ωx . From (1), we know that the larger the correlation between two datasets is, the closer their respectively transformed features are. Eq. (1) can be written into the following formula:     0 Cxy ωx ωx Cxx 0 ] = λ (2) [ ω y 0 Cyy [ ωy ]. Cyx 0

Although KCCA shows its effectiveness in some applications, it cannot guarantee the original non-linear input data to be mapped into linear relationship in any feature space. Moreover, the stronger kernel function will result in undermining the generalization ability for the unseen data. Thus, it is difficult to choose the pretty kernel function.

52

C. LPCCA

III. BACKGROUND K NOWLEDGE A. Sparse Representation

Another variant of CCA is LPCCA [26] which incorporates the local geometric structure of the dataset into CCA. LPCCA constructs the laplacian matrix S to incorporate the relationship between samples, wherein the larger the entry in S, the larger sample correlation between samples. The objective function of LPCCA can be written in a matrix format as follows:   0 XSxy Y T [ ωx ] = T ωy Y Syx X 0   . (6) T λ XSxx X0 +γIx Y S Y0T +γI [ ωωxy ] yy

Sparse representation (SR) was initially represented as an extension to traditional signal representations such as Fourier representation and wavelet representation. Sparse representation has been mainly utilized in various fields such as image super-resolution [30], image classification [31], image denoising [32], and transfer learning [33]. Given a vector (an image with vector pattern) x ∈ Rm , and a matrix X = [x1 , x2 , · · · , xn ] ∈ Rm×n containing the elements of an overcomplete dictionary in its columns, SR aims at representing x using as few elements of X as possible, and thus the expressive formulation of SR is shown below. min si 0 si . (10)

y

where Sxx = Dxx − Sx ◦ Sx , the i-th diagonal entry of Dxx is the sum of the entries in the i-th row (or the i-th column due to the symmetry) of the matrix Sx ◦ Sx , the symbol ◦ x ] is defined denotes the element-wise product, and Sx = [Si,j as follows:

x Sij =

⎧ 2 ⎨ exp( −xit−xj  ) xi ∈LN (xj )

or

x



0

xj ∈LN (xi )

s.t. x=Xsi or x−Xsi  2