Matrix-Based Kernel Subspace Methods

0 downloads 0 Views 181KB Size Report
6. We start by finding the first project direction uf×1 such that the variation of the output vector matrix z1×q = u. T φ(Y) is maximized. Mathematically, we have the ...
1

Matrix-Based Kernel Subspace Methods S. Kevin Zhou Integrated Data Systems Department Siemens Corporate Research 755 College Road East, Princeton, NJ 08540 Email: {kzhou}@scr.siemens.com

Abstract It is a common practice that a matrix, the de facto image representation, is first converted into a vector before fed into subspace analysis or kernel method; however, the conversion ruins the spatial structure of the pixels that defines the image. In this paper, we propose two kernel subspace methods that are directly based on the matrix representation, namely matrix-based kernel principal component analysis (matrix KPCA) and matrix-based kernel linear discriminant component analysis (matrix KLDA). We show that, through an extended Gram matrix, the two proposed matrix-based kernel subspace methods generalize their vector-based counterparts and contain richer information. Our experiments also confirm the advantages of the matrix-based kernel subspace methods over the vector-based ones.

Index Terms Kernel methods, matrix KPCA, matrix KLDA, reproducing kernel Hilbert space (RKHS), face recognition.

I. I NTRODUCTION A. Subspace methods in vector space Subspace analysis is often used in pattern recognition, signal processing and computer vision problems as an efficient method for both dimension reduction and finding the direction of the projection with certain properties. The basic framework of subspace analysis is as follows. Suppose we have an n d-dimensional random vector x where the dimensionality d is usually

2

very large, we attempt to find m basis vectors (m < d) forming a projection matrix Ud×m , such that the new representation z defined below satisfies certain properties. zm×1 = UT d×m xd×1 . Different properties give rise to different kinds of analysis methods. Two popular methods are principal component analysis (PCA) and linear discriminant analysis (LDA). •

PCA [1], an unsupervised method, decomposes the available data into uncorrelated directions, along which there exist the maximum variations.



LDA [2], a supervised method, exploits the class label information and attempts to maximize the between-class scatter while minimizing the within-class scatter.

The regular PCA and LDA are linear methods in the sense that they only utilize the firstand second-order statistics. Therefore, their modeling capability is limited when confronted with highly nonlinear data structures. To enhance their modeling capability, kernel PCA (KPCA) [3] and kernel LDA (KLDA) [4], [5] are proposed in the literature. These kernel methods enhance the modeling capability by nonlinearly mapping the data from the original space to a very high dimensional feature space, the so-called reproducing kernel Hilbert space (RKHS). The nonlinear mapping enables implicit characterization of high-order statistics. The key idea of kernel methods is to avoid the explicit knowledge of the mapping function by evaluating the dot product in the feature space using a kernel function. B. From vector to matrix One common practice in the literature is that the reproducing kernel takes vectors as input. However, images are often represented as matrices. Typically, images (or matrices) are simply vectorized into vectors before fed into kernel methods. Such vectorization ruins the spatial structure of the pixels that defines the 2D image. We aim to bring back the matrix structure and propose to perform matrix-based KPCA and KLDA. In the present paper, we refer to them as matrix KPCA and matrix KLDA, whose principles are described in sections III and IV, respectively. Correspondingly, we call the vector-based kernel principal component analysis as vector KPCA and the vector-based linear discriminant analysis as vector KLDA. We further show in section V that the matrix KPCA and matrix KLDA generalize the vector KPCA [3] and vector KLDA [4], [5]. In particular, the Gram matrix used in the vector KPCA

3

and vector KLDA can be derived from that used in the matrix KPCA and matrix KLDA. Therefore, matrix-based methods provide richer representations than vector-based counterparts. Our experiments in section VI also demonstrate the advantages of matrix-based methods. Another advantage of the matrix-based methods is that they enable us to study the spatial statistics in the matrix. We proceed with a brief review of vector, matrix, reproducing kernel, and their notations. II. V ECTOR , MATRIX , AND KERNEL A. Vector and matrix We denote a scalar by a lowercase a, a vector by a lowercase a, and a matrix by an uppercase A. We use AT to represent the matrix transpose, and use I to denote an m × m identity matrix. m

To save space, we also define the following concatenation notations: ⇒ and ⇓. ⇒ and ⇓ mean horizontal and vertical concatenations, respectively. For example, we can represent a p×1 vector a by a = [a , a , . . . , a ]T = [⇓p a ] and its transpose by aT = [a , a , . . . , a ] = [⇒p a ]. p×1

1

2

p

i

i=1

1

2

n

i=1

i

In addition, we can combine ⇒ and ⇓ to achieve a concise notation. Rather than representing a matrix Ap×q as [aij ], we represent it as Ap×q = [⇓pi=1 [⇒qj=1 aij ] ] = [⇒qj=1 [⇓pi=1 aij ] ]. We can use ⇒ and ⇓ to concatenate matrices to form a block matrix denoted by an uppercase A. For instance, given a collection of matrices {A1 , A2 , . . . , An } of size p × q, we construct a block matrix of size p × nq as Ap×nq = [⇒ni=1 An ]. Another example used in the paper is as follows. Given a collection of matrices {A11 , A12 , . . . , A1n , . . . , Amn } of size p × q, we construct a block matrix of size pm × qn as n Amp×nq = [⇓m i=1 [⇒j=1 Aij ] ].

Also, we denote the matrix Kronecker (tensor) product by ⊗. The tensor product is defined as Ap×q ⊗ Bm×n = [⇓pi=1 [⇒qj=1 aij B] ]pm×qn . Using the above notation, a matrix can be written as Xp×q = [⇒qi=1 xi ] where xi ’s are column vectors of the matrix X. The dot product matrix (or Gram matrix) of two matrices Xp×q and Yp×q

4

is given as q q T XT p×q Yp×q = [⇓i=1 [⇒j=1 xi yj ] ].

We also introduce a vectorization operator vec(.) that converts a p × q matrix to a pq × 1 vector by arranging all the elements of the matrix according to a fixed order, say a lexicographic order. In other words, p vec(X)pq×1 = [⇓i=1 [⇓qj=1 xij ] ].

B. Reproducing kernel In the core of kernel methods lies a kernel function. Let E be a set. A two variable function k(α, β) on E × E is a reproducing kernel if for any finite point set {α1 , α2 , ..., αn } and for any corresponding real numbers {c1 , c2 , ..., cn }, the following condition holds: n X n X

ci cj k(αi , αj ) ≥ 0.

i=1 j=1

The most widely used kernel functions in the literature are defined on a vector space, that is E = Rp . For example, two popular kernel functions based on vector inputs are the radial basis function (RBF) and the polynomial kernels. Their definitions are given as, ∀x, y ∈ Rp , k(x, y) = exp{−θ−1 ||x − y||2 }, k(x, y) = {xT y + θ}d . The kernel function k can be interpreted as a dot product between two vectors in a very high-dimensional space, the so-called reproducing kernel Hilbert space (RKHS) Hk . In other words, there exists a nonlinear mapping function φ : Rp → Hk = Rf , where f > p and f could even be infinite, such that k(x, y) = φ(x)T φ(y). This is the so-called ‘kernel trick’. C. Kernelized matrix For a matrix Xp×q = [⇒qi=1 xi ], we define its kernerlized matrix φ(X) as φ(X)f ×q = [⇒qi=1 φ(xi )],

5

which maps each column of the matrix X to the nonlinear feature space. Note that the kernelized matrix is a hypothesized quantity introduced for analysis: There is no need not compute it in practice. The dot product matrix Kq×q of two kernelized matrices X and Y is given as K(X, Y)q×q = φ(X)T φ(Y) = [⇓qi=1 [⇒qj=1 k(xi , yj )] ]. Similarly, for the block matrix Xp×nq = [⇒ni=1 Xi ] and its kernelized matrix φ(X )f ×nq = [⇒ni=1 φ(Xi )], we can compute the dot product matrix Knq×nq of two kernerlized block matrices φ(X ) and φ(Y) as K(X , Y) = φ(X )T φ(Y) = [⇓ni=1 [⇒nj=1 K(Xi , Yj )] ]. The matrix Knq×nq is also a block matrix. III. M ATRIX KPCA Given a matrix Xp×q and its kernelized version φ(X)f ×q , the matrix KPCA attempts to find the projection matrix Uf ×r , i.e., Zr×q = UT f ×1 φ(X)f ×q ,

(1)

such that the variation of the output matrix Zr×q is maximized. In the above definition, we note the following: •

We assume that the mean of the data (i.e. φ(X)) is removed; otherwise, it contributes only a constant matrix to the output matrix Z in Eq. (1).



Because here each column is lifted from Rp to Rf , we will call the above matrix KPCA as column-lifted matrix KPCA. To obtain a row-lifted matrix KPCA, we simply replace X with XT . T T T T Zp×s = {VT f ×s ψ(X )f ×p } = ψ(X ) V.



For one projection direction, the column-lifted matrix KPCA outputs a principal component vector whose dimensionality equals the number of the columns of the input matrix. In comparison, the vector KPCA [3] outputs a principal component value for one projection direction.

6

We start by finding the first project direction uf ×1 such that the variation of the output vector matrix z = uT φ(Y) is maximized. Mathematically, we have the following optimization 1×q

problem: max ||z − E[z]||2 = uT Σu,

||u||=1

where Σf ×f = E[{φ(X) − E[φ(X)]}{φ(X) − E[φ(X)]}T ] is the total scatter matrix. It is easy to show that the maximizing u is the leading eigenvector of Σ corresponding to the largest eigenvalue. In practice, we learn the projection vector u based on a training set X = [⇒ni=1 Xi ] that is kernelized to φ(X )f ×nq = [⇒ni=1 φ(Xi )]. The mean E[φ(X)] is estimated as φ(X)f ×1 =

n 1X 1 φ(Xi ) = φ(X ) (e ⊗ Iq ) n i=1 n

where en×1 = [1, 1, . . . , 1]T is a column vector of 1’s, and the total scatter matrix Σ is estimated as

n 1X ˆ Σ= {φ(Xi ) − φ(X)}{φ(Ai ) − φ(X)}T = φ(X )J J T φ(X )T n i=1

(2)

where Jnq×nq plays the role of data centering. 1 1 Jnq×nq = √ J ⊗ Iq , Jn×n = In − eeT . n n ˆ matrix. Usually, the matrix Σ ˆ is It leaves to show how to compute the eigenvector of the Σ rank deficient because the cardinality of the nonlinear feature space is extremely large. In the case of the RBF kernel, f is infinite. We follow the standard method in [6] to calculate the leading eigenvectors for a rank deficient matrix. ¯ nq×nq as Define the matrix K ¯ = J T φ(X )T φ(X )J = J T K(X , X )J , K where K(X , X ) is the Gram matrix that is computable in practice. Suppose that (λ, v) is an ¯ i.e., vnq×1 is an unit eigenvector with the corresponding eigenvalue eigenpair of the matrix K, ¯ = λv. Because λ or Kv ¯ = Σφ(X ˆ φ(X )J Kv )J v = λφ(X )J v,

7

ˆ with the corresponding eigenvalue the vector uf ×1 = φ(X )J v is an eigenvector of the matrix Σ λ. To obtain a normalized vector u, we note that uT u = vT J T φ(X )T φ(X )J v = λvT v = λ. Therefore, (λ, u =

√1 φ(X )J v) λ

(3)

ˆ is an eigenpair of the matrix Σ.

For a given matrix Yp×q = [⇒qi=1 yi ], its principal component vector zq×1 = uT φ(Y) is computed as

1 1 z = √ vT J T φ(X )T φ(Y) = √ vT J T [⇓ni=1 K(Xi , Y)]. λ λ

(4)

Similarly, we first obtain the top r eigenpairs {(λi , vi ); i = 1, . . . , r} that are associated with ¯ and then convert them to those of the matrix Σ. ˆ In a matrix form, the eigenvector he matrix K matrix is given as Uf ×r = [⇒ri=1 ui ] = φ(X )J VΛ−1/2 , ¯ and Λ is a diagonal where Vnq×r = [⇒ri=1 vi ] encodes the top m eigenvectors vi of the matrix K matrix whose diagonal elements are {λi ; i = 1, . . . , r}. The matrix for all principal component vectors is Zr×q = [⇓ri=1 zi ] = Λ−1/2 VT J T [⇓nj=1 K(Xi , Y)]. The column-lifted matrix KPCA is summarized in Fig. 1. It should be noted that when the kernel function is k(x, y) = xT y, then matrix KPCA reduces to matrix PCA [7]. However, the ‘kernel trick’ substantially enhances the modeling capacity of the matrix PCA that is linear in nature. [Training] • • •

¯ Compute the centering matrix J , the Gram matrix K(X , X ), and the matrix K. ¯ Find the leading r eigenpairs {(λi , vi ); i = 1, 2, ..., r} of the matrix K. ˆ is {(λi , √1 φ(X )J vi ); i = 1, 2, ..., r} The leading r eigenpairs for the matrix Σ λi

[Projection] •

Fig. 1.

For an arbitrary input matrix Y, find its ith principal component vector using Eq. T n (4), that is zi = √1λi vT i J [⇓j=1 K(Xj , Y)].

Summary of matrix KPCA.

8

IV. M ATRIX KLDA The matrix KLDA attempts to find the projection matrix Uf ×r , i.e., Zr×q = UT φ(X), such that the within-class variation of the matrix Z is minimized while the between-class variation of the matrix Z is maximized. Again this is a column-lifted definition. We only show how to find the first project direction uf ×1 ; the rest just follows. Mathematically, we have the following optimization problem: uT ΣB u max , ||u||=1 uT Σ u W where ΣW is the within-class scatter matrix and ΣB the between-class scatter matrix. The maximizing solution to the above is the leading eigenvector corresponding to a generalized eigenvalue problem, ΣB u = λΣW u.

(5)

It is easy to show [2] that the total scatter matrix Σ can be written as Σ = ΣW + ΣB . Therefore, Eq. (5) is equivalent to ΣB u = λΣu,

(6)

except that there is a difference in the eigenvalue. We assume that the training set is given as X = [⇒ni=1 Xi ] and each data point Xi has its class label function h(Xi ), taking a value in the class label set {1, 2, . . . , C}. In section III, we ˆ = φ(X )J J T φ(X )T , e.g., Eq. (2). We now showed that the matrix Σ can be estimated as Σ ˆ B. derive the estimate for the matrix ΣB , denoted by Σ Without loss of generality, we further assume that the data points are already ordered by class labels. The number of data points belonging to the class c is given as nc =

Pn

i=1

where I[.] is an indicator function. The mean for the class c is estimated as Pn

φc (X) =

i=1 φ(Xi )I[h(Xi ) Pn i=1 I[h(Xi ) =

= c] 1 = φ(X ) (ec ⊗ Iq ), c] nc

where ec is a column vector of size n × 1 given as ec = [⇓ni=1 I[h(Xi ) = c]]. The between-class scatter matrix ΣB is estimated as C X ˆ B = nc Σ {φc (X) − φ(X)}{φc (X) − φ(X)}T = φ(X )Wφ(X )T , n c=1

I[h(Xi ) = c],

9

where Wnq×nq = Wn×n ⊗ Iq and Wn×n =

C nc X ec e ec e {( − )}{( − )}T . n c=1 nc n nc n

It is easy to show that Wn×n = Jn×n Bn×n Jn×n where    Bn×n =   



1 I n1 n1

... 1 I nC nC

  .  

Therefore, W = (JBJ) ⊗ Iq = Jnq×nq Bnq×nq Jnq×nq , where Jnq×nq = Jn×n ⊗ Iq , Bnq×nq = Bn×n ⊗ Iq . Putting the above derivations together, we have ˆ B = φ(X )J BJ T φ(X )T . Σ

(7)

Substituting the ensemble estimators of the scatter matrices, i.e., equations (2) and (7), into Eq. (6) yields φ(X )J BJ T φ(X )T u = λφ(X )J J T φ(X )T u, The eigenvector uf ×1 of Eq. (8) is in the form of u = φ(X )J v, where vnq×1 is the leading eigenvector that satisfies ¯ , X )B K(X ¯ , X )v = λK(X ¯ , X )K(X ¯ , X )v. K(X To normalize the eigenvector u, we obtain 1 ¯ , X )v = vT J T K(X , X )J v. u = √ φ(X )J v, ξ = vT K(X ξ For an arbitrary matrix Yp×q , its discriminative vector is computed as 1 z1×q = uT φ(Y) = √ vT J T [⇓ni=1 K(Xi , Y)]. ξ

(8)

10

V. D ISCUSSION It is insightful to compare the (column-lifted) matrix KPCA with the vector KPCA. This is especially important because the current practice usually vectorizes a matrix to a vector. The vector KPCA [3] is based on the Gram matrix K(X , X )n×n whose ij th element is k(vec(Xi ), vec(Xj )), i.e., K(X , X )n×n = [⇓ni=1 [⇒nj=1 k(vec(Xi ), vec(Xj ))] ], whereas the block Gram matrix K(X , X )nq×nq = [⇓ni=1 [⇒nj=1 K(Xi , Xj )] ] is used in the matrix KPCA. First, these two Gram matrices have different sizes. The matrix K has a much bigger size than the matrix K. As a consequence, the vector KPCA finds maximally (n − 1) eigenpairs, whereas the matrix KPCA has maximally (n − 1)q eigenpairs. Also, the vector KPCA outputs a vector of maximum size (n − 1) × 1 while the matrix KPCA outputs a matrix of maximum size (n − 1)q × q. Table I summarizes the above comparisons. Vector KPCA

Column-lifted matrix KPCA

Input

pq × 1 vector

p × q matrix

Number of training samples

n

n

Gram matrix size

n×n

nq × nq

Maximum number of principal components

(n − 1)

(n − 1)q

Output of maximum size

(n − 1) × 1 vector

(n − 1)q × q matrix

TABLE I C OMPARISON BETWEEN VECTOR KPCA AND MATRIX KPCA.

Second, the two Gram matrices are closely related. We compare the ij th element of K(X , X ), or k(vec(Xi ), vec(Xj )), with the ij th block of K(X , X ), or K(Xi , Xj ). For convenience, we let Xi = A = [⇒qi=1 ai ] and Xj = B = [⇒qi=j bj ]. Using the RBF kernel k(a, b) = exp{−θ−1 ||a − b||2 }, the quantities k(vec(A), vec(B)) (or k(A, B) in short) and K(A, B)q×q are computed as follows. k(A, B) =

q Y i=1

k(ai , bi ), K(A, B) = [⇓qi=1 [⇒qj=1 k(ai , bj )]].

11

The scalar k(vec(A), vec(B)) is just the product of the diagonal elements of the matrix K(A, B)q×q . The Gram matrix K(X , X )n×n can be completely derived from the block Gram matrix K(X , X )nq×nq . In general, the above relationship between the two Gram matrices holds. Because the mapping from K(X , X ) to K(X , X ) is many-to-one, the block matrix K(X , X ) contains more information than the matrix K(X , X ). Thus, the matrix KPCA provides richer representation than the vector KPCA. More importantly, K(A, B) can be viewed to encode the column-wise statistics, because its ij th element k(ai , bj ) compares two columns. Spatial statistics of higher order between columns are somewhat captured. However, this property is typically lost in k(vec(A), vec(B)). For example, in the RBF kernel, there is no comparison between the two columns ai and bj when i 6= j. Another advantage of the matrix KPCA is that in the matrix KPCA we can interchange the role of row and column. This can be done by simply replacing a matrix X by its transpose XT . Alternatively, we can perform the following recursive procedure. T T T T Xp×q 7−→(1) Yr×q = UT f ×r φ(X)f ×q 7−→(2) Zr×s = ψ(Y )f ×r Vf ×s = ψ(φ(X) U) V,

(9)

where ‘7−→(1) ’ means the column-lifted matrix KPCA and ‘7−→(2) ’ means the row-lifted matrix KPCA. Through the above procedure that is similar to bilinear analysis [8], we essentially perform the matrix KPCA first along the column and then along the row to capture both columnwise and row-wise statistics of higher order. Similar observations can be made when comparing the matrix KLDA and the vector KLDA. VI. E XPERIMENTAL RESULTS In this section, we show two applications of the kernel methods developed above. A. Visualization We conducted the matrix KPCA on 200 images of the digit ‘three’. Figure 2(a) shows ten example images of size 16 × 16. The digit ‘three’ manifests a spatially nonlinear pattern. To capture it, methods able to handle both nonlinearity and spatial statistics are desired. The column-lifted matrix KPCA is first applied along the column direction using the kernel function k(x, y) = exp{−32−1 ||x−y||2 }. We kept only the top 16 principal components that are displayed in Figure 2(b). Note that each output matrix is again of size 16 × 16. Next, we further

12

applied the row-lifted matrix KPCA, with a different kernel function k(x, y) = exp{−||x−y||2 }, along the row direction (of those in Figure 2(b)), Again, we kept only the top 16 principal components. The outputs are shown in Figure 2(c). It is interesting that observe that there is a distinctive white line characterizing all these images. We adopted the local linear embedding (LLE) [9] to visualize the data in different representations. Fig. 2(d) visualizes the original images, where a triangle structure appears. In Fig. 2(e), the vector KPCA representation, obtained using the kernel function k(x, y) = exp{−128−1 ||x − y||2 }, is displayed. Note that the structure in Fig. 2(e) is only slightly tighter than that in Fig. 2(d), meaning that vector KPCA does not change the distribution of the data too much. Fig. 2(f) visualizes the data representation after the column-lifted matrix KPCA with the kernel function k(x, y) = exp{−||x − y||2 } is applied along the column vector. The data becomes very clustered. The row-lifted matrix KCPA with the same kernel function is further applied along the row direction and its output is visualized in Fig. 2(g). The data is even more clustered after high-order row and column statistics are somewhat captured. However, it is not fully captured because the data still form a tight nonlinear manifold. It should be noted that changing the kernel function does not affect the structure presented in the figure. B. Face recognition One main application area of subspace analysis is face recognition [6], [10], [7], [11]. We also tested the matrix KPCA and KLDA on this task. In particular, we focused on variations in pose and illumination. The AT&T database has 40 subjects with 10 images for each subject. Figure 3(a) shows one subject at 10 poses. While pose variation ruins the pixel correspondence required for vector-based method, matrix-based methods somewhat capture such correspondence. We randomly divided 10 images of one subject into two sets, M images for training and N images for testing (M + N = 10). Such a random division was repeated for every subject. Table II(a) shows the average recognition rate of 10 simulations. Here ‘Matrix (q = n)’ means the following: When q = 2 for example, the image is converted to a matrix with 2 columns, the first column for the left part of the image and the second column for the right part. Hence in this case we can characterize the statistical dependence between the left and right parts of the face image. We repeated the same experiment for the PIE database [12] that consists of 68 subjects. Here

13

(a) (b) (c) 1.5

2.5

1

2

0.5

1.5

0

1

−0.5

0.5

−1

0

−1.5

−0.5

−2

−1

−2.5

−1.5

−3 −3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

(d)

1.5

−2 −3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

(e)

3

2.5 1 2 0.5 1.5

0

1

0.5 −0.5 0 −1 −0.5

−1.5 −3

Fig. 2.

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

(f)

−1 −3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

(g)

(a) Ten images of digit ‘three’. (b) The matrix KPCA along the column direction. (c) The matrix KPCA along the

column and row directions. LLE visualization: (d) The original data. (e) The vector KPCA. (f) The matrix KPCA along the column direction. (g) The matrix KPCA along the column and row directions.

(a) (b) Fig. 3.

(a) Images at different poses in the AT&T database. (b) Images under different illumination conditions in

the PIE database. All images are downsampled to 12 × 12.

14

we focused only on the illumination variation and selected 12 lighting conditions. Figure 3(b) shows one subject under these 12 illumination conditions. As PCA is known to be difficult in handling illumination variation, we only considered LDA [13]. For comparison, we implemented the illumination subspace approach [14] that is physical-model-based and hence effective for modeling illumination variation. In [14], a separate linear subspace is learned for each subject. Table II(b) presents the recognition rates for the PIE database. From Table II, we make the following observations: 1) The matrix-based analysis almost always outperforms the vector-based analysis, highlighting the virtue of the matrix KPCA and matrix KLDA. 2) The matrix KLDA consistently outperforms the matrix KPCA because the former is a supervised approach. 3) Using more samples in the training set increases the recognition performance. 4) Kernel methods almost always outperforms their non-kernel counterparts. 5) When dealing with the illumination variation, the matrix KLDA algorithm that is learningbased even outperforms the illumination approach that is physical-model-based, especially in small-sample cases. M = 2, N = 8

Polynomial kernel

M = 3, N = 7

M = 4, N = 6

M = 5, N = 5

(θ = 1, d = 3) KPCA

KLDA

KPCA

KLDA

KPCA

KLDA

KPCA

KLDA

Matrix (q = 4)

73.4±2.6

73.6±2.5

75.7±2.0

82.1±3.0

78.1±1.4

87.4±3.0

80.6±2.6

89.8±3.4

Matrix (q = 2)

71.2±3.1

76.0±2.5

72.4±1.9

83.9±3.7

75.9±2.4

88.1±2.6

78.3±2.9

92.0±2.9

Vector (q = 1)

65.9±3.1

76.0±3.0

68.5±2.5

82.7±2.8 (a)

75.3±2.4

87.5±3.2

78.3±1.8

90.5±2.4

Polynomial kernel

M = 2, N = 10

M = 4, N = 8

M = 6, N = 6

M = 8, N = 4

M = 10, N = 2

(θ = 1, d = 3) KLDA

KLDA

KLDA

KLDA

KLDA

Matrix (q = 2)

55.4±2.1

82.6±2.4

93.0±1.5

96.5±1.5

98.2±1.1

Vector (q = 1)

64.3±3.7

81.3±2.5

81.2±2.6

80.2±2.2

71.0±4.3

Non-kernel (q = 1)

49.2±3.7

66.1±3.5

74.8±1.9

80.0±3.3

83.6±2.7

Illumination Subspace

53.1±2.5

85.0±2.2 (b)

92.3±1.5

96.1±1.3

98.5±1.1

TABLE II FACE RECOGNITION RATES FOR ( A ) THE AT&T DATABASE AND ( B ) THE PIE DATABASE .

15

VII. C ONCLUSION We presented two matrix-based kernel subspace methods: matrix KPCA and matrix KLDA. The proposed matrix-based kernel subspace methods generalize their vector-based counterparts, e.g. vector KPCA and vector KLDA, in the sense that the matrix KPCA and matrix KLDA provide richer representations and capture spatial statistics to some extent. This comes from the fact the Gram matrix used in the vector-based kernel methods can be derived from that used in the proposed matrix-based kernel methods. The experiments also demonstrated the effectiveness of the matrix KPCA and matrix KLDA. ACKNOWLEDGEMENT The author thanks Prof. Rama Chellappa at the University of Maryland, College Park for his helpful discussions and encouraging remarks. R EFERENCES [1] I. T. Jolliffe, Principal Component Analysis.

New York: Springer-Verlag, 2002.

[2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley-Interscience, 2001. [3] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, pp. 1299–1319, 1998. [4] G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach,” Neural Computation, vol. 12, pp. 2385–2404, 2000. [5] S. Mika et al., “Fisher discriminant analysis with kernels,” Neural Networks for Signal Proc. IX, pp. 41–48, 1999. [6] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3, pp. 72–86, 1991. [7] J. Yang et al., “Two-dimensional pca: A new approach to appearance-based face representation and recognition,” IEEE Trans. PAMI, vol. 26, no. 1, 2004. [8] W. T. Freeman and J. B. Tenenbaum, “Learning bilinear models for two-factor problems in vision,” Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, 1997. [9] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. [10] M.-H. Yang, “Kernel eigenfaces vs. kernel fisherfaces: Face recognition using kernel methods,” Intl. Conf. on Automatic Face and Gesture Recognition, 2002. [11] W. Zhao, R. Chellappa, A. Rosenfeld, and P. Phillips, “Face recognition: A literature survey,” ACM Comp. Surveys, 2003. [12] T. Sim, S. Baker, and M. Bsat, “The CMU pose, illuminatin, and expression (PIE) database,” Proc. Intl. Conf. on Face and Gesture Recognition, 2002. [13] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Trans. PAMI, vol. 19, 1997. [14] R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25, pp. 218–233, 2003.