Jianchao Yang, Jiangping Wang, and Thomas Huang. Beckman Institute ..... [18] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo. Sapiro, âSupervised ...
LEARNING THE SPARSE REPRESENTATION FOR CLASSIFICATION Jianchao Yang, Jiangping Wang, and Thomas Huang Beckman Institute, University of Illinois at Urbana-Champaign, USA {jyang29, jwang63, huang}@ifp.illinois.edu ABSTRACT In this work, we propose a novel supervised matrix factorization method used directly as a multi-class classifier. The coefficient matrix of the factorization is enforced to be sparse by ℓ1 -norm regularization. The basis matrix is composed of atom dictionaries from different classes, which are trained in a jointly supervised manner by penalizing inhomogeneous representations given the labeled data samples. The learned basis matrix models the data of interest as a union of discriminative linear subspaces by sparse projection. The proposed model is based on the observation that many high-dimensional natural signals lie in a much lower dimensional subspaces or union of subspaces. Experiments conducted on several datasets show the effectiveness of such a representation model for classification, which also suggests that a tight reconstructive representation model could be very useful for discriminant analysis. Index Terms— Sparse representation, sparse coding, matrix factorization, dictionary training, face recognition, digit recognition. 1. INTRODUCTION In machine learning and computer vision, data analysis is often desired to find a proper data representation suited for subsequent data clustering or classification tasks. Principal Component Analysis (PCA) [1] aims to find a low rank orthonormal basis that best reconstructs the original data in the ℓ2 norm sense. As a data processing step, PCA is widely applied in many classification tasks, e.g., Eigenfaces [2]. Motivated by the psychological and physiological evidences for partsbased representation of mental objects in the human brain [3], nonnegative matrix factorization (NMF) that captures the notion of nonsubtractive parts was proposed. The nonnegativity property of NMF and its extensions turns out to be useful for learning meaningful parts-based representations for various applications, such as face recognition, text mining, and collaborative filtering. Reconstructive representations, e.g., PCA and NMF, can achieve compact representations that are most effective for signal processing, such as compression and denoising, but not necessarily for classification. For classification purposes, such reconstructive data representations have to work with subsequent discriminant models. However,
such a two-step approach does not ensure the optimality of the choice of a discriminative model. On the other hand, discriminative representations, such as linear discriminant analysis (LDA) [4], are usually better suited for classification tasks. The difference between reconstruction and discrimination has been widely investigated in the patter recognition literature. While both methods have been broadly applied in classification, the discriminative methods have often outperformed the reconstructive methods [5] [6]. Many previous works on subspace learning [7] and metric learning [8] seek the partial data representations that are best for recognition. However, discriminative models are usually sensitive to noise, missing data, and outliers [9], and are prone to be overfitting. Further more, these discriminative models almost surely reduce the entropy of the original data, and thus the derived representations will only increase the Bayesian error. Recently, there has been growing interests in sparse modeling of natural high-dimensional signals. Sparse representations are representations that account for all or most of the information of a signal with a linear combination of a small number of elementary signals called atoms, which are usually chosen from an over-complete dictionary. The over-complete dictionary allows more representation flexibility and efficiency. Different from conventional inductive orthonormal basis, such as Wavelets and PCA, searching for the sparse representation of a signal over an over-complete dictionary is achieved by optimizing an objective function that consists of two terms: one that measures the reconstruction error and the other that measures the sparsity of the representation. The sparsity of the representation is usually measured by ℓ0 -norm, which counts the number of the nonzero coefficients. However, solving an ℓ0 -norm minimization is NP-hard, and much recent process has centered around ℓ1 /ℓ0 equivalence in high dimensional spaces [10]. In the past few years, sparse representation by ℓ1 minimization has been applied in many tasks, including image denoising [11], image super-resolution [12], graph learning [13], face recognition [14], image classification [15] [16]. The success of sparse representation in various areas has been largely owed to the fact that natural high-dimensional signals are sparse signals lying in a much lower dimensional space. Based on the observation that many real data belong-
ing to the same class, e.g., faces or digits, actually live in the same subspace of a much lower dimension, a new test sample can be well represented by the training samples from the same class, which naturally leads to a sparse representation over the whole training set. Such a sparse representation has been successfully applied to face recognition by Wright et al. in [14], which suggests that a proper reconstructive model can imply great discriminative power for classification. In this work, we directly seek such a reconstructive model based on the sparsity assumption in a supervised manner, where the data is modeled as a union of low dimensional linear subspaces, and each class is modeled as a union of a small subset of these linear subspaces. Accordingly, the classification model is simply based on the sparse projection errors of the test sample on the nearest linear dimensional subspaces. Different from previous supervised sparse models [17], [18] and [19], our model is purely reconstructive and explores the possibility of classification based solely on the data representation. The reminder of this paper is organized as follows. Section 2 discusses the related works, including matrix factorization and sparse coding. Section 3 presents our supervised dictionary training or regularized matrix factorization algorithm. Section 4 presents experiment results on various datasets. Finally, section 5 concludes our paper with discussion.
to better model the data, arg min ∥X − DZ∥F D,Z
(2)
s.t. ∥zi ∥0 ≤ k,
where zi denotes the i-th column of Z, and ∥ · ∥0 counts the number of nonzero elements. Eqn.(2) aims to learn an overcomplete dictionary D such that the representations of X in terms of D are k-sparse. Many algorithms can be applied to achieve approximate solutions for this problem, e.g., KSVD [20]. Such an over-complete dictionary has led to stateof-the-art performance on many image processing tasks, including denoising and inpainting [11]. Another alternative formulation for Eqn.(2) is to replace ℓ0 -norm by ℓ1 -norm and use it as a regularization to promote sparsity, arg min ∥X − DZ∥F + λ∥Z∥ℓ1 D,Z
(3)
s.t. ∥Di ∥2 ≤ 1,
which is usually referred as sparse coding [21] [22]. Different from conventional matrix factorization techniques, sparse coding is nonlinear and models the data as a union of low dimensional subspaces that are adapted to the data space. 2.2. Classification based on Sparse Representation
2. RELATED WORKS 2.1. Matrix Factorization and Sparse Coding For matrix factorization, we aim to find a basis matrix D ∈ Rd×K and a coefficient matrix Z ∈ RK×N that can represent the original data matrix X ∈ Rd×N well, subject to some constraints on the basis matrix D or the coefficient matrix Z or both, in order to obtain some desired properties for the representation. For example, Nonnegative Matrix Factorization (NMF) enforces nonnegativity on both D and Z, and the optimization can be iteratively solved in an multiplicative way,
Casting the recognition problem as one of finding a sparse representation of the test image in terms of the training set as a whole up to some sparse errors due to occlusion, Wright et al. [14] showed that such a simple sparse representation based approach is robust to partial occlusions and can achieve promising recognition accuracy on public face datasets. Formally, given a set of training samples for the c-th class Dc = [dc,1 , dc,2 , · · · ], a test sample x from class c can be well represented by Dc with coefficients zc . As the label for x is unknown, it is assumed that zc can be recovered from the sparse representation of x in terms of the dictionary constructed from training samples of all I classes by
arg min ∥X − DZ∥F D,Z
s.t. D ≽ 0, Z ≽ 0.
min ∥z∥1
(1)
On the other hand, Principal Component Analysis (PCA) aims to find a low-rank approximation of the original data matrix in an orthonormal system. Independent Component Analysis (ICA), however, is to find linear representation of non-Gaussian data so that the components are statistically independent. Many of these matrix factorization algorithms usually assume some compactness properties on the basis matrix D (i.e., K < d), implying dimensionality reduction in the representation. For many real signals that are sparse, we can go to the contrary by increasing the basis matrix to be overcomplete and enforce sparsity on the representation in order
z
s.t. ∥Dz − x∥22 ≤ ϵ,
(4)
where D = [D1 , D2 , · · · , DI ], z = [z1⊤ , z2⊤ , · · · , zI⊤ ]⊤ . Then the class label is determined based on the minimum residual error criteria after projecting the test sample on each class as: cˆ = arg min ∥Dδc (z) − x∥22 = arg min ∥Dc zc − x∥22 , (5) c
c
where δc (·) is a vector indicator function that keeps the elements corresponding to the c-th class while setting all others to be zero. In [14], Wright et al. further extended the basic formulation of (4) by attaching a trivial basis (identity matrix) to D to capture sparse errors.
In the general case, if the assumption holds that high dimensional data samples belonging to the same class live in the same union of a small set of low-dimensional subspaces, then a new test sample can be represented by only a few basis atoms from the same class, i.e., the new test sample has a sparse representation in terms of all the basis atoms, which can be recovered by ℓ1 -norm minimization. In [14], these basis atoms are modeled directly by the raw training samples, which is not scalable to large scale applications where there are tens of thousands of training samples. How to find a compact basis matrix that can well model and differentiate different classes of high-dimensional sparse data is our focus in this work. 3. SUPERVISED SPARSE MATRIX FACTORIZATION In this work, we are interested in classification of highdimensional data, where we assume each class of the data can be well approximated by a union of low-dimensional linear subspaces. We require our basis matrix to be over-complete so that each union of subspaces from different classes do not overlap. A trivial way of finding such basis matrices is to learn an atom dictionary for each class independently, which, however, is suboptimal for classification. Instead, we propose to train these atom dictionaries jointly such that a new test sample will be sparsely projected only onto the linear subspace of the same class. 3.1. Formulation
where Pyi selects the inhomogeneous representation coefficients of xi , i.e., coefficients corresponding to atom dictionaries other than Dyi . Eqn.(7) means that the sparse representation of xi in terms of D will only concentrate on the atom dictionary Dyi . The ideal basis matrix D should be composed of dictionaries where each dictionary Dc can sparsely represent data samples from the c-th class but not others. Therefore, if we find the sparse representation of a given test sample xt in terms of the trained dictionary D, we can identify the label of xt based on which atom dictionary accounts for the nonzero coefficients. 3.2. Approximate Optimization Finding a basis matrix D for Eqn.(6) and Eqn.(7) is a no easy task. In this paper, we propose a simple and efficient approximate optimization to integrate Eqn.(6) with Eqn.(7). We replace the ℓ0 -norm by an ℓ1 -norm regularization to enforce sparsity of the representation, and add a ℓ2 -norm representation cost for Pyi zi , arg min D,Z
N ∑
∥xi − Dzi ∥22 + λ∥zi ∥1 + γ∥Pyi zi ∥22
i
s.t. ∥dj ∥2 ≤ 1. Eqn.(8) demands that the learned dictionary can sparsely represent a given data sample while requires that its inhomogeneous representations to be small in the ℓ2 -norm sense. Expanding the ℓ2 reconstruction error term, the problem is equivalent to
N
Suppose we are given N training samples {xi , yi }i=1 falling into I classes, where xi ∈ Rd×1 is the i-th training sample and yi ∈ {1, 2, ..., I} is its label. We desire to find a composite basis matrix D = [D1 , D2 , ..., DI ], where Dc ∈ Rd×K is the atom dictionary that can (sparsely) represent the c-th class well but not others. We also require that the coefficient matrix Z should be sparse column wise, i.e., each data sample xi will have a sparse representation with respect to the basis matrix D. Suppose the signal we are interested in is k-sparse, i.e., the signal has a sparse representation in some domain with nonzero coefficients at most of number k. We aim to learn the basis matrix D and representation matrix Z such that arg min ∥X − DZ∥F
(8)
arg min D,Z
N ∑
ziT Azi + bT zi + λ∥zi ∥1
i
(9)
s.t. ∥dj ∥2 ≤ 1, where A b
= DT D + γPyTi Pyi
(10)
= −2D xi
(11)
T
which is ready to be solved by many existing sparse coding packages by alternatively optimizing over the sparse representations Z and the dictionary D [21].
D,Z
s.t. ∥zi ∥0 < k, ∥dj ∥2 ≤ 1,
(6)
where zi is the i-th column of Z, and dj is the j-th column of the composite basis matrix D. Since we are interested in classification, we hope that the solution Z of Eqn.(6) will have the following property Pyi zi = 0,
(7)
3.3. Classification by Regularized Sparse Projection Given the learned composite dictionary D = [D1 , D2 , ..., DI ] and a new test sample xt , we want to find its most efficient sparse projection in order to identify its label. To be consistent with the training phase, we need to add an ℓ2 regularization term on the inhomogeneous representation coefficients, which, however, is unknown. The training process in Eqn.(8) suggests that our data sample xt should have the lowest sparse projection error on the corresponding atom dictionary if the
correctly label is identified. Therefore, we first find the sparse representations of xt for each class, ztc = arg min ∥xt − Dz∥22 + λ∥z∥1 + γ∥Pc z∥22 . z
(12)
And then we assign the classification label by finding the lowest sparse projection residual error yt = arg min ∥Dδc (ztc ) − x∥22 . c
texture 1
texture 2
texture 3
Supervised 1
Supervised 2
Supervised 3
(13)
In practice, we find that simply using the sparse representation based classification (SRC) algorithm talked in Section 2.2 by neglecting the ℓ2 regularization terms actually performs on par with the above classification method. Therefore, we use the SRC algorithm for classification in all our experiments, which is fast due to the compactness of our learned dictionary. 4. EXPERIMENTAL ANALYSIS In this section, we present our experimental analysis on three datasets, i.e., Broadatz texture, Extended Yale-B [23], and MNIST [24]. For texture analysis, we are interested in learning interesting dictionaries for different visual textures of different visual statistics. And we show that the supervised training helps with classification a lot. For face recognition and digit recognition, we show how our supervised learning process helps with learning discriminant representation dictionaries for each class. The learned composite dictionary is compact and thus is scalable to large dataset scenarios in real applications. For experimental evaluation, we compare with mainly two algorithms: 1. “Random”: the atom dictionary for each class is generated by random sampling the raw data samples from that class. If we use all the data, the algorithm is essentially the SRC algorithm in [14]. 2. “Unsupervised”: the atom dictionary for each class is trained independently using sparse coding. For all the experiments, we initialize our dictionary with a Gaussian random matrix. As we will see, our algorithm correctly finds and groups the atom dictionary for each class.
Fig. 1. Top row: three textures selected from Broadatz dataset for analysis; Bottom row: trained dictionaries for each texture. Table 1. Classification accuracy (%) comparison using random atom dictionaries, unsupervised atom dictionaries, and supervised atom dictionaries. We use K = 500 for each atom dictionary. Algorithm K = 500 K = 1000
Random 88.25 90.49
Unsupervised 91.74 92.86
Supervised 95.00 95.90
as initialization, our supervised optimization process gradually learns the three atom dictionaries for the three textures. As shown in the bottom row of figure 1, it is evident that the learned atom dictionaries nicely capture the statistics of each texture. For example, texture 1 is composed of large structures of the bricks and fine textures on the brick surface. The learned atom dictionary captures the large structures by some translated vertical and horizontal edges and the fine textures with some DCT like basis atoms. To evaluate the quality of our trained composite basis matrix, we compare with “Random” and “Unsupervised”. Table 1 lists the recognition accuracies for each method, and our supervised training outperforms the other two significantly for both K = 500 and 1000. 4.2. Face Recognition
4.1. Texture Analysis We select three textures from Broadatz texture dataset for training the composite basis matrix. Figure 1 top row shows the three example texture images, which are quite different in appearance. From each texture image, we randomly sample 10, 000 image patches of size 15 × 15, half of which are modeled as training and the rest as testing. We use two atom dictionary sizes K = 500 and K = 1000, and therefore, we have 15, 000 image patches for training a 1, 500 basis matrix and a 3000 basis matrix. With the Gaussian random matrix
For face recognition, we use the Extended Yale Face Database B for evaluation. The database consists of 38 individuals and each with 64 near frontal face images. The cropped face images were captured under various laboratory-controlled lighting conditions, which are simply normalized into 32 × 32 in all our experiments. We randomly sample 50 face images from each individual for training and the rest are modeled as testing. For training the composite basis matrix, we try three different atom dictionary sizes K = 10, 20, 30 for both super-
vised training and unsupervised training. Table 2 lists the classification results on this dataset. Our supervised training outperforms both random sampling and unsupervised training notably. For random sampling, the best result is achieved by using all the 50 training images for each individual. As we can see, our supervised scheme using K = 10 already achieves the same performance as using all the raw images directly. The performance improves to 98.44% for K = 20 due to a more flexible model capturing more face image variations, but drops back to 98.05% for K = 30, probably due to overfitting with the small training dataset. Also, it should be noted that our supervised training algorithms outperforms unsupervised training significantly when K = 10, because in this constrained model our supervised training better identifies the discriminant subspace for each class. Table 2. Face recognition accuracy on Extended Yale Face Database B. We compare with atom dictionaries generated by random sampling from the training data and unsupervised training. K Random Unsuper. Supervised
10 86.96 93.39 98.05
20 94.94 96.88 98.44
30 97.28 97.85 98.05
40 97.47 -
50 98.05 -
Figure 2 (a) shows the trained atom dictionaries for K = 10. We only show the dictionaries for 10 subjects out of 38 due to space limitation, where each row corresponds to one subject. As theoretically shown by Basri and Jacobs [25], only nine (properly chosen) basis illuminations are sufficient to generate basis images that span images of the object under all possible lighting conditions to a good approximation. Our trained dictionary aligns with this theoretical finding to a large extent. As shown in the figure, each atom dictionary is composed of one or two mean-like faces for the subject and the rest part-based basis atoms are used for compensating the lighting variations, since the most face variations for this face database is the lighting. 4.3. Digit Recognition For Handwritten digit recognition, we apply our algorithm to the large scale MNIST dataset [24]. The database consists of 70, 000 handwritten digits, of which 60, 000 digits are modeled as training and 10, 000 as testing. The digits have been size-normalized and centered in a fixed-size image. We choose three easily confusing digits 4, 7, 9 for our experiments. For dictionary training, we again try different sizes of the atom dictionary, and the results are listed in table 3. In all cases compared, the supervised dictionary training outperforms unsupervised dictionary training, and significantly outperforms random sampling under the same dictionary sizes. Even though much more compact and efficient, our algorithm
(a)
(b)
Fig. 2. (a) The trained dictionaries for Extended Yale Face Database B. Each row is one atom dictionary for one subject; (b) The trained dictionaries for digits 4, 7, 9 in MNIST. Starting from a Gaussian random matrix, the algorithm correctly learns the atom dictionaries for each digit. Table 3. Classification accuracy (%) on different sizes of atom dictionaries. We compare with random sampling and unsupervised dictionary training. Size K Random Unsuper. Supervised
100 94.50 98.01 98.24
200 95.86 98.24 98.54
300 97.02 98.57 98.71
2000 97.95 -
All 98.54 -
can perform better than the sparse representation based classification using all the training dataset. In Figure 2 (b), we show the discriminatively learned atom dictionaries for digits 4, 7, and 9. Again, starting from a random Gaussian matrix, our learning algorithm correctly finds the atom dictionaries for each digit. 5. CONCLUSION AND FUTURE WORK In this paper, we propose a supervised sparse matrix factorization model, where the basis matrix is composed of atom dictionaries from each class and the coefficient or representation matrix is sparse indicating the identity of the data samples. The learning process enforces each data sample to be represented by a low-dimensional subspace of the same class, which is found by solving an ℓ1 -norm regularization problem. Our approach is based on the observation that many natural high-dimensional signals belonging to the same class tend to lie in the same low-dimensional subspace. Although the obtained sparse representations can be combined with subsequent discriminant models, such as SVM, we show that these data representations can be directly used for the classification purpose. Experiments on three datasets, including texture images, face database, and handwritten digits, show promising results of the supervised model with a simple sparsity assumption. For future work, we will investigate structured sparse models, e.g., group sparsity, in our reconstructive
representation model for better discriminative analysis. 6. REFERENCES [1] I. Joliffe, Principal compoenent analysis, SpringerVerlag, 1986. [2] M. Turk and A. Pentland, “Face recognition using eigenfaces,” in IEEE Conference on Computer Vision and Pattern Recognition, 1991, pp. 586–591. [3] D. Lee and H. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, 1999. [4] R. Duda, P. Hart, and D. Stork, Pattern classification, Wiley-Interscience, 2 edition, 2000. [5] A. Martinez and A. Kak, “Pca versus lda,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228–233, 2001. [6] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenface vs. fisherfaces: recognition using class specific linear projections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 228–233, 2001.
[13] Bin Cheng, Jianchao Yang, Shuicheng Yan, Yun Fu, and Thomas Huang, “Learning with ℓ1 -graph for image analysis,” IEEE Transactions on Image Processing, vol. 19, no. 4, pp. 858–866, 2010. [14] John Wright, Allen Yang, Arvind Ganesh, Shankar Satry, and Yi Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Feburay 2009. [15] David M. Bradley and J. Andrew Bagnell, “Differential sparse coding,” in NIPS, 2008. [16] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, 2009. [17] Jianchao Yang, Kai Yu, and Thomas Huang, “Supervised translation-invariant sparse coding,” in CVPR, 2010. [18] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro, “Supervised dictionary learning,” in NIPS, 2008. [19] Fernando Rodriguez, Guillermo Sapiro, O Rodriguez, and Guillermo Sapiro, “Learning discriminative and reconstructive non-parametric dictionaries,” 2007.
[7] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: a general graph framework for dimensionality reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 40–51, 2007.
[20] Michal Aharon, Michael Elad, and Alfred Bruckstein, “K-svd: an algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
[8] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, “Learning a mahalanobis metric from equivalence constraints,” Journal of Machine Learning Research, vol. 6, pp. 937–965, 2005.
[21] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng, “Efficient sparse coding algorithms,” in Advances in Neural Information Processing Systems (NIPS), 2007.
[9] Ke Huang and Selin Aviyente, “Sparse representation for signal classification,” in In Adv. NIPS, 2006. [10] D. L. Donoho, “For most large underdetermined systems of linear equations, the minimal ℓ1 -norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics, vol. 59, no. 6, pp. 797–829, 2006. [11] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15, pp. 3736–3745, 2006. [12] Jianchao Yang, John Wright, Thomas Huang, and Yi Ma, “Image super-resolution as sparse representation of raw image patches,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
[22] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro, “Online learning for matrix factorization and sparse coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010. [23] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intelligence, vol. 23, no. 6, pp. 643–660, 2001. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of IEEE, vol.86, no.11, pp.22782324, 1998. [25] R. Basri and D. Jacobs, “Lambertain reflectance and linear subspaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 218–233, 2003.