Supervised Dictionary Learning via Non-negative Matrix ... - IEEE Xplore

0 downloads 0 Views 310KB Size Report
negative matrix factorization as supervised dictionary learning method and NNLS as non-negative sparse coding method. Experiment shows that our approach ...
2012 11th International Conference on Machine Learning and Applications

Supervised Dictionary Learning via Non-Negative Matrix Factorization for Classification Yifeng Li and Alioune Ngom School of Computer Science University of Windsor Windsor, Ontario, Canada Email: {li11112c,angom}@uwindsor.ca

where 𝜆 is a parameter controlling the trade-off between reconstruction error (regression precision) and sparsity. Generally speaking, SR includes two tasks–sparse coding and dictionary learning. The above procedure of representing a signal by a sparse linear combination of dictionary atoms is termed dictionary coding. The dictionary 𝑨 is crucial for high-quality coding. The method of constructing 𝑨 from training data is called dictionary learning. This is usually a matrix decomposition. Dictionary can certainly be constructed by other ways, for example designed as a redundant dictionary by an expert. However, in this paper, we concentrate on learning it from data. SR has been studied for the applications in signal decoding, de-noising [2], compressed sensing [3], and machine learning [4]. In this paper, we focus on its application for classification. Learning a dictionary and dealing with sparse coding in order to improve accuracy and speed up computation are the very purposes of many studies. There are two main reasons why SR is used for classification: i) it provides compressed representation and better interpretation; and ii) as regularized model, it has the advantages of denoising and avoiding overfitting [5]. There are at least three strategies to apply SR for classification. They differ in the aspects of dictionary learning, sparse coding, or/and predicting class labels based on the sparse codes. The first category is to use all training samples as dictionary atoms. Therefore, there is no dictionary learning step. The first method of this category is SRC1 [4]. Using all of the low-dimensional training samples as dictionary atoms, SRC1 solves (2) to pursue the sparse code of a new sample 𝒃. In order to predict the class label, a nearest space (NS) rule (detailed in Section II) is employed. SRC2 [6] to classify undercomplete data through solving (3). In the study of non-negative matrix factorization (NMF) [7], nonnegativity has been shown useful for finding patterns and sparse coefficients. The original NMF decomposes a nonnegative matrix into two non-negative factors as expressed 𝑚×𝑛 is an available by 𝑫 + ≈ 𝑨+ 𝒀 + , where 𝑫 ∈ ℝ+ 𝑚×𝑘 𝑘×𝑛 is called basis matrix, and 𝒀 ∈ ℝ+ dataset, 𝑨 ∈ ℝ+ is the coefficient matrix. Each column of 𝑫 is approximated by a non-negative and usually sparse linear combination of

Abstract—Sparse representation (SR) has been being applied as a state-of-the-art machine learning approach. Sparse representation classification (SRC1) approaches based on 𝑙1 norm regularization and non-negative-least-squares (NNLS) classification approach based on non-negativity have been proposed to be powerful and robust. However, these approaches are extremely slow when the size of training samples is very large, because both of them use the whole training set as dictionary. In this paper, we briefly survey the existing SR techniques for classification, and then propose a fast approach which uses nonnegative matrix factorization as supervised dictionary learning method and NNLS as non-negative sparse coding method. Experiment shows that our approach can obtain comparable accuracy with the benchmark approaches and can dramatically speed up the computation particularly in the case of large sample size and many classes. Keywords-sparse representation; supervised dictionary learning; sparse coding; classification; non-negative matrix factorization; non-negative least squares

I. I NTRODUCTION Sparse representation (SR) is a principle that a signal can be approximated by a sparse linear combination of dictionary atoms [1]. This can be formulated by a multiobjective optimization task, { min𝒙 ∥𝒙∥0 , (1) min𝒙 ∥𝒃 − 𝑨𝒙∥2 where 𝑨 ∈ ℝ𝑚×𝑘 is called a dictionary of which each column is called an atom, 𝒃 ∈ ℝ𝑚 is a new signal, and 𝒙 ∈ ℝ𝑘 is the coefficient vector. The 𝑙0 norm, ∥𝒙∥0 , is defined as the number of non-zeros in vector 𝒙. 𝑙0 norm is often replaced by 𝑙1 norm for convex optimization. If 𝑨 is overcomplete, that is 𝑚 ≪ 𝑘, then (1) can be rewritten as a constrained task: min∥𝒙∥1 𝒙

s.t. 𝒃 = 𝑨𝒙.

(2)

On another hand, if 𝑨 is undercomplete (𝑚 ≫ 𝑘), (1) can be reformulated as a unconstrained task: 1 min ∥𝒃 − 𝑨𝒙∥22 + 𝜆∥𝒙∥1 , 𝒙 2 978-0-7695-4913-2/12 $26.00 © 2012 IEEE DOI 10.1109/ICMLA.2012.79

(3)

439

the columns of 𝑨. Therefore NMF can be viewed as a dictionary learning method. semi-NMF [8] drops the nonnegative constraints on 𝑫 and 𝑨, therefore widens the applicability of NMF. As an extreme case of semi-NMF, NNLS classification approach [5] includes all training samples, say 𝑫, in the dictionary 𝑨. (that is, 𝑨 = 𝑫 and each column is a sample.) Then, for a set of new samples arranged in 𝑩 column-wise, the sparse coding is to solve the following NNLS optimization: 1 min ∥𝑩 − 𝑨𝑿∥2𝐹 𝑿 2

s.t. 𝑿 ≥ 0.

through solving (3). After that, NS rule is used to predict the class label. Sound performance of MSRC was reported over microarray data. NMF was also tried to learn subdictionaries, but it was claimed that its performance is worse than that of using SVD. We find this category has the following advantages: i) this is eager learning, hence is applicable in real-time prediction; ii) dictionary learning over data of large sample size and many classes can be solved through divide-and-conquer scheme; and iii) the learning on each class is independent from others, therefore the classifier is easy to upgrade when there are some new training samples from some of the classes; and iv) this is actually supervised learning because sub-dictionaries are learned for all classes separately. Additionally, methods in the first and third categories are naturally multi-class approaches, therefore have advantage over linear classifiers in the case of many classes. However, it has been reported in [5] that MSRC has poor performance over mass spectrometry data. In this paper we propose a meta-sample based NNLS classification approach which falls into the third category above. We shall show that our approach generally has better classification performance than MSRC. Our approach is also faster than MSRC on data of large sample size. The rest of this paper is organized as below. Our approach is proposed in Section II. Experimental comparison over various data are reported in Section III. Finally, we draw some conclusions and mention some interesting future works.

(4)

It has been shown that NNLS classifier can obtain comparable accuracy with state-of-the-art methods. It has the following merits. i) It is a non-parametric method. ii) It can be kernelized easily. And iii), it is robust to noise and missing values. However, methods in this category are lazy learning, and therefore have long prediction time, which handers their applicability in real-time prediction. Moreover, they have computational challenge when the size of the training set is very large. Usually, SR models can be formulated to (alternatingly) constrained quadratic programming (QP) problems. Dictionary learning can be solved by alternating QP problems. Sparse coding is also a QP problem. If a SR approach does not involve dictionary learning but instead pools all training samples in the dictionary, then it is costly lazy learning, because the sparse coding of a new sample is a large-scale QP problem. However, if a concise dictionary is learned from training data, then the sparse coding of a new sample is a very small QP problem which can be solved efficiently. SR involving dictionary learning is not lazy any more. The second category is composed of SR based feature extraction methods. They usually have three steps. First, the dictionary is learned by a matrix decomposition on the training data 𝑫, that is 𝑫 ≈ 𝑨𝒀 , where columns of 𝒀 are the sparse representations of the training samples in the feature space spanned by columns of 𝑨. Then, any other classifier can be trained using 𝒀 . Next, the new samples 𝑩 are projected into the feature space. This is formulated as 𝑩 ≈ 𝑨𝑿, where 𝑨 is fixed and columns of 𝑿 are the sparse representations of the new samples in the feature space. Finally, the classifier learned in the second step is used to predict the class labels of 𝑿. A typical example of this category is the NMFbased feature extraction method [9]. The weakness is that the sparsity may not be used by the classifier. Moreover, the dictionary learning might be computationally unaffordable in the case of huge sample size. The main idea of the third category is that sub-dictionary is learned for each class, and then these sub-dictionaries are concatenated into a holistic dictionary. Meta-sample based sparse representation classification (MSRC) [10] falls into this category. MSRC employs SVD to learn the subdictionaries. The sparse coding of a new sample is obtained

II. M ETHOD We note that NMF has been tried as dictionary learning method by MSRC. However, it is unclear if semi-NMF is used by MSRC in the case of mixed-signed data. Also, if any NMF is used as dictionary learning approach, it is not suitable to do the sparse coding based on 𝑙1 regularization. This is because a new sample is supposed to be a nonnegative linear combination of the dictionary atoms obtained by a NMF, but the coefficient vector obtained by 𝑙1 regularization may contain negative numbers. Thus it makes more sense if NNLS optimization instead is used to obtain the non-negative sparse coding after NMF based dictionary learning. Therefore the main idea of our approach is to use NNLS-optimization based sparse coding to match the NMF (or semi-NMF) based sub-dictionary learning. Our method, named by meta-sample based NNLS (MNNLS) classification approach, is detailed in Algorithm 1. A meta-sample is defined as a column of the basis matrices obtained via (sub-)dictionary learning. In the algorithm, the meta-samples obtained by NMF from the 𝑖th class are labeled to belong to this class. NMF can be solved by alternating NNLS algorithm [11]. Inspired by the NNLS optimization and the multiplicative-update rules for semi-NMF, we propose the NNLS optimization for semiNMF. This is illustrated in Algorithm 2. It alternatingly calls Moore-Penrose pseudoinverse (to update the basis matrix)

440

Algorithm 1 MNNLS Classification Approach Input: 𝑫 𝑚×𝑛 : training data, 𝒄: class labels of 𝑛 training samples, 𝑩 𝑚×𝑝 : new data, 𝒌: size of sub-dictionaries Output: 𝒑: the predicted class labels of the 𝑝 new samples training step: 1: normalize each training sample to have unit 𝑙2 norm. 2: for any class learn the sub-dictionary through NMF (𝑫 𝑖+ ≈ 𝑨𝑖+ 𝒀 𝑖+ where 𝑫 𝑖 is the training samples of the 𝑖-th class in input space and 𝒀 𝑖+ is their representation in feature space) if the data are non-negative, or semi-NMF (𝑫 𝑖 ≈ 𝑨𝑖 𝒀 𝑖+ ) if the data are of mixed signs. 3: concatenate the sub-dictionaries into a holistic dictionary via 𝑨 = [𝑨1 , ⋅ ⋅ ⋅ , 𝑨𝐶 ]. prediction step: 1: normalize each new sample to have unit 𝑙2 norm. 2: solve the NNLS min𝑿 21 ∥𝑩 − 𝑨𝑿∥2𝐹 s.t. 𝑿 ≥ 0. 3: for each column of 𝑿, apply nearest-subspace rule to predict the corresponding class label.

Algorithm 2 NNLS Based semi-NMF Optimization Input: 𝑩 𝑚×𝑛 : matrix, 𝑘: number basis vectors Output: 𝑨𝑚×𝑘 : basis matrix, 𝑿 𝑘×𝑛 : coefficient matrix, where 𝑨 and 𝑿 is a solution to min𝑨,𝑿 21 ∥𝑩 − 𝑨𝑿∥2𝐹 , s.t. 𝑿 ≥ 0 𝑹 = 𝑩 T 𝑩; initialize 𝑿 by ,for example, random numbers; 𝑟𝑝𝑟𝑒𝑣 = 𝐼𝑛𝑓 ; for 𝑖 = 1 : 𝑚𝑎𝑥𝐼𝑡𝑒𝑟 do update 𝑨 via 𝑨 = 𝑩𝑿 † ; update 𝑿 via solving min𝑿 21 ∥𝑩 − 𝑨𝑿∥2𝐹 , s.t. 𝑿 ≥ 0 using fast NNLS algorithm in [12]; if 𝑖 == 𝑚𝑎𝑥𝐼𝑡𝑒𝑟 or 𝑖 mod 𝑙 = 0 then % check the termination conditions every 𝑙 iterations 𝑟𝑐𝑢𝑟 = 12 ∥𝑩 − 𝑨𝑿∥2𝐹 ; if 𝑟𝑝𝑟𝑒𝑣 − 𝑟𝑐𝑢𝑟 ≤ 𝜖 or 𝑟𝑐𝑢𝑟 ≤ 𝜖 then break; end if 𝑟𝑝𝑟𝑒𝑣 = 𝑟𝑐𝑢𝑟 ; end if end for

and NNLS solver (to update the coefficient matrix). The advantage of our algorithm over the original multiplicative update rules is that it converges very fast to a stationary point. The objective and feasible set of the NNLS problem are convex, therefore the global solution can be found by a fast active-set algorithm [12]. The strength of using this algorithm is that a batch of new samples can share their computation in parallel during sparse coding in our approach. The nearest-subspace (NS) rule in Algorithm 1 is a method to interpret the sparse code in order to predict the corresponding class label. Suppose there are 𝐶 classes with labels 𝑙1 , ⋅ ⋅ ⋅ , 𝑙𝐶 . Given a new sample 𝒃 and its sparse coefficient vector 𝒙, its regression residual corresponding to the 𝑖−th class is computed as 𝑟𝑖 (𝒃) = 12 ∥𝒃 − 𝑨𝛿𝑖 (𝒙)∥22 , where 𝜹 𝑖 (𝒙) : ℝ𝑁 → ℝ𝑁 returns the coefficients for class 𝑙𝑖 . Its 𝑗−th element is defined by { 𝑥𝑗 if meta-sample 𝒂𝑗 in class 𝑙𝑖 , (𝛿𝑖 (𝒙))𝑗 = 0 otherwise.

Table I DATASETS Data ALL[13] MLL[14] SRBCT[15] OvarianCD PostQAQC[16] Prostate7-3-02[17] ExYaleB[18]

#Classes 6 3 4 2 2 38

#Features 12625 12582 2308 15000 15154 32256

#Samples 248 72 63 216 322 2432

± ± + + ± ± +

mixed sign and + non-negative. We compared our method with the existing SR based approaches including SRC1 [4], SRC2 [6], NNLS [5], and MSRC [10], kernel approaches including SVM [19] and LSSVM [20], and other instance based-approaches including linear-regression-classification approach (LRC) [21] and 1-nearest neighbor (1-NN) [22]. The implementation of SRC1 does not work on highdimensional and small-sample-sized data [3], thus we only 1 before used it on ExYaleB via resampling with rate 16 sparse coding [4]. The parameters of all approaches are selected by cross-validation (CV) over training set. 4-fold CV was conducted to split each of the microarray and mass spectrometry data into training and test set. CV was run 20 times and the averaged performance were recorded for fair comparison. 3-fold CV was used for the facial data. The accuracies of the methods over these data are shown in Fig. 1, 2, and 3. First, we can see that the accuracy of MNNLS is comparable with that of MSRC-SVD on microarray data. However, MNNLS outperforms MSRC-SVD on mass spectrometry data and ExYaleB data. Therefore we can conclude that MNNLS works better than MSRC-SVD

Finally, class label 𝑙𝑘 is assigned to 𝒃, where 𝑘 = min1≤𝑖≤𝐶 𝑟𝑖 (𝒃). Please note that the normalization in Algorithm 1 does not change the performance dramatically, while it is crucial in MSRC as it affects the computation of SVD and the parameter selection of the sparse coding. III. E XPERIMENTS In order to test the performance of our MNNLS method in the aspects of accuracy and time-complexity, we ran it on data of various types including three microarray gene expression datasets (ALL, MLL, and SRBCT), two mass spectrometry datasets (OvarianCD PostQAQC and Prostate7-3-02), and one facial dataset (ExYaleB). The concise description of them is in Table I where ± means the dataset is of

441

1

generally. Second, MSRC-NMF has poor performance over all the datasets, except SRBCT that is easy to classify, which convinces us that a sparse coding method should match well with dictionary learning. Third, it can be observed that our approach obtained comparable performance with other SR approaches on all datasets. Thus this proves that our NMFbased sub-dictionary learning method performs well. Forth, as shown in Fig. 1 and 2, our approach achieved similar result as the state-of-the-art kernel approaches. Furthermore, SVM and LSSVM obtained worse accuracy than MNNLS on the facial data. Finally, comparable result was also obtained by LRC whose idea is to compare the linear regression residuals of a new sample to each class. Compared with 1-NN, our approach performs significantly better. We also investigated the efficiency of our approach on ExYaleB which has a large sample size and a large number of classes. The logarithmic computing time of 3-fold CV was recorded and compared in Fig. 4. First of all, we can see obviously that using the whole training set as dictionary, SRC2 is much slower than other approaches. This corroborates our claim that using all training set as dictionary could be extremely slow in the case of large sample size. Through the comparison of NNLS and SRC2, it can be concluded that the solver of NNLS runs much faster than that of SRC2. Secondly, MNNLS consumed much less time than the MSRC approaches, which tells us that our approach has the computational advantage over MSRC for large-samplesized data. Thirdly, the computing time of MNNLS is around half of that of NNLS and slightly better than SRC1 which employed dimension reduction. This convinces us further that MNNLS is suitable to classify large-scale data. Finally, though SVM, LSSVM, and 1-NN run faster than MNNLS, their accuracies are lower than MNNLS on this data. As the class number and sample size increase further, the computing time of binary SVM and LSSVM must exceed our approach.

MNNLS MSRC−SVD MSRC−NMF NNLS SRC2 LRC 1−NN SVM LSSVM

0.9 0.8 0.7

Accuracy

0.6 0.5 0.4 0.3 0.2 0.1 0

OvarianCD_PostQAQC

Prostate7−3−02

Mass Spectrometry Data

Figure 2.

Classification Accuracy on Mass Spectrometry Data

challenge for large-scale data. In this paper, we proposed a simple but powerful approach which involves NMF-based sub-dictionary learning and NNLS based sparse coding. Experiments on various data shows that our approach can obtain comparable accuracy with the benchmark methods. 1

0.95

0.9

0.85

Accuracy

0.8

0.75

0.7

0.65

0.6

0.55

0.5

MNNLS MSRCSVDMSRCNMF

NNLS

SRC1

SRC2

LRC

1−NN

SVM

LSSVM

Approach

IV. C ONCLUSION AND F UTURE W ORK

Figure 3.

SR has been attracting attentions in the machine learning community. It has many merits that other approaches do not have. However, one of the bottleneck is its computational

Classification Accuracy on ExYaleB Data

16

14

12 1 0.9 0.8 0.7

Computing Time: log2(seconds)

MNNLS MSRC−SVD MSRC−NMF NNLS SRC2 LRC 1−NN SVM LSSVM

Accuracy

0.6 0.5

10

8

6

0.4

4 0.3 0.2

2

0.1 0

ALL

MLL

0

SRBCT

MNNLS MSRCSVDMSRCNMF

Figure 1.

NNLS

SRC1

SRC2

LRC

1−NN

SVM

Approach

Microarray Data

Classification Accuracy on Microarray Gene Expression Data

Figure 4.

442

Computing Time on ExYaleB Data

LSSVM

Furthermore, our approach has significant computational improvement compared with the existing SR approaches for data of large sample size and many classes. Since our approach is easy to update the dictionary for new training samples and is efficient to compute sparse codes for batches of unknown samples, it can be applied in scenes including i) data are frequently updated; ii) many unknown samples are classified simultaneously, and iii) prediction is done in real-time. The implementation of our approach is available in our Sparse Representation Toolbox in MATLAB [23]. There are a lot of interesting studies to be done in this direction. First, the numbers of atoms in each subdictionary are assumed the same in our current study. If the model selection can be solved by a quick optimization, better accuracy would be produced. Second, as a successful dictionary learning method in the field of machine learning, its applicability in other domains needs to be investigated. Third, online dictionary learning is also an important choice for large dictionary learning. Forth, it is very interesting to devise kernel dictionary learning and kernel sparse coding methods. Fifth, the performance and applicability of hybrid dictionary learning (e.g. combining non-negativity and 𝑙1 norm regularization) needs to be fully investigated. Furthermore, there is much room to devise faster NNLS optimization algorithm. Finally, sample selection or stochastic methods can also be a direction when directly using part of training samples as dictionary atoms.

[7] D.D. Lee and S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, pp. 788-791, 1999. [8] C. Ding, T. Li, and M.I. Jordan, “Convex and semi-nonnegative matrix factorizations,” TPAMI, vol. 32, no. 1, pp. 45-55, 2010. [9] Y. Li and A. Ngom, “Non-negative matrix and tensor factorization based classification of clinical microarray gene expression data,” in Proc. BIBM, Hong Kong, 2010, pp. 438-443. [10] C.-H. Zheng, L. Zhang, T.-Y. Ng, S.C.K. Shiu, and D.S. Huang. Metasample-based sparse representation for tumor classification. TCBB, vol. 8, no. 5, pp. 1273-1282, 2011. [11] H. Kim and H. Park, “Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 2, pp. 713-730, 2008. [12] M.H. Van Benthem and M.R. Keenan. Fast algorithm for the solution of large-scale non-negaive constrained least squares problems. J. Chemometrics, vol. 18, pp. 441-450, 2004. [13] E.J. Yeoh, et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling,” Cancer Cell, vol. 1, pp. 133-143, 2002. [14] S.A. Armstrong, et al., “MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nature Genetics, vol. 30, pp. 41-47, 2002.

ACKNOWLEDGMENT This research has been supported by IEEE CIS Summer Research Grant 2010, OGS Scholarship 2011-2013, and NSERC Grants #RGPIN228117-2011.

[15] J. Khan, et al., “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001.

R EFERENCES

[16] T.P. Conrads, et al., “Continuous representations of time series gene expresion data,” Endocrine-Related Cancer, vol. 11, pp. 163-178, 2004.

[1] A.M. Bruckstein, D.L. Donoho, and M. Elad, “From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images,” SIAM Reviews, vol. 51, no.1, pp. 34-81, 2009.

[17] E.F. Petricoin III, et al., “Serum proteomic patterns for detection of prostate cancer,” Journal of the National Cancer Institute, vol. 94, no. 20, pp. 1576-1578, 2002.

[2] M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparse representation,” CVPR, New York, 2006, pp. 895-900.

[18] K.C. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for dace recognition under variable lighting,” TPAMI, vol. 27, no. 5, pp. 684-698, 2005.

[3] S.J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, “An interior-point method for large-scale 𝑙1-regularized least squares,” IEEE Journal on Selected Topics in Signal Processing, vol. 1, no. 4, pp. 606-617, 2007.

[19] V. Vapnik, Statistical Learning Theory. Wiley-IEEE Press, New York, 1998. [20] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor and J. Vandewalle, Least Squares Support Vector Machines. World Scientific, Singapore, 2002.

[4] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. TPAMI, vol. 31, no. 2, pp. 210-227, 2009.

[21] Naseem, R. Togneri, and M. Bennamoun, “Linear regression for face recognition,” TPAMI, vol. 32, pp. 2106-2112, 2010.

[5] Y. Li and A. Ngom, “Classification approach based on nonnegative least squares” Technical Report, No. 12-010, School of Computer Science, University of Windsor, March 2012.

[22] T. Mitchell, Machine Learning. McGraw Hill, Ohio, 1997. [23] Y. Li, The Sparse Representation Toolbox in MATLAB. Available at http://cs.uwindsor.ca/˜li11112c/sr

[6] X. Hang, and F.-X. Wu. Sparse representation for classification of tumors using gene expression data. Journal of Biomedicine and Biotechnology, 2009: doi:10.1155/2009/403689, 2009.

443

Suggest Documents