NONNEGATIVE MATRIX FACTORIZATION FOR PATTERN RECOGNITION Oleg Okun Machine Vision Group Department of Electrical and Information Engineering University of Oulu FINLAND email:
[email protected]
Helen Priisalu Tallinn University of Technology ESTONIA
ABSTRACT In this paper, linear and unsupervised dimensionality reduction via matrix factorization with nonnegativity constraints is studied when applied for feature extraction, followed by pattern recognition. Since typically matrix factorization is iteratively done, convergence can be slow. To alleviate this problem, a significantly (more than 11 times) faster algorithm is proposed, which does not cause severe degradations in classification accuracy when dimensionality reduction is followed by classification. Such results are due to two modifications of the previous algorithms: feature scaling (normalization) prior to the beginning of iterations and initialization of iterations, combining two techniques for mapping unseen data.
algorithm convergence: 1) feature scaling prior to NMF and 2) combination of two techniques for mapping unseen data with theoretical proof of why this combination leads to faster convergence. Because of its straightforward implementation, NMF has been already applied for object and pattern classification [2, 3, 4, 5, 6, 7, 8, 9]. Here we extend the application of NMF to bioinformatics: NMF coupled with a classifier is applied for protein fold recognition. Statistical analysis (by means of one-way analysis of variance (ANOVA) and multiple comparison tests) of the obtained results (error rates) testifies to the absence of significant accuracy degradations when introducing our modifications, compared to the standard NMF algorithm.
KEY WORDS pattern recognition, nonnegative matrix factorization
2 Nonnegative Matrix Factorization Given the nonnegative matrices V, W and H whose sizes are n × m, n × r and r × m, respectively, we aim at such factorization that V ≈ WH. The value of r is selected nm in order to obtain dimenaccording to the rule r < n+m sionality reduction. Each column of W is a basis vector while each column of H is a reduced representation of the corresponding column of V. In other words, W can be seen as a basis that is optimized for linear approximation of the data in V. NMF provides the following simple learning rule guaranteeing monotonical convergence to a local maximum without the need for setting any adjustable parameters [1]: Viµ Haµ , (1) Wia ← Wia (W H)iµ µ
1 Introduction Dimensionality reduction is typically used to cure or at least to alleviate the effect known as curse of dimensionality negatively affecting classification accuracy. The simplest way to reduce dimensionality is to linearly transform the original data. Given the original, high-dimensional data gathered in an n × m matrix V, a transformed or reduced matrix H, composed of m r-dimensional vectors (r < n and often r n), is obtained from V according to the following linear transformation W: V ≈ WH, where W is an n × r (basis) matrix. Nonnegative matrix factorization (NMF) is an example of a typical linear transformation, but unlike other transformations, it is based on nonnegativity constraints on all matrices involved. Thanks to this fact, it can generate a part-based representation, since no subtractions are allowed. Lee and Seung [1] proposed a simple iterative algorithm for NMF and proved its convergence. The factorized matrices are initialized with positive random numbers before starting matrix updates. It is well known that initialization is of importance for any iterative algorithm: properly initialized, an algorithm converges faster. However, this issue was not yet investigated in case of NMF. In this paper, our contribution is two modifications accelerating
480-040
Wia Wia ← , j Wja Haµ ← Haµ
i
Wia
Viµ . (W H)iµ
(2) (3)
The matrices W and H are initialized with positive random values. Eqs. (1-3) iterate until convergence to a local maximum of the following objective function: F =
n m i=1 µ=1
546
(Viµ log(W H)iµ − (W H)iµ ) .
(4)
In its original form, NMF can be slow to converge to a local maximum for large matrices and/or high data dimensionality. On the other hand, stopping after a predefined number of iterations, as sometimes done, can be too premature to get a good approximation. Introducing a parameter tol (0 < tol 1) to decide when to stop iterations significantly speeds up the convergence without negatively affecting the mean square error (MSE), measuring the approximation quality 1 . That is, iterations stop when Fnew − Fold < tol. After learning the NMF basis functions, i.e. the matrix W, new (unseen before) data in the matrix are mapped to r-dimensional space by fixing W and using one of the following techniques:
It is generally acknowledged that proper initialization significantly speeds up the convergence of an iterative algorithm. Therefore the second modification concerns initialization of NMF iterations for mapping unseen data (known as generalization), i.e. after the basis matrix W has been learned. Since such a mapping in NMF involves only the matrix H (W is kept fixed), its initialization is to be done. We propose to initially set H to (W T W)−1 WT Vnew , i.e. to the solution provided by the direct mapping technique in Section 2, because 1) it provides a better initial approximation for Hnew than a random guess, and 2) it moves the start of iterations closer toward the final point, since the objective function F in Eq. 4 is increasing [1] and the inequality F direct > F iter always holds at initialization (theorem below proves this fact), where F direct and F iter stand for the values of F when using the direct and iterative techniques, respectively.
1. randomly initializing H as described above and iterating Eq. 3 until convergence [3], 2. as a least squares solution of V new = WHnew , i.e. Hnew = (WT W)−1 WT Vnew , e.g. as in [5], where Vnew contains the new (unseen or test) data.
Theorem 1 Given that F direct and F iter are values of the objective function in Eq. 4 when mapping unseen data with the direct and iterative techniques, respectively. Then F direct − F iter > 0 always holds at the start of iterations when using Eq. 3.
We will call the first technique iterative while the second - direct, because the latter provides a straightforward noniterative solution.
Proof. By definition,
3 Our Contribution
F
iter
=
n m
(Vij log(W H)ij − (W H)ij ) ,
i=1 j=1
We propose two modifications in order to accelerate convergence of the iterative NMF algorithm. The first modification concerns feature scaling (normalization) linked to the initialization of the factorized matrices. Typically, these matrices are initialized with positive random numbers, say uniformly distributed between 0 and 1, in order to satisfy the nonnegativity constraints. Hence, elements of V (matrix of the original data) also need to be within the same range. Given that V j is an n-dimensional feature vector, where j = 1, . . . , m, its components V ij are normalized as follows: V ij /Vkj , where k = arg maxl Vlj . In other words, components of each feature vector are divided by the maximal value among them. As a result, feature vectors are composed of components whose nonnegative values do not exceed 1. Since all three matrices (V, W, H) have now entries between 0 and 1, it takes much less time to perform matrix factorization V ≈ WH (values of the entries in the factorized matrices do not have to grow much in order to satisfy the stopping criterion for the objection function F in Eq. 4) than if V had the original (unnormalized) values. As additional benefit, MSE becomes much smaller, too, because a difference of the original (Vij ) and approximated ((W H) ij ) values becomes smaller, given that mn is fixed. Though this modification is simple, it brings significant speed of convergence as will be shown below, and to our best knowledge, it was never explicitly mentioned in literature.
F direct =
n m
(Vij log Vij − Vij ) .
i=1 j=1
The difference F direct − F iter is equal to n m
(Vij log Vij − Vij − Vij log(W H)ij + (W H)ij ) =
i=1 j=1
n m Vij (log i=1 j=1 n m i=1 j=1
Vij − 1) + (W H)ij (W H)ij
Vij + (W H)ij Vij log 10(W H)ij
=
In the last expression, we assumed the logarithm to the base 10, though the natural logarithm or the logarithm to the base 2 could be used as well without affecting our proof. Given that all three matrices involved are nonnegative, the last expression is positive if either condition is satisfied: 1. log
Vij 10(W H)ij
> 0, V
2. (W H)ij > Vij log 10(WijH)ij . Let us introduce a new variable, t: t = above conditions can be written as
1 It was observed
in numerous experiments that MSE quickly decreases after not very many iterations, and after that its rate of decrease dramatically slows down.
1. log
547
t 10
> 0 or logt > 1,
Vij (W H)ij
. Then the
.
2. 1 > t log
t 10
or log t
10 whereas the second if t < t0 (t0 ≈ 12). Therefore either t > 10 or t < 12 should be satisfied for F direct > F iter . Since the union of both conditions covers the whole interval [0, +∞[, it means that F direct > F iter holds, independently of t, i.e. of whether Vij > (W H)ij or not. This proof is valid for the base 2 and natural logarithms, too, because 2 or 2.71, replacing 10 above, is still smaller than 12. It means that the whole interval [0, +∞[ is covered. Q.E.D. Because our approach combines both direct and iterative techniques for mapping unseen data, we will call it iterative2, since the whole procedure remains essentially iterative but only the initialization is modified.
6 Experiments
4 Summary of Our Algorithm
Experiments with NMF involve estimation of the error rate when combining NMF and a classifier. Three techniques for generalization (mapping of the test data) are used: direct, iterative, and iterative2. The training data are mapped into reduced space according to Eqs. 1-3 with simultaneous learning of the matrix W. We experimented with the real-world data set created by Ding and Dubchak [10] and briefly described below. Tests were repeated 10 times to collect statistics necessary for comparison of three generalization techniques. Each time, the different random initialization for learning the basis matrix W was used, but the same learned basis matrix was utilized in each run for all generalization techniques in order to create as fair comparison of three generalization techniques as possible. Values of r (dimensionality of reduced space) were set to 25, 50, 75, and 88 3 , which constitutes 20%, 40%, 60%, and 71.2% of the original dimensionality, respectively, while tol was equal to 0.05. In all reported statistical tests α = 0.05. All algorithms were implemented in MATLAB running on a Pentium 4 (3 GHz CPU, 1GB RAM).
Suppose that the whole data set is divided into training and test (unseen) sets. Our algorithm is summarized as follows: 1. Scale both training and test data and randomly initialize the factorized matrices as described in Section 3. Set parameters tol and r. 2. Iterate Eqs. 1-3 until convergence to obtain the NMF basis matrix W and to map training data to NMF (reduced) space. 3. Given W, map test data by using the direct technique. Set negative values in the resulting matrix H direct new to zero. 4. Fix the basis matrix and iterate Eq. 3 until converat initialization of iterations. gence by using H direct new The resulting matrix H iterative2 provides reduced new representations of the test data in the NMF space. Thus, the main distinction of our NMF algorithm from others is that it combines both techniques for mapping unseen data instead of applying one of them for this task.
6.1 Data A challenging data set derived from the SCOP (Structural Classification of Proteins) database was used in experiments. It is available online 4 and its detailed description can be found in [10]. The data set contains the 27 most populated folds represented by seven or more proteins. Ding and Dubchak already split it into the training and test sets, which we will use as other authors did. Feature vectors characterizing protein folds are 125dimensional. The training set consists of 313 protein folds having (for each two proteins) no more than 35% of the sequence identity for aligned subsequences longer than 80 residues. The test set of 385 folds is composed of protein sequences of less than 40% identity with each other and
5 Practical Application As a challenging task, we selected protein fold recognition from bioinformatics. Protein is an amino acid sequence. In bioinformatics, one of the current trends is to understand evolutionary relationships in terms of protein function. Two common approaches to identify protein function are sequence analysis and structure analysis. Sequence analysis is based on a comparison between unknown sequences and those whose function is already known and it is argued that it is good at high levels of sequence identity, but below 50% identity it becomes less reliable. Protein fold recognition is structure analysis without relying on sequence similarity. Proteins are said to have a
2 Regions
of local regularity within a fold. is the largest possible value of r, resulting in dimensionality reduction with NMF for a 125x313 training matrix used in experiments. 4 http://crd.lbl.gov/∼cding/protein/ 3 88
548
less than 35% identity with the proteins of the first set. In fact, 90% of the proteins of the test set have less than 25% sequence identity with the proteins of the training set. This, as well as multiple classes, many of which are sparsely represented, render the task extremely difficult.
Let us now turn to the error rates in reduced space. As mentioned above, we tried four different values for r and three different generalization techniques with and without scaling prior to NMF. For each value of r, NMF followed by HKNN are repeated 10 times, and the results are assembled in a single column vector 88 75 75 50 50 25 25 T [e88 1 , . . . , e10 , e1 , . . . , e10 , e1 , . . . , e10 , e1 , . . . , e10 ] ,
6.2 Classification Results
where the first ten error rates correspond to r = 88, next ten error rates to r = 75, etc. As a results, a 40x6 (10 runs times 4 values of r times 3 generalization techniques times 2 scaling options for NMF) matrix containing the error rates was generated. This matrix is then subjected to the one-way analysis of variance (ANOVA) and multiple comparison tests in order to make statistically driven conclusions about all generalization techniques. Table 2 shows identifiers used below in association with all generalization techniques.
The K-Local Hyperplane Distance Nearest Neighbor (HKNN) [11] was selected as a classifier, since it demonstrated a competitive performance compared to support vector machine (SVM) when both methods were applied to classify the abovementioned protein data set in the original space, i.e. without dimensionality reduction [12]. In addition, when applied to other data sets, the combination of NMF and HKNN showed very good results [13], thus rendering HKNN as a natural choice for NMF. HKNN computes distances of each test point x to L local hyperplanes, where L is the number of different classes. The th hyperplane is composed of K nearest neighbors of x in the training set, belonging to the th class. A test point x is associated with the class whose hyperplane is closest to x. HKNN needs to predefine two parameters, K and λ (regularization). The normalized features were used since feature normalization to zero mean and unit variance prior to HKNN dramatically increases the classification accuracy (on average by 6 percent). The optimal values for two parameters of HKNN, K and λ, determined via cross-validation, are 7 and 8, respectively. Table 1 shows error rates obtained with HKNN, SVM, and various neural networks when classifying protein folds in the original, 125-dimensional space. A brief survey of the methods used for classification of this data set can be found in [12]. From Table 1 one can see that HKNN outperformed all but one competitors 5 even though some of them employed ensembles of classifiers.
Table 2. Generalization techniques and their identifiers. Identifier 1 2 3 4 5 6
Classifier DIMLP-B DIMLP-A MLP GRNN RBFN SVM RBFN SVM HKNN
Scaling prior to NMF No No No Yes Yes Yes
The one-way ANOVA test is first utilized in order to find out whether the mean error rates of all six techniques are the same (null hypothesis H 0 : µ1 = µ2 = · · · = µ6 ) or not. If the returned p-value is smaller than α = 0.05 (which happened in our case), the null hypothesis is rejected, which implies that the mean error rates are not the same. The next step is to determine which pairs of means are significantly different, and which are not by means of the multiple comparison test. Such a sequence of tests constitutes the standard statistical protocol when two or more means are to be compared. Table 3 contains results of the multiple comparison test and it is seen that these results confirm that the direct technique stands apart from both iterative techniques. The main conclusions from Table 3 are µ 1 = µ4 and µ2 = µ3 = µ5 = µ6 , i.e. there are two groups of techniques, and the mean of the first group is larger than that of the second group based on Table 4. It implies that the direct technique tends to lead to higher error than either iterative technique when classifying protein folds. On the other hand, the multiple comparison test did not detect any significant difference in the mean error rates of both iterative techniques, though it looks like the standard iterative technique while being slower (see the next subsection) than the iterative2 technique, is nevertheless a bit more accurate on the data set used. The last column in Table 4 points to the very interesting result: independendently of feature normalization prior
Table 1. Error rates when classifying protein folds in the original space with different methods. Source [14] [14] [15] [15] [15] [15] [16] [10] [12]
Technique Direct Iterative Iterative2 Direct Iterative Iterative2
Error rate 38.8 40.9 51.2 55.8 50.6 48.6 48.8 46.1 42.7
5 In [14] they used an ensemble of four-layer Discretized Interpretable Multi-Layer Perceptrons (DIMLP), together with bagging or arcing for combining individual classifiers. When using a single DIMLP, the error rate was 61.8% as reported in [14].
549
to the fact that NMF factorization is not unique. The overall conclusion about three generalization techniques is that for the given data set, despite the fact that the direct technique is the fastest among the three, it lagged far behind either iterative technique in terms of classification accuracy when NMF is followed by HKNN. Among the iterative techniques, the conventional technique provides a slightly better accuracy (though the multiple comparison test did not detect any statistically significant difference) than the iterative2 technique, but at slower convergence as will be demonstrated in the next section.
Table 3. Results of the multiple comparison test. Numbers in the first two columns stand for technique identifiers. The third and fifth columns contain a confidence interval for the estimated difference in means of two techniques with the given identifiers. If the confidence interval contains 0 (zero), the difference in means is not significant, i.e. H 0 is not rejected. 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
2 3 4 5 6 3 4 5 6 4 5 6 5 6 6
4.38 3.34 -2.07 4.79 3.00 -2.66 -8.07 -1.20 -3.00 -7.03 -0.16 -1.97 5.25 3.45 -3.42
5.99 4.96 -0.46 6.41 4.61 -1.04 -6.45 0.42 -1.38 -5.41 1.46 -0.34 6.87 5.07 -1.80
7.61 6.57 1.16 8.03 6.23 0.58 -4.83 2.03 0.24 -3.79 3.07 1.28 8.48 6.69 -0.18
Reject H0 : µ1 = µ2 Reject H0 : µ1 = µ3 Accept H0 : µ1 = µ4 Reject H0 : µ1 = µ5 Reject H0 : µ1 = µ6 Accept H0 : µ2 = µ3 Reject H0 : µ2 = µ4 Accept H0 : µ2 = µ5 Accept H0 : µ2 = µ6 Reject H0 : µ3 = µ4 Accept H0 : µ3 = µ5 Accept H0 : µ3 = µ6 Reject H0 : µ4 = µ5 Reject H0 : µ4 = µ6 Reject H0 : µ5 = µ6
6.3 Time Since the iterative techniques showed better performance than the direct technique, we will compare them in terms of time. Table 5 accumulates gains resulted from our modifications for different dimensionalities of reduced space. R 1 stands for the ratio of the average times spent on learning NMF basis and mapping training data to NMF space without and with scaling prior to NMF. R 2 means the ratio of the average times spent on generalization by means of the iterative technique without and with scaling prior to NMF. R3 is the ratio of the average times spent on generalization by means of the iterative2 technique without and with scaling prior to NMF. R 4 corresponds to the ratio of the average times spent on learning and generalization together by means of the iterative technique without and with scaling prior to NMF. R 5 corresponds to the ratio of the average times spent on learning and generalization together by means of the iterative2 technique without and with scaling prior to NMF. R6 is the gain in time caused by applying the iterative2 technique for generalization instead of the iterative one when no scaling before NMF occurs, whereas R 7 is the gain in time caused by applying the iterative2 technique for generalization instead of the iterative one when scaling before NMF occurs. As a result, the average gain in time obtained with our modifications is more than 11 times.
Table 4. Mean error rates for all six generalization techniques (standard error for all techniques is 0.4). Standard deviations are reported only for four techniques associated with smaller error. Identifier 1 2 3 4 5 6
Mean error rate 50.93 44.94 45.98 51.38 44.52 46.32
Std. deviation 3.08 2.52 2.11 3.31 2.13 1.72
to NMF, the standard deviation of the error rate for the our technique is lowest. It implies that our modifications of NMF led to a visible reduction in the deviation of classification error! This reduction is caused by shrinking a search space of possible factorizations. That is, our initialization eliminates some potentially erroneous solutions before iterations even start and leads to more stable classification error. However, the open issue is when to stop iterations in order to achieve the highest accuracy. One can say that the error rates in reduced space are larger than the error rate (42.7%) achieved in the original space. Nevertheless, we observed that sometimes error in reduced space can be lower than 42.7: for example, the minimal error when applying the iterative technique with no scaling before NMF and r = 50 is 39.22, while the minimal error when using the iterative2 technique under the same conditions is 42.34. Varying errors can be attributed
Table 5. Gains in time resulting from our modifications of the conventional NMF algorithm. r 88 75 50 25 Average
R1 11.9 13.8 13.2 9.5 12.1
R2 11.4 12.9 11.1 6.4 10.4
R3 10.4 13.1 12.5 8.8 11.2
R4 11.7 13.6 12.6 8.7 11.6
R5 11.5 13.7 13.1 9.5 11.9
R6 1.5 1.6 1.6 1.6 1.6
R7 1.4 1.6 1.7 2.2 1.8
7 Conclusion The main contribution of this work is two modifications of the basic NMF algorithm and its practical application to
550
the challenging real-world task of protein fold recognition. The first modification concerns feature scaling before NMF while the second modification combines two known generalization techniques, which we called direct and iterative; the former is used as a starting point for updates of the latter, thus leading to a new generalization technique which is called iterative2. We proved (both theoretically and experimentally) that at the start of generalization the objective function in Eq. 4 of the new technique is always larger than that of the conventional iterative technique. Since this function is increasing, the generalization procedure according to the iterative2 technique converges faster than that with the ordinary iterative technique. When modifying NMF, we were also interested in the error rate when combining NMF and a classifier (HKNN in this work). To reach this goal, we tested all three generalization techniques. Statistical analysis of the obtained results indicates that the mean error rate associated with the direct technique is higher than that related to either iterative technique while both iterative techniques lead to the statistically similar error rates. Since the iterative2 technique provides a faster mapping of unseen data, it is advantageous to apply it instead of the ordinary iterative one, especially in combination with feature scaling before NMF because of the significant gain in time resulted from both modifications.
trix factorization. Pattern Recognition Letters, 24(910):1599–1605, 2003. [7] D. Guillamet, J. Vitri`a, and B. Schiele. Introducing a weighted non-negative matrix factorization for image classification. Pattern Recognition Letters, 24(14):2447–2454, 2003. [8] Y. Wang, Y. Jia, C. Hu, and M. Turk. Fisher nonnegative matrix factorization for learning local features. In Proc. 6th Asian Conf. on Computer Vision, Jeju Island, Korea, 2004. [9] M. Rajapakse and L. Wyse. NMF vs ICA for face recognition. In Proc. 3rd Int. Symp. on Image and Signal Processing and Analysis, Rome, Italy, pages 605–611, 2003. [10] C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349–358, 2001. [11] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest neighbor algorithms. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 985–992. MIT Press, Cambridge, MA, 2002. [12] O. Okun. Protein fold recognition with k-local hyperplane distance nearest neighbor algorithm. In Proc. 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, Pisa, Italy, pages 47–53, 2004.
References [1] D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
[13] O. Okun. Non-negative matrix factorization and classifiers: experimental study. In Proc. 4th IASTED Int. Conf. on Visualization, Imaging, and Image Processing, Marbella, Spain, pages 550–555, 2004.
[2] I. Buciu and I. Pitas. Application of non-negative and local non-negative matrix factorization to facial expression recognition. In Proc. 17th Int. Conf. on Pattern Recognition, Cambridge, UK, pages 288–291, 2004.
[14] G. Bologna and R.D. Appel. A comparison study on protein fold recognition. In Proc. 9th Int. Conf. on Neural Information Processing, Singapore, pages 2492–2496, 2002.
[3] D. Gillamet and J. Vitri`a. Discriminant basis for object classification. In Proc. 11th Int. Conf. on Image Analysis and Processing, Palermo, Italy, pages 256– 261, 2001.
[15] I.-F. Chung, C.-D. Huang, Y.-H. Shen, and C.-T. Lin. Recognition of structure classification of protein folding by NN and SVM hierarchical learning architecture. In O. Kaynak, E. Alpaydin, E. Oja, and L. Xu, editors, Lecture Notes in Computer Science (Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003, Istanbul, Turkey), volume 2714, pages 1159–1167. SpringerVerlag, Berlin, 2003.
[4] X. Chen, L. Gu, S.Z. Li, and H.-J. Zhang. Learning representative local features for face detection. In Proc. 2001 IEEE Conference on Computer Vision and Pattern Recognition, Kauai, HW, pages 1126–1131, 2001. [5] T. Feng, S.Z. Li, H.-Y. Shum, and H.-J. Zhang. Local non-negative matrix factorization as a visual representation. In Proc. 2nd International Conference on Development and Learning, Cambridge, MA, pages 178–183, 2002.
[16] N.R. Pal and D. Chakraborty. Some new features for protein fold recognition. In O. Kaynak, E. Alpaydin, E. Oja, and L. Xu, editors, Lecture Notes in Computer Science (Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003, Istanbul, Turkey), volume 2714, pages 1176–1183. SpringerVerlag, Berlin, 2003.
[6] D. Guillamet and J. Vitri`a. Evaluation of distance metrics for recognition based on non-negative ma-
551