Speeding Up Graph Regularized Sparse Coding by Dual ... - NSFC

0 downloads 0 Views 1MB Size Report
Index Terms—Graph regularized sparse coding, image classifi- cation, image clustering. .... 3) Dual FISTA (DFISTA): Recall the assumptions for ISTA and FISTA ...
IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 3, MARCH 2015

313

Speeding Up Graph Regularized Sparse Coding by Dual Gradient Ascent Rui Jiang, Hong Qiao, Senior Member, IEEE, and Bo Zhang, Member, IEEE

Abstract—Graph regularized Sparse Coding (GSC) considers data relationships during Sparse Coding (SC) and thus has better performance in certain image analysis tasks. However, it is very time consuming. This letter aims at speeding up GSC. The alternating optimization framework for GSC involves repeatedly solving a variant of minimization referred to as GSRsub in this letter. Traditional ways to deal with GSRsub are to generalize optimization strategies for minimization to solve its primal problem that is strongly convex but non-differentiable, thus converging slowly. We propose that GSC can be accelerated by solving a new dual problem of GSRsub called D-GSRsub. Compared with the primal form and the existing dual form of GSRsub, D-GSRsub has a strongly convex and smooth objective function with less variables. Based on these properties, four dual gradient ascent strategies with lower computational complexities are developed. Experimental results on real-world datasets demonstrate that these strategies can dramatically and stably speed up GSC without affecting its performance in the corresponding image analysis tasks. Index Terms—Graph regularized sparse coding, image classification, image clustering.

S

I. INTRODUCTION

PARSE CODING (SC) has been a popular coding technique for generating representations in image analysis. Recent studies found that SC does not consider data relationships during the coding process, thus leading to information loss. To address this issue, Graph regularized SC (GSC) based methods have been proposed. ScSPM [1] is a representation learning method for image classification and uses spatial-pyramid max pooling of SIFT sparse codes. Gao et al. [2], [3] pointed out that SC can not encode similar SIFT features as similar sparse codes. To reduce the loss of locality information, Laplacian Sparse Coding (LSc) was proposed to consider the similarity between SIFT features in ScSPM. As a holistic representation learning method for image classification and clustering, SC fails to consider the geometrical structure of images satisfying the manifold assumption. GraphSC [4] was presented for preserving the geometrical information during SC. GSC indeed improves the performance of SC. However, the great execution time of Manuscript received July 09, 2014; revised September 11, 2014; accepted September 14, 2014. Date of publication September 17, 2014; date of current version September 29, 2014. This work was supported in part by NNSFC Grants 61210009, 61033011, 61379093 and 11131006. The associate editor coordinating the review for this manuscript and approving it for publication was Prof. Waheed U. Bajwa R. Jiang and H. Qiao are with State Key Lab of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: [email protected]; [email protected]). B. Zhang is with LSEC & Institute of Applied Mathematics, AMSS, Chinese Academy of Sciences, Beijing 100190, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2014.2358853

GSC has been a bottleneck that restricts its practical use [5]. This letter focuses on speeding up GSC. To solve the non-convex GSC problem, an alternating optimization framework is often employed. This framework treats GSC as two problems, Graph regularized Sparse Representation (GSR) problem and Dictionary Update (DU) problem, and then alternatively solves these two problems until convergence. GSR can be separated into several subproblems with a common form (GSRsub) that is a variant of minimization. And these subproblems are sequentially solved one after another. Traditional ways of solving GSRsub focus on its primal with a strongly convex but non-differentiable objective. Typical methods for minimization, such as Feature-sign search (Feature-sign) algorithm [6], Iterative Shrinkage-Threholding Algorithm (ISTA) [7] and its improved version FISTA [8], can be directly extended to the primal problem of GSRsub. Note that Feature-sign was first proposed in [6] for SC and then modified in [2]–[4] for GSRsub. GSRsub can also be treated as an regularized QP ( -QP) problem (see, e.g.,[9], [10]) and solved by the Cyclical Coordinate Descent (CCD) method. Further, a dual form of -QP (D- -QP) was also developed in [10]. In this letter, we propose another dual problem of GSRsub (D-GSRsub) which has two advantages over the primal problem and D- -QP: (1) the number of variables of D-GSRsub is less than that of both D- -QP and the primal problem, and (2) the objective function of D-GSRsub is smooth and strongly convex. These advantages make it possible for fast gradient ascent strategies to be applied to D-GSRsub with lower computational complexities. In this letter, four such strategies are applied to D-GSRsub, and their performances are verified in image classification and clustering experiments on real-world datasets. II. GRAPH REGULARIZED SPARSE CODING A. Model Data and their relationships can be modeled by a weighted graph whose nodes are data points and edge weight matrix represents pairwise relationships between data. Generally, is symmetric and nonnegative. The idea of GSC is to incorporate the objective of SC model with a graph regularization term: , where are sparse codes of and respectively, and describes their pairwise relationship. Thus the GSC model is

1070-9908 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

(1)

314

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 3, MARCH 2015

where is an overcomplete dictionary with , are sparse codes, is the graph regularization parameter and is the sparsity regularization parameter. B. Optimization The GSC model (1) is non-convex, so finding a global solution is unrealistic. However, it is feasible to attain a local minimum by alternatively solving two convex problems: (i) the Graph regularized Sparse Representation (GSR) problem: Fixing , we have

(2) (ii) the Dictionary Update (DU) problem: Once fix and update by solving

is updated,

sparse representation, such as Graphical LASSO [9], [10], we can also treat the primal problem of GSRsub as a regularized QP ( -QP) problem:

with and , which can be solved by the CCD algorithm [10] with convergence rate [11]. Further, a dual problem of GSRsub can be derived as a dual problem of -QP (D- -QP):

(see [10]) with and . Here, is the indicator function of the ball of radius . Note that the number of variables in D- -QP is the same as in the primal problem of GSRsub and that the objective function of D- -QP is still non-differentiable. In next section, we propose a novel dual form of GSRsub whose objective function is smooth and strongly convex with less variables. III. DUAL GRADIENT ASCENT STRATEGIES FOR GSRSUB

(3) This letter mainly considers the optimization of GSR. The GSR problem (2) can be split into subproblems:

A. Novel Dual Problem of GSRsub (D-GSRsub) Let and We recast (4) as an equality constrained problem

where

Its associated Lagrangian is

and

,

,

that is, updating each sparse code individually while holding other sparse codes fixed. Dropping the superscripts and subscripts, we have the GSR subproblem (GSRsub): (4) Traditional ways of dealing with GSRsub are to generalize optimization strategies for minimization to solve the primal problem (4) which has a strongly convex but non-differentiable objective function. Typical strategies are the Feature-sign search (Feature-sign) algorithm [2]–[4], the ISTA algorithm [7] and its improvement FISTA [8]. Feature-sign solves GSRsub by first iteratively searching and refining the signs of the elements of and then solving a least square problem based on the search result. ISTA and FISTA consider the minimization of a composite objective function: (5) where and satisfy the following three assumptions [8]: (i) is a continuous convex function; (ii) is convex with Lipschitz continuous first-order derivatives, i.e., , where is the Lipchitz constant of ; (iii) Problem (5) is solvable. Let , . Then it is obvious that GSRsub is a special case of problem (5) with and , where is the spectral norm of . ISTA is a first-order method with a sublinear global convergence rate [8]. And FISTA1is an improved ISTA with a better global convergence rate [8]. Here, is the iteration counter. Except for the two strategies, inspired by other lines of work on 1To prevent confusions, we call the strategy using FISTA on the primal problem of GSRsub as Primal FISTA (PFISTA) hereafter.

.

where is the dual variables. Note that both and are strictly convex, so their conjugate functions and are well defined. Thus the dual objective function, denoted by , can be written as follows

This leads to a dual problem of (4):

. We refer to

this problem as D-GSRsub and recast it as (6) and

where . of

Detailed

derivation

and is given in Appendix A. Before proceeding further, we recall the following properties of strongly convex functions, the proof of which can be found in [12], [13]. Proposition 3.1: is strongly convex with parameter if and only if there is a such that is convex. Proposition 3.2: Suppose is closed and strongly convex with parameter . Then its conjugate is differentiable, and is Lipschitz contin, i.e., is smooth. uous with Thanks to the graph regularization term, we have . By this fact and the convexity of , it follows from Proposition 3.1 that is strongly convex. This, together with Proposition 3.2, implies that is differentiable and is Lipschitz contin, i.e., is smooth. Note that uous with

JIANG et al.: SPEEDING UP GRAPH REGULARIZED SPARSE CODING BY DUAL GRADIENT ASCENT

, where and is the soft-thresholding function [14]. The derivation of is given in Appendix A. We conclude this subsection by summarizing the better properties of D-GSRsub: (i) The number of variables in D-GSRsub is which is less than , the number of the primal variables and the variables in D- -QP in the overcomplete dictionary setting; is smooth with and (ii) ; Moreover, by Proposition 3.1, is strongly convex with . B. Dual Gradient Ascent (DGA) Strategies for GSRsub lead to efficient gradient ascent The better properties of strategies for D-GSRsub, i.e., gradient descent strategies for problem (6). In this letter, we only consider four such strategies. 1) DGA with Fixed Step Size:Due to the strongly convexity and smoothness of , a fixed step size can achieve a linear convergence rate [15]. 2) DGA with Variable Step Size: Inspired by a kind of accelerated ISTA called SpaRSA [16], we employ a progressive quadratic approximation

and in each iteration use Barzilai-Borwein (BB) method [17] shown at the bottom of the page, with a subsequent update procedure to choose . Hence, we have . As shown in [18], for strongly convex objectives the convergence rate of SpaRSA is R-linear. 3) Dual FISTA (DFISTA): Recall the assumptions for ISTA and FISTA mentioned in Section II-B. Let and . Problem (6) is also a special case of problem (5) with and . We use a quadratic at , i.e., upper bound approximation of

at th iteration. Following the rationale of FISTA [8], we can obtain an accelerated dual ascent strategy referred to as Dual FISTA (DFISTA). Let and . Then the updates are as follows:

315

where . The convergence rate [8]. of DFISTA is 4) Nesterov’s Accelerated DGA (NADGA): For problem (6), we can employ Nesterov’s Accelerated Gradient Descent [15] for strongly convex and smooth unconstrained cases. This strategy starts at and iterates as follows:

where . The convergence rate of NADGA is [15]. In summary, the discussed optimization strategies for GSRsub, except for Feature-sign with unknown convergence rate, arranged in the increasing order of convergence rate are CCD, PFISTA, DFISTA, , and NADGA. Let be the complexity to compute and let be the iteration counter. Then the complexity of CCD and PFISTA is respectively and , the complexity of DFISTA is , and the complexity of , and NADGA is . Since , the complexity of DFISTA, DGA , and NADGA is lower than that of CCD and PFISTA. This indicates that solving D-GSRsub by DFISTA, , and NADGA is faster than solving its primal problem by CCD and PFISTA. IV. EXPERIMENTS We present two experiments on real-world datasets to compare the performance of seven optimization strategies for GSRsub in GSC based methods: Feature-sign [2]–[4], CCD 2[16], [10], PFISTA [8], DFISTA [8], [15] and NADGA [15]. For fair comparison, we only employ the Lagrange dual method (LagDual) [6] for solving the DU problem (3). All experiments are performed in Matlab R2009a on a Lenovo Windows 7 PC with Intel Core i3-2120 CPU (3.30 GHz) and 3.5 GB RAM. We set the maximum number of iterations as 5000 for Feature-sign and 10000 for both CCD and PFISTA. We use the relative change of the primal objective, i.e., , as stopping criteria for both CCD and PFISTA. For the other four, we set the maximum number of iterations as 100000, and choose the step size of the dual variables, i.e., , as the stopping criterion and the dual gap, i.e., , as for DGA stopping criteria for the other three. The first experiment is learning representations for image clustering using GraphSC [4]. We use CMU PIE database with 68 objects and 42 images of size for each object and set for the heat kernel weighted -nearest-neighbor graph construction. PCA is performed on the images first to reduce the 2The key parameters are set as , experiments. See [16] for more details.

and

in both two

316

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 3, MARCH 2015

TABLE I CLUSTERING RESULTS ON CMU PIE DATASET (THE CLUSTERS NUMBER IS 60 AND THE NUMBER OF ITERATIONS IS 20. THE BEST RESULTS ARE MARKED IN BOLDFACE.)

dimensionality to 100. The dictionary size is set as 200. By conducting the cross validation described in [4], and are adopted. The threshold values for stopping criteria of CCD, are , and , respecPFISTA and tively and the threshold value for dual gap is . Clustering is carried out with the clusters number 60, and 10 tests are conducted. The average performances after 20 iterations, evaluated by the accuracy, normalized mutual information (NMI) [4] and CPU time, are reported in Table I. It is observed that the results support our analysis in Section III-B. Compared with solving the primal problem by Feature-sign, CCD and PFISTA, solving D-GSRsub by the given four strategies dramatically reduces the execution time of GSC. Precisely, the CPU time used by , and is almost the same. takes a little more CPU time than the former three. The accuracy shows that the given four dual gradient ascent strategies even improve the clustering performance. The objective values in the third column suggest that solving D-GSRsub in our parameters setting may lead to a different local minimum of GSC. The second experiment is learning representations for image classification by using LScSPM [2], [3]. We select the first to the 20th classes with total 2548 images from Caltech 256 dataset and follow the parameter setting in [2] to extract SIFT features. It should be noted that, for the feasibility of comparison, we randomly sample 10000 SIFT features for training a dictionary by GSC, and employ histogram intersection [2] weighted -nearest-neighbor graph to model the similarity between SIFT features. is adopted for graph construction. By following the setting in [2], and are used. The threshold values for stopping criteria of CCD, PFISTA and are , and , respectively. The threshold value for dual gap is . In inference, we use the learned dictionary to encode each SIFT feature by solving a GSRsub (4). Finally, we generate a representation for each image by spatial-pyramid max pooling the corresponding SIFT sparse codes. In classification, we randomly select 60 images from each class and use their representations to train a linear SVM classifier. We then carry out classification on the others. 10 tests are conducted, and the average performances after 50 iterations are reported in Table II. The performances of the seven strategies are evaluated by the accuracy, the CPU time for training the dictionary by GSC and the average CPU time for inferring a representation for one image. The objective values of GSC model are also reported. Table II shows that the accuracy of the seven strategies is almost the same, but , ,

TABLE II CLASSIFICATION RESULTS ON 20 CLASSES OF CALTECH 256 (THE NUMBER OF ITERATIONS IS 50. THE LAST COLUMN SHOWS AVERAGE CPU TIME FOR INFERRING A REPRESENTATION FOR ONE IMAGE. THE BEST RESULTS ARE MARKED IN BOLDFACE.)

and which solve D-GSRsub consume less time in training and inference. Further, is the fastest one in training, and NADGA is the fastest one in inference. V. CONCLUSION In this letter, we proposed to speed up GSC by solving a novel dual problem of GSRsub (D-GSRsub) which is smooth and strongly convex with less variables, compared with its primal problem and the existing dual form (D- -QP). Based on this dual form, four efficient gradient ascent strategies for D-GSRsub have been introduced to speed up GSC. Image classification and clustering experiments on two real-world datasets are conducted to illustrate the fast and stable performance of the four strategies. APPENDIX A DERIVATION OF , It is easy to verify that the conjugate of

AND

is

By Proposition 3.2 and the strong convexity of , it follows that

The necessary and sufficient condition for to be the optimal point of the above optimization problem is that the subgradient of its objective function equals to 0, that is, (7) where and we set

. Thus, for all , either or . If then , so . If then . Conversely, if when and when , then (7) holds. This

implies that Substituting gives the desired expression of ACKNOWLEDGMENT The authors thank the referees for their comments.

into .)

JIANG et al.: SPEEDING UP GRAPH REGULARIZED SPARSE CODING BY DUAL GRADIENT ASCENT

REFERENCES [1] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in IEEE Conf. Computer Vision and Pattern Recognition, 2009, pp. 1794–1801. [2] S. Gao, I. W. Tsang, L.-T. Chia, and P. Zhao, “Local features are not lonely–laplacian sparse coding for image classification,” in IEEE Conf. Computer Vision and Pattern Recognition, 2010, pp. 3555–3561. [3] S. Gao, I. W. Tsang, and L.-T. Chia, “Laplacian sparse coding, hypergraph laplacian sparse coding, and applications,” IEEE Trans. Patt. Anal. Mach. Intell, vol. 35, no. 1, pp. 92–104, 2013. [4] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, and D. Cai, “Graph regularized sparse coding for image representation,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1327–1336, 2011. [5] A. Shaban, H. Rabiee, M. Farajtabar, and M. Ghazvininejad, “From local similarity to global coding: An application to image classification,” in IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 2794–2801. [6] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds. Cambridge, MA, USA: MIT Press, 2007, pp. 801–808. [7] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, 2004.

317

[8] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, pp. 183–202, 2009. [9] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008. [10] R. Mazumder and T. Hastie, “The graphical lasso: New insights and alternatives,” Electron. J. Statist., vol. 6, pp. 2125–2149, 2012. [11] A. Saha and A. Tewari, “On the non-asymptotic convergence of cyclic coordinate descent methods,” SIAM J. Optim., vol. 23, no. 1, pp. 576–601, 2013. [12] Y. Nesterov, “Smooth minimization of non-smooth functions,” Math. Program., vol. 103, no. 1, pp. 127–152, 2005. [13] S. Shalev-Shwartz, “Online learning: Theory, algorithms, and applications,” Ph. D. dissertation, The Hebrew University of Jerusalem, 2007 [Online]. Available: http://www.cs.huji.ac.il/~shais [14] D. Donoho, “De-noising by soft-thresholding,” IEEE Trans. Inf. Theory, vol. 41, no. 3, pp. 613–627, 1995. [15] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course. Norwell, MA, USA: Kluwer, 2004. [16] S. Wright, R. Nowak, and M. Figueiredo, “Sparse reconstruction by separable approximation,” IEEE Trans. Signal Process., vol. 57, no. 7, pp. 2479–2493, 2009. [17] J. Barzilai and J. Borwein, “Two point step size gradient methods,” IMA J. Numer. Anal., vol. 8, no. 1, pp. 141–148, 1988. [18] W. Hager, D. Phan, and H. Zhang, “Gradient-based methods for sparse recovery,” SIAM J. Imag. Sci., vol. 4, no. 1, pp. 146–165, 2011.