Support-Driven Sparse Coding for Face Hallucination - Multimedia

4 downloads 0 Views 2MB Size Report
with local detail adjustment. Following [1, 2], progresses have been made in estimating a HR face image from a ... The main idea is to estimate the HR face image by hallucinating ... has attracted more attention since it gives better hallucinated faces. In ... [5] employed a sparse coding method to adaptively choose the most.
Support-Driven Sparse Coding for Face Hallucination Junjun Jiang1, Ruimin Hu1, Zhongyuan Wang1, Zixiang Xiong2 and Zhen Han1 1

National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China 2

Dept of Electrical and Computer Engineering, Texas A&M University, College Station, USA

Abstract—By incorporating the prior of positions, position patch based face hallucination methods can produce high-quality results and save computation time. Given a low-resolution face image, the key issue of these methods is how to encode the input lowresolution patch. However, due to stability and accuracy issues, the coding approaches proposed so far are not satisfactory. In this paper, we present a novel sparse coding method via exploiting the support information on the coding coefficients. In particular, the support information is characterized by the locality of the image patch manifold, which has been shown to be critical in data representation and analysis. According to the distances between the input patch and bases in the dictionary, we first assign different weights to the coding coefficients and then obtain the coding coefficients by solving a weighted sparse problem. Our proposed method exploits the non-linear manifold structure of patch samples and the sparse property of the redundant data, leading to stable and accurate representation. Experiments on commonly used databases demonstrate that our method outperforms state of the art.

I.

INTRODUCTION

Face super-resolution (SR), or face hallucination, refers to the technique of generating a high-resolution (HR) face image from a lowresolution (LR) one with the help of a set of training examples. It aims at transcending the limitations of electronic imaging systems. Applications of face hallucination include video surveillance, in which the individual of interest is often times far away from the cameras. Therefore, a captured face image is usually LR and it lacks detailed facial features, which are of vital importance to face image analysis and recognition. In their pioneering work on face hallucination [1], Baker and Kanade employed a Bayesian approach to infer the missing highfrequency components of an input LR image from a parent structure with HR/LR training samples, leading to a large magnification factor with relatively good results. Liu et al. [2] proposed a two-step statistical modeling approach that integrates global structure reconstruction with local detail adjustment. Following [1, 2], progresses have been made in estimating a HR face image from a single LR face image with a training set of HR and LR image pairs [3-11]. The main idea is to estimate the HR face image by hallucinating the input LR face image globally [3] via face space parameter estimation, locally [4-9] via patch image restoration, or both [10, 11]. Among these three approaches, local patch image restoration has attracted more attention since it gives better hallucinated faces. In The research in this paper uses the CAS-PEAL-R1 face database collected under the sponsor of the Chinese National Hi-ech Program and ISVISION Tech. Co. Ltd. The research was supported by the major national science and technology special projects (2010ZX03004-003-03, 2010ZX03004-001-03), the National Basic Research Program of China (973 Program) (2009CB320906), the National Natural Science Foundation of China (61231015, 61172173, 60970160, 61070080, 61003184, 61170023), the Fundamental Research Funds for the Central Universities (201121102020009), the science and technology Program of Wuhan (201271031366).

978-1-4673-5762-3/13/$31.00 ©2013 IEEE

Fig. 1. Block diagram of position-patch based face hallucination. particular, Chang et al. [4] pointed out that the HR and LR patch manifolds share similar local geometric structure. Following this assumption, they proposed a neighbor embedding (NE) based SR method to estimate a HR patch through a K nearest neighbor strategy with a fixing the number of neighbors for reconstruction, which may lead to blurred edges due to over- or under-fitting. To alleviate this problem, Yang et al. [5] employed a sparse coding method to adaptively choose the most relevant neighbors for image SR. For a class of highly structured objects, such as human faces, the prior of face positions can be incorporated into face hallucination. Recently, Ma et al. [6, 7] took advantage of this prior and introduced a position patch based method (with block diagram shown in Fig. 1) to estimate (or hallucinate) a HR image patch using the same position patches of all training face images. Specifically, the coding coefficients estimated via constrained least square (CLS) in each face region are used to generate the HR patch of the corresponding position. However, when the number of the training patches is much larger than the dimension of the patch, the CLS problem is under-determined and the solution is not unique. To overcome this problem, Jung et al. [8] imposed a sparsity constraint on the coding coefficients and obtained a more stable solution and better hallucinated results. However, this sparse coding (SC) based method [8] may select very distinct patches to favor sparsity and thus cannot reveal the manifold geometric structure that is important for image representation and analysis. Most recently, we propose a locality constraint representation method for patch representation in [9], but it can’t really achieve sparsity. In this paper we introduce a new class of norms and a novel coding method called support-driven sparse coding (SDSC) that results in stable and accurate representation due to spatial locality of manifold and sparsity of natural image patches. Our work is inspired by recent developments on reweighted  1 minimization-inducing norms that are capable of exploiting the support information on coding coefficients. Although traditional  1 norm can only promote sparsity and does not incorporate any prior information about the support of coding coefficients, we can assign different weights to the bases in the patch manifold and thus have weighted sparsity using the so-called weighted  1 minimization. To enable more

2980

stable and accurate coding, it is important to use weighted sparsityinducing norms. In this paper we enforce a manifold locality constraint to induce a weighted sparsity norm on the coding coefficients and make the coding much more stable and accurate. As shown in Fig. 2, SC [8] selects very distinct bases for representation; SDSC uses the support information to select the neighbor bases. Although the recovery of compressively sampled signals using prior support information has been previously studied in the literature [13, 14], we are the first to introduce it to the SR problem. Compared with existing sparse coding approaches, our method combines the merits of manifold learning and sparse coding. Experiments on commonly used databases confirm superiority of our proposed SDSC method.

Lagrange multipliers offer an equivalent formulation wˆ  arg min Dl w  x w

2 2

 w

(2)

1

where   0 is a scalar regularization parameter for the  1 norm penalty, balancing the tradeoff between reconstruction error and sparsity. It is worth noting that, the role of the  1 regularization is twofold. First, the dictionary is usually over-complete, i.e., m  N , and hence the sparsity regularization makes the least square solution stable; Second, the sparsity prior allows the learned representation to capture salient patterns and thus achieve much less reconstruction error. Although sparse regularization has many advantages, the standard sparse coding problem (2) does not incorporate any extra prior knowledge about the support of the coding coefficients w . However, in many applications such as image representation, it is possible to draw an estimate of the support of the signal. This leads to our proposed SDSC method. B.

Fig. 2. Comparison between SC (left) and SDSC (right). Note that the selected bases are highlighted in blue.

II.

FORMULATIONS

Given a LR face image X , our goal is to obtain its HR version Y . According to the patch based method, we assume that the LR and HR patches share the same underlying representation. As shown in Fig. 1, a LR face image X is given as input, for each patch x of X , it is approximated by a linear combination of the LR patches at the same position using the coding method, and we get a set of coding coefficients [w1, w2, , wN ]T on the LR patch dictionary Dl (Following [6-8], in this paper, we simply regard the training position patch set as an overcomplete dictionary). Keeping the coefficients and replacing the LR dictionary Dl with the corresponding HR dictionary Dh , a hallucinated HR patch y can be synthesized. Collecting all the HR patches to their corresponding positions, we can get an estimation of the HR face Y . Therefore, for the patch based methods, how to obtain the optimal coding coefficients w  [w1, w 2, , wN ]T of an input face patch is the core issue. In the following, we review the existing spare coding scheme and introduce our proposed SDSC strategy and the face hallucination method based SDSC.

III.

FACE IMAGE S UPER-RESOLUTION WITH S UPPORTDRIVEN S PARSE CODING

A.

Sparse Coding Motivated by recent results in sparse signal representation, Yang et al. [5, 8] propose a SC based image SR method. Let Dl  mN be an over-complete LR dictionary of N prototype signal-atoms, Dh  M N be the corresponding over-complete HR dictionary of N prototype signal-atoms, where m and M are the dimensions of a LR image patch and HR image patch ,respectively. In the following, we will show how to code a path of the traditional spare coding methods [5, 8]. For a LR patch x  d from the LR image X , the standard sparse coding problem is as follows: wˆ  arg min w 1 , s.t . Dl w  x w

2 2



Support-Driven Sparse Coding (SDSC) In many signal processing applications, most data such as images and videos lying in a high-dimensional space is likely to concentrate in the vicinity of a non-linear manifold with much lower dimensionality, which is embedded in high dimensional input space [15]. However, sparse coding tries to favor the sparsity of the solution without any consideration of the locality, which is very important for successfully exploring the underlying structures of manifolds. In Local Coordinate Coding (LCC) [12], it states that sparse coding does not ensure a good locality and thus fails to facilitate the nonlinear function learning. Inspired by this point, we propose the SDSC method in this paper. It incorporates prior support information induced by locality to the minimization in (2) with weighted minimization wˆ  arg min w w

where w

1,a

1,a

, s.t . Dl w  x

2 2



(3)

is the weighted  1 norm (see [13, 14]), which is defined

as follow: N

w

1,a



 ai i 1

1 i  T wi , with ai    otherwise 

(4)

Here, a is an indication of the N  1 weight vector, and T is the support of the coding w . Thus, optimization of Eq. (3) can be seen as the standard sparse coding on the sub-dictionary consists of the selected bases. Now the question becomes: how to estimate the support T for the coding coefficients w . In [12], it demonstrates that sparse coding is helpful for learning only when the codes are local. By enforcing a locality constraint in the data representation, we can take advantage of the manifold geometric structure to learn a nonlinear function in high dimension. Therefore, in order to achieve the optimal coding coefficients as well as reveal the nonlinear manifold of the image patch space, we introduce the locality constraint to generate the support. In particular, the neighbor information is applied to guide the estimation of support. Thus, we define the support as follows: T  supp(dist |k )

(5)

where dist   N is the measurement of distances between x and bases in the dictionary Dl . Specifically,

(1)

2981

disti || x  dli ||2, 1  i  N

(6)

where dli is the i -th base in the LR dictionary Dl , dist |k refers to the

30

smallest k entries of the dist , and thus supp(dist |k ) is a set composing the indexes of these smallest k entries in dist . As shown in Fig. 2, minimizing (3) tends to encode the input using its nearby bases, while the resulting w will still satisfy the sparsity and locality constraint simultaneously, thus achieving stability and accuracy. There are, of course, other support estimation strategies, and we will discuss the effect of different strategies in Section 4.2. From the objective function (3) of SDSC, we can see that standard SC is a special case of our method. When no prior information on coding coefficients is available,

0.9

1,a





wi , the

i 1

proposed SDSC is reduced to the standard SC approach [8]. Face SR based on SDSC. Acquiring the coding coefficients wˆ , we can reconstruct the HR patch y using the same coding coefficients: y  Dh wˆ

(7)

Concatenating and integrating all the hallucinated HR patches according to their corresponding positions, we can form a face image, which is the target HR face image Y .

IV.

RESULTS AND ANALYSIS

In our experiments, we first test the effectiveness of the support information induced by the locality of the manifold structure, and then compare the subjective and objective quality results with the several state-of-art baselines, such as Bicubic interpolation, Wang’s method [3], neighbor embedding (NE) [4], constrained least square (CLS) [7], and sparse coding (SC) [8]. A.

Experimental configurations Databases: In this paper, all the experiments were performed on CAS-PEAL-R1 face database [16]. All the 1040 frontal face images with neutral expression are without selection for other considerations. The images are aligned by five manually selected feature points and cropped to 112×100 pixels. The LR face images (with 28×25pixels) are directly smoothed (by an averaging filter H of size 4×4) and downsampled (by a factor of 4) from the original HR face images. We randomly select 1000 images for training and leave the other 40 for testing. Therefore, all the test images are absent completely in the training set. Parameter Setting: For SDSC, there are four parameters to be determined: the patch size, overlap pixels between patches, the value of neighbors k , and error tolerance  . As in [9], we recommend that the patch size is 12 × 12 and the overlap pixels between patches is 4 for the HR patch while corresponding LR patch size is set to 3 × 3 with overlap of 1 pixel for the rest experiments. The value of neighbors k is set to 100. The performance with various k will be examined in Section 4.2. The error tolerance  is set to 8, and we use a primal-dual algorithm [18] for sparse coding problem as in [8]. For Wang’s method [3], 350 bases are chosen. The number of neighbors in NE [4] is set to 50. We set error tolerance in SC [8] to 1. B.

The effect of support

In this section, we test that how the support impacts the final face hallucination performance. In particular, we generate the support by three strategies: selecting k nearest bases (proposed SDSC method), k random bases, or k farthest bases, respectively. We list the PSNR (dB) and SSIM [17] performance of these three different support estimation strategies according to different values of k in Fig. 3 and at least draw two conclusions: i) The performance of the SDSC is always better than the comparison methods no matter what the values of k is. When k  1000 , they are reduced to the SC method [8] and receive the same performance. That’s

27

0.88

SSIM

PSNR(dB)

28

26 25 24

0.86

0.84

23 nearest bases random bases farthest bases

22 21

N

T  {1,2, , N } , i.e., ai  1 for i  1, , N , and w

0.92

29

0

200

400

600

The value of k

800

1000

nearest bases random bases farthest bases

0.82

0.8

0

200

400

600

800

1000

The value of k

Fig. 3. The performance of three different support estimation strategies. to say, by choosing suitable nearest bases as the support, the coding coefficients of SDSC will be more accurate. On the other hand, because SDSC enforces the coding on k nearest bases and hence achieves stability. And we attribute all these properties to the locality induced sparse coding of SDSC; ii) As can be seen, SDSC achieves the best performance when k is around 100 and starts to decline when k gets larger, while the performances of other two strategies go up as the value of k increased. That would mean, first, locality is significantly accurate prior information about the support of coding coefficients and incorporating this prior can lead to a stable and accurate recovery; second, incorporating other priors which are possibly inaccurate will decrease the performance. Obviously, this negative impact on the performance decreases while increasing the value of k . This is primarily because a large k (select almost all the bases as the support) implies that the support has a less influence on the recovery. C.

Comparison of Subjective and Objective Quality Results The comparison hallucinated results are presented in Fig. 4. Due to the limited space, only five test face images are shown here. Column (a) are the input LR face images and column (h) are the original HR face images. Column (b)-(g) are the hallucinated results of different methods. We can see that: i) Bicubic interpolation method (column (b)), which is based generic smoothness priors, blurs the edge and the hallucinated faces are quite smooth; ii) The results of Wang’s eigentransformation based global face SR method [3] (column 3(c)) have clear edges compared with Bicubic interpolation method, however, there are obvious ghosting effect on face contours. This is a common problem of such global based algorithms especially when the training faces are misalignment or the training samples size is small; iii) The hallucinated results using local patch based methods (NE [4], CSL [7], SC [8] and SDSC) are closer to the ground truth images in column (h). This suggests that hallucination through local patch modeling has the advantage of recovering HR facial details. The success of local patch based methods arises in part from the employment of patch representation, which can greatly enhance the expression ability of the training set. NE [4] based method doesn’t incorporate position patch constraint, consequently their results in column (d) expose some unsmoothed noisy artifacts, compared poorly against the results in column (d)-(g) using the position patch prior; iv) SDSC (column (g)) leads to the best visual quality. It can not only remove the blurring effects and un-smoothed noisy artifacts, but also reconstruct more facial details and sharper image edges than other methods (see the cheeks in row 1, row 2 and row 4, chin in row 5, and eyes in row 3). We attribute the excellent facial details improvement and edge preservation to the incorporating of support information, which leads to the stable and accurate coding. In summary, the hallucinated faces of our proposed SDSC approach are superior to other state-of-art baselines.

2982

V.

CONCLUSION AND FUTURE WORK

In this paper, we have proposed a novel position patch based face hallucination method based on SDSC. In SDSC, we used the nearest bases to the input LR patch to estimate the support of coding coefficients and utilized sparse coding on the support to obtain accurate and stable coding coefficients. Keeping the coefficients and replacing the LR bases with the corresponding HR bases, a new HR patch of the same position is hallucinated. Concatenating and integrating all the hallucinated HR patches, we formed the target HR facial image. Experimental results on CAS-PEAL-R1 have demonstrated the effectiveness of SDSC. In this work, we directly used the whole training image patches as the dictionary, learning a compact and representative dictionary will be our further work.

REFERENCES Fig. 4. Simulation results of different methods: (a) Input LR faces. (b) Bicubic interpolation; (c) Wang’s Results [3]; (d) NE [4]; (e) CLS [7]; (d) SC [8]; (g) Our proposed SDSC method; (h) Original HR faces. (Note that the effect is more pronounced if you resize the figure yourself on the electronic version) 32

28

SSIM

PSNR(dB)

30

26

[1] [2] [3]

[4]

0.95

[5]

0.9

[6]

0.85

[7]

0.8

[8]

24 0.75

22

[9] 20

] ie g [3] [4] 7] C ub S [ SC [8 SDS n NE Bic Wa CL

0.7

ie [3 ] 4] S [7] [8] SC ub ng NE [ SC SD Bic Wa CL

Fig. 5. The PSNR (dB) and SSIM comparison of different methods: Bicubic (PSNR = 24.50 dB, SSIM = 0.8163), Wang’s method [3] (PSNR = 26.62 dB, SSIM = 0.8254), NE [4] (PSNR = 27.98 dB, SSIM = 0.8906), CLS [7] (PSNR = 28.16 dB, SSIM = 0.8974), SC [8] (PSNR = 28.25 dB, SSIM = 0.8968), and our method (PSNR = 28.94 dB, SSIM = 0.9093).

[10]

[11]

[12]

In order to quantitatively illustrate the superiority of the proposed SDSC method, we further calculate the PSNR (dB) and SSIM values for the 40 test images of each comparison method and report the results as a box-plot in Fig. 4. The higher the SSIM value, the better is the face hallucination quality. The maximum value of SSIM is 1, which means a perfect reconstruction. We can observe that the Bicubic interpolation always gives the lowest performance; the next smallest is the Wang’s method [3], which is a global method and thus the expressive ability of PCA bases are limited. NE [4], CLS [7], and SC [8] can achieve better results than the Bicubic interpolation. However, the performance is obviously inferior to the proposed SDSC method. The average PSNR and SSIM improvements of SDSC method over the second best method, i.e., SC [8], are 0.69 dB and 0.0125 respectively. This is essentially benefiting from the incorporating of the support information on coding coefficients. In addition, we investigate the time costs for SC and SDSC, which are 85 seconds vs. 68 seconds. Though SDSC must execute k-NN searching for every patch, it reduces the burden on the inefficient and time consuming  1 norm minimization as in SC. If we use some exist acceleration methods such as approximate nearest neighbor (ANN), one may expect that the speed of SDSC will vastly excels that of SC.

[13] [14]

[15]

[16]

[17]

S. Baker and T. Kanad, “Hallucinating faces,” FG, pp. 83–88, 2000. C. Liu, H. Shum, and W. Freeman, “Face hallucination: Theory and practice,” Int. J. Comput. Vis., vol. 7, no. 1, pp.115–134, 2007. X. Wang and X. Tang, “Hallucinating face by eigen-transformation,” IEEE Trans. Systems, Man, and Cybernetics. Part C, vol. 35, no. 3, pp. 425–434, 2005. H. Chang, D. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” CVPR, pp. 275–282, 2004. J. Yang, H. Tang, Y. Ma, and T. Huang, “Face hallucination via sparse coding,” ICIP, pp. 1264–1267, 2008. X. Ma, J. Zhang, and C. Qi, “Position-based face hallucination method,” ICME, pp. 290–293, 2009. X. Ma, J. Zhang, and C. Qi, “Hallucinating face by position-patch,” Pattern Recognition, vol. 43, no. 6, pp. 3178–3194, 2010. C. Jung, L. Jiao, B. Liu, and M. Gong, “Position-Patch Based Face Hallucination Using Convex Optimization,” IEEE Signal Process. Lett., vol. 18, no. 6, pp. 367–370, 2011. J. Jiang, R. Hu, Z. Han, T. Lu, and K. Huang, “Position-Patch Based Face Hallucination via Locality-constrained Representation,” ICME, pp. 212-217, 2012. H. Huang, H. He, X. Fan, and J. Zhang. Super-resolution of human face image using canonical correlation analysis. Pattern Recognition, vol. 43, no. 7, pp. 2532–2543, 2010. J. Jiang, R. Hu, Z. Han, T. Lu, and K. Huang, “Surveillance Face Hallucination via Variable Selection and Manifold Learning,” ISCAS, pp. 2681-2684, 2012. K. Yu, T. Zhang, and Y. Gong, “Nonlinear learning using local coordinate coding,” in Proc. of Advances in Neural Information Processing Systems, pp. 2223-2231, 2009. R. von Borries, C.J. Miosso, and C. Potes, “Compressed sensing using prior information,” CAMPSAP, pp. 121 – 124, 2007. M. P. Friedlander, H. Mansour, R. Saab, and O. Yılmaz, “Recovering compressively sampled signals using partial support information,” IEEE Trans. on Inf. Theory, vol. 58, no. 2, pp. 1122-1134, 2012. S.T. Roweis and L.K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. W. Gao, B. Cao, S.G. Shan, X.L. Chen, D.L. Zhou, X.H. Zhang, and D.B. Zhao. The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations. IEEE TSMC-A, vol. 38, no. 1, pp. 149-161, 2008. Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process. , vol. 13, no. 4, pp. 600–612, 2004.

[18] E. Candes and J. Rombergt, 1 -Magic: Recovery of Sparse Signals

2983

via Convex Programming 2005 http://www.acm.caltech.edu/l1magic/

[Online].

Available: