An Adaptive Learning Method for Face Hallucination Using Locality Preserving Projections Xuesong Zhang and Silong Peng The Institute of Automation, Chinese Academy of Sciences Beijing, 100190, PRC
Jing Jiang North China Institute of Science and Technology Beijing, 101601, PRC
{xuesong.zhang, silong.peng}@ia.ac.cn
[email protected]
Abstract The size of training set as well as the usage thereof is an important issue of learning-based super-resolution. In this paper, we presented an adaptive learning method for face hallucination using Locality Preserving Projections (LPP). By virtue of the ability to reveal the non-linear structure hidden in the high–dimensional image space, LPP is an efficient manifold learning method to analyze the local intrinsic features on the manifold of local facial areas. By searching out patches online in the LPP sub-space, which makes the resultant training set tailored to the testing patch, our algorithm performed the adaptive sample selection and then effectively restored the lost high-frequency components of the low-resolution face image by patch-based eigentransformation using the dynamic training set. Finally, experiments fully demonstrated that the proposed method can achieve good performance of super-resolution reconstruction by utilizing a relative small sample.
1. Introduction Face hallucination is a learning-based super-resolution (SR) method [1], which infers the lost high-frequency components from a properly trained and organized face image database. Recent works have shown the superiority of learning-based methods to reconstruction-constrained methods, especially in the scenario of large magnification factors, such as 4-16 [1-9]. The basic idea of learning-based SR is to construct a cooccurrence model of high-resolution (HR) images and their low-resolution (LR) counterparts by learning from example patterns, while different research differs in which learning method is utilized and how the learnt knowledge is incorporated into the SR process. Baker and Kanade [1, 2] developed a face hallucination method to learn a prior on the spatial distribution of the image gradient for frontal face images and then infer the high-frequency components
978-1-4244-2154-1/08/$25.00 ©2008 IE
under a MAP framework. Liu et al. [4] proposed a two-step approach that integrates a global parametric model in the eigenface sub-space and a local nonparametric model based on Markov network. Wang and Tang [5] developed an efficient face hallucination algorithm using eigentransformation. An important issue of learning-based SR that hasn’t been considered in the previous works is the relationship between the learning efficiency and the size of training samples. One may naturally ask: whether more samples will always lead to better reconstruction results? In other words, is it possible to get a good reconstruction result from a small training set, and how? In this paper, we addressed this problem by proposing an adaptive learning method for face hallucination using a new manifold learning method, Locality Preserving Projections (LPP) [10]. The main contributions of this paper are: (1) we proposed a scenario in which the face hallucination algorithm is endowed with the smart learning ability, i.e., when a low-resolution face image is fed into our system, it can dynamically and efficiently pick out some samples that are most similar to the input image, and then use the tailored training set to perform the SR reconstruction process; and (2) the experimets demonstrated that the learning efficiency of patch-based reconstruction is higher than that of global reconstruction method. The rest of this paper is organized as follows. We described the framework of our adaptive learning method for face hallucination in section 2. In section 3, we first introduced the LPP algorithm briefly, and then give the procedures of adaptive sample selection by LPP, and finally presented a patch-based eigentransformation algorithm for face hallucination based on the tailored training set. Experimental results and discussions are reported in section 4. We finally gave our conclusions in section 5.
2. The framework of adaptive learning method We separate the proposed face SR algorithm into two
...
...
−
+ ...
...
L (↑ M ) g
g
Figure 1: Framework of the proposed approach.
parts: the training part and the testing part, as shown in Fig. 1 by a red dashed line. In the training part, each HR canonical face image is first blurred, downsampled and contaminated by Gaussian noise, and then interpolated and low-pass filtered to retrieve the low-frequency components of the original training set. The corresponding highfrequency residual faces are generated simply by subtracting the low-frequency faces from the HR versions. In addition, the low-frequency faces and residual faces are involved in similar processes in which each image is split into n patches. The differences consist in the subsequent procedures: all patches sharing the same position in the low-frequency faces are analyzed by LPP and the resultant bases and projection coefficients are stored for the purpose of adaptive sample selection in the testing process. In the testing part, the input LR image, g, is first interpolated to the HR image grid by an upsampling factor of M , then filtered by a low-pass filter L to restore the low-frequency part of the super-resolved face image. The magnified LR image is also split into patches and each of them is projected into the corresponding LPP sub-space of the training part. After a classification operation on the LPP coefficients of the input patch and the stored ones, we can use the patch-based eigentransformation method, inspired by Wang and Tang‘s work [5], to construct a bridge between the input low-frequency patch and its high-frequency counterpart, whereas they directly infer the SR image from the LR image using a global linear approximation. The hallucinated high-frequency patches
are further smoothed by an arithmetic average operation on the overlapped regions and rearranged into the residual face image, which is finally added to the magnified low-frequency face to create the output SR face image.
3. Algorithmms and implementation 3.1. The LPP algorithm As a novel manifold learning method, LPP is obtained by finding the optimal discrete linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold [10, 11]. LPP seeks to work out a projective matrix that can preserve the intrinsic geometry of the data and local structures. The algorithm of LPP is described in brief as follows: 1) Suppose M is a manifold embedded in a highdimensional space R d , and the zero mean data set x1 , x2 ," xn ∈ M is organized into the data matrix
2)
X = [ x1 , x2 ,..., xn ] . The similarity matrix W is obtained by calculating the weights between any two neighbors according to: ⎧exp(− || xi − x j ||2 / t ), || xi − x j ||2 < ε Wij = ⎨ (1) otherwise ⎩ 0, where ε defines the radius of neighborhood, and t is a constant.
the
local
3)
Solve the problem:
minimum
generalized
eigenvalue
XLX T a = λ XDX T a
(2) where D is a diagonal matrix with diagonal entries Dii = ∑ j Wij , and L = D − W is the 4)
Laplacian matrix. Let the first l eigenvectors with the minimum eigenvalues be a1 , a2 ,..., al . The projective matrix A = [a1 , a2 ,..., al ] is of size d × l , and the projected data are: yi = AT xi , yi ∈ Rl , l d , i = 1,...n (3)
3.2. Adaptive sample selection by LPP Sample selection is in essence a problem of classification. A number of recent efforts have shown that not only the face images reside on a low-dimensional sub-manifold embedded in a high ambient space [11-14], but also the local areas in face images can be analyzed by manifold learning methods [15]. In order to perform the classification in a low dimensional space to allow robust and fast computation, we need to apply some dimensionality reduction techniques. Traditional linear dimensionality reduction methods, such as PCA and LDA, fail to discover the underlying non-Euclidean structures, since the face images lie on a nonlinear manifold hidden in the image space. However, LPP is found to have more discriminating power than PCA for classification purpose, especially when nearest-neighbor like classifiers are used. The steps of the Adaptive Sample Selection algorithm using LPP are as follows: 1) Given the n pairs of HR and LR images {h1 : p1 , h2 : p2 ," , hn : pn } , each patch pki sharing the common position in the LR training images is extracted for k = 1," , n , and the patches are reshaped in lexicological order to form Pi = [ p1i , p2i ," , pni ] , where N is the number of patches in a image and 1 ≤ i ≤ N . 2) Compute the LPP projective matrix Ai from Pi
training set to restore the lost high-frequency components of qi (by means of eigentransformation, see Section 3.3 for more details). The number of selected samples s is determined dynamically by the classification algorithm. There are two questions about the proposed Adaptive Sample Selection method: (1) Are the selected samples in the low dimensional feature space consistent with those selected in the original high dimensional image space? and (2) Does the distribution of samples on the high-resolution manifold conform to that on the low-resolution manifold? The affirmative answers to these two questions lay the foundation stone of the proposed face hallucination approach. 100
50
0
-50 -300
-200
-100
0
100
200
300
400
(b) LPP of the marked LR patches
(a) LR mean face 300 200 100 0 -100
-200 -1000
(c) HR mean face
-500
0
500
1000
1500
(d) LPP of the marked HR patches
Figure 2: LPP projections of the marked low- and high-resolution patches into a two dimensional subspace.
T
and the mapped data matrix Yi = Ai Pi . 3)
For the input patch qi , calculate its low T
dimensional feature yini = Ai qi . 4)
Select from Yi a low dimensional feature set { y ij1 , y ij2 ," y ijs } among which each selected feature is similar to yini under the metric of Euclidean distance, where jk ∈ 1,..., n and 1 ≤ k ≤ s . The corresponding patch pairs of { yij , yij ," yij } , denoted 1
2
s
as {hij : p ij , hij : p ij ,", hij : p ij } , will be used as a 1
1
2
2
s
s
On one hand, the locality preserving ability of LPP guarantees the consistency of the selected samples in lowand high-dimensional spaces, while the computational consumption in terms of comparison is reduced significantly; on the other hand, we can demonstrate through experiments that both the global geometry and the local topology of the LR and HR training pairs are very similar to each other, even though a theoretic proof has been unavailable until now. Such an illustration is presented in Fig. 2 in which a HR patch set and its LR counterpart are projected into a 2D LPP sub-space, respectively.
The positions of the extracted 12 × 12 patches are marked by a white rectangle in the high-resolution mean face as well as the degraded one. Although we fixed the dimension to 2 after LPP for the convenience of displaying, the actual LPP sub-space is of dimension 6, and one can still find that the distributions of the projected training pairs are very similar to each other.
and additional 166 face images were used to construct the training ensemble. The average matching ratio is 86%, while the probability to get this matching ratio is nearly zero if samples are randomly selected ( 94 × 86% ≈ 81 , 81 96 −81 94 P = C96 C166− 96 C166 ≈ 10−16 ). We presented such an example in Fig. 3.
3.3. Patch-based eigentransformation The rest task of super-resolution is to restore the high-frequency components from the input degraded patches and the selected training sets. Unfortunately, different from PCA, LPP cannot generate the inverse mapping from the low-dimensional projection space to the original high-dimensional space. To tackle this problem in our adaptive sample selection scenario, we proposed a patch-based eigentransformation method. The main improvement on Wang and Tang’s work [5] is that the interference from the high-frequency components is shielded from the learning process of the low-frequency components by decomposing the training images. Now we prove that both the low-frequency input patch and its high-frequency counterpart can be optimally approximated by a linear combination of the corresponding training patches using the same set of coefficients. The image degradation model can be described as: g = (↓ M ) Bf + n (4) where the vectors g , f and n represent, respectively, the LR observation, the original image and the noise. B is the point spread function, and (↓ M ) represents the decimation by a factor of M . By first interpolating g by a factor of M , and then filtering the data using a low-pass filter L to reduce the frequency aliasing, we get from (4): L(↑ M )g = L(↑ M )(↓ M ) Bf + (↑ M )( Ln) (5) According to the multirate digital signal processing theory [16], L(↑ M )(↓ M ) forms a multirate system, and
Figure 3: An example of adaptive sample selection. (a) and (e) are HR and LR test image pair. (b) and (f) are extracted HR and LR patch pair. (c) and (g) are the first five selected HR and LR patches. (d) and (h) are the last five selected HR and LR patches.
In order to check the ability of LPP’s picking out the consistent sample sets for the HR and LR testing patches, all selected training sets are recorded using the aforementioned Adaptive Sample Selection algorithm. The matching ratio is defined as the number of sample indices belonging to the intersection of the selected HR sample indices and LR ones divided by the total selected sample number. In this experiment, 94 samples (see Table 1) were selected for each testing patch extracted from 46 images,
L(↑ M )(↓ M ) Bf = Bf , if: (1) the frequency spectrum ( −π / M , π / M ) , and
of
Bf
falls
into
(2) L can pick out one spectral period of (↓ M ) Bf , i.e., the support of L in frequency domain is ( −π / M , π / M ) . Of course the previous two assumptions hold approximately. However, the errors can be compensated to some extend through the following learning process of lowand high-frequency training pairs. After further assuming Ln = 0 , we simplify (5) to: L(↑ M )g ≈ Bf (6) where Bf is the low-frequency component of f after blurred by B . Thus f can be further decomposed into the sum of the low- and high-frequency components:
f = Bf + ( I − B ) f = f L + f H
(7)
≈ L ( ↑ M )g + ( I − B )f where I is the identity matrix. According to (7), the HR training images t1 , t 2 ,..., t n can be decomposed into the low-frequency images l1 , l 2 ,..., l n and
the
high-frequency ' 1
' 2
h1 , h 2 ,..., h n .
images
We
' n
define T = [t , t ,..., t ] = [t1 − m, t 2 − m,..., t n − m] , where 1 n ∑ ti is the HR mean face. Applying PCA to T , we n i =1 can get: (8) TT T V = V Λ where V is the eigenvector matrix and Λ is the eigenvalue matrix. The projection of f onto V is: ω = V T (f − m) (9) f can be reconstructed from V as follows: fˆ = Vω + m = TT TV Λ −1ω + m m=
where n is an even number, and overlaps with its adjacent eight neighbors. In order to smooth the hallucinated highfrequency patches, an arithmetic averaging operation is applied to every quarter region of a patch, which is covered by 4 neighbouring patches. This operation plays a similar role as the compatibility matrix of Markov network in [3]. Fig. 5 shows the enhancing effect( n = 12 , the rest pixels near the image margin are organized into one patch).
p(i, j+1)
p(i, j)
Residual face
overlapped
p(i+1, j)
p(i+1, j+1)
fˆH
+
LR face
fˆL
=
HR face
fˆ
Figure 4: The merging scheme of overlapped patches.
(10)
n
=T c + m = ∑ ci t i' + m i =1
where c = T T V Λ −1ω = [c1 , c2 ,..., cn ]T . Note that Bm = mL and Bt i' = li' since Bt i = li , where 1 n ∑ li , and li' = li − mL . From (7) and (10), the n i =1 reconstructed low- and high-frequency image can be expressed by: mL =
n
n
fˆL = Bfˆ = ∑ ci Bt i' + Bm = ∑ ci l i' + m L i =1
i =1
n
fˆH = ( I − B )fˆ = ∑ ci ( I − B )t i' + ( I − B )m
(11)
i =1
n
= ∑ ci hi' + m H
(a)
i =1
n
1 ∑ hi . (11) means that n i =1 we can reconstruct the high-frequency components fˆH using the same set of reconstruction coefficients c as that of the low-frequency components fˆL . In practice, we can
where h'i = hi − m H and m H =
compute c by applying PCA to L(↑ M )g and l1 , l 2 ,..., l n since f is unknown.
3.4. Merging patches Applying the previous approach to each patch set, we can construct the high-frequency components for each lowfrequency input patch and merge them into a whole HR face image easily, as shown in Fig. 4. The high-frequency patches in the residual face are arrangend in such a scheme: each patch is of size n × n ,
(b)
(c)
(d)
Figure 5: Enhancement resulting from averaging operation. (a) LR image. (b) Reconstruction from non-overlapping patches. (c) The proposed method. (d) Original HR image
4. Experiments and discussions Our experiments were conducted on a data set containing 212 frontal face images selected from the databases of AR [17], IMM [18] and GTAV [19]. Among these images, 46 were used for testing, and the rest for training. All of these face images were with neutral expression and aligned by 21 manually selected feature points. The images further cropped into 96 × 128 canonical images. Finally, the images were blurred by a Gaussian low-pass filter and downsampled to 24 × 32 low-resolution images with Gaussian noise added.
(a)
(b)
(c)
(d)
(e)
Figure 6: Global vs. local reconstruction. (a) Input low-resolution images. (b) Bicubic interpolation of (a). (c) Global reconstruction in [5]. (d) The proposed method. (e) Original high-resolution face image.
4.1. Local reconstruction vs. global reconstruction In order to verify that the learning efficiency of patch-based reconstruction is higher than that of global reconstruction, we compared the straightforward global reconstruction method in [5] with our patch-based approach. As shown in Fig. 6, we can observe that dirty disturbance and ghost effect exist in the global reconstructed images when the training set is not big enough relative to the hallucinated image size, while our method presents much more visual pleasing results. To provide a quantitative comparison about the construction deviations from the ground truth data, we also computed the root of mean square error (RMSE) for each
RMS Error Per Pixel in Gray-Levels
11 10 9 8 7 6
Global Reconstruction Local Reconstruction
5
0
10
20 30 Test Image ID
40
50
Figure 7: RMSE of global and local reconstruction methods.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 8: The impact of training set on hallucination performance: (a) Input LR images. Super-resolved images using: (b) Bicubic interpolation, (c) randomly selected samples, (d) adaptively selected samples and (e) the ensemble. (f) High-resolution images.
testing image, and the results are shown in Fig. 7. The advantage of local reconstruction method stems from its higher learning efficiency, which results from two aspects: (1) the stronger power to dig out the diversity of samples, and (2) the separation of high-frequency components from low-frequency components throughout the learning process.
4.2. Adaptive sample selection vs. random samples and the ensemble In this experiment, we examined the impact of training set upon the hallucination performance. We first took an average of the number of samples selected by the proposed method, and then randomly selected the same number of
samples to perform reconstruction, and finally used the ensemble of samples. The results are displayed in Fig. 8. Some statistics collected from the three strategies of sample usage are listed in Table 1, including the average dimensions of PCA sub-space per patch of size 12 × 12 (The algorithms were implemented by MATLAB running on a 1.8G CPU, 256M RAM PC). Obviously, the proposed adaptive sample selection outperformed the random sample selection and obtained nearly the same SRR performance as the ensemble did, while the computation time reduced almost 55%. Furthermore, two points should be noted: (1) the search for the nearest neighbors only takes a very small percentage of the SR’s computation cost since the average length of LPP coefficients in the adaptive sample selection method is
just 6.7, and (2) the adaptive sample selection will fully reveal its superiority under the circumstances of large training sample, say beyond 5000, since we can always select an appropriate training set that can be processed for reconstruction, otherwise one has to resort to the Iterative Kernel PCA like method in [20] to estimate the kernel principal components by off-line computing, where the ability of adaptive sample selection is lost. Table 1: Comparisons of running time and RMSE. Mean num. of samples
Mean running time
Mean RMSE
Mean PCA dimensions
94
10.7
7.1
59
94
9.2
7.4
50
166
16.4
7.0
58
Adaptive selection Random selection The ensemble
The last examples in Fig. 6 and Fig. 8 are SR reconstructions of two LR faces wearing glasses. Although it’s hard to discern the glasses in the input images, our method is still able to synthesize these details that are very similar to the original images, and there are no glasses supplemented mistakenly to those LR faces without glasses. However, the “glasses effect” apparently exists in the scenarios of global reconstruction and random sample selection. Different from the experiments of glasses supplement in [5] and [6], where glasses are added artificially to a face without glasses using a training set with glasses, our method can restore the high-frequency details more faithfully to the truth. The proposed method exhibited the discriminating power again when performing sample selection.
5. Conclusions The high-frequency information of facial images is usually contained in the local subtle variations, which as a whole are at the root of the facial diversity. To find out the essence of these variations, we combined two unsupervised learning methods, LPP and PCA, to analyze the local structures of facial images on the sub-manifolds. The proposed method used the adaptively selected training set to restore the local high-frequency components efficiently, and the experimental results have proven the feasibility to hallucinate faces using just a small training set.
6. References [1] S. Baker and T. Kanade. Hallucinating faces. In Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pages 83–88, 2000.
[2] S. Baker and T. Kanade. Limits on super-resolution and how to break them. IEEE TPMI, 24:1167–1183, Sept. 2002. [3] W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based super-resolution. IEEE Trans. Comput. Graph. Appl., 22(2):56–65, Apr. 2002. [4] C. Liu, H. Y. Shum and W. T. Freeman. Face hallucination: theory and practice. IJCV, 75(1):115–134, Oct. 2007. [5] X. G. Wang and X. O. Tang. Hallucinating face by eigentransformation. IEEE Trans. on Systems, Man, and Cybernetics, Part C, 35(3):425–433, 2005. [6] Y. Zhuang, J. Zhang and F. Wu. Hallucinating faces: LPH super-resolution and neighbor reconstruction for residue compensation. Pattern Recognition, 40(11):3178–3194, 2007. [7] A. Chakrabarti, A. N. Rajagopalan, and R. Chellappa. Super-Resolution of Face Images Using Kernel PCA-Based Prio. IEEE Trans. On Multimedia, 9(4):888–892, 2007. [8] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution through neighbor embedding. CVPR 2004, I-275–I-282. [9] G. Dedeoˇglu, T. Kanade, and J. August. High-zoom video hallucination by exploiting spatio-temporal regularities. CVPR 2004, II-151–II-158. [10] X. He and P. Niyogi. Locality Preserving Projections. In Proc. Conf. Advances in Neural Information Processing Systems, pages 327–334, 2004. [11] M. Belkin and P. Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In Proc. Conf. Advances in Neural Information Processing System, pages 585–591, 2001. [12] Y. Chang, C. Hu, and M. Turk. Manifold of Facial Expression. In Proc. IEEE International Workshop Analysis and Modeling of Faces and Gestures, pages 28–35, 2003. [13] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500):2323–2326, 2000. [14] X. He. Face Recognition Using Laplacianfaces. IEEE TPAMI, 27(3):328–340, 2005. [15] L. K. Saul and S. T. Roweis. Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds. Journal of Machine Learning Research, 4(6):119–155, 2003. [16] P. P. Vaidyanathan. Multirate Systems and Filter Banks. Englewood Cliffs, NJ: Prentice-Hall press, 1993. [17] A. Martinez and R. Benavente. The AR face database. CVC Techinical Report, No.24, 1998. [18] M. B. Stegmann. Analysis and segmentation of face images using point annotations and linear subspace techniques. Techinical Report, DTU, 2002. [19] F.Tarrés and A. Rama. GTAV Face Database, http://gps-tsc.upc.es/GTAV/ [20] K. I. Kim, M. O. Franz and B. Schölkopf. Iterative kernel principal component analysis for image modeling. IEEE TPMI, 27(9):1351–1366, 2005.