Robust Face Sketch Style Synthesis - IEEE Xplore

220

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

Robust Face Sketch Style Synthesis Shengchuan Zhang, Xinbo Gao, Senior Member, IEEE, Nannan Wang, Member, IEEE, and Jie Li Abstract— Heterogeneous image conversion is a critical issue in many computer vision tasks, among which example-based face sketch style synthesis provides a convenient way to make artistic effects for photos. However, existing face sketch style synthesis methods generate stylistic sketches depending on many photosketch pairs. This requirement limits the generalization ability of these methods to produce arbitrarily stylistic sketches. To handle such a drawback, we propose a robust face sketch style synthesis method, which can convert photos to arbitrarily stylistic sketches based on only one corresponding template sketch. In the proposed method, a sparse representation-based greedy search strategy is first applied to estimate an initial sketch. Then, multi-scale features and Euclidean distance are employed to select candidate image patches from the initial estimated sketch and the template sketch. In order to further refine the obtained candidate image patches, a multi-feature-based optimization model is introduced. Finally, by assembling the refined candidate image patches, the completed face sketch is obtained. To further enhance the quality of synthesized sketches, a cascaded regression strategy is adopted. Compared with the state-of-the-art face sketch synthesis methods, experimental results on several commonly used face sketch databases and celebrity photos demonstrate the effectiveness of the proposed method. Index Terms— Heterogeneous image conversion, face sketch synthesis, example-based stylization, sparse representation, multi-scale feature, cascaded regression.

I. I NTRODUCTION

H

ETEROGENEOUS image conversion, transforming images from one modality to another, contains many research directions, such as image super-resolution [1], face hallucination [2] and various non-photorealistic rendering [3]. Some previous methods transform input images depending on a single or a pair of reference images [4]–[7]. Other existing Manuscript received May 3, 2015; revised September 23, 2015; accepted November 12, 2015. Date of publication November 18, 2015; date of current version December 3, 2015. This work was supported in part by the National Natural Science Foundation of China under Grant 61125204, Grant 61172146, Grant 61432014, and Grant 61501339, in part by the Fundamental Research Funds for the Central Universities under Grant JB149901 and Grant XJS15049, in part by the Program for Changjiang Scholars and Innovative Research Team in University of China under Grant IRT13088, in part by the China Post-Doctoral Science Foundation under Grant 2015M580818, and in part by the Shaanxi Innovative Research Team for Key Science and Technology under Grant 2012KCT-02. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Nilanjan Ray. S. Zhang and X. Gao are with the State Key Laboratory of Integrated Services Networks, School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]; [email protected]). N. Wang is with the State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]). J. Li is with the Video and Image Processing System Laboratory, School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2501755

methods generate images based on the mapping relations learned from a large training set of paired images [8]–[12]. Our proposed method focuses on face sketch synthesis based on just a single template sketch. Since face sketch synthesis can not only assist law enforcement but also favor digital entertainment, in recent years, many researchers devote into this field and achieve great progress. Some representative methods about face sketch synthesis are summarized as follows. Tang and Wang [13] assumed that the transformation between photos and sketches can be approximated as a linear mapping. Firstly, an input photo is reconstructed from training photos by principal component analysis (PCA) [14] and then a synthesized sketch can be obtained from a linear combination of these corresponding sketches weighted by the same reconstruction coefficients. However, the linear assumption between the whole sketch and the corresponding photo is somewhat unreasonable since human faces have complex nonlinear structures. To reduce the limitation, Liu et al. [15] proposed a face sketch generation method by employing the idea of locally linear embedding (LLE) [16]. They considered the local linear preserving of geometry manifolds between photo patches and sketch patches. Each sketch patch is synthesized independently and then the whole sketch image is obtained by averaging the overlapping areas between neighboring sketch patches. However, it leads to blurring effect and ignores the neighboring constraint between overlapping sketch patches. Gao et al. [17] and Wang et al. [18] extended the work of Liu et al. [15] by introducing sparse neighbor selection (SNS) to find closely related neighbors adaptively and adopting sparse representation based enhancement (SRE) to compensate the details. Moreover, Song et al. [19] applied the idea of image denoising to improve the method proposed in [15] and speeded up by GPU. Wang and Tang [20] presented the Markov random fields (MRF) model to tackle the problem of preserving the compatible structure between overlapping sketch patches. Their method generated a face sketch with the “best” candidate sketch patches which is obtained by optimizing the MRF model. The aforementioned MRF model cannot produce new image patches and the optimization problem in solving the MRF model is NP-hard. In order to handle these drawbacks, Zhou et al. [21] introduced a novel Markov weight fields (MWF) model that is capable of synthesizing new target patches not existing in the training set. All the above methods conduct face sketch synthesis from the inductive learning perspective which minimizes the empirical loss for training samples [22]. In contrast, Wang et al. [23] proposed a novel transductive face sketch synthesis method which minimizes the expected loss for test samples by incorporating the test samples into the learning process.

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

ZHANG et al.: ROBUST FACE SKETCH STYLE SYNTHESIS

221

Fig. 2.

Fig. 1.

Framework of the proposed method.

Existing face sketch synthesis methods need a large training set of paired images. Although Gao et al. [24] introduced an embedded hidden Markov model (E-HMM) to describe the nonlinear relationship between a photo-sketch pair, their method adopted the selective ensemble strategy to synthesize a finer face sketch which still required more paired images as the training set. In general, a large training set would restrict the generalization and applications of face sketch synthesis technology. What is more, the acquisition of face sketch database with different sketch styles is expensive in order to synthesize different stylistic sketches with existing methods. Bae et al. [4] proposed an approach to tone management for photographs, which is based on a two-scale non-linear decomposition of an image. Hertzmann et al. [5] introduced a framework for processing images by examples, called “image analogies”, which is based on a simple multi-scale autoregression. Liu et al. [6] presented a technique, called image-based surface detail transfer, to transfer geometric details from one surface to another. Zhang et al. [7] proposed a style transfer algorithm via component analysis approach. These methods are non-photorealistic rendering techniques, which are unfit for real face sketches. To this end, we propose a novel framework of converting photos to stylistic sketches trained on only one template sketch. In this paper, we mainly focus on, but not limited to, face images. The proposed approach is composed of three steps: Firstly, by adopting the idea of sparse representation based greedy search (SR-GS) [25], an initial sketch is generated. Subsequently, a multi-feature-based optimization model is constructed to synthesize a sketch with finer details. Finally, through cascaded image regression (CIR), we can further enhance the quality of the synthesized sketches. The proposed method can not only handle face images in frontal pose with normal lighting and neutral expression but also images in the wild condition, such as variations of the number of faces in an image, face poses and image sizes. In this paper, we take face sketch synthesis as an example to introduce our method, as shown in Fig. 1.

Illustration of the constructed graph model.

The contributions of this paper are threefold. (1) We propose a novel framework to synthesize sketches trained on only one template sketch. (2) We present a multi-feature-based optimization model to select candidate image patches. (3) Benefiting from SR-GS which is employed to generate initial sketch, our method is robust to non-facial factors. Since we search candidate image patches within the whole image, the proposed method can reduce the effects resulted from variations of the number of faces, face poses and image sizes etc.. The organization of the rest of the paper is as follows: Section II introduces the proposed face sketch synthesis framework and implementation details. Experimental results and analyses are presented in Section III. Section IV concludes the paper. II. FACE S KETCH S YNTHESIS F ROM A G IVEN P HOTO AND A T EMPLATE S KETCH Given two images as the input, a test photo p and a template sketch t, our task is to generate the pseudo-sketch s of the test photo p with the same style as t. We construct a graphical model to denote the relationship between pseudo-sketch patches, as shown in Fig. 2. We first synthesize an initial sketch of the test photo p via SR-GS method [25]. Then for each test photo patch, we select 3K nearest neighbors as candidate sketch patches through three accesses. Particularly, we select K candidate patches from test photo to test photo patch matching, test photo to template sketch patch matching and pseudo-sketch to template sketch patch matching respectively (see Fig. 1). Finally, a multi-feature-based optimization model is built to refine the obtained candidate sketch patches. CIR is applied to further improve the quality of synthesized sketches gradually. A. Sparse Representation Based Greedy Search Since we only have a template sketch as the training set, we have to utilize inter-modality distance between testing photo patches and training sketch patches to select candidate image patches. We choose to use the sparse coefficient values and the selection orders of dictionary atoms that are extracted from sparse representations. Our strategy is to assign each patch a sparse representation, and iteratively apply two criteria between sparse representations of photo patches and

222


sketch patches to measure the inter-modality distance. The above strategy is called sparse representation based greedy search [25]. Although sparse representation has been widely used for face sketch synthesis, these methods apply sparse coefficient values to reconstruct new sketch patches and neglect the effect of the selection orders of dictionary atoms. In contrast, our method employs both sparse coefficient values and the selection orders of dictionary atoms to select existing sketch patches from the whole template sketch. To synthesize an initial sketch from a test photo trained on a template sketch by SR-GS strategy, we perform the following steps. (1) Construct a training set t. Image patches that are extracted from the image pyramid of a face image could be adopted as a useful compensation. For the template sketch t, we build a Gaussian image pyramid to obtain L images at different scales which are denoted as t1 , · · · , t L . To reduce the overfitting problem, these sketch patches obtained from Gaussian image pyramid are randomly partitioned into two non-overlap sets that are separately applied for synthesis and dictionary learning. For the sake of convenience, without prejudice, image patches in odd layers are applied for dictionary learning while that in even layers for synthesis in our implementation. These two non-overlap image patch sets can be represented as follows, respectively. {t1 , · · · , to , · · · , t O }, {t1 , · · · , te , · · · , t E }; (2) Learn a feature dictionary Ds from image patch set {t1 , · · · , to , · · · , t O }. Given the image patch set {t1 , · · · , to , · · · , t O }, we can formulate the optimization problem bellow to learn a feature dictionary Ds for sketch patches. min To − Ds C22 + λC1

Ds ,C

s.t. Dsi 22 ≤ 1, ∀i = 1, · · · , n;

(1)

where To = [t1 , · · · , to , · · · , t O ], Ds ∈ Rd×n , d is the dimensionality of each atom in Ds and n is the number of atoms in Ds . λ is experimentally set to 0.15 in our implementation. (3) Obtain the sparse representation set {c}s of image patch set {t1 , · · · , te , · · · , t E }. For each image patch te in {t1 , · · · , te , · · · , t E }, resolve the following optimization problem: ce = arg min te − Ds ce 22 + λce 1 . (2) We consider both the sparse coefficient value and the selection order of dictionary atoms. Then each sparse representation ce here includes the sparse coefficient value ve and the dictionary atom selection order oe . As a result, there are two sets corresponding to {c}s = {c1 , · · · , ce , · · · , c E }, which are denoted as {v}s = {v1 , · · · , ve , · · · , v E } and {o}s = {o1 , · · · , oe , · · · , o E } respectively; (4) Construct a test photo p. Given an input test photo p, we divide it into resulting in the set of overlapping patches, photo patches p1 , · · · , pn , · · · , p N ; (5) Obtain the sparse representation cn of image patch pn . For each photo patch pn , we can obtain its sparse

representation cn by optimizing the formula (2), where cn contains the sparse coefficient value vn and the dictionary atom selection order on ; (6) Select candidate sketch patches from {t1 , · · · , te , · · · , t E } for each test photo patch pn by SR-GS strategy. (6a) Set i = 1, θ = 2K ; (6b) We can find these sparse representations corresponding to {o}s whose i th element in oe is equal to the i th element io in on and set them as {c}io s . The cardinality T of set {c}s is discussed as follows: if 0 < T ≤ θ , replace {c}s with {c}io s and go to step (6e); if T = 0, keep {c}s and go to step (6e); if T > θ , replace {c}s with {c}io s and continue; (6c) 1/9 of these sparse representations corresponding to {v}s are found based on the following criterion: Di s = vin − vie 2 (3) where vin represents the i th element in vn and vie denotes the i th element in ve . Set selected sparse representations iv as {c}iv s and discuss the cardinality T of set {c}s as follows: iv if 0 < T ≤ θ , replace {c}s with {c}s and go to step (6e); if T = 0, keep {c}s and go to step (6e); if T > θ , replace {c}s with {c}iv s and continue; (6d) Set i = i + 1 and return to step (6b); (6e) Image patches {t}n = {tn1 , · · · , tnt , · · · , tnT } are picked out from {t1 , · · · , te , · · · , t E } corresponding to these sparse representations among {c}s = {cn1 , · · · , cnt , · · · , cnT } where cnt is the tth sparse representation which is similar to cn . If T < K , we apply replication operation to increase the number of image patches {t}n from T to K . If T > K , we employ the Euclidean distance about the high-frequency information between image patches in {t}n and pn to decrease the number of image patches {t}n from T to K . At the end, for photo patch pn , we can select K candidate sketch patches denoted as {t}n = {tn1 , · · · , tnk , · · · , tn K }; (7) Synthesize an initial sketch. After obtaining the candidate sketch patches for each test photo patch, we can obtain the initial sketch of test photo by applying the Markov random fields (MRF) model. B. Patch Matching Since the visual quality of the initial sketch s is not good enough, we apply the initial estimation s as a start point to estimate a refined sketch. The template sketch t contains the fidelity information while the initial estimation s includes the reconstruction information. In order to fully utilize above information, we apply patch matching operation to extract useful information. In other words, patch matching is to find sketch patches in both the template sketch t and the initial estimation s that best matches the test photo patch. We first divide the template sketch t, the test photo p and its initial estimation s into even size patches with the same overlap size respectively, i.e., {t1 , · · · , tm , · · · , t M }, p1 , · · · , pn , · · · , p N and {s1 , · · · , sn , · · · , s N }, where tm stands for the mth sketch patch of the template sketch t, pn represents the nth photo patch of the test photo p and sn denotes the nth sketch patch from the initial sketch s.


Since s is the pseudo-sketch corresponding to p, it is reasonable to assume that some candidate sketch patches of pn can be obtained by sn . We propose to match a test photo patch pn with test photo patches and template sketch patches. We also apply an initial sketch patch sn to match with template sketch patches. As a result, for each test photo patch pn , there are three sources of candidate sketch patches, namely, test photo to test photo patch matching, test photo to template sketch patch matching, pseudo-sketch to template sketch patch matching. For the sake of brevity, we utilize p2p, p2s and s2s to stand for above three patch matching procedures. For p2p, we exploit the Euclidean distance between pn and other photo patches in p1 , · · · , pn , · · · , p N to search for K candidate photo patches. Then the corresponding K initial sketch patches are utilized as the candidate sketch patches of pn . For p2s, since photos and sketches are in different modalities, we cannot directly match them. We build Gaussian image pyramid for photos and sketches and then apply the multi-scale feature to search for K candidate sketch patches. In particular, Given an image I, we construct its Gaussian image pyramid G1 (I) , · · · , G L (I). Then we construct a pyramid of features F (I) = (G1 (I) , · · · , G L (I)). For each image patch of I, its feature vector which concatenates features from corresponding patches of each image in the pyramid is obtained. The feature vector is normalized and then is projected onto the eigenvector learned in the training stage. We call the projected coefficients as the multi-scale feature. After obtaining the multi-scale feature of the test photo patch pn and sketch patches in {t1 , · · · , tm , · · · , t M }, we calculate the Euclidean distance between the multiscale features to obtain the K candidate sketch patches in {t1 , · · · , tm , · · · , t M }. For s2s, although the initial sketch and the template sketch are in the same modality, the quality of the initial sketch is quite different from the template sketch. So we also apply the multi-scale feature for patch matching to improve the patch matching precision. Here we also select K candidate sketch patches in {t1 , · · · , tm , · · · , t M } most similar to the sketch patch sn . In other words, the selected K candidate sketch patches are good estimations of the final sketch patch corresponding to the test photo patch pn . Through above patch matching operations, for each test photo patch pn , we can search for a total of 3K candidate sketch patches through p2p, p2s and s2s patch matching. The number of the parameter K will be discussed in the experiments. C. Multi-Feature-Based Optimization Model After patch matching, in order to discard noisy patches1 and save computational cost, we need to refine the number of the candidate sketch patches from 3K to K n for each test photo patch pn . Sketch patches and photo patches are heterogeneous and significantly different in their modality, both in geometry and in texture. Just one feature cannot yield 1 We call those candidate sketch patches discarded after the multi-featurebased optimization model as noisy patches.

223

very good matching results. As we knew, different features have different influences on patch matching. Hence, in order to utilize the complementation among different features and take adaptive weighting into consideration, we design the following energy function to refine the obtained candidate sketch patches: 3K L 3K

u il fl pn − fl (ti ) + λi ui 2 min i=1 l=1

s.t.

L

i=1

u il = 1, 0 ≤ u il ≤ 1, i = 1, · · · , 3K , l = 1, · · · , L.

l=1

(4) where fl (∗) is lth kind of feature extraction operation, L is the total number of features, ti is the i th candidate sketch patch, u il is the weight of the lth feature of ti to judge similarity between ti and pn . After minimizing the energy function, we select the top K n candidate sketch patches with the following smallest energy: L

u il fl pn − fl (ti ) , i = 1, · · · , 3K . (5) min l=1

In our implementations, we employ eight different kinds of features: normalized intensity, SURF feature [26], the multiscale feature of Gaussian pyramid, the multi-scale feature of Laplacian pyramid, the multi-scale feature of the horizontal and vertical first derivatives of the Gaussian pyramid, the multi-scale feature of the horizontal and vertical second derivatives of the Gaussian pyramid. Among these features, the multi-scale feature will be illustrated in section II-E. D. Cascaded Image Synthesis We apply iterative strategy to improve the visual quality of the synthesized result. In particular, as shown in Fig. 1, we exploit the output sketch to replace the initial sketch and perform strategies in section II-B and II-C sequentially. Once the quality of the output sketch is better than that of the initial sketch, patch matching in section II-B will be more accurate, especially for s2s. As a result, the visual quality of the new output sketch could be much better. We call this iterative strategy used in face sketch synthesis cascaded image synthesis, which can be abbreviated as CIS. E. Implementation Details In the training stage, in order to obtain the eigenvectors which is applied to get multi-scale features in the test stage, we first construct the Gaussian image pyramid and Laplacian image pyramid of the template sketch t. Then for each image in the Gaussian pyramid, we calculate its first and second derivatives in horizontal and vertical directions. Hence, for the template sketch t, we have six image pyramids as shown in Fig. 3: Gaussian image pyramid, Laplacian image pyramid, horizontal first derivative pyramid, vertical first derivative pyramid, horizontal second derivative pyramid and vertical second derivative pyramid. For each pyramid, we extract M parent structures as follows:

m m m Fm P = P1 ; · · · ; Pl ; · · · ; P L , m = 1, · · · , M

224


Algorithm 1 Face Sketch Style Synthesis

Fig. 3.

Six different image pyramids generated from a template sketch.

where Fm P is the mth parent structure of the pyramid P, which represents one of the six image pyramids. Plm is the mth image patch in the lth image of the pyramid P. M is the total number of image patches of each image in the pyramid P. For each image pyramid P, a set of corresponding eigenvectors are computed from the following covariance matrix: C=

M m

T FP − m P Fm P − mP m=1

where m P is the mean of the total M parent structures. So we can obtain six different kinds of eigenvectors corresponding to different image pyramid P. In the test stage, we need obtain the multi-scale feature of the template sketch t, the test photo p and its initial estimation s. Since the process of calculating the multi-scale feature for above three images is the same, we take the test photo p as an example to illustrate. Given the test photo p, we first build its six image pyramids as the training stage does. Then for each test photo patch pn , we possess its six different parent structures according to six different image pyramids. For each parent structure FnP , we can project it onto corresponding subspace spanned by eigenvectors learned in the training stage. The projected coefficients obtained are the multi-scale feature of the image pyramid P. The entire face sketch style synthesis process is summarized as Algorithm 1. In the patch matching stage illustrated in section II-B, we employ the high frequency information of each test photo patch pn to determine which patch matching procedure to use. We argue that when the test patch pn is flat, p2p is qualified simply. When the test patch pn is not flat, above three patch

Fig. 4.

Template sketches.

matching procedures are utilized simultaneously to keep accuracy. III. E XPERIMENTAL R ESULTS AND A NALYSES We conduct experiments on the CUHK face sketch database and several celebrity face photos got from the web. The CUHK face sketch database consists of 606 photo-sketch pairs including 188 photo-sketch pairs from the CUHK student database [20], 123 photo-sketch pairs from the AR database [14] and 295 photo-sketch pairs from the XM2VTS database [16]. Each photo-sketch pair corresponds to the same subject. Most of these photos are taken in a frontal pose with a neutral expression under a normal lighting condition. Corresponding sketches are drawn by an artist while viewing photos. In all experiments, one template sketch from the CUHK face sketch database and other three different stylistic sketches downloaded from the websites2 are utilized for training (i.e. they are template sketches, see Fig. 4). The remaining photos and celebrity face photos are taken as the test images. A. Face Sketch Synthesis With Baseline Template We first investigate the performance of the proposed face sketch synthesis method trained on the baseline template which can be any sketch in the CUHK face sketch database. 2 http://www.huabao.me/p/70471/, http://dinaan-jie.blog.163.com/blog/static/ 834500502011529631063/, http://www.qqzhi.com/touxiang/417898/


225

Fig. 8. Fig. 5.

Effect of the level number of image pyramid.

Synthesized sketch examples by the proposed method.

Fig. 6. Effect of the patch size. The first number is the patch size; the second number is the size of overlapping area which is sixty percent of the patch size; the third number is the level number of the image pyramid which is fixed as 3.

Fig. 9.

Fig. 7.

Effect of the neighborhood size K n after optimization.

Some synthesized sketches are shown in Fig 5. It can be seen that the proposed method is robust to the background variations. There are many parameter settings in our experiments. In the section II-A, we set the size of each patch to be 10 × 10 with half region overlapped for the test photo while we set the patch size to be 10 × 10 with 9 pixels overlapped for the

Effect of the number of iteration in cascaded image regression.

template sketch to learn the dictionary and synthesize the initial sketch. In other sections, we experimentally set the patch size to be 5 × 5 with sixty percent areas (3 pixel) overlapped. The effect of the patch size with the fixed level of image pyramid is shown in Fig. 6. The effects of noise become serious with patch size increase. So we carefully set the patch size to be 5 × 5 without increasing the computation amount obviously. In the section II-B, the nearest neighbors are set to be 5 for each patch matching process (p2p, p2s and s2s). So in the section II-C, we totally have 15 candidate sketches for each test photo patch to refine. As shown in Fig. 7, if the number of candidate sketches after optimization is small, the noise is serious. If this number is large, in other words, the influence of the optimization is small, the effects of block and blur increase. Hence, the nearest neighbors after optimization are set to 7 in our experiments. After the patch size and the neighborhood size are determined, we start to consider the effect of the level number of image pyramid. As we can see in the Fig. 8, the level

226


Fig. 10. Comparison between the proposed method, LLE method [15], MRF-based method [20], MWF-based method [21], Transductive-based method [23] and SSD method [19].

number of image pyramid is set to 2, which can achieve comparatively good results. In the cascaded image regression stage, we only iterate 3 times. Because the visual quality of synthesized sketches does not improve obviously with further increase of iterations which can be seen in Fig. 9, while the running time becomes heavily with the increase of iteration times. In Fig. 10, we compare the synthesized results of the proposed method with the LLE method [15], MRF method [20], MWF method [21], transductive-based method [23] and SSD method [19]. Existing methods directly apply image intensity to find the nearest neighbors and impose a constraint on search region. So these methods cannot deal well with the backgrounds of photos from the XM2VTS database, which can be seen in Fig. 10. Since the search region is restricted, existing methods cannot produce glasses when the training data excludes glasses. In contrast, the proposed method could improve these drawbacks and obtain satisfactory results. It should be noted that we perform face sketch synthesis for

existing methods with the training set and testing set partition as suggested in [23]. Since the synthesized results are poor when existing methods are performed depending on a single photo-sketch pair for training. Fig. 12 shows some synthesized resluts generated by MWF method trained on a single photosketch pair. It can be seen that existing methods need large training photo-sketch pairs to guarantee the robustness and generativeness. B. Face Sketch Synthesis With Different Stylistic Sketches Although some state-of-the-art methods perform well on the CUHK face sketch database, their performance depends heavily on a large set of photo-sketch pairs. Hence, they cannot flexibly synthesize different kinds of stylistic sketch because it is difficult to collect many different stylistic photosketch pairs. Our method only needs a template sketch and performs consistently well trained on different stylistic template sketches. With different stylistic sketches as the training data, the parameter settings keep unchanged to the descriptions


Fig. 11.

227

Some synthesized sketches from different databases with different stylistic sketches.

in section III-A. Some synthesized sketches trained on different stylistic sketches are shown in Fig. 11. From Fig. 11, it can be seen that the proposed method achieves stable performance with the change of training template sketches. C. Celebrity Face Photos From the Web The generalization ability of our method is further tested on some Chinese celebrity face photos got from the web. These photos have different backgrounds with uncontrolled lighting and pose variation. Some images even include many faces. Some synthesized results are shown in Fig 13. D. Objective Quality Assessment We conduct objective quality assessment on the CUHK face sketch database to validate the effectiveness of the proposed method. We exploit structural similarity (SSIM) index [27] and feature-similarity (FSIM) index [28] to assess the visual quality of synthesized sketches. SSIM captures the loss of structure in the image that is based on the hypothesis that human visual system (HVS) is highly adapted to extract

Fig. 12. Some synthesized results by MWF method with only a photo-sketch pair as training set.

the structural information from the visual scene. Since the HVS understands an image mainly based on its low-level features, such as edges and zero-crossings, FSIM is devised by comparing the low-level feature sets between the reference image and the distorted image.

228


Fig. 13.

Some synthesized sketches from different databases with different stylistic sketches.

TABLE I SSIM VALUES OF D IFFERENT M ETHODS ON D IFFERENT D ATABASES

Since above two objective quality assessment metrics are full reference, we utilize the sketches drawn by artists as the reference images. Tables I and II3 show the average quality score of the synthesized sketches generated by methods in [15], [19]–[21], and [23] and the proposed method. Although the proposed method only utilizes a template sketch as the training set, its performance is promising and comparable with state-of-the-arts on the three databases in 3 Tables I and II are only based on the synthesized sketches of the same style provided by the CUHK face sketch database.

TABLE II FSIM VALUES OF D IFFERENT M ETHODS ON D IFFERENT D ATABASES

terms of both SSIM and FSIM metric. The photos from the XM2VTS database are the most complex compared with these photos from the other two databases, since the number of subjects on the XM2VTS database is 295 which almost occupy the half number of the CUHK face sketch database and these subjects varies in age, race and hair color etc.. What is more, the backgrounds are apt to confuse the hairs when face photos from the XM2VTS database are converted to gray images. So the proposed method is effective and robust for different


Fig. 14.

229

Statistical curves of (a) FSIM scores and (b) SSIM scores.

Fig. 15. Comparison of accumulative match score between our method and five other typical methods: LLE-based, MRF-based, MWF-based, transductive learning-based and SSD method over different databases. (a) The CUHK student database. (b) The AR database. (c) The XM2VTS database.

test photos. Fig. 14 compares the statistical results of the proposed method with five other methods [15], [19]–[21], [23]. From the curve (a) in Fig. 14, we can see that although the proposed method can not synthesize many high quality synthesized sketches as state-of-the-art methods does, it can still generate ordinary quality synthesized sketches, which demonstrates its robustness and generalization. E. Face Sketch Recognition We conduct face sketch recognition on three aforementioned databases, which can reflect the visual quality of synthesized sketches to some extent [29]. We utilize sketches drawn by artists to match the synthesized sketches. The eigenface method [30] is applied as the recognition method to conduct simply face sketch recognition. There are total 518 synthesized sketches from the three databases while the reminding 88 sketches from the CUHK student database are employed as the training data to learn eigenface model. From Fig. 15, it can be found that the proposed method achieves higher sketch recognition rates than the MRF method on the XM2VTS database, which is the most challenging database among three databases. F. Limitations and Discussions When the test photo is under dark side lightings, the visual qualities of the synthesized sketches by the proposed method are qualified as shown in Fig. 16. The proposed method can still achieve promising results because multifeature-based optimization model is applied where SURF

Fig. 16. Synthesized sketches of photos under dark side lightings. First row: test photos. Second row: results by the method in [29]. Third row: results by the method in [20] with luminance remapping [5]. Last row: results by the proposed method.

is robust to illumination variations. Moreover, the proposed method extends the local search area to the whole face region. Hence, the synthesized sketches omit some facial textures. Besides, the proposed method is time-consuming due to the adopted optimization strategy and the cascaded image regression procedure. Since we only have a template sketch at hand,

230


Fig. 17.

Some failed results by the proposed method.

Fig. 18. Some synthesized results without the initial sketch. The first and second columns: results of photos from the CUHK student database. The third and fourth columns: results of photos from the AR database. The fifth and last columns: results of photos from the XM2VTS database.

in order to mine the knowledge contained in the template sketch, we repeatedly utilize the template sketch via various operations, which leads to heavy computational cost. The time

complexity of our method is about O (Md L) + O t M p 2 M while that of existing methods is O c p2 M N . Here M is the total number of image patches contained in each image, d is the number of dimensions examined to reach leaf cells [25], and L is the number of buckets at each dimension, t is the number of iteration, p is the patch size, c is the number of candidates in the search region around one patch and N is the number of photo-sketch pairs for training. In our experiments, cN ≈ M. Hence, the time complexity of the proposed method is larger than that of existing methods. Fig. 17 shows some failed synthesized results. In addition, we will take these defects into consideration for further improvement. An initial sketch, which is generated by SR-GS strategy, is necessary. Since the initial sketch includes the reconstruction information, which is robust to non-facial factors such as image backgrounds, we need the initial sketch to produce a high quality sketch. Particularly, if the initial sketch does not exist, patch matching in the section II-B only consists of test photo to template sketch patch matching (p2s). Fig. 18 shows the synthesized results without the initial sketch. From Fig. 18, we can see that the synthesized results from the XM2VTS database are not robust to image backgrounds. It can further demonstrate that the initial sketch is essential in the proposed method. Since photo and sketch patches have different visual appearances, it is difficult to match them using image intensities directly. However, there exists some similarity in terms of image gradients between a photo and its corresponding sketch. Multi-scale feature considers the long range dependency among local patches by introducing image pyramid. That is

Fig. 19. Some synthesized sketches. The first row: test photos. The second row: photo to sketch patch matching by image intensities. The third row: photo to sketch patch matching by multi-scale features. The fourth row: photo to sketch patch matching by sparse representations.

Fig. 20. Some synthesized sketches from different databases using the template sketch with large shape exaggeration.

to say, it applies both image intensities and image gradients at multiple scales to improve the accuracy of patch matching. So to some extent multi-scale feature can be employed to measure the inter-modality distance between testing photo patches and training sketch patches directly. Thus, multi-scale features are effective. The sparse representation is applied to search the K nearest neighbor patches of a test patch. There are many methods that apply image intensities or image gradients as measurement features which cannot deal with image backgrounds and face skin colors well. So we introduce the sparse representation as patch feature to search nearest neighbor patches, which can be robust to image backgrounds and face skin colors without increasing much time cost. We compare the synthesized sketches generated by candidate sketch patches selected by image intensities, multi-scale features and sparse representations respectively in Fig. 19. From Fig. 19, it can be seen that results generated by multi-scale features and sparse representations have their own unique advantages that are taken into consideration in our proposed method simultaneously. In our experiments, we utilize normal sketches without large shape exaggeration as the template sketches. Intuitively, if the template sketches have large shape exaggeration, our method cannot synthesize sketches with similar shape exaggeration. The reason is that we do not take this factor into consideration when designing the proposed method. From Fig. 20, we can


see that the synthesized sketches do not possess any shape exaggeration. Yet, the synthesized sketches are still reasonable images. IV. C ONCLUSION As we know, sometimes we find a favorable stylistic sketch accidentally whose corresponding photo is missing at the same time. We desire to transform given photos to sketches with the same style as the one at hand. However, existing face sketch synthesis methods based on photo-sketch pairs cannot meet this requirement. In this paper, we present a face sketch synthesis method trained on a template stylistic sketch to handle above situations. Firstly, we introduce the idea of sparse representation based greedy search to generate an initial sketch. Secondly, we construct a multi-feature-based optimization model to improve the visual quality of the synthesized sketch. Finally, we further enhance the visual quality of the synthesized sketches through cascaded image regression strategy. The proposed approach is designed to handle face photo images in frontal pose with normal lighting and neutral expression. Nevertheless, it would also be applied to images in the wild condition, such as variations of number of faces in an image, face poses and image size etc.. Experimental results validate the effectiveness and robustness of the proposed method by comparing it with the state-of-the-art methods. R EFERENCES [1] J. Tian and K.-K. Ma, “A survey on super-resolution imaging,” Signal, Image Video Process., vol. 5, no. 3, pp. 329–342, 2011. [2] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “A comprehensive survey to face hallucination,” Int. J. Comput. Vis., vol. 106, no. 1, pp. 9–30, 2014. [3] T. Strothotte and S. Schlechtweg, Non-Photorealistic Computer Graphics: Modeling, Rendering and Animation. San Francisco, CA, USA: Morgan Kaufmann, 2002. [4] S. Bae, S. Paris, and F. Durand, “Two-scale tone management for photographic look,” ACM Trans. Graph., vol. 25, no. 3, pp. 637–645, 2006. [5] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin, “Image analogies,” in Proc. 28th SIGGRAPH, 2001, pp. 327–340. [6] Z. Liu, Z. Zhang, and Y. Shan, “Image-based surface detail transfer,” IEEE Comput. Graph. Appl., vol. 24, no. 3, pp. 30–35, May/Jun. 2004. [7] W. Zhang, C. Cao, S. Chen, J. Liu, and X. Tang, “Style transfer via image component analysis,” IEEE Trans. Multimedia, vol. 15, no. 7, pp. 1520–1601, Nov. 2013. [8] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-level vision,” Int. J. Comput. Vis., vol. 40, no. 1, pp. 25–47, 2000. [9] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1167–1183, Sep. 2002. [10] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through neighbor embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun./Jul. 2004, p. I. [11] C. Liu, H.-Y. Shum, and W. T. Freeman, “Face hallucination: Theory and practice,” Int. J. Comput. Vis., vol. 75, no. 1, pp. 115–134, 2007. [12] X. Tang and X. Wang, “Face sketch recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 1, pp. 50–57, Jan. 2004. [13] X. Tang and X. Wang, “Face photo recognition using sketch,” in Proc. IEEE Int. Conf. Image Process., Sep. 2002, pp. I-257–I-260. [14] A. Martínez and R. Benavente, “The AR face database,” The Ohio State Univ., Columbus, OH, USA, Tech. Rep. #24, 1998. [15] Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma, “A nonlinear approach for face sketch synthesis and recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2005, pp. 1005–1010. [16] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: The extended M2VTS database,” in Proc. 2nd Int. Conf. Audio Video Biometric Person Authentication, 1999, pp. 72–77. [17] X. Gao, N. Wang, D. Tao, and X. Li, “Face sketch–photo synthesis and retrieval using sparse representation,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 8, pp. 1213–1226, Aug. 2012.

231

[18] N. Wang, J. Li, D. Tao, X. Li, and X. Gao, “Heterogeneous image transformation,” Pattern Recognit. Lett., vol. 34, no. 1, pp. 77–84, 2013. [19] Y. Song, L. Bao, Q. Yang, and M.-H. Yang, “Real-time exemplar-based face sketch synthesis,” in Proc. 13th Eur. Conf. Comput. Vis., 2014, pp. 800–813. [20] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967, Nov. 2009. [21] H. Zhou, Z. Kuang, and K.-Y. K. Wong, “Markov weight fields for face sketch synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 1091–1097. [22] A. Schwaighofer and V. Tresp, “Transductive and inductive methods for approximate Gaussian process regression,” in Proc. Adv. Neural Inf. Process. Syst., 2003, p. 953–960. [23] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “Transductive face sketchphoto synthesis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 9, pp. 1364–1376, Sep. 2013. [24] X. Gao, J. Zhong, J. Li, and C. Tian, “Face sketch synthesis algorithm based on e-hmm and selective ensemble,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 4, pp. 487–496, Apr. 2008. [25] S. Zhang, X. Gao, N. Wang, J. Li, and M. Zhang, “Face sketch synthesis via sparse representation-based greedy search,” IEEE Trans. Image Process., vol. 24, no. 8, pp. 2466–2477, Aug. 2015. [26] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (SURF),” Comput. Vis. Image Understand., vol. 110, no. 3, pp. 346–359, 2008. [27] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. [28] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity index for image quality assessment,” IEEE Trans. Image Process., vol. 20, no. 8, pp. 2378–2386, Aug. 2011. [29] W. Zhang, X. Wang, and X. Tang, “Lighting and pose robust face sketch synthesis,” in Proc. 11th Eur. Conf. Comput. Vis., 2010, pp. 420–433. [30] K. Delac, M. Grgic, and S. Grgic, “Independent comparative study of PCA, ICA, and LDA on the FERET data set,” Int. J. Imag. Syst. Technol., vol. 15, no. 5, pp. 252–260, 2005. Shengchuan Zhang received the B.Eng. degree in electronic information engineering from Southwest University, Chongqing, China, in 2011. He is currently pursuing the Ph.D. degree in intelligent information processing with the VIPS Laboratory, School of Electronic Engineering, Xidian University. His current research interests include computer vision and pattern recognition.

Xinbo Gao (M’02–SM’07) received the B.Eng., M.Sc., and Ph.D. degrees from Xidian University, Xi’an, China, in 1994, 1997, and 1999, respectively, all in signal and information processing. From 1997 to 1998, he was a Research Fellow with the Department of Computer Science, Shizuoka University, Shizuoka, Japan. From 2000 to 2001, he was a Post-Doctoral Research Fellow with the Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong. Since 2001, he has been with the School of Electronic Engineering, Xidian University. He is currently a Cheung Kong Professor of the Ministry of Education, a Professor of Pattern Recognition and Intelligent System, and the Director of the State Key Laboratory of Integrated Services Networks, Xi’an, China. He has published five books and around 200 technical articles in refereed journals and proceedings, including the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY, the IEEE T RANSACTIONS ON S YS TEMS , M AN , AND C YBERNETICS , the International Journal of Computer Vision, and Pattern Recognition in the above areas. His current research interests include multimedia analysis, computer vision, pattern recognition, machine learning, and wireless communications. He is on the Editorial Boards of several journals, including Signal Processing (Elsevier), and Neurocomputing (Elsevier). He is currently a fellow of the Institution of Engineering and Technology. He served as the General Chair/Co-Chair, Program Committee Chair/Co-Chair, or PC Member for around 30 major international conferences.

232

Nannan Wang (M’15) received the B.Sc. degree in information and computation science from the Xi’an University of Posts and Telecommunications, in 2009, and the Ph.D. degree in information and telecommunications engineering from Xidian University, in 2015. Now, he works with the State Key Laboratory of Integrated Services Networks with Xidian University. From 2011 to 2013, he has been a Visiting Ph.D. Student with the University of Technology, Sydney, NSW, Australia. His current research interests include computer vision, pattern recognition, and machine learning. He has published more than ten papers in refereed journals and proceedings including the International Journal of Computer Vision (IJCV), IEEE T-NNLS, T-IP, T-CSVT etc.


Jie Li received the B.Sc. degree in electronic engineering, the M.Sc. degree in signal and information processing, and the Ph.D. degree in circuit and systems from Xidian University, Xi’an, China, in 1995, 1998, and 2004, respectively. She is currently a Professor with the School of Electronic Engineering, Xidian University. In these areas, she has authored around 50 technical articles in refereed journals and proceedings, including the IEEE T RANSACTIONS ON I MAGE P ROCESSING , the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECH NOLOGY , and Information Sciences. Her research interests include image processing and machine learning.