Video Super-Resolution via Dynamic Texture Synthesis - IEEE Xplore

Video Super-Resolution via Dynamic Texture Synthesis Chih-Chung Hsu1, Li-Wei Kang2, and Chia-Wen Lin1,* 1

Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan Graduate School of Engineering Science and Technology-Doctoral Program, and Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Yunlin, Taiwan Email: [email protected], [email protected], 1,*[email protected]

2

Abstract—This paper addresses the problem of hallucinating the missing high-resolution (HR) details of a low-resolution (LR) video while maintaining the temporal coherence of the hallucinated HR details by using dynamic texture synthesis (DTS). Most existing multi-frame-based video super-resolution (SR) methods suffer from the problem of limited reconstructed visual quality due to inaccurate sub-pixel motion estimation between frames in a LR video. To achieve high-quality reconstruction of HR details for a LR video, we propose a texture-synthesis-based video super-resolution method, in which a novel DTS scheme is proposed to render the reconstructed HR details in a time coherent way, so as to effectively address the temporal incoherence problem caused by traditional texture synthesis based image SR methods. To further reduce the complexity of the proposed method, our method only performs the DTS-based SR on a selected set of key-frames, while the HR details of the remaining non-key-frames are simply predicted using the bi-directional overlapped block motion compensation. Experimental results demonstrate that the proposed method achieves significant subjective and objective quality improvement over state-of-the-art video SR methods. Keywords—video super-resolution; video hallucination; dynamic texture synthesis; video upscaling; motion-compensated interpolation.

I.

INTRODUCTION

With the rapid development of multimedia and network technologies, delivering and sharing multimedia contents through the Internet and heterogeneous devices has become more and more popular. However, limited by the channel bandwidth and storage capability, videos distributed over the Internet may exist in low-quality versions degraded from the sources. The most common video degradations are downscaling and compression. In this paper, we focus on investigating video super-resolution (SR) for a video degraded by downscaling. Related applications include resolution enhancement of the video captured by a resource-limited mobile device or a low-cost surveillance device. Enhancement of video resolutions would be beneficial for several further applications, such as face, action, or object recognition,

2014 IEEE 16th International Workshop on Multimedia Signal Processing (MMSP), Sep. 22-24, 2014, Jakarta, Indonesia. 978-1-4799-5896-2/14/$31.00 ©2014 IEEE.

behavior analysis, and video retrieval. A. Image Super-Resolution Most SR methods in the literature were mainly designed for image SR. The goal of image SR is to recover a high-resolution (HR) image from one or multiple low-resolution (LR) input images, which is essentially an ill-posed inverse problem [1]. There are mainly two categories of approaches for image SR: (i) traditional approaches and (ii) exemplar/learning-based approaches. In the traditional approaches, one sub-category is reconstruction-based methods, where a set of LR images of the same scene are aligned with sub-pixel accuracy to generate an HR image [2]. Such kind of approaches is usually timeconsuming and somewhat impractical due to multiple input LR images required with accurate image registration. The other sub-category of the traditional approaches is frame interpolation [3], which has been shown to generate overly smooth images with ringing and jagged artifacts. It has been shown that the limitation on magnification factor achieved by traditional approaches is not easy to break [4]. The example/learning-based methods [5]−[8] hallucinate the high frequency details of a LR image based on the cooccurrence prior between LR and HR image patches in a training set, which has proven to provide much finer details compared to traditional approaches. More specifically, for a LR input, the exemplar-based methods search for similar image patches from a pre-collected training LR image dataset or the same image itself based on self-examples, and use their corresponding HR versions to produce the final SR output. Nevertheless, the HR details hallucinated by such kind of approaches cannot be guaranteed to provide the true HR details. Hence, the performance of this approach highly relies on the similarity between the training set and test set or the selfsimilarity in the image itself. Moreover, learning-based SR approaches [9]−[11] focus on modeling the relationship between different resolutions of images. For example, Yang et al. [9] proposed to apply sparse coding techniques to learn a compact representation for HR/LR patch pairs for SR based on pre-collected HR/LR image pairs (SCSR). Then, Yang et al. [10] advances [9] to propose a coupled dictionary training approach for SR based on patchwise sparse recovery, where the learned couple dictionaries relate the HR/LR image patch spaces via sparse representations. In [11], a sparse representation based framework was proposed for image deblurring and SR based on adaptive sparse domain

Fig. 1. Block diagram of the proposed video super-resolution framework. The input m LR frames are sampled from original LR video with interval . Then, the m LR frames are super-resolved using TS-SR technique. BOBMC is then used to compensate the m SR frames according to the interpolated n LR frames. Finally, the motion compensated n SR frames are rendered using the proposed DTS-SR to obtain the final n SR frames.

selection and adaptive regularization, where two adaptive regularization terms are introduced. B. Video Super-Resolution Image SR techniques can then be extended to video SR by incorporating temporal information. Most video SR methods mainly rely on motion estimation for interpolating LR frames between two key-frames (usually assumed to be with HR) in a video [12]−[15]. In [12], a video SR method based on keyframes and motion estimation was proposed, while in [13], fast block matching motion estimation algorithms for video SR was proposed. In [14], an energy-based algorithm for motioncompensated video SR was proposed for up-scaling a standard definition video to high-definition video via optical flow. In addition, a video SR algorithm was proposed in [15] to interpolate an arbitrary frame in a LR video from sparsely existing HR key-frames which are assumed to be always available for a LR video input. On the other hand, example/learning-based techniques are also employed for video SR. In [15], in the case that motioncompensated error is large, an input LR patch is spatially super-resolved using the dictionary learned from the LR/HR key-frame pair. In [16], adaptive regularization and learningbased SR were integrated for web video SR by learning a set of LR/HR patch pairs. An example-based video SR based on the codebooks derived from key-frames was also proposed in [17]. Moreover, the property of nonlocal-means was adopted for video SR in [18], which up-scales each input LR patch by linearly fusing multiple similar LR patches based on selfsimilarity with no explicit motion estimation. C. Texture Image/Video Super-Resolution A main challenging problem in video SR is SR for dynamic texture information [19]−[21], which is rarely investigated in the literature. In [22] and [23], texture synthesis (TS) techniques were proposed for image/video synthesis, but not for SR. For image SR with texture synthesis, an image upscaling method via texture hallucination was proposed in [24]. It interprets a LR image as a tiling of distinct textures and each of which is matched to an example in a database of relevant

textures, extended from the example-based approach in [5]. Moreover, a database-free TS technique was proposed for image SR in [25]. Nevertheless, these TS-based approaches are usually not easy to be directly applied for video SR since maintaining the temporal consistency while performing TS for video SR is still challenging. D. Overview and Contribution of Proposed Method In this paper, a video SR framework via dynamic texture synthesis (DTS) is proposed to effectively enhance the resolution of a LR video while maintaining the temporal consistency of the reconstructed HR video. The proposed method separates the input LR video frames into key-frames and non-key-frames. We first apply the texture synthesis-based SR method to hallucinate the HR details of each key-frame, followed by employing a low-complexity bi-directional overlapped block motion compensation (BOBMC) method to interpolate the HR details of non-key-frames between two neighboring key-frames. To solve the problem of temporal inconsistency, we proposed an example-based DTS method to render the HR details of the super-resolved video based on the temporal dynamics of the input LR video. The main contribution of this paper is two-fold: (i) different from most existing video SR frameworks (e.g., [15]), the proposed method does not assume that key-frames are with HR quality; and (ii) the proposed DTS-based SR (DTS-SR) method can well address the temporal inconsistency problem of the hallucinated HR video. The rest of this paper is organized as follows. Sec. II presents the proposed DTS-based video SR framework. In Sec. III, experimental results are demonstrated. Finally, Sec. IV concludes this paper. II.

PROPOSED VIDEO SUPER-RESOLUTION FRAMEWORK VIA DYNAMIC TEXTURE SYNTHESIS

As illustrated in Fig. 1, in our method, each input LR video is first decomposed into several key-frames and non-keyframes. Each LR key-frame is upscaled using texture synthesisbased SR (TS-SR) [24]. Then, each non-key-frame is interpolated using BOBMC [15]. After all frames are super-

resolved, the proposed DTS-SR is applied to solve the problem of the temporal inconsistency in the HR video. Similar to [24], we collect a set of example textures in advance to form a multiscale texture database for texture synthesis. A. Super-Resolution for Key-Frames Based on [24], each input LR key-frame can be segmented into different segments according to texture descriptors and then each segment can be classified into an example texture from the pre-collected multi-scale texture database. For each segment in a LR key-frame ( denotes the frame index), the best matched patch , ( denotes the patch index) for is searched from the example texture each patch , in database by ,

= arg min

∈

,

,

,

(1)

where denotes the distance between two texture patches defined in [24]. Then, the SR version ( , ) of , can be recovered based on the HR version ( , ) of , from the database and the texture synthesis by [24]: ,

=

,

,

, ,

+

,

,

(2)

where and respectively represent the average and denotes the middlestandard deviation of patch , and frequency component of . Based on (2), the high-frequency components of the reconstructed HR patch can be synthesized directly from the textured image database, as exemplified in Fig. 2. It can be observed from Fig. 2 that TS-SR [24] provides significantly finer details compared to other SR methods. Nevertheless, it is expected that individually applying TSSR to each LR frame will induce the temporal inconsistency problem for the reconstructed SR video. Hence, in the proposed method, we only apply TS-SR to each LR key-frame. Then, we apply BOBMC to each non-key-frame, followed by the proposed DTS-SR to refine the HR non-key-frames while maintaining the temporal consistency of the resulted HR video, as described in Sec. II-B and Sec. II-C, respectively.

B. Super-Resolution for Non-Key-Frames After super-resolving all of the key-frames via TS-SR for an input LR video, we apply BOBMC [15] to super-resolve all non-key-frames between each pair of closest key-frames. Considering a pair of super-solved key-frames and where denotes the number of non-key-frames between a pair of two neighboring key-frames, each LR non-key-frame between them will be up-scaled to the same size as the keyframes via the bicubic interpolation (denoted by ). To perform forward motion estimation (ME) for each nonwith respect to , the patch together overlapped patch in with its surrounding pixels are extracted for overlapped block matching to calculate the motion vector (MV) = , by ∗

= arg min

∈

SAD

,

(3)

where ∗ denotes the estimated MV for this patch, Ω denotes the search window, and SAD (sum of absolute differences) is defined as: =∑,

SAD

+ ,

+

+ +

,

, (4)

+ +

, denotes the location of the current patch. where Similarly, backward ME between and can be performed by the same technique. After performing the bidirectional ME for with and , each patch in can be upscaled respect to using the motion-compensated blocks with the estimated MVs (forward or backward MVs according to the SADs) and the pre-specified weight matrices as: ,

,

=

,

∙

,

+

+

,

∙

,

+

+

,

∙

,

,

,

∙ ,

, ∙

, (5)

denotes the -th super-resolved patch via where , , , , , , , , , , BOBMC in , are the pixel values at , in the motion and compensated patches corresponding to current MV, MVs of the top neighbor patch, bottom neighbor patch, left neighbor patch, and right neighbor patch, respectively, and the matrices , , , , and are defined in [15]. After applying TS-SR and BOBMC to super-resolve an input LR video, the resulted HR video still suffers from the problem of temporal inconsistency. We then apply the proposed DTS-SR to solve the problem, as explained below. C. Proposed Dynamic Texture Synthesis-based SR Assume that a textural patch is varying along time axis and the transition between the textures can be modeled as a linear system via ,

Fig. 2. Texture synthesis results obtained by the: (a) Bicubic; (b) NLM-based SR [18]; (c) SC-SR [9]; and (d) TS-SR [24] (employed in our method).

=

,

,

(6)

where denote two patches at the same position , and , of the two successive frames with indices and + 1 , respectively, and denotes the transition matrix of dynamic texture for , . Based on [21], dynamic textures can be formulated as a linear autoregressive (AR) system as

=

,

,

+

0, Σ ,

(7)

where , = is an orthogonal projection matrix, and , , the zero-mean normally distributed additive noise captures the uncertainty with covariance matrix Σ . In our implementation, the orthogonal projection matrix is generated via principal component analysis (PCA). Therefore, the dimensionality of the projected data is smaller than or equal to the number of frames in order to avoid the underdetermine problem. As a result, the transition matrix can be solved using least-squares approximation [20] as ≈

:

:

:

:

,

the patch size used for DTS-SR is set to 60×60. The frame size of input LR video is 240×160 and that of the upscaled HR video is 720×480. The motion estimation of BOBMC is in pixel accuracy. We compare the performance of the proposed method with the following existing SR approaches: (i) bicubic interpolation (denoted by Bicubic) [3]; (ii) SR via sparse coding (denoted by SC-SR) [9]; (iii) SR via non-local iterative back-projection (denoted by NLIBP-SR) [6]; (iv) SR via adaptive sparse domain selection and adaptive regularization (denoted by ASDS-SR) [11]; (v) SR via texture hallucination (denoted by TS-SR) [24]; and (vi) video SR via BOBMC [15].

(8)

represents the projected ⋯ where the matrix : = data from the first to the -th frames. Based on our experiments, the transition matrix for characterizing the dynamic textures of a LR video is similar to that of its HR ground-truth video. As a result, we can solve via (8) based on the input LR video and use it to solve the temporal inconsistency problem in its reconstructed HR version using TS-SR or BOBMC. The proposed dynamic texture synthesis-based SR (DTS-SR) method can be expressed as ,

=

,

+

0, Σ ,

(9)

_ _ where , = , is an upscaled , , patch via TS-SR or BOBMC, is an orthogonal projection matrix, and 0, Σ is defined in (7). Finally, the final SR patch , can be recovered from its subspace using the projection matrix as , = , . After each patch is processed using the proposed DTS-SR, the temporal inconsistency problem can be solved. To analyze the performance of the proposed DTS-SR, we visualize the trajectory of the dynamics for each patch by projecting these patches into their subspaces using PCA. As exemplified in Fig. 3, the four trajectories show the dynamics of the ground-truth HR patches, input LR patches, SR patches via TS-SR with BOBMC, and SR patches via the proposed method, respectively. It can be observed from Fig. 3 that our assumption for the similarity of dynamics between the LR patches and the corresponding ground-truth HR patches is valid. Moreover, the dynamics of the SR patches obtained by our method are also much closer to the ground-truth than those of the SR patches only based on TS-SR with BOBMC. As a result, the proposed method can well address the temporal inconsistency problem in video SR.

III.

EXPERIMENTAL RESULTS

To evaluate the performance of the proposed video SR method, we collected a dataset of 52 HR training textured images. Four videos with dynamic textures were used for our experiments, as exemplified in Fig. 4. The parameter settings are described as follows. For each test video, the number of non-key-frames between two successive key-frames is empirically set to k = 8. The upscaling factor is set to 3 for both horizontal and vertical dimensions. The patch sizes used for TS-SR and BOBMC are both set to 16×16, respectively, while

Fig. 3. Examples of the trajectories (from frame 14 to 20) for the variation of the dynamic background (i.e. dynamic textures) of the ground-truth HR patches, input LR patches, SR patches via TS-SR with BOBMC, and SR patches via the proposed method, respectively.

Moreover, to quantitatively evaluate the performance of the proposed SR method, we use the motion-based video integrity evaluation (denoted by MOVIE) metric [26] for video quality assessment. The MOVIE metric utilizes a general, spatiospectrally localized multi-scale framework for evaluating dynamic video fidelity that integrates both spatial and temporal (and spatio-temporal) aspects of distortion assessment, which has been shown to be quite closely with human subjective judgment. The smaller the MOVIE index for an evaluated video is, the higher the visual quality of this video will be. Based on our experiments, the MOVIE metric is much more suitable for evaluating the fidelity of dynamic textures synthesized for video SR than other quality assessment metrics without considering temporal information [e.g., the peak signal-to-noise ratio (PSNR) metric]. Fig. 5 shows the subjective SR results cropped from a set of five successive upscaled frames for the input LR video using the proposed, SC-SR [9], ASDS-SR [11], NLIBP-SR [6], and Bicubic [3] methods (with their respective MOVIE indices). It can be observed from Fig. 5 that the visual qualities of the SR results obtained by the proposed method are significantly better than those obtained by the four methods used for comparisons. More specifically, more high-frequency components (i.e., textures) in the reconstructed frames can be recovered using the proposed method, while the other SR methods lead to oversmooth results. Moreover, TABLE I shows the objective SR results based on the MOVIE index [26] for the four test videos using the Bicubic [3], SC-SR [9], TS-SR [24], BOBMC [15], and

proposed methods. To compare our method with BOBMC [15], we applied TSSR to super-resolve each LR key-frame, followed by using BOBMC to super-resolve each non-keyframe because in [15], each HR key-frame was assumed to be always available. It can be observed from TABLE I that the proposed method outperforms these existing methods used for comparisons in terms of the MOVIE index based on the fact that our method can well maintain the temporal consistency for SR videos. More experimental results can be available from our project website [27].

REFERENCES [1]

[2]

[3] [4] [5]

TABLE I OBJECTIVE EVALUATION BY MOVIE FOR THE RECONSTRUCTED SR VIDEOS OBTAINED USING THE BICUBIC, SC-SR, TS-SR, BOBMC, AND THE PROPOSED METHOD (SMALLER MOVIE VALUE INDICATES HIGHER VISUAL QUALITY). Compared TS-SR Bicubic [3] SC-SR [9] BOBMC [15] Proposed Methods [24] Video 1 1.12 0.36 0.22 0.24 0.18 Video 2 0.93 0.5 0.43 0.44 0.2 Video 3 0.98 0.41 0.29 0.32 0.2 Video 4 1.51 0.51 0.17 0.21 0.11

[6] [7] [8] [9] [10]

[11]

[12]

[13]

[14]

[15]

Fig. 4. Example frames of the four test videos: the (a) Video 1; (b) Video 2; (c) Video 3; and (d) Video 4.

IV.

CONCLUSION

In this paper, we have proposed a video SR framework via dynamic texture synthesis to effectively enhance the resolution of a LR video while maintaining the temporal consistency of the reconstructed HR video. The proposed method separates the input LR video frames into key-frames and non-key-frames. We first apply the texture synthesis-based SR method to superresolve each key-frame, followed by employing a lowcomplexity bi-directional overlapped block motion compensation method to interpolate the HR details of each non-key-frame between two neighboring key-frames. To solve the problem of temporal inconsistency, the proposed examplebased DTS is used to render the super-resolved video based on the temporal dynamics of the input LR video. Our experimental results have demonstrated that the proposed method significantly outperforms the traditional and state-of-the-art learning-based SR methods in terms of visual quality of reconstructed video both subjectively and objectively.

[16] [17]

[18]

[19] [20] [21]

[22]

[23] [24]

S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: a technical overview,” IEEE Signal Process. Mag., vol. 20, no. 3, pp. 21−36, May 2003. S. Farsiu, M. Robinson, M. Elad, and P. Milanfar, “Fast and robust multiframe super resolution,” IEEE Trans. Image Process., vol. 13, no. 10, pp. 1327−1344, Oct. 2004. H. S. Hou and H. C. Andrews, “Cubic splines for image interpolation and digital filtering,” IEEE Trans. Signal Process., vol. 26, no. 6, 1978. B. Baker and T. Kanade, “Limits on superresolution and how to break them,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, 2002. W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based superresolution,” IEEE Comput. Graph. Appl., vol. 22, no. 2, Mar./Apr. 2002. W. Dong, L. Zhang, G. Shi, and X. Wu, “Nonlocal back-projection for adaptive image enlargement,” in Proc. IEEE ICIP, pp. 349−352, 2009. D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a single image,” Proc. IEEE Int. Conf. Comput. Vis., Sept. 2009, pp. 349−356. G. Freedman and R. Fattal, “Image and video upscaling from local selfexamples,” ACM Trans. Graph., vol. 30, no. 2, Article 12, 2011. J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol. 19, Nov. 2010. J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. S. Huang, “Coupled dictionary training for image super-resolution,” IEEE Trans. Image Process., vol. 21, no. 8, pp. 3467−3478, Aug. 2012. W. Dong, L. Zhang, G. Shi, and X. Wu, “Image deblurring and superresolution by adaptive sparse domain selection and adaptive regularization,” IEEE Trans. Image Process., vol. 20, no. 7, pp. 1838−1857, July 2011. F. Brandi, R. L. de Queiroz, and D. Mukherjee, “Super-resolution of video using key frames and motion estimation,” in Proc. IEEE Int. Conf. Image Process., 2008, pp. 321–324. G. M. Callicó, S. López, O. Sosa, J. F. López, and R. Sarmiento, “Analysis of fast block matching motion estimation algorithms for video super-resolution systems,” IEEE Trans. Consumer Electronics, vol. 54, no. 3, pp. 1430–1438, 2008. S. H. Keller, F. Lauze, and M. Nielsen, “Video super-resolution using simultaneous motion and intensity calculations,” IEEE Trans. Image Process., vol.20, no.7, pp.1870−1884, July 2011. B. C. Song, S. C. Jeong, and Y. Choi, “Video super-resolution algorithm using bi-directional overlapped block motion compensation and on-thefly dictionary training,” IEEE Trans. Circuits Syst. Video Technol., vol.21, no.3, pp.274−285, Mar. 2011. Z. Xiong, X. Sun, and F. Wu, “Robust web image/video superresolution,” IEEE Trans. Image Process., vol. 19, no. 8, Aug. 2010. E. M. Hung, R. L. de Queiroz, F. Brandi, K. F. de Oliveira, and D. Mukherjee, “Video super-resolution using codebooks derived from keyframes,” IEEE Trans. Circuits Syst. Video Techn. vol. 22, no. 9, 2012. M. Protter, M. Elad, H. Takeda, and P. Milanfar, “Generalizing the nonlocal-means to super-resolution reconstruction,” IEEE Trans. Image Process., vol.18, no.1, pp.36−51, Jan. 2009. A. Schödl, R. Szeliski, D. Salesin, and I. A. Essa, “Video textures,” in Proc. SIGGRAPH, 2000, pp. 489-498. G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, “Dynamic textures,” Int. J. Comput. Vis., vol. 51, no. 2, pp. 91−109, 2003. M. Hyndman, A. D. Jepson, and D. J. Fleet, “Higher-order autoregressive models for dynamic textures,” Proc. British Machine Vis. Conf., Warwick, Sept. 2007, pp. 1−10. V. Kwatra, A. Schödl, I. A. Essa, G. Turk, and A. F. Bobick, “Graphcut textures: image and video synthesis using graph cuts,” ACM Trans. Graph., vol. 22, no. 3, pp. 277−286, 2003. V. Kwatra, I. Essa, A. Bobick, and N. Kwatra, “Texture optimization for example-based synthesis,” ACM Trans. Graph., vol. 24, 2005. Y. HaCohen, R. Fattal, and D. Lischinski, “Image upsampling via texture hallucination,” in Proc. IEEE Int. Conf. Comput. Photography, Cambridge, MA, USA, pp. 20−30, Mar. 2010.

[27] NTHU video super-resolution project [Online]. http://www.ee.nthu.edu.tw/cwlin/videoSR/index.html

[25] Y.-C. Lin, Y.-N. Liu, and S.-Y. Chien, “Super resolution via databasefree texture synthesis,” in Proc. IEEE Int. Conf. Consumer Electronics, Sept. 2011, pp. 154–155. [26] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,” IEEE Trans. Image Process., vol. 19, no. 2, pp. 335−350, Feb. 2010.

Available:

(a)

(b)

(c)

(d)

(e)

(f) Fig. 5. Video SR results: (a) the original HR video; and its SR results of successive five frames obtained by the: (b) proposed method (MOVIE = 0.23); (c) SC-SR [9] (MOVIE = 0.36); (d) ASDS-SR [11] (MOVIE = 0.24); (e) NLIBP-SR [6] (MOVIE = 0.29); and (f) Bicubic [3] (MOVIE = 0.95) methods.