Video Super-Resolution Using High Quality Photographs - IEEE Xplore

Video Super-Resolution Using High Quality Photographs Cosmin Ancuti, Codruta Orniana Ancuti and Philippe Bekaert Hasselt University - tUL -IBBT, Expertise Center for Digital Media, Belgium {ſrstname.lastname}@uhasselt.be ABSTRACT This paper introduces a technique that increases the spatial resolution of a given video. The method is built on the fundamentals of super-resolution techniques that aim to reconstruct high-resolution frames from a low-resolution input sequence. Different than classical super-resolution methods, besides using the information of adjacent frames, we take advantage of several reference high quality (resolution) photographs of the same scene. The method is purely imagebased, and does not require depth estimation. The additional information extracted from the reference photographs is used to construct several high resolution seed frames added with a constant step in the initial video sequence. Therefore, the seed frames but also the adjacent low-resolution frames provide important information to deſne priors that are considered in the probabilistic interpretation of the generative model. The estimated solution is obtained based on a standard maximum a posteriori (MAP) approach. Objective tests on real and synthetic video sequences demonstrate the utility and the beneſts of the proposed technique over related methods. Index Terms— super-resolution, MAP, matching 1. INTRODUCTION Most applications require digital images and videos with higher spatial resolution. Images and video frames with such resolution comprise a higher density of pixels that allows a better rendering of the ſnest details which in general are embedded in subtle transitions of the luminance channel. The process of reconstruction of a high quality (resolution) image is essentially an ill-posed problem since there are an inſnite number of solutions that downsampled would yield the same low resolution version. Among the existing strategies, multiframe super-resolution techniques are commonly used to restore the missing high frequencies. Super-resolution is a general term that refers to the task of solving the optical limitation of digital camera sensors through image processing techniques, more inexpensive than hardware solutions. The main idea is motivated by the sampling theory which states that given enough uniform or non-uniform samples, signals (images) can be reconstructed

978-1-4244-4296-6/10/$25.00 ©2010 IEEE

862

from these samples. Practically, a group of images of the same scene are fused to yield images with higher spatial resolution, or with more visible details. The resolution of the original imaging device is improved by exploiting the relative sub-pixel motion between the scene and the imaging plane. The pixels of the initial image are interleaved with pixels from the other images obtaining a super-resolved image with enhanced details. For a comprehensive overview the reader is referred to the studies of Borman and Stevenson [1] and Park et al. [2]. In this paper we present a new method built on the fundamentals of the multiframe super-resolution techniques. Besides using the information of adjacent frames, to yield more accurate results, based on MAP estimation, the prior information of several high quality (resolution) photographs taken from the same scene is added. There are several advantages of using photographs to enhance videos. Compared with videos, when capturing still photographs we are able to use longer exposure times than videos that are relatively limited due to the frame rate. Additionally, processing or editing a single image is less computationally expensive than considering separately every frame. Still photographs can also use the ƀash feature of digital cameras that diminishes signiſcantly distortions such as blur and noise in poorly illuminated scenes. Due to the higher amount of information and sensors limitations many commercial digital cameras yield videos with higher rate of compression or low resolution frames. The learning-based or example-based methods [3, 4] are also related with our work. These methods aim to solve a more complex problem by learning the high frequency information from a set of generic training images that are not necessarily taken from the same scene. Besides being more computationally expensive, these techniques depend critically on the training strategy and examples used in the training stage. The idea of using stills of the same scene to enhance the resolution of a low quality video sequence was introduced recently [5, 6]. Bhat et al. [5] use a different concept that employs an extremely expensive 3D reconstruction approach while the method of Ancuti et al. [6] is a patch-based technique that may introduce ƀickering artifacts since the information of adjacent frames or other temporal coherence constraints are completely ignored.

ICASSP 2010

The simple probabilistic approach would be to minimize the relevant terms of the negative log likelihood. Since maximum likelihood (ML) is still an ill-conditioned problem whose solution is prone to corruption by very strong high frequency oscillations, as will be shown later, to recover the ſnest details we rely on a maximum a posteriori (MAP) estimate.

HR reference photographs Low resolution sequence

3. HIGH RESOLUTION SEED FRAMES

HR seed LR

HR seed LR

LR

LR

LR

LR

MAP estimate

Input video + seed frames

Enhanced video sequence

Fig. 1. Overview of the algorithm. 2. MODELING THE GENERATIVE PROCESS Due to the fact that we deal with an inverse problem that has the main objective to estimate a high resolution sequence from the observed low resolution frames, in the ſrst step is needed to deſne a generative model to emulate reliably the forward process. Practically, the generative model has to consider the physical features of the imaging system and the scene parameters that cause the deterioration of the high resolution projection into the resulting low resolution frames. Let consider the high resolution image H (N pixels) and a set of K input low resolution images Lk (M pixels). The entire process can be rendered by operations with matrices since the basic operations in the image formation model (e.g blur convolution , subsampling and warping) are linear in the image intensities: Lk = γk Dk Bk Wk H + Nk

(1)

where Dk represents the decimation (subsampling) matrix, Bk is the blurring matrix while Wk represents the geometric afſne warping matrix. The general model includes also the noise Nk of the Lk , assumed to be uncorrelated zero mean Gaussian of variance σk , and the global photometric correction represented by the multiplicative scalar γk . Therefore, based on these notations the goal is to recover the residual: k =

K

2

Lk − γk Dk Bk Wk H + Nk 2

(2)

k=1

863

An important step of our approach (see Figure 1) is to integrate the additional information taken from the high resolution reference photographs. Different than the previous methods [5, 6], since it is not practical to consider every frame, in our technique only several high resolution frames (with a step n) are built by employing a feature-based matching approach. The problem that we would like to solve is motivated by several scenarios that can arise easily in practical situations. Such cases may appear when, due to the card memory limitations, the user is able to take only low resolution videos (e.g. 320x240) and several high resolution photographs of a static scene. Furthermore, for well-known sites (e.g. famous cathedrals, historical monuments and buildings), users may exploit by a simple search the enormous internet image database that provides the necessary reference information. Another plausible scenario is due to the last generation commercial cameras (e.g. Canon EOS 500D) that allows recording video sequences simultaneously with still images. However, in all cases the high quality images need to be aligned with several degraded frames of the considered input sequence. It is assumed that the images are related by a dominant plane and therefore an important part of the images can be related by homography [7]. To match images we employed the recent method of Ancuti et al. [8] that improves considerably the matching results of the state-of-the-art operators for wide-baseline matching problems. Subsequently, the estimated homography matrix that relates images is reſned based on the RANSAC [7] algorithm. Next, after images have been aligned, the high resolution seed frames are identiſed either by interlacing one of the high resolution images (e.g. when using the features of the Canon EOS 500 D) or, in general case, by iteratively computing the difference between the selected frame and the existing high resolution photographs of the same scene. Since the traditional distances (e.g. NCC or SSD) are highly sensitive to illumination variation and to similar texture, we searched for a distribution-based measure that are known to be more robust. We opted for the well-known SIFT [9] to perform a per-pixel difference between the aligned high and low resolution images. SIFT is calculated from the image gradients being practically a histogram of gradient locations and orientations. In some special cases, due to the lack of information and occlusions, the remained uncovered regions are ſlled by employing an effective inpainting scheme [10].

4. OPTIMIZED SUPER-RESOLUTION In this part is detailed how the additional prior information is integrated to optimize the super-resolution problem. Considering the generative model expressed by the equation 1, the main goal is to estimate accurately the high resolution frames. Based on the Bayes’ theorem, p(x|D) = (p(D/x)p(x))/p(D) , that determines a probabilistic relation between the observed data D and the unknown x, the problem may be expressed as: p(H| {Lk , ρk , γk }) =

p(H)p({Lk } |H, {ρk , γk }) p({Lk } | {ρk , γk })

(3)

where by the vector ρk is parameterized the geometric transformation up to afſne. The marginal probability, the denominator in the equation 3, is in general neglected since it is a normalization constant. Commonly, the problem is split in several steps that solve separately the registration or estimation of the relative motion and reconstruction of the high resolution frames using prior information. Since these problems are closely interrelated we deſne a strategy similar as in the work of [11], where all the unknowns are optimized simultaneously. A simple probabilistic solution is given by the maximum likelihood (ML) estimate that is expressed by maximizing the probability of the observed dataset p(Lk |H, ρk , γk ). Despite being relatively straightforward to implement (ML is directly related with the least-squares iterative techniques), in absence of additional constraints, ML fails to solve properly the problem being highly sensitive to high frequency oscillations. An alternative way, that has proved [12] to yield better results, is to include prior information in the probabilistic approach. Maximum a posteriori (MAP) extends the ML solution by adding priors over the unknown H. The prior information represents constraints that match subjectively the humans visual system interpretation having the ſnal goal to reduce the searching space of the true solution in that direction. Regarding equation 3, the MAP solution maximizes the numerator with respect to H: HMAP = arg max [p(H)p({Lk } |H, {ρk , γk })] H

(4)

Derived from the equation 2 the likelihood probability of the entire low resolution set {Lk } is expressed as following: KM K 2 1 1 2 k 2 exp p({Lk } |H, {ρk , γk }) = 2πσk 2σk k=1 (5) For our problem two prior information are blended in the reconstruction step. The ſrst prior restricts the inferring process based on the inƀuence of the high resolution seed frames constructed in the previous step. The second one is a general prior considered in the majority of the existing MAP superresolution techniques [2] and refers to the additional infor-

864

Initial frame

Bicubic upsampling

Our result

Fig. 2. Comparative results between bicubic upsampling and our result. mation contained by the adjacent video frames. By assuming that the geometrical relation between the low resolution frames and the seed frames can be expressed by homography, the ſrst prior that practically uses the information transferred form the high quality photographs is expressed as a difference of gradients: pseed (H) = exp {∇H − ∇Hseed 2 }

(6)

where the Hseed represents the proximate high resolution seed frame. The second prior is inspired from the total variation criterion being used previously in the work of Farsiu et al. [12]. The Bilateral Total Variation (BTV) is an effective regularizer that preserves edges. Basically, a penalty function is built based on the difference among the high resolution image and versions of itself weighted and shifted in several directions by an integer number of pixels: pBT V (x) = exp

⎧ P ⎨ ⎩

l,m=−P

⎫ ⎬

α|m|+|l| x − Sil Sjm x 1 (7) ⎭

where P represents the size of the ſlter and Sil , Sjm represents matrices that shift the considered image x by l pixels in the horizontal direction and m pixels in the vertical direction, respectively. In addition to penalizing high spatial frequency of the input, this prior improves the temporal coherence between adjacent frames. The MAP estimator (equation 4) is converted in a minimization problem by applying a negative logarithm. Therefore, considering the last three expressions, the total energy function that needs to be minimized becomes: Etotal = λ

K 1 k 22 − β log [pseed (H)pBT V (H)] (8) 2σk k=1

where λ,β represent weights parameters. Inspired by the strategy described in [11], to estimate the MAP solution we apply a scaled conjugate gradients algorithm that converges relatively fast in 40-50 iteration. To improve the optimization, as a starting point the process is initialized with the ML estimate.

6. CONCLUSION

a.

b.

c.

d.

In this paper we present a super-resolution technique that increases the spatial resolution of a video sequence by borrowing information from several photographs taken from the same static scene. After several high resolution seed frames are constructed using the additional information of the high resolution reference photographs, based on a maximum a posteriori (MAP) framework the entire low resolution frames are restored accurately. Our results demonstrates that the enhanced videos are of superior quality compared to the standard upsampling and the reconstruction-based superresolution methods. For future work we would like to extend our approach for the more challenging case of dynamic scenes. 7. REFERENCES

Fig. 3. a) An initial frame of the Palma de Mallorca cathedral video. b) The edge preserving method of [13] ruins the high frequencies. c) SR technique of [12]. d) Our method is able to restore more accurate the ſnest details of the input video by using effectively the prior information of adjacent low resolution frames but also of the high quality reference photographs.

[1] S. Borman and R. Stevenson, “Spatial resolution enhancement of low-resolution image sequences - a comprehensive review,” T.R., Univ. of Notre Dame, 1998. [2] Sung Cheol Park, Min Kyu Park, and Moon Gi Kang, “Superresolution image reconstruction: A technical overview,” IEEE Signal Processing Magazine, 2003. [3] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super-resolution,” IEEE Computer Graphics and Applications, 2002. [4] V. Cheung, B. J. Frey, and N. Jojic, “Video epitomes,” IEEE Conf. Comp. Vision and Pattern Recog., 2005.

5. RESULTS AND DISCUSSION The new method has been extensively tested for synthetic but also for real videos taken with a commercial hand-held digital camera. For all experiments we assume only a slight shake that induces a planar-projective motion. Furthermore, for examples shown in this paper the additional photographies are taken with the same digital camera that recorded the video sequences (we did not use the camera feature that records simultaneously video-stills). The distance between high resolution seed frames is set to 20 frames while the inƀuence of the adjacent frames is limited to a maximum 10 frames. Figure 2 shows comparative results for a frame of an input video sequence of approximately 150 frames using 6 reference photographs (not shown). As can be observed, compared with standard upsampling techniques (e.g. bicubic) the proposed method is able to enhance more pleasantly the original frame by restoring some of the destroyed details. A second example is presented in the ſgure 3. By upsampling the original low resolution frame with an edge preserving method [13], even though some of the important edges have a sharper appearance, important ſne transitions are completely destroyed. On the other hand, by using the additional information taken from several reference images we are able to recove better the ſnest details compared with a reconstruction-based super-resolution strategy.

865

[5] Pravin Bhat, C. L. Zitnick, N. Snavely, A. Agarwala, M. Agrawala, B. Curless, M. Cohen, and Sing Bing Kang, “Using photographs to enhance videos of a static scene,” In Eurographics Symposium on Rendering, 2007. [6] Cosmin Ancuti, Tom Haber, Tom Mertens, and Philippe Bekaert, “Video enhancement using reference photographs,” In Journal of Visual Computer, 2008. [7] Richard Hartley and Andrew Zisserman, “Multiple view geometry in computer vision,” Cambridge,2003. [8] Cosmin Ancuti, C. O. Ancuti, and P. Bekaert, “An efſcient two steps algorithm for wide baseline image matching,” In Journal of Visual Computer, 2009. [9] D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vision, 2004. [10] A. Criminisi, P. Perez, and K. Toyama, “Region ſlling and object removal by exemplar-based image inpainting,” IEEE Transactions on Image Processing, 2004. [11] L. C. Pickup, S. J. Roberts, and A. Zisserman, “Optimizing and learning for super-resolution,” In Proceedings of the British Machine Vision Conference, 2006. [12] S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, “Fast and robust multi-frame super-resolution,” IEEE Transactions on Image Processing, 2004. [13] Z. Farbman, R. Fattal, D. Lischinski, and R. Szeliski, “Edgepreserving decompositions for multi-scale tone and detail manipulation,” in ACM SIGGRAPH, 2008.