Extraction of High-Resolution Frames from Video Sequences. Richard R. Schultz and Robert L. Stevenson. Laboratory for Image and Signal Analysis.
Extraction of High-Resolution Frames from Video Sequences Richard R. Schultz and Robert L. Stevenson Laboratory for Image and Signal Analysis Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556, USA ABSTRACT | This paper addresses how to utilize both the spatial and temporal information present in a short video sequence to create a high-resolution still. A novel observation model based on motion compensated subsampling is proposed for video data. Since the reconstruction problem is ill-posed, Bayesian restoration with an edge-preserving prior image model is used to extract a high-resolution frame given a low-resolution sequence. Estimates generated by the multiframe video extraction algorithm show dramatic improvements over single frame interpolation techniques. Simulation results are reported for an image sequence containing a subpixel camera pan and an interlaced video sequence.
1 Introduction
Single frame interpolation techniques [1, 2] have been researched quite extensively, with the zeroorder hold, bilinear, and various cubic spline interpolation methods providing progressively more accurate solutions. However, these methods are limited by the number of constraints available within the data. For this reason, multiframe methods have been proposed [3{5] which use the additional data present in a sequence of adjacent video frames to improve resolution. In this paper, the multiframe interpolation problem is placed into a Bayesian framework featuring a novel observation model for video sequence data. Independent object motion in the image sequence will be assumed, with hierarchical block matching [6] used to estimate subpixel motion vectors between frames. An edgepreserving image prior is used to regularize the interpolation problem, resulting in an estimate of the high-resolution data containing distinct edges. Increasing the number of low-resolution frames in the video model improves the quality of the enhanced frame, up to a practical limit. Applications of this research include highresolution video hardcopy and preprocessing for image/video analysis.
2 Video Sequence Observation Model
A novel observation model is proposed for a video sequence. A subsampling matrix is de ned to map the high-resolution data pixels into a lowresolution frame via spatial averaging. Motion compensated subsampling matrices incorporate object motion between frames into the model.
2.1 Problem Statement Assume that each frame in a low-resolution video sequence contains N1 N2 square pixels. nCon-o sider a low-resolution video subsequence y(l) for l = k ? M2?1 ; : : :; k; : : :; k + M2?1 ; where M represents an odd number of frames. A single high-resolution frame z(k) coincident with the center frame y(k) is to be estimated from the low-resolution subsequence. This unknown highresolution data consists of qN1 qN2 square pixels, where q is an integer-valued interpolation factor in both the horizontal and vertical directions.
2.2 Subsampling Model for Center Frame Subsampling for the center frame is accomplished by averaging a block of high-resolution pixels,
0
qj X
qi X (k ) yi;j = q12 @
1 zr;sk A ;
r=qi?q+1 s=qj ?q+1
( )
(1)
for i = 1; : : :; N1 and j = 1; : : :; N2. This models the spatial integration of light intensity over a square surface region performed by CCD image acquisition sensors [2]. The subsampling model for the center frame can be expressed in matrix notation as y(k) = A(k;k)z(k); (2) where A(k;k) 2 IRN N q N N is referred to as the subsampling matrix. 1
2
2
1
2
2.3 Motion Compensated Subsampling The idea is to extract knowledge about the highresolution center frame z(k) from neighboring lowresolution frames y(l). This will be modeled as y(l) = A(l;k)z(k) + u(l;k) for l 6= k. A(l;k) is the motion compensated subsampling matrix which models the subsampling of the high-resolution frame and accounts for object motion occurring between frames y(l) and y(k). For pixels in z(k) which are not observable in y(l), A(l;k) contains a column of zeros. Object motion will also cause pixels to be present in y(l) which are not in z(k) . The vector u(l;k) accommodates for these pixels with nonzero elements. Rows of A(l;k) containing useful information are those for which elements of y(l) are observed entirely from motion compensated elements of z(k) . Write these useful rows as the reduced set of equations y0 = A0 z(k) . In practice, the motion compensated subsampling matrix must be estimated initially from y(l) and y((kk)). Therefore, the relationship between y(l) and z will be rewritten as (l)
(l;k )
y0 = A^ 0 z k + n l;k ; (l)
(l;k )
( )
(
(3)
)
where n(l;k) is an additive noise term representing ^ 0 . The additive noise the error in estimating A is assumed to be independent and identically distributed (i.i.d.) Gaussian. To construct A(l;k), consider the motion of the (i; j )th pixel between frames l and k: (l;k )
0 qi X 1 yi;jl = @ q ()
2
1 qj X zrk?qv ;s?qv A (4)
r=qi?q+1 s=qj ?q+1
( )
i
j
Vertical and horizontal displacements, represented as vi and vj , respectively, have fractional resolution to account for subpixel motion in the low-resolution data. In the motion compensated subsampling model, A(l;k) is similar in form to A(k;k), but with the summation over a shifted set of pixels. Pixels in y(l) which are not completely observable in z(k) must be detected, so that the corresponding rows of A(l;k) can be deleted in the construction of A0 [7]. (l;k )
3 Video Frame Extraction Algorithm
By assuming a Huber-Markov random eld model [2] for the image data and an i.i.d. Gaussian density to represent error in estimating the observation model [7], the maximum a posteriori (MAP) estimate of the high-resolution data given the
low-resolution subsequence becomes
^z k
( )
+
= arg min
z(k) 2Z
M ?1 k +X 2
l = k ? M2?1 l 6= k
(X X 4
m;n r=1
dtm;n;r z(k)
(l;k)
y0
(l)
? A^ 0
(l;k )
9 > > >
=
k z > : (5) > > ; ( )
2
In this expression, the set of constraints is de ned as o n (6) Z = z(k) : y(k) = A(k;k)z(k) : The rst term in (5) pertains to the image model. The quantity dtm;n;r z(k) is a spatial activity measure, with a small value in smooth image regions and a large value at edges. Four spatial activity measures are computed at each pixel in the highresolution data, implemented as second-order nite dierences [2]. The Huber edge penalty function is used to preserve edges, de ned as
x; ; (x) = 2jxj ? ; jjxxjj > ; 2
2
(7)
where is a threshold parameter controlling the size of discontinuities modeled by the prior. The remaining terms in (5) correspond to the i.i.d. Gaussian density used for the observation model error. Each frame has an associated con dence parameter, (l;k) = = (l;k) ; representing ^ 0 . The gradient projection the con dence in A algorithm is used to compute ^z(k) 2
(l;k )
4 Simulations
The Airport test sequence consists of seven progressively-scanned frames, synthetically generated by extracting subimages from a digitized image of an airport [7]. Each successive frame was shifted seven horizontal pixels to the right and seven vertical pixels down, which simulates a diagonal panning of the scene acquired by a video camera mounted on an airplane. Each lowresolution frame y(l) was generated by averaging 4 4 pixel blocks within each high-resolution frame z(l) and then subsampling by a factor of q = 4. The center low-resolution frame y(k) was expanded using the single frame techniques of bilinear interpolation, cubic B-spline interpolation [8], Bayesian MAP estimation assuming a GaussMarkov random eld (GMRF) image model with = 1, and Bayesian MAP estimation assuming a Huber-Markov random eld (HMRF) image
model with = 1 [2]. Both Bayesian estimates were computed to compare linear (GMRF) and nonlinear (HMRF) estimates. Next, the video frame extraction algorithm was used to estimate the center frame, given the low-resolution sequence. Table I provides a quantitative comparison of the estimates by showing the improved signal-to-noise ratio, SNR, with a zero-order hold of the center frame selected as the reference. Improved SNR versus the number of frames used in the video model is plotted in Figure 1. The image model used in the various reconstructions was either the GMRF or the HMRF. Motion was estimated by block matching, with either Motion Vectors (motion vector eld with (l;k) = jl?10kj ) or Camera Pan (average of motion vector eld with (l;k) = j1000 l?kj ) used in the simulations. Substantial improvement over the single frame interpolation methods can be obtained by using M = 3 to M = 5 frames, with diminishing improvements available for a larger number of frames. A short video was generated of a landmark on the University of Notre Dame campus to show the eect of applying the multiframe technique to interlaced image sequences. A method often employed in the generation of interlaced video hardcopy involves the integration of two elds by placing them together in the same frame. An image produced by this method is shown in Figure 2. Note the motion artifacts between the even and odd elds. The high-resolution deinterlaced frame generated by the video frame extraction algorithm is also depicted in Figure 2. Since the Bayesian multiframe algorithm uses motion compensation between frames, a signi cantly improved reconstruction results.
5 Conclusion
Single frame interpolation methods are inherently limited by the number of constraints available within a given image. Additional linearly independent constraints are available within a video sequence. A novel observation model was proposed for low-resolution video frames, which models the subsampling of the unknown highresolution data and accounts for independent object motion occurring between frames. Provided that the object motion has subpixel resolution, estimates computed by the video frame extraction algorithm may be substantially improved over single frame interpolations. A number of issues will be explored in future research. One of the most critical aspects of the video resolution enhancement algorithm is the accurate estimation of motion. Regularization
techniques can be applied to the ill-posed inverse problem of motion estimation, which are robust to both noise and discontinuities within the image sequence. Further research in the accurate modeling of video camera sensors will be conducted as well.
6 Acknowledgments
Eort sponsored by Rome Laboratory, Air Force Materiel Command, USAF under grant number F30602{94{1{0017.
References
[1] N. B. Karayiannis and A. N. Venetsanopoulos, \Image interpolation based on variational principles," Signal Processing, vol. 25, no. 3, pp. 259{288, 1991. [2] R. R. Schultz and R. L. Stevenson, \A Bayesian approach to image expansion for improved de nition," IEEE Trans. Image Processing, vol. 3, no. 3, pp. 233{242, 1994. [3] S. P. Kim, N. K. Bose, and H. M. Valenzuela, \Recursive reconstruction of high resolution image from noisy undersampled multiframes," IEEE Trans. Acoust., Speech, Signal Processing, vol. 38, no. 6, pp. 1013{1027, 1990. [4] A. J. Patti, M. I. Sezan, and A. M. Tekalp, \High-resolution image reconstruction from a low-resolution image sequence in the presence of time-varying motion blur," in Proc. IEEE Int. Conf. Image Processing, (Austin, TX), pp. I{343 to I{347, 1994. [5] R. Y. Tsai and T. S. Huang, \Multiframe image restoration and registration," in Advances in Computer Vision and Image Processing (R. Y. Tsai and T. S. Huang, eds.), vol. 1, pp. 317{339, JAI Press Inc., 1984. [6] M. Bierling, \Displacement estimation by hierarchical blockmatching," in Proc. SPIE Conf. Visual Commun. Image Processing '88 (T. R. Hsing, ed.), vol. 1001, pp. 942{951, 1988. [7] R. R. Schultz and R. L. Stevenson, \Extraction of high-resolution frames from video sequences." Submitted to IEEE Transactions on Image Processing, Special Issue on Nonlinear Image Processing. [8] H. H. Hou and H. C. Andrews, \Cubic splines for image interpolation and digital ltering," IEEE Trans. Acoust., Speech, Signal Processing, vol. 26, no. 6, pp. 508{517, 1978.
TABLE I Comparison of Interpolation Methods on the Airport Sequence
Technique SNR (dB) Single Frame Bilinear Interpolation 0.57 Single Frame Cubic B-Spline Interpolation 1.25 Single Frame MAP Estimation, = 1 1.43 Single Frame MAP Estimation, = 1 1.51 Video Frame Extraction with Motion Estimates, M = 7, = 1, (l;k) = jl?10kj 3.47 10 (l;k ) Video Frame Extraction with Motion Estimates, M = 7, = 1, = jl?kj 5.48 1000 (l;k ) Video Frame Extraction with Panning, M = 7, = 1, = jl?kj 6.72 1000 (l;k ) Video Frame Extraction with Panning, M = 7, = 1, = jl?kj 7.00 Airport Video Sequence 8 HMRF/Camera Pan
7
GMRF/Camera Pan
∆ SNR (dB)
6 5
HMRF/Motion Vectors
4 3
GMRF/Motion Vectors
2 1 0
1
5
3
7
M
Figure 1: Improved SNR versus number of frames used in the video observation model.
Figure 2: Frame extraction from an interlaced video sequence. Image on the left was generated by combining a single even and a single odd eld, followed by zero-order hold up-sampling (q = 2). Reconstruction on the right was computed using the video frame extraction algorithm, q = 2, M = 7 (3 even elds, 4 odd elds), = 1, (l;k) = jl?1kj .