Video Synthesis with High Spatio-Temporal Resolution Using Motion Compensation and Image Fusion in Wavelet Domain Kiyotaka Watanabe1 , Yoshio Iwai1 , Hajime Nagahara1, Masahiko Yachida1 , and Toshiya Suzuki2 1
Osaka University, Graduate School of Engineering Science, 1–3 Machikaneyama-cho, Toyonaka, Osaka 560–8531, Japan {watanabe, iwai, nagahara, yachida}@yachi-lab.sys.es.osaka-u.ac.jp 2 Eizoh Co., LTD, 2–1–10 Nanko-Kita, Suminoe, Osaka 559–0034, Japan
[email protected]
Abstract. This paper presents a novel algorithm for obtaining a high spatio-temporal resolution video from two video sequences. These sequences are high resolution with low frame rate and low resolution with high frame rate. To this end, we introduce a dual sensor camera that can capture these sequences with the same field of view simultaneously. The proposed method observes motion information through the video with high frame rate. Moreover, the method conducts both motion compensation for the high resolution sequence and image fusion in the wavelet domain. We confirmed that the proposed method improves the resolution and frame rate of the synthesized video.
1
Introduction
In recent years, charge coupled devices (CCD) and CMOS image sensors have been widely used to capture digital images. With the development of sensor manufacturing techniques, the spatial resolution of these sensors has been increased. As the resolution increases, however, the frame rate generally decreases because the sweep time is limited. Hence, high resolution is incompatible with high frame rate. There are some high resolution cameras available for special use, such as a digital cinema. However, these are very expensive and thus unsuitable for general purpose use. Various methods have been proposed to obtain high resolution images from low resolution images by utilizing image processing techniques. One of the methods to enhance spatial resolution is known as super resolution, which have been actively studied for a long time. Conventional techniques of obtaining super resolution images from still images have been summarized in the literature [1], and several methods for obtaining a high resolution image from a video sequence have also been proposed [2, 3]. P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3851, pp. 480–489, 2006. c Springer-Verlag Berlin Heidelberg 2006
Video Synthesis with High Spatio-Temporal Resolution
481
Frame rate conversion algorithms have also been investigated in order to convert the frame rate of videos or to increase the number of video frames. Frame repetition and temporal linear interpolation are straightforward solutions for the conversion of the frame rate of video sequences, but they also produce jerkiness and blurring at moving object boundaries, respectively. It has been shown that frame rate conversion with motion compensation provides the best solution in temporal up-sampling applications [4, 5]. Conventinal techniques mentioned above enhance either spatial or temporal resolution. We adopt a novel stragety to synthesize a high spatio-temporal resolution video (i.e., high spatial resolution video with high frame rate). Our approach synthesizes the video from two video sequences. These sequences are high resolution with low frame rate and low resolution with high frame rate. To this end, we introduce a dual sensor camera [6, 7] that can capture two video sequences with the same field of view simultaneously. The dual sensor camera consists of conventional image sensors, allowing for inexpensive construction of the camera. Moreover, another advantage of this approach is that the amount of video data obtained from the dual sensor camera can be small. Several works that are related to our approach have been conducted. Shechtman et al. have proposed a method [8, 9] for increasing the resolution both in time and in space. Matsunobu et al. have proposed a method to solve the same problem covered in this paper using image morphing [10]. In their methods, all processes are conducted in the image domain. However, it is difficult to extract and track the feature points in that approach, so there exist several cases where resolution enhancement of the video cannot be achieved. Another algorithm proposed by Watanabe et al. [11] conducts motion compensation for the high resolution images in the image domain, and fuses the spectrum of the compensated image with that of the temporally corresponding low resolution image in the DCT domain. That is, motion compensation and spectral fusion are done in distinct domains, so the algorithm can be complicated. To solve these issues, this paper presents a novel method for synthesizing a high spatio-temporal resolution video using wavelet transform. The proposed method conducts both motion compensation and fusion of two video sequences in the wavelet domain, and thus can be uncomplicated. In our method, we use redundant discrete wavelet transform (RDWT) [12], which is an extension of traditional discrete wavelet transform (DWT). Shift-invariant property of RDWT
Beam splitter High resolution low frame rate camera Scene Pulse generator
Low resolution high frame rate camera
Fig. 1. Concept of dual sensor camera
482
K. Watanabe et al.
allows motion compensation in the wavelet domain. In addition, our method conducts motion compensation for all pixels in the image, i.e., it doesn’t contain difficult problems such as feature extraction and tracking. The rest of the paper is organized as follows. We present the dual sensor camera in the following section. Next, we describe DWT and its property, and then introduce RDWT in Sect. 3. Section 4 demonstrates the proposed algorithm for synthesizing a high spatio-temporal resolution video. Section 5 shows some experimental results. Finally, we conclude this paper in Sect. 6.
2
Dual Sensor Camera
The concept of the dual sensor camera used in our method is shown in Fig. 1. The camera has a beam splitter and two CCD sensors. The beam splitter divides an incident ray into the two CCDs. The camera can capture two video sequences simultaneously, using two different CCDs which can capture high resolution video with low frame rate and low resolution video with high frame rate. Synchronized frames between low resolution and high resolution sequences can be obtained by means of the input of synchronization pulse. We call the synchronized frames “key frames” in this paper. We define the resolution ratio between the high resolution and low resolution sensors as 2α : 1 (α ∈ N) and the frame rate ratio as 1 : ρ (ρ ∈ N). Moreover, we assume σ = 2α . Two video sequences satisfying α = 2 (i.e., σ = 4) and ρ = 7 are obtained from the two video sequences captured through the dual sensor camera [6, 7].
3
Discrete Wavelet Transform
DWT is one of the frequency transforms. In contrast to some frequency transforms such as DFT, DCT, etc., DWT preserves the spatial information in the frequency domain. We employ this property for motion compensation and image fusion in order to conduct both operations in the wavelet domain. It is known that the traditional DWT is shift-variant [13]. Hence, DWT coefficients of the image I(x, y) is generally very different from that of one-pixel shifted image Is (x, y) = I(x − 1, y), so motion compensation of the wavelet coefficients causes large error. Shift-variance of DWT arises from the downsampling operation in the DWT decomposition. To overcome this property, we introduce RDWT [12]. RDWT removes the downsampling operation from the traditional DWT to ensure shiftinvariance at the cost of a redundant representation. Because of the shift-invariant property of RDWT, the shift in the image domain is just the same as in the wavelet domain. Therefore, motion compensation for each subband in the RDWT domain can be performed essentially in the same manner as in the image domain. The proposed method conducts both motion compensation and image fusion in the wavelet domain, and thus can be simplified.
Video Synthesis with High Spatio-Temporal Resolution High Resolution Video Input
RDWT
483
Low Resolution Video Input
Motion Estimation Server Client
High Freq. Component Motion Compensation
Motion Vector
Image Fusion
Time t + ∆t Time t
IDWT
Motion Compensated High Freq. Component Downsampling
v (x,y) (x, y)
(x, y) Synthesized Video
Fig. 2. Block diagram of proposed algorithm
Target Frame
Anchor Frame
Fig. 3. Terminology of ME
Let Lf (j) and Hf (j) be the j-th level low-band and high-band DWT coeffi (j) as the j-th level (j) and Hf cients of f , respectively. Moreover, we denote Lf low-band and high-band RDWT coefficients of f , respectively, where represents the RDWT coefficients. DWT coefficients are correlated to RDWT coefficients as the following formulae; (j) ↓ 2j , Hf (j) = Hf (j) ↓ 2j , (1) Lf (j) = Lf where ↓ α denotes downsampling by a factor of α. That is, if y(n) = x(n) ↓ α, then y(n) = x(αn). Therefore, RDWT reconstruction can be done as DWT reconstruction (inverse DWT) by using the formulae.
4
Video Synthesis in Wavelet Domain
Figure 2 shows the outline of the proposed algorithm that synthesizes high resolution images. The method estimates motion information in the low resolution video with high frame rate. Each frame of the high resolution video is decomposed by using RDWT. High frequency components of the RDWT coefficients are motion-compensated, based on the estimated motion information, and then downsampling is done for these components. Downsampled coeffficients are fused with low resolution video frames. Finally, by applying inverse DWT, we obtain a high resolution video with high frame rate. In cases of streaming the video, RDWT and motion estimation are processed on the server side, while the other processes are processed on the client side. 4.1
Terminology of Motion Estimation
The computation of the motion information is referred to as motion estimation (ME). As shown in Fig. 3, if the motion from the frame at t to t + ∆t is estimated, then the frame at t and t + ∆t are called the “anchor frame” and “target frame” respectively. We distinctly call the estimation process “forward motion
484
K. Watanabe et al. fw bk
High Resolution Video I˜ Low Resolution Video I
ρ−2 0
1
ρ−1
ρ
Time
2
Fig. 4. Initial settings of proposed method
estimation” if ∆t > 0 and “backward motion estimation” if ∆t < 0. A motion vector is assigned for every pixel of the frame at t in this case. 4.2
Process for Synthesizing High Resolution Video
We assume, as shown in Fig. 4, that there exist two pairs of key frames, (I0 , I˜0 ) and (Iρ , I˜ρ ). We also suppose that the intermediate frames of these key frames, I1 , I2 , · · · , Iρ−1 , are obtained. Under the assumptions mentioned above, the proposed algorithm synthesizes high resolution images I˜1 , · · · , I˜ρ−1 in accordance with the following process. The order of the synthesis of each frame is I˜1 , I˜ρ−1 , I˜2 , I˜ρ−2 , · · · in this instance. Step 1. (Initial Settings) Set bk = 1 and fw = ρ − 1, as shown in Fig. 4. Step 2. Set s = bk , r = bk − 1 and c = fw + 1. Step 3. (RDWT Decomposition) Calculate the following high frequency components I˜r (j) , HH I˜r (j) (j = 1, 2, · · · , α) I˜r (j) , HL (2) LH by means of the RDWT decomposition for the high resolution image I˜r , which has already been obtained. If the standard RDWT is used, then the I˜r (α) is I˜r (α) can be obtained. However, LH low frequency component LH not used in our method, so we do not need to calculate this. Step 4. (Motion Estimation) Estimate the motion vector for each pixel of low resolution image Is by the phase correlation method [14] where the anchor frame and target frame are the low resolution images Is and Ir , respectively. The motion vector is measured to an accuracy of 1/σ pixel. Step 5. (Motion Compensation) Estimate each subband of I˜s by conducting motion compensation for corresponding subbands of I˜r . Motion compensation for each subband is conducted according to the following equations; I˜s (j) (xs , ys ) = LH I˜r (j) (xr , yr ) LH I˜s (j) (xs , ys ) = HL I˜r (j) (xr , yr ) HL
(3)
I˜s (j) (xs , ys ) = HH I˜r (j) (xr , yr ), HH
(5)
(4)
Video Synthesis with High Spatio-Temporal Resolution
485
where xs = 2α x + ∆x , ys = 2α y + ∆y y x xr = 2α (x + v(x,y) ) + ∆x , yr = 2α (y + v(x,y) ) + ∆y , for 0 ≤ ∆x , ∆y ≤ 2α − 1. The failure to estimate a motion vector may occur when there is no candidate of motion vector for which MSE is lower than certain threshold in phase correlation method. If there exist positions where the method fails to assign motion vectors, the motion estimation is conducted differently from the method mentioned above. In this case, Is and Ic are used as the anchor frame and target frame respectively, and each subband of I˜s are estimated by conducting motion compensation for respective subbands of I˜c . If the method fails to assign motion vectors to the coordinate (x, y) with either procedure described above, zeros are substituted for each subband as shown below. I˜s (j) (xs , ys ) = 0, HL I˜s (j) (xs , ys ) = 0, HH I˜s (j) (xs , ys ) = 0 LH (j = 1, 2, · · · , α)
(6)
xs = 2α x + ∆x , ys = 2α y + ∆y , for 0 ≤ ∆x , ∆y ≤ 2α − 1.
(7)
where
Step 6. (Downsampling) Conduct downsampling operation for each subband of I˜s in accordance with (1), i.e., calculate the following DWT coeffients from RDWT coeffients. LH I˜s (j) , HLI˜s (j) , HH I˜s (j) (j = 1, 2, · · · , α) (α)
Step 7. (Image Fusion) Replace the low frequency component of I˜s , LLI˜s , with (α) temporally corresponding low resolution image Is . That is, let LLI˜s = Is . Step 8. (IDWT) Conduct IDWT for each subband of I˜s . As a result, we will obtain a high resolution image I˜s . Step 9. Add 1 to bk . Step 10. If bk = fw , then terminate this algorithm. Otherwise proceed to Step 11. Step 11. Set s = fw , r = fw + 1, and c = fw + 1, and then execute Steps 3 to 8. Step 12. Subtract 1 from fw . Step 13. Repeat from Step 2 to Step 12 until bk = fw .
5 5.1
Experiments Simulation Experiments
We conducted simulation experiments to confirm that the proposed method synthesizes high resolution video. We used simulation input image sequences from
486
K. Watanabe et al. Table 1. Description of test sequences Sequence Name Spatial Resolution Coast guard 352 × 288 Football 352 × 240 Foreman 352 × 288 Hall monitor 352 × 288
Frame 1–295 1–120 1–295 1–295
(a) Original frame
(b) Close-up of (a)
(g) Haar
(h) Close-up of(g)
(c) Nearest neighbor
(d) Close-up of (c)
(i) Integer 2/6
(j) Close-up of (i)
(e) DCT fusion[11]
(f) Close-up of (e)
(k) Daubechies 4-tap
(l) Close-up of (k)
Fig. 5. Test sequence “Foreman” 46th frame
the dual sensor camera. The simulated input images were made from MPEG test sequences as described below. The low resolution image sequence (M/4 × N/4 [pixels], 30 [fps]) was obtained by a 25 % scaling down of the original MPEG sequence (M × N [pixels], 30 [fps]), i.e., σ = 4. The high resolution image sequence (M × N [pixels], 4.29 [fps]) was obtained by picking up every seven frames of the original sequence, i.e., ρ = 7. The proposed method synthesized M ×N pixel video with 30 [fps] as the synthesized high resolution video with high frame rate. Table 1 shows the original MPEG sequences used in these experiments. In these experiments, we used three wavelet functions for the image fusion; Haar wavelet, Daubechies 4-tap filter [15], and integer 2/6 wavelet [16]. We investigated the difference in the quality of the synthesized images between these functions. Figure 5 shows the original frame (a)(b) and an upsampled low resolution frame using nearest neighbor method(c)(d). Figure 5 (e)(f) shows a synthesized
Video Synthesis with High Spatio-Temporal Resolution
487
Table 2. PSNR results Sequence Name Coast guard Foreman Football Hall monitor
Haar Integer 2/6 Daubechies DCT spectral Nearest Wavelet Wavelet 4-tap filter fusion[11] Neighbor 23.59 23.65 22.47 23.38 21.28 26.08 26.27 24.29 25.88 23.67 20.06 20.52 19.78 20.15 19.88 31.90 32.08 25.44 30.81 21.78
frame by means of DCT spectral fusion method [11]. In Fig. 5, (g)(h), (i)(j) and (k)(l) are synthesized frames using Haar wavelet, integer 2/6 wavelet and Daubechies 4-tap filter, respectively. Blocking artifacts are produced in the synthesized image using Haar wavelet (Fig. 5(g)(h)), e.g., near the brim of the helmet. At these areas high frequency coefficients of RDWT are replaced with zero because the motion vector is not estimated. Haar wavelet is discontinuous, and thus the interpolation is conducted roughly at these areas. This nature causes blocking artifacts. On the other hand, these artifacts are reduced in the synthesized images using integer 2/6 wavelet and Daubechies 4-tap filter (see Fig. 5(i)-(l)). This results from the smoothness of these wavelet functions against Haar wavelet, so the smooth interpolation is carried out. Table 2 shows the simulation results of each test sequence. We compared the peak signal to noise ratio (PSNR) between the synthesized and original images. The obtained results using integer 2/6 wavelet are better in PSNR than the results by means of up-sampling of the low resolution video (nearest neighbor) and DCT spectral fusion method for all the four test sequences. This result shows that the proposed method improved the resolution and frame rate. PSNR results for the “Football” and “Foreman” sequences are relatively worse. These sequences contain large amount of dynamic region. Hence there are some regions where the motion information cannot be estimated, and thus the PSNR results decreased. On the other hand, static region and pure translation mainly dominate in the “Coastguard” and “Hall monitor” sequences. So the accurate motion estimation and compensation could be achieved and better PSNR results were obtained. 5.2
Synthesis from Real Video Sequences
By calibrating two video sequences captured through the prototype dual sensor camera [6, 7], two sequences were made; – Size: 4000 × 2600 [pixels], Frame rate: 4.29 [fps] – Size: 1000 × 650 [pixels], Frame rate: 30 [fps] High resolution (4000 × 2600 [pixels]) video with high frame rate (30 [fps]) is synthesized from two video sequences mentioned above using our algorithm. Figure 6(a) shows an example of the synthesized frames. Enlarged image of Fig. 6(a) is shown in Fig. 6(b). Figure 6(c) shows the low resolution image which
488
K. Watanabe et al.
(a) (b)
(c)
Fig. 6. Synthesized high resolution image from real images. (a) Synthesized frame, (b) Close-up of (a), and (c) Corresponding low resolution image.
temporally corresponds to Fig. 6(a)(b). We can observe sharper edges in Fig. 6(b), while the edges in Fig. 6(c) are blurred. This result shows that our method can also synthesize a high resolution video with high frame rate from the video sequences captured through the dual sensor camera.
6
Conclusion
In this paper we have proposed a novel algorithm for obtaining a high resolution video with high frame rate from two video sequences with different spatiotemporal resolution. The proposed algorithm synthesizes a high spatio-temporal resolution video using motion compensation and image fusion in the wavelet domain. We confirmed through the experiments that the proposed method improves the resolution and frame rate of video sequences.
Acknowledgments A part of this research is supported by “Key Technology Research Promotion Program” of the National Institute of Information and Communication Technology.
References 1. Park, S.C., Kang, M.K., Kang, M.G.: Super-resolution image reconstruction: A technical overview. IEEE Signal Processing Mag. 20 (2003) 21–36 2. Shekarforoush, H., Chellappa, R.: Data-driven multi-channel super-resolution with application to video sequences. J. Opt. Soc. Am. A 16 (1999) 481–492
Video Synthesis with High Spatio-Temporal Resolution
489
3. Tom, B.C., Katsaggelos, A.K.: Resolution enhancement of monochrome and color video using motion compensation. IEEE Trans. Image Processing 10 (2001) 278–287 4. Choi, B.T., Lee, S.H., Ko, S.J.: New frame rate up-conversion using bi-directional motion estimation. IEEE Trans. Consumer Electron. 46 (2000) 603–609 5. Ha, T., Lee, S., Kim, J.: Motion compensated frame interpolation by new blockbased motion estimation algorithm. IEEE Trans. Consumer Electron. 50 (2004) 752–759 6. Hoshikawa, A., Shigemoto, T., Nagahara, H., Iwai, Y., Yachida, M., Tanaka, H.: Dual sensor camera system with different spatio-temporal resolution. In: Proc. SICE Annual Conf. (2005) 7. Nagahara, H., Hoshikawa, A., Shigemoto, T., Iwai, Y., Yachida, M., Tanaka, H.: Dual-sensor camera for acquiring image sequences with different spatio-temporal resolution. In: Proc. IEEE Int. Conf. Advanced Video and Signal based Surveillance. (2005) 8. Shechtman, E., Caspi, Y., Irani, M.: Increasing space-time resolution in video. In: Proc. European Conf. Computer Vision. (2002) 753–768 9. Shechtman, E., Caspi, Y., Irani, M.: Space-time super-resolution. IEEE Trans. Pattern Analysis and Machine Intelligence 27 (2005) 531–545 10. Matsunobu, T., Nagahara, H., Iwai, Y., Yachida, M., Tanaka, H.: Generation of high resolution video using morphing. In: Proc. SICE Annual Conf. (2005) 11. Watanabe, K., Iwai, Y., Nagahara, H., Yachida, M., Tanaka, H.: Video synthesis with high spatio-temporal resolution using motion compensation and spectral fusion. In: Proc. SICE Annual Conf. (2005) 12. Shensa, M.J.: The discrete wavelet transform: wedding the ` a trous and mallat algorithms. IEEE Trans. Signal Processing 40 (1992) 2464–2482 13. Park, H.W., Kim, H.S.: Motion estimation using low-band-shift method for wavelet-based moving-picture coding. IEEE Trans. Image Processing 9 (2000) 577–587 14. Girod, B.: Motion-compensating prediction with fractional-pel accuracy. IEEE Trans. Communications 41 (1993) 604–612 15. Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics (1992) 16. Zandi, A., Allen, J.D., Schwartz, E.L., Boliek, M.: CREW: Compression with reversible embedded wavelets. In: Proc. IEEE Data Compression Conf. (1995) 212–221