A Fast Algorithm of Video Super-Resolution Using Dimensionality Reduction by DCT and Example Selection Kiyotaka Watanabe†,‡ Yoshio Iwai‡ Tetsuji Haga† Masahiko Yachida † Mitsubishi Electric Corporation, Amagasaki, Japan ‡ Osaka University, Toyonaka, Japan Osaka Institute of Technology, Hirakata, Japan
[email protected]
Abstract In this paper, we propose a novel learning-based video super resolution algorithm with less memory requirements and computational cost. To this end, we adopt discrete cosine transform (DCT) coefficients for feature vector components. Moreover, we design an example selection procedure to construct a compact database. We conducted evaluative experiments using MPEG test sequences to synthesize a high resolution video. Experimental results show that our method can improve effectiveness of super-resolution algorithm, while preserving the quality of synthesized image.
1. Introduction Super-resolution (SR) refers to the problem that a high resolution (HR) image is synthesized from one or more low resolution (LR) images. SR is roughly classified as reconstruction based methods and learningbased methods. While the former approach has been extensively studied for a long time, we focus on the latter approach in this paper. Learning-based SR algorithms extract a relationship between HR images and their corresponding LR ones, which are used as training data. The algorithms construct a large database of examples from training data set. The resolution of input LR images can then be enhanced by adding high frequency detail via a nearest neighbor search in the database. Baker and Kanade [2] gave fundamental limits on reconstruction-based SR, and demonstrated that facial images and text images can be super-resolved by using a large number of examples. Freeman et al. [5] showed that an idea of learning-based SR can be applied to general images. Besides, several
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
Beam splitter High resolution low frame rate camera Scene Pulse generator
Low resolution high frame rate camera
Figure 1. Concept of dual sensor camera.
learning-based SR algorithms have been proposed for still images [10, 4] and for video sequences [3, 6]. We have proposed a dual sensor camera [8] shown in Fig. 1, which can capture two video sequences with the same field of view simultaneously. These sequences are HR with low frame rate and LR with high frame rate. We assume that the synchronized frames of HR and LR sequences can be obtained, which we call “key frames.” We also developed algorithms to synthesize a HR video with high frame rate from these two video sequences [7, 11]. In this paper, we propose a novel algorithm based on learning-based SR approach to resolve this problem. In learning-based SR algorithms, there are two major problems. One is that they require a large amount of memory to store examples, the other is high computational cost to find nearest neighbors in the database. In order to alleviate these problems, it is helpful to adopt the following strategies. 1. Adopt fast searching algorithm. 2. Reduce the dimensionality of examples stored in the database. 3. Reduce the number of examples stored in the database.
With respect to 1, Freeman et al. [5] used a fast searching algorithm [9] to find multi-dimensional nearest neighbors. We use approximate nearest neighbor (ANN) search algorithm [1], which finds nearest neighbors quickly instead of allowing a small error. 2 and 3 are effective strategies because both time and space complexity of learning-based SR algorithm are dependent on the dimensionality and the number of examples stored in the database. As for 2, Bishop et al. [3] used principal component analysis (PCA) to reduce the dimension of feature vector from 147 to 20. However, PCA is computationally expensive itself and thus is unsuitable for video SR, which must process a large amount of data. In our method discrete cosine transform (DCT) is applied to each block and low frequency components of DCT coefficients are extracted to compose a feature vector. Strategy 3 has not been considered in the previous studies. We propose an algorithm to select examples effective in synthesizing a higher quality video under the condition that the number of samples stored in the database is limited. We introduce the proposed algorithm in the following section. Experimental results using MPEG test sequences are shown in section 3. We conclude this paper in section 4.
2. Proposed algorithm We extend learning-based SR methods for still images [10, 4] to video sequences, and adopt DCT coefficients as feature vector components. Our method constructs an example database by using key frames as training data, and conducts SR for LR frames except key frames. In our method SR procedure is conducted for luminance component since human vision is insensitive to spatial variation of colors. Chrominance components of synthesized frames are interpolated by bi-cubic spline interpolation from LR frames in order to reduce computational cost.
2.1. Construction of database using key frames If we have a large memory space, then all examples extracted from the training images can be stored. Here we assume that we do not have a large memory space enough to store all examples. We propose a procedure for selecting examples sequentially. We consider two sub-databases D1 and D2 . Our method firstly constructs D1 , which is composed of examples randomly selected in the key frame. Next,
it constructs D2 , which is composed of examples that have large distance to nearest neighbors in D1 . Finally, by integrating D1 and D2 , example database D = D1 ∪ D2 is constructed. Let n be the upper limit of the number of examples that can be stored in D, and |D| expresses the number of examples stored in D. When a pair of HR image IH and LR image IL (key frame) is given, example database D is constructed according to the following procedure. 1. Enlarge IL by using bi-cubic spline interpolation to generate an image IH with the same size as IH . 2. Set D1 = D2 = ∅, and fix n1 and n2 , the upper limits of the numbers that can be stored in D1 and D2 respectively, where n = n1 + n2 . Arrange a priority queue Q with length of n2 . Firstly, set Q = ∅. 3. (Random sampling of examples) Repeat the following steps until |D1 | = n1 . (a) Extract a block B with size of K × K from IH randomly and without duplication. (b) Calculate the contrast c = E[|x − μ |], where μ = E[x] is the average of pixels x ∈ B . (c) If c is smaller than a threshold θc , then the following process is not conducted for this block. Otherwise proceed to the next step. (d) Conduct DCT for B to obtain DCT coefficients C[B ], and extract d AC coefficients of C[B ] in order of zig-zag scan to compose a d-dimensional vector v . (e) Extract a block B h with size of K × K from IH to obtain DCT coefficients C[B h ]. (f) Add an example (v , C[B h ]) to D1 . Here we define B1 as a set of blocks B that c ≥ θc and then added to D1 . 4. (Example selection by nearest neighbor search) Conduct the following steps in raster-scan order. (a) Extract a block B with size of K × K from IH , where a block in B1 is omitted, i.e. B ∈ / B1 . (b) Conduct the same processes as shown from Step 3(b) to Step 3(e) to obtain an example (v , C[B h ]). (c) Find the nearest neighbor of v in D1 . We denote the found item as v NN . Measure the Euclid distance dist(v , v NN ).
(d) Add (v , C[B h ]) to Q, where an example with larger distance is given higher priority, and then remained in Q. 5. Construct D2 from the examples stored in Q. 6. Construct D = D1 ∪ D2 by integrating examples in D1 and D2 . D1 , which is constructed by random sampling, is a data set that approximates the distribution of examples on the feature space. Since we assume that key frames (used as training data) and super-resolved frames are temporally-closed, they are strongly correlated. Therefore, at the synthesis step of HR frames nearest neighbors can be found in D1 within a small distance. However, nearest neighbors may not be found within a small distance due to the motion in the scene. We introduce D2 in order to construct a data set whose elements are spread on the feature space. The generalization of the resultant database can be eventually enhanced.
2.2. Synthesis of HR frames
Table 1. Description of test sequences. Sequence Name Spatial Reso. Frame No. Coast guard 352 × 288 0 - 270 Football 352 × 240 1 - 121 Foreman 352 × 288 0 - 270 Hall monitor 352 × 288 0 - 270
Solve the linear system of equations Gq w q = 1 for w q and then normalize the weights so k that i=1 wqi = 1. (e) Let C[Bih ] (i = 1, 2, · · · , k) be the DCT coefficients of HR blocks which correspond to v i . All of DC coefficients of k DCT blocks C[Bih ] are replaced with DC coefficient of C[Bq ] to correct illumination change. We de note the resulting DCT blocks C[Bih ].
(f) Calculate the linear combination of C[Bih ] by applying w q ;
LR frame IL which is not a key frame is superresolved according to the following procedure. 1. Enlarge IL by using bi-cubic spline interpolation with the same size as the to generate an image IH target HR image.
2. Extract a block Bq with size of K × K from IH and conduct the following steps for each block. Each extracted block is overlapped with the neighboring blocks and with overlap width of w.
(a) Calculate the contrast cq of the block Bq . If cq < θc , then SR is not conducted for this block and the values of IH are used for target HR frame. Otherwise proceed to the next step. (b) Conduct DCT for Bq to obtain DCT coefficients C[Bq ]. Extract d AC coefficients of C[Bq ] in order of zig-zag scan to compose a d-dimensional vector v q . (c) (k-nearest neighbor search) Find the k nearest neighbors v 1 , v2 , · · · , vk of v q in D. (d) (Local neighbor embedding) Define Gram matrix Gq as follows; Gq = (v q 1T − L)T (v q 1T − L)
(1)
where 1 is a vector of ones and L is a d × k matrix whose columns are v 1 , v 2 , · · · , v k .
C[Bth ] =
k
wqi C[Bih ].
(2)
i=1
(g) Conduct inverse DCT for C[Bth ] to transform the block to the image space. 3. Construct the target HR frame by arranging the blocks obtained in the above step with overlap width of w. The pixel values in overlapped regions are averaged to obtain a smooth image.
3. Experiments We conducted evaluative experiments using MPEG test sequences. Table 1 shows the original MPEG test sequences used in the following experiments. Two video sequences used as input were made from the original sequence, as described below. A LR video sequence (M/4 × N/4 [pixels], 30 [fps]) was obtained by a 25% scaling down of the original sequence (M × N [pixels], 30 [fps]). A HR video with low frame rate (M × N [pixels], 30/7 [fps]) was obtained by picking up every seven frames of the original sequence. A HR video with high frame rate (M × N [pixels], 30 [fps]) was synthesized by using the proposed method and several conventional methods. Our method was implemented in Visual C++ 2005 and run on Windows XP PC (CPU: Intel Pentium4 3.0 [GHz], RAM: 512 [MB]). We set the parameters θc = 8.0, w = 4, k = 2, and K = 8.
Table 2. PSNR results.
(a) Original frame
(e) Close-up of (a)
Sequence Name Coast guard Football Foreman Hall monitor
Bicubic Spline 22.05 20.77 26.26 23.06
Proposed Method 24.72 22.23 29.93 29.57
DCT Fusion[11] 24.59 20.68 27.41 30.54
Table 3. Effects on database allocation.
(b) Bi-cubic spline
(f) Close-up of (b)
(c) DCT fusion [11]
(g) Close-up of (c)
(d) Proposed method
(h) Close-up of (d)
Figure 2. Test sequence “Foreman” 45th frame.
Figure 2 shows an original frame (a)(e), an enlarged LR frame by bi-cubic spline interpolation (b)(f), and a synthesized frame by using DCT spectral fusion method [11] (c)(g). Figure 2 (d)(h) shows a synthesized frame by using the proposed method, where the example selection process (Step 4 in Section 2.1) was not conducted and all the examples were stored in D. Also, we set d = 20. We can see that the proposed method can synthesize images with higher quality. Table 2 shows the simulation results of each test sequence. We compared the peak signal to noise ratio (PSNR) between the synthesized and original images. The proposed method gives the best results in PSNR for the test sequences except “Hall monitor”. “Hall monitor” is a sequence captured by a static camera, but the other sequences contain
Size n1 n2 2000 8000 4000 6000 6000 4000 8000 2000 10000 0
PSNR [dB] Foreman Hall monitor 29.56 28.03 29.59 28.21 29.57 28.24 29.54 28.07 29.41 27.69
a large amount of dynamic regions. Since DCT spectral fusion method synthesizes a HR video using motion estimation, it is hard to obtain a good result for the sequences with dynamic region because of the error of motion estimation. However, the proposed method synthesizes a HR video without using motion information, so it gives the best results for these sequences. We examined the effects on database allocation in order to confirm the effectiveness of data selection algorithm. We set d = 20, n = 10000 and measured PSNR, varying n1 and n2 . Table 3 shows PSNR results at various database allocations. Compared with the simple strategy that all examples stored in the database are randomly selected (n1 = 10000, n2 = 0), our data selection method can give better results. In particular, it is much preferable to set n1 and n2 nearly equal. We evaluated the performance using test sequence “Foreman”, varying d and/or n1 , n2 . We set n1 = n2 based on the result mentioned above. We compared PSNR, the average time of synthesizing one HR frame, and memory requirement to store examples in the database. In terms of memory requirements, we assume that the memory requirement to store one double precision real number is 8 bytes. Since a pair of example (v , C[B h ]) is composed of d + K 2 real numbers, the total memory requirement to store n examples is n(d + K 2 ) × 8/1024 [kB]. Table 4 shows the comparative results with different dimension and database size. The row “All” denotes the case that all examples are stored, where we suppose n = 100000. As d and/or n1 = n2 decrease, PSNR is also decreased. But the reduction of processing time and memory requirements can be achieved. That is, we can control the quality
Table 4. Performance comparison with different dimension of vector and database size. Size of database n1 = n2 1000 2000 5000 10000 All
PSNR [dB] 28.45 28.75 29.12 29.35 29.40
d=5 Time [msec] 39 42 46 49 56
Memory [kB] 1078 2156 5391 10781 53906
PSNR [dB] 28.67 29.04 29.54 29.82 29.88
d = 10 Time Memory [msec] [kB] 43 1156 49 2312 61 5781 73 11563 97 57813
PSNR [dB] 28.70 29.08 29.58 29.85 29.92
d = 15 Time Memory [msec] [kB] 46 1234 52 2469 66 6172 82 12344 114 61719
Many latest digital cameras and cellular phones can capture both still image and video sequence by switching shooting mode. In future work we plan to adopt our method to these consumer devices by modifying the problem formulation considered in this paper. (a) d = 5, n1 = n2 = 1000
(b) d = 10, n1 = n2 = 5000
(c) Close-up of (a)
(f) Close-up of (b)
Figure 3. Synthesized images using dimensionality reduction and example selection.
of synthesized images, computational cost and memory requirements by selecting d and n1 , n2 . Figure 3 shows the synthesized images using dimensionality reduction and example selection. Compared with the image shown in Fig. 2(d), which is synthesized without dimensionality reduction and example selection, the results shown in Fig. 3 are not too bad. Therefore, under the condition of limited memory space the proposed algorithm is able to synthesize a HR video fast without deteriorating the quality of the synthesized images.
4. Conclusion In this paper we have proposed a novel method to synthesize a HR video sequence based on learningbased SR. Our algorithm introduces dimensionality reduction by DCT and example selection procedure in order to improve the cost of searching procedure and memory requirements.
References [1] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. ACM, 45(6):891–923, 1998. [2] S. Baker and T. Kanade. Limits on super-resolution and how to break them. IEEE Trans. PAMI, 24(9):1167– 1183, 2002. [3] C. M. Bishop, A. Blake, and B. Marthi. Superresolution enhancement of video. In Proc. Ninth Intl. Conf. Artificial Intell. & Statistics, 2003. [4] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution through neighbor embedding. In Proc. CVPR, 2004. [5] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learning low-level vision. IJCV, 40(1):25–47, 2000. [6] D. Kong, M. Han, W. Xu, H. Tao, and Y. Gong. A conditional random field model for video super-resolution. In Proc. ICPR, 2006. [7] T. Matsunobu, H. Nagahara, Y. Iwai, M. Yachida, and H. Tanaka. Generation of high resolution video using morphing. In Proc. SICE Annual Conf., 2005. [8] H. Nagahara, A. Hoshikawa, T. Shigemoto, Y. Iwai, M. Yachida, and H. Tanaka. Dual-sensor camera for acquiring image sequences with different spatio-temporal resolution. In Proc. IEEE Int. Conf. Advanced Video and Signal based Surveillance, Sep. 2005. [9] S. A. Nene and S. K. Nayer. A simple algorithm for nearest neighbor search in high dimensions. IEEE Trans. PAMI, 19(9):989–1003, 1997. [10] J. Sun, N.-N. Zheng, H. Tao, and H.-Y. Shum. Image hallucination with primal sketch priors. In Proc. CVPR, 2003. [11] K. Watanabe, Y. Iwai, H. Nagahara, M. Yachida, and T. Suzuki. Video synthesis with high spatio-temporal resolution using motion compensation and spectral fusion. IEICE Trans. Inf. & Syst., E89-D(7):2186–2196, 2006.