the undefined regions are within a bound [3]. In this paper, we propose a novel dual pass video stabi- lization system in Section II. The transformation matrix to.
2010 International Conference on Pattern Recognition
A Dual Pass Video Stabilization System Using Iterative Motion Estimation and Adaptive Motion Smoothing Pan Pan1 , Akihiro Minagawa2 , Jun Sun1 , Yoshinobu Hotta2 , Satoshi Naoi1 1 Fujitsu R&D Center Co., Ltd., Beijing, China 2 Fujitsu Laboratories Ltd., Kawasaki, Japan {ppan, sunjun, naoi}@cn.fujitsu.com; {minagawa.a, y.hotta}@jp.fujitsu.com
or constructing image mosaics using the information from neighboring frames [1]. For the purpose of fast processing and robustness, trimming and expanding are usually used. In the conventional approach, motion estimation and image composition are processed in the same single pass. The tradeoff exists between the amount of undefined regions and the degree of motion smoothness. A common approach to deal with this is to reduce the degree of smoothness so that the undefined regions are within a bound [3]. In this paper, we propose a novel dual pass video stabilization system in Section II. The transformation matrix to stabilize each frame is returned in the first pass. Before the beginning of the second pass, we obtain the optimal trim size for a specific video based on the statistics of the transformation parameters. In the second pass, the stabilized video is composed according to the optimal trim size. In Section III, we further propose a novel iterative global motion estimation method. The intentional motion is estimated using adaptive window smoothing according to the local motion change as explained in Section IV. Experimental results show the better performance of our method than other existing methods in Section V.
Abstract—In this paper, we propose a novel dual pass video stabilization system using iterative motion estimation and adaptive motion smoothing. In the first pass, the transformation matrix to stabilize each frame is returned. The global motion estimation is carried out by a novel iterative method. The intentional motion is estimated using adaptive window smoothing. Before the beginning of the second pass, we obtain the optimal trim size for a specific video based on the statistics of the transformation parameters. In the second pass, the stabilized video is composed according to the optimal trim size. Experimental results show the superior performance of the proposed method in comparison to other existing methods. Keywords-video stabilization; dual pass; motion estimation; motion smoothing;
I. I NTRODUCTION In the past few years, video stabilization technique with the purpose of eliminating unwanted camera motion has become increasingly popular in consumer market. A video stabilization system usually contains three components: global motion estimation, intentional motion estimation and image composition [1]. Global motion estimation between adjacent frames is an essential step of video stabilization. Besides the registration based approaches [1], points matching based approaches have been used in the video stabilization framework. Block matching [2], optical flow [3], SIFT points tracking [4], KLT tracker [5] have been applied to get a set of matching pairs. Least squared estimation [4], RANSAC, particle filters [6], filtering [7] have been utilized to estimate the transformation matrix given those matching pairs. Poor image quality, crowded scene, and the different movement of objects during camera movement could be the reasons to cause global motion estimation difficult. The intentional motion is interpreted as the movement caused by human’s intention, e.g. camera panning, which is usually estimated from the global motion. Kalman filter [1], adaptive Kalman filter [5], Motion Vector Integration [4] are proposed to estimate the intentional motion. Since correcting the image frame will leave undefined black regions resulting in visual degradation [1], the black regions have to be eliminated. This is usually solved by either trimming and expanding the remained image portion, 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.562
II. T HE D UAL PASS V IDEO S TABILIZATION S YSTEM Let us use xn to denote the pixel location of frame n, where x = (x, y, 1)T . Hn is the 3 × 3 transformation matrix from frame n − 1 to frame n, i.e. xn = Hn xn−1 . Hinn is the intentional motion from frame n − 1 to n. We further denote the cumulative global motion and intentional ∏n motion as CHn ∏ and CHinn , where CHn = k=1 Hk , n and CHinn = k=1 Hink . Given the pixel location in the original frame xn , video stabilization is to obtain the stabilized xn without unwanted camera motion. Therefore, we have xn = CHn x1
(1)
xn = CHinn x1
(2)
Combing the above two equations, we obtain xn = CHinn [CHn ]−1 xn = H n xn ,
(3)
where H n is the transformation matrix to convert the original frame location xn to the desired xn . 2290 2302 2298
Second, the output frames should contain as much original information as possible. Therefore, we could obtain the optimal trim size based on the statistics of parameters frequencies. For each of δx and δy , we set up a one dimensional histogram, where the x-axis is the bin number determined by the value of the parameter, and the y-axis is the number of times it appears in one video. The m-bin histogram h = {h∑ 1 , h2 , · · · , hm } is normalized to the range m of [0, 1], where i=1 hi = 1. The optimal value δopt is chosen such that ∑
χ(δopt )
hi > 1 − γ,
(4)
i=1
Figure 1. The flow chart of (a) the first pass; (b) optimal trim size determination; (c) the second pass.
In the conventional video stabilization system, the motion estimation and image composition are processed in the same single round. However, in this case, the tradeoff exists between the amount of undefined regions and the degree of motion smoothness [3]. A common approach to deal with the tradeoff is to sacrifice the degree of smoothness and to guarantee the undefined regions are less than a predefined threshold. For example, in [3], the smoothing parameter is reduced when the stabilized frames have larger black regions than the pre-defined threshold, until the requirement that black regions are lower than the bound is met. In the proposed dual pass video stabilization system, we consider the degree of smoothness and the optimal trim size separately in two passes. The flow chart of the proposed system is shown in Fig. 1. We evaluate the degree of smoothness is the first priority and in the first pass the transformation matrix to stabilize each frame is returned (Fig. 1 (a)). Before the beginning of the second pass, we obtain the optimal trim size for a specific video based on the statistics of the transformation parameters (Fig. 1 (b)). In the second pass, after the image is transformed to the stabilized image, we trim the warped image according to the optimal trim size, and then expand the remained portion to the original resolution (Fig. 1 (c)). The proposed dual pass system does not confine the motion models, nor the algorithms of each module. In this paper, we consider the translational jitter is the major reason to cause quality loss. Therefore the H matrix has two variables, where H = [1, 0, δx ; 0, 1, δy ; 0, 0, 1], δx and δy are the shift value of horizontal and vertical directions respectively. Based on the transformation matrix for all frames, before the beginning of the second pass, we want to find the optimal trim size. We consider two criteria. First, the black regions appear at the utmost for a very limited time period.
where χ(δopt ) is the bin number and δopt is the midpoint of the shifts which correspond to bin χ(δopt ). γ is a very small value, e.g. γ = 0.01, which means the parameter is chosen such that at least 99% of the output video does not have black regions. After we find δxopt and δyopt respectively, these values have to be adjusted so that the remained image portion has the same aspect ratio as the original frame. III. I TERATIVE G LOBAL M OTION E STIMATION The accuracy and robustness of the global motion estimation between adjacent frames are critical due to the fact that an incorrect estimate will yield a jitter in the output video sequence. To overcome the difficulties which make global motion estimation difficult, e.g. the different movement of objects during camera movement, we propose an iterative motion estimation method. The input images are preprocessed by deinterlacing algorithm to remove the artifacts. After ignoring the boundary regions, we uniformly select blocks to estimate their block motion. The block size should not be too large nor too small in order to describe the motion of local region precisely and to avoid the possible noises. One way to estimate the motion of each block is block matching algorithm based on the sum of squared difference criterion. We now obtain a set of n motion vectors Mi = (xi , yi )T , i = 1, · · · , n, and want to estimate the global motion between adjacent frames. The problem becomes how to fuse those motion vectors to get the accurate global motion. We assign a weight wi to each block vector and the global motion Mg is estimated by the ∑nweighted average of all local motion vectors, i.e. Mg = i=1 w ei Mi , where w ei is the normalized weight. The weight of each block vector is measured by a Gaussian function based on the difference between its value and the true global motion vector, (Mi − Mg )T (Mi − Mg ) ) σ2 (xi − xg )2 + (yi − yg )2 = exp(− ). (5) σ2 Since in (5) the true global motion cannot be reached, we use an iterative method. Initially we use the average motion wi ∝ exp(−
2299 2303 2291
Table I I TERATIVE GLOBAL MOTION ESTIMATION • Given n block motion vectors Mi (0) 1 ∑n Mg = n i=1 Mi • For r = 1 : Iter (r)
− wi −
(r) w ei
∝ exp(−
(r−1) T
(Mi −Mg
(r)
=
(r) Mg
− = End for r
(r−1)
) (Mi −Mg σ2
)
)
wi
∑n
∑i=1 n
(r)
wi
i=1
(r)
w e i Mi
size is 2s + 1 and s is preset to P . R is the number of direction change of interframe motion within the default window size 2P + 1. If R is less than a threshold T , e.g. T = 0.2P , we reduce the value of s to P 1, where P 1 < P . ut is the weight function of δb,t , usually we set ut = 1/(2s + 1), i.e. the simple average mean. At the beginning and the end of one video where there may not be enough 2s+1 frames, the above equation has to be modified accordingly so that all the frames are valid. V. E XPERIMENTAL R ESULTS
of all block motion vectors as a guess of the global motion and then do the estimation iteratively. A detailed algorithm is given in Table I. If we view each block motion vector is a sample of the global motion vector, the proposed method becomes a mean-shift algorithm. The proposed iterative method is similar in spirt to the local directional smoothing in [8]. The differences are as follows. First the algorithm in [8] is to smooth the motions of one region and its neighbor regions to estimate the region’s motion. Second, the weight function is different. IV. A DAPTIVE M OTION S MOOTHING At the step of motion smoothing, the intentional motion is estimated from the global motion. Most commonly used method is to estimate the intentional motion using Kalman filter [1], where the global interframe camera motion is treated as the noisy observations of the intentional interframe camera motion. Under the framework of our dual pass system, to find the transformation parameters which remove the jitter of the camera is in the different pass of the image composition step. Therefore, some latency while getting the transformation parameters is acceptable. We consider the intentional motion is estimated as the window smoothed output of the global motion. The size of the window reflects the degree of smoothness. The larger window size will provide more smooth motion trajectory, while may make it far away from the real intention of human and cause large undefined regions. Moreover, window smoothing will flatten the huge peak in the motion trajectory. Since the number of direction change of interframe motion reflects the degree of camera shake, we propose an adaptive window smoothing algorithm where the size of the window is adjusted automatically with the number of direction change of interframe motion. In particular, we want to reduce the window size when there is very few direction change of interframe motion within the default window size. Therefore, { t+s ∑ P1 R < T Cδinb,t = ut Cδb,t , s = (6) P else t−s
where Cδin, Cδ are the cumulative intentional shift and cumulative global shift respectively, b = x, y. The window
To demonstrate the performance of our proposed system, we conduct experiments on several videos captured by hand held digital camcorder. The resolution of all videos are 720× 480. We first compare our iterative motion estimation algorithm with the least-square motion estimation [4] given the same set of input motion vectors. In the video Dance, the dancers have different movement from camera motion, e.g. waving their hands, turning around, which makes the global motion estimation difficult. Figure 2 shows the results of compensating global motion only without considering intentional motion. It can be seen that the proposed iterative motion estimation is more accurate than the least-square method. We further compare our system with the single pass system which sacrifices the degree of smoothness to guarantee that the undefined regions are within a preset bound [3] on two different videos Sing and Beach. Sing is a video with lots of unwanted camera shake, while Beach is a video with very few jitters. Figure 3 shows the cumulative motion trajectory of original video and stabilized video using the single pass system and the proposed system respectively. The final trim size of two methods is shown in Fig. 4. We can see for Sing, our method produces more smooth motion trajectory than single pass system and has smaller trim size. In Beach the degree of smoothness of two methods are about the same, while our method gives more information about the original frame. Compared with the single pass system, our method could maintain the degree of smoothness and determine the optimal trim size. Compared with the single pass system, our proposed dual pass system is an off-line processing algorithm which utilizes more CPU time for the trim size determination and for the decoder in the second pass. However, the extra CPU time is 3.95 ms/frame, only a very small portion of the time for other operations needed in the single pass system 54.12 ms/frame. The time is measured on an Intel Core Duo 2.4G computer with input 720×480 MPEG-2 video without code optimization. VI. C ONCLUSION In this paper, a dual pass video stabilization system is proposed using iterative motion estimation and adaptive motion smoothing. The degree of smoothness is maintained
2300 2304 2292
Figure 2. The results of compensating global motion on Dance sequence: least-square estimation (first row), proposed iterative estimation (second row).
Figure 3.
Cumulative motion of Sing (left) and Beach (right) sequences.
symposium on Electronic Imaging, Image and Video Communication and Processing, 2003. [2] F. Vella, A. Castorina, M. Mancuso, and G. Messina, “Digital image stabilization by adaptive block motion vectors filtering,” IEEE Transactions on Consumer Electronics, vol. 48, no. 3, 2002. Figure 4. Trim size on Sing (left) and Beach (right) sequences. Cyan stands for our dual pass system (expansion ratio 1.163 (left), 1.034(right)). Red represents the single pass system (expansion ratio 1.089 (both left and right)).
and the optimal trim size is chosen for image composition. The global motion of adjacent frames is estimated using an iterative method. Moreover, an adaptive window smoothing algorithm is applied to estimate the intentional motion. Experimental results show the better performance of our approach compared with other methods. The proposed system could provide not only accurate global motion estimation, but also a smooth output video with optimal trim size. As a future work, we plan to utilize more complicated motion models and more advanced features on the dual pass system. R EFERENCES [1] A. Litvin, J. Konrad, and W. Karl, “Probabilistic video stabilization using kalman filtering and mosaicking,” in IS&T/SPIE
[3] H.-C. Chang, S.-H. Lai, and K.-R. Lu, “A robust real-time video stabilization algorithm,” Journal of Visual Communication and Image Representation, vol. 17, no. 3, 2006. [4] S. Battiato, G. Gallo, G. Puglisi, and S. Scellato, “Sift features tracking for video stabilization,” in International Conference on Image Analysis and Processing, 2007. [5] C. Wang, J.-H. Kim, K.-Y. Byun, J. Ni, and S.-J. Ko, “Digital image stabilization by adaptive block motion vectors filtering,” IEEE Transactions on Consumer Electronics, vol. 55, no. 1, 2009. [6] J. Yang, D. Schonfeld, and M. Mohamed, “Robust video stabilization based on particle filtering tracking of projected camera motion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 7, 2009. [7] S. Battiato, G. Puglisi, and A. R. Bruna, “A robust video stabilization system by adaptive motion vectors filtering,” in IEEE International Conference on Multimedia and Expo, 2008. [8] A. Calway, S. Kruger, and D. Tweed, “Motion estimation using adaptive correlation and local directional smoothing,” in International Conference on Image Processing, 1998.
2301 2305 2293