Real-time video stabilization and mosaicking for ... - IEEE Xplore

Proceedings of the 4th International Conference on Robotics and Mechatronics October 26-28, 2016, Tehran, Iran

Real-time Video Stabilization and Mosaicking for Monitoring and Surveillance Ali Jahani Amiri1, Hadi Moradi1,2 1

School of Electrical and Computer Engineering University of Tehran Tehran, Iran 2 Intelligent Systems Research Institute, SKKU, South Korea together. In contrast, image mosaicking involves large overlaps and several images to be put together. This would make mosaicking harder to reach real-time performance. This large overlap is because the camera movement between two consecutive video frames is very small hence our baseline is very small.

Abstract— Video stabilization and mosaicking is an important task when a video stream of a large area, such as the video stream from a UAV or a blimp, is received. In this paper we propose a novel approach for video stabilization and mosaicking from shaky video streams. We present a feature-based real-time video mosaicking pipeline performing image alignment by combining feature point detector, descriptor, a robust statistical selection methodology (RANSAC), and other filters. By using this method, we can eliminate the mechanical stabilizer. The stabilization is done using SURF feature detectors, BRSIK descriptor, and affine transition matrix. To reach real-time performance, we used C++ OpenCV libraries for implementation and compared most feature detectors and descriptors by their time consumption factor to find out which one is better for real-time approach. Finally, the approach has been tested on real video streams whic showed good performance. Keywords—Real-time video mosaic, Video stabilization, UAV video mosaic, Real-time photomosaic from video, video mosaicking .

Figure 1: A mechanical camera gimbal by GoPro company.

I. INTRODUCTION

This is crucial for applications such as monitoring and surveillance in which the full area view is needed. Imagine a UAV flying over a disaster zone and collects images as it flies by. The captured images can be put together to create a wide view of the area to be analyzed by the ground team. In the case of a camera facing down from a balloon or a blimp, the wind causes the camera swing and capture a large area while the camera does not translate a lot. Thus the images need to be correctly transformed and stitched together to create a usable image for the ground team.

Imagine a faced down camera from a balloon/blimp which may be used for monitoring and surveillance. The video output of such a camera can be very shaky and most UAV [7,8,9] operators use mechanical stabilizers with a motor and sensors such as the gimbal in fig. 1. Another approach is to crop the video [5,6] in order to stabilize the video. Although, the price of gimbals is dropping, however, they are still heavy for platforms that have limited payload. Furthermore, gimbals need power to operate which is limited on UAVs.

Image stitching and video mosaicking are usually done based on two different methods: (i) pixel based, where images are aligned using direct pixels. (ii) Feature based, where frames are aligned using the features extracted from every single frame. Although it is suggested [12] that feature-based methods are suitable for large separate image stitching, however, most of recent works on video mosaicking are feature-based [3, 4].

Another issue with using a gimbal is that it would not work when the camera moves widely since it tries to keep the image focused on the given Field Of View (FOV) of the camera. Thus, in such cases, in which the camera swings a lot, there are many images captured that would not be useful unless they are stitched [14] or mosaicked [3, 4, 11, 12, 13] together. In other words, video mosaicking or stitching can be used to put these images together and provide a wide angle view for the operator to use it whenever it is needed. Although, image stitching and video mosaicking look similar, but there is a little difference. In image stitching, there are small overlaps between images upon which the images are put

978-1-5090-3222-8/16/$31.00 ©2016 IEEE

It should be mentioned that video mosaicking is a harder problem than image stitching. This is because of two significant issues, i.e. i) the accumulated error related to long-term sequential image stitching [13, 15, 16] and (ii) frame management because of dense input.

613

In this paper, we present a novel method for real-time video mosaicking and stabilization. In other words, we try to eliminate the narrow FOV of a gimbal-based camera. Furthermore, we propose a real-time feature based approach for mosaicking to improve the previously proposed offline methods. Video mosaic will improve the wide angle view and help us achieve a higher resolution output. The video pipeline is designed such that it can handle the missing frames [3, 4]. Our work can be combined with earlier work on UAVs like hotline tracking [1, 2] as well. Furthermore, in contrast to the prior recent work [3, 4, 5, 6, 13, 15], we present a real-time video mosaicking algorithm with higher fps output. We employ GPU for better frame per second result for our mosaicking.

and Capel [24] using a feature-based method which is similar to the work of Pal, Steedly and Szeliski [18] on efficiently registering into video mosaic. There are several work [11, 25, 26] on real-time video mosaicking using a sequential image matching but these methods are prone to the accumulation of alignment errors. Several methods have been taken to fix the accumulation error by either doing a global registration [3, 4, 27] or loop closure detection methods [28]. Recently, Davidson and Lovegrove [13] and Civera et al. [15] used Simultaneous Localization And Mapping (SLAM) approach. Civera et al. [15] used an extended Kalman filter (EKF) to estimate the features in frames and the sensor. Civera et al. [15] used only a few amounts of extracted frame features therefore it results to a lower quality output. Davidson and Lovegrove [13] , Civera et al. [15], and Breszcz et al. [3, 4] presented whole image alignment by a key-frame selection method. This method is not a suitable approach for our application but it is good for planar scene or map construction applications. Lovegrove and Davidson [13] use a whole image alignment method to overcome the quality limitations of [15] by taking advantage of using all frames features. However, this method was not as fast as our real-time approach and also it is not scale-invariant meaning when the camera zooms or when the balloon goes up it will lead to error in our mosaicking. Brown et al. [14] presented a good image stitching using SIFT features but it is not a real-time process and it is not utilized for video mosaicking since the rate of overlapping in our case is very high.

Finally, we propose using filtering approaches to lower the accumulation error to use all frames in our mosaicking video without using bundle adjustment, which is a very time consuming algorithm. The proposed approach eliminates the typical keyframe approach [3,4,13,15]in which a frame key, from a bunch of frames, is selected and other frames are neglected to construct a map of an aerial video. When the camera moves and reaches a threshold, the new key frames will be set. This will prevent the data redundancy and give us more time for using bundle adjustment algorithms and it works for prior usage of video mosaic. But, in our case, we cannot ignore frames and make a key frame of a scene since it is a monitoring application not mapping. II. PRIOR WORK Prior work on video stabilization for UAV has an old history. Nowadays, most recent video stabilizations are done mostly by two methods: (i) Finding homography matrices between two or more sequential frames and then the new frame is warped and the input is going to be cropped so the output will be stabilized [5, 6]. In such an approach the field of view would be limited and we would have data lose. That is why some of drones, e.g. parrot drones, have fish-eye cameras to provide a wide range of field of view and part of that FOV is going to be cropped for stabilization. (ii) Another method implemented using a mechanical stabilization, i.e. a gimbal, in which a motor with a controller and sensors are used [7, 8, 9]. Using hardware will increase the weight of the system and also using motor requires more battery power and leads to more depreciation.

Unlike prior work on video stabilization of hanged cameras and video mosaicking, we present a feature based method based on local pairwise alignment and we use filters instead of global alignment without using the key-frame approach to reach better frame per second output. We benefit from a GPU pipeline which uses SURF features which are robust to translation, zooming, and rotation. Unlike prior work, we put more emphasis on practical implementation and timing. It should be noted that to the best of our knowledge, there is no prior work addressing the problem of a faced down camera from a blimp/balloon with high rotational swing which makes the image harder to put them together. III. REAL TIME VIDEO MOSAICING A. Outline Our video mosaicking approach is based on Speeded Up Robust Features (SURFs) [29] between consecutive video frames. Then we use Binary Robust Invariant Scalable Keypoints (BRISK) descriptors [10]. Subsequently, a RANdom SAmpling and Consensus (RANSAC) [30] is applied to discard the outliers and also we use another filter which we describe in part 8 to filter best match pairs. Based on these good match pairs pairwise alignment is used to robustly estimate relative frame-to-frame image transformation, and then warp new frame into our panoramic scene. In cases in which the consecutive frame-toframe matching fails or there are missing frames, the algorithm waits for another frame that has overlap with most recent frame.

On the other hand, the related work on panoramic imaging and panoramic stitching is well established as well [12, 19, 20]. Image alignment can be used through two different methods (i) directpixel-based methods [1, 2] (ii) feature-based matching methods, which is the most recent method [3, 4, 14]. A set of these studies use global registration via bundle adjustment [3, 4, 21] and blending [12, 19]. In our work, we do not use the bundle adjustment because it is time consuming. Instead, we use more accurate match points by taking advantages of multiple filters. There are a lot of prior work on real-time mosaicking. Works by Sawhney et al. [23], Robinson [11], and Szeliski R et al. [22] are based on a direct matching method which has accumulated error. Super-resolution video mosaicking was done by Zisserman

614

Feature detector 0.2 0.15 0.1 0.05 0 Surf

Sift

Orb

Fast

Fig. 3: A comparison between feature detectors 6) In this step, a descriptor is chosen to match the features. Table 2 shows several descriptors already defined in the OpenCV library and the processing time needed for these descriptors for matching on a given set of features. Fig. 2: Real-time video mosaicking pipeline

Table 2: Descriptors and the time comparison for a given task.

Fig. 2 shows the proposed pipeline of our real-time video mosaicking. Each section is explained in the following and the implementation in OpenCV 2.411 is described. 1) The first frame is received and converted to grayscale. Then its features and descriptors are found. 2) Then we place the first frame into the middle of our field of view (FOV). It would be our starting frame. 3) The new frames are taken, greyscaled, and prepared for further processing. 4) The size of frames are adjusted to have adequate resolution while the frames are not too big so the processing is fast enough. 5) Feature detection is performed at this step. Empirically, about 500 features are enough for our process. We do not use more features because time is an important factor. There are several feature detection approaches already implemented in OpenCV which are faster than SURF but not good enough for our application., Table 1 shows the comparison between these features for a given image. According to Fig. 3, the best result is FAST detectors but empirically SURF shows more robustness than FAST and that is why we use SURF feature detector as our default detector.

time(s)

Number of features Time

SIFT

ORB

FAST

105

101

118

178

0.022

0.172

0.007

0.001

SIFT

ORB

BRISK

FREAK

BRIEF

0.0229

0.633

0.0079

0.0017

0.0019

0.0108

Fig. 4 shows the descriptor comparison in a chart form. Because of the better performance of BRISK, compared to the rest, we have chosen BRISK in our approach. 7) The matching is performed in this step. Similar to the previous steps, we evaluated the performance of avaialble approaches and selected the best approach with the lowest time consumption. Table 3 shows the comparison between two available methods in OpenCV. Based on these results, we employed BruteForce matcher in our implementation. 8) In this step, the good mathces are selected from the set of all matches that may contain false matches. We use the RANdom SAmpling and Consensus (RANSAC) method to discard the outliers. Then we pass these inliers into a filter to select the best

Descriptors 0.04

Table 1: A comparison between 4 features based on the processing time. SURF

SURF

0.02 0 Surf

Orb

BRISK FREAK BRIEF

Fig. 4: The comparison between five descriptors

615

Table 3: A comparison between two available feature mathers in OpenCV Matcher

Brute Force matcher

Flann Based matcher

Time

0.00383

0.15002 Fig. 5: A mask sample

ones. This filter is implemented such that we find the minimum of the distances between descriptors in all match pairs and then we select the top 200 matches that their distance is less than 3*min distance of whole matches. This will cause robustness and let us choose the best matches and discard others. 9) Finding the homography matrix is performed in step 9. The main part is to find a 3*3 perspectivep projection matrix or a 3*2 affine transformation matrix between frame (n-1) and frame (n) using best match points found in step 8. We are using affine transformation in this paper and we find a 3*2 homography matrix. For aligning these frames, we need to multiply the homography matrices together and find a homography matrix between frame n and frame one. This is performed by the consecutive multiplication of the transformation matrices between frame n-1 and frame 1.

processing and have high errors. This is done beased on the fact that our data is a video stream so the homography matrix does not change suddenly. In other words, if there is a sudden change it is an error and can cause output failure. So we calculate the norm 2 of the new homography matrix minus last homography matrix, and then we calculate average of these norms until n-1 frame in each frame. If norm at frame n is more than a threshold than the norm at frame n-1, it is an error and we skip that frame (empirically the threshold is set to 10* average norms from frame n-500 to n-1). If the norm is less than threshold then it’s ok and we update the average with the new norm calculated. Another filter used in this section checks correlation between the two recent homographies. If the absolute value of correlation is less than a threshold it is ok and if it is more than a given threshold the frame will be ignored. Empirically this threshold is set to 0.8.

We need an offset to move the image to the center of our FOV by pre-multiplying offset matrix to transition matrix:

 1 0 offset x   r11     0 1 offset y  ×  r21 0 0 1   0 

r12 r22 0

tx   ty  1 

11) Mask preparation is the process of determining the region of interest for the final image to copy the wrapped image into it. One way and the most common way for making mask is applying the same calculated homogaphy matrix to a white filled rectangle with the same size as the initial frame (we apply homography matrix to a white image same size to our frame). It should be noted that if our FOV has high resolution, which usually has, it will take too much time for a CPU to do the processing. For instance, it was about 0.5335 second for a 3840*1200 image for us. To handle this issue, we made a black image same size as final panorama then we calculate the corners of the warped frame using perspectiveTransform function. Then fill the shape with white color using fillConvexPoly function. So by making the mask manually it took 0.00178886 second for a 3840*1200 image. This way is even better than using GPU. Warping with GPU will take 0.00604753 second. Fig. 5 shows a mask sample. Table 4 shows the time comparison between the mask preparation methods.

(1)

This offset matrix only needs to be pre-multiplied at first only once. So, for the nth frame we can calculate the homography matrix using the following steps: 1 ˆ1 = H H 2 offset × H 2

(2)

ˆ1 H13 = H32 × H 2

(3)

H1n = H

n 1 n

× H

n2 n 1

ˆ1 × … × H 2

(4)

12) Warping image to final panoramic image is done in this step. It involves applying the homography matrix to the frame and

Now we have calculated the homography between frame n and frame 1, which is our reference frame. The sequential multiplication of these homography matrices would accumulate the error which is one of the disadvantages of this algorithm. To avoid this accumulation, we may do this process for a limited period of time and start over. 10) Checking for bad homography is done in this step. This is needed since our algorithm doesn’t have bundle adjustment. Thus, we need to ignore the frames which are not suitable for

Table 4: Mask preparation time using 3 different methods

CPU GPU

616

WarpPrespective Inter_Linear

WarpPrespective Inter_Nearest

Manual method

0.533502 0.008436

0.493054 0.006047

0.001788

warpPerspective 0.48332

0.00768 CPU

GPU

Fig 6: Warping time comparison then copy it using mask to our FOV. The processing time needed for CPU depends on the output resolution. It takes 0.483326 second for a 3840*1200 resolution frame. So we used GPU for this part. Time for GPU is 0.00767731 for the same process (Fig. 6). 13) The loop preparation involves copying the current frames, its features, descriptor, and book keeping information into last frame so no need to calculate these again. In one of our experiments, this step took 0.064 miliseconds. IV. RESULTS Below is fps output of the presented algorithm for a sample video.

Fig. 7: The consecutive images mosaicked together. The red rectangle shows the current image.

Fig. 7 shows the consecutive images captured and mosaicked together from a hand held camera. Fig. 8 shows the images captured from a camera facing down from a balloon at about 80 meters. Our pipeline (Fig. 2) shows an average of 15-20 fps rate or more for a 640*480 input video and 2560*1920 output resolutions. More demos can be found at the following link, https://www.youtube.com/watch?v=on_sG_X79oQ. The processing on done on a computer with Corei7 2.00 GHz, Nvidia 540M, Win 7 64 bit, C++ OpenCV 2.411 library (Table 2). Fig. 8: image stabilization and mosaicking performed on a video stream from a camera facing down on a balloon stationed at about 80 meters. The red rectangle shows the current image. Table 5: The performance of our proposed system for the scene shown in Fig. 2 in which input is 640*480 and output is 2560*1920

Frames

Time

FPS

With Consideration of Webcam/hard read delays

664

32.78

20.2

Without Consideration Webcam/hard delays

664

29.69

22.3

of

Fig. 8: image stabilization and mosaicking performed on a video stream from a camera facing down from a balloon.

617

[16] Wagner D., Mulloni A., Langlotz T., ET AL.,“Real-time panoramic mapping and tracking on mobile phones”. Proc. Virtual Reality Conf., 2010, pp. 211–218 [17] Hartley R., Zisserman A.,“Multiple view geometry in computer vision’ ,Cambridge University Press, Cambridge, UK, 2003 [18] Steedly D., Pal C., Szeliski R.,”Efficiently registering video into panoramic mosaics”. Proc. Tenth Int. Conf. on Computer Vision, 2005, pp. 1300–1307 [19] Benosman R., Kang S. (Eds.),”Panoramic vision”, Springer-Verlag, London, UK, 2001 [20] Gledhill D., Tian G.Y., Taylor D., ET AL.,”Panoramic imaging – a review”, Comput. Graph., 2003, 27, (3), pp. 435–445 [21] Hartley R.I., Zisserman A., “Multiple view geometry in computer vision”,Cambridge University Press, Cambridge, UK, 2004, 2nd edn., ISBN: 0521540518 [22] Szeliski R., Shum H.,”Creating full view panoramic image mosaics and environment maps”, Proc. of the 24th Annual Conf. on Computer Graphics and Interactive Techniques, 1997, pp. 251–258 [23] Sawhney H., Hsu S., Kumar R.,”Robust video mosaicking through topology inference and local to global alignment”, Proc. European Conference on Computer Vision, 1998, pp. 103–119 [24] Capel D., Zisserman A.: ‘Automated mosaicking with super- resolution zoom’. Proc. 1998 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 1998, pp. 885–891 [25] Marks R., Rock S., Lee M.,”Real-time video mosaicking of the ocean floor”,IEEE J. Oceanic Eng., 1995, 20, (3), pp. 229–241 [26] Morimoto C., Chellappa R.,”Fast 3d stabilization and mosaic construction”, Proc., 1997 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, 1997, pp. 660–665 [27] Kim D., Hong K.,”Real-time mosaic using sequential graph”, J. Electron. Imaging, 2006, 15, (2), pp. 47–63 [28] Zhu Z., Xu G., Riseman E., ET AL.,”Fast generation of dynamic and multiresolution 360 panorama from video sequences”, IEEE Int. Conf. on Multimedia Computing and Systems, 1999, vol. 1, pp. 400–406 [29] Bay H., Ess A., Tuytelaars T., ET AL.,”Speeded-up robust features (SURF)”, Comput. Vis. and Image Underst., 2008, 110, (3), pp. 346–359 [30] Fischler M., Bolles R.,”Random sample consensus: a paradigm for model fitting with applications to image analysis and automated car- tography”, Commun. ACM, 1981, 24, (6), pp. 381–395 [31] Lowe D.,”Distinctive image features from scale-invariant keypoints”, Int. J. Comput. Vis., 2004, 60, (2), pp. 91–110

V. CONCLUSIONS In this paper, we have presented a pipeline of practical featurebased real-time video mosaicking for monitoring applications to eliminate mechanical video stabilizers in hanged cameras from UAVs or hand-held applications in shaky environment. We utilized current mosaicking approaches to deal with zoom, rotation and translation for such specific applications. We reached at least 15-20 fps output of a 640*480 video input and 2560*1920 video output on an ordinary laptop. Furthermore, we investigate the time consumption of different feature detectors and descriptors of OpenCV 2.4. Future work could be done about real-time 3D modeling of scene using this video mosaic and also decreasing the accumulated error without losing the real-time context. References [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10] [11] [12] [13]

[14] [15]

S. Valipour and H. Moradi, “A vision-based automatic hotline tracking approach using unscented Kalman filter”, Robotics and Mechatronics (ICRoM), 2014 Second RSI/ISM International Conference on, pp. 607-612 S. Valipour, H. Khandani, H. Moradi, “The design and implementation of a hotline tracking UAV”, Robotics and Mechatronics (ICROM), 2015 3rd RSI International Conference on, pp. 252 – 258 M. Breszcz, T.P. Breckon, “Real-time Construction and Visualization of Drift-Free Video Mosaics from Unconstrained Camera Motion”, In IET J. Engineering, IET, Volume 2015, No. 16, pp. 1-12, 2015. M. Breszcz, T.P. Breckon, I. Cowling, “Real-time Mosaicing from Unconstrained Video Imagery for UAV Applications”, In Proc. 26th International Conference on Unmanned Air Vehicle Systems, pp. 32.1-32.8, 2011. B. Pinto and P. R. Anurenjan, “Video stabilization using speeded up robust features”, Proc. IEEE Int. Conf. Commun. Signal Process., pp. 527-531, 2011 A. Hafiane, K. Palaniappan, G. Seetharaman, “UAV-Video registration using block-based features”, IEEE Int. Geoscience and Remote Sensing Symposium, Vol. II, pp. 1104-1107, 2008 Trofin, “Motion Picture Stabilizing Achieved By Mechanical Engineering: Shooting Video Using Three Axis Camera Gimbals”, Academic Journal of Manufactoring Engineering, VOL. 13, 3/2015, pp. 70-81. Ole C. Jakobsen and Eric N. Johnson, “Control Architecture for a UAVMounted Pan/Tilt/Roll Camera Gimbal”, Infotech@Aerospace 26 - 29 September 2005, Arlington,Virginia. Tiimus, K. & Tamre, M., “Camera gimbal control system for unmanned platforms”, 7th International DAAAM Baltic, Conference, Industrial Engineering, 22-24 April (2010), Tallinn, Estonia. S. Leutenegger, M. Chli and R. Siegwart, “Brisk: Binary robust invariant scalable keypoints”, Proc. IEEE ICCV, pp. 2548-2555 Robinson J., “Collaborative vision and interactive mosaicking”. Proc. Vision, Video and Graphics, 2003 Szeliski R., “Image alignment and stitching: a tutorial”, Found. Trends Comput. Graph. Vis., 2006, 2, (1), pp. 1–104 Lovegrove S., Davidson A.,”Real-time spherical mosaicking using whole image alignment”, Proc. European Conference on Computer Vision, 2010, pp. 73–86 Brown M., Lowe D., “Automatic panoramic image stitching using invariant features”, Int. J. Comput. Vis., 2007, 74, (1), pp. 59–73 Civera J., Davison A.J., Magallón J., ET AL.: “Drift-free real-time sequential mosaicking”, Int. J. Comput. Vis., 2008, 81, (2), pp. 128–137

618