Mobile Video Processing for Visual Saliency Map Determination Shilin Xua , Weisi Linc , and C.-C. Jay Kuob a Huazhong
University of Science and Technology, Wuhan, P. R. China of Southern California, Los Angeles, USA c Nanyang Technological University, Singapore
b University
ABSTRACT The visual saliency map represents the most attractive regions in video. Automatic saliency map determination is important in mobile video applications such as autofocusing in video capturing. It is well known that motion plays a critical role in visual attention modeling. Motion in video consists of camera’s motion and foreground target’s motion. In determining the visual saliency map, we are concerned with the foreground target’s motion. To achieve this, we evaluate the camera/global motion and then identify the moving target from the background. Specifically, we propose a three-step procedure for visual saliency map computation: 1) motion vector (MV) field filtering, 2) background extraction and 3) contrast map computation. In the first step, the mean value of the MV field is treated as the camera’s motion. As a result, the MV of the background can be detected and eliminated, and the saliency map can be roughly determined. In the second step, we further remove noisy image blocks in the background and provide a refined description of the saliency map. In the third step, a contrast map is computed and integrated with the result of foreground extraction. All computations required in the our proposed algorithm are low so that they can be used in mobile devices. The accuracy and robustness of the proposed algorithm is supported by experimental results. Keywords: video saliency map, motion vector filtering, contrast map, mobile video
1. INTRODUCTION The characteristics of the human visual system (HVS) have been a research topic for many decades. One of HVS properties is visual attention, which refers to a small region of an image sequence that attracts human eyes the most. Locations of visual fixation in video defines the visual saliency map. To find the region of interest (ROI) from video automatically is an important yet challenging problem. Since the visual saliency map represents the most attractive region to human eyes, there is a close relationship between ROI identification and visual saliency map determination. Applications of visual saliency map include video registration, ROI-based video coding, transcoding, resizing, adaptation, error resilience, etc. In this work, we propose a low-complexity algorithm to extract the visual saliency map from a scene with mobile devices. It can be used for effective mobile video capturing such as camera auto-focusing. Generally speaking, the observing behavior of human eyes consists of two types: top-down and bottom-up.1 The top-down style is consciously controlled by the human brain to guide visual attention to finish a target task while the bottom-up style is unconsciously trigged by a particular object or event that grabs our attention. A computational bottom-up model for a still image was proposed by Itti.2 In this work, the original image was down-sampled into nine spatial scales using the dyadic Gaussian pyramid. A sequence of low-pass filters downsampled the input image by factors ranging from 1:1 (i.e. scale zero) to 256:1 (i.e. scale eight). Then, a center-differencing scheme and a normalization process are used to yield a conspicuity map. Finally, the winner-take-all (WTA) strategy is adopted to determine the most salient point. Itti’s model can also be used in image adaptation, which optimizes user’s experience by displaying the most attentive region if the image size exceeds the screen resolution.3 Since a hard cut for saliency region extraction is not accurate, fuzzy growing is Further author information: (Send correspondence to Shilin Xu) Email:
[email protected]
Applications of Digital Image Processing XXXI, edited by Andrew G. Tescher, Proc. of SPIE Vol. 7073, 70730O, (2008) · 0277-786X/08/$18 · doi: 10.1117/12.796753
Proc. of SPIE Vol. 7073 70730O-1
incorporated in saliency map modeling.4 Although these methods achieve good performance for still images, they cannot be applied to video directly. Motion in video can attract human visual attention and trigger subsequent eye movement efficiently. Statistically, most of the objects in the visual world are still with respect to the background. Thus, an object with significant motion against the background has a great impact on the HVS response.5 For visual perception of video, motion plays an important role, where moving objects (or regions) provide us more useful information than the still background. Motion detection/extraction has been intensively studied by researchers and many solutions have been proposed for a wide range of applications. Preliminary work on motion-based saliency map extraction from video has been reported in the literature. Ma and Zhang6 proposed a temporal and spatial motion attention model based on the entropy distribution for sport video skimming. To extract moving objects from the background, consistency of motion vectors (MVs) in spatial and temporal domains is checked for each MV bin based on the entropy calculation. More recently, Tang7 integrated motion attention, visual sensitivity and visual masking models together to form a spatial-temporal perceptual distortion measure. Abdollahian and Delp8 used the camera motion to provide important cues for visual saliency map modeling, where visual attention was determined by camera motion as well as a contrast map. An accurate and robust saliency map can be used in many applications. One application is video coding, where the saliency map can guide a video encoder to compress different regions in a frame at different bit rates based on their visual importance. For example, a visual foveation model was employed by Wang et al.9 to improve the performance of scalable video coding. Similarly, Agrafiotis et al.10 developed an efficient method to encode the sign language video. Another application is video indexing and summarization. Ma et al.11 combined visual and aural attention models for the annotation of key shots in video. The saliency map can also be used to measure the similarity of key shots from different video sources.12 In this work, to find an efficient way to extract the visual saliency map from general video, we propose a method to compute the video saliency map using foreground objects extraction and a contrast map technique. Motion vector (MV) filtering is utilized to compensate camera motion. When no global and local motion is detected, a method to compute the image saliency map is developed to handle such a situation. Specifically, we propose a three-step procedure for visual saliency map computation: 1) motion vector (MV) field filtering, 2) background extraction and 3) contrast map computation. In the first step, the mean value of the MV field is treated as the camera’s motion. As a result, the MV of the background can be detected and eliminated, and the saliency map can be roughly determined. In the second step, we further remove noisy image blocks in the background and provide a refined description of the saliency map. In the third step, a contrast map is computed and integrated with the result of foreground extraction. All computations required in the our proposed algorithm are low so that they can be used in mobile devices. The accuracy and robustness of the proposed algorithm is supported by experimental results. The rest of this paper is organized as follows. Our research motivation and an overview of the proposed computational system are given in Section 2. The MV field filtering, background extraction and the contrast map computation are examined in Sections 3, 4 and 5, respectively. Experimental results are given in Section 6. Finally, concluding remarks and future research topics are mentioned in Section 7.
2. RRESEARCH MOTIVATION AND SYSTEM OVERVIEW 2.1 Research Motivation Digital video is typically stored and/or transmitted in compression form. It is therefore desirable to develop a method that extracts the region-of-interest (ROI) directly from coded video. Furthermore, the MV field in a coded video bit stream can provide useful motion information, and the computation load can be significantly reduced by its re-use. A MV refers to the spatial displacement of the target block of the current frame and its best matched block of the reference frame. MV can be expressed as M V = arg min E(p0 , p0 + p), p
where p0 is the position of the target block, p is the distance between the target block and the candidate reference block, and E(., .) is an error measurement function, which is often chosen to be the summation of
Proc. of SPIE Vol. 7073 70730O-2
(a)
(b)
Figure 1. The original frame and the computed MV field by an H.264/AVC encoder: (a) the 15th frame of the CONTAINER sequence and (b) the 72th frame of the FOOTBALL sequence.
absolute differences between corresponding pixels. MV is a two-dimensional vector, denoted by (px , py )T . Fast motion estimation techniques for video encoding has been extensively studied, for example, 13. It is worthwhile to point out that the MV field obtained by a video encoder may not characterize real motion of objects and/or background.14 To illustrate this point, we give two examples in Figure 1. For the CONTAINER sequence in Figure 1(a), the foreground moving objects consist of two ships, the sea and the sky are the background, and the camera is still. However, when this video clip is encoded by the H.264/AVC coder, there exist non-zero MVs in the still background. That is due to small water waves in the sea and noisy pixels in the sky. For the FOOTBALL sequence in Figure 1(b), football players are the foreground and the court is the background. There are some MVs in the court area since the camera is moving fast to track the players. To conclude, we see quite a few reasons, including the vibration of background objects, camera acquisition noise, camera’s motion and inaccurate motion estimation, that will result in a noisy MV field in the background region.
2.2 System Overview Human eyes can distinguish the foreground from the background effectively. For the bottom-up approach, motion provides an important cue for saliency map determination since motion can grab human attention more efficiently than other factors such as the pixel color, intensity and orientation in video. For a video clip with still background, the image blocks with non-zero MV is often in the foreground region and, thus, should be included in the saliency map. However, for a video clip with camera movement, the moving block can be in either the background or the foreground. As a result, camera’s motion should be evaluated before foreground motion extraction. To address the problem of a noisy MV field and eliminate camera’s motion (global motion) from the MV field, we propose a three-step computational procedure as described below. 1. Motion vector (MV) field filtering We develop a MV filtering technique to roughly locate foreground objects. 2. Background extraction To detect the noisy MV in the background, a background extraction technique is developed based on block search. 3. Contrast map computation After the extraction of foreground objects, a contrast map is computed and integrated with the intermediate results from the first two steps to generate the final saliency map. Some global motion estimation methods, e.g., 15, use the pixel-by-pixel matching to assess camera’s movement. Although it provides fairly accurate results, it is computationally expensive for real-time applications. MV is the distance between the target block and its best match, which can be obtained as a by-product in video encoding. Its complexity has been reduced signficantly due to the development of fast motion search algorithms, e.g., 13. H.264/AVC adopts several techniques for more effective motion estimation, including half-pixel motion compensation and variable block size motion estimation. Thus, we use the MV field generated by H.264/AVC to conduct foreground extraction in this research.
Proc. of SPIE Vol. 7073 70730O-3
3. MOTION VECTOR FIELD FILTERING When an object is moving with a certain speed, a photographer may use the camera to track of its motion. The camera movement makes all pixels in a frame move at a global speed. Let the global motion be denoted by p∗ = (p∗x , p∗y )T . Since the MV of the background block should be close to p∗ in most cases, we can use the mean MV of the MV field as a rough estimator of p∗ ; namely, N −1 M−1
p∗i
≈ p¯i =
n=0 m=0
pi (m, n)
M ×N
,
(1)
where pi (m, n) is the MV of a block in position (m, n) in the i-th frame that has M × N blocks. A foreground object has its own motion, and their MVs are different from the global motion. We can use the following criterion to detect the object motion: 1, if |pi (m, n) − p¯i | > Ti Mi (m, n) = , (2) 0, otherwise where Mi (m, n) = 1 and 0 indicate that the current block is in the foreground and the background, respectively, and Ti is a threshold. To estimate Ti , we determine the standard deviation of the MV field as N −1 M−1 |p (m, n) − p¯i |2 n=0 y=0 i , σi = M ×N
(3)
and choose Ti = ασi , where α is a paramter chosen by the user. In our experiment, we set α = 1.
4. BACKGROUND EXTRACTION As shown in Figure 1(a), there exist non-zero noisy MVs in the background region although the camera is still. This occurs because of the following reasons. • For regions with simple textures such as the sky in Figure 1(a), the motion compensated MV is not identical with the real motion since quite a few blocks may match the target block well in the flat region. • Noise may exist in the process of video acquisition using the CCD or CMOS sensor. Consequently, the pixel intensity may vary from frame to frame. To eliminate noisy MVs in the background region, we propose a background extraction method based on a block-by-block matching strategy. A mixture of Gaussian functions was used to model the background scene in the surveillance application in 16–18. However, the complexity of this background modeling approach is high so that it is not suitable for real-time applications. Furthermore, this approach was primarily developed for the still camera case as shown in 19, 20, and it has to be modified for the camera motion case. Due to these considerations, we develop a new algorithm in this work, which has low complexity and can be applied to the case with camera motion.
Proc. of SPIE Vol. 7073 70730O-4
(a)
(b)
(c)
Figure 2. Results of foreground and background separation for sequence HALL: (a) the 50th frame of the video clip, (b) the MV field after filtering, and (c) the foreground after background extraction.
S •: tSli
(a)
(b)
(c)
Figure 3. Results of foreground and background separation for sequence FOOTBALL: (a) the 72th frame of the video clip, (b) the MV field after filtering, and (c) the foreground after background extraction.
A block in the still background should be similar to its spatially neighboring block in the previous frame. Let the target 4 × 4 block be B(i, m, n), and its corresponding block in the previous frame be B(i − 1, m, n). The similarity of these two blocks is determined by 3 3
Γα,β =
k=0 l=0
|I(i, 4m + k, 4n + l) − I(i − 1, 4m + k + α, 4n + l + β)| 3 3 j=0 i=0
,
(4)
I(n, 4m + k, 4n + l)
where I is the pixel intensity and (α, β) is the spatial displacement of the target block. If Γα,β < , we claim that these two blocks match, and the target block is in the background region. In the experiment reported in Sec. 6, we set = 0.08 and α, β ∈ {−2, −1, 0, 1, 2}. Combining (2) and (4), we can determine the foreground objects using the following criterion: M (i, m, n), if Γα,β ≥ , F (i, m, n) = 0, otherwise.
(5)
If F (i, m, n) = 1, the current block is part of foreground objects; otherwise, it is in the background region. Figures. 2 and 3 show some results of foreground and background separation. For sequence HALL, the camera is still but there are many noisy MVs in the background. The intermediate results after MV field filtering and background extraction are shown in Figures. 2(b) and (c), we see that all noisy MVs are removed and the foreground object is seperated from the background. For sequence FOOTBALL, the camera moves to track to ball and there exist non-zero MVs in the background. Again, the intermediate results after MV field filtering and background extraction are shown in Figures. 3(b) and (c). We see that the players in the background are successfully separated from the background after these processing steps.
5. CONTRAST MAP COMPUTATION Attention computation using the bottom-up approach can be driven by other factors such as color, intensity and orientaion. For example, Itti et al.2 downsamples an image into multiple scales and then conduct subtraction
Proc. of SPIE Vol. 7073 70730O-5
(a)
(b)
Figure 4. Results of contrast map computation: (a) the 2nd frame of sequence COASTGUARD, and (b) the computed contrast map.
(a)
(b)
Figure 5. Results of contrast map computation: (a) the 1st frame of sequence WATERFALL, and (b) the computed contrast map.
across different scales to yield the contrast map, where the subtraction operation is conducted in an image feature space. Actually, the contrast is a key ingredient of the image-based saliency map computation. Significant contrast of any low-level image features makes a local region stand out from its neighborhood. Consequently, it is more noticeable to human eyes. Contrast in a small region is more obvious than that in a large region. Thus, contrast is a function of the local difference of a selected image feature and the size of the neighborhood. Here, we compute the contrast map based on the idea in 4. The luminance component of the LUV space is used to yield the contrast map since the LUV space can simulate the HVS more efficiently. The visual attention at each pixel is quantified by H
C(i, x, y) =
W
l=−H k=−W
ω(k, l)|I(i, x, y) − I(i, x + k, y + l)| H
W
l=−H k=−W
,
(6)
ω(k, l)
where I(i, x, y) is the luminance intensity of pixel at position (x, y) in the ith frame, the neighborhood is a window of size (2W + 1) × (2H + 1), and ω(k, l) is the weight of each pixel pair. One simple way to choose the weight is 1 ω(k, l) = √ . (7) 2 k + l2 In above, each pixel is weighted inversely with respect to the distance to the window center. Thus, a pixel with a larger distance has a smaller impact on the final visual attention. The results of contrast map computation for two sequences COASTGUARD and WATERFALL are shown in Figures. 4 and 5, respectively. We see that the most attentive regions in these sequences can be obtained easily using the constrast map computation. The final saliency map is obtained by integrating results from MV field filtering, background extraction and contrast map computation. Let S(i, x, y) denote the attention index of each pixel in the saliency map. It is computed via S(i, x, y) = F (i, m, n) · C(i, x, y) (8)
Proc. of SPIE Vol. 7073 70730O-6
(a)
(b)
(c)
(d)
rst j ttCnM ç@;
1t;;jc (e)
(f)
Figure 6. Results of computed saliency maps: (a) the basket ball, (b) the hockey, (c),(d) the soccer and (e),(f) the tennis.
where
x m= 4
and
y n = . 4
6. EXPERIMENTAL RESULTS Several test video clips with distinct visual saliency are selected in our experiment. Sports video is a perfect test object since players in the video are undoubtedly the center of visual saliency. In sports game, the camera keeps moving to capture the important shot such as scoring, passing and break. Since the camera moves with the ball and players, the background keeps changing. Players displayed in different resolutions and sizes also move with diverse directions and velocities. A robust saliency map is expected to eliminate the background and extract key players from the input video. Furthermore, spectators in the background are featured with complex texture and small motions. It is challenging to eliminate them from the saliency map. In particular, the saliency map computation for sports video should provide some solution to camera motion estimation as well as complex and varying background removal. We select a set of sports video clips: including the basketball in Figure. 6(a), the hockey in Figure. 6(b), the soccer in Figures. 6(c),(d) and tennis in Figures. 6(e),(f), to test the performance of the proposed algorithm for visual saliency computation. For the basket ball, the hockey and the soccer, the camera keeps panning and/or zooming and shooting the game in either wide-angel or close-up alternatively. This is especially obvious when a hasty attack occurs. For the tennis, the background is complex in color, texture, intensity and object sizes. Thus, it is difficult to get the saliency map using an image-based model to locate foreground objects. In contrast, as shown in Figures. 6(e),(f), the player in each video clip is extracted and the background, including the spectator and court, is removed successfully.
Proc. of SPIE Vol. 7073 70730O-7
Although intensive camera movement and complex background bring some challenges in obtaining the saliency map, the proposed algorithm is effective due to the following reasons. First, H.264/AVC provide us a sufficient accurate MV field so that the global motion can be well estimated. Second, MV filtering and background extraction can successfully separate the foreground and background. Third, the contrast map can be used to assist foreground extraction and locate the saliency center.
7. CONCLUSION AND FUTURE WORK A simple yet effective computational procedure to visual saliency map determination from video clips was proposed, which can be used in mobile applications. The MV field filtering and background extraction techniques were developed to separate foreground objects and the background. Furthermore, we integrated the contrast map with the seperated foreground/background to form the final saliency map. We used several sports video sequences to demonstrate the effectiveness of the proposed approach. In the future, we would like to examine more reliable camera motion estimation techniques, and use the result as the cue of attentive region detection. Interaction of multiple viusal modalities of human being should also be investigated.
ACKNOWLEDGEMENT This work is supported by the China Scholarship Council and the authors are thankful to useful discussion and feedback from Zhiwei Yan and Jiaying Liu.
REFERENCES [1] James, W., [The Principles of Psychology], Cambridge MA: Harvard University Press., MA, USA (1890). [2] Itti, L., Koch, C., and Neibur, E., “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence. 20, 1254–1259 (1998). [3] Hu, Y., Chia, L., and Rajan, D., “Region-of-interest based image resolution adaptation for mpeg-21 digital item,” ACM Conference on Multimedia., 340–343 (2004). [4] Ma, Y. and Zhang, H., “Contrast-based image attention analysis by using fuzzy growing,” ACM Conference on Multimedia., 274–381 (2003). [5] Wang, Z. and Li, Q., “Video quality assessment using a statistical model of human visual speed perception,” J. Opt. Soc. Am. A. 24, B61–B69 (2007). [6] Ma, Y. F. and Zhang, H. J., “A model of motion attention for video skimming,” IEEE Conference on Image Processing. 1, 129–132 (2002). [7] Tang, C. W., “Spatiotemporal visual considerations for video coding,” IEEE Transactions on Multimedia. 9, 231–238 (2007). [8] Abdollahian, G. and Delp, E. J., “Finding regions of interest in home videos based on camera motion,” IEEE Conference on Image Processing. 1, 545–548 (2007). [9] Wang, Z., Lu, L., and Bovik, A. C., “Foveation scalable video coding with automatic fixation selection,” IEEE Transactions on Image Processing. 12, 1703– 1705 (2003). [10] Agrafiotis, D., Cancagarajah, N., Bull, D., and Dye, M., “Perceptually optimized sign language video coding based on eye tracking analysis,” IEE Electronics Letters. 39, 243–253 (2003). [11] Ma, Y., Hua, X., Lu, L., and Zhang, H., “A generic framework of user attention model and its application in video summarization,” IEEE Transactions on Multimedia. 7, 907–919 (2005). [12] Li, S. and Li, M., “Efficient spatiotemporal-attention-driven shot matching,” ACM Conference on Multimedia., 178–187 (2007). [13] Chalidabhongse, J. and Kuo, C.-C. J., “Fast motion vector estimation using multiresolution-spatio-temporal correlations,” IEEE Transactions on Circuits and System for Video Technology. 7, 477–488 (1997). [14] Wiegand, T., Sullivan, G. J., Bjontegaard, G., and Luthra, A., “Overview of the h.264/avc video coding standard,” IEEE Transactions on Circuits and System for Video Technology. 7, 560–576 (2003). [15] Dufaux, F. and Konrad, J., “Efficient, robust, and fast global motion estimation for video coding,” IEEE Transactions on Circuits and System for Video Technology. 9, 497–501 (2000).
Proc. of SPIE Vol. 7073 70730O-8
[16] Stauffer, C. and Grimson, W., “Adaptive background mixture models for real-time tracking,” IEEE Conference on Computer Vision and Pattern Recognition. 1, 246–252 (1999). [17] Elgammal, A., Duraiswami, R., Harwood, D., and Davis, L. S., “Background and foreground modeling using nonparametric kernel density estimation for visual surveillance,” Proceeding of the IEEE. 90, 1151– 1161 (2002). [18] Elgammal, A., Harwood, D., and Davis, L., “Non-parametric model for background subtraction,” Proceedings of the 6th European Conference on Computer Vision-Part II. 1843, 751–767 (2000). [19] Gutchess, D., Trajkovic, M., Cohen-Solal, E., Lyons, D., and Jain, A. K., “A background model initialization algorithm for video surveillance,” IEEE International Conference on Computer Vision. 1, 733–740 (2001). [20] Monnet, A., Mittal, A., Paragios, N., and Ramesh, V., “Background modeling and subtraction of dynamic scenes,” IEEE International Conference on Computer Vision. 2, 1305–1312 (2003).
Proc. of SPIE Vol. 7073 70730O-9