2017 IEEE/ACS 14th International Conference on Computer Systems and Applications
Detecting video saliency via local motion estimation Rahma Kalboussi ENISO[1] , NOCCS [2]Laboratory Soussel, 4054, Tunisiay
[email protected]
Aymen Azaza ENIM[3], NOCCS Laboratory Monastir 5000 , Tunisia
[email protected]
Mehrez Abdellaoui ENISO, NOCCS Laboratory Sousse, 4054, Tunisia
[email protected]
Zhang [12] developed a saliency detection model based on the spectral residual of the images log spectra. Guo et al. [10] assumed that locating salient object depends essentially on the phase spectrum of the Fourier transform. Later, an approach based on the frequency-tuned is proposed by Achanta et al.[1] where they used features of centersurround contrast and luminance. Harel et al.[11] proposed a graph-based saliency detection model using the dissimilarity measure. Goferman et al. [9] used the local and global contrast to build a saliency detection model. Based on fusing local center-surround and feature channels, a saliency detection framework was presented by Klein and Frintrop [16]. For the spatial-domain saliency, several visual cues (contrast, color, texture, etc.) are adopted. Cheng et al.[3] proposed a contrast based salient object detection model. They evaluate simultaneously global contrast differences and spatial weighted coherence scores to compute saliency maps. Kim et al. [15] aim to separate background from foreground to highlight the salient object. In this paper we propose a new video saliency detection method by fusing motion cues with surrounding contrast cue. The first step consists on decomposing the input frame into a set of patches, then, for each path a local motion estimation using optical flow measure is computed. Local motion estimation is fused with surrounding contrast cue to produce a spatio-temporal local estimator which will be used to indicate whether a patch belongs to foreground or to background. Finally, giving foreground and background patches saliency computation is derived from a probability ratio measures according to foreground/background likelihood. In section III we will provide more details about saliency maps generation. In section IV we will explain our experiments and discuss our results. Section V is dedicated to conclusions and future works.
Abstract—In the last decades, saliency detection was extensively studied. The number of computational models that help to detect salient regions in still images is increasing, whereas, detecting salient regions in videos is in its early stages. In this paper we propose a video saliency detection method using local motion estimation. Starting from a patch, the problem of saliency detection is modeled as a growing region starting from a region which contains the higher motion information to the background. Local saliency is measured by combining local motion estimation and local surrounding contrast which leads to the construction of foreground and background patches. Experiments have proved that The proposed method outperforms state-of-the-art methods over two benchmark datasets. Keywords-Video Saliency; Local Motion; Object Detection;
I. I NTRODUCTION The human vision system is one of the most complicated systems of the human body. Many Researchers have deeply studied the behaviour of this complex system which has lead to various computational approaches that can be grouped into two groups: Top-down and Bottom-up. Called descendant, top-down mechanisms depends not only on the task and its stimulus but also on the observer’s own experiences and idiosyncrasy. Unlike Top-down mechanisms, Bottom-up processes also called ascendant, are influenced by the visual scene stimulus and independently on the observer’s will. In this paper, we are interested to the Bottom-up saliency. In this context, Saliency detection aims to highlight the object of interest that attracts human gaze. Saliency detection has been widely studied and a big number of computational methods have been developed. Itti et al. proposed one of the earliest saliency model based on the Feature Integration Theory for visual attention proposed by Treisman and Gelade which assumes that the input scene can be decomposed into different feature maps [27].They supposed that local features like color, intensity and orientation can be used by center-surround operations to produce feature maps which will be fused into one final saliency map. Itti’s model has been the base for many other models. Gao et al. [8] [7] utilized the mechanism proposed by Itti for images and videos to produce saliency map while Walther and Koch [28] developed a whole toolbox dedicated for saliency estimation called SaliencyToolBox (STB). Hou and 2161-5330/17 $31.00 © 2017 IEEE DOI 10.1109/AICCSA.2017.93
Ali Douik ENISO, NOCCS Laboratory Sousse, 4054, Tunisia
[email protected]
1 Ecole
Nationale d’Ing´enieurs de Sousse Objects Control and Communication Systems 3 Ecole Nationale d’Ing´ enieurs de Monastir 2 Networked
738
effective in saliency prediction in still images, for video saliency, they can not be efficient due to complex nature of certain scenarios like background with high texture or low color distinctiveness between foreground and background. In that case, motion information is used to contribute as the dynamic object which change in the flow field attracts attention. However, in video saliency, spatial cues alone can not perform well, and motion itself can not be a good saliency indicator specially for dynamic backgrounds or soft movement. Therefor, we propose a video saliency detection method which combines spatial information of the input frame with the temporal characteristics of the dynamic object. The first step of our method consists on decomposing the input frame into a set of patches. These patches should be labeled as foreground and background patches. To do so, we define a local motion estimation (LM ) measure. This saliency indicator measure is fused with another very important saliency cue (surrounding contrast) to produce a spatiotemporal estimator. The spatio-temporal estimator is used to label foreground and background patches. Saliency scores are derived from the foreground/background likelihoods.
II. R ELATED W ORK While there has an intensive focus on the image saliency detection, detecting saliency in videos is still in its early stages, and there are few methods to address video saliency detection. (For a better review on image saliency see [2]. As color, texture, contrast etc, cues are used for detecting salient objects in still images, moving objects attract observer’s attention which makes motion as the main cue in video saliency detection. Most of the existing methods tried to combine an image saliency model with motion cue. The model proposed by Gao et al. [7] is an extension of the image saliency model [8] where they add a motion channel. Also, Mahadevan and Vasconcelos [21] used the saliency model in [8] to produce a spatiotemporal model using dynamic texture. In [25], Seo et al. used the local kernel regression from a given video frame to measure the likelihood between pixels, then extract a feature vector which includes temporal and spatial information.Rahtu et al. [24] proposed a saliency model for both natural images and videos which combines saliency measure formulated by using local features and statistical framework, with a conditional random field model. The model proposed by Fu et al. [5] which is a region saliency model, computes the saliency of a cluster using spatial, contrast and global correspondance cues. Zhong et al. [31] developed a video saliency by fusing a spatial saliency map inherited from a classical bottom-up spatial saliency model and temporal saliency map resulting from a new optical flow model. Based on Bottom-up saliency, Wang et al. [29] developed a video saliency object segmentation model using the geodesic distance with spatial edges and temporal motion boundaries which are used as foreground indicators. Later,Mauthner et al. [23] used the Gestalt principle of figure-ground segregation for appearance and motion cues to predict video saliency. Singh et al. [26] incorporated color dissimilarity, motion difference, objectness measure, and boundary score feature, of superpixels into a video saliency framework for saliency detection. Beside motion, Fang et al. [4] have used color, luminance and texture to produce saliency model in compressed domain. Lee et al. [17] combine a set of spatial saliency features including rarity, compactness, and center prior with temporal features of motion intensity and motion contrast into an SVM regressor to detect each video frame’s salient object. Kim et al. [14] developed a novel approach based on the random walk with restart to detect salient regions. First, a temporal saliency distribution is found using the motion distinctiveness, Then, that temporal saliency distribution is used as a restarting distribution of the random walker. The spatial features are used to design a transition probability matrix for the walker, to estimate the final spatiotemporal saliency distribution.
A. Local motion estimation When an observer watches a video he does not have enough time to examine the whole scene so his gaze will be directed to a specific region which contains the dynamic target. In this section we will present how to estimate the motion between each pair of frames and estimate the saliency. First we compute the optical flow Vf of a given frame f using [20] which provides orientation and magnitude of each flow vector. Then, we need to highlight the exact boundaries of the dynamic object which can be defined as a brutal change in the optical flow. Usually motion boundaries correspond to the boundaries of the physical salient object. We define the motion boundaries strength at patch Pi as fallows X
M (Pi ) =
x=1
(1 − exp(−λ||∇Vf (x)||)) X
(1)
X is the total number of pixels in the patch Pi and λ is a controlling parameter which is set to 0.5 in the experiments. M (Pi ) measures the motion boundaries at every pixel (x) of the patch p and is close to 1 when the pixels of the patch change position rapidly, which can be a good saliency indicator. But in case of low movement we can receive an inappropriate optical flow computation the motion boundaries measure becomes doubtful. So, it will necessary to estimate the motion considering the neighboring pixels. We introduce O(Pi ) a second estimator which is based on
III. M ETHODOLOGY Video saliency detection differs from image saliency. While spatial cues like color, texture, contrast, etc, are very
739
where α is a variable to control the weight rate of color/spatial distance, Dc (Pi , Pj ) is the euclidean distance between Pi and Pj in the CIE L*a*b color space and Dp (Pi , Pj ) is the Euclidean diastance between Pi and Pj positions.
direction difference between a given pixel and its neighboring N in a giving patch p. Let be O(Pi ) = 1 − exp(−βθ(Pi ))
(2)
where β is a controlling parameter, θ(Pi ) is the maximum orientation angle between a giving pixel and all its neighbors from the patch Pi . The idea comes from the assumption that if a pixel is moving with a height or low velocity and has a different direction than the background, then, it belongs essentially to the object boundaries. Now it will be interesting to expose the motion strength of each patch. To do so, we use directly the magnitude measure. But, while camera motion can produce wrong measures we thought it will be useful to smooth the magnitude which is given as fallows. X
T (Pi ) =
x=1
C. Saliency map generation Generally, patches with higher motion attract the attention. In the last section we defined local motion estimator to measure motion feature. Now local motion estimation will be fused with surrounding contrast to produce spatio-temporal local estimator which is defined as follows STe (Pi ) = LC(Pi ) × LM (Pi )
The spatio-temporal local estimator will be used to select foreground and background patches similar to [30]. The first thing to do is to sort the STe values in the ascending order, then patches are ranked according to their STe degree, where patches with higher spatio-temporal degree is marked as foreground patches and patches with lower STe degree are marked as background patches. More precisely, the PF are the first δf % patches and the PB are the last δb % patches. We define a probability ratio R, given by
Vm (Pi (x))
X × (2 × S + 1)
(6)
(3)
where T (Pi ) is the total motion strength at patch Pi , Vm is the optical flow magnitude, X the total number of pixels in patch Pi and S is a smoothing parameter ( in our experiments, after several tests we set S=4). Given a motion boundaries and strength, the local motion estimation measure can be computed as follow T (Pi ) × M (Pi ) if M (Pi ) > γ LM (Pi ) = T (Pi ) × M (Pi ) × O(Pi ) if M (Pi ) ≤ γ (4) Local motion estimation depends on the velocity, in case of modest velocity, motion strength will be around 0.5 and wont provide good measure, so it will be necessary to add orientation based boundaries estimator and to set γ to 0.5. Figure 1 shows how important our proposed motion estimator for saliency detection.
R=
P r(P |PB ) P r(P |PF )
(7)
where P r(P |PB ) and P r(P |PF ) measure respectively the patches background and foreground probability and are defined respectively similar to [30] as follows P r(P |PB ) =
1 − STe (P ) Dp (P, X) exp(− ) |PB | σp
(8)
X∈PB
B. Surrounding contrast Usually a salient region is distinctive from the rest of the scene. While motion is the main cue to detect saliency in videos, in case of low movement, extra cues should be considered. In our method we use the local contrast measure. However, contrast detector will measure the distinctiveness regarding the rest of the scene. Local contrast is defined as the brutal change of color independently of the spatial distance between different patches. While salient patches generally are spatially grouped, spatial distance is considered for a good local contrast representation. In this context, surrounding contrast cue is presented by [9] which assumes that not only color distinctiveness is necessary for saliency detection but also the surrounded patches characteristics. To do so, local contrast distinctiveness for each patch Pi is defined as follow Dc (Pi , Pj ) (5) LC(Pi ) = 1 + α.Dp (Pi , Pj )
P r(P |PF ) =
STe (P ) Dp (P, Y ) exp(− ) |PF | σp
(9)
Y ∈PF
Foreground and background probabilities of a given patch Pi depend on the distance in the space domain regarding the other patches of the whole frame and on the spatio-temporal local estimator The probability ratio will serve to compute saliency scores of each patch which is defined as S(Pi ) =
1 1+R
(10)
IV. E XPERIMENTS In this section we evaluate the performance of our method on video saliency detection on two benchmark datasets and compare our results to five state-of-the-art video saliency methods in terms on Presion-Recall and ROC curves and F-score.
Pj ∀j
740
A. Datasets
to add spatial cue ( color). Spatial cue fused with temporal cue helps to separate the salient object from background and to highlight it. On Fukuchi dataset, all videos contain only one dynamic object which explains the good F-score value and therefore a finer precision shape. The GVS [29] computes spatial and temporal edges of each object in the video frame. This method is very efficient when the video frame has one moving object that’s why when applying this method to the SegTrack v2 which includes video frames with different conditions (as we explained in the last paragraph), results decreased a little bit. On Fukuchi dataset which all videos contain one dynamic object, results are better. GB [11] PR curve and F-score values (see table.I)are not good because this method does not include motion as a saliency cue. Even if this method is used for video saliency detection is more suitable for image saliency. Mancas video saliency method RR [22] exploits the optical flow to select motion in a crowd. For videos with stable camera like video surveillance applications, this method is very effective but does not produce good results with video frames of SegTrack v2 and Fukuchi datasets where camera motion causes noise. ITTI [13] detects surprising events in videos by fusing different saliency maps to get the final saliency which includes motion, color, intensity, orientation and flicker features. Surprising location can be affected by one of these features. The main cause that the last two methods does not perform very well on the Segtrack v2 and Fukuchi dataset, is that both use spatial features beside motion features, so a static pixel which belongs to background can be considered salient. Precision-Recall curves on Segtrackv2 and Fukuchi datasets are reported in Fig. 3 and Fig. 2 where our proposed method outperforms other methods. The recall values of RR[22] and GVS [29] are very small when we varied the threshold to 255 and even can go down to 0 for ITTI [13], RT [24] and GB [11]. On fukuchi dataset we have best precision rate which shows that our proposed method is very efficient and provides very precise salient objects. On Segtrackv2 dataset, we have competitive results compared to GVS [29] which also indicate that our saliency maps are informative of the region on interest. Also, on Segtrackv2 dataset, the minimum value of recall does not go down to zero which means that in its worst cases, our method detects the region of interest with a good response values. We presented in Fig. 4 a visual comparison of the saliency maps produced by our approach against state-of-the-art methods.
Our results will be evaluated on two benchmark datasets which are used by most video saliency detection methods. SegTrack v2 dataset [18] is a video segmentation and tracking datatset. It contains 14 videos with 976 frames where some videos contain one dynamic object, some others have more. Each video object has specic characteristics that can be Slow motion, Motion blur, change in Appearance, Complex deformation, Occlusion, and Interacting objects. In addition to video frames, a binarized ground truth for each frame is provided. Fukuchi dataset [6] is a video saliency dataset which contains 10 video sequences with a total of 936 frames with a segmented ground truth. B. Evaluation We evaluate the performance using the precision-recall (PR) curve, and F-measure. Pecision-Recall curve plots the Precision against the recall. To do so, each saliency map is binarized using a fixed set of thresholds varying from 0 to 255. The precision and the recall are then computed by comparing the binarized map S to the ground-truth G see Eq. 11 and Eq. 12. x,y S(x, y)G(x, y) (11) precision = S(x, y) x,y S(x, y)G(x, y) (12) recall = G(x, y) S is the binarized estimated saliency map and G is the binary ground truth. The Presion-Recall curve can be computed by varying the threshold which is used to binarize S(x, y). The saliency map can not be evaluated by the Precision and the recall shapes or best scores but it is necessary to compute a measure which harmonically combines them. For that task we use Fβ to generate the F-score 1 + β 2 · precision · recall (13) Fβ = β 2 · precision + recall we set β 2 = 0.3 following [19] and [1]. Precision is wheighted more than recall because it is more important (see [2]). For each frame we generate a single Fβ and the final F-score is the average Fβ over the whole dataset. C. Results Our results will be compared to five state-of-the-art video saliency methods in terms of Precision-Recall and F-score. On SegTrack v2 and Fukuchi datasets we outperform all other approaches with a big gap in term of F-score. The nature of video sequences of the SegTrack v2 dataset are quite different. In Segtrackv2 dataset some videos contain more than one dynamic object, which means that motion cue alone wont be enough to predict saliency which obliges us
V. C ONCLUSION In this paper we propose a video saliency detection method by fusing temporal cues with spatial information
741
Figure 1: Impact of our motion feature on the saliency map. From left to right: input frame, grouond truth, saliency map with local motion estimation, saliency map without local motion estimation Table I: F-score values On Benchmark-Datasets
Precision−Recall curve on Fukuchi dataset 1 0.9
Method
0.8
Ours
0.8639
0.6380
GVS
0.7243
0.633
RR
0.5514
0.5701
RT
0.3673
0.5514
ITTI GB
0.5667 0.5393
0.4295 0.4807
Precision
0.7 0.6 0.5 0.4 CBS GB GVS ITTI Ours RT RR
0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1
Fukuchi
Segtrackv2
R EFERENCES
Recall
[1] R. ACHANTA , S. H EMAMI , F. E STRADA , AND S. S USSTRUNK, Frequency-tuned salient region detection, in Computer vision and pattern recognition, 2009. cvpr 2009. ieee conference on, IEEE, 2009, pp. 1597–1604.
Figure 2: Precision-Recall curve on Fukuchi dataset
Precision−Recall curve on Segtrackv2 dataset 1
[2] A. B ORJI , M.-M. C HENG , H. J IANG , AND J. L I, Salient object detection: A benchmark, IEEE Transactions on Image Processing, 24 (2015), pp. 5706–5722.
0.9 0.8
Precision
0.7
[3] M.-M. C HENG , N. J. M ITRA , X. H UANG , P. H. T ORR , AND S.-M. H U, Global contrast based salient region detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 37 (2015), pp. 569–582.
0.6 0.5 0.4 CBS GB GVS ITTI Ours RT RR
0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
[4] Y. FANG , W. L IN , Z. C HEN , C.-M. T SAI , AND C.-W. L IN, A video saliency detection model in compressed domain, IEEE transactions on circuits and systems for video technology, 24 (2014), pp. 27–38.
1
Recall
Figure 3: Precision-Recall curve on Segtrackv2 dataset
[5] H. F U , X. C AO , AND Z. T U, Cluster-based co-saliency detection, IEEE Transactions on Image Processing, 22 (2013), pp. 3766–3778. [6] K. F UKUCHI , K. M IYAZATO , A. K IMURA , S. TAKAGI , AND J. YAMATO, Saliency-based video segmentation with graph cuts and sequentially updated priors, in 2009 IEEE International Conference on Multimedia and Expo, IEEE, 2009, pp. 638–641.
of a local path. Our contribution consists on measuring the local motion estimation of a patch from a video frame. The experiments have demonstrated that our method outperforms state-of-the-art methods in terms of Precision-Recall and Fscore metrics on two benchmark datasets. In future work, we will try to extend our method to be a real time application which can be installed on a mobile device and to be used for various interests.
[7] D. G AO , V. M AHADEVAN , AND N. VASCONCELOS, The discriminant center-surround hypothesis for bottom-up saliency, in Advances in neural information processing systems, 2008, pp. 497–504.
742
Image
GT
OURS
GVS
GB
RR
RT
ITTI
Figure 4: Visual comparison of saliency maps generated from 6 different methods, including our method, GVS [29], GB [11], RR [22], RT [24] and ITTI [13]
[8] D. G AO AND N. VASCONCELOS, Bottom-up saliency is a discriminant process, in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, IEEE, 2007, pp. 1–6.
[11] J. H AREL , C. KOCH , P. P ERONA , ET AL ., Graph-based visual saliency, in NIPS, vol. 1, 2006, p. 5. [12] X. H OU AND L. Z HANG, Saliency detection: A spectral residual approach, in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE, 2007, pp. 1–8.
[9] S. G OFERMAN , L. Z ELNIK -M ANOR , AND A. TAL, Contextaware saliency detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 34 (2012), pp. 1915–1926. [10] C. G UO , Q. M A , AND L. Z HANG, Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform, in Computer vision and pattern recognition, 2008. cvpr 2008. ieee conference on, IEEE, 2008, pp. 1–8.
[13] L. I TTI AND P. BALDI, A principled approach to detecting surprising events in video, in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, IEEE, 2005, pp. 631–637.
743
[14] H. K IM , Y. K IM , J.-Y. S IM , AND C.-S. K IM, Spatiotemporal saliency detection for video sequences based on random walk with restart, IEEE Transactions on Image Processing, 24 (2015), pp. 2552–2564.
[28] D. WALTHER AND C. KOCH, Modeling attention to salient proto-objects, Neural networks, 19 (2006), pp. 1395–1407. [29] W. WANG , J. S HEN , AND F. P ORIKLI, Saliency-aware geodesic video object segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3395–3402.
[15] J. K IM , D. H AN , Y.-W. TAI , AND J. K IM, Salient region detection via high-dimensional color transform, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 883–890.
[30] H.-H. Y EH , K.-H. L IU , AND C.-S. C HEN, Salient object detection via local saliency estimation and global homogeneity refinement, Pattern Recognition, 47 (2014), pp. 1740–1750.
[16] D. A. K LEIN AND S. F RINTROP, Center-surround divergence of feature statistics for salient object detection, in Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 2214–2219.
[31] S.- H . Z HONG , Y. L IU , F. R EN , J. Z HANG , AND T. R EN, Video saliency detection via dynamic consistent spatiotemporal attention modelling., in AAAI, 2013, pp. 1063– 1069.
[17] S.-H. L EE , J.-H. K IM , K. P. C HOI , J.-Y. S IM , AND C.S. K IM, Video saliency detection based on spatiotemporal feature learning, in Image Processing (ICIP), 2014 IEEE International Conference on, IEEE, 2014, pp. 1120–1124. [18] F. L I , T. K IM , A. H UMAYUN , D. T SAI , AND J. M. R EHG, Video segmentation by tracking many figure-ground segments, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2192–2199. [19] Y. L I , X. H OU , C. KOCH , J. M. R EHG , AND A. L. Y UILLE, The secrets of salient object segmentation, in IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 280–287. [20] B. D. L UCAS , T. K ANADE , ET AL ., An iterative image registration technique with an application to stereo vision., in IJCAI, vol. 81, 1981, pp. 674–679. [21] V. M AHADEVAN AND N. VASCONCELOS, Spatiotemporal saliency in dynamic scenes, IEEE transactions on pattern analysis and machine intelligence, 32 (2010), pp. 171–177. [22] M. M ANCAS , N. R ICHE , J. L EROY, AND B. G OSSELIN, Abnormal motion selection in crowds using bottom-up saliency, in Image Processing (ICIP), 2011 18th IEEE International Conference on, IEEE, 2011, pp. 229–232. [23] T. M AUTHNER , H. P OSSEGGER , G. WALTNER , AND H. B ISCHOF, Encoding based saliency detection for videos and images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2494– 2502. [24] E. R AHTU , J. K ANNALA , M. S ALO , AND J. H EIKKIL A¨ , Segmenting salient objects from images and videos, Computer Vision–ECCV 2010, (2010), pp. 366–379. [25] H. J. S EO AND P. M ILANFAR, Static and space-time visual saliency detection by self-resemblance, Journal of vision, 9 (2009), pp. 15–15. [26] A. S INGH , C.-H. H. C HU , AND M. P RATT, Learning to predict video saliency using temporal superpixels, in Pattern Recognition Applications and Methods, 4th International Conference on, 2015, pp. 201–209. [27] A. M. T REISMAN AND G. G ELADE, A feature-integration theory of attention, Cognitive psychology, 12 (1980), pp. 97– 136.
744