Abstract â Video shot boundary detection plays an important role in video processing. It is the first step toward video- content analysis and content-based video ...
Shot Boundary Detection by a Hierarchical Supervised Approach G. Cámara-Chávez∗† , F. Precioso∗ , M. Cord∗ , S. Phillip-Foliguet∗ , A. de A. Araújo† ∗
Equipe Traitement des Images et du Signal-ENSEA / CNRS UMR 8051 6 avenue du Ponceau 95014 Cergy-Pontoise - France †
Federal University of Minas Gerais - DCC
Av. Antônio Carlos 6627 31270-010 - MG - Brazil
Keywords: shot boundary detection, cut, gradual transition, dissolve, fade.
tion retrieval [4]. Kernel functions can efficiently deal with a large number of features. With many features it is possible to better describe the frame information and to better handle illumination changes and fast movement problems. Thus the pre-processing steps are not necessary anymore. Once the video sequence is segmented into AT-free segments we can execute the second stage. We extract new features in each segment in order to detect possible GTs without using any sliding window like most of the authors do [3]. Then we classify the candidate frames with the same kernel-based SVM classifier. We present satisfying results obtained at TRECVID 2006 shot boundary extraction task. The paper is organized as follows: Section 2 describes the features and the learning scheme for sharp cut and gradual transition extraction. Section 3 is devoted to the interactive video retrieval from key-frames.
Abstract – Video shot boundary detection plays an important role in video processing. It is the first step toward videocontent analysis and content-based video retrieval. We develop a hierarchical approach for shot boundary detection based on the assumption that hierarchy helps to take decisions by reducing the amount of indeterminate transitions. Our method consists in first detecting abrupt transitions using a learning-based approach, then non-abrupt transitions are split into gradual transitions and normal frames. We describe in this paper, a machine learning system for shot boundary detection. The core of this system is a kernel-based SVM classifier. We present some results obtained for shot extraction TRECVID 2006 Task.
1. I NTRODUCTION The first step for video-content analysis, content based video browsing and retrieval is the partitioning of a video sequence into shots. A shot is defined as an image sequence that presents continuous action which is captured from a single operation of single camera. The joining of two shots can be of two types : abrupt transitions (ATs) and gradual transitions (GTs). According to the editing process of transitions, 99% of all edits fall into one of the following three categories: cuts, fades, or dissolves [1]. An AT is an instantaneous change from one shot to another. In fade-in a shot gradually appears from a constant image while in fade-out a shot gradually disappears from a constant image. In dissolve the current shot fades out while the next shot fades in. A common approach to detect GTs is to compute the difference between two adjacent frames (with color, motion, edge and/or texture features) and to compare this difference to a preset threshold [1], [2]. The main drawback of these approaches lies in selecting an appropriate threshold for different kind of videos. Another way is to see this problem as a categorization task. Most of the works based on this approach consider a low number of features because of computational and classifier limitations. Then to compensate this lack of information, they need preprocessing steps, like motion compensation [3] or illumination normalization. Presently, we limit our attention to ATs, fades and dissolves in our work. We propose to use a supervised kernel-based SVM (Support Vector Machine) technique with a hierarchical classification procedure composed of two stages. In the first stage, we extract features computed on frame differences in order to detect ATs. The features are classified by a kernel-based SVM classifier, because of its well-known performances in statistical learning informa-
978-961-248-029-5/07 © 2007 UM FERI
2. A BRUPT TRANSITION SEGMENTATION Cuts generally correspond to an abrupt change between two consecutive frames in the sequence. Automatic detection is based on the information extracted from the video (brightness, color distribution, motion, edges, etc.). Cut detection between shots with little motion and constant illumination is usually done by looking for sharp brightness changes. However, brightness changes cannot be easily related to transition between two shots, in the presence of continuous object motion, or camera movements, or change of illumination. Thus, we need to combine different and more complex visual features to avoid such problems. We extract, for every frame in the video stream one feature vector concatenating the following visual information: Several color histograms (RGB, HSV and opponent color), Zernike moments (Zm), Fourier-Mellin moments (Fm), Projection histograms (horizontal and vertical) and Phase correlation [5]. Then a pairwise similarity measure is calculated. We test different distance metrics: L1 norm, cosine dissimilarity, histogram intersection and χ2 distance. Finally, the dissimilarity feature vector (distance for each type of feature: color histogram, moments and projection histograms) of each pair of frames is used as an input to the classifier [5].
3. G RADUAL TRANSITION SEGMENTATION The visual patterns of many gradual transitions are not as clearly or uniquely defined as that of abrupt transitions.
209
2007 IWSSIP & EC-SIPMCS, Slovenia
b) 2 color histogram differences, one between S and C, the other between C and E of the candidate interval; c) correlation by blocks of interest in the sequence: this feature is computed only on the target intervals and uses the dissolve descriptor [?].
Among the numerous types of gradual transitions, dissolve is considered as the most common one, but also the most difficult to detect, particularly on real videos. 3.1 Dissolve detection Once the video sequence is segmented into cut-free segments, we process a three-step dissolve detection: a coarse gradual transition pattern detection based on curve matching, a refined level based on an improved gradual transition modeling error and a learning level for classifying gradual transitions from not gradual transitions. Since gradual transitions have very variable lengths, going from 6 to almost 100 frames, the number of frames to be considered for the detection is difficult to specify, as in [3]. Our machine learning approach allows us to avoid using a sliding window. Furthermore, gradual transitions are not as clearly, or uniquely defined, as ATs. The dissolve modeling error, proposed by Won et al.[6], is the difference between an ideal dissolve shape and the shape of temporal variation curve of the illumination variance. For the first step, we extend the dissolve modeling error to the Effective Average Gradient (EAG) [7]. Indeed, in presence of dissolve, both illumination variance and EAG can be matched with a parabola. However, not all the video segments, corresponding to a parabola shape along these curves, are dissolves. Thus, at the second step, we process a verification among all the validated video segments from dissolve modeling error step, using a modification of the Double Chromatic Difference test (DCD). Proposed by Yu et al. [7], the DCD feature can differentiate dissolve from zoom, pan and wipe:
2) DCD features: a) the quadratic coefficient of the parabola approximating the DCD curve at best [8]; b) the “depth” of this parabola, defined as the height difference between C and S (or E) [9]; 3) SCD features: the modified DCD, here we use the same features used in DCD (previous item), 4) VarProj: difference of the projection histograms extracted in the first step (cut detection). 5) Motion: motion vectors are also extracted in the first step, when the phase correlation method is computed, for each block we compute the magnitude of the motion vector. We concatenate them in one feature vector given as input to our kernel-based SVM classifier in order to determine “dissolves” and “non-dissolves” video segment. 3.2 Fade detection Fade-in and fade-out often occur together as a fade group, i.e., a fade group starts with a shot fading out to a color C which is then followed by a sequence of monochrome frames of the same color, and it ends with a shot fading in from color C. Fade groups formed by this way are often referred to as a single fade. Allatar detects fades by recording all negative spikes in the second derivative of frame luminance variance curve [10]. The drawback with this approach is that motion also would cause such spikes. Lienhart proposes detecting fades by fitting a regression line on the frame standard deviation curve [8]. Truong et al. observe the mean difference curve, examining the constancy of its sign within a potential fade region [11]. We present further extensions to these techniques. Since a fade is a special case of a dissolve we can explore some of the features used for dissolve detection. The salient features of our two-step fade detection algorithm are the following:
f (x, y, S) + f (x, y, E) − f (x, y, t)|) 2 x,y (1) where t represents the time, S ≤ t ≤ E, f (x, y, S) is the starting frame and f (x, y, E) the ending one of a possible dissolve interval. F (.) is a fixed threshold function. We propose a modification of this well known descriptor reducing highly the complexity of its computation. Indeed, we use projection histograms (1D) instead of the frame content (2D). Projection histograms allow us not only to reduce the size of data concerned with DCD test but also to preserve color and spatial information. For our modified DCD, the formulation Eq. (1) remains the same if f (x, y, t) represents the projection histogram at frame t. Furthermore, we already extracted projection histograms for each frame in the first step of our shot boundary detector. Then, in the last step, we classify the remaining video intervals with our machine learning core, using the same kernel-based SVM classifier and the following features: C = argmint DCD(t), t ∈ [S, E], where C represents the position of the minimum value of the downwardparabolic shape (normally the center). DCD(t) =
F (|
1) The existence of monochrome frames is a very good clue for detecting all potential fades, these are used in our algorithm. In a quick fade, the monochrome sequence may be compound by a single frame while in a slower fade it would last up to 100 frames [11]. Therefore, detecting monochrome frames (candidate region) is the first step in our algorithm. 2) In the second step, we use a descriptor that characterizes a dissolve, our improved DCD. The variance curves of fade-out and fade-in frame sequences have a half-parabolic shape independent of C. Therefore, if we compute the DCD feature in the region where the fade-out occurs we will have a parabola shape, the same principle is applied to detect the fade-in.
1) Ddata: different information extracted from the dissolve region, the features used are: a) 2 correlation values, one between S and C, the other between C and E of the candidate interval;
210
3) We also constrain the variance of the starting frame of a fade-out and the ending of a fade-in to be above a threshold to eliminate false positives caused by dark scenes, thus preventing them from being considered as monochrome frames. Some of the techniques used for detecting fades are not tolerant to fast motion, which produce the same effect of a fade. DCD feature is more tolerant to motion and other edition effects or combinations of them. Our modified DCD feature preserves all the characteristics of the feature presented in [7], with the advantage that we reduce the dimension of the features, from 2D to 1D.
Fig. 1 provides the team names that participated to TRECVID 2006 and corresponds to the respective markers in Figs. 2, 3 and 4. Fig.2 presents the results of abrupt transitions, for 10 runs with different combinations of features, different dissimilarity measures, as well as various kernels for the SVM classification. Fig.3 presents the results of gradual transition for 10 other runs. The results show that our system has a better performance than other learning-based systems that use pre-processing steps like motion compensation [12] and post-processing like flashlight filtering [13].
4. E XPERIMENTS We evaluated our shot extraction module by participating to TRECVID 2006, to the Shot Boundary Detection (SBD) Task. Our training set for AT detection consisted of a single video of 9078 frames (5mins. 2 secs.). This video was captured from a Brazilian TV-station and is composed by a segment of commercials. The training video was labeled manually by ourselves. Our training set for GT consisted of the corpus of TRECVID-2002 Video Data Set. We used a SVM classifier and trained it with a Gaussian kernel with χ2 distance. We run our algorithm on the TRECVID-2006 test set and provide the mean precision and the mean recall obtained. A good detector should have high precision and high recall. The test collection comprises about 7.5 hours, including 13 videos for a total size of about 4.64 Gb. The total number of frames is 597,043 and the number of transitions is 3785. The collection contains 1844 abrupt transitions, that represents 48.7% of the total transitions and 1509 dissolves, that represents 39.9% of the total transitions. The reference data was created by a student at the National Institute of Standards and Technology (NIST) whose task was to identify all transitions. In Table 1, we present the visual feature vectors for dissolve detection used for the 10 runs of our experiment. The runs for AT detection are the same presented in [5]. Run Etis1 Etis2 Etis3 Etis4 Etis5 Etis6 Etis7 Etis8 Etis9 Etis10
Figure 1: Participants of TRECVID 2006 Evaluation
Features mdata, VarProj mdata, Motion mdata, DCD mdata, DCD, SCD mdata, DCD, VarProj mdata, DCD, Motion mdata, SCD mdata, SCD, VarProj mdata, SCD, Motion mdata, SCD, VarProj, Motion
Figure 2: Abrupt transition extraction TRECVID 2006 (Zoom) In Table 2 we show the performance of two runs of our system for GT detection, measured in recall and precision; and frame-recall and frame-precision (measures to evaluate how well reported gradual transitions overlap with reference transitions). Here we compare the accuracy of Yu et al. [7] DCD method (Etis3) and our modified DCD method (Etis7). We reduce the computational complexity, from a 2D descriptor to a 1D descriptor, preserving the performance of the DCD method.
Table 1: 10 best combinations of visual features for dissolve detection For fade detection we chose a threshold of 200 for the variance of each frame, if the variance is lower than that value we consider it as a monochrome frame and a possible fade. After that, it is necessary to see if the interval presents two download parabolas, one for fade-in and other for fadeout.
Run Etis3 Etis7
Recall 0.632 0.612
Precision 0.825 0.849
Frame-Recall 0.775 0.775
Frame-Precision 0.849 0.850
Table 2: Results for DCD features and modified DCD
211
two class-problem : fast motion or dissolve. Even though TRECVID-2002 (GT training set) is of poor quality [14], the performance of our system is good, we are among the best teams. We improve existing algorithms and detect automatically the shot boundaries, without setting any threshold or parameter. Our method is not only parameter free, but also robust and stable to training data set. Our method allows to merge many different and very complex types of video features in an efficient way, avoiding any tuning or pre-processing, and providing robust and stable results considering the training data set. Acknowledgement
Figure 3: Gradual transition extraction TRECVID 2006 (Zoom)
We thank NIST and TRECVID Organisation for allowing us to present here our algorithm evaluation. The authors are grateful to MUSCLE Network of Excellence, CNPq and CAPES for the financial support of this work.
Our learning approach showed up to be quite robust once the database involves training sets from different cameras, with different compress formats or coding, from different countries, situation, quality, or length. The selected features kept being relevant and stable to detect shot transitions in different context and environment. Fig.4 shows the frame-recall versus the frame-precision. We can notice that all markers that correspond to our system’s run are almost at the same place. This ensures that our best GT detector is also very accurate on the transitions it finds. All our runs have more or less the same accuracy. Our results are among the best ones. The density of all our run markers, regardless the changes made in the system, prove the capacity of generalization of our kernel-based SVM classifier.
R EFERENCES [1] R. Lienhart, C. Kuhmunch, and W. Effelsberg, “On the detection and recognition of television commercials,” IEEE Int. Conf. on Multimedia Computing and Systems (ICMC ’97), pp. 509–516, 1997. [2] Rui Jesus, João Magalhães, Alexei Yavlinsky, and Stefan Rüger, “Imperial college at trecvid,” in TREC Video Retrieval Evaluation Online Proceedings, 2005. [3] Ralph Ewerth, Christian Beringer, Tobias Kopp, Michael Niebergall, Thilo Stadelmann, and Bernd Freisleben, “University of marburg at trecvid 2005: Shot boundary detection and camera motion estimation results,” in TREC Video Retrieval Evaluation Online Proceedings, 2005. [4] P.-H. Gosselin and M. Cord, “Precision-oriented active selection for interactive image retrieval,” in International Conference on Image Processing (ICIP’06), October 2006, pp. 3127–3200. [5] G. Cámara-Chávez, M. Cord, F. Precioso, S. Philipp-Foliguet, and Arnaldo de A. Araújo, “Video segmentation by supervised learning,” in 19th Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’06), Oct. 2006, pp. 365–372. [6] Jing-Un Won, Yun-Su Chung, In-Soo Kim, Jae-Gark Choi, and Kil-Houm Park, “Correlation based video-dissolve detection,” in International Conference on Information Technology: Research and Education, 2003, pp. 104 – 107. [7] H. Yu, G. Bozdagi, and S. Harrington, “Feature-based hierarchical video segmentation,” in ICIPInternational Conference on Image Processing (ICIP’ 97), 1997, vol. 2, pp. 498–501. [8] R. Lienhart, “Reliable transition detection in videos: A survey and practitioner’s guide.,” IJIG, vol. 1, no. 3, pp. 469 – 486, 2001. [9] Alan Hanjalic, “Shot boundary detection: Unraveled and resolved?,” IEEE Trans. on Circuits and System for Video Technology, vol. 12, no. 2, pp. 90–105, 2002. [10] A.M. Alattar, “Detecting and compressing dissolve regions in video sequences with a dvi multimedia image compression algorithm,” IEEE International Symposium on Circuits and Systems (ISCAS), vol. 1, pp. 13–16, 1993. [11] Ba Tu Truong, Chitra Dorai, and Svetha Venkatesh, “New enhancements to cut, fade, and dissolve detection processes in video segmentation,” in MULTIMEDIA ’00: Proceedings of the eighth ACM international conference on Multimedia, 2000, pp. 219–227. [12] Ralph Ewerth, Markus Mühling, Thilo Stadelmann, Ermir Qeli, Björn Agel, Dominik Seiler, and Bernd Freisleben, “University of marburg at trecvid 2006: Shot boundary detection and rushes task results,” in TREC Video Retrieval Evaluation Online Proceedings, 2006. [13] Nithya Manickman, Neela Sawant, Aman Parnami, Srikanth Lingamneni, and Sharat Chandran, “Indian institute of technology, bombay at trecvid 2006,” in TREC Video Retrieval Evaluation Online Proceedings, 2006. [14] M. Cooper, “Video segmentation combining similarity analisys and classification,” in Proc. of the 12th annual ACM international conference on Multimedia (MULTIMEDIA ’04), 2004, pp. 252–255.
5. C ONCLUSION In this paper we present our learning-based system for shot boundary detection with a hierarchical classification procedure composed of two stages. The first step is dedicated for ATs detection based on a machine learning approach. Then, we seek for GTs inside the shot delimited by the ATs and fades detected in the first stage. Our hierarchical learning system let us reduce the GTs to
Figure 4: Gradual transition Frame-Precision / FrameRecall TRECVID 2006
212