Joint Key-frame Extraction and Object-based Video Segmentation ∗ Xiaomu Song School of Electrical and Computer Engineering Oklahoma State University Stillwater, OK 74078, USA
[email protected] Abstract In this paper, we propose a coherent framework for joint key-frame extraction and object-based video segmentation. Conventional key-frame extraction and object segmentation are usually implemented independently and separately due to the fact that they are on different semantic levels. This ignores the inherent relationship between key-frames and objects. The proposed method extracts a small number of keyframes within a shot so that the divergence between video objects in a feature space can be maximized, supporting robust and efficient object segmentation. This method can utilize advantages of both temporal and object-based video segmentations, and be helpful to build a unified framework for content-based analysis and structured video representation. Theoretical analysis and simulation results on both synthetic and real video sequences manifest the efficiency and robustness of the proposed method.
1. Introduction Video segmentation is a fundamental step towards structured video representation, which supports the interpretability and manipulability of visual data. Based on different semantic levels, video segmentation often refers to two categories as temporal and object-based video segmentations. A video sequence comprises a group of video shots, and a video shot is an unbroken sequence of frames captured from one perspective. Temporal video segmentation partitions a video sequence into a set of shots, and some keyframes are extracted to represent a shot. In this work, we only consider key-frame extraction in one shot, which can be carried out by a clustering process based on similarity measurements [22, 11] or statistical modeling processes ∗
This work was supported by the National Science Foundation (NSF) under Grant IIS-0347613 (CAREER).
Guoliang Fan School of Electrical and Computer Engineering Oklahoma State University Stillwater, OK 74078, USA
[email protected] [10]. Extracted key-frames can provide a compact representation for video indexing and browsing, while it cannot support content-based video analysis at a higher semantic level [5]. Object-based video segmentation extracts objects for content-based analysis and provide structured representation for many object-oriented video applications. Current object-based video segmentation methods can be classified into three types: segmentation with spatial priority, segmentation with temporal priority, and joint spatial and temporal segmentation [17]. More recent interests are on joint spatial and temporal video segmentation [3, 8, 9, 21, 6] due to the nature of human vision that recognizes salient video structures jointly in spatial and temporal domains [7]. Hence, both spatial and temporal pixel-wise features are extracted to construct a multi-dimensional feature space for object segmentation. Compared with key-frame extraction methods using frame-wise features, e.g., color histogram, these approaches are usually more computationally expensive. Due to different semantic levels, key-frame extraction and object segmentation are usually implemented independently and separately. The work in [5] presents a universal framework where key-frame extraction and object segmentation independently support content-based video analysis at different semantic levels, and their results can only be unified via a high-level description. In order to make the content analysis and representation more efficient, comprehensive, and flexible, it is helpful to exploit inherent relationship between key-frame extraction and object segmentation. In addition, the new MPEG-7 standard provides a generic segment-based representation model for video data [16], and both key-frame extraction and object segmentations could be grouped into a unified paradigm, where video key-frames are extracted to support efficient and robust object segmentation, and to facilitate the construction of the suggested universal description scheme in [5]. Recently, we have proposed a combined key-frame extraction and object-based video segmentation method in [15], where the extracted key-frames are used to estimate statistical models for model-based object segmentation, and
2. Unified Feature Space Video key-frame extraction and object segmentation are usually based on different feature subsets. A unified feature subset is necessary for joint key-frame and object-based video segmentation. This feature subset should contain both spatial and temporal features that are easy to be extracted. In this work we use a pixel-wise 7-D feature vector suggested in [15], including YUV color features, x-y spatial location, time T, as well as intensity change over the time to provide additional motion information. The original idea comes from the feature selection in pattern recognition. Given a candidate feature set X = {xi |i = 1, 2, · · · , n}, where i is the feature index, feature selection ˜ = {xi |i = 1, 2, · · · , m}, m < aims at selecting a subset X ˜ related to clasn from X so that an objective function F (X) sification performance can be optimized: ˜ = arg max F (Z). X Z⊆X
(1)
Generally, the goal of feature selection is to reduce the feature dimension. In this work, we apply feature selection to extract video key-frames rather than reducing the feature dimension. According to [1], the video frames within a shot represent a spatially and temporally continuous action, and they share the common visual and often semantic-related characteristics, resulting in tremendously redundancy. Since a video shot should be characterized both spatially and temporally, a set of key-frames could be enough to model the object behavior in the shot. Moreover, by extracting a set of representative key-frames that supports salient and condensed object representation in the feature space, we can obtain compact video representation and efficient object segmentation simultaneously. Thus, the issue is how to find a set of key-frames that can facilitate object segmentation.
Frame index
Key-frames
1
1
2
3
2
1
2
3
3
1
2
3
1
1
2
3
1
2
3
1
2
3
3
object segmentation results are used to further refine the initially extracted key-frames. This approach significantly reduces the model estimation time compared with [8, 9], and provides more representative key-frames. However, the relationship between key-frame extraction and object segmentation is not explicit yet in this approach. It is not shown that how key-frame extraction affects object segmentation. In addition, some predefined and data-dependent thresholds are needed that influence the final results. In this work, we attempt to exploit an explicit relationship between keyframe extraction and object segmentation, and propose a coherent framework for joint key-frame extraction and object segmentation. The key point is to treat key-frame extraction as a feature selection process. Maximum average interclass Kullback Leibler distance (AIKLD) criterion is used with an efficient key-frame extraction method. Compared with [15], the proposed method provide an explicit relationship between key-frame extraction and object segmentation.
1
2
1
3
2
3
2
3
2 1 N
Video frames
Feature space
Video objects
Figure 1. Unified feature space. For example, in Fig. 1, a video shot of N frames contains three objects. Outliers, including noise and insignificant objects that might randomly appear, usually cause the feature space overlap among major objects. Therefore, keyframe extraction can be treated as a feature selection process where key-frames are extracted by minimizing the feature space overlap among three objects. One often used feature selection criterion is to maximize the cluster divergence in the feature space, and we will discuss such a criterion and its implementation to the joint key-frame and object-based video segmentation.
3. Proposed Method 3.1. Maximum Average Leibler Distance
Interclass
Kullback
Kullback Leibler distance (KLD) measures the distance between two probability density functions [14]. In this section, we will discuss how to apply a feature selection method based on KLD to jointly extract key-frames and objects. A frequently used criterion is to minimize the KLD between the true density and the density estimated from feature subsets. Nevertheless, this approach aims at minimizing the approximation error rather than extracting the most discriminative feature subsets. Although it is often desired that this criterion can lead to good discrimination among classes as well, this assumption is not always valid [18]. For the purpose of robust classification, divergence-based feature selection criterion is more preferred [18]. Given two probability density fi (x) and fj (x), the KLD between them is defined as: Z fi (x) dx, (2) KL(fi , fj ) = fi (x) ln fj (x) KLD is usually not a symmetric distance measurement and is symmetrized by adding KL(fi , fj ) and KL(fj , fi ) together: D(fi , fj ) =
KL(fi , fj ) + KL(fj , fi ) . 2
(3)
KLD is often used as the divergence measurement of different clusters in the feature space. Ideally, the larger the KLD, the more separability between clusters. If there are M clusters, the average interclass KLD (AIKLD) is defined as: ¯ =C D
M X M X
D(fi , fj ),
(4)
i=1 j>i
where C = M (M2 −1) . Conventional approaches that reduce the feature dimension based on the maximum AIKLD ¯ where D¯0 is the AIKLD (MAIKLD) usually has D¯0 ≤ D, of clusters in the reduced feature space. As mentioned before, key-frame extraction is formulated as a feature selection process, and we want to extract a set of key-frames where the average pairwise cluster divergence is maximized. Let X be the original video shot with N frames and M objects, and be represented as a set of frames X = {xi , 1 ≤ i ≤ N } with cardinality |X| = N . Let Z = {x∗i , 1 ≤ i ≤ N ∗ } be any subset of X with cardinality |Z| = N ∗ ≤ N . The objective function is defined as: ˜ = arg X
max
Z∈X,|Z|≤N
¯ Z, D
(5)
˜ is a subset of X that is optimal in the sense of where X ¯ Z is the AIKLD of M objects within Z in MAIKLD, and D ¯˜ ≥ D ¯ X because the 7-D feature space. We might have D X some frames might contain aforementioned outliers that de¯ X . Hence reteriorate the cluster separability, decreasing D moving those “noisy” frames might mitigate the cluster overlapping problem. According to [2], MAIKLD is optimal in the sense of a minimum Bayes error. If we assign zero-one cost to the classification, then this leads to a maximum a posteriori (MAP) estimation. Therefore an optimal solution to (5) will lead to an optimal subset of key-frames that can minimize the error probability of video object segmentation. Nevertheless, it is not easy to find an optimal solution, especially when N is large, and a suboptimal but computationally efficient solution might be preferred.
3.2. Key-Frame Extraction Feature selection methods have been well studied and some very good reviews can be found in [12, 13]. It is well known that the exhaustive searching method can guarantee the optimality of the feature subset according to the objective function. Nevertheless, the exhaustive method is computational expensive and impractical for large feature sets. For example, if a video shot X has N frames, then the exhaustive search needs to try 2N − 1 possible frame subsets. Various suboptimal approaches were suggested and amongst them a deterministic feature selection method called Sequential Forward Floating Selection
(SFFS) method shows good performance [19]. When N is not very large, SFFS method could even provide optimal solutions for feature selection. For simplicity, we do not begin with all N frames in X but apply the method in [22, 15] to 0 extract N ≤ N initial key-frames, which are usually redundant. In the following, we call these initially extracted key-frames as key-frame candidates. 0 Based on the initial N key-frame candidates, Gaussian mixture model (GMM) is used to model video objects coherently in the unified feature space. The iterative Expectation maximization (EM) algorithm [4] is applied with the minimum description length (MDL) model selection criterion [20]. After the model estimation, the objects in all keyframe candidates are segmented out using the maximum likelihood (ML) criterion. Then the proposed key-frame extraction algorithm is performed as follows, where SFFS is initialized by using sequential forward selection (SFS): ˜ (no key-frame), and n is the (1) Start with an empty set X ˜ i.e., n = |X| ˜ and initially n = 0; cardinality of X, (2) Based on the MAIKLD criterion, first use SFS to generate a combination that comprises 2 key-frame candi˜ = 2; dates, and |X| (3) Search for one key-frame candidate that maximizes ˜ = n + 1 , and add it to X, ˜ let AIKLD when |X| n = n + 1; ˜ (4) If n > 2, remove one key-frame candidate from X and compute AIKLD based on the remained key-frame ˜ and go to (4), otherwise go to (3); candidates in X, (5) Determine if AIKLD increases or not after removing the selected key-frame candidate. If the answer is yes, let n = n − 1, and go to (4), otherwise go to (3). The algorithm is stopped when n equals to a certain number or the iteration reaches a given times (e.g., 20). The proposed segmentation method has several significant advantages: (1) Since model estimations are based on a small number of key-frames, the proposed segmentation method is computationally efficient compared with those using all frames [8]. (2) The optimal or near-optimal set of key-frames that maximizes AIKLD can be extracted for robust object segmentation. These key-frames are more representative than those extracted by our previous method [15]. (3) The algorithm is flexible without significant datadependent thresholds. This work develops a unified framework for key-frame extraction and object segmentation, which will support more coherent content-based analysis and structured video representation.
4. Simulations and Discussions The proposed method is tested on both synthetic and real video sequences. The purpose of using synthetic video is to
(a) Method-I.
Figure 2. Synthetic videos: Video-A (first row), Video-B (Second row).
(a) Car
(b) People
(c) Face
Figure 3. Real video sequences. numerically evaluate the video object segmentation performance, where we calculate segmentation accuracy, precision, and recall with respect to all moving objects. In order to show the validity of MAIKLD, we also compare the suggested method with our previous one in [15] based on these videos. The frame size of all the video sequences is 176 × 144. For convenience, we denote the method in [15] as Method-I, and the proposed method as Method-II. Methods-I and -II are first tested on two synthetic video sequences comprising 36 frames each as illustrated in Fig. 2. The first row of Fig. 2 shows three frames in Video-A where a circular object moves sigmoidally. There are two moving objects in Video-B as shown in the second row of Fig. 2. One is an elliptic object that is moving diagonally from the top-left to the bottom-right corner and changing size simultaneously, the other is a rectangular object moving from right to left horizontally. Some Additive White GausVideo sequences Video-A (36 frames) Video-B (36 frames) Car (39 frames) People (150 frames) Face (150 frames)
Key-frame candidates (Method-I) 18 19 10 6 16
(b) Method-II.
Extracted key-frames (Method-II) 9 9 3 3 8
Table 1. Key-frame numbers
(a) Method-I.
(b) Method-II. Figure 4. Segmented moving objects of Videos-A and -B.
sian Noise (AWGN) is deliberately added to the synthetic video. The key-frame extraction is stopped after 0 20 times SFFS iteration or n > N /2. The numerical results are shown in Fig. 5. As we can see, both methods have similar segmentation performance on the moving object of Video-A while Method-II uses less key-frames as listed in Table 1. Particularly, both methods can detect the moving object with 100% recall. Method-II outperforms Method-I in Video-B even though Method-II uses less key-frames for object segmentation. From Fig. 4 (a) we can see that the moving rectangle cannot be discrim-
1.01
1.002
Method I Method II
1 0.998
they might not be representative enough for video object segmentation. However, in Method-II, the key-frames are extracted by considering both spatial-temporal information in the unified feature space. Consequently, extracted keyframes should be more accurate to represent the dynamics of video objects.
Method I Method II
1 0.99
0.994
Accuracy
Accuracy
0.996
0.992 0.99 0.988
0.96
0.94
0.984 0.982 0
5
10
15
20
Frame index
25
30
0.93 0
35
1
Method I Method II
0.9
0.9
0.8
Precision
0.8 0.75 0.7
15
20
Frame index
25
30
35
Method I Method II
0.7
0.6
0.5
0.65
0.4
0.6 0.55
0.3
5
10
15
20
Frame index
25
2
30
0.2
35
0
5
10
15
20
Frame index
25
1.1
Method I Method II
1.8
30
35
Method I Method II
1.05
1.6
1
1.4
0.95
Recall
1.2 1 0.8
0.9 0.85 0.8
0.6
0.75
0.4
0.7
0.2 0 0
10
1
0.85
0.5 0
5
1.1
0.95
Precision
0.97
0.95
0.986
Recall
0.98
5
10
15
20
Frame index
(a) Video-A
25
30
35
0.65 0
5
10
15
20
Frame index
25
30
35
(b) Video-B
Figure 5. Numerical results. Dash and solid lines indicate results of Methods-I, and -II, respectively.
inated from a static background object (dark square) by Method-I. Moreover, the moving rectangle is misclassified into two separate objects in the latter part of Video-B. This indicates that Method-II can extract more representative and salient key-frames regarding video objects than Method-I. We also compare two methods on three real video sequences as shown in Fig. 3. The number of initial key-frame candidates and finally extracted key-frames are listed in Table 1. In order to demonstrate the effectiveness of MethodII, we change the initial threshold for key-frame extraction in Method-I so that object segmentation is based on the same number of key-frames as Method-II. It can be seen in Fig. 6 that with the same number of key-frames, the performance of Method-II is better than that of Method-I. In particular, if we stop the key-frame extraction of Car video using the same criterion as that used for Videos-A and -B, both methods provide similar segmentation results. However, if we deliberately stop the key-frame extraction pro0 cess when the key-frame number n > N /3, the Method-II provides much more representative key-frames for object segmentation than Method-I, as shown in Fig. 6 (a) and (b). In Method-I, the key-frames are extracted using the framewise color histogram without local spatial information, and
4.1. Conclusions This paper presents a coherent framework for joint keyframe extraction and object-based segmentation within a video shot, where key-frames are extracted by maximizing the AIKLD of major video objects in the unified feature space. The suggested framework provides an integrated platform where the inherent and explicit relationship between key-frames and video objects is revealed. Simulation results on both synthetic and real video sequences show that the proposed approach can provide robust and accurate object segmentation results with more compact temporal representation of a video shot using key-frames compared with our previous work. This work also open a new avenue to support the content-based video analysis.
References [1] G. Davenport, T. A. Smith, and N. Pincever. Cinematic primitives for multimedia. IEEE Computer Graphics and Applications, 11(4):67–74, July 1991. [2] H. P. Decell and J. A. Quirein. An iterative approach to the feature selection problem. In Proc. of Purdue Univ. Conf. on Machine Processing of Remotely Sensed Data, volume 1, pages 3B1–3B12, 1972. [3] D. DeMenthon and R. Megret. Spatio-temporal segmentation of video by hierarchical mean shift analysis. Technical Report: LAMP-TR-090/CAR-TR-978/CS-TR4388/UMIACS-TR-2002-68, 2002. [4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., 39:1–38, 1977. [5] A. M. Ferman, A. M. Tekalp, and R. Mehrotra. Effective content representation for video. In Proc. IEEE Int’l Conference on Image Processing, Chicago, IL, 1998. [6] C. Fowlkes, S. Belongie, and J. Malik. Efficient spatiotemporal grouping using the Nystrom method. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 1, pages 231–238, 2001. [7] S. Gepshtein and M. Kubovy. The emergence of visual objects in space-time. In Proc. of the National Academy of Science, volume 97, pages 8186–8191, USA, 2000. [8] H. Greenspan, J. Goldberger, and A. Mayer. A probabilistic framework for spatio-temporal video representation and indexing. In Proc. European Conf. on Computer Vision, volume 4, pages 461–475, Berlin, Germany, 2002. [9] H. Greenspan, J. Goldberger, and A. Mayer. Probabilistic space-time video modeling via piecewise GMM. IEEE
[10]
[11]
[12]
[13]
[14] [15]
[16] [17]
[18]
[19]
[20]
[21]
[22]
Trans. Pattern Analysis and Machine Intelligence, (3):384– 396, March 2004. R. Hammoud and R. Mohr. A probabilistic framework of selecting effective key frames for video browsing and indexing. In International workshop on Real-Time Image Sequence Analysis, 2000. A. Hanjalic and H. J. Zhang. An integrated scheme for automated video abstraction based on unsupervsied clustervalidity analysis. IEEE Trans. on CSVT, 9(8):1280–1289, 1999. A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: a review. IEEE Trans. Pattern Analysis and Machine Interlligence, 22(1), January 2000. A. K. Jain and D. Zongker. Feature selection: Evaluation, application, and small sample performance. IEEE Trans. Pattern Analysis and Machine Intelligence, (2):153–158, Feb. 1997. S. Kullback. Information Theory and Statistics. Dover, New York, 1968. L. Liu and G. Fan. Combined key-frame extraction and object-based video segmentation. IEEE Trans. Circuits and System for Video Technology, 2005, to appear. J. M. Martnez. Mpeg-7 overview (ver.8). ISO/IEC JTC1/SC29/WG11 N4980, July 2002. R. Megret and D. DeMenthon. A survey of spatiotemporal grouping techniques. Technical report, University of Maryland, College Park, March 2002. http://www.umiacs.umd.edu/lamp/pubs/TechReports/. J. Novovicova, P. Pudil, and J. Kittler. Divergence based feature selection for multimodal class densities. IEEE Trans. Pattern Analysis and Machine Intelligence, 18(2):218–223, 1996. P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in feature selection. Pattern Recognition Letters, pages 1119–1125, Nov. 1994. J. Rissanen. A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11(2):417– 431, 1983. J. Shi and J. Malik. Motion segmentation and tracking using Normalized cuts. In Proc. of Int. Conf. on Computer Vision, pages 1151–1160, 1998. Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra. Adaptive key frame extraction using unsupervised clustering. In Proc. of IEEE Int Conf on Image Processing, pages 866–870, Chicago, IL, 1998.
(a) Method-I.
(b) Method-II.
(c) Method-I.
(d) Method-II.
(e) Method-I.
(f) Method-II. Figure 6. Segmentation results of the real video sequences using the same number of key-frames.