We present a new approach for automatic video ... tation and video indexing are based on a temporally ... windows start with the last scene change detected in.
Eigen-Image Based Video Segmentation and Indexing Keesook J. Han and Ahmed H.Tew k Department of Electrical and Computer Engineering University of Minnesota 200 Union St. S.E., Minneapolis, MN 55455, USA
Abstract
We present a new approach for automatic video scene segmentation and content based indexing. Our approach detects video shots and builds a collection of key frames and representative frames. Scene segmentation and video indexing are based on a temporally windowed principal component analysis of a subsampled version of the video sequence. Two discriminants are derived from the principal components on a frame by frame basis. The discriminants are used for scene change detection and key frame extraction and classi cation into relevant clips. The system creates an adjacency matrix to build the scene transition graph. The scene transition graph allows easy access to video image-based information.
1 Introduction
In recent years there has been a growing interest in automatic segmentation of digital video and in video indexing [1] [2] [3] [5]. Most of the existing work in the area of video segmentation can be divided into two major types: pair-wise comparison of pixels or comparison of histograms of pixel values, and comparison of blocks with the likelihood ratios [4] [7]. The existing video segmentation algorithms have problems in detecting some dicult transitions associated with special eects. They can also misclassify other special eects as scene transitions. The eigenimage based video segmentation model proposed in this paper provides an eective solution to the problem. We also present experimental results that demonstrate the eectiveness of our approaches.
2 Eigen-Image Based Vido Aanlysis
The video library application usually involves computation on large data sets. It is desirable to perform the video processing on reduced data sets instead of the original full-frame video sequence. The reduced data set must of course contain enough information for eective video segmentation and classi cation. In
particular, we must be able to extract eective discriminating features from that reduced data set that can be reliably used in the segmentation and indexing tasks. To eciently measure similarity in appearance within an object class we must rst determine which features are most eective at describing the image of the object. A standard linear method for data feature extraction is principal component analysis. Principal component analysis is a mathematical technique used to analyze correlated random variables to reduce the dimensionality of a data set. This reduction is achieved by selecting the rst few principal components. These components capture the most relevant features to use in classifying a group of objects to be recognized. To reduce the complexity of the approach we begin by subsampling each frame by a factor of q2 along both spatial dimensions. The subsampling would proceed as follow. The jth subsampled image is expressed as
Isj (x; y) = Ij (qx; qy) (1) where 1 x X and 1 y Y . Denote by fj the N , by , 1 (i.e., N = XY ) column vector representation of the subsampled jth frame. The entries of fj are produced used the scanning pattern shown in Fig. 1. Next, we group M of the vectors fj in a window. Typical window sizes M that we have
used varied from 90 to 150 frames. The rst window starts with the beginning of the video. Subsequent windows start with the last scene change detected in the preceding window. Let be the mean vector of the video sequence of data f1 ; f2 ; ; fM . With the input image, j = fj , , the empirical covariance matrix of the window is computed as
S = M1
M X j =1
j Tj :
(2)
original image reducedreduced image eigen image keykey frame image eigenimage frame
2D original image index 1,1
1,9
1,17
video sequence : original image
1,25 1,33
9,1
9,9
9,17
17,1
17,9
17,17 17,25 17,33 17,41
.......
.......
.......
9,25 9,33
1,41 .......
.......
.......
1,313
s4
9,41 ....... 9,313
.......
....... 17,313 .......
s3 s1 320x240
1200x1
1/64
.......
1/240
fi
5x1
5x1
pi
si
s2 s1
1/(64x240)=1/15360 233,1 233,9 233,17 233,25 233,33 233,41 ....... 233,313
1D reduced image f = [(1,1) (1,9) (1,17) ... (1,313) (9,313) (9,305) ...(9,9) (9,1) (17,1) (17,9 ) ... (233,305) (233,313)]T
spatialsampling sampling spatial
temporal sampling temporal sampling
Figure 1: Data compression of video database by spatial and temporal subsampling We compute the unique set of M orthonormal eigen vectors of S , q1 ; q2 ; qM and their associated eigenvalues 1 > 2 > > M . Linear combinations of the rst K eigenvectors Q = [q1 q2 qK ] span the space of input image at a coarse resolution that captures most of the relevant information in each subsampled frame. Each frame in the window is then projected onto the K eigenvectors corresponding to the K largest eigenvalue is transformed into its eigen-image representation by following operation
pj = QT j = [pj1 pj2 ; ; pjK ]T (3) where 1 j M . The elements of vector pj are called the principal components. The vector pj is contained a compact image information for jth frame. The K , by , 1 vector is an encoding of the jth input image in
terms of eigen-image basis and is selected as a feature vector. We have found that accurate segmentation and classi cation results can be achieved with K = 5 for the video sequences that we have analyzed.
3 Key Frame Extraction
Video segmentation is based on a discriminant function derived from the retained K principal components. The rst discriminant function is designed to separate a long video sequence into shots that can be detected by the temporal variations of angle and distance of principal components. This process is generally called scene change detection. Speci cally, scene detection is performed as follows. Let
j = (
K q X k=1
(pjk , pj,1k )2 + (pjk+1 , pj,1k+1 )2 )=max (4)
and K X
j = (
k=1
pj,1k+1 ))= : (5) tan,1 ( pjkp+1 , max ,p jk
j ,1k
,pj,1k+1 ) 2. where 0 tan,1 ( pjkp+1 jk ,pj ,1k
We declare a shot boundaries to be present if j j > t, where t is threshold. Threshold t is determined from
training data. In earlier works, e.g. [5], the rst frame of a shot was usually selected as a key frame. This key frame selection is not suitable in some instances, e.g., with gradual shot transitions. Instead, we use here the mean eigen-image pi the shot as the ith key frame. Therefore, at the end of this step, we have a representation of the video in terms of a sequence of key frames p1 ; p2 ; ; pm . The next step is to cluster these frames into relevant clips represented by Rframes.
4 Key Frame Clustering
We use a second discriminant function to group similar key frames and produce the Rframe sets instead of searching the complete key frames [6]. Hence, each group can be treated as a unit. This second discriminant function Dg (i; j ) takes the value 1 if key key frames i and j belong to same group. It is de ned by
Dg (i; j ) = where and
1 if aij < ta and bij < tb : same group 0 otherwise (6)
q aij = (pi1 , pj1 )2 + (pi2 , pj2 )2
(7)
q bij = (pi2 , pj2 )2 + (pi3 , pj3 )2 :
(8) Once clustering is completed using the above function, the system produces the scene change graph [6]. Two distributions of principal components and Rframe sets are shown in Fig. 2. Fig. 3 shows the scene transition graph that was built by video indexing of movie sequence.
5 Enhanced Video Segmentation
The limitations of eigen-image based video segmentation are that appropriate window size M and the video sequence vector size N should be selected. For content-based video indexing, 90 to 150 frames and 1200 , by , 1 video sequence vectors provide the accurate results. For strict video segmentation purpose, the window size can be increased and the dimension of the video sequence vector can be decreased if we modify the video segmentation techniques (i.e., 550 frames and 120 , by , 1 video sequence vectors are selected for the experiment). In this section, we propose various robust approaches to overcome the limitations of the eigen-image based video segmentation.
Reduction of the Computational Complexity
For this approach, the video sequence of data f~j is obtained from the subsampled image Is (i.e., X = 160, Y = 120 and q = 2 are selected) and the entries of f~j are evaluated as X
X f~jy = Isj (x; y) x=1
(9)
where 1 y Y . A potential problem with subsampled pixel data is its sensitivity to camera or object movement. This eect may be reduced by using Eq. 9. Note that Eq. 9 describes a smoothing lter. It makes scene change detection more robust. Eq. 9 also reduces the computational time needed to nd the eigenvectors.
Local Enhancement of the Image Data
The image data is locally enhanced to increase the contrast of columns of data for scene change detection purposes. The normalization technique is applied for each column. The f~j is simply normalized as
fjy = (f~jy , f~jmin )=(f~jmax , f~jmin ):
(10) Fig. 6 indicates that the above normalization process is quite useful to reduce the ashing light eects. With the normalized vector fj , Eq. 2 and Eq. 3 are used to compute the principal components pj . The distance j and the angle j are computed by using Eq. 4 and Eq. 5 (i.e., K = 14 is selected).
Dynamic Range Thresholding
We propose a dynamic range thresholding technique to implement an eective and ecient video segmenting system. This method is very simple, and eectively reduces the potential for false scene change detection, e.g., ashing lights, camera movement and large object movement. We apply this method to both variables j and j de ned in Eq. 4 and Eq. 5. Let j denote j or j . We begin by computing jj j = j+1 , j . Next, we threshold j using the following equation j = j+1 = 0 if jj j < t (11) where 1 j M , 1. We will denote by ^j and ^j the threshold values of j and j respectably. Fig. 4 shows the eects of dynamic range of thresholding of j and j . The discriminant function for the enhanced video segmentation is developed as follows: ^j ^j > ta and maxf^j ; ^j g > tb ^ Ds (j ) = 10 ifotherwise (12)
The enhanced eigen-image based video segmentation provides the broad range of thresholding that makes it easier to choose suitable thresholds.
6 Experimental Results
The algorithms described in this paper has been tested on a number of video sequences (e.g., two shots in similar intensity, ashing lights, panning, fade-in, fade-out, dissolve, and other gradual transition). The histogram comparison, the likelihood, the pixel intensity dierence, and the enhanced eigen-image based video segments of the movies Dave, Firm, Broadcast, and Sting are shown in Fig. 5. The results indicate that the enhanced eigen-image based video segmentation technique provides a reliable mean for scene transition detection. The eigen-image based video indexing has the capability of producing the distribution of principal components and the scene transition graph automatically (see Fig. 2 and Fig. 3). Our proposed video indexing technique is a useful tool for automatic identi cation of index entries and content-based video retrieval.
References
[1] A. Hampapur, R. Jain, and T. E. Weymouth, \Production Model Based Digital Video Segmentation", Multimedia Tools and Applications, Borko Furht (ed), Kluwer Academy Publishers, 1996 [2] R. Hjelsvold, R. Midtstraum, and O. Sandsta, \Searching and Browsing a Shared Video Database", Multimedia Database Systems : Design and Implementation Strategies, Borko Furht (ed), Kluwer Academy Publishers, 1996 [3] A. Pentland, R. W. Picard, and S. Sclaro, \Photobook: Content-Based Manipulation of Image Databases", Multimedia Tools and Applications, Borko Furht (ed), Kluwer Academy Publishers, 1996 [4] R. Kasturi and R. Jain, Dynamic vision, Computer Vision: Principles, R. Kasturi and R. Jain (eds), IEEE Computer Society Press, Washington, 1991 [5] M. Yeung, B. Yeo, W. Wolf, and B. Liu, \Video Browsing using Clustering and Scene Transitions on Compressed Sequences", Proceedings, Multimedia Computing and Networking, Feb. 1995 [6] M. Yeung and B. Liu, \Ecient Matching and Clustering of Video Shots", ICIP, Oct. 1995 [7] H. Zhang, A. Kankanhalli, and W. Smoliar, \Automatic Partitioning of Full-motion Video", A Guided Tour of Multimedia Systems and Applications, IEEE Computer Society Press, 1995
angle
1
1500
0.5
1000
0 0
G
B
500
D
E
50
100
150
200
250
300
350
400
450
500
450
500
distance
1 0.5
P2
0
0 0
-500
50
100
150
200
250
300
350
C
400
angle threshold
1
F
0.5
-1000
H
0 0
-1500
A
50
100
150
200
250
300
350
400
450
500
distance threshold
1
-2000 0.5
-2500
-1000
-500
0
P1
500
1000
0 0
1500
50
100
150
200
250
300
350
400
450
500
Figure 4: Scene Change Detection
1000
E H 500
G Histogram
1
B 0 P3
0.5 0 0
C
-500
50
100
150
200
250
300
350
400
450
500
Likelihood Ratio
1 0.5
F
-1000
0 0
D A -1500 -2000
-1500
50
100
150
200
250
300
-1000
-500 P2
0
500
1000
400
450
500
0.5 0 0
Figure 2: Distribution of principal components : FIRM
350
Pixel Intensity Difference
1
50
100
150
200
250
300
350
400
450
500
Enhanced Video Segmentation
1 0.5 0 0
50
100
150
200
250
300
350
400
450
500
Figure 5: Scene Change Detection 4 & 10 B
28-29 & 57-60 12 & 13
D
69-72
10-15
G 22-27
37-40
46-56
image data
73-80 E
41-45
8&9
5 &11 30-32 & 61-68
33-36
7
6
F
enhanced image data
C 16-21 2&3 A
1 H
similar histograms of two frames
1-9 81-90
flashing lights
motion of static small and large objects
panning zooming
motion of large objects
14
DAVE
Figure 3: Scene Transition Graph : FIRM
FIRM
BROADCAST
STING
Figure 6: Image data from the four movie frames