Eigen-Image Based Video Segmentation and ... - Semantic Scholar

7 downloads 0 Views 733KB Size Report
We present a new approach for automatic video ... tation and video indexing are based on a temporally ... windows start with the last scene change detected in.
Eigen-Image Based Video Segmentation and Indexing Keesook J. Han and Ahmed H.Tew k Department of Electrical and Computer Engineering University of Minnesota 200 Union St. S.E., Minneapolis, MN 55455, USA

Abstract

We present a new approach for automatic video scene segmentation and content based indexing. Our approach detects video shots and builds a collection of key frames and representative frames. Scene segmentation and video indexing are based on a temporally windowed principal component analysis of a subsampled version of the video sequence. Two discriminants are derived from the principal components on a frame by frame basis. The discriminants are used for scene change detection and key frame extraction and classi cation into relevant clips. The system creates an adjacency matrix to build the scene transition graph. The scene transition graph allows easy access to video image-based information.

1 Introduction

In recent years there has been a growing interest in automatic segmentation of digital video and in video indexing [1] [2] [3] [5]. Most of the existing work in the area of video segmentation can be divided into two major types: pair-wise comparison of pixels or comparison of histograms of pixel values, and comparison of blocks with the likelihood ratios [4] [7]. The existing video segmentation algorithms have problems in detecting some dicult transitions associated with special e ects. They can also misclassify other special e ects as scene transitions. The eigenimage based video segmentation model proposed in this paper provides an e ective solution to the problem. We also present experimental results that demonstrate the e ectiveness of our approaches.

2 Eigen-Image Based Vido Aanlysis

The video library application usually involves computation on large data sets. It is desirable to perform the video processing on reduced data sets instead of the original full-frame video sequence. The reduced data set must of course contain enough information for e ective video segmentation and classi cation. In

particular, we must be able to extract e ective discriminating features from that reduced data set that can be reliably used in the segmentation and indexing tasks. To eciently measure similarity in appearance within an object class we must rst determine which features are most e ective at describing the image of the object. A standard linear method for data feature extraction is principal component analysis. Principal component analysis is a mathematical technique used to analyze correlated random variables to reduce the dimensionality of a data set. This reduction is achieved by selecting the rst few principal components. These components capture the most relevant features to use in classifying a group of objects to be recognized. To reduce the complexity of the approach we begin by subsampling each frame by a factor of q2 along both spatial dimensions. The subsampling would proceed as follow. The jth subsampled image is expressed as

Isj (x; y) = Ij (qx; qy) (1) where 1  x  X and 1  y  Y . Denote by fj the N , by , 1 (i.e., N = XY ) column vector representation of the subsampled jth frame. The entries of fj are produced used the scanning pattern shown in Fig. 1. Next, we group M of the vectors fj in a window. Typical window sizes M that we have

used varied from 90 to 150 frames. The rst window starts with the beginning of the video. Subsequent windows start with the last scene change detected in the preceding window. Let  be the mean vector of the video sequence of data f1 ; f2 ;    ; fM . With the input image, j = fj , , the empirical covariance matrix of the window is computed as

S = M1

M X j =1

j Tj :

(2)

original image reducedreduced image eigen image keykey frame image eigenimage frame

2D original image index 1,1

1,9

1,17

video sequence : original image

1,25 1,33

9,1

9,9

9,17

17,1

17,9

17,17 17,25 17,33 17,41

.......

.......

.......

9,25 9,33

1,41 .......

.......

.......

1,313

s4

9,41 ....... 9,313

.......

....... 17,313 .......

s3 s1 320x240

1200x1

1/64

.......

1/240

fi

5x1

5x1

pi

si

s2 s1

1/(64x240)=1/15360 233,1 233,9 233,17 233,25 233,33 233,41 ....... 233,313

1D reduced image f = [(1,1) (1,9) (1,17) ... (1,313) (9,313) (9,305) ...(9,9) (9,1) (17,1) (17,9 ) ... (233,305) (233,313)]T

spatialsampling sampling spatial

temporal sampling temporal sampling

Figure 1: Data compression of video database by spatial and temporal subsampling We compute the unique set of M orthonormal eigen vectors of S , q1 ; q2 ;    qM and their associated eigenvalues 1 > 2 >    > M . Linear combinations of the rst K eigenvectors Q = [q1 q2    qK ] span the space of input image at a coarse resolution that captures most of the relevant information in each subsampled frame. Each frame in the window is then projected onto the K eigenvectors corresponding to the K largest eigenvalue is transformed into its eigen-image representation by following operation

pj = QT j = [pj1 pj2 ;    ; pjK ]T (3) where 1  j  M . The elements of vector pj are called the principal components. The vector pj is contained a compact image information for jth frame. The K , by , 1 vector is an encoding of the jth input image in

terms of eigen-image basis and is selected as a feature vector. We have found that accurate segmentation and classi cation results can be achieved with K = 5 for the video sequences that we have analyzed.

3 Key Frame Extraction

Video segmentation is based on a discriminant function derived from the retained K principal components. The rst discriminant function is designed to separate a long video sequence into shots that can be detected by the temporal variations of angle and distance of principal components. This process is generally called scene change detection. Speci cally, scene detection is performed as follows. Let

j = (

K q X k=1

(pjk , pj,1k )2 + (pjk+1 , pj,1k+1 )2 )=max (4)

and K X

j = (

k=1

pj,1k+1 ))= : (5) tan,1 ( pjkp+1 , max ,p jk

j ,1k

,pj,1k+1 )  2. where 0  tan,1 ( pjkp+1 jk ,pj ,1k

We declare a shot boundaries to be present if j j > t, where t is threshold. Threshold t is determined from

training data. In earlier works, e.g. [5], the rst frame of a shot was usually selected as a key frame. This key frame selection is not suitable in some instances, e.g., with gradual shot transitions. Instead, we use here the mean eigen-image pi the shot as the ith key frame. Therefore, at the end of this step, we have a representation of the video in terms of a sequence of key frames p1 ; p2 ;    ; pm . The next step is to cluster these frames into relevant clips represented by Rframes.

4 Key Frame Clustering

We use a second discriminant function to group similar key frames and produce the Rframe sets instead of searching the complete key frames [6]. Hence, each group can be treated as a unit. This second discriminant function Dg (i; j ) takes the value 1 if key key frames i and j belong to same group. It is de ned by

Dg (i; j ) = where and



1 if aij < ta and bij < tb : same group 0 otherwise (6)

q aij = (pi1 , pj1 )2 + (pi2 , pj2 )2

(7)

q bij = (pi2 , pj2 )2 + (pi3 , pj3 )2 :

(8) Once clustering is completed using the above function, the system produces the scene change graph [6]. Two distributions of principal components and Rframe sets are shown in Fig. 2. Fig. 3 shows the scene transition graph that was built by video indexing of movie sequence.

5 Enhanced Video Segmentation

The limitations of eigen-image based video segmentation are that appropriate window size M and the video sequence vector size N should be selected. For content-based video indexing, 90 to 150 frames and 1200 , by , 1 video sequence vectors provide the accurate results. For strict video segmentation purpose, the window size can be increased and the dimension of the video sequence vector can be decreased if we modify the video segmentation techniques (i.e., 550 frames and 120 , by , 1 video sequence vectors are selected for the experiment). In this section, we propose various robust approaches to overcome the limitations of the eigen-image based video segmentation.

 Reduction of the Computational Complexity

For this approach, the video sequence of data f~j is obtained from the subsampled image Is (i.e., X = 160, Y = 120 and q = 2 are selected) and the entries of f~j are evaluated as X

X f~jy = Isj (x; y) x=1

(9)

where 1  y  Y . A potential problem with subsampled pixel data is its sensitivity to camera or object movement. This e ect may be reduced by using Eq. 9. Note that Eq. 9 describes a smoothing lter. It makes scene change detection more robust. Eq. 9 also reduces the computational time needed to nd the eigenvectors.

 Local Enhancement of the Image Data

The image data is locally enhanced to increase the contrast of columns of data for scene change detection purposes. The normalization technique is applied for each column. The f~j is simply normalized as

fjy = (f~jy , f~jmin )=(f~jmax , f~jmin ):

(10) Fig. 6 indicates that the above normalization process is quite useful to reduce the ashing light e ects. With the normalized vector fj , Eq. 2 and Eq. 3 are used to compute the principal components pj . The distance j and the angle j are computed by using Eq. 4 and Eq. 5 (i.e., K = 14 is selected).

 Dynamic Range Thresholding

We propose a dynamic range thresholding technique to implement an e ective and ecient video segmenting system. This method is very simple, and e ectively reduces the potential for false scene change detection, e.g., ashing lights, camera movement and large object movement. We apply this method to both variables j and j de ned in Eq. 4 and Eq. 5. Let j denote j or j . We begin by computing jj j = j+1 , j . Next, we threshold j using the following equation j = j+1 = 0 if jj j < t (11) where 1  j  M , 1. We will denote by ^j and ^j the threshold values of j and j respectably. Fig. 4 shows the e ects of dynamic range of thresholding of j and j . The discriminant function for the enhanced video segmentation is developed as follows:  ^j ^j > ta and maxf^j ; ^j g > tb ^ Ds (j ) = 10 ifotherwise (12)

The enhanced eigen-image based video segmentation provides the broad range of thresholding that makes it easier to choose suitable thresholds.

6 Experimental Results

The algorithms described in this paper has been tested on a number of video sequences (e.g., two shots in similar intensity, ashing lights, panning, fade-in, fade-out, dissolve, and other gradual transition). The histogram comparison, the likelihood, the pixel intensity di erence, and the enhanced eigen-image based video segments of the movies Dave, Firm, Broadcast, and Sting are shown in Fig. 5. The results indicate that the enhanced eigen-image based video segmentation technique provides a reliable mean for scene transition detection. The eigen-image based video indexing has the capability of producing the distribution of principal components and the scene transition graph automatically (see Fig. 2 and Fig. 3). Our proposed video indexing technique is a useful tool for automatic identi cation of index entries and content-based video retrieval.

References

[1] A. Hampapur, R. Jain, and T. E. Weymouth, \Production Model Based Digital Video Segmentation", Multimedia Tools and Applications, Borko Furht (ed), Kluwer Academy Publishers, 1996 [2] R. Hjelsvold, R. Midtstraum, and O. Sandsta, \Searching and Browsing a Shared Video Database", Multimedia Database Systems : Design and Implementation Strategies, Borko Furht (ed), Kluwer Academy Publishers, 1996 [3] A. Pentland, R. W. Picard, and S. Sclaro , \Photobook: Content-Based Manipulation of Image Databases", Multimedia Tools and Applications, Borko Furht (ed), Kluwer Academy Publishers, 1996 [4] R. Kasturi and R. Jain, Dynamic vision, Computer Vision: Principles, R. Kasturi and R. Jain (eds), IEEE Computer Society Press, Washington, 1991 [5] M. Yeung, B. Yeo, W. Wolf, and B. Liu, \Video Browsing using Clustering and Scene Transitions on Compressed Sequences", Proceedings, Multimedia Computing and Networking, Feb. 1995 [6] M. Yeung and B. Liu, \Ecient Matching and Clustering of Video Shots", ICIP, Oct. 1995 [7] H. Zhang, A. Kankanhalli, and W. Smoliar, \Automatic Partitioning of Full-motion Video", A Guided Tour of Multimedia Systems and Applications, IEEE Computer Society Press, 1995

angle

1

1500

0.5

1000

0 0

G

B

500

D

E

50

100

150

200

250

300

350

400

450

500

450

500

distance

1 0.5

P2

0

0 0

-500

50

100

150

200

250

300

350

C

400

angle threshold

1

F

0.5

-1000

H

0 0

-1500

A

50

100

150

200

250

300

350

400

450

500

distance threshold

1

-2000 0.5

-2500

-1000

-500

0

P1

500

1000

0 0

1500

50

100

150

200

250

300

350

400

450

500

Figure 4: Scene Change Detection

1000

E H 500

G Histogram

1

B 0 P3

0.5 0 0

C

-500

50

100

150

200

250

300

350

400

450

500

Likelihood Ratio

1 0.5

F

-1000

0 0

D A -1500 -2000

-1500

50

100

150

200

250

300

-1000

-500 P2

0

500

1000

400

450

500

0.5 0 0

Figure 2: Distribution of principal components : FIRM

350

Pixel Intensity Difference

1

50

100

150

200

250

300

350

400

450

500

Enhanced Video Segmentation

1 0.5 0 0

50

100

150

200

250

300

350

400

450

500

Figure 5: Scene Change Detection 4 & 10 B

28-29 & 57-60 12 & 13

D

69-72

10-15

G 22-27

37-40

46-56

image data

73-80 E

41-45

8&9

5 &11 30-32 & 61-68

33-36

7

6

F

enhanced image data

C 16-21 2&3 A

1 H

similar histograms of two frames

1-9 81-90

flashing lights

motion of static small and large objects

panning zooming

motion of large objects

14

DAVE

Figure 3: Scene Transition Graph : FIRM

FIRM

BROADCAST

STING

Figure 6: Image data from the four movie frames

Suggest Documents