Semantic video content abstraction based on multiple cues ...

5 downloads 0 Views 291KB Size Report
In other words, music is deliberately added so as to enhance viewers' experience for an unbreakable unit of a story, which may of course consist of several shots ...
HTML Paper SEMANTIC VIDEO CONTENT ABSTRACTION BASED ON MULTIPLE CUES Ying Li, Wei Ming and C.-C. Jay Kuo Integrated Media Systems Center and Department of Electrical Engineering-Systems University of Southern California, Los Angeles, CA 90089-2564 E-mail: yingli,ming,cckuo @sipi.usc.edu 

ABSTRACT Acts

This research addresses the problem of automatically extracting video’s semantic structure and summarizing it in a hierarchical manner. Multiple media cues are employed in this procedure including visual, audio and text information. The generated hierarchy can provide us a compact yet meaningful abstraction of the video data similar to the conventional table-of-contents, which will facilitate user’s access to multimedia contents including browsing and retrieval. Preliminary experiments of integrating different media for hierarchically representing video semantics have yielded encouraging results.

grouping Scenes

Representative frames grouping

Shots

keyframe extraction

keyframes

shot detection Video bitstream Audio bitstream

1. INTRODUCTION

Closed caption/audio transcript

The amount of multimedia information generated in today’s society is growing exponentially, which poses a serious technological challenge in terms of how the information can be integrated, processed, organized, summarized and indexed in a semantically meaningful manner. Of all existing media types, video is the most challenging one due to its rich content. Recently, research on efficient indexing, browsing and retrieval of video databases has attracted a lot of attention, yet more promising results are still to come. When the amount of data is small, a user can retrieve the desired content in a linear fashion by simply browsing the data sequentially. However, with a huge amount of data, a linear search is no longer feasible. What is needed is the capability of automatically abstracting the essential content of multimedia data and then forming an extremely compact yet meaningful representation of the data as a road map for effective information browsing and retrieval. To accomplish this goal, a hierarchy with multiple layers of abstraction is desired. In this work, we adopt a bottom-up approach. That is, to group semantically related units into a larger meaningful cluster. The levels of abstraction that we intend to achieve are shown in Figure 1. Previous work on video abstraction was centered around shotbased video representation. That is, a video program is represented in the form of cascaded shots, which are segmented based on lowlevel features. Then, keyframes are extracted from each shot [1], [2]. However, low-level shot structures do not correspond well to the underlying semantic structure of video data. Furthermore, there are often too many shots generated for a given sequence, resulting in too fine information for many applications. More recently, there has been some work that considers video segmentation in terms of groups or scenes. Along this direction, a video sequence is first segmented into shots, then semantically related and

Fig. 1. A hierarchical representation framework of video.

temporally adjoining shots are grouped into scenes [3], [4]. However, in spite of their effort to bridge the gap between semantic contents and low-level features, most previous work only considers either pure visual or pure audio information, or utilizes applicationspecific models [5], [6]. More generic algorithms which integrate multiple media cues for more general applications are still in need. Our current work follows a similar path taken by Yeung [3] and Rui [4] but with different emphasis. Based on segmented video shots, we employ all available media cues and a set of rules to group them into scenes. Here, by integrated media cues we mean that the audio and visual information will interact with each other instead of being applied individually at the initial stage and finally combined together at the last stage as most of other multimedia work does. The rest of paper is organized as follows. Section 2 briefly reviews techniques for audio and visual content analysis. The proposed scene-level video representation scheme is elaborated in Section 3. Preliminary experimental results are reported and discussed in Section 4. Finally, concluding remarks and future work are given in Section 5. 2. AUDIO AND VISUAL PROCESSING The first two steps in most content-based video analysis work are shot detection and keyframe extraction. In this work, we readily use the algorithms proposed in our previous work [7], where video

804

2001 IEEE International Conference on Multimedia and Expo ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE

0-7695-1198-8/01/$10.00 (C) 2001 IEEE

shots are detected based on extracted color and motion information. In the audio domain, we mean to classify each audio clip into predefined audio types for better semantic understandings. Particularly, the following two primary processing tasks are involved. 1. Audio feature extraction. Six types of audio features are computed. They are the short-time energy function, the short-time average zero-crossing rate, the short-time fundamental frequency, the energy band ratio, the silence ratio and a set of 20-dimensional Mel-Frequency Cepstral Coefficients (MFCC) computed over each 20ms window. The first five features are mainly extracted for the classification purpose, and the last one is used to compare the acoustic similarity between two speech segments. 2. Audio classification. The accompanying audio clip of each shot is classified into one of following 5 types: silence, speech, music, speech with music background, and the environmental sound based on the features obtained from above. Note that, if the audio clip contains a song, it will be classified into the music class. For more details on the audio analysis work described above, we refer to [8]. All our test video sequences are digitized movies and real TV programs, which may contain commercials. Since the non-story material is of no interest to us, we prefer to filter it out before proceeding to the next step. The approach proposed in [9] is applied here for this purpose. For the rest of this work, it is assumed that all inserted commercial breaks have been correctly detected and removed from the test video programs. 3. VIDEO SCENE CONSTRUCTION In this work, we concentrate on the scene contruction of a video program that has a certain story line such as movie, TV sitcom, etc. As for other types of video, such as sports and news, since there are more structures in the content, it is relatively easier to obtain the semantic abstraction. A scene is defined as a set of semantically related shots, which may or may not be physically close to each other. Thus, we no longer benefit from the tradition causal processing approach where shots are processed sequentially and only the knowledge of the past is available. In this work, we attempt to get all necessary and available information, then take out what we want. The three primary steps involved in our proposed scene construction scheme are as follows. 1. Shot sink generation. A shot sink contains a pool of shots which are visually similar to each other but largely different from those in other sinks. The shot sink is calculated by using a window-based sweep algorithm, which will be detailed later in this section. 2. Coarse-level scene extraction. A coarse-level scene extraction will be carried out in this step by using the shot sink results obtained from above. Basically, all shots with their shot sinks overlapped will be considered from the same scene. 3. Rule-based scene construction. A set of rules will be used in this stage to obtain a refined scene construction result. Here, the audio information will be primarily used to group semantically related scenes together, as well as some necessary visual information.

Each of the above steps will be elaborated below. 3.1. Window-based Sweep Algorithm Given the shot detection and keyframe extraction results, a new algorithm called the window-based sweep algorithm is used to find all visually similar shots with respect to a referenced shot and pool them into this shot’s sink. Since any scene is within certain temporal locality, naturally we restrict our search range within a temporal window of length as shown in Figure 2(a), where the current shots. Given shot , we choose its window contains keyframes to be its first and last frames and denote them as and as shown in the same figure, then the similarity between shot and is defined as 

































 







!

"













%





'





+

!

"









0

!

"

















%





'





1

!

"







% 

'





% 

'

'

%

where is the standard Euclidean Distance between two keyframes and in terms of their color histograms. and are four weighting coefficients and are computed as "

















%





'











%



+

%



1

;











8

8

: %



+





0





%



1





:

; >

8

?



(1) %

8



; ?

A



are lengths of shots and in terms of frames, rewhere and spectively. From Equation 1, we see that the absolute time separation between shots and , which is the temporal distance between and , is not taken into account. The reason is that since we want to find all similar shots, we shall not weaken shot’s similarity due to their physical separation as long as they are in the same temporal window. However, we do have considered the relative distance between each keyframe by introducing the shot length and . It is quite intuitive that, if shots and parameters are similar shots, frame should be more similar to than to considering the motion continuity. Hence, is a measure of time-adaptive distance. Note, for simplicity we can also let be the minimum of the four distances in Equation 1, which has actually produced surprising good results in our experiments. is less than a predefined threshold , we conNow, if sider them to be similar, and put shot into shot ’s sink. As shown in Figure 2(b), all shots similar to shot are nicely linked together based on their temporal order. One thing worth mentioning here is that if shot ’s sink is not empty, we have to compute the distances between shot and all other shots in the sink (for example, shot and shot in the figure), and shot is only qualified to be in the sink when the maximum of all the distances is less than . We repeat this window-based sweep algorithm for every shot in getting their respective shot sink. However, if one shot has already been included in other shot’s sink, we will skip it and proceed to the next one. Another important issue of this algorithm is the proper selection of threshold . Basically could be either empirically determined based on experimental results or automatically derived from statistics of shot’s differences. We adopt the latter approach in this work. B



B

















B



B





























































E











F





F

E

E

E

3.2. Scene Extraction and Refinement After obtaining shot sinks, we proceed to extract the scene structure from them. The basic idea applied here is whenever there is an overlap between two sinks, we claim that these two shots are from

805

2001 IEEE International Conference on Multimedia and Expo ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE



0-7695-1198-8/01/$10.00 (C) 2001 IEEE

0

Window of Length N Shot i+1

Shot i

bj

ei

bi

Shot n



Shot j



ej

Keyframes of Shot n

(a) Shot k

Shot j

Shot i



(b)

Fig. 2. (a) Shots contained in a window of length sink of shot .

, and (b) the



the same scene, and all of their keyframes will be linked to the list of scene’s representative frames. The underlying rationale is that, since parts of them are visually similar, they will possibly share the same basic semantics. Here is an example. Suppose that shot 1’s sink contains shots 3, 5, 7, 10, and shot 2’s sink contains shots 4, 6, 8, 11. Then, it is natural to conceive that there is something going on among these two sets of shots based on their alternating structure. Usually, this type of repetition pattern can be frequently found in TV series and movies. A shot will be claimed as an isolated scene, if it has an empty sink and is also not covered by other shots’ sinks. These are usually the so-called “progressive scenes”, which are usually used by the movie directors to help establish the situation for the next scene. Two major errors observed in the coarse-level scene detection are (a) false negatives or missed scenes, and (b) false positives or false alarms. When the timing window is too narrow, we may have false alarms since the shot sinks haven’t covered all similar shots, and a new scene has to be initialized to contain them. Besides, not all video shots within a scene will present a repetition pattern. On the contrary, when the timing window is too wide, we may introduce false negatives, where the next scene is falsely included by the current one. This situation usually occurs when the two neighboring scenes have similar background, same actors/actresses, but with different theme topic. During the refinement stage, we focus on finding solutions to correct false positives since we can reduce false negatives by shortening the timing window. The audio information will be primarily used here so that when combined with the visual information, it can better capture the underlying semantics. Below are a set of rules employed in the refinement procedure, which have been proved to be very effective. (1) For an isolated scene, if both of its neighboring shots as well as itself contain music or speech with the music background, all three shots should belong to the same scene. The rationale of this rule is as follows. Music is usually mixed by the music director during the movie or TV post-production so as to convey the inner feeling of a key figure of the story or to reflect the atmosphere under certain circumstance. In other words, music is deliberately added so as to enhance viewers’ experience for an unbreakable unit of a story, which may of course consist of several shots. (2) For an isolated scene, if it contains pure speech, and its previous shot (or next, or both neighboring shots) also contains speech, then very possible they are in the same scene. The decision process is as follows. First, we calculate the 20-dimensional

MFCC mean and variance vectors for each shot. (Note the variance vector is also 20-dimensional since a diagonal matrix is used.) Then, the distance between these two sets of parameters are computed to give the degree of acoustic similarity between two shots. The two scenes will be merged if the distance is less than a certain threshold. Otherwise, we proceed to the next rule. (3) If the isolated scene is of silence, environmental sound, or all other cases, our decision will be based on the visual information. Specifically, we will compute the minimum distance between the current shot’s keyframe and the R-frames (representative frames) of its previous and next scenes, respectively. If the distance is less than a preset threshold, it will be merged to the corresponding scene. Otherwise, it remains to be an isolated scene. This rule can also be applied to every detected scene, where the distance between two sets of R-frames needs to be computed. 3.3. Keywords Extraction To further extract semantic meanings of underlying scenes, we proceed with one more step, i.e. keyword extraction. Our text source is the closed captioning digitized simultaneously with the test video sequences. A keyword parsing algorithm is designed based on two popularly used text features: the term frequency and the inverse document frequency. One thing worth mentioning is that all detected cast names are extracted as keywords although their appearing frequency may be very low. The reason is that they are either names of speakers in the scene or subjects of the scene talk, which certainly can help viewers better understand the scenes. 4. EXPERIMENTAL RESULTS All test video clips were digitized from real TV programs with 29.97 frames/second and a length of about 20-30 minutes. Our test sets included TV sitcoms and TV movies only. An example of the constructed video hierarchy including scenes, scene keywords, shots and scene’s representative frames (R-frames) are given in Figure 3. This test video sequence is digitized from a popular TV sitcom called “Friends”. Scene 1 Keywords: Ross, save, life, bulletin, sandwich, Chalender, Joe Shots: 0, 1, 2, 3, 4, 5, 6, …, 28 Representative frames:

Scene 2 Keywords: Ross, near-death experience, car backfire, shot, survived, Emory, Rachel Shots: 29, 30, 31, 32, …, 45 Representative frames:

Fig. 3. An example of constructed video hierarchy.

806

2001 IEEE International Conference on Multimedia and Expo ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE

0-7695-1198-8/01/$10.00 (C) 2001 IEEE

Table 1 gives the scene construction results of four test sequences: two TV sitcoms and two TV movies, where the first movie is a romance and the other one an action movie. Precision and recall rates are computed to evaluate the performance of the proposed scene construction scheme. The ground truth of scene boundaries are obtained subjectively.

Video Sitcom1 Sitcom2 Movie1 Movie2

Table 1. Scene construction results. Missed False Correctly Prec. Scenes Alarms Detected 0 0 12 100% 0 1 13 93% 1 2 19 90% 2 6 34 85%

Rec. 100% 100% 95% 94%

We have the following observations from the above table. The recall rates of all test sequences are quite satisfactory. However, our scheme tends to detect more scenes than necessary. This phenomenon is expected for most automated video analysis approaches, and was also encountered by other researchers as reported in [3], [4]. In addition, TV Sitcoms tend to have a better performance than movies due to their relatively simpler content structures.

[3] Minerva Yeung, Boon-Lock Yeo, and Bede Liu, “Extracting story units from long programs for video browsing and navigation,” IEEE Proceedings of Multimedia, pp. 296–305, 1996. [4] Yong Rui, Thomas S. Huang, and Sharad Mehrotra, “Constructing table-of-content for video,” ACM Journal of Multimedia Systems, 1998. [5] Q. Huang, Z. Liu, and A. Rosenberg, “Automated semantic structure reconstruction and representation generation for broadcast news,” Proc. of SPIE, vol. 3656, pp. 50–62, January 1999. [6] H. J. Zhang, S. Y. Tan, S. W. Smoliar, and G. Y. Hong, “Automatic parsing and indexing of news video,” Multimedia Systems, vol. 2, no. 6, pp. 256–266, 1995. [7] Ying Li and C.-C. Jay Kuo, “Real-time segmentation and annotation of MPEG video based on multimodal content analysis I & II,” Technical Report, University of Southern California, 2000. [8] T. Zhang and C.-C. Jay Kuo, “Audio-guided audiovisual data segmentation, indexing and retrieval,” Proc. of SPIE, vol. 3656, pp. 316–327, 1999. [9] Ying Li and C.-C. Jay Kuo, “Detecting commercial breaks in real TV program based on audiovisual information,” Proc. of SPIE, vol. 4210, November, Boston 2000.

Our approach performs slightly better in “slow” movies than in “fast” movies due to two reasons. First, some speech/music shots are not correctly detected due to the always-existing loud background noises/sounds in action movies, which obviously brings negative effects to our scene detection. Second, in “fast” movies, visual contents are normally more complex and thus more difficult to capture. In fact, we got different subject scene segmentation results from people who are invited to watch the movie and give their scene understandings. 5. CONCLUSION AND FUTURE WORK A new framework of extracting video contents by integrating multiple media cues from video, audio and text information was presented in this work. A three-layer video representation was constructed to include scenes, shots, and R-frames. Different media have been utilized whenever they are suitable in achieving this goal. Preliminary experiments have yielded some encouraging results. We will further improve ways to fuse text, audio and video three modalities and perform extensive experiments on more test sequences. 6. REFERENCES [1] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar, “Automatic partitioning of full-motion video,” Multimedia Systems, vol. 1, no. 1, pp. 10–28, 1993. [2] Boon-Lock Yeo and Bede Liu, “Rapid scene analysis on compressed video,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 5, no. 6, pp. 533–544, December 1995.

807

2001 IEEE International Conference on Multimedia and Expo ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE

0-7695-1198-8/01/$10.00 (C) 2001 IEEE