Automatic Video Summarizing Tool using MPEG-7 Descriptors for Personal Video Recorder Jae-Ho Lee, Member, IEEE, Gwang-Gook Lee, and Whoi-Yul Kim, Member, IEEE
Abstract — In this paper, we introduce an Automatic Video Summarizing Tool (AVST) for a personal video recorder. The tool utilizes MPEG-7 visual descriptors to generate a video index for summary. The resulting index generates not only a preview of a movie but also allows nonlinear access with thumbnails. In addition, the index supports the searching of shot similar to a desired one within saved video sequences. Moreover, simple shot–based video editing can be achieved readily using the generated index1. Index Terms — MPEG-7 application, non-linear editing, PVR, shot–based video editing, video retrieval, video summarization I.
D
INTRODUCTION
igital broadcasting services are expending more than ever, and new equipment is being introduced in the market everyday. The new digital systems are different from conventional home electronics of the analog type. For example, an analog type video player performs a simple role: fast-forwarding, rewinding, and recording. However, new digital electronics allows us to carry out new functions to manage the amount of video data including non-linear accessing and even video editing. To add another functionality, we propose an Automatic Video Summarizing Tool (AVST) embedded in a personal video recorder (PVR). The MPEG-7 standard has been set up in order to manage the ever increasing amount of multimedia data based on audio-visual information. With the MPEG-7 standard, contrary to the conventional methods of annotation based systems, multimedia data content can be described with some degree of coherency in interpreting or accessing data. However, since the element of the MPEG-7 standard are somewhat different from those of MPEG-1, -2 and -4, only a few systems have been introduced as prototypes so far. In this paper, we introduce a system that can be useful for home entertainment electronics using MPEG-7 techniques. The popularization of digital broadcasting forces us to come into contact with a large amount of video data. The sheer 1 This work was supported in part by the Institute of Information and Technology Assessment, Korea, under the Development of MPEG-7 Browsing System. And this research was also supported in part by the Ministry of Education, Seoul, Korea, under the BK21 project. Jae-Ho Lee is with the Image Engineering Laboratory of Hanyang University, Seoul, Korea (
[email protected]). Gwang-Gook Lee is with the Image Engineering Laboratory of Hanyang University, Seoul, Korea (
[email protected]). Whoi-Yul Kim is with the Division of Electrical and Computer Engineering, Hanyang University, Seoul, Korea, (
[email protected]).
amount of data is becoming increasingly difficult to handle on conventional home electronics. Utilization of the MPEG-7 is a reasonable approach to describe and manage multimedia data [1].To this end, there has been some research done on the use of MPEG-7 in broadcasting content applications. A. Yamada et al. have built a visual program navigation system that uses an MPEG-7 color layout descriptor [2]. N. Fatemi and O.A. Khaled designed a retrieval application using a rich news description model based on the MPEG-7 standard [3], and T. Walker proposed a system for content-based navigation of television programs based on the MPEG-7 [4]. The system that Walker presented uses standard MPEG-7 description schemes (DS) to describe television programs, but is limited to news programs. T. Sikora applied the MPEG-7 descriptor for the management of multimedia databases [5]. Also, A. Divakaran et al. presented a video summarization technique using cumulative motion activity based on compressed domain features extracted from motion vectors [6]. In this paper, we introduce a AVST that generates summarized video in real-time. The summary information generated by the presented tool provides users an overview of video content, and guides them visually to move to a desired position quickly in a video. Also, the tool makes it possible to find shots similar to a queried one, which is useful for editing different video streams into a new one. Here, MPEG-7 visual descriptors are only used to segment a video into shots, to summarize a video, and to retrieve a scene of interest. The organization of the paper is as follows: Section 2 describes the MPEG-7 visual descriptors utilized in the system. The detail implementation of the AVST and the functions provided are presented in Section 3. The analysis of the developed system is presented with the results in Section 4. Finally, Section 5 concludes with the contributions of this research. II. MPEG-7 VISUAL DESCRIPTORS The MPEG-7 standard is composed of seven main parts [7]. TABLE 1 describes the structure and contents of the MPEG-7. The visual component is categorized by basic structures such as color, texture, shape, motion, localization and face recognition. Not all visual descriptors were utilized for the AVST. For example, in the color category, the Dominant Color Descriptor (DCD), the Color Layout Descriptor (CLD), and the Color Structure Descriptor (CSD) were employed. A detailed explanation of all visual descriptors can be found in the standard document [8]. A brief explanation of the utilized color visual descriptors will follow:
TABLE I STRUCTURE OF MPEG-7 MPEG-7 Standard Part-1: System
Part-2: Description Definition Language (DDL) Part-3: Visual Part-4: Audio Part-5: Generic Entities and Multimedia Description Schemes (MDS) Part-6: Reference Software Part-7: Conformance Testing
Contents Specifies the tool that is needed to prepare MPEG-7 descriptions for efficient transport and storage to allow synchronization between content and descriptions and the tool related to managing and protecting intellectual property Specifies the language for defining new description schemes Specifies the descriptors and description schemes dealing exclusively with visual information Specifies the descriptors and description schemes dealing exclusively with audio information Specifies the descriptors and description schemes dealing exclusively with generic and multimedia features
• Generation of an overview of video contents; to find a desired position in video data. • Query by example; to find similar shots to a queried one in a large amount of video data. • Nonlinear editing; to support simple editing based on the summarized video. Figure 1 shows the overall schematic of the AVST. Each image in the summary index becomes a representative frame of the cluster which is composed of a number of shot segments. Users can find similar shots to his or her favorite shot by querying with an example in the index. Also, a new video stream can be easily generated by moving shot segments from the summary index. All of these operations have been designed simply, so they can be executed with a home electronics interface.
Includes software corresponding to the tools included in the standard Defines guidelines and procedures for testing conformance of MPEG-7 descriptions and terminals
• Dominant Color Descriptor This color descriptor is suitable for representing local (object or image region) features where a small number of colors are sufficient to characterize the color information in the region of interest. The binary syntax of the DCD specifies three bits to represent the number of dominant colors and five bits for each of the percentage values. • Color Layout Descriptor Its objective is to represent the spatial distribution of color in a very compact form so that image or video matching functionality can be achieved with high retrieval efficiency at a very small computational cost. The actual values of the coefficients are represented by the arrays Ycoeff, CbCoeff and CrCoeff. In this system, the numbers of the coefficients were selected as 6, 3, and 3, respectively, and the length of each is six bits long. • Color Structure Descriptor Similar to a color histogram this descriptor captures both color content and the structure of the content. The difference lies in the expression of the local color structure in an image by means of a structuring element composed of several image samples. This descriptor can distinguish images that have the same color distribution but have different spatial arrangements of colors. In this system, 256 bins are used to represent an image and each bin has 8 bits. III. IMPLEMENTATION The functions of a developed AVST consist of mainly three parts:
Fig. 1. Schematic of AVST: preview, non-linear access, editing, and query by example.
The implemented functions in the developed tool are described in this section. A. Video Summarization With our tool, a video summary is generated in two ways: semantic vs. content-based summarization. Semantic summarization is for videos that have stories such as dramas or sitcoms, while content-based summarization is for videos that do not contain stories such as sports videos. In both cases, a video stream is segmented into a set of shots as a preprocessing step by detecting scene changes. The adaptive threshold and gradual transition detection methods are applied in this step [9][10]. Figure 2 shows an illustration of detection steps depending on the types of scene change (abrupt or gradual) and the corresponding algorithms. In this stage, three MPEG-7 color descriptors: color layout, dominant color, and color structure are computed and saved as the features to be utilized in the summarization and retrieval process. In figure 2, the far left column shows a video sequence as an input to the system. The image in the center is a current frame being processed for the extraction of features. The features of the current frame are then compared to those of other frames to detect scene changes. Abrupt scene change detection is done simply by computing the distance between two sets of features extracted from adjacent frames. Since the number of abrupt scene
changes is highly dependent upon the threshold value, an adaptive method can be used using the average and deviation of the local duration [9].
Fig. 2. Diagram of scene change detection
To detect a gradual change, the distance value between the current frame and the ones at the k frame is computed by Bescos’s plateau method [10]. As illustrated in Fig. 2, when a gradual scene transition occurs the distance plot will show a plateau form. To detect the exact location of the plateau form, metrics such as symmetry, slope fail (distance decreasing on rising phase, or vice versa), maximum of distance and distance difference when scene change occurs are used. Values of 10, 20, 30, and 40 were chosen as k for the various durations of gradual changes. The Color Layout Descriptor (CLD) was used as the feature of distance to detect these abrupt and gradual scene changes. The CLD contains the color and spatial information with DCT coefficients. The DCT coefficient is defined from a reduced 8 by 8 image in the YCbCr color domain. The number of the coefficient is 6, 3, and 3 for Y, Cb, and Cr. The distance of the descriptor is defined in the following equation [8]:
D (i, j ) =
6
∑ λ (Y k =0
+
coeff
(i, k ) − Ycoeff ( j , k ) )
3
∑ λ ( Cb k =0
+
yk
Cbk
coeff
3
∑ λ ( Cr k =0
Crk
some approaches have been proposed recently. H. Chang, S. Sull and S. Lee presented a method to measure the performance of key-frames [11]. Although this method is not infallible, it is certainly applicable to the results presented in this submission. It would enable them to compare the results with other techniques. Also, A. Hanjalic and H. Zhang provided a much more thorough review of the key-frame extraction techniques as well as perhaps another way to assess the performance of a key-frame extraction scheme using cluster-validity analysis [12]. In this paper, the oldest attempt to automate the key-frame extraction was adapted [13], because it chooses as a key frame the frame appearing after each detected shot boundary, and this method is appropriate for a PVR which receives video data via a broadcasting system. After the scene change detection step, a video summarization process follows. To perform a semantic summarization, the segmented shots are clustered to compose story units. The algorithm proposed by Yeung is used in this step [14]. To obtain content-based summarization, clustering is applied while considering the duration of each shot segment. In both, the distance between shots is measured using a MPEG-7 color layout descriptor and a color structure descriptor. The block diagram of the semantic summarization is described in Figure 3.
coeff
(i, k ) − Cbcoeff ( j , k ) ) (i, k ) − Crcoeff ( j , k ) )
where i and j indicate the frame number, and k is the coefficient number. The λ is the weight coefficient and it is defined in the following table [8]:
λ Ycoeff Cbcoeff Crcoeff For key-frame
0 1 2 2 2 1 2 1 extraction
2 2 1 1 of the
3 4 1 1 segmented
5 1 video shots,
Fig. 3. Summarization process of the AVST
By comparing key frames from the scene change detection step, the process of shot clustering is followed by a modified time-constrained method. Time-constrained clustering is
based on the observation that similar shot segments separated in time have a high probability to be located in other story units. As a result, remote shots are not in the same story unit using a time windowing method, even though the shots have similar features. A hierarchical clustering method merges shot segments that have similar features and neighbor each other in the time domain into the identical cluster. The time window comparing regions is fixed as 3000 seconds. Yeung proposed a Scene Transition Graph (STG) to generate story units from clustering results [14]. Each node of the STG is a cluster, and links between clusters are generated when there are adjacent shot segments. To separate story units, the following observation was proposed: (1) shots in identical units interact with each other (2) there is no interaction with shots in the other story units except as one transition between units. With this observation, cut edge, which is one directional path between units, was estimated. In our system, we selected a simple numbering method to detect the transition point. The pseudo code of the method is shown below: Story0 Å Cluster0, lastCID = 0, j=0 for i=0 ~ Number of Shots if(CID of shoti > lastCID) { j++ Storyj Å CID of shoti lastCID = CID of shoti } else if(CID of shoti < CID of shoti-1) { merge Storyk+1, Storyk+2, … , Storyj where Storyk contains the cluster which shoti is belonged to }
The bottom images represent the key frame of each shot segment. All shot segments are included in the higher level of hierarchical structure which is called the cluster by comparing its features. And the story units hold one or more cluster as the highest level of the structures. Content-based summarization focuses on the coincidence of content without temporal information. This method can be applied especially in sports videos. For example, in a soccer video, player scenes, goal scenes, and audience scenes can all be classified separately. Figure 5 shows the example of the content-based summarization result.
Fig. 5. Contents-based video summary
In the content-based summary, there is no hierarchical structure. The far left column depicts a video sequence arriving in the system. And the content-based summary results are displayed at the right side. Each cluster has key-frames of a similar feature, even if they are located separately in time. The GUI of the AVST is presented in Figure 6.
If there is a new Cluster ID while checking the Cluster ID of continuous shots, the shot is regarded as a new story unit. However, when an interaction is detected, all related story units are merged. For example, Semantic summarization has a hierarchical structure driven by considering the temporal locality and continuity of content in video. In this system, we assumed that the story is the top layer of the structure, and each story is organized by clusters. These clusters also have some key frames of the scene. Figure 4 describes the hierarchical structure of the semantic summarization.
(a)
Fig. 4. Hierarchical structure of semantic summary
(b) Fig. 6. GUI of the developed AVST, (a) main window of system, (b) hierarchical viewer and option buttons.
On the upper left is the current frame and on the right side are the summarized results. The buttons on the top right are associated with each layer such as cut, cluster, and story. For example, if the story button is selected, the story frames of the video are displayed on the right side. If a story image is selected on the right side by the user, the clusters of the selected story are displayed below. There is an option for content-based summarization. The detect cut button activates the scene change detection process, and the summarize button is for summing-up. The summarized result is saved for editing or retrieval with the save index button. With this tool, users can easily overview the video with the key frames and gain access directly to the scene of his or her choice. B. Video Segment Editing Video editing, which is based on a summarized index, is also supported with the developed system. In the editing procedure, the segmented shots can be removed or merged into a new video stream by the user. Users can generate a video stream that consists of their favorite shots using this editing tool. Each of these functions is designed to be used easily with the use of a remote control. Figure 7 shows the operation of the editing tool.
quick searching of favorite scenes for editing or direct access. An example of retrieval is shown in Figure 8.
Fig. 8. Video segment retrieval with key frames in AVST.
If a user orders a retrieval function by clicking the query image in one index, the similar key frames are retrieved along with all saved summary indexes. IV. EXPERIMENTAL RESULTS
Summary of video 1
Story 3 Story 1 Story 2
Summary of video 2
Story 1
Story 2
Fig. 7. Video editing based on a summarized index.
The user chooses the shot segment, cluster, or story in the video indexes using a remote control. This only requires pushing the insert and move button on a remote control to edit. With this simple process, a user can make his or her own favorite scenes easily. C. Video Retrieval A query of similar scenes can be achieved with the developed system. The MPEG-7 descriptors are used to find similar scenes in query by example methods. This function plays an important role by providing user convenience with
The MPEG-7 visual descriptors are utilized in the developed system to keep up with the further extension of home entertainment systems using internet connectivity. The detailed algorithms for extracting visual features and the similarity measurement between the features were referenced from the XM document and software in MPEG-7 [15][16]. To analyze the performance of the scene change detection with visual descriptors, two trailers and two music videos were selected as test video data. Usually, the detection of abrupt scene change is easier than gradual change. Therefore, we employed this test data because each includes a lot of gradual scene changes. In TABLE 2, the number of scene changes in each video is displayed.
No 1 2 3 4
TABLE 2 INFORMATION OF VIDEO IN EXPERIMENTS No. of abrupt No. of gradual No. of frames changes transitions 4005 43 29 3616 55 22 5552 0 34 6858 78 37
The results of detection are shown in TABLE 3. According to the results, the MPEG-7 color descriptors can be utilized as a feature for scene change detection.
No
TABLE 3 THE ACCURACY RATE OF SCEND CHANGE DETECTION Abrupt change Gradual transition Recall (%)
1 2 3 4
93 94 95
Precision (%) 98 98 97
Recall (%) 81 88 87 81
Precision (%) 88 65 77 89
summarized index results. On the other hand, drama, animation, and movie data have a small number of stories because it has long and similar video content, respectably. The retrieval result of the queried key frame is presented in figure 10. The color descriptors are also utilized. The compounding multi-feature can generate more accurate retrieval results. The detailed performance analysis of the compound color descriptors is described in [17].
The example of the scene change detection and key frame extraction of a movie is displayed in figure 9. The key frame is selected as the first frame for each shot.
Fig. 9. An example result of an AVST summary. Fig. 10. The retrieval results in key frames
From the experimental results, the average of cluster numbers decreased to 30 percent of the total segmented shots. After the clustering procedure, all the clusters are gathered and classified into story units. The average of total number of story units is usually about 15 percent of the clusters. The generated shots, clusters and story units form a hierarchical structure can assist the consumer to view and access the video content easily. To analyze summarization efficiency, six types of video data were utilized. The summarized results are shown in TABLE 4.
Video Comedy show Drama Animation Music show News Movie
TABLE 4 INFORMATION OF VIDEO IN EXPERIMENTS Duration No. of shots No. of clusters 45m 15s
757
279
No. of stories 34
19m 46s 27m 27s 83m 13s 60m 36s 39m 21s
174 630 1141 1066 472
47 267 428 712 167
6 14 43 66 12
The numbers of stories are dependent upon the characteristics of the video data. In the case of video data of a short duration shot such as news, music shows, and comedy shows, there are many shots, cluster, and stories in the
The image on the top left is the query image. The similar frames are displayed in a window by retrieval result. A user can directly access and edit with this result. Figure 11 shows the example of the edited scene results with three individual dramas. The generated scene includes only one actress according to user choice. The new scene can also be saved for reproduction.
Fig. 11. The results of new video segment with three different video clips.
On the top left hand side is the list of opened indexes, and
on the right hand side it indicates the summarized index of the selected files. The two bottom rows are edited scenes and generated scenes which have been selected by the user. The result of a newly generated scene is displayed in the middle.
V.
CONCLUSION
In order to manage the ever increasing amount of multimedia data based on audio-visual information, the MPEG-7 standard has been set up as an alternative to contentbased image retrieval techniques. With the MPEG-7 standard, multimedia content data can be described to support some degree of coherency in interpreting or accessing data. The objective of this research is to develop a summarizing system using only the visual descriptors in Part-3 without any human intervention or manual annotation. In this paper, we have introduced our video summarizing tool. The resulting tool enables users to access a video easily through a generated summarization index. In addition, the summarization index also supports other operations that can be helpful and interesting for users: querying a scene and editing a video stream. The MPEG-7 descriptors are used to obtain video summarization and to retrieve a queried scene. The proposed tool was devised to be operated inexpensively with a simple interface; therefore, it can be embedded in a PVR. Furthermore, the system can be extended for a video search engine with internet connectivity PVRs using the MPEG-7 technique. The developed system presents a prototype for the increasing number of applications using the MPEG-7 standard, and also supports basic techniques for future home entertainment systems.
REFERENCES [1] [2]
[3] [4] [5] [6]
[7] [8] [9]
MPEG-7 Group, “MPEG-7 Applications Document”, ISO/IEC JTC1/SC29/WG11//N2462, Atlantic, October 1998 A. Yamada, E, Kasutani, M. Ohta, K. Ochiai, and H. Matoba, “Visual Program Navigation System based on Spatial Distribution of Color,” IEEE Proc. of International Conference on Consumer Electronics, 13-5, pp. 280-281, 2000. N. Fatemi, and O.A. Khaled, “Indexing and retrieval of TV news programs based on MPEG-7,” IEEE Proc. of International Conference on Consumer Electronics, 20-6, pp. 360-361, 2001. T. Walker, “Content-based navigation of television programs using MPEG-7 description schemes,” IEEE Proc. of International Conference on Consumer Electronics, 13-1, pp. 272-273, 2000. T. Sikora, "Visualization and Navigation in Image Database Applications based on MPEG-7 Descriptors," IEEE proc. of International Conference on Image Processing, vol. 3, pp.583, Oct. 2001. A. Divakaran, R Regunathan, K.A.Peker, "Video Summarization Using Descriptors of Motion Activity: A Motion Activity Based Approach to Key-Frame Extraction from Video Shots," Journal of Electronic Imaging, vol. 10, no. 4, pp 909-916, Oct. 2001. B. S. Manjunath et al., Introduction to MPEG-7, John Wiley & Sons Ltd., West Sussex, England, 2002. ISO/IEC 15938-3, “Multimedia Content Description Interface - Part 3: Visual," version 1, 2001. Y. Yusoff, W. Christmas, and J. Kittler, “Video shot cut detection using adaptive thresholding,” British Machine Vision Conference, Sep. 2000.
[10] J. Bescos, J.M. Menendez, G. Cisneros, J Cabrera, and J.M. Martinez, “A unified approach to gradual shot transition detection,” IEEE Proc. of International Conference on Image Processing, vol. 3, pp. 949 -952, Aug. 2000. [11] H. Chang, S. Sull and S. Lee, "Efficient Video Indexing Scheme for Content-Based Retrieval,", IEEE Trans. on Circuits and Systems for Video Technology, vol.9, pp.1269-1279, Dec. 1999. [12] A. Hanjalic and H. Zhang, “An Integrated Scheme for Video Abstraction based on Unsupervised Cluster-Validity Analysis,” IEEE Trans. on Circuits and Systems on Video Technology, vol. 9, pp.1280-1289, Dec. 1999. [13] B. Shahraray, and D. Gibbon, “Automatic generation of pictorial transcripts of video programs,” SPIE proc. of Multimedia Computing and Networking, pp. 512-518, Feb.1995. [14] M. Yeung, and B.L. Yeo, “Segmentation of video by clustering and graph analysis,” Computer Vision and Image Understanding Journal, vol. 71, no. 1, pp. 97-109, Jul. 1998. [15] MPEG-7 Visual part of eXperimentation Model Version 10.0, ISO/IEC JTC1/SC29/WG11/N4063, Singapore, Mar. 2001. [16] MPEG-7 Experimental Model Software, http://www.lis.e-technik.tumuenchen.de/research/bv/topics/mmdb/e_mpeg7.html. [17] J.H. Lee, H.J. Kim, and W.Y. Kim, “Video/Image Retrieval System (VIRS) based on MPEG-7,” IEEE proc. of International Conference on Information Technology: Research and Education, submitted for publication, Aug. 2003. Jae-Ho Lee received his B.S. and M.S degree from Hanyang University, Seoul, Korea in 1999 and 2001. He is currently working towards the Ph.D. degree in Division of Electrical and Computer Engineering at Hanyang University. His research interests include motion detection/segmentation, face recognition, and MPEG-7. He has experience in system development with various computer vision techniques such as surveillance systems, retrieval systems, face recognition systems, and personal video recorder. Recently, his work has focused on the face recognition descriptor in the MPEG-7 visual part. He is a student member of IEEK and IEEE. Gwang-Gook Lee received his B.S degree in electrical and computer engineering from Hanyang University, Seoul, Korea in 2002. He is currently working towards the M. S. degree at Hanyang University. His research interests are in the areas of video processing, video browsing and retrieval. Recently, his work is focused on image/video understanding.
Whoi-Yul Kim received his B.S. degree in Electronic Engineering from Hanyang University, Seoul, Korea in 1980. He received his M.S. from Pennsylvania State University, University Park, in 1983 and his Ph.D. from Purdue University, West Lafayette, in 1989, both in Electrical Engineering. From 1989 to 1994, he was with the Erick Johnson School of Engineering and Computer Science at the University of Texas at Dallas. Since 1994, he has been on the faculty of Electronic Engineering at Hanyang University, Seoul, Korea. He has been involved with research development of various range sensors and their use in robot vision systems. Recently, his work has focused on content-based image retrieval system. His two proposals for MPEG-7 visual descriptor were selected as international standard. He is a member of IEEE.