Video Keyframe Extraction and Filtering: A ... - Semantic Scholar

2 downloads 3580 Views 1MB Size Report
consumers are going to build their own digital video libraries. There is a converging market for ..... finme signature is derived from the composition of the DCT coefficients. During the .... To a human viewer, the borders between programs are ...
Video Keyframe Extraction and Filtering: A Keyframe is not a Keyframe to Everyone Nevenka Dimitrova, Thomas McGee, Herman Elenbaas Philips Research 345 Scarborough Rd. BriarcliiManor NY, 10510 {nvd, tf’m, hje)@?philabs.research.philips.com Phone: (914) 9456059,6271,6584; Fax: (914) 945-6511

Video content analysis involves automatic and semi-automatic methods to extract information that best describe the content of the recorded material [l - 61. Extracting information can be as simpIe as detecting video scene changes and selecting the first frame of a scene as a representative frame. However, to deliver meaningful representation to the user, the segments should be further analyzed to extract information that truly represents the content of the video. This process may include recognizing cast, transcribing text from fihn titles, analyzing the audio track, semantic interpretation of activities, and assigning meaning to particular subsegments of the video. Furthermore, video clips can be divided into story units which are independent. We can search the story units by the individual meanin@ features that are assigned to them. As an exampIe, a user may ask for the peak scenes of the movie “Shine” which only contain the main movie actor. Ultimately, it would be desirable to customize the choice of keyframes to the users interest. Solving these issues requires long term research in video content analysis. In this paper we focus on issues related to video processing and keyframe filtering in real time.

Abstract In this paper, we describe the keyframeextraction and filtering process within the video content indexing system called Vitamin. The video content filtering system analyzes the source video and presents to the user a visual table of contents using thumbnail images. The tiltering process eliminates keyframes which do not contribute to the overall comprehension of the video contents. The user should he able to access particular points on a VHS tape, or MPEG file using this visual table of contents. We have analyzed over ten hours of video content from dierent movies, home videos, serials. and sitcoms. Our experhnents show that the number of keyframes is reduced to a manageable size, thus enabling only important visual information to be presented to the user. Keywords: video content filtering, video segmentation, video cut detection, video content indexing, content-based video retrieval 1. Introduction

In the Vitamin system there are two phases in the video content indexing process: archival and retrieval. During the archival process, video content is analyzed and an index is created. This video analysis process includes automatic video cut detection and keyframe selection. Video cut detection (also called video segmentation or video parsing) is a process of identifying scene changes, i.e. “cuts”. For each scene, a particular representative frame is extracted called a keyframe. The keyfiame filtering and selection process is then applied to create the visual index from selectively chosen key frames. In the case of video tape indexing, this index is stored on the video tape. In the case of indexing an MPEG file from a Video CD or DVD, the index is stored on a hard disk. The retrieval process dispIays the visual index to the user in a form most suitable from a UI design point of view. Currently, we display thumbnail images to the user at different levels of temporal granularity. The user can browse and navigate through the visual index and fast forward to a certain point on the video tape or the MPEG ffie.

Currently, consumers store video content on VHS tapes. In the next few years with the introduction of DVD and CD-RW, consumers are going to build their own digital video libraries. There is a converging market for the traditional audio/video appliances and multimedia PCs. In response, several companies have announced the introduction of PC-TV devices. The video content analysis and filtering will be a value added video content management application for the converged PC-TV devices. We assume that in the most geheral case, a video clip will be taken from a TV program or a home video. There is a large variety of content that consumers would want to store on a VHS tape, Video CD or DVD. The programs that people record range from episodes of their favorite TV serials, movies, concerts to sport events. Our goal is to analyze informative and entertainment programs such as news, serials, movies, weather or sports. In order to achieve this goal, we need visual tools to empower users to find important points of programs without delay. Specifically, these tools must intuitively provide multi-point access to the linear, time-based medium of video. 1kniiissioillo ninhr‘digil:d;l~:~rcl topics ul‘dl or pi

The organization of the paper is as follows: Section 2 describes technical issues in automatic keyfmme extraction and filtering. Section 3 describes the current video indexing demonstration prototype. Section 4 focuses on research issues and future work in video analysis. Section 5 gives concluding remarks.

ol’lhis tiialkal Iiw

~TSOIXII or &.w~ai~~ NISI: is g~ukd \\ilhoul I& pruvidLullhal llle crpiL5 are noI niadc or dislrihul~clIbr pdil or comnwkd ;Idvallligc. lhr: cop\‘righI noliw. Ihe Iilk ol3l~cptthlicalinae~ldils dnlc qqztr. nad nolice is givea III~IcopyrightIShy panlision ol’~hI2ACM. IllC.To cqy ollwwiw. lo rqub1ibll. lo porl on .wrvcrsor lo ralislrihule Id lids. rquircs bpccilic paniiissioilnndkwIte.

ClK;M97 Lff.sl -qlgl*s.\~tvntin ~qq+$l

i XI

1997 At%] II-X979 I-97lk’97

.

I l..S33~

If3

l

.-. r - .;I:>

b

* I

h

,

I

_’

after Hufhnan decoding and dequantization steps nnd before inverse DCT transform (see Figure 2,). In this approach, all the processing could be performed during decoding in real time. If the video application runs on a platform which delivers uncompressed video, and the processing power is limited on the host machine, then by applying fast Dff transform, the video frames can be transfomted into blocks of DCT coeffcients (see Figure 3.). In this case, the video cut detection method will be comparing a weighted selection of these coefficients.

2. Automatic Keyframe Extractton and Filtering Process The purpose of video content filtering and analysis is to extract structure and meaning from visual, auditory and textual cues in the video. The video fihering system analyzes the content of source video and presents to the user a visuaI table of content using thumbnail images. As mentioned earlier, video content indexing can be an involved process if we consider the human factors and try to fully match the human power of abstraction with the current limited processing power. In the beginning of our research we focused on issues related to video processing at the rate of 30 frames per second and preliiary filtering of keyframes. In this section, we will describe the initial process of automatic video content indexing. We will defer the discussion of longer term research issues for Section 4.

Video in Computer-based ****w****. *.******.*******...*.., *m-e-“‘. ***** ;“’ video , content mdexmg system T : 8 : I 3 1

The process of automatic video content indexing and archival consists of several stages (see Figure 1). The first stage is video cut derectiun [2,7], In this stage, scene changes are detected, and the video clip is segmented into scenes. From each scene we extract one or more keyframes that represents the scene. The methods for video cut detection can be performed on uncompressed video or previously compressed video [2]. The video cut detection and keyframe extraction process wilI be explained in Section 2.1. After cut detection, the keyframes are analyzed and filtered based on their content or based on similarity with previously selected keyfmmes. The @fin filtering pruce.w (second stage) will be discussed in Section 2.2. After filtering, the final reduced list of key&mea is archived on disk or tape.

:

Frames

:

. : Module i host pm ssor I I : I : I I I I I 1 : , I I : I t :*..*.****.**

2.1 Video Cut Detection In this stage, consecutive frames are compared and scene changes are detected {see Figure 1). From each scene we extract one or more keyframes that represent the scene. Most of the current systems in the literature use the fmt frame in a scene. However, in many cases, the main subject or event in a scene appears after a camera zoom or a pan. Our method is based on comparing DCT (Discrete Cosine Transform) coefficients and therefore represents a midway approach. If the video application uses MPEG compressed video then the video cut detection can be performed

1

0

Create framelist I i

keyframes

Figure 1. Video keyframe filtering process

Cut detection

8x8 Block

Figure 2. Frame decoding process for Motion JPEG or MPEG ( intracoded frames)

114

e:

-e

. .

I

i’

Cut detection

8x8 Block

Figure 3. Frame encoding process in Motion JPEG or MPEG ( intracoded frames)

We experimented with several algorithms. The family of algorithms requires a DCT transformed video frame similar to the first step of JPEG or MPEG encoding. The methods above divide the video frame information into blocks of DCT data. However, there am different types of data stored in different blocks. some related to color other related to light intensity. The human eye is more sensitive to intensity than it is to color. The CCIR 601 broadcast standard {4:2:0 YUV video) subsamples color information by two, therefore reducing the amount of information that needs to be sent. This means that for every color block sent two intensity blocks are sent. The broadcast standard is to send four light intensity blocks (luminance, Y) and two color blocks (chrominance. Cr and Cb). These six blocks together form a macroblock (see Figure 4). A video frame is now divided into macroblocks. The cut detection algorithms are based on one of the following formulae:

Figure 4. Macroblock structure in MPEG or PEG.

1) SUM = &J Ci ABS@CTl~~ - DC’IZq,J , for 1s k 5

The above equations were implemented as described in the following.

Frame-width06, l& Frame-heightll6, 15 i ~6~ DCTlw and DCT2ui are DCT coefficients that belong to previous and current video frame respectively. 2)

SUM[i]-& ABS(DCTl~[i]-DCIZG[i]),for lsk -< Frame-widthll6, l& Frame-heigWl6, 11 i 56, DCTl, [ i ] and DCT2y [ i ] are DCT coefficients that belong to previous and current video frame respectively.

3)

SUM = & Ci (DCTl~i - DCT2,-32/AsS(DCl’Q~ ). for 11 k 5 Frame-width/16,& Frame-height/l4 11 i 56.

4)

SUM[i]- &JDCTl~[i]-DCT2G [ i 1)’ /ABS@CI&,J 1~ k -1 , r, ;’ / ii; g’:‘:;j;g /, j,:

.

-, - & ~;~~~~,tp: :

g.;;; ~” button. The user can also insert annotation for each section. Videophilemode: In this mode, the user is presented wfth n summary of complete video at a higher level of gmnularity (see Figure 9). The user can select a logical video segment (ii the 5gm. it is segment 2) and then obtain the keyfmmes for that segment and their respective slide-shows. In the current implementation, the user is presented wQh approximately 200 images in this mode.

I

Figure 8. Slide-show associated with the first thumbnaIl image

Figure 6. A snapshot of the master video index. 32 Single Video hdex The challenging presentation task is to provide comfortable modes of keyframe display at various levels of granularity. For this purpose in the Vitamin prototype the contents of an entire video is represented using couch potato mode and videophile mode. In both modes, the user can have immediate access to a par&.~lar point on the video by selecting its corresponding image (in either static or slide show mode).

Figure 9. Sample screen of the visual index fn videophile mode

118

.ij-

. ..s. .**a.

i

h

Video Philips Company Video 1996 Home video 1 Home video 2

Static Scenes Before Filtering 274 158 227

static scenes After Filtering 160 I 90 128

scene cuts Before FiItering 84 17 11

Scene Cuts After Filtering 80 14 9

,

I

Table I. Results of scene detection with and without keyframe filtering.

3.4 Experimental

orientation of the face. For the purpose of cast recognition. we need a face recognition algorithm that is mbust and does not have the aforementioned restrictions. . background (recognition of stage sets). In sitcoms and serials there are certain stage sets that repeat throughout the pmgram. For example, in Seinfeld we see Jerry’s apartment, the coffee shop, Elaine’s office, etc. We can use scene auaiysis techniques to identify the location of the current scene. . beginning and tail credits. Frames that contain text can be more meanh@l than randomly selected frames. These frames can indicate a start of a movie or story topic or summa&e a discussion. A method developed by the University of Manheim uses split-and-merge algorithm to identify regions of the image with textual content of information [ll]. Then the candidate regions with appropriate size are subjected to an OCR algorithm. l change of audio content such as music, speech and noise WI. l program transcript. Closed captioning (or speech transcript) can be used to aid in the overall reconstrnction of the pmgram strncture.

Results

We have used video cut detection method implemented by using equation 5 (see Section 2.1). Video cut detection and static scene detection has been used with the keyfiame filtering on several 30 minute segments, These results are shown in Table I. The video clips consisted of two home videos and a Philips Company Video. All three of these videos had many more static scenes than scene breaks. These static scenes represented segments such as people opening presents on biidays or a keynote speaker. Thii clearly showed the significance of static scene detection. As can be seen from Table I, the keyftame filtering currently reduces the number of keyfmmes by slightly more than 40% for the static scenes but only produces a slight decrease for the scene cuts. Upon analysis of the data, there were not any repetitive frames found for the scene cuts; the reduction was solely due to detection of unicolor frames. Research in keyframe filtering is still in initial stages and further refinements are expected to improve this performance.

4.

. overall structure of a program. We need information from multiple sources such as type of scenes, number of people in each scene, and type of background to derive representation based on understanding of video content. Fihu theory cau provide some general information about the intended structure of the program. If the intended structure is known, then we can use video and image analysis techniques to parse and recognize the structure of the program [13]. However, it is very challenging to translate the knowledge in film thwry and design into algorithmic methods that will automatically analyze video content.

Research Issues

If multiple programs are recorded on the same tape (analog) or in the same fite (digital), then the main pmbIem is to separate the coherent pieces of recording. To a human viewer, the borders between programs are recognizable because programs normally have their distinguishing characteristics: same cast, time period setting, stage sets, theme. Even if there are two recordings on the same video tape, say ER and Deep Space Nine and even if the credits do not appear on both of them, humans can clearly recognize the end of the first recoding and the start of the second recording.

5. Conclusions The goal of our research is to develop a real time video analysis system to exhact of visual content. This is a very computationally demanding task since we analyze video content “On the fly” with limited memory and processor resources. Iu this paper we have described the process of automatic keyfiame extraction and fdtering in OUTVitamin system. The initial results show that a filtered number of keyframes is less cumbersome and more powerful to represent the video content than the huge number of keyfiames that result from cut detection. However, in order to extract a more meaningful representation, we need algorithms which try to match the human understanding of “huportant” frames and at the same time do not require heavy computations. We are augmenting the development of aIgorithms for video content analysis with fmsmgs from user interface research.

To detect these boundaries automatically and to extract the logical structure of the program, we have to further explore &~es of detecting representative feahnes of the pmgrams such as: . program cast (face detection and recognition). To determine the cast of a certain program, we can search for the close-up representative frames. Then, using a face-recognition software we can try to find the common faces in subsequent scenes. The best results of course would be obtained if there is a learning stage in which for each program main actors am introduced to the video analysis system. There is an abundance of algorithms for face detection in computer vision literature [lo]. Most of the algorithms work only with heavy restrictions on lighting, size and

119



I

Eiectrmic Imaging Science and Technology, Conference on Storage and Retrieval for Image and Video Databases V, Volume 3022, San Jose, pp. 218-225.1997.

Acknowledgments

[13] Smith, MA., and Christel, E.H., “Automating the Creation of a Digital Video Library”, In Proceedings of ACM Multimedia 95, San Francisco, November 1995.

We wouId like to thank Jim McCartney for his help in video digitization and content acquisition. We would like to thank reviewers Aninda Dasgupta and Eric Cohen-Solal for their insightfui comments.

[14] R Gonzalez and Richard E. Woods, “Digital Image Processing”, Addison-Wesley Publishing Company, 1992.

[I) Nevenlca Dimitrova and Mohamed Abdel-Mottaleb, by example video clip,’ “Content-based video retrieval Procwdings of IS&T SPIE Symposium on Electronic Imaging Science and Technology, Conference on Stomge and Retrieval for Image and Video Databases V, Volume 3022, San Jose, pp. 5970.1997. [21 Hong-3iang Bang et al. “Video Parsing and Browsing Using Compressed Data,” Multimedia Tools and Applications, Vol. 1 , No.1, pp. 89411,199s. [3] A. Nagasaka and Y. Tanaka, “Automatic Video Indexing and Full-Video Search for Object Appearances,” Visual Database Systems, II, E. Knuth and L.M. Wegner, editors, North-Holland, pp. 119-133.1991. [4] HJ. Zhimg, A. Ka&&aUl, and S.W. Smoliar, “‘Auto- matic Partitioning of Full-motion Video,” Muitimedia Systems l(l), pp. 10-28.1993. [5] R. Zabii, J. Miller and K. Mai, ‘*A Feature-Based Algorithm for Detecting and Classifying Scene Btesks”, Proc. ACM Multimedia 95, San Francisco, CA, November, 1993, pp. 189-200. [q F. Arman, A. Hsu and M-Y Chiu, “Image Processing on Encoded Video Sequences”, Multimedia Systems (1994) Vol. 1, No. 5, pp.21 l-219. [7] Alan Hanjaiic, Marco Ceccarelli, Reginald L Lagend& Jan Biemond, ‘Automation of Systems Enabling Search on Stored Video Data,” In Proceedings of IS&T SPIE Symposium on Electronic Imaging Science and Technology, Conference on Storage and Retrieval for Image and Video Databases V, Volume 3022, San Jose, pp. 427-438,1997. [a] Amamath Gupta and Ramesh Jain, ‘I’isual Information Retrieval”, Communications of the ACM, May 1997, Vol. 40 No. 5, pp. 70-79. [9] Never&a Dimitrova and Forouzan Golshaui, ‘Motion Recovery for Vidw Content Classification”, ACM Transactions on O&e Information Systems, Vol. 13, No. 4, October 1995, pp. 408-439.

[lo] Jie Yang, Alex Waibel, “A Real Tie Face Tracker,” In Proceedmgs of the Third IEEE Workshop on Applications of ComputerVision, Sarasota, Florida, pp. 142-I 47,1996. [l I] Reiner Lienhart, ‘Automatic Text Recognition for Video Indexing,” In Proceed&s of ACM Multimeda ‘96, Boston, November 1996. 1121 N.V. Pate& I.K. Sethi, ‘Video ClasslEcation using Speaker Identification”, Proceedings of IS&T SPIE Symposium on 120

Suggest Documents