Oct 8, 1998 - Fade In - Fade from black. Fade Out - Fade to black. Fade Out-In - A fade out followed by a fade in, with possibly some black frames in between.
An Evaluation of Motion and MPEG Based Methods for Temporal Segmentation of Video
R. Kasturi, S. H. Strayer, U. Gargi, S. Antani
Department of Computer Science and Engineering Technical Report CSE-98-014
October 8, 1998
Contents List of Tables List of Figures Abstract 1 Introduction 2 Objectives 3 Evaluation Criteria 4 Test Data Set 5 Indexing of MPEG compressed video streams
5.1 Algorithms for Indexing MPEG compressed video . . . . . . . . . . . . . . . . . . 5.1.1 Video Parsing using Compressed Data . . . . . . . . . . . . . . . . . . . . 5.1.2 A Statistical approach to Scene Change Detection . . . . . . . . . . . . . 5.1.3 A Fast Algorithm for Video Parsing Using MPEG Compressed Sequences 5.1.4 Scene Change Detection in a MPEG Compressed Video Sequence . . . . . 5.1.5 Scene Decomposition of MPEG Compressed Video . . . . . . . . . . . . . 5.1.6 Rapid Scene Analysis on Compressed Video . . . . . . . . . . . . . . . . . 5.1.7 Other MPEG based methods . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 MPEG based algorithms: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The Sethi{Patel Algorithm: Results . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
6 Motion-based Indexing
6.1 Motion Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Video Indexing using Motion Vectors . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Scene Change Detection and Content-Based Sampling of Video Sequences . 6.1.3 Text, Speech and Vision for Video Segmentation - The Informedia Project . 6.1.4 Optical Flow-Based Model for Scene Cut Detection . . . . . . . . . . . . . . 6.1.5 Temporal Segmentation of Videos: A New Approach . . . . . . . . . . . . . 6.2 Motion based methods on uncompressed video: Results . . . . . . . . . . . . . . .
7 Observations and Conclusions Appendices A Video Event Typology A.1 Introduction . . . . . . A.2 Terminology . . . . . . A.3 Video Events . . . . . A.3.1 Editing events
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . i
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
iii iv 1 2 3 3 4 5 5 5 6 6 6 7 7 8 8 9
9
10 10 11 11 11 12 12
13 15 15 15 15 15 15
A.4 Dynamic events . . . . . . . . . . . A.4.1 Structural or Meta Events . A.5 Ground Truth File Format . . . . . A.6 Notes . . . . . . . . . . . . . . . . A.6.1 Some conventions . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
B The MPEG Video Compression Algorithm B.1 B.2 B.3 B.4 B.5
Introduction . . . . . MPEG Video Layers I{Frames . . . . . . P{Frames . . . . . . B{Frames . . . . . .
C VADIS - User Manual C.1 C.2 C.3 C.4 C.5 C.6 C.7
. . . . .
Introduction . . . . . . System Requirements Functionality . . . . . File Menu . . . . . . . Movie Menu . . . . . . Subsequence Menu . . Live Feed Menu . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
ii
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
16 16 17 18 18
19
19 19 21 21 22
23
23 23 24 24 25 27 28
List of Tables 1 2 3 4 5 6
Description of the original sequences used to generate the dataset . MPEG Algorithms : Cut Detection Performance . . . . . . . . . . MPEG Algorithms : Gradual Transition Detection Performance . . MPEG Algorithms : Camera Flash Detection Performance . . . . The Sethi{Patel Algorithm : Cut Detection Performance Study . . Test results for cut detection . . . . . . . . . . . . . . . . . . . . .
iii
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. 4 . 8 . 8 . 9 . 10 . 12
List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14
MPEG Video Layer Hierarchy . . . . . . . . . . . . . . . . . . . . Zigzag placement of DCT values . . . . . . . . . . . . . . . . . . Intraframe(or Intrapicture) coding . . . . . . . . . . . . . . . . . P{frame coding and B{frame coding . . . . . . . . . . . . . . . . VADIS Main Window . . . . . . . . . . . . . . . . . . . . . . . . VADIS File Menu . . . . . . . . . . . . . . . . . . . . . . . . . . VADIS Open Index/Movie File Dialog Box . . . . . . . . . . . . VADIS Movie Menu . . . . . . . . . . . . . . . . . . . . . . . . . VADIS SGI Movie indexing parameters - Using color histograms VADIS MPEG Movie indexing parameters . . . . . . . . . . . . . VADIS Subsequence Menu . . . . . . . . . . . . . . . . . . . . . . VADIS Subsequence Play Commands . . . . . . . . . . . . . . . . VADIS Live Feed Menu . . . . . . . . . . . . . . . . . . . . . . . VADIS Live indexing parameters . . . . . . . . . . . . . . . . . .
iv
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
20 20 21 22 23 24 25 25 26 27 27 28 28 29
Abstract
This technical report discusses the results from the nal phases in evaluating the state of the art in video segmentation algorithms. The performance of various methods proposed in the literature that use motion information from uncompressed and MPEG compressed video streams is described. Selected algorithms from those published in the literature were implemented and evaluated on a common ground-truthed data set developed during the rst phase of the project. The MPEG compressed data set was subject to another ground-truthing process to detect gradual transitions and camera ashes. The algorithms that detected gradual transitions and camera ashes were classi ed and evaluated separately. The report concludes with our observations on the performance of the motion based methods. The report has three appendices that de ne the video typology developed by us, give an overview of the MPEG-1 compression algorithm, and a user manual for the Video Analysis, Display and Indexing System (VADIS) prototype. VADIS is an indexing tool that demonstrates the results of some of the important algorithms.
1
1 Introduction We have evaluated a number of algorithms for indexing video sequences based on color histograms 1] and frame motion information from uncompressed video and MPEG{compressed video bitstream. This report details our evaluation of the frame motion based indexing methods. In the rst phase of this project, we evaluated color histogram based techniques. The color histogram based indexing methods employed histogram dierences to temporally segment video sequences. The input data was converted to one of a number of dierent color space representations such as RGB, HSV, YIQ, L*a*b*, L*u*v* and Munsell and a histogram in one or more dimensions was computed using the resulting data. Dierence measures were applied to a uniformly subsampled sequence of the computed frame histograms to measure the corresponding changes. Large frame dierences indicated a possible shot change. The video sequence was then temporally segmented into subsequences based on \suciently" large dierence values. The shot changes generated by the algorithm were compared against those marked by humans, which we treated as the ground truth. The comparison was based on two quantities, Missed Detections (MDs) which are shot changes that were marked by humans, but missed by the algorithms and False Alarms (FAs) which are shot changes detected by the algorithm that do not correspond to the ground truth. The reader is referred to 1] to obtain a complete picture of our evaluation. In this part of the evaluation, motion{based indexing methods and compressed video{based methods were evaluated. The evaluation of the two classes of algorithms was combined into one extended evaluation as the distinction between them is not dichotomous. The motion based methods were found to be computationally very intensive. Much of this computation time was expended in computing the motion vectors which are readily available in MPEG video bitstreams. Many MPEG based methods (may) use this precomputed motion vector information available in the bitstream. There are two main justications for segmenting MPEG compressed video, the primary purpose being the speed gain derived from not having to uncompress the video information. The decoding of the compressed bitstream to obtain the Discrete Cosine Transform (DCT) terms, which is a necessary step, forms only a small part of the total time spent in segmenting the MPEG bitstream 2]. Secondly, as described in our technical report on the evaluation of color histogrambased methods 1], the increasing importance of video data demands not only the development of robust indexing algorithms, but also their evaluation on a generalized data set. The MPEG-2 standard which was developed for satellite broadcast TV has already been put to extensive use. Thus, the need for MPEG-based video indexing algorithms is further endorsed. Before proceeding with our evaluation, we dene some relevant terminology. A video frame is a single two dimensional (2D) image. A video sequence is a set of continuous video frames that makes up an episode of a particular topic. In any episode, the human vision system can identify and interpret the various shot changes. A set of frames between two shot changes is called a subsequence. Sometimes the these changes are gradual and consist of special eects like fades, blends or cross-dissolves. All of these are classied as forms of gradual transitions which should be recognized as distinct video events. They are marked by two numbers, a start frame number and an end frame number. Another kind of video event that is detected is the presence of camera ashes in the scene. Some of the MPEG based algorithms that are able to detect these ashes are included in our evaluation. We have also developed a general form of video typology which includes many other graphics eects and a description of the syntactic content of the frame. This 2
can be used to detail the occurrence of various video events in a ground-truth le or a index le for search from a video database. This typology has been included in Appendix A. Other terminology used in this report is related to the MPEG compression standard. Appendix B describes the salient features of the MPEG-1 compression standard. In Appendix C we include a user manual explaining the functionality of the Video Analysis, Display and Indexing System (VADIS). VADIS was a prototype developed to demonstrate some of the algorithms which we evaluated. The user can get a feel for the methods by varying the parameters for each of the algorithms.
2 Objectives The following were the objectives of the phases that are described in this technical report: 1. Implementation of selected motion-based video indexing methods. 2. Implementation of selected compressed video-based indexing methods. 3. Correlating the performance of color histogram-, motion- and compressed video-based methods. The primary goals of objectives 1 and 2 above were: 1. Evaluation of the indexing methods for detecting cuts. 2. Evaluation of the indexing methods for detecting other events such as gradual transitions and camera ashes.
3 Evaluation Criteria As described in our earlier report 1], we continue to use Missed Detections and False Alarms as a criteria for evaluating the performance of the algorithms. To give a better intuitive feel about their behavior we have also calculated the Recall and Precision of the algorithms. Recall and Precision are dened as follows: Recall = ( Detections Detections (1) + Missed Detections ) Detections (2) Precision = ( Detections + False Alarms ) In the above equations Detections, False Alarms and Missed Detections are all the total number of events the method is able to identify. A method with a high recall has very few Missed Detections. Conversely an algorithm with a low precision has very high number of False Alarms. So, an ideal algorithm would have Recall and Precision vales close to unity. As mentioned earlier, many of the motion based algorithms can also detect video events other than shot changes. Some methods claim to detect gradual transitions, while one method also detects the presence of camera ashes. Since more than one type of event could be detected, 3
it became necessary to dene a video typology that would dene the events better. The video typology dened by us has been included in Appendix A. This video typology contains denitions of events such as cuts, fades, blends, appearance and disappearance of text, graphics etc. For the evaluation we used only three types of events from the typology. These were: 1. Shot Changes: marked by a single frame number. 2. Gradual Transitions: These include video events such as fades, blends, cross{dissolves etc. These are marked by a start and an end frame number. 3. Camera Flashes: These are marked by a frame number at peak brightness produced by the ash. Since the events were so diverse, it is unfair to penalize the algorithms that do not detect all events. So the algorithms that claimed to detect a particular event were evaluated for that event separately.
4 Test Data Set The methods were evaluated on NTSC video captured from VHS tape at 30 frames per second through a JPEG compression board. The video data is in the form of ten video sequences. These video sequences range from a minute to 5 minutes in length. This data set, though not as exhaustive as a typical evaluation may demand, is fairly varied and representative of the typical broadcast video. The ten sequences used in the evaluation are listed in Table 1. Code Sequence Name Length (s) Description a Savannah-Birdcage 70 movie preview b Seinfeld-jacket 70 sitcom c Deep Space Nine 140 drama d Robit-Rategate 50 advertisement e CNN company 161 news clip f CNN-space 120 news clip g Predators 120 outdoor wildlife h MTV News-a 107 news show i MTV News-b 90 news clip ashes Flashes 70 press conference Table 1: Description of the original sequences used to generate the dataset For methods using the MPEG compressed data, these ten sequences were compressed using the SGI MPEG encoder at 4.15 Mbits/sec compression rate. The sequences have a cumulative total of 30403 frames. There are a total of 172 ground truth shot changes, 38 gradual transitions and 7 camera ashes. 4
5 Indexing of MPEG compressed video streams This section lists various MPEG-based algorithms that were studied for the evaluation. Not all of the algorithms listed below were implemented. Some algorithms were not implemented because 1. The published work did not describe the algorithm very well and the authors could not reached for an explanation of their work. 2. In case of similar approaches to the indexing problem, a superior algorithm was implemented. Appendix B contains an overview of the MPEG-1 video compression algorithm. To better understand the heuristics applied in the algorithms, it is strongly suggested that the reader have background knowledge of the compression technique. Briey, the method can obtain the motion vector information, quantized DCT values for coarse color and luminance histograms from MPEG encoded sequences with very little computation. The following subsection describes the salient features of the evaluated algorithms.
5.1 Algorithms for Indexing of MPEG compressed video 5.1.1 Video Parsing using Compressed Data
Arman et al 3] devised a shot change detection method using the inner product of JPEG 1 coecient vector which was modied for MPEG by Zhang et al 4]. The image is divided into 8x8 pixel blocks. A subset of these blocks are chosen a priori. From each of the selected block a subset of the 64 DCT 2 coecients is chosen. A vector is formed from these selected coecients of each chosen block. The inner product of two vectors from dierent frames is used as the interframe similarity measure. In MPEG encoded sequences, the I{frames are intracoded and closely resemble JPEG compressed images. These are chosen for determining frame similarity. If they are signicantly dissimilar then a shot change is said to have been detected. This method results in coarse shot boundaries. The \actual" change may have occurred anywhere within a GOP 3 size interval. Zhang et al modied the algorithm further to make the location of the detected cut more accurate. Cuts in B{frames are detected by comparing the minimum of the forward and backward non-zero motion vectors against a preset threshold. If this number is greater than the threshold, a cut is declared. This heuristic is based on the motion compensation done by MPEG encoders. The encoder tries to nd a block in the frame that best matches a reference block. If it does nd such a block, it can be represented by a vector pointing to the reference block and small error values for correcting minor dierences. On a shot change, (ideally) there would be an increase in the vectors due to the \moving"(changing) blocks. This number would be expected to drop back to levels below the threshold once the references are made to an I{frame from the new shot. JPEG: Joint Photographic Experts Group DCT: Discrete Cosine Transform 3 GOP: Group of Pictures. The frames between 2 consecutive I{frames 1 2
5
5.1.2 A Statistical approach to Scene Change Detection
This approach by Sethi and Patel uses the DC coecients 5, 6, 7, 8]. The method computes histograms of average block intensities. These are approximated by the DC term of the DCT coecients of the MPEG I{frames. Standard statistical measures are then applied to determine the shot changes. The following tests are applied: 1. Yakimovsky Test: ratio of the variance of the combined histograms to the product of the individual histogram variances 2. Chi-square Test: a normalized histogram dierence measure 3. Kolmogorov-Smirnov Test: maximum cumulative bin dierence
In the original work by the authors, the statistical tests have been applied to I{frames only. In Table 2 the results of this evaluation are shown. All the three tests were simultaneously applied and the best result was selected. The authors also published an extension to their earlier work in 8]. In this they apply the Chi-square test on three histograms for each frame { the global, the row and the column histograms. The results of these three are combined to detect a shot change. Our modi cations: The above methods were slightly modied to use all MPEG frame types. As described above, the original papers use only I{frames. We tested both the older and the newer versions of the methods with I{, I{ and P{ and all three, I{, P{ and B{ frame types. In general we found that the results improved as we provided more information to the algorithm. This modication was possible (and was reasonable) for this method since it uses coarse intensity histograms which are readily available as the DC values for the 8x8 block.
5.1.3 A Fast Algorithm for Video Parsing Using MPEG Compressed Sequences
This method by Shen and Delp 9] constructs 1-Dimensional coarse color histograms from the DC coecients. It then applies bin-to-bin dierencing of these histograms to detect scene changes. The method is applied only to I{frames by the authors. The reader is referred to 1] for details on the bin-to-bin histogram dierencing measure.
5.1.4 Scene Change Detection in a MPEG Compressed Video Sequence
This method developed by Meng and Chang 10] uses information from all three frame types, I{, P{ and B{frames. The following heuristics are used to detect scene changes. P{frames: A high ratio of intra{coded macroblocks to inter{coded macroblocks indicates a scene change. B{frames: The ratio of backward prediction motion vectors to forward prediction motion vectors is thresholded adaptively. The threshold value is set as the average within a 2-4 GOP size interval. I{frames: The detection of cuts is more involved. A variance of intensity plot is generated from the DCT DC terms of I{ and P{ frames. A cut is indicated by peak in I-frame 6
variance immediately following a B{frame with high motion vector ratio. At a shot change, the B{frame preceding the I{frame belonging to the new shot, would have a high prediction motion vector count. The I{frame and subsequent immediate P{frames would have a change in intensity resulting in a peak in the variance of intensity plot. Dissolves: The variance of intensity plot generated above is also used to detect gradual transitions. For these the ideal curve is parabolic. A dip in curve marks the center of dissolve. A curve is expected because the variance of frame intensity gradually increases, then stabilizes and gradually decreases into the new shot. The variance is computed for the DC coecients in I{ and P{ frames. For I{frames the DC terms are directly available. For P{frames the DC value of the block to which motion vector points is calculated by weighted averaging of DC values of its four neighboring macroblocks and adding the residual error DCT coecients. DC terms for P{frames are reconstructed from the prediction vectors and reference blocks using the method described in 11].
5.1.5 Scene Decomposition of MPEG Compressed Video
This method by Liu and Zick 12, 13] makes use of error power and video slots. The authors dene the error power of a frame as the sum of the squares of the DCT values of each macroblock. Since P{frames contain only forward{predicted macroblocks, the error power of a P{frame at a scene change would be more than that of a B{frame. Since B{frames contain backward{predicted, forward{predicted and bidirectionally{predicted macroblocks, the encoder can use one or more of these types to keep the error power low. The authors dene correlation factors between P{, B{, and P{ and B{frames. This is then used to determine shot changes. Their method does not detect scene cuts at I{frames.
5.1.6 Rapid Scene Analysis on Compressed Video
The algorithm detects scene cuts, gradual transitions and appearance of camera ashes. The basis of the method is in making heuristic use of DC-coecient dierence sequences. DC coecient values for I{,P{, and B{frames are extracted. DC coecient values for P{ and B{frames are reconstructed from the MPEG compressed stream as described in 11]. The method constructs DC image sequences, that is it constructs the video sequence only with the DC values of macroblocks or 8x8 blocks. This way it is possible to process every frame in the video with relatively low computational cost. The following are the heuristics used to detect the video events. Detecting Shot Changes: The method uses dierences in luminance and coarse color histograms with adaptive thresholding to detect abrupt shot changes. The dierence values of luminance and color histograms are combined to get a more accurate detection. Detecting Gradual Transitions: Gradual transitions include the special eect of dissolving, fade in and fade out scenes. The dierences are now accumulated over a series of frames. A plateau in the dierence plot over time represents a gradual transition. 7
Detecting Camera Flashes: The presence of a camera ash in a scene is observed in the dierence plot with two sharp peaks very close to each other. The major criteria used to detect these are: 1. the maxima are very close to each other 2. the dierence in the maxima is very small 3. the dierence between the average of the surrounding and these maxima is fairly large.
5.1.7 Other MPEG based methods
A method by Panchanathan et al 14] was not implemented since it is quite similar to 10] and is not as well presented. Method by Doermann 15] was not implemented since it was not as well developed then as some of the other algorithms.
5.2 MPEG based algorithms: Results
We present the results of the performance of the MPEG based algorithms. The cut detection performance of various algorithms is in Table 2. The performance of the methods in detecting gradual transitions is given in Table 3. Only the method by Boon{Lock Yeo and Bede Liu detects camera ashes. This has been detailed in Table 4. These results were obtained by applying the methods on the 10 MPEG compressed sequences. The search window for matching the ground truth frame numbers to the detected frame numbers was 3 for cuts and 10 for gradual transitions. Method Detects Missed Detects False Alarms Recall (%) Precision (%) Zhang{Smoliar 164 8 4283 95 4 Zick{Liu 52 120 73 30 42 Meng{Chang 113 59 583 66 16 Sethi{Patel 78 94 472 45 14 Shen{Delp 117 55 119 68 50 Yeo{Liu 119 53 8 69 94 Table 2: MPEG Algorithms : Cut Detection Performance It is apparent from the recall and precision values that the results for most of the methods are rather mediocre. In fact the results are comparable to the motion based methods being applied Method Detects Missed Detects False Alarms Recall (%) Precision (%) Meng{Chang 9 29 148 24 6 Sethi{Patel 19 19 208 50 8 Shen{Delp 11 27 21 29 34 Yeo{Liu 7 31 137 18 5 Table 3: MPEG Algorithms : Gradual Transition Detection Performance 8
Method Detects Missed Detects False Alarms Recall (%) Precision (%) Yeo{Liu 3 4 0 42 100 Yeo{Liu 7 11 0 39 100 (Special) Table 4: MPEG Algorithms : Camera Flash Detection Performance to uncompressed video data. The results of the motion based methods are described in Table 6. The advantage is that the execution times of the MPEG based methods is very low. Boon-Lock Yeo and Bede Liu's method performs the best for cut detection and camera ash detection. The Yeo{Liu (Special) values in the camera ash detection refer to the sequence having 18 ashes. The camera ash detection actually performs better than it appears. The method incorrectly detected very closely spaced (less than 5 frames) ashes in the video sequence as a single event. Hence it is able to detect all camera ash events accurately with the exception of closely spaced ashes.
5.3 The Sethi{Patel Algorithm: Results
The Sethi{Patel Algorithm uses three statistical measures, as explained above. All the measures are applied to I{frames only. In a modied form of the algorithm 8], only the Chi-Square test is applied on row, column and frame histograms of the I{frames. The eect of increasing the information processed by an otherwise mediocre algorithm was tested. Both the methods were applied rst to I{frames, then I{ and P{frames and nally I{, P{ and B{frames. The authors smooth the histograms by applying Gaussian averaging. The new method was also evaluated using histograms that were not averaged. The cut detection results of these tests are described in Table 5. The results show that the method generally behaves better if it is given more information, in this case the method performs the best when it processes all three frame types.
6 Motion-based Indexing In this section of the report, we focus on those motion-based methods which use uncompressed data. The objective was to identify the techniques which used motion features computed from uncompressed video data. The computed features are then evaluated for their ability to identify shot changes in video sequences. We identied several methods which fall into this category and rated them on the perceived signicance or originality of the technique as well as its potential implementability. If insucient detail was provided in the article to successfully implement the method, then the technique was given a lower rank. All identied methods are listed below in order of their (somewhat subjective) ranking and the status (implemented or not) of each is given. Due to time constraints, we implemented only a subset of these algorithms.
9
Method
Comment
Frame Detects Missed False Recall Precision Types Detects Alarms (%) (%) Sethi{Patel (1995) Original I 78 94 472 45 14 Sethi{Patel (1995) Modied I, P 170 2 1268 99 12 Sethi{Patel (1995) Modied I, P, B 172 0 1854 100 8 Sethi{Patel (1997) Original I 47 125 112 27 30 Sethi{Patel (1997) Modied I,P 98 74 38 57 72 Sethi{Patel (1997) Modied I, P, B 74 98 47 43 61 Sethi{Patel (1997) No Gaussian I 44 128 138 26 24 Averaging Sethi{Patel (1997) No Gaussian I, P 113 59 84 66 57 Averaging Sethi{Patel (1997) No Gaussian I, P, B 97 75 110 56 47 Averaging Table 5: The Sethi{Patel Algorithm : Cut Detection Performance Study
6.1 Motion Based Algorithms
6.1.1 Video Indexing using Motion Vectors
The algorithm developed by Akutsu, Tonomura, Hashimoto and Ohba 16] determines shot changes by using a motion smoothness measure. The computation is performed on every kth frame where k is a user-dened parameter. Each chosen frame is then divided into 8x8 blocks. Each block is matched to a block in the next chosen frame within a 30x30 pixel neighborhood. The closest matching neighboring block would indicate the motion in that direction. The corresponding motion vector is then computed. The value of the correlation coecient for the best matching block (actually computed as an inverse correlation coecient and identied by the lowest value) is also computed. The average of these correlations represents an inter-frame similarity measure. We then compute the number of blocks which have moved signicantly and the blocks and the amount these blocks have moved. The motion smoothness is then the ratio of these two quantities. Shot changes in the video sequence are determined by local maxima in the correlation and motion smoothness ratio values. Camera operations may also be detected by applying motion analysis to the vectors using the Hough transform. We know that a spatial line in the image space is represented as a point in the Hough space. Conversely, a sinusoid in the Hough space represents a group of lines in the image space. These lines intersect at the same point. In our case, at the convergence/divergence point of the motion vectors. Thus, if the Hough transform of the motion vectors can be least squares t to a sinusoid, the point of divergence/convergence of vectors indicates the type of camera motion. The paper describes 7 possible camera operations using the convergence/divergence point and whether the vector magnitudes are constant, changing or zero. Due to the lack of detail in the description of the method for detecting camera operations, only shot change detection was implemented. 10
6.1.2 Scene Change Detection and Content-Based Sampling of Video Sequences
This algorithm developed by Shahraray 17] also uses a form of block matching and motion estimation to detect shot changes, gradual transitions and perform camera-operation characterization. It works as follows: As with the method by Akutsu et al, the computation is performed on every kth frame where k is a user-dened parameter. Each chosen frame is divided into 12 non-overlapping blocks. Each block is matched to a block in the next chosen frame within a 30x30 pixel neighborhood. The corresponding motion vector and best correlation value are computed as before. The correlation values are sorted into ascending order. A similarity measure is computed by taking the average of the rst s values from the sorted list where s is the number of blocks scaled by a user-specied matching percentage. Shot changes are determined by local maxima in the similiarity measure. Gradual increases in frame dierence are used to detect gradual transitions. Motion may also cause gradual frame dierences but it is claimed that the corresponding match value is lower than that for operations such as pans so that one can avoid false detections. The algorithm does camera motion estimation and relatively simple analysis of the signal to detect pans. As with the method by Akutsu et al, only shot change detection was implemented. Signicantly more information was needed to implement the detection of gradual transitions and camera motion.
6.1.3 Text, Speech and Vision for Video Segmentation - The Informedia Project
This method developed by Smith et al 18, 19] has been implemented in CMU's Informedia project. The algorithm uses audio track information as well as color histograms and motion information. Specically, the mean length, mean angle and mean angle-variance of optical ow vectors are computed using the Lucas-Kanade technique. These are then used to detect static scenes, pans and zooms. This method was judged as one of the important methods but was abandoned due to extremely long computation time involved with the technique. In addition the MPEG methods, which were simultaneously implemented during the evaluation, were resulting in more promising detections and also were not as computationally intensive.
6.1.4 Optical Flow-Based Model for Scene Cut Detection
In this method by Fatemi, Zhang and Panchanathan 20], the video sequence is divided into (overlapping) subsequences. Each subsequence is dened as 3 consecutive frames with a fourth predicted frame. Each frame is divided into non-overlapping blocks of 4x4 pixels. Blocks are predicted from Frame 1 to 2 and 2 to 3. Then a set of 3 matching blocks in Frames 1, 2 and 3 are used to predict a block in Frame 4. Using this information, a decision is made whether a cut occurred or not. Gradual transition eects are also mentioned in the paper. The algorithm was not implemented since the paper lacked thresholding or nal decision methods to select shot changes or gradual transitions.
11
6.1.5 Temporal Segmentation of Videos: A New Approach
The algorithm described by Cherafoui and Bertin performs global motion estimation and cameraoperation based segmentation 21, 22]. The method estimates global motion parameters in an ane transformation model. The value of the scale factor in the transformation is used to detect zooms and horizontal and vertical displacements. Taking two points on the X-axis on either side of the origin, the new horizontal positions of the two points are plotted at each new frame. The curves are straight-line approximated to eliminate camera shake. Simple analysis of the pairs of lines indicate a xed shot, a pan or a convergent/divergent zoom. This method was also not implemented due to low interest in algorithms operating on uncompressed video.
6.2 Motion based methods on uncompressed video: Results
Table 6 presents the results of processing the video sequences using the Akutsu and Shahrarary algorithms. A simple statistical analysis method was used to determine threshold values for detecting a shot change. The threshold was set to be a user-specied number of standard deviations of the frame dierence values over their mean. The results provided below are for a cumulative testing of the segmentation methods on nine video sequences. Method Akutsu
Std. Dev. Detects Missed Detects False Alarms Recall (%) Precision (%) 0.5 153 19 349 88.9 30.4 1.0 143 29 177 83.1 44.7 1.5 128 44 109 74.4 54.0 2.0 98 74 73 56.9 57.3 Shahraray 0.5 153 19 552 88.9 21.7 1.0 145 27 286 84.3 33.6 1.5 134 38 136 77.9 49.6 2.0 116 56 68 67.4 63.0 Table 6: Test results for cut detection Overall, we see in the table above that the Shahraray algorithm has a somewhat higher recall rate (higher detections at the same thresholding level) but produces signicantly more false alarms in most cases. For both algorithms, the computation time required to perform the block matching and compute the motion vectors was extremely high. For even the shortest sequences, the overall time to run the algorithm was greater than 96 hours. Even at the expense of such long execution times, the results were not signicantly better than the MPEG{based techniques which take much less computation time. This is because the MPEG encoding has already computed the motion vectors which takes the longest computation in the methods discussed above.
12
7 Observations and Conclusions This section concludes the evaluation of the current video indexing algorithms. From the results obtained the histogram based indexing approach and the MPEG based cut detection methods seem to be the most promising. It is interesting to note that the histogram based approach does fairly well in the compressed video indexing algorithms too. Some of the observations made during the evaluation are:
The observed behaviors of the methods vary largely from the published (expected) behaviors.
This is probably so because the results presented in the literature are tuned to specic test sequences. The algorithms implemented by us are tested on a generalized data set. Also, there could be minor variations in the author's implementation and our understanding of the method published by them.
The overall gradual transition detection performance is rather mediocre.
This is because the methods rely on shapes of the frame dierence plots to detect a gradual transition. In reality the plots do not adhere to the ideal shapes (like plateaus). Also, large object motion or quick camera pans are often detected as gradual transitions. In addition a gradual transition false alarm aects cut detection performance. In addition matching the start and end points of the gradual transition detected by the methods to the ground truth is inaccurate leading to Missed Detections.
The algorithms that use input data from all the three frame types perform better.
This is evident from the study conducted by us on the algorithm by Sethi and Patel. The algorithm by itself was not spectacular in performance, yet, the increase in the input information did signicantly improve its performance. This is an important point to note in the development of new algorithms.
The methods have an extremely high number of input parameters which make it dicult to tune them.
Many methods had 5 to 8 parameters and some even had up to 12 parameters that needed to be tuned for the method to perform well. A large number of parameters make tuning dicult. This induces a degree of uncertainty in the quality of the the results obtained from an indexing operation.
A high number of input parameters also means that they cannot be easily generalized to handle dierent types of video information.
Not only is tuning the parameters dicult (with respect to one video sequence), but it is even more dicult to obtain a stable tuning of values across dierent types of movies. This often happened when the methods were applied on our data set. Both of the above items point to a need for a more intuitive and few parameters in a algorithm. (e.g. The sensitivity parameter on VADIS is an example) 13
In general, the cut detection results also are not very exciting. The above statement does not imply that the existing algorithms are not good. However, there is room for improvement and from the results obtained in the two evaluations the use of histograms and compressed video data seem to be the direction in which further research should be directed. This evaluation also points out a need for many such evaluations that must be conducted at regular intervals to bring out the salient features from dierent methods { which can possibly lead to a combined implementation resulting in a \good" algorithm.
14
A Video Event Typology A.1 Introduction
This appendix denes a set of video events, syntactic events in video sequences that often convey directorial or editorial semantics. The list is almost certainly incomplete but serves as a good starting point. A Ground Truth (GT) data le format (gtf) is also dened.
A.2 Terminology
Shot A sequence of frames that was continuously shot from the same camera at one time (or
shot at multiple times, but with the multiple shoots invisible to the viewer). A shot may encompass pans, tilts, or zooms. Scene A collection of one or more shots that describe action taking place at the same time and/or in the same place, or present visual information that is so related. Not as clearly dened as shot. A lm of a basketball match may contain shots showing the action from many camera angles (shots) while still showing the same scene. A news broadcast may switch from one anchor-person to the other or even to on-site footage from reporters.
A.3 Video Events
A.3.1 Editing events
These events are created by the compositor or editor.
Scene Cut or Break. An abrupt change or discontinuity in the viewed frame. A cut will be
marked where it is apparent that the lm hasn't been shot continuously, even if the change is not that great (eg. same scene). Cut to Black A cut marking the end or a pause in a sequence having the next frame as black. This event is marked to avoid having black representative frames. Also it often indicates a pause, to show passage of time or separation between scenes as in advertisements. Gradual Transition Can be one of Fade In - Fade from black. Fade Out - Fade to black. Fade Out-In - A fade out followed by a fade in, with possibly some black frames in between. Cross dissolve From one frame into another. Blends Two dierent video sources are blended into one frame (by chroma or luma keying for example) such that dierent portions of the viewport belong to dierent sources. Graphic Eect Can be one of Wipes One edge (or the diagonal) of the current viewport is translated across the screen, wiping out the current frame data and replacing it with new data. 15
Pins One corner of the current viewport is used as the dynamic point of a graphical trans-
formation peeling back the current viewport revealing a new frame beneath. Text The presence of text on the screen as in movie credits, anchor person identier tags or general text in the scene.
A.4 Dynamic events
These are intrinsic to the video, caused by motion of some kind, perhaps created by the director. Pan - A rotational movement of the camera about the vertical axis. Tilt - A rotational movement of the camera about the horizontal axis. Zenith - A rotational movement of the camera about the optical axis. Dolly - Forward/Backward motion of the camera along the optical axis. Boom - Upward/Downward motion of the camera along the vertical axis. Track - Right/Left motion of the camera along the axis perpendicular to the vertical and the optical axes 16]. Zoom - Convergent or Divergent. Background Motion The dominant motion in the scene is of a moving background while the foreground object(s) remain relatively static. Eg. a scene looking out of a moving car window. Foreground/Object Motion Can be one of Intra-frame Object moves entirely within the current viewport. Into frame An object enters the viewport. Out of frame An object exits the viewport. This event lasts from the instant part of the object leaves till the entire object is invisible. Object Tracking This is very similar to the background motion event but involves a (usually smaller) object whose motion is tracked or followed by the camera, usually in an attempt to keep it in focus or in the center of the viewport.
A.4.1 Structural or Meta Events
These are higher level events involving structure either within the frame or between frames, when available providing useful indexing information.
Scene An interval of frames composed of at least one shot (but usually more than one) dealing
with the same action, place and time. A useful description for a video sequence if it can be detected since it allows very compact indexing. 16
Back & Forth Scene A common interview structure, the camera switches between two or more
persons or relatively static viewpoints. Newsanchor A person's head and shoulders ll the foreground with an optional story display either to the left or right of the person's head. This format appears to be standard all over the world. Flash A commonly appearing event in sequences covering conventions or speeches. A ash bulb going o often causes a false scene change indication to appear.
A.5 Ground Truth File Format
Uppercase letters are used to denote literals, lowercase for symbols.
datafile comment eventset event
subevent startframe text cameraoperation pan tilt zoom boom track dolly bnfmotion
:= f comment j eventset g := ! string < newline > := startframe event subevent # string # ] < newline > := CUT j CUTTOBLACK j ( text j FLASH j GRADTRANS j BLEND j GRAPHFX j OBJTRACK j NEWSANCHOR j SCENE ) = endframe ) j ( cameraoperation j bnfmotion j backmotion j objectmotion ) := FADEIN j FADEOUT j CROSSDISSOLV E j WIPE j PIN j OBJINTRA j OBJINTO j OBJOUTOF jV ERTICAL j HORIZONTAL j FLIP := value := TEXT = endframe NATURE = (CAPTION j SCROLLING j SCENE ) STRING =< textstring > := ( pan j tilt j zoom j zenith j boom j dolly j track ) := PAN = endframe ANGLE = panangle DIRECTION = direction SPEED = panspeed UNIFORM = boolean := TILT = endframe DIRECTION = direction ( FINANLGLE = value j ANGLE = value ) UNIFORMITY = boolean := ZOOM = endframe SENSE = sense SPEED = value UNIFORM = boolean := BOOM = endframe DIRECTION = direction := TRACK = endframe DIRECTION = direction := DOLLY = endframe SENSE = sense := BNFMOTION = endframe BNFSHOTS = value 17
objectmotion := OBJMOTION = endframe OBJSIZE = value OBJSPEED = value backmotion := BACKMOTION = endframe DIRECTION = direction zenith := ZENITH = endframe ( ZENITHFINANLGLE = value j ZENITHANGLE = value ) panangle := j \RIGHT 00 ( = 0) j \LEFT 00 j \UP 00 ( = 90) j \DOWN 00 panspeed := (#pixels \moved out00 = #pixels per frame ) 100 sense := CONV ERGE j DIV ERGE direction := RIGHT j LEFT j UP j DOWN j UPLEFT j UPRIGHT j DOWNLEFT j DOWNRIGHT boolean := TRUE j FALSE endframe := < endingframenumber > value := ( + j ; ) ( 0 :: 9 ) string := ( A :: Z a :: z 0 :: 9 )
A.6 Notes
Scrolling text is caption text that is scrolling either horizontally or vertically. Scene text is text that is presented in the 3D scene being viewed.
A.6.1 Some conventions
To create ground truth les using the above grammar, the following convention will be used. Startframe The starting frame is the rst one where the event actually occurs. E.g., a gradual transition fadein (from black) is considered to start when at the rst non-black frame. Endframe The endframe will be marked as following for the various listed cases. Text - The end frame for text will be the last frame with the same legible text as the beginning of the event which is the rst frame where the text is entirely clearly legible. Each distinct appearance of text is a separate event. Gradual Transition - The end frame for a gradual transition will be the last frame with some remnants of the previous sequence. All other cases - The endframe will be marked a the second of two frames not having any motion OR the last frame of a sequence in case of occurrence of a signicant change.
18
B The MPEG Video Compression Algorithm - An Overview B.1 Introduction This section gives an overview of the MPEG{1 video standard, also known as the international standard ISO/IEC 11172{2 for video compression 23]. MPEG is an acronym for Moving Picture Experts Group. The focus of MPEG{1 is with coding of digital audio and video streams and their synchronization. Besides MPEG{1, there are two other dierent MPEG standards! MPEG{2 and MPEG{4. MPEG{1 was designed for data rates of the order of 1.5 Mb/sec, MPEG{2 has been developed for data rates 10 Mb/sec or more. MPEG{4, which is primarily meant for telecommunication use, has data rates of the order of 64 Kb/sec. The main prole of MPEG{2 is a fairly straightforward extension of MPEG{1. This introduction to MPEG is for MPEG{1 video. A video sequence with the accompanying audio track can occupy a formidable amount of storage space. A typical 352x288 pixel resolution4 image with 3 color separation at 8{bit precision can occupy approximately 72 Mbits/sec and an additional 1.4 Kbits/sec of audio sampled at 44 kHz with 16 bit precision. Each video image is called a frame or a picture. A video sequence is a series of frames taken at closely spaced intervals. Video data exhibits a high amount of temporal redundancy! that is, except at a scene change, very little change occurs in the content of a video frame across time. The MPEG video coding scheme uses interframe compression techniques to take advantage of this temporal redundancy, or predictability, to compress the data. At locations of scene changes in video content interframe predictions are not possible. Here MPEG uses the similarity of adjacent regions within a frame to compress the data. This technique intraframe coding. These inter{ and intra{frame coding techniques form the heart of the MPEG compression scheme.
B.2 MPEG Video Layers MPEG is a hierarchical data format. As shown in Figure 1, the outermost layer of the MPEG bitstream is the video sequence. The video sequence is divided into Group{of{ Pictures (GOP). Each GOP is composed of three dierent frame types, I{, P{ and B{ frames. I{frames are intracoded frames and are coded independently, without reference to other frames. P{ and B{frames are compressed with reference to previous I{ or P{ frames. They are intercoded and hence exploit the similarity with reference to other frames. P{frames are predictive-coded frames which obtain the predictions from the temporally preceding I{ or P{frames in the sequence. B{frames are bidirectionally{predictive coded frames which obtain the predictions from nearby preceding or succeeding I{ or P{ frames. P{ and B{frames use motion compensation and block matching techniques (discussed later) to compress video data. If a P{frame or a B{frame is unable to nd a closely matching region in the reference frame then intracoding techniques are used 4
Common Intermediate Format(CIF)
19
Figure 1: MPEG Video Layer Hierarchy Block DC
AC Terms
Figure 2: Zigzag placement of DCT values for that region. Generally the P{ and B{frames are predicted from other frames within the GOP. 5 Each frame is subdivided into a 16x16 sample array, or macroblock, of luminance samples together with one 8x8 block of samples for each of two chrominance components, redness-greenness and blueness-yellowness (called U and V in the YUV color space). A raster row of macroblocks is called a slice. The slice allows for greater exibility in signaling changes in some of the coding parameters. 5
In an open GOP format the prediction may be from frames outside of the GOP.
20
B G
U
Y
R
V
DCT 16
Quantization
Zigzag
8 DC
16
block
DPCM
Macroblock
E n t r
Huffman or Arithmetic
MPEG bitstream
o p
Encoding
y RLE
Figure 3: Intraframe(or Intrapicture) coding
B.3 I{Frames I{frames are intracoded frames. They are coded without any reference to other frames in the video sequence. Every GOP has one I{frame. Since the video sequence is composed of a series of GOPs, the I{frames provide points of random access into the MPEG compressed bitstream. As shown in Figure 3, the I{frame is compressed by applying the discrete cosine transform (DCT) to each of the 8x8 luminance blocks and the horizontally and vertically averaged chrominance blocks. These values are then quantized. The DC term is placed at the upper{left corner of each block and the AC terms are then placed in a zig{zag fashion in order of decreasing frequency, as shown in Figure 2. The MPEG algorithm then entropy encodes the macroblocks as Run Length Encoded (RLE) values or Dierential Pulse Code Modulated values. These are further Human or arithmetically encoded for compression.
B.4 P{Frames P{frames or forward predictive{coded frames belong to the intercoded frame types. Previous I{ or P{frames within the GOP are used to predict the movement of a partic21
I
B
B
P
B
B
I
Forward Prediction
B
B
P
B
B
P
Bidirectional Prediction
Figure 4: P{frame coding and B{frame coding ular block. Block matching and motion compensation techniques are used to nd the motion vector. A motion vector is a set of 4 values indicating the (X,Y) location of a block and the row and column oset to the closest matching block. Block matching is done on each 8x8 block in the reference frame which is dierent from the corresponding block in the current frame. A search is performed in a small neighborhood for the closest matching block. The dierence in the (X,Y) coordinates of these blocks form the row and column values of the motion vector. The dierences that this block might have with the reference block forms the error values. This process of nding the closest matching block and recording the dierence is called motion compensation. The MPEG encoder inserts an intracoded block when it is unable to nd a close match in the neighboring blocks.
B.5 B{Frames B{frames or bidirectionally{predictive coded frames are also intercoded frames. They reference both previous I{frames and previous and future P{frames. As a result they have forward, backward and bidirectional motion vectors and provide maximum compression. Since they do not refer to other B{frames they do not propagate errors. In fact wherever bidirectional motion vectors are found, B{frames average out the errors.
22
C VADIS - User Manual C.1 Introduction VADIS, an acronym for Video Analysis, Display and Indexing System, is a tool that indexes video data streams. The prototype system allows the user to index video data various video segmentation algorithms and also browse pre-indexed video sequences. VADIS can be used to index live video data and video data saved SGI movie les using the color histogram indexing methods. Compressed video data in MPEG les, can be indexed using either of the two best algorithms in our evaluation. VADIS has been implemented for SGI IRIX operating systems and has been tested for version 5.3 of the Operating System. Other system requirements are listed in the System Requirements section below. The remainder of this appendix shall describe the functionality of VADIS.
Figure 5: VADIS Main Window
C.2 System Requirements
Operating System: IRIX 5.3 (atleast), 6.2, 6.3 Display Capability: 24 bit Windowing System: X/Motif Other Software: Perl 5.001, SGI movieplayer Miscellany: Paths to be set as described in the README le. A Perl 5.001 executable has been provided with the package. Appropriate paths need to be set in the les having the extension .pl. These and other necessary changes are described in the README le included with the package. 23
C.3 Functionality This section describes the functionality oered by VADIS. Each Menu item is described and its functionality is detailed. VADIS is executed by typing vadis on the prompt. Figure 5 shows the vadis application when started.
C.4 File Menu
The File menu (as shown in gure 6) contains the following menu items:
Open Movie File Open Index File Save Index File Exit
Figure 6: VADIS File Menu The Open Movie File menu item opens a le selection dialog box and allows the user to browse the directories and select a movie le. The movie may be a SGI movie or a MPEG{1 compressed movie. The MPEG movie le extensions must end in a .mp* or a .MP*. Here a * implies any alphanumeric set of characters. This restriction is necessary because the MPEG libraries do not provide a way to detect the le type from its header information. The SGI libraries, however, provide a functionality to detect if a selected le is indeed a movie le and hence there is no restriction placed on the naming convention. The Open Index File menu item also opens a le selection dialog box as shown in Figure 7. The index les are les created by VADIS that denote the cut frames for a particular movie. The le header contains the path to the movie and upon selection of a index le the appropriate movie is also opened and represented on the VADIS window. Typically the index les are given a .vid extension, though there is no restriction on the user on their naming convention. After an indexing operation, the index les are named as movielename.vid. If the user wishes to save them in some other name, the Save Index File menu item should be 24
Figure 7: VADIS Open Index/Movie File Dialog Box selected. If the user wishes to repeat the indexing operation with dierent parameters, the index le will be overwritten with the results of the new indexing operation. So, if one wants to compare the results of the two operations, the index le should be saved as another le. The Exit menu item quits the application.
Figure 8: VADIS Movie Menu
C.5 Movie Menu VADIS performs two kinds of operations on video data. Play the video or its indexed subsequence and index the video using one of the several indexing methods built into the system. The Movie Menu, as shown in Figure 8, deals with some these functions. The Play Movie menu item plays the loaded movie using the SGI movieplayer. It is 25
Figure 9: VADIS SGI Movie indexing parameters - Using color histograms assumed that the system has this software loaded and appropriate paths set to point to this application. Information on the SGI movieplayer can be obtained through the man pages. The SGI movie player supports both the SGI and the MPEG movie le format. The Index Movie menu item opens one of the two dialog boxes shown in Figure 9 and Figure 10. The choice of the dialog box is based on the type of the movie loaded. Figure 9 shows the dialog box used to index SGI movie les. These will be indexed using the color histogram based methods. The meaning of various parameters is listed below: Frame Interval: The number entered here denotes the distance (in number of frames) between two frames being compared. Default value is 10. Starting Frame: The frame number from which the index operation is started. Default value is 0, denoting the beginning of the movie. Ending Frame: The frame number at which the index operation is stopped. Default value is the number of frames in the movie. Obviously this value should not be less than the Starting Frame number nor greater than the number of frames in the movie. Threshold Window (frames): This number denotes the shortest indexed subsequence that is detected. Color Space: One of the 7 listed color spaces may be used to index the movie. Sensitivity: The sensitivity to a scene change. Greater the value more sensitive the cut detection. Figure 10 shows the choices for indexing MPEG compressed movie les. The two best algorithms have been selected here to demonstrate the indexing results. These indexing methods use the perl binary and hence perl v5.0001 should be installed and in the user's path. The algorithms included are (i) the algorithm by Boon-Lock Yeo and 26
Figure 10: VADIS MPEG Movie indexing parameters Bede Liu (called Yeo-Liu Algorithm) and (ii) the algorithm by Shen and Delp (called the Shen-Delp algorithm). The Yeo-Liu algorithm has a sensitivity parameter having a default value of 2.50. Higher values will mean greater sensitivity to changes The Shen-Delp algorithm has a sensitivity parameter with a default value of 0.4.
C.6 Subsequence Menu Once the movie le has been indexed it is represented on the VADIS main window. The complete blue bar on the top represents the movie as a whole. The broken bars beneath the complete bar represent the points along the movie at which cuts were detected. Beneath these bars the subsampled frames at which cuts were detected are displayed. The frame number is also displayed below each image. The Subsequence Menu completes the functions of VADIS by allowing the user to select movie subsequences, individual cut{frames or the entire movie and playing them. The menu items listed under this are shown in Figure 11
Figure 11: VADIS Subsequence Menu The images and each of the bars are selectable items. Once selected either of the two listed operations can be selected. The Representative Frame menu item will show 27
the selected cut frame in its original size. For the Play menu item, dierent actions are taken depending upon the type of the movie. If the movie is a SGI movie le, then a dialog box shown in Figure 12 shows the various commands.
Figure 12: VADIS Subsequence Play Commands The MPEG playback creates a mpegmoviefilename.index le which help in improving random access times to frames within the MPEG le. it is recommended that these les not be removed until all processing on a particular MPEG le is complete. Thereafter, these les may be deleted. Deletion of these les does not aect the functionality of VADIS.
C.7 Live Feed Menu VADIS includes a utility to detect cuts in incoming video. This menu is shown in Figure 13. The Index Live Feed menu item shows the dialog box in Figure 14.
Figure 13: VADIS Live Feed Menu 28
Figure 14: VADIS Live indexing parameters Once the appropriate parameters are specied, VADIS starts two applications. One is called vidcuts and is included in the distribution and the other is videoin. This is a tool available on the SGI systems to view incoming video o the specied port. The parameters are set as follows: Frame Interval: This is the interval between the frames which are compared. Default value: 8 Output Type: This can have the values 1 and 2. 1. saves the cut images as individual JPEG images. 2. saves the entire page of thumbnail images on a single page in each le. This can be used to create a storyboard from the video. Default value. Frame Comparison: This option can have values 1, 2 or 3. The options control the frame comparison in the segmentation process. 1. compares the current frame with the immediate previous frame. The frame is compared only every frame interval number of frames. This option is insensitive to motion. 2. compares the current frame to the frame which is frame interval frames in the past. Default compromise. 3. compares the current frame with the last cut frame. This is an approximated form of cumulative dierence and also reacts to object motion. Device Number: The device number of the port on which incoming video is sensed. Default is 0. This generally indicates the use of the Vino video board on the SGI Indy. The vcp tool (Video Control Panel) should be used to set the source. Frame Size: This also has 3 options. 1. The cut frame size is set to 320x240 2. The cut frame size is set to 160x120. Default value. 3. The cut frame size is set to 80x60. 29
Frame Display: This is expressed as #x#. It denotes the thumbnail geometry on the live feed screen. Save as video le: If the button is checked and a le name is specied, the live video (with the audio track) is saved as the specied le. This saves the live video while indexing it in real time. Threshold: This is the cut detection threshold. Specifying a video save le greatly enhances the features of VADIS. If the save option is unselected, the indexing will be purely visual. Specifying the le adds the following features. The live video indexing creates a specifiedfilename.vid index le. This enables VADIS to (re)open the saved video for post indexing browsing and analysis. For the subsequences selected for playback from the VADIS panel, the audio track will be played along with the video subsequence. Note: For the live video indexing and le save (capture) to work, the following connections are necessary: For an SGI Indy: The live video feed must be connected to the input socket of the Cosmo Compressor Board. The output of the Cosmo board should be connected to the input of the Vino Video Board. To end the live feed indexing, select the End Live Feed menu item. This kills both the applications and restores VADIS to its original state.
30
References 1] U. Gargi, R. Kasturi, S. Strayer, and S. Antani. An Evaluation of Color Histogram Based Methods in Video Indexing. Technical Report CSE-96-053, Department of Computer Science and Engineering, Penn State University, 1996. 2] K. Patel, B.C. Smith, and L.A. Rowe. Performance of a Software MPEG Video Decoder. In ACM International Conference on Multimedia, pages 75{82, August 1993. 3] F. Arman, A. Hsu, and M.-Y. Chiu. Feature Management for Large Video Databases. Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases I, Vol. SPIE 1908, pages 2{12, 1993. 4] H.J. Zhang et al. Video Parsing using Compressed Data. In SPIE Symposium on Electronic Imaging Science and Technology: Image and Video Processing II, pages 142{149, 1994. 5] SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies, volume 2419, 1995. 6] I.K. Sethi and N. Patel. A Statistical Approach to Scene Change Detection. In Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases III, Vol. SPIE 2420, 1995. 7] N.V. Patel and I. K. Sethi. Compressed video processing for video segmentation. In IEE Proceedings: Vision, Image and Signal Processing, 1996. 8] N.V. Patel and I. K. Sethi. Video shot detection and characterization for video databases. In Pattern Recognition, Special Issue on Multimedia, volume 30, pages 583{592, 1997. 9] K.Shen and E.J. Delp. A Fast Algorithm for Video Parsing Using MPEG Compressed Sequences. In IEEE International Conference on Image Processing, pages 252{255, October 1995. 10] J. Meng, Y. Juan, and S.F. Chang. Scene Change Detection in a MPEG Compressed Video Sequence. In SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies 5]. 11] S.F. Chang and D.G. Messerschmitt. Manipulation and Compositing of MC-DCT Compressed Video. IEEE Journal on Selected Areas in Communications: Special Issue on Intelligent Signal Processing, 13(1):1{11, 1995. 12] H.C. Liu and G.L. Zick. Scene Decomposition of MPEG Compressed Video. In SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies 5]. 13] H.C. Liu and G.L. Zick. Automated Determination of Scene Changes in MPEG Compressed Video. In ISCAS - IEEE International Symposium on Circuits and Systems, 1995. 31
14] K. Tse, J. Wei, and S. Panchanathan. A Scene Change Detection Algorithm for MPEG Compressed Video Sequences. In Canadian Conference on Electrical and Computer Engineering (CCECE '95), volume 2, pages 827{830, 1995. 15] V. Kobla, D.S. Doermann, K-I. Lin, and C. Falutsos. Compressed domain video indexing techniques using dct and motion vector information in mpeg video. In SPIE/IS&T Conference on Storage and Retrieval for Image and Video Databases V, volume 3022, pages 200{211, 1997. 16] A. Akutsu et al. Video Indexing using Motion Vectors. In Proceedings of SPIE Visual Communications and Image Processing, volume 1818, pages 1522{1530. SPIE, 1992. 17] B. Shahraray. Scene Change Detection and Content-based Sampling of Video Sequences. In SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies 5]. 18] A. Hauptmann and M. Smith. Text, Speech, and Vision for Video Segmentation: The Informedia Project. In AAAI Fall 1995 Symposium on Computational Models for Integrating Language and Vision, 1995. 19] M.A. Smith and T. Kanade. Video Skimming for Quick Browsing based on Audio and Image Characterization. Technical Report CMU-CS-95-186, Carnegie Mellon University, 1995. 20] O. Fatemi, S. Zhang, and S. Panchanathan. Optical Flow Based Model for Scene Cut Detection. In Canadian Conference on Electrical and Computer Engineering, 1996. 21] M. Cherfaoui and C. Bertin. Temporal Segmentation of Videos: a New Approach. In SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies 5]. 22] M. Cherfaoui and C. Bertin. Two-stage Strategy for Indexing and Presenting Video. In Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases II, Vol. SPIE 2185, 1994. 23] Joan L. Mitchell, William B. Pennebaker, Chad E. Fogg, and Didier J. LeGall. MPEG Video Compression Standard. Digital Multimedia Standards Series. Chapman and Hall, 1997.
32