Performance Characterization and Comparison of Video ... - CiteSeerX

19 downloads 0 Views 201KB Size Report
1] Joan L. Mitchell, William B. Pennebaker,. Chad E. Fogg, and Didier J. LeGall. MPEG. Video Compression Standard. Digital Multime- dia Standards Series.
Performance Characterization and Comparison of Video Indexing Algorithms U. Gargi R. Kasturi S. Antani Department of Computer Science & Engineering The Pennsylvania State University University Park, PA 16802 fgargi, kasturi, [email protected]

Abstract

Temporal segmentation of video is a necessary rst step to indexing digital video for browsing and retrieval. A number of di erent video temporal segmentation algorithms have been published in the literature. There has been little e ort to evaluate and characterize their performance so as to deliver a single (or set of) algorithms that may be used by other researchers for indexing video databases. We present results of evaluating a number of these algorithms and characterizing their performance, speci cally with respect to robustness to encoder and bitrate changes. The lessons learnt have relevance to algorithm development and evaluation in general.

1 Introduction

A number of initiatives have been undertaken which aim to include digital video content in a digital library accessible over a data network. The interface to this content is meant to allow browsing and searching of the video. Given the temporally linear and data intensive nature of digital video coupled with the bandwidth constraints of the network, temporal indexing, or segmentation, of any video sequence stored in the video database has generally been accepted to be a necessary rst step in the creation of the interface. Since the stored digital video is likely to be compressed, a number of di erent algorithms have been proposed that perform this task on transform{coded MPEG compressed video [1]. Future versions of the MPEG standard are likely use higher (object) level \semantic" compression, thus easing the task of indexing the video. Until they are accepted however, the transform based MPEG standards are likely to dominate. We de ne a shot to be a sequence of frames that was continuously shot from the same camera at one time. Ideally, a shot can encompass pans, tilts, or zooms; in reality, algorithms for cut detection will also

react to camera and signi cant object motion. We de ne a cut to be the separating event between shots. The rst step in indexing video is to detect the points at which cuts occur. Detection of these points allows fast forwarding to be done at a higher semantic level than merely speeding up frame delivery rate. Clustering of shots [2] may then be used to lead to a further speed up in the browsing/searching user interface. Shot changes may occur in a variety of ways: straight cuts, or gradual transitions such as cross{ dissolves, fade{ins, fade{outs, various graphical editing e ects (wipes, pins), which may also be accorded varying semantic signi cance (e.g., a fade{out to black followed by a fade{in is often used by lm directors or editors to indicate the passage of time or a change of location). Thus, it is also important to be able to detect these events distinctly. In addition, there are other video events such as camera ashes (due to a still camera ash operating in the scene being recorded) which, if detected, o er additional information.

1.1 Previous Work

Ahanger et. al [3] surveyed the eld of video indexing. Boreczky et. al [4] compared 5 di erent indexing algorithms|global grayscale histograms, region grayscale histograms, global histograms using twin comparison thresholding, motion compensated pixel di erence, and DCT coecient di erence|with respect to cut detection. Their data set consisted of motion JPEG video. Gargi et. al [5] compared color histogram based segmentation algorithms.

2 Algorithms Evaluated

In this section we brie y describe the algorithms we implemented. The choice of algorithms for our study was based on the level of implementation detail in the paper, and description and motivation by the author(s). Some algorithms were not chosen because they were clearly superseded by one that was

chosen, or an implementation was not possible from the information in the paper alone and details were not forthcoming from the author(s). Even for the methods chosen, not all details needed for implementation were available from the published paper. Where possible, we obtained clari cations from the author(s). At times we had to make intelligent choices. The following is the list of algorithms that we evaluated. Familiarity with the MPEG syntax [1] is assumed. Readers are referred to the original papers for full details. Algorithm A |Inner product of JPEG coecient vector [6] and prediction vector count statistics [7]. This algorithm is de ned for the I frames of MPEG sequences [7]. Some (a priori) subset of 8x8 blocks in the frame is chosen. Some (a priori) subset of the 64 DCT coecients for each block is chosen A vector is formed from these coecients of each chosen block. The inner product of two consecutive I frames gives the frame similarity. Breaks in B frames are detected as follows: if Nforw and Nback are the numbers of non{zero forward and backward prediction vectors used in the prediction of a particular B frame, then a cut is declared if min(Nforw ; Nback ) < T , i.e. if the bidirectional frame prediction favors a particular direction. A twin{threshold comparison is used to detect gradual transitions. The total number of parameters needed to specify this algorithm is 8. Algorithm B |Variance & prediction statistics [8]. For cut detection, this uses statistics on the numbers and types of prediction vectors used to encode P and B frames. For P frames, if Nintra is the number of intra{coded macroblocks, and Ninter is the number of inter{coded macroblocks, then a high value of Nintra =Ninter indicates a shot change. For B frames, the ratio Nback =Nforw vectors is used. I frame cuts are detected by nding peaks in the di erence of frame intensity variance. The variance is computed for the DCT DC{ coecients in I and P frames. Linear dissolves are also recognized by using the di erence of frame variances. For dissolves the ideal variance curve is parabolic. This algorithm detects cuts and gradual transitions and uses I and P frames fully, and prediction vector statistics from B frames. The total number of parameters needed to implement this algorithm is 7. Algorithm C |Motion prediction statistics [9]. After suggesting a number of measures, this approach uses the ratios of forward, backward and

bidirectional motion prediction vector numbers for B frames. It de nes the following frame difference measure: 1 ( Nforw + Nbidir ) + Nbidir ) ; ( NbackNtotal ) min ( Ntotal where Nbidir is the number of bidirectionally{ predicted blocks in a B frame [1] and Ntotal is the total number of macroblocks. A median{ lter based algorithm is used for thresholding. This algorithm detects only cuts and uses only B frames and requires 2 parameters. Algorithm D |Use of DC{coecient di erence sequences [10]. DC DCT{coecient values for I, P, and B frames are extracted (DC{coecient values for P and B frames are reconstructed as in [11]). These values are used to construct a DC{frame sequence, where a DC{frame is a frame consisting of the average intensities of 8x8-blocks of the original frames. Di erences between these DC{frames are then computed. Results are presented for two metrics: the sum of the absolute DC{frame pixelto-pixel di erences, and the bin{to{bin di erence between histograms of the DC{frame pixel luminances. Automatic thresholding is achieved by a sliding window technique|a peak is declared if it is greater than the second largest di erence within the window by some factor. Dissolves are detected by looking for a gradually increasing multi{frame di erence followed by a plateau followed by a decreasing frame di erence. This algorithm detects cuts, gradual transitions and camera ashes and uses I, P, and B frames. There are 11 distinct parameters that need to be speci ed for using this algorithm. Algorithm E |Statistical approach using DC{ coecients [12]. This computes the histograms of DC{frames, applies the Chi{square statistical test and thresholds using a xed value. An improved version of this algorithm [13] computes row and column histograms in addition to the overall frame histogram, and uses decision logic to combine the three event decisions into one. These algorithms operate only on I frames. Our implementation allowed the choice of using only I frames, I & P frames, or I, P, & B frames. The latter two were found to give better results, especially in terms of event localization (frame number). This algorithm detects cuts and gradual transitions and uses 4 parameters. We present results for this algorithm modi ed to run on all frame types.

Algorithm F |DC{coecient histograms [14]. 1-D

histograms of macroblock DC DCT coecients are used. This method uses the color information present in the bitstream. Histograms of the luminance Y, and chrominance Cb and Cr DC components of blocks are computed and bin{to{bin differencing applied. Both static and locally adaptive thresholds are used for peak nding. Median ltered di erences are used to detect gradual transitions by looking for a series of medium{high di erence values, a majority of which need to be above a soft threshold. This algorithm detects cuts and gradual transitions and uses I, P and B frames. It needs 7 parameters to be speci ed.

3 Evaluation Methodology

We describe our methodology of evaluation in this section.

3.1 Test Database

We obtained NTSC video captured from VHS tape at 30 frames per second through an MJPEG compression board. Ten sequences with a total of 30403 frames (over 16 minutes of video) were used as our evaluation database. The contents of the sequences were as follows: movie & TV show previews, a sitcom, a Star Trek TV show, two commercials, a news discussion show & news segment dealing with 2 top stories and 1 business report, a news segment with multiple archive clips, nature video showing feline predators in action, two MTV news clips, and nally a news conference with many photographic camera ashes in evidence. The rst 9 sequences totaling 28253 frames were used for detection evaluation, while the last was used only for evaluating the camera ash detection of Algorithm D. These sequences were converted to MPEG using a software encoder. It must be remarked that these sequences are challenging ones for the MPEG encoding format.

3.2 Evaluation Protocol

The MPEG sequences were ground{truthed by frame{by{frame inspection by humans. Cuts, gradual transitions, and camera ashes are examples of video events, which term also encompasses: the appearance of text captions, and pan, zoom, and other camera motions. The sequences were ground{truthed with respect to these events, and the relevant subset of the ground{truth was used for evaluation. A total of 172 cuts and 38 gradual transitions was present in these sequences. Our evaluation is based on the number of Missed Detections (MDs) and False Alarms (FAs) and computation of the associated recall and precision. In [12] it is suggested that false alarm errors be

ignored entirely. In our view, this is not desirable| under such an evaluation scheme, an algorithm that reported cuts at every single frame would outperform a more conservative one. De ning  to be the set of ground{truth events or features in the data, and  to be the events or features marked by a detection algorithm, the evaluation process can be de ned as the process of mapping fe :  ! . This mapping process will also (in general) be parameterized. The parameter for our evaluation consisted of the mapping range RM , which is the temporal interval within which a ground{truth and a detected event are matched.

3.3 Parameter Optimization

The values of parameters needed for implementation of the algorithms were not always speci ed by the authors, or only a range was speci ed. Further, in some cases, even those that were speci ed resulted in poor performance on our dataset. A complete simultaneous optimization with respect to all the parameters of an algorithm is a complex task. We therefore used the given values or reasonable values for those parameters that were thought to have a range of equally valid values (e.g., histogram resolution) and for the parameters that the algorithm(s) proved more sensitive to, we chose optimal values by empirical optimization maximizing a gure of merit. The gure of merit used was the sum of recall and precision, where Recall = detects +detects missed detects

Precision = detects +detects false alarms

4 Performance Comparison

Some of the algorithms we implemented detected only cuts, some detected gradual transitions as well, and one detected camera ashes in addition to these. We therefore present results for these three types of video events separately. Discussion of the results is contained in section 5.

4.1 Cut Detection

Table 1 presents the performances of the algorithms on 9 sequences. RM was set to 3 frames for this evaluation. From these results, algorithm D appears to have the best performance, with both high recall and precision|unlike Algorithm A which merely trades o recall for precision|followed by method F. These results are also comparable to or better than the performance of block{motion{matching algorithms [15, 16] that we evaluated.

4.2 Gradual Transition Detection

Table 2 presents the performances of those algorithms that detected gradual transitions on 9 sequences. Algorithm F has the best performance,

Method Detects MDs FAs Recall Precision A 164 8 4283 95 % 4% B 113 59 583 66 % 16 % C 52 120 73 30 % 42 % D 119 53 8 69 % 94 % E 78 94 472 45 % 14 % F 117 55 119 68 % 50 % Table 1: Cut detection performance. though none of them does particularly well. Algorithm A could not detect any gradual transitions because it only uses I frames, which in our sequences were 12 frames apart. Method Detects MDs FAs Recall Precision A 0 38 0 0% B 9 29 148 24 % 6% D 7 31 137 18 % 5% E 19 19 208 50 % 8% F 11 27 21 29 % 34 % Table 2: Gradual transition detection performance.

4.3 Camera Flash Detection

The one algorithm that detected camera ashes (algorithm D) had 3 detects, 4 missed detections, and no false alarms on the same dataset as above. On a special testing sequence of 2150 frames containing 18 camera ash events, it had 7 detects, 11 missed detects, and 0 false alarms. The overall recall and precision are 40% and 100%. The actual performance is better than it appears because the algorithm often detected one ash where there were 2 or 3 closely spaced ones. Thus, it is seen to perform relatively well on camera

ash detection.

5 Performance Analysis

A discussion of the comparison results follows. Apart from the objective measures of performance presented above, certain other issues such as ease of implementation are also discussed. One interesting direction for further experimental work would be to compare the performance of the algorithms with that of humans.

5.1 Cut Detection Performance

Algorithm D is clearly the best in this respect, with high recall and precision. The other algorithms have markedly lower precision.

5.2 Gradual Transition Detection Performance

The reason for the poor gradual transition detection performance of all algorithms is that the algorithms expect some sort of ideal curve (a plateau or a parabola) for a gradual transition, but the actual frame di erences are noisy and don't follow this ideal pattern, or don't follow it smoothly for the entire dissolve. This causes the localization of the transition to be incorrect (beyond the mapping range RM of our evaluation program) as a single transition is broken into multiple transition detections. The measured performance of these algorithms also depends on the parameter RM of the evaluation mapping fe : large mapping ranges lead to better measured performance. A mapping range parameter of RM = 10 was used. Smaller values caused performance to drop drastically, illustrating the relatively poor localization of gradual transition begin and end points. The sequences used had some complex gradual transitions, that varied in length from a few frames to as many as 56 frames in length, making detection a dicult task.

5.3 Full Data Use

Some of the algorithms (A, E, C) do not process all I, P and B frames that are present in the input stream. An interesting question is whether this signi cantly decreases their performance. From Table 1, the algorithms that used more data did better. The modi cation to algorithm E to process all frame types improved its performance signi cantly over the original. Also, from Table 2 and as mentioned earlier, algorithm A is unable to detect any gradual transitions at all because it uses only I frames. In addition, the algorithms that processed all frames localized the event locations better. Thus, use of all frame types does improve performance signi cantly.

5.4 Algorithm Characterization

Di erent applications weight false alarm and missed detection errors di erently. For example, in a security monitoring application, false alarms might not matter relative to missed detects, whereas in a video browsing application, false alarms would signi cantly decrease browsing speed. All these algorithms had a number of di erent parameters which could be tuned to tradeo their recall against their precision. For example, Figure 1 plots the operating characteristic curve of algorithms D and F|the variation of recall with precision measured on 9 sequences encoded by the SGI encoder at 4.15 Mbps. The tuning parameter in each case was a thresholding ratio parameter. Such an operating characteristic allows the best parameterization for a particular application.

0.95

0.45 Algorithm D Algorithm F

0.9

Algorithm D Recall Algorithm D Precision Algorithm F Recall Algorithm F Precision

0.4 0.35

0.85

Recall

0.3 0.8 0.25 0.75 0.2 0.7

0.15

0.65

0.1

0.6 0.2

0.3

0.4

0.5

0.6 Precision

0.7

0.8

0.9

1

0.05 0.5

1

1.5

2 Mbps

2.5

3

3.5

4

Figure 1: Operating curve for algorithms D and F as threshold ratio parameter is varied.

Figure 3: Dissolve performance of algorithms D & F with varying bitrate.

5.5 Source e ects

far as possible the same encoding parameters|IPB pattern, motion{prediction vector resolution, search window for motion prediction. The IPB pattern was IBBPBBPBBPBB, the vector resolution was half{pel, the search window was 48 in all directions. The quantization scale factors (IQSCALE, PQSCALE, BQSCALE) varied to achieve the speci ed bitrate. Figure 4 shows the variation of cut detection performance of Algorithm D with bitrate for the two encoders for bitrates of 100 Kbps, 500 Kbps, 1.5 Mbps, 3 Mbps and 4.15 Mbps. These data points were obtained using the same algorithm with the parameters independently optimized for the two encoders at 4.15 Mbps. Figure 5 shows similar data for gradual transition detection performance. As can be seen there is a signi -

The performance of a video indexing algorithm operating on an MPEG stream should, ideally, be independent of the encoder used and the encoding bitrate. We investigated the dependence of the two best performing algorithms(D & F) on variations in bitrate. The variation of performance with bitrate using the SGI encoder is shown in Figure 2 and Figure 3. The 0.95 Algorithm D Recall Algorithm D Precision Algorithm F Recall Algorithm F Precision

0.9 0.85 0.8 0.75 0.7

0.95

0.65 0.6

0.9

0.55

0.85

0.5

SGI Recall SGI Precision UCB Recall UCB Precision

0.8

0.45 0.5

1

1.5

2 Mbps

2.5

3

3.5

4

Figure 2: Cut performance of algorithms D & F with varying bitrate.

0.75 0.7 0.65 0.6 0.55

algorithms appear robust to bitrate changes except at very low bitrates, which is an important consideration especially for low bitrate coding applications. We also investigated the dependence of algorithm D on two di erent software encoder implementations. One was the SGI software encoder. The second was the Berkeley mpeg encode software encoder (UCB). Both used the same original MJPEG data, and as

0.5 0.5

1

1.5

2 2.5 Mbps

3

3.5

4

Figure 4: E ect of encoder on cut detection performance. cant di erence in the performance of the algorithm on the data from the two encoders. Also, this di erence is

350000

0.3

300000

SGI Recall SGI Precision UCB Recall UCB precision

0.25

250000 200000 150000

0.2

100000 50000 0

0.15

0

500

0.1

1000 Frame Number

1500

2000

1500

2000

SGI Encoder

350000 300000 250000 200000

0.05 0.5

1

1.5

2 2.5 Mbps

3

3.5

4

150000 100000

Figure 5: E ect of encoder on dissolve detection performance. consistent across bitrates and was not attributable to a simple thresholding parameter change. The reason for the di erence is the di ering characteristics of the encoders. For B frames, the Berkeley encoder appears to use intra{coding very sparingly and uses forward prediction much more than backward or bidirectional prediction. This imbalance causes the bursty nature of B frame sizes compared to the SGI encoder (Figure 6). The e ect is to delay the coding of frame changes in 400000 350000 300000

Bits

250000 200000 150000 100000 50000 0 0

500

400000

1000 Frame Number SGI Encoder

1500

2000

350000 300000

Bits

250000 200000 150000

50000 0 0

500

1000 Frame Number Berkeley Encoder

Figure 7: Frame di erence values at B frames computed by algorithm D. the observed e ect on performance. For algorithms that use the statistics of the predicted frames (algorithms A, B and C) the e ect is likely to be even greater.

5.6 Ease of Implementation

There was a wide variation in the ease of implementation of the algorithms. Some papers explicitly gave the (ranges of) values of various parameters used by the algorithm. In others, there was no mention of the fact that a threshold parameter might even be required. Algorithm D was the best in this respect, with implementation details and speci cation of parameter values and some performance analysis already carried out by the authors. Algorithm B also scored in this respect. We feel that the paper peer review process could be improved by taking into account the ease of implementation and algorithm characterization aspects of a proposed algorithm.

6 Summary and Conclusions

100000 50000 0 0

500

1000 Frame Number Berkeley Encoder

1500

2000

Figure 6: Sizes of B frames for one sequence. B frames until the next reference frame, leading to a larger eventual frame di erence value (Figure 7, for algorithm D). For the same sequence, the frame di erence mean and variance for B frames were 24446 and 27925 for the SGI encoder and 33107 and 32773 for the UCB encoder. Since B frames constitute 67% of all frames in our sequences (a typical fraction), this has

We have implemented, evaluated, and characterized a number of MPEG compressed video indexing algorithms. The results of the evaluation show the relative strengths and weaknesses of the algorithms and highlight the need for more development in this area. The sensitivity to encoder changes emphasizes the need for more robust algorithms. This research also emphasizes the importance of consistent data sets and evaluation criteria in comparing algorithms. Implementation of an algorithm in a portable executable form, such as Java, might be one way for researchers to test their algorithms against others on the same

data and thereby allow a smaller sized \toolbox" to be made available to the community. Another possibility is for researchers to provide a Web interface to their algorithms, whereby other researchers could submit their data over the Web and retrieve the output of the algorithm.

References

[1] Joan L. Mitchell, William B. Pennebaker, Chad E. Fogg, and Didier J. LeGall. MPEG Video Compression Standard. Digital Multimedia Standards Series. Chapman and Hall, 1997. [2] M.M. Yeung, B.-L. Yeo, W. Wolf, and B. Liu. Video Browsing using Clustering and Scene Transitions on Compressed Sequences. In IS&T/SPIE Multimedia Computing and Networking, 1995.

[3] G. Ahanger and T.D.C. Little. A Survey of Technologies for Parsing and Indexing Digital Video. Journal of Visual Communication and Image Representation, special issue on Digital Libraries, 7(1):28{43, 1996. [4] J.S. Boreczky and L.A. Rowe. Comparison of Video Shot Boundary Detection Techniques. In I.K. Sethi and R.C. Jain, editors, Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases IV Vol. SPIE 2670, pages 170{179, 1996. [5] U. Gargi and R. Kasturi. An evaluation of color histogram based methods in video indexing. In First International Workshop on Image Databases and Multi{Media Search, pages 75{82, 1996. [6] F. Arman, A. Hsu, and M.-Y. Chiu. Feature Management for Large Video Databases. Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases I, Vol. SPIE 1908, pages 2{12, 1993. [7] H.J. Zhang et al. Video Parsing using Compressed Data. In SPIE Symposium on Electronic Imaging Science and Technology: Image and Video Processing II, pages 142{149, 1994. [8] J. Meng, Y. Juan, and S.F. Chang. Scene Change Detection in a MPEG Compressed Video Sequence. In SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies [17].

[9] H.C. Liu and G.L. Zick. Automatic Determination of Scene Changes in MPEG Compressed Video. In ISCAS - IEEE International Symposium on Circuits and Systems, 1995. [10] B.-L. Yeo and B. Liu. A Uni ed Approach to Temporal Segmentation of Motion JPEG and MPEG Compressed Video. In Second International Conference on Multimedia Computing and Systems, 1995. [11] S.F. Chang and D.G. Messerschmitt. Manipulation and Compositing of MC-DCT Compressed Video. IEEE Journal on Selected Areas in Communications: Special Issue on Intelligent Signal Processing, 13(1):1{11, 1995. [12] I.K. Sethi and N. Patel. A Statistical Approach to Scene Change Detection. In Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Image and Video Databases III, Vol. SPIE 2420, 1995. [13] N.V. Patel and I. K. Sethi. Video shot detection and characterization for video databases. In To appear in Pattern Recognition, Special Issue on Multimedia, 1997. [14] K.Shen and E.J. Delp. A Fast Algorithm for Video Parsing Using MPEG Compressed Sequences. In IEEE International Conference on Image Processing, pages 252{255, October 1995. [15] A. Akutsu et al. Video Indexing using Motion Vectors. In Proceedings of SPIE Visual Communications and Image Processing, volume 1818, pages 1522{1530. SPIE, 1992. [16] B. Shahraray. Scene Change Detection and Content-based Sampling of Video Sequences. In SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies [17]. [17] SPIE/IS&T Symposium on Electronic Imaging Science and Technology: Digital Video Compression: Algorithms and Technologies, volume 2419, 1995.