Gradual Transition Detection Using Average Frame ... - CiteSeerX

7 downloads 220 Views 805KB Size Report
S. M. M. Tahaghoghi. Hugh E. Williams ... tion boundaries. Our technique is a valuable new tool ... effects will increase as powerful video editing tools en-.
Gradual Transition Detection Using Average Frame Similarity Timo Volkmer S. M. M. Tahaghoghi Hugh E. Williams School of Computer Science and Information Technology RMIT University GPO Box 2476V, Melbourne, Australia 3001 {tvolkmer,saied,hugh}@cs.rmit.edu.au Abstract Segmenting digital video into its constituent basic semantic entities, or shots, is an important step for effective management and retrieval of video data. Recent automated techniques for detecting transitions between shots are highly effective on abrupt transitions. However, automated detection of gradual transitions, and the precise determination of the corresponding start and end frames, remains problematic. In this paper, we present a gradual transition detection approach based on average frame similarity and adaptive thresholds. We report good detection results on the trec video track collections — particularly for dissolves and fades — and very high accuracy in identifying transition boundaries. Our technique is a valuable new tool for transition detection.

1

Introduction

The volume of video content produced daily is extremely large, and is likely to increase with the evergrowing popularity of digital video consumer products. For this content to be usable, it must be easily accessible. An important first step is to identify and annotate sections of interest. Historically, identification and annotation of video have been performed by human annotators [30, 31]. This is tedious, expensive, and susceptible to error. Moreover, it relies on the judgement of the human observer. This is inherently subjective, and often inconsistent. Automatic indexing methods have the potential to avoid these problems. Part of the analysis process is to identify and determine the boundaries of the basic semantic elements, the shots [5]. The transition between adjacent shots can be abrupt — a cut — or gradual. The former category describes a shot change where two consecutive frames

belong to different shots. The latter involves a progressive changeover between two shots using video editing techniques such as dissolves, fades, and wipes [6]. Gradual transitions are less frequent than cuts, but are more complex. Lienhart [10] reports that together, cuts, fades, and dissolves account for approximately 99% of all transitions in all types of video. In the video collections we use, approximately 70% of the annotated transitions are cuts, while 26.5% are fades or dissolves. These collections are discussed in detail later. It is likely that the proportion of rarer transition effects will increase as powerful video editing tools enter mainstream use. Nevertheless, fades and dissolves remain the most common forms of gradual transition, and their accurate identification is important for effective video retrieval. Automatic cut detection approaches have been shown to be highly effective [1, 18, 27]. Indeed, the results are comparable to results obtained by human observers [2]. However, gradual transitions are more difficult to detect using automated systems [14, 22]. The often subtle changes between frames are hard to discriminate from changes caused by normal scene activity. In particular, camera motion and zoom operations often confuse detection algorithms. In this paper we present our novel approach to gradual transition detection in video. Our moving query window technique [26, 27] caters for the fact that gradual transitions usually extend over several frames by evaluating the average inter-frame distance in a set of frames, rather than examining only individual frames. Moreover, we compute thresholds dynamically to increase effectiveness across different types of video content. Our results are promising across different test collections of the Text REtrieval Conference (trec) VIDeo Retrieval Evaluation (trecvid)1 . We conclude that 1 http://www-nlpir.nist.gov/projects/trecvid

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

Pre Frames

Current Frame

Post Frames

DMZ Figure 1. An equal number of frames on each side of the current frame — the pre-frames and the post-frames — constitute the moving query window. We can optionally specify a Demilitarised Zone (DMZ); frames falling within the DMZ for a particular current frame are omitted from the comparisons for that frame. The DMZ is explained in more detail in Section 5.

our approach constitutes a good basis for effective and efficient video indexing.

2

Background

Popular shot boundary detection approaches rely on the property that adjacent frames within one shot are usually similar. By evaluating inter-frame differences and searching for significant dissimilarities, transitions between shots can be detected. Digitised video is commonly stored compressed in one of the mpeg2 formats. Many automatic techniques for determining shot boundaries use aspects of the compressed data directly. Koprinska et al. [9] give an overview of such methods. In this paper, we focus on techniques that are applied to uncompressed footage. The majority of approaches to shot boundary detection compute inter-frame distances from the decompressed video. Shot transitions can be detected by monitoring this distance for significant changes. In direct image comparison, changes between adjacent frames are determined on a pixel-by-pixel basis. While this approach shows generally good results [3], it is computationally intensive, and also sensitive to camera motion, camera zoom, and noise. Additional filtering may be used to address some of these problems [18]. An alternative — and more common — approach is to use histograms of frame feature data. Approaches using global histograms [15, 28, 31] represent each frame as a single vector, while those using localised histograms [24] generate separate histograms for subsections of each frame. Inter-frame distances are calculated using often simple vector-distance measures to compare corresponding histograms [31]. Localised histograms, used in conjunction with additional features such as edge-detection, perform well when applied in the trecvid environment [1, 8]. 2 Moving Picture Experts Group: http://www.chiariglione.org/mpeg/

The twin-comparison algorithm first proposed by Zhang et al. [31] is the basis of several proposed approaches for detection of gradual transitions [8, 25, 30]. Here, a low threshold is applied to detect groups of frames that belong to a possible gradual transition. The accumulative inter-frame distance is calculated for these frames. A gradual transition is reported if the accumulated inter-frame distance exceeds a second, higher threshold. Several approaches have been proposed that are based on the video production model. These employ internal transition models based on the operation of video editing systems. One or more features of the video are monitored for patterns very similar to those predicted by the internal models [11, 12, 13, 16]. These approaches show promising results, but we are not aware of any large-scale evaluation. Some video segmentation systems also consider features such as audio information or captions [7, 17]. These are usually designed for a particular task on specific types of content, for example the detection of commercial breaks in television footage. We have previously proposed a technique for effective cut detection. The moving query window technique [26] performs comparisons on a set of frames to detect abrupt transitions. As we proceed through the video, we take each frame in turn as a pivot, and consider a fixed-size window of frames encompassing each pivot or current frame. This moving window is comprised of two equal-sized sets of frames preceding and following the current frame, as illustrated in Figure 1. All frames in the moving window are ranked on their histogram similarity to the current frame; the most similar frame is ranked highest. The number of frames from the preceding half window that are ranked in the top half is monitored while advancing through the video. A cut is reported when this number exceeds an upper threshold and falls below a lower threshold within four consecutive frames.

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

1

Pre-frames

Current frame 11

Post-frames

12

A A A A A A A A A A A A A 5

12

15

22

A A A A A A A A A 8

12

25

B B B 22

18

A A A A A A 22

A A

32

B B B B B B B B B B 22

24

minimal slowly rising

28

B B B B B B

12

14

PrePostRatio 21

34

B B B B B B B B B B B B

steeply rising

maximum falling

Figure 2. An example of a dissolve. Before the transition, the PrePostRatio is minimal. It rises to a maximum as we proceed through the transition, before falling again afterwards.

The effectiveness of this approach for cut detection has been demonstrated with the collections of the trec Video Retrieval Evaluation [26, 27, 29]. However, without modification, this scheme is less effective on gradual transitions. For example, after training on the trec-10 [23] test collection, we obtain cut detection quality index values of 91% for blind runs on both the trec-11 [22] and trec-12 [21] collection. However, the corresponding values for detection of gradual transitions are 40% and 35% respectively. Aiming for a simplified, and more effective detection scheme for gradual transitions, we have developed an alternative technique for application in our moving query window.

3

Gradual Transition Detection with the Moving Query Window

In this section, we propose a novel extension of our moving query window approach that permits effective detection of gradual transitions. Our method of ranking frames in the query window works well for abrupt transitions because these usually show significant inter-frame distances within a few consecutive frames. Our observations have shown that this is not usually the case for gradual transitions, where inter-frame distances are typically smaller. This results in our approach being far less effective in detecting gradual transitions. To address this problem, we propose that the frames in the moving window not be examined individually. Instead, we define two sets of frames, one each from either side of the current frame; we refer to the frames of these two sets as pre-frames and post-frames respectively. For each of the two sets, we determine the distance between each frame in that set and the current frame. We then average these intra-set distances, giving a final value that is the average distance between

that set and the current frame. This computation results in two values, one each for the pre- and post-frame sets, and we use the ratio of these values — referred to as PrePostRatio — to detect gradual transitions. Consider an example: Figure 2 shows a dissolve between the neighbouring shots A and B. We assume that the dissolve starts at frame 12 and ends with frame 22. In the top row, frame 11 is the current frame; it belongs to shot A and is the last frame before the transition starts. Frames 1 to 10 form the pre-frames and are also from shot A. They are similar to frame 11, and therefore, their inter-frame distance to the current frame is relatively low. For this example, let us assume the average inter-frame distance of the pre-frames to the current frame has the value 2. Frames 12 to 21 — the post-frames in Figure 2 — are mostly dissolve frames, and therefore relatively dissimilar to the current frame. Hence, the average interframe distance for the post-frames is comparatively high; let us assume it has the value 10. Given a preframe average of 2 and a post-frame average of 10, the 2 = 0.2. PrePostRatio of the top row in Figure 2 is 10 As the current frame moves further into the dissolve, the ratio rises. This is illustrated in rows two and three of Figure 2. In the fourth row, frame 22 is the current frame and also the last frame of the transition. This frame is likely to be very similar to frames 23 to 32 that belong to shot B, producing a low average interframe distance. For our example, let us take this value to be 2. The pre-frames that are formed by frames 12 to 21 are the frames of the dissolve. As we have established earlier, their average inter-frame distance is high, we again assume a value of 10. We can now calculate the PrePostRatio for row four as 10 2 = 5. Once the window exits the transition completely, the ratio usually reverts to a relatively low value. We have observed that this behaviour is common for

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

PrePostRatio cut fade dissolve

10 8 6

Pre/Post Ratio Moving Average Threshold

4 2 0 24800

24820

24840

24860

24880

24900

24920

24940

24960

24980

25000

Figure 3. Plot of PrePostRatio over a 200-frame interval of video. The dynamic threshold is calculated from a moving average of the PrePostRatio.

dissolves and fades. By monitoring the PrePostRatio as we advance through a video clip, we can detect the minima and maxima that accompany the start and end of such transitions. Other effects, such as wipes and page translations, are more complex, and often include intense motion. Such transitions can also be detected using our approach, but with reduced effectiveness. We maintain a history of PrePostRatio values, and calculate a moving average and standard deviation that we use to compute a threshold. Detailed analysis of the PrePostRatio curves indicates that application of this threshold works well, and caters for varying levels of the computed ratio across different types of footage. However, it is sometimes necessary to adjust the level of this threshold. For example, poor quality and noisy footage produces many smaller peaks in the PrePostRatio curve. To reduce false detections caused by these peaks, we multiply the calculated threshold by a factor we call the Upper Threshold Factor (utf). Figure 3 shows the PrePostRatio curve for a 200frame segment of a video, along with the corresponding moving average and threshold. A possible gradual transition is indicated if the PrePostRatio crosses the threshold. In this case, we determine the position of the local minimum within the preceding frames. If this is sufficiently small, a gradual transition is reported over the interval between these two points. The most important algorithm parameter influencing the results is the number of pre- and post-frames on either side of the current frame, which we refer to as the Half-Window Size (hws). The number of frames in the entire query window is then 2 × hws, as shown in Figure 1. The current frame is not part of the query window. We discuss the effect of the parameters on system performance further in Section 5 along with detailed

results. In the next section we discuss the environment used to train and test our algorithm.

4

Evaluation Environment

We apply the common Information Retrieval measures of recall and precision to evaluate effectiveness [20]. Recall is the fraction of all known transitions that are correctly detected, while precision is the fraction of reported transitions that match the known transitions recorded in the reference data. Additional effectiveness measures designed specifically for gradual transitions are Frame Recall (fr) and Frame Precision (fp) [22]. These are defined as follows: FR =

Frames correctly reported in detected transition Frames in reference data for detected transition

FP =

Frames correctly reported in detected transition Frames reported in detected transition

We also calculate a quality index (Q) that penalises false negatives more heavily than false positives. False detections are regarded less problematic, as they can be filtered out in later processing steps [19]: Q=

NC − N3I NT

where: NC NI NT

= = =

Number of correctly reported transitions Number of false detections Number of transitions in reference data

We developed our algorithm using the shot boundary detection task subset of the trec-10 video collection. Detailed results of blind runs of trec-12, including comparison with other approaches, appear

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

Collection trec-10 trec-11 trec-12

Clips 18 18 13

Frames 594 179 545 068 596 054

Abrupt 2 066 1 466 2 364

Gradual 1 037 591 1 012

Table 1. Details of our test collections. elsewhere [29]. We have since improved our technique through further training on the trec-11 and trec-12 test sets. In this paper, we discuss results obtained with the current approach. Details of all test sets that we use are shown in Table 1. The trec-10 and trec-11 collections contain a variety of documentary and educational cinema and television footage, some more than fifty years old. The trec-11 collection also includes amateur video. The collection used in trec-12 comprises more recent footage, mostly television news and entertainment programming from the period 1998-2002. All three collections contain a large number of annotated abrupt transitions which we do not consider in this paper, but have explored in detail elsewhere [27]. The reference data for the collections categorises gradual transitions into three classes: Dissolve: One shot is replaced with another by gradually dimming the first shot and gradually increasing the brightness of the second. Fade-out or fade-in: A fade-in or fade-out can be considered to be a special case of a dissolve, with the first or second shot consisting of frames of only one colour, usually black. A common transition is a fade-out of one shot, followed by a fade-in of the next. Other: This category comprises all other transition effects that stretch over more than two frames, such as wipes and pixelation effects, and also artifacts of imperfect splicing of the original cine film. Many gradual transitions extend only over a handful of frames, and are effectively observed as cuts by human viewers at the standard replay speed of 24 or 30 frames per second. The two known dissolves marked in Figure 3 are examples of such short transitions, with an effective length of only three frames each. In accordance with the trecvid guidelines that a cut may stretch over up to six consecutive frames [23], we consider such short transitions to be abrupt, rather than gradual, transitions.

5

Results

We have experimented with one-dimensional and three-dimensional histograms using the rgb and hsv

colour spaces [24], and also with a feature based on the Daubechies wavelet coefficients of the transformed frame data [4]. We have found that gradual transitions are best detected with one-dimensional hsv colour histograms using 32 bins per colour component. All results reported in this paper are for this feature representation. Table 2 shows results for detecting gradual transition for each of the three trec collections using algorithm parameters that produce the best performance. As with most applications, it is generally possible to trade precision for higher recall. These results show that the technique performs best for the trec-10 test collection. We observe that for the two newer trec collections, and especially for trec-12, recall drops considerably. This is caused in large part by the appearance of transitions in rapid succession, which our algorithm tends to report as a single transition. The trec-11 collection also contains a large proportion of older footage with low quality, and is the most challenging for our system. The high number of false positives has a negative effect on precision and quality. We cater for this by raising the threshold level (utf) by 20%, and by applying a dmz of one frame on either side of the current frame to reduce the effects of low video quality, camera motion, and compression artifacts. The demilitarised zone allows frames immediately adjacent to the current frame to not be considered as part of the pre- and post-frame sets, permitting less sensitivity in lower-quality footage. This produces the best compromise between recall and precision for this collection. Although recall on the trec-12 collection is rather low, precision remains reasonable, and at 31.9%, the rate of false positives is not unacceptable. More detailed results are provided in Table 3. Since our system does not yet distinguish between transition types, we cannot calculate the individual insertion rates. As expected, our approach performs better for dissolves and fades than for other, less common, gradual transition types. Table 4 shows the frame recall and frame precision obtained for each collection. We observe very good results for all types of gradual transitions. Frame recall for fades in the trec-10 and the trec-11 collections is relatively low. We find that in these collections, the average length of fades is 80 and 89 frames respectively, while the corresponding average in the trec-12 collection is 29 frames. Our implementation is currently limited to detection of gradual transitions spanning less than 60 frames. The values used for the algorithm parameters play an important part in determining effectiveness. Figure 4 illustrates the effect of varying the half-window

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

Collection trec-10 trec-10 trec-11 trec-11 trec-12 trec-12

hws 18 18 18 18 14 14

dmz 1 1 1 1 0 0

utf 1.0 1.2 1.0 1.2 1.0 1.2

Recall 83.5% 76.4% 81.7% 64.5% 65.9% 58.9%

Precision 75.0% 80.7% 56.8% 77.0% 76.4% 82.5%

Quality 72.4% 69.4% 33.8% 53.4% 56.0% 53.7%

Deletions 16.4% 23.5% 18.2% 22.9% 34.0% 41.0%

Insertions 33.4% 21.1% 144.0% 70.8% 29.6% 15.8%

Table 2. Results of the best runs for gradual transitions for the TREC video collections and the parameters used in these runs. Collection trec-10 trec-11 trec-12

Reference transitions Dissolve Fade Other 942 54 41 510 63 18 684 116 212

Dissolve 86.2% 78.6% 76.9%

Recall Fade 68.5% 76.2% 47.4%

Other 70.7% 61.1% 44.8%

Dissolve 13.8% 21.6% 25.1%

Deletions Fade 42.6% 31.8% 52.6%

Other 24.4% 38.9% 55.7%

Insertions All 35.8% 67.2% 31.9%

Table 3. Results grouped by transition type for the best run on each test collection. We observe much better performance for dissolves and fades than for other types of gradual transition.

size (hws) on recall, precision, and quality for the trec-12 collection. With a larger hws, precision increases but recall drops considerably for half-window sizes larger than 16 frames. For this collection, optimum quality is achieved with a half-window size of 14. The upper threshold factor (utf) also affects the trade-off between recall and precision, with quality peaking when utf=1. Figure 5 shows that while precision improves with a larger utf, there is an associated drop in recall. The best parameter values over all three collections are hws=14, dmz=0, and utf=1. Our algorithm performs well relative to comparable systems. In the trec-12 shot boundary detection task, an earlier implementation was among the betterperforming systems, and obtained the highest precision of all participants for gradual transitions [29]. It achieved average recall, above-average frame precision, and the best results for frame recall. The results presented here reflect performance after the trec-12 data was included in the training set, and indicate performance in the top four systems of trec-12.

6

Conclusion

Gradual transitions comprise a significant proportion of all shot transitions. The relatively small interframe differences during gradual transitions are often indistinguishable from normal levels of inter-frame distance. This makes gradual transitions much harder to detect than cuts. However, effective identification of gradual transitions is important for complete video indexing and retrieval.

In this paper, we have proposed a novel approach to gradual transition detection, based on our moving query window technique. This monitors the accumulated inter-frame distance of frame collections for detection of gradual shot changes. We have shown that it is effective on large video collections, with recall and precision of approximately 83% and 75% respectively. A particular strength of our approach is the accurate detection of the start and end of gradual transitions. We plan to address the high false detection rate through the use of localised histograms and an edgetracking feature. We also intend to explore automatic parameter selection to allow the system to automatically adapt to different types of footage. Despite its relative simplicity, our technique shows good results when tested on a large video collection comprising a variety of content, and has the potential to be the basis for more effective video segmentation tools.

References [1] B. Adams, A. Amir, C. Dorai, S. Ghosal, G. Iyengar, A. Jaimes, C. Lang, C.-Y. Lin, A. Natsev, M. Naphade, C. Neti, H. J. Nock, H. H. Permuter, R. Singh, J. R. Smith, S. Srinivasan, B. L. Tseng, T. V. Ashwin, and D. Zhang. IBM Research TREC2002 video retrieval system. In E. M. Voorhees and L. P. Buckland, editors, NIST Special Publication 500251: Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), pages 289–298, Gaithersburg, MD, USA, 19–22 November 2002.

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

Collection

Dissolve F-Recall F-Precision 94.5% 81.1% 94.6% 83.2% 96.7% 76.6%

trec-10 trec-11 trec-12

F-Recall 44.5% 49.0% 88.0%

Fade F-Precision 83.1% 88.9% 83.8%

F-Recall 63.2% 57.1% 81.5%

Other F-Precision 78.9% 73.1% 87.6%

Table 4. Frame recall and frame precision grouped by transition type for the best runs on each test collection. (%) 80

70

Precision Recall Quality

60

50

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

(number of frames)

Figure 4. Variation of recall, precision and quality with the HWS for the TREC-12 test collection. The best recall/precision trade-off (maximum quality) is seen for HWS=14.

[2] P. Aigrain, H. J. Zhang, and D. Petkovic. Contentbased representation and retrieval of visual media: A state-of-the-art review. Multimedia Tools and Applications, 3(3):179–202, September 1996. [3] J. S. Boreczky and L. A. Rowe. Comparison of video shot boundary detection techniques. Journal of Electronic Imaging, 5(2):122–128, April 1996. [4] I. Daubechies. Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. [5] A. Del Bimbo. Visual Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001. [6] A. Hampapur, R. Jain, and T. Weymouth. Digital video segmentation. In Proceedings of the ACM International Conference on Multimedia, pages 357–364, San Francisco, CA, USA, 15–20 October 1994. [7] A. G. Hauptmann and M. J. Witbrock. Story segmentation and detection of commercials in broadcast news video. In Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL’98), pages 168–179, Santa Barbara, CA, USA, 22–24 April 1998. [8] D. Heesch, M. J. Pickering, S. R¨ uger, and A. Yavlinsky. Video retrieval within a browsing framework using key frames. In E. M. Voorhees and L. P. Buckland, editors, NIST Special Publication 500-252: Proceedings of the Twelfth Text REtrieval Conference (TREC 2003), Gaithersburg, MD, USA, 18–21 November 2003. To appear.

[9] I. Koprinska and S. Carrato. Temporal video segmentation: A survey. Journal of Signal Processing: Image Communication, 16(5):477–500, 2001. [10] R. W. Lienhart. Comparison of automatic shot boundary detection algorithms. Proceedings of the SPIE; Storage and Retrieval for Image and Video Databases VII, 3656:290–301, December 1998. [11] R. W. Lienhart. Reliable Dissolve Detection. Proceedings of the SPIE; Storage and Retrieval for Media Databases, 4315:545–552, December 2001. [12] R. W. Lienhart. Reliable transition detection in videos: A survey and practitioner’s guide. International Journal of Image and Graphics (IJIG), 1(3):469–486, July 2001. [13] X. Liu and T. Chen. Shot boundary detection using temporal statistics modeling. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, volume 4, pages 3389–3392, Orlando, FL, USA, 13–17 May 2002. [14] S. Marchand-Maillet. Content-based video retrieval: An overview. Technical Report 00.06, CUI - University of Geneva, Geneva, Switzerland, 2000. [15] A. Nagasaka and Y. Tanaka. Automatic Video Indexing and Full-Video Search for Object Appearances. Visual Database Systems, 2:113–127, 1992. [16] J. Nam and A. H. Tewfik. Dissolve transition detection using B-Splines interpolation. In A. Del Bimbo, editor, IEEE International Conference on Multimedia and Expo (ICME), volume 3, pages 1349–1352, New York, NY, USA, 30 July – 2 August 2000.

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

(%) 84 79 74

Precision Recall Quality

69 64 59 54 49 44 0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

(UTF)

Figure 5. Variation of recall, precision and quality with different values of UTF for the TREC-12 test collection. The highest quality index is observed at UTF=1.

[17] S. Pfeiffer, R. W. Lienhart, and W. Effelsberg. Scene determination based on video and audio features. Technical Report TR-98-020, University of Mannheim, Germany, January 1998. [18] G. M. Qu´enot, D. Moraru, and L. Besacier. CLIPS at TRECVID: Shot boundary detection and feature detection. In E. M. Voorhees and L. P. Buckland, editors, NIST Special Publication 500-252: Proceedings of the Twelfth Text REtrieval Conference (TREC 2003), Gaithersburg, MD, USA, 18–21 November 2003. To appear. [19] G. M. Qu´enot and P. Mulhem. Two systems for temporal video segmentation. In Proceedings of the European Workshop on Content Based Multimedia Indexing (CBMI’99), pages 187–194, Toulouse, France, 25–27 October 1999. [20] R. Ruiloba, P. Joly, S. Marchand-Maillet, and G. M. Qu´enot. Towards a standard protocol for the evaluation of video-to-shots segmentation algorithms. In Proceedings of the European Workshop on Content Based Multimedia Indexing (CBMI’99), pages 41–48, Toulouse, France, 25–27 October 1999. [21] A. F. Smeaton, W. Kraaij, and P. Over. TRECVID2003 – An introduction. In E. M. Voorhees and L. P. Buckland, editors, NIST Special Publication 500252: Proceedings of the Twelfth Text REtrieval Conference (TREC 2003), Gaithersburg, MD, USA, 18–21 November 2003. To appear. [22] A. F. Smeaton and P. Over. The TREC-2002 video track report. In E. M. Voorhees and L. P. Buckland, editors, NIST Special Publication 500-251: Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), pages 69–85, Gaithersburg, MD, USA, 19–22 November 2002. [23] A. F. Smeaton, P. Over, and R. Taban. The TREC2001 video track report. In E. M. Voorhees and D. K. Harman, editors, NIST Special Publication 500-250: Proceedings of the Tenth Text REtrieval Conference (TREC 2001), pages 52–60, Gaithersburg, MD, USA, 13–16 November 2001.

[24] J. R. Smith. Content-based access of image and video libraries. Encyclopedia of Library and Information Science, 1:40–61, 2001. [25] J. Sun, S. Cui, X. Xu, and Y. Luo. Automatic video shot detection and characterization for content-based video retrieval. Proceedings of the SPIE; Visualization and Optimisation Techniques, 4553:313–320, September 2001. [26] S. M. M. Tahaghoghi, J. A. Thom, and H. E. Williams. Shot boundary detection using the moving query window. In E. M. Voorhees and L. P. Buckland, editors, NIST Special Publication 500-251: Proceedings of the Eleventh Text REtrieval Conference (TREC 2002), pages 529–538, Gaithersburg, MD, USA, 19– 22 November 2002. [27] S. M. M. Tahaghoghi, J. A. Thom, H. E. Williams, and T. Volkmer. Video cut detection using frame windows. In submission. [28] B. T. Truong, C. Dorai, and S. Venkatesh. New enhancements to cut, fade, and dissolve detection processes in video segmentation. In R. Price, editor, Proceedings of the ACM International Conference on Multimedia 2000, pages 219–227, Los Angeles, CA, USA, 30 October – 4 November 2000. [29] T. Volkmer, S. M. M. Tahaghoghi, H. E. Williams, and J. A. Thom. The moving query window for shot boundary detection at TREC-12. In E. M. Voorhees and L. P. Buckland, editors, NIST Special Publication 500-252: Proceedings of the Twelfth Text REtrieval Conference (TREC 2003), Gaithersburg, MD, USA, 18–21 November 2003. To appear. [30] J. Yu and M. D. Srinath. An efficient method for scene cut detection. Pattern Recognition Letters, 22(13):1379–1391, January 2001. [31] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar. Automatic partitioning of full-motion video. Multimedia Systems Journal, 1(1):10–28, June 1993.

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE Authorized licensed use limited to: RMIT University. Downloaded on June 25, 2009 at 20:08 from IEEE Xplore. Restrictions apply.

Suggest Documents