ISIS technical report series, Vol. 2001-03, April 2001
Evaluation measurement for Logical Story Unit segmentation in video sequences Jeroen Vendrig and Marcel Worring
Intelligent Sensory Information Systems Department of Computer Science University of Amsterdam The Netherlands
Although various Logical Story Unit (LSU) segmentation methods based on visual content have been presented, systematic evaluation of their mutual dependencies and their performance has not been performed. The unifying framework presented in this paper shows that LSU segmentation methods can be classi ed into 4 essentially dierent types. LSUs are subjective and cannot be de ned with full certainty. We therefore present de nitions based on lm theory to limit subjectivity. We then introduce an evaluation method measuring the quality of a segmentation method and its economic impact rather than the amount of errors. Furthermore, the inherent complexity of the segmentation problem given a visual feature is measured. Finally, we show to what extent LSU segmentation depends on the quality of shot boundary segmentation. We present results of an evaluation of the four segmentation method types under similar circumstances using an unprecedented amount of 20 hours of 17 complete videos in dierent genres. Tools and ground truths are available for interactive use via Internet.
submitted to: Special issue of IEEE Transac-
tions on Multimedia on Multimedia databases
Evaluation measurement for Logical Story Unit segmentation in video sequences
Contents 1 Introduction 2 Logical Story Unit Segmentation
2.1 De nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Unifying framework for existing methods . . . . . . . . . . . . . . . . 2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Evaluation 3.1 3.2 3.3 3.4
Feature evaluation . . . . . . . . . . . . . . . . . . . . . . . Evaluation method . . . . . . . . . . . . . . . . . . . . . . . Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . LSU segmentation on incorrect shot boundary segmentation
. . . .
. . . .
4 Experiments 5 A B C D E
. . . .
. . . .
. . . .
1 2 2 3 5
5 5 6 8 8
9
4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 9 10
Conclusions LSU evaluation in literature Description of video collection Feature evaluation Over segmentation and under segmentation Shot segmentation results
11 14 14 15 15 16
Intelligent Sensory Information Systems
Department of Computer Science University of Amsterdam Kruislaan 403 1098 SJ Amsterdam The Netherlands tel: +31 20 525 7463 fax: +31 20 525 7490
http://www.science.uva.nl/research/isis
Corresponding author: Jeroen Vendrig tel: +31(20)525 7463
[email protected] http://www.science.uva.nl/~vendrig
Section 1 Introduction
1
1 Introduction Video structure is important for abstracting, visualizing, searching and navigating through videos. Brunelli [6] de nes video structure as the decomposition of a stream into shots (\contiguously recorded sequences") and scenes. In video processing literature, the term scene is already occupied and renamed to Logical Story Unit (LSU) [8] or simply story unit [4, 19]. In cinematography an LSU is de ned as a series of shots that communicate a uni ed action with a common locale and time [3]. Viewers perceive the meaning of a video at the level of LSUs [4, 14]. Therefore, next to methods for accurate shot detection there is even a greater need to have methods for automatic segmentation of a video into LSUs [8], [19], [14], [10], [11], [2], [7], [9], [16], [13], [12], [15], [18]. Large-scale evaluations on shot segmentation results have been performed already, e.g. [6], showing reasonable performance. Such a major eort has not been done before for LSU segmentation results. To our knowledge, the largest data set described in literature is 4 hours for just 2 movies [8]. Other authors use more videos, but not full-length [14, 16]. See appendix A for details. For good evaluation of LSU segmentation, however, both complete videos and various genres should be evaluated. Methods presented in literature are evaluated by their creators, resulting in three major problems. Firstly, there is no common data set for evaluation. Secondly, the de nition used for an LSU is not precise. Therefore de nitions of a ground truth depend on individual viewers instructed by the creator of a method. The ground truth may be biased to the speci c method's underlying assumptions and approach. Thirdly, for each method speci c features are used to compare video content. Hence, it is possible that the quality of features rather than methods is evaluated. An independent evaluation is required. Evaluation measurement of segmentation results is either left to the reader [19] or based on evaluation criteria for shot boundary segmentation. The latter boils down to counting false negatives and false positives as done in [8] and [14]. This approach requires an exact and unbiased ground truth and does not communicate error magnitude. For shots this is possible, because start and end are well de ned. For LSU ground truths this cannot be expected. In the context of this paper, we restrict ourselves to visual features. Segmentation methods that use other modalities, such as audio, still depend on visual information. Evaluation of visual methods is necessary in nding the best starting point. This paper is organized as follows. In section 2, we present a new, more precise de nition for LSU. Assumptions and techniques underlying LSU segmentation methods are made explicit, resulting in a unifying framework. In section 3 user centered measures for visual features and segmentation methods are de ned. In section 4, results of the performance evaluation are described. Conclusions are given in section 5.
2
Vendrig and Worring
2 Logical Story Unit Segmentation 2.1 De nitions In literature the use of ground truths for LSU segmentation is often insinuated, but never well de ned. This is no surprise since the commonly used de nition of an LSU is vague. The \common time and locale" in the de nition can be interpreted widely. In the extreme case, the common locale could be earth. Although judgment by humans is unavoidable, we provide clear rules for 4 important editorial techniques, viz. elliptical editing, montage, establishing shot, and parallel cutting [3] [5], to limit subjectivity to the largest extent. In the case of elliptical editing, \an action is presented in such a way that it consumes less time on the screen than it does in the story" [5]. Viewers perceive the shots as being continuous in time. Therefore, we consider the shots to be part of one LSU. Montage is juxtaposition of shots based on a common theme. When the shots cannot be perceived as having a common locale or time, we do not consider the shots to be part of one LSU. An establishing shot is \a beginning shot of a new scene that shows an overall view of the new setting" [3]. Since often the establishing shot shows the outside of the building where a LSU takes place, the \common locale" part of the LSU de nition is interpreted broadly. We consider an establishing shot to be part of the LSU for which it determines the setting. Parallel cutting is the alternation shots at dierent locales to create the impression that several events take place at the same time [3]. The de nition of an LSU as given in the introduction does not accommodate parallel cutting. We let the part of the de nition that an LSU is a series of shots prevail. Hence, events shown in parallel cutting are considered one LSU. Let V be a video represented at shot level by V and at LSU level by V . Other representations used in literature, viz. shot-lets [16] and frames [15], can be aggregated into the 2 representations. A shot in V is represented by x, where x is the shot index number. In a similar way t represents an LSU in V . t consists of a continuous series of shots x : : : x+n. The # operator counts the amount of shots a representation of V or an LSU contains. For later use, we introduce the rst shot and last shot operators FS and LS as de ned in [19], which return the index of a shot in a series. Since humans perceive LSUs by way of changes in content [4], an LSU can be de ned best by its boundaries. This allows us to reformulate the problem into nding discontinuities in place or time. To accommodate parallel cutting a maximal gap gmax between discontinuities is de ned. Similar to [2], we have found gmax = 3 shots to be a representative value for the entire video collection.
De nition 1 An LSU is the series of shots between an LSU boundary and the succeeding LSU boundary.
De nition 2 x is an LSU boundary, i: c(x?1 ; x) = false ^ 8y;z : c(y ; z ) = false ^ (0 y < x < z < #(V )) ^ (z ? y ? 1 gmax )
Section 2 Logical Story Unit Segmentation
3
where c(x ; y ) denotes a continuity operator returning true if x and y share time and locale and x and y are both part of V , and returning false otherwise. When a discontinuity is a small interruption, i.e. the story later continues in the same locale and time, this is attributed to parallel cutting. The subjective question whether an instance of parallel cutting occurs or dierent LSUs are present, cannot be avoided completely. By setting the variable gmax , however, the choice is made explicit for the entire video collection. Given the de nition and representation of an LSU, the process of creation and evaluation of LSUs can be described.
2.2 Unifying framework for existing methods Video is composed of the content of the individual shots and the structure imposed through edits. Therefore, for the segmentation of a video into LSUs, a major distinction can be made between methods based on content and methods based on edits. An edits-based analysis requires an explicit model of the video program or a set of generic style rules used in editing. Editing rules have been applied in news broadcast segmentation [20] and for speci c classes of feature lms [2]. They are domain speci c and not suited for appliance to general movies and TV shows. In this paper we aim at domain independent analysis. Consequently, we focus on the content of the shots. We restrict ourselves to the broad class of methods based on visual similarity [8], [19], [14], [10], [11], [2], [7], [9], [16], [13], [12], [15], [18]. De nition 2 is based on the semantic notion of common locale and time. There is no one-to-one mapping between these semantic concepts and the data-driven visual similarity. In practice, however, most LSU boundaries coincide with a change of locale, causing a change in the visual content of the shots. Furthermore, usually the scenery in which an LSU takes place does not change signi cantly, or foreground objects will appear in several shots, e.g. talking heads in the case of a dialogue. Therefore, visual similarity provides a proper base for common locale. There are two complicating factors regarding visual similarity. Firstly, not all shots in an LSU need to be visually similar. For example, one can have a sudden close-up of a glass of wine in the middle of a dinner conversation showing talking heads. This problem is addressed by the overlapping links approach [8] which assigns visually dissimilar shots to an LSU based on temporal constraints. Finally, at a later point in the video, time and locale from one LSU can be repeated in another, not immediate succeeding LSU. The above observations lead to the following assumptions every visual similarity LSU segmentation method depends on for succesful segmentation:
Assumption 1 The visual content in an LSU is dissimilar from the visual content in a succeeding LSU.
Assumption 2 Within an LSU, shots with similar visual content are repeated. Assumption 3 If two shots are visually similar and assigned to the same LSU, then all shots between are part of this LSU.
4
Vendrig and Worring
Given the assumptions, LSU segmentation methods using visual similarity can be characterized by two important components, viz. the shot distance measurement and the comparison method. The former determines the (dis)similarity mentioned in assumptions 1 and 2. The latter component determines which shots are compared in nding LSU boundaries. Both components are described in more detail. Shot distance measurement. The shot distance represents the dissimilarity between two shots and is measured by combining (typically multiplying) measurements for the visual distance v and the temporal distance t . The two distances will now be explained in detail. Visual distance measurement consists of dissimilarity function fv for a visual feature f measuring the distance between two shots. Usually a threshold fv is used to determine whether two shots are close or not. fv and fv have to be chosen such that the distance between shots in an LSU is small (assumption 2), while the distance between shots in dierent LSUs is large (assumption 1). Segmentation methods in literature do not depend on speci c features or dissimilarity functions, i.e. the features and dissimilarity functions are interchangeable amongst methods. Temporal distance measurement consists of temporal distance function t . As observed before, shots from not immediate succeeding LSUs can have similar content. Therefore it is necessary to de ne a temporal threshold t , known as time window. t determines what shots in a video are open for comparison. t, expressed in shots or frames, has to be chosen such that it resembles the length of an LSU. In practice, t has to be estimated since LSUs vary in length. t is either binary or continuous. A binary t results in 1 if two shots are less than t shots or frames apart and 1 otherwise [19]. A continuous t re ects the distance between two shots more precisely. In [14], t ranges from 0 to 1. As a consequence, the further two shots are apart in time, the closer the visual distance has to be the shots to be assigned to the same LSU. t is still used to mark the point after which shots are considered dissimilar. Shot distance is then set to 1 regardless of the visual distance. The second important component of LSU segmentation methods is comparison method. There are two approaches to compare shots and assign them to LSUs. In sequential iteration the distance between a shot and other shots is measured pairwise. In clustering, shots are compared group-wise. Note that in the sequential approach still many comparisons can be made, but always of one pair of shots at the time. Methods from literature can now be classi ed according to the framework. The visual distance function is not discriminatory, since it is interchangeable amongst methods. Therefore, the two discriminating dimensions for classi cation of methods are temporal distance function and comparison method. Their names in literature and references to methods are given in table 1. Note that in [9] 2 approaches are presented.
Section 3 Evaluation
5
2.3 Discussion As any binary function can be expressed as the limit case of some continuous function, methods using binary temporal distance can be considered special cases of the methods using continuous temporal distance. The use of a binary function impacts the result, as it is more sensitive to the choice for t than a continuous function. A larger value for t allows for more shot comparisons resulting in less oversegmentation. The disadvantage for a binary function is that more shot comparisons also increase the number of shot pairs that are determined visually similar, resulting in undersegmentation, especially in combination with assumption 3. Making v more strict will not solve this problem. It results in oversegmentation since assumption 2 is then easily violated. A continuous function does not suer as much from the threshold setting dilemma, since shots with a higher temporal distance have to compensate with a lower visual distance. The comparison method has a similar eect since it in uences time window t as well. Sequential comparison is sensitive to violation of assumption 2 because it makes only shot-wise comparisons. t and v have to be ne-tuned for each video in order to achieve good results. Cluster comparison suers less from the parameter setting problem, since shots are compared group-wise. This allows for a larger value of t , because similarity of one pair of shots alone will not result in a new LSU boundary. Furthermore, it allows for a more strict value of v . Similarity is measured in a group of shots and therefore less sensitive to outliers. Classi cation of LSU segmentation methods. Comparison Temporal distance function method binary continuous sequential overlapping links [8], [10], continuous video coherence [11], [2], [7] [9], [16], [13] clustering time constrained clustering time adaptive grouping [14], [19], [12], [15] [9], [18]
Table 1:
3 Evaluation In this section we present evaluation methods for features and segmentation methods. Evaluation is done from a video librarian point of view, re ecting the practical and economical eort required to restore errors.
3.1 Feature evaluation Humans and automated segmentation methods have dierent ways to nd LSU boundaries based on discontinuities in time and locale. Humans try to relate changes in time and locale to discontinuities in meaning [4]. Automated methods depend on visual dissimilarity in the video content, as expressed in assumptions 1 and 2. The
6
Vendrig and Worring
semantic gap between human de ned LSUs and so-called \computable scenes" [16] makes it impossible for automated segmentation methods to achieve fully correct segmentation based on visual features only. In this section, we present two criteria to measure to what degree automatic segmentation can approach human de ned LSUs using a visual feature. Measurement of the potential of a visual feature in segmenting a video into LSUs requires to measure the extent to which the assumptions 1 and 2 hold. Measurements have to be taken in the context of assumption 3 which allows shots to be assigned to LSUs regardless of the values for the visual feature. Coverage C measures to what extent assumption 2 is met in the ground truth, i.e. what part of the ground truth LSU t could theoretically be found given the feature and visual threshold v . To be precise, C is the fraction of shots in t that can be merged into a newly generated LSU x (see gure 1.a). In the best case C = 1, i.e. t = 0 . Otherwise, there are several LSUs 0 : : : n, in which case the longest j is taken to measure C . =0::n #(j ) (1) C (t ) = maxj#( t ) Over ow O measures to what extent assumption 1 is met. It quanti es the overlap of a given t with its two surrounding LSUs (t?1 and t+1 ). O is the fraction of shots in the surrounding LSUs that would be merged with t into a newly generated LSU (see gure 1.b). In the best case, j = t hence O = 0. In the worst case, all three LSUs are merged into one 0 and O = 1. Pnj=0 #(j n t) min(1; #(j \ t) (2) O(t ) = #(t?1 ) + #(t+1 ) The measured values can be aggregated into values for an entire video or collection of videos as follows: #(V )?1 X C( ) #(t) C (V ) = (3) t #(V ) t=0 O(V ) is de ned in a similar way. There are three important applications for the measurements. First of all, they are useful to compare the performance of individual features. Secondly, the measurements show to what extent segmentation of a video sequence is theoretically possible, i.e. under ideal circumstances. The ideal feature/threshold combination has C = 1 and O = 0. The dierence between the actual measurements and the ideal is the inherent complexity of the segmentation problem given the visual feature. Thirdly, when coverage and over ow are plotted against one another, an appropriate threshold can be selected depending on the user's preferences for amount of over ow (undersegmentation) and coverage (oversegmentation).
3.2 Evaluation method An evaluation criterion for the quality of an automatic LSU segmentation result should re ect the perception of users. In the case of LSUs, users have doubts about
Section 3 Evaluation
7
Visualization of a) coverage C and b) over ow O. The gray shaded areas contribute to the value measured.
Figure 1:
the exact start and the exact end of an LSU, see for example the problem with establishing shots mentioned in section 2.1. An LSU evaluation criterion should therefore not measure if a boundary is incorrect but how incorrect it is. Although it is impossible to completely solve the problem of biased ground truths, such criterion will at least cope with the uncertainty in truths rather than ignore it. Similar to the video string edit distance proposed in [1] to measure similarity between video sequences, we propose to measure the cost of transforming result V to the ground truth V . This is done by counting the number of shot comparisons c(x; y ) necessary to correct LSU boundaries, as this is proportional to the practical eort to be delivered by video librarians. To that end, we introduce a simpli ed user model based on two assumptions:
the user has a constant V in mind; the user is able to carry out continuity operator c consistently, correctly and immediately.
The procedure the user follows to convert V into V is modeled as in gure 2. Basically, the user iterates over the found LSUs t 2 V and corrects them one at the time. In the end, each t = t and hence V = V . In the model, the user makes a number of assessments. He assesses whether the boundaries in V are correct (lines 3 and 6). Note that this type of assessment is made even in the case of perfect segmentation V = V . Then, if necessary, he makes assessments to nd the true boundaries from V in V (lines 4 and 7). The amount of assessments to nd the true boundaries depends on the search strategy of the user. A trivial strategy is a linear search, where the user simply iterates back or forward shot by shot. This is not realistic for an expert user, such as a video librarian. We use a more advanced version of the linear strategy. It is modeled as follows. The user rst takes big steps forward or backward. We use 10 shots which corresponds to half the length of an average LSU. When he realizes he's gone too far, he switches to small steps, we use 1 shot, and iterates in the opposite direction. Based on the user model, it is possible to de ne evaluation criteria that can be measured consistently for dierent automatic segmentation methods.
8
Vendrig and Worring
1. while V 6= V 2. Let t be the rst uncorrected LSU in V 3. if c(FS (t ) ; LS (t ) ) = false 4. Divide t into t and 0t+1 , by searching x: x = FS ( 0t+1 ) = LS (t )+1 5. Update V , replacing t with t and 0t+1 6. 7.
else if c(LS(t ) ; FS(t+1)) = true Search LS (t ) 2 x and divide t : : : x into t and 0x =
LS(t)+1 : : : LS(x)
8. 9. Figure 2:
Merge FS ( t : : : LS (t ) 2 x into t , 0t+1 = FS ( x : : : LS ( x ) Update similar to step 5 User model procedure
3.3 Evaluation criteria The quality of segmentation results can be measured by applying the user model. Let AV be the total number of times an assessment c was necessary to perform the transformation from automatic segmentation V togroundtruthV . AV expresses the amount of labor invested by a user. The gain G can now be computed for using an automatic segmentation method compared to the situation in which a video librarian has to segment a video fully manually. Let A0V be AV measured for the hypothetical segmentation where V0 consists of 1 LSU covering the entire movie. Then, gain G is de ned as follows: 0
G (V ) = AV A?0 AV 100% V
(4)
In a similar way, G can be computed for a collection of videos by summing all individual measurements for AV and A0V . G is a powerful criterion. It allows comparison of methods based on one single value and gives a direct measure of economic impact. The criterion can be applied to the large video collection in section 4.
3.4 LSU segmentation on incorrect shot boundary segmentation In this section, we evaluate the necessity of the requirement made in most LSU segmentation methods that ground truth shot boundaries are known. Automatic shot boundary segmentation does not reach 100% correctness [6]. Results should be either adjusted manually before performing LSU segmentation, or the errors in the results should be known not to aect the LSU segmentation signi cantly. To
Section 4 Experiments
9
verify this, we have manually corrected results from an automatic adaptive color histogram based shot segmentation method. Results show 37% false positives and 10% false negatives on average. For more details see E. Even in the best case, the adjustments are very labor-intensive. Therefore, the option of manual correction is not viable, unless perfect shot segmentation is required for other applications as well. Hence, evaluation of the eects of incorrect shot segmentation results on LSU segmentation is necessary. To avoid confusion, we rst introduce notation for the dierent types of segmentation results involved. The ground truth segmentations V for LSUs and V for shots have been described before. The results of automatic LSU segmentation based on V are represented by V . Let the results of automatic shot segmentation be V^ . The results of automatic LSU segmentation based on automatic shot segmentation V^ are then denoted by V^. The question is whether V^ is suciently similar to V , or more general whether the distances from either segmentation to ground truth V are comparable in magnitude. If so, the complete process from raw video data to LSU segmentation can be automated. For determining the distance between V^ and V it is necessary that the underlying shot boundaries correspond. This is non trivial, since an LSU boundary could be detected for a shot boundary in V^ that is not present in V . To make comparison of the results possible, we adjust V^ such that the following requirement is ful lled: each boundary in automatic LSU segmentation V^ corresponds with the closest ground truth shot boundary in V . Given the requirements for LSU boundary correspondence, the impact of errors in shot segmentation on LSU segmentation can be evaluated.
4 Experiments 4.1 Setup We use the following implementations for the 4 types of segmentation methods: [8] for overlapping links, [9] (section 3+4) for continuous video coherence, [19] for time constrained clustering, and [14] for time adaptive grouping. These methods are most often refered to in literature, and can be seen as rst adapters of the particular type. The parameters suggested in the references were used for evaluation. We de ned LSU ground truths1 for 17 popular movies and TV series, in total 20 hours and 926 scenes. The video collection is split into collection I and II of 10 hours each. For collection I, shot boundaries were manually corrected from automatic results. Collection I is used to evaluate the impact of shot segmentation on LSU segmentation. For collection II, only automatic shot segmentation results are available. Characteristics of the videos are given in table 2.
4.2 Features LSU segmentation depends on computation of visual shot similarity, which in turn is based on visual features. Although implementation details dier, there is consensus 1 Tools and ground truths are available via http://www.science.uva.nl/~vendrig/evaluation/
10
Vendrig and Worring 1 Hue-Saturation Intensity
0.8
Overflow
0.6
0.4
0.2
0 0
0.1
0.2
0.3
0.4
0.5 Coverage
0.6
0.7
0.8
0.9
1
Figure 3: Coverage C and over ow O plotted for 2 features, measured for 40 thresholds v , median values for 10 hours of video.
Figure 4: Example of ground truth and segmentation results for the rst 20 minutes of \A View To A Kill" in time lines. Each vertical line represents an LSU boundary. From top to bottom the ground truth, and the results for overlapping links, continuous video coherence, time constrained clustering, and time adaptive grouping.
in literature on the use of color histograms. In the context of this paper we focus on evaluating segmentation methods rather than evaluating the quality of features. Therefore, we restrict to evaluation of the Hue-Saturation-Intensity color space. We use 2D Hue-Saturation histograms [2] [14] for the chromatic part of the color space, and Intensity histograms for the achromatic part. The similarity function used is the intersection distance between histograms [17].
4.3 Results The coverage C and over ow O are plotted against one another for various thresholds
in gure 3. Apart from outlier video \Fawlty Towers", for all videos a similar trend is visible. \Fawlty Towers" is atypical in the sense that LSUs are long and take place in several settings, while the same settings occur in most LSUs. Since the vast majority of videos exhibit similar results, in gure 3 the median values for the video collection are presented for each threshold. The Hue-Saturation histogram feature results in the same coverage as the Intensity histogram feature, but for signi cantly lower over ow. Therefore, the Hue-Saturation histogram feature is used for further LSU segmentation experiments.
Section 5 Conclusions
11
Table 3.a shows the outcome of the evaluation of the segmentation results against the ground truth for collection I. A detailed example of the various results for a small movie segment is given in gure 4. The overall results in table 3.b for LSU segmentation V^ , based on automatic shot boundary segmentation V^ , show that the performance of methods does not decrease because of incorrect shot segmentation. Overlapping links, and to a lesser extent time constrained clustering, is aected by false negatives (undersegmentation) in V^ . False positives (oversegmentation), on the other hand, causes performance to increase, particularly in the case of overlapping links. In 4 the results for collection II and the overall results for the 20 hours are given. Again, the results indicate there is no necessity for the assumption of a perfect shot boundary segmentation for succesful LSU boundary segmentation. Overview of video collections I and II. Name video length #shots #LSUs (minutes) Collection I A View To A Kill 125 2171 68 Rain Man 108 1166 80 Witness 108 1049 63 Life Of Brian 90 883 37 Fawlty Towers 1-1 30 299 8 Fawlty Towers 2-4 32 420 16 Seinfeld 2-9 22 316 9 Seinfeld 8-20 22 310 25 Simpsons 11-18 22 352 26 Friends 4-17 21 363 16 Friends 4-18 21 328 14 Collection II Gladiator 143 3237 91 Forrest Gump 136 1430 154 Conspiracy Theory 127 1974 62 Under Siege II 96 2561 132 Airplane! 84 952 112 Friends 4-19 21 406 13 Overall I+II 1208 18217 926
Table 2:
5 Conclusions LSU segmentation methods are characterized by two dimensions, viz. temporal distance and shot comparison, resulting in four classes. We have de ned evaluation criteria for features and segmentation results of the four classes from the perspective of video librarians. For visual features, the evaluation criteria help users in nding thresholds suited for the segmentation process. They also yield insight in
12
Vendrig and Worring
Table 3: Gain G for 4 LSU segmentation methods performed on collection I, based on a. ground truth shot segmentation V , and b. automatic shot segmentation V^ . a. b. Name video Segmentation methods Name video Segmentation methods [8] [19] [14] [9] [8] [19] [14] [9] A View To A Kill 45% 46% 70% 62% A View To A Kill 88% 37% 74% 62% Rain Man 85% 40% 61% 53% Rain Man 63% 52% 68% 58% Witness 52% 59% 71% 58% Witness 79% 65% 70% 75% Life Of Brian 82% 45% 55% 52% Life Of Brian 77% 41% 73% 66% Fawlty Towers 1-1 49% 37% 52% 47% Fawlty Towers 1-1 47% 37% 59% 62% Fawlty Towers 2-4 76% 35% 4% 44% Fawlty Towers 2-4 69% 41% 8% 41% Seinfeld 2-9 74% 63% 67% 47% Seinfeld 2-9 58% 50% 40% 52% Seinfeld 8-20 29% 47% 51% 53% Seinfeld 8-20 64% 49% 58% 52% Simpsons 11-18 12% 45% 55% 48% Simpsons 11-18 64% 39% 59% 51% Friends 4-17 29% 38% 65% 54% Friends 4-17 27% 19% 61% 47% Friends 4-18 23% 47% 55% 57% Friends 4-18 58% 28% 53% 77% Overall I 58% 46% 64% 56% Overall I 67% 46% 66% 59%
Gain G for 4 LSU segmentation methods performed on collection II, based on automatic shot segmentation V^ . Name video Segmentation methods [8] [19] [14] [9] Gladiator 59% 34% 62% 67% Forrest Gump 90% 43% 53% 39% Conspiracy Theory 58% 43% 70% 61% Under Siege II 93% 16% 61% 55% Airplane! 61% 26% 35% 38% Friends 4-19 33% 31% 39% 42% Overall II 86% 59% 75% 73% Overall I+II 72% 37% 61% 56%
Table 4:
the inherent complexity of the segmentation problem. For evaluation of automatic segmentation results, a method is introduced measuring the eort a video librarian should make to convert found segmentation results into a ground truth segmentation. Given the inherent complexity of segmentation by visual similarity, results are quite good for all methods. The overlapping links method's results are unpredictable. This could be corrected by adapting thresholds for speci c movies, but this is undesirable as it results in loss of general applicability. Time adaptive grouping shows both good and consistent results. It is the best method to segment a collection of videos. The use of automatic shot segmentation does not result in signi cantly worse performance of the LSU segmentation methods as shown in table 3.b. Hence, the
Section 5 Conclusions
13
labor-intensive creation of shot segmentation ground truths before performing LSU segmentation is not necessary. In general, current automatic segmentation methods based on visual features show sound results. Improvement of automatic segmentation is not only sought in development of new visual features, but also in extension with features from other modalities, viz. audio and text. Systematic user centered evaluation should be applied to such multi-modal approaches as well and show to what extent they result in an increase of gain G .
14
Vendrig and Worring
Appendices A LSU evaluation in literature In table 5 an overview is given of the evaluation eorts done in the referred papers on LSU segmentation. The papers are ranked according to the total length of video material used. Overview of evaluation of Logical Story Unit segmentation as reported in literature. Reference # videos total length remarks [16] 5 5 hrs 4 genres, including foreign language movies. [8] 2 244 min. From personal conversation with author: \4 Weddings and a Funeral" and \Jurassic Park". [12] 2 237 min. 2 genres (Forrest Gump and Groundhog Day). [18] 2 3 hrs 2 genres. [9] 3 2 hrs 2 genres: part of a movie and 2 sitcoms. [14] 7 97 min. 5 genres. [19] 5 81 min. 4 genres. [10] 1 1 hr 1 genre (half of Terminator II). [7] 1 1hr Seminar talk. [13] 1 52 min. 1 genre (France98 World Cup, 40 matches). [15] 2 41 min. 2 genres. [2] No information about data set presented.
Table 5:
B Description of video collection In table 6 the content of videos used for experiments is described. Names of TV series contain numbers describing the season and episode number. Genres given are provided by the Internet Movie Database2 . The descriptions of the video content concern speci cs that are relevant for LSU segmentation. For validation of the assumption that LSU boundaries are mostly caused by a change in locale, 10 hours of video are described in table 7. An estimation of the number of locales in each video is presented, as well as the number of times the locale does not change over an LSU boundary. A locale is constant when the locale of the end of LSU t equals the locale of the start of t+1 . Since the locale does not change, the boundary is caused by a change in time that is not attributed to elliptical editing as de ned in section 2. Table 7 shows that LSU boundaries without locale change are rare, as expected. Therefore, LSU segmentation based on change in locales is very well feasible. 2 http://www.imdb.com/
Section C Feature evaluation
15
C Feature evaluation In this section, the results from gure 3 for coverage C and over ow O are split into more detail. The results for each movie are now given, for feature Intensity ( gure 5) and feature Hue-Saturation ( gure 6). With the exception of outliers \Fawlty Towers 1-1" for feature Hue-Saturation and \Seinfeld 2-9" for both features, the results for all movies show a similar trend. For low C , O is low as well. Increase of C results in exponential growth of O. When C =1, O 1 in all cases. Note that the points in the graphs for dierent movies may be close to one another for dierent threshold values. For example, the 4 points in gure 6 at coordinates C 0:8 and O 0:5 have completely dierent threshold values. For \Friends 4-17" and \Friends 4-18" v = 0:175, but for \Rain Man" and \Witness" v = 0:275. Therefore, selection of a threshold value does not result in the same coverage/over ow trade-o for all videos. 1 A View To A Kill Rain Man Witness Life of Brian Fawlty Towers 1-1 Fawlty Towers 2-4 Seinfeld 2-9 Seinfeld 8-20 Simpsons 11-18 Friends 4-17 Friends 4-18
0.8
Overflow
0.6
0.4
0.2
0 0
0.1
0.2
0.3
0.4
0.5 Coverage
0.6
0.7
0.8
0.9
1
Coverage C and over ow O plotted for feature Intensity, measured for 40 thresholds v on 11 videos.
Figure 5:
D Over segmentation and under segmentation In addition to the gain G presented in section 3.3, in this appendix the reasons for segmentation errors are described in more detail. A segmentation error is either caused by over segmentation (false positive) or by under segmentation (false negative). Let #oV be the number of times a correction for an over segmentation had to be performed. #uV is de ned in a similar way for under segmentation. The values for the two types of errors in presented in tables 8, 9, and 10 for video collection I, for which both ground truth shot segmenations and automatic shot segmentations are available, and video collection II, for which just automatic shot segmentations are available. Note that #oV and #uV do not necessarily equal the number of false positives or
16
Vendrig and Worring 1 A View To A Kill Rain Man Witness Life of Brian Fawlty Towers 1-1 Fawlty Towers 2-4 Seinfeld 2-9 Seinfeld 8-20 Simpsons 11-18 Friends 4-17 Friends 4-18
0.8
Overflow
0.6
0.4
0.2
0 0
0.1
0.2
0.3
0.4
0.5 Coverage
0.6
0.7
0.8
0.9
1
Coverage C and over ow O plotted for feature Hue-Saturation, measured for 40 thresholds v on 11 videos.
Figure 6:
false negatives. The user model introduced in section 3.2 to evaluate segmentation results allows one correction step for several succeeding errors. Methods using a continuous temporal distance function [14] [9] suer less from under segmentation than methods using a xed temporal distance [8] [19]. This can be attributed to the less strict value for the visual threshold in the case of a continuous temporal distance function. Since under segmentations are generally harder to correct than over segmentations, they will have a large impact on gain G .
E Shot segmentation results In table 11 the results of comparison between ground truth shot segmentation V and automatic shot segmentation V^ on video collection I is presented. The term block is used which covers both shots and edit eects. The number of False segmentations in the last column is split into false positives (oversegmentation) and false negatives (undersegmentation). Percentages are computed relative to the amount of blocks in V . The overall results are similar to the results found in [6] for methods that prefer a relatively low percentage of false positives at the cost of a relatively high percentage of false negatives.
Section E Shot segmentation results
17
Description of video collections I and II. Name video genre description Collection I A View To A Kill action James Bond movie. Rain Man drama Road movie. Witness drama, Large part of the movie is set in Amish village romance, which is not very colorful. thriller Life of Brian comedy Monty Python movie. Contains some animations. Contains some extremely long LSUs. Fawlty Towers 1-1 + 2-4 comedy English situation comedy. Many settings in one LSU because John Cleese is running around. The various settings reoccur in all LSUs. Seinfeld 2-9 + 8-20 comedy American situation comedy. Several storylines at various settings, all coming together during the show. Simpsons 11-18 animation, Cartoon series. Comparable to American sitcomedy uation comedies. Friends 4-17 + 4-18 comedy American situation comedy. Generally focuses on 3 settings in all episodes, each episode contains some less frequently shown settings in addition. Collection II Gladiator drama, Epic movie. Large color dierences between action parts of the movie. Sometimes visions. Forrest Gump adventure, Most of the movie is a frame story. Many drama, times TV footage and movie scenes are comedy mixed. Conspiracy Theory action, Typical Hollywood movie. Sometimes very mystery, short visions/ ashbacks. romance, thriller Under Siege 2 action, Movie mostly plays in 2 settings: hijacked thriller train and satellite command centre. There are several settings within the train. Many shots of computer screen, which can be seen at both settings. Airplane! comedy Movie mostly plays in an airplane, several subsettings. Sometimes short LSUs just to make a joke. Friends 4-19 comedy See Friends 4-17. Relatively few settings in this episode.
Table 6:
18
Vendrig and Worring
Description of locales in video collection I. Name video # locale (esti- # locale conmated) stancy between LSUs A View To A Kill 51 2 Rain Man 58 11 Witness 35 2 Life of Brian 18 2 Fawlty Towers 1-1 7 0 Fawlty Towers 2-4 13 0 Seinfeld 2-9 5 1 Seinfeld 8-20 12 1 Simpsons 11-18 22 0 Friends 4-17 7 1 Friends 4-18 7 0
Table 7:
Amount of corrections for over segmentation and under segmentation for 4 LSU segmentation methods performed on collection I, based on ground truth shot segmentation V . Name video Segmentation methods [8] [19] [14] [9] #oV #uV #oV #uV #oV #uV #oV #uV A View To A Kill 26 38 23 31 41 16 43 21 Rain Man 69 1 24 43 37 25 45 28 Witness 19 40 22 20 29 16 40 21 Life Of Brian 28 2 21 13 21 9 20 11 Fawlty Towers 1-1 7 0 5 1 6 1 8 0 Fawlty Towers 2-4 10 1 9 7 2 13 9 5 Seinfeld 2-9 5 1 3 3 4 2 6 1 Seinfeld 8-20 2 17 5 10 6 9 13 4 Simpsons 11-18 0 23 7 9 10 6 17 5 Friends 4-17 2 11 1 9 10 2 10 2 Friends 4-18 1 10 2 6 7 3 9 2 Overall I 169 144 122 152 173 102 220 100
Table 8:
Section E Shot segmentation results
19
Amount of corrections for over segmentation and under segmentation for 4 LSU segmentation methods performed on collection I, based on automatic shot segmentation V^ . Name video Segmentation methods [8] [19] [14] [9] #oV #uV #oV #uV #oV #uV #oV #uV A View To A Kill 65 1 20 35 45 13 46 16 Rain Man 51 17 29 34 47 18 45 23 Witness 54 5 32 15 43 9 43 12 Life Of Brian 27 7 18 15 31 3 25 6 Fawlty Towers 1-1 8 0 5 1 8 0 6 0 Fawlty Towers 2-4 9 3 10 5 2 13 11 4 Seinfeld 2-9 8 1 6 3 6 3 5 1 Seinfeld 8-20 12 7 5 9 5 7 10 5 Simpsons 11-18 25 0 9 10 18 3 19 2 Friends 4-17 2 12 2 11 10 3 10 3 Friends 4-18 8 3 0 8 8 3 7 1 Overall I 438 200 258 298 396 177 447 173
Table 9:
Amount of corrections for over segmentation and under segmentation for 4 LSU segmentation methods performed on collection II, based on automatic shot segmentation V^ . Name video Segmentation methods [8] [19] [14] [9] #oV #uV #oV #uV #oV #uV #oV #uV Gladiator 51 31 21 53 45 29 63 24 Forrest Gump 110 7 50 70 28 87 59 85 Conspiracy Theory 33 24 22 31 33 17 42 17 Under Siege II 119 1 11 106 75 42 78 49 Airplane! 61 31 30 69 23 80 42 62 Friends 4-19 2 8 6 5 7 5 9 3 Overall II 376 102 140 334 211 260 293 240 Overall I+II 814 302 398 632 607 437 740 413
Table 10:
20
Vendrig and Worring
Overview of automatic shot segmentation results for video collection I. Name video # blocks V # blocks V^ # Correct (%) #False (%) A View To A Kill 2180 2782 2106 (96%) 676+74 (34%) Rain Man 1183 1557 863 (72%) 694+320 (85%) Witness 1058 1676 1010 (95%) 666+48 (67%) Life Of Brian 904 1871 781 (86%) 1090+123 (134%) Fawlty Towers 1-1 308 333 304 (98%) 29+4 (10%) Fawlty Towers 2-4 427 456 427 (100%) 29+0 (6%) Seinfeld 2-9 330 326 319 (96%) 7+11 (5%) Seinfeld 8-20 323 348 313 (96%) 35+10 (13%) Simpsons 11-18 355 612 349 (98%) 263+6 (75%) Friends 4-17 384 386 322 (83%) 64+62 (32%) Friends 4-18 352 368 331 (94%) 37+21 (16%) Overall 7804 10715 7125 (91%) 3590+679 (54%)
Table 11:
REFERENCES
21
References [1] D.A. Adjeroh, I. King, and M.C. Lee. A distance measure for video sequences. Computer Vision and Image Understanding, 75(1):25{45, 1999. [2] Ph. Aigrain, Ph. Joly, and V. Longueville. Medium Knowledge-Based MacroSegmentation of Video Into Sequences, chapter 8, pages 159{173. AAAI Press, 1997. [3] J.M. Boggs and D.W. Petrie. The art of watching lms. May eld Publishing Company, Mountain View, CA, 5th edition, 2000. [4] R.M. Bolle, B.-L. Yeo, and M. Yeung. Video query: Research directions. IBM Journal of Research and Development, 42(2):233{252, 1998. [5] D. Bordwell and K. Thompson. Film Art: An Introduction. McGraw-Hill, New York, 5th edition, 1997. [6] R. Brunelli, O. Mich, and C. M. Modena. A survey on the automatic indexing of video data. Journal of Visual Communication and Image Representation, 10(2):78{112, 1999. [7] P. Chiu, Girgensohn, W. Polak, E. Rieel, and L. Wilcox. A genetic algorithm for video segmentation and summarization. In IEEE International Conference on Multimedia and Expo, volume 3, pages 1329{1332, 2000. [8] A. Hanjalic, R.L. Lagendijk, and J. Biemond. Automated high-level movie segmentation for advanced video retrieval systems. IEEE Transactions on Circuits and Systems for Video Technology, 9(4):580{588, 1999. [9] J.R. Kender and B.L. Yeo. Video scene segmentation via continuous video coherence. In CVPR'98, Santa Barbara, CA. IEEE, IEEE, June 1998. [10] Y.-M. Kwon, C.-J. Song, and I.-J. Kim. A new approach for high level video structuring. In IEEE International Conference on Multimedia and Expo, volume 2, pages 773{776, 2000. [11] S.-Y. Lee, S.-T. Lee, and D.-Y. Chen. Automatic Video Summary and Description, volume 1929 of Lecture Notes in Computer Science, pages 37{48. Springer-Verlag, Berlin, 2000. [12] R. Lienhart, S. Pfeier, and W. Eelsberg. Scene determination based on video and audio features. In Proc. of the 6th IEEE Int. Conf. on Multimedia Systems, volume 1, pages 685{690, 1999. [13] T. Lin and H.-J. Zhang. Automatic video scene extraction by shot grouping. In Proceedings of ICPR '00, Barcelona, Spain, 2000. [14] Y. Rui, T.S. Huang, and S. Mehrotra. Constructing table-of-content for videos. Multimedia Systems, Special section on Video Libraries, 7(5):359{368, 1999.
22
REFERENCES
[15] E. Sahouria and A. Zakhor. Content analysis of video using principal components. IEEE Transactions on Circuits and Systems for Video Technology, 9(8):1290{1298, 1999. [16] H. Sundaram and S.-F. Chang. Determining computable scenes in lms and their structures using audio visual memory models. In Proceedings of the 8th ACM Multimedia Conference, Los Angeles, CA, 2000. [17] M.J. Swain and D.H. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11{32, 1991. [18] E. Veneau, R. Ronfard, and P. Bouthemy. From video shot clustering to sequence segmentation. In Proceedings of ICPR '00, volume 4, pages 254{257, Barcelona, Spain, 2000. [19] M. Yeung, B.-L. Yeo, and B. Liu. Segmentation of video by clustering and graph analysis. Computer Vision and Image Understanding, 71(1):94{109, 1998. [20] H.J. Zhang, S.Y. Tan, S.W. Smoliar, and G. Yihong. Automatic parsing and indexing of news video. Multimedia Systems, 2(6):256{266, 1995.
Acknowledgements This project is funded by NWO/SION under grant 612-61-005 and supported by the Multimedia Information Analysis project.
REFERENCES
25
ISIS reports This report is in the series of ISIS technical reports. The series editor is Rein van den Boomgaard (
[email protected]). Within this series the following titles are available:
References
[1] J.M. Geusebroek, F. Cornelissen, A.W.M. Smeulders, and H. Geerts. Robust autofocusing in microscopy. Technical Report 17, Intelligent Sensory Information Systems Group, University of Amsterdam, 2000. [2] J.M. Geusebroek, R. van den Boomgaard, A.W.M. Smeulders, and A. Dev. Color and scale: The spatial structure of color images. Technical Report 18, Intelligent Sensory Information Systems Group, University of Amsterdam, 2000. [3] J.M. Geusebroek. A physical basis for color constancy. Technical Report 19, Intelligent Sensory Information Systems Group, University of Amsterdam, 2000. [4] J.M. Geusebroek, A.W.M. Smeulders, and R. van den Boomgaard. Color invariance. Technical Report 20, Intelligent Sensory Information Systems Group, University of Amsterdam, 2000. [5] L. Todoran and M. Worring. Segmentation of color document images. Technical Report 21, Intelligent Sensory Information Systems
Group, University of Amsterdam, 2000. [6] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content based image retrieval at the end of the early years. Technical Report 22, Intelligent Sensory Information Systems Group, University of Amsterdam, 2000. [7] J.M. Geusebroek and A.W.M. Smeulders. A minimum cost approach for segmenting networks of lines. Technical Report 2001-01, Intelligent Sensory Information Systems Group, University of Amsterdam, 2001. [8] A.W.M. Smeulders and D. Koelma. The design of vision solutions. Technical Report 2001-02, Intelligent Sensory Information Systems Group, University of Amsterdam, 2001. [9] J. Vendrig and M. Worring. Evaluation measurement for logical story unit segmentation in video sequences. Technical Report 2001-03, Intelligent Sensory Information Systems Group, University of Amsterdam, 2001.
You may order copies of the ISIS technical reports from the corresponding author or the series editor. Most of the reports can also be found on the web pages of the ISIS group (http://www.science.uva.nl/research/isis).