Motion Insensitive Detection of Cuts and Gradual

0 downloads 0 Views 623KB Size Report
Other shot boundary types, such as fades, dissolves and wipes, result from adding editing frames between shots to produce effects like a smoother transition or a.
Motion Insensitive Detection of Cuts and Gradual Transitions in Digital Videos Scott Lawrence, Djemel Ziou, Marie-Flavie Auclair-Fortier and Shengrui Wang D´epartement de Math´ematiques et d’Informatique Universit´e de Sherbrooke Sherbrooke (Qc), Canada, J1K 2R1

flawrence,ziou,auclair,[email protected] - Technical Report N. 266 -

Abstract This paper concerns shot boundary detection in digital video. Automatic video segmentation is a first step towards video indexing. An efficient yet robust algorithm is needed to separate video sequences into smaller segments called shots. Although many researchers have tackled the problem of finding shot boundaries, few have been able to accurately detect gradual transition boundaries such as fades and dissolves. Because shot boundaries correspond to discontinuities in the time domain, we present an algorithm based on the video’s first-order partial derivatives. By only considering areas with low apparent motion, our method is insensitive to object or camera motion. Our algorithm can accurately detect cuts as well as fades, dissolves, wipes and other gradual transitions. Experimental results show that our method outperforms other popular methods on our sample videos. Keywords: Video shot boundary detection, Cut detection, Gradual transition detection, Partial derivatives, Optical flow

1

DMI, Universit´e de Sherbrooke

Technical Report N. 266

1 Introduction In the last decade, the amount of digital video documents has grown to the point where it is now a dominating component in available multimedia data. We must now address the indexing problem emerging from the need to manage these video documents. Classical text-based indexing methods are insufficient to provide an adequate description, so a new form of indexing is needed for video sequences. Many authors [5, 7, 11, 17, 19] believe that shot boundary detection in video sequences is one of the necessary first steps in an efficient video management system. Segmentation of digital video into smaller units is also important in other domains like MPEG compression. We define the digital video segmentation problem as the automatic detection of shot boundaries. Just as a text can be divided into sentences, a video sequence can be hierarchically divided into smaller components such as scenes and shots. A shot is defined as a sequence of images taken from a single operation of the camera and that depicts a continuous action in time and space. A scene is a group of related shots. Figure 1 shows an example of the relations between video, scenes and shots. In the editing process, filmmakers combine shots in a variety of manners to produce the final video. Shot boundaries differ depending on the editing techniques used to combine two shots. The simplest type of shot boundary is known as a cut, and results from a simple juxtaposition of two shots. Other shot boundary types, such as fades, dissolves and wipes, result from adding editing frames between shots to produce effects like a smoother transition or a highlight between two shots. A fade involves slowly changing the shot intensity to give the impression that a shot is appearing or disappearing. A dissolve combines a fade-out and a fade-in, as one shot vanishes, another shot appears. In a dissolve, for a short period of time both shots are superimposed. A typical wipe involves one shot replacing the other as a line crosses the screen. As one side of the line shows the old shot, the other shows the new shot. As the line crosses the screen, the old shot disappears. In a wipe, shots are not superimposed but occupy different areas of the screen. Examples of some of the common shot boundaries are illustrated in Figure 2. While cuts represent the vast majority of shot boundaries used in today’s videos, we believe that detecting gradual transitions like fades, dissolves and wipes is also important. For instance, in a sample of video test data composed of television commercials, 25% of the shot boundaries were gradual transitions [3]. The detection of cuts has been well established in previous work, but identifying gradual transitions still remains a challenge. This is first because

Video Sequences Scenes Shots Frames

Figure 1: Digital video hierarchy model. 2

DMI, Universit´e de Sherbrooke

Technical Report N. 266

Figure 2: Shot boundaries : first line shows a cut, second line shows a dissolve and third line shows a wipe. gradual transitions occur over many frames and the frame-to-frame differences are comparatively small. Consequently, the gradual transitions detection should be done over many frames. Secondly, motion causes the same type of changes as gradual transitions then both phenomena can easily be confused. To overcome this problem, the detection should not take into account pixels with high motion. In order to circumvent these two problems, we propose to find these discontinuities by identifying local maxima of a difference measure based on temporal partial derivatives over many frames. We then threshold these maxima to eliminate multiple ones, those due to noise and some due to motion. Since this thresholding step does not handle all motion effects, our method is then improved by removing contributions from pixels in motion areas. To do so, we examine the optical flow equation from which we derive a threshold to eliminate pixels with high motion from the difference measure computation. To increase computational performance, sampling is used with little effect on results. In this paper, we present a new method aimed at detecting all shot boundaries including gradual transitions. In the next section, we start by giving an overview of some related works. In Section 3, we present our new method based on detecting temporal discontinuities caused by shot boundaries. We then give, in Section 4, experimental results from two test sequences comparing our method with two other popular methods.

2 Related Work Many methods for automatic video segmentation have been proposed in recent years. Most of the methods define a difference measure between consecutive frames and use it to identify shot boundaries. This measure has taken many forms, sometimes local, sometimes global. These

3

DMI, Universit´e de Sherbrooke

Technical Report N. 266

methods are developed for two types of videos: compressed and non-compressed. Since the approach presented in this paper is dealing with non-compressed videos, we present a short review of some related works on automatic video segmentation for the second type of videos. Detailed surveys can be found in [3, 5]. There are four characteristics distinguishing methods for non-compressed video segmentation: 1) data and preprocessing, 2) disparity measure used to estimate the change between two images, 3) thresholding and 4) types of boundary considered by the method. Table 1 shows current methods classified with these characteristics.

Data and preprocessing Segmentation algorithms for non-compressed videos directly use pixel information. The first ones proposed use only graylevel intensities [12, 14]. Later, other methods have been proposed to keep color information which is more complete [7, 11, 13, 15, 19]. Aigrain and Joly [1] propose to apply their method to each band separately. Some authors [12, 19] suggest to reduce color space to obtain a limited number of different colors. Finally Lee and Ip [11] use HSI space by conserving band H and S because they represent color independently from the intensity. Zabih et al. [17] take contour information to identify cut boundaries. This kind of approach is rarely used because it is time consuming. To reduce computational time cost and because there is some redundancy in spatial and temporal information, many authors [1, 2, 15] propose to sample spatially and/or temporally. This sampling aims to reduce motion effects too [1]. Cheong [4] proposes to use a measure based on the violation of the brightness constraint of the optical flow. That is he must perform calculation of the optical flow as preprocessing. The algorithm presented in this paper does not perform any preprocessing which is an advantage in time computing. Disparity measure Disparity measures for boundary detection are based on pixel intensity differences, histogram comparison, edge comparison, cut models, optical flow or neuralnetworks. Early research in automatic video segmentation was mainly focused on simple intensity difference measures between consecutive frames in a video sequence [13, 14, 15]. In [12, 19] the authors propose a number of methods to perform video segmentation. The simplest measure is the difference in pixel intensity averages over whole frames. Another measure is the sum of pixel-to-pixel absolute intensity differences across consecutive frames. Aigrain and Joly [1] propose a method modeling the pixel intensity differences in successive frames as a combination of three factors: noise, motion and editing. Another class of methods is centered on histogram differences between consecutive frames. The number of possible intensities is first reduced and a histogram is then taken for each frame. Histograms from adjacent frames are then compared and a sum of bin-to-bin differences is calculated [13]. A number of variations of this method have been proposed. For instance, Nagasaka and Tanaka [12] determined that using a 2 test on color histograms results in a measure that more strongly reflects the differences between two frames. They also propose to divide the image frames into regions and use local histograms. Lee and Ip [11] propose the use of local HSI histograms to provide a better measure when lighting changes occur. Hampapur, Jain and Weymouth [7] use a model driven approach to video segmentation. They separate shot boundaries into classes according to their types and formulate mathematical models for each class based on video production techniques. These models are used to design

4

Graylevel

Ardizzone et al. [2] Hampapur et al. [7, 6, 8] Cheong [4] Lee and Ip [11]

5 Graylevel Graylevel or color Graylevel

Shahraray [14]

Xiong et al. [15, 16] Zabih et al. [17, 18] Zhang et al. [19] Our [10] -

Resolution reduction Edge detection -

-

-

Computation of optical flow (o.f.) RGB to HSI Conversion -

Histogram equalization and resolution reduction Resolution reduction -

Preprocessing

Neuron network Cut models Departure of o.f. from smoothness constraint HSI histogram comparison Histogram comparison Intensity differences with histogram comparison Intensity differences Intensity differences Edge comparison Histogram comparison Partial derivatives

Intensity differences

Method

Yes

2 computed thresholds + motion analysis Given threshold

Yes Yes

2 given thresholds

Yes

Given threshold + local maxima 2 computed thresholds

No

No

No

Given threshold + local maxima

Given threshold

No

Yes

4 given thresholds Given threshold

Yes

No

Yes

Gradual transitions

-

-

Complex with 7 given thresholds

Thresholding

Table 1: Comparative table of methods using non-compressed video for detection of shot boundaries.

Graylevel or color Graylevel

Graylevel or color Color

Nagasaka and Tanaka [12] Naphade et al. [13]

Color

Graylevel

Color

Graylevel or color

Image

Aigrain and Joly [1]

Methods

DMI, Universit´e de Sherbrooke Technical Report N. 266

DMI, Universit´e de Sherbrooke

Technical Report N. 266

a series of feature detectors capable of identifying cuts as well as gradual transitions. Video segmentation is then formulated as a feature based classification problem. Zabih, Miller and Mai [17] present a feature-based method using edge detection. Their method compares edge pixel locations in successive frames. A registration technique is used to compensate for object or camera motion. The authors start by finding edges in two consecutive frames using the thresholded gradient and non-maximum suppression, resulting in two binary images E1 and E2 . They define two values named pin and pout corresponding the fraction of entering edge pixels and exiting edge pixels. The first value pin is the fraction of edge pixels in E2 which are at a distance greater than a fixed value r from the closest edge pixel in E1 . Similarly, pout is the fraction of edge pixels in E1 more than r pixels from the closest edge pixel in E2 . Their measure of dissimilarity is

( ) = max(p

dz t

in

)

; pout :

Cheong [4] uses a ratio-like measure based on the following hypothesis: a moving pixel should not violate the smoothness constraint but a pixel in a gradual transition could. Hence, he computes the ratio of the number of pixels with high optical flow and violating the smoothness constraint on the number of pixel with high optical flow. Ardizzone, Gioiello, La Cascia and Molinelli [2], use a neural-network approach for shot boundary detection. Their system first resizes each frame to obtain a  pixel gray-level image. Pixel-to-pixel differences of successive frames are sent to a three-layer Multi-layer Perceptron. The authors claim satisfactory results while processing several thousand frames per second. Our method is a mix of intensity difference and optical flow methods. Our disparity measure is based on the temporal derivative, which is a measure of intensity difference over many frames. Our algorithm is also inspired from the optical flow equation although we do not explicitly compute it.

16 16

Thresholding Once a disparity measure has been computed, most of the algorithms comprise a thresholding step to identify cuts from motion and noise. Frames corresponding to difference values greater than the threshold are labeled as shot boundaries [11, 13, 15, 17]. Nagasaka and Tanaka [12] claim to reduce motion effects by rejecting the regions with the highest differences. In [1], the authors claim that cuts and wipes can be identified by looking ; range while fades and dissolves can be identified by looking for differences in the range. for differences in the ; Zhang, Kankanhalli and Smoliar [19] propose a twin-comparison technique to identify gradual boundaries. Instead of only one, two thresholds are used. A low threshold is used to identify possible frames at the start of a gradual transition and a high threshold is used to identify shot boundaries. If a difference value is greater than the high threshold, a shot boundary is immediately declared. If a difference value is greater than the low threshold but below the higher threshold, the corresponding frame is marked as a frame possibly starting a gradual transition. Subsequent frames are also compared to this reference frame until the difference between consecutive frames goes below the low threshold. This second comparison provides a measure of the cumulative difference between the current frame and the frame that was marked as the possible beginning of a gradual transition. If the cumulative difference

[128 255] [7 40]

6

DMI, Universit´e de Sherbrooke

Technical Report N. 266

reaches the higher threshold while the difference between consecutive frames steadily remains greater than the lower threshold then a gradual transition is declared. Cheong [4] uses four thresholds. That is, one distinguishes pixel with high optical flow, one determines pixels violating the smoothness constraint, a third one thresholds shot changes and finally a last one is used to remove false alarms.

Types of boundary Many authors [2, 11, 12, 13, 14, 15, 16] focus on cuts only and they ignore gradual transitions. Some authors [7, 17] propose methods specifically aimed at identify all types of boundary. For example, authors in [17] analyze the spatial and temporal distributions of entering and exiting pixels, they can also classify shot boundaries according to their types (cut, fade, dissolve or wipe).

3 Proposed Approach Our main objective is to develop a method capable of accurate detection of both cuts and gradual transition boundaries. Let us consider a video as a function of three discrete variables V x; y; t . Our approach towards the detection of shot boundaries is based on the first-order partial derivatives of V x; y; t . Our initial hypothesis is that shot boundaries correspond to discontinuities in the time domain. To illustrate this, Figure 3 shows two spatio-temporal images. A spatio-temporal image represents the video where spatial dimension has been reduced from to . One line is taken in each frame and is placed one after the other. One column in a spatio-temporal image corresponds to a given pixel’s temporal profile, showing how its intensity varies with time. Figures 4(a), 5(a) and 6(a) show temporal profiles for various shot boundary types, confirming our hypothesis. In order to develop a measure capable of detecting all the different types of shot boundaries, let us first draw attention to a few facts. For a pixel position xi ; yi , a cut or a wipe produces a step edge in V xi ; yj ; t . It accordingly produces a maximum in jVt xi ; yj ; t j. During a fade or a dissolve, a pixel’s intensity changes slowly and it produces a blurred step edge in V xi ; yj ; t spanning many frames. If the derivative is calculated on an interval wide enough, the edge produces a maximum in jVt xi ; yj ; t j.

(

)

(

)

2 1

(

(

)

(

)

(

)

(

)

)

3.1 Transitions Detection

(

)

From the previous discussion, it is clear that our measure should be a function of jVt xi ; yj ; t j. It should also be global and take into consideration a large number of pixels to reduce noise and motion effects. Various possible derivative based measures have been tried, such as summing the gradient magnitude or averaging jVt x; y; t j but experimental evidence has shown that the following measure is the most adapted. We propose to use a difference measure D t defined as follows: X D t jVt x; y; t j: (1)

(

)

()=

()

(

)

(x;y )

The sign of the derivative should not be a factor in our measure because it only indicates the relative pixel intensity values (dark to bright or bright to dark), hence the use of the absolute

7

DMI, Universit´e de Sherbrooke

f.i.

(1)

diss.

(2)

f.o. f.i. diss.

Technical Report N. 266

(1)

f.i.

(2)

f.o.

(3)

f.i. diss. f.o.

(3) (4)

(4) (5)

(5)

(6) (7) (8)

diss.

f.o.

(6)

(7)

t

f.i. diss. diss.

(9)

diss.

( 10 )

f.o.

x

Figure 3: Two spatio-temporal images. Transitions are identified in margins ((f.i. = fade-in, f.o. = fade-out and diss. = dissolve). D(t)

V()

t

t

(a)

(b)

Figure 4: a) Typical temporal profiles for 5 pixels in a cut. b) Associated D (t) response.

(

)

value. The calculation of the temporal derivative Vt x; y; t is achieved by a convolution of V x; y; t with a Gaussian mask defined by:

(

)

(

1 ) =  t 2

Gt x; y; t

2

e 2

(x2 +y 2 +t2 ) 2 2

;

(2)

where  is the scale. Since the scale controls the size of the Gaussian filter, it allows to consider more than two frames and consequently, it is better suited to identify gradual transitions taking place over many frames. During a cut, all pixels change at the same moment in time, producing a sharp peak in D t . During fades and dissolves, pixel intensities change slowly but steadily. They produce smoother peaks in D t . During a wipe, pixel intensities change rapidly but at different times also resulting in a smooth peak in D t . Figures 4(b), 5(b) and 6(b) show associated D t responses for a few common shot boundary types.

()

()

()

()

8

DMI, Universit´e de Sherbrooke

Technical Report N. 266

D(t)

V()

t

t

(a)

(b)

Figure 5: a) Typical temporal profiles for 5 pixels in a dissolve. b) Associated D (t) response. V()

D(t)

t

t

(a)

(b)

Figure 6: a) Typical temporal profiles for 5 pixels in a wipe. b) Associated D (t) response. We propose to locate shot boundaries in the video by identifying local maxima in our difference measure D t . However, not all local maxima in D t are due to shot boundaries (see Figure 7), so, unwanted maxima must be removed. First, to avoid multiple hits during gradual transition boundaries (e.g. labels c, d and e), we only label frames as shot boundaries when they correspond to local maxima over a fixed window size (e.g. only label d is kept). Second, two thresholds are applied. The first one (1 ) aims to remove small peaks caused by noise (e.g. labels a and b). 1 is expressed as a percentage of the global maximum value in D t . The second one (2 ) aims to eliminate local maxima corresponding to peaks of low relative amplitude (e.g. labels f , g and h), which are not caused by shot boundaries but by motion. The relative amplitude is defined as the average of the differences between the peak and its fore and aft minima. 2 is expressed as a percentage of the value of the peak. In spite of this thresholding rules, some maxima caused by motion remains. Figure 8 presents (in dashed line) an example where the thresholding has not effect to remove peaks labeled a and b which are due to motion. The next section presents a new approach to reduce these motion effects.

()

()

()

3.2 Reducing motion effects Because motion causes the same type of changes as those encountered during gradual transitions, we have seen that our measure D t is not sufficient to handle motion effects, even if we add thresholding rules. One option would be to use registration techniques to compute global motion and then compensate for this motion. We have chosen not to use this approach because

()

9

DMI, Universit´e de Sherbrooke

bruit

D(t)

Technical Report N. 266

coupe franche

mouvement

fondu d c

e

a

i

g

f

h

b

t

Figure 7: Illustration of thresholding situations. of the high computational cost involved. Instead, we have decided to use the first-order partial derivatives to identify areas where motion can cause problems. By identifying these motion areas, we can estimate D t only in regions that do not contain optical flow. Motion affects shot boundary detection in areas where a pixel’s intensity changes due to motion rather than editing (or both). In such areas, spatial derivatives Vx x; y; t and Vy x; y; t can be used to warn about possible motion effects. The optical flow Equation [9] helps to show how first-order partial derivatives can be used to reduce motion effects:

()

(

Ix u

)

(

)

+ I v + I = 0; y

(3)

t

( )

where u; v is the motion vector. From this equation, it is clear that if motion exists, at least one of the spatial partial derivatives must be non-zero. This brings us to propose evaluating both spatial derivatives Vx x; y; t and Vy x; y; t and eliminating from the sum in Equation 1, pixels at which at least one of the spatial derivatives is important. This insures that high values in jVt x; y; t j resulting from motion are not taken into account in D t . While this also eliminates pixels lying on spatial edges, we can tolerate this because of the large number of pixels used in the estimation of D t . By eliminating contributions to D t from pixels whose spatial partial derivatives are important, the motion effects are reduced. Those pixels are found by thresholding the angle between vector Vx ; Vy ; Vt and vector ; ; Vt , where x; y; t has been skipped for more clarity. Equation 1 then becomes:

(

(

)

(

)

)

() () (0 0 ) (

()

(

X ( jV j D (t) = 0 t

(x;y )

)

tan

if < elsewhere

tan 

mvt

;

)

(4)

where mvt is the maximum angle allowed between the two vectors and

tan  =

q

Vx2

+V

jV j

2

y

;

t

which can be easily verified. The influence of this algorithm designed to reduce motion effects on D t during both motion and gradual transitions is shown in Figure 8. Notice how the motion areas have been reduced compared to the other areas. Two local maxima (a and b) created in these motion areas have almost been totally removed.

()

10

DMI, Universit´e de Sherbrooke

D(t)

Technical Report N. 266

fade + motion

cut motion

fade

cut motion

a

motion

cut

b

Before

t

After

Figure 8: Reducing motion effects using spatial partial derivatives.

3.3 Summary of the algorithm The algorithm is straightforward. The first step is the convolution of each frame from the video with the first order partial derivatives of the Gaussian (Equation 2) and the estimation of D t (Equation 4). The second step is the localization of local maxima of D t and the thresholding rule described in Section 3.1. This algorithm is a time consuming process. To reduce processing, a spatial sampling is added. The number of pixels on which the derivatives x are evaluated is reduced by sampling both spatial dimensions at regular intervals. Using a ratio means that every xth pixel in both spatial directions are taken. Increasing ratio decreases accuracy performance in videos with motion and gradual transitions. With a careful choice of sampling ratio, the impact on the accuracy of the algorithm is below a reasonable limit while the computational gain is considerable.

()

()

1:

4 Performance Evaluation 4.1 Evaluation Method The performance of our algorithm comparing to two other (region histograms [11] and edge tracking [17, 18], summarized in Section 2) will be presented using two separate video sequences. While our choice of methods for comparison is arbitrary, both these methods have well performed when tested in survey papers [3, 5]. For each implemented algorithm, the number of shot boundaries correctly detected, the number of missed shot boundaries and the number of false positives are computed. If any frame in a gradual transition is marked, the boundary is considered correctly detected. But if multiple frames from the same gradual transition are marked, only one is considered correct and the others are labeled as false positives. Comparing video segmentation methods is not easy. Depending on the application, finding an extra shot boundary at the expense of accepting many more false positives might or might not be acceptable. We use recall versus precision graphs to compare the methods. Recall is defined as the number of desired items that are retrieved and P recision is defined as the number of retrieved items that are desired: Recall

#Correct = #Correct + #M issed ;

P recision

11

#Correct = #Correct + #F alse :

DMI, Universit´e de Sherbrooke

Technical Report N. 266

Each algorithm was tested using many threshold parameter combinations, the other parameters was fixed to give good results. Consequently to the variation of thresholds, many precision values are generated for a given recall value. The plots show the best precision value for a given recall value. We used two different video sequences to compare the algorithms. The first sequence is  from a video trailer for the movie Les Mis´erables. The sequence was digitized at pixels per frame and at frames per second. It lasts frames and has shot boundaries. cuts and gradual transitions. The second sequence is from a The sequence contains  pixels per frame and at frames per second. It music video. It was digitized at frames and contains cuts and no gradual transitions. Both sequences have consists of large amounts of both object and camera motion. regions in a  For the region histogram method [11], we divide each frame into pattern and use bin color histograms. The highest difference values are dropped to compensate for motion. We use the twin-comparison algorithm to help identify gradual transitions. to . The thresholds varied from : , Concerning the edge tracking method [17, 18], the edge detector scale was set to  and the distance r pixels. Unfortunately the proposed motion the gradient threshold to registration technique is not implemented on our system. The threshold on max pin ; pout varied from : to : . : was used to compute spatial and temporal For our method, a Gaussian scale  derivatives which produces good results for both sequences. To show how sampling affects accuracy, we use two different sampling ratios on both sequences. A low-resolution test uses ratio for the first sequence and a ratio for the second. A high-resolution test a ratio and a ratio for the first and second sequences. The respectively uses a local maximum window size is a compromise between reducing false positives during gradual transitions and the shortest possible shot length. Since one of the sequences contains a shot consisting of only frames, we set the window size at . The threshold used to eliminate to motion effect (mvt ) is set to degrees. The threshold parameters are varied from for the first threshold (1 ) and from to for the second threshold (2 ).

15 60

1715

27 320 240 73

2700

64

87

15

16

8

1000 40000

80

1 : 10

= 15

1 : 20 1 : 10

1:5

3

4 4

=15 ( )

=6

01 06

160 120

10

7

1% 70%

10% 80%

4.2 Experimental Results Figure 9 shows a plot of recall versus precision values for Les Mis´erables. This sequence contains many gradual transitions. As can be seen from the plot, at both resolutions, our algorithm generates fewer false positives than the other methods, resulting in higher precision values for a given recall value. This is because our method is less sensitive to motion, enabling it to better locate gradual transitions. At high resolution, our algorithm does not generate any ). While the results are not quite as false positives at all for most recall values (Recall < good in low resolution, precision is higher then with the other two methods. At high recall, our . algorithm out-performs the others and does not decline as fast when Recall approaches Figure 10 shows another plot of Recall versus P recision values, this time for the music video sequence. This sequence only contains cuts, but it has a large amount of rapid motion. Again our algorithm outperforms the other methods. The plot lines for low and high resolution tests for superposed for most recall values, suggesting little impact from the sampling ratio. (all shot boundaries found) only false positives are At the highest recall value of

86%

100%

100%

2

12

DMI, Universit´e de Sherbrooke

Technical Report N. 266

100%

90%

Precision 80%

70%

Partial Derivatives(high res.) Partial Derivatives (low res.) Edge Tracking Color Histograms

60% 65%

70%

75%

80%

85%

90%

95%

Recall

Figure 9: Best precision value per Recall for Les Mis´erables.

3

generated in the high-resolution test and false positives for the low-resolution test. At its highest recall value, the color histogram method generated false positives but missed one cut. The edge-tracking algorithm did not do as good, but we did not implement the motion ), our registration technique discussed in the article. For most recall values (Recall < algorithm does not produce any false positive.

6

98%

4.3 Speed performance Both plots show that our method is accurate and robust. It has outperformed both of the others methods on both our test sequences. Table 4.3 gives approximate running times for the algorithms. The local color histogram method was the fastest segmenting the first sequence in seconds. The edge tracking algorithm was one of 60 seconds and the second sequence in seconds to analyze the the longest even without the motion compensation efforts. It took seconds for the second. For the high-resolution test, our algorithm first sequence and seconds to segment the first sequence and seconds for the second. As took a greedy and seconds expected it was faster for the low-resolution test. It took respectively to segment both sequences. The respective speeds in frames per second (fps) are also given in Table 4.3.

450

2000 4600

12000 3200

13

700

1300

DMI, Universit´e de Sherbrooke

Technical Report N. 266

100%

90%

Precision 80%

70%

Partial Derivatives(high res.) Partial Derivatives (low res.) Edge Tracking Color Histograms

60% 70%

80%

90%

100%

Recall

Figure 10: Best precision value per Recall for the music video sequence.

Color Histogram Edge Tracking 3D Derivatives (high res) 3D Derivatives (low res)

Les Mis´erables sec (28:6 fps) 2000 sec (0:86 fps) 3200 sec (0:54 fps) 700 sec (2:45 fps)

60

Music Video sec (6 fps) 12000 sec (0:23 fps) 4600 sec (0:59 fps) 1300 sec (2:08 fps)

450

Figure 11: Running times for the three algorithms.

14

DMI, Universit´e de Sherbrooke

Technical Report N. 266

Figure 12: Example screen from our Windows based application.

5 Conclusion For many years video indexing has been an important research area. It needs an efficient yet robust method for digital video segmentation. In this paper, we have presented a new algorithm designed to accurately identify cuts as well as gradual transitions. Using first-order partial derivatives, our algorithm recognizes shot boundaries with minimal influence from motion. Sampling considerably reduces execution time with a reasonable impact on accuracy. Experimental results confirm that our method works well with both cuts and gradual transitions. It also has shown that our method withstands large amounts of motion. Further research aims at getting a better understanding of parameter influences on the algorithm’s performance. We have developed a Windows based application called Video Shot Detector (Figure 12). We plan to incorporate this video segmentation method in a future video indexing application.

References [1] P. Aigrain and P. Joly. The Automatic Real-Time Analysis of Film Editing and Transition Effects and its Applications. Computers & Graphics, 18(1):93–103, 1994. [2] E. Ardizzone, G. A. M. Gioiello, M. La Cascia, and D. Molinelli. A Real-time Neural Approach to Scene Cut Detection. In IS & T / SPIE Symposium on Electronic Imaging: volume 2670, San Jose, 1996.

15

DMI, Universit´e de Sherbrooke

Technical Report N. 266

[3] J. S. Boreczky and L. A. Rowe. Comparison of Video Shot Boundary Detection Techniques. In IS & T / SPIE Symposium on Electronic Imaging: volume 2670, pages 170–179, San Jose, 1996. [4] L.-F. Cheong. Scene-Based Shot Change Detection and Comparative Evaluation. Computer Vision and Image Understanding, 79:224–235, 2000. [5] A. Dailianas, R. B. Allen, and P. England. Comparaison of Automatic Video Segmentation Algorithms. In Proceedings of SPIE Photonics East ’95: volume 2615, pages 2–16, Philadelphia, 1995. [6] A. Hampapur, R. Jain, and T. Weymouth. Digital Video Indexing in Multimedia Systems. In Workshop on Indexing and Reuse in Multimedia Systems, Seattle, 1994. [7] A. Hampapur, R. Jain, and T. Weymouth. Digital Video Segmentation. In Proceedings of the ACM International Conference on Multimedia, pages 357–364, San Francisco, 1994. [8] A. Hampapur, R. Jain, and T. Weymouth. Production Model Based Digital Video Segmentation. Journal of Multimedia Tools and Applications, 1(1):9–46, 1995. [9] B. K. P. Horn. Robot Vision. MIT press. McGraw-Hill Book Company, 1986. [10] S. Lawrence, D. Ziou, and S. Wang. Motion Insensitive Detection of Cut and Gradual Transitions in Digital Video. In International Conference on Multimedia Modeling, Ottawa, 1999. [11] J. C.-M. Lee and D. M.-C. Ip. A Robust Approach for Camera Break Detection in Color Video Sequence. In IAPR Workshop on Machine Vision Application, pages 502–505, Kawasaki, 1994. [12] A. Nagasaka and Y. Tanaka. Automatic Video Indexing and Full: Video Search for Object Appearances. In IFIP TC2/WG 2.6 Second Working Conference on Visual Database Systems, pages 113–127, Budapest, 1991. [13] M. R. Naphade, R. Mehrotra, A. M. Ferman, J. Warnick, T. S. Huang, and A. M. Tekalp. A High-performance Shot Boundary Detection Algorithm Using Multiple Cues. In Proceedings of the IEEE International Conference on Image Processing, volume 1, pages 884–887, Chicago, 1998. [14] B. Shahraray. Scene Change Detection and Content-based Sampling of Video Sequences. In IS & T / SPIE Symposium on Electronic Imaging: volume 2419, pages 2–13, San Jose, 1995. [15] W. Xiong, J. C.-M. Lee, and M.-C. Ip. Net Comparison: A Versatile and Efficient Method for Scene Change Detection. In IS & T / SPIE Symposium on Electronic Imaging: Science and Technology, Conference: Storage and Retrieval for Image and Video Databases III: volume 2420, 1995. [16] W. Xiong, J. C.-M. Lee, and R. Ma. Automatic Video Data Structuring Through Shot Partitioning and Key-Frame Computing. Machine Vision and Applications: Special Issue on Storage and Retrieval for Still Image and Video Databases, 10(2):51–65, 1997. [17] R. Zabih, J. Miller, and K. Mai. A Feature-based Algorithm for Detecting and Classifying Scene Breaks. In Proceedings of the ACM International Conference on Multimedia, pages 189–200, San Francisco, 1995.

16

DMI, Universit´e de Sherbrooke

Technical Report N. 266

[18] R. Zabih, J. Miller, and K. Mai. A Feature-based Algorithm for Detecting and Classifying Scene Breaks. Technical Report CS-TR-95-1530, Computer Science Department, Cornell University, 1995. [19] H. Zhang, A. Kankanhalli, and S. W. Smoliar. Automatic Partitioning of Full-motion Video. Multimedia Systems, 1(1):10–28, 1993.

17

Suggest Documents