Techniques of video segmentation and shot change detection are ...

5 downloads 9416 Views 289KB Size Report
abrupt change or cut, and the other is the gradual shot transition. ... Taiwan 300, Phone: +886-3-574-2958, Fax: +886-3-572-3694, Email: [email protected]; ...
Integrated video shot segmentation algorithm Wei-Kuang Li and Shang-Hong Lai∗ Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan ABSTRACT Video segmentation is fundamental to a number of applications related to video retrieval and analysis. Shot change detection is the initial step of video segmentation and indexing. There are two basic types of shot changes. One is the abrupt change or cut, and the other is the gradual shot transition. The smooth variations of the video feature values in a gradual transition produced by the editing effects are often confused with those caused by camera or object motions. To overcome this difficulty, it is reasonable to estimate the motions and suppress the disturbance caused by them. In this thesis, we explore the possibility to exploit motion and illumination estimation in a video sequence to detect both abrupt and gradual shot changes. A generalized optical flow constraint that includes an illumination parameter to model local illumination changes is employed in the motion and illumination estimation. An iterative process is used to refine the generalized optical flow constraints step by step. A robust measure that is the likelihood ratio of the corresponding motion-compensated blocks in the consecutive frames is used for detecting abrupt changes. For the detection of gradual shot transitions, we compute the average monotony of intensity variations on the stable pixels in the images in a twin-comparison framework. We test the proposed algorithm on a number of video sequences in TREC 2001 and compare the detection results with the best results reported in the TREC 2001 benchmark. The comparisons indicate that the proposed shot change detection algorithm is competitive against the best existing algorithms. Keywords: shot change detection, gradual shot transition, video retrieval, video analysis, optical flow computation, motion estimation, illumination estimation, intensity compensation.

1. INTRODUCTION In recent years, because of the rapid growth in multimedia information and the advance in internet communication, multimedia information indexing and retrieval has become more and more important. Multimedia information contains audio and visual data in addition to text information. Although lots of research efforts have been devoted to the multimedia retrieval based on audio or visual information, there are still many unsolved issues in this problem. First of all, efficient management of the vast amount of data in the audio and visual information is already a difficult problem. In addition, to retrieve information from the audio or visual content is a very challenging since it requires the extraction of high-level semantic information from low-level audio or visual data. The Moving Picture Experts Group (MPEG) has established standards for multimedia information compression and transmission. The current main objective in MPEG is to establish the “Multimedia Content Description Interface” or MPEG-7, standard for efficient multimedia information retrieval. The basic requirement of video segmentation and indexing is to partition a video into shots. A shot is an unbroken sequence of frames and also a basic meaningful unit in a video. It usually represents a continuous action or a single camera operation. Generally, the purpose of a shot segmentation algorithm is to accurately find all the shot boundaries. There are two kinds of shot boundaries (shot changes) 1-2; namely, abrupt changes and gradual transitions. Abrupt changes usually result from camera breaks, while gradual transitions are produced with artificial editing effects, such as fading, dissolve and wipe. A number of methods to video shot transition detection have been proposed in the past decade 1-3. Most of the methods provide metrics to quantify the difference between successive frames in order to locate shot boundaries. Among all the video shot detection methods, we roughly classify them into two categories: one uses original video data and the other uses only the compression information. Researchers have proposed lots of measures include comparison of pixel values, edge counts, histograms, and compression coefficients to quantify the variation of continuous video frames 1-3. Most of the methods tend to use ∗

Author of correspondence: Dr. Shang-Hong Lai, Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan 300, Phone: +886-3-574-2958, Fax: +886-3-572-3694, Email: [email protected]; URL: http://www.cs.nthu.edu.tw/~lai

simple information from the video to determine various types of shot changes. However, the smooth variations of the video feature values in a gradual transition produced by the editing effects are often confused with those caused by camera or object motions. A feasible solution to overcome this problem is to remove the factors caused by the camera and object motions. Motion estimation has been wildly used in many applications of video processing, since it provides the most essential information for an image sequence. In this paper, we explore the possibility to exploit accurate motion and illumination estimation in the video sequence to detect various types of shot changes. Optical flow is the motion vector computed at each pixel in an image sequence due to changes of intensities. Traditionally, optical flow computational algorithms are derived based on brightness constancy assumption, which leads to the derivation of the linear optical flow constraint at each pixel by a first-order Taylor series approximation. The accuracy of the conventional approach will be violently decreased due to the effects of noise, shadows, occlusions, rapid motions and illumination variations. Fortunately, generalized optical flow constraints have recently been proposed to account for non-uniform illumination changes. In this paper, we employ a generalized optical flow constraint that includes an illumination parameter to model local illumination changes and iteratively estimate the optical flow as well as the illumination variation parameters in each block. In the iterative estimation process, we refine the generalized optical flow constraint in each step with the currently estimated flow vector updated into the flow constraint to significant reduce the Taylor approximation error. A robust measure that is the likelihood ratio of the corresponding motion-compensated blocks in the consecutive frames is used for detecting abrupt changes. For the detection of gradual shot transitions, we compute the average monotony of intensity variations on the stable pixels in the images in a twin-comparison framework. The rest of this paper is organized as follows. An iterative optical flow and illumination change estimation algorithm is proposed in section 2. Section 3 describes the proposed shot change detection algorithm for detecting abrupt changes as well as gradual transitions in video sequences. Some experimental results on different types of video shot changes are presented in section 4. Finally, we conclude this paper in section 5.

2. ACCURATE MOTION ESTIMATION Optical flow provides the most fundamental information in video analysis systems, including some shot change detection systems. Most motion-based shot change detection approaches use the encoded displacement vectors in compressed videos. However, accurate motion vectors are usually not available directly from the video in compressed domain because the accuracy of motion estimation depends on performance of the video codec. Instead of using the encoded vectors, we compute the optical flow vectors from video data in the uncompressed domain in this thesis. For the sake of efficiency in the computation of optical flow, we first partition an image into smaller n-by-n blocks. The blocks can be either overlapped or non-overlapped. We then compute the optical flow vectors only at the centers of these blocks. If the block size is not too large, the motion vector at the center is sufficient to approximate the motion of all pixels in the whole block. To account for brightness variations between frames, we used the generalized optical flow constraint under non-uniform illumination changes proposed by Zhang et al. 4-5. This generalized optical flow constraint can be written as ∂I ( x, y ) ∂I 0 ( x, y) v + I 0 ( x, y ) ⋅ w + I 0 ( x, y ) − I 1 ( x, y ) = 0 u+ 0 ∂y ∂x

(1)

where w is a constant used for compensating the intensity variation between two corresponding points at consecutive frames.

ˆ Following Lucas and Kanade’s optical flow computation approach 6, we assume the three unknowns uˆ , vˆ ,and w be constants in a local window. Then, we combine all the generalized optical flow constraints in this local window to estimate the three unknowns by the linear least square method. The least-square estimation is to minimize the following energy function.

∂I ( x, y )  ∂I ( x, y )  E (u, v, w) = ∑  0 u+ 0 v + I 0 ( x, y ) ⋅ w + I 0 ( x, y ) − I 1 ( x, y )  ∂x ∂y ( x , y )∈Wi , j  

2

(2)

where Wi , j is the local neighborhood window centered at the location (i, j). Note that the minimization of the above quadratic energy function leads to the least-square solution for the optical flow vector (u, v) and the illumination factor w. The least-square minimization is accomplished by solving the following linear system. To alleviate the Taylor approximation error in the above optical flow constraint equation, we have developed an iterative least-square optical flow estimation algorithm that refines the generalized optical flow constraints step by step. The main idea behind this iterative refinement process is to move the block in the image with the newly estimated motion vector and compute the updated flow constraints in a recursive manner. The updated optical flow constraint is given by

∂I 0 ( x + u(k ) , y + v(k ) ) (k ) ∂I 0 ( x + u(k ) , y + v(k ) ) (k ) ∆u + ∆v + (1 + w(k ) )I 0 ( x + u(k ) , y + v(k ) ) − I1 ( x, y) = 0 ∂x ∂y

(3)

(k ) Thus the residual optical flow vector ( ∆u , ∆v ) and the illumination factor w are computed by the above least-square estimation procedure with the above updated flow constraint. For the details, please refer to our previous paper 7. (k )

(k )

3. VIDEO SHOT CHANGE DETECTION Because our shot change detection algorithm is based on motion estimation, a major advantage of this approach is the elimination of the factors due to camera motion (such as translation, rotation, zooming) as well as object motion. The use of motion estimation makes it possible to calculate the variation between an object (a block or a pixel) in the image and its corresponding one at the previous or following frame. Our video segmentation system is able to detect both abrupt shot changes and gradual scene transitions including fades and dissolves. Due to the different characteristics of these two types of shot boundaries, we pick one specific measure for each of them separately. For abrupt changes, we use the proportion of unmatched blocks in the image. For gradual transitions, we estimate the accumulated frequency of monotonously increasing or decreasing intensity variation on low-motion pixels. In the next two sections we will describe the two measures separately. 3.1 ABRUPT SHOT CHANGE DETECTION When two shots were separated by an abrupt change, all objects in the previous shot will no longer exist in the following shot either spatially or temporally. As we applied optical flow computation on two frames across an abrupt shot change, the resulting optical flow vectors just associate the most similar blocks in the neighborhood but do not stand for the motions of objects. Therefore, the similarity of most pairs of corresponding blocks is much less than that of the corresponding blocks in the same shot. General video, especially movies, usually contains complicated multiple motions. For this reason as well as the inherent defects of optical flow computation, a block may hardly be perfectly matched in a video sequence. Although we have tried to suppress the effect of camera and object motions by using the motion estimation procedure described in Chap. 3, it is still not appropriate to evaluate the similarity between two blocks with simple pixel differencing. That is to say, we need a more flexible and reliable measure to describe the difference between two blocks. In this paper, we adopt the likelihood ratio test 8 to compare two blocks k1 and k2 at frames i and i+1, respectively, with a likelihood ratio

2

 σ k ,i + σ k ,i +1  µ k ,i − µ k ,i +1  2     +   2 2   ,  λk =  σ k ,i ⋅ σ k ,i +1

where

µ k ,i , µ k ,i +1

are the mean intensity, and

σ k ,i , σ k ,i +1

are the variances for the two corresponding blocks k1 and

k2 in the consecutive frames i and i+1. The value of the likelihood ratio

λk

will be close to 1 if two blocks have

similar density distributions, and it is a measure of statistical difference. The likelihood ratio is insensitive to slight shift and rotation of pixels in a block. With this measure, we employ the conventional block-based comparison1 to quantify the difference between two frames to detect abrupt shot changes. Our detection approach is described as follows: 1. Read two consecutive frames Ii-1 and Ii. 2. Down-sample and smooth the two images. 3. Divide the images into many non-overlapped n-by-n blocks, where n is a predefined block size. 4. For each block in Ii , first check whether it is fully textured; namely, the sum of absolute gradient magnitudes in a block is greater than a threshold Tg . For the blocks fail to satisfy this condition, they are termed ill-conditioned block. 5. For the well-conditioned blocks, apply the iterative optical flow computation described in Chap.3 on them to obtain the motion vectors. And then register every pair of the corresponding blocks at two consecutive frames. 6. For each registered block k in Ii, compute the likelihood ratio

λk

between this block and its corresponding block in

frame Ii-1. 7. The difference between Ii-1 and Ii is measured by N

D(i − 1, i ) =

∑ DP(i − 1, i, k ) k =1

N

,

1 if λk > Tλ , and N is the number of well-conditioned blocks determined in step 3. where DP(i − 1, i, k ) =   0 otherwise 8. An abrupt shot change is declared between Ii and Ii-1 if the difference

D(i − 1, i ) is larger than a threshold TD .

9. Set i=i+1. Go to step 1 until all frames are scanned. 3.2 GRADUAL SHOT TRANSITION DETECTION Based on many previous works and our experiments, we found that the inter-frame differences, such as histogram deviation and block-based comparison, are not sufficient to distinguish the slight variation in a gradual shot transition from the general inter-frame differences caused by noise, motion, newly appeared or occluded objects. Our motion compensation algorithm may suppress the disturbance of motions but can not overcome the problems due to other factors. If there exists a dissolve transition between scene A and scene B, an object OA in scene A will gradually and “continuously” transform into another object OB in scene B during the dissolve transition. If OA has intensity value IA and OB has intensity value IB and I A ≠ I B , we can find the intensity value at the location of OA (OB) monotonously increase or decrease from IA to IB. A dissolve transition can be detected if most objects (pixels) in a video sequence monotonously increase or decrease in their intensities for a period of time. A fade-in or fade-out transition can be treated as a special case that one of the two corresponding objects is entirely black.

To make the above approach work well, we need to know the locations of the two corresponding objects OA and OB. However, with the random variation of intensities during a dissolve or fading transition, it is very difficult to accurately estimate the locations even with complicated optical flow computation. As long as we can not completely remove the unstable factors, a good policy is to get rid of unstable objects (pixels) when we determine the monotony of intensity variation. Fortunately, the optical flow vectors not only provide the information of motions but also the stabilization of a block. In other words, if the magnitude of an optical flow vector is nearly zero, the associated block is very possibly static during the gradual transition. Static blocks are simple for us to estimate the intensity variation of corresponding pixels. In addition, the blocks that are homogeneous in intensity and are ignored by our optical flow computation process may be useful in the determination of gradual shot changes, too. Since the “flat” blocks usually appear as the background or large identical region in an image sequence, they may be also stable with respect to some undesirable motion effects during dissolve transitions. We choose these two kinds of blocks that are flat or static as the candidates for estimating the degree of intensity variation monotony. A Stability-Map F(x, y) that is a 2-dimentional Boolean array with the same size of images is maintained to indicate whether a pixel at the location (i, j) is static or flat. Note that the flat pixels are determined from the overlapping ill-conditioned blocks at consecutive frames. The Stability-Map will be updated right after the optical flow computation is completed. An additional Continuity-Map C(x, y), a 2-dimensional integer array, is used to record the frequency of cumulative increase or decrease between consecutive frames. After the optical flow computation of two consecutive frames I0 and I1 is accomplished, the static region is indicated on the Stability-Map. For every pixel (i, j) in the static region, if I1(i, j) > I0(i, j), then increase C(i, j) by 1. Else if I1(i, j) < I0(i, j), decrease C(i, j) by 1. When a dissolve or fading transition occurs, the average of absolute values in the static region of C will monotonically increase. Finally, a twin-comparison algorithm is applied on the consecutive difference of average absolute Continuity-Map to determine a dissolve or fading transition in the sequence. The gradual transition detection process is given as follows: 1. The preprocessing is the same as step 1~5 of our abrupt shot change approach. 2. For each ill-conditioned (homogeneous) block, if the likelihood ratio the same position in the consecutive frame) is smaller than a threshold

λ between it and its corresponding block (at Tλ , then assign TRUE to the Stability-Map F

for all pixels in the block; else assign FALSE. 3. For each well-conditioned block, assign TRUE to F for all pixels in the block if the length of the motion vector

u 2 + v 2 ≈ 0 , else assign FALSE to F for all pixels in the block and the corresponding block associated with (u, v). 4. For each point (i, j) in the flat region, increase the Continuity-Map C(i, j) by 1 if Ii(i, j) > Ii-1(i, j), and decrease C(i, j) by 1 if Ii(i, j) < Ii-1(i, j). When Ii(i, j) = Ii-1(i, j), the map C(i, j) is increased by 1 if C(i, j) < 0, otherwise it is decreased by 1.

5. Compute Cont i =



∀i , j∃F ( i , j ) = TRUE

N

C (i, j )

as the variation continuity of this frame i, where N is the number of

pixels in the flat region. 6. Set i=i+1. Go to step 1 until all frames are scanned. 7. For i ∈ [first + 1, last] , apply twin-comparison (described in Chap. 2) on the value of Cont i − Cont i −1 , the periods that pass the twin-comparison are marked as candidates. 8. For each candidate period, apply our abrupt change detection (described in the last section) on the pair of images at the two boundaries of the period. If the two boundary frames form an abrupt change, a gradual transition is declared detected during the period.

4. EXPERIMENTAL RESULTS For shot change detection, there are two commonly used metrics for performance assessment; namely, the recall and precision rates. These two rates are defined as follows: Recall

=

N

Nc c + N

,

Precision

m

=

N

Nc c + N

f

where Nc denotes the total number of correct detection; Nm is the total number of misses, and Nf represents the total number of false detection. Thus, Nc + Nm means the total number of detected shot changes, and the value Nc + Nf denotes the total number of real shot changes. To evaluate the performance of a shot change detection algorithm, we need to consider these two rates simultaneously. To show the objectivity of our experiments, we tested our systems with the benchmarks adopted by the tenth Text REtrieval Conference (TREC) that was held at the National Institute of Standards and Technology (NIST) in the United States, Nov. 13-16, 2001 9. This task was performed on a 5-hour video dataset, which contains MPEG-1 video sequences from “NIST Digital Video Collection, Volume 1” (http://www.nist.gov/srd/nistsd26.htm), from the Open Video Project (http://www.open-video.org/), and from the BBC. The shot detection results provided by the participating systems are evaluated using automatic comparison to a set of reference shot boundaries created manually by NIST. In this paper, we adopt 9 undamaged video sequences from the Open Video Project which is a public data set. The total number of shot boundaries that we calculated is slightly different from that provided by NIST. With the 9 test sequences listed in last section, we now show the Recall and Precision for each. Our system is executed in on a 1000MHZ x86 CPU with 256MB RAM. Excluding decoding time, the time complexity is 0.048 sec frame average, which means that 1.43 times of the playback duration is needed for a sequence to be entirely scanned for both abrupt and gradual changes. Table 1. Accuracy of our system for detecting abrupt shot changes.

Filename

Recall

Precision

Anni005.mpg

1.0000

0.7358

Anni009.mpg

1.0000

0.6552

Bor03.mpg

0.9610

0.9487

Bor08.mpg

0.9730

0.9010

Nad31.mpg

0.9005

0.9451

Nad33.mpg

0.9842

0.9639

Nad53.mpg

0.9518

0.9405

Nad57.mpg

0.9583

0.9388

Senses111.mpg

0.9966

0.9966

Total

0.9678

0.9297

The accuracy of out system for detecting abrupt change is listed in Table 1. For the testing results on gradual transition detection, the performance of our algorithm is summarized in Table 2. According to the test results of TREC 2001 reported on the NIST web site at the web page http://www-nlpir.nist.gov/projects/trecvid/results.html, the systems of Fudan University and IBM Almaden Research Center produced the best overall performance for abrupt change detection when considering both precision and recall rates. For gradual boundary detection, the three best performed systems are from IBM Almaden Research Center, Microsoft Research in China, and the University of Amsterdam and TNO. They can achieve an average precision/recall rate of roughly 70%. By comparing the experimental results on this subset of video sequences, we can see that the proposed algorithm is very competitive against the best existing video shot detection algorithms in the world.

Table 2. Accuracy of our system for detecting gradual transitions.

Filename

Recall

Precision

Anni005.mpg

0.7857

0.8800

Anni009.mpg

0.8475

0.9259

Bor03.mpg

0.9286

0.3824

Bor08.mpg

0.7162

0.9298

Nad31.mpg

0.7077

0.6866

Nad33.mpg

0.9429

0.8049

Nad53.mpg

0.8256

0.9467

Nad57.mpg

0.8519

0.8846

Senses111.mpg

0.4118

0.3500

Total

0.7745

0.8136

5. CONCLUSION In this paper, we introduced a new shot change detection approach based on accurate motion estimation. We employed the generalized optical flow constraint that can tolerate small illumination variation for accurate image motion computation. The iterative refinement process and extending the range of initial guesses make the Locus-Kanade’s method more robust to rapid motion. Our motion compensation process has overcome the tough problem for conventional shot change detection methods, which is camera and object motion. When the disturbance of camera and object motion is suppressed, we can easily compute the variation of a specific object in a video sequence. Thus, an abrupt change can be detected when most blocks in a frame are not able to be tracked. A gradual transition can also be found by analyzing the variation continuity of low motion pixels. There are two advantages in our shot change verification scheme. Firstly, the two different measures, one for abrupt changes and the other for gradual transitions, do not interfere with each other. The independent character makes it much

simpler for data analysis. Secondly, as the hard controlled factors are suppressed, the magnitude of the normalized difference measure will be very small in a common scene but be very large at a shot boundary. This makes it much easier to select a robust global threshold for the verification. The above two characteristics successfully raise the detection rate of our approach, both Recall and Precision.

ACKNOWLEDGMENTS This research was jointly supported by the Program for Promoting Academic Excellence of Universities (89-E-FA04-1-4), and the Chinese Multimedia Information Extraction project funded by Academia Sinica, Taiwan.

REFERENCES 1. Q. Tian, H.J. Zhang, “Video shot detection and analysis: Content-based approaches”, in: C.W. Chen, Ya Qin Zhang, Visual information representation, communication, and image processing, pp.227~253. 2. I. Koprinska, S. Carrato, “Temporal video segmentation: A survey”, Signal processing: Image communication 16, 2001, pp.477~500. 3. Ullas Gargi, Rangachar Kasturi, and Susan H. Strayer, “Performance characterization of video-shot-change detection methods”, IEEE transactions on circuits and systems for video technology, vol. 10, no. 1, February 2000. 4. L. Zhang, T. Sakurai, H. Miike, “Detection of motion fields under spatio-temporal non-uniform illumination,” Image and Vision Computing, 17, pp. 309-320, 1999. 5. A.Nomura, “Spatio-temporal optimization method for determining motion vector fields under non-stationary illumination,” Image and Vision Computing 18, pp. 939-950, 2000. 6. B. D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” Proc. IJCAI81, pp. 674-679, 1981. 7. S.H. Lai. and W.K. Li, New video shot change detection algorithm based on accurate motion, and illumination estimation, Proc. SPIE: storage and retrieval for media databases, vol. 4676, pp. 148-157, 2002. 8. R. Kasturi, R. Jain, Dynamic vision, in: R. Kasturi, R. Jain(Eds.), Computer Vision: Principles, IEEE Computer Society Press, Washington DC, 1991, pp. 469~480. 9. E. M. Voorhees and D. Harman, “Overview of TREC 2001”, Proceedings of the tenth Text Retrieval Conference, 2001.

Suggest Documents