Dissolve Detection in MPEG Compressed Video

0 downloads 0 Views 155KB Size Report
For an MPEG stream, we define a measure as the percentage of blocks with DC difference values fall- ing within a specified range. Dissolves are thus detected ...
Dissolve Detection in MPEG Compressed Video Lifang Gu, Ken Tsui and David Keightley CSIRO Mathematical and Information Sciences GPO Box 664, Canberra Australia Abstract - A dissolve is a gradual transition from one shot to another in a video, resulting from gradually scaling the intensity values of the two shots. Up to now, there have been very few algorithms for detecting dissolves. By analysing characteristics exhibited by a typical dissolve, we propose two methods of detecting dissolves directly in MPEG compressed video. The first method is based on the characteristics that the average intensity of each frame changes linearly with time during a dissolve. Dissolves are detected by finding rectangular shapes on the difference curve of the average frame intensity values, which can be obtained by averaging the DC values of luminance blocks of each frame in an MPEG stream. The second method is based on the characteristics that each pixel changes its intensity value linearly and gradually during a dissolve. As a result, most of the pixel difference values between two consecutive frames will be moderate and fall within a range. This range is determined by the duration of a typical dissolve and the maximal intensity difference allowed. If the number of pixels with difference values falling within such a range is large over a period of time, a dissolve can be declared. For an MPEG stream, we define a measure D as the percentage of blocks with DC difference values falling within a specified range. Dissolves are thus detected by finding periods with consistent large values of D . Both methods have been tested on a range of MPEG video sequences and the results show that most dissolves can be correctly detected. Since the proposed methods operate directly on compressed data, they are much more efficient than some pixel domain methods.

I. INTRODUCTION Recently the demand for digital video has increased dramatically due to the emergence of new services such as video on demand (VOD) and digital libraries. Such services require not only efficient transmission and storage of video data, but also representation of the content in a way that can facilitate efficient searching, browsing and navigation. Digital video segmentation, the process of decomposing a video into meaningful and coherent component units according to some criteria, is the first step towards realisation of such representation. A video can be organised hierarchically into segments, scenes, and shots just like a text document can be organised into paragraphs, sentences, phrases and words. A shot is an image sequence taken by a camera which may be fixed or following continuous motion such as panning, zooming, tilting and tracking. It is the smallest unit in this hierarchical structure. Video segmentation in this paper is therefore referred to as the process of segmenting a video into temporal “shots”, each representing an event or continuous sequence of actions. Further scene

analysis and interpretation can then be performed on each of these shots to extract key information for indexing the shot. Shots are connected together by the editing process. There are usually two types of shot transitions in a video: abrupt and gradual. Abrupt transitions occur when two individual shots are simply pasted together. Gradual transitions are the results of applying special editing techniques, such as fade, dissolve and wipe, to connect two shots smoothly. At an abrupt shot transition, the difference in gray level as well as the color information between two consecutive frames is usually large due to the content dissimilarity of the two shots. Therefore, many of the early methods use algorithms based on difference metrics, such as pixel intensity value difference and histogram difference [2,5,8], to detect abrupt transitions. One of the problems with these difference based methods is that they are sensitive to busy scenes in which intensities change substantially from frame to frame due to camera/object motion. Although some of these false positives can be removed by detecting certain types of camera motion, these methods still have difficulty with scenes containing substantial object motion. Also, these methods operate on raw-pixel data, which must be restored from the compressed data by the time-consuming decompression operation. Since most digital videos are stored in a compressed format such as MPEG, several algorithms for detecting abrupt shot boundaries directly in the compressed domain have recently emerged [1,4,7,11]. These methods use the information directly available from an MPEG stream, such as DCT coefficients, motion vectors and bit-rates, as dissimilarity measures. Since decompression is not necessary, these compressed domain methods are computationally more efficient. In addition, some of these measures are more reliable because they capture the content dissimilarity better than simple difference measures. However, all of the above methods are not suitable for detecting a gradual transition where the change between two consecutive frames is small. Different editing techniques result in different types of gradual transitions, which, in turn, have different characteristics [8]. A fade is a gradual transition between a shot and a constant image (fade out) or between a constant image and a shot (fade in). During a fade, images in a shot have their intensities multiplied by some value α . During a fade in, α increases from 0 to 1 while α decreases from 1 to 0 during a fade out. A dissolve is a gradual transition from one shot to another, in which the first shot fades out and the second shot fades in. Typically, fade in and fade out begin at the same time.

Although dissolves are the most common gradual transitions present in conventional movies and TV programs, very few works have been reported on how to detect them. Zhang et al. [10] use a dual threshold on the intensity histogram difference to detect gradual transitions. Since only accumulative frame differences are considered, their method gives false positives to busy scenes. Besides, histogram calculation is computationally expensive. Hampapur et al. [5] propose a method based on explicit models of the video production process. They calculate a chromatic image from a pair of consecutive images. Its value at each pixel is the change in intensity between the two images divided by the intensity in the later image. Ideally, the chromatic image should be uniform and non-zero during a fade. However, such a chromatic image is extremely sensitive to noise and the intensity change caused by camera motion and object motion in a scene. In addition, the models used are not suitable for dissolves. Zabih et al. [9] describe a feature-based algorithm for detecting and classifying shot transitions. Their algorithm first detects intensity edges in each image and then calculates the dissimilarity measure based on the edge change fraction between two consecutive images. A shot transition is declared if a high value of this measure is detected. However, edge detection is a quite expensive operation and this method is thus too slow for real applications. All of the above three methods use raw pixel data and can only be applied to a compressed video after the computationally expensive decompression is performed. Meng et al. [7] propose an algorithm for detecting dissolves directly in MPEG compressed videos. Their algorithm detects dissolves by finding parabolic shapes on the intensity variance curve. This method fails to detect dissolves when one shot has a much larger variance than another shot. In summary, most existing algorithms for dissolve detection are either very slow or perform well only on some ideal video data and therefore not suitable for processing large collection of video footage. To overcome some of these problems, we propose two algorithms of detecting dissolves directly in MPEG compressed domain in this paper. In the following, we first give the mathematical definition of a dissolve and a brief introduction to MPEG compression standard. We then describe implementation details of the two algorithms based on dissolve characteristics. Some experimental results are finally given. II. DISSOLVE DEFINITION As pointed out by Hampapur et al. [5], the problem of video segmentation can be posed either as shot boundary detection, i.e. locating the points where each shot starts and ends, or edit detection, i.e. locating the points where the editing period starts and ends. We take the edit detection approach, which requires dissolve models.

As mentioned in the introduction, a dissolve is a gradual transition from one shot to another, which results from a simultaneous application of a fade in to the first shot and a fade out to the second shot. Frames created by the editing process are here referred to as editing frames. Let G ( x, y, t ) represent the intensity function of the editing frames created by dissolving two shots S 1 and S 2 , and g 1 ( x, y, t ) , g 2 ( x, y, t ) be the intensity functions of the two shots respectively. From the above dissolve definition, we have the following equation: G ( x, y, t ) = g 1 ( x, y, t ) [ 1 – α ( t ) ] + g 2 ( x, y, t )α ( t )

(1)

where α ( t ) = ( t – t s ) ⁄ ( t e – t s ) increases from 0 to 1 during the dissolve and, t s and t e are the start and end of the dissolve respectively. It can be seen from (1) that a fade can be regarded as a special dissolve with g 1 ( x, y, t ) being constant for a fade in and g 2 ( x, y, t ) being constant for a fade out. III. MPEG COMPRESSION STANDARD Before discussing the actual dissolve detection algorithms, we briefly describe the MPEG compression standard [6]. MPEG (Moving Picture Experts Group) is an international standard (ISO 11172) for serious video applications. In order to achieve high compression ratio, the MPEG compression algorithm uses a suite of techniques to reduce both spatial and temporal redundancies in a video sequence. There are three basic picture types in an MPEG stream: I-, P- and B-pictures. An I-picture is completely intra-coded, i.e. only information from the current picture is used. Each picture is divided into 16 × 16 macroblocks. Each macroblock is composed of four 8 × 8 luminance blocks and two 8 × 8 chrominance blocks. Each block is DCT transformed and the coefficients then quantised and entropy encoded. A P-picture is predictively coded with motion compensation from a past I- or P-picture. Each macroblock in the current picture is matched to the most similar macroblock in the reference picture. The displacement between the two macroblocks is stored as a motion vector. The residue pixel difference after motion compensation is then DCT transformed, quantised and entropy encoded just like an I-picture block. If no match is found for a macroblock within the specified search area, it will be intra-coded. A B-picture is similar to a P-picture except that it can use a future reference picture as well as a past reference picture. IV. DISSOLVE DETECTION FROM GLOBAL INTENSITY INFORMATION Let the average intensity values of the two shots be g 1 and g 2 , which should be quite stable for most shots. From (1), the average intensity g ( t ) of an editing frame in the dissolve period can then be written as

g ( t ) = g 1 + ( g 2 – g 1 )α ( t )

(2)

As a result, the average intensity during a dissolve period changes linearly with the time. If shot S 2 is brighter than shot S 1 , i.e. g 2 > g 1 , the average intensity g ( t ) will linearly increase from the average intensity of shot S 1 , g 1 , to that of shot S 2 , g 2 . Otherwise it decreases linearly from g 1 to g 2 (see Fig. 1). The change rate dg of the average intensity during a dt dissolve can be written as g2 – g1 dg = ---------------dt te – ts

(3)

This rate thus depends on the dissolve duration, t e – t s , and the average intensity difference of the two shots, g 2 – g 1 , and remains constant. In another words, the average intensity difference between two editing frames is constant during a dissolve. Fig. 2 shows the graph of the average intensity difference for some shot transitions in an ideal case. An abrupt shot transition corresponds to a strong pulse on the difference curve while a dissolve appears as a rectangular shape. The width of the rectangular shape is the dissolve duration while its height is determined by the change rate dg . Consequently, disdt solve periods can be detected by finding rectangular shapes on the average intensity difference curve. g(t )

g(t )

g2

g1 t ts

te

g1

V. DISSOLVE DETECTION FROM LOCAL INTENSITY INFORMATION

g2

Although the above method can detect most of the dissolve periods, it has difficulty when the two shots being connected have similar average intensity values. In this case, the difference value between two editing frames in the dissolve period will be small and difficult to differentiate from noise. However, if we look at the local information, individual pixels of the two shots are different because the two shots have different contents. In fact, every pixel changes its intensity linearly and gradually from its brightness in shot S 1 to the brightness in shot S 2 during a dissolve. From (1), the absolute pixel intensity difference ∆g ( x, y ) between any two editing frames during a dissolve can be written as follows:

ts

t

te

Fig. 1 Temporal change of the average intensity values during a dissolve

intensity difference

picture, DC values of luminance blocks are directly available from the MPEG compressed data. For P- and B-pictures, not every macroblock is intra-coded and therefore DC values in some blocks are not available. Fortunately, we can approximately reconstruct the DCT DC value of each block in a P-picture by using the method suggested in [7]. This method calculates the DC value of a predictively coded block from its residue DC value and the average value of the area-weighted DC values of the reference blocks pointed to by its motion vector. B-pictures are not used because of the extra time needed for reconstructing DC values. The number of Iand P-pictures is usually enough for dissolve detection since two consecutive I-/P-pictures are 1 to 3 frames apart in a typical MPEG stream and a dissolve lasts for 10 to 60 frames in a typical video. After DC values of all the luminance blocks in Iand P-pictures become available, the average DC values of each frame and then their difference between consecutive I-/P-frames are calculated. If the difference values are above a preset threshold value and consistently positive or negative over a period, which is within the range of typical dissolve durations, a dissolve is declared. This latter constraint ensures that only frame sequences with monotonic changes in average intensity over a certain period are detected as dissolves. Intensity changes resulting from object motion or some other random sources within a shot will not be detected since they often tend to be irregular.

Abrupt shot transition Dissolve

∆t ∆g ( x, y ) = g 2 ( x, y ) – g 1 ( x, y ) ------------te – ts t

Fig. 2 Average intensity difference values Since the DCT DC value of a luminance block represents the average intensity of the block, the frame average intensity value can be easily obtained by averaging the DC values of all the blocks in a frame. For an I-

(4)

where g 1 ( x, y ) and g 2 ( x, y ) are the pixel intensity values of the two shots at the location ( x, y ) respectively, and ∆t is the frame interval. Since the intensity difference between two pixels has an upper limit, usually 255, and a typical dissolve lasts for 10 to 60 frames, i.e., t e – t s ∈ [ 10, 60 ] , the difference value ∆g ( x, y ) theoretically falls within the range [ 0, 25.5 × ∆t ] and remains constant during the whole dissolve period. In other words, most pixels will have difference values within this range during a dissolve.

ference of the average DC values of the two shots is small. Nevertheless, the algorithm using block DC differences can detect this dissolve more easily since the local content dissimilarity between the two shots is picked up by the block DC difference values. Fig. 4 shows the curve of the block percentage value D for the same video sequence. The percentage values of the three dissolve periods are clearly high and remain relatively constant. 90 Video sequence ’spacewalk’

85 80 Average DC Value

However, pixel intensity values are quite noisy in images of a real video and thus their difference values between two images have a very low signal to noise ratio (SNR). Applying a smoothing filtering to an intensity image will certainly improve its SNR ratio. Since the DCT DC value of an 8 × 8 luminance block in an MPEG stream represents the average intensity of the block, it is one of the smoothly filtered representations of the intensity values. Therefore, if we replace the pixel intensity values with the DC values of the luminance blocks, more robust results can be expected. More importantly, since DCT DC values are directly available from compressed data, computationally expensive decompression is not necessary. This will substantially improve the efficiency of the algorithm. For an MPEG compressed video, we first reconstruct the DC values of each luminance block in P-pictures as before so that they are available for all I- and Ppictures. We then define and calculate the measure D t, t + ∆t , the percentage of the blocks with DC difference values falling within the specified range [ ∆g min, ∆g max ] , as follows:

75 70 65 60 55 50 45 0

X, Y



x = 1, y = 1

50

100

T t, t + ∆t ( x, y )

D t, t + ∆t = -------------------------------------------------------- ⋅ 100 N

(5)

otherwise

(6)

To reduce noise, the lower limit ∆g min of the specified range is usually set to a small value above zero. From the dissolve definition, it can be concluded that the above percentage value D will be constant and large during a dissolve period and small during normal scene activities or other shot transitions. As a result, dissolves can be detected by finding periods with block percentage values consistently larger than a given threshold. VI. EXPERIMENTAL RESULTS The above two algorithms have been implemented on top of a general MPEG decoding library. B-pictures are not used and therefore skipped during the parsing phase of an MPEG stream. For each P-picture, the DC value of each non-intra block is reconstructed using the approach mentioned above. Both algorithms have been tested on several MPEG video sequences. Fig. 3 shows the average frame DC values and the corresponding difference values between two consecutive I-/P-frames for an MPEG video sequence containing three dissolves. It can be seen that the average DC values change linearly with the time during the dissolve periods as expected. The three peak periods on the difference curve correspond to the three dissolves although the magnitude of the second peak is small since the dif-

Average DC Difference

∆g min ≤ ∆g ( x, y ) ≤ ∆g max

300

Video sequence ’spacewalk’

8 6 4 2 0 -2 -4 -6 -8 -10 0

50

100

150 200 Frame number

250

300

Fig. 3 Average frame DC values and their differences for a real video sequence

Percentage of Blocks with Moderate DC Difference

1 T t, t + ∆t ( x, y ) =  0

250

12 10

where N = X ⋅ Y is the total number of blocks in a frame and T t, t + ∆t ( x, y ) is defined as below:

150 200 Frame number

70 Video sequence ’spacewalk’ 60 50 40 30 20 10 0 0

50

100

150 200 Frame number

250

300

Fig. 4 Block percentage values of a real video containing three dissolves

Percentage of Blocks with Moderate DC Difference

Since fades are special cases of dissolves, they have the same characteristics as dissolves and can also be detected by the above two algorithms. Zabih et al. [9] compare the results of several pixel domain algorithms of dissolve detection on a video sequence, which contains two dissolves and has scenes with considerable object motion, even during the dissolve period. They conclude that only their featurebased method can reliably detect the two dissolves. Fig. 5 shows our block percentage curve for this video sequence. The two periods with high block percentage values in Fig. 5 clearly correspond with the two dissolve periods. This proves that our algorithm performs reliably and is able to detect dissolves containing substantial object motion. Since our method operates directly on the compressed data it is much more efficient than the feature-based method. The performance and robustness of our algorithm are due to the fact that it is based on the dissolve characteristics and uses more reliable block DC values instead of the individual pixel intensity values.

[3] S.F. Chang, “Compressed-domain techniques for image/video indexing and manipulation, in Proceedings of the IEEE International Conference on Image Processing, Special Session on Digital Library and Video on Demand, Washington DC, October 1995. [4] J. Feng, K-T Lo and H. Mehrpour, “Scene change detection algorithm for MPEG video sequence”, in Proceedings of the IEEE International Conference on Image Processing, September 1996. [5] A. Hampapur, R. Jain and T. Weymouth. “Digital video segmentation”. in Proceedings of the 3rd ACM International Conference on Multimedia, San Francisco, CA, October 1994, pp. 357-364. [6] D. LeGall, “A video compression standard for multimedia applications”, Commun. ACM, vol. 34, no. 4, 1991, pp. 46-58.

45 Video sequence ’Clapton’

40

[2] J.S. Boreczky and L.A Rowe, “Comparison of video shot boundary detection techniques”, in IS&T/SPIE Symposium Proceedings, Storage and Retrieval for Image and Video Databases IV”, vol. 2670, San Jose, January 1996.

[7] J. Meng, Y. Huan and S.F. Chang, “Scene change detection in a MPEG compressed video sequence”, in IS&T/SPIE Symposium Proceedings, vol. 2419, San Jose, California, February 1995.

35 30 25

[8] T.A. Ohanian, Digital Nonlinear Editing - New Approaches to Editing Film and Video, Focal Press, 1993.

20 15 10 5 0 0

20

40

60 80 100 Frame number

120

140

160

[9] R. Zabih, J. Miller and K. Mai, “A featured-based algorithm for detecting and classifying scene breaks”, in Proceedings of the 4th ACM International Conference on Multimedia, San Francisco, CA, November 1995, pp. 189-200.

Fig. 5 Block percentage values of a video sequence containing considerable object motion

[10] H. Zhang, A. Kankanhalli and S.W. Smoliar, “Automatic partitioning of full-motion video”, Multimedia Systems, vol. 1, no. 1, 1993, pp. 10-28.

VII. CONCLUSIONS

[11] H. Zhang, C.Y. Low and S.W, Smoliar, “Video parsing and browsing using compressed data”, Multimedia Tools and Applications, vol. 1, no. 1, 1995, pp. 89-111.

Two algorithms for detecting dissolves directly in MPEG compressed video sequences have been proposed in this paper. Compared with existing methods, our algorithms are much more efficient since they operate directly on compressed data. The proposed algorithms are also able to reliably detect dissolves with considerable object motion because they are based on the dissolve characteristics. Experimental results show that the proposed methods are effective in detecting dissolves as well as fades directly from MPEG streams. VIII. REFERENCES [1] F. Arman, A. Hsu and M.Y. Chiu, “Image processing on compressed data for large video databases”, in Proceedings of the 2nd ACM International Conference on Multimedia, Anaheim, California, August 1993, pp. 267-272.