temporal video segmentation using cross correlation

2 downloads 0 Views 127KB Size Report
TEMPORAL VIDEO SEGMENTATION USING CROSS. CORRELATION. Naveen Aggarwal*, Prof. Nupur Prakash**, Prof. Sanjeev Sofat***, Ajay Mittal ***.
TEMPORAL VIDEO SEGMENTATION USING CROSS CORRELATION Naveen Aggarwal*, Prof. Nupur Prakash**, Prof. Sanjeev Sofat***, Ajay Mittal *** * CSE Deptt., Univ. Instt. Of Engg. & Technology. Panjab University, Chandigarh, [email protected] ** School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi ***CSE Dept., Punjab Engineering. College, Chandigarh

Keywords: Video shot detection, Temporal Video Segmentation, Area Correlation, Scene change detection, Video Databases

Abstract Various algorithms have been proposed in last decade, for temporal segmentation of digital videos. Comparison of the work of different research groups working in this area can be seen from TRECVID 2005. It is clear from comparison that no method is suitable for all types of video to detect all types of gradual transitions. Further many methods are either computation intensive or too much false detection reduces their efficiency. In this work, a method based on region matching using area correlation is proposed. This method can be used in prior to any other method to reduce the false detections. In this method, different frames of video are divided into different regions based on their statistical similarity. For each region in two consecutive frames, area correlation function is calculated. This area correlation energy function is used to detect the similarity among the regions. Weighted sum of correlation function in all the mismatched regions and location of these are used to classify the edit shots of a video into different types such as hard cuts, fades, wipes etc,. This method is tested on different videos and reasonable performance is achieved. It can also be used as pre-classifier in prior to other methods to reduce their false detections.

1 Introduction With the emergence of fast computing resources and high capacity storage devices, processing of large multimedia data in real time has become a reality. Rapid growth of internet and other communication technologies in the last decade has further fuelled the widespread use of multimedia. But these advancements have proven to be a double edge sword, because availability of immense data is not supported by advance browsing methods. The tools available for browsing are still in primitive stage. Further the development of good compression standard, although reduced the communication time, but their representation semantic increases the differences from the way user poses the queries. Main hindrance in the development of good Multimedia management tool is the semantic difference between the representation of data and the richness of semantics in user’s queries. Hence effective use of data is possible only when

tools retrieve information from multimedia data based on their contents. It is easier to say that picture is a worth a thousand words, but finding what words actually describe picture is a daunting task. It is need to focus our attention to develop cost and computation effective mechanism to represent the data in such a way that users can directly query the data using the first order logic. For this system needs to represent the data in different dimensions. Each dimension of data represents its specific visual or semantic features. A specific feature of data may represent the high level details (as perceived by users) or low level details (specific top hardware implementation). Such a system can be termed as a Multimedia Mining System, because segmentation of video based on low level and high level features using different dimensions, helps to find interesting patterns in video. These interesting patterns bridge the semantic gap between the user queries (represented as high level features) and data representation (as low level features). Single shot of video rather than a single frame is considered as a basic elementary unit along each dimension. This is because shots provide the ground for nearly all existing video abstraction and high level video segmentation algorithms [10]. During video production, each shot transitions represents the content and context of video. A single shot is defined as a sequence of frames that was continuously shot from the same camera at one time. However shot changes may occur as straight cuts or gradual transitions. A gradual transition can further be classified as cross-dissolves, fade-ins, fade-outs, wipes etc. depending upon how the second shot is dissolved in first one. Thus it is important to detect the gradual transitions effectively. But there are other events such as sudden illumination (due to flash or fire), camera movements and zooming which are progressively used in production. These events make the overall shot detection process very difficult. Early work in this concentrates mainly on hard cuts but recently different methods have been proposed to detect the gradual transitions in TRECVID 2005. Rainer Lienhart [10 ] and Gargi et al [21] compared some of these methods. It is clear from the studies that most of the algorithms assess their performance for edit detection in limited domain but not their ability to classify types of edits and its temporal extent. In this paper, a method to detect and classify the edits has been proposed. It is based on the techniques used for scene

matching using area correlation. The remainder of paper is organized as follows, in section 2; related work done in this area is summarized. Proposed algorithm and methods to classify edits are described in section 3. In Section 4, our implementation to verify the results is explained and the experimental results are analyzed. Finally, concluding remarks and future enhancements are described in section 5.

2. Related Work In last decade, considerable work has been seen to detect scene changes based on entire images. Although the detection of edits is fundamental to any kind of video analysis, but very few comparative studies on early shot boundary detection algorithms have been published. Temporal segmentation of digital video can be categorized into methods that use compressed video and the methods that use uncompressed video. Techniques in compressed domain normally use the encoded features of video, such as in MPEG videos, information about the Motion Vectors, Macro Blocks, index frames, and DCT coefficients can be used. Algorithms in compressed domain are more efficient as they need to fully decode the video and they use those features which are already calculated. However, most of video produced are still in uncompressed format. Although the number of possible edits are quite large. Well known video editing programs provide more than hundred types of edits. Rainer Lienhart [10] classified these into four classes: 1) Identity class: Neither of the two shots involved are modified, and no additional edit frames are added such as hard cuts. 2) Spatial Class: Some spatial transformations are applied to the two shots involved. Examples are wipe, page turn, slide, and iris effects. (3)Chromatic Class: Some color space transformations are applied to the two shots involved. Examples are fades and dissolve effects. (4) Spatio-Chromatic Class: Some spatial as well as some color space transformations are applied to the two shots involved. All morphing effects fall into this category. Hard cuts can be easily detected using pixel based comparison of frames. But these are very sensitive to camera motions. Zhang, Kankanhalli, and Smoliar [22] implemented this method with averaging filter to reduce the effects of camera motions and noise. But this makes the method very slow. Shahraray [18] used the similar methos that is used to find the motion vectors in MPEG videos. Image is divided into different regions and the best match around the region is searched. The weighted sum of the sorted region differences provided the image difference measure. Gradual transitions were detected if this difference exceeds a manually provided threshold value. Hampapur, Jain, and Weymouth [3] computed the gray level change in each pixel in consecutive frames. During dissolves and fades, this change is almost constant. Similarly wipes can be detected but this technique does not work for camera motions, noise and object motions. Statistical methods are similar to pixel based method but they divide the image into different regions based on their statistical values such as mean and standard deviations. But

these methods are very slow as statistical formulas are computational intensive. Histograms methods either based on gray level or color are very popular to detect shot boundaries. If the bin-wise difference between the two histograms is above a threshold, a shot boundary is assumed. B.V. Funt and G.D. Finlayson [1] used the color histogram while J.S. Borecezeky and L.A.Rowe [4] used the color histograms along with other statistics. If the histogram difference in some regions is very large as compared to other regions of image then it can be discarded, considering this as effect of object motion and noise. Rainer Lienhart [13] found the histogram based methods are more accurate and fast as compared to other pixel or statistics based methods. It is concluded that it is better to use two threshold values to properly detect gradual transitions such as wipes and dissolves. First threshold values mark the start of gradual transitions and second threshold value marks the peak of gradual transitions. The method is made fast by comparing only odd frames. Consecutive frames are compared only when gradual transitions are detected. Zabih, Miller, and Mai [15] proposed a method based on edge detection. They calculate the Edge Change Ratio in consecutive frames. Number of edges that enter and exit and their relative position is used to detect the hard cuts, fades and dissolves. They found that this method is very less sensitive to camera motions as compared to histogram based methods. Rainer Lienhart [10] suggested that the shot detection algorithms can be compared on the basis of how effectively they detect the type of transitions. To calculate the number of false detections that are proposed by a particular algorithm, he used the recall and precision metrics. TRECVID 2005 experience shows that although there are many research groups which provide efficient and effective methods for shot detection in limited video, but still there is no method, which detects all classes of transitions irrespective of type of video. In this work, a method based on area correlation is proposed, in an attempt to reduce the false detection by other algorithms. This method can be used with the other proposed methods to reduce the false detections.

3. Proposed Algorithm The proposed algorithm is fundamentally a region based area correlation matching techniques. The proposed method can be used to classify the scene breaks in major categories. This categorization is very helpful when used as a pre-filter with other efficient methods such as by Rainer Lienhart [11]. This prior filtering of videos helps to reduce the false detection and can be used to improve the efficiency of any method in terms of precision and recall. First step in the algorithm is to divide the frames into regions based on their statistical properties and is explained in subsection 3.1. The proposed correlation function and its use to find the correlation between consecutive frames is explained in subsection 3.2. Finally the use of correlation function to categorize shots into different classes is outlined in subsection 3.3.

3.1 Region Segmentation Region segmentation is based on popular technique of region splitting. Main difference is that statistical properties of frame are used to split it into different segments instead of the intensity values. Let I represents the original frame and m is the mean value of intensity in image. Standard deviations (variance) from mean of frame for each pixel is calculated and is represented by R. Reason of using the variance is to divide the region based on the contrast rather than on the intensity. To split the frame, a suitable predicate P specifying the satisfying condition is selected. Satisfying condition can be based upon the minimum threshold value of variance. One approach for segmenting R is to subdivide it into four quadrants regions, so that, for any region Ri, P (Ri) =true. If P (Ri) is false for any quadrant, we further subdivide it into four equal regions. This process is repeated until our predicate is fully satisfied for all segmented regions. This particular splitting technique has a convenient representation in the form of quad-tree. All the segmented rectangular regions satisfy the homogeneity property and represents different contrast from other regions.

The shot detection in a video can be detected by searching for the region of mismatch between the consecutive frames. Let the different regions in a particular frame are represented by u (m,n), where m, n represents the different pixels in the region. Similarly, let v (m,n) represents the corresponding region in the consecutive frame. Mismatch energy between two regions for displacement (p,q) is defined as

 2  p, q    vm, n   u m  p, n  q 2 n

 vm,n  um,n  2vm,num p,n  q 2

m n

2

m n

m n

For Mismatch energy function to achieve minimum, it is sufficient to maximize the cross correlation function by calculating it for different values of p and q.

C vu  p, q    vm, n u m  p, n  q ,  (p, q) m

n

From, the Cauchy-Schwarz inequality, we have 1

C vu

3.3 Shot Categorization Correlation energy function is calculated for all the different regions between consecutive frames. Weighted average of correlation function for a particular frame containing n region is calculated as n

 m Weighted Average=

3.2 Cross Correlation Function

m

to some camera motion. Similarly effects of camera rotation and zooming can also be calculated as v(m, n) = α u(m-p/s1, n-q/s2, θ), where θ is rotation factor and s1,s2 are scaling factors and p, q are displacement factors. With all these factors, computations of cross correlation function become very complex, so estimated values of all factors in a particular area are used. Hence Cross correlation function is also called area correlation function. If for all the estimated values of p, q, s1, s2 and θ, correlation function is less than the maximum threshold value, it means that two frames are very much dissimilar to each other and weighted average of correlation function is calculated.

1

22  22    vm, n    u m, n   m n  m n 

where the equality takes the maximum value only when region in one frame matches the displaced region in another frame i.e v (m, n) = α u (m-p, n-q), where α is an arbitrary constant. If for a particular value of (p,q), cross correlation is more than some threshold value, It means that difference in frame is due

i 1

i

 n i C i , ( u , v ) n

Advantage of using weighted average function is that bigger regions have been given more weight, because mismatch in smaller region may happen due to object motion. Following observations are made based on the number of regions which are different and the actual value of mismatch correlation energy.  If all the regions are different in consecutive frame than it is assumed to be hard cut, as only during hard cuts all the regions can be different.  If the number of regions which are different are more than T1 and weighted average function is more than T2, then it is assumed to be fade transitions. T1 value represents the minimum number of regions which should be different for a cut to be fade transition. T2 value is set to reduce the effect of camera flashes. It is observed that during camera flashes, weighted correlation function does not attain very high values. Both T1 and T2 are estimated based on our observation of different video used in approximations.  If the number of regions which are different are located on one side of frame, then it could be either wipes or camera motions. During Wipes, correlation function attains higher value as compared to the camera motions. Hence Wipes can be detected by adjusting the threshold value of area correlation to some suitable value.  Similar to wipes, location of mismatched region can be used to detect other editing effects such as slides, doors etc. Overall process is depicted in the Figure 1

Original Frames

Area Correlation Sub Regions

Region Segmentation

Area Correlation

Weighted Average

Area Correlation

Shot Detection

Figure1: Shot Detection Mechanism Two consecutive frames are used in region segmentation to split the frames into different regions. For each region corresponding Area correlation energy function is calculated and finally weighted average of all regions is taken for final shot detection step as explained above. Process is repeated again for next two consecutive frames i.e. if it is started with frame f(t-1) and f(t), then it is repeated for f(t) and f(t+1) and so on. Each frame is segmented only once in the whole process. Previous segmentations are carried forward in the next iterations. To reduce the number of iterations, alternate frames can be picked for segmentation and matching rather than each individual frame.

4. Results The proposed algorithm was tested on several arbitrary content sequences obtained by recording satellite broadcasts of news, sports and other television programs. Experiments are performed using MATLAB 7.0. For uniformity, each video is converted into YUV format with resolution 800 X 600 using standard MATLAB functions. Only the luminance component of color model is used to reduce the overall complexity of algorithms. First step of algorithms is to split each frames of video into different regions. For each region, area correlation in two consecutive frames is calculated. Here the results computed for a particular MTV recorded Video is shown in Figure 2. This particular video contains different types of production edits such as hard cuts, fades, wipes etc. Further the chosen video contains lots of camera motion at different angles and a camera flash instant. 1.2

First video is a BBC NEWS broadcast video, second is a Cricket Video and third is MTV Songs Video. It is observed that our algorithm give very high recall & precision values for first and second video but relatively low values for highly dynamic MTV videos. To improve the performance, different threshold values used for classification need to precisely adjusted for different types of videos. In our results, morphological effects in videos are not considered as they are not used in our example videos.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 BBC Video

1

Area Correaltion

From the Figure 2, it is clear that video is very dynamic. The values of area correlation above 0.8 are due to simple camera motion, below 0.3 are due to hard cuts and in between values account for gradual transitions. It is found that, algorithms gives false values only when there is either strong illumination or large camera movements. Small camera or object movements do not affect the values, but large camera movements and large object movements in video have significant impact on the results. Different set of experiments are conducted on variety of different video to measure the performance of algorithm in terms of precision and recall for classifying the types of edits in video. In Figure 3, Precision and recall for three different video data sets is shown. Where Recall = No. of correct detection/ No. of Existing Events and Precision= No. of correct detection/ No. of overall detections.

0.8

Cricket

MTV Video

Figure 3: Recall Precision Graph

0.6

5. Conclusion

0.4 0.2 0 1

43

85 127 169 211 253 295 337 379 421 463 Fram es

Figure 2: MTV Video: Normalized Area Correlation

Temporal segmentation of videos is not a very trivial task if the edit types are only hard cuts. But nowadays, more than hundred different types of edits are used in production. The proposed algorithm very effectively detects the hard cuts and some other gradual transitions such as fades, wipes, doors and slides. It also effectively judges the small camera motion in horizontal and circular motion and optical zooms. As shown

in the result, it is found that the transitions which are both in space and time such as morphological effects are not effectively detected. Further the efficiency of the methods depends upon the manual estimation of different threshold values. A wrong estimation may lead to large number of false detections. In future, computational efficiency of method needs to be checked for the videos having large number of cuts.

[11] Rainer Lienhart and Andre Zaccarin, "A System for Reliable Dissolve Detection in Videos", ICIP 2001, Thessaloniki, Greece, Oct. (2001)

References

[13] R. Lienhart, "Comparison of Automatic Shot Boundary Detection Algorithms", in Image and Video Processing VII, (1999),

[1] B.T.Turong, C.Dorai, S.venkatesh, "Improved fade and dissolve detection for reliable video segmentation", in Proceedings of International Conference on Image Processing, Volume 3, 10-13 Sept. (2000). [2] Chung-Lin Huang and Bing -Yao Liao, "A Robust Scene Change Detection Method for Video Segmentation" published in IEEE transactions on Circuits and Systems for Video technology, Vol11, No 12, December (2001) [3] Hampapur, A., Jain, R., and Weymouth, T., “Digital Video Segmentation”, Proc. ACM Multimedia 94, San Francisco, CA, October, (1994) [4] J.Boreczky, L.A.Rowe, “Comparison of video shot boundary detection techniques”in proceedings IS & T/SPIE International Symposium on Electronic Imaging, San Jose (1996) [5] J.R.Kender et al, IBM Research TRECVID 2004 Video retrieval system, TREC Video Retrieval Evaluation Online Proceedings, TRECVID, (2004) [6] Jordi Mas, Gabriel Fernandez, “Video Shot Boundary Detection based on Color Histogram”, Notebook Papers TRECVID2003, Gaithersburg, Maryland, NIST, October (2003). [7] Lekha Chaisorn, Tat-Seng Chua, Chun-Keat Koh, Yunlong Zhao, Huaxin Xu, Huamin Feng, Qi Tian, “ A TwoLevel Multi-Modal Approach for Story Segmentation of Large News Video Corpus”, Notebook Papers TRECVID2003, Gaithersburg, Maryland, NIST, October (2003). [8] Masaru Sugano, Keiichiro Hoashi, Kazunori Matsumoto, Fumiaki Sugaya, Yasuyuki Nakajima, “Shot Boundary Determination on MPEG Compressed Domain and Story Segmentation Experiments for TRECVID 2003” , Notebook Papers TRECVID2003, Gaithersburg, Maryland, NIST, October (2003) [9] Mihai Datcu and Klaus Seidel, “An Innovative Concept For Image Information Mining”, in proceeding of International workshop on Multimedia Data Mining with ACM SIGKDD, (2002) [10] Rainer Lienhart, "Reliable Transition Detection In Videos: A Survey and Practitioner's Guide", International Journal of Image and Graphics (IJIG), Vol. 1, No. 3, pp. 469486,(2001)

[12] R.Lienhart, C.Kuhumunch and W.Effelesberg "On the detection and recognition of Television Commercials", in proceeding of the international conference on Multimedia Computing and Systems, Ottawa, Ontario, Canada, pp509516 June (1997)

[14] Rajeev kumar, Vijay Devatha, "A statistical approach to robust video temporal segmentation", in proceedings of indian conference on computer vision, graphics and image processing, ISRO, Ahemdabad India, December (2002) [15] Ramin Zabih, Justin Miller and Kevin Mai,"A FeatureBased Algorithm for Detecting and Classifying Production Effects'', ACM Journal of Multimedia Systems, vol. 7, no. 2 pages 119-128 (1999). [16] R. Zabih, J. Miller, and K. Mai,"A Feature-Based Algorithm for Detecting and classifying Scene Breaks", Proc. ACM Multimedia 95, San Francisco, CA, pp. 189-200, Nov. (1995) [17] Sangkeun Lee, Manson H.Hayes, "Scene Change Detection using Adaptive Threshold and sub-macroblock images in compressed sequences" in IEEE international Conference on Multimedia and Expo, (2001) [18] Shahraray, B., “Scene Change Detection and ContentBased Sampling of Video Sequences”, in Digital Video Compression: Algorithms and Technologies, Proc. SPIE 2419,February, (1995) [19] Silvia Pfeiffer, Rainer Lienhart and Wolfgang Effelssberg, "Scene determination based on video and audi Features", in Proceedings of the IEEE Conference on Multimedia Computing and Systems, Florence, Italy, 07.-11. June (1999) [20] Timo Volkmer, S.M.M.Tahaghoghi, Hugh E. Williams, James A. Thom, “The Moving Query Window for Shot Boundary Detection at TREC-12”, Notebook Papers TRECVID2003, Gaithersburg, Maryland, NIST, October (2003) [21] U.Gargi, R.Kasturi, S.Antani, “Performance Characterization and Comparison of Video Indexing Algorithms” in proceeding of conference on computer vision and pattern recognition (CVPR), (1998) [22] Zhang, H.J, Kankanhalli,A. and Smoliar, S.W, "Automatic partiotioning of full motion video", Multimedia Systems,Vol1, No. 1, (1993)

Suggest Documents