Video Shot Boundary Detection Using Support Vector ...

6 downloads 8227 Views 141KB Size Report
email: [email protected]. Abstract: Abstract: Video shot boundary detection (SBD) is a challenging proposition in video analysis applications. Of late, the ...
Video Shot Boundary Detection Using Support Vector Machine

Biswanath Chakraborty1, Siddhartha Bhattacharyya2, Susanta Chakraborty3 1 Department of CA, RCC Institute of Information Technology, Kolkata, India email: [email protected] 2 Department of IT, RCC Institute of Information Technology, Kolkata, India email: [email protected] 3 Department of CST, Indian Institute of Engineering Science and Technology, Shibpur, India email: [email protected]

Abstract: Video shot boundary detection (SBD) is a challenging proposition in video analysis applications. Of late, the objective of video information retrieval is mainly centered on the segmentation on the layers of shot, scene and story in a video. Several threshold based techniques are also in vogue for this purpose. However, none of these methods are robust by design. In this article, a shot boundary detection technique using the classification capabilities of support vector machines is presented. Initially, the video features like pixel intensity variance between successive image frames and Shannon entropy of the individual frames of the videos under consideration are extracted. These are then used as input vectors to the support vector machine (SVM). Subsequently, the SVM is used to classify the frames based on the extracted features. The classified frames reveal the detected hard cuts. Experimental results show that the proposed algorithm yields encouraging avenues.

Keywords:

shot boundary detection; pixel intensity variance; Shannon entropy; support vector machine

1. Introduction Coupled with the increasing demands in multimedia compression technology, there has been an upsurge in the research for detection of shot boundaries and discontinuities in video sequences. Video shot boundary detection (SBD) is a challenging proposition in video analysis applications. Of late, the objective of video information retrieval is mainly centered on the segmentation on the layers of shot, scene and story in a video. Typical application areas include content based video retrieval, video mining, video indexing, and video summarization to name a few [1] [2]. The problem of video analysis has been dealt with in various ways ranging from classical probabilistic thresholding techniques to intelligent classification techniques. Of all the techniques, structural analysis of video is one of the prerequisite steps for automatic analysis of video content. It essentially centers on the shot level organization of the video sequence under consideration. [3] [4]. A shot comprises continuous image frames captured by the camera. Depending on the transition between the shots the shot boundaries can be categorized into two types: cuts (CUTs) and gradual transitions (GTs). The GTs can be further classified into dissolve, wipe, fade out/in (FOI), etc. [5]. Shot boundary detection (SBD) is aimed at identifying the demarcations between the adjacent shots. A plethora of SBD methods are in vogue. Initially, the SBD techniques have to resort to artificially recorded videos as there existed no benchmark datasets. However, in 2001, the National Institute of Standards and Technology (NIST) came up with a collection of benchmark videos i.e., TRECVID [6], where the SBD techniques can be tested successfully. NIST is a huge evaluation database. The inception of TRECVID has rapidly advanced the research in SBD. As of now, researchers have been able to detect CUTs to some appreciable extent but the detection of GTs still remains a difficult problem [6].

2. Survey of existing approaches A plethora of surveys regarding SBDs are available in the literature [7-10]. In this section, we present three main categories of related works in SBD based on (i) methods based on visual content representation, (ii) methods of classification and (iii) methods of gradual transition detection. 2.1 Methods based on Visual Content Representation Visual content information holds paramount important in research on SBD. Several techniques like pixel-based [11], histogram [8], and even the mean and standard deviation of intensities [10] have been introduced. Different from other surveys, the computational complexity of various SBD approaches was taken up by Lefèvre et al. [12]. It has been empirically observed that the simple histogram feature oftentimes yields satisfactory result compared to other complicated features such as edges [8][10]. The pixel-based method can map an image frame to itself with a mapping φ. It has been found that this method is sensitive to local/global movement of pixels. To tackle this problem, several versions of method have been proposed. Zhang et al. proposed a 3×3 smoothing filtering on the images prior to comparing the pixels [13]. Color histogram is another alternative to the pixel-based methods. It is more invariant to local/global movements. Moreover, it is not capable enough to distinguish the shots within the same scene. The block-matching method [14] is the best possible alternative to both the pixel-based and global color histogram methods. Here each frame is first divided into several blocks which are nonoverlapping to each other. Then the histogram or other features of each subdivided block are extracted individually to characterize the image frames.

2.2 Methods of Classification Statistical Machine Learning: SBD can often be treated as a pattern recognition and machine learning problem. Apart from the models based on generative classifiers, various discriminative machine learning approaches, which includes K-means [15], KNN [17-19], and support vector machines (SVMs) [22-25], have been used for SBD. The model parameters are selected using cross validation while the shapes of classification hyperplane are generated during the training phase. The problems associated with these models are the construction of the classifier features and the selection of a training set with sufficient number of positive and negative examples. For this purpose, Cooper [16] and Yuan et al. [17] applied the continuity signals in a temporal interval to be used for the KNN and SVM classification features. To address the problem of selection of training set, Lienhart [20] resorted to a dissolve synthesizer for creating sufficient examples of dissolve sequences to produce the nondissolve pattern set by means of a bootstrap method. While the thresholding methods rely on the amplitude of video content and are often supervised in nature, the machine learning methods are often unsupervised and decide via the recognition of the pattern characteristics of the shot transitions. 2.3 Methods of Gradual Transition Detection In this section, a brief overview of the existing methods of the various types of gradual transition detection is presented. 1) Fade Out/In: In the phase of a FOI, two adjacent shots are well separated by some monochrome frames in the spatial and temporal domain. Thus, the key point in detection of FOI is to recognize the monochrome frames. The mean and the standard deviation of pixel intensities are very useful for this purpose.

2) Wipe: In case of wipes, the adjacent shots are only spatially well separated at any time. The spatio-temporal slice analysis is one of the popular methods for wipe detection. There are different patterns on the spatio-temporal slices corresponding to different styles of wipes,. 3) Dissolve: In the process of dissolve, two adjacent shots are temporally as well as spatially intermingled. A popular method of dissolve detection uses the change of intensities variance. It is also referred to as the downwards-parabolic pattern proposed by Alattar [21]. The principle of detection of dissolve in this method relies on the recognition of the parabolic pattern on the variance curve of the pixel intensities. 3. SBD System based on Support Vector Machine The thresholding scheme for SBD has several limitations. Apart from being user dependent, it depends on the genres of videos. As such, any chosen global threshold cannot distinguish the boundaries and nonboundaries successfully. In this section, the basics of machine learning methods to address the SBD process will be discussed. 1) Construction of Features: In the intensity distribution signal, each CUT corresponds to a local minimum on the curve. However, it may be noted that not every local minimum is a CUT boundary. Some of the minima are due to GTs, and others arise from external disturbances such as changes in ambient lighting conditions, speckle noise, channel noise etc. Hence, it would not be wise to calculate only the magnitudes of local minima. But it is a fact that the valleys of the curve which correspond to the boundaries for CUTs, are usually regular and somewhat symmetric in shape, while valleys due to other disturbances are not that so. So, these can be classified according to their shapes only. The shape of the valley centering at ct can be characterized by the feature vector V

t

,…., ,

,….,

r

(1)

where r is the image neighborhood of the content under consideration, since a CUT boundary between image frames I and I +1 affects the signal values at most to ct before c and at most c to t

t

-r

t

t+r

after c . t

2) Support Vector Machine: SVM has solid theoretical foundation and hence selected as the classifier for obvious reasons. This is typically a binary classifier which relies on a training set of examples. Hence it is supervised in nature. The associated learning algorithms analyze data and recognize patterns which can be used for classification and regression analysis. The basic principle lies in the fact that for a given set of training examples, each belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM aims to maximize the separability of the separate categories if these are represented as points in space. As newer examples/categories arrive, these are mapped on to the suitable space depending on their proximity as regards to the feature subspace. Hence, SVM also aims to minimize the risk of over-fitting. SVMs can also be involved efficiently in non-linear classification problems using the kernel trick which maps the inputs into high-dimensional feature spaces. To be precise, an SVM uses a hyperplane or set of hyperplanes in a high- or infinite-dimensional space for classification or regression. 4. Feature Selection and Extraction This section elaborates the different features of the video sequences for the classification process by the SVM. 4.1 Intensity Variance of Individual Frames Variance is a measure as to how far a set of numbers is spread out. A variance of zero indicates that all the values are identical. Variance is always non-negative: a small variance indicates that the

data points tend to be very close to the mean (expected value) and hence to each other, while a high variance indicates that the data points are very spread out around the mean and from each other. This concept can be applied to the pixel intensities of the individual frames of the video sequences as well. 4.2 Shannon Entropy In information theory, Shannon entropy is the average value of the information content. It relates to the amount of uncertainty in the information. It increases with the randomness in information content and decreases when it is less random. In technical terms, it is defined as the negative of the logarithm of the probability distribution. The probability distribution of the events, coupled with the information amount of every event, forms a random variable whose average (also termed expected value) is the average amount of information, a.k.a. entropy, generated by this distribution. Units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to define it, though the shannon is commonly referred to as a bit. It is thus defined as [26] ∑

(2)

where, P(x ) is the probability of the event x to occur. b is the base of the logarithm used. Common i

values of b are 2, Euler's number e, and 10, and the unit of entropy is shannon for b = 2, nat for b = e, and hartley for b = 10. When b = 2, the units of entropy are also commonly referred to as bits. 5. Proposed Methodology The proposed methodology of video shot boundary detection has been carried out using the following algorithm. 1.

Find the intensity variance and Shannon entropy of each image frame.

2.

Find the average of intensity of each image frame using the following equation.

!

"

(3)

#$

where, F is the intensity of the image frames. 3.

If Average - Average -1> T N

N

Average

AND entropy - entropy N

N

-1

> T

entropy

Then

DETECT=TRUE ELSE DETECT=FALSE 4.

Apply SVM on Average and entropy features to classify the video frames into cuts, gradual transitions and fade in/fade out.

6. Experimental Results The proposed methodology has been demonstrated on two videos 1. (V1) Waka Waka (World Cup 2010) having total 631 frames and 2. (V2) Beautiful Nature Video & Relaxing Music - Echoes of the Forest (HD) having 599 frames. After an extensive set of experiments, the best set of values of TAverage and Tentropy were set based on 60% of the change in the feature values from one frame to another frame [27]. These were found to be 500 and 0.9 respectively. The results of classification with SVM trained with the intensity variance and entropy features and after applying the proposed methodology are enlisted in Table I. It has been observed that the proposed methodology has been able to successfully detect 117 cuts successfully out of 146 cuts in V1, 5 fade in/fade out and 1 gradual transition in V2. The correct hits were 6, no miss hits were there for V2. The precision (P) and recall (R) are also shown in the Table. The proposed methodology was compared with [27] and the comparative results reported in Table I reveal that the proposed methodology outperforms [27]. It may also be noted that [28] used a thresholding value the basis of which is not clear and defined in their work.

TABLE I.

RESULTS

Cut

Gradual Transition

Fade In/Fade Out

Video

Proposed Method

P

R

P

R

0.834286

0.924051

--

--

V1 Ling et al.’s method [28]

0.812354

0.895671

--

0.857143

1.000

0.998345

0.854912

0.985423

0.865426

--

--

--

--

0.974652

1.000

0.932198

--

1.000

V2 Ling et al.’s method [28]

R

--

V1 Proposed Method

P

0.994352

V2

Moreover, it may be noted that the proposed method is more efficient than Ling et al.’s method as far as time complexity is concerned. Whereas, the time complexity of the proposed method is O(n), where n is the number of frames, the complexity of Ling. et al. is O(n2). Hence, the proposed method is more applicable in real time situations. 7. Discussions and Conclusion An intelligent technique based on support vector machine for classification of hard cuts, gradual transitions and fade in/fade out in videos is reported in this article. The method is found to successfully detect the different types of transitions from different types of videos. The proposed method is based on selection of intensity variance and entropy of the individual image frames from the video sequences followed by a thresholding procedure and SVM based training on the chosen features. The comparative results reported support the claim of the efficacy of the proposed methodology. However, methods still remain to be explored to employ the proposed method to detect dissolves from video sequences. The authors are currently engaged in this direction.

Acknowledgment The authors would like to thank the World Bank funded project TEQIP – II for the infrastructural support in carrying out this work. References: [1] N. Dimitrova, H. J. Zhang, B. Shahraray, I. Sezan, T. Huang, and A. Zakhor. Applications of video content analysis and retrieval, IEEE Multimedia, 9(3) (2002) 42–55. [2] L. A. Rowe and R. Jain. Acm sigmm retreat report on future directions in multimedia research, ACM Trans. Multimedia Comput. Commun. Appl., 1(1) (2005) 3–13. [3] S. W. Smoliar and H.-J. Zhang. Content-based video indexing and retrieval, IEEE Multimedia, 1(2) (1994) 62–72. [4] R. Lienhart, S. Pfeiffer, and W. Effelsberg. Video abstracting, Commun. ACM, 40(12) (1997) 55–62. [5] V. Kobla, D. DeMenthon, and D. Doermann, Special effect edit detection using videotrails: a comparison with existing techniques, in Proc. SPIE Conf. Storage Retrieval Image Video Databases VII, Jan. 1999, 302–313. [6] NIST,

Homepage

of

Trecvid

Evaluation.

[Online].

Available:

http://www-nlpir.nist.gov/projects/trecvid/ [7] R. Lienhart. Reliable transition detection in videos: a survey and practitioner’s guide, Int. J. Image Graph., 1(3) 2001 469–486. [8] U. Gargi, R. Kasturi, and S. H. Strayer. Performance characterization of video-shot-change detection methods, IEEE Trans. Circuits Syst. Video Technol., 10(1) (2000) 1–13 . [9] J. S. Boreczky and L. A. Rowe. Comparison of video shot boundary detection techniques, in Proc. SPIE Storage Retrieval Image Video Databases IV, Jan. 1996, 2664, 170–179.

[10] R. Lienhart. Comparison of automatic shot boundary detection algorithms, in Proc. SPIE Image Video Process. VII, Jan. 1999, 3656, 290–301. [11] S. K. Choubey and V. V. Raghavan. Generic and fully automatic content-based image retrieval using color, Pattern Recog. Lett., 18(11–13) (1997) 1233–1240. [12] S. Lefèvre, J. Holler, and N. Vincent. A review of real-time segmentation of uncompressed video sequences for content-based search and retrieval, Real-Time Imag., 9(1) (2003) 73–98. [13] H.-J. Zhang, C. Y. Low, and S. W. Smoliar. Video parsing and browsing using compressed data, Multimedia Tools Appl., 1(1) (1995) 89–111. [14] M. Ahmed, A. Karmouch, and S. Abu-Hakima. Key frame extraction and indexing for multimedia databases, in Proc. Vis. Interface Conf., 1999, 506–511. [15] M. R. Naphade, R. Mehrotra, A. Ferman, J.Warnick, T. S. Huang, and A. M. Tekalp. A high-performance shot boundary detection algorithm using multiple cues, in IEEE Inte. Conf. Image Process., 1998, 884–887. [16] M. Cooper. Video segmentation combining similarity analysis and classification, in Proc. ACM Multimedia 2004, Oct. 2004, 252–255. [17] J. Yuan, J. Li, F. Lin, and B. Zhang. A unified shot boundary detection framework based on graph partition model, in Proc. ACM Multimedia 2005, Nov. 2005, 539–542. [18] C.-W. Ngo. A robust dissolve detector by support vector machine, in Proc. ACM Multimedia, 2003, 283–286. [19] T.-S. Chua, H. Feng, and A. Chandrashekhara. An unified framework for shot boundary detection via active learning, in Proc. ICASSP, Hong Kong, Apr. 2003, 2, 845–848.

[20] R. Lienhart. Reliable dissolve detection, in Proc. SPIE Storage Retrieval Media Database, Jan. 2001, 4315, 219–230. [21] A. M. Alattar. Detecting and compressing dissolve regions in video sequences with a DVI multimedia image compression algorithm, in Proc. IEEE ISCAS, May 1993, 1, 13–16. [22] H. Zhang, A. Kankanhalli, and S. W. Smoliar. Automatic partitioning of full-motion video, Multimedia Syst., 1(1) (1993) 10–28. [23] Y. Lin, M. S. Kankanhalli, and T.-S. Chua, Temporal multiresolution analysis for video segmentation, in Proc. SPIE Conf. Storage Retrieval Media Database VIII, 2000, 3972, pp. 494–505. [24] C. J. C. Burges. A tutorial on support vector machines for pattern recognition, Data Mining Knowl. Discov., 2(2) (1998) 121–167. [25] G. Schohn and D. Cohn. Less is more: active learning with support vector machines, in Proc. 17th Int. Conf. Mach. Learning, 2000, 839–846. [26] Claude E. Shannon. A Mathematical Theory of Communication, Bell System Technical Journal 27(3) (1948) 379–423. [27] Baber, J., Afzulpurkar, Nitin, Dailey, M. N., and Bakhtyar, M. Shot boundary detection from videos using entropy and local descriptor, in Proc. 2011 17th International Conference on Digital Signal Processing (DSP), 1-6. [28] Xue Ling, Li Chao, Li Huan, and Xiong Zhang. A general method for shot boundary detection, in Proc. 2008 International Conference on Multimedia and Ubiquitous Engineering, 394-397.