A Novel Video Shot Detection Technique Using ... - Semantic Scholar

A Novel Video Shot Detection Technique Using Color Anglogram and Latent Semantic Indexing Rong Zhao Department of Computer Science State University of New York Stony Brook, NY 11794-4400, USA [email protected]

William I. Grosky Department of Computer & Information Science University of Michigan-Dearborn Dearborn, MI 48128-1491, USA [email protected]

Abstract

Video retrieval is based on the availability of a representation scheme of video contents and how to define such a scheme mo stly depends on the indexing mechanism that we apply to the data. Apparently, it is totally impractical to index video documents manually due to the fact that it is too time consuming. However, state of the art computer science hasn’t been mature enough to provide us with a method that is able to cope with these problems both automatically and with the intelligence comparable to that of human beings. Therefore, existing retrieval and management techniques for video collections and their users are typically at crosspurposes. While they normally retrieve video documents based on low-level visual features, users usually have a more abstract and conceptual notion of what they are looking for. Using low-level features to correspond to high-level abstractions is one aspect of the semantic gap [14] between content-based analysis and retrieval methods and the concept-based users [19, 24, 25, 26]. In this paper, we attempt to find a solution to negotiating this semantic gap in content-based video retrieval, with a special focus on video shot detection. We will introduce a novel technique for spatial color indexing, color anglogram, which is invariant to rotation, scaling, and translation. This paper also concerns a dimension reduction technique, latent semantic indexing (LSI), which has been used for textual information retrieval for many years. In this paper, we examine the use of this technique for video shot detection, hoping to uncover the semantic correlation between video frames. Experimental results sho w that LSI, together with color anglogram, is able to extract the underlying semantic structure of video contents, thus helping to improve the shot detection performance significantly. The remainder of this paper is organized as follows. In Section 2, related works on visual feature indexing and their application to video shot detection are briefly reviewed. Section 3 describes the color anglogram technique. Section 4 introduces the theoretical background of latent semantic indexing. Comparison and

In this paper, we present the results of our work that seeks to negotiate the gap between low-level features and highlevel concepts in the domain of video shot detection. We introduce a novel technique for spatial color indexing, color anglogram, which is invariant to rotation, scaling, and translation. This work also concerns a dimension reduction technique, latent semantic indexing (LSI), which has been used for textual information retrieval for many years. In this paper, we examine the use of these techniques for video shot detection, hoping to uncover the semantic correlation between video frames. Experimental results show that LSI, together with color anglogram, is able to extract the underlying semantic structure of video contents, thus helping to improve the shot detection performance.

1. Introduction Video data is available and has been used in many different application domains such as security, digital library, distance learning, advertising, electronic publishing, broadcasting, interactive TV, video-ondemand entertainment, etc. As in the old saying, “a picture is worth a thousand words”. If each video document is considered a set of still images, and the number of images in such a set might be in hundreds or even thousands, it’s not so hard for us to imagine how difficult it can be to find certain information in a set of video documents. The sheer volume of video data that is available nowadays presents a daunting challenge for researchers. How can we organize and make use of these video documents both effectively and efficiently? How can we represent and locate meaningful information and extract knowledge from video documents? Needless to say, there’s an urgent need for tools that can help us index, annotate, browse, and search video documents.

evaluation of various experimental results are presented in Section 5. Section 6 contains the conclusions, along with proposed future work.

2. Related Works The development of video shot detection techniques has become one of the most important research areas in content-based video analysis and retrieval. The detection of boundaries between video shots provides a basis for almost all of the existing video segmentation and abstraction methods. However, it is quite difficult to give a precise definition of a video shot transition since many factors such as camera motions may change the video content significantly. Usually, a shot is defined to be a sequence of frames that was continuously captured by the same camera [11]. Ideally, a shot may encompass camera motions such as pans, tilts, or zooms, and video editing effects such as fades, dissolves, wipes and mattes [7, 15]. Basically, video shot transitions can be categorized into two types: abrupt/sharp shot transitions, which are also called cuts, where a frame from one shot is followed by a frame from a different shot, and gradual shot transitions, such as cross dissolves, fade-ins and fade-outs, and other editing effects. Usually, a similarity measure between successive frames is defined based on various visual features. When one frame and the following frame are sufficiently dissimilar, an abrupt transition (cut) may be determined. Gradual transitions are found by using measures of cumulative differences and more sophisticated thresholding mechanisms. Methods to cope with these two types of shot transitions fall into one of two domains, either uncompressed or compressed, depending on whether it is applied to raw video stream or compressed video data. In general, methods working on uncompressed video are mo re reliable but also more costly with regard to storage and computational resources [7]. In this paper, our discussion will focus only on methods in the uncompressed domain. Discussions of compressed domain shot detection algorithms can be found in [7, 11, 12, 15, 17]. Pair-wise pixel comparison, which is also called template matching, was introduced to evaluate the differences in intensity or color values of corresponding pixels in two successive frames [23]. The main drawback of this method is that both the feature representation and the similarity comparison are closely related to the pixel position. Therefore, it is very sensitive to object and camera movements and noises. In contrast to pair-wise comparison which is based on global visual features, block-based approaches use local characteristics to increase the robustness to object and camera movements. Each frame is divided into a number of blocks that are compared against their counterparts in the successive frame. Typically, the similarity or dissimilarity between two frames can be measured by

using a likelihood ratio, as proposed in [16, 23]. Obviously, this approach provides a better tolerance to slow and small motions between frames. To further reduce the sensitivity to object and camera movement and thus provide a more robust shot detection technique, histogram comparison was introduced to measure the similarity between successive frames. Beyond the basic histogram comparison algorithm, several researchers have proposed various approaches to improve its performance, such as histogram equalization [1], histogram intersection [21], and normalized ?2 test [18]. However, experimental results show that approaches which enhance the difference between two frames across a cut may also magnify the difference due to object and camera movements [23]. For instance, the overall performance of applying ?2 test is not necessarily better than that of the linear histogram comparison due to such a trade-off, even though it is more time consuming. Another interesting issue is which color space to use when we consider color-based techniques. In [13] the performance of several color histogram based methods using different color spaces, including RGB, HSV, YIQ, etc. were evaluated. Experimental results showed that HSV performs quite well with regard to classification accuracy and it is one of those that are the least expensive in terms of computational cost of conversion from the RGB color space. Besides, HSV reflects human perception of color patterns quite well. Therefore, HSV is used in our experimental study. The reasoning behind these approaches is that two images (or frames) with unchanging background and unchanging (although moving) objects will have minor differences in their histograms [17]. In addition, histograms can minimize the sensitivity to camera movements such as panning and zooming and they are not sensibly affected by histogram dimensionality [11]. [4, 5] presented fairly good performance of precision. Besides, histogram comparison doesn’t require very intensive computation. However, even though it has these attractive characteristics, theoretically speaking, histogram-based similarity measures may lead to incorrect classifications, since the whole process depends on the distribution and does not take spatial properties into account. Therefore, the overall distribution of features, and thus its histogram, may remain mostly unchanged, even if pixel positions have been changed significantly.

3. Color Anglogram Let P = { p1, p2, …, pn } be a set of points in the two dimensional Euclidean plane, namely the sites. Partition the plane by labeling each point in the plane to its nearest site. All those points labeled as pi form the Voronoi region V(p i). V(pi) consists of all the points x’s at least as close to pi as to any other site:

V(pi) = { x: | pi - x| ≤ | p j - x|, ∀ j ≠ i }

Some points x’s do not have a unique nearest site. The set of all points that have more than one nearest site form the Voronoi diagram V(P) for the set of sites. Construct the dual graph G for a Voronoi diagram V(P). Delaunay proved that when the dual graph is drawn with straight lines, it produces a planar triangulation of the Voronoi sites P, so called the Delaunay triangulation D(P). Each face of D(P) is a triangle, so called the Delaunay triangle. The proof of Delaunay’s theorems and properties can be found in [20]. Among various algorithms for constructing the Delaunay triangulation of a set of N points, we note that there are O(NlogN) algorithms [9, 10]. Spatial layout of a set of points can be coded through an anglogram that is computed by discretizing and counting the angles produced by the Delaunay triangulation of a set of unique feature points. Anglograms encode the spatial correlation of these feature points and are invariant to translation, scale, and rotation. An O(max(N, #bins)) algorithm is necessary to compute the anglogram corresponding to the Delaunay triangulation of a set of N points. To construct color anglograms , each image is decomposed into a number of non-overlapping blocks. Each individual block is abstracted as a unique feature point labeled with its spatial location and feature values. The feature values can be dominant or average hue and saturation values in the corresponding block. For each set of feature points labeled with a particular value, a Delaunay triangulation is constructed and the feature point histogram is computed by discretizing and counting the number of either the two largest angles or the two smallest angles in the Delaunay triangles. Finally, the image will be indexed by using color anglogram, i.e., the concatenated feature point histograms for each feature value. For example, a color image is shown in Figure 1(a). By dividing the image into 256 blocks, Figure 1(b) and Figure 1(c) show the dominant hue and saturation values of each block. Figure 1(d) is the Delaunay triangulation of the set of feature points labeled with saturation value 5, and Figure 1(e) shows the corresponding anglogram obtained by counting the two largest angles in each triangle.

4. Latent Semantic Indexing Latent Semantic Indexing (LSI) was introduced to overcome a fundamental problem that plagues existing textual information retrieval techniques. The problem is that users want to retrieve documents on the basis of conceptual content, while individual keywords provide unreliable evidence about the conceptual meaning of a document. There are usually many ways to express a given concept and users tend to have different preference of picking terms. This is called synonymy and the prevalence of synonyms tends to decrease the recall performance. On the other hand, most words have

multiple meanings and are used in different contexts. Hence, the terms in a query may literally match the terms in documents that are not of any interest to the user at all. In information retrieval this is addressed as polysemy which may lead to poor precision performance [6]. LSI makes use of Singular Value Decomposition (SVD) which is a dimension reduction method allowing the arrangement of the space to reflect the major associative patterns in the data, and ignore the smaller, less important influences. As a result, terms that did not actually appear in a document may end up close to the document in the transformed space, if that is consistent with the major patterns of association in the data. Retrieval proceeds by using the terms in a query to identify a point in the semantic space, and documents in its neighborhood are returned as relevant results to the query. The latent semantic indexing technique has been successfully applied to textual information retrieval, in which it shows distinctive power of finding the latent correlation between terms and documents [2, 3, 6]. In a previous study [24, 25], we used LSI to reveal the underlying semantic nature of image contents, and thus to find the correlation between image features and the semantics of the image or its objects. In [26] we explored applying LSI to the domain of Web retrieval. Experimental results verified that LSI helps improve the retrieval performance in both cases.

5. Experimental Results Our data set consisted of 8 video clips with a total length of 496 seconds. There are 255 abrupt shot transitions and 60 gradual shot transitions. Almost all of the possible editing effects, such as cuts, fades, wipes, and dissolves, can be found in these clips. The topics of these clips fall into categories such as outdoor scenes, news story, TV commercials, and movie trailers. Our shot detection evaluation platform is shown in Figure 2. This system supports video playback and frameby-frame browsing. It allows us to compare the performance of various methods, including pair-wise comparison, global and local color histogram, color histogram with ?2, color anglogram, latent semantic indexing, and their combinations. Several parameters, such as number of blocks, number of frames, and various thresholds used in feature extraction and similarity comparison can be adjusted. Statistical results of both abrupt transitions and gradual transitions, together with their locations and strength, are presented in a friendly interface. Two thresholds, T1 and T2, are involved in the similarity comparison process, which are similar to those in [23]. If the distance between two consecutive frames fi and fi+1 is above T1, a shot transition is identified between frames fi and fi+1 . If the distance between fi and fi+1 is

below T2, the two frames are considered to be within the same shot. If the distance falls in the range between these two thresholds, further examination will be necessary to determine if the distance results from a gradual transition or not. A certain number of frames, N, can be specified by the user, which allows the system to analyze the cumulative similarity of frames fi-N, fi-N+1 , …, and fi. This measure will be compared with frame fi+1 and to determine if a transition exists between fi and fi+1 . In our experiment, we compared the performance of our color anglogram approach with that of the color histogram method, due to the fact that color histogram provides one of the best performances among existing techniques. Each frame is converted into the HSV color space. For each pixel of the frame, hue and saturation are extracted and each quantized into a 10-bin histogram. Then, the two histograms h and s are combined into one h × s histogram with 100 bins, which is the representing feature vector of each frame. This is a vector of 100 elements, F = [f1, f2, f 3, . . . f 100 ]T. To apply the latent semantic indexing technique, a feature-frame matrix, A = [F 1,…,Fn ], where n is the total number of frames of a video clip, is constructed using the feature vector of each frame. Each row corresponds to one of the feature elements and each column is the entire feature vector of the corresponding frame. Singular Value Decomposition is performed on the feature-frame matrix. The result comprises three matrices, U, Σ , and V, where A = UΣ VT. The dimensions of U, Σ , and V are 100 × 100, 100 × n, and n × n, respectively. For our data set, the total number of frames of each clip, n, is greater than 100. To reduce the dimensionality of the transformed space, we use a rank-k approximation, Ak, of the matrix A, where k = 12. This is defined by Ak = UkΣ k Vk T. The dimension of Ak is the same as A, 100 by n. The dimensions of Uk, Σ k, and Vk are 100 × 12, 12 × 12, and n × 12, respectively. The following normalization process will assign equal emphasis to each frame of the feature vector. For the feature-frame matrix A=[V1,V2, …, Vn ], we have Ai,,j which is the ith component in vector Vj. Assuming a Gaussian distribution, we can obtain the mean, µi, and standard deviation, σi, for the ith component of the feature vector. Then we normalize the feature-frame matrix into the range of [-1,1] by using,

Ai , j =

Ai , j − µ i . σi

It can easily be shown that the probability of an entry falling into the range of [-1, 1] is 68%. In practice, we map all the entries into the range of [-1, 1] by forcing the out-of-range values to be either –1 or 1, whichever is the nearest. Then we shift the entries into the range of [0, 1] by using,

Ai , j =

Ai , j + 1 . 2

After this normalization process, each component of the feature-frame matrix is a value between 0 and 1, and thus will not bias the importance of any component in the computation of similarity. Applying different weights to different components has been a common practice for improving retrieval performance [8]. Both global weight and local weight are considered in our experiment. A global weight indicates the overall importance of that component in the feature vector across all the frames. Therefore, the same global weighting is applied to an entire row of the matrix. A local weight is applied to each element indicating the relative importance of the component within its vector. The value for any component Ai , j is thus L(i, j)G(i), where L(i, j) is the local weighting for feature component i in frame j, and G(i) is the global weighting for that component. Common local weighting techniques include term frequency, binary, and log of term frequency, whereas common global weighting methods include normal, gfidf, idf, and entropy. Based on previous research, it has been found that log (1 + term frequency) helps to dampen effects of large differences in frequency and thus has the best performance as a local weight, whereas entropy is the appropriate method for global weighting [8]. The entropy method is defined as,

1+ ∑ j

pij log( pij ) log( number _ of _ documents )

where,

pij =

tfij gfi

is the probability of that component, tfij is the raw frequency of component Ai,j, and gfi is the global frequency, i.e., the total number of times that component i occurs in all the frames. The global weights give less emphasis to those components that occur frequently or in many frames. Theoretically, the entropy method is the most sophisticated weighting scheme, taking the distribution property of feature components over the set of all the frames into account. The experimental results are presented in Table 1. The measures of recall and precision are used in evaluating the shot detection performance. First, we applied color histogram and evaluated using it with and without latent semantic indexing. It can be noticed that better performance is achieved by integrating color histogram with latent semantic indexing. This validates our beliefs that LSI can help discover the correlation between visual features and high-level concepts.

For our experiments with color anglogram, we divide each frame into 64 blocks and compute the average hue value and average saturation value of each block. The average hue values are quantized into 10 bins, so are the average saturation values. For each quantized hue (saturation) value, we apply Delaunay triangulation on the point feature map. We count the two largest angles of each triangle in the triangulation, and categorize them into a number of anglogram bins each of which is 5°. Our vector representation of a frame thus has 720 elements: 36 bins for each of the 10 hue values and 36 bins for each of the 10 saturation values. In this case, for each video clip the dimension of its feature-frame matrix is 720 × n, where n is the total number of frames. As is discussed above, we reduce the dimensionality of the feature-frame matrix to k = 12. Based on the experimental results of our previous studies in [24, 25], we notice that normalization and weighting has a negative impact on the performance of color anglogram. Therefore, we do not apply normalization and weighting on the elements in the feature-frame matrix. From the results in Table 1 we notice that color anglogram achieves better performance than color histogram. This result is consistent with those of our previous studies in [22, 24, 25, 26]. We also notice that the best performance of both recall and precision is obtained by integrating color anglogram with latent semantic indexing.

6. Conclusions In this paper, we presented our work that seeks to negotiate the gap between low-level features and highlevel concepts in the domain of video shot detection. We introduced a novel technique for spatial color indexing, color anglogram, which is invariant to rotation, scaling, and translation. We also experimented with a dimension reduction technique, latent semantic indexing (LSI), to uncover the semantic correlation between video frames. Experimental results show that LSI is able to correlate visual features, either color or spatial color, to construct higher-level semantic clusters. Results also show that color anglogram is more accurate than color histogram in capturing and emphasizing meaningful features in the video contents. Invariant to rotation, scaling, and translation, it is more robust with regard to object and camera movements, thus helps improve the performance when more complex situations, especially gradual transitions, are involved. By comparing all the experimental results, we validated that both color anglogram and LSI help narrow the semantic gap. Combining the two provides us a new shot detection technique that is both reliable and effective. To improve the performance of our approach, a more in-depth study of threshold selection is necessary. It is also interesting to explore how to capture the similarity

among multiple frames to tackle the problems with complex gradual transitions. We propose to extend our approach into a video analysis and retrieval framework in which multiple features and techniques can be integrated. As the first step, various clustering techniques, along with shape anglogram, will be added into this framework.

References [1] P. Aigrain and P. Joly, The Automatic Real-Time Analysis of File Editing and Transition Effects and Its Applications, Computer and Graphics, Volume 18, Number 1, 1994, pp. 93-103. [2] M. Berry, Z. Drmac, and E. Jessup, Matrices, Vector Spaces, and Information Retrieval, SIAM Review, Vol. 41, No. 2, 1999, pp. 335-362. [3] M. Berry, S. T. Dumais, and G. W. O’Brien, Using Linear Algebra for Intelligent Information Retrieval, SIAM Review, 1995, pp. 573-595. [4] J. Boreczky and L. Rowe, Comparison of Video Shot Boundary Detection Techniques, Proceedings of SPIE Conference on Storage and Retrieval for Video Databases IV, San Jose, CA, February 1995. [5] A. Dailianas, R. B. Allen, and P. England, Comparison of Automatic Video Segmentation Algorithms, Proceedings of SPIE Photonics West, Philadelphia, October 1995. [6] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, Volume 41, Number 6 (1990), pp. 391-407. [7] A. Del Bimbo, Visual Information Retrieval, Morgan Kaufmann, San Friancisco, CA, 1999. [8] S. Dumais, Improving the Retrieval of Information from External Sources, Behavior Research Methods, Instruments, and Computers, Vol. 23, Number 2 (1991), pp. 229-236. [9] R. A. Dwyer, A Faster Divide-and-Conquer Algorithm for Constructing Delaunay Triangulations, Algorithmic, Volume 2, Number 2, 1987, pp. 127-151. [10] S. Fortune, A Sweepline Algorithm for Voronoi Diagrams, Algorithmic, Volume 2, Number 2, 1987, pp. 153-174. [11] U Gargi, R. Kasturi, and S. H. Strayer, Performance Characterization of Video-Shot-Change Detection Methods, IEEE Transactions on Circuits and Systems for Video Technology, Volume 10, Number 1, 2000, pp. 1-13. [12] U Gargi and R. Kasturi, An Evaluation of Color Histogram Based Methods in Video Indexing, International Workshop on Image Databases and Multimedia Search, Amsterdam, August 1996, pp. 75-82. [13] U. Gargi, S. Oswald, D. Kosiba, S. Devadiga, and R. Kasturi, Evaluation of Video Sequence Indexing and Hierarchical Video Indexing, Proceedings of SPIE Conference on Storage and Retrieval in Image and Video Databases, 1995, pp. 1522-1530. [14] V. N. Gudivada and V. V. Raghavan, Content-Based Image Retrieval Systems, IEEE Computer, Volume 28, September 1995, pp. 18-22. [15] A. Hanjalic, Shot-Boundary Detection: Unraveled and Resolved? IEEE Transactions on Circuits and Systems for Video Technology, Volume 12, Number 2, 2002, pp. 90105.

[16] R. Kasturi and R. Jain, Dynamic Vision, Computer Vision: Principles, R. Kasturi and R. Jain (Eds.), IEEE Computer Society Press, Washington DC, 1991, pp. 469-480. [17] I. Koprinska and S. Carrato, Temporal Video Segmentation: A Survey, Signal Processing: Image Communication, Volume 16, 2001, pp. 477-500. [18] A. Nagasaka and Y. Tanaka, Automatic Video Indexing and Full Video Search for Object Appearances, IFIP Transactions on Visual Database Systems II, E. Knuth and L. M. Wegner (Eds.), Elsevier, 1992, pp. 113-127. [19] M. R. Naphade and T. S. Huang, Extracting Semantics From Audiovisual Content: The Final Frontier in Multimedia Retrieval, IEEE Transactions on Neural Networks, Volume 13, Number 4, 2002, pp. 793-810. [20] J. O’Rourke, Computational Geometry in C, Cambr idge University Press, Cambridge, England, 1994. [21] M. J. Swain and D. H. Ballard, Color Indexing, International Journal of Computer Vision, Volume 7, Number 1, 1991, pp. 11-32.

(a) A sample image

(b) Hue

[22] Y. Tao and W. I. Grosky, Spatial Color Indexing Using Rotation, Transla tion, and Scale Invariant Anglograms, Multimedia Tools and Applications, 15, pp. 247-268, 2001. [23] H. Zhang, A. Kankanhalli, and S. Smoliar, Automatic Partitioning of Video, Multimedia Systems, Volume 1, Number 1, 1993, pp. 10-28. [24] R. Zhao and W. I. Grosky, Bridging the Semantic Gap in Image Retrieval, Distributed Multimedia Databases: Techniques and Applications, T. K. Shih (Ed.), Idea Group Publishing, Hershey, Pennsylvania, 2001, pp. 14-36. [25] R. Zhao and W. I. Grosky, Negotiating the Semantic Gap: From Feature Maps to Semantic Landscapes, Pattern Recognition, Volume 35, Number 3, 2002, pp. 593-600. [26] R. Zhao and W. I. Grosky, Narrowing the Semantic Gap – Improved Text-Based Web Document Retrieval Using Visual Features, IEEE Transactions on Multimedia, Volume 4, Number 2, 2002, pp. 189-200.

(c) Saturation

Feature Point Histogram

Number of angles

60 50 40 30 20 10 0 1

2

3

4

5

6

7

8

9

10 1 1 12 13 14 15 16 17 18 19

Bin number

(d) Delaunay triangulation

(e) Anglogram Figure 2. Video shot detection evaluation system

Figure 1. Color anglogram

Table 1. Evaluations of experimental results Abrupt Shot Transition

Gradual Shot Transition

Precision

Recall

Precision

Recall

Color Histogram

70.6%

82.7%

62.5%

75.0%

Color Histogram with LSI

74.9%

83.1%

65.8%

80.0%

Color Anglogram

76.5%

88.2%

69.0%

81.7%

Color Anglogram with LSI

82.9%

91.4%

72.6%

88.3%