quentially compared with those of a series of sliding matching windows in the target ... Among many video representation techniques, key frame based shot rep-.
Fast Video Segment Identification from Large Video Collection Junsong Yuan1,2 , Lingyu Duan1, Qi Tian1 1
Inst. for Infocomm Research, 21Heng Mui Keng Terrace, Singapore 119613 2
Dept. of ECE, National Univ. of Singapore {jyuan, tian, Lingyu}@ i2r.a-star.edu.sg
Abstract. In this paper we design a new global visual feature and use it as signature for “fast and dirty” video segment identification among video collection containing a large number of sequences. Different from previous key frame based shot representation, the proposed method combines both temporal-spatial and color range information in an appropriate way and could effectively and robustly characterizes video segments of varied length using a fixed-size 144-d signature. By using the global visual feature, the ambiguity of key frame selection and gradual shot transition problem could be avoided. Moreover, since the proposed feature could be extracted directly from MPEG compressed domains with low-cost process, it is also inexpensive to extract and acquire. Based on our experiment, by applying active search algorithm, the new global feature could support fast search: searching for a 15-sec clip among the 10.5 hours MPEG video database in merely 2 seconds. And it is also robust to color shifting and other video variations caused by digital devices.
1 Introduction Video segment identification has many applications such as video content monitoring [8], copyright enforcement [4][5], and video structure analysis [13] and so on. For video segment identification in large database of video sequences, query clip is usually firstly represented as a feature vector or a set of feature vectors, which is expected to be a unique signature in high dimensional space [4][10]. And then the obtained signature will be sequentially compared with those of a series of sliding matching windows in the target video stream [2], or used to do fast searching based on the previously built index structure [10]. After that instance detection could be completed based on the similarity values by comparison. In general, video segment identification concerns two challenging problems: representation and searching, namely how to select appropriate features uniquely and robustly describing the video content, and how to accelerate the search process based on the extracted features. Among many video representation techniques, key frame based shot representation is the most popular one and has extensive applications in video browsing, indexing and retrieval [1]. However when applied to video segment identification, the traditional key-frame based representation has some drawbacks. First of all, the performance of key-frame based shot representation strongly depends on the accuracy of shot segmen-
tation algorithm and the appropriate selection of key frame to characterize the video content. For example, in the case that the searched query clip has gradual shot boundaries or very limited number of shot, key frames can’t represent the whole clip informatively. Besides that, it is somewhat ambiguous to determine which frames should be chose as keys frames. Even for the same video segment, key-frame based representation may vary significantly with different key frame selection criterion. Furthermore, key frame representation is lack of reflecting the temporal duration. In consideration of the above features, Ferman et al. [6] presents various histogrambased color descriptors to reliably capture the color properties of video segments. Although such descriptors are reported to be robust and invulnerable to outlier frames within the shot, only color range information is considered, while both spatial information within each individual frame and temporal information are ignored. However our experiments show that spatial and color range information are both important for identification task. Search speed is another important issue of video segment retrieval and identification for practical applications, especially when the video collection is large. Different from image database, video collection consists of a large number of sequences. Therefore subsequence matching is usually the dominant method for searching given clip in such data collection. Kashino et al. [2] improved the conventional signal detection technique for similarity-based search by introducing temporal pruning algorithm called “active search”. Nevertheless, similarly, their feature extraction also only considers color features without using spatial information and the training process such as vector quantization is needed for obtaining optimal feature parameter. Regarding the challenges mentioned above, in this paper we design a new visual feature which contains both color range and spatial information and apply it to robust video segment identification. As such feature is compatible for active search algorithm, fast search speed can also be achieved by combining this feature and active search. Different from [2], our method doesn’t perform vector quantization so as to save the learning procedure.
2 Overview With the purpose of compactly and robustly representing the video segment, we propose Ordinal Pattern Distribution (OPD) histogram as ordinal feature and cumulative color histogram as color feature and combine them in a suitable way. The detailed description of these features will be described in section 3. In section 4, we employ active search algorithm [2] to accelerate the search process based on the proposed feature. Figure 1 illustrates the whole process of our method. Our methods produce the following features: Unlike key frame representation, proposed global visual feature could describe the video segment as a whole and troublesome shot boundary detection and key frame selection are avoided Spatial feature is introduced as compensation to color feature to represent the video content robustly and compactly Fast search speed can also achieve by using active search algorithm. However, training process could be saved
Using compressed domain feature, which is inexpensive to extract and acquire Active Search
MPEG Video Collection
Compressed Domain Feature Extraction
Ordinal Feature Similarity Measure
Temporal Window
(DC image sequence of I frames)
Color Feature Similarity Combination
Video Segment
Compressed Domain Feature Extraction (DC image sequence of I frames)
Instance Result Detection
Ordinal Feature Similarity Measure Color Feature
Fig. 1. System Chart.
3 Feature Extraction and Group-of-Frames Representation As one of the common visual features, color histogram is extensively used in video retrieval and identification [3][4]. [3] applies compressed domain color features to form compact signature for fast video search. In [4], each individual frame is represented by four 178-bin color histograms on the HSV color space. And spatial information is incorporated by partitioning the image into four quadrants. Despite certain levels of success in [3] and [4], the drawback is also obvious, for example, color histogram is fragile to color distortion problems and it is inefficient to describe each individual key frame using a color histogram as in [4]. Another type of feature which is robust to color distortion is ordinal feature [10]. Hampapur et al. [5] compared performance of using ordinal feature, motion feature and color feature respectively for video sequence matching. It was concluded that ordinal signature has the best performance. Based on this conclusion, we believe better performance could be achieved when combining ordinal feature and color range feature appropriately, with the former providing spatial information and the latter providing range information. Experiments in Section 5 will reveal these facts. As a matter of fact, many works such as [1] and [12] also incorporate the combined feature in order to improve the performance of retrieval and identification. Generally, the selection of ordinal feature and color feature as signature for identification task is motivated by the following reasons: (1) Compared with cost features such as edge, texture or refined color histogram, such as CCV (color coherent vector) [13] which also contain spatial information, they are inexpensive to acquire (2) Such features are compact as signature [10], but they retain perceptual meaning (3) Ordinal feature is immune to global changes in the quality of the video and contain spatial information, therefore it is a good complimentary to color feature
3.1 Ordinal Feature Description In our approach, we simply use I frames of the MPEG videos as sub-sampled frames to represent the video. Two main advantages are obtained by selecting I frame as representative frames. Firstly, the uniform sub-sampled I frames of MPEG video have coarse temporal granularity, which is typically hundreds of milliseconds (depending on the MPEG GOP parameter); consequently I frames can compactly represent the video while still being informative enough for the recognition task. Secondly, the Y, Cb and Cr visual feature values of each I frame can be estimated using its MPEG DC coefficients, therefore computational cost could be greatly reduced. Video Segment Ordinal Pattern Distribution Histogram
198 105
i frame
…...
i frame
i frame
…...
i frame
Ordinal Feature
147
…
77
1, 2, 3 Ordinal
…... 24
Measure
1
3
2
4
Patter code P(1324)=3 with P(1234)=1 , P(1243)=2, ...and P(4321)=24
Fig. 2. Ordinal Feature Description
As described in Figure 2, each I frame is represented by a reduced image, of size 2 × 2. For each Y, Cb or Cr channel, we calculate the average value of each of the 4 sub-images by DC coefficients extracted from compressed domain. After the raw feature extraction, in total 12 coefficients (#Y/#Cb/#Cr=4/4/4) will represent an individual I frame. Raw feature extraction is then followed by the ordinal measure process [5]. Considering that the total combination of the ordinal measures is limited (4! =24 possible patterns for the 4 sub-images), each possible ordinal measure combination can be treated as an individual pattern. Therefore for a set of frames in a video segment, we can form the Ordinal Pattern Distribution (OPD) histogram as a global descriptor. After the above operations, For each channel c = Y, Cb, Cr, the video clip is represented as: Hcopd = (h1, h2 ,⋅ ⋅ ⋅, hl ,⋅ ⋅ ⋅, hN )
0 ≤ hi ≤ 1
and
∑h = 1 i
(1)
i
Here N = 4! = 24 is the dimension of the histogram, namely the number of possible patterns mentioned above. The total dimension of the ordinal feature is 72. The advantages of using Ordinal Pattern Distribution (OPD) histograms as visual features are two folders. First, they are robust to frame size change and color shifting as mentioned above. And secondly, the contour of the pattern distribution histogram can describe the whole clip globally; therefore it is insensitive to video frame rate change and other local frame changes compared with key frame representation. 3.2 Color feature For the color feature, rather than representing each individual I frame as a color histogram,
we characterize the color information of a GoF by using the cumulative color information of all the sub-sampled frames in it. For computational simplicity, cumulative color distribution is also estimated using the DC coefficient from I frame. Normalization of the cumulative histogram can be defined as:
H ccd =
1 M
bk + M −1
∑ H ( j) i
j = 1,Λ , B
(2)
bk
where H i (i = bk , bk +1 ,Λ , bk + M −1 ) denotes the color histogram describing an individual I frame in the segment. M is the total number of I frames and B is the color bin number. In this paper, B is selected 24 with uniform quantization. As a result, the total dimension of the color feature also becomes 72.
4 Similarity Search and Instance Detection For visual feature histogram matching, we use Euclidean distance as the dissimilarity measure between the given clip Q and the sliding matching window SW. For each channel Y, Cb and Cr, ordinal feature distance is defined as: opd Dcopd ( HQopd , H SW )=
N
∑ (h
opd Q
i =1
opd (i ) − hSW (i ))2
c = Y , Cb, Cr
(3)
Similarly, for the cumulative color histogram, the distance between the given clip Q and the sliding matching window SW is defined as: ccd Dcccd ( H Qccd , H SW )=
N
∑ (h
ccd Q
ccd (i ) − hSW (i )) 2
c = Y , Cb, Cr
(4)
i =1
And the integrated similarity over the whole matching is defined as reciprocal of linear combination of the average distance of ordinal pattern distribution and minimum distance of cumulative color distribution in the Y, Cb, and Cr channels: opd )= DIopd ( H Qopd , H SW
1 ∑ Dcopd ( H Qopd , H SWopd ) 3 c =Y ,Cb ,Cr
ccd opd D Iccd ( H Qccd , H SW ) = Min {Dcccd ( H Qopd , H SW )} c =Y ,Cb ,Cr
S I ( H Q , H SW ) =
w × DI
+ (1 − w ) × D
(6) (7)
1 opd
(5)
ccd I
Let the similarity metric array be {Si ;1 ≤ i ≤ m + n − 1} corresponding to m+n-1 sliding windows, where n and m are the I frame number of given clip and target stream respectively. Based on [2] and [12], the search process can be accelerated by skipping unnecessary wi steps.
1 ⎧ ⎪ floor ( 2 D ( − θ )) + 1 wi = ⎨ Si ⎪⎩ 1
if S i
max{T , m + kσ }
and
(9)
where T is the pre-defined preliminary threshold, m is the mean and σ is the deviation of the similarity curve; k is an empirically determined constant. Only when similarity value exceeds the maximum value of T and m + kσ , it can be treated as the detected instance. In our experiment, w in eq. (7) is set to 0.5, and θ in eq. (8) is set to 0.1. T in eq. (9) is 6.
5 Experimental Results All the simulations were performed on a standard P4 @ 2.53G Hz PC (512 M memory). The algorithm was implemented in C++. The given clip collection consists of 83 individual commercials which varied in length from 5 to 60 seconds and one 10-second long news program lead-out clip (Fig. 3). All the 84 given clips were taken from ABC TV news programs. The experiment seeks to identify and locate these clips inside the target video collection, which contains 22 half-hour long ABC TV news broadcast. The 83 commercials appear in 209 instances in these half-hour news programs; and the lead-out clip appears in total 11 instances. All the video data were encoded in MPEG1 at 1.5 Mb/sec with image size of 352 × 240 or 352 × 264 and frame rate of 29.97 fps. It is compressed with the frame pattern IBBPBBPBBPBB, with I frame resolution around 400ms. 0.7 Odinal Pattern Distribution Histogram Cumulative Color Histogram 0.6
Cb Channel
Y Channel
Cr Channel
0.5
Percentage 0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
70
Dimensionality (72-d Vector)
Fig. 3. ABC News program lead-out clip (Left) and its ordinal feature and color feature representation (Right)
Table 1 gives the approximate computational cost of the algorithm. Feature extraction process includes DC coefficients extraction from compressed domain and formation of color histogram (3 × 24-d) of each I frame. While Feature processing during active search devotes to form color and ordinal feature for the specific matching windows and its cost may vary according to the length of the window. But if the query length is known or fixed beforehand, feature processing step could also be finished off-line. In that case, searching through a video database of tens of hours may only cost tens of milliseconds. Table 1. Approximate Computational Cost Table (CPU time). Task: Search for a 10 sec long query clip
Active Search Feature Processing Ordinal Feature Color Feature 0.969 sec 0.688 sec
Feature Extraction
10.5 h MPEG1 Video
1178.034 sec
Histogram Matching 0.011 sec
The performance of searching for the instances of the given 84 clips in the 10.5 hour video collection is presented in Figure 4. From the experimental result we found that a large part of false alarms and miss detections are mainly caused by the I frame shifted matching problem, when the sub-sampled I frames of the given clip and that of the matching window are not well aligned in temporal axis. Although the performance can’t achieve 100% accuracy using the proposed feature only, it still obtains comparative performance with that of [13], where only ordinal feature with N=720 is considered. However, compared with [13] whose feature size is 3 × 720=2160 dimension, our proposed feature is as small as a 6 × 24=144 dimensional vector, 15 times smaller than that of [13]. And from Figure 4, it is obvious that better performance can be achieved by using the combined feature compared with considering only color feature or only ordinal feature respectively. 1
1 0.95
0.9 0.9
0.8 0.85 Recall 0.8
Recall 0.7
0.75
0.6 0.7
0.5 0.65 0.6 0
Proposed (N=24,B=24) Ordina Feature only (N=720) 0.2
0.4
0.6 Precision
0.8
1
0.4 0
Color Feature Only (B=24) Ordinal Feature Only (N=24) Proposed (N=24,B=24) 0.2
0.4 Precision
0.6
0.8
1
Fig. 4. Performance Comparison using different features: proposed feature vs. 720-d ordinal feature (Left); proposed feature vs. 24-d cumulative color feature and 24-d ordinal feature respectively (Right); the detection curves are generated by varying the parameter k in eq. (8) (Precision = detects /( detects + false alarms)) (Recall = detects / (detects + miss detects))
6 Conclusion and Future Work Rather than selecting representative key frames to describe the video, the proposed descriptor thinks of the video segment as a whole and forms global feature as signature to uniquely represent the video content. Such representation scheme can handle video clips of variable length, such as shot, sub-shot, or group of shots. Nevertheless, it does not explicitly require the exact shot boundary detection. In the experiment, the proposed Ordinal Pattern Distribution histogram has proved to be an effective complementary part to color histogram descriptor. And such ordinal feature could also reflect the global distribution of the frames within a video segment. However, since the proposed representation can’t reflect the order of frames within the video, temporal information has not been used sufficiently. Although such characteristic is useful for certain applications such as to detect commercials with different shot order [8], lacking frame ordering information may cause the extracted signature indistinguishable. Our future work will include how to incorporate temporal information, how to represent the video content more robust and how to further speed up the search process.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
A. K. Jain et al., “ Query by video clip,” In Multimedia System, Vol. 7, pp. 369-384, 1999 K. Kashino et al., “A Quick Search Method for Audio and Video Signals Based on Histogram Pruning,” In IEEE Trans. on Multimedia, Vol. 5, No. 3, pp. 348-357, 2003 M.R. Naphade, et al., “A Novel Scheme for Fast and Efficient Video Sequence Matching Using Compact Signatures,” In Proc. SPIE, Storage and Retrieval for Media Databases 2000, Vol. 3972, pp. 564-572, 2000 S.S.Cheung and A. Zakhor, “Efficient video similarity measurement with video signature,” In IEEE Trans. on Circuits and System for Video Technology, vol. 13, Issue 1, pp. 59-74, Jan. 2003 A. Hampapur, K. Hyun, and R. Bolle., “Comparison of Sequence Matching Techniques for Video Copy Detection,” In SPIE. Storage and Retrieval for Media Databases 2002, vol. 4676, pp. 194-201, San Jose, CA, USA, Jan. 2002. A.M. Ferman, et al., “Robust color histogram descriptors for video segment retrieval and identification,” In IEEE Trans. on Image Processing, vol. 1, Issue 5, May 2002 L. Chen and T.S.Chua, “A match and tiling approach to content-based video retrieval,” In Proc. of ICME’01, pp. 301-304, 2001 V. Kulesh et al., “Video clip recognition using joint audio-visual processing model,” In Proc. of ICPR’02, vol. 1, pp. 500-503, 2002 D.N. Bhat, S.K.Nayar, “Ordinal measures for image correspondence,” In IEEE Trans. on PAMI, Vol. 20, No. 4, pp. 415-423, 1998 J. Oostveen et al., “Feature extraction and a database strategy for video fingerprinting,” In Visual 2002, LNCS 2314, pp. 117-128, 2002 Akisato Kimura, et al., “A Quick Search Method for Multimedia Signals Using Feature Compression Based on Piecewise Linear Maps,” In Proc. of ICASSP’02, Vol. 4 , pp. 3656 -3659, May 2002 G. Pass et al., “Comparing images using color coherence vectors,” In Proc. of ACM Multimedia’96, pp. 65-73, 1996 Junsong Yuan, Qi Tian and S. Ranganath, “Fast and Robust Search Method for Short Video Clips from Large Video Collection,” To appear in Proc. of ICPR’04, Aug. 2004