IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 5, AUGUST 2009
879
An Efficient Near-Duplicate Video Shot Detection Method Using Shot-Based Interest Points Xiangmin Zhou, Xiaofang Zhou, Senior Member, IEEE, Lei Chen, Member, IEEE, Athman Bouguettaya, Senior Member, IEEE, Nong Xiao, and John A. Taylor
Abstract—We propose a shot-based interest point selection approach for effective and efficient near-duplicate search over a large collection of video shots. The basic idea is to eliminate the local descriptors with lower frequencies among the selected video frames from a shot to ensure that the shot representation is compact and discriminative. Specifically, we propose an adaptive frame selection strategy called furthest point voronoi (FPV) to produce the shot frame set according to the shot content and frame distribution. We describe a novel strategy named reference extraction (RE) to extract the shot interest descriptors from a keyframe with the support of the selected frame set. We demonstrate the effectiveness and efficiency of the proposed approaches with extensive experiments. Index Terms—Frame set, near-duplicate shots, shot consistency, shot interest point.
I. INTRODUCTION HE advent of the web has elicited video data ubiquity. There has been a dramatic increase in the amount of web videos since the creation of video sharing web sites such as YouTube and Google Video, etc. Users view the video clips, download them from the web, edit and upload them through these specialized video sharing communities. As a consequence, a large number of near-duplicate videos are springing over the Internet. This has had the effect of building pressure to address the challenge of detecting near-duplicate video clips in many emerging applications such as video copyright detection and video clustering. The fast and accurate detection of near-duplicate shots that are basic near-duplicate video units has become a vital research problem. We derive our near-duplicate video shots from the definition of near-duplicate keyframe (NDK) [36]. Shots are defined as
T
Manuscript received September 09, 2008; revised March 15, 2009. First published May 02, 2009; current version published July 17, 2009. This work was supported in part by ARC Grants (DP0663272 and LE0668542) and in part by Hong Kong RGC Grants under Project 611608, NSFC Key Project Grant 60736013. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Marcel Worring. Xiangmin Zhou, A. Bouguettaya, and J. A. Taylor are with CSIRO, ICT Center, Canberra, Australia (e-mail:
[email protected];
[email protected];
[email protected]). Xiaofang Zhou is with the School of Information Technology and Electrical Engineering, University of Queensland, Brisbane, Australia (e-mail:
[email protected]). L. Chen is with Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China (e-mail:
[email protected]). N. Xiao is with the National University of Defense Technology, Changsha, China (e-mail:
[email protected]). Digital Object Identifier 10.1109/TMM.2009.2021794
near-duplicates if they have the same content but vary in acquisition times, lighting conditions, and editing operations, which abundantly exist in real applications. Formally, given a shot collection , a query shot , near-duplicate shot identification is to find all the shots near-duplicate to from . Here near-duplicate shots are identified by comparing their keyframes, each of which is represented as a bag of local descriptors [32]. In this paper, we focus on the problem of effective and efficient near-duplicate identification in large video shot collections. We aim to obtain the high efficiency of near-duplicate shot identification, while the high accuracy performance is preserved. There has been considerable recent work in developing effective near-duplicate image (keyframe) detection approaches [31], [32], [35], [36]. Most of them focus on finding more effective and efficient descriptions to an image. In [35], an image is represented by an attributed relational graph (ARG) that models the compositional parts as well as their relations of image scenes. The similarity model is thereby derived from the likelihood ratio of the stochastic transformation process that transforms one ARG to the other. The proposed ARG model is the first to address the problem of near-duplicate image identification, and the similarity measure can be learned from training data in both supervised and unsupervised fashions. However, the accuracy of ARG model is not satisfactory. Local descriptors extracted from local interest points of images have shown high performance in near-duplicate image identification. Examples include the SIFT [21] and its variants, such as the PCA-SIFT [14] and the GLOH [22], etc. Local descriptors capture distinctive invariant local features from images, thus being robust for the matching between different views of an object or scene. Existing approaches only allow the matching of static frames despite many of them supporting near-duplicate keyframe (image) identification [14], [18], [21], [31], [32], [35], [36]. Moreover, they assume that all the keyframes extracted from video streams through segmentation are consistent. Faced with the problem of shot matching, existing methods suffer from overloaded computation cost and fail to ensure the quality of matching due to the inherent characteristics of videos and the limitation of concerned techniques. • Videos are streaming data. Practically, a local interest point can undergo substantial changes along the consecutive frames due to the viewpoint and illumination change, and partial occlusion in fast object motions. This degrades the quality of the media descriptions. However, existing works on near-duplicate keyframe identification focus on static images, and do not consider the variations of local
1520-9210/$25.00 © 2009 IEEE Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
880
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 5, AUGUST 2009
descriptors along videos. Thus, they cannot be directly applied to detect near-duplicate shots. • Video shots are usually inconsistent. Existing shot detection algorithms fail to find 100% accurate shot boundaries [10], [20]. Once the transformations happen among videos, these algorithms may produce inconsistent shots. Thus inconsistent keyframes exist among the near-duplicate videos. Consequently, the current near-duplicate detection methods suffer from overloaded computation cost and fail to ensure the quality of video matching. • Stable local interest points cannot be fully guaranteed while the complexities between keyframes are different. Existing proposals reduce the number of local interest points to improve the efficiency of near-duplicate image detection. This has been done by two strategies: 1) increasing the inclusion threshold value [14], [21]; and 2) selecting top K most significant local interest points ranked by the contrast values that reflect the differences between the local interest points and their surroundings [8]. However, while first approach produces the unstable number of local interest points for images with lower average intensity level, the second one neglects the difference of complexities between various images, since it extracts a certain number of local interest points for all the images. Both approaches may result in a low matching accuracy over video shot collection. Many existing works have addressed the scalability of video identification [6], [12], [17], [34]. However, they only focus on the identification of near identical shots [6] or video copies [12], [17], [34], which is not applicable for the near-duplicate shots with large variations. Several recent works proposed to detect the near-duplicate video sequences [4], [26]–[28], efficiently. However, they focus on capturing the global information of videos, and the local similarity of each video is neglected. Sivic et al. [29] use a sequence of adjacent frames in a single shot to reject unstable objects. In this approach, the regions in each video frame are tracked with a simple constant velocity dynamical model and correlation. The regions that survive for more than three frames are considered as valid descriptions. This work proposed a basic idea for reducing noise regions in video frames. However, it is based on an important assumption, i.e., all the videos involve motions of the same speed. Practically, the motions involved in different video shots are different to large extent. When applying this uniform region rejection method, a large number of noise regions in the frames with slow motions will be kept. Meanwhile, some important regions of fast motion objects may be lost. In [7], Chum et al. proposed to exploit enhanced min-Hash techniques for the scalability of near-duplicate image and video identification. But their evaluation is only based on static images. This paper proposes shot-based interest points for effective and efficient near-duplicate video shot retrieval. We assume that the variation of each interest point in a shot frame is diverse. We focus on capturing the local descriptors with higher occurrence frequencies from the keyframes. We first propose an adaptive frame selection approach, called FPV, to produce a representative frame set in terms of the shot content. The shot interest points are extracted
based on the selected representative set. Then, we design a new approach, the reference extraction (RE), to find the shot interest points from each keyframe. RE uses the matching relationship among the selected reference frames and finds the local descriptors with higher occurrence frequencies. To further improve the performance of the matching between different local interest points, we incorporate an existing index scheme called LIP-IS to organize the local descriptors and perform efficient candidate pruning [36]. We conduct extensive experiments on large-scale shot database. The experimental results demonstrate the superiority of our proposed approach over state-of-the-art approaches in both effectiveness and efficiency. The rest of the paper is organized as follows: Section II provides an overview of the related work. Section III describes the framework of the proposed near-duplicate shot retrieval system. Section IV presents our new approaches to selecting shot interest points. We present the evaluation methodology of our method in Section V. Experimental results are reported in Section VI. Section VII concludes the paper. II. RELATED WORK In this section, we briefly review the existing related work for near-duplicate shot identification, mainly in four aspects, the video segmentation, the invariant local descriptors, the matching of these descriptors, and the query processing in near-duplicate image identification. A. Video Segmentation Breaking video sequence into a series of shots that are its basic syntactic entities has been addressed by numerous researchers. A complete overview of existing approaches has been provided in [15]. Major approaches to shot detection are proposed based on certain features extracted from videos, such as pixel difference, edge difference, histogram comparison, motion vector, variance curve, linear regression etc. Color histogram difference is the commonest one among the approaches used for shot determination. In [5], both RGB and HSV color spaces are employed to produce the color histograms. The color differences between frames are calculated with three distance functions, Euclidean distance, color moment and earth mover’s distance (EMD). The shot boundaries are determined with an adaptive threshold that takes into account the mean and standard deviation of EMD values in the neighborhood of peak points. The peak points have maximal difference values compared with their neighbors. This approach is good for the detection of cuts and short transitions, but not effective for slow transitions. Considerable effort has addressed the problem of shot boundary determination based on machine learning. In [19], various types of shot transitions are effectively detected by combining the preset threshold and SVM based method. A low threshold is selected with histogram difference and mutual information utilized to locate the candidate abrupt transitions. An SVM is deployed on those candidate areas to identify different types of shot transitions. This method performs fast detection in real time, while the accuracy of the shot determination needs to be improved. In [20], a divide-and-conquer strategy has been
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
ZHOU et al.: EFFICIENT NEAR-DUPLICATE VIDEO SHOT DETECTION METHOD
881
proposed to reduce the false positive cuts in shot boundary determination. Six independent detectors are built to target for six dominant shot boundaries in this task. Each detector is attached with a finite state machine (FSM) that contains several states. Each state indicates the current task of a certain detector. The new shot boundaries are detected by checking the states of all FSMs, and then verifying the potential transition candidates. The results of all detectors are merged together in temporal order to form the final shot determination results. We choose this technique as the best practice for video segmentation to consider the accuracy and efficiency of shot determination.
the interest points of two images, several approaches have been proposed [21], [25], [36]. In [21], Lowe proposed to find the best match of each local interest point by identifying its nearest neighbor from the local interest points of the dataset. The similarity between two local interest points is compared using Euclidean distance. To reduce the mismatches in the process of the matching, the query descriptors that do not have any good match to the database are selected and then discarded by comparing the distance of the closest neighbor to that of the second-closest one. In this approach, multiple distinct points in a query image can be mapped into a single point of an image in the database. Zhao et al. proposed one to one symmetric matching (OOS) with a cosine distance based partial similarity matching in [36]. The point pairs with low similarity are excluded from the matching subset with this approach, and each distinct point in the database can only be matched once. Two local interest point sets are optimally matched with the matching algorithm proposed in [1] to maximize the overall score of the bipartite graph under OOS matching constraint. The main purpose of OOS is to optimize the matches. OOS constraint performs much better comparing with the multiple match [21] since it removes a large number of matches caused by noise and other local visual ambiguity. Meanwhile, the match of each descriptor is found by searching the decreased point set under OOS constraint, thus the matching cost is reduced. In this work, we use OOS matching that is noise tolerant and shows good performance in the image detection [36]. The similarity between two keyframes is determined by the number of matched interest point pairs using a cosine similarity based partial matching.
B. Invariant Local Descriptor Local descriptors extracted from interest image regions have been proved to be very successful in near-duplicate image identification and object recognition [2], [14]. Given an image, its local descriptors are computed by first localizing the local interest points with significant local variations of intensities, and then characterizing these interest points based on a patch of pixels in its local neighborhood. Thus, an image is characterized with a bag of local descriptors. Different types of local descriptors have been developed to describe local image regions [3], [14], [16], [18], [21], [22]. In [21], Lowe proposed a scale invariant feature transform (SIFT), which is computed as follows. First, with a difference-of-Gaussian function, potential interest points that are invariant to scale and orientation are identified. Then, based on the stability of each potential interest point, the keypoints are localized. For each keypoint location, one or more orientations are assigned based on local image gradient directions. The keypoint descriptors that are 128-dimensional feature vectors are obtained by measuring the local image gradients at the selected scale in the region around each keypoint. SIFT is highly distinctive and also invariant to image scale and rotation. However, the construction of the SIFT feature vector is very complicated, thus not applicable for the interest point matching in large image collections. Variants of SIFT, including PCA-SIFT [14], GLOH [22], [18], etc., have been proposed to improve SURF [3] and the performance of SIFT. PCA-SIFT [13] is one of the most popular descriptors [23] among them, and has much lower dimensionality. It was initially proposed to reduce the computation complexity of SIFT. Instead of using smoothed weighted histograms in SIFT, PCA-SIFT applies principle component analysis to the normalized gradient patches. Just as the first three stages in SIFT, PCA-SIFT accepts the sub-pixel location, scale and dominant orientations of each keypoint, and extracts a 41 41 patch at the given scale, centered over the keypoint, and rotated to align its dominant orientation to a canonical direction. Based on these three stages, PCA-SIFT is computed by first pre-computing an eigenspace to express the gradient images of local patches, then computing the local image gradient for a given patch and projecting the gradient image vector into a 36-dimensional condensed feature with the eigenspace. C. Interest Point Set Matching Matching interest point sets is for measuring the similarity between two images under certain mapping constraints. To align
D. Query Processing Many approaches have been proposed to organize the local descriptors. Locality-sensitive hashing (LSH) [11], [24] and visual vocabulary [29], [33] are two typical ones among them. LSH was first proposed in [11] and applied in near-duplicate image detection applications [24]. The basic idea is to hash the key points in the database to ensure that the probability of collision is much higher for objects close to each other than for those far apart. The high dimensional descriptors can be filtered out with hash keys. Visual vocabulary quantizes the descriptors into clusters, which can be managed with the techniques for text retrieval. Instead of directly matching each high dimensional descriptor pair, this approach finds matched pairs by using visual key words that can be accessed from visual vocabulary. This idea was first proposed in [29]. In some later work, the idea of visual vocabulary was extended by combining context information. In [33], the semantic context of a keyframe refers to the story text transcripts extracted through speech recognition in the audio track. The extracted text transcripts is utilized with the visual key words jointly, and provides a meaningful cue for NDK retrieval. In [37], we proposed to combine the sequence relationship between neighboring frame samples and the keyframe symbolization to simplify the video similarity measure, while maintaining high matching results. There are some latest techniques exploiting the natural characteristics of near-duplicates in image or video database. For
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
882
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 5, AUGUST 2009
example, in [9], the image comparison is based on the bags of local invariant features instead of the clustered descriptors. In [36], authors quantize each dimension of the PCA-SIFT, and map each PCA-SIFT to 36 Hash bins. The hash keys to the descriptors are used for filtering out the unqualified descriptors and deciding the potential candidates of the point match. In this work, we use the query processing approach proposed in [36] for near-duplicate shot search. We mainly focus on how to adaptively select the local shot descriptors from a video shot. Based on the matching relationship between the keyframe and its neighboring ones, the local descriptors having higher frequencies are extracted from the keyframe, and the ones with lower frequencies are discarded from it as well. As such, while the accuracy of near-duplicate video shot detection is increased, the computation cost of near-duplicate shot matching is reduced significantly.
TABLE I NOTATION
proposals. Then, we discuss our solutions to address these shortcomings.
III. FRAMEWORK OF OUR SOLUTION This section provides a general framework for near-duplicate video shot identification and describes our proposed approach briefly. In near-duplicate video shot identification, each shot is represented by its keyframe that is a bag of high-dimensional local descriptors. Given a collection of video shots , a query shot : , an inter-descriptor similarity function Sim, an interest point similarity threshold , video shot matching consists of two key problems: 1) For each local interest point of query keyframe , finding its matched distinct points from the distinct point collection of ; 2) Calculating the similarity between keyframes by taking each matched local interest point pair into consideration. However, fast near-duplicate video shot identification has been a very challenging issue due to the large amount of local interest points represented by high dimensional descriptors in each keyframe. Meanwhile, the inconsistent shot boundary identification produces inconsistent keyframes for the near-duplicate video shots. The consecutive frames in a single shot cannot perfectly match with each other by their local descriptors due to the viewpoint and illumination change, and also the partial object inclusions during the object motions. All these have caused negative effects on the similarity matching between keyframes. We extract the local interest points with higher frequencies in a shot from its keyframe to improve the performance of near-duplicate shot identification. The keyframe matching is performed based on the local points of high frequencies. In our approach, each video shot and the query are represented by their keyframes, which are further described as the sets of their PCA-SIFTs. Then, we extract the shot interest descriptors from each set of descriptors to a certain keyframe by considering the descriptor matching relationship among the near-duplicate frames in a single shot. As such, the representation of each keyframe is highly compact. We will address the problems of shot interest points in the following section. For easy reference, the notation used in this paper is summarized in Table I. IV. SHOT INTEREST POINT EXTRACTION We first motivate our work by presenting observations that reveal several problems in the existing interest point matching
A. Motivation Using a keyframe to represent a shot is a natural method in the area of video processing [32]. In near-duplicate video matching, the comparison between shots is usually performed by matching the local interest points of their keyframes [32]. However, consistent keyframes cannot be ensured among the near-duplicate videos due to the limitation of the existing shot boundary detection techniques [10]. Meanwhile, a large quantity of local interest points cannot survive with invariant local description in video streaming under the viewpoint and illumination change as well as partial sub-image occlusion in fast object motion. Accordingly, effective keyframe matching cannot be performed over the near-duplicate videos with large variations. Effective and efficient near-duplicate shot matching has been challenged. In the following, we describe our motivation for the interest point selection in a shot. Intuitively, a large number of near-duplicate frames usually exist in a single video shot. For the keyframe of a given shot, its nearby frames are usually very similar to it, being its near-duplicates. Thus, it seems obvious that there exist a large portion of matched local interest points between the keyframe and its neighboring ones. However, practically, this is not true. Consider a set of shot examples 1 from TRECVID2007 sound and visual dataset used to explain this problem. Given a col, lection of 1000 random selected shots in is described with a series of near-duplicate each shot frames that are temporally extracted neighboring frames in it . We find the matched interest points beand each of the followings, , tween the first frame using SIFT demo. The average ratio of matches between the keyframe and its following ones are shown as Table II. As we can see from Table II, only about 50% matches are found between the keyframe and its next frame averagely, and no more than 20% of the interest points are matched with each other to . To illustrate the problem clearer, averagely from to Table III lists the statistic results for the individual shots , including the number of local interest points, denoted as P-N, and that of matches, denoted as M-N. The percentages of and others are shown in Fig. 1. matches between 1http://www.itee.uq.edu.au/
emily/Research.htm
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
ZHOU et al.: EFFICIENT NEAR-DUPLICATE VIDEO SHOT DETECTION METHOD
883
Fig. 1. Percentage of interest point matches. TABLE II RATIO OF INTEREST POINT MATCHES OVER WHOLE DATASET
TABLE III STATISTICS OF INTEREST POINT MATCHING FOR TEN INDIVIDUAL SHOTS
Clearly, a large number of changeable interest points exist among the neighboring near-duplicate frames because of the viewpoint and illumination changes, and the motion blur or occlusions. Meanwhile, mismatching can be found among the non-near-duplicate frames. Motivated by this observation, we propose to select shot interest points based on a set of frames similar with the keyframe. Next, we will introduce how to produce frame set for a shot and how to select shot interest points from a keyframe.
B. Producing Frame Set Recall that the matchable local descriptors are found using the neighboring frames of a keyframe, thus the most important descriptors are determined. However, how to select the appropriate neighboring frames of the keyframe as the references? Simply picking fixed number of neighboring frames after the keyframe is not applicable, since the content change trends of various shots tend to be very diverse. With a fixed number of frames, non-near-duplicate frames may be selected as the references for some keyframes. At the same time, the set of local descriptors may not be effectively narrowed down for others. To address this problem, we propose a new adaptive scheme to build a frame set for each individual shot based on the content of
, the frame itself. Given a shot , a threshold value set of it is defined as below. , and the neighDefinition 1: Given a keyframe of shot , let E be the Euclidean boring frames of distance of two color histogram feature vectors, the frame set of , is
where and . We use color histogram feature vectors and Euclidean distance in our work, which have been proved to be effective in [32]. In [32], it has been justified that a global signature from color histograms can be used to detect the near-duplicate videos with high confidence. Two keyframes that have Euclidean distance less than between them can be identified as near-duplicates directly. We select a smaller threshold value 0.1 to safely guarantee that the reference frames obtained from a shot over global features are near-duplicates. Our experiments over 1000 shots also verified the validation of this selection. More importantly, color histogram matching is much more efficient than local interest point matching, which greatly saves the computation cost in video query preprocessing. Color histogram serves as an appropriate choice for constructing the reference frame
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
884
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 5, AUGUST 2009
Fig. 2. Producing reference frames for a shot.
set for a video shot because of its superiorities. We adopted two strategies jointly for effective keyframe selection. For one thing, we try to maximize the consistency of shots with the best practice shot boundary detection approach proposed in [20]. This segmentation approach has been proved to be superior comparing with others in TRECVID2006 competition. Meanwhile, we adaptively select each keyframe from the middle frames of its shot since extracting a representative keyframe from the middle of a shot therefore is relatively reliable for extracting basically similar keyframes from different near-duplicates [32]. Given a video shot, we start checking the shot content from its 10th frame and finding near-duplicates from its following frames as its reference candidates. A frame with less than near-duplicates is unreliable. This process is continued until a reliable frame is found and used as a keyframe. and its neighboring frames Given a keyframe , we produce the frame set according to the Euclidean distance between each frame in and the keyframe based on their color histogram feature vectors in HSV space. Fig. 2 describes the detailed reference frame selection algoin , we compute the Euclidean rithm. For each frame with their color histogram vectors distance between and (line 2–3). If the distance between them is smaller than a given is collected as a reference frame (line threshold value 4–5); Otherwise, the process is stopped (line 6). This approach can ensure that all the reference frames in the frame set are near-duplicates of the keyframe. However, in many cases, there exist a large number of near-duplicates of keyframe. If we choose all of them as references, the cost for preprocessing will increase dramatically. Next, we will introduce an approximate approach to sample reference frames and obtain a compact frame set. Sampling Reference Frames: A straight way to compact a frame set is random sampling. A good sampling scheme should use a fixed-number of frames to produce the frame set of each shot, and keep the sampled reference frames apart from each other. As such the frame set can contain more discriminative information. The subordinate local interest points can be filtered by diverse references, enhancing the ability of obtaining the most important local descriptors. Selecting diverse references from the produced candidate set consisting of similar frames may cause a confusion that there exists a contradiction between the proposed two steps. However, in fact, there is no contradiction because 1) a candidate set contains the similar frames to the keyframe, but a diverse distribution reflects the correlation between the selected references. Thus, the “similar frames”
and the “diverse references” focus on different objectives, respectively; and 2) the purposes of these two steps are consistent, i.e., maintaining more discriminative local interest points. Consider an example of selecting the very similar frames of a keyframe. If we choose dissimilar frames of the keyframe as reference candidates, these frames will not contain matchable local interest points with the keyframe. Therefore, no reliable key points can be identified from the keyframe. Likewise, let us consider another example of the extreme condition for diverse references. While we choose several exact frame copies as references (exactly same), the most important information obtained by the matching between the keyframe and each of the references will be totally same. Accordingly, some important local descriptors may be missed because of this authoritative selection, while some unimportant ones may still exist. Thus it is important to filter the unimportant local descriptors in a diverse manner. Considering this purpose, we propose to sample the reference frames with diverse distribution, which is defined as follows. and its sampled Definition 2: Given a frame set , a sampled frame set is of most diverse frame sets distribution if the minimal distance between two of its reference , is maximized. A bigger frames, value refers to a more diverse distribution. Given a continuously differentiable real-valued function, sequential quadratic programming (SQP) can be used to find its minimum or maximum values by seeking a zero point in the function’s first derivative. However, in this work, we cannot use SQP algorithm due to the specific characteristic of the objective function and the investigated data. First, the objective function is very complex. It is a composite function consisting of -dimensional variables, and owns different expressions in their associated coordinate areas. To get a certain zero point, we need to decide the function expression in each area, and perform partial derivatives computations. This process incurs heavy computation cost. Moreover, given a set of discrete frame data, the objective function is discontinuous. Accordingly, we may still not be able to find the optimal sample set for a found zero point of the derivative function since the variables to the zero point may not exist in the data set. To quickly select a frame set with diverse distribution, a simple approach is to pick an initial point, then keep adding to the candidate set the point furthest from any others in the set. We call this heuristic approach GRF. However this approach cannot optimize the global distribution of frame samples. We propose a new sampling approach named FPV, which utilizes the scheme of furthest point voronoi diagram with the corresponding weight to each frame to sample the representative frames for each frame set. Voronoi diagram has been successfully used in [4] for producing compact video signatures and performing more efficient approximate similarity measure. Its main purpose is to approximate the similarity between two videos with their video signatures(ViSig) produced through voronoi diagram. Unlike ViSig proposed in [4], our purpose is to obtain a frame set with sparse frame distribution, so that more diverse information can be obtained from the frame set. Thus, in our work, we choose the furthest point voronoi diagram that is the opposite of voronoi diagram.
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
ZHOU et al.: EFFICIENT NEAR-DUPLICATE VIDEO SHOT DETECTION METHOD
885
Fig. 3. Example of furthest voronoi diagram.
Fig. 3 shows an example of furthest voronoi diagram of four is the furthest voronoi area of . Given a frames, where FV , FPV does frame sampling mainly by frame set three steps, seed generation, furthest voronoi diagram construction, and reference frame selection. When a furthest voronoi diagram is constructed, the number of the voronoi areas may be less than the number of seed vectors due to the bias of seed selection. To overcome this problem, a postprocessing procedure is used and a set of frames is obtained by recursively selecting the reference frames. Fig. 4 shows the detailed algorithm for producing the reference frame samples. First, generate a set of frames uniformly distributed in temporal order, which we call seed vectors. Based on these seed vectors, we construct a furthest point voronoi diagram (line 3–4). Seed selection may be critical for general high dimensional data. Fortunately, we select seeds from each single shot, and the change of the frames in a single shot follows a consistent trend along video stream as shown in Table III, i.e., the difference between two frames usually increases with the temporal difference between them. This special characteristic of video shots makes seed se, lection get more robust. Second, for each vector in we find the furthest voronoi area containing it and map it to a seed vector with the distance to the corresponding seed attached as a weight (line 5–6). In the third step, each point that has the biggest weight in its area is selected as a reference frame (line 7–8). The sampling procedure is post-processed to obtain reference frames (line 9–13). Since the reference frame candidates are only in a small cluster, we believe the reference frames are uniformly distributed. Thus, in the postprocessing, once a refernearest ence frame is selected, we neglected the neighbors of it to refine the reference frame candidates, and only the rest of the candidates are considered for further selecting the reference samples (line 11). The frame that forms a diverse distribution with other reference frames is selected as the final reference (line 13). Computing Cardinality Parameter T: In this part, we analyze the appropriate value of the sampled frame set from statistics. Given a set of near-duplicate shot frames, the intersection of them decreases exponentially with the value of the frame set. This attribute can be realized by an exponential function to decide the common part of the reference frames, as follows: (1) where is the ratio of the common part between two neighboring frames, and is the number of local descriptors in the keyframe. As mentioned in Section IV-A, the common part
Fig. 4. FPV sampling.
of two neighboring frames is as small as 28%, averagely 50%. When we set to 4, the common part of the frame set is dramatically reduced to only 6% of the original frame set averagely, and to 0.6% for some of them. Thus, in our work, we believe bigger value will not improve the final matching result much and set as the cardinality threshold of the sampled frame set. Having decided the best value for a reference frame set, we will describe how to extract shot interest points from a keyframe with its frame set in the next section. C. Extracting Shot Interest Points Near-duplicate video shots are determined by whether the keyframes to different shots are near-duplicate or not. Recall that since the changed local interest points caused by the transformations or damages during video streaming exist in the keyframes, a large quantity of unnecessary comparisons are involved in the near-duplicate video shot identification. To efficiently identify near-duplicate shots in a large video shot collection, we propose reference extraction to select the shot interest points and throw the damaged ones off, reducing the computation cost during the local interest point matching. Next, we first introduce an important conception, frequency, which is the preliminary of our shot interest point selection strategy, and then transfer to the specific approach for shot interest point extraction. , Definition 3: Given the keyframe of shot and the frame set of , the frequency of a local interest point indicates its occurrences among the frames , of the reference frame set. For a given local interest point is equal to the number of its matches to its frequency the local interest points from other frames in . The frequency of a local interest point is used in the stage of shot interest point extraction, and indicates how important a local interest point is. The points with higher frequencies are more important in the interest point matching. Reference Extraction (RE): Reference extraction computes the frequency of each interest point by matching the keyframe with the reference near-duplicate frames in the selected frame set. Fig. 5 describes the processing of this approach. Given a in the keyframe , we find each matched local descriptor one by local interest point from the reference frames in
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
886
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 5, AUGUST 2009
Fig. 5. Extracting shot interest points.
one. If a matched interest point is found, the number of matches is increased by 1. Accordingly, the higher frequency is of . achieved by Selecting the local interest points with higher frequencies as shot interest points, one may think that the collected shot interest points are only from background. However, this is not the real case. For one thing, the frequency of a local interest point is not determined by its motion, but by the effect on it from the motion and others like illumination changes. Fig. 6 gives an example of local interest point matching between two frames in a shot with SIFT demo, where the lines connecting two images show the matched interest point pairs between them. Clearly, a large quantity of matched interest points are from the foreground of pictures, while not the background of them. For another, we have a flexible constraint on the frequency of a shot interest point. i.e., once the frequency of an interest descriptor reaches to a certain threshold, it will be selected as a shot interest descriptor. With RE approach, the frequency of each local interest point in the keyframes is accurately computed, thus the shot interest points are well selected. However, this approach requires additional preprocessing cost for extracting shot interest points. Next, we analyze the cost for extracting shot interest points and that for detecting near-duplicates from video shot collections. Cost Analysis: We estimate the shot interest point extraction cost and near-duplicate shot detection cost for the existing work OOS and our proposed approach RE. First, we compare the shot interest point selection cost. Denote the cardinality of query frame set as , the number of local interest point candi, and those in others as dates contained in the keyframe as . Suppose that the cost of local feature extraction for each frame is c, Then the computation cost for extracting shot interest points is (2)
Fig. 6. Example of interest point matching.
Next we compare the query processing cost of OOS with that of RE approach. Denote the number of selected local descriptors , that in the database as , the in the query keyframe as filtering power of indexing as . Then the computation cost for query processing is (3) Apparently, the query processing cost is increasing linearly as the database size and the number of query descriptors grow. At the very small cost for preprocessing, RE approach reduces the number of local descriptors in the database dramatically. For large video database, considering the overall query processing cost, the preprocessing cost is neglectable. Thus, we can conclude that, from the view of efficiency, the dramatic performance enhancement from identifying near-duplicate shots in large video shot collections is worth the cost for shot interest point extraction. V. EVALUATION METHODOLOGY This section presents our strategy for comparing our methods with others, including the selection of video dataset, the existing work to compare and the evaluation criteria. A. Video Dataset
Clearly, the costs of shot interest point selection is determined by the cardinality of frame set and the size of candidates from the frame set. For a fixed cardinality of frame set, , the cost of shot interest point selection is a fixed number, which does not increase with the increasing of shot collection size. While OOS uses the local interest descriptors extracted from keyframes directly, it only requires the cost of c for extracting the local descriptors from the keyframe.
We conduct experiments on a 81-h video collection from two sources: 1) 50-h sound and vision data collection used for development of search feature detection at TRECVID 2007 [30], which includes 18 142 video shots; 2) 31-h transformed ones produced with 3.8-h videos from the data collection for test of search at TRECVID 2007. Following the evaluation method in [14], we generate test data by transformations below: • Contrast: increase and decrease contrast 25%;
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
ZHOU et al.: EFFICIENT NEAR-DUPLICATE VIDEO SHOT DETECTION METHOD
887
Fig. 7. Example of video data. (a) Original. (b) Variation 1. (c) Variation 2. (d) Variation 3. (e) Brightness (h) Contrast +25%. (i) Saturation 50%. (j) Saturation +50%. (k) Crop 50%. (l) Half size.
0
• Brightness: increase and decrease brightness 25%; • Crop: crop 50% of frame, preserve center region; • Saturation: alter saturation by 50%; • Resize: increase and decrease scale by two times. The investigated video shots include the data with the above transformations and also those with different viewpoints. Fig. 7 shows an example of the investigated video data, where sub-figures (b)–(d) are three variants of sub-figure (a), and sub-figures (e)–(l) are obtained by different global transformations of shot shown in sub-figure (a). We use the latest algorithm proposed by AT&T at TRECVID 2006 to segment the transformed videos [20], and 11 992 shots are obtained. Each video frame is compressed using PLCVideo Mjpegs. Except for the transformed data by resize, all other shots are with the resolution of 352 288 pixels. Following the experimental parameter setting in [14], a video shot is represented as a bag of 20-dimensional PCA-SIFT descriptors to obtain higher efficiency. B. Comparison Technique To evaluate the performance of our approach, we choose the latest one to one matching technique proposed in [36] as a reference, which neglects the variation of the local interest points along video streams. This technique is used to match the near-duplicate keyframes based on the local descriptors of the interest points. With one to one matching and the cosine dissimilarity, the similarity matching between two keyframes and that between two local descriptors are measured. In the OOS matching, a partial matching scheme is adopted to limit the similarity of matchable local interest points no less than a threshold value . In partial matching, only a subset of points are matched to exclude point pairs with low similarity. As proved in [36], compared with LSH, this method presents much higher accuracy, while only increasing considerable small computation cost with LIP-IS index scheme. We have implemented this technique as a reference rather than LSH approach because LSH leads to matching results with poorer accuracy, which contradicts with our initial purposes, to preserve high accuracy of the results.
025%. (f) Brightness +25%. (g) Contrast 025%.
C. Evaluation Criteria To evaluate our approach, we adopt the standard evaluation metric in TRECVID [30]. The precision-recall curves of the queries are utilized to measure the effectiveness of the near-duplicate shot matching. The precision and recall are computed as follows:
We selected ten representative shots that are news, sports, etc., as queries. 2 These ten query shots involve fast motion and slow motion respectively, and are obtained by transforming the original MPEG video files to AVI format using VirtualDub software. We manually built the ground truth for the 31-h dataset, which is used to measure the effectiveness. First, for each query, five nonexpert accessors were asked to watch the original 3.8-h videos, and select the shots with formatting differences including the encoding format, frame rate, bit rate and frame resolution, and those with content differences including the photometric variations, editing, content modification, viewpoint variations and different versions. Then, the copies to each selected shot are selected from the transformed videos to form the ground truth. We also evaluate the system according to the average precision that is a single-valued measure reflecting the performance over all relevant shots. It is the average of the precision value obtained after each relevant shot is retrieved, as follows: (4) where is the rank, the number retrieved, a binary the precision of function on the relevance of a given rank, the evaluated system and the base at a given cut-off rank, respectively. For a query, its results are ranked by their similarity values. The precision is calculated after each relevant clip is retrieved. A precision-recall curve is then produced by measuring ). precisions at 11 recall points (0, 2The
same shots used in Section IV-A.
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
888
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 5, AUGUST 2009
Fig. 8. Effect of different parameters on the effectiveness: (a) AVP versus . (b) Precision-recall for different .
Fig. 9. Comparing the average precisions for different shot interest point selection approaches and OOS.
To evaluate the efficiency of our approach, we measure the computation cost drop ratio, which is defined as
(5) where Cost(SIP) is the computation cost of the proposed approach (shot interest points), including the distance computation of shot interest descriptor selection and that for identifying shots from video database, Cost(OOS) is that of OOS approach. The cost of each is obtained by computing the number of distance calculations for each method, since it is objective and also dominates the time of video search. VI. EXPERIMENTAL RESULTS In this section, we evaluate the effectiveness and efficiency of the proposed approaches by conducting a series of experiments and report the experimental results. Our study includes two parts: 1) studying the effect of local descriptor similarity threshold and 2) studying the effectiveness and efficiency of the proposed approach by comparing with OOS.
Fig. 10. Comparing the precision-recall curves for different shot interest point selection approaches and OOS.
A. Effect of In OOS, the similarity measure between keyframes is based on partial matching with a threshold . In this part, we evaluate the OOS scheme [36] by turning value from 0.8 to 0.98, for
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
ZHOU et al.: EFFICIENT NEAR-DUPLICATE VIDEO SHOT DETECTION METHOD
889
Fig. 11. Comparing the computation drop ratios of SRF, NRF, MRF, RRF, and GRF to OOS over 31-h dataset.
testing the optimal . The overall average precision is obtained over each value. Fig. 8(a) and (b) shows the changing trend of average precisions and the precision-recall for different values. Clearly, with the increasing of , the average precision of the near-duplicate shot identification first increases, and reaches to the optimal value at point 0.95. After that, the average precision of the identification degrades. This is mainly caused by two reasons. When is set to very small values, many mismatches of local interest point pairs are produced. After value reaches to optimal value, with the further increasing of , more real matches of local interest point pairs are lost due to the extreme strict similarity constraint. Meanwhile, from Fig. 8(b), we notice that, at each recall level, OOS obtains the highest preci. sion when B. Comparing With Existing Approach In this test, we compare the effectiveness and efficiency of shot interest point selection approaches with different frame set selection strategies by comparing with OOS which does not perform further selection of local descriptors [36]. We compare four types of frame set selection strategies, SRF, NRF, RRF, GRF, and MRF. Here, SRF chooses the next frame of the keyframe as the reference, NRF chooses the next consecutive four frames as references with a frequency of 1, RRF selects four reference frames by random sampling with a frequency of 1, GRF selects four reference by first picking an initial point, then adding to the candidate set the point furthest from any others in the set with a frequency of 1, and MRF chooses a frame set by FPV sampling. We first test the average precision of each query and their overall ones by setting the value to its default value 0.95, and varying the frequency threshold from 1 to 4 for MRF approach, while the SRF, NRF, RRF, GRF, and OOS approaches are tested as references. Fig. 9 reports the average precision of each query and overall average precision for SRF, NRF, RRF, GRF, and OOS approaches and those at different values for MRF. We notice, while is set to 1, MRF approach optimizes the average precision for each individual query and the global of them accordingly to different extent, since it extracts most important parts as the shot interest points, thus overcoming
the negative effect from “dirty” interest points during the shot matching. For the queries with 100% average precision by OOS [36], MRF can keep the accuracy as high as OOS. For the queries which have near-duplicates with viewpoint and changes in the investigated dataset, i.e., , MRF can achieve better accuracy performance compared with OOS. Meanwhile, with the continuously increasing of the value, the average precisions of most queries decrease. This is mainly because, for very big values, the information loss increases due to the extreme strict constraint on the shot interest point selection. While for SRF approach, the average precision is reduced comparing with OOS due to the too strict constraint on interest point selection. For NRF approach, since the consecutive frames are distributed densely, the reference frames cannot compensate information in the shot interest point selection, thus the accuracy decreases comparing with OOS. Random sampling shows much worse accuracy performance, which is mainly caused by the inappropriate frame selection, directly leading to low quality interest point selection. Since GRF approach can only take the local distribution of selected references into consideration when a new reference is inserted, it obtains lower average precision comparing with MRF. We then compare the precision and recall of our proposed approach with the existing approach by fixing at the values of 1 and 2. Fig. 10 shows the precision of the queries at each recall level. Obviously, MRF-1 achieves much better precision at each recall level. Thus, we can conclude our proposed approach improve the accuracy of OOS steadily. Finally, we compare the efficiency of our proposed approaches with the existing one by testing the computation cost drop ratio which is defined in (5). We use two datasets, 31-h video collection and 81-h video collection in the experiments with varying from 1 to 4 for MRF. Fig. 11 describes the cost drop ratio of SRF, NRF, MRF, RRF, and GRF with the existing approach OOS as a base for the 31-h dataset. Fig. 12 shows that for 81-h dataset. From the experimental results in Figs. 11 and 12, we can see that for each dataset, all the approaches, i.e., SRF, NRF, MRF, RRF, and GRF, improve the efficiency of the original OOS more than six times on average. Meanwhile, with the increasing of value, the matching speed of MRF
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
890
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 5, AUGUST 2009
Fig. 12. Comparing the computation drop ratios of SRF, NRF, MRF, RRF, and GRF to OOS over 81-h dataset.
increases exponentially. Finally, each approach optimizes OOS in two different datasets to very similar extent. This is because most of new added video shots are filtered out. Obviously, for our proposed methods, the number of local interest points is dramatically reduced, thus the computation cost increases much slower with the increasing of datasize. Considering the effectiveness and efficiency as well, MRF approach with a frequency of 2 achieves the best performance tradeoff, competitive accuracy with dramatic speed improvement. VII. CONCLUSION In this paper, we studied effective and efficient methods for near-duplicate shot identification. We proposed a shot-based interest point representation, which extends the idea of local interest points from individual images to videos. We proposed an adaptive approach to reference frame selection, by which the obtained reference frames can effectively capture the variances of local descriptors along video streams. We proposed a novel approach that adaptively selects the shot interest points in a video shot with the frequency of each local descriptor of the keyframe among the reference frame set. Extensive experiments show that our proposed shot-based interest point approach is both effective and efficient. ACKNOWLEDGMENT The authors would like to thank Y. Ke for providing source code for PCA-SIFT extraction. They would also like to thank X. Lin and X. Wu for useful discussions. REFERENCES [1] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency. New York: Springer, 2003, vol. A, pp. 267–290. [2] L. Ballan, M. Bertini, A. D. Bimbo, and W. Nunziati, “Soccer players identification based on visual local features,” in Proc. CIVR, 2007, pp. 258–265. [3] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust features,” in Proc. ECCV, 2006, pp. 404–417. [4] S. Cheung and A. Zakhor, “Fast similarity search and clustering of video sequences on the world-wide-web,” IEEE Trans. Multimedia, vol. 7, no. 3, pp. 524–537, Jun. 2005. [5] S. C. H. Hoi, L. L. S. Wong, and A. Lyu, “Chinese University of Hong Kong at TRECVID 2006: Shot boundary detection and video search,” in TRECVID, 2006.
[6] O. Chum, J. Philbin, M. Isard, and A. Zisserman, “Scalable near identical image and shot detection,” in Proc. CIVR, 2007, pp. 549–556. [7] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: Min-hash and tf-idf weighting,” in Proc. BMVC, 2008. [8] J. J. Foo and R. Sinha, “Pruning sift for scalable near-duplicate image matching,” in Proc. ADC, 2007, pp. 63–71. [9] K. Grauman and T. Darrell, “Efficient image matching with distributions of local invariant features,” in Proc. CVPR, 2005, pp. 627–634. [10] X.-S. Hua, X. Chen, and H. Zhang, “Robust video signature based on ordinal measure,” in Proc. ICIP, 2004, pp. 685–688. [11] P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in Proc. STOC, 1998, pp. 604–613. [12] A. Joly, O. Buisson, and C. Frelicot, “Content-based copy retrieval using distortion-based probabilistic similarity search,” IEEE Trans. Multimedia, vol. 9, no. 2, pp. 293–306, Feb. 2007. [13] Y. Ke and R. Sukthankar, “Pea-sift: A more distinctive representation for local image descriptors,” in Proc. CVPR (2), 2004, pp. 506–513. [14] Y. Ke, R. Sukthankar, and L. Huston, “An efficient parts-based nearduplicate and sub-image retrieval system,” in Proc. MM, 2004, pp. 869–876. [15] I. Koprinska and S. Carrato, Temporal Video Segmentation: A Survey, 2001. [16] I. Laptev and T. Lindeberg, “Space-time interest points,” in Proc. ICCV, 2003. [17] J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa, “Robust voting algorithm based on labels of behavior for video copy detection,” in Proc. MM, 2006, pp. 835–844. [18] H. Lejsek, F. H. Ásmundsson, B. T. Jónsson, and L. Amsaleg, “Scalability of local image descriptors: A comparative study,” in Proc. MM, 2006, pp. 589–598. [19] C. liu, H. Liu, S. Jiang, Q. Huang, Y. Zheng, and W. Zhang, “JDL at TRECVID 2006 shot boundary detection,” in TRECVID, 2006. [20] Z. Liu, D. Gibbon, E. Zavesky, B. Shahraray, and P. Haffner, “AT&T research at TRECVID 2006,” in Proc. TRECVID, 2006. [21] D. Lowe, “Distinctive image features from scale-invariant keypoints,” in Proc. IJCV, 2003, vol. 20, pp. 91–110. [22] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1615–1630, Oct. 2005. [23] P. Moreels and P. Perona, “Evaluation of features detectors and descriptors based on 3D objects,” in Proc. ICCV, 2005, pp. 800–807. [24] A. Qamra, Y. Meng, and E. Y. Chang, “Enhanced perceptual distance functions and indexing for image replica recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 379–391, Mar. 2005. [25] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” in Proc. IJCV, 2000, vol. 40, no. 2, pp. 99–121. [26] H. T. Shen, B. C. Ooi, and X. Zhou, “Towards effective indexing for very large video sequence database,” in Proc. SIGMOD, 2005, pp. 730–741. [27] H. T. Shen, X. Zhou, Z. Huang, and J. Shao, “Statistical summarization of content features for fast near-duplicate video detection,” in Proc. MM, 2007, pp. 164–165. [28] H. T. Shen, X. Zhou, Z. Huang, J. Shao, and X. Zhou, “Uqlips: A realtime near-duplicate video clip detection system,” in Proc. VLDB, 2007, pp. 1374–1377.
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.
ZHOU et al.: EFFICIENT NEAR-DUPLICATE VIDEO SHOT DETECTION METHOD
891
[29] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. ICCV’02, 2003. [30] A. F. Smeaton, P. Over, and W. Kraaij, “Evaluation campaigns and TRECVID,” in Proc. MIR, 2006, pp. 321–330. [31] X. Wu, A. G. Hauptmann, and C.-W. Ngo, “Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts,” in Proc. MM, 2007, pp. 168–177. [32] X. Wu, A. G. Hauptmann, and C.-W. Ngo, “Practical elimination of near-duplicates from web video search,” in Proc. MM, 2007, pp. 218–227. [33] X. Wu, W.-L. Zhao, and C.-W. Ngo, “Near-duplicate keyframe retrieval with visual keywords and semantic context,” in Proc. CIVR, 2007, pp. 162–169. [34] Y. Yan, B. C. Ooi, and A. Zhou, “Continuous content-based copy detection over streaming videos,” in Proc. ICDE, 2008, pp. 853–862. [35] D. Zhang and S.-F. Chang, “Detecting image near-duplicate by stochastic attributed relational graph matching with learning,” in Proc. MM, 2004, pp. 877–884. [36] W.-L. Zhao, C.-W. Ngo, H.-K. Tan, and X. Wu, “Near-duplicate keyframe identification with interest point matching and pattern learning,” IEEE Trans. Multimedia, vol. 9, no. 5, pp. 1037–1048, Aug. 2007. [37] X. Zhou, X. Zhou, and H. T. Shen, “A new similarity measure for near duplicate video clip detection,” in Proc. APWeb/WAIM, 2007, pp. 176–187.
Lei Chen (M’05) received the B.S. degree in computer science and engineering from Tianjin University, Tianjin, China, in 1994, the M.A. degree from the Asian Institute of Technology, Bangkok, Thailand, in 1997, and the Ph.D. degree in computer science from the University of Waterloo, Waterloo, ON, Canada, in 2005. He is now an Assistant Professor in the Department of Computer Science and Engineering at Hong Kong University of Science and Technology. His research interests include multimedia and time series databases, sensor and peer-to-peer databases, and stream and probabilistic databases.
Xiangmin Zhou received the Ph.D. degree in computer science from The University of Queensland, Brisbane, Australia, in 2008. Currently, she is a Postdoctoral Research Fellow in the ICT Center, Commonwealth Scientific and Industrial Research Organization (CSIRO), Canberra, Australia. Her research interests include multimedia information retrieval, multimedia database management, indexing, and query processing.
Xiaofang Zhou (SM’06) received the B.Sc. and M.Sc. degrees in computer science from Nanjing University, Nanjing, China, in 1984 and 1987, respectively, and the Ph.D. degree in computer science from The University of Queensland, Brisbane, Australia, in 1994. He is a Professor of computer science with The University of Queensland. He is the Head of the Data and Knowledge Engineering Research Division, School of Information Technology and Electrical Engineering. He is also the Director of ARC Research Network in Enterprise Information Infrastructure (EII), and a Chief Investigator of ARC Centre of Excellence in Bioinformatics. From 1994 to 1999, he was a Senior Research Scientist and Project Leader in CSIRO. His research is focused on finding effective and efficient solutions to managing, integrating, and analyzing a very large amount of complex data for business and scientific applications. His research interests include spatial and multimedia databases, high-performance query processing, web information systems, data mining, bioinformatics, and e-research.
Athman Bouguettaya (SM’89) received the Ph.D. degree in computer science from the University of Colorado, Boulder, in 1992. He is a science leader in the ICT Center, Commonwealth Scientific and Industrial Research Organization (CSIRO), Canberra, Australia. He was previously a tenured faculty member in the Department of Computer Science, Virginia Polytechnic Institute and State University (commonly known as Virginia Tech), Blacksburg. His current research interests are in service-oriented computing. Dr. Bouguettaya is on the editorial boards of several journals including the IEEE TRANSACTIONS ON SERVICES COMPUTING, International Journal on Web Services Research, the VLDB Journal, the Distributed and Parallel Databases Journal, and the International Journal of Cooperative Information Systems. He was invited to be a guest editor of a special issue of Internet Computing on database technology on the Web. He also guest edited a special issue of the ACM Transactions on Internet on Semantic Web Services. He served as a program chair of the 2008 International Conference on Service Oriented Computing (ICSOC) and the IEEE RIDE Workshop on Web Services for E-Commerce and E-Government (RIDE-WSECEG 04). He served on numerous program committees of database and service-oriented computing conferences. He is a senior member of the ACM.
Nong Xiao received the B.S. degree and the Ph.D. degree in computer science from National University of Defense Technology, Changsha, China, in 1990 and 1996, respectively. Now he is a Professor at the College of Computer Science in the National University of Defense Technology in China. His research interests are in grid computing, network storage and massive data management, and architecture.
John A. Taylor received the Ph.D. degree in air quality modeling from the Australian National University, Acton, Australia. He is currently Leader of CSIRO’s Computational and Simulation Sciences research group, formerly called Terabyte Science. His scientific developments across broad areas of science including large sensor networks such as radio telescopes, large experiments such as the Synchrotron, and high content, high throughput DNA analysis systems all generate large and complex datasets.
Authorized licensed use limited to: CSIRO LIBRARY SERVICES. Downloaded on August 11, 2009 at 01:25 from IEEE Xplore. Restrictions apply.