Effective and Scalable Video Copy Detection 1
2
1
1
Zhu Liu , Tao Liu , David Gibbon , Behzad Shahraray 1
AT&T Labs – Research, Middletown, NJ 07748 Polytechnic Institute of NYU, Brooklyn, NY 11201
2
1
2
{zliu, dcg, behzad}@research.att.com,
[email protected]
ABSTRACT Video copy detection techniques are essential for a number of applications including discovering copyright infringement of multimedia content, monitoring commercial air time, and querying videos by example. Over the last decade, video copy detection has received rapidly growing attention from the multimedia research community. To encourage more innovative technology and benchmark the state of the art approaches in this field, the TRECVID conference series, sponsored by the NIST, initiated an evaluation task on content based copy detection in 2008. In this paper, we describe the content-based video copy detection framework developed at AT&T Labs – Research. We employed local visual features to match the video content, and adopted hashing techniques to maintain the scalability and the robustness of our approach. Experimental results on TRECVID 2008 data show that our approach is effective and efficient.
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – Information filtering, search process. I.4.9 [Computing Methodologies]: Image Processing and Computer Vision – Applications.
General Terms Algorithms, Performance.
Keywords Video copy detection, multimedia content analysis, SIFT, LSH, RANSAC, Query by example.
1. INTRODUCTION The goal of video copy detection is to locate segments within a query video that are copied or modified from an archive of reference videos. Usually the copied segments are subject to various transformations such as cam cording, picture in picture, strong re-encoding, frame dropping, cropping, stretching, contrast changing, etc. Some of these transformations are intrinsic to the regular video creation process, e.g., block encoding artifacts, bit rate and resolution changes, etc. Others are introduced Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR’10, March 29–31, 2010, Philadelphia, PA, USA. Copyright 2010 ACM 1-58113-000-0/00/0004…$5.00.
intentionally for various purposes. These include flipping, cropping, insertion of text and patterns (graphic overlays), etc. All these transformations make the detection task more challenging. Video copy detection is essential for many applications, for example, discovering copyright infringement of multimedia content, monitoring commercial air time, querying video by example, etc. Generally there are two complimentary approaches for video copy detection: digital video watermarking and contentbased copy detection (CBCD). The first approach refers to the process of embedding irreversible information in the original video stream, where the watermarking can be either visible or invisible. The second approach extracts content-based features directly from the video stream, and uses these features to determine whether one video is a copy of another. In this paper, we focus on the second approach. Over the last decade, video copy detection has received rapidly growing attention from the multimedia research community. Depending on the adopted visual features, CBCD approaches can be generally classified into two categories: relying on global features or local features. Law-To et al. [1] presented a comparative study of different state-of-the-art video copy detection techniques, and concluded that local features are more robust, yet more computational expensive. Kas and Nicolas [2] studied the scalability and robustness of features that are easily available in compressed domain for video copy detection. These features include bitrate, macro-block size histogram, motion activity, global motion, number of objects, and object trajectories. Among these features, motion activity achieves the most robust results. In [3], Chum et al. used Scale Invariant Feature Transform (SIFT) and min-Hash for near duplicate image detection, which underlies the high level CBCD task. The method they proposed requires small indexing space for images, and achieved promising performance. Douze et al. reported their video copy detection system in [4]. Their approach is based on the bag-of-features images search system introduced by Sivic and Zisserman [5], and the adopted features are local invariant descriptors. The improvements include 1) Hamming embedding which provides binary signatures that refine the visual word based matching and 2) weak geometric consistency constraints which filter matching descriptors that are not consistent in terms of angle and scale. TRECVID (TREC Video Retrieval Evaluation) is sponsored by the National Institute of Standards and Technology (NIST) with the goal of encouraging research in automatic video segmentation, indexing, and content-based retrieval [6]. Since 2001, TRECVID has organized many laboratory-style evaluation tasks, including shot boundary detection, video segmentation, surveillance event detection, high-level feature extraction, search, content-based
copy detection, etc. In this paper, we describe a video copy detection system built for the TRECVID 2009 CBCD evaluation task. In TRECVID CBCD task, each query video is created in four steps: 1) selecting a segment from the reference videos (this segment is the video copy to be detected); 2) selecting a segment from the non-reference videos (not a video copy); 3) compiling the reference and non-reference video segments in one of three modes (described below); 4) applying transformations on the query video. There are three compiling modes: 1) only keep the reference video segment; 2) only keep the non-reference video segment, 3) inserting the reference video segment into the nonreference video segment at a random offset. There are ten categories of visual transformations and seven categories of audio transformations. Since our system is designed for the visual only CBCD task, we will skip the audio transformations in this paper. The ten visual transformations are: T1. T2. T3. T4. T5. T6.
Cam cording Picture in Picture (PiP) Insertions of pattern Strong re-encoding Change of gamma Decrease in quality: a mixture of 3 transformations among blur, gamma, frame dropping, contrast, compression, ratio, white noise. T7. Same as T6, but with a mixture of 5 transformations T8. Post production: a mixture of 3 transformations among crop, shift, contrast, text insertion, vertical mirroring, insertion of pattern, picture in picture. T9. Same as T8, but with a mixture of 5 transformations T10. Combinations of 5 transformations chosen from T1-T9.
describe the details of each component in Section 3. Experimental results are presented and discussed in Section 4. In Section 5, we describe a few applications of the developed CBCD algorithm. Finally, conclusions and future works are given in Section 6.
2. OVERVIEW Figure 2 illustrates the high level block diagram of our approach. The processing consists of two parts as indicated by different colors. The top portion illustrates the processing components for the query video, and the bottom portion shows the processing stages for the reference videos. The processing of reference videos is as follows. We first apply a content based sampling algorithm to segment the video into shots with homogeneous content [7] and select one keyframe for each shot. The aim of content based sampling is data reduction, since indexing each reference frame is computationally intractable. To proactively cope with the transformations in the query video, we apply two transformations to each original reference keyframe: half resolution and strong re-encoding. These two additional versions of the reference keyframe are helpful to detect query keyframes with PiP and strong re-encoding transformations. All three sets of keyframes (one original and two transformed) are processed independently. Scale invariant feature transform (SIFT) features are extracted to represent the image content, and locality sensitive hashing (LSH) is adopted for efficient indexing and query. The LSH indexing results are saved in the LSH database.
Figure 1 shows two query videos used in TRECVID 2008 CBCD task. In query example 1, the reference video is in the background with a non-reference video embedded in the PiP region and a pattern inserted at the top right corner. In query example 2, the reference video is flipped, and embedded in the PiP region. The whole query video is also strongly re-encoded. These two samples give some idea of the difficulty of this task.
Figure 2. Overview of the proposed CBCD algorithm
(a) Query example 1
(b) Query example 2
Figure 1. Samples of TRECVID CBCD queries In this work, we employed local visual features (Scale Invariant Feature Transform - SIFT) to match the content among query and reference videos. Locality Sensitive Hashing (LSH) and Random SAmple Consensus (RANSAC) techniques are exploited to maintain the scalability and increase the robustness of our approach. We benchmarked our system on TRECVID 2008 CBCD dataset, and achieved a good performance. The rest of this paper is organized as follows: We first give a high level overview of the algorithm we proposed in Section 2. We
For each query video, we first apply content based sampling to extract keyframes. We then detect a few pre-defined video transformations, including video stretch, PiP, scale, shift, etc. For each type of detected transforms, we normalize the original keyframe to reverse the transformation effect. For example, the stretched video is rescaled to the original resolution and the embedded picture-in-picture region is rescaled to half original resolution. Similar to the reference video processing, the original keyframe and the normalized keyframes go through the following processing independently, which includes SIFT feature extraction, LSH computation, keyframe level query based on LSH values, and keyframe level query refinement. The refinement module basically adjusts the relevance scores of all retrieved reference keyframes for each query keyframe by a more accurate but slower SIFT matching method based on RANdom Sample Consensus (RANSAC). Then for each query keyframe, the query results from different sets (the original and the transformed version) are
merged based on their relevance scores. The keyframe level query results are then passed into the video level result fusion module, where the temporal relationship among the kefyrame query results as well as their relevance scores are considered. Finally, based on the different criteria of CBCD tasks, we normalize the decision scores of all detected reference videos, and then generate the final copy detection runs in TRECVID format.
3. CBCD COMPONENTS 3.1 Content Based Sampling Indexing every reference video frame and searching all query frames is too computationally expensive. Content based sampling is an important step for data reduction. The goal is to segment the original video into shots, where each shot contains homogeneous content. The first frame within each shot is selected as a keyframe. All subsequent processing is applied only to the keyframe. The content based sampling module is based on our shot boundary detection (SBD) algorithm developed for TRECVID 2007 evaluation task [10]. The existing SBD algorithm adopts a “divide and conquer” strategy. Six independent detectors, targeting the six most common types of shot boundaries, are designed. They are cut, fade in, fade out, fast dissolve (less than 5 frames), dissolve, and motion. Essentially, each detector is a finite state machine (FSM), which may have different numbers of states to detect the target transition pattern and locate the transition boundaries. Support vector machines (SVM) based transition verification method is applied in cut, fast dissolve, and dissolve detectors. Finally, the results of all detectors are fused together based on the priorities of these detectors. The FSMs of all detectors depend on two types of visual features: intra-frame and inter-frame features. The intra-frame features are extracted from a single, specific frame, and they include color histogram, edge, and related statistical features. The inter-frame features rely on the current frame and one previous frame, and they capture the motion compensated intensity matching errors and histogram changes. By TRECVID SBD criteria, a shot may contain totally different scenes, as long as they belong to one camera shot, for example, a long panning shot. The existing SBD algorithm only reports one keyframe for such heterogeneous yet smooth shot. To make the content based sampling more effective, we expanded the existing SBD algorithm with an additional detector: sub shot detector. This detector will measure the color histogram dissimilarity between current frame and the most recently detected keyframe. If their difference is large enough, a subshot boundary is declared and a new keyframe is extracted. Another modification of the existing content based sampling module is to deal with frame dropping. The dropped frames in the query videos provided by NIST are mostly replaced by blank frames. These blank frames created a large number of false cut boundaries. We updated the finite state machine model of the cut detector to tolerate up to 4 adjacent blank frames.
3.2 Transformation Detection And Normalization For Query Keyframe While generating the video queries, TRECVID CBCD evaluation task introduced the following transformations: Cam cording, picture in picture, insertions of pattern, strong re-encoding,
change of gamma, decrease in quality (e.g., blur, frame dropping, contrast change, ratio, noise, crop, shifting, flip, etc.), and combinations of individual transforms. It is not realistic to detect all kinds of transform, and recover the effect, since some of the transformation is irreversible. In this work, we mainly focus on letterbox detection and picture in picture detection. The letterbox may be introduced by shifting, cropping, or stretching. We do not differentiate them, but simply remove the letterbox by rescaling the embedded video into the original resolution.
3.2.1 Letterbox detection We rely on both edge information and the temporal intensity variance of each pixel to detect letterbox. For each frame, the Canny edge detection algorithm is applied to locate all edge pixels. Then based on the edge direction, a horizontal edge image and a vertical edge image are extracted. To cope with the noise in the original video, the edge images have to be smoothed. The smoothing process includes removing the short edge segments, and merging adjacent edge segments that are less than 5 pixels away. After all frames of the query video are processed, the mean horizontal (and vertical) edge image is computed for the entire video and it is projected horizontally (and vertically) for computing the horizontal (and vertical) edge profile. For stretched, cropped, or shifted videos, there are significant peaks in the horizontal and/or vertical edge profiles. We search the maximum horizontal (and vertical) peak near the boundary, such that it is within one eighth of the height (and the width). Usually, these detected profile peaks indicate the letterbox boundaries. When the video is too noisy, profile peaks may not be detected reliably. In this case, we rely on the intensity variation of each pixel across the entire video. For the embedded video region, the intensity variance of each video is relatively large due to the dynamic video content. For the letterbox region, the variance is much smaller. Comparing the pixel intensity variance value to a preset threshold, we classify each pixel into either the video pixel or the non-video pixel. We then search the letterbox boundaries such that the percentage of video pixels in the letterbox region is less than 1%.
3.2.2 Picture-in-picture detection TRECVID specifies that picture-in-picture may appear at five locations: the center and the four corners. The size of the picturein-picture region is less than half of the original resolution, and bigger than 0.3 of the original size. These constraints are very useful in reducing the searching complexity in picture-in-picture detection. Figure 3 (a) shows an example of reference video embedded in PiP, and Figure 3 (b) shows the enlarged version of the PiP region. Orhan et al. described a PiP extraction algorithm in [11], where persistent strong horizontal lines in image derivatives in x-axis are found using Hough lines, and the PiP region is extracted by connecting horizontal parallel lines. The algorithm achieves an accuracy of 74%. However, it still misses many positive PiPs. In this work, we devise a more effective approach. As in the letterbox detection algorithm, the devised method is based on both edge information and pixel intensity variance. Our method contains three basic steps. First, all peaks of the horizontal (and vertical) edge profile are located. Based on these peak locations, we find the original horizontal edge segments (and the
vertical edge segments). These edge segments are the candidate PiP region boundaries. Due to the size of PiP region, we remove those edge segments that are either too short or too long. Figure 3 (c) – (f) illustrates the horizontal (and vertical) edge images and their profiles. Second, we determine the PiP region candidates by grouping a pair of horizontal edge segments and a pair of vertical edge segments together. The pair of the horizontal edge segments should be not too close or too far away from each other vertically, and should overlap in terms of their horizontal positions. The vertical edge segments need to follow a similar constraint. The sample shown in Figure 3 is simple; it only contains one pair of horizontal edge and vertical edge segments. Note that the broken horizontal edge segments in Figure 3 (c) are merged into one segment during the smoothing process. Third, each candidate PiP region need to be verified, and a likelihood score is assigned to it. The verification criteria include the aspect ratio of the region, the percentage of the video pixels. The summation of the boundary edge intensity is used as the likelihood value. Finally the region with the maximum likelihood is picked as the PiP region.
3.2.3 Query Keyframe Normalization In addition to the letterbox removal and PiP rescaling, we equalize and blur the query keyframe to overcome the effect of change of Gamma and white noise transformations. Due to the fact that the SIFT features are not invariant to mirror transformations, we have to create a flipped version for each of these normalized query keyframes. In summary, we have 10 types of query keyframe: original, letterbox removed, PiP scaled, Equalized, blurred, and flipped versions of these five types.
3.3 Reference Keyframe Transformation Complementary to normalizing the query keyframes, the reference keyframes can be pre-processed to eliminate the transformation effect as much as possible. Because of the large volume of the reference video set, we cannot afford to have too many variations for all reference keyframes. In our work, we simply apply two transformations on reference keyframes. They are half resolution rescaling and strong re-encoding. The first transformation is for comparing with the detected PiP region in the query keyframes, and the second transformation is useful in dealing with the strong re-encoded query keyframes. So in total, we have 3 different types of reference keyframes.
3.4 SIFT Extraction (a) A Query keyframe
(c) Horizontal edge Image
(e) Horizontal edge profile
(b) PiP region
(d) Vertical edge image
(f) Vertical edge profile
Figure 3. An example for PiP Detection Given that the PiP region is smaller than half of the original resolution, we rescale the detected PiP region to be half size as a normalization process. Generally speaking, SIFT features are robust to the scale change, but we did find that the transformation detection helps to further improve the robustness of SIFT feature extraction. Transformation detection also benefits copy detection methods based on global visual features, including Gabor texture, edge direction histogram, and grid color moments, etc. Although these features can be easily computed, they are very sensitive to geometric changes. Transformation detection helps to improve the robustness of these approaches.
The scale-invariant feature transform (SIFT) features [12] have been proven effective in various video searching and near duplicate image detection tasks. In this work, we adopted SIFT as the main feature for locating the video copies. SIFT extraction is composed of two steps: 1) Locating the keypoints that have local maximum Difference of Gaussian values both in scale and in space. These keypoints are specified by location, scale, and orientation. 2) Computing a descriptor for each keypoint. The descriptor is the gradient orientation histogram, which is a 128 dimension feature vector. We rely on Vedaldi’s implementation [13] of SIFT because of its ease of use and the accompanying Matlab code for visualization and debugging purpose. More efficient versions of SIFT, including the SURF (Speeded Up Robust Features) features [14], are also considered in this work. But at this point, we have not evaluated their effectiveness. The default parameters for the SIFT feature computation program can generate a few thousands of SIFT features for a single image. This brings more computational complexity in the SIFT matching step. For video copy detection, we need a much smaller number of SIFT features to verify that a frame is from the reference video. To reduce the number of SIFT features, we set the edge threshold to be 5 and the peak threshold to be 7. With this set of parameters, the number of SIFT features is reduced to about two hundred for each frame, which is roughly ten times less than the default setting.
3.5 Locality Sensitive Hashing Directly comparing the Euclidian distance between two SIFT features in the 128 feature dimension is not scalable at all. In this work, we utilize the locality sensitive hashing (LSH) [15] for efficient SIFT feature matching. The idea of the Locality sensitive hashing is to approximate the nearest neighbor search in high dimensional space. Following is a
brief introduction on LSH. Each hash function ha,b(v) maps a 128 dimensional vector v onto the set of integers (bins),
a ⋅ v + b , ha ,b ( v ) = w where w is a preset bucket size, b is chosen uniformly in the range of [0, w], and a is a random vector following a Gaussian distribution with zero mean and unit variance. This hash function possesses the desirable property that when v1 and v2 are closer in the original vector space (e.g., || v1- v2|| is smaller), their hash values ha,b(v1) and ha,b(v2) are more likely to be the same, and when v1 and v2 are farer away in the original vector space (e.g., || v1- v2|| is bigger), their hash values ha,b(v1) and ha,b(v2) are less likely to be the same. Using a Hash function, the complex high dimension vector distance comparison is converted into one integer comparison, which is extremely efficient. Two additional parameters to tune the hashing performance are: 1) Combine k parallel hashing values to increase the probabilities that vectors far away fall into the same bin; 2) Form L independent k-wise Hash values to reduce the probability that similar vectors are “unluckily” projected into different bins. For the SIFT features we computed, we found that the following parameters give satisfactory results: w = 700, k = 24, and L = 32. This basically creates 32 independent Hash values for each of the SIFT features. For two SIFT features, if one or more of their L Hash values are the same, the two SIFT features are reported as matching.
3.6 Indexing and Search by LSH Indexing the L LSH values for the reference videos is straight forward. We simply sort them independently and save the sorted hash values with their SIFT identifications (a string that is composed of the reference video ID, the keyframe ID, and the SIFT ID) in a separate index file.
original reference keyframe to those in the query keyframe. The affine transform is able to model the geometric changes introduced by following transforms: PiP, shift, ratio, etc. Specifically, the keypoint at pixel (x, y) in the reference keyframe is mapped to pixel (x’, y’) in the query keyframe by the following formula,
x ' a b y ' = d e
x c ⋅ y f 1
To determine the affine model parameters (a, b, c, d, e, f), three pairs of keypoints, where the keypoints in the reference keyframe are not on a line, are required. For a pair of keyframes, the detailed RANSAC procedure is as follows, 1. 2. 3. 4.
5.
Randomly select 3 pairs of matching keypoints (having the same LSH values). Determine the affine model. Transform all keypoints in the reference keyframe into the query keyframe. Count the number of keypoints in the reference keyframe whose transformed coordinates are close to the coordinates of their matching keypoints in the query keyframe. These keypoints are called inliers. Repeat steps 1 to 4 for a certain number of times, and output the maximum number of inliers.
Figure 4 presents an example of RANSAC verification. Figure 4 (a) displays a query keyframe (on the left side) and a reference keyframe side by side, and links all original matching keypoints by lines. It is clear that some of the matched keypoints are not correct. Figure 4 (b) shows the matching keypoints after RANSAC verification. All of the wrong matching keypoints are removed.
Querying the index file for an LSH value by binary search is very efficient, since the complexity is in the order of log(N), where N is the cardinality of the index file. For a query keyframe, the task is to find all reference keyframes with matching SIFT features and using the number of matching SIFT pairs as the relevance score for ranking purpose. To reduce the computational complexity for the following process, we only keep the top 256 reference keyframes on the matched list.
3.7 Keyframe Level Query Refinement
(a) Original matching keypoints
It is obvious that keyframe matching purely based on SIFT and LSH is not reliable enough. There are two issues: 1) the original SIFT matching by Euclidian distance is not reliable, especially when the number of SIFT features is large and various transformations may introduce noise in SIFT extraction, 2) it is possible that two SIFT features that are far away mapped to the same LSH value. We need to rely on an additional mechanism to validate the list of retrieved reference keyframes produced in the last section. In this work, RANdom Sample Consensus (RANSAC) [8] [9] is utilized for this purpose. RANSAC is an iterative method for estimating model parameters from observed data with outliers. Here, RANSAC is used for estimating the affine transform that maps the keypoints in the
(b) Matching keypoints after RANSAC verification Figure 4. SIFT verification by RANSAC
After RANSAC verification, we use the maximum number of inliers as the new relevance score between a query keyframe and a reference keyframe.
3.8 Keyframe Level Result Merge As mentioned in Section 2, to cope with the visual effects introduced by various transformations, we transform the reference keyframes to mimic the query transformations, as well as normalize the query keyframes to recover the original content. Table 1 shows all the combinations of transformed reference keyframe and normalized query keyframe that we considered. The choice of combinations in this work is is driven by the requirements of the TRECVID CBCD evaluation task. For other applications, a reduced or expanded list may be more feasible. For example, in the application of commercial detection, the pair of original query keyframe vs. original reference keyframe alone is sufficient. Considering more pairs introduces more computational complexity. So it is a tradeoff in the real system to find the optimized pair list in terms of detection accuracy and speed. As shown in Table 1, in total we considered 12 combinations in this work. Hence for each query keyframe, there are 12 lists of ranked reference keyframes. The merging process is to simply combine these 12 lists into 1 list. If one reference keyframe appears more than once in the 12 lists, its new relevance score is set to be the maximum of its original relevance scores, otherwise, its relevance score is the same as the original one. The new list is ranked based on the new scores, and then only the top 256 reference keyframes are kept for further processing.
denoted by S(i, j). The timestamp of query keyframe i is QT(i), and that of reference keyframe j is RT(j). Figure 5 plots the relevance score matrix, where all non-zero entries are marked by dark squares. For each pair of matched keyframes, we compute an extended relevance score for it, and the relevance score between these two videos is the maximum of all extended relevance scores. In the following, we describe the details on how to compute the extended relevance score. For a matching keyframe pair (i, j), which is highlighted in Figure 5, we first compute the timestamp difference: ∆ = RT(j) - QT(i). Then we find all matching keyframes whose timestamp difference is in the range of [∆ - δ, ∆ + δ], where δ is set to 5 seconds in this work. The extended relevance score for pair (i, j) is simply the sum of the relevance scores of these filtered pairs. To reduce the impact of the number of query keyframes, N, we normalize the extended relevance score by 1/log(N). In Figure 5, the time difference range is marked by two dashed lines. The video level matching determined by pair (i, j) is query keyframes [i1, i2] => reference keyframes [j1, j2], and N = i2 – i1+ 1. Once we have the extended relevance scores for all pairs, the one with maximum score is picked as the video level matching between corresponding reference video and query video. Based on the video level relevance scores, we sort all matching videos for a query video. To simplify the computation, we only keep the top 8 matching videos for each query video.
Table 1. Normalized query keyframes vs. transformed reference keyframes Pair
Query keyframes
1
Original
2
Flipped
3
5
Letterbox removed Letterbox removed & flipped Equalized
6
Equalized & flipped
7
Blurred
8
Blurred & flipped
9
Original
4
Reference keyframes
Original
Figure 5. Video Level Result Fusion
3.10 Video Relevance Score Normalization Encoded
10
Flipped
11
Picture in Picture (PiP)
12
PiP & flipped
Half
3.9 Video Level Result Fusion Based on the query results of all keyframes of a query video, we can easily determine the list of best matching videos for it. Let us consider one query video and one reference video, where at least one keyframe of the reference video is on the matched list of at least one keyframe of the query video. The query video has Q keyframes, and the reference video has R keyframes, the relevance scores of between query keyframe i and reference keyframe j is
TRECVID CBCD evaluation is conducted for copy detection results of all query videos together, but not on each query video individually. Different decision thresholds are used to truncate the submitted results to measure the probability of miss and the false alarm rate. Therefore, it is important to normalize the video relevance scores across the entire collection of query videos to boost the system performance. First, we normalize the relevance scores into the range of [0, 1] by the following sigmoid function,
y=
2 − 1, 1 + exp(− x / 50)
where x is the original relevance score, and y is the normalized one. Then, for each query video, we normalize the score of the best matching reference video. Let’s assume that there are N matching reference videos for a query video, and their relevance scores are {xi}, i = 1,…, N, satisfying xi > 0 and xi >= xj, if i < j. x1 is normalized as x1 = x1*(x1/x2). The motivation for this normalization is that if the best matching video has a much higher relevance score than the second best matching video, it is more likely that the best matching video is correct. Reference videos whose relevance scores are less than half of the best matching one are removed, since they are likely to be wrong. Finally, the scores are normalized across all query videos. As we mentioned in the last section, each query video may have up to 8 matching reference videos. The idea is to map the relevance scores of all best matching reference videos of all query videos into the range of [7, 8] by a sigmoid function, and then map the relevance scores of the second best matching reference videos of all query videos into the range of [6, 7], and so on and so forth. In such a way, the higher ranked matching videos have higher priority to be included in the final detection results.
In total, we extracted ~268 thousand keyframes and ~57 million SIFT features for the entire reference video set, and ~25 thousand keyframes and ~3.7 million SIFT features for the query video set.
4.1 Transformation Detection Performance Letterbox effects may be due to different transformations, including stretching, cropping, shifting, and sometimes the original reference video contains letterbox already. Given that the ground truth for letterbox is not available, we did not measure the performance on letterbox detection. Our preliminary observations indicate that the letterbox detection is very reliable. To measure the performance of Picture-in-picture detection method, we considered all video queries with only PiP transformation (in T2 category). Totally there are 201 such queries, and the proposed PiP detection algorithm successfully detected 192 of them – a miss probability of less than 5%. We did not measure the false alarm rate since a false detection is not a big concern in this work. The only negative effect of the falsely detected PiP is the introduction of more computation load, but normally it does not affect the SIFT matching performance.
4.2 Samples of Keyframe Level Query Results 3.11 CBCD Result Generation TRECVID requires that the copy detection results for all query videos of one system be compiled in one run file. Within the run file, we need to provide a list of result items, where each item specifies the query video ID, the found reference video ID, the boundary information of the copied reference video segment, the starting frame of the copied segment in the query video, and a decision score – the higher the number, the stronger the evidence. In addition to the copy detection results, participants also need to provide additional information, including the processing time for each query video, the operating system, the amount of memory, etc. The timing information is used to measure the efficiency of the submitted approach. While generating the CBCD run file, we use the relevance scores directly as the decision score. To keep a balance on false alarm rate and miss probability, we only include the top 2 highest ranked matching videos for each query video. Depending on the applications and evaluation criteria, more or fewer matched videos can be reported.
4. Experimental Results We participated in the TRECVID 2009 CBCD evaluation task, and submitted 3 runs with various system parameters. The TRECVID 2009 evaluation results were not yet available at the time this manuscript was prepared. In this section, we report the evaluation results of our system on the query dataset of TRECVID 2008 CBCD task. The reference video dataset is actually from TRECVID 2009 CBCD evaluation task, which includes those used in TRECVID 2008 CBCD evaluation task. The volume of TRECVID 2009 dataset is about twice that of 2008 dataset. While creating the CBCD runs, we simply remove the detected reference videos that do not belong to TRECVID 2008 dataset. TRECVID 2008 CBCD dataset contains 2010 short query videos, and TRECVID 2009 CBCD dataset contains 838 reference videos (from the Netherlands Institute for Sound and Vision) and 208 non-reference videos (from the British Broadcasting Corporation).
This section shows a few examples of the CBCD results. Figures 6 to 8 shows keyframe level query results for three examples. Figure 6 (a) shows a keyframe in the query video, to which a combination of transformations, including offset, insertion of pattern and text, and PiP has been applied. It is clear that the majority of the original frame has been significantly modified in the query. Figure 6 (b) shows the best match keyframe in the reference videos.
(a) Query keyframe
(b) Reference keyframe
Figure 6. Keyframe level query result of example 1 Figure 7 shows the second example. The query video has been heavily re-encoded, and it is difficult even for a human observer to recognize the content and find the reference video manually.
(a) Query keyframe
(b) Reference keyframe
Figure 7. Keyframe level query result of example 2 In Figure 8, the query video went through two transformations: blur and noise. Again, in this case, the video copy detection algorithm correctly located the best matching reference keyframe.
participants of TRECVID 2008. In total, 22 groups submitted 48 runs to TRECVID 2008 [18].
(a) Query keyframe
(b) Reference keyframe
Figure 8. Keyframe level query result of example 3
4.3 CBCD Evaluation 4.3.1 CBCD Evaluation Criteria There are three performance measures specified for CBCD task. They are minimal normalized detection cost rate (min NDCR), copy location accuracy, and the processing time. First, we introduce the term detection cost rate (DCR), which is a combination of two error rates, the probability of miss error (PMiss) and the false alarm rate (RFA). By the TRECVID evaluation plan, if at least one detected copy segment overlaps with the true reference video segment, the copy is successfully found. When multiple detected copy segments for one query are reported, the one with the maximum overlap (if there is overlap) to the true reference video segment is chosen as the correct hit, and the rest are treated as false alarms. Specifically, DCR is defined as,
Figure 9 plots the minimum NDCR of the top 10 runs [18]. The smaller the minimum NDCR, the better the performance. As we indicated before, each transformation type is measured separately. The top 10 runs are marked with short color bars, and the median performance is marked in squares, linked by solid lines. The results of our system are marked with circles, linked with dotted lines. From this figure, it is obvious that the difficulty of copy detection for different transformations varies a lot. T5, change of gamma is the easiest one, where all top 10 runs achieved less than 0.1 minimum NDCR. The more challenging transformations are T2, T9, and T10. The performance of our system somewhat follows this phenomena as well. An interesting observation is that our system performs relatively better on T2, T8, and T9, which may contain Picture in Picture transformation. This is likely due to the fact that our system detects the PiP transformation, and applies specific processing to it. Overall, our system achieves reasonably good performance, within the top 10 in all 10 categories, and it is much better than the median runs. Table 2. Evaluation results on TRECIVD 2008 query data Transformation Category
Min. NDCR
Mean_F1
Mean Processing Time (seconds)
T1
0.441
0.647
128.0
T2
0.45
0.585
128.4
T3
0.142
0.854
128.4
Where CMiss and CFA are the costs of a miss and a false alarm, Rtarget is the priori target rate. For TRECVID 2008 CBCD task, Rtarget = 0.5/hour, CMiss = 10, and CFA = 1.
T4
0.202
0.793
128.7
T5
0.094
0.899
128.4
In order to compare the detection cost rate values across a range of parameter values, DCR is normalized as follows,
T6
0.221
0.829
128.6
T7
0.292
0.715
128.7
DCR NDCR = (CMiss ⋅ Rt arget ) = PMiss + β ⋅ RFA , where
T8
0.158
0.822
128.4
T9
0.219
0.807
128.4
T10
0.552
0.721
128.5
Average
0.277
0.767
128.4
DCR = C Miss ⋅ PMiss ⋅ Rt arg et + C FA ⋅ RFA ,
β = C FA /(C Miss ⋅ Rt arg et ) Results of individual transformations within each run are evaluated separately. Different decision thresholds θ are applied to generate a list of pairs of increasing PMiss and decreasing RFA. The minimal NDCR is found for each transformation. Copy location accuracy is defined to assess the accuracy of finding the exact extent of the copy in the reference video. This is only measured for the correctly detected copies. Mean F1 (harmonic mean) score based on the precision and recall of the detected copy location relative to the true video segment is adopted. Copy detection processing time is the mean time to process a query. It includes all processing from reading in the query video to the output of results.
4.3.2 CBCD Evaluation Results Table 2 lists the evaluation results of our CBCD system. Figure 9 and Figure 10 compare the performance of our system with the Figure 9. Performance of NDCR
Figure 10 plots the F1 measure of the top 10 runs [18]. The bigger the F1 measure, the better the performance. Again, it is obvious that T5 is the easiest transformation to deal with for copy detection, and T2, T9, and T10 are harder. Similar to Figure 9, the results of our system are marked in circles. It is clear that our system achieves good performance on F1 measure – within the top 5 runs for all transformation types. This means that the boundaries of the copied segments are accurately detected. One possible reason is that the content based sampling module produces highly consistent shot boundaries in all videos. The mean processing time results are plotted in Figure 11 [18]. The processing time of our system is roughly the same for all different transformations. That is anticipated since all query videos go through the same processing steps. The mean processing times of our system are comparable to those of the median run. This proves that our system is scalable compared to the current state of the art approaches. A few thoughts on further reducing the computational complexity are 1) decreasing the number of SIFT features and/or the number of LSH hashing sets; 2) improving the implementation, for example, adopting a more efficient way to read in the large index file (e.g., larger than 2GB) and parallelizing the query processing; 3) investigating more on the pairs of the transformed reference keyframes and the normalized query keyframes (see Table 1).
Figure 10. Performance of F1 measure
5. Content Services Enabled by CBCD The ability to rapidly and reliably detect duplicate, degraded or transformed video segments enables several applications beyond copyright infringement detection. We have developed architectures and systems for large scale content processing that support ingestion and archiving of a large number of digital broadcast television feeds from terrestrial, satellite, cable television and IPTV sources [16]. Efforts of applying the developed the content based video copy detection system in a range of service prototypes is undergoing. In this section, we briefly discuss some thoughts in this work.
5.1 Commercial Verification Nationally broadcast television services support the ability to dynamically insert advertising based on the geographic region to maximize the value and relevance of the advertising content to the viewer. The infrastructure required is complex and the Society of Cable Television Engineers (SCTE) has developed widely adopted standards such as SCTE-35 to allow interoperability among vendors equipment such as ad campaign management servers, ad insertion severs, ad splicer, for both linear and ondemand applications. Systems are being developed with greater geographic region precision, and ultimately will allow ads to be personalized and targeted for individual viewers. Given the scale and engineering complexity involved, errors are unavoidable and considering the central role that advertising plays as a revenue source for broadcasters. Therefore, it is not surprising that verification systems are employed to ensure that the ads are being delivered as intended. These systems may employ preprocessing for watermarking coupled with identifiers using the advertising identification (Ad-ID) standard [17] to associate content with metadata about the ad. However, we are focusing on using CBCD which requires no such preprocessing, but simply requires a sample of the ad to use as a query. Further, the proven invariance of our approach to commonly encountered transformations such as letterbox, high definition (HD) to standard definition (SD) conversion, and coding artifacts ensures its suitability to this application. The service requirements involve generating summary reports of ad delivery on a daily or longer periodic basis, which suggests an offline processing approach which maps well with our scalable architecture. Data reduction for processing efficiency is achieved by first identifying commercial segments using multimodal processing with features derived from the video, audio and closed caption streams if available. Once the ad segments are identified in a set of broadcast streams corresponding to the desired time range such as one day, a set of queries is executed based on the set of ads to be verified.
5.2 Recovering Content Structure
Figure 11. Performance of processing time
In television news production, there is a high cost associated with creating video from remote locations. For planned events, only a few seconds of content may be of interest to viewers - such as the president waving as he steps aboard a helicopter. Given this paucity of content, news programs are often structured so that a preview containing the video is shown, followed later by the same content being repeated in a slightly longer form with different anchor or reporter commentary. In some cases the footage is pooled or syndicated so that the same content appears in many
different news programs from different broadcasters with differing graphic overlays and differing voice over. For breaking news, 24 hour news channels striving to fill the air time will repeat field video over and over, adding additional clips as they become available. Video CBCD plays a key role for archiving and retrieval applications in the aforementioned cases. Recovering the structure of news programming aids in the process of content navigation for historical analysis and can enable business applications such as stock footage licensing. Analysis of the frequency and sequencing of the broadcast of particular news clips can be used to infer a measure of the value of the clip as estimated by the producer of the news program. Similarly, video CBCD methods apply to the problem of summarizing rushes or raw footage that has been the subject of TRECVID evaluations. In this application, which involves preproduction source material, many 'takes' of a particular scene are shots so that editors can later select the best segments for the final production. Many hours of video being used of source material are used to produce a single hour of programming and the task of editing is greatly simplified if content structure can be recovered so that editors can easily select from the multiple 'copies' of the desired shot. Of course, slight variations in shot duration and camera angle must be tolerated to achieve the desired result.
6. Conclusions and Future Work In this paper, we presented a content based copy detection system. The system is based on SIFT features and relies on LSH for scalability. RANSAC is used for reliable SIFT matching. The evaluation results on TRECVID 2008 CBCD query dataset are indicative of its good performance for both detection cost rate and detection accuracy. The algorithm is also scalable. Our future work will be targeted at •
•
•
Further improving the scalability of the system. We will investigate other local visual features, including SURF, which take less time to compute. Global visual features, including color histogram, edge and texture information, may also be used as a pre-processing step to speed up the local visual feature search. Incorporating the audio information. Videos normally contain both audio and video streams. The audio information can be used to further improve the detection accuracy and reduce the computational complexity. We are planning to use a set of audio features, including Mel-frequency cepstral coefficients (MFCC), and evaluate the performance of pattern recognition schemes, such as, dynamic programming and hidden Markov models. It will be interesting to evaluate the fusion of audio and visual detection results as well. Applying the developed copy detection algorithm in real world applications. Possible applications and services include large scale content based image and video search, video copy detection for P2P streaming network, etc.
7. ACKNOWLEDGMENTS This work benefits from the contributions of: Andrea Basso, Patrick Haffner, and Bernard Renger at AT&T Labs - Research.
8. REFERENCES [1] J. Law-To, et al, “Video Copy Detection: A Comparative Study,” CIVR 2007, Amsterdam, The Netherlands. [2] C. Kas and H. Nicolas, “Compressed Domain Copy Detection of Scalable SVC Videos,” the 7th International Workshop on Content-based Multimedia Indexing, 2009. [3] O. Chum, J. Philbin, A. Zisserman, “Near Duplicate Image Detection: min-Hash and tf-idf Weighting,” Proceedings of the British Machine Vision Conference 2008. [4] M. Douze, A. Gaidon, H. Jegou, M. Marszalek, and C. Shmid, “INRIA-LEAR’s Video Copy Detection System,” TRECVID Workshop, Nov. 17-18, 2008, Gaithersburg, MD. [5] J. Sivic and A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos,”, ICCV 2003. [6] P. Over, et al, “TRECVID 2008 – Goals, Tasks, Data, Evaluation Mechanisms and Metrics,” TRECVID Workshop, Nov. 17-18, 2008, Gaithersburg, MD. [7] B. Shahraray, “Scene Change Detection and Content-based Sampling of Video Sequences,” in Digital Video Compression: Algorithms and Technologies 1995, Proc. SPIE 2419, February 1995. [8] Martin A. Fischler and Robert C. Bolles (June 1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography". Comm. Of the ACM 24: 381–395. [9] W. Zhang and J. Kosecka, “Generalized RANSAC Framework for Relaxed Correspondence Problems,” 3DPVT 2006, Chapel Hill, NC. [10] Z. Liu, D. Gibbon, E. Zavesky, B. Shahraray, P. Haffner, "A Fast, Comprehensive Shot Boundary Determination System," ICME 2007, Beijing, China, July 2-5, 2007. [11] O. Orhan, et al, “University of Central Florida at TRECVID 2008 Content Based Copy Detection and Surveillance Event Detection,” TRECVID Workshop, Nov. 17-18, 2008, Gaithersburg, MD. [12] D. G. Lowe, “Distinctive Image Features from ScaleInvariant Keypoints,” IJCV, 2 (60), pp. 91-110, 2004. [13] A. Vedaldi and B. Fulkerson, “VLFeat: An Open and Portable Library of Computer Vision Algorithms,” 2008, http://www.vlfeat.org/. [14] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, 2008. [15] A. Andoni and P. Indyk, “Near-Optimal Hashing Algorithms for Near Neighbor Problem in High Dimension,” Communications of the ACM, Vol. 51, No. 1, 2008. [16] D. Gibbon and Z. Liu, “Large Scale Content Analysis Engine,” The 1st Workshop on Large-scale Multimedia Retrieval and Mining, in conjunction with ACM Multimedia 2009, Beijing, China, Oct. 23, 2009. [17] http://www.ad-id.org/, Management.
Adverting
Identification
and
[18] W. Kraaij, G. Awad, P. Over, “TRECVID 2008 Contentbased Copy Detection Task Overview,” TRECVID Workshop, Nov. 17-18, 2008, Gaithersburg, MD.