Video Copy Detection Based on Source Device Characteristics: A Complementary Approach to Content-Based Methods Sevinc Bayram
Husrev Taha Sencar
Nasir Memon
Polytechnic University Department of Electrical and Computer Engineering Brooklyn, NY
Polytechnic University Department of Computer and Information Science Brooklyn, NY
Polytechnic University Department of Computer and Information Science Brooklyn, NY
[email protected]
[email protected]
[email protected]
ABSTRACT
variations in component design, component tolerances and defects, choice of processing methods, and processing artifacts. Multimedia forensics techniques have successfully utilized this type of low-level characteristics to identify individual properties of the source device of an image or video (like noise characteristics of the imaging sensor and traces of sensor dust[7, 9, 18, 23]) as well as class properties of a source device (like the type of color filter array, specifics of the demosaicing technique, type of lens, compression parameters [2, 5, 27]). While the existence of multimedia forensics techniques is essential in determining the origin, veracity and nature of media data, these techniques have much wider application potential. Motivated by this, in this work, we show how source device characteristics extracted from multimedia can be deployed to achieve the goals of conventional content-based video copy detection techniques. Video copy detection techniques are automated analysis procedures to identify the duplicate and modified copies of a video among a large number of videos so that their use can be managed by content owners and distributors. These techniques are required to accomplish various tasks involved in identifying, searching and retrieving videos from a database. Furthermore, due to the increase in the scale of video databases, the ability to accurately and rapidly perform these tasks become increasingly crucial. For example, it is reported that the number of videos in video sharing site YouTube’s1 database have reached 73.8 millions by March 2008 and every day more than 150 thousand videos are uploaded to its servers 2 . In such systems, copy detection techniques are needed for efficient indexing, copyright management and accurate retrieval of videos as well as detection and removal of repeated videos to reduce storage costs. Development of monitoring systems that can track commercials and media content (e.g., songs, movies, etc.) over various broadcast channels is another application where copy detection techniques are needed most. In any case, realizing above tasks requires techniques that are capable of providing distinguishing characteristics of videos which are alos robust to various types of modification. The most prominent approach in video copy detection has been to extract unique features from the audiovisual content
We introduce a new video copy detection scheme to complement existing content-based techniques. The idea of our scheme is based on the fact that visual media possess unique characteristics that can be used to link a media to its source. Proposed scheme attempts to detect duplicate and modified copies of a video primarily based on peculiarities of imaging sensors rather than content characteristics only. We demonstrate the viability of our scheme by both analyzing its robustness against common video processing operations and evaluating its performance on real world data. Results show that proposed scheme is very effective and suitable for video copy detection application.
Categories and Subject Descriptors H.3.3 [Information Systems]: INFORMATION STORAGE AND RETRIEVAL—Information Search and Retrieval ; I.4.7 [Computing Methodologies]: IMAGE PROCESSING AND COMPUTER VISION—Feature Measurement
General Terms Design, Experimentation
Keywords video copy detection, imaging sensor, PRNU noise, video signature
1.
INTRODUCTION
Recent research in digital image and video forensics [19, 25] has shown that the media data has certain characteristics that relate to physical mechanisms and algorithms used in its generation. These imperceptible characteristics are embedded within the visual content, and they emerge largely due to
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR’08, October 30–31, 2008, Vancouver, British Columbia, Canada. Copyright 2008 ACM 978-1-60558-312-9/08/10 ...$5.00.
1
www.youtube.com Kansas State University Digital Ethnography Group’s YouTube Statistics report can be obtained at http://mediatedcultures.net/ksudigg/?p=163 2
435
[8, 20]. Therefore, many content-based features have been proposed. These include color features like layouts [15], histograms [17, 21, 29], and coherence [22]; spatial features like edge maps [1] and texture properties [24, 28]; temporal features like shot length [13]; and spatio-temporal features like 3D-DCT coefficient properties [6] and differential luminance characteristics [14]. A video signature is generated by either organizing the computed features into suitable representations or through cryptographically hashing them to obtain more succinct representations. The resulting signatures are expected to be unique and robust under common processing operations. These signatures are stored in a database for later verifying the match of a given video. To improve the accuracy in detection, various approaches have been taken. These include the use of key-frames, groups-of-frames and all-frames of a video in generating a signature. The three approaches have some advantages and disadvantages over each other. For example, the use of key-frames and groupof-frames yields faster matching of two videos when compared to use of all-frames in matching. But due to loss of some temporal information, it may not always be possible to detect at what exact position the two videos match. On the other hand, using all-frames during matching is potentially vulnerable to temporal desynchronizations due to frame droppings and frame rate changes. The biggest challenge in video copy detection is to retrieve duplicate or modified versions of a video while being able to discriminate it from other similar videos. Since a video can be modified in many different ways, including common video processing operations, overlaying graphical objects onto video frames and insertion/deletion of video content, obtaining video signatures that are robust to all these types of modifications is a challenging task. Figure 1-a display frames from a video and its contrast enhanced version which are expected to yield the same signature. Similarly, Figure 1-b displays the copy of a video with overlaid advertisement. While robustness of extracted video signatures is crucial for the success of video detection techniques, such a requirement, at the same time, makes it very difficult to differentiate between videos that are very similar in content. Figures 1-c and 1-d show frames from videos that are visually very similar but essentially different. Therefore, in the presence of many content-wise similar videos, detecting modified copies of a given video becomes a very challenging task, and rapidly increasing size of video databases significantly exacerbates the problem. In the context of these difficulties, main insight of this work is that use of source device characteristics provides a new level of information that can help alleviate above problems. The fact that source characteristics are not primarily content dependent makes it potentially very effective against problems arising due to similarity of content. Moreover, since source device characteristics are not equally subject to constraints of audiovisual content, they are not prone to effects of common video processing operations in the same way, which makes them robust against certain modifications. Hence, incorporation of source device characteristics with content-based features will improve overall accuracy of video copy detection techniques. In this paper, we propose a new video copy detection scheme that utilizes unique characteristics of imaging sensors used in cameras and camcorders. Our scheme is inspired from the results of [4] and [23] which showed that noise like
(a)
(b)
(c)
(d)
Figure 1: (a) A video and its contrast enhanced duplicate. (b) A video and its advertisement overlaid version. (c) Two videos taken at at slightly different angels. (d) Similar but not duplicate videos variations in images and videos due to different sensitivity of pixels in an imaging sensor to light can be practically measured and used as a fingerprint of the imaging sensor. The underlying idea of the proposed scheme is that a video signature can be defined as a weighted combination of the fingerprints of camcorders (e.g., imaging sensors) involved in generation of a video. The resulting signature essentially depends on various factors that include duration of video, number of involved camcorders, contribution of each camcorder, and partly the content of the video. We demonstrate the viability of the idea on videos taken by several different camcorders and on several copies of duplicate and near-duplicate videos downloaded from YouTube. Our results show that signatures extracted from a set of videos downloaded from YouTube do not yield a false-positive in detecting near-duplicate videos and that the signatures are robust to both temporal changes and various common processing operations.
2. SENSOR FINGERPRINTS 2.1 Fingerprinting an Imaging Device One of the most important components of any imaging device (e.g., digital cameras, scanners, and camcorders) is an imaging sensor that measures the intensity of light incident over the whole spectrum and obtains an electrical representation of what is being captured. An imaging sensor is essentially a two dimensional array of, tightly packed, light sensitive elements called pixels, and an image or video is generated by processing the raw image data obtained from the pixels. However, due to imperfections and tolerances of manufacturing and minor defects, each imaging sensor exhibits unique and non-varying characteristics. As a result, data acquired by imaging sensors (e.g., camera images, scanned images, and videos) inherit traces of sensor’s peculiar characteristics. These characteristics appear in every image or frame captured by the sensor, and therefore, they can be used as a fingerprint of the imaging device. Various approaches have been proposed to identify and extract such systematic errors. The first work in the field was undertaken by Kurosawa et al. [18] who aimed at detecting fixed-pattern (FP) noise associated with an imaging sensor. FP noise reveals itself in the form of fixed offset values in pixel readings and can be easily extracted by capturing a video (or an image) when the sensor is not exposed
436
video to obtain a reliable fingerprint is around 10 minutes [4]; whereas, a few hundred images is typically sufficient to obtain the fingerprint of a digital camera. There are several reasons for that: (i) frame sizes of typical videos are smaller which decreases the available information needed for reliable detection; (ii) successive frames are very much alike, hence averaging successive instances of PRNU noise patterns do not effectively eliminate content dependency; and (iii) because of motion compensation PRNU noise might be lost in some parts of the frames. Essentially, the accuracy of the PRNU noise estimate depends on the the quality (compression and resolution) and the duration of video (i.e., number of frames). In Figure 2, we show the impact of the quality and length of the video on PRNU noise estimates obtained from videos taken by 5 different camcorders. Each video is encoded at 1 Mbps and 2 Mbps bit-rate and divided into segments of 1000 frames and 1500 frames, and a PRNU noise pattern is extracted from each segment to obtain a fingerprint of the camcorder. By designating one of the camcorders as reference, inter- and intra-correlations of the obtained fingerprints are computed with respect to the reference camcorder. It can be seen that for increasing quality and longer segments the PRNU estimates yield better differentiation of videos taken by the reference camcorder from the videos taken by other camcorders.
to any light. The authors demonstrated the success of FP noise pattern in identifying the source of a video. However, since FP noise is additive and can be easily extracted, manufacturers later added mechanisms to eliminate FP noise by first capturing a dark frame and subtracting it from every subsequently captured video frame or image. Later, Geradts et al. [9] proposed utilizing pixel defects of the form hot/cold pixels and pixel traps, which arise primarily due to high leakage currents, circuit defects, dust particles, and scratches on imaging sensor, to identify the source of an image. Although their results show that these imperfections are unique to imaging sensors and they are quite robust to JPEG compression, most digital cameras, today, deploy mechanisms to detect and compensate pixel imperfections through post-processing. Therefore, FP noise and pixel defects do not constitute reliable ways for fingerprinting an imaging sensor. More recently, similar to [18], Luk´ as et al. [23] and Chen et al. [3], [4] proposed a more reliable source digital camera and camcorder identification method. Their method is based on extraction of the so called photo-response nonuniformity (PRNU) noise pattern which is caused mainly by the impurities in silicon wafers. These imperfections affect the light sensitivity of each individual pixels and cause a fixed noise pattern. Unlike FP noise, PRNU noise is multiplicative and correcting the offsets in the pixel readings due to PRNU noise requires the ability to create a perfectly lit scene within the device. Since this cannot be trivially achieved, PRNU noise pattern can be reliably used for fingerprinting an imaging device. Khanna et al. [16], Gou et al. [12], and Gloe et al. [10] have extended this approach to source scanner identification wherein the imaging sensor is typically a one dimensional linear array.
80
30
70 25 20
50
Density
Density
60
40 30
15 10
20 5
10 0
−0.04
−0.02 0 Correlation Values
0.02
0
0.04
−0.04
−0.02
(a)
2.2 Extraction of PRNU Noise Pattern To identify the source camera of an image, Luk´ as et al. [23] proposed a method to extract PRNU noise pattern from digital camera images. This is realized through a denoising procedure in which an input image is subjected to a waveletbased denoising operation and the resulting noise residue is deemed to be an estimate of the PRNU noise. However, due to inaccurate modeling, extracted noise residues also contain contributions from the image itself. To suppress the content dependent part, noise residues extracted from multiple images (captured by the same camera) are averaged together to generate a fingerprint of the camera. In the method, source camera of a given image is decided through a correlative procedure between the extracted PRNU noise estimate from the image in question and the (PRNU noise based) fingerprints of all potential source cameras. In [23], results obtained using images taken by nine cameras show that 100% accuracy can be achieved in source identification. Sutcu et al. [26] provided a more rigorous performance study by testing the method on larger datasets. Later in [3], the authors introduced a pre-processing stage to improve accuracy of the PRNU noise estimate and improved the correlation based detection procedure. Chen et al. [4] extended this approach to videos to identify source camcorder. Although digital cameras and camcorders are very similar in their operation, obtaining an estimate of the PRNU noise pattern from a video is a more challenging task. As a comparison, for internet quality videos, of size 264x352 at 150 kb/sec. bit-rate, the needed duration of the
0 0.02 Correlation values
0.04
0.06
(b)
30
16
25
14 12 Density
Density
20 15
10 8 6
10
4 5 0
2 −0.05
0
0.05 0.1 0.15 Correlation Values
(c)
0.2
0 −0.1
0
0.1 0.2 Correlation Values
0.3
(d)
Figure 2: The distribution of inter- and intracorrelation values. Distributions in blue indicate the correlation values of fingerprints associated with the videos shot by the reference camcorder. Distributions in red indicate the correlation values of fingerprints between the reference camcorder and other camcorder. Bit-rate of videos and the number of frames in each segment are (a) 1 Mbps and 1000 frames, (a) 1 Mbps and 1500 frames, (c) 2 Mbps and 1000 frames, and (d) 2 Mbps and 1500 frames.
3. OBTAINING SOURCE CHARACTERISTICS BASED VIDEO SIGNATURES Since a video can be generated by a single camcorder or by combining multiple video segments captured by several camcorders, we define a video signature to be the weighted combination of the fingerprints of the involved camcorders. We, therefore, utilize a procedure similar to one described in [4] in extracting PRNU noise pattern from a video. We
437
denoise each video frame with a wavelet-based denoising filter and extract the noise residues which are then averaged together. The resulting pattern is the combination of camcorder fingerprints, and it is treated as the signature of the video. If a video is shot by, for example, two camcorders, the extracted signature will be the weighted average of the fingerprints of these two camcorders. The weighting will depend on the length of the video shot by each camcorder. To detect of two videos are copies of each other, we assess the correlation between two video signatures. Since PRNU noise pattern is intrinsic to an imaging sensor, one issue that needs to be addressed is how to identify videos taken by the same camcorder (or a fixed set of camcorders) as they are expected to yield the same signature. Essentially, due to inability to extract an accurate estimate of the underlying PRNU noise, noise pattern extracted from a video has also contributions from the content itself. That is, the extracted video signature will not only depend on the imaging sensor fingerprints but also it will exhibit some degree of content dependency. In Figure 2, it can be seen that the fingerprints extracted from videos captured by the reference camcorder correlate more; however, in the best case the correlation value is just around 0.25. For unmodified or slightly modified videos correlation would take values close to one. On the other hand, for near-duplicate videos, no matter how similar they are, as long as the source camcorders are different, correlation values will not take high values. These will be further explored in the following sections. Another challenge in video copy detection is the robustness of the extracted video signature when the video is subjected to common processing. Proposed video signature extraction scheme is expected to be robust to the linear operations as they will not degrade the PRNU noise. The scheme would also be robust to temporal changes like random frame droppings, and time desynchronizations as long as number of frames in a video is not reduced dramatically. Since modifications like blurring, noise addition and compression are expected to degrade PRNU noise estimates, proposed scheme would be robust to this type of modifications only up to certain extent. One critical type of modification that will impact the performance negatively is frame cropping or scaling. This would require establishing synchronization between the sensor fingerprints from the original video and its scaled/cropped version prior to comparison of video signatures. Although Goljan et al. [11] showed that PRNU noise can be detected under image cropping or re-scaling through a search of relevant (cropping and scaling) parameters, it would, nevertheless, increase the computational complexity. To evaluate our video copy detection scheme, we performed two sets of experiments. In the first set of experiments, we provide results demonstrating the robustness of the video signatures against various common processing. In the second set of experiments, we apply the proposed scheme to videos downloaded from YouTube and show how the scheme performs on real life test data, where no information is available on the source camcorders. In the following sections, we provide an evaluation of the proposed scheme.
4.
30
Density
25 20 15 10 5 0
−0.05
0
0.05 0.1 Correlation values
0.15
0.2
Figure 3: Distribution of correlation values obtained by correlating signatures 50 video clips with each other.
videos captured by five different camcorders in Mini-DV format with a frame resolution of 0.68 megapixels. The videos are initially encoded at an average bit-rate of 2 Mbps and at 30 frames per second. The videos depict various sceneries that include indoor/outdoor scenes, fast moving objects, and still scenes. and shot at varying optical zoom levels and also using camcorder panning. The videos captured with each camera are divided into 10 clips of 1000 frames, and the signatures of the resulting 50 video clips are extracted. Figure 3 shows that the correlation values computed between different video clips range from -0.06 to 0.2. The results demonstrate that each video clip yields a different signature even though they are shot by the same camcorder. Next, we assess robustness properties of extracted video signatures by subjecting the video clips to various types of modifications at varying strengths. We extracted signatures from video clips that has undergone manipulation and correlate these signatures with the signatures from original (unmodified) video clips, For each manipulation we provide distributions of how original signatures correlate with (a) signatures extracted from their modified versions (blue distribution), (b) signatures extracted from other videos taken by the same camera (red distribution), and (c) signatures extracted from videos taken by other camcorders (green distribution). The distribution of these correlation values are shown in Figure 4. As can be seen in this figure, when the content is different correlation of signatures are less than 0.2 for all types of modifications. Therefore, if the signatures from the original and modified version of a video is above 0.2 video copies can be reliably detected. In our experiments, we set the threshold for identification to 0.2 so that none of the different videos, whether they are taken by same camcorder or not, would be identified as copies. As a performance measure we consider true positive rate (TPR), which determines the rate of correctly detected copies of the video. For each manipulation, we also provide a figure showing the change in mean of signature correlations between each video and its modified version with the change in manipulation strength.
ROBUSTNESS PROPERTIES OF VIDEO SIGNATURES
4.1 Contrast Adjustment
To test the robustness of the extracted signatures, we used
Contrast adjustment operation modifies the range of pixel
438
0.4 0.6 Correlation values
0.8
1
0
0.2
0.4 0.6 Correlation values
−0.1
0.1
0.2 0.3 Correlation values
0
0.1
0.2
0.4
0.5
0.6
0.3 0.4 Correlation values
0.5
0.6
0.7
0
(e)
0.1
0.2
0.3 0.4 Correlation values
0.5
0.6
0.7
0.2
0.3 0.4 Correlation Values
0.5
0.6
0.7
(d)
Density 0
0.1
(c)
Density 0
Density
1
(b)
Density
(a)
0.8
Density
0.2
Density
Density
Density
0
−0.1
0
0.1
0.2 0.3 Correlation values
(f)
0.4
0.5
0.6
0
0.2
0.4 0.6 Correlation Values
(f)
0.8
1
(g)
Figure 4: The distribution of correlation values. Blue distribution is obtained by correlating the unmodified video clips and their modified versions, red is obtained by correlating video clips with the ones coming from the same camcorder and green is obtained by cross correlation of different videos for different types of manipulations. (a) Decreased contrast. (b) Increased contrast. (c) Decreased brightness. (d) Increased brightness. (e) Blurrring. (f ) AWGN addition. (g) Compression. (h) Random frame dropping
0.9
1
0.8
0.9
0.7
Mean of correlation values
Mean of correlation values
values without changing their mutual dynamic relationship. Contrast enhancement (increase) maps the luminance values in the interval [vl , vh ] to the interval [0, 255]. The luminance values below vl and higher than vh are saturated to 0 and 255, respectively. In the same manner, when contrast is decreased the luminance values ranging in [0, 255] are mapped to the range [vl , vh ]. Under contrast adjustment since the PRNU noise is largely preserved, the resulting video signatures will not be modified in any significantly. In the experiments, we tried various [vl , vh ] values changing from [25, 230] to [115, 140]. As can be seen in Figure 5-a, video signatures are robust up to 90% contrast increase, which corresponds to [102, 153] range. For enhancement values [114, 140], the mean correlation value was around 0.18 and actually all correlation values were lower than 0.2. However, it must be noted that this is a very extreme case and most of the luminance values of the frames are saturated to 0 and 255. On the other hand, even in the most extreme cases of contrast decrease, where the luminance values in the range [0, 255] are mapped to [114, 140] range, we were able to detect all the copies of video clips, Figure 5-b. The distributions of correlation values between the signatures of original video clips and their contrast increased versions can be seen in Figure 4-a and contrast decreased versions can be seen in Figure 4-b. These results show that the extracted signatures are very to contrast manipulations.
0.6 0.5 0.4 0.3
0.7
0.6
0.5
0.4
0.2 0.1 20
0.8
30
40
50 60 70 Strength of contrast increase
80
90
20
30
40 50 60 70 Strength of contrast decrease
(a)
80
90
(b)
Figure 5: The change in mean of correlation values as a function of the strength of (a) contrast increase and (b) contrast decrease. brightness are given in Figure 4-c and 4-d. Also, the average change in the correlation values with respect to changes in brightness level is given in Figure 6-a and 6-b. As can be seen in these figures, detection fails only when brightness increase is at an extreme level. For other instances, the video signatures are observed to be robust; therefore, we were able to detect all the copies videos without any false positives.
4.2 Brightness Adjustment Brightness adjustment is performed by either adding or subtracting p percent of the frame mean luminance value to or from each pixel in the frame, where p is a user defined parameter. Since this operation only offsets the pixel values, the PRNU noise will be almost fully preserved and video signature will not change much. During the experiments we varied p value between 10% to 190%, where 10%-99% indicates brightness increase and 101%-190 indicates brightness decrease. The correlation of signatures after adjusting
0.65
0.61
0.6
0.6
0.55 Mean of correlation values
Mean of correlation values
0.62
0.59 0.58 0.57 0.56
0.5 0.45 0.4 0.35
0.55
0.3
0.54
0.25
0.53 10
20
30
40 50 60 70 Strength of brightness decrease
(a)
80
90
0.2 20
30
40
50 60 70 80 Strength of brightness increase
90
100
(b)
Figure 6: The change in mean of correlation values as a function of the strength of (a) brightness decrease and (b) brightness increase.
439
4.3 Blurring Blurring is performed by filtering each frame using a standard Gaussian filter function with parameter σ (i.e., standard deviation). Since blurring will remove much of the medium- to high-frequency content, the PRNU noise may be largely removed (depending on the choice of σ), making extracted signatures unreliable. In the experiments, we considered σ = 2, 3, 5, 7 values. Figure 7-a shows the mean value of the resulting correlations with the change in the filter size. Results indicate that the signatures are robust to blurring only if Gaussian filter width σ is less than 3. The distribution of correlation values can be seen in Figure 4-e.
0.95
Mean of correlation values
Mean of correlation values 3.5
4
4.5 Sigma
5
5.5
6
6.5
7
0.4
0.3
0.2
0
5
10
(a)
15 Sigma
20
25
1.8
0.65
5
10
15
20 25 30 35 Percentage of dropped frames
40
45
50
(b)
To test the performance of the proposed video copy detection scheme, we used videos from the video sharing site YouTube. For this purpose, we downloaded more than 400 videos searched under 44 distinct names without imposing any other constraint (e.g., resolution, compression level, synchronization in time). Each distinct video had copies ranging from 2 to 39. These videos include TV commercials, movie trails, and music clips, and duration of each video varies from 20 seconds to 10 minutes at a resolution of 240x320 pixels. Then, signatures extracted from the 400 videos are cross-correlated. The distributions of the resulting correlation values are given in in Figure 9. In the figure, blue distribution curve indicates the correlation of signatures associated with the same videos and red is for the correlation of signatures associated with different videos. From these distributions, it can be immediately seen that for the same videos, the correlation values are in general greater than 0.5 and mostly close to 1. For different videos, on the other hand, correlation values are centered around 0 with a maximum less than 0.5. To evaluate the performance of the scheme in detecting video copies, at a given decision threshold, we counted the number of decision errors when the copy of the video is deemed to be a different video (false-rejection) and a different video is detected as a copy (false-acceptance). (This is realized by comparing the correlation values associated will all pairs of values with a preset threshold.) Figure 10 displays the receiver operating characteristic (ROC) curve which shows the change in false-acceptance rate in comparison to false-rejection rate by varying the decision threshold across all values. The ROC curve shows that the misidentification rate is very low. Note that in the best case, accuracy performance is 99.30%. To see why some of the videos didn’t correlate with their copies, we examined the videos more closely. We found several reasons for not getting similar signatures from the available copies. The most common reason for a misidentification is (slight) scaling of the videos. Since after scaling extracted signatures do not align, those videos yielded very low correlation values. Figure 11-a shows an example of a video and its scaled copy. As expected, another reason for observing low correlation values is compression. Figure 11-b provides an example where the copied version of the video is compressed by a factor of 0.75 which yields a correlation value just below the threshold. Another factor contribut-
0.1
3
1.6
5. PERFORMANCE EVALUATION
0.05
2.5
1.2 1.4 Bit rate Mbps
signatures after random frame dropping can be found in Figure 4-h.
0.1
2
1
(a)
0.5
0.15
0.8
0.75
Figure 8: The change in mean of correlation values as a function of the strength of (a) compression and (b) frame dropping.
0.35
0.2
0.9
0.85
0.7
0.6
0.25
0.5
0.4 0.8
0.4
0.3
0.55
0.45
Noise addition will degrade the accuracy of the PRNU noise estimates. It is perceivable that with increasing noise power levels reliable detection of PRNU noise will get more and more difficult. When the noise is additive and framewise independent, its impact can be reduced by averaging it over large number of frames; however, this will be effective only very long videos. In the experiments, we added additive white Gaussian noise (AWGN), with varying standard deviation σ, to each video frame. The considered range of noise levels are σ = 2, 3, 5, 10, 20, 30. The results in Figure 7-b show that performance is not satisfactory when σ > 5. For σ = 20, 30 our scheme didn’t work at all, for σ = 5, we achieved 80% true positive rate (TPR) and for σ = 10 the TPR was 30%, in both cases there were no false positives. Figure 7-f provides the distribution of correlation values after AWGN addition. σ. 0.45
Mean of correlation values
Mean of correlation values
0.6
4.4 AWGN Addition
0
1
0.65
30
(b)
Figure 7: The change in mean of correlation values as a function of the strength of (a) blurring and (b) AWGN addition.
4.5 Compression To show the impact of compression, we re-encoded all videos at bit-rates ranging from 0.8 Mbps to 2 Mbps, while still preserving the frame resolution. (Since compression beyond 0.8 Mbps caused a decrease in frame resolution, we did not consider lower bit-rate values.) We observed that accuracy does not vary with the bit-rate as can be seen as in Figure 8-a and Figure 4. Therefore, we can conclude that the signatures are very robust to bit rate changes. The distributions of correlation values are given in Figure 4-g and shows a similar trend.
4.6 Random Frame Dropping To illustrate the impact of a lossy channel we randomly removed frames from each video clip before extracting the signature. The drop rate varied between 50% to 90%. As Figure 8-b and Figure 4-h indicate, for all frame drop rates extracted signatures were reliable. Also, the correlation of
440
advertisement in different videos, as exemplified in Figure 13.
25
Density
20 15 10
(a) Added subtitles
5
(b) Shifted in time
Figure 12: Video copies with similar signatures. 0
−0.2
0
0.2 0.4 0.6 Correlation Values
0.8
1
Figure 9: Cross-correlation of extracted video signatures. ROC curve 1
0.8
FRR
0.6
(a)
(b)
Figure 13: Different videos with similar signatures. corr(a,b)=0.45
0.4
0.2
To determine the impact of content and imaging sensor fingerprint on the resulting video signatures, we performed another experiment on YouTube videos. For this purpose, we downloaded 36 distinct videos of a commercial series (the now famous PC vs Mac commercial) hypothesizing that they are captured using the same set of source devices(s). The videos were short, typically 27 seconds videos, with an average of 700 frames per video and at resolution 240x320 pixels per frame, and content-wise they are quite similar. Figure 15 shows representative frames extracted from four of the videos. We extracted signatures from each of the video and computed pair-wise correlations among all signatures. In Figure 14, red distribution shows the values obtained by pair-wise correlations. As can be seen, most of the resulting values are very close to 0, implying no relation between the videos. These results indicate that if the content of the videos are not same, but very similar, and even if they might have been captured by the same set of source devices, our scheme doesn’t detect them as copies. On the other hand, these results do not allow us to conclude whether or not the commercials are captured using the same source device(s) as videos captured by the same device are expected to yield higher correlation values. Since extracting reliable fingerprint of the sensors from internet quality videos require longer duration videos, we performed another experiment to see if the source devices for the videos match. For this purpose, we first randomly chose 10 videos and combined them together to generate a composite video. Then we generated another composite video by choosing 10 different videos from the remaining ones and correlated the resulting signature from the two composite videos. (Note that the two composite videos have no overlapping content.) We repeated the same experiment 250 times by drawing different combinations of 36 videos each time. The distribution of resulting correlation values are shown in Figure 14 in blue. These results strongly imply that at least some of the videos are taken by the same set of cameras/camcorders. Overall, the experiments on these commercial series showed that when the same camcorders are used in capturing process of two videos, if their contents are not same, although they may be
0 0
0.2
0.4
0.6
0.8
1
FAR
Figure 10: ROC curve for detection results on the videos downloaded from YouTube. ing to mis-identifications is the extra content (like advertisements) inserted into the videos. We observed that if the added content is around 10% of the length of original video, the signatures yield correlation values more than 0.5. When the added content is more than 30% of the original one in duration, the resulting signatures become substantially dissimilar. Another reason for low correlation values is video summarization. It is observed that even if the a video is shortened more than 30% of its original, signatures yield satisfactory correlation. However, in general when videos are shorted by more than 40%, our scheme was not able to correctly detect the copies.
(a) Scaled versions
(b) Highly compressed ver.
Figure 11: Video copies for which the extracted signatures are dissimilar. We notice that our signatures are quite robust in the presence on-screen graphic objects, like subtitles and small advertisements, that overlay the video content. Figure 12-a gives one such example where detection can be successfully achieved. In addition, small shifts in time didn’t effect the signature much. In Figure 12-b one can see the 300th frames of a video and it’s copy. In this example, the second video started with a blank screen with a duration around 2 seconds that yielded a shift in time. We also examined the videos that are falsely detected as copies of videos with high correlation values. To our observation, most dominant factor in those cases is the continuous presence of a logo or
441
Density 0
0.1
0.2
0.3 Correlation Values
0.4
0.5
0.6
Figure 14: The distribution of correlation values obtained from the commercial series. Red distribution is obtained by pair-wise correlations of each video and blue distribution is obtained by correlation of composite videos
Figure 15: The frames of four example videos from a commercial series.
[8] X. Fang, Q. Sun, and Q. Tian. Content-based video identification: a survey. International Conference on Information Technology : Research and Education, 2003. [9] Z. J. Geradts, J. Bijhold, M. Kieft, K. Kurosawa, K. Kuroki, and N. Saitoh. Methods for identification of images acquired with digital cameras. SPIE, Enabling Technologies for Law Enforcement and Security, 4232:505–512, February 2001. [10] T. Gloe, E. Franz, and A. Winkler. Forensics for flatbed scanners. Security, Steganography, and Watermarking of Multimedia Contents IX, 6505:65051I, February 2007. [11] M. Goljan and J. Fridrich. Camera identification from scaled and cropped images. Proc. SPIE, Electronic Imaging, Forensics, Security, Steganography, and Watermarking of Multimedia Contents X, January 26-31 2008. [12] H. Gou, A. Swaminathan, and M. Wu. Robust scanner identification based on noise features. Security, Steganography, and Watermarking of Multimedia Contents IX. Proceedings of the SPIE, 6505(65050S), February 2007. [13] P. Indyk, G. Iyengar, and N. Shivakumar. Finding pirated video sequences on the internet. In Technical Report. Stanford University, 1999. [14] T. K. J. Oostveen and J. Haitsma. Feature extraction and a database ˇ strategy for video fingerprinting. In VISUAL S02: Proceedings of the 5th International Conference on Recent Advances in Visual Information Systems, pages 117–128. London, UK, 2002. [15] E. Kasutani and A. Yamada. The mpeg-7 color layout descriptor: A compact image feature description for high-speed image/video segment retrieval. IEEE International Conference on Image Processing: ICIP, 1:674–677, October 2001. [16] N. Khanna, A. K. Mikkilineni, G. T. C. Chiu, J. P. Allebach, and E. J. Delp. Scanner identification using sensor pattern noise. Security, Steganography, and Watermarking of Multimedia Contents IX. Proceedings of the SPIE, 6505:65051K, February 2007. [17] T. Kuronuni, K. Kashino, and H. Murase. A method for robust and quick video searching using probabilistic dither-voting. International Conference on Image Processing, 2:653–656, October 2001. [18] K. Kurosawa, K. Kuroki, and N. Saitoh. Ccd fingerprint method identification of a video camera from videotaped images. In ICIP99, pages 537–540. Kobe, Japan, 1999. [19] T. V. Lanh, K.-S. Chong, S. Emmanuel, and M. S. Kankanhalli. A survey on digital camera image forensic methods. In 2007 IEEE International Conference on Multimedia and Expo, 2007. [20] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, and N. B. adn F.I. Stentiford. Video copy detection: a comparative study. In ACM International Conference on Image and Video Retrieval: CIVR 07, pages 371–378. Amsterdam, The Netherlands, October 2007. [21] Y. Li, L. Jin, and X. Zhou. Video matching using binary signature. In International Symposium on Intelligent, Signal Processing and Communication Systems, pages 317–320, 2005. [22] R. Lienhart, C. Kuhm¨ unch, and W. Effelsberg. On the detection and recognition of television commercials. International Conference on Multimedia Computing and Systems, pages 509–516, June 1997. [23] J. Luk´ as, J. Fridrich, and M. Goljan. Digital camera identification from sensor pattern noise. IEEE Transactions Information Forensics and Security, 1(2):205–214, 2006. [24] Y. Meng, E. Y. Chang, and B. Li. Enhancing dpf for near-replica image recognition. International Conference on Pattern Recognition, pages 416–423, 2003. [25] H. T. Sencar and N. Memon. Overview of State-of-the-art in Digital Image Forensics. World Scientific Press, 2008. [26] Y. Sutcu, S. Bayram, H. T. Sencar, and N. Memon. Improvements on sensor noise based source camera identification. In Proceedings of IEEE ICME, 2007. [27] A. Swaminathan, M. Wu, and K. J. R. Liu. Non intrusive forensic analysis of visual sensors using output images. IEEE Transactions of Information Forensics and Security, 2(1):91–106, March 2007. [28] L. W. Y. Lu, H.-J. Zhang and C. Hu. Joint semantics and feature based image retrieval using relevance feedback,. IEEE Transactions Multimedia, 5(3):339–346, September 2003. [29] J. Yuan, L. Duan, Q. Tian, and C. Xu. Fast and robust short video clip search using an index structure. In 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, pages 61–68. New York, 2004.
similar, the resulting signature would be significantly different.
6.
CONCLUSIONS
In this paper, we demonstrate how conventional contentbased video processing methods can benefit from findings of multimedia forensics research. For this purpose, we utilize source device characteristics extracted from a video to construct a new video copy detection technique. In the scheme, rather than extracting a content-based signature from a video, a combination of the fingerprints of cameras/camcorders involved in generation of a video are used as the video signature. The fact that signatures extracted from videos have both contributions due to characteristics of the imaging sensor and content makes the resulting video signatures more useful for video copy detection. To show the viability of our scheme, we performed two sets of experiments. In the first set, we used a controlled data set containing videos with known camcorders and analyze the robustness of the signatures to various common video processing operations. Our results indicate that our scheme is quite robust to contrast/brightness adjustment, temporal modifications , and compression; while it is only partially robust to AWGN addition and blurring type of modifications. In the second set of our experiments, we tested the scheme on real data by downloading many duplicate videos from YouTube. In this case, we achieved a copy detection rate of 99.30%. Proposed method is complementary to existing content-based copy detection methods and their incorporation will result in much more superior performance.
7.[1] A.REFERENCES J. A. Vailaya, M. Figueiredo and H.-J. Zhang. Image classification for content-based indexing. IEEE Transactions on Image Processing, 10(1):117–129, January 2001. [2] S. Bayram, H. T. Sencar, and N. Memon. Classification of digital camera models based on demosaicing artifacts. In to appear in Journal of Digital Investigations, 2008. [3] M. Chen, J. Fridrich, and M. Goljan. Digital imaging sensor identification (further study). Security, Steganography, and Watermarking of Multimedia Contents IX. Proceedings of the SPIE, 6505:65050P, February 2007. [4] M. Chen, J. Fridrich, M. Goljan, and J. Luk´ as. Source digital camcorder identification using sensor photo response non-uniformity. SPIE, Electronic Imaging, Security, Steganography, and Watermarking of Multimedia Contents IX, 6505:1G–1H, January 28-February 2 2007. [5] K. S. Choi, E. Y. Lam, and K. K. Y. Wong. Source camera identification using footprints from lens aberration. Digital Photography II. Proceedings of the SPIE, 6069:172–179, February 2006. [6] B. Coskun, B. Sankur, and N. Memon. Spatio-temporal transform-based video hashing. IEEE Transactions on Multimedia, 8(6):1190–1208, 2006. [7] A. E. Dirik, H. T. Sencar, and N. Memon. Digital single lens reflex camera identification from traces of sensor dust. In IEEE Transactions on TIFS, 2008.
442