It compares the local fingerprints of the candidate video with all local fingerprints of a ... given by the relative position of this point (x,y) to the center. (x',y'). For an ...
A VIDEO FINGERPRINT BASED ON VISUAL DIGEST AND LOCAL FINGERPRINTS A. Massoudi, F. Lefebvre, C.-H. Demarty, L. Oisel, B. Chupeau Thomson R&D, Rennes (FRANCE) ABSTRACT A fingerprinting design extracts discriminating features, called fingerprints. The extracted features are unique and specific to each image/video. The visual hash is usually a global fingerprinting technique with crypto-system constraints. In this paper, we propose an innovative video content identification process which combines a visual hash function and a local fingerprinting. Thanks to a visual hash function, we observe the video content variation and we detect key frames. A local image fingerprint technique characterizes the detected key frames. The set of local fingerprints for the whole video summarizes the video or fragments of the video. The video fingerprinting algorithm identifies an unknown video or a fragment of video within a video fingerprint database. It compares the local fingerprints of the candidate video with all local fingerprints of a database even if strong distortions are applied to an original content. KEYWORDS: fingerprinting, visual hash, video copy detection. 1. INTRODUCTION
None of the techniques applied to video combine efficient key frame detection (shot boundaries, stable frames) and efficient key frame description in a video copy piracy context (camcorder, Stirmark attacks). We propose a video fingerprint process which combines visual hash and local fingerprints and which is resistant against most of voluntary and natural attacks. Due to its efficient image characterization and its fixed bit length, the visual digest is chosen to observe the video content variation instead of the histogram. The evolution of the successive visual digests detects key frames: shot boundaries and stable frames. One stable frame is obtained for each shot. In an image authentication or identification process, the main attack against this visual digest is the image cropping. To resist against this attack, the local fingerprinting is chosen to characterize the stable frame and defines a shot fingerprint. The video fingerprint is the set of all shot fingerprints. The following figure 1 summarizes the global architecture. Video Visual hash function
Visual digest 1
Visual hash function
m Ti e
Fingerprinting is largely used to search content in large multimedia database. A fingerprinting design extracts discriminating features, called fingerprints, typical for each image/video and thus specific to each image/video. Fingerprinting aims at automatically extracting image or video discriminating features, called fingerprints, that uniquely identify an image, a video or a fragment of video. A fingerprinting method is usually designed to cope with both natural distortions (compression, frame rate changes, analog coding...) and malicious attacks (logo addition, cropping, geometric distortion, letterbox…). A fingerprint should remain almost the same before and after attacks, if these attacks do not alter visual content. An image fingerprinting can be a global description of the image or a local description of key points extracted from an image [2][3]. A video fingerprinting can be a global description of the video, a set of image fingerprints for all video frames or a set of image fingerprints for all video key frames [4]. Fingerprinting is a technical issue for the detection of illegal copy and streaming monitoring.
The visual hash function is a fingerprinting technique with crypto-system constraints. A visual hash function verifies some definitions and properties [5][6]. For image/video application, the visual hash output is called visual digest. A small change of the content leads to a small change of the visual digest. A video visual digest can be a global digest of the video, a set of image visual digests of all video frames, a set of image visual digests of all video key frames [6]. Visual hash function can be used in multimedia content authentication.
Visual hash function
Key Frames detection Visual digest 10 Visual digest n
Local Fingerprint on stable frame
Shot fingerprint Video fingerprint = {shot fingerprints}
Figure 1: global architecture 2. VISUAL HASH FUNCTION
According to the review and the results detailed in [6], the visual hash, called RASH, based on radial projections of the image pixels is chosen. Each RASH component is computed by the moment of order 2 of the pixels belonging to a line passing through the center of the image. This algorithm is extended in order to take care of the importance of a point in the image. The importance of a point (x,y) in an image is given by the relative position of this point (x,y) to the center (x’,y’). For an orientation θ, only pixels surrounded by the ellipse and the strip are selected (figure 2). The other ones are rejected. Rejected pixels
Ellipse {(x’,y’),height,width}
θ
Selected pixels
As described in [6], an automatic threshold process determines brutal transitions along the video and detects shot boundaries. The thresholding process is based on two thresholds: •
Pseudo-global threshold, denoted Tglobal
•
Adaptive threshold, denoted Tlocal
Two frames with the same content have the same (or closed to) visual hash. Instead of the histogram, the visual hash function described in section 2 is chosen to study the video content variation. Thus, the visual digest distance (L2 norm) measures the similarity between frames. The shot boundary corresponds to the frame where a peak of the visual digest distance is detected. The shot boundary, denoted SB, is calculated as below:
SB = i dist( VD( i ),VD( i + 1 )) > max( Tglobal ( i , L1 ),Tlocal ( i , L2 )) The computation (with L1 and L2 values) of Tglobal , Tlocal are detailed in [6]. The set of shot boundaries along a video is denoted ShotBound. The cardinality of ShotBound depends on the video activity.
Image Middle point (x’,y’)
3.2 Stable frame detection
Figure 2: selected points If a discretization of 1° angle is selected, the visual digest is composed of 180 elements. Each element of the visual digest is thus computed by: Nθ 1 2 Elt( θ ) = ( I ( p , θ ) − Mean ( θ )) Nθ p =1,( p ,θ )∈Ellipse • I(p,θ) is the luminance of the pixel (p,θ) • Mean( θ ) is the classical mean value of I(p,θ) over the strip with orientation θ • Nθ is card ( {p ,θ } ) • For (x,y) and θ, p = cos(θ )x + sin(θ ) y
∑
The image visual digest VD( i ) = {Elt ( θ )}θ =0⋅⋅179
of
a
frame
i
is:
3. KEY FRAMES DETECTION The evolution of the distance between visual digests over a group of frames is used to detect key frames. A shot boundary is a brutal variation of the visual digest inside a group of frames. The set of frames comprised between two shot boundaries is called a shot. The stable frame presents the smallest distance variation of visual digest inside a shot, this distance variation being described on the following subsection. The goal of this step is to extract stable frames. 3.1. Shot boundary detection
A stable frame is the frame with the smallest content variation along a shot. For such a frame, the content distance between this frame and the other neighbor frames will be very low. In [6], the stable frame is the frame where the distance between two successive histograms is the smallest of the shot. In our case we chose the frame which provides the smallest visual digest distance between neighboring frames. This stable frame must also have well distributed content information to avoid the detection of a dark/light frames. For such a frame, the pixel luminance distribution is uniform and the frame brings no shot information. We propose the following computation:
StableFrame = l
Dist (l ) = Min({Dist }) Entropy (VD(l )) ≥ Threshold
Inside a shot, for each group of 2L3+1 frames, the average of the content image distance (at the position j) Dist(j) is given by:
Dist( j ) =
1 2 L3 + 1 i = j − L
j + L3
∑ dist( VD( j ),VD( i ))
3
,i∈Shot ,i ≠ j
We obtain one stable frame per shot. Each stable frame is now characterized by using local fingerprinting.
4. LOCAL FINGERPRINTING
The local fingerprint process is divided in two steps: detection and description of points of interest.
4.1. Detection of points of interest The detection of points of interest is based on a Difference Of Gaussian [1]. It consists in detecting repeatable key points. A key point is repeatable if its pixel location is detectable after attacks. It must be resistant to scale change, rotation, filtering… We use a cascade of filtered and subsampled images (multi-resolution) that we call “octaves”. The Gaussian kernel is a scale-space kernel candidate. The theoretical interest of such an approach is that the difference of two Gaussians with respective variances k.σ and σ is a very good approximation of the normalized Gaussian Laplacian:
L( x , y ,σ ) = G( x , y ,σ ) ∗ I ( x , y )
The orientation of the resulting vector gives the key point orientation. According to this orientation, the neighboring disc is then divided into nine regions (figure 3). Each pixel within the disc belongs to one unique region depending on its polar position.
G( x , y ,kσ ) − G( x , y ,σ ) ≈ ( k − 1 )σ 2 ∇ 2 G (1) The convolution of (1) with the image will lead to the Difference of Gaussians function (DOG image): D( x , y ,σ ) = ( G( x , y , kσ ) − G( x , y ,σ )) ∗ I ( x , y )
In each octave we use an initial scale factor σ with a value of 1.6 and a multiplicative factor k with a value of 1.15. The extrema of the DOG represent potential locations of points of interest. Not all of these locations contain relevant information. Only the points with good contrast and precise space localization are kept [1]. The locations of the detected key points in the scale space are then stored. 4.2. Local description of the key points In the previous step, we performed key points detection and we stored their localization in the scale-space (x,y,σ). In this step, we locally characterize each key point by computing a local descriptor. The descriptor must be both discriminant and invariant to a certain number of transformations. A discriminant descriptor is a descriptor which provides representative and different value for each different content. To compute our descriptor, we consider a circular neighborhood of radius β (e.g. 20) in order to be invariant to rotation. We compute the key point orientation, KeyOrient(x,y), by summing the gradient vectors in a small disc β(x,y) around the key point (x,y). KeyOrient( x , y ) =
∑ orientation( x" , y" )
Figure 3: nine region decomposition With : • α the key point orientation • θ the polar position of the pixel (x,y) within the disc • r the distance from the key point to the pixel (x,y) For each of the nine regions, we compute a local histogram of sixteen bins: Histo(i,θ,k)=#Pixel(x”,y”)|orientation(x”,y”)= θ, (x”,y”) Є R(i,k)
With: • k the selected key point (x,y) • θ the pixel orientation (θ =0°..360°, a step of 22.5°) • i the index of the selected region (i=1..9) • R(i,k) the region i of the disc centered in the selected key point k The final descriptor of a key point k (KD) is the concatenation of the nine histograms. KD( k ) = [Histo( i ,θ ,k )]
with i=1..9
The local fingerprint of a stable frame j , called also shot fingerprint (SF) , is the set of all KD. The video fingerprint is the set of shot fingerprints SF. 5. EXPERIMENTS
( x" , y" )∈β ( x , y )
The gradient of a pixel (x,y) in a disc of radius 1 is given by:
magnitude( x, y ) = (∂ x L( x, y )) 2 + (∂ y L( x, y )) 2 orientation( x, y ) = Arc tan(∂ y L( x, y ) / ∂ x L( x, y )) Where L(x,y) is the Gaussian image:
This section presents our experiments. Our algorithm, which combines visual hash and local fingerprints (method 2), is compared to the system described in [6] (method 1), which combines histogram and visual hash. First, we evaluate the repeatability of the boundary frame detector (figure 4). Then, two stable frame detection methods are compared (figure 5). Finally, we validate the whole system by
matching the candidate fingerprints with the corresponding original fingerprints. The experiments have been done on 5 sequences of 1 minute each. For each original sequence, several realistic image processing attacks were computed: • Temporal cropping: 25% of frames are removed. • Asymmetric spatial cropping: for each frame, 30% of rows and columns are removed from the original frames. • Motion blur: a temporal Gaussian filtering. • Spatial blur: a spatial Gaussian filtering (radius of 3). • Scaling: each frame is resized by a scale factor of 1.5 using bicubic interpolation. • Divx compression: regarding the original sequence, the bitrate is divided by 4. • Camcorder: camcorder capture.
6. CONCLUSION This video fingerprinting algorithm proposes an innovative solution to identify a video or a fragment of video. This algorithm combines visual hash techniques and local fingerprinting. Thanks to the description of a multitude of independent points of interest, the whole image is not necessary to match correctly candidate fingerprints with original fingerprints. Such an approach allows the usage of this algorithm in a copy protection context. New perspectives are focused on the repeatability of stable frames. 100 90
histogram, correct detection
80
Visual Hash, correct detection
70
histogram, bad detection
60
Visual Hash, bad detection
50 40
For each attack, the repeatability of both shot boundary and stable frame detections are evaluated (figure 4, 5). The detection result in the original sequence is compared with the detection result in the candidate sequence. Detection is ‘correct’ if the frame detected is the same in the original and in the candidate sequence. The correct detection ratio is given by: # correct_fr ame_detect ed_candida te correct = # correct_fr ame_detect ed_origina l
The bad detection ratio is given by:
30 20 10 0
Using visual hash or histogram method (figure 4), the shot boundaries are correctly detected. In a typical sequence, about 1% of key frames are detected. The figure 5 shows that our stable frame method performs better than the stable frame method described in [6]. Motion blur and temporal cropping highlights this result. The fingerprint detection (figure 6) validates our global approach. Most of shot fingerprints are correctly matched with the shot fingerprints of the original sequence. The method we propose is highly resistant to spatial cropping (figure 6). The bad results observed during spatial blur attacks were expected. Due to the blur effect, the pixel luminance is not well distributed, only few stable frames are detected. Detection and description of points of interest are affected by the gradient smoothing.
spatial cropping
motion blur
spatial blur
scale x1,5
Divx compression
camcorder + cropping
Figure 4: repeatability of the shot boundary detector (percentage) 100 90 80 70
method 1, correct detection
60
method 2, correct detection
50 40
method 1, bad detection
30
method 2, bad detection
20 10 0
# bad_frame_ detected_c andidate bad = # total_fram e_detected _candidate
To evaluate the global system, we compute a video fingerprint database of 10 movies. We consider one of theses 10 movies. On this movie, we apply all the previously described attacks to obtain 7 candidates. For each candidate, we compare its fingerprint to the fingerprint database. When more than 40% (empirical threshold) of local fingerprints of a candidate shot matches with the local fingerprints of an original shot, the candidate shot is detected as an original shot.
temporal cropping
temporal cropping
spatial cropping
motion blur
spatial blur
scale x1,5
Divx compression
camcorder + cropping
Figure 5: repeatability of the stable frame detector (percentage) 10 0 90 80 70 60
method 1, correct matching
50
method 2, correct matching
40
method 1, bad matching
30
method 2, bad matching
20 10 0
temporal cropping
spatial cropping
motion blur
spatial blur
scale x1,5
Divx compression
camcorder + cropping
Figure 6: fingerprinting matching (percentage)
7. REFERENCES [1] I. Laptev and T. Lindeberg, “Space-time Interest Points”, In Proc. ICCV, France, pp. 432-439, 2003. [2] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors”, International Conference on Computer Vision & Pattern Recognition, pp 257-263, 2003. [3] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, IJCV, pp. 91-110, 2004. [4] A. Joly, C. Frélicot and O. Buisson “Feature statistical retrieval applied to content-based copy identification”, ICIP, 2004. [5] R. Venkatesan, S.M. Koon, M.H. Jakubowski, and P. Moulin, “Robust image hashing”, ICIP, 2000. [6] C. De Roover, C. De Vleeschouwer, F. Lefèbvre, and B. Macq, “Robust video hashing based on radial projections of key frames”, IEEE Transactions on Signal processing, pp. 4020-4038, 2005.