Compressed domain copy detection of scalable SVC videos

5 downloads 98 Views 493KB Size Report
analyze compressed H.264/SVC streams and form different ... to handheld devices, scalable video coding (SVC) will play ..... Reference software for h.264/svc.
Compressed domain copy detection of scalable SVC videos Christian K¨as, Henri Nicolas University of Bordeaux 1 Laboratoire Bordelais de Recherche en Informatique (LaBRI) 351, cours de la lib´eration, 33405 Talence Cedex, France {kaes,nicolas}@labri.fr

Abstract We propose a novel approach for compressed domain copy detection of scalable videos stored in a database. We analyze compressed H.264/SVC streams and form different scalable low-level and mid-level feature vectors that are robust to multiple transformations. The features are based on easily available information like the encoding bit rate over time and the motion vectors found in the stream. The focus of this paper lies on the scalability and robustness of the features. A combination of different descriptors is used to perform copy detection on a database containing scalable, SVC-coded High-Definition (HD) video clips.

1 Introduction Content-based copy detection (CBCD) in video databases is an important and interesting research field. The two major target applications are video retrieval in databases and the protection of digital rights, where it represents an alternative to digital watermarking. In contrast to watermarking approaches, video copy detection regards the media itself as the watermark, where the task is to form unique video descriptions that are robust to multiple types of transformations. Besides the claimed robustness of video copy detection systems, another crucial aspect concerning the usability is the computing time and the storage of the descriptors. Hence, fast processing speed and lightweight video fingerprints are necessary. To achieve fast processing, we follow a common approach and extract the feature vectors from compressed video streams, since videos are most often stored in encoded form. Compressed domain approaches only necessitate minimal stream decoding and enable more efficient processing, at the cost of lower precision. Regarding the continuously increasing variety of video distribution networks and devices, ranging from broadband High-Definition (HD) television to handheld devices, scalable video coding (SVC) will play an important role in the future media landscape. In this

article, we focus on videos encoded by H.264/SVC [20], the scalable extension to the well-known H.264/AVC, also known as MPEG-4/Part 10. The descriptors presented in this article are either encoding- or motion-based, since in both cases, no or only slight stream decoding is needed. In general, descriptors are supposed to be fast to obtain, easy to compare, small to store and robust to transformations. Our aim is to efficiently extract lightweight video descriptors that are temporally and spatially scalable and robust to multiple types of transformations (see Sec. 2). The input for data base queries are video clips and a search is performed by combining multiple descriptors to form the final result set (see Sec. 3). In order to test our method, we used the data base from the French national research project ICOS-HD [10]. It contains short High-Definition (HD) videos together with a number of scaled and transformed versions. A description of the data set and the obtained results are provided in Sec. 4. The main contributions of this paper are: i) The comparison of different compressed-domain descriptors in terms of retrieval performance on a SVC data base. ii) The proposal of a way to combine the different descriptors to enhance the retrieval results.

1.1

Related Work

Concerning video copy detection and sequence matching, a number of previous efforts that exploit motion information have been published. They can be divided into two groups, pixel domain and compressed domain approaches. The extracted descriptors can be coarsely classified into local and global features. In the pixel domain, color histograms [19, 6], feature points like SIFT [4] or Harris feature points [12, 14, 15, 9] and object trajectories [7] are often considered features in CBCD. Mohan [17] adapted the ordinal measure for video retrieval applications. It is formed by comparing and sorting the mean brightness of defined regions in an image. It represents an image-based feature-vector that was later enhanced to the temporal case by Chen [5].

2 Feature Extraction This section presents the features we studied and shows how they are obtained. The order in which the features are presented goes from low-level to high-level. All descriptors/signatures are calculated frame by frame and are stacked into vectors along time, so the longer they are, the more unique and discriminative they become. In case a video contains multiple spatial layers, we calculate the respective feature for each layer (except for number of objects and trajectories). The similarity measures used for the individual features are also provided in this section. When comparing two sequences of unequal length, we shift the shorter video over the longer one, calculate the similarity at each position and keep the highest similarity. Retrieval results of the different features are given in Sec. 4.

2.1

Encoding based

Bitrate. The first video signature we analyze is the temporal evolution of the bit rate, i.e., the number of bits per frame (BPF) used by the encoder. This feature can be extracted very efficiently, because no stream decoding is necessary. For streams with spatial scalability, we extract the information for all layers. An example for a random test video is provided in Fig. 1. The Group-Of-Picture (GOP)

size, i.e., the interval between two successive I-frames, becomes clearly visible when looking at the distance between two peaks, which correspond to I-frames that require more coded bits than inter-predicted B- or P-frames. This peri1000 Layer 0 (480x272) Layer 1 (960x544) Layer 2 (1920x1088) 800

KBits per frame

Regarding CBCD in the compressed domain, encoding data, motion information and transform coding coefficients build the basis for most frameworks, where motion delivers the most distinctive and important information. Hampapur et al. [8, 1] determines the dominant motion direction per frame through motion vector histograms. As similarity measure, the correlation coefficient is used. Ardizzone et al. [2] base the search on the size and the average motion of dominant regions, which are obtained by a sequential labeling method and clustering of the motion vectors (MVs). Kobla et al. [13] perform searches on a global motion estimation (GME), which is determined by the largest bin in a directional motion vector histogram. The focus of contraction and expansion is used to determine the zoom factor. The closest approach to the one presented in this paper was proposed by Babu in [3], describing a MPEG based retrieval system based on global motion and local object features. The motion activity is measured by the standard deviation of the magnitudes of motion vectors of each frame and object segmentation is performed by a combination of K-means clustering and the EM algorithm. Object trajectories are represented by two second order polynomials. Their system is designed to retrieve video sequences with similar local object trajectories and not for exact copy detection. Furthermore, the system does not capture complex camera motion and is not robust against transformations like rotation or flipping.

600

400

200

0 300

320

340

360

380

400

Frame

Figure 1. BPF per spatial layer of a test video with GOP Size = 8. odic fingerprint caused by the GOP structure has to be compensated, because the absolute positions of I- and B-frames of corresponding images is probably different for two versions of the same video. We therefore average the BPF for each GOP. The similarity Srate between the length n feature vectors A and B of two sequences is represented by the correlation coefficient rAB , given by Pn ¯ ¯ i=1 (Ai − A) ∗ (Bi − B) qP . Srate = rAB = qP n n ¯ 2∗ ¯ 2 (A − A) (B − B) i i i=1 i=1 (1) ⊕ Easy and very fast to extract. Lightweight - 1 integer per frame. ⊖ Codec-dependent. MB size histograms. Beginning with H.264/AVC, macro-blocks (MBs) span 16x16 pixels and may be partitioned into smaller, independent sub-MB partitions in order to increase the coding efficiency. In H.264/AVC and SVC, 7 different MB partition sizes are possible: 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4. Usually, MB partitions get smaller in well-textured and high-contrast areas that are in motion, like moving trees or object borders. We construct and store frame-wise MB size histograms with 7 bins, corresponding to the possible partitions sizes listed above. The similarity Shist between two videos is determined via the frame-wise sum of histogram intersections: Shist =

n bins X X

min (hA , hB ),

(2)

i=1 j=1

where ha and hb are the normalized histograms of two sequences, n is the number of frames and bins is the number of bins.

⊕ Easy and fast to extract. Lightweight - 7 integers per frame. ⊖ Codec-dependent (H.264/AVC and SVC only).

2.2

Motion based

0.6

Since no color or pixel data is available in the compressed domain, motion is the most important information found in the stream. For block-based codecs from the MPEG family, motion is represented as motion vectors (MVs) that are associated with macro-blocks. Except from skipped or intra-coded MBs, each B-frame MB is assigned with one or more MVs, pointing to its reference frames in the past and in the future (organized in LIST 0 and LIST 1, respectively). For B-frames, we extract LIST 1 MVs. I-frames are intra-coded, hence no MVs are available in the stream. As an approximation for I-frames, we use the mirrored LIST 0 MVs of the succeeding B-frame in coding order. We scale each MV by the distance to its reference frame in order to get vectors that are independent of the GOP structure. Motion activity. A very simple yet powerful feature is the frame-wise, overall intensity of motion, sometimes referred to as pace of action. The feature we use is very similar to the MPEG-7 descriptor [11] Intensity of motion, given by the frame-wise, average MV magnitude: I=

N q 1 X dx2i + dyi2 , N i=0

(3)

where N is the number of MVs per frame. Different from MPEG-7, we weight the magnitudes of the MVs of a frame by a 2-D gaussian to assign more importance to the center region, because motion vectors on the image borders are less reliable due to parts that are entering and leaving the image and because the Region-of-Interest (ROI) is usually located around the center. Furthermore, we correct the GOP structure by dividing the MV magnitudes by the the distance to the reference frame. The weighting function is depicted in Fig. 2 and is given by w(x, y) = e

−(

(x−x0 )2 2 2σx

+

(y−y0 )2 2 2σy

)

,

(4)

where x0 and y0 denote the image center, σx = wI /2 and σy = hI /2, with wI and hI being the image width and height, respectively. The similarity between two videos is determined by the correlation coefficient between two motion activity vectors (see Eq. 1). ⊕ Easy to extract. Can also be calculated for all other blockbased video codecs. ⊖ Retrieves only sequences with similar average motion. Global motion.

1 0.8

Camera operation usually causes a

0.4 120

0.2 70

100

60

80

50 40

60 30

40

20 20

10 0 0

Figure 2. 2-D gaussian weighting function for motion vector magnitudes. global and dominant motion, which is an important feature in video indexing. In order to estimate the global scene motion, we adopted a robust algorithm similar to the one presented in [18]. It incorporates a multi-resolution scheme and an iterative re-weighted least-squares estimation with outlier rejection of the 6-parameter affine motion model, given by dx = a1 + a2 (x − x0 ) + a3 (y − y0 ) dy = a4 + a5 (x − x0 ) + a6 (y − y0 ).

(5)

The multi-resolution approach directly exploits the spatial scalability of the stream. To save computing time, we only re-estimate the global motion until a certain threshold resolution is reached, which was set to 480x272 pixels. At spatial resolutions greater than that, the estimation results do not change significantly anymore. The result of the estimation are the 6 parameters a1 . . . a6 . From these parameters, we construct the two values mtrans and mratio , corresponding to the magnitude of the total translational motion (mtrans ) and amount of zoom and rotation (mratio ): mtrans =

|a1 |+|a4 | ; 2

mratio =

|a2 +a6 |+|a5 −a3 | . 2

(6)

These two values are robust to transformations like flipping and rotation and are stored for each frame of the video. Similarity is also calculated with the correlation coefficient between the vectors mtrans and mratio of two sequences. The global motion estimation itself becomes less reliable when large, low-textured areas are present or when dominant motion is caused by a large object. Nevertheless, even if the estimation result does not reflect the real motion, it still forms a robust video descriptor. The estimation time depends heavily on the size of the estimation support, thus, the video dimensions. We achieve real-time or faster up to a resolution of approximately SDTV on an Intel Core2Duo with 2.16 Ghz and 1 GB of RAM. ⊕ Robust to various transformations and distortions.

⊖ Query video has to be sufficiently long and contain motion for GME to be discriminative.

2.3

Local motion / objects

Most notably for sequences that contain no global motion, moving objects provide useful information for retrieval purposes. As a basis for object detection, we process the outlier masks from the global motion estimation process described in the previous section. After a spatio-temporal filtering of the outlier masks, we regard in a first pass all connected regions as separate objects and calculate certain properties of each object, namely: its i) size, ii) orientation, iii) width and height along its principal axes and iv) its local motion. In a second pass, we solve object correspondences by matching the most similar objects in adjacent frames. Number of objects. After the second pass, we discard any objects that appear for less than 5 frames and store the remaining number of moving objects per frame as the first object descriptor of the video sequence. Trajectories. We represent object trajectories, given by the centroids of the silhouette images, in a differential manner for retrieval tasks. Beginning with the second occurrence of an object, we store its speed, i.e., the distance the centroid travelled since the last frame, and the angle difference of the moving direction. Per video clip in the data base, we store as many differential trajectories as objects are detected. For comparing two sequences based on objects, we at first sum up the frame-wise difference of the number of detected objects between two clips and keep all sequences where the difference lies below a threshold. In the same manner, we calculate the trajectory similarity by summing up the differences in speed and angle for the temporal duration of the trajectory and rank the results accordingly. ⊕ High-level and thus robust to multiple transformations. ⊖ For videos where the GME delivers erroneous results or for videos with many occluding objects, the number of objects and the resulting trajectories may be false. The unavoidable disadvantage of motion-based descriptors is the fact that still scenes without significant moving objects result in all-zero feature vectors. In this case, we can only discard all sequences containing motion when performing a query.

2.4

Feature scalability

All presented feature vectors are temporally scalable by a simple re-sampling process to the appropriate frame-rate. Temporal scalability in SVC is enabled by the hierarchical prediction structure of SVC. Lower temporal layers are obtained by simply discarding the last B-frames in coding order, cutting the frame-rate in half at each temporal level.

Spatial scalability of the descriptors can be easily achieved by normalizing or scaling by the scale factor in size between the query video and the video to be compared. The features that are invariant to spatial scalability are the number of objects and the angle difference of the trajectories.

3 Copy Detection For the detection of copies, we propose a dynamic and progressive search using a combination of different features, depending on the input video. The building blocks of the system are illustrated in Fig.3. As a first step, we calculate all features mentioned in Sec. 2 for the input video. Based on these results we dynamically chose a certain search scheme. High motion scheme: If the mean value of the feature motion activity (MA) is higher than a fixed threshold ρMA , this scheme applies. We obtain a first result subset by rejecting all clips in the data base with a low correlation coefficient (≤ 0.5) regarding the input MA. High MA may result either from global motion (GM), from strong local motion (LM), or both. We successively refine the results at first by GM and finally, by LM. Low motion scheme: For videos with very low MA values, we at first reject all clips with a mean MA higher than ρMA . Then, the results are successively refined with encoding based features, and as a last refinement, with local motion. Input Video Analyzer

Query Module Data base

Figure 3. Copy detection schema.

4 Results In this section, we present the test data base used during our experiments and the retrieval results that we obtained using the presented features and methods.

4.1

Test Data Base

Our test database consists of a set of SVC-compressed, scalable high-definition videos. The corpus was created from 47 original video clips in Full-HD resolution (1920x1080) at 25 fps. For each of these 47 clips, 12 different versions are stored in the database at: original, resized to half resolution, resized to quarter resolution, flipped horizontally, flipped vertically, cropped to half resolution, and six rotated versions (10◦ , 20◦ , 30◦ , 40◦ , 45◦ , 190◦ ). Figure 4 shows some screenshots of multiple versions taken from the base sequence street. The clips in the corpus have an average duration of 3 seconds and contain four temporal and up to three spatial layers, depending on the resolution.

original 1920x1080

horizontal flip

rotated by 190◦

cropped to 960x540

rotated by 10◦

vertical flip

resize to 960x540

resize to 480x270

c Figure 4. Exemplary screenshots from test database. Sequence street Warner Bros. Adv. Media Ser.Inc. The content of the sequences greatly varies and includes indoor and outdoor shots with moving persons, objects and all types of camera motion. More information on the corpus and screenshots of all sequences can be found under [10] (menu: SPs → Sous-Projet 4 → Corpus HD; website in french). For encoding, we used the SVC reference implementation JSVM in version 9.8, available at [21].

4.2

Precision-Recall Curves

Figure 5 shows the retrieval results with each of the descriptors presented in Sec. 2 and the combined approach presented in Sec. 3. Each precision/recall value pair has been obtained by averaging the retrieval results for all 47 original clips as queries at a fixed similarity threshold. Recall equals 1 if all 12 versions of the clip are among the results, and precision equals 1 when all of the retrieved videos are correct matches. The curves represent the evolution of precision and recall at different threshold values of the similarity measure. The values have been averaged over all queries at a given threshold. For single feature queries, mo-

0.9

0.8

0.8

0.7

0.7

0.6 Recall

0.9

0.6 Recall

alone has shown to be not very discriminant, because the number of moving objects is very similar for most videos in the data base and the trajectories are also very similar due to the short duration of the clips. The plot also shows some operating points of the combined approach described in Sec. 3, obtained at different similarity threshold combinations. The combined approach works dynamically with different features and performs better in all cases, because it adapts the search to the input video. Figure 6 shows the performance of the proposed approach in comparison to key-frame based retrieval with SIFT [16] and color histograms. For SIFT matching, we calculated the SIFT points for one key-frame per clip. As similarity measure between a query image IQ and another image In , we used the ratio NofMatchesIQ ∩In /NofSIFTpointsIQ . Concerning color histograms, we constructed global 3-D RGB color histograms with 5x5x5 = 125 bins. The similarity was determined by calculating the histogram intersection. The

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.1

0.2 GME MB size bitrate objects motion activity combined 0.65

0.7

0.1

SIFT ColHist proposed 0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Precision

0.75

0.8

0.85

0.9

0.95

1

Precision

Figure 5. Detection results of single features and of combined approach. tion activity clearly performs best for the task of video copy detection, followed by the global motion descriptor, bit rate and MB size histograms. The local motion based feature

Figure 6. Detection results of proposed approach compared to key-frame methods. proposed approach works better than key-frame based approaches on the test data base. This is due to the nature of the data base and the properties of feature point / color based methods. Some of the clips are different, but have

been shot in the same environment as others, so we obtain a significant amount of matches with sequences that are not a copy of the original query sequence. Furthermore, the data base also contains flipped versions, and SIFT fails in this case. The global color histograms basically failed on rotated versions, since the resulting gaps on the borders of the image after the rotation are filled with white, black or the border pixel values, which becomes worse with increasing rotation angle.

5 Conclusions We analyzed different scalable compressed domain features for the task of video copy detection. Since neither pixel nor color information is available in the compressed domain, motion turns out to be the crucial factor. Encoding based features like bit rate over time or histograms of MB partition sizes are codec dependent and do not deliver very reliable results alone. Among the analyzed descriptors, motion activity provides robust results for a variety of videos. However, a search that incorporates multiple features and that adapts to the properties of the query video gives better results on the used data set. For future work, we want to validate the results on larger video collections of scalable high-definition content.

6 Acknowledgments This work has been carried out in the context of the french national project ICOS-HD (ANR-06-MDCA010-03) funded by the ANR (Agence Nationale de la Recherche).

References [1] A.Hampapur, K. Hyun, and R. Bolle. Comparison of sequence matching techniques for video copy detection. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 4676, pages 194–201, Dec 2001. [2] E. Ardizzone, M. L. Cascia, A. Avanzato, and A. Bruna. Video indexing using mpeg motion compensation vectors. In ICMCS ’99: Proceedings of the IEEE International Conference on Multimedia Computing and Systems Volume IIVolume 2, page 725, Washington, DC, USA, 1999. IEEE Computer Society. [3] R. Babu and K. Ramakrishnan. Compressed domain video retrieval using object and global motion descriptors. Multimedia Tools and Applications, 32:93–113, January 2007. [4] A. Basharat, Y. Zhaia, and M. Shah. Content based video matching using spatiotemporal volumes. Computer Vision and Image Understanding, 110(3):360–377, 2008. [5] L. Chen and F. W. M. Stentiford. Video sequence matching based on temporal ordinal measurement. Pattern Recogn. Lett., 29(13):1824–1831, 2008.

[6] N. Diakopoulos and S. Volmer. Temporally tolerant video matching. In Proc. of the ACM SIGIR Workshop on Multimedia Information Retrieval, Toronto, Canada, August 2003. [7] S. fu Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong. A fully automated content-based video search engine supporting spatiotemporal queries. IEEE Transactions on Circuits and Systems for Video Technology, 8:602–615, 1998. [8] A. Hampapur and R. M. Bolle. Comparison of distance measures for video copy detection. IEEE International Conference on Multimedia and Expo (ICME’01), (0):188, 2001. [9] X.-S. Hua, X. Chen, and H.-J. Zhang. Robust video signature based on ordinal measure. In International Conference on Image Processing (ICIP04), volume 1, pages 685–688, Singapore, October 2004. [10] ICOS-HD. French national research project anr-06-mdca010-03. http://icos-hd.irisa.fr/. [11] S. Jeannin and A. Divakaran. Mpeg-7 visual motion descriptors. Circuits and Systems for Video Technology, IEEE Transactions on, 11(6):720–724, 2001. [12] A. Joly, O. Buisson, and C. Frelicot. Content-based copy retrieval using distortion-based probabilistic similarity search. IEEE Transactions on Multimedia, 9(2):293–306, February 2007. [13] V. Kobla, D. Doermann, and K. ip (david Lin. Archiving, indexing, and retrieval of video in the compressed domain. In in Proc. of the SPIE Conference on Multimedia Storage and Archiving Systems, pages 78–89, 1996. [14] J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa. Robust voting algorithm based on labels of behavior for video copy detection. In MULTIMEDIA ’06: Proceedings of the 14th annual ACM international conference on Multimedia, pages 835–844, New York, NY, USA, 2006. ACM. [15] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. GouetBrunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In CIVR ’07: Proceedings of the 6th ACM international conference on Image and video retrieval, pages 371–378, New York, NY, USA, 2007. ACM. [16] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004. [17] R. Mohan. Video sequence matching. In Int. Conf. on Acoustics, Speech and Signal Processing, volume 6, pages 3697–3700, Seatlle, WA, USA, 1998. [18] J. M. Odobez and P. Bouthemy. Robust multiresolution estimation of parametric motion models. Journal of Visual Communication and Image Representation, 6(4):348–365, 1995. [19] J. M. A. Sanchez, X. Binefa, J. Vitria, and P. Radeva. Local color analysis for scene break detection applied to tv commercials recognition. In in Proceedings of Visual 99, pages 237–244, 1999. [20] H. Schwarz, D. Marpe, and T. Wiegand. Overview of the scalable h.264/mpeg4-avc extension. In IEEE International Conference on Image Processing (ICIP’06), Atlanta, USA, pages 161–164, October 2006. [21] J. R. Software. Reference software for h.264/svc. http://ftp3.itu.ch/av-arch/jvt-site/.

Suggest Documents