Robust Video Watermarking via Optimization ... - Semantic Scholar

Robust Video Watermarking via Optimization Algorithm for Quantization of Pseudo-Random Semi-Global Statistics ¨ Mehmet Kucukgoz1 , Oztan Harmancı2 , M. Kıvan¸c Mıh¸cak3 and Ramarathnam Venkatesan3 1 Microsoft Corporation, Redmond, WA, 98052 2 Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627 3 Microsoft Research, Cryptography and Anti-Piracy Group, Redmond, WA, 98052 Contact Author Mehmet Kucukgoz, tel: (425) 705-7789, Email: [email protected] ABSTRACT In this paper, we propose a novel semi-blind video watermarking scheme, where we use pseudo-random robust semi-global features of video in the three dimensional wavelet transform domain. We design the watermark sequence via solving an optimization problem, such that the features of the mark-embedded video are the quantized versions of the features of the original video. The exact realizations of the algorithmic parameters are chosen pseudo-randomly via a secure pseudo-random number generator, whose seed is the secret key, that is known (resp. unknown) by the embedder and the receiver (resp. by the public). We experimentally show the robustness of our algorithm against several attacks, such as conventional signal processing modifications and adversarial estimation attacks.

1. INTRODUCTION The goal of robust watermarking is to embed a message in a host signal, such that the mark-embedded signal is perceptually approximately the same as the original and the system is robust, i.e., the embedded mark should be detected (or decoded) with high probability under reasonable attacks. The extraction or detection of a possibly embedded message can be used for identification or copyright-protection purposes in anti-piracy applications if the system is secure. The marked signal might go through possible changes by users and attackers: unintentional modifications and malicious attacks that particularly aim to disable the recovery of the watermark (WM). Ideally, a watermarking system should resist all attacks that preserve the perceptual quality, while avoiding false positives. We distinguish between data embedding (for communications purposes) and watermarking (for security purposes) problems, both of which fall into the category of information hiding. Data embedding is non-adversarial information hiding, which can be modelled as the problem of communication with side information at the transmitter. In contrast, we use the term watermarking for information hiding for anti-piracy applications with the presence of a malicious attacker. In this paper, we consider the latter one, where we assume that the adversary has full knowledge of the algorithm except for the secret, akin to the traditional cryptography.1 Prior work on video watermarking includes various schemes. Typical examples include,2, 3 where the authors extended image watermarking schemes to video via frame-by-frame mark embedding. The temporal correlations in a video creates new problems and attacks, and we present some simple cases. A compression algorithm such as H.264 can become a good attack to decrease the WM strength since a region in frame where WM is introduced may be replaced by another from a neighboring unwatermarked frame, unless proper care is taken. Also, note that if one uses independent watermarks on selected frames, it can cause noticeable defects (e.g., flickers) and thus one has to correlate the WM of a given frame with those of its neighbors (e.g., slowly increase and decrease WM strength). Then, the attacker may be able to estimate the inserted watermarks, or if a different WM is used for each frame in a consecutive group of pictures within the same scene, a malicious attacker can possibly remove at least a portion of the WM by comparing parts of the frames that change little and averaging them. Using redundancy across frames (possibly for robustness against temporal attacks) may allow an attack that exploits

the redundancy in WM generation. There are many other problems, and our goal is a system that would admit some transparent ways of assessing its security properties empirically and analytically. We propose to overcome such difficulties by exploiting the 3-dimensional (3D) correlations in video. A related work is,4 where the authors propose to embed a spread-spectrum WM into 3D blocks of video after applying 3D discrete Fourier transform (DFT). We propose to embed the WM after applying 3D wavelet transform, which is known to de-correlate images and videos better than DFT. In particular, we use pseudo-random (PR) linear statistics of PR (possibly overlapping) connected regions in the DC subband of the 3D wavelet domain for embedding and decoding. We experimentally show that such statistics compactly capture the essential distinctive information in video, while being robust under attacks; thus, we effectively work on a new PR signal representation domain to achieve both robustness and security. For mark embedding, we quantize these statistics and subsequently compute an additive WM via solving a PR optimization problem, such that the statistics of the marked signal are equal to the quantized statistics of the (unmarked) host signal. We assume the presence of unmarked host statistics at the receiver, and accordingly use the correlation detector to verify the presence of the WM (if any). Some similarities may be observed between our approach and “spread-transform”watermarking discussed in.5, 6 However, there are major fundamental differences: Our statistics are produced by non-disjoint sets of host coefficients (i.e., overlapping semi-global connected regions) that have geometric meanings (as opposed to the usage of disjoint sets in5, 6 that are not necessarily meaningful geometrically). We mention three advantages: it allows one to embed WM with minimal perceptual distortion, improves robustness, and increases the dimension in which WM is computed and inserted so as to avert the estimation attacks in an arguable fashion. Our approach requires the computation of the WM via solving a non-trivial optimization problem (which yields a combinatorially-complex PR structure that is security-wise desired), whereas the WM is obtained by a trivial inverse spread transform in,5, 6 that lacks security properties. Note that, our major goal in this paper is to have a system that is both practically-sound and secure. The basics of our approach has first been introduced in7 in the context of image watermarking; here, we extend this approach to the video case in a suitable model. The paper is organized as follows. In Section 2, we introduce the notation, explain the problem setup and give a general description of the WM verification system. In Section 3, we explain our general approach of using PR statistics of semi-global regions and justify it. In Section 4, we discuss the design of WM embedding and detection algorithms in detail. In Section 5, we present experimental results for various videos, in comparison with coefficient-based quantization.8 We conclude with some remarks and future work.

2. NOTATION AND PROBLEM DEFINITION Lowercase (resp. uppercase) boldface letters denote real vectors (resp. matrices) and subscripts denote individual elements (e.g., ai (resp. Aij ) is the i-th element (resp. the (i, j)-th element) of the vector a (resp. the matrix A)). For vectors, we use k · k and h·, ·i to denote the Euclidean norm and the corresponding inner product. For matrices, we use (·)T and k · kF to denote the transpose operator and the Frobenius norm. Let s ∈ Rn and θ ∈ Θ be the length-n host data and the secret key respectively, where Θ represents the corresponding key space. In this work, we use s as the vector representation of the DC subband (i.e., the lowestfrequency subband) in the 3D discrete wavelet transform (DWT) domain of the host video. In all the PR steps of our approach, θ is used as the seed of a secure PR number generator; θ is known to the embedder and the receiver, but unknown to the public. In this paper, we consider the verification problem, where the receiver acts as a detector and decides on the presence of the WM in the received signal (unlike the decoding problem, where the receiver assumes that a WM has been embedded in the received signal and aims to reliably extract the embedded bits). The watermarked data (denoted by x ∈ Rn ) should be perceptually approximately the same as s. The goal of the malicious attacker is to cause errors in detection, via feeding the attacked data y (which should have perceptually acceptable quality) to the receiver; note that, y may have been produced from some unmarked data s or its marked version x. If y has been generated from s (resp. x), the attacker aims to increase the probability of receiver’s declaring the presence of the WM, denoted by PF (resp. the probability of receiver’s declaring the absence of the WM, denoted by PM )∗ . Naturally, the receiver’s goal is to achieve otherwise. ∗

PF and PM are conventionally used for probabilities of false alarm and miss in detection theory.9

Most watermarking techniques consider either a “blind” or “non-blind” setup, where the former (resp. the latter) assumes that the receiver has (resp. does not have) access to unmarked host s. Here, we consider an alternative “semi-blind” setup, where the receiver does not have access to s, but rather has access to some side information that is correlated with s. Thus, by construction, our approach is well-suited for fingerprinting applications,10, 11 where the receiver operates in a secure server (which is typically allowed to have information about the unmarked original) and attacker does not have access to it. In contrast with fingerprinting, in screening applications (e.g., automatic monitoring, copyright control in players or recorders, etc.), the receiver operates in a non-secure server, possibly within the attacker’s reach, and thus, the usual assumption is a “blind” setup. However, we believe that our algorithm may also be suitable for screening applications; because in our method the side information at the receiver is correlated with s in a PR carefully-designed fashion via θ, and we conjecture that even though the attacker knows this side information, it would be hard for him to extract knowledge about s or θ.

3. GENERAL APPROACH A serious problem in watermarking is resistance against geometric attacks (e.g., rotation, cropping, shearing, etc.). Common techniques to achieve robustness against these attacks include repetition of the WM12 to gain approximate invariance under these attacks and embedding a “pilot pattern” (that has often a periodic structure) to estimate the parameters of the attack.13 However, we believe that such techniques may pose serious security threats. Due to the redundancy in the embedded marks or pilot patterns, they obviously leak information and may lead to the attacker’s estimating them; see14 for an example of such an attack. Therefore, we need to do mark embedding such that we gain robustness against acceptable geometric attacks in a secure way. To achieve this task, we aim to find PR signal representations that are both secure (in a complexity theoretic sense akin to traditional cryptography) and approximately invariant under reasonable geometric attacks. The transforms that yield these representations would obviously be radically different from the ones that are used for conventional signal processing applications (e.g., DWT, DCT, etc.). Our final goal is to design such transforms with provable security and use them in media anti-piracy applications; this paper can be viewed as a first step towards it. We propose to extract PR semi-global features that bear approximate invariance properties and use them for mark embedding. We choose these features as linear statistics of PR sufficiently-large regions; the statistic of each region is computed as a weighted linear combination of the coefficients in that region, where the weights are chosen from a smoothly-varying jointly-Gaussian distribution. We experimentally show that this representation is robust against common signal processing attacks and mild geometric attacks. The high-level rationale is that any valid attack (i.e., the one that preserves perceptual quality) would not change the global characteristics of a video, whereas it may have drastic effects locally. Our method can be viewed as an extension of the image watermarking work proposed in7 to 3D signals with some fundamental differences: In,7 we considered a “blind decoding” problem for still image watermarking, whereas in this paper our problem is “semi-blind verification” for video. Robustness of Statistics: Now, we experimentally illustrate the robustness of PR linear semi-global statistics for video. We use 256 frames from the movie “Gone With the Wind” (GWW) as our original video; each frame is 480 × 480. We apply 3 different attacks on it: frame-by-frame rotation (2 degrees) and cropping (10% by area), MPEG compression at 500 kbps, and time-compression by 6.7%. Next, we compare the statistics of the original with those of the attacked. In computing them, we use rectangular prisms of various sizes with PR locations in the spatial domain as our regions, where the weights are chosen from a jointly-Gaussian distribution (bandlimited to 0.3π radians in all three dimensions) and independent for each prism. In Fig. 1, we show the average normalized mean squared error (MSE) between the statistics of the original and the attacked videos; averaging is done over 5000 independent realizations. We observe that, under all attacks the MSE monotonically decreases quite fast as the region size increases. In our watermarking scheme, we use sufficiently-large regions to work in a regime corresponding to small MSE.

5

10

4

Average absolute change per pixel

10

3

10

2

10

1

10

0

10

−1

10

−2

10

0

20

40

60 Prism size

80

100

120

Figure 1. Average normalized MSE between the statistics of the original and attacked videos vs. prism size. Solid: MPEG compression at 500 kbps; dashed: time-compression by 6.7%; dotted: rotation (2 degrees) and cropping (10% by the area).

4. VIDEO WATERMARKING METHOD 4.1. 3D Wavelet Transformation Mark embedding needs to be done on an approximately de-correlated version of the signal for robustness (especially against estimation-type attacks that may exploit correlations). To achieve this purpose, we apply 3D-DWT. Note that DWT is well-known to capture the space-time energy compactly in case of images. In particular, we use N1 , N2 and N3 level decompositions along horizontal, vertical and temporal axis respectively. We compute the statistics and embed the WM in the DC subband (i.e., the lowest frequency subband), since that subband captures most of the signal energy according to our experiments. Our decomposition is similar to the one in.15

4.2. Quantization of Statistics via Optimization For Mark Embedding We embed the WM by effectively changing the statistics of (possibly overlapping) regions with PR locations. Currently we confine ourselves to rectangular prisms; however our approach allows the usage of regions with other shapes. Let s ∈ Rn denote the original signal coefficients of size n in the DC subband. We first generate sizes and locations of m rectangular prisms; the i-th prism is represented by the set Ri ⊂ {1, 2, . . . , n}, which defines the indices of the coefficients that belong to it, 1 ≤ i ≤ m. The i-th statistic µi of s that corresponds to Ri is a linear weighted combination of PR Gaussian weights (that are bandlimited to ft radians by construction) and {sj } that fall into Ri . We represent the weights for Ri with ti ∈ Rn , where tij = 0 if j ∈ / Ri . Hence, we write Pn i i M µi = j=1 sj tj =< s, t >, which leads to µ = Ts, where µ ∈ R is the vector of statistics, and T ∈ Rm×n is formed such that its i-th row is ti . We design the additive WM sequence w ∈ Rn such that the watermarked signal x = s + w has statistics µ ˆ ∈ Rm that are the quantized version of µ. Currently, we use a scalar uniform quantizer with step size δ;

however, in general any high-dimensional quantizer that maps µ to µ ˆ can be used. We compute w such that T(s + w) = µ ˆ and ||w|| = ||x − s|| is minimized; i.e., we solve min ||w|| w

s.t. Tx = µ ˆ ⇔ Tw = µ ˆ − µ.

(1)

Assuming that T is full-rank (which is almost always satisfied with our parameter selection), the solution to (1) ¡ ¢−1 is given by the well-known minimum-norm problem: wM N = TT TTT (ˆ µ − µ). Note that, this is optimal in the sense of L2 norm; however, it does not necessarily correspond to a good perceptual quality of x. To increase the quality, we impose some smoothness constraints on w and use the following algorithm for WM generation: 1. Set w0 = wM N = TT (TTT )−1 (ˆ µ − µ). 3. For K steps: 3.1. w1 = smooth(w0 ) − wM N . 3.2. Project w1 onto N (T), the nullspace of T: w2 = w1 − TT (TTT )−1 Tw1 . 3.3. Set w3 = w2 + wM N . 3.4. If ||w3 − w0 || < ε stop, else set w0 = w3 and goto 3.1. 4. The watermarked signal is given by x = s + w3 . Note that, step 3.2 guarantees T(s + w2 + wM N ) = µ ˆ at each step of the algorithm. In general, the function smooth can be any smoothing operator; we currently use ideal low pass filtering with cutoff frequency fw . Remark 1: The parameter of the algorithm should be chosen such that T is full-rank (preferably with low condition number). If this is not satisfied minimum-norm least-squares solution can be used instead of the minimum-norm; however, in that case it would be impossible to satisfy Tx = µ ˆ exactly. Remark 2: If the function smooth is linear, it is possible to incorporate it as another constraint in (1). In that case, a closed form solution can be found for the WM instead of using the iterative algorithm given above; however, our experiments show that the resulting computational complexity is infeasible for video. Remark 3: It is possible to use a similar approach to design a multiplicative WM, xi = wi si . We do not discuss it here due to space requirements; for further details, we refer the reader to.7

4.3. Detection Watermarking via quantization was first proposed in8 for the decoding problem and extended to be used for the verification problem in,16 both in a blind setup. We consider the verification problem in a semi-blind setup; in particular we assume that the detector knows µ, the statistics of the unmarked host s. Accordingly, we use a correlation detector. Let y ∈ Rn denote the received signal to the detector. Then, the normalized correlation value is given by γ= , ||ˆ µ − µ||2 where µ ˜ are the PR linear statistics of y, i.e., µ ˜ = Ty. In the absence of any attacks, γ = 1 (resp. γ = 0) if the WM is present (resp. absent) in y. Thus, having chosen a threshold τ (0 < τ < 1), the detector declares the presence (resp. the absence) of the WM if γ > τ (resp. γ < τ ).

5. SIMULATION RESULTS AND DISCUSSION We now experimentally quantify the performance of our algorithm. We used several samples, including the standard test videos and commercially-available movies and observed similar results in all. Here, we show the results based on several sample clips of length 256 frames (size 480 × 480) taken from the movie “Gone With the Wind”. We used Daubechies length-10 (resp. Daubechies length-4) wavelets along horizontal and vertical (resp. time) axis. We choose the parameters of our algorithm as follows: N1 = 3, N2 = 3, N3 = 2, m = 100, δ = 400, ft = 0.3π radians in all three dimensions, and fw is 0.25π radians (resp. 0.15π radians) along horizontal and vertical (resp. temporal) directions. The rectangular prisms are of size 32 × 32 × 24 with PR locations in the DC subband (which is of size 60 × 60 × 128 due to the structure of the DWT transform). We compare our algorithm with the Quantization Index Modulation (QIM) method.8 We apply the QIM scheme to all of the coefficients in the DC subband, and choose the quantization step size such that the MSE

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0

0.2

0.4

0.6

0.8 γ

1

1.2

1.4

1.6

Figure 2. Comparison of our scheme and QIM via normalized distribution of the correlation detector output γ under attacks. Dashed (resp. solid): our scheme (resp. QIM) against rotation by 2 degrees and 10% cropping by area; Dotted (resp. dash-dotted): our scheme (resp. QIM) against MPEG-2 compression at 500 kbps; Squares (resp. circles): our scheme (resp. QIM) against time compression by 6.7%.

induced by it matches the watermarking distortion by our scheme. Although the MSE introduced by both schemes is the same, we observed that QIM introduces perceptual artifacts, whereas our scheme does not; this is probably because of the smoothness conditions we impose on the WM generation. We applied several attacks to both schemes and observed that our algorithm outperforms QIM. Here, we show results for 3 attacks, namely rotation (by 2 degrees) and cropping (by 10% in the area), MPEG-2 compression at 500 kbps, and time compression (by 6.7%). The normalized distribution of the detector output γ for 100 independent realizations is given in Fig. 2 under these 3 attacks for both QIM and our scheme. The operational probability of error for our algorithm is much less than that of QIM under all attacks. In the final version of the paper, we shall provide more experimental results. Furthermore, we also carried out cryptanalysis, which shows the infeasibility of WM-estimation-type attacks based on locally-Gaussian stochastic models on video, where we assume that the attacker has some information about the secret key. The analysis is analogous to the one in7 and omitted here due to space requirements. Discussion: We propose a novel video mark embedding scheme that is based on a new PR signal representation: PR linear statistics of PR 3D regions. In our algorithmic design, we emphasize the significance of security and we believe that our scheme is a first step towards “hard-to-break” (i.e., requiring large computational resources) watermarking systems. Experiments reveal that our algorithm is robust against reasonable geometric attacks and secure against adversarial estimation attacks and outperforms QIM. Our future work includes combining our mark embedding system with a robust hash algorithm to choose the areas to embed the watermark (thus gaining robustness against temporal de-synchronization attacks) and use it to develop a movie fingerprinting scheme.

REFERENCES 1. A. J. Menezes, P. C. van Oorschot and S. A. Vanstone, Handbook of applied cryptography, CRC Press, Boca Raton, FL, 1997. 2. G. Depovere, T. Kalker, J. Haitsma, M. Maes, L. de Strycker, P. Termont, J. Vandewege, A. Langell, C. Alm, P. Norman, G. O’Reilly, B. Howes, H. Vaanholt, R. Hintzen, P. Donnelly, A. Hudson, “The VIVA project: digital watermarking for broadcast monitoring,” Proc. ICIP’99, vol. 2, pp. 202–205, Oct. 1999. 3. A. M. Alattar, E. T. Lin, and M. U. Celik, “Digital watermarking of low bit-rate advanced simple profile MPEG-4 compressed video,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 8, pp. 787–800, Aug. 2003. 4. F. Deguillaume, G. Csurka, J. O’Ruanaidh and T. Pun, “Robust 3D DFT video watermarking,” Proc. SPIE’99: Security and Watermarking of Multimedia Contents, vol. 3657, San Jose, CA, Jan. 1999. 5. J. Eggers, “Information Embedding and Digital Watermarking as Communication with Side Information,” Ph.D. thesis, Friedrich-Alexander-Universit¨ at Erlangen-N¨ urnberg, 2001. 6. B. Chen and G. W. Wornell, “Achievable performance of digital watermarking systems,” Proc. ICMCS’99, vol. 1, pp. 13–18, Florence, Italy, June 1999. 7. M. K. Mıh¸cak, R. Venkatesan and M. Kesal, “Watermarking via Optimization Algorithms for Quantizing Randomized Statistics of Image Regions,” Proceedings of the Fortieth Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, Oct. 2002. 8. B. Chen and G. W. Wornell, “Quantization index modulation: a class of provably good methods for digital watermarking and information embedding,” IEEE Trans. Inform. Theory, vol. 47, no. 4, pp. 1423–1443, May 2001. 9. H. V. Poor, An introduction to signal detection and estimation, 2nd Ed., New York, NY: Springer-Verlag, 1994. 10. D. Boneh and J. Shaw, “Collusion secure fingerprinting for digital data,” IEEE Trans. Information Theory, vol. 44, pp. 1897–1905, 1998. 11. Y. Yacobi, “Improved Boneh-Shaw Content Fingerprinting,” The Cryptographer’s Track at RSA Conference 2001, San Francisco, CA, April 2001. 12. D. Kirovski and H. S. Malvar, “Spread-spectrum watermarking of audio signals,” IEEE Trans. Signal Processing, Vol. 51, No. 4, pp. 1020-1033, April 2003 13. F. Deguillaume, S. Voloshynovskiy and T. Pun, “Method for the Estimation and Recovering from General Affine Transforms”, Proc. SPIE 2002, San Jose, CA, 2002. 14. M. K. Mıh¸cak, R. Venkatesan, and M. Kesal, “Cryptanalysis of Discrete-Sequence Spread Spectrum Watermarks,” Proceedings of 5th International Information Hiding Workshop (IH 2002), invited paper, Noordwijkerhout, The Netherlands, Oct. 2002. 15. C. I. Podilchuk, N. S. Jayant and N. Farvardin, “Three-dimensional subband coding of video,” IEEE Trans. Image Processing, vol. 4, no. 2, pp. 125–139, Feb. 1995. 16. T. Liu and P. Moulin, “Error exponents for one-bit watermarking,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Hong Kong, Apr. 2003.