Virtual View Invariant Domain for 3D Video Blind ... - Semantic Scholar

4 downloads 22929 Views 395KB Size Report
ing some reference watermark signal, and to check for its presence using some ... They rely on the availability of (i) depth information for each and every pixel of ...
2011 18th IEEE International Conference on Image Processing

VIRTUAL VIEW INVARIANT DOMAIN FOR 3D VIDEO BLIND WATERMARKING Javier Franco-Contreras, S´everine Baudry, and Gwena¨el Do¨err Technicolor R&D France Security & Content Protection Labs ABSTRACT 3D video content has been receiving increasing interest over the last few months and created challenges regarding how to protect such high valued items. For instance, depth-image-based rendering techniques allowing for the creation of virtual views may impair underlying watermarks embedded within individual views. In contrast with previous works, this article focus on defining a view-independent watermark embedding domain rather than a watermarking system coherent with the disparity across several views, thus permitting blind detection. More specifically, based on the fact that pixel displacements across views most often reduce to horizontal shifts, the average luminance value along the rows of the video frames have been selected for watermark embedding. Experimental results clearly prove the robustness of the watermark against virtual views generation. Index Terms— Digital watermarking, depth-image-based rendering, invariant domain. 1. INTRODUCTION After the huge success of 3D movies in theaters in 2010, 3D video content is gradually taking our homes by storm. The new generation of TV sets is 3D ready and offers a variety of novel functionalities, one of which being the ability to generate virtual synthesized views. For instance, when several views of a scene are available e.g. several cameras scattered around a soccer stadium, some televisions allow the viewer to interactively define a (possibly virtual) view point so as to provide a more immersive experience [1]. Similarly, even for conventional stereoscopic content (left and right views), it could be useful to generate a virtual view as a means to adjust the perceived depth according to the viewing conditions e.g. size of the screen, distance to the screen, interocular distance of the individual in front of the screen [2]. In a nutshell, generating a virtual view comes down to an interpolation process that involves re-aligning real views with the virtual view point and ‘fusing’ them. This re-alignment process depends on (i) the intrinsic parameters of the real and virtual cameras (focal length, optical center, etc), (ii) the relative positions between the cameras, and (iii) the depth of the objects appearing in the scene. Digital watermarking consists in inserting imperceptible changes within multimedia content so as to convey additional information – a copyright notice, copy control bits, etc – in a robust manner. Should real views be watermarked independently with any offthe-shelf watermarking algorithm for video, the re-alignment process involved in the virtual view generation process will cause a nonuniform desynchronization that is not supported by state-of-the-art watermarking systems. Moreover, the fusion process is also likely to cause interferences between the different underlying watermarks.

978-1-4577-1302-6/11/$26.00 ©2011 IEEE

To address such issues, early works have focused on extending the baseline spread-spectrum algorithm [3] so as to make the individual watermarks embedded in each one of the real views coherent with the re-alignment process [4, 5]. In other words, the idea is to attach watermark samples to the physical 3D points of a scene and to export these wherever they are projected in the scene in a fashion somehow similar to what has been done in motion-coherent watermarking for video [6] or texture watermarking for 3D objects [7]. The detection procedure of these early works relies on estimating the projection matrix associated with the received view available at detection. Using this matrix, it is possible to obtain the watermark signal that is expected to be in the received view by projecting some reference watermark signal, and to check for its presence using some correlation score. This straightforward strategy is however hampered by a number of shortcomings. For instance, notable side information is required in order to estimate the projection matrix of the received view e.g. the original real views and the associated intrinsic and extrinsic camera parameters. Moreover, detection performances are usually heavily dependent on the accuracy of this projection matrix. To alleviate these constraints, our paper investigates a different watermarking strategy. More precisely, we define an embedding domain that is quasi invariant to virtual view synthesis which allows us to reuse readily available watermarking systems with bind detection. Section 2 will first review virtual views synthesis using depth image-based rendering (DIBR) techniques and highlight practical simplifying assumptions which allow defining a quasi-invariant domain. Section 3 then details a generic watermarking framework in this domain to obtain watermark signals that will be nearly untouched when synthesizing virtual views. Experimental results are then reported in Section 4 that clearly demonstrate the desired robustness property of the proposed system. Eventually, Section 5 provides concluding remarks and insight for future research.

2. A VIEW QUASI-INVARIANT DOMAIN DIBR techniques are routinely used to synthesize virtual views [8]. They rely on the availability of (i) depth information for each and every pixel of the real views and (ii) intrinsic and extrinsic camera parameters. This side information permits indeed to establish an affine relationship between the coordinates M of the 3D scene and their corresponding coordinates m in the captured view: M = ZA−1 m, ˜

(1)

where Z is the depth value of the considered pixel (in metric units), m ˜ the 2D coordinates of the pixels in the view in homogeneous notations, and A a 3 × 3 matrix specifying the intrinsic parameters of the

2817

2011 18th IEEE International Conference on Image Processing

view synthesis. This intuitive insight is confirmed in Figure 1 where the average value along rows is reported for three different views of the same scene. One can clearly see the near perfect match between the three depicted curves. 3. GENERIC WATERMARKING FRAMEWORK FOR MULTI-VIEW VIDEOS

Fig. 1. Average luminance value along the row of a single frame for several views. camera1 . Combining such equations for two cameras, it is possible to derive the so called disparity equation which defines the depthdependent relationship between corresponding points in two views of the same 3D scene: ˜ 0 = ZA0 RA−1 m Z0m ˜ + A0 t

(2)

where the notation .0 is used to differentiate the two cameras, and R and t indicates the rotation and translation between the coordinate system of the two cameras. This could be seen as some 3D image warping formalism that can be exploited to synthesize any virtual view from a single or a collection of reference views. Occluded regions cannot be synthesized this way and a filling process will need to be put in place e.g. based on interpolation or inpainting techniques. In practice, these occluded regions occupy a marginal portion of the view. Equation 2 is applicable in any case. Nevertheless, in many practical cases, this equation can be further simplified. For instance, in 3D video, it is common practice to assume that (i) all cameras share the same intrinsic parameters (A = A0 ), (ii) all cameras share the same orientation (R = 0), and (iii) the translation vector t between the cameras reduces to a single horizontal shift (to accommodate for the interocular distance). In summary, all cameras are aligned and identical, and have the same optical axis direction. As a matter of fact, should the scene be shot differently, post-processing would be applied to the video to come back to this setup. In this case, the disparity equation reduces to: x0 = x −

f.tx −h Z

(3)

where x is the pixel horizontal position in the reference view, x0 its corresponding position in the other view, f the focal length of the camera (in pixels), tx the inter-camera distance (in metric units), and h a correcting parameter used to adjust the convergence plan. In other words, from one view to another, pixels simply move along the rows of the frame with an amplitude depending of the depth of the corresponding point in the 3D scene. As a result, neglecting the marginal effect of occlusion, the average luminance along rows should be stable across views, making it a perfect embedding domain for a watermark that is expected to survive virtual 1 For the sake of simplicity of the mathematical expressions, the world coordinate system is assumed to equal the camera coordinate system.

Using this view quasi-invariant domain, it is straightforward to tailor a generic watermarking framework that can re-use any state-of-theart watermarking technique as depicted in Figure 2. The basic idea is to project any input view onto the invariant domain, to perform the embedding process in that domain and then to export back the resulting watermark vector into the spatial domain. As mentioned earlier, in practical cases, pixels only move horizontally from one view to another. The first step of the watermarking process is consequently to bring the input view I into an invariant domain by computing the average luminance along the row, hence producing the column-vector i defined as follows: i[y] =

W 1 X I[x, y], W x=1

1 ≤ y ≤ H,

(4)

where W and H stand respectively for the width and height of the input view. This view-invariant column-vector is then watermarked with any state-of-the-art blind watermarking algorithm e.g. Spread Spectrum (SS) watermarking [3] or Quantization Index Modulation (QIM) [9]. This step may involve a number of operations besides the watermark embedding process itself. For instance, one could apply some spatiofrequency transform such as the wavelet transform as means to get a better control over the robustness of the watermark. Our experimental observations indeed exhibited that the view invariance of the vector i was much higher in the low frequencies compared to the high ones. We consequently decided to apply the watermark embedding algorithm to the middle frequency band of a 2-level Discrete Wavelet Transform (DWT) of the column vector i. In the end, one can retrieve the watermark signal w by computing the difference between the original vector i and its watermark version iw : w = iw − i = Embed(i, K, b) − i,

(5)

where K is the secret key used to seed the pseudo-random generator used in the watermarking algorithm and b is some embedded payload bit. At this point, watermarking the view I basically comes down to finding a way of modifying the view so that it projects onto the vector iw in the view invariant domain. A straightforward strategy to reach this objective consists in mapping the watermark vector w back into the view domain, e.g. simply by duplicating the vector W = [w|w| . . . |w], and to fuse the resulting watermark signal W with the view itself. Naively, one could simply add the two signals together (Iw = I + W). Nevertheless, in practice, it is needed to incorporate some kind of perceptual model: Iw [x, y] = I[x, y] + αI (y)βI (x, y)w[y]

(6)

where βI (x, y) ≥ 0 are some perceptual slacks Pused to adjust the watermark locally [10, 11] and αI (y) = W/ βI (x, y) a global row-dependent scaling factor. Due to the nature of the exploited invariant domain, perceptual shaping plays a critical role. The watermark signal W is strongly

2818

2011 18th IEEE International Conference on Image Processing

iw

i Project(.)

without having access to the original views and/or the camera parameters, which is much more practical. The watermark detection procedure reduces to projecting the input view onto the invariant domain to obtain the column-vector i0 and to apply the watermark detector paired with the embedder used previously, e.g. some correlation score computation for spread-spectrum or a nearest codeword search for binning schemes.

Embed(.,K,b)

I -

Map(.)

4. EXPERIMENTAL RESULTS w Fuse(.)

Iw

i' Project(.) I'

Detect(.,K)

b/ø

Fig. 2. Generic watermarking framework for view invariant watermarking. Watermark embedding is done in the view invariant domain and the resulting difference watermark vector is then exported back into the frame domain for fusion. There is a priori no limitation with respect to the watermark algorithm that can be used.

dominated by horizontal frequencies which makes it more noticeable than other conventional watermarks. This issue is further emphasized by the concentration of the watermark energy in the middle frequency band. Locally adjusting the watermark power is therefore necessary in order to combat this bias naturally inherited from the construction of the view invariant domain. Other factors may be incorporated e.g. discarding pixels located in occluded areas of the view. It is also necessary to accurately manage rounding and clipping operations that are implicit in the pixel domain as it could significantly damage the watermark signal on each row. In order to accommodate for this phenomenon, the difference induced by the round-and-clip operation R(.) is transferred to the next sample in the same row as follows:   W[x, y] ← W[x, y] + Iw [x − 1, y] − R(Iw [x − 1, y]) . (7) The embedding process is repeatedly applied to all the available views, using the same secret key K and the same payload bit b. It should be noted that the generation of the watermark signal W could be factorized across the different views as a means to same computational resources. On the receiver side, the watermark detector is fed with a view I0 that can be real or synthesized, and that may be watermarked or not. The detector is in charge of assessing whether a watermark is present or not, and to extract the payload information if needed. The advantage of using a view invariant domain to embed and detect watermarks is that the detector no longer has to compensate for the virtual view synthesis process and can therefore operate blindly. In other words, in contrast with previous works [4, 5], no side-information is needed to perform detection i.e. the watermark can be retrieved

For experimental validation, several multi-view videos provided by the Nagoya University were used [12]. Each video is made of several views together with their associated depth maps. We limited our study to using two views and considering a synthesized virtual view in the middle of these real views i.e. the worst possible case. The watermarking framework described in Section 3 is applied to all views using conventional additive spread spectrum paired with correlation-based detection [3]. As mentioned earlier, the embedding process is restrained to the middle frequency band of a 2-levels DWT of the view invariant vector i. Moreover, the watermark signal W is adjusted at fusion time so as to strengthen the watermark (i) in bright areas and (ii) in textured areas so as to be able to raise the watermark power up to a distortion comprised in 40-50 dB while remaining visually imperceptible. Once both original views are watermarked, a virtual view in the middle of them is synthesized, submitted to various signal processing primitives and then fed to the correlation-based watermark detector. This process is repeated for all the frames in the video sequences. Experimental results are depicted in Figure 3 and limited to the ‘Kendo’ sequence (1024 × 768) due to space limitation. The reported correlation score is the average over time of the correlation coefficient ρt between the middle frequency band of the input invariant vector i0LH and the pseudo secret vector wLH at a given instant t: (i0LH − µi0LH ) · (wLH − µwLH ) ρt = , (8) |i0LH − µi0LH | |wLH − µwLH | where the operator · denotes the inner product and the notation µv the average of the components of vector v. The first observation is that detection performances are nearly identical in the considered views for all setups. This clearly highlight the invariance of the selected embedding domain which results

Fig. 3. Watermark detection results for real and virtual views after various signal processing primitives (‘Kendo’ sequence).

2819

2011 18th IEEE International Conference on Image Processing

6. REFERENCES

in the watermark being barely affected by the virtual view synthesis process. This is a significant improvement compared to the performances reported in previous works [4, 5]. One may notice a consistent decrease of the correlation score from the reference camera to the real camera. This is due to the fact that pixels appearing in the real view but not in the reference one were considered as occluded and not watermarked, hence mathematically reducing the correlation score. The second observation is that correlation scores for watermarked content remain greater than the ones for unwatermarked content by an order of magnitude, even after aggressive degradation of the considered view. The proposed watermarking system indeed naturally inherited the robustness properties of the underlying spread spectrum technique against genuine signal processing primitives such as low pass filtering with 3 × 3 averaging filter, additive white Gaussian noise (AWGN), and lossy MPEG4 compression at high (3 Mbps) and low (500 kbps) bit rates. Finally, we also investigated the impact of an attack designed by a hostile adversary so as to focus the energy of her AWGN attack on the middle frequency band of the invariant domain which is used for embedding case. Even in this pessimistic scenario, the algorithm exhibit good robustness. To reach the detection level of crude lossy compression, the attacker needs to degrade the quality of the content down to a PSNR of 30 dB.

[1] C. Fehn, P. Kauff, O. Schreer, and R. Sch¨afer, “Interactive virtual view video for immersive TV applications,” in Proceedings of the International Broadcast Conference, September 2001, vol. 2, pp. 53– 62. [2] Z. Arican, S. Yea, A. Sullivan, and A. Vetro, “Intermediate view generation for perceived depth adjustment of stereo video,” in Applications of Digital Image Processing XXXII, August 2009, vol. 7443 of Proceedings of SPIE, pp. 74430U–74430U–10. [3] I.J. Cox, J. Kilian, F.T. Leighton, and T. Shamoon, “Secure spread spectrum watermarking for multimedia,” IEEE Transactions on Image Processing, vol. 6, pp. 1673–1687, December 1997. [4] E. Halic¸i and A. Alatan, “Watermarking for depth-image-based rendering,” in Proceedings of the IEEE International Conference on Image Processing, November 2009, pp. 4217–4220. [5] A. Koz, C. Cigla, and A. Alatan, “Watermarking of free-view video,” IEEE Transactions on Image Processing, vol. 19, pp. 1785–1797, July 2010. [6] G.Do¨err and J.-L. Dugelay, “Secure background watermarking based on video mosaicing,” in Security, Stenography, and Watermarking of Multimedia Contents VI, January 2004, vol. 5306 of Proceedings of SPIE, pp. 304–314. [7] E. Garcia and J.-L. Dugelay, “Texture-based watermarking of 3D video objects,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 853–866, August 2003. [8] C. Fehn, “A 3D-TV approach using depth-image-based rendering (DIBR),” in Proceedings of IASTED Conference on Visualization, Imaging, and Image Processing, September 2003, pp. 482–487.

5. CONCLUSION The rapid development of multi-view videos introduces new challenges for watermarking as it requires the underlying watermark to survive non-uniform desynchronization as well as complex fusion processes. To address this issue, previous works considered the use of disparity coherent watermarks but the proposed solution were computationally intensive and required significant side-information at detection [4, 5]. In this paper, under practical simplifying assumptions, we introduced a domain invariant to virtual view synthesis, namely the average luminance values along the rows. Exploiting this domain, it is then straightforward to devise a generic watermarking platform that can re-use any existing state-of-the-art watermarking system with blind detection. Reported experimental results clearly show that this alternate strategy provides virtual immunity against virtual view synthesis while inheriting the robustness properties of the underlying baseline watermarking algorithm. Since the detection is performed blindly, this alternate strategy is a definite step forward compared to earlier proposals. Digital watermarking for multi-view video has been underinvestigated so far and several challenges remain. In particular, future research is needed to investigate the impact of the watermarking process on the quality of 3D rendering both perceptually (degradation of 3D perception) or physiologically (apparition of visual fatigue, headaches, dizziness etc) [13].

[9] B. Chen and G. W. Wornell, “Quantization index modulation: A class of provably good methods for digital watermarking and information embedding,” IEEE Transactions on Information Theory, vol. 47, no. 4, pp. 1423–1443, May 2001. [10] S. Voloshynovskiy, A. Herrigel, N. Baumgaetner, and T. Pun, “A stochastic approach to content adaptive digital image watermarking,” in Information Hiding, October 1999, vol. 1768 of LNCS, pp. 211–236. [11] J. Oostveen, T. Kalker, and M. Staring, “Adaptive quantization watermarking,” in Security, Stenography, and Watermarking of Multimedia Contents VI, January 2004, vol. 5306 of Proceedings of SPIE, pp. 296– 303. [12] N. Fukushima, “MPEG-FTV test sequence download page,” http://www.tanimoto.nuee.nagoya-u.ac.jp/ ˜fukushima/mpegftv. [13] J. Choi, D. Kim, B. Ham, S. Choi, and K. Sohn, “Visual fatigue evaluation and enhancement for2D-plus-depth video,” in Proceedings of the IEEE International Conference on Image Processing, September 2010, pp. 2981–2984.

2820

Suggest Documents