3D model-based frame interpolation for Distributed ... - Semantic Scholar

5 downloads 0 Views 2MB Size Report
Matthieu Maitre, Student Member, Christine Guillemot, Senior Member, and Luce ...... [41] R. Hartley, “Lines and points in three views and the trifocal tensor,”.
1

3D model-based frame interpolation for Distributed Video Coding Matthieu Maitre, Student Member, Christine Guillemot, Senior Member, and Luce Morin

Abstract— This paper addresses the problem of side information extraction in Distributed Video Coding (DVC), taking into account geometrical constraints available when a moving camera captures a video of a static scene. A 3D model-based DVC approach is first described. The decoder recovers a 3D model from appropriately chosen key frames using motion models from the structure-from-motion paradigm. The intermediate frames are interpolated by projecting the 3D model onto 2D image planes and by applying image-based rendering techniques. Issues with the accuracy of camera parameters lead to the introduction of a quasi-DVC method relying on a limited point tracking at the encoder. The approach greatly improves the side information PSNR, while only slightly increasing the encoder complexity. It also allows the encoder to adaptively select the key frames based on the video motion content, hence reducing the key frame frequency with respect to 2D block-based motion compensated interpolation techniques. Index Terms— Distributed Video Coding (DVC), Structurefrom-Motion (SfM), Image-Based Rendering (IBR), point tracking, quasi-DVC, motion-adaptive key frames.

I. I NTRODUCTION Distributed Source Coding (DSC) has gained increased interest for a range of applications such as sensor networks, video compression or loss-resilient video transmission. DSC finds its foundation in the seminal Slepian-Wolf [1] and Wyner-Ziv [2] theorems. Most Slepian-Wolf and Wyner-Ziv coding systems are based on channel coding principles [3]– [9]. The statistical dependence between two correlated sources X and Y is modeled as a virtual correlation channel analogous to binary symmetric channels or additive white Gaussian noise (AWGN) channels. The source Y (called the side information) is thus regarded as a noisy version of X (called the main signal). Using error correcting codes, the compression of X is achieved by transmitting only parity bits. The decoder concatenates the parity bits with the side information Y and performs error correction decoding, i.e. MAP or MMSE estimation of X given the received parity bits and the side information Y. Compression of video streams can be cast into an instance of side information coding, as shown by Aaron et al. [10]– [12] and Puri and Ramchandran [13], [14]. These schemes are also referred to as Distributed Video Coding (DVC) systems. Key frames are coded using conventional Intra mode coding. From them, the decoder generates the side information for the intermediate frames, also known as Wyner-Ziv (WZ) frames. In contrast with classical predictive coding, the encoder does not resort to motion estimation and temporal prediction. This leads to a new load-balancing paradigm with low-complexity encoders well suited for power-limited and complexity-limited devices like mobile wireless cameras. DVC lends also itself

naturally to combat inter-frame loss propagation, hence to lossresilient transmission [13], [15], [16]. A comprehensive survey on distributed video compression can be found in [17]. One key aspect in the performance of the system is the mutual information between the side information and the information being Wyner-Ziv encoded. In current approaches, the side information is generated via motion-compensated frame interpolation, often using Block-Based Motion Compensation (BBMC) [17]. Motion fields are first computed between key frames which may be distant from one another. An interpolated version of these motion fields is then used to generate the side information for each WZ frame. The frame interpolation based on these interpolated motion fields is not likely to lead to the highest possible PSNR, hence highest mutual information, between the side information and the Wyner-Ziv encoded frame. To cope with these limitations, BBMC is embedded in a multiple motion hypothesis framework in [17], [18]. The actual motion vectors are chosen by testing the decoded frames against hash codes or CRCs. Here we propose to improve side information generation by using more complex motion models. Unlike predictive coding, DVC has the advantage of not requiring the transmission of motion model parameters. Therefore, increasing the complexity of motion models, and thus their ability to accurately represent complex motions, offers potential gains in mutual information without additional bitrate overheads. There exists a large variety of motion models [19]. In this paper, we explore those belonging to the Structure-fromMotion (SfM) paradigm [20], [21] where the video stream is assumed to come from a unique camera moving in a static 3D environment with lambertian surfaces. These motion models exhibit strong geometrical properties, which allow their parameters to be robustly estimated. They are of interest to specialized applications such as augmented reality, remotecontrolled robots operating in hazardous environments and remote exploration by drones or planetary probes. The proposed 3D model-based frame interpolation presents several improvements and adaptations of the SfM approach to the DVC context. When used in computer vision applications, SfM aims at recovering 3D models as close to the ground truth as possible or at generating realistic virtual views [22]. On the other hand, when used in DVC, the objective is to generate side information with the highest mutual information possible. This requires a reliable estimation of the camera parameters associated with the WZ frames, which are usually ignored in SfM. This also requires sub-pixel precision of the reprojected 3D model, especially in edge regions where even small misalignments can have a strong impact on PSNR. Moreover, constraints on latency in applications such as video streaming,

2

Fig. 1.

Outline of the 3D-DVC codec.

as well as memory constraints, prevent the reconstruction of the 3D scene from all the key frames at once, as is usually done in SfM. Instead, a sequence of independent 3D models is reconstructed from pairs of consecutive key frames. Finally, we study the impact of key frame quantization on frame interpolation, which is usually not considered in SfM. However, the experiments detailed in Section IV show that, even if this purely DVC approach gives motion fields between key frames which are much closer to the ground truth, its impact on the side information PSNR is limited. Therefore, we introduce an approach called quasi-DVC (qDVC) in which the encoder is allowed to share some limited information between frames under the form of point tracks, which are transmitted to the decoder. The tracks have no impact on the motion fields between key frames but they allow the decoder to precisely estimate the camera parameters associated with the WZ frames. This greatly reduces misalignments between interpolated frames and WZ frames, significantly increasing the side information PSNR. Moreover, statistics on tracks give the encoder a rough estimation of the video motion content, which is sufficient to decide when to send key frames. The problem of key frame selection has already been studied in the context of SfM [23] and predictive coding [24]. However, it relied on epipolar geometry estimation at the encoder, which DVC cannot afford. Our experiments show that the qDVC approach leads to major PSNR improvements while only introducing limited bitrate and complexity overheads. The remainder of the article is organized as follows: we describe the proposed 3D model-based frame interpolation in Section II and its extension with point tracking in Section III. We detail our experimental results in Section IV. Preliminary results were presented in [25]. II. 3D MODEL - BASED FRAME INTERPOLATION A. Codec overview The proposed codec, called 3D-DVC, derives from the DVC codec described in [17], [26], as outlined in Figure 1. The input video is split into Group of Pictures (GOP) of fixed size. Each GOP begins with a key frame, which is encoded using a standard intra coder (H.264-intra in our case) and then transmitted. The remaining frames (WZ frames) are quantized and turbo-encoded. The resulting parity bits are punctured and transmitted. For the sake of simplicity, we only consider quantization in the pixel domain. At the decoder, the key frames are decompressed and the side information is generated by interpolating intermediate

foreach pair of key frames {0 I, 1 I} do Detect correspondences {0 x, 1 x} Perform a robust weak-calibration Estimate the fundamental matrix F Calibrate the system and perform triangulation Estimate the projection matrices 0 P and 1 P Estimate the depths {0 λ} Apply a bundle adjustment Refine 0 P , 1 P, {0 λ} and F Propagate the correspondences along edges using F Interpolate the intermediate projection matrices {t P} Interpolate the WZ frames {t I} Algorithm 1: Outline of side information generation at the 3D-DVC decoder.

frames from pairs of consecutive key frames. The turbodecoder then corrects this side information using the parity bits. The remainder of this section describes the recovery of the 3D model and its use in frame interpolation, as outlined in Algorithm 1. B. Notations We first introduce some notations. We shall use the typesettings a, a, A to denote respectively scalars, column vectors and (i) matrices. In the following, t aj denotes the j th scalar-entry of the ith vector of a set at time t. Likewise for matrices, Aij denotes the scalar entry at ith row and j th column, while Ai: represents its ith row vector. Moreover, A| denotes the transpose of matrix A, As the column vector obtained by stacking the A|i: together and [.]× the cross-product operator. The identity matrix shall be denoted by I and the norms 1 and 2 by respectively k.k1 and k.k2 . We shall use homogeneous | | vectors, where x , (x, y, 1) and X , (x, y, z, 1) represent respectively a 2D and a 3D point. These entities are defined | | up to scale, i.e. (x, y, 1) is equivalent to (λx, λy, λ) for any non-null scalar λ. C. Correspondence detection The first step of 3D estimation consists in establishing point correspondences between two key frames, first by isolating feature points {x} on each key frame independently, and then by matching them across key frames. Without loss of generality, the two key frames are assumed to have been taken at times t = 0 and t = 1 and are respectively denoted by 0 I and 1 I. We use the Harris-Stephen corner detector [27] to find feature points. Its sensitivity is adapted locally to spread feature points over the whole frame [20], which improves the weak-calibration detailed in the  next subsection. All pairs of feature points 0 x, 1 x across key frames are considered as candidate correspondences. A first series of two tests eliminates blatantly erroneous correspondences: 1) Correspondences with large motions are discarded. The threshold is set to 30 pixels for CIF videos in the experiments.

3

2) Correspondences with dissimilar intensity distributions in the neighborhoods around feature points are discarded. Distributions are approximated using Parzen windows [28, §4.3] and sampled uniformly to obtain histograms. The similarity of histograms is tested using the χ2 -test [29, §14.3]. The locations of the remaining correspondences are then refined locally by searching for the least Mean Square Error (MSE) between neighborhoods around feature points. The minimization is solved using the Levenberg-Marquardt algorithm [29, §15.5]. This refinement compensates errors from the Harris-Stephen detector, leads to values of MSE independent of image sampling [30] and computes feature point locations with sub-pixel accuracy. A second series of two tests further eliminates erroneous correspondences: 1) Correspondences with large MSE are discarded. 2) Each feature point is only allowed to belong to at most one correspondence. This unicity constraint is enforced using the Hungarian algorithm [31, §I.5], which keeps only the correspondences with the least MSE when the unicity constraint is violated. At this point, we have got a set of corresponding points between key frames 0 I and 1 I with sub-pixel accuracy. D. Robust weak-calibration The assumption of static scene is now introduced to estimate the fundamental matrix F from the set of correspondences and remove those correspondences which do not abide by the epipolar constraint. This constraint is given by the equation 1 |

x F 0 x = 0.

(1)

The proposed weak calibration method consists in three steps: 1) an initial estimation of F and the set of inliers using MAPSAC [32], 2) a first refinement of F and the set of inliers using LORANSAC [33], 3) a second refinement of F over the final set of inliers by a non-linear minimization of the Sampson distance [20]. E. Quasi-euclidean self-calibration and triangulation The next step is to recover the 3D geometry, that is, the projection matrices {0 P, 1 P} and the depths {λ} of the 3D points. We can choose the World Coordinate System (WCS) of the 3D scene to be the camera coordinate system at time t = 0, leading to 0 P = [I 0]. This leaves four degrees of freedom in the WCS. They appear in the relation between F and the projection matrix of the second key frame 1 P , [1 R 1 t], given by   1 t ∈ ker (F| ) and 1 R = 1 t × F − 1 t a| (2) where a is an arbitrary 3-vector and 1 t has arbitrary norm. For the time being, these degrees of freedom are fixed by choosing 1 t with unit norm and setting a = 0 t, where the epipoles 0 t and 1 t are recovered from the Singular Value

Decomposition (SVD) of respectively the matrices F and F| [20]. Since projection matrices are defined up-to-scale, in the remaining of the√paper they are normalized so that their Frobenius norm be 3. The geometry of the 3D scene can then be recovered. Let a 3D point X project itself onto a 2D point x on the camera image plane. These two points are related by λx = PX where λ is the projective depth. Therefore, correspondences allow the recovery of a cloud of 3D points by triangulation, solving the system of equations 1 1

λ x = 0 λ1 R0 x + 1 t

(3)

for each correspondence. Remaining erroneous correspondences are removed by imposing the following constraints: 1) the projection of the 3D points must be close to the actual 2D points, 2) all products of projective depths 0 λ1 λ must have the same sign [34, Th. 17], 3) correspondences must be far from epipoles, since triangulation is ill-conditioned around them. The initial choice of WCS is refined by quasi-euclidean self-calibration [35], so that the WCS be as euclidean as possible. It is important to recover a euclidean WCS because the interpolation of intermediate camera matrices, detailed later on, relies on the notion of distance which does not exist in an arbitrary projective space. Assuming that the camera parameters cannot undergo large variations between key frames, we look for a matrix 1 R as close as possible to the identity matrix and compatible with the fundamental matrix F. We also constrain the depths in the new WCS to be bounded by 1 and M , to reduce numerical issues during the bundle adjustment detailed in the next subsection. The optimal vector a is then found by

minimizing 1 R − 1 t at − I 1 under the linear constraints   | max 0 λ, 1 λ /M ≤ 0 λ0 x ˆa − 1 ≤ min 0 λ, 1 λ . (4)

A lower bound on the value of M is given by  max 0 λ, 1 λ / min 0 λ, 1 λ . The self-calibration starts with this value and increases it until the linear programming problem admits a solution. Finally, points with aberrant depths are removed by testing the difference between their depths and the weighted median of the depths of their neighbors positioned on Delaunay triangles [36, Chap.9]. Neighbors are assigned weights inversely proportional to their distances. F. Bundle adjustment The 3D geometry obtained so far suffers from the bias inherent to linear estimation [37]. It needs to be refined by minimizing its euclidean reprojection error on the key frames. First, the basis of the projective space has to be fixed to prevent it from drifting. This is done by fixing two 3D points and performing the optimization over a reduced parameter-space. As shown in Appendix I, the 12-dimensional projection matrix 1 P can be expressed as a linear√combination s of an 8-dimensional unit vector r, i.e. 1 P = 3Wr where

4

W is an orthonormal matrix. The two fixed 3D points are chosen randomly under the constraints that they have small reprojection errors and that they are far from the epipoles and from each other. The minimization of the euclidean reprojection error is defined as 0 (i)0 (i)|  1 s !2 X λ x 1 P 1 (i) x1 − 0 (i)0 (i)|  1 s1:4 min λ x 1 P9:12 {0 λ(i) },1 Ps i 0 (i)0 (i)|  1 s !2 X λ x 1 P (5) 1 (i) x2 − 0 (i)0 (i)|  1 s5:8 + λ x 1 P 9:12 i 1 s √ P = 3Wr . such that krk2 = 1 It is solved using an alternated reweighted linear least square approach [38], as detailed in Appendix II. This bundle adjustment step, provides us with refined estimates of 0 P , 1 P, {0 λ} and F. G. Correspondence propagation Previous sections led to an accurate but sparse set of correspondences. This set is densified by propagating correspondences along edges, under the epipolar constraint. The goal of this procedure is to get accurate motion information in edge regions, where even a slight misalignment can lead to large MSE, degrading the side information PSNR. As shown in Figure 2, the intersections of edges with epipolar lines define points, except where epipolar lines and edge tangents are parallel. Therefore, correspondences can be obtained not only at feature points, but also along edges. Edges are found in the first key frame using the Canny edge detector [39]. Correspondences are propagated along edges, starting from correspondence between the feature points. At each iteration, edge-points around previously known correspondences are selected and their motions are initialized to those of their nearest neighbors. Their motions are then improved by full search over small windows along associated epipolar lines, minimizing the MSE between intensity neighborhoods. Their motions are finally refined by Golden search [29, §10.1] to obtain sub-pixel accuracy. The robustness of this procedure is increased by removing edge-points too close to the epipole, as well as those whose edge tangents are close to epipolar lines or which have large MSE. H. Projection-matrix interpolation Frame interpolation relies on the projection of the 3D scene onto camera image planes. This requires the knowledge of the projection matrices associated with the WZ frames. Since the decoder only has access to the key frames, it needs to interpolate these matrices from those of the key frames. Let t P be a projection matrix associated with a WZ frame. It can be decomposed as t P = t Kt R[I −t C] where t K is an upper triangular matrix containing the intrinsic parameters (focal length, pixel aspect ratio, etc.), t R is a rotation matrix and t C is the optical center. Both the matrix t K and the vector t C are interpolated using linear interpolation. The

Fig. 2. Correspondence propagation. An edge-point 0 x(1) in key frame 0 defines an epipolar line 1 l(1) in key frame 1. The previously matched correspondence {0 x(0) , 1 x(0) } gives an estimation of the edge-point motion, whose value is refined by a local MSE minimization along the epipolar line.

matrix t R is interpolated using Spherical Linear intERPolation (SLERP) [40] in the space of unit quaternions so that it be a rotation matrix. Due to the special form of 0 P, SLERP has a particularly simple expression. Let 1 θ and 1 u be the rotation-axis representation of the quaternion 1 q, such that 1 q = cos 1 θ + 1 u sin 1 θ. The interpolated projection matrix t P is given by t 1   K = t K + (1 − t) I, t (6) C = t 1 C,   1   t 1 1 q = cos t θ + u sin t θ .

I. Frame interpolation

In order to proceed with frame interpolation, we need to estimate a dense motion field between the frame being interpolated and each of the two key frames. We consider two motion models to obtain dense motion fields from the cloud of 3D points and the projection matrices: one block-based and one mesh-based. 1) Epipolar block interpolation: Each WZ frame is divided into blocks whose unknown texture is to be estimated. Pairs of corresponding blocks are looked for over the key frames, using the epipolar constraint and trifocal transfer [41] to limit the search space. As shown in Figure 3, given a block located at t x in the WZ frame, its corresponding blocks in the key frames lie along the epipolar lines 0 l and 1 l. For a given candidate location in the reference key frame, say 0 x in 0 I, the location of the corresponding block 1 x in the other key frame 1 I is uniquely defined via trifocal transfer: triangulation of points 0 x and t x give the 3D point X, which is then projected onto 1 I to give 1 x. The key frame whose optical center is the furthest away from the optical center of the WZ frame is chosen as the reference key frame, so that the equations of trifocal transfer are best conditioned. The previously computed set of 3D points is also used to reduce the search space. Each 3D point is projected onto the three frames 0 I, t I and 1 I, providing point correspondences, i.e. points with known motions between the WZ frame and the key frames. Given a block in the WZ frame, a full search is performed inside a window along its associated epipolar line, centered using the motion of the nearest point correspondence. The quality of a pair of blocks is assessed using the MSE. The block location is refined using Golden search [29, §10.1] to obtain sub-pixel accuracy. Since trifocal transfer is

5

2) the depth error between each 3D point and its closest triangle must be small, 3) depth variations between triangle vertices must be limited.

Fig. 3.

Trifocal transfer for epipolar block interpolation.

singular around epipoles, blocks too close to these epipoles are assigned the motions of their nearest correspondences. Once a pair of corresponding blocks is selected, their textures are linearly blended based on time to obtain the texture of the unknown block in the interpolated frame. 2) 3D-mesh interpolation: Each WZ frame is divided into blocks whose unknown texture is to be estimated. Blocks are subdivided into pairs of triangles, creating a triangular mesh. Mesh vertices are associated with depth values, giving an elevation grid. It represents a piecewise-planar mesh which is fitted onto the cloud of 3D points. The mesh is defined in the coordinate system of the WZ frame, which requires us to change the coordinate system of the 3D quantities computed earlier, that is  X ← t RX + t t   −1 −1  0 P ← [t R − t R t t] . (7) −1 −1 1 1  t − 1 R t R t t] P ← [1 Rt R   t P ← [I 0]

Fitting the mesh onto the cloud of 3D points is cast as a minimization problem, using a Tikhonov regularization approach [42]. Let µ be a scalar controlling the smoothness of ˜ be two vectors representing the depths the mesh. Let λ and λ of respectively the 3D points {X} and the mesh vertices. The mesh provides a linear approximation of λ, which can be ˜ The mesh also has an internal smoothness, written as Mλ. which favors small differences between the depth of a vertex and the average depth of its four neighbors. Since the average ˜ The minimization is a linear operation, it can be written as Nλ. problem is then ˜ 22 . ˜ 22 + µ2 k (I − N) λk min kλ − Mλk ˜ λ

(8)

This is a Linear Least-Square (LLS) problem, which can be readily solved. The smoothness term guarantees that its solution is always well conditioned. Since LLS estimation is not robust to erroneous depths, it is embedded into an iterative process which detects and removes them. At each iteration, the 3D mesh is fitted onto the cloud of 3D points. Valid triangles must abide by these three criteria: 1) they must have the same orientation in all frames,

Points inside triangles failing these tests are removed and the process is reiterated until all triangles pass these tests. Once the mesh has been fitted, it is projected onto the key frames to obtain the two dense motion fields. The texture of the frame being interpolated is then obtained by warping the key frames using 2D texture mapping [43] and linearly blending them based on time. 3) Motion model comparison: Epipolar block interpolation handles well depth discontinuities but only provides a blockwise fronto-parallel approximation of surfaces. It shall allow us to analyze improvements brought by the 3D geometry over classical 2D BBMC in Section IV. The 3D-mesh interpolation enforces a stronger smoothness over the motion field and is able to further remove erroneous correspondences. However, it has the drawbacks of over-smoothing depth discontinuities and not modeling occlusions.

III. 3D MODEL - BASED FRAME INTERPOLATION WITH POINT TRACKING

A. Rationale Experiments detailed in Section IV show that the proposed approach gives motion fields between key frames much closer to the ground truth than those of classical 2D block matching. However, this improvement barely increases the PSNR of the side information. The bottleneck lies in the interpolation of projection matrices, which gives inaccurate projection matrices. Since the motion fields are obtained by projecting 3D points or a 3D mesh onto image planes, inaccurate projection matrices lead to misalignments between the interpolated frames and the actual WZ frames. These misalignments then create large errors in regions with textures or edges, which penalize the PSNR. Instead of interpolating the projection matrices, it would be better to estimate them from the frames. We propose to achieve this goal by switching from a pure DVC approach to a quasi DVC one, called 3D-qDVC, where the encoder shares some limited information between consecutive frames under the form of point tracks. Point tracks present a generalization of correspondences to larger number of frames. Computing and transmitting these point tracks introduce overheads on the encoder complexity and the bandwidth. However, these overheads are minor because only a small number of tracks is required to estimate the eleven parameters of each intermediate projection matrix. The estimation of the 3D scene and the motion fields remains identical and the workload remains born by the decoder, in particular correspondence detection robust epipolar estimation, bundle adjustment, correspondence propagation, frame interpolation and Wyner-Ziv decoding. Moreover, statistics on tracks allow the encoder to select key frames based on the video motion content, thus increasing bandwidth savings.

6

B. Decoder The only modification brought to the 3D-DVC decoder lies in the estimation of the projection matrices, instead of their interpolation (Section II-H). This estimation follows from a generalization of the bundle-adjustment equation (Equation 5) to three and more frames: 0 (i)0 (i)|  t s !2 X λ x 1 P 1 (i) min x1 − 0 (i)0 (i)|  t s1:4 λ x 1 P9:12 {0 λ(i) },{t Ps } t,i 0 (i)0 (i)|  t s !2 X λ x 1 P 1 (i) + x2 − 0 (i)0 (i)|  t s5:8 (9) λ x 1 P9:12 t,i  √ s  1 P = 3Wr such that . krk2 = 1  t 2s 2 k P k2 = 3, 0 < t < 1

Note that since the pair of projection matrices 0 P and 1 P defines a unique basis for the projective space, no further constraints are required on matrices {t P}. As in 3D-DVC, the projection matrix 1 P is estimated from the correspondences between key frames, ignoring tracks at intermediate time instants. The projective depths of point tracks are then recovered by triangulating their 2D locations at times t = 0 and t = 1. Since the projection matrices {t P} are independent of one another in Equation 9, they are solution of simple reweighted linear least square problems. These problems are solved using the method described in Section II-F and Appendix II.

C. Encoder The 3D-qDVC encoder includes the components of its 3DDVC counterpart, with an additional Harris-Stephen featurepoint detector [27], a point tracker and a point-track encoder. The encoder detects feature points on the current key frame and tracks them in the following frames until one of the following two stopping criteria is met: 1) the length of the longest track becomes large enough or 2) the number of lost tracks becomes too large. The former criterion enforces that key frames sufficiently differ from one another, while the latter criterion ensures that the estimation of intermediate projection matrices is always a well-posed problem. Once a stopping criterion is met, a new key frame is transmitted and the process is reiterated. Tracking relies on the minimization of Sum of Absolute Differences (SAD) between small blocks around point tracks. Minimization is biased toward small motions to avoid the uncertainty due to large search regions. It only considers integer pixel locations. It begins by a spiral search around the location with null motion. Once a small SAD is detected, it continues by following the path of least SAD, until a local minimum is found. Tracks for which no small SAD can be found are discarded. Point-tracks are encoded using Differential Pulse Code Modulation (DPCM) and fixed-length codes.

IV. E XPERIMENTAL RESULTS The DVC architecture has been derived from the Discover codec [26] in pixel domain, a 2D-DVC codec which relies on Block-Based Motion Compensated (BBMC) interpolation to generate the side-information. Results on 3D-DVC and 3DqDVC presented thereafter have been obtained by replacing this 2D interpolation by our 3D interpolations. We present experimental results on two sequences: street and stairway. Both are 50 frames long at CIF resolution and 30fps. These sequences contain drastically different camera motions and scene contents. In the former, the camera has a smooth motion, mostly forward. In the latter, the camera has a rougher lateral motion, creating pixel motions of up to 7 pixels between consecutive frames. A. Frame interpolation without tracking (3D-DVC) Figures 4(a) and 4(b) show correspondences between the first two key frames of each sequence, after respectively robust weak-calibration and correspondence propagation. In both cases, the epipolar geometry was correctly recovered and correspondences are virtually outlier-free. Moreover, propagation greatly increases the correspondence density, from 397 correspondences to 7732 for the street sequence and from 325 correspondences to 6421 for the stairway sequence. The two figures also underline some intrinsic limitation of the SfM approach. First, the street sequence has epipoles inside the images, as can be seen by the converging epipolar lines. Since triangulation is singular around the epipoles, there are no correspondences in their neighborhoods. Second, the stairway sequence contains strong horizontal edges whose tangents are nearly parallel to the epipolar lines. This explains why so few correspondences were found in this region, while the wall is covered by correspondences. Since DVC estimates motion fields between quantized key frames, it is important that it be robust to quantization noise. Figure 5 confirms that the 3D estimation behaves well even with coarsely quantized key frames. Comparing it with Figure 4 where lossless key frames are used, we see that both detection and propagation degrade gracefully, the major difference lying in the density of correspondences. Compared to classical 2D block matching, the proposed motion estimation obtains motion fields much closer to the ground truth. Figure 6 displays the norm of the motion fields between the first two key frames of each sequence. The epipolar motion fields, albeit not perfect, are far superior to their 2D counterparts. In particular, they do not exhibit such a large number of outliers. Since epipolar block matching and classical block matching share the same motion model, any difference between them is solely due to the static scene assumption. However, Figure 7 shows that this barely impacts the PSNR of the side information. The cause appears clearly in Figure 8: motion fields between WZ frames and key frames create misalignments between the side information and WZ frames. B. Frame interpolation with tracking (3D-qDVC) Figure 8 shows that misalignments between the side information and WZ frames are greatly reduced by estimating the

7

(a) Correspondence detection

(a) Correspondence detection

(b) Correspondence propagation

(b) Correspondence propagation

Fig. 4. Correspondences between the two first lossless key frames of each sequence: after epipolar estimation (a), and after propagation (b). Legend: feature points in first key frame (red dots), epipolar lines (green lines) and motion vectors (magenta lines).

Fig. 5. Correspondences between the two first key frames of each sequence quantized at QP 42: after epipolar estimation (a), and after propagation (b). Legend: see Figure 4.

projection matrices, instead of interpolating them. Figure 7 indicates that 3D-qDVC consistently outperforms both 3DDVC and 2D-DVC, bringing at times improvements of more than 10dB. Unlike the other methods, 3D-qDVC is able to maintain nearly constant PSNR values inside each GOP. The mesh-based approach provides PSNR gains over the epipolar block-based method in both sequences: 1.0dB in the street sequence and 0.18dB in the stairway sequence. Figure 9 displays the tracks obtained at the encoder between the first two key frames of each sequence. Tracking has two drawbacks: it introduces a bit-rate overhead and increases the coder complexity. The bit-rate overhead represents around 0.01b/pixel (1178b/frame for the street sequence and 1390b/frame for the stairway sequence). Compared to 3D video coders like [44], the complexity overhead at the encoder is negligible since all the 3D estimation is still performed at the decoder. Compared to classical 2D BBMC coders, the overhead is also very limited due to the small number of tracks. Assuming 8 by 8 blocks for 2D-BBMC, a CIF frame has (352/8) × (288/8) = 1584 blocks. On the other hand, the average number of tracks for both the street and stairway sequences is 135. Therefore, in these experiments the complexity of 3D-qDVC is only 8.5% of the 2D-BBMC one. Figure 10 shows the robustness to quantization noise of the frame interpolation. The PSNR of interpolated frames actually decreases more slowly than the one of key frames. There is, however, one exception around frame 37 of the stairway sequence where the camera jitter leads to large variations of PSNR. Finally Figure 11 compares the rate-distortion performances of mesh-based 3D-qDVC with three other codecs: H.264 intra, H.264 inter IPPP and 2D-DVC Discover [26] I-WZ-I. The key

frame rate was chosen to obtain the best performance of each codec. The 3D-qDVC codec outperforms both H.264 intra and 2D-DVC and even gets close to H.264 inter at low bit-rates. Note that both DVC codecs are in the pixel domain. Improved rate-distortion performances would be expected in transform domains. V. C ONCLUSION In this paper we have shown that Distributed Video Coding (DVC) benefits from Structure-from-Motion techniques. We have developed a robust feature-point matching algorithm leading to semi-dense correspondences between pairs of consecutive key frames. We have proposed two interpolation schemes to generate the side information, based either on block matching along epipolar lines or 3D-mesh fitting. Experiments have shown that the proposed motion estimation obtained motion fields between key frames closer to the ground truth than classical 2D block matching, but had a limited impact on the side information PSNR. This limitation has been overcome by estimating the projection matrices from point tracks. It has led to major PSNR improvements with only limited overheads, both in terms of bit-rate and encoder complexity. As an additional feature, the encoder is able to roughly estimate the video motion content from the point tracks and adaptively select the key frames. Several issues remain open. For instance, it is still unclear how to obtain an optimal bit-rate allocation between key frames and WZ frames. Also, the spatial and temporal dependencies in the side information errors are yet to be understood and modeled, potentially leading to additional rate savings. Finally, further studies would be needed to extend the proposed frame interpolation technique to videos with more generic motion fields.

8

(a) Classical block matching

(b) Epipolar block matching Fig. 6. Norm of the motion fields between the first two lossless key frames (left: street sequence, right: stairway sequence) for two block matchings: classical (a) and epipolar (b).

ACKNOWLEDGMENTS This work has been partly funded by the European Commission in the context of the network of excellence SIMILAR and of the IST-Discover project. The authors are thankful to the IST development team for its original work on the IST-WZ codec [26] and the Discover software team for the improvements they brought. A PPENDIX I F IXING THE PROJECTIVE BASIS During the non-linear optimization of projection matrix 1 P, the projective basis is fixed by setting 0 P = [I 0] and choosing two points {X(1) , X(2) } and their projections. We would like to obtain a minimum parameterization of 1 P. The two points induce six constraints on 1 P, four of which are independent. Each point is associated with an equation of the form 1 λ1 x = 1 PX. Using the third component to solve for 1 λ, we obtain 1 1 x1 P3: 0 x = 1 P1: X and 1 x2 1 P3: 0 x = 1 P2: X where 1 Pi: is the th i row of 1 P. These equations can be rewritten as A1 Ps = 0 where A is defined as   (1) X(1)| 0 −1 x1 X(1)   (1) X(1)| −1 x2 X(1)   0 A ,  (2)| (10) . (2)  X 0 −1 x1 X(2)  (2) 0 X(2)| −1 x2 X(2) Taking the SVD of A gives    V| S 0 , A=U W| 0 0

Fig. 7. PSNR of interpolated WZ frames for different interpolation schemes, using lossless key frames (top: street sequence, bottom: stairway sequence). Missing points correspond to key frames (infinite PSNR).

(a) 2D-DVC

(b) 3D-DVC

(c) 3D-qDVC Fig. 8. Correlation noise for GOP 1, frame 5 (center of the GOP) of the stairway sequence, using lossless key frames: 2D-DVC with classical block matching (a), 3D-DVC with mesh model (b) and 3D-qDVC with mesh model (c). The correlation noise is the difference between the interpolated frame and the actual WZ frame.

(11)

where S, V and W are three matrices. Therefore, the projec1 tion matrices P can be parameterized by a vector r such that √ √ 1 s P = 3Wr, where the factor 3 was introduced so that a unit-norm vector r corresponds to k1 Pk22 = 3.

Fig. 9. Feature-point tracking between the first two key frames of each sequence. Legend: feature points in first key frame (red dots), tracks in other frames (multicolor curves).

9

and initialized to 1. The problem then becomes biquadratic in its parameters:  X κ(i)2 min {0 λ(i) },1 Ps

i

2 i s s (i) 1 1 x1 1 P9:12 − 1 P1:4 2  i h s s (i) + 0 λ(i) 0 x(i)| 1 1 x2 1 P9:12 − 1 P5:8  1 s √ P = 3Wr such that krk22 = 1 h

Fig. 10. PSNR of key frames and interpolated frames for different key frame quantization QP, using 3D-qDVC with mesh model (top: street sequence, bottom: stairway sequence). Peaks correspond to key frames.

0 (i) 0 (i)|

λ

x

(13)

which is solved by alternatively fixing either the projective s depths {0 λ(i) } or the camera parameters 1 P and minimizing over the free parameters. When the projective depths {0 λ(i) } are fixed, the problem is equivalent to finding the unit-norm vector r which minimizes the squared norm of Ar, where matrix A is obtained by stacking together sub-matrices of the form  0 0 |  | √ − λ x −1 0 0 1 x1 0 λ0 x 1 x1 3κ W. (14) | | 0 0 −0 λ0 x −1 1 x2 0 λ0 x 1 x2 The solution is obtained by taking the SVD of matrix A and choosing the vector associated with the smallest singularvalue. s When the camera parameters 1 P are fixed, the problem is unconstrained and its Hessian is diagonal. Taking the derivative with regard to a particular 0 λ and setting it to 0 leads to the solution  1 1 s|  s|  x1 P9:11 − 1 P1:3 0   x,  a , 1 x 1 Ps | − 1 Ps | a| b 2 0  1 1 s9:11 1 s 5:7  λ = − | where (15)  a a   b , 1 x1 1 Ps12 − 1 Ps4 . x2 P12 − P8 R EFERENCES

Fig. 11. Rate-distortion curves for H.264 intra, H.264 inter IPPP, 2D-DVC I-WZ-I and 3D-qDVC (left: street sequence, right: stairway sequence).

A PPENDIX II B UNDLE ADJUSTMENT

The bundle adjustment problem given by Equation 5 is solved using an alternated reweighted linear least square approach [38]. First, the denominators are factored out and treated as constant weights, only updated at the end of each iteration. These weights, denoted 1 κ(i) , are defined as

1 (i)

κ

h i s −1 , λ(i) 0 x(i)| 1 1 P9:12

(12)

[1] J. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE Trans. on Info. Theory, vol. 19, no. 4, pp. 471–480, 1973. [2] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Trans. on Info. Theory, vol. 22, no. 1, pp. 1–10, January 1976. [3] S. Pradhan and K. Ramchandran, “Distributed source coding using syndromes (discus): design and construction,” IEEE Trans. on Info. Theory, vol. 49, no. 3, pp. 626–643, March 2003. [4] A. Aaron and B. Girod, “Compression with side information using turbo codes,” in Proc. IEEE Int. Data Compr. Conf., 2002. [5] J. Garcia-Frias and Y. Zhao, “Compression of correlated binary sources using turbo codes,” IEEE Comm. Letters, vol. 5, pp. 417–419, 2001. [6] ——, “Data compression of unknown single and correlated binary sources using punctured turbo codes,” in Proc. Allerton Conf., 2001. [7] J. Bajcsy and P. Mitran, “Coding for the slepian-wolf problem with turbo codes,” in Proc. IEEE Int. Global Com Conf., 2001, pp. 1400–1404. [8] A. Liveris, Z. Xiong, and C. Georghiades, “Compression of binary sources with side information at the decoder using ldpc codes,” IEEE Comm. Letters, vol. 6, pp. 440–442, 2002. [9] T. Tian, J. Garcia-Frias, and W. Zhong, “Compression of correlated sources using ldpc codes,” in Proc. IEEE Int. Data Compr. Conf., 2003. [10] A. Aaron, R. Zhang, and B. Girod, “Wyner-Ziv coding of motion video,” in Proc. Asilomar Conf. on Sig., Sys. and Computer, 2002. [11] A. Aaron, S. Rane, R. Zhang, and B. girod, “Wyner-Ziv coding for video: Applications to compression and error resilience,” in Proc. IEEE Int. Data Compr. Conf., 2003, pp. 93–102. [12] A. Aaron, S. Rane, E. Setton, and B. girod, “Transform-domain WynerZiv codec for video,” in Proc. SPIE Conf. on Visual Com. and Im. Proc., 2004.

10

[13] R. Purit and K. Ramchandran, “PRISM: A new robust video coding architecture based on distributed compression principles,” in Proc. Allerton Conf., 2002. [14] ——, “PRISM: A new reversed multimedia coding paradigm,” in Proc. ICIP, 2003. [15] A. Sehgal and N. Ahuja, “Robust predictive coding and the Wyner-Ziv problem,” in Proc. IEEE Int. Data Compr. Conf., 2003. [16] A. Aaron, S. Rane, D. Rebollo-Monodero, and B. Girod, “Systematic lossy forward error protection for video waveforms,” in Proc. ICIP, 2003. [17] B. Girod, A. Aaron, S. Rane, and D. Rebollo-Monedero, “Distributed video coding,” Proc. of the IEEE, vol. 93, no. 1, pp. 71–83, January 2005. [18] P. Ishwar, V. Prabhakaran, and K. Ramchandran, “Towards a theory for video coding using distributed compression principles,” in Proc. ICIP, 2003. [19] D. Zhang and G. Lu, “Segmentation of moving objects in image sequence: a review,” Circuits, Systems, and Signal Proc., vol. 20, no. 2, pp. 143–183, 2001. [20] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An invitation to 3D vision. Springer-Verlag, 2004. [21] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch, “Visual modeling with a hand-held camera,” IJCV, vol. 59, no. 3, pp. 207–232, September 2004. [22] C. Zitnick, S. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “Highquality video view interpolation using a layered representation,” in Proc. of ACM SIGGRAPH, 2004. [23] J. Repko and M. Pollefeys, “3D models from extended uncalibrated video sequences: Addressing key-frame selection and projective drift,” in Proc. 3-D Digit. Imag. and Model., 2005. [24] F. Galpin and L. Morin, “Sliding adjustment for 3D video reprensentation,” EURASIP J. on Applied Signal Proc., vol. 2002, no. 10, pp. 1088–1101, 2002. [25] M. Maitre, C. Guillemot, and L. Morin, “3D scene modeling for Distributed Video-Coding,” in Proc. ICIP, 2006. [26] J. Ascenso, C. Brites, and F. Pereira, “Improving frame interpolation with spatial motion smoothing for pixel domain distributed video coding,” in EURASIP Conf. on SIPMCS, 2005. [27] C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. Alvey Vision Conf., 1988, pp. 147–151. [28] R. Duda, P. Hart, and D. Strok, Pattern classification, 2nd ed. New York: John Wiley and Sons Ltd, 2001. [29] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical Recipes in C : The Art of Scientific Computing. Cambridge University Press, 1993. [30] Birchfield and Tomasi, “Depth discontinuities by pixel-to-pixel stereo,” IJCV, 1999. [31] D. Luenberger, Linear and Nonlinear Programming, 2nd ed. Kluwer Academic Publishers, August 2003. [32] P. Torr and A. Zisserman, “Robust computation and parametrization of multiple view relations,” in Proc. ICCV, 1998. [33] O. Chum, J. Matas, and J. Kittler, “Locally optimized ransac,” in Proc. of the 25th DAGM Symp., 2003. [34] R. Hartley, “Chirality,” IJCV, vol. 26, no. 1, pp. 41–61, 1998. [35] Beardsley, Zisserman, and Murray, “Sequential updating of projective and affine structure from motion,” IJCV, vol. 23, pp. 235–259, 1997. [36] M. Berg, M. Kreveld, M. Overmars, and O. Schwarzkopf, Computational Geometry: Algorithms and Applications, 2nd ed. Springer-Verlag, 2000. [37] B. Matei and P. Meer, “A general method for errors-in-variables problems in computer vision,” in CVPR, 2000. [38] A. Bartoli, “A unified framework for quasi-linear bundle adjustment,” in Proc. ICPR, 2002, pp. 560–563. [39] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach. Prentice Hall, August 2002. [40] K. Shoemake, “Animating rotation with quaternion curves,” Computer Graphics, vol. 19, no. 3, pp. 245–254, 1985. [41] R. Hartley, “Lines and points in three views and the trifocal tensor,” IJCV, vol. 22, no. 2, pp. 125–140, 1997. [42] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004. [Online]. Available: http://www.stanford.edu/ ∼boyd/cvxbook.html [43] M. Woo, J. Neider, T. Davis, and Shreiner, OpenGL programming guide, 3rd ed. Addison Wesley, 1999. [44] R. Balter, P. Gioia, and L. Morin, “Time evolving 3D model representation for scalable video coding,” in Proc. ICIP, 2005.