G. Shen, W.-S. Kim, S.K. Narang, A. Ortega. Signal and Image ... by using directional transforms, e.g., directional DCT [3], di- rectionlets [4], bandelets [5]. ... This work was supported in part by Samsung Electronics Co., Ltd. Email:{godwinsh ...
EDGE-ADAPTIVE TRANSFORMS FOR EFFICIENT DEPTH MAP CODING G. Shen, W.-S. Kim, S.K. Narang, A. Ortega
Jaejoon Lee, HoCheon Wey
Signal and Image Processing Institute University of Southern California
Multimedia Lab Samsung Advanced Institute of Technology Samsung Electronics Co., Ltd.
ABSTRACT In this work a new set of edge-adaptive transforms (EATs) is presented as an alternative to the standard DCTs used in image and video coding applications. These transforms avoid filtering across edges in each image block, thus, they avoid creating large high frequency coefficients. These transforms are then combined with the DCT in H.264/AVC and a transform mode selection algorithm is used to choose between DCT and EAT in an RD-optimized manner. These transforms are applied to coding depth maps used for view synthesis in a multi-view video coding system, and provides up to 29% bit rate reduction for a fixed quality in the synthesized views. Index Terms— Multiview plus depth (MVD), depth coding, rate-distortion optimization
“V” shaped edge. In order to efficiently represent these sorts of blocks, a more general transform representation is needed. In [6], platelets [7] were used to approximate depth map images as piece-wise planar signals. Since depth maps are not exactly piece-wise planar, this representation will have a fixed approximation error. On the other hand, our proposed transforms have no approximation error (i.e., we can represent signals that are not exactly piece-wise planar). Edge-adaptive wavelet transforms have also been proposed for depth map coding [8, 9]. Those transforms were applied to the entire image and are not easily amenable to block-based processing. Alternatively, our proposed transforms are easily applied to blocks of any size. We recently proposed an edge-adaptive intra prediction scheme [10] that avoids predicting across edges, though this technique still uses the DCT to process the residuals.
The Discrete Cosine Transforms (DCTs) used in image and video coding standards can efficiently represent images that have only horizontal edges or only vertical edges. However, when complex edges exist, e.g., edges with diagonal orientation, these transforms produce many large magnitude coefficients, which tend to require a significant number of bits for coding. Quantization errors in these large coefficients also lead to ringing artifacts in the reconstructed images. Moreover, when using depth maps for synthesizing virtual views in a multi-view video coding system [1], ringing artifacts around edges in the reconstructed depth maps lead to erosion artifacts in the synthesized views [2]. In this work we seek to design transforms that can efficiently represent depth maps while also preserving the edge information. Efficient representation of diagonal edges can be achieved by using directional transforms, e.g., directional DCT [3], directionlets [4], bandelets [5]. By performing filtering parallel to edges in a block, these transforms can significantly reduce the number of large magnitude coefficients. However, they still do not provide an efficient representation for blocks with more complex edges such as, e.g., an “L” shaped edge or a
We make two main contributions in this work. The first is a set of edge-adaptive transforms (EATs) that take (arbitrary) edge structure directly into account. We use these transforms to improve on the performance of existing coding schemes with particular emphasis on depth map coding using H.264/AVC. More specifically, for each block we (i) perform edge detection to identify edge locations, (ii) map the pixels in the block onto a graph G in which each pixel is connected to its immediate neighbors only if they are not separated by an edge and then (iii) construct an EAT on this graph. The basic idea here is that whenever two pixels are separated by an edge, they will not be connected in the graph G. Thus, by constructing transforms that only filter together values from pixels that are connected in the graph, we automatically avoid filtering across edges. Any such transform can then be applied to the pixel values, and if designed appropriately, very few large magnitude transform coefficients should result. We use the eigenvectors of the Laplacian [11] of the graph since it provides a spectral decomposition of the signal on the graph. The second contribution is a mode selection scheme that allows the encoder to choose between DCT and EAT in a rate-distortion (RD) optimal fashion. Our implementation of EAT and mode selection is based on H.264/AVC.
This work was supported in part by Samsung Electronics Co., Ltd. Email:{godwinsh,wooshikk,kumarsun,aortega}@usc.edu
This paper is organized as follows. The construction of the EATs is described in Sec. 2. The RD-optimized mode se-
1. INTRODUCTION
lection algorithm is discussed in Sec. 3. The EATs and mode selection scheme are then evaluated in Sec. 4, where we observe up to 29% reduction in bit rate for a fixed interpolation quality in synthesized views. Some concluding remarks are made in Sec. 5. 2. EDGE-ADAPTIVE TRANSFORM DESIGN We now describe how to construct our proposed EATs. This process consists of three steps, i.e., (i) edge detection is applied on the residual block to find edge locations, (ii) a graph is constructed based on the edge map, then (iii) an EAT is constructed and EAT coefficients are computed. The EAT coefficients are then quantized using a uniform scalar quantizer and the same run-length coding used for DCT coefficients is applied. The 2 × 2 sample block in Fig. 1 is used to illustrate the main ideas. We describe the encoder operation when applied to blocks of prediction residuals, though the same ideas can be easily applied to original pixel values. 2.1. EAT Construction First edge detection is applied on the residual block to find edge locations and a binary edge map is generated that indicates the locations of edges. We use the edge detection scheme described in [8, 9]. If no edges are found in a residual block, then a DCT is used and no further EAT processing is performed. Otherwise, the encoder computes an EAT. A graph is generated from this edge map, where each pixel in the residual block is connected to each of its immediate neighbors (e.g., 4-connected neighbors) only if there is no edge between them. This results in an adjacency matrix A, where A(i, j) = A(j, i) = 1 if pixel i and j are immediate neighbors not separated by an edge, otherwise A(i, j) = A(j, i) = 0. Alternatively, in the adjacency matrix the value of 1 for pixels that are connected could be replaced by a distance between pixels (so that pixels that are adjacent horizontally or vertically are closer than those adjacent diagonally). The adjacency matrix is then used to compute the degree matrix D, where D(i, i) equals the number of non-zero entries in the i-th row of A, and D(i, j) = 0 for all i 6= j. The Laplacian of the graph is computed using the standard definition [11], i.e., L = D − A. Fig. 1 shows an example of these three matrices. As is discussed in [11], projecting the signal of a graph G onto the eigenvectors of the Laplacian L yields a spectral decomposition of the signal, i.e., it provides a “frequency domain” interpretation of the signal on the graph. Thus, in this work we construct each EAT from the eigenvectors of the Laplacian of the graph. The EAT matrix is simply Et ; an example is shown in Fig. 1. Since the Laplacian L is symmetric, the eigenvector matrix E can be efficiently computed using the well-known cyclic Jacobi method [12]. One could also store a pre-computed set of transforms corresponding to the
Edges
1
2
3
4
0 1 A= 0 0 1 − 1 L= 0 0
1 0 0 0 0 0 0 0 1 0 1 0 −1 0 1 0
0 0 0 1 − 1 0 −1 1
1 0 D= 0 0 1 2 0 Et = −1 2 0
0 0 0 1 0 0 0 1 0 0 0 1 1 2
0
0
1 2
1 2 0
0 −1 2
0 1 2 0 1 2
Fig. 1. Example of a 2 × 2 block. Pixels 1 and 2 are separated from pixels 3 and 4 by a single vertical edge (shown as the thinner dotted line). The corresponding adjacency matrix A and degree matrix D are also shown there, along with the Laplacian matrix L and the resulting EAT Et .
most popular edge configurations in order to save on computation complexity. Simpler transforms (e.g., the lifting transform in [13]) could also be used as an alternative. EAT coefficients are computed as follows. For an N × N block of residual pixels, form a one-dimensional input vector x by concatenating the columns of the block together into a single N 2 × 1 dimensional vector, i.e., x(N j + i) = X(i, j) for all i, j = 0, 1, ..., N − 1. The EAT transform coefficients are then given by y = Et ·x, where y is also an N 2 ×1 dimensional vector. The coefficients are quantized with a uniform scalar quantizer. This vector of coefficients is then reformed into an N × N block Y by placing the elements of y in the standard zig-zag fashion used for the DCT, i.e., Y (0, 0) = y(0), Y (1, 0) = y(1), Y (0, 1) = y(2), Y (0, 2) = y(3), and so on. This allows the coefficients to be scanned and entropy coded in the same manner as is done for the DCT. 2.2. EAT Properties We now evaluate the efficiency of the EATs proposed in Sec. 2.1. Suppose we have an N × N PWC image with M (constant) regions and that every pixel in the i-th region has value ai for i = 0, 1, . . . , M − 1. Assume that we have an edge map that divides the block into M regions and let x be defined as in Sec. 2.1. In this case, any graph G constructed as described in Sec. 2.1 will have exactly M connected components. As will soon be shown, our proposed EATs produce (at most) M non-zero coefficients for any PWC image block. This implies that, for run-length coding schemes such as those used in JPEG and H.264/AVC, at most M non-zero values need to be encoded and an end of block code can be used for the remaining N 2 − M zeros. Thus, EATs can provide much higher coding efficiency than a DCT. We now prove that our EATs produce at most M non-zero coefficients for a PWC image with M constant regions. Under the assumption that the graph G will have M connected components, the Laplacian has M zero eigenvalues [11]. Let the eigenvalues be sorted in increasing order, i.e., λ0 = λ1 =
. . . = λM−1 = 0 and λM ≤ . . . ≤ λN 2 −1 . Also, for each i = 0, 1, . . . , M − 1, let Ci be the set containing the indices of the pixels in the i-th component and let |Ci | denote the number of elements in that set. It was shown in [11] that, for every λip= 0, the corresponding eigenvector ei has value ei (j) = 1/ |Ci | for all j ∈ Ci and ei (k) = 0 for all k ∈ / Ci . Thus, for each λi = 0, the eigenvector ei acts like a normalized “DC” basisP function forpthe i-th connected component, i.e., (ei )t · x = j∈Ci x(i)/ |Ci |. Now let xi be such that xi (j) = ai for all j ∈ Ci and xi (k) = 0 for k ∈ / Ci . Thus, we have that x = x0 + x1 + . . . + xM−1 . Moreover, p (1) (ei )t · x = |Ci |ai , ∀i = 0, 1, . . . , M − 1. By using the definition of ei and xi , it also follows that xi = p , M − 1. Therefore, x = p|Ci |ai ei forpall i = 0, 1, . . .p |C0 |a0 e0 + |C1 |a1 e1 +. . .+ |CM−1 |aM−1 eM−1 . Since the eigenvectors are mutually orthogonal, we also have that (ej )t · ei for all i = 0, 1, . . . , M − 1 and j ≥ M . Thus, (ej )t · x = 0, ∀j = M, M + 1, . . . , N 2 − 1.
EAT) for coding a block. Also let Redges denote the number of bits needed to code the binary edge map for that block. It is certainly possible that the RD cost when using the DCT is less than the RD cost when using an EAT, i.e., we may have that Ddct + λRdct < Deat + λ(Reat + Redges ). In that case it would be better to use the DCT. In other words, an EAT should only be used in place of the DCT when it yields lower RD cost than the DCT. Otherwise, the DCT should be used. This leads to an RD-optimized transform mode selection algorithm as shown in Fig. 2. The edge maps and transform mode information (e.g., EAT or DCT) are encoded using context-adaptive binary arithmetic coding (CABAC) [14]. {Xp}p in P Xb
Db
Get Residual Db Db
Eb, Redges
DCT & Coding
Db
(2)
(1) and (2) both imply that, for a PWC block with M connected components, there are (at most) M non-zero projections onto the eigenvectors of the Laplacian. Thus, at most M non-zero EAT coefficients must be encoded and the remaining N 2 − M zeros can be represented by an end of block code. This can lead to higher overall coding efficiency. To illustrate this result more concretely, refer again to Fig. 1. Note that if x(0) = x(1) = a0 and x(3) = x(4)√= a1 ,√ the EAT coefficients for this block are simply y = [ 2a0 , 2a1 , 0, 0]t . Note that there are at most two non-zero coefficients for the two connected components. These EATs minimize the number of non-zero coefficients that must be encoded for a PWC image, thus, they provide a highly efficient representation for depth maps (since depth maps are nearly PWC). Since residual depth maps (resulting from intra or inter prediction) are also PWC, EATs also provide an efficient representation for residual depth maps. However, since edge information must be encoded and sent to the decoder, these EATs are not necessarily RD-optimal, i.e., if the edge map bit rate is too high, the RD cost for these EATs may actually be greater than the RD cost for the DCT. Therefore, it would be better to choose between EAT and DCT in an RD-optimal fashion. An optimized transform mode selection algorithm is described in Sec. 3. 3. RD-OPTIMIZATION As pointed out in Sec. 2, the RD cost when using the DCT for coding may actually be lower than the RD cost for an EAT due to the edge map bit rate. More specifically, suppose that Rdct and Ddct (resp. Reat and Deat ) are the rate and distortion respectively associated with using the DCT (resp. an
Edge Detection & Coding
EAT & Coding Yeat , Eb, Deat , Reat , Redges
Ydct, Ddct, Rdct
Deat +λ(Reat+Redges ) < Ddct + λRdct ?
Yes Yeat , Eb No Ydct
Fig. 2. Block diagram for the RD-optimized mode decision process. Xb ˆ p denotes reconstructed pixel denotes the original pixel values in block b, X values in block p, Db denotes the residual pixel values and Eb denotes the binary edge map used to construct each EAT. Ydct (resp. Yeat ) denotes the block of DCT (resp. EAT) coefficients.
4. EXPERIMENTS We now compare our proposed EATs with RD-optimized transform mode selection algorithm against H.264. An implementation of this transform and mode selection algorithm was made based on H.264/AVC JM13.2 reference software. Since the primary transform used in H.264/AVC is the 4 × 4 DCT, we have only made an implementation using 4 × 4 EATs. For each block that uses an EAT, the corresponding edge map is encoded using CABAC. The choice of transform mode is also encoded using CABAC. For testing purposes, we use the ‘Ballet’ and ‘Breakdancers’ sequences provided from Microsoft Research [15]. We encode 15 video and depth map frames using an IPPP coding structure with QP = 24, 28, 32 and 36, and with the proposed algorithms applied only to depth map coding. The default λ value in JM13.2 is used in the RD-optimized transform mode selection. The View Synthesis Reference Software (VSRS) 3.0 [16] is used to generate the rendered view between two coded views.
The results for the ‘Ballet’ and ‘Breakdancer’ sequences are shown in Fig. 3. For a fixed PSNR of the rendered view, our proposed algorithms provide Bjontegaard Delta bitrate (BDBR) of 29% for the coded ‘Ballet’ depth maps. For the ‘Breakdancer’ sequences, for a fixed PSNR, we get a BDBR of 19% when using our proposed algorithms.
33.2 33 32.8 32.6 32.4 32.2 32 31.8
H.264/AVC EAT 0
1000
2000
3000
4000
31.1 31 30.9 30.8 30.7 30.6 30.5
H.264/AVC EAT 0
5000
1000
2000
3000
4000
Kbps
Kbps
Fig. 3.
Rate-distortion curves of the proposed methods and H.264/AVC. x-axis: total bitrate to code two depth maps; y-axis: PSNR of luminance component between the rendered view and the ground truth.
We observe that the ‘Breakdancer’ depth map sequences have weaker edges than in the ‘Ballet’ sequence. Thus, the the average bit rate reduction (e.g., the BDBR) that the EATs provide is not as large for that sequence. The edge map and transform mode bit rates are shown for the sequences in Table 1. Note that the edge map and mode bit rate decreases with increasing QP , mainly because the edge map bit rate becomes more expensive at higher QP . QP Edges (Ballet) Mode (Ballet) Edges (Break) Mode (Break)
24 147 63 111 54
28 138 58 95 47
32 128 51 46 30
[1] K. Yamamoto, M. Kitahara, H. Kimata, T. Yendo, T. Fujii, M. Tanimoto, S. Shimizu, K. Kamikura, and Y. Yashima, “Multiview video coding using view interpolation and color correction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 11, pp. 1436–1449, Nov. 2007. [2] P. Lai, A. Ortega, C. Dorea, P. Yin, and C. Gomila, “Improving view rendering quality and coding efficiency by suppressing compression artifacts in depth-image coding,” in Proc. of VCIP’09, 2009.
Breakdancers PSNR Y
PSNR Y
Ballet
6. REFERENCES
36 112 46 5 13
Table 1. Edge map and transform mode bit rate (in kbps) for ‘Ballet’ and ‘Breakdancer’ sequences for various QP values.
[3] B. Zeng and J. Fu, “Directional discrete cosine transforms for image coding,” in Proc. of ICME’06, 2006. [4] V. Velisavljevic, B. Beferull-Lozano, M. Vetterli, and P.L. Dragotti, “Directionlets: Anisotropic multidirectional representation with separable filtering,” IEEE Transactions on Image Processing, vol. 15, no. 7, pp. 1916– 1933, July 2006. [5] E. Le Pennec and S. Mallat, “Sparse geometric image representations with bandelets,” IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 423– 438, April 2005. [6] Y. Morvan, P.H.N. de With, and D. Farin, “Platelet-based coding of depth maps for the transmission of multiview images,” 2006, vol. 6055, SPIE. [7] R. Willett and R. Nowak, “Platelets: A multiscale approach for recovering edges and surfaces in photon-limited medical imaging,” IEEE Transactions on Medical Imaging, vol. 22, no. 3, pp. 332–350, March 2003. [8] M. Maitre and M.N. Do, “Joint encoding of the depth image based representation using shape-adaptive wavelets,” in Proc. of ICIP’08, 2008. [9] A. Sanchez, G. Shen, and A. Ortega, “Edge-preserving depthmap coding using graph-based wavelets,” in Proc. of Asilomar’09, 2009. [10] G. Shen, W.-S. Kim, A. Ortega, J. Lee, and H.C. Wey, “Edgeaware intra prediction for depth-map coding,” in To Appear in Proc. of ICIP’10. [11] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Tech. Rep. arXiv:0912.3848, Dec 2009. [12] H. Rutishauser, “The jacobi method for real symmetric matrices,” Numerische Mathematik, vol. 9, no. 1, Nov. 1966.
5. CONCLUSIONS
[13] S.K. Narang and A. Ortega, “Lifting based wavelet transforms on graphs,” in In Proc. of APSIPA’09.
We have proposed a novel set of edge-adaptive transforms as an alternative to the standard DCTs used in image and video coding applications. A transform mode selection scheme was also proposed to choose between EAT and DCT in an RDoptimal manner. When used for coding depth maps in a multiview video coding system, these EATs provide up to 29% reduction in bit rate for a fixed quality in the synthesized views. As future work, we could reduce the computational complexity by pre-computing EATs for a fixed set of edge configurations and coding efficiency could be increased by performing more efficient edge map coding.
[14] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 620–637, Jul. 2003. [15] L. Zitnick, S.B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High-quality video view interpolation using a layered representation,” ACM Transactions on Graphics, vol. 23, no. 3, Aug. 2004. [16] M. Tanimoto, T. Fujii, and K. Suzuki, “View synthesis algorithm in view synthesis reference software 2.0(VSRS2.0),” ISO/IEC JTC1/SC29/WG11, Feb. 2009.