DCT MODE CONVERSIONS FOR FIELD/FRAME CODED MPEG VIDEO Bhaskaran Vasudev
Epson Palo Alto Laboratory 3145 Porter Drive, Suite 104, Palo Alto, California, USA 94304 Tel: 650-843-8345, Fax: 650-843-9106,
[email protected] appears in 1998 IEEE Multimedia Signal Processing Workshop Proceedings
Abstract - In recent years, several compressed-domain based processing methods such as video downscaling, ltering, inverse motion compensation, etc. have been developed for MPEG compressed digital video. These processing methods assume that the underlying MPEG video was coded in only one mode, namely, frame or eld mode. In order to support both frame and eld-mode coded video, in this paper, we present a fast algorithm for converting a sequence of DCT blocks in the eld mode to a sequence of blocks whose DCT's represent frame-mode coding. A multiplier-free implementation for this algorithm is also developed in this paper. In a typical coding setup, used in MPEG-2, simulations indicate that the image quality degradations resulting from the multiplier-free implementation is around 0.5 - 1.0dB; however, a factor of two speedup is obtained compared with a traditional spatial-domain approach.
INTRODUCTION In MPEG-2, each macroblock [1] can be coded in frame mode or eld mode. Field mode coding implies that the even numbered lines within a macroblock are grouped together and DCT coded; a similar DCT coding procedure is employed on the grouping of odd-numbered lines. In frame mode coding, the rst contiguous eight lines in the macroblock are DCT coded; a similar DCT coding procedure is used on the next contiguous eight lines within the macroblock. Now consider the problem of DCT-domain downscaling or inverse motion compensation[2] or DCT-domain ltering [3]. In DCTdomain based inverse-motion compensation, downscaling or ltering, we form a weighted sum of several DCT blocks. If all of these DCT blocks are not of the same type, i.e., eld-mode or frame-mode, then there is a need to convert these blocks to the same mode before applying the compressed-domain based processing method. In order to enable DCT-domain processing of both frame-mode and eld-mode MPEG coded video, in this paper, we solve the following problem: given the DCT domain representation of two 8 8 blocks that correspond to eld-mode coding of a 16 8 region, develop the DCT domain representation of the two blocks in the same 16 8 region that represents frame-mode coding of this region. This algorithm which we refer to as 1
the fast DCT-domain approach yields a four-fold reduction in computational complexity over the direct DCT-domain approach developed in [4]. The fast DCT-domain approach is not as ecient as the spatial-domain approach wherein a one-dimensional inverse DCT is performed along the columns, followed by rearrangement of the data among the rows and then followed by a one-dimensional forward DCT along the columns. We have developed a multiplier-free implementation of the fast DCT-domain method which yields a factor-of-two reduction in computation complexity over the spatial-domain approach if data sparseness within the DCT blocks is exploited. For typical MQUANT seetings and (I,P) coded MPEG-video, image quality loss due to use of the multiplier-free implementation is around 0.5 1.0dB.
FIELD DCT TO FRAME DCT CONVERSION
Referring to Fig. 1, we are given the 8 8 DCT's X1 and X2 for the spatial-domain 8 8 blocks, x1 and x2 . From the eld-mode representations y1
x1
y2
x2 Frame mode
Field mode
Figure 1: Frame and Field organization of a 16 8 region.
x1 and x2 , the spatial domain 8 8 blocks for the frame-mode representations, namely, y1 and y2 can be computed using the sampling matrices s11 , s12 , s21 , and s22 , as y1 = s11 x1 + s12 x2 ; y2 = s21 x1 + s22 x2 ; (1) where, s11 , s12 , s21 , and s22 are given by
0 B s11 = B @
1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 CC s = BB A 12 @ ,
2
0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 CC A
,
(2)
0 B s21 = B @
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0
1 0 CC s = BB A 22 @ ,
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1
1 CC A
.
(3) Our objective is to compute the 8 8 DCT-domain representations Y1 and Y2 of y1 and y2 given only the 8 8 DCT-domain representations X1 and X2 for x1 and x2 .
SPATIAL-DOMAIN APPROACH The spatial-domain approach for eld-mode DCT to frame-mode DCT conversion involves the following steps: (a) Perform a 8-point inverse DCT for each column of X1 and X2 . A Winograd DCT scheme [1] requires 29 adds and 5 multiplies for each column. Repeating the process for all the columns in X1 and X2 requires 464 adds and 80 multiplies, (b) Using the resulting X1 and X2 from the previous step, construct the new 8 8 matrices Z1 and Z2 . The rows of Z1 are alternatively taken from the rows of X1 and X2 . Repeat the process for Z2 , (c) Perform a 8-point DCT for each column of Z1 and Z2 . The computational complexity is the same as in step (a); this yields the desired frame-mode DCT matrices Y1 and Y2 . The operations count for this spatial-domain approach is 928 adds and 160 multiplies. In the next section we develop the basic DCT-domain approach for DCT mode conversion.
DIRECT DCT-DOMAIN APPROACH The DCT-domain equivalent of Eq.(1) can be written as Y1 = S11 X1 + S12 X2 , Y2 = S21 X1 + S22 X2 , where, Y1 , Y2 , S11 , S12 , S21 and S22 are 8 8 DCT's of y1 , y2 , s11 , s12 , s21 and s22 . This is the basic algorithm for
converting a eld-mode DCT-domain representation to a frame-mode DCTdomain representation. We refer to this as the direct DCT-domain approach. Computation of each of the matrix products for one column of Y1 and Y2 requires 45 adds and 53 multiplies. Since four matrix products are involved in computing Y1 and Y2 , cost per column of Y1 and Y2 is 180 adds and 224 multiplies. This process is repeated for the eight rows of Y1 and Y2 ; the total operations count using the direct DCT-domain approach is 1440 adds and 1792 multiplies. An approach similar to the one developed here is suggested in [4]; the algorithm details as well as the associated complexity is not discussed in [4]. In the next section, we describe a fast algorithm to reduce the computation complexity of direct DCT-domain approach.
3
FAST DCT-DOMAIN APPROACH As in prior work on DCT-domain processing [2], we will use the notion of (a) butter y operations on the input and output, and (b) sparse-matrix factors of the DCT basis matrix S [5], to develop a fast algorithm for the direct DCT-domain approach described in the previous section. Using butter y operations on the inputs and outputs xi , yi , i = 1; 2, leads to the following equations: x+ = x1 + x2 ; x, = x1 , x2 ; y+ = 12 [u11 x+ + u12 x, ]; y, = 21 [u21 x+ + u22 x, ]; y1 = 12 [y+ + y,]; y2 = 21 [y+ , y, ]; (4) where, u11 = s11 + s12 + s21 + s22 , u12 = s11 , s12 + s21 , s22 , u21 = s11 + s12 , s21 , s22 , and, u21 = s11 , s12 , s21 + s22 . The reason for using butter ies is that fuij g can be factorized into products of matrices that are sparser than the factors of fsij g. We can develop the DCT-domain equivalents for Eq.(4) as X + = X1 + X2 ; X , = X1 , X2 ; Y + = 12 [Su11S t X + + Su12 S t X ,]; Y , = 21 [Su21 S t X + + Su22S t X , ]; (5) Y1 = 12 [Y + + Y , ]; Y2 = 21 [Y + , Y , ]: Referring to Eq.(5), we will embed the matrices, u11 , u12 , u21 and u22 within the sparse-matrix factorization of S (note that S t is transpose of S ). Using a factorization of S = DPB1 B2 MA1 A2 A3 that corresponds to the 8-point Winograd DCT due to Arai, Agui, and Nakajima [5], the Su11S t term in Eq.(5) can be rewritten as Su11S t = DPB1 B2 MA1 A2 A3 u11 At3 At2 At1 M t B2t B1t P t Dt : (6) We combine MA1 A2 A3 u11 At3 At2 At1 M t B2t B1t P t into a single precomputed matrix, U11 ; we also absorb the factor 41 in Eq.(5) in this matrix:
0 B U11 = B @
4:000 0 0 0 0 0 0 0
,
0 0 0 0 0:089 0:736 1:631 0:770
0 0 0 0 0 0 0 0
0 0 0 0 0:324 1:277 2:631 1:153
0 0 2:828 2:000 0 0 0 0
, , , ,
0 0 0 0 1:089 0:570 0:783 0:153
0 0 0 0 0 0 0 0
, ,
0 0 0 0 0:675 0:029 0:216 0:229
1 CC A
.
(7)
In a similar manner, u12 , u21 and u22 can be embedded to yield the matrices U12 , U21 and U22 . Note that each of the U matrices are sparse. In Eq.(6), premultiplication and postmultiplication by Dt and D can be absorbed in the dequantization and quantization steps needed in DCT-domain processing. Hence we ignore the computation cost of these steps. 4
The operations count for one column of Y1 and Y2 using the proposed exact DCT-domain approach of Eq.(5) is 90 adds, 43 multiplies and 3 shifts. To compute all entries in matrices Y1 and Y2 this process is repeated eight times requiring an operations count of 720 adds, 344 multiplies and 24 shifts. This has 2-3 times lower complexity of the direct DCT-domain approach; however it is not as computationally ecient as the spatial-domain approach. In the next section we develop a multiplier-free implementation that yields complexity lower than the spatial-domain approach.
MULTIPLIER-FREE IMPLEMENTATION The basic idea in developing a multiplier-free DCT-domain processing algorithm given its exact counterpart is as follows. Suppose that the exact output DCT vector Y corresponding to the eld-to-frame DCT P conversion linear operation can be represented by a weighted sum Y = i Ui Xi , where, fXi g are DCT input vectors and fUi g are certain xedPmatrices. The approach proposed herein is to approximate Y by Y~ = D2 i UiQ D1 Xi , where, fUiQ g contain only 0, 1, 12 , 41 and 81 , as elements, and, D1 and D2 are optimally-designed diagonal matrices, so as to best approximate fUig by fD2UiQ D1 g in some reasonable sense. By using this form, multiplication by D1 and D2 can be absorbed in the de-quantization and re-quantization steps which are usually needed in any DCT-domain processing(by appropriately modifying the de-quantization and quantization tables). Thus a virtually multiplication-free implementation of Eq.(5) is obtained since multiplications by each UiQ can be performed with shifts and adds. In our implementation, we replace matrices U11 , U12 , U21 and U22 of Eq.(6) by their multiplier-free counterparts U11Q , U12Q , U21Q and U22Q , i.e., replace oating point values by nearest powers-of-two values 0, 1, 21 , 14 and 18 . In Eq.(6), we replace the premultiplication matrix D by optimized premultiplication and postmultiplication diagonal matrices D1 and D2 , where, D1 = diag(1, 0.9252, 1.0251, 1.0821, 0.8362, 1.1592, 1.0074, 0.9999), and D2 = diag(0.9995, 0.8105, 0.9841, 0.8438, 0.8574, 1.1427, 0.9650, 0.8846). Using these matrices in Eq.(5) and Eq.(6), the operations count for one column of Y1 and Y2 is 64 adds and 43 shifts. The total computation cost for all columns of Y1 and Y2 is 512 adds and 344 shifts. Since the DCT-domain representation is sparse for typical MPEG coded bitstreams, assuming only the rst 3 rows and columns of X1 and X2 are nonzero, the cost of computing Y1 and Y2 for sparse eld-DCT data is 176 adds and 152 shifts. Note that due to quantization of matrices U11 , U12 , U21 , U22 by U11Q , U12Q , U21Q and U22Q , the frame-mode synthesized DCT from the eld-mode DCT's will not be identical to that obtained with the spatial-domain approach or the exact fast DCT-domain approach.
5
CONCLUSIONS We have developed a compressed-domain based algorithm for converting a sequence of eld-mode DCT's to a sequence of frame-mode DCT's. This conversion is needed in many compressed domain based processing tasks such as downscaling, ltering, inverse-motion compensation. To further improve the performance of the algorithm, a multiplier-free implementation was also developed. For typical eld-DCT data sparseness, the multiplier-free implementation is atleast two times faster than the spatial domain approach. The computation complexity for all the methods described in this paper are summarized in Table 1. The image quality loss due to the multiplier-free implementation compared to the spatial domain domain approach is 0.5 - 1dB for typical MQUANT settings and (I,P) coded MPEG video.
Algorithm
+
x
Spatial-domain approach 464 80 Direct DCT-domain 1440 1792 Fast DCT-domain 720 344 Multiplier-free 512 Multiplier-free (sparse DCT data) 176
Shifts Cycles 24 344 152
704 6816 1776 856 328
Table 1: Operations count for eld-mode to frame-mode DCT conversions. For the cycle count, we assume that add/shift costs one cycle and multiply costs 3 cycles.
References [1] V. Bhaskaran and K. Konstantinides, \Image and Video compression standards: Algorithms and Architectures,", Kluwer Academic Publishers, Second Edition, June 1997. [2] N. Merhav and V. Bhaskaran, \Fast Algorithms for DCT-Domain Image Down-Sampling and for Inverse Motion Compensation," IEEE Trans. on Circuits and Systems for Video Technology, June 1997. [3] N. Merhav and R. Kresch, \Approximate convolution using DCT coecient multipliers,"IEEE Trans. on Circuits and Systems for Video Technology, Aug. 1998. [4] H. Sun, A. Vetro, J. Bao and T. Poon, \A new approach for memory ecient ATV decoding," IEEE Transactions on Consumer Electronics, Aug. 1997. [5] Y. Arai, T. Agui, and M. Nakajima, \A Fast DCT-SQ Scheme for Images," Trans. of the IEICE, E 71(11):1095, November 1988. 6