15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP
AN ESTIMATION OF MISMATCH ERROR IN IDCT DECODERS IMPLEMENTED WITH HYBRID-LNS ARITHMETIC Peter Lee Electronics Department, University of Kent Canterbury, Kent CT2 7NT UK phone: +44 1227 823251, fax: +44 1227 824066, email:
[email protected] web: www.ee.kent.ac.uk
efficiency is further increased by exploiting the properties of the 1-D transform to reduce the total number of multiplications necessary [1]. Other researchers [4] have implemented the 2-D transform directly. This approach can reduce the computational complexity further, albeit at the cost of a less regular architecture. Efficient calculations of the DCT and IDCT are becoming increasingly important because of their use in digital media applications where, for hand-held devices power is also a significant issue. An important aspect of 2-D image compression is that in most cases the original image data has low precision – normally between 8 and 10 bits. Recently a number of researchers [5, 6, 7] have proposed using the Logarithmic Number System (LNS) or Hybrid-LNS arithmetic to perform the DCT/IDCT on images. The published results have shown that this is an effective alternative to normal linear binary arithmetic, requiring less hardware and benefiting from the lower power properties of such number systems[8]. The results presented in these papers have primarily been generated using standard images or by quoting PSNR performance. In the past another important metric for measuring the performance of the IDCT/DCT transform pair has been the IEEE 1180-1990 standard [9]. This standard sought to quantify the acceptable level of mismatch between different transform algorithms used at the encoder and the decoder. This is important because the IDCT is used in the reconstruction loop in both the encoder and decoder as shown in Figure 1.
This paper investigates the magnitude of the mismatch error produced when an IDCT unit, built using the HybridLogarithmic System (Hybrid-LNS), is used as part of an MPEG or videophone decoder. Mismatch errors occur when different IDCT algorithms are used in the encoding and decoding unit of an image compression system. As the IDCT forms part of the feedback loop in such a system, the errors can accumulate and affect the overall picture quality. To remove the errors it is necessary to periodically transmit intra-frames. However, if the intra-frames are transmitted too frequently the efficiency of the transmission is reduced (i.e. the required bandwidth. increases). This paper uses a well known standard, the IEEE 1180-1990, to assess the performance of a range of decoders built using Hybrid-LNS and compares them with a typical fixed-point equivalent decoder. The results show that to achieve the performance levels defined by 1180-1990 the major advantages of Hybrid-LNS: low-power and low complexity are somewhat compromised. 1. INRODUCTION The 2-Dimensional Discrete Cosine Transform (DCT) and its inverse the IDCT are used in the compression of images in many standards such as JPEG and MPEG [1] and video telephony [2]. Because the transforms are computationally intensive they are usually performed on blocks of 8x8 pixels using (1) for the DCT.
Z ( m, n ) =
2
i =7 j =7
π ( 2i + 1) m
j =0
π ( 2 j + 1) n (1)
Video Input
1
Q -1
16
IDCT
Cm = Cn =
1
2
for m = n = 0 2 = 1 otherwise.
SW 1
Memory Encoder
The IDCT transform is similar but with different values for the coefficients Cm and Cn. Many architectures and algorithms have been proposed for performing these transforms [3]. Because both the DCT and IDCT are separable transforms they are often implemented as two 1-D transforms. The computational
©2007 EURASIP
IDCT
SW
16
cos
Q -1
Q
1
where
x ( i, j ) cos
FDCT
2
∑∑
1 Cm Cn 4 i =0
Video Output
SW
Memory
ABSTRACT
Decoder
Figure 1. Simplified Encoder/decoder block diagram for DCT/IDCT When different IDCT algorithms are used at the encoder and decoder the difference between the two IDCT outputs, referred to here as the mismatch error, will accumulate. The
683
15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP
Hybrid-LNS is considered. In section 2 the basic properties of the Logarithmic Number System (LNS) and the HybridLNS will be described. Section 3 outlines the main features of the algorithm proposed for calculating both the DCT and IDCT using Hybrid-LNS. Section 4 presents the results of the algorithm implemented with Matlab ™. Finally, Section 5 concludes with a summary and a description of further work.
mismatch error will appear as additional noise in the reconstructed pictures and can cause degradation in picture quality over time. Due to the nature of error accumulation, even a slight IDCT mismatch may cause significant degradation. In applications such as MPEG and video telephony the effect of these errors is reduced with a periodic reset of the IDCT loop using intra-frame coding. However, because of the low compression rates achieved with intra-frame coding there is a penalty to be paid in the overall system bandwidth. Hence intra-frame coding is used sparingly. For MPEG this may be once every 12 frames but for video telephony this can be significantly longer (over 100 frames in some cases). If periodic intra-frame coding is used then the mismatch error will be considered acceptable as long as the corresponding picture degradation is not noticeable before the picture is refreshed. The IEEE 11801990 standard is used to define the acceptable limits of mismatch error for such systems. The specification defines the error ek ( i, j ) for a pixel located at position ( i, j ) in the
2. LOGARITHMIC NUMBER SYSTEM
The LNS has been proposed as an alternative to both fixed point and floating point number representations for over 30 years [10]. In the LNS, a number x is represented as a fixed-point value
i = log 2 x .
With extra bits to represent the sign of x and the special case when x = 0 . A major advantage of the LNS is that multiplication and division in the linear domain is simply replaced by addition or subtraction in the log domain. However the operations of addition and subtraction are more complex. They are usually defined using the equations
image as the difference between the value calculated by the algorithm being investigated and that achieved by an “ideal” reference calculated using double-precision floating point arithmetic. It then uses five conditions: ppe, pixel peak error - the maximum error allowed at a pixel site (= 1); pmse(i,j) - the pixel mean square error; omse - the overall mean square error; pme(i,j) - the pixel mean error and ome - the overall mean error. The latter four, defined in (3) – (6) are used to determine the performance of the algorithm being tested.
ek ( i, j ) = xk , REF − xk
7
10000
≤ 0.06 .
(
omse =
i = 0 j = 0 k =1
64 × 10000
≤ 0.02 .
(3)
(4)
10000
pme ( i, j ) = 7
ome =
∑ e ( i, j ) k =1
k
10000
≤ 0.015 .
(5)
7 10000
∑∑ ∑ e ( i, j ) i = 0 j = 0 k =1
k
64 × 10000
≤ 0.0015 .
(6)
In this paper the performance of Hybrid-LNS arithmetic when used for the IDCT is evaluated with different levels of fractional precision for the logarithmic representation of the arithmetic. The implications of these results for efficient hardware implementations of IDCT decoders built using
©2007 EURASIP
)
(9)
(10)
It is common practice to use a Look-Up Table (LUT) to approximate P, where the accuracy of the approximation is a function of the LUT address space. A number of algorithms have been developed to minimise the cost of this LUT which becomes prohibitively large when more than 16 bits of accuracy are required [10]. A further problem, often overlooked in the description of LNS processors, is the cost of converting numbers to and from the log domain. Although several algorithms have been proposed they represent another limitation to the achievable accuracy and performance of LNS systems often requiring large LUTs [11] or time consuming iterative algorithms [12]. As a result of these issues two distinct types of LNS architectures are commonly used when implementing LNS processors designed to perform numerical calculations with floating point accuracy. The first performs all mathematical operations in the log domain and uses a LUT to perform (8) and (9) [13] while the second type, called the Hybrid-LNS processor in this paper [14], performs the operations of multiplication and division in the log domain and addition and subtraction in the linear domain. Although the HybridLNS processor does not need an LUT for addition and subtraction it does need to convert frequently between the
2
k
)
P = log 2 1 ± 2 j −1 .
7 10000
∑∑ ∑ e ( i, j )
(
Where i = log 2 x and j = log 2 y . Both these equations
2
pmse ( i, j ) =
(8)
require the evaluation of a non-linear function
10000
k
)
log 2 (x − y ) = i + log 2 1 − 2 j −1 .
where i, j = 0,K 7 and k = 1,K10000.
k =1
(
log 2 (x + y ) = i + log 2 1 + 2 j −1
(2)
∑ e ( i, j )
(7)
684
15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP
linear and log domains. Both architectures are shown in Figure 2, where they are being used to calculate an inner product.
barrel shifter is the fractional part of the log2 converted using a small LUT. Sign
Log Multiply
Zero
Lin Log
yn
wn = log 2 xn ± log 2 yn Lin
vn = 2 wn
Log Lin
Σ
z = ∑vn
Integer
Log2x
Linear Addition
Log
Log Multiply
Log Addition
wn = log 2 xn ± log 2 yn
v = ∑ wn
Barrel Shifter
Lin Log
yn
Log2Lin LUT
Fraction
(a) xn
Sign/Magnitude Converter
xn
x
z = 2v
Log Lin
Lin Log
Figure 4. Log2Lin Converter Architecture 3. THE IDCT ARCHITECTURE
(b)
The standard architecture shown in Figure 5 was modelled in Matlab™ to test out the performance of the Hybrid-LNS 2-D IDCT with a range of fractional precision bits for the logarithmic components. Scaling is used at the output of the first 1-D IDCT to minimise the size of the dual port memory needed to store the temporary results. In most implementations of this architecture (see for example [15]) the throughput can be increased by using multiple 1-D IDCT units in parallel. It is not uncommon for 4 or 8 such units to be working in parallel. In the linear domain this means that the most costly element is the coefficient multiplier, which has to be at least 16x16 bits to achieve 1180-1990 performance [16] whereas for the hybrid-LNS it is the Log2Lin converter.
Figure 2. (a) Hybrid-LNS and (b) LNS processors The Hybrid-LNS is generally smaller than the LNS architecture for small bit-widths. As this is the case with image data it was chosen as the architecture for performing the experiments on 2D DCT/IDCT transforms with 256 level (8-bit) grey-scale images. This algorithm performs a direct-form implementation of the 2-D transform using two 1-D transforms. The algorithm has been implemented using Matlab™. The generic architecture of the DCT and IDCT blocks are shown in Figure 5. The same processing elements have been used to perform both the 1-D transforms in series. 2.1 The Lin2Log Converter
8x8 IDCT Block Lin d Log
In the Hybrid-LNS, the log2 of a number is described using the 4-tuple where S is the sign bit, Z is the zero bit, I is the integer part and F is the fractional part. Figure 3 outlines the main elements used in the converter algorithm.
Leading Zero Detector
Log i,j
p,q
Log Lin
r,s
Σ
t,u
Scale & Round
w
m,n
IDCT Coef LUT (Log)
1-D IDCT Dual Port Memory
Sign Zero
8x8 Image Block
Log Integer
x
Lin
Log2x
x
Log
i,j
p,q
Log Lin
r,s
Σ
t,u
Scale & Round
y
m,n
Barrel Shifter
DCT Coef LUT (Log)
Lin2Log Fraction LUT
1-D IDCT
Figure 5 IDCT Architecture 4. RESULTS
Figure 3. Lin2Log Converter Architecture
A qualitative indication of the performance of the IDCT algorithm used here is shown in Figures 6 and 7 where the performance with 6 and 8 bits of fraction precision are compared to the ideal IDCT generated in Matlab™ with double-precision FP arithmetic. Figure 7 uses a histogram to show the changes to the spread of data values occurring in the image due to the logarithmic coding.
2.2 The Log2Lin Converter
The Log2Lin converter shown in Figure 4 has a similar architecture to the Lin2Log converter where the integer part of the log is used to control a barrel shifter. The input to the
©2007 EURASIP
685
15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP
The DCT coefficients are also converted into log2 format. As they are constant they are calculated off-line and stored separately. However, because the range of the cosine terms is between -1.0 and +1.0 they have been scaled by multiplying by 212 prior to conversion to ensure there is no underflow. As in the linear case the IDCT is particularly sensitive to the accuracy with which the coefficients are defined. This scaling factor is removed after the summation in the linear domain as shown in Fig. 2. To quantify the performance with respect to the 11801990 standard, the model was tested with a set of 10000 8x8 DCT blocks generated as defined in the standard [9]. The mismatch error calculated against an ideal converter was used to determine the pmse, omse, pme, and ome as described in (3)-(6). These results are tabulated in Tables 1–3 where scaling factors of ±5, ± 256 and ± 300 are used.
Table 4 summarises the results obtained with a fixed-point equivalent architecture as reported in [16]. PPE PMSE OMSE PME OPME LUT 1 0.06 0.02 0.015 0.0015 Address Space (bits) 6 1 0.08 0.029 0.02 0.002 8 1 0.04 0.0058 0.02 0.0001 10 1 0.02 0.002 0.02 0.0009 12 1 .01 0.0005 0.01 0.0002 14 0 0 0 0 0 16 0 0 0 0 0 Table 1 Data weighted by ±5 , DCT Coefficients = 16 bit
(a) (b) (c) Figure 6. Reconstructed Image after IDCT (a) ideal, (b) 6-bit fractional precision, (c) 8-bit fractional precision. 3500
3500
3500
3000
3000
3000
2500
2500
2500
2000
2000
2000
1500
1500
1500
1000
1000
1000
500
500
500
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
(a)
0.4
0.5
0.6
(b)
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(c)
Figure7. Normalised histogram of image (a) ideal (b) with 6 bit fractional precision (c) with 8 bit fractional precision.
LUT PPE PMSE OMSE PME OPME Address 1 0.06 0.02 0.015 0.0015 Space (bits) 6 11 1.4 1.04 0.93 0.11 8 2 0.47 0.38 0.16 0.0014 10 1 0.13 0.06 0.07 0.0003 12 1 0.06 0.017 0.05 0.0002 14 1 0.03 0.0047 0.02 0.0003 16 1 0.01 0.0014 0.01 0.0002 Table 3 Data weighted by ±300 , DCT Coefficients = 16 bit
LUT PPE PMSE OMSE PME OPME Address 1 0.06 0.02 0.015 0.0015 Space (bits) 6 10 1.5 1.1 1.0 0.087 8 2 0.59 0.40 0.19 0.0045 10 1 0.12 0.069 0.06 0.0015 12 1 0.06 0.02 0.05 0.0001 14 1 0.02 0.0047 0.02 0.0009 16 1 0.01 0.0018 0.01 0.0003 Table 2 Data weighted by ±256 , DCT Coefficients = 16 bit
©2007 EURASIP
686
15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP
REFERENCES
LUT PMSE Address 0.06 Space (bits) L=256, H=255 + 0.023 L=256, H=255 - 0.022 L=5, H= 5 + 0.021 L=5, H= 5 0.022 L=300, H=300+ 0.02 L=300, H=3000.02 Table 4 1180-1990 compliant fixed point architecture
OMSE 0.02 0.0193 0.0192 0.0183 0.0184 0.0166 0.0166 Solution.
PME 0.015
OPME 0.0015
[1] K. R. Rao, J. J. Hwang, Techniques and Standards for Image, Video and Audio Coding. Prentice Hall 1996 [2] V. Bhaskaran, K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, 2nd Ed. Kluwer 1997. [3] P. Pirsch, N. Demassieux, W. Gerhke, “VLSI Architectures for Video Compression – A Survey.”, Proc. IEEE, Vol. 83, No. 2, pp220-246, Feb. 1995 [4] Wen-Hsiung Chen, C. Harrison Smith, S. C. Fralick, “A Fast Computational Algorithm for the Discrete Cosine Transform.”IEEE Transactions on Communications, Vol. Com-25, No. 9, pp 1004-1009, Sep 1977 [5] M. G. Arnold “LNS for low-power MPEG decoding” Advanced Signal Processing Algorithms, Architectures and Implementations XII, SPIE, Seattle, Washington, July 2002. [6] Sheng-Chieh Huang, Liang-Gee Chen, “A Log-Exp Image Compression Chip Design.” IEEE Transactions on Consumer Electronics, Vol. 45, No. 3, pp 812-818, Aug 1999. [7] P. Lee, “An evaluation of a Hybrid-Logarithmic Number System DCT/IDCT Algorithm.” IEEE International Symposium on Circuits and Systems, Kobe, Japan, May 2005 [8] V. Paliouras, T. Stouraitis, “Low-Power Properties of the Logarithmic Number System.” 15th IEEE Symp. on Computer Arithmetic, pp 229-236, 2001. [9] IEEE 1180-1990 Standard. IEEE Press, 1991. [10] E. E Swartzlander, A. G. Alexopoulos, "The Sign/Logarithm Number System." IEEE Transactions on Computers: pp 1238-1242, 1975. [11] J. N. Coleman, "Simplification of Table Structure in Logarithmic Arithmetic." IEE Electronics Letters Vol. 31, No. 22, 1905-1906, 1995. [12] D. K. Kotsopoulos, "An Algorithm for the Computation of Binary Logarithms." IEEE Transactions on Computers 40(11): pp 1267-1270. 1991 [13] J. N. Coleman, E. I. Chester, et al. "Arithmetic on the European Logarithmic Processor." IEEE Transactions on Computers Vol. 49, No. 7, pp 702-715, 2000 [14] F. J. Taylor, "An Extended Precision Logarithmic Number System." IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-31, No.1, pp 232-234, 1983. [15] Xilinx Application Note XAPP611 (V1.1) available at www.xilinx.com, June 2002. [16] Xilinx Logicore 2-D Discrete Cosine Transform, Product Specification, available at www.xilinx.com/ipcenter, March 14, 2002. [17] P.Lee “A Linear-to-log Conversion Algorithm with Reduced Memory Requirements.” To be submitted to IEEE Transactions on Computers [18] L. C. Pickett, “High Speed Logarithmic Function Generating Apparatus.” US Patent No. 5,359,551, Oct 1994. [19] G. Knittel. "A Fast Logarithm Converter." 7th IEEE International ASIC Conference: pp 450-453, 1994.
0.0023 0.0001 0.0028 0.0029 0.0041 0.0000 0.0032 0.0001 0.0048 0.0003 0.0032 0.0001 [16] 16 bit internal
5. CONCLUSIONS AND FURTHER WORK
The results above show that although it is possible to fulfil the IEEE 1180-1990 specification using Hybrid-LNS the levels of fractional precision necessary make it unlikely that this type of decoder could be used effectively in video telephony applications where intra-frames are only transmitted infrequently (every 100 frames or less). However, if (as in the case of MPEG) a higher intra-frame rate is possible (for example every 12 frames) or JPEG where there is no feedback path using the IDCT, then the HybridLNS remains a potential low-power, low-complexity solution when implemented in custom silicon or even an FPGA albeit with a minimum of 10-bits of fractional precision. As it is necessary for 1180-1990 compliance to use logarithms with levels of precision that are similar to that required in fixed-point solutions for low intra-frame rates, many of the advantages of the Hybrid-LNS approach are negated. As shown in Figure 5 the major component of the 1-D transform in the hybrid-LNS system is the Log2Lin conversion between domains. At 16-bits the simple LUT approach used in the low-resolution converter solutions is not viable as the LUT becomes exponentially larger with the address space. There are many conversion algorithms in the literature that attempt to address this problem. The most successful ones are based on Taylor or polynomial approximations to the Log2Lin curve. However, both methods require significant resources in terms of LUTs and multipliers. Alternative methods for conversion need to be found that make the logarithm approach competitive with the cost of the major component in the linear system, the 16x16 multiplier. In [17] an algorithm based on the algorithm by Kotsopoulos is presented which uses a reduced set of LUTs together with two small multiplier units. However it is yet to be determined experimentally in hardware whether this has a significant advantage in terms of logic resources and power dissipation when compared to a fixed point solution for the IDCT. Solutions based on the algorithms by Pickett [18] and Knittel [19] offer the possibility of generating an efficient “multiplierless” converter with sufficient accuracy for this application but further work is necessary to determine the overall hardware cost of these algorithms at the required precision.
©2007 EURASIP
687