An Estimation of Mismatch Error in IDCT Decoders ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
Figure 2, where they are being used to calculate an inner product. Lin. Log. Lin. Log. Σ. Lin. Log xn n n ..... [6] Sheng-Chieh Huang, Liang-Gee Chen, “A Log-Exp.
15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

AN ESTIMATION OF MISMATCH ERROR IN IDCT DECODERS IMPLEMENTED WITH HYBRID-LNS ARITHMETIC Peter Lee Electronics Department, University of Kent Canterbury, Kent CT2 7NT UK phone: +44 1227 823251, fax: +44 1227 824066, email: [email protected] web: www.ee.kent.ac.uk

efficiency is further increased by exploiting the properties of the 1-D transform to reduce the total number of multiplications necessary [1]. Other researchers [4] have implemented the 2-D transform directly. This approach can reduce the computational complexity further, albeit at the cost of a less regular architecture. Efficient calculations of the DCT and IDCT are becoming increasingly important because of their use in digital media applications where, for hand-held devices power is also a significant issue. An important aspect of 2-D image compression is that in most cases the original image data has low precision – normally between 8 and 10 bits. Recently a number of researchers [5, 6, 7] have proposed using the Logarithmic Number System (LNS) or Hybrid-LNS arithmetic to perform the DCT/IDCT on images. The published results have shown that this is an effective alternative to normal linear binary arithmetic, requiring less hardware and benefiting from the lower power properties of such number systems[8]. The results presented in these papers have primarily been generated using standard images or by quoting PSNR performance. In the past another important metric for measuring the performance of the IDCT/DCT transform pair has been the IEEE 1180-1990 standard [9]. This standard sought to quantify the acceptable level of mismatch between different transform algorithms used at the encoder and the decoder. This is important because the IDCT is used in the reconstruction loop in both the encoder and decoder as shown in Figure 1.

This paper investigates the magnitude of the mismatch error produced when an IDCT unit, built using the HybridLogarithmic System (Hybrid-LNS), is used as part of an MPEG or videophone decoder. Mismatch errors occur when different IDCT algorithms are used in the encoding and decoding unit of an image compression system. As the IDCT forms part of the feedback loop in such a system, the errors can accumulate and affect the overall picture quality. To remove the errors it is necessary to periodically transmit intra-frames. However, if the intra-frames are transmitted too frequently the efficiency of the transmission is reduced (i.e. the required bandwidth. increases). This paper uses a well known standard, the IEEE 1180-1990, to assess the performance of a range of decoders built using Hybrid-LNS and compares them with a typical fixed-point equivalent decoder. The results show that to achieve the performance levels defined by 1180-1990 the major advantages of Hybrid-LNS: low-power and low complexity are somewhat compromised. 1. INRODUCTION The 2-Dimensional Discrete Cosine Transform (DCT) and its inverse the IDCT are used in the compression of images in many standards such as JPEG and MPEG [1] and video telephony [2]. Because the transforms are computationally intensive they are usually performed on blocks of 8x8 pixels using (1) for the DCT.

Z ( m, n ) =

2

i =7 j =7

π ( 2i + 1) m

j =0

π ( 2 j + 1) n (1)

Video Input

1

Q -1

16

IDCT

Cm = Cn =

1

2

for m = n = 0 2 = 1 otherwise.

SW 1

Memory Encoder

The IDCT transform is similar but with different values for the coefficients Cm and Cn. Many architectures and algorithms have been proposed for performing these transforms [3]. Because both the DCT and IDCT are separable transforms they are often implemented as two 1-D transforms. The computational

©2007 EURASIP

IDCT

SW

16

cos

Q -1

Q

1

where

x ( i, j ) cos

FDCT

2

∑∑

1 Cm Cn 4 i =0

Video Output

SW

Memory

ABSTRACT

Decoder

Figure 1. Simplified Encoder/decoder block diagram for DCT/IDCT When different IDCT algorithms are used at the encoder and decoder the difference between the two IDCT outputs, referred to here as the mismatch error, will accumulate. The

683

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

Hybrid-LNS is considered. In section 2 the basic properties of the Logarithmic Number System (LNS) and the HybridLNS will be described. Section 3 outlines the main features of the algorithm proposed for calculating both the DCT and IDCT using Hybrid-LNS. Section 4 presents the results of the algorithm implemented with Matlab ™. Finally, Section 5 concludes with a summary and a description of further work.

mismatch error will appear as additional noise in the reconstructed pictures and can cause degradation in picture quality over time. Due to the nature of error accumulation, even a slight IDCT mismatch may cause significant degradation. In applications such as MPEG and video telephony the effect of these errors is reduced with a periodic reset of the IDCT loop using intra-frame coding. However, because of the low compression rates achieved with intra-frame coding there is a penalty to be paid in the overall system bandwidth. Hence intra-frame coding is used sparingly. For MPEG this may be once every 12 frames but for video telephony this can be significantly longer (over 100 frames in some cases). If periodic intra-frame coding is used then the mismatch error will be considered acceptable as long as the corresponding picture degradation is not noticeable before the picture is refreshed. The IEEE 11801990 standard is used to define the acceptable limits of mismatch error for such systems. The specification defines the error ek ( i, j ) for a pixel located at position ( i, j ) in the

2. LOGARITHMIC NUMBER SYSTEM

The LNS has been proposed as an alternative to both fixed point and floating point number representations for over 30 years [10]. In the LNS, a number x is represented as a fixed-point value

i = log 2 x .

With extra bits to represent the sign of x and the special case when x = 0 . A major advantage of the LNS is that multiplication and division in the linear domain is simply replaced by addition or subtraction in the log domain. However the operations of addition and subtraction are more complex. They are usually defined using the equations

image as the difference between the value calculated by the algorithm being investigated and that achieved by an “ideal” reference calculated using double-precision floating point arithmetic. It then uses five conditions: ppe, pixel peak error - the maximum error allowed at a pixel site (= 1); pmse(i,j) - the pixel mean square error; omse - the overall mean square error; pme(i,j) - the pixel mean error and ome - the overall mean error. The latter four, defined in (3) – (6) are used to determine the performance of the algorithm being tested.

ek ( i, j ) = xk , REF − xk

7

10000

≤ 0.06 .

(

omse =

i = 0 j = 0 k =1

64 × 10000

≤ 0.02 .

(3)

(4)

10000

pme ( i, j ) = 7

ome =

∑ e ( i, j ) k =1

k

10000

≤ 0.015 .

(5)

7 10000

∑∑ ∑ e ( i, j ) i = 0 j = 0 k =1

k

64 × 10000

≤ 0.0015 .

(6)

In this paper the performance of Hybrid-LNS arithmetic when used for the IDCT is evaluated with different levels of fractional precision for the logarithmic representation of the arithmetic. The implications of these results for efficient hardware implementations of IDCT decoders built using

©2007 EURASIP

)

(9)

(10)

It is common practice to use a Look-Up Table (LUT) to approximate P, where the accuracy of the approximation is a function of the LUT address space. A number of algorithms have been developed to minimise the cost of this LUT which becomes prohibitively large when more than 16 bits of accuracy are required [10]. A further problem, often overlooked in the description of LNS processors, is the cost of converting numbers to and from the log domain. Although several algorithms have been proposed they represent another limitation to the achievable accuracy and performance of LNS systems often requiring large LUTs [11] or time consuming iterative algorithms [12]. As a result of these issues two distinct types of LNS architectures are commonly used when implementing LNS processors designed to perform numerical calculations with floating point accuracy. The first performs all mathematical operations in the log domain and uses a LUT to perform (8) and (9) [13] while the second type, called the Hybrid-LNS processor in this paper [14], performs the operations of multiplication and division in the log domain and addition and subtraction in the linear domain. Although the HybridLNS processor does not need an LUT for addition and subtraction it does need to convert frequently between the

2

k

)

P = log 2 1 ± 2 j −1 .

7 10000

∑∑ ∑ e ( i, j )

(

Where i = log 2 x and j = log 2 y . Both these equations

2

pmse ( i, j ) =

(8)

require the evaluation of a non-linear function

10000

k

)

log 2 (x − y ) = i + log 2 1 − 2 j −1 .

where i, j = 0,K 7 and k = 1,K10000.

k =1

(

log 2 (x + y ) = i + log 2 1 + 2 j −1

(2)

∑ e ( i, j )

(7)

684

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

linear and log domains. Both architectures are shown in Figure 2, where they are being used to calculate an inner product.

barrel shifter is the fractional part of the log2 converted using a small LUT. Sign

Log Multiply

Zero

Lin Log

yn

wn = log 2 xn ± log 2 yn Lin

vn = 2 wn

Log Lin

Σ

z = ∑vn

Integer

Log2x

Linear Addition

Log

Log Multiply

Log Addition

wn = log 2 xn ± log 2 yn

v = ∑ wn

Barrel Shifter

Lin Log

yn

Log2Lin LUT

Fraction

(a) xn

Sign/Magnitude Converter

xn

x

z = 2v

Log Lin

Lin Log

Figure 4. Log2Lin Converter Architecture 3. THE IDCT ARCHITECTURE

(b)

The standard architecture shown in Figure 5 was modelled in Matlab™ to test out the performance of the Hybrid-LNS 2-D IDCT with a range of fractional precision bits for the logarithmic components. Scaling is used at the output of the first 1-D IDCT to minimise the size of the dual port memory needed to store the temporary results. In most implementations of this architecture (see for example [15]) the throughput can be increased by using multiple 1-D IDCT units in parallel. It is not uncommon for 4 or 8 such units to be working in parallel. In the linear domain this means that the most costly element is the coefficient multiplier, which has to be at least 16x16 bits to achieve 1180-1990 performance [16] whereas for the hybrid-LNS it is the Log2Lin converter.

Figure 2. (a) Hybrid-LNS and (b) LNS processors The Hybrid-LNS is generally smaller than the LNS architecture for small bit-widths. As this is the case with image data it was chosen as the architecture for performing the experiments on 2D DCT/IDCT transforms with 256 level (8-bit) grey-scale images. This algorithm performs a direct-form implementation of the 2-D transform using two 1-D transforms. The algorithm has been implemented using Matlab™. The generic architecture of the DCT and IDCT blocks are shown in Figure 5. The same processing elements have been used to perform both the 1-D transforms in series. 2.1 The Lin2Log Converter

8x8 IDCT Block Lin d Log

In the Hybrid-LNS, the log2 of a number is described using the 4-tuple where S is the sign bit, Z is the zero bit, I is the integer part and F is the fractional part. Figure 3 outlines the main elements used in the converter algorithm.

Leading Zero Detector

Log i,j

p,q

Log Lin

r,s

Σ

t,u

Scale & Round

w

m,n

IDCT Coef LUT (Log)

1-D IDCT Dual Port Memory

Sign Zero

8x8 Image Block

Log Integer

x

Lin

Log2x

x

Log

i,j

p,q

Log Lin

r,s

Σ

t,u

Scale & Round

y

m,n

Barrel Shifter

DCT Coef LUT (Log)

Lin2Log Fraction LUT

1-D IDCT

Figure 5 IDCT Architecture 4. RESULTS

Figure 3. Lin2Log Converter Architecture

A qualitative indication of the performance of the IDCT algorithm used here is shown in Figures 6 and 7 where the performance with 6 and 8 bits of fraction precision are compared to the ideal IDCT generated in Matlab™ with double-precision FP arithmetic. Figure 7 uses a histogram to show the changes to the spread of data values occurring in the image due to the logarithmic coding.

2.2 The Log2Lin Converter

The Log2Lin converter shown in Figure 4 has a similar architecture to the Lin2Log converter where the integer part of the log is used to control a barrel shifter. The input to the

©2007 EURASIP

685

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

The DCT coefficients are also converted into log2 format. As they are constant they are calculated off-line and stored separately. However, because the range of the cosine terms is between -1.0 and +1.0 they have been scaled by multiplying by 212 prior to conversion to ensure there is no underflow. As in the linear case the IDCT is particularly sensitive to the accuracy with which the coefficients are defined. This scaling factor is removed after the summation in the linear domain as shown in Fig. 2. To quantify the performance with respect to the 11801990 standard, the model was tested with a set of 10000 8x8 DCT blocks generated as defined in the standard [9]. The mismatch error calculated against an ideal converter was used to determine the pmse, omse, pme, and ome as described in (3)-(6). These results are tabulated in Tables 1–3 where scaling factors of ±5, ± 256 and ± 300 are used.

Table 4 summarises the results obtained with a fixed-point equivalent architecture as reported in [16]. PPE PMSE OMSE PME OPME LUT 1 0.06 0.02 0.015 0.0015 Address Space (bits) 6 1 0.08 0.029 0.02 0.002 8 1 0.04 0.0058 0.02 0.0001 10 1 0.02 0.002 0.02 0.0009 12 1 .01 0.0005 0.01 0.0002 14 0 0 0 0 0 16 0 0 0 0 0 Table 1 Data weighted by ±5 , DCT Coefficients = 16 bit

(a) (b) (c) Figure 6. Reconstructed Image after IDCT (a) ideal, (b) 6-bit fractional precision, (c) 8-bit fractional precision. 3500

3500

3500

3000

3000

3000

2500

2500

2500

2000

2000

2000

1500

1500

1500

1000

1000

1000

500

500

500

0

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0

0.1

0.2

0.3

(a)

0.4

0.5

0.6

(b)

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(c)

Figure7. Normalised histogram of image (a) ideal (b) with 6 bit fractional precision (c) with 8 bit fractional precision.

LUT PPE PMSE OMSE PME OPME Address 1 0.06 0.02 0.015 0.0015 Space (bits) 6 11 1.4 1.04 0.93 0.11 8 2 0.47 0.38 0.16 0.0014 10 1 0.13 0.06 0.07 0.0003 12 1 0.06 0.017 0.05 0.0002 14 1 0.03 0.0047 0.02 0.0003 16 1 0.01 0.0014 0.01 0.0002 Table 3 Data weighted by ±300 , DCT Coefficients = 16 bit

LUT PPE PMSE OMSE PME OPME Address 1 0.06 0.02 0.015 0.0015 Space (bits) 6 10 1.5 1.1 1.0 0.087 8 2 0.59 0.40 0.19 0.0045 10 1 0.12 0.069 0.06 0.0015 12 1 0.06 0.02 0.05 0.0001 14 1 0.02 0.0047 0.02 0.0009 16 1 0.01 0.0018 0.01 0.0003 Table 2 Data weighted by ±256 , DCT Coefficients = 16 bit

©2007 EURASIP

686

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

REFERENCES

LUT PMSE Address 0.06 Space (bits) L=256, H=255 + 0.023 L=256, H=255 - 0.022 L=5, H= 5 + 0.021 L=5, H= 5 0.022 L=300, H=300+ 0.02 L=300, H=3000.02 Table 4 1180-1990 compliant fixed point architecture

OMSE 0.02 0.0193 0.0192 0.0183 0.0184 0.0166 0.0166 Solution.

PME 0.015

OPME 0.0015

[1] K. R. Rao, J. J. Hwang, Techniques and Standards for Image, Video and Audio Coding. Prentice Hall 1996 [2] V. Bhaskaran, K. Konstantinides, Image and Video Compression Standards: Algorithms and Architectures, 2nd Ed. Kluwer 1997. [3] P. Pirsch, N. Demassieux, W. Gerhke, “VLSI Architectures for Video Compression – A Survey.”, Proc. IEEE, Vol. 83, No. 2, pp220-246, Feb. 1995 [4] Wen-Hsiung Chen, C. Harrison Smith, S. C. Fralick, “A Fast Computational Algorithm for the Discrete Cosine Transform.”IEEE Transactions on Communications, Vol. Com-25, No. 9, pp 1004-1009, Sep 1977 [5] M. G. Arnold “LNS for low-power MPEG decoding” Advanced Signal Processing Algorithms, Architectures and Implementations XII, SPIE, Seattle, Washington, July 2002. [6] Sheng-Chieh Huang, Liang-Gee Chen, “A Log-Exp Image Compression Chip Design.” IEEE Transactions on Consumer Electronics, Vol. 45, No. 3, pp 812-818, Aug 1999. [7] P. Lee, “An evaluation of a Hybrid-Logarithmic Number System DCT/IDCT Algorithm.” IEEE International Symposium on Circuits and Systems, Kobe, Japan, May 2005 [8] V. Paliouras, T. Stouraitis, “Low-Power Properties of the Logarithmic Number System.” 15th IEEE Symp. on Computer Arithmetic, pp 229-236, 2001. [9] IEEE 1180-1990 Standard. IEEE Press, 1991. [10] E. E Swartzlander, A. G. Alexopoulos, "The Sign/Logarithm Number System." IEEE Transactions on Computers: pp 1238-1242, 1975. [11] J. N. Coleman, "Simplification of Table Structure in Logarithmic Arithmetic." IEE Electronics Letters Vol. 31, No. 22, 1905-1906, 1995. [12] D. K. Kotsopoulos, "An Algorithm for the Computation of Binary Logarithms." IEEE Transactions on Computers 40(11): pp 1267-1270. 1991 [13] J. N. Coleman, E. I. Chester, et al. "Arithmetic on the European Logarithmic Processor." IEEE Transactions on Computers Vol. 49, No. 7, pp 702-715, 2000 [14] F. J. Taylor, "An Extended Precision Logarithmic Number System." IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-31, No.1, pp 232-234, 1983. [15] Xilinx Application Note XAPP611 (V1.1) available at www.xilinx.com, June 2002. [16] Xilinx Logicore 2-D Discrete Cosine Transform, Product Specification, available at www.xilinx.com/ipcenter, March 14, 2002. [17] P.Lee “A Linear-to-log Conversion Algorithm with Reduced Memory Requirements.” To be submitted to IEEE Transactions on Computers [18] L. C. Pickett, “High Speed Logarithmic Function Generating Apparatus.” US Patent No. 5,359,551, Oct 1994. [19] G. Knittel. "A Fast Logarithm Converter." 7th IEEE International ASIC Conference: pp 450-453, 1994.

0.0023 0.0001 0.0028 0.0029 0.0041 0.0000 0.0032 0.0001 0.0048 0.0003 0.0032 0.0001 [16] 16 bit internal

5. CONCLUSIONS AND FURTHER WORK

The results above show that although it is possible to fulfil the IEEE 1180-1990 specification using Hybrid-LNS the levels of fractional precision necessary make it unlikely that this type of decoder could be used effectively in video telephony applications where intra-frames are only transmitted infrequently (every 100 frames or less). However, if (as in the case of MPEG) a higher intra-frame rate is possible (for example every 12 frames) or JPEG where there is no feedback path using the IDCT, then the HybridLNS remains a potential low-power, low-complexity solution when implemented in custom silicon or even an FPGA albeit with a minimum of 10-bits of fractional precision. As it is necessary for 1180-1990 compliance to use logarithms with levels of precision that are similar to that required in fixed-point solutions for low intra-frame rates, many of the advantages of the Hybrid-LNS approach are negated. As shown in Figure 5 the major component of the 1-D transform in the hybrid-LNS system is the Log2Lin conversion between domains. At 16-bits the simple LUT approach used in the low-resolution converter solutions is not viable as the LUT becomes exponentially larger with the address space. There are many conversion algorithms in the literature that attempt to address this problem. The most successful ones are based on Taylor or polynomial approximations to the Log2Lin curve. However, both methods require significant resources in terms of LUTs and multipliers. Alternative methods for conversion need to be found that make the logarithm approach competitive with the cost of the major component in the linear system, the 16x16 multiplier. In [17] an algorithm based on the algorithm by Kotsopoulos is presented which uses a reduced set of LUTs together with two small multiplier units. However it is yet to be determined experimentally in hardware whether this has a significant advantage in terms of logic resources and power dissipation when compared to a fixed point solution for the IDCT. Solutions based on the algorithms by Pickett [18] and Knittel [19] offer the possibility of generating an efficient “multiplierless” converter with sufficient accuracy for this application but further work is necessary to determine the overall hardware cost of these algorithms at the required precision.

©2007 EURASIP

687

Suggest Documents