Quality and Power Efficient Architecture for the Discrete Cosine ...

1 downloads 0 Views 582KB Size Report
this reason, low power DCT design has become more and more important in ... above standards require the discrete cosine transform (DCT) function [2] to aid ...
IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3500

PAPER

Special Section on VLSI Design and CAD Algorithms

Quality and Power Efficient Architecture for the Discrete Cosine Transform Chi-Chia SUNG†a) , Student Member, Shanq-Jang RUAN†b) , Member, Bo-Yao LIN† , Nonmember, and Mon-Chau SHIE† , Member

SUMMARY In recent years, the demand for multimedia mobile battery-operated devices has created a need for low power implementation of video compression. Many compression standards require the discrete cosine transform (DCT) function to perform image/video compression. For this reason, low power DCT design has become more and more important in today’s image/video processing. This paper presents a new power-efficient Hybrid DCT architecture which combines Loeffler DCT and binDCT in terms of special property on luminance and chrominance difference. We use Synopsys PrimePower to estimate the power consumption in a TSMC 0.25-µm technology. Besides, we also adopt a novel quality assessment method based on structural distortion measurement to measure the quality instead of peak signal to noise rations (PSNR) and mean squared error (MSE). It is concluded that our Hybrid DCT offers similar quality performance to the Loeffler, and leads to 25% power consumption and 27% chip area savings. key words: Loeffler DCT, binDCT, DAA, SSIM, low power

1.

Introduction

In recent years, many kinds of digital image processing and video compression techniques have been proposed in the literature [1]. New generation video compression standards are proposed, such as JPEG, Digital Watermark, MPEG and H.263 for higher quality and better performance. All the above standards require the discrete cosine transform (DCT) function [2] to aid image/video compression which translates from space domain to frequency domain. Also, modern mobile device executing multimedia applications consumes a significant amount of power. In order to extend battery life, low power DCT design has become more and more important for mobile image/video processing. On the past few years, many researches have devoted to the low power DCT design [3]–[10]. In general, distributed arithmetic architecture (DAA) and flow-graph algorithm (FGA) are the most popular approaches for implementing fast DCT (FDCT) in VLSI [11]–[16]. While the area advantages of DAA are well known. The DAA actually consumes more energy than FGA does [17]. Therefore, the FGA has better power efficiency and is more suitable for low power consideration [18]. Loeffler et al. [15] proposed a Manuscript received March 14, 2005. Manuscript revised June 13, 2005. Final manuscript received July 28, 2005. † The authors are with the Department of Electronic Engineering, National Taiwan University of Science and Technology, Taipei 106, Taiwan. a) E-mail: [email protected] b) E-mail: [email protected] DOI: 10.1093/ietfec/e88–a.12.3500

low-complexity FDCT/IDCT algorithm based on FGA that required only 11 multiplications and 29 additions. However, the power consumption of multiplication accounts for about 40% in DCT computation, and almost occupies 45% of total chip area. Due to the observation of multiplication operation dissipated a large amount of power, Trac [19] proposed binDCT which approaches the multiplication operation with adder and shifter. It consumed about 38% less power than that of Loeffler DCT. To evaluate different image compression approaches, peak signal to noise rations (PSNR) and mean squared error (MSE) are simple and well-known estimating measures. However, large errors do not always result in poor perceptual distortions and visual sensitive [20]–[22]. So we adopt a novel assessment method based on structural distortion measurement to measure the quality of reconstructed results [23]. Notice that the novel assessment method is more consistent with human visual system behavior compared with PSNR. As we know that human visual behavior is less sensitive to chrominance resolution than luminance resolution. In this paper, we compare the area, the power consumption, and the distortion of Loeffler DCT, DAA and binDCT respectively. After due consideration, we propose a Hybrid DCT processing architecture which combines the Loeffler DCT and the binDCT in terms of special property on luminance and chrominance difference. In order to reduce power consumption, we employ the binDCT to handle the chrominance stream. On the other hand, we employ the Loeffler DCT to handle the luminance stream for the reason to maintain quality. The power-efficient hybrid design reduces not only power consumption but also chip area compared the approach only using Loeffler DCT. Moreover, our Hybrid DCT can get better quality than that of the binDCT. The power efficiency and high video quality characteristics of the Hybrid DCT are especially suitable for power aware designs, such as mobile handsets, digital camera, cell phone and wireless set-top box. The rest of this paper is organized as follows. Section 2 we briefly introduce and compare the algorithms of Loeffler DCT, DAA and binDCT. In Sect. 3, we present the detailed Hybrid DCT architecture and its functional block. Section 4 presents power and quality analysis while Sect. 5 concludes this paper.

c 2005 The Institute of Electronics, Information and Communication Engineers Copyright 

SUNG et al.: QUALITY AND POWER EFFICIENT ARCHITECTURE FOR THE DISCRETE COSINE TRANSFORM

3501

2.

where C x,k =

DCT Algorithms

Consequently, Eq. (3) can be calculated and decomposed into two smaller summations:

2.1 The DCT Backgroud The two dimensional DCT in Eq. (1) transforms an 8 × 8 block of video frame samples from spatial domain f (x, y) into frequency domain F(k, l). 7 7   1 F(k, l) = C(k)C(l) f (x, y) 4 x=0 y=0     (2y + 1)lπ (2x + 1)kπ cos · cos 16 16  1     √ , if k = 0 C(k) =  2    1, otherwise.  1     √ , if l = 0 C(l) =  2    1, otherwise.

 (2x + 1)kπ C(k) cos . 2 16

F(k) =

3 

C x,k u x

(k = 0, 2, 4, 6)

(4)

C x,k v x

(k = 1, 3, 5, 7),

(5)

x=0

F(k) =

3  x=0

(1)

Since computing the above 2-D DCT by using matrix multiplication requires 84 multiplications, a commonly used approach in hardware designs to reduce the computation complexity is row-column decomposition. The decomposition performs row-wise one-dimensional (1-D) transform followed by column-wise 1-D transform with intermediate transposition. An 8-point 1-D DCT can be expressed as follow:   7  (2x + 1)kπ 1 f (x) cos F(k) = C(k) 2 16 x=0  1     √ , if k = 0 C(k) =  (2) 2    1, otherwise. This decomposition approach has two advantages. First, the number of operations is reduced. Secondly, the 1-D DCT can be readily substituted for different algorithms in the future.

where u x = f x + f7−x , v x = f x + f7−x (x = 0, 1, 2, 3). Equation (4) and Eq. (5) are suitable for implementation using a technique know as distributed arithmetic [24]. Figure 1 shows a detailed example of a distributed arithmetic RAC architecture. Each multiplication is carried out a bit at a time by using a ROM and an accumulator (RAC) rather than a parallel multiplier. The four input elements u xj are 8 bit 2’s complement integer. Every clock cycle the Result register doubles its previous value and adds to the current ROM contents. It must be noted that these 16 outcomes are pre-calculated and stored in a ROM. Moreover, during each cycle, the four registers that hold the four elements of the u xj are shifted to the right. The sign control pulse T sc is activated when the ROM is addressed by bit 3. In this circumstances, the add operation of ALU will be transformed to subtract operation. After eight cycles, the dot product will be produced and stored in the Result register. Figure 2 shows the block diagram of a 1-D 8-point distributed arithmetic architecture. Notice that the ROM data length is 16 bits in our DAA design. Not only can DAA save chip area, it can also reduce power consumption. But DAA still consumes more energy than FGA does due to DAA needs more computation cycles and will be discussed in the following sections.

2.2 Distributed Arithmetic Algorithm Distributed Arithmetic Algorithm (DAA) has enjoyed widespread applications in VLSI implementations of Digital Signal Processing (DSP) architectures. The main advantage of DAA approach is that it speeds up the multiply operation by pre-computing all the possible dot product values and storing these values in a ROM. In other words, it can be used to develop a more regular structure instead of using large parallel multiplier. Notice that our DA architecture is slightly modified from [1], [24]. In order to implement DAA, we rewrite Eq. (2) as a sum of products: F(k) =

7  x=0

C x,k f x

(3)

Fig. 1

Distributed arithmetic ROM and accumulator structure [24].

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3502

that DCT is a floating-point transform. More importantly, hardware implementations of floating-point algorithms need more area and power consumption. Therefore, integerfriendly approximations of the DCT are necessary [19], [31]. Trac [19] presented a fast biorthogonal block transform called binDCT that can be implemented by using only shift and add operations (i.e. there is no multiplication). The binDCT is an approximation of the FGA DCT and can map integers to integers with exact reconstruction. This property is pivotal in transform-based lossless coding and allows a unifying lossless coding framework. Table 2 compares three different versions of the binDCT [32]. Among the three versions of binDCT, we can observe that the binDCT-c has the lowest number of operations. Therefore, we select this version to handle the chromi-

2.3 Loeffler DCT Algorithm Many 1-D flow graph algorithms have been reported in the literature. These papers aimed at reducing the number of multiplications because general-purpose multiplier is the major power consumption for the computation in the DCT. Therefore, choosing a suitable algorithm that minimizing the number of multiplications to achieve low-power is very important. Table 1 summarizes the complexity of several flow-graph algorithms [25]. Since the theoretical lower bound on the number of multiplications required for 1-D Loeffler DCT had been proven to be 11 [26]. Therefore, we select the Loeffler DCT algorithm to handle the luminance stream in our Hybrid DCT. As we can see in Table 1, Loeffler’s 1-D 8-point DCT algorithm uses only 11 multiplications and 29 additions. Figure 3 shows the flow graph of the Loeffler DCT and Fig. 4 describes the transfer functions of the building blocks. Notice that the second building block (kcn), the rotation, can be calculated with only 3 multiplications and 3 additions instead of 4 multiplications and 2 additions by using Eq. (6). O0 = aI0 + bI1 = (b − a)I1 + a(I0 + I1 ) O1 = −bI0 + aI1 = −(a + b)I0 + a(I0 + I1 )

nπ     a = k cos    16

nπ       b = k sin 16

(6) Fig. 3

Loeffler’s FDCT algorithm [15].

2.4 The binDCT Algorithm Although DCT is very useful for image/video compression, it can not map integers to integers without losses due to

Fig. 4

Transfer function of Loeffler’s FDCT building blocks [15].

Table 2

Fig. 2

1D distributed arithmetic architecture.

Table 1

Comparion of binDCT transform complexity.

Transform

No. of Adders

No. of Shifters

8 x 8 binDCT-a

36

19

8 x 8 binDCT-b

31

14

8 x 8 binDCT-c

30

13

Complexities of different FDCT algorithms [25].

Algorithm

Chen [14]

Wang [16]

Lee [13]

Vetterli [27]

Suehiro [28]

Hu [29]

Loeffler [15]

Jeong [30]

Multiplier

16

13

12

12

12

12

11

12

Adder

26

29

29

29

29

29

29

28

SUNG et al.: QUALITY AND POWER EFFICIENT ARCHITECTURE FOR THE DISCRETE COSINE TRANSFORM

3503

nance stream in our Hybrid DCT. Figure 5 shows the flow graph of binDCT-c architecture. Notice that floating-point multiplication was replaced by hardware-friendly dyadic rationals (i.e. in the format of k/2m , where k and m are integers), which can be efficiently implemented by shifter and adder [33]. 2.5 Comparison of Different DCT Algorithms In Table 3, we compare the area, the power/energy consumption and the distortion of these algorithms. Table 3 clearly indicates that the DAA performs well in both area and distortion. Unfortunately, the DAA actually consumes more energy than tradition FGA DCT does [18]. This shortcoming is due to that the RAC used in DAA needs eight cycles to carry out one dot product result. In consideration of low power issue, we have to reduce this drawback in our hybrid design. On the other hand, the Loeffler DCT realized the DCT architecture by using multipliers and adders. Although multiplier will introuduce more power dissipation and area, it offer better distortion. Moreover, as noted previously, the binDCT employs the shifter and adder to ap-

Fig. 5

Table 3

binDCT-c algorithm [32].

Comparison of the Loeffler DCT, the DAA and the binDCT. Area

Power

Energy

Distortion

Loeffler DCT

poor

normal

well

well

DAA

well

well

very poor

well

binDCT

well

well

well

normal

Table 3 compares the area, power/energy consumption, and rate distortion of Loeffler DCT, DAA and binDCT architectures.

Fig. 6

proach the multiply operation by using appropriate dyadic lifting steps and lifting scheme [33]. Hence the binDCT will cause more image distortion than the Loeffler DCT does [32]. Even though the binDCT has less power consumption and area than that of the Loeffler DCT. In next section, we will take advantage of Loeffler DCT and binDCT algorithms and separate the luminance stream and chrominance stream to build our Hybrid DCT architecture. 3.

Hybrid DCT Architectures

As shown in Table 3, the power/energy consumption and the distortion between Loeffler DCT and binDCT have its merits. Notice that the different properties of distortion will reflect on quality. Moreover, as we mentioned in Sect. 1, the human visual system is less sensitive to chrominance resolution than luminance resolution. Considering about with the different special properties of these DCT algorithms, we propose a hybrid method that combines the Loeffler DCT and the binDCT to handle luminance and chrominance difference respectively. We also attempt to find the best quality and the power efficiency according to the different criterion. Figure 6 shows the framework of our Hybrid DCT architecture, we separate the frame to luminance (Y) and chrominance (CbCr) streams for high quality and lowpower. Notice that luminance and chrominance streams in traditional design are already separated. In order to get better quality, we employ the Loeffler DCT which has less distortion to handle the luminance stream. On the other hand, for the reason to reduce power consumption, we employ the binDCT which has large distortion to handle the chrominance stream. Although binDCT will decrease color quality, the influences of color defect are less noticeable in video sequence stream or image compression. In short, we use the Loeffler DCT to keep quality and use the binDCT to reduce power consumption and chip area. The power dissipation of Hybrid DCT is between Loeffler DCT and binDCT. However, we also hope that the quality of Hybrid DCT can get close to Loeffler DCT. So we combine two different algorithms to design our Hybrid DCT to achieve low-power with high throughput, while maintain reasonable quality. In addition to these considerations, we also exchange the luminance and the chrominance input streams without changing the Hybrid DCT architecture named inv-Hybrid DCT as shown in Fig. 7. To prove our hybrid method is better choice. In next section, we will evaluate our Hybrid

Hybrid DCT architecture.

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3504 Table 4

Fig. 7

Gate level circuit simulation results. Loeffler

Hybrid

DAA

binDCT

Power (mW)

128.5

96.05

48.7

63.42

Energy (µJ)

12.32

9.2

64.99

6.08

Area (GateCount)

152K

106K

65K

60K

Quality

Best

Good

Good

Worst

The inv-Hybrid DCT architecture.

DCT architecture by using different measurement criterion.

TSMC 0.25-µm at 2.5 V and 33 MHz.

4.

Results Cb Cr SSIMi j = 0.8SSIMY i j + 0.1SSIMi j + 0.1SSIMi j .

4.1 Experiment Environment In order to analyze the performance of hybrid architcture, we modeled four architectures: Loeffler DCT, binDCT, Hybrid DCT, and inv-Hybrid DCT. After synthesizing with TSMC 0.25-µm technology library, we used Synopsys PrimePower to estimate power consumption with QCIF patterns (sampling rate at 4:2:2) and dumped the value change dump (VCD) to make power analysis more accurately. In addition, we also applied these DCT architectures into JPEG system to compare quality of compression results [34]. 4.2 Quality Analysis The PSNR and MSE are widely used because they are simple to calculate, have clear physical meanings, and are mathematically easy to deal with for optimization purposes. However, they have been widely criticized as well for not very well matched to perceived visual quality [20]–[22]. Z. Wang et al. [23], [35] proposed a new quality assessment system based on structural distortion which considers more human perceived features than traditional error sensitivity based methods. From an image formation point of view, the “structural distortion” in an image is the attribute that reflects the structure of the object in the scene. It is independent of the average luminance and contrast of the image. This leads to an image quality assessment approach that separates the measurement of luminance, contrast and structural distortions. The approach is called Structural SIMilarity (SSIM) index algorithm [36]. The overall SSIM index between signals x and y is defined as SSIM(x, y) =

(2µ x µy + C1 )(2σ xy + C2 ) , (2µ2x µ2y + C1 )(2σ2x σ2y + C2 )

(7)

where µ and σ denote the mean value and the variance value respectively, C1 and C2 are the constant values assigned to the SSIM index. Equation (7) satisfies following conditions: 1. SSIM(x, y) ≤ 1; 2. SSIM(x, y) = 1, if and only if image has no distortion. Moreover, the SSIM indexing approach is applied to Y, Cb and Cr components independently and combined into a single quality measure using a weighted summation defined as

(8)

The SSIM index method considers not only the measurement of contrast, and structural distortions, but also assigns different weights to luminance and chrominance. The SSIM index method has been found to be consistent with many human visual system behaviors. More details about implementation of the SSIM index algorithm are given in [35]. 4.3 Experiment Results Table 4 presents the overall comparison results of these architectures in gate level simulation. There are some important points can be seen from Table 4. First, it can be easily noticed from Table 4 that the DAA consumes more energy than the others due to the long computation cycles. While the DAA has a better performance in power consumption and area, we can not accept its serious energy defect for low power application. Secondly, the proposed Hybrid DCT enjoys better quality of luminance by Loeffler DCT, the area and the power consumption are inferior to binDCT. Although the Loeffler DCT architecture can obtain the optimal quality, it consumes 43% more in area and 33.7% more in power dissipation than Hybrid DCT architecture. Finally, binDCT gets the smallest chip area, however it suffers the worst quality than others do. Next, we applied these DCT architectures into JPEG [34] system and compared image quality. Table 5 shows PSNR and SSIM results. The results in both PSNR and SSIM indicate that Hybrid DCT performs very close to Loeffler DCT in image quality. On the other hand, the Hybrid DCT is also about 3 dB higher than binDCT in PSNR. That shows the Hybrid DCT preserves the image quality well by applying Loeffler DCT to luminance stream. In order to demonstrate the features of the Hybrid DCT, we built the evaluation metric of power and quality from low to high compression ratio. Figure 8 combines the PSNR and the SSIM evaluation metric with power consumption respectively. It can be easily noticed from Fig. 8 that Hybrid DCT’s power consumption is between Loeffler DCT and binDCT. Its image quality is very close to the best one, Loeffler DCT. The power and quality results show that Hybrid DCT is better tradeoff between Loeffler DCT and binDCT. The inv-Hybrid architecture which exchanges the luminance

SUNG et al.: QUALITY AND POWER EFFICIENT ARCHITECTURE FOR THE DISCRETE COSINE TRANSFORM

3505 Table 5 Loeffer DCT

Quality analysis results.

Hybrid DCT

inv-Hybrid DCT

binDCT

lena

baboon

peppers

lena

baboon

peppers

lena

baboon

peppers

lena

baboon

peppers

PSNR(Q=95)

35.46

35.20

34.44

34.34

33.86

32.38

30.11

31.01

29.18

29.77

27.07

28.04

PSNR(Q=80)

33.27

29.39

31.09

32.66

28.76

30.78

29.39

27.63

28.33

29.10

25.43

27.36

PSNR(Q=60)

32.14

26.90

31.97

31.62

26.55

30.10

28.92

25.86

27.94

28.66

24.41

26.92

SSIM(Q=95)

0.979

0.991

0.977

0.978

0.989

0.976

0.960

0.972

0.929

0.947

0.970

0.928

SSIM(Q=80)

0.940

0.943

0.929

0.939

0.943

0.928

0.932

0.930

0.885

0.913

0.927

0.884

SSIM(Q=60)

0.933

0.903

0.908

0.929

0.903

0.907

0.905

0.889

0.866

0.894

0.887

0.864

Table 5 shows PSNR and SSIM results of four DCT architectures which are implemented in JPEG.

5.

(a) Relation between power consumption and average PSNR.

(b) Relation between power consumption and average SSIM. Fig. 8 The power consumption and average quality results from low to high compression ratio.

and chrominance streams of Hybrid DCT has worse image quality and higher power dissipation than those of Hybrid DCT. In summary, the Hybrid DCT applies different DCT to luminance stream (Y) and chrominance stream (CbCr) to obtain good visual quality, while reduce power dissipation and chip area. Although the binDCT is suitable for low-power design, it is not recommended to implement in hi-end system due to poor visual quality. Finally, our experiments show that Hybrid DCT can be used in low power and high resolution image/video application, especially in battery-based system design.

Conclusion

In this paper, we compared area, power/energy consumption, and distortion of Loeffler DCT, DAA and binDCT. Based on these comparisons, we proposed a power-efficient Hybrid DCT architecture which separates the luminance and the chrominance channels. We integrated Loeffler DCT and binDCT algorithms to reduce power consumption and chip area, however keep reasonable quality. In addition, we also adopted a SSIM assessment method to evaluate quality instead of traditional PSNR. Our experimental results clearly indicate that Hybrid DCT offers similar performance to Loeffler DCT in terms of quality. Moreover, the Hybrid DCT leads to 25% power and 27% area savings in comparison with Loeffler DCT. In summary, the proposed Hybrid DCT provides another choice that can save power consumption, reduce chip area, and keeps the visual quality. Our design method can be used in low power and high resolution image/video applications. Acknowledgement The authors would like to thank the referees for carefully reading our paper, and for giving us useful comments. References [1] I.E.G. Richardson, Video Codec Design, John Wiley & Sons, Atrium, England, 2002. [2] R.C. Conzalez and R.E. Woods, Digial Image Processing, PrenticeHall, Englewood Cliffs, NJ, 2001. [3] J. Li and S.L. Lu, “Low power design of two-dimensional DCT,” IEEE Conf. on ASIC and Exhibit, pp.309–312, Sept. 1996. [4] N.J. August and D.S. Ha, “Low power design of DCT and IDCT for low bit rate video codecs,” IEEE Trans. Multimed., vol.6, no.3, pp.414–422, June 2004. [5] J. Park, S. Kwon, and K. Roy, “Low power reconfigurable DCT design based on sharing multiplication,” IEEE International Conf. on Acoustics, Speech, and Signal Processing, pp.III3116–III3119, May 2002.

IEICE TRANS. FUNDAMENTALS, VOL.E88–A, NO.12 DECEMBER 2005

3506

[6] T. Xanthopoulos and A.P. Chandrakasan, “A low-power DCT core using adaptive bitwidth and arithmetic activity exploiting signal correlations and quantization,” IEEE J. Solid-State Circuits, vol.35, no.5, pp.740–750, May 2000. [7] K. Kim and P.A. Beerel, “A high-performance low-power asynchronous matrix-vector multiplier for discrete cosine transform,” IEEE Asia Pacific Conf. on ASICs, pp.135–138, Aug. 1999. [8] H. Jeong, J. Kim, and W.K. Cho, “Low-power multiplierless DCT architecture using image correlation,” IEEE Trans. Consum. Electron., vol.50, no.1, pp.262–267, Feb. 2004. [9] A. Shams, W. Pan, A. Chidanandan, and M.A. Bayoumi, “A low power high performance distributed DCT architecture,” IEEE Computer Society Annual Symposium on VLSI, pp.21–27, April 2002. [10] L. Fanucci and S. Saponara, “Data driven VLSI computation for low power DCT-based video coding,” International Conf. on Electronics, Circuits and Systems, pp.541–544, Sept. 2002. [11] S.A. White, “Applications of distributed arithmetic to digital signal processing: A tutorial review,” IEEE Acoust. Speech Signal Process. Mag., vol.6, no.3, pp.4–19, July 1989. [12] D.W. Kim, T.W. Kwon, J.M. Seo, J.K. Yu, S.K. Lee, J.H. Suk, and J.R. Choi, “A compatible DCT/IDCT architecture using hardwired distributed arithmetic,” IEEE International Symposium on Circuits and Systems, pp.457–460, May 2001. [13] B. Lee, “A new algorithm to compute the discrete cosine transform,” IEEE Trans. Acoust. Speech Signal Process., vol.32, no.6, pp.1243– 1245, Dec. 1984. [14] W.H. Chen, C. Smith, and S. Fralick, “A fast computational algorithm for the discrete cosine transform,” IEEE Trans. Commun., vol.25, no.9, pp.1004–1009, Sept. 1977. [15] C. Loeffler, A. Lightenberg, and G.S. Moschytz, “Practical fast 1-D DCT algorithms with 11-multiplications,” Proc. ICASSP, pp.988– 991, Glasgow, UK, May 1989. [16] Z. Wang, “Fast algorithms for the discrete w transform and for the discrete Fourier transform,” IEEE Trans. Acoust. Speech Signal Process., vol.32, no.4, pp.803–816, Aug. 1984. [17] M. Kuhlmann and K.K. Parhi, “Power comparison of flow-graph and distributed arithmetic based DCT architectures,” Signals, Systems & Computers, pp.1214–1219, Pacific Glove, CA, Nov. 1998. [18] E.N. Farag and M.I. Elmasry, “Low-power implementation of discrete cosine transform,” Proc. VLSI, pp.174–177, Ames, Canada, March 1996. [19] T.D. Trac, “A fast multiplierless block transform for image and video compression,” International Conf. on Image Processing, pp.822– 826, Oct. 1999. [20] P.C. Teo and D.J. Heeger, “Perceptual image distortion,” IEEE Conf. on Image Processing, pp.982–986, Nov. 1994. [21] A.M. Eskicioglu and P.S. Fisher, “Image quality measures and their performance,” IEEE Trans. Commun., vol.43, no.12, pp.2959–2965, Dec. 1995. [22] Z. Wang, A.C. Bovik, and L. Lu, “Why is image quality assessment so difficult,” IEEE Conf. on Acoustics, Speech, and Signal Processing, pp.IV-3313–IV-3316, May 2002. [23] Z. Wang, L. Lu, and A.C. Bovik, “Video quality assessment using structural distortion measurement,” IEEE Conf. on Image Processing, pp.III-65–III-68, June 2002. [24] A. Peled and B. Liu, “A new hardware realization of digital filters,” IEEE Trans. Signal Process., vol.22, no.6, pp.456–462, Dec. 1974. [25] C.Y. Pai, W.E. Lynch, and A.J. Al-Khalili, “Low-power datadependent 8/SPL times/8 DCT/IDCT for video compression,” IEE Proc. Vision, Image and Signal Processing, vol.150, pp.245–255, Aug. 2003. [26] P. Duhamel and H. H’Mida, “New 2n DCT algorithms suitable for VLSI implementation,” IEEE International Conference on ICASSP, pp.1805–1808, April 1987. [27] M. Vetterli and H. Nussbaumer, “Simple FFT and DCT algorithms with reduced number of operations,” Signal Process., vol.6, pp.264– 278, 1984.

[28] N. Suehiro and M. Hatori, “Fast algorithms for the DFT and other sinusoidal transforms,” IEEE Trans. Acoust. Speech Signal Process., vol.34, no.3, pp.642–644, 1986. [29] H. Hou, “A fast recursive algorithm for computing the discrete cosine transform,” IEEE Trans. Acoust. Speech Signal Process., vol.35, no., pp.1455–1461, 1987. [30] Y. Jeong, I. Lee, H.S. Kim, and K.T. Park, “Fast DCT algorithm with fewer multiplication stages,” Electron. Lett., vol.34, pp.723– 724, 1998. [31] G.A. Hoffmann, E.E. Fabris, D. Zandonai, and S. Bampi, “The binDCT processor,” Microelectronics Group. [32] T.D. Trac, “The binDCT: Fast multiplierless approximation of the DCT,” IEEE Signal Process. Lett., vol.7, no.6, pp.141–144, June 2000. [33] J. Liang and T.D. Trac, “Fast multiplierless approximations of the DCT with the lifting scheme,” IEEE Trans. Acoust. Speech Signal Process., vol.49, no.12, pp.3032–3044, Dec. 2001. [34] The JPEG-6b Website, http://www.ijg.org/, 1998. [35] Z. Wang, L. Lu, and A.C. Bovik, “Video quality assessment based on structural distortion measurement,” Signal Process. Image Commun., vol.19, pp.121–132, Feb. 2004. [36] Z. Wang and A.C. Bovik, “A universal image quality index,” IEEE Signal Process. Lett., vol.9, no.3, pp.81–84, March 2002.

Chi-Chia Sung received the B.S. degree in computer science and information engineering from National Taiwan Ocean University, Taiwan (2001–2004). From April 2003 to Sept. 2004, he served as an HDTV IC Design engineer in Trumpion Electron Corporation. He is currently a graduate student at Department of Electronic Engineering of National Taiwan University of Science and Technology from September 2004.

Shanq-Jang Ruan received the B.S. degree in computer science and information engineering from Tamkang University, Taiwan (1992– 1995), the M.S. degree in computer science and information engineering (1995–1997), and Ph.D. degree in Electrical Engineering (1999– 2002) from National Taiwan University, respectively. He is an Assistant professor at Department of Electronic Engineering of National Taiwan University of Science and Technology.

Bo-Yao Lin received the B.S. degree in Electronic Engineering from National Taiwan University of Science and Technology (2000– 2003). He is a graduate student at Department of Electronic Engineering of National Taiwan University of Science and Technology from September 2003.

SUNG et al.: QUALITY AND POWER EFFICIENT ARCHITECTURE FOR THE DISCRETE COSINE TRANSFORM

3507

Mon-Chau Shie was born in NaTou County, Taiwan. He received his B.S., M.S., and Ph.D. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1983, 1985, and 2000, respectively. In fall 1988, he joined the faculy of National Taiwan University of Science and Technology, Taipei, Taiwan, where he is currently an Assistant Professor of Electronic Engineering Department. His research interests include computer architecture, embedded system, very large scale integration, and image compression.

Suggest Documents