Process Variation Tolerant Low Power DCT Architecture - CiteSeerX

0 downloads 0 Views 544KB Size Report
Abstract: 2-D Discrete Cosine Transform (DCT) is widely used as the core of ... low power is not the only requirement in today's designs. With technology ... Utilization of this architecture to make any path-delay errors ..... 4(a), (b), (c) and (d) that the ..... we have to either increase Vdd (Vddnew) or upsize transistors along.
Process Variation Tolerant Low Power DCT Architecture Nilanjan Banerjee, Georgios Karakonstantis and Kaushik Roy Purdue University, West Lafayette, IN-47907, USA. Email:{nbanerje, gkarakon, kaushik}@purdue.edu Abstract: 2-D Discrete Cosine Transform (DCT) is widely used as the core of digital image and video compression. In this paper, we present a novel DCT architecture that allows aggressive voltage scaling by exploiting the fact that not all intermediate computations are equally important in a DCT system to obtain “good” image quality with Peak Signal to Noise Ratio(PSNR) > 30 dB. This observation has led us to propose a DCT architecture where the signal paths that are less contributive to PSNR improvement are designed to be longer than the paths that are more contributive to PSNR improvement. It should also be noted that robustness with respect to parameter variations and low power operation typically impose contradictory requirements in terms of architecture design. However, the proposed architecture lends itself to aggressive voltage scaling for low-power dissipation even under process parameter variations. Under a scaled supply voltage and/or variations in process parameters, any possible delay errors would only appear from the long paths that are less contributive towards PSNR improvement, providing large improvement in power dissipation with small PSNR degradation. Results show that even under large process variation and supply voltage scaling (0.8V), there is a gradual degradation of image quality with considerable power savings (62.8%) for the proposed architecture when compared to existing implementations in 70 nm process technology. 1. INTRODUCTION Energy-aware designs are necessary to prolong the battery lifetime of portable devices, to prevent excessive heat generation which might result in device reliability problems, and to reduce the cost associated with expensive cooling techniques. With increasing demand of video messaging in multimedia/wireless communications, development of low-energy image/video transmission schemes are necessary [1, 2]. Conventional image compression schemes are designed to minimize distortion of the reconstructed image for a given bit-rate. However, applications like portable multimedia may not always require the best image quality [3]. This aspect can be effectively exploited to obtain architectures that provide the “right” trade-off between image quality and energy consumption. Discrete Cosine Transform (DCT) is important in the field of video/image compression due to its inherent capability of achieving high compression rates at low design complexity. A lot of research has been devoted to reduce number and complexity of computations in DCT architectures [4, 5]. Low-power requirements for image compression have resulted in several other DCT architectures that used partial computation [6] or dual-threshold voltages [7]. However, low power is not the only requirement in today’s designs. With technology scaling, process parameter variations pose a major design concern. Studies have shown [8] that parameter variations create a delay spread of almost 30% for 70 nm process technology, leading to delay failures in some chips. Conventional wisdom dictates a conservative design approach (e.g., scaling up the Vdd or upsizing logic gates) to prevent delay failures and to achieve high parametric yield. However, such techniques come at a cost of increased power and/or die area. Therefore, process tolerance and low power represent contradictory design requirements. In this paper, we simultaneously target low power and process tolerance by proposing an architecture

amenable to voltage scaling even under parameter variations. Our contributions are as follows: • Identify computational paths that are vital in maintaining high image quality • Develop an algorithm/architecture that makes more important computations (in terms of image quality) to have shorter paths than less important ones • Utilization of this architecture to make any path-delay errors predictable under a single scaled supply voltage and process parameter variations, and to tolerate delay failures in such paths with minimal PSNR degradation of image • Reconfiguration of the architecture to provide trade-offs between image quality and power consumption The paper is organized as follows. The proposed DCT architecture operated under a single scaled Vdd is presented in Section 2. Implementation details and the results of scaled-Vdd scheme are elaborated in Section 3. The process tolerance capabilities of this architecture are presented in Section 4. Section 5 concludes the paper.

2. DCT ARCHITECTURE: PRINCIPLES AND DESIGN In this section, we briefly mention the underlying principle of conventional DCT systems and then propose a new DCT architecture for low power. Though our DCT implementation considers 8-bit coefficients, the technique can be easily extended to 12-bit or 16-bit DCT coefficients. 2.1 Conventional DCT Conventional 2D-DCT [13] can be shown with the following block diagram (Fig. 1) which shows that the 2D-DCT can be separated in two 1-D DCTs. X

1D DCT

W

Transpose Memory

Y

1D DCT

Z

Fig.1. 2-D DCT architecture expressed as two 1-D DCT transforms The intermediate computation is a 1D-DCT unit that transforms an 8 X 8 image block from spatial domain to frequency domain. The 1-D DCT transform is expressed as:

wk =

c(k ) 2

7

∑x i=0

i

cos

(2 i + 1) k π , k = 0,1, 2,..., 7 16

1 2 c(k ) =   1

(1)

k=0 otherwise

In vector-matrix form, the same equation can be written as:

w = T • xt ,

(2)

where, T is an 8 X 8 matrix with cosine functions as its elements, and x and w are row and column vectors, respectively [9]. The 8 X 8 coefficient matrix T is symmetric and this property is used for even/odd 1-D DCT calculation in the following manner: d  w0   x0 + x7  d d d  x1 + x6   w2  = b f - f - b  ( 3)  w4   x2 + x5  d - d - d d   x3 + x4   w6   f - b b - f 

e g  w1   x0 - x7  a c  w3   x1 - x6  c g a e   ( 4)  w5  =  e - a  x2 - x5  g c   w7   x3 - x4   g - e c - a  where, ck = cos(nπ/16), a=c1, b=c2, c=c3, d=c4, e=c5, f=c6, and g=c7. We can rearrange eqn. (3) in the following manner:  w0     w2  = ( x + x ) 0 7  w4    w  6

d     d  + ( x1 + x6 ) d    d 

 b    f  + ( x2 + x5 ) − f     −b 

 d    −d  + ( x3 + x4 )  −d     d

 f     −b  (5)  b   − f 

Eqn. (4) can be expressed in a similar format. As shown in eqn. (5), each of the outputs in a 1D-DCT computation is simply the additions of vector scaling operations. It should be noted that several optimization techniques have been proposed [1] to reduce the number of operations in DCT computation.

x0 x1 x2 x3

1D-DCT

x4 x5 x6

w7

x7

(b)

(a) z0 z1 z2 z3

Faster

1D-DCT

z4 z5 z6

Slower

2

3

5

4

9

6

7 15

16 28

8

14 17

29

27 30

43

13 18 26 31 42

44

10 12 19 25 32 41 45

54

11 20 24 33 40 46 53

55

21 23 34 39 47 52 56

61

22 35 38 36

37 49

48 51 57 60

62

50 58

64

59 63

Fig.2. Energy distribution for 2-D DCT matrix Note that all coefficients of the 2-D DCT matrix do not affect the image quality in a similar manner. Analysis conducted on various images like Lena, Peppers etc. show that most of the input image energy (around 85% or more) is contained in the first 20 coefficients (Fig. 2) of the DCT matrix after the 2D-DCT operation. The coefficients beyond that (21-64) contribute significantly less in the improvement of image quality and hence, the PSNR. Fig. 2 shows the energy distribution, known as zig-zag scan, (from 1 to 64) for the final DCT matrix. 2.2 Proposed DCT architecture In this subsection, we introduce the underlying concept utilized in designing our DCT architecture. From Fig. 2, we can infer that the energy content is distributed non-uniformly across the final DCT matrix. With this information in mind, we propose an architecture that computes the high-energy components of the final DCT matrix faster than the low-energy components. The design methodology developed for this architecture is shown in Fig. 3. While computing the 1-D DCT, we calculate the first 5x8 sub-matrix (marked as “Faster Computation” in Fig.3 (b)) earlier than the remaining 3X8 values (marked as “Slower Computation” in Fig.3 (b)). To explain this further, let us first consider how a conventional pipelined DCT system works. In a pipelined DCT system, usually eight pixel values (x0-x7) corresponding to one column are input at a time and 1-D DCT is performed on them to obtain the values w0-w7 (Fig. 3(b)) in a manner shown in eqn. 5. In the next clock cycle, the next set of intermediate values (w8-w15) is computed corresponding to inputs x8x15 and, so on. Our architecture is designed in a way that in each clock cycle the first five values (e.g. values w0-w4 for inputs x0-x7 since each w computation takes the entire column values x0-x7) are evaluated faster than the remaining values (w5-w7), which take longer time to be computed. Therefore, the 5x8 sub-matrix is “faster” than the remaining 3x8 sub-matrix. The transposition of the intermediate 1-D matrix (Fig. 3(b)) results in the matrix shown in Fig. 3(c). Since the 5x8 sub-matrix is

y0 y1 y2 y3 y4 y5 y6

Slower

Faster

y7

z7

1

w0 w8 w16w24 w32 w40w48 w56 w1 w2 Faster w3 Computation w4 w5 Slower w6 Computation

(c)

(d)

Fig. 3.(a) Input Pixel Matrix (b) Matrix after first 1-D DCT (c) Matrix after being transposed (d) Final 2-D DCT matrix computed faster in Fig. 3(b), the corresponding transpose results in earlier computation of the 8x5 sub-matrix as shown in Fig. 3(c). The second 1-D DCT operation results in faster evaluation of the first 5 (e.g. z0-z4) values for each of the input columns (x0-x7 etc.). This enables fast computation of the first 5x5 sub-matrix of the final DCT matrix. This 5x5 sub-matrix includes all the high energy components (1-20) of the DCT matrix. The rest of the matrix is computed “slowly”. Therefore, the computational paths for high energy components are shorter than their low energy counterparts. Designing the architecture in such a manner provides three distinct advantages: a) it helps isolate the computational paths based on high energy and low energy contributing components, b) it allows supply voltage scaling (single supply) to trade-off power dissipation and image quality even under process parameter variations. 2.3 Modification of path-lengths To scale the supply voltage and to obtain power savings, it is necessary to skew the different path-lengths in the DCT computation. In this sub-section, we describe the step-by-step procedure that guarantees that the path-lengths for computing the first five elements (w0-w4) of the DCT computation are shorter than that of the remaining three elements (w5-w7). We achieve this by creating a relationship between the DCT coefficients. Later we will observe that such path-skewing leads to supply voltage scaling for low-power operations even under process parameter variations. Let us consider the original 8-bit DCT coefficients shown in Table 1. We slightly alter the coefficient values as shown in Table 2. The reason for such modification will be clear shortly. This modification should be performed carefully so that it has minimal effect on the image quality. We keep the value of the coefficient “d” unchanged in Table 1. Original 8-bit DCT coefficients DCT Coef Value Binary Number a

0.49

0011 1111

b c d e f g

0.46 0.42 0.35 0.28 0.19 0.10

0011 1011 0011 0101 0010 1101 0010 0100 0001 1000 0000 1100

w0

( x0 + x7 ) • d (x3 + x4 )• d

(a)

( x2 + x5) • d ( x1+ x6 ) • d

Suggest Documents