Parallel-Pipelined Architecture for 2-D ICT VLSI Implementation

PARALLEL-PIPELINED ARCHITECTURE FOR 2-D ICT VLSI IMPLEMENTATION J. A. Michell, G. A. Ruiz, A. M. Burón Department of Electronic and Computers Facultad de Ciencias University of Cantabria Avda.Los Castros s/n. 39005-Santander. SPAIN [email protected] ABSTRACT The Integer Cosine Transform (ICT) has been shown to be an alternative to the DCT for image processing. This paper presents a parallel-pipelined architecture of an 8x8 ICT(10, 9, 6, 2, 3, 1) processor for image compression. The main characteristics of this architecture are: high throughput, low latency, reduced internal storage and 100% efficiency in all computational elements. The processor has been designed in 0.35-µm CMOS technology with an operational frequency of 300MHz. Index Terms– integer cosine transform, multiplication free DCT, discrete cosine transform, image compression, parallel pipelined architectures, VLSI. 1. INTRODUCTION Since the introduction of the Discrete Cosine Transform (DCT) several contributions have been made regarding fast algorithms and architectures for different applications. Of particular relevance are those describing 2D-DCT processors for image compression [1-5] and approximations to the DCT that are multiplication-free in order to reduce implementation complexity [6-7]. Although the DCT is the most widely used transform for image processing, other transforms have appeared over the last two decades which have both a smaller compression capacity and lower implementation costs [812]. Among these, the Integer Cosine Transform ( ICT) [9-12] has been shown to be a promising alternative to the DCT [9] due to its implementation simplicity [13-14], similar performance and compatibility with the DCT. The ICT(10,9,6,2,3,1) has the advantage of not needing multipliers, as its kernel is a matrix on integers [13]. This paper describes a parallel-pipelined architecture of an 8x8 ICT(10, 9, 6, 2, 3,1) processor aimed at image compression. Based on a numerical strength reduction ICT algorithm, the architecture has been tailored to attain high throughput, having a low latency data flow that

0-7803-7750-8/03/$17.00 ©2003 IEEE.

minimizes internal storage needs and keeps all computational elements at 100% efficiency. From this architecture, an 8x8 ICT processor, which meets the numerical characteristic requirements of the IEEE std. 1180-1990, has been designed using a 0.35-µm CMOS standard cell library. The circuit, with an area of 9.3 mm2, has an operational frequency of 300MHz. 2. THE INTEGER COSINE TRANSFORM The Integer Cosine Transform (ICT) was derived from the DCT through the concept of dyadic symmetry. The order8 ICT kernel is (1) T=KJ where K is the normalization diagonal matrix and J an orthogonal matrix defined as

    J =    

g a e b g c f d

g g b c f −f −d −a −g −g −a d −e e −c b

g d −e −c g b −f −a

g −d −e c g −b −f a

g g g −c −b −a  −f f e a d −b  −g −g g  −d a −c e −e f  −b c −d

(2)

whose elements are all integers and satisfy a b= a c+ bd+cd a ≥ b ≥ c ≥ d and e ≥ f

(3) (4)

There are many possible J matrices, and the corresponding ICTs are denoted as ICT(a, b, c, d, e, f); g is always 1.

3. DECOMPOSITION OF THE 1-D ICT The 1-D ICT for a real input sequence x(n) is defined as

X=T x = KJ x = KY

(5)

where X and x are dimension-8 column matrices. Reordering the input sequence and the transform coefficients according to the rules

ICIP 2003

x0

a0

Y0

a7

x1

a1

Y4

a6

x2

a2

Y2

a5

x3

a3

x7

a7

x6

J 4e butterfly

a5

x4

a4

a4

Y1

a7

J4o butterfly

a

a+b

b

a-b

Y5

a5

Y7

a4

X′ (m) = X(Br8[m])  X′ (m + 4) = X(2m + 1), m ∈ [0,3]

(7)

d2

(8)

The reordered integer ICT kernel is

g g g g g −g −g g e f −f −e f −e e −f 

 =  

a b c d b −d −a −c c −a d b d −c b −a 

J 4o

Applying the decomposition rules equations (6) and (7) to the J4e matrix gives J 4e

J =  2e  0

0   I2 J 2o   I2

I2  R -I2  4

Y1

e3

2

Y3

2

Y5

-1

d1

-8

f0

-1

g0

f1 d2

-1 8

g1

f2 d1

-1 8

g2

f3 d0

-1 8

g3

d3

8

8 8

Y7

9 6 2  2 2 −2 2  10 9 −2 −10 −6  2 −2 −2 2 J 4o =  = + 2 9 −2 −2 2 2  6 −10 9 −10  2 2 2 −2  2 −6  +  

0 8 8 0 8 0 0 −8

8 0 −8 0 −8  1 0 8 −  0 8 0  0

1 0 8 0

0 8 0 1

0 0 1 8

(15)

4. DECOMPOSITION OF THE 2-D ICT (10)

The 2-D ICT for a real input sequence x is defined as t

t

X=T x T = KJ x J K= KY K (11)

defined

in

t

(12)

(13) (14)

(16)

where X and x are 8x8 matrices. Reordering the input data and the transform coefficients applying the rules defined in (6) and (7) to both dimensions, the 2-D ICT can be expressed as X′ = TR x′ TR = K R Y′ KR

where R4 is the reordering matrix of length 4, I2 is the dimension-2 identity matrix, and g g J 2e = g -g   e f  J 2o =  f -e

2

in order to use only adding and shifting operations. The resulting signal flow graph for the J4o transformation is shown in Figure 2.

(9)

I4 being the dimension-4 identity matrix, and

  J 4e =  

e2 -1

Figure 1 shows the signal flow graph obtained by applying the decomposition process to J(10,9,6,2,3,1). In this case, the J4o matrix can be readily decomposed as

where Br8[m] represents a bit-reverse operation of length 8, then the 1-D ICT can be expressed as

I4  -I4 

2

Figure 2. J4o signal flow graph.

(6)

0 I4 J 4o   I 4

e1 -1

8

x′ (n) = x( n)  x′ (7 − n) = x(n + 4), n ∈ [0, 3]

J J R =  4e  0

d3

-1

Figure 1. J transform signal flow graph.

X′ = TR x′ = K R J R x′ = K R Y′

e0 -1

a6

Y3

a6

x5

Y6

d0

(17)

where J Y′ =  4e  0

0 I4 J 4o   I 4

I 4   I4 x′ -I4   I4

I 4  J t4e -I 4   0

0 (18) J 4o

Figure 3 shows the signal flow graph obtained by applying the decomposition process to 2-D ICT(10, 9, 6, 2, 3, 1).

5. PARALLEL PIPELINED ARCHITECTURE WITH A THROUGHPUT RATE OF 1 x(0,0)

Y(0,0)

x(0,1)

Y(0,4)

x(0,2)

Y(0,2)

x(0,3)

Y(0,6)

x(0,7)

Y(0,1)

x(0,6)

Y(0,3)

x(0,5)

Y(0,5)

x(0,4)

Y(0,7)

x(1,0)

Y(4,0)

x(1,1)

Y(4,4)

x(1,2)

Y(4,2)

x(1,3)

Y(4,6)

x(1,7)

Y(4,1)

x(1,6)

Y(4,3)

x(1,5)

Y(4,5)

x(1,4)

Y(4,7)

x(2,0)

Y(2,0)

x(2,1)

Y(2,4)

x(2,2)

Y(2,2)

x(2,3)

Y(2,6)

x(2,7)

Y(2,1)

x(2,6)

Y(2,3)

x(2,5)

Y(2,5)

x(2,4)

Y(2,7)

x(3,0)

Y(6,0)

x(3,1)

Y(6,4)

x(3,2)

Y(6,2)

x(3,3)

Y(6,6)

x(3,7)

Y(6,1)

x(3,6)

Y(6,3)

x(3,5)

Y(6,5)

x(3,4)

Y(6,7)

x(7,0)

Y(1,0)

x(7,1)

Y(1,4)

x(7,2)

Y(1,2)

x(7,3)

Y(1,6)

x(7,7)

Y(1,1)

x(7,6)

Y(1,3)

x(7,5)

Y(1,5)

x(7,4)

Y(1,7)

x(6,0)

Y(3,0)

x(6,1)

Y(3,4)

x(6,2)

Y(3,2)

x(6,3)

Y(3,6)

x(6,7)

Y(3,1)

x(6,6)

Y(3,3)

x(6,5)

Y(3,5)

x(6,4)

Y(3,7)

x(5,0)

Y(5,0)

x(5,1)

Y(5,4)

x(5,2)

Y(5,2)

x(5,3)

Y(5,6)

x(5,7)

Y(5,1)

x(5,6)

Y(5,3)

x(5,5)

Y(5,5)

x(5,4)

Y(5,7)

x(4,0)

Y(7,0)

x(4,1)

Y(7,4)

x(4,2)

Y(7,2)

x(4,3)

Y(7,6)

x(4,7)

Y(7,1)

x(4,6)

Y(7,3)

x(4,5)

Y(7,5)

x(4,4)

Y(7,7)

a

a+b

b

a-b

J4e butterfly

J4o butterfly

Figure 3. 8x8 ICT signal flow graph.

The architecture shown in Figure 4 is based on a computing scheme that first obtains the 2-D J transform and then performs the normalization. This 2-D J transform is implemented using two 1-D J(10, 9, 6, 2, 3, 1) processors, sharing a register file for the storing of intermediate data. The input processor operates on each row of the input data block and stores the output coefficients on the register file. The second processor has as input the data stored on the register file -read in column (row) if they were written in row (column)- and produces the transform coefficients without normalization. A highly pipelined output multiplier performs the final normalization. The 1-D J processor architecture has been conceived to implement efficiently the computation diagram of Figure 1 and to reduce the operating frequency to fs/2, fs which is the frequency of the input data. It has an input processor which performs the addition/subtraction of the input data pairs and two processors in parallel that carry out the transformations labeled J4e (eq. 10) and J4o (eq. 11). The output mixer circuit produces the un-normalized output coefficient sequence at a rate equal to the sampling frequency. The input processor has a shift register of length 11 to store the input data, two multiplexers and an arithmetic unit having an adder and a subtractor in parallel. The arithmetic unit produces the input data for the J4e and J4o processors. Figure 5 shows the J4e processor, which has been derived from eq. (12). This processor contains an input data register, three intermediate data registers, two multiplexers, an arithmetic unit having an adder and a subtractor in parallel, and an output data multiplexer to order the coefficients. The multiplication by 3, implemented by shifting/adding, is pipelined with the subtraction. The J4o processor architecture has been conceived from the decomposition defined in eq. 15, avoiding the need for multipliers. As can be seen in Figure 6, this processor has an input data register, four intermediate data registers, eight multiplexers, four adders and one subtractor operating in parallel. The computing efficiency of all processors is 100%. Based on this architecture, an 8x8 ICT processor, meeting the numerical characteristic requirements of the IEEE std. 1180-1990, has been designed in 0.35-µm CMOS AMS technology using standard cell methodology. This circuit occupies an area of 9.3 mm2 and has its operational frequency at 300MHz. The output multiplier has been implemented with fine grain pipeline architecture to make possible this high rate.

J4e processor

J4e processor

x

2

2

J4o processor

J processor

2 Register file

J processor

2 J4o processor

Y

X

X

normalization factors

Figure 4. Architecture of 8x8 2-D ICT processor.

REG

a3a2a 1a0

REG

3

6. REFERENCES

REG M U X

+

M U X

_

M U X

Y6Y4Y2Y0

[2] W. Ku, “A cost-effective architecture for 8x8 two-dimensional DCT/IDCT using direct method,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 7, no. 3, pp. 459-67, June 1997.

REG

Figure 5. Architecture of J4e processor. a4a5a6a7

REG

REG

M U X

+

M U X

_

M U X M U X

REG

8 8 8

8 8 REG

M U X M U X M U X M U X

[5] S.F. Hsiao, J.M. Tseng, “Parallel, pipelined and folded architectures for computation of 1-D and 2-D DCT in image and video codec,” Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, vol. 28, no. 3, pp. 205-220, July 2001.

d1d3e1e2 2

+

[3] H.C. Chang, J.Y. Jiu, L.L. Chen, L.G. Chen, “A low power 8x8 direct 2-D DCT chip design,” Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology. vol. 26, no. 3, pp. 319-332, Nov. 2000. [4] C.H. Chang, C.L. Wang, Y.T. Chang, “Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse,” IEEE Trans. on Signal Processing. vol. 48, no. 11, pp. 3206-3216, Nov. 2000.

d0d2e0e3

REG

[1] C.L. Wang, C.Y. Chen, “High-throughput VLSI architectures for the 1-D and 2-D discrete cosine transforms,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 5, no. 1, pp. 31-40, February 1995.

Y7Y5Y3Y1

f0f2g0g2

+

[6] N. Merhav, B. Vasudev, “A multiplication-free approximate algorithm for the inverse discrete cosine transform,” Proc. 1999 Int. Conference on Image Processing, IEEE, vol. 2, pp .759-63, 1999. [7] J. Liang, T.D. Tran, “Fast multiplierless approximations of the DCT with the lifting scheme,” IEEE Trans. on Signal Processing, vol. 49, no. 12, pp. 3032-44, Dec. 2001. [8] S.C. Pei and J.J. Ding, “The integer transform analogous to discrete trigonometric transforms,” IEEE Transactions on Signal Processing, vol. 48, no. 12, pp. 3345-3364,, December 2000. [9] W.K. Cham, “Development of integer cosine transforms by the principle of dyadic symmetry,” IEE Proc. Part I, vol. 136, no.4, pp.276282, August 1989. [10] W.K. Cham, Y.T. Chan, “An order-16 integer cosine transform,” IEEE Trans. on Signal Proc., vol. 39, no. 5, pp. 1205-1208, May 1991.

+ f1f3g1g3

Figure 6. Architecture of J4o processor.

Acknowledgment. We wish to express our gratitude to the Spanish Ministry of Science and Technology for the partial funding (TIC2000-1289) of this work.

[11] Y. Zeng, L. Cheng, G. Bi and A.C. Kot, “Integer DCTs and fast algorithms,” IEEE Transactions on Signal Processing, vol. 49, no. 11, pp. 2774-2784, Nov. 2001. [12] A. Marcek, J. Kotuliakova and G. Rozinaj, “New approach of fast ICT and MICT algorithms development,” on Proc. 3rd IEEE Int. Conf. on Electronics, Circuits and Systems, 1996, pp. 744-747. [13] W.K. Cham, C. S. Choy, W.K. Lam, “A 2-D integer cosine transform chip set and its applications.” IEEE Trans. on Consumer Electronics, vol. 38, no. 2, pp. 43-47, May 1992. [14] T.C.J. Pang, C.S.O. Choy, C.F. Chan and W.K. Cham, “A selftimed ICT chip for image coding,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 9, no. 6, pp. 856-860, Sept. 1999.

Parallel-Pipelined Architecture for 2-D ICT VLSI Implementation

Parallel-Pipelined Architecture for 2-D ICT VLSI Implementation

Suggest Documents

architecture and VLSI implementation - Semantic Scholar

A Modular VLSI Implementation Architecture for ... - Semantic Scholar

A Modular VLSI Implementation Architecture for ... - Semantic Scholar

VLSI Implementation of High Performance Optimized Architecture for ...

Energy-efficient Hardware Architecture and VLSI Implementation of a ...

VLSI Architecture Design and Implementation of a ...

vlsi architecture and fpga implementation of ice encryption ... - CiteSeerX

ICT solution architecture for agriculture

FPGA Implementation Technology for Memory Efficient VLSI

New Architecture Paradigms for Analog VLSI ... - People.csail.mit.edu

AN EFFICIENT VLSI ARCHITECTURE FOR MC ... - CiteSeerX

Display Architecture for VLSI -based Graphics Workstations

An Efficient VLSI Architecture for CORDIC Algorithm

VLSI Architecture for Spatial Domain Spread ...

SYSTEM ANALYSIS OF VLSI ARCHITECTURE FOR MOTION ...

VLSI reed solomon decoder architecture for ...

An Efficient VLSI Architecture for Fingerprint

(ICT) IMPLEMENTATION CONSTRAINTS - CiteSeerX

the implementation of ict accessibility policy for

Towards a distributed intelligent ICT architecture for

fpga implementation of efficient vlsi architecture for fixed point 1-d dwt

Implementation of Area & Power Optimized VLSI ...

Design and VLSI Implementation of an Adaptive

VLSI Implementation of RSA Encryption System