two dimensional discrete cosine transform using bit ... - Google Sites

TWO DIMENSIONAL DISCRETE COSINE TRANSFORM USING BIT LEVEL SYSTOLIC STRUCTURES Project Report Submitted by

Balagopal G 0221156

Jagannath S 0221162

Nishanth U 0221374

Under the guidance of

Dr.Sumam David Professor, Dept of ECE, NITK, Surathkal.

In Partial Fulfillment of the Requirements for the Award of Degree of BACHELOR OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA, SURATHKAL SRINIVASNAGAR – 575 025 KARNATAKA, INDIA May 2006

Dept of E & C

-1-

NITK – Surathkal

1. INTRODUCTION Discrete orthogonal transforms (DOTs), which are used frequently in many applications including image and speech processing, have evolved quite rapidly over the last three decades. Typically, these applications require enormous computing power. However, a close examination of the algorithms used in such real world algorithms (e.g. the discrete cosine transform (DCT), the discrete Fourier transform (DFT), and singular value decomposition (SVD)), reveal that many of the fundamental actions involve matrix or vector multiplication operations. Unfortunately, matrix-matrix multiplication algorithms have O(N3) complexity and matrix-vector multiplication algorithms have O(N2) complexity, making them computationally intensive for large-scale problems. Consequently, techniques that reduce this complexity are required. In real applications, data to be processed is often available as a stream of input values. In such cases, the throughput rate (which is the time between two consecutive input data values) is more important than the latency (which is the time from the input of a set of data to its computed result). Therefore, there is a need to incorporate some level of pipelining in the overall system. The use of systolic arrays in the design and implementation of high performance digital signal processing equipment is now well established. Most of the research to date has concentrated on word-level systems where the typical processor is of a single chip (at least). The systolic array concept can also be exploited at bit level in the design of individual VLSI chips. A bitlevel methodology exhibits a number of attractive features such as: • The basic processing element is small (typically a gated full adder) and an Entire array of these may be integrated on a single chip; • The computation time for a single cell is small (typically 3–4 gate delays) and so the overall throughout rate which may be achieved using a given technology is very high; • The highly regular structure of the circuits renders them comparatively easy to design and test; • Regular and nearest-neighbor interconnections between the PE’s, high level of pipelinability, small chip-area, and low power dissipation. Conventional ROM multipliers may be used. But the bottleneck is the size which increases exponentially with word length and transform size. An implementation of DCT based matrix-matrix and matrix-vector multiplication using the BAUGH-WOOLEY algorithm was done using the structural style of coding. A full-custom layout of the same at the transistor level was manually designed using magic tool and simulated using ngspice

Dept of E & C

-2-

NITK – Surathkal

2. BAUGH-WOOLEY ALGORITHM 2.1 OVERVIEW The algorithm specifies that all possible AND terms are created first, and then sent through an array of half-adders and full-adders with the carry-outs chained to the next most significant bit at each level of addition. For signed multiplication (by utilizing the properties of the two’s complement system) the Baugh-Wooley algorithm can implement signed multiplication in almost the same way as the unsigned multiplication.

2.2 MATHEMATICAL FORMULATION Let X = [x0, x1, . . . , xN-1]t be the N-point input vector and C be the N x N kernel matrix of an orthogonal transform. The transformed vector of Y is given by Y = CX = [y0, y1, . . . , yN_1]t

[1]

Such that N-1

ym =

Σ Cmk xk

[2]

k=0

Let the elements of the kernel matrix C and the data vector X be represented by the 2’s complement code as n-2

Cmk = - Cmk

n-1

2

n-1

+ Σ Cmk i 2 i

[3]

i=0 n-2

xk = - xk

n-1

2

n-1

+ Σ-xj 2j

[4]

j=0

where xkj and Cmki are the jth bit of xk and ith bit of Cmk, respectively, (which are either zero or one). xkn-1 and cmkn-1 are the sign bits, where n is the wordlength. Substituting [3] and [4] into [2], we have N-1

ym =

n-2

Σ [ - Cmk

n-1

n-1

2

n-2 i

+ Σ Cmk 2

k=0

i

][

-xkn-1

n-1

2

i=0

+ Σ xkj 2j ]

[5]

j=0

Using the Baugh–Wooley algorithm, (5) may be expressed as N-1 n-2 n-2

ym = Σ[

Σ Σ2

k=0 i=0 j=0

Dept of E & C

n-2 i+j

Cmkixkj

+2

Cmkn-1xkn-1 +

2n-2

( Σ-2 j=0

-3-

n-2 j

Cmkn-1xkj +

Σ-2ixkn-1Cmki) 2n-1 ]

[6]

i=0

NITK – Surathkal

N-1 n-2 n-2

ym = Σ[Σ

Σ2

n-2 i+j

i

j

Cmk xk +2

2n-2

Cmkn-1xkn-1 +

k=0 i=0 j=0

n-2

1-Cmkn-1xkj

Σ(

)2

n+j-1

2n-3

1-xkn-1Cmki

+Σ(

j=0

i=0

)2

n+i-1

-2Σ2i] i=n-1

[7] N-1 n-2 n-2 i+j

ym = Σ[ΣΣ2

k=0 i=0 j=0

n-2 i

Cmk xkj

+2

2n-2

Cmkn-1xkn-1 +

n-2 j

( Σ2 (Cmk j=0

n-1

j

xk )’+Σ2i(xkn-1Cmki)’)2n-1+2n-22n-1] i=0

[8] N-point discrete orthogonal transform output given by [8] may be computed by a systolic architecture described in the following section.

2.3 SYSTOLIC IMPLEMENTATION From [8] it can be seen that multiplication of Cmk and xk, expressed in 2’s complement representation, can be written in a form which involves only positive bit products. The multiplier design is based on the multiplication scheme shown in Fig. 1 for wordlength n = 4. The partial-product terms are formed by ANDing each multiplicand bit with each multiplier bit. The partial products xkjcmk3 and xk3cmki for i and j = 0; 1, and 2 contain information concerning the sign of the operands. According to Baugh–Wooley algorithm, these product terms containing the sign information are complemented to obtain the partial products. The final product is computed by adding “1” to the fifth and eighth columns along with all partial product terms. The proposed 2’s-complement serial–parallel multiplier is comprised of a logic unit and an adder unit shown in Fig. 2 (for n = 4).The logic unit consists of four AND gates, one NAND gate, four XOR gates, and an OR gate. Duration of a clock cycle = max{TA, TG}; where TA is the full adder delay and TG is the total gate delay in performing an AND operation, one XOR operation followed by an AND/OR operation. The output of the multiplier is obtained from the adder unit by carry-save and add-shift technique. Q and Q’ are two control signals. In the first three and last four clock cycles Q = 0; but in the fourth clock cycle Q = 1. Q’ = 0 for the sixth, seventh and eighth clock cycles and 1 in the rest of clock cycles. The extra “1” (necessitated by Baugh–Wooley algorithm) for the fifth column is provided to the right-most adder through an OR gate with the help of delayed Q signal. The extra “1” for the eighth column is provided to the left most adder with the help of Q’ signal through an AND gate. Each bit of multiplicand xk is provided to the first row of AND and NAND gates of the multiplier simultaneously while the bits of Cmk are stored and fed to the individual gates (Fig. 2). During the first clock cycle all flip-flops are reset, and the first bit of xk and control signal Q are fed to the multiplier. The control signal Q’ is fed to an AND gate following the left most XOR gate. Four zeros are appended to the left of the MSB of each xk for input/output synchronization.

Dept of E & C

-4-

NITK – Surathkal

Fig. 1. 4 X 4 bit 2’s complement multiplication using the Baugh–Wooley algorithm.

Fig. 2. 4 X 4 bit 2’s complement multiplication using the Baugh–Wooley algorithm.(BWM)

Dept of E & C

-5-

NITK – Surathkal

3. SYSTOLIC ARCHITECTURE Contemporary parallel architectures may be grouped into three different classes based on structure : vector processors, multiprocessor systems, and array processors. Vector processors and multi-processor systems belong to the domain of general purpose computers while most array processors are designed for special-purpose applications. Array processors, as a computing paradigm, are capable of meeting the realtime processing requirements of a variety of application domains. Locally interconnected computing networks such as systolic and wavefront processor arrays, due to their massive parallelism and regular data flow, allow efficient implementation of a large number of algorithms of practical significance; especially in the areas of image processing, signal processing, and robotics. A systolic system is defined as a network of processors which rhythmically compute and pass data through the system". Systolic arrays, as a class of pipelined array architectures, display regular and modular structures locally interconnected to allow a high degree of pipelining and synchronized multiprocessing capability. The primary reasons for the use of systolic arrays in special-purpose processing are simple and regular design, concurrency and communication, with balanced computation and I/O .

• • • •

As an example, consider a linear array. It has the following properties. It’s a fixed connection network. Underlying graph is fixed. There is only local connection. I/O location is fixed.

At each step of a globally synchronous clock, each processor 1. Receives inputs from neighbours ( or I/O ). 2. Inspects local memory. 3. Performs local computation. 4. Updates local memory. 5. Generates outputs for neighbours. Example: sorting • Accept the left input. • Compare input with the stored value. • Store smaller value. • Output bigger number to right.

Dept of E & C

-6-

NITK – Surathkal

3,5,1,2,3

3,5,1,2

3

3,5,1

2

3,5

1

3

3

2

5

3 2

1 3

5 1

3 2

3

Fig. 3. Sorting

3.1 PROCESSING ELEMENT Each PE comprises a serial parallel Baugh–Wooley multiplier (BWM) as shown in Fig 4, a flip-flop (FF) for saving the carry bit and a full adder that adds the result of the partial product and the result generated from the previous PE.

Fig. 4. Processing Element

3.2 MATRIX VECTOR MULTIPLICATION Equation [8] can be mapped into the proposed architecture as shown in Fig. 5 for the case of N=4. Using the same previous PEs structure for matrix multiplication, the matrix elements aij are fed from the north in a parallel/serial fashion bit by bit LSB first (LSBF) while the vector elements bi are fed in a parallel fashion and remain fixed in their corresponding PE cell during the entire computation of the operation. Each bit of the final

Dept of E & C

-7-

NITK – Surathkal

product of the PE is fed to the full adder of the preceding PE so that the corresponding output bits of each PE are added to compute the desired output bit in an LSBF fashion. During the first eight cycles the first inner product [C1] is computed using an LSBF fashion. Then, during the second (third) eight cycles the inner products [C2] ([C3]) are computed. Finally, the fourth inner product [C4] is computed at the end of the fourth eight cycles. The entire computation process can also be carried out 2nN clock cycles with a structure area of N PEs only.

Fig. 5. Matrix Vector multiplier

3.3 MATRIX MATRIX MULTIPLICATION Equation [6] can be mapped into the proposed architecture. Fig. 6 shows the architecture obtained for N=4. It consists of sixteen identical processing elements (PEs). Each PE comprises a serialparallel Baugh–Wooley multiplier (BWM) as shown in Fig. 2, a flip-flop (FF) for saving the carry bit and a fulladder that adds the Result of the partial product and the result generated from the previous PE. The matrix elements bij are fed from the north in a parallel/serial fashion bit by bit LSB first (LSBF) while the matrix elements aij are fed in a parallel fashion and remain fixed in their corresponding PE cell during the entire computation of the operation. Each bit of the final product of the PE is fed to the full adder of the preceding PE so that the corresponding output bit of each PE are added to compute the bit of the desired output bit in an LSBF fashion. During the first eight cycles the four inner products [Ci1] (i=1, 2, 3, 4) are computed using an LSBF fashion. Then, during the second (third) eight cycles the four inner products [Ci2] ([Ci3]) are computed. Finally, the four inner products [Ci4] are computed and are available at the output buffer at the end of the fourth eight cycles.

Dept of E & C

-8-

NITK – Surathkal

It is worth mentioning that the array produces four coefficients of the matrix C every eight clock cycles based on the multiple accumulate technique and therefore the entire computation can be carried out 2nN clock cycles with a structure requiring N2 PEs.

Fig. 6. Matrix Matrix multiplier

Dept of E & C

-9-

NITK – Surathkal

4.IMPLEMENTATION IN VHDL The proposed architecture for matrix-vector and matrix-matrix multiplication was designed and simulated in VHDL.A structural approach was used in which simpler gates, D-flip flops and adders were used to build RAM, Baugh-Wooley multiplier and the processing element. Finally, the local memories and PEs were interconnected to obtain the desired structure. Individual components were simulated for different input patterns and the waveforms were noted. After interconnection the architecture was simulated for various inputs. The maximum clock frequency and number slices were also noted.

4.1 BAUGH-WOOLEY MULTIPLIER Serial parallel Baugh-Wooley multiplier was designed using basic gates, adders and D-flip flops. It can perform 4-bit by 4-bit signed multiplication with numbers in 2s complement form. Necessary control signals were given and design was simulated. VHDL code: entity bmw is Port ( a : in bit_vector(3 downto 0); b,s1,s2 : in bit; clk : in bit; prod : out bit); end bmw; architecture Behavioral of bmw is signal m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11: bit; signal w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11 : bit; begin p0: p1: p2: p3: p4: p5: p6: p7: p8: p9: p10: p11: p12: p13:

nandx port map (b,a(3),w1); andx port map (b,a(2),w2); andx port map (b,a(1),w3); andx port map (b,a(0),w4); dffx port map (s1,clk,w5); xorx port map (s1,w1,w6); xorx port map (s1,w2,w7); xorx port map (s1,w3,w8); xorx port map (s1,w4,w9); andx port map (s2,w6,w10); orx port map (w9,w5,w11); dffx port map (w10,clk,m1); fax port map (m1,w7,m4,m2,m3); dffx port map (m3,clk,m4);

Dept of E & C

- 10 -

NITK – Surathkal

p14: dffx port map (m2,clk,m5); p15: fax port map (m5,w8,m8,m6,m7); p16: dffx port map (m6,clk,m9); p17: dffx port map (m7,clk,m8); p18: fax port map (m9,w11,m11,prod,m10); p19: dffx port map (m10,clk,m11); end Behavioral; Synthesis Report: Maximum Frequency: 281.770MHz Maximum combinational path delay: 10.277ns Number of Slices: 6 out of 192 Number of Slice Flip Flops: 7 out of 384 Number of 4 input LUTs: 11 out of 384 RTL Schematic:

Fig. 7. Baugh-Wooley multiplier (RTL Schematic)

2x3=6

(0010 x 0011=0000 0110)

-1 x 6 = -6 (1111 x 0110=1111 1010) Fig. 8. Test bench waveform for BWM.

Dept of E & C

- 11 -

NITK – Surathkal

4.2 RAM AND CONTROL UNIT Four 4x4 bit RAMs are used to store the kernel matrix. The RAM is designed using shift registers with load. The control unit necessary for generation of signals for Baugh-Wooley multiplier are implemented using shift registers. A mod-9 counter is used to synchronise all operations. VHDL code: entity ram is Port (b1,b2,b3,b4 : in bit_vector(3 downto 0); clk,loadram,load : in bit; data : out bit_vector(3 downto 0)); end ram; architecture Behavioral of ram is signal q1,q2,q3,q4 : bit_vector(3 downto 0); begin process(clk) begin if clk='1' and clk'event then if loadram='1' then q1

two dimensional discrete cosine transform using bit ... - Google Sites

two dimensional discrete cosine transform using bit ... - Google Sites

Suggest Documents

Warped Discrete Cosine Transform Cepstrum - Google Sites

Advanced Discrete Cosine Transform Using ...

Face Recognition Using the Discrete Cosine Transform

Image Compression Using the Discrete Cosine Transform - Google Sites

Medical Image Compression Using Two Dimensional Discrete Cosine ...

Separable two-dimensional discrete Hartley transform

Image Compression and the Discrete Cosine Transform ... - Google Sites

the discrete fractional cosine transform - CiteSeerX

A hybrid algorithm using discrete cosine transform and gabor filter ...

Jpeg Image Compression Using Discrete Cosine Transform - A

Iris Recognition Using Discrete Cosine Transform and Kekre's Fast

WARPED DISCRETE COSINE TRANSFORM CEPSTRUM: A NEW ...

Multiple description discrete cosine transform-based ...

Directional Discrete Cosine Transform for Handwritten

fast 2-d discrete cosine transform - Infoscience

On discrete cosine transform - Semantic Scholar

Zero-quantised discrete cosine transform prediction ...

Implementation Discrete Cosine Transform and ...

Optimization of Discrete Cosine Transform-Based Image

Discrete Fractional Cosine Transform based Online ...

Quantum Discrete Cosine Transform for Image Compression

feature extraction using discrete cosine transform for ... - IEEE Xplore

Content-Based Image Retrieval Using Block Discrete Cosine Transform

Face Recognition Using Discrete Cosine Transform ... - Aman Chadha