2010 International Conference on Reconfigurable Computing
A Two Level Architecture for high throughput DCT-Processor and Implementing on FPGA
Dr.Mahmood Fathy Computer faculty Iran University of Science and Technology Tehran, Iran
[email protected]
Azad Fakhari Computer faculty Iran University of Science and Technology Tehran, Iran
[email protected] ABSTRUCT- Frequency analysis using discrete cosine transform is being used in a large variety of algorithms such as image processing algorithms. This paper proposes a new high throughput architecture for the DCT processor. This system has got a 2level architecture which uses parallelism and pipelining and has been synthesized on Xilinx Virtex5 FPGA. Synthesis results show that this system works at 150MHz. Applying DCT on each 8x8 matrix of image take 67 clock pulses. In other words, applying DCT on each pixel takes approximately one clock pulse.
DCT has been implemented in form of different architectures but two of them are more common: Systolic architecture (SA) and Distributed Arithmetic (DA). [1], [2], [3], [4]. FPGAs facilitate accessibility and configurability in implementing hardware on a chip. The FPGA chip can be placed in an actual circuit to be evaluated in real situations. For DCT, there are different architectures which have been represented so far. Some architectures are systolic such as [5] which is a reconfigurable 2D-DCT architecture and [6] which is array-based architecture with high scalability and also [7]. Residue Number System (RNS) is used as a Row/column decomposition technique in [8], [9]. Polynomial transformation algorithm on two dimensions using 1DDCT is implemented on FPGA in [3]. Several architectures proposed distributed arithmetic (DA) for example [1] which represents a fully parallel architecture based on Chen et al’s method or in [10] group distributed arithmetic (GDA) combines the good features of cyclic convolution and DA computation using shared ROM modules, barrier shifters, and accumulators. In [11] a compressed distributed arithmetic architecture for 2D 8x8 DCT is presented using distributed arithmetic 1D-DCT architecture and in [12] they combine the methods of the look-up table and rowcolumn overlapped operations using look-up tables and shift registers to avoid the transpose operation and in this design multipliers and transport RAMs are not used and therefore it can achieve high speed and with less latency for real-time applications. [13] Uses a recursive algorithm and [14] and [15] are based on a frame-recursive approach using two 1-D DCT arrays. [16] Is based on partial sum and [4] is based on Lee algorithm with combined pipeline. There are a lot of other architectures like [2] and [17]. This paper proposes a new architecture which is efficient in speed because of its parallelism and pipe-lining. Section 2 is about discrete cosine transform and its aspects. The proposed architecture is presented completely in section 3. In that section 2 level architecture is defined and each level is described individually then System function is shown in a state machine. In section 4 we will see how this architecture is implemented on hardware. Such as architecture section, implementation is defined for both levels individually. Section 5 is about simulation of the
I. INTRODUCTION
Nowadays, with a high demand for video and image data transmission and storage, compression is a must, in order to achieve lower transmission time for high quality images, as well as to use lower storage space. DCT plays a key role in many compression standards such as JPEG2000 for still image compression, ITU H.261 and ITU H.263 in teleconferencing and ISO MPEG1 and MPEG2 for moving pictures and home video. Image compression using 2D-DCT includes several levels including: pre-processing, applying discrete cosine transform, quantization and coding. 2D-DCT is the most computationally intensive phase of the encoding process and accelerating it would dramatically reduce the whole process time. Several algorithms and implementation methods have been proposed for DCT, spreading from software implementations in DSPs to hardware implementations in ASICs. Depending on the final objective and the application, the best implementation option will be chosen. Generally when the speed is a premise, hardware implementation is the best option. Many of the implementation methods for 2D-DCT algorithms have been proposed to achieve reduction of the computational complexity and thus increasing the operational speed and throughput. 2D-DCT can be broken down in two groups of N 1D-DCT, which is equivalent to processing a data block by rows, followed by a column processing or vice versa and this is called row/column decomposition. Algorithms and architectures for the 2D DCT can be divided into two categories: • row/column decomposition methods • Non row/column decomposition methods 978-0-7695-4314-7/10 $26.00 © 2010 IEEE DOI 10.1109/ReConFig.2010.67
115
system which uses sample data to verify the results. Section 6 is about synthesizing the design on FPGA and its reports. Finally section 7 is about system analysis and how fast this system is.
Therefore, for calculating 64 values of F(a,b) simultaneously, 64 similar calculation units, that every unit has got its specific coefficients which depend on the value of a and b for each unit , are needed. When one operational cycle finishes, the final values of F(a,b) in all 64 calculation units are ready. For applying DCT on the whole image matrix the image must be divided into 8x8 matrixes without overlap. Then DCT operations must be applied to all 8x8 matrixes individually. The system which implements these operations must be in charge of all these tasks: • Reading 8x8 matrix values from related addresses of memory • Controlling 64 calculation units for applying DCT on 8x8 matrix data • Writing calculated data F(a,b) in related addresses of memory • Shifting the 8x8 window to the next part of image when an operational cycle is finished • Declaring the end of whole operations (applying DCT to the whole image)
II. DESCRETE COSINE TRANSFORM
DCT is one of the major transformations which is used in a lot of compression algorithms. DCT is a frequency transform which is equivalent to the real part of the discrete Fourier transform (DFT). 2D-DCT is very common in image processing and specially image compression. It affects a 2-dimensional matrix of natural values. Discrete Cosine Transform receives an image matrix, which is divided into smaller image blocks (4x4, 8x8, 16x16 ...) where each block is transformed from the spatial domain to the frequency domain. DCT decomposes signal into spatial frequency components. The lower frequency parts appear toward the first line/first column of the DCT matrix, and the higher frequency parts are in the last line/last column of the DCT matrix. This transformation includes multiplication of specific cosine coefficients in values of a small block of image and summation of the products results in new values for each pixel [18], [19], [4]. 2D-DCT tends to be parallelized very well. It is common to use 8x8 matrixes to simplify the calculations. DCT will be applied on each 8x8 matrixes of the image and the formula is like: C (u ).C (v) 7 7 (2i + 1)uπ (2 j + 1)vπ F (u, v ) = ).cos( ). f (i, j ) ∑∑ cos( 4
i =0 j =0
16
A DCT processor, a source memory that stores primitive values of image and a destination memory, which calculated data will be stored on, are needed for designing this system. A. 2LEVEL ARCHITECTURE
16
The DCT Processor has been designed in 2 levels in which the high level unit is in charge of reading data from source memory, transferring data to low level unit, reading calculated data from low level unit and writing them on the destination memory. On the other hand, the low level unit is just in charge of applying DCT calculations on entered data from high level unit.
0≤u≤7 0≤v≤7
2 2
β =0
(1)
C (β ) = 1
otherwise
Where F(u,v): values in transform domain F(i,j): values in pixel domain i, j: spatial coordinate in the pixel domain u, v: coordinate in transform domain III.
B. HIGH LEVEL UNIT ARCHITECTURE High level unit includes several sub-units. 1) A counting unit as a sequence counter which is used for defining the state of the system and driving signals based on that state. 2) Reading register file includes 64 registers for read data from source memory. 3) Calculation register file includes 64 registers for data which are used by low level unit. 4) Writing register file includes 64 for data which must be written in the destination memory 5) Image size registration unit includes 2 registers which save the height and width size of the image. 6) Current coordinate of 8x8 matrix registration unit includes 2 counters which show the coordinate of the first element of 8x8 matrix in the image. 7) Address generator unit uses values of counting unit, image size registration unit and current coordinate of 8x8 matrix registration unit to generate these addresses
SYSTEM ARCHITECTURE
Unlike many other architectures, in this paper we propose a non row/column decomposition method which implements the 2D-DCT directly. Each 8x8 matrix has got 64 elements which for each (a,b) element ( 0 ≤ a ≤ 7 , 0 ≤ b ≤ 7 ) the procedure that results F(a,b) must be applied once. Considering that the procedure for each element is independent of other 63 elements and 64 cosine coefficients, C(a) and C(b) for each (a,b) element have got constant values independent of the value of f(a,b), the calculation procedure for F(a,b) could be applied for each element of 8x8 matrix simultaneously with 63 other elements. In other words, there can be 64 parallel procedures for 64 elements of 8x8 matrix.
116
Fig2: low level unit block diagram
Fig1: high level unit block diagram
• Source memory address for reading data from • Destination memory address for writing data in • Current addresses of 3 register files • Current coefficient address in low level unit 8) Control unit generates control signals for memories, register files, high level sub units and low level sub units. Control signals include enable and reset signals. C. LOW LEVEL UNIT ARCHITECTURE Fig3: system state machine
Low level unit includes 64 similar sub-units which differ only in their coefficients. Each sub unit includes these parts: 1) Coefficient memory includes 64 rows and in each row there is the coefficient value which is related to one of the combinations of i and j in DCT formula. In order to simplify the calculations, every coefficients has been multiplied by 2^16 = 65536. 2) Multiplication unit include a fast clock-free multiplier which multiplies input data from high level unit by the related coefficient from coefficient memory. 3) Accumulation unit has got the value of 0 at first and after each clock pulse it adds the current result value with the input value. Therefore, after 64 clock pulses it contains the final value of F(a,b). the first 16bits must be eliminated in order to division of the result by 2^16 = 65536.
At state 1, counting unit starts counting from 0 to 63 and by getting to 64 it stops counting. During this counting period, these actions will be happening: • 64 new data are being read from source memory and stored in reading register file. • Those 64 data which have been read through the previous cycle are being sent to 64 low level sub units for being applied in DCT calculations. Coefficient memories in each sub unit are being addressed by this counting from 0 to 63. • Those 64 data which have been calculated through the previous cycle are being written in the destination memory By stopping the counting unit, system will go from state 1 to state 2. At state 2, with first rising edge of clock pulse, data on the reading register file will be moved to calculation register file simultaneously and respectively. Also final values in each calculation sub unit will be placed on each sub units output. System will go from state 2 to state 3 immediately after this edge of clock pulse. At state 3, with first rising edge of clock pulse, 64 calculated final values in all sub units will be moved to writing register file simultaneously and respectively. System will go from state 3 to state 4 immediately after this edge of clock pulse. At state 4 two actions can happen.
D. SYSTEM STATE MACHINE At first state, system assigns width and height values of the image via two related input pins and these values are stored in image size registration unit. System also resets all other registers in this state. System will go to state 1 by changing an input signal which decides whether system is in configuration state (start state) or process state (starts with state 1).
117
• If at state 1 of current cycle the last part of calculated data on the writing register file was written on the destination memory, system would go to final state with first rising edge of clock pulse. • If the previous condition was not correct then the counting unit would be reset and system would start a new cycle by going to state 1.
B. LOW LEVEL UNIT IMPLEMENTATION This unit includes 64 independent calculation sub-units which work concurrently on 16bits input data. Every subunit's structure is the same with others and the only difference between them is their coefficient values which are stored in their coefficient memory. By counting from 0 to 63 in counting unit, related coefficient in each sub-unit multiplies by related 16bits data from calculation register file. On each iteration, current result will be added to the summation of all the previous results. When counting unit reaches 64, the summation of products of 64 data with 64 coefficients will be available on a register in every subunit. These 64 values would be the final values of an 8x8 matrix after DCT operations. In other words, since 64 subunits calculate the final data concurrently, when counting unit reaches 64, 64 stored values in 64 calculation sub-units would be the result of DCT on 64 data stored on calculation register file. Those registers which are containing final data in sub-units are connected to 64 registers in writing register file respectively. On the next rising edge of clock pulse, data will be copied into the related registers in writing register file. It must be mentioned that coefficient memory in each subunit includes 64 signed values which have been calculated by a MATLAB function. Values for each sub-unit depends on the sub-unit's coordinate in 8x8 matrix of sub units related to the formula of DCT.
At final state, counting unit stops and an output signal changes in order to declare the end of process. IV.
SYSTEM IMPLEMENTATION
Designing this system in two levels facilitate implementation, simulation, analysis and testing in two separate parts which are completely independent from each other. High level unit acts as a commander which is in charge of controlling system sequences, addressing different parts, deciding to continue with or change the current state of system and controlling every single element either in high level unit or in low level unit. Low level unit acts as a calculation core which gets command, data, and required addresses from high level unit to do the calculations of DCT which includes calculating the summation of different products for achieving the result of applying DCT on data. A. HIGH LEVEL UNIT IMPLEMENTATION
V. SIMULATION AND VERIFICATION
At the start state, 7bits counter and 3 flip-flops in counting unit are reset. After transition to state 1, 7bits counter starts up-counting. When it gets to 64(7th bit gets to '1'), it will stop counting and enable signal for first flip-flop will change into '1' simultaneously. Now, first flip-flop gets to '1' by first rising clock pulse. This change makes the enable signal for first flip-flop return to '0' and the enable signal for second flip-flop will change into '1' simultaneously. Then, second flip-flop gets to '1' by first rising clock pulse. This change makes the enable signal for second flip-flop return to '0' and the enable signal for third flip-flop will change into '1' simultaneously. After that, third flip-flop gets to '1' by first rising clock pulse. This change causes resetting 7bits counter. 0 value of the counter makes the first flip-flop reset. '0' value of the first flip-flop makes the second flip-flop reset. '0' value of the second flip-flop makes the third flip-flop reset. '0' value of the third flip-flop causes ending the reset state of 7bits counter. In other words, 7bits counter will be able to start counting again right after the change from '1' to '0' of third flip-flop. Only if the current cycle is the last cycle of system, the whole counting unit will stop instead of starting counting again. In this case system will go to final state and an output signal declares the end of process.
The implementation of the system has been done using VHDL modeling including two separate modules for high level unit and low level unit and they have been integrated in a bigger module called DCT processor containing high level unit and 64 sub-units of the low level unit. Then the whole model has been simulated in ModelSim SE 6.0 for verification test. Also, DCT function has been written in MATLAB 7.8 to apply DCT mathematical function on the same data to confirm the results. Results of the simulated system are the same with original results of MATLAB for same data with a small error which is negligible. After comparing results of DCT processor and MATLAB function, the design has been verified. VI.
SYSTEM SYNTHESIS
After verification, the whole system has been synthesized on FPGA using Xilinx ISE. The device properties are presented in the following table. TABLE I. Brand Family Device Package
Xilinx Virtex5 xc5vlx155t 2ff1738 Device properties
118
pulses for reading a 8x8 matrix of data from memory, 64×64=4096 clock pulses for calculating the results of applying DCT on 8x8 matrix and 64 clock pulses for writing the 8x8 matrix of final data on memory. This means it takes 4224 clock pulses for every 8x8 matrix of image. To sum up, it would take 66n clock pulses for an image which has n pixels. On the other hand, for the system which has been described in this paper, each cycle takes 67 clock pulses. Applying DCT on an image
Important timing information is what we see in table II. TABLE II. Minimum Period Maximum Frequency Setup Time Hold Time
6.630ns 150.824MHz 1.863ns 13.591ns Timing Information
After synthesis, all the components can be extracted from the synthesis report. Tables III and IV show the components of high level unit and low level unit individually.
containing n pixels takes
affects a 8x8 matrix and it takes 2 more cycles for calculating the last 8x8 matrix and writing it in the destination memory. Each cycle takes 67 clock pulses and
TABLE III. Comp
Type
#
Reference
1 2
Counter Start Counter, End Counter Rd Reg File, Calc Reg File, W, H, W Pos Wr Reg File FF1, FF2, FF3, Done H Pos Src Mem Addr Src Mem Addr W Pos Src Mem Addr Dest Mem Addr Dest Mem Addr Dest Mem Addr, Src Mem Addr H Pos, W Pos Dest Mem Addr Dest Mem Addr Dest Mem Addr Height Pos, Width Pos Data Dest mem data out
Cntr Cntr
7b up 2b up
Reg
16b
131
Reg Reg
18b 1b
64 4
Acc Mult Mult Add Add Add Add Add
16b up 17x16b 18x18b 16b 16b cout 17b cout 31b 32b
1 1 1 1 1 1 1 5
Sub Sub Sub Sub Comp
16b 17b 31b 32b 16b
2 1 1 2 2
Mux Mux
16b 64-1 18b 64-1
1 1
it means that 67× (
VIII.
Type
64x2048 18b up 18b 17x17
#
Reference
1 64 64 64
Rom Sum Reg Data Reg Data Reg Input
Components in Low Level unit
In addition, the utilization percentage of logic part is 2%, memory is 3% and DSP slices are 49%. VII.
CONCLUSION
In this paper we have proposed a new high throughput design for discrete cosine transform which can be used in real-time systems. Using parallelism in calculations for 64 different parts and having a pipeline for reading, calculating and writing data provides us with a dramatic speed in applying DCT on an image. On the other hand, calculating the coefficients for each calculation part and storing it on a ROM saves a lot of time and hardware complexity. Of course, implementing this design on ASIC will decrease delays and increase the system frequency. Therefore, it can work on larger images and videos with higher qualities. This system for instance can function in a digital camera as a co-processor to be in charge of compression and decrease the burden on the main processor. This system can also be used as IDCT processor just by adding another ROM for new coefficients because the algorithm is completely the same with DCT. Furthermore, by adding quantizer, encoder, decoder and inverse quantizer to DCT/IDCT processor we can design a compressor/decompressor system as a full package. Compressor/decompressor system can work as a coprocessor in an image processing system.
TABLE IV. Comp
n +2) clock pulses for the whole 64
image. In other word 1.05n + 134 clock pulses for an image which has n pixels. For large images the system works approximately 1 pixel in 1 clock pulse. Thus, this DCT which has been synthesized on vitrex5 FPGA and functions in 150MHz can work real-time for an NTSC (30fps) video with 4.7 mega pixel size.
Components in High Level unit
ROM Acc Reg Mult
n +2 cycles because each cycle 64
SYSTEM ANALYSIS
This method for architecture and implementation of a DCT processor has got a very high time efficiency because of its parallelism and pipe-lining in reading from memory, calculation and writing in memory. Assuming the same DCT processor but with a single calculation unit (no parallelism) and without pipe-lining, it would take 64 clock
REFERENCES [1]
119
Reza Ebrahimi Atani, Mehdi Baboli, Sattar Mirzakuchaki, Shahabaddin Ebrahimi Atani, Babak Zamanlooy, "Design and Implementation of a 118 MHz 2D DCT Processor", Industrial
[2]
[3] [4] [5]
[6]
[7]
[8]
[9] [10]
[11] [12] [13] [14]
[15] [16] [17]
[18] [19] [20] [21]
Electronics, 2008. ISIE 2008. IEEE International Symposium on, 2008 Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto, "A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs", IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07), 2007 Arturo Méndez Patiño, Marcos Martínez Peiró, F. Ballester and G. Payá, "2D-DCT on FPGA by polynomial transformation in two dimensions", ISCAS, 2004 A. Kassem, M. Hamad , E. Haidamous, "Image Compression on FPGA using DCT", ACTEA, 2009 F. Cariccia, P. Cariccia, M. Martina, A. Molino, F. Vacca, "Multimedia SoC a Systolic Core for Embedded DCT Evaluation", Thirty-Sixth Asilomar Conference on Signals, Systems & Computers, 2002 Jian Huang, Jooheung Lee, Yimin Ge, "An Array-based Scalable Architecture for DCT Computations in Video Coding", IEEE Int. Conference Neural Networks & Signal Processing Zhenjiang, China, June 8~10, 2008 Yu-Tai Chang and Chin-Liang Wang, "A New Fast DCT Algorithm and Its Systolic VLSI Implementation", IEEE transactions on circuits and systems—II: Analog and digital signal processing, Vol. 44, No. 11, November 1997 P. G. Femhdez, J. Ramlrez, A. Garcia, L. Parrilla, A. Lloris, "A New RNS Architecture for the Computation of the Scaled 2DDCT on Field Programmable Logic", Conference Record of the Thirty-Fourth Asilomar Conference, 2000 P. G. Fernhdez, A. Garcia, J. Ramirez, A. Lloris, "Fast RNS-based 2D-DCT computation on field-programmable devices", SiPS, 2000 Jiun-In Guo, Jia- Wei Chen, and Han-Chen Chen, "A new 2-D 8x8 DCT/IDCT core design using group distributed arithmetic", ISCAS '03. Proceedings of the 2003 International Symposium, 2003 PENG Chungan, CAO Xixin, YU Dunshan and ZHANG Xing, "A 250MHz Optimized Distributed Architecture of 2D 8x8 DCT", ASIC, 2007. ASICON '07. 7th International Conference , 2007 Jen-Shiun Chiang, Yi-Fang Chiu, and Teng-Hung Chang, "A high throughput 2-dimensional DCT-IDCT architecture for real-time image and video system", ICECS, 2001 S. An, C. Wang, "Recursive algorithm, architectures and FPGA implementation of the two-dimensional discrete cosine transform", IET Image Processing, 21st March 2008 Ching-Te Chiu and K. J. Ray Liu, "Real-Time Parallel and Fully Pipelined Two-Dimensional DCT Lattice Structures with Application to HDTV Systems", IEEE transactions on circuits and systems for video technology Vol.2 No.I, March 1992 C.T.Chiu and k.J.Ray Liu, "Real-Time recursive two-dimensional DCT for HDTV systems", IEEE Transactions on Circuits and Systems for Video Technology, 1992 Mao Tian, Guang-Jun Li, Qi-Zong Peng, "A New Fast Algorithm for 8 x 8 2-D DCT And Its VLSI Implementation", IEEE Int. Workshop VLSl Design & Video Tech. Suzhou China, 2005 Roger Endrigo Carvalho Porto, Luciano Volcan Agostini, "Project Space Exploration on the 2-D DCT Architecture of a JPEG Compressor Directed to FPGA Implementation", Proceedings of the Design, Automation and Test in Europe Conference and Exhibition Designers’ Forum (DATE’04), 2004 Rafael C. Gonzalez, Richard E. Woods, Digital Image Processing (2nd Edition), Prentice-Hall, 2002 Ze-Nian Li, Mark S. Drew, Fundamentals of Multimedia, Prentice-Hall, 2004 Volnei A. Pedroni, Circuit Design with VHDL, MIT Press, 2004 The MathWorks, MATLAB 7.8.0(R2009a) help, 2009
120