A New Generalized Reconfigurable Architecture for Digital Signal

1 downloads 0 Views 237KB Size Report
Signal Processing (DSP) architecture and algorithm has been proposed ... the proposed processor, different DSP functions can be realized. Since the ... a jb c jd m jn a cm dn .... appropriate control signals to generate the Program. Address in ...
15th International Conference on Advanced Computing and Communications

A New Generalized Reconfigurable Architecture for Digital Signal Processor Joyanta Basu, Md. Sahidullah, Amitabha Sinha Department of Computer Science of Engineering, West Bengal University of Technology [email protected], [email protected], [email protected] 1.Usage of faster multiplication, adder units for performance enhancement. 2. The filter co-efficient, twiddle factors, trigonometric co-efficients etc. will be stored in the look-up table. 3. Fast FIR/IIR computation should be done by using parallel computation and by suitable data partitioning. 4. The role of the controller unit is to generate the control signals and selection of proper data by properly controlling the address generation units. The Control memory [9][10][11] is configurable so that the proposed architecture can be configured for different DSP functions. Since, Adders, Subtractors, Multipliers etc. are the basic building blocks for these algorithms, they can be used as the static building blocks and by interconnecting these blocks in proper fashion, different DSP functions can be implemented. Hence, configuration of a particular DSP function can be achieved by changing the control signals required to interconnect these basic building blocks. The aim of this paper is to introduce such a re-configurable DSP processor, which will be working as a configurable DSP accelerator under the control of a host Control Unit. The set of control signals referred to as “control bit streams” are different for different DSP functions and by downloading these bit streams from an external memory by the control of the host control unit on to the proposed processor, different DSP functions can be realized. Since the numbers of the functional blocks are fixed, configuration latency for implementing a particular DSP function is also reduced substantially.

Abstract In this paper a novel re-configurable Digital Signal Processing (DSP) architecture and algorithm has been proposed where basic building blocks are high performance adders, subtractors, multipliers etc. The architecture has been conceived keeping high performance, low dynamic configuration latency, flexibility and low power consumption in view. Issues involving interconnection among the basic building blocks have been dealt with in details and a new scheme is proposed.

1. Introduction High-end Digital signal processing applications demand high performance, flexibility and low power consumption. While the fastest available Digital signal Processors are unable to meet the speed requirements of many high speed applications like Software Define Radio (SDR), 3G mobile communication, Video applications etc., Application Specific Integrated Circuits (ASICs)[1] are also not suitable for their lack of flexibility. Though Field Programmable Gate Arrays (FPGAs)[1] have emerged as an alternative to ASICs, the performance and flexibility are achieved at a very high cost. Moreover, FPGAs are based on Look-upTable (LUT) and they are not optimized for any specific application. The major DSP functions are: 1) Filtering - Finite Impulse Response (FIR)[2], Infinite Impulse Response (IIR)[2] Filter and 2) Transforms - Fast Fourier Transform (FFT)[2], Discrete Cosine Transform (DCT)[2], WalshHadamard Transform (WHT)[3] etc. For all these algorithms the basic objective is to design faster Multiplication and Accumulation (MAC) units [4][5]. Apart from MAC units, square root computation engines also required. The following schemes have been taken for efficient implementation of the proposed reconfigurable architecture [6][7][8]:

0-7695-3059-1/07 $25.00 © 2007 IEEE DOI 10.1109/ADCOM.2007.108

2. A common structure for different DSP algorithms To find a common architecture for different DSP functions, it is necessary to explore the basic DSP operations and their similarities. First, let us consider an FIR filter where the tap size is N.If x(n) is the input sequence and h(0), h(1),h(2)…etc. are the coefficients of the filter then output y(n) is given by the equation

333

N −1

y (n ) = ∑ h ( k ) x (n − k )

p+jq

(1) IIR filters [12] consist of FIR filters connected in proper fashion. The structure of a four-tap Fast FIR filter is shown in Figure 1. It consists of four multipliers and three adders. Another important DSP function is DFT. DFT of N point sequence x (n) is given by, K= 0

N −1

X (k ) = ∑ x(n)W

nk N

a+jb ADDER MULT s+jt c+jd SUB

(2)

m+jn

n =0

Where WN is the twiddle factor and it is defined as

WN = e

Figure 2. Butterfly unit Hence,

−2π j / N

p = [a + (cm - dn)] , s = [ a - (cm - dn )] q = [b + (dm + cn)] , t = [b - ( dm + cn)]

Above two equations (ie equation 1 and 2) clearly indicate the computational similarities of these two DSP functions. However, unlike FIR, DFT deals with complex multiplications.

Apart from FFT, there are other transformations like DCT, Hadamard etc. It may be noted that all these transformations are of basically matrix multiplications of input sequence vector and constant co-efficient vector. Considering this, in this architecture we aim to introduce a basic fundamental building block, which can be reconfigured for different DSP functions by changing the connectivity. It may be observed that the hardware complexity involved in implementing a “Butterfly Unit” is similar to a 4 tap FIR filter or a (4X1) by (1X4) matrix multiplication unit and for realizing these functions, seven adder-subtractors and 4 multipliers are required.

x(0) MULT

ADDER h(3)

x(1) MULT

OUTPUT

h(2) x(2)

ADDER

MULT h(1) x(3)

3. Details of the architecture ADDER

3.1. Overview of the architecture

MULT h(0)

The Basic building block of this architecture is the Processing Element (PE), which is shown in Figure 3. The PE consists of Functional blocks, Register Files, Buffers and Interconnection Networks (ICN). Functional blocks include multipliers (M0-M3), addersubtractors (AS0-AS6). RF0 and RF1 are the two registers files i.e. array of registers. The connections between the functional blocks of the PEs are done by four interconnection networks (ICN0-ICN3). The architecture allows parallel loading of data in register files RF0 and RF1. The Input Buffer Array (IBA) and Output Buffer Array (OBA) are employed for this purpose. There is another buffer array “Constant Data Buffer” (CDBA) to load the constants to be used in the operation. Depending on the mode signal of the ICNs, the functional blocks can work in different modes (FFT, FIR etc).

Figure 1. 4-Tap FIR computation Fast Fourier Transform algorithm is basically the faster version of DFT and its fundamental computational building block is ‘butterfly’ which consists of one complex multiplier and two complex adders/subtractors. Figure 2 depicts such a butterfly unit. It has two complex inputs a+jb and c+jd and two complex outputs p +jq and s + jt. The complex output of the butterfly of Figure 2 is given by

p + jq = ( a + jb) + (c + jd )(m + jn ) = [a + (cm - dn)] + j[b + ( dm + cn )] s + jt = ( a + jb) - (c + jd )(m + jn) = [ a - (cm - dn)] + j[b - (dm + cn)]

(3) (4)

334

Figure 3. Block diagram of the proposed processing element The functional units are internally and externally pipelined and synchronized. These provide an efficient implementation of FFT, Fast DCT, Fast FIR, Goertzel algorithm [13], Fast Discrete wavelet transform etc. The contents of the output buffer can be loaded directly onto the Data Memory (DM0) to feed back to the input again or for storing it to the memory. If output is required then it would be available from the multiplexer (MX2). If it is required after some COrdinate Rotation DIgital Computer (CORDIC) operation then the Reconfigurable Bitstream would be configured and then it would be used. This arrangement reduces the overhead of the CORDIC operations. The interconnection networks used in this architecture consist of Multiplexers. The MUXes have select lines to control the flow of data. The Basic Structure of an ICN is shown in Figure 4. It can operate up to 7 modes. Hence, a 3-bit mode control signal is required. The operation of the ICN in different mode is shown in Table I. In ICN, data can flow through different paths. Table I Different modes of ICN Mode [2:0]

Operation

000

Addition

001

Multiplication

010

Subtraction

011

Butterfly

100

2-Tap FIR

101

4-tap FIR

110

4-tap FIR & addition in parallel

Beside the data it also routes control signals. Control signals are required to enable input-output for the Functional Blocks (M0-M3, AS0-AS6). A special connection is needed from RF0 to ICN2. This is basically for addition-subtraction of two data. This link will be used for execution of Fast FIR and other operations where multiple multiplications and additions are required for concurrent operations. Depending upon the control signals, the appropriate constant data is downloaded to the CDBA .At the same time the data from IBA is also loaded into RF0. In the next clock cycle, the variable data (from RF0) and the constants (from CDBA) are loaded into ICN0. Depending upon the mode signals, ICN0 sends the data to the multipliers (M0- M3). The output is available from RF0. En_out_1

En_in_1 En_in_2 En_in_3

MUX01 MUX02

in1 in2

MUX03

En_out_2 En_out_3 out1 out2 out3

in3 MUX04 MUX05 Clock Reset

MUX06

Mode[2:0]

Figure 4.A Block diagram of an InterConnection Network (ICN)

335

The status signal line from RF1 indicates the availability of the Data in OBA. PE uses a common clock and reset signal for all its internal blocks. Necessary control lines are connected to the different blocks. Concurrent data input and data output buses are maintaining true pipelining to increase throughput. By separating constant data from the input-output variables, the overall data bus bandwidth increases, thereby increasing the processor performance.

3.2. PE control unit The PE control unit (Figure 5) consists of a Control Unit (CU), two Address generator units: 1) Data Address Generator (DAG) and 2) Program Address Generator (PAG), a Configurable control memory (CFCM) and two decoders (DCDR0-1). Decoder (DCDR0) loads the instruction from the outside of the processor and stores them in a queue. CU fetches instruction one by one from DCDR0.If CU is ready to accept any new instruction, it just enables the output of DCDR0. Subsequently, it generates appropriate control signals to generate the Program Address in CFCM. The Program address generator does this function. The address is used to access the data in CFCM. This has two great advantages one is to have a flexibility and also to reduce the size of the control memory at a time. The CU also gives instruction to the DAG. It generates the location of the data in Data memory. The Data memory has two segments one is DM0 which is to store the intermediate result and DM1 that is used to store the input data from out side of the processor. The content of DM0 and DM1 both have the facility to arrive the IBA through a Mux (MX0). Three MUXes (MX0-MX2) are used in this architecture whose select lines are controlled by the CU itself. From the decoder (ie DCDR1) the control word (C) is generated. This decoder decodes the Control Field (CF), which is part of the control memory content. The width of the data of the control memory is being reduced using this decoder. Here another main function of the CU is to distribute the data properly in different times (clock cycles). To do that work the CU uses Barrel Shifter, Counter etc. Instruction Decoder unit of this CU translates the instruction to get the information. For example, to do a filtering operation the information of data element to be operated, the CU performs allocation of filter coefficient to the CDBA. The modified Harvard architecture used here helps the data to be operated in pipeline mode. The CU also takes the status of the PE from the Status Line Decoder (SLD), which is to indicate the state of the current operation ie if it is the last operation, performed or it is a part of an

Figure 5. Processing element controller operation. If the CU has completed its execution then CNS signal is 1.Otherwise it is busy with its previous instruction. The DAG designed here produces two address line one for DM0 and another DM1. Similarly the PAG also generates two addresses one for reading control data and another for configuring control memory (CFCM). To exploit the inherent parallelism existing within the DSP functions, the architecture has been so conceived that a number of concurrent microoperations can be merged to a single control word, thereby achieving fine- grain parallelism. The micro operation includes the mode signal of the ICNs, control signal of the different blocks like buffers, memory etc. Using optimization algorithm can further optimize the control memory.

3.3. Reconfigurable memory module Reconfigurable Bit-stream Module (RBM) is a reconfigurable segment. The Required bit-stream can be downloaded from outside of the processor. In some DSP operation data is passed through this module and collected by controlling MUX (MX02). This arrangement reduces the complexity of the algorithm as it is abstracted from the main module. This memory unit (based on SRAM) resides within Processing element and stores the re-configuration bit-stream for a particular DSP function. Whenever a particular function is to be configured, the processing element controller allows the bit-streams to be down loaded from an external memory onto the reconfigurable memory.

3.4. Configuration latency The architecture provides an important feature “low configuration latency”. It has two configurable modules: 1) RBM and 2) CFCM and they have been

336

discussed earlier. In the proposed architecture it has been considered that the basic functional blocks (MAC, ALU etc) are fixed and the ICNs are configured dynamically. These ICNs are configured by the data stored in the CFCM and RBM is configured for necessary trigonometric functions. So, by changing the contents of these two memories, different DSP functions can be realized. Here, Unlike the FPGAs, the basic functional blocks are not implemented by LUTs; instead they are fixed functional units. Therefore, only few bit-streams required only configuring the ICN. Therefore, it is clear that configuration bit-streams get reduced substantially thereby reducing the configuration latency. In this architecture control memory is dual ported so runtime configuration is possible i.e. when one portion is read then in another location can be modified. This dynamic configurability reduces latency time. In certain DSP operation constants and coefficients are reused from the CDBA. So it is not needed to load again.

N UM B E R O F TA P V S CLOC K CY C LE 140

N U M B E R O F C LO C K C Y C LE

120

100

80

60

40

20

0

200

400

600 800 NU M B E R O F TA P

1000

1200

Figure 6. (a) FIR tap Vs clock cycle 14

x 10

4

NU MB E R O F P O INT V S C LO CK CY CLE

N U M B E R O F C LO C K C Y C LE

12

4. Results and analysis The performance of the Architecture for two widely used DSP functions (ie FIR, FFT) were simulated in ModelSim 6.0.It was observed that for implementing a 4 tap FIR filter or a FFT butterfly or for a (4X1) by (1X4) matrix multiplication, 27 clock cycles were required. On the other hand in this architecture 11 clock cycles are required for realizing addition/subtraction of two data. All the functional units (Multiplier, Adder-Substractors) are pipelined in 7 stages. This functional unit takes a single clock cycle for each stage. Figure 6(a) shows the performance FIR characteristics of TAP vs. Clock Cycle. As the architecture is highly pipelined, the number of clock cycles for first output of N-tap filter is lower than other processors. And Figure 6(b) shows the characteristics of number of point of FFT Vs Clock cycle .The Basic unit is a butterfly, so the same architecture is used for different data. Time will be increased linearly with increase of number of points. We have used Xilinx ISE 8.1i for designing different part of the architecture and used Mixed HDL language (Verilog and VHDL) for simulation. To develop our prototype we have used Vertex 4[14] (Device ID: xc4vsx25-12-ff668) as target device FPGA Simulation of different module is done in ModelSIM PE 6.0d and ModelSIM XE 6.0e. MATLAB 7.1 is used to study characteristics of results. Validity of some algorithm like FIR algorithm, DFT algorithm is also simulated in Simulink of MATLAB.Here device utilization summary and timing summary are given.

10

8

6

4

2

0

0

200

400

600 800 NUM B E R OF P O INT

1000

1200

Figure 6. (b) FFT point Vs clock cycle

4.1. Device utilization summary The synthesis reports of our proposed processing element are shown belowNumber of Slices: 6259 out of 10240(i.e. used 61%) Number of Slice Flip Flops: 6540 out of 20480(i.e. used 31%) Number of 4 input LUTs: 9944 out of 20480(i.e. used 48%) Number of IOs: 260 Number of bonded IOBs: 260 out of 320(i.e. used 81%) Number of Global Clocks: 2 out of 32 (i.e. used 6%)

4.2. Timing Summary Speed Grade of the device is –10. Minimum response time of the architecture is 7.636ns(3.826ns logic, 3.809ns route i.e. 50.1% logic,

337

49.9% route) and maximum frequency is is 5.980ns and Maximum output required time after clock is 4.851ns This report is generated through the Xilinx Synthesis Report tool. In this report the usage of different blocks in FPGA i.e. LUT, IOBs are shown.

130.961MHz.Minimum input arrival time before clock in DSP Coprocessor”, ICECS 06, December 2006, pp. 119 – 122. [6] Sinha, P.; Sinha, A.; Basu, D., “A Novel Architecture of a Re-configurable Parallel DSP Processor”, IEEE-NEWCAS Conference, June 2005,pp. 71 – 74. [7] Sinha, A.; Karmakar, A.; Maiti, K.; Halder, P; “A reconfigurable architecture for a class of digital signal/image processing applications”, Communications, Computers and signal Processing, 2001. PACRIM. 2001 IEEE Pacific Rim Conference, Stuttgart, Germany, October 2001, Volume 1, pp. 71 – 74.

5. Conclusion In this paper, we proposed a new generalized reconfigurable architecture for digital signal processing applications. The architecture provides a balance performance between ASIC and FPGA by enhancing the speed and silicon utilization factors close to those of ASIC while retaining the flexibility of FPGA. The experimental result indicates the efficiency of the architecture in implementing higher order DSP operations. The processor can also be used as a basic processing block of a multiprocessor system likeSIMD Array [15], Systolic array [16] etc. Since the basic computational blocks like Adder, Multiplier etc. are fixed and any DSP function can be implemented by simply changing the configuration of the ICNs, the configuration latency is reduced substantially compared to the commercially available FPGAs. The architecture was validated on Vertex 4 FPGA. The architecture can further be upgraded by introducing multi-radix number systems for efficient implementation of high performance functional units.

[8] Myjak, M. J.; Delgado-Frias, J. G.; “Medium-Grain Cells for Reconfigurable DSP Hardware”, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, June 2007,Volume: 54, Issue: 6,pp. 1255-1265. [9] Das, S. R., Banerji, D. K., and Chattopadhyay, A., "On Control Memory Minimization in Micro- Programmed Digital Computers," IEEE-TC, September 1973, Volume C22, pp. 845-848. [10] Khouri, K.S.; Lakshminarayana, G.; Jha, N.K.; “Memory binding for performance optimization of controlflow intensive behavioral descriptions”, Very Large Scale Integration (VLSI) Systems, IEEE Transactions, May 2005,Vol-13, Issue-5, pp.513-524. [11] Barkalov, A., Kolopienczyk, M., Titarenko, L; “Optimization Of Control Memory Size Of Control Unit With Codes Sharing”, Mixed Design of Integrated Circuits and System, June 2006, pp. 354- 358.

6. References

[12] Wang, A.; Smith, J.O., “On fast FIR filters implemented as tail-canceling IIR filters”, IEEE Transactions on Signal Processing, June 1997,Volume 45, Issue 6, pp. 1415 – 1427.

[1] Paul S. Zuchowski, Christopher B. Reynolds, Richard J. Grupp, Shelly G. Davis, Brendan Cremen, Bill Troxel “A hybrid ASIC and FPGA architecture”, Proceedings of the 2002 IEEE/ACM international conference on Computeraided design ICCAD '02, pp. 187- 194.

[13] Beraldin, J.A.; Steenaart, W.; “Overflow analysis of a fixed-point implementation of the Goertzel algorithm” IEEE Transactions on Circuits and Systems, February 1989, Volume 36, Issue 2, pp. 322 – 324.

[2] K.K. Parhi, “VLSI Digital signal Processing Systems”, A Wiley-Inter science Publication, 1999.

[14] Vertex-4 FPGA Reference Manual from Xilinx Inc., http://www.xilinx.com/.

[3] Neungsoo Park; Prasanna, N.K.; “Cache conscious Walsh-Hadamard transform”, ICASSP 2001, Volume 2, pp. 1250-1208.

[15] Herbordt, M.C.; Cravy, J.; Sam, R.; Kidwai, O.; Lin, C.; “A system for evaluating performance and cost of SIMD array designs”, Frontiers of Massively Parallel Computation, February 1999,pp. 16 – 24.

[4] Abdelgawad, A.; Bayoumi, M.; “High Speed and AreaEfficient Multiply Accumulate (MAC) Unit for Digital Signal Processing Applications”, ISCAS 2007, May 2007, pp. 3199 – 3202.

[16] Arnould, E.; Kung, H.; Menzilcioglu, O.; Sarocky, K.; “A systolic array computer”, Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '85, April 1985,Volume 10, pp. 232 – 235.

[5] Parandeh-Afshar, Hadi; Ahmadvand, Mohsen; Safari, Saeed; “A Novel Merged Multiplier-Accumulator Embedded

338

Suggest Documents