a low power architecture for implementation of

5 downloads 0 Views 55KB Size Report
port triggered architecture [4], [5], [10], [18]. A major feature of the architecture is that the processing elements are algorithm-specific and that the control signals ...
A LOW POWER ARCHITECTURE FOR IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ALGORITHMS Henrik Ohlsson, Weidong Li, Oscar Gustafsson, and Lars Wanhammar Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden E-mail: {henriko, weidongl, oscarg, larsw}@isy.liu.se

ABSTRACT In this paper we discuss an architecture for implementation of digital signal processing algorithms. The architecture is based on algorithm-specific processing elements in order to obtain a medium high throughput and low power consumption. There is also no global clock used which makes the architecture suitable for low power applications. The proposed architecture is also well suited for use in globally-asynchronous, locallysynchronous systems. Design consideration when implementing digital filters using the proposed architecture is discussed as well.

1. INTRODUCTION The architecture discussed in this paper is almost clockfree and based on algorithm-specific processing elements and it is well suited for implementation of digital signal processing algorithms with medium high throughput requirements. The architecture is flexible and have a high degree of reusability. It can easily be parameterized with respect to the number of processing elements, number of RAMs, and number of inputs and outputs to each processing element. The architecture that we originally developed aimed at obtaining a well Structured Interfacing of Computational elements – SIC architecture [14], [3], [15], [16], [6] has successfully been applied in a number of high-performance cases [12],[9], [17]. Later it was discovered that the basic concept can be retraced to Lipovski [8] who used a similar concept in a simple controller architecture. The SIC architecture has later been rediscovered several times and refer to as single instruction computer (SIC) architecture [1], [8], MOVE architecture or transport triggered architecture [4], [5], [10], [18]. A major feature of the architecture is that the processing elements are algorithm-specific and that the control signals only activate data transfers (MOVE operations) between memory cells. Most digital signal processing algorithms uses only a few types of operations, e.g., multiplications, sum-ofproducts, butterflies, and adaptors. Hence, it is advantageous to implement such regular (digital signal processing) algorithms using only a few types of arithmetic processing elements. Such processing elements can be implemented efficiently for different throughput and power consumption requirements using many different types of arithmetic, e.g., bit-serial, digit-serial or parallel arithmetic and number representations [17].

The SIC architecture discussed here is based on resource sharing between operations of the same or a few types. This means that the processing elements need to be somewhat flexible and accommodate slightly different operations, e.g., different coefficients in a multiplication. It now turns out that the SIC architecture is also suitable for low power applications since a global clock is not required. This means that we do not need a global clock distribution network and reduce any clock skew problems. Furthermore the control unit, which often limits the overall clock frequency of a hardware structure, can easily be distributed. These features of the architecture have a large impact on the power consumption since the buffers and capacitance in clock and control signal nets consumes a large part of the power consumption. Another factor that reduces the power consumption is that the system can be set into an idle state when no input data is available. We use static or semi-static memories to allow that data to be kept without any extra power consumption when the system is in the idle state.

2. SIC ARCHITECTURE The SIC architecture is composed of algorithm-specific processing elements, RAM memory, registers, and a central or distributed control unit. A main feature of the architecture is the elimination or reduction of long communication channels. Hence, the processing elements has been incorporated into the memory address space. An example of the proposed architecture with two processing elements is illustrated in Fig. 1. The processing elements has common or separate inputs and outputs registers. The number of registers can Register

PE1 Register

ROM

RAM Register

PE2 Counter

Register

RAM Fig. 1. A SIC architecture with two processing elements.

differ between the processing elements. These registers, which are used to read and write data to the RAMs, are mapped into the memory address space. Such transactions are controlled by the control unit. The control unit also initiate the execution of the process elements.

2.1 The Processing Elements The process elements (PEs) suitable for this architecture can, as mentioned, be implemented using different types of arithmetic and number representations depending on the requirements. We can freely mix processing elements that employ bit-serial or digit-serial arithmetic with conventional bit-parallel arithmetic. The execution of a PE operation starts as soon as a data value is loaded into an input register of the PE. When a data is available at an input it will start to ripple through the logic operations in the PE and the result will appear at the output when the computations has been finished. A result of an operation is stored in the output registers, which may be the same physical registers as the input registers. The resulting values may subsequently be moved to another RAM location or an other input register. It is also be possible to write data from an output register to an input register on the same PE without using the RAM. Note that no operation is required if the registers are the same. The detail configuration of the communication scheme is determined by the throughput requirements and the precedence constraints of the algorithm to be executed. The PEs only execute algorithm-specific operations. This is to obtain a high throughput and an energy efficient implementation, compared to using a more general and flexible PE. However, this may lead to many different types of PEs. To reduce the number of PEs required for a specific algorithm we may use somewhat flexible PEs to perform a limited set of fixed operations. For example, a Butterfly processor may execute operations using only a limited set of twiddle factors. There is, of course, a trade-off between the degree of flexibility and the number of PEs required. There is, in principle, no limit on the arithmetic complexity of a PE. Depending on the application the PE complexity can differ significantly between two implementations. A simple PE would, for example, be a fixed multiplication compared to a more complex PE which, for example, could be a butterfly operation for an FFT. In fact, a PE can be replaced with another SIC architecture that performs a very complex task. It is also easy to interconnect SIC architectures in cascade or parallel form [17]. We can freely mix SIC architectures with conventional computer structures.

ciple, of a ROM where the control signals for the PEs and RAM addresses are stored and a counter that generates the addresses for the ROM. The control signals are used for loading data to the input and output registers of the PEs, configuration of the PEs, and for reading and writing to the RAM memory. The control unit is also used to shut down the system and enter into an idle state. This has a large impact on the power consumption. As mention above, it is often more favourable to use a set of distributed control units since long, and fast control signal are eliminated. These distributed control units also become much simpler and faster and consume less power.

2.3 Registers and Memories It is in general necessary to employ several physical or logical memories in order to balance the computational throughput and the communication bandwidth. Often is favourable to use one physical memory port for each input or output register. To allow for the system to enter an idle state without any global clock we must use static or semi-static memories and the registers. This means that these elements can keep their values during an idle period, without any refreshing of the memories required.

3. EXAMPLE APPLICATIONS The proposed architecture is suitable for several different applications, e.g., digital filters, filter banks, wavelets, Fast Fourier and cosine transforms, Viterbi decoders and many other similar algorithms. Here we will only discuss three different cases, wave digital filters, FIR filters, and FFTs.

3.1 Lattice Wave Digital Filters Lattice wave digital filters are regular structures that can be implemented using only two- or three-ports adaptors. A symmetric two-port adaptor is shown in Fig. 2 and a fifth-order lattice wave digital filter, using symmetric two-port adaptors, is shown in Fig. 3. Such filter can be implemented using general adaptor PEs, with general multipliers. This is, however, not a very efficient solution with respect to either speed, power consumption or chip area. A2

We can employ either a central or a set of distributed control units. A centralized control unit consists, in prin-

A2

B2 a0

-

a0

A1

2.2Control Unit

B2

B1

A1

B1

Fig. 2. Structure of a two-port adaptor.

T

Reg

a4 T

a0 a4 Reg

T

a0

RAM

a3

Reg

x(n)

y(n)

ROM a2 a3 Reg

a1

RAM

T

Reg Counter

a2

a1 + Reg

T

Fig. 3. A fifth-order lattice wave digital filter. Instead we consider a configurable, application specific PE for the adaptor operations. An example of such PE is shown in Fig. 4. In this example we replace the general multiplier with two fixed multipliers. By selecting either of these two multipliers two different adaptor operations can be executed using the same PE. There is, of course, a large number of other PE configurations possible. For example, by designing the filter with coefficients with common partial products, the multiplexing of arithmetic components can be done inside the multiplier. The result is that a larger part of the hardware can be shared between the two configurations and the area required is reduced. For example, if we have two coefficients, α1 = 0.100001012C and α2 = 0.010101012C, one fixed multiplication where the common nonzero bits are shared and the nonzero bits that differ are multiplexed when the partial products are generated. Wave digital lattice filters and many other low-sensitive algorithms can be designed with few non-zero bits in their coefficients and, hence, allow for simple implementation using only a few simple multipliers.

A2

B2

a0 a1

A1

B1

B2

A2 a1

A1

a0

B1

Fig. 4. A configurable two-port adaptor PE.

Fig. 5. Illustration of an implementation of a lattice wave digital filter using a SIC architecture. In Fig. 5 an implementation of the filter shown in Fig. 3 on the proposed architecture is shown. Each PE performs two operations, either two adaptor operations or one adaptor operation and one addition. The PEs can be configured in a similar fashion to the example in Fig. 4.

3.2 FIR Filters The proposed architecture is suitable for implementation of FIR filters using the direct form or the transposed form structures or linear-phase structures. A direct form FIR filter can be mapped to the architecture using PEs computing one or several multiplications. Such PE could, for example, be implemented using an adder tree with as low height as possible to obtain a high throughput. An example of such tree is the Wallace tree. To obtain efficient PEs the coefficient can be matched together so that all inputs to the adder tree for a certain height is used. This would result in a time/area efficient solution. There is also a possibility to find coefficient with common parts to be able to share components in the adder tree. Another alternative is to use distributed arithmetic. We will also need PEs for adding several multiplication result at the filter output. Such PE will, however, be similar to a multiplier block [2]. The only difference is that we do not need to shift the inputs for the later case. Thus, for the direct form FIR filter, homogenous PEs are easily identified. For a transposed FIR filter we can use multiple constant blocks as PEs [2], or distributed arithmetic. The filter design can in this case not only find coefficient

suitable for multiple constant blocks but also find sets of coefficient suitable for configurable multiplier blocks.

3.3 Fast Fourier Transform In an fast fourier transform (FFT), the coefficients are fixed. For small transform lengths, the number of coefficients is low. For example, a 16-point FFT has only 5 coefficients. The PEs can therefore be configured as small butterfly elements with fixed coefficients. Since the implementation reduces the number of expensive complex multiplications, the FFT can be performed efficiently. The main drawback for this implementation scheme is the increase of routing cost. This is, however, not a problem for a small FFT since a large transform length FFT can be divided into smaller FFT operations [17], [7]. Hence, this architecture is suitable for efficient FFT implementation. Several successful implementations have been done [9], [12], [13], [7].

3.4 GALS – Globally-Asynchronous, Locally-Synchronous Architectures The SIC architecture is well suited for use as a component in a globally asynchronous, local synchronous (GALS) architectures [11], [19]. This kind of system has been proposed as an efficient design and implementation method for large systems on chip. In a GALS architecture communication between large components are done by using asynchronous communication channels. This approach is particularly suitable for communications systems which typically are partitioned into and operate according to data-flow principle. This is also the case for most digital signal processing systems. When input data is received to a subsystem it is processed and the system returns to an idle state waiting for the next input.

4. CONCLUSIONS In this paper we have discussed and concluded that the SIC architecture, which uses algorithm-specific processing elements, also is suitable for low power implementation of many digital signal processing algorithms and as a component in larger GALS systems. The SIC architecture is easily parameterizable and thereby reusable and may therefore reduce the design effort significantly.

[3]

[4] [5]

[6] [7] [8] [9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17] [18]

5. REFERENCES [1]

[2]

Azaria H. and Tabak D., “Design Consideration of a Single Instruction Microcomputer – A Case Study,” Microprocessors and Microprogramming, Vol. 1, pp. 187–194, 1983. Dempster A. G. and Macleod M. D., “Use of MinimumAdder Multiplier Blocks in FIR Digital Filters,” IEEE Trans. Circuits Syst. II, vol. 42, no. 9, pp. 569–577, Sept. 1995.

[19]

Dinha F., Sikström B., Sjöström U., and Wanhammar L., “LSI Implementation of Digital Filters – A MultiProcessor Approach,” Proc. Intern. Conf. on Computers, Systems and Signal Processing, Bangalore, India, Dec. 10-12, 1984. Jones D.W., “The ultimate RISC,” Computer Architecture News, Vol.16, No.3, June 1988, pp.48–55. Laplante P.A., “A single instruction computer architecture and its application in image processing,” Proc. SPIE – Intern. Society Optical Eng., Vol.1608, pp. 226– 35, 1992. Lawson H.W., Svensson B., and Wanhammar L., Parallel Processing in Industrial Real-Time Applications, Prentice Hall, 1992. Li W. and Wanhammar L., “Complex Multiplication Reduction in FFT Processors,” Proc. Swedish Syst. on Chip Conf., Falkenberg, Sweden, March 17-18, 2002. Lipovski G.J., “Architecture of a Simple, Effective Control Processor,” Second Symp. on Micro Architecture, EUROMICRO, pp. 187–194, 1976. Melander J., Widhe T., Sandberg P., Palmkvist K., Vesterbacka M. and Wanhammar L., “Implementation of a Bit-Serial FFT Processor with a Hierarchical Control Structure,” Proc. European Conf. on Circuit Theory and Design, ECCTD '95, Istanbul, Turkey, Aug. 1995. Moore S.W. and Morgan G., “The Recursive MOVE Machine: R-Move,” IEE Colloquium on 'RISC Architectures and Applications' (Digest No. 163). IEE, London, UK, 1991, pp. 3/1–5 Njølstad T. et al., “A Socket Interface for GALS Using Locally Dynamic Voltage Scaling for Rate-Adaptive Energy Saving,” Proc. 14th Annual IEEE Intern. ASIC/ SOC Conf., Rochester, 2001, pp.110–116. Nordhamn E., Sikström B., and Wanhammar L., “Design of an FFT Processor, Fourth Swedish Workshop on Computer Architecture,” DSA-92, Linköping, Sweden, Jan. 1992. Nordhamn E., “Design of an Application-Specific FFT Processor,” Linköping Studies in Science and Technology, Thesis, No. 324, Linköping University, Sweden, 1992. Wanhammar L., “On Algorithms and Architecture Suitable for Digital Signal Processing,” Proc. The European Signal Processing Conf., EUSIPCO–86, The Hague, The Netherlands, Sept. 1986. Wanhammar L., Sikström B., Afghahi M., and Pencz J., “A Systematic Bit-Serial Approach to Implement Digital Signal Processing Algorithms,” Proc. 2nd Nordic Symp. on VLSI in Computers and Communications, Linköping, Sweden, June 2-4, 1986. Wanhammar L., Afghahi M., and Sikström B., “On Mapping of Algorithms onto Hardware,” IEEE Intern. Symp. on Circuits and Systems, Espoo, Finland, pp. 1967–1970, June 1988. Wanhammar L., DSP Integrated Circuits, Academic Press, 1999. Zivkovic V.A., Tangelder R.J.W.T, and Kerkhoff H.H., “Design and Test Space Exploration of Transport-Triggered Architectures,”, Proceedings Design, Automation and Test in Europe Conference and Exhibition, Paris, France, pp. 146–51, March 27–30, 2000. Zhuang S., Li W., Carlsson J., Palmkvist K., and Wanhammar L., “An Asynchronous Wrapper with Novel Handshake Circuits for GALS Systems,” Accepted for publication in Proc. IEEE Intern. Symp. on Circuits and Systems, Scotsdale, USA, June 2002.