Power efficient data path synthesis of sum-of-products ...

19 downloads 2583 Views 501KB Size Report
partment of Electrical and Computer Engineering, University of Patras, Patras. 26-500 .... switching activities obtained when the proposed synthesis tech- niques are .... Dissertation Award from the University of Florida in 1986. Costas E. Goutis ...
446

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 3, JUNE 2003

Power Efficient Data Path Synthesis of Sum-of-Products Computations Konstantinos Masselos, Member, IEEE, Panagiotis Merakos, Spyros Theoharis, Thanos Stouraitis, Senior Member, IEEE, and Costas E. Goutis, Member, IEEE

Abstract—Techniques for the power efficient data path synthesis of sum-of-products computations between data and coefficients are presented. The proposed techniques exploit specific features of this type of computations. Efficient heuristics for the scheduling and assignment tasks, based on the concept of the Traveling Salesman’s Problem, are described. Different cost functions are proposed to drive the synthesis tasks. The proposed cost functions target the power consumption either in the interconnect buses or in the functional units. Experimental results from different relevant digital signal processing algorithmic kernels prove that the proposed synthesis techniques lead to significant power savings. Index Terms—Data path synthesis, digital signal processing, power consumption, sum-of-products.

I. INTRODUCTION

T

HE POWER consumed in sum-of-products computations forms an important part of the total power budget of a digital signal processing (DSP) system [5]. Two main categories of DSP algorithmic kernels requiring sum-of-products computation are: 1) convolutional algorithms where the basic computation is the multiplication of data and coefficient vectors of size (e.g., FIR filtering) and 2) transformational algorithms where the basic computation is the matrix-vector multiplication and a data vector between a coefficient matrix of size (e.g., DCT, DFT/FFT). of size In this paper, techniques for low-power data-path synthesis of sum-of-products computations are proposed. The proposed techniques exploit two basic characteristics of sum-of-products computations between data and coefficients: 1) The independence of the partial products computations allowing a simple but systematic formulation of the synthesis tasks using the concept of the Traveling Salesman’s Problem. Existing synthesis techniques target a large number of algorithms and they are based either on complex formulations of the synthesis tasks [1] or on simpler but less systematic approaches [2]. Manuscript received September 15, 2000; revised September 03, 2001. K. Masselos is with INTRACOM S.A, Athens, Greece, and also with the Department of Electrical and Computer Engineering, University of Patras, Patras 26-500 Greece (e-mail: [email protected]). P. Merakos is with Athena SEMI S.A., Athens, Greece, and also with the Department of Electrical and Computer Engineering, University of Patras, Patras 11-141 Greece (e-mail: [email protected]). S. Theoharis is with ALMA Technologies S.A., Athens 11-745 Greece (e-mail: [email protected]). T. Stouraitis and C. E. Goutis are with the Department of Electrical and Computer Engineering, University of Patras, Patras 26-500 Greece (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TVLSI.2003.812368

2) The static nature of the coefficients allowing the use of (partly) static power cost functions leading to significant speed up of the synthesis procedure. Existing synthesis techniques use long simulations to evaluate their cost functions leading to increased run time of the synthesis procedure. Furthermore cost functions capturing other sources of power dissipation that are ignored by existing approaches, such as the address bus power consumption, are proposed in this paper. The rest of the paper is organized as follows. In Section II, the target architecture model is described. The proposed synthesis techniques are described in Section III. In Section IV, different cost functions are proposed. Experimental results are presented in Section V, together with a comparison of the proposed synthesis techniques with existing relevant approaches. Finally in Section VI conclusions are offered. II. TARGET ARCHITECTURE The proposed techniques primarily target implementation of sum-of-products computations on (hardwired) custom hardware architectures. They can also be applied to instruction set processors (at the assembly language level). The algorithms’ input data and coefficients are assumed to be stored in memories (background or registers). Sum-of-products computations are performed on a number of functional units [multiply accumulators (MACs) or multipliers]. Bit-parallel implementations of the functional units are assumed. III. PROPOSED SYNTHESIS TECHNIQUES A. Problem Formulation for Convolutional Algorithms A convolutional computation can be described by (1) (1) is defined as: where and are the data and coefficient sets, respectively. The power cost caused by the and in a sequential computation of the partial products given part of the implementation platform is denoted by the . term The multiplications, implied by (1), are commutative and have the same ASAP (as soon as possible) and ALAP (as late as possible) times, so the relative order of their execution is of no importance for the calculation of an output sample. This The partial product

1063-8210/03$17.00 © 2003 IEEE

MASSELOS et al.: POWER EFFICIENT DATA PATH SYNTHESIS OF SUM-OF-PRODUCTS COMPUTATIONS

is the idea that is exploited for the derivation of the proposed techniques. The optimization problem can be stated as: “Given the computations of (1), define the function (2) such that the computations of (3) (3) minimize the cost function shown in (4), taking into account the costs defined in (3)” (4) is defined as a permutation for the sequence The function of partial products computation, returning an index that modifies the regular order of the sequence elements. The variable indicates the current time step of computation, concerning the partial products. In the conventional computation of the convolution algorithms, given by (1), the order of the computation of partial products is given by (5) After the optimization procedure, the notation means that the th partial product is . In this way a reordering of the sequence of computations implied by (1) is the reordered set of coefficients achieved. For specific and and data are: , respectively. The that minimizes the cost expresed by (4), is the function solution to the problem. Equation (4) assumes that coefficients and data are stored sequentially in memory, in the order defined by the algorithm in and its original form, starting from the base addresses respectively. The sum in (4) accounts for the interpartial product cost, while the last term accounts for the cost occured between the first and last partial products of the reordered sequence of computation, because DSP algorithms can be considered as infinite loops. The problem stated above can be formulated as a Traveling of the problem conSalesman’s Problem. The graph of vertices, which sists of the set partial products of (1) and the set of the correspond to the edges, from vertex to vertex . The edges of the graph are weighted by the power cost incurred due to the transition (traveling) between the partial products that the edge connects. That is, the weighting cost for the , connecting the partial products (nodes) and edge of the graph is defined as (6) The resulting graph is complete, because and can be any partial product. The solution to the problem is to find a Hamiltonian tour in this graph, which has minimum cost, implying that the sum of all weights in this tour must be minimum. This is the well-known TSP, which is an NP-complete problem [6]. Exact

447

solution can be found only for relatively small otherwise, heuristic algorithms can be used (based on minimum spanning trees (MST) and genetic algorithms). Assuming one functional unit implementation, the resulting tour will be assigned for computation in this single unit. If more than one functional unit are required, the tour is divided in as many parts as the number of functional units and each part is assigned to a corresponding functional unit. B. Problem Formulation for Transformational Algorithms In this case, both inner and partial products must be scheduled. The sceduling problem is formulated as a special form of TSP, that operates in a multigraph, in order to take into account its hierarchical nature. The multigraph is constructed graphs of the inner products. Each inner product by the consists of the set graph , containing vertices, which represent the partial products of the th inner product. Inner product scheduling is performed first after determining the partial products of each inner product that will be executed first and last. Then the partial products of each inner product are scheduled taking into consideration the restrictions imposed by the inner product scheduling. IV. POWER MODELS A. Interconnect Oriented Power Cost Function and Assuming two partial products the interconnect power cost function is given by the following equation:

(7) returns the storage address of its operand, where function , , , and are weights related to the capacitances of the coefficient, data and (coefficient and data) ad, , and dress buses respectively. represent the Hamming distances between the coefficients, coefficient addresses, and data addresses of the partial products and can be directly evaluated since they can is the average all be known before realization. Hamming distance between the data terms of the partial products and is determined through simulation. The interconnect oriented power cost function can be simplified to include only bus switching activity terms. It may also become fully static by including coefficient and address buses switching activity terms. B. Functional Unit Oriented Power Cost Function and the funcAssuming two partial products tional units power cost function is given by the following equation: (8) is the switching activity in the th node of the funcwhere is the node capacitance. tional unit and

448

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 3, JUNE 2003

Fig. 1. Experimental Results for implementation in two MACs.

TABLE I EFFECT OF THE PROPOSED TECHNIQUES ON THE FUNCTIONAL UNITS POWER CONSUMPTION

In order to calculate the functional units related power weights, the incremental static method presented in [7] is used. The mean error in the power consumption that is estimated is in the range of 5%–10% compared with the power consumption that is calculated using QuickSim II gate-level simulator of Mentor Graphics.

V. EXPERIMENTAL RESULTS A. Optimization of Interconnect Buses Power Lines of the same capacitance are assumed for all buses reducing interconnect oriented power cost function to the sum of the switching activities in the buses as shown by (7). Bit true simulations of ANSI C codes of the DSP kernels have been used to monitor the switching activities at the buses. The buses switching activities obtained when the proposed synthesis techniques are applied, are compared to those obtained when the algorithms’ computations are executed in the order determined by the definitions of the algorithms. The simulations were executed with the data and coefficients represented in 2’s complement format. For every simulation, actual image data have been used. Data and addresses were assumed to be 8-bits wide, while the coefficients were fixed-point numbers, with 2-bits integer part and F-bits fractional part. Experimental results for realizations on two MACs are presented in Fig. 1. Similar results have been obtained for realizations on one and four MACs and for sign magnitude representation as well.

B. Optimization of Functional Units Power Bit parallel multiply-accumulators based on two different types of multipliers namely a carry save array (CSA) multiplier and a booth encoded-Wallace multiplier have been developed. The VHDL descriptions for the multipliers have been obtained using the DesignWare software of Synopsys running on HP-10 workstations. Internal power has been estimated through switch level simulations and combining the activity results produced with the capacitance files corresponding to the target technology (0.7 m). The estimation procedure has been carried out using the Mentor Graphics framework running on HP-10 workstations. The results of the application of the proposed synthesis techniques on the power consumption of the functional units are presented in Table I. Power figures are in microwatts and have been obtained assuming a frequency of operation of 20 MHz and a supply voltage of 5 V. Two’s complement representation is assumed for both data and coefficients. C. Comparison With Existing Approaches In [3], schedules are proposed for the power-efficient evaluation of the partial products of linear computations satisfying the ascending or descending ordering of the coefficients. Algorithms with two-dimensional (2-D) coefficient structures are not targeted while input data information and address buses activity are not taken into consideration in [3]. The techniques proposed in this paper are compared to those presented in [3], in terms of interconnect and functional units power consumption. As far as interconnect power is concerned only the switching

MASSELOS et al.: POWER EFFICIENT DATA PATH SYNTHESIS OF SUM-OF-PRODUCTS COMPUTATIONS

449

TABLE II COMPARISON IN TERMS OF INPUT BUSES SWITCHING ACTIVITY OF THE PROPOSED SYNTHESIS TECHNIQUES AND THE TECHNIQUES PRESENTED IN [3]

TABLE III COMPARISON IN TERMS OF FUNCTIONAL UNIT POWER OF THE PROPOSED SYNTHESIS TECHNIQUES AND THE TECHNIQUES PRESENTED IN [3]

TABLE IV COMPARISON IN TERMS OF INPUT BUSES SWITCHING ACTIVITY OF THE PROPOSED SYNTHESIS TECHNIQUES AND THE TECHNIQUES PRESENTED IN [4]

activity in the data and coefficient buses is taken into consideration which is the best case for the techniques presented in [3] since addressing related issues are not evaluated by them. For the transformational algorithms the inner products are assumed to be executed in the order determined by the original algorithm definition. The results of the comparisons are presented in Tables II and III. Single functional unit realizations and 2’s complement representation are assumed in all cases. The proposed techniques outperform the techniques presented in [3] in all cases. In [4], a technique based on the TSP that reorders the finite impulse response filter coefficients to decrease the switching activity in the coefficient bus of the implementation is presented. Algorithms with 2-D coefficient structures are not targeted while input data information, address buses activity and functional units power are not taken into consideration in [4]. The techniques proposed in [4] are compared to those presented in this paper in terms of their primary cost function, i.e., the switching activity in the data and coefficient buses. This is the best case for the techniques presented in [4] since addressing related issues are not evaluated by them. During the application of the techniques proposed in [4] to transformational algorithms the inner products are assumed to be executed in the order determined by the original algorithm definition. The results of the comparison are presented in Table IV. Single functional unit realizations and 2’s complement representation are assumed in all cases. The proposed techniques outperform the techniques presented in [4] in all cases in terms of the optimization scope of the techniques presented in [4]. Comparisons of the proposed techniques with the techniques presented in [4]

in terms of the total bus switching activity (including address buses), prove that the proposed techniques lead to 12% less bus switching on average than the techniques presented in [4] for both 2’s complement and sign magnitude representation. VI. CONCLUSION Power efficient scheduling and assignment heuristics based on the TSP have been proposed for sum-of-products. Experimental results have proven that the proposed techniques lead to 25% savings in terms of interconnect power and to 15% savings in terms of functional units power on average in comparison to reference implementations. REFERENCES [1] A. Raghunathan and N. Jha, “Behavioral synthesis for low power,” in Proc. Int. Conf. Comput. Design, Oct. 1994, pp. 318–322. [2] E. Musoll and J. Cortadella, “High-level synthesis techniques for reducing the activity of functional units,” in Proc. ACM/IEEE Int. Symp. Low Power Design, Apr. 1995, pp. 99–104. [3] A. Chatterjee and R. Roy, “Synthesis of low power linear DSP circuits using activity metrics,” in Proc. Int. Conf. VLSI Design, Jan. 1994, pp. 265–270. [4] M. Mehendale, S. D. Sherlecar, and G. Venkatesh, “Low power realization of FIR filters on programmable DSP’s,” IEEE Tran. VLSI Syst., vol. 6, pp. 546–553, Dec. 1998. [5] M. T.-C. Lee, V. Tiwari, S. Malik, and M. Fujita, “Power analysis and minimization techniques for embedded DSP software,” IEEE Trans. VLSI Syst. , pp. 123–135, June 1997. [6] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs, NJ: Prentice-Hall, 1982. [7] S. Theoharis, G. Theodoridis, N. Zervas, and C. Goutis, “Accurate and fast power estimation of large combinational circuits,” in Proc. IEEE Workshop Power Timing Modeling, Optimization and Simulation, Oct. 1999, pp. 199–208.

450

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 3, JUNE 2003

Konstantinos Masselos (M’95) was born in Athens, Greece, in 1971. He received the B.Sc. degree in electrical engineering from the University of Patras, Patras, Greece, in 1994, and the M.Sc. degree in VLSI systems engineering from University of Manchester Institute of Science and Technology (UMIST), Manchester, U.K., in 1996, and received the Ph.D. degree in electrical and computer engineering from University of Patras. His Ph.D research invloved high-level low-power design methodologies for multimedia applications realized on different architectural platforms. From 1997 to 1999, he was a Visiting Researcher with the Design Technology for Integrated Information and Communication Systems (DESICS) Division of the Inter-University Micro Electronics Centre (IMEC), Leuven, Belgium, where he was involved in research related to theACROPOLIS multimedia precompiler. Currently, he is with INTRACOM S.A, Greece, where he is involved with the realization of wireless baseband processing systems on reconfigurable hardware. His main research interests include (hardware and software) optimizing compilers, reconfigurable architectures and related design technologies, power optimization, and design of reusable cores. Dr. Masselos is a Member of the Technical Chamber of Greece.

Panagiotis Merakos was born in Korydallos of Attiki, Greece, in 1968. He received the Diploma in electrical engineering from the University of Patras, Patras, Greece in 1992. He is currently working toward the Ph.D. degree at the same university. In April 1993, he was a Researcher on various DSP projects working with Very Large Scale Integration (VLSI) Design Laboratory, University of Patras. In early 1997, he was with the Atmel Hellas S.A., Patras, where he was involved with the development of telecommunications integrated circuits, implementing standards, for example, USB and Bluetooth. Currently, he is an IC Design Engineer with Athena SEMI S.A., Athens, Greece, where he is involved with the hardware development of WLAN standards. His research interests include algorithms and techniques for low-power DSP systems design, low-power high-level synthesis and mixed-signal design. Mr. Merakos is a Member of the Technical Chamber of Greece.

Spyros Theoharis received the B.Sc. degree in computer engineering and informatics, and the Ph.D. degree in electrical and computer engineering from the University of Patras, Patras, Greece, in 1994 and 2000, respectively. He is currently working as a System and Design Engineer with ALMA Technologies S.A., Athens, Greece. His research interests include low-power design, power estimation, very large scale integration (VLSI) signal prosessing, DSP architectures and cryptography. He has published more than 20 papers in international journals.

Thanos Stouraitis (SM’97) received the B.S. degree in physics and the M.S. degree in electronic automation from the University of Athens, Athens, Greece, in 1979 and 1981, respectively. He received the M.Sc. degree in electrical computer engineering from the University of Cincinnati, Cincinnati, OH, and the Ph.D. degree in electrical engineering from the University of Florida, Gainesville, in 1983 and 1986, respectively. Currently, he is a Professor of electrical and computer engineering at the University of Patras, Patras, Greece. He has also served on the faculty of The Ohio State University, Columbus, and the University of Florida, Gainesville. From 1986 to 1987, he was a Technical Manager with The Athena Group Inc. and from 1982 to 1983, he was a Researcher for the National Electrical Manufacturers Association (NEMA) Laboratories, Cincinnati, OH. His current research interests include signal and image processing, application-specific processor technology and design, design and architecture of optimal digital systems and computer arithmetic. He has authored or coauthored over 70 papers. He holds one patent on DSP processor design. He has authored several book chapters, including Digital Signal Processing (Patras, Greece: Univ. of Patras Press) and coauthored Digital Filter Design Software for the IBM PC (New York: Marcel Dekker). He is an Editor-at-Large for Marcel Dekker, Inc., and is a Consultant for various industries. Dr. Stouraitis regularly reviews for the IEEE TRANSACTIONS ON SIGNAL PROCESSING, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, IEEE TRANSACTIONS ON COMPUTERS, and IEEE TRANSACTIONS ON EDUCATION, for the Proceedings of the Institute of Electronics Engineers E, F, and G and for conferences like IEEE ICASSP, ISCAS, VLSI9, Computer Arithmetic, Euro DAC, etc. He is President Elect of the VLSI Systems and Applications (VSA) Technical Committee of the IEEE Circuits and Systems Society, and a Member of the DSP Technical Committee of the IEEE Signal Processing Society. In 1996, he was the General Chair of the IEEE International Conference on Electronics, Circuits and Systems (ICECS). He received the Outstanding Ph.D. Dissertation Award from the University of Florida in 1986.

Costas E. Goutis (M’83) was a Lecturer at the School of Physics and Mathematics, University of Athens, Athens, Greece, from 1970 to 1972. In 1973, he was the Technical Manager with the Greek P.T.T., responsible for the installation and maintenance of the telephone exchanges in a large provincial region. From 1976 to 1979, he was a Research Assistant and Research Fellow with the Department of Electrical and Electronic Engineering, University of Strathclyde, Glasgow, U.K., and from 1979 to 1985, a Lecturer with the Department of Electrical and Electronic Engineering, University of Newcastle upon Tyne, Newcastle, U.K. Since 1985, he has been with the Department of Electrical and Computer Engineering, University of Patras, Patras, Greece where he is currently a Full Professor. His research interests include very large scale integration (VLSI) circuit design, low-power VLSI design, memory management, systems design, analysis and design of systems for signal processing, and telecommunications. He has published more than 180 papers in international journals and conferences. He has been awarded a large number of Research Contracts from IST, ESPRIT, RACE, and National Programs.

Suggest Documents