IEEE Int. Symposium on Circuits and Systems (ISCAS ’98), Monterey, CA, USA, July 1998. © 1998 IEEE. Personal use of this material is permitted However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE."
TRANSFORMATIONAL-BASED SYNTHESIS OF VLSI BASED DSP SYSTEMS FOR LOW POWER USING A GENETIC ALGORITHM M. S. Bright and T. Arslan Circuits and Systems Research Group Cardiff University Of Wales, Newport Road, Cardiff, UK, CF2 3TF
[email protected] ABSTRACT This paper describes a technique for the synthesis of CMOS based DSP systems under multiple design constraints. The primary target of the technique is to reduce operating power by applying high level transformations to designs. During the search for a low power solution the technique considers issues at circuit and layout levels, using appropriate capacitive models, together with tracking speed and area design constraints. In exploring the complex search space of the synthesis problem the technique uses a Genetic Algorithm which utilises a library of high level transformation based techniques within its operators. The paper describes the technique, the capacitive models used for power estimation and presents results for DSP systems of varying complexity. The results demonstrate the significant power savings achieved with the technique. 1. INTRODUCTION The power consumption of VLSI systems has become an important design parameter alongside the traditional constraints of area and speed [1]. This is largely due to the increase in demand for portable computing systems with longer operating times. The operating time of these portable systems is constrained by both the battery capacity and system power requirements. Battery technology has peaked in recent years so attention has focused on the reduction of system power requirements, including the power consumption of the VLSI devices [1]. In addition to portable operation considerations, the increased power consumption of VLSI devices can lead to overheating, which reduces device speed and time to failure. This heat dissipation problem requires heat management systems that significantly add to device cost [2]. The issues described above have led to the development of a number of low power design techniques and methodologies that tackle various levels of the VLSI design process. The consideration of power as a high-level design parameter, to be optimised alongside speed and area, will have the greatest impact on power and will not require expensive modifications to the VLSI fabrication
process [1]. A number of techniques have been developed for power reduction of digital CMOS based VLSI devices [3]. The authors in [4] proposed the use of high-level algorithmic transformations, traditionally used to optimise for speed and area, for power reduction. The transformations operate on a high-level description of the design, modifying elements within the design to produce lower power VLSI implementations. However, the use of high-level transformations compounds the already complex nature of the low power design problem. Even with a restricted set of transformations no time efficient algorithm can be developed to determine the optimum low power solution [5]. An efficient algorithm is required to search the low power solution space while obeying constraints on design speed and area. A Genetic Algorithm (GA) [6] is a heuristic search algorithm that has been applied to various VLSI design problems such as test pattern generation and bus size minimisation. Previous research has demonstrated the effectiveness of GAs when applied to the problem of structural synthesis [7], where there are multiple design constraints to satisfy. The authors have previously demonstrated a limited application of GAs to low power synthesis [8]. This paper presents a GA framework for the application of high-level algorithmic transformations to Digital Signal Processing (DSP) designs. The transformations modify the design characteristics to produce designs with lower power implementations. The GA is modified to suit the specific nature of the synthesis problem; unique genetic operators are developed to apply the high-level transformations. The developed GA is capable of reducing the power consumption of a wide range of signal processing circuits while obeying speed and area constraints. 2. PROCEDURE The most significant factor affecting power consumption in a CMOS VLSI device is the switching power, which is expressed by the product [(supply voltage)2 × switched capacitance] [9]. This equation identifies that reduction of
IEEE Int. Symposium on Circuits and Systems (ISCAS ’98), Monterey, CA, USA, July 1998. © 1998 IEEE. Personal use of this material is permitted However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE."
Normalised Delay
supply voltage will yield a quadratic decrease in power. However, reduction of supply voltage decreases the speed of a CMOS device, as illustrated in figure 1 [4]. The application of high-level transformations is used to compensate for the speed decrease, producing a design with no loss in speed but a lower supply voltage. High-level transformations have been well documented in the VLSI design literature [10-13]. They operate on a high-level description of the design represented as a Data Flow Graph (DFG). In order to be processed by a GA the DFGs are encoded into the chromosome representation shown in figure 2. A GA contains a number of chromosomes, collectively known as a population; each chromosome represents a possible solution to the design problem.
28 24 20 16 12 8 4 0
system. Therefore retiming is applied at a greater rate than pipelining. Genetic evolution proceeds with the creation of an initial population of candidate designs. The GA synthesis system creates a population of designs from an initial DFG specification. Random application of transformations is used to create a population of DFGs with different characteristics but the same function as the specified design. The power consumption of each design is assessed to calculate the fitness of each chromosome; lower power consumption corresponds to a higher fitness. Power estimation is a complex process at the high-level. A number of different high-level techniques have been reported in the literature [14-17].
IN e1
e2
e3 OUT D
e4 DFG
OUT IN ADD DEL MUL e1 e2 e3 e4
CHROMOSOME
Figure 2. Data Flow Graph (DFG) Representation
1
2
3
4
5
VDD (Volts) Figure 1. Relationship Of Supply Voltage And Delay [4] The high-level transformations operate on the elements of the DFG (adders, delays, etc.) to modify its characteristics. The high-level transformations used within the GA synthesis system are; Retiming [10], which is the process of moving delays around the DFG to reduce the length of the critical path. The critical path is the longest computational path within the DFG so it places a bound on the maximum operating speed. A shorter critical path results in a faster design. Pipelining [11] attempts to minimise the critical path by inserting delay elements within the DFG. Automatic Pipelining [12] is a specialised form of retiming; delay elements are inserted on the inputs of the DFG and retimed through in an attempt to reduce the critical path length. Loop Unfolding [13] creates a parallel implementation of the DFG, which increases throughput at the cost of an increase in area. The GA is used to apply the transformations to a population of candidate design chromosomes. The application rates were determined through a combination of design heuristics and experimental analyses. For example, both pipelining and retiming are powerful transformations for the reduction of critical path (increase in speed) but pipelining has the added overhead of increasing the number of delay elements within the
The GA uses a capacitive model that combines data from practical VLSI designs with statistically derived relationships. A “good” model should be able to convey critical design information to the GA. In VLSI synthesis this usually requires circuit layout information for the optimum design synthesis. To provide this information to the GA a number of capacitive models were considered. The most simple model assigned unit size and delay to functional elements. This option requires very simple fitness calculations but is very inaccurate as it ignores the substantial size difference between multipliers, registers, adders, etc. The second model characterises each functional element with a gate count, providing imprecise comparative areas. This model ignores the effect of intraconnect capacitance within an element. The third model uses functional elements constructed in VLSI design CAD tools. The CAD tools are used to extrapolate accurate area and capacitance information for each element. In addition to functional element capacitance, information on interconnect capacitance between elements is provided by a statistical model. This capacitive data is used to compute the capacitance contribution to power consumption. This third model was chosen as producing the best accuracy and speed of calculation trade-off. The graph of figure 1 is used to estimate a supply voltage that will enable the design to run at the same speed as the initial design. For example, a design with a critical path half that of the original design will run at twice the speed. The graph is used to determine that a supply voltage of
IEEE Int. Symposium on Circuits and Systems (ISCAS ’98), Monterey, CA, USA, July 1998. © 1998 IEEE. Personal use of this material is permitted However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE."
3. RESULTS A number of DSP designs, comprising a set of benchmarks, are used to illustrate the performance of the GA synthesis tool. The designs were selected to cover a range of recursive and non-recursive signal processing operations of varying complexity. FIR3 and FIR8 are nonrecursive Finite Impulse Response filters, 3rd order and 8th order respectively. LAT is a 2nd order recursive filter. AV6D is a direct form representation of a 6th order Avenhaus filter. AV8P is an 8th order parallel implementation of the Avenhaus filter [5]. The Avenhaus filters contain complex recursive structures. ELLIP is the 5th Order Elliptic Wavelet filter presented in [18]. The results for each of the benchmark circuits are presented in graphical form in figure 3. As an example the GA has produced an FIR3 design with an 8 times increase in area, but the overall power consumption has been reduced by a factor of 19. The supply voltage for each optimised design, to produce the reported power
reduction, is also presented. The FIR filters have a significant estimated power reduction but with an associated increase in area. This increase is due to the amount of unfolding applied to these filters, producing parallel designs that are capable of operating at very low voltages with the same speed as non-parallel designs. The Avenhaus filters have a very small increase in area as the unfolding transformation could not improve on the effect of the other transformations. The large power reductions obtained are possible because of the large critical paths of these filters, which offers plenty of scope for minimisation through retiming and pipelining.
ELLIP 4.3V
LAT 3.9V
Power Reduction Size Increase
AV8P 2.3V
AV6D 2.3V
FIR8 1.5V
20 18 16 14 12 10 8 6 4 2 0
FIR3 1.2V
2.9V will double the delays in the device, negating the speed increase, producing a design of the same speed but lower supply voltage than the original. Equation (1) is then used to compute the power consumption of the design. The GA selects members of the current population for modification using the standard Roulette Wheel selection method [6]. Designs with a greater fitness (i.e. those best satisfying the design constraints) have a greater probability of being selected for reproduction; the GA attempts to find the optimum design by using the higher fitness members of the current population to create the next population. Standard GAs use mutation and crossover operators to modify the characteristics of chromosomes [6]. In the case of DSP synthesis these standard genetic operators would corrupt the functionality of the DFG, therefore they are modified to suit the nature of the synthesis problem. The mutation process is modified to apply the high-level transformations, accessed from a transformation library. Crossover is a complex genetic operator that combines the characteristics of two parent chromosomes to produce child chromosomes. The GA uses a modified form of crossover that identifies which transformations have been applied to each parent, then produces child chromosomes with both sets of transformations applied. The repeated application of the genetic operators to the current population breeds a new population of designs, which then becomes the current population. This process of genetic evolution is repeated until the design with the required specifications is produced. The design with the highest fitness over all generations is selected as the best low power design for the initial specified DFG.
DSP Design and Optimised Supply Voltage
Figure 3 Power Reduction And Area Increase For Benchmark Designs The relatively smaller reduction of power obtained with the elliptical filter is due to the fact that pipelining and retiming have limited effect on reducing the length of its feedback paths. The GA produces a number of designs with the same power specifications, giving greater flexibility to the designer. Typically a population will converge on an optimum design within 500 generations. Subsequent generations will increase the number of designs that meet the design criteria, 3-8% of the final population will consist of unique best low power designs. The population size of the GA was set to 500 for all designs to enable a comparison between the speed of each synthesis operation. The synthesis tool was run on a Pentium Pro Windows NT workstation with 64 Megabytes of RAM; the synthesis tool runs in 16 Megabytes of RAM. The execution time for each DSP algorithm to be synthesised is presented in table 1. The relatively longer running time of the FIR8 filter is due to its complexity (22 elements) and the application of unfolding using the postponing principle [13], which postpones the application of the unfolding transformation
IEEE Int. Symposium on Circuits and Systems (ISCAS ’98), Monterey, CA, USA, July 1998. © 1998 IEEE. Personal use of this material is permitted However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE."
until the other transformations are incapable of increasing the quality of the design any further. If unfolding is successful the unfolded design is fed back into the synthesis cycle for further optimisation, resulting in longer execution times for synthesising unfolded designs. 4. CONCLUSION A technique has been developed for the synthesis of low power DSP systems. At the core of the system is a Genetic Algorithm which accesses a library of high level transformations to modify designs, and a capacitive model which feeds device level information to the GA for power estimation. The technique shows flexibility with a wide range of DSP systems of varying complexity. Results have been provided for a number of benchmark DSP algorithms which show significant power reduction obtained in all cases.
DSP Design FIR3 FIR8 AV6D AV8P LAT ELLIP
Execution Time (Seconds) 83.32 260.06 40.51 129.58 28.24 3.24
Table 1. Synthesis Tool Execution Times 5. REFERENCES [1] D. Singh, J. M. Rabaey, M. Pedram, F. Catthoor, S. Rajgopal, N. Sehgal, and T. J. Mozdzen, “Power conscious CAD tools and methodologies : A perspective”, Proc. IEEE, vol. 83, pp. 570-594, Apr. 1995. [2] A. Raghunathan and N. K. Jha, “An ILP formulation for low power based on minimizing switched capacitance during data path allocation”, Proc. IEEE Int. Symp. Circuits And Systems ‘95, 1995, vol. 2, pp. 1069-1073. [3] A. P. Chandrakasan and R. W. Broderson, “Minimizing power consumption in digital CMOS circuits”, Proc. IEEE, vol. 83, pp. 498-523, Apr. 1995. [4] A. P. Chandrakasan, M. Potkonjak, R. Mehra, and R. W. Broderson, “Optimizing power using transformations”, IEEE Trans. CAD of Integrated Circuits and Systems, vol. 14, pp. 12-31, Jan. 1995. [5] A. P. Chandrakasan, M. Potkonjak, J. Rabaey, and R. W. Broderson, “HYPER-LP: A system for power minimization using architectural transformations”,
Proc. IEEE/ACM Int. Conf. Computer Aided Design ‘92, 1992. [6] D. E. Goldberg, Genetic Algorithms In Search, Optimization and Machine Learning. AddisonWesley, Reading, MA, 1988. [7] T. Arslan, D. H. Horrocks, and E. Ozdemir, “Structural cell-based VLSI circuit design using a genetic algorithm”, Proc. IEEE Int. Symp. Circuits And Systems ‘96, 1996, vol. 4, pp. 308-311. [8] M. S. Bright and T. Arslan, “A genetic framework for the high-level optimisation of low power VLSI DSP systems”, IEE Electronics Letters, vol. 32, pp. 11501151, June 1996. [9] T. Arslan, D. H. Horrocks, and E. T. Erdogan, “Overview and design directions for low-power circuits and architectures for digital signal processing”, IEE Colloquium (Digest), No.122, 1995, pp 6/1-6/5. [10] K. K. Parhi, “High-level algorithm and architecture transformations for DSP synthesis”, IEEE Journal Of VLSI Signal Processing, vol. 9, pp. 121-143, Jan. 1995. [11] M. Potkonjak, J. Rabaey, “Retiming for scheduling”, VLSI, Signal Processing IV, H. S. Moscovitz, K. Yao, and R. Jain, (Ed.)., IEEE Press, New Jersey, 1991, pp. 23-32. [12] K. K. Parhi, “Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding”, IEEE Trans. Computers, vol. 40, pp. 178-195, Feb. 1991. [13] S. Huang and J. M. Rabaey, “Maximizing the throughput of high performance DSP applications using behavioural transformations”, Proc. European Design and Test Conference ‘94, 1994, pp. 25-30. [14] A. Raghunathan and N. K. Jha, “Behavioural synthesis for low power”, Proc. IEEE/ACM Int. Conf. Computer Aided Design ‘94, 1994, pp. 318-322. [15] R, Mehra and J. M. Rabaey, “Behavioural level power estimation and exploration”, 1994 International Workshop On Low Power Design, California, USA, 1994. [16] P. E. Landman and J. M. Rabaey, “Power estimation for high level synthesis”, Proc. EDAC-EUROASIC ‘93, Paris, France, 1993, pp. 361-366. [17] P. M. Chau and S. R. Powell, “Power dissipation of VLSI array processing systems”, in Journal of VLSI Signal Processing, vol. 4, pp. 199-212, 1992. [18] S. Y. Kung, H. J. Whitehouse and T. Kailath, VLSI And Modern Signal Processing, Prentice-Hall, New Jersey, 1985