-1-
TRANSISTOR SIZING IN MOS DIGITAL CIRCUITS WITH LINEAR PROGRAMMING Category: Logic Synthesis
M.R.C.M. Berkelaar, J.A.G. Jess Eindhoven University of Technology Department of Electrical Engineering Design Automation Section P.O. Box 513 NL-5600 MB Eindhoven The Netherlands Phone: 040 - 473345 Telefax: 040 - 448375 Telex: 51163 tuehv nl email:
[email protected]
-2-
ABSTRACT In this paper a solution is presented to tune the delay of a circuit composed of cells to a prescribed value, while minimizing power consumption. The tuning is performed by adapting the load drive capabilities of the cells. This optimization problem is mapped onto a linear program, which is then solved by the Simplex algorithm. This approach guarantees to find the global optimum, and has proven feasible for circuits of up to several thousand cells. The method can be used with any convex delay model. Results show that circuits can be speeded up by a factor of 2 at a cost of only 10 to 30% of extra power. 1. INTRODUCTION
In this paper we will discuss tuning the delay of a standard-cell implementation of a combinational logic circuit. This means that the logic functions of the cells in the circuit are fixed, but the individual cells are available in a range of load drive capabilities. Choosing an appropriate load drive capability for every cell will be our problem. It is clear that load drive requirements for individual cells are quite different. There are cells with high fanout, low fanout, some are on the critical path, others are on relatively short paths only. Cells which both are on the critical path and have high fanout should probably have a high load drive capability, but cells with high fanout yet only member of a relatively short path need not have, or even, should not have this property for the sake of low power consumption. In fact, we would like to attain a given maximum delay with minimal power requirement. So, we have the following optimization problem: given a circuit built from cells, and a maximum delay limit, find a load drive capability for every cell such that the circuit satisfies the delay limit and the total power requirement is minimal. We assume that all cells are available in a continuous but limited range of load drive capabilities. This will usually only be the case when ’smart’ cell generators are used. In the past, this problem (and the similar one of sizing individual transistors) has been considered in several papers, among which the following are to be mentioned:
•
In [RUE-76], the problem is approached by means of Newton’s method. This yields some convergence problems. The fact that input capacitances of cells grow when they are made faster is not included in the delay model.
•
In [HED-84], an algorithm is presented which optimizes single paths with a quasi Newton method. The designer interactively invokes this algorithm to optimize single paths in the circuit. The influence of the adaptations made on the remainder of the circuit is not taken into account, therefore multiple iterations may be necessary. The delay model used is very simple and does not include a zero load delay for the cells. However, input capacitances growing with cell size are accounted for.
•
In [GLA-84] delay models are studied extensively, and optimization algorithms for single paths are presented.
•
In [KAO-84] and [KAO-85] methods are presented which heuristically choose one cell in the circuit and adapt its size. This process is repeated until the timing requirements are met. The procedure does not guarantee optimal solutions in terms of power and/or area, however.
•
In [FIS-85] ("TILOS") an iterative approach is presented, in which first that path is selected which fails to meet speed requirements the most, and secondly that transistor on the selected path is adapted in size which has greatest sensitivity with respect to delay. Iterating this process then leads to the design. Although this approach has proved feasible for very large circuits (up to 26000 transistors), global optimality of the solution is not guaranteed, as the authors themselves have realized in
•
[SHY-88]. So they decided to use the results from TILOS as a starting point for a mathematical programming method (Method of Feasible Directions) with proved convergence for the type of problem transistor sizing poses. Due to the time complexity of this approach, they chose to divide the complete problem into a set of smaller subproblems, which could be solved much faster. Doing this they had to give up the certainty of reaching the global optimum.
•
In [MAT-85] a method is presented to solve the nonlinear transistor sizing problem by means of a nonlinear programming technique called ’duality’. They combine it with relaxation techniques to keep the run times limited. In this way they can always reach the global optimum. The two examples they present show however that optimizing a four bit full adder module (72 transistors) already takes 519 seconds of a DEC 20/60 CPU, which
makes it doubtful if the approach is still feasible for large blocks of logic (>> 1000 transistors). •
In [MAR-89] a system is described to combine transistor sizing and layout compaction. So, this is an area/timing optimizer, and not a power/timing optimizer. They solve the nonlinear part of their problem with augmented Lagrangian techniques. They limit the size of the nonlinear problem by only looking at the active constraints at each moment, and switching to other constraints when the optimization proceeds. For the combined problem of sizing and compaction they reach run times in the same order of magnitude as [MAT-85], so the same doubts for large circuits apply here.
It is clear that the approaches of [MAT-85] and [MAR-89] are the best of the above mentioned. They find a global optimum for the transistor sizing problem. Their algorithmic ideas would probably work for gate sizing as well. But, the problem sizes they deal with are rather limited. Therefore we will look for a different approach, which will still lead us to the global optimum, but will be computationally more efficient. The solution we propose here maps the delay/power optimization problem onto a linear program, which can then be solved by well known techniques. The most significant advantage of this approach is flexibility. The object function can be composed as desired to optimize different costs. Any constraint can be added to the linear program and will then be kept (as long as a solution exists). Any delay model which is convex (as most realistic models will be) can be used. In [FIS-85] the widely used distributed RC model is proved convex. Furthermore, if a solution exists, the (globally) optimal one will be found. Our experiments with standard-cell layouts indicate, that sizing transistors will change the area of the layout only marginally, because the routing area will not be affected, and only a small percentage of the cells will be larger than standard. Only if you push the speed of the circuit to the maximum attainable, an increase of up to 10% in area has shown up in our examples. 2. A SIMPLE DELAY MODEL
In this chapter a simple delay model is introduced. It is used later to show how the linear problem is composed. This could however be done with more accurate delay models as well, as long as they are convex. Firstly, we introduce some notational definitions:
parameters for single cell meaning total delay of cell delay due to cell-internal capacitances total load capacitance interconnect capacitance input capacitance of connected cells speed constant
symbol cell int
Cl C wire C tr S
TABLE 1. cell parameters
symbol F K
parameters for wire meaning fanout transistor count of total circuit in thousands TABLE 2. wire parameters
In order to know which paths are critical and which paths are not very long at all, the delay of individual cells will have to be estimated accurately. This is not a simple task, as the effect of future placement and routing on wire capacitances will have to be estimated as well. As we do not know which signals are rising and which are falling during critical path analysis, the delay model needs to be independent of signals. Trials have shown us (unexpectedly) that worst-case modeling gives more accurate results than mediating between rise and fall times. We start with the simple delay formula: cell
=
int
+ cC l
This is how most standard-cell libraries define delay, and simulations we performed have shown that it models actual behavior very accurately. However, we need to extend this simple formula with parameters to describe the behavior of a whole family of functionally identical cells with different load drive capabilities. Cells are made faster by increasing the size of the transistors in them. This will increase their load drive capability linearly with this size. Their internal capacitances however will also increase almost linearly. These two effects together will keep the internal delay int almost constant over a range of load drive capabilities, but will decrease the delay due to capacitive loading linearly. Introducing the parameter S cell , the speed constant of a cell, leads to the following formulas: cell int
=
int
+ cC l / S cell
= f (cell_structure)
C l = C wire + C tr We will estimate wire capacitance C wire as a function of the fanout of the wire and of the total circuit size:
C wire = f (F, K ) = c 1 FK + c 2 F + c 3 K + c 4 Constants will have to be found from statistical data of actual layout. The capacitances of the input transistors of the fanout set of a single cell are dependent on the speed constant S of the cell which they are part of, thus: C tr = Σ S i C in, i ,
i ∈ {Fanout cell }
i
The quality of delay estimations obtained with these formulas can be judged from table 3. For each example three different implementations are considered: the fastest one attainable with 1 ==< S i ==< 3, the slowest one with S i = 1, and one somewhere in between. The constants for the above formulas were calculated for a 6 micron NMOS process, hence the relatively high values for the delay. Simulation was done with data extracted from actual layout. The simulator was a switch level simulator, known to be accurate to ±10% relative to Spice simulations. We did use the switch level simulator, because Spice runs into convergence problems too often with circuits of this size. example
dc2
alu3
dk27
rd73
sao2
sqn
5xp1
estimated (nanoseconds) 99 150 197 77 100 141 70 100 131 116 160 214 139 222 312 102 160 207 107 155 219
simulated (nanoseconds) 122 152 156 82 81 99 75 106 122 117 141 197 151 190 242 99 138 198 106 175 224
%diff -19 -1 +26 -6 +23 +42 -7 -6 +7 -1 +13 +9 -8 +17 +29 +3 +16 +5 +1 -11 -2
TABLE 3. delay estimation results These results look promising, although there are a few considerable errors. The modeling remains subject of our research.
3. FROM BOOLEAN NETWORK TO LINEAR PROBLEM
We now have a (nonlinear) delay model, by which we can calculate all delays in a practical circuit given the speed constants S i for each individual cell. Our task is, however, to determine the speed constants such, that a given delay limit is just kept, and, at the same time, the power consumption of the total circuit is minimized. As a measure of power consumption, we will use the sum of all speed constants Σ S i , thus weighing all i
cells equally. It is trivial to include weight factors based on the structure or the expected switching frequency of the cells. At first we have to worry about the nonlinear form of the delay model: C wire + Σ S i C in, i
cell
=
int
+ c∗
i
S cell
For S limited between 1 and 3, all constants set to reasonable values, the graph of the delay as a function of the speed constant S cell is drawn in figure 1 for the case of a 3-input nand with a fanout of 3.
25 20 delay 15 (ns)
Σi S i = 9
10
Σi S i = 3
5 1
1.5
2 S
2.5
3
Figure 1. Delay versus speed constant This has to be linearized to:
cell
= c 1 + c 2 S cell - c 3 Σ S i C in, i i
In order to limit the error we make by linearizing, we do a piece-wise linear approximation for a number of subranges of S cell . The results reported later were obtained by taking 3 subranges for S cell . Looking at the smoothness of the graph this seems to be sufficient. The linear program is now composed as follows: Firstly, we define T i to be the schedule time, i the delay and S i the speed constant of cell i. The total delay of the circuit is
T max = max T i . Now, for every cell in the circuit the following i
(un)equalities are defined: 1.
The 3 linearized delay models:
cell
==> c 1 + c 2 S cell - c 3 Σ S i C in, i ,
i ∈ {Fanout cell }
i
cell
==> c 4 + c 5 S cell - c 6 Σ S i C in, i i
cell
==> c 7 + c 8 S cell - c 9 Σ S i C in, i i
2.
S is limited: S cell ==> 1 S cell ==< limit
3.
Definitions of schedule times: •
IF cell is only dependent on primary inputs: T cell =
•
cell
ELSE for all j ∈ {Fanin cell }: T cell ==> T j +
4.
cell
Definitions for maximum schedule time of circuit: IF cell is primary output: T max ==> T cell
The object function is composed of the sum of all speed constants, and a constant cT multiplied by T max . To obtain the fastest circuit possible, cT must be large. To keep a delay limit, take a small cT , and add the inequality T max ==< limit. The linear program solver will minimize the object function under all given constraints. 4. RESULTS
The results in the following figures were obtained after mapping onto sets of (3,3)AOI standard cells. The delays are estimations made with the model for a 6 micron NMOS process. Power values are
Σi S i , i ∈ {all_cells}.
Solid lines were
obtained with 1 ==< S i ==< 3, and dashed lines with 1 ==< S i ==< 5, so load drive capabilities were limited to 3 or 5 times normal respectively. The linear program was solved with the Simplex algorithm [ORC-68]. Run times (on an Alliant FX8, appr. 5 Mflops) per solution can be found in table 4. A range of times is given, as it takes the linear program solver more time to find solutions with low delay than with high delay.
example misex2 rd84 duke2 misex3c
# gates 101 139 211 551
run time (s) 6.7 9.7 8 13 60 90 360 480
TABLE 4. run times Our experience shows that it is possible to solve the linear program for several thousand cells within about an hour, but for larger examples run times become excessively long. Looking at the figures, the most striking result is the following: Moving from the slowest solution towards faster ones, the curve is very flat, hence, considerable speed gain can be obtained without great power costs. With these four examples, the speed could be doubled at a cost of between 10 and 30 % extra power.
Rd84
•
200 180
•
power 160
• • •
140 150
•
•
•
•
••
200 250 300 delay (nanoseconds)
350
Figure 2. Delay versus power consumption for rd84
300
•
Duke2 280 power 260 240
• • •
220
•
•
•
•
•
•
•
260 300 340 380 420 460 500 540 580 620 delay (nanoseconds)
Figure 3. Delay versus power consumption for duke2
180
Misex2
•
1 ==< S ==< 3 1 ==< S ==< 5
160 •
power 140
• •
•
120
•
• •
• •
• •
••
100 100
••
•
•
120 140 160 180 delay (nanoseconds)
•
•
•
200
220
Figure 4. Delay versus power consumption for misex2
750 •
Misex3c 1 ==< S ==< 3 1 ==< S ==< 5
700 •
power 650
•
• •
600
• • • •
550 500
700
•
•
•
•
•
•
900 1100 1300 1500 delay (nanoseconds)
•
1700
Figure 5. Delay versus power consumption for misex3c When we compare a maximum speed constant of 3 versus 5, it shows that the differences are not great for circuits exhibiting up to about twice the speed of the unsized version. For even faster circuits, there is a gain in power consumption for upper limits of 5, and the maximum attainable speed is higher as well, but at relatively great power costs. If such speeds are required, it would probably be better to redesign the circuit for speed, for example by using a different decomposition procedure during the logic synthesis phase. 5. ACKNOWLEDGEMENTS
I wish to thank Lukas van Ginneken for many good ideas and fruitful discussions on this subject.
REFERENCES [FIS-85]
Fishburn, J.P. and Dunlop, A.E., "TILOS: A Posynomial Programming Approach to Transistor Sizing", Proceedings of the IEEE International Conference on Computer Aided Design 1985, pp 326-328.
[GLA-84] Glasser, L.A. and Hoyte L.P.J., "Delay and Power Optimization in VLSI Circuits", Proceedings of the IEEE Design Automation Conference 1984, pp 529-535. [HED-84] Hedlund, K.S., "Models and Algorithms for Transistor Sizing in MOS Circuits", Proceedings of the IEEE International Conference on Computer Aided Design 1984, pp 12-14. [KAO-84] Kao, W.H., "ARIES, a Workstation Based, Schematic Driven System for Circuit Design", Proceedings of the IEEE Design Automation Conference 1984, pp 301-307. [KAO-85] Kao, W.H., "Algorithms for Automatic Transistor Sizing in CMOS Digital Circuits", Proceedings of the IEEE Design Automation Conference 1985, pp 781-784. [MAR-89] Marple, D., "Transistor Size Optimization in the Tailor Layout System", Proceedings of the IEEE Design Automation Conference 1989, pp 43-48. [MAT-85] Matson, M.D., "Optimization of Digital MOS VLSI Circuits", Proceedings of the Chapel Hill Conference on VLSI, 1985, pp 109-126. [ORC-68] Orchard-Hays W., "Advanced Linear Programming Computing Techniques", McGraw-Hill 1968. [RUE-76] Ruehli, A.U., Wolff P.K. and Goertzel G., "Power and Timing Optimization of Large Digital Systems", Proceedings of the IEEE International Symposium on Circuits And Systems 1976, pp 402-405. [SHY-88] Shyu, J., Sangiovanni-Vincentelli, A., Fishburn, J.P. and Dunlop, A.E., "Optimization-Based Transistor Sizing", IEEE Journal of Solid-State Circuits, Vol. 23, No. 2, April 1988, pp 400-409. [SIN-88]
Singh K.J., Wang A.R., Brayton R.K., Sangiovanni-Vincentelli A., "Timing Optimization of Combinational Logic", Proceedings of the IEEE International Conference on Computer-Aided Design 1988, pp 282-285.