The method we propose, based on the concept of neighborhood of a cluster of functions, targets power optimization through in- place re-programming of one or ...
In-Place Power Optimization for LUT-Based FPGAs Luca Benini Enrico Macii z Fabio Somenzi # Universita di Bologna z Politecnico di Torino Bologna, ITALY 40122 Torino, ITALY 10129
Balakrishna Kumthekar # # University of Colorado Boulder, CO 80309
Abstract
This paper presents a new technique to perform power-oriented re-con guration of a system implemented using LUT FPGAs. The main features of our approach are: Accurate exploitation of degrees of freedom, concurrent optimization of multiple LUTs based on Boolean relations, and in-place re-programming without re-routing. Our tool optimizes the combinational component of the CLBs after layout, and does not require any re-wiring. Hence, delay and CLB usage are left unchanged, while power is minimized. As the algorithm operates locally on the various LUT clusters, it best performs on large examples as demonstrated by our experimental results: An average power reduction of 20:6% has been obtained on standard benchmarks.
1 Introduction
Synthesis and optimization of FPGA circuits is a challenging problem. The freedom provided by the programmable architectures reduce the eectiveness of most of the design automation tools adopted for standard cell devices. Therefore, ad-hoc techniques need to be adopted to handle area, delay, and power optimization of FPGA-based designs. In this paper, we focus our attention on power minimization. The method we propose, based on the concept of neighborhood of a cluster of functions, targets power optimization through inplace re-programming of one or more con gurable logic blocks (CLBs) used in the circuit implementation. This distinctive feature of our approach makes it particularly suitable to deal with post-layout descriptions, i.e., to designs for which the internal routing. In fact, even though for some architectures (e.g., the Xilinx LUT-based FPGAs) the internal routing can be modi ed by re-con guring the switch and connection blocks, the rsttime choice of the interconnections is usually made so that tight timing and resource (i.e., number of used CLBs) constraints are met. Successive power-oriented modi cations that aect the routing may thus negatively impact the timing behavior or the feasibility of the resulting solution. A further advantage of the post-layout re-con guration procedure we propose is that, since full information on wiring capacitance and load is available even for designs that span multiple FPGAs, accurate power estimation is possible; thus, the iterative optimization procedure can exploit such information to re-program the most power-critical CLBs. Concerning previous work, several approaches to FPGA power minimization have been proposed in the literature [1, 2, 3, 4]. Our technique is not mutually exclusive with any of these methods, but it is applied at a later stage in the design ow.
Relevant to us is also the method of [6], where a re-programming technique for solving the engineering change problem, when the FPGA implementation of the design is already in place, is presented. Dierently from the engineering change problem, our purpose is to obtain a circuit whose input-output behavior is equivalent to the original one, but whose power dissipation is reduced. Similar to the work of [6], however, we enforce the constraint that the connectivity of the design remains unchanged, and only the CLB functionality can be altered. Finally, another work which is related to ours is the generalized matching approach to gate re-mapping proposed in [7]. The idea here is to replace groups of logic gates with functionally equivalent (but less power consuming) groups; as in our case, this objective is achieved by exploiting the concurrent replacement of more than one gate to achieve a more aggressive global power optimization. However, our target technology and the core optimization algorithm dier from those in [7].
2 Background
In this section, we review some basic concepts and terminology used throughout the discussion. We assume that the reader is familiar with Boolean algebras and BDD-based Boolean function manipulation. We denote vectors and matrices in bold, i.e., x = [x1 ;x2 ;: : :; x ] . We use the symbols: n
T
8 f = fj fj x
x
x0
and 9 f = f j + f j x
x
x0
to designate the consensus (or universal quanti cation) and the smoothing (or existential quanti cation) of Boolean function f with respect to variable x, respectively. We consider Boolean functions that model a portion (or cluster) of a combinational circuit, and we refer to them as cluster functions. We denote by f = [f1 ; f2 ; : :: ; f ] a generic multi-output cluster function. The concept of Boolean relation is at the basis of the optimization algorithms detailed in Section 3. For the sake of brevity, we provide here only some basic de nitions. De nition 1 A Boolean relation F is a subset of the Cartesian product of two Boolean spaces B and B , where B = f0; 1g: F B B . Boolean relations can be used to represent sets of Boolean functions. In this case, the Boolean relation is said to be well de ned. The formal de nition of this property is the following: De nition 2 A Boolean relation F is well de ned if, 8x 2 B , 9y 2 B such that (x;y) 2 F . Furthermore, we de ne the property of a function contained in a Boolean relation as follows: De nition 3 For a given Boolean relation F , a Boolean function f : B 7! B is compatible with F if, for every minterm x 2 B ; (x; f (x)) 2 F . Otherwise, f is incompatible with F . n
r
r
T
n
n
r
n
r
n
r
th
35 Design Automation Conference ® Copyright ©1998 ACM 1-58113-049-x-98/0006/$3.50
DAC98 - 06/98 San Francisco, CA USA
3 Power Optimization Algorithm
The starting point for our optimization algorithm is a netlist of K -input LUTs. The netlist is completely speci ed, and it has been back-annotated with post-layout information concerning loads, parasitics, and wiring capacitances. Given a generic LUT, we denote with i the vector of its input variables, with o the output variable, and with f (i) the function it implements. In addition, we denote with C the output load, whose value is known with high accuracy, since it comes from post-layout analysis.
p(x)
The block diagram of the main loop of the optimization algorithm is shown in Figure 2. The rst step consists of simulating the FPGA network to estimate the power dissipation. The user can provide typical long input pattern streams, possibly coming from behavioral/RT-level simulation. Alternatively, the input probability distributions can be supplied. Within the loop, the network is re-simulated every N iterations in order to update the switching statisticsand to ascertainthat there has been a decrease in the power consumption after every N optimization steps. In order to speed-up the procedure, only a few patterns (m % of the total) are used for the in-loop simulations. Both N and m are user-de nable parameters. int
int
int
q(o,x) x
3.1 Main Loop
int
i1
LUT
o1
i2
LUT
o2
in
LUT
on
i
int
Original LUTs
z Power Estimate (SW)
o
Sorted LUT List
Power Estimate (SW) All Locked
f(i)
Build Cluster Around LUT
Optimized LUTs
Build Neighborhood
Compute Boolean Relation
h(x)
Figure 1: A Multi-Output Cluster and Its Neighborhood. The optimization algorithm iteratively selects clusters of two or more LUTs, and nds low-power alternative personalizations (i.e., functions that can be implementedby the LUTs) exploiting the degrees of freedom produced by the environment around the target cluster, as shown in Figure 1. For the sake of clarity, in the following we consider a simple two-output cluster (that is, two LUTs). However, the algorithms and equations we present generalize to clusters with any number of outputs. The degrees of freedom on the functions in the cluster are represented by a Boolean relation whose characteristic function can be computed with the following symbolic equation [8]:
F (i; o) = 8x z [(P (x; i) Q(o; x; z)) ) H (z; x)] (1) where P , Q and H are the characteristicfunctionscorresponding ;
to the Boolean functions p, q and h describing the neighborhood of the cluster (see Figure 1). Boolean relation F represents a set of compatible functions that is richer than the set represented by don't cares. The LUTs are optimized for low-power by nding the minimum-power pair of functions f1 and f2 that are compatible with F and that have the same support i1 , i2 as the original functions f1 (i1 ) and f2 (i2 ), respectively. Notice that the constraint on the support of f1 and f2 is enforced because we want to perform in-place optimization of CLBs without changing the connectivity of the FPGA implementation we start with. Several points must be addressed to achieve an eective implementation of in-place power optimization. First, we need to specify how the candidate clusters of the LUTs and their neighborhoods are selected. Second, we need to formulate a strategy for estimating the power improvements and to decide when to stop the optimization loop. Third, we need to be able to nd f1 and f2 given F and to enforce the constant support constraint. The rst two issues are discussed in Section 3.1, where the main optimization loop is described. The third problem is described in Section 3.2. opt
opt
opt
opt
opt
opt
Find Min Power Replacement
Mark and Record Gain Next LUT
Lock Node with Max Gain
Figure 2: Main Power Optimization Loop. The nodes in the network (each LUT is a network node, and to each node we associate the function f stored in the LUT) are rst sorted according to the product of their switching activity times output load (SW C ). Let these sorted nodes constitute the set S . The LUTs are then picked according to this order. For each selected LUT, , the companion cluster members and the neighborhoods are computed. Only the members of S which have not been locked due to re-programming in earlier iterations of the algorithm are chosen. (The procedures for cluster selection and neighborhood construction are not described here for space reasons; however, the interested reader can nd details on such procedures in [9].) With the selection of the cluster and its neighborhood, functions P , Q and H (Equation 1) can be computed. The stage is then ready for the construction of Boolean relation F and the computation of the minimum-power compatible functions f to be implemented by the LUTs in the cluster. These are the key steps of the optimization algorithm, and will be described in the following subsection. Cluster and neighborhood selection, and minimum-power compatible function computation are repeated for each of the nodes in the set S . After all the nodes in S have been processed, an LUT is selected which has the highest gain. The LUT is re-programmed with the minimum-power compatible function and then it is locked. This means that the LUT cannot be reprogrammed in future iterations. The algorithm iterates (Figure 2) until the improvement is below a minimum user-de ned threshold d, or all nodes have been locked. Finally, full-delay simulation with the complete set of input patterns is run again on the optimized network to con rm the reduction in power. opt
3.2 Optimization Core
Once the neighborhood and the cluster members are identi ed, the Boolean relation F is computed through Equation 1. The key optimization performed by our procedure consists of nding minimum-power compatible functions for re-programming the LUTs in the cluster. In the following, let f (i ), f (i ), j = 1; ; maxCluster represent the functions implemented by the LUTs in the cluster before and after re-programming, respectively. Notice that the support variables i of the multioutput cluster function f(i) are the union of the i . To enforce the constraint that the network connectivity be left unchanged, the Boolean relation F must be restricted according to constant support constraint to yield R F . Relation R is a restriction of Boolean relation F with the property that if functions f , j = 1; ; maxCluster compatible with F and with the same support as the original f exist, then they are compatible with R as well. The usefulness of R is that it eliminates many compatible functions of F that do not satisfy support constraints, without excluding any valid solution. We have followed the algorithmproposed by Kukimoto and Fujita [6] for the computation of R. Unfortunately, although R does not contain functions which are compatible with F and do not meet the support constraint, it is not guaranteed to contain only valid compatible functions either. Hence, the correctness of the solutions extracted from Boolean relation R must be checkedagainst the support constraints. After R has been computed, the LUTs in the cluster are ordered for decreasing SW C . Starting from the LUT with highest switched capacitance, the min-power compatible functions are computed. Assume that LUT j has been selected. We determine the lower bound l (i ) and the upper bound u (i ) for it. Function l has the same support as f and it is the function with minimum ON-set compatible with R, while u is the function with maximum ON-set (and same support as f ) compatible with R. In symbols: l (i ) h(i ) u (i ), 8h(i ) compatible with R. To compute l and u we rst extract from R the Boolean relation R R which contains all and only the functions with support i : j
opt
j
j
j
j
opt
j
j
j
j
j
j
j
j
j
j
j
j
j
j
j
j
j
j
j
j
R (i ;o ) = 8 2( i , i j ) 9 2( o , j ) R(i; o)
(2)
l (i ) = R j j (i ; o ) R j j (i ;o ) u (i ) = R j j (i ; o )
(3) (4)
j
j
j
o
i
o
where o is the output variable corresponding to LUT j . If R is well-de ned, the two bounds are easily computed as follows: j
j
j
j
j o
j
j
j
j
j o
j
j
j o
j
j
The minimum-power function f for re-programming LUT j is then selected to be either u or l . The rationale for this choice is the following. Assuming no temporal correlation, the switching activity of output j can be computed as 2p (1 , p ), where p is the probability of the output j to be one. Notice that this is a single-maximum function. Since l is smaller than any function with support i compatible with R, its probability is guaranteed to be minimum. On the same lines, since u is larger than any function with support i compatible with R, its probability is guaranteed to be maximum. Consequently, the minimum switching activity is attained either in u or in l or both. This conclusionis valid in absence of temporal correlation, and may not be accurate in some cases. However, it provides an ecient way of getting function f , as shown by the results of Section 4. opt
j
j
j
j
j
j
j
j
j
j
j
opt
j
j
Once the choice between l and u is made, node j is marked as potential candidate for re-programming and its gain is recorded. It is also marked as processed, so that it is not considered again in the same iteration. The relation R is restricted by taking into account the choice of f made. Then, another cluster output is considered. It should be noted that our choice of f is greedy. We try to get the best possible re-programmingfor LUT j disregarding the ones that come after j . As a consequence, the choice for j may result in the Boolean relation being ill-de ned when making a choice for, say j + 1. If this is the case, the re-programming algorithm backtracks and chooses another function, dierent from the one previously chosen, such that the new choice also reduces the switching activity. If this also fails, the node's function is not changed and the algorithm continues with a choice for the next cluster member. Node j is moved to the bottom of the list of nodes yet to be processed. It will be processed again, but it will not be moved again and it will not be allowed to cause backtrack. Having described the main optimization loop and the core reprogramming algorithm, we can now provide the pseudo-code of the in-place optimization procedure (Figure 3). j
j
opt
j
opt
j
stop = iter = 0; frequency = Nint ; do f if(iter % frequency == 0) f if(iter == 0) PerformCompleteSimulation(); else PerformPartialSimulation(); if(Power is increased) f Undo the last Nint changes; Leave nodes locked;
g
g
sortedNodes = SortUnlockedNodes(network); foreach(node 2 sortedNodes) f if(node is not processed) f cluster = SelectCluster(node,maxCluster); neighbor = SelectNeighborhood(cluster,maxZNodes); F (i; o) = ComputeRelation(cluster,neighbor); R(i; o) = RestrictAccordingToSupporConstraints(F (i; o)); ChooseCompatibleFunctions(Fr (i; o),potNodeTable);
g
g
if (size of
potNodeTable > 0) ReProgramAndLockBestNode(potNodeTable); else stop = 1; iter++; g while(stop == 0); PerformCompleteSimulation();
Figure 3: In-Place LUT Re-Programming Algorithm. The rst part of the loop dispatches simulations to assess the quality of the optimization. Simulation of the complete input stream is performed at the rst iteration, and simulations of shorter streams are performed every N iterations. After simulation, the nodes are sorted based on their output power consumption. The sorted list sortedNodes is then processed, one node at a time. For each node, if the node has not been locked in previous iterations, cluster and neighborhood are constructed. Boolean relations F and R are then computed, and the min-power compatible functions are obtained. The variable potNodeTable returns a set of nodes for potential re-programming. The node in the cluster for which power savings are maximum is selected, re-programmed and locked. The iteration stops when no more savings or unlocked nodes can be found. The procedure is guaranteed to terminate and has a number of iterations which is linear in the number of LUTs. int
4 Implementation and Results
The ow we have used for benchmarking the capabilities of our in-place LUT re-programming algorithm is illustrated in Figure 4. The original .blif description is fed to SIS for logic optimization and mapping onto 5-input LUTs. The script we used for this task is the one suggested in the SIS manual. The mapped circuit (.fpga) is supplied to a tool (PREX) that generates a le (.cap) containing the post-layout capacitances for the nodes in the circuit. The .fpga description and the corresponding capacitance le are input to VIS, the framework in which the optimization algorithm has been developed. VIS is interfaced with a power simulator (PSIM), whose task is that of providing power estimates at all stages of execution of the re-programming algorithm using a set of user-supplied patterns ( le .pat), or userspeci ed primary input statistics ( le .i sat). The output of the ow is the optimized LUT netlist ( le .opt). .blif
SIS
.pat
.fpga
PREX
.cap
VIS
.i_stat
PSIM
.opt
Figure 4: Experimental Flow. Program PREX mimics the job of the tools for placement, routing, and capacitance extraction usually distributed by the FPGA vendors together with their parts; in fact, it produces a capacitance le to be used for accurate timing and power simulation. For our experiments, PREX generates capacitances based on the fanout of the LUTs and some additive random noise (that models the uncertainties in the routing process). This is for the purpose of running experiments which are not biased by the results of dierent layout tools. Notice, however, that any capacitance distribution would be acceptable, as far as it is not modi ed before and after the optimization, because the interconnections of the various LUTs never change. Several parameters can be speci ed to control the optimizer: The depth of the neighborhood, the number of nodes in the cluster, and the number of nodes in the neighborhood. Our goal was to be able to show a practical realization of the inplace power optimizationalgorithm. Therefore, we implemented the algorithm targeting robustness and conservatively set the parameters to avoid failure even in corner cases and produce results in a reasonable amount of time. Needless to say, with this choice we gave up the opportunity to explore the entire space spanned by the various degrees of freedom. A case in point being the limit set on the number of nodes in the cluster to 2 and the depth of the neighborhood limited to 1. We have assumed the propagationdelay through the CLBs to be constant. The pin-to-pinblock delay of the CLB is then assumed to be constant. The total delay of CLB j , including the delay due to the output capacitance, can thus be approximated as: D(j ) = BlockDelay + C fanoutCount(j ) (5) where C is the capacitance of node j taken from the .cap le. Our algorithm disregards the internal power of CLBs. A recent work [5] has shown that, for the same function, the internal power dissipation of a CLB can be reduced by reordering the inputs to the CLB. As we do not allow any change in the interconnect, the above procedure cannot be used, and the constant power model seems reasonable. In addition, it should be noted that the output switching power of the CLB usually dominates the internal power. j
j
Table 1 reports the results we have obtained on a sample of the large Mcnc'91 combinational multi-level circuits [10]. Columns CLB, PI and PO report the number of CLBs, primary inputs and primary outputs of the circuit. Columns Init. and Fin. give the power dissipation of the circuit before and after optimization. Column R gives the number of CLBs that have been re-programmed, column Sav. the percentage of power saving, and column Time the CPU time, measured in seconds on a 200MHz Pentium Pro Linux machine with 128MB of RAM, required by the algorithm to complete. Circ. c499 c1355 alu2 c880 c1908 alu4 c2670 pair c3540 c5315 t481 i8 c7552 i10 des
CLB 70 94 113 115 150 214 307 440 470 515 628 774 801 830 1406
PI 41 41 10 60 33 14 233 173 50 178 16 133 207 257 256
PO 32 32 6 26 25 8 140 137 22 123 1 81 108 224 245
Init. 155 199 156 167 254 327 330 572 652 907 652 1170 1393 919 1887
Fin. 118 101 117 151 200 246 313 562 482 723 627 787 1188 782 1341
R 22 77 52 26 52 91 59 58 206 207 39 102 319 163 461
Sav. 23.8 48.9 25.3 9.4 21.4 24.8 5.1 1.7 26.2 20.2 3.8 32.7 14.7 14.9 28.9
Time 343 765 221 881 1570 1696 1110 1582 17222 12443 1558 1475 24920 6948 28675
Table 1: Power Optimization Results. On all the benchmarks, power was estimated through real-delay simulation of user speci ed input patterns. 20; 000 to 50; 000 vectors were used for the initial and nal simulations, while only an average of 50% of such vectors was simulated during the optimization phase. The internal simulations were performed every 5 to 10 iterations of the algorithm. The peak power reduction we have obtained is around 48:9%, and the average power reduction is around 20:6%.
5 Conclusions
We have presented a technique to perform power-oriented recon guration of a system implemented using LUT-based FPGAs. Our approach has the distinctive property of being applicable to designs for which the layout has already been generated. Our method operates locally on the various LUT clusters of the original network and performs best on large examples, as demonstrated by the experimental results we have reported.
References
[1] A. Farrahi, M. Sarrafzadeh, \FPGA Technology Mapping for Power Minimization," IWFPLA, pp. 66-77, Sep. 1994. [2] K-R. Pan, M. Pedram, \FPGA Synthesis for Minimum Area, Delay and Power Consumption," EDTC-96, page 603, Mar. 1996. [3] C-C. Wang, C-P. Kwan, \Low Power Technology Mapping by Hiding High-Transition Paths in Invisible Edges for LUT-Based FPGAs," ISCAS-97, pp. 1536-1540, May 1997. [4] C-S. Chen, T. Hwang, C. L. Liu, \Low Power FPGA Design { A Re-Engineering Approach," DAC-34, pp. 656-661, Jun. 1997. [5] M. Alexander, \Power Optimization for FPGA Look-Up Tables," ISPD-97, pp. 156-162, Apr. 1997. [6] Y. Kukimoto, M. Fujita, \Recti cation Method for Lookup-Table Type FPGAs," ICCAD-92, pp. 54-61, Nov. 1992. [7] P. Vuillod, L. Benini, G. De Micheli, \Re-Mapping for Low Power Under Tight Timing Constraints," ISLPED-97, pp. 287-292, Aug. 1997. [8] Y. Watanabe, L. Guerra, R. K. Brayton, \Permissible Functions for Multi-Output Components in Combinational Logic Optimization," IEEE TCAD, Vol. 15, No. 7, pp. 734-744, Jul. 1996. [9] B. Kumthekar, L. Benini, E. Macii, F. Somenzi, In-Place Power Optimization for LUT-Based FPGAs Tech. Rep., Dept. of ECE, Univ. of Colorado, Oct. 1997. [10] S. Yang, Logic Synthesis and Optimization Benchmarks User Guide Version 3.0, Tech. Rep., MCNC, Jan. 1991.