latency units into variable-latency ones that run with a faster clock cycle. .... To increase the average throughput of a given stage, Sj shown on the left of Figure 2 ...
Telescopic Units: Increasing the Average Throughput of Pipelined Designs by Adaptive Latency Control Luca Benini # Enrico Macii z Massimo Poncino z # Stanford University z Politecnico di Torino Computer Systems Laboratory Dip. di Automatica e Informatica Stanford, CA 94305 Torino, ITALY 10129
\Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee." DAC 97, Anaheim, California (c) 1997 ACM 0-89791-920-3/97/06 ..$3.50
2 Background
2.1 Circuits and Delays
A pipelined circuit consists of one or more combinational logic blocks, called stages of the pipeline, connected through memory elements as shown in Figure 1. Combinational Logic Block
Stage #1
Combinational Logic Block
Stage #2
Registers
The ever increasing clock frequency of high-performance systems pushes IC designers and synthesis tools to substantial efforts in optimizing the delay of combinational logic blocks that constrain the cycle time. Timing optimization is often an expensive operation with a signi cant area and power overhead. In this work, we propose an innovative way for increasing the average throughputof a design with a small reduction in average latency. Two are the key features of our approach: First, a slow, xed-latency unit is transformed into a fast variable-latency one which delivers a higher average throughput with low average latency; second, the transformation of the unit is performed in a fully automatic way and estimates of the improvement in performance are available to the designer. We call telescopic unit the nal product of our automatic transformation. The name stems from the fact that the unit requires a variable number of cycles for terminating its computation. Seen as a black box, a telescopic unit produces two outputs: The original functional output and a handshaking hold signal which is activated when the functional unit cannot terminate its computation in the required cycle time. The overhead of realizing a telescopic unit consists of the circuitry needed for the generation of the hold signal. Additional circuitry may also be required in the external control logic that needs to observe the hold signal and behave accordingly. Although telescopic units are, in principle, similar to self-timed units, they operate in a fully synchronous environment. Hence, they take an integer number of clock cycles to complete their executions. The fully synchronous operation allows us to ignore the issues related to hazards, that make the design of large scale self-timed circuits complex and expensive. ||||||||||||||||||
Registers
1 Introduction
We outline algorithms and heuristics for automatically synthesizing telescopic units which rely on symbolic techniques for exact timing analysis [1]. Experimental results are very promising, and they clearly indicate the applicability of the technique for pure throughput optimization, as well as for area optimization under throughput constraints. Recently, Hassoun and Ebeling have presented an approach similar to ours, called architectural retiming [2]. Their idea is to increase the number of registers on latency-constrained paths, thus decreasing the cycle time at no latency cost. The results, achieved on a few selected benchmarks, are interesting. Unfortunately, the approachis fully manual, and no indication is given on how the search of the proper retiming could be automated.
Registers
Abstract
This paper presents a technique, alternative to performancedriven synthesis, that allows to drastically increase the average throughput of combinational logic blocks by transforming xedlatency units into variable-latency ones that run with a faster clock cycle. The transformation is fully automatic and can be used in conjunction with traditional design techniques, such as pipelining, to improve the overall performance of speed-critical systems. Results, obtained on a large set of benchmark circuits, are very promising.
Combinational Logic Block
Stage #3
Figure 1: A Three-Stage Pipelined Circuit. Each combinational logic block is a DAG composed of gates and connections between gates. If the output of a gate, gi , is connected to an input of a gate, gj , then gi is a fanin of gj . Gate gj is a fanout of gate gi . A controlling value at a gate input is the value that determines the value at the output of the gate independent of the other inputs, while a non-controlling value at a gate input is the value whose presence is not sucient to determine the value at the output of the gate. Each connection, c, has two delays, dr (c), rise delay, and df (c), fall delay, associated with it. The delay function of c from gate h to gate g, d(c; x), equals dr (c) if g carries a 1 when input vector x is applied to the primary inputs of the block. Otherwise, d(c; x) = df (c). If all fanin connections of g have the same values of dr (c) and df (c), we de ne the delay function of g as d(g; x) = d(c; x), where c is any fanin connection of g. If f (g;x) is the global function of g (function in terms of the primary inputs) and c connects gate h to gate g, then: d(c;x) = f (g;x) dr (c) + f 0 (g; x) df (c) Given a gate g, the arrival time, AT(g;x), is the time at which the output of g settles to its nal value if input vector x is applied at time 0.
2.2 Pseudo-Boolean Functions and ADDs
A n-input pseudo-Boolean function, f : B n ! S , is a mapping from a n-dimensional Boolean space to a nite set S . The data structure we have adopted for eciently storing and manipulating this kind of functions is the algebraic decision diagram (ADD) [3]. An ADD is an extension of the BDD which allows values from an arbitrary nite domain to be associated with the terminal nodes (i.e., the leaves) of the diagram. Among the existing operators for ecient ADDs manipulation, the one called THRESHOLD is of particular importance to our purposes. THRESHOLD takes two arguments: f , a generic ADD, and val, a threshold value, and sets to 0 all the leaves of f whose value is smaller than val and to 1 all the leaves of f whose value is greater than or equal to val. The resulting ADD, fval, is thus restricted to have only 0 or 1 as terminal values; therefore, it is a BDD.
2.3 ADD-Based Timing Analysis
The problem of calculating the timing response of a combinational logic block can be formulated as follows: Given a combinational block, nd the set of input vectors for which the length of the critical path, under a speci ed mode of operation and a gate delay model, is maximum; the length of the critical path gives the overall block delay. Given a gate g of the network and a primary input vector x 2 X , where X is the set of all the care input vectors of the block, the arrival time at its output line, AT (g;x), is evaluated in terms of the arrival times of its inputs, and the delays of its fanin connections, d(cj ; x). Let cj be the connection to pin j of g. If all fanins of g have non-controlling values, AT (g;x) = max fAT (cj ; x) + d(cj ; x)g j If at least one fanin cj of g has a controlling value for input x 2 X , where X is the set of all possible care input vectors, AT (g;x) = min fAT (cj ; x) + d(cj ; x) j cj = controlling g j Finally, if x 62 X ,
AT (g;x) = ,1 Conversely from what happens with traditional delay analyzers, the use of the ADD-based timing analysis tool has made it possible to compute and store the length of the critical path for each input vector. The availability of the complete timing information regarding the combinational logic block is essential for the realization of the throughput optimization algorithm described in Section 3.
3.1 The Idea
To increase the average throughput of a given stage, Sj (shown on the left of Figure 2), of a pipelined design, we can obviously shorten its cycle time from the original value, T , to T < T . One possible way of achieving this goal is through the addition to the combinational block of Sj an output signal, fh (called the hold output), which takes the value 1 anytime an input vector requires more than T time units to propagate to the outputs of the block (see the right-hand side of Figure 2). fh
Combinational Logic Block
Stage S j
Registers
P
3 Telescopic Units
Registers
A path in a combinational logic block is a sequence of gates and connections, (g0 ; c0 ; : : :; cn,1 ;gn ), where connection ci , 0 i < n, connects the output of gate gi to the input of gate gi+1 . The length of a path, P = (g0 ;c0 ; :: : ;cn,1 ; gn ) is de ned as ,1 d(ci ; x). The topological delay of a combinad(P; x) = ni=0 tional logic block is the length of its longest path. An event is a transition 0 ! 1 or 1 ! 0 at a gate. Given a sequence of events, (e0 ;e1 ; :: : ;en ), occurring at gates (g0 ;g1 ; : : :; gn ) along a path, such that ei occurs as a result of event ei,1 , the event e0 is said to propagate along the path. Under a speci ed delay model, a path P = (g0; c0 ;: : :; cn,1 ; gn ) is said to be sensitizable if an event e0 occurring at gate g0 can propagate along P . The critical path of a block is the longest sensitizable path under a speci ed delay model; if a path is not sensitizable, then it is a false path. In the presence of false paths, the topological delay of a block may exceed the true delay.
Combinational Logic Block
Stage S j
Figure 2: Transforming a Pipeline Stage into a Telescopic Unit. If the application of a generic vector x to the inputs of Sj is such that fh = 0, then the outputs of stage Sj are available at time T , at the latest. On the other hand, if fh = 1, more than T time units are required to stage Sj to complete the intended computation; then, an additional clock cycle is needed before the correct data are available. We call telescopic unit the pipeline stage modi ed as shown in Figure 2. The name comes from the fact that the new stage has a variable latency, T or 2T , depending on the speci c pattern appearing at the inputs of the stage. Obviously, for throughput improvements, it is mandatory to have a very low probability of signal fh to take on the value 1. In fact, the average throughput, P , of the telescopic unit is given by the following formula: (fh ) 1 , Prob(fh ) (1) P = Prob 2T + T where Prob(fh) is the probability of the hold signal to be one. Then, since the average throughput of the original stage is: P = T1 (2) the use of the telescopic unit is advantageous only for some speci c values of T and Prob(fh ), i.e., when P > P . Substituting Equations 1 and 2 in the inequality we obtain the following condition for throughput improvement: Prob(fh ) < T ,T T (3) Notice that, since calculating the hold value requires the presence of some additional circuitry, the realization of a telescopic unit usually comes at the price of a certain area overhead, which may or may not be aordable, depending on the speci c application. Observe also that Equation 3 is valid only for T T=2. Even though, in principle, the expression for P can be modi ed to account for values of T < T=2, it should be considered that in this case the circuitry needed to support the telescopic unit would become more complex, since the combinational logic may need, for some input patterns, more than two cycles to complete its computation. In this work we do not consider telescopic units with latency longer than two cycles.
3.2 Synthesis of the Hold Logic
The computation of the arrival time ADD for a combinational block allows us to determine all input vectors that activate the critical path. More generally, for a given cycle time, T , it is possible to nd all input conditions for which the propagation through the logic block will be slower than T . We can exploit this information to synthesize the logic which generates the proper values of the hold output fh . Given the arrival time ADD of output Oi , AT (gOi ;x), the BDD for the function fhOi which assumes the value 1 for all the input vectors for which the arrival time of Oi is greater than the desired cycle time T is given by: f Oi (x) = THRESHOLD(AT (gOi ; x);T ) (4) h
Since we are interested in the set of input conditions for which at least one block output Oi has an arrival time greater than T , we have that fh can be easily determined as:
fh (x) =
Xm THRESHOLD(AT (gO ; x);T ) i=1
i
(5)
where m is the total number of block outputs. Clearly, the key issue for making telescopic units usable in practice concerns the way the BDD of fh is synthesized. Well-known algorithms, e.g., [4], can be used for this purpose. However, there are three main constraints, listed below in decreasing order of importance, that the nal implementation of fh must satisfy, and thus require particular consideration during synthesis: The arrival time of output fh must be strictly smaller than T for any possible input pattern. Otherwise, the telescopic unit cannot be guaranteed to work correctly. The probabilityof fh to assume the value 1 must be small enough to guarantee a substantial throughput improvement, that is, P P . The area of the logic implementation of fh must be kept under control. Satisfying the timing constraint for fh is a much easier task than improving the performanceof the original logic by the same amount. This is because, if compared to the logic in the original circuit, function fh has a much less constrained behavior, the only requirement on its implementation being that its ON-set includes all input conditions which violate the timing constraint T . Enlarging the ON-set of fh may also be bene cial from the point of view of the area of the nal implementation of the hold logic. On the other hand, including in fh additional input conditions (which implies increasing its probability of assuming the value 1) may introduce some degradation in the average throughput of the telescopic unit. This is because there will exist conditions (i.e., input patterns) for which fh = 1 but the original circuit will not have any output with an arrival time greater than T . However, the correctness of the operations performed by the unit can still be guaranteed. Finding the cheapest logic implementation of fh clearly depends on the timing behavior of the original circuit and on the effectiveness of the synthesis algorithms. For balanced circuits, where the arrival times are not strongly dependent on the input values, Prob(fh ) is large, and little improvement in throughput is to be expected. On the other side, for very unbalanced circuits, for which the critical path delay is much longer than the average delay, the detection logic may be very complex. It is responsibility of the synthesis algorithm to generate a simple (i.e., fast and small) hold logic with Prob(fh ) as small as possible.
3.2.1 BDD-Based Heuristics
It has been pointed out earlier that the most important constraint on the implementation of the hold logic is on its critical path. In order for the telescopic unit to perform correctly, the output value of the hold function must always be computed in a time shorter than the speci ed T . Once this requirement is satis ed, the implementation can be optimized for secondary cost measures, such as area or power dissipation. The function fh , computed as in Equation 5, is guaranteed to have the minimum number of minterms in its ON-set. In fact, it includes all input vectors that must drive the hold signal to 1 because they would cause timing violations in the original logic. More formally, fh (x0 ) = 1 , 9Oi j AT (gOi ; x0 ) T . The BDD of fh is the starting point of the optimization procedure that nds a new hold function fhe > fh whose implementation satis es the timing constraint: THRESHOLD(AT (fhe;x);T ) = 0 (6) The procedure starts from the conservative assumption that the hold logic will be generated by simply mapping the BDD of fhe to a network of multiplexors. This straight-forward implementation can be obtained from the BDD of fhe in O(Nfhe ) time [4], where Nfhe is the number of nodes in the BDD on which variable reordering as been applied with the purpose of reducing its size. Each BDD node is mapped to a multiplexor, with the select input connected to the variable corresponding to the node, and the two data inputs connected to the then and else nodes. The network obtained by direct BDD mapping is obviously highly unoptimized; therefore, its performance can be sensibly improved by standard logic optimization algorithms. Under the assumption of a multiplexor-based implementation of the hold logic, the longest path in the BDD gives us an estimate of the critical path for the network. Clearly, this is only a rst order estimate, since it neglects two factors: First, the output load on a multiplexor; second, the load on the input variables that control the selection input of the multiplexors. If the BDD is very \wide" in the lower levels (i.e., there are many nodes marked with variables which are at the bottom of the global order), the speed of the mux-based network could be limited by the excessive load on the control inputs of the multiplexors. Similarly, if a node in the BDD is shared by many subtrees, the fan-out of the multiplexor corresponding to it is large, and its speed decreases. However, buering can mitigate the problem and reduce the delay penalty in both situations. Our approach is to focus rst on the number of levels of logic in the multiplexor network. In the following, we describe in detail the algorithm for the constrained generation of fhe , which consists of two steps. First, the BDD of fh is traversed and levelized: Each node is marked with its level. The level of a node is the length of the longest path between the node and the root of the BDD. Second, the constraint on the maximum number of levels is enforced. Let dmux (Davg ;Cavg ) be the delay of a multiplexor with a fan-out load of Cavg and an input drive Davg , where Cavg and Davg are two constants representing the expected average load on a multiplexorand the expected drivingstrength on its inputs. The maximum number of levels of logic allowed in the multiplexor network is given by: Lmax = bKtT =dmux (Davg ; Cavg )c (7) where Kt is a scaling constant that factors the expected eect of logic synthesis and optimization on the multiplexor network (Kt < 1 produces conservative results).
Starting from the nodes marked with higher level, the BDD is traversed, and all nodes for which LEVEL > Lmax are eliminated. The elimination of a node consists of its replacement with the constant 1. The meaning of this operation is straightforward: fh is transformed into fhe > fh , that is, the ON-set of fhe is increased. Notice that node elimination implies the reduction of the number of paths in the BDD with length larger than Lmax . In particular, elimination of a single node may cause a length reduction for an exponential number of paths. We call supersetting the operator which eliminates a given node from a BDD, since it is the dual of the subsetting transformation proposed by Ravi and Somenzi in [5] in the context of reachability analysis of large nite state machines. Although conceptually simple, the implementation of the supersetting operator requires particular care, since canonicity and reduction of the BDD must be preserved when a node is eliminated. In fact, the elimination of a node by simple replacement of all the pointers to it with a pointer to the constant 1 cannot be performed directly, since the change in the structure of the BDD may lead to a loss of canonicity or generate an unreduced binary decision diagram. We have implemented the supersetting operator using a procedure which is reminiscent of the basic BDD operators. The simpli ed pseudo-code of the algorithm is shown in Figure 3.
For the sake of simplicity, we do not discuss the generalized procedure in detail. However, it is important to stress the fact that the complexity of the generalized procedure is the same as that of the simpli ed single-node version we have described, namely O(Nf ) (with the customary assumption of perfect caching). If fh and the list of all the nodes with LEVEL > Lmax are passed to the generalizedsupersettingoperator, the result is the BDD of fhe , a new hold function with two important properties: fhe > fh and MaxNode2fhe (LEV EL(Node)) Lmax . Although elimination of nodes labeled with high LEVEL guarantees the satisfaction of the constraint on Lmax , the delay of the multiplexor network may still violate the timing constraint because of nodes with excessive fan-out or excessively loaded input signals. Supersetting is exploited again to eliminate such violations. Simple heuristics for marking nodes that would generated heavily loaded multiplexors, or for reducing the load on the inputs have been devised. For space reasons we do not describe such heuristics in detail. In addition, extensive experimentationhas shown that after the BDD of fh has been modi ed so as to meet the the constraint on Lmax , the optimized logic network implementing the hold function almost always satis es the timing constraint T . Hence, the heuristics for load control have a very marginal impact on the quality of the results.
procedure Supersetting (f , prunef) f if (f == zero or f == one) return(f ); if (f == prunef) return(one); r = CacheLookup(f , prunef); if (r != NULL) return(r);
Figure 4 outlines the pseudo-code of the BDD-based algorithm for the generation and synthesis of fhe.
g
rthen = Supersetting(f.THEN, prunef); relse = Supersetting(f.ELSE, prunef); r = GetFromUniqueTable(rthen, relse); CacheInsert(f , prunef, r ); return(r);
Figure 3: The Supersetting Operator. The procedure receives, as inputs, the BDD to be reduced, f , and the BDD node to be eliminated, prunef. First, the terminal cases are examined: If f is the constant 0 or 1, f is returned as is. However, if f is equal to prunef, the return value is the constant 1. Next, the cache of previously computed results is examined. If the result of the operation was previously computed, it is fetched from the cache and returned; otherwise, the operator is called recursively on the IF and THEN children of node f . Before returning, the result node is allocated in the unique table and its IF and THEN pointers are set to the return values of the recursive call. Finally the result node is cached. The only important dierence between the supersetting operator and a standard BDD operator is its unary behavior: The second parameter of the function call, namely prunef, is passed unchanged through all the recursive calls, and no recursive splitting is performed on it. It is important to notice that supersetting eliminates at least one node in the BDD. In general, the reduction in the number of nodes can be much larger, because the disappearance of a node may increase the sharing; in addition, if the node is not at the bottom of the BDD, its elimination may trigger the removal of many other nodes in the subtree for which it is the root. Notice that the procedure of Figure 3 only eliminates one BDD node at a time. In practice, we use a generalized version of the routine that eliminates all nodes in a list.
3.2.2 Synthesis Algorithm
procedure Bdd2Logic (fh , T , A ) f fhe = VariableReordering(fh); while(1) f
g
g
Cfhe = Bdd2Network(fhe); Cfhe = LogicOptimization(Cfhe ); Cfhe = TechMapping(Cfhe ); ATCf e = AddTimingAnalysis(Cfhe ); h if ((MaxVal (ATCf e < T ) and Area(Cfhe ) A )) return (Cfhe ); h Lmax = FindLMax(fhe); LEVEL [] = Levelize(fhe); NodeListLev = MarkNodesLev(fhe, LEVEL[], Lmax ); fhe = Supersetting(fhe, NodeListLev); NodeListLoad = MarkNodesLoad(fhe, MuxLoad); fhe = Supersetting(fhe, NodeListLoad);
Figure 4: The Bdd2Logic Algorithm. Procedure Bdd2Logic takes, as inputs, the original function fh , the desired cycle time, T , and the area bound, A , for the implementation of the hold logic, and it returns the single output logic circuit Cfhe implementingan fhe which satis es the required constraints. The size (i.e., the number of nodes) of the BDD for fh is rst reduced through variable reordering, and the so obtained BDD is synthesized as a mux-based logic network and subsequently optimized and mapped. Finally, the arrival time ADD, ATCf e , h for the hold circuitry is computed. If both the timing and the area constraints are met by the implementation Cfhe of fhe , such implementation is returned. Otherwise, a modi cation of fhe is required, following the supersetting paradigm discussed in Section 3.2.1. Therefore, the maximum number of allowed levels
in the BDD, Lmax , is computed, the BDD for fhe is levelized, the nodes with LEVEL > Lmax marked and stored into a list, NodeListLev, and procedure Supersetting is run on fhe . Similarly, nodes which may be responsible for timing violations due to excessive load are stored into list NodeListLoad and eliminated through a new call to procedure Supersetting. At this point, the whole sequence of operations starts over. Notice that procedure Bdd2Logic is guaranteed to terminate because supersetting eliminates at least one node in the BDD at each iteration. In the totality of the cases we examined, one iteration was sucient to nd an implementation of fhe that satis ed the timing constraint T .
In all cases we have obtained a noticeable throughput increase (26.9% on average) with a limited area overhead (7.9% on average). It is important to observe that the speed optimization for the initial circuits has been pushed all the way to the limit; therefore, the throughput increase achieved on each example can be totally awarded to the use of telescopic units. Notice also that in some cases the optimization could have been even more aggressive; however, a lower limit of T = T=2 has been imposed because, as discussed in Section 3, having T < T=2 implies that the hold signal must be at the value 1 for more than one clock cycle in order for the logic to compute the correct result.
We have implemented procedure Bdd2Logic and the surrounding software as an extension of SIS [6] using CUDD [7] as the underlying BDD/ADD package. Experiments have been run on a DEC-Station 5000/240 with 64 MB of memory. We present two sets of data. The rst one concerns the use of telescopic units as a pure throughput optimization technique. The second one shows the applicability of telescopic units for area optimization under throughput constraints.
For this set of experiments, the initial circuits have been optimized for area by iteratively applying the script.rugged SIS script followed by the rr (redundancy removal) command, and mapped for area using the map -m0 command onto the usual cell library. Then, a 20% throughput optimization has been targeted using two dierent approaches: First, by transforming the circuits into telescopic units; second, by optimizing the circuits for delay using SIS. Table 2 reports the experimental data (examples are sorted as in Table 1). In particular, columns Original{Gt, Original{T, and Original{P report the number of gates, the cycle time, and the average throughput of the original (minimum area) circuits. Columns Telescopic Unit give the number of gates and the percentage of gates overhead required by the telescopic units to produce a 20% throughput increase. Finally, columns Delay Opt show similar data for the delay optimized circuits. Due to the diculty of exactly controlling the throughput increase, a 2:5% slack has been allowed. The symbol | in a column indicates that the desired throughput improvement could not be obtained. This situation has occurred in one case only for the telescopic units (circuit C432) and in 17 cases for the delay optimized circuits. It should be observed that the telescopic units have out-performed the delay optimized circuits, in terms of gate count, in the majority of the cases (32 examples out of 39). Only on benchmark C432 both optimizations failed. On average, the area overhead due to the use of telescopic units has been around 16.7%, while for the delay optimized circuits it has been around 30.1%. Such average is obviously computed only for the examples on which both optimizations have succeeded.
4 Experimental Results
4.1 Throughput Optimization
We have considered all the large (more than 100 gates) benchmarks in the Mcnc'91 combinational multi-level suite [8] (that is, a total of 53 examples). The circuits have been rst optimized for speed using a simpli ed version of the script.delay SIS script, and then mapped for speed with load constraints using the map -n1 -AFG command onto a cell library containing inverters, buers, and two-input NAND and NOR gates. The unit gate delay model has been adopted for the ADD-based timing analysis. We have run our tool on the delay-optimized circuits trying to obtain maximum-throughput telescopic units. To accomplish this task we have speci ed several decreasing values for T , and we have synthesized the hold logic until we have found a value for which a further cycle time reduction caused a decrease in throughput (due to the high probability of the hold function). For 39 examples the use of telescopic units has produced a substantial throughput improvement. On the other hand, in 4 cases (circuits i3, i4, i6, and i7) the throughput did not increase. The reason for the failure is due to the delay distribution in the circuits; for example, all outputs of i6 have the same delay (6 time units); if we specify T = 5 and we extract fh , we obtain Prob(fh ) = 1. Finally, in 10 cases the ADD-based timing analysis did not complete, due to the size of either the circuit functional BDDs or the arrival time ADDs to be constructed; thus, our tool could not proceed to the generation of fh . Table 1 reports the data for the 39 examples on which throughput optimization has succeeded. Benchmarks are sorted by increasing size. Columns Circuit, In, Out, Gt, T and P give the name, the number of inputs, outputs, and gates, the true delay and the throughput of the original circuit. Column Prob(fhe ) shows the probability of fhe, column Gt gives the total number of gates of the telescopic unit, column T reports the cycle time at which the telescopic unit is clocked to achieve the increased throughput of column P , and column T (fhe) tells the arrival time of the hold signal. Columns P and Gt give the percentage of throughput improvement and area overhead (in terms of gates) of the telescopic unit. Finally, column Time reports the CPU time, in seconds, required to perform the ADD-based timing analysis, as well as the synthesis and the optimization of fhe for a given T .
4.2 Area Optimization
5 Conclusions and Future Work
We have presented a technique for the automatic generation of variable-latency high-performance units that allows us to push the performance limit beyond the levels achievable with traditional synthesis approaches. Thanks to symbolic exact delay computation, we identify the input conditions for which the propagation through the original logic takes longer than the cycle time. We then generate a combinational logic block which communicates to the environment when the correct result is available at the unit register boundaries. Experimental results have demonstrated that the technique is valuable as pure performance-enhancing tool as well as a throughput-constrained area optimization strategy. We are currently investigating two main directions of improvements. First, we are planning to extend the applicability of our technique to very large circuits for which neither the functional BDDs nor the arrival time ADDs can be constructed. Second, we are analyzing the impact of telescopic units on control generation algorithms and their potential usefulness in high-level synthesis.
Circuit pcler8 mux cordic frg1 sct unreg b9 f51m comp lal count cht c8 my adder i2 term1 9symml apex7 ttt2 example2 C432 i5 x1 x4 too large alu2 i9 rot x3 apex6 t481 frg2 dalu vda alu4 i8 pair k2 des
In 27 21 23 28 19 36 41 8 32 26 35 47 28 33 201 34 9 49 24 85 37 133 51 94 38 10 88 135 135 135 16 143 75 17 14 133 173 45 256
Out 17 1 2 3 15 16 21 8 3 19 16 36 18 17 1 10 1 37 21 66 6 66 35 71 3 6 63 107 99 99 1 139 16 39 8 81 137 45 245
Gt 105 106 126 143 143 147 150 152 174 179 205 209 211 225 242 242 252 302 306 358 404 445 452 498 417 783 813 840 872 889 1043 1048 1316 1416 1457 1485 1956 2393 5084
T 12 14 15 15 8 6 11 11 21 10 12 6 9 34 12 17 14 16 10 13 27 12 15 12 21 35 16 23 14 17 22 16 23 12 39 16 28 18 27
P 0.08 0.07 0.06 0.06 0.12 0.16 0.09 0.09 0.04 0.10 0.08 0.16 0.11 0.02 0.08 0.05 0.07 0.06 0.10 0.07 0.03 0.08 0.06 0.08 0.04 0.02 0.06 0.04 0.07 0.05 0.04 0.06 0.04 0.08 0.02 0.06 0.03 0.05 0.03
Prob(fhe ) 0.187 0.050 0.052 0.375 0.046 0.250 0.460 0.109 0.003 0.001 0.250 0.250 0.109 0.136 0.195 0.102 0.037 0.371 0.121 0.081 0.000 0.033 0.084 0.166 0.019 0.394 0.027 0.209 0.021 0.000 0.164 0.250 0.373 0.026 0.359 0.068 0.008 0.081 0.015
Gt 109 145 153 145 152 149 161 165 220 191 207 211 218 286 256 272 303 323 322 382 435 476 491 539 462 810 828 937 928 905 1151 1054 1394 1447 1520 1519 2027 2571 5119
T 7 12 11 8 6 3 7 10 19 9 6 5 7 18 8 11 13 10 8 10 26 10 10 10 17 19 14 19 11 16 18 9 16 11 20 13 24 16 24
P 0.12 0.08 0.08 0.10 0.16 0.29 0.11 0.09 0.05 0.11 0.14 0.17 0.13 0.05 0.11 0.08 0.07 0.08 0.11 0.09 0.03 0.09 0.09 0.09 0.05 0.04 0.07 0.04 0.08 0.06 0.05 0.09 0.05 0.08 0.04 0.07 0.04 0.05 0.04
T (fhe ) 3 9 10 2 4 2 6 7 9 5 2 2 3 9 7 10 9 9 7 9 10 9 8 9 7 10 6 18 10 10 17 2 6 10 17 11 18 15 7
P 55.4% 13.7% 32.8% 52.3% 30.2% 75.0% 21.0% 4.0% 10.3% 11.0% 75.0% 5.0% 21.6% 76.0% 35.4% 46.5% 5.7% 30.3% 17.4% 24.7% 3.8% 18.0% 43.7% 9.9% 22.3% 47.8% 12.7% 8.5% 25.9% 6.2% 12.1% 55.5% 16.9% 7.6% 59.9% 18.8% 16.1% 7.9% 11.6%
Gt 3.8% 36.8% 21.4% 0.1% 6.3% 1.4% 7.3% 8.5% 26.4% 6.7% 0.9% 0.9% 3.3% 27.1% 5.8% 12.3% 20.2% 6.9% 5.2% 6.7% 7.7% 6.9% 8.6% 8.2% 10.7% 3.4% 1.8% 11.5% 6.4% 1.8% 10.3% 0.5% 5.9% 2.2% 4.3% 2.2% 3.6% 7.4% 0.6%
Time 0.9 1.8 1.3 10.0 0.9 0.7 1.4 1.1 3.8 0.9 1.2 0.8 0.9 10.1 10.5 5.3 5.7 3.8 5.5 5.1 69.3 10.6 4.8 8.8 36.7 15.1 27.6 582.1 7.6 6.8 45.9 34.0 59.6 10.7 53.3 32.1 68.7 60.1 73.2
Table 1: Throughput Optimization. Circuit
Gt pcler8 101 mux 69 cordic 77 frg1 124 sct 76 unreg 118 b9 128 f51m 68 comp 123 lal 105 count 138 cht 179 c8 141 my adder 190 i2 220 term1 245 9symml 192 apex7 238 ttt2 161 example2 306 C432 179 i5 198 x1 324 x4 408 too large 364 alu2 460 i9 663 rot 731 x3 723 apex6 769 t481 729 frg2 695 dalu 891 vda 570 alu4 699 i8 1074 pair 1630 k2 1087 des 3668
Original T 14 15 19 21 18 8 13 23 23 16 21 8 13 37 14 21 21 19 19 16 30 18 16 17 23 42 20 29 19 21 27 29 40 19 44 19 41 27 31
P 0.07 0.06 0.05 0.04 0.55 0.12 0.07 0.03 0.04 0.06 0.04 0.12 0.07 0.02 0.07 0.04 0.04 0.05 0.05 0.06 0.03 0.55 0.06 0.05 0.04 0.02 0.05 0.03 0.05 0.04 0.03 0.03 0.02 0.04 0.02 0.05 0.02 0.03 0.03
Telescopic Unit Gt Gt 109 7.9% 72 4.3% 98 27.2% 134 8.0% 89 17.1% 149 26.2% 142 10.9% 117 72.0% 146 21.1% 116 10.4% 140 1.4% 180 0.5% 143 1.4% 263 38.4% 294 33.6% 317 29.3% 271 41.1% 265 11.3% 186 15.5% 315 2.9% | | 246 24.2% 416 28.3% 430 5.3% 411 12.9% 498 8.2% 672 1.3% 854 16.8% 788 8.9% 905 17.6% 1047 43.6% 720 3.5% 962 7.9% 668 17.2% 854 22.1% 1141 6.2% 1855 13.8% 1431 31.6% 4216 14.9%
Delay Opt. Gt Gt | | | | 126 63.6% 143 15.3% 102 34.2% 147 24.5% | | 141 107.3% | | 120 14.2% 157 13.7% 209 16.7% 156 10.6% | | | | | | 240 25.0% | | 181 12.4% | | | | 388 95.9% | | 471 29.3% | | | | 813 22.6% 840 14.9% 790 9.2% 889 15.6% | | 757 8.9% 1003 12.6% 1022 79.2% | | | | 1891 16.0% 1301 19.6% | |
Table 2: Area Optimization Under Throughput Constraints.
Acknowledgments
We wish to thank Iris Bahar for helping us with the ADDbased timing analysis code, Fabio Somenzi for useful suggestions on the use of the CUDD package, and Nanni De Micheli for reviewing this manuscript.
References
[1] R. I. Bahar, H. Cho, G. D. Hachtel, E. Macii, F. Somenzi, \Timing Analysis of Combinational Circuits using ADDs," EDTC-94, pp. 625-629, Paris, France, February 1994. [2] S. Hassoun, C. Ebeling, \Architectural Retiming: Pipelining Latency-Constrained Circuits," DAC-33, pp. 708-713, Las Vegas, NV, June 1996. [3] R. I. Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E. Macii, A. Pardo, F. Somenzi, \Algebraic Decision Diagrams and their Applications," ICCAD-93, pp. 188-191, Santa Clara, CA, November 1993. [4] L. Burgun, N. Dictus, A. Greiner, E. Prado Lopes, C. Sarwary, \Multi-Level Optimization of Very High Complexity Circuits," EuroDAC-94, Grenoble, France, September 1994. [5] K. Ravi, F. Somenzi, \High-Density Reachability Analysis," ICCAD-95, pp. 154-158, San Jose, CA, November 1995. [6] E. M. Sentovich, K. J. Singh, C. W. Moon, H. Savoij, R. K. Brayton, A. Sangiovanni-Vincentelli, \Sequential Circuits Design Using Synthesis and Optimization," ICCD-92, pp. 328333, Cambridge, MA, October 1992. [7] F. Somenzi, CUDD: University of Colorado Decision Diagram Package, Release 2.1.0, Technical Report, Dept. of ECE, University of Colorado, Boulder, CO, January 1997. [8] S. Yang, Logic Synthesis and Optimization Benchmarks User Guide Version 3.0, Technical Report, MCNC: Microelectronics Center of North Carolina, Research Triangle Park, NC, January 1991.