Delay Balancing using Latches - CiteSeerX

0 downloads 0 Views 53KB Size Report
developed to insert a sufficient number of latches into a com- binational circuit to .... minimum and maximum delay of the input flip-flops. For each edge (i, j), the ...
Delay Balancing using Latches Chuan-Hua Chang, Edward S. Davidson EECS Department, University of Michigan Ann Arbor, MI 48109-2122 chuanhua,[email protected]

Abstract Delay elements are often added to improve performance of a wave-pipelined circuit by reducing the delay difference of the longest and the shortest paths. Unfortunately, precise delay elements that realize the exact delay needed are difficult to obtain. Instead we use latches for delay balancing, thereby providing more feasible and accurate circuit path delay control under the Min/Max delay model. A heuristic is developed to insert a sufficient number of latches into a combinational circuit to achieve a specified clock cycle time. Experiments on the ISCAS C85 benchmark illustrate this approach and its advantage.

1.0 Introduction The maximum performance of a wave-pipelined circuit is limited by the delay difference of the longest and shortest paths in the circuit, the setup and hold time of the storage elements between the pipeline stages, the clock skew, and process variations [1]. One obvious way to improve cycle time is to attack the delay balancing problem, i.e. reduce the delay difference between the longest and shortest paths. Better balance permits faster clocks and more simultaneously active wave in the circuit. Several proposed techniques reduce delay variations in a wave-pipelined circuit. Ekroot [2] developed linear programs to insert combinational delay elements where needed to reduce the delay difference. Later, Wong et. al. [3][4] used one technique, called rough tuning, which inserts active delay elements in the short paths, and another, called fine tuning, which adjusts parameters that determine gate drive capability to equalize the path delays in a wave pipelined circuit. Joy and Ciesielski [5][6] presented a balancing technique that intentionally skews the clock phase of each input latch, and implemented it in a placement and routing algorithm. Kim et al. [7] used logic resynthesis approaches to balance the circuit delays at a technology-independent level. All delay balancing techniques proposed in the literature use delay elements either as padding in the data path of the circuit or to insert skew in some clock distribution lines of the circuit. Unfortunately, it is not easy to design a delay element that will realize the exact delay needed [8]. Furthermore, for a circuit with large delay differences, a series of delay ele-

ments may have to be inserted to balance the paths. For these delay chains, it is even harder to achieve the exact delay value desired. We propose a delay balancing technique that is both easier to implement and more accurate than adding chains of delay elements. Rather than delay chains, we insert latches, using the synchronizing feature of the latches to achieve the desired delay effect. Conceptually, this technique is similar to the problem of optimally partitioning a synchronous circuit and inserting latches to obtain maximum performance. However, the technique proposed here partitions only short paths (Fig. 1). In contrast to Wong’s technique, which balances all path delays perfectly, the proposed approach only balances path delays to the degree needed in order to achieve a target clock frequency. Using latches to control delay is more accurate than using long chains of delay elements because the clock lines which control latches, in contrast to data signals, have smaller and more easily controllable skews. Even if it is necessary to skew the clock of a latch or to add small paddings to get a feasible balancing solution, this extra delay is reduced when at least some portion of the delay value is implemented by the synchronizing property of the latch. The output delay of the latch also adds to the delay of the short path. Furthermore, the extra area cost of inserting latches may be smaller than the cost of using active delay elements. Storage Element (input FFs)

clk

L

Storage Element (output FFs)





Long Path

L

Short Path



Fig. 1. Using latches (L) to delay the short paths of a wave-pipelined circuit

2.0 Problem Definition We consider delay balancing for a combinational circuit of unidirectional, single output gates. Its inputs are driven by flip-flops (FFs) and its outputs drive another set of FFs. A common clock controls these FFs and all inserted latches. 1

The set of output FFs and each inserted latch has an associated clock phase shift relative to the clock of the input FFs. These phase shifts may be constrained or chosen freely by the insertion algorithm. The common clock is defined by ( T L, T H , T c ) where T L is the logic 0 portion, T H the logic 1 portion, and T c = T L + T H is the target cycle time. We define mTc to be the time allowed for a signal to pass through the circuit; if m is greater than 1, the circuit is wave-pipelined. The primary input signals are assumed to depart the input FFs at the negative edge of the clock. FF setup and hold times are S r and H r , respectively. The given circuit is converted into a directed graph G(V, E) where each gate corresponds to a node in V. The input FFs are represented by a single source node, s, and the output FFs by a sink node, k. The interconnections from gate output to gate input, gate output to output FF input, or input FF output to gate input correspond to directed edges in the graph. In the following discussion, we use the terms circuit and graph interchangeably. The fanin set, FI(i), of a node i is the set of nodes which have directed edges to node i. The fanout set, FO(i), of a node i is the set of nodes to which there is a directed edge from node i. Note that FI(s) and FO(k) are empty. Each node i is associated an earliest and latest departure time ( d i, D i ) which represents the signal departure range at its output. Each edge (i, j) for all i ∈ FI ( j ) has an earliest and latest

associated with cascaded latches with single-phase or overlapping multi-phase clocks. Edges may be broken to allow more than one latch when nonoverlapping multiphase clocks are used. The goal of this delay balancing problem is to insert a minimum number of latches to support the specified target cycle time subject to the clock phase and circuit constraints, without data corruption between successive data waves (i.e. no wave collision) in the wave-pipelined circuit. The inserted latches may partition the combinational circuit, especially for small Tc. As Tc increases, fewer latches are inserted, primarily on short paths, and more paths become wave-pipelined.

2.1 Objective and Constraints Formally, this problem can be described as an optimization problem with the following objective function,



minimize

∀( i, j ) ∈ E

(3)

Z ij

The constraints are derived below: Latch insertion constraints They are associated with each edge ( i, j ) ∈ E . They describe whether a latch can be inserted on each edge, and the properties of the inserted latch.

• Clock phase constraints: ∀( i, j ) ∈ E define

ζ ij . If no

arrival time ( a ij, A ij ) which represents the signal arrival

latch is inserted in edge ( i, j ) , ζ ij = 0 .Otherwise, let ζ ij rep-

range at the node j input. The earliest and latest departure times for node s are defined to be ( d s, D s ) = ( δ r, ∆ r ) , the

resent the non-negative real clock phase shift of the inserted latch clock relative to the negative edge of the clock. Thus,

minimum and maximum delay of the input flip-flops. For each edge (i, j), the minimum and maximum delay is defined to be ( δ ij, ∆ ij ) . The minimum and maximum delay of an inserted latch is ( δ l, ∆ l ) . The longest and shortest delay from s to k for the given combinational circuit are ∆ max and δ min , respectively. These delays can be determined by ∆ max = max { A ik }

∀i ∈ FI ( k )

(1)

δ min = min { a ik }

∀i ∈ FI ( k )

(2)

∀( i, j ) ∈ E define a boolean variable Z ij which repre-

sents the presence of an inserted latch on edge ( i, j ) if Z ij = 1 , or the absence of an inserted latch if Z ij = 0 . We do

not allow more than one inserted latch on each edge. This restriction is reasonable given the flow-through problem

( Z ij – 1 ) ⋅ ζ ij = 0

(4)

If we use only a single-phase clock on all the inserted latches, ζ ij = 0

(5)

If we use two-phase clocks with phase shift 0 and p and the same Tc, TL and TH, then ζ ij ⋅ ( ζ ij – p ) = 0

(6)

The use of multi-phase clocks with the same Tc, TL and TH can be characterized in a similar manner. For arbitrary phase clocks, 0 ≤ ζ ij ≤ T c

(7)

• Hold and Setup constraints: An inserted latch in edge (i, j) with clock phase shift ζ ij , is valid only when both the 2

latest and the earliest departures of the node i output arrive during the same clock period of the latch and satisfy the setup (S) and hold (H) time requirements. Let n ij be an integer which represents the particular clock period in which the earliest and latest signals arrive. Then n ij ∈ Integer

(8)

Z ij ⋅ ( d i – ζ ij – n ij ⋅ T c – H ) ≥ 0

(9)

Z ij ⋅ ( ( n ij + 1 )T c + ζ ij – S – D i ) ≥ 0

(10)

• Min/Max delays added by inserted latches: We define Λ ij and λ ij to be the added delay of the latest and earliest

(a) Suppose that for each node j, there is a minimum signal stable time, g j , to allow node j to produce the correct output. The signal stable time is the time difference between the latest signal arrival among all node j inputs and earliest signal arrival of the next wave. Then T c + min ( a ij ) – max ( A ij ) ≥ g j

∀i ∈ FI ( j )

∀j ∈ Vg

(18.a)

(b) Suppose that for each node j the earliest input signal of the next wave must arrive after node j produces its stable output. Then T c + min ( a ij ) ≥ D j

∀i ∈ FI ( j )

∀j ∈ Vg

(18.b)

signal, respectively, due to latch insertion on edge (i, j). Λ ij and λ ij will be 0 if there is no latch inserted on edge (i, j). Recall that the latch delay is ( δ l, ∆ l ) . If the latest signal arrival at the inserted latch comes after the enabling edge of

Note that (18.a) and (18.b) are equivalent if ∆ ij = g j

∀i ∈ FI ( j )

∀j ∈ Vg

(19)

Output setup and hold constraints

the clock, then Λ ij = ∆ l . Otherwise Λ ij is equal to ∆ l plus

They must be satisfied for the output flip-flops.

the time difference between the clock enabling (positive) edge and the latest signal arrival. Thus,

mT c – ∆ max ≥ S r

(20)

δ min ≥ ( m – 1 )T c + H r

(21)

Λ ij = Z ij ⋅ max ( ∆ l, n ij T c + T L + ζ ij – D i + ∆ l )

(11)

Note that the fractional part of the mT c must equal the

Similarly, the added delay for the earliest signal is λ ij = Z ij ⋅ max ( δ l, n ij T c + T L + ζ ij – d i + δ l )

(12)

2.2 The Min/Max Delay Model

Signal propagation equations They are associated with each edge and each gate node and describe the signal propagation and the topology of the circuit. V g is a subset of the vertices corresponding to all the gate nodes. d j = min ( a ij + δ ij ) D j = max ( A ij + ∆ ij )

∀i ∈ FI ( j ) ∀i ∈ FI ( j )

∀j ∈ Vg ∀j ∈ Vg

(13) (14)

a ij = d i + λ ij

∀( i, j ) ∈ E

(15)

A ij = D i + Λ ij

∀( i , j ) ∈ E

(16)

( d s, D s ) = ( δ r, ∆ r )

phase shift of the output FFs.

(17)

Internal node (collision free) constraints They are associated with each gate node. They guarantee no data corruption (wave collision) due to racing signal waves within the circuit when the circuit is wave-pipelined. These constraints may vary according to the specific requirements of different technologies; we list two versions here.

The Min/Max delay model associates circuit a range of delay with each circuit component. Each data signal passing through the component may experience a propagation delay anywhere in this range. The Min/Max delay model is an extension of the Unit delay model which associates only a single delay value with every circuit component. The Min/ Max delay model is thus a conservative model that can incorporate all process and environmental variations by appropriately setting these ranges. Using the Unit delay model, the shortest path and the longest path delay of a circuit will never occur on the same path except when the path delays are all equal. However, using the Min/Max delay model, the shortest and the longest path delays can differ and still both occur on the same path. For balancing the minimum and maximum delays of such a circuit path, latches are preferred over delay elements, since delay elements with Min/Max delays can never reduce the delay difference of such a path. If the Min/Max delay ratio of a delay element is less than 1, the delay difference of such a path always increases with inserted delay elements.

3

To illustrate, consider Fig. 2. The single-phase clock ( T L, T H , T c ) = ( 4, 2, 6 ) is used for all latches and flip-flops. For simplicity, let all latches and flip-flops have 0 delay and 0 setup and hold times. Let the min/max ratio of the delay element be 1 (the base case for delay elements). Internal node constraint (18.b) and output constraints (20) (21) are used to find the nodes in the circuit that have wave collisions. There is no solution to satisfy all those constraints when delay elements are used. However, a solution can be found with latches since latches can delay the early signals without delaying the late signals. FF

FF

(2,4)

(4,12) (2,8) (4,12)

(2,4) (2,8)

(a) circuit with wave collision, no way to pad the circuit with delay elements. FF

FF

(6,8)

(8,12) (6,8)

• Iterative Balance The main idea of the heuristic, called Iterative Balance, is that several passes are carried out to insert latches into the circuit. The number of latches on different paths in the circuit may vary. For a signal path that requires more than one latch to be inserted, the positions of latches later in the circuit are dependent on the positions of latches placed at earlier levels in the circuit, since signal arrival at a later point is affected by the latches inserted earlier. Thus after a set of latches has been inserted in the circuit, the signal propagation times have to be recalculated before finding the positions for the next set of latches. The work of finding a set of positions before which latches must be inserted, according to the newly recalculated signal propagation times, is termed a Basic Balance. This section discusses a heuristic approach that uses Iterative Balance and solves a Basic Balance problem in each pass. Note that between the Basic Balance passes it is only necessary to recompute the propagation times of the components located after the inserted latches. This technique is called Incremental Path Calculation. As the levels of the inserted latches increase in successive iterations the number of recomputed circuit components decreases. • Basic Balance & the Fanout Problem

(8,12)

(2,4) (6,8)

(b) circuit using latches to balance delays : wave collision node, use (20.b) as internal node constraint : Latch

6 4 0 FF and latch clock

Latch delay = setup = hold = 0 Gate delay: (2,4)

Fig. 2: Better performance is achieved using latches

3.0 Balancing Heuristics The problem of optimal delay balancing using latches can be formulated as an integer mathematical programming problem as described in section 2.1. However, the computational complexity is very high. It is not suitable to solve practical size circuit problems using this formal approach. Instead, efficient heuristics need to be devised to find sufficiently good solutions for practical problems. The following section presents such a heuristic and discusses some problems encountered.

Basic Balance is performed by finding the set of wave collision nodes that have no wave collision nodes in the transitive closure of their fanin node set, and then delay balancing each node in the set in sequence. A wave collision node is a node with an internal node constraint (20) or output constraint (20) (21) violated by its fanin signals. Such nodes can be found easily by examining those constraints for every node in the circuit using the computed signal arrivals. For each wave collision node, the heuristic tries to balance the delays by inserting latches into the transitive closure fanin edges of that node. The current strategy of placing latches is to insert latches no later than, but as close to, the wave collision node as possible. Whenever a latch cannot be inserted into a desired edge, earlier edges will be examined. This strategy is not guaranteed to obtain the minimum solution, but we feel that it has a good probability of producing good or near-optimal solution. This choice is based on the observation that due to the increasing number of fanin edges, more latches are generally required to rectify a wave collision when latches are inserted further from the collision node. However, the number of fanin edges may not increase with distance if there are particular node fanout topologies before the collision node. If there is fanout, we may be able to insert one latch to cover more than one wave collision in the circuit, thus reducing the number of inserted latches. For instance, to fix collisions at g3, g5 and g6 in Fig. 3 using our 4

insertion strategy, 3 latches may be inserted (in positions A). However, It is possible that only one latch (in position B) may suffice to correct all three collisions. On the other hand, although using one latch seems superior to using 3 latches for this Basic Balance pass, it may eventually require inserting more latches in later Basic Balance passes. Our current heuristic does not consider this effect when choosing edges for insertion.

g4 B

g5

g1

g4

fanout

g4

A

g5

Fanouts also causes other problems for the iterative heuristic. Extra steps must be performed whenever a latch is inserted to ensure a correctly balanced circuit. For exampleIn, in Fig. 4, g1 has g2, g3 in its transitive fanout. Balancing them sequentially may cause conflicts between inserted latches. For example, assume that g2 is balanced first and a latch A is inserted in the path between g1 and g2. Next, g3 is balanced and no latch can be inserted in the path between g1 and g3 to fix the problem. The heuristic then successfully places latch B. However, inserting latch B alters the arrival time at latch A. To resolve this problem, whenever a latch is inserted, we remove previously inserted latches on its transitive fanout edges. Removed latches may have been inserted in the same Basic Balance pass or in a previous pass. In general, whenever we balance a collision node, we first check if a latch has already been inserted into the transitive fanin of that node during the same Basic Balance pass. If we find such a latch, we do not consider that collision node in this pass. Also whenever we insert a latch in an edge, we invalidate all latches residing in the transitive fanout edges of the current insertion edge. Whenever a removed latches was inserted by a previous Basic Balance pass, the current pass is terminated immediately after the removal since the signal arrival times may have to be recalculated for some of the collision nodes still on the list for this pass. Terminating this pass allows the recalculation to be done before they are balanced. • Constraints for Latch Insertion Several constraints must be satisfied by each inserted latch. Constraint (8) insures that the early arrival ( d i ) and the

g2 Abort balance

g3

(b)

g6

Fig. 3. A fanout structure.

g3

g1

wave collision

g3

A

g2

(a)

B B

A

g1

g5

A

g2

Invalid

Fig. 4. (a) g2 is balanced before g3 and latch A is inserted before B. So latch A has to be invalidated. (b) g3 is balanced before g2 and latch B is inserted first. Then g2 will not be balanced this time. late arrival ( D i ) of the signal at the latch are within the same clock cycle. Constraints (9) and (10) insures that the hold setup requirements are satisfied. These constraints ensure that the data signal can be sampled into the latch correctly. The fourth constraint is that the delay added by the inserted latch must be large enough to eliminate the wave collision at the current collision node. The use of this constraint helps to put latches in more effective positions, thus generally reducing the number of inserted latches. However, this constraint is strategic, not essential and may limit the solutions found by the heuristic. We thus implemented the heuristic with two versions: one with the fourth constraint (basic) and a relaxed version without the fourth constraint (basic_r). Basic_r provides a larger feasible solution space since it allows multiple latches to satisfy one collision node, but it may place and remove many latches, requiring more Basic Balance passes. Currently the heuristic can handle single-phase and random-phase clocks; tow-phase is being implemented. The random-phase clock has one waveform, but allows an arbitrary phase shift ζ for each latch. • Termination of the Algorithm During the Basic Balance process, a latch is pushed backwards until the latch insertion constraints are satisfied. A latch may be pushed back to a primary input or another latch which inserted in a previous Basic Balance pass. If previously inserted latch is encountered, it is removed from that edge and pushed further backward along the transitive fanins

5

of that edge. If a primary input is encountered, the heuristic will stop and claim that no solution can be found. • Post-Processing: Merge Latches at Fanout In our circuit graph model, each fanout of a gate output is represented by its own edge. After obtaining the latch insertion solution, the heuristic merges all latches on each gate’s fanout edges into one latch.

4.0 Experimental Results To evaluate the basic and basic_r algorithms, we performed some delay balancing experiments on the ISCAS C85 benchmark using Min/Max delays for the gates (Table 1). The circuit statistics are shown in Table 2. The ratios of the shortest path delay to the longest path delay of the circuits range from 0.5%-33.3%. All setup and hold times for latches and flip-flops are set to 0 at this time. All latch and flip-flop delays are also set to 0. The direct primary inputs to primary outputs in the benchmark circuits, if any, are all deleted as the heuristic cannot handle them since only one latch is allowed on any edge. The minimum pulse width of a clock is set to 1. Table 1. Assigned Min/Max gate delays. Gate

inv, buf

nand, nor

and, or, xor

Delay

(1,3)

(2,4)

(3,5)

Table 2. Circuit parameters. (os, ol: shortest & longest path delays of the original circuit.) original

# of

# of

ckt

(os, ol)

os/ol

Tc

edges

nodes

c17

(4, 12)

33.3%

12

14

8

c432

(4, 68)

5.8%

68

343

162

c499

(3, 53)

5.7%

53

440

204

c880

(4, 94)

4.2%

94

755

385

c1355

(5, 98)

5.1%

98

1096

548

c1908

(5, 147)

3.4%

147

1523

882

c2670

(1, 135)

0.7%

135

2140

1195

c3540

(4, 183)

2.2%

183

2961

1671

c5315

(1, 192)

0.5%

192

4509

2309

First, we search for the smallest cycle time for which a solution can be found by basic and basic_r using singlephase latches and single-phase flip-flops. In Table 3, the number of inserted latches is listed both before merging fanout latches (nm) and after merging (m). Also shown are the new shortest and longest paths (ns, nl), the number of

Basic Balance passes for the final T c solution (ps), and the number of Basic Balance passes using the basic_r heuristic given the minimum T c found by the basic heuristic (b_r). The results show that the basic_r heuristic generates faster circuits than the basic heuristic, but takes more passes to finish. The number of waves, and hence T c speedup, that can be achieved is about 2 or 3. Second, in Table 4, we consider random-phase latches and single-phase flip-flops using the same T c as in Table 3. In Table 4, the ph column lists the total number of clock phases used in the solution. The numbers of latches (nm) in Table 4 before merging fanout latches are all smaller or equal to nm in Table 3 (except for c1908), as are the numbers (m) after merging (except for c432 and c1355). This reduction is due to the general effectiveness of the heuristic algorithm in exploiting the relaxed insertion constraints of random-phase latches. The current heuristic makes no attempt to control the number of phases used. Third, we examine the smallest cycle time found by using random-phase latches and single-phase flip-flops (Table 5). The results show that the cycle times ( T c ) in Table 5 are smaller than or equal to T c in Table 3. This speedup also results from the relaxed constraints. Faster clocks generally, but do not always require more latches. To verify that the minimum cycle time achieved by using latches is indeed smaller than that by using delay elements under the Min/Max delay model described in Section 2.2, we follow Ekroot’s delay insertion method [2] and use a linear programming (LP) solver, splex, to solve the problem. In Ekroot’s model, the boundary flip-flops can have any clock phases, so we compare the delay insertion results to our latch insertion heuristic with random-phase flip-flops. There is no limitation on the maximum delay values that a delay element can take. However, the ratio (r) of the minimum to maximum delay is fixed. We used two types of delay elements with the min/max delay ratio 1 and 0.5 and compare the results to those using single-phase (singleL) and random-phase (randomL) latches. The results in Table 6 show that the minimum cycle time achieved with delay elements is larger than with latches, except for c499 with min/max delay=1. Note that we do not show the results of c1908-c5315 in Table 6 since the LP solver cannot finished the job within a resonable time frame or run out of the memory for those circuits.

5.0 Conclusion and Future Plan We have formulated the delay balancing problem as a non-linear integer program. To solve the problem quickly, we have devised a heuristic to insert latches into the circuit. The 6

Table 3. The smallest Tc found, single-phase flip-flops, single-phase latches. (ns, nl: shortest & longest path delays of the balanced circuit.) # latches

basic T L, T c

#

#ps

# latches

ps

b_r

nm

m

basic_r

#

T L, T c

ckt

nm

m

(ns, nl)

c17

6

5

(8,12)

(4,6)

4

5

10

c432

74

58

(35,70)

(34,35)

7

8

same

8

c499

128

80

(28,53)

(25,28)

6

6

same

6

c880

123

106

(48,94)

(45,48)

9

12

183

c1355

160

152

(95,99)

(45,47)

8

8

same

c1908

204

151

(146,149)

(71,73)

17

22

246

c2670

123

98

(68,135)

(67,68)

8

9

same

c3540

115

96

(92,183)

(89,92)

15

17

185

c5315

343

305

(97,192)

(96,97)

20

20

same

8

(ns,nl) (10,12)

146

ps

(3,5)

(78,94)

5

(38,39)

19 8

180

(124,147)

(60,62)

34 9

158

(174,205)

(86,87)

35 20

Table 4. The same Tc as TABLE 1, single-phase flip-flops, random-phase latches # latches

basic T L, T c

#

# latches

ph

nm

m

(ns,nl)

8

(10,12)

(3,5)

1

141

(78,94)

(38,39)

8

166

(124,147)

(60,62)

11

121

(174,183)

(86,87)

17

ckt

nm

m

(ns, nl)

c17

6

5

(8,12)

(4,6)

2

10

c432

65

65

(35,68)

(34,35)

4

same

c499

128

64

(28,53)

(25,28)

2

same

c880

84

77

(48,94)

(45,48)

5

163

c1355

160

160

(95,99)

(45,47)

4

same

c1908

226

140

(146,147)

(72,73)

10

229

c2670

79

79

(68,135)

(67,68)

8

same

c3540

113

90

(92,183)

(89,92)

16

144

c5315

222

205

(97,192)

(96,97)

9

same

basic_r T L, T c

# ph

Table 5. The smallest Tc found, single-phase flip-flops, random-phase latches # latches

basic T L, T c

#

#

# latches

basic_r

ph

ps

nm

m

(ns,nl)

T L, T c

#

#

ph

ps

ckt

nm

m

(ns,nl)

c17

6

5

(8,12)

(4,6)

2

3

10

8

(10,12)

(3,5)

1

5

c432

78

73

(34,68)

(28,34)

7

7

105

86

(58,68)

(28,29)

5

11

c499

104

58

(29,53)

(26,27)

3

4

same

c880

95

88

(47,94)

(43,47)

4

10

163

c1355

160

152

(95, 98)

(45, 46)

4

6

same

c1908

226

140

(146,147)

(72,73)

10

11

238

c2670

79

79

(68,135)

(67,68)

8

6

same

c3540

113

90

(92,183)

(88,92)

17

10

178

c5315

227

210

(96,192)

(95,96)

10

14

same

number of inserted latches is controlled reasonably well,

4 141

(78,94)

(38,39)

8

17 6

171

(147,122)

(60,61)

14

20 6

160

(152,183)

(74,76)

21

17 14

although not optimally, by the heuristic insertion strategy. 7

Experimental results for the benchmark circuits show that a 2-3 times performance speed up can be achieved with latches. However, for some of the solutions to work, the duty cycle of the specified clock has to be small, since a smaller duty cycle increases the added short path delay. Examples and experimental results show that better performance can be obtained by using latches rather than delay elements for delay balancing. Table 6. Comparisons between padding with delay elements and padding with latches delay element

latch

ckt

min Tc r=0.5

min Tc r=1

min Tc singleL

min Tc randomL

c17

8

8

(3,5)

(3,5)

c432

38

36

(28,29)

(28,29)

c499

29

25

(25,28)

(26,27)

c880

51

49

(38,39)

(38,39)

c1355

NA

49

(45,47)

(45,46)

References [1] L. W. Cotten, “Maximum-rate pipeline systems,” in Proc. AFIPS Spring Joint Computer Conf., pp. 581-586, 1969. [2] B. C. Ekroot, “Optimization of pipelined processors by insertion of combinational logic delay,” Ph.D. Dissertation, EE. Dept., Stanford U., Sep. 1987. [3] D. C. Wong, G. D. Micheli, M. J. Flynn, “Designing High-Performance Digital Circuits Using Wave Pipelining: Algorithms and Practical Experiences,” TCAD, pp. 25-46, January 1993. [4] D. Wong, G. D. Micheli, M. Flynn, and R. Huston, “A bipolar population counter using wave pipelining to achieve 2.5x normal clock frequency,” IEEE J. SolidState Circuits, vol. 27, pp. 745-753, May 1992. [5] D. A. Joy, M. J. Ciesielski, “Clock period minimization with wave pipelining,” IEEE Trans. TCAD, vol. 12, no. 4, pp 461-472, April 1993. [6] D. A. Joy, M. J. Ciesielski, “Placement for clock period minimization with multiple wave propagation,” in Proc. 28th Design Automation Conf., San Francisco, CA, pp. 640-643, June 1991. [7] T. S. Kim, W. Burleson, and M. Ciesielski, “Logic restructuring for wave-pipelined circuits,” in Workshop Notes, International Workshop on Logic Synthesis, May 23-26, 1993. [8] S. H. Unger, C. J. Tan, “Clocking schemes for high-speed digital systems,” IEEE Trans. on Computers, vol. C-35, No. 10, Oct. 1986, pp. 880-895.

[9] C. T. Gray, W. Liu, R. K. Cavin, “Timing Constraints for Wave Pipelined Systems,” Technical Report NCSUVLSI-92-06, December 1992. [10] L. W. Cotten, “Circuit implementation of high-speed pipeline systems,” in AFIPS Fall Joint Computer Conf., pp. 489-504, 1965. [11] T. G. Hallin, M. J. Flynn, “Pipelining of arithmetic functions,” IEEE Transactions on Computers, pp. 880-886, Aug. 1972. [12] B. K. Fawcett, Maximal Clocking Rates for Pipelined Digital Systems, Master thesis, EE. Dept., Univ. of Illinois, Urbana-Champaign, 1975. [13] S. R. Kunkel, J. E. Smith, “Optimal pipelining in supercomputers,” in Proc. 13th Annual Symposium on Computer Architecture, pp. 404-411, 1986. [14] F. Klass and J. M. Mulder, “CMOS implementation of wave pipelining,” Delft Univ. of Technology, Delft, The Netherlands, Tech. Rep. 1-68340-44(1990)02, Dec, 1990. [15] C. T. Gray, T. Hughes, D. Fan, G. Moyer, W. Liu, R. Cavin, “A High Speed CMOS FIFO Using Wave Pipelining,” Technical Report NCSU-VLSI-91-01, January 1991. [16] W. Lien, W. Burleson, “Wave-domino logic: Timing analysis and applications,” in Proc. TAU 1992: ACM/ SIGDA Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, March 1992. [17] N. V. Shenoy, R. K. Brayton, and A. L. SangiovanniVincentelli, “Minimum padding to satisfy short path constraints,” in Workshop Notes, International Workshop on Logic Synthesis, May 23-26, 1993.

8