Early Planning for RT-Level Delay Insertion during ... - IEEE Xplore

2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip

Early Planning for RT-Level Delay Insertion during Clock Skew-Aware Register Binding Keisuke Inoue Mineo Kaneko School of Information Science, Japan Advanced Institute of Science and Technology (JAIST) 1-1 Asahidai, Nomi, Ishikawa 923-1292, JAPAN Email: [email protected]

Abstract-In current VLSI complexity systems, clock skew scheduling is one of the key approaches to improve circuit performance and reliability. A delay insertion method has been discussed in logic-level to reduce the clock period. This paper extends this idea into high-level synthesis (HLS), and introduces a new HLS task, namely the minimum-path delay assignment. Since register binding plays an important role on the effect of the minimum-path delay assignment, this paper formulates the problem of simultaneously optimizing register binding and the minimum-path delay assignment. An MILP-based approach will be presented, and evaluated by experiment which shows the approach can reduce the clock period with an average of 14.1 % compared to conventional clock skew-aware design.

I.

INT RODUCTION

The clock signal is distributed to each memory element via a clock routing structure forming tree, grid, or others [1]. The clock skew is defined as the difference in clock arrival times between memory elements. In the past, clock skew had been considered to be negative impact on circuit performance, thereby minimization of the clock skew had been one of the essential goals of clock design [2]. Today, applying intentional clock skew (clock skew scheduling) is becoming a common technique for clock period reduction [3], reliability enhancement [4], and peak current reduction [5], etc. More recently, Obata et al. [6] have first introduced the concept of clock skew scheduling into high-level synthesis (HLS). Later on, it has resulted in several related HLS works for clock period minimization [7], [8], and power minimization [9], etc. In the design with intentional clock skew, the minimum clock period is determined not only by the maximum-path delay between adjacent memory elements (max-delay for short), but also by the minimum-path delay between adjacent memory elements (min-delay for short) due to the timing constraint, namely the hold constraint. As one approach to relax a timing bottleneck due to the hold constraint, delay insertion (also referred to as delay padding) has been utilized in logic-level [10][ 11], which is a technique that fixes hold violations by inserting extra delays on min-delay paths without increasing max-delay. Huang et al. [7] have defined a novel optimization prob lem, namely the minimum-period register binding problem in which, given a scheduled data-flow graph (DFG), the goal is to find a register binding solution and a clock skew schedule to minimize the clock period. The motivation behind this problem is based on the fact that the clock arrival time for data assigned

978-1-4577-0170-2/11/$26.00 ©2011 IEEE

154

to the same register is enforced to be the same, thereby different register binding solutions could lead to different minimum clock periods. However, they did not consider delay insertion. In other words, the optimal clock period in their problem may be sub-optimal if delay insertion is allowed. In this paper, we will show that early planning of increasing min-delay in HLS of nonzero clock skew circuits is more effective to reduce the minimum clock period than logic-level approaches. In register-transfer-level circuits (a part of the out puts of HLS), the maximin-delays are mainly determined by an intermediate functional unit (FU) between adjacent registers (i.e., arrays of flip-flops). For the purpose of formal discussion, we introduce a new HLS task, namely the minimum-path delay assignment, (the min-delay assignment for short), which is the mapping of FUs to the real-valued numbers (min-delays). Fig. 1 illustrates an implementation of the min-delay as signment. In HLS, an input or the output of an FU consists of multiple-bits. Suppose that FU f in (a) consists of three bits (h, 12, h), and each bit has delay information (given as probabilistic density function (PDF) if statistical design). One possible way to determine max-delay D(f) and min-delay d(f) is to use the latest and earliest delays as shown in (b). The min-delay assignment is an information for designers to use (or design) an FU in a way that the designer-specified d'(f) is achieved without increasing D(f) as shown in (c). The min-delay assignment requires the costs for inserted delay buffers. However, the accurate costs depend strongly on the technology. In this paper, we assume that the min-delay assignment is implemented by a commercial logic synthesis tool Design Compiler™ [12]. Table I shows the area and

PDF(h) PDF(f,)

\ Delay D(f)

d(f)

(a)

(b)

d(f)

d'(f)

D(f)

(c)

Fig. I. Illustration of an implementation of the min-delay assignment. (a) 3-bit FU f with three outputs fl, 12, and h. (b) Initial PDFs. (c) PDFs after applying the min-delay assignment.

power costs for the min-delay assignment computed by Design Compiler, where the first column lists FUs f, the second column D(f) denotes max-delay of FU, the third column d(f) denotes the original min-delay (without any specification) with its area and power costs, and the fourth column denotes d' (f) which is the designer-specified min-delay with its area and power costs. In this paper, we formulate the clock skew-aware register binding problem on a user-given upper-bound of the costs for the min-delay assignment. The contributions of the paper are as follows: (1) It presents a novel HLS of nonzero clock skew circuits with the min-delay assignment (Section III). Given an operation scheduling result and design constraints, the objective is to find optimal results of register binding, the min-delay assignment, clock skew scheduling to minimize the clock period; (2) It proposes a mixed integer linear programming (MILP) formulation of the problem (Section IV); (3) It tests the proposed MILP using some benchmark circuits (Section V). The rest of the paper is represented as follows: Preliminaries of the paper is given in Section II. Section VI concludes the paper, and presents some future works. II. 0i

PRELIMINARIES

We assume that an input algorithm is in the form of DFG, denotes an operation, and ai denotes the output data of 0i.

A. Register Binding Given the result of operation scheduling, the lifetime of a data is the time-interval between the birth and death clock cycles (CCs) of the data, where the former is the CC at which the data is generated and the latter is the latest CC at which the data is referenced as an input to another operation. Register binding is a process of assigning data to registers TABLE I AREA AND POWER COSTS FOR MIN-DELAY ASSIGNMENT FU

f

I

D(f)

I

d(f) /

' area / power

I

d' (f) /

17.8ns

0.85ns / 76 / 12.6/-1W

=

=

B. Huang' Datapath Timing Model [7J A datapath from register Rjl to register Rj, is the com binational logic from Rj, to Rj" including multiplexers, an FU, and wires connecting them. Given the results of operation scheduling, FU binding and register binding, there exists the datapath for each arc in DFG. We use a common used timing model proposed by Huang et al. [7] in which there are two timing constraints between adjacent registers, namely the setup constraint and the hold constraint:

T(Rj1)- T(Rj,) T(Rj2)-T(Rj,)

< P-max(RjllRj,)-Tsetup, (1) < min(RjllRj,)-Thold, (2)

respectively, where T(') is the clock arrival time of a register, P is the clock period, max / min(Rj" Rj2) is maximin-delay of the datapath from Rjl to Rj" and Tsetup (Thold) is a real value constant including the setup (hold) time of a register, clock-to-Q delay, and timing margin. For the sake of simplic ity, we assume that max / min(Rjl ,Rj,) can be estimated by the maximin-delay of the intermediate FU, respectively. Considering timing closure problem, the possible range of the clock arrival time should be restricted for each register. Therefore, we introduce a constraint for the possible range of its clock arrival time for each Rj: (3) where Tmin ( Tmax ) is the minimum (maximum) clock arrival time of a register, given as user-specified values. C. Clock Skew Scheduling Clock skew scheduling is a process of assigning a real valued number to each T(') while meeting (1)-(3). A clock skew schedule is a set of the clock arrival times of registers. A graph-theoretic approach is often used to solve clock

' area / power

2.0ns / 87 / 16.9/-1W 8-bit ADD

so that two data are not assigned to the same register if they have overlapping lifetimes [13]. Fig. 2(a) shows an example scheduled DFG with two add operations 01 and 02 scheduled at CC I. For the sake of unified treatment, we assume that the primary input data a3 is the output data of dummy operations 03 scheduled at CCO. Suppose that we are given two registers R1 and R2. Data a1 and a3 can be assigned to R1 (described as R1 {a1,a3}), and R2 {ad as shown in Fig. 2(b).

3.0 / 107 / 25.4 7.0 / 203/ 60.3 13.0 / 402/ 118.4 4.0 / 238 / 65.9

16-bit ADD

38.3

0.85/ 163/ 29.5

8.0 / 395 / 130.6

. . . . . 0/. . (:�:��)�. . . . .

16.0 / 817 / 280.2

CCI

30.0 / 1796 / 679.7 3.0/ 63 1/ 189.5 8-bit MUL

26.4

0.95/ 618/ 183.7

5.0 / 666 / 203.3

CC2

10.0 / 828 / 290.4 22.0 / 2427 / 754.6

60.53

1.20/ 2679/ 1459.4

0,

a1

a2

h

CCI

8

CC2

El 8

.----------- -------- _ .

(a)

(b)

8.0 / 2859/ 1640.4 Fig. 2. (a) An example scheduled DFG with two operations and one dummy operation representing a primary input data. (b) A register binding solution where a rectangle represents the lifetime of a data.

16.0/ 3334/ 2136.9 50.5/ 12523/ 6164.8

*

01

------------

4.0 / 2708/ 1480.7 16-bit MUL

11

Base area (area I); 2-mput NAND

155

skew scheduling. The set of the constraints (1)-(3) is in the form of a system of difference constraints (i.e., a set of T(Rj,) - T(Rj,) :S D: jd 2). We can model such a system as a directed graph, namely the constraint graph Gcg(V, E) where V includes vertices correspond to registers, and one special vertex Uh referred to as the host, and E includes arcs correspond to the timing constraint associated with a weight. In this paper, we use the constraint graph model proposed by Ni et al. [8] due to the easiness of treating register binding. In this model, a vertex is a data, and an arc represents a data dependence or register sharing. In the following, we do not distinguish a data and a vertex in the constraint graph. Formally speaking, E is defined as follows: The setup constraint (1) and the hold constraint (2): For each arc ( Oil' Oi2 ) in DFG, we add an arc (ai2' ai, ) and (ail' ai2 ) to Gcg, associated with the weight of the right-hand side of (1) and (2), respectively. The former and the latter arc is referred to as S-arc and H-arc, respectively. The possible range of clock arrival time (3): For each vertex ai, we add two arcs ( u h,ai) and (ai, u h) to Gcg, associated with the weight -Trnin and Trnax, respectively. This type of arc is referred to as C-arc. Register sharing: In view of timing, the register sharing of data ail and ai2 means that the registers to which these data are assigned, have the same clock arrival time. It can be represented by adding two arcs (ail' ai2 ) and (ai2' ail ) associated with the weight of O. This type of arc is referred to as R-arc. From the well-known result for a system of difference constraints, and the result by Ni et al. [8], we can easily derive the following lemma: Lemma 1: There exists a feasible clock skew schedule meeting (1)-(3) if and only if there exists no negative weighted cycle in the constraint graph Gcg. If there is no negative cycle in Gcg, the shortest path distance of each vertex Ui from the host Uh represents a valid clock skew schedule of the register D to which data ai is assigned. Example 1: Fig. 3 shows the constraint graph Gcg(V, E) for the example as shown in Fig. 2. Suppose that DUd 0.1, 1.0, d(h) 0.2, Tsetup 0.1, D(h) 0.6, dUd Tho1d 0.1, Tmin -1.0, and Tmax 1.0. Since there are three data al, a2 , and a3, V includes al, a2 , a3, and the host Uh. For arc ( 03,o d in DFG, E includes S-arc (al,a3 ) and H-arc (a3,a d associated with the weight P- 0.7 and 0.0, respectively. For arc ( 03, 02 ) in DFG, E includes S-arc ( 02 , 03 ) and H-arc ( 03, 02 ) associated with the weight P- 1.1 and 0.1, respectively. For every vertex ai E V\ { Uh}, E includes C-arcs (ai, Uh) and (Uh,ai ) associated with the weight 1.0. In the register binding solution in Fig. 2(b), al and a3 are assigned to the same register, thereby E includes R-arcs (al,a3 ) and (a3,a d associated with the weight O. From the weight of the cycle consisting of S-arc (a2 ,a3 ) and H-arc (a3,a2 ) , and Lemma 1, P 2: 1.0 for the existence of a feasible clock skew schedule. If P 1.0, there is no negative weighted cycle in Gcg. Therefore, we can see that P 1.0 is minimum, and solving the shortest path distance problem, a feasible clock =

=

=

=

=

=

=

=

=

=

156

Fig. 3. The constraint graph Gcg for the register binding solution in Fig.2(b) with the minimum P of 1.0. A dash line represents an H-arc.

skew schedule is T(Rl) 0.9 and T(R2) 1.0. Fig. 4(a) shows the timing chart in case that T(Rl) is set as 0 (note D that clock skew is relative amount). =

III. HLS

OF NONZERO

=

C LOCK

SKEW

C IRCU ITS

In this section, we introduce a novel HLS task, namely the min-delay assignment to reduce the clock period, and formulate our optimization problem. A. Motivation of Increasing Min-Delay In the constraint graph, every cycle C is in the form of

fEF

fEF

where D: is the number of S-arcs included in C, F is the set of FUs included in C, {3f is the number of FU f included in C, A is a constant including Trnax / min S, Tsetup/holdS, and a constant related to C-arc. From Lemma 1, it can be easily seen that if the minimum clock period is achieved, there is a cycle including an S arc, and the weight of the cycle is 0 (referred to as zero weighted cycle). Therefore, if we can increase the weight of zero-weighted cycles, we can expect further improvement of P. From (4), if a zero-weighted cycle includes an H-arc, it could be effective to increase min-delay of the FU related to the H-arc. Let us consider again the constraint graph as shown in Fig. 3. As described in Example 1, the minimum P is 1.0 for this case, and there is the zero-weighted cycle C consisting of S-arc (a2 ,a3 ) and H-arc (a3,a2 ) . Since C includes an H-arc, and 12 is the FU related to the H-arc, if we can increase d(h), for example, from 0.2 to 0.5, the weight of C is increased by 0.3. Fig. Sea) shows the constraint graph after increasing d(12) . As a result, we can further reduce P as long as there is a zero-weighted cycle. In this case, P can be down to 0.7. Fig. 4(b) shows the corresponding timing chart. B. The Min-Delay Assignment In logic-level design, a method of increasing min-delay has been discussed in view of how to increase the delay, and where to insert delay elements, etc [10][11]. On the other hand, we treat such a method in HLS as a task of assigning a min-delay to each FU, referred to as the minimum-path delay assignment

(a)

(b)

Fig. 4. Clock period reduction by relaxing the hold constraint using the min-delay assignment. (a) The setup constraint and the hold constraint for arc (03,02) in Fig. 2(a) with the minimum P of 1.0. (b) Clock period reduction from 1.0 to 0.7 by applying the min-delay assignment.

(a)

(b)

Fig. 5. The constraint graphs of different register binding solutions for Fig.2(a) after applying the min-delay assignment (i.e., increasing d(h) from 0.2 to 0.5). (a) For the register binding solution (Rl {al,a3}, R2 {a2}). (b) For another register binding solution (Rl {a2,a3}, R2 {ad), where the min-delay assignment is no effective to reduce P. =

=

(the min-delay assignment for short). The result of the min delay assignment is an information to be implemented in logic level and/or physical-level design. The min-delay assignment requires area and power costs as shown in Table I. If there is an upper-bound of the costs for the min-delay assignment, we could not fully increase min-delay of all FUs. In this case, a result of the min-delay assignment becomes important. For example, let us consider the case in which we can increase min-delay exactly one of h and 12 by the amount of 0.3 due to the cost limitation. If we choose 12, the minimum P is 0.7 as described in the previous section. On the other hand, if we choose h, the minimum P remains 1.0 due to a timing bottleneck determined by the weight of the cycle consisting of S-arc (a2 ' a3) and H-arc (a3,a2 ). C.

Impact of Register Binding

We point out that register binding has a crucial impact on the etlect of the min-delay assignment since the min-delay assignment can change only the weight of H-arcs. In other words, it does not work to reduce P if a bottle neck cycle does not include any H-arc. Let us consider another register binding solution Rl {a2,a3 } and R2 {a d . Fig. 5(b) shows the corresponding Gcg, which includes cycle C' consisting of S arc (a2 ' a3) and R-arc (a3,a2 ). From the weight of C' and Lemma 1, P � 1.1. In this case, the min-delay assignment does not work to break the lower-bound of P due to C' since there is no H-arc in C'. Based on the observation, we can see that there is a demand for minimizing P to optimize the min-delay assignment and register binding, simultaneously. =

=

we treat the cost for the min-delay assignment as a real value constant, which represents area cost, power cost, or the weighted sum of area and power. Formally speaking, we will tackle the following problem: The Min-Delay Assignment Problem:

Input: A DFG G (0, A), a set of FUs F, a set of available min-delays for each FU type with its cost, a set of registers n, an operation scheduling result, an FU binding result, the lower/upper bound T min / max of clock arrival times, and the upper bound of the cost Cmax for the min-delay assignment. Output: A register binding result, a min-delay assignment result, a clock skew schedule, and the clock period P. Constraint: The setup constraint (1) and the hold constraint (2) are met for each arc in G. The possible range of clock arrival time (3) is met for each register. The cost for min-delay assignment is no larger than Cmax. Minimize: The clock period P. D =

IV. MILP ApPROACH We formulate our optimization problem as an MILP. The objective function of our MILP is the clock period P. One advantage of using MILP is that we can find an optimal solution utilizing efficient solvers. A. Terms Definition The variables used in our MILP are as follows: •

D. Problem Formulation From perspective of practical design, we could not choose arbitrary values for min-delay of an FU. Therefore, we assume that for each FU type, a set of min-delays with their area and/or power costs is given, and the min-delay assignment is a process of choosing one of them. As a first attempt to HLS of nonzero clock-skew circuits with the min-delay assignment, we consider the problem of simultaneous optimization of register binding, the min-delay assignment, and clock-skew scheduling under design constraints. For the sake of simplicity,

157

•

•

•

Xi,j: A binary variable for each combination of data ai and register Rj. If ai is assigned to Rj, then the value of Xi,j is 1; otherwise, the value of Xi,j is O. Yk,£,rn: A binary variable for each combination of kth FU of type C and mth min-delay of the FU. If mth min-delay is chosen for kth FU of type C, then the value of Yk,£,rn is 1; otherwise, the value of Yk,£,rn is O. Ti(D): A real-value variable representing the clock arrival

time of the register to which the output data of 0i (i.e., ai) is assigned. T R): A real-value variable representing the clock arrival time of register Rj.

j

The constants used in our MILP are as follows: •

•

•

•

•

•

•

•

Tmax and Tmin: Real-value constants representing the maximum and the minimum value of Tj, respectively. Df: A real-value constant representing the max-delay of an FU of type f. df,m: A real-value constant representing mth min-delay of an FU of type f. fmax: An integer-value constant representing the maxi mum number of available FU types. mmax: An integer-value constant representing the maxi mum number of available min-delays. Gf,m: A real-value constant representing the cost for mth min-delay of an FU of type f. Gmax: A real-value constant representing the upper bound of total area cost. Tsetup and Thold: Real-value constants representing the setup and hold times of a register.

Since the total cost of the min-delay assignment must be no larger than Gmax, we have the following constraint: (13) where Ff is a set of available FUs of type f. C. Complexity Analysis The following summarizes the complexity of our MILP: (D) , and T R) are • The number of variables Xi,j, Yk,f,m, Ti IVI . IRI, IFI . mmax, lVI, and IRI, respectively, where V is a set of data. • The number of the formulas in (5), (6), (7), (8), (9), (10), (11), (12), and (13) are lVI, 0(IV12 · IRI), IFI, IAI, IAI, IVI . IRI, IVI . IRI, IRI, and 1, respectively. 2 • Since IVI is 0(101), and IAI is 0(101 ), the total 2 constraints can be estimated as max{0(IOI ·IRI), IFI}.

j

V.

B. Formulas Since each data must be assigned to one register, we have the following constraint for each data ai:

2:

RjER

xi, j

=

(5)

1.

If two data ail and ai2 have overlapping lifetimes, they can not share the same register. Therefore, we have the following constraint for each register Rj: (6) Since each FU has exactly one min-delay, we have the following constraint for each kth FU of type f:

1:S Tn:S 'Tnrnax

Yk,f,m

=

1.

(7)

For each arc ( Oil' Oi2 ) in DFG, the setup and hold con straints can be written as follows:

T-21(D) -T-22(D) T-(D) -T (D) ·

21

'1,2