A Performance-Driven Macro-Block Placer For

2 downloads 0 Views 78KB Size Report
RTL synthesis and timing-driven layout so necessary for design of sub-micron ... a program that constructs an approximate performance-driven macro-block.
A Performance-Driven Macro-Block Placer For Architectural Evaluation of ASIC Designs Yutaka Mori, Vasily G.Moshnyaga, Hidetoshi Onodera and Keikichi Tamaru Department of Electronics, Kyoto University Sakyo-ku, Yoshida-Honmachi, Kyoto 606-01 JAPAN

Abstract This paper presents a tool for generating a performance-driven placement from a netlist of Register-Transfer Level (RTL) blocks. Based on the modified force-directed algorithm, the tool chooses locations and orientations of the blocks such a way that to produce a compact area placement with minimum wiring delay along the critical path. Experiments show that our tool (1) provides solutions close to those generated manually, (2) is fast enough to be used in the inner loop of a program that synthesizes RTL structures from behavioral specifications and (3) ensures strong links between RTL synthesis and timing-driven layout so necessary for design of sub-micron ASICs.

1

1 Introduction 1.1 Motivation of the work Because geometry has a significant impact on the cost and performance of Application Specific Integrated Circuit (ASIC) design, any program that explores design tradeoffs, whether interactively or automatically, needs some way of estimating layout. This is true even at the Register-Transfer Level (RTL), where the basic architecture of the design is chosen. It has been recognized already, that evaluating an RTL design by counting the number of functional units such as ALU, adders, multipliers, etc., and the number of clock cycles is no longer adequate. A more accurate layout evaluation is necessary. Recently, several tools capable of estimating area cost of a given RTL datapath have been developed[1], [2], [3]. Most of these tools support standard-cell design style and are not acceptable for the Macro-Block designs. Besides, the main objective of algorithms implied in these tools is layout area minimization. As ASIC fabrication technology enters deep submicron (below 0 :5m) region, such an objective, however, contradicts the real layout generation process, whose primary goal is minimization of wiring delays because of their increasing impact on the chip performance. We claim that RTL evaluation of a layout to be produced in a deep submicron technology is not reliable if performance-driven placement is not considered. For intelligent RTL synthesis of deep-submicron ASICs, evaluators of a performance-driven layout are needed. During the last decade many performance-driven placement tools have been proposed[4]-[12]. These tools are able to produce good quality placements that satisfy given timing requirements of a design. However, practically speaking, they are very time consuming to run and hence can not be used in the inner loop of an RTL synthesizer that searches through many design alternatives. Up to date there are no fast estimators of the performance-driven layouts.

1.2 Contribution This paper presents a program that constructs an approximate performance-driven macro-block placement for a given set of modules and their interconnections. The program runs fast enough to be used in the RTL synthesis[13], where it is called many times during a design run to evaluate candidate designs. The placement produced is quite compact and adequate to that generated by the manual physical design so it can be relied on in choosing between design alternatives. This is important because it allows the RTL synthesizer to consider properly both module and interconnect

2

delays when it makes design decisions concerning deep sub-micron ASICs. This is one of the first attempts to develop performance-driven layout evaluators of RTL structures and experiments show that it makes a big difference in estimation of design tradeoffs.

2 The Program Our Performance-Driven Placer (PDP) takes as input:

fmg represents RTL blocks to be placed (e.g. ALUs, multipliers, registers, etc.) and the net set N = fng represents connections between (i) an RTL network GN (M; N ) whose node set M

=

the blocks. (ii) a scheduled data-flow graph

G(O; V; E ) whose two vertex sets O

=

fog; V

=

fv g,

represent the operations executed and the values stored in the network, respectively; the directed edge set

E

feg represents the dependencies.

=

We suppose that scheduling and allocation

of functional units, registers and multiplexors is done and the following mappings are known:

:O

!M

fu

;

:V

!M ;:E!M rg

mx

;

 :O

! T , where M ; M ; M fu

of functional units, registers and multiplexors, respectively, in the network;

rg

mx

define sets

T is the set of control

steps in the schedule.

(iii) characteristics of each block m 2 M : width (W ), height (H ), delay (D ). m

m

m

(iv) a placement region with given positions of external I/O pads. (v) a clock-cycle time, T . c

The PDP outputs a two-dimensional placement of the network in the region without overlap between any pair of blocks such that delay of wiring along the critical paths of each control step is minimized. To reduce the complexity of the problem, we make the following assumptions: (1) The placement region is a rectangle with I/O pads located on its boundary. (2) The blocks

m2M

have a rectangular shape with given aspect ratio. The terminal locations of a block are represented by the center of the block. (3) All connectivities n

2 N in the network are two-terminal nets, (4)

Interconnection topology is multiplexer-based point-to-point interconnection. The PDP works in three phases: preprocessing, positioning and compaction.At the preprocessing phase, it enlarges the blocks to incorporate space for wiring and determines the data-transfer paths for each control step. The positioning phase allocates the blocks without overlap in the placement region giving priority to modules on critical paths. A modified force-directed algorithm performs this operation. The final phase examines the rotation alternative for the blocks in order

3

to improve the placement. Below we describe each of these phases in details.

2.1 Preprocessing To ensure space for wiring, the width and the height of each block in the network is enlarged by 4x estimated as:

4x = (in

out + cpin + ppin )=4 2 W 2 U , where in and out are the input and output bit widths of the block m; cpin and ppin are the number of clock and power terminals of the block; W is the average width of a routing track in the layout system under current technology; U is a technology dependent track utilization factor m

+

m

m

m

m

t

t

m

m

m

t

t

that reflects the sharing of routing channels in the layout system. Thus the effective area of a block

m including the wiring area is approximated by: A  (W + 2 1 4x) 2 (H + 2 1 4x), where W and H are the width and height of the block before considering the wiring area. m

m

m

m

m

Note, that a block with different aspect ratio may result in a different wiring area in the routing process, which is true in the real world design. After the module enlargement, we transform the graph

GN into a set of directed subgraphs

fG g; (j = 1; : : : ; jT j), G  GN (M; N ) each of which represents the physical paths of datap

p

j

j

transfers for a control step. Figure 1 exemplifies the procedure by showing a scheduled data-flow

G1; G2; G3, corresponding to three control steps. In this figure, labels r define registers, labels x define multiplexors, add marks the adder and mlt marks the multiplier. Based on the subgraphs, the positioning phase is

graph (Fig.1(a)), its data-path network (Fig.1(b)), and the subgraphs,

performed.

2.2 Positioning The aim of this phase is to determine locations of the blocks considering their geometrical characteristics and connectivity simultaneously under given aspect-ratio constraints. The placement is based on a force-directed technique[14] which was modified for critical net weighting. A force acting on a block

i is computed as a difference between an attractive force proportional to the

length of nets connected to the block and a repulsive force proportional to the overlapping area of the block:

F~

i

X

= j

2jM j

K (r~ ij

j

0 r~ ) 0 f i

j

X 2jM j

4

A2

ij

r~ jr~

(

j j

0 r~ ) + X B2 n~ g 0 r~ j i

ij

i

j =N;E ;W;S

j

(1)

K is a weight assigned to these interconnections; r~ ; r~ are locations of the blocks i and j , respectively; A is the overlapping area between the blocks; B is the portion of the block i overhanging the placement region; n~ is a unit normal vector directed outside the placement region; N ,E ,W and S denote Northern, Eastern, Western and Southern directions, respectively; is the penalty coefficient for the block overlapping. (The bigger is , the smaller is the overlap). Here,

ij

i

i

ij

ij

j

The first term in the foregoing expression represents the attractive force due to connectivity; the other terms represent repulsive forces due to block overlap and block jutting out from the placement region. The key to the PDP lies in judiciously weighting of block interconnections because the weight value assigned to a net affects directly the length of the net. The weight K has a non zero value, ij

if there is a net n

i;j

between the blocks i and j . In our program, the value of

K depends on the ij

‘‘criticality’’ of a path containing the net and is computed as:

8 >< 1= K => : 1=SL(n ); i;j

i;j

if SL(n

i;j

)

Suggest Documents