Accurate layout area and delay modeling for system level design

0 downloads 0 Views 645KB Size Report
costly to evaluate. In the extreme case, one can get exact measures of design quality by going through the full physical design steps but this would be too costly,.
Accurate Layout Area and Delay Modeling for System Level Design C. Rarnachandrant, E J. Kurdahit, D. D. Gajskit, A. C.-H. Wus, and V. ChaiyakulS tECE Department, ‘ICs Department University of California, Imine, CA 927 17

-

Abstract

Structwd Representohon

We discuss the problem of estimating design quality measures to accurately reflect design tradeofs and efficiently explore the design space. Specifreally,we are interested in predicting the layout area and delay of a given structural RT level design. Clearly, current RT level cost measures are highly simplified and do not reflect the real physical design. In order to establish a more realistic assessment of layout effects, we proposed a new layout model which accurately and efficiently accounts for the effects of wiring and Poorplanning on the area and performance layout of RT level designs. Benchmarking has shown that our model is quite accurate.

Accurate Areaand

*.

**.

Models

Physical Representation

Figure 1: Design Abstraction Levels

1 Introduction

actually get optimized are usually simple linear models of hardware which are based solely on the contribution of active logic (i.e. gates or functional units) to both area and delay. Such abstract metrics do not accurately reflect the quality of the real physical design because they do not consider important layout effects such as wiring and floorplanning. For large designs, these physical design considerations tend to dominate layout area and delay. The lack of such realistic modeling of layout could result in “misleading” the synthesis tasks into generating inferior designs. This was indeed the conclusion in McFarland’s work [ 11. As technology provides smaller and smaller feature sizes and higher density, it is expected that physical effects will play an even more important role in the overall design area and performance. Therefore, such effects can no longer be ignored.

The design of a VLSI chip starts with an abstract behavioral or functional specification and ends with a detailed specification of the layout mask level geometries (Figure 1). In between these two stages of the design is a hierarchy of design steps (automatic or manual) involving functional partitioning,architectural and logic synthesis, lloorplanning, placement and routing. The global decisions made during the initial phases of design have a pronounced impact on the final chip’s performance and consequences of these decisions will not be apparent until very late in the design process. Consequently, during the early design stages, there is a need for accurate design quality evaluation metrics which can take into consideration the effects of the subsequent design steps and provide guidance for the global design decisions. The various design synthesis tasks which take place prior to physical design usually comprise archit&tural (or high level) synthesis, Register-Transfer level synthesis, and logic synthesis. These tasks could be performed either manually or automatically. Each one of these synthesis tasks usually attempts to optimize a cost function representing the layout a r a , delay, or a combination of both. Since there is no physical layout information readily available at this point, typical area cost functions that

0-8186-3010-8/92 $03.00 Q 1992 IEEE

S y n t h e s i s 1 Behaviaal Representation

l.l previous work Most of the previous work in developing predictive models of layout was done at the gate or transistor levels [2] [3] [4]. Gate level predictive models are of great value in early chip planning and layout tasks (such as floorplanning). However, in order to address the specific problems in high level synthesis described

355

WS

lerpt

Simple Model

Proposed Model

Actual design

SMll

Figure 2: Error Vs Complexity earlier, predictive models must operate at a higher level of design abstraction. The output of a typical high level synthesis phase consists of RT level components, control, and memories. Clearly, such a design would not be laid out as a large piece of random logic. Thus, any high level predictive model must be able to handle several variations in design styles and must reflect the effects of high level design tradeoffs on silicon accordingly. In [5] and [6], abstracted layout area and timing models for high level synthesis were presented. These models were experimentally shown to accurately and efficiently reflect the effects of the data path design tradeoffs on the final layout. However, these models concentrated on modeling the datapath and controller separately and did not consider the impact of floorplanningwhich could generally be a significant factor in area and delay.

Figure 3: the Datapath Layout Model

0

0

cle time. Therefore, our system can achieve chip level modeling whereas 151 and [6] concentrated on blocks level modeling and estimation, our layout and wiring models of datapath and random logic are flexible, efficient, and accurate, and we have validated our model with fully constructed and simulated layouts of standard high level synthesis benchmarks, and real industry standard designs, using industrial layout tools.

2 Area models Ideally, one would like a predictive layout model to read in design specifications in which only high level constructs are described and partitions the specifications into four structurally distinct blocks: (1) datapath, (2) controller, (3) macros, and (4) memories. Clearly, this is a very difficult problem which has only recently been addressed by researchers [8]. As a first step in abstracting the level of predictive models, we assume for now that the input to our model is a an FSMD [7] description of the system as a partitioned RT level design in which components have been assigned to structurally distinct blocks.

1.2 Contributions of our work In this paper, we propose an overall layout model which can be used to obtain accurate and efficient area and timing information of a given RT level design. Clearly, accuracy and runtime efficiency are mutually competing goals as shown in Figure 2. Simple hardware models (usually analytical) are quite fast to evaluate but are generally not accurate. More complex models (usually constructive) are generally more accurate but also more costly to evaluate. In the extreme case, one can get exact measures of design quality by going through the full physical design steps but this would be too costly, especially if a large number of candidate RT level designs are to be evaluated when going through iterative design improvement steps. Our model offers a reasonable compromise between accuracy and efficiency and furthermore, allows the user to tradeoff these two criteria. Our additional contributions over previous work are the following: 0 We model chip level designs (using the Finite State Machine with Datapath, or FSMD paradigm [7]), as opposed to gate level designs, thus achieving a higher level of abstraction compared to [3], [21, and [41 without sacrificingaccuracy or efficiency, 0 we incorporate floorplanning information into the prediction and have provably good wiring models of global interconnect. Thus, our prediction model accurately reflects the effects of layout and floorplanning on area, timing, and system clock cy-

2.1 Datapath area model The data path is laid out using the bit slice methodology which is being successfully used in design systems such as [9]. A bit slice is constructed by laying out each bit slice unit (composed of 1-bit RT level component slices such as 1-bit Adder, 1-bit Mux, 1-bit Register, etc..) using standard cells vertically and connecting the interunit data nets using vertical routing channels. Control lines run horizontally using METAL2 lines. Power and ground lines run vertically using METALl. Vertical inter-bitslicerouting is done using METALl and POLY lines. We assume that information about the standard cell library used is readily available. This methodology is illustrated in Figure 3. The width of the datapath bit slice, W, is estimated as the sum of: (1) VW,,, which is the width of the widest unit, and (2) width of the routing needed to connect the inter-bit data nets. In order to estimate the area of each bit slice, we use LAST and SCALE [4]. LAST [4] uses a

356

RT Level design

constructive/analytical model which was found to predict the layout area of standard cell designs with an average error of 5%, and SCALE is an analytical/probabilistic model for standard cell area estimation'. We use SCALE to estimate the area of each unit and next we use LAST to estimate the approximate topology and floor plan of the overall bit slice. This approximate floorplan is used to estimate the dimensions of the bit slice. The width of the overall data path can then be estimated as shown in Figure 3. Our modeling technique is flexible enough to allow the designer to experiment with different ways of partitioning the datapath, ranging from a single datapath block, all the way to a multi-block strategy in which each block represents the layout of an individual RT level component.

Constructive Prediction

A

Dalgath

Analytical Predictors

Mano Block

&

Macro Bbdc Controth-

& & &

Figure 4: Area Estimation Technique

2.2 Controller area Model We assume that the controller is designed using random logic and implemented using standard cells. The current version of the layout model also assumes that the control logic has been optimized and a controller netlist is generated using existing sequential and combinational logic synthesis tools such as [ l l ] and [12]. Given this netlist, LAST is used to estimate the shape function of the controller block. This enables the chip level layout model to estimate the aspect ratio which results in the most optimal packing of the overall floorplan. Clearly, this would not be an efficient approach for larger chips, and future work must address modeling the effects of logic optimization on controller size and delay instead of actually invoking these tools. Such a modeling is necessary in order to obtain a complete, self-contained predictive model of layout.

The chip level slicing tree technique involves slicing down to the leaf blocks, which consist of either macros or standard cell clusters. This constructive approach does not consume excessive CPU runtime since the number of leaf blocks is usually limited to a relatively small number. The layout of hard macros (such as macrocells and memories) is fixed and hence the pin locations is known. In the case of soft macros (e.g controller), the pin locations have to be approximately determined to the extent of the side location. The pin side location can be determined by evaluating the shape function of the design prior to wiring area calculation. This process determines the approximate location of the blocks in the design. From the connectivity information the blocks connected to any pin can be determined. By evaluating the mean location of these blocks, the side location of each pin can be determined. The area of the design is built up by looking at a window of the slicing tree This window consists of slice level i and slice level i + l . The area of the slices can be obtained by estimating the wiring channels in the slices. The area of the channels are estimated using empirical formulas described in [ 141.

2.3 Macrocells and Memories Some system components can be laid out as Macrocells. Macrocells can be either pre-designed, or parameterized. In the first case, the dimensions of the block and also the pin locations, are known in advance. Parameterized components are usually designed by replicating a bit slice of a bit cell in one or two dimensions by using generators such as GDT [131. In this case the overall block dimensions can be easily calculatedby using simple equations derived from the generating function.

3 Timing models A typical timing model for digital systems is shown in Figure 5. The data path consists of combinational

logic blocks (composed of functional units and muxes) bounded by data registers. Data registers are used to store data inputs, outputs,and intermediate values in the data path. The controller can be implemented either as a Mealy or a Moore model. Our modeling style is flexible enough to allow for either design style, depending on the designer's preference and the particular sequential logic synthesis algorithms used.

2.4 Chip level area estimation model Our chip level model uses a slicing tree technique (described in [4] and derived from [3]) as shown in Fig4 The standard cells are grouped into clusters called standard cell blocks and are treated as a single entity while building the slicing tree and computing the shape function of the design. SCALE or LAST can be used to evaluate the shape function of the standard cell block using a partial slicing technique described in [4].

3.1 Combinational logic timing model In our model, the three factors that contribute to delay t e b ( R i , R j ) in the combinational logic bounded by

'Details of the SCALE model can be found in [IO]

357

Controller controll status

-

-

Control signals

DataPath

I-----

;, Output Logic

data

clock

-......*

state

uw I

Status signals

.

Figure 5: Typical timing model for a digital system with a Moore-style controller registers Ri and Rj can be expressed as :

t,b(R;,Rj) =

delayk VcellsE Path

-t

c

+

_r

~~

Figure 6: Constructive/analyticalwire length estimation fanout,

VneW Path

Where, Path is the worst case path between Ri and Rj. The first two components are the contributions of the intrinsic cell delay and the fanout delay in Path and can be estimated from the cell library and the netlist. In order to estimate the third component (i. e. the wiring delay), we assume constant width wires and a lumped RC model [151, where the capacitance and the resistance of a wire n are each directly proportional to the wire length, L ( n ) , in the Path. In order to estimate L ( n ) , we use the approximate layout topology generated by the area model to estimate the wire length. 3.2 Wirelength Estimation The area model essentially generates an approximate floorplan of the major circuit blocks, or leaf clusters as described in Section 2. The length of each wire is estimated as the sum of two components: the intercluster component, which is given by : Lintercruster(n)= M S T , ( c l u s t e q , . . . , clusterN) and the intracfustercomponent, which is an analytically derived estimate of the average wire length inside the leaf clusters as shown in Figure 6. This approach results in accurate and efficient estimates of wire lengths. More details on the wire length and delay estimation can be found in [14]. Next, the critical path(s)s in the block is computed. This provides a topological estimate of the worst case delay of a combinational block. In a combinational block described at the gate level, the topological delay estimation scheme may be overly pessimistic because of the possible false paths [16] in the circuit. In this case, the user may as an option, invoke a functionality-based delay estimation heuristic model which attempts to find the true worst case path at the expense of increased runtime. More details on the functionality based technique, can be found in [17].

3.3 Estimating the clock cycle time The combinatorial logic delay can be used to compute the worst case register-to-registerdelay, t d ( R i , R j ) where, t d ( R i ,Rj) = thold(&)

+ tcb(Rzr R j ) +

t~etup(Rj)

The minimum clock cycle is determined by the maximum register-to-registerdelay as shown below :

The effects of clock skew can also be accounted for: since our area model can provide approximate locations of the major blocks (and therefore estimate the length of the clock tree branches), the clock arrival times at these blocks can be estimated and used to adjust the clock cycle estimates accordingly. Modeling of multiphase clocking schemes is accomplished assuming the assignment of phases to registers is determined. In this case, the timing estimation model estimates the delays of all the combinational logic blocks and can feed this information into more sophisticated clocking models (such as [ 181, for example) to determine a lower bound on the clock period and a clocking scheme. This flexibility allows the user to use the model to explore the effect of different clocking schemes on layout performance.

4 Experimental results 4.1 Experimental procedure In order to benchmark the accuracy of our layout model, we used 8 layouts described in Table I. These layouts were generated from three standard high level synthesis benchmarks: (1) the Elliptic Filter [19] (EFl-EF6), (2) the AMD 2901 CPU [20], and (3) the AMD 2910 microsequencer [20]. All the EF designs use muxed functional units and registers and were derived manually except for EF5 which is implemented using register files and described in [21]. The Mentor Graphics GDT tools were used to generate and simulate the final layouts which were based on the layout methodology explained in Section 2.

the data path. This large number of wires between the datapath and the controller caused a relatively large error in the interblock wiring estimation which in turn affected the overall area estimation error for that design. Figure 8 graphically compares the ordering of the actual and the estimated layout areas for the designs. We note that the ordering of the areas is the same except for EF5 and EF6, whose ranking was reversed by the estimator. However, given a 10% error margin as shown by the error bar, one can say that the two designs are almost equal in area, which can be concluded when the actual areas are compared. Another interesting comparison is between designs EF4 and EF3 which have the same number of control steps and the same number of Functional units and similar register count, but different register and mux allocations. In this case, EF4 which has 23 equivalent 2-1 mux equivalents occupied less area than EF3 which only had 20. This was also reflected by our estimation model. Figure 8 also compares the contributionsof the different design components to the overall layout area. The different curves show the estimates of area using our model, and other partial models including: functional unit (FU) area, FU+register area, FU+register+mux (or datapath) area, FU+register+mux+controller area. The difference between the last model and the actual area clearly shows that the effect of global wiring and floorplanning is significant in all the examples. It is important to note that the FU, mux, register and controller area curves shown in the figure were obtained using our block level models described in Sections 2.1, 2.2, 2.3 which account for both cell area and wiring area inside each of the major blocks. The clock cycle estimation results show a maximum error of 10%. In this case, we observe the same relative ordering between the estimated and measured clock cycle times. One interesting design is EFl in which the long bitslice datapath wires clearly affected the clock cycle time adversely. The same effect is reflected in the model which shows a proportionately large estimate of the cycle time. Another interesting tradeoff is between EFS, which

Figure 7: Chip layout of Design EF5 (screendump from GDT) For consistency, the same library of standard cells and Macrocells (SCMOS 3p) was used across all the designs. Each layout was functionally verified using Lsim in switch mode, and the extracted netlist was simulated in the ADEPT mode to obtain accurate timing waveforms. The worst case cycle delay time was measured from the Lsim output waveforms. Figure 7 shows a typical layout example. 4.2

Results

The estimation results are shown in Table 2. First, we note that our area estimate are within 10% with the exception of the AMD2901 where our estimation error was 12.7%. Upon examination of this design, we noted that the register file in the original specifications was embedded within the bit sliced datapath. This resulted in a large number of control signals because all the register file controls were sent along with their complements to

359

Benchmark II Measured I Estimated I % II Measuredclock 1 Estimatedclock I % cycle(ns) error design Area($) Area($) error cycle(ns) 394 I 354 I 10.1 EF1 II 11327x4928 I 11236x4944 I 0.4 I1

11

AMD2901 AMD2910

11 I]

I

5340x1874 3375x3815

I I

I

5289x1652 3414x3980

11

I I

12.7 5.1

uses register files and EF4 which uses muxed registers inside the data path bitslice. Both designs have one adder and one multiplier each. Note that the cycle time of EF5 is significantly larger than EF4 mainly because of the combined effects of the ripple carry adder delay and the delay in accessing registers inside the register files, which are located outside the bitslice datapath. This effect was also accurately reflected in the cycle time estimated by our model. Many high level synthesis systems compute the clock cycle time as the cycle delay of the slowest operator among those used in the RT level design [22]. To assess the accuracy of this meuic, we plotted in Figure 10 the actual and estimated clock cycle times for all 9 designs along with the delay of the slowest operator in each design. It can be clearly seen that the latter metric is inadequate because it does not take the register, mux, and wiring delays into consideration2. By contrast, our model shows an excellent tracking ability of the actual cycle delay values. Figure 9 is a plot of the area versus overall input-tooutput delay (i.e. cycle-time x #-clock-cycles) for each EF design (We did not include the delays of the AMD designs because they have a variable number of control steps). Clearly, the slowest Elliptic Filter design is EF6 which uses an add-shift multiplier whereas the fastest design is EFl which uses two multiplier and two adders and is scheduled in 19 time steps. The tradeoffs between the other designs are estimated accurately by OUT model. For example, EF2 is correctly predicted as inferior by our model. For all the benchmarks, the runtime of the model to estimate both the area and cycle times was under 10 seconds of SUN4 CPU time per design. This is expected, since the heuristics used in the model are either constant lime, linear time, or pseudo-linear time. Of course, we need to run larger examples with many more RT level components and benchmark their runtime before making a general statement about the model's efficiency. ~~

'The only exception is EF6. where the add-shift multiplier cycle delay was dominant. However, the add-shift multiplier itself was implemented as a bitslice unit whose cycle delay was computed using our model.

360

I

11 11

I

332 23 1

303 246

I I

1

I I

9.5 6.0

0 Measured

1

rn Estimeted Elliptic Filter Designs 1-6

c

10.0

5.0 20.0

30.0

40.0 50.0 Layout Area (x 10' p*)

60.0

70.0

Figure 9: Area versus delay for the EllipticFilter layouts

5

Conclusions

We presented a new layout predictive model for high level applications. This model accurately and efficiently accounts for a variety of RT level design styles on achip as is normally required for digital systems layout. We tested our model on a variety of high level design benchmarks. The results show that this model can reflect the effects of layout on design area and performance much more accurately than typical high level synthesis cost models. Clearly, more benchmarking on a wider range of design styles is needed. Another issue of importance is to model the effects of control logic optimization on controllerarea and delay. These issues and others are open problems and subjects of current research.

I

1

q

a

,

_-/ /

\ \

/

,?

1‘,

EF1

EF2

>‘\

EF3

\

\

P 1.o

[9] W. K. Luk and A. A. Dean, “Multi-stack optimization €or data-path chip (microprocessor) layout,” in Proc. 26th DAC, pp. 110-1 15,1989. [lo] E J. Kurdahi and A. C. Parker, ‘Techniques for area estimation of VLSI layouts,” IEEE Trans. CAD, vol. 8, no. 1,pp. 81-92,1989. [ l l ] R. Brayton et al., “Mis: A multiple level logic optimization system,”IEEE Trans. CAD, vol. CAD6, pp. 1062-1081, NOV.1987. [12] S . Devadas et. al, “MUSTANG: State assignment for finite state machines for multi-level logic implementations,” in Proc. ICCAD-87, pp. 16-19, 1987. [13] M. Buric and T. Matheson, “Silicon compilation environments,” in Proc. IEEE CICC, 1985. [14] C. Ramachandran et. al., “RT level layout models for high level synthesis,” tech. rep., ECE Department, UC Irvine, 1992. [15] P. Penfield Jr and J. Rubenstein, “Signal delay in RC tree networks,” in Proc. 18th DAC, 1981. [ 161 P. C. McGeer, On the Interaction of Functional and Timing Behavior of Combinational Logic Circuits. PhD thesis, Dept. of EECS, Univ. of California, Berkeley, 1989. [17] C. Ramachandran and F. J. Kurdahi, “A combined topological and functionality based delay estimation using a layout-driven approach for high level applications,” in Proc. Euro-DAC 92, Sept. 1992. (to appear). [18] K. A. Sakallah, T. N. Mudge, and 0. A. Olukotun, “Analysis and design of latch-controlled synchronous digital circuits,” in Proc. 27th Design Automation Conf.,pp. 111-1 17,June 1990. [19] S.-Y.Kung, H. J. Whitehouse, and T. Kailath, VLSI and Modern Signal Processing. Prentice Hall, 1985. [20] “Amd 2900 series databook.” Advanced Micro Devices. [21] B. Rouzeyre and G. Sagnes, “Memory area minimization by hierarchical clustering in high-level synthesis,” in Fifth International Workshop on High-Level Synthesis, Mar. 1991. [22] R. Jain et. al., “Experience with the ADAM synthesis system,” in Proc. 26th DAC, pp. 56-61,1989.

1

I

MMeasured t rn - Model’s w Estimate-

\

P,

>‘\

EF4

EF5

EF6

2901

>‘\

2910

Figure 10: Comparison of measured, predicted clock cycle times, and slowest operator delay

6 Acknowledgements The authors would like to acknowledge the students in the VLSI design classes at UCI who provided the layouts used for benchmarking our model. This work was supported by NSF grants #MIP-8909677 and #MIP8922851,and CAL-MICRO grants #90-046 and #91-080. We are grateful for their support.

References [l] M. McFarland, “Using bottom-up design techniques in the synthesis of digital hardware from abstract behavioral specifications,” in Proc. 23rd Design Automation Conf., pp. 474480, IEEE/ACM, 1986. [2] M. Pedram and B. Preas, “Interconnection length estimation for optimized standard cell layouts,” in Proc. ICCAD-89, pp. 390-393, IEEE/ACM, 1989. [3] G. Zimmermann, “A new area and shape function estimation technique for VLSI layouts,” in Proc. 25th Design Automation Conf., pp. 60-65, IEEE/ACM, 1988. [4] F. J. Kurdahi and C. Ramachandran, “LAST A layout area and shape function estimator for high level applications,” in Proc. Second European Conf. on Design Automation, Feb. 1991. 151 A. C.-H. Wu, V. Chaiyakul, and D. D. Gajski, “Layout-area models for high-level synthesis,” in Proc. ICCAD-91, pp. 34-37, Sept. 1991. [6] V. Chaiyakul, A. Wu, and D. Gajski, “Timing models for high-level synthesis,” in Proc. EuroDAC-92, 1992. [7] D. Gajski, N. Dutt, A. Wu, and S. Lin, High-Level Synthesis: Introduction to Chip and System Design. Kluwer Academic Publishers, 1992. [8] S . Narayan, F. Vahid, and D. Gajski, “System specification and synthesis with the specchart language,” in Proc. ICCAD-91, pp. 266-269,1991.

36 1

Suggest Documents