static profile driven optimization of digital circuits - CMU ECE

STATIC PROFILE DRIVEN OPTIMIZATION OF DIGITAL CIRCUITS Srihari Cadambi Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, Pennsylvania.

Contents 1 Introduction

1

2 Hot-spots in Netlists

8

1.1 Main Contribution of this Work . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 General De nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Hot-spot Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Algorithm for Extracting Hot-spots . . . . . . . . . . . . . . 2.3 Characterizing Hot-spots . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Area Hot-spots . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Area-Delay Hot-spots . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Other Factors in Template Characterization . . . . . . . . . . 2.3.4 Handling Overlap . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Time Complexity of Template Extraction . . . . . . . . . . . . . . . 2.5.1 Reducing running time: Iterative Pro ling . . . . . . . . . . . 2.5.2 Iterative Extraction with Gradually Decreasing Sizes (GDS) . 2.5.3 Pruning Bad SSDAGs . . . . . . . . . . . . . . . . . . . . . . 2.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Static Pro ling for A Recon gurable Compiler

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

3.1 Background: Recon gurable Architectures and Compilers . . . . . . . . . . 3.1.1 Pipeline Recon gurable Architectures . . . . . . . . . . . . . . . . . i

2 3 7

9 12 13 15 16 17 18 19 20 23 23 26 28 29 33

35

35 36

3.2 The Tool Flow: DIL Compiler . . . . . . . . 3.2.1 DIL Compiler De nitions . . . . . . 3.2.2 The DIL Compiler Place and Route 3.3 Pro ling Method . . . . . . . . . . . . . . . 3.3.1 Macro Speci cation . . . . . . . . . 3.4 Results and Analysis . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

4 Static Pro ling for ASIC Technology Mapping

4.1 Background: Technology Mapping . . . . . . . . . . . . . . . . . . 4.1.1 Area and Delay Minimization During Technology Mapping 4.2 The Tool Flow: Synopsys Design Compiler . . . . . . . . . . . . . 4.3 Pro ling Metholology . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 IDEA Encryption . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Static Pro ling for FPGA Architecture Design

5.1 Background: FPGA Architectures . . . . . . . . . . . . . . . . . 5.1.1 Island-style FPGAs . . . . . . . . . . . . . . . . . . . . . 5.1.2 Parameterizable Architectural Model . . . . . . . . . . . . 5.1.3 Typical CAD Flow for FPGAs . . . . . . . . . . . . . . . 5.2 The Tool Flow: Experimental Setup . . . . . . . . . . . . . . . . 5.2.1 Experimental Goals . . . . . . . . . . . . . . . . . . . . . 5.2.2 Architectural Model Used for the Experiment . . . . . . . 5.2.3 Experimental Tool Flow . . . . . . . . . . . . . . . . . . . 5.2.4 VPack and VPR . . . . . . . . . . . . . . . . . . . . . . . 5.3 Pro ling Methodology: Integrating the Pro ler in the Tool Flow 5.3.1 Architectural Exploration Using Hot-spots . . . . . . . . . 5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 40 41 45 48 50 56

58

58 60 62 62 63 64 69 72

74

75 75 77 79 80 81 81 82 83 86 86 90 91

6 Static Pro ling for Heterogeneous FPGA Compilation 6.1 6.2 6.3 6.4 6.5

Background . . . . . . . . . . . . . . . . . . . . . . . . The Tool Flow: Compiling for Heterogeneous Clusters Pro ling Methodology . . . . . . . . . . . . . . . . . . Results and Analysis . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . .

7 Summary and Conclusions

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

93

93 94 94 95 96

99

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

A DIL Place and Route Algorithm

103

iii

List of Figures 2.1 Illustrations of the basic de nitions used. The netlist shown on the left may be represented by the DAG shown on the right. n is a net (or wire) with the mux as its only source (driver). n has 3 destinations: the ALU, inverter and AND gate. It has 4 pins labeled p0, p1, p2 and p3. p0 connects n to its source while p1, p2, p3 connect n to its 3 destinations. . . . . . . . . . . . 2.2 An example single sinked DAG, with an internal output. ABCDE is the single sinked DAG. However, ABCDEF is not, since it has two sinks A and F. 2.3 Example equivalence relations between SSDAGs. . . . . . . . . . . . . . . . 2.4 Pattern construction algorithm. newgraph(n,l,r) creates a new graph with edges l ! n and r ! n. T represents the global pattern hash table. . . . . . 2.5 The hot-spot extraction pattern search algorithm and data structures. While searching for SSDAGs of size 4 ending at node N, SSDAGs of sizes 0, 1, 2 and 3 ending at each of N's sources are looked for. These would have been computed earlier, and stored at their root nodes. Each node stores a twodimensional array of a list of SSDAGs. Each entry in the array stands for an SSDAG of a particular size. The SSDAG itself is stored as a list of node pointers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 The Overall Static Hardware Pro ler. . . . . . . . . . . . . . . . . . . . . . 2.7 Example illustrating intra template overlap. Isomorphic SSDAGs S0, S1, S2 and S3, all representing structural and functional isomorphic instances of \ABA", are extracted from the graph shown on the left. S0 and S2 overlap with each other. A greedy heuristic eliminates S2 and the nal extracted template contains S0, S1 and S3, as shown on the right side of the gure. .

iv

9 11 12 14

15 16

19

2.8 Netlist with 5 distinct operators A, B, C, D and E. SSDAGs \ABA" and \CDEB" have functional isomorphic instances. Two templates are extracted and their properties are marked in the table on the right. The smaller template has a greater frequency of occurrence but does not cover as many critical pins as the larger template. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Extraction of an area hot-spot. From among the two templates available in this netlist, the one with the largest regularity and coverage is extracted. . 2.10 Extraction of an area-delay hot-spot. The template with the largest frequencysize-criticality product is extracted. The larger template contains more critical pins although it has lower regularity. . . . . . . . . . . . . . . . . . . . . 2.11 Two iterations of the static pro ler with L = 3. The original graph G is modi ed to G1 and then G2 after extraction of the shaded templates. The table is updated with the frequency, size, criticality and architectural speci c information of the templates. Note that a template of size 4 was obtained during the second iteration although L was set to 3. This is because the later template included the earlier template. . . . . . . . . . . . . . . . . . . . . . 2.12 Missing a valid SSDAG during iterative pro ling. If netlist shown on the left is pro led upto a size of 3, the template containing ABC and DEF is extracted. Instances of this are converted into the macros shown on the right. Owing to this, another valid SSDAG BCZB is never examined. . . . . 2.13 Intra-template overlap indices for templates determined by structural and functional isomorphism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.14 Intra-template overlap indices for templates determined by size and I/O isomorphism only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.15 Comparison of running times of iterative pro ling with and without the GDS heuristic on a 360 MHz Sun 6500 Enterprise Server. It is seen that gradually decreasing the size accounts for a big improvement in the running time. . . 2.16 Eect of a pruning heuristic on the running time. Ignoring SSDAGs of more than 5 inputs makes little dierence on the running time in most cases. The machine used was a 360 MHz Sun 6500 Enterprise Server. . . . . . . . . . .

v

21 22

22

27

28 30 31

32

33

3.1 The process of hardware virtualization in pipeline recon gurable architectures. A ve-stage pipeline is being virtualized on a 3-stage fabric. . . . . . 3.2 PipeRench architecture showing the N PEs in each stripe and their register les. The interconnect switches B-bit values. Unless overwritten by their PE, the register les constitute a pipelined bus. . . . . . . . . . . . . . . . . 3.3 PEs in PipeRench showing the interconnect lines. A b-bit PE is composed of b 1-bit PEs. The main interconnect is used to feed the A and B inputs of the PE, which are b-bits wide. The third PE input is 1-bit with the same bit going to all the b 1-bit PEs. This bit is fed by the X-line of the control interconnect. The control interconnect also consists of the carry propagation line (C) and the zero detect line (Z). . . . . . . . . . . . . . . . . . . . . . . 3.4 The main stages of the DIL compiler. . . . . . . . . . . . . . . . . . . . . . 3.5 Illustration of RPL. The right source bitwire of the operator has its nearest non-routing source node in time-step 0, resulting in a RPL of 3. The left source bit of the originates in time-step 1, resulting in an RPL of 2. Each arc represents one bit of a wire. . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 The DIL place and route algorithm. The algorithm is based on list scheduling. The graph is rst pre-processed and placement directives are inserted. Subsequently, the RPL-based priority function is used to select nodes during list scheduling. The selected nodes are attempted for placement. . . . . . . 3.7 Illustration of placement restriction 1. If B avoids the column in which A is placed, C can get its inputs on the interconnect since the inputs are in distinct register les, each with a single read port. . . . . . . . . . . . . . . 3.8 Illustration of placement restriction 2. If C is placed directly below A, D is forced to the next stripe since C cannot use the interconnect wire (A has already used it.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Example of a decomposition of an 16-bit add into pipelined 8-bit adds with carry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Mapping the decomposed addition onto a more abstract PipeRench architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Compacted hand-layout of the decomposed addition on PipeRench. . . . . . 3.12 Identifying the 1-bit wire hot-spot in PipeRench and converting it into a macro. vi

37

38

38 40

41

43

44

44 46 46 46 47

3.13 Using the static pro ler to identify hot-spots in netlists for the DIL compiler. The equivalence relation is structural and functional isomorphism and the templates are characterized by their frequency-size product. The netlist on the right shows a typical template. . . . . . . . . . . . . . . . . . . . . . . . 3.14 Integrating the static pro ler with the DIL compiler. . . . . . . . . . . . . . 3.15 Speci cation of two add-with-carry macros for PipeRench. (a) the two isomorphic graphs of the template with the numbers beside the wires indicating bit-widths; (b) the single DIL speci cation; (c) the single assembly speci cation for the layout; (d) the two created macro layouts. . . . . . . . . . . . . 3.16 Template extracted from the Optical Flow application. The values are all 8-bit wide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Layout of the Optical Flow template using the DIL compiler. The target architecture has 8-bit PEs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.18 Compact hand-created layout for the Optical Flow template on 8-bit PEs on PipeRench. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.19 Template extracted from the IDEA encryption application. The values are all 8-bit wide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.20 Layout of the IDEA pattern using the DIL compiler. The target architecture has 8-bit PEs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.21 Compact hand-created layout for the IDEA pattern on 8-bit PEs on PipeRench. 4.1 Repartitioning a mapped circuit at fanout points that do not meet delay constraints. Mapped cell N has a fanout of 3. Fanout branch A is on the critical path. This fanout is broken by duplicating N so that the critical branch is on its own, which decreases the eective output capacitance viewed from N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Tool ow for ASIC Technology Mapping. . . . . . . . . . . . . . . . . . . . 4.3 One round of IDEA. The X's represent the 16-bit input data sub-blocks, while the Z's represent the 16-bit key sub-blocks. . . . . . . . . . . . . . . .

vii

48 49

50 51 52 52 53 54 54

61 63 65

4.4 Improving the area and delay of IDEA with the static pro ler. The pro ler is used to extract the best hot-spot from the RTL description of idea. The selection is based on the isomorphism function R and template utility function FT . The selected template is optimized into a cell which is added to the technology library. Successive runs of Synopsys Design Compiler are forced to used the optimized cell at every instance of the template in IDEA. . . . . 4.5 (a) The hot-spot with the highest regularity extracted from IDEA. Isomorphic instances of this pattern cover 33% of the IDEA netlist. This is a piece of an array multiplier, computing the i'th partial product. (b) Equivalent circuit for the array multiplier stage. This transformed pattern can be implemented as a smaller cell in the technology library. . . . . . . . . . . . . 4.6 The second most optimizable hot-spot from IDEA. This pattern consists of two MUXes and two subtractions from a constant 1, and is part of the multiplication modulo 216 operation. The isomorphic instances of this pattern cover 3% of the IDEA netlist. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Area-delay curves for IDEA using Design Compiler. The topmost line shows the area-delay trade-o without specially optimizing the hot-spots extracted by the static pro ler. The two curves below show the area-delay trade-o the hot-spots are mapped into specially created technology library tells with and without a compaction factor. . . . . . . . . . . . . . . . . . . . . . . . . 4.8 (a) The most regular hot-spot extracted from CORDIC. This structure occurs enough times to cover 18% of the netlist. It consists of a MUX, an adder and a subtractor with common inputs, and may be mapped to an ADDERSUBTRACTOR unit commonly found in many technology libraries. . . . . 5.1 A high-level view of a typical island-style FPGA, showing the rows of con gurable logic blocks (CLBs) and the interconnect network. The interconnect network consists of routing channels between the rows of CLBs containing with routing tracks, switchboxes (denoted as 'S') and connection boxes, shown connecting the CLB pins to the routing tracks. . . . . . . . . . . . . 5.2 The basic logic element consisting of a K-input look-up table (LUT) with a register. The output may be registered or combinational. . . . . . . . . . . viii

65

66

67

68

70

76 77

5.3 The typical combinational logic block known as a cluster. It consists of N basic logic elements and I external inputs. A complete interconnect maps the I external inputs and the N feedback inputs to any the input of any logic element. Each logic element has an external output. . . . . . . . . . . . . . 5.4 Typical FPGA CAD ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Experimental Tool Flow. The benchmarks are rst subjected to decomposition, after which they are mapped to K-LUTs. The LUT netlist is statically pro led, and the hot-spots extracted are reported to the user. . . . . . . . . 5.6 Variation of the cluster size prediction error with the IO factor a in Equation 5.5. The error is minimized when a is in the range 80-120. . . . . . . . 5.7 Variation of the cluster size prediction error with the Size factor b in Equation 5.5. The error is minimized when b is in the range 30-40. . . . . . . . . 5.8 Variation of the cluster size prediction error with the Critical Nets factor c in Equation 5.5. The error is minimized when c is in the range 3-3.5. . . . . 5.9 Variation of the cluster size prediction error with the Intra Cluster Delay factor d in Equation 5.5. The error is minimized when d is in the range 100-900. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Area-delay predictions seen by analyzing hot-spots. The hot-spots predict a cluster size, and the area-delay for the predicted cluster size is obtained from the exhaustive runs. This is the predicted area-delay value shown above. It is compared with the correct minimal area-delay obtained exhaustively. The overall idea is to predict a cluster size that has an eventual area-delay close to the minimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Tool Flow for Heterocluster Mapping. The benchmarks are rst subjected to decomposition, after which they are LUT-mapped. The static pro ler which extracts hot-spots and provides them to the mapper which preferentially map the hot-spots rst. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

77 79

83 89 89 89

89

91

94

6.2 Given a LUT netlist, and a heterogeneous cluster C de ned by the cluster type LT and number of inputs I , the pro ler extracts patterns that match the cluster and contain a large number of critical nets. In the above case, the given cluster has 3 LUTs of the type 322 (that is, 2 2-LUTs and 1 3-LUT), and 5 external inputs. Hot-spots containing critical nets that match this structure are extracted. The netlist is annotated with these hot-spots, and then passed on to the rest of the tool ow. . . . . . . . . . . . . . . . . . . . 6.3 Area-delay improvements seen when hot-spots identi ed by the static pro ler are preferentially optimized. The area is the sum of the logic and routing area after place and route measured in millions of minimum width transistors. The delay is the nal critical path delay measured in nanoseconds. . . . . .

96

97

A.1 Placing and routing a single node in a stripe. . . . . . . . . . . . . . . . . . 104 A.2 The Place and Route Algorithm in the DIL Compiler. . . . . . . . . . . . . 105 A.3 Computing placement restrictions on graph G prior to place and route. . . 106

x

List of Tables 2.1 Comparison of iterative and non-iterative pro ling. The maximum size pro led by iterative pro ling is shown against the maximum size pro led by non-iterative pro ling in the same running time. It is seen that iterative pro ling extracts templates with very large sizes. . . . . . . . . . . . . . . . 3.1 Properties and pro ler rankings of macros found and used in dierent applications. Each of these applications had a single macro that was used. . . . . 3.2 The two macros used in IDEA encryption. . . . . . . . . . . . . . . . . . . . 3.3 Improvement seen in the number of processing elements and used for PipeRench when pro ler-suggested macros were used. Improvements in the number of PEs indicate functional resources saved only. . . . . . . . . . . . . . . . . . 3.4 Improvement seen in the number of stripes used for PipeRench when pro lersuggested macros were used. An improvement in the number of stripes is indicative of the impact of functional and routing resources saved. . . . . . 4.1 Area and Delay Improvements seen when pro ler-suggested cells were added to the Synopsys Technology library for IDEA encryption. . . . . . . . . . . 4.2 Area and Delay Improvements seen when pro ler-suggested hot-spots were optimized in CORDIC. ADDSUB is the slow and small adder-subtractor cell from the 0.18m technology library, while ADDSUBP is the larger one with more drive strength. Design Compiler was forced to use these cells for every occurrence of the pattern in Figure 4.8. The last row shows the results when hints were provided to use the DesignWare adder-subtractors at every instance instead of those present in the library. . . . . . . . . . . . . . . . .

xi

32 54 54

55

55 69

71

Abstract Optimization of netlists subject to area, delay or resource constraints is a common problem used in almost every stage of CAD tools. Most forms of this problem have been proven to be NP-complete. Instead of taking a global approach to optimize a netlist, this thesis suggests using static hardware pro ling to identify important sub-circuits, which may then be given priority by the tools or human designers. Such pro ling helps focus optimization eorts on the portions of the netlist that have the greatest impact on the nal result. A static hardware pro ling tool that can rapidly extract hot-spots from hardware descriptions has been built. Results from using this tool to improve the performance of a recon gurable hardware compiler, an ASIC technology mapper and an FPGA compiler will be shown.

xiii

Acknowledgement I would rst like to thank my advisor, Prof. Seth Copen Goldstein, whose support and guidance over the past few years has been invaluable for this work. His constructive suggestions and practical ideas went a long way in making this thesis a reality, and in teaching me how to develop and articulate ideas well. I am also grateful to him for the many (odd) hours he has put in to help me work on my presentations and meet other deadlines. I would also like to thank my committee members Prof. Herman Schmit, Prof. Don Thomas and Randy Harr for their helpful suggestions during several consultations that I have had with them. Financial support for this work was provided by DARPA and Altera Corporation for which I am very grateful. I have also had the privilege of working and collaborating with many smart people at CMU, especially my oce-mates Matt Moe, Ben Levine and Reed Taylor. I will remember the many stimulating research-oriented discussions we had, not to mention the political ones. I must also thank Mihai Budiu, without whom the DIL compiler would not have been possible. We were part of many bug- xing sessions for the compiler. Finally, and most of all, I would like to thank my wife Neela. Her love, support and patience during the past few years was more than I could ever have asked for. It is what brought this endeavor to a successful completion.

xiv

Chapter 1

Introduction Netlist optimization is a very general and common problem in the eld of computer-aided design (CAD) for digital circuits. Among the many constraints that netlists could be optimized for are delay, area, a limited set of resources and others. This problem is ubiquitous in CAD tools: circuit transformations that take place at dierent stages of CAD are all formulated and solved as netlist optimization problems. However, the netlist optimization problem is notoriously dicult, and most forms of it have been proven to be NP-complete. Algorithms producing good quality solutions are usually very time-consuming, while fast heuristics produce sub-optimal solutions. CAD tools often seek a middle-ground and attempt to balance speed and eciency. Even this is becoming a formidable task for several reasons. First, netlists are getting larger and more complicated: emerging designs have a lot of datapath and control elements with millions of gates and complex interconnections. A commercial CAD tool may take hours, even days to optimize such designs given stringent area or delay constraints. Second, the advent of new technologies (such as recon gurable architectures and Systems-on-a-Chip) has resulted in several CAD concepts being applied in hardware compilers. These tools compile netlists to complicated underlying architectural models, which present more challenging optimization constraints. Finally, changing silicon technology and decreasing feature sizes result in denser and faster chips which require new, more complex algorithms for netlist optimization. Although manual designs can be more optimal than those produced automatically by CAD tools, human intervention is slow and expensive. In addition, the human attention span wanes as the problem size increases; therefore such a solution is practical only if 1

the problem size is small. Thus, not only is it dicult to build CAD tools with fast and optimal algorithms for the netlist optimization problem, it is also impractical rely on manual designers. The current approach to optimizing netlists is a global approach, one that is generally time-consuming. Given the intractibility of the netlist optimization problem, it is important for a CAD tool or a human designer to know which parts of the netlist to spend time on and focus optimization eorts. To that eect, this thesis suggests using static pro ling to identify the sub-circuits in a netlist which are the most important and which have a big impact on the nal result. Once identi ed, these sub-circuits may be prioritized by the CAD tools, or optimized manually. Such important sub-circuits are called hot-spots. Hot-spots represent small parts of a larger problem that have a big impact on the overall result. Each instance of a hot-spot may be optimized quickly, and the optimization eort used on one instance may be reused on other instances in the netlist. Hot-spots in hardware designs may be compared to those in software programs, which are identi ed by a (dynamic) software pro ler. This thesis extends the software pro ling idea to hardware netlists. It seeks to discover methods to identify important sub-circuits in netlists using static pro ling. This information may be used to help tools prioritize optimization eorts and human designers focus their attention.

1.1 Main Contribution of this Work This thesis suggests using static pro ling to identify important portions of dicult and time-consuming optimization problems. Applied speci cally to the case of netlist optimization with constraints, static pro ling is used to identify sub-circuits in the netlist that are important and have a big impact on the nal result. An important contribution of the thesis is an ecient sub-circuit extraction algorithm and methods to characterize and distinguish those sub-circuits in a netlist. The characterization is done using several properties like regularity and presence of critical nets. Further, regularity itself can be characterized in dierent ways depending on speci c scenarios. For instance, certain sub-circuits may be considered to be dissimilar when mapping a netlist to a given technology library, while the same sub-circuits may be similar when compiling to 2

a certain FPGA architecture. Such characterization methods also articulate the dierence between hardware pro ling and traditional software pro ling. In software pro ling, the common characterization metric is execution time per line of code. For instance, a pro ler will discover a small for-loop in which a large fraction of a program's execution time is being spent, resulting in that for-loop being a hot-spot. However, in hardware pro ling, several dierent characterization metrics result in the detection of dierent kinds of hot-spots. Static pro le-driven optimization can be used as a pre-processing step to make CAD tools and hardware compilers faster, more ecient and more portable. It may also be used to identify important portions of the netlist that need to be hand-optimized, and can thus aid human designers. This thesis demonstrates these properties of static pro ling based on experiments in four domains: (i) compilation for recon gurable architectures, (ii) technology mapping in ASIC design, (iii) mapping for island-style FPGAs and (iv) compilation for FPGAs with heterogeneous resources. The experiments are mainly in the domains of ASIC design and FPGA compilation. Two big factors that determine the cost and performance of ASICs and FPGAs are the speed and quality of design and compilation. In ASIC design, the speed of the CAD tools determines the time-to-market, while the quality of the result determines the size and clock speed. In the FPGA and recon gurable computing world, where a netlist is mapped to resources on a target architecture, the quality of the mapping determines the speed and size of the nal mapped design, which translates to application performance. Since it is time-consuming to get a high-quality solution for the intractible netlist optimization problem, the static pro ler is used in each of the above cases to point out portions of the netlists either for human optimization or for prioritization by the tools.

1.2 Related Work While speci c background information regarding the four major experiments in this thesis is provided in the appropriate chapters, this section presents a broad overview of related research in this area. First, earlier work in regularity extraction is discussed. Subsequently, current advances in technology mapping, FPGA mapping and recon gurable compilation, on which the experiments used to validate this thesis are based, are brie y mentioned. Finally, the novelty and contributions of this thesis are summarized. 3

Regularity Extraction in Graphs Arikati et al. [2], demonstrated a signature based regularity extraction method for graphs. Their main intention was to identify cells strongly connected to the datapath. During logic optimization of datapath components, buers and other gates may be synthesized in order to meet timing constraints. These gates would not be considered as part of the datapath by the oorplanner, and not be placed accordingly. By extracting regularity amongst all components strongly connected to the main datapath, the authors were able to identify a more complete datapath for the oorplanner. Their regularity extraction approach was based on de ning a signature for each component connected to the datapath. The main datapath is either known beforehand by prior datapath synthesis or may be constructed by taking hints from bus names and widths. The signature of a cell is obtained by hashing the function of the cell and its connectivity to the datapath. Several cells with the same signature were grouped together into a new datapath function, which was provided to the

oorplanner. Following the signature-based regularity extraction, Chowdhary et al., [14] presented a method to extract functional regularity from netlists. Their main contribution was an algorithm to extract sub-circuits with a single output in which all fanout from intermediate nodes reconverged at or before the single output. Given a netlist, they could thus create a set of templates each of which corresponded to a set of regular instances in the circuit. However, their template extraction algorithm was restricted by two major assumptions. First, the set of extracted templates, S , was such that no template was a sub-circuit of any other template in S . Second, every extracted template had at least two instances in the netlist. Using these two assumptions, they showed that the total number of templates in a netlist with V nodes is restricted to O(V 5 ). Their algorithm would consider every pair of nodes in a netlist and extract the largest isomorphic subgraph rooted at those two nodes. This extraction could be done by looking at the largest templates extracted at the children of the nodes in question. The important shortcoming of the above algorithm stems from the rst assumption. The assumption ensures that only the largest subgraphs with at least two isomorphic instances in the netlist are extracted. As a result, numerous small templates would be missed simply because they were part of larger templates. Additionally, their algorithm could not extract 4

templates that had internal outputs. The algorithm in this thesis overcomes both of these shortcomings.

Datapath Compilers Computational systems such as ASICs typically consist of storage, control logic, and the computation engine. The computation engine, which consists of arithmetic units interconnected by wide busses, constitutes the datapath. Certain tools, such as Synopsys Module Compiler, optimize datapaths by using specialized arithmetic components like Wallace-tree multipliers and carry-save adders. Thus, they work well on datapath-dominated designs. However, they are unable to optimize control logic or automatically extract datapaths well from designs that have both datapath and large amounts of control.

Technology Mapping Delay optimization during technology mapping is a rather complicated problem. Mailhot et al., [23] presented an iterative improvement methodology for improving the delay during technology mapping. This is reviewed in more detail in Section 4.1. Other approaches to technology mapping for area and delay minimization include simulated annealing, integer linear programming (ILP), mapping the problems to satis ability, etc. However, all of them still take a long time because the basic problem is NP-complete.

FPGA and Recon gurable Compilers FPGA compilation involves mapping the netlist to the resources on the FPGA and subsequently placing and routing the mapped design. The place and route process is rather slow in typical FPGA compilers owing to simulated annealing back-ends [6]. There have also been eorts to perform simultaneous place and route [26] in FPGAs by only placing nodes if they are estimated to be routable. Other eorts have mapped FPGA routability to Boolean satis ability [33] to allow for fast computation. Again, the basic problem is inherently dicult, and hence optimal solutions by de nition are slow. Two experiments in this thesis are based on the T-VPack and VPR FPGA mapping and place and route tools. The algorithms in those tools are described in detail in Chapter 5. Recon gurable compilers have a greater need to be fast owing to their integration with mainstream computing devices (on SOCs, for instance). In some cases, such compilers do 5

not consider all architectural details in order to be fast and portable [11]. This feature makes them fast but inecient. Compilation speed is compromised for eciency, or eciency for compilation speed. This is explained in more detail in Chapter 3.

What's new in this thesis? The contribution of this thesis is the idea of static pro le-driven optimization, a technique that is intended to quickly improve the area and delay results of what are otherwise dicult and time-consuming optimization problems. Static pro le-driven optimization involves the identi cation and optimization of important sub-circuits in a netlist. The important subcircuits are characterized based on the problem at hand, optimization constraints, etc. This thesis also presents several novel aspects that dierentiate it from the earlier eorts described above:

The sub-circuit extraction process in this thesis is not con ned to regularity extraction,

but also involves characterization during which the most important structures under certain constraints are identi ed. Further, the extraction process itself is not restricted to standard functional isomorphism, as in [14], or to datapath connectivity, as in [2]. Dierent variants of isomorphism to identify similar patterns in domains like ASIC technology mapping, island-style FPGA mapping etc., have been identi ed. Further, the most regular sub-circuit may not be a hot-spot; for example, if the delay of a netlist has to be minimized, focussing on the most regular sub-circuit will not help if those sub-circuits do not aect the critical path in any way.

Static pro ling is not restricted to datapath-dominated netlists, but can work equally

well on control-dominated random logic structures too. In the case of netlists with large amounts of control and datapath, it can be used to extract and separate the datapath from the control logic.

The regularity extraction in this thesis is not restricted to large templates, but can

extract templates of any size, small and large, that have multiple instances in the circuit. This is important since, as in [14], extracting a large template with a few instances overlooks a smaller template that may have more instances in the circuit. The smaller and more frequently-occurring template could be a better candidate for optimization. 6

The regularity extraction algorithm in this thesis is also not restricted to single-output sub-circuits, as in [14], but rather to single primary output DAGs, which may have internal fanout extending outside.

Instead of proposing an algorithmic approach to solve an inherently dicult problem,

this thesis proposes a pre-processing step during which the important areas are identi ed and heavily optimized (either by the tools or designers), unlike the remaining parts of the netlist. The pro ler thus quickly points out portions of the netlist that are important and need attention.

1.3 Roadmap Chapter 2 presents a general algorithm to extract and manipulate hot-spots from netlists. Most signi cantly, it presents methods to characterize the extracted structure and select the hot-spots in order to improve area or delay. It also describes several heuristics to enhance the speed of the algorithm, and analyzes the performance of the algorithm under dierent conditions. The next four chapters describe the major experiments used to demonstrate the utility of static pro ling. In Chapter 3, the result of the integration of the thesis ideas into the DIL compiler for the PipeRench recon gurable architecture is presented. This illustrates the use of static pro ling to focus human attention on sub-circuits in the netlist that the compiler underperforms on, but which have a big impact on the nal result. In addition, this chapter also describes a general speci cation methodology for macros in a library. A single macro with a single speci cation could be used to generate multiple modules. Chapter 4 shows how using static pro ling and hot-spots can bene t technology mapping during ASIC design. Chapter 5 and Chapter 6 demonstrate the use of static pro ling in the FPGA world. These experiments illustrate how the pro ler could automatically aid CAD tools, and present the integration of the pro ler into the VPR general-purpose island-style FPGA mapping and place and route tool. Chapter 7 provides a summary, conclusion and a discussion about future work.

7

Chapter 2

Hot-spots in Netlists Many netlists, especially those with datapaths, have simple patterns that occur repeatedly. When a single one of these patterns is optimized, it will induce signi cant improvements throughout the netlist. Such a pattern is called a hot-spot. This is similar to the 90:10 \rule" in software, which states that 90% of a program's execution time is spent in 10% of the code. By optimizing only a small portion of a netlist (for example, an 8-bit adder followed by a multiplexer) signi cant improvements throughout the netlist are seen as the same pattern is found in many places. Unlike the case of software optimization, where a hot-spot has a simple metric (execution time per line of code), it is possible to de ne dierent metrics resulting in the detection of dierent kinds of hot-spots. In the obvious case of nding netlists with similar functionality and connectivity, hot-spots that, when optimized, reduce the area and the delay of the circuit are found. Slightly less obvious is nding all portions of a graph, irrespective of their functionality, which are all on the critical path. Optimizing these portions of the netlist rst leads to faster circuits. Hot-spots identify the important areas of a netlist to optimize. They may also be used to quickly predict the outcome of a netlist optimization process. For example, if a quick analysis has to be performed to determine the nal area and delay of a netlist, hot-spots are the likely areas to focus on. This information would otherwise be available only after exhaustive analysis. In order to nd and use hot-spots, two important steps are required: (i) the ecient extraction of patterns from netlists and (ii) the accurate characterization of the extracted 8

?: p0

n p1 +

p2

~

p3 &

Figure 2.1: Illustrations of the basic de nitions used. The netlist shown on the left may be represented by the DAG shown on the right. n is a net (or wire) with the mux as its only source (driver). n has 3 destinations: the ALU, inverter and AND gate. It has 4 pins labeled p0, p1, p2 and p3. p0 connects n to its source while p1, p2, p3 connect n to its 3 destinations. patterns according to the nal objective (minimizing area, delay, etc.). Characterization is what distinguishes hot-spots from other patterns in the netlist. This chapter deals with these two processes. First, several de nitions relevant to this thesis are described in Section 2.1. Then in Section 2.2, the extraction algorithm is presented. This is followed by the characterization principles in Section 2.3.

2.1 General De nitions In this section, de nitions of terms that are used throughout the thesis are presented. A netlist is represented by a directed graph. The nodes of the graph correspond to the hardware functional blocks in the netlist. Nodes are interconnected by wires, also referred to as nets. Each wire or net has a single source which is the driver, and multiple destinations which comprise the fanout of the wire or net. Each of the terminals of a wire or net is referred to as a pin. A net thus has a single source pin and one or more destination pins. The source pin connects the net to its source node while the destination pins connect the net to its destinations. These de nitions are illustrated in Figure 2.1. Each net may also be annotated with a width, which is the number of bits required to represent the value on the net. A subgraph or sub-circuit of a netlist is any connected subset of nodes of the netlist. The 9

number of inputs of a subgraph G is the number of distinct nets whose sources are outside G. Thus, if a net with an external source has two destinations inside G, it constitutes a single external input. The number of outputs of a subgraph G is the number of distinct nets at least one of whose destinations nodes is outside G. Thus, a net that has multiple destinations outside G contributes a single output. For pin p of net n in netlist P , the signal slack is de ned as the dierence between the actual signal arrival time and the latest possible signal arrival when p would become critical. (The signal times could be the result of a static timing analysis). The criticality, crpn , of pin p of net n is de ned by

slackpn crpn = 1 , max slack

P

(2.1)

where slackpn represents the slack of pin p of net n, and max slackP represents the maximum slack among all the pins in netlist P . Clearly, as the slack for p decreases, it becomes more critical. A pin with no slack has a zero criticality. It is also possible for a pin to have negative slack, if the signal arrival time estimated after static timing analysis is later than the required arrival time for a given clock speed or delay constraint. In this case, the criticality is greater than unity. The longest path from the source of a netlist to its sink in terms of signal propagation time is referred to as its critical path. All pins in a netlist are divided into two distinct groups based on static timing analysis: critical pins and non-critical pins. This division is done based on the average criticality of a pin in the circuit. Pins whose criticalities exceed the average criticality belong to the set of critical pins, while all other pins are considered non-critical pins. This division based on the average criticality identi es a set of pins as potentially critical, and indicates that these pins will likely not meet timing constraints during subsequent optimization processes1 . This uncertainty causes pins with lower than the maximum criticality to become critical after placement. The above assumption merely states that pins with above-average criticality are likely to become critical if their nets are placed unsuitably. The criticality of a subgraph G of the netlist, crG , is de ned as the number of critical pins inside G. It is hard to estimate the nal critical path before a netlist is placed and routed, since logic units the are relatively close together in the unplaced netlist may end up being placed far away. 1

10

E

D B

C DAG with single sink

internal output

F

A

Figure 2.2: An example single sinked DAG, with an internal output. ABCDE is the single sinked DAG. However, ABCDEF is not, since it has two sinks A and F. A single-sinked DAG (SSDAG) is a subgraph of the netlist that has a single primary output but can have multiple internal outputs. It may be de ned as a subgraph in which there is a path from every node to a single sink node. Figure 2.2 shows an SSDAG. SSDAGs are used to represent a large number of electrical circuits. An SSDAG is characterized by several important properties pertaining to its subgraph. Six important properties of an SSDAG are used in this thesis: 1. the number of inputs 2. the number of outputs (the single primary output, plus all internal outputs) 3. the size, which is de ned as the number of nodes within 4. the sole root node, which may be determined by a topological sort 5. the widths of the nets within 6. the criticalities of the edges or pins within Given a pair of SSDAGs, an equivalence relation may be de ned between them. The equivalence relation determines if the SSDAGs are to be considered similar or not. For instance, the equivalence relation between two SSDAGs pi and pj , represented as R(pi ; pj ) may imply that pi and pj are structurally isomorphic, or that the number of inputs and outputs of pi and pj are equal. This is shown pictorially in Figure 2.3. 11

aRb: a, b have same I and O

aRb: a, b structurally isomorphic

Figure 2.3: Example equivalence relations between SSDAGs. Given an equivalence relation R, a template is an equivalence class of SSDAGs. That is, it is a set of SSDAGs such that any two SSDAGs in the set are equivalent under R:

T = fpi j8(i; j ); R(pi ; pj )g

(2.2)

The cardinality of a template, or the number of SSDAGs in it, is referred to as its frequency. A template inherits the properties of its member SSDAGs, such as the number of inputs, outputs, size, etc., depending on the equivalence relation. An equivalence relation R is symmetric and transitive if and only if, for any two SSDAGs pi and pj (i) R(pi ; pj ) implies R(pj ; pi) and (ii) R(pi; pj ), R(pj ; pk ) implies R(pi ; pk ).

2.2 Hot-spot Extraction In order to eectively use hot-spots, they have to be eciently extracted from a netlist. In this section, the extraction problem is de ned and a solution is presented. All sub-circuits that may be extracted are represented by SSDAGs. The hot-spots, which are basically important sets of sub-circuits, are represented by templates under a certain equivalence relation. The original netlist itself is represented as a directed-acyclic graph (DAG), in which every node has a single output (that may fan-out). In other words, for sequential circuits, all feedback edges are specially marked and not considered in the extraction process. Now, the extraction problem may be stated as follows: Given a DAG G, a symmetric and transitive equivalence relation R and a positive nonzero integer L, extract all SSDAGs in G whose sizes are less than or equal to L, and classify them into templates under R. 12

This section presents a novel algorithm based on dynamic programming to solve the above problem. The algorithm is also described in [12]. Important heuristics that reduce the running time and memory usage of the algorithm are also presented. Before presenting the algorithm, the following important property for classifying a set of SSDAGs into templates will be proved. Property 1: If a set of SSDAGs is classi ed into templates using a symmetric and transitive relation, every SSDAG can belong to one and only one template. Proof: (by contradiction) If an SSDAG p were a member of two distinct templates T 1 and T 2, then for any SSDAG p1 in T 1 and SSDAG p2 in T 2, R(p; p1) and R(p; p2) are true. Since R is symmetric, R(p; p1) implies R(p1; p). Since R is transitive, R(p1; p) and R(p; p2) together imply R(p1; p2), which means p1 in T 2 and p2 in T 1. This is true for any SSDAGs p1 and p2. Therefore T 1 and T 2 are identical, which violates the initial assumption that they are distinct. Hence, p is a member of a unique template.

2.2.1 Algorithm for Extracting Hot-spots The algorithm for extracting hot-spots is based on static graph pro ling. The input is a graph G representing the netlist, and a size limit L representing the maximum size of the SSDAGs required to be extracted. The size of an SSDAG is the number of nodes it contains. The algorithm extracts all possible SSDAGs upto size L from G. It proceeds constructively using dynamic programming and builds at each graph node N SSDAGs of size l using previously constructed SSDAGs of sizes upto l , 1. The algorithm is presented in Figure 2.4. Figure 2.5 pictorially depicts the algorithm and some data structures used. The approach is based on the following observation: each SSDAG of size l ending at node N is composed of N along with various SSDAGs ending at N , such that the total number of nodes including N is l. The only caveat to this is when there is reconvergent fan-in, which can be detected by taking the intersection of the nodes in the source SSDAGs of node N . In this case, the SSDAG is ignored, since it is already accounted for as a pattern of size l , 1 ending at node N. The SSDAGs extracted are categorized into templates. The given equivalence relation de nes the type of templates that the extracted SSDAGs are categorized into. In order to create the templates, a table containing the SSDAGs is maintained during the extraction process. Each entry in the table corresponds to a template. When an SSDAG is 13

Inputs: Graph G, Size l Outputs: Table T which contains patterns of all sizes and annotations on the graph genAllPatterns(Graph G, int size) {

for

(int l = 1; l

static profile driven optimization of digital circuits - CMU ECE

static profile driven optimization of digital circuits - CMU ECE

Suggest Documents

Static-Priority Scheduling over Wireless Networks with ... - CMU (ECE)

COMPRESSED SENSING - CMU (ECE)

Ballista - CMU (ECE)

s1 m'l - CMU (ECE)

Automatic Performance Optimization of the Discrete ... - CMU (ECE)

Nonconvex Optimization Meets Low-Rank Matrix ... - CMU (ECE)

FFT Compiler Techniques - CMU (ECE)

Testing & Verification of Digital Circuits ECE/CS 5745/6745 Hardware ...

Static Power Optimization of Deep Submicron CMOS Circuits ... - cs.York

Power Performance Optimization for Custom Digital Circuits

Towards Decentralized Management of Graceful ... - CMU (ECE)

TOPOLOGY PRESERVING STACS SEGMENTATION OF ... - CMU (ECE)

The Provenance of WINE - CMU (ECE)

ECE 260B - CSE241A VLSI Digital Circuits - UCSD VLSI CAD Lab

ECE 587: Design of Analog Integrated Circuits - the GMU ECE ...

Main Menu - CMU (ECE) - Carnegie Mellon University

(MRAM) Fabrication - CMU (ECE) - Carnegie Mellon University

Proceedings Template - WORD - CMU ECE - Carnegie Mellon

VIDEO COMPRESSION VIA CONSTRUCTS - Acoustics ... - CMU ECE

Diagnostic Grade Wireless ECG Monitoring - CMU (ECE)

Multiplierless Multiple Constant Multiplication - CMU (ECE) - Carnegie

From Dependability to Resilience - CMU (ECE)

TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS ... - CMU (ECE)

recursive implementation and performance analysis - CMU ECE