Jan 27, 2010 - through the formulation of a Power-efficient multipin ILP-based global Routing Technique (PIRT). The main technical contribu- tions of this ...
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
225
A Power-Efficient Multipin ILP-Based Routing Technique Ahmed Youssef, Student Member, IEEE, Zhen Yang, Student Member, IEEE, Mohab Anis, Member, IEEE, Shawki Areibi, Member, IEEE, Anthony Vannelli, Member, IEEE, and Mohamed Elmasry, Fellow, IEEE
Abstract—With the ever increasing die sizes and the accompanied increase in the average global interconnect length, delay-optimal-routing and buffer-insertion techniques are significantly straining the power budget of modern ICs. To mitigate the impact of the power consumed by the interconnects and buffers, a power-efficient multipin routing technique is proposed in this paper. The problem is based on a graph representation of the routing possibilities, with the objective of identifying the minimum power path between the interconnect source and set of sinks. The technique is tested by applying it to the International Symposium on Physical Design and IBM benchmarks to verify the accuracy, complexity, and solution quality. Results obtained indicate that an average power saving as high as 32% for the 130-nm technology is achieved with no impact on the maximum chip frequency. Index Terms—Global routing, interconnect optimization, power management.
I. INTRODUCTION NTERCONNECT wires are gradually dominating the performance of deep-submicrometer integrated circuits (ICs). In fact, the number of interconnects and buffers for achieving timing closure is one of the primary challenges facing designers for sub-90-nm ICs. This is attributed to the continual increase in the number of logic blocks, due to the continuous shrinking of device dimensions. To mitigate the impact of the interconnects, designers have shifted their focus to interconnect-centric designs, where the wire is the center of the chip design flow [1]. The interconnect problem is a multiobjective optimization problem, where delay, power, and routing are the core objectives. Traditionally, delay and routing have been the focus of most optimization efforts [1]. A considerable amount of work, which takes wirelength and congestion into account, has been done on global routing (sequential and concurrent) such as [2]–[4]. FastRoute [3] is a high-quality and efficient global router that sequentially routes nets. BoxRouter [2], on the other hand, is a concurrent integer-linear-programming (ILP)-based router that uses progressive box expansion to formulate a
I
Manuscript received June 18, 2008; revised November 15, 2008. First published February 18, 2009; current version published January 27, 2010. This paper was recommended by Associate Editor Y. Massoud. A. Youssef is with Intel, Santa Clara, CA 95054 USA (e-mail: anour@vlsi. uwaterloo.ca). Z. Yang is with Orora Design Technologies, Inc., Redmond, WA 98052 USA. M. Anis, A. Vannelli, and M. Elmasry are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada. S. Areibi is with the School of Engineering “Engineering Systems and Computing,” University of Guelph, Guelph, ON N1G 2W1, Canada. Digital Object Identifier 10.1109/TCSI.2009.2015602
sequence of ILPs that incrementally route the circuit. None of these routers consider power minimization in their objective function. However, the power consumption of the interconnects is becoming a crucial factor in determining the overall chip performance [5]. To address the interconnect bottleneck, researchers have developed several subproblems that deal with the various aspects of the interconnects. These problems begin with the simple problem of buffer insertion, determining the number and the positions of buffers to minimize delay [6], and increase in complexity up to power-optimal maze routing, where a combined routing and power optimization under relaxed delay constraints is solved. Research in [7]–[14] focused on the low-power-interconnect problem through optimum buffer insertion. The work of Peng and Liu in [7] focuses on single net optimization while including the slew rate as a constraint to the optimum buffer positions. The research in [8] uses a dual-Vdd tree for buffer insertion. In the meantime, the research works in [9]–[12] focus on the optimum buffer insertion assuming a prerouted net. The works in [13] and [14] simultaneously route the interconnects while inserting the buffers to optimize for the power consumption. However, the effort in [13] focuses only on two-pin nets while the work in [14] is characterized with an excessively long run time. The research works in [15]–[17] focused on an analytical solution to the power-optimum repeater-insertion problem. The main limitation of these techniques is their inability to accommodate the buffer blockage arising from the preplaced blocks. However, these techniques can benefit from the non-power-optimal but delay-optimal methodology in [18] which proposes an analytical solution that can accommodate the buffer blockage. In order to address the limitation of the previous techniques, a formulation for a power-efficient interconnect-optimization technique is proposed in this paper. The goal is to find a solution through the formulation of a Power-efficient multipin ILP-based global Routing Technique (PIRT). The main technical contributions of this paper can be summarized as follows. 1) Unlike previous approaches, the newly developed approach is capable of routing multipin nets with timing optimization, buffer insertion, and power reduction efficiently. 2) The PIRT is capable of simultaneously routing all the nets on the chip, avoiding any net-ordering drawbacks. 3) The optimization of power consumption and simultaneous accounting for the buffer blockage, which has not been considered in previous analytical formulations of the power-optimization problem, is formulated. This is achieved without affecting the chip maximum frequency.
1549-8328/$26.00 © 2010 IEEE
226
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
4) The problem is formulated so that it is independent of the delay and the power models used, allowing for more flexibility in applying the new technique to scaled technologies. The remainder of this paper is organized as follows. In Section II, background information about the interconnect-optimization efforts is presented. Section III presents the interconnect delay and power models used in this paper. Section IV introduces the ILP-based routing formulation (PIRT). Results and discussion of applying the new formulation on industrial benchmarks is presented in Section V. II. BACKGROUND In order to fully describe the multiobjective optimization flow presented in this paper, various efforts pertaining to interconnect optimization are revisited.
Fig. 1. Buffer insertion and sizing. (a) Buffer-sizing problem finds values of buffer sizes that satisfy the delay constraint. (b) Buffer-insertion problem finds optimum buffer positions to satisfy the delay constraint.
A. Global Routing: Unified Timing and Congestion Minimization One of the fundamental goals of global routing is to route all the nets within the circuit without overflow (i.e., congestion minimization). Several works have been proposed in this area, such as sequential approaches, rip-up-and-reroute techniques [19], multicommodity-flow-based techniques [20], and hierarchical methods [21]. As technology scales in terms of the device dimensions, the interconnect delay becomes a performance bottleneck. Therefore, minimizing congestion alone is not sufficient. To deal with this trend, several research efforts have been performed on timing-driven global routing [22], [23] (i.e., interconnect delays are considered explicitly during global routing). In [22], an approach has been suggested to simultaneously optimize congestion and delay according to a network flow formulation so that the timing slack consumptions were adaptive for the congestion distributions. In [23], the authors have formulated global routing as a multicommodity-flow problem and adopted a shadow price mechanism to incorporate the timing performance and routability into a unified objective function. Although most of these researchers have proposed solutions to offer two important and competing objectives, congestion and delay, none has considered the power minimization of realistic interconnect trees under given timing budgets. B. Buffer-Insertion-Based Methods Buffer-insertion-based techniques are effective for reducing interconnect delay. Several works have examined delay-driven buffer insertion for two-pin nets [24], [25]. Broadly characterizing the interconnect-optimization efforts has indicated two principal techniques: analytical and dynamic programming. Analytical optimization techniques have been applied to find a closed-form expression that can minimize one of the major objectives, i.e., delay, power, and routing topology, with the other objectives kept within bounds. Simple optimization techniques have been employed to minimize the delay of the interconnect by buffer insertion, disregarding all other aspects of the interconnect [6]. However, an important limitation to these formulations of the problem is that it has not taken into account the case when the optimum buffer positions lie on top of the preplaced functional blocks. This limitation has introduced another set of problems to find the independent buffer
Fig. 2. Wire sizing.
that minimize the delay, as shown sizes in Fig. 1(a), and the exact sizes of the wire segments indepenthat minimize this delay, as seen dently in Fig. 1(b). It is also interesting that these efforts have been extended to include the more elaborate problem of wire sizing, where the wire lines are divided into a discrete set of segments and each , as segment is sized independently shown in Fig. 2. In fact, the proper wire sizing has been proven interconnect line [26]. to minimize the signal delay in an More advanced techniques have involved the combined buffer-insertion/wire-sizing problem while minimizing the power consumption of the buffers [27]. These techniques have , where employed a delay-relaxed formula was the minimum achievable delay on a global net and . Therefore, was used as a control knob trading off the delay of the interconnect for its power consumption. The work in [27] has demonstrated that a 15% relaxation in delay can save up to 33% of the power consumed by the interconnect buffers. These efforts have been adapted to plot the delay-power tradeoff curve in order to determine the optimal number of buffers and their respective positions [5], [28]. These power-optimization efforts lack the capacity to accommodate buffer blockage due to the presence of macroblocks that prevent buffer insertion. Typically, dynamic programming techniques depend on finding a solution within the framework of a graph theoretic representation of the problem. Usually, such techniques are derived from the extension of van Ginneken’s algorithm [29] to find the optimal delay within the framework of the optimal power. In [30], fixed buffer locations have been used, and in [31], an algorithm has been employed to find the optimal setup. In addition, Sapatnekar and Shah [32] have formulated this problem with the framework of fixed buffer locations. Although, in most of these efforts, the blockage has not been explicitly considered, it can be easily incorporated into these methodologies.
YOUSSEF et al.: POWER-EFFICIENT MULTIPIN ILP-BASED ROUTING TECHNIQUE
227
graph. A route (or tree) of a net is a set of grid edges for connecting all the terminals of the net. Since the routing resources are finite, each grid edge has a capacity. With such a graph representation, the graph version of the global routing problem, instead of the original problem, can be solved. B. Global Routing Techniques
Fig. 3. Grid graph for standard-cell-based designs. (a) Global bin graph. (b) Corresponding grid graph.
To combine the efforts to find a power-optimal solution, a power-optimal maze routing methodology has been devised in [13]. This paper extends the efforts of delay-optimal maze routing and buffer insertion in [33] by considering the shortest path algorithm to find the power-optimal path. The drawback in [33] is the inability to find the routing tree for the multipin nets. These nets represent 30%–40% of the interconnects on a modern chip [34]. This issue has been addressed in [14] at the expense of a significant increase in the total run time. III. PRELIMINARIES In this section, some of the fundamental concepts related to the new formulation are discussed. A. Global Routing Problem Global routing is performed by overlaying a grid on the placement solution. Then, a graph abstraction is typically used, in which each grid cell becomes a vertex and each boundary between grid cells becomes an edge. The number of routing tracks that a grid cell encompasses determines the capacity of an edge, as shown in Fig. 3. In global routing, the connection pattern for each net that satisfies the different objectives must be decided. The input to the global routing problem consists of a netlist that indicates the interconnections between the terminals and the placement information, including the terminal positions and the location of the routing channels between them. Typically, the global routing problem is presented as a graph problem, where the routing regions and the module connections are modeled by using a grid graph. Initially, a given circuit is partitioned into a set of rectangular regions, called global bins.1 The finer the bin structure becomes, the more accurate the cell locations and net topologies. After the cells are placed in these bins, each cell is assumed to be placed at the center2 of the global bin, as shown in Fig. 3(a). From Fig. 3(b), it is clear that the global bins and edges can be transformed into a grid graph. The vertices of the graph represent the possible positions of the interconnect terminals, and the horizontal and vertical edges (called the grid edge) that lie between two adjacent vertices represent channels that can be used for wire routing. A net is an unordered set of points on the grid 1The number of global bins for the IBM and International Symposium on Physical Design (ISPD) benchmarks are given and set a priori by the designers. 2We can use the bin centers to roughly specify cell locations.
Since all versions of the global routing problem are NP-hard [19], a variety of heuristic algorithms have been developed for it. They are classified as sequential global routing and concurrent global routing algorithms. The most common approach to global routing is sequential routing.3 In such an approach, the nets are first ordered according to their importance, and then, based on the ordering, the nets are routed sequentially. The quality of a sequential global router largely depends on the ordering of nets. Due to the sequential nature of these techniques, they fail to give adequate results. Moreover, the sequential heuristic techniques cannot provide a key answer as to whether a feasible solution exists. In other words, if they fail to yield a feasible solution, it is not clear whether this is attributable to the nonexistence of a feasible solution or due to the shortcomings of the heuristic. Moreover, when a heuristic does find a feasible solution, it is not known whether this solution is optimal or how far it is from optimality. To avoid the net-ordering problem and to make the solution more predictable, concurrent-based global routing algorithms, in the form of integer programming models, have been developed to route all the nets simultaneously. In the mathematical-programming-based approach, global routing is formulated as a 0/1 integer programming problem. Given a set of Steiner trees4 for each net and a routing graph, the objective of the integer programming technique is to select a tree for each net from its set of Steiner trees without violating the channel capacities while minimizing the total wirelength. This approach tends to result in a more global solution, and no initial ordering of the nets is required. A general formulation of the ILP-based global routing is as follows:
(1) is the total number of where is the total number of nets, trees generated for net , is the total number of edges on the grid graph, and is the total number of trees built for all the nets. In this model, the global routing problem is formulated as a 0/1 with each tree which ILP problem by associating a variable connects a net. The variable equals “1” if that particular tree is selected and “0” if otherwise. The constant is the cost of connecting a net using the th tree. is the edge capacity 3Several 4Flute
industrial tools utilize Maze Routers as a solver. [35] is used for the Steiner tree construction.
228
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
Fig. 5. Wire structure for capacitance extraction.
Fig. 4.
wires, and is the spacing between the wire and the ground is given by the following: plans.
RC tree [6].
representing the routing supply of each edge. is a binary variable which is equal to 1 if tree passes through edge . There are two types of constraints in this formulation. The selection of one and only one routing for each net is forced by the first set of constraints. The edge capacity requirements of each edge are represented by the second set of constraints. The time to solve the integer programming problem increases exponentially with the number of Steiner trees generated in the formulation of the program. Therefore, the techniques to solve the ILP-based global routing problem efficiently become a significant issue. C. Interconnect Modeling In this paper, the Elmore delay model is chosen for its simplicity [36] for modeling the delay of the interconnects. The from the source node (as shown model states that the delay in Fig. 4) is calculated as follows: (2) where represents the resistance, shared among the paths is the capacitance from the source node to nodes and , of each wire segment between the nodes (Fig. 4), and is the number of nodes. To estimate the interconnect wire capacitance, the model introduced in [37] is used. It accounts for the fringing, coupling, and plate capacitances for an interconnect according to the of the structure in Fig. 5. The unit length capacitance wire is5
(5)
The International Technology Roadmap for Semiconductors [39] predictions are employed to estimate the interconnect wire resistance. Note that, although an analytical model is chosen for the capacitance and resistance calculations, without loss of generality, more exact extraction techniques can be used with no modification to the problem formulation in the next section. In practice, the number of sinks, connected to the same driver without a buffer between the sinks, is small [40]. In fact, connecting too many sinks to a single source results in excessive delays that cannot be recovered by buffer insertion. Accordingly, if the degree of this net configuration is limited, the wirelengths of , , and in Fig. 4 can be computed by generating a limited set of Steiner trees between the driver and the set of sinks connected to it. On the other hand, to estimate the total power, consumed by the driver and sink gates and the driven interconnect, the power consumption model employed in [5] is used. The switching is hence calculated as follows: power (6)
(3) is the total wire fringing and plate capacitance and where is the coupling capacitance between the metal line and the is given by computing [37] adjacent metal lines.
where width,
(4) is the interlayer-dielectric primitivity, is the wire is the wire thickness, is the spacing between the
5The equations were used and evaluated by the Berkeley Predictive Technology Model which is provided by the Device Group at the University of California, Berkeley [38].
(7) is the supply voltage, is the clock frequency, and are the widths of the driver and all the sink gates connected to it, respectively, is the switching factor which is assumed to be 0.15 [6], is the wire capacitance is the output capacitance of the driving gate, per unit length, is the loading capacitance of sink gate . In addition, and is the total Steiner tree wire length. The following section presents the performance- and powerdriven ILP-based global routing technique followed by experimental results. where
YOUSSEF et al.: POWER-EFFICIENT MULTIPIN ILP-BASED ROUTING TECHNIQUE
229
Fig. 6. PIRT flowchart.
IV. POWER-EFFICIENT MULTIPIN ILP-BASED GLOBAL ROUTING In order to optimize the routing and interconnect delay, as well as to reduce the power consumed by the interconnect buffers, a PIRT is proposed in this section. This performanceand power-aware routing algorithm is based on the ILP-based global routing model in [41]. The goal of the proposed PIRT is to find an efficiently powered buffer path for each net without violating the delay and routability constraints. The routing area is modeled by using a routing grid that is similar to the one in Fig. 3(b). In this grid, , each vertex represents a buffer possible location, represents a unit length wire segment. In and each edge this case, the problem can be formally described as follows. , a buffer PIRT: Given a predefined routing grid library , a buffer function, if a buffer is alotherwise, and a set of nets lowed at vertex and , find the minimum power path for each net, subject to a delay constraint. A. PIRT Phases The PIRT flow can be illustrated by dividing it into two distinct phases, as shown in Fig. 6. First, an initialization phase (phase I) is invoked, where the initial minimum Steiner trees are constructed for each net. To reduce the global routing congestion, additional detoured trees are also built for each net. In order to create enough routing alternatives considering the delay requirements, several buffered trees (i.e., trees with buffers inserted) are built for each net, if possible. Next, a power-optimization phase (phase II) is invoked. This phase attempts to find a low-power route (i.e., tree) for each net so that the total power of all the nets is minimized while satisfying the delay requirement of the chip. B. Phase I (Initialization) In order to perform the interconnect power optimization in phase II, several routing alternatives are needed for each net. Creating routing alternatives allows the ILP model in phase II to pick a globally optimum solution that covers all the nets on the chip simultaneously. The routing alternatives are either unbuffered or buffered routes that connect the interconnect source to its sinks.
1) Unbuffered Tree Construction: The first step in this phase is to produce a set of admissible routes for each net. In a practical circuit, the terminals in a net are connected by horizontal and vertical wires. Therefore, only the rectilinear spanning trees or rectilinear Steiner trees are considered in the tree construction process. These trees become the unknown variables of the ILP problem. Obviously, the number of trees for each net should not be too large, since the time complexity of the ILP problem is a function of the number of trees. However, a number of trees should be built for each net to guarantee the feasibility of the problem. To remedy this problem, an additional tree-generation step is proposed in [41] to reduce the number of trees created for each net, while ensuring that the constructed trees will likely result in a promising (feasible and optimal) solution. Initially, the potentially congested areas in the routing graph are predicted by a heuristic technique [41] in the congestion estimation stage. Thus a priori congestion information is then used in the additional tree-generation stage6 to eliminate the congested areas by adding trees to the nets passing through these areas iteratively. The congestion of the circuit is re-estimated after each additional tree-generation step. 2) Complexity Analysis of Tree Generation: The complexity , where of the initial tree construction for each net is is the degree of the routed net [35]. The complexity of the tree where is construction for the whole chip is the total number of routed nets. The complexity of congestion , where is the total number of edges in estimation is the graph. The complexity of the additional tree construction is , where is the total number of nets that pass through is the congested edges [41]. Since the maximum value of less than , and , the complexity of the initial tree construction, congestion estimation, and additional . tree construction is 3) Buffered Tree Construction: Since unbuffered nets are not guaranteed to achieve the timing requirements of the chip, a set of buffered routes has to be added to the unbuffered trees generated earlier to extend the alternatives presented for the optimization step at phase II. However, due to the preplaced modules, there is usually a limited number of available buffer locations. In this formulation, each location is expected to allow for a limited number of 6The number of trees generated for each net is variable and depends on congestion estimation.
230
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
4) Complexity Analysis of Buffered Tree Generation: Fig. 9 shows the pseudocode of the buffer generation algorithm. The time complexity of step 1 (i.e., sorting all the nets) is , where is the total number of routed nets. For , where is the number of step 2, the complexity is trees built for each net. Since is a constant, the complexity . Accordingly, the complexity of the whole algorithm is is . C. Phase II (Power Minimization)
Fig. 7. Buffer insertion for two-terminal nets. (a) Buffered tree with one buffer. (b) Buffered tree with two buffers.
To achieve the PIRT’s objective to minimize the power consumption of the interconnect while satisfying the chip’s delay is needed. is constraints, a constant the maximum acceptable delay for all the nets which is derived from the clock frequency set by the product specifications. is unSince, any net having a shorter delay than necessarily fast, PIRT strives to trade this delay slack with the power consumption. Accordingly, the PIRT power minimization under constraint is presented as follows: (8) subject to (9)
Fig. 8. Buffer insertion for three-terminal nets. (a) Three-cell net with Steiner point. (b) Three-cell net with middle point.
(10) buffers to be inserted. Accordingly, in our current implementation, for each tree of a two-terminal net, at most two buffered trees are produced: one with one buffer inserted and the other with two buffers inserted, as shown in Fig. 7. The buffer-insertion process of three-terminal nets begins by identifying the Steiner point or middle point of the net, as shown in Fig. 8. Following this, each net is divided into two or three segments, based on the presence of either a Steiner point or a middle point. If one segment is much longer than the other segments, the buffer insertion is considered for only the longest segment. Otherwise, all the segments are sorted in descending order, based on their length, and one buffer is inserted in each segment. For the generated buffered two- and three-terminal nets, only the trees, where the addition of a buffer results in the reduction of the delay, are added to routing alternatives for Phase II. Limiting the alternatives to two buffered trees per unbuffered tree assists in limiting the run time of the algorithm. However, this limit can be removed at the expense of longer run time. Also, limiting the number of buffers to at most two is a run-time tradeoff that can be tuned. In addition, due to the limited availability of buffer locations, all the nets are first sorted in descending order according to their half-perimeter wirelength such that the long nets have a higher priority for buffered-route generation. However, the priority of short critical or highly loaded nets can be easily elevated to allow for buffered routes to be created for these nets. It is important to note that the buffered-route-generation priority does not affect the nonordering nature of the formulation particularly for the congestion consideration in phase II.
(11) which is similar to (1) introduced in Section III-B. The global routing problem is formulated as a 0/1 ILP problem by associwith each tree which connects a net. The ating a variable equals 1 if that particular tree is selected and 0 othvariable erwise. The first constraint ensures that only a single tree among is the all the possible trees generated for net is selected. weight associated with the power of tree and is calculated as (12) The power of tree 7 is modeled and calculated by (6). is the delay of tree calculated For the second constraint, using (2). This constraint ensures that, for all the selected nets, . the delay of each net does not exceed the For the third constraint, all the possible tree combinations created for each net in the two routing layers are represented , with the th row corresponding to the by a (0,1) matrix th edge in the grid graph and each column corresponding to the possible tree combinations for each net. The element is expressed as if tree passes through edge otherwise. 7Low-activity nets can be pruned quickly from the search space to eliminate unnecessary run-time overhead.
YOUSSEF et al.: POWER-EFFICIENT MULTIPIN ILP-BASED ROUTING TECHNIQUE
231
TABLE I ISPD98, IBM, AND ISPD2007 BENCHMARK STATISTICS
Fig. 9. Buffer-insertion algorithm.
Since the routing resources are finite, each grid edge has a capacity. The capacity of each edge is represented by constraint (11), where is the number of edges on the grid graph, is the total number of trees produced for all the nets, and is the edge capacity of the th edge. In order to achieve a global routing which solution, a slight allowance on the overflow is set by is a variable associated with the routing overflow of each edge. is the upper bound on . It is a positive value obtained from experimental results. The value of represents the minimum number of extra tracks needed by the detailed router to achieve a fully routed chip. All the aforementioned constraints introduced in this model guarantee a feasible solution within the ILP model. Solving the minimization problem set by (8) ensures that the final net selection simultaneously achieves the delay constraints and minimize the power consumption.
TABLE II TECHNOLOGY AND EQUIVALENT CIRCUIT MODEL PARAMETERS FOR GLOBAL INTERCONNECTS
cussed. Consequently, the following sections introduce the results for these parameters when the PIRT is applied to the multipin netlists in different benchmarks.
V. EXPERIMENTAL RESULTS The ILP-based router is implemented in C++ on a 900-MHz Sun Blade 2000 workstation with a 1-GB memory. Flute [35] is used for the Steiner tree construction, and the iLog CPLEX10.0 package is used as the ILP solver. Table I shows the statistics of the ISPD98 IBM benchmarks [42] and ISPD2007 benchmarks [43] used in this paper. It is important to note that, for these benchmarks, two- and three-terminal nets constitute the majority of the nets in all the test benchmarks. Accordingly, the following experiments perform global routing on all the nets while constraining the power optimization to the twoand three-terminal nets. The column “Total Nets” indicates the total number of nets in each benchmark. The column “H/V Cap” lists the horizontal and vertical edge capacity. Additionally, Table I lists the grid and chip sizes to demonstrate the size of the routing problem. Based on the work in [13], the chosen 130-nm technology requires a buffer library that spans the range between 5 and 15 times the minimum-sized nm to efficiently buffer the buffer of nets. Accordingly, the buffer library for this paper represents three types of buffers: weak, medium, and strong. They are chosen to correspond to three different buffer sizes: 5, 10, and 15 times the minimum buffer size of the 130-nm technology. This allows the verification of the technique for different buffer-driving capabilities. Table II lists the parameters of the 130-nm technology used for the power and delay calculation. To fully quantify the performance of the proposed technique, the power savings, routing quality, and run time need to be dis-
A. Power Savings To calculate the power savings achieved by applying PIRT, the value of needs to be identified. In addition, a baseline that does not perform power optimization has to be established for comparison purposes. In the absence of clock frequency constraints in the IBM and ISPD benchmarks, the value has to be identified using the available is the longest net delay routing information. Since when the chip is properly routed, i.e., all nets are optimized for their minimum delay, a delay-minimization model that uses the data from PIRT’s phase I and solves for the minimum delay instead of minimum power will properly route the chip, and the longest net delay can be easily extracted. In addition, the power consumed by the final routing solution of the delay-minimization model is the perfect baseline to identify the power savings achieved by PIRT in comparison with this delay-minimization model. The details of the used delay-minimization model that is similar to the work in [41] is described in the Appendix. 1) Power Savings Comparison: Fig. 10 shows the power savings achieved when the PIRT is applied to the various benchmarks in Table I in comparison with the power consumption baseline established using a delay-minimization model, in addition to the average power savings for the IBM and ISPD benchmarks, respectively. This figure also shows the comparison of the power savings when considering only the two- and three-terminal nets optimized in phase II with the total power savings when all the nets are included.
232
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
Fig. 10. Power savings by PIRT and the average for IBM and ISPD benchmarks (strong buffer size).
Fig. 12. Second worst net slack change between a delay-optimized design (no PIRT) and power optimized design (with PIRT). TABLE III COMPARISON OF DELAY- AND POWER-MINIMIZATION MODELS
B. Delay Performance Fig. 11. Average power reduction over different buffer sizes (5, 10, and 15 times the minimum-sized buffer).
For the two- and three-terminal nets, the average savings for the IBM and ISPD2007 benchmarks are 16% and 32%, respectively. These values drop to 6.5% and 19%, respectively, when the remaining unoptimized nets are factored in. A comparison of the results for the IBM and ISPD benchmarks indicates a significant increase in savings for the ISPD benchmarks. This can be attributed to the fact that the larger ISPD benchmarks have a higher number of long buffered nets which are the primary target of the PIRT. Since the trend for future chips is an increase in size,8 this enhanced performance of the PIRT for the ISPD benchmarks emphasizes the importance of the PIRT as a power management technique. Finally, a comparison of the PIRT average power savings with the different buffer sizes in Fig. 11 shows that PIRT has more potential for larger buffer libraries. This is explained by noting that the removal of a large power-hungry buffer from the strong library results in higher savings, compared to the weak buffers, which are less power hungry, from the same library. Although it is tempting to try to route the nets by using the weakest buffer from the beginning, a significant hit to the signal integrity of the routed nets occurs accordingly, as the use of a mixed buffer library ensures that the nets are optimally performing over a wide range of chip sizes. 8Although the area might not change, the number of nets and buffers are exponentially growing.
Since global routing is an important step in the design of any chip, the impact of the global router on the delay of the chip is of major importance. Due to the constraint in (10), the PIRT never ; accordingly, the worst picks a net that violates the net delay always achieves the target frequency set forth by the . design and represented by the inverse of However, it is also interesting to see how the PIRT affects the delay of the second worst net on the chip. Fig. 12 shows the slack of the second worst net and compares between a delayoptimized design (no PIRT) and a power-optimized design (with PIRT). It is evident that the effect of the PIRT is to reduce the slack of the second worst net and translate it into power savings. Consequently, the average reduction of the slack from 29% to 7% explains the savings in Fig. 10. C. Routing Quality Table III provides the total wirelength, number of bends, and overflow of the delay- and power-minimization models for the buffer size 15 130 nm. As shown, the power-minimization model reduces the total overflow significantly without increasing the total wirelength and total number of bends for both the IBM and ISPD benchmarks. The overflow is reduced since PIRT allows for a more relaxed constraint on the noncritical nets, allowing for more detours. Although the wirelength of some nets might slightly increase, the freed up tracks usually allows for an overall reduction of power consumption by properly routing a heavily loaded net. It is interesting to note that the worst case overflow in ibm10 represents only 1.9% of the total nets.
YOUSSEF et al.: POWER-EFFICIENT MULTIPIN ILP-BASED ROUTING TECHNIQUE
TABLE IV COMPUTATION TIME COMPARISON OF PIRT FOR BUFFER SIZE 15
233
2 130 nm
TABLE V COMPUTATION TIME COMPARISON WITH POWER-DRIVEN ROUTERS
Fig. 13. Buffer-location generation.
APPENDIX Delay Minimization: In order to define the and the power consumption of a delay-optimal routed chip, the following delay-minimization model is solved:
D. Computation Time Table IV displays a comparison of the computation time of the different phases of the PIRT. Columns “T-Time,” “B-Time,” “S-Time,” and “Tot-Time” reveal the computation times of the tree-construction phase, buffer-insertion phase, power or delay model solving time, and the total time, respectively. It is clear that most of the computation time is consumed by the tree-construction phase which is common to most routing algorithms. Buffer-insertion and power-minimization phases consume only a very small part of the total computation time. Due to this small overhead, the PIRT manages to achieve its goal of power reduction without affecting the total run time. This enables many existing routing techniques to benefit from the inclusion of the PIRT for power minimization. Finally, Table V compares the total run time of PIRT versus state-of-the-art power-minimization techniques. Since all existing techniques target the power minimization of single nets and expect a sequential router to finalize the chip routing, they do not account for congestion. On the other hand, PIRT is capable of minimizing power, while managing congestion with considerably lower average run time.
VI. CONCLUSION AND FUTURE WORK Timing optimization and low power are important goals in global routing, particularly in deep-submicrometer designs. Previous efforts that focused on power optimization for global routing are hindered by excessively long run times or the routing of a subset of the nets. Accordingly, this paper has presented a power-efficient multipin global routing technique (PIRT). This ILP-based technique strives to find a power-efficient global routing solution. The results indicate that an average power savings as high as 32% for the 130-nm technology can be achieved with no impact on the maximum chip frequency.
(13) represents tree built for net , is the routing where is a variable associated with the supply of each edge, and represents the weight assorouting overflow of each edge. ciated with the delay of tree and is calculated by the following: (14) The delay of tree is modeled and calculated by (2). In order to solve the delay-minimization model, a number of allowable buffer locations (about 10% of the total vertices) are generated, as shown in Fig. 13, following the buffer-location generation in phase I. After the delay-minimization problem is solved, one tree is selected for each net such that the total delay of all the nets is minimized under the routing overflow constraint. The delay of net equals the delay of the selected tree. The maximum , delay of a circuit is defined as is the delay of net after the delay minimization. In where addition, the power consumed by the final solution represents the baseline power consumption for the calculation of the power savings. REFERENCES [1] J. Cong, L. He, C. Koh, and P. H. Madden, “Performance optimization of VLSI interconnect layout,” Integration, VLSI J., vol. 21, no. 1/2, pp. 1–94, Nov. 1996. [2] M. Cho and D. Z. Pan, “BoxRouter: A new global router based on box expansion and progressive ILP,” in Proc. ACM/IEEE Des. Autom. Conf., San Francisco, CA, 2006, pp. 373–378.
234
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
[3] M. Pan and C. Chu, “FastRoute 2.0: A high-quality and efficient global router,” in Proc. ASP Des. Autom. Conf., 2007, pp. 250–255. [4] J. Roy and I. Markov, “High-performance routing at the nanometer scale,” in Proc. ICCAD, San Jose, CA, 2007, pp. 496–502. [5] K. Banerjee and A. Mehrotra, “A power-optimal repeater insertion methodology for global interconnects in nanometer designs,” IEEE Trans. Electron Devices, vol. 49, no. 11, pp. 2001–2007, Nov. 2002. [6] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits. Englewood Cliffs, NJ: Prentice-Hall, 2003. [7] Y. Peng and X. Liu, “Low-power repeater insertion with both delay and slew rate constraints,” in Proc. ACM/IEEE Des. Autom. Conf., 2006, pp. 302–307. [8] K. H. Tam and L. He, “Power optimal dual-Vdd buffered tree considering buffer stations and blockages,” in Proc. ACM/IEEE Des. Autom. Conf., 2005, pp. 497–502. [9] Y. P. X. Liu and M. Papaefthymiou, “Practical repeater insertion for low power: What repeater library do we need?,” IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 25, no. 5, pp. 917–924, May 2006. [10] Y. Peng and X. Liu, “Freeze: Engineering a fast repeater insertion solver for power minimization using the ellipsoid method,” in Proc. ACM/IEEE Des. Autom. Conf., 2005, pp. 813–818. [11] V. Wason and K. Banerjee, “A probabilistic framework for power-optimal repeater insertion in global interconnects under parameter variations,” in Proc. Int. Symp. Low Power Electron. Des., 2005, pp. 131–136. [12] W. T. Cheung and N. Wong, “Power optimization in a repeater-inserted interconnect via geometric programming,” in Proc. Int. Symp. Low Power Electron. Des., 2006, pp. 226–231. [13] A. Youssef, M. Anis, and M. Elmasry, “POMR: A power-aware interconnect optimization methodology,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 3, pp. 297–307, Mar. 2005. [14] A. Youssef, T. Myklebust, M. Anis, and M. Elmasry, “A low-power multi-pin maze routing methodology,” in Proc. IEEE Int. Symp. Quality Electron. Des., Mar. 2007, pp. 153–158. [15] P. Kapur, G. Chandra, and K. C. Saraswat, “Power estimation in global interconnects and its reduction using a novel repeater optimization methodology,” in Proc. Des. Autom. Conf., 2002, pp. 461–466. [16] H. Fatemi, B. Amelifar, and M. Pedram, “Power optimal MTCMOS repeater insertion for global buses,” in Proc. Int. Symp. Low Power Electron. Des., 2007, pp. 98–103. [17] K. Nose and T. Sakurai, “Power-conscious interconnect buffer optimization with improved modeling of driver MOSFET and its implications to bulk and SOI CMOS technology,” in Proc. Int. Symp. Low Power Electron. Des., 2002, pp. 24–29. [18] C. J. Alpert, J. Hu, S. S. Sapatnekar, and C. C. N. Sze, “Accurate estimation of global buffer delay within a floorplan,” IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 25, no. 6, pp. 1140–1145, Jun. 2006. [19] N. Sherwani, Algorithms for VLSI Physical Design Automation. Boston, MA: Kluwer, 1999. [20] C. Albercht, “Global routing by new approximation algorithms for multicommodity flow,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 20, no. 5, pp. 622–632, May 2001. [21] J. Cong, J. Fang, and Y. Zhang, “MARS–A multilevel full-chip gridless routing system,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 3, pp. 382–394, Mar. 2005. [22] J. Hu and S. Sapatnekar, “A timing-constrained algorithm for simultaneous global routing of multiple nets,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2000, pp. 99–103. [23] T. Jing, X. L. Hong, J. Y. Xu, C. K. Cheng, and J. Gu, “UTACO: A unified timing and congestion optimization algorithm for standard cell global routing,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 23, no. 3, pp. 358–365, Mar. 2004. [24] I.-M. Liu, A. Aziz, D. F. Wong, and H. Zhou, “An efficient buffer insertion algorithm for large networks based on Lagrangian relaxation,” in Proc. Int. Conf. Comput. Des., 1999, pp. 210–215. [25] R. Chen and H. Zhou, “Efficient algorithms for buffer insertion in general circuits based on network flow,” in Proc. ICCAD, 2005, pp. 509–514. [26] C. Chu and D. F. Wong, “Closed form solution to simultaneous buffer insertion/sizing and wire sizing,” ACM Trans. Design Autom. Electron. Syst., vol. 6, no. 3, pp. 343–371, Jul. 2001. [27] R. Otten and G. S. Garcea, “Simultaneous analytic area and power optimization for repeater insertion,” in Proc. Int. Conf. Comput.-Aided Des., 2003, pp. 568–573.
[28] J. C. Eble, V. K. De, D. S. Wills, and J. D. Meindl, “Minimum repeater count, size, and energy dissipation for gigascale integration (GSI) interconnects,” in Proc. Int. Interconnect Technol. Conf., Jun. 1998, pp. 56–58. [29] L. P. P. P. van Ginneken, “Buffer placement in distributed RC-tree networks for minimal Elmore delay,” in Proc. Int. Symp. Circuits Syst., 1990, pp. 865–868. [30] J. Cong, C. Koh, and K. Leung, “Simultaneous buffer and wire sizing for performance and power optimization,” in Proc. Int. Symp. Low Power Electron. Des., 1996, pp. 271–276. [31] J. Lillis, C. Cheng, and T. Lin, “Simultaneous routing and buffer insertion for high performance interconnect,” in Proc. Great Lakes Symp. VLSI, 1996, pp. 148–153. [32] J. Shah and S. Sapatnekar, “Wiresizing with buffer placement and sizing for power-delay tradeoffs,” in Proc. VLSI Des., 1996, pp. 346–351. [33] M. Lai and D. Wong, “Maze routing with buffer insertion and wiresizing,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no. 10, pp. 1205–1209, Oct. 2002. [34] J. Cong and Z. Pan, “Interconnect performance estimation models for design planning,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 20, no. 6, pp. 739–752, Jun. 2001. [35] C. Chu, “FLUTE: Fast lookup table based wirelength estimation technique,” in Proc. IEEE Comput. Aided Des., 2004, pp. 696–701. [36] W. C. Elmore, “The transient response of damped linear networks,” J. Appl. Phys., vol. 19, no. 1, pp. 55–63, Jan. 1948. [37] S. C. Wong, G.-Y. Lee, and D.-J. Ma, “Modeling of interconnect capacitance, delay, and crosstalk in VLSI,” IEEE Trans. Semicond. Manuf., vol. 13, no. 1, pp. 108–111, Feb. 2000. [38] BPTM Provided by the Device Group at UC Berkeley [Online]. Available: http://www-device.eecs.berkeley.edu/ptm/introduction.html [39] International Technology Roadmap for Semiconductors 2002 [Online]. Available: http://public.itrs.net/Files/2002Update/2002Update.htm [40] X. Tang, R. Tian, H. Xiang, and D. F. Wong, “A new algorithm for routing tree construction with buffer insertion and wire sizing under obstacle constraints,” in Proc. Int. Conf. Comput.-Aided Des., 2001, pp. 49–56. [41] Z. Yang, S. Areibi, and A. Vannelli, “An ILP based hierarchical global routing approach for VLSI ASIC design,” Optim. Lett., vol. 1, no. 3, pp. 281–297, Jun. 2007. [42] ISPD98/IBM 1998 [Online]. Available: www.ece.ucsb.edu/kastner/ labyrinth/benchmarks/ [43] ISPD2007 2007 [Online]. Available: http://www.ispd.cc/ispd07_contest.html
Ahmed Youssef (S’99) was born in Giza, Egypt, on November 21, 1978. He received the B.Sc. degree (with honors) in electronics and communication engineering from Ain Shams University, Cairo, Egypt, in 2001 and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 2004 and 2008, respectively. He is currently with Intel, Santa Clara, CA. His research interests include low-power DSP integratedcircuit design, leakage control, microprocessor design, and design automation for VLSI systems.
Zhen Yang (S’09) received the B.Eng. degree in electrical engineering from Wuhan University of Science and Technology, Wuhan, China, in 1995, the M.A.Sc. degree from the School of Engineering, University of Guelph, Guelph, ON, Canada, in 2003, and the Ph.D. degree from the University of Waterloo, Waterloo, ON, in 2007. She is currently a Research and Design Engineer with Orora Design Technologies, Inc., Redmond, WA. Her research interests include physical design automation for both digital and analog circuit designs.
YOUSSEF et al.: POWER-EFFICIENT MULTIPIN ILP-BASED ROUTING TECHNIQUE
Mohab Anis (S’98–M’03) was born in Montreal, QC, Canada, on February 19, 1974. He received the B.Sc. degree (with honors) in electronics and communication engineering from Cairo University, Cairo, Egypt, in 1997 and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1999 and 2003, respectively. He is currently an Assistant Professor and the Codirector of the VLSI Research Group, University of Waterloo, where he is also with the Department of Electrical and Computer Engineering. He is a Cofounder of Spry Design Automation. He has authored/coauthored over 60 papers in international journals and conferences and is the author of the book Multi-Threshold CMOS Digital Circuits-Managing Leakage Power (Kluwer, 2003). His research interests include integrated-circuit design and design automation for very large scale integration systems in the deep-submicrometer regime. He is an Associate Editor of the Journal of Circuits, Systems, and Computers, ASP Journal of Low Power Electronics, and VLSI Design. Dr. Anis is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS. He is a member of the program committees for several IEEE conferences. He was the recipient of the 2004 Douglas R. Colton Medal for Research Excellence in recognition of his excellence in research leading to new understanding and novel developments in microsystems in Canada and was the recipient of the 2002 International Low-Power Design Contest Award.
Shawki Areibi (S’88–M’89) was born in Tripoli, Libya, on November 11, 1960. He received the B.Sc. degree in computer engineering from Elfateh University, Tripoli, in 1984 and the M.A.Sc. and Ph.D. degrees in electrical/computer engineering from the University of Waterloo, Waterloo, ON, Canada, in 1991 and 1995, respectively. From 1985 to 1987, he was with NCR Research Laboratories, Libya. From 1995 to 1997, he was a Research Mathematician with Shell International Oil Products, The Netherlands. From 1997 to 1999, he was a Faculty Member with the Department of Electrical and Computer Engineering, Ryerson Polytechnic University, Toronto, ON. He is currently an Associate Professor with the School of Engineering “Engineering Systems and Computing,” University of Guelph, Guelph, ON, Canada. He has authored/coauthored over 60 papers in international journals and conferences. His research interests include VLSI physical design automation, combinatorial optimization, reconfigurable computing systems, embedded systems, and parallel processing. He is the Associate Editor of the International Journal of Computers and Their Applications. Dr. Areibi served on the technical program committees for several international conferences on computer engineering and embedded systems. He was also a member of the program committees for the Genetic and Evolutionary Computation Conference, HPC, and several other IEEE conferences.
235
Anthony Vannelli (M’06) received the Ph.D. degree in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1983. From 1983 to 1984, he was a Research Fellow with the Mathematical Sciences Department, Thomas J. Watson Research Center, Yorktown Heights, NY, where he worked on layout problems with a VLSI CAD group led by Robert Brayton. He was the recipient of the Natural Sciences and Engineering Research Council of Canada University Research Fellowship which he held from 1984 to 1993. He joined the Department of Industrial Engineering, University of Toronto, Toronto, ON, in 1984 and returned to the Department of Electrical and Computer Engineering in 1987. In 1993–1994, he was a Visiting Research Scientist with Shell Research, Amsterdam. Since 1998, he has been the Chair of the Department of Electrical and Computer Engineering, University of Waterloo. His main research focuses on the development of efficient linear, nonlinear, and combinatorial optimization techniques to solve VLSI circuit layout and design problems.
Mohamed Elmasry (S’69–M’73–SM’79–F’88) was born in Cairo, Egypt, on December 24, 1943. He received the B.Sc. degree in electrical engineering from Cairo University, Cairo, in 1965 and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Ottawa, Ottawa, ON, Canada, in 1970 and 1974, respectively. He has worked in the area of digital integrated circuits and system design for the last 35 years. From 1965 to 1968, he was with Cairo University, and from 1972 to 1974, he was with Bell–Northern Research (BNR), Ottawa. Since 1974, he has been with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, where from 1986 to 1991, he held the Natural Sciences and Engineering Research Council/BNR Chair in VLSI design, and where he is currently a Professor and the founding Director of the VLSI Research Group. He was a Consultant to research laboratories in Canada, Japan, and the U.S. He has authored or coauthored over 400 papers and 14 books on integrated circuit design and design automation. He is the holder of several patents. He is the founding President of Pico Electronics, Inc.,Waterloo. Dr. Elmasry has served in many professional organizations in different positions and was the recipient of many Canadian and international awards. He is a founding member of the Canadian Conference on VLSI, the Canadian Microelectronics Corporation, the International Conference on Microelectronics, MICRONET, and Canadian Institute for Teaching Overseas. He is a fellow of the Royal Society of Canada and the Canadian Academy of Engineers.