Post-Placement Voltage Island Generation under Performance Requirement
Huaizhi Wu , I-Min Liu , Martin D.F. Wong and Yusu Wang Cadence Design Systems, Inc. Email:
[email protected] Atoptech, Inc. Email:
[email protected] University of Illinois at Urbana-Champaign, Email:
[email protected] Ohio State University, Email:
[email protected]
Abstract High power consumption not only leads to short battery life for handheld devices, but also causes on-chip thermal and reliability problems in general. As power consumption is proportional to the square of supply voltage, reducing supply voltage can significantly reduce power consumption. Multi-supply voltage (MSV) has previously been introduced to provide finer-grain power and performance trade-off. In this work we propose a methodology on top of a set of algorithms to exploit non-trivial voltage island boundaries for optimal power versus design cost trade-off under performance requirement. Our algorithms are efficient, robust and error-bounded, and can be flexibly tuned to optimize for various design objectives (e.g., minimal power within a given number of voltage islands, or minimal fragmentation in voltage islands within a given power bound) depending on the design requirement. Our experiment on real industry designs shows a ten-fold improvement of our method over current logical-boundary based industry approach.
1 Introduction With the broadening market interests in sophisticated mobile applications, meeting aggressive power target on top of performance requirement in high-speed portable design is becoming a challenging task. As the design parameters for optimal power and optimal performance often contradict each other (for example, a lower supply voltage reduces power consumption but slows device speed) designers are in a constant fight with balancing power and performance throughout the chip design cycle. High power consumption not only leads to short battery life for handheld devices, but also causes on-chip thermal and reliability problems in general. At 90nm process node, the vast amount of functionality integrated within SoC designs, compound with much larger leakage current, is already leading to designs with power dissipations in the hundreds of Watts. As process technology is trending against power dissipation, this problem is expected to only get worse at the future process nodes.
This work was partially supported by the National Science Foundation un-
der grant CCR-0306244.
Power consumption generally breaks down into two sources: dynamic and static powers [5]. While static power comes from leakage current, dynamic power is the result of device’s switching activities. It can be represented by
(1)
where is switching rate, is load capacitance, is supply voltage, and is clock frequency. Dynamic power dominates the total power consumption in today’s logic design. Techniques to lower switching power are combinations of reducing switching activity, load capacitance and supply voltage. For example, clock frequency can be set to zero by gating the clock to inactive logic block. Load capacitance can be reduced by minimizing total wire length and by downsizing the gates. As dynamic power is proportional to the square of supply voltage, reducing supply voltage can significantly reduce active power consumption. Multi-supply voltage (MSV) is introduced to provide finer-grain power and performance trade-off. There are two types of MSV. In “row-based” type, there are interleaving high and low supply voltage standard cell placement rows. In “region-based” type, circuits are partitioned into “voltage island” (or “power domain”) where each voltage island occupies a contiguous physical space and operates at a supply voltage that meets the performance requirement [9, 7, 2]. Region-based design in current state-of-art is largely done manually and is primarily based on design’s logic hierarchy. That is, designers partition circuits into a few groups based on their performance requirement and the connectivity between modules. Each group is then specified with a supply voltage. Logic boundaries are largely used in this grouping process mainly because they are the boundaries that designers are most familiar with. However, these “natural” boundaries in a design are almost always non-optimal boundaries for supply voltages. Although in theory only timing critical device needs high supply voltage, this naive thinking for maximum power reduction is not practical. When voltage islands become fragmented, there is an overhead in voltage shifting devices. Moreover, it is high cost to implement fragmented power networks as implementing such complex power network is not only tedious work but will also take a lot of precious routing resources from design, which
is not a good idea when per-metal-layer manufacturing cost is soaring as process technology migrates. In this work, we wish to reduce the cost of power network in a MSV design.
B A
B A
C
C
(a)
(b) B
A
B 8
A C
C
(c)
tion algorithms (i.e., algorithms with optimality guarantee) and present one that runs in polynomial time. This algorithm, unfortunately, is not efficient enough for practical designs. We therefore design an efficient two-step heuristic algorithm which combines dynamic programming with variable-sized !#"%$ gridding (Section 3). We show (Section 4) that our method is efficient and practical, as well as produces near-optimal voltage islands for a wide selection of industry data. Compared to the approach using logical boundaries within a design, which is most commonly used in industry, our method generates about one tenth of voltage islands for the same amount of power reduction. The running time is small even for very large industry designs.
(d)
8
7 6
10
5
7 6
5
11
8
5
4
10
8
4
3
6
11
3
2
2
1
Figure 1: (a) Design with timing critical cells (small darker cells). (b) Power consumption too high. (c) Timing requirement not met for small cells in module . (d) A solution with non-logical boundary.
Figure 1 illustrates why sticking to logical boundaries is limiting the solution space in producing optimal MSV. In the example, there are three modules and each of them only contains leaf cells and both module A and B contain some timing critical cells that require high voltages (Figure 1(a)). Figure 1(b) and (c) are designs based on logical boundary. While Figure 1(b) guarantees the performance using high power, Figure 1(b) reduces the power consumption without meeting the timing requirement. None of them are optimal MSV. By using placement (instead of logical) information, the optimal MSV meets power and timing requirements at the same time while keeping the number of power domains small (Figure 1(d)). In this work we propose a methodology on top of a set of algorithms to exploit the “non-natural” (non-logical) boundary in a design for optimal supply voltage partitioning capturing power versus design cost trade-off under performance (timing) requirement. Depending on each designer’s specific needs, the optimal trade-off can be explored by either one of the two dual optimization problems: maximally reduce power consumption within given bound on number of voltage islands; or create minimally fragmented voltage islands within given bound on power consumption. Our approach can handle both problems, with the latter having an extra factor in running time ( is the number of voltage islands), coming from a binary search for . We will focus on the latter problem in the following discussion. However all our results can be easily adapted to the former one. Our contributions can be summarized as follows: To our best knowledge, we are the first to consider power versus design cost trade-off under timing requirement for the voltage island generation problem. In particular, we exploit non-trivial voltage island boundaries to balance power consumption and power network fragmentation. We formulate this problem as a voltage-partitioning problem (Section 2). This voltagepartitioning problem is NP-hard and we thus study approxima-
1
1
2
3
4
5
6
7 8
1
2
3
(a)
4
5
6
7 8
(b)
Figure 2: (a) A &(')& array * , with a subarray (shadowed region) +-, * . /102030 4657/10200 8:9 ; ;=+@?A,CBB and DE+@?F,GB& . (b) A partitioning of * with seven rectangles.
2 Voltage-partitioning Problem 2.1 Problem definition Let H be an IJ"LK array, and HNM OQPRM S1P the value of the element at position >O S TVU O U I WTVU S U K . A subarray HNM 7XWXWX Y 2Z XXWX [\P is the rectangular region in H with ] 2Z: (resp. ]Y [ ) as the bottom-left (resp. upper-right) corner. We may also refer to an array (or a subarray) as a rectangle or occasionally a region. Let ^A_ a`cbedfhg]i jlknm1o HpM OPRM SqP be the maximum value of all elements in a rectangle _ . The weight of a rectangle _ is defined as
r _ A
t^A]_ u HpM OPRM SqP s fhg]i jlknm1o
See Figure 2(a) for an illustration. v A partitioning of H is a set of disjointv rectangles (subarrays) xw _zy W{W{{W _(|~} that cover H ; is called the size of this partitioning. See Figure 2(b) for an illustration. The weight v of a partitioning , is defined as
r v
r _
s y3n|
In the voltage island generation problem, we wish to subdivide the placement region into a small number of voltage islands (where every cell in the same voltage island will eventually receive the same voltage), while keeping the total power consumption low and timing requirement met. The latter means
that the voltage assigned to each cell should be no lower than its required value. So the voltage value of each voltage island should be the maximum required voltage of all cells in this voltage island. Raising voltage of cells with lower requirement to this maximum value will result in an increase in the power consumption of this voltage island. To keep the overall power consumption low, it is desirable to have the total power increase on all voltage islands below some threshold. We can consider each standard cell placement region as a twodimensional array H , induced by the underlying placement grid. As dynamic power is proportional to the square of supply voltage (Eqn 1), we let HpM OPRM SqP be the square of the required voltage of the corresponding standard cell covering this grid. Then it is easy to see that a partitioning of H corresponds to a voltage island partitioning of the placement region; ^A]_ is the voltage value of each voltage island _ ; and the weight of the partitioning represents the total power increase. We can thus formally define the voltage-partitioning problem as follows. Definition 1 (Voltage-partitioning Problem, or VPP) Given an I"K array H and an error threshold , among all partitionings whose weight is at most , find one with the smallest size. Let ]H be the size of this optimal partitioning. The dual version of this problem (DVPP) is defined as the problem of minimizing the weight of the partitioning with a bound on the size. We will focus on VPP in this paper. However our algorithm can easily handle DVPP as well.
2.2 Algorithm with guarantees While no previous work has been done for the VPP problem, some variants of it are well studied. In particular, if we define the weight of a rectangle _ as the sum of all HpM OQPnM S1P ’s, with v >O S _ , and the weight of v a partitioning as the maximum weight of rectangles in , we have a variant of the VPP problem, previously referred to as the RTILE problem [8]. Intuitively, this max-sum metric is much simpler than the measure we consider in the VPP problem, and it has been shown that the RTILE problem is NP-hard [8]. Therefore, given the indication
(a)
(b)
Figure 3: (a) A slicing partitioning of size 8. (b) A grid-partitioning of size 'N4 .
of the hardness of our problem, we shift our focus to approximation algorithms which provide a guarantee on the output size. Below we first describe our -approximation algorithm for the v VPP problem, which finds a partitioning of a given array H so
v
v
that r )U and U 1]H . This approximation algorithm is based on a nicely structured class of partitionings which we introduce next. Slicing model. Given an input region (array) H , we can slice it either with a vertical or a horizontal cut. Each cut will divide its parent region into two, and we then slice the resulting two children regions recursively. An example is shown in Figure 3 (a). A partitioning obtained this way is called a slicing partitioning of region H [12, 14]. In fact, it is the same as a special type of binary space partitioning (BSP) induced by orthogonal cuts, called orthogonal BSP. It has been shown in [3] that, given a set of I disjoint axis-aligned rectangles that cover a rectangular region H , one can construct a orthogonal BSP so that in the induced slicing partitioning of H , each rectangle lies completely inside one of the I original rectangles, and the size of this slicing partitioning is at most eI . Let ]H denote the size of the optimal slicing partitioning of H with weight at most , and ]H denote that of the optimal arbitrary partitioning. We can infer the following lemma. Lemma 1
]H U H U 6=H :{
P ROOF. The left inequality is obvious. For the right one, the vz optimal solution of the VPP problem gives =H disjoint rectanglesv that cover region H . By [3], there exists a slicing partitioning of H with at most 6=H rectangles. Since every v rectangle inside v( in is completely v v( a rectangle in the optimal solution , we have r U r zU . Thus proves the right inequality. The above lemma states that an optimal slicing partitioning for H is a -approximation for the VPP problem. Therefore we now only need to focus on solving the VPP problem under this slicing model. One standard method to handle the slicing partitioning is by using dynamic programming [13, 3]. Below, we first describe DP-Alg, a dynamic programming based -approximation algorithm for VPP. Dynamic programming approach. First, we show that the dual DVPP problem under the slicing model can be solved optimally by dynamic programming. It follows the same framework as the one in [10], which was developed for the RTILE problem and several variants of it 1 . Let r ]7XWXX Y 2Z XWXWX [ denote the minimum weight of any rectangle (subarray) _ in H with at most sub-rectangles, where _ HpM XXWX Y 2Z XXWX [tP . The ultimate goal is to compute r | T XXWX I T XXWX K . At the base level, if T , or if r _ a , we set r XXWX Y 2Z XXWX [ r ]_ . Otherwise, we have the recursion:
r XXWX Y 3Z XWXX [ A `h w y3n 1 We remark that this dynamic programming paradigm is general and applies to different weight functions that are superadditive [10], which informally means that splitting a rectangle in any partitioning does not increase the weight. We can non-trivially prove that our in the VPP problem is superadditive.
¡ £¢¥¤
` § i ¨h j © ¦ g
ª r r ] 7XWXX O 2 Z XXWX [ = « r r 3 ¬ >O «LT WX XWX Y 2Z XWXWX [ : } ] 7XWXX Y StXXWX [ = « 3¬ ]7XWXWX Y S «LT XWXX [ ®
The second minimization term enumerates all vertical and horizontal cuts, and the first term ( TNUL¯±° ) enumerates all possible ways to assign the number of rectangles (¯ and u²¯ ) for the two separated parts obtained by some horizontal or vertical cut. For the base cases, the weight r _ of any rectangle _ can be computed in ³ T time after ³>K´I time/space preprocessing by extending the prefix sum algorithm from [8] to our weight function. It then follows from the above recursion that we can build the remaining dynamic programming table in ³7]K « I K I time, where we spend ³7]K « I l time to compute each of the ³>K I entries of the table. We thus have the following result. Lemma 2 Given an Iµ"²K array H and an integer V¶C , we can compute in ³7>K « I K I time the minimum-weight slicing partitioning for H of size . The space complexity is ³>K I . This provides a solution for the dual DVPP problem. For the primal VPP problem, we can now guess the optimal number of rectangles ]H in ³]·h¸6¹ time by binary search, starting with ºT . Together with Lemma 1 and 2, we conclude with the following result. Theorem 1 Given an I®"(K array H and an error bound ¶» , a -approximation of ]H can be computed in ³7]K « I K I ·¼¸1¹ time and ³>K I space.
Note that ¾ actually gives rise to a special type of partitioning of H : Suppose v
Æ ¾ is of size !L"Å$ , we call the underlying partitioning a grid-partitioning of size !G"Ç$ for H v
Æ (see Figure 3 (b)), and each rectangle _ ©À¿ in is referred to as a grid-rectangle. To avoid excessive power increase v
Æ in such partitioning, we would like to have a bound on r for r v
Æ È y©6 É i y¿´Ê r ]_ ©À¿q . This motivates us to design the following two step approach for the VPP problem, referred to as TS-Alg. Step 1. Size reduction: produce a !#"%$ array ¾ with r
v Æ
UÅË , for some ˱U
Step 2. Approximate voltage-partitioning: apply DP-Alg on ¾ to compute a partitioning
r v U
v
with
Note that both the quality and quantity of the first step directly affect the performance of the second step: On the one hand, we hope that the quantity (i.e., the value of ! and $ ) is small, so that DP-Alg is fast and practical. On the other hand, as DP-Alg will v Æ not further subdivide any grid-rectangle in (i.e., a rectanv gle in the final output will be a combination of some gridv Æ v Æ rectangles from ), grid-rectangles from should be ‘good’. We will see in what follows that although TS-Alg does not guarantee that the output size will approximate ]H within some constant factor, there is a control on the quality in each of the two steps. The experimental results from next section further demonstrate the performance of TS-Alg both in efficiency and in its output quality.
3.2 Size reduction
3 Fast Heuristic Algorithm In this section, we describe TS-Alg, an efficient two-step heuristic algorithm for the VPP problem.
3.1 Algorithm overview Ideally, we would like to have some guarantee on the size of the rectangles output by TS-Alg, while keeping the complexity of the algorithm low. DP-Alg as introduced in last section produces near-optimal solutions. Unfortunately, it is too slow and memory-inefficient to be practical 2 . In fact, the large space requirement limits the size of the input rectangle to merely around T6 " T1 , while the size encountered in practice can easily go up to ½ \l61 "±½ \l61 . We therefore want to first reduce the size of the input, before we feed it to DP-Alg. In particular, we can impose a second grid of lower resolution on the original array H , and obtain a ‘compressed’ array ¾ , where each element ¾M [tPRM P corresponds to a rectangle (subarray) _(©¿ of H in the original array. To guarantee meeting the timing requirement for the power islands, we let ¾M [\PRM P ^A_ ©À¿ Á Â`cbedfhg]i jlknm1o´ÃÄ HpM OQPnM S1P . 2 We remark that there are some improvement over the above DP algorithm that can be extended in our case as well. However, it increases the approximation factor greatly, as well as making the algorithm quite complex [10]. Our goal is to have an efficient algorithm that performs well in practice.
One straightforward approach for Step 1 of TS-Alg is to simply subdivide H evenly into !"#$ rectangles (so all !t$ number v Æ of grid-rectangles in are congruent). This method is completely oblivious to the data distribution in H , thus leading to suboptimal result. It is desirable to subdivide H more intelligently so as to preserve the data distribution in H , by using more flexible, variable-sized grids. In particular, we ask the following question for Step 1. Definition 2 (Grid-partitioning Problem, or GPP) Given an IÌ"ÍK array H and an error threshold Ë , among all gridpartitionings whose weight is at most Ë , find one with the smallest size (i.e., !Î"%$ ). Variants of this problem have already been studied in the field of computational geometry [4, 11], and we modify the algorithm from [11] to obtain the following result 3 .
"Ï$ be the size of the optimal gridLemma 3 Let ! partitioning with weight at most Ë:Ð . One can compute in ³]K´I « ]K « I «ÒÑ! Ñ$ X ±Ñ! ·h¸6¹t]K´I l time and ³]K´I space a grid-partitioning of weight at most Ë and of size ² Ñ! " $ Ñ , where Î Ñ! " $ Ñ U TÀÓ ! "Ô$ . Õ ¼Ö6×z¤
3 time/space is spent on preprocessing, so that the weight of any grid-rectangle can be computed in time.
Ø=ÚÛ
Õ ]Ül¤
¡ ÙØ=ÚÛ¤
On the high level, the grid-partitioning problem can be reduced to a special type of Set Cover problem, which can be efficiently approximated using techniques from randomized algorithms (in particular, Ë -nets [4], and an elegant analysis given by Clarkson in [6]): Roughly speaking, it uses the iterative doubling technique originally used for linear programming problem [6]. It assigns a load to every possible separator for gridpartitioning (i.e., all vertical and horizontal grid lines). (The load was referred to as weight in previous studies. We change it here to avoid confusion with our weight function r introduced earlier.) Starting with unit weights, at each iteration, it chooses a subset of these separators according to their current loads, such v Æ v Æ that the resulting grid-partitioning serves as an Ë -net. If already satisfies the error requirement (i.e., Ë ), the algorithm rev Æ turns . Otherwise, it chooses a grid-rectangle at random with probability proportional to its weight, and doubles the loads of all separators that intersect this grid-rectangle. The intuition is that if a grid-rectangle has high weight, then those grid lines cutting it are more likely to be chosen for the grid-partitioning. Careful analysis can show that the expected number of iterations can be bounded. The algorithm is easy to implement. However, a detailed description is somewhat involved. We therefore omit it from current version of the paper, and interested readers can refer to [11] for more details.
3.3 Putting everything together The size reduction step gives a ‘compressed’ !"Å$ array ¾ , with each element ¾M [tPRM P representing a subarray _(©À¿ of the original array H , for any T)U [ U ! WTNUÍ%U $ , and ¾M [tPRM P ^A]_(©À¿ A L`cbedfhg]i jlknm1o´ÃÄ HpM OQPnM S1P . Let Ý be a subarray of ¾ , and ^A>Ý Þ ß`cbqdf © i ¿ kRm1à ¾M [tPRM P . We define the weight of Ý as:
rAá >Ý F
t^A>Ý u ¾M [tPnM P X _ ©À¿ s f © i ¿ knm1à
where _ ©À¿ is the number of elements contained in rectangle of the original array H . Note that this weight is in similar form as the weight r ]_ defined in Section 2.1, except that there is an additional multiplier _(©¿ here, because unlike HpM OQPnM S1P , which corresponds to single cell, ¾M [tPRM P corresponds to group of cells now. v
â ¦ of ¾ . Then the Let v
ã â rAw Ý=á y v {Wâ {W{W Ý } be a ¦ partitioning weight of , p È yä1 rAá ]Ý´ä , represents v Æ the additional power increase from the grid-partitioning of the origv
â inal array H to the partitioning of the ’compressed’ array ¾ . v â v (Note that of ¾ canv directlyv
imply a partitioning of H with Æ v â r r A r á « v
.)Æ To satisfyv
â overall the same cuts, and power increase bound , we require that r ¥« rAá (U , v â v Æ v
Æ A r á r r , where is computed in Step i.e., U u 1. We can now carry over the algorithm from Section 2.2, running DP-Alg on the ‘compressed’ ! " $ array ¾ . The only difference is how the base cases are computed, as the weight function is slightly different 4 . However, as the weight rAá ]Ý of any sub-
_ ©À¿
4 We
remark that the new weight function
tåh £¢=æÙ¤
is also superadditive.
array Ý of ¾ can also be computed in ³ TÀ time, after a similar ³ç!t$ time/space preprocessing on array ¾ , we can carry over the complexity result from Theorem 1, and implement the second step in ³ç!t$ « ç! « $ ! $ ·h¸6¹ time and ³è!t$ « ! $ space. Together with Lemma 3, we have:
é ]K´I « è! « $ ! $ time Theorem 2 The TS-Alg runs in ³c and ³]K´I « ! $ space, where ³ é hides some logarithmic terms. The value of ! and $ depends on the parameter Ë
U in the first step. In practice, is small, and !²êI , $)êK , in which case our TS-Alg runs in roughly ³>K´I time.
4 Experimental Results 4.1 Experiment setup and snapshots of our results We perform our experiments with a set of industry designs on 64-bit Linux machines (CPU: 1.95 GHz, Memory: 11.7 GB). For each design, the experiment is carried out in the following steps: We use the Cadence’s commercial tool SoC Encounter [1] to do timing-driven placement, timing optimization and timing analysis. Then we assign voltage to each cell according to its worst slack. (We give four different levels of voltages, each corresponding to a different range of worst slacks. The smaller the slack, the higher the voltage.) We transform the standard cell placement and associated voltage requirement into the input array, as described earlier in Section 2.1. We calculate the maximum power increase, which is the total power increase when all cells are raised to the highest required voltage on the entire chip. We give some reasonable bounds on the total amount of power increase, each corresponding to a certain percentage of the maximum power increase, and apply our TS-Alg to generate minimum number of voltage islands within each power increase bound.
Snapshots First, we give some visual results of our TS-Alg to demonstrate its effectiveness. We will present quantitative results in the subsequent section. Figure 7 shows the voltage islands generated from two industry designs. For each design (one column), the top picture shows the placement with timing critical cells in dark colors (the darker a cell, the higher voltage is needed). The middle picture shows the !-"-$ gridpartitioning generated by the size reduction step. Note that the voltage distribution information is well preserved by the variable-sized grids. The last picture shows the generated voltage islands.
4.2 Comparison with other approaches To demonstrate the efficiency of our TS-Alg, we compared it with two alternative approaches. (One of them is commonly used in industry; the other is developed by us as an straightforward improvement over the industry approach.) Outline of alternative approaches. The first one is rather straightforward: It is the logical-boundary based approach, where each grouped module or cell in the logical hierarchical tree forms an individual voltage island (Figure 4(a)). Currently this is the approach commonly used in practice (as mentioned in the example in section 1). This approach is often very inefficient due to the high fanout of modules in the logical hierarchy tree. For the example in Figure 4(b), module ëìy has been marked high voltage due to its internal timing critical cells. The rest of cells in the sub-tree rooted at ë are non-critical. If we don’t group module ë , each child of ë will become an individual voltage island, causing too many fragments. If we group module ë , the entire module will be raised to the high voltage that is required on ë y , causing too much power consumption. Such case is very common in real designs.
is ³>K´IÇ·¼¸1¹>K´I 7 and space complexity is illustrates QT-Alg.
(a)
(a)
öWöWöö õWõWõõ
(b)
Comparison on the same design with different power increase bound. Figure 6 compares the output size of our TS-Alg with QT-Alg and the logical tree based algorithm on one industry design with different power increase bounds. Clearly our TS-Alg outperforms the other two significantly and constantly (by an order of 10 and 2, respectively). Between the two alternatives, QT-Alg is significantly better. Therefore we will only compare with QT-Alg in the rest of our experiment. 180
160
M1
TS-Alg QT-Alg Logical-Alg
140
120
(b)
Figure 4: Logical hierarchical tree. (a) shaded modules correspond to one possible partitioning. (b) A node ÷ with high fanout.
# of voltage islands
îWíí îíí Wî î
Figure 5
Figure 5: Quadratic tree. (a) The input array (darker color indicates higher voltage value). (b) After some combinations
M
ïWïWïï ðWð ðð W W ó ó WòñWñ òñòñ ôWôWôóô
³>K´I .
ü
100
80
60
40
20
A natural way to improve the logical-boundary based approach is to substitute the high-fanout hierarchical tree with nonlogical boundary based, standard quad-tree. The leaves of the quad-tree are the grid cells in the input array, and each node in this tree corresponds to a possible region (a subarray). Given an upper bound for power increase, the goal is to find a set of appropriate nodes from the tree that form a partitioning of the input array. The way to obtain such a partitioning is by a greedy bottom-up merging approach. In particular, we mark a node while if it has not been merged, black otherwise. A node is called a candidate for merging if it is while while all its four children are black. Furthermore, given a node _ in quad-tree with its four children _ , ¯ aT1W{{W{:lø , define the cost of merging _ to be ù]_ A r _ ~u r ]_Þy q« r ]_ q« r ]_(ú q« r _
û 7:{ This corresponds to the power increase resulted from combining the four subregions into _ . Now in order to compute a partitioning, we start with a tree where all non-leaf nodes are while. At any time, we choose the candidate with smallest cost (by using a priority queue). The process will terminate when the total weight of the resulting partitioning exceeds the given upper bound . We refer to this algorithm as QT-Alg. The overall time complexity
0 0.65
0.6
0.55
0.5 0.45 percentage of power increase
0.4
0.35
0.3
Figure 6: comparing different algorithms on the same design
Comparison on different designs. We extend the experiment to a wide selection of industry designs of different sizes. For each design, we compare the two algorithms against different power increase bounds, and obtain similar results in output size, as with the design in Figure 6. Due to the limited space, we omit the complete results here. Instead, for each design, we only pick a particular power increase bound such that the number of voltage islands being generated is within a desired range 5 . We let the range be around 20, which is roughly an upper bound in current practical designs. Table 1 shows the comparison between 5 As stated earlier, our algorithm works for both of the dual optimization problems. Our discussion and implementation is focused on minimizing number of voltage islands bounded by power increase, however it can be adapted to directly solve the other problem with faster runtime.
our TS-Alg and QT-Alg. The designs are listed with increasing [3] P. Berman, B. Dasgupta, and S. Muthukrishnan. Exact size of binary space partitionings and improved rectangle tiling algorithms. array size (the input to both algorithms), which is roughly proSIAM J. Discrete Math, 15(2):252–267, 2002. portional to the number of standard cells in the design. Clearly, our TS-Alg outperforms QT-Alg significantly on all designs in [4] H. Br¨onnimann and M. T. Goodrich. Almost optimal set covers in finite vc-dimension. Discrete Comput. Geom., 14:463–479, 1995. terms of the number of voltage islands obtained. The time complexity of our algorithm has two terms: ³>K´I [5] J. Buurma and L. Cooke. Low-power design using é 7ç! « $ ! $ . The first term comes from the preproand ³ multiple ASIC libraries. http://www.sinavigator.com/ cessing of the input array into a table for later looking up the Low Power Design.pdf. weight for any subarray of H . Since this preprocessing step is [6] K. L. Clarkson. A Las Vegas algorithm for linear programming needed by both QT-Alg and TS-Alg, we omit it from the runwhen the dimension is small. J. ACM, 42(2):488–499, 1995. ning time presented 6 . Other than this preprocessing time, the [7] J. Hu, Y. Shin, N. Dhanwada, and R. Marculescu. Architecting running time of TS-Alg is only output-sensitive, depending on voltage islands in core-based system-on-a-chip designs. In Pro! $ and , while QT-Alg still has a ³>K´I ·h¸6¹]K´I l running ceedings of the 2004 international symposium on Low power electime. This explains why TS-Alg has larger running time on the tronics and design, pages 180–185, 2004. first three small designs in Table 1: as !E"E$ after Step 1 in their [8] S. Khanna, S. Muthukrishnan, and M. Paterson. On approximatcases is larger than that of later cases. On the other hand, for all ing rectangle tiling and packing. In SODA ’98: Proceedings of practical data we test and for all practical , ! and $ are small the ninth annual ACM-SIAM symposium on Discrete algorithms, regardless of the size of the input design (see, for example, !)"p$ pages 384–393, 1998. does not increase in Table 1 with the input size). This means [9] D. E. Lackey, P. S. Zuchowski, T. R. Bednar, D. W. Stout, S. W. that our algorithm scales well with increasing size of the input Gould, and J. M. Cohn. Managing power and performance for design! Furthermore, as decreases, the running time of our alsystem-on-chip designs using voltage islands. In Proceedings of gorithm decreases as well, while that of QT-Alg remains roughly the 2002 IEEE/ACM international conference on Computer-aided the same (as it is not output-sensitive). This is demonstrated in design table of contents, pages 195–202, 2002. Table 2, where we choose around T , which is probably a more [10] S. Muthukrishnan, V. Poosala, and T. Suel. On rectangular partipractical number in current designs. Note that TS-Alg also beats tionings in two dimensions: Algorithms, complexity, and applicaQT-Alg significantly and consistently in terms of the quality of tions. In ICDT ’99: Proceeding of the 7th International Conferresult. ence on Database Theory, pages 236–256, 1999. Design Name industryB industryC industryG industryH industryA industryD industryI industryJ industryE industryF
Power Bound 70% 55% 65% 50% 60% 50% 45% 45% 40% 50%
Grid Size 13 x 17 21 x 21 21 x 21 9 x 13 13 x 17 9 x 13 9 x 13 9 x 13 17 x 21 9 x 13
# of VI TS QT 11 26 12 24 11 22 11 29 12 37 10 32 9 26 9 18 9 22 11 25
Runtime(s) TS QT 2 1 7 2 6 3 1 20 3 25 2 33 2 35 2 63 4 131 2 211
Table 2: Comparison of TS-Alg and QT-Alg with output # of voltage islands around 10
References [1] Cadence software manual: Soc encounter gps. http:// www.cadence.com/products/digital ic/soc encounter/index.aspx. [2] Power islands: The evolving topology of soc power management. http://www.us.design-reuse.com/articles/article9150.html.
Õ hÖ~×(¤
6 Also since the preprocessing time is , it will be much smaller than the runtime for QT-Alg, especially for large design, this logfactor can be quite large! So omitting it will not change our comparison, while helping to clarify things.
Õ hÖ~×Ôýÿþ e hÖ~×(¤¤
[11] S. Muthukrishnan and T. Suel. Approximation algorithms for array partitioning problems. Journal of Algorithms, 54:85–104, 2005. [12] R. H. Otten. Automatic floorplan design. In Proceedings of the 19th ACM/IEEE conference on Design automation, pages 261– 267, 1982. [13] L. P. P. P. van Ginneken and R. H. J. M. Otten. Optimal slicing of plane point placements. In Proceedings of the conference on European design automation, pages 322–326, 1990. [14] D. F. Wong and C. L. Liu. A new algorithm for floorplan design. In Proceedings of the 23rd ACM/IEEE conference on Design automation, pages 101–107, 1986.
industryA: placement with timing critical cells
industryE: placement with timing critical cells
industryA: !Î"Ô$ grid-partitioning
industryE: !Î"Ô$ grid-partitioning
industryA: voltage islands (a)
industryE: voltage islands (b)
Figure 7: (a) design industryA (b) design industryE
Name industryB industryC industryG industryH industryA industryD industryI industryJ industryE industryF
Design # of Cells Array Size w I»"K} 5926 79 x 790 43677 161 x 2860 76406 230 x 3504 243188 694 x 7852 317752 732 x 8793 397940 1300 x 8270 342113 1199 x 8931 737555 1372 x 12350 352060 2144 x 17159 306326 2652 x 20978
Power Bound 60% 50% 60% 40% 55% 35% 35% 35% 35% 40%
Grid Size
w !"F$~}
26 x 26 21 x 34 26 x 26 13 x 17 17 x 21 21 x 26 17 x 21 21 x 21 26 x 34 13 x 17
# of VI we } TS QT 18 48 19 61 19 64 19 48 17 68 17 74 15 39 17 46 13 38 15 45
Runtime(s) TS QT 58 1 125 2 105 3 9 20 16 25 49 33 11 36 32 64 47 134 5 222
Table 1: Comparison of TS-Alg and QT-Alg with output # of voltage islands around 20