Post-Placement Power Optimization with Multi-Bit Flip ... - IEEE Xplore

1 downloads 0 Views 1MB Size Report
Nov 18, 2011 - with multi-bit flip-flops; 3) flip-flop clustering and placement al- gorithms to simultaneously minimize flip-flop power consumption.
1870

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 12, DECEMBER 2011

Post-Placement Power Optimization with Multi-Bit Flip-Flops Mark Po-Hung Lin, Member, IEEE, Chih-Cheng Hsu, Student Member, IEEE, and Yao-Tsung Chang

Abstract—Optimization for power is always one of the most important design objectives in modern nanometer integrated circuit design. Recent studies have shown the effectiveness of applying multi-bit flip-flops to save the power consumption of the clock network. This paper presents: 1) a novel design methodology of applying multi-bit flip-flops at the post-placement stage, which can be seamlessly integrated in modern design flow; 2) a new problem formulation for post-placement optimization with multi-bit flip-flops; 3) flip-flop clustering and placement algorithms to simultaneously minimize flip-flop power consumption and interconnecting wirelength; and 4) a progressive windowbased optimization technique to reduce placement deviation and improve runtime efficiency of our algorithms. Experimental results show that our algorithms are very effective in reducing not only flip-flop power consumption but also clock tree and signal net wirelength. Consequently, the power consumption of the clock network is minimized. Index Terms—Multi-bit flip-flop, physical design, post-layout resynthesis, power optimization, synthesis for low power.

I. Introduction

W

ITH LIMITED power/thermal budgets for modern system on chips (SoCs) which integrate an increasing number of transistors, power minimization has become one of the most important objectives in designing SoCs for various applications. High power dissipation of an SoC will not only increase its system costs but also affect the product lifetime and reliability. To optimize the power consumption, many low-power design techniques have been introduced [2], such as clock gating [3], [4], replacing non-timing-critical cells with their high-Vt counter parts [5], [6], power gating [7], [8], creating multi-supply-voltage designs [5], dynamic voltage/frequency scaling [9], [10], and minimizing clock network. Among these techniques, minimizing clock network is very important in reducing power consumption of an SoC because it accounts for up to 50% of dynamic power of the chip [11], and the dynamic power is the dominant power source, which Manuscript received February 15, 2011; revised May 15, 2011 and July 20, 2011; accepted July 29, 2011. Date of current version November 18, 2011. This work was supported in part by Faraday Technology Corporation, Himax Technologies, Inc., and the NSC of Taiwan, under Grant NSC 098-2218-E194-008-MY3. A preliminary version of this paper was presented at the 2010 IEEE/ACM International Conference on Computer-Aided Design [1]. This paper was recommended by Associate Editor I. L. Markov. The authors are with the Department of Electrical Engineering, National Chung Cheng University, Chiayi 621, Taiwan (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2011.2165716

accounts for 75% of total power consumption of an SoC [12]. Resent studies have proposed various approaches to minimize clock network, including buffer sizing [13], placement optimization of registers [14]–[16], and applying multi-bit flipflops (MBFFs) [17]–[21], or multi-bit registers [22], or register banks [23]. According to [17] and [18], applying MBFFs may have the following advantages: 1) smaller design area due to shared clock drivers and clock gating cells; 2) less delay and power of the clock network due to fewer clock sinks and smaller capacitive load on the clock net; 3) controllable clock skew because of common clock and enable signals for a group of flip-flops and reduced depth of a clock tree; 4) improved routing resource utilization especially when considering design for testability. The required routing resource for a scan chain is greatly reduced because of fewer cells in a scan chain. Fig. 1 shows an example of merging two 1-bit flip-flops into one 2-bit flip-flop. Each flip-flop contains two inverters which generate opposite-phase clock signals. As the process technology advances to 65 nm and beyond, even a minimum-sized inverter/buffer can still drive multiple flip-flops [18]. Replacing several 1-bit flip-flops with one MBFF will significantly reduce the number of inverters. Consequently, the total power and area of the whole flip-flops in a design are reduced. Table I further compares the normalized power consumption and area of different flip-flop cells in a cell library which was designed by [24] based on the UMC 55 nm technology [25]. The idea of applying MBFFs was first proposed in [17], as seen in Fig. 2(a), which groups single-bit flip-flops into MBFFs before floorplanning and placement based on a breadfirst-search algorithm. Kretchmer [22] and Chen et al. [18] introduced a design methodology for logic optimization with MBFFs. Such methodology creates the models of multi-bit registers in a cell library which can be inferred by existing logic synthesis tools. Based on the multi-bit register inference, it is possible to map a register-transfer level design directly to a gate-level design with multi-bit register cells during logic synthesis. More recently, Hou et al. [23] proposed a power-aware placement flow, as shown in Fig. 2(b), which generates register banks based on predetermined heuristic criteria after initial placement and removes the generated register banks to satisfy timing and routability constraints during incremental placement optimization. Most recently, it

c 2011 IEEE 0278-0070/$26.00 

LIN et al.: POST-PLACEMENT POWER OPTIMIZATION WITH MULTI-BIT FLIP-FLOPS

Fig. 1.

1871

Example of merging two 1-bit flip-flops into one 2-bit flip-flop. TABLE I

Comparisons of the Normalized Power Consumption and Area of Different Flip-Flop Cells in a Cell Library [24] Based on the UMC 55 nm Technology [25] Bit Number 1 2 4

Normalized Power Consumption per Bit 1.00 0.86 0.78

Normalized Area per Bit 1.00 0.96 0.71

TABLE II Design Methodologies for MBFFs or Register Banks Design Methodology for MBFFs Logic optimization with MBFFs Pre-placement optimization with MBFFs In-placement optimization with register banks Post-placement optimization with MBFFs

Related Works [18], [22] [17] [23] This paper, [19]–[21]

is suggested to optimize a design with MBFFs at the postplacement stage [19]–[21], as illustrated in Fig. 2(c), for more accurate timing/delay budgets. The clock tree of the design is synthesized after the MBFFs had been generated and placed. Table II summarizes all the proposed logic/physical design methodologies for MBFF or register bank generation. For post-placement optimization with MBFFs, the previous works [21], [20] separated “MBFF Gen. & Placement” in Fig. 2(c) into two steps: 1) flip-flop merging, and 2) MBFF placement, based on different design objectives. During flipflop merging, both [19] and [20] tried to minimize total flipflop power consumption, while [21] proposed to minimize the number of clock sinks (i.e., the total flip-flop number) and net switching power (i.e., the total weighted wirelength). During MBFF placement, [20] proposed to minimize total wirelength, and [21] considered the minimization of net switching power.

Table III summarizes and compares the design objectives of flip-flop merging and MBFF placement in the previous works. In this paper, we address the problem of power optimization with MBFFs at the post-placement stage. We present a new problem formulation for the application of multi-bit flipflops, which simultaneously minimize total flip-flop power consumption and interconnecting wirelength such that both placement density and timing slack constraints are satisfied. Based on the problem formulation, we propose a novel postplacement power optimization flow together with the flip-flop grouping and MBFF placement algorithms to solve the addressed problem. We formulate the flip-flop grouping problem as the m-clique finding and maximum-independent-set subproblems. Finally, we introduce the progressive window-based optimization technique to reduce placement deviation and improve runtime efficiency of our algorithms. Experimental results show that our approach is very effective in reducing not only flip-flop power consumption but also clock tree and signal net wirelength when applying multi-bit flip-flops to a design at the post-placement stage. The remainder of this paper is organized as follows. Section II describes the problem formulation of the postplacement power optimization with MBFFs. Section III details the proposed algorithms to solve the problem. Section IV reports the experimental results, and finally Section V concludes this paper. II. Problem Formulation Given the following inputs: 1) a set of placed flip-flop cells, F , where each flip-flop, fi ∈ F , can be either 1-bit or multi-bit; 2) a cell library containing a set of MBFF cells, FL , with the specification of area, AfLm , and power consumption, PfLm , for each m-bit flip-flop, fLm ∈ FL ; 3) the slack distance, dslack (pj , fpj ), between a pin, pj , and its connected flip-flop, fpj ;

1872

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 12, DECEMBER 2011

Fig. 2. Comparison of different physical design flows for MBFF or register bank generation. (a) Pre-placement optimization with MBFFs [17]. (b) Inplacement optimization with register banks [23]. (c) Post-placement optimization with MBFFs [20]. TABLE III Comparisons of the Design Objectives for Flip-Flop Merging and MBFF Placement in the Previous Works [19]–[21], and Ours Approach This Paper [19] [20] [21]

Objectives for Flip-Flop Merging Min. total flip-flop power consumption Min. total wirelength Min. total flip-flop power consumption Min. total flip-flop power consumption Min. total # of clock sinks (total # of flip-flops) Min. total net switching power consumption (total weighted signal net wirelength)

4) the width, wc , and height, hc , of the chip; 5) a set of bins, B, covering the whole chip area with equal widths and heights; 6) the width, wb , and height, hb , of a bin, bi ∈ B; 7) the maximum placement density, Dmax ; 8) placement grids; 9) a set of placed combinational logic cells, C; 10) the corresponding design netlist. The post-placement power optimization problem is to simultaneously minimize total flip-flop power consumption and interconnecting wirelength with MBFFs such that both placement density and timing slack constraints are satisfied.

Objectives for MBFF Placement Min. total wirelength N/A Min. total wirelength Min. total net switching power consumption (total weighted signal net wirelength)

The total flip-flop power consumption of a design, PF , can be calculated by summing up the power consumption of each flip-flop, Pfi , in the design. A. Placement Density Constraint In order to avoid routing congestion, when merging two or more flip-flops into one MBFF, the placement density constraint should be considered during MBFF placement because an MBFF has larger area compared with all the merged 1bit flip-flops. To consider the placement density constraint, a chip is equally divided into a number of bins covering the whole chip area. If the placement density of a bin, bi , is larger

LIN et al.: POST-PLACEMENT POWER OPTIMIZATION WITH MULTI-BIT FLIP-FLOPS

Fig. 3.

1873

Longer wirelength (a) before and (b) after merging two 1-bit flip-flops, f1 and f2 , into one 2-bit flip-flop, f3 , in a design.

than Dmax , it may result in routing difficulty. We define the placement density constraint as shown in (1), where AF,bi and AC,bi denote the total area of all flip-flops and combinational logic cells in bi , respectively AF,bi + AC,bi ≤ Dmax ∀bi ∈ B. (1) wb h b When a newly generated MBFF is placed in a bin, the placement density constraint must be satisfied. B. Timing Slack Constraint In addition to the placement density constraint, a poor location of a newly generated MBFF may also induce longer wirelength between the flip-flop and its connected pins. For example, in Fig. 3(a), there are two 1-bit flip-flops, f1 and f2 , where f1 is connected to p1 and p2 , and f2 is connected to p3 and p4 . After replacing f1 and f2 with the 2-bit flip-flop, f3 , as shown in Fig. 3(b), the wirelength from f3 to p4 becomes much longer. The longer wirelength will introduce much larger delay leading to timing violation. Therefore, the timing slack constraints should be considered to avoid the timing violation when replacing a set of flip-flops with an MBFF. In our problem formulation, the timing slack constraints can be obtained through the following steps: 1) perform static timing analysis; 2) apply the slack distribution technique [28]; 3) convert the timing slack to slack distance as our problem input, which will be elaborated in the following paragraphs. During post-placement optimization with MBFFs, only the locations of flip-flop cells can be changed, so the delay of a datapath is only affected by the input transition time and output capacitance of the flip-flops and their fan-in/fanout cells on the datapath based on the Synopsys Liberty Timing Model [26]. According to the loading dominance property [27], the effect of the output capacitance (loading), including wire loading and input/output pin capacitance, to the gate delay is much larger than that of the input transition time (slope). Since the input/output pin capacitance is also not changed during flip-flip merging and replacement, the delay is only affected by the wire loading which is proportional to

the interconnecting wirelength. Consequently, we can define the timing slack constraint in terms of the maximum allowable distance from a flip-flop to each of its connected fan-in/fan-out pins, as shown in (2), where d(pj , fp j ) is the distance from a pin, pj , to its newly connected MBFF, fp j , and dmax (pj , fp j ) is the maximum allowable distance from pj to fp j d(pj , fp j ) ≤ dmax (pj , fp j )

∀pj

dmax (pj , fp j ) = d(pj , fpj ) + dslack (pj , fpj ).

(2) (3)

The maximum allowable distance, dmax (pj , fp j ), can be further derived from (3), where d(pj , fpj ) is the distance from pj to its previously connected (1-bit) flip-flop, fpj , and dslack (pj , fpj ) is a slack distance between pj and fpj . The value of the slack distance can be either positive, zero, or negative. If dslack (pj , fpj ) is positive, it means that the newly generated MBFF cell, fp j can be placed farther from pj than the distance of d(pj , fpj ). If dslack (pj , fpj ) is zero, fp j can only be placed within the distance of d(pj , fpj ). For a negative slack distance, it suggests placing fp j closer to pj compared with the distance of d(pj , fpj ) or leaving fpj unchanged.

III. Proposed Algorithms Based on the problem formulation described in Section II, we propose our algorithms to simultaneously reduce total flipflop power consumption and interconnecting wirelength at the post-placement stage. The flow of our algorithms is illustrated in Fig. 4. First of all, the MBFF cells in the cell library are sorted in ascending order with respect to the flip-flop power consumption per bit which can be calculated by the power Pf m consumption of an MBFF cell divided by its bit number, mL . Once the MBFF cells in the cell library are sorted, the most power-efficient MBFF cell is then iteratively extracted. Our algorithms always replace a group of flip-flops with the most power-efficient MBFF cell during the optimization. After an m-bit flip-flop cell is extracted, two major steps, including flip-flop grouping (see Section III-A) and MBFF creation and placement (see Section III-B), in the flow will

1874

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 12, DECEMBER 2011

Fig. 5. (a) Timing-slack-free region of the flip-flop, f2 . (b) Timing-slackfree regions of the flip-flops, f1 , f2 , . . . , and f6 .

Fig. 4.

Flowchart of the proposed algorithms.

be performed based on the technique of progressive windowbased optimization (see Section III-C). The first step finds a set of m-bit flip-flop groups in the design while the second step determines the position of each m-bit flip-flop group and verifies the legality of the position. A legal position of an mbit flip-flop group means that placing an m-bit flip-flop cell at the position does not violate any aforementioned design constraints. If the position for an m-bit flip-flop group is legal, an m-bit flip-flop cell is then created to merge all the flipflops in the m-bit flip-flop group. Otherwise, the flip-flops in the m-bit flip-flop group cannot be merged. The overall time complexity of the proposed algorithms is further analyzed in Section III-D. A. Grouping of Flip-Flops When grouping a set of flip-flips, the timing slacking constraints between any flip-flop and all its connected pins should be first considered. According to the timing slacking constraints, we explore all possible combinations of flip-flop groups for flip-flop merging. Finally, we try to select maximal non-conflicted flip-flop groups from all the combinations. 1) Consideration of Timing Slack Constraints: Based on (2), every flip-flop should be placed in the timing-slack-free region which is defined as follows. Definition 1: A timing-slack-free region (TSFR) of a flipflop is a region where the flip-flop is placed within the maximum allowable distance from its connected pins such that the timing slack constraints are satisfied. Fig. 5(a) illustrates the TSFR of f2 which is a tilted rectangular region [29] intersected by the Manhattan rings [16], [29] of p1 and p2 . Every point on the Manhattan ring of p1 (p2 ) has the same Manhattan distance from p1 (p2 ), which is equal

to dmax (p1 , f2 ) (dmax (p2 , f2 )). Fig. 5(b) further shows all the TSFRs of f1 , f2 , . . . , and f6 in the same design. According to the definition of the TSFR, a set of flip-flops can be grouped and replaced by an MBFF if there exists an intersection region of the TSFRs of all the flip-flops. In Fig. 5(b), f2 and f5 cannot be grouped and merged by an MBFF since the TSFRs of f2 and f5 are independent without any intersection. On the contrary, f1 and f2 can be grouped and merged by an MBFF because the merged MBFF can be placed in the intersection of the TSFRs of f1 and f2 such that the timing slack constraint of the merged MBFF is met. Such flip-flop group of f1 and f2 is called a timing-slack-free group which is defined in the following. Definition 2: A timing-slack-free group (TSFG) is a flipflop group in which all the flip-flops can be merged by an MBFF such that the timing slack constraints between the MBFF and all its connected pin are satisfied. 2) Exploration of m-Bit Flip-Flop Groups: Before exploring the m-bit TSFGs in a design, we construct the TSFR intersection graph which is defined in Definition 3. Fig. 6(a) shows the TSFR intersection graph representing the relationship of the TSFRs in Fig. 5(b). The vertices, v1 , v2 , . . . , v6 , represent the six flip-flops, f1 , f2 , . . . , f6 , respectively. If there is an intersection between the TSFRs of two flip-flops, there is an edge between the corresponding vertices. Definition 3: A TSFR intersection graph is a graph, G(V, E), where each vertex, vi , corresponds to a flip-flop, fi , in the design, and an edge, eij , between vi and vj exists if there is an intersection between the TSFRs of fi and fj . Once the TSFR intersection graph of a design is constructed, we can explore all the m-bit TSFGs in the design by finding all the m-cliques in the TSFR intersection graph. Each m-clique in the graph corresponds to an m-bit TSFG. The problem of finding all m-cliques in the graph can be well solved by applying the branch-and-bound and backtracking algorithms [30] using a search tree as shown in Fig. 6. From the example, we can find all 4-cliques, including {n1 , n2 , n3 , n4 } and {n1 , n3 , n4 , n6 }, in the graph. Consequently, the set of 4-bit TSFGs, G4 , of the design in Fig. 5(b) contains two TSFGs, {g14 , g24 }, where g14 = {f1 , f2 , f3 , f4 } and g24 = {f1 , f3 , f4 , f6 }. 3) Selection of Flip-Flop Groups: After exploring the set of m-bit TSFGs of a design denoted by Gm = {g1m , g2m , . . . , gkm }, the next problem is how to select the maximum number of non-conflict m-bit TSFGs for more

LIN et al.: POST-PLACEMENT POWER OPTIMIZATION WITH MULTI-BIT FLIP-FLOPS

Fig. 7.

1875

(a) Original coordinate system. (b) Transformed coordinate system.

Fig. 6. (a) TSFR intersection graph representing the relationship among the TSFRs in Fig. 5(b). (b) Branch-and-bound and backtracking algorithms [30] which find all 4-vertex cliques in (a).

Algorithm 1 Generation of an IS of TSFGs

Input: Gm {a set of m-bit TSFGs} Output: Gm IS {an independent set of m-bit TSFGs} 1: Sort gi ∈ Gm in descending order with respect to αAgim + Wgim ; // α is a constant. 2: F  ← ∅; 3: Gm IS ← ∅; 4: for each gim ∈ Gm do 5: if there exists no fj ∈ gim in F  then m m 6: Gm IS ← GIS + gi ; 7: for each fj ∈ gim do 8: F  ← F  + {fj }; 9: end for 10: end if 11: end for 12: return Gm IS power saving and wirelength reduction. The selection of nonconflict TSFGs can be formulated by finding the maximum independent set (MIS) in Gm . In the previous example, the MIS in G4 is either {g14 } or {g24 } since f1 , f3 , and f4 belong to both g14 and g24 . The independent set of TSFGs is defined as follows. Definition 4: An independent set (IS) of TSFGs is a set of TSFGs, in which every flip-flop belongs to only one TSFG. Since finding the MIS has been known as an NP-hard problem [31], a novel and efficient algorithm as seen in Algorithm 1 is proposed to generate the IS of TSFGs in Gm . In addition to power optimization, the proposed algorithm additionally considers the placement area, Agim , of the MBFF corresponding to a TSFG, gim , and the wirelength reduction, Wgim , of gim . Agim can be calculated by the intersection area of the corresponding TSFRs. Wgim is the difference of the half-perimeter wirelengths (HPWL) after and before replacing gim with an MBFF. In Algorithm 1, a gim in Gm with the largest value of αAgim + Wgim is repeatedly selected and added into another set, Gm IS , until the IS of TSFGs is obtained. B. Placement of Flip-Flop Groups Once the IS of TSFGs is obtained, a proper location for the MBFF corresponding to a TSFG should be searched. In

Fig. 8. Example of converting the design in Fig. 5(b) from the original coordinate system into the transformed coordinate system. The intersection denotes the valid placement region of the TSFG, g24 = {f1 , f3 , f4 , f6 }.

this subsection, the transformation of the coordinate system is first introduced to improve the computational efficiency when calculating the intersection of several TSFRs. The placement bins and grids are then searched for each MBFF corresponding to a TSFG according to the intersection of TSFRs. When finding a placement bin or a placement grid for each MBFF, we try to minimize the interconnecting wirelength while satisfying the placement density constraint. 1) Transformation of Coordinate System: According to Definitions 1 and 2, the MBFF corresponding to a TSFG should be placed within the intersection of the TSFRs of all the flip-flops in the TSFG. Since all the TSFRs are tilted in 45° with respect to the placement coordinate system, the intersection of the TSFRs is also tilted in 45°. To efficiently calculated the coordinates of the intersection from the coordinates of the TSFRs, we transform the coordinate system based on the transfer functions defined in (4). The difference between the original and the transformed coordinate systems is demonstrated in Fig. 7. Both coordinate systems can be transformed back and forth based on the transfer and inverse transfer functions in (4) and (5)  Xtrans = Xorig − Yorig (4) Ytrans = Yorig + Xorig  Xorig = (Xtrans + Ytrans )/2 (5) Yorig = (Ytrans − Xtrans )/2. Fig. 8 further shows an example of converting the design in Fig. 5(b) into the transformed coordinate system. After the transformation, all the tilted rectangular regions of the TSFRs become non-tilted. Consequently, it becomes much

1876

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 12, DECEMBER 2011

Fig. 9. Example of finding placement bins intersected by the boundaries of the tilted rectangular placement region. (a) Set of placement bins intersected by the bottom-left boundary of the tilted rectangular placement region. (b) Set of placement bins intersected by all four boundaries of the tilted rectangular placement region.

Fig. 10. Placement area of an MBFF with the consideration of interconnecting wirelength. (a) Placement area bounded by the median coordinates of the eight pins. (b) Enlarged placement area when placing an MBFF in the area in (a) is not feasible.

easier to calculate the coordinates of the intersection of the TSFRs of all flip-flops in the TSFG, g24 = {f1 , f3 , f4 ,, f6 }. Once the coordinates of the intersection region in Fig. 8 are calculated in the transformed coordinate system, they should be transformed back to the original coordinate system such that the coordinates of the tilted rectangular placement region in the original coordinate system can be obtained. 2) Consideration of Placement Density: To find a legal placement for an MBFF corresponding to a TSFG within the tilted rectangular placement region, or the TSFR of the MBFF, only the placement bins covered by the tilted rectangular placement region should be considered. To collect all the placement bins, the bins intersected by each boundary of the tilted rectangular placement region should be first identified as shown in Fig. 9(a) and (b). The bins surrounded by these intersected bins can be found and collected accordingly. To better consider the placement density during MBFF placement, the bin with the lowest placement density is chosen to accommodate the MBFF corresponding to a TSFG. If there

is no valid placement grid within the bin, the bin with the second lowest placement density is then chosen. The gridsearching process is repeated until a valid placement grid for the MBFF is found. 3) Consideration of Interconnecting Wirelength: In addition to considering the placement densities of the bins within the TSFR of an MBFF corresponding to a TSFG, it is also required to minimize the interconnecting wirelength when placing the MBFF. To find a position for the MBFF with shorter wirelength, the area bounded by the median coordinates of all pins connected to the MBFF is first considered as shown in Fig. 10(a). In this example, the median coordinates of the eight pins are xp4 , xp5 , yp4 , and yp8 in both axes. Once the area bounded by the median coordinates of all pins is obtained, a grid-searching process is performed to find a valid placement grid. During the grid-searching process, the bin with the lowest placement density, which contains grids inside the TSFR and the bounded area of the median coordinates of all pins, is first chosen. For example, in Fig. 11(a), the bin, b22 , is

LIN et al.: POST-PLACEMENT POWER OPTIMIZATION WITH MULTI-BIT FLIP-FLOPS

1877

Fig. 11. Example of finding valid placement grids during grid-searching process. (a) Bins containing placement grids inside both the TSFR of the MBFF and the area bounded by the median coordinates of all pins connected to the MBFF. (b) All possible placement grids in the bin, b22 .

the one with the lowest placement density. A valid placement grid is then searched among all possible placement grids in b22 as shown in Fig. 11(b). If there is no valid placement grid in the bin intersected by both the area bounded by the coordinates of the pins and the TSFR, or the tilted rectangular placement region, the area bounded by the coordinates of the pins is enlarged to the next pitch which is the closest one from the current pitches. In Fig. 10(b), yp1 is the closest pitch from yp8 compared with all the other neighboring pitches. The enlarged area for wirelength minimization is then surrounded by xp4 , xp5 , yp4 , and yp1 . The process is continued until a valid placement grid for the MBFF is found. C. Progressive Window-Based Optimization In addition to the flip-flop grouping and placement algorithms, we further propose the progressive window-based optimization technique, as seen in Fig. 4, for the following reasons. 1) At the post-placement stage, it is more relevant to minimize the deviation to an already placed design. The proposed progress window-based optimization technique focuses on local flip-flop merging and replacement to avoid global changes. 2) As modern SoCs usually contain hundred thousands of flip-flops, it is inefficient to handle such a large flattened design during post-placement power optimization with MBFFs, especially at the stage of finding feasible flipflops groups. The technique can effectively reduce the solution space within a design window. The progressive window-based optimization contains three major steps: 1) window initialization; 2) window sliding; and 3) window size expansion. 1) Window Initialization: Before the algorithms of flipflop grouping and MBFF placement are performed, a window is initialized at the bottom-left corner of the chip area with the size of nb × nb bins, where nb is a positive integer. The purpose of the window is to restrict the problem size to a sub-

design containing less flip-flops within a portion of the chip area. Fig. 12(a) shows an example of an initialized window with the size of 2 × 2 bins. Once the window is initialized, the flip-flop grouping and MBFF placement algorithms are then performed within the window. 2) Window Sliding: After the sub-design in the bottom-left window is optimized with MBFFs, the window is progressively slid from left to right and from bottom to top with the distance of half the window size, or nb /2 bins, until all bins in the chip area are covered. Fig. 12(b) illustrates the procedure of progressive window sliding throughout the chip area. Every time the window is slid, the flip-flop grouping and MBFF placement algorithms are performed within the slid window. 3) Window Size Expansion: Once the window is slid throughout the chip area, we expand the window size based on (6), where nb is also a positive integer, such that a larger design area is covered nb = nb + nb .

(6)

The enlarged window is again initialized at the bottom-left corner of the chip area, and progressively slid throughout the whole chip area with the distance of half the enlarged window size each time. The flip-flop grouping and MBFF placement algorithms are performed within the enlarged window every time the window is initialized and slid. It should be noted that the flip-flops, which had previously been grouped and merged within a smaller window, are not considered again within a larger window. Therefore, the problem size of a sub-design within a larger window will not exponentially grow. The window size is progressively expanded with the size of nb until the whole chip area is covered by the window. Fig. 12(c) shows an enlarged window with the size of 4 × 4 bins. In this case, the window has already covered the whole chip area, so it will not be slid and expanded anymore. D. Complexity Analysis We first analyze the time complexities of the flip-flop grouping and placement algorithms, respectively. The inte-

1878

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 12, DECEMBER 2011

Fig. 12. Progressive window-based optimization technique. (a) Initialized window with the size of 2 × 2 bins. (b) Progressive window sliding throughout the chip area. (c) Enlarged window with the size of 4 × 4 bins.

grated time complexity is then demonstrated together with the progressive window-based optimization technique. 1) Time Complexity of Flip-Flop Grouping Algorithm: It takes O(n2 ) to construct a TSFR intersection graph, where n is the number of flip-flops, because a TSFR graph contains n nodes and C2n edges in the worst case. After the graph is constructed, it further takes O(nm ) to obtain all possible m-bit TSFGs by exploring m-cliques in the graph based on the branch-and-bound and backtracking algorithms [30]. When generating an independent set of all the explored TSFGs, as seen in Algorithm 1, it takes O(mnm ) to compute the sorting function, αAgim + Wgim , for all TSFGs, O(mnm lgn) to sort all TSFGs (Line 1 in Algorithm 1), and O(mnm ) to generate an independent set of all TSFGs (Lines 4–11 in Algorithm 1). Consequently, the overall time complexity of the flip-flop grouping algorithm is O(mnm lgn). 2) Time Complexity of MBFF Placement Algorithm: After grouping the flip-flops in a design, we have at most mn TSFGs. For each TSFG, it takes O(m) to transform the coordinate systems and to find the intersection region of TSFRs of m 1-bit flip-flops in the TSFG. It then takes constant time to find a valid placement bin and grid because the numbers of bins and grids in the intersection region are constant. Finally, it takes O(m) to create an MBFF for the TSFG and update the netlist. Consequently, the overall time complexity of the MBFF placement algorithm is O(n). 3) Integrated Time Complexity with Progressive WindowBased Optimization: We assume that there are k windows

TABLE IV Industry Benchmark Circuits Circuit c1 c2 c3 c4 c5 c6

# of FFs 1-bit 2-bit 76 22 366 57 1464 228 4378 751 9150 1425 146 400 22 800

Normalized Total Power 113.84 464.04 1856.16 5669.72 11601.00 185616.00

Total WL (nm) 89 425 348 920 1 395 680 4 290 655 8 723 000 139 568 000

during the progressive window-based optimization. The number of flip-flops in each window is nk . The time complexity of the flip-flop grouping algorithm becomes O(km( nk )m lg nk ), and the time complexity of the MBFF placement algorithm is still O(n). Consequently, the overall time complexity of our algorithms is O(km( nk )m lg nk ). IV. Experimental Results We implemented our algorithms in the C++ programming language, and performed three sets of experiments on a 2.66 GHz Intel Core i7 PC under the Linux operating system. First of all, we compared the flip-flop power consumption, wirelength, and runtime for our approach and the most recent works [19]–[21]. Second, we showed the efficiency of the proposed progressive window-based optimization. Finally, we demonstrated the impact on clock tree wirelength after clock tree synthesis and compared with [21].

LIN et al.: POST-PLACEMENT POWER OPTIMIZATION WITH MULTI-BIT FLIP-FLOPS

1879

TABLE V Comparisons of Power Reduction, Wirelength Reduction, and Runtime for Five Different Approaches: 1) Yan and Chen’s Approach [19] Implemented and Reported in [20] (on Intel Xeon 3.8 GHz); 2) Jiang et al.’s Approach [20] (on Intel Xeon 3.8 GHz); 3) Wang et al.’s Approach [21] (on Intel Xeon 2.13 GHz); 4) Our Approach [1] (on Intel Core i7 2.66 GHz); and 5) Our Approach in This Paper (on Intel Core i7 2.66 GHz)

Circuit

c1 c2 c3 c4 c5 c6 Avg.

[19] (Reported in [20]) Power WL Time Red. Red. (%) (%) (s) 17.20 −23.00 0.03 18.80 −24.80 0.11 18.70 −25.20 0.53 18.50 −24.70 2.55 18.70 −24.20 8.01 18.70 −24.40 1994.61 18.43 −24.38

Power Red. (%) 17.20 19.10 19.20 19.00 19.30 19.30 18.85

[20] WL Red. (%) 3.60 −2.00 −3.60 −4.10 −4.80 −5.30 −2.70

Time (s) 0.01 0.01 0.01 0.02 0.05 1.11

Power Red. (%) 15.64 17.53 17.41 17.08 17.29 17.52 17.08

[21] WL Red. (%) 8.20 11.10 11.50 11.50 13.40 11.80 11.25

Time (s) (s) 0.01 0.05 0.22 0.72 1.89 36.12

Power Red. (%) 14.80 16.90 17.10 16.80 17.10 17.20 16.65

Ours [1] WL Time Red. (%) (s) 8.30 0.01 5.30 0.04 5.20 0.10 5.50 0.28 5.10 0.60 5.10 78.92 5.75

Ours Power Red. (%) 15.60 17.21 17.16 16.87 17.11 17.09 16.84

(This Paper) WL Time Red. (%) (s) 11.84 0.01 13.49 0.03 13.22 0.08 12.99 0.25 13.17 0.61 13.18 84.41 12.98

A. Power and Wirelength Reduction In the first set of the experiments, we empirically tested our approach on six industrial circuits [24] which contain all the required input data, including placement density and timing slack constraints, as described in Section II. Table IV lists the names of the benchmark circuits (“Circuit”), the numbers of 1-bit and 2-bit flip-flops (“# of FFs”), normalized total flip-flop power consumption (“Normalized Total Power”), and total wirelength (“Total WL”). For each circuit, the numbers of 1-bit flip-flops range from 76 to 146 400 and the numbers of 2-bit flip-flops range from 22 to 22 800. There is no 4bit flip-flop in each circuit. The placements of all flip-flops in each circuit had been optimized based on the commercial tools, and the chip area utilization of each circuit is about 50%. The normalized total flip-flop power consumption was calculated by summing up the normalized flip-flop power consumption of each flip-flop cell as seen in Table I, and the total wirelength was calculated by summing up the wirelength of each net connected to a flip-flop based on the HPWL model. We compared the flip-flop power consumption, wirelength, and runtime for our approaches and the most recent works [19]–[21]. Table V lists the names of the benchmark circuits (“Circuit”), the flip-flop power reduction compared with the flip-flop power consumption of the input circuits (“Power Red.”), the wirelength reduction compared with the total wirelength of the input circuits (“WL Red.”), and the runtime (“Time”) for five different approaches: 1) Yan and Chen’s approach [19] (on Intel Xeon 3.8 GHz); 2) Jiang et al.’s approach [20] (on Intel Xeon 3.8 GHz); 3) Wang et al.’s approach [21] (on Intel Xeon 2.13 GHz); 4) our approach [1] (on Intel Core i7 2.66 GHz); and 5) our approach in this paper (on Intel Core i7 2.66 GHz). We honored Jiang et al. with their implementation of Yan and Chen’s approach [19] and experimental comparisons reported in [20]. Compared with the previous works, our approach in this paper results in comparative power reduction, which is only 2% worse than the best result [20]. It should be noted that our approach achieves the largest wirelength reduction, which is 37.36% better than [19], 15.68% better than [20], 1.73% better than [21], and 7.23%

Fig. 13. Distribution of (a) slack distance and (b) bin density of the circuit “c6” before and after performing the post-placement power optimization with MBFFs.

better than [1]. Consequently, the net switching power is also reduced due to shorter interconnecting wirelength (i.e., smaller wire loading). It should be noted that we have modified the sorting function in Algorithm 1 for more wirelength reduction compared with [1]. Fig. 13 further compares the distribution of slack distance and bin density of the circuit, “c6,” before and after performing the proposed algorithms in this paper. Fig. 13(a) shows the number of nets (# of Nets) over different slack distance ranging from 0 to 1400 nm, which can be calculated by (3). Fig. 13(b)

1880

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 12, DECEMBER 2011

TABLE VI Comparisons of Power Reduction, Wirelength Reduction, and Runtime for Different Parameter Settings of the Progressive Window-Based Optimization

Circuit

c1 c2 c3 c4 c5 c6 Avg.

Power Red. (%) 15.60 17.21 17.16 16.87 17.11 17.09 16.84

(nb = 2, nb = 2) WL Avg. Avg. Red. FF FF Disp. Disp. (%) (bins) (nm) 11.84 0.52 295.61 13.49 0.51 317.50 13.22 0.50 317.23 12.99 0.50 314.07 13.17 0.50 316.85 13.18 0.50 316.60 12.98 0.50 312.98

Time

Power Red.

(s) 0.01 0.03 0.08 0.25 0.61 84.41

(%) 14.55 16.63 16.60 16.32 16.58 16.57 16.21

(nb = 4, nb = 4) WL Avg. Avg. Red. FF FF Disp. Disp. (%) (bins) (nm) 16.00 0.61 315.56 15.66 0.56 327.15 15.45 0.56 327.03 15.27 0.57 325.89 15.33 0.56 326.96 15.26 0.56 326.93 15.05 0.57 324.92

shows the number of bins (# of Bins) over different bin densities ranging from 0 to 100%, which can be calculated by (1). The results show that both timing slack and bin density are not affected much after applying MBFFs at the post-placement stage based on our formulation and algorithms. B. Impact of Progressive Window-Based Optimization In the second set of the experiments, we compared the results of our approach with different parameter settings for the proposed progressive window-based optimization. Table VI lists the names of the benchmark circuits (“Circuit”), the flipflop power reduction compared with the flip-flop power consumption of the input circuits (“Power Red.”), the wirelength reduction compared with the total wirelength of the input circuits (“WL Red.”), average flip-flop displacement before and after the flip-flops are merged (“Avg. FF Disp.”), and the runtime (“Time”) for three different parameter settings of the window size, nb , and window expansion, nb , in (6), which are: 1) nb = 2, nb = 2; 2) nb = 4, nb = 4; and 3) nb = 6, nb = 6. For power reduction, the results based on different parameter settings are similar because finding the MIS of TSFGs is an NP-hard problem as discussed in Section III-A. Therefore, we cannot guarantee that our flip-flop grouping algorithm will find a global optimum solution with a larger window size. Although a larger window size results in more wirelength reduction, it also leads to more flip-flop displacement. It would be more relevant to minimize the placement deviation at the post-placement stage. For runtime, the results are consistent with our complexity analysis in Section III-D. If the window size is larger, it contains more flip-flips leading to longer runtime. C. Reduction of Clock Sinks and Clock Tree Wirelength In the third set of the experiments, we showed the reduction of clock sinks and clock tree wirelength after applying MBFFs at the post-placement stage and compared with Wang et al.’s approach [21]. We performed zero-skew clock tree synthesis using bounded-skew clock tree routing (Version 1.0) [32] based on the nine benchmark circuits in [21] with and without applying MBFFs. Each circuit also contains the all the required inputs as described in Section II. We adopted the experimental

Time

Power Red.

(s) 0.03 0.07 0.23 0.63 1.40 100.33

(%) 14.30 16.43 16.61 16.39 16.70 16.74 16.20

(nb = 6, nb = 6) WL Avg. Avg. Red. FF FF Disp. Disp. (%) (bins) (nm) 13.02 0.61 305.46 15.20 0.56 326.30 14.27 0.57 327.63 13.64 0.58 326.03 13.78 0.58 328.55 13.56 0.58 329.05 13.91 0.58 323.84

Time

(s) 0.14 0.34 0.94 2.30 4.58 147.00

TABLE VII Benchmark Circuits in [21] Circuit r1 r2 r3 r4 r5 t0 t1 t2 t3

FF # 267 598 862 1903 3101 120 60 000 5524 953

CT-WL (nm) 1 325 183 2 621 623 3 357 327 6 839 628 10 145 960 39 637 3 981 765 985 348 201 755

SN-WL (nm) 1 743 703 3 930 879 5 672 241 12 616 681 20 528 314 83 285 53 624 875 3 562 985 576 710

WSN-WL (nm) 179 802 400 928 574 642 1 266 302 2 061 324 8755 5 356 145 357 151 58 576

setup in [21] for comparison with their results. Table VII lists the name of the benchmark circuits (“Circuit”), the number of clock sinks (“FF #”), clock tree wirelength (“CT-WL”), total signal net wirelength (“SN-WL”), and total weighted signal net wirelength (“WSN-WL”), where the wirelength is measured by the HPWL model, and the clock tree wirelength of each circuit was obtained by performing bounded-skew clock tree routing (Version 1.0) [32] for each circuit without applying MBFFs. We compared the number of clock sinks, clock tree wirelength, and signal net wirelength after applying MBFFs for each circuit based on Wang et al.’s approach [21] and ours. The runtime of each approach was also compared. Table VIII lists the names of the benchmark circuits (“Circuit”), the clock sink reduction (“FF # Red.”), the clock tree wirelength reduction (“CT-WL Red.”), the total signal net wirelength reduction (“SN-WL Red.”), the total weighted signal net wirelength reduction (“WSN-WL”), and the runtime (“Time”) for both approaches. All the reductions are compared with the corresponding data of the input circuit. Although our approach results in 4.15% more clock sinks compared with [21], the clock tree wirelength based on our approach is very close to that in [21]. It should be noted that the total signal net wirelength and weighted signal net wirelength based on our approach are 5.9% and 3.48% better than those in [21], respectively. Consequently, our approach can save more switching power arising from signal nets.

LIN et al.: POST-PLACEMENT POWER OPTIMIZATION WITH MULTI-BIT FLIP-FLOPS

1881

TABLE VIII Comparisons of the Reductions of Flip-Flop Number, Clock Tree Wirelength, Signal Net Wirelength, and Weighted Signal Net Wirelength, and Runtime for Wang et al.’s Approach [21] (on Intel Xeon 2.13 GHz) and Ours (on Intel Core i7 2.66 GHz)

Circuit

r1 r2 r3 r4 r5 t0 t1 t2 t3 Avg.

FF # Red. (%) 62.17 62.71 65.43 68.89 70.30 69.17 74.93 72.39 74.19 68.91

CT-WL Red. (%) 29.43 33.50 37.48 39.78 42.08 37.35 52.02 44.84 49.58 40.56

[21] SN-WL Red. (%) −0.76 5.04 8.47 4.36 7.38 10.71 38.32 41.07 22.30 15.21

WSN-WL Red. (%) 1.75 7.59 11.00 6.19 9.94 14.54 41.04 42.63 25.00 17.74

Time (s) 1.53 5.91 6.98 28.20 51.26 0.03 1053.12 1.98 1.14

V. Conclusion In this paper, we presented a new problem formulation of post-placement optimization with MBFFs to optimize the power consumption of the clock network. Based on the problem formulation, we proposed flip-flop grouping and MBFF placement algorithms to simultaneously minimize flip-flop power consumption and interconnecting wirelength such that both placement density and timing slack constraints are satisfied. To reduce placement deviation and improve runtime efficiency of our algorithms, we also introduced the progressive window-based optimization technique. Experimental results have shown that our approach is very effective in reducing not only flip-flop power consumption but also clock tree and signal net wirelength when applying MBFFs to a design at the post-placement stage. Future work lies in considering pin density and modeling routing congestion [33] to enable chip area reduction with MBFFs.

Acknowledgment The authors would like to thank Dr. Y.-W. Tsai and S.-F. Chen from Faraday Technology Corporation, Hsinchu, Taiwan, for providing the measured data including area and power consumption of the multi-bit flip-flops, and the benchmark circuits. They would also like to thank Prof. W.-K. Mak’s team of National Tsing Hua University, Hsinchu, for providing the benchmark circuits in [21].

References [1] Y.-T. Chang, C.-C. Hsu, M. P.-H. Lin, Y.-W. Tsai, and S.-F. Chen, “Post-placement power optimization with multibit flip-flops,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., Nov. 2010, pp. 218–223. [2] R. Goering, “Low-power IC design techniques may perturb the entire flow,” EE Times, May 7, 2007. [3] Q. Wu, M. Pedram, and X. Wu, “Clock-gating and its application to low power design of sequential circuits,” IEEE Trans. Circuits Syst. I, vol. 47, no. 3, pp. 415–420, Mar. 2000. [4] H. Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, “Ultralowpower clocking scheme using energy recovery and clock gating,” IEEE Trans. Very Large Scale Integr. Syst., vol. 17, no. 1, pp. 33–44, Jan. 2009.

FF # Red. (%) 52.06 58.36 60.44 63.58 65.56 65.83 74.74 71.22 71.04 64.76

Ours (This Paper) CT-WL SN-WL WSN-WL Red. Red. Red. (%) (%) (%) 35.96 9.46 9.69 33.07 10.32 10.93 37.57 11.80 11.96 37.46 11.86 11.64 39.71 11.66 12.01 40.13 29.03 29.01 49.48 41.06 41.04 43.88 35.10 34.94 46.93 29.66 29.78 40.46 21.11 21.22

Time (s) 0.22 0.27 0.38 0.70 1.18 0.02 20.10 0.87 0.14

[5] A. Khan, P. Watson, G. Kuo, D. Le, T. Nguyen, S. Yang, P. Bennett, P. Huang, J. Gill, C. Hawkins, J. Goodenough, D. Wang, I. Ahmed, P. Tran, H. Mak, O. Kim, F. Martin, Y. Fan, D. Ge, J. Kung, and V. Shek, “A 90 nm power optimization methodology with application to the ARM 1136JF-S microprocessor,” IEEE J. Solid-State Circuits, vol. 41, no. 8, pp. 1707–1717, Aug. 2006. [6] T. Luo, D. Newmark, and D. Z. Pan, “Total power optimization combining placement, sizing and multi-Vt through slack distribution management,” in Proc. IEEE/ACM Asia South Pacific Des. Autom. Conf., Mar. 2008, pp. 352–357. [7] D.-S. Chiou, S.-H. Chen, S.-C. Chang, and C. Yeh, “Timing driven power gating,” in Proc. ACM/IEEE Des. Autom. Conf., Sep. 2006, pp. 121–124. [8] H. Xu, R. Vemuri, and W. Jone, “Dynamic characteristics of power gating during mode transition,” IEEE Trans. Very Large Scale Integr. Syst., vol. 19, no. 2, pp. 237–249, Feb. 2011. [9] G. Magklis, M. L. Scott, G. Semeraro, D. H. Albonesi, and S. Dropsho, “Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor,” in Proc. ACM Int. Symp. Comput. Architecture, 2003, pp. 14–27. [10] L. Yan, J. Luo, and N. Jha, “Combined dynamic voltage scaling and adaptive body biasing for heterogeneous distributed real-time embedded systems,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., Nov. 2003, pp. 30–37. [11] D. Liu and C. Svensson, “Power consumption estimation in CMOS VLSI chips,” IEEE J. Solid-State Circuits, vol. 29, no. 6, pp. 663–670, Jun. 1994. [12] A. Krishnamoorthy, “Minimize IC power without sacrificing performance,” EE Times, Jul. 15, 2004. [13] K. Wang and M. Marek-Sadowska, “Buffer sizing for clock power minimization subject to general skew constraints,” in Proc. ACM/IEEE Des. Autom. Conf., Jul. 2004, pp. 159–164. [14] L. Huang, Y. Cai, Q. Zhou, X. Hong, J. Hu, and Y. Lu, “Clock network minimization methodology based on incremental placement,” in Proc. IEEE/ACM Asia South Pacific Des. Autom. Conf., Jan. 2005, pp. 99–102. [15] Y. Cheon, P.-H. Ho, A. B. Kahng, S. Reda, and Q. Wang, “Power-aware placement,” in Proc. ACM/IEEE Des. Autom. Conf., Jun. 2005, pp. 795– 800. [16] Y. Lu, C. N. Sze, X. Hong, Q. Zhou, Y. Cai, L. Huang, and J. Hu, “Navigating registers in placement for clock network minimization,” in Proc. ACM/IEEE Des. Autom. Conf., Jun. 2005, pp. 176–181. [17] R. Pokala, R. Feretich, and R. McGuffin, “Physical synthesis for performance optimization,” in Proc. IEEE Int. ASIC Conf. Exhibit., Sep. 1992, pp. 34–37. [18] L. Chen, A. Hung, H.-M. Chen, E. Tsai, S.-H. Chen, M.-H. Ku, and C.-C. Chen. (2010). “Using multi-bit flip-flop for clock power saving by DesignCompiler,” in Proc. Synopsys Users Group [Online]. Available: http://www.synopsys.com.cn/information/snug/2010/usingmulti-bit-flip-flop-for-clock-power-saving-by-designcompiler [19] J.-T. Yan and Z.-W. Chen, “Construction of constrained multibit flipflops for clock power reduction,” in Proc. IEEE Int. Conf. Green Circuits Syst., Jun. 2010, pp. 675–678.

1882

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 12, DECEMBER 2011

[20] I. H.-R. Jiang, C.-L. Chang, Y.-M. Yang, E. Y.-W. Tsai, and L. S.F. Chen, “INTEGRA: Fast multi-bit flip-flop clustering for clock power saving based on interval graphs,” in Proc. ACM Int. Symp. Phys. Design, 2011, pp. 115–121. [21] S.-H. Wang, Y.-Y. Liang, T.-Y. Kuo, and W.-K. Mak, “Power-driven flipflop merging and relocation,” in Proc. ACM Int. Symp. Phys. Design, 2011, pp. 107–114. [22] Y. Kretchmer, “Using multibit register inference to save area and power,” EE Times Asia, May 24, 2001. [23] W. Hou, D. Liu, and P.-H. Ho, “Automatic register banking for lowpower clock trees,” in Proc. IEEE/ACM Int. Symp. Quality Electron. Des., Mar. 2009, pp. 647–652. [24] Faraday Technology Corporation [Online]. Available: http://www. faraday-tech.com [25] United Microelectronics Corporation [Online]. Available: http://www. umc.com [26] Library Compiler User Guide: Modeling Timing and Power Technology Libraries, Synopsys, Inc., Mountain View, CA, 2003. [27] K.-H. Ho, Y.-P. Chen, J.-W. Fang, and Y.-W. Chang, “ECO timing optimization using spare cells and technology remapping,” IEEE Trans. Comput.-Aided Des., vol. 29, no. 5, pp. 697–710, May 2010. [28] J. Vygen, “Slack in static timing analysis,” IEEE Trans. Comput.-Aided Des., vol. 25, no. 9, pp. 1876–1885, Sep. 2006. [29] T.-H. Chao, Y.-C. Hsu, J.-M. Ho, and A. Kahng, “Zero skew clock routing with minimum wirelength,” IEEE Trans. Circuits Syst. II Analog Digital Signal Process., vol. 39, no. 11, pp. 799–814, Nov. 1992. [30] C. Bron and J. Kerbosch, “Algorithm 457: Finding all cliques of an undirected graph,” Commun. ACM, vol. 16, no. 9, pp. 575–577, Sep. 1973, [31] R. M. Karp, “Reducibility among combinatorial problems,” in Complexity of Computer Computations, R. E. Miller and J. W. Thatcher, Eds. New York: Plenum, 1972, pp. 85–103. [32] A. B. Kahng and C.-W. Tsao. (2000, May). Bounded-Skew Clock Tree Routing—Version 1.0 [Online]. Available: http://vlsicad.ucsd.edu/ GSRC/bookshelf/Slots/BST [33] J. A. Roy, N. Viswanathan, G.-J. Nam, C. J. Alpert, and I. L. Markov, “CRISP: Congestion reduction by iterated spreading during placement,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., Nov. 2009, pp. 357– 362.

Mark Po-Hung Lin (S’07–M’09) received the B.S. and M.S. degrees in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1998 and 2000, respectively, and the Ph.D. degree from the Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan, in 2009. He was with Springsoft, Inc., Hsinchu, from 2000 to 2007. In 2008, he was a Visiting Scholar with the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana. He has been an Assistant Professor with the Department of Electrical Engineering, National Chung Cheng University, Chiayi, Taiwan, since 2009. His current research interests include analog design automation and very large scale integration physical synthesis. Chih-Cheng Hsu (S’11) received the B.S. degree in electronics engineering from Fu Jen Catholic University, Taipei, Taiwan, in 2009. Since 2009, he has been working toward the Ph.D. degree with the EDA Laboratory, Department of Electrical Engineering, National Chung Cheng University, Chiayi, Taiwan. His current research interests include design automation and low power design.

Yao-Tsung Chang received the B.S. degree in electronics engineering from Fu Jen Catholic University, Taipei, Taiwan, in 2006, and the M.S. degree in electronics engineering from National Chung Cheng University, Chiayi, Taiwan, in 2011. His current research interests include electronic design automation with an emphasis on low power design.

Suggest Documents