Thermal-aware Steiner Routing for 3D Stacked ICs - Semantic Scholar

Thermal-aware Steiner Routing for 3D Stacked ICs Mohit Pathak and Sung Kyu Lim School of Electrical and Computer Engineering Georgia Institute of Technology {mohitp, limsk}@ece.gatech.edu Abstract— In this paper, we present the first work on the Steiner routing for 3D stacked ICs. In the 3D Steiner routing problem, the pins are located in multiple device layers, which makes it more general than its 2D counterpart. Our algorithm consists of two steps: tree construction and tree refinement. Our tree construction algorithm builds a delay-oriented Steiner tree under a given thermal profile. We show that thermal-aware 3D tree construction involves the minimization of two-variable Elmore delay function. In our tree refinement algorithm, we reposition the through-vias while preserving the original routing topology for further thermal optimization under performance constraint. We employ a novel scheme to relax the initial NLP formulation to ILP and consider all through-vias from all nets simultaneously. Our related experiments show the effectiveness of our proposed solutions.

I. I NTRODUCTION The emerging 3D die stacking technology enables the integration of multiple planar integrated circuits in the vertical direction with high-density interconnect through-silicon-vias. The pitch of these vertical die-to-die vias can be as short as a few microns [1], thus substantially reduce the communication latency, in particular, the global interconnect. It also can reduce power consumption for the shorter interconnect as shown by recent research [2]. One of the major concerns of 3D ICs is thermal dissipation. Stacking of different device layers combined with the low thermal conductivity of the bonding material may result in excessively high on-chip temperature. The location of through vias in a Steiner tree has high impact on the delay at the sink nodes of the tree. In addition, effective placement of through vias can play a significant role in lowering the temperature of the chip since these vias establish thermal paths to the heat sink. Thus, an approach which tries to optimize the location of through vias such that both performance and thermal issues are addressed is important. The contributions of this paper are as follows: • We formulate and solve the new 3D IC Steiner Routing (3DSR) problem for multi-pin net routing in 3D stacked ICs. Our algorithm considers all die simultaneously and constructs performance-oriented Steiner trees under given thermal profile while determining the locations for the related through-vias. We show that thermal-aware 3D tree construction involves the minimization of two-variable Elmore delay function. • We formulate and solve the new Through-via Relocation problem for thermal optimization in 3D stacked ICs. We further optimize the thermal objective by relocating through vias in each tree under a given timing constraint

while preserving the original performance-oriented routing topology. We employ a novel scheme to relax the initial NLP formulation to ILP and consider all throughvias from all nets simultaneously. The goal of our 3D Steiner routing is to address the following two important design objectives: performance and thermal. We tackle this problem by first constructing a performance-oriented routing tree for each net under a given non-uniform thermal profile. We then optimize the thermal objective by relocating through vias in each tree under a given timing constraint. We emphasize that the through via relocation enables a new 3D-specific design optimization since it preserves the original performance-oriented tree topology while optimizing the thermal objective. Lastly, we address the routing congestion issue by satisfying the wire/via capacity constraints specified in the given 3D global routing grid. A previous work [3] talks about Steiner trees in different routing layers, but their work does not consider pins on different device layers. Thus, this work cannot be used for 3D ICs without any significant modification. Another recent work on thermal via insertion [4] is targeting thermal optimization during routing for 3D ICs. However, the thermal vias are additional vias that do not carry any signal and may cause congestion problem. In our algorithm, we relocate the existing through-vias for thermal optimization. Lastly, some analytical results on the location of through vias were recently presented in [5]. However, this work focused on two-pin nets and did not perform Steiner tree construction. The remainder of the paper is organized as follows. Section II presents the problem formulation. Section III presents our 3D Steiner tree construction algorithm. Section IV presents our through-via relocation algorithm. Experimental results are presented in Section V, and we conclude in Section VI. II. P RELIMINARIES A. Problem Formulation We assume the following is given: (i) a set of m nets {n0 , n1 , · · · , nm−1 }, where each net is represented by a list of pins ni = {p0 , p1 , · · · , pk−1 } with p0 as the driver, (ii) a 3D routing grid G that represents the routing resource in a given 3D stacked IC, where each grid node represents a routing region and each edge denotes the adjacency among the regions, (iii) each x/y grid edge is associated with horizontal/vertical wire capacity and z with via capacity, (iv) the location of each pin p(x, y, z), and (v) a 3D thermal grid Z with temperature

at all grid nodes. A 3D Steiner Tree is defined to be a set of 2D (= planar) Steiner trees connected by through vias. The goal of the Thermal-aware 3D Steiner Routing problem is to generate a 3D Steiner tree for each net while satisfying the capacity constraints specified in the underlying G.1 The objective is to minimize (i) the maximum temperature among all nodes in the thermal grid, and (ii) the maximum Elmore delay among all pins in each tree, where the delay is computed based on the given thermal distribution. This paper uses the recent temperature dependent interconnect delay model proposed in [6], [7]. The line resistance per unit length can be calculated as: r(x) = r0 (1+β·T (x)), where r0 is the resistance at 0◦ C, β is the temperature co-efficient of resistance, and T (x) denotes the temperature at location x. B. Overview of the Approach The goal of our 3D Steiner routing is to address the following two important design objectives: performance and thermal. We tackle this problem in two steps: 1) Construction step: we first perform thermal analysis from the given 3D placement. We then construct a performance-oriented routing tree for each net under the non-uniform thermal profile. We recompute the temperature values to reflect the routing. 2) Refinement step: We optimize the thermal objective by relocating through vias in each tree under given timing constraints. Note that the through via relocation preserves the original tree topology while optimizing the thermal objective. We recompute the timing slacks and temperature values to reflect the relocation. We repeat this step until no more thermal improvement is possible. III. 3D S TEINER T REE C ONSTRUCTION A. Overview of the Algorithm The basic approach of our 3D Steiner tree construction algorithm is similar to SERT [8], where an existing tree is incrementally grown by connecting a new sink pin to it. SERT starts with the driver pin and selects the sink pin that minimizes Elmore delay when connected to the driver. This process continues until all sink pins are connected to the tree that is being grown. The goal is to minimize the maximum Elmore delay among all sink pins of the tree. Here the biggest challenge is to compute the point on the tree where the new pin connects to. There are three major differences between SERT and our work: (i) all the pins in SERT are located in the same die, whereas our 3D algorithm handles the pins located in different die. This 3D case requires the usage of through vias, and the location of these vias has huge impact on the topology of the tree as well as the sink pin delay. (ii) the delay optimization in SERT is based on single variable, whereas our algorithm deals with two-variable function optimization, (iii) our interconnect delay is computed based on the given thermal profile. 1 Note that we do not explicitly minimize routing/via congestion but implicitly address it by satisfying the capacity constraints. Each edge capacity can be set independently to reflect the availability of the routing resource.

Thermal-aware 3D Steiner Routing Algorithm input: netlist N L, routing graph R, thermal profile Z output: 3D Steiner tree for each net 1. for (each net n ∈ N L) 2. Tn = p0 (n); 3. Qn = set of pins of n except p0 ; 4. while (Qn 6= ∅) 5. for (each pin a ∈ Qn ) 6. for (each edge e ∈ Tn ) 7. x = connection point for a → e; 8. y = through via location on e(x, a); 9. update d(p) for all p ∈ Tn ; 10. X(a, e) = max{d(p)|p ∈ Tn ∪ a}; 11. (amin , emin ) = pin+edge pair with min X; 12. Tn = Tn ∪ emin ; 13. remove amin from Qn ; 14. for (each non-timing critical Tn violating capacity) 15. rip-up-and-reroute Tn under Z; Fig. 1. Pseudocode of thermal-aware 3D Steiner routing algorithm. In case e and a are located in different planes, e(r, a) will contain a through via.

A pseudocode of our algorithm is shown in Figure 1. Our routing algorithm consists of two phases: construction (line 1-13) and refinement (14-15). We construct 3D Steiner trees during the construction phase while ignoring congestion and then alleviate congestion by rip-up-and-reroute during the refinement phase. Given a net n, our 3D Steiner tree Tn initially contains the driver pin (line 2). We store the remaining pins of n in Qn (line 3). We then examine all pinedge pair (line 5-6) and compute the impact of connecting the pin to the edge on Elmore delay under the given thermal profile Z, where the pin is from Qn and the edge is from Tn . Specifically, the delay impact is calculated based on the increase in temperature-dependent Elmore delay among all pins currently in Tn (line 9-10). This requires the computation of connection point x and the through via location y (line 7-8) (to be discussed in Section III-B). Next, we select the pin-edge pair that results in the minimum max-delay increase (line 11) and add the pin to Tn (line 12-13).2 Our rip-up-and-reroute is done only on the less timing critical nets, i.e., the nets with smaller max-delay values (line 14-15). The complexity of our algorithm is O(nm4 ), where n is the total number of nets and m is the maximum number of pins in any net. The O(m4 ) term is based on the while-loop, two for-loops, and Elmore delay computation for all sinks (line 9). Since m is bounded by a constant in VLSI circuits typically, our algorithm runs in linear time in practice. B. Connection Point and Via Location In this section we discuss how we can efficiently construct 3D Steiner trees. Our discussion is based on two-die case for 2 In case there exist ties among several pin-edge pairs in terms of the performance objective, we use the via usage to break the tie. If a more aggressive via minimization or other objective such as congestion is additionally desired, we can use a weighted cost function that combines performance, via, and congestion objectives to select the best pin-edge pair.

b (case 3)

g (case 4) p

connection point

d

x

p0

c (case 2)

q y, through via

die 1

dz a (case 1)

die 2

Fig. 2. Illustration of how pin a connects to e(p, c) ∈ T . e(y, a) is routed in the bottom die, where as all other edges are routed in the top die. x is the location of connection point on e(p, c). y is the location of the through via inserted on e(x, a). e(q, b) is another branch in T . g is another sink that is not a part of the subtree rooted at p. d is the shortest distance point on e(p, c) from a, and δz is the distance between the through via and a. The Elmore delay of T ∪ a is a function of both x and y.

the simplicity of the discussion, but our algorithm is applicable to multiple die without any modification. Let r1 and c1 denote the unit length resistance and capacitance values for die 1. r2 and c2 are similarly defined for die 2. The capacitance and resistance of a through via connecting the two die are denoted Cvia and Rvia . Given a pin p and an edge e ∈ T , the connection point is defined as the point on e to which p is connected. The connection point computation for 2D case has been presented in [8], where the Elmore delay change on an entire tree caused by adding a new pin to the tree is a function of a single variable x, the location of connection point. We extend this work by introducing a second variable y that represents the location of through via. We then optimize the two-variable delay function and determine the location of connection point (= x) and through via (= y) for 3D case. Referring to Figure 2, e(p, c) and e(q, b) are edges on T . p is the parent node of e(p, c), and q is the parent node of e(q, b). a is the new pin that needs to connect to e(p, c). Edge e(p, c) lies on die 1 with interconnect parasitics r1 and c1 , whereas a lies on die 2 with interconnect parasitics r2 and c2 . d is the point on e(p, c) that is of the shortest distance to a. x is the connection point, and y is the location of through via. Our first goal is to derive Elmore delay equations that are the functions of δx and δy. In what follows, we let δx denote the distance between node p and node x, δq, δa, δb, δc, and δd are used similarly. δy is the distance between x and y, and δz is the distance between y and a. Let Tb denote the subtree rooted at node b. In order to compute the Elmore delay change on all sink pins in T caused by adding a to T , we consider the following four cases: (i) delay at the node to be added (= node a), (ii) delay at the subtree located after the connection point (= node c), (iii) delay at the subtree that could be located either before or after the connection point (= node b), (iv) delay of the nodes not in Tp . Figure 2 illustrates these four cases: Case 1: We handle the delay at node a. In this case, d(a) is a sum of four functions. f1 is the delay from p0 to p. f2 is the delay from p to x without considering e(q, b) and Tb . f3 is the delay from p to x when considering e(q, b) and Tb . f4 is

the delay from x to a. Thus we have d(a) = f1 + f2 + f3 + f4 , where f1 = K0 + K1 {c1 δy + Cvia + c2 δz + c1 δc + Cc + Cb +c1 (δb ³ − δq)}

f2 = r1 δx c1 δx 2 + c1 δy + Cvia + c2 δz + c1 (δc − δx) + Cc n r δx(c (δb − δq) + C ), if δx ≤ δq 1 1 b f3 = r1 δq(c ³ 1 (δb − δq) + Cb ),´ if δx ≥ ¡δq ¢ Cvia f4 = r1 δy c1 δy 2 + Cvia + c2 δz + Rvia 2 + c2 δz +r2 c2 δz2

´

2

where δz = δa − (δx + δy), K0 is the sum of resistance and capacitance products along p0 → p path, K1 is the sum of resistance along p0 → p path, and Ci is the capacitance of the sub-tree rooted at ith node. Case 2: The new delay at node c is given by d(c) = f1 + f2 + f30 + f40 , where f30 = r1 δq(c1 (δb − δq) + Cb ) ½ ¾ c1 (δc − δx) 0 f4 = r1 (δc − δx) + Cc 2 Case 3: The new delay at node b is given by d(b) = f1 + f200 + f300 , where n r1 δx(c1 δy + Cvia + c2 δz), if δx ≤ δq f200 = r1 δq(c1 δy + Cvia + c2 δz), if δx ≥ δq ½ ¾ δq 00 f3 = r1 δq c1 + c1 (δb − δq) + Cb + Cc 2 Case 4: For all other nodes not in Tp , the added delay is a function of the added capacitance, which is linear in terms of x and y and given by ∆C = c1 (δx + δy) + Cvia + c2 δz In case of connecting two pins separated by non-zero intermediate layers, we use stacked vias so that no routing in these intermediate layers are used. C. Optimization of Delay Equations We discuss how the delay equations derived in the previous section can be optimized to give a small set of possible optimum location points. We first consider the conditions needed to determine the minimum of a general quadratic function of two variables. We later show how the delay equations derived in the previous section can be optimized using these conditions. In general, for a quadratic function of two variables f (δx, δy), the maximum or the minimum of the function ∂2F depends upon the values of ∂δx 2 and the determinant of the Hessian matrix H1 : ¯ 2 ¯ ¯ ∂ F ∂2F ¯ ¯ ∂δx2 ∂δx∂δy ¯ ¯ ∂2F ∂ 2 F ¯¯ ¯ ∂δx∂δy ∂δy 2 where F is the delay function under consideration. The above values for a quadratic function of two variables are always constant.

For our case we have 0 ≤ δx ≤ δd and 0 ≤ δy ≤ δa and thus consider the following cases:3 ∂2F • Case A: If ∂δx2 ≤ 0 and H1 ≥ 0, the minimum can be found at the boundary points, i.e., δx = 0 or δx = δd and δy = 0 or δy = δa. Thus we have four points to look for the minimum. ∂2F ∂2F • Case B: If ∂δx2 ≤ 0 ∂δy 2 ≤ 0 and H1 = 0, we have a concave function, and the minimum lies on the boundary points. ∂2F ∂2F • Case C: If ∂δx2 = 0 ∂δy 2 = 0 and H1 = 0, then f (δx, δy) is a linear function of δx and δy, and the minimum lies at the boundary points. • Case D: If H1 < 0, the critical point found is a saddle point. and the minimum lies at the boundary, although a different set of boundary points may need to be chosen. The set of boundary points may be found by setting δx = 0 or δx = δd and minimizing f (δx, δy) as a function of δy or setting δy = 0 or δy = δa and minimizing f (δx, δy) as a function of δx. We show that the Elmore delay at each sink node in T can be optimized by considering any of the 4 cases shown above. Thus, there is a fixed number of points (x, y) for which the Elmore delay can be minimized. Details are included in Appendix I. IV. T HERMAL -AWARE T HROUGH V IA R ELOCATION A. Overview of the Algorithm The motivation behind our through via relocation is to move as many through vias into thermal hotspots as possible while preserving the original tree topology we obtain during our construction step. The objective is to minimize the maximum temperature on the chip while not violating the timing and routing resource capacity constraints. Note that the problem of optimizing the number of through vias and temperature is non-linear in nature since we need to solve the equation T = P R, where T is the temperature matrix, P is the power vector, and R is the thermal resistance matrix such that R ≺ a1 , where a is the number of through vias. General solutions available for solving non-linear problems cannot be applied directly for large problem sizes. In this section we propose an innovative problem formulation which helps us effectively overcome the non-linear nature of this problem. We propose a relaxed ILP based formulation in which the number of integer variables are kept at minimum. Our ILP-based method optimizes through vias on all nets simultaneously, which is more rigorous than a sequential approach that optimizes the nets one by one. In addition, we target all hotspots simultaneously instead of iteratively targeting one by one. Our experimental results in Section V demonstrate the advantage of this approach. B. Movable Range We start with the set of 3D Steiner trees we obtain from the construction step. All pins in each tree Ti are associated 3 These

cases are not related to the four cases in the previous section.

p3 (L3) Rv3

v3

p4 (L4) v1

Rv1 p1 (L1)

driver (L1) v2

p2 (L2)

Fig. 3. Illustration of movable range. The figure is a top view of a multilayer grid. The driver is represented by a triangle, the sinks are represented by dots, and the vias are represented by square boxes. The driver is located in layer 1 (= L1). The movable range of each via is represented by the dotted lines.

with timing constraint that denotes the required arrival time in terms of Elmore delay. Each through via v ∈ Ti is associated with the movable range, denoted Rvia , that denotes the range of new location along its route to the connection point so that the timing constraints are not violated. An illustration is shown in Figure 3. In case the movable range of a through via v is a single point, v is non-movable; otherwise it is movable. Our goal is then to find a new location for each movable via in each Steiner tree so that the maximum temperature among all nodes in the thermal grid is minimized while the timing and routing resource capacity constraints are not violated. Note that we preserve the original topology of the Steiner trees. All that is changing is the location of through vias for thermal optimization, where the movable through vias are moved into thermal hotspots under timing constraint to reduce the thermal resistivity.4 C. Fast Thermal Analysis To optimize the location of through vias, we use a fast thermal analysis model described in [4]. An illustration is shown in Figure 4. A tile structure is imposed on the surface, where each tile is approximated as a resistive chain as shown in Figure 4. In 3D ICs, the heat sinks are attached to the bottom or top side of the 3D IC stack, with other boundaries being adiabatic. Thus, the dominant heat flow is in the vertical direction. For the purpose of optimization we view each tile stack in Figure 4 as an independent thermal resistive chain. In our model we do not consider effects of lateral thermal dissipation, this can be justified since the thermal conductivity of epoxy material used to join the die is much lower than that of silicon itself. This essentially means that it is difficult for heat to dissipate in the vertical direction as compared to horizontal direction. Our fast thermal model which optimizes vertical heat flow helps in reducing the overall chip temperature. To obtain effective temperature reduction the full resistive thermal model [4] (considering lateral resistances) is run twice, once before 4 The authors in [4] suggest to add thermal-vias for further thermal optimization. Our method, which simply relocates exiting through-vias, can be used in conjunction with these approaches for more rigorous thermal optimization.

TABLE I VARIABLES AND CONSTANTS USED IN OUR ILP

P5

FORMULATION P4

Ti,j,k temperature at tile (i, j, k) αi,j,k temperature-related weight for tile (i, j, k), which is org org org computed by Ti,j,k /Tmax , where Ti,j,k denotes the original temperature for (i, j, k) before the optimizaorg org tion, and Tmax is the maximum value among all Ti,j,k values. constant vmax maximum number of through vias each tile can accommodate. constant n βi,j,k becomes 1 when a through via is moved to tile (i, j, k) so that the total number of movable through vias at (i, j, k) changes from n − 1 to n. org Vi,j,k original number of vias (= movable + non-movable) in tile (i, j, k) before the optimization. constant m Vi,j,k number of through vias in tile (i, j, k), which is just m. opt Vi,j,k number of vias (= movable + non-movable) in tile (i, j, k) after the optimization. x,y,k Mi,j,k (n) becomes 1 if a through via in net n is moved from tile (i, j, k) to (x, y, k); 0 otherwise. Gcur i,j,k current usage of wires and through vias for a grid (i, j, k) in 3D routing grid G. Gmax i,j,k capacity constraint of wires and through vias for a grid (i, j, k) in 3D routing grid G. constant m Ri,j,k thermal resistance of tile i, j, k having m vias. novia thermal resistance of tile i, j, k with no through vias. Ri,j,k constant α thermal resistance of one through via. constant n γi,j,k becomes 1 if the number of through vias in the tile (i, j, k) is n. m δi,j,k it is the temperature difference between tile i, j, k and i, j, k − 1 if the number of through vias in tile i, j, k is m m m. δi,j,k = (Pi,j,n + · · · + Pi,j,k ) ∗ Ri,j,k constant

P3 P2 P1

4

R5

3

R4

2

R3

1

R2 R1

Rb

Rlateral Tile stack array

Sing tile stack

Tile stack analysis

Fig. 4. Thermal model used, where Rb denotes the thermal resistance to the heatsink. The convention used in our ILP formulation is that adding through vias at point (= tile) i reduces the value of Ri .

Subject to: opt Ti,j,k = Ti,j,k−1 + (Pi,j,n + · · · + Pi,j,k ) × Ri,j,k α opt Ri,j,k = α opt + Vi,j,k Rnovia

(2)

opt org Vi,j,k = Vi,j,k + ∆Vi,j,k

(4)

i,j,k v,w,k ΣMx,y,k (n) − ΣMi,j,k (n), ∀n max cur Gi,j,k ≤ Gi,j,k , ∀(i, j, k) x,y,k ΣMi,j,k (n) = 1, ∀n x,y,k Mi,j,k (n) ∈ {0, 1}

(5) (6)

(3)

i,j,k

∆Vi,j,k =

and once after calling our via relocation routine. The values in temperature reduction reported in our experiments are based on the full resistive model.

D. Non-linear Programming In the following sections we first show how the thermal relocation problem can be formulated as a NLP (non linear programming) formulation. We then show how the NLP may be converted to an ILP problem formulation. The ILP formulation adds a large number of integer variables in the problem thus making it difficult to solve. We finally propose our reduced ILP problem formulation (which is a relaxed ILP formulation) which reduces the number of integer variables significantly. The NLP based formulation is defined as follows (Table I explains the notations we use in the formulation): Minimize

Σαi,j,k · Ti,j,k

5

(1)

(7) (8)

Equation (1) is our objective function, where we minimize the weighted sum of temperature values at all thermal tiles (thermal tiles and thermal mesh nodes are used interchangeably in this paper). The weights αi,j,k are computed based on the initial temperature measured before the through via relocation. In this case, the higher the αi,j,k is, the lower the Ti,j,k we desire. Equation (2) gives the temperature at each tile based on our fast thermal model illustrated in Figure 4. Equation(3) shows the variation of thermal resistance based on the number of through vias in a tile. This is obtained from solving the opt novia following parallel resistance relation: 1/Ri,j,k = 1/Ri,j,k + opt opt Vi,j,k /α. Equation (4) is from the definition of Vi,j,k . Equation (5) states that the total change in the number of through vias for tile (i, j, k) is the total number of through vias moved into (i, j, k) minus the total number of through vias moved out of (i, j, k). Equation (6) ensures that the routing resource (= wires and through vias) capacity constraints are satisfied. Equation x,y,k (8) states that Mi,j,k (n) are binary integer variables. Equation (7) ensures that only one via per net is moved. Note that this restriction is unavoidable since the movable range of a through via is computed independent of other through vias. Once a through via is moved, it affects the timing constraint, movability, and the range of all other through vias in the same net. The ultimate way to perform via relocation is to consider all vias from all nets simultaneously, which is computationally expensive. Our method that considers one via from all nets simultaneously is better than a sequential approach that considers all vias from a single net.

We see that the original via relocation problem is nonlinear in nature due the inverse relation between thermal opt resistance and number of through vias in a tile (= Ri,j,k vs opt Vi,j,k ). In the next section we propose a simplified integer linear programming formulation which overcomes the nonlinear problem formulation. E. Integer Linear Programming From the NLP formulation we see that the number of through vias in each tile is an integer variable. Our ILP based formulation differs from the NLP in the following way: we replace Equations (2) and (3) with the following: 0 0 vmax vmax × δi,j,k + · · · + γi,j,k × δi,j,k (9) Ti,j,k = Ti,j,k−1 + γi,j,k opt 1 2 vmax 1 × γi,j,k + 2γi,j,k + · · · + vmax × γi,j,k = Vi,j,k (10) m Σγi,j,k = 1 ∀(i, j, k) (11) n γi,j,k ∈ {0, 1} (12)

Equation (9) gives the new equation for calculating the temn perature. In this equation γi,j,k are the new integer variables that are added whereas δi,j,k (refer to Table I) are the constants which are calculated for each possible value of the number of n vias in a tile. Equation (10) equates the γi,j,k variable with the optimum number of vias in a tile. Equation (11) ensures that n for each tile only one γi,j,k takes a value 1. Finally equation n (12) ensures that γi,j,k is either 0 or 1. All other equations are same to our previous NLP formulation. We see that for this new ILP formulation a large number of n new integer variables γi,j,k are added. The number of these new integer variables is proportional to the number of tiles in our thermal grid and the number of vias possible in each grid. Adding such large number of integer variables makes the problem harder to solve. In our next section we propose our reduced ILP formulation which removes the need of integer n γi,j,k variables, thus reducing the large number of integer variables required. F. Reduced Integer Linear Programming Our ILP-based via relocation is formulated as follows (Table I explains the notations we use in the formulation): Minimize Σαi,j,k · Ti,j,k

(13)

Subject to: 0 1 0 1 Ti,j,k = Ti,j,k−1 + δi,j,k − βi,j,k · (δi,j,k − δi,j,k ) − · · · (14) vmax−1 vmax vmax −βi,j,k · (δi,j,k − δi,j,k ) opt 1 2 vmax βi,j,k + βi,j,k + · · · + βi,j,k = Vi,j,k (15) opt org Vi,j,k = Vi,j,k + ∆Vi,j,k (16) i,j,k v,w,k ∆Vi,j,k = ΣMx,y,k (n) − ΣMi,j,k (n), ∀n (17) cur max Gi,j,k ≤ Gi,j,k , ∀(i, j, k) (18) m 0 ≤ βi,j,k ≤ 1 (19) x,y,k Mi,j,k (n) ∈ {0, 1} (20) x,y,k ΣMi,j,k (n) = 1, ∀n (21)

TABLE III T EMPERATURE COMPARISON BETWEEN GREEDY AND OUR SIMULTANEOUS THROUGH VIA RELOCATION APPROACH . Torg AND Topt RESPECTIVELY DENOTE THE TEMPERATURE BEFORE AND AFTER THERMAL OPTIMIZATION .

ckt s9234 b14 opt s13207 s15850 b20 opt b21 opt b22 opt ibm09 ibm10 ibm11 ibm13 ibm17 RATIO

Torg 87.6 137.2 122.0 118.2 116.3 114.5 114.9 98.5 108.9 128.8 91.2 115.7 1.00

greedy Topt cpu 86.5 1.66 136 1.56 121.4 1.88 117.8 2.19 115.2 2.60 114.1 3.02 114.2 3.76 97.8 10.23 108.1 12.3 127.9 19.8 90.5 17.6 115.3 26.8 0.99 1.0

ILP Topt cpu 70.5 71.85 122.3 72.0 107 81.17 107.3 133.4 104.2 167.6 102.7 156.8 105.1 228.83 87.2 634.0 101.2 451.9 119.1 990.8 80.6 1594.6 103.4 2789.2 0.89 71.3

multiple ILP Topt cpu iter 70.1 123.5 2 119.8 133.4 2 106.7 248.9 3 106.2 223.4 2 104.1 465.6 3 102.3 412.3 3 104.5 387.6 2 86.4 1033 2 100.7 1109.8 3 118.4 1500.7 2 80.3 3455.8 3 102.9 4435.5 2 0.89 111.9 -

Equation (14) is a new way to compute the temperature at the routing tile (i, j, k) compared with Equation (9). A detailed explanation of this equation is included in Appendix II. Equation (15) states that the total number of through vias i in a tile (i, j, k), which is specified by the βi,j,k values (1 ≤ opt i ≤ vmax), should equal to Vi,j,k . Equation (19) restricts the n range of values βi,j,k can take. All other equations are same as discussed in NLP formulation. To overcome the restriction of moving just one via per net (Equation 21) we iterate the entire relaxed ILP multiple times so that multiple through vias from a single net is given a chance to relocate in an iterative x,y,k fashion. The number of integer variables (= Mi,j,k (n)) can be huge if the number of nets is larger or the thermal grid is finer. This makes our ILP formulation less desirable for a large problem instance. We overcome this limitation by relaxing these integer variables and solving the resulting LP problem. We round the continuous variables based on a threshold value λ. We additionally perform a gain-based gradient search to obtain the λ value that generates the best quality results. V. E XPERIMENTAL R ESULTS We implemented our algorithms in C++/STL and ran our experiments on Linux PC running at 2GHz. We tested our algorithms with three sets of benchmark: ISCAS89, ITC99, and ISPD98. Our experiment is based on 4-die stacked IC, where the top two and bottom two dies are bonded in face-toface and the middle two in back-to-back. We assume all 4 dies have different unit-length resistance and capacitance values [5]: r1 = 86Ω/mm and c1 = 396f F/mm, r2 = 175Ω/mm and c2 = 100f F/mm, r3 = 74Ω/mm and c3 = 279f F/mm, and r4 = 154Ω/mm and c4 = 120f F/mm for unit-length interconnect for the four dies. Rvia = 106Ω/mm and Cvia = 140f F/mm values are used for the face-to-face vias, and Rvia = 53Ω/mm and Cvia = 280f F/mm values are used for the back-to-back vias. All runtime reported are in seconds. We use [9] to analyze the temperature profile from a given 3D placement [10]. To compare our results we implemented a

TABLE II 3D ROUTING FOR ALL TYPES OF NETS . W E COMPARE 3D MAZE ROUTER AND OUR 3D S TEINER ROUTER . T HE RUNTIME IS IN SECONDS . T HE V IA B OUND COLUMN SHOWS A LOWER BOUND ON THE VIA COUNT.

3D Maze Routing ckt size v-bound wire delay via cpu s9234 5844 5563 157.11 11.48 6856 13.45 b14 opt 5646 5546 189.0 19.85 9091 26.45 s13207 8727 8298 235.1 14.1 10308 34.96 20.1 12091 41.52 s15850 10397 9915 280.1 b20 opt 12501 12174 434.1 22.4 19456 71.5 b21 opt 12678 12288 433.1 22.0 19613 63.0 b22 opt 18086 17911 623.1 23.2 28712 61.4 ibm09 52989 53483 2134.4 3476.7 106081 305.3 ibm10 68004 67651 2984.6 2867.2 117713 314.2 ibm11 70028 70524 2928.8 3864.2 131113 567.8 ibm13 84191 82989 3645.6 3135.1 140069 556.8 ibm17 184227 180001 9166.6 8091.3 351023 1164.4 RATIO 1.00 1.00 1.00 1.00

wire 164.93 202.3 239.6 287.2 458.7 465.1 668.2 2370.3 3374 3115.6 3747.7 9940.3 1.07

wirelength driven 3D maze router. During tree construction in our 3D maze router, a new pin is added to an already existing tree such that the added interconnect is shortest possible. We also implemented a 3D version of Steiner arborescence. We first converted the 3D problem to a 2D problem by mapping the corresponding pin locations to the 2D plane, then we implemented a 2D Steiner arborescence used in [11]. All the wire routing was done on the same plane as that of the driver pin, finally all pins which were not located on the same plane as the driver pin were connected using stacked vias. We report the total wirelength, number of vias, the maximum path-delay among all sinks and processor time in seconds for each circuit. A. 3D Steiner Routing Results Table II shows a comparison among 3D maze, 3D A-tree, and our 3D Steiner routing. We observe that our 3D Steiner router achieves 49% average performance improvement over 3D maze routing and about 29% improvement over 3D A-tree. However, we see an increase in 21% wirelength and 7% vias over 3D maze router. Also our 3D Steiner router works slower than 3D A-tree, which can be attributed to a large number of high fanout nets present in the design. Our 3D Steiner tree generation takes O(p4 ) time to run, where p is the number of pins, and does not scale well with increasing fanout size. The v-bound column in Table II shows the lower bound of the through via usage for each circuit. For the two-die twopin nets, the lower bound is 1. For the multi-die, multi-pin nets, we use the minimum possible via to connect all pins in the dies. We see that the number of vias needed by the 3D maze or 3D Steiner router is about twice as many as the minimum required. The number of vias used in 3D A-tree algorithm is the highest. However, this should not be used as a comparison metric, since we constructed the 3D A-tree to primarily compare performance results. If we look at circuit ibm11, we observe that 3D-Atree performs better than our 3D Steiner in terms of performance. This was due to congestion that caused larger number of nets to be ripped and re-routed in our 3D Steiner algorithm. In some cases, we observed that 3D A-tree caused lower congestion thus required fewer re-routing.

3D A-tree delay via 8.35 7299 10.13 10461 9.23 11080 14.76 13029 13.17 22529 13.51 22397 13.75 33440 1673.1 123749 3102.8 157215 2070.7 151938 2251.9 193909 6169.7 451452 0.71 1.26

3D Steiner Routing cpu wire delay via 6.26 174.8 7.44 6725 8.5 229.6 8.59 8639 9.6 252.7 7.26 10164 13.4 301.6 11.12 12099 20.8 524 9.33 18888 19.3 532.1 8.94 18958 26.5 744.0 10.95 27666 85.4 2787.4 1226.2 103942 200.7 3841.2 1819.4 142592 206.5 3599.7 1825.4 130900 260.6 4232.8 2463.3 166472 653.6 11924.4 3518.4 375621 0.47 1.21 0.51 1.07

cpu 9.05 40.4 24.81 27.7 60.1 56.2 76.9 342.0 411.2 545.5 625.6 1171.9 1.05

B. Through Via Relocation Results To compare the results of our through via relocation, we implemented a fast greedy algorithm, which tries to move through vias into thermal hotspots in an iterative fashion. We choose a single hotspot and relocate the movable vias into it. We then repeat this process for the next hotspot until no more temperature improvement can be obtained. Table III shows the reduction in maximum temperature by our approach and the greedy algorithm (the accurate thermal model was used to calculate final temperature). We observe that our simultaneous approach achieves consistent improvement over the greedy approach at the expense of additional runtime. The runtime of our ILP-based method ranges from 70 to 2800 seconds. The runtime for our biggest circuit, ibm17, that contains 184K nets is around 2800 seconds. This shows that our method scales well with the complexity of the circuit while maintaining high quality solutions. In Table III we also show the impact of multiple iterations on our relaxed ILP. Our entire through via relocation algorithm can be repeated multiple times since the temperature values change from each iteration. However, we observe that the majority of the improvement is achieved from the first iteration and that the algorithm converges quickly without any major improvement during the subsequent iterations.5 VI. C ONCLUSIONS This paper studied two new problems that are important for 3D stacked IC technology: 3D Steiner tree construction and through via relocation. Our routing algorithm is based on a constructive method, where a 3D Steiner tree is grown by connecting a new pin to the existing tree. We derived twovariable delay equations and optimized them to compute the location of the through vias under given thermal profile. For through via relocation, we devised an innovative technique which helps avoid the non-linear optimization required for 5 We attempted to solve our NLP formulation directly using conjugategradient method [4]. However, the runtime was prohibitive even in small circuits.

temperature optimization. Our formulation can handle large number of vias simultaneously for an effective temperature optimization. R EFERENCES [1] S. Das, A. Chandrakasan, and R. Reif, “Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools,” in Proceedings of the IEEE Annual Symposium on VLSI, 2003. [2] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCauley, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb, “Die Stacking (3D) Microarchitecture,” in Proceedings of the 39th International Symposium on Microarchitecture, 2006. [3] M. Yilidiz and P.H.Madded, “Preferred Direction Steiner Trees,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 2002. [4] J. Cong and Y. Zhang, “Thermal Via Planning for 3-D ICs,” in Proc. IEEE Int. Conf. on Computer-Aided Design, 2005. [5] V. Pavlidis and E. Friedman, “Interconnect Delay Minimization through Interlayer Via Placement in 3-D ICs,” in Proc. Great Lakes Symposum on VLSI, 2005. [6] A. Ajami, K. Banerjee, and M. Pedram, “Effects of non-uniform substrate temperature on the clock signal integrity in high performance designs,” in Proc. of IEEE Custom Integrated Circuits Conference, May 2001, pp. pp. 233–236. [7] K. Banerjee, A. H. Ajami, and M. Pedram, “Analysis and optimization of thermal issues in high-performance VLSI,” in Proc. Int. Symp. on Physical Design, April 2001, pp. 230–237. [8] K. Boese, A. Kahng, B. McCoyy, and G. Robins, “Near-Optimal Critical Sink Routing Tree Constructions,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 1995. [9] T.-Y. Wang and C. C.-P. Chen, “3-d thermal-adi: A linear-time chip level transient thermal simulator,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, pp. 1434–1445, 2002. [10] B. Goplen and S. Sapatnekar, “Efficient Thermal Placement of Standard Cells in 3D ICs using a Force Directed Approach,” in Proc. IEEE Int. Conf. on Computer-Aided Design, 2003. [11] J. Cong, K.-S. Leung, and D. Zhou, “Performance Driven Interconnect Design Based on distributed RC delay model,” in Proc. ACM Design Automation Conf., 1993.

A PPENDIX I O PTIMIZATION OF T WO -VARIABLE D ELAY E QUATIONS Assuming a0 = r2 /r1 and b0 = c1 /c2 , the optimization of two-variable delay functions shown in Section III-C allow the computation of x (= connection point) and y (= through via location) as follows: ∂2F ∂2F • For d(a) we have ∂δx2 = r1 c2 (a0 − b0 − 2), ∂δy 2 = r1 c2 (a0 +b0 −2), and H1 = −(r1 c2 )2 {(a0 +b0 −2)2b0 }. ∂2F ∂2F Thus, we see that when H1 = 0, ∂δx 2 ≤ 0 and ∂δy 2 = 0, the optimal delay is found at points according to the Case B. If H1 < 0, optimal delay is found at points in ∂2F according to the Case D. If H1 > 0, we have ∂δx 2 ≤ 0, so the optimal delay is found at points according to the Case A. • For d(b) we need to evaluate two cases: (i) when x ≥ b, ∂2F ∂2F we have ∂δx 2 = 0, ∂δy 2 = 0, and H1 = 0. Thus, the optimal delay is found at points according to the Case ∂2F ∂2F C. (ii) when x ≤ b, we have ∂δx 2 = −2r1 c2 , ∂δy 2 = 0, and H1 = −(r1 c2 )2 (b0 − 1)2 . Thus, if H1 = 0, the optimal delay is found at points according to the Case B. Otherwise, they are found at points according to the Case D.

2

2

∂ F ∂ F For d(c) we have ∂δx 2 = −2r1 c2 , ∂δy 2 = 0, and H1 = 2 2 −(r1 c2 ) (b0 − 1) . If H1 = 0, the optimal delay is found at points according to the Case B. Otherwise, they are found at points according to the Case D. ∂2F ∂2F • For all other nodes not in Tp , we have ∂δx2 = 0, ∂δy 2 = 0, and H1 = 0 since delay is a linear function of δx and δy. Thus, the optimal delay is found at points according to the Case C. Since a0 and b0 values are dependent on the interconnect parameters at each die, we can see that the number of points (x, y) to which a pin can connect to an edge in 3D case is a fixed constant. At each stage of the 3D Steiner tree generation, we identify the optimum location point for each node using the above conditions, and then choose the node which gives the minimum max-delay. This process is repeated for a given tree until no more unconnected sink nodes are left. •

A PPENDIX II E XPLANATION OF E QUATION (14) From Figure 4 we note that temperature at node i, j, k having n vias is computed as follows: n Ti,j,k = Ti,j,k−1 + (Pi,j,n + · · · + Pi,j,k ) × Ri,j,k n We can write δi,j,k (refer to table I for definition) as follows: n n δi,j,k = (Pi,j,n + · · · + Pi,j,k ) × Ri,j,k n n n is strictly We see that δi,j,k ∝ Ri,j,k ∝ V n1 , thus δi,j,k i,j,k decreasing for increasing values of n and n > 0. It can be seen easily that the temperature of a given tile having n vias can be rewritten as 0 0 1 Ti,j,k = Ti,j,k−1 + δi,j,k − (δi,j,k − δi,j,k ) − ··· n−1 n · · · − (δi,j,k − δi,j,k ) n We define ∆Ti,j,k as follows: n−1 n n ∆Ti,j,k = δi,j,k − δi,j,k n which is equal to the coefficient of the variable βi,j,k . Note n that ∆Ti,j,k is strictly decreasing when n is increasing. This n enables us to use non-integer values for the variable βi,j,k (refer to Table I for its definition). The reason is that for any n n value of Vi,j,k , βi,j,k will always reach its maximum allowed n+1 value of 1 before βi,j,k starts having a non-zero value. This n+1 n is due the fact that ∆Ti,j,k > ∆Ti,j,k , which corresponds to a greater decrease in objective function per unit change of n+1 n n < 1 and βi,j,k > 0, then we can . In other words, if βi,j,k Vi,j,k n always find a solution with a lower cost by adding γ to βi,j,k n+1 n+1 n so that βi,j,k = 1 and adjusting βi,j,k with βi,j,k −γ. Thus, we see that in our new formulation that the extra variables βi,j,k are not constrained to be integers and that the only integer x,y,k variables we have are the Mi,j,k (n) variables.