This paper introduces a parallel global and detailed wire router. This router is divided into two phases to decrease the communication cost while routing short ...
Vol. 37 No. 12
Transactions of Information Processing Society of Japan
Dec. 1996
Regular Paper A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm Hesham Keshk,y;☆ Shin-ichiro Mori,y Hiroshi Nakashimay and Shinji Tomitay
This paper introduces a parallel global and detailed wire router. This router is divided into two phases to decrease the communication cost while routing short nets. The rst phase is aimed to route all short nets using a few messages by rotating the area assigned to each processor. The second phase uses a new method of grid assignment which divides the whole grid into layers, divides each layer into slices, divides each slice into partitions, and assigns one or more slices for each processor. To obtain higher wireability, a new method for calculating the actual capacity (to be used by global routing) for each partition is introduced. A new detailed routing algorithm, which nds a path for each net under the condition that this path does not decrease the combined capacity of a series of partitions by more than one, is also introduced.
1.1 Global & Detailed Routing
1. Introduction
With the increasing complexity of the routing problems dealing with many thousands of nets, the large problems should be decomposed into two processes; global and detailed3) . In the global routing, the whole grid space is divided into partitions, and it is required to determine in which partitions each net will be routed. The exact path for each net within the partitions is found in the detailed routing. Detailed paths must not intersect on a layer with each other. As many routers separate completely global and detailed routing13),14) , the global router has to estimate the number of available tracks (capacity) inside each partition. The accuracy of this estimation has a great eect on the wireability, which is de ned as the ratio of successfully connected nets. In this paper, we introduce a new cooperative relation between global and detailed routers to improve the wireability. Instead of estimating the capacities, the detailed router calculates the exact capacities of all partitions, and feeds these capacities back to the global router. We obtained wireability of 100%, 99.9%, and 100% for 3 MCM benchmarks. Section 2 introduces the global router, detailed router, and how to nd the exact capacity.
The basic problems of automated wire routing are the long computation time and large memory size required. Some researches improve the speed by using hardware implementation of maze running or line search algorithm16),17) . Recently, several researches have tried to overcome these problems by using parallel processing4),5),21) . Parallelism could be obtained in both nets and space. In the former, nets are assigned to processing elements (PEs), so that a net is routed by a PE that is responsible to the net. In the latter, the whole grid space is divided among the PEs so that several PEs cooperate in routing one net13),18),20) . To exploit parallelism more extensively, the above two methods may be merged together8),21) . For message passing distributed memory computers, dividing the grid space is necessary to decrease the required memory for each PE. The eciency of parallelism depends on how to divide the grid space into partitions, and how to assign these partitions to the PEs. It is required to decrease the communication cost between PEs by decreasing number of messages and increasing the overlapping between communication and computation. Dierent methods for grid space division have been introduced, but there is a trade o between processor utilization and inter-processor communication20) .
1.2 Two Phases Algorithm
As short length nets take short time for routing, the ratio of communication to computation for short net routing causes a severe performance problem even if only one message is used per net. To overcome this problem, we introduce a new router that consists of two phases
y Department of Information Science, ☆
Kyoto University Presently with Faculty of Engineering and Technology, Helwan University 2376
Vol. 37 No. 12
A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm
with two dierent grid space division strategies. Each phase consists of cooperative global and detailed routers. The rst phase is aimed to route all short nets using a few messages, then the rest of the nets will be routed in the second phase. In the rst phase, we use the rotating area algorithm8) . We modi ed this algorithm to improve the load balance, to decrease the execution time, and to increase the wireability. The modi ed rotating area algorithm is introduced in Section 3.1. In the second phase, a slice based grid space division9) is used to route long nets. We divide the whole grid space into layers, divide each layer into slices, and assign one or more slices to each PE. The number of messages per net in this phase is proportional with number of vias. Section 3.2 explains the second phase.
2. Cooperative Global and Detailed Routing
In this section, we classify several features of global and detailed routers which are implemented in both phases of the parallel router described in Section 3.
A1 B1 A9
A2 B2
S
B9 A17 A18
A3
A5
A6
A7
A8
B3 B4 B5 B6 B7 B8 A11 A12 A13 A14 A15 A16 B11 B12 B13 B14 B15 B16 A19 A20 A21 A22 A23 A24
B17 B18 B19 B20 B21 B22 B23 B24 A25 A26 A27 A28 A29 A30 A31 A32 B25 B26 B27 B28 B29 B30 A33 A34 A35 A36 A37 A38
B31
D
B33 B34 B35 B36 B37 B38 A41 A42 A43 A44 A45 A46 A47
B32 A40 B40 A48
B45 B47 B41 B42 B43 B44 B46 B48 A49 A50 A51 A52 A53 A54 A55 A56 B49 B50 B51 B52 B53 B54 B55 B56 A57 A58 A59 A60 A61 A62 A63 A64 B57
B58
B59
B60
B61
B62
B63
B64
(a) D
B
E
F
C A F
C D
B G
2.1 Global Routing
In the global routing, the whole grid space is divided into partitions, and it is required to determine in which partitions each net will be routed. We use maze running algorithm1),15) with dierent costs to nd the least cost global path. The cost of each partition depends mainly on the number of available tracks (capacity) inside this partition. The method to nd the partition capacity will be explained in Section 2.2. In our router, global routing has a restricted X and Y direction (only X direction is allowed for odd layers and Y direction for even layers), while detailed routing allows both X and Y directions on all layers. The global routing has to detect in which layers and in which partitions within these layers each net will be routed. We assume that the number of layers is given before routing, and routing rules, such as clearance between two wires, are commonly applied to all layers. Figure 1(a) shows an example of two-layer grid, each layer is divided into 64 partitions. The shaded area indicates the output of the global router for the two-terminal net (S and D). Two layers are assigned to the partitions which contain net terminals (A10, B10, A39,
A4
2377
A E
Fig. 1
G
(b) Global routing.
B39) or path bends (A13, B13, A37, B37). Only one layer is assigned to the other partitions (A11, A12, B21, B29, A36) to allow other nets to be routed simultaneously if they cross those partitions by overpasses or underpasses. In the detailed routing, nets can be routed simultaneously if their global paths have no common partitions. Figure 1(b) shows an example of 7 nets in which their detailed routing can be done simultaneously. As for a net with three or more terminals, the global router nds a sub-optimal path using the conventional extension of maze running, in which the branch between a terminal pair is routed rst and then paths for other terminals are routed from previously-routed branches repeatedly2) . Each branch is given to the detailed router one by one so that it uses the exact path in the branch point partition on a previouslyrouted branch as a terminal. In the rest of this paper, we will concentrate on two-terminal nets for simplicity, but the algorithms to be explained are easily adapted to nets with three or more terminals.
2378
Transactions of Information Processing Society of Japan
(a) capacity = 0
(b) capacity = 1
(c) capacity = 2 Fig. 2
Up
Up Left Previous Propagation direction
3 2 Right 1 Down
(a) Fig. 3
2 Left
3
Up
1 Right Down
(b)
2 3
(e) capacity = 4
(f) capacity = 5
Partition capacity.
Up
1 Left
(d) capacity = 3
Dec. 1996
Left Right
1 2
3 Right
Down
Down
(c)
(d)
The order to check adjacent points depends on previous propagation direction.
2.2 How to Find the Partition Capacity
As many routers13) separate completely the global and detailed routing, they initialize the boundary capacity as equal to the length of the boundary between two adjacent partitions, and decrease this capacity by one when the global path of a net crosses this boundary. This method is simple, but in many cases it can not give the accurate capacity. In Fig. 2, for example, the partitions shown in (a) to (d) have boundaries that 6 paths can cross, but the number of paths that can go through the partitions are 0 to 3. Thus if the global router generates 6 global paths through the partition (a), for example, because the capacities of both boundaries are 6, the detailed router cannot nd any exact paths for those global paths. Therefore, we use partition capacity, instead of boundary capacity, which is de ned as the maximum number of paths that can be drawn between the two boundaries of the partition in the routing direction10) . The partition capacity cannot be obtained only by the number of grid points occupied by obstacles inside the partition. Rather, we must take account the locations of obstacles. For example, the capacities of the partitions shown in Fig. 2 are dierent, while each of these partitions has 6 grid points occupied by obstacles. We use the algorithm introduced in Amon10) to nd the actual capacity of a partition. In this algorithm, the propagation starts from the lowest free point on the left edge, and then repeatedly moves to the rst adjacent free point until arriving at the right edge. The order to check the adjacent points depends on the pre-
(a) Fig. 4
( b)
(c)
Finding the actual capacity of a partition.
vious direction in which the currently propagated point has been visited as shown in Fig. 3. For example, if a point is visited from the Left (Fig. 3(a)), then the propagation will be moved to the rst free adjacent point in the order (1)Down )(2)Right ) (3)Up . If there are no free adjacent points, the propagation will be moved to the previous point. Figure 4 shows the movement of the propagation in a partition. When we arrive at the right edge (Fig. 4(a)), we nd that one path can be drawn, and then make another propagation (Fig. 4(b)). This will be repeated until arriving at the upper edge (Fig. 4(c)). The obstacles in one partition may aect the adjacent partitions. In some cases, although all the capacities of adjacent partitions are more than zero, no detailed path could be drawn between these partitions. In Fig. 5(a), there is no path between T1 and T2 while each of the partition has a capacity 3. This will lead the global router to choose a path which could not be veri ed by the detailed router. We overcome this problem by making overlapping between partitions as shown in Fig. 5(b). Although this technique is not perfect, it will drastically reduce the probability of the overestimation of the combined capacity for a series of partitions. In fact, we have not encountered any failure in detailed routing by the overestimation in the experiments described in Section 4. Updating the partition capacity ignoring detail routing result will cause overestimation and underestimation. For example, the capacity of the partition shown in Fig. 6(a) will be decreased by 1 if the path A ! B thorough the partition is drawn as shown in Fig. 6(b), while
Vol. 37 No. 12
A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm
T1
T1
T2
capacity = 3
capacity = 3
2379
T2
capacity = 3
T1
T1
T2
Overall capacity = 0
T2
capacity = 3
(a) Fig. 5
capacity = 0
capacity = 3
Overlapping between partitions.
capacity = 4
capacity = 3
(b)
2.3 Detailed Routing (Max-cap Algorithm)
A
A
B
B ( a ) Capacity = 3 Fig. 6
( b ) Capacity = 2
( c ) Capacity = 1
Updating capacity depends on detailed paths.
it will be decreased by 2 if the path drawn as shown in Fig. 6(c). Although our detailed router minimizes the capacity consumption for a series of partitions as described in Section 2.3, it may consume the capacity of a partition by two or more. Thus, if the global router simply decrease the capacity by 1, the result may be overestimated. Underestimation may be caused by a via that occupies n 2 n grid points. This via may decrease the capacity by n, but the consumption may be less than n depending on the location of the via. Thus, it is necessary to feed back the actual capacity consumption from the detailed router to the global router. In order to reduce the communication overhead and to make the global and detailed routers work in parallel, we divide nets into groups and update partition capacities after the detailed routing of each group rather than each net. As described in Section 3.2, the global router works on (n + 1)-th group using the actual capacity obtained from detailed routing of (n 0 1)-th group, while detailed router works on n-th group. During the global routing of one group, after routing each net, we decrease the capacities for the partitions located on this net path by one. All the capacities will be corrected and updated by accurate (actual) data after routing each group. This cooperative relation obtains higher wireability at the expense of longer, but not signi cantly longer, execution time, as described in Section 4.
Maze running algorithm has a serious disadvantage that previously routed nets may prevent the other nets to be routed later. This disadvantage becomes more severe in our detailed router as we assign only one layer for some partitions. In Fig. 7(a), for example, only one path can be drawn by the maze running, while the combined capacity of four partitions is 3. Thus we devised a new algorithm named max-cap10) which routes each net under the condition that this net path will not decrease the combined capacity of a series of partitions on the global path by more than one. To route a net using this algorithm, the global path for this net is divided into strips. Each strip is a series of partitions located on the global path within the same layer. For example, the global path for the net shown in Fig. 1 consists of three strips. The rst strip contains the partitions A10, A11, A12, and A13, the second strip has the partitions B13, B21, B29 and B37, and third has A37, A38 and A39. For each strip, we nd the maximum number of paths (combined capacity) which can be routed between the two edges of the strip by the algorithm explained in Section 2.2. For example, we nd that the capacity of the strip shown in Fig. 7 is 3 by drawing 3 paths. These paths, however, have a lot of unnecessary bends and long length. To decrease the number of bends and shorten paths, we rip-up the uppermost path (Fig. 7(b)), and then nd the shortest path which connects the left edge to the right edge and draw it (Fig. 7(c)). Then, we rip up and redraw the other paths one by one as shown in Fig. 7(d). Finally, we choose an appropriate path according to the terminal location or previously drawn path in the other
2380
Transactions of Information Processing Society of Japan
(a) Using naive maze running algorithm
(b) Find capacity and rip up the uppermost path
(c) Redraw the uppermost path
(d) Rip up and redraw other paths
(e) choose an appropriate path Fig. 7
Detailed routing.
strip (Fig. 7(e)). Max-cap algorithm will achieve higher wireability than maze running algorithm1) . Note that, however, its ripping up and redrawing may consume the capacity of a partition in a strip by two or more, while the capacity consumption for the strip is one. Therefore, we have to re-calculate partition capacities and feed it back to the global router, as explained in Section 2.2.
3. Two Phases Algorithm
Our algorithm consists of two phases with two dierent grid space division strategies to improve the parallelism and to decrease the communication cost. The rst phase is aimed to route all short nets using few messages, then the rest of the nets will be routed in the second phase. Both of the phases consist of cooperative global and detailed router explained in Section 2. Table 1 shows the main features of the two phases.
3.1 First Phase (for Short Nets)
As short nets take short time for routing,
Dec. 1996
the communication overhead should be minimized. In this phase, we use a modi ed version of the rotating area algorithm8) which was proposed to route all short nets using only a few messages. The modi cations lead to improvement of load balance (by using a new area assignment), of execution time (by using a global/detailed router instead of using only detailed router in the original algorithm), and of the wireability (by using the max-cap algorithm for detailed routing). We de ne Xdi for a net as the maximum distance between its terminals in the X direction (Ydi in Y direction). Let the number of PEs be N 2 , and the grid dimensions be L2 . This phase guarantees to try to nd a path for all nets which have both Xdi and Ydi less than d = L=2N . This phase may also deal with nets which have Xdi and/or Ydi greater than d but less than 2d according to the locations of the net terminals. In this phase, the host computer distributes the locations of all net terminals and initial obstacles to the PEs. The grid space is divided among all PEs as shown in Fig. 8(a), in which we have 16 PEs, P1 to P16 , and thus N = 4. Every PE is responsible for an area of (2d)2 = (L=N )2 = (L=4)2 . At the beginning, all the nets are labeled by 0 indicating that these nets have not been routed yet. Every PE chooses the nets that lie in its assigned area, sorts them in an ascending order according to their lengths, and try to route (both global and detailed routing) them within its assigned area. In this example, P6 routes the net `A', while `B', `C' and `D' cannot be routed because their paths have to cross area boundaries. The nets will be labeled by 1 if the PE succeeds to nd a path for them, and the others will be still labeled by 0. After nishing the step 1, all the PEs virtually move to left rotationally by d = L=8 on the grid space (Fig. 8(b)). Thus the right half of the previously assigned area of a PE is now assigned to its right neighbor. For this area shift, each PE sends the routing result and obstacle locations in the right half of its previously assigned area to its right neighbor, and receives those data from its left neighbor. In this step 2, the PEs P1 , P5 , P9 and P13 are idle because the nets lying in their areas, which are shaded in Fig. 8(b), have been tried to be routed in the step 1. The other active PEs choose nets labeled by 0 which lie in their assigned areas,
Vol. 37 No. 12
A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm
2381
Two phases algorithm. First phase Second phase Short nets. Long nets. For each partition, both global and de- PEs are divided into global PEs and detailed routing executed on the same PE. tailed PEs. Maze running algorithm with dierent Maze running algorithm with dierent costs. costs. Max-cap algorithm. Max-cap algorithm. Cooperative global and detailed routing Cooperative global and detailed routing within the same PE. between global PEs and detailed PEs. Rotating area algorithm. Global: Competing processors algorithm. Detailed: Slice based grid space division. Net parallelism. Global: Net parallelism. Detailed: Net and space parallelism. Table 1
Objective nets PEs processing Global algorithm Detailed algorithm Cooperation Parallel algorithm Parallelism
L
P1
C
P5
P2
B CD A B
d=L/8
P3
P4
P1
D A
C
P2
P3
P7
P8
P5
B CD A B
P4
P1
D A
P1 P5
P7
P8
P5 P9
P9
P10
P11
P12
P9
P10
P11
P12 P9
P13
P14
P15
P16
P13 P14
P15
P16 P13
P2 B CD A B P10
P13 P14 d = L / 4 (a) step 1
(b) step 2
Fig. 8
P1
P3
C
P2
D A P7
P4 P8
P1 P5
P1 P5
P2
C B CD A B
D A
P3
P4
P7
P8
P11
P12
P11
P12 P9
P9
P15
P16 P13
P13
P14
P15
P16
P3
P4
P1
P2
P3
P4
(c) step 3
P1
P10
(d) step 4 path routed in this step path routed in previous steps
Routing nets in the rst phase.
and try to route them. In this example, the net `B' is routed by P6 . Then all the PEs move upward and active 9 PEs perform step 3 in which the net `C' is routed by P6 (Fig. 8(c)). Finally, all the PEs move to right and the net `D' is routed in the nal step 4 (Fig. 8(d)). At the end of this phase, all the PEs send the paths to the host computer. By this complete rotation, a short net whose Xdi and Ydi are less than d is tried to be routed regardless of its terminal locations, because its terminals should lie in an area of a step. A net such that d Xdi ; Ydi < 2d may also be tried if its terminals are in an area of a step. As could be observed from Fig. 8, the number of active PEs in these four steps are N 2 , N 2 0N , 2 2 2 N 0 2N + 1, N 0 N respectively, where N is the number of available PEs. The ratio of active PEs improves as N increases, and thus we will obtain a high ratio with massively parallel computers. The communication cost of this phase is very low because a PE sends/receives only 8 messages, two from/to the host and six
from/to its neighbors. Moreover, in the case of a torus or mesh network parallel computer, which we use, the inter-processor communication is performed between physical neighbors and thus network trac will be localized. On the other hand, this phase may suer from the load imbalance if short nets are unequally distributed in the grid space. To overcome this problem, the host would stop this phase and start the second phase if the number of active PEs becomes less than a certain ratio from the total number of PEs (50%, for example). Unequally grid space partitioning strategy6) might also be used, but the memory size required by each PE will be dierent. These improvements have not yet been implemented but are left for future works. As described at the beginning of this section, the algorithm explained above is a modi ed version of the rotating area algorithm that we proposed in 8). In the original algorithm, the grid space was divided to distribute to the PEs as shown in Fig. 9. In this gure, the shaded area indicates a part of the grid space which is
2382
Transactions of Information Processing Society of Japan L
d=L/9
P1
P2
P3
P4
P1
P2
P3
P4
P5
P6
P7
P8
P5
P6
P7
P8
P9
P10
P11
P12
P9
P10
P11
P12
P13
P14
P15
P16
P13
P14
P15
P16
(a) step 1 Fig. 9
Dec. 1996
P1
P2
P3
P4
P1
P2
P3
P4
P5
P6
P7
P8
P5
P6
P7
P8
P9
P10
P11
P12
P9
P10
P11
P12
P13
P14
P15
P16
P13
P14
P15
P16
(b) step 2
(c) step 3
(d) step 4
Assignment of the routing area to PEs in the original rotating areas algorithm.
not assigned to any PE during each step, while shaded area in Fig. 8 is assigned to idle PEs. Although in this assignment all the PEs are active in all four steps, but the load balance is worse than in the new assignment shown in Fig. 8, if the short nets are equally distributed or concentrated near the edges of the grid space. For example, if the short nets are equally distributed, PEs in the rightmost column in Fig. 9(b) (P4 , P8 , P 12 and P16 ) have much more load than the other PEs since right halves of their assigned areas have not been visited in the previous step 1. So the other PEs have to wait for these 4 PEs till nish their work. This will occur again in both steps 3 and 4. If the concentration of the short nets near the edges is less than in the middle area, the original assignment will give better performance than the new assignment. In our implementation, we check the concentration of the nets at the beginning of this phase, and choose one of the two assignments shown in Figures 8 and 9 according to the concentration of the nets.
3.2 Second Phase (for Long Nets)
In this phase, the PEs are divided into global PEs for global routing and detailed PEs for detailed routing. In order to reduce the number of messages between global and detailed PEs, we divide the nets into groups. In the basic and naive communication scheme shown in Fig. 10(a), all the global paths for a group is sent to the detailed PE after the global PEs nish the routing of the group. The detailed PEs nd the detailed paths for this group, and then calculate the actual capacities of all partitions which are sent to the global PEs to update their capacity information by actual data. The global PEs then start to route another group with actual capacities. During global PEs work on a group, they incrementally update capacities by decreasing 1 from the capacities of partitions on a global path.
calc-cap
idle
calc-cap
idle
Global 1
idle
Global 1
Detailed 1
idle
calc-cap
idle Global 1
Detailed 1 idle
up-cap 1 calc-cap1 Global 2
up-cap 1 idle Global 2
Detailed 2
Detailed 2
calc-cap2 idle
up-cap 2
Global 3 up-cap 2 sub-cap 3
Detailed 3
Detailed 3
calc-cap2 idle
Detailed 2
Global 4
Global 3
up-cap 2
calc-cap1 up-cap 1 sub-cap 2
calc-cap2
idle
idle
Detailed 1 Global 2
calc-cap1 idle
Detailed 4
idle
idle calc-cap3 idle
Global 3 up-cap 3
(c)
Global 4 idle
Detailed 3 calc-cap3
up-cap 3 idle
Global 4 idle
Detailed 4
Detailed 4 idle (b)
up-cap : update capacity calc-cap : calculate capacity sub-cap : subtract capacity
(a)
Fig. 10
Alternating between Global and detailed routing in second phase.
As could be observed from Fig. 10(a), this scheme causes much idle time in both the global and detailed PEs. To decrease the idle time, we divide each group further into several subgroups. The global PEs send the global paths of each sub-group to the detailed PEs to start detailed routing earlier as shown in Fig. 10(b). Although this improvement increases the number of messages from the global PEs to the detailed PEs, the reduction of the idle time overcomes the communication overhead. Finally, in order to decrease the idle time further, we let the global PEs start routing for the n-th group after receiving the actual capacities obtained after the detailed routing of the (n 0 2)-th group, before receiving those of the (n 0 1)-th group, as shown in Fig. 10(c). The global PEs maintains the estimated capacity consumption for each partition on routing (n 0 1)-th group, which is subtracted from the actual capacity reported by the detailed PEs before routing the n-th group. By this earlier starting, we completely eliminate the idle
Vol. 37 No. 12
A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm
time of the global or detailed PEs. In the gure, we assume the load of the detailed PEs is heavier than that of the global PEs, and thus the detailed PEs work without idling. The idle time of the global PEs could be eliminated if we advanced this earlier starting further. This further-earlier starting, however, will cause bad eect on the accuracy of the capacities used for global routing.
3.2.1 Global Routing
As the required memory for global routing is not very large because it is proportional to the number of partitions instead of the number of grid points in case of detailed routing, we exploit only the net parallelism for global routing. Each global PE has a copy of the capacity map of all the partitions in its own memory. The global routing is performed in a master-slaves manner using the competing processing algorithm19) . One global PE is the master while all the others are slaves. The master distributes its capacity map reported by the detailed PEs to the slaves, and clears its dierential capacity map. Then it assigns a net, which has a label 0 indicating that the net was not routed in the rst phase, to a slave. Each slave tries to nd the least cost global path for the assigned net and sends the path back to the master that veri es whether the path exceeds the capacities of any partitions. If it exceeds, the master rejects the path and requests the sender slave to reroute the path giving the slave the latest accepted paths so that the slave updates its capacity map. Otherwise, the master labels this net by 1, adds its path to the list of accepted paths, updates the dierential capacity map, and sends another net with the latest accepted paths to the sender slave. When the master receives the actual capacities from the detailed PEs, it updates its own capacity map by the actual capacities and the dierential capacity map which is then cleared. The master also distributes the new capacity map again to the all slaves. In the competing processors algorithm, many nets will be rerouted if it is used for detailed routing19) , but this does not happen frequently in case of global routing. As every partition has a capacity, two or more PEs can use the same partition for dierent nets at the same time under the condition that these net will not exceed the total capacity of the partition.
2383
P5 P6 P7 P8 P5 P6 P7 P8 P1 P2 P3 P4 P1 P2 P3 P4
Layer 1 Fig. 11
virtual boundary actual boundary
Layer 2
Detailed grid assignment
3.2.2 Detailed Routing 3.2.2.1 Slice Partitioning
Here we explain how to divide the grid into partitions and how to assign these partitions to the detailed PEs. First, we divide the grid space into layers and assign a group of PEs to a layer. For example, if the grid space has 2 layers of dimensions 1000 2 1000 and 8 PEs are available for detailed routing, 4 PEs are assigned to each layer. Each layer is divided into slices instead of square areas (see Fig. 11). The length of each slice is equal to the grid dimension (1000 in this example), while slice width is equal to grid dimension divided by (K 2 C ), where K is the number of detailed PEs assigned to a layer (4 in this example) and C is the number of slices assigned to one detailed PE (2 in this example). Thus we have 16 slices of 1000 2 125. Odd (even) layers are divided into slices in X (Y ) direction. Each slice is divided by virtual boundaries to form a series of partitions that global PEs work on. The virtual boundaries are used to divide the memory of each PE into different parts. Virtual boundaries for odd layers are in the same positions as actual boundaries for even layers and vise versa. By using virtual boundaries, we can route two or more nets in a slice simultaneously if their global path does not overlap in any partitions. As the nets are not uniformly distributed, some parts from the grid are more condensed by nets than the others. Thus better load balance will be obtained by assigning more slices (i.e. with larger C ) to each PE. Larger C also makes the slice width narrower and thus makes detailed routing faster because the searching space becomes smaller. On the other hand, this makes the global routing more dicult because the number of partitions becomes larger, and the wireability may be decreased if slices become too narrow. Since the optimal value of C
2384
Transactions of Information Processing Society of Japan P
P P P P P P P P
1 2 3 4 1 2 3 4
5
6
7
8
5
6
7
P
8
P
A25 A26 A27 A28 A29 A30 A31 A32
B25 B26 B27 B28 B29 B30 B31 B32
P
A33 A34 A35 A36 A37 A38 A39 A40
B33 B34 B35 B36 B37 B38 B39 B40
P
A41 A42 A43 A44 A45 A46 A47 A48
B41 B42 B43 B44 B45 B46 B47 B48
P
A49 A50 A51 A52 A53 A54 A55 A56
B49 B50 B51 B52 B53 B54 B55 B56
P
A57 A58 A59 A60 A61 A62 A63 A64
B57 B58 B59 B60 B61 B62 B63 B64
P
A8
B7
P
B17 B18 B19 B20 B21 B22 B23 B24
A7
B6
P
A17 A18 A19 A20 A21 A22 A23 A24
A6
B5
P
P
A5
B4
P
B9 B10 B11 B12 B13 B14 B15 B16
A4
B3
P
A9 A10 A11 A12 A13 A14 A15 A16
A3
B2
P
P
A2
B1
P
B8
A1
P P P P P P
P
6
P
7
P
8
P
5
P
6
P
7
P
8
2 3 4 1 2 3 4
(b) P
P
5
1
(a)
P
Dec. 1996
5
P
6
P
7
P
8
P
5
P
6
P
7
P
P
8 P
1
P
2
P
3
P
4
P
1
P
2
P
3
P
4
(c)
P
6
P
7
P
8
P
5
P
6
P
7
P
8
1 2 3 4 1 2 3 4
Global path Settled detailed path Temporary detailed path
Fig. 12
5
(d)
Detailed routing for a net.
depends on the characteristics of the problems, we tuned the value by hand so that the slice width is in the range from 20 to 30 in our experiments. The method to nd good value of C is left as a future work.
3.2.2.2 Algorithm of Detailed Routing
Here we explain how the detailed routing is done in parallel and how the detailed PEs communicate each other to route a net. The detailed routing is performed in a master-slaves manner. We exploit parallelism in both nets and space. The master detailed PE receives a group of global paths for nets from the master global PE, and picks a net from the group. The master locks the partitions on the global path of the net, and requests the slave PE, which is responsible for the partition where one of the terminals of the net lies, to route the net. The slave PE works cooperatively together with other slave PEs which are responsible for the partitions on the global path of the net. Each PE nds a part of the detailed path in the strip that is a part of the global path and is included in the slice of the PE. During slaves work on a net, the master picks another net whose global path consists of free partitions to route it simultaneously with the rst net. Since we lock partitions instead of slices, the master will nd enough number of nets to be routed in parallel.
Figure 12 shows an example of the detailed routing for a net. If the global path of this net is as shown in Fig. 12(a), the detailed routing is done as in the following steps. ( 1 ) If all the partitions on the global path for the net are free, the master locks these partitions, and requests P6 to route the net. ( 2 ) P6 sends the data of partition B10 to P2 . This data informs P2 of occupied grid points in B10. P6 can not use B10 yet but it can use any other partitions within its slice to route other nets. ( 3 ) P2 nds a detailed path, using max-cap algorithm, between the left and the right end of the strip that consists of A10, A11, A12 and A13. P2 also use B10 to connect the source to the path in the strip. This detailed path consists of two parts: settled (B10, A10, A11, A12) and temporary (A13) (Fig. 12(b)). ( 4 ) P2 sends data of B10 back to P6 , so now P6 can use this partition again in routing other nets. P2 sends a message to the master PE to free partitions (B10, A10, A11, A12), and sends the data of partition A13 to P5 . P2 can not use partition A13 yet, but other partitions are free for routing other nets. ( 5 ) P5 nds the detailed path between the
Vol. 37 No. 12
A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm
2385
Dierent programs to determine algorithms eectiveness Detailed routing Capacity Number of phases Maze max-cap Without With One Two feedback feedback phase phases
Table 2
prog1
prog2
prog3
prog4
prog5
Maze : using maze running algorithm for detailed routing. max-cap : using max-cap algorithm for detailed routing. Without F.B : Detailed router does not send the exact capacity to Global router. With F.B : Detailed router sends the exact capacity to the global router. One phase : Using only the second phase for both short and long nets. Two phases : Using rst phase to route short nets then use the second phase for long nets. Table 3
mcc1 mcc2-75 mcc2-45 data1 data2
chips no.
nets no.
6 37 37
802 7118 7118 32000 variable
MCM benchmark and test data pins no. grid size Layers 2495 14659 14659 64000 variable
received temporary detailed path in A13 and the bottom end of the strip that consists of B13, B21, B29 and B37 (Fig. 12(c)). ( 6 ) P5 sends data of A13 back to P2 , sends a message to the master to free partitions (A13, B13, B21, B29), sends data of partition B37 to P1 , and sends a request to P7 to send data of B39 to P1 . ( 7 ) P1 nds a detailed path between the destination and the received temporary detailed path (Fig. 12(d)), sends data of B37 back to P5 and B39 back to P7 , and sends a message to the master to free partitions (B37, A37, A38, A39, B39). As the master con rms that all the partitions on the global path of a net are free before requesting to route the net to the detailed PEs, deadlock cannot be occured. This also guarantees that only one net is routed in any partition at a time. The partitions which contain temporary paths must not be used to route other nets until this temporary path is settled. Although using the partition for other nets is safe, the part of the temporary path, which is nally pruned, might be an unnecessary obstacle for other nets if we freed the partition. Thus we lock the partition until the path is settled, slightly wasting potential parallelism that could be exploited by freeing the partition earlier.
599 2 599 2032 2 2032 3386 2 3386 3024 2 3024 1000 2 1000
4 6 4 4 4
Lower bound of wire length 343,767 5,362,181 8,935,372 9,600,000 2,000,000
4. Result
Five programs shown in Table 2 have been implemented on AP10007) , a MIMD distributed memory parallel computer with torus network, using C language to examine the effectiveness of the algorithms described in this paper. These programs have been evaluated using 3 industrial MCM (multi-chip module) benchmarks☆ and two random data shown in Table 3. The program for generating random nets outputs X and Y coordinates of net terminals. The Manhattan distances of the nets obey the Rayleigh probability functions. Data2 has many sets of nets, each set has a dierent average net distance, while keeping the total net length (average net distance 2 number of nets) constant equal to 2,000,000 for all sets. Table 4 shows the con guration of global and detailed PEs, execution time and the number of unconnected nets for each programs. Division ratios of global and detailed PEs are tuned by hand, and we leave automatic and optimal division according to the problems as a future work.
☆
These data are available via anonymous ftp from mcnc.org
2386
Transactions of Information Processing Society of Japan
Dec. 1996
Time and number of unconnected nets prog1 prog2 prog3 prog4 time unc. time unc. time unc. time unc. (sec) nets (sec) nets (sec) nets (sec) nets 9 20 10 3 10.5 3 13.3 0 105 142 127 35 131 34 173 9 179 70 246 7 251 7 316 0 193 52 228 8 212 8 321 0 D: Detailed T: Total unc. nets: unconnected nets
4.1.2 Cooperation between global and detailed routing.
Further improvement in the wireability is achieved by using actual capacity in the global routing. The numbers of unconnected nets when using the actual capacity (prog4, prog5) become 0, 9, 0 and 0 for mcc1, mcc2-75, mcc245 and data1 respectively. On the other hand, the computation time is increased by the time required for calculating the actual capacities. Moreover, dividing nets into groups decreases the parallelism among nets as only few PEs are simultaneously working at the end of each group as shown in Fig. 13(c). An improvement in the parallelism could be obtained by considering the locations of the nets while dividing nets into groups.
4.2 Using one phase versus two phases 4.2.1 One phase
In prog2 and prog4, we use only the second phase to route both short and long nets. Short nets are rstly routed as the nets are sorted in ascending order of their lengths. Figure 13(a) and (c) shows that the number of active PEs working on routing short nets is less than that for long nets. That is, a few number of PEs are active at the beginning of both prog2 and prog4. The frequency of messages for routing short nets is higher than that for routing long nets as could be observed from Fig. 14(a) and (c).
4.2.2 Two phases
Using two phases (prog3 and prog5) increases the number of active PEs working on short nets
PEs
64 48 32 16 0
prog5 time unc. (sec) nets 13.8 0 176 8 322 0 250 0
Time 228 sec.
( a ) prog 2
64 48 32 16 0
Time First phase
PEs
Using max-cap algorithm achieves higher wireability than the conventional maze running algorithm. The numbers of unconnected nets for mcc1, mcc2-75, mcc2-45, and data1 are reduced to 3, 35, 7, and 8 respectively with maxcap, from 20, 142, 70, and 52 with maze. This improvement is done at a small expense of execution time as shown in Table 4 (compare prog1 and prog2).
PEs
4.1 Wireability 4.1.1 Maze versus max-cap
( b ) prog 3
64 48 32 16 0
Second phase 212 sec.
( c ) prog 4
one group
64 48 32 16 0
Time 321 sec.
Time First phase
Fig. 13
messages
48+1 64 48+1 64 52+1 64 52+1 64 G: Global
( d ) prog 5
250 sec.
Second phase
Number of PEs simultaneously working in routing data1.
2000 1000 0
messages
14+1 14+1 10+1 10+1
Time 228 sec.
( a ) prog 2
2000 1000 Time
0 First phase messages
mcc1 mcc2-75 mcc2-45 data1
Second phase
( b ) prog 3
212 sec.
2000 1000 0
messages
Number of PEs G D T
PEs
Table 4
( c ) prog 4
one group
Time 321 sec.
2000 1000 Time
0 First phase
Fig. 14
( d ) prog 5
Second phase
250 sec.
Number of messages in routing data1.
(Fig. 13(b) and (d)), and decreases the message frequency drastically as only a few messages are transferred in the rst phase to route short nets (Fig. 14(b) and (d)). Dividing nets into groups does not aect the parallelism in the rst phase. Using two phases decreases the execution time without aecting the wireability. The execution time required to route data1 using only
A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm
2387
70
Time (sec)
Time (sec)
Vol. 37 No. 12 prog4 (one phase)
65
60
1000 800 700
prog5
55
2000
(two phases)
600 50
500 400
45
300 40 50
100 20000
Fig. 15
150
200
250
10000
300 6666
350
400 5000
Average net length Number of nets
Eectiveness of using two phases for dierent net-lengths.
one phase (prog4) is 321 seconds which will be decreased to 250 seconds when using two phases (prog5). As there is only few short nets in the MCM benchmarks, so using the rst phase to route short nets does not improve the speed (Table 4). Another experiment has been done using data2 to measure the eect of two phases on dierent net lengths. Figure 15 shows the execution time against the net length for both prog4 and prog5 using 64PEs. Using two phases obtains better performance for shorter average net length.
4.3 Scalability Figure 16 shows the execution time against
the number of processors for routing mcc2-45 using prog5 in a logarithmic scale. There is a high degree of parallelism inside the program since increasing number of PEs from 16 to 256 speeds up the program by 10 times. We have also a good scalability since the speed up is not saturated yet while using up to 256 PEs.
4.4 Comparison with other systems Table 5 shows the results obtained for
the same data by some other sequential algorithms11),12) run on Sun SparcStation II. The wireability are reported as 100 % for these sequential algorithms. Both 3D maze and Slice use one more layer than our router. Prog5 routes all the nets of mcc1 in 13.8 seconds using 64 PEs which means that prog5 runs 257, 52, and 13 times faster than 3D maze, Slice, and V4R respectively. The speed of the sequential version of our router is comparable to the fastest sequential algorithm while we use fewer number of vias and shorter total wire length (better routing quality).
5. Conclusion
A new parallel wire routing algorithm has
Prog5
200
Ideal slope 100 80 70 10
Fig. 16
20 30 40 50 60 80 100 200 300 PEs Execution time against number of PEs for mcc2-45.
been introduced. This router consists of two phases to decrease the communication cost. The rst phase rotates the assignment area of every PE to route short nets using only a few messages. The second phase divides the grid into slices to increase the parallelism among nets and to decrease the number of messages. A cooperative relation between global and detailed routings is introduced to improve the wireability. The nets are divided into groups. After nding the detailed paths for one group, the actual partition-capacities are calculated by the detailed router to be used for the global routing of the succeeding groups. The max-cap algorithm is used for detailed routing. In this algorithm, the maximum capacities for the series of partitions are obtained, and each net is routed by a path that decreases this capacity by only one. The results show that our router has a high degree of parallelism and low communication cost. Using max-cap and actual capacity (cooperative global and detailed) improves the wireability at the expense of execution time as dividing nets into groups decreases the parallelism among nets. Using two phases decreases the communication cost and improves routing speed without aecting the wireability. Acknowledgments We would like to express our thanks to Fujitsu Laboratories Ltd. for oering us the parallel computer AP1000 to implement our parallel algorithms. We also
2388
Transactions of Information Processing Society of Japan Table 5
mcc1 mcc2-75 mcc2-45
mcc1 mcc2-75 mcc2-45
Performance comparison to some previous sequential algorithms
3D Maze11) Layers Time 5 3540 | | | |
3D Maze11) Wire length 397 (2103 ) | |
Dec. 1996
Vias 8794 | |
seq. SLICE11) Layers Time 5 720 7 29700 | |
seq. SLICE11) Wire length Vias 402 (2103 ) 6386 5903 47864 | |
thank all our laboratory members for their great help.
References
1) Akers, S. B.: A Modi cation of Lee's Path Connection Algorithms, IEEE Trans. on Electronic Computers , Vol.EC-16, pp.97{98 (1967). 2) Akers, S. B.: Routing, Design Automation of Digital Systems: Theory and Techniques
(Breuer, M. A.(ed.)), Vol. 1, Prentice Hall, pp. 283{333 (1972). 3) Banerjee, P.: Parallel Algorithms for VLSI Computer Aided Design , PTR Prentice Hall (1994). 4) Date, H., Matsumoto, Y., Kimura, K., Taki, K., Kato, H. and Hoshi, M.: LSI-CAD Programs on Parallel Interface Machine, Proc. Intl. Conf. on Fifth Generation Computer System , pp. 237{247 (1992). 5) Date, H. and Taki, K.: A Parallel Lookahead Line Search Router with Automatic Ripup and Reroute, Proc. the European Conf. on Design Automation , pp. 117{121 (1993). 6) Goda, M., Toyama, N. and Watanabe, T.: A Parallel Detailed Router TRED, Proc. Transputer/Occam Japan (1994). 7) Ishihata, H. et al.: Third Generation Message Passing Computer AP1000, International Symposium on Supercomputing , pp. 46{55 (1991). 8) Keshk, H., Mori, S., Nakashima, H. and Tomita, S.: A New Technique to Improve Parallel Automated Single Layer Wire Routing, Proc. Performance Evaluation of Parallel Sys-
tems , pp. 134{141 (1993). 9) Keshk, H., Mori, S., Nakashima, H. and Tomita, S.: A Parallel Slice Maze Router, Proc. Intl. Symp. on Fifth Generation Computer Sys-
tems , Vol. W6, pp. 67{73 (1994). 10) Keshk, H., Mori, S., Nakashima, H. and Tomita, S.: Amon : A Parallel Slice Algorithm for Wire Routing, Proc. Intl. Conf. on Supercomputing , pp. 200{208 (1995). 11) Khoo, K. Y. and Cong, J.: A Fast Multilayer General Area Router for MCM Designs, IEEE
V4R12) Layers Time 4 180 6 3960 4 5820 V4R12) Wire length 394 (2103 ) 5559 9131
prog5 (64 PE) Layers Time 4 13.8 6 176 4 322
Vias 6993 36438 36473
prog5 (64 PE) Wire length 382 (2103 ) 5498 9052
Trans. on Circuits and Systems ,
Vias 2825 22823 19554
Vol.39, No. 11, pp. 841{851 (1992). 12) Khoo, K. Y. and Cong, J.: An Ecient Multilayer MCM Router Based on Four Via Routing, Proc. Design Automation Conf., pp. 590{595 (1993). 13) Olukotun, O. A. and Mudge, T. N.: A Preliminary Investigation into Parallel Routing on a Hypercube Computer, Proc. ACM/IEEE Design Automation Conf., pp. 814{820 (1987). 14) Shimamoto, Hane, Shirakawa, Tsukiyama, Shinoda, Yui and Nishiguchi: A Distributed Routing System for Multilayer SOG, Proc. the European Conf. on Design Automation , pp. 298{303 (1992). 15) Soukup, J.: Fast Maze Router, Proc. Design Automation Conf., pp. 100{102 (1978). 16) Suzuki, K., Matsunaga, Y., Tachibana, M. and Ohtsuki, T.: A Hardware Maze Router with Application to Interactive Rip-Up and Reroute, IEEE Trans. on CAD , Vol. CAD-5, pp. 466{476 (1986). 17) Suzuki, K., Ohtsuki, T. and Sato, M.: A Gridless Router: Software and Hardware Implementation, VLSI 87 , pp. 153{163 (1987). 18) Takahashi, Y.: Parallel Maze Running and Line Search Algorithms for LSI CAD on Binary Tree Multiprocessors, Proc. Word Conf. on Information and Communication , pp. 128{ 136 (1989). 19) Takahashi, Y.: Parallel Automated Wire Routing with a Number of Competing Processors, Proc. Intl. Conf. on Supercomputing , pp. 310{317 (1990). 20) Won, Y. and Sahni, S.: Maze Routing on a Hypercube Multiprocessor Computer, Proc. Intl. Conf. on Parallel Processing , pp. 630{637 (1987). 21) Yamauchi, T., Ishizuka, A., Nakata, T., Nishiguchi, N. and Koike, N.: PROTON: A Parallel Router on an MIMD Parallel Machine, Proc.Intl. Conf.on Computer Aided Design , pp. 340{343 (1991).
Vol. 37 No. 12
A Two-Phase, Cooperative Detailed/Global Parallel Wire Routing Algorithm
(Received ??? ??, ????) (Accepted ??? ??, ????)
Hesham Keshk was born in 1960. He received his M.E. degree from Helwan Univ. in 1989 and Ph.D. degree from Kyoto Univ. in 1996. He had been an assistant lecturer since 1984 and has been a lecturer since 1996 in Helwan Univ. His current research interests is CAD systems for VLSI and FPGA.
Shin-ichro Mori was born in 1963. He received his M.E. and Ph.D. degree from Kyushu Univ. in 1989 and 1995 respectively. He had been a research associate since 1989 and has been an associate professor since 1995 in Kyoto Univ. He is pursuing research works on parallel and distributed processing and computer architecture. He is a member of IPSJ, IEEE-CS and ACM.
2389
Hiroshi Nakashima was born in 1956. He received his M.E. and Ph.D. degree from Kyoto Univ. in 1981 and 1991 respectively. He had worked in Mitsubishi Electric Cooperation since 1981 and had engaged in research on inference systems. Since 1992 he has been in Kyoto Univ. as an associate professor. His current research interests are architecture of parallel processing systems and implementation of programming languages. He received the Motooka award in 1988 and the Sakai award in 1993. His major literary work is \Computer Hardware." He is a member of IPSJ and the Chair of SIG-ARC of IPSJ. Shinji Tomita was born in 1945. He received his Ph.D. degree from Kyoto Univ. in 1973. He was a research associate from 1973 to 1978 and an associate professor from 1978 to 1986 in Kyoto Univ. He was in Kyushu Univ. as a professor from 1986 to 1991, and has been in Kyoto Univ. as a professor since 1991. His current research interests are computer architecture and parallel computer systems. His recent literary works are \Parallel Computer Engineering," \Parallel Machines" and \Computer Architecture I & II." He is a member of IPSJ, IEICE, IEEE and ACM.