AMATEROUS: A PARALLEL WIRE ROUTER ... - Hiroshi NAKASHIMA

This paper is author's private version of the paper published as follows. Proc. Intl. Conf. Algorithms and Architectures for Parallel Processing, pp. 334{351, World Scienti c, December 2000. AMATEROUS: A PARALLEL WIRE ROUTER WITHOUT DETAILED/GLOBAL FEEDBACK 3

KOTARO TANAKA, KAZUHIKO OHNO AND HIROSHI NAKASHIMA

Dept. Information and Computer Sciences, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku Toyohashi, 441-8580 JAPAN email:

{kotaro,ohno,nakasima}@para.tutics.tut.ac.jp

This paper proposes a new parallel wire routing algorithm named Amaterous. As many successful parallel wire routers, Amaterous has global and detailed routers, but the feedback from the detailed to the global is removed so that both routers are parallelized eciently and independently. For this feedback removal, Amaterous has an additional routing phase named capacity path (c-path) router that draws all the possible paths in each rectangular region of each routing layer prior to the global routing. Since the global router picks c-paths to form the global path of a net, it is assured that the global path is always converted successfully to its detailed counterpart which the detailed router generates from the c-paths. Therefore the global router does not need to know the result of the detailed routing and thus the feedback from detailed to global is removed. We implemented Amaterous on a 16-node PC cluster and measured its performance with three MCM benchmarks. The result shows up to 12 fold speedup proving the eciency of our algorithm. It is also shown that Amaterous is 1.2 to 3.6 times as fast as a state-of-art parallel router Amon2. 1

Introduction

Parallel processing plays an important role in the performance improvement of CAD applications, which is strongly required due to the rapid growth of the amount and complexity of digital circuits. The parallelization, however, is not always easy because of the sequentiality hidden in problems and well-tuned algorithms for them. Therefore straightforward approaches, such as do-all loop parallelization often used in scienti c computation, are hardly applicable but sophisticated parallel algorithms are necessary for ecient parallel computation. For example, in the wire routing problem that is the target of our research, multiple nets may be routed simultaneously exploiting inter-net parallelism. However the resulting paths for those nets can interfere each other revealing the hidden dependence, the source of sequentiality, among the routing of the nets. Thus most of successful parallel routers have two cooperative routers; global router that roughly determines a region for a net in which the path for 3

Currently with Hitachi Microsoftware Systems, Inc.

1

the net should lie; and detailed router to draw the exact path of the net in the region. Since the regions, or global paths, for multiple nets may be overlapped, the global router is eciently parallelized. The detailed routing of multiple nets is also performed in parallel if their global paths have no intersection. This parallelization, however, still has a bottleneck due to the dependence between the global and detailed routing. When the global router draws the path for a net, it estimates the diculty of the detailed routing, or wireability, of the path. The preciseness of the estimation depends on the detailed paths for other nets routed previously, and it greatly aects the real wireability. Thus the global router has to receive some feedback from the detailed router, which brings the dependence among the nets, in order to keep high wireability. We observed this dependence is due to lack of con dence in global routing. That is, if it is assured that the detailed router always nds the path for a net lying in its global path, no feedback from detailed to global is necessary. In this paper, we propose a new parallel wire router named Amaterous based on this observation. Amaterous has an additional routing phase named capacity path (c-path) router to draw all the possible paths in each rectangular region of each routing layer prior to the global routing. Its global router picks c-paths to form the global path of a net ensuring that the detailed router successfully generates the exact path from the c-paths. Thus the feedback from the detailed to the global is removed so that both routers are parallelized eciently and independently. The rest of the paper describes Amaterous and its performance as follows. Section 2 overviews related works to make it clear why the feedback removal is essential. Section 3 gives the outline of Amaterous while its parallel implementation is discussed in Section 4 in detail. An experiment result with MCM benchmarks is shown in Section 5. Finally we conclude the paper summarizing future works in Section 6. 2 2.1

Related Works Parallelism in Wire Routing

Wire routing problem has two types of parallelism; intra-net parallelism and inter-net parallelism. Parallel algorithms exploiting the former try to route one net or one subnet (branch) in a net by multiple processors. For example, parallelized maze routing1;2;3 and line search4 are regarded as a kind of parallel search to nd optimal solution (path) in a large search space of possible solutions (paths). Another example is to divide a subnet into segments by bends and to assign a processor to a segment for parallel routing5 . 2

The latter is exploited in algorithms in which multiple nets or subnets are routed simultaneously. If a set of nets or subnets has no geometrical interference in its members, it is quite easy to route the (sub)nets in parallel6;7;8;5 . Even if the (sub)nets potentially interfere each other, they may be routed in parallel expecting no or few real interference9;10;4;5. Note that an algorithm may exploit both types of parallelism. For example, multiple nets are routed in parallel and each (sub)net is routed also in parallel9;7;4. Another example is to switch the type of parallelism according to routing phases8;5. 2.2

Global and Detailed Routing

is a well-known technique to bound a huge routing space so that routing of exact paths, or detailed routing, is performed in a reasonable time. The global routing for a net is to determine a region, or a global path, in which the exact path for the net should lie. Since the region is much narrower than the whole routing space, the detail routing to nd the exact path works quickly. Among various global routing techniques, such as weighted maze5, hierarchical decomposition11 and iterative improvement9 , it is common that the global router evaluates some cost function representing the diculty of detailed routing, or wireability. One of most commonly used cost measure is boundary capacity12 that represents how many paths can cross an edge of a coarse rectangular region. The partition capacity5 is a similar concept but it indicates how many paths pass through a rectangular region. The combination of global and detailed routing is also bene cial to parallel routing exploiting inter-net parallelism. For the global routing, it is good news that multiple global paths may have intersections because a boundary or a partition allows multiple detailed paths to pass through it. Thus multiple (sub)nets may be globally routed simultaneously without much care of the interference among them. For the detailed routing, it is easy to judge that a set of (sub)nets has no interference among its members if their global paths have no intersections. Therefore, many successful parallel wire routers have global and detailed routers3;7;5. Global routing

2.3

Feedback from Detailed to Global

The capacity referred in the global routing is an estimation of the real wireability of the detailed routing. If the estimation is too optimistic, the detailed router will frequently fail to nd the exact path for a global path. On the other 3

hand, too pessimistic estimation will brings frequent failure of the global routing. Thus the preciseness of the estimation greatly aects the total wireability of the router. The preciseness greatly depends on the results of the detailed routing as reported by Keshk5 . For example, it was shown that 2.5 % of nets of MCC175 in MCM benchmark set13 cannot be routed if the global router ignores the results of the detailed. Thus he claimed the necessity of the feedback from the detailed to the global and showed it makes MCC1-75 completely routed. However, the feedback brings a bottleneck in parallel processing. If the global routing for a net has to know all the results of detailed routing of previously routed nets, we cannot exploit inter-net parallelism in the global routing. This sequentiality may be relaxed by grouping certain number of nets in order to reduce the frequency of the feedback as proposed by Keshk. This relaxation, however, not only degrades the estimation preciseness but also is insucient to remove the bottleneck. In fact, his evaluation showed his Amon2 incurs about 50 % overhead caused by the feedback. As stated in the introductory section, we observed this feedback may be removed if it is assured that the detailed router always nds the path for a net lying in its global path. This feedback removal also makes it possible to separate global and detailed routing into completely distinct phases to minimize the frequency of communication between them. 3 3.1

Overview of Amaterous Terminology

Here we de ne a few terms to explain our new parallel wire router Amaterous (Fig. 1). The routing space is a 3-dimensional non-negative integer nite coordinate space, or grid space, [0; X ] 2 [0; Y ] 2 [0; Z ] in which terminals are located. A net is a set of terminals to be connected. A subnet of a net is a pair of its terminals, or a pair of a terminal and a subnet recursively. Thus the routing a net of terminals ft1 ; : : : ; tn g is to nd paths for subnets ht1 ; t2 i, hft1 ; t2 g; t3i, . . . , hft1 ; : : : ; tn01 g; tni. The (detailed) path for a subnet hS; ti is an ordered set of adjacent grid points f(x1 ; y1; z1 ); : : : ; (xm; ym ; zm )g where (x1 ; y1 ; z1) is included in the path for the subnet S or the location of S if it is a terminal, (xm ; ym; zm) is the location of t, and one of the following holds for all 1 i < m. ( xi+1 = xi 6 1 ^ yi+1 = yi ^ zi+1 = zi xi+1 = xi ^ yi+1 = yi 6 1 ^ zi+1 = zi xi+1 = xi ^ yi+1 = yi ^ zi+1 = zi 6 1 4

terminal

global path (segment)

detailed path

partition

slice

x -layer

y -layer

Figure 1. Routing Space

The union of the paths for a net N , denoted as d (N ), has no intersection with that of any other net d (N 0), i.e. 8N 8N 0(N 6= N 0 ! d (N ) \ d(N 0 ) = ;). A xy-space of the routing space, or a layer, is subdivided into rectangular partitions by lines fx = X1 ; : : : ; x = Xk g and fy = Y1 ; : : : ; y = Yl g. A layer with even (odd) z-coordinate is referred to as x-layer (y-layer) because paths in the layer run in the direction of x-axis (y-axis) in a global view. A partition segment on an x-layer (y -layer) is a series of partitions in x (y ) direction. The partition segment whose left and right (top and bottom) edges are on the routing space border is called slice. The partition where a terminal lies is referred to as the home partition of the terminal. The global path for a subnet hS; ti is de ned as an ordered set of adjacent partitions in a similar manner to the detailed path using coarse grid space f0; X1; : : : ; Xk g 2 f0; Y1; : : : ; Yl g 2 [0; Z ]. However, adjacent partitions (Xi; Yi ; zi ) and (Xi+1 ; Yi+1 ; zi+1 ) has to satisfy Yi+1 = Yi (Xi+1 = Xi ) if they both lie in a x-layer (y-layer). That is, a layer has its own direction in which global paths have to run. Thus a global path consists of global segments each of which is a partition segment. The union of global paths for a net N , denoted as g (N ), may have intersection with that of other net g (N 0 ). Its detailed counterpart d (N ) has to included in it. The capacity of a partition on an x-layer (y-layer), or partition capacity, is the maximum number of paths simultaneously connecting the left and right (top and bottom) edges. 3.2

Capacity Path

For a given set of nets, Amaterous routes it by three distinct phases of routing; c-path routing, global routing and detailed routing. 5

pc=6

pc=3

pc=4

pc=5

pc: partiction capacity

Figure 2. Capacity Path (c-path)

The rst phase, c-path routing, is unique for Amaterous. For each slice in an x-layer, it nds out all the capacity paths (c-paths) that ideally satisfy the following conditions (Fig. 2). 1. A c-path has no intersection with initial obstacles including terminals nor any other c-paths. 2. The left end of a c-path is on the left edge of a partition and its right end is on the right edge of the same or another partition. 3. For each partition, the number of c-paths passing through it is equal to its partition capacity. That is, the number of c-paths passing through it is maximized. 4. For each partition segment, the number of c-paths passing through it is maximized on condition that those for partitions and partition segments included in it are maximized recursively. The conditions for c-paths on y-layers are similarly given. 3.3

Global and Detailed Routing

After c-paths are routed as shown in Fig. 3(a), each terminal is connected to its initial c-path that passes through two or more partitions including its home partition as shown in Fig. 3(b). Thus a terminal has a partition adjacent to its home in which its initial c-path passes. This partition is called initial partition of the terminal. The global routing of a two-terminal (sub)net is the process to nd the cost optimal global path connecting initial partitions of both terminals. The cost of a global segment is the sum of the cost of each partitions in it if there is a c-path passing through it. The cost of a partition increases as its capacity decreases to avoid too much congestion. Moreover, the cost for a global segment through which no c-paths pass is in nite. Thus it is assured 6

terminal home partition

(a) c-path routing

initial c-path initital partition

(b) connecting terminals

(c) global routing

(d) detailed routing

Figure 3. Routing Phases

that a global segment is always converted to its detailed counterpart using a c-path passing through it as the bottom-line. After a global path is settled, the capacity of each partition on the path is decreased and a c-path for each global segment is exclusively assigned to the path as shown in Fig. 3(c). The routing of a terminal and a subnet is a similar process connecting the initial partition of the terminal and the global path for the subnet. When the global routing completes, each global segment has its own cpath. Thus at least we can form a detailed path by connecting c-paths for global segments by vias. The c-paths, however, have too many bends and are too lengthy as shown in Fig. 2. Thus the detailed router rst tries to straighten each c-path keeping its connection of the both ends of a global segment. Finally, straightened paths are connected by vias and fragments are discarded as shown in Fig. 3(d). 4 4.1

Parallel Algorithm of Amaterous Overview of Parallelization

Three routing phases of Amaterous, c-path, global and detailed routing, are parallelized following master-worker paradigm. For the c-path and detailed routing, slices in x-layers are cyclicly assigned to a half of worker processors while the other half are responsible for y-layers as shown in Fig. 4(a). One of processors, say P0 , also acts as the master. In global routing, processors other than P0 are workers to nd global paths for subnets assigned to them as shown in Fig. 4(b). The processor P0 has the master job only in this phase to distribute subnets to workers and to check the consistency of the results from them. 7

Master P0

Master P0

P0

P0

P3

P2

P2

P1

P3

subnets

P1

P2

P3

P1 (a) c-path and detailed routing

(b) global routing

Figure 4. Parallelization of Routing Phases 4.2

C-path Routing

At rst, the master distributes the locations of initial obstacles and terminals to workers. Then each worker for x-layers routes c-paths in the slices assigned to it independently by the following algorithm based on that for the partition capacity measurement proposed by Keshk14. for p = leftmost partition to rightmost partition begin e = left edge of p; for g = bottom of e to top of e begin Draw a path from g touching right hand to the bottom edge of the slice, obstacles, terminals, or previously drawn paths, until reaching to one of the followings. 1. the right edge of the slice. 2. the top edge of the slice. 3. the left edge of the partition currently visiting. In cases 2 and 3, erase the fraction of the path from the left edge of the partition currently visiting. end end

The c-paths in slices on y-layers are similarly routed. The c-paths shown in Fig. 2 are routed by this algorithm and satisfy the conditions discussed in Section 3.2. In general, this algorithm is suboptimal becaues the fourth condition is not always satis ed. Although the resultant c-paths are practically acceptable and an optimal algorithm should require a impractically large computation time, we will further investigate the improvement of the algorithm to increase the number of c-paths passing through partition segments. 8

A B C

A B C

A B C

A B C

D E

D E

D E

D E

(a)

(b)

(c)

(d)

Figure 5. Connecting Terminals to C-paths

After the c-paths are settled, terminals are connected to them to determine their initial c-paths. Terminals in a partition are distributed to each layer as follows. If the height of the minimum rectangle including all the terminals is larger than its width, the terminals are equally distributed to x-layers. Otherwise, they are assigned to y -layers. Fig. 5 illustrates the method to connect terminals to c-paths passing through a partition in an x-layer. (a) From three terminals A, B and C that can be connected to the topmost c-path, A is chosen for the connection because it is leftmost and the cpath passes the left edge only. Thus the initial partition of A is the left neighbor. After the connection, the fraction of the c-path is erased. (b) From remaining terminals B, C, D and E that can be connected to the second c-path, C is chosen for the connection because it is rightmost and above the c-path passing the right edge only. In order to reserve enough space for further connection, the path from C to the c-path is drawn touching left hand to obstacles and the barrier virtually constructed at its left. (c) The terminal B is connected to the third c-path similarly. (d) Both the terminals D and E are connected to the bottommost c-path that passes both left and right edges. This process may fail to connect all the terminals to c-paths. If so, unconnected terminals are exchanged between workers for x- and y-layers. This is the only communication between workers in the c-path routing. After all the terminals are connected to c-paths, abstracted information of c-paths and terminal/c-path connections are gathered by the master to construct c-path map. Note that exact geometrical data of c-paths are not included in the c-path map but kept in workers for the detailed routing. 9

m1 ( j ) worker w1

m1 ( j -1) m1 ( j -2) g 1 (1)

g 1 (2)

g 1 (3)

g 1 (4)

g 1 (5)

M 1( j )

master

M 2( j ) g 2 (1)

g 2 (2)

g 2 (3)

g 2 (4)

g 2 (5) m2 ( j )

worker w2

m2 ( j -1) m2 ( j -2)

Figure 6. Master/Worker Communication in Global Routing 4.3

Global Routing

At the rst step of the global routing, the master decomposes each net N of n terminals into subnet S1 , . . . , Sn01 as follows. d(t; t0 ) Manhattan distance between terminals t and t0 d(t; S ) min d(t; t0 ) t0 2S S1 = ht1 ; t2 i s.t. d(t1 ; t2 ) = min d(t; t0 ) t;t0 2N Si = hSi01 ; ti+1 i s.t. ti+1 2 = Si01 ^ d(ti+1; Si01 ) = min d(t; Si01 ) t2N 0S 01 i

After that, all the subnets of all the nets are sorted in ascending order of d de ned above. Then, for k workers, the master picks 2 2 k 2 l shortest two-terminal subnets to assign two groups of l subnets to each worker. The master also distributes the c-path map M to all the workers. A worker wi at rst tries to nd paths for the rst (shorter) group of subnets gi (1) updating its own c-path map mi(0) = M to mi (1) keeping their dierence 1mi (1) conceptually denoted as mi(1) 0 mi (0). The routing result of gi (1) is sent to the master that checks the consistency of the result against those reported from other workers and updates its master map to make Mi (1). Then the master picks the third group gi (3) for wi that consists of shortest l remaining subnets of the pair of terminals or that of a terminal and a subnet whose global path has already been settled. The group gi (3) is sent to wi 10

together with the dierence of the master map 1Mi (1) = Mi(1) 0 Mi (0) where Mi (0) = M for all i. During the master is in the work above, the worker wi routes the subnets in the second group gi(2) updating its map to make mi (2) = mi (1) + 1mi (2) and send the result again. Then it receives gi (3) and 1Mi (1) from which it reconstructs nearly up-to-date map by mi (2) mi (0) + 1Mi(1) + 1mi (2) for the routing of gi (3). It also reconstructs one period older but globally correct map by mi (1) mi (0) + 1Mi (1) = Mi (1) discarding mi (0). In general, when wi works on gi (j ), it maintains three maps; globally correct mi (j 0 2) = Mi (j 0 2); nearly correct mi (j 0 1); and working version mi (j ). By this multiple version maintenance, the latency of the communication between the master and workers will be hidden as shown in Fig. 6. The eectiveness of the latency hiding depends on the number of subnets in a group, l. In general, a large l eectively hides the latency, but degrades the correctness of c-path maps referred by workers and thus consistency of routing results. Based on experiments, l is set to 10 in our implementation. For the global routing of a subnet Si that is the i-th longest in all the K subnets, we de ned the cost of a global segment of the series of n partitions p1 , . . . , pn as follows based on experiments. Ck partition capacity of pk 01:6++ cost (pk ) = Ck 8 < 0:0 if i K=3 = 0:1 if K=3 < i 2K=3 : 0:2 if 2K=3 < i ( 0:0 if not ripped = 0:1 if ripped once :2 if ripped twice or more 8 0X < n cost (p ) if there is a c-path passing through k cost () = : k=1 1 otherwise The parameter is to make short nets severely sensitive to the capacity so that enough c-paths are preserved for long nets. The parameter is to relax the capacity sensitivity of subnets victimized for the rip-and-reroute discussed later. After a global path is settled, the worker temporarily chooses c-paths for global segments of the path and marks them as consumed. The nal choice of the c-path for each global segment is done by the master. If there are two or more c-paths passing thorough the global segment, the master chooses the 11

best- t one to minimize the length (number of partitions) of the fragments caused by the consumption. The fragments, if any, are not discarded but will be used for global segments of another paths. Since the cost function is evaluated by looking up the c-path map maintained by each worker independently, a global path routed by a worker may be inconsistent with paths routed by other workers. That is, the path may have a global segment through which no c-paths pass because of the consumption by other workers. When the master nds such a path, it rejects the path and includes the correspondent subnet in the next group. When a worker fails to nd the global path for a subnet, the master tries to reroute it by ripping up paths that have already been settled. To make the algorithm terminating, the rerouted path cannot be ripped. The subnets for ripped paths are rerouted by workers relaxing the capacity sensitivity by the parameter shown above. 4.4

Detailed Routing

When the global routing completes, the master decomposes its c-path map into submaps for slices and distributes each of them to the worker of detailed routing responsible for the slice. Each worker at rst combines the submaps and geometrical data of c-paths in its slices generated in c-path routing to have the complete information of c-paths used for global paths. Then each worker straightens c-paths in use in a slice of an x-layer by a modi ed version of max-cap algorithm proposed by KeshK14, as follows (Fig. 7). (a) Erase all the c-paths and fragments of c-paths that are not used for global paths. (b) Give a total order to c-paths in use by the following relation 0 of the c-paths and 0; if and 0 share a partition, 0 i0 is above 0 in 0 the partition; otherwise, i lies to the left of . Thus we have an ordered set of c-paths f 1; : : : ; n g such that i j for all i < j. (c) Redraw c-paths from n to 1 by the algorithm shown in Section 4.2. (d) Repeat the following for 1 to n. If both ends of i are on partition edges, draw a path from the left edge to the right by the maze routing algorithm assuming other c-paths and straightened paths are obstacles, giving higher priority to upper candidate paths. Otherwise, that is an end of 1 is a terminal, draw a path from the terminal to the edge which the other end is on. 12

(2) (3) (4) (1)

(5)

(6)

(a)

(b)

(c)

(d)

(e)

Figure 7. Detailed Routing

C-paths in y-layers are similarly straightened. Then each worker of x-layers gives geometrical data of straightened paths in each partition to the worker of y-layers responsible for the slice intersecting at the partition. The worker of y-layer determines the location of each via at the crossing point of straightened x and y paths for adjacent global segments. Since the end of x and y paths to be connected by a via pass through partitions adjacent in z direction, the crossing point of the paths is always found. Finally, the worker erases fragments beyond vias to obtain the nal result (Fig. 7(e)). 5 5.1

Experiments Implementation

We implemented Amaterous on a PC cluster of 16 Pentium II (450 MHz) based PCs with 128 MB memory for each connected by 100Base-TX Ethernet. The parallelization package of PVM version 3.4. 715 is used. For the compilation, we used gcc version 2.7.2.3. 5.2

Workloads

The workloads for the performance evaluation of Amaterous are three MCM benchmarks13 shown in Table 1. The number of layers used for each benchmark shown in the table is same as that for Amon25 and also for the stateof-art sequential router V4R16. The width of slice is also same as that for Amon2 for fair comparison of Amaterous and Amon2. 13

Table 1. MCM Benchmarks benchmark c n t s l w MCC1-75 6 802 2495 599 2 599 4 30 MCC2-75 37 7118 14695 2032 2 2032 6 80 MCC2-45 37 7118 14695 3386 2 3386 4 120 c: # of chips; n: # of nets; t: # of terminals; s: size of routing space; l: # of layers; w : width of slice 5.3

Comparison of Amaterous and Amon2

Table 2 shows the performance data of the sequential version of Amaterous (S) and parallel one on the 16-node PC cluster (P) with respect to execution speed and routing quality, and also that of Amon2. Since Amon2 was implemented on the parallel computer AP1000 developed about a decade ago, it is hard to compare its 64-processor performance shown in rows (a) in the table5 with that of Amaterous. Thus we ported Amon2 to our PC cluster to obtain the performance data shown in rows (b). This ported version, however, cannot route all the nets as shown in the table, probably because Amon2 is very sensitive to the communication performance relative to computation in which AP1000 outperforms our PC cluster by the order of magnitude. Since the execution time in rows (b) is obviously underestimated due to the unrouted nets, we estimated its real performance assuming that the execution time is proportional to the total wire length as shown in rows (c). This normalization should not cause overestimation because rip-and-reroute, which Amon2 does not have, is hardly parallelized. The table shows that Amaterous is 3.4 and 1.6 times as fast as the raw performance of Amon2 on PC cluster for MCC1-75 and MCC2-45 respectively. As for the normalized performance, the speedup is 3.6 and 2.0. For MCC2-75, which both original and ported version of Amon2 cannot route completely, the execution time of Amaterous is almost equal to the normalized one of Amon2. However, if we may suppose that another one million or so grid points are added to the total wire length of Amon2 to route completely, Amaterous is 1.2 times as fast as Amon2. Thus we may conclude that Amaterous is faster, much faster in many cases, than Amon2. On the other hand, our result of total wire length is 10 to 35 % worse than Amon2. The major reason of the degradation is that Amaterous requires every terminal to have its initial partition adjacent to its home. That is, even if both terminals of two-terminal (sub)net are in a partition, Amaterous draws paths from the terminals to the adjacent partitions and then connect them even in the best case. In the worst case in which their initial partitions are at 14

Table 2. Amaterous v.s. Amon2 benchmark program t u l MCC1-75 Amaterous (S) 12.61 0 477 (P) 2.41 0 500 Amon2 (a) (13) 0 382 (b) 8.25 1 366 (c) 8.61 | | MCC2-75 Amaterous (S) 459.99 0 7232 (P) 61.65 0 7439 Amon2 (a) (173) 9 5498 (b) 50.31 17 4486 (c) 61.66 | | MCC2-45 Amaterous (S) 638.61 0 10105 (P) 52.19 0 10385 Amon2 (a) (316) 0 9052 (b) 83.67 10 7310 (c) 103.61 | | t: execution time (sec); u: # of unrouted nets; l: total wire length (2103 grid point); v : # of vias;

v

4990 4824 2825 2519 | 39141 40089 22823 14885 | 31688 33482 19554 13275 |

both sides of their home, the path should be much longer than the natural expectation. Although it is hard to devise a special method to route such short nets and, especially, short subnets keeping consistency with the concept of c-path, we will investigate this issue further. As for the number of vias, the dierence from Amon2 is wider. Although our result is comparable to or better than V4R16, the reduction of the number is also an important subject of our future study. 5.4

Parallel Speedup

Fig. 8 shows the parallel speedup relative to the sequential version of Amaterous. It shows that 5.0, 7.5 and 12.2 fold speedups are obtained with 16 PC nodes for MCC1-75, MCC2-75 and MCC2-45 respectively. The result of MCC2-45 shows that Amaterous is suciently scalable with a problem of large routing space. The saturation observed in the case of MCC1-75 is mainly due to its small problem scale both in the number of nets and the size of routing space. In fact, the speedup of the global routing, which consumes about 60 % of the total execution time, is only 3.5. Since this benchmark is relatively easy to solve, we could enlarge the number of subnets in a group given to global routing worker, xed to 10 so far, to hide master/worker communication latency more eectively. The speedup of MCC2-75 is not very excellent too. It is also remarkable 15

speedup

16

MCC2-45

12

8

MCC2-75

MCC1-75

4

4

8

12

16

# of processors

Figure 8. Parallel Speedup

that MCC2-75 is harder than MCC2-45 for the 16-node system contrary to the result of the sequential case as shown in Table 2. This is mainly due to that this problem is hard to route completely and thus the master has to perform rip-and-reroute repeatedly. Thus, when the master is in the work, workers become idle soon to degrade the parallel performance. An idea to make workers busy is to delegate the rip-and-reroute from the master to a worker so that other workers route other subnets speculatively assuming global paths for them are consistent to the rerouted path. The implementation and evaluation of this idea is left for our future work. 6

Conclusions

In this paper, we proposed a new parallel wire router named Amaterous. Its unique feature is the c-path router that draws candidate paths in each slice of the routing layer prior to the global and detailed routing. Since the global router picks c-paths to form the global path of a net, it is assured that the global path is always converted successfully to its detailed counterpart which the detailed router generates by straightening c-paths. By the introduction of the c-path routing, the feedback from the detailed to global router is removed to make them eciently parallelized. In fact, the 16

experimental result shows up to 12 fold speedup for MCM benchmarks using a 16-node PC cluster. It is also shown that Amaterous is 1.2 to 3.6 times as fast as the parallel router Amon2. Our future work is to improve the algorithm of Amaterous in the following points; increase the number of c-path passing through partition segments; reduce the wire length and vias for short subnets; delegate the rip-and-reroute from the global routing master to a worker; and tune the routing parameters such as the number of subnets in a group given to global routing worker. Acknowledgments

We would like to express our thanks to Hesham Keshk and Tomita Laboratory of Kyoto University for their courtesy to give us the source code of Amon2. References

1. T. Watanabe, H. Kitazawa, and Y. Sugiyama. A parallel adaptable routing algorithm and its implementation. IEEE Trans. ComputerAided Design, CAD-6(2):241{250, March 1987. 2. Y. Won and S. Sahni. Maze routing on a hypercube multiprocessors. In Proc. Intl. Conf. Parallel Processing, pages 630{637, August 1987. 3. O. A. Olukotun and T. N. Mudge. A preliminary investigation into parallel routing on a hypercube computer. In Proc. 24th Design Automation Conf., pages 814{820, June 1987. 4. H. Date and K. Taki. A parallel lookahead line search router with automatic ripup and reroute. In Proc. European Conf. on Design Automation, pages 117{121, 1993. 5. H. Keshk, S. Mori, H. Nakashima, and S. Tomita. A two phases, cooperative detailed/global parallel wire routing algorithm. Trans. Information Processing Society of Japan, 37(12):2376{2389, December 1996. 6. Y. Takahashi. Paralellel maze running and line search algorithms for LSI CAD on binary tree multiprocessors. In World Conf. Information and Communication, pages 128{136, 1989. 7. T. Yamauchi, T. Nakata, N. Koike, A. Ishizuka, and N. Nishiguchi. PROTON: A parallel detailed router on an MIMD parallel machine. In Proc. IEEE Intl. Conf. Computer-Aided Design, pages 340{343, November 1990. 8. H. Keshk, S. Mori, H. Nakashima, and S. Tomita. A new technique to improve parallel automated single layer wire routing. In Proc. Performance Evaluation of Parallel Systems, pages 134{141, November 1993. 17

9. J. Rose. LocusRoute: A parallel global router for standard cells. In Proc. 25th Design Automation Conf., pages 189{195, June 1988. 10. Y. Takahashi. Parallel automated wire routing with a number of competing processors. In Proc. Intl. Conf. Supercomputing, pages 310{317, 1990. 11. R. J. Brouwer and P. Banerjee. PHIGURE: A parallel hierarchical global router. In Proc. 27th Design Automation Conf., pages 360{364, 1990. 12. S. B. Akers. Routing. In M. A. Breuer, editor, Design Automation of Digital Systems: Theory and Techniques, volume 1, pages 283{333. Prentice Hall, 1972. 13. Collaborative Benchmarking Laboratory, Dept. of Computer Science, North Carolina State Univ. PDWorkshop'93 benchmark set. http:// www.cbl.ncsu.edu/CBL Docs/pdw93.html, March 1995. 14. H. Keshk, S. Mori, H. Nakashima, and S. Tomita. Amon: A parallel slice algorithm for wire routing. In Proc. Intl. Conf. Supercomputing, pages 200{209, July 1995. 15. Computer Science and Mathematics Division, Oak Ridge National Laboratory. PVM: Parallel virtual machine. http://www.csm.ornl.gov/ pvm/, January 2000. 16. K. Y. Khoo and J Cong. An ecient multilayer MCM router based on four via routing. In Proc. 30th Design Automation Conf., pages 590{595, 1993.

18