A Parallel Local-Search Algorithm for the k ... - Semantic Scholar

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

A Parallel

Local-Search

Ralf Diekmann,

Reinhard

Algorithm Liiling,

for the k-Partitioning

Burkhard

Monien,

Problem*

and Carsten Spraner

Department of Mathematics and Computer Science, University of Paderborn, 33095 Paderborn, Germany e-mail : {diek, 11, bm, casi}@uni-paderborn.de

Abstract

The k-partitioning problem can be regarded as the problem of embedding a graph into the Kk (clique network of k nodes), in a way that every node of the Ick receives the same number of nodes of the guest graph and the total number of edges mapped onto the edges of the host is minimized. This relaxation of the embedding problem reflects the actual trend in parallel architecture design. Due to the growing performance of interconnection networks used for the realization of massively parallel computing systems and especially with the establishment of independent routing networks, the edge dilation plays a minor important role [20].

In this paper we present a new algorithm for the kpartitioning problem which achieves an improved solution quality compared to known heuristics. We apply the principle of so called “helpful sets”, which has shown to be very efficient for graph bisection, to the direct k-partitioning problem. The principle is extended in several ways. We introduce a new abstraction technique which shrinks the graph during runtime in a dynamic way leading to shorter computation times and improved solutions qualities. The use of stochastic methods provides further improvements in terms of solution quality. Additionally we present a parallel implementation of the new heuristic. The parallel algorithm delivers the same solution quality as the sequential one while providing reasonable parallel eficiency on moderately sized MIMD-systems. All results are verified by experiments for various graphs and processor numbers.

1

At present, k-partitioning problems are usually solved by recursive bisection (k = 2). A number of results have been shown for the bisection width of certain classes of graphs [20]. Popular approximation algorithms are the Kerninghan-Lin heuristic (KL) [15] and its improvements by Fiduccia and Mattheyses [8], Spectral Bisection [lo, 221, Inertial [24], SimulatedAnnealing (SA) [4, 14, 161 and the geometric approach by Miller et al. [3, 191, which generalizes a number of earlier results on the size of graph separators starting with Lipton and Tarjan [17]. Simon and Teng showed that direct partitioning can lead to asymptotic better solution qualities compared to recursive bisection on some artificial classes of graphs [25]. Although recursive bisection works very well in most practical applications there are cases where direct partitioning can produce better results. We will focus on direct partitioning methods within this paper.

Introduction

In this paper we study graph partitioning as one of the fundamental problems in the design and use of parallel computing systems. Partitioning and mapping of data and program structures onto a parallel system is only one of the aspects that show its relevance. The graph partitioning problem is a special case of the so-called graph embedding problem. The task of this problem is to map a given guest graph onto a fixed host graph minimizing edge dilation and congestion and balancing the nodes of the guest graph evenly over the host graph. Graph embedding is known to be NP-complete. A large number of theoretical results for special graphs have been achieved [21]. For the general problem, heuristic methods have been presented. Methods of this kind are often very time consuming [5].

Parallel algorithms for the graph partitioning problem are described in [9, 18, 231. All three methods are based on a local search principle. The direct kpartitioning method CPE presented by Hammond [9] uses a “hill-climbing method” which exchanges in each step nodes in such a way that the total number of crossing edges is decreased by a maximal amount. To choose the nodes that are exchanged, Hammond proposed a fixed schedule for pairing partitions. The strategy has shown to be very fast, but as it uses a very simple hill-climbing method, its solution quality is not comparable to those algorithms that use more sophisticated neighborhood structures.

*This work was partly supported by DFG-Forschergruppe “Effiziente Nutzune massiv Daralleler Svsteme”. bv Esorit Basic Research Actiol Nr. 714i (ALCOM “II) and ihd EC-Human Capital and Mobility Project: “Efficient Use of Parallel Computers: Architecture, Mapping and Communication”

41 1060-3425/95

$4.00 0 1995 IEEE

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 199s

The Mob-heuristic presented by Savage and Wloka in [23] for bipartitioning graphs uses a more powerful exchange operation. In each step it exchanges subsets of nodes, all having a gain larger than a lower bound. The size of the subsets is chosen according to some schedule which starts with very large sets (about 10 percent of the graph size) and decreases its size linearly to zero. The whole process is repeated several times. The algorithm has been implemented on a SIMD parallel computing system. Martin and Otto combine in [18] the KL heuristic with simulated annealing and yield a very efficient heuristic for graph bisection. They also provide a simple assemble approach for a parallel version of their algorithm. In this paper we introduce a parallel algorithm for the k-partitioning problem. We use a direct partitioning algorithm in contrast to recursive bisection. The method is based on the concept of helpful sets which we introduced in [13] to construct upper bounds on the bisection width of regular graphs and used in [6] to design a hill-climbing algorithm for bipartitioning graphs. The method is applied to direct k-partitioning and extended in several ways: l

l

l

In the next section we describe the k-partitioning algorithm with its basic principle and the modifications and extensions mentioned above. In Section 3 we show solutions computed by our algorithm for a number of benchmark problems and compare them to results obtained by existing heuristics.

2

The Algorithm

2.1

Definitions

Definition 1 (k-partitioning problem) Let G = (V, E) be a graph and rr : V + { 1, . . . , k} a mapping function that partitions G into k clusters VI . . .Vk, 6 := {v E V; 7r(v) = i}. Define load(r,i) := IKJ to be the load of cluster i caused by rr. For a node v E V let deg(v) be its degree, ext(v) := l{w E V; {v,w} E E, 7r(w) # 7r(v)}I its number of external edges and i&(v) := deg(v) -ext(v) its number of internal edges. 1. A k-partition

of G is a mapping function

Jload(r,i)-load(r,j)] 0 realize node remappings; gh = 2 . gain; gl = min{O, gl + mazdegree}; ELSE gh = max{I, gh/2}; g1 = gl - mardegree; UNTIL M = 0 or gain > 0 UNTIL gain 5 0 Figure 3: The hill-climbing

algorithm.

balanced partition while trying to further improve the cut size. In each round the algorithm chooses an underloaded cluster VU at random and fills it up by iteratively moving the node with highest gain to VU. Note that VU is filled up from nodes of all other clusters, not just from nodes of overweighted ones. This gives the algorithm larger flexibility and allows complex movements of nodes within one balancing step. Unfortunately it also introduces the risk of circular movements which would cause the algorithm not to terminate. There are two possible strategies to cope with this problem. The first is to block clusters which have been filled up once (clr&er-blocking) and to allow the algorithm to choose nodes only from nonblocked clusters. The second and more computationally expensive alternative is to allow single nodes to migrate at most m times during. the whole balancing process. This method is called node-blocking. We show experimental results of both strategies in Section 2.2.4. The change in cut size during the balancing process is added to the gain of S. Thus, at the end of the balancing, g expresses the total change in cut size if S is moved to Vi and balanced afterwards. Note that this value must not necessarily be greater than 0. 2.2.3

Hillclimbing

2.2.4

Measurement

Results

Table 1 shows partitioning results of the hillclimbing algorithm described above. It presents the costs of the best partition found if the algorithm is applied 5 times, the average costs, the number of iterations in the outer loop and the computation times on a Sun SS10/40. The initial partitions for each run are chosen at random. In the node blocking scheme, each node is allowed to move at most m = 1 times. Our experiments showed that this is the best value concerning the tradeoff between solution quality and running time. Larger values of m do not improve the convergence properties of the method substantially. Tests were performed for 32-partitioning of the benchmark graphs described in the introduction. Table 1 shows some of the graphs properties. It can be observed that the node blocking scheme gives better results on the average at the expense of larger comput,ation times. An interesting fact is the average number of iterations in the outer loop both methods perform. Node blocking uses much less iterations than cluster blocking. This is due to the larger improvement in cut size that node blocking is able to realize within each iteration and shows its larger flexibility. In the following we always use the node blocking scheme for the balancing operation.

Algorithm

2.3

The overall hillclimbing algorithm is presented in Figure 3. It iteratively searches for an (i, g)-helpful set S with g > 0, moves it (logically) to Vi and balances the partition afterwards. If the total gain of both operations is greater than 0, the node remapping is realized physically. If no i E { 1, . , Ic} is left such that an (i,g)-helpf u 1 set with gain > 0 (a.fter balancing) exists, the algorithm terminates. The parameters gh and g1 control the gain of the helpful sets to be searched for. If in an iteration a cer-

2.3.1

Additional

Concepts

Stochastic

Optimization

One of the main disadvantages of all local search methods is the high probability that they get stuck in local optimal solutions. Even if the neighborhood relation is very sophisticated, like that of the KL heuristic or of our hill-climbing algorithm described in section 2.2.3, there exist local optima from which the algorithm is not able to escape.

44


Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

graph

IVI

AIRFOIL BCSPWROS BCSPWRlO BCSSTK13 MESH32x32

4253

LSHP3466

3466 4704

NASA4704

IEI mm . 12289

1723 2394 5300 8271 2003 40940 1024 1924 10215 50026

3

degree av max

5.78 1 2.78 1 3.12 4 40.88 2 3.88 3 5.89 5 21.27

9 14 13 94 4 6 41

cluster-blocking av iter. time 1038 1101.0 61.4 27.08 best

308 932 18283 349 1145

336.8 990.6 18412.2 381.8 1178.0

47.6 74.4 53.0 42.8 80.4

10089 10510.2 65.4 39.94

Table 1: The benchmark suit)e and results of the hill-climbing A common method to prevent local search algorithms from getting stuck in local optima too early, is to allow a certain amount of (controlled) deteriorations in the cost function value. The KL heuristic and also the search for helpful sets described in Section 2.2.1 make only limited use of this concept. In case of simulated annealing, deteriorat,ions together with randomized search serve as the basic principle [14, 161. Simulated annealing generates neighboring solutions at random, calculates the change AC of t,he cost function value if moved to this configuration and performs the move with probability e-Ac/t. The parameter t, commonly called temperature, cont,rols t.he amount of allowed deteriorations. The algorithm is started with a high value of t, such that nearly all moves are performed with high probability, and reduces the temperature throughout the search. At the end, only moves that improve the cost funct,ion value are accepted. The combination of simulated annealing with our hill-climbing procedure described in Section 2.2.3 (cf. Fig. 3) is straight forward. The algorithm determines a helpful set w.r.t. a random chosen cluster and tries to rebalance the partit,ioning as described in Section 2.2.2. The condition for t,he node remapping is the main change of the algorithm. Nodes are not longer remapped only if gain > 0, but also wit,11 probability egainIt if gain 5 0. 2.3.2

Dynamic

6.31 22.58 32.35 6.52 39.03

algorithm,

best 951 322 870 18006 339

1101 9307

1995

node-blocking avg iter. 978.8 27.8 330.0 993.6 18238.0 345.8

time 79.10 16.6 11.99

21.4 17.4 18.2

55.31 71.25 17.32

1106.6 23.6 89.97 9710.2 22.8 89.67

k = 32 (times in sec.).

results from one of their main characteristic: they are ergodic, i.e. their convergence behavior is independent of the initial solution they are started on. Abstraction With the introduction of Dynamic we try to overcome this disadvantage for the kpartitioning algorithms described in sections 2.2.3 and 2.3.1. For a given partition of a graph G we define a super-graph G by identifying connected components within partitions (cf. Fig. 4). The nodes of G are weighted by their size, i.e. the number of nodes from G they are build of. Its edges are weighted according to the number of edges from G that connect corresponding components.

Figure 4: Building G from a given partition Dynamic Abstracfion is related to edge contraction schemes that are widely used to shrink the problem size and speed up partitioning algorithms (see e.g. [2, 111). The difference to edge contraction is, that the abstraction is not performed in a static way on the graph G before partitioning, but is applied dynamically at different steps throughout the algorithm and takes the already achieved partition into account. Thus it may serve as an additional concept that can speed up local search algorithms especially in their starting phases. The application of Dynamzc Abstraction to the kpartitioning algorithms is straight forward. The algorithm first builds G by identifying the connected components within each cluster (cf. Fig. 4). The BFSsearch for helpful sets is extended to weighted nodes and edges and is performed on G in nearly the same way as described in 2.2. For the balancing procedure, G is expanded to G first, i.e. t,he balancing is performed on the normal graph instead of G. This offers more flexibility and gives the opportunity to balance

Abstraction

One of the main disadvantages of all local search based optimization algorithms is their lack of a certain amount of “global information”. Especially if they are started on randomly generated init,ial solut,ions t,hey are not able to identify global structures that have to be rearranged to find good solutions. Native hillclimbing algorithms can overcome this problem if they are started on good initial configurations as they are for example generated by spectral methods if graph bisection is considered (see also e.g. [6, 221). Algorithms based on simulated annealing are not able to take benefit from good initial solutions because they normally rearrange the starting configuration randomly. This

45



large helpful sets which are found on-G successfully. We will call suc_h a step - building G, computing a helpful set on G and balancing on G - a meta step. The algorithm performs meta steps only as pure hill climbing, i.e. no stochastic optimization is used together with them. If after a number of meta steps no further improvement is possible, the normal algorithm is performed for a while and after a randomly chosen time meta steps are tried again. The benefits of Hierarchical Abstraction are presented in the next section. In general it can be observed, that the running time decreases dramatically while, in addition, the algorithm finds much better solutions. 2.3.3

Measurement

BCSPWROS BCSPWRlO BCSSTK13 MESH32x32 LSHP3466 NASA4704

221 594 18047 326 1061 9167

aw 906.4

without dynamic abstraction using dynamic abstraction

140.0 z 8 120.0 -

100.0

60.0

600

iter. 1

-

1600

Results

1 graph (k = 32) 1 best 1 1 AIRFOIL 1 886 1

-

160.0

1995

’ 0

1 temperature

Figure 5: Cost function vs. temperature for graph BCSPWROS, L = 8 (temp. scaled logarithmic).

time

520.0 1 639.81

236.6 691.0 18112.0 329.0 1074.8 9297.4

(see e.g. [4]), th e cost function drops very quickly to very low values. This is also due to the fixed and hand optimized cooling schedule which is used throughout this work. 2.4

Table 2: Hill-climbing with abstraction and stochastic optimization for Ic = 32 (times in sec. on SS10/40).

Parallelizat

ion

In this section we present the parallelization of the stochastic local search algorithm for the t-partitioning problem. The aim of our work was to develop a parallel algorithm that provides solutions which are comparable to those found by the sequential algorithm, but computes them in significant shorter time, thus achieving a considerable speedup.

Table 2 shows results of the hill-climbing algorithm with dynamic abstraction and stochastic optimization Comparing the results to those of the native hillclimbing algorithm (Table l), it can be observed that the additiona, concepts improve the solution quality by up to 70 percent, depending on the individual problem. This shows that especially the robustness of the heuristic (i.e. its independence on the characteristics of the problem graph and especially the starting solution) is increased substantially. One has to pay for the improved solution quality and robustness with an up to 10 times larger running time. To preserve a proper amount of convergence, the number of steps the algorithm performs can not be decreased arbitra.rily. Figure 5 shows the course of the cost function value over the temperature for two runs of the algorithm partitioning the graph BCSPWROS into 8 cluster. The solid line at a value of 176 indicates the solution that are found by pure hill-climbing with the described neighborhood (cf. Tab. 1). The curves demonstrate the large benefits of hierarchical abstraction. Not only that the algorithm is able to find much better solutions if abstraction is used, but also better solutions are found much earlier (note the logarithmic scaling). It can also be observed that the used neighborhood leads to a rapid decrease of the cost function value. In contrast to the usual curves that characterize simulated annealing if simple neighborhood relations are used

2.4.1

Parallelizing

Probabilistic

Local

Search

Previous work on parallelizing probabilistic local search algorithms mainly focused on the parallelization of simulated annealing. Two basic principles are presented in the literature (see [4] for an overview). The first is based on data partitioning. The problem describing data is split into small subsets and distributed among the processors. Each processor is responsible for a data subset and performs sequential simulated annealing on it. The efficiency of this approach is directly related to the degree of dependence between different data subsets. High dependencies results in intensive communication and low efficiency if these dependencies are taken into account or a bad solution quality if they are not considered properly. Another disadvantage is the limit on the maximal number of processors which is determined by the size of the problem instance. An approach which is based on the parallelization of the SA algorithm itself is described in [l] and [4]. Both papers describe a general principle to parallelize SA 46



which is independent of the specific application. The idea is based on the observation, that for typical applications of SA nearly 99% of all generated moves are not accepted [4]. Therefore these moves can be performed independently on different processors. Each processor receives the whole problem instance and executes the sequential steps of the algorithm in parallel. Thus, all processors work simultaneously on the evaluation of one Markov chain, preserving the same convergence properties as the sequential algorithm. 2.4.2

Application

1995

the solution gained if the remappings induced by x,, are applied to ;7,1. The question remaining is how to select a local solution out of the set of local solutions which may arrive at the farmer within a period of time. The way this is done is most important for the parallel efficiency in terms of the overall computation time and solution quality. We compared several strategies [4]: Selecting every solution: The simplest strategy is to select every local solution as it arrives at the farmer and to judge about its acceptance. The disadvantage of this strategy is that solutions which can be computed very fast are favored. This affects the solution quality negatively [4]. The problem mainly arises in the first phase of the algorithm where the number of accepted moves is extremely high. In the ending phase, there is only a very small probability that a move is accepted anyhow. Therefore it is possible to select every local solution in this phase without affecting the convergence properties negatively. Simulating the sequential algorithm: To give all local solutions a fair chance of being accepted, the farmer waits until all workers have computed their local solution. It then chooses a local solution at random and accepts it according to the accept function of SA. This process is repeated until a local solution is accepted are all candidates have been evaluated. The disadvantage of this method is the large idle time which is likely to occur on worker processors having found their solution quickly. Dynamic strategy: We have seen that the first strategy simulates the sequential algorithm if the acceptance probability is small, whereas the second strategy simulates the behavior of the sequential algorithm in all cases. A strategy which combines these t,wo principles decides about accepting a new solution on the basis of a set of local solutions computed by the workers. We determine the size of this set depending on the probability that a move is accepted. The number of solutions the farmer has to receive until it is allowed to judge about the acceptance of a new solution is computed by

to b-Partitioning

As our aim is to preserve the convergence behavior of the sequential algorithm we use the second parallelization strategy. The realization of this principle is based on a processor farm. One processor (the farmer) controls the overall algorithm and distributes work to t,he worker processors. It also holds the current solution of the optimization problem, the so called globalsolution. If a worker becomes idle, it requests the current global solution 7ra and a random number i E { 1, , k} from the farmer, computes an (i,g)-helpful set, performs the balancing operation and sends its local solution r,, as a result of this computation back to the farmer. The farmer performs a transition of the global solution ~~ depending on the costs of irn. As the calculation of helpful sets and the balancing operaCon are randomized the parallelism of this approach is not limited to k processors. The farmer is able to give the same value of i to several workers which will, with high probability, all result in different local solutions. The computation times for the (i, g)-helpful set and the associated balancing operation can vary extremely for different i E { 1,. . , k}. Therefore the results of this computation can arrive in any order and unpredictable delay at the farmer. The farmer has to choose one of the solutions and update t,he global solut,ion. Suppose the farmer has chosen in, for an update of r,. We will later on describe how the selection is done. Two cases are possible: 1. 7r, was computed as a neighbor of ~~ and in, is still the valid global solution. In this case t,he farmer only has to decide about t,he accept,ance of K, according to the accept function of SA.

#ret

= ?nin{k,

1 + acceptance.probability

k}.

Thus, at the beginning of the algorithm (large probability of acceptance) the strategy simulates the sequential algorithm, leading to a decreased parallel efficiency (because of the idle times) but to a solution quality comparable to that of the sequential algorithm. At the end of the algorithm every solution provided by the workers is taken to judge about its acceptance. This also reflects the behavior of the sequential algorit,hm while providing a high parallel efficiency. Using this strategy we gained the speedup presented in Figure 6.

2. 7r, was computed as a neighbor of ra and riT, has already been replaced by 7r,r as the global solution. In this case it is t.ested if the node remappings induced by 7r,, conflict with those that were performed by the change from ~~ to x,1. If they conflict, 7riT,is not accept,ed. If the node remappings do not conflict, it is possible to realize both of them. In this case the farmer decides about the acceptance of 7r, according to the accept function of SA, but now depending on the value of 7r,~ and 47



graph AIRFOIL BCSPWROS BCSPWRlO BCSSTKl3 MESH32x32 LSHP3466

best 894 249 583

w3 914.7

iter 505.7

act 47.1 %

265.7 641.2

518.2 606.2

55.8 % 49.4 %

18116

18195.7

482.5

45.0 %

323 1045

324.2 1063.0

495.2 473.5

41.3 % 39.0 %

Table 3: Parallel algorithm, 2.4.3

Measurement

act win 30.6 % 32.3 30.7 27.7 24.7 25.3

% % % % %

act lost 16.4 % 23.4 % 18.6 % 17.3% 16.5 % 13.7 %

time

1417.2 468.5 1607.7 1680.5 346.0 1364.0

LY= 32, p = 64 (times in sec. on Transputer

Results

speedup 30.41 21.64 21.61 28.65 25.87 33.28

network)

Table 3 presents detailed results for different graphs when partitioned into 32 clusters using 64 processors. The table presents the best solution found for 5 runs, the average solution and the number of iterations. “acc” denotes the percentile number of iterations in which an accepting configuration has been found. “act win” the number of transitions which have been accepted because the configuration lead to a better partition. The overall runtime is computed in seconds. Notice, that this refers to the time using Transputers.

To verify the usefulness of the concepts presented above, we implemented the algorithm on a Transputer based MIMD machine of the “Paderborn Center for Parallel Computing”. The processors are arranged in a grid structure. The system allows the use of any arbitrary sized sub-mesh. Our tests were performed on 1, 9, 25, 49 and 64 processors. One has to observe that the Transputer processor is very slow compared to the SPARC SS10/40 w h’K h was used for the sequential algorithms. Experiments on both systems show that for our application the SPARC systems is about 23 times faster than the Transputer. Thus, to be able to compare running times we adjust the numbers of the Transputers to the SPARC using a factor of 23.

Results show, that about 30 to 50 percent of all local solutions are accepted by the farmer. This shows that the used neighborhood structure is very efficient, but also makes it very hard to achieve large speedup as the parallelization is based on the principle of evaluating independent trials in parallel. Due to the high acceptance rate a large search overhead occurs at the beginning of the algorithm where the temperature is high. We found out that local solutions showing up at the farmer after it has already accepted a new configuration are usually not accepted since they conflict in most cases with the performed node remapping. Only in the ending phase of the algorithm a small number of non-conflicting local solutions occur and can be accepted by the farmer. Thus, the sequential algorithm is in some way inherently sequential, making it very hard to achieve good parallel efficiencies for larger numbers of processors. If k is relatively large compared to the processor number, the search overhead becomes relatively small, thus larger speedups are possible. The idle time of the worker processors is very low in general (up to 5%), thus the centralized approach is no bottleneck. As more sophisticated neighborhood relations will in general need more time for their local computations on the worker processors we can expect to achieve this low idle times also for improvements of our sequential algorithm.

Figure 6: Speedup, li = 32 Figure 6 presents the speedup for all benchmark problems, varying processor numbers and k = 32. Speedup could be achieved for all problem instances but it is highly dependent on the individual problem instance. Measurements with different values of Ic also showed a strong dependence of the parallel efficiency on k. For larger values of k a better speedup even on larger processor numbers is possible. The parallel algorithm achieves similar solution qualities than the sequential one. For example the solutions gained by 64 processors differ only by 1.04% from the sequential solution quality on the average, if 32-partitioning is performed.

The speedup is mainly determined by the cooling schedule. If the algorithm performs a very large number of iterations on low temperatures, a huge number of local solutions can be computed independently and a large speedup can be achieved. We used the same optimized cooling schedule as used for the sequential algorithm performing only a relatively small amount of iterations on low temperatures. 48



graph

new-s

new-p 1

SA 1 CPE 1 ML(200) ] SP(ML-RQI)

Table 4: Comparison of solution qualities and running times of partitioning brackets indicate running times in seconds on SS10/40)

3

Experimental

Comparison

]

HS 1

heuristics, k = 32 (numbers in

bisection heuristic [6] that is based on helpful sets and is used recursively for the k-partitioning problem.

Table 4 presents running times (measured in seconds) and partitioning results of our new heuristic compared to a number of other methods. “new-s” gives results for the sequential algorithm presented in section 2.3.1, “new-p” are the results of the parallel implementation. The running times for “new-p” are obtained by pessimistic projection (factor 23, see above) of Transputer times to SPARC SS10/40. Our implementation of simulated annealing (SA) [4] uses a simple swap neighborhood a.nd a selfadapting cooling schedule. The running times are in general far to large to be comparable to any of the other methods. The results of Hammonds CPEheuristic [9] (CPE) are obtained from a sequential simulation of the parallel algorithm. Therefore running times can not be presented. As already stated, the method uses a simple swap neighborhood together with hill-climbing and its results are therefore very much depending on the initial solution the algorithm is started on. ML means the multilevel method from Hendrickson and Leland [ll] that is implemented in their Chaco library of partitioning heuristics [la]. It incorporates edge contracting schemes (the graphs are coarsen down to 200 nodes), a spectral bisection algorithm to partition the coarsed graph and the KL-heuristic as local clean up. The algorithm uses recursive bisection to construct a k-partitioning. SP is the spectral bisection method from Pothen et. al. [22] that is also included in the Chaco library. The spectral algorithm uses a multilevel RQI-solver coarsening the problem down to 200 nodes and applies full orthogonalization. Additionally, KL is used as local clean up. Like the ML method it uses recursive bisection to split the entire graph into k clusters. Finally, HS stands for our

The measurement results show that our algorithm achieves in general better results than all other methods. This is true for the sequential and the parallel version. Compared to the sequential heuristics, the running times of our sequential algorithm are much larger due to the stochastic acceptance of neighboring configurations. The parallel version achieves computation times comparable to those of other sequential algorithms.

4

Conclusions

We presented an efficient heuristic algorithm for the k-partitioning problem. The algorithm combines a local search method which is based on the concept of helpful sets with a dynamic abstraction scheme and with stochastic methods adopted from simulated annealing. The increased running time caused by the annealing process is reduced by a parallel implementation on a moderately sized MIMD-system. The parallel stochastic partitioning algorithm needs the same computation time as sequential pure hill-climbing but provides an up to 70% improved solution quality. The new heuristic is tested on a set of widely used benchmark graphs and compared to a number of other kpartitioning algorithms including recursive bisection with a variety of bisection algorithms. Experiments show that our new algorithm provides better solutions in almost all cases than those obtained by other methods. 49



Acknowledgements

WI

Many people helped to make this work possible. We would especially like to thank Bruce Hendrickson and Robert Leland from Sandia, who provided the code of their Chaco library and gave many helpful comments. Thanks also to Alex Pothen (Old Dominion Univ.) and Steve Hammond (RIACS), who made some of their sample input data available, to Horst Simon (RIACS) who provided useful hints and to our colleague Robert Preis who performed some of the experiments.

B. Hendrickson, R. Leland: The Chaco User’s Guide. Tech. Rep. SAND93-2339, Sandia National Lab., Nov. 1993

The Bisection Problem P31 J. HromkovZ, B. Monien: for Graphs of Degree 4 (Configuring Transputer Systems). 16th Math. Foundat. of Comp. SC. (MFCS ‘91), Springer LNCS 520, pp. 211-220 C.R. Aragon, L.A. McGeoch, [I41 D.S. Johnson, C. Schevon: Optimization by Simulated Annealing: An Experimental Evaluation; Part 1, Graph Partitioning. Op. Res. 37(6), 1989, pp. 865-893 P51 B.W. Kerninghan, S. Lin: An Eflectiue Heuristic Procedure for Partitioning Graphs. BeII Syst. Tech. Journal, Feb. 1970, pp. 291-308

References [l] F. Baiardi, S. Orlando: Strategies for Massively Parallel Implementation of Simulated Annealing. Proc. of Parallel Architectures and Languages (PARLE), 1989, pp. 335-338

P61S. Kirkpatrick,

C.D. Gelatt, M.P. Vecchi: Optimization by Simulated Annealing. Science, Vol. 220, No. 4598, May 1983, pp. 671-680

P71 R.J. Lipton, R.E. Tarjan: A Seperator Theorem for Planar Graphs. SIAM J. Applied Mathematics, 36, 1979, pp. 177-189

[2] S.T. Barnard, H.D. Simon: Fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. Concurrency: Practice and Experience 6(2), 1994, pp. 101-117

[If31O.C.

Martin, S.W. Otto: Combining Simulated Annealing with Local Search Heuristics. Techn. Rep. CSE-94-016, Oregon Grad. Inst.

[3] G.E. BleIIoch, A. Feldmann, 0. Ghattas, J.R. Gilbert, G.L. Miller, D.R. O’Hallaron, E.J. Schwabe, J.R. Shewchuk, S.-H. Teng: Automated Parallel Solution of Unstructured PDE problems. CACM, to appear

WI

[4] R. Diekmann, R. Liiling, J. Simon: Problem Independent Distributed Simulated Annealing and its Applications. In R.V.V. Vidal (ed): Applied Simulated Annealing, Lect. Notes in EC. and Math. Systems, Springer LNEMS 396, 1993, pp. 17-44

G.L. Miller, S.H. Teng, S.A. Vavasis: A Unified Geometric Approach to Graph Seperators. 32nd Symp. on Foundat. of Comp. SC., 1991, pp. 538-547

PO1B.

Monien, R. Diekmann, R. LiiIing: Communication Throughput of Interconnection Networks. 19th Math. Foundat. of Comp. SC. (MFCS ‘94), Springer LNCS 841, 1994, pp. 72-86

[S] R. Diekmann, R. LiiIing, A. Reinefeld: Distributed Combinatorial Optimization. Proc. of Sofsem’93, Czech Republik 1993, pp. 33-60

Pll

[6] R. Diekmann, B. Monien, R. Preis: Using Helpful Sets to Improve Graph Bisections. Tech. Rep. tr-rf94-008, Univ. of Paderborn, 1994

P21A. Pothen,

[7] I.S. Duff, R.G. Grimmes, J.G. Lewis: Sparse Matrix Test Problems. ACM Trans. on Math. Software 15(l), 1989, pp. 1-14

Parallelism in GraphP31 J.E. Savage, M.G. Wloka: Partitioning. J. of Par. and Distr. Computing 13, 1991, pp. 257-272

[B] C.M. Fiduccia, R.M. Mattheyses: A linear-time heuristic for improving nettuork partitions. 19th IEEE Design Autom. Conf. 1982, pp. 175-181

of unstructured problems P41 H.D. Simon: Partitioning for parallel processing. Conf. on Par. Meth. on Large Scale Struct. Anal. and Ph. Appl., Pergamon Press, 1991

H.D. Simon, K.P. Liu: PartitioningSparse Matrices with Eigenuectors of Graphs. SIAM J. Mat. An. & Appl. 11/3, 1990, pp. 430-452

[9] S.W. Hammond: Mapping Unstructured Grid Computations to Massively Parallel Computers. Tech. Rep. 92.14, RIACS, NASA Ames, 1992 [lo]

B. Hendrickson, R. Leland: Multidimensional Spectral Load Balancing. Tech. Rep. SAND93-0074, Sandia National Lab., Jan. 1993

[ll]

B. Hendrickson, for Partitioning Sandia National

B. Monien, I.H. Sudborough: Embedding One Interconnection Network in Another. Computing Suppl. 7, 1990, pp. 257-282

[25] H.D. Simon, S.H. Teng: How Good is Recursive Bisection. Tech. Rep., RIACS, NASA Ames, June 1993 [26] M. Yannakakis: The Analysis of Local Search Problems and their Heuristics. Int. COIL Alg. Lang. and Prog. (ICALP ‘go), Springer LNCS, 1990, pp. 298311

R. Leland: A Multilevel Algorithm Graphs. Tech. Rep. SAND93-1301, Lab., Oct. 1993 50