Finding Hamiltonian Paths in Tournaments on Clusters - CiteSeerX

Finding Hamiltonian Paths in Tournaments on Clusters Chun-Hsi Huang

Sanguthevar Rajasekaran

Laurence Tianruo Yang

Xin He

Abstract This paper presents a general methodology for the communication-efficient parallelization of graph algorithms using the divide-and-conquer approach and shows that this class of problems can be solved in cluster environments with good communication efficiency. Specifically, the first practical parallel algorithm, based on a general coarse-grained model, for finding Hamiltonian paths in tournaments is presented. On any such parallel machines, this algorithm uses only

, where is the number of processors, communication rounds, which is independent of the tournament size, and can reuse the existing linear-time algorithm in the sequential setting. For theoretical completeness, the algorithm is revised for fine-grained models, where the ratio of computation and communication throughputs is low or the local memory size, , of each individual processor is extremely limited "! for any #%$'& , solving the problem with (

communication rounds, while the hidden constant grows with the scalability factor ) # . Experiments have been carried out on a Linux cluster of 32 Sun Ultra5 computers and an SGI Origin 2000 with 32 R10000 processors. The algorithm performance on the Linux Cluster reaches 75% of the performance on the SGI Origin 2000 when the tournament size is about one million.

Keywords: Cluster Computing, Tournaments, Hamiltonian Path, Parallel Computing, Graph Applications

1 Introduction 1.1 Motivation and Contribution A large number of parallel computing problems in many fields are defined in terms of graphs. However, graph problems have been shown to have considerably less internal structures than many other problems studied. This results in highly data-dependent communication patterns and makes it difficult to achieve communication efficiency. Balancing the load assigned to different processors and minimizing the communication overhead are the core problems in achieving high performance on parallel or distributed systems. Typically, algorithms based on the divide-and-conquer need to trade off between processor and communication overhead. However, parallel divide-and-conquer specifies an important class of problems in many fields, such as computational geometry [1, 2], graph theory [9, 16], numerical analysis [11, 8] and optimization [4]. Therefore, designing *

Computational resources and technical support are provided by the Center for Computational Research (CCR) at the State University of New York at Buffalo. + Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269. Email: [email protected] , Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269. Email: [email protected] Department of Computer Science, St. Francis Xavier University, Antigonish, NS, B2G 2W5, Canada. Email: [email protected] . Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY 14260. Email: [email protected]

1

an approach that reduces the interprocessor communication overhead while balancing the workload becomes essential, especially when such algorithms are to be implemented on fully distributed-memory environments. This paper focuses on one such example, finding Hamiltonian paths in tournaments, a graph problem that has wide applications in fields such as task scheduling and computational geometry. A tournament is a directed graph in which, for any pair of vertices , either or , but not both. This models a competition involving players, where every player competes against every other one. A Hamiltonian path in a graph is a simple path that visits every vertex exactly once. Throughout this paper, we design algorithms based on a general-purpose coarse-grained model as detailed in the next subsection. In this model, the local memory size of each processor is limited to , where is the input size, namely the number of vertices (players) in the tournament graph, and the number of processors. We also limit the message size in each communication round to be . We present load-balanced and communication-efficient partitioning strategies that generate sub-tournaments as evenly as possible for each processor. The computation for the Hamiltonian path of each sub-tournament is then carried out using the existing sequential algorithm. The major features of the main algorithm in this paper are as follows:

"

#

!

1. Inter-processor communication overhead in the “divide” stage (partitioning) is reduced by routing data only after the final destination processor has been determined, reducing a large amount of data movement. 2. Code reuse from the existing sequential algorithm is maximized in the “conquer” stage. 3. No additional communication overhead is introduced in the “merge” stage. For theoretical completeness, this algorithm has also been revised for fine-grained platforms where either each individual processor has extremely limited local memory or the ratios of computation and communication throughputs are low. To demonstrate the practical relevance, the algorithm has been implemented on a Linux cluster of 32 333 MHz Sun Ultra 5 processors. For performance comparison, experiments have also been carried out on an SGI Origin 2000 with 32 R10000 processors. Speedups on both platforms scale well from 2 to 32 processors. Most importantly, the experimental results conclude that the algorithm performance on the Sun Cluster is above 75% of the performance on the SGI Origin 2000 while the tournament size is about one million.

1.2 Cost Model The cost model used to describe and analyze the algorithm is a general-purpose coarse-grained parallel programming model. Parameters similar to those used in specific parallel models, such as the BSP, LogP, and PRAM(n), etc. [6, 7, 12, 18], are used to capture the performance characteristics, including: (1) : the number of processors, (2) : the ratio of communication throughput to processor throughput, and (3) : the time required to barrier synchronize all or part of the processors. Besides, the model consists of three components: (1) a set of processors, each with a local memory, (2) a communication network, and (3) a mechanism for globally synchronizing the processors. The algorithm proceeds as a series of supersteps, in which a processor may operate only on values stored in local memory. Values sent through the communication network are not guaranteed to arrive until the end of the current superstep. We use the term h-relation to denote a routing problem where each processor has at most words of data to send to other processors and each processor is also due to receive at most words of data from other processors. In each superstep, if at most arithmetic operations are performed by each processor and the data communicated forms an . h-relation, the cost of this superstep is

#

'

$

&

& ')(*&,+$-(*%

2

%

1.3 Organization of the Following Sections The rest of this paper is organized as follows. Section 2 presents the coarse-grained parallel algorithm for finding Hamiltonian paths in tournaments and details the cost analysis for both local computation and interprocessor communication. Theoretical improvements for platforms in a more general setting are given in Section 3. In this section, two operations: prefix sum and array packing are employed in the partitioning stage so that the revised algorithm fits in these environments. Section 4 details the experimental results on a Linux cluster of 32 Sun Ultra5 computers and an SGI Origin 2000 with 32 R10000 processors. Section 5 concludes this paper.

2 The Algorithm and Cost Analysis 2.1 The Algorithm

* -

! "

is given, where , and . Throughout this paper, the Assume a tournament size of a tournament refers to . A trivial, but useful, fact is that any induced subgraph of a tournament is also a tournament. Related work on tournaments can be found in [5, 15, 17]. Given , define to be the tournament (induced subgraph) on , where . We say dominates if , and denote this property by . Note that since the directions of the arcs are arbitrary, the domination relation is not necessarily transitive. The notion of domination is extended to sets of vertices: Let be subsets of . dominates ( ) if every vertex in dominates every vertex in . For a given vertex in , the rest of the vertices of are categorized according to their relations with : is the set of vertices that are dominated by and is the set of vertices that dominate . We start by stating the theorem for the Hamiltonian path [17].

)

%

Theorem 2.1 Every tournament contains a Hamiltonian path.

Proof: By induction on the number of vertices, , the result is clear for . Assume the theorem holds for tournaments on vertices. Considering a tournament on vertices, let be an arbitrary vertex of . By the induction hypothesis, has a Hamiltonian path ! . If " # , then # $ is a Hamiltonian path of . Otherwise, let % be the largest index such that '&( . If % $&+* $ is the desired Hamiltonian path. then # $ is a Hamiltonian path. If not, ) $&

,

For each vertex in out-degree) of in .

, we define

-.&/

(

(

) ) (respectively, -10324 ( )) to be the in-degree (respectively,

there exists a vertex , referred to as mediocre on vertices, # 6 7 % have at least 5 vertices. Proof: Let 8 9 -'&/ ;:8 . Assume without loss of generality that 8 : . Since the sum of the in-degrees of the vertices in ?8 equals to the sum of the out-degrees 8 whose out-degree in ?8 is no less than its6 7in-degree of the vertices in ?8 , there exists a vertex @ 6 7 7 in ?8 . Thus: -A0B2C4 -D8 ; :E5$F G F :H5 and -'&/ I :=-10324 I :J-10324 ? 8 ; :E5 as to be, Theorem 2.2 In a tournament player, for which both and

shown.

Throughout this paper, KML will be used to denote a Hamiltonian path of a tournament . Assuming a mediocre player of is ON , an observation reveals the fact that the concatenation K LQP/RPSLT UWVYXZX $N K L PS[\PSLT U]VYXZX is a KML . This observation, along with Theorem 2.2, motivates the algorithm design. The ideas of the algorithm are sketched below.

3

Denote the initial tournament by . In the first partitioning step, we identify the mediocre player, of and split into and . Similarly, during the second and , respectively. We then split partitioning step, we identify the mediocre players and of into ) , ) and into ) , ) . & & & Inductively, during the % -th partitioning step, we need to identify the mediocre players of & & & , respectively. Then split & into & & & , & & & ; & into & & & , & & & , etc. The partitioning stage proceeds no and split 6

more than iterations and no less than iterations. At this point, each sub-tournament has at most vertices. Then each sub-tournament & is sent to a processor, and a Hamiltonian path K L is locally found by the processor. The concatenation of these Hamiltonian paths K L ’s and the selected mediocre players is a Hamiltonian path K L .

. For ease of description, the number of iterations in the partitioning stage is denoted as Note that the mediocre player of a tournament is not unique. In order to split the tournaments as evenly as possible in the partitioning stage, in each round, the mediocre player whose in-degree and out-degree are closest is selected and used to split the tournaments from the preceding partitioning round. An important idea here is, during the “divide” stage, only the mediocre players are communicated among processors. Sub-tournaments are moved to destination processors only after the splitting process ends, when each subtournament can now fit in the storage of a single processor. Sequential algorithm for finding Hamiltonian path is now applied in parallel, and the results, along with all the mediocre players selected during the last partitioning step form the final Hamiltonian path. The algorithm for finding Hamiltonian paths in tournaments uses the following major data structures:

% %

,

#

-% % #

- % - % # #

(1) Adjacency matrix:

if $&Y *"!$# % @% *\ if $&&%*"!' Note that is anti-symmetric. Also note that the total number of ’s (- ’s, respectively) in row % is the in-degree (out-degree, respectively) of

(2)

(3)

&.

( % )@%* # +) * & & round %1# ifif "!-! - ,, .0.0 // % . &. . .& ininpartitioning partitioning round %+' Note that ( % keeps track of the sub-tournament 2! belongs to in partitioning round % . When the partitioning stage finishes, each ! can determine, via column of ( % the target processor 354 P ! X , 4?> ( @ where 6 87:49