PARALLEL EXTERNAL MEMORY GRAPH ALGORITHMS

11 downloads 114 Views 160KB Size Report
E-mail address: {goodrich,nodari}@ics.uci.edu. Abstract. In this paper, we study parallel I/O efficient graph algorithms in the Parallel. External Memory (PEM) ...
PARALLEL EXTERNAL MEMORY GRAPH ALGORITHMS LARS ARGE 1 AND MICHAEL T. GOODRICH 2 AND NODARI SITCHINAVA 2 1

MADALGO, University of Aarhus, Aarhus, Denmark E-mail address: [email protected]

2

University of California – Irvine, Irvine, CA 92697 USA E-mail address: {goodrich,nodari}@ics.uci.edu

Abstract. In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest common ancestors, tree contraction and expression tree evaluation. We also study the problems of computing the connected and biconnected components of a graph, minimum spanning tree of a connected graph and ear decomposition of a biconnected graph. All our solutions provide an optimal speedup of O(p) in parallel I/O complexity and parallel computation time, compared to the singleprocessor external memory versions of the algorithms.

1. Introduction With the advent of multicore architectures, and the realization that processors are not increasing in speed the way they used to, there is an increased realization that parallelism is the primary remaining method for achieving orders of magnitude improvements in performance, through the use of chip-level multiprocessors (CMPs) [17, 22, 26]. Recently Arge et al. [3] introduced the parallel external memory (PEM) model to model the modern CMP architectures. In this model each of the p processors contains a private cache of size M and all the processors share a common “external” memory. The processors are not assumed to be organized in any particular network structure; hence, their only communication channel is through the common shared memory. Memory is subdivided into blocks of B contiguous memory cells, and, in any step, each processor can read or write a block of memory cells. Access to the blocks is based on concurrent-read, exclusive-write (CREW) conflict resolution protocol, but can be adapted to have any other reasonable protocol, such as concurrent-read, concurrent-write (CRCW) or exclusive-read, exclusivewrite (EREW). The complexity measures of the model are the number of parallel I/Os, parallel computation time and the total space required by an algorithm. 1998 ACM Subject Classification: F.1.2 Models of Computation, F.2.2 Non-numerical Algorithms and Problems; G.2.2 Graph Theory. Key words and phrases: parallel algorithms, private-cache CMP, parallel external memory, PEM, PEM graph algorithms, PEM list ranking. MADALGO – Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation

1

2

L. ARGE, M.T. GOODRICH, AND N. SITCHINAVA

In this paper, we are interested in the design of efficient graph algorithms in the PEM model. From an algorithmic standpoint, the main challenges to the design of efficient PEM algorithms is to exploit spacial locality of data while maintaining maximum concurrency. In contrast, existing shared memory parallel (e.g. PRAM) algorithms access data in too random of a pattern to utilize spacial locality. At the same time the existing externalmemory algorithms are either too sequential or rely on message passing and routing networks to communicate between processors. Thus, we need new approaches and techniques for the design of efficient PEM algorithms. 1.1. Prior Related Work There is a wealth of efficient previous graph algorithms in the PRAM model (e.g., see [20, 21, 28]). The algorithms developed in the PRAM model are interesting from a theoretical point of view because they demonstrate the full extent to which a problem can be solved in parallel. However, they do not translate into efficient real-world parallel applications, since their fine-grain assumption – that each processor can access any memory address in unit time – is unrealistic. To counter this assumption, an interesting alternative approach was introduced, based on the design of bulk-synchronous parallel (BSP) and coarse-grain parallel algorithms (e.g., see [32, 18, 12, 15]). In this model, processors are loosely coupled and loosely synchronized, so as to communicate over a network that supports message routing. Another closely related set of previous results are those for (sequential) external-memory (EM) algorithms (e.g., see [1, 33]). Such algorithms explicitly model the I/O complexity of transferring data in blocks of size B between internal memory of size M and external memory. Previous external-memory algorithms have considered parallelism mostly in terms of the number of disks [25]. The few examples that explored multiprocessor designs assumed a PRAM architecture for the processors themselves [24]; hence, are not very well suited for CMPs. Incidentally, many efficient sequential EM algorithms were designed by simulating either PRAM [19] or BSP [30] algorithms. But since the goal was the sequential algorithms, the simulations did not preserve the original parallelism. To rectify the lack of the parallelism in the EM model, Dehne et al. [13, 14] combined the blockwise memory access of the external memory model with the parallelism of the BSP model. Each processor in the resulting EM-BSP model has a private memory hierarchy. The communication among the processors is conducted via message passing, i.e. it is still a coarse-grained parallel model. At a high-level, the EM-BSP model is a more accurate version of the BSP model when it comes to working on large data sets that do not fit in the processors’ caches. And just like the BSP model, it is an ideal model for a distributed setting with high-bandwidth network connecting the processors. However, the reliance on such a network is not very applicable for modern CMP designs. In particular, communication in modern CMPs is conducted via read/write of data to the shared memory. One could simulate message passing of the BSP and EM-BSP models via read/write to shared memory in CMPs as follows. Each processor pi reserves space Sij for each of the processors pj that will be sending it data during any of the BSP rounds. Then, pj writes the data it needs to communicate to pi into space Sij , from which processor pi reads it later. However, this simulation can be rather costly from the space complexity point of view: while each processor in the (EM)-BSP model sends and receives at most O(N/p) items in each round, each processor could potentially use all of the allowed bandwidth to send data

PARALLEL EXTERNAL MEMORY GRAPH ALGORITHMS

3

to any single processors. Thus, if the simulation is to be generic for all BSP algorithms, the reserved space Sij could be as much as O(N/p) for all 1 ≤ i, j ≤ p, i 6= j, resulting in the total space required for simulation as large as O(pN ). For large values of p this can be prohibitively expensive. More recently, there has been some work in designing more appropriate models for the CMPs. Bender et al. [4] proposed a concurrent cache-oblivious model, i.e. algorithms are oblivious to the cache parameters M and B. The authors presented and analyzed a search tree data structure, but focus on the correctness of the data structure in the presence of concurrency, rather than efficiency. Blelloch et al. [5] (building on the work of Chen et al. [6]) proposed multicore-cache model, which consists of a two-level memory hierarchy: a per-processor private (L1 ) cache and a larger (L2 ) cache which is shared among all processors. The authors consider thread scheduling algorithms for a wide range of problems in the new model. However, their analysis is limited to the hierarchical divide-and-conquer problems and a moderate level of parallelism. Chowdhury and Ramachandran consider cache-complexity in both private- and shared-cache models for matrix-based computations, including all-pairs shortest paths algorithm of Floyd-Warshall [9]. They also consider parallel dynamic programming algorithms in private-, shared- and multicore-cache models [10]. In contrast to cache-oblivious models, Arge et al. [3] proposed a cache-aware parallel external memory (PEM) model. While the cache-oblivious model is more attractive from the algorithm design point of view, just as in the single-processor case, it is more manageable to prove tighter bounds in the cache-aware variant. The authors provide several basic parallel techniques and provide solutions to combinatorial problems, such as sorting, which are optimal in all three complexity measures for an up to linear (in the input size N ) number of processors. In fact, the sorting algorithm they present remains optimal for a wide range of parameters of p, M and B: it turns into Cole’s merge-sort PRAM sorting algorithm at one extreme (when M and B are constants) and a single-processor external memory merge-sort algorithm of Aggarwal and Vitter at the other extreme (when p = 1). 1.2. Our Results In this paper we are interested in continuing exploration of the PEM model and provide a number of efficient graph algorithms. The main contribution of the paper is the solution to the well-known list ranking problem [2, 11, 27], which has traditionally been the linchpin in the design of parallel graph algorithms. The I/O complexity of our algorithm on a list of size N is Θ(sortp(N )) – time it takes to sort N items of an array on a p-processor PEM machine. This is a speedup of O(p) over the optimal single-processor EM algorithm and it holds for up to B 2 N log B processors. At high level our algorithm follows the approach of the optimal PRAM list ranking algorithm, in that the core of the solution is to find a large independent set via deterministic coin tossing. However, to achieve the optimal O(sortp (N )) I/O complexity, our algorithm avoids sorting each item more than a constant number of times by delaying the processing of items for as long as possible. The solution to the list ranking problem opens opportunities for the use of the Euler tour technique on trees, which leads to various efficient solutions on trees and graphs. In particular, we show how to adapt the existing PRAM algorithms to solve tree contraction (and, consequently, expression evaluation) and batched lowest common ancestors queries

4

L. ARGE, M.T. GOODRICH, AND N. SITCHINAVA

on trees. We also show how to adapt the PRAM algorithms to find connected and biconnected components, ear decomposition and minimum spanning tree on graphs. All resulting solutions exhibit optimal O(p) speedup in the I/O complexity and parallel computational complexity over their single-processor EM counterparts, while maintaining linear overall space complexity.

2. List Ranking In this section we consider the problem of list ranking. We are given a linked list of N items stored in the main memory. Each item is identified by a unique address (ID) and stores the address of the successor node. The rank of an item is defined as the number of items in the list from the head of the list to this item. The goal of the list ranking problem is to find the ranks of each item in the list. We adapt the algorithmic framework of Anderson and Miller [2] and Cole and Vishkin [11] for list ranking in PRAM. The same framework is used in the single-processor version of the I/O-optimal list ranking algorithm of Chiang et al. [7]. Initially, we assign each list item a rank of 1. This requires a single scanp (N ) = O(N/pB) I/Os. Then an independent set S of size Θ(N ) is produced. Each member of the independent set is bridged out in the list as described in [2]. In the PEM model it takes O(1) sorts and scans to accomplish this step. We then recursively solve the problem on the remaining nodes. As the last step, using O(1) PEM sorts and scans we reintegrate the nodes in S into the final solution. If the independent set S can be found in O(sortp (N )) I/Os, i.e. the I/O complexity of sorting N items on the p-processor PEM machine, all individual steps of each recursive call can be accomplished using O(sortp (N )) I/Os, and the I/O complexity of the list ranking algorithm will satisfy the recurrence Q(N, p) = Q(N/c, p) + O(sortp (N )). However, the PEM sorting algorithm of Arge et al. [3] is optimal only for up to N/B 2 processors. Thus, the the difficulty arises with the base case of the recurrence, because the sorting rounds of the recursive call on subproblems of sizes less than pB 2 will have suboptimal I/O complexity. Theorem 2.1 shows that by stopping the recurrence early enough and reverting to the PRAM list ranking algorithm, the list ranking can still be solved in the PEM model within the optimal I/O complexity of O(sortp (N )) at the expense of increasing the critical path length of the parallel computation by a factor of only O(log B). Theorem 2.1. If an independent set of size Θ(N ) can be found in O(sortp(N )) I/Os, then list ranking can be solved in the PEM model with the optimal I/O complexity Q(N, p) = O(1) . O(sortp (N )), as long as p ≤ B 2 N log B and M = B Proof. The PEM of Arge et al. exhibits optimal I/O complexity of  sorting algorithm  N N sortp (N ) = O pB logM/B B for up to N/B 2 processors, as long as M = B O(1) . For the

case when the problem size is pB ≤ N ≤ pB 2 , we reduce the number of processors proportionally to the problem size, to maintain the I/O complexity of sorting at O(B logM/B N/B) I/Os. Finally, we stop the recursive list ranking algorithm whenever the problem size becomes smaller than pB and revert to the PRAM list ranking algorithm. The PRAM list ranking algorithm with p processors on input of size pB will run in O(B log pB) parallel time. Even if we charge a full block transfer for each PRAM memory access, the parallel and M = B O(1) , I/O complexity of this step will be Q′ = O(B log pB). Since p ≤ B 2 N log B

PARALLEL EXTERNAL MEMORY GRAPH ALGORITHMS

the complexity of this O(B log B · logM/B N B ). Thus, the total I/O recursion:    Q(N, p) =  

M step reduces to Q′ = O(B log N B ) = O(B log B · logM/B

5

N B)

=

complexity of the list ranking algorithm is defined by the following 



N B Q(N/c, p/c) + O(B logM/B N B) ) O(B log B · logM/B N B

Q(N/c, p) + O

N pB

logM/B

if N ≥ pB 2 ;

if pB ≤ N < pB 2 ; if N < pB.    N N + B log B log The solution to this recursion is Q(N, p) = O M/B B . And as long pB     N N N as p ≤ B 2 N . Therefore, Q(N, p) = O log , B log B = O M/B B = O(sortp (N )). pB pB log B

The only remaining task is to show that an independent set of size Θ(N ) can be found in O(sortp (N )) I/Os in the PEM model, which we show in the next subsection. 2.1. Independent Set Computation It is easy to see that the single-processor external memory randomized solution for finding the independent set via a fair coin tossing [7] remains optimal in the PEM model and finds an independent set with expected size (N − 1)/4 in O(sortp (N )) PEM I/Os. The deterministic approach to finding a large independent set is via finding a 2-ruling set of the list. Definition 2.2. An r-ruling set is a subset of the linked list with the following properties: (1) It is an independent set. (2) The number of consecutive unselected items is at most r. The items of the ruling set are called rulers, each of which rules up to r of the unselected items which follow the ruler in the list. To find the 2-ruling set we make use of the deterministic coin tossing algorithm of Cole and Vishkin [11]. A single round of deterministic coin tossing produces a O(log N )-coloring of the list with color of each item k being the ordered pair tag(k) = (t, bit t of k), where t is the least significant bit of the node’s address in which it differs from its successor’s address (see [11] for details). Given the O(log N )-coloring of the list we can obtain O(log N )-ruling set by picking the items i with tags that are local minima, i.e. tag(k) ≤ tag(succ(k)) and tag(k) ≤ tag(pred(k)). Notice that a single round of deterministic coin tossing as well as finding O(log N )-ruling set can be accomplished with O(1) sorts and scans of the list for a total of O(sortp (N )) I/Os by sorting the items by their IDs, the successor IDs and predecessor IDs and scanning the sorted lists. The predecessor pointers can be calculated in O(sortp (N )) I/Os by sorting the list twice: once by the node’s IDs and the other by the successor IDs and scanning through both of the lists. Thus, for convenience, we assume that in addition to the successor pointers, each node also stores the predecessor pointers and colors of the node, the node’s successor and the node’s predecessor, all of which can be calculated in O(1) sorts and scans in PEM. Repeating deterministic coin tossing log∗ N times results in the 3-coloring of the list, which in turn can be converted into 2-ruling set, by picking the items with the most popular color to be in the ruling set. Unfortunately, the I/O complexity of this approach is

6

L. ARGE, M.T. GOODRICH, AND N. SITCHINAVA

Figure 1: Finding 2-ruling set: Each group Gi contains list nodes of color i produced via deterministic O(log log N )-coloring.

G1

G2

G log log N

O(sortp (N ) · log∗ N ). The next section describes how to find a 2-ruling set with a constant number of PEM sorting rounds. 2.1.1. Finding 2-ruling set with O(sortp (N )) I/O complexity. Let p∗ = N/B 2 , the maximum number of processors for optimal speedup for the PEM sorting algorithm. We show how to achieve optimal speedup for list ranking using up to p∗ processors. We make a simplifying assumption that B = Ω(log N ), which is not very restrictive in practice.1 The list ranking algorithm proceeds as follows: (1) Create O(log log N )-ruling set of the list by running deterministic coin tossing twice. (2) Group the items by their tag values by lexicographically sorting them by the values of their tags. Let the first group G1 be the items of the O(log log N )-ruling set. The construction of the 2-ruling set proceeds in O(log log N ) rounds, processing items of a single group in each round. (3) For each group Gi , 1 ≤ i ≤ log log N do the following: (a) Sort items of Gi by their IDs and for each item k that has duplicates, remove k and all the duplicates. The duplicates were added in step 3c of previous rounds to indicate that k should not be in the 2-ruling set. (This step does nothing in the case of G1 .) (b) Mark the remaining items to be in the 2-ruling set and pack them into a contiguous array T . (c) Sort the array T lexicographically with the successor tag values being the primary sorting criteria and successor IDs being secondary ones. For each successor j add a duplicate dummy node j to the group of j’s tag value. This is an indication that j should not be in the 2-ruling set. (d) Repeat line 3c for the predecessors as well. 1It is possible to reduce this assumption to B = Ω(log (t) N ), for some constant t, using a similar technique

as in [7].

PARALLEL EXTERNAL MEMORY GRAPH ALGORITHMS

7

The correctness of the algorithm follows from the following two lemmas which satisfy both parts of the definition of the 2-ruling set. Lemma 2.3. No two neighboring elements are selected, i.e. it’s an independent set. Proof. The O(log log N )-ruling set of step 1 partitions the linked list into sublists with heads of each sublist being chosen to be in the 2-ruling set. No two neighbors can be selected to be in the 2-ruling set in the same round, because each group contains items with the same tag after the deterministic coin tossing in step 1. The items with the same tag are part of an independent set and, therefore, are not neighbors. Across the rounds, the algorithm explicitly excludes the successors and predecessors of the items that are already in the 2ruling set. Lemma 2.4. Any unselected element has a selected neighbor. Proof. The only reason a vertex is excluded is because it has a selected neighbor. O(1) , then the I/O complexity of the algorithm is Lemma 2.5. If p ≤ p∗ and M  = B N logM/B N O(sortp (N )) = O pB B .

Proof. Steps 1 and 2 take O(1) sorts and scans of the whole list, i.e. O(sortp (N )) I/O complexity. Let’s analyze the I/O complexity of each round of step 3. Each item has at most 2 neighbors, thus, there are at most 2 other duplicates for each item k and all of them will be placed together after the sort. Therefore, step 3a takes a sort and a scan of Ni items of group Gi : O(sortp (Ni )) I/Os. Step 3b takes O(1) scans and a call to the PEM prefix sums routine to compute the size of array T . The total I/O complexity of this step is O(Ni /pB + log p). Step 3c is the most complicated step of the algorithm. The goal of this step is to defer updating the (non)-membership status of the successors to the 2-ruling set until the group to which those successors belong is processed. It is accomplished by appending duplicates of the successor nodes to the end of the group to which they belong. To determine the addresses of locations where to write the duplicates, the processors run prefix sums on the sizes of the sets of duplicate items and the sizes of the groups. Also note, that each processor writes at most the number of items for which it is responsible, but it can write to O(log log N ) different groups. Thus the I/O complexity of this step is O(sortp (Ni ) + scanp (Ni ) + log p + log log N ) = O(sortp (Ni ) + log p + log log N ). The I/O complexity of step 3d is the same. Thus, each round takes O(sortp (Ni ) + log p + log log N ) I/Os, where Ni is the size of each group Gi . There is at most two duplicates for each item and each item participates in at most a constant number of sorts. Therefore, total I/O complexity is logX log N

O(sortp (Ni ) + log p + log log N ) = O(sortp (N ) + log p · log log N + (log log N )2 )

i=1

And since M = B O(1) and p ≤ p∗ = N/B 2 , sortp(N ) = Ω(B logM/B N/B) = Ω which is the dominating term since we assumed that B = Ω(log N ).



B log N log B



,

Theorem 2.6. A linked list of size N can be ranked in the PEM model with the I/O complexity O(sortp (N )) using up to p = B 2 N log B processors. Proof. Follows directly from Lemma 2.5 and Theorem 2.1.

8

L. ARGE, M.T. GOODRICH, AND N. SITCHINAVA

In the weighted list ranking problem, each edge of the list has an arbitrary weight and the rank of a node is the sum of weights of the edges on the path from the head of the list to the node. The weighted list ranking problem can be solved similarly.

3. Applications In this section we describe several applications of the list ranking algorithm of the previous section. At a high level our approach combines the application of the list ranking algorithm in the PRAM model with the I/O optimal single-processors algorithms. The result is O(p) speedup in the I/O complexity of the parallel algorithms compared to the corresponding single-processor algorithms. 3.1. Euler Tour and Basic Tree Problems Most external memory graph algorithms make use of the Euler tour technique invented by Tarjan and Vishkin [31] for the PRAM model. The PEM graph algorithms are no exceptions. Given a tree T and a special vertex r of T , an Euler tour of T is defined as a traversal of T that starts and ends at r and visits every edge exactly twice: once in each direction. Theorem 3.1. In the p-processor PEM model building an Euler tour on a tree can be built in O(sortp(N )) I/O complexity. Proof. The single-processor EM solution to building an Euler tour involves replacing each undirected edge with two directed edges, sorting them lexicographically and scanning the edges assigning appropriate successors to each incoming edge. This can be accomplished in the PEM model by utilizing the PEM versions of sorting and scanning, each of which takes at most O(sortp (N )) I/Os. Theorem 3.2. In the p-processor PEM model the following problems can be solved in O(sortp (N )) I/O complexity using up to p = B 2 N processors: rooting a tree, determining log B preorder and postorder numbering of the vertices, the depth of each vertex and the sizes of the subtrees rooted at each vertex. Proof. The single-processor EM solution for each of the problems involves building an Euler tour, running weighted list ranking with the appropriate edge weights and a constant number of sorts and scans. With the PEM solution to building the Euler tour, all these problems can also be solved in the PEM model by utilizing the appropriate PEM versions of each subroutine, each of which take at most O(sortp (N )) I/Os using up p = B 2 N log B processors. 3.2. Tree Contraction and Expression Tree Evaluation Having a PEM solution to the Euler tour technique provides us with a direct PEM solution to tree contraction, whose main application is tree expression evaluation. Theorem 3.3. Tree contraction and expression tree evaluation can be solved in the PEM model in O(sortp (N )) I/O complexity using up to B 2 N log B processors.

PARALLEL EXTERNAL MEMORY GRAPH ALGORITHMS

9

Proof. The EREW PRAM solution to tree contraction by Gazit et al. [16] decomposes the tree into m-bridges, subtrees of size at most m = O(N/p). Each m-bridge contains at most one unknown leaf vertex, therefore, each m-bridge can be processed independently from others, collapsing it into a single vertex. Finally the p processors contract the resulting tree of size O(p). The definition of the m-bridge allows it to be laid out contiguously in memory by storing the vertices of the tree in the post-order. Thus, all m-bridges can be processed by each processor scanning contiguous O(N/p) memory locations and the tree can be collapsed into a p-vertex tree in O(sortp (N )) I/Os using up to p ≤ B 2 N log B processors. Finally run EREW PRAM tree contraction algorithm on the p-vertex tree using p processors, charging one I/O for each PRAM memory access, for a total of O(log p) I/Os. Since p ≤ B 2 N log B , the I/O complexity of the sorting steps dominate the O(log p) bound of the final step and the total I/O complexity of tree contraction is O(sortp (N )). 3.3. Lowest Common Ancestors Since the single-processor I/O-efficient solution to the Lowest Common Ancestor (LCA) problem is an adaptation of the PRAM solution, it remains efficient in the PEM model as long as individual parts of the solution are completed using the corresponding PEM algorithms. The solution reduces the LCA problem to the range-minima problem, constructs a complete (M/B)-ary search tree with O(logM/B (N/B)) levels and maintains at each internal node prefix and suffix minima of the leaves of subtrees rooted at those nodes, as well as a list of M/B minimums of the subtrees rooted at each child of the node. The K batched queries are performed by first sorting the items, so that all of the queries can be answered by scanning the search tree a constant number of times. By implementing the sort and scans using p processors, the K batched LCA queries can be answered in the PEM model in O((1 + K/N )sortp (N )) I/Os. 3.4. Connected and Biconnected Components, Ear Decomposition and Minimum Spanning Tree In this section we show that several graph connectivity problems can be solved in the PEM model with O(p) speedup in the I/O complexity over the corresponding singleprocessors. processor EM algorithms, using up to p ≤ B|V2 |+|E| log2 B Theorem 3.4. Given an undirected graph G = (V, E), the connected components of the graph can be computed in the PEM model in O(sortp (|V |) + sortp (|E|) log(|V |/pB) I/Os processors. using up to p ≤ B|V2 |+|E| log2 B Proof. The PEM connected components algorithm is an adaptation of the single-processor external memory algorithm of Chiang et al. [7] which in turn is based on the PRAM algorithm of Chin et al. [8]. Let |V | = n, |E| = m. Consider the single-processor external memory algorithm for finding connected components of a graph G = (V, E): For each vertex v ∈ V consider the set of edges {v, wv } ∈ E, where wv is the neighbor of v with the smallest address/ID. The subgraph H of G induced by these edges is a forest and the connected components of H can be found and compressed into individual (super-)vertices via tree contraction. The algorithm then recursively computes the connected components on the resulting graph. At each step the algorithm reduces the

10

L. ARGE, M.T. GOODRICH, AND N. SITCHINAVA

number of vertices by a factor of two and the recursion terminates when the graph contains only one vertex. The PEM version of the algorithm implements the above algorithm using the PEM versions of the corresponding subroutines. Unlike the single-processor version, the PEM algorithm terminates the recursion and reverts to the O(log n) PRAM connectivity algorithm of Shiloach and Vishkin [29] as soon as the size of the graph is less than O(pB) vertices. processors The PEM list ranking and tree contraction algorithms require at most B 2 N log B to maintain the optimal O(sortp (N )) I/O complexity. Thus, once the size of the graph reaches O(pB 2 log B) vertices, the number of processors involved in the computation must be reduced proportionally with the size of the graph, resulting in the I/O complexity of O(B log B logM/B (n + m)/B) I/Os for each application of list ranking or tree contraction algorithms. Thus, the total I/O complexity of the algorithm is defined by the recurrence:    n+m n+m  log if n ≥ pB 2 log B; Q(n/2, m, p) + O  M/B B pB 2 Q(n, m, p) = Q(n/2, m, p/2) + O(B log B logM/B n+m B ) if pB ≤ n < pB log B;   if n < pB. O(B log B · logM/B n+m B )    m n n + pB log pB + B log2 B logM/B n+m . And as long as It solves to Q(n, m, p) = O pB B      2 n n m n n+m n p ≤ B 2 log , B log B = O + log . Therefore, Q(n, m) = O log = 2 M/B pB pB pB pB B B n O(sortp (n) + sortp (m) log pB ). Theorem 3.5. The following problems on the undirected connected graphs G = (V, E) can be solved in O(sortp (|V |) + sortp(|E|) log(|V |/pB) I/Os in the PEM model using up to processors: finding a minimum spanning tree, biconnected components, and p ≤ B|V2 |+|E| log2 B ear decomposition (if the graph is biconnected). Proof. The solution to finding connected components is easily augmented to solve the minimum spanning tree problem. In particular, when building subgraph H, instead of picking the neighbor wv with the smallest address/ID, the algorithm picks the one incident on the edge with the smallest edge weight, breaking ties lexicographically using the addresses/IDs of the edge’s endpoints as the secondary criteria. Then the MST contains all the edges of H plus the smallest-weight edge between the connected components found during the recursive calls. The biconnected components algorithm of Tarjan and Vishkin [31] requires generating an arbitrary spanning tree, evaluating an expression tree, and computing connected components of the newly generated graph. The algorithm for ear decomposition of a graph of Maon et al. [23] requires generating an arbitrary spanning tree, performing batched lowest common ancestor queries and evaluating an expression tree. Using the appropriate PEM subroutines provides the PEM solutions to each of these problems. Note that, while the number of vertices is reduced by a factor of 2 between iterations in the above algorithms, no good upper bound can be give on the rate of reduction of edges for general graphs. However, if the graphs are sparse and closed under contraction the rate of decrease of the edges between iterations matches that of the vertices. Thus, we obtain the following result for sparse graphs: Theorem 3.6. For sparse graphs G = (V, E) closed under contraction, the connected components, the minimum spanning tree (if G is connected), the biconnected components and

PARALLEL EXTERNAL MEMORY GRAPH ALGORITHMS

11

ear decomposition (if G is biconnected) can be computed in O(sortp (|V |)) I/Os using up to processors. p ≤ B|V2 |+|E| log2 B While the critical path length of our solutions is only within the polylog factor of the block size B from the critical path length of the sorting algorithm of Arge et al., finding algorithms with the same I/O complexity for up to p ≤ BN2 processors remains an open problem.

References [1] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31:1116–1127, 1988. [2] R. Anderson and G. Miller. Deterministic parallel list ranking. In Proc. 3rd Aegean Workshop Computing, volume 319 of Lecture Notes Comput. Sci., pages 81–90. Springer-Verlag, 1988. [3] L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for privatecache chip multiprocessors. In 20th ACM Symposium on Parallelism in Algorithm And Architectures (SPAA), pages 197–206, New York, NY, USA, 2008. ACM. [4] M. A. Bender, J. T. Fineman, S. Gilbert, and B. C. Kuszmaul. Concurrent cache-oblivious B-trees. In SPAA ’05: Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, pages 228–237, New York, NY, USA, 2005. ACM. [5] G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proc. 19th ACM-SIAM Sympos. Discrete Algorithms, 2008. [6] S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105–115, New York, NY, USA, 2007. ACM. [7] Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. Externalmemory graph algorithms. In Proc. 6th ACM-SIAM Sympos. Discrete Algorithms, pages 139–149, 1995. [8] F. Y. Chin, J. Lam, and I.-N. Chen. Efficient parallel algorithms for some graph problems. Commun. ACM, 25(9):659–665, 1982. [9] R. A. Chowdhury and V. Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 71–80, New York, NY, USA, 2007. ACM. [10] R. A. Chowdhury and V. Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In SPAA ’08: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 207–216, New York, NY, USA, 2008. ACM. [11] R. Cole and U. Vishkin. Deterministic coin tossing with applications to optimal list ranking. Inform. Control, 1:153–174, 1986. [12] F. Dehne. Coarse grained parallel algorithms. Special Issue of Algorithmica, 24(3):173–176, 1999. [13] F. Dehne, W. Dittrich, and D. Hutchinson. Efficient external memory algorithms by simulating coarsegrained parallel algorithms. Algorithmica, 36:97–122, 2003. [14] F. Dehne, W. Dittrich, D. Hutchinson, and A. Maheshwari. Bulk-synchronous parallel algorithms for external memory model. Theory of Computing Systems, 35:567–597, 2002. [15] F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel computational geometry for coarse grained multicomputers. International Journal of Computational Geometry and Applications, 6(3):379–400, 1996. [16] H. Gazit, G. L. Miller, and S.-H. Teng. Optimal tree contraction in an EREW model. In Concurrent Computations: Algorithms, Architecture and Technology, pages 139–156, New York, 1988. Plenum Press. [17] D. Geer. Chip Makers Turn to Multicore Processors. IEEE Computer, 38(5):11–13, 2005. [18] A. Gerbessiotis and L. Valiant. Direct Bulk-Synchronous Parallel Algorithms. Journal of Parallel and Distributed Computing, 22(2):251–267, 1994.

12

L. ARGE, M.T. GOODRICH, AND N. SITCHINAVA

[19] M. T. Goodrich. Communication-efficient parallel sorting. SIAM Journal on Computing, 29(2):416–432, 2000. [20] J. J´ aJ´ a. An Introduction to Parallel Algorithms. Addison-Wesley, Reading, Mass., 1992. [21] R. M. Karp and V. Ramachandran. Parallel algorithms for shared memory machines. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 869–941. Elsevier/The MIT Press, Amsterdam, 1990. [22] R. Kumar, V. Zyuban, and D. Tullsen. Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. Computer Architecture, 2005. ISCA’05. Proceedings. 32nd International Symposium on, pages 408–419, 2005. [23] Y. Maon, B. Schieber, and U. Vishkin. Parallel ear decomposition search (eds) and st-numbering in graphs. Theoret. Comput. Sci., 47:277–298, 1986. [24] M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proc. 5th ACM Sympos. Parallel Algorithms Architect., pages 120–129, 1993. [25] M. H. Nodine and J. S. Vitter. Greed sort: An optimal sorting algorithm for multiple disks. J. ACM, 42(4):919–933, July 1995. [26] J. Rattner. Multi-Core to the Masses. Parallel Architectures and Compilation Techniques, 2005. PACT 2005. 14th International Conference on, pages 3–3, 2005. [27] M. Reid-Miller, G. L. Miller, and F. Modugno. List ranking and parallel tree contraction. In J. H. Reif, editor, Synthesis of Parallel Algorithms, pages 115–194. Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1993. [28] J. H. Reif. Synthesis of Parallel Algorithms. Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1993. [29] Y. Shiloach and U. Vishkin. An o(log n) parallel connectivity algorithm. J. Algorithms, 3(1):57–67, 1982. [30] J. Sibeyn and M. Kaufmann. BSP-like external-memory computation. In 3rd Italian Conference on Algorithms and Complexity, pages 229–240, 1997. [31] R. Tarjan and U. Vishkin. An efficient parallel biconnectivity algorithm. SIAM J. Comput., 14:862–874, 1985. [32] L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, 1990. [33] J. Vitter. External memory algorithms. Proceedings of the 6th Annual European Symposium on Algorithms, pages 1–25, 1998.

Suggest Documents