Algorithmica - Carnegie Mellon School of Computer Science

0 downloads 0 Views 94KB Size Report
also use as subroutines sorting algorithms for hypercubic networks due to Nassimi and. Sahni [8] and ... (The lower and upper approximations are considered ... and so on until the set of remaining elements S ⊆ S is sufficiently small that it can be .... The following approach is used to extract sample i + 1 from sample i. First ...
Algorithmica (2000) 26: 237–254 DOI: 10.1007/s004539910011

Algorithmica ©

2000 Springer-Verlag New York Inc.

Sorting-Based Selection Algorithms for Hypercubic Networks1 P. Berthom´e,2,3 A. Ferreira,2 B. M. Maggs,4 S. Perennes,2,5 and C. G. Plaxton6 Abstract. This paper presents several deterministic algorithms for selecting the kth largest record from a set of n records on any n-node hypercubic network. All of the algorithms are based on the selection algorithm of Cole and Yap, as well as on various sorting algorithms for hypercubic networks. Our fastest algorithm runs in O(lg n lg∗ n) time, very nearly matching the trivial Ä(lg n) lower bound. Previously, the best upper bound known for selection was O(lg n lg lg n). A key subroutine in our O(lg n lg∗ n) time selection algorithm is a sparse version of the Sharesort algorithm that sorts n records using p processors, p ≥ n, in O(lg n(lg lg p − lg lg( p/n))2 ) time. Key Words. Selection, Hypercube, Parallel algorithms.

1. Introduction. This paper presents several algorithms for solving the selection problem on hypercubic networks. The input to the selection problem is a set S of n records and an integer k. The goal is to find the kth smallest record in S. This problem is also called the order statistics problem. The algorithms in this paper run on the hypercube or on any of its bounded-degree derivatives including the butterfly, cube-connected cycles, and shuffle-exchange network. The fastest runs in O(lg n lg∗ n) time on an n-node network. (Throughout the paper, we use lg n to denote log2 n, and we use lg∗ n to denote the “log-star” function defined by lg∗ n = min{i ≥ 0 : lg(i) n ≤ 1}, where lg(i) n denotes the ith iterated logarithm of n.) The fastest previously known algorithm ran in O(lg n lg lg n) time [9]. The algorithms use a technique called successive sampling, which was previously used by Cole and Yap [5] to solve the selection problem in an idealized model of computation called the parallel comparison model. The algorithms 1

Work by the first author was supported by the French CNRS Coordinated Research Program on Parallelism C3 and the French PRC/GDR MATHINFO. Work by the fifth author was supported by NSF Research Initiation Award CCR-9111591, the Texas Advanced Research Program (TARP) under Grant Nos. 93-003658-461 and 91-003658-480, and the NEC Research Institute. 2 Laboratoire de l’Informatique du Parall´ elisme, CNRS, Ecole Normale Sup´erieure de Lyon, 46, All´ee d’Italie, 69364 Lyon Cedex 07, France. [email protected], www.ens-lyon.fr/˜ferreira. 3 Current address: LRI, Bˆ at 490, Universit´e Paris-Sud, 91405 Orsay Cedex, France. [email protected], www.lri.fr/people/berthome.html. 4 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA. [email protected], www.cs.cmu.edu/˜bmm. This work was performed while the author was at NEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA. 5 I3S, CNRS, rue A. Einstein, Sophia Antipolis, 06560 Valbonne, France. [email protected], www.i3s.unice.fr/˜sp. 6 Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA. [email protected], www.cs.utexas.edu/users/plaxton. Received March 23, 1994; revised October 30, 1997. Communicated by Chee-Keng Yap.

238

P. Berthom´e, A. Ferreira, B. M. Maggs, S. Perennes, and C. G. Plaxton

also use as subroutines sorting algorithms for hypercubic networks due to Nassimi and Sahni [8] and Cypher and Plaxton [6]. 1.1. Hypercubic Networks. A hypercube contains n = 2d nodes, each of which has a distinct d-bit label (d must be a nonnegative integer). A node labeled b0 · · · bd−1 is connected by edges to those nodes whose labels differ from b0 · · · bd−1 in exactly one bit position. An edge connecting two nodes whose labels differ in bit i is called a dimension i edge. Each node has d neighbors, one for each dimension. A subcube of the hypercube is formed by fixing the bit values of the labels in some subset of the d dimensions of the hypercube, and allowing the bit values in the other dimensions to vary. In particular, for each subset j0 , . . . , jk−1 of the set of dimensions {0, . . . , d − 1}, and each set of bit values v0 , . . . , vk−1 , there is a dimension-k subcube of the hypercube consisting of the n/2k nodes whose labels have value vi in dimension ji , 0 ≤ i < k, and the edges connecting those nodes. The nodes in a hypercube represent processors, and the edges represent wires. Each processor has some local memory organized in O(d)-bit words. At each time step, a processor can send a word of data to one of its neighbors, receive a word of data from one of its neighbors, and perform a local operation on word-sized operands. In sorting and selection problems, the input consists of a number of O(1)-word records. Each record has an associated key that determines its rank in the entire set of records. We assume throughout that all keys are unique. This may be done without loss of generality, since ties can always be broken in a consistent manner by appending the initial address (processor and memory location) of each record to its key. All of the algorithms described in this paper use the edges of the hypercube in a very restricted way. At each time step, only the edges associated with a single dimension are used, and consecutive dimensions are used on consecutive steps. Such algorithms are called normal [7, Section 3.1.4]. The bounded-degree variants of the hypercube, including the butterfly, cube-connected cycles, and shuffle-exchange network, can all simulate any normal hypercube algorithm with constant slowdown [7, Sections 3.2.3 and 3.3.3]. For simplicity, we describe all of the algorithms in terms of the hypercube. 1.2. Selection Refinement. Like most selection algorithms, the algorithms in this paper use a technique called selection refinement. Given a set S of n records and an integer k, 0 ≤ k < n, a selection refinement algorithm finds the key with rank k as follows. First, the algorithm computes lower and upper approximations to the desired record. A lower approximation to the record of rank k is a record with rank less than or equal to k. An upper approximation to the record of rank k is a record with rank greater than or equal to k. Second, the algorithm extracts the subset S 0 of S consisting of all records between the lower and upper approximations. (The lower and upper approximations are considered to be “good” if |S 0 | is much smaller than |S|.) Third, the algorithm computes an integer k 0 such that the record with rank k 0 in S 0 has rank k in S. Finally, the algorithm recursively finds the element of rank k 0 in S 0 . 1.3. Successive Sampling. Selection refinement algorithms differ in the method used to find lower and upper approximations. Our algorithm uses a technique called successive

Sorting-Based Selection Algorithms for Hypercubic Networks

239

sampling. This technique is also used in the algorithm of Cole and Yap [5]. Given a set S of n records and an integer k, 0 ≤ k < n, a successive sampling algorithm computes lower and upper approximations as follows. First, the algorithm partitions the set S = S0 into n/s groups of size s (in an arbitrary fashion), and sorts each group. Second, a new set S1 ⊆ S of nt/s records is formed by taking t evenly spaced records from each group. This sampling process is repeatedly applied to obtain a subset S2 of S1 , a subset S3 of S2 , and so on until the set of remaining elements S 0 ⊆ S is sufficiently small that it can be sorted efficiently. (Different values of the parameters s and t may be used at each “level” of sampling.) Finally, the lower (resp., upper) approximation is chosen to be the largest (resp., smallest) record in S 0 with rank that is guaranteed (by properties of the successive sampling process) to be less (resp., greater) than or equal to k in S.

1.4. Previous Work. The selection problem is closely related to the sorting problem. On the one hand, it is obvious that any sorting algorithm can be used for selection. On the other hand, the performance of selection refinement algorithms depends heavily on the cost of sorting “small” sets of records (i.e., sorting n records using p À n processors). For the hypercube, the fastest n-record n-processor sorting algorithm known is the Sharesort algorithm of Cypher and Plaxton [6], which runs in O(lg n(lg lg n)2 ) time. (A nonuniform version of the Sharesort algorithm runs in O(lg n lg lg n) time [6].) In addition to Sharesort, we make use of Nassimi and Sahni’s sparse enumeration sort [8], which sorts n records on p processors, p ≥ n, in O((lg n lg p)/(lg p − lg n)) time. (Note that sparse enumeration sort runs in optimal O(lg n) time if p ≥ n 1+ε for some positive constant ε.) The fastest previously known algorithm for solving the selection problem on a hypercubic network is due to Plaxton and runs in O(lg n lg lg n) time on an n-node network [9]. Of course, the selection problem can also be solved in O(lg n(lg lg n)2 ) time using Sharesort. Plaxton also showed that any deterministic algorithm for solving the selection problem on a p-processor hypercubic network requires Ä((n/ p) lg lg p + lg p) time in the worst case [9]. Since the selection problem can be solved in linear time sequentially [3], the lower bound implies that it is not possible to design a deterministic hypercubic selection algorithm with linear speedup. For n = p the lower bound is Ä(lg n), which is the diameter of the network. In [10] Valiant proved an Ä(lg lg n) lower bound on the time to find the largest record in a set of n records using n processors in an idealized model called the parallel comparison model. The lower bound implies a lower bound on the time to select the kth smallest record as well. Valiant also showed how to find the largest record in O(lg lg n) time. Cole and Yap [5] then described an O((lg lg n)2 ) selection algorithm for this model. The running time was later improved to O(lg lg n) by Ajtai et al. [1]. The comparisons performed by the latter algorithm are specified by an expander graph, however, making it unlikely that this algorithm can be efficiently implemented on a hypercubic network. A different set of upper and lower bounds hold in the PRAM models. Beame and H˚astad [2] proved an Ä(lg n/ lg lg n) lower bound on the time for selection in the CRCW comparison PRAM using a polynomial number of processors. Vishkin [11] discovered an O(lg n lg lg n) time PRAM algorithm that uses O(n/ lg n lg lg n) processors. The algorithm is work-efficient (i.e., exhibits optimal speedup) because the processor–time

240

P. Berthom´e, A. Ferreira, B. M. Maggs, S. Perennes, and C. G. Plaxton

product is equal to the time, O(n), of the fastest sequential algorithm for this problem. Cole [4] later found an O(lg n lg∗ n) time work-efficient PRAM algorithm. 1.5. Outline. A “basic” selection algorithm that runs in O(lg n lg lg n) time is presented in Section 2. Several faster selection algorithms are presented in the remainder of the paper. Pseudocode for these faster selection algorithms is provided in Section 3. Running times of O(lg n lg(3) n) and O(lg n lg(4) n) are established in Sections 4 and 5, respectively. An O(lg n lg∗ n) algorithm is presented in Section 6. This time bound is obtained at the expense of using a nonuniform variant of the Sharesort algorithm [6] that requires a certain amount of preprocessing. Finally, in Section 7 we show how to avoid the nonuniformity introduced in Section 6. 2. An O(lg n lg lg n) Selection Algorithm. In this section we develop an efficient subroutine for selection refinement based on the parallel comparison model algorithm of Cole and Yap [5]. There are two major differences. First, we use Nassimi and Sahni’s sparse enumeration sort [8] instead of a constant time sort (as is possible in the parallel comparison model), and second we obtain a total running time that is proportional to the running time of the largest call to sparse enumeration sort, whereas in the Cole–Yap algorithm, the running time is proportional to the number of sorts, O(lg lg n), each of which costs constant time. As in the Cole–Yap algorithm, the selection refinement algorithm proceeds by successively sampling the given set of records. We define “sample 0” as the entire set of records. At the ith stage of the selection refinement algorithm, i ≥ 0, a “subsample” is extracted from sample i. This subsample represents sample i + 1, and is a proper subset of sample i. Hence the sequence of sample sizes is monotonically decreasing. The sampling process terminates at a value of i for which the ith sample is sufficiently small that it can be sorted in logarithmic time (using sparse enumeration sort). From this final sample, we extract lower and upper approximations to the desired record. Our goal is to obtain “good” upper and lower approximations in the sense that the ranks of our approximations are close to k. The following approach is used to extract sample i + 1 from sample i. First, the records of sample i are partitioned into a number of equal-sized groups, and each group is assigned an equal fraction of the processors. Second, each group of records is sorted using sparse enumeration sort. The number of groups is determined in such a way that the running time of sparse enumeration sort is logarithmic in the group size. This is the case, for example, if sparse enumeration sort is used to sort m 2 records in a subcube with 3 size, the third step is to extract approximately m √ processors. Letting s denote the group √ s uniformly spaced records (i.e., every sth record) from each group. The union of √ sample i + 1. Note that the ratio of the size of these extracted sets of size s forms √ sample i to that of sample i + 1 is s. Before proceeding, we introduce a couple of definitions. DEFINITION 2.1. The rank of a record α in a set S, rank(α, S), is equal to the number of records in S that are strictly smaller than α. (Note that the record α may or may not belong to the set S.)

Sorting-Based Selection Algorithms for Hypercubic Networks

241

DEFINITION 2.2. An r -sample of a set of records S is the subset R of S consisting of those records with ranks in S congruent to 0 modulo r , i.e., R = {α ∈ S | rank(α, S) = ir, 0 ≤ i < |S|/r }. 2.1. Pseudocode for the Sampling Procedure. The input to procedure Sample` below is a set S of records concentrated in a subcube of p processors, p ≥ |S|. (A set of records X is said to be concentrated in a subcube C if each of the |X | lowest-numbered processors of C contains a unique element of X .) The output is a sample (i.e., subset) S 0 of S. The sample S 0 is chosen in such a manner that: (i) |S 0 | ¿ |S| and (ii) the record of rank k in the sample has rank approximately k|S|/|S 0 | in S. (Lemma 2.1 provides precise bounds on the rank properties of S 0 with respect to S.) The subscript ` is drawn from the set {0, 1, 2, 3, 4}. (In effect, we are defining five slightly different sampling procedures, one corresponding to each subscript value.) For the purposes of Section 2 the reader may assume that ` = 0. Procedure Sample` (S, p) 1. Partition S into g groups (the parameter g will be defined momentarily) of size s = |S|/g, assign p/g processors to each group, and sort each of the groups in parallel. If ` = 0 or 1, then use sparse enumeration sort to accomplish the sorting. If ` = 2, then use Sharesort. (Note that Sharesort assumes an input consisting of one record at each processor; if p > |S|, then we simply use |S| of the p processors.) If ` = 3, then use the nonuniform sparse Sharesort algorithm defined in Section 6. If ` = 4, then use the uniform sparse Sharesort algorithm defined in Section 7. The parameter g is chosen so that the running time of this step (which dominates the overall running time of the procedure) is 2(lg s) if ` = 0, and 2(lg √ p) otherwise. 2. Extract a s-sample from√each of the g groups. 3. Return the union of these s-samples. 2.2. Analysis of the Sampling Procedure. Given the rank of a record in the sample returned by a call to Sample` (S, p), the following lemma provides upper and lower bounds on the rank of that record in S. LEMMA 2.1. Let δ, δ 0 , and δ 00 denote integers satisfying 0 ≤ δ 00 ≤ δ 0 ≤ δ. Let X denote 0 0 a set of 2δ records, and assume that X is partitioned into 2δ−δ sets X k , 0 ≤ k < 2δ−δ , 0 00 of size 2δ . Let X 0 denote the union of the 2δ -samples of each of the X k ’s. If record α has rank j in set X 0 , then the rank of α in set X lies in the interval ³

00

0

00

j2δ − 2δ−δ +δ , j2δ 00

00

i

.

PROOF. Let rk denote the rank of α in the 2δ -sample extracted from set X k , 0 ≤ k < 0 00 00 2δ−δ . Then the rank of α in set X k lies in the interval ((rk − 1)2δ , rk 2δ ], and so the rank

242

P. Berthom´e, A. Ferreira, B. M. Maggs, S. Perennes, and C. G. Plaxton

of α in X belongs to  

X

(rk − 1)2δ ,

0≤k