Work-Time Optimal k-merge Algorithms on the PRAM

Work-Time Optimal k -merge Algorithms on the PRAM Tatsuya Hayashi, Koji Nakano Department of Electrical and Computer Engineering Nagoya Institute of Technology Showa-ku, Nagoya 466, JAPAN fhayashi,[email protected]

Abstract

algorithms in that model. Clearly, the naive approach of solving the k -merge problem by repeated pairwise merging may result, at best, in a work-optimal algorithm running in O(log n log k) time, which is not work-time optimal [8].

The k -merge problem, given a collection of k , (2 k n), sorted sequences of total length n, asks to merge them

into a new sorted sequence. The main contribution of this work is to propose simple and intuitive work-time optimal algorithms for the k -merge problem on two PRAM models. Specifically, our k -merge algorithms perform O (n log k ) work and run in O (log n) time on the EREW-PRAM and in O(log log n + log k) time on the CREW-PRAM, respectively.

1

Our main contribution is to propose simple and intuitive work-time optimal algorithms for the k -merge problem on two PRAM models. Specifically, we begin by devising a k merge algorithm running in O (log n) time and performing O(n log k) work on the EREW-PRAM. This algorithm is work-time optimal [9], because the running time O (log n) is best possible among the work-optimal algorithms on the EREW-PRAM. Next, we design a work-time optimal CREW-PRAM algorithm running in O (log log n + log k ) time and performing O (n log k ) work. By using Brent’s scheduling principle [4, 9], these two algorithms run, rek spectively, in O ( n log + log n) time on the EREW-PRAM p

Introduction

Consider a collection A of n items consisting of k; (2 k n), sorted sequences A1 ; A2 ; : : : ; A . The k-merge problem asks to merge A1 ; A2 ; : : : ; A into a new sorted sequence. For definiteness, we write for every i, (1 i k), A = ha 1 ; a 2 ; a 2 ; : : : ; a i. We note that, in general, each of the k sequences may have a different number of items. The k -merge problem has been investigated in the

k + log log n + log k ) time on the CREWand in O ( n log p PRAM, whenever p, (p 1), processors are available. At the heart of our algorithms lies a novel deterministic sampling scheme reminiscent of the one developed recently by Olariu and Schwing [11]. The main feature of our sampling scheme is that, when used for bucket sorting, the resulting buckets are well balanced, making costly rebalancing unnecessary.

k

k

i

i;

i;

i;

Stephan Olariu Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA [email protected]

i;ni

sequential setting, being a key ingredient in external sorting algorithms [10]. Just as the usual merging, the k -merge problem finds applications to databases, information retrieval, and query processing [12]. In addition, the k -merge problem has a lot of theoretical appeal since it provides a natural bridge between merging, corresponding to the case k = 2, and sorting where one merges k = n single-item sequences. A parallel algorithm is termed work-optimal if the product of the computing time and the number of processors equals the running time of the fastest sequential algorithm for the problem. Further, a parallel algorithm is termed work-time optimal [9] if it is work-optimal and, in addition, its running time is best possible among the work-optimal

Quite recently, Wen [12] proposed an interesting and sophisticated work-optimal k -merge algorithm. Specifically, his algorithm merges a collection of k , (2 k n), k sequences of total length n in O n log + log n time, p using p processors on the CREW-PRAM. In particular, log k for p = nlog his algorithm achieves a running time of n O(log n). In spite of being work-optimal, Wen’s algorithm is not work-time optimal: it is well-known that the problem of merging two sorted sequences of combined length n can be solved in O (log log n) time using O (n) work on the CREW-PRAM [3]. As the results in our work were being written, we became aware of a similar and independent effort by Chen et al. [5], that showed that k sorted sets of size n each can be merged in O (log n) time using O (n log k ) k

Work supported, in part, by NSF grant CCR-9522093, by ONR grant N00014-95-1-0779, and by the Nitto Foundation.

1

sample (A ) \ (s ; s (

j k +1) ] has exactly one item, then contains at most 2s , 1 items. More generally, if sample s (Ai) \ (sjk ; sj (k+1) ] has m items, then Ai \(sjk ; sj(k+1) ] contains at most (m+1)s,1 items. Since sample s (A) \ (sjk ; sj(k+1) ] has exactly k sample items, it follows that bucket Cj = A \ (sjk ; sj (k+1) ] contains at most 2sk items from A, as claimed. In turn, the sampling scheme that we just presented suggests, in quite a natural way, the following generic algorithm for solving the k -merge problem. This algorithm will be instantiated in various ways in subsequent sections to yield work-time optimal algorithms on the PRAM.

work on the EREW-PRAM. To put our contribution in perspective, we note that our k-merge algorithms improve on Chen’s algorithm in several respects: first, we solve the problem faster if concurrent reads are allowed (CREW-PRAM), second our algorithm is simpler and more intuitive, and third, our algorithm does not requires that the given sorted sets be of the same size.

2

s

i

Our sampling scheme

The main goal of this section is to present our sampling scheme that is key in designing our k -merge algorithms. integer. We begin by extracting a Let s be a positive sample of size nsi by retaining every s-th item in each Ai . We let sample s (Ai) denote the corresponding sample and write sample s (Ai) = hai;s ; ai;2s ; ai;3s ; : : : ; ai;sbni =sc i. Pk ni Writing e = , we let sample s (A) = i= 1 s hs1 ; s2 ; : : : ; se i be the sequence obtained by sorting the set sample s(A1) [ sample s (A2) [ [ sample s(Ak) of samples extracted from all the Ai ’s. For later reference, we state the following technical result. Lemma 2.1 For every integer k , n e n or = sk , 1: sk k

k 6= 0, we have

e k

jk

j k +1)

jk

]

Algorithm Basic-k -merge(A; s);

Step 1. For every i, (1 i k ), compute sample s (Ai ); Step 2. Merge the k sorted sequences sample s (A1), sample s(A2), , sample s (Ak) to obtain sample s(A); Step 3. Construct sample k (sample s (A)) and partition A n into b ks c + 1 buckets, each containing no more than 2ks items from A;

=

Step 4. Sort each bucket and concatenate the resulting sorted sequences.

From now on we shall assume, without loss of generality, that jek j n k = : (1)

k

i

A \ (s ; s (

3

sk

The EREW-PRAM k -merge algorithm

First, we will review of basic results that will be used in our k -merge algorithm on the EREW-PRAM.

Out of the sorted sequence sample s (A) we extract a new sample, denoted sample k (sample s (A)), by retaining every k -th item in sample s (A). By construction, n and by (1), sample k (sample s (A)) contains ke = sk items from A. Clearly, we have sample k (sample s (A)) = hsk ; s2k ; : : : ; sb skn ck i. To avoid handling boundary condin c+1)k = +1. tions we write s0 = ,1 and s(b sk Next, having obtained sample k (sample s (A)), we pron ceed to partition the set A into b sk c + 1 buckets n C0 ; C1 ; : : : ; Cb sk c such that for every j , (0 j b skn c), Cj = fa 2 Ajsjk < a s(j+1)k g. It is easy to confirm that, all the buckets obtained are disjoint, and that every item of A belongs to exactly one such bucket. We refer the reader to Figure 1 for an illustration of our sampling scheme. Our next result shows that the sampling scheme we just described results in buckets that are surprisingly well balanced.

Proposition 3.1 [7] The task of computing the prefix sums of an n-item sequence can be performed in O (log n) time and O (n) work on the EREW-PRAM. Proposition 3.2 [8] The task of merging two sorted sequences of size n can be performed in O (log n) time and O(n) work on the EREW-PRAM. Proposition 3.3 [2, 6] The task of sorting of n items can be performed in O (log n) time and O (n log n) work on the EREW-PRAM. Next, we show a sequential k -merge algorithm used by our k-merge algorithm.

b c), contains more

Lemma 3.4 The task of merging k , (2 k n), sorted sequences of total length n can be performed in O (n log k ) sequential time.

Proof. If sample s (Ai ) \ (sjk ; sj (k+1) ] is empty, then each block Ai \ (sjk ; sj (k+1) ] contains at most s , 1 items: this is because no group of s consecutive items from Ai contains more than one item in sample s (Ai). Similarly, if

Proof. Begin by constructing a bottom-heavy heap containing the items a1;1 ; a2;1 ; : : : ak;1 . Clearly, this can be done in O(k) time [1]. Let aj;1 be the minimum. Remove aj;1 and add aj;2 to the heap. In other words, remove the minimum from the heap, pick a new item from the sequence to which

Lemma 2.2 No bucket Cj , (0 j than 2sk items of A.

n

sk

2

s

bucket

A1

A2

k

A3

A4

Ak

Figure 1. Our sampling scheme

A \ (s ; s( +1) ] belong to bucket C . To partition the A s into blocks, we need to compute the rank of every item in sample (sample (A)) with respect to each A . Notice,

the removed item belongs, and add it to the heap. It is easy to see that all the items are removed by iterating this procedure O(n) times. Moreover, the order of removal corresponds to the sorted order. Since each iteration needs O (log k ) time to maintain the heap, the k -merge can be performed in O(k + n log k) = O(n log k) time. Our k -merge algorithm on the EREW-PRAM amounts n to the call: Basic-k -merge(A; log log k ). We present a detailed implementation of the four steps of the algorithm on the EREW-PRAM. Step 1 can be performed in O (1) time and i log k O( nlog ) work for each Ai. The total amount o work is n

k

Since

sample

log n log k

A

( )

log k log n n log k log n )

contains

sorted by Proposition 3.3 in O (log

log k

log k formed in O (log n) time and at most O (ni + nk log ) work. n Therefore, the task of determining the identity of the bucket to which each item of A belongs can be performed in O(log n) time and O(n log k) work. The task of sorting the buckets in Step 4 is trickier to perform in O (log n) time. There are, essentially, two sorting strategies that come to mind:

items, it can be

O(log n) time

i. By (1), our sampling scheme yields the log k

sample (sample k

containing

n k

log k log n

used to generate

log n log k

A

( )) =

items from n log k k

log n +

A.

hs ; s2 ; : : : s k

k

n log k k k log n

1. sort each bucket from scratch using Cole’s algorithm;

i

2. apply Algorithm Basic-k -merge a second time to sort each bucket.

In turn, this sequence is

As it will turn out, we will use both these strategies, depending on the values of k . To begin, let us investigate the first strategy. Recall that by Lemma 2.2, Ci contains no more than 2kloglogkn items. Now Proposition 3.3 guarantees that bucket Ci can be sorted from scratch in O ( 2kloglogkn log( 2kloglogkn )) =

1 buckets C0 ; C1 ; : : : ; C n log k . k log n

By Lemma 2.2, no bucket contains more than 2kloglogkn items from A. It is easy to see that the task of constructing sample k (sample log n (A)) can be performed in O(1) time log k

log k and O ( kn log ) work. n To complete Step 3, for each item in A, we must identify the bucket to which it belongs. This task is handled as follows. Each Ai is partitioned into blocks, Ai \ (,1; sk ], Ai \ (sk ; s2k ], : : :, Ai \ (s n log k k ; +1).

Recall that for each j , (0 j the k blocks A1 \ (sjk ; s(j +1)k ],

A2

O( O(

k log n(log k +log log n)

log k

)

work.

Thus, the total work is

) O(n(log k + log log n)). log k Therefore, if k log k log n the total amount of work involved in sorting the buckets is bounded by O (n log k ). For values of k satisfying

k log n

n log k

), the items in \ (sjk ; s(j+1)k ], : : :, k

i

log k

O(n log k) work. At the end of and O (log n Step 2, we obtain the sorted sequence sample log n (A) = sequence

log n log k

forehand. This latter task can be performed in O (log k ) log k time and by O ( nlog Once the k copies of ) work. n sample k (sample loglog nk (A)) are available, each of them is merged with one of A1; A2 ; : : : ; Ak . By Proposition 3.2, the task of merging Ai and the corresponding copy of sample k (sample log n (A)) can be per-

log k log n ) =

n log k log n

j

k

however, that in order to perform the merging in parallel, we must make k copies of sample k (sample log n (A)) be-

n

hs1 ; s2 ; : : : ; s

j

k

log k ) O (n log k ). clearly bounded by O ( nlog n n

jk

i

log n

k log n(log k +log log n)

log k = k log n n

k log k < log n

3

(2)

The CREW-PRAM k -merge algorithm

the strategy above cannot be used. Instead, we shall rely on the second strategy outlined above.

4

Consider again a generic bucket Ci . The reader should have no difficulty confirming that, by virtue of our sampling scheme, bucket Ci consists, in turn, of k sorted sequences. It is natural, therefore, to apply Algorithm Basic-k -merge to bucket Ci . Along this line of thought, we assume that Ci consists of the k sorted sequences A1i ; Ai2 ; : : : ; Aik and that Ci contains a total of ci ( 2kloglogkn ) items. This new application of Basic-k -merge corresponds to the call Basick-merge(Ci; kloglognk ). Accordingly, we begin by constructing for every j , (1 j k ), the sequences sample log n (Aij ).

The main goal of this section is to exhibit a work-time optimal parallel algorithm solving the k -merge problem in O (log log n + log k ) time and O (n log k ) work on the CREW-PRAM. This algorithm uses the merging algorithm for 2 sorted sequences as follows.

sample

Notice that

log n k log k

k log k

C)

(

sample

=

i

log n k log k

Proposition 4.1 [3] The task of merging two sorted sequences of size n can be performed in O (log log n) time and O (n) work on the CREW-PRAM. At this point, we note that if k 2 log n > n log k , applying the logarithm throughout, we obtain O (log n) O(log log n + log k). Consequently, we can apply Cole’s sorting algorithm directly. From now on, we shall assume that k 2 log n n log k . The algorithm is very close in spirit to the EREW algorithm detailed in the previous section, and amounts to the log n call Basic-k -merge(A; klog ). The remainder of this section k is devoted to a detailed implementation of the fours steps of the algorithm on the CREW-PRAM. To begin, Step 1 can log k ) work to each Ai . be performed in O (1) time and O ( nkilog n

A [

( 1i )

A [ [ sample (A ) involves a tolog tal of at most c log 2 loglog loglog = 2k2 items. By Proposition 3.3, sample (C ) can be sorted in O (log k ) time and O (k 2 log k ) work. The total amount of work is log log 2 ). By (2), the total work of log O (k log k ) = O ( log Step 2 is bounded by O (n log k ). In Step 3, sample (sample (C )) is constructed and C is partitioned into k + 1 buckets each containing at most 4 log items of A. At this moment, each such bucket is sorted log by performing the sequential k-merge. By Lemma 3.4, this log takes O log log k = O (log n) time and optimal work. sample

log n k log k

i

( 2) i

log n k log k

k

k

k

n

n

k

k

n

nk

k

n

i

2

k

n

k

i

k

k

log n k log k

n

i k

log n k log k

log k ) O (n log k ). The total work is O ( kn log n While the EREW-PRAM implementation of Step 2 uses Cole’s optimal sorting, we shall rely, instead, on the doubly-logarithmic merging algorithm of Proposition 4.1. In outline, the idea is to sort sample k log n (A1) [

i

n

k

n k

sample

Once sorting is done, we need to concatenate the (sorted) buckets. We shall only demonstrate how this is done in the case of condition k log k log n, the complementary range being similar. In order to do this, we need to compute the rank of each item in sample k (sample log n (A)) with respect log k to A. This can be done by computing the prefix sums of jC0 j; jC1 j; : : : ; jC nk loglog nk j and by broadcasting the prefix sum to the corresponding bucket. By using Proposition 3.1 the log k ) O (log n) prefix sums can be computed in O (log nk log n n

k

k

n

k log n log k

(

log k

A

k

)

spect to each sample k log n (Aj ), i 6= j . What remains to log k be done is to add the corresponding ranks. By Proposition 4.1, sample k log n (Ai ) and sample k log n (Aj ) can be log k

O(log log n) time and O(jsample (A )j + jsample (A )j) work. Now a simple counting arP gument shows that P 1 (n + n ) = (k , 1)n. Thus, the total work is 1 O (jsample (A )j + ( ,1) log jsample (A )j) = O( log ) O(n). As a result (A ), the number of merging, for each item in sample (A ), i 6= j , smaller than it can be of items in sample log k

log k

merged in

log O( log ) work, while the broadcasting can be log done in O ( log ) time and O (n) work. Since each item

time and

A [ [ sample

( 2)

by computing the rank of each item. To implement this idea, we plan to merge, pairwise, sample k log n (Ai ) and sample k log n (Aj ) log k log k for 1 i < j k . This, of course, allows each item in a generic sample k log n (Ai ) to compute its rank with rek log n log k

k log n log k

n k

k

j

i

k

i

Work-Time Optimal k-merge Algorithms on the PRAM - Semantic Scholar

Work-Time Optimal k-merge Algorithms on the PRAM - Semantic Scholar

Suggest Documents

Reflections on PRAM - Semantic Scholar

algorithms for optimal discontinuous piecewise ... - Semantic Scholar

Minimax Optimal Algorithms for Unconstrained ... - Semantic Scholar

Optimal Combination of Classification Algorithms ... - Semantic Scholar

OPTIMAL STEEPEST DESCENT ALGORITHMS ... - Semantic Scholar

Stochastic Search Algorithms for Optimal ... - Semantic Scholar

A PRAM oriented programming system - Semantic Scholar

Optimal on-line algorithms for single-machine ... - Semantic Scholar

PRAM

The Time is Right: Voluntary Reduced Worktime ... - Semantic Scholar

On Optimal Algorithms and Optimal Proof Systems

Optimal Algorithms for the Single and Multiple ... - Semantic Scholar

Approximation Algorithms for Finding the Optimal ... - Semantic Scholar

Packet Routing and PRAM Emulation on Star ... - Semantic Scholar

Truly E cient Parallel Algorithms: 1-Optimal ... - Semantic Scholar

Generic algorithms for halting problem and optimal ... - Semantic Scholar

genetic algorithms: what fitness scaling is optimal? - Semantic Scholar

Optimal Snap-Stabilizing PIF Algorithms in Un ... - Semantic Scholar

Fast Optimal and Suboptimal Algorithms for ... - Semantic Scholar

Truly E cient Parallel Algorithms: c-Optimal ... - Semantic Scholar

optimal pid tuning with genetic algorithms for non ... - Semantic Scholar

Optimal and heuristic algorithms for quality-of ... - Semantic Scholar

Efficient Algorithms for Optimal Stream Merging for ... - Semantic Scholar