nodes, in scienti c computing for solving di erential equations or for solving in- tegrals ... That is, for a p-processor distributed memory machine and for a priority.
Load Balanced Priority Queues on Distributed Memory Machines (Extended Abstract)
Ajay K. Gupta?
Western Michigan University Kalamazoo, MI 49008, USA
Andreas G. Photiou
Lake States Insurance Company Traverse City, MI 49685, USA
Abstract. We consider ecient algorithms for priority queues on distributed memory multiprocessors, such as nCUBE, iPSc, MPP and looselycoupled systems consisting of networked workstations. For a p-processor distributed memory multicomputer P and n data items in the priority queue, n > p, we investigate two priority queues; horizontally sliced and vertically sliced. Both of these achieve load balance, i.e. at most (n=p) data items are stored at every processor of P . Horizontally sliced priority queue allows deletions and insertions of (p) items in time O( bwp c + p p log n) on hypercubic networks where c is the communication time between a pair of processors, p is the unit processing time and bw is the width of the communication channel between a pair of processors. Vertically sliced priority queue allows deletions and insertions of (p) items in time O((c + p) log p log n) on hypercubic networks. Similar results hold for other types of networks.
1 Introduction The heap is an important priority queue data structure. In its implementation, it most commonly has the structure of a binary tree with the property that each item in a node has higher priority than its children. Uni-processor implementations of heap can be found in any text book on data structures and algorithms (such as [2]). The two basic operations that are performed with the heap is insertion of a new item or deletion of the item with the highest priority while maintaining the heap property. Both operations in a single processor take O(log n) time for a heap of size n. With the popularity of parallel machines with distributed memory, it naturally becomes important to consider the distributed implementations of a priority queue. A distributed priority queue can be used in many applications such as in the distributed operating systems for scheduling tasks, in branch and bound algorithms for fetching best nodes and inserting new nodes, in scienti c computing for solving dierential equations or for solving integrals, in genetic and evolutionary approaches etc. In this paper, we consider ?
Research supported in part by a Fellowship from the Faculty Research and Creative Activities Support Funds, WMU-FRCASF 90-15 and WMU-FRACASF-94-040 and by the National Science Foundation under grant USE-90-52346.
ecient distributed implementations of a priority queue using heap as its basic data structure. Our main focus is to consider load balanced implementations of a priority queue. That is, for a p-processor distributed memory machine and for a priority queue of n data items, n > p, every processor is allowed (n=p) data items of the priority queue. Load balancing becomes very important in distributed memory machines (as opposed to shared memory machines) since every processor in a distributed memory machine has xed amount of memory. Furthermore, it's not atypical in applications such as in the areas of scienti c computing, simulated annealing and genetic approaches that larger the n, better the solution. Load balancing allows one to increase n by order of magnitude. A distributed priority queue should allow for deletion and insertion of more than one items very eciently since at the same time unit every processor of an MIMD parallel machine should be allowed to utilize a distinct highest priority item and insert new item (possibly distinct from the other processors). More speci cally, since there are p processors, one would like to consider concurrent deletions and insertions of (p) items eciently into a distributed priority queue. A number of researchers have thus focused on priority queue implementations on various models of parallel computation that allow for ecient deletion and insertion of (p) items. To the best of our knowledge, almost all of these result in a load imbalanced implementation of a priority queue. We only mention the recent results in this direction. Deo and Prasad describe a parallel heap in [6] that allows deletion of p highest priority items (and insertion of p new items) in O(log n) time on PRAM shared memory models of parallel computation. Pinnoti and Pucci discuss n-bandwidth heap on an CREW PRAM model in [10] which is similar to the parallel heap of [6]. Das and Horng improve some of the draw backs of [6] and again present a parallel heap that allows p deletions and insertions in O(log n) time on PRAM models [4, 5]. The authors in [4, 5] give a good summary of the results known for parallel heaps. The main draw back with adapting these results to distributed memory models is that they result in a very unbalanced distribution of the priority queue among the processors. In particular, for example a processor may end up with roughly half of the n data items of the priority queue while another processor may end with only O(1) data items. Another drawback, not as crucial for the PRAM models but crucial for distributed memory models, is that most of the O(log n) time spent in deleting or inserting p items is the communication between the shared memory and the processors. In this paper, our attempt is to alleviate (at least partially) these drawbacks. Let P be a p-processor distributed memory MIMD parallel machine (tightly coupled or loosely coupled). We present two distributed implementations of a priority queue on P . We refer to them as horizontally sliced and vertically sliced priority queues. Both of these use heap as the primary data structure and both of these maintain a load balance at every step of p deletions and p insertions. Let c be the communication time between a pair of processors in P , let p be the processing time of a basic operation on a processor of P and let bw 0
be the width of the communication channel between a pair of processors. If
P has a hypercube interconnection network, then horizontally sliced priority queue (HQ) allows deletion and insertion of (p) items in O( bwp c + p p log n) time in a pipelined fashion. Vertically sliced priority queue (VQ) allows for pdeletions and p-insertions in time O((c + p ) log p log n). Depending on the values of c and p , one of the implementations results in a better performance. In practice, since c is much larger than p , HQ in general gives better results. Of course, theoretically, VQ has better timing result than HQ as c ; p and bw 0
0
0
0
0
are constants. Further comparisons between HQ and VQ are given in Section 4. We do note that both HQ and VQ have inferior timing results than the results on PRAM models of [4, 5, 6]. (We conjecture, however, that VQ timing results are optimal.) It should be noted that O((c + p ) log np ) time for n >> p can be achieved but at the expense of a load imbalance. We note that similar results hold for the situations where P has other types of interconnection networks, such as butter y, de Bruijn, Benes, meshes etc. We also consider a sequence of m insert or delete operations on a distributed priority queue. Since it takes O(p log n) time to delete or insert an item from the sequential heap, it takes O(mp log n) time for a sequence of m deletes or inserts on a uni-processor machine. The main problem in generalizing these ideas for parallel algorithms is that in the insertion the inserted key is attached at a leaf node and works its way up until it nd its proper position. This would force the algorithm to wait until the key reaches its proper position and then perform the next operation resulting in an inherent sequentialism. A solution to overcome this sequentiality is to start both the insertion and the deletion from the top (root) so that as soon as the the root satis es the heap property a new operation can start. The main contribution (and non-triviality) of our algorithms is in maintaining the load balance in every processor at every step along with this idea. Throughout this paper the minimum data item has the highest priority, and all the logarithms are base 2. 0
2 Horizontally Sliced Priority Queue We develop an algorithm for a (p)-processor multi-computer with the distributed memory model. Let n be the number of items in the current parallel heap. For simplicity, let us assume that all the n items are distinct. We keep the parallel heap in a complete binary tree of p nodes with height dlog(p + 1)e ? 1. (Our algorithms can be easily generalized to balanced binary trees of height (log p) with the same time complexity, however, for simplicity and clarity we use a complete binary tree of p nodes with p = 2s ? 1 for some positive integer s.) Let us denote the right child (respectively. left child and parent) of a node v as the node rc(v) (respectively, lc(v) and p(v)). In each node of the parallel heap we keep a minimum of b np c and a maximum of d np e items and a node is assigned to a processor. (We discuss the implications of relaxing the strict load balance constraint in Section 4.) Since there is a one-one correspondence between the processors and the nodes, if a node v is assigned to the processor i, then we
simply refer to processor i as v and vice-versa. The min-heap property is kept strictly across all the processors; i.e., in every processor all items are smaller than those of its children. We call this property the global heap property. Let qi be the number of items that are assigned to processor i, 1 i p; i.e., qi 2 fd np e; b np cg. The qi items in processor i are kept in a local min-heap of height dlog(qi + 1)e ? 1 using either an array implementation or a linked list implementation. Hence, the qi items at processor i satisfy the local heap property. Any operation, either insertion or a deletion of item(s), always starts from the root processor. In order to maintain the load balanced distribution of the parallel heap, during insertion or deletion, we always make sure that for a processor v, the total number of items in its left subtree rooted at lc(v) diers from the total number of items in its right subtree rooted at rc(v) by at most one. In addition, the maximum number of items and the minimum number of items that are allowed at every processor is d np e and b np c, respectively. Note that insertion of an item (similarly deletion of an item) may change the upper limit (respectively. lower limit) at every processor.
2.1 Insertion of m items in horizontally sliced priority queue
The insertion starts at the root processor. We assume that the m items, m bn=pc, to be inserted are at the root processor (otherwise send them to the root processor, this does not change the cycle time complexity). If m > bn=pc then m needs to be divided into groups of no more than bn=pc and inserted one after the other. The new maximum and minimum number of items that a processor can hold is d n+pm e and b n+pm c, respectively. We need to keep the smallest b n+pm c or d n+pm e items at the root, but some of these items might be at its two children. Let root be the extra elements that the root needs to insert at its local structure so that it does not violate the global heap property and it is within the permissible limits. The root requests root from each of lc(root) and rc(root), merge the received items with the new m items, keeps the smallest root items and sends the remaining items, to the children lc(root) and rc(root). When dividing the remaining items, we must make sure that the resulting number of items at the two subtrees rooted at lc(root) and rc(root) dier by no more than one. The insertion process now continues at both the processors lc(root) and rc(root) viewing the received items as the insert items and proceed down level by level until the global and local heap properties are met and the number of items at each processor is within the new permissible limits. In general if the insertion process is at processor v, let vl and vr be its left and right child, respectively. We know that processor v was requested and has sent vsent = d n+pm e ? qp(v) smallest items to its parent and received vrecv items for insertion (for the root processor rootrecv = m and rootsent = 0). Let v be the maximum number of extra items that processor v requires, where v = d n+pm e ? qv , for load balancing purposes. We request the smallest v items from each of vl and vr and merge the vrecv items (received from the parent)
with the 2v items (received from the children) at v to obtain a sorted list L of vrecv + 2v items. Processor v keeps the smallest (say) vrem items of L and sends say vl and vr remaining items of L to vl and vr , respectively. In order to enforce the load balancing condition, we must make sure that the total number of items in v (namely qv ? vsent + vrem ) is within the new permissible limits and that the total number of items in the subtrees rooted at vl and vr dier by at most one. We thus need to satisfy the following four conditions for a load balanced distribution :
lcount + vl pl b n+pm c (I) n +m rcount + vr pr b p c (II) (III) d n+pm e qv ? vsent + vrem b n+pm c (IV) vrem + vl + vr = vrecv + 2v where pl = pr = 2dlog(p+1)e?k?1 ? 1 is the number of processors in the subtree rooted at vl and vr and lcount (respectively rcount) is the total number of items in the subtree rooted at vl (respectively vr ) after processor vl (respectively vr ) has sent the v items to v. pl d n+pm e pr d n+pm e
Note that we have only three unknowns in the above conditions, namely
vl ; vr and vrem . The variable vrem can have two possible values i.e. d n+pm e ? qv +vsent or b n+pm c?qv +vsent . The variable vl thus has value d vrecv +22v ?vrem e and vr has the value vrecv + 2v ? vrem ? vl items or vise versa. Hence there are only a constant number of cases to nd the values of vl ; vr and vrem . Once we have determined the values of vrem , vl and vr , we sequentially insert the smallest vrem items from L in the local heap of node v (recall vsent smallest items have been deleted from v's local heap). Since we need to send the largest vl + vr of all the vrecv + qv ? vsent + 2v items currently at processor v to its children, we rst insert the items from the remaining vrecv + 2v ? vrem items of L that are smaller than the maximum item of the local heap into the local heap at processor v and replace them with an equal number of the largest items from the local heap of v. This ensures the global heap property.
2.2 Deletion of m highest priority items Deletion is very similar to insertion. The m smallest items are deleted and sent from the local heap of the root to the required processors. Note that if m > m e deletion operations have to be performed. The new maximum bn=pc, d bn=p c and minimum permissible number of items at every processor is d n?pm e and b n?pm c, respectively. In order for the root to be within the permissible limits we request the smallest root items from each of lc(root) and rc(root) where root = qroot ? b n?pm c and lc(root) and rc(root) delete their smallest root items. We next merge the 2root received items to obtain the sorted list L at the root processor and sequentially insert the smallest rootrem items of L in the root's local heap. Out of the remaining 2root ? rootrem items of L we return
rootl and rootr items to lc(root) and rc(root), respectively. We need to satisfy the following four conditions for a load balanced distribution : p ? 1 d n ? m e lcount + p ? 1 b n ? m c; rootl 2 p 2 p p ? 1 b n ? m c; p ? 1 d n ? m e rcount + root r 2 p 2 p
d n ?p m e qroot ? m + rootrem b n ?p m c and
2root = rootrem + rootl + rootr : In this case we have unknowns rootl ; rootr and rootrem . Processors lc(root) and rc(root) must insert the rootl and rootr items that they received from the root in their local heaps, respectively, but if these items are simply inserted at lc(root) and rc(root) then the global heap property might be violated since some of these items might come from the other child of the root and they might be too large for this processor. Hence, in order to maintain the global heap property, we start an insertion process at processor lc(root) (respectively rc(root)) of rootl received items (respectively rootr items) while keeping the permissible limits to be d n?pm e and b n?pm c at every processor. Note that the load balancing conditions (I)?(IV) of the insertion process now need to be satis ed by replacing (n + m) by (n ? m).
2.3 Complexity Analysis Consider a parallel heap of size n and we need to perform a sequence of m insert or delete operations. For a sequential heap the time complexity is O(mp log n) for performing the m operations where p is the unit processing time within each processor. For a distributed memory parallel machine let the communication cost between any pair of processors be c , and the number of processors be p, for 1 p n. We de ne one cycle of a computation process to be the time between the beginning of one operation (insert or delete m items, m bn=pc) and the beginning of the next operation. In case of insertion it takes O(p m log m) time to sort m items, O(p m log np ) time to insert them in the local heap and O(m bwc ) time to send the necessary items among the processors. Total time complexity is O(p m log m + p m log np + m bwc ) The worst case will be when m = bn=pc where the whole local heap needs to be sent to the parent and again reconstructed. This takes O(n=p) time, if the size of the local heap is n=p [2], so the worst case time complexity for one cycle m + p m log n ) is O( np bwc + p np log np ): In the case of deletion, the time is O(c bw p since no sorting is needed. Again at worst case the time complexity for a cycle is O( np bwc + p np ) since m = b np c. We conclude this section by putting the above result in the perspective of some of the existing distributed memory parallel machines, such as the hypercubes, butter ies and the tree machines. Let c be the communication time 0
0
0
0
0
0
between an adjacent pair of processors of the distributed memory machine. For complete binary tree machines, our algorithms can be directly implemented and the communication time factor c changes to c resulting in the time of one cycle c ). For distributed memory machines with to be O(p m log m + p m log np + m bw hypercube and butter y interconnection networks, we can use the graph embedding techniques [1, 8, 9] to assign the adjacent nodes of the parallel heap of height (p) to the processors of the distributed memory machine that are a constant distance apart. Hence, we again obtain the cycle time of the parallel heap with balanced load distribution to be the same as above. For a pp pp-size meshnetwork based distributed memory machines adjacent nodes p of a complete binary tree can be assigned to processors that are at most O( logpp ) distance papart [3] c m p ): We and hence the resulting cycle time is O(p m log m + p m log np + bw log p do note, however, that the recent trend in the distributed memory machine is to have separate routing network (software and/or hardware) for communicating between the processors which usually results in c = c . In addition to considering a sequence of m operations on a distributed priority queue, for most applications (such as Branch and Bound, Scienti c Computing, Job Pools and Parallel Genetic Algorithms) it is worthwhile to speci cally consider a group of (p) insertions and/or deletions since there are p processors in the parallel machine and every processor, in general, would rst process one of the p highest priority items (goes into a think cycle if we use the terminology of [4, 5, 6]) and then generate O(1) new items to be inserted into the global distributed priority queue. We next describe such an algorithm where (p) items are inserted or deleted at a time. 0
0
3 Vertically Sliced Priority Queue In this section, we consider another load-balanced implementation of parallel heap so that (p) items can be inserted or deleted very eciently. Given a p-processor hypercube and n data items, we show that (p) highest priority items can be deleted or (p) new items can be inserted in O((c + p ) log p log n) time. Theoretically, this implementation is more ecient than the one of the previous section. However, it suers from a large communication overhead as we will discuss in detail in Section 4. We note that, with some eort, one can adapt the parallel heap algorithms of the shared memory PRAM models as developed in [4, 5, 6] onto a distributed memory model, but this would result in a very imbalanced distribution of the n data items among the p processors. Our main contribution is to develop schemes for load balanced priority queues. The basic idea behind such schemes is to maintain a min-heap with more than one items in every node of the min-heap and we use merging and sorting to handle insertions and deletions. (Our ideas can be easily generalized to further improve the timings; i.e., a cycle time of O((c + p ) log(n=p)) but they result in a load imbalance.) Let P be a p processor distributed memory multicomputer and n be the number of data items in the current heap. Without loss of generality, let us
assume for clarity that n mod p 0 and that q = n=p for some positive integer q. Our results hold easily for the situation when n mod p 6 0 by letting b np c q d np e or q = 1 if n < p. As before, assume that every processor of P has at most q data items; i.e, we have a load balanced distribution of heap (with n items) among the processors in P . Let Local Heap i[1..q] be an array maintaining the h = dlog(q + 1) ? 1e-height min-heap of q items in the processor i, 1 i p. Let the levels of the local min-heap be numbered from 0 to h where root (i.e., the item at index 1) is at level 0. Of course, we thus have Local Heap i[j ] Local Heap i[2j ] and Local Heap i[j ] Local Heap i[2j + 1] for 1 j b 2q c for any xed value of i. In addition, in order to allow for deletions (and insertions too) of the smallest items out of the total n items in the overall min-heap, we enforce the global heap property as follows: for any xed j , items in the set fLocal Heap 1[j ], Local Heap 2[j ], , Local Heap p[j ]g are all smaller than the items in the set fLocal Heap 1[2j ], Local Heap 2[2j ], , Local Heap p[2j ]g and also smaller than the items in the set fLocal Heap 1[2j +1], Local Heap 2[2j +1], , Local Heap p[2j + 1]g. Although it is not required that Local Heap i[j ] Local Heap (i + 1)[j ] for any xed value of j , we will assume for the sake of simplifying our discussion that this property holds; i.e, the j th local items across all the p processors are kept in a sorted order resulting in Local Heap 1[j ] Local Heap 2[j ] Local Heap p[j ]. It is easy to see that the p smallest items are at index 1 of the local heap arrays and the kth processor has the kth smallest item, 1 k p. In order to delete the kth highest priority item (the kth smallest item), 1 k p, we could simply delete as in a uniprocessor environment the item at Local Heap k[1] from the local heap in the processor k (by rst replacing Local Heap k[1] with Local Heap k[q] and then top-to-bottom heapifying so that the substitute item Local Heap k[1] \trickles down" to its appropriate place in the kth local heap array; note this local heap array now has one fewer item than the original one). In fact, we can simultaneously delete at most p smallest items from the parallel heap in time O(p log np ) assuming p is the computation time of a basic operation in a processor. This would keep the local heap property at every one of the p processors, but may destroy the global heap property since the next p smallest items may not \trickle up" to the index 1 of local heap arrays (for example, when one of the local heaps has more than one of the next p smallest items). We know that the next p smallest items have to be either at index 2 or at index 3 of the local heap arrays since the global heap property is satis ed. We make use of these items and modify our deletion strategy, to delete p smallest items from the current heap, as follows.
3.1 Deletion of p highest priority items We rst delete the p smallest items Local Heap i[1], 1 i p, and substitute Local Heap i[1] with Local Heap i[q] and decrement q by one. This task can be performed in parallel by all the processors and hence takes O(p ) time. Let maxItem = maxfLocal Heap p[2]; Local Heap p[3]g, and let maxIndex
(respectively. minIndex) be either the index 2 or the index 3 depending on which index gives (respectively. does not give) the maxItem. We next merge the items Local Heap i[1..3], 1 i p, a total of 3p items so that after merging items Local Heap i[1], 1 i p, are the p smallest items, the items Local Heap i[maxIndex] are the next p smallest items and the items Local Heap i[ minIndex] are the largest p items. At this point, we have the smallest p out of the remaining n ? p items at index 1 of the local heap arrays and local binary trees rooted at maxIndex in the Local Heap arrays satisfy the min-heap property. However, the items in the binary trees rooted at minIndex may not satisfy the min-heap property and hence we have to continue to top-to-bottom heapify the local heaps rooted at minIndex assuming that the newly arrived p items Local Heap i[minIndex] are the substitute items. The heart of the top-to-bottom heapify process after substitute items have been fetched initially at index 1 of the local heap arrays is the merging process which may get repeated at most h = O(log np ) times. In order to perform the merging process, observe that we have three sorted lists in the set L each with length p, one list L1 consisting of items at index r, another list L2 consisting of items at index 2r and nally the list L3 consisting of items at index 2r + 1. We thus can use any ecient merging algorithm that can merge two sorted lists of length p each on a p-processor distributed memory multicomputer so that the ith and (i + p)th smallest items of the combined list of 2p items are at the ith processor after merging, 1 i p. Observe that we may have to use the merging algorithm twice, rst to merge the list L1 and L2 and second to merge the resulting list L1 [ L2 of 2p items with the list L3 of p items. Let c be again the communication time of a data item between an adjacent pair of processors and let p be the time for performing a basic operation in a processor. Assuming in general that if the merging algorithm takes time O(c Mc(p) + p Mp (p)) on a p-processor distributed memory multicomputer, then the total time to delete the p highest priority item would be O((c Mc(p) + p Mp (p)) log np ) since the merging process is repeated at most O(log np ) times. Note that no pipelining of computation is possible using the above implementation since the items to be merged at any time are distributed across all the processors.
3.2 Insertion of p items Let A = fa ; a ; ; ap g be the p items that need to be inserted into the parallel heap. Let us also assume that the item ai is at processor i, 1 i p. One could 1
2
simply insert ai into Local heap i[1..q] starting from the root (index 1) of the Local Heap as in [7, 11], however, this may destroy the global heap property. Hence, in order to maintain the global heap property in addition to the local heap property, we insert the items of A as follows. We rst sort the items in the list A using the p processors of the multicomputer so that the ith smallest item arrives at processor i, 1 i p. We can use any ecient algorithm to sort p items on a p-processor parallel machine and let Sc (p) be the communication time used by an ecient sorting algorithm and
let Sp (p) be the computation time of the algorithm. Then, the time to sort the items in A is O(c Sc (p) + p Sp (p)). Let A0 be the sorted list whose ith element is the ith smallest element of A. We next identify the insertion path from the root (index 1) to the new leaf (index q + 1) in the Local Heap arrays as in [7, 11]; namely the path consisting of nodes at indices (q + 1); b (q+1) c; b (q+1) c; ; 1 of the Local Heap arrays. For 2 4 simplicity, let j1 ; j2 ; ; jx denote the insertion path where j1 = 1 and jx = (q + 1). The new items are inserted along this path starting from index j1 and ending at index jx while maintaining the heap property. In order to insert the items in A0 along the insertion path, we merge the p items in the sorted list A0 with the p items in the sorted list L =fLocal Heap i[j1 ] j 1 i pg such that, after merging, the ith and the (i + p)th smallest items of A0 [ L are at processor i, 1 i p. We assign the ith smallest item to Local Heap i[j1 = 1] and hence the list fLocal Heap i[1] j 1 i pg contains the smallest p items of the total n + p items in the parallel heap. The sorted list A00 containing the (i + p)th smallest items (a total of p items again) now needs to be inserted at index j2 of the Local Heap arrays, 1 i p. We thus repeat the process until we reach the leaf nodes (at index q + 1) of Local Heap arrays. Assuming O(c Mc(p)+ p Mp (p)) time for merging two sorted lists of length (p) each using p processors, the total time for inserting p items into the heap is O(c Mc(p) log np + p Mp (p) log np + c Sc(p) + p Sp (p)). Depending on the architecture of the p-processor distributed memory multiprocessor, we have dierent values of Mc (p), Mp (p), Sc (p) and Sp (p). Note that sorting (p) items on a p-processor hypercubic network takes O(log2 p) communication and O(log2 p) computation time (using practical sorting algorithms), whereas sorting (p) items on a pp pp mesh network takes O(pp) communication and computation time [9]. For p-processor Hypercubic Networks, such as Butter y, Shue Exchange, Benes, de Bruijn, Cube-connected Cycles and Hypercube we can insert or delete (p) data items (highest priority items) in a parallel heap of size n in O((c + p )(log2 p + log p log np )) time, where c is the communication time of a data item between an adjacent pair of processors and ppis thepcomputation time of a basic operation. For p-processor Mesh Networks p p-mesh network (p) data (highest priority) items can be inserted (deleted) in O((c + p )pp log np ) time. Insertions and deletions of fewer or more than p items can be handled rather easily in the above scheme and hence omitted from this extended abstract.
4 Practical Aspects and Conclusions We have described two load balanced implementations of a priority queue of n data items on a p-processor distributed memory parallel machine P . Load balanced implementations are important since it allows very large values of n to be handled (in comparison to sequential implementations) and since every processor in P has a xed amount of memory. Let us compare these two implementations on a p-processor hypercube based parallel machine (similar results hold for other
interconnection networks; tightly coupled as well as loosely coupled). Let bw be the width of the communication channel between two adjacent processors of the hypercube, c be the time to communicate a data item between two adjacent processors and p be the unit processing time (of a basic operation) on a processor. Horizontally sliced priority queue (say HQ) allows deletion and insertion of (p) items in time TH (p) = O( bwp c + p p log n) (we only consider the time between two operations since completion of an operation is done by other processors in a pipelined fashion). Vertically sliced priority queue (say V Q) allows deletion and insertion of (p) items in time TV (p) = O((c + p ) log p log n). Theoretically, TV (p) is certainly smaller than TH (p) since c , p and bw are assumed to be constants. However, when one considers existing machines c is much larger than p (for example, for nCUBE-2 c is of the order of milliseconds whereas p is at most of the order of microseconds). Hence, in practice one would like to minimize communication overheads. Queue HQ obviously uses less communication than V Q resulting in cycle time TH (p) to be smaller than TV (p) (in practice) even if we assume bw = 1. Typically, for existing parallel machines having few number of processors, such as 128-processor nCUBE-2, the communication channel bandwidth bw can be assumed to be roughly equal to p (i.e., p items can be sent in one communication time unit) and in these situations the factor bwp c in TH (p) changes to c log p since the diameter of a p-processor hypercube is log p and since we need to communicate items from all the processors to a processor or vice versa. Obviously, for these situations TH (p) is smaller than TV (p). The other draw back with V Q is that the communication overhead is a function of n and p whereas in HQ the communication overhead is just a function of p. The number of processors p is typically much smaller than n and for a given parallel (virtual) machine it is xed and hence HQ can be assumed to have xed communication cost no matter what size of a distributed priority queue one considers. If one considers a sequence of m insert or delete operations, m c + p m log n) whereas p > m 1, queue HQ allows m operations in time O( bw the time for m operations on V Q remains same (i.e, TV (p)). Hence, for m < log p queue HQ gives better results than V Q. Of course, given the values of p, c , bw and p one can nd out further tradeos between TH (p) and TV (p). For example, in the loosely coupled systems consisting of networked workstations we typically have pc = 1000, and in these situations TV (p) would be smaller than TH (p) for p > 10; 000. In other words, only for very ne-grain parallelism and for parallel machines with very large number of processors (e.g. Connection Machine), queue V Q would give better results. We do conjecture, however, that in practice implementations of HQ would give better performance than the one of V Q on MIMD distributed memory parallel machines. Our experiments on a 128-processor nCUBE-2 and a distributed system consisting of 40 workstations also agree with our conjecture. We would like to also note that for the values of n that are close to p, a sequential implementation of a heap of n data items on any processor of P will usually give better performance than any of the implementations described in
this paper (memory at every processor would not be a bottleneck anyway for small values of n). Hence, when the dierence between n and p is very large, distributed priority implementations should be considered. It is easy to see that both the load balanced implementations are not optimal when one considers only the cycle times. Ideally, one would like to obtain a load balanced implementation which allows deletion or insertion of (p) items in time O(log n) using p processors. Thus, the question of optimal load balanced implementation of a priority queue still remains open. However, if one allows load imbalance, then our ideas of V Q can be generalized to an implementation that does achieve the optimal time of O(log n) for one cycle. In fact, the best strategy we know leads to a very unbalanced distribution of the priority queue among the processors. Hence, for practical purposes, a combination of HQ and VQ would yield a better distributed priority queue that balances execution times and load. We nally note that horizontally and vertically sliced priority queues can be easily built in parallel on hypercube in O(n=p + p log p) time (details are omit from this extended abstract) and using this result universal k-selection problem can be solved on hypercube in time O(n=p + k=p log p); an improvement log p over the best known results for k ( n log ). log p
References 1. S. N. Bhatt and I. C. F. Ipsen. How to embed trees in hypercubes. Research Report 443, Dept. of Computer Science, Yale University, December 1985. 2. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. McGraw Hill Book Company, 1991. 3. P. Czerwinski and V. Ramachandran. Optimal VLSI graph embeddings in variable aspect ratio rectangles. Algorithmica, 1986. 4. S. Das and W. Horng. Managing a parallel heap eciently. In the Proceedings of Parallel Architectures and Language Europe (PARLE), Lecture Notes in Computer Science, Springer Verlag, volume 505, 1991. 5. S. Das and W. Horng. An ecient algorihm for managing a parallel heap. Personal Communication, 1992. 6. N. Deo and S. Prasad. Parallel heap: An optimal parallel priority queue. Journal of Supercomputing, pages 87{98, 1992. 7. G. H. Gonnet and J. I. Munro. Heaps on heaps. SIAM Journal on Computing, 15(4):964{971, Novemeber 1986. 8. A. K. Gupta and S. E. Hambrusch. Embedding complete binary trees into butter y networks. IEEE Transactions on Computers, 40(7):853{863, July 1991. 9. F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, and Hypercubes. Morgan Kaufmann Publishers, Inc., 1992. 10. M. C. Pinotti and G. Pucci. Parallel priority queues. Information Processing Letters (also avaliable as TR 91-016, ICSI, Berkeley), 1991. 11. N. S. V. Rao and V. Kumar. Concurrent access of priority queues. IEEE Transactions on Computers, 37(12):1657{1665, 1988. 12. N. S. V. Rao and W. Zhang. Building Heaps in Parallel. Information Processing Letters, 37:355{358, 1991. This article was processed using the LaTEX macro package with LLNCS style