The OTIS model of computing has been proposed in [4, 7, 13]. In this model, proces- sors are partitioned into groups where each group is realized as chips with ...
Randomized Routing, Selection, and Sorting on the OTIS-Mesh Sanguthevar Rajasekaran∗ Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 32611 Sartaj Sahni† Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 32611
Abstract The Optical Transpose Interconnection System (OTIS) is a recently proposed model of computing that exploits the special features of both electronic and optical technologies. In this paper we present efficient algorithms for packet routing, sorting, and selection on the OTIS-Mesh.
√ The diameter of an N 2 -processor OTIS-Mesh is 4 N − 3. We present an √ √ algorithm for routing any partial permutation in 4 N +o( N ) time. Our selection √ √ √ algorithm runs in time 6 N + o( N ) and our sorting algorithm runs in 8 N + √ o( N ) time. All these algorithms are randomized and the stated time bounds hold with high probability. Also, the queue size needed for these algorithms is O(1) with high probability. ∗
Research of this author is supported in part by an NSF Award CCR-95-03-007 and an EPA Grant
R-825-293-01-0. † Research of this author is supported, in part, by the Army Research Office under grant DAA H04-95-1-0111.
1
Key Words. Optical computing, mesh-connected computers, parallel algorithms, sorting, packet routing, selection, randomized algorithms.
1
Introduction
The OTIS model of computing has been proposed in [4, 7, 13]. In this model, processors are partitioned into groups where each group is realized as chips with electronic interprocessor connections. Connections among the groups are realized using free space optical links. The advantage of this optoelectronic architecture lies in the fact that free space optical links provide superior speed and power when the connect distance is more than a few millimeters. Processors in any group can be organized as a mesh, a hypercube, or any other network. Accordingly the OTIS-Mesh, the OTIS hypercube, etc. will arise. Let j be any processor in group i. In the OTIS model, this processor is connected to processor i of group j. Figure 1 shows an OTIS-Mesh where there are four groups and each group has 4 processors.
Group 0
Group 1
0
1
0
1
2
3
2
3
Group 2
Group 3
0
1
0
1
2
3
2
3
Figure 1: A 4 × 4 OTIS-Mesh An N × N OTIS-Mesh has N groups where each group has N processors organized 2
as a
√
N×
√
√ N mesh. The diameter of this mesh has been shown to be 4 N − 3 [11].
The Mesh architecture could either be SIMD or MIMD. Accordingly, two variants of the OTIS-Mesh can be conceived of. In this paper we consider the basic problems of partial permutation routing, sorting, and selection. We employ the MIMD version of the model. In particular, we assume that any processor can communicate with all of its neighbors and perform a local computation in the same time step. Sahni and Wang [11] have presented efficient deterministic algorithms for routing √ general BPC permutations. Their algorithm runs in 12( N −1) electronic and log N + 2 optical moves for a general BPC permutation. However, for specific permutations, the algorithm might run faster. In this paper we do not count the optical and electronic moves separately. The number of optical moves is significantly less than the number of electronic moves in all of our algorithms. All the stated time bounds include the electronic as well as the optical moves. Sahni and Wang [12] have also given deterministic √ algorithms for sorting. Their algorithm has a time bound of 22 N on the SIMD model. √ √ In this paper we give a routing algorithm that runs in time 4 N + o( N ). Since √ the diameter of an N × N OTIS-Mesh is 4 N − 3, this algorithm is optimal up to a lower order term. We also present algorithms for sorting and selection. Our sorting and √ √ √ √ selection algorithms run in time 8 N + o( N) and 6 N + o( N ), respectively. These three algorithms are randomized and the stated bounds hold with high probability.
2 2.1
Some Preliminaries Problem Definitions
If each node in a network has a packet of information that has to be sent to some other node, the problem of packet routing is to send all the packets to their correct destinations as quickly as possible such that at most one packet passes through any interconnect link at any time. Packet routing is a fundamental problem of parallel computing since an efficient algorithm for its solution will result in faster interprocessor communication. Partial permutation routing is a special case of routing, where each node is the origin of at most one packet and each node is the destination of no more than one packet. The run time of a packet routing algorithm is defined to be the time taken by the last 3
packet to reach its destination. The queue length is defined as the maximum number of packets any node will have to store during routing. Priority schemes are used to resolve contentions for edges. Farthest destination first and farthest origin first are examples of priority schemes. We assume that a packet not only contains the message (from one processor to another) but also the origin and destination information of this packet. An algorithm for packet routing is specified by 1) the path to be taken by each packet, and 2) a priority scheme. Given a sequence of n keys, the problem of sorting is to rearrange this sequence in either ascending or descending order. Given a sequence of n keys and an i, 1 ≤ i ≤ n, the problem of selection is to identify the ith smallest key from the sequence. Sorting and selection are important problems of computing.
2.2
Randomized Algorithms
We say a randomized algorithm uses O(g(n)) amount of any resource (like time, space,
etc.) if there exists a constant c such that the amount of resource used is no more than cαg(n) with probability ≥ 1 − n−α on any input of length n and for any α (see e.g., [5]). Similar definitions apply to o(g(n)) and other such ‘asymptotic’ functions. By high probability we mean a probability of ≥ 1 − n−α for any fixed α ≥ 1 (n being the input size of the problem at hand). Let B(n, p) denote a binomial random variable with parameters n and p, and let ‘w.h.p.’ stand for ‘with high probability’ . One of the most frequently used facts in analyzing randomized algorithms is Chernoff bounds. These bounds provide close approximations to the probabilities in the tail ends of a binomial distribution. Let X stand for the number of heads in n independent flips of a coin, the probability of a head in a single flip being p. X is also known to have a binomial distribution B(n, p). The following three facts (known as Chernoff bounds) will be used in the paper (and were discovered by Chernoff [2] and Angluin & Valiant [1]): Prob.[X ≥ m] ≤
np m
m
em−np ,
Prob.[X ≥ (1 + )np] ≤ exp(−2 np/3), and Prob.[X ≤ (1 − )np] ≤ exp(−2 np/2), 4
for any 0 < < 1, and m > np.
2.3
Basic Data Movements
Consider an N ×N OTIS-Mesh. Let the groups be G0 , G1 , . . . , GN −1 . Let the processors in each group be numbered from 0 through N − 1. Each processor can be labeled with √ √ a quadruple (i, j, k, l) where 0 ≤ i, j, k, l ≤ ( N − 1). Here i N + j is the group √ number and k N + l is the processor number in that group. This labeling makes the presentations of our algorithms easier and also the labeling corresponds to embedding a √ √ √ √ N × N × N × N Mesh on the OTIS-Mesh. Each move of the 4D mesh can be simulated in ≤ 3 moves on the OTIS-Mesh [13]. Let (i, j, k, l) be any processor of the 4D mesh. Its corresponding processor in the OTIS-Mesh will also have the same label. The 4D Mesh moves (i, j, k ± 1, l) and (i, j, k, l ± 1) take one electronic move each since these moves are local to groups. 4D Mesh moves (i±1, j, k, l) and (i, j±1, k, l) can be simulated in one electronic and two optical moves each. For example, the move (i + 1, j, k, l) can o
e
o
be done by the moves (i, j, k, l) −→ (k, l, i, j) −→ (k, l, i + 1, j) −→ (i + 1, j, k, l), where o is an optical and e is an electronic move. We refer to the four dimensions of the 4D Mesh and its embedding on the OTIS-Mesh as u, v, w, x. Each move along the w and x dimensions takes only one (electronic) step, whereas each move along the u and v dimensions takes three steps. We shall employ this observation repeatedly in all of our algorithms. For any group i, let its middle processor refer to processor
N 2
√
+
N 2
of that group. Call the group
N 2
+
√ N 2
as the middle group of
the Mesh. The following Lemma has been proven in [11]. √ Lemma 2.1 Sorting on an N 2 -processor OTIS-Mesh can be done in time O( N ). Problem 1. Each of the N groups in an N ×N OTIS-Mesh has a datum in an arbitrary processor. The problem is to collect these N data items in a specified group Gi , (for any 0 ≤ i ≤ N − 1) one datum per processor. √ Lemma 2.2 Problem 1 can be solved in ≤ 2 N − 1 steps.
5
Proof. Group j sends it datum to processor i in that group. This takes no more than √ 2 N − 2 steps. In another optical move the datum from processor i of group j is sent ✷
to processor j of group i.
Problem 2. One group, say Gi , has N data items (located one per processor). These data items have to be replicated in each of the N groups. √ Lemma 2.3 Problem 2 can be solved in 2 N steps. Proof. Sahni and Wang [12] proved this lemma. Processor j (for 0 ≤ j ≤ N − 1) of group i (0 ≤ i ≤ N −1) sends its datum to processor i of group j. After this parallel step, each group has one datum. Now each group replicates its datum in all of its processors. √ This takes 2 N − 2 steps. Finally each processor sends its datum using the transpose connection, i.e., processor k of group l sends its datum to processor l of group k. It is easy to verify that now each group has the required data items. Problem 3. Each group has N , for some fixed
0, Prob. |ri − i ns | >
√
√ 3α √ns log n < n−α .
A proof of the above lemma can be found in [10]. RSelect gives a detailed description of the selection algorithm. There are seven steps. There is a key at each of the Mesh processors to begin with. At the end the ith smallest element will be available at the middle processor of the middle group. Algorithm RSelect Step 1. Each key includes itself as a sample key with probability N −0.53 . 0.47 As a result there will be O(N ) sample keys in each group and a total of 1.47 O(N ) sample keys in the whole OTIS-Mesh.
Step 2. Collect the sample keys in the groups Gi , Gi+1 , . . . , Gi+N 0.47 −1 , where i is the label of the middle group (i.e., i =
N 2
√
+
Rearrange these data into a 4D submesh of size N
N ). 2 0.4
(C.f. Problem 3).
× N 0.4 × N 0.4 × N 0.4
around the middle processor of Gi (c.f. Problem 4). Step 3. Sort the sample keys (c.f. Lemma 2.1). Let s be the number of keys in the sample. Pick two keys, l1 and l2 , from the sample whose ranks √ in the sample are i Ns2 − δ and i Ns2 + δ, respectively, for δ ≥ 6αs log N, for any fixed α ≥ 1. Step 4. Broadcast the keys l1 and l2 to the whole OTIS-Mesh. All the input keys that do not fall in the interval [l1 , l2 ] will be deleted. Count the number N1 of keys in the original input that are less than l1 and the number N2 of keys that are in the interval [l1 , l2 ]. If i < N1 or i > N1 + N2 or 1.27 ), start all over; otherwise let i = i − N1 . N2 = O(N
Step 5. Partition the OTIS-Mesh into 4D submeshes of size N 0.4 × N 0.4 × N 0.4 × N 0.4 each. Each surviving key chooses a random node within the 4D submesh it is in and goes there greedily as in the first phase of the packet routing algorithm (c.f. Section 3.1). 12
Step 6. Let B be a 4D submesh of size N 0.4 × N 0.4 × N 0.4 × N 0.4 centered around the middle processor of the middle group. Collect the surviving keys in B as in Step 2. Step 7. Sort B and route the ith smallest key to the middle processor of the middle group. √ √ Theorem 4.1 Algorithm RSelect runs in time 6 N + o( N ), the queue size being O(1). Proof. The number of sample keys in Step 1 has a distribution of B(N 2 , N −0.53 ). Thus the expected number of keys in the sample S is N 1.47 . An application of Chernoff bounds 1.47 ). Step 1 takes O(1) time. implies that |S| = O(N √ √ Step 2 takes time 2 N + o( N ) (c.f. Corollary 2.1 and Lemma 2.5). √ Step 3 can be completed in o( N ) time in accordance with Lemma 2.1. √ Broadcasting in Step 4 can be done in 2 N + O(1) time as follows. Let (a, b, c, d) be
the middle processor of the middle group Gi . Both l1 and l2 are in (a, b, c, d). Consider the broadcast of only l1 , since l2 can be broadcast in an additional O(1) time using the technique of pipelining. The key l1 has to be broadcast to all the processors that are at √
N 2
a distance of
or less in each of the four dimensions.
First broadcast along the w and x dimensions so that at the end all the processors √ √ √ √ √ a, b, [c − 2N , c + 2N ], [d − 2N , d + 2N ] will have a copy of l1 . This takes N time.
Each of these processors now sends a copy using its transpose OTIS connection. Processors
[c −
√
N ,c 2
√
+
groups
[c − [c −
√
N ,c √2 N ,c 2
N ], [d 2
√
+ +
N ], [d √2 N ], [d 2
This takes another OTIS
[a −
√ N ,a 2
√
− − −
√
N ,d 2
+
√
N ,d √2 N ,d 2
+ +
+
N ], [b 2
now have copies of l1 . Within each of the
√ N ] a broadcasting is done so that all the √2 √ √ √ √ N N N N N ], [a − , a + ], [b − , b + ] will 2 2 2 2 2
processors have copies.
N time. Finally, these processors send copies along their transpose
connections. √
√ N ], a, b 2
−
√
N ,b 2
+
√ N ], [c 2
−
√
Thus
N ,c 2
√
+
N ], [d 2
all −
√
N ,d 2
the √
+
N ] 2
processors end up with copies
of l1 . After the broadcasting of l1 and l2 is over, two counters c1 and c2 carrying partial information about N1 and N2 are routed toward the center of the OTIS-Mesh. The path 13
taken by the counters is the reverse of the path taken by l1 and l2 . Thus within an √ additional 2 N + O(1) time, N1 and N2 will be available at (a, b, c, d). As a result, Step √ 4 takes 4 N + O(1) time. An application of Lemma 4.1 implies that the number of keys surviving Step 4 is √ 1.27 √ N2 2 O log N = O(N ). But these keys can be arbitrarily distributed in the 1.47 N
OTIS-Mesh.
√ Step 5 takes only o( N) time.
In Step 6, algorithms of Problems 3 and 4 are employed to collect the surviving keys. Thus it is necessary to get estimates of how many surviving keys will be there 1.27 ) keys in the whole OTISin each group after Step 5. Note that there are only O(N
Mesh. Let g be any group. The group g intersects with N 0.2 4D submeshes (each of size N 0.4 × N 0.4 × N 0.4 × N 0.4 ). Call these 4D submeshes b1 , b2 , . . . , bN 0.2 . Let the number of surviving keys in bj be βj , for 1 ≤ j ≤ N 0.2 . Then the expected number of surviving keys in the intersection of bj and g is is
N 0.2 βj
j=1 N 0.8
= O(N
0.47
βj . N 0.8
Thus the expected number of keys in any group
). An application of the Chernoff bounds will readily imply that
0.47 the number of surviving keys in any group is O(N ).
Now, applying Corollary 2.1 and Lemma 2.5 we see that Step 6 can be completed in √ 2 N + o( N ) time. √ Step 7 takes o( N) time applying Lemma 2.1. √
Note that we can overlap Steps 4 and 6. In particular, collection of surviving keys √ √ can be started immediately after l1 and l2 have been broadcast, i.e., after 4 N + o( N) time. Surviving keys and the sample keys might cause edge contentions. In such cases give higher priority to the sample keys. The number of samples at any time along √ any dimension is only o N. Thus the additional delay the surviving keys can suffer is √ o( N). √ √ As a result, the algorithm runs in 6 N + o( N ) time. ✷
5
Sorting
In this section we present a randomized algorithm for sorting on an N 2 -processor OTIS√ √ Mesh that runs in time 8 N + o( N), the queue size being O(1).
14
Perhaps the first randomized algorithm for sorting was given by Frazer and McKellar in their wonderful paper [3]. Given n keys, the technique of [3] is to: 1) randomly sample n (for some constant < 1) keys, 2) sort this sample (using any algorithm), 3) partition the input using the sorted sample as splitter keys, and 4) to sort each part separately in parallel. Since then this idea has been employed over a variety of parallel models. See [8] for a survey. Let X = k1 , k2 , . . . , kn be a given sequence of n keys and let S = {(1 , (2 , . . . , (s } be a random sample of s keys picked from X (in sorted order). X is partitioned into (s + 1) parts defined as follows. X1 = {( ∈ X : ( ≤ (1 }, Xj = {( ∈ X : (j−1 < ( ≤ (j } for 2 ≤ j ≤ s, and Xs+1 = {( ∈ X : ( > (s }. The following Lemma (see e.g., [10]) probabilistically bounds the size of each of these subsets, and will prove helpful to our algorithm. n log n). Lemma 5.1 The cardinality of each Xj (1 ≤ j ≤ (s + 1)) is O( s
A mesh implementation of [3]’s technique was given by Kaklamanis and Krizanc [6]. They showed how to sort an n × n Mesh in 2n + o(n) time. Our implementation will be similar to [6]’s. Any sorting algorithm on a network has to specify an indexing scheme. An indexing scheme dictates how the sorted elements should be arranged in the network. For the 2D mesh some of the indexing schemes commonly used are row-major, column major, snake-like row or column major, snake-like blockwise row-major and so on. The indexing scheme we employ is the analog of the snakelike blockwise row-major scheme. In particular, we partition the OTIS-Mesh into blocks of 4D submeshes of size N 0.4 × N 0.4 × N 0.4 × N 0.4 each. Within a block, the keys can appear in any order. The blocks themselves are arranged snake-like in the lexicographic order of the dimensions (u, v, w, x). Details of our sorting algorithm are given as RSort. Algorithm RSort Step 1. Each key includes itself as a sample key with probability N −0.66 . 0.34 ) sample keys in each group and a total of As a result there will be O(N 1.34 O(N ) sample keys in the whole OTIS-Mesh. Partition the OTIS-Mesh
15
into 4D submeshes of size N 0.4 × N 0.4 × N 0.4 × N 0.4 each. Final sorted keys will be arranged in these blocks using the indexing scheme specified above. Call each such 4D submesh an a-block. Step 2. Collect the sample keys in the groups Gi , Gi+1 , . . . , Gi+N 0.34 −1 , where i is the label of the middle group (i.e., i =
N 2
√
N ) 2 0.34
+
Rearrange these data into a 4D submesh S of size N
(c.f. Problem 3).
×N 0.34 ×N 0.34 ×N 0.34
around the middle processor of Gi (c.f. Problem 4). Call this 4D submesh S the sample block. Step 3. Sort the sample keys (c.f. Lemma 2.1). Partition the OTIS-Mesh into 44 evenly sized 4D submeshes. This partitioning is similar to the one √
in Figure 4, except that the 16 regions will be of size
N 4
×
√
N 4
×
√
N 4
×
√
N 4
each. Step 4. Broadcast the sample block to the whole OTIS-Mesh. At the end every 4D submesh of size N 0.34 × N 0.34 × N 0.34 × N 0.34 in the OTIS-Mesh will have a copy of the sample block. Step 5. Partition the Mesh into blocks of size N 0.34 × N 0.34 × N 0.34 × N 0.34 . Call each such 4D submesh a b-block. Each b-block has a copy of the sample block. Sort the input keys and sample keys in each of these blocks. Perform a prefix computation within the block to determine an approximate destination a-block for each key and also compute the partial rank of each sample key in this block. Step 6. Each key is routed to an approximate destination a-block using the algorithm of Theorem 3.2. Step 7. Compute and broadcast the global ranks of the sample keys to the whole OTIS-Mesh. Step 8. In each a-block, sort the input keys and sample keys, do a prefix computation, and figure out the global ranks of the input keys. 16
Step 9. Route the keys to their final a-blocks. With high probability they will be very near. √ √ Theorem 5.1 RSort runs in time 8 N + o( N ) maintaining constant size queues with high probability. √ Proof. Step 1 takes O(1) time. Steps 3, 5, 8, and 9 take a total of o( N) time. In Step 10, it can be shown that the actual destination block of any key can only be O(1)
blocks away.
√ √ Step 2 takes 2 N + o( N) time. √ √ Step 4 can be completed in 2 N + o( N ) time. The technique is similar to the
one used for Step 4 of Algorithm RSelect. One way of broadcasting the block is to do appropriate window-broadcastings in the w, x planes first and in the (u, v) planes √ next. Window-broadcasting in the w, x plane takes N time. In order to perform window-broadcasting in the u, v planes we send data using the transpose OTIS connections, perform window-broadcasting in the w, x planes, and again transfer data using √ the transpose OTIS connections. Thus Step 4 can be completed in 2 N time. √ √ Step 6 takes 4 N + o( N) time (c.f. Theorem 3.2). √ √ Step 7 can be completed in time 4 N + o( N) along the same lines as Step 4. The sample block carrying the partial ranks information is routed toward the center of the √ √ OTIS-Mesh. In 2 N + o( N ) time the global ranks will be available at the center of the OTIS-Mesh. This block is again broadcast to the whole OTIS-Mesh in another √ √ 2 N + o( N ) time. √ Steps 6 and 7 can be overlapped. Therefore, the algorithm runs in a total of 8 N + √ o( N) time. ✷
6
Conclusions
In this paper we have presented efficient randomized algorithms for packet routing, sorting, and selection. For routing and sorting the only known lower bound is the diameter. An open question is if our algorithms are optimal. Our algorithms assume that the number of keys to be operated on is the same as the number of processors. 17
We believe that our algorithms can be extended (preserving the work done) to the case when the number of keys is more than the number of processors. It will be interesting to consider this case. Another important open problem is to match our time bounds deterministically.
References [1] D. Angluin and L.G. Valiant, “Fast Probabilistic Algorithms for Hamiltonian Paths and Matchings,” Journal of Computer and Systems Science, 18, pp. 155-193, 1979. [2] H. Chernoff, “A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sum of Observations,” Annals of Mathematical Statistics, 23, pp. 493-507, 1952. [3] W. D. Frazer, and A. C. McKellar, Samplesort: A Sampling Approach to Minimal Storage Tree Sorting, Journal of the ACM, 17(3): 496-507, 1970. [4] W. Hendrick, O. Kibar, P. Marchand, C. Fan, D. V. Blerkom, F. McCormick, I. Cokgor, M. Hansen, and S. Esener, Modeling and Optimization of the Optical Transpose Interconnection System, in Optoelectronic Technology Center, Program Review, Cornell University, September 1995. [5] E. Horowitz, S. Sahni, and S. Rajasekaran, Computer Algorithms, W. H. Freeman Press, 1998. [6] C. Kaklamanis, D. Krizanc, Optimal Sorting on Mesh-Connected Processor Arrays, Proc. Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, 1992, pp. 50-59. [7] G. C. Marsden, P. J. Marchand, P. Harvey, and S. C. Esener, Optical Transpose Interconnection System Architectures, Optic Letters, 18(3):1083-1085, July 1993. [8] S. Rajasekaran, Sorting and Selection on Interconnection Networks, DIMACS Series in Discrete Mathematics and Theoretical Computer Science 21, 1995, pp. 275-296.
18
[9] S. Rajasekaran and Th. Tsantilas, Optimal Routing Algorithms for Mesh Connected Processor Arrays, Algorithmica 8, 1992, pp. 21-38. [10] S. Rajasekaran and J.H. Reif, Derivation of Randomized Sorting and Selection Algorithms, in Parallel Algorithm Derivation and Program Transformation, Edited by R. Paige, J.H. Reif, and R. Wachter, Kluwer Academic Publishers, 1993, pp. 187-205. [11] S. Sahni and C. Wang, BPC Permutations on the OTIS-Mesh Optoelectronic Computer, Proc. The Fourth International Conference on Massively Parallel Processing Using Optical Interconnections (MPPOI’97), 1997, pp. 130-135. [12] S. Sahni and C. Wang, Basic Algorithms on the OTIS-Mesh Optoelectronic Computer, Manuscript, 1997. [13] F. Zane, P. Marchand, R. Paturi, S. Esener, Scalable Network Architectures Using the Optical Transpose Interconnection System (OTIS), Proc. Third International Conference on Massively Parallel Processing Using Optical Interconnections (MPPOI’96), 1996, pp. 114-121.
19