Parallel Prefix on Mesh of Trees and OTIS Mesh of Trees Dheeresh K. Mallick1 and Prasanta K. Jana2$ 1 Department of Computer Science and Engineering, Birla Institute of Technology, Mesra, Ranchi 835 215, India E-mail:
[email protected] 2 Department of Computer Science and Engineering, Indian School of Mines University, Dhanbad 826 004, India E-mail:
[email protected]
Abstract: In this paper, we first develop a parallel algorithm for prefix computation on an n × n mesh of trees (MOT). For n2 data elements, the algorithm requires 4 log n + O(1) time using n2 processors. Using the MOT prefix, we next propose a prefix algorithm on an n2 × n 2 OTIS mesh of trees. This algorithm for n4 data elements is shown to map in 13 log n + O(1) electronic moves + 2 OTIS moves using n 4 processors.
Mesh of trees (MOT) is an efficient SIMD model that owns two advantages of small diameter and large bisection width [24]. This network can provide logarithmic time algorithms for many problems such as packet routing, prefix computation, matrix multiplication, sorting, convolution, shortest paths, minimum spanning trees, convex hull, nearest neighbor and so on [24]. G(1, 1)
Keywords: Prefix computation, optoelectronic, mesh of trees, OTIS mesh of trees, parallel algorithms
1
G(1, 2)
1,1
1,2
1,1
1,2
2,1
2,2
2,1
2,2
Introduction
Optical Transpose Interconnection System (OTIS) [1], [2] is a hybrid architecture that fully benefits from optical and electronic communications. In such systems, the processors are partitioned into groups. Processors within each group are connected by usual electronic links and the processors of two different groups are connected by optical links following the OTIS rule: the Pth processor of the Gth group is connected to the Gth processor of the Pth group. The choice of an interconnection pattern among the processors within each group yields a specific model of OTIS, e.g., OTIS-Mesh, OTIS-hypercube, OTIS-MOT etc. The general model of an OTIS is shown in Fig. 1 with 16 processors in which the coordinates of a processor are shown in a small box. There has been a growing awareness to achieve high performance computing through OTIS. Several parallel algorithms have also been reported on different OTIS models that include randomized algorithms [3], matrix multiplication [4], image processing [5], numerical analysis [6], sorting [7], BPC permutation [8], basic operation [9] and many more.
G(2, 1)
G(2, 2)
1,1
1,2
1,1
1,2
2,1
2,2
2,1
2,2
Fig. 1. 4×4 OTIS network In this paper, we first propose a parallel algorithm for prefix computation on MOT. We then use this algorithm to develop a parallel prefix algorithm on OTIS-MOT. Our algorithm on n 2 processors MOT requires 4 log n + O(1) time for n 2 data elements. The algorithm for n 4 data elements on an n 2 × n 2 OTIS- MOT requires 13 log n + O (1) electronic moves and 2 OTIS moves.
$ The corresponding author, Member, IEEE and IEEE Computer Society
Given N data values d 0, d 1, d 2, …, d N − 1 and an associative binary operation θ, the problem of prefix computation is to compute µi = d 0 θ d 1 θ d 2 θ …θ d i , 0 ≤ i ≤ N-1. Prefix computation is one of the basic operations which is extensively used in various applications such as job scheduling, loop optimization, evaluation of polynomials, solving system of linear equations, polynomial interpolation, knapsack. Many researchers [10]-[19] have developed various parallel algorithms and parallel circuits for the prefix computation. An N-point parallel algorithm on twodimensional mesh with 3√N – 2 communication steps can be found in [10]. Egecioglu and Srinivasan [18] presented 2√N +1 communication steps and log N + 1 arithmetic steps optimal algorithm also on mesh architecture. A parallel algorithm has been reported in[11] on an extended MultiMesh that requires 13 N 1/4 communications steps (electronic) and log N + 4 arithmetic steps. Prefix algorithms on OTIS platform have also reported in [9], [20]. Wang and Sahni [9] developed a parallel algorithm for N-point prefix computation in (8 N 1/4 – 1) electronics moves and 2 OTIS moves for both SIMD and MIMD models of an Nprocessor OTIS-Mesh. In the same paper, they also modified their algorithm to run in (7 N 1/4 – 1) electronic moves and 2 OTIS moves. Jana and Sinha [20] presented an improved algorithm with (5.5 N 1/4 + 3) electronic moves and 2 OTIS moves on the same model.
Without any loss of generality and for the shake of simplicity, we treat prefix computation as the prefix sum for the rest of the paper. We now describe our proposed prefix algorithm on MOT as follows. We assume that every processor P(i, j) has two local registers namely A(i, j) and B(i, j).
The paper is organized as follows. In section 2, we describe the prefix algorithm on MOT. Prefix computation on OTIS-MOT is presented in section 3 and the section 4 concludes the paper.
Algorithm MOT-Prefix:
2
Prefix Computation on Mesh of Trees
Topology of MOT: In an n × n MOT, n 2 processors are organized as an n × n lattice. Let P(i, j) denote the processor placed in the ith row and jth column, 1 ≤ i , j ≤ n. Then the interconnectivity among the processors is as follows: 1.
2.
3.
The processors in the ith row are connected to form a binary tree rooted at P(i, 1), i.e, for j = 1 to n/2, processor P(i, j) is directly connected to the processors P(i, 2j) and P(i, 2j+l), whenever they exist. We call such binary trees as row-trees. Similarly, the processors in the jth column are connected in the form of a binary tree rooted at P(1, j), i.e, for i = 1 to n/2, processor P(i, j) is directly connected to the processors P(2i, j) and P(2i, j), whenever they exist. We call such binary trees as column -trees. Al the links are bi-directional.
As an example, the graph topology of 5×5 MOT is shown in Fig. 2.
P(1,2 )
P(1,1 )
P(1,4 )
P(1,3 )
P(1,5 )
P(2,1 )
P(3,1 ) P(4,1 ) P(5,1 )
Fig. 2 5×5 Mesh of Trees All processor addresses are not shown)
Data Initialization: n 2 data elements x0, x1, x2, …, xn2 -1 are stored in the register A(i, j) in the row major order, i.e., A(i, j) is initialized with the data element x(i-1)n + j –1, 1 ≤ i, j ≤ n , as shown in Fig. 3 for n = 5.
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
x21
x22
x23
x24
Fig. 3 Data initialization on MOT Step 1. Perform parallel prefix on each row-tree using the similar procedure as given in [10] and store it in B(i, j) register. After this step register B[i, j] holds x[(i-1)n : (i-1)n + (j –1)] in which x[p : q] indicates the sum xp+ xp + 1 + xp + 2 + …+ xq for p ≤ q and the result is shown in Fig. 4. However, the sum xp+ xp + 1 + xp + 2 + …+ xq, in this figure is denoted by the symbol ‘xp-q’.
Step 5. Sum up the contents of A(i, j) and B(i, j) to produce the final prefix. The processor P[i, j] holds the final result x[0 : (i-1)n + (j-1)].
x0
x0-1
x0-2
x0-3
x0-4
x5
x5-6
x5-7
x5-8
x5-9
x10
x10-11 x10-12 x10-13
x15
x15-16 x15-17
x15-18 x15-19
x20
x20-21 x20-22
x20-23
The steps 1 and 3 require 2 log n time each. Steps 2 and 4 each is computed in log n time and the step 5 requires constant time. Therefore, the above algorithm can performed in 4 log n + O(1) time.
x10-14
3
x20-24
Fig. 4 Result after prefix sum on each row-tree Step 2. Perform parallel summation in each row tree on the contents of A(i, j) register in each row and store it in A(i, 1) of the root processor P(i, 1). A[i, 1] holds x[(i-1)n + ( j-1) : in-1 ] after this step. Step 3. Perform modified prefix on the contents of A(i, 1) of the 1st column tree using similar procedure of [10]. In this modified prefix, the processor P[i, 1] computes the sum x[0: (i-1)n-1] for 2 ≤ i ≤ n and A[i, 1] holds 0. The result is shown in Fig. 5 where ‘-‘ indicates a don’t care value. 0 x0-4
-
-
-
-
-
-
-
x0-9
-
-
-
-
x0-14
-
-
-
-
x0-19 -
-
-
-
Prefix Computation on OTIS-MOT
Topology of OTIS -MOT: n 4 processors of an n 2 × n 2 OTIS-MOT are divided into n 2 groups as shown in Fig. 7 for n = 3. Each group G consisting of n 2 processors is an n × n two-dimensional MOT. The processor placed in the k th row and lth column within the group G(i, j) is denoted by P(i, j, k, l) for 1≤ i, j, k, l ≤ n. The processors within a group are connected by electronic links following the interconnections of MOT while those of different groups are interconnected by optical links according to OTIS rule, i.e., the processor P(i, j, k, l) is directly connected to the processor P(k, l, i, j). Here, we first develop an algorithm called ModifiedMOT-Prefix on MOT, which will be used in our proposed parallel prefix on the OTIS-MOT. Modified Prefix on MOT: Here also we assume that n 2 data elements x0, x1, x2, …, xn2 -1 are initially stored in the register A(i, j) in row major order. We write the steps to compute the modified prefix as follows. All the steps can be similarly illustrated as the prefix computation on MOT described in the previous section.
Fig.5 Modified prefix computation on 1st column -tree Step 4. Broadcast the content of A(i, 1) on each row as shown in Fig. 6. 0
0
0
0
0
x0-4
x0-4
x0-4
x0-4
x0-4
x0-9
x0-9
x0-9
x0-9
x0-9
x0-14
x0-14
x0-14
x0-14
x0-14
x0-19
x0-19
x0-19
x0-19
x0-19
Fig. 6 Result after row-wise broadcast
Step 1: Perform modified prefix on each row-tree using the similar procedure as given in [10] and store it in B(i, j) register. Step 2: Perform parallel prefix on each row-tree using original data stored it in A(i, j) register and store the result also in A(i, j). Step 3: Perform modified prefix on the contents of A(i, n) of the last column processors. Step 4: On each row, broadcast the content of A(i, n) to store in A(i, j), 1 ≤ i, j ≤ n. Step 5: Add the content of A register of step 4 and B register of Step 1 to yield the final modified prefix. Step 7: Stop Each of the steps 1, 2 and 3 requires 2log n time. Step 4 broadcasts in log n time and step 5 requires constant time. Therefore, the above algorithm requires 7 log n + O(1) time. The steps of the algorithm can be similarly illustrated as the MOT-Prefix algorithm. We now present the prefix algorithm on the OTIS-MOT as follows. Our parallel algorithm is based on the prefix algorithm of [9] and the prefix algorithm on the MOT described in Section 3.
12
11 11
12
13
11
21
22
23
21
31
32
33
31
21
12
13 13
11
12
13
22
23
21
22
23
32
33
31
32
33
22
23
11
12
13
11
12
13
11
12
13
21
22
23
21
22
23
21
22
23
31
32
33
32
33
31
32
33
33
32
31 11
12
13
11
12
13
11
12
13
21
22
23
21
22
23
21
22
23
31
32
33
31
32
33
31
32
33
Fig. 7 3 × 3 OTIS-MOT (partial optical connectivity shown)
Algorithm OTIS-MOT-Prefix: Data Initialization: We store the data elements x0, x1, x2, … xn4 –1 in the register A(i, j) following the row major order within the block and also in row major order from block to block, i.e., A(i, j, k, l) is initialized with x( i−1) n3 + ( j −1) n 2 + ( k −1) n + (l −1) as shown in Fig. 8 for n = 3. x0 x3 x6
x1 x4 x7
x2 x5 x8
x27 x28 x29 x30 x31 x32 x33 x34 x35 x54 x55 x56 x57 x58 x59 x60 x61 x62
x9 x10 x11 x12 x13 x14 x15 x16 x17
x18 x19 x20 x21 x22 x23 x24 x25 x26
x36 x39 x42 x63 x66 x69
x45 x46 x47 x48 x49 x50 x51 x52 x53
x37 x40 x43 x64 x67 x70
x38 x41 x44 x65 x68 x71
x72 x73 x74 x75 x76 x77 x78 x79 x80
Fig. 8 Data initialization on OTIS-MOT
Step 1: Compute the local prefix on the contents of Aregisters in each group in parallel by applying the algorithm MOT-Prefix described in section 2 and store the result in B register. After this step, B(i, j, k, l) holds x[(i-1)n 3 + ( j-1)n 2 : (i-1)n 3 + ( j-1)n 2 + (k-1)n + (l-1)] which is illustrated in Fig. 9. x0 x0-1 x0-3 x0-4 x0-6 x0-7
x0-2 x0-5 x0-8
x9 x9-10 x9-11 x9-12 x9-13 x9-14 x9-15 x9-16 x9-17
x18 x18-19 x18-20 x18-21 x18-22 x18-23 x18-24 x18-25 x18-26
x27 x27-28 x27-29 x27-30 x27-31 x27-32 x27-33 x27-34 x27-35
x36 x36-37 x36-38 x36-39 x36-40 x36-41 x36-42 x36-43 x36-44
x45 x45-46 x45-47 x45-48 x45-49 x45-50 x45-51 x45-52 x45-53
x54 x54-55 x54-56 x54-57 x54-58 x54-59 x54-60 x54-61 x54-62
x63 x63-64 x63-65 x63-66 x63-67 x63-68 x63-69 x63-70 x63-71
x72 x72-73 x72-74 x72-75x72-76 x72-77 x72-78x72-79 x72-80
Fig. 9 Local prefix sum
Step 2: Perform an OTIS move to send the partial results from A(i, j, n, n), 1≤ i, j ≤ n to the group G(n, n) as illustrated in Fig. 10. After this step, A(n, n, k, l) holds x[(k-1)n 3 + ( l1)n 2 : (k-1)n 3 + ln2 -1)], which is shown in Fig. 10.
x0 x0-1 x0-3 x0-4 x0-6 x0-7
x0-2 x0-5 x0-8
x27 x27-28 x27-29 x27-30 x27-31 x27-32 x27-33 x27-34 x27-35
x54 x54-55 x54-56 x54-57 x54-58 x54-59 x54-60 x54-61 x54-62
computes modified prefix computation in group G(n, n) in 7 log n + O(1) electronics moves. Step 5 broadcasts in for 2 log n + O(1) electronics moves. Step 2 and Step 4 each requires a single OTIS move and Step 6 requires no data movement.
x18 x18-19 x18-20 x18-21 x18-22 x18-23 x18-24 x18-25 x18-26
x9 x9-10 x9-11 x9-12 x9-13 x9-14 x9-15 x9-16 x9-17
x36 x36-37 x36-38 x36-39 x36-40 x36-41 x36-42 x36-43 x36-44 x27-28 x27-29
x63 x63-64 x63-65 x63-66 x63-67 x63-68 x63-69 x63-70 x63-71
x45 x45-46 x45-47 x45-48 x45-49 x45-50 x45-51 x45-52 x45-53
x0-8 x9-17 x18-26 x27-35 x33-44 x45-53 x5462 x63-71 x72-80 x72-80
Fig. 10 OTIS move to G(3,3). All moves are not shown Step 3: Apply the algorithm Modified-MOT-Prefix on the contents of A-registers of the group G(n, n). The modified prefix on this group after this step is shown in Fig. 11. 0
x0-8
x0-17
x0-26
x0-35
x0-44
x0-53
x0-62
x0-71
Therefore, the above algorithm requires 13 log n + O(1) electronic moves and 2 OTIS moves, i.e., 3.25 log N + O(1) electronic moves and 2 OTIS moves for N = n 4 data. Table 1: Comparison of the OTIS based prefix algorithms
Algorithm
Fig. 11 Modified Prefix on group G(3, 3) Step 4: Perform an OTIS move to send back the modified prefix of step 3 from the group G(n, n) for being stored in the corresponding A(i, j, n, n) register, 1 ≤ i, j ≤ n. Step 5: In each group, broadcast this modified prefix received in step 4 for being stored in A registers. Step 6: Each processor now calculates its target prefix by adding the contents of the B-registers of the Step 1 with that of A-registers of Step 5 in parallel. A(i, j, k, l) holds the final prefix sum x[0 : (i – 1)n 3 + (j – 1)n 2 + (k – 1)n + (l – 1)] after this step. Step 7 : Stop. Time complexity: Step 1 requires 4 log n + O(1) electronics moves for local prefix computation in each group. Step 3
Wang and Sahni [9] Jana and Sinha [20] Proposed Algorithm
4
Time complexity Electronic moves OTIS moves (7 n – 1) 2 (5.5 n + 3) 2 13 log n + O(1) 2
Conclusion
In this paper, we have first presented a parallel algorithm for n 2-point prefix sum on an n × n MOT in 4 log n + O(1) time using n 2 processors. We have next developed an algorithm for n 4-point prefix sum on an n 2 × n 2 OTIS-MOT. This algorithm has been shown to run in 13 log n + O(1) electronic moves and 2 OTIS moves using n 4 processors. The comparison of the time complexity between the proposed algorithm and the other prefix algorithms on the OTIS platform is shown in Table 1 for the same number of data elements and processors.
5 References [1] G. C. Marsden, P.J. Marchand, P. Harvey and S. C. Esener, “Optical transpose interconnection system architectures,” Optics Letters, Vol. 18, No. 13, pp. 1083-1085, July, 1993. [2] C. F. Wang and S. Sahni, “OTIS optoelectronic computers,” Parallel Computation Using Optical Interconnections, K. Li, Y. Pan and S. Q. Zhang, Eds. Kluwer Academic, 1998. [3] S. Rajasekaran and S. Sahni, “Randomized routing, Selection, and Sorting on the OTIS-Mesh optoelectronic computer, IEEE Trans. On Parallel and Distributed Systems Vol. 9, No. 9, pp. 833-840, 1998. [4] C. F. Wang and S. Sahni, “Matrix multiplication on the OTIS-Mesh optoelectronic computer,” IEEE Trans. On Computers, Vol. 50, No. 7, pp. 635–646, July, 2001.
[13] R. Lander and M. J. Fisher, “Parallel prefix computation,” J. ACM., Vol.27, pp. 831 - 839 Oct. 1980. [14] D. A. Carlson and B. Sugla, “Limited width parallel prefix circuits,” J. Supercomput.,Vol.4, pp. 107 - 129, June 1990. [15] F. E. Fich, “New bounds for parallel prefix circuits,” Proc. Fifteenth Sym. on the Theory of Computing, pp. 100 109, 1993. [16] M. Snir, “Depth-size trade-offs for parallel prefix computation,” J. Algorithms, Vol.7, pp. 185 - 201, 1986. [17] C. P. Krushkal, L. Rudolh and M. Snir, “The power of parallel prefix,” IEEE Transactions on Computers, Vol. C 34, No.10, pp.965, Oct .1985. [18] O. Egecioglu and A. Srinivasan, Optimal Parallel Prefix on mesh architecture. Parallel Algorithms and Applications Vol. 1, 1993, pp. 191–209.
[5] C. F. Wang and S. Sahni, “Image processing on the OTIS-Mesh optoelectronic computer,” IEEE Trans. On Parallel and Distributed Systems Vol. 11, No. 2, pp. 97–109. December 1998.
[19] S. Lakshmivarahan and S. K. Dhal, “Parallel computing using the prefix problem,” Oxford, U.K.:Oxford University Press, 1994.
[6] P. K. Jana, “Polynomial Interpolation and Polynomial Root Finding on OTIS-Mesh,” Parallel Computing, Vol. 32, No. 4, pp. 301-312, 2006.
[20] P. K. Jana and B. P. Sinha “An Improved parallel prefix algorithm on OTIS-Mesh,” Parallel Processing Letters, pp. 429-440, Vol. 16, No. 4, 2006.
[7] A. Osterloh, “Sorting on the OTIS-Mesh,” Proc. 14th Int’l. Parallel and Distributed Processing Symposium (IPDPS 2000), pp. 269-274, 2000.
[21] Khaled Dey, “Topological properties of OTIS networks,” IEEE Transctions on Parallel and Distributed Systems, Vol. 3, No. 4, pp. 359-366, 2002.
[8] S. Sahni and C. F. Wang, “BPC permutations on the OTIS-Mesh optoelctronic computer,” Proc. Fourth Int’l. Conference Massively Parallel Processing Using Optical Interconnections (MIPPOI ’97), pp. 130-135, 1997.
[22] Khaled Dey, “Optical transpose K-ary n-cube networks, ” Journal of System Architecture, Vol. 50, pp. 697705, 2004.
[9] C. F. Wang and S. Sahni, “Basic operations on the OTIS-Mesh optoelectronic computer,” IEEE Trans. On Parallel and Distributed Systems Vol. 9, No. 12, pp. 1226– 1998. December 1998. [10] S. G. Akl. The Design and Analysis of Parallel Algorithms. Englewood Cliffs, NJ. Prentice Hall, 1989. [11] P. K. Jana, B. D. Naidu, S. Kumar, M. Arora, and B. P. Sinha, “Parallel prefix computation on extended multi-mesh network,” Information Processing letters, Vol. 84, No. 6, pp. 295-303, October 2002. [12] Y. C. Lin and C. M. Lin, “Efficient parallel prefix algorithms on fully connected message passing computers,” Proc. of Third Int. Conference on High Performance Computing (HiPC), Trivandrum, India, Dec 19 - 22, 1996.
[23] Berhooz Parhami, “The Hamiltonicity of Swapped (OTIS) networks built of Hamiltonian component networks,” Information Processing Letters, Vol. 95, pp. 441-445, 2005. [24] E T. Leighton, Introduction to Parallel Algorithms and Architectures: Array, Trees and Hypercubes, Morgan Kaufmann, San Mateo, CA, 1992.