ORNL/TM-12309 Engineering Physics and Mathematics Division Mathematical Sciences Section
PARALLEL MATRIX TRANSPOSE ALGORITHMS ON DISTRIBUTED MEMORY CONCURRENT COMPUTERS Jaeyoung Choi x Jack J. Dongarra xy David W. Walker x
x Mathematical Sciences Section
Oak Ridge National Laboratory P.O. Box 2008, Bldg. 6012 Oak Ridge, TN 37831-6367 y Department of Computer Science University of Tennessee at Knoxville 107 Ayres Hall Knoxville, TN 37996-1301 Date Published: October 1993 Research was supported by the Applied Mathematical Sciences Research Program of the Oce of Energy Research, U.S. Department of Energy, by the Defense Advanced Research Projects Agency under contract DAAL03-91-C-0047, administered by the Army Research Oce, and by the Center for Research on Parallel Computing Prepared by the Oak Ridge National Laboratory Oak Ridge, Tennessee 37831 managed by Martin Marietta Energy Systems, Inc. for the U.S. DEPARTMENT OF ENERGY under Contract No. DE-AC05-84OR21400
Contents 1 Introduction : : : : : : : : : : : : : : 2 Design Issues : : : : : : : : : : : : : 3 Matrix Transpose Algorithms : : : : 3.1 P and Q : relatively prime : : 3.2 P and Q : not relatively prime 4 Results : : : : : : : : : : : : : : : : : 5 Conclusions and Remarks : : : : : : 6 References : : : : : : : : : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
- iii -
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
1 2 5 8 10 14 19 19
PARALLEL MATRIX TRANSPOSE ALGORITHMS ON DISTRIBUTED MEMORY CONCURRENT COMPUTERS Jaeyoung Choi Jack J. Dongarra David W. Walker
Abstract
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P Q processor template with a block scattered data distribution. P , Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for dierent groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM=GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to dierent processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A B, the algorithms are used to compute parallel multiplications of transposed matrices, C = AT BT , in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
-v-
1. Introduction Matrix transposition is a fundamental matrix operation of linear algebra [8, 13] and arises in many scienti c and engineering applications. On a uniprocessor, an algorithm involving a transposed matrix may not actually require the matrix data to be transposed in physical memory. Instead, it may be accessed simply by exchanging the row and column indices. However, in a distributed-memory multiprocessor environment, we cannot simply interchange the global row and column indices. Instead, the data must be physically moved from one processor to another. Transposition of a matrix is a redistribution of its elements. Many researchers have considered the problem for dierent architectures. In 1972, Eklundh [7] considered the problem of directly accessing rows or columns of a matrix when its size is larger than the available high-speed storage. O'Leary [12] implemented an algorithm for transposing an N N matrix on a one-dimensional systolic array. Azari, Bojanczyk and Lee [1] developed an algorithm for transposing an M N matrix on an N N mesh-connected array processor, and Johnsson and Ho [10] presented an algorithm for a Boolean n-cube, or hypercube. Current advanced architecture computers possess hierarchical memories in which accesses to data in the upper levels of the memory hierarchy (registers, cache, and/or local memory) are faster than those in lower levels (shared or o-processor memory). To exploit the power of such machines, block-partitioned algorithms are preferred for dense linear algebra computations, in which operations are performed on submatrices, rather than individual matrix elements. In distributing matrix data over processors we therefore assume a block scattered decomposition [4, 6]. The block scattered decomposition can reproduce the most common data distributions used in dense linear algebra, as described brie y in the next section. In this paper, the parallel matrix transpose algorithms are presented based on the block scattered decomposition. The algorithms are implemented on the Intel Touchstone Delta computer. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of the number of rows and columns (P and Q) of the processor template. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. This is called all-to-all personalized communication, in which each of N = P Q processors is required to send distinct subblocks to each of the remaining N , 1 processors, and receive distinct subblocks from each of them. Bokhari and Berryman [2] have developed binary exchange and quadrant exchange algorithms on a circuit switched mesh, where P and Q are powers of 2. The complete exchange communication in our transpose algorithms arises for any processor con guration, and is not limited to the case where P and Q are powers of 2. We implemented the complicated two-dimensional complete exchange communication problem by generalizing the one-dimensional complete exchange communication based on direct point-to-point communication. Details are discussed in Section 3.1. p
p
-2We have presented the Parallel Universal Matrix Multiplication Algorithms (PUMMA) in [5] for performing C ( op(A) op(B) + C, where op(X) = X or X , based on the block scattered decomposition. One of the cases in the PUMMA package, C ( A B + C, is implemented in two steps (T ( B A; C ( T + C). The second step involves parallel matrix transposition. The performance of this algorithm for evaluating C = A B is compared with the algorithm for evaluating C = A B on the Intel Delta machine in Section 4. T
T
T
T
T
T
2. Design Issues The way in which an algorithm's data are distributed over the processors of a concurrent computer has a major impact on the load balance and communication characteristics of the concurrent algorithm, and hence largely determines its performance and scalability. The block scattered decomposition provides a simple, yet general-purpose way of distributing a blockpartitioned matrix on distributed memory concurrent computers. In the block scattered decomposition, described in detail in [4, 6], an M N matrix is partitioned into blocks of size r s, and blocks separated by a xed stride in the column and row directions are assigned to the same processor. If the stride in the column and row directions is P and Q blocks respectively, then we require that P Q equal the number of processors, N . Thus, it is useful to imagine the processors arranged as a P Q mesh, or template. The processor at position (p; q) (0 p < P, 0 q < Q) in the template is assigned the blocks indexed by, p
(p + i P; q + j Q);
(1)
where i = 0; : : :; b(M , p , 1)=P c, j = 0; : : :; b(N , q , 1)=Qc, and M N is the size in blocks of the matrix (M = dM=re, N = dN=se). Blocks are scattered in this way so that good load balance can be maintained in parallel algorithms, such as LU factorization [3, 6]. The nonscattered decomposition (or pure block distribution) is just a special case of the scattered decomposition in which the block size is given by r = dM=P e and s = dN=Qe. A purely scattered decomposition (or two-dimensional wrapped distribution) is another special case in which the block size is given by r = s = 1. If P and Q are relatively prime, the matrix transpose algorithm involves a two-dimensional complete exchange communication, where each of N processors is required to send distinct subblocks to each of the remaining N , 1 processors, and receive distinct subblocks from each of them. We implemented the complicated two-dimensional complete exchange algorithm by generalizing the one-dimensional complete exchange algorithm. Three one-dimensional complete exchange communication schemes are shown in Figure 1, where each processor needs one subblock from each other processor, and the number in parentheses denotes the number of b
b
b
b
b
p
p
b
-3-
0
(4) (4)
1
(4)
2
(4)
(4) (4)
3 4
(4)
6
(4)
(6)
(5)
0
(1)
(1)
1
(7)
(6)
(5)
1
(1)
(1)
2
:
:
:
2
:
:
3
:
:
:
3
:
:
4
4
(4)
5
5
6 (7)
(4)
7 step 1
(7)
(4) (4)
5
0
7 step 2
step 3
(1)
7 step 1
(a) Binary Exchange
step 2
step3 .....
(b) Rotating
(1) : :
6 (1)
(5)
(6)
(1)
step 1 step 2
(1)
step 3 .....
(c) Direct Communication
Time (seconds)
Figure 1: Three complete exchange communication schemes on 8 processors. The number in parentheses denotes the amount of data to transmit.
2.5
2.0 Rotating 1.5 Direct Comm
1.0
0.5
Binary Exchange
0.0 0
20
40
60
80
100
Block Size (Kbytes)
Figure 2: Comparison of three exchange communication schemes on 16 processors.
-40 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
2 5 2 5 2 5 2 5 2 5 2 5
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
2 5 2 5 2 5 2 5 2 5 2 5
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
2 5 2 5 2 5 2 5 2 5 2 5
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
(a) block distribution over template
2 5 2 5 2 5 2 5 2 5 2 5
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
2 5 2 5 2 5 2 5 2 5 2 5
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
2 5 2 5 2 5 2 5 2 5 2 5
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
2 5 2 5 2 5 2 5 2 5 2 5
0 3 0 3 0 3 0 3 0 3 0 3
1 4 1 4 1 4 1 4 1 4 1 4
2 5 2 5 2 5 2 5 2 5 2 5
(b) LCM block distribution
Figure 3: A matrix with 12 12 blocks is distributed over a 2 3 processor template. (a) Each shaded and unshaded area represents dierent templates. The numbered squares represent blocks of elements, and the number indicates at which location in the processor template the block is stored { all blocks labeled with the same number are stored in the same processor. The slanted numbers, on the left and on the top of the matrix, represent indices of row block and column block, respectively. (b) The matrix has 2 2 LCM blocks. Blocks belong to the same processor if the relative locations of blocks are the same in each square LCM block. The de nition of the LCM block is de ned in the text. subblocks to transmit. The binary exchange scheme completes in dlog2 P e steps and the amount of data transmitted in each step is xed at 2dlog2 e,1 subblocks, where P is the number of processors. The rotating scheme can avoid network congestion, but requires P , 1 steps and the amount of data transmitted in the initial steps is large. In the direct point-to-point communication scheme, the number of steps is the same in the rotating scheme, but the amount of data transmitted in each step is only one subblock. The three schemes have been implemented on 16 nodes of the Delta and their performances are compared in Figure 2. The binary exchange and the rotating schemes are implemented with an odd-even communication scheme, which is preferable to a simultaneous communication scheme on the Delta [5, 11]. In this algorithm, odd-numbered processors send their own blocks and even-numbered processors receive them in the rst step, and even-numbered processors send and odd-numbered processors receive in the next step. On P = 2 processors, as shown in Figure 2, the binary exchange scheme is the fastest. However, if P is not a power of 2, then this scheme becomes very complicated and may be slower than the direct communication scheme. The direct communication scheme is about 20% slower than the binary exchange scheme for the worst case (P = 2 ), but it is simple to implement on an arbitrary number of processors. We adopted the simple direct communication scheme for the implementation of the matrix transpose algorithms, which are explained in detail in the next section. P
d
d
-50 0 1 2 3 4 5
0
1
2
3
4
5
0 3 0 3 0 3
1 4 1 4 1 4
2 5 2 5 2 5
0 3 0 3 0 3
1 4 1 4 1 4
2 5 2 5 2 5
transpose
A
1
2
3
4
5
0
0 1 2 0 1 2
1
3 4 5 3 4 5
2
0 1 2 0 1 2
3
3 4 5 3 4 5
4
0 1 2 0 1 2
5
3 4 5 3 4 5 AT
(a) matrix transpose from matrix point-of-view 0 0
3
1
4
2
5
P0
P1
2
P2
4
transpose
1
1
4
2
5
0
0 2
3
P0 P1 P2
4 1
P3
3
P4
P5
3
5
A
P3 P4 P5
5
AT (b) matrix transpose from processor point-of-view
Figure 4: An example of matrix transpose for a block scattered decomposition, when P = 2, Q = 3, and M = N = 6. b
b
3. Matrix Transpose Algorithms We assume that a matrix is distributed over a two-dimensional processor mesh, or template, so that in general each processor has several blocks of the matrix as shown in Figure 3 (a), where a matrix with 12 12 blocks is distributed over a 2 3 template. Denoting the least common multiple of P and Q by LCM, we refer to a square of LCM LCM blocks as an LCM block. Thus, the matrix may be viewed as a 2 2 array of LCM blocks, as shown in Figure 3 (b). The concept of the LCM block was introduced in [5], and is very useful for implementing algorithms that use a block scattered data distribution. Blocks belong to the same processor if their relative locations are the same in each square LCM block. An algorithm may be developed for the rst LCM block, and then it can be directly applied to the other LCM blocks, which all have the same structure and the same data distribution as the rst LCM block. That is, when an operation is executed on a block of the rst LCM block, the same operation can be done simultaneously on other blocks, which have the same relative location in each LCM block.
-60 0 1 2 3 4 5
0
1
2
3
4
5
0 3 6 0 3 6
1 4 7 1 4 7
2 5 8 2 5 8
0 3 6 0 3 6
1 4 7 1 4 7
2 5 8 2 5 8
transpose
A
1
2
3
4
5
0
0 1 2 0 1 2
1
3 4 5 3 4 5
2
6 7 8 6 7 8
3
0 1 2 0 1 2
4
3 4 5 3 4 5
5
6 7 8 6 7 8 AT
(a) matrix transpose from matrix point-of-view 0 0 0 2 4 1 3 5
3
1
4
2
5
P0
P1
P2
P3
P4
P5
P6
P7
P8
0 3 1
transpose
4 2 5
A
3
1
4
2
5
P0 P1 P2 P3 P4 P5 P6 P7 P8 AT
(b) matrix transpose from processor point-of-view
Figure 5: An example of matrix transpose for a block scattered decomposition, when P = 3, Q = 3, and M = N = 6. b
b
We now describe parallel matrix transpose algorithms. A matrix A, distributed over a P Q processor template, has M N blocks and each block consists of r s elements, where r and s are arbitrary. Figure 4 (a) shows an example of a matrix transpose on a 2 3 template. If A is transposed, the transposed matrix A is distributed over the same P Q template, and it has N M blocks and each block has s r elements. The elements of each block remain in the same block, but may be in a dierent processor, and each block is itself transposed. Figure 4 (b) shows the same example from the processor point-of-view. If P and Q are relatively prime, as shown in the gure, blocks in the rst processor P0 are scattered to all processors. As shown in Figure 5, which is the same example on a 3 3 square template, the blocks in each processor are not dispersed, but they are moved as one entity to a dierent processor. Parallel matrix transpose algorithms for the block scattered data distribution have several communication patterns determined by the greatest common divisor (GCD) of P and Q. b
b
T
b
b
-7DO J = 0; Q , 1 DO I = 0; P , 1 [ Copy all blocks of A required by P hp + I; q , J i to T1 (in condensed and transposed form) ] [ Send T1 to P hp + I; q , J i ] [ Receive T2 from P hp , I; q + J i ] [ Copy T2 to C ] END DO END DO Figure 6: A parallel matrix transpose algorithm from the processor point-of-view, when P and Q are relatively prime. P hp; qi represents PMOD( ) MOD( ) . Processor P (0 p < P and 0 q < Q) communicates with P hp + I; q , J i to send, and P hp , I; q + J i to receive based on direct point-to-point communication. `p + I' and `q , J' can be replaced with a dierent combination of signs. p;P ;
q;Q
p;q
3.1. P and Q : relatively prime We start with the simple case where P and Q are relatively prime, i. e. GCD = 1. In this case blocks in P0 are scattered to all processors after being locally transposed as shown in Figure 4 (b). This case involves the two-dimensional complete exchange communication. That is, every processor needs to communicates with every other processor. The complete exchange problem is implemented by direct communication between sender and receiver. Figure 6 shows the pseudocode from the processor point-of-view, where P hp; qi represents PMOD( ) MOD( ) in the processor template. Processor P hp; qi (0 p < P and 0 q < Q) starts to transpose blocks whose transposed blocks belong to itself. Then it deals with blocks whose transposition are in processors in the same column of the template (P hp,i; qi, 0 i < P ). The processor sends blocks to its top neighbor, P hp , 1; qi, and receives blocks from its bottom neighbor, P hp + 1; qi. Before sending the blocks, it is necessary to copy the blocks to be sent into a contiguous message buer. Next it sends blocks to the next top processor, P hp , 2; qi, and receives blocks from the next bottom processor, P hp + 2; qi. After it completes its operations with the processors in the same column, it sends blocks to the processors to the left in the template (P hp , i; q , 1i, 0 i < P), and receives blocks from the processors to the right (P hp + i; q + 1i). All operations are completed in P Q = LCM steps. We interpret the algorithm from the matrix point-of-view. In the rst LCM block, the above algorithm performs the operation by transposing one (wrapped) diagonal blocks at one step. (Actually the algorithm transposes every LCM diagonal blocks of the matrix at each step.) The rst step of the algorithm in Figure 6 requires no explicit communication. It corresponds p;P ;
q;Q
-80
1
2
0
1
2
0
1
2
A0,0 3
0
4
5
3
4
5
3
4
5
3
A1,1 0
1
2
4
5
A0,3 1
A1,4 2
0
1
2
0
1
2
0
1
2
3
4
5
3
4
5
3
4
5
1
2
0
1
2
5
3
4
5
A2,2 3
4
5
A2,5 A3,3
0
1
2
0
A3,0 1
2
0
A4,4 3
4
5
3
4
A4,1 5
3
4
A5,5 (a) zeroth diagonal (A(i,j), MOD(j-i,LCM)=0) 0
1
2
0
AT2,0 3
4
5
1
1
2
3 0
4 1
4
5
3
4
1
AT0,4 3
2
0
1
5
3
A1,5
AT3,1
2
0
4
5
1
5 2
3
0
4
5
3
1
2
0
AT0,2 3
1 4 1
4 1
5
1 4
0
1
5
3
3
4
4
A5,1
5
3
4
5
2
0
1
2
4
5
A5,3 5
3
2
A0,4 0
1
2
0
5
3
4
5
2
4 1
A1,5 AT1,5
(d) first diagonal (A(i,j), MOD(j-i,LCM)=1) 0
1
2
0
1
2
5
3
4
5
0
1
2
4
5
AT1,0 5
3
AT5,1
A1,0
2
0
4
A0,5 AT2,1
1
2
AT3,2
A2,1 5
3
4
A3,5 2
2
A4,2
2
5
3
AT4,3
A3,2 0
1
2
AT2,4
A4,0 3
5
AT5,3
A2,4 AT1,3
0
4
A2,0
A1,3 0
3
AT4,0
A0,2 4
1
AT0,4 A5,3
2
0
AT2,0
AT4,2
(c) fourth diagonal (A(i,j), MOD(j-i,LCM)=4)
3
2
A3,1
A4,2 AT1,5
0
1
AT5,3
A3,1 0
0
AT4,2
A2,0 3
(b) third diagonal (A(i,j), MOD(j-i,LCM)=3)
A0,4 AT3,1
0
2
A5,2
0
1
2
AT5,4
A4,3 3
4
5
AT3,5
(e) second diagonal (A(i,j), MOD(j-i,LCM)=2)
3
4
5
3
4
AT0,5
5
A5,4
(f) fifth diagoanl (A(i,j), MOD(j-i,LCM)=5)
Figure 7: Snapshots of matrix transposition when P = 2, Q = 3 and M = N = 6. The small slanted number in the upper left corner in each block represents processor identi cation number. One wrapped block diagonal is transposed in each step. The darkly shaded area represents blocks to be shifted, and lightly shaded area stands for their transpositions. b
b
-9DO J = 0; Q , 1 K =J n* Deterimine K-th diagonal block to transpose *n WHILE (MOD(K; P) 6= 0) DO K = MOD(K + Q; LCM) END DO DO I = 0; P , 1 [ Copy every (K : N : LCM)-th wrapped diagonal blocks in P to T1 ] [ Move T1 from P to P hp + I; q , J i ] [ Copy the received T1 to C ] K = MOD(K + Q; LCM) END DO END DO b
p;q
p;q
Figure 8: A parallel matrix transpose algorithm from the matrix point-of-view, when P and Q are relatively prime. One diagonal block is transposed at one step. The `While' statement should be executed until MOD(K; P) becomes 0. (start : limit : intv) represents values of x, where x = start + intv y, y = 0; 1; , and x can't exceed limit. to an in-place transpose of the diagonal blocks of A (A(i; i)) (See Fig. 7(a)). Then every P diagonal blocks of A (A(i; j); MOD(j , i; P) = 0) (See Fig. 7(b)) are transposed in the rst outer loop (J = 0) of Figure 6. In the next outer loop (J = 1), the next P diagonal blocks (A(i; j); MOD(j , i; P) = 1) are transposed. In Figures 7 (c) and (d), P0 (P h0; 0i) sends blocks to P2 (P h0; 2i), and receives from P1 (P h0; 1i), where P0, P1 and P2 are in the same row. Then P0 sends blocks to P5 (P h1; 2i), and receives from P4 (P h1; 1i), and so on. The pseudocode for the algorithm from the matrix point-of-view is shown in Figure 8. Processors need to determine a diagonal block of A (A(i; j); MOD(j , i; LCM) = K) which they start to transpose in the outer J loop in order to communicate with other processors in the same row of the template. The two lines before the inner DO-loop compute the value of K.
3.2. P and Q : not relatively prime In the previous section, we have investigated the case when P and Q are relatively prime, which involves complete exchange communication. In this section we consider the case of GCD > 1. The former algorithm is a special case (GCD = 1) of this algorithm. Figure 9 shows an example of transposing a 12 12 matrix on a 4 6 template from the processor point-of-view. Each processor has its own 3 2 (= LCM=P LCM=Q) array of blocks. The processors can transpose the matrix with 6 (= LCM=P LCM=Q = LCM=GCD) communications steps. As shown in Figure 10, a processor P hp; qi starts to communicate with P hp~; q~i, where p~ and q~ are computed from p and q (details are explained later of this section). Once P hp~; q~i is determined, it communicates with other processors, whose vertical and horizontal intervals are GCD from P hp~; q~i. The two loops of the algorithm in Figure 6 are
- 10 -
0
6
1
7
2
8
3
9
4 10 5 11
0
0 4 8
9
P0 P1 P2 P3 P4 P5
4 8
10
P 6 P 7 P 8 P 9 P10 P11
5 9
11
2
8
3
9
4 10 5 11
P0 P1 P2 P3 P4 P5 P 6 P 7 P 8 P 9 P10 P11
2
P12 P13 P14 P15 P16 P17
6 10
3 7
7
1
2 6
1
0
1 5
6
P12 P13 P14 P15 P16 P17
3
P18 P19 P20 P21 P22 P21
7 11
P18 P19 P20 P21 P22 P21
Figure 9: A matrix transpose example on a 4 6 template.
p, q
~ p, ~q
~ p, ~q+GCD
......
~ p, ~ q+(LCM/P) GCD
~ p+GCD, ~ q
~ p+GCD, ~ q+GCD
......
~ p+GCD, ~ q+(LCM/P) GCD
~ p+2 GCD, ~ q
~ p+2 GCD, ~ q+GCD
......
~ p+2 GCD, ~ q+(LCM/P) GCD
: : ~ p+(LCM/Q)GCD, ~q
: : ~ p + (LCM/Q)GCD, ~ q+GCD
: : ......
~ p+(LCM/Q)GCD, ~ q+(LCM/P) GCD
Figure 10: Processor map for communication. A processor P hp; qi starts to communicate with P hp~; q~i, then it communicates with other processors, whose vertical and horizontal intervals are GCD from P hp~; q~i.
- 11 -
PARDO K = 1; GCD g = MOD(q , p; GCD) p~ = MOD(p + g; P ); q~ = MOD(q , g; Q) DO J = 0; LCM=P , 1 DO I = 0; LCM=Q , 1 [ Copy to T1 (in condensed and transposed form) all blocks of A required by P hp~ + I GCD; q~ , J GCDi ] [ Send T1 to P hp~ + I GCD; q~ , J GCDi ] [ Receive T2 from P hp~ , I GCD; q~ + J GCDi] [ Copy T2 to C ] END DO END DO END PARDO Figure 11: A modi ed matrix transpose algorithm from the processor point-of-view. Operations of GCD groups of processors are overlapped. 0
1
2
3
4
5
0
1
2
3
4
5
0
1
A0,0 6
2
3
4
5
0
1
2
3
4
8
9
10
11
6
7
8
9
10
A0,1 7
8
9
10
11
6
7
8
9
10
11
6
AT0,1
A1,1 12
13
18
19
0
1
6
7
12
13
18
19
0
1
6
7
12
13
18
19
14 15 16 17 12 13 14 15 16 17 A2,2 20 21 22 23 18 19 20 21 22 23 A3,3 2 3 4 5 0 1 2 3 4 5 A4,4 8 9 10 11 6 7 8 9 10 11 A5,5 14 15 16 17 12 13 14 15 16 17 A6,6 20 21 22 23 18 19 20 21 22 23 A7,7 2 3 4 5 0 1 2 3 4 5 A8,8 8 9 10 11 6 7 8 9 10 11 A9,9 14 15 16 17 12 13 14 15 16 17 A10,10 20 21 22 23 18 19 20 21 22 23 A11,11
0
1
2
3
4
6
7
8
9 10 11
11
A1,2
12
13 14 15 16 17 12 13 14 15 16 17 AT1,2 A2,3 18 19 20 21 22 23 18 19 20 21 22 23 T A2,3 A3,4 0 1 2 3 4 5 0 1 2 3 4 5 AT3,4 A4,5 6 7 8 9 10 11 6 7 8 9 10 11 AT4,5 A5,6 12 13 14 15 16 17 12 13 14 15 16 17 AT5,6 A6,7 18 19 20 21 22 23 18 19 20 21 22 23 T A6,7 A7,8 0 1 2 3 4 5 0 1 2 3 4 5 AT7,8 A8,9 6 7 8 9 10 11 6 7 8 9 10 11 AT8,9 A9,10 12 13 14 15 16 17 12 13 14 15 16 17 T A9,10 A10,11 18 19 20 21 22 23 18 19 20 21 22 23 T A11,0 A10,11
5
12 13 14 15 16 17
7
5
AT11,0
0
1
2
3
4
5
6
7
8
9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
18 19 20 21 22 23
(a) transposing the zeroth wrapped block
(b) transposing the first wrapped block
Figure 12: Two snapshots of matrix transposition for transposing the zeroth and rst wrapped block diagonals, when P = 4, Q = 6 and M = N = 12. In this example, transposing of even numbered wrapped block diagonals can be overlapped with that of odd numbered. b
b
- 12 -
0
1
A 0,0 3
2
A 0,1 4
A 1,0 6
A 1,1
3
transpose
A 1,2 8
A 2,1
1
AT0,0
A 0,2 5
7
A 2,0
0
AT1,0 4
AT0,1 6
AT2,0 5
AT1,1 7
AT0,2
A 2,2
2
AT2,1 8
AT1,2
AT2,2
AT
A
Figure 13: Matrix transposition when P = Q = GCD = 3. Processors transpose 3 (= GCD) diagonal blocks at one step, so that the transposition is done in one step.
PARDO K = 1; GCD g = MOD(q , p; GCD) p~ = MOD(p + g; P ); q~ = MOD(q , g; Q) DO J = 0; LCM=P , 1 K=J n* Deterimine K-th diagonal block to transpose *n WHILE (MOD(K , g; P) 6= 0) DO K = MOD(K + Q; LCM) END DO DO I = 0; LCM=Q , 1 [ Copy every (K : N : LCM)-th diagonal blocks in P hp; qi to T1 ] [ Move T1 from P hp; qi to P hp~ + I GCD; q~ , J GCDi ] [ Copy the received T1 to C ] K = MOD(K + Q; LCM) END DO END DO END PARDO b
Figure 14: A modi ed matrix transpose algorithm from matrix point-of-view. GCD diagonal blocks are transposed simultaneously.
- 13 changed from Q and P to LCM=P and LCM=Q. The pseudocode of the algorithm is shown in Figure 11. Figure 12 shows two snapshots of the same example, from the matrix point-of-view, to transpose the zeroth and the rst diagonal blocks of A (A(i; j), MOD(j , i; LCM) = 0 and 1, respectively.) The processors which have the blocks to send out are shaded at the bottom. In the example, only P Q = GCD processors are involved in block communication in each step. Processors are split into GCD groups of processors, and a processor P hp; qi belongs to a group g if it has the same value of g = MOD(q , p; GCD). Processors in a group g send and receive their blocks to other processors in another group g0 = MOD(GCD , g; GCD). The operations of each group can be overlapped. The problem is interpreted from the matrix point-of-view. In general, for transposing the k-th diagonal block of A (A(i; j), MOD(j , i; LCM) = k), a group of processors g = MOD(k; GCD) send the blocks to another group g0 = MOD(GCD , g ; GCD). Since the operations are overlapped over dierent groups of processors, processors transpose GCD diagonal blocks simultaneously. So, the matrix can be transposed with LCM=GCD steps. For the extreme case of P = Q = GCD = 3 as shown in Figure 13, processors transpose 3 (= GCD) diagonal blocks at one step. That is, the transposition is done in one step. A processor P hp; qi exchanges data with processor P hq; pi. The pseudocode of the algorithm from the matrix point-of-view is shown in Figure 14. The code includes the case of GCD = 1. k
k
k
96 processors P Q Time (second) 6 16 0:404 8 12 0:330 12 8 0:307 16 6 0:381
64 processors P Q Time (second) 4 16 0:596 88 0:572 16 4 0:475
48 processors P Q Time (second) 4 12 0:652 68 0:546 86 0:527 12 4 0:547
Table 1: Dependence of performance on template con guration for xed number of processors (M = N = 2400).
4. Results In this section we present performance results of the parallel matrix transpose algorithms on the Intel Touchstone Delta computer. The performance of the transpose algorithms cannot be represented in oating point operations per second ( ops), since there is no multiplications or additions in the transpose algorithms. The algorithms are combined with a matrix multiplication routine in the PUMMA to compute C = A B + C in two steps (T ( B A; C ( T + C). We assume that = 1 and = 0 in our test. The performance of A B is compared with that of A B. T
T
T
T
T
Gflops
- 14 8
Α ×Β ΑΤ × ΒΤ
7 6 5 4 3 2 1 0 0
1200
2400
3600
4800
6000
7200
Matrix Size, M
Figure 15: Performance comparison of A B and A B on 15 16 template. (P = 15; Q = 16, and GCD = 1). C ( A B is implemented in two steps, T ( B A, and then C ( T . T
T
T
T
T
Matrix elements are generated uniformly on the interval [,1; 1] in double precision. Conversions between measured runtimes and performance in giga ops (G ops) are made assuming an operation count of 2MNL for the multiplication of a M L by a L N matrix. In our test examples, all processors have the same number of blocks so there is no load imbalance. The algorithms were implemented with force type communication [9]. First, we considered how, for a xed number of processors N = P Q, performance depends on the con guration of the processor template. Some typical results are presented in Table 1 for a xed number of processors. In the test, the block size is xed at 5 5 elements. It may be seen that the template con guration does have some eect on performance. The performance dierence is between 19 and 24 %. For rectangular templates with dierent aspect ratios, the algorithm prefers those with small Q to those with small P. On the Delta, communication speed along vertical links seems faster than along horizontal links. Figures 15 19 show the performance of the routines on 15 16 (GCD = 1, i.e., P and Q are relatively prime), 14 16 (GCD = 2), 12 16 (GCD = 4), 8 16 (GCD = 8), and 16 16 (P = Q = GCD = 16) templates, respectively. In all cases the block size is xed at 5 5 elements. The solid and the dashed lines show the performance of A B and A B, respectively. The dierence of the two lines shows the loss of performance due to matrix transposition. The transposed multiplication routine shows good performance relative to matrix multiplication. The loss of performance due to the matrix transpose routine is about 2 or 3 %. The p
T
T
Gflops
- 15 -
Α ×Β ΑΤ × ΒΤ
7 6 5 4 3 2 1 0 0
1120
2240
3360
4480
5600
6720
Matrix Size, M
Figure 16: Performance comparison of A B and A B on 14 16 template. (P = 14; Q = 16, and GCD = 2)
Gflops
T
T
Α ×Β ΑΤ × ΒΤ
6 5 4 3 2 1 0 0
960
1920
2880
3840
4800
5760
6720
Matrix Size, M
Figure 17: Performance comparison of A B and A B on 12 16 template. (P = 12; Q = 16, and GCD = 4) T
T
Gflops
- 16 -
4
Α ×Β ΑΤ × ΒΤ
3
2
1
0 0
800
1600
2400
3200
4000
4800
5600
Matrix Size, M
Figure 18: Performance comparison of A B and A B on 8 16 template. (P = 8; Q = 16, and GCD = 8)
Gflops
T
8
T
Α ×Β ΑΤ × ΒΤ
7 6 5 4 3 2 1 0 0
1600
3200
4800
6400
8000
Matrix Size, M
Figure 19: Performance comparison of A B and A B on 16 16 template. (P = Q = GCD = 16). T
T
- 17 P Q
Matrix Size
8 16
4800 4800
12 16
4800 4800
14 16
5600 5600
15 16
6000 6000
16 16
6400 6400
Block Size 1 1 5 5 300 300 1 1 5 5 100 100 1 1 5 5 50 50 1 1 5 5 25 25 1 1 5 5 400 400
Time (second) 1:857 1:612 1:564 1:280 0:893 0:882 1:484 1:193 1:161 1:740 1:437 1:426 1:967 1:967 1:967
Table 2: Dependence of performance on block size. transpose routine has excellent performance if P and Q are relatively prime. In other cases (GCD 2), network congestion may degrade the performance of the routine. P Q 11 8 16 12 16 14 16 15 16 16 16
Matrix Size (A; B) 500 500 5600 5600 6720 6720 6720 6720 7200 7200 8000 8000
A B (%)
36:70 (100:0) 32:05 (87:3) 32:09 (87:4) 32:52 (88:6) 32:78 (89:3) 31:22 (85:1)
A B (%) T
T
35:04 (100:0) 30:57 (87:3) 31:64 (90:3) 32:11 (91:6) 32:43 (92:6) 30:38 (86:7)
Table 3: Performance per node in M ops. Block size is xed to 5 5 elements. 1 1 template gives performance of assembly-coded matrix multiplication. Numbers in parentheses represent eciency compared with the performance on 1 processor. Table 2 shows how the block size aects the performance of the algorithms. It includes three cases of the block size, two extreme cases { the smallest and largest possible block sizes { and 5 5 block of elements. If P = Q, processors directly copy all blocks at once, so block size does not aect the performance. For the case of the smallest block size (1 1 element) when P 6= Q, processors make a copy element by element, so it takes a little more time to make a copy. The routines with the smallest block sizes are slower than those with the largest possible block sizes by between 15% and 31%. This dierence is negligible, compared with the total elapsed time of the matrix multiplication. Performance per node is shown in Table 3. The 1 1 template gives the performance of the assembly-coded level 3 BLAS matrix multiplication routine for the two cases A B and
- 18 -
A B . Processors have about 85% eciency for A B, and 87% for A B if P = Q = 16. The routines perform better on templates for which P = 6 Q. Processors achieve about 89%, T
T
T
T
and 93% of eciency for each case if P and Q are relatively prime.
5. Conclusions and Remarks We have presented parallel matrix transpose algorithms based on the block scattered decomposition. The algorithms have good performance for arbitrary processor con gurations on the Intel Delta computer. If P and Q are relatively prime, the transpose routine involves complete exchange communication on a two-dimensional template. We have approached this complicated problem with a direct point-to-point communication scheme (see Section 2). When P and Q are not relatively prime (GCD > 1), the processors' operations are overlapped over dierent groups, so that only LCM=GCD communications are required. In our Fortran implementation, we assume that the rst dimension of the matrix may be dierent from the number of rows of the matrix in a processor. Even when P = Q, the processor needs to copy blocks of A to a communication buer before sending, and copy the received buer to blocks of C after receiving. The parallel matrix transpose algorithms have been combined with matrix multiplication routines. The integrated routines comprise a general-purpose matrix multiplication package, called PUMMA [5], for MIMD message-passing computers. The package has good performance for a wide range of decomposition parameters, that is, its performance depends weakly on processor con guration and block size. The PUMMA package is currently available only for double precision real data, but will be implemented in the near future for other data types, i.e., single precision real and complex, and double precision complex. To obtain a copy of the software and a description of how to use it, send the message \send pumma from misc" to
[email protected].
Acknowledgments The authors would like to thank Eduardo D'Azevedo at ORNL for his helpful suggestions to improve the quality of the paper. This research was performed in part using the Intel Touchstone Delta System operated by the California Institute of Technology on behalf of the Concurrent Supercomputing Consortium. Access to this facility was provided through the Center for Research on Parallel Computing.
6. References [1] N. G. Azari, A. W. Bojanczyk, and S.-Y. Lee. Synchronous and asynchronous algorithms
- 19 for matrix transposition on MCAP. In SPIE Vol. 975, Advanced Algorithms and Architecture for Signal Processing III, pages 277{288, 1988. [2] S. H. Bokhari and H. Berryman. Complete exchange on a circuit switched mesh. In Proceedings of the 1992 Scalable High Performance Computing Conference, pages 300{ 306. IEEE Press, 1992. [3] J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of Fourth Symposium on the Frontiers of Massively Parallel Computation (McLean, Virginia). IEEE Computer Society Press, Los Alamitos, California, October 19-21 1992. [4] J. Choi, J. J. Dongarra, and D. W. Walker. The design of scalable software libraries for distributed memory concurrent computers. In Proceedings of Environment and Tools for Parallel Scienti c Computing Workshop, (Saint Hilaire du Touvet, France). Elsevier Science Publishers, September 7-8 1992. [5] J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA : Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Technical Report TM-12252, Oak Ridge National Laboratory, Mathematical Sciences Section, August 1993. [6] J. J. Dongarra, R. van de Geijn, and D. Walker. A look at scalable linear algebra libraries. In Proceedings of the 1992 Scalable High Performance Computing Conference, pages 372{ 379. IEEE Press, 1992. [7] J. O. Eklundh. A fast computer method for matrix transposing. IEEE Transactions on Computers, 21:801{803, 1972. [8] G. H. Golub and C. V. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, 1989. Second Edition. [9] Intel Corporation. Touchstone Delta Fortran Calls Reference Manual, April 1991. [10] S. L. Johnsson and C.-T. Ho. Algorithms for matrix transposition on boolean n-cube con gured ensemble architecture. SIAM J. Matrix Anal. Appl, 9:419{454, July 1988. [11] R. Little eld. Characterizing and tuning communications performance for real applications. In Proceedings, First Intel Delta Application Workshop, CCSF-14-92, Pasadena, California, pages 179{190, February 1992. presentation overheads. [12] D. P. O'Leary. Systolic arrays for matrix transpose and other reorderings. IEEE Transactions on Computers, 36:117{122, January 1987.
- 20 [13] G. Strang. Linear Algebra and Its Applications. Harcourt Brace Jovanovich, Inc., San Diego, CA, 1988. Third Edition.