Purdue University
Purdue e-Pubs Computer Science Technical Reports
Department of Computer Science
1993
Some Experiments with a Basic Linear Algebra Routine on Distributed Memory Parallel Systems H. Byun Elias N. Houstis Purdue University,
[email protected]
E. A. Vavalis Report Number: 93-031
Byun, H.; Houstis, Elias N.; and Vavalis, E. A., "Some Experiments with a Basic Linear Algebra Routine on Distributed Memory Parallel Systems" (1993). Computer Science Technical Reports. Paper 1049. http://docs.lib.purdue.edu/cstech/1049
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact
[email protected] for additional information.
SOME EXPERIMENTS EXPERIMENTS WITH A BASIC LINEAR S01\1E R O U T N ON DISTRffiUfED DISTRIBUTED ALGEBRA ROUTINE MEMORY PARALLEL SYSTEMS
Byun H. nyun E. E.N. Houstis Howtb E. A. Vavlllis Vavalis
CSD-TR-93-031 CSD-TR·93·031 May]993 May I993
SOME EXPERIMENTS WITH A BASIC LINEAR ALGEBRA ROUTINE A ON DISTRIDUTED DISTRIEIUTED MEMORY PARALLEL SYSTEMS· SYSTEMS ' H. EI. BYUN, BYUN, E.N. E.K. HOUSTIS AND E.A. VAVALIS VAVALlS t
Abstract In this paper we describe describe the algorithm used, used, discuss the implementation and present the performance of the Basic Linear Algebra Subroutine Subroutine (BLAS) sgemv sgemv for distributed-memory distributed-memory performance multiprocessors. multiprocessors. The basic assumption is that the matrix and the vectors vectors are row distributed distributed among processors. Performance data from nCUBE II, 11, iPSC/860 and iPSe iPSC DELTA machines among are presented.
1. Introduction. In this study we present data that describe various aspects of 1. the performance of the Basic Linear Algebra Subroutine (BLAS) (BLAS) sdgemvon sdgemv on three dis11, the iPSCj860 jPSC/860 and the tributed memory multiprocessor systems namely the nCUBE II, iPSC DELTA. sdgemv is a member of a set of parallel BLAS routines we have implemented on such machines [1]. [I]. The software methodology utilized to parallelize the BLAS routines assumes that each processor performs the appropriate local operations by calling the corresponding uniprocessor BLAS routines ([9] (191,I [5], 151, [4]). The local results [6]routines to generate the final final answer. It is worth noticing are "combined" by PICL [6] that the global combine operations can use any multiprocessor mu1tiprocessor connection topology that PICL supports. follo~vs.Section 3 consists of a list of tables that The rest of the paper is organized as follows. present raw timing data measuring the total elapse and total (communication and idle) overhead time required by the sdgemv routine to perform the matrix-vector operations on the three machines considered. Using the data given in Section 3 we present, in Gflops achieved on the three machines for different matrix sizes sizes and Section 4 the Gflops connection topologies, and in Section 5 the data that show the differences differences observed when we used the optimized uniprocessor BLAS routines instead of FORTRAN BLAS on the iPSCj860. iPSC/860. Finally in Section 6 we give the utilization and concurrency profiles and in Section 7 spacetime execution diagrams. The data in the latter two sections were obtained using PARAGRAPH PARAGRAPH [7]. 2. The algorithm and its implementation. In this section we discuss the parx or A Throughout, we allel implementation of the matrix x vector operations A Ax AT~x.X .Throughout, row-wise. assume that the matrix A and the vector xx is distributed among processors row-wise. defined by a vector idist(i), idist(i),ii == 1, 1 , nprocs npocs + 4 11 where nprocs is This distribution is defined idist(i) the number of processors and idist( i) denotes the global index of the first row of A 9123502-CDQand 9202536·CCR, 9202536-CCR, AFOSR • This work was partially supported by NSF grants 9123502-CDQ F49620-92-J-0069and PRF 6902003. 6902003. This research was performed in part F49620-92-J-0069 part. using the Intel Touchstone Della Syst.em System operated by Cal Caltech Concurrent Supercomputer Consortium. Access Access to Delt.a tech on behalf of the Concurrent University. this facility was provided by Purdue Universit.y. University, Computer Computer Science Department. Department, West Lafayette, IN 47907. 47907. t Purdue University, 1
i. Thus, we store rOWS rows from idist(i) idist (i) to (first element of x) that belongs to processor i. idist(ii + 1) 1) -- 11 of the matrix A on the local memory memorjr of oi processor ii together with the idist( associated elements of x. idist is the only global information needed and all other variables are local ,to to each processor. pTOcessor. thc implementation of the Ax operation, we follow follow the methodology described For the in the previous section. As an example we give in Figure 1I the actual code for the sdgemv routine for full matrices. We assume that the data are distributed on a wrap around linear array of processors. The level two BLAS routine sgemv gets as input the matrix A E R Rmxn, Rm and the scalars aa and {3. m X nthe , vectors x, y E Rm ,8. It computes the vector y= = aAx+J3y CYAX+PYor nAT aATx +py. In the case of matrix A, each node calculates the part of the x+f3y. product that involves local data by utilizing the uniprocessor level two BLAS routine sgemv. Then it broadcasts its local vector xx to all other processors by calling the PICL sgemv. routine bcastl, beast1, in which the reception and forwarding of the message are decoupled. Thus, processors that participate in this broadcast call sgemv which computes the part of the product associated with the incoming vector xx before forwarding it. Finally, from a nearest neighbor. It we restore the original value of the vector xx by reading it from is important to notice that the only memory overhead for the routine sdgernv sdgemv is the 1. integer array idist of length nprocs + 1. In the case of the ATx ATx operation, each processor i calls sgernv sgemv to compute nATx aATx and stores this result to t o aa buffer. The entries of all these buffers are added componentwise using the PICL routine gsumO and the result is stored in the buffer at a t processor 1. 1. Finally aa call to the level one BLAS routine saxpy is used to accumulate the term p y to the buffer. The above described procedure is repeated nprocs -- 11 more times as (3y shown in Figure 1. 1. different interconnection topologies that PIeL PICL supWe have experiment with two different and the full full connectivity. It is worth noticing that ports, namely the ring connectivity and successfully follow follow the above in the case of banded or sparse matrices we were able to successfully approach. approach, coupJed coupled with appropriate data structures, with only a few basic differences
+
+
[2). 121data. The uniprocessor BLAS routines available to us on the iPSCs 3. Raw time data. [8] [8] were used unless it is stated otherwise in the caption of the tables. All times are in milli-seconds. 4. Gflops Gfiops achieved. In Figures 2 and and 3 we present the Gflops achieved on the three machines using ring topology for both the non-transposed and transposed cases respectively. It is worth noticing that the "theoretical peak" (determined by counting full precision that can be completed during the number of floating point operations in fuB 11, the iPSe iPSC 860 and for the iPSe iPSC DELTA is .15, 2.6 and 20 aa cycle) for the nCUBE II, G0ops respectively [3]. Gflops 5 . The affect of using optimized BLAS. In Figure 4 we present the speedup 5. achieved on the iPSCj860 iPSC/86O by using optimized uniprocessor BLAS routines [8] [8] for the achieved four to six times. It I t is also non-transposed case. As we see the performance increases fOUT further apparent that the performance drops as the number of processors increases. To further 2
FIG. FIG.1. 1. The The fortran forfrun code of oJ subroutine sdgemv. sdgemv.
subroutine subroutine sdgemv(trans,m,n,alpha,a,lda,x,beta,y,tmp,idst) sdgemv(trans,m,n,alpha,a,lda,x,beta,y,tmp,idst) integer integer idst idst(1) ( 1) real a(lda,l),x(l),y(l),tmp(l) real a(lda,l> ,x(l> ,y(I> ,tmp(l) character character trans trans integer dir integer nprocs, nprocs, me, me, host, host, top, top, ord, ord, dir common lopenl nprocs,me.host commoh /open/ nprocs,me,host common common Isetarcl /setarc/ top, top, ord, ord, dir dir integer mtype, root, root, lnx, lnx, jidx, jidx, node_Do node-no integer bytes: bytes ,' mtype. lnx == idst(me+2) idst(me+2) -- idst(me+l) lnx bytes = 44 * lnx bytes if( if( (( Isame(trans,'N') lsame(translJN1))) .or. .or. (( lsame(trans.'n') Isame(trans,'nl) )) )) then then
jidx jidx = = idst(me+l) idst (me+l) call sgemv(trans,m,n,alpha,a(l,jidx),lda,x,l,beta,y,l) call sgemv(trans,m,n,alpha,a(l,jidx) ,lda,x,l,b~ta,y.l) mtype mtype = = 1000 1000 ++ me me call call bcastO(x bcastO(x ,bytes ,bytes,mtype,me) ,mtype,me) do do 100 100 i=l.nprocs-l i=l,nprocs-1 root root == mod(ma-i+nprocs, mod(me-i+nprocs , nprocs) nprocs) mtype mtype == 1000 1000 ++ root root jidx jidx == idst(root+l) idst (root+l) call bcastO(x,bytes,mtype.root) bcastO(x,bytes ,mtype,root) call call sgemv(trans,m,n,alpha,a(l, jidx),lda,x,l,l. ,y,l) call sgemv(trans,m,n,alpha,a(l,jidx),lda,x,l,l.,y.l) 100 continue 100 continua root mod(me+l+nprocs, nprocs) nprocs) root == mod(me+l+nprocs, call call sendO(x,bytes,root+l000,root) send0(x,bytes,root+l000,root) call call recvO(x,bytes,me+l000) recvO~x,bytes,me+l000) endiitf and if( if( (( lsame(trans, lsame(trans, 'T') 'T') )) .or. .or. (( laame(tranB,'t') leame(trane,'tl) ) ) then then mtype = 4000 mtype = 4000 do 200 200 node_DO node-no == 0, 0,nprocs-l nprocs-1 do istart idst(node,no+l) istart == idst(node_Do+l) call sgemv(trans,m.n,alpha,a(1.istart),lda.x.l.0.,tmp.l) sgemv(trans .m,n,alpha,a(l,istaxt) ,lda,x,1,O. .tmp,l) call call gsumO(tmp,lnx,4,mtype,nods_no) gsumO(tmp, lnx,4,mtype,node-no) call if (node_Do (node-no .eq. ,eq. rna) me) then then if if (beta (beta .ne. .ne. 0.) 0.) callsaxpy(lnx,beta,y,l,tmp,l) call saxpy(lnx,beta,y, 1,tmp,1) if call 6copy(lnx,tmp,1,y,1) scopy(lnx,tmp,l,y,l) call endif andif 200 continue 200 continua endif endif
return return end end 3
TABLE TABLE 1 The total elapse and overhead timings on the nCUBE II fJor o r matrices A of of size n x n. n. The
n
800 1200 1600 2400 3200 4800 6400
4
1
2
Te 2834
Te To 1424 3 3194 3
8
Te 717
To 2
1605 2846
3 3
Te 366 812 1435 3210
32
16
To 3 3
4 5
Te
To
418 732 1624 2869
6 6 7 8
Te
64 Te
To
386 836 1463 3248
To
12 12 454 12 771 14 1672 2926
22 24 23 25
TABLE TABLE 2 The total i o i a l elapse and overhead ouerhead time OJI on the nGUBE nCUBE 1I I I jfor o r matrices AT of o j size n x n. n.
2
1 n
Te
800 2615 1200 1600 2400 3200 4800 6400
Te
1316 2950
4 Te
To 3
666 1487 2632
4
16
8
To 3 4 6
Te
To
343 756 1333 2974
5 6
7 9
32
Te
To
395 686 1514 2666
6
Te
64
To
370 12 790 14 1373 3028
Te
To
15
10
17 442 19 749 24 1581 2747
27 30 36 40
TABLE TABLE 3 The The total elapse and ot.erhead otterhead timings on the iPSC/860 for matrices matfices A of 01size n x n. n. This This implementation full connectivity. implementaiion is based on jull conncctiuity-
1 Tc 1600 201 2400 3200 4800 6400 9600 12800 n
2 Tc 106 234
4
To 4 6
Te
125
74 122 254
214
16
8
Te To 60 4 5 8
38
To 6 11 12 13
32 Te
64
Te
To
53
11
80 151 250
16 63 24 22 UO 27 92 43 27 162 36 129 52 302 48 220 63
To
Te
310 4
To
85
TABLE 4 TABLE The total elapse and Qverhead timings on the iPSC/860 matrices A of of size nn xx n. n. This The Loin1 overhead ihc iPSC/86O lfor o r mairices implementation is based on ring topology. implemeniation
1
2
8
4
T~
Te
1600 201 2400 3200 4800 6400 9600 12800
106
To 3
233
6
n
Tc 59 124 213
16
To
Te
To
6 5 8
38
6
72 6 118 8 249 10
Te
To
52 11 76 9 145 13 238 16
32 Te To
22 105 16 152 24 281 19
64 Te To
62
92 125 207 286
34 42 41 43
TABLE TABLE 5 The of size n x n. This The total ioial elapse and overhead ouerhead timings on the iPSC/860 for matrices A of implementation two BLAS BLAS used were written implemeniotion is based bascd on full connectivity conneciiviiy and the uni-processors level iwo in FORTRAN. in FORTRAN.
n
1
2
4
Te
Te To
Te To 1873 6
3200 4800 6400 9600 12800
16 Te To 10 468 12 - 1042 1849
8
Te 924
2087
To 10 16 10
32 Te 251 537 937 2072
64 To 22 17 17 21
Te
To
303 502 1069
34
1852
33
34 42
TABLE 6G
TABLE
The total iota1 elapse and overhead timings on the iPSCj860 iPSC/860 jor The for matrices A oofj size n x n. n. This implemeniaiion is based on ring ring connectivity conneclivity and the ihe uni-processors implementation un i-processors level two BLAS used were written in FORTRAN. FORTRAN. in
1 2 n T e Te To 3200 4800 6400 9600 12800
16 Tc To Tc 870 16 455 1957 34 994 1744 8
4 1~
To
1741
8
5
32 To 23 34
Te
To 35
260 535 41 35 912 51 1987 60
64 Te To 330 523 1071 1825
74 81 85 90
TABLE TABLE 7 The tolal fofal elapse c l a p s e and overhead timings on o n the iPSC/860 iPSC/86D for matrices mairtccs A= oJ size nn >