Some Experiments with a Basic Linear Algebra ... - Purdue e-Pubs

0 downloads 0 Views 576KB Size Report
performance of the Basic Linear Algebra Subroutine (BLAS) sgemv for ... iPSC DELTA. sdgemv is a member of a set of parallel BLAS routines we have im-.
Purdue University

Purdue e-Pubs Computer Science Technical Reports

Department of Computer Science

1993

Some Experiments with a Basic Linear Algebra Routine on Distributed Memory Parallel Systems H. Byun Elias N. Houstis Purdue University, [email protected]

E. A. Vavalis Report Number: 93-031

Byun, H.; Houstis, Elias N.; and Vavalis, E. A., "Some Experiments with a Basic Linear Algebra Routine on Distributed Memory Parallel Systems" (1993). Computer Science Technical Reports. Paper 1049. http://docs.lib.purdue.edu/cstech/1049

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information.

SOME EXPERIMENTS EXPERIMENTS WITH A BASIC LINEAR S01\1E R O U T N ON DISTRffiUfED DISTRIBUTED ALGEBRA ROUTINE MEMORY PARALLEL SYSTEMS

Byun H. nyun E. E.N. Houstis Howtb E. A. Vavlllis Vavalis

CSD-TR-93-031 CSD-TR·93·031 May]993 May I993

SOME EXPERIMENTS WITH A BASIC LINEAR ALGEBRA ROUTINE A ON DISTRIDUTED DISTRIEIUTED MEMORY PARALLEL SYSTEMS· SYSTEMS ' H. EI. BYUN, BYUN, E.N. E.K. HOUSTIS AND E.A. VAVALIS VAVALlS t

Abstract In this paper we describe describe the algorithm used, used, discuss the implementation and present the performance of the Basic Linear Algebra Subroutine Subroutine (BLAS) sgemv sgemv for distributed-memory distributed-memory performance multiprocessors. multiprocessors. The basic assumption is that the matrix and the vectors vectors are row distributed distributed among processors. Performance data from nCUBE II, 11, iPSC/860 and iPSe iPSC DELTA machines among are presented.

1. Introduction. In this study we present data that describe various aspects of 1. the performance of the Basic Linear Algebra Subroutine (BLAS) (BLAS) sdgemvon sdgemv on three dis11, the iPSCj860 jPSC/860 and the tributed memory multiprocessor systems namely the nCUBE II, iPSC DELTA. sdgemv is a member of a set of parallel BLAS routines we have implemented on such machines [1]. [I]. The software methodology utilized to parallelize the BLAS routines assumes that each processor performs the appropriate local operations by calling the corresponding uniprocessor BLAS routines ([9] (191,I [5], 151, [4]). The local results [6]routines to generate the final final answer. It is worth noticing are "combined" by PICL [6] that the global combine operations can use any multiprocessor mu1tiprocessor connection topology that PICL supports. follo~vs.Section 3 consists of a list of tables that The rest of the paper is organized as follows. present raw timing data measuring the total elapse and total (communication and idle) overhead time required by the sdgemv routine to perform the matrix-vector operations on the three machines considered. Using the data given in Section 3 we present, in Gflops achieved on the three machines for different matrix sizes sizes and Section 4 the Gflops connection topologies, and in Section 5 the data that show the differences differences observed when we used the optimized uniprocessor BLAS routines instead of FORTRAN BLAS on the iPSCj860. iPSC/860. Finally in Section 6 we give the utilization and concurrency profiles and in Section 7 spacetime execution diagrams. The data in the latter two sections were obtained using PARAGRAPH PARAGRAPH [7]. 2. The algorithm and its implementation. In this section we discuss the parx or A Throughout, we allel implementation of the matrix x vector operations A Ax AT~x.X .Throughout, row-wise. assume that the matrix A and the vector xx is distributed among processors row-wise. defined by a vector idist(i), idist(i),ii == 1, 1 , nprocs npocs + 4 11 where nprocs is This distribution is defined idist(i) the number of processors and idist( i) denotes the global index of the first row of A 9123502-CDQand 9202536·CCR, 9202536-CCR, AFOSR • This work was partially supported by NSF grants 9123502-CDQ F49620-92-J-0069and PRF 6902003. 6902003. This research was performed in part F49620-92-J-0069 part. using the Intel Touchstone Della Syst.em System operated by Cal Caltech Concurrent Supercomputer Consortium. Access Access to Delt.a tech on behalf of the Concurrent University. this facility was provided by Purdue Universit.y. University, Computer Computer Science Department. Department, West Lafayette, IN 47907. 47907. t Purdue University, 1

i. Thus, we store rOWS rows from idist(i) idist (i) to (first element of x) that belongs to processor i. idist(ii + 1) 1) -- 11 of the matrix A on the local memory memorjr of oi processor ii together with the idist( associated elements of x. idist is the only global information needed and all other variables are local ,to to each processor. pTOcessor. thc implementation of the Ax operation, we follow follow the methodology described For the in the previous section. As an example we give in Figure 1I the actual code for the sdgemv routine for full matrices. We assume that the data are distributed on a wrap around linear array of processors. The level two BLAS routine sgemv gets as input the matrix A E R Rmxn, Rm and the scalars aa and {3. m X nthe , vectors x, y E Rm ,8. It computes the vector y= = aAx+J3y CYAX+PYor nAT aATx +py. In the case of matrix A, each node calculates the part of the x+f3y. product that involves local data by utilizing the uniprocessor level two BLAS routine sgemv. Then it broadcasts its local vector xx to all other processors by calling the PICL sgemv. routine bcastl, beast1, in which the reception and forwarding of the message are decoupled. Thus, processors that participate in this broadcast call sgemv which computes the part of the product associated with the incoming vector xx before forwarding it. Finally, from a nearest neighbor. It we restore the original value of the vector xx by reading it from is important to notice that the only memory overhead for the routine sdgernv sdgemv is the 1. integer array idist of length nprocs + 1. In the case of the ATx ATx operation, each processor i calls sgernv sgemv to compute nATx aATx and stores this result to t o aa buffer. The entries of all these buffers are added componentwise using the PICL routine gsumO and the result is stored in the buffer at a t processor 1. 1. Finally aa call to the level one BLAS routine saxpy is used to accumulate the term p y to the buffer. The above described procedure is repeated nprocs -- 11 more times as (3y shown in Figure 1. 1. different interconnection topologies that PIeL PICL supWe have experiment with two different and the full full connectivity. It is worth noticing that ports, namely the ring connectivity and successfully follow follow the above in the case of banded or sparse matrices we were able to successfully approach. approach, coupJed coupled with appropriate data structures, with only a few basic differences

+

+

[2). 121data. The uniprocessor BLAS routines available to us on the iPSCs 3. Raw time data. [8] [8] were used unless it is stated otherwise in the caption of the tables. All times are in milli-seconds. 4. Gflops Gfiops achieved. In Figures 2 and and 3 we present the Gflops achieved on the three machines using ring topology for both the non-transposed and transposed cases respectively. It is worth noticing that the "theoretical peak" (determined by counting full precision that can be completed during the number of floating point operations in fuB 11, the iPSe iPSC 860 and for the iPSe iPSC DELTA is .15, 2.6 and 20 aa cycle) for the nCUBE II, G0ops respectively [3]. Gflops 5 . The affect of using optimized BLAS. In Figure 4 we present the speedup 5. achieved on the iPSCj860 iPSC/86O by using optimized uniprocessor BLAS routines [8] [8] for the achieved four to six times. It I t is also non-transposed case. As we see the performance increases fOUT further apparent that the performance drops as the number of processors increases. To further 2

FIG. FIG.1. 1. The The fortran forfrun code of oJ subroutine sdgemv. sdgemv.

subroutine subroutine sdgemv(trans,m,n,alpha,a,lda,x,beta,y,tmp,idst) sdgemv(trans,m,n,alpha,a,lda,x,beta,y,tmp,idst) integer integer idst idst(1) ( 1) real a(lda,l),x(l),y(l),tmp(l) real a(lda,l> ,x(l> ,y(I> ,tmp(l) character character trans trans integer dir integer nprocs, nprocs, me, me, host, host, top, top, ord, ord, dir common lopenl nprocs,me.host commoh /open/ nprocs,me,host common common Isetarcl /setarc/ top, top, ord, ord, dir dir integer mtype, root, root, lnx, lnx, jidx, jidx, node_Do node-no integer bytes: bytes ,' mtype. lnx == idst(me+2) idst(me+2) -- idst(me+l) lnx bytes = 44 * lnx bytes if( if( (( Isame(trans,'N') lsame(translJN1))) .or. .or. (( lsame(trans.'n') Isame(trans,'nl) )) )) then then

jidx jidx = = idst(me+l) idst (me+l) call sgemv(trans,m,n,alpha,a(l,jidx),lda,x,l,beta,y,l) call sgemv(trans,m,n,alpha,a(l,jidx) ,lda,x,l,b~ta,y.l) mtype mtype = = 1000 1000 ++ me me call call bcastO(x bcastO(x ,bytes ,bytes,mtype,me) ,mtype,me) do do 100 100 i=l.nprocs-l i=l,nprocs-1 root root == mod(ma-i+nprocs, mod(me-i+nprocs , nprocs) nprocs) mtype mtype == 1000 1000 ++ root root jidx jidx == idst(root+l) idst (root+l) call bcastO(x,bytes,mtype.root) bcastO(x,bytes ,mtype,root) call call sgemv(trans,m,n,alpha,a(l, jidx),lda,x,l,l. ,y,l) call sgemv(trans,m,n,alpha,a(l,jidx),lda,x,l,l.,y.l) 100 continue 100 continua root mod(me+l+nprocs, nprocs) nprocs) root == mod(me+l+nprocs, call call sendO(x,bytes,root+l000,root) send0(x,bytes,root+l000,root) call call recvO(x,bytes,me+l000) recvO~x,bytes,me+l000) endiitf and if( if( (( lsame(trans, lsame(trans, 'T') 'T') )) .or. .or. (( laame(tranB,'t') leame(trane,'tl) ) ) then then mtype = 4000 mtype = 4000 do 200 200 node_DO node-no == 0, 0,nprocs-l nprocs-1 do istart idst(node,no+l) istart == idst(node_Do+l) call sgemv(trans,m.n,alpha,a(1.istart),lda.x.l.0.,tmp.l) sgemv(trans .m,n,alpha,a(l,istaxt) ,lda,x,1,O. .tmp,l) call call gsumO(tmp,lnx,4,mtype,nods_no) gsumO(tmp, lnx,4,mtype,node-no) call if (node_Do (node-no .eq. ,eq. rna) me) then then if if (beta (beta .ne. .ne. 0.) 0.) callsaxpy(lnx,beta,y,l,tmp,l) call saxpy(lnx,beta,y, 1,tmp,1) if call 6copy(lnx,tmp,1,y,1) scopy(lnx,tmp,l,y,l) call endif andif 200 continue 200 continua endif endif

return return end end 3

TABLE TABLE 1 The total elapse and overhead timings on the nCUBE II fJor o r matrices A of of size n x n. n. The

n

800 1200 1600 2400 3200 4800 6400

4

1

2

Te 2834

Te To 1424 3 3194 3

8

Te 717

To 2

1605 2846

3 3

Te 366 812 1435 3210

32

16

To 3 3

4 5

Te

To

418 732 1624 2869

6 6 7 8

Te

64 Te

To

386 836 1463 3248

To

12 12 454 12 771 14 1672 2926

22 24 23 25

TABLE TABLE 2 The total i o i a l elapse and overhead ouerhead time OJI on the nGUBE nCUBE 1I I I jfor o r matrices AT of o j size n x n. n.

2

1 n

Te

800 2615 1200 1600 2400 3200 4800 6400

Te

1316 2950

4 Te

To 3

666 1487 2632

4

16

8

To 3 4 6

Te

To

343 756 1333 2974

5 6

7 9

32

Te

To

395 686 1514 2666

6

Te

64

To

370 12 790 14 1373 3028

Te

To

15

10

17 442 19 749 24 1581 2747

27 30 36 40

TABLE TABLE 3 The The total elapse and ot.erhead otterhead timings on the iPSC/860 for matrices matfices A of 01size n x n. n. This This implementation full connectivity. implementaiion is based on jull conncctiuity-

1 Tc 1600 201 2400 3200 4800 6400 9600 12800 n

2 Tc 106 234

4

To 4 6

Te

125

74 122 254

214

16

8

Te To 60 4 5 8

38

To 6 11 12 13

32 Te

64

Te

To

53

11

80 151 250

16 63 24 22 UO 27 92 43 27 162 36 129 52 302 48 220 63

To

Te

310 4

To

85

TABLE 4 TABLE The total elapse and Qverhead timings on the iPSC/860 matrices A of of size nn xx n. n. This The Loin1 overhead ihc iPSC/86O lfor o r mairices implementation is based on ring topology. implemeniation

1

2

8

4

T~

Te

1600 201 2400 3200 4800 6400 9600 12800

106

To 3

233

6

n

Tc 59 124 213

16

To

Te

To

6 5 8

38

6

72 6 118 8 249 10

Te

To

52 11 76 9 145 13 238 16

32 Te To

22 105 16 152 24 281 19

64 Te To

62

92 125 207 286

34 42 41 43

TABLE TABLE 5 The of size n x n. This The total ioial elapse and overhead ouerhead timings on the iPSC/860 for matrices A of implementation two BLAS BLAS used were written implemeniotion is based bascd on full connectivity conneciiviiy and the uni-processors level iwo in FORTRAN. in FORTRAN.

n

1

2

4

Te

Te To

Te To 1873 6

3200 4800 6400 9600 12800

16 Te To 10 468 12 - 1042 1849

8

Te 924

2087

To 10 16 10

32 Te 251 537 937 2072

64 To 22 17 17 21

Te

To

303 502 1069

34

1852

33

34 42

TABLE 6G

TABLE

The total iota1 elapse and overhead timings on the iPSCj860 iPSC/860 jor The for matrices A oofj size n x n. n. This implemeniaiion is based on ring ring connectivity conneclivity and the ihe uni-processors implementation un i-processors level two BLAS used were written in FORTRAN. FORTRAN. in

1 2 n T e Te To 3200 4800 6400 9600 12800

16 Tc To Tc 870 16 455 1957 34 994 1744 8

4 1~

To

1741

8

5

32 To 23 34

Te

To 35

260 535 41 35 912 51 1987 60

64 Te To 330 523 1071 1825

74 81 85 90

TABLE TABLE 7 The tolal fofal elapse c l a p s e and overhead timings on o n the iPSC/860 iPSC/86D for matrices mairtccs A= oJ size nn >