Data Structures and Algorithms for Distributed

0 downloads 0 Views 210KB Size Report
Sep 2, 1994 - y(r) + matrix(c) x(index(c)) end for end for. 3 Distributed matrix and vector storage. In this section we will discuss data structures for the localĀ ...
Data Structures and Algorithms for Distributed Sparse Matrix Operations Victor Eijkhout and Roldan Pozo University of Tennessee Department of Computer Science 107 Ayres Hall, Knoxville, TN 37996-1301 [email protected], [email protected] September 2, 1994 Abstract

We propose extensions of the classical row compressed storage format for sparse matrices. The extensions are designed to accomodate distributed storage of the matrix. We outline an implementation of the matrix-vector product using this distributed storage format, and give algorithms for building and using the communication structure between processors.

1 Introduction

Many practical applications involve matrix manipulations with matrices that are sparse and of an irregular sparsity structure. Several data structures for such matrices have become standard over the years [1]. In the context of parallel computing, however, such data structures are limited to use on shared memory architectures. We will descibe here extensions of classical data structures to distributed memory applications. Partly this involves slight extensions to the data structure, re ecting the embedding of local processor data in the global problem, partly this involves true additions, re ecting the communication patterns between processors.

2 Sparse matrix storage formats

There are several widely accepted ways of storing sparse matrices, such as coordinate storage, compressed row/column, compressed diagonal, and Itpack storage.  . This work was supported by DARPA under contract number DAAL03-91-C-0047

1

Here we focus on stencil storage for regular grids, compressed row storage, and compressed diagonal storage. We will use these as examples when discussing distributed memory sparse formats.

2.1 Stencil storage on orthogonal grids

Finite element and nite diference problems on regular, orthogonal, grids, give rise to stencil matrices that can be stored (in the 2-dimensional case) as real matrix(global_ipts,global_jpts,nstencil) integer offsets(2,nstencil)

where the physical problem domain has ipts  jpts points. For the interpretation of the data structures in terms of the matrix matrix(i; j; k) = A we need the translation of grid points (i; j ) to a number i of the corresponding problem variable: i (i; j ) = i + (j ? 1)  ipts: (1) With this: i0 = i (i; j ); j 0 = i (i + offsets(1; k); j + offsets(2; k)): One sees that the matrix is banded with only nstencil nonzero diagonals. By limiting the stencil to such common cases as 3, 5, or 7 diagonals (corresponding to central di erences or linear elements in 1, 2, or 3 space dimensions), the matrix can be split up into explicit arrays for the nonzero diagonals. For example, in the 2-dimensional case the matrix would be stored as i0 ;j 0

v

v

v

v

real main(ipts,jpts), > diag_plus_one_i(ipts,jpts),diag_min_one_i(ipts,jpts), > diag_plus_one_j(ipts,jpts),diag_min_one_j(ipts,jpts)

and no integer information is needed.

2.2 Compressed diagonal format

As was remarked above, a stencil matrix is a diagonal matrix. One could store the diagonals in such a matrix directly as real matrix(global_size,nstencil) integer offsets(nstencil)

under the interpretation that matrix(i; k) = A

i;j

where j = i + offsets(k): 2

Note that a negative value of offsets(k) implies that matrix(1; k) does not correspond to an existing matrix element. Hence sometimes the format is interpreted as 8 i = i0 ; j = i0 + offsets(k) > < if offsets(k)  0 : matrix(i0 ; k) = A where > i = i0 + offsets (k); j = i0 : if offsets(k) < 0 However, this makes coding more complicated. i;j

2.3 Compressed row storage format

The compressed row format comprises integers n, the size of the matrix, nz, the number of nonzeros in the matrix, a real array matrix containing the nonzero matrix elements, an integer array index containing the column indexes of the nonzeros, and an integer array pointer containing the pointers to the rst elements of the rows in both matrix and index. real matrix(global_nz) integer index(global_nz),pointer(n+1) real x(n),y(n)

The interpretation attached to this is that for r = 1; : : :; n for c = pointer(r); : : :; pointer(r + 1) ? 1 matrix(c) = A where s = index(c) rs

This is the normal matrix vector product algorithm for computing y = Ax: for r = 1; : : :; n y(r) = 0 for c = pointer(r); : : :; pointer(r + 1) ? 1 y(r) y(r) + matrix(c)  x(index(c)) end for end for

3 Distributed matrix and vector storage

In this section we will discuss data structures for the local storage of part of a matrix on a processor. The two requirements are that local operations should be eciently executable, and that the embedding information of the local problem in the global context should be preserved. 3

Speci cally, we will propose that the index information of the matrix be renumbered to a local numbering, and that the translation be stored explicitly. The form this takes will be dependent on the storage format. By way of example, we will show in detail the distributed generalizations of the storage formats discussed above. Also, ws will give the matrix-vector product algorithm for each distributed format. It should be noted that this is only the local part of the product; we will ignore communication issues until the next section.

3.1 Embedding a local matrix in the global matrix

Suppose that a processor has a submatrix of global matrix. In general this can be a true subblock, a number of rows or columns, not necessarily adjacent, or a completely arbitrary set of matrix elements. However, not every partitioning of the matrix makes sense for every storage format. Formally, every processor is assumed to have a set E of the indexes for which it has the matrix elements1. Mathematically, we can say that every processor has a matrix A de ned by n (A ) = A if (i; j ) 2 E : (2) 0 otherwise The basic idea is now that we don't want to store more elements of the vectors x and y than are involved in the multiplication with the submatrix. For this we de ne two sets, based on the local index set E : I = fi: 9 (i; j ) 2 E g; (3) and J = fj : 9 (i; j ) 2 E g: (4) Induced by these index sets, two vector partitionings can de ned: n if i 2 J ; (yI ) = n y if i 2 I : (xJ ) = 0x otherwise (5) 0 otherwise In the context of computing the matrix-vector product y = Ax, these partitionings have the following interpretation. A processor p owns a submatrix A , and having x is sucient to compute y( ) = A xJ . The result vector is P ( ) J y = y . In general, it is not true that y( ) = yI , but we will discuss in detail (section 4) the case of so-called `one-dimensional decompositions' for which is the case. With two variables isize = jIj and jsize = jJ j the necessary local storage for x and y on a processor is E

E

ij

ij

j

i

i

i

i

i

E

p

p

E

p

p

1. This set need not be explicitly given; it can be constructed from the matrix information.

4

real x(jsize),y(isize)

Corresponding to these local sizes, we need to change the matrix to refer to local vector indexes instead of global ones. The implementation of this renumbering depends strongly on the storage scheme. Hence we will defer discussion to the appropriate section below. With vector storage allocated as indicated above, and the de nitions of I and J , fully arbitrary matrix distributions can be accomodated. (For precise de nitions and discussion, see [2].) However, many of these distributions may not be appropriate for certain matrix storage schemes, or will be impractical altogether. In this paper we will limit ourselves, implicitly or explicitly, to the case where I  J , and correspondingly, isize  jsize. Also, the above allocation of x and y may be, from a practical point of view, optimal for as far as space is concerned, but having two di erent storage conventions makes performing, for instance, dot products hard. Hence we propose that all vectors will be allocated with size jsize. The embedding of I in J is often needed in algorithms, hence we will discuss its implementation. This, like the mapping from local to global numbering, depends strongly on the storage scheme. In the following subsections we will discuss the distributed data structures and their interpretations. We will give an implementation of the matrix-vector product algorithm, and the mapping from local to global numbering, for each format in the appendices.

3.2 Distributed 2D stencil storage

Because of the regularity of stencil storage it only makes sense to partition the problem variables in regular patches over the processors. That is, in the twodimensional case we apportion every processor a subset fi1 ; : : :; i2 gfj1; : : :; j2g of the problem domain, and correspondingly a submatrix real matrix(i1:i2,j1:j2,nstencil)

of the global matrix. In order to localize the matrix, we introduce the dimensions of the locally owned part of the grid ipts = i2 ? i1 + 1 and jpts = j2 ? j1 + 1. The embedding of the local grid in the global one is described by proc shift(1) = i1 , proc shift(2) = j1 . Locally on each processor, the matrix data will then appear as in gure 1. The interpretation of the local data structures now becomes (i; j; k) = A

matrix

i0 ;j 0

5

Matrix storage:

integer ipts,jpts,nstencil,offsets(2,stencil) real matrix(ipts,jpts,nstencil)

Embedding information:

integer proc_shift(2)

Vector storage:

integer ipts,jpts,b real x(1-b:ipts+b,1-b:jpts+b)

Figure 1: Storage declarations for distributed stencil storage where we rst de ne the number of the problem variable i( ) (analogous to, and using, equation (1), but shifted for the parallel embedding) p v

i( ) (i; j ) = iv(i + proc shift(1); j + proc shift(2)): We obtain the location in the matrix as i0 = i( ) (i; j ); j 0 = i( ) (i + offsets(1; k); j + offsets(2; k)): p v

p v

(6)

p v

The vector storage in gure 1 uses a single integer b describing the size of the border:

b = maxfoffsets(1; 1); offsets(2; 1); offsets(1; 2); offsets(2; 2)g: It would be slightly more ecient in storage to de ne b for i; j = 1; 2, but some simplicity of coding would be lost. Formally, for the I set we have I = fi = i + (j ? 1)  global ipts: i1  i  i2; j1  j  j2g; and J = fi = i + (j ? 1)  global ipts: i1 ? b  i  i2 + b; j1 ? b  j  j2 + bg: However, we never store these sets, since they can be completely reconstructed from proc_shift and the local values of ipts and jpts. An algorithm for the distributed matrix-vector product using stencil storage is given in appendix A.1. ij

v

v

3.3 Distributed compressed diagonal storage If the global matrix is of compressed diagonal type,

6

Matrix storage:

integer size,ndiag,offsets(ndiag) real matrix(size,ndiag)

Matrix embedding information: integer first_var

Vector storage:

integer vec_len, attach(ndiag), first_own real x(vec_len)

Vector embedding information (version 1):

integer nsegments, segments(2,nsegments)

Vector embedding information (version 2): integer loc_to_glob(vec_len)

Figure 2: Storage declarations for distributed diagonal storage real matrix(global_size,ndiag)

for eciency reasons we only consider partitioning it as real matrix(i1:i2,ndiag)

The local storage then uses integers size = i2 ? i1 + 1 and first var = i1 to describe the size of the local portion of the matrix and its location in the global matrix. The proposed storage is de ned in gure 2. Unlike in the case of stencil storage, we will now construct some explicit representation of J . First of all, I = ffirst var; : : :; first var + size ? 1g: De ning the minimum and maximum diagonal o set as fmin = minfoffsets(k)g; fmax = maxfoffsets(k)g; k

k

it is clear that the variables first var+fmin and first var+size?1+fmax are in J , but not all intermediate variables need to be. Hence, concluding jsize = size + fmax ? fmin would lead to unnecessary allocation of storage for the input vector x. During the matrix vector multiplication y = Ax, the local part of the k-th diagonal, matrix(first:first+size-1,k) needs the elements x(first + offsets(k) : x + first + size + offsets(k)) as input. Hence [ J = ffirst + offsets(k); : : :; first + size ? 1 + offsets(k)g; k

7

where the union is most likely not disjoint. Hence we augment the local storage with an array of segments (see gure 2) that form a disjoint partitioning of J 2 . We then allocate for x an array of exactly jsize = jJ j. The segments describe the embedding of the local vector x into the global vector x. Let X S = (segments(2; k) ? segments(1; k) + 1); k

j i2 + 1, since

8

Matrix storage:

integer nrows,nz real matrix(nz) integer index(nz), pointer(nrows+1)

Embedding information:

integer iset(isize),jset(jsize)

Vector storage:

real x(jsize),y(isize)

Figure 3: Storage declarations for distributed compressed row storage; fully general

3.4 Distributed compressed row storage

In the case of compressed row storage, the generality of the format necessitates us to store the embedding information in the I and J sets completely explicitly. The local arrays x and y are then related to the global vectors x and y by (j ) = xjsize( );

x

j

(i) = yisize( ):

y

i

A local submatrix stored in compressed row format needs pointer and index arrays like the global matrix. with the following interpretation: for i0 = 1; : : :; isize for j 0 = pointer(i0 ); : : :; pointer(i0 + 1) ? 1 ?  matrix(j0 ) = A where i = iset(i0 ) and j = jset index(j 0 ) ij

Although in some cases such generality may be warranted, we will here present a more limited data structure, for the case where I  J . Thus, all vectors will have length veclen = jsize. This also raises (as in the compressed diagonal case) the question of how to nd the variables in I among those in J . A simple solution would be to order the variables rst in the array jset. This does mean that the jset array will in general not be a monotone mapping: we can have i < j and jset(i) > jset(j )3 . Another solution is to have an additional mapping i_to_j that indicates where the I set is among the J set4 . This way monotonicity can be maintained. The 3. This is no problem for the matrix-vector product. However, if distributed storage is used for the triangular factors of a factorization, it is important to be able to recognise the global ordering from the local ordering. Speci cally, if local row 0 has global number , there has to be a way to recognize what elements have a global index less than, and which one a global index greater than . 4. Note that above we have altered the de nition of J in such a way to indeed contain I . i

i

i

9

Matrix storage:

integer nrows,nz real matrix(nz) integer index(nz), pointer(nrows+1)

Embedding information:

integer jset(veclen),isize

Vector storage:

real x(veclen)

Figure 4: Storage declarations for distributed compressed row storage; case I  J disadvantage of this approach is that vector operations such as inner products now become indirectly addressed. Because of this, we adopt the former approach. An algorithm for the distributed matrix-vector product using compressed row storage is given in appendix A.3.

4 Communication for a one-dimensional decomposition of the matrix

Above, we have not gone deeply into the issue of the distribution of matrix elements to processors. The data structures proposed were based on a distribution where each processor owns whole matrix rows. Although more general distributions are possible (see [2]), we are not convinced of their practical usefulness. Hence we will only consider in detail the communication issues involved for such a distribution.

4.1 Concepts

Formally, we de ne a one-dimensional matrix decomposition as follows. If a processor owns matrix position (i; j ), then it owns all positions with that value of i, and because of exclusivity, no other processor owns a position with that value of i. Another way of putting it is that the I sets of the processors form a disjoint partitioning of the problem variables. In relatively abstract terms, here is the algorithm for matrix-vector multiplication. In the following subsections we will investigate the communication issues in detail. Algorithm: distributed matrix-vector product y = Ax Input: each processor has x for i 2 I Output: each processor will compute y for i 2 I i

i

10

Step 1: collect x for i 2 J ? I from other processors Step 2: compute the local matrix-vector product as described above Since this algorithm immediately leaves the results y for i 2 I in the computing processor, we will call I the `owned' problem variables. We will call J the `local' variables, since for the local computation we need x for i 2 J . We see in the rst step of the algorithm that we have the problem of supplying to each processor the value of x in the set J ? I . We call these the `bordering' variables5. In the context of a sparse matrix these variables are likely to be owned by a very small number of processors, even if the total number of processors is large. Hence, it is advantageous to build up a data structure describing which processors supply which data elements. i

i

i

4.2 Examples of one-dimensional decompositions

There are many ways to partition problem variables over processors. However, depending on the matrix storage scheme chosen, they may not be equally suited to a particular problem. For a matrix in diagonal storage the only decomposition that makes sense is one in consecutive stretches of variables. This enables storage of the local matrix by diagonals. Conceivably, one could assign to a processor a small number of such local matrices stored by diagonals. For a matrix that, in global terms, can be stored as a stencil on a regular grid, there is similarly only one decomposition that makes sense, namely by subdomains. Partitioning the problem domain in regular subdomains corresponds to assigning to each processor a series of consecutive variables, spaced at regular intervals, but this is in general not a very insightful way of regarding the matter. Taking the point of view of looking at subdomains is often clearer, and may even give analytical information about the linear algebra operations performed. Matrices in compressed row storage allow any decomposition of the problem variables. However, the context often suggest decompositions, for instance based on physical proximity of the variables.

4.3 Data structures for communication

Prior to a matrix-vector product, then, each processor needs to receive the components of x with indexes in its border variables. As a mirror image of this, each processor also needs to send certain of its owned variables. We will call 5. In the context of the physical problem they are likely to border on the region occupied by the owned variables.

11

these the `edge' variables6. We will then extend the data structure with a few more variables and arrays. For instance, for the border processors, that is, processors that will send border variables, we need the numbers of the processor (array bord_procs), the numbers of the variables that it will send (array bord_vars), and pointers to where these numbers are located in the bord_vars array (the bord_pointer array). integer n_bord_proc, n_edge_proc integer bord_procs(n_bord_proc), edge_procs(n_edge_proc), > bord_pointer(n_bord_proc+1), edge_pointer(n_edge_proc+1) integer bord_vars(*),edge_vars(*)

If the storage scheme exhibits any regularity, for instance consecutive variables in the case of compressed diagonal storage, the bord_vars and edge_vars can be stored considerably more eciently. However this does not a ect the gist of the following discussion. The exact interpretation of the above entities is as follows. Suppose that each processor has the current value of x in the owned variables. Then for p0 = 1; : : :; n edge proc let p = edge procs(p0 ) 0 0 for i = edge ? pointer(p ); : : :; edge pointer(p + 1) ? 1 send x edge vars(i) to processor p end for end for and correspondingly, for p0 = 1; : : :; n bord proc let p = bord procs(p0 ) for i = bord?pointer(p0); :: :; bord pointer(p0 + 1) ? 1 receive x bord vars(i) from processor p end for end for As a practical matter, one is unlikely to perform individual sends and receives, but rather values sent to and received from one processor will be handled as a package. Also, on some architectures the above sequence of a series of sends followed by a series of receives may not be feasible (if only because of network congestion or communication bu er over ow). Hence some form of serialization is necessary. The following algorithm serialises the sends and receives. (Note that the data structures are not a ected.) On every processor, let me be the number of the processor; then 6. Rigorous de nitions of the concepts of border and edge variables can be found in [2].

12

for p = 1; : : :; me ? 1 if there is a q such that bord procs(q) = p then, if necessary, receive from p if there is a q such that edge procs(q) = p then, if necessary, send to p for p = me + 1; : : :; nprocs if there is a q such that edge procs(q) = p then, if necessary, send to p if there is a q such that bord procs(q) = p then, if necessary, receive from p end for Thus, for processors with a lesser number than you, you receive from them before sending to them, for processors with higher number you send to them before receiving from them. This algorithm can easily be shown to be deadlock-free7 .

4.4 Construction of the communication structure

We will now give algorithms for constructing the above communication structure. We will distinguish several cases, depending on how much global information a processor has. Depending on the amount of global information posessed, various amounts of communication will be necessary. In all cases we assume that the processor knows the global numbers of its local variables. This can be realized for instance by letting each processor have initially part of the matrix with all indexes expressed in the global numbering. In order to abstract away from a speci c implementation, we will de ne the relevant sets somewhat abstractly. First of all, we de ne sets J  J as those indexes in J owned by processor p, and therefore to be received from that processor. Conversely, I  I is the index set of variables that have to be sent to processor p. In the simplest case, where the variable to processor map is know to each processor8 , it is possible to compute the border sets J without any communication. If matrix element (i; j ) is owned and j 62 I , then j 2 J where p is the owning processor of variable j . If the map is not known, a processor can still construct J ? I , but the partitioning into J sets now takes global communication: p

p

p

p

p

7. Let ( ) mean that processor is waiting for processor . Suppose there is dead-lock,that is, there are processors 0 ( k k+1 ) (with all indices implicitly taken n?1 such that modulo ). Now let k be the minimum number in this cycle. From the nature of the algorithm it is impossible (1) that k k?1 k+1 , so it must be that k k+1 k?1 . Since it is impossible (2) that three consective processors in the cycle are ordered i i+1 i+2 , it must be that k+2 k+1 . Continuing the above observations (1) and (2), it follows that all remaining processors in the cycle are k+1 . This contradicts that k?1 k+1 . 8. Or at least the function's values are known on the processor's local variables. W i; j

i

j

p ;:::;p

n

W p ;p

p

p

< p

< p

p

< p

< p

p

p

< p

< p

< p

< p

p

> p

13

each processor has to proclaim what variables are in its border set J ? I ; it then accepts messages from all processors, either with the information that that processor owns no element of the border set, or the list of variables that that processor owns. This algorithm involves O(P 2) messages, even if most of these will be empty, or containing a single number zero. Furthermore, the algorithm may need to be serialized, since P simultaneous broadcasts followed by P 2 subsequent messages may deadlock the network. Here is a simple algorithm for serialised computation of the J sets, assuming that processors are numbered 0; : : :; np ? 1. for p = 0; : : :; np ? 1 if I am not processor p, then send J ? I to processor p receive from processor p, call this set S intersect S with I , call the result T , send T to processor p receive from processor p, this is J . end if end for It is easy to see that this algorithm will do the job, and its serial time (measured in the inner computations of the E sets) is twice the number of processors. The following modi cation, where me is the number of the executing processor, reduces this by a factor of two to the optimal serial time: for q = 1; : : :; np ? 1 let p = (me + q) mod np and p = (me ? q) mod np send J ? I to processor p receive from processor p , call this set S intersect S with I , call the result T , send T to processor p receive from processor p , this is J a . end for We also have to decide that some of a processor's owned variables constitute a set I for some p. There are two cases here. 1. If the matrix has a symmetric sparsity pattern (note that the matrix need not be symmetric numerically, only structurally), then A 6= 0 with j 2 J implies i 2 I . 2. If the matrix is structurally unsymmetric, every processor p reasons that its J set is the I set of processor q, so it sends J to processor q, and received from processor q the I set. Both of these algorithm presume that the J sets have already been computed. 1. 2.

p

p

q p

a

b

a

b

b

a

p

p

i;j

p

p

q

p

p

q

p

14

5 Uni ed vector data structures

In a single processor environment a vector consists of a real array plus an integer describing its length. In a distributed environment the matter is much more complicated. For instance, the notion of the length is expanded to cover both owned and local variables. We propose that a distributed vector consists, on each processor, of one real and one integer array. The integer array needs to contain  size information: the number of owned and local variables, as wel as the global number of variables;  embedding information: sucient data to translate the local piece of the vector back into terms of global variables;  structure information: the embedding of the owned in the local variables, and other structural data.  sucient hooks that applications can store extra information in this integer array. The problem we now face is that of collaborating modules. They need to have an agreement on how information for common use is stored in the integer array, and they need to be able independently to store information in that array for private use. This can be realized by letting the following locations have prescribed meanings: 1 This location contains a type identi cation. A library module can use this location to recognize whether a vector is of a supported type, and act accordingly by accepting it, converting it to an accepted format, or aborting. 2, 3, 4 The numbers of owned, local, and global variables. Not all of these are necessary for every format and all operations, but it is convenient to have them around. 5, 6 The location of the structure and embedding information respectively in the integer array. We need an indirection step here, because the size of the structure and embedding information varies drastically with the storage format and the particular problem. 7 A freespace pointer. This array location tells library modules at what point in the array they can store their own information. Modules are expected to update this location if they store data in the integer array. Further details are given in appendix C.

6 Sucient interfaces for library routines

In this section we will consider the question what the minimum amount of information is that a library routine needs to have passed from outside so that 15

it can perform operations on a distributed matrix. From the above discussion it will be clear that there are three possibilities: 1. If the communication structure has not been built, we need to pass, in addition to the matrix, sucient information to build the communication stucture, for instance, the list of participating processors. 2. If the communication structure has not been built, we can also pass, in addition to the matrix, a function describing the mapping of either variables or matrix elements to the processors. Values of this function are only needed in the local variables of the local problem. 3. If the communication structure has been built, each processor knows exactly with what other processors it will communicate, and no knowledge beyond this is needed about the other processors. There are several minor variations:  We can pass to each processor the full matrix with the full information what parts belong to what processors.  We can pass the full matrix, with only the information what part belongs to this processor.  We can pass exactly the locally owned part of the matrix. In this case there is another choice: { We can have the matrix still expressed in global indexes. { The matrix can already have been transformed to local indexes. In deciding on a particular interface, there is a trade-o . Passing only the local matrix plus the list of processors makes the interface very generally applicable. On the other hand, any routine accepting this interface needs to perform a considerable amount of preprocessing of the matrix. Hence it needs to perform a substantial amount of work to amortize this overhead. Passing a full communication structure in addition to the matrix makes the interface rather speci c, but it cuts the preprocessing overhead costs to zero. We feel that this question can only be settled through practical experience.

7 Conclusion

We have given proposals for distributed memory versions of three popular matrix storage formats. The extensions of the classical versions arose from the need to express the matrix in local numbering, and to store explicitly the embedding of this local numbering in the global problem. Although we started the discussion by considering arbitrary mappings of matrix elements to processors, we only considered in detail so-called one-dimensional decompositions. In these, processors own disjoint sets of matrix rows, and therefore they can be said to own certain problem variables. Since such decompositions can be easily derived from physical proximity considerations, we feel that this is an important case, and not a severe limitation. 16

Next, we have described the communicationnecessary for common linear algebra operations, and we have proposed data structures for storing communication information, as well as algorithms for constructing them. We have consistently derived all structures from a discussion of the matrixvector product as the only linear algebra operation. This is not a limitation. The transpose product and the regular and transpose solution of a linear system with sparse factors do not lead to additional requirements. Basically, it is only the sparsity pattern of the matrix that matters, not the particular operation.

8 Bibliography References

[1] I.S. Du , A.M. Erisman, and J.K. Reid. Direct Methods for Sparse Matices. Clarendon Press, Oxford, 1986. [2] Victor Eijkhout and Roldan Pozo. Lapack working note 77, Basic concepts for distributed sparse linear algebra operations. Technical Report CS-94-240, Computer Science Department, University of Tennessee, Knoxville, 1994.

A Distributed matrix vector product algorithms

In this appendix we will give the algorithms for the local matrix vector product for the distributed storage schemes discussed above.

A.1 2D stencil storage case

For 2D stencil storage the local part of the distributed algorithm for matrix vector multiplication is basically the same as the standard algorithm. In fact, it is easier, since the fact that x is a bordered vector implies that we do not need tests for matrix coecients being out of bounds. for j = 1; : : :; jpts for i = 1; : : :; jpts y(i; j ) = 0 for k = 1; nstencil for j = 1; jpts for i = 1; ipts y(i; j ) = y(i; j ) + matrix(i; j; k) x(i + offsets(1; k); j + offsets(2; k))

17

A.2 Compressed diagonal storage case

The matrix vector multiplication algorithm for compressed diagonal storage needs integer information to locate where the input data is located in x (one value of attach per diagonal), and where the I set is located in the output vector (a shift value first_own). for i = 1; : : :; size y(i) = 0 for k = 1; ndiag for i = 1; size y(first own + i ? 1) = y(first own + i ? 1) + matrix(i; k)  x(attach(k) + i ? 1)

A.3 Compressed row storage case

The distributed compressed row algorithm for matrix vector multiplication given here is based on a permutation of the matrix elements such that the elements of y that are computed are the rst nrows elements, rather than an arbitrary set of nrows elements (see section 3.4 for discussion). for i = 1; : : :; nrows y(i) = 0 for j = pointer(i); : : :; pointer(i + 1) ? 1 y(i) y(i) + matrix(c)  x(index(j )) end for end for

B Local to global variable mappings

In all of the distributed storage schemes proposed above, information was retained for the mapping from local storage back to global storage. Such information is usually not necessary during the normal linear algebra operations, but it is required, for instance, if data is to be tranferred to central storage, for such purposes as visualization. The following sections are all implicitly based on one-dimensional decompositions of the matrices. In particular, for the compressed row storage we will take the limited data structures of gure 4 rather than the fully general structures in gure 3.

B.1 2D stencil storage case

Stencil storage numbers vector variables as problem variables, that is, they inherit the dimensionality of the physical problem. In the two dimensional case 18

we then need to translate (i local; j ocal) ?! (i global; j lobal): l

g

This is done by i global

= i local+proc shift(1);

j global

= j local+proc shift(2):

B.2 Compressed diagonal storage case

In section 3.3 ( gure 2) we proposed two ways of storing the local-to-global variable mapping. With the explicit storage option, the map is immediately given as i global = loc to glob(i local): If only the segments comprising J are stored, the mapping is given only algorithmically: s=0 for k = 1; : : :; nsegments ` = segments(k; 2) ? segments(k; 1) + 1 if s + 1  i local < s + ` i global = segments(k; 1) + i local ? s ? 1 exit otherwise s s+` Usually, the mapping will be needed for a number of variables in a row. In that case, algorithmic optimizations are possible that make it almost as ecient as the explicit storing of the map.

B.3 Compressed row storage case

Since matrices stored by compressed rows have no regularity to be exploited, we need to have the local-to-global map explicitly stored. Hence, mapping a local variable i_local to its global location i_global is simple: i global

= jset(i local):

C Descriptions of distributed vectors

C.1 Stencil storage

Matrices in stencil storage are applied to problems on regular domains. A distributed vector is then a regular subdomain of the global domain (see the discussion in section 3.2. Hence it needs as structural information: 19

1. the number of space dimensions, 2. the size of the border surrounding the subdomains, 3. the sizes of the subdomain along each of the space dimensions. The embedding of the subdomain vector in the global vector is described by a single vector of o sets, of the length of the number of space dimensions.

C.2 Diagonal storage

A vector corresponding to a diagonally stored matrix has the following structural information (see the discussion in section 3.3): 1. the location of the rst owned variable in the vector, 2. the number of segments comprising the vector, 3. the start and end points of the vector. The embedding of the subdomain vector is described by a single integer, giving the global number of the rst owned variable.

C.3 Compressed row storage

The way we have proposed distribued compressed row storage (see the discussion in section 3.4), a subvector needs no further structural information, since the owned variables are simply the initial piece of the vector. The embedding information, on the other hand, is more extensive than for the previous two formats. It comprises an integer array, with length of the number of owned variables, giving the global numbers of these variables. For many applications it is convenient also to have the global numbers of the nonowned local variables, so we extend the integer information to this.

20