Block Based Compression Storage Expected ... - Semantic Scholar

12 downloads 5241 Views 418KB Size Report
Mar 6, 2000 - ... alleviate the performance degradation experienced by VPs when op- ... a specialized pipelined functional unit dedicated to the execution of ...
Block Based Compression Storage Expected Performance Stamatis Vassiliadis, Sorin Cotofana, and Pyrrhos Stathis Delft University of Technology, Electrical Engineering Dept., P.O. Box 5031, 2600 GA Delft, The Netherlands fStamatis,Sorin,[email protected] March 6, 2000 Abstract

In this paper we present some preliminary performance evaluations of the Block Based Compression Storage (BBCS) scheme, that consists of a sparse matrix representation format and an associated Vector Processor (VP) architectural extension, designed to alleviate the performance degradation experienced by VPs when operating on sparse matrices. In particular we address the execution of Sparse Matrix Vector Multiplication (SMVM) algorithms. First we introduce a specialized VP functional unit to support the part of the architectural extension that does the SMVM computation, the MIPA instruction. Subsequently, we consider a set of benchmark matrices and report some preliminary performance evaluations by comparing the BBCS scheme with the Jagged Diagonal (JD) scheme. The simulations of the SMVM algorithm core execution indicate the BBCS scheme always performs better than the JD scheme for large VP section sizes and its performance always lies within a range of at most 10% away from the theoretic ideal performance, independently of the type of the processed matrix, whereas the JD performance is heavily dependent on the matrix type. Furthermore we calculated that the required memory bandwidth for the JD scheme is between 2 49 and 3 12 times higher than the one required by the BBCS scheme. :

:

1 Introduction In many scienti c computing areas the manipulation of sparse matrices constitutes the kernel of the solvers. In such solvers, Sparse Matrix Vector Multiplication (SMVM) plays a major role and therefore several SMVM software techniques [13, 3] as well as sparse matrix solver machines, e.g., the (SM)2 [1], or Finite Element Modeling (FEM)1 computations dedicated parallel architectures, e.g., the White Dwarf [7] and SPAR [15], have been proposed and developed.

Finite Element Modeling [9] is a powerful numerical technique for solving partial di erential equations which has SMVM as basic computational step. 1

1

Generally speaking, due to their intrinsic support for data parallelism, vector architectures [11, 8] are potentially good candidates to eciently execute SMVMs and other types of sparse matrix manipulations. In practice however they are not as ecient2 on sparse matrices as they are on dense. This performance degradation mainly relates to the code and data irregularity induced by the fact that SMVM algorithms make use of sparse matrix representation formats [13] in order to avoid trivial operations on zero value elements and to reduce memory bandwidth requirements. Such sparse formats are meant to be a sparse matrix representation in memory in such a way that the large amount of zeros a sparse matrix contains are not stored. Consequently, only the non-zero elements accompanied by some positional information are stored in memory and this obviously implies poor data regularity and make the use of otherwise powerful vector instructions, e.g., vector multiply, quite inecient. One way to alleviate this performance degradation is to augment the vector processor with some architectural support for sparse matrix manipulation as suggested for example in [10]. In this line of reasoning there was recently proposed by the authors in [16, 17] a vector processor ISA architectural extension and a associated sparse matrix storage format. This paper constitutes a continuation of the research in [16, 17] and its main contributions can be summarized as follows:  We introduce a specialized functional unit able to execute the newly de ned Multiple Inner Product and Accumulate (MIPA) instruction, and the BBCS based SMVM algorithm.  We consider a set of benchmark matrices and compare the expected performance of the BBCS scheme and the JD scheme. We simulate the execution of the SMVM algorithm core on a functional unit with a level of parallelism of 4 and a 10 cycle latency. The results suggest that the BBCS scheme always performs better than the JD scheme for large section sizes. The BBCS performance always lies within a range of at most 10% away from the ideal performance and is independent of the processed matrix type. The JD scheme performance exhibits a much larger deviation, between 10% and 400%, from the ideal performance.  We derive a BBCS scheme memory bandwidth requirement of n(1:5+ m +2q), where n is the number of non zero matrix elements, q characterize the number of non empty rows in the sparse matrix, and m is the length of the dense vector expressed as a fraction of n. For the JD scheme the memory bandwidth requirement is 5n regardless of the matrix structure. In particular, for the matrices we considered, the required bandwidth for the BBCS scheme is 2:49 to 3:12 times lower than the bandwidth required by the JD scheme. The presentation is organized as follows: In Section 2 we introduce assumptions and preliminaries about sparse matrices, SMVM on vector processors, and the JD scheme. In Section 3 we introduce the BBCS based SMVM scheme and in Section 4 we describe a specialized pipelined functional unit dedicated to the execution of the Multiple Inner Product and Accumulate (MIPA) instruction. In Section 5 we present some preliminary simulation results of and compare the BBCS based SMVM scheme with the JD scheme. Finally in Section 6 we draw some conclusions.

As an example a CRAY Y-MP operates, as suggested in [15], at less than 33% of its peak oating-point unit throughput when executing FEM computations. 2

2

2 Assumptions & Preliminaries

The de nition of the multiplication of a matrix A~ = [ai;j ]i;j=0;1;:::;n?1 by a vector ~b = [bi]i=0;1;::: ;n?1 producing a vector ~c = [ci]i=0;1;::: ;n?1 is as follows:

~c = A~~b;

ci =

n?1 X k=0

ai;k bk ;

i = 0; 1; : : : ; n ? 1

(1)

In Equation (1), to calculate ~c, a sequence of n inner products A~ i ~b, where A~ i is the ith row of the matrix A~ , i = 0; 1; : : : ; n ? 1, have to be evaluated. Vector architectures [11, 8] such as the one graphically depicted in Figure 4 are particularly suitable to eciently perform such inner product evaluations as they can operate on sequences of data, i.e., vectors, via vector instructions. On most current vector architectures [8], the vectors are copied from the main memory into vector registers within the processor before they are operated upon. Vector registers are arrays of scalar registers that hold (parts of) the vectors to be processed. Due to the fact that the vector register length can not be arbitrarily large, when operating on large vectors they have to be divided into smaller parts, a technique that is usually called strip mining, each of which cannot be larger than the maximum amount of elements a vector register can hold, i.e., the architecturally de ned section size of the VP. In a VP the operations are carried out by pipelined Functional Units (FU) that are able to fetch one new element per cycle from each of the source vector register(s) involved, operates on it/them, and returns the result(s) to the result (vector) register. It is also possible that the FU fetches and processes p  2 vector (register) elements per cycle case in which the FU has a parallelism of p. Each vector instruction implies a number of startup penalty cycles or latency. This is related to the depth of the pipeline executing the vector instruction and means that the rst result will be available only after a number of cycles equal to the latency. Nowadays, typical latency values are, for example, 6 cycles for a vector add FU, 7 for a vector multiply FU, and 20 for a vector divide FU. Such vector architectures are very ecient when they compute matrix  n multiplications  ~ ~ involving dense matrices as each Ai  b can be basically evaluated with s vector multiplication instructions. In the case of sparse matrices however, vector architectures are not as e ective due to the lack of data regularity induced by sparse representation formats. This performance degradation can be easily illustrated if we assume, for example, that the A~ matrix involved in Equation (1) is sparse and represented in the Compressed Row Storage (CRS) format [6]. This format uses three sets of values as follows: the Value set (all the non-zero matrix elements stored row-wise), the Column Index set (the corresponding column positions of the non-zero elements), and the Row Pointer set (pointers to the rst non-zero element of each row that contains at least one non-zero element within the Value and Column Index sets). To clarify, we present in Figure 1 a simple sparse matrix example both in dense and CRS format. Under the assumption that the A~ matrix is sparse and CRS represented the matrixvector multiplication in Equation (1) can be computed as follows: for i = 0 to Dimension(Row_Pointer)-1 c[i] = 0 for k = Row_Pointer[i] to Row_Pointer[i+1]-1 c[i] = c[i] + Value[k] * b[Column_Index[k]]

3

0 1 2 3 4 0 1 2 3 4

2 0 0 0 0

0 4 0 8 0

0 0 0 0 0

0 4 5 0 0

0 6 0 7 9

Dense Format

Value

2 4 4 6 5 8 7 9

Column_index

0 1 3 4 3 1 4 4

Row_pointer

0 1 4 5 7

Compressed Row Storage Format

Figure 1: Compressed Row Storage Format The execution of this code on a vector processor requires d(Row Pointer[i+1] )=se vector multiplication instructions for each matrix row i. Since in many real-life sparse matrices it is likely that the number of non-zero elements on a matrix row is smaller than the vector processor section size s, many of the vector multiplication instructions previously mentioned may handle short vectors. Consequently, the utilization of the processor resources may be inecient and thus pipeline start-up times may dominate the overall execution time. Furthermore, another matter that may negatively in uence the performance of a VP executing a CRS algorithm is the need for an indexed access of the ~b vector. In many cases, depending on the main memory and vector load/store unit design, the load/store vector with index instructions can not be completed as ecient as the standard load/store vector instructions. To overcome some of the shortcomings of the CRS format many other schemes have been proposed [13]. The scheme3 that provides the best performance, when no consideration is made about some possible speci c properties of the non-zero structure of the matrix, is considered to be the Jagged Diagonal (JD) [2]. The JD scheme reorders the non-zero matrix elements in columns in an attempt to provide an adequate vector register lling and to reduce the startup penalty. The JD format consists of four parts: 1. The actual non-zero Values arranged in column Vectors (VV). 2. The corresponding Column Position Vectors (CPV). 3. A Permutation Vector (PV). 4. A vector that indicates the number of elements (Length) per column Vector for the rst two parts, i.e., VV and CPV, of the format (LV). The JD format can be derived from a dense representation as follows (refer also to Figure 2 for an example): First, all the nonzero values are shifted to the left of the matrix so that the rst column contains all the rst non-zero elements of each row, the second column the second non-zero elements of each row, and so on. The resulting rightmost columns that contain only zeros are then discarded. The remaining columns form the VV set. For each matrix element present in the VV its initial column position is stored in the CPV. This results in a CPV set with the same number of columns as the VV. This rst step is producing an intermediate format such as the one depicted in Figure 2. If the number of non-zero elements per row in the original matrix is not the same for every row, the intermediate format contains a number of zeros. In this case the VV rows are permuted Row Pointer[i]

By scheme we mean here and from now on both the storage format and the algorithm used to execute SMVM. 3

4

Intermediate Format

0 1 2 3 4 0 1 2 3 4

2 0 0 0 0

0 4 0 8 0

0 0 0 0 0

0 4 5 0 0

2 4 5 8 9

0 6 0 7 9

4 7

Value Column Vectors

0 6 0 0 0

Value Column Vectors

Dense Format 4 8 2 5 9

0 4 0 7 0

6

1 1 0 3 4

3 4

Column Position Vectors

0 1 3 1 4

0 3 0 4 0

0 4 0 0 0

Column Position Vectors

4

5 2 1

1 3 0 2 4 Permutation Vector

Vector Lengths

Jagged Diagonal Format

Figure 2: JD Format Construction from a Dense Representation (the permutation vector is memorized in PV) in such a way that all the zero values in VV are placed at the bottom of the VV columns. Those zeros are then discarded and the length of each of the new columns for both the value columns (VV) and the corresponding positional information columns (CPV) is stored in LV vector. The pseudo-assembler4 code to execute the JD based SMVM can be written as follows: for i = 1 LD ST lp: LDV LDV LDVI MULV LDV ADDV STV SUB INC BLE

to (number of columns) do lv[i],R1 ;load starting vector length into R1 #0,R2 ;reset section counter v[i]+R2*s,VR1 ;load non-zero elements in VR1 cp[i]+R2*s,VR4 ;load the index vector VR4,VR2 ;load b-vector elements with index VR1,VR2,VR3 ;store element-wise products in VR3 @c+R2*s,VR5 ;load result vector c VR3,VR5,VR5 ;add VR3 to c vector VR5,@c+R2*s ;store result vector back to memory R1,s,R1 ;decrease R1 = R1 - s R2 ;increase section counter #0,R1,lp ;Branch to lp if not end of column

under the assumption that v[i] is a pointer to the beginning of the ith VV column in memory, cp[i] is a pointer to the beginning of the ith CPV column in memory, and lv[i] is their length. Furthermore we also assumed that the rst element of the result vector ~c is located at the memory address @c and s is the VP section sizer. Code details for the loop are omitted and only the main body of the code is described for simplicity. Note that if a VV column is longer than the section size s it must be sectioned into smaller with a maximum length of s. This means that if q is the length of a VV column, vectors q  vectors of size s and one vector of size q ? s  q  will be operated upon. The vector s s registers R1 and R2 are used in this context to implement the strip miming process. 4

Inspired by the IBM/370 [5] and DLXV [8] vector instruction sets.

5

A-matrix s

1 2 3 4 5 6

s A0

A1

A2

0

1

0

0

0

1

1

0

0

0

1

0

0

1

0

0

0

0

0

1

1

0

0

0

1

0

0

1

0

1

0

0

0

1

0

0

0

0

3

1

0

0

0

2

1

0

0

0

1

0

0

0

0

3

1

0

0

0

4

0

0

0

0

6

1

0

0

0

1

0

0

1

4

0

0

0

0

6

1

0

0

0

7

1

0

0

0

1

0

0

1

4

0

0

0

0

7

1

1

0

0

3

2

1

1

Vertical Block Storage 1 2 3 4 5 6

Value Column Position EOR flag EOB flag EOM flag ZR flag

Non-Zero Values

Figure 3: Block Based Compressed Storage Format

3 BBCS based Sparse Matrix Vector Multiplication Before introducing the specialized MIPA VP Functional Unit and the performance results we will rst brie y discuss the BBCS scheme, its two variations and the associated vector architecture extension previously introduced in [16, 17].

3.1 Basic BBCS Format

To construct a BBCS sparse matrix representation of an n  n A~ matrix it has to be rst partitioned in d ns e Vertical Blocks (VBs) A~m , m = 0; 1; : : : ; d ns e? 1, of at most s columns, where s is the VP section size. At most one VB, the last block, can have less than s columns in the case that n is not a multiple of s. An example of such a partitioning is graphically depicted in Figure 3. The reason to make this partitioning is related to the fact that when computing the A~~b product the matrix elements aij are used only once in the computation whereas the elements of ~b are used several times, depending on the number of non-zero entries in the corresponding column of the A~ matrix. Thus, when processing a certain VB, as just one corresponding s size section of the vector ~b participate into the computation it can to be loaded in a vector register once when the VB processing is initiate and no further information should be loaded form the vector ~b until the processing of the current VB is completed. Furthermore, with this partitioning we only need to store for each nonzero matrix element its position within the VB and not within the entire matrix. This immediately implies a memory bandwidth requirement reduction, as the column position information within the entire matrix has to be able to cover an arbitrary large size while for a VB partitioned matrix the column position information only needs to address within an s position wide area. Each VB is stored as a series of 6- eld entries (BBCS-entries) as depicted in Figure 3. The rst one, Value, holds the value of the non-zero matrix element that the entry represents. The second, Column Position (CP), contains the column position of the non-zero value within the VB. Thus, only [log s] + 1 bits are needed for this eld. This results in a considerable reduction of the required memory bandwidth as demonstrated later in Section 5. The next four elds are one bit ags: the End Of Row (EOR) ag, is 1 when the represented non-zero value is the last one in its row; the End Of Block (EOB) ag 6

Vector processor

BBCS extensions

Vector Unit Functional Unit 1 LDS support

Main

Vector

Functional Unit 2

Register

Load Store Unit

File

MIPA Functional Unit

C-Memory

Functional Unit N

Memory Vector Controller

Scalar Controller

Cache

Scalar Registers

Scalar Pipeline

Scalar Unit

Figure 4: BBCS augmented Vector Processor Organization signals the last non-zero element of the VB, and the End Of Matrix (EOM) ag signals the last element of the matrix. The last ag, the Zero Row (ZR), speci es the way the Value eld should be treated. When it is set the entry no longer represents a non-zero element of the matrix but a sequence of matrix rows with no non-zero elements within the VB. In this case Value contains the number of such empty rows.

3.2 Vector Architecture Extension

In order to eciently support the execution of the SMVM when A~ is represented in the BBCS format we have proposed in [16] a vector processor architecture5 extension consisting of two new instructions, LDS (LoaD Section) and MIPA (Multiple Inner Product and Accumulate), a special vector register CPR that holds the positional information produced by the LDS instruction and to be used by MIPA instructions, and a scalar register RPR that holds the current row position in the currently processed VB. Furthermore two extra processor status ags that are updated by the LDS instruction, the EOM and EOB, have been added. Figure 4 depicts the main extensions to the vector processor organization. The c-memory module will be described in Section 3.3.2 The LDS instruction format is LDS @a,VR1,VR2. It loads a sequence of BBCS entries from the memory address @a into the destination vector registers VR1 and VR2. From the stream of BBCS data entries the non-zero elements are loaded in VR1 and the index positions of the rows that contain at least one non-zero matrix element are loaded in VR2. VR2 can serve later as an index vector to load ~ c sections when performing SMVMs. LDS will cease fetching BBCS entries on one of the following two conditions: either the vector register VR1 has already been loaded with s elements, which is the maximum it can contain, or after loading a data entry with at least one of the ags that signal the 5 We assumed a register type of organization similar with the IMB=370 vector facility [5] but not

necessarily identical and/or restricted to it.

7

end of a VB (EOB or EOM ag) set. In the latter case the corresponding EOB or EOM processor status ag are updated consequently. The CPR vector register is a ected by LDS implicitly. After a LDS is completed the CPR contains the CP elds corresponding to the non-zero matrix elements that are loaded in VR1. The MIPA instruction format is MIPA VR1,VR2,VR3. The VR1 vector register is assumed to contain a sequence of non-zero matrix element values from a VB section previously loaded via an LDS instruction. The CPR vector register, that is used implicitly by MIPA, is also assumed to contain the column position information updated by the same LDS instruction. VR2 is assumed to contain the s element wide corresponding section of the ~b vector. Before the execution VR3 contains the values of the ~c vector elements located at the positions to be a ected by the MIPA computation. These elements in VR3 are selected from the vector ~c and loaded with the index vector created by the LDS instruction. Generally A-matrix

s

s A0

0 1

Vector c Partial result

Vector b A1

A2

0 1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18 0

1

2

3

4

5

Value 0 1 2 3 4

6

7

8

9 10 11 12 13 14 15 16 17 18

CP EOR ZR EOB EOM

a (12,14) a (14,12) a (14,14) a (15,15) a (17,12) a (17,15)

6

1

0

0

4

0

0

0

0 0

6

1

0

0

0

7

1

0

0

0

4

0

0

0

0

7

1

0

1

0

MUL

ADD

MUL ADD MUL

5

VR1

MUL

CPR 0

1

2

3

4

5

6

ADD

MUL

7

VR2

ADD MUL

Non-Zero Values

VR3

Figure 5: The functioning of the MIPA instruction speaking VR1 may contain one or more sequences of values that belong to the same matrix row. The end of such a sequence is marked by a set EOR ag in the corresponding CPR entry. For each of these sequences an inner product is computed with a vector formed by selecting values from VR2 using the corresponding CP index values in the CPR register. To better clarify the functioning manner of the MIPA instruction we present in Figure 5 a visual representation of the MIPA operation. In the Figure it is assumed that the last 6 non-zero elements of the second VB, A~1 , are loaded in VR1, the corresponding ~b vector section in VR2, and the values of the vector ~c to be a ected by the MIPA computation in VR3. 8

S

b

c

S

S

b

c

S

K=3

Figure 6: The Matrix Processing Order - SBBCS (left) and S+BBCS (right)

3.3 BBCS based SMVM Algorithms

Assuming the the architectural extension described in the previous subsection and the BBCS format, the pseudo-assembler code describing the BBCS based SMVM can be written as follows: for all VBs do: LDV @b,VR2 ;load a vector b section in VR2 ST #0,RPR ;reset RPR loopA: LDS @A,VR1,VR4 ;load VB-section LDVI VR4,VR3 ;load vector c with index vector VR4 MIPA VR1,VR2,VR3 ;multiply VB-section with b and add to c STVI VR3,@c,VR4 ;store vector c with index VR4 compute new @A if not EOB or EOM jump to loopA

where @c is assumed to be the memory address of the rst element of the vector ~c, and @A is assumed to be a pointer to the rst data entry of the BBCS stored matrix. As we mentioned before the indexed load/store vector instructions, as they may loose any advantage coming for the interleaved organization [12, 14] of the main memory, might be much more inecient than the standard load/store vector instructions. Within the previously described algorithm such indexed operations are used for the manipulation of the ~c vector. The need for using indexed loads and stores rises from the fact that when the VB entries are processed they may originate from any row position in the matrix. Since the row position of the entry corresponds to the position in the result ~c vector to contribute at, the ~c vector needs to be loaded with an index vector. To avoid using an indexed load/store based scheme we need to restrict the range of the matrix entry row positions that are loaded. In the following we propose two techniques that avoid indexed load/store vector instructions.

3.3.1 Segmented BBCS (SBBCS)

Within this scheme the A~ matrix has to be partitioned into vertical blocks as before but now each VB is partitioned at its turn in s  s sub-blocks called segments. This extra partitioning is realized by adding a new eld, the EOS (End Of Segment) ag, to the BBCS entry format. All the other eld of the SBBCS format follow the BBCS structure. The EOS ag signals the end of a segment and thus it is set for entries representing the last non-zero element of a segment. In this new scenario the load unit executing an LDS instruction ceases loading, additionally to the previously speci ed cases, when an entry 9

with a set EOS ag is encountered. This guarantees the fact that after the subsequent execution of a MIPA instruction the result lies within a known contiguous corresponding section of ~c which we can be loaded and stored without the aid of an index vector.

3.3.2 Extended SBBCS (S+BBCS)

This scheme uses the SBBCS data format to represent the matrix. The di erence is that any LDS instruction ceases loading after k segments and not just after one segment. This approach ensures that the result values subsequently produced by a MIPA instruction lie within a known contiguous k  s size section of the ~c vector. To support such a functionality we introduce an additional memory module within the processor that we call the c-memory (See Figure 4. This memory is able to hold k  s elements and is directly accessible by the the MIPA instruction that can store and load from it. It can be viewed as a working memory for the MIPA instruction where it can store the partial results. This approach eliminates the need to load and store the partial results of the ~c vector to and from the main memory because now all the partial results reside in the c-memory. Consequently, to perform SMVM using the S+BBCS scheme, a di erent access pattern should be used when accessing the A~ matrix. This is graphically depicted right in Figure 6. After the upper (k  s)  d ns e elements of the A~ matrix are processed, the c-memory will contain the nal results of the rst k  s elements of vector ~c and can be stored into the main memory. Then the next (k  s) d ns e elements will be processed and so on until the whole matrix is processed. We note that for this scheme to be realized there are extra requirements, e.g., instructions to access it. Such instructions would be for instance initialize/load/store the c-memory and of course the MIPA functional unit has to be able to implicitly accesses the c-memory.

4 The MIPA Functional Unit As discussed in the previous section, in order to support the execution of the SMVM on a vector processor using the BBCS format we have introduced an architectural extension. Here we describe a possible implementation of a functional unit dedicated to the execution of the MIPA instruction. The hardware unit is meant to be one of the pipelined functional units of a vector processor as the one presented in Figure 4. A 3 stage pipelined implementation of the MIPA functional unit is graphically depicted in Figure 7. The operation scenario for the MIPA unit can be described as follows: First, a number of elements in VR1 vector register as well as their corresponding CPs from the CPR special vector register are stored in the A bu er. The number of VR1 elements that can be processed in each cycle is determined by the level of parallelism p within the implementation. In the implementation example in Figure 7 p is 4. Then the CP entry elds in the A bu er are used to select the \right" elements from the VR2 vector register. These two set of data, i.e., the V alues in the A bu er and the elements selected from VR2, are then pairwise multiplied with an array of p multipliers6 in the rst pipeline stage. As a result of the multiplications p new product results are forwarded to the next stage. In the second stage the accumulation of the previously computed products is performed. On the basis of the EOR ag of each non-zero element that is contained in the Actually as they perform only a multiple-operand addition the multipliers do not contain the nal adder stage. 6

10

Column Position

b1

Flags CPR

A-buffer

VR2 b2

b3

b4

b5

b6

b7

b8 P entries per cycle

MUX

VR1

Value

Stage 1 Array of Multipliers

X

X

2

X

2

X

2

Control 1

2

VR3 Value Accumulation

Stage 2

Forward to next cycle

Array of 2-1 Adders

6/2

8/2

10/2

12/2

2

2

2

2

+

+

+

+

Control 2

Select up to P elements per cycle

Control 3

Stage 3

Figure 7: MIPA Dedicated Functional Unit register, the Control 2 unit selects the results of the previous stage to be accumulated to form the nal result element(s) and to be stored back in the vector register holding the partial ~c vector. In this accumulation step the current values in VR3 are also included. This extra accumulation happens only when the last non-zero element of a matrix row is encountered which means that an end result should be produced. The bandwidth needed from VR3 can vary from 0 to p elements/cycle. Note that when the element that is computed in the rightmost column of the value accumulation array is not the last element of its matrix row the accumulation cannot be completed in the current cycle and a partial accumulation result is forwarded to the next cycle at the same stage. Finally, in the third pipeline stage nal 2j1 additions are performed and results are stored back in VR3.

CPR

5 Experimental Results and Comparisons In this section we present some preliminary performance evaluations for the BBCS schemes. As the BBCS schemes are generally applicable, that is, they make no assumption about the matrix properties as for instance a speci c non-zero structure or a property that might arise from the type of problem that the matrix is used to solve, we do not consider in our comparisons methods that are speci cally tuned for speci c classes of matrices, e.g., block diagonal, etc. Given that, to the best of our knowledge, of the general applicable methods reported in the literature the JD scheme [2] is considered to deliver the best performance we decided to compare our schemes with it. To evaluate each scheme we use a set of benchmark matrices and some simple simulation models that covers the loading of the matrices to be processed in vector registers and the execution of the SMVM core instructions by a functional unit. For all schemes the same assumptions have been made in order to make a fair comparison. For instance when we model the MIPA functional unit with a parallelism of p we assume that the func11

s

s

s

BBCS-1 v=7 s

s

BBCS-2 v=10

s

s

BBCS-3 (k=2) v=8

Jagged Diagonal v=8

Figure 8: Vector Register Load Scenario tional unit executing the core of the JD SMVM algorithm also has a parallelism of p. The experiments for the BBCS, SBBCS, and S+ BBCS schemes were obtained by simulating the execution of the SMVM algorithm on the extended vector architecture described in Section 3, and for the JD scheme on a vector architecture as described in Section 1. We note here that for all the experiments where the scheme S+BBCS is involved, we have chosen to present only the results where k = 8. This is because during our experiments we observed no signi cant improvement of the S+BBCS performance for higher values of k on the benchmark matrices that we considered. The Expected Performance of a scheme is measured in cycles and calculated as follows: Expected Performance = n + v  l (2) p where p is the parallelism of the functional unit that executes the core of the SMVM algorithm and is assumed the same for all the schemes. The latency l models the latency of the functional unit itself plus the scalar instruction overhead for the initiation of a vector instruction. For all the considered schemes we have chosen this value to be 10 cycles, 7 cycles for the pipeline startup time and 3 cycles for the scalar code needed to prepare the vector instruction. Additionally, we compare the schemes on the base of Memory Bandwidth required for the execution. Here we assume no cache or any other memory performance model that depends on the type of the memory access. The memory bandwidth is measured by analyzing the SMVM code of the involved scheme. E ectively, we count the number of words that enter and leave the processor. As a consequence, due to the fact that the advantage of the SBBCS and S+BBCS schemes is related to the memory access policy 12

(i.e. indexed load/stores), these schemes will not be considered for the memory bandwidth measurements. We selected a sparse matrix test suite from the Matrix Market on-line collection [4]. We carried a thorough examination of the statistical properties of the available matrices by focusing our attention on the statistical properties of the number of non-zeros elements per row (NZPR) because these are the parameters that mainly in uence the performance of the SMVM schemes. The parameters we considered are: the average NZPR, the NZPR variance, and the existence of any rows with a large NZPR. Consequently, we divided the matrices into three categories as follows:  Regular: a sparse matrix that has a relatively (to others matrices of the same size) small NZPR variance and the row with the largest NZPR has an NZPR which is not signi cantly larger than the average NZPR plus the NZPR variance.  Irregular-1: a sparse matrix that has approximately the same NZPR average and variance as a regular matrix but contains rows with a large NZPR.  Irregular-2: a sparse matrix that has a large average NZPR, a large NZPR variance, and rows with a large NZPR. The selected matrices are presented in detail in Table 1. For each category we have chosen two matrices that we consider representative for that category: one small and one large7. Figure 9 depicts the results of the experiment that measures the expected performance when the core of the SMVM algorithm is executed by a functional unit with a latency of 10 and a parallelism of 4. The results are represented in numbers of cycles (y-axis) as a function of the section size (x-axis). For each matrix we have also plotted the ideal performance which is computed as np , assuming that a processor cannot process any matrix faster than the number of non-zero elements the matrix contains divided by the functional unit parallelism. The ideal performance should be interpreted, in this context, as a lower bound under which the number of cycles the execution may take can not be reduced. By analyzing the curves in Figure 9 the following can be observed:  The BBCS schemes always perform better than the JD scheme for large section sizes.  For all matrices all the BBCS schemes converge to the same performance for increasing section size s. This performance always lies within a range of at most 10% away from the ideal performance and is independent of the type of the processed matrix. The JD scheme exhibits a much larger deviation from the ideal performance (between 10% and 400%).  For small matrices (left in Figure 9) all the BBCS variants are up to three times faster than the JD scheme. Within this context a matrix is considered to be small when the number of non-zero elements it contains is in the order of the VP section size s, which is in the range of 32 < s < 1024 in our experiments, and considered to be large otherwise. 7

13

Category

Matrix Name

Dimensions Number Average NZPR longest Nonof Matrix of non- NZPR Vari- NZPR zero zeros ance Structure

Regular Large

cavity16 4562 4562

Regular Small

gre 185 185 x 185

x 138187

30

15

62

1005

5.3

0.85

6

x 126150

5.6

13

353

Irregular- fs 183 3 183 x 183 1 Small

1069

5.8

9.1

72

Irregular- mbeause 496 x 496 2 Large

41063

83

130

489

Irregular- tols90 2 Small

1746

19

35

90

Irregular- memplus 17758 1 Large 17758

90 x 90

Table 1: Sparse Matrix Test Suite

14

90000

700 BBCS SBBCS S+BBCS (k=8) JD Ideal

80000

BBCS SBBCS S+BBCS (k=8) JD Ideal

600

70000 500 60000 400

50000

40000

300

30000 200 20000 100 10000

0

0 32

64

128 256 Matrix cavity16.rua

512

1024

120000

32

64

128 256 Matrix gre__185.rua

512

1024

1200 BBCS SBBCS S+BBCS (k=8) JD Ideal

100000

BBCS SBBCS S+BBCS (k=8) JD Ideal

1000

80000

800

60000

600

40000

400

20000

200

0

0 32

64

128 256 Matrix memplus.rua

512

1024

30000

32

64

128 256 Matrix fs_183_3.rua

512

1024

1400 BBCS SBBCS S+BBCS (k=8) JD Ideal

25000

BBCS SBBCS S+BBCS (k=8) JD Ideal

1200

1000 20000 800 15000 600 10000 400

5000

200

0

0 32

64

128 256 Matrix mbeause.rua

512

1024

32

64

128

256 Matrix tols90.rua

Figure 9: Expected Performance

15

512

1024

1

1 BBCS JD (BBCS/JD) 0.5

BBCS JD (BBCS/JD) 0.5 0.8

Resource Utilization

Resource Utilization

0.8

0.6

0.4

0.2

0.6

0.4

0.2

0

0 0

50

100

150 matrix cavity16

200

250

300

0

50

100

latency

1

150 matrix gre__185

200

BBCS JD (BBCS/JD) 0.5 0.8

Resource Utilization

0.8

Resource Utilization

300

1 BBCS JD (BBCS/JD) 0.5

0.6

0.4

0.2

0.6

0.4

0.2

0

0 0

50

100

150 matrix memplus

200

250

300

0

50

100

latency

1

150 matrix fs_183_3

200

250

300

latency

1 BBCS JD (BBCS/JD) 0.5

BBCS-1 JD (BBCS-1/JD) 0.5 0.8

Resource Utilization

0.8

Resource Utilization

250

latency

0.6

0.4

0.2

0.6

0.4

0.2

0

0 0

50

100

150 matrix mbeause

200

250

300

latency

0

50

100

150 matrix tols90

Figure 10: Expected Resource Utilization

16

200 latency

250

300

BBCS l0 5 Matrix Name l0JD :5 l0:5 l0 5 cavity16 215 255 1.19 gre 185 35 250 7.14 memplus 45 245 5.4 fs 183 3 4 130 32.5 mbeause 20 250 12.5 tols90 5 220 44 Table 2: Resource Utilization

BBCS : JD :

Furthermore, we can consider the fraction Performance R = Ideal (3) Performance as a measure for the processor resource utilization. Because the BBCS schemes have a better vector register lling than the JD scheme the resource utilization is not drastically a ected by latency increases and/or increased parallelism especially when we use large section sizes and thus they are more tolerant to changes in parallelism and latency. We illustrate this phenomenon by plotting R versus the latency l in Figure 10 for each matrix in the test suite. Within this experiment we assumed a section of s = 1024 and a parallelism of p = 4. We depict the JD scheme and only the BBCS as within these assumptions the performance of BBCS, SBBCS, and S+BBCS do not di er signi cantly for s = 1024. A good indication of the latency tolerance regarding the resource utilization is the latency value where the resource utilization drops to 0:5, which we call l0:5 . The higher l0:5 is, the more tolerant to latency increases the scheme is. The l0:5 values are listed in Table 2 for both schemes and for all the considered matrices. The last column in the Table represents the ratio between the l0:5 corresponding the BBCS scheme and the one corresponding to the JD scheme as a better mean to compare them. It can be observed that in all the considered cases the BBCS scheme is much more tolerant to latency increasing than the JD scheme. Equation (3) can be rewritten as R = n+nlpv which means that an increased functional unit parallelism may have the same e ect on R. As a consequence, using the same reasoning, one can prove that the BBCS schemes have a better tolerance to increased parallelism regarding the resource utilization. The nal issue to be addressed here relates to the memory bandwidth requirements. In relation to this we present in the remaining part of this section a comparison of the BBCS and JD schemes memory bandwidth requirements. To determine the memory bandwidth for the BBCS scheme we have to extract the following values from the test matrices: rows estimates the number rows within the VBs as a fraction of the 1. q = # of VB n total number of non-zeros matrix elements n. Because of the fact that every row within a VB produces an (intermediate) result to be stored in the ~c vector, the q value constitutes an indication of how many elements of the vector ~c need to be loaded by the BBCS based SMVM algorithm. 2. m speci es the length of vector ~b, expressed as a fraction of the total number of nonzero elements n in the A~ matrix. This value is needed in order to put the bandwidth required for the ~b vector in relation to n. 17

Furthermore, the CP eld of the BBCS format requires [log s]+1 bits and the ags another 4 bits for each non-zero value. This means that, under the current assumption for the VP section size, i.e., 32 < s < 1024, 2 extra bytes are required per non-zero matrix value for the positional information and ags. If we assume that the matrix non-zero values are 32-bit represented the bandwidth for the positional information is n2 . Under the previously mentioned considerations the memory bandwidth for each of the two schemes can be calculated from the main core of the SMVM algorithms as follows: (1) LDV v[i]+R2*s,VR1 (2) LDV cp[i]+R2*s,VR4 (3) LDVI VR4,VR2 (4) (5) (6) (7)

Jagged Diagonal n

the number of non-zeros

n

the corresponding positional information for each non-zero element a value from the ~b vector is loaded no memory action each result is added separately no memory action same as (5)

n 0 n 0 n 5n

MULV VR1,VR2,VR3 LDV @c+R2*s,VR5 ADDV VR3,VR5,VR5 STV VR5,@c+R2*s

Total

BBCS

(1) LDV @b,VR2 (2) LDS @A,VR1,VR4

mn n + n2

(3) LDVI VR4,VR3

qn

(4) MIPA VR1,VR2,VR3 (5) STVI VR3,@c,VR4

0

b is loaded only once the number of the nonzeros plus the positional information the number of values of c needed to loaded no memory action

qn

same as (3)

Total (1:5+ m +2q)  n To summarize the JD scheme requires a memory bandwidth of 5n regardless of the matrix structure while the BBCS scheme memory bandwidth requirements depend on the number of non-empty matrix rows and can be expressed as (1:5 + m + 2q)  n. To get a better indication on the implication of the previously mentioned memory bandwidth requirement complexities we present in Table 3 the total bandwidth for each matrix in our test set, assuming that the SMVM algorithms are executed on a VP with a section size s = 256. The last column presents the ratio between the memory bandwidth required by the JD scheme and the one required by the BBCS scheme. It can be observed in the Table that for the test matrix suite we considered the JD scheme requires between 2:49 and 3:26 times more bandwidth that BBCS scheme.

6 Conclusions This paper constitutes a continuation of the research in [16, 17]. First we introduced a specialized functional unit able to compute multiple inner products. Subsequently, we 18

Matrix Name q m BBCS JD cavity16 0:073 0:033 1:606  n 5  n gre 185 0:184 0:184 1:868  n 5  n memplus 0:371 0:141 2:012  n 5  n fs 183 3 0:171 0:171 1:842  n 5  n mbeause 0:020 0:012 1:532  n 5  n tols90 0:052 0:052 1:604  n 5  n Table 3: Memory Bandwidth Evaluations

Ratio 3:11 2:57 2:49 2:71 3:26 3:12

considered a set of benchmark matrices and simulated the SMVM algorithm core execution on a functional unit with a level of parallelism of 4 and a latency of 10. Our simulations suggested that the BBCS scheme always performs better than the JD scheme for large section sizes. The BBCS schemes performance always lies within a range of at most 10% away from the ideal performance and is independent of the processed matrix structure. The JD scheme exhibits a much larger deviation, between 10% and 400%, from the ideal performance. Finally, we derived the BBCS scheme memory bandwidth requirement as being (1:5+ m +2q)  n, where n is the number of non zero matrix elements, q characterizes the number of non empty rows in the sparse matrix, and m is the length of the dense vector expressed as a fraction of n, while the JD scheme requires a memory bandwidth of 5n regardless of the matrix structure. In particular, for the considered test matrices, the required bandwidth for the BBCS scheme is 2:49 to 3:12 times lower than the bandwidth required by the JD scheme.

References

[1] H. Amano, T. Boku, T. Kudoh, and H. Aiso. (SM)2 -II: A new version of the sparse matrix solving machine. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 100{107, Boston, Massachusetts, June 17{19, 1985. IEEE Computer Society TCA and ACM SIGARCH. [2] E. C. Anderson and Y. Saad. Solving sparse triangular systems on parallel computers. International Journal of High Speed Computing, 1:73{96, 1989. [3] A. J. Bik. Compiler Support for Sparse Matrix Operations. PhD thesis, Rijksuniversiteit Leiden, 1996. [4] R. F. Boisvert, R. Pozo, K. Remington, R. Barrett, and J. J. Dongarra. The Matrix Market: A web resource for test matrix collections. In R. F. Boisvert, editor, Quality of Numerical Software, Assessment and Enhancement, pages 125{137, London, 1997. Chapman & Hall. [5] W. Buchholz. The IBM System/370 vector architecture. IBM Systems Journal, 25(1):51{62, 1986. [6] V. Eijkhout. LAPACK working note 50: Distributed sparse data structures for linear algebra operations. Technical Report UT-CS-92-169, Department of Computer Science, University of Tennessee, Sept. 1992. Mon, 26 Apr 99 20:19:27 GMT. 19

[7] A. W. et al. The white dwarf: A high-performance application-speci c processor. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 212{222, Honolulu, Hawaii, May{June 1988. IEEE Computer Society Press. [8] J. L. Hennessy and D. A. Patterson. Computer Architecture A Quantative Approach. Morgan Kaufman, San Mateo, California, 1990. [9] T. J. R. Hughes. The Finite Element Method. Prentice-Hall, Englewood Cli s, NJ, 1987. [10] R. N. Ibbett, T. M. Hopkins, and K. I. M. McKinnon. Architectural mechanisms to support sparse vector processing. In Proceedings of the 16th ASCI, pages 64{71, Jerusalem, Israel, June 1989. IEEE Computer Society Press. [11] P. M. Kogge. The Architecture of Pipelined Computers. McGraw-Hill, New York, 1981. [12] W. Oed and O. Lange. On the e ective bandwidth of interleaved memories in vector processor systems. IEEE Trans. Comput., C-34:949{957, 1985. [13] Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, Research Institute for Advanced Computer Science, NASA Ames Research Center, Mo ett Field, CA, 1990. [14] G. S. Sohi. High-bandwidth interleaved memories for vector processors - a simulation study. IEEE Transactions on Computers, 38(4):484{492, Apr. 1989. [15] V. E. Taylor, A. Ranade, and D. G. Messerschitt. SPAR: A New Architecture for Large Finite Element Computations. IEEE Transactions on Computers, 44(4):531{ 545, April 1995. [16] S. Vassiliadis, S. Cotofana, and P. Stathis. Vector isa extension sprase matrix multiplication. In EuroPar'99 Parallel Processing, Lecture Notes in Computer Science, No. 1685, pages 708{715. Springer-Verlag, 1999. [17] S. Vassiliadis, S. Cotofana, and P. Stathis. BBCS based sparse matrix-vector multiplication: Initial evaluation. To appear in 16th IMACS World Congress on Scienti c Computation, Applied Mathematics and Simulation, Aug 2000.

20