MOB Forms: A Class of Multilevel Block Algorithms for ... - CiteSeerX

MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations Juan J. Navarro, Toni Juan and Tomas Lang+ Computer Architecture Department, Universitat Politecnica de Catalunya Gran Capita s/n, Modul D6, E-08034 Barcelona, (Spain), e-mail: [email protected] + Department of Electrical and Computer Engineering, University of California at Irvine Abstract Multilevel block algorithms exploit the data locality in linear algebra operations when executed in machines with several levels in the memory hierarchy. It is shown that the family we call Multilevel Orthogonal Block (MOB) algorithms is optimal and easy to design and that using the multilevel approach produces signi cant performance improvements. The eect of interference in the cache, of the TLB misses, and of page faults are also considered. The multilevel block algorithms are evaluated analytically for an ideal memory system with M cache levels without interferences. Moreover, experimental results of the MOB forms in some present high performance workstations are presented. 1 Introduction In the last decade block algorithms have been proposed for dense linear algebra operations, with the objective of exploiting the data locality in architectures with a memory hierarchy [GaPS90]. These proposals are for one or two levels of the hierarchy. We can mention, for example, the LAPACK library [Ande92], the numerical codes developed to exploit the use of vector registers and the cache in Cedar [GaJM88], [GaPS90] and the algorithm to utilize the registers and the cache for the IBM RS6000 [DoMR91]. Moreover, compiler techniques to generate these algorithms automatically are being developed [Wolf87], [CaKe92]. Although this blocking approach has produced dramatic improvements for the machines considered, it does not provide a method to extend it to more levels. This extension is very important today because the high speed of recent superscalar and superpipelined processors makes necessary in some cases two levels of caches [Dutt92], and the increase in the size of problems requires the exploitation of the locality in all levels of the hierarchy, including the minimization of TLB misses and page faults. In this article we consider multilevel block algorithms. This work was supported by the Ministry of Education and Science of Spain (CICYT TIC-880/92) and by the EEC (ESPRIT Project APPARC 6634)

We optimize the form (choose the best order of the nested loops) and the size of the blocks, to provide maximum reuse of the data in all levels simultaneously. As a result of this optimization, we identify two families of forms: nonorthogonal and orthogonal. The latter, 1which we call Multilevel Orthogonal Block (MOB) forms produce a somewhat higher performance and their simplicity makes them suitable for automatic generation by compiler. The MOB forms can be applied to many dense matrix operations, such as matrix multiplication, LU decomposition or QR decomposition via Givens rotations, as well as to a variety of architectures (one or several processors with several levels in the memory hierarchy). To simplify the description, in this paper we present the case of matrix multiplication in a high-performance processor with a multilevel memory hierarchy. We consider rst the case in which there are no cache interferences, which is suitable for software managed local memories or for the theoretical case of completely associative caches with optimal replacement (Section 2). This case is studied with analytical models to evaluate and optimize the multilevel block forms. We then consider (Section 3) the eect of cache interferences in a real memory system formed by registers, on-chip cache, and o-chip cache. As is done usually, to reduce the interference eect some data blocks are pre-copied into continuous regions of memory [LaRW91], [TeGJ93]. We also re ect on virtual memory systems taking into account the eect of TLB misses and page faults. We show the behavior of MOB forms using experimental results on two current high performance workstations, an HP-Apollo with a PA-Risc 7100 processor and a DEC 3000 Model 500 with an Alpha processor. 1.1 Data and Computation Diagram In this paper we deal with the matrix multiplication operation C = C + A B, where the size of C is imax jmax , that of A is imax kmax , and that of B is kmax jmax . We assume that the matrices are large in all dimensions and stored by columns. Consider, for example, the ijk form [DoGK84]:

10

DO 10 I= 1, Imax DO 10 J= 1, Jmax DO 10 K= 1, Kmax C(I,J)= C(I,J) + A(I,K)*B(K,J)

1 A preliminary presentation of MOB forms has been recently published in [NaJV93]

Figure 1 shows what we call a Data and Computation Diagram (DCD) for this form. In this diagram, the rectangular parallelepiped represents the iteration space, with the operations in the inside and the data in the faces or in planes parallel to these faces. The arrows indicate the order in which the data is accessed and the operations performed. To clarify data positions in this DCD the elements a11 , b11 , and c11 are represented in dark. From the code and the DCD it is apparent that matrix A can be reused in direction j (all iterations of loop j use the same element of A), matrix B can be reused in direction i and matrix C in direction k. We propose DCDs as a very powerful visual tool to understand and to design multilevel block algorithms.

j max

j

B k max k

C

i max i

A

Block Form Restriction ijk jik ikj kij jki kji

km jm + km Lm + jm Lm Cm km im + km + Lm Cm jm km + jm Lm + km Lm Cm jm im + jm Lm + Lm Cm im km + im + Lm Cm im jm + im + jm Lm Cm

Table 1: Block-m size restrictions to t in cache-m there are 3!3! = 36 dierent forms for the resulting 1-level block algorithm. In Figure 2 we show the DCD of one of these forms; this is denoted jki(kji), following the order of the loops beginning with the most external.

j max j

max

jm km im

i max

imax

k maxk max

: : : (ijk(: : : k (j + i

Form

Restriction

)+i

Figure 4: i) DCD

(f m )m-1 tm tm

(s m )m sm

fm

fm

fm sm

tm

sm

(t m )m i)

ii)

iii)

The optimal form of block-m is obtained by considering the optimal block sizes for all six possible forms and choosing the one that produces the smallest CPF . We partition the six forms into two classes, show that all forms in a class produce the same contribution to CPF , and that the forms of one of the classes is optimal (orthogonal forms).

Class a) f m = f m?1 . This case is illustrated in Figure 4i. The approximation (11) becomes (tm )m [(sm )m + (f m?1 )m?1 ] Cm From this expression we see that the characteristic of this class is that (f m?1 )m?1 appears in the expression. The optimal sizes are (f m?1 )

m?1 (sm )m (tm )m

r Lm CPMm?1

Lm?1 CPMm Cm Cm

p

q Lm1CPMm?

Cm

p

1 + Lm?1 CP Mm Figure 5 shows the optimized form of this class.

(f m ) m tm fm sm

fm sm

1

the direction with the larger size (f m ) the block direction and the size in this direction the block length. The other two directions de ne the block section and the sizes in these directions are called block-section sizes. Since for the optimal blocks f m 6= f m?1 the direction of block-(m-1) is dierent than the direction of blockm, that is they have orthogonal directions. Because of this, we call these optimal forms orthogonal forms.

2.4 MOB forms As indicated, we call the optimal forms Multilevel Orthogonal Block forms. They are constructed so thatmthe directions of the blocks of adjacent levels are dierent (f 6= f m?1 ; 2 m M and f M +1 6= f M ) and are denoted by

f M +1 g M +1 (f M : : : (f m gm (f m?1 g m?1 (: : : (f 2 g 2 (f 1 s1 t1 ) : : :) The length of block-(m-1) is equal to one of the section sizes of block-m, and the approximate optimal section sizes (since Cm Cm+1 ) are (f m?1 )m?1

v u ? + CP Mm u Lm t CPLMm?mCPM Cm m 1

1

Lm

(f m?1 )m = (f m?1 )m?1 (g m )m

v u m u Lm t CP MmCPM ? + CP Mm Cm Lm?1

1

Lm

To apply these expressions also for m = 1 we make

CPM0 = 0. Moreover, the optimal length of block-M is L0

equal to the size of the problem in the direction of block-M . Since in a MOB form once f m?1 and f m are xed, then m g is also xed, it is possible to use the following reduced notation, specifying only the block directions (in capitals):

f M +1 gM +1 F M : : : F m : : : F 3 F 2 F 1 s1 t1 where F m can be i; j or k with F m = F m?1 and (F 1 s1 t1 ) and (f M +1 gM +1 F M ) are permutations of (i; j; k ). Moreover, if we only indicate the directions of the M 6

block levels, that is,

F M : : : F 2F 1 we refer to a set of four forms. Figure (7) shows the DCD for the set of MOB forms KJK (for three levels of blocks). 2.5 System with two levels of caches To show the advantage of using multilevel block algorithms, we consider a system with two levels of caches with no interferences and compare the CPF with block algorithms with a single block level and with two levels. Table 4 shows the increase in M ops obtained when a second level of blocks is introduced as a function of CPF (inner) and CPM2 =L2 for C1 = 1Kw; C2 = 64Kw; CPF (overh) = 0:1 and CPM1 =L1 = 2:5.

(axpy operation). On the other hand, if forms ijk or jik are used, the inner loop does not require any store, but there are dependencies among iterations (dot product operation). Size On-chip Line Size Cache Mapping Size O-chip Line Size Cache Mapping Numb. Entries TLB Page Size Mapping

ALPHA 1Kw 4w Direct 64Kw 4w Direct 32 1Kw Fully Assoc.

PA Risc 32Kw 4w Direct 120 512w Fully Assoc.

Table 5: Characteristics of the Alpha memory system Table 6 shows the maximum M ops that can be achieved in the PA-Risc for each of the six forms of block-1. In order not to include other eects such as cache and TLB misses we have used small matrices (81x81) of the same size of block-1 and the data has been preloaded to avoid compulsory misses. As can be seen, the best performance obtained is of 40 M ops, from the peak of 198 M ops achievable for this processor. In this case, the dot product forms give lower performance because of the dependencies. Then, the best forms of block-1 are the axpy forms.

there are dependencies among iterations). Table 7 shows the six forms of executing block-1 using block-0. In all cases, block-0 section size is 3x3 to avoid register spills. The size of the matrix is 81x81 (as in the previous case) so that it is a multiple of 3. Block Section Size (k1 = i1 ) 1024 51 31 102 27 153 23

Leading Dimension 1025 63 47 34

1026 91 58 36

1027 94 85 41

1028 91 99 48

Table 8: M ops for JK forms in PA-Risc without precopy Block Section Size (k1 = i1 ) 1024 51 87 102 94 153 94


1026 90 101 101

1027 94 101 102

1028 92 101 101

ikj kij jik ijk jki kji

Table 9: M ops for JK forms in PA-Risc with precopy of submatrices of A Block Leading Dimension Section Size (k1 = i1 ) 1024 1025 1026 1027 1028 51 91 90 91 92 91 102 102 102 102 102 102 153 104 104 104 104 104

Table 6: M ops in the PA-Risc for a 81x81 matrix

Table 10: M ops for JK forms in PA-Risc with precopy of submatrices of A and B

Form

39

39

27

27

39

39

Form

Table 7: M ops for the PA-Risc using block-0 for a 81x81 matrix

3.2 Data interference in the caches Now we comment on the problems that arise due to data interferences in direct-mapped or set-associative caches. As an example we use the set of MOB forms JK with two levels of blocks (registers and cache-1).

To improve the performance it is possible to decompose block-1 into smaller blocks (register-level block or block-0) to reduce the number of loads and stores [DoMR91]. The block is implemented by complete unrolling in the s0 and t0 directions since it is not possible to modify dynamically the register number. 0A few iterations of the loop in the longitudinal direction f can also be unrolled if this increases the optimization possibilities. It is also possible to utilize compilation techniques, such as software pipelining, together with an ecient register assignment and instruction scheduling, to achieve a high speed kernel. All this permits the overlapping of the Load/Store with the Mul/Add instructions and in this way to approximate to the minimum number of cycles per oating-point operation for the particular architecture. The best form and size of block-0 depend on the architecture and on the compiler. However, in many cases the best is to use a block with direction K0 since this results in no stores in the loop body and the (s )0 (t0 )0 mult-add operations inside the internal loop are independent (although

Precoping to avoid Interferences Of particular importance is the self interference of the data that can be reused in the block direction; for the set of forms JK these are the i1 k1 data of the submatrix of A (see Figure 8). The eect of this self interference depends strongly on the matrix leading dimension, on the size of the cache, and on the size of the block section and can produce a low cache utilization [LaRW91], [TeGJ93]. Table 8 shows the variation of performance for several values of the leading dimension for experimental results on PA-Risc. Since the block section of block-0 is 3x3, the sizes of the block-1 are multiples of three and the sizes of the problem are multiples of the block-1 section size. To avoid these self interferences, the corresponding submatrix is pre-copied into consecutive locations in memory. This precopy increases the CPF (overh) by a term that is inversely proportional to the length of block-m. For example, for a matrix multiplication of 306x306 with the section of block-1 of 102x102 (and length 306) this term is inversely

ikJ kiJ jiK ijK jkI kjI 71

71

142 142 71

71

proportional to 306. In general, this term is small compared to the advantage obtained from the precopy. Comparing Table 9 with Table 8 indicates the improvement that the precopy of A produces for the PA-Risc.

j0 j

j0

i0 i0

i1

1

k1

the forms IK (Table 11) when no precopy is performed. The catastrophic eect is only produced for block section size 153, since this is larger than the number of TLB entries. It is eliminated by precopying A (see Tables 9 and 13 for forms JK and IK , respectively). Note that for form IK the precopy that is sucient to avoid self interferences does not eliminate the TLB misses (see Table 12). Block Section Size (k1 = i1 ) 1024 51 22 102 20 153 18

process, a page fault is produced each time the i1 k1 elements of A are reused (that is, each time a new column of B is accessed). If because of the context switch the contents of the cache is replaced, the eect is as if no block-1 is used. This can be seen in Figure 10 , where we compare the M ops obtained measuring the DEC system with dierent block algorithms, and limiting de working set quota to 1000 pages for the process (using the VMS operating system). We compare the form jik without blocks, the form K with one level of blocks for the registers (section size 3x3), the form JK with two levels of blocks (registers and cache-1) and the form KJK with three levels of blocks (two levels for registers and cache-1, and 1 level for cache-2 and main memory). As can be seen, the performance for form JK is reduced dramatically for problems larger than 600, and approaches the performance of the algorithm without cache-1 blocks. This eect is produced by page faults that prevent the use of the cache-1 blocks. This performance degradation is not produced for the KJK form because now page faults has been reduced. In addition, the new block level improves the utilization of the second cache level, although the improvement is small ( 3M ops for matrix sizes greater than 300).


1026 81 55 36

1027 83 75 40

1028 84 88 45

Table 11: M ops for IK forms in PA-Risc without pre-copy Leading Dimension 1025 87 90 54

1026 92 94 55

1027 94 95 53

1028 94 94 53

Table 12: M ops for IK forms in PA-Risc with pre-copy of B Block Leading Dimension Section Size (k1 = i1 ) 1024 1025 1026 1027 1028 51 75 87 87 90 90 102 90 91 91 92 91 153 90 88 88 87 87 Table 13: M ops for IK forms in PA-Risc with pre-copy of AyB Note the unusually low value obtained for the 1024 column in Tables 11 and 12. This is due to the high cross interference between submatrices A and C; it is eliminated when A is copied as seen in Table 13. 3.4 Eect of page faults Up to now we have only considered blocks at the register and caches levels and precopies to reduce cache interferences and TLB misses. However blocking can also be used to reduce the page faults. This was considered important in the beginnings of paged memory systems [McCo69]. Although today's high performance processors have large main memories, to achieve a good throughput in multiuser systems it is necessary to reduce the amount of main memory used by individual processes. If the memory allocated to the process is not sucient for the whole problem, page faults occur. For a multilevel memory hierarchy, the page faults have an eect on the elapsed time as well as on the CPU time. The latter is produced because the processes that are executed during the service of the page fault will eliminate from the cache at least part of the data being reused. The page faults can be avoided by an additional block level. Consider, for example, the behavior of the forms JK (with blocks-0 and blocks-1). For large problems with jmax greater than the number of physical pages allocated to the

3 Level MOB (KJK form) 2 Level MOB (JK form) 1 Level Block Algorithm (K form) jik form 80

MFLOPS

Block Section Size (k1 = i1 ) 1024 51 30 102 27 153 23

60 40 20 0 0

500

1000

Matrix Order

Figure 10: M ops for four dierent matrix multiplication algorithms on Alpha. 4 Conclusions We have studied multilevel block algorithms to exploit data locality of linear algebra operations in machines with several levels in the memory hierarchy. We use a visual representation of data and computations that we call DCD. This is a very powerful tool to understand and design multilevel block algorithms. In the paper, we propose a sequential procedure to optimize the form and the size of the blocks for multilevel block algorithms. The optimization of a generic multilevel form has produced two families of forms, of which that we call MOB provides a better performance. Moreover, the simplicity of these MOB forms make them suitable for automatic implementation by compilers. The resulting throughput of the MOB forms is high because of the maximum reuse of the data in all levels of the hierarchy. This is in contrast with other approaches [Chen91]

in which the sizes of the blocks are a compromise between the sizes of the memories (registers, caches and TLB) in two levels of the memory hierarchy, so that increasing the reuse in one level can result in its reduction in the other. Nevertheless, the algorithm propossed in [Chen91] obtains very good eciencies on the IBM RS6000 for which it has been designed. There is another kind of block algorithms that reduce CPF (mem) increasing the number of oating-point operations. In order to choose the block size, in [GaJM88], they solve the tradeof betwen CPF (mem) and number of

oating-point operations with a double-level blocking. Nevertheless, the block algorithms designed in our paper do not increase the number of oating point operations. Other conclusions are The use of multilevel blocking produces signi cant performance improvements, as compared to the use of one-level blocking For todays processor characteristics the performance is not very sensitive to the sizes of the blocks, as long as the block ts in cache and is not too small. However, this would change for higher performance processors that can perform of the order of eight oating-point operations per cycle. Blocking at the register level is performed to reduce the number of load/stores and to overlap them with the Mul/Add operations. The best form at this level depends on the architecture and compiler. The interference in the cache can be eectively reduced by precopying the submatrices to be rused to consecutive locations in memory for each level of blocks. The TLB misses are signi cant and are produced by p some forms when the TLB size is smaller than C1 . In these cases, the misses can be reduced also by precopying some submatrices. Page faults occur when the number of main memory pages available to the process is such that the problem does not t in these pages. In this case, in addition to the increase in elapsed time, a signi cant degradation in throughput is observed because the processes that execute during the service of the page fault expel the blocks from the caches. These page faults can be eliminated by the introduction of an additional block level. Acknowledgments We would like to thank Javier Gallardo for his useful contribution to this work and Larry Carter for his interesting comments on this paper. References [Ande92] E. Anderson et al., LAPACK User's Guide, Philadelphia, PA:SIAM, 1992. [Aspr93] T. Asprey et Al., Performance Features of the PA7100 Microprocessor. IEEE Micro, June 1993, pp 22{35 [CaKe92] S. Carr and K. Kennedy, Compiler Blockability of Numerical Algorithms. Proc. of the Supercomputing'92 conference, 1992, pp 114{124.

[Chen91] D. Chen, Hierarchical Blocking and data Flow Analysis for Numerical Linear Algebra, ACM Int. Conf. Supercomputing, 1991, pp. 12{19. [DoMR91] J. J. Dongarra, P. Mayes and G. Radicati, The IBM RISC System/6000 and Linear Algebra Operations. Supercomputer, July 1991, pp. 15{30. [DoGK84] J. Dongarra, F. Gustavson and A. Karp, Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Rev., 26 (1984), pp. 91{112. [Dutt92] T. A. Dutton et al., The Design of the DEC 3000 AXP Systems, Two High-Performance Workstations, Digital Technical Journal, Vol 4, Num. 4 1992, pp 66{81 [GaJM88] K. Gallivan, W. Jalby, U. Meier, and A. Sameh, Impact of hierarchical memory systems on linear algebra algorithm design. Intl. J. Supercomputer Appl., 2(1988), pp. 12{48 [GaPS90] K. A. Gallivan, R. J. Plemmons and A.H. Sameh, Parallel Algorithms for Dense Linear Algebra Computations, in Parallel Algorithms for Matrix Computations by K. A. Gallivan et al. SIAM, 1990, pp. 1{82. [HePa91] J. L. Hennessy and D. A. Patterson, Computer Architecture, A Quantitative Approach, Morgan Kaufmann Publishers, Inc. 1990. [JaMe86] W. Jalby and U. Meier, Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system, in Proc. Intl. Conf. Par. Processing, IEEE Computer Society Press, New York, 1986, pp. 429{432 [LaRW91] M. S. Lam, E. E. Rothberg and M. E. Wolf, The Cache Performance and Optimizations of Blocked Algorithms, ASPLOS 1991, pp. 67{74. [McCo69] A.C. McKellar and E.G. Coman, Jr., Organizing Matrices and Matrix Operations for Paged Memory Systems, Communications of the ACM, 12.3, 1969, pp 153{165 [NaJV93] J.J. Navarro, A. Juan, M. Valero, J.M. Llaberia and T. Lang, Multilevel Orthogonal Blocking for Dense Linear Algebra Computations, IEEE Computer Society TC on Computer Architecture Newsletter, Fall 1993, pp. 10{14 [OeGr90] R.R. Oehler and R.D. Groves, IBM RISC System/6000 processor architecture. IBM Journal of Research an Development, Vol. 34, Number 1, January 1990, pp 23{36. [TeGJ93] O. Temam, E.D. Granston, W. Jalby, To Copy or Not to Copy: A Compile-Time Technique for Assessing When Data Copying Should be Used to Eliminate Cache Con icts, in Supercomputing'93, pp 410{419 [Wolf87] M. Wolfe, Iteration Space Tiling for memory hierarchies, Proc. of the third SIAM conference on Par. Proc. for Sci. Comp., Dec 1987, pp 357{361

MOB Forms: A Class of Multilevel Block Algorithms for ... - CiteSeerX

MOB Forms: A Class of Multilevel Block Algorithms for ... - CiteSeerX

Suggest Documents

Algorithms for Sliding Block Codes - CiteSeerX

Multilevel Algorithms for Wavefront Reduction

Forward Algorithms for Optimal Control of a Class of ... - CiteSeerX

Algorithms for Normal Forms for Matrices of

Algorithms for parallel memory, II: Hierarchical multilevel ... - CiteSeerX

Multilevel Algorithms for Partitioning Power-Law Graphs - CiteSeerX

Block-relaxation Algorithms in Statistics - CiteSeerX

Multilevel latent class casemix modelling: a novel ... - CiteSeerX

General multilevel adaptations for stochastic approximation algorithms

Parallel multilevel algorithms for solving the

Multilevel Algorithms for Multi-Constraint Graph Partitioning

A Multilevel Transaction Problem for Multilevel Secure ... - CiteSeerX

a multilevel analysis - CiteSeerX

on some block algorithms for fast fourier transforms - CiteSeerX

Block Algorithms for Fast Fourier Transforms on Vector ... - CiteSeerX

Class of algorithms for realtime subpixel registration - CiteSeerX

Advanced Multilevel Node Separator Algorithms

A General Class of No-Regret Learning Algorithms and ... - CiteSeerX

Engineering Multilevel Graph Partitioning Algorithms

A Global Convergence Proof for a Class of Genetic Algorithms

Class Notes : Programming Parallel Algorithms - CiteSeerX

block algorithms for orthogonal symplectic factorizations - Infoscience

Dynamic Programming Algorithms for Haplotype Block Partitioning ...

Block Matching Algorithms For Motion Estimation