Effortless and Efficient Distributed Data-Partitioning in ...

3 downloads 1789 Views 659KB Size Report
Email: [email protected] ... and a cyclic layout (distribute blocks cyclically to processes). ..... the HitTile template with the arrays sizes. It is used in Line ...
2010 12th IEEE International Conference on High Performance Computing and Communications

Effortless and Efficient Distributed Data-partitioning in Linear Algebra Carlos de Blas Cartón

Arturo Gonzalez-Escribano

Diego R. Llanos

Dpto. Informática University of Valladolid, Spain Email: [email protected]

Dpto. Informática University of Valladolid, Spain Email: [email protected]

Dpto. Informática University of Valladolid, Spain Email: [email protected]

Hitmap, a library for hierarchical-tiling and mapping of arrays. Hitmap library is a basic tool in the back-end of the Trasgo compilation system [7]. It is designed as a supporting library for parallel compilation techniques, offering a collection of functionalities for management, mapping and communication of tiles. Htimap have a extensibility philosophy similar to the Fortress framework [10], and it can be used to build higher levels of abstraction. Other tiling arrays libraries, offering abstract data-types for distributed tiles (such as HTA [8]), could be reprogrammed and extended using Hitmap. The tile decomposition in Hitmap is based on index-domain partitions, in a similar way as proposed in the Chapel language [9]. In Hitmap, data-layout and load-balancing techniques are independent modules that belong to the plug-in system. The techniques are invoked from the code and applied at run-time when needed, using internal information of the target system topology to distribute the data. The programmer does not need to reason in terms of the number of physical processors. Instead, it uses highly abstract communication patterns for the distributed tiles at any grain level. Thus, coding, and debugging operations with entire data structures is easy. We show how the use of layout compositions at different levels, together with hierarchical Hitmap tiles, can also automatically benefit the sequential part of the code, improving data locality and exploiting the memory hierarchy for better performance. This composition technique allows to use Hitmap to program highly efficient distributed computations with low development cost. We use the LU decomposition as a case study, focusing on the ScaLAPACK implementation as reference. ScaLAPACK is a software library of high-performance linear algebra routines for distributed-memory, message-passing clusters, built on top of a message-passing interface (such as MPI or PVM). ScaLAPACK is a parallel version of the LAPACK project, which in turn is the successor of the LINPACK benchmark to measure supercomputer performance. Currently, there exists other parallel evolution of LINPACK named HPL (High Performance LINPACK). HPL is a benchmark to measure performance in modern supercomputers, and the fastest 500 are published in the top500.org website. Thus, algorithms and implementations used in ScaLAPACK are a key reference to measure performance in supercomputing environments. Our experimental results show that the Hitmap implementation of LU offers better performance than ScaLAPACK version, and

Abstract—This paper introduces a new technique to exploit compositions of different data-layout techniques with Hitmap, a library for hierarchical-tiling and automatic mapping of arrays. We show how Hitmap is used to implement block-cyclic layouts for a parallel LU decomposition algorithm. The paper compares the well-known ScaLAPACK implementation of LU, as well as other carefully optimized MPI versions, with a Hitmap implementation. The comparison is made in terms of both performance and code length. Our results show that the Hitmap version outperforms the ScaLAPACK implementation and is almost as efficient as our best manual MPI implementation. The insertion of this composition technique in the automatic data-layouts of Hitmap allows the programmer to develop parallel programs with both a significant reduction of the development effort and a negligible loss of efficiency. Index Terms—Automatic data partition; layouts; distributed systems.

I. I NTRODUCTION Data distribution and layout is a key part for any parallel algorithm in distributed environments. To improve performance and scalability, many parallel algorithms need to compose different data partition and layout techniques at different levels. A typical example is the block-cyclic partition, useful in many linear algebra problems, such as the LU decomposition. Block-cyclic can be seen as a composition of two basic layouts: Blocking layout (divide data into same-sized blocks), and a cyclic layout (distribute blocks cyclically to processes). LU decomposition is a matrix factorization that transforms a matrix in the product of a lower triangular matrix and a upper triangular matrix. Applications of this decomposition include solving linear equations, matrix inverse calculation, and determinant computation. While many languages offer a predefined set of one-level, data-parallel partition techniques (e.g. HPF [1], OpenMP [2], UPC [3]), it is not straightforward to use them to create a multiple-level distributed layout that adapts automatically for different grain-levels. On the other hand, message-passing interfaces for distributed environments (such as MPI [4]) allow to develop programs carefully tuned for a given platform, but at the cost of a high development effort [5], [6]. The reason is that these interfaces only provide basic, simple tools to calculate the data distribution and communication details. In this paper we present a technique to introduce compositions of different layout techniques at different levels in 978-0-7695-4214-0/10 $26.00 © 2010 IEEE DOI 10.1109/HPCC.2010.37

89

⎛ ⎜ ⎜ ⎜ ⎝

a11 a21 .. .

a12 a22 .. .

··· ··· .. .

an1

an2

· · · ann

a1n a2n .. .





⎟ ⎜ ⎟ ⎜ ⎟=⎜ ⎠ ⎝

1 l21 .. .

0 1 .. .

ln1

ln2

⎞ ⎛ ··· 0 u11 ⎜ ··· 0 ⎟ ⎟ ⎜ 0 . ⎟×⎜ . .. . .. ⎠ ⎝ .. 0 ··· 1

u12 u22 .. . 0

··· ··· .. .

u1n u2n .. .

⎞ ⎟ ⎟ ⎟ ⎠

· · · unn

Figure 1: LU decomposition. almost the same performance as a carefully-tuned, manual MPI implementation, with a significantly lower development effort.

Algorithm 1 LU right-looking decomposition. Require: A n × n matrix Ensure: L and U n × n matrices overwriting A and non storing either zeros in both matrices or ones in L diagonal 1: for k = 0 to n − 1 do 2: {elements of matrix L} 3: for i = k + 1 to n − 1 do ik 4: aik ← aakk 5: end for 6: {update rest of elements} 7: for j = k + 1 to n − 1 do 8: for i = k + 1 to n − 1 do 9: aij ← aij − akj aik 10: end for 11: end for 12: end for

II. A STUDY CASE : LU ALGORITHM A. LU decomposition LU decomposition [11], [12] writes a matrix A as a product of two matrices: a lower triangular matrix L with 1’s in the diagonal and an upper triangular matrix U . See Fig. 1. LU decomposition is used to solve systems of linear equations or calculate determinants. The advantage of solving systems with this method is that, once the matrix is factorized, it can be used to compute solutions for different right-hand sides. As it is shown in Fig. 2, there are three different forms of designing the LU matrix factorization: Right-looking, leftlooking and crout LU. There exists iterative and recursive versions of the associated algorithms. In this paper we focus on the iterative, right-looking version, since it provides a better load balancing, as we will see in Sect. IV. Figure 2 shows the state associated to a particular iteration of the corresponding algorithms. Stripped portions refer to elements that will get their final value at the end of the iteration. Grey areas refer to read elements in left-looking and crout, and updated elements in right looking. These elements will be modified in subsequent iterations. Algorithm 1 shows the right-looking LU decomposition. As can be seen, the algorithm may be implemented with three loops. It is worthwhile to note that the division in line 4 may cause precision errors if divider is nearby 0 when using finite precision. Thus, some implementations perform row-swapping to avoid small divisors. In each iteration the algorithm searches the greater aik with i > k and swap row i−th and k−th in matrix and vector. After solving the system, it is necessary to undo the row swapping in the solution vector.

B. Solution of a system of linear equations After the decomposition, it is possible to solve the system Ax = b ≡ LU x = b in two steps. Let U x = y. First we solve Ly = b using forward substitution, and then we solve U x = y using backward substitution. Definition A system Ax = b using forward substitution with A lower triangular is solved as follows: x0 = For i = 1, . . . , n − 1 xi =

bi −

j=0

aij xj

Definition A system Ax = b using backward substitution with A upper triangular is solved as follows:

For i = n − 2, . . . , 0 xi = Right−looking LU

i−1 aii

xn−1 =

Left−looking LU

b0 a00

Crout LU

bi −

bn−1 an−1,n−1

n−1

j=i+1

aij xj

aii

The cost of the decomposition is in O(n3 ) and the cost of the two substitutions are O(n2 ). Summing up, the cost of the whole algorithm is in O(n3 ) and the cost of the solution

Figure 2: Different versions of LU decomposition. Activity at a middle iteration.

90

proc 0

proc 1

111 000000000 111 111111000

U

U

proc P−1

(a) Block distribution. proc 1 proc 3

proc 0

proc 2

proc 0

proc P−1

L

proc 2

proc 1

proc 3

1

(b) Cyclic distribution.

proc 0

proc 1

proc P−1

L

proc 0

2 U

U

proc 1

(c) Block-cyclic distribution.

Figure 3: Examples of one-dimensional data distributions.

L

is not significant. Thus, LU decomposition is useful when solving several right-hand sides. The solving algorithm can be implemented browsing the matrix by rows or columns, with the same asymptotic cost.

L

3

4

Figure 4: Steps in a blocked LU parallel algorithm.

III. B LOCK - CYCLIC LAYOUT Parallel implementations of algorithms need to distribute the data across different processes. The way of distributing the data is known as layout. Each layout provides good balance for a different family of algorithms, so it is very important to choose the most suitable. For our study case, the most appropriate is the block-cyclic layout. This layout can be seen as the composition of the following layouts. Let N the number of elements and P the number of processes. 1) Block layout (Fig. 3a). Each process receives a block of N P  contiguous elements. 2) Cyclic layout (Fig. 3b). The data is distributed cyclically among the processes, so element i is associated to process i mod P . Block-cyclic layout [13] is built in two stages: blocks and cyclic (see Fig. 3c). First, a blocking is made, but dividing data into portions of the same size S, not necessarily equal to N P . Then, the blocks are distributed cyclically to processes. Thus, element i is associated to process  Si  mod P . These three layouts can be easily extended to processes arrangements in n dimensions.

data distribution. In left-looking LU algorithm, since only one column per iteration is updated, only the column of processes which owns that column performs computations, whereas the rest only communicate. Something similar happens with crout. Besides, right-looking implies less communication than other versions, because only a column and a row is needed per iteration to update the trailing matrix, whereas in other versions, almost the whole matrix is locally needed. Right-looking LU algorithm is usually computed in process arrangements of 2 dimensions, also called grid of processes. In Alg. 2, we can see an outline of a block-cyclic, right-looking, LU parallel decomposition algorithm. Functions minedim1(i) and minedim2(i) in that algorithm return TRUE if i belongs to the local process. The algorithm iterates by blocks and each iteration is divided in four steps, as shown in Fig. 4. The main steps of the factorization algorithm are shown in Fig. 4. In the figure, dark grey portions refer to elements updated and light grey ones to elements read on each step. 1) Factorize the diagonal block (Fig. 4.1). 2) Update blocks in the same column of the diagonal block in matrix L, using the diagonal block (Fig. 4.2). 3) Update blocks in the same row of the diagonal block in matrix U, using the diagonal block (Fig. 4.3). 4) Update trailing matrix using blocks previously updated in steps 2 and 3 (Fig. 4.4).

IV. R IGHT- LOOKING LU IN DISTRIBUTED ENVIRONMENTS Among the three LU decomposition algorithms described above, the right-looking LU is the one that most benefits from load-balancing techniques when implemented in distributed environments. For this algorithm, block-cyclic layout is the most appropriate load-balancing technique. The cyclic distribution of the blocks ensures that all processes have roughly the same workload, because the trailing matrix to be updated is distributed among processes. In addition, distributing blocks instead of single elements improves data locality [14], [15]. Left-looking and crout versions are not so useful in distributed environments, since it is much harder to find a suitable

Steps 2 to 4 may imply block communication among processes. Since all updated blocks will be needed by other processes of the same row or column, most communications are in multicast mode. Figure 5 shows how the processes are arranged into an R×C grid. In a given iteration, each process into the grid performs exactly one of the following four tasks (see Alg. 2).

91

Algorithm 2 LU parallel algorithm. Input: n × n matrix A and n × 1 vector b. Output: n × n matrices L and U overwriting A and stored together ruling out zeros and ones of the diagonal of L, and n × 1 vector x overwriting b. The matrices are partitioned in S × S blocks and the processes form a R × C grid. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

for k = 0 to  N S  − 1 do {Iterating by blocks} if minedim1(k) ∧ minedim2(k) then {TASK TYPE 1: Process with the diagonal block} Factorize diagonal blocks A[k, k] Update blocks A[i, k] with i > k with block A[k, k] Update blocks A[k, j]/j > k with block A[k, k] Update blocks of submatrix else if minedim1(k) then {TASK TYPE 2: Process in the same column of the grid than the diagonal block process} Update blocks A[i, k] with i > k with block A[k, k] Update blocks of submatrix else if minedim2(k) then {TASK TYPE 3: Process in the same row of the grid than the diagonal block process} Update blocks A[k, j]/j > k with block A[k, k] Update blocks of submatrix else {TASK TYPE 4: Process neither in the same row of the grid nor in the same column of the grid as the diagonal block process} Update blocks of submatrix end if end for 0

1) There is one process that owns the diagonal block. This process executes the four steps and communicates in 2, 3 and 4 (process P[1,2] in Fig. 5). 2) There are R − 1 processes that own blocks in the same column as the diagonal block. These processes execute steps 2 and 4. In step 4 they communicate the blocks updated in step 2, receiving blocks updated that belong to the same row of the diagonal block. (processes P[0,2] , P[2,2] , and P[3,2] in Fig. 5). 3) There are C − 1 processes that owns blocks in the same row as the diagonal block. These processes execute steps 3 and 4. In step 4 they communicate the blocks updated in 3, receiving blocks updated that belong to the same column of the diagonal block. (processes P[1,0] , P[1,1] , P[1,3] , and P[1,4] in Fig. 5). 4) The remaining (R ∗ C − R − C + 1) processes execute step 4, receiving the blocks updated in steps 2 and 3 from one process on the same row and one process on the same column.

0 1 2 3

1

2

3

4

11 00 00 11 00 11 00 11 000 111 000 111 00 11 00 11 00 11 000 111 00 11 000 111 00 11 00 11 000 111 00 11 000 111 00 11 00 11 000 111 00 11 000 111 00 11 00 11 00011 111 00 000 111 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

Figure 5: Example of a 3 × 4 grid of processes where process P[1,2] owns the diagonal block.

Processes P[1,1] and P[1,0] update blocks in the same row as the diagonal block in matrix U (blocks B[1,2] , B[1,3] , . . . , B[1,6] ) and processes P[1,1] and P[0,1] update blocks in the same column as the diagonal block in matrix L (blocks B[2,1] , B[3,1] , . . . , B[6,1] ). Then, all process update blocks of the submatrix with block indexes B[2..6,2..6] . For doing it, process P[1,0] and P[0,1] need blocks from P[1,1] , and process P[0,0] needs blocks from P[0,1] and P[1,0] . Remember the update expression aij = aij −akj aik in line 9 of Alg.1. A process only need blocks of other processes in the same row or column in the grid. Thus, communications are limited to rows or columns. The right-hand side vector has only one dimension and it is mapped to a row or column of processes. Solution of the system using the LU factorization is straightforward and need no further explanation.

A. An example In Fig.6a we can see an example of a n × n matrix, divided in 7 × 7 blocks and distributed to a 2 × 2 grid of processes. In this example, the computation is at the start of the second iteration (blocks in column 0 and row 0 are already updated). Figure 6b shows the communications needed at this point. Diagonal block is B[1,1] and belongs to process P[1,1] . Process P[1,1] factorize this block and send it to processes P[1,0] and P[0,1] . 92

0

1

2

3

4

5

library, LAPACK designed for shared-memory and vector supercomputers. Their goals are efficiency, scalability, reliability, portability, flexibility and easy of use. One of the routines from ScaLAPACK , called pdgesv, computes the solution of a real system of linear equations Ax = b. This routine implements firstly a decomposition of a n × n matrix A in a product of two triangular matrices L ∗ U , where L is a lower triangular matrix with ones in the diagonal and U is a upper triangular one. Then, this decomposition makes easier the later resolution of the system Ax = b where x is the n × 1 vector solution of the system. ScaLAPACK implements the non-recursive right-looking version of LU algorithm, since it is the better option for blockcyclic layouts as explained before. The matrices are distributed using a two-dimensional blockcyclic layout and the local part of the matrices after distribution are stored contiguously in memory.

6

0

process 0,0

1

process 1,0

2

process 0,1

3

process 1,1

4 5 6 (a) Distribution of blocks.

0

1

2

3

4

5

6

0 1

B. LU with Hitmap library

2

communication

Hitmap is a library to support hierarchical tiling mapping and manipulation [7]. It is designed to be used with an SPMD parallel programming language, allowing the creation, manipulation, distribution and efficient communication of tiles and tile hierarchies. The focus of Hitmap is to provide an abstract layer for tiling in parallel computations. Thus, it may be used directly, or to build higher-level constructions for tiling in parallel languages. Although Hitmap is designed with an object-oriented approach, it is implemented in C language. It defines several object-classes implemented with C structures and functions. A Signature, represented by a HitSig object, is a tuple of three numbers representing a selection of array indexes in a one-dimensional domain. The three numbers represent: The starting index, the ending index, and a stride to specify a noncontiguous but regular subset of indexes. A Shape (HitShape object) is a set of n signatures representing a selection of indexes in a multidimensional domain. Finally, a Tile, represented by a HitTile object is an array whose indexes are defined by a shape. Tiles may be declared directly, or as a selection of another tile, whose indexes are defined by a new shape. Thus, they may be created hierarchically or recursively. Tiles have two coordinate system to access its elements: (1) Tile coordinates; and (2) Array coordinates. Tile coordinates start always at index 0 and have no stride. Array coordinates associate elements with their original domain indexes, as defined by the shape of the root tile. In Hitmap a tile may be defined with, or without allocated memory. This allows to declare and partition arrays without memory, finally allocating only the parts mapped to a given processor. Hitmap also provides a HitTopology object-class to encapsulate simple functions to create virtual topologies from hidden physical topology information, directly obtained by the library from the lower implementation layer. HitLayout objects store the results of the partition or distribution of a shape domain over a topology of virtual processors. They are created calling one of the layout plug-in modules included in the library.

3 4 5 6

(b) Communications in example on left figure in iteration 2.

Figure 6: Example of a 7 × 7 blocks matrix distributed in a two-dimensional block-cyclic layout.

V. T WO IMPLEMENTATION APPROACHES In this section we describe two different approaches for LU decomposition implementation in distributed-systems. The first one is using directly a message-passing library. The wellknow ScaLAPACK implementation is a good representative of this approach. Instead of manage data-partition and MPI communications directly, a different approach is to use a library with a higher-level of abstraction. We present an LUimplementation using Hitmap, a library to efficiently map and communicate hierarchical tile-arrays. Hitmap is used in the back-ends of the T RASGO compilation system [7]. After reviewing both approaches, we compare both in terms of performance and development effort. A. ScaLAPACK ScaLAPACK [16], [17] is a parallel library of highperformance linear algebra routines for message-passing systems. These routines are written in Fortran using the BLACS communications library. BLACS is defined as an interface, and has different implementations depending on the communications library used, such as PVM, MPI, MPL, and NX. ScaLAPACK is an evolution from the previous sequential

93

1 2 3 4 5 6 7 8 9 10 11 12 13 14

HitTopology topo = hit_topology( plug_topArray2DComplete ); int numberBlocks = ceil_int( matrixSize / blockSize ); HitTile_HitTile matrixTile, matrixSelection; hit_tileDomainHier( & matrixTile, sizeof( HitTile ), 2, numberBlocks, numberBlocks ); HitLayout matrixLayout = hit_layout( plug_layCyclic, topo, hit_tileShape( matrixTile ) ); hit_tileSelect( & matrixSelection, & matrixTile, hit_layShape( matrixLayout ) ); hit_tileAlloc( & matrixSelection ); HitTile_double block; hit_tileDomainAlloc( & block, sizeof( double ), 2, blockSize, blockSize ); hit_tileFill( & matrixSelection, & block ); hit_tileFree( block ); hit_comAllowDims( matrixLayout );

Figure 7: Hitmap code for data partitioning.

(e.g. array blocks), allowing to easily apply compositions of layouts at different levels.

There are some predefined layout modules, but it allows to easily create new ones. The resulting layout objects have information about the local part of the input domain, neighbor relationships, and methods to get information about any other virtual process part. The layout modules encapsulate mapping decisions for load-balancing and neighborhood information for either, data or task parallelism. The information in the layout objects may be exploited in several abstract communication functionalities which build HitComm objects. These objects may be composed in reusable patterns represented by HitPattern objects. The library is built on top of the MPI communication library, internally exploiting several MPI techniques for efficient buffering, marshalling, and communication.

Data distribution In Fig. 7 we show how the Hitmap tile creation and layout functions can be used in a parallel environment, to automatically generate the structure represented in Fig. 9, with a HitTile_double element for each block assigned to the local parallel processor. Line 1 creates a virtual 2D topology of processors using a Hitmap plug-in. Topology plug-ins automatically obtain information from the physical topology. The number of blocks is calculated in line 2, using the matrix size and the block size parameters. Line 3 declares the HitTile variables needed. In line 4, we define the domain of a tile representing the global array of blocks. This declaration does not allocate memory for the blocks, or the upper level array of tiles. Instead, it creates the HitTile template with the arrays sizes. It is used in Line 5 to call a cyclic layout plug-in that distributes the tile elements among processors. The returned HitLayout object contains the information about the elements assigned to any processor, specifically the local one. This data is used in line 9 to select appropriate local elements of the original tile. The memory for these selected local blocks is allocated in line 10. The following four lines define and allocate a new tile with the sizes of a block, using it as a pattern to fill up the local selection of the matrix of blocks. The function hit_tileFill makes clones of the second parameter for any data element of the first tile parameter. The last line activates the possibility of declaring 1D collective communications in the virtual topology associated to the layout.

Block-cyclic layout in Hitmap library In this section we show how to use the Hitmap tools to implement a block-cyclic layout. We will use a two-level tile hierarchy and a cyclic Hitmap layout to distribute the upper level blocks among processors. Hitmap uses the HitTile objects to represent multidimensional arrays, or arrays subselections. HitTile is an abstract type which is specialized with the base type of the desired array elements. Thus, HitTile_double is the type for tiles of double elements. In this work, we include a new derived type in the library: HitTile_HitTile. The base elements of tiles of this type will also be tiles of any derived type. Thus, heterogeneous hierarchies of tile types are easily created using HitTile_HitTile objects on the branches, and other specific derived types on the leaves of the structure. In Fig. 9 we show the internal structure of a two-level hierarchy of tiles. The HitTile_HitTile variable is internally a C structure that records the information needed to access the elements of the array and manipulate the whole tile. One of its fields is a pointer to the data. In this case, it is a pointer to an array of HitTile elements. We have added other information fields to the structure to record details about the hierarchical structure at a given level. Other Hitmap functionalities, related to the communication and input/output of tiles, have been reprogrammed to support marshalling/demarshalling whole hierarchical structures. The rest of this section shows how the HitTile_HitTile type is used to define and layout aggregations of subselection tiles

Factorization Figure 8 shows an excerpt of the matrix factorization code. Inside the main loop, we show the code for the Task Type 1, which updates (a) the blocks of the L matrix in the same column as the diagonal block, and (b) the blocks of the trailing matrix. Before the task code, we can see the declaration of Hitmap communication objects used in that task during the given iteration. Hitmap provides functions to declare HitCom objects 94

1 HitCom commRowSend, commRowRecv, commColSend, commColRcv, commBlockSend, commBlockRecv; 2 for( i = 0; i < numberBlocks; i++ ) { 3 4 // 1. COMMUNICATION OBJECTS DECLARATION 5 commBlockSend = ...; 6 commBlockRecv = hit_comBroadcastDimSelect( matrixLayout, 0, diagProc[0], & bufferDim0, 7 hit_shape( 1, hit_sigIndex( currentBlock[1] ) ), 8 HIT_COM_TILECOORDS, HIT_DOUBLE ); 9 commRowSend = ...; 10 commRowRecv = hit_comBroadcastDimSelect( matrixLayout, 0, diagProc[0], & bufferDim1, 11 hit_shape( 1, hit_tileSigDimTail( matrixSelection, 1, currentBlock[1] ) ), 12 HIT_COM_TILECOORDS, HIT_DOUBLE ); 13 commColSend = hit_comBroadcastDimSelect( matrixLayout, 1, hit_topDimRank( topo, 1 ), 14 & matrixSelection, hit_shape( 2, 15 hit_tileSigDimTail( matrixSelection, 0, currentBlock[0] ) ), 16 hit_sigIndex( currentBlock[1] ) ), 17 HIT_COM_TILECOORDS, HIT_DOUBLE ); 18 commColRecv = ...; 19 20 // 2. COMPUTATION 21 // 2.1. TASK TYPE 1: UPDATE DIAGONAL BLOCK, U ROW, L COLUMN, AND TRAILING MATRIX 22 if ( hit_topAmIDim( topo, 0, diagProc[0] 23 && hit_topAmIDim( topo, 1, diagProc[1] ) { 24 ... 25 } 26 27 // 2.2 TASK TYPE 2: UPDATE L COLUMN, AND TRAILING MATRIX 28 else if ( hit_topAmIDim( topo, 1, diagProc[1]) ) { 29 hit_comDo( & commBlockRecv ); 30 for( j = currentBlock[0]; j < hit_tileDimCard( matrixSelection, 0 ); j++ ) { 31 updateMatrixL( matrixBlockAt( bufferDim1, 1, currentBlock[1] ), 32 matrixBlockAt( matrixSelection, j, currentBlock[1] ) ); 33 } 34 hit_comDo( & commRowRecv ); 35 hit_comDo( & commColSend ); 36 for( j = currentBlock[0]; j < hit_tileDimCard( matrixSelection, 0 ); j++ ) { 37 for( k = currentBlock[1]; k < hit_tileDimCard( matrixSelection, 1 ); k++ ) { 38 updateTrailingMatrix( matrixBlockAt( matrixSelection, j, currentBlock[1] ), 39 matrixBlockAt( bufferDim1, 1, k ), 40 matrixBlockAt( matrixSelection, j, k ) ); 41 } 42 } 43 } 44 45 // 2.3. TASK TYPE 3: UPDATE U ROW, AND TRAILING MATRIX 46 else if ( hit_topAmIDim( topo, 0, diagProc[0] ) { 47 ... 48 } 49 // 2.4. TASK TYPE 4: UPDATE ONLY TRAILING MATRIX 50 else { 51 ... 52 } 53 hit_comFree( commRowSend ); hit_comFree( commRowRecv ); hit_comFree( commColSend ); 54 hit_comFree( commColRecv ); hit_comFree( commBlockSend ); hit_comFree( commBlockRecv ); 55 }

Figure 8: Hitmap factorization code. containing information about the subtile (set of blocks) to communicate, and the sender/receiver processor in virtual topology coordinates. Communications needed on this algorithm are always broadcasts of blocks along a given axis of the 2D virtual topology associated with the layout object. In the task code, the communications associated to a HitCom object are invoked, any number of times, with the Hit_comDo primitive.

performance and ease of programming. ScaLAPACK’s LU version is originally written in Fortran, while Hitmap is a C library. To ensure a fair comparison, we translate ScalaPACK’s LU version to C, using the MPI library for communications. Since C language uses row-major storing instead of Fortran’s column-major storing, we have inverted loops and indexes for matrix accesses, and also adapted the solving part of the algorithm. Moreover, the row-swapping operation is not fairly comparable in terms of performance with the C versions due to the different storing order. Therefore, we have eliminated the row-swapping stage from both the

VI. E XPERIMENTAL R ESULTS In this section we evaluate the approaches described in the previous section. The evaluation is carried out in terms of

95

HitTile_HitTile

MPI Ref. Version Hitmap Version Reduction

baseExtent=sizeof(HitTile)

Total Lines 646 586 9.29%

% 23.37% 19.97%

Table I: Comparison of code lines due to parallelism and total code lines between Hitmap and the MPI reference implementation.

... ...

data

Par. Code Lines 151 117 22.52%

... HitTile_double

15 000 elements to achieve good scalability. The optimal block size is platform-dependent. In Geopar, we have found that the best block size is 64 × 64 elements.

baseExtent=sizeof(double)

... ...

Results

...

data

Figure VI (left) shows execution times in seconds for the different implementations. Fig.VI (right) shows the corresponding speed-ups obtained. The plots show that our direct translation of the ScaLAPACK code to C presents a similar performance as the original Fortran code only when executed in sequential. When executing in parallel, our results show that the optimization effort applied to the communication stages in ScaLAPACK routines can not be achieved by a naive translation to C. However, the data locality optimizations introduced in blocks-browsing and blocks-storing versions outperform the original ScaLAPACK routines clearly. The speed-up plots also show that they escalate properly. The buffering of data needed during communications also benefits from the block-storing scheme. Regarding the Hitmap version, it achieves similar performance figures as the carefully-optimized C+MPI implementations with data locality improvements. The reason is that, as we stated before, the Hitmap development techniques exploit locality in a natural way. The flexible and general Hitmap tile management only introduces a small overhead in the access to tile elements. Finally, in Table I we show a comparison of the Hitmap version with the best optimized C version in terms of lines of code. The comparison takes into account not only the total number of lines of code, but also the lines devoted to parallelism, including data-partition, communication initializations, and communication calls. As the table shows, the use of Hitmap library leads to a 9.29% reduction on the total number of code lines. This reduction affects mainly to lines directly related to parallelism (22.52%). The reason is that Hitmap greatly simplifies the programmer effort for data distribution and communication, compared with the equivalent code needed when using MPI routines directly. Moreover, the use of Hitmap tile-management primitives eliminates some more lines in the sequential treatment of matrices and blocks.

double

Figure 9: Two levels support for block-cyclic layouts in Hitmap Library.

LU ScaLAPACK routine and our C programs. This translation process also helped us to estimate the development effort of developing a hand-made, right-looking, message-passing LU. The hierarchical nature of a two-level Hitmap tile exploits memory accesses locality in a natural way. To isolate this effect, we developed two modified versions of the manual C code with different optimizations. These changes represent a good trade-off between development effort and performance. Both modified versions are described below. Blocks browsing: ScaLAPACK LU main loop iterates in a block-by-block basis, but the inner loops browse the remaining part of the matrix by columns. Thus, several block columns are updated before proceeding to the next column. More data locality can be achieved in the sequential parts of the code browsing matrices by blocks, updating a whole block before proceeding to the following one. Blocks storing: ScaLAPACK stores the local parts of the global matrix contiguously in memory: a n × m matrix is declared as a 2-dimensional n×m array in which all the column elements are contiguous. Storing each block contiguously and using a matrix of pointers to locate the blocks improves both locality and performance. Experiments design and environment The codes have been run on a shared-memory machine called Geopar. Geopar is an Intel S7000FC4URE server, equipped with four quad-core Intel Xeon MPE7310 processors at 1.6GHz and 32GB of RAM. Geopar runs OpenSolaris 2008.05, with the Sun Studio 12 compiler suite. We use up to 15 cores simultaneously, to avoid the interference of the operative system running on the last core. We present results with input data matrices of 15 000 ×

VII. C ONCLUSIONS In this paper we present a technique that allows Hitmap, a library for hierarchical array tiling and mapping, to support layout compositions at different levels. The technique takes advantage of a new derived type that allows to represent heterogeneous tile hierarchies. This technique improves the 96

Matrix 15000−by−15000 and block 64−by−64

Matrix 15000−by−15000 and block 64−by−64 16

C naive translation ScaLAPACK Hitmap C blocks browsing C blocks storing

ideal C blocks storing C blocks browsing Hitmap ScaLAPACK C naive translation

14 12 speed−up

time (s)

10000

1000

10 8 6 4 2 0

100 0

2

4

6 8 10 number of processes

12

14

0

16

2

4

6 8 10 number of processes

12

14

16

Figure 10: Results for a 15 000 × 15 000 matrix (left) and 64 × 64 blocks (right).

applicability of Hitmap to new problem classes and programming paradigms, such as multilevel adaptative PDE solvers, or recursive domain partitions; and to other scenarios, such as heterogeneous platforms. The improved features of Hitmap allow the programmer to work at a higher abstraction level than using message-passing programming directly, thus significantly reducing coding and debugging effort. We show how this new type allows to create data-layout compositions. We use the block-cyclic layout in the wellknown parallel LU decomposition as a study case. We compare the reference implementation of LU in ScaLAPACK with different, carefully-optimized MPI versions, as well as a Hitmap implementation. Our experimental results show that the highly-efficient techniques used in Hitmap for tile management, memory access, data-locality, buffering, marshalling and communication allows to achieve better performance than the ScaLAPACK implementation. The performance obtained is almost the same as developing a manual, carefully-optimized MPI version with a fraction of its associated development cost.

[2] R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, and R. Menon, Parallel programming in OpenMP. Morgan Kaufmann Publishers, 1 ed., 2001. ISBN 1-55860-671-8. [3] W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren, “Introduction to UPC and language specification,” Tech. Rep. CCS-TR99-157, IDA Center for Computing Sciences, 1999. [4] W. Gropp, E. Lusk, and A. Skjellum, Using MPI - 2nd Edition: Portable Parallel Programming with the Message Passing Interface. The MIT Press, 2 ed., Nov. 1999. [5] S. Gorlatch, “Send-Recv considered harmful? Myths and truths about parallel programming,” in PaCT’2001 (V. Malyshkin, ed.), vol. 2127 of LNCS, pp. 243–257, Springer-Verlag, 2001. [6] K. Gatlin, “Trials and tribulations of debugging concurrency,” ACM Queue, vol. 2, pp. 67–73, Oct 2004. [7] A. Gonzalez-Escribano and D. Llanos, “Trasgo: A nested parallel programming system,” Journal of Supercomputing, vol. doi:10.1007/s11227-009-0367-5, 2009. [8] G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. Fraguela, M. Garzarán, D. Padua, and C. von Praun, “Programming for parallelism and locality with hierarchical tiled arrays,” in PPoPP’06, pp. 48–57, ACM Press, March 2006. [9] B. Chamberlain, D. Callahan, and H. Zima, “Parallel programmability and the Chapel language,” International Journal of High Performance Computing Applications, vol. 21, pp. 291–312, Aug 2007. [10] G. S. Jr., “Parallel programming and code selection in Fortress,” in PPoPP’06, pp. 1–1, ACM Press, March 2006. [11] G. H. Golub and C. F. V. Loan, Matrix computations. JHU Press, Oct. 1996. [12] M. T. H. Heath, Scientific Computing. McGraw-Hill Science/Engineering/Math, 2 ed., 2001. [13] J. Choi and J. J. Dongarra, “Scalable linear algebra software libraries for distributed memory concurrent computers,” in Proceedings of the 5th IEEE Workshop on Future Trends of Distributed Computing Systems, p. 170, IEEE Computer Society, 1995. [14] J. Choi, J. J. Dongarra, L. S. Ostrouchov, A. P. Petitet, D. W. Walker, and R. C. Whaley, “Design and implementation of the ScaLAPACK LU, QR, and cholesky factorization routines,” Sci. Program., vol. 5, no. 3, pp. 173–184, 1996. [15] R. Reddy, A. Lastovetsky, and P. Alonso, “Parallel solvers for dense linear systems for heterogeneous computational clusters,” in Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–8, IEEE Computer Society, 2009. [16] J. Dongarra and A. Petitet, “ScaLAPACK tutorial,” in Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science, pp. 166– 176, Springer-Verlag, 1996. [17] L. S. Blackford, A. Cleary, J. Choi, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users’ Guide. SIAM, 1997.

ACKNOWLEDGEMENTS This research is partly supported by the Ministerio de Educación y Ciencia, Spain (TIN2007-62302), Ministerio de Industria, Spain (FIT-350101-2007-27, FIT-350101-200646, TSI-020302-2008-89, CENIT MARTA, CENIT OASIS, CENIT OCEAN LEADER), Junta de Castilla y León, Spain (VA094A08), and also by the Dutch government STW/PROGRESS project DES.6397. Part of this work was carried out under the HPC-EUROPA project (RII3-CT-2003506079), with the support of the European Community Research Infrastructure Action under the FP6 “Structuring the European Research Area” Programme. The authors wish to thank the members of the Trasgo Group for their support. R EFERENCES [1] S. Hiranandani, K. Kennedy, and C. Tseng, “Compiling Fortran D for MIMD distributed-memory machines,” Commun. ACM, vol. 35, no. 8, pp. 66–80, 1992.

97

Suggest Documents