Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems Fengguang Song∗, Hatem Ltaief∗ , Bilel Hadri† and Jack Dongarra‡ ∗ EECS,
University of Tennessee, Knoxville, TN, USA {song, ltaief}@eecs.utk.edu † NICS, Oak Ridge National Laboratory, Oak Ridge, TN, USA
[email protected] ‡ University of Tennessee, Knoxville, TN, USA Oak Ridge National Laboratory, Oak Ridge, TN, USA University of Manchester, Manchester, UK
[email protected] Abstract—As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on CommunicationAvoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.
In addition, a number of vendors provide libraries optimized for their own hardware such as Intel MKL, AMD ACML, IBM ESSL (PESSL), and Cray XT LibSci. All the vendor libraries include the subroutines of LAPACK and ScaLAPACK. However, with the increment of the number of cores on each chip, these existing libraries start to see degrading performance on multicore (or manycore) architectures. One important reason is that the libraries use the fork-join approach for parallelism to implement their routines. The join operation works as a barrier and increases the task graph’s critical path length substantially. Assuming a fixed number of tasks, increasing the length of the critical path can seriously affect the program performance. For instance, the subroutine for QR factorization in LAPACK uses a block algorithm. Given an m × n matrix A that is partitioned as follows: A1:b,1:b A1:b,b+1,n , A= Ab+1:m,1:b Ab+1:m,b+1:n
I. I NTRODUCTION
where b is the block size, the block algorithm 1) first factorizes the left column panel A1:m,1:b ; 2) applies the panel factorization result to the top row panel A1:b,b+1,n ; 3) then to the trailing submatrix of Ab+1:m,b+1:n . All the three steps are executed in a fork-join manner for which the length of the critical path is increased. The same set of steps will be applied recursively to the submatrix of Ab+1:m,b+1:n until the submatrix merely consists of a single column panel. The ScaLAPACK QR factorization subroutine uses the same block algorithm as LAPACK. In this paper, we use the term “block QR factorization” to refer to this algorithm. During the last several years we have been working on designing new parallel linear algebra software for multicore architectures. We believe that the new software for multicore architectures should have the following characteristics: finegrain tasks for a higher degree of parallelism, asynchronous execution to eliminate synchronization points, and good locality to improve data reuse. The tile algorithms designed in our Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project [4] exhibit the three desirable characteristics. The subroutine for QR factorization in PLASMA adopts
The method of least squares has been used in many scientific fields such as mathematics, physics, statistics, and economics where applications of data fitting, regression analysis, and production function modeling happen frequently. The problem is to find the solution of an overdetermined system of linear equations Ax = b with more equations than unknowns. The shape of the matrix A is tall and skinny. The modern classical method to solve such a system is based upon QR factorization by first computing A = QR followed by solving the uppertriangular system Rx = Q∗ b for x. Various numerical libraries have supplied the QR factorization subroutine. LAPACK [1] provides a collection of linear algebra software for shared-memory systems. ScaLAPACK [2], [3] includes a subset of LAPACK subroutines that is redesigned for distributed-memory message-passing systems. This material is based upon work supported by the NSF grant CCF0811642, and by Microsoft Research. This work also used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-ASC05-00OR22725.
c 2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or
for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. SC10 November 2010, New Orleans, Louisiana, USA 978-1-4244-7558-2/10/$26.00
1,1
1,2
1,3
1,4
1,5
2,1
2,2
2,3
2,4
2,5
3,1
3,2
3,3
3,4
3,5
4,1
4,2
4,3
4,4
4,5
5,1
5,2
5,3
5,4
5,5
Fig. 1. Tile QR factorization on a square matrix with 5 × 5 tiles. Each tile is of size b × b and corresponds to a fine grain task. The arcs show the data dependencies between the tasks.
an updating-based algorithm that operates on matrices stored in a tile data layout [5], [6]. A tile is a b × b square submatrix and is stored in memory contiguously. In this paper, we use the term “tile QR factorization” to refer to the updating-based QR factorization. Unlike the block QR factorization that operates on panels, the tile QR factorization operates on much smaller tiles (hence more fine-grained). Given a matrix A consisting of mb × nb tiles, matrix A can be expressed as follows: A1,1 A1,2 . . . A1,nb A2,1 A2,2 . . . A2,nb A= , .. .. .. .. . . . . Amb ,1
Amb ,2
. . . Amb ,nb
where Ai,j is a square tile of size b × b. In the first iteration, the tile QR factorization computes the QR factorization for tile A1,1 . The factorization output of A1,1 is then used to update the set of tiles on A1,1 ’s right hand side in an embarrassingly parallel fashion, that is, {A1,2 , . . . , A1,nb }. As soon as the update on any tile A1,j is finished, the update on tile A2,j can read the modified A1,j and proceed. In other words, whenever a tile-update on the i-th row completes, its below tile on the (i + 1)-th row can start if Ai+1,1 also completes. After updating the tiles on the last mb -th row, tile QR applies the same steps to the trailing submatrix A2:mb ,2:nb recursively. Figure 1 illustrates the data dependency relationships between tasks during the first iteration given a 5 × 5 tiled matrix. Each tile located at [i, j] corresponds to a task that reads a couple of inputs and modifies A[i, j]. For instance, tile A[2, 4] corresponds to a task that reads the output of two tasks located at [1, 4] and [2, 1] and then modifies A[2, 4]. In the tile QR factorization, the tasks within each row can be executed in embarrassingly parallel. However, the sequential dependency between tasks along a column clearly makes the algorithm inefficient, especially for tall and skinny matrices. Hadri et al. presented a strategy to compute the QR factorization on shared-memory multicore machines for tall and skinny matrices [7]. Their approach considerably increases the number of parallel tasks located in the same column.
Their work was inspired by the tile QR factorization (available in PLASMA) and a communication-avoiding technique (known as CAQR) that was introduced by Demmel [8]. Their original idea was to increase the degree of parallelism in a shared-memory context. However, this algorithm has not been explored on distributed-memory systems. We investigate and extend the algorithm to modern large-scale distributedmemory machines and demonstrate its high efficiency and scalability. We call the distributed-version algorithm “distributed tile communication-avoiding QR factorization.” In this paper, we also analyze the tile CA-QR factorization in terms of operation count, number of messages, and communication volume. We then compare the algorithm to previous work such as LAPACK, ScaLAPACK, tile QR (PLASMA), TSQR and CAQR [8]. The tile CA-QR factorization has an operation count that could be comparable to that of LAPACK and ScaLAPACK by choosing appropriate parameters. The communication volume it performs is optimal (up to a factor of two) and is better than that of ScaLAPACK, CAQR, and tile QR. It also has much less number of messages than tile QR. The distributed tile CA-QR factorization partitions a matrix’s rows into D blocks of rows (i.e., D domains). Then on a distributed-memory system with P compute nodes, it continues to partition the D domains into P subsets (one subset per node) using a 1D block distribution, where D ≥ P . Each node runs a single MPI process and is responsible for computing a number D P of domains. For each column panel (of a tile width), the factorization algorithm performs independent QR factorizations in each domain by different processes in parallel. Then, each domain updates its trailing submatrix concurrently. The third and final step, the local R factors from each domain are reduced by different processes to the final R factor and the corresponding block-rows are again updated. The reduction operation among the domains adopts a binary-tree to attain the final R factor. Due to the complex binary-tree reduction residing on the critical path of the computation’s task graph, we extended our dynamic scheduling runtime system [9] to support distributed tile CA-QR more efficiently. We added new features such as look-ahead depth and three levels of task priority to the runtime system. A collection of trace analysis show that the new scheduling runtime system has been improved significantly. This paper evaluates the efficiency of distributed tile CA-QR by comparing it to vendor optimized ScaLAPACK libraries. We conducted both strong-scalability and weak-scalability experiments on two clusters and a Cray XT5 system consisting of hundreds of thousands of 12-core nodes. The experimental results show that our program is able to outperform ScaLAPACK by up to 4 times, and exhibits good scalability from 1 to 3,072 cores (3,072 cores is the largest experiment we have attempted). This paper includes the following new and original work: (1) A major extension and improvement from shared-memory systems to distributed-memory systems. (2) First to analyze the algorithm with respect to operation count, number of mes-
sages, and communication volume. (3) An extended runtime system to enable an efficient implementation of the distributed algorithm. (4) First to demonstrate good scalability of the algorithm on modern large-scale distributed-memory systems using up to 3,072 cores. The rest of the paper is organized as follows. Section II introduces the related work. Sections III and IV describe the tile CA-QR factorization algorithm and the analysis of the algorithm, respectively. Section V provides an overview of the dynamic scheduling runtime system and explains our extensions. Section VI presents the performance evaluation on three distributed-memory systems. Section VII summarizes our work. II. R ELATED W ORK In the mid 70s, Morven Gentleman introduced for sparse matrices [10] the approach of splitting a matrix into submatrices allowing the reduction to be done independently and recursively for the submatrices. Then, Pothen and Raghavan [11] developed the idea of parallelizing the factorization of a panel by implementing distributed orthogonal factorizations using Householder and Givens algorithms. Their approach divides the columns into P subcolumns (where P is the number of processors) and performs factorizations locally from which the final triangular factors are merged. Based on Pothen and Raghavan’s work, Demmel et al. [8] proposed a class of QR factorizations with the parallel panel factorization, called Communication-Avoiding QR (CAQR). The approach consists of performing the panel factorization on several columns thanks to a new algorithm called TSQR (Tall Skinny QR). The panels are divided into block-rows, and they are factorized independently and then merged by using either a binary or general reduction tree. An estimate of the performance for CAQR has been provided by the authors. Assuming that the QR factorization of a tall and skinny matrix can be represented as a reduction, Langou [12] implemented a methodology to perform the reduction by using user-defined MPI operation and MPI Reduce. Moreover, in the context of grid computing, by identifying bottlenecks in ScaLAPACK, Agullo et al. [13] developed an approach to computing the QR factorization by articulating the CAQR factorization with a topology-aware middleware in order to confine intensive communications. Contrary to all the previous work on QR, they build trees that are adapted for a particular grid environment. III. T ILE CA-QR FACTORIZATION Essentially the tile CA-QR factorization is an integration (or mixed version) of the CAQR factorization and the tile QR factorization. The basic idea is to store a matrix in a tile data layout and divide the matrix into a number of domains (i.e., blocks of rows). Each domain performs a local QR factorization independently. After finishing the local QR factorization, each domain participates in a global reduction to compute the final R factor.
Suppose an m × n matrix A consists of mb × nb tiles (m > n n), and b is the tile size for which mb = m b and nb = b . Tile CA-QR partitions the matrix’s m rows into D blocks: A = [A1 ; A2 ; . . . ; AD ], where Ai is of dimension m D × n and is called “Domain i.” Note that matrix A is stored in b×b tiles. The tiled matrix A that is divided into D horizontal domains can be expressed as follows: A1,1
··· Am b ,1 D A mb +1,1 D A= ··· A 2m b ,1 D ..
A1,2
··· A mb ,2 D
A mb +1,2
... .. . ...
··· A 2mb ,2
... .. . ...
.
.. .
..
Amb ,1
Amb ,2
D
D
A1,nb ···
A mb ,nb D
A mb +1,nb D
··· A 2mb ,n D
. ···
b
.. . Amb ,nb
,
where Ai,j is a tile of size b × b. In the first step, all the domains start to execute the tile QR factorization on the first panel and the associated updates concurrently as shown in Fig. 1. There is no data dependency or communication between different domains. That is, each domain is independent of the other domains. After the QR factorization of the first panel within each domain is finished, each domain i gets a ˆ i located at A m b × b upper triangular factor R (i−1)× Db +1,1 . For ˆ ˆ instance, R1 is located at A1,1 and R2 is located at A mb +1,1 . D ˆ i ’s belong to the first block-column for Note that all the R the first iteration. Next, the tile CA-QR factorization performs ˆ i ’s, where i ∈ {1, . . . , D}. The a reduction among all the R output of the reduction is the final factor of R1,1 assuming A = QR and R is stored in tiles. Then the final R1,1 will be applied to the top block-row {A1,2 , . . . , A1,nb } to compute {R1,2 , . . . , R1,nb }. The next iteration of the factorization can be initiated on A2:mb ,2:nb while the previous iteration is still in progress as long as the dependencies are satisfied. The factorization iterations are therefore pipelined which can potentially hide the light points of synchronizations required during the reduction step. Before describing the distributed tile CA-QR factorization, we briefly overview the six kernel subroutines used by the factorization. For more details of these kernels, please refer to Section 3 of the Hadri et al. paper [7]. The first four kernel subroutines are invoked locally within a domain. • dgeqrt: R[k,k], V[k,k], T[k,k] ← dgeqrt(A[k,k]) dgeqrt computes the QR factorization of a tile A[k,k] and generates three outputs: an upper triangular tile R[k,k], a unit lower triangular tile V[k,k] containing the Householder reflectors, and an upper triangular tile T[k,k] for storing the accumulated transformations. • dtsqrt: R[k,k], V[i,k], T[i,k] ← dtsqrt(R[k,k], A[i,k]) After dgeqrt is called, dtsqrt stacks tile R[k,k] on top of tile A[i,k] and computes an updated QR factorization.
The subroutine updates the tile R[k,k] and generates a tile V[i,k] and an upper triangular tile T[i,k]. V[i,k] and T[i,k] store the Householder reflectors and the accumulated transformations, respectively. • dormqr: R[k,j] ← dormqr(V[k,k], T[k,k], A[k,j]) dormqr applies dgeqrt’s output (i.e., V[k,k], T[k,k]) to tile A[k,j] located on the right hand side of A[k,k] and computes the R factor R[k,j]. • dtsssmqr: R[k,j], A[i,j] ← dtsssmqr(V[i,k], T[i,k], R[k,j], A[i,j]) dtsssmqr applies dtsqrt’s output (i.e., V[i,k], T[i,k]) to a stacked R[k,j] and A[i,j], and then updates the R factor R[k,j] and A[i,j], respectively. The Domain_Tile_QR algorithm applies the above four kernel subroutines to factorize a domain of size nrows × ncols tiles starting from the position A[I, J]. For instance, Fig. 1 can be viewed as a single domain that applies this algorithm. Note that here I, J are indexed from 0. Algorithm 1 Domain Tile QR Algorithm Domain Tile QR(A, I, J, nrows, ncols) R[I,J], V[I,J], T[I,J] ← dgeqrt(A[I,J]) for j ← J+1 to J+ncols-1 /*along I-th row*/ do A[I,j] ← dormqr(V[I,J], T[I,J], A[I,j]) end for for i ← I+1 to I+nrows-1 /*along J-th column*/ do R[I,J], V[i,J], T[i,J] ← dtsqrt(A[I,J], A[i,J]) end for for i ← I+1 to I+nrows-1 /*trailing submatrix update*/ do for j ← J+1 to J+ncols-1 do R[I,j], A[i,j]←dtsssmqr(V[i,J], T[i,J], R[I,j], A[i,j]) end for end for
The remaining two kernel subroutines are used in the reduction step that involves merging a collection of domains. • dttqrt: R[i1 ,k],V[i2 ,k],T[i2 ,k]←dttqrt(R[i1 ,k], R[i2 ,k]) This is the “merge” operation. dttqrt stacks one domain’s factor R[i1 ,k] on top of another domain’s R[i2 ,k] and computes an updated R[i1 ,k]. It also generates an upper triangular tile V[i2 ,k] and an upper triangular tile T[i2 ,k]. • dttssmqr: A[i1 ,j], A[i2 ,j] ← dttssmqr(V[i2 ,k], T[i2 ,k], A[i1 ,j], A[i2 ,j]) After dttqrt is called, dttssmqr applies the output of dttqrt to update A[i1 ,j] and A[i2 ,j] (j ∈ [k + 1, nb ]) that are located on the right hand side of R[i1 ,k] and R[i2 ,k], respectively. The Merge_Domains algorithm merges two R factors from a pair of domains and updates two trailing rows on their right hand sides. Algorithm 2 Merge Domains Algorithm Merge Domains(R, A, i1, i2, k, ncols) /*merge two R factors from two domains*/ R[i1,k], V[i2,k], T[i2,k] ← dttqrt(R[i1,k], R[i2,k]) /*update the coupled i1-th and i2-th rows*/ for j ← k+1 to k+ncols-1 do A[i1,j], A[i2,j] ← dttssmqr(V[i2,k], T[i2,k], A[i1,j], A[i2,j]) end for
Distributed Tile CA-QR Factorization: Given P processes on a distributed-memory system, we distribute a matrix’s D domains across different processes by 1-D block distribution. Each process Pi owns a number D to P of domains from D D P i D D (i+1)−1 . Although D is a parameter used at the algorithm P level, we assume D ≥ P so that a process owns at least one domain. A process may consist of one or more threads running on multiple cores. The algorithm of the distributed tile CA-QR factorization is shown as follows: Algorithm 3 Distributed Tile CAQR Algorithm Distributed Tile CAQR(A, mb , nb , D, P) b nr ← m /*number of rows per process*/ P D nd ← P /*number of domains per process*/ b ds ← m /*domain size*/ D for each tile column k ← 0 to nb − 1 do root ← ⌊k/ds⌋ /*get the index of the current root domain*/ /*process Pmy rank can factorize its own nd domains in parallel*/ for each domain i ← 0 to nd − 1 do if (d ← my rank × nd + i) ≥ root then I ← my rank × nr + i × ds /*[I,k] is the top left corner of domain d*/ Domain Tile QR(A, I, k, (my rank+1)×nr-I, nb -k) end if end for /*binary-tree merge*/ LB ← my rank × nd, UB ← LB+nd-1 for m ← 1 to ⌈log2 (D − root)⌉ do d1 ← root, d2 ← d1 + 2m−1 while d2 < D do if ∀ d1, d2 ∈ / [LB, UB] then continue end if P1 ← d1/nd, P2 ← d2/nd i1 ← d1 × ds, i2 ← d2 × ds Processes P1, P2 exchange A[i1, k. . . nb -1], A[i2, k. . . nb -1] Merge Domains(R, A, i1, i2, k, nb -k) d1 += 2m , d2 += 2m end while end for end for
Figure 2 illustrates the operations of Distributed_ Tile_CAQR. It shows a matrix of 12 × 3 tiles that is distributed across four domains. Each domain is stored and computed by one process and has a submatrix of 3 × 3 tiles. The figure shows the corresponding operations in the first iteration. That is, each domain invokes Domain_Tile_QR in parallel followed by a binary-tree merge between the first panels of each domain. The second iteration would be the same as the first iteration except for working on a trailing submatrix of size 11 × 2 tiles. IV. A LGORITHM A NALYSIS In this section, we present the total number of operations (for both sequential and parallel versions), the number of messages and communication volume for the tile CA-QR factorization. We also compare the metrics to those of related QR factorizations. A. Operation Count We use aggregate analysis to calculate the number of operations for each kernel. Note that each kernel takes as input tiles of size b × b.
R0
R0
D0
R0 R
Domain_Tile_QR
R
Domain_Tile_QR
D2
Domain_Tile_QR
D3
Domain_Tile_QR
(a)
(b)
mb D
Tdtsssmqr
=
× (nb − k + 1) tiles.
D0 R1
D1
are of size
(4b3 + sb2 )
D1
[(
D
k=1
D2
− k mod
mb )(nb − k) + D
k−1 mb (D − − 1) (nb − k)] mb /D D s nb 2b3 n2b (1 + )(mb − ) 4b 3
R2
R2
nb X mb
=
R3
D3 (c)
(d)
(e)
Fig. 2. The operations of distributed tile CA-QR. (a) Matrix A is divided into four domains horizontally. (b) We now apply Domain_Tile_QR to each domain in parallel. (c) Each domain computed an R factor located in the first column. We merge R0 and R1 , R2 and R3 . (d) We merge R0 and R2 and get the final factor R0 . (e) At the beginning of the second iteration, domain D0 has 2 × 2 tiles and the other three domains have 3 × 2 tiles. Similarly, we continue to apply (b), (c), (d) to the four new domains in the trailing submatrix.
dttqrt is the merge operation and does 35 b3 floating operations. At iteration k, the binary tree has D − k + 1 leaf nodes and D − k internal nodes. nb X 5 3 5 3 Tdttqrt =
k=1
(D − k) b = b nb (2D − nb ) 3 6
dttssmqr does 12 (4b3 +sb2 ) floating point operations. Every merge operation dttqrt is followed by a number nb − k of dttssmqr operations. nb X 1 3 2 Tdttssmqr
=
k=1
dgeqrt does 2b3 floating point operations. For each iteration times. There are nb iterations, k, dgeqrt is invoked D − mk−1 b /D thus Tdgeqrt
nb X
(D −
=
k=1
Ttile-caqr
nb − 1 mb mb nb − 1 )(2nb − − ) mb /D D D mb /D
dormqr does 3b3 floating point operations. At iteration k, domains each of which has a number there exist D − mk−1 b /D nb − k + 1 of tile columns. Every domain applies dormqr to all the tiles on its top row except for the first one. Tdormqr =
nb X
(D −
k=1
k−1 3 )(nb − k) × 3b3 ≃ b3 Dn2b mb /D 2
3 dtsqrt does 10 3 b floating point operations. At iteration k, k−1 there exist D − mb /D domains and one of them is the root mb b domain. The root domain has m D − (k mod D ) + 1 tile rows m b while the other domains have D tile rows.
Tdtsqrt
=
nb X mb
(
k=1
≃
D
− k mod
mb k−1 mb 10b3 + (D − mb − 1) ) D D 3 D
(2mb nb − nb (nb + min(nb ,
mb 5 ))) b3 D 3
dtsssmqr does 4b3 +sb2 floating point operations, where s is a parameter used to implement dtsssmqr. s is the inner tile size which divides the tile size b. At iteration k, there are D− mk−1 b /D mb b domains. The root domain consists of m − (k mod ) D D +1 tile rows and nb − k + 1 tile columns. The remaining domains
nb s D )( − ) 4b 2 6
2b3 n2b (1 +
The total number of operations of the sequential tile CA-QR is the sum of the above six equations:
k−1 ) × 2b3 mb /D
2b3 nb D − b3 (
=
=
(D − k)(nb − k) (4b + sb ) 2
=
≃
Tdgeqrt + Tdormqr + Tdtsqrt +Tdtsssmqr + Tdttqrt + Tdttssmqr n Db s ) 2n2 (1 + )(m − + 4b 2 2
Compared to the operation count of the standard LAPACK QR factorization, that is, TLapack = 2n2 (m − n3 ), s )(m − n2 + 2n2 (1 + 4b Ttile-caqr = 2 TLapack 2n (m − n3 )
Db ) 2
= (1+
3Db − n s )(1+ ). 6m − 2n 4b
Based upon the above equation, we can make the following observations: Ttile-caqr mb 1 ≃ 1+ 2m • If m ≫ n and b ≫ s, T b . Note that D Lapack D is the domain size and typically a large number. Ttile-caqr s ). = 32 (1 + 4b • If D = mb , T Lapack mb b • If we choose domain size D ≥ 10 and s ≥ 5, the operation count of tile CA-QR is comparable to that of LAPACK. We also use the same method as used by Ttile-caqr to compute the number of operations for the parallel tile CAQR. Instead of counting all the D domains, we only count the number of D P domains located within a process to compute Tdgeqrt , Tdormqr , Tdtsqrt , and Tdtsssmqr . Moreover, the process can participate in at most a number of log(D) dttqrt and log(D)(nb − k) dttssmqr operations at each iteration k. Therefore, Tpar.
tile-caqr
≃
2n2 (m −
n )(1 3
P
+
s ) 4b
+ n2 b(1 +
s ) log D. 4b
TABLE I C OMPARISON OF T ILE CA-QR WITH R ELATED A LGORITHMS
Operation Count (Sequential) 2n2 (m −
LAPACK
2
2n (m −
ScaLAPACK
2
2n (m −
CAQR
2n2 (m −
TSQR
n ) 3 n ) 3 n ) 3 n ) 3
2n2 (m −
Tile QR
2n2 (m −
Tile CA-QR
n s )(1 + 4b ) 3 n Db s + 2 )(1 + 4b ) 2
Operation Count (Parallel)
# Messages
–
–
2n2 (m− n ) 3 P 2 n 2n (m− 3 ) P 2 2n m n3 − + 32 n3 log P P 3 2 n s ) 2n (m− 3 )(1+ 4b P s ) 2n2 (m− n )(1+ 4b s 3 + n2 b(1 + 4b ) log D P
B. Number of Messages We compute for the process that has the maximum number of messages. We know that communication only occurs during the binary tree merge where the dttqrt and dttssmqr operations are involved. Each dttqrt is followed by (nb − k) dttssmqr operations and both dttqrt and dttssmqr require two message exchanges to stack two tiles. Given P processes, for each iteration k, a process is involved in at most log2 P merge stages, thus, M essagepar.
tile-caqr
=
nb X
log2 (P )(nb − k + 1) × 2
k=1
≃ log2 (P )n2b = log2 (P )
n2 . b2
C. Communication Volume Similar to computing the number of messages, we compute for the process that has the maximum number of words communicated with other processes. Since each message contains a tile of b2 words, W ordpar.
tile-caqr
= log2 (P )n2 .
D. Comparison with Related Algorithms We compare tile CA-QR with LAPACK, ScaLAPACK, CAQR, TSQR, and tile QR factorizations for tall and skinny matrices. The numbers for CAQR and TSQR are provided by Demmel’s paper [8]. As for ScaLAPACK and CAQR, we let Pr ≫ Pc assuming a very tall and skinny matrix input. We have implemented the tile QR factorization on distributed-memory systems in our previous work [9]. We briefly introduce it here. The distributed tile QR factorization maps tiles to a Pr × Pc process grid using the 2-D block cyclic data distribution. P = Pr × Pc is the total number of processes. A tile indexed by [i, j] will be allocated to the process P [i mod Pr , j mod Pc ] so that each process stores a set of tiles and computes the tasks that modify the tiles. We skip the calculation of the number of messages and words for tile QR and give the result in Table I. As shown in Table I, the first two columns present the number of floating point operations for sequential algorithms and parallel algorithms, respectively. For the sequential algorithms, tile QR and tile CA-QR have more operations than the other
Communication Volume – 2
3n log P
(n + bn) log P
3n b
(n2 +
log P
log P r ( nb )2 P Pc
bn ) log P 2
2
m b
( nb )2 log P
n log P 2 2 Pr m n Pc b 2
n log P
s . Note that s is an inner tile size which divides algorithms by 4b the tile size b and is typically a small number. For the operation count of the parallel algorithms, from the least to the most are ScaLAPACK, CAQR, tile QR, tile CA-QR, and TSQR. LAPACK is a library used for share-memory systems and thus does not have any communication. Although TSQR has the minimum number of messages, it uses a much larger tile size such that b = n given an m × n matrix. CAQR also has a smaller number of messages than tile CA-QR, but the algorithm typically uses the fork-join approach and is not suited for dynamic scheduling (e.g., the whole step of panel factorization must be completed before the step of trailing matrix update can start). Differently, tile CA-QR provides more fine grain tasks operating on tiles and is able to be executed in a fully asynchronous manner where computation and communication can be overlapped greatly. At first glance it might appear that the binary tree merge in tile CA-QR is also a barrier, but it does not necessarily lead to idle CPU time since the amount of fine grain tasks on each process (i.e., the existing tasks before the merge and the newly generated dttssmqr tasks) can keep the process’s cores busy and hide the merge-related communication. On the other hand, in the case of a 1-D block row layout, the communication volume of n2 2 log P has been proven to be optimal for QR factorization considering the length of a critical path is at least log P [8].
V. T HE D ISTRIBUTED F RAMEWORK We build upon our previous work of Task-based Basic Linear Algebra Subroutines (TBLAS) dynamic runtime system [9], [14] to realize tile CA-QR on distributed-memory systems. This section first overviews the TBLAS runtime system, then describes how we extend TBLAS to support tile CA-QR efficiently. Given a matrix A of mb × nb tiles and a multicore cluster consisting of N nodes each with T cores, we launch on each node Ni a process Pi , respectively. The rows of matrix A are preallocated to N nodes by 1D block distribution. That b is, Pi (on nodeNi ) stores a submatrix of A from ( m N i)-th to b (i + 1) − 1)-th tile-rows. Note that by default TBLAS uses (m N a general 2D block cyclic data distribution. But the 1D data distribution which is a special case of 2D data distribution is more suitable for tall and skinny matrices.
A. TBLAS Runtime System Every process runs an instance of the TBLAS runtime system in parallel, which are started by mpirun. As shown in Fig. 3, the TBLAS runtime system includes three types of threads: task-generation thread, computing thread, and communication thread. Given a node with T cores, we launch T computing threads on T different cores, as well as a task-generation thread and a communication thread on two arbitrary cores. The task-generation thread executes a tile CAQR program and generates tasks to fill in its node’s local task queues. Also, whenever becoming idle, a computing thread picks up a ready task from the ready task queue and computes it. After finishing a task, the computing thread scans the task queues to resolve data dependency and finds the finished task’s children and starts them. The communication thread is responsible for sending and receiving data between a parent task and its children to meet the data dependency demands. An advantage of the tile CA-QR factorization is that we do not need a dedicated core to perform MPI communications because of the high parallelism degree and the minimized communication cost of the algorithm. B. Extensions Our first implementation of tile CA-QR with the original TBLAS runtime system did not yield good performance automatically. By profiling the execution using the Intel trace analyzer and collector [15], we found that each core’s computing time is only half of the wall-clock execution time, which implies there is nearly 50% idle time on each core. Figure 4 a) shows an example trace of the first version of tile CA-QR running on 16 dual-core nodes. The colored regions represent the computation time, and the gaps represent the idle time during the execution. By analyzing the trace, we found a few reasons for the poor performance. 1) In the program’s corresponding task graph, between domains, tasks from two
Task-generation thread
...
task window:
ready task queue:
Computing thread
...
Computing thread
...
VI. P ERFORMANCE E VALUATION
outbox Communication thread
Computing thread
iterations (i.e., from i-th and i+1-th panels) are connected by tasks computing the global binary-tree reduction across domains. The merge tasks must be executed earlier in order to pull tasks from the next iteration to execute. 2) Within a domain, the panel factorization tasks should also be executed as early as possible because many trailing-matrix update tasks are awaiting a single panel-factorization task. 3) Lookahead to the next d iterations can help pull tasks not only from the next iteration but also from the next d iterations. Essentially we want to make sure the TBLAS runtime system executes the tasks on the critical path as early as possible. We modified the runtime system in the following ways: • We added the lookahead feature to the runtime system. The lookahead depth d is a parameter to the runtime system and has been tuned to provide the best performance. A depth of d means that before the current iteration’s submatrix update is completed, the next d panels belonging to the next d iterations can be factorized immediately after the panels are updated. • We assign priorities to different tasks. The binary-tree merge tasks have the highest priority. At iteration i, the tasks located between the i-th column and (i+d)-th column have the 2nd highest priority given a lookahead depth of d. The remaining tasks have a regular priority. • We also added message priorities to the communication subsystem of the runtime system. The output of a high priority task will be assigned a high priority accordingly and sent out by the communication thread earlier than the other messages. Similarly, the receiver will process the high priority message earlier too. • The task window size has been tuned to optimize the program performance. With a small window size, the runtime system is not able to see tasks in the other domains and the following iterations so that there is a lesser degree of parallelism. But a large window size will increase the runtime system overhead due to longer queues and lengthy access time to search for and resolve data dependencies in the queues. Figure 4 displays examples of traces for three different versions of the runtime system. Figure 4 a) shows the trace of the original version that has significant idle time. After setting appropriate task priorities, the performance is improved by 27% as shown in b). Figure 4 c) shows the trace of our final optimized version after applying all the above modifications and tunings. The final version is better than the original one by 35%. It is easy to see the significantly reduced empty gaps (i.e., idle time) in the figure.
inbox
Network
Fig. 3.
TBLAS runtime system.
In this section, we provide strong scalability and weak scalability performance results on three different distributedmemory machines. We also present the crossover point of distributed tile CA-QR for matrices that are not tall and skinny. We conducted experiments on two clusters (Grig and Newton at University of Tennessee) and a Cray XT5 system
(a) The original version
(b) An improved version
(c) The final version
Fig. 4. Traces for tuning the TBLAS runtime system. The dark regions denote computation time and the empty gaps denote idle time. After applying a number of modifications, the final version of the TBLAS runtime system has much less idle time and is faster than the original version by 35%.
(Jaguar at Oak Ridge National Laboratory) to compare tile CA-QR with the ScaLAPACK library. Whenever possible, we use a vendor-optimized ScaLAPACK library. Table II lists the hardware and software resources we used to do our experiments. The Grig cluster has two cores per node, the Newton cluster has eight cores, and the Cray XT5 system has 12 cores per node. On Newton and the Cray XT5 system, we use Intel MKL and Cray XT LibSci libraries to conduct ScaLAPACK experiments, respectively. A. Strong Scalability For strong scalability experiments, we fix the matrix size and increase the number of cores to solve the matrix. Then lops ) we compare the overall number of GFLOPS (i.e., T otalF T ime between tile CA-QR and ScaLAPACK. The matrix input to the Newton cluster and Cray XT5 system is of 512 × 32 tiles with a tuned tile size of b = 200. The matrix input to the Grig cluster is a bit smaller (due to its smaller memory), that is, 512 × 16 tiles with a tile size 200. Since the configuration of a process grid Pr × Pc can affect the performance of ScaLAPACK significantly, we tried all possible grid configurations and used the best process grid for ScaLAPACK. We also found that running one MPI process per core provides better performance than running one MPI per node with multithreaded computational kernels for tall and skinny matrices (i.e., up to 27% faster on the clusters and 150% faster on the Cray XT5 system). Thus we use one MPI process per core in our ScaLAPACK experiment. Figure 5 displays the overall performance of tile CA-QR and ScaLAPACK on three systems. On the Grig cluster, as we increase the number of cores from 1 to 64, the performance of tile CA-QR increases from 4.3 GFLOPS to 206 GFLOPS. By contrast ScaLAPACK increases from 2.4 GFLOPS to 112 GFLOPS. On Newton, between 1 and 128 cores, the performance of tile CA-QR increases from 7.3 to 620 GFLOPS at which time the parallel efficiency is 66.4%. Then from 128 cores to 256 cores, tile CA-QR’s parallel efficiency decreases slightly and its overall performance rises to 810 GFLOPS. The performance of ScaLAPACK is less than that of tile CA-QR, which rises from 7.1 to 423 GFLOPS. On the Cray XT5 system (Fig. 5 c), with an increasing
number of cores from 1 to 384, tile CA-QR improves from 7.5 to 1700 GFLOPS while ScaLAPACK improves from 5.8 to 947 GFLOPS with a slowdown in the end. B. Weak Scalability For weak scalability experiments, we fix the amount of computation on each core. When we double the number of cores, we also double the total amount of computation accordingly. Weak scalability demonstrates a program’s ability to solve larger problems with more resources. In our experiment, each matrix input has a fixed number of eight tile-columns but different number of tile-rows. When we double the number of cores, we double the number of tilerows in the input. For instance, the input to the single-core experiment has 64 × 8 tiles. And the two-core experiment has a matrix input of 128 × 8 tiles. Figure 6 shows the performance of the weak scalability experiments on three different systems. Besides tile CAQR and ScaLAPACK, we also display the theoretical peak performance and the serial DGEMM performance times the number of cores for each system. The DGEMM performance serves as an upper bound for all of our experiments. Again for ScaLAPACK, we always choose the best process grid and use the vendor optimized ScaLAPACK library whenever possible. There are two subfigures for each system. The top subfigure shows the overall number of GFLOPS, and the bottom one shows the number of GFLOPS per core (i.e., the overall GFLOPS divided by the total number of cores). Ideally the number of GFLOPS per core is a constant and does not change from 1 to n cores so that the per-core performance curve is flat. In Fig. 6, a) and b) display the overall performance and per-core performance of tile CA-QR and ScaLAPACK on the Grig cluster, respectively. We set the tile size b = 200. As shown in b), the per-core performance of tile CA-QR keeps at a rate of 4 GFLOPS that outperforms ScaLAPACK by nearly four times. On the Newton cluster, the ScaLAPACK experiment invokes the QR factorization subroutine provided by Intel MKL 10.1 and sets tile size b = 200. Figure 6 d) shows that the percore performance of tile CA-QR decreases slowly from 6.9 to 6.0 GFLOPS from 1 to 256 cores. But ScaLAPACK does not
TABLE II E XPERIMENT RESOURCES
Grig cluster Intel Xeon 3.2GHz 1 2 60 4 GB 6.4 GFLOPS 5.6 GFLOPS Myrinet Linux 2.6 gcc 64bit 3.4.4 mpich-mx 1.1 Goto 1.26 Netlib ScaLAPACK 1.8
Processor Cores per processor Processors per node Nodes Memory per node Peak perf. per core Serial DGEMM perf. Network OS Compilers MPI lib BLAS/LAPACK lib ScaLAPACK lib
250
1600 Tile CA-QR
700
GFLOPS
GFLOPS
GFLOPS
1000
400
50
600
200
400
100
200
0
1
2
4 8 16 Number of Cores
32
64
0 1
2
(a) Grig cluster
8 16 32 64 Number of Cores
1
128 256
8 12 24 48 Number of Cores
96
192 384
35000
Peak Serial-DGEMM x #Cores Tile CA-QR ScaLAPACK
2000
25000
1500
20000
GFLOPS
250 200
Peak Serial-DGEMM x #Cores Tile CA-QR ScaLAPACK
30000
GFLOPS
300
4
(c) Cray XT5
2500
Peak Serial-DGEMM x #Cores Tile CA-QR ScaLAPACK
350
2
Strong scalability. The input size is fixed while increasing the number of cores.
450
GFLOPS
4
(b) Newton cluster
Fig. 5.
400
800
300
0
ScaLAPACK
1200
500
100
Tile CA-QR
1400
ScaLAPACK
600
150
Cray XT5 AMD Opteron 2.6GHz 6 2 18,688 16 GB 10.4 GFLOPS 9.7 GFLOPS Cray SeaStar2+ Compute Node Linux 2.2 PGI 9.0.4 Cray XT MPT 3.5.1 Cray XT LibSci 10.4.4 Cray XT LibSci 10.4.4
1800
900 800
Tile CA-QR ScaLAPACK
200
Newton cluster Intel Xeon E5530 2.4GHz 4 2 170 16 GB 9.6 GFLOPS 8.96 GFLOPS Infiniband Scientific Linux 5.3 Intel compilers 11.0 OpenMPI 1.2.8 Intel MKL 10.1 Intel MKL 10.1
15000
1000
150
10000
100
500
5000
50 0
0
0 1
2
4
8
16
32
1
64
2
4
Number of Cores
(a) Overall performance on Grig
8 16 32 64 Number of Cores
128
peak dgemm
9
dgemm
2
4
8
12
24
48
96
192 384 768 1536 3072
Number of Cores
(c) Overall performance on Newton
peak
6
1
256
(e) Overall performance on Cray XT5 peak 10
dgemm
8
Tile CA-QR ScaLAPACK
3 2
7
GFLOPS per Core
4
GFLOPS Per Core
GFLOPS Per Core
8
5
6
6
Tile CA-QR ScaLAPACK
5 4
2
1
Tile CA-QR
4
3
ScaLAPACK
2
1
0
0
1
2
4 8 16 Number of Cores
32
64
(b) Per-core performance on Grig Fig. 6.
0
1
2
4
8 16 32 64 Number of Cores
128
256
(d) Per-core performance on Newton
1
2
4
8
12
24
48
96
192
384
768 1536 3072
Number of Cores
(f) Per-core performance on Cray XT5
Weak scalability. The input size increases with the increment of the number of cores.
1400
P0
1200
P1 P2 P3 P4 P5 P6 P7
GFLOPS
1000 800 600 Tile CA-QR 400
ScaLAPACK
200 0
Fig. 7.
R R R R R R R
8
16
32 64 128 256 Number of Tile Columns
512
Fig. 8. An example of existing idle processes given a matrix of 8 × 4 tiles that are distributed across eight processes in a 1D block row layout. In the figure, black tiles denote the final result and red tiles (or gray tiles in black and white) denote the first panel of the remaining computation. Since the third iteration (i.e., from the third column), P0 and P1 have finished all their allocated tasks and then become idle.
Crossover point. The input has a fixed number of 512 tile rows.
perform as well as tile CA-QR. For instance, the performance of ScaLAPACK on 256 cores is only 1/6 of that of tile CA-QR. On the Cray XT5 system, we use the ScaLAPACK subroutine provided by Cray XT LibSci 10.4.4 and let tile size b = 300. In Fig. 6 e), with an increasing number of cores from 1 to 3,072, tile CA-QR increases from 7.4 GFLOPS to 17.5 TFLOPS while ScaLAPACK increases from 4.4 GFLOPS to 4.4 TFLOPS. In f), the per-core performance of tile CAQR decreases gradually from 7.4 GFLOPS to 6.3 GFLOPS between 1 and 12 cores. The reason for the performance drop is related to the NUMA node architecture and requires an optimized memory-affinity setup. Afterward tile CA-QR scales well from 12 to 3,072 cores. By contrast ScaLAPACK drops from 4.4 to 1.4 GFLOPS per core as we increase the number of cores, which is 1/4 of that of tile CA-QR. C. Crossover Point This section discusses how distributed tile CA-QR behaves if the matrix is not tall and skinny. In our experiment, a matrix has a fixed number of 512 tile-rows but an increasing number of tile-columns. The tile size is set to b = 200. Since we want to view the number of columns as a unique variable, we choose to use a fixed number of 192 cores. We conducted the experiment on the Cray XT5 system. Note that 192 cores correspond to 16 nodes. Figure 7 shows the crossover point when a matrix becomes wider and wider until it is eventually square. We can see that the performance of tile CA-QR becomes worse than that of ScaLAPACK after the number of columns is greater than 1/4 of the number of rows. This is because the matrix’s 512 tilerows have been distributed to 16 processes by the 1D block distribution. Every process is allocated with 32 tile-rows and is only responsible for the computation on its own 32 tile-rows. As the algorithm visits and computes the matrix from top left to bottom right, more and more processes on the top become idle, which results in a load imbalance and poor performance. Figure 8 shows an example of the tile CA-QR factorization that explains the cause of idle processes. The matrix input has 8×4 tiles and is partitioned across eight processes. We can see from the figure that when the algorithm is working on the third tile-column, processes P0 and P1 become idle until the end of
the factorization. A two-dimensional block cyclic distribution, similar to ScaLAPACK, would then be necessary to efficiently handle general matrix sizes and overcome this bottleneck. This would also require a revision of the algorithm correspondingly. This is out of the scope of this paper which focuses on how to factorize tall and skinny matrices in a more efficient way. VII. C ONCLUSION
AND
F UTURE W ORK
The QR factorization of tall and skinny matrices has been used in many scientific fields that require solving least square problems. This paper extends an existing algorithm for sharedmemory architectures and enables it to work efficiently on modern large-scale distributed-memory systems. We have implemented the algorithm with an augmented TBLAS runtime system. The distributed tile CA-QR factorization has a high degree of parallelism and allows for a fully dynamic execution that can overlap computation and communication greatly. We have presented the algorithm, the analysis of the algorithm, the extension of the runtime system, and the performance evaluation. Our experiments on two multicore clusters and a Cray XT5 system demonstrate that the tile CA-QR factorization is scalable on up to 3,072 cores and can outperform the ScaLAPACK library by up to 4 times for tall and skinny matrices. In summary, we make the following contributions: (1) An extension from shared-memory systems to distributed-memory systems; (2) A detailed analysis of the algorithm with respect to operation count, number of messages, and communication volume; (3) An extended TBLAS runtime system to support an efficient distributed implementation; (4) First demonstration of the scalability of the algorithm on large scale distributedmemory systems. The design principles and implementation techniques of our tile CA-QR approach can also be applied to other parallel software on modern manycore cluster systems. It is important for the software to create fine-grain tasks, have asynchronous execution, good data locality, and minimized communication cost to achieve high performance. Our future work includes looking for new methods to partition matrices across processes to improve load balance for general size matrices, and applying the approach to solving other linear algebra problems on distributed-memory multicore systems.
R EFERENCES [1] E. Anderson, Z. Bai, C. Bischof, L. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’ Guide. SIAM, 1992. [2] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. Whaley, ScaLAPACK Users’ Guide. SIAM, 1997. [3] L. S. Blackford, J. Choi, A. Cleary, A. Petitet, R. C. Whaley, J. Demmel, I. Dhillon, K. Stanley, J. Dongarra, S. Hammarling, G. Henry, and D. Walker, “ScaLAPACK: A portable linear algebra library for distributed memory computers - design issues and performance,” in Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM). Washington, DC, USA: IEEE Computer Society, 1996, p. 5. [4] E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan, “PLASMA Users’ Guide,” ICL, UTK, Tech. Rep., 2010. [5] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, “A class of parallel tiled linear algebra algorithms for multicore architectures,” Parallel Comput., vol. 35, no. 1, pp. 38–53, 2009. [6] ——, “Parallel tiled QR factorization for multicore architectures,” Concurr. Comput. : Pract. Exper., vol. 20, no. 13, pp. 1573–1590, 2008. [7] B. Hadri, H. Ltaief, E. Agullo, and J. Dongarra, “Tile QR factorization with parallel panel processing for multicore architectures,” in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010), 2010.
[8] J. W. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou, “Communication-optimal parallel and sequential QR and LU factorizations,” UTK, LAPACK Working Note 204, August 2008. [9] F. Song, A. YarKhan, and J. Dongarra, “Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems,” in SC’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. New York, NY, USA: ACM, 2009, pp. 1–11. [10] W. M. Gentleman, “Row elimination for solving sparse linear systems and least squares problems,” Numer. Anal., Lect Notes Math, vol. 506, pp. 122–133, 1976. [11] A. Pothen and P. Raghavan, “Distributed orthogonal factorization: Givens and Householder algorithms,” SIAM Journal on Scientific and Statistical Computing, vol. 10, pp. 1113–1134, 1989. [12] L. Julien, “Computing the R of the QR factorization of tall and skinny matrices using MPI reduce,” University of Colorado, Denver, Tech. Rep., 2010. [13] E. Agullo, C. Coti, J. Dongarra, T. Herault, and J. Langou, “QR factorization of tall and skinny matrices in a grid computing environment,” in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010), 2010. [14] F. Song, “Static and dynamic scheduling for effective use of multicore systems,” http://trace.tennessee.edu/utk graddiss/634, 2009, PhD diss., University of Tennessee. [15] Intel, “Intel Trace Analyzer Reference Guide (Revision 8.0),” http://software.intel.com/en-us/intel-trace-analyzer, 2010.