Multifrontal Factorization of Sparse SPD Matrices ... - Semantic Scholar

12 downloads 3623 Views 1MB Size Report
... to thousands of pro- cessors on dedicated supercomputers resulting in a substantial ... efficient implementation and concerns of low GPU utilization. ..... We used a HS21 blade server connected to Nvidia Tesla. T10 GPU ..... We also compare.
2011 IEEE International Parallel & Distributed Processing Symposium

Multifrontal Factorization of Sparse SPD Matrices on GPUs Thomas George∗ , Vaibhav Saxena∗ , Anshul Gupta† , Amik Singh‡ and Anamitra R. Choudhury∗ ∗ High Performance Computing Group IBM Research India, New Delhi, India 110070 Email: thomasgeorge, vaibhavsaxena, [email protected] † Department of Mathematical Sciences IBM T.J. Watson Research Center, Yorktown Heights, USA Email: [email protected] ‡ Department of Electronics and Computer Engineering, IIT Roorkee, India Email: [email protected] LLT , followed by a forward triangular solve Ly = b and a backward triangular solve LT x = y. There are multiple variants of sparse Cholesky decomposition. Multifrontal factorization [1] is one such approach that has resulted in some of the most efficient and scalable implementations to date [2], [3], [4] on dedicated supercomputers. In recent years, an alternative paradigm based on GPUs has gained widespread popularity, primarily due to their efficient power usage, high Flop to price ratio, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. GPUs provide a cheaper alternative to enormous computing power and have spurred a lot of research activity in porting compute intensive applications to GPUs. In the domain of linear system solvers, most of the past work involving GPUs has been targeted towards dense direct solvers [5], [6], [7] and sparse iterative solvers [8], [9]. To our knowledge, there have been only a few published efforts that perform a preliminary analysis on accelerating sparse direct solvers on GPUs [10], [11], [12]. The approach followed in all these papers involves off-loading the computationally expensive dense kernels to the GPU in order to improve the overall time. The speedups reported in these works are similar to what one would expect from a multi-threaded (4-8 threads) run of a sparse direct solver and raises a natural question whether we have hit a wall with respect to extracting more performance for applications involving sparse matrix operations. Another interesting question worth exploring is whether there is any benefit at all in implementing a sparse direct solver entirely on the GPU? These questions are difficult to answer without a detailed analysis of the performance bottlenecks and potential fixes, which is lacking in the existing work on porting sparse direct solvers to GPU. When we consider this space in detail, another important observation that pops out is that there are multiple choices of workload division between the host CPU and GPU depending on the characteristics of the input problem. Whether there is an intelligent way to design a hybrid approach that can combine the best of both worlds is another key question.

Abstract—Solving large sparse linear systems is often the most computationally intensive component of many scientific computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-efficiency, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored sufficiently due to the complexity involved in an efficient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classifier. We estimate the classifier parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU. Keywords-Multifrontal method; Sparse factorization; Autotuning; CUDA

I. I NTRODUCTION Solution of large sparse systems of linear equations of the form Ax = b is an integral part of a variety of scientific computing applications. Here A is the sparse coefficient matrix, b is the right hand side and x is the vector of unknowns that needs to be computed. Direct solvers are used in many of these applications due to their high performance on moderate-size problems, robustness, accuracy, and the potential for reusing the factorization when solving multiple systems with the same coefficient matrix. In a large class of applications, the coefficient matrix A is symmetric positive definite (SPD). For SPD linear systems, the solution involves a Cholesky factorization of the coefficient matrix A into 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.44

372

The current paper attempts to address the above questions and makes the following contributions. •







Section VII discusses related work. Section VIII contains concluding remarks.

We present a detailed analysis of various aspects of the performance of factor-update operation, which is a critical component of sparse multi-frontal factorization, on a single CPU-GPU combination. We consider a basic GPU implementation that involves offloading the dense matrix operations to GPU for this purpose. Using empirical data from several matrices from real applications, we analyze the various computational and data transfer costs, the relative performance on CPU vs. GPU, and utilization relative to peak performance. We use this analysis to identify the key performance bottlenecks and set reasonable expectations for performance improvement with just offloading the most time consuming dense kernels to GPU. We identify and implement three alternate policies to schedule and distribute the computational workload between the CPU and GPU. These three policies along with the host CPU implementation are most effective for different ranges of sizes of the dense blocks. We use this observation to identify a new baseline hybrid policy that selects one of the four policies for each factorupdate computation based on its number of floatingpoint operations. In order to best exploit the relative strengths of the different policies, we propose a novel statistical modelbased hybrid approach that determines the best policy in terms of the characteristics of different parts of the input matrix while being sensitive to the computational costs. Specifically, our approach relies on modeling the best policy predictor as a parametric classifier and estimating the classifier parameters such that the expected computation time is minimized. The model-based hybrid approach boosts the speedup by 5–10% over the baseline hybrid scheme and brings the performance to within 2% of an ideal hybrid of the four policies. The model-based approach can also be readily adapted for an extended set of policies as well as for future GPU architectures. We also present performance results using the modelbased hybrid approach for multiple CPU-GPU combinations as well as a comparison with a multithreaded run on the CPU.

II. BACKGROUND A. Sparse Direct Factorization In this section, we briefly describe the main components of multifrontal sparse Cholesky factorization [1]. We use a supernodal variant of the multifrontal method, which permits factoring blocks of columns at a time. More details on this approach are given in [3], [13]. Consider the Cholesky factorization of a sparse symmetric positive definite matrix A. The algorithm performs a postorder traversal of the supernodal elimination tree associated with A and computes the frontal matrix F n and the update matrix U n associated with every supernode n of the elimination tree. For a supernode n, F n is an (s + 1) × (s + 1) symmetric matrix, where s is the number of non-zeros in the lower triangular part of the first column of supernode n. For the leaf supernodes, the initial frontal matrices include the contributing entries from A and the remaining entries are zero. For non-leaf nodes, F n is constructed by merging the contributing entries from A with the update matrices of all the children of supernode n via an extend-add operation. The update matrix U n is generated from the fully populated F n through a block step of the standard dense Cholesky factorization with nodes n . . . n + k as pivots, where k is the size (number of consecutive rows and columns with the same nonzero pattern in L) of the supernode n. This block dense Cholesky step is what we will refer to as the factor-update or the FU operation in this paper. The m × m update matrix U n resulting from performing the F-U operation on F n is passed on to the parent of node n in the elimination tree as an operand for the extend-add operation at the parent node. The process terminates when the Cholesky factorization is computed at the root supernode. We now briefly describe the main components of FU operations. The first step is a single step of the dense Cholesky factorization on the frontal block of size s × k. Let s = m + k. This step can be divided into a dense Cholesky operation (potrf) on the k × k matrix Ln1 and a triangular solve (trsm) on a m × k matrix Ln2 . The last step of the F-U operation is the update of the m × m matrix U n . This is essentially a symmetric rank-k update of the form U n = U n − Ln2 LnT 2 (syrk). These steps are detailed in Figure 1. It is well known that the computation of the frontal matrix F n and the update matrix U n is the most computationally intensive component of the multifrontal algorithm. Our experiments indicate that these operations consume nearly 90% of the total running time for large matrices. Thus, it is imperative to target this operation to accelerate multifrontal factorization.

The rest of the paper is organized as follows. Section II provides background on sparse multifrontal factorization, GPU based computing, and the GPU BLAS libraries. Section III describes the experimental setup for our performance analysis and evaluation. In Section IV, we analyze various aspects of the performance of factor-update operation on a basic CPU/GPU implementation, identify potential areas of improvement, and then suggest alternative implementation policies in Section V. Section VI describes our modelbased hybrid approach and provides an empirical evaluation.

373

k

POTRF n

L1

TRSM

GPU Architecture Type Clock (GHz) Scalar Cores Memory b/w (GB/s) Memory size Local Store (KB) SDK Compiler

SYRK

m k

n

L2

m

U

n

m

Tesla T10 multithread SIMD (SIMT) 1.3 240(30x8) 102 (device) 2 (PCIe x8) 4 GB 16 per SM CUDA 2.3 nvcc (-O3)

Table I GPU S PECIFICATION

Figure 1.

The three dense matrix operations in single F-U call.

GPU. The CUBLAS library abstracts out the implementation details of the BLAS routines such as number of thread blocks, number of threads per thread block, matrix data tiles used, from the user. Existing host applications using BLAS can take advantage of CUBLAS routines by transferring the input vectors/matrices to the GPU, calling the CUBLAS routines and then transferring the output vectors/matrices back to the host from the GPU. The three CUBLAS routines we are mainly interested in are trsm, gemm, and syrk as we offload these operations to the GPU.

B. GPU based Computing GPUs are multi-threaded, many-core co-processors to the CPU and were traditionally used for processing of graphics related tasks. The large computation power coupled with very high memory bandwidth of GPUs makes them very attractive for general purpose, computation intensive, dataparallel tasks as well. CUDA [14] is a parallel computing architecture designed to enable general purpose computing on NVIDIA GPUs. In the CUDA architecture, a GPU is modeled as a set of SIMD multiprocessors (SMs) each consisting of a set of scalar processor cores (SPs). The SPs of a SM execute the same instruction simultaneously, but on different data points. The GPU has a large device global memory with high bandwidth and low latency. In addition, each SM also contains a very fast, low-latency on-chip shared memory. In the CUDA programming model, a host program runs on the CPU and launches a kernel program to be executed on the GPU device in parallel. Prior to launching the kernel, the host program transfers the required data from host (CPU) memory to device (GPU) memory. Once the kernel has finished with its execution on the GPU, the host program transfers the computed results from the device memory back to the host memory. The kernel executes as a grid of one or more thread blocks. Each thread block is dynamically scheduled to be executed on a single SM. The threads of a thread block cooperate with each other by synchronizing their execution and efficiently sharing resources on the SM such as shared memory and registers. Threads within a thread block get executed on a SM in scheduling units of 32 threads called a warp. Global memory is used most efficiently when multiple threads simultaneously access words from a contiguous aligned segment of memory, enabling GPU hardware to coalesce these memory accesses into a single memory transaction. The Nvidia Tesla T10 GPU used in the present work contains 4GB of off-chip device memory and 16KB of on-chip shared memory per SM.

III. E XPERIMENTAL S ETUP In this section, we describe the details of the hardware configuration, base code, and the test matrices used for our analysis. A. Hardware Configuration We used a HS21 blade server connected to Nvidia Tesla T10 GPU for performing our experiments. The HS21 has two dual-core Intel Xeon 5160 processors running at 3.0 GHz frequency with 4 MB of shared L2 cache and 32 GB of main memory. This host system is connected to two Tesla T10 GPUs of a Tesla S1070 GPU system through a PCIe x8 interface. Table I provides more details. The Tesla T10 GPU has the peak floating point performance of 624 GFlops/s for single precision and 78 GFlops/s for double precision. The single core of host Xeon processor has the peak floating point performance of 24 GFlops/s for single precision and 12 GFlops/s for double precision. The host code was compiled using GCC 4.4.0 compiler with -O3 optimization flag. Note that for most of our analysis in Sections IV and V, we use only one CPU and one GPU. However, our implementation can simultaneously use multiple CPUs and GPUs and Section VI presents performance results using 2 CPU threads and 2 GPUs. B. Software and Libraries We used Watson Sparse Matrix Package (WSMP) [16] as the base code for our experiments. WSMP has a highly efficient serial and multithreaded implementation of the multifrontal method [2]. For our experiments, we linked WSMP with the ATLAS [17] library to provide BLAS functionality

C. GPU BLAS Library CUBLAS [15] is a CUDA based implementation of BLAS (Basic Linear Algebra Subprograms) optimized for a single

374

m ≤ 1000. However, this accounts for a small fraction of the total computation time (5-6%). Figures 2(a)-(c) show the fractional time spent on the calls for each bin for the host CPU implementation, the basic GPU implementation, and the basic GPU implementation excluding the data transfer time. We observe that a large fraction of the computation time comes from function calls dealing with moderate and large sized matrices, which makes it especially important to optimize these large matrix calls. In case of the host CPU implementation, the disparity is quite evident, while it is less so in case of the GPU implementation with and without the copy operations. A comparison between Figures 2(b) and 2(c) reveals that if one includes the time for copy operations, a much higher fraction of the time is spent on smaller matrices than the larger ones.

on the host machine. The various GPU implementations involve offloading parts of F-U operations to the GPU using Level-3 BLAS calls from CUBLAS (version 2.3). An important point to note is that WSMP is in double precision only, while CUBLAS can handle both single and double precision. Since the peak double precision performance of the T10 GPU1 is significantly (8×) lower than single precision, we used single precision in the computations performed in CUBLAS. This did reduce the number of accurate digits in the solution of the sparse system, but the lost accuracy could be readily regained by one or two steps of iterative refinement using double precision sparse matrixvector multiplication. C. Test Matrices Table II lists the SPD matrices used in our experiments, along with their dimension and the number of non-zeros. All the matrices are obtained from 3-D structural analysis problems in automotive modeling, metal forming, etc. Matrix audikw 1 kyushu lmco nastran-b sgi 1M

N 943695 990692 665017 1508088 1522431

B. Flop Rate and Bandwidth Utilization As discussed earlier in Section II-A, each call of F-U in the host CPU implementation can be broken down into the three dense kernels: potrf, trsm, and syrk. In the case of basic GPU implementation, there are additional data transfer operations associated with copying the matrices Ln1 , Ln2 , and U n to/from GPU memory. Specifically, the trsm computation requires copying matrices Ln1 , Ln2 from the host CPU to GPU and copying Ln2 back to the host. The syrk computation only requires copying Ln2 LnT from GPU to the host since U n 2 can be updated locally on the host after Ln2 LnT has been 2 computed on the GPU. U U U Let αCP , αCP , αCP denote the average performance P T S on the CPU for the kernels potrf, trsm, and syrk respectively U U U while αGP , αGP , αGP denote the same in the case of P T S GPU. Table III shows these performance numbers in terms of Flop Rate (Flops/s) computed from the data as well as the utilization relative to the theoretical maximum for a single core of the host CPU (12 GFlops/s for double precision) and the GPU (624 GFlops/s for single precision). We estimated the performance for the individual routines separately so as to account for various optimizations within the routines as well as the range of operation. An important point to note here is that the utilization rate, in general, steadily increases with the number of operations and stabilizes only for large number of operation counts. Figure 4 shows the achieved performance for a few of the large trsm and syrk calls for the CPU and GPU implementations. The values presented in Table III correspond to the stabilized values and can be used to estimate asymptotic performance (i.e., for large matrices). In case of the GPU implementation, we also need to consider the average bandwidth β achieved for copying matrices Ln1 , Ln2 , U n , which we observed to be approximately 1.4 GB/s. This can be attributed to the slower PCIe x8 interface to the GPU in our system. Using the above values, one can estimate the time taken for the host CPU implementation as

NNZ 77651847 26268136 107514163 111614436 125755875

Table II SPD TEST MATRICES WITH THEIR ORDER (N) AND NUMBER OF NON - ZEROS (NNZ).

IV. D ENSE K ERNEL P ERFORMANCE

ON

GPU

In this section, we analyze various aspects of the performance of factor-update (F-U) operations on CPU-GPU combination for a basic GPU-based implementation. The basic implementation involves offloading the dense matrix operations syrk and trsm to the GPU (shown in Figure 1) while dense Cholesky potrf is performed on the host. We ran both the serial host CPU implementation and the basic GPU implementation over the set of five matrices in Table II. We measured timings of various computation and data transfer components for each F-U call along with the input matrix dimensions m and k following the notation in Section II-A. Based on this empirical timing data, we make some key observations on the relative distribution of computational load with respect to the matrix dimensions, bandwidth, flop rate, and the relative performance of the three dense kernels. A. Computation Load Distribution vs. Matrix Dimensions On analyzing the distribution of F-U calls across an m×k grid using bins of size 500 × 500, we observe that most of the calls involve small matrices. In fact, approximately 97% of the calls are concentrated in the range k ≤ 500 and 1 The latest Fermi offering from Nvidia is expected to improve double precision performance significantly.

U U U T CP U = NP /αCP + NT /αCP + NS /αCP , P T S

375

(1)

10000

10000

10000

0.14

0.04 9000

9000

0.035

8000 7000

6000

0.025

6000

5000

0.02

k−−>

4000

k−−>

0.03

0.1 0.08

5000

7000

0.04

3000

0.05

6000

0.06

4000

0.015

3000

0.06

8000

k−−>

8000 7000

0.07

9000

0.12

0.04 5000 4000

0.03

3000

0.02

0.01 2000

2000

0.005

2000

0.02

0.01

1000

1000

1000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

m−−>

m−−>

m−−>

(a) Fraction of computation time spent on CPU. (b) Fraction of computation time spent for basic (c) Fraction of computation time spent for basic GPU implementation including copy time. GPU implementation excluding copy time. Figure 2.

Distribution of computation time across an m × k grid using bins of size 500 × 500. Note that the colorbars have different scales.

and that of the basic GPU implementation including copy costs as U U U T GP U = NP /αCP + NT /αGP + NS /αGP P T S

+ND (Ln1 , Ln2 )/β + ND (Ln2 LnT 2 )/β,

(2)

3

where NP = k3 , NT = mk 2 and NS = m2 k are the (asymptotic) number of operations for the kernels, and 2 ND (Ln1 , Ln2 ) = k 2 + 2mk, ND (Ln2 LnT 2 ) = m are the sizes of the data transfer. Figure 3 shows a plot of theoretical speedup (shown in red) that can be achieved using the basic GPU implementation along with the observed values. For large matrix calls, the copying times are negligible relative to the computation costs so that the achieved speedup depends primarily on the α’s. In practice, however, this speedup cannot be readily achieved due to the memory limitations of GPU, which requires deployment and coordination among multiple CPUs and GPUs to handle large matrices. Further, as one can see from Figure 3, the actual empirical speedups show a variance with respect to the theoretical ones because the performance of the dense kernels for small and moderate matrices is far from the idealized model in Equation 1 and 2.

Figure 3. Theoretical speedup that can be achieved for the basic GPU implementation based on the observed flop rate along with the actual observed speedup.

12

10

11

10

10

U αCP T 9.24 76.99

U αCP S 10.02 83.49

U αGP T 153.7 24.63

10

U αGP S 159.69 25.59

9

Flops/s

GFlops/s: %Peak:

U αCP P 8.84 73.7

Table III AVERAGE STABILIZED FLOP RATES .

10

8

10

syrk−CPU trsm−CPU syrk−GPU trsm−GPU

7

10

2

10

C. Distribution of Computation Time

Figure 4.

We now consider how the distribution of computation time for the different components both for the host CPU and the GPU implementations varies with the size of the input matrix. Figure 5 shows the observed timings on log scale on the y-axis with the total number of operations on the xaxis. To enable comparison across the components, Figure 6 shows the fractional timings in a similar fashion where the normalization is performed across the components. From the figures, we note that the time-consuming components

4

10

6

8

10 10 Number of Operations

10

10

12

10

Observed Flop rate for CPU and GPU.

as well as the relative performance of CPU vs. GPU varies depending on overall computational load. In particular, trsm and syrk on GPU is more expensive for small matrix calls relative to CPU (#Ops < 105 ) while we observe the opposite behavior for larger matrix ones (#Ops > 108 ).

376

Matrix audikw 1 kyushu lmco nastran-b sgi 1M

potrf (sec) 28.75 96.43 20.86 17.53 41.87

%Host 5.43 7.48 7.10 5.95 5.15

%GPU w/o copy 43.28 55.50 48.32 39.66 41.48

%GPU w/ copy 29.54 46.17 30.83 24.46 27.85

Table IV T OTAL TIME IN SECONDS TAKEN FOR ALL THE potrf CALLS FOR THE VARIOUS MATRICES ALONG WITH THE PERCENTAGE OF TOTAL TIMES .

E. TRSM Figure 5. Observed timings for various components of host CPU and basic GPU implementation.

Table III shows the observed Flop rate obtained for the trsm routines on the host and CUBLAS trsm routine on the GPU. However, to analyze the relative benefits of performing trsm on the host CPU vs. GPU, we need to also consider the costs of copying matrices Ln1 , Ln2 from the CPU to GPU and back. In general, since it is possible to overlap the copying with the kernel execution on the T10 GPU, we consider two variants for the GPU implementation, one including the synchronous copy costs entirely and the other without, viewing these as upper and lower bounds on the costs associated with a more sophisticated implementation. Figure 7 shows the flop rate of the three variants (CPU, GPU with copy, and GPU without copy) with increasing number of operations (mk 2 ). From the plot, it is fairly clear that the flop rate varies with the number of operations and that there is a tipping point above which it is beneficial to perform trsm on GPU. This tipping point depends on whether the copy times are included (∼ 3×106 Ops.) or not (∼ 4×105 Ops.). These transition points were obtained based on different choices of m and k ( by considering the difference in the flop rate, fitting a polynomial curve and estimating zeros) and are fairly robust, but of course specific to the chosen host CPU implementation and the GPU configuration. Schenk et al. [10] also refer to existence of such transition points, however, they do not consider block triangular solves.

Figure 6. Fractional timings for various components of host CPU and basic GPU implementation.

D. POTRF Since the dense Cholesky operations are performed only on the host CPU, the main question to consider is what is the relative time consumption for potrf and how critical it is to optimize this component. Table IV shows the total time taken for the potrf calls for the various matrices along with the percentage of the total time for the entire F-U update calls for the three variants. In the case of host CPU timing measurements, portf takes very little time (less than 8% of the total time), which justifies our choice of targeting syrk and trsm as a first step for our basic GPU implementation. However, in the case of GPU implementation, the potrf calls performed on the CPU becomes a bigger fraction (24-46% for the case with copy times included) of the total time. Furthermore, a major chunk of these potrf calls correspond to special cases where m = 0, or k is comparable to m, which occurs close to the root of the elimination tree. For example, in the case of the matrix kyushu, the total time for potrf is 97 seconds out of which 86 % of the time is spent for just three calls and the top ten calls account for almost 96% of the total potrf time. Exploring a GPU-based implementation for handling the special cases is required to improve the overall performance.

Figure 7. Flop rate (in Flops/s) of different variants of trsm (host CPU, GPU with copy, GPU without copy).

377

overlapping of the three operations potrf, trsm, and syrk. We work with blocks of columns of a specified width w at a time for both Ln1 and Ln2 . More specifically, a light-weight GPU kernel is written for performing potrf on a w × w matrix. Since we have Ln1 and Ln2 in a single data structure, we perform trsm on a (m + k − w) × w matrix which spans both Ln1 and Ln2 . The bottom right lower triangular part of Ln1 is updated using a syrk operation while the remaining part of Ln2 is updated using a gemm operation. Finally, we perform a syrk operation to partially update U n . This process is repeated until we have exhausted all the columns in Ln1 . For this approach, all the operations are performed on the GPU and no attempt is made to overlap computations with the CPU. Figure 9 shows the details of the algorithm. Table V shows the improved flop rate and speedup for the potrf operation using this algorithm. The maximum speedup achieved is 13.

F. SYRK In case of syrk, we consider three variants (CPU, GPU with Ln2 LnT copy costs, and GPU without any copy costs) 2 as in the case of trsm. Inclusion of the copy costs for Ln1 , Ln2 do not make a substantial difference to the effective flop rate. Figure 8 shows the flop rate of these three variants with increasing number of operations (m2 k). The jagged behavior of the GPU variants is due to the fact that the m2 k is only an approximate indicator of the exact number of operations, which depend on the data tile sizes used in GPU operations. From the plot, we clearly observe that it is beneficial to offload syrk to GPU for large matrices. To obtain the transition point, we examine the plot and find that there is substantial difference in the behavior of the GPU variants with copy and without copy. In particular, when copy costs are not included, the transition point between CPU and GPU occurs at 1.5 × 105 Ops., while in the case of GPU with copy costs, there is a fairly large range (106 − 107 Ops.) where there is no clear winner. The transition point for the variant including copy costs also occurs at a much later point, indicating that optimizing the copy costs is critical for reducing the overall computation time, especially for moderate sized matrices.

w k

w n

L1

m

k-w

n

L2

m

U

m

POTRF

Figure 9.

Matrix audikw 1 kyushu lmco nastran-b sgi 1M

Figure 8. Flop rate of different variants of syrk (host CPU, GPU with copy, GPU without copy).

n

k (m=0) 5418 10592 5353 5682 7014

TRSM

SYRK

GEMM

SYRK

Overlapped potrf, trsm and syrk.

CPU (GFlops/s) 8.98 9.44 8.75 9.02 9.18

potrf (GPU) (GFlops/s) 69.60 123.95 67.73 71.71 80.42

Speedup 7.75 13.13 7.74 7.95 8.76

Table V potrf FLOP RATE AND SPEEDUP OBTAINED USING THE GPU IMPLEMENTATION (m = 0).

V. A LTERNATIVE GPU I MPLEMENTATION P OLICIES In this section, we present alternative policies for scheduling the workload and copy operations that address some of the issues discussed in the previous section. We also compare the relative performance of these alternative policies and arrive at a baseline hybrid approach that invokes these policies for different ranges of the total number of operations of the F-U call.

2) Overlapping Copy with Computation: Data transfers between host and GPU are fairly expensive relative to the GPU computation cost especially for small matrices. A natural choice is to overlap the copying of the matrices with the computational tasks wherever possible. For example, while potrf is performed on the host, Ln2 could be transferred simultaneously. Similarly, while updates to U n are being computed on the GPU, Ln2 could be copied back to the host. The asynchronous copy operations require the use of pinned host memory and it allows faster transfer of data

A. Workflow Optimizations 1) POTRF on GPU: As observed in Table IV, the cost of potrf call on the CPU could become a major fraction of the total GPU time close to the root of the elimination tree. In order to address these special cases, we perform an

378

between the host and GPU. However, each call to allocate a chunk in pinned memory is prohibitively expensive when the data to be copied is not large enough to provide substantial savings in transfer time. For a sparse matrix factorization, the supernodes are typically small and frequent allocation calls degrade the overall performance of the solver. In order to address this issue, any allocation/deallocation is triggered only when the maximum allocated size over all the previous calls is insufficient. We follow this policy even for the memory allocated on the GPU memory. 3) Optimizing BLAS routine parameters: We also explored tuning the GPU BLAS routine parameters such as the data tile sizes and the number of threads based on the input matrix sizes, but found very little improvement in the performance. For instance, we experimented with 17 different configurations for number of threads/thread block and data tile sizes for syrk for the matrix kyushu and found that the range of variation was less than 0.5%.

Figure 10. Flop rate (in Flops/s) for various policies for varying number of total operations.

B. Policies Based on the workflow optimizations discussed in the previous subsection, there can be four possible ways of dividing the workload between the CPU and GPU. Table VI shows the four different policies. Specifically, policy P1 is the serial CPU implementation, P2 and P3 involve offloading either just syrk or both syrk and trsm to the GPU with the overlapped copy as described in V-A2 while policy P 4 involves performing all the routines on GPU itself following the approach in V-A1. Note that we do not need to consider other options, for instance performing trsm alone on GPU because the large matrices for which trsm on GPU is beneficial are likely to benefit from moving syrk also to GPU. Policy P1 P2 P3 P4

Figure 11. Observed speedup with respect to host CPU implementation for various policies for varying number of total operations.

VI. M ODEL - BASED H YBRID A PPROACH Choosing the optimal policy from the multiple choices listed in Table VI is an important part of optimizing the GPU performance. Figure 12(a) shows the best policy based on retrospective analysis of the average observed timings for F-U calls given the matrix sizes m and k. Clearly, one can notice that the serial CPU implementation (P1 ) is better suited for matrices of smaller size while the policies P3 and P4 are more appropriate for large matrices. On the other hand, the policy P2 is more effective for moderate values of m and k. This observation points to the need for a datadriven learning approach, that can provide a decision model for picking the best policy based on the characteristics of the input matrix.

Description potrf, trsm, syrk all on CPU potrf, trsm on CPU; syrk on GPU potrf on CPU; trsm, syrk on GPU potrf, trsm, syrk all on GPU

Table VI F OUR POLICIES FOR A Factor Update OPERATION .

1) Relative Flop Rate and Baseline Hybrid: Figures 10 and 11 show how the flop rate and speedup (with respect to host CPU implementation) of the different policies in Table VI vary with the total number of operations. From the plots, one can construct a baseline hybrid approach PBH based on the transition points in the number of operations at which there is a change in the best policy. In particular, for our data, we notice that policy P1 is predominantly the best choice for less than 2 × 106 Ops. while P2 is often better in the range 2 × 106 − 1.5 × 107 Ops., and P3 is better for matrices that require 1.5 × 107 − 9 × 1010 Ops. after which P4 seems to dominate.

A. Desiderata for a Decision Model Our choice of the decision model and the learning approach is motivated by a few key requirements. First, the model should be capable of choosing from multiple policies based on a combination of multiple matrix characteristics to account for the different optimizations in the various policies, which might not be captured via simple threshold(s) on the total number of operations (as in [10]). Secondly, it

379

the following optimization problem:

should be possible to readily adapt the decision model to a different policy space, for instance, one corresponding to a double-precision implementation, a different BLAS library, or a different GPU processor configuration given empirical computational time data. This entails adopting a statistical learning approach since parameter tuning based on manual analysis of data requires significant human effort. Lastly, the decision model should not only predict the optimal policy, but do so while being sensitive to the computational times, i.e., make relatively harmless errors by suggesting near optimal policy when it does make an error rather than one that is highly sub-optimal. In other words, all prediction errors are not equal and the cost of an error depends on the actual best policy, the predicted one, and the input matrix. For example, predicting P1 (serial implementation) as the optimal policy instead of P3 (say the actual best choice) for a large matrix is much more costly than predicting P2 (only syrk on GPU) or a similar error over a moderate sized matrix. Existing machine learning techniques including costsensitive classifiers [18], do not in fact, address the third requirement and are limited only to scenarios where the error cost depends only on the actual and the predicted class, but not the example itself as we discuss further in Section VII. Since accounting for the computational time is a critical requirement, we employ a new direct optimization approach over a parametric multi-class classifier, which we describe shortly.

θ ∗ = argmin θ

n X r X

pθ (y(xi ) = Cj |xi )Tij ,

(3)

i=1 j=1

where θ∗ denotes the optimal parameter weights. Given a new matrix A, the best policy yˆ(A) can then be determined as yˆ(A) = argmax pθ ∗ (y = Cj |x(A)). (4) 1≤r≤n

To instantiate the above approach, we need to choose the form of the parametric model as well as feature representation for the input matrices. Feature Representation. Though there are only two main matrix characteristics m and k, the computation time can be optimized more accurately if features such as number of operations, matrix sizes and control flow predicates that exhibit a simpler dependency relation to the computation are included. Hence, we consider features based on [m, k, m/k, m2 , mk, k 2 , k 3 , mk 2 ]. Parametric Model. It is preferable to choose a parametric form for the classifier that is flexible enough to represent the mapping to the optimal policy and also allows an efficient solution to both the optimization and prediction problems (Eqs. 3 and 4). We choose a multinomial logistic classifier, exp(x.θj ) , pθ (y = Cj |x) = Pr l=1 exp(x.θl ) where the input representation consists of a d-dimensional real vector, i.e., x ∈ Rd and the parameter θ can be represented as a d×r matrix with the column θj being associated with the class Cj . Plugging in the above parametric model into learning problem Eq. (3) results in a simple unconstrained convex optimization problem that can be readily solved using standard optimization techniques such as the Newton-Raphson’s method. Further, since the denominator is constant across the classes and the exponential function is monotonic, the policy prediction step Eq. (4) reduces to the following linear computation,

B. Learning Decision Model Formally, let C = {C1 , · · · , Cj , · · · , Cr } denote the set of policy classes, A = {A1 , · · · , Ai , · · · , An } denote the set of input matrices for which we have the observed data. Let xi = x(Ai ) ∈ X , 1 ≤ i ≤ n, be a suitable feature vector that captures the salient properties of matrix Ai with X denoting the feature space. Further, let Tij = T (Ai , Cj ) denote the observed computational time for the matrix Ai using the policy Cj . Our objective is to learn a mapping y : X 7→ C such that the policy y(xi ) = y(Ai ) yields the best computation time for the matrix Ai , i.e., y(xi ) = argmin Tij . Due

yˆ(A) = argmax x(A).θ ∗ ,

(5)

1≤j≤r

which requires a fairly low overhead of O(dr) arithmetic operations where d is the feature size and r the number of policy classes.

Cj ∈C

to the large input feature space and noise in the data, an exact mapping to the best policy cannot always be determined and instead one typically learns a parameterized classifier. For the sake of robustness, we consider a probabilistic classifier pθ (y|x) parameterized by θ. Here, pθ (y(xi ) = Cj |xi ) denotes the probability that the policy recommended for xi is Cj and the associated computational time is given by Tij . The expected computational time for the P classifier-based recommendation for a matrix Ai is given r by j=1 pθ (y(xi ) = Cj |xi )Tij . Since we seek to minimize the computational time based on all the empirical data, learning the decision model can thus be posed in terms of

C. Experimental Results In this section, we present a comparative performance analysis of the proposed model-based hybrid approach, different GPU implementation policies as well as a multithreaded run on large sparse matrices from real world applications in Table II. For each of the test matrices, we performed sparse matrix direct factorization using the four policies in Table VI and measured the timings for single precision computation.

380

columns in the table demonstrate the speedups obtained with this copy optimization. We were able to obtain speedups in the range 10 to 25× while using two CPU threads and 2 GPUs. Please note that this range of speedup can be realized only if there is significant amount of work to be performed on the GPU which is the case for our large 3D matrices. One might not observe such speedups for large 2D problems arising in many practical applications.

Using the observed timings, we constructed the following three hybrid policies, which choose from one of the four base policies depending on the matrix dimensions. • Ideal Hybrid PIH , which corresponds to the optimal policy based on the actual average observed timings. • Model Hybrid PMH , which corresponds to the policy learned using the parametric model described earlier using a subset of the observed timing data. • Baseline Hybrid PBH , which corresponds to a policy chosen based solely on thresholds on the number of operations as described in Section V. Hybrid Policies versus Matrix Dimensions: Figures 12(a) and 13(a) show the ideal hybrid policy PIH (color coded according to the legend) based on the average observed timings for a particular value of matrix dimensions m and k for different ranges. Figures 12(b-c) and 13(b-c) show similar information for the model-based hybrid policy PMH and the baseline hybrid policy PBH respectively. We observe that for regions with low values of m and also k, the best choice is almost always P1 , i.e., the host CPU implementation. For moderate values of k, policy P2 , i.e., performing both potrf and trsm on the CPU, often, turns out to be the best policy. For large values of k, P4 which performs all the three routines on the GPU is the winner while policy P3 , i.e., performing both trsm and syrk on GPU is the best choice in the remaining region. The baseline hybrid policy captures some of these trends, but as one can visually perceive, the model-based hybrid is much more effective since it makes uses of multiple features of which the most prominent ones are (m < 122, k < 19, m/k < 2.6, m/k < 11). Relative Speedup of the Best GPU Implementations: We now examine the speedup relative to the host CPU implementation for the three hybrid policies PIH , PMH , PBH . Figure 14 shows the variation of speedup with respect to matrix dimensions. The speedup values are averaged across bins of size 250 × 250 and the bins with no observations are assigned an invalid value shown as -1 in the color coding. We observe that the speedups of the three hybrid policies steadily increase from 1× for small matrices where P1 , the host CPU implementation is the optimal policy to 12 − 13× for large matrices where P3 or P4 is the best solution. Policies P3 and P4 , on the other hand, are sub-optimal for smaller matrix calls. Table VII shows the speedup for the different policies (including the hybrid approaches) and a 4-threaded run for each of the matrices. The results indicate a speedup in the range 5 − 10× with respect to a highly optimized single threaded run and approximately 2× with respect to a 4-threaded run. We have also included results for a 2 thread 2 GPU run. While implementing the multiple thread multiple GPU version, we observed that a few copy optimizations could be made for policy P4. With the copy optimized version, P 4 was the better policy for even moderately sized frontal matrices. Therefore, a new model was learned with these results and the last two

VII. R ELATED W ORK Our current work is related to two main areas of research. Matrix Computations on GPU: Over the years, there have been numerous attempts to accelerate linear system solver performance on GPUs either via a full-blown implementation of the algorithms on GPUs or by identifying and offloading only the compute intensive kernels to the GPU. In particular, for dense matrices, there has been work on LU factorization [5], [7] and Cholesky decomposition [6]. The improved performance of matrix-matrix multiplication kernel in CUBLAS [7] has resulted in unprecedented performance for hybrid CPU-GPU based dense direct solvers. In the case of sparse matrices, however, most of the work has been performed in the iterative solver domain, for example, multigrid [8], and GMRES [9]. Recently, there has been some work on accelerating sparse direct factorization [10] and a simple strategy of replacing the host BLAS calls without any additional optimization has been shown to achieve ∼ 7× speedup over an in-house serial implementation (PARDISO). However, our experiments (Section IV) indicate a speedup of only 3 − 5× most likely due to differences in the serial implementation. Vuduc et.al. [12] also discuss some of the issues associated in implementing sparse multifrontal Cholesky factorization on GPUs and report a speedup of about 3× over their serial implementation for double precision computations. In [11], the authors use a higher number of threads (8) at the bottom level of the elimination tree and at levels close to the root, a single GPU is used to accelerate the computations. Our approach uses the same number of threads as the number of available GPUs and a model based hybrid approach offloads computations to the GPU as needed. Auto-tuning in Scientific Computing: In recent years, there has been increasing interest in auto-tuning approaches for scientific computing applications via machine learning methods. When the possible policies form a small discrete set, one of the natural choices is to model the policy optimization in terms of learning a classifier that maps each input problem to the best policy class. Dongarra et.al. [19] and Xu et.al. [20] adopt this approach for optimizing linear system solves using a Gaussian and Support Vector Machinebased classifiers respectively. Such an approach, however, only focuses on learning the best observed class penalizing all prediction errors equally completely disregarding the differences in the observed performance metric values (i.e., computational time in our scenario). In the machine learning

381

1000

1000 P1 P2 P3

1000 P1 P2 P3

900

800

700

700

600

600

600 k−−>

800

700

500

500 400

400

300

300

300

200

200

200

100

100 100

200

300

400

500 600 m−−>

700

800

0 0

900 1000

100 100

200

(a) Ideal Hybrid

300

400

500 600 m−−>

700

800

0 0

900 1000

(b) Model

10000

8000

400

500 600 m−−>

700

800

900 1000

8000

8000

6000

k−−>

6000

k−−>

7000

6000 5000

5000

5000

4000

4000

4000

3000

3000

3000

2000

2000

2000

1000

1000 0 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

P1 P2 P3 P4

9000

7000

1000 0 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

m−−>

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

m−−>

(a) Ideal Hybrid

m−−>

(b) Model Hybrid

(c) Baseline Hybrid

Ideal, Model-based, and Baseline Hybrid Policies for values of m and k in the range 0 ≤ m, k ≤ 10000.

10000

10000

10

9000 8000

10000

10

9000 8000

8

7000

6

4

4000 3000

2

2000

7000

6

4

4000 3000

2

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

6

6000

k−−>

5000

2000

1000

8

8000

8

6000

k−−>

5000

10

9000

7000

6000

k−−>

300

10000 P1 P2 P3 P4

9000

7000

Figure 13.

200

(c) Baseline

10000 P1 P2 P3 P4

9000

0 0

100

Ideal, Model-based, and Baseline Hybrid Policies for values of m and k in the range 0 ≤ m, k ≤ 1000.

Figure 12.

k−−>

500

400

0 0

P1 P2 P3

900

800

k−−>

k−−>

900

5000

4 4000 3000

2

2000

0

1000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

1000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

m−−>

m−−>

m−−>

(a) Ideal Hybrid

(b) Model Hybrid

(c) Baseline Hybrid

Figure 14. Speedup of various hybrid policies relative to the host CPU implementation for different values of m and k. Note that the scales on the colorbars are not exactly the same.

community, there has been considerable work on costsensitive classifiers [18] where the mis-classification errors are penalized differently. However, the mis-classification costs are modeled only as functions of the true class and the predicted class (i.e., a different mis-classification cost for each pair of classes) and do not incorporate the example specific characteristics. Our auto-tuning approach addresses this drawback by directly minimizing the expected performance loss (example specific costs) on the empirical data. VIII. C ONCLUDING R EMARKS

AND

for a coupled CPU-GPU implementation and recommended multiple implementation policies based on various workflow optimizations that result in a speedup up to 7× relative to a state-of-the-art serial host implementation (WSMP). We also proposed a cost-sensitive model-based hybrid scheme that can combine the various policies in order to best overlap the host and GPU computations. The adaptive hybrid method presented in this paper has potential benefits not only for sparse multifrontal factorization, but also for other computational domains especially since the CPU/GPU architectures and libraries are constantly evolving. Experimental results on a few large SPD matrices indicate that the modelbased hybrid approach provides a performance boost of

F UTURE W ORK

We performed a detailed performance analysis of the compute intensive operations in sparse multifrontal factorization

382

Matrix audikw 1 kyushu lmco nastran-b sgi 1M

P2 2.5 2.64 2.33 2.31 2.54

P3 5.27 6.09 4.21 3.94 5.26

No Copy Opmizations 1-GPU P4 Ideal Model 4.67 6.82 6.73 7.26 9.62 9.46 3.72 5.51 5.45 3.2 5.38 5.32 4.53 6.62 6.55

4-Thread Baseline 6.48 8.68 4.94 4.98 6.26

2.96 4.33 2.74 2.68 3.57

Copy Optimized 1-GPU 2-GPU Model Model 7.52 14.14 9.87 25.64 6.06 10.69 5.89 10.68 7.34 14.06

Table VII S PEEDUP OF DIFFERENT POLICIES W. R . T. SINGLE THREAD CPU RUN .

20 − 60% (∼ 10× relative to serial) over that of the single best policy and almost approaches the ideal hybridization. Comparison with a multi-threaded run of WSMP indicates that in case of sparse matrix operations, the effective flop rate of GPU accelerated serial code is approximately equivalent to that of a highly optimized multithreaded code running on few CPU (4-12) cores. The exact point of equivalence depends on the GPU architecture and the precision of the computation. In spite of the relative low efficiency compared to dense matrix computations, the low price to Flop ratio of GPUs makes hybrid computing platforms comprising of multiple CPUs and multiple GPUs a highly attractive option for high performance computing on sparse matrices. Using the task-parallel formulation in WSMP, we were able to obtain speedups in the range 10 to 25× (relative to serial) while using 2 CPU threads and 2 GPUs. We are currently investigating the feasibility of using the distributed-memory parallel version of WSMP to develop a cluster version of the solver.

[8] J. Bolz, I. Farmer, E. Grinspun, and P. Schr¨ooder, “Sparse matrix solvers on the GPU: conjugate gradients and multigrid,” ACM Trans. Graph., vol. 22, no. 3, pp. 917–924, 2003. [9] M. Wang, H. Klie, M. Parashar, and H. Sudan, “Solving sparse linear systems on NVIDIA Tesla GPUs,” in ICCS ’09, 2009, pp. 864–873. [10] M. Christen, O. Schenk, and H. Burkhart, “General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform,” in First Workshop on General Purpose Processing on Graphics Processing Units, 2007. [11] R. Lucas, G. Wagenbreth, D. Davis, and R. Grimes, “Multifrontal computations on gpus and their multi-core hosts,” in VECPAR, 2010. [12] R. Vuduc, A. Chandramowlishwaran, J. W. Choi, M. E. Guney, and A. Shringarpure, “On the limits of GPU acceleration,” in HotPar, June 2010. [13] J. W. H. Liu, “The multifrontal method for sparse matrix solution: theory and practice,” SIAM Rev., vol. 34, no. 1, pp. 82–109, 1992.

R EFERENCES

[14] “NVIDIA CUDA prog. guide 2.3.” [Online]. Available: http://www.nvidia.com/object/cuda get.html

[1] I. S. Duff and J. K. Reid, “The multifrontal solution of indefinite sparse symmetric linear,” ACM Trans. Math. Softw., vol. 9, no. 3, pp. 302–325, 1983.

[15] “NVIDIA CUBLAS library.” [Online]. Available: http://developer.download.nvidia.com/compute/cuda/2 3/toolkit/ docs/CUBLAS Library 2.3.pdf

[2] N. I. M. Gould, J. A. Scott, and Y. Hu, “A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations,” ACM Trans. Math. Softw., vol. 33, no. 2, pp. 1–32, 2007.

[16] A. Gupta, “WSMP: Watson sparse matrix package (Part-I: Direct solution of symmetric sparse systems),” IBM T. J. Watson Research Center, Yorktown Heights, NY, Tech. Rep. RC 21886, November, 2000.

[3] A. Gupta, G. Karypis, and V. Kumar, “Highly scalable parallel algorithms for sparse matrix factorization,” IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 5, pp. 502–520, May 1997.

[17] “Automatically tuned linear algebra software (ATLAS).” [Online]. Available: http://math-atlas.sourceforge.net/

[4] A. Gupta, S. Koric, and T. George, “Sparse matrix factorization on massively parallel computers,” in SC, 2009.

[18] C. Elkan, “The foundations of cost-sensitive learning,” in IJCAI’01: Proceedings of the 17th international joint conference on Artificial intelligence, 2001, pp. 973–978.

[5] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware,” in SC, 2005.

[19] G. Bosilca, Z. Chen, J. Dongarra, V. Eijkhout, G. E. Fagg et al., “Self-adapting numerical software (SANS) effort,” IBM Journal of Research and Development, vol. 50, no. 2/3, pp. 223–238, 2006.

[6] H. Ltaief, S. Tomov, R. Nath, P. Du, and J. Dongarra, “A scalable high performant Cholesky factorization for multicore with GPU accelerators,” in VECPAR, Berkeley, CA, 2010.

[20] S. Xu and J. Zhang, “A data mining approach to matrix preconditioning problem,” University of Kentucky, Lexington, Tech. Rep. 433-05, 2005.

[7] V. Volkov and J. W. Demmel, “Benchmarking GPUs to tune dense linear algebra,” in SC, 2008, pp. 1–11.

383

Suggest Documents