Email: {d8121101, nakasato, sedukhin}@u-aizu.ac.jp .... APIs (application programming interfaces) for controlling ...... org/blas/blast-forum/blas-report.pdf.
2012 SC Companion: High Performance Computing, Networking Storage and Analysis
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs Kazuya Matsumoto, Naohito Nakasato, and Stanislav G. Sedukhin Graduate School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu City, Fukushima, 965-8580 Japan Email: {d8121101, nakasato, sedukhin}@u-aizu.ac.jp
of GPUs. On the other hand, current commodity CPUs have up to a few tens of processing cores at a few GHz clock speed. The peak performance of CPUs is a few hundred GFlop/s, but CPUs are capable of running many more types of applications. Automatic performance tuning (or auto-tuning in short) is an important technique to resolve the problem of performance portability. Auto-tuning is a well accepted solution for fast GEMM implementations. PHiPAC (Portable High Performance ANSI C) [6] and ATLAS (Automatically Tuned Linear Algebra Software) [7] are famous projects for auto-tuned BLAS routines on CPUs. Also, several autotuning systems for GEMM have been developed [8]–[13]. In CUDA, an auto-tuning framework for NVIDIA GPUs has been implemented [10], which is named as ASTRA (Automatic Stencil TuneR for Accelerators). In OpenCL, Du et al. [12] presented auto-tuned GEMM routines on an NVIDIA Fermi GPU and an AMD Cypress GPU. We have also previously implemented a GEMM code generator for fast GEMM kernels through an auto-tuning process on an AMD Tahiti GPU [13]. Our contribution of this study is that we have applied our auto-tuning system to different GPUs and CPUs, and evaluated the performance. The processors are the following four GPUs and two CPUs:
Abstract—OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes performance of OpenCL programs also portable on different processors. We have developed an auto-tuning system with a code generator for fast matrix multiply kernels in OpenCL. This paper presents results of performance evaluation of DGEMM (double-precision general matrix multiply) and SGEMM (single-precision GEMM) implementations by using the auto-tuning system. Performance evaluations are conducted on two AMD GPUs (Tahiti and Cayman), two NVIDIA GPUs (Kepler and Fermi), and two CPUs (Intel Sandy Bridge and AMD Bulldozer). Our GEMM implementations on the AMD GPUs show higher performance than the highly tuned vendor library while the implementations on the NVIDIA GPUs are comparable.
I. I NTRODUCTION Matrix-matrix multiplication is a fundamental routine in linear algebra, which is called GEMM (GEneral Matrix Multiply) in BLAS (Basic Linear Algebra Subprograms) standard [1]. GEMM is used in many important numerical algorithms, and it is a building block of LAPACK (Linear Algebra PACKage) [2] and other Level-3 BLAS routines [3]. GEMM algorithms have a high computational intensity and regularity and, therefore are good candidates for performance acceleration. OpenCL (Open Computing Language) is a standard framework for parallel programming [4], [5]. Programs in OpenCL are functionally portable across multiple processors that include CPUs, GPUs and other computing devices such as FPGAs. OpenCL offers an abstract hardware layer such that it allows programmers to develop applications without knowing details of underlying processor architectures. However, performance is not always portable across different processors in OpenCL. Parallel processing on GPUs and multi-core CPUs are widely used. State-of-the-art GPUs contain more than a thousand processing elements at around 1 GHz clock speed and are capable of achieving around 1 TFlop/s in doubleprecision and more than 3 TFlop/s in single-precision. Many numerical applications rely on the high computational power 978-0-7695-4956-9/13 $26.00 © 2013 IEEE DOI 10.1109/SC.Companion.2012.59
1) AMD Tahiti GPU (Radeon HD 7970); 2) AMD Cayman GPU (Radeon HD 6970); 3) NVIDIA Kepler GPU (GeForce GTX 670 overclocked); 4) NVIDIA Fermi GPU (Tesla M2090); 5) Intel Sandy Bridge CPU (Core i7 3960X); 6) AMD Bulldozer CPU (FX-8150). Another contribution is that we have improved our previous GEMM code generator such that it supports a greater number of parameters. The new parameters include a parameter designating a specific matrix multiply algorithm and additional blocking factors for more flexible usage of local memory (sharable memory). The new generator have succeeded in producing GEMM kernels which run faster on the Tahiti GPU, i.e., the maximum performance 396
where both α and β are scalar values, and op(A), op(B) and C are M × K, K × N and M × N matrices, respectively. Since the op(X) takes either X (non-transposed matrix) or X T (transposed matrix), there are four multiplication types: (NN) C ← αAB + βC, (NT) C ← αAB T + βC, (TN) C ← αAT B + βC, (TT) C ← αAT B T + βC. An auto-tuning system uses two core components: a code generator and a heuristic search engine. Our code generator takes a set of parameters as the input. When the input is given, the code generator produces the corresponding GEMM kernel code written in OpenCL as the output. We can set different input parameters to the generator. We have tuned the code generator so that it produces a fast AT B + C kernel in which data are properly aligned in a row-major order. For simplicity multiplications with scalars α and β are omitted in the following descriptions though they are performed in the kernel. Our approach for fast GEMM implementations is to utilize the AT B + C kernel. This means that the implementation firstly copies matrices with, if needed, matrix transposition and data layout change, and then executes the AT B + C kernel. The following explanations of each parameter are targeted for a multiplication of a transposed matrix AT and a non-transposed matrix B.
of DGEMM (double-precision GEMM) kernel is increased to 863 GFlop/s (91% of the peak performance) from 848 GFlop/s and that of SGEMM (single-precision) kernel is improved to 3047 GFlop/s (80% of the peak) from 2646 GFlop/s. The following parts of this paper are organized as follows. Section II shows OpenCL basics. Section III describes our improved GEMM code generator. This section also explains differences between the improved generator and the previous generator. After that, Section IV presents results of performance evaluation on different processors. Finally, Section V concludes the paper. II. O PEN CL BASICS OpenCL is an open standard framework for generalpurpose parallel programming on heterogeneous platforms. The OpenCL framework includes a C99-based language for writing parallel functions called kernels, and runtime APIs (application programming interfaces) for controlling OpenCL platforms and devices. An OpenCL platform is composed of one or more OpenCL devices connected to a host. An OpenCL device comprises multiple compute units (CUs), each of which has multiple processing elements (PEs). When an OpenCL kernel is submitted for execution on the device, an Ndimensional index space, which is called NDRange, is defined. In this study, we consider a two-dimensional index space only, which is suitable for matrix data. Each instance in an NDRange is called a work-item which has a unique ID. Several work-items organize a work-group. A workitem runs on one or more PEs. A task of a work-group is processed by all the PEs of a CU. In an OpenCL kernel, four distinct memory regions are accessible to work-items. 1) Global memory is a memory region in which data can be read/written by all work-items. It is impossible to synchronize work-items during a kernel execution in this memory. 2) Constant memory is a read-only region of global memory. Data in this region are not changed during execution. 3) Local memory is a specific memory region to a workgroup. Work-items in a work-group can share data in local memory. 4) Private memory is a specific memory region to a workitem. Data in private memory of a work-item is not visible to other work-items. On most OpenCL devices, private memory is in the register file.
A. Blocking Matrix blocking (or tiling) in a matrix multiply algorithm is a necessary technique for high performance computation. The blocking technique increases the data reuse ratio in multi-level memory hierarchy of today processors by considering of the fact that matrix multiply requires O(N 3 ) multiply-add floating-point operations over O(N 2 ) data. Two levels of blocking are used in our matrix multiply algorithms. In this paper, we refer the first level of blocking to alleviate access latencies between the local memory and private memory for work-item processing as work-item blocking. The other level of blocking is to efficiently use local memory (and data caches) of a compute unit for work-group processing. We call this blocking work-group blocking. Let Mwg , Nwg , Kwg be the blocking factors of workgroup blocking, where M, N, K are considered to be divisible by Mwg , Nwg , Kwg , correspondingly. The blocking divides three matrices AT , B, C into blocks of Kwg × Mwg , Kwg × Nwg and Mwg × Nwg matrices, respectively. Fig. 1 shows a matrix multiply-add partitioned by the blocking factors. The elements of each Mwg × Nwg matrix of C are computed by a work-group. The work-group involves an K × Mwg block of AT and a K × Nwg block of B for multiplication and an Mwg × Nwg block of C for addition with the result of multiplication. The blocked matrix multiply requires K/Kwg iterations in the outermost loop of our GEMM algorithms. In every
III. GEMM C ODE G ENERATOR The GEMM routine in a single or double precision is defined as C ← αop(A)op(B) + βC, 397
N
M
Nwg
Nwg Kwg
Mwg M
+=
C
Nwg
N
Mwg
×
AT
K
Nwg
Mwi
Mwi
Kwg
K
Mwg
Nwi
Mwg
Nwi Kwg
+=
×
Kwg
AT
B
B
C (a) With a unit stride (adjacent) memory access
Figure 1. Blocked matrix multiply-add partitioned with factors Mwg , Nwg , Kwg in the work-group blocking
Nwg NdimC
iteration, the work-group updates the Mwg × Nwg block by multiplying a Kwg × Mwg block of AT with a Kwg × Nwg block of B and adding the product to the Mwg × Nwg block of C. Fig. 2(a) depicts the further blocked matrix multiplyadd. Each block is additionally divided with blocking factors Mwi , Nwi in the work-item blocking. The two blocking factors Mwi , Nwi are not parameters of the code generator. Instead, the size (MdimC , NdimC ) of a work-group are parameterized, where Mwg and Nwg are in multiples of MdimC and NdimC respectively, Using the parameters MdimC , NdimC , the two blocking factors Mwi , Nwi are calculated by Mwg /MdimC , Nwg /NdimC correspondingly. A work-item of the work-group is in charge of multiplication of an K × Mwi sub-block of AT by a K × Nwi sub-block of B and accumulation of the product on an Mwi × Nwi sub-block of C. In addition, the code generator supports another parameter of Kwi which determines the degree of unrolling in the innermost loop of our GEMM algorithms. Note that Kwg is divisible by Kwi and we categorize the parameter Kwi as one of blocking factors. Loop unrolling [14] is an optimization technique in which the body of a loop is replaced with multiple copies of itself. As positive effects, the technique exposes parallelism explicitly in an OpenCL kernel to language compilers and reduces loop overheads such as loop-counter increment; however, the unrolling also has a side effect of increasing the number of required registers. The unrolling degree is, hence, necessary to be parameterized.
Mwg MdimC
Mwg
Nwg NdimC
MdimC
+=
Kwg
×
AT
Kwg
B
C (b) With a non-unit stride access Figure 2. Further blocked matrix multiply-add with factors Mwi , Nwi in the work-item blocking
stride memory access is utilized for performance optimization on Fermi GPUs. Fig. 2(b) depicts the computation with our non-unit stride memory access where the stride size in M direction is MdimC and the one in N direction is NdimC . If vector variables are used (vw ≥ 2), stride sizes are multiplied by the vector width, i.e., the sizes are vw ·MdimC and vw · NdimC .
C. Usage of local memory We parameterize usage of local memory for sharing data in work-items of a work-group. Local memory offers advantage to re-use matrix data of A and B which are loaded once from global memory. A disadvantage of using local memory is that it needs a barrier synchronization between the work-items, which takes a certain amount of time. This fact means that using local memory does not always lead to high performance. In case when local memory is used, an assignment pattern of work-items in a work-group can be reshaped (this reshaping technique is also used in [10]). To represent the reshape, let us introduce values MdimA , KdimA , KdimB , NdimB . Reshaping the block is possible as long as three shapes of A, B, C completely overlay the corresponding matrix. We add two parameters of MdimA and NdimB and the other values of KdimA and KdimB are calculated by (MdimC · NdimC )/MdimA and (MdimC · NdimC )/NdimB , respectively.
B. Vector width and stride memory access A width of vector variables is a parameter to the code generator. Vector variables in OpenCL resemble arrays containing multiple elements of the same variable type. The vector width vw affect the performance of generated kernels, and the best width depends on a processor and an algorithm. In the work-item blocking of Fig. 2(a), each work-item is in charge of computation for adjacent (unit stride) elements in Mwi ×Nwi block of C and dark shade indicates elements accessed by a work-item. In previous works [9], [10], [15], instead of using the unit stride memory access, a non-unit 398
Mwg
Mwg Kwg
K
K
M
K
M
(a) Row-major layout Figure 3.
M
(b) Column-block-row-major layout (CBL)
(c) Row-block-row-major layout (RBL)
Matrix data layouts of an M × K transposed matrix with blocking factors Mwg , Kwg
of algorithms, “barrier” means a barrier synchronization between work-items in a work-group to ensure correct memory access to local memory. Fig. 4 presents one of the GEMM algorithms. The algorithm is considered as a basic algorithm (BA), which is similar to the GEMM algorithm by Volkov and Demmel [16]. In the body of the outer loop (lines 2-12), a workitem firstly loads matrix elements of A and B from global memory into local memory. The inner loop body (lines 6-10) loads the elements from local memory into private memory Apm , Bpm , multiply Apm by Bpm , and added the product to Cpm also in private memory. The inner loop is fully unrolled. When the work-item exits from the outer loop, it merges computed results in Cpm with corresponding elements of C (line 13). The scalars α and β are used for multiplication at the same time as the merging.
D. Matrix data layouts The code generator supports AT B + C kernels in which matrices A, B are supposed to be aligned in block-major data layouts in addition to a row-major layout. Fig. 3 shows the supported layouts (this is an example of an M × K transposed matrix AT with blocking factors Mwg , Kwg ). Fig. 3(a) is a row-major layout. Fig. 3(b) depicts a columnblock-row-major layout (CBL) where data of each K ×Mwg column-block are stored in a row-major order. In CBL, matrix data required for a multiplication of K × Mwg column-block of AT by K × Nwg column-block of B are in contiguous memory space. Fig. 3(c) shows a row-block-rowmajor layout (RBL) where data of each Kwg × Mwg subblock of a Kwg × M row-block are aligned in a row-major order. In RBL, matrix data for a multiplication between Kwg × Mwg sub-block and Kwg × Nwg sub-block are in sequential memory space. Both CBL and RBL have better spatial locality than row-major layout and GEMM kernels using either of them are expected to read matrix data more efficiently. To make use of a fast AT B + C kernel for GEMM routines, matrix data have to be copied into extra allocated buffers in global memory before executing the kernel. For example, to implement an AB + C routine where data are stored in row-major order, the matrix A is copied into a buffer with matrix transposition and the matrix B is copied into another buffer without transposition. If designated data layouts are not row-major, matrix data are changed into the required layouts along with the copying.
Cpm = 0 for pwg = 0 to K − Kwg step Kwg do 3: load MwiA · KwiA elements of A into Alm 4: load KwiB · NwiB elements of B into Blm 5: barrier 6: for pwi = 0 to Kwg − Kwi step Kwi do 7: load Mwi · Kwi elements of Alm into Apm 8: load Kwi · Nwi elements of Blm into Bpm 9: Cpm + = Apm × Bpm 10: end for 11: barrier 12: end for 13: merge Cpm with Mwi · Nwi elements of C pm - private memory, lm - local memory, MwiA = Mwg /MdimA , KwiA = Kwg /KdimA , KwiB = Kwg /KdimB , and NwiB = Nwg /NdimB . (The same notations are used in the following algorithms.) 1: 2:
E. Algorithms We have implemented three GEMM algorithms and the code generator has a parameter to indicate one of them. In OpenCL language, kernels are written in an SPMD (SingleProgram Multiple-Data) fashion where a specific kernel describes behavior of each work-item in NDRange. Note that the following algorithms are examples which use local memory for both matrices A and B. If a generated kernel does not use local memory, data elements are directly loaded from global memory into private memory. In the description
Figure 4.
Basic GEMM algorithm (BA)
Fig. 5 shows another GEMM algorithm. The algorithm uses a software pipelining (PL) technique and it is based on the GEMM algorithm proposed in [9], [10]. The PL algorithm has a prologue processing and an epilogue processing 399
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:
3: 4: 5: 6: 7: 8:
Cpm = 0 load MwiA · KwiA elements of A into Alm load KwiB · NwiB elements of B into Blm barrier for pwg = 0 to K − 2 · Kwg step Kwg do load MwiA · KwiA elements of A into Apm0 load KwiB · NwiB elements of B into Bpm0 barrier for pwi = 0 to Kwg − Kwi step Kwi do load Mwi · Kwi elements of Alm into Apm1 load Kwi · Nwi elements of Blm into Bpm1 Cpm + = Apm1 × Bpm1 end for barrier store MwiA · KwiA elements of Apm0 into Alm store KwiB · NwiB elements of Bpm0 into Blm barrier end for for pwi = 0 to Kwg − Kwi step Kwi do load Mwi · Kwi elements of Alm into Apm1 load Kwi · Nwi elements of Blm into Bpm1 Cpm + = Apm1 × Bpm1 end for merge Cpm with Mwi · Nwi elements of C Figure 5.
Cpm = 0 load MwiA · (KwiA /2) elements of A into Alm0 load (KwiB /2) · NwiB elements of B into Blm0 for pwg = 0 to K − 2 · Kwg step Kwg do barrier load MwiA · (KwiA /2) elements of A into Alm1 load (KwiB /2) · NwiB elements of B into Blm1 for pwi = 0 to Kwg /2 − Kwi step Kwi do load Mwi · Kwi elements of Alm0 into Apm load Kwi · Nwi elements of Blm0 into Bpm Cpm + = Apm × Bpm end for barrier load MwiA · (KwiA /2) elements of A into Alm0 load (KwiB /2) · NwiB elements of B into Blm0 for pwi = Kwg /2 to Kwg − Kwi step Kwi do load Mwi · Kwi elements of Alm1 into Apm load Kwi · Nwi elements of Blm1 into Bpm Cpm + = Apm × Bpm end for end for barrier load MwiA · (KwiA /2) elements of A into Alm1 load (KwiB /2) · NwiB elements of B into Blm1 for pwi = 0 to Kwg /2 − Kwi step Kwi do load Mwi · Kwi elements of Alm0 into Apm load Kwi · Nwi elements of Blm0 into Bpm Cpm + = Apm × Bpm end for barrier for pwi = Kwg /2 to Kwg − Kwi step Kwi do load Mwi · Kwi elements of Alm1 into Apm load Kwi · Nwi elements of Blm1 into Bpm Cpm + = Apm × Bpm end for merge Cpm with Mwi · Nwi elements of C
1: 2:
in addition to a loop body for the pipelining. The main feature of the algorithm is that the loop body loads elements of A and B from global memory while it loads elements from local memory and the computes the multiply-add. This strategy is considered to be effective on OpenCL devices where big memory access latencies to global memory are a bottleneck.
9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33:
GEMM algorithm with software pipelining (PL)
34: 35:
The third GEMM algorithm is shown in Fig. 6. The algorithm is a variant of the algorithm with double-buffering strategy (DB) by Tan et al. [15]. In contrast to the PL algorithm, the DB algorithm requires less private memory. The private memory is usually allocated in registers. In some processors (especially in GPUs), the number of used registers affects the kernel performance [5]. The number of registers determines the number of work-groups launched on a compute unit. If the number of work-groups is not enough, processors cannot hide memory access latencies. A drawback of the DB algorithm is that it requires bigger space in local memory than the other two algorithms.
36:
Figure 6.
• •
•
F. Modifications of the GEMM code generator Our GEMM code generator has been modified from [13] in several aspects. • The number of parameters related to blocking factors is increased from six to eight. • Each size of blocking factors was previously limited to a power of two. This limitation has been eliminated.
•
GEMM algorithm with double-buffering strategy (DB)
A non-unit stride memory access is implemented in addition to a unit-stride access. The current generator can produce GEMM kernels using local memory for both matrices A and B while the previous generator was incomplete on such kernel production. GEMM kernels access data in buffer objects on global memory. Image objects, which are another possible memory objects in OpenCL, are not used currently. The three discussed above GEMM algorithms are now supported. Only the BA algorithm was used in the previous version of code generator.
Our strategy to search the fastest (best) kernel has not been changed significantly. We searched tens of thousands 400
Table I P ROCESSOR SPECIFICATION Code name Product name Core clock speed [GHz] Number of compute units Max DP operations / clock Max SP operations / clock Peak DP performance [GFlop/s] Peak SP performance [GFlop/s] Global memory size [GB] Peak global memory bandwidth [GB/s] L3 cache size [MB] L2 cache size [kB] L1 cache size [kB] Local memory size [kB] Local memory type OpenCL SDK Display driver version SP: Single-precision DP: Double-precision SDK: Software Development Kit
Tahiti HD 7970 0.925 32 1024 4096 947 3789 3 264 768a 16b 64b Scratchpad AMD APP 2.6 12.3d
Cayman Kepler HD 6970 GTX 670 OC 0.88 1.085 24 7 768 96 3072 2688 676 122 2703 2916 1 2 176 192 512a 512a 8b 64b 32b 48b Scratchpad Scratchpad AMD APP 2.6 CUDA 5.0 RCf 11.11d 304.33e a : Size per processor b : Size per compute unit (core) c : Size per two cores
of kernel variants per single GEMM type on an OpenCL device. For this number, kernels which are failed in code generation, compilation or testing are not counted. Those many variants were heuristically chosen. We implemented a heuristic search engine and selected the fastest kernel. To find the best set of parameters for each GEMM kernel, the search engine should run more than five hours. The procedure for selecting the best kernel is as follows:
Fermi Sandy Bridge Tesla M2090 Core i7 3960X 1.3 3.3 16 6 512 48 1024 96 665 158.4 1331 316.8 6 177 15a 768a 256b 16b 32b 48b 32 Scratchpad Global CUDA 4.1.28 Intel 2013 beta 285.05e d : Catalyst driver version e : CUDA driver version f : CUDA 5.0 Release Candidate
Bulldozer FX-8150 3.6 8 32 64 115.2 230.4 8a 2048c 64c 32 Global AMD APP 2.7 -
A. Performance of GEMM kernels Fig. 7 depicts the performance of the selected fastest DGEMM and SGEMM kernels as a function of problem size. Table II shows the set of parameters and the observed maximum performance of the kernels. The Tahiti GPU shows the highest performance: 863 Gflop/s (91% of the peak performance) in DGEMM and 3047 Gflop/s (80%) in SGEMM. Those performance numbers are higher than our previous results [13]. In particular, the SGEMM performance is significantly increased from 2646 GFlop/s. The main reason for the performance improvement is that the new SGEMM kernel uses a local memory for both matrices A, B. In addition to the Tahiti case, the local memory usage affects performance improvement on the Kepler and Fermi GPUs. For instance, if local memory is not used for both matrices on the Kepler, the maximum SGEMM performance is decreased from 1440 GFlop/s to 1150 GFlop/s. A prominent performance difference can not be seen on the CPUs depending on the local memory usage. The Cayman runs slower when the local memory is utilized, probably because the cost for barrier synchronizations is too large. The selection of GEMM algorithm affects the performance of GEMM kernels. Fig. 8 depicts the relative performance using three different GEMM algorithms in respect to the maximum performance from Table II for each processor. Note that DGEMM kernels with PL algorithm always fail to execute on the Bulldozer. The BA algorithm is apparently the best on the Tahiti GPU. For the other three GPUs, the best algorithm is different between the DGEMM and SGEMM kernels. Performance variations on the CPUs are relatively small. GEMM kernels using block-major matrix layouts show the highest performance on all tested processors. Influence of block-major layouts to the performance is big on the
1) Measuring the performance in GFlop/s of every generated GEMM kernel for a problem size N = b4096/LCM c · LCM in GPU devices and N = b1536/LCM c · LCM in CPU devices, where matrices are square M = N = K and LCM is the least common multiple of work-group blocking factors Mwg , Nwg , Kwg . 2) Further measuring the performance of the fastest 50 kernels for problems sizes N (N is in multiples of LCM and N ≤ 8192) among a large number of previously tested kernels tested in 1). 3) Selecting the fastest kernel among the 50 kernels tested in 2). IV. P ERFORMANCE E VALUATION In this study, we have made the performance measurements for four different GPUs (AMD Tahiti, AMD Cayman, NVIDIA Kepler and NVIDIA Fermi) and two different CPUs (Intel Sandy Bridge and AMD Bulldozer). The specifications of processors are shown in Table I. The Kepler GPU has a boost function which dynamically increases processor frequency and, therefore, the measured performance may be higher than the listed peak performance. Note that the presented performance numbers do not take into account data transfer time between host and OpenCL device. 401
1000
Performance [GFlop/s]
Performance [GFlop/s]
1000
100
10 Tahiti Cayman Fermi Kepler Sandy Bridge Bulldozer
1 0
1024
2048 3072 4096 Matrix size [M=N=K]
5120
Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer
10
1 6144
0
DGEMM performance Figure 7.
100
1024
2048 3072 4096 Matrix size [M=N=K]
6144
SGEMM performance
Performance of the fastest DGEMM and SGEMM C ←
αAT B
+ βC kernels produced by our code generator on different processors
Table II PARAMETERS FOR THE FASTEST C ← αAT B + βC KERNEL AND THE MAXIMUM
Mwg , Nwg , Kwg Mwi , Nwi , Kwi MdimC , NdimC MdimA , KdimA KdimB , NdimB Parameter Vectora DGEMM Strideb Sharedc Layoutd Algorithm GFlop/s Max perf. Efficiency Mwg , Nwg , Kwg Mwi , Nwi , Kwi MdimC , NdimC MdimA , KdimA KdimB , NdimB Parameter Vectora SGEMM Strideb Sharedc Layoutd Algorithm GFlop/s Max perf. Efficiency a. Width of vector variables. b. Non-unit stride access in each direction
5120
Tahiti 96,32,48 6,2,2 16,16 16,16 16,16 2 B CBL,CBL BA 863 91% 96,96,16 6,6,2 16,16 16,16 16,16 1 M A, B CBL,CBL BA 3047 80%
PERFORMANCE
Cayman Kepler Fermi Sandy Bridge 64,32,48 32,64,8 64,64,8 64,32,64 4,4,24 2,4,4 4,4,2 4,8,4 16,8 16,16 16,16 16,4 16,8 32,8 64,4 16,4 16,8 8,32 4,64 16,4 2 1 1 4 N N N A, B A, B B CBL,CBL CBL,CBL CBL,RBL RBL,RBL BA BA PL DB 580 128 370 64 86% 105% 56% 40% 128,64,96 64,64,8 64,64,16 64,64,64 8,8,24 8,4,8 8,4,16 8,8,8 16,8 8,16 8,16 8,8 16,8 32,4 32,4 8,8 16,8 4,32 8,16 8,8 4 2 2 8 N M M, N M A, B A, B B CBL,CBL CBL,CBL CBL,CBL RBL,RBL PL PL BA BA 2167 1440 896 140 80% 49% 67% 44% c. Matrix whose data are shared in local memory. d. Data layout for matrices A, B, respectively.
two AMD GPUs while it is relatively small on the other processors. The fastest DGEMM kernel without using blockmajor data layouts demonstrates the maximum performance of 837 Gflop/s on the Tahiti and the performance for some problem sizes (such as multiples of 2048) is drastically deteriorated because of memory bank conflicts.
Bulldozer 48,32,96 2,8,16 24,4 24,4 48,2 2 M B CBL,RBL DB 37 32% 32,48,192 4,12,4 8,4 8,4 8,4 4 M CBL,CBL BA 87 38%
a blocking factor, we use a zero padding technique. This section presents the performance results of our GEMM implementations where matrix data are stored in columnmajor order. Table III shows a summary of the measured performance and a comparison with vendor BLAS libraries. Fig. 9 depicts the performance of different DGEMM and SGEMM C ← αAB + βC implementations on the Tahiti GPU. In the figure, we also plot the performance results of our previous implementation [13] and AMD APPML (Accelerated Parallel Processing Math Libraries) clBLAS 1.8.291. Our current implementation shows the highest per-
B. Performance of GEMM implementations Our GEMM implementations execute the C ← αAT B + βC kernel after copying matrix data. Matrix data are transposed and changed into a block-major order during the copying. When a matrix size is not in multiples of 402
Table III M AXIMUM PERFORMANCE [G FLOP / S ]
Processor Tahiti Cayman Kepler Fermi Sandy Bridge Bulldozer
Impl. Ours Vendora Ours Vendora Ours Vendorb Ours Vendorc Ours Vendord Ours Vendore
NN 852 647 568 329 127 124 366 405 60 138 36 50
OF OUR
GEMM
IMPLEMENTATIONS AND GEMM ROUTINES OF VENDOR LIBRARIES , WHERE MATRIX DATA ARE STORED IN COLUMN - MAJOR ORDER
DGEMM NT TN 855 849 731 549 567 565 336 302 128 127 122 122 368 363 406 408 60 60 139 138 37 36 50 50
TT 851 650 565 329 128 122 365 405 60 138 36 50
NN 2989 2468 2060 1071 1399 1371 882 830 132 282 74 103
SGEMM NT TN 3008 2970 2489 1476 2096 2037 1011 662 1417 1382 1417 1227 888 876 942 920 133 132 285 281 78 70 101 103
Relative performance
0.8
0.6
0.4
0.2
0 Tahiti
Cayman
Kepler
Fermi
PL (DGEMM) PL (SGEMM)
NN: C ← αAB + βC; NT: C ← αAB T + βC; TN: C ← αAT B + βC; TT: C ← αAT B T + βC; a : AMD Accelerated Parallel Processing Math Libraries (APPML) clBLAS 1.8.291 b : NVIDIA CUBLAS in CUDA 5.0 RC c : NVIDIA CUBLAS in CUDA 4.1.28 d : Intel Math Kernel Library (MKL) 2011.10.319 e : AMD Core Math Library (ACML) 5.1.0
current OpenCL compilers for CPUs are not as mature as for GPUs. Another reason is that our auto-tuning system is not particularly optimized for CPUs. On the Sandy Bridge CPU, we have tested another version of Intel OpenCL SDK (version 2012) in addition to the latest version (2013 beta). Fig. 11 shows the performance of different DGEMM implementations. Using the newer SDK improves the performance by around 20%. In the figure, we also compare our DGEMM performance with the performance of DGEMM routine auto-tuned by ATLAS [7]. ATLAS searches the best BLAS kernels written in C language. The performance by ATLAS is higher though both C and OpenCL are high-level languages.
1
BA (DGEMM) BA (SGEMM)
TT 2989 2281 2074 1021 1399 1361 882 889 133 283 74 101
Sandy Bridge Bulldozer DB (DGEMM) DB (SGEMM)
C. Comparison to other works
Figure 8. Relative performance of the GEMM kernels using different algorithms in respect to the maximum performance from Table II for each processor
Matrix multiplication is compute intensive and important in high performance computing. There have been a number of works for fast matrix multiplication. Kurzak et al. [10] developed an auto-tuning system for all GEMM routines on a Fermi GPU. Their system is to develop fast GEMM kernels written in CUDA. In [17], they also reported results of performance evaluation by the auto-tuning system on a Kepler GPU (GeForce GTX 680) that has the peak single-precision performance of 3090 GFlop/s. The SGEMM performance is around 1150 GFlop/s when size M = N = K = 4096. Although experimental environments including the GPU model are different, our current SGEMM implementation shows higher performance, which is 1340 GFlop/s, on a Kepler GPU. Tan et al. [15] presented a fast DGEMM implementation on a Fermi GPU (Tesla C2050). The DB algorithm in Fig. 6 is based on their GEMM algorithm with double-buffering strategy. They reported that their DGEMM kernel achieves 362 GFlop/s, which is 70% utilization efficiency. The tuned kernel is written in Fermi’s native machine language. They claim that such high processor utilization is impossible by using CUDA C or PTX language, and it is also valid for OpenCL as shown in our experiments.
formance. Note that the current implementation is not fast for small sizes because the ratio of copying time to total time is relatively big. When the matrix size is large, the overhead for the copying is amortized since the copying of N × N matrix needs O(N 2 ) memory operations while the matrix multiplication requires O(N 3 ) arithmetic operations. The performance of GEMM implementations on the Fermi and Kepler GPUs is shown in Fig. 10. The figure compares the performance of our current implementation with NVIDIA CUBLAS 4.1.28 and MAGMA (Matrix Algebra on GPU and Multicore Architectures) 1.2.1 on the Fermi and CUBLAS 5.0 RC on the Kepler. As can be seen, our implementation in OpenCL is comparable to these in CUDA. The performance of our OpenCL implementation does not highly depend on GEMM types (see Table III). The OpenCL implementation on CPUs is not so good compared with the vendor libraries. The performance in OpenCL is twice or more times lower than Intel MKL (Math Kernel Library) 2011.10.319 on the Sandy Bridge. A possible reason of the low utilization efficiency is that 403
900
3000
800 2500 Performance [GFlop/s]
Performance [GFlop/s]
700 600 500 400 300
2000
1500
1000
200 500
This study Our previous study AMD clBLAS 1.8.291
100 0 0
1024
2048 3072 4096 Matrix size [M=N=K]
5120
This study AMD clBLAS 1.8.291 Our previous study
0 6144
0
DGEMM performance
1400
350
1200
300 250 200 150 CUBLAS 4.1.28 (Fermi) MAGMA 1.2.1 (Fermi) This study (Fermi) This study (Kepler) CUBLAS 5.0 RC (Kepler)
50 0 0
1024
6144
2048 3072 4096 Matrix size [M=N=K]
5120
1000 800 600 400
CUBLAS 5.0 RC (Kepler) This study (Kepler) This study (Fermi) MAGMA 1.2.1 (Fermi) CUBLAS 4.1.28 (Fermi)
200 0 6144
0
DGEMM performance Figure 10.
5120
Performance of different DGEMM and SGEMM C ← αAB + βC implementations on the Tahiti GPU
400
100
2048 3072 4096 Matrix size [M=N=K]
SGEMM performance
Performance [GFlop/s]
Performance [GFlop/s]
Figure 9.
1024
1024
2048 3072 4096 Matrix size [M=N=K]
5120
6144
SGEMM performance
Performance of different DGEMM and SGEMM C ← αAB + βC implementations on the Fermi and Kepler GPUs
Nakasato [18] implemented GEMM kernels in assemblylike intermediate language (IL). His GEMM kernels read matrix data through texture cache (image). In our measurement, the performance of this DGEMM kernel is up to 498 GFlop/s (92% efficiency) on an AMD/ATI Cypress GPU (Radeon HD 5870). We applied our auto-tuning system on the GPU and the fastest generated DGEMM implementation in OpenCL achieves 495 GFlop/s. Du et al. [12] presented auto-tuned SGEMM and DGEMM routines in OpenCL. The maximum performance of their DGEMM routine is 308 GFlop/s (57% efficiency) on the Cypress GPU. We consider the large performance difference between our and their implementations comes from the following two main reasons:
The parameters include vector variable width, blocking factors, texture cache usage, and local memory usage. V. C ONCLUSION We have showed that our tuning system for fast matrix multiplication can be widely applicable to processors that support OpenCL. The performance demonstrated by the best GEMM kernel is superior to the vendor library (clBLAS) on AMD GPUs. On NVIDIA GPUs, the GEMM performance is almost equivalent to libraries in CUDA (CUBLAS and MAGMA). For CPUs, our current implementations do not perform well as for GPUs. The high performance of our GEMM kernels relies on the usage of block-major layouts for storing matrix data. Block-major layouts contributes the performance improvement on all tested processors. We have implemented three different GEMM algorithms and measured the performance difference among these algorithms.
1) The OpenCL SDK they used is older and less mature. They used ATI Stream SDK 2.1 while we use AMD APP SDK 2.5. 2) A set of parameters in their code generator is different. 404
[7] R. C. Whaley, A. Petitet, and J. J. Dongarra, “Automated empirical optimizations of software and the ATLAS project,” Parallel Computing, vol. 27, no. 1-2, pp. 3–35, Jan. 2001.
140
Performance [GFlop/s]
120
[8] C. Jiang and M. Snir, “Automatic tuning matrix multiplication performance on graphics hardware,” Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’ 2005), pp. 185–194, Sep. 2005.
100 80 60
[9] R. Nath, S. Tomov, and J. Dongarra, “An improved MAGMA GEMM for Fermi graphics processing units,” International Journal of High Performance Computing Applications, vol. 24, no. 4, pp. 511–515, 2010.
40 Intel MKL 2011.10.319 ATLAS 3.10.0 This study (Intel SDK 2013 beta) This study (Intel SDK 2012)
20 0 0
1024
2048 3072 Matrix size [M=N=K]
4096
[10] J. Kurzak, S. Tomov, and J. Dongarra, “Autotuning GEMM kernels for the Fermi GPU,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, pp. 2045–2057, Nov. 2012.
5120
Figure 11. Performance of different DGEMM C ← αAB + βC implementations on the Sandy Bridge CPU
[11] C. Jang. (Accessed Sep. 30, 2012) GATLAS GPU Automatically Tuned Linear Algebra Software. [Online]. Available: http://golem5.org/gatlas
Our implementations conduct a copying of matrix data for using a GEMM kernel in block-major layout. For small sizes, an overhead for the copying is relatively large; therefore, the implementation does not run fast. One possible solution for such sizes is to use another GEMM kernel without the matrix copying. A future work is to implement the kernel and combine it with the current implementation.
[12] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra, “From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming,” Parallel Computing, vol. 38, no. 8, pp. 391–407, Oct. 2011. [13] K. Matsumoto, N. Nakasato, and S. G. Sedukhin, “Implementing a code generator for fast matrix multiplication in OpenCL on the GPU,” in Proceedings of the IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC12). Aizu-Wakamatsu City, Japan: IEEE Computer Society, Sep. 2012, pp. 198–204.
ACKNOWLEDGMENT A part of this work has been carried out under “the Interdisciplinary Computational Science Program” in Center for Computational Sciences, University of Tsukuba. R EFERENCES
[14] G. S. Murthy, M. Ravishankar, M. M. Baskaran, and P. Sadayappan, “Optimal loop unrolling for GPGPU programs,” in Proceedings of the 24th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010). IEEE, Apr. 2010, pp. 1–11.
[1] “Basic Linear Algebra Subprograms Technical Forum Standard,” Aug. 2001. [Online]. Available: http://www.netlib. org/blas/blast-forum/blas-report.pdf [2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK User’s Guide, 3rd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 1999.
[15] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, “Fast implementation of DGEMM on Fermi GPU,” in Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). Seattle, WA, USA: ACM, Nov. 2011, pp. 35:1– 35:11.
[3] B. K˚agstr¨om, P. Ling, and C. Van Loan, “GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark,” ACM Transactions on Mathematical Software, vol. 24, no. 3, pp. 268–302, 1998.
[16] V. Volkov and J. W. Demmel, “Benchmarking GPUs to tune dense linear algebra,” in Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC’08). Austin, Texas: IEEE Press, Nov. 2008, pp. 31:1–31:11.
[4] Khronos Group. (Accessed Sep. 30, 2012) OpenCL - The open standard for parallel programming of heterogeneous systems. [Online]. Available: http://www.khronos.org/opencl
[17] J. Kurzak, P. Luszczek, S. Tomov, and J. Dongarra, “Preliminary results of autotuning GEMM kernels for the NVIDIA Kepler architecture - GeForce GTX 680,” 2012, LAPACK Working Note 267. [Online]. Available: http://www.netlib.org/lapack/lawnspdf/lawn267.pdf
[5] AMD Inc., “AMD Accelerated Parallel Processing OpenCL Programming Guide, rev2.3,” Jul. 2012. [6] J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, “Optimizing matrix multiply using PHiPAC : a portable, high-performance, ANSI C Coding methodology,” Computer Science Department, University of Tennessee, Tech. Rep., May 1996. [Online]. Available: http://www.netlib.org/lapack/ lawnspdf/lawn111.pdf
[18] N. Nakasato, “A fast GEMM implementation on the Cypress GPU,” ACM SIGMETRICS Performance Evaluation Review, vol. 38, no. 4, pp. 50–55, Mar. 2011.
405