Jan 28, 2015 - portion of this comparison and the Blaze C++ library will be used for ..... Tesla k20 gpu accelerator. http://www.nvidia.com/content/PDF/kepler/.
TR-2015-01
Comparison of OpenCL performance on different platforms using VexCL and Blaze
Hammad Mazhar Dan Negrut
January 28, 2015
Abstract This technical report provides performance numbers for several benchmark problems running on several different hardware platforms. The goal of this report is twofold. First, it helps us better understand how the performance of OpenCL changes on different platforms. Second, it provides a OpenCL-OpenMP comparison for a sparse matrix-vector multiplication operation. The VexCL library will be used for the OpenCL portion of this comparison and the Blaze C++ library will be used for the OpenMP portion.
1
Contents 1 Blaze
3
2 VexCL
3
3 Hardware Platforms 3.1 AMD Kaveri . . . 3.2 Intel Haswell . . 3.3 Intel Haswell-E . 3.4 Intel Xeon . . . . 3.5 AMD Opteron . .
3 3 4 4 5 5
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
4 Benchmark Results
6
5 Chrono SpMV Benchmark
7
2
1
Blaze
Blaze [1] is an open source headers-only library for for performing linear algebra operations using dense and sparse data structures. Blaze was designed so that mathematical expressions can be written intuitively with the library transparently handling type conversion and optimization. By default, Blaze uses OpenMP for parallelism but it can be configures to use C++11 threads, Boost threads and also execute serialy. Additionally bindings for generic BLAS libraries, such as ATLAS, which will be transparently used for certain linear algebra operations such as Matrix-Matrix multiplication. In this context Blaze was used to compare the performance of OpenCL to OpenMP on platforms that supported it
2
VexCL
VexCL [2] is an open source headers-only expression template library for both OpenCL and CUDA. Similar to Thrust [3], its purpose is to reduce boilerplate code required to develop applications for GPUs and other accelerators. VexCL provides many different functions that deal with reduction, linear algebra, sorting etc. In terms of this technical report the synthetic benchmark results are provided by the VexCL benchmark example [4]. For the real world examples VexCL’s SpMV function is used. For all tests the latest version of VexCL was used from the GitHub repository [5]
3
Hardware Platforms
Thirteen different hardware setups were used for benchmarking, this section will describe each setup
3.1
AMD Kaveri
This CPU is based on AMDs new Kaveri architecture, the die features 4 x86-64 cores based on AMDs Steamroller architecture and an 8 core Radeon R7 class GPU. Specifically, the performance in the 7850K matches that of the Radeon HD 7750. The GPU does not have dedicated memory and instead relies on the system memory. The upside of this is that the total memory available to the GPU is equal to the system ram, the downside is that generally system memory is much slower than ordinary on-board GPU memory. L1 Cache 2x96 KB 3-way Instruction 4x16 KB 4-way Data L2 Cache 2x2 MB 16-way L3 Cache None
Model AMD A10-7850K Architecture Steamroller Clock (Turbo) 3.7 GHz (4.0 GHz) Cores 4 Threads 4
3
Memory 16GB Memory Interface Dual Channel DDR3 OS Arch Linux
Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL AMD APP
Accelerator Type AMD Radeon R7 series Compute Units 8
Cores 512 Clock 720 MHz
Reference: [6, 7]
3.2
Intel Haswell
Model i7-4770K Architecture Haswell Clock (Turbo) 3.5 GHz (3.9 GHz) Cores 4 Threads 8 L1 Cache 4x32 KB 8-way Instruction 4x32 KB 8-way Data
L2 Cache 4x256 KB 8-way L3 Cache 8 MB 16-way Memory 32GB Memory Interface Dual Channel DDR3 OS Arch Linux Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL Intel(R) OpenCL
Accelerator 1 Model NVIDIA GTX 680 Architecture GK104 - Kepler Compute Units 8
Cores 1536 Clock (Boost) 1006 MHz (1058 MHz) Memory 2GB 256-bit GDDR5
Accelerator 2 Model NVIDIA K20c Architecture GK110 - Kepler Compute Units 13
Cores 2496 Clock 706 MHz Memory 5GB 320-bit GDDR5
Reference: [8–11]
3.3
Intel Haswell-E
Model i7-5960X Architecture Haswell-E CPU Clock (Turbo) 3.0 GHz (3.5 GHz) CPU Cores 8 CPU Threads 16
L1 Cache 8x32 KB Instruction 8x32 KB Data L2 Cache 8x256 KB L3 Cache 20 MB 4
Memory 32GB Memory Interface Quad Channel DDR4 OS Arch Linux
Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL Intel(R) OpenCL
Accelerator Model NVIDIA GTX 770 Architecture GK104 - Kepler Compute Units 8
Cores 1536 Clock (Boost) 1046 MHz (1085 MHz) Memory 2GB 256-bit GDDR5
Reference: [12, 13]
3.4
Intel Xeon
Model E5-2690 V2 Architecture Ivy Bridge-EP Sockets 2 CPU Clock (Turbo) 3.0 GHz (3.6 GHz) CPU Cores 10 CPU Threads 20 L1 Cache 10x32 KB 8-way Instruction 10x32 KB 8-way Data
L2 Cache 10x256 KB 8-way L3 Cache 25 MB 20-way Memory 64GB Memory Interface Quad Channel DDR3 OS Arch Linux Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL Intel(R) OpenCL
Accelerator 1,2,3 Model NVIDIA K20x Architecture GK110 - Kepler Compute Units 15
Cores 2688 Clock 732 MHz Memory 6GB 384-bit GDDR5
Accelerator 4 Model Intel Xeon Phi 5110P Architecture Knights Corner Cores 60 Threads 240 L1 Cache
60x32 KB 8-way Instruction 60x32 KB 8-way Data L2 Cache 60x512 KB 8-way Clock 1053 MHz Memory 8GB GDDR5
Reference: [14–16]
3.5
AMD Opteron
5
Model 6274 Architecture Bulldozer Sockets 4 Clock (Turbo) 2.2 GHz (3.1 GHz) Cores 16 Threads 16 L1 Cache 8x64 KB 2-way Instruction 16x16KB 4-way Data
L2 Cache 8x2 MB 16-way L3 Cache 2x8 MB up to 64-way Memory 128GB Memory Interface Quad Channel DDR3 OS Centos 6 Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL AMD APP
Reference: [17]
4
Benchmark Results
Using the VexCL library a benchmark was performed on each platorm. Several different tests were used to gauge performance including sort, reduce, and scan operations along with vector-vector operations such as add and matrix-vector operations such as SPMV. K20X K20C GTX770 GTX680 2xK20X 3xK20X i7-5960X MIC Opteron 6274 E5-2690 V2 Radeon R7 i7-4770K A10-7850K
Sort Scan 106
107
108 Keys/sec
6
109
K20X K20C GTX770 GTX680 2xK20X 3xK20X i7-5960X MIC Opteron 6274 E5-2690 V2 Radeon R7 Reduce SAXPY SpMV
i7-4770K A10-7850K 0
5
5
10
15
20 25 GFLOPs
30
35
40
Chrono SpMV Benchmark
Along with the synthetic benchmarks performed using the VexCL library, actual matricies from a simualtion were used to gauge real world performance. The simualtion setup consisted of a kinematically driven vehicle that fords a river comprised of one million rigid, frictionless spheres. Specifically 8 different sets of matricies will be compared on each platform. The problem being solved is DT M −1 Dx which is split into two matrix vector multiplications, first temp = M −1 Dx and then Result = DT temp. Note that M is a diagonal matrix and x is a vector. The figures below show the simulation output, jacobian matrix D and the results for FLOP rate for the computation. A10-7850K i7-4770K Radeon R7 E5-2690 V2 Opteron 6274 MIC i7-5960X 3xK20X 2xK20X GTX680 GTX770 K20C K20X Fig. 8, and Fig. 9 provide the same data as above in a different format, data is grouped by device name with each bar representing one of the 7 tests that were performed. The results 7
T = 3.0 s
0
2
4
6 8 GFLOPs
10
12
T = 4.0 s
0
2
4
6 8 GFLOPs
10
12
T = 5.0 s
0
2
4
6 8 GFLOPs
10
12
demonstrate how different sparisity patterns affect the speed that the SpMV operation is performed at. Fig. 10, and Fig. 11 show the results using the Blaze C++ library along with the speedup of VexCL vs Blaze. In most cases Blaze was slightly faster than VexCL
8
T = 6.0 s
0
5
10 GFLOPs
15
T = 7.0 s
0
2
4
6 8 10 GFLOPs
12
14
T = 8.0 s
0
9
2
4
6 8 10 GFLOPs
12
14
T = 9.0 s
0
2
4
6 8 GFLOPs
10
SpMV CPU i7-5960X
Opteron 6274 T=3.0s T=4.0s T=5.0s T=6.0s T=7.0s T=8.0s T=9.0s
E5-2690 V2
i7-4770K
A10-7850K 0
0.5
1
1.5
2 2.5 GFLOPs
3
Figure 8: Combined plots for the CPUs using VexCL
10
3.5
4
SpMV Accelerators T=3.0s T=4.0s T=5.0s T=6.0s T=7.0s T=8.0s T=9.0s
K20X
K20C
GTX770
GTX680
2xK20X
3xK20X
MIC
Radeon R7 0
2
4
6
8
10 GFLOPs
12
14
16
Figure 9: Combined plots for the accelerators using VexCL
11
18
SpMV Blaze i7-5960X
Opteron 6274 T=3.0s T=4.0s T=5.0s T=6.0s T=7.0s T=8.0s T=9.0s
E5-2690 V2
i7-4770K
A10-7850K 0
0.5
1
1.5
2 2.5 GFLOPs
3
3.5
4
Figure 10: Combined plots for the CPUs using Blaze SpMV Speedup VexCL vs Blaze i7-5960X
Opteron 6274 T=3.0s T=4.0s T=5.0s T=6.0s T=7.0s T=8.0s T=9.0s
E5-2690 V2
i7-4770K
A10-7850K 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 GFLOPs
1
1.1 1.2 1.3 1.4
Figure 11: Speedup for VexCL compared to Blaze for different matrices. A speedup of less than one means that VexCL was slower than Blaze. 12
References [1] K. Iglberger, G. Hager, J. Treibig, and U. Rude. High performance smart expression template math libraries. In High Performance Computing and Simulation (HPCS), 2012 International Conference on, pages 367–373, July 2012. [2] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. Programming CUDA and opencl: A case study using modern C++ libraries. CoRR, abs/1212.6326, 2012. [3] J. Hoberock and N. Bell. Thrust: C++ template library for CUDA. http://thrust. github.com/. [4] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. benchmark.cpp. https://github.com/ddemidov/vexcl/blob/master/examples/benchmark.cpp. [5] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. Vexcl. https: //github.com/ddemidov/vexcl. [6] AMD. Amd a-series apu processors. processors/desktop/a-series-apu.
http://www.amd.com/en-us/products/
[7] CPU-World. Amd a10-series a10-7850k. http://www.cpu-world.com/CPUs/ Bulldozer/AMD-A10-Series%20A10-7850K.html. [8] CPU-World. Intel core i7-4770k. Intel-Core%20i7-4770K.html.
http://www.cpu-world.com/CPUs/Core_i7/
[9] techpowerup. Nvidia geforce gtx 680. http://www.techpowerup.com/gpudb/342/ geforce-gtx-680.html. [10] Nvidia. Tesla k20 gpu accelerator. http://www.nvidia.com/content/PDF/kepler/ Tesla-K20-Passive-BD-06455-001-v07.pdf. [11] techpowerup. Nvidia tesla k20c. tesla-k20c.html.
http://www.techpowerup.com/gpudb/564/
[12] CPU-World. Intel core i7-5960x. http://www.cpu-world.com/CPUs/Core_i7/ Intel-Core%20i7-5960X%20Extreme%20Edition.html. [13] Nvidia. Geforce gtx 700 series. http://www.nvidia.com/gtx-700-graphics-cards/ gtx-770/. [14] CPU-World. Intel xeon e5-2690 v2. Intel-Xeon%20E5-2690%20v2.html.
13
http://www.cpu-world.com/CPUs/Xeon/
[15] Nvidia. Tesla k20x gpu accelerator. http://www.nvidia.com/content/PDF/kepler/ Tesla-K20X-BD-06397-001-v07.pdf. [16] CPU-World. Intel xeon phi 5110p. Intel-Xeon%20Phi%205110P.html.
http://www.cpu-world.com/CPUs/Xeon_Phi/
[17] CPU-World. Amd opteron 6274. http://www.cpu-world.com/CPUs/Bulldozer/ AMD-Opteron%206274%20OS6274WKTGGGU.html.
14