Comparison of OpenCL performance on different platforms using

TR-2015-01

Comparison of OpenCL performance on different platforms using VexCL and Blaze

Hammad Mazhar Dan Negrut

January 28, 2015

Abstract This technical report provides performance numbers for several benchmark problems running on several different hardware platforms. The goal of this report is twofold. First, it helps us better understand how the performance of OpenCL changes on different platforms. Second, it provides a OpenCL-OpenMP comparison for a sparse matrix-vector multiplication operation. The VexCL library will be used for the OpenCL portion of this comparison and the Blaze C++ library will be used for the OpenMP portion.

1

Contents 1 Blaze

3

2 VexCL

3

3 Hardware Platforms 3.1 AMD Kaveri . . . 3.2 Intel Haswell . . 3.3 Intel Haswell-E . 3.4 Intel Xeon . . . . 3.5 AMD Opteron . .

3 3 4 4 5 5

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4 Benchmark Results

6

5 Chrono SpMV Benchmark

7

2

1

Blaze

Blaze [1] is an open source headers-only library for for performing linear algebra operations using dense and sparse data structures. Blaze was designed so that mathematical expressions can be written intuitively with the library transparently handling type conversion and optimization. By default, Blaze uses OpenMP for parallelism but it can be configures to use C++11 threads, Boost threads and also execute serialy. Additionally bindings for generic BLAS libraries, such as ATLAS, which will be transparently used for certain linear algebra operations such as Matrix-Matrix multiplication. In this context Blaze was used to compare the performance of OpenCL to OpenMP on platforms that supported it

2

VexCL

VexCL [2] is an open source headers-only expression template library for both OpenCL and CUDA. Similar to Thrust [3], its purpose is to reduce boilerplate code required to develop applications for GPUs and other accelerators. VexCL provides many different functions that deal with reduction, linear algebra, sorting etc. In terms of this technical report the synthetic benchmark results are provided by the VexCL benchmark example [4]. For the real world examples VexCL’s SpMV function is used. For all tests the latest version of VexCL was used from the GitHub repository [5]

3

Hardware Platforms

Thirteen different hardware setups were used for benchmarking, this section will describe each setup

3.1

AMD Kaveri

This CPU is based on AMDs new Kaveri architecture, the die features 4 x86-64 cores based on AMDs Steamroller architecture and an 8 core Radeon R7 class GPU. Specifically, the performance in the 7850K matches that of the Radeon HD 7750. The GPU does not have dedicated memory and instead relies on the system memory. The upside of this is that the total memory available to the GPU is equal to the system ram, the downside is that generally system memory is much slower than ordinary on-board GPU memory. L1 Cache 2x96 KB 3-way Instruction 4x16 KB 4-way Data L2 Cache 2x2 MB 16-way L3 Cache None

Model AMD A10-7850K Architecture Steamroller Clock (Turbo) 3.7 GHz (4.0 GHz) Cores 4 Threads 4

3

Memory 16GB Memory Interface Dual Channel DDR3 OS Arch Linux

Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL AMD APP

Accelerator Type AMD Radeon R7 series Compute Units 8

Cores 512 Clock 720 MHz

Reference: [6, 7]

3.2

Intel Haswell

Model i7-4770K Architecture Haswell Clock (Turbo) 3.5 GHz (3.9 GHz) Cores 4 Threads 8 L1 Cache 4x32 KB 8-way Instruction 4x32 KB 8-way Data

L2 Cache 4x256 KB 8-way L3 Cache 8 MB 16-way Memory 32GB Memory Interface Dual Channel DDR3 OS Arch Linux Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL Intel(R) OpenCL

Accelerator 1 Model NVIDIA GTX 680 Architecture GK104 - Kepler Compute Units 8

Cores 1536 Clock (Boost) 1006 MHz (1058 MHz) Memory 2GB 256-bit GDDR5

Accelerator 2 Model NVIDIA K20c Architecture GK110 - Kepler Compute Units 13

Cores 2496 Clock 706 MHz Memory 5GB 320-bit GDDR5

Reference: [8–11]

3.3

Intel Haswell-E

Model i7-5960X Architecture Haswell-E CPU Clock (Turbo) 3.0 GHz (3.5 GHz) CPU Cores 8 CPU Threads 16

L1 Cache 8x32 KB Instruction 8x32 KB Data L2 Cache 8x256 KB L3 Cache 20 MB 4

Memory 32GB Memory Interface Quad Channel DDR4 OS Arch Linux

Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL Intel(R) OpenCL

Accelerator Model NVIDIA GTX 770 Architecture GK104 - Kepler Compute Units 8

Cores 1536 Clock (Boost) 1046 MHz (1085 MHz) Memory 2GB 256-bit GDDR5

Reference: [12, 13]

3.4

Intel Xeon

Model E5-2690 V2 Architecture Ivy Bridge-EP Sockets 2 CPU Clock (Turbo) 3.0 GHz (3.6 GHz) CPU Cores 10 CPU Threads 20 L1 Cache 10x32 KB 8-way Instruction 10x32 KB 8-way Data

L2 Cache 10x256 KB 8-way L3 Cache 25 MB 20-way Memory 64GB Memory Interface Quad Channel DDR3 OS Arch Linux Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL Intel(R) OpenCL

Accelerator 1,2,3 Model NVIDIA K20x Architecture GK110 - Kepler Compute Units 15

Cores 2688 Clock 732 MHz Memory 6GB 384-bit GDDR5

Accelerator 4 Model Intel Xeon Phi 5110P Architecture Knights Corner Cores 60 Threads 240 L1 Cache

60x32 KB 8-way Instruction 60x32 KB 8-way Data L2 Cache 60x512 KB 8-way Clock 1053 MHz Memory 8GB GDDR5

Reference: [14–16]

3.5

AMD Opteron

5

Model 6274 Architecture Bulldozer Sockets 4 Clock (Turbo) 2.2 GHz (3.1 GHz) Cores 16 Threads 16 L1 Cache 8x64 KB 2-way Instruction 16x16KB 4-way Data

L2 Cache 8x2 MB 16-way L3 Cache 2x8 MB up to 64-way Memory 128GB Memory Interface Quad Channel DDR3 OS Centos 6 Compiler GCC 4.9.2 Compiler Flags -O3 OpenCL AMD APP

Reference: [17]

4

Benchmark Results

Using the VexCL library a benchmark was performed on each platorm. Several different tests were used to gauge performance including sort, reduce, and scan operations along with vector-vector operations such as add and matrix-vector operations such as SPMV. K20X K20C GTX770 GTX680 2xK20X 3xK20X i7-5960X MIC Opteron 6274 E5-2690 V2 Radeon R7 i7-4770K A10-7850K

Sort Scan 106

107

108 Keys/sec

6

109

K20X K20C GTX770 GTX680 2xK20X 3xK20X i7-5960X MIC Opteron 6274 E5-2690 V2 Radeon R7 Reduce SAXPY SpMV

i7-4770K A10-7850K 0

5

5

10

15

20 25 GFLOPs

30

35

40

Chrono SpMV Benchmark

Along with the synthetic benchmarks performed using the VexCL library, actual matricies from a simualtion were used to gauge real world performance. The simualtion setup consisted of a kinematically driven vehicle that fords a river comprised of one million rigid, frictionless spheres. Specifically 8 different sets of matricies will be compared on each platform. The problem being solved is DT M −1 Dx which is split into two matrix vector multiplications, first temp = M −1 Dx and then Result = DT temp. Note that M is a diagonal matrix and x is a vector. The figures below show the simulation output, jacobian matrix D and the results for FLOP rate for the computation. A10-7850K i7-4770K Radeon R7 E5-2690 V2 Opteron 6274 MIC i7-5960X 3xK20X 2xK20X GTX680 GTX770 K20C K20X Fig. 8, and Fig. 9 provide the same data as above in a different format, data is grouped by device name with each bar representing one of the 7 tests that were performed. The results 7

T = 3.0 s

0

2

4

6 8 GFLOPs

10

12

T = 4.0 s

0

2

4

6 8 GFLOPs

10

12

T = 5.0 s

0

2

4

6 8 GFLOPs

10

12

demonstrate how different sparisity patterns affect the speed that the SpMV operation is performed at. Fig. 10, and Fig. 11 show the results using the Blaze C++ library along with the speedup of VexCL vs Blaze. In most cases Blaze was slightly faster than VexCL

8

T = 6.0 s

0

5

10 GFLOPs

15

T = 7.0 s

0

2

4

6 8 10 GFLOPs

12

14

T = 8.0 s

0

9

2

4

6 8 10 GFLOPs

12

14

T = 9.0 s

0

2

4

6 8 GFLOPs

10

SpMV CPU i7-5960X

Opteron 6274 T=3.0s T=4.0s T=5.0s T=6.0s T=7.0s T=8.0s T=9.0s

E5-2690 V2

i7-4770K

A10-7850K 0

0.5

1

1.5

2 2.5 GFLOPs

3

Figure 8: Combined plots for the CPUs using VexCL

10

3.5

4

SpMV Accelerators T=3.0s T=4.0s T=5.0s T=6.0s T=7.0s T=8.0s T=9.0s

K20X

K20C

GTX770

GTX680

2xK20X

3xK20X

MIC

Radeon R7 0

2

4

6

8

10 GFLOPs

12

14

16

Figure 9: Combined plots for the accelerators using VexCL

11

18

SpMV Blaze i7-5960X


E5-2690 V2

i7-4770K

A10-7850K 0

0.5

1

1.5

2 2.5 GFLOPs

3

3.5

4

Figure 10: Combined plots for the CPUs using Blaze SpMV Speedup VexCL vs Blaze i7-5960X


E5-2690 V2

i7-4770K

A10-7850K 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 GFLOPs

1

1.1 1.2 1.3 1.4

Figure 11: Speedup for VexCL compared to Blaze for different matrices. A speedup of less than one means that VexCL was slower than Blaze. 12

References [1] K. Iglberger, G. Hager, J. Treibig, and U. Rude. High performance smart expression template math libraries. In High Performance Computing and Simulation (HPCS), 2012 International Conference on, pages 367–373, July 2012. [2] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. Programming CUDA and opencl: A case study using modern C++ libraries. CoRR, abs/1212.6326, 2012. [3] J. Hoberock and N. Bell. Thrust: C++ template library for CUDA. http://thrust. github.com/. [4] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. benchmark.cpp. https://github.com/ddemidov/vexcl/blob/master/examples/benchmark.cpp. [5] Denis Demidov, Karsten Ahnert, Karl Rupp, and Peter Gottschling. Vexcl. https: //github.com/ddemidov/vexcl. [6] AMD. Amd a-series apu processors. processors/desktop/a-series-apu.

http://www.amd.com/en-us/products/

[7] CPU-World. Amd a10-series a10-7850k. http://www.cpu-world.com/CPUs/ Bulldozer/AMD-A10-Series%20A10-7850K.html. [8] CPU-World. Intel core i7-4770k. Intel-Core%20i7-4770K.html.

http://www.cpu-world.com/CPUs/Core_i7/

[9] techpowerup. Nvidia geforce gtx 680. http://www.techpowerup.com/gpudb/342/ geforce-gtx-680.html. [10] Nvidia. Tesla k20 gpu accelerator. http://www.nvidia.com/content/PDF/kepler/ Tesla-K20-Passive-BD-06455-001-v07.pdf. [11] techpowerup. Nvidia tesla k20c. tesla-k20c.html.

http://www.techpowerup.com/gpudb/564/

[12] CPU-World. Intel core i7-5960x. http://www.cpu-world.com/CPUs/Core_i7/ Intel-Core%20i7-5960X%20Extreme%20Edition.html. [13] Nvidia. Geforce gtx 700 series. http://www.nvidia.com/gtx-700-graphics-cards/ gtx-770/. [14] CPU-World. Intel xeon e5-2690 v2. Intel-Xeon%20E5-2690%20v2.html.

13

http://www.cpu-world.com/CPUs/Xeon/

[15] Nvidia. Tesla k20x gpu accelerator. http://www.nvidia.com/content/PDF/kepler/ Tesla-K20X-BD-06397-001-v07.pdf. [16] CPU-World. Intel xeon phi 5110p. Intel-Xeon%20Phi%205110P.html.

http://www.cpu-world.com/CPUs/Xeon_Phi/

[17] CPU-World. Amd opteron 6274. http://www.cpu-world.com/CPUs/Bulldozer/ AMD-Opteron%206274%20OS6274WKTGGGU.html.

14

Comparison of OpenCL performance on different platforms using

Comparison of OpenCL performance on different platforms using

Suggest Documents

Executing Dynamic Data Rate Actor Networks on OpenCL Platforms

Executing Dynamic Data Rate Actor Networks on OpenCL Platforms ...

OpenCL Performance Evaluation on Modern Multicore CPUs

Comparison of the performance of different directional

Performance Comparison of Different Shapes of ...

A Comparison of the Performance of Different Metaheuristics on the ...

A comparison of the performance of different metaheuristics on the ...

OpenCLIPP: OpenCL Integrated Performance ...

Comparison of Performance between Different Selection ... - CiteSeerX

Performance Comparison of Different Image Sizes

Comparison of Performance between Different Selection ... - CiteSeerX

Performance Comparison of Different Techniques for ... - CiteSeerX

Performance comparison of different clustering methods for

Performance Comparison of Three Different AC ...

Comparison of OpenMP & OpenCL Parallel Processing ... - arXiv.org

On the Performance Comparison of Different UWB ... - Semantic Scholar

On the Performance Comparison of Different UWB Data Modulation

Comparison of Performance between Different Selection Strategies on

A Comparison of VoIP Performance Evaluation on different ...

Predicting Performance of Applications on Multicore Platforms

Comparison and Performance Evaluation of Mobile Agent Platforms

Efficient FPGA implementation of OpenCL High-Performance ...

A performance impact indicator for the OpenCL kernels using local

A performance impact indicator for the OpenCL kernels using local