Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency. Jiajun Wang, Ahmed Khawaja, George Biros,. Andr
Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency Jiajun Wang, Ahmed Khawaja, George Biros, Andreas Gerstlauer, Lizy K. John The University of Texas at Austin
Introduction • Fundamental problem in computational physics, statistics, machine learning tasks. • What is Kernel and what is Kernel Summation?
Kernel Kernel Summation
2
Related Work • Tree codes, fast multipole methods, Ewald sums ++ scale to billions or trillions of points by reducing complexity O(N2) to O(N logN) -- work for problems in low dimensions (two or three), not suitable for solving statistics, machine learning task • Rely on General Matrix Matrix Multiplication (GEMM) for high dimensional tasks
3
Kernel Summation Steps
• Denotation: M: number of points in target set N: number of points in source set K: dimension • Inputs: Target matrix A: M-by-K Source matrix B: K-by-N Weight vector W: N-by-1 • Gaussian Kernel: Ҝ(𝛼, 𝛽) = 𝑒𝑥𝑝 • Output: Vector V: M-by-1
−
𝛼−𝛽 2 2 2
Kernel Summation Steps C←AxB 𝑅𝑖,𝑗 ← 𝑠𝑞𝑢𝑎𝑟𝑒𝐴𝑖,𝑗 + 𝑠𝑞𝑢𝑎𝑟𝑒𝐵𝑖,𝑗 - 2𝐶𝑖,𝑗
𝑈𝑖,𝑗 ←
5
GEMM
Embarrassingly Parallel
𝑅𝑖,𝑗
𝑒𝑥𝑝− 2
V←UxW
GEMV 5
Implementation Based on cuBLAS 1. C ← A x B Call cuBLAS 2. 𝑅𝑖,𝑗 ← 𝑠𝑞𝑢𝑎𝑟𝑒𝐴𝑖,𝑗 + 𝑠𝑞𝑢𝑎𝑟𝑒𝐵𝑖,𝑗 - 2𝐶𝑖,𝑗 ++ Fast. Easy to use 𝑅𝑖,𝑗 -- Sacrifice data locality − 3. 𝑈𝑖,𝑗 ← 𝑒𝑥𝑝 2 -- Waste energy on DRAM accesses 4. V ← U x W 100
High L2 MPKI indicates opportunity for fusion
K=32
K=64
K=128
K=256
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
0
M=131072
50 M=1024
L2 MPKI
150
6
We propose Fused Kernel Summation FOR each thread DO
1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW)
7
GPU Background • Thread: organized in thread blocks • Thread block: executed by a SM (Streaming Multiprocessor) • Warp: basic scheduling unit, 32 threads, execute same instruction in lock-step
8
Fused Kernel Summation FOR each thread DO 1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW)
9
GEMM Algorithm Overview • A thread block (bx,by) computes 𝑠𝑢𝑏𝑚𝑎𝑡𝑟𝑖𝑥𝐶𝑏𝑥,𝑏𝑦 = 𝑠𝑢𝑏𝑚𝑎𝑡𝑟𝑖𝑥A𝑏𝑦 × 𝑠𝑢𝑏𝑚𝑎𝑡𝑟𝑖𝑥𝐵𝑏𝑥 • Carefully select submatrix size and thread block size • Overlap memory read latency with computation • Rearrange data location for fast access
10
Shared memory (SMEM) • Programmer managed cache • 32 banks to be accessed simultaneously • Serialize accesses when there’s bank conflict
11
Shared Memory Data Mapping
12
Fused Kernel Summation FOR each thread DO 1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW)
13
Fused Kernel Summation FOR each thread DO 1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW) Intra thread level •
Register → Shared memory
Intra thread block level •
Shared memory → DRAM
Inter thread block level •
DRAM → DRAM 14
Evaluation • Infrastructure Evaluate on NVIDIA GTX970 Profile tool: nvprof cuBLAS library in version 7.0 • Experiments Fused: fuse our own GEMM implementation with the kernel evaluation and the summation routine. CUDA-Unfused: call our own GEMM implementation followed by the kernel evaluation and the summation routine. cuBLAS-Unfused: call cuBLAS GEMM function followed by the kernel evaluation and the summation routine. 15
Performance Comparison
K=32
K=64
Fused vs. CUDA-Unfused
K=128
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
4 3.5 3 2.5 2 1.5 1 0.5 0
M=1024
Speedup
Fused beats cuBLAS-Unfused by up to 1.8X speedup when dimension K < 128.
K=256
Fused vs. cuBLAS-Unfused
16
Influence on memory • Fused optimization reduces memory transactions Fused reduces 50% Fused reduces 90% of L2 accesses in of DRAM accesses cuBLAS-Unfused in cuBLAS-Unfused
K=32
K=64
Fused
K=128
K=256
CUDA-Unfused
Fig. a: L2 Accesses normalized to cuBLAS-Unfused
K=32
Fused
K=64
K=128
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
0
M=524288
1
M=131072
2
M=1024
1 0.8 0.6 0.4 0.2 0
3
K=256
CUDA-Unfused
Fig. b: DRAM Accesses 17 normalized to cuBLAS-Unfused
Energy Savings Comparison • Energy Savings of Fused compared to cuBLAS-Unfused With same K, saving more energy when M increases The amount of energy savings obtained from fusion is greatly affected by the K value
18
Energy Breakdown • 80% reduction in DRAM access energy (8% to 24% of total energy) 40
cuBLAS-Unfused Compute Fused Compute
35
CUDA-Unfused Compute
25
cuBLAS-Unfused SMEM
20
Fused SMEM
15
CUDA-Unfused SMEM
10
cuBLAS-Unfused L2
5
Fused L2
K=32
K=64
K=128
K=256
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
M=1024
M=524288
M=131072
0
M=1024
Energy (J)
30
CUDA-Unfused L2 cuBLAS-Unfused DRAM
Fused DRAM 19 CUDA-Unfused DRAM
Summary • Presented a fused approach of implementing the kernel summation on the state of the art GPU. Fusion leads to improvement in locality and reduction of memory accesses. Fusion is seen to improve overall performance of kernel summation up to 1.8X. From the energy perspective, fused kernel summation shows up to 33% of total energy saving across various experimented dimensions. 20
Thanks!
Lab of Computer Architecture http://users.ece.utexas.edu/~ljohn/publications.html 21