M. Kowarschik and C. Wei, âAn Overview of Cache Optimization Techniques and .... blocks from the first matrix and large blocks from the second matrix. .... Figure: Effect of loop fusion on CPU with two common data element .... Two different arrays ... The array merging technique improves the performance of non-stride ...
Cache Optimization Techniques for General Purpose Graphic Processing Units
D.R.V.L.B. Thambawita Supervised By Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe Department of Computer Engineering Faculty of Engineering University of Peradeniya
What is this GPU? Is it important?
Fluid Dynamic
AI Figure: CPU vs GPGPU
2 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...)
3 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources???
3 / 61
Why this research? How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources???
Our Contribution....
Main
Finding suitable cache optimization techniques for GPGPU (Programmer side)
Sub
Giving an idea about application level cache behavior of GPGPUs cache for GPGPU architecture designers 3 / 61
GPU configurable cache architecture
Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture)
4 / 61
Outline 1
Related works
2
Conceptual level design
3
Selected CPU Cache Optimization Techniques
4
Experimental Setup
5
Adaptation Process + Results and Discussion
6
Findings and Conclusions
7
Case Study Introduction - Aho-corasick algorithm Results and Discussion Conclusion about the case study
8
Publications
9
Q? and A 5 / 61
Related works
Related works J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”. Morgan Kaufmann/Elsevier, 2012. Identifying main cache optimization techniques in computer architecture. M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003. Selecting basic cache optimization techniques. CUDA Toolkit Documentation Finding available GPGPU optimization techniques and getting knowledge for adaptation process. C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp. 1906-1916, Oct. 2013. Identifying a case study for our research. 6 / 61
Related works
Challenges!!!
Lack of information about GPGPU cache architecture Complexity of SIMD architecture No any direct research regarding GPGPU cache optimization technique of end users
7 / 61
Conceptual level design
Conceptual level design
Selecting CPU CPU cache caceh optimization optimizations? techniques
GPGPU cache optimizations?
8 / 61
Conceptual level design
Conceptual level design
Selecting CPU CPU cache caceh optimization
Analyzing
optimizations? techniques
GPGPU cache optimizations?
8 / 61
Conceptual level design
Conceptual level design
Selecting CPU CPU cache caceh optimization
Analyzing
optimizations? techniques
GPGPU cache
Adopting from CPU
optimizations?
to GPU
Developing GPU cache optimizations
8 / 61
Conceptual level design
Conceptual level design
Selecting CPU CPU cache caceh optimization
Analyzing
optimizations? techniques
GPGPU cache
Adopting from CPU
optimizations?
to GPU
Developing GPU
Analyzing
cache optimizations
using GPU
8 / 61
Conceptual level design
Conceptual level design Selecting CPU CPU cache caceh optimization
Analyzing
optimizations? techniques
GPGPU cache
Adopting from CPU
optimizations?
to GPU
Developing GPU
Analyzing
Identifying GPU
cache optimizations
using GPU
cache optimizations
8 / 61
Conceptual level design
Conceptual level design Selecting CPU CPU cache caceh optimization
Analyzing
optimizations? techniques
GPGPU cache
Adopting from CPU
optimizations?
to GPU
Developing GPU
Analyzing
Identifying GPU
cache optimizations
using GPU
cache optimizations
Case Study
8 / 61
Selected CPU Cache Optimization Techniques
Common end user cache optimization technique
Data access optimization Stride-one access Blocking Loop fusion
Data layout optimization Array padding Array merging Array transpose
9 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
Complex Memory SIMD Architecture
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
Warps
GPGPU cache
Complex Memory Blocks
SIMD Architecture
Grids
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
Warps
Blocks
Grids
Shared Memory
GPGPU cache
Complex Memory
L1 and L2 (Configurable)
Architecture
- 16KB,48KB, L1 disabled
SIMD
Texture Memory
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
32 banks
Warps
Shared Memory
GPGPU cache
32 bytes cache
Blocks
Complex Memory
L1 and L2 (Configurable)
Architecture
- 16KB,48KB, L1 disabled
SIMD
128 bytes cache
Grids
Texture Memory
2D Spatial Locality
10 / 61
Experimental Setup
Experimental setups Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM
L1 cache L2 cache L3 cache
Cache size 32KB 256KB 3072KB
Cache line size 64bytes 64bytes 64bytes
Associativity
Description
8-way 8-way 12-way
Shared Memory
Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory Cache size
Cache line size
Associativity
L1 cache
48KB/ 16KB
128bytes
Not mentioned
Shared memory
16KB/ 48KB
128bytes
Not mentioned
L2 cache
768KB
128bytes/ 32bytes
Not mentioned
Description can be disable by using -Xptxas-dlcm=cg compile flag can be used manually Unified cache
11 / 61
Adaptation Process + Results and Discussion
One by one
Data access optimization Stride-one access Blocking Loop fusion
Data layout optimization Array padding Array merging Array transpose
12 / 61
Adaptation Process + Results and Discussion
Stride-one access
Stride-one memory access
Figure: Non-stride access vs stride access of GPGPU
13 / 61
Adaptation Process + Results and Discussion
Stride-one access
Adaptation - From CPU to GPGPU
L1, L2 and L3 cache line size Loops
64 bytes
Changing parameter = Loop index
Figure: Adaptation Process
14 / 61
Adaptation Process + Results and Discussion
Stride-one access
Adaptation - From CPU to GPGPU Kernel
Changing parameter = blockDim * blockID + threadID
L1 cache line size 128 bytes L2 cache line size 128 bytes
32 bytes
Figure: Adaptation Process
15 / 61
Adaptation Process + Results and Discussion
Stride-one access
Results: Effect of stride-one access on the CPU 100
Time [ms]
80 60 40 20 0
0
10
20
30
40
50
60
70
Stride Amount Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB)
Time taken for execution increased continuously according to the stride amount. 16 / 61
Adaptation Process + Results and Discussion
Stride-one access
Results: Effects of stride-one access on the GPGPU 10
Time [ms]
8 6 4 2 0
0
10
20
30
40
50
60
70
Stride Amount Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Time taken for execution increases according to the stride amount. It shows the best performance while stride amount is 1 like CPU changes. The effect of stride amount is comparably low after the cache line is full. 17 / 61
Adaptation Process + Results and Discussion
Stride-one access
Results: Effects of stride-one access on the GPGPU 10 Time [ms]
8 6 4 2 0
0
10
20
30
40
50
60
70
StrideL1) AmountInput Size=2867200(48KB L1) Input Size=2867200(Disabled Input Size=2867200(16KB L1) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Disabled L1 cache shows better performance for large stride about because number of cache lines are high in L2 cache when L1 cache is disabled. Large L1 cache shows better performance than small cache due to large number of cache lines in large cache. 18 / 61
Adaptation Process + Results and Discussion
Stride-one access
One by one
Data access optimization Stride-one access X Blocking Loop fusion
Data layout optimization Array padding Array merging Array transpose
19 / 61
Adaptation Process + Results and Discussion
Blocking technique
Blocking technique 5 4
1
4
2
1111 0000 0000 1111 0000 1111 0000 1111 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 00000000000000000 1111 0000 1111 11111111111111111 0000 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111
B
3
A
C
Figure: Two different blocking techniques from two different sources. First technique uses small blocks from the first matrix and large blocks from the second matrix. Second method uses equal size blocks form both matrices
20 / 61
Adaptation Process + Results and Discussion
Blocking technique
Adaptation
21 / 61
Adaptation Process + Results and Discussion
Blocking technique
Adaptation
Figure: Adaptation process
22 / 61
Adaptation Process + Results and Discussion
Blocking technique
Results: Effects of blocking technique on the CPU
Time [s]
150
100
50
0 512X512
1024X1024
1536X1536
2048X2048
Size of the matrix Default method - without tilling technique Method from Computer Architecture: A Quantitative Approach book Method equivalent to GPGPU tiling method Figure: Effect of tiling on CPU
Method equivalent to the GPGPU method shows better performance on the CPU also. 23 / 61
Adaptation Process + Results and Discussion
Blocking technique
Results: Effects of blocking technique on the GPGPU
Time [ms]
1,000
500
0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size disabled Blocked - L1 disabled Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion
Blocking technique
Results: Effects of blocking technique on the GPGPU
Time [ms]
800 600 400 200 0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size (16KB) Blocked - L1 (16KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion
Blocking technique
Time [ms]
Results: Effects of blocking technique on the GPGPU
500
0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size (48KB) Blocked - L1 (48KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion
Blocking technique
Results: Effects of blocking technique on the GPGPU
Time [ms]
1,000
500
0 512X512 1024X10241536X15362048X20482560X25603072X3072 Size of the matrix Default - L1 disabled Blocked - L1 disabled Default - L1 (16KB) Blocked - L1 (16KB) Default - L1 (48KB) Blocked - L1 (48KB) Blocked - Shared memory Figure: Non blocking vs blocking with various cache configurations on GPGPU
Blocking technique with shared memory shows the best performance among all other GPGPU cache options. 24 / 61
Adaptation Process + Results and Discussion
Blocking technique
One by one
Data access optimization Stride-one access X Blocking X Loop fusion
Data layout optimization Array padding Array merging Array transpose
25 / 61
Adaptation Process + Results and Discussion
Loop fusion
Loop fusion It is required to match the number of branching conditions in both fused and non-fused loops. Common variables within for loops have been used. The loops within the GPGPU are kernels. Kernel fusion is the technique in GPGPUs corresponding to the loop fusion in CPUs. Example for (int i=0;i