Cache Optimization Techniques for General Purpose

0 downloads 0 Views 19MB Size Report
M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and .... blocks from the first matrix and large blocks from the second matrix. .... Figure: Effect of loop fusion on CPU with two common data element .... Two different arrays ... The array merging technique improves the performance of non-stride ...
Cache Optimization Techniques for General Purpose Graphic Processing Units

D.R.V.L.B. Thambawita Supervised By Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe Department of Computer Engineering Faculty of Engineering University of Peradeniya

What is this GPU? Is it important?

Fluid Dynamic

AI Figure: CPU vs GPGPU

2 / 61

Why this research?

How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...)

3 / 61

Why this research?

How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources???

3 / 61

Why this research? How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources???

Our Contribution....

Main

Finding suitable cache optimization techniques for GPGPU (Programmer side)

Sub

Giving an idea about application level cache behavior of GPGPUs cache for GPGPU architecture designers 3 / 61

GPU configurable cache architecture

Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture)

4 / 61

Outline 1

Related works

2

Conceptual level design

3

Selected CPU Cache Optimization Techniques

4

Experimental Setup

5

Adaptation Process + Results and Discussion

6

Findings and Conclusions

7

Case Study Introduction - Aho-corasick algorithm Results and Discussion Conclusion about the case study

8

Publications

9

Q? and A 5 / 61

Related works

Related works J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”. Morgan Kaufmann/Elsevier, 2012. Identifying main cache optimization techniques in computer architecture. M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003. Selecting basic cache optimization techniques. CUDA Toolkit Documentation Finding available GPGPU optimization techniques and getting knowledge for adaptation process. C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp. 1906-1916, Oct. 2013. Identifying a case study for our research. 6 / 61

Related works

Challenges!!!

Lack of information about GPGPU cache architecture Complexity of SIMD architecture No any direct research regarding GPGPU cache optimization technique of end users

7 / 61

Conceptual level design

Conceptual level design

Selecting CPU CPU cache caceh optimization optimizations? techniques

GPGPU cache optimizations?

8 / 61

Conceptual level design

Conceptual level design

Selecting CPU CPU cache caceh optimization

Analyzing

optimizations? techniques

GPGPU cache optimizations?

8 / 61

Conceptual level design

Conceptual level design

Selecting CPU CPU cache caceh optimization

Analyzing

optimizations? techniques

GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU cache optimizations

8 / 61

Conceptual level design

Conceptual level design

Selecting CPU CPU cache caceh optimization

Analyzing

optimizations? techniques

GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU

Analyzing

cache optimizations

using GPU

8 / 61

Conceptual level design

Conceptual level design Selecting CPU CPU cache caceh optimization

Analyzing

optimizations? techniques

GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU

Analyzing

Identifying GPU

cache optimizations

using GPU

cache optimizations

8 / 61

Conceptual level design

Conceptual level design Selecting CPU CPU cache caceh optimization

Analyzing

optimizations? techniques

GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU

Analyzing

Identifying GPU

cache optimizations

using GPU

cache optimizations

Case Study

8 / 61

Selected CPU Cache Optimization Techniques

Common end user cache optimization technique

Data access optimization Stride-one access Blocking Loop fusion

Data layout optimization Array padding Array merging Array transpose

9 / 61

Selected CPU Cache Optimization Techniques

GPU cache complexity - in adaptation process

GPGPU cache

Complex Memory SIMD Architecture

10 / 61

Selected CPU Cache Optimization Techniques

GPU cache complexity - in adaptation process

Warps

GPGPU cache

Complex Memory Blocks

SIMD Architecture

Grids

10 / 61

Selected CPU Cache Optimization Techniques

GPU cache complexity - in adaptation process

Warps

Blocks

Grids

Shared Memory

GPGPU cache

Complex Memory

L1 and L2 (Configurable)

Architecture

- 16KB,48KB, L1 disabled

SIMD

Texture Memory

10 / 61

Selected CPU Cache Optimization Techniques

GPU cache complexity - in adaptation process

32 banks

Warps

Shared Memory

GPGPU cache

32 bytes cache

Blocks

Complex Memory

L1 and L2 (Configurable)

Architecture

- 16KB,48KB, L1 disabled

SIMD

128 bytes cache

Grids

Texture Memory

2D Spatial Locality

10 / 61

Experimental Setup

Experimental setups Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM

L1 cache L2 cache L3 cache

Cache size 32KB 256KB 3072KB

Cache line size 64bytes 64bytes 64bytes

Associativity

Description

8-way 8-way 12-way

Shared Memory

Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory Cache size

Cache line size

Associativity

L1 cache

48KB/ 16KB

128bytes

Not mentioned

Shared memory

16KB/ 48KB

128bytes

Not mentioned

L2 cache

768KB

128bytes/ 32bytes

Not mentioned

Description can be disable by using -Xptxas-dlcm=cg compile flag can be used manually Unified cache

11 / 61

Adaptation Process + Results and Discussion

One by one

Data access optimization Stride-one access Blocking Loop fusion

Data layout optimization Array padding Array merging Array transpose

12 / 61

Adaptation Process + Results and Discussion

Stride-one access

Stride-one memory access

Figure: Non-stride access vs stride access of GPGPU

13 / 61

Adaptation Process + Results and Discussion

Stride-one access

Adaptation - From CPU to GPGPU

L1, L2 and L3 cache line size Loops

64 bytes

Changing parameter = Loop index

Figure: Adaptation Process

14 / 61

Adaptation Process + Results and Discussion

Stride-one access

Adaptation - From CPU to GPGPU Kernel

Changing parameter = blockDim * blockID + threadID

L1 cache line size 128 bytes L2 cache line size 128 bytes

32 bytes

Figure: Adaptation Process

15 / 61

Adaptation Process + Results and Discussion

Stride-one access

Results: Effect of stride-one access on the CPU 100

Time [ms]

80 60 40 20 0

0

10

20

30

40

50

60

70

Stride Amount Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB)

Time taken for execution increased continuously according to the stride amount. 16 / 61

Adaptation Process + Results and Discussion

Stride-one access

Results: Effects of stride-one access on the GPGPU 10

Time [ms]

8 6 4 2 0

0

10

20

30

40

50

60

70

Stride Amount Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)

Time taken for execution increases according to the stride amount. It shows the best performance while stride amount is 1 like CPU changes. The effect of stride amount is comparably low after the cache line is full. 17 / 61

Adaptation Process + Results and Discussion

Stride-one access

Results: Effects of stride-one access on the GPGPU 10 Time [ms]

8 6 4 2 0

0

10

20

30

40

50

60

70

StrideL1) AmountInput Size=2867200(48KB L1) Input Size=2867200(Disabled Input Size=2867200(16KB L1) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)

Disabled L1 cache shows better performance for large stride about because number of cache lines are high in L2 cache when L1 cache is disabled. Large L1 cache shows better performance than small cache due to large number of cache lines in large cache. 18 / 61

Adaptation Process + Results and Discussion

Stride-one access

One by one

Data access optimization Stride-one access X Blocking Loop fusion

Data layout optimization Array padding Array merging Array transpose

19 / 61

Adaptation Process + Results and Discussion

Blocking technique

Blocking technique 5 4

1

4

2

1111 0000 0000 1111 0000 1111 0000 1111 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 00000000000000000 1111 0000 1111 11111111111111111 0000 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111

B

3

A

C

Figure: Two different blocking techniques from two different sources. First technique uses small blocks from the first matrix and large blocks from the second matrix. Second method uses equal size blocks form both matrices

20 / 61

Adaptation Process + Results and Discussion

Blocking technique

Adaptation

21 / 61

Adaptation Process + Results and Discussion

Blocking technique

Adaptation

Figure: Adaptation process

22 / 61

Adaptation Process + Results and Discussion

Blocking technique

Results: Effects of blocking technique on the CPU

Time [s]

150

100

50

0 512X512

1024X1024

1536X1536

2048X2048

Size of the matrix Default method - without tilling technique Method from Computer Architecture: A Quantitative Approach book Method equivalent to GPGPU tiling method Figure: Effect of tiling on CPU

Method equivalent to the GPGPU method shows better performance on the CPU also. 23 / 61

Adaptation Process + Results and Discussion

Blocking technique

Results: Effects of blocking technique on the GPGPU

Time [ms]

1,000

500

0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size disabled Blocked - L1 disabled Figure: Non blocking vs blocking with various cache configurations on GPGPU

The blocking technique shows better performance than non-blocking techniques.

24 / 61

Adaptation Process + Results and Discussion

Blocking technique

Results: Effects of blocking technique on the GPGPU

Time [ms]

800 600 400 200 0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size (16KB) Blocked - L1 (16KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU

The blocking technique shows better performance than non-blocking techniques.

24 / 61

Adaptation Process + Results and Discussion

Blocking technique

Time [ms]

Results: Effects of blocking technique on the GPGPU

500

0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size (48KB) Blocked - L1 (48KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU

The blocking technique shows better performance than non-blocking techniques.

24 / 61

Adaptation Process + Results and Discussion

Blocking technique

Results: Effects of blocking technique on the GPGPU

Time [ms]

1,000

500

0 512X512 1024X10241536X15362048X20482560X25603072X3072 Size of the matrix Default - L1 disabled Blocked - L1 disabled Default - L1 (16KB) Blocked - L1 (16KB) Default - L1 (48KB) Blocked - L1 (48KB) Blocked - Shared memory Figure: Non blocking vs blocking with various cache configurations on GPGPU

Blocking technique with shared memory shows the best performance among all other GPGPU cache options. 24 / 61

Adaptation Process + Results and Discussion

Blocking technique

One by one

Data access optimization Stride-one access X Blocking X Loop fusion

Data layout optimization Array padding Array merging Array transpose

25 / 61

Adaptation Process + Results and Discussion

Loop fusion

Loop fusion It is required to match the number of branching conditions in both fused and non-fused loops. Common variables within for loops have been used. The loops within the GPGPU are kernels. Kernel fusion is the technique in GPGPUs corresponding to the loop fusion in CPUs. Example for (int i=0;i