Cache Optimization Techniques for General Purpose

Cache Optimization Techniques for General Purpose Graphic Processing Units

D.R.V.L.B. Thambawita Supervised By Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe Department of Computer Engineering Faculty of Engineering University of Peradeniya

What is this GPU? Is it important?

Fluid Dynamic

AI Figure: CPU vs GPGPU

2 / 61

Why this research?

How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...)

3 / 61

Why this research?

How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources???

3 / 61

Why this research? How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources???

Our Contribution....

Main

Finding suitable cache optimization techniques for GPGPU (Programmer side)

Sub

Giving an idea about application level cache behavior of GPGPUs cache for GPGPU architecture designers 3 / 61

GPU configurable cache architecture

Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture)

4 / 61

Outline 1

Related works

2

Conceptual level design

3

Selected CPU Cache Optimization Techniques

4

Experimental Setup

5

Adaptation Process + Results and Discussion

6

Findings and Conclusions

7

Case Study Introduction - Aho-corasick algorithm Results and Discussion Conclusion about the case study

8

Publications

9

Q? and A 5 / 61

Related works

Related works J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”. Morgan Kaufmann/Elsevier, 2012. Identifying main cache optimization techniques in computer architecture. M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003. Selecting basic cache optimization techniques. CUDA Toolkit Documentation Finding available GPGPU optimization techniques and getting knowledge for adaptation process. C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp. 1906-1916, Oct. 2013. Identifying a case study for our research. 6 / 61

Related works

Challenges!!!

Lack of information about GPGPU cache architecture Complexity of SIMD architecture No any direct research regarding GPGPU cache optimization technique of end users

7 / 61



Selecting CPU CPU cache caceh optimization optimizations? techniques

GPGPU cache optimizations?

8 / 61



Selecting CPU CPU cache caceh optimization

Analyzing

optimizations? techniques

GPGPU cache optimizations?

8 / 61




Analyzing


GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU cache optimizations

8 / 61




Analyzing


GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU

Analyzing

cache optimizations

using GPU

8 / 61


Conceptual level design Selecting CPU CPU cache caceh optimization

Analyzing


GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU

Analyzing

Identifying GPU

cache optimizations

using GPU

cache optimizations

8 / 61


Conceptual level design Selecting CPU CPU cache caceh optimization

Analyzing


GPGPU cache

Adopting from CPU

optimizations?

to GPU

Developing GPU

Analyzing

Identifying GPU

cache optimizations

using GPU

cache optimizations

Case Study

8 / 61


Common end user cache optimization technique

Data access optimization Stride-one access Blocking Loop fusion

Data layout optimization Array padding Array merging Array transpose

9 / 61


GPU cache complexity - in adaptation process

GPGPU cache

Complex Memory SIMD Architecture

10 / 61



Warps

GPGPU cache

Complex Memory Blocks

SIMD Architecture

Grids

10 / 61



Warps

Blocks

Grids

Shared Memory

GPGPU cache

Complex Memory

L1 and L2 (Configurable)

Architecture

- 16KB,48KB, L1 disabled

SIMD

Texture Memory

10 / 61



32 banks

Warps

Shared Memory

GPGPU cache

32 bytes cache

Blocks

Complex Memory

L1 and L2 (Configurable)

Architecture

- 16KB,48KB, L1 disabled

SIMD

128 bytes cache

Grids

Texture Memory

2D Spatial Locality

10 / 61

Experimental Setup

Experimental setups Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM

L1 cache L2 cache L3 cache

Cache size 32KB 256KB 3072KB

Cache line size 64bytes 64bytes 64bytes

Associativity

Description

8-way 8-way 12-way

Shared Memory

Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory Cache size

Cache line size

Associativity

L1 cache

48KB/ 16KB

128bytes

Not mentioned

Shared memory

16KB/ 48KB

128bytes

Not mentioned

L2 cache

768KB

128bytes/ 32bytes

Not mentioned

Description can be disable by using -Xptxas-dlcm=cg compile flag can be used manually Unified cache

11 / 61


One by one

Data access optimization Stride-one access Blocking Loop fusion


12 / 61


Stride-one access

Stride-one memory access

Figure: Non-stride access vs stride access of GPGPU

13 / 61


Stride-one access

Adaptation - From CPU to GPGPU

L1, L2 and L3 cache line size Loops

64 bytes

Changing parameter = Loop index

Figure: Adaptation Process

14 / 61


Stride-one access

Adaptation - From CPU to GPGPU Kernel

Changing parameter = blockDim * blockID + threadID

L1 cache line size 128 bytes L2 cache line size 128 bytes

32 bytes

Figure: Adaptation Process

15 / 61


Stride-one access

Results: Effect of stride-one access on the CPU 100

Time [ms]

80 60 40 20 0

0

10

20

30

40

50

60

70

Stride Amount Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB)

Time taken for execution increased continuously according to the stride amount. 16 / 61


Stride-one access

Results: Effects of stride-one access on the GPGPU 10

Time [ms]

8 6 4 2 0

0

10

20

30

40

50

60

70

Stride Amount Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)

Time taken for execution increases according to the stride amount. It shows the best performance while stride amount is 1 like CPU changes. The effect of stride amount is comparably low after the cache line is full. 17 / 61


Stride-one access

Results: Effects of stride-one access on the GPGPU 10 Time [ms]

8 6 4 2 0

0

10

20

30

40

50

60

70

StrideL1) AmountInput Size=2867200(48KB L1) Input Size=2867200(Disabled Input Size=2867200(16KB L1) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)

Disabled L1 cache shows better performance for large stride about because number of cache lines are high in L2 cache when L1 cache is disabled. Large L1 cache shows better performance than small cache due to large number of cache lines in large cache. 18 / 61


Stride-one access

One by one

Data access optimization Stride-one access X Blocking Loop fusion


19 / 61


Blocking technique

Blocking technique 5 4

1

4

2

1111 0000 0000 1111 0000 1111 0000 1111 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 00000000000000000 1111 0000 1111 11111111111111111 0000 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111

B

3

A

C

Figure: Two different blocking techniques from two different sources. First technique uses small blocks from the first matrix and large blocks from the second matrix. Second method uses equal size blocks form both matrices

20 / 61


Blocking technique

Adaptation

21 / 61


Blocking technique

Adaptation

Figure: Adaptation process

22 / 61


Blocking technique

Results: Effects of blocking technique on the CPU

Time [s]

150

100

50

0 512X512

1024X1024

1536X1536

2048X2048

Size of the matrix Default method - without tilling technique Method from Computer Architecture: A Quantitative Approach book Method equivalent to GPGPU tiling method Figure: Effect of tiling on CPU

Method equivalent to the GPGPU method shows better performance on the CPU also. 23 / 61


Blocking technique

Results: Effects of blocking technique on the GPGPU

Time [ms]

1,000

500

0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size disabled Blocked - L1 disabled Figure: Non blocking vs blocking with various cache configurations on GPGPU

The blocking technique shows better performance than non-blocking techniques.

24 / 61


Blocking technique


Time [ms]

800 600 400 200 0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size (16KB) Blocked - L1 (16KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU


24 / 61


Blocking technique

Time [ms]


500

0 512X512 1024X10241536X15362048X20482560X25603072X3072 of the matrix Default - L1 Size (48KB) Blocked - L1 (48KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU


24 / 61


Blocking technique


Time [ms]

1,000

500

0 512X512 1024X10241536X15362048X20482560X25603072X3072 Size of the matrix Default - L1 disabled Blocked - L1 disabled Default - L1 (16KB) Blocked - L1 (16KB) Default - L1 (48KB) Blocked - L1 (48KB) Blocked - Shared memory Figure: Non blocking vs blocking with various cache configurations on GPGPU

Blocking technique with shared memory shows the best performance among all other GPGPU cache options. 24 / 61


Blocking technique

One by one

Data access optimization Stride-one access X Blocking X Loop fusion


25 / 61


Loop fusion

Loop fusion It is required to match the number of branching conditions in both fused and non-fused loops. Common variables within for loops have been used. The loops within the GPGPU are kernels. Kernel fusion is the technique in GPGPUs corresponding to the loop fusion in CPUs. Example for (int i=0;i

Cache Optimization Techniques for General Purpose

Cache Optimization Techniques for General Purpose

Suggest Documents

Cache Optimization Techniques for General Purpose

Cache Energy Optimization Techniques For Modern Processors

Application-Specific Autonomic Cache Tuning for General Purpose

Design and Performance of a General-Purpose Software Cache

A General-Purpose Optimization Engine for Multi-Disciplinary Design ...

SIFT Implementation and Optimization for General-Purpose GPU

A General Purpose Local Search Algorithm for Binary Optimization

SIFT Implementation and Optimization for General-Purpose GPU

General Purpose

Dynamic cache reconfiguration based techniques for improving cache ...

Cache Coherence Techniques

An Overview of Cache Optimization Techniques and ... - CiteSeerX

10. An Overview of Cache Optimization Techniques ... - Springer Link

Cache Hierarchy Optimization - arXiv

Optimization techniques for computationally

Dynamic cache reconfiguration based techniques for

Variation-Aware Software Techniques for Cache

Topology Optimization in Multiphysics - TO++, a free general purpose ...

Cache Management Techniques for Privacy ... - Auburn Engineering

A Parallel General Purpose Multi-Objective Optimization Framework ...

General Purpose Tools

General Purpose Tools

General-Purpose Lexical Acquisition

Honda General Purpose Engines