Problem Description Duplication Algorithm Padding Algorithm ... - safari

A Study Towards Optimal Data Layout for GPU Computing Eddy Z. Zhang, Han Li and Xipeng Shen Presenter: Mingzhou Zhou The College of William and Mary

Problem Description

Duplication Algorithm

Irregular Memory References

Reorder only data by packing consecutive objects together

* A mem. transaction -- read/write a consecutive memory segment at once * A thread warp -- execute only when all data for all threads in the warp is ready * Random and complicated data access patterns * Previous solutions are all heuristics based, and a study for optimal layout is needed * Example: assume thread warp size is 4, and mem. segment size is 4

Threads Idx: Data Obj Idx: Data Obj Layout:

0 1 2 3 0 0 4 4 a c e g

4 5 6 7 1 1 5 5 b d

f

* Duplicate data objects when necessary * Add space overhead * Optimal number of memory transactions * Adaptive partial duplication [2]

Threads Idx:

0 1 2 3

4 5 6 7

8 9 10 11 12 13 14 15 16 17 18 19

Original Data Obj Idx: 0 0 4 4

1 1 5 5

Original Obj Layout:

a c e g

b d

New Data Obj Idx:

0 1 2 3

4 5 6 7

8 9 10 11 12 13 14 15 16 17 18 19

New Obj Layout:

a a b b

c c d d

e e a a

8 9 10 11 12 13 14 15 16 17 18 19 2 2 0 0 ............

4 4 1 5

2 6 3 7 f

2 2 0 0 ............

4 4 1 5

h

5 Mem. Transactions

h 9 Mem. Transactions

2 6 3 7

Data Layout Transformation * Complexity NP Completeness if not using any extra memory space or thread relocation [1] * Assumptions: (1). one memory transaction access one memory segment of fixed size (2). memory segment size == thread warp size

g h

Reorder Threads More Aggressively

Reorder Both Threads and Data * Step 1: Reorder data objects according to their access frequencies. However, this is not the final data object layout. * Step 2: Reorder threads according to their data object order from Step 1. * Step 3: Put data objects one by one into memory segments with the order obtained from step 1. Check every thread warp’s memory access and duplicate or pad a memory segment when necessary. If one thread warp’s memory accesses cannot fall into the current memory segment, then leave empty objects to fill the current memory segment, and put the object into next memory segment. If one thread warp’s memory access has to span two memory segments, then duplicate it into next memory segment.

Data Obj Freq.:

e f

Approximation with Confidence

Padding Algorithm

Original Obj Layout:

b b c d

a c e g

b d

f

h

* Step 1: Reorder data objects according to their access frequencies. Same as step 1 in padding algorithm, * Step 2: Group threads according to the data object they access. A group of threads access the same data object. Classify thread groups into three types: (a). groups with size >= warp size (b). groups with size < warp size, and > 1 (c). groups with size == 1 * Step 3: Put threads in type (a) group into warps until the number of remaining threads is less than warp size. The remaining threads from type (a) (c) and all type (b) threads form remainder threads, reorder these threads’ objects according to their frequencies, and make it the new layout of data objects. Then reorder threads according to the data objects they access.

4 3 3 1 4 3 1 1 Analytical Bound

Threads Idx:

0 1 4 5

2 3 12 13 4 5 14 6

7 15 8 9 16 17 18 19

New Data Obj Idx:

0 0 0 0

1 1 1 1 2 2 2 3 ............

4 4 5 5

New Obj Layout:

a b c d

d e x x

e f

g h

8 9 10 11

5 Mem. Transactions

* Optimal case: Total number of memory transactions = Total number of warps. * Upper bound: Optimal number + R (number of remainder threads) / W (warp size)

References [1] E. Zhang, Y. Jiang, Z. Guo, K.Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. Technical Report WM-CS-2010-07, CS, College of William and Mary, 2010. [2] E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In ASPLOS 2011. [3] Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program reference affinity. In PLDI 2004.

MSPC‘2012, Beijing, China

Problem Description Duplication Algorithm Padding Algorithm ... - safari

Problem Description Duplication Algorithm Padding Algorithm ... - safari

Suggest Documents

A Novel Algorithm for Duplication-Loss Problem ...

MPA Algorithm Description [PDF]

NP problem in quantum algorithm

ERICA Switch Algorithm: A Complete Description

Relaxation algorithm in description of superconducting structures

Population Migration Algorithm Description ... - Semantic Scholar

A Task Duplication Based Scheduling Algorithm for ... - CiteSeerX

An Efficient Duplication Record Detection Algorithm for Data Cleansing

ITD assembler: an algorithm for internal tandem duplication discovery ...

A Distributed Algorithm for 2D Shape Duplication ... - Research - MIT

Quantum Algorithm for Hilbert's Tenth Problem

Parameter Identification Problem Solving Using Genetic Algorithm

Multi-Objective Evolutionary Algorithm for Illiteracy Problem

Visual Place Categorization: Problem, Dataset, and Algorithm

Fuzzy Counter Ant Algorithm for Maze Problem

Genetic algorithm inspired by gene duplication - IEEE Xplore

Firefly Algorithm for Optimization Problem Nur ...

Learning Problem and BCJR Decoding Algorithm

Problem-orientable numerical algorithm for modelling multi ...

Solving Travelling Salesman Problem Using Genetic Algorithm

Evolutionary Algorithm for Congestion Problem in

Hybrid Algorithm for Constrained Portfolio Selection Problem

An Exact Algorithm for Travelling Salesman Problem

THE AUCTION ALGORITHM FOR THE TRANSPORTATION PROBLEM