Problem Description Duplication Algorithm Padding Algorithm ... - safari

3 downloads 0 Views 1MB Size Report
Eddy Z. Zhang, Han Li and Xipeng Shen. Presenter: Mingzhou Zhou. The College of William and Mary. MSPC'2012, Beijing, China. Problem Description.
A Study Towards Optimal Data Layout for GPU Computing Eddy Z. Zhang, Han Li and Xipeng Shen Presenter: Mingzhou Zhou The College of William and Mary

Problem Description

Duplication Algorithm

Irregular Memory References

Reorder only data by packing consecutive objects together

* A mem. transaction -- read/write a consecutive memory segment at once * A thread warp -- execute only when all data for all threads in the warp is ready * Random and complicated data access patterns * Previous solutions are all heuristics based, and a study for optimal layout is needed * Example: assume thread warp size is 4, and mem. segment size is 4

Threads Idx: Data Obj Idx: Data Obj Layout:

0 1 2 3 0 0 4 4 a c e g

4 5 6 7 1 1 5 5 b d

f

* Duplicate data objects when necessary * Add space overhead * Optimal number of memory transactions * Adaptive partial duplication [2]

Threads Idx:

0 1 2 3

4 5 6 7

8 9 10 11 12 13 14 15 16 17 18 19

Original Data Obj Idx: 0 0 4 4

1 1 5 5

Original Obj Layout:

a c e g

b d

New Data Obj Idx:

0 1 2 3

4 5 6 7

8 9 10 11 12 13 14 15 16 17 18 19

New Obj Layout:

a a b b

c c d d

e e a a

8 9 10 11 12 13 14 15 16 17 18 19 2 2 0 0 ............

4 4 1 5

2 6 3 7 f

2 2 0 0 ............

4 4 1 5

h

5 Mem. Transactions

h 9 Mem. Transactions

2 6 3 7

Data Layout Transformation * Complexity NP Completeness if not using any extra memory space or thread relocation [1] * Assumptions: (1). one memory transaction access one memory segment of fixed size (2). memory segment size == thread warp size

g h

Reorder Threads More Aggressively

Reorder Both Threads and Data * Step 1: Reorder data objects according to their access frequencies. However, this is not the final data object layout. * Step 2: Reorder threads according to their data object order from Step 1. * Step 3: Put data objects one by one into memory segments with the order obtained from step 1. Check every thread warp’s memory access and duplicate or pad a memory segment when necessary. If one thread warp’s memory accesses cannot fall into the current memory segment, then leave empty objects to fill the current memory segment, and put the object into next memory segment. If one thread warp’s memory access has to span two memory segments, then duplicate it into next memory segment.

Data Obj Freq.:

e f

Approximation with Confidence

Padding Algorithm

Original Obj Layout:

b b c d

a c e g

b d

f

h

* Step 1: Reorder data objects according to their access frequencies. Same as step 1 in padding algorithm, * Step 2: Group threads according to the data object they access. A group of threads access the same data object. Classify thread groups into three types: (a). groups with size >= warp size (b). groups with size < warp size, and > 1 (c). groups with size == 1 * Step 3: Put threads in type (a) group into warps until the number of remaining threads is less than warp size. The remaining threads from type (a) (c) and all type (b) threads form remainder threads, reorder these threads’ objects according to their frequencies, and make it the new layout of data objects. Then reorder threads according to the data objects they access.

4 3 3 1 4 3 1 1 Analytical Bound

Threads Idx:

0 1 4 5

2 3 12 13 4 5 14 6

7 15 8 9 16 17 18 19

New Data Obj Idx:

0 0 0 0

1 1 1 1 2 2 2 3 ............

4 4 5 5

New Obj Layout:

a b c d

d e x x

e f

g h

8 9 10 11

5 Mem. Transactions

* Optimal case: Total number of memory transactions = Total number of warps. * Upper bound: Optimal number + R (number of remainder threads) / W (warp size)

References [1] E. Zhang, Y. Jiang, Z. Guo, K.Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. Technical Report WM-CS-2010-07, CS, College of William and Mary, 2010. [2] E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for gpu computing. In ASPLOS 2011. [3] Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping and structure splitting using whole-program reference affinity. In PLDI 2004.

MSPC‘2012, Beijing, China

Suggest Documents