CUDA-lite: Reducing GPU Programming Complexity - CiteSeerX

2 downloads 0 Views 335KB Size Report
the memory interface can enhance the main memory access efficiency if the ac- ... This paper presents CUDA-lite, an experimental enhancement to CUDA that.
CUDA-lite: Reducing GPU Programming Complexity Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi, and Wen-mei W. Hwu Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign {ueng, mlathara, bsadeghi, hwu}@crhc.uiuc.edu

Abstract. The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.

1

Introduction

In 2007, NVIDIA introduced the Compute Unified Device Architecture (CUDA) [9], an extended ANSI C programming model. Under CUDA, Graphics Processing Units (GPUs) consist of many processor cores, each of which can directly address into a global memory. This allows for a much more flexible programming model than previous GPGPU programming models [11], and allows developers to implement a wider variety of data-parallel kernels. As a result, CUDA has rapidly gained acceptance in application domains where GPUs are used to execute compute intensive, data-parallel application kernels. While GPUs have been designed with higher memory bandwidth than CPUs, the even higher compute throughput of GPUs can easily saturate their available memory bandwidth. For example, the NVIDIA GeForce 8800 GTX comes with 86.4 GB/s memory bandwidth, approximately ten times that of Intel CPUs on a Front Side Bus. However, since the GeForce 8800 has a peak performance of 384 GFLOPS and each floating point operation operates on up to 12 bytes of source data, the available memory bandwidth cannot sustain even a small fraction of the peak performance if all of the source data are accessed from global memory.

Consequently, CUDA and its underlying GPUs offer multiple memory types with different bandwidth, latency, and access restrictions to allow programmers to conserve memory bandwidth while increasing the overall performance of their applications. Currently, CUDA programmers are responsible for explicitly allocating space and managing data movement among the different memories to conserve memory bandwidth. Furthermore, additional hardware mechanisms at the memory interface can enhance the main memory access efficiency if the access patterns follow memory coalescing rules. Currently, CUDA programmers shoulder the responsibility of massaging the code to produce the desirable access patterns. Experiences show that such responsibility presents a major burden on the programmer. CUDA-lite is designed to relieve such burden. Furthermore, CUDA code that is explicitly optimized for one GPU’s memory hierarchy design may not easily port to the next generation or other types of data-parallel execution vehicles. This paper presents CUDA-lite, an experimental enhancement to CUDA that allows programmers to deal only with global memory, the main memory of a GPU, with transformations to leverage the complex memory hierarchy. For increased efficiency, the programmers provide annotations describing certain properties of the data structures and code regions designated for GPU execution. The CUDA-lite tools analyze the code along with these annotations and determine if the memory bandwidth can be conserved and latency can be reduced by utilizing any special memory types and/or by massaging memory access patterns. Upon detection of an opportunity, CUDA-lite performs the transformations and code insertions needed. CUDA-lite is designed as a source-to-source translator. The output is CUDA code with explicit memory-type declarations and data transfers for a particular GPU. We envision CUDA-lite to eventually target multiple types and generations of data-parallel execution vehicles. If maximum performance is desired, the programmer can still choose to program certain kernels at the CUDA level. In this paper we present CUDA-lite in detail. We cover the memories and techniques that are leveraged by the tool to conserve memory bandwidth and reduce memory latency. We describe how CUDA-lite identifies the opportunities and the hand transformations that it replaces. We have developed plug-ins for the Phoenix compiler [7] from Microsoft to perform all of the transformations as a source-to-source compiler, and evaluated our results by passing the resulting source code through NVIDIA’s tool chain. We show that the performance of code generated by CUDA-lite matches or is comparable to hand generated code.

2

CUDA Programming Model

The CUDA programming model is ANSI C extended with keywords and constructs. The GPU is treated as a coprocessor that executes data-parallel kernel functions. The user supplies a single source program encompassing both host (CPU) and kernel (GPU) code. These are separated and compiled by NVIDIA’s

G

r

i

B

d

l

o

c

k

(

0

,

0

S

R

e

T

g

h

r

L

l

C

M

b

e

s

x

e

e

d

M

e

m

o

R

0

,

0

)

e

T

g

h

M

i

r

L

y

r

e

o

e

o

c

k

(

1

,

y

s

0

S

t

a

c

m

l

e

r

d

s

(

a

l

o

r

R

1

,

0

)

e

T

g

h

r

L

y

M

i

e

o

e

t

a

c

m

s

e

a

o

h

r

d

)

a

r

e

d

M

e

m

s

(

o

R

0

,

0

)

e

T

l

g

h

y

M

i

r

L

r

r

e

o

e

s

t

a

c

m

y

e

r

d

a

o

s

(

1

,

0

)

l

r

y

l

t

r

u

n

r

r

o

y

a

o

t

m

r

s

(

r

o

m

a

B

l

a

n

e

h

r

d

o

m

o

M

T

o

e

a

m

e

t

a

c

e

M

s

e

o

M

G

i

)

t

y

e

r

y

Fig. 1. CUDA Programming Model and Memory Hierarchy

compiler, nvcc. The host starts the kernel code with a function call. The complete description of the programming model can be found in [8–10]. Figure 1 depicts the programming model and memory hierarchy of CUDA. Threads are organized into a three-level hierarchy, and are executed on the streaming multiprocessors (SMs) on the GPU. At the highest level, each kernel creates a single grid, which consists of many thread blocks (TBs) arranged in two dimensions. The maximum number of threads per TB is 512, arranged in a three dimensional manner. Each TB is assigned to a single SM for its execution. Each SM can handle up to eight TBs at a time. Threads in the same TB can share data through the on-chip shared memory and can perform barrier synchronization by invoking the syncthreads primitive. Synchronization across TBs can only be safely accomplished by terminating the kernel. One of the major bottleneck to achieving performance while using CUDA is the memory bandwidth and latency. The GPU provides several different memories with different behaviors and performance that can be leveraged to improve memory performance. However, the programmer must explicitly and correctly utilize these different memories in the source code in order to gain the benefit. In the rest of this section we will examine shared memory and desirable memory access patterns to global memory that improve memory performance, and show the work required of programmers. Work that CUDA-lite intends to automate. We focus on memory coalescing for global memory and shared memory in this work since these are the only writable memories in CUDA. We leave the read-only memories, constant and texture, for future work. 2.1

Global Memory

CUDA exposes a general-purpose, random access, readable and writable off-chip global memory visible to all threads. It is the slowest of the available memory spaces, requiring hundreds of cycles, and is not cached. However, its resemblance to a CPU’s memory in its generality and size are also what allows more

#

d

e

f

i

n

e

A

#

d

e

f

i

n

e

T

S

I

Z

E

3

0

0

0

i

n

i

n

1

t

P

B

2

5

m

a

(

)

6 {

i

n

i

n

n

; k

t

i

d

u

m

i

_

z

b

l

e

o

_

5

_

g

e

r

n

l

i

n

o

b

a

l

_

e

_

v

o

s

c

=

i

s

z

e

f

t

f

f

A

S

I

Z

E

A

S

I

Z

E

;

s

o

(

l

o

a

t

)

*

*

f

k

l

(

l

o

a

t

*

a

,

l

o

a

t

*

b

)

/

A

e

n

d

s

{

*

i

=

r

e

d

I

d

x

.

x

l

;

n

l

o

c

d

a

i

t

n

a

i

_

h

i

s

o

i

t

z

a

b

_

h

o

e

t

,

i

/

e

s

t

i

t

h

t

n

i

h

a

t

*

=

I

k

d

x

.

x

a

t

a

l

a

_

h

s

o

t

w

t

h

d

e

v

a

l

u

*

;

k

b

b

f

l

=

o

c

f

i

+

i

/

;

A

e

d

e

i

e

n

d

i

/

e

k

l

0

i

o

a

t

n

t

i

(

l

o

a

t

)

t

h

b

*

;

l

l

o

c

a

t

a

_

v

d

i

c

a

d

b

_

d

e

v

c

i

*

e

i

z

e

i

z

e

;

s

1

t

c

u

a

M

a

l

l

o

c

(

(

v

o

*

d

i

*

)

&

a

_

v

d

d

c

e

,

i

)

e

;

s

c

i

f

i

T

P

B

+

i

>

=

A

S

I

Z

u

a

M

a

l

l

o

c

(

(

v

o

*

*

)

&

b

_

v

c

,

)

E

k

(

r

b

*

e

r

n

t

h

)

/

;

e

f

r

d

s

t

u

*

C

o

p

d

y

v

a

l

e

i

o

d

e

m

h

i

o

t

t

o

v

c

f

r

i

=

0

;

i




r

S

"

)

M

;

>

"

)

;

forward implementation of the kernel code that utilizes only global memory, and depend on tools to optimize the memory performance. We have developed tools to automate the transformations previously done by hand to maximize memory performance via memory coalescing. The programmer provides a version of the program that has been parallelized for CUDA using only global memory and the tools output a version with the memory accesses optimized. In other words, the tools transform code like the kernel function in Figure 2 to the memory coalescing version in Figure 4. We rely upon information from the programmer provided via annotations to perform our transformations. We call the software tools and annotations together CUDA-lite. Figure 5 shows the current form of the annotations in CUDA-lite. Part (a) indicates the functions of interest, i.e. kernel functions running on the GPU, and parallelization factors. While some of the information, like threads per TB, can eventually be derived from CUDA code, the last argument gives programmers some control over how much resources a kernel generated by CUDA-lite should take. Part (b) indicates what arrays in global memory are of interest and their properties. This gives control over which memory accesses are targeted for optimization, which uses up resources. The speedup gained from performing memory coalescing needs to be balanced against excessive resource usage that reduces executing parallelism. We will discuss this in detail in Section 4. Part (c) is for annotating exit checks, such as the conditional check on line 12 of Figure 2 mentioned in Section 2.1. While CUDA threads may terminate early, CUDAlite may need those threads to satisfy memory coalescing and synchronization requirements. Therefore CUDA-lite removes the early termination and places guards around the original computation, as mentioned in Section 2.3. Finally, part (d) conveys information about the control flow of loops in the program. We currently use this information to perform loop transformations. We recognize that some of the information provided by the annotations is derivable by advanced compiler techniques. However, the point of the annotations was to quickly provide the additional information needed and enable the transformations so that the memory hierarchy optimization automation work can proceed. It is not necessarily the final form. Requirement 2 of the four requirements detailed in Section 2.3 is the most difficult to satisfy and check for. CUDA-lite derives the expression used in global memory accesses by performing a backwards dataflow up to the parameters of the kernel function and thread indices. The expression is first simplified by extracting all references to the thread index in the x direction. We leverage the SIMD execution model to eliminate the need for temporal locality checks, since the execution model guarantees that the expression is the same for all threads in the warp. The desired expression is one where every thread in a half-warp accesses the same location, differing only by their order within the half-warp. Consequently, any instance of ⌊thi.x/hwarp⌋ can be safely disregarded, where thi.x is the thread index in the x dimension and hwarp is the number of threads in a half-warp. Mathematically this can be seen as the function f in Equation 1.

As long as the expression fits the form of the function, then the memory access is coalesced.   thi.x +C (1) f (thi.x) = thi.x + g hwarp

P

s

e

u

d

o

5

c

o

d

e

:

f

o

r

a

E

x

p

r

e

s

s

i

o

n

(

i

(

b

=

k

0

i

*

T

P

t

o

B

+

A

t

h

S

I

i

x

Z

E

)

S

I

Z

E

*

t

h

i

x

(

s

e

u

d

o

5

c

o

d

e

o

r

f

(

o

p

r

e

s

s

i

o

n

j

r

a

x

S

I

Z

E

+

i

]

/

/

l

i

n

e

1

5

,

F

i

g

u

r

e

2

/

/

l

i

n

e

1

8

,

F

i

g

u

r

e

2

+

a

i

+

a

+

T

P

B

*

A

S

I

Z

E

*

b

k

i

)

:

f

E

A

:

A

P

)

*

[

=

0

t

(

k

=

0

(

b

k

i

*

o

T

E

P

t

o

B

+

n

d

)

T

P

k

)

B

*

)

A

S

I

Z

E

+

j

*

T

P

B

+

t

h

i

/

/

l

i

n

e

/

/

l

i

n

e

l

i

n

]

/

/

1

2

e

7

,

F

i

g

u

r

e

0

,

F

i

g

u

r

e

F

i

g

u

r

1

8

,

4

4

e

4

[

:

t

h

i

x

+

A

S

I

Z

E

*

k

(

+

b

T

P

B

*

j

+

a

+

T

P

B

*

A

S

I

Z

E

*

b

k

i

)

Fig. 6. Array Access and Expression (a) Non-Coalescing (b) Coalescing

Figure 6(a) shows the relevant pseudo-code and expression generated by CUDA-lite for the memory access to array a in Figure 2. Due to the ASIZE multiplier on the first term, the expression does not fit function f and thus the load is not coalesced. Part (b) shows the memory access to array a in Figure 4. Unlike part (a), the expression does fit the form of the function f and therefore the access is coalesced. If the memory access is not already coalesced, CUDA-lite will attempt to automatically generate a coalescing version. The labels of the additional boxes in Figure 4 outline the majority of the transformations: inserting shared memory variables, performing loop tiling, generating memory coalesced loads and/or stores, and replacing the original global memory accesses with accesses to the corresponding data in shared memory. The shared memory size and tiling factor are fixed and known for each target GPU, due to the half-warp requirement for memory coalescing. The amount of shared memory allocated can thus be determined by the number of arrays of interest, array dimensions, and array element size. The generation of coalescing loads or stores depends on the relationship between the array dimension and the threading dimension. If they match, then CUDA-lite needs to have each thread load from the appropriate place in global memory into the thread’s corresponding position in shared memory. If the array is of higher dimension than the thread organization, two-dimension to one dimension in the running example, then CUDA-lite generates loops that load/store the data. This can be seen in the Coalesced Loads and Stores boxes of Figure 4. These loops must not only

1

#

d

e

f

i

n

e

A

S

I

#

d

e

f

i

n

e

T

P

B

i

d

r

n

e

l

f

l

_

_

_

_

v

Z

E

3

3

0

*

a

0

0

2

o

e

(

f

l

_

_

k 5

o

{

n

n

n

,

(

o

L

a

t

*

b

g

)

_

"

l

_

l

T

P

B

1

"

g

n

o

t

a

t

o

t

a

t

o

t

a

t

o

n

i

n

(

L

"

_

r

o

b

a

a

y

a

a

y

b

)

r

;

2

4

A

S

I

Z

E

A

S

I

Z

E

"

2

4

A

S

I

Z

E

A

S

I

Z

E

"

g

a

n

o

n

a

i

a

1

t

i

a

_

a

n

(

L

"

r

o

r

a

)

;

)

;

0

i

n

i

t

i

n

f

l

i

n

t

=

r

h

t

i

e

d

h

=

l

d

x

.

x

;

I

k

t

I

a

d

x

.

x

k

b

b

=

o

(

f

c

;

l

i

+

i k

o

a

t

t

o

a

t

)

t

h

b

;

i

t

;

1

5

_

_

i

f

n

n

i

n

(

L

"

B

n

d

C

>

=

" k

a

o

t

(

a

t

o

i

o

T

P

B

u

+

h

i

)

A

S

I

;

Z

E

k

b

r

*

e

r

t

2

0

f

r

(

t

h

A

S

)

n

u

;

i

=

0

i

o