Derivation of optimal input parameters for minimizing execution time of ...

4 downloads 1595 Views 916KB Size Report
Oct 2, 2014 - parameters which yield the minimum execution time for matrix-based computations exe- cuting on a GPU. Input parameters are defined as the ...
Parallel Computing 40 (2014) 628–645

Contents lists available at ScienceDirect

Parallel Computing journal homepage: www.elsevier.com/locate/parco

Derivation of optimal input parameters for minimizing execution time of matrix-based computations on a GPU Andrew White ⇑, Soo-Young Lee Department of Electrical and Computer Engineering, 200 Broun Hall, Auburn University, AL 36849, United States

a r t i c l e

i n f o

Article history: Received 10 April 2013 Received in revised form 14 September 2014 Accepted 24 September 2014 Available online 2 October 2014 Keywords: GPU Execution time Matrix-based computations Input parameters

a b s t r a c t As GPUs are continually being utilized as coprocessors, the demand for optimally utilizing them for various computations continues to grow. The goal of this work is to derive input parameters which yield the minimum execution time for matrix-based computations executing on a GPU. Input parameters are defined as the dimensions of the grid and blocks assigned for execution on the GPU. Since input parameters inadequately represent the executional behavior of the GPU, execution metrics are formulated as functions of the input parameters to represent the behavior. The execution metrics are architecture independent and are utilized to derive optimal input parameters, which are input parameters that yield the minimum execution time. Optimal input parameters are derived for the following matrix-based computations: matrix–vector multiplication (Mv), matrix–matrix multiplication (MM), and convolution. The derivation allows for selection of optimal input parameters without executing code. Results, for all matrix-based computations and sizes tested, show that utilizing the derived optimal input parameters often yields the minimum execution time, and, at worst, execution time within 13.6% of the minimum. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction As the market for massively multithreaded architectures continues to grow, so does the development of GPGPU applications [1–3]. Over the last decade, research has been done in regards to memory- and compute-bound applications on GPUs, including matrix-based applications [4–9]. Programming languages such as CUDA were introduced to ease the programming gap between users and GPUs [10,11]. With the advent of CUDA and ever-evolving architectures, there is a need to continually optimize GPU computations. Since the early CUDA-enabled GPUs, 8- and 200-series architecture, research has attempted to model or optimize matrix-based computations [12–26]. Similar work has focused on implementing and optimizing BLAS routines on GPUs [27–33]. Likewise, today’s work with matrix-based computations continues with the newer Fermi and Kepler architectures [34–37]. However, research on GPUs is lacking in the ability to determine optimal input parameters for GPU execution.1 Input parameters are the dimensions, or size, of the grid and blocks assigned to the GPU for execution. These dimensions determine the total number of threads assigned for execution. Since the dimensions are typically configurable for GPU code, determining the optimal input parameters is necessary to minimize execution time. ⇑ Corresponding author. Tel.: +1 (334) 844 1800.

1

E-mail addresses: [email protected] (A. White), [email protected] (S.-Y. Lee). URL: http://www.eng.auburn.edu/users/leesooy/ (S.-Y. Lee). GPU, in this work, refers to an NVIDIA CUDA-enabled GPU.

http://dx.doi.org/10.1016/j.parco.2014.09.002 0167-8191/Ó 2014 Elsevier B.V. All rights reserved.

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

629

Currently, there are several approaches to determining optimal input parameters, including exhaustive searches [38]. Others focus on minimizing the optimization space such that fewer searches are necessary [39,40]. Even with a reduced search space, execution of the GPU application is required. Auto-tuning has also been utilized to determine optimal input parameters [34,41–44]. Again, this work requires the code to be executed on the GPU, often hundreds of times, before optimal input parameters are determined. The goal of this work is to derive optimal input parameters off-line thus eliminating the need of any code execution. As shown in Section 3, input parameters alone are insufficient for modeling GPU executional behavior. Therefore, execution metrics are formulated as functions of input parameters to represent the behavior. Execution metrics are formulated in regards to GPU hardware but are independent of GPU architecture. For example, with GPUs, there is a maximum number of threads executing at a given time which varies with architecture. Therefore, the number of threads executing at a given time is formulated as one of the execution metrics. Given the input parameters, this number is calculable for any GPU architecture. From execution metrics and the implementations of the matrix-based computations, optimal input parameters are derived to yield the minimum execution time. This paper is organized as follows. Section 2 is a brief discussion of GPU architecture and terms. Section 3 is the formulation of execution metrics utilized in this work. The implementations of three matrix-based computations, Mv, MM, and convolution are described in Section 4, and the performance of each implementation is given. The implementations and execution metrics are utilized in Section 5 to derive optimal input parameters which yield the minimum execution time. The derivation is performed for each matrix-based computation. Results are presented in Section 6 as a comparison of execution time between optimal input parameters and all input parameters tested. Lastly, a summary including future work is given in Section 7.

2. GPU architecture GPUs, as opposed to CPUs, are not well-suited for all algorithms due to architecture. CPUs are typically designed for optimal sequential performance of a wide variety of applications, and application performance generally increases with increasing clock frequencies. To further application performance, CPUs employ many hardware features not present in GPUs, such as relatively large instruction and data caches, branch prediction mechanisms, and deep pipelines. However, the lack of these hardware features in GPUs allows hardware designers to fit many more cores on a single die than in a CPU. Although GPU cores operate at a lower clock frequency than CPUs, the many-core design of GPUs can allow for greater performance of parallel applications. Therefore, portions of applications, such as matrix-based computations, that require high throughput with minimal branching, communication, and/or scatter-gather operations, excel on GPUs. The CUDA GPU architecture is a collection of streaming multiprocessors (SMs). Each SM consists of a number of cores or streaming processors (SPs). The control logic and instruction cache for all SPs within an SM are shared. Code which is written for GPU execution is referred to as device code. Each function in device code is known as a kernel. Kernels are executed using a specified number of blocks and threads. Since the GPU utilizes a SIMT (single-instruction, multiple-thread) architecture, all threads execute the same code on different portions of data. Each thread has an identifier, the thread index, which is used for control logic and data access. A collection of threads is a block and blocks can be partitioned in 1-, 2-, or 3-dimensions. The collection of blocks is considered the grid and can be partitioned in 1 or 2-dimensions. Each kernel call can define different grid and block sizes and dimensions but these cannot be changed dynamically during kernel execution. Once partitioning of the grid and blocks is determined, the scheduler organizes threads into groups of 32 known as warps. Warps waiting for long-latency operations, such as intensive arithmetic operations or memory access, are not selected for execution. Non-waiting warps are selected which provides a type of latency hiding. This type of zero-overhead thread scheduling ensures that the maximum instruction throughput is realized. Warps are divided into half-warps (HWs), a group of 16 threads in row-major order, for memory accesses. The GPU consists of 3 separate types of memory: global, constant and shared. Global memory is the largest and slowest memory available and is divided into evenly sized partitions. Accesses to the same partition of global memory at a given time by different HWs causes partition camping to occur [22]. Although partition camping has been explained by NVIDIA, its effects on various applications have not been studied until recently [45–47]. Since only one HW can be serviced at a time, the accesses become serialized thus increasing the execution time if other partitions are unused. Similar to global memory partitioning, shared memory is divided into evenly sized banks. Accesses by threads within a HW to the same bank, but differing rows, create bank conflicts [48]. The Mv, MM, and convolution implementations utilized in this work minimize partition camping and eliminate bank conflicts. The number of HWs is determined by the partitioning of the grid and blocks which is specified by the programmer. As mentioned, dimensions of the grid and blocks are referred to, in this work, as input parameters, and they are independent of GPU architecture. Four input parameters are denoted as dGrd:x; dGrd:y; dBlk:x and dBlk:y. dGrd:x and dGrd:y are the dimensions of the grid in the x and y-dimensions, respectively. dBlk:x and dBlk:y are the dimensions of each block in the x and y-dimensions, respectively. Lastly, the focus of this work is on matrix-based computations. All matrices are assumed to be square and n is used to denote the width or height of a square matrix.

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

250

250

200

200 time (ms)

time (ms)

630

150 100 50 0

150 100 50

8

16

32 64 dBlk.x

128

(a) Input Parameter

0

1920

3840 7680 15360 active threads

(b) Execution Metric

Fig. 1. Execution time (ms) of MM on the T10 GPU, n = 2048.

3. Execution metrics As shown in [40] and illustrated in Fig. 1(a), input parameters alone are insufficient for modeling GPU computational behavior and deriving optimal input parameters. In the figure, the execution time of MM for n ¼ 2048 on the T10 GPU is illustrated. As shown, large variations exist between the minimum and maximum time for a given value of dBlk:x. Rather than deriving optimal input parameters based on input parameters, or arbitrary combinations of input parameters, execution metrics are utilized. Execution metrics are formulated as functions of input parameters to represent the executional behavior of a GPU. Therefore, execution metrics are developed, as functions of input parameters, through analysis of GPU hardware and measured data. They are formulated to be architecture independent. Fig. 1(b) depicts the execution time as a function of one execution metric, the number of active threads. The variation between the minimum and maximum time is significantly reduced compared to Fig. 1(a). Therefore, optimal input parameters are derived through optimization of execution metrics. The derivation through execution metrics allows for specification of optimal input parameters without the necessity of executing code. 3.1. Active threads The number of threads executing at a given time on any GPU, or the number of active threads, is referred to as activ e ThreadsGPU . To formulate the number of active threads, it is necessary to formulate the number of active blocks per SM, activ e activ e BlksSM . BlksSM is dependent on the following: the maximum number of registers per SM (RegistersSM ), the number of registers used per block (RegistersBlk ), the maximum amount of shared memory per SM (Shared MemorySM ), the amount of Max: shared memory used per block (Shared MemoryBlk ), and the maximum number of blocks and threads per SM, (BlksSM ) Max: and (ThdsSM ), respectively. The maximums are specific to each GPU architecture. For any GPU, the number of active blocks in an SM is defined by activ e BlksSM

!     Max: RegistersSM Shared MemorySM ThdsSM Max: : ; ; BlksSM ; ¼ min RegistersBlk Shared MemoryBlk dBlk:x  dBlk:y

ð1Þ

The number of registers used is dependent on the compiler and determined after compilation. The amount of shared memory used per block is determined from the kernel and performed after compilation. activ e The number of active blocks on a GPU, BlksGPU , is dependent on the number of SMs per GPU (Num: of SMs), number of active blocks per SM, and dimensions of the grid. Therefore,

  activ e activ e BlksGPU ¼ min Num: of SMs  BlksSM ; dGrd:x  dGrd:y :

ð2Þ

The number of active threads on a GPU is formulated as the product of Eq. (2) and the number of threads per block. Therefore, activ e

activ e

ThdsGPU ¼ BlksGPU  dBlk:x  dBlk:y:

ð3Þ

3.2. Fragmented threads  If the number of active  threads on any GPU is not evenly divisible by the number of total threads activ e total ThdsGPU mod ThdsGPU – 0 ,2 fragmentation occurs. Fragmented threads are threads which are not executed with the same 2

x mod y denotes the remainder of yx.

631

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

amount of parallelization as active threads. From (2) and (3), two execution metrics are formulated to represent the numbers of frag: frag: fragmented blocks and threads: BlksGPU and ThdsGPU , respectively. Therefore, the number of fragmented blocks on a GPU is activ e

frag:

BlksGPU ¼ ðdGrd:x  dGrd:yÞ mod BlksGPU

ð4Þ

and the number of fragmented threads on a GPU is frag:

frag:

ThdsGPU ¼ BlksGPU  dBlk:x  dBlk:y:

ð5Þ

3.3. Global memory partitions In addition to formulating execution metrics to represent the number active and fragmented threads, two execution metrics are formulated to represent the layout of global memory. Global memory on a GPU is divided into evenly sized partitions. The number of partitions (Num: of Partitions) and width of each partition (Partition Width) is specific to each GPU architecture. The number of global memory partitions accessed per block is an execution metric, PartitionsBlk . PartitionsBlk is dependent on the width of the block and memory partitions. Therefore,

PartitionsBlk ¼

dBlk:x : Partition Width

ð6Þ

Since HWs are formed in row-major order, the x-dimension of the block is used. From PartitionsBlk , the number of global memory partitions accessed by the active blocks on a GPU, PartitionsGPU , is formulated as

PartitionsGPU

8   activ e < min Blksactiv e  PartitionsBlk ; Num: of Partitions if BlksGPU < Grd:x GPU ¼ : activ e min ðdGrd:x  PartitionsBlk ; Num: of PartitionsÞ if BlksGPU P dGrd:x:

ð7Þ

activ e

PartitionsGPU is dependent on the number of global memory partitions accessed per block, BlksGPU , and dGrd:x. The equation defines a maximum number of global memory partitions being accessed as Num: of Partitions. 4. Implementation of matrix-based computations The implementation of computations on GPUs largely affects execution time. Three matrix-based computations are implemented for GPU execution: matrix–vector multiplication (Mv), matrix–matrix multiplication (MM), and convolution. The focus of this work is deriving optimal input parameters so the optimized implementation of each computation is only discussed briefly. NVIDIA provides an implementation for their CUDA version of the BLAS routines (CUBLAS), which includes Mv and MM [28]. Since the CUBLAS implementations are closed source and examining the kernels is not possible, we have created our own kernels. The performance of our kernels is compared against CUBLAS, where applicable, since CUBLAS is often used as a reference for custom kernels [29,36,37,44,49]. 4.1. Mv Matrix–vector multiplication (Mv) is a BLAS library routine used in many mathematic, graphic, and scientific applications. Multiplying a matrix by a vector yields a vector, c = Ab, where A is a matrix, b is a vector multiplied by A, and c is the resulting vector. A naïve implementation of Mv on a GPU consists of threadj computing cj where c is the resulting vector. However, this total activ e limits the maximum of ThdsGPU , and therefore, the maximum of ThdsGPU to n. Because of this limit, Mv is implemented such total that each block computes dBlk:x elements of c, thus increasing the maximum of ThdsGPU . To execute Mv on the GPU, A and c are stored in global memory. b is stored in constant memory to minimize the number of global memory accesses and partition camping. Shared memory is utilized to coalesce accesses to A. Each row of shared memory is padded by one to eliminate bank conflicts. Partition camping is minimized by ensuring each block begins accessing A from a varying partition of global memory. Since each block computes dBlk:x elements of c, the shared memory

Listing 1. The inner-most loop of accessing shared memory for the implementation of Mv on a GPU. Constant memory is utilized for b.

632

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

allocated to coalesce accesses to A is reused to sum partial results computed by each thread. Therefore, the amount of shared memory allocated is

Shared MemoryBlk ¼ ðdBlk:x2 þ dBlk:xÞ  4:

ð8Þ

Threads with a y-index equal to zero, sum the partial results in shared memory, and store in c in a coalesced manner. The inner-most loop of the implementation of Mv utilized is depicted in Listing 1. The listing is included since the derivation of optimal input parameters is dependent on shared memory utilization. j is an outer loop counter. As is shared memory allocated for A and b is the constant memory utilized for b. threadIdx:x and threadIdx:y are local thread indices in the x- and y-dimensions, respectively. blockDim:x (dBlk:x) and blockDim:y (dBlk:y) are dBlk:x the dimensions of each block in the x- and y-dimensions, respectively. Therefore, each thread performs dBlk:y iterations of computation with data in shared memory. In addition, this kernel requires that dBlk:x P dBlk:y. Fig. 2 illustrates the performance of the aforementioned Mv kernel compared to the CUBLAS implementation. As shown in the figure, the Mv kernel described outperforms the CUBLAS implementation for 4 of the 6 data sizes tested. The minimum (n ¼ 8192) and maximum (n ¼ 1024) speedups over CUBLAS are 0.96 and 2.17, respectively. Therefore, at-worst, the Mv implementation utilized in this work is comparable to CUBLAS. 4.2. MM Matrix–matrix multiplication (MM) is another BLAS library routine which is also used in many mathematic, graphic, and scientific applications. Multiplying a matrix by a matrix yields a matrix, C = AB, where A, B, and C are matrices. A naïve implementation of MM consists of threadij computing C ij . However, as shown in [30] and utilized in [35,50,51], this limits the amount of shared memory utilization and overlap of memory accesses and computation. Therefore, the algorithm presented in [30] is implemented although modifications are performed to assume the matrices are stored in rowmajor order and to minimize partition camping. In this implementation, each thread computes n elements C ij s. Due to resource allocation of registers, n elements is a maximum of sixteen. To execute MM on the GPU, A, B, and C are stored in global memory. Shared memory is allocated only for A and is utilized to reduce the number of global memory accesses. Shared memory is not padded since bank conflicts do not occur. Partition camping is minimized by ensuring blocks begin accessing A and B from varying partitions of global memory. Since each thread computes n elements C ij s, the amount of shared memory allocated is

Shared MemoryBlk ¼ ðn elements  dBlk:xÞ  4

ð9Þ

Each thread utilizes n elements registers for storing partial results. The inner-most loop of the implementation of MM utilized is depicted in Listing 2. The derivation of optimal input parameters is dependent on utilization of shared memory and therefore, the listing is included. i and j are outer loop counters. As is shared memory allocated for A and B is the pointer to B in global memory. Each thread performs ne le iterations of computation with data in shared memory. Fig. 3 depicts the performance of the aforementioned MM kernel compared to the CUBLAS implementation. The minimum (n ¼ 4096) and maximum (n ¼ 512) speedups are 0.95 and 1.39, respectively. The MM kernel described outperforms the CUBLAS implementation for 3 of the 6 data sizes tested. Therefore, the performance of the MM kernel utilized for testing the derivation of optimal input parameters is, at-worst, comparable to CUBLAS. 4.3. Convolution Like Mv and MM, convolution is a mathematical function commonly used in mathematic, graphic, and scientific applications. In this work, 2D convolution is utilized, C = A  B, where A, B, and C are matrices. Some implementations of

50

GFLOPS

40 30 20 10 0

Mv kernel CUBLAS 512

1024

2048

n

4096

8192

Fig. 2. Performance of the Mv kernel on the T10 GPU.

16384

633

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

Listing 2. The inner-most loop of accessing shared memory for the implementation of MM on a GPU.

400

GFLOPS

350 300 250 200

MM kernel CUBLAS 512

1024

2048

n

4096

8192

16384

Fig. 3. Performance of the MM kernel on the T10 GPU.

2D convolution on a GPU assume utilization of separable filters to improve performance [52]. However, the implementation utilized in this work is for general convolution of any size filter. In image processing, A is an image, B is a filter applied to the image, and C is the resulting image. In this work, threadij computes C ij . All three matrices are stored in global memory. Since the implementation assumes inputs and filters of any size, constant memory is not utilized for A or B. Shared memory is utilized for A and B to reduce the number of global memory accesses and is not padded as bank conflicts do not occur. Blocks begin accessing A from varying partitions of global memory and therefore, partition camping is ignored for accessing A. However, if the size of the filter, FS, is less than the width of a global memory partition, partition camping can be reduced by padding the filter with zeros. In addition, padding the filter aligns each row to a specified boundary and ensures the data is naturally aligned. This reduces the number of global memory transactions [48]. Similarly, padding A with zeros eliminates boundary checking instructions and therefore, reduces thread divergence. If threads in a warp diverge due to conditional branches, the warp is serially executed for each branch path [48]. Therefore, padding A reduces execution time. Each HW in a block loads a portion of its respective row of A into shared memory every dBlk:x iterations of the inner loop. Similarly, a portion of a row of B is loaded into shared memory every dBlk:x iterations. Since the same portion of a row of B is utilized by all threads in a block, only threads with a y-index equal zero load B into shared memory. Therefore, the amount of shared memory allocated is

Shared MemoryBlk ¼ ð2  dBlk:x  dBlk:y þ dBlk:xÞ  4:

ð10Þ

The inner-most loop of the implementation of convolution described and utilized is depicted in Listing 3. The listing is included to illustrate utilization of data loaded into shared memory which is utilized for deriving optimal input parameters. A and B are loaded into the shared memory variables As and Bs, respectively. From the listing, the trip count for the inner  FS  loop where shared memory is accessed is dependent on FS. Therefore, each thread performs dBlk:x iterations of computation with data in shared memory. Fig. 4 illustrates the performance of the 2D convolution kernel described. The performance for three differing filter sizes is shown. Convolution is not part of the BLAS routines so there is no CUBLAS comparison available. As shown in the figure, the GFLOPS performance increases as the filter size increases. For small filter sizes, utilizing constant memory would increase the GFLOPS performance as less accesses to global memory would be necessary. The goal of this work is to derive optimal input parameters rather than develop an optimal 2D convolution kernel. Therefore, this implementation is general to allow an input and filter of any size which fits in the global memory.

Listing 3. The inner-most loop of accessing shared memory for the implementation of convolution on a GPU.

634

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

GFLOPS

300

200

Filter Size = 513x513 Filter Size = 63x63 Filter Size = 3x3

100

0

512

1024

2048

n

4096

8192

16384

Fig. 4. Performance of the convolution kernel on the T10 GPU.

Table 1 The percentage of peak theoretical global memory bandwidth for all kernels tested on the T10 GPU. n

512 1024 2048 4096 8192 16,384

Mv (%)

21 50 76 78 79 79

Mv CUBLAS (%)

10 23 47 80 82 76

MM (%)

73 83 86 87 88 88

MM CUBLAS (%)

53 72 84 92 92 92

Convolution FS = 3  3 (%)

FS = 63  63 (%)

FS = 513  513 (%)

7 7 8 8 8 8

68 70 70 70 70 70

71 74 74 75 75 75

Since matrix-based GPU kernels are often limited by global memory bandwidth, the convolution kernel is compared against the Mv and MM kernels in terms of peak theoretical bandwidth (102 GB/s). Table 1 shows the performance of each kernel as a percentage of peak theoretical global memory bandwidth. If FS  63  63, the table shows the 2D convolution kernel yields global memory bandwidth comparable with the Mv and MM implementations. The relatively low utilization of bandwidth for the smallest filter size suggests the kernel is not optimal in this case. Regardless, it is used in validating the derivation of optimal input parameters since this is the goal of this work. 5. Optimal input parameters As mentioned in Section 3, input parameters alone are inadequate for modeling GPU computational behavior and therefore, execution metrics are formulated. Through execution metrics and the implementation of each matrix-based computation, optimal input parameters are derived. To derive optimal input parameters for any matrix-based computation, a list of five steps is presented: 1. 2. 3. 4. 5.

Saturate the memory bus. Maximize shared memory utilization. Minimize the amount of shared memory used per block. Maximize the number of global memory partitions accessed. Minimize the number of fragmented blocks.

Step 1 is listed to maximize the overlap of computation and memory accesses by ensuring the memory bus is fully utilized. Step 2 ensures data that is loaded into shared memory is reused as much as possible. Step 3 is listed such that a minimum amount of shared memory is assigned per block which increases the number of blocks assigned to each SM. In addition, this reduces the time each HW in a block waits for synchronization since there are less HWs per block. Step 4 maximizes the number of global memory partitions accessed such that partition camping is minimized. Lastly, Step 5 ensures the contribution to execution time from fragmented blocks is minimized. Optimal input parameters are derived in this section for the matrix-based implementations of Mv, MM, and convolution presented in Section 4. The execution metrics in Section 3 are general for any GPU. In this work, the T10 GPU is used to verify the derivation of optimal input parameters. Table 2 contains hardware characteristics specific to the T10 GPU since it is the target architecture of the following derivation. However, the derivation of optimal input parameters can be applied to any GPU architecture through substitution of the GPU hardware characteristics. Before deriving optimal input parameters, several constraints are presented due to the GPU architecture. From Table 2, there is a maximum of 512 threads per block. Therefore,

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

635

Table 2 Hardware characteristics of the T10 GPU. Name

Value

Description

RegistersSM Shared MemorySM

16,384 16,384 8

Number of registers per SM Amount of shared memory (bytes) per SM Maximum number of blocks per SM

Max:

BlksSM

Max:

1024

Maximum number of threads per SM

Max:

512

Maximum number of threads per block

30 8 256

Number of SMs per GPU Number of global memory partitions Width of each global memory partition (bytes)

ThdsSM

ThdsBlk Num: of SMs Num: of Partitions Partition Width

dBlk:x  dBlk:y 6 512:

ð11Þ

Threads are assigned to HWs in row-major order and the minimum global memory transaction size is 32B [48]. Since values of type float are utilized, Eq. (12) ensures that one 32B transaction can service the threads in a HW.

dBlk:x P 8:

ð12Þ

The T10 GPU has 4GB of global memory. Using powers of two for the height or width of a square matrix, n, the maximum value of n is 16,384. The minimum value of n, in this work, is determined such that one row of a matrix spans at least one global memory partition. Therefore,

512 6 n 6 16; 384:

ð13Þ

Step 1 of deriving optimal input parameters specifies that the memory bus is saturated, which requires analysis of activ e activ e ThdsGPU , which is dependent on BlksGPU . From [48], each global memory access takes approximately 400–800 clock cycles. activ e Therefore, there needs to exist a minimum of 400 warps per global memory partition, or 12,800 ThdsGPU , to fully saturate the 8 activ e memory bus. The maximum ThdsGPU is determined by the product of the number of SMs per GPU, Num: of SMs, and the Max: maximum number of threads per SM, ThdsSM . Using values from Table 2 yields activ e

12; 800 6 ThdsGPU 6 30; 720:

ð14Þ activ e ThdsGPU

Therefore, optimal input parameters are derived such that is greater than the minimum amount necessary to fully saturate the memory bus. Optimal input parameters are derived through execution metrics which are dependent on shared memory and register allocation. Each matrix-based computation discussed utilizes a different kernel for execution and therefore, the derivation of optimal input parameters is dependent on the computation. 5.1. Derivation for Mv For Mv, the amount of shared memory allocated per block is defined by Eq. (8). Since each SM has a maximum of 16 KB of shared memory, the maximum value of dBlk:x is 32. Combining with the minimum value of dBlk:x from Eq. (12) yields

8 6 dBlk:x 6 32:

ð15Þ

From the implementation of Mv and Eq. (11),

dBlk:y 6 min dGrd:x ¼



512 ; dBlk:x dBlk:x



n dBlk:x

ð16aÞ ð16bÞ

dGrd:y ¼ 1: The number of registers used per thread for the kernel varies from 10 to 15 dependent on dBlk:x and dBlk:y. Assuming the activ e maximum register usage, BlksSM from Eq. (1) is not limited by registers. Substituting values from Table 2 into Eq. (1) and excluding registers yields activ e

BlksSM

¼ min



 16; 384 1024 ; 8; : Shared MemoryBlk dBlk:x  dBlk:y

Since dGrd:y ¼ 1, Eq. (2) simplifies to

ð17Þ

636

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

  activ e activ e BlksGPU ¼ min 30  BlksSM ; dGrd:x : Since activ e

activ e

ThdsGPU ¼ BlksGPU  dBlk:x  dBlk:y; then, if dGrd:x 6

ð18Þ

activ e BlksGPU ,

activ e

ThdsGPU ¼ dGrd:x  dBlk:x  dBlk:y: n Substituting dGrd:x ¼ dBlk:x from Eq. (16b) and the lower limit of n defined by Eq. (13) yields

activ e

ThdsGPU P 512  dBlk:y: Since dBlk:x 6 32, the maximum value of dBlk:y from Eq. (16a) yields activ e

ThdsGPU P 8192: activ e

Substituting into Eq. (18) and substituting for BlksGPU yields

  activ e 8192 6 min 30  BlksSM ; dGrd:x  dBlk:x  dBlk:y:

ð19Þ

activ e

If dGrd:x 6 30  BlksSM , rearranging Eq. (19) and solving for dBlk:y yields the lower limit as

8192 6 dBlk:y: n

ð20Þ

The upper limit of dBlk:y is not modified and therefore is defined by Eq. (16a). activ e activ e However, if dGrd:x > 30  BlksSM , then substituting for BlksSM from Eq. (17) into Eq. (19) yields



8192 6 30  min

 16; 384 1024 ; 8;  dBlk:x  dBlk:y: Shared MemoryBlk dBlk:x  dBlk:y

ð21Þ

For Eq. (8), Shared MemoryBlk ¼ ðdBlk:x2 þ dBlk:xÞ  4. After substituting Shared MemoryBlk into Eq. (21), solving for dBlk:y given the possible values of dBlk:x from Eq. (15) yields

 64 6 dBlk:y: max 4; dBlk:x

ð22Þ

Combining lower limits of dBlk:y from Eqs. (20) and (22) with the upper limit from Eq. (16a) yields

max

  8192 64 512 6 dBlk:y 6 min ; 4; ; dBlk:x : n dBlk:x dBlk:x

ð23Þ

Step 2 specifies shared memory utilization is maximized to ensure data is reused as much as possible. From Line 1 of ListdBlk:x ing 1, the iteration count for the inner loop where shared memory is accessed is dependent on dBlk:y . Therefore, the maximum of dBlk:x from Eq. (15) and the minimum of dBlk:y from Eq. (23) yield

dBlk:x ¼ 32 dBlk:y ¼ max

 8192 ;4 n

n 32 dGrd:y ¼ 1:

dGrd:x ¼

ð24Þ

Given n, all input parameters are constant and no further steps are necessary to derive optimal input parameters. Therefore, Eq. (24) is the derivation of optimal input parameters for Mv. 5.2. Derivation for MM For MM, the implementation and Eqs. (11) and (12) define

8 6 dBlk:x 6 512 512 1 6 dBlk:y 6 dBlk:x n dGrd:x ¼ dBlk:x  dBlk:y n dGrd:y ¼ n elements

ð25aÞ ð25bÞ ð25cÞ ð25dÞ

637

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

where 1 6 n elements 6 16 due to resource constraints. From Eq. (9), Shared MemoryBlk = (n elements  dBlk:xÞ  4. From compilation of the kernel, Registers per Thread varies from 14 to 38. activ e From the minimum of Shared MemoryBlk and RegistersBlk ; dGrd:x  dGrd:y > 30  BlksSM . Therefore, Eq. (2) simplifies to activ e

BlksGPU ¼ 30  min



   16; 384 16; 384 1024 ; ; 8; : RegistersBlk Shared MemoryBlk dBlk:x  dBlk:y activ e

ð26Þ activ e

For Step 1 of deriving optimal input parameters, substituting BlksGPU from Eq. (26) and the minimum of ThdsGPU from Eq. (14) into Eq. (3) yields



   16; 384 16; 384 1024 ; ; 8;  dBlk:x  dBlk:y: RegistersBlk Shared MemoryBlk dBlk:x  dBlk:y

12; 800 6 30  min

ð27Þ

Solving for dBlk:y modifies the lower limit of dBlk:y from Eq. (25b) to yield

0

1 12; 800 12; 800 64 A 6 dBlk:y 6 512 : j k; j k; max @ 16;384 16;384 dBlk:x dBlk:x 30  dBlk:x Registers 30  dBlk:x ðn elementsdBlk:x Þ4

ð28Þ

Blk

Step 2 specifies shared memory utilization is maximized to ensure data is reused as much as possible. From Line 1 of Listing 2, the trip count for the inner loop where shared memory is accessed is dependent on n elements. Since 1 6 n elements 6 16; n elements ¼ 16. Substituting n elements ¼ 16 into Eq. (25d) yields

dGrd:y ¼

n : 16

ð29Þ

Since Shared MemoryBlk ¼ ðn elements  dBlk:x  4Þ, dBlk:x 6 64. However, if dBlk:x ¼ 64 and dBlk:y is the maximum from activ e activ e Eq. (28), then BlksGPU = 30 due to register usage. Therefore, ThdsGPU = 7680 and is less than the minimum from Eq. (14). Therefore, dBlk:x 6 32. However, if dBlk:x P 16, due to register usage, only two values exist for dBlk:y which satisfy Eq. (28). Therefore, Eq. (28) simplifies to

64 512 6 dBlk:y 6 dBlk:x dBlk:x

if dBlk:x ¼ 8

ð30Þ

and

dBlk:y ¼

32 if dBlk:x ¼ 16; 2

if dBlk:x ¼ 32:

Step 3 specifies the minimum amount of shared memory is allocated per block to increase the number of blocks assigned to each SM. In addition, minimizing the amount of shared memory allocated reduces time each HW in a block waits for synchronization since there are less HWs per block. Since n elements ¼ 16; Shared MemoryBlk ¼ 64  dBlk:x. Therefore, the minimum value of dBlk:x from Eq. (25a) is utilized. Since dBlk:x ¼ 8, Eqs. (30) and (25c) are evaluated to yield

8 6 dBlk:y 6 64 n dGrd:x ¼ 8  dBlk:y

ð31aÞ ð31bÞ

Step 4 specifies the maximum number of global memory partitions, activ e PartitionsGPU , is accessed to reduce partition camping. PartitionsGPU is dependent on PartitionsBlk ; dGrd:x, and BlksGPU . Since dBlk:x and n elements are constant, Registers per Thread is constant. From compilation, 32 Registers per Thread are allocated so RegistersBlk ¼ 256  dBlk:y. Since RegistersBlk is the only limiting factor of Eq. (26), the equation simplifies to activ e

BlksGPU ¼

1920 : dBlk:y

ð32Þ

activ e

activ e

If dGrd:x > BlksGPU , substituting dBlk:x ¼ 8 and BlksGPU from Eq. (32) into Eq. (7) and solving for dBlk:y yields3

dBlk:y 6 16: If dGrd:x 6

activ e BlksGPU ,

ð33Þ substituting dBlk:x ¼ 8 and dGrd:x from Eq. (31b) into Eq. (7) and simplifying yields

PartitionsGPU ¼ min



n ;8 : 64  dBlk:y

Substituting the minimum value of dBlk:y from Eq. (31a) into Eq. (34) yields

3

The solution is dBlk:y 6 30. However, dBlk:y is a power of two so it is rounded down to the nearest power of two.

ð34Þ

638

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

dGrd:x

  n  dBlk:x P min ;8 : 64 512

ð35Þ

Substituting dBlk:x ¼ 8 and dGrd:x from Eq. (31b) into Eq. (35) and solving for dBlk:y yields

dBlk:y 6

n n : 64  min 512 ;8

ð36Þ

Combining the upper limits of dBlk:y from Eqs. (33) and (36) with the lower limit of dBlk:y from Eq. (31a) yields

! n : 8 6 dBlk:y 6 min 16; n 64  min 512 ;8

ð37Þ frag:

Step 5 specifies the number of fragmented blocks, BlksGPU from Eq. (4), is minimized. Substituting dGrd:x from Eq. (31b), activ e dGrd:y from Eq. (29), and BlksGPU from Eq. (32) into Eq. (4) yields

n2 1920 mod : dBlk:y 128  dBlk:y

ð38Þ

jk In general, yx can be expressed as x ¼ Qy þ R where Q is the quotient, Q ¼ yx , and R is the remainder. Since frag: frag: BlksGPU ¼ R; BlksGPU ¼ x  Qy. Substituting for x; y, and Q from Eq. (38) and simplifying yields frag:

BlksGPU ¼

  n2 n2 1920  : 128  dBlk:y 245; 760 dBlk:y

ð39Þ

frag:

Therefore, to minimize BlksGPU , the maximum value of dBlk:y from Eq. (37) is utilized and

dBlk:y ¼ min 16;

! n n : 64  min 512 ; 8

ð40Þ

Since dBlk:x ¼ 8 and from Eqs. (40), (25c), and (29), optimal input parameters for the implementation of MM on a GPU are

dBlk:x ¼ 8 dBlk:y ¼ min 16;

n n 64  min 512 ;8

!

n 8  Blk:y n dGrd:y ¼ : 16 dGrd:x ¼

ð41Þ

5.3. Derivation for convolution For convolution, the implementation and Eqs. (11) and (12) yield

8 6 dBlk:x 6 512 512 1 6 dBlk:y 6 dBlk:x n dGrd:x ¼ dBlk:x n dGrd:y ¼ dBlk:y

ð42aÞ ð42bÞ ð42cÞ ð42dÞ

From Eq. (10), Shared MemoryBlk ¼ ð2  dBlk:x  dBlk:y þ dBlk:xÞ  4. From compilation of the kernel, Registers per Thread ¼ 10 and is constant regardless of n; FS, and input parameters. Therefore, Eq. (1) simplifies to activ e

BlksSM

 ¼ min 8; activ e

Substituting BlksSM

1024 : dBlk:x  dBlk:y

ð43Þ

from Eq. (43), dGrd:x from Eq. (42c), and dGrd:y from Eq. (42c) into Eq. (2) yields

  activ e BlksGPU ¼ min 30  min 8;

1024 n2 ; : dBlk:x  dBlk:y dBlk:x  dBlk:y

ð44Þ

Since n P 512 from Eq. (13) and dBlk:x  dBlk:y 6 512 from Eq. (11), Eq. (44) simplifies to

 activ e BlksGPU ¼ 30  min 8;

1024 : dBlk:x  dBlk:y

ð45Þ

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645 activ e

639

activ e

Substituting BlksGPU from Eq. (45) and the minimum of ThdsGPU from Eq. (14) into Eq. (3) yields

 12; 800 6 30  min 8; If

1024 dBlk:xdBlk:y

1024  dBlk:x  dBlk:y: dBlk:x  dBlk:y

P 8, then Eq. (46) is true. Solving for dBlk:y in Eq. (46), if

dBlk:y P

ð46Þ 1024 dBlk:xdBlk:y

< 8, yields

12; 800 : 240  dBlk:x

64 Since dBlk:y is a power of two, this simplifies to dBlk:y P dBlk:x . Combining this with the limits of dBlk:y from Eq. (42b) yields

max



64 512 ; 1 6 dBlk:y 6 : dBlk:x dBlk:x

ð47Þ

Step 2 specifies shared memory utilization is maximized to ensure data is reused as much as possible. From Line 1 of Listing 3, the iteration count for the inner loop where shared memory is accessed is dependent on filter size, FS. From the num FS  ber of accesses to global memory for the implementation, dBlk:x P 16. Since accesses are performed dBlk:x times in the inner loop, dBlk:x > FS. Therefore, dBlk:x P max ð16; FSÞ. However, if dBlk:x > FS, each HW accesses dBlk:x  dBlk:y values for A and dBlk:x values for B but only FS values are necessary. Therefore, dBlk:x > FS but less than the next power of two. So,  dBlk:x ¼ max 16; 2dlog2 ðFSÞe . If FS is greater than the upper limit of dBlk:x from Eq. (42a), then dBlk:x ¼ 512. Therefore,

    dBlk:x ¼ min max 16; 2dlog2 ðFSÞe ; 512 :

ð48Þ

Step 3 specifies the minimum amount of shared memory is allocated per block to increase the number of blocks assigned to each SM. In addition, minimizing the amount of shared memory allocated reduces time each HW in a block waits for synchronization since there are less HWs per block. Since Shared MemoryBlk ¼ ð2  dBlk:x  dBlk:y þ dBlk:xÞ  4 and dBlk:x is 64 constant from Eq. (48), the minimum value of dBlk:y from Eq. (47) is utilized. Since dBlk:y ¼ max dBlk:x ; 1 and from Eqs. (48), (42c), and

    dBlk:x ¼ min max 16; 2dlog2 ðFSÞe ; 512  64 dBlk:y ¼ max ;1 dBlk:x n dGrd:x ¼ dBlk:x n dGrd:y ¼ : dBlk:y

ð49Þ

Given n and FS, all input parameters are constant and no further steps are necessary to derive optimal input parameters. Therefore, Eq. (49) is the derivation of optimal input parameters for convolution.

6. Experimental results and discussion Results in this section are of the measured execution time for Mv, MM, and convolution on the T10 GPU. As shown in the results, the execution time varies significantly based on the input parameters. Rather than auto-tuning, or exhaustive or pruned searches for optimal input parameters, results are presented to prove the validity of deriving optimal input parameters from the execution metrics presented. Deriving input parameters from the execution metrics eliminates the search space and the need for code execution. In the figures, time for each set of input parameters is denoted with a blue dot, and time, utilizing optimal input parameters, is denoted as a red cross. The minimum measured time is illustrated as a red dotted line. The x-axis in all figures, the number of threads per block, is dBlk:x  dBlk:y. As previously mentioned, due to memory coalescing and the structure of HWs, dBlk:x < 8 is known to yield poor performance (Eq. (12)). While samples were taken for dBlk:x < 8, they are intentionally omitted in the results to more clearly show the benefits of the derived optimal input parameters. Since the figures depict time versus the number of threads per block, there are multiple samples for each threads per block value. However, due to the GPU, the maximum number of threads per block is 512 (Eq. (11)). The minimum number of threads per block is 32 due to the warp size of the GPU. It was measured that utilizing less than 32 threads per block does not yield the minimum, or near-minimum, execution time and, therefore, those samples are omitted in the results. Again, the intentional omission of samples from the results is to clearly show the benefits of the derived optimal input parameters against reasonable configurations executed on the GPU. Lastly, in all results, the limits of n tested are defined by Eq. (13).

640

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

n=512

−0.94

time (ms)

−1.13

−1.33

10

−1.08

−0.66

10 32 64 128 256 512 threads per block n=4096

0.21

10 32 64 128 256 512 threads per block n=8192

0.86

10

10

0.07

−0.07

n=16384

10

0.52

1.13

10 32 64 128 256 512 threads per block

1.73

1.43

10

10

32 64 128 256 512 threads per block

10

0.69

10

n=2048

−0.44

10

10

−0.21

10

−0.88

10

time (ms)

n=1024

−0.68

10

10

10 32 64 128 256 512 threads per block

32 64 128 256 512 threads per block

Fig. 5. Execution time (ms) measured for Mv on the T10 GPU. Time utilizing optimal input parameters is denoted as a red cross. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

6.1. Results for Mv Fig. 5 depicts the measured execution time for the GPU implementation of Mv described in Section 4.1. Due to the implementation, dBlk:x P dBlk:y, and therefore, only one sample is obtainable when threads per block is 512 (32  16). From the figure, it is shown there is not one value for threads per block which yields the minimum execution time for all data sizes. The derivation of optimal input parameters from the execution metrics (Eq. (24)) allows the optimal input parameters to vary with n. Therefore, as illustrated, optimal input parameters yield the minimum, or near-minimum, execution time for all data sizes. Statistics of the samples presented in the figure are given in Table 3. Due to the implementation of Mv tested (Section 4.1) and the constraints on the number of threads per block as explained in the beginning of this section, there are only 11 samples for each data size. However, from column 4, it is shown that there is a large standard deviation of the measured time for all data sizes, which illustrates the necessity of optimal input parameters. From the table, the optimal input parameters (column 6) yield an execution time consistently less than the mean (column 3). From a percentile rank perspective (column 7), the worst-case utilizing optimal input parameters occurs when n ¼ 8192. Although it is the 82nd percentile rank, the measured time is within 1.2% of the minimum (column 5). The 82nd percentile rank is due to 2 of the 11 samples yielding the minimum, or near-minimum, execution time as shown in Fig. 5. In terms of percentage difference to the minimum time, the worst-case occurs when n ¼ 512, although the time is within 4.2% of the minimum. Since optimal input parameters yield the minimum execution time for 4 of the 6 data sizes, utilizing them yields, on average, time within 0.9% of the minimum measured.

6.2. Results for MM Fig. 6 depicts the measured execution time for the GPU implementation of MM described in Section 4.2. As illustrated, there is a large variation of the measured execution time for MM. Utilizing the optimal input parameters consistently yields

Table 3 Statistics of time (ms) measured for Mv on the T10 GPU. Time and percentile rank, utilizing optimal input parameters, are shown in the last two columns. n

Samples

l

r

Min.

O.I.P.

Rank (%)

512 1024 2048 4096 8192 16,384

11 11 11 11 11 11

0.065 0.118 0.333 1.170 4.748 26.193

0.018 0.033 0.127 0.251 1.470 14.165

0.047 0.084 0.220 0.856 3.325 13.551

0.049 0.084 0.220 0.856 3.364 13.551

91 100 100 100 82 100

641

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

n=512

0.85

time (ms)

10

0.40

2.15

10

−0.05

10

0.81

1.69

10 32 64 128 256 512 threads per block n=4096

3.52

10

10 32 64 128 256 512 threads per block n=8192

4.45

3.05

3.97

2.59

5.00

10

3.49

4.39

10 32 64 128 256 512 threads per block

n=16384

5.60

10

10

32 64 128 256 512 threads per block

10

10

10

n=2048

2.62

10

1.26

10 10

time (ms)

n=1024

1.71

10

10 32 64 128 256 512 threads per block

32 64 128 256 512 threads per block

Fig. 6. Execution time (ms) measured for MM on the T10 GPU. Time utilizing optimal input parameters is denoted as a red cross. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 4 Statistics of time (ms) measured for MM on the T10 GPU. Time and percentile rank, utilizing optimal input parameters, are shown in the last two columns. n

Samples

l

r

Min.

O.I.P.

Rank (%)

512 1024 2048 4096 8192 16,384

118 118 118 118 118 118

2.85 21.62 170.16 1378.15 11249.61 97526.53

2.02 16.32 130.63 1064.70 8698.01 82848.23

0.90 6.42 49.24 387.04 3085.49 24678.54

0.91 6.42 49.58 390.83 3085.49 24678.54

99 100 99 98 100 100

the minimum, or near-minimum, time. Similar to Mv, there is not one value of threads per block which consistently yields the minimum, or near-minimum, time. Also similar to Mv, the number of threads per block utilizing optimal input parameters varies with n (Eq. (41)). This allows the optimal input parameters to consistently yield the minimum, or nearminimum, execution time for all data sizes. Table 4 contains the statistics of the samples in the previous figure. As shown, utilizing optimal input parameters yields a worst-case percentile rank of 98th. Unlike Mv, the implementation of MM presented in Section 4.2 allows for many samples. Therefore, optimal input parameters provide a consistently high percentile rank. In addition, optimal input parameters yield execution time approximately one-third of the mean. From the table, optimal input parameters yield the minimum execution time for 3 of the 6 data sizes. Worst-case time (n ¼ 512), in terms of percentage difference, is within 1.2% of the minimum. However, the time is in the 99th percentile rank. On average, time utilizing optimal input parameters is within 0.5% of the minimum.

6.3. Results for convolution The measured execution time for the GPU implementation of convolution, described in Section 4.3, is depicted in Figs. 7–9. Each figure depicts the time for a specified filter size. Fig. 7 depicts the time for a filter size of 3  3. Convolution with this filter size yields the worst performance of the optimal input parameters. This may be attributed to the implementation of the kernel. As previously mentioned in Section 4.3, this kernel yields, on average, 8% of the peak theoretical global memory bandwidth for the smallest filter size (3  3). Improvement in bandwidth can be achieved utilizing constant memory for the filter, although, this limits the maximum filter size. As illustrated in Fig. 7, the execution time varies depending on the input configuration. However, from Table 5, the standard deviation is much lower compared to the Mv and MM kernels. This may be attributed to a poorer performing kernel. As

642

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

−0.55

n=512

time (ms)

10

0.01

−0.65

−0.74

−0.21

n=4096

10 time (ms)

10

0.37

10 32 64 128 256 512 threads per block

10 32 64 128 256 512 threads per block

1.86

n=8192

10

1.09

0.97

n=16384

10

1.57

2.16

10 32 64 128 256 512 threads per block

2.59

2.37

10

10

32 64 128 256 512 threads per block

10

1.72

10

n=2048

0.48

10

10

0.60

10

−0.10

10

1.20

n=1024

10

10 32 64 128 256 512 threads per block

32 64 128 256 512 threads per block

Fig. 7. Execution time (ms) measured for convolution with a filter size of 3  3 on the T10 GPU. Time utilizing optimal input parameters is denoted as a red cross. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

1.50

n=512

time (ms)

2.10

1.20

0.91

1.51

n=4096

2.11

10 32 64 128 256 512 threads per block

3.93

n=8192

10

10 time (ms)

10

10 32 64 128 256 512 threads per block

3.02

2.71

n=16384

10

3.31

3.91

10 32 64 128 256 512 threads per block

4.58

4.25

10

10

32 64 128 256 512 threads per block

10

3.62

10

n=2048

2.41

10

10

2.71

10

1.80

10

3.32

n=1024

10

10

10 32 64 128 256 512 threads per block

32 64 128 256 512 threads per block

Fig. 8. Execution time (ms) measured for convolution with a filter size of 63  63 on the T10 GPU. Time utilizing optimal input parameters is denoted as a red cross. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

kernels become less efficient, the variation in execution time from the input parameters lessens. Regardless, utilizing optimal input parameters still yields a worst-case percentile rank of 72nd. From Table 5, utilizing optimal input parameters yields a time consistently less than the mean for all data sizes. The optimal input parameters yield a time, on average, within 9.6% of the minimum with a worst-case (n ¼ 8192) within 13.6%. Bestcase utilizing optimal input parameters (n ¼ 512) yields a time within 5.3% of the minimum. Although these results are less convincing than for Mv and MM, they still exhibit the validity of the derivation of optimal input parameters. In addition, the results illustrate the effectiveness of utilizing optimal input parameters for sub-optimal GPU kernels. As mentioned in Section 4.3, increasing the filter size increases the efficiency of the convolution kernel. Fig. 8 depicts the measured execution time for a filter size of 63  63. The improvement in performance of the kernel, as the filter size

643

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

n=512

3.25

time (ms)

2.98

3.57

4.19

10

2.71

10

3.30

10

3.90

10 32 64 128 256 512 threads per block n=4096

5.09

10 32 64 128 256 512 threads per block n=8192

5.73

4.80

6.14

10

4.50

10

5.10

10

5.70

10 32 64 128 256 512 threads per block

n=16384

10

5.41

10

32 64 128 256 512 threads per block 6.57

10

10

n=2048

4.47

10

10

10

time (ms)

n=1024

3.85

10

10 32 64 128 256 512 threads per block

32 64 128 256 512 threads per block

Fig. 9. Execution time (ms) measured for convolution with a filter size of 513  513 on the T10 GPU. Time utilizing optimal input parameters is denoted as a red cross. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 5 Statistics of time (ms) measured for convolution with a filter size of 3  3 on the T10 GPU. Time and percentile rank, utilizing optimal input parameters, are shown in the last two columns. n

Samples

l

r

Min.

O.I.P.

Rank (%)

512 1024 2048 4096 8192 16,384

25 25 25 25 25 25

0.22 0.78 3.01 11.98 49.07 195.60

0.03 0.12 0.48 1.97 8.90 52.10

0.18 0.62 2.32 9.32 37.33 143.83

0.19 0.68 2.63 10.40 42.40 157.25

84 72 74 76 72 80

Table 6 Statistics of time (ms) measured for convolution with a filter size of 63  63 on the T10 GPU. Time and percentile rank, utilizing optimal input parameters, are shown in the last two columns. n

Samples

l

r

Min.

O.I.P.

Rank (%)

512 1024 2048 4096 8192 16,384

25 25 25 25 25 25

13.5 52.8 210.5 849.1 3463.9 14258.4

8.1 32.3 129.0 529.3 2224.7 9854.0

8.2 32.1 127.8 510.5 2041.5 8163.7

8.2 32.1 127.8 510.8 2042.7 8170.2

92 100 100 96 96 96

Table 7 Statistics of time (ms) measured for convolution with a filter size of 513  513 on the T10 GPU. Time and percentile rank, utilizing optimal input parameters, are shown in the last two columns. n

Samples

l

r

Min.

O.I.P.

Rank (%)

512 1024 2048 4096 8192 16,384

25 25 25 25 25 25

729 2859 11,657 48,098 217,281 1,080,129

312 1245 5458 24,023 141,016 1,002,214

509 1990 7898 31,506 125,960 503,782

519 1990 7898 31,506 125,960 503,782

88 100 100 100 100 100

644

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

increases, shows an improvement in utilizing optimal input parameters. Again, the execution time varies greatly between the different input configurations but the optimal input parameters yield the minimum, or near-minimum, time for all n. The statistics for the samples in the figure are given in Table 6. Utilizing optimal input parameters yields the minimum execution time for 2 of the 6 data sizes with a worst-case percentile rank of 92nd. Again, the optimal input parameters yield a time significantly less than the mean. The standard deviation increased in comparison to the smaller filter size of 3  3. The increased variation is similar to the variation observed in the Mv and MM kernels which suggests that this implementation is more optimal than the smaller filter size. On average, utilizing optimal input parameters yields an execution time within 0.2% of the minimum time. Worst-case (n ¼ 512), in terms of percentage difference to the minimum time, optimal input parameters yield a time within 1.0%. Lastly, Fig. 9 depicts the time for convolution with a large filter size (513  513). The optimal input parameters (Eq. (49)) are dependent on the size of the filter. Therefore, the optimal number of threads per block differs compared to the other filter sizes. The statistics of the samples in the figure are shown in Table 7. Similar to convolution with a filter size of 63  63, the variation in time amongst the various input parameters is more similar to the Mv and MM kernel than to the convolution kernel with a filter size of 3  3. Likewise, this suggests this implementation is more optimal than the small filter size, which is also evident by the increase in GFLOPS and effective bandwidth shown in Section 4.3. The large standard deviation, shown in the table, illustrates the necessity of optimal input parameters. Utilizing the optimal input parameters yields the minimum execution time for 5 of the 6 data sizes, with a worst-case time (n ¼ 512) within 1.9% of the minimum measured. 7. Summary In this work, it is shown that input parameters, or the dimensions of the grid and blocks, are insufficient for deriving optimal input parameters. Therefore, execution metrics, which represent the executional behavior of a GPU, are formulated as functions of input parameters. The execution metrics are independent of GPU architecture. From execution metrics, optimal input parameters are derived to yield the minimum execution time of matrix-based computations on a GPU. Optimal input parameters are derived and tested for implementations of Mv, MM, and convolution. Results of the Mv implementation show that the derivation of optimal input parameters yields an execution time, on average, within 0.9% of the minimum measured time. Worst-case utilizing optimal input parameters yields a time within 4.1% of the minimum. Similarly, for the MM implementation, the average execution time utilizing optimal input parameters is within 0.5% of the minimum and the worst-case is within 1.2%. Three filter sizes are utilized for testing the derivation of optimal input parameters for convolution. For the smallest filter size (3  3), results show optimal input parameters yield an execution time, on average, within 9.6% of the minimum. For filters of size 63  63 and 513  513, optimal input parameters yield an execution time, on average, within 0.2% and 0.3%, respectively. Therefore, this work validates the derivation of optimal input parameters via execution metrics. Unlike auto-tuning, and exhaustive or pruned searches, the determination of optimal input parameters in this work is performed off-line which eliminates the need of any code execution. Therefore, the search space for the minimum execution time is eliminated. Future work includes utilizing execution metrics to derive optimal input parameters for other GPU computations and architectures. Lastly, this work can also be used as a framework for developing a GPU simulator or improving accuracy of current simulators. References [1] N. Goswami, R. Shankar, M. Joshi, T. Li, Exploring GPGPU workloads: characterization methodology, analysis and microarchitecture evaluation implications, in: 2010 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. [2] D.B. Kirk, W.-m.W. Hwu, Programming Massively Parallel Processors: A Hands-On Approach, first ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2010. [3] J. Nickolls, W.J. Dally, The GPU computing era, IEEE Micro 30 (2010) 56–69. [4] K. Fatahalian, J. Sugerman, P. Hanrahan, Understanding the efficiency of GPU algorithms for matrix–matrix multiplication, in: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, HWWS ’04, ACM, New York, NY, USA, 2004, pp. 133–137. [5] N.K. Govindaraju, S. Larsen, J. Gray, D. Manocha, A memory model for scientific algorithms on graphics processors, in: Proc. of the ACM/IEEE Conference on Supercomputing (SC06), ACM Press, 2006. [6] C. Jiang, M. Snir, Automatic tuning matrix multiplication performance on graphics hardware, in: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT ’05, IEEE Computer Society, Washington, DC, USA, 2005, pp. 185–196. [7] N.B. Lakshminarayana, H. Kim, Effect of instruction fetch and memory scheduling on GPU performance, in: Workshop on Language, Compiler, and Architecture Support for GPGPU. [8] W. Liu, Muller-Wittig, B. Schmidt, Performance predictions for general-purpose computation on GPUs, in: International Conference on Parallel Processing, 2007, ICPP 2007, p. 50. [9] D. Tarditi, S. Puri, J. Oglesby, Accelerator: using data parallelism to program GPUs for general-purpose uses, SIGARCH Comput. Archit. News 34 (2006) 325–335. [10] W.-M. Hwu, C. Rodrigues, S. Ryoo, J. Stratton, Compute unified device architecture application suitability, Comput. Sci. Eng. 11 (2009) 16–26. [11] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym, NVIDIA tesla: a unified graphics and computing architecture, IEEE Micro 28 (2008) 39–55. [12] S.S. Baghsorkhi, M. Delahaye, S.J. Patel, W.D. Gropp, W.-m.W. Hwu, An adaptive performance modeling tool for GPU architectures, SIGPLAN Not. 45 (2010) 105–114. [13] A. Bakhoda, G. Yuan, W. Fung, H. Wong, T. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator, in: IEEE International Symposium on Performance Analysis of Systems and Software, 2009, ISPASS 2009, pp. 163–174.

A. White, S.-Y. Lee / Parallel Computing 40 (2014) 628–645

645

[14] J.W. Choi, A. Singh, R.W. Vuduc, Model-driven autotuning of sparse matrix–vector multiply on GPUs, in: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, ACM, New York, NY, USA, 2010, pp. 115–126. [15] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences with CUDA, IEEE Micro 28 (2008) 13–27. [16] D. Grewe, A. Lokhmotov, Automatically generating and tuning GPU code for sparse matrix–vector multiplication from a high-level representation, in: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, ACM, New York, NY, USA, 2011, pp. 12:1– 12:8. [17] S. Hong, H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, in: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, ACM, New York, NY, USA, 2009, pp. 152–163. [18] S. Hong, H. Kim, An integrated GPU power and performance model, SIGARCH Comput. Archit. News 38 (2010) 280–289. [19] B. Jang, S. Do, H. Pien, D. Kaeli, Architecture-aware optimization targeting multithreaded stream computing, in: Proceedings of Second Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, ACM, New York, NY, USA, 2009, pp. 62–70. [20] K. Kothapalli, R. Mukherjee, M. Rehman, S. Patidar, P. Narayanan, K. Srinathan, A performance prediction model for the CUDA GPGPU platform, in: 2009 International Conference on High Performance Computing (HiPC), pp. 463–472. [21] D. Man, K. Uda, Y. Ito, K. Nakano, A GPU implementation of computing Euclidean distance map with efficient memory access, in: 2011 Second International Conference on Networking and Computing (ICNC), pp. 68–76. [22] G. Ruetsch, P. Micikevicius, Optimizing Matrix Transpose in CUDA, NVIDIA, 2009. [23] S. Ryoo, C.I. Rodrigues, S.S. Baghsorkhi, S.S. Stone, D.B. Kirk, W.-m.W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, in: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’08, ACM, New York, NY, USA, 2008, pp. 73–82. [24] S.-Z. Ueng, M. Lathara, S.S. Baghsorkhi, W.-m.W. Hwu, in: J.N. Amaral (Ed.), Languages and Compilers for Parallel Computing, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 1–15. [25] V. Narasiman, M. Shebanow, C.J. Lee, R. Miftakhutdinov, O. Mutlu, Y.N. Patt, Improving GPU performance via large warps and two-level warp scheduling, in: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 ’11, ACM, New York, NY, USA, 2011, pp. 308–317. [26] Y. Sun, Y. Tong, CUDA based fast implementation of very large matrix computation, in: 2010 International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 487–491. [27] L.S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, R.C. Whaley, An updated set of basic linear algebra subprograms (BLAS), ACM Trans. Math. Software 28 (2002) 135–151. [28] NVIDIA, CUBLAS, 2013. [29] MAGMA, Innovative Computing Laboratory at the University of Tennessee, 2013. [30] V. Volkov, J.W. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, IEEE Press, Piscataway, NJ, USA, 2008, pp. 31:1–31:11. [31] B. Boyer, J.-G. Dumas, P. Giorgi, Exact sparse matrix–vector multiplication on GPU’s and multicore architectures, in: Proceedings of the Fourth International Workshop on Parallel and Symbolic Computation, PASCO ’10, ACM, New York, NY, USA, 2010, pp. 80–88. [32] J. Bolz, I. Farmer, E. Grinspun, P. Schröoder, Sparse matrix solvers on the GPU: conjugate gradients and multigrid, ACM Trans. Graph. 22 (2003) 917– 924. [33] A. El Zein, A. Rendell, From sparse matrix to optimal GPU CUDA sparse matrix vector product implementation, in: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 808–813. [34] J. Kurzak, S. Tomov, J. Dongarra, Autotuning GEMM kernels for the fermi GPU, IEEE Trans. Parallel Distrib. Syst. 23 (2012) 2045–2057. [35] R. Nath, S. Tomov, J. Dongarra, Accelerating GPU kernels for dense linear algebra, in: Proceedings of the Ninth International Conference on High Performance Computing for Computational Science, VECPAR’10, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 83–92. [36] J. Lai, A. Seznec, Performance upper bound analysis and optimization of sGEMM on fermi and kepler GPUs, in: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–10. [37] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, N. Sun, Fast implementation of dGEMM on fermi GPU, in: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11. [38] H. Cook, E. Gonina, S. Kamil, G. Friedland, D. Patterson, A. Fox, CUDA-level performance with python-level productivity for Gaussian mixture model applications, in: Proceedings of the Third USENIX Conference on Hot Topic in Parallelism, HotPar’11, USENIX Association, Berkeley, CA, USA, 2011, p. 7. [39] S. Ryoo, C.I. Rodrigues, S.S. Stone, J.A. Stratton, S.-Z. Ueng, S.S. Baghsorkhi, W.-m.W. Hwu, Program optimization carving for GPU computing, J. Parallel Distrib. Comput. 68 (2008) 1389–1401. [40] S. Ryoo, C.I. Rodrigues, S.S. Stone, S.S. Baghsorkhi, S.-Z. Ueng, J.A. Stratton, W.-m.W. Hwu, Program optimization space pruning for a multithreaded GPU, in: Proceedings of the Sixth annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, ACM, New York, NY, USA, 2008, pp. 195–204. [41] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, J. Cavazos, Auto-tuning a high-level language targeted to GPU codes, in: Innovative Parallel Computing (InPar), 2012, pp. 1–10. [42] A. Magni, D. Grewe, N. Johnson, Input-aware auto-tuning for directive-based GPU programming, in: Proceedings of the Sixth Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, ACM, New York, NY, USA, 2013, pp. 66–75. [43] J. Bergstra, N. Pinto, D. Cox, Machine learning for predictive auto-tuning with boosted regression trees, in: Innovative Parallel Computing (InPar), 2012, pp. 1–9. [44] A. Mametjanov, D. Lowell, C.-C. Ma, B. Norris, Autotuning stencil-based computations on GPUs, in: Proceedings of the 2012 IEEE International Conference on Cluster Computing, CLUSTER ’12, IEEE Computer Society, Washington, DC, USA, 2012, pp. 266–274. [45] A.M. Aji, M. Daga, W.-C. Feng, Bounding the effect of partition camping in GPU kernels, in: Proceedings of the Eighth ACM International Conference on Computing Frontiers, CF ’11, ACM, New York, NY, USA, 2011, pp. 27:1–27:10. [46] J. Wu, J. JaJa, Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs, in: Innovative Parallel Computing (InPar), 2012, pp. 1–12. [47] Y. Yang, P. Xiang, J. Kong, H. Zhou, A GPGPU compiler for memory optimization and parallelism management, in: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, ACM, New York, NY, USA, 2010, pp. 86–97. [48] NVIDIA, CUDA Programming Guide, 2013. [49] P. Estival, L. Giraud, A fight for performance and accuracy of the matrix multiplication routines: CUBLAS on Nvidia Tesla versus MKL and ATLAS on Intel Nehalem, Technical Report HAL: hal-00699377, CERFACS, 2012. [50] R. Nath, S. Tomov, J. Dongarra, An improved magma GEMM for fermi graphics processing units, Int. J. High Perform. Comput. Appl. 24 (2010) 511–515. [51] Y. Li, J. Dongarra, S. Tomov, A note on auto-tuning GEMM for GPUs, in: Proceedings of the Ninth International Conference on Computational Science: Part I, ICCS ’09, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 884–892. [52] V. Podlozhnyuk, Image Convolution with CUDA, NVIDIA, 2012.

Suggest Documents