A practical performance model for compute and memory ... - hgpu.org

A practical performance model for compute and memory bound GPU kernels Elias Konstantinidis Department of Informatics and Telecommunications, University of Athens, Athens, Greece e-mail: [email protected]

Abstract—Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting factors of multiple devices with different compute-memory bandwidth ratios with respect to a particular kernel. We elaborate on the compute-memory bound characteristic of kernels. In addition, a micro-benchmark program was developed exposing the peak compute and memory transfer performance using variable operation intensity. Experimental results of executions on different GPUs are presented. In the proposed performance prediction procedure, a set of kernel features is extracted through an automated profiling execution which records a set of significant kernel metrics. Additionally, a small set of device features for the target GPU is generated using micro-benchmarking and architecture specifications. In conjunction of kernel and device features we determine the performance limiting factor and we generate an estimation of the kernel’s execution time. We performed experiments on DAXPY, DGEMM, FFT and stencil computation kernels using 4 GPUs and we showed an absolute error in predictions of 10.1% in the average case and 25.8% in the worst case. Keywords-GPU kernels; performance prediction; performance model; micro-benchmarks;

I. I NTRODUCTION With the rise of programmability of GPUs for general purpose computation they were rapidly adopted by the scientific community mostly due to their high compute performance and energy efficiency. Compute environments like CUDA[17] and OpenCL[8] enabled their use as compute accelerators. The primary goal in use of GPUs for computing is the significant improvement in performance they are able to achieve in problems which fit to their peculiarities. However, it is not always easy to predict the benefits of migrating to a GPU accelerator or moving from one type of GPU to another. On GPUs, memory throughput plays a major role in the procedure of ”feeding” data to its compute resources. The roofline visual model [22] is capable to provide essential insight to the limiting factor of performance and determine a performance bound for the application. We propose an alternative visualization which we call quadrant-split performance model and it allows a better focus on a particular problem while inspecting the behaviour of multiple processors.

Yiannis Cotronis Department of Informatics and Telecommunications, University of Athens, Athens, Greece e-mail: [email protected]

In addition, we conduct a quantitative prediction of performance on GPUs. We collect some features of our kernel through profiling on a CUDA enabled system. These kernel features constitute a minimal set of characteristics used to estimate the performance limiting factor and the effective compute performance bound for the particular kernel. The combination of these features with the GPU specification data are used to predict the expected performance of the kernel on another GPU without actually executing it on the real hardware. As such we are able to achieve more accurate results than using the pure visual model. Furthermore, we developed a mixed micro-benchmark program, which allows inspecting the attained compute (GFLOPS) and memory (GB/sec) performance using a configurable operation intensity. As such, we executed this microbenchmark on a wide range of operation intensity values. This allows to investigate the behaviour of various GPUs on a range of different operation intensity values. The usefulness of our model is validated by applying it on a number of kernels for popular problems. These are the DAXPY vector operation, matrix multiplication (DGEMM), FFT and a set of stencil computation kernels (LMSOR[3]). In general our model provided an adequate prediction on the applied experiments with an average absolute error of 10.14% though in the worst case the absolute error reached to 25.8%. The rest of this paper is structured as follows. In the next section the roofline model along with the quadrant-split model is described. In section 3 the applied method for performance prediction is presented along with the mixed micro-benchmark results. In section 4 the experimental results are presented and justified. Related work is referred in section 5 and the conclusions and future work follow in section 6. II. T HE QUADRANT- SPLIT VISUAL PERFORMANCE MODEL In [22] authors present a visual model for identifying the critical performance factor of a program with regard to its operational intensity. The operational intensity is the compute to memory transfer ratio, typically measured in flops/byte. By exposing the limiting performance factor this model can lead to valid optimization decisions. In general a GPU kernel can either be memory bound, compute bound or latency bound. It is memory bound when

Fig. 1. The roofline visual model as applied for the NVidia GTX-480. The vertical dotted lines correspond to 4 different problems.

the memory bus is congested by limiting the rate of execution of compute instructions. It is compute bound primarily when the ALUs (arithmetic logic units) of the processor are fully utilized and unable to provide additional throughput. In case the pipeline or memory latencies are the primary reason that limits performance the kernel is considered as latency bound. The operational intensity is a characteristic that identifies the compute requirements of an application with regard to the DRAM traffic. Some known applications are illustrated with their operational intensities in the roofline model[22] (fig.1). In the quadrant-split model we propose presenting the same information by using a slightly different data mapping in the respective graph. Instead of using the operational intensity on the horizontal axis we use the DRAM bandwidth of the device. On the vertical axis we keep using the compute performance unit (GFLOP). With this change we can describe every GPU (or CPU) with single point as specified by the peak compute and memory throughput specifications of the device. The application is represented as a half-line with a slope determined by the compute to memory traffic requirements of the application. The same applications described on figure 1 as vertical lines are presented on figure 2 as half-lines all crossing the intersection of the axes with various slopes. Typical memory bound problems tend to have a small slope whereas compute bound problems have high slope. The halfline splits the quadrant into two half-quadrants where device points residing into the upper one have higher compute resources than memory traffic potential with respect to the application’s requirements whereas the device points residing on the lower one have lower compute potential. Simply put, the half-line is the visual bound for the distinction of the devices into two groups where the kernel is expected to behave as memory bound for the devices in the upper half-quadrant and as compute bound for the others instead. In this regard it is clear that a problem can be considered as memory bound for some devices and as compute bound for others. In other words the limiting factor is a relative term which is dependent on both the problem and the device specifications.

Fig. 2. The quadrant-split model as applied on the 3D-FFT problem on 4 different CPU/GPUs. The dashed lines lead to the anticipated performance points for the particular kernel-device combination.

For instance, regarding the 3D-FFT problem as depicted on figure 2, the Intel Xeon and both NVidia Tesla GPUs reside on the upper half-quadrant which entails that the problem is memory bound with respect to these devices. In contrast, the GTX-480 GPU is compute bound as it resides on the lower half-quadrant. In order to determine the performance of a memory bound problem we mark a vertical line from the device point straight down to the problem half-line (shown as a dotted line). Similarly, in order to determine the performance of a compute bound problem we mark a horizontal line from the device point moving to the left till the application half-line is crossed. The intersection of these lines mark the anticipated performance on this device. In the quadrant-split model more devices than one can be naturally applied on a single graph. Roofline model is devicecentric as it is convenient for applying multiple problems on a single device whilst the quadrant-split model is application centric depicting better one application with many devices having different characteristics. In this work we do not account for either latency bound or CPU-GPU transfer bound cases. We presume that the kernels under consideration are either compute or memory bound. In latency bound cases the kernel programmer should focus on eliminating the latency bottleneck. Additionally, all data are assumed to reside in the GPU memory so CPU-GPU transfer considerations are not applicable. III. P ERFORMANCE PREDICTION Apart from designing a visual model in this work we proceed to estimate the performance of GPU kernels following a quantitative approach. However, in order to get realistic results three adjustments had to be applied on the theoretical peak device performance which reduce them by a factor. The first one adjusts both compute and memory transfer peaks whereas the rest two adjust the compute peak under the particular kernel.

A. Measured compute and transfer peaks The theoretical specifications set a good base of the performance that can be achieved on a device. However this might not always be adequately realistic. In some cases the measured performance is a fraction of the theoretical values. In order to estimate the practical peaks in both compute and memory transfer performance we developed some micro-benchmark kernels through which the real performance of the devices under investigation is evaluated. The family of GPUs used for the experiments of this work comprises four NVidia GPUs (GTX-660, GTX-480, Tesla S2050 and Tesla K20c). The Teslas are professional GPUs targeted to GPU computing sector [13] and the GTX (GeForce) GPUs are consumer oriented cards primarily for gaming. The peak performance of these GPUs as measured is depicted on table I. It is evident that the compute performance exposed in the micro-benchmark is very close to the theoretical one provided by the manufacturer. The GTX-660 especially performs better than the theoretical performance in 64bit floating point operations. After further investigation with profiling metrics it was evident that the particular card boosted the GPU’s frequency when run pure double precision intensive code to about 1123MHz instead of the 1033MHz base clock frequency justifying our measurement. In contrast, the performance of memory bandwidth does not reach to the same point the theoretical peak. Especially, the Tesla GPUs perform bellow 70% of their theoretical peaks due to the enabled ECC protection[14]. The measured values were later used for correcting the GPU performance points in the model. It should also be noted that the theoretical bandwidth given by vendors use the assumption that 1GB = 109 bytes which is inaccurate. We chose to use the 1GB = 10243 bytes convention which is more compliant with the multiples of byte. Thus all the rated values are given using the latter convention. B. Floating point operation mix efficiency Another factor that has to be taken into account is the floating point operation mix. GPU vendors tend to provide the peak performance achieved using multiply-add operations. These operations fuse a multiplication and an addition operation into a single instruction (a × b + c). These instructions are typically optimized to be executed in just one shader cycle (for single precision). The theoretical peaks provided by vendors assume a perfectly balanced stream of floating point multiplications and additions. If the stream of executed instructions is not perfectly balanced the performance drops. For instance, having a pure stream of addition instructions would reduce the floating point performance to a half as the addition instructions tend to be executed as fast as the multiply-add instructions though they perform just one operation instead of two. In order to evaluate the operation mix we perform a profiling execution of the kernel with the NVidia profiler[18] and record the metrics flops dp, flops dp fma. flops dp is the total double precision floating point operations executed and the flops dp fma is the count of the multiply-add executed instructions. The efficiency of Flop mix is evaluated as follows:

Ef p

mix

=

f lops dp 2 × (f lops dp − f lops dp f ma)

(1)

where Ef p mix is the efficiency of the floating point operation mix which ranges from 50% to 100% depending on the usage of multiply-add operations. The performance which should be taken into account should be adjusted as follows: Pest(1) = Pmeasured × Ef p

mix

(2)

where Pmeasured is the peak compute performance of the device as it was measured and the Pest(1) is the adjusted peak performance under the consideration of the floating point operation mix. In this stage we also evaluate the operation intensity (Ikernel ) of the kernel as described bellow: Ikernel =

f lops dp 32 × dram trans

(3)

where dram trans is the sum of the dram read transactions and dram write transactions, and it is used to express the kernel DRAM traffic requirements. It is worth noting that the total amount of memory transcations does not strictly reflect the actual memory access needs of a kernel. In kernels with poor irregular memory accesses each access of a thread warp causes multiple memory transactions which leads to an order of magnitude higher DRAM traffic. In our experiments instead of using flops dp in (1) and (3) as a metric for executed 64bit flops, we applied the expression inst f p 64 + f lops dp f ma which seems to include more actual floating point operations than the regular ones (addition, multiplication, multiply-add). It is assumed that all instructions perform one floating point operation except the multiply-add which performs two. C. Beneficial instruction mix efficiency Another factor that further lowers peak floating point performance of a kernel is the density of floating point instructions in the executed instruction stream. The beneficial instructions on scientific problems are the floating point instructions which perform the actual computations as required by the algorithm. The rest of the instructions can be control flow, pointer arithmetic, etc. These operations consume valuable resources of the GPU and they stress the peak floating point performance to significantly lower levels. In order to take into account these operations we measure the inst fp 32, inst fp 64, inst integer, inst bit convert, inst control, inst compute ld st, inst misc, inst inter thread communication metrics which accumulate all the instructions executed for various types. Two derived features are used, the density of 64bit float instructions (Df p64 ) and the density of the load/store instructions (Dldst ) in the whole instruction stream executed. These are derived as follows: Df p64 =

inst ld st inst f p 64 , Dldst = inst total inst total

(4)

TABLE I P EAK COMPUTE AND

GPU

Rated

GTX-660 GTX-480 Tesla S2050 Tesla K20c

89.52 167.94 509.13 1164.49

∗

MEMORY TRANSFER PERFORMANCE AS MEASURED WITH MICRO - BENCHMARKS

Compute performance (GFLOPS) (double precision) (single precision) Specification Percentage Rated Specification Percentage 82.6 168.1 514.0 1174.0

108.3% 99.9% 99.1% 99.2%

1950.01 1304.99 1008.11 3114.93

1983.4 1345.0 1028.0 3521.9

98.3% 97.0% 98.1% 88.4%

Memory bandwidth (GB/sec) Rated

Specification

Percentage

110.66 150.70 102.66∗ 141.99∗

144 177 148 208

76.8% 85.1% 69.4% 68.3%

with ECC enabled

TABLE II T HE GPU

Cother = (1 − Df p64 − Dldst ) × 1

WEIGHTED COSTS AS USED IN THE MODEL

GPU GTX-660 GTX-480 Tesla S2050 Tesla K20c

(7)

where Wf p64 and Wldst are the weighted costs of 64bit floating point and load/store instructions respectively(table II). The total computational cost can be estimated by formula 8 and the relative efficiency of floating point instructions in the whole instruction mix is given by formula 9. The new adjusted peak performance can be estimated with formula 10.

GPU throughput factors fp64 load/store 24 6 8 2 2 2 3 6

where inst total equals to the sum of inst fp 32, inst fp 64, inst integer, inst bit convert, inst control, inst compute ld st, inst misc and inst inter thread communication metrics. After these values are captured we can assign weights to each category which correspond to the reciprocal throughput of the target GPU in each type of instruction with respect to the fastest one which typically is the single precision multiplyadd instruction. Weights for the GPUs used in this work are depicted on table II. The instruction cost as used in the estimations derived by the NVidia’s supplied information [17], [14], [15]. Fermi architecture multiprocessors consist of 16 Load/Store units which gives a 32/16 = 2 weight factor (SPs/LoadStore units) and Kepler multiprocessors consist of 32 units per multiprocessor which gives a 192/32 = 6 factor. After the device properties have been determined the floating point and the load/store relative instruction costs (Cf p64 and Cldst ) can be determined by formulae 5 and 6. The relative instruction cost of a particular type corresponds to the cost percentage of this type to the whole amount of executed instructions. The load/store instruction cost should not be confused with the load/store DRAM throughput. It is the throughput of the load/store unit, which mostly depends on the number of load/store units per multiprocessor. The cost of other instructions is assumed to be 1 per instruction (formula 7) which though inaccurate assumption it does not affect the estimation severely as on optimized kernels the other instructions constitute mostly of fast integer instructions. We account for the best case since not always all instructions in a category have the same throughput and the NVidia profiler does not provide more detailed information. Cf p64 = Df p64 × Wf p64

(5)

Cldst = Dldst × Wldst

(6)

Ctotal = Cother + Cf p64 + Cldst Einst

mix

=

Cf p64 Ctotal

Pest(2) = Pest(1) × Einst

(8) (9)

mix

(10)

D. Mixed micro-benchmarks If a kernel is purely compute bound then compute performance reaches close to the theoretical peaks of the device as previous micro-benchmarks showed. However, in real applications the instruction mix consists of additional instructions like load/stores, integer arithmetic as address computations, controls instructions, etc. All these instructions reduce the performance peak in productive instructions (i.e. flops). In order to experiment with performance in a mixed kernel which features both computation and memory traffic we developed a kernel containing artificial mixed operations with a configurable balance. As such we can investigate the behaviour of various GPUs in mixed instruction streams and the degree the compute and memory traffic peak values differentiate. We focused to keep all other instructions to a minimum in order to keep extra overhead as low as possible. Template variables have been used where possible including access strides, block size and the loops were unrolled[16]. A significant workload has been assigned to each thread in order to eliminate the impact of the initialization overhead. In figure 3 the compute throughput and the memory bandwidth are shown as the operational intensity of the kernel is increased by moving from a pure memory bound to a pure compute bound kernel. The kernel was executed on the GTX-480 GPU. In fact, the GFLOPS line of the chart is an experimental graphical representation of the roofline graph model as it was generated by the micro-benchmark and the similarities with figure 1 are apparent.

Fig. 3. Attained compute performance (64bit floating point) and memory traffic bandwidth for the mixed micro-benchmark on the GTX-480.

Fig. 5. Attained compute performance in relation with the attained memory traffic on the Tesla S2050.

Fig. 4. Attained compute performance in relation with the attained memory traffic on the GTX-480. The X mark corresponds to the theoretical peak.

Fig. 6. Attained compute performance in relation with the attained memory traffic on the Tesla K20c.

The same results are illustrated in figure 4 which relates the compute throughput with the memory bandwidth. The generated graph is almost ideal as the lines forming are straight, persistently bounded by fixed limits. Essentially, it is the quadrant-split visual model chart in which the corner formed by the vertical edges in the middle represents the effective device performance point. On the Tesla GPUs the same graph appears with more disturbances (figures 5 and 6) as it is not as flat as the previous one. This is partially justified by the different throughput ratio of double precision operations in relation to other instructions. As Tesla’s throughput of 64bit floating point instructions is close to the performance of integer throughput, the latter plays an important role in the additional overhead. In the consumer GPUs where the integer instructions are much faster than the 64bit floating point ones, their impact is not so important and the additional overhead is smaller. The fluctuations evident for the memory bound samples of the graph (more apparent on the K20c case) are caused by the fact that as the operational intensity of the particular kernel is increased by one step, the amount of memory access operations increase by either one read or one write operation in its main loop and write operations seem to allow higher bandwidth in this case.

IV. E XPERIMENTAL RESULTS In order to prove the usefulness of the described adjustments we performed experiments with 6 kernels of different types. These were the vector operation DAXPY, the DGEMM (matrix multiplication), the FFT and 3 variations of the LMSOR stencil computation we had developed in a previous work [4], [3]. All experiments were conducted with double precision arithmetic. For the DGEMM and FFT problems the implementations of NVidia libraries have been used (CUBLAS, CUFFT libraries). For the rest code custom kernels were used. The DAXPY kernel is the addition of a scalar multiple of a vector with another vector. It is clearly a memory bound kernel on all recent architectures. The vector constituted of 48M double precision floats (384MB global memory per vector). We applied a 2048x2048 matrix multiplication in CUBLAS and a 32 × 10243 element vector in CUFFT. The matrix multiplication is clearly a compute bound problem even for GPUs. The CUBLAS and CUFFT libraries provide optimized kernels for each architecture. As a result a different kernel is chosen per GPU architecture. Therefore we used the appropriate kernel data for each applied GPU. The LMSOR kernel is a red/black stencil computation with applied memory reordering by color optimization [10], [9].

In addition, 3 variations of the kernel had been developed on a previous work in which the recomputation strategy was applied [4], [3]. The motivation was the significantly increased computation capability of GPUs compared to their memory access potential, a trend which is expected to widen. The recomputation aims to reduce memory accesses at the expense of extra computations and thus affecting the flops per byte access ratio. The variations are the following: 1) Kernel #1 - No recomputations 7 read accesses per computed element 2) Kernel #2 - Minor recomputations 5 read accesses per computed element (4 read accesses were replaced by 2 read accesses plus 4 recomputations) 3) Kernel #3 - Aggressive recomputations 4 read accesses per computed element (an additional read access was replaced by one intensive recomputation) We performed the stencil computation experiment on a 3842 × 3842 mesh for a total of 26 iterations. All experiments were conducted on a 64bit Linux environment with CUDA versions ranging from 5.0 to 6.5 as multiple systems were used with different installations. First we profiled all applications in order to extract the required metric values for the adjustments. These values were used to adjust the compute peak derived from microbenchmarks to the peak we expect under the investigated problem. All results are depicted on table III. In figure 7 we provide an illustration of the adjustment performed on the peak values of the Tesla S2050 under the LMSOR - kernel #3 problem. First the theoretical point is moved to left because of the peak measured effective bandwidth with micro-benchmarks (102.66 GB/sec). The vertical line corresponds to the compute throughput reduction as it is adjusted due to the operation mix efficiency and instruction mix efficiency considerations (72.22% and 72.91% giving overall 52.66%). Therefore, the peak compute throughput reaches to 268.1 GFLOPS. Eventually, from the last point we cross the application line by moving horizontally towards the vertical axis. Seemingly the first adjustment moved the point to the memory bound region by crossing the application line and the last adjustment moved it back to the compute bound region. Therefore the expected behaviour of the problem is compute bound. The ”X” point represents the actual performance as measured on this GPU. In figure 8 the performance estimation for the 4 GPUs is depicted for the FFT kernel after the adjustments were applied. The two Teslas seem to be memory bound where the GeForce GPUs are compute bound. After performing the series of experimental executions we compared the real with the estimated executions times an all kernels. The results are depicted on figure 9. The memory limited DAXPY exhibits very good prediction with an average absolute difference 3.76%. The CUDA library kernels DGEMM and FFT also exhibit good predictions with absolute differences of 10.37% and 9.50%, respectively. LMSOR kernels #1 and #3 exhibit absolute differences 8.88% and

Fig. 7. Applied adjustments of peak performance values for the Tesla S2050 on the LMSOR problem kernel #3 and the performance estimation in relation to the actual performance point (”X” mark).

Fig. 8.

Visual performance estimation on the FFT problem.

15.32% respectively. After disassembling the last kernel it was identified that a high number of binary shifts instructions were evident which are not as fast as integer addition instructions and thus the adjusted compute throughput is actually lower which is crucial for a compute limited kernel. In some cases the predicted times are larger than the measured times. This could be caused by the fact that some instruction executions of different types could be actually overlapped. For instance, the Kepler based SMs consist of 4 schedulers each able to issue 2 instructions per cycle [17]. Thus it is possible to issue 6 instructions on the 6 ALU units and 2 on the 2 load/store units in one cycle. The same issue could be true for the Fermi GeForce GPUs when executing double precision floating point instructions, though this is not documented and needs to be further investigated. The perfection of figures 3 and 4 also supports this argument. The DAXPY kernel was identified as memory bound and the DGEMM as compute bound on all GPUs as expected. The FFT kernel was identified as memory bound on the Tesla GPUs and as compute bound on GeForce GPUs instead. All results are depicted on table IV.

TABLE III K ERNEL FEATURES AS FORMED FOR THE INVESTIGATED KERNELS .

Kernel

Total flops

DAXPY DGEMM (Fermi) DGEMM (Kepler, GTX660) DGEMM (Kepler, K20c) FFT (Fermi) FFT (Kepler, GTX660) FFT (Kepler, K20c) LMSOR (kernel #1) LMSOR (kernel #2) LMSOR (kernel #3)

100,663,296 17,188,519,936 17,188,519,936 17,184,129,024 2,283,798,528 2,279,604,224 2,283,798,528 2,976,335,232 4,509,877,632 33,892,699,776

Balance Flops/byte

Floating point operation mix efficiency (50%-100%)

0.083 9.969 4.822 22.703 1.038 1.044 0.866 0.238 0.482 4.193

100.00% 100.00% 100.00% 99.98% 65.37% 65.40% 65.37% 73.75% 63.49% 72.22%

Instruction percentages fp 64 load/store other 10.53% 65.58% 52.85% 77.46% 41.32% 38.56% 45.25% 16.07% 27.45% 58.76%

15.79% 20.53% 18.18% 12.38% 19.20% 15.78% 17.82% 18.48% 14.97% 2.41%

73.68% 13.89% 28.97% 10.15% 39.48% 45.66% 36.94% 65.45% 57.58% 38.83%

Fig. 9. Comparison of the estimated execution times with the actual measured execution times and the respective difference percentage (brackets imply a negative error percentage).

In the overall the experiments exhibited an average difference of ≈ 10.14% between the real and measured performance which can be considered as an adequate approximation. In the worst case however the performance of K20c was predicted with an absolute error of ≈ 25.8%. In order to make more accurate predictions we have to take into account other factors that complicate the analysis. These can be serialization factors as low occupancy or ILP (instruction level parallelism which is more important for Kepler architectures), shared memory bank conflicts, latency bound issues, saturation of atomic operations etc. For the sake of simplicity and the lack of more detailed profiling information we left other performance factors out of scope of this work. V. R ELATED WORK As the GPUs were gaining the interest of the scientific community many researchers began to focus on building performance models for them [21], [6]. In [23] authors create a performance model based on low level GPU components, as the pipeline, shared memory and global memory data. They used the native GPU instruction set of the rather old now GeForce 200 architecture. On [7] Karami et al. present a

TABLE IV T HE DESIGNATION

OF THE LIMITING FACTOR OF EACH KERNEL PER AS DERIVED FROM THE MODEL

GPU

Kernel

GTX660

GTX480

Tesla S2050

Tesla K20c

DAXPY DGEMM FFT LMSOR (ker.#1) LMSOR (ker.#2) LMSOR (ker.#3)

Memory Compute Compute Memory Compute Compute

Memory Compute Compute Memory Compute Compute

Memory Compute Memory Memory Memory Compute

Memory Compute Memory Memory Memory Memory

regression model in which they predict the execution time of OpenCL kernels. In [5] Goswami et al. use statistical methods to characterize various CUDA kernels. They employ PCA (Principal Component Analysis) and cluster analysis. Baghsorkhi et al. used a type of analytical approach to perform performance prediction on GPU kernels and they validate their model with matrix multiplication and FFT kernels[1]. Sim et al. proposed a full framework for performing performance predictions and guiding for the programmer to beneficial optimizations in order to improve performance[19]. Another

performance prediction model was proposed by Kothapalli et al. [11]. They took into account of various special GPU characteristics and they experimented with matrix multiplication, link ranking and histogram generation. In [2] authors built the Grophecy++ framework which they use to predict speedups of GPU kernels with a particular focus on the CPUGPU data transfer cost. A theoretical model for describing performance of GPUs was proposed by Ma[12]. This model is called Threaded Many-core Memory and it is regarded as an improvement of the PRAM model. Volkov and Demmel developed a variety of micro-benchmarks for GPUs in their work on developing dense linear algebra implementations for GPUs such as GEMM, SYRK and matrix factorizations[20]. These include benchmarking kernel launch overheads, CPUGPU data transfers, GPU memory subsystem and pipeline latencies, memory bandwidth and compute throughput. This work is focused on the performance prediction of developed GPU kernels on GPU architectures as an automated procedure. A knowledge of the internal design of the kernel is not a requirement as all essential parameter values are acquired by the profiling procedure. VI. C ONCLUSIONS AND FUTURE WORK In this work we describe a practical performance prediction method for GPU kernels, in conjunction with the quadrant-split model as a visual representation. We employed data supplied by executing the subject kernel with profiling in order to capture a set of metrics through which a set of kernel features is constructed. These features are later on used to predict the kernel’s performance of other GPUs. Our model provides a practical method to predict the performance of a kernel on other GPUs without requiring any knowledge of the kernel design itself. The procedure can be easily automated and thus it can provide useful insight without significant effort. In our experiments we exhibited adequate results. However, the exact performance is dependent on details of the instruction mix, pipeline latencies, the available parallelism and other serialization factors which are difficult to extract. Nevertheless, the exhibited measurements are significantly closer to the actual measurements due to the peak adjustments. As future work we consider the inclusion of shared memory bank conflicts and other serialization factors e.g. reduced occupancy or atomic operations. Cache effects could also be considered which potentially differentiate the total amount of DRAM accesses between various types of GPUs. A future goal could also be the separation of the GPU performance characteristics from the applied kernel as in this work the FLOPS adjustment is dependent on the kernel’s features. ACKNOWLEDGMENTS We would like to acknowledge the kind permissions of the Innovative Computing Laboratory at the University of Tennessee and Dr. Gizopoulos Dimitris at the University of Athens to use their NVidia Tesla S2050 and NVidia Tesla K20c installations respectively for the purpose of this work. This research was partially funded by the University of Athens Special Account of Research Grants no 10812.

R EFERENCES [1] S.S. Baghsorkhi and M. Delahaye and S.J. Patel and W.D. Gropp and W.W. Hwu, An Adaptive Performance Modeling Tool for GPU Architectures. SIGPLAN Not., May 2010, Vol. 45, No. 5, ISSN 03621340, pp. 105–114, ACM, 2010. [2] M. Boyer and J. Meng and K. Kumaran, Improving GPU Performance Prediction with Data Transfer Modeling. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, pp. 1097–1106, 2013. [3] Y. Cotronis and E. Konstantinidis and M.A. Louka and N.M. Missirlis, A comparison of CPU and GPU implementations for solving the Convection Diffusion equation using the local Modified SOR method. Parallel Computing, ISSN 0167-8191, Vol 40, No 7, pp. 173–185, 2014. [4] Y. Cotronis and E. Konstantinidis and N.M. Missirlis, A GPU Implementation for Solving the Convection Diffusion Equation Using the Local Modified SOR Method. Numerical Computations with GPUs, ISBN 978-3-319-06547-2, Springer, pp. 207–221, 2014. [5] N. Goswami and R. Shankar and M. Joshi and Tao Li, Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications. Workload Characterization (IISWC), 2010 IEEE International Symposium on , pp.1–10, 2-4 Dec. 2010. [6] S. Hong and H. Kim, An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. SIGARCH Comput. Archit. News, ISSN 0163-5964, Vol. 37, No 3, pp. 152–163, 2009. [7] A. Karami and S.A. Mirsoleimani and F. Khunjush, A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs. Computer Architecture and Digital Systems (CADS), 2013 17th CSI International Symposium on, pp.15–22, 30-31 Oct. 2013. [8] Khronos group, The OpenCL Specification’. Khronos group, 2009. [9] E. Konstandinidis and Y. Cotronis, Accelerating the red/black SOR method using GPUs with CUDA. 9th International Conference On Parallel Processing And Applied Mathematics, Lecture Notes in Computer Science, Part I, Vol. 7203, Torun, pp. 589-598, 2012. [10] E. Konstantinidis and Y. Cotronis, Graphics processing unit acceleration of the red/black SOR method. Concurrency and Computation: Practice and Experience, Vol. 25, No. 8, pp. 1107-1120, 2013. [11] K. Kothapalli and R. Mukherjee and M.S. Rehman and S. Patidar and P.J. Narayanan and K. Srinathan, A performance prediction model for the CUDA GPGPU platform. High Performance Computing (HiPC), 2009 International Conference on, pp. 463–472, 16-19 Dec. 2009. [12] L. Ma, K. Agrawal, R.D. Chamberlain, A memory access model for highly-threaded many-core architectures. Future Generation Computer Systems, Volume 30, January 2014, Pages 202-215, ISSN 0167-739X. [13] NVidia, Tesla S2050 GPU Computing System. NVidia, 2010. [14] NVidia, Tuning CUDA Applications for Fermi. NVidia, 2011. [15] NVidia, NVIDIAs Next Generation CUDA Compute Architecture: Kepler TM GK110. NVidia, 2012. [16] NVidia, NVidia CUDA C Best Practices Guide Version 6.0. DG-05603001 v6.0, NVidia, 2014. [17] NVidia, NVidia CUDA C Programming Guide v. 6.0 Design Guide. PG02829-001 v6.0, NVidia, 2014. [18] NVidia, Profiler user’s guide. DU-05982-001 v6.0, NVidia, 2014. [19] J. Sim and A.Dasgupta and H. Kim and R. Vuduc, A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, pp. 11–22, ACM, New York, USA, 2012. [20] V. Volkov and J.W. Demmel, Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pp. 1–11, IEEE Press, Piscataway, NJ, USA, Article 31, 2008. [21] L. Weiguo and W. Muller-Wittig and B. Schmidt, Performance Predictions for General-Purpose Computation on GPUs. Parallel Processing, 2007. ICPP 2007. International Conference on , pp.50, 10-14 Sept. 2007. [22] S. Williams and A. Waterman and D. Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM, Vol 52, No 4, pp. 65–76, 2009. [23] Y. Zhang and J.D. Owens, A quantitative performance analysis model for GPU architectures. High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp.382–393, 12-16 Feb. 2011.

A practical performance model for compute and memory ... - hgpu.org

A practical performance model for compute and memory ... - hgpu.org

Suggest Documents

A Practical Approach for Performance Analysis of Shared-Memory ...

Practical Performance Estimation On Shared-Memory Multiprocessors

Execution-Cache-Memory Performance Model: Introduction and ...

A Performance Model for Memory Bandwidth ... - Semantic Scholar

A Network Memory Architecture Model and Performance ... - CiteSeerX

Practical Model for Real Structure and Performance of Heavy Vehicles

Compute for Mobile Devices: Performance-focused

Cost and performance comparison for OpenStack compute and ...

High performance compute cloud

Memory Access Pattern-Aware DRAM Performance Model for Multi ...

A portable and extensible performance model for

Impaired Spatial Memory Performance in a Rat Model of Neuropathic ...

Toward a Memory Model for Autonomous ... - ScholarlyCommons

Towards A Practical Goms Model Methodology For

Blue Gene/L compute chip: Memory and Ethernet ... - Semantic Scholar

Practical Memory Checking with Dr. Memory - BurningCutlery

Practical Memory Checking with Dr. Memory - BurningCutlery

Practical Data Compression for Modern Memory Hierarchies

A Practical MDA Approach for Autonomic Profiling and Performance ...

Memory for Music Performance 1 Nature of memory for ... - CiteSeerX

IceChrono1: a probabilistic model to compute a common ... - HAL-Inria

Performance and Power Scalability with Compute and Graphics ... - AMD

A Practical Semi-External Memory Method for ... - Science Direct

Continually Resilient Compute Performance With Hitachi Technology