SPRAT: Runtime Processor Selection for Energy-aware ... - CiteSeerX

11 downloads 4172 Views 203KB Size Report
In spite of such trends in HPC system design, software development for hybrid computing systems is difficult because a programmer must ...... [6] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook ... Field-Programmable Custom Computing Machines, 2007, pp. 107–116. [24] NVIDIA ...
SPRAT: Runtime Processor Selection for Energy-aware Computing Hiroyuki Takizawa #1 , Katuto Sato #2 , and Hiroaki Kobayashi ∗3 #

Graduate School of Information Sciences, Tohoku University 6-3 Aramaki-aza-aoba, Aoba, Sendai 980-8578, Japan 1

2

[email protected] [email protected]



Cenberscience Center, Tohoku University 6-3 Aramaki-aza-aoba, Aoba, Sendai 980-8578, Japan 3

[email protected]

Abstract—A commodity personal computer (PC) can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and a graphics processing unit (GPU). Since the superiorities of GPUs in the performance and the power efficiency strongly depend on the system configuration and the data size determined at the runtime, a programmer cannot always know which processor should be used to execute a certain kernel. Therefore, this paper presents a runtime environment that dynamically selects an appropriate processor so as to improve the energy efficiency. The evaluation results clearly indicate that the runtime processor selection at executing each kernel with given data streams is promising for energy-aware computing on a hybrid computing system. Index Terms—GPU computing, automatic performance tuning, energy-aware computing

I. I NTRODUCTION For several reasons, a high-performance computing (HPC) system has become heterogeneous at various levels. For example, Grid computing[1] innately assumes that heterogeneous computing resources are aggregated to organize a virtual computing system; latest supercomputers such as TSUBAME[2] and RoadRunner[3] are equipped with additional co-processors for data-parallel computations; and heterogeneous multicore processors such as Cell Broadband Engine[4] are also currently becoming commonplace. In addition to those examples, the research community of generalpurpose computation on graphics processing units (GPGPU) or GPU computing has demonstrated that even a current PC can be seen as a hybrid computing system equipped with two different kinds of processors, i.e. CPU and a graphics processing unit (GPU)[5]. As such a PC is the most widelyavailable hybrid computing system, this paper focuses on effective use of CPUs and GPUs for scientific computations. In spite of such trends in HPC system design, software development for hybrid computing systems is difficult because a programmer must properly manage different kinds of processors, each of which has its own strengths and weaknesses. Due to its application specific architecture, GPU especially has strong likes and dislikes about computation tasks; it works well only for tasks including massive data parallelism without complicated control flows.

Even if a task is obviously preferable for GPUs, a programmer cannot yet ensure that a certain GPU on one PC can execute the task faster than its CPU, because the difference in sustained performance between CPU and GPU depends on individual tasks and is often determined at runtime. As a result, the difference in energy efficiency between them also drastically changes at runtime. Therefore, appropriate processor selection at runtime is needed to achieve energyaware computing on a hybrid computing system of CPU and GPU. So far, many approaches have been proposed for abstraction of the hardware configuration and thereby encapsulating the complexities of GPGPU programming [6], [7], [8], [9]. However, they do not consider runtime processor selection for executing a given task in terms of the energy efficiency. The goal of this work is to establish a programming framework that can automatically select the best available processor for execution of a given task among various kinds of processors on a hybrid computing system. To this end, we have recently proposed a programming framework named stream programming with runtime auto-tuning (SPRAT) that can dynamically switch the processor to minimize the execution time[10]. The main contribution of this paper is to investigate the potential of SPRAT to maximize the energy efficiency. The energy saving due to the runtime processor selection by SPRAT is discussed via some experimental results. The rest of this paper is organized as follows. Section 2 describes the related work. We briefly review some popular programming languages and tools for developing GPGPU applications, and then mention empirical studies to model the performance of GPGPU applications. In Section 3, we propose a programming framework that can dynamically select an appropriate processor to be used to execute a program so as to maximize the energy efficiency. Section 4 shows the evaluation results to discuss the trade-off between energy saving and computational performance. Finally, Section 5 gives some concluding remarks and our future work.

II. R ELATED W ORK A. GPGPU Programming Tools GPGPU applications generally prefer massive SIMD data parallelism and are often mapped to stream processing, which is modeled as a computation-intensive kernel for processing a long data stream. The stream processing can hide the memory access latency by making memory accesses highly predictable and overlapping data fetches with computations. Importance of the overlapping is growing more and more, due to the so-called memory wall problem[11]. The stream processing is promising not only for GPUs but also for many other processors, such as general-purpose processors[12] and heterogeneous multicore processors[13]; it will be a key technology to achieve highperformance with current and future computing systems. The GPGPU community has demonstrated that GPUs can be seen as general-purpose stream processors[5]. So far, several high-level programming languages have been proposed to mitigate the programming efforts required to use the GPU’s computing power for stream processing applications[6], [7], [8], [9]. BrookGPU is the first abstraction of GPUs for GPGPU programmers, and is a popular programming tool that extends the standard C programming language to explicitly describe stream processing applications. BrookGPU provides a highlevel programming language and its runtime backends to facilitate the development of GPGPU applications. Each of the runtime backends corresponds to a runtime environment supported by BrookGPU: CPU, OpenGL, and DirectX9. A stream processing application is executed on one processor, either CPU or GPU, which is specified by the runtime backend, referred to as a computing engine. Using the Brook language, a programmer can explicitly write a kernel code. The Brook compiler is a source-to-source compiler that translates a Brook code into a standard C++ code. At the translation, a kernel code is translated into multiple codes respectively corresponding to runtime backends. When the executable file is launched, it first checks the environmental variable, BRT_RUNTIME, to decide the runtime backend. Finally, all the kernels in an application program are executed using one computing engine. CUDA[9] is the latest programming language presented by NVIDIA, and incorporates some additional keywords into the standard C programming language for GPU computing. As CUDA allows a programmer to access GPUs via the lowlevel APIs, it can handle the GPU hardware without tricky programming techniques, which are required at GPGPU programming with graphics APIs such as OpenGL and DirectX. In addition, it significantly reduces the runtime overheads of GPGPU applications that have been considered unavoidable due to use of the graphics APIs. As a result, CUDA has become the most popular programming language for GPU computing. As higher-level programming tools, RapidMind[7] and PeakStream[8] are both commercial software products that offer C++ class libraries with high-performance runtime backends to describe and execute multi-platform programs in

a stream processing manner. These tools can significantly mitigate the programming efforts required to use GPU as a data-parallel co-processor. However, they usually assume that a programmer or an application user statically determines the computing engine of every kernel in advance of the execution, even though appropriate processor selection clearly depends on various runtime behaviors and system configurations. B. Performance Modeling There are numerous earlier studies on performance evaluation and modeling of GPGPU applications. Since the GPU architectures are not fully disclosed, the GPU performance for various applications has been experimentally examined[14]. Transco et al. have reported a comprehensive experimental study of the performance comparison between a CPU and a GPU[15]. They investigated the execution times of BrookGPU kernels, changing kernel parameters such as the computation intensity, the data size and the data format. Their results clearly indicate that the GPU’s superiority in the performance depends on several parameters determined at runtime. Govindaraju et al. investigated the details on a GPU memory hierarchy. Their memory model shows that the GPU’s memory bandwidth depends on memory access patterns, and effective use of the two-dimensional cache memory can maximize the GPU performance[16]. Harrison et al. have assessed the execution time required for the data transfer between the main memory and the video memory[17] . As the time for the data transfer often becomes dominant especially in the case of a low arithmetic intensity kernel, performance evaluation of the data transfer using different APIs is also important to predict the total execution time of a GPGPU application. These experimental studies show that the performance prediction of GPGPU applications must consider a vast variety of implementation techniques. Buck et al. have presented BrookGPU with a simple performance model to analyze the performances of a CPU and a GPU[6]: TG TC

= n(TR + KG ),

(1)

= nKC ,

(2)

where TG and TC are execution times of GPU and CPU respectively, TR is the time associated with downloading and reading back a single stream element, KG and KC are the times required to execute a kernel on a single element, and n is the number of elements in a data stream. It is obvious that the GPU will outperform the CPU only when TR < KC − KG .

(3)

This means, only if the performance gain by using GPU exceeds the data transfer overhead, the GPU can outperform the CPU. Ito et al. have proposed a model to estimate the execution time of a GPGPU application[18]. They assume that the GPU performance is always limited by its memory bandwidth, and hence the execution time of a kernel is in proportion to the

size of data transferred between the GPU cores and the video memory. However, the actual memory bandwidth obviously depends on the memory access patterns. Therefore, He et al. have separately modeled sequential access performance and random access performance[19]. Recently, energy-aware computing has become more and more important not only in mobile systems but also in HPC systems. G¨oddeke et al. have reported that use of even low-end, out of date GPUs leads to improvements in both performance- and power-related metrics of a GPU-accelerated cluster system for FEM calculations [20]. However, although use of GPUs generally increases the power consumption, it does not always increase the performance. As a result, it may lead to the increase in energy consumption for the applications, which cannot be efficiently executed by GPUs. Accordingly, runtime performance prediction is mandatory to achieve appropriate processor selection in terms of energy efficiency; GPU should be used only if its power consumption is substantially small compared to the increase in performance.

kernel map saxpy( float a, in stream x, in stream y, out stream z){ z = a * x + y; } int main(int argc, char** argv){ stream sX(N,M), sY(N,M), sZ(N,M); float x[N*M], y[N*M], z[N*M], pi=3.14f; init_array( x, y); streamRead(sX, x); streamRead(sY, y); saxpy(pi, sX, sY, sZ); streamWrite(sZ, z); print_array(z); return 0; }

Fig. 1.

Sample code of the SPRAT language.

III. A P ROGRAMMING F RAMEWORK WITH RUNTIME AUTO - TUNING In this paper, a programming framework named stream programming with runtime auto-tuning (SPRAT)[10] is extended to consider the energy consumption for dynamic selection of an appropriate computing engine. SPRAT combines a highlevel programming language and runtime performance prediction. As with BrookGPU[6], the SPRAT compiler translates a SPRAT code into multiple codes, each of which is corresponding to one computing engine. According to the runtime behaviors not available for a programmer and a compiler, SPRAT can dynamically switch the computing engine for executing each kernel so as to minimize the energy consumption; GPU is used as a computing engine only if its acceleration is effective enough to deserve the power consumption. A. SPRAT Language The SPRAT language is an extension of the standard C language incorporating some special keywords for description of stream processing tasks. In a stream processing task, a stream that is a collection of data is processed by a kernel. A stream is declared with the stream keyword and angle-bracket syntax. A kernel function, which operates on individual stream elements, is specified by the kernel qualifier. The syntax of the SPRAT language itself is similar to the conventional stream programming languages such as BrookGPU[6]. Figure 1 shows a sample code of the SPRAT language. In the code, a kernel function saxpy is called with a scalar value pi, two input streams sX and sY, and an output stream sZ. Here, each of sX, sY, and sZ contains the same number of stream elements. Stream elements at the same position in different streams are corresponding to each other. Every element of sX is multiplied by pi, and then added to its corresponding element of sY. The calculation result is written to the corresponding element of sZ. Since a stream element is accessible only within a kernel function, elements

Fig. 2.

SPRAT language translation.

of arrays x and y are copied onto sX and sY using function streamRead. Similarly, elements of sZ are copied onto z using streamWrite. Figure 2 illustrates how a SPRAT code is converted into an executable file. A SPRAT code written by a programmer is first translated into multiple codes: a standard C++ code for CPU and a CUDA code for GPU. Then, a programmer can manually optimize the automatically generated codes if needed. Finally, those codes are respectively compiled and then linked with the SPRAT runtime library to generate an executable file. The details on a stream and a kernel function are described as follows. 1) Streams: In the SPRAT language, a variable declared with the stream keyword is used as a container of stream data. The number and the variable type of stream elements are

specified when declaring the stream variable. Since stream elements are directly accessible only within a kernel function, there are built-in functions to manage stream data. For example, streamRead and streamWrite are used for the data transfer between a standard C array and a stream. The former copies each array element to its corresponding stream element. The latter copies each stream element onto its corresponding array element. Moreover, the SPRAT language provides built-in functions for frequentlyused operations such as streamMax for finding the maximum element in a stream. There are four kinds of qualifiers to specify the access attributes of a stream: in, out, inout, and gather. A stream specified by the in keyword permits sequential readonly accesses. A stream with the out keyword permits sequential write-only accesses. A stream specified by inout is both readable and writable. An element in a gather stream can be read using the array index operator to be mentioned later. In addition, a stream reference to a part of stream elements can be declared by stream& ref = strm[i][j](w,h); Here, a stream reference ref represents the domain of a 2dimensional stream strm whose left-upper corner position and size are specified by [i][j] and (w,h), respectively. Note that ref and strm share the same memory area of w×h stream elements. 2) Kernels: A kernel, which operates on each stream element, is described by a special function specified by the kernel keyword. A processor executing a kernel function, i.e. a computing engine, is dynamically selected by the SPRAT runtime environment. A programmer cannot permute the sequence of stream elements; they may independently be processed in parallel. The attribute of a kernel function is specified by either map or reduce. The former is a kernel function that processes input streams and generates the output stream elements. The latter reduces an input stream to a smaller output stream or a scalar value. A kernel function with the map qualifier takes one or more output streams specified by the out or inout keyword. The kernel execution is a data-parallel task that logically computes all the output stream elements. For example, the saxpy function in Figure 1 is implicitly translated by the SPRAT compiler to the multiple codes each meaning the following loops. for(int i=0; i

Suggest Documents