Enabling task-level scheduling on heterogeneous platforms

47 downloads 81265 Views 1MB Size Report
Mar 3, 2012 - nies such as Apple, AMD, NVIDIA, Intel, Imagination and. S3. .... have to add code to either split the kernels between devices, or change the ...
Enabling Task-level Scheduling on Heterogeneous Platforms ∗ Enqiang Sun1 , Dana Schaa1 , Richard Bagley2 , Norman Rubin2 , and David Kaeli1 1

Department of Electrical and Computer Engineering, Northeastern University, Boston MA, USA {esun, dschaa, kaeli}@ece.neu.edu 2

Graphics Product Group, Advanced Micro Devices, Boxborough MA, USA {Richard.Bagley, Norman.Rubin}@amd.com

ABSTRACT

Keywords

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic task scheduling and data consistency when multiple devices appear in the system. To address this need, we have designed a task queueing extension for OpenCL that provides a high-level, unified execution model tightly coupled with a resource management facility. The main motivation for developing this extension is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible scheduling schemes. To demonstrate the value and utility of this extension, we have utilized an advanced OpenCL-based imaging toolkit called clSURF. Using our task queueing extension, we demonstrate the potential performance opportunities and limitations given current vendor implementations of OpenCL. Using a state-of-art implementation on a single GPU device as the baseline, our task queueing extension achieves a speedup up to 72.4%. Our extension also achieves scalable performance gains on multiple heterogeneous GPU devices. The performance trade-offs of using the host CPU as an accelerator are also evaluated.

Heterogeneous computing, Task scheduling, OpenCL, Parallel programming

1.

INTRODUCTION

Heterogeneous platforms such as AMD’s Fusion [10] and Intel’s Sandybridge [5] augment the superscalar power of mainstream CPUs with on-chip, high-throughput, GPUs. Although GPUs were historically used almost exclusively for graphics, they have now also emerged as the platform of choice for accelerating data parallel applications. Unlike the ubiquity of the x86 architecture and the long lifecycle of CPU designs, GPUs often have much shorter release cycles and ever-changing ISAs and hardware features. As such, the need has arisen for a programming interface that allows a single, general-purpose (i.e., non-graphics) program to be portable to a wide range of accelerator hardware.

1.1

Heterogeneous Computing with OpenCL

The emerging software framework for programming for heterogeneous devices is the Open Computing Language (OpenCL) [8]. OpenCL is an open industry standard managed by the non-profit technology consortium Khronos Group, and support for OpenCL has been increasing from major companies such as Apple, AMD, NVIDIA, Intel, Imagination and S3.

Categories and Subject Descriptors C.1.3 [Processor Architectures]: Other Architecture Styles – Heterogeneous (hybrid) systems ∗The work in this paper began while Enqiang Sun and Dana Schaa were interns at Advanced Micro Devices.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GPGPU-5, March 03 2012, London, United Kingdom Copyright 2012 ACM 978-1-4503-1233-2/12/03...$10.00.

The aim of OpenCL is to become a universal language for programming heterogeneous platforms such as GPUs, CPUs, DSPs, and FPGAs. In order to support such a wide variety of heterogeneous devices, some elements of the OpenCL API are necessarily low-level. As with the CUDA/C language [7], OpenCL does not provide support for automatic task scheduling nor guarantee global data consistency–it is up to the programmer to explicitly define tasks and enqueue them on devices, and to move data between devices when necessary. Furthermore, when OpenCL implementations from different vendors are used, OpenCL objects in the context of one vendor’s implementation are not valid in another’s. Given these limitations, there is still remain barriers to achieving straightforward heterogeneous computing.

1.2

An Improved Task Queueing API

In this work we propose a task-queueing API extension for OpenCL that helps ameliorate many of the burdens faced

84

when performing heterogeneous programming. Our taskqueueing interface is based on the concepts of work pools and work units, and provides a convenient mechanism to execute applications on heterogeneous hardware platforms. Using the API, programmers can easily develop and tune flexible scheduling schemes according to the hardware configuration.

presenting the programming environment as a single OpenCL platform. Shared OpenCL resources, such as data buffers, events, and kernel programs are transparently managed across the installed vendor implementations. The result is simpler programming in heterogeneous environments. However, even equipped with this commercially-developed Common Runtime, scheduling decisions, data synchronization, and other multi-device functionality must still be manually performed by the programmer.

In the task-queueing API, OpenCL kernels within an application are wrapped with metadata into work units. These work units are then enqueued into a work pool and assigned to computing devices by a scheduler. A resource management system is seamlessly integrated in this API to provide for migration of kernels between devices and platforms. We demonstrate the utility of this class of task queueing extension by implementing a large, non-trivial application, clSURF, which is an OpenCL open-source implementation of OpenSURF (Open source Speeded Up Robust Feature) [17].

StarPU [14] is a simple tasking API that provides numerical kernel designers with a convenient way to execute parallel tasks on heterogeneous platforms, and incorporates a number of different scheduling policies. StarPU is based on the integration of a resource management facility with a task execution engine. Several scientific kernels [22][13] [12][11] have been deployed on StarPU to utilize the computing power of heterogeneous platforms. However, StarPU is implemented in C and the basic schedulable units (codelets) have to be implemented multiple times if they are targeting multiple devices. This limits the migration of the codelets across platforms, and increases the programmer’s burden. To overcome this limitation, StarPU has initiated a recent effort to incorporate OpenCL [21] as the frontend.

Our work pool-based, task-queueing API provides the following benefits for OpenCL programmers: • simple extensions to OpenCL built on top of the current API, • automated utilization of all devices present in a system,

I

• a mechanism for easily applying different scheduling schemes to investigate execution on cooperating devices, and

Build Integral Image

• a means to evaluate the potential benefits of heterogeneous versus discrete architectures by profiling multidevice performance.

Calculate Hessian Determinant

The rest of the paper is organized as follows. In Section 2 we provide a review of heterogeneous computing interfaces, and the SURF algorithm with existing implementations. In Section 3, we describe our OpenCL task-queueing extension and how it facilitates the effective use of heterogeneous platforms. In Section 4, we demonstrate the capabilities of our API by evaluating the performance of the SURF algorithm on heterogeneous platforms. Finally, in Section 5, we conclude and discuss directions for future work.

Non-max Suppression

II

Calculate Orientation

Calculate and Normalize Descriptors

2. BACKGROUND AND RELATED WORK 2.1 Heterogeneous Computing Several projects have investigated how to alleviate the programmer from the burden of managing hybrid or heterogeneous platforms.

Figure 1: The Program Flow of clSURF Maestro [27] is an open source library for data orchestration on OpenCL devices. It provides automatic data transfer, task decomposition across multiple devices, and autotuning of dynamic execution parameters for selected problems. However, Maestro focuses mainly on data management, and lacks the ability to run on applications with complex program flow and/or data dependencies.

Qilin [24] uses offline profiling to obtain information about each task on each computing device. This information is then used to partition tasks and create an appropriate performance model for the targeted heterogeneous platforms. However, the overhead to carry out the initial profiling phase can be prohibitive and may be inaccurate if computation behavior is heavily input dependent.

Grewe et al. [20] propose a static partitioning model for OpenCL programs on heterogeneous CPU-GPU systems. The model focuses on how to predict and partition the different tasks according to their computational characteristics, and does not abstract to any common programming inter-

IBM’s OpenCL Common Runtime [4] is an OpenCL abstraction layer designed for improving the OpenCL programming experience by managing multiple OpenCL platforms and duplicated resources. It minimizes application complexity by

85

face that would enable more rapid adoption.

sent to that queue. To allow for this, we need to decide which workloads will target the CPU at compile time. At runtime, once a kernel is enqueued on a command queue, there is no mechanism for a command to be removed from a queue or assigned to another queue. Effective scheduling thus requires that we profile the kernel on both platforms, compute the relative performance, and divide the work accordingly. The disadvantages of this approach are: 1) the CPU may have some unknown amount of work to do between calls, 2) the performance of one or both devices may vary based on the exact input data used, 3) the host CPU may be executing unrelated tasks which may be difficult to identify and throw noise into the profile.

Apart from StarPU [14], none of the above approaches focus on task partitioning and do not discuss how to exploit task-level parallelism that is commonly present in large and diverse applications. We address this issue by enhancing OpenCL programming with a resource management facility.

2.2

SURF in OpenCL

The SURF application was first presented by Bay et al. [15]. The basic idea of SURF is to summarize images by using only a relatively small number of interesting points. The algorithm analyzes an image and produces feature vectors for every interesting point. SURF features have been widely used in the real life applications such as object recognition [25], feature comparison and face recognition [23]. Numerous projects have implemented elements of SURF in parallel using OpenMP [28], CUDA [29, 19] and OpenCL [26]. We reference the state-of-art OpenCL implementation, clSURF, and use it as the baseline for performance comparison in this paper.

Working with multiple GPU devices presents a similar problem. If multiple devices are used to accelerate the execution of a single kernel, we would still need to statically prepartition the data. This is especially tricky with heterogeneous GPUs, as it does not allow for the consideration of runtime factors such as delays from bus contention, computation time changes based on input data sets used, relative computational power, etc..

Figure 1 shows the program flow of clSURF for processing an image or one frame of a video stream. The whole program flow of clSURF is implemented as several stages, and these stages can be further separated into two phases. In the first phase, the amount of computation is mainly influenced by the size of the image. In the second phase, the amount of the computation depends more on the number of the interesting points, which is a reflection of the complexity of the image.

If multiple devices are used for separate kernels, we would have to add code to either split the kernels between devices, or change the targeted command queue based on some other factors (e.g., number of loop iterations). Creating an algorithm that divides multiple tasks between a variable number of devices is not a trivial undertaking. Perhaps the most persuasive argument for changing the current queueing model is the fact that it limits effective use of Fusion-like devices. If multiple tasks are competing to run concurrently a Fusion processor, one of the tasks may elect to run on the GPU, but if the device is already busy, it may be acceptable to run the specific tasks or kernel on the CPU. Unless we introduce new functionality that allows swapping contexts on the GPU, we are limited to using the current model which restricts programs that target the GPU to wait until all previous kernels have completed execution. Even if swapping is implemented, there may be just too many tasks attempting to share the GPU, and execution may be preferable to waiting a long time.

Previous work has also evaluated SURF on hybrid or heterogeneous platforms[18]. However, they concentrate on speeding up the algorithm and do not explore scheduling on multiple devices of a platform.

3.

A TASK QUEUEING EXTENSION FOR OPENCL

Next, we describe our OpenCL extension in detail, and consider some of the design decisions made before arriving at our task queueing extension.

3.1

Limitations of the OpenCL Command– Queue Approach

3.2

In OpenCL, command queues are the mechanisms for the host to interact with devices. Via a command queue, the host submits commands to be executed by a device. These commands include the execution of programs (called kernels), as well as data transfers. The OpenCL standard specifies that each command queue is only associated with a single device; therefore if N devices are used, then N command queues are required. There are a number of factors that can limit performance when kernel execution runs across multiple devices.

3.1.1

The Task Queueing API

Instead of using command queues to communicate with devices (one command queue per device), we introduce work pools as an improved task queueing extension to OpenCL. Work pools function similarly to command queues, except that a work pool can automatically target any device in the system. In order to manage execution between multiple devices, OpenCL kernels need to be combined with additional information and wrapped into structures called work units. Figure 2 shows the components in the work pool-based task queueing extension layer that runs on top of OpenCL. Once kernels are wrapped into work units, they can be enqueued into a work pool by the enqueue engine. On the back end, the scheduler can dequeue work units which are ready to execute (i.e., have all dependencies resolved) on one of the devices in the system. Each work pool has access to all of the devices on the platform. It is possible to define multiple work pools and enqueue/dequeue engines.

Working with Multiple Devices

When a CPU and GPU are present in a system, the CPU can potentially help with the processing of some workloads that would normally be offloaded entirely to the GPU. Using the CPU as a compute device requires creating a separate command queue, and specifying which commands should be

86

Task-Queueing Extension Application Program Interface

Enqueue Enqueue Engine Engine

Dequeue Dequeue

Work Unit Queue

Engine Engine

Resource Management Unit

OpenCL Interface and Device Driver

Figure 2: Scheduling work units from work pools to multiple devices. If there is task-level parallelism in the host code, it may make sense to create multiple work pools that correspond to different scheduling algorithms. From a practical standpoint, creating multiple work pools to assign kernels to the same device may increase device utilization.

status of all the resources, including memory objects and events, and does so on a device-by-device basis.

3.2.3

The following subsections more clearly define the concepts used in our task queueing API. The API functions that implement these concepts are described in detail in Section 3.2.7.

3.2.1

Enqueue and Dequeue Engine CPU Execution

GPU Execution

CPU Enqueue

CPU Idle

GPU Idle

CPU Dequeue

Work Units

In the task-queueing API, work units are the basic schedulable units of execution. A work unit consists of dependencies that need to be resolved prior to execution of the OpenCL kernel (cl_kernel). When a work unit is created, the programmer optionally supplies a list of work units that must complete execution before the current work unit can be scheduled. This functionality is similar to the current OpenCL standard, where each clEnqueue function takes as an argument a list of cl_events that must be complete before the command is executed. To enable a work unit to execute on any device, the OpenCL kernel is pre-compiled for all devices in the system. When the work unit is scheduled for execution, the kernel corresponding to the chosen device is selected. (a)Baseline Implementation

3.2.2

(b)Work Pool Implementation

Work Pools

A work pool is a structure that contains a collection of work units to be executed, and details related to resource management functionality (Section 3.2.5). A scheduler (Section 3.2.6) interacts with each work pool, dequeueing and executing work units according to their accompanying dependency information. Equipped with dependency information for each work unit, the work pool has available system-wide information so it can make informed scheduling decisions.

Figure 3: CPU and GPU execution The enqueue and dequeue operations on the work pool are two independent CPU threads, and a mutex is used to protect the queue of the work units. During the enqueue operation, if the kernel arguments and work-item configuration information (i.e., the NDRange and work group dimensions) are already available, they will be packaged together with the work units; if the information is dynamically generated by a dependent work unit, we rely on a callback mechanism that allows the user to inject initialization and finalization

The work pool also has detailed knowledge of the system (such as the number and types of devices), and has the ability to work with multiple devices from different vendors (Section 3.2.4). The resource management unit tracks the

87

functions which are executed before and after the execution of the work unit, respectively.

will be moved to the target device. If the data is already present on the correct device, no action is required. Since the data transfer time overhead is not negligible, memory coherency between data buffers could be an issue. Instead, data is transferred as necessary prior to kernel dispatch. This data management scheme can be easily extended to avoid data transfers altogether if all devices use a unified memory (such as the AMD Fusion). In the current implementation, the resource manager assumes that data sizes are smaller than the capacity of any one single device.

In the baseline implementation of clSURF application, the kernel execution on the GPU and the host program execution on the CPU (e.g. program flow control, data structure initialization and image display, etc), are synchronized with each other, as illustrated in Figure 3(a). By using separate enqueue and dequeue threads and the callback mechanism, we can execute some of the CPU host programs asynchronously with the GPU kernel execution. This usually results in achieving higher utilization for the GPU command queue and an overall performance gain for the application (Section 4.2).

3.2.4

Table 1: The Extension API Classes and Methods class work pool Description init Initialize a work pool, define the capacity and initialize a buffer table get context Get information of all possible OpenCL devices in the system enqueue Enqueue a new work unit into the work pool dequeue Extract a ready work unit and distribute to device request buffer Request a new or existing buffer for the data represented in pointers query Query the information about next work unit class work unit Description init Initialize a work unit

Common Runtime Layer

In OpenCL, the scope of objects (such as kernels, buffers, events, etc) are limited to a context. When an object is created, a context must always be specified, and objects that are created in one context are not visible by another context. This becomes a problem when devices from different vendors are used. If a programmer wanted to use an AMD GPU and an NVIDIA GPU simultaneously, they would need to install both vendors’ OpenCL implementations–AMD’s implementation can interact with AMD GPUs and any x86 CPU; NVIDIA’s implementation can only interact with NVIDIA GPUs. With the current OpenCL specification, contexts cannot span different vendor implementations. This means that the programmer would have to create two contexts (one per implementation), and initialize both of them. The programmer would also have to explicitly manage synchronization between contexts, including transferring data and ensuring dependencies are satisfied.

compile program create kernel set argument

Using our work pool-based approach, we remove the handicap of restricting object scope by device type and implement a common runtime layer in the work pool back-end. Each individual work pool directly manages the task of object consistency and synchronization across multiple contexts. Providing this new level of flexibility can avoid some of the overhead associated with dependence on a full-blown common runtime layer.

3.2.5

set worksize describe dep

3.2.6

Compile the OpenCL kernel file for all possible devices Create the OpenCL kernel for all possible devices Register the arguments to the metadata of the work unit Register the work item configuration information to the metadata Incorporate dependency information together with the work unit

Scheduler

The scheduler continuously dequeues ready work units out of work pool according to the defined scheduling policy. It evaluates the dependency information associated with the work units enqueued in the work pool and uses the specified policy to determine which work unit to execute next and on which device.

Resource Manager

In the current OpenCL environment, when many devices are used with OpenCL, memory objects must be managed explicitly by the programmer. If kernels that use the same data run on different devices, the programmer must always be aware of which device updated the buffer last, and transfer data accordingly. Using our task queueing API, the programmer cannot predict apriori which device each kernel will execute on, so we have designed it to act as an automated resource manager that manages memory object consistency between devices.

When work units are created, the values of the kernel arguments and work-item configuration information may or may not be available. If the information is not available, work units can still be enqueued into the work pool, but cannot be executed until the information is provided. To update a work unit with this information, a callback mechanism is provided for the programmer.

When using our new OpenCL task queuing API, prior to execution, the programmer must explicitly inform the API that it wishes to use a block of memory as the input to a kernel by passing the associated data pointer. The resource manager then determines whether the data buffer is already present on that device. If this is the first time the data is used or if the valid version of the data is on another device, a data transfer will be initiated, and the new valid data buffer

For example, in the clSURF application, the work-item configuration information for the work units in the second phase is determined by the number of interest points, which is the output data that was computed during the first phase of the application. We program the initialization and finalization in the callback functions of related work units so that the number of interesting points is updated after the work units

88

are already enqueued.

different heterogeneous platforms: two platforms containing multiple GPUs of different type, and one AMD Fusion APU (CPU and GPU on the same chip). Section 4.2 and Section 4.3 evaluates our task-queueing extension and the performance on single and multiple discrete GPUs, while Section 4.4 evaluates the usefulness of using the host CPU as a compute device.

The scheduler currently employs a static scheduling policy, where the programmer decides which device a kernel will be executed on. In future work we will improve the scheduler to make this mapping at runtime, allowing for decisions to be made more dynamically.

3.2.7

The first platform has installed one AMD FirePro V9800P GPU [2] and one AMD Radeon HD6970 GPU [3]. The AMD FirePro V9800P GPU has 20 SIMD engines (16-wide) where each core is a 5-way VLIW, for a total of 1600 streaming processors running at 850MHz. The AMD Radeon HD6970 GPU has 24 SIMD engines (16-wide) but is a 4-way VLIW for a total of 1536 streaming processors running at 880MHz.

Task-Queueing API

The task-queueing API extension is designed to facilitate the execution of OpenCL applications on multiple devices in an easy and efficient way. On top of OpenCL programming API, the extension API includes two basic classes: work pool and work unit. Table 1 shows a brief description of these two classes and their methods.

The second platform has one AMD FirePro V9800P GPU and one NVIDIA GTX 285 GPU [6]. The NVIDIA GTX 285 GPU has 15 SIMD engines (16-wide) with scalar cores for a total of 240 streaming cores running at 1401MHz.

The work pool class initializes the whole task queue system, and has the knowledge of all OpenCL devices present in the system. It also allocates and coordinates resources across devices.

An AMD A8-3850 Fusion APU [1] is used as the host device on the above two platforms. We also evaluate this APU as the third platform. On the single chip of the A8-3850 we have four x86-64 CPU cores running at 2.9GHz integrated together with a Radeon HD6550D Radeon GPU, which has 5 SIMD engines (16-wide) with 5-way VLIW cores, for a total of 400 streaming processors.

The work unit class concentrates more on the kernel itself. Besides packaging the compiled kernels for all possible devices, it also incorporates any dependency information. Code listing 1 shows an example of how to declare a work unit and provides commands to enqueue/dequeue this work unit to the work pool.

All experiments were performed using the AMD Accelerated Parallel Processing (APP) SDK 2.5 on top of vendor specific drivers (Catalyst 11.12 for AMD’s GPU and CUDA 4.0.1 for NVIDIA’s GPU). The Open Source Computer Vision (OpenCV) library v2.2[16] is used by clSURF to handle extracting of frames from video files and displaying the processed video frames.

work unit i n t e g r a l s c a n ( work pool in , NULL, ‘ ‘ scan . cl ’ ’ , ‘ ‘ scan ’ ’ , d e p f r a m e [SCAN] , kernel list , SCAN, &s t a t u s ) ; i n t e g r a l s c a n . s e t a r g u m e n t ( index , data size , ARG TYPE CL MEM, ( v o i d ∗)& data , CL TRUE ) ; i n t e g r a l s c a n . s e t w o r k s i z e ( globalWorkSize , localWorkSize , dim ) ; work pool −>enqueue(& i n t e g r a l s c a n , &s t a t u s ) ; work pool −>dequeue ( p o o l c o n t e x t [ d e v i c e i d x ] , initFun , &a r g s , finalizeFun , &a r g s , &s t a t u s ) ; Listing 1: Example code of enqueueing or dequeueing work unit

Table 2: Input sets emphasizing different phases of the SURF algorithm. Input Size Number of Description Interesting Points Small video size Video1 240 x 320 312 & small number of interesting points Small video size Video2 704 x 790 3178 & large number of interesting points Large video size Video3 720 x 1280 569 & small number of interesting points Large video size Video4 720 x 1280 4123 & large number of interesting points When processing a video in clSURF, the characteristics of the video frames can dramatically impact the execution on specific kernels in the application. For each frame, the performance of the kernels in the first stage is a function of the size of the image, and the performance of the kernels in the second stage is dependent on the number of interest points found in the image. To provide coverage of the entire perfor-

4. EXPERIMENTS AND RESULTS 4.1 Experimental Environment To evaluate our new work pool-based implementation of the clSURF algorithm, the application is benchmarked on three

89

1.8

GTX 285

V9800P

HD 6970

HD 6550D

1.8

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

GTX 285

V9800P

HD 6970

HD 6550D

0

0 Video1

Video2

Video3

Video1

Video4

(a) With Display

Video2

Video3

Video4

(b) Without Display

Figure 4: The performance of our work pool implementation on a single device – One Work Pool.

2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

GTX 285

Video1

V9800P

Video2

HD 6970

Video3

HD 6550D

2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

GTX 285

Video1

Video4

(a) With Display

V9800P

Video2

HD 6970

Video3

HD 6550D

Video4

(b) Without Display

Figure 5: The performance of our work pool implementation on a single device – Two Work Pools. mance space, we selected four different videos that provide a range of these different attributes (Table 2).

the V9800P and HD6970D system, the resource management overhead cannot be amortized and we do not achieve any speedup. But in most of the cases, the work pool implementation achieves speedup, and if we discount the overhead of displaying the image, the speedup is up to 72.4%.

Our baseline is the reference implementation of clSURF. The clSURF application uses OpenCV to display the processed images on the screen. This function is always running on the host machine and frames must be serialized to be displayed. When we have multiple devices processing independent frames, this function often becomes a major bottleneck. To provide a more accurate view of the performance capabilities of the task queuing implementation on multiple devices, we will present results with and without the displaying functionality enabled for each evaluation.

Figure 5 shows the speedup when we create two work pools on a single GPU devices. The two work pools are processing independent different frames, and there is no dependency between frames. When we decompose the application across frames, we obtain an additional average speedup of 15%. Using two work pools produces better utilization of the GPU command queue.

For this work, we use a simple, static scheduling scheme and manually balance the workload on different devices.

4.3

4.2

To demonstrate the work pool implementation of the clSURF application as run on multiple GPU devices, we choose two combinations of different types of GPU devices. The first platform has an AMD FirePro V9800P and an AMD Radeon HD6970; the second platform has an AMD FirePro V9800P and a NVIDIA GTX 285. We manually load-balance the workload on different GPU devices, and compare the performance against the a single work pool implementation by measuring the average execution time per video frame.

Work Pool Implementation on a Single GPU Device

We first investigate the performance opportunities afforded by our new task-queueing extension. Figure 4 shows the relative performance of the work pool implementation compared with the baseline implementation on respective devices, GTX 285, FirePro V9800P, Radeon HD6970 and fused GPU Radeon HD6550D. The y-axis is the speedup and x-axis are different video inputs. We can see from this figure that an independent enqueue engine can continuously enqueue work units into the work pool, and has a better utilization of the GPU command queue when combined with the call back mechanism. For the first video input run on

Heterogeneous Platform with Multiple GPU Devices

Figure 6 shows the performance of our two work pool implementation on the V9800P/HD6970 and compares this against the one work pool implementation on a V9800P. Since the V9800P and the HD6970 have comparable computing

90

1.6

Baseline on V9800P HD 6970: V9800P = 1:1 HD 6970: V9800P = 2:1

V9800P Work Pool HD 6970: V9800P = 10:9 1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

Baseline on V9800P HD 6970: V9800P = 1:1 HD 6970: V9800P = 2:1

V9800P Work Pool HD 6970: V9800P = 10:9

0 Video1

Video2

Video3

Video4

Video1

(a) With Display

Video2

Video3

Video4

(b) Without Display

Figure 6: Load balancing on two devices – V9800P and HD6970.

3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

Baseline on GTX 285 V9800P: GTX 285 = 1:1 V9800P: GTX 285 = 2:1

Video1

Video2

GTX 285 Work Pool V9800P: GTX 285 = 10:9 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Video3

Video4

Baseline on GTX 285 V9800P: GTX 285 = 1:1 V9800P: GTX 285 = 2:1

Video1

(a) With Display

Video2

GTX 285 Work Pool V9800P: GTX 285 = 10:9

Video3

Video4

(b) Without Display

Figure 7: Load balancing on dual devices – V9800P and GTX 285. power, we achieve the best performance when the workload on both devices divided evenly. The speedup achieved is up to 55.4%. We also include the baseline implementation for reference.

ample, GPUs are much better suited for large, data-parallel algorithms than CPUs. However, as more processing cores are appearing inside a single die, the CPU has the potential to supply valuable computing power instead of being used only as a scheduler and coordinator. Furthermore, we always have to be careful when assigning computing tasks to CPU devices, as the performance of the whole system could be seriously affected if too many resources are used by the kernel computation and other CPU threads are blocked from execution.

The AMD FirePro V9800P and the NVIDIA GTX 285 provide significant differences in computational power. In Figure 7, we compare the performance of our two work pool implementation on the V9800P/GTX 285 combination and compare it against a single work pool implementation on he GTX 285. When we schedule 2 times the number of frames on V9800P as we do on the GTX 285, we achieve up to 2.8x speedup because we have a much more powerful processing unit on the AMD device.

4.4

One solution for this problem is to utilize the device fission extensions [9] supported by OpenCL. By using device fission, the programmer can divide an OpenCL device into sub-devices and use only part of it for execution. Currently, this device extension is only supported for multi-core CPU devices, and can be used to ensure that the CPU has enough resources to perform additional tasks, specifically task scheduling, while executing an OpenCL program.

Heterogeneous Platform with CPU and GPU(APU) Device

In OpenCL programming, the CPU can also be used as a computing device. With APUs such as AMD’s Fusion chips, the communication overhead between GPU and CPU decreases, and makes heterogeneous programming feasible for a wider variety of applications. In this section, we demonstrate how our extension facilitates the use of the CPU to participate in kernel execution with the GPU accelerator.

Figure 8 shows the impact of the number of CPU cores used for OpenCL processing and its impact on the execution time per video frame for Video2. In this experiment, the GPU and CPU devices are working together to process the video file. The X-axis indicates the ratio between number of frames distributed to the fused AMD APU A8-3850. From this Figure we can see that aggressively using the CPU as

When using the GPU and CPU as co-accelerators, characteristics of each processing unit must be considered. For ex-

91

Baseline on HD6550D HD6550D with the work pool and last work unit on CPU HD6550D: CPU = 100:2 HD6550D:CPU = 100:4

44 42

40 38 36

4

34

3

32

2

30

1

28 26 24 22

20 0.04

0.03

0.02

Number of Cores used for Computing

Execution Time per Frame (ms)

46

1.2

1

0.8

0.01

0.6

Processed Frame Ratio between CPU and Fused GPU

0.4

Figure 8: Performance with different device fission configurations and load balancing schemes between CPU and Fused HD6550D GPU.

0.2

0

Video1

the computing device is detrimental to performance, as kernel execution impacts the CPU’s ability to function as an efficient host device.

Video3

Video4

Figure 9: The Load Balancing on Dual Devices – HD6550D and CPU

Figure 9 shows the performance of the work pool implementation while using both a CPU and a GPU as compute devices, compared with the baseline execution on a fused HD6550D GPU. We configure the CPU using the device fission extension and only use part of the CPU as the computing device. From this Figure we can see that our extension enables programmers to utilize all possible computing resources on the GPU and the CPU. However, we only achieve speedup in one scenario, and in most of the cases, kernel execution on the CPU slows down the whole application. Since the CPU in the AMD APU A8-3850 is a quad-core CPU, using it as a computing device severely impacts its ability as a scheduling unit, although we’ve already only use some of the cores by configuring it using the fission extension. The enqueue operation on the GPU work pool is frequently blocked when part of the CPU is reserved for the computing task. This results in a slow down of the overall execution.

5.

Video2

scheduling of tasks among different computing devices using the task queueing scheduler. With a proper runtime profiling framework, the scheduler should be able to make scheduling decisions dynamically and distribute work units to devices based on data location, computing power, load, etc. The current implementation of the task queueing API is built entirely on existing OpenCL structures, making a clean interface difficult. Future work will attempt to design a more intuitive API for creating and enqueueing work units, with less effort for the programmer, so that our work pool concept can be easily adapted into the design of future applications.

6.

ACKNOWLEDGEMENTS

The authors would like to thank Perhaad Mistry and Chris Gregg for the source code of their OpenCL implementation of SURF algorithm.

CONCLUSION

This paper presents a work pool-based task queueing extension for OpenCL that allows programmers to easily harness the power of heterogeneous computing environments. Using the task queueing API, and a large, diverse application, clSURF, we show that work pools can provide substantial benefit to OpenCL programmers. In multi-GPU environments, work pools allow the programmer to easily adapt the scheduling policy of OpenCL kernels to fit the environment. Even with a single GPU, work pools allow for higher utilization of the device, resulting in speedups.

7.

REFERENCES

[1] AMD Accelerated Processors for Desktop PCs. http://www.amd.com/us/products/desktop/apu/mainstream/pages/mainstream.aspx. [2] AMD FirePro V9800P Professional Graphics. http://www.amd.com/us/products/workstation/graphics/ati-firepro-3d/v9800p/Pages/v9800p.aspx. [3] AMD Radeon HD 6970 Graphics. http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/hd-6970-/Pages/amd-radeonhd-6970-overview.aspx. [4] IBM OpenCL Common Runtime for Linux on x86 Architecture. http://www.alphaworks.ibm.com/tech/ocr. [5] Intel CORE Processors.

In this work, we also evaluated the potential for the CPU to be used a compute device. Our results show that only when enough CPU resources are reserved for the system to perform required operations (including running the OpenCL host code), the CPU can be an effective co-accelerator. The natural extension of this paper is to implement dynamic

92

http://www.intel.com/SandyBridge. [6] NVIDIA GeForce GTX 285 Graphics. http://www.nvidia.com/object/product geforce gtx 285 us.html. [7] NVIDIA’s parallel computing architecture. http://www.nvidia.com/object/cuda home new.html. [8] OpenCL - The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/. [9] OpenCL device extension: Device fission. http://www.khronos.org/registry/cl/extensions/ext/ cl ext device fission.txt. [10] The AMD Fusion Family of APUs. http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx. [11] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief, and S. Tomov. LU factorization for accelerator-based systems. In 9th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 11), Sharm El-Sheikh, Egypt, 2011. [12] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators. In 25th IEEE International Parallel & Distributed Processing ´ Symposium, Anchorage, Etats-Unis, May 2011. [13] E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov. Faster, Cheaper, Better – a Hybridization Methodology to Develop Linear Algebra Software for GPUs. In GPU Computing Gems, volume 2. Morgan Kaufmann, Sept. 2010. [14] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par ’09, pages 863–874, Berlin, Heidelberg, 2009. Springer-Verlag. [15] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110:346–359, June 2008. [16] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. [17] C. Evans. Notes on the opensurf library. Technical Report CSTR-09-001, University of Bristol, January 2009. [18] Z. Fang, D. Yang, WeihuaZhang, H. Chen, and B. Zang. A comprehensive analysis and parallelization of an image retrieval algorithm. In Performance Analysis of Systems and Software (ISPASS), 2011 IEEE International Symposium on, pages 154 –164, april 2011. [19] P. Furgale, C. H. Tong, and G. Kenway. Ece 1724 project report: Speed-up speed-up robust features. 2009.

93

[20] D. Grewe and M. F. P. O’Boyle. A static task partitioning approach for heterogeneous systems using opencl. In Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software, CC’11/ETAPS’11, pages 286–305, Berlin, Heidelberg, 2011. Springer-Verlag. [21] S. Henry. Opencl as Starpu frontend. Technical report, National Institute for Research in Computer Science and Control, INRIA, March 2010. [22] E. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard. Multi-gpu and multi-cpu parallelization for interactive physics simulations. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II, Euro-Par’10, pages 235–246, Berlin, Heidelberg, 2010. Springer-Verlag. [23] D. Kim and R. Dahyot. Face components detection using surf descriptors and svms. In Machine Vision and Image Processing Conference, 2008. IMVIP ’08. International, pages 51 –56, sept. 2008. [24] C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pages 45 –55, dec. 2009. [25] J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B.-L. Lu. Person-specific sift features for face recognition. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 2, pages II–593 –II–596, april 2007. [26] P. Mistry, C. Gregg, N. Rubin, D. Kaeli, and K. Hazelwood. Analyzing program flow within a many-kernel opencl application. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 10:1–10:8, New York, NY, USA, 2011. ACM. [27] K. Spafford, J. Meredith, and J. Vetter. Maestro: data orchestration and tuning for opencl devices. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II, Euro-Par’10, pages 275–286, Berlin, Heidelberg, 2010. Springer-Verlag. [28] S. Srinivasan, Z. Fang, R. Iyer, S. Zhang, M. Espig, D. Newell, D. Cermak, Y. Wu, I. Kozintsev, and H. Haussecker. Performance characterization and optimization of mobile augmented reality on handheld platforms. IEEE Workload Characterization Symposium, 0:128–137, 2009. [29] N. Zhang. Computing parallel speeded-up robust features (p-surf) via posix threads. In Proceedings of the 5th international conference on Emerging intelligent computing technology and applications, ICIC’09, pages 287–296, Berlin, Heidelberg, 2009. Springer-Verlag.

Suggest Documents