1
OpenCL Programming Language And Aspects Of Heterogeneous Parallel Computing Gaurav P. Dave Institute of Diploma Studies Department of Computer Engineering, Nirma University Ahmedabad, India
[email protected],
[email protected] Abstract-- The advancements in GPU technology has attracted many researchers who need intensive computing to GPU based computing (GPGPU). GPGPU programming platforms are vendor specific and hardware specific, so accessing overall compute power of heterogeneous processors becomes difficult. The OpenCL standard for general purpose parallel programming across heterogeneous processors provides a promising platform for heterogeneous parallel computing. This paper introduces OpenCL programming environment, describe basic structure of OpenCL program and outline the challenges in heterogeneous parallel computing. Index Terms— OpenCL, Graphics Processing Unit, Heterogeneous computing.
I. INTRODUCTION In recent years, more and more multi-core/many-core processors are superseding sequential ones. Increasing parallelism, rather than increasing clock rate, has become the primary engine of processor performance growth, and this trend is likely to continue [1]. Particularly, today’s GPUs (Graphic Processing Units), greatly outperforming CPUs in arithmetic throughput and memory bandwidth, can use hundreds of parallel processor cores to execute tens of thousands of parallel threads [2]. Researchers and developers are becoming increasingly interested in harnessing this power for general purpose computing, an effort known collectively as GPGPU (for “General-Purpose computing on the GPU”) [3], to rapidly solve large problems with substantial inherent parallelism. The scope of the paper is to review and understand the architectural and programming aspects of GPGPUs using OpenCL. . The rest of the paper is divided into following parts: Section: II shows motivational factors to use graphics processors in general purpose applications. Section: III shows the background of GPUs. Section: IV briefs about architecture of GPU. Section: V provides insight about OpenCL. Section: VI concludes the paper.
in comparison of application specific supercomputers, can be achieved. III. BACKGROUND Initially the graphics processors were dedicated to gaming and simulation industry. A visionary named John Nickolls, at NVIDIA, recognized the huge potential of GPGPU. So in 2007, the Compute Unified Device Architecture (CUDA) [7] was unleashed on the world. With CUDA, the GPU is much easier to program for general-purpose applications. NVIDIA is not the only company with a programming environment aimed at making GPGPU more accessible. Another major GPU manufacturer, ATI Technologies (now AMD Graphics Products Group), introduced a generalpurpose GPU computing technology called Stream SDK. Both Nvidia and AMD Graphics, together with IBM, Apple, Intel, and others, are moving on OpenCL, which is described in Section: V. Heterogeneous computing: A well-guided, effective use of suite of diverse high performance devices/machines to provide fast processing for computationally demanding tasks that have diverse computing needs.[8] In the context of heterogeneous computing, a programming model is required which understands different architectures and devices. OpenCl has capability to provide an umbrella view of entire system to the programmer to utilize the power of different compute devices of the computer system. IV.
ARCHITECTURE OF GPU
As shown in Fig.1 GPUs contain independent processors capable of executing multiple threads, local memory per multiprocessor and contain on board video RAM (VRAM). According to Flynn’s taxonomy, GPUs are SIMD (Single Instruction Multiple Data) processors.
II.MOTIVATION In comparison to traditional CPUs, GPUs provides a low cost alternative as commodity hardware. The latest results, as given in Ref.[5] shows that the use of GPUs as coprocessors in specific applications can give the performance in GFLOPS (Giga Floating Point Operations Per Second) and obtaining 1x-5000x speed up in comparison of existing only CPU based implementations. Thus personal supercomputing at lower cost
Fig. 1. GPU architecture model
2 V. OPENCL OpenCL (Open Computing Language)[4]: The OpenCL is the first open royalty-free standard for general purpose parallel programming across heterogeneous processors. It was initially proposed by Apple and finally developed by the Khronos Group, which released the OpenCL 1.0 version in 2008. As a standard, its most important characteristic compared for example to CUDA, is that OpenCL supports not only vendor-independent GPUs but also any platform that matches with the OpenCL platform structure. Moreover, the OpenCL device management allows the user to take advantage and use all the available resources in the computer, not only GPU.
CPU, is responsible for the main control. It communicates with other OpenCL devices. Each OpenCL device, for example a multicore CPU or GPU, con\sists of multiple compute units. Each compute unit can represent one or more processing elements. For example, a four-core CPU is a device, its cores are compute units with one processing element each. GPUs have multiple processing elements per compute unit instead. Opencl framework:
OpenCL is the logical evolution of CUDA (Compute Unified Device Architecture from NVIDIA). It provides Framework for writing programs that execute across heterogeneous platforms (CPUs + GPUs). OpenCL is a parallel programming language (based on C99) for writing KERNELs (function/block of code that execute on OpenCL compatible devices). It has APIs that are used to define and then control different platforms.
Fig. 3. OpenCL framework Fig. 2. OpenCL Simplified Platform Model
OPENCL PLATEFORM MODEL
Fig. 3. OpenCL Platform Model
In the OpenCL platform [9], a host, which is usually a
Execution model: An OpenCL program consists of a host program and kernels. Kernels are code that are executed on OpenCL devices. The host manages a command queue for each device. Kernels can then execute in-order or out of order depending on the synchronization commands utilized by the programmer. When a kernel is submitted for execution, an index space is defined. Work items are logical instances that execute the -same kernel together. A work item can correspond to one or more processing elements executing the same kernel code. A larger logical instance, a work group, is also defined to enable synchronization among each other. Synchronization: Synchronization can be achieved by work group barriers across work items. Command queue barriers can be used to ensure ordering within a queue. Events can ensure synchronization across queues. Memory model: Global, local, private, and constant memory regions are supported. A given device do not have to have all
3 combinations of memories. A host has read and write access to global and constant memory. A host can also allocate local memory but cannot access its data. All host allocations are dynamic. A kernel has read and write access to all memory regions. Kernels can only allocate local, private, and constant memories. The allocation is static. In terms of architectural implementation of these memory regions, processing elements have access to private memories, compute units have access to local, global, and constant memories.
VI.
OpenCL provides a way for heterogeneous massively parallel computing. Graphics processors have become an attractive alternative for general purpose high performance computing and OpenCL provides a promising way to become the standard for heterogeneous computing. VII. REFERENCES [1] [1]
Compilation model: OpenCL uses a dynamic compilation model similar to OpenGL. The code is first compiled into an intermediate representation (IR). An application loads IR, and compiles into machine code on the fly. Figure 1illustrates the concept of writing a code once and compiling it to multiple architectures depending on the platform of interest.[9]
[2]
[3]
[4] [5] [6] [7]
[8] [9]
Fig. 4. Example of C function and OpenCL Kernel[9]
Fig. 5. Example of C function and OpenCL Kernel
As shown in Fig. 5 a serial code of C language can be transformed to OpenCL kernel.
CONCLUSION
M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov, “Parallel Computing Experiences with CUDA,” IEEE Micro, vol. 28, pp. 13–27, July 2008. J. Nickolls and W. J. Dally, “The GPU Computing Era,” IEEE Micro, vol. 30, pp. 56–69, March 2010. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kr¨uger, A. E. Lefohn, and T. J. Purcell, “A Survey of General-Purpose Computation on Graphics Hardware,” Computer Graphics Forum, vol. 26, pp. 80– 113, March 2007. HE, B., GOVINDARAJU, N. K., LUO, Q., AND SMITH, B. 2009.Relational Query Processing on Graphic processors. ACM Transactions on Database Systems, Vol. 34, No. 4, Article 21, Publication date: December 2009. OpenCL documentation- http://www.khronos.org/opencl/ , 2012. GPGPU. General-Purpose Computation Using Graphics Hardware. http://www.gpgpu.org/ , 2012. P. L. Alvarez and S. Yamagiwa, Invitation to OpenCL in “2011 Second International Conference on Networking and Computing”, IEEE, 2011, pp. 8-16 J. Lee, J. Kim, J. Kim, S. Seo, and J. Lee in “An OpenCL Framework for Homogeneous Manycores with no Hardware Cache Coherence” in “International Conference on Parallel Architectures and Compilation Techniques”, IEEE, pp.56-67 A. A. Khokhar, V. K. Prasanna, M. E. Shaaban, and C. Wang, “Heterogeneous computing: Challenges and Opportunities” in IEEEComputer, June-1993, pp. 18-27 R O. Topaloglu, “GPU Programming for EDA with OpenCL” IEEE invited paper-2011 pp. 63-66