TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll08/18llpp287-295 Volume 17, Number 3, June 2012
Parallelization and Performance Optimization on Face Detection Algorithm with OpenCL: A Case Study* Weiyan Wang1,3, Yunquan Zhang1,2,**, Shengen Yan1,3, Ying Zhang1,3,Haipeng Jia1,4 1. Laboratory of Parallel Software and Computational Science, Institute of Software, the Chinese Academy of Science, Beijing 100190, China; 2. State Key Laboratory of Computer Science, Institute of Software, the Chinese Academy of Science, Beijing 100190, China; 3. Graduate University of Chinese Academy of Sciences, the Chinese Academy of Science, Beijing 100190, China; 4. Ocean University of China, Qingdao 26610, China Abstract: Face detect application has a real time need in nature. Although Viola-Jones algorithm can handle it elegantly, today’s bigger and bigger high quality images and videos still bring in the new challenge of real time needs. It is a good idea to parallel the Viola-Jones algorithm with OpenCL to achieve high performance across both AMD and NVidia GPU platforms without bringing up new algorithms. This paper presents the bottleneck of this application and discusses how to optimize the face detection step by step from a very naïve implementation. Some brilliant tricks and methods like CPU execution time hidden, stubbles usage of local memory as high speed scratchpad and manual cache, and variable granularity were used to improve the performance. Those technologies result in 4-13 times speedup varying with the image size. Furthermore, those ideas may throw on some light on the way to parallel applications efficiently with OpenCL. Taking face detection as an example, this paper also summarizes some universal advice on how to optimize OpenCL program, trying to help other applications do better on GPU. Key words: Viola-Jones; OpenCL; time cost hidden; local memory usage; parallel granularity
Introduction
Face detection is widely used in camera, surveillance, entertainment, and so on. What is more important, it is a necessary step for face recognition, which has to be fast. As the first object detection framework to provide competitive object detection rates in real-time, the Viola-Jones object detection framework can be trained to
Received: 2012-05-02; revised: 2012-05-14
** Supported by the National Natural Science Foundation of China (No. 61133005) and the National High-Tech Research and Development (863) Program of China (No. 2012AA010902)
** To whom correspondence should be addressed. E-mail:
[email protected]; Tel: 86-10-62661636
detect a variety kinds of object, especially to detect face[1,2]. In 2001 when the algorithm was brought up, Viola-Jones object detection framework can satisfy real time need elegantly. However, as today’s image and video have even 10 times more pixels and larger sizes, original Viola-Jones algorithm can not handle it nowadays. Confronting with such a real time challenge, researchers will struggle to find a faster algorithm. But the appearance of general purpose GPU and OpenCL standard offers another way to approach the goal on most platforms without changing the algorithm. This paper discusses how to parallel and optimize the Viola-Jones face detection across platforms step by step.
288
It introduces some effective technologies like CPU execution time hidden, kernel invoking reduction, local memory for coalescing read and manual prefetching cache, and variable parallel granularity to avoid waste of computing resources, which can take full use of GPU’s numerous computing units, high bandwidth, and overcome bottlenecks to settle down the harsh challenge. As a result of those efforts, considerable speedup, which is up to more than 13 times compared with CPU program, is achieved on both AMD and NVidia platforms. Additionally, this paper tries to not only describe how to transplant the face detection on the GPU platforms but also throw some light on the universal way to properly map mature algorithms with the GPU devices, trying to figure out how to analyze the bottleneck and summarize how to optimize the common applications on GPUs. Briefly speaking, this paper makes contributions mainly on time cost hidden, optimizing memory access of both image and cascade classifier and reducing the waste of computing resource in branch divergence to achieve high speed up across platforms.
1
Related Work
The challenge of real time is a universal problem shared among the related area researchers, and some attempt to take advantages of GPU that have been made. Merrill et al.[3] proposed a particle filtering based method for tracking faces in a group of meeting video. However, the GPUs are only used to do 3-D modeling and face tracking. Ghorayeb et al.[4] described a hybrid scheme that does face detection on CPU and GPU using Brooks API which is built on top of OpenGL. A recent work of Sharma et al.[5] developed a GPU accelerated real-time and robust face processing system that does face detection and tracking. Oro et al.[6] also presented a highly optimized Haarbase face detector. However, they seem to have ignored the waste of computing resources as a result of diverging in cascade classifier. Hefenbrock et al.[7] presented a GPU accelerated face detection and alleviates the waste of computing resources in branch divergence by assigning two threads to detect each sub window. However, this method is simple, naïve and not so efficient. The most important is that these works described above are all based on CUDA and they can
Tsinghua Science and Technology, June 2012, 17(3): 287-295
only run on NVidia GPU. This paper presents a carefully optimized face detection algorithm with OpenCL across the platforms using the techniques discussed in the following sections. In other words, our implementation achieves high performance on both NVIDIA and AMD GPU while maintaining the portability.
2
Background
2.1
Viola-Jones object detection framework
It is necessary to have a glance on Viola-Jones algorithm itself before discussing the OpenCL implementation. Just like other machine learning and pattern recognition algorithms, the Viola-Jones’ detection algorithm is consisted of two parts: training and detection. As same as the most similar algorithm, the training part is done offline. In other words, although the adaboost training of the framework that may have to run several weeks is a time-cost iteration algorithm, it can be ignored. So this paper focuses on the detection part as mentioned in the following. The Viola-Jones depends on several ingenious and brilliant ideas and technologies to get high accuracy rate and detecting speed. The first one is the Haar Features[8]. Viola-Jones uses Haar Feature, a kind of indirect information, instead of using pixels directly, which can extract information from pixels and speed up detecting. Haar Feature is defined as a kind of simple rectangle feature as shown in Fig. 1. The Haar Feature is calculated by subtracting the sum of the white parts from the sum of the black parts. To speed up calculating, integral image is introduced, in which pixel value is sum of left-up pixels as shown in Eq. (1). ( ii ( x, y ) = ∑ value( x′, y′) ) (1) x′