Accelerating image recognition on mobile devices using GPGPU Miguel Bordallo L´opeza , Henri Nyk¨anenb , Jari Hannukselaa , Olli Silv´ena and Markku Vehvil¨ainenc a Machine
Vision Group, University of Oulu, Oulu, Finland b Visidon Ltd, Oulu, Finland c Nokia Research Center, Tampere, Finland ABSTRACT
The future multi-modal user interfaces of battery-powered mobile devices are expected to require computationally costly image analysis techniques. The use of Graphic Processing Units for computing is very well suited for parallel processing and the addition of programmable stages and high precision arithmetic provide for opportunities to implement energy-efficient complete algorithms. At the moment the first mobile graphics accelerators with programmable pipelines are available, enabling the GPGPU implementation of several image processing algorithms. In this context, we consider a face tracking approach that uses efficient gray-scale invariant texture features and boosting. The solution is based on the Local Binary Pattern (LBP) features and makes use of the GPU on the pre-processing and feature extraction phase. We have implemented a series of image processing techniques in the shader language of OpenGL ES 2.0, compiled them for a mobile graphics processing unit and performed tests on a mobile application processor platform (OMAP3530). In our contribution, we describe the challenges of designing on a mobile platform, present the performance achieved and provide measurement results for the actual power consumption in comparison to using the CPU (ARM) on the same platform. Keywords: Computer vision, cell phone, image recognition, graphics processing unit, GPGPU
1. INTRODUCTION The future multi-modal user interfaces of battery-powered mobile devices are expected to require energy-efficient image analysis for tasks such as face tracking and gesture recognition. These low level image processing operations are computationally costly and often need to be performed in real time. Currently, mobile devices make use of SIMD-units for operation parallelization. Even though the SIMD computing model presents high flexibility, it relies on a CPU with control code execution, while a Graphics Processing Unit can be treated as an independent entity. This provides for attractive opportunities to implement complete algorithms on the GPU. In addition, the energy efficiency of the GPU could reduce the power needs of image analysis tasks in mobile devices. Figure 1 depicts a typical organization of a mobile platform. The camera is interconnected with the main processor that has access to both the main memory and the graphics memory. Since there is no straight interface that links the camera with the graphics memory, the camera data needs to be transfered from the main CPU to the GPU. Using Graphics Processing Units (GPUs) to perform computationally intensive tasks has become popular in many industrial and scientific applications. As GPU computing is well suited for parallel processing, it is also a very interesting solution for accelerating image recognition solutions. Traditionally, the GPU is only used to accelerate certain parts of the graphics pipeline such as warping operations. General-purpose computing on graphics processing units (GPGPU) is the technique of using a GPU to perform computations that are usually handled by the CPU. The addition of programmable stages and high precision arithmetic allow developers to use stream processing of general data. Further author information: Send correspondence to Miguel Bordallo L´ opez E-mail:
[email protected], Telephone: +358 449170541
Figure 1. A typical mobile platform organization. The GPU can be treated as an independent entity.
In the last years, GPU implementations of computer vision algorithms have gained recognition as a viable option. Algorithms like scale invariant feature transform (SIFT),1 speeded up robust features (SURF),2 Kanade-Lucas-Tomasi (KLT)-tracker3 and LBP4 have been implemented on a desktop GPU with performance far surpassing corresponding CPU implementations. In this context, we consider a face tracking approach that uses efficient gray-scale invariant texture features and boosting. The solution is based on the Local Binary Pattern (LBP) features and makes use of the GPU on the pre-processing and feature extraction phase to reduce the computation time and power requirements. The LBP operator is a texture analysis tool that measures local image contrast where the selected pixel’s value is defined by its eight surrounding neighbors. The LBP techniques have been identified as a potential methodological basis for implementations due to their high accuracy.5 However, the algorithms are bit-oriented and as such, demanding to implement on serial processors. In our contribution, we have implemented a series of image processing techniques in the shader language of OpenGL including the first implementation of the LBP on a mobile GPU with OpenGL ES 2.0. We describe the challenges of the design on a mobile GPU in contrast to more flexible desktop computer GPUs. We also present the execution time and the rough power consumption of the implementation, and compare the results with the CPU implementation.6
2. MOBILE IMAGE RECOGNITION Robust recognition of arbitrary object classes in natural visual scenes is an aspiring goal with numerous practical applications such as augmented reality systems and multi-modal context-aware mobile user interfaces. A crucial task in this process is the ability to link a group of pixels in an image array with a known category of objects. On a mobile context, faces are important feature sources for multi-modal user interfaces. A camera directed towards the user is usually intended for video call purposes and therefore the field of view of the camera is optimized for the user’s face regions. This provides advantages for various human-computer interaction solutions. The knowledge of presence and motion of a human face in the view of the camera can be a powerful application enabler.7 In different imaging applications the faces are also important. Typically, users are interested in searching for people in the images and good quality face regions are among the main concerns in customer imaging. There are already commercial solutions for different auto focus and auto white balance methods that utilize the detected faces during the image capturing. An efficient face tracking implementation can also be used as a first step to construct several more complex applications, such as smile-based shutters or virtual 3D displays. An example scheme of a face tracking algorithm is shown in Figure 2. The face tracking approach under consideration uses efficient gray-scale invariant texture features8 and boosting.9 Feature extraction is an important part of these systems and many algorithms have been proposed to solve it. The analyzed solution is based on the LBP texture methodology10 for facial representation. After the effective extraction of the face features, a learning system like AdaBoost can be applied to find the most discriminative features for distinguishing the face patterns from the background. This boosting method
Figure 2. Parallelized face tracker.
searches for the faces in the images or image sequences, and returns the coordinates of the detected objects. The resulting information can be directly used by face based approaches, such as auto focusing or color enhancement. The main idea behind the proposed approach is to exploit a GPU for face tracking and utilize the free cycles on the CPU to perform an alternative task such as camera controlling. The face tracking algorithm consist of four phases. First, the incoming viewfinder image, typically CIF or QVGA, is pre-processed. The preprocessing algorithm consists on the multi-scaling of the source image at different sizes and the preparation of the data in the most suitable format. The extraction of features in different scales, allow the detection of multiple-sized objects. After the pre-processing, the image features are extracted using the LBP operator. Then, classification is performed and finally the results are post-processed and presented. In this paper we describe how the computation of scaling, pre-processing and LBP feature extraction can be implemented on a GPU. The classifier is implemented as a pure CPU solution, while future work considers also the GPU acceleration of this computationally demanding phase.
2.1 Local Binary Pattern The LBP operator is a texture analysis tool that measures local image contrast. It was first introduced by Ojala8 and after that it has gotten several improvements.6 This implementation, based on the first version of the operator, can be implemented more efficiently. The resulting operator is called LBP8, since the selected pixel’s value is defined by it’s eight surrounding neighbors. Figure 3 depicts the computation of the LBP operator. The value is computed by setting each neighbor that is brighter or equal to the selected pixel to 1 and each one that is darker to 0. In other terms, the algorithm thresholds the picture using the center pixel as an edge value. The LBP binary value can be formed by placing the thresholded values in correct order. In this implementation the least significant bit is in the upper right corner of the 3x3 neighborhood and the most significant bit is the pixel below it. To obtain the correct number, the binary values are just unrolled into a single byte. In the example shown in Figure 3 the LBP value is 105 in decimal numbers. Once the LBP values are calculated, they can be used in various ways. For example, a histogram can be formed from the LBP image, and then be used to detect different kinds of textures by comparing it to other LBP histograms.
3. MOBILE GRAPHICS PROCESSOR AS A COMPUTING ENGINE Graphics Processing Units have been used on desktop computers to accelerate computer vision algorithms. A mobile GPU is specially useful as a co-processor to execute certain functions, while employing its resources is most conveniently and portably done with a standard graphics API. On a mobile device platform the choice is essentially limited to OpenGL ES, while the emerging OpenCL Embedded Profile is likely to offer flexibility similar to vendor specific solutions designed for desktop computers, such as CUDA of Nvidia. Some of the
Figure 3. The LBP operator.
recent mobile phones such as Nokia N900 include a graphics processor accessible via the OpenGL ES application programming interface (API). Although the current OpenGL ES API supports a limited set of programmable function pipelines originally designed to render 3D graphics, it can be used even in implementing image processing functions. Furthermore, future GPU interfacing APIs are likely to provide for more flexibility, which help in implementing and mapping algorithms to the GPU. For the time being, the most obvious operations to be accelerated using OpenGL ES are pixel-wise operations and geometrical transformations such as warps and interpolations.
3.1 OpenGL ES OpenGL (Open Graphics Library) is a multi-platform standard defining a cross-language cross-platform API used for producing 2D and 3D scenes from simple graphic primitives such as points, lines and triangles. OpenGL ES (OpenGL for Embedded Systems) is in turn a subset of the OpenGL 3D graphics API designed for embedded devices such as mobile phones, PDAs, and video game consoles. Currently, there are several versions of the OpenGL ES specification. OpenGL ES 1.0 is drawn up against the OpenGL 1.3 specification, while OpenGL ES 1.1 and OpenGL ES 2.0 are defined relative to OpenGL 1.5 and OpenGL 2.0 specification, respectively. OpenGL ES 2.0 is not backwards compatible with OpenGL ES 1.1. OpenGL ES 1.1 has currently been implemented in many mobile phones, some of which include GPU hardware. Newer phones on the market have support for OpenGL ES 2.0, increasing the capabilities of the GPU as an image processing unit. Several OpenGL ES profiles exist on each specification with support for fixed-point or floating-point data types. The fixed point types are supported due to the lack of hardware floating-point instruction sets on many embedded processors. Many functionalities have been reduced on OpenGL ES 1.0 from the original OpenGL API, although some minor functionalities have been also added. In comparison to OpenGL 1.0, the 1.1 adds support for multi-texture with combiners and dot product texture operations, automatic mipmap generation, vertex buffer objects, state queries, user clip planes, and greater control over point rendering. The rendering pipeline is of fixed-function type. In practice, these features of OpenGL ES provide for possibilities of using the graphics accelerator as a co-processing engine. General purpose image processing capabilities are available via texture rendering. The image data can be copied to the graphics memory, allowing the application of several matrix transformations and performing bilinear interpolations for the rendered texture. However, the overheads of copying images as textures to graphics memory result in significant slowdowns. OpenGL ES 2.0 eliminates most of the fixed-function rendering pipeline API in favor of a programmable one, and a shading language allows programming most of the rendering features of the transform and lighting pipelines. However, the images must still be copied the GPU memory in a matching format and the lack of shared video memory causes multiple accesses to the GPU memory to retrieve the data for the processing engine. Although a programmable pipeline enables the implementation of many general processing functions, OpenGL ES APIs have several limitations. The most important one is that the GPU is forced to work in single buffer mode to allow the read-back of the rendered textures. Other shortcomings include the need to use power of two textures or the restricted types of pixel data.
3.2 OpenCL Embedded Profile OpenCL (Open Computing Language) is essentially an open standard for parallel programming of heterogeneous systems. It consists of an API for coordinating parallel computation across different processors and a crossplatform programming language with a well-specified computation environment. It was conceived by Apple Inc.,
which holds trademark rights, and established as standard by Khronos Group in cooperation with others, and is based on C99. The purpose is to recall OpenGL and OpenAL, which are open industry standards for 3D graphics and computer audio respectively, to extend the power of the GPU beyond graphics facilitating General Purpose computation on Graphics Processing Units, GPGPU. In the OpenCL model, the high-performance resources are considered as Compute Devices connected to a host. The standard supports both data and task based parallel programming models, utilizing a subset of ISO C99 with extensions for parallelism. OpenCL is defined to efficiently inter-operate with OpenGL and other graphics APIs. The current supported hardwares range from CPUs, GPUs and DSPs (Digital Signal Processors) to mobile CPUs such as ARM. Through OpenCL, multiple tasks can be configured to run in parallel on all processors in the host system, and the resulting code is portable on a number of devices. The specification is divided into a core that any OpenCL compliant implementation must support an embedded profile which relaxes the OpenCL compliance requirements, such as data type and precision, for hand-held and mobile devices. OpenCL defines a set of functions and extensions that must be implemented by hardware vendors. Vendors should provide the compiler and other tools to allow the execution of OpenCL code on their specific hardware. OpenCL implemented on an embedded system will allow the distribution of tasks with highly parallel programming through all the processing units present on a chipset (CPU, GPU, DSP,...). Fig. 4 compares three different computational models. OpenCL model can make use of available shared memory to reduce the number of memory read-backs while keeping highly parallel data process.11
Figure 4. The figure shows three computational models. The use of shared memory reduces the number of read-backs.
The execution of image processing algorithms in a single OpenCL kernel offers smaller execution overheads, lower memory bandwidth requirements and better performance comparisons than CPU alone or pure OpenGL ES implementations. Full implementations of the OpenCL standard have been recently released for the PC market and desktop GPUs, while the existing embedded profile implementations are likely to be integrated in the current mobile device development environments in the near future.
4. GPU ACCELERATED IMAGE RECOGNITION Traditional approaches to image recognition use to follow a sequential path with multiple accesses to the memory of the processing unit. GPU approaches require an understanding of the data nature and the careful tailoring of the algorithms. The graphics pipeline of the OpenGL ES 2.0 model is still not fully programmable as it has only two programmable stages (shaders). Nevertheless, it provides enough flexibility to implement computer vision algorithms such as the LBP. Figure 5 depicts a typical graphics pipeline. Mobile phones currently on the market have not yet taken into account the use of GPU as general purpose capable processors. Image processing algorithms that use the camera as the main source for data lack of fast ways of data transferring between processing units and capturing or saving devices. In this context, to map the algorithms properly on the GPU, the data should be copied to the video memory in the specific model of OpenGL textures.
Figure 5. Graphics pipeline allows data streams to be processed with fixed or programmable function from the texture data to a render image on the frame buffer.
After the data transfer, the first programmable step of the graphics pipeline is the vertex shader. These shaders operates on vertices, described as points in a three dimensional space. The vertex shader is used in graphics processing to perform different kinds of matrix multiplications on the vertices, such as projection and viewpoint transformations. However, properly designed, the vertex shader can be used to transform the coordinates of a quad through matrix multiplications. These operations can result on scalings, rotations warpings or just passing the texture coordinates forward. The next step is the primitive assembly, where the primitives such as triangles, lines and points are formed corresponding to the vertices. In our application the input picture is mapped on a polygon as a texture, so in this case the primitive assembly forms a two triangles quad that matches the desired rendering surface. Next the primitives are drawn and a 2D-picture is formed. This is called rasterization. The second programmable step is the fragment shader that operates on fragments (pixels) and can be used to perform operations such as the LBP calculations. After this the data goes through various per-fragment operations before reaching the frame buffer. These operations include pixel ownership, scissor, depth and stencil tests, which basically test if the pixel should be visible or not. In addition, blending and dithering operations are applied. While the quad is textured, bilinear interpolations for each pixel are calculated in parallel on the GPU. The rendering surface is then copied back to the main memory as a native bitmap.
4.1 LBP Fragment Shader Code In our implementation, the fragment shader program accesses the input picture via texture lookups, calculates the LBP value and then sends it forward. The LBP algorithm can be implemented efficiently on a CPU with bit-operations like shifts and bit-wise AND operations,6 but since the OpenGL ES 2.0 shader pipeline uses only floating point numbers such methods are not suitable in this case. A straightforward solution is to apply brute force and form the LBP value by multiplying the binary number’s bits with their corresponding weight factors and then sum all products together. Fortunately, the OpenGL ES 2.0 Shader Language application programming interface (API) offers built-in functions for this purpose. The LBP algorithm has been implemented before on a desktop GPU by Zolynski et al.4 In this approach, the LBP value of each pixel is calculated four times using different sample radiuses, combining the values to compose a single RGBA LBP. However, this shader algorithm is quite similar to ours, since it uses floating point numbers. Two different implementations of the LBP algorithm have been developed. The first version (version 1) takes in a basic 8 bits per pixel intensity picture and the alternative version (version 2) takes in a 32 bits per pixel RGBA picture. The second one can be used in various ways. For example, a regular intensity picture can be divided into four sections that would be assigned into different color channels. Another option is to just put four different gray scale pictures into these channels. Since the texture lookup function will always return values of all the four channels, even if the input picture has only one channel, the second version will offer a better performance. However, this will require some preparation on the input picture. The first approach, on the other
hand, can be used as such with no preprocessing of the input picture. Versions 1 and 2 are presented in figure 6 as pseudo-codes.
Figure 6. Pseudo-code for versions 1 (left) and 2 (right) of the fragment shader.
Neither of the algorithms use any dynamic branching, which means that the fragment shader will do the same calculations to every pixel and this will allow maximum parallelization. Both of the algorithms are also vectorized and they can be implemented with built-in functions which will enhance portability and ensure the best performance on any OpenGL ES 2.0 platform. In the version 1, the first operation fetches the selected pixel’s value and the second it’s neighbors’ values. Next, the built-in OpenGL ES 2.0 function step returns a vector of ones and zeros corresponding to relations of pixels’ values. The LBP value can then be computed by calculating a dot product between the binary vector and the weight factor vector. The version 2 works in a similar way as the version 1, but changing the scalars into vectors and the vectors into matrices. The neighborhood matrix includes all the channels of all the neighbors. The step function compares all the channels independently and returns a matrix, which consists of all the thresholded values of all the channels of all the neighbors. The LBP values of all the channels can then be calculated by multiplying the binary matrix with the weight factor vector.
4.2 Image Pre-Processing and post-processing The implementation of LBP feature extraction requires several pre-processing tasks on the input images. A typical mobile camera interface produces RGB images that need to be converted in a suitable gray-scale format. This costly process can be done in parallel in a straightforward manner on several steps with any fixed or programmable graphics pipeline. In OpenGL ES 2.0, to map the warping algorithm properly on the GPU, the camera frame data must be copied in the specific model of OpenGL textures to the video memory. A two-triangle quad matching the desired rendering surface that will be copied to the destiny image must be created. The vertices can then be transformed applying the desired changes of scale. While the quad is textured, bilinear interpolations for each pixel can be calculated in parallel on the GPU. The rendering surface should be copied back to the main memory as a native bitmap. Figure 7 shows the preprocessing algorithm for the second version of the LBP shader.
Figure 7. Preprocessing algorithm scheme.
In the case of the second version of the LBP implementation, the required preparation for the input images can be also included in the same programmable stage as the gray-scale conversion and scaling. The process consists on the creation of a quad with only a fourth of the desired image size. The uploaded texture can be divided into four sections that can be used to texture each one of the RGB channels of the rendered quad. The result is suitable as the input of the LBP computation. The inverse procedure can be implemented as a post-processing stage, where a full size image can be obtained by rendering each one of the RGB channels of the LBP image into a single quad in gray-scale format.
5. EXPERIMENTAL RESULTS Our experiments were made on three different platforms. The main developing platforms are all based on single board computers including a device from the Texas Instruments OMAP3 family. The platforms, a Beagleboard revision C, a Zoom AM3517 EVM board and a Nokia N900 Mobile phone, include an ARM Cortex-A8 CPU and a PowerVR SGX535 GPU included in a OMAP3530 platform. Execution time and power consumption were measured with small (64x64), medium (512x512) and large (1024x1024) input pictures on the CPU and the GPU implementations separately, and in parallel with both versions of the shader code. The purpose of the multiple image sizes was to identify dependencies from the cache efficiency and the level of parallelization. The speed was measured using operating system provided software functions. The energy consumption was measured on the BeagleBoard and the EVM board with a digital multimeter tapped between the board and it’s power supply. To obtain the correct values of the energy demands, the measured stand-by consumption, when only the operating system is running, have been subtracted from the measured values. On the N900 mobile phone, the energy consumption was calculated with the battery level information provided by the Maemo operating system. As it was expected, the speed and power consumption values for each one of the tested platforms are on a similar range. All the GPU implementations are developed in OpenGL ES 2.0 while the reference CPU implementations are written highly optimized C code.6
5.1 Speed The test results of the CPU and both GPU implementations are presented in Table 1. The CPU implementation is faster on every size, but as expected, the GPU performance increases as the picture size grows. This is probably due to improved parallelization. If we use even a bigger picture size picture size, the GPU will probably be faster at some point. However, the low overall speed will make this approach unusable on real-time applications. In the version 2 of the LBP shader, the amount of texture lookups is reduced to one quarter. There is a significant improvement compared with the version 1, although the CPU is still faster. The preprocessing and the scaling of the input picture was not taken into account, but in long image analysis pipelines, where preprocessing has to be done only once, this approach can be a more suitable method. Table 1. Average execution times per frame of the version 1 of the shader algorithm.
Size 1024x1024 512x512 64x64
GPU v1. 232ms 76ms 2ms
GPU v2. 180ms 46ms 1,5ms
CPU 100ms 25ms 0,4ms
CPU and GPU v1. 116ms 37ms 1ms
CPU and GPU v2. 90ms 23ms 0,2ms
For both implementations, when the CPU and the GPU are used to compute two frames in parallel, the execution time is bound by the slowest device, in this case the GPU. This means that the GPU and the CPU have separate resources and their actions do not affect each other. Although the GPU is slower than the CPU at platform level, an improved performance can be achieved if both are utilized concurrently. If the data transfer times are taken into account, a carefully tailored scheduling that distributes the number of frames between the two computation devices in the most suitable manner could dramatically improve the performance of the simultaneous configuration. The test results of the preprocessing and scaling algorithm are presented in Table 2. We can observe that pixel-wise algorithms that require floating-point operations present times around three times smaller than the CPU. Again, the computation of the pre-processing stage in both the CPU and GPU, is bound by the slower one. In this case, the time employed by the CPU in the data transferring is added to the final computation time. If we take into account the combination of preprocessing, scaling and LBP computation for one frame, the GPU implementation can match the performance of the CPU alone. Table 3 shows the results of the combined algorithms, preprocessing, scaling and version 2 of the LBP, in several CPU and GPU task distributions.
Table 2. Average execution times per frame of the preprocessing and scaling algorithm.
Size 1024x1024 512x512 64x64
GPU 35ms 10ms 0,2ms
CPU 100ms 25ms 0,4ms
CPU and GPU 107ms 30ms 0,8ms
Table 3. Average execution times per frame of combined preprocessing and LBP.
Size 1024x1024 512x512 64x64
GPU alone 215ms 56ms 1,8ms
CPU alone 205ms 50ms 1ms
GPU preproc. and CPU LBP 142ms 40ms 0,8ms
5.2 Power consumption The measurement results for the power consumptions of the CPU implementation and both versions of the shader algorithm are in Tables 4. The Beagleboard consumes barely over 1450mW when it is not running any applications but only the Angstrom Linux operating system. We see that the increase in the power consumption is about 100-150mW with all implementations. Table 4. Average power consumption of the LBP calculations.
Size 1024x1024 512x512 64x64
GPU v1. 150mW 130mW 110mW
GPU v2. 130mW 130mW 110mW
CPU 190mW 190mW 190mW
CPU and GPU v1. 340mW 320mW 300mW
CPU and GPU v2. 320mW 320mW 300mW
The CPU implementation demands the same power on every picture size, but the GPU implementation consumption slightly decreases on smaller pictures, due to improved cache utilization. When the CPU and the GPU are used in parallel, the increase in the power consumption is about 260-300mW which is directly the sum of the consumptions of the GPU and the CPU implementations. Table 5 shows the energy consumption per frame calculated from the power consumption data and the computation time average. We can observe that the energy efficiency of the GPU highly depends on the algorithm type and mapping. Mobile GPUs are usually designed to offer a high energy efficiency per instruction.12 However, this feature takes most relevance in algorithms where only floating-point per pixel operations are required.
6. SUMMARY We have presented the first mobile-GPU implementation of Local Binary Pattern feature extraction. An earlier study has shown that on a desktop platform, a GPU implementation can be over ten times faster than a corresponding CPU implementation.4 However, modern desktop GPUs can include over two hundred shading processors that allow the achievement of a massive parallelization. Although the exact number of shading pipelines in SGX535 is not known, our study shows that it does not yet have enough parallelization to match the CPU’s performance in this application. Furthermore, the CPU implementation used as a reference6 is highly optimized and it’s word length utilization has been carefully designed, while the OpenGL ES 2.0 implementation relies on built-in functions and it’s word length utilization is low since it uses 32-bit floating point numbers for binary arithmetic. On the other hand, preprocessing tasks Based on pixel-wise operations and matrix multiplications offer a high performance that can clearly surpass any CPU implementation and prove to be more suitable for a GPU computation model.
Table 5. Average energy consumption per 1024x1024 frame.
Size Preprocessing LBP Combined algorithm
GPU 27mJ 5,3mJ 32,3mJ
CPU 19mJ 19mJ 28mJ
In addition, the use of a mobile-GPU decreases the workload of the application processor and proves to be very useful when some other tasks need to be done during LBP computation, specially with long image pipelines that can reduce the data exchange between computing entities. In addition, GPU implementations are highly scalable and future mobile GPUs will likely surpass CPU implementation. Based on our experiments, the use of the GPU simultaneously with the application processor clearly improves the performance of image recognition applications, while increasing its energy efficiency. However, the GPUs on mobile device platforms are a rather recent add-on, primarily intended for displaying graphics. In the future, emerging graphics platforms such as OpenCL will offer more flexibility for image processing.
ACKNOWLEDGMENTS We would like to thank Nokia Research Center Tampere for their funding and support. We would also like to thank the Texas Instruments University Program for the donation of the research equipment.
REFERENCES [1] Heymann, S., Muller, K., Smolic, A., Frohlich, B., and Wiegand, T., “Sift implementation and optimization for general-purpose gpu,” International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (2007). [2] Cornelis, N. and Van Gool, L., “Fast scale invariant feature detection and matching on programmable graphics hardware,” Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on , 1–8 (2008). [3] Sinha, S. N., Frahm, J. M., Pollefeys, M., and Genc, Y., “Gpu-based video feature tracking and matching,” Workshop on Edge Computing Using New Commodity Architectures (2006). [4] Zolynski, G., Braun, T., and Berns, K., “Local binary pattern based texture analysis in real time using a graphics processing unit.,” VDI wissenforum GmbH - Proceedings of Robotik (2008). [5] Hadid, A., Pietik¨ainen, M., and Ahonen, T., “A discriminative feature space for detecting and recognizing faces.,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, Washington, D.C. (2004). [6] M¨aenp¨a¨a, T., Turtinen, M., and Pietik¨ ainen, M., “Real-time surface inspection by texture,” Real-Time Imaging 9(5):289-296 (2003). [7] Hannuksela, J., Silven, O., Ronkainen, S., Alenius, S., and Vehvil¨ainen, M., “Camera assisted multimodal user interaction.,” Proc. SPIE Multimedia on Mobile Devices 7542(754203–754203–9) (2010). [8] Ojala, T., Pietik¨ainen, M., and Harwood, D., “A comparative study of texture measures with classification based on featured distributions,” Pattern Recognition 29(1), 51 – 59 (1996). [9] Freund, Y. and Schapire, R. E., “A decision-theoretic generalization of on-line learning and an application to boosting,” Proceedings of the Second European Conference on Computational Learning Theory , 23–37 (1995). [10] Ahonen, T., Hadid, A., and Pietik¨ainen, M., “Face description with local binary patterns: Application to face recognition.,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2006). [11] Bordallo, M., Hannuksela, J., Silven, O., and Vehvil¨ainen, M., “Graphics hardware accelerated panorama builder for mobile phones.,” Proc. SPIE Multimedia on Mobile Devices 2009 7256(72560D) (2009). [12] Akenine-Moller, T. and Strom, J., “Graphics processing units for handhelds,” Proceedings of the IEEE 96(5), 779 –789 (2008).