Interactive multi-frame reconstruction for mobile devices - ee.oulu.fi

Multimedia Tools and Applications manuscript No. (will be inserted by the editor)

Interactive multi-frame reconstruction for mobile devices Miguel Bordallo López · Jari Hannuksela · Olli Silvén · Marku Vehviläinen

Received: date / Accepted: date

Abstract The small size of handheld devices, their video capabilities and multiple cameras, are under-exploited assets. Properly combined, the features can be used for creating novel applications that are ideal for pocket-sized devices, but may not be useful in laptop computers, such as interactively capturing and analyzing images on the fly. In this paper we consider building mosaic images of printed documents and natural scenes from low resolution video frames. High interactivity is provided by giving a real-time feedback on the video quality, while simultaneously guiding the user’s actions. In our contribution, we analyze and compare means to reach interactivity and performance with sensor signal processing and GPU assistance. The viability of the concept is demonstrated on a mobile phone. The achieved usability benefits suggest that combining interactive imaging and energy efficient high performance computing could enable new mobile applications and user interactions. Keywords multi-frame reconstruction · mobile device · mobile interactivity

1 Introduction Mobile communication devices are becoming attractive platforms for multimedia applications as their display and imaging capabilities are improving together with the computational resources. Many of the devices have increasingly been equipped with built-in cameras that allow the users to capture high resolution still images as well as lower resolution video frames. The capabilities of mobile phones in portable imaging applications are on par or exceed the ones of laptop computers despite the order of magnitude disparity between the computing power budgets. Table 1 points out the versatility of the hardware in handhelds in comparison to laptops. The size and semi-dedicated interfaces of handheld devices are significant benefits over the general purpose personal computer technology based platforms, M. Bordallo · J. Hannuksela · O. Silvén Center for Machine Vision Research, University of Oulu, 90570 Oulu, Finland, Tel.: +358-458-679699 E-mail: [email protected] M. Vehviläinen Nokia Research Center, Tampere, Finland

2

Miguel Bordallo López et al.

despite their apparent versatility. On the other hand, even the most recent mobile communication devices have not used their multimedia and computing resources in a novel manner, but are merely replicating the functionalities already provided by other portable devices, such as digital still and video cameras. Also the popularity of laptop PCs and portable video players, as a means to access multimedia content via WiFi or 3G networks, have clearly influenced the handheld application designs. Consequently, most handhelds rely on keypad and pointer user interfaces, while their applications use content provided via the Internet to supplement locally stored music, movies and maps. The users can also create images and video content and stream it to the network for redistribution. As more and more applications are being crammed into handheld devices, their limited keypads and small displays are becoming too overloaded, potentially confusing the user who needs to learn how to use each individual application. Based on the personal experiences of most people, increasing the number of buttons, as with remote control units, is not the best solution from the usability point of view. As a result, dedicated applications have become popular, replacing even net browsers in accessing specific services. The full keyboard, touchpad or mouse, and higher resolution displays of laptop PCs appear to give them clear benefits as platforms for multiple simultaneous applications. However, the small size of handheld devices and their multiple cameras are under-exploited assets. Properly combined, these characteristics can be used for novel user interfaces and applications that are ideal for handhelds, but may be considered less suitable for laptop computers. Fig. 1 represents two mobile devices with cameras and sensors. Touch-sensitive screen

Fig. 1 Current mobile devices include several cameras, sensors, keys and displays included in a very small size.

interaction is often viewed as a solution for interaction with mobile devices. However, it usually requires both hands and can cause additional attention overhead [7]. On the other hand, camera-based interfaces can provide single-handed operations in which the users’ actions are recognized without having to interact with the screen or keypad. In our contribution,

Interactive multi-frame reconstruction for mobile devices

3

Table 1 Characteristics of typical laptop computers and handheld mobile devices. Still image resolutions Number of displays Number of cameras Video resolution (display) Display size (inches) Processor clock (GHz) Display resolution (pixels) Processor DRAM (MB)

Laptop computer up to 2 Mpixel 1 0–1 1920x1080/30Hz 12–15 1–3.5 1024x768–2408x1536 1024–8192

Handheld device up to 12Mpixel 1-2 1–3 1920x1080/30Hz 2–4 0.3–1.2 176x208–960x640 64–1025

Typical ratio 0.20x 0.5x 0.5x 1x 5x (area 20x) 3-10x 12x 16x

we show how image sequences captured by the cameras of mobile phones can be used for new self intuitive applications and user interface concepts. We also analyze the observed platform-dependent limitations and features in future interfaces that could help in implementing vision-based solutions. The key ideas rest on the utilization of the handheld nature of the equipment and the analysis of video frames captured by the device’s camera. In this context, we highlight the application development challenges and trade-offs that need to be dealt with battery powered devices, presenting how graphical processing units of those devices can be utilized to accelerate and improve energy efficiency of computer vision algorithms. A multi-frame reconstructor is described as an example application which can benefit from the enriched user experience. We analyze building mosaic images of printed documents and natural scenes using the mobile phone camera in a highly interactive manner. We describe an intuitive user interaction framework which utilizes quality assessment and feedback in addition to motion estimation. The paper is organized as follows. In Section 2 we introduce related work on visionbased user interfaces on mobile devices and mobile implementations of multi-frame applications. Section 3 discusses the challenges of camera-based interactivity and its implications in the design of algorithms and applications. As a case study, an interactive real-time multi-frame reconstructor is described in Section 4. Section 5 highlights the application development challenges and trade-offs that need to be dealt with battery powered devices and discusses the desirable future platform developments for interactive applications. The use of a GPU is considered to reduce the computational load of camera-based applications. A performance evaluation of the system on a mobile device is described in Section 6. Finally, Section 7 summarizes the paper and discusses the possible future directions.

2 Related work Cellular phone cameras have traditionally been designed to replicate the functionalities of digital compact cameras. They have been included as almost stand-alone subsystems rather than as an integrated part of the device interfaces. However, some work has been carried out to incorporate the camera systems as a crucial part of vision-based mobile interactive applications. In 2003, Siemens introduced an augmented reality game called Mozzies developed for their SX1 cell phone. This was probably the first mobile phone application utilizing the camera as a sensor. The goal of the game was to shoot down the synthetic flying mosquitoes projected onto a real-time background image by moving the phone around and clicking at the right moment. When the user is executing some action, the motion of the phone was

4


recorded using a simple optical flow technique. Figure 2 depicts a Nokia N95 phone with a Mozzies type application. Since Mozzies, we have seen the multimedia capabilities of

Fig. 2 A camera-based mosquito killing game, similar to Mozzies, the one included in the Siemens SX1 device.

mobile phones advancing significantly. Mobile phones with high-resolution digital cameras are now inexpensive, widely available, and very popular. The rapid evolution of image sensors and computing hardware on mobile phones has made it attractive to apply computer vision techniques to create new user interaction methods and a number of solutions have been proposed [7]. 2.1 Vision-based mobile interactivity Much of the previous work on vision-based user interfaces with mobile phones has utilized measured motion information directly for controlling purposes. Figure 3 depicts three previously implemented camera-based interaction methods. For instance, Möhring et al. [21] presented a tracking system for augmented reality on a mobile phone to estimate 3-D camera pose using special color-coded markers. Other marker-based methods used a hand-held target [13] or a set of squares [37] to facilitate the tracking task. A solution presented by Pears et al. [23] uses a camera on the mobile device to track markers on the computer display. This technique can compute which part of the display is viewed and the determination of the 6-DOF position of the camera with respect to the display. An alternative to markers is to estimate motion between successive image frames with similar methods to those commonly used in video coding. For example, Rohs [28] divided incoming frames into a fixed number of blocks and then determined the relative x, y, and rotational motion using a simple block-matching technique. Another possibility is to extract distinctive image features, such as edges and corners which exist naturally in the scene. Haro et al. [17] have proposed a feature-based method to estimate movement direction and magnitude. Instead of using local features, some approaches extract global features such as integral projections from the image [1].


5

Fig. 3 Mobile implementation of three examples of camera-based interaction: a) Color coded markers. b) Handheld markers. c) Ego motion image browsing.

A recent and generally interesting direction for mobile interaction is to combine information from several different sensors. In their feasibility study, Hwang et al [18] combined forward and backward movement and rotation around the Y axis data from camera-based motion tracking, and tilts about the X and Z axis from the 3-axis accelerometer. In addition, a technique to couple wide area, absolute, and low resolution global data from a GPS receiver with local tracking using feature-based motion estimation was presented by DiVerdi and Höllerer [9].

2.2 Multi-frame reconstruction The approaches described above have utilized camera motion estimation to improve user interaction in mobile devices. Mobile phones equipped with a camera can also be used for interactive computational photography or multi-frame reconstruction. Adams et al. [1] presented an on line system for building 2D panoramas. They used viewfinder images for triggering the camera whenever it is pointed at previously uncaptured part of the scene. Ha et al. [12] also introduced the auto shot interface to guide mosaic creation using device motion estimation. Other panorama creation applications include the work of Xiong and Pulli [38] and Wagner et al. [35]. Kim and Su [20] used a recursive method for constructing super resolution images that is applicable to mobile devices. Another super resolution technique-based on soft learning priors can be seen in the work of Tian et al. [33]. On mobile devices, Bilcu et al. [3] proposed a technique for creating high resolution high dynamic range images, while Gelfand et al. [11] proposed the fusion of multi-exposure images to increase the quality of the resulting images. A good survey of work on mobile multi-frame techniques can be found in the work of Pulli et al. [25]

2.3 GPU-based computer vision Using mobile GPUs for multimedia applications and computer vision is an attractive option. The work of Kalva et al. [19] presents a good tutorial on the advantages and shortcomings of GPU platforms when developing multimedia applications while Fung et al. [10] explain how to use the GPU to perform computer vision tasks. Pulli et al. [24] analyzes the use of GPU-based computer vision for real-time applications, by studying the performance under an OpenCV environment. The use of GPUs as general purpose capable processors has not been extensively considered yet on mobile phones. However, some work can be found in the literature. The work

6


of Seo et al. [29] describes a 3D tracking application that uses the mobile GPU to accelerate certain parts of the algorithm, such as Canny-edge detection. Singhal et al. [31] analyzes the performance of several computer vision algorithms on a handheld GPU. The recent work from Wang et al. [36], uses a mobile GPU-CPU platform to construct an energy-efficient face recognition system. Our previous work, evaluates the use of a handheld GPU to assist image recognition applications [5] and in document stitching [4].

3 Camera-based Interactivity The usability of camera-based applications critically rests on latency. This becomes apparent with computer games in which action-to-display delays exceeding about 100-150 ms are considered disturbing [8]. This applies even to key-press to sound or display. If we employ a camera as an integral real-time application component, its integration time will add to the latency, as well as the image analysis computing. If we sample the scene at 30 frames/second rate, our base latency is 33 ms Assuming that the integration time is 33 ms, the information in the pixels read from the camera is on an average 17 ms old for a typical rolling shutter scheme. As the computing and display/audio latencies need to be added, achieving the 100-150 ms range is challenging. Vision-based interactivity requires always-on cameras that may compromise the battery life. Consequently, it is advisable to turn the cameras on only when needed. This interactivity issue can be improved by predicting the user’s intentions. In the most typical case, this involves recognizing the raising of the device in front of the face to horizontal pose to capture an image.

3.1 Automatic application launching The key ideas for the automatic launching of camera applications rest on the utilization of the hand-held nature of the equipment and the user being in the field of view of a camera [16]. We use the camera to detect whether the user is watching the device that is often a good indication of interaction needs. Figure 4 illustrates the user handling the device to launch a camera application. Clearly the recognition of this context benefits from the coupled use of motion and face sensing using the frontal camera, provided that is on all the time. The key lock is released, the back light is turned on, and the back camera is activated automatically. From the user’s point of view it would be most convenient if the device automatically recognizes the type of target that the user is expecting to capture without demanding manual activation of any application. Several targets could be differentiated by for example showing a dialog box on the capture screen with the suggested options.

3.2 Interactive Multi-frame reconstruction To demonstrate camera-based interactivity, we have built examples of multiframe reconstruction applications. Multiframe reconstruction techniques can be included in several applications such as scene panorama imaging, handheld document scanning, context recognition, high dynamic range composition, super resolution imaging or digital zooming. Multiframe reconstruction is a process that merges the information obtained from several input frames into a single result. The result can be an image that presents increased field of view


7

Fig. 4 Automatic launching of a camera application. When the device is raised in front of the user and a face is in the field of view, the main camera starts the capture.

or enhanced quality, but also a feature cloud with combined information obtained from the inputs, that can be used for example in object recognition. While not a replacement for wide angle lenses or flatbed scanners, a multiframe image reconstructing application running on a cellular phone platform is essentially an interactive camera-based scanner that can be used in less constrained situations. The usage concept of the handheld solution offers a good alternative as the users cannot realistically be expected to capture and analyze single shot high-quality images of certain types of targets, such as broad scenes, three-dimensional objects, big documents, white board drawings or posters. Instead, our approach relies on real-time user interaction and capturing of HD-720p resolution images that are registered and stitched together. In addition to interactivity benefits, the use of low resolution video imaging can be defended from purely technical aspects. In low resolution mode the sensitivity of the camera can be better as the effective size of the pixels is larger, reducing the illumination requirements and improving the tolerance against motion blur. On the other hand, a single-shot high resolution image, if captured properly, could be analyzed and used directly after acquisition, while the low resolution video capture approach requires significant post-processing effort. Figure 5 shows the four steps of a multiframe reconstruction application. In the proposed interactive solution, the capture interface sends images to the frame evaluation subsystem and gives feedback to the user. The best frames are selected. The images are corrected, unwarped and interpolated. The final stage constructs the resulting image. The next section describes the implementation details of the interactive multi-frame reconstructor applications.

8


Fig. 5 The four steps of a multiframe reconstruction application. Image registration aligns the features of each frame. An image selection subsystem, based on quality assessment, identifies the most suitable input images. A correction stage unwarps and enhances the selected frames. A blending algorithm composes the result image by reconstructing the final pixels.

4 Implementation 4.1 Interactive capture and pre-registration To reconstruct a single image from multiple frames, the user faces the capture of several good quality partial images. The input frames are often far from perfect, when not unsuitable for the reconstruction. The most relevant problem when capturing several frames that are going to be merged is the camera orientation and perpendicularity to the target among the set of captured frames. The user might involuntarily tilt or shake the camera, causing the frames to have lack of focus or motion blurriness, which will result in a low quality reconstructed image. Because a handheld camera is used, it is difficult for the user to maintain a constant viewing angle and distance, so the user interaction scheme just targets at capturing the targets using a free scanning path. The key usability challenge of a handheld camera-based multiframe reconstructor is enabling and exploiting interactivity. For this purpose, our solution is to let the device interactively guide the user to move the device during the capture [15]. The user starts the scanning by taking an initial image of some part of the target, for example a newspaper page or a white board drawing. Then, the application instructs the user to move the device to the


9

next location. The scanning direction is not restricted in any manner, and a zig-zag style path can be used. Rotating the camera may be necessary to avoid and eliminate shadows or reflections from the target, and it is a practically useful degree of freedom. The allowed free scanning path is a very useful feature from the document imaging point of view, however, it sets significant computational and memory demands for the implementation and prevents building final mosaics in real time. We have also developed a scene panorama application that limits the scanning path into a unidirectional one. With this approach, a mobile phone can be used to stitch images on the fly with the resulting images growing in real time with the frame acquisition [6]. The memory requirements are smaller as not all selected frames need to be stored until the end of the panorama blending process. Figure 6 shows the typical problems present during the capture stage and the proposed solutions-based on interactivity and quality assessment. Each image is individually pro-

Fig. 6 The problems appearing during image acquisition and the proposed solutions. involuntary tilting and shadows can be solved with the help of the user if the proper guiding is offered. Quality assessment can select the best frames and avoid the possible moving objects present on the screen.

10


cessed to estimate motion. The estimation of the motion is based on modified Harris corners and a best linear unbiased estimator. A detailed description of the subsystem can be found in the paper from Hannuksela et al. [14]. The blurriness of each picture is measured and eventual moving objects are detected. In multi-frame reconstructions the regions with moving objects or typically shadows, are simply discarded. Based on shutter time and illuminationdependent motion blur, the user can be informed to slow down when the suitable overlap between images has been achieved [15], and a new image for stitching is selected from among the image frames based on quality assessment. The user can also be asked to backup or he can return to lower quality regions later in the scanning process. As a result, good partial images of the target can be captured for the final stitching stage. The practical result of the interactive capture stage is a set of high quality images that are pre-registered and aligned. The coarse frame registration information based on motion estimates computed during interactive scanning is employed as the starting point in constructing the mosaic image. The strategy in scanning is to keep sufficient overlaps between stored images to provision for frame re-registration using a highly accurate feature based method during the final processing step.

4.2 Re-registration and blending After on-line image capturing, the registration errors between the regions to be stitched can be on the order of pixels that would be seen as unacceptable artifacts. In principle, it would be possible to perform accurate registration during image capture, but building final document images will in any case require post-processing to adjust the alignments and scales. The fine-registration employed for automatic mosaicking of document images is based on a RANSAC estimator with a SIFT feature point detector. In addition, graph based global alignment and bundle adjustment steps are performed in order to minimize the registration errors and to further improve quality. Finally, warped images are blended to the mosaic using simple Gaussian weighting. A more detailed description of the implementation can be found in the work of Hannuksela et al. [15]. Memory needs are an a usual implementation bottleneck of fine-registration with current mobile devices, limiting the size of the final mosaics and the number of input frames. It should be noticed that with the lower resolution frames the registration and blending errors are easy to see and reveal the eventual shortcomings of the methodology.

4.3 Quality determination and frame selection Taking pictures of documents with a handheld camera is often hampered by the self-shadow of the device, appearing as a moving region in the sequence of frames. In practice, the regions with moving objects, whether they are shadows or something else, are not desirable when stitching the final image. Instead of developing advanced methods for coping with these phenomena, we mostly count on the user interaction to avoid the problems from harming the reconstruction result. The treatment of shadows, reflections or moving objects depends on the type of the scene that is processed. For every selected frame with natural scenes, if a moving object is present and fits the sub-image, the image is blended, drawing a seam that is outside the boundaries


11

of the object. If only a partial object is present, the part of the frame without the object is the one that is blended. The individual frames are selected based on moving objects detection and blur measures [6]. A blur detection algorithm estimates the images sharpness by summing together the derivates of each row and each column. Motion detection is done in a very simple fashion to make the process fast. First, the difference between the current frame and the previous frame is computed. The result is a two-dimensional matrix that covers the overlapping area of the two frames. Then, this matrix is low-pass filtered to remove noise and is thresholded against a fixed value to produce a binary motion map. If the binary image contains a sufficient amount of pixels that are classified as motion, the dimensions of the assumed moving object are determined statistically. These operations are computed in real time to enable feedback to the user. However, as the differences in the image content may distort the results, the accuracy of motion estimates used for preliminary registration needs to be a reasonable one. In practice, this is a computational cost, interactivity and quality trade-off. An increased accuracy implies that less overlap is needed between frames and a decrease of the computing requirements.

5 Interactivity and energy efficiency 5.1 Architectural maximization of stand-by and active-state battery life The stand-by and active-state battery lives of a mobile device are interconnected. High standby power consumption means that active use regularly starts with a partially charged battery. As this is a recognized usability issue, the designers optimize for low stand-by currents, primarily by turning off sub-systems such as motion sensors and cameras whenever possible. However, this exposes another usability issue as the responsiveness for interaction of the device can be compromised. For instance, the device may be unable to detect its handling in stand-by state. The roots of the problem are in the involvement of the application processor of the platform, and we see this as an argument for dedicated camera and sensor processors. The availability of a fast application processor enables the straightforward implementation of novel camera-based applications and even vision-based user interfaces. On the other hand, the versatility and easy programmability of the single processor solution have lead to design decisions that compromise the battery life if high interactivity is needed. In active-state a fast application processor may consume more than 900mW with memories, while the whole device can go up to 3W, a limit above which it becomes too hot to handle. This can push the battery life below one hour. The background is in the typical top level hardware organization of a current mobile communications device with multimedia capability, as the example that we propose in Figure 7(a). Most of the application processing functionality, including camera and display interfaces, has been integrated to a single system chip. The baseband and mixed signal processing, such as the power supply and analog sensor control, have their own subsystems. For instance, with the design of Figure 7(a) the accelerometer measurements can not be kept on all the time due to the power hungry processor. In comparison, the sport watches that include accelerometers, operate at sub-mW power levels, thanks to their very small footprint processors.

12


In practice, the bulk of the sensor processing needs to be moved to dedicated low-power subsystems that can be on all the time. This reduces the number of tasks the application processor needs to execute, improving its reactiveness, e.g., for highly interactive visionbased user interfaces. In Figure 7(b) we propose a possible future design with low power sensor processors. The inclusion of small-footprint processors that operate very close to the sensor units, allow the energy efficiency of the subsystems, that can remain always on, enabling new interaction methods.

(a) Current device

(b) Future device Fig. 7 Possible organization of a current and a future multimedia device. In current devices, the processing of the motion sensors, frontal and back cameras are mainly done in the application level using a powerhungry main processor. We propose that future multimedia devices include several dedicated small-footprint processors that minimize the transfers, improving the energy efficiency an allowing to be always activated.


13

The practical challenge of an interactive camera-based application scenario is the assumption on having an active front camera. If it is operated at lower frame rate, the latency savings may not materialize, while a higher frame rate reduces power efficiency. If the image processing is coupled with the employment of other sensors, we can formulate an approach that is both reliable and energy efficient. Much of the application start latencies and delays can be hidden by predicting the users intention to capture an image [16]. In the most typical case, this involves the use of the motion sensors to recognize the handling and raising of the device to the horizontal pose. When this happens, the camera can be switched on or its frame rate can be increased to improve interactivity, hiding the latencies perceived by the user. The needs of interactivity are pushing the manufactures to add more sensors to their devices, as well as to adopt architectural solutions that provide for long battery life. The designers need to take into account the power needs of the required signal processing and the sensors themselves. For instance, a triaxial accelerometer dissipates in sub-mW range, a QVGA camera requires about 1mW/frame/second, while capacitive touch screens demand around 3mW. The analysis of 150/second sample signals from triaxial accelerometers and magnetometers require less than 30 000 instructions per second. Face tracking from QVGA (320-by-240) video requires around 10-15 MIPS per frame using Local Binary Pattern technology [2]. If implemented on ARM7, the energy per instruction (EPI) is around 100pJ, while with optimized sensor processing architectures the EPI can be pushed well below 5pJ [22]. Consequently, if implemented at a low frame rate of 1 frame/second, the corresponding power needs range from micro-Watts up to 1 mW. Employing the application processor with its interfaces for the same purposes would demand tens of mW, significantly reducing the battery life of active-state uses. Figure 8 shows our measurements of the battery discharge times of a Nokia N9 phone under constant load. Since the battery life is a nonlinear function of the load current, small improvements in the energy efficiency of the applications can achieve high improvements in the operation times. Similar observations of the battery life and its knee region have been made by Silvén and Rintaluoma [30] and can be found in the earlier work of Rakhmatov and Vrudhula [26]. Table 2 compares the stand-by and active-state power needs of the designs in Figures 7(a) and 7(b) with ”conventional” and ”advanced” user interfaces. With the latter one, a camera and accelerometers are used for detecting user interaction needs. Current mobile

Table 2 Comparison of stand-by and active-state power needs of user interfaces.

Advanced UI on current platform VGA camera (15fps.) + sensors Image + sensor processing (App. proc.) Others (memory, screen, subsystems) Total Advanced UI on future platform QVGA camera (1 fps.) + sensors Image + sensor processing (Dedicated proc.) Others (memory, screen, subsystems) Total Conventional UI (Phone total)

Stand-by power consumption [mW]

Active-state power consumption [mW]

55 70 5 130

55 700 500 1255

3 2 5 10 5

20 200 500 720 650

14


Fig. 8 Discharge time of a 1450mAh Li ON battery on a N9 phone. The shape of the discharge curve implies that small improvements in the applications’ energy efficiency can achieve high improvements in the operation times.

platforms, do not include dedicated energy efficient sensor processors that can be employed to develop advanced user interfaces. However, an increasing number of devices already include a Graphics Processing Unit that can be accessed with standard APIs such as OpenGL ES or OpenCL. A good solution that can be integrated into the current platforms is the employment of a GPU for general purpose computing, which improves the active-state battery life due to its lower energy consumption. Using mobile GPUs for camera-based applications is an attractive option. We consider speeding up interactive applications by using the graphics processing resources. Also, their smaller EPI makes them a suitable candidate for reducing the power consumption of computationally intensive tasks.

5.2 GPU implementation of multiframe techniques Several algorithms that can be used in a multiframe reconstruction application have been implemented using a PowerVR530 mobile GPU. The Nokia N9 graphics processor is accessible via the OpenGL ES application programming interface (API). However, the use of GPU as general purpose capable processors has not been extensively considered yet on mobile phones. This fact causes that the developers of image processing algorithms that use the camera as the main source for data lack on fast ways of data transferring between processing units and capturing or saving devices. In current platforms, this must be done by copying the images obtained by the camera from the CPU memory to the GPU memory in a matching format [5]. Furthermore,


15

the overheads of copying images as textures to graphics memory result in significant slowdowns. The lack of shared video memory causes multiple accesses to the GPU memory to retrieve the data for the processing engine. Although the OpenGL ES programmable pipeline enables the implementation of many general processing functions, the APIs still have several limitations. The most important one is that the GPU is forced to work in single buffer mode to allow the read-back of the rendered textures. Other shortcomings include the need to use power of two textures or the restricted types of pixel data. While traditional approaches to mosaic building algorithms use to follow a sequential path with multiple accesses the memory from the processing unit, in our example application, each step of the mosaicking algorithm has to be evaluated separately in order to find the best ways of organizing the data and to reduce the overheads. For the time being, the most obvious operations to be accelerated using OpenGL ES are pixel-wise operations and geometrical transformations such as warps and interpolations [4]. Part of the computations required in a feature machine based registration process can be moved to the GPU through the use of programmable thread shaders. Previous works shows that desktop-GPU SIFT feature extraction used along with a RANSAC estimator in parallel has shown a 50% CPU load reduction [32] and that feature extraction times on VGA frames can be reduced about ten times [27]. A mobile-GPU Harris Corners detector can be used to accelerate the registration process [31]. As a study case for feature extraction, LBP extraction has been implemented on an OMAP3630 platform, using the OpenGL ES 2.0 shading language [5]. Our experiments show that the equivalent CPU algorithm outperforms the GPU algorithm on all three image sizes, although the GPU times are comparable on bigger image sizes, where the parallelization is easier. However, processing does not interfere with the GPU. The CPU can be utilized to compute part of the frames concurrently or to perform some other tasks while the GPU performs extraction. Area-based image registration methods are also suitable to be highly parallelized. For example, the method by Vandewalle et al. [34] uses Tukey window filtering and FFT-based phase correlation computations to register two images. Experiments ran on an OMAP 3630 platform show that window filtering and complex division routines increase their execution speed up to three times when performed on the built-in GPU. The stitching process requires the correction of each selected frame with a warping function, that must interpolate the pixel data to the coordinates of the new frame. This costly process can be done in a straightforward manner on several steps using any fixed or programmable graphics pipeline. The programmable pipeline of OpenGL ES 2.0 enables shader programming in implementing blur detection in a similar way as the feature extraction method. The first stage of the blur detection is a simple derivation algorithm, which can be implemented efficiently with OpenGL ES 2.0 shader. Our tests show that on an OMAP 3630 platform the derivation algorithm with HD-720p images can be computed about three times faster on the GPU, while reducing the CPU load an 80%. The pixel blending operation can be done in a straightforward manner with the hardware implemented blending function. When the blending function is enabled, overlapping textures will be blended together. The transparency can be determined by choosing a blending factor for every channel of both images and then a blending function. The channel values are multiplied with their respective factors and then the blending function is applied to each channel pair. Since OpenGL ES 2.0 has a programmable pipeline, blending can also be done with a shader algorithm. In this way, all the needed calculations can be combined in only one

16


rendering stage. Table 3 shows the computation times and energy consumptions of several algorithms utilized in multiframe reconstruction.

Table 3 Computational and energy costs per HD-720p frame of several operations implemented on a mobile platform (OMAP 3630).

Grayscale conv. Scaling LBP extraction Harris Corners detector Blur detection Tukey windowing Image warping Image blending

CPU time consumption [ms] 18 24 48 60 80 35 140 270

CPU energy consumption [mJ] 3.6 5.3 9.1 13.5 28.2 5.1 27.3 50.7

GPU time consumption [ms] 8 12 90 170 60 15 40 120

GPU energy consumption [mJ] 1.0 1.5 13.1 22.5 8.0 2.1 5.4 15.8

6 Application performance The computing requirements of a multiframe reconstructor are quite significant for a battery powered mobile device, although the application can be broken down into the interactive, real-time frame capture part, and the non-interactive final mosaic stitching post-processor. The application has been developed on a Nokia N9 device. This device is based on an OMAP 3630 System on Chip composed by a 1 GHz Cortex A8 ARM and a Power VR SGX 530 GPU, supporting openGL ES 2.0. Nokia N9 has a 3.9 inches capacitive touchscreen with a maximum resolution of 854x480 pixels and a 8 Mpixel camera with a maximum video resolution of 2180x720 pixels. The device includes a 1450mAh battery. Table 4 shows the computational and energy costs of the most expensive parts of the HD-720p frame-based document scanning application when implemented entirely on the application processor.

Table 4 Algorithm’s computational and energy costs per frame on a N9 (ARM Cortex-A8). Online loop Camera motion estimation Quality assessment Offline computations Image registration Image correction Image blending

Computation time [ms]

Energy consumption [mJ]

100 50

200 10

5000 200 100-300

800 40 40

The application has been implemented using only fixed point arithmetic to achieve good performance on most devices. The implementations of the interactive capture stage allow the process of about 7.5 frames/second on Nokia N9 (ARM cortex-A8 processor) in HD720p resolution mode. The off-line stage operates at about 0.2 frames/second, and depends both on the available memory resources and processor speed.


17

High resolution video frames require long processing times that although still suitable for interactivity purposes, might not result in the best user experience. Reducing the resolution of the input frames proportionally decreases the needed processing times, allowing a better user experience. However, the fast evolution of mobile processor and the recent inclusion of multi-core chip sets suggest that the future mobile platforms will be able to handle a multiframe reconstruction application in real time using,input images of even higher resolution. Table 5 shows a comparison of the processing times at different resolutions on a Nokia N9. The experiments show that processing time increases almost linearly with the number of pixels and operations.

Table 5 Application processing times at different resolutions on a N9 (ARM Cortex-A8). Online loop Offline computations

320x240 15 500

640x480 52 1900

1280x720 150 5400

1920x1080 330 11900

7 Summary Interactive camera-based user interface reliant applications are among the most demanding uses of mobile devices. All the available processing resources are needed to ensure smooth operation, but that may compromise battery life and usability. Nevertheless, based on our experiences, camera-based interactivity is an extremely attractive scheme. Camera sub-systems on mobile device platforms are a rather recent add-on, and designed just for capturing still and video frames. At the same time the energy efficiency features of the platform architectures, computing resources, and displays have been optimized for video playback. From the point of view of the interactive camera-based applications, compatible data formats for the camera and graphics systems would be a major improvement. We have presented a system for building document mosaic images from selected video frames on mobile phones. High interactivity is achieved by providing a real-time feedback of motion and quality, while simultaneously guiding the user. The captured images are automatically stitched together with good quality and high resolution. The graphics processing unit of the mobile device is used to speed up the computations. Based on our experiments, the use of the GPU improves performance, but only moderately and increases battery life facilitating interactivity. We believe that the cameras in future mobile devices may, for the most of time, be used for sensory purposes rather than capturing images for human viewing.

References 1. A. Adams, N. Gelfand, and K. Pulli. Viewfinder alignment. In Eurographics 2008, pages 597–606, 2008. 2. T. Ahonen, A. Hadid, and M. Pietikäinen. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2037–2041, 2006. 3. R. Bilcu, A. Burian, A. Knuutila, and M. Vehvilainen. High dynamic range imaging on mobile devices. In Electronics, Circuits and Systems, 2008. ICECS 2008. 15th IEEE International Conference on, pages 1312 –1315, 31 2008-sept. 3 2008.

18


4. M. Bordallo, J. Hannuksela, O. Silvén, and M. Vehviläinen. Graphics hardware accelerated panorama builder for mobile phones. In Proceeding of SPIE Electronic Imaging 2009, 7256, 2009. 5. M. Bordallo, H. Nykänen, J. Hannuksela, O. Silvén, and M. Vehviläinen. Accelerating image recognition on mobile devices using gpgpu. In Proceeding of SPIE Electronic Imaging 2011, 7872, 2011. 6. J. Boutellier, M. Bordallo, O. Silvén, M. Tico, and M. Vehviläinen. Creating panoramas on mobile phones. In Proceeding of SPIE Electronic Imaging 2007, 6498, 2007. 7. T. Capin, K. Pulli, and T. Akenine-Möller. The state of the art in mobile graphics research. IEEE Computer Graphics and Applications, 1:74–84, 2008. 8. J. Dabrowski and E. Munson. Is 100 milliseconds too fast? In Conference on Human Factors in Computing Systems, pages 317–318, 2001. 9. S. DiVerdi and T. Höllerer. Groundcam: A tracking modality for mobile mixed reality. In IEEE Virtual Reality, pages 75–82, 2007. 10. J. Fung and S. Mann. Using graphics devices in reverse: Gpu-based image processing and computer vision. In Multimedia and Expo, 2008 IEEE International Conference on, pages 9 –12, 23 2008-april 26 2008. 11. N. Gelfand, A. Adams, S. H. Park, and K. Pulli. Multi-exposure imaging on mobile devices. In Proceedings of the international conference on Multimedia, MM ’10, pages 823–826, New York, NY, USA, 2010. ACM. 12. S. J. Ha, S. H. Lee, N. I. Cho, S. K. Kim, and B. Son. Embedded panoramic mosaic system using auto-shot interface. IEEE Transactions on Consumer Electronics, 54(1):16–24, 2008. 13. M. Hachet, J. Pouderoux, and P. Guitton. A camera-based interface for interaction with mobile handheld computers. In I3D’05 - ACM SIGGRAPH 2005 Symposium on Interactive 3D Graphics and Games, pages 65–71. ACM Press, 2005. 14. J. Hannuksela, P. Sangi, and J. Heikkilä. Vision-based motion estimation for interaction with mobile devices. Computer Vision and Image Understanding: Special Issue on Vision for Human-Computer Interaction, 108(1–2):188–195, 2007. 15. J. Hannuksela, P. Sangi, J. Heikkilä, X. Liu, and D. Doermann. Document image mosaicing with mobile phones. In 14th International Conference on Image Analysis and Processing, pages 575–580, 2007. 16. J. Hannuksela, O. Silvén, S. Ronkäinen, S. Alenius, and M. Vehviläinen. Camera assisted multimodal user interaction. In Proceeding of SPIE Electronic Imaging 2010, 754203, 2010. 17. A. Haro, K. Mori, T. Capin, and S. Wilkinson. Mobile camera-based user interaction. In IEEE International Conference on Computer Vision, Workshop on Human-Computer Interaction, pages 79–89, Beijing, China, 2005. 18. J. Hwang, J. Jung, and G. J. Kim. Hand-held virtual reality: A feasibility study. In ACM Virtual Reality Software and Technology, pages 356–363, 2006. 19. H. Kalva, A. Colic, A. Garcia, and B. Furht. Parallel programming for multimedia applications. Multimedia Tools Appl., 51(2):801–818, Jan. 2011. 20. S. Kim and W.-Y. Su. Recursive high-resolution reconstruction of blurred multiframe images. Image Processing, IEEE Transactions on, 2(4):534 –539, oct 1993. 21. M. Möhring, C. Lessig, and O. Bimber. Optical tracking and video see-through ar on consumer cell phones. In Workshop on Virtual and Augmented Reality of the GI-Fachgruppe AR/VR, pages 193–204, 2004. 22. L. Nazhandali, B. Zhai, J. Olson, A. Reeves, M. Minuth, R. Helfand, S. Pant, T. Austin, and D. Blaauw. Energy optimization of subthreshold-voltage sensor network processors. In Proceedings of the 32nd annual international symposium on Computer Architecture, ISCA ’05, pages 197–207, Washington, DC, USA, 2005. IEEE Computer Society. 23. N. Pears, P. Olivier, and D. Jackson. Display registration for device interaction. In 3rd International Conference on Computer Vision Theory and Applications, pages 446–451, 2008. 24. K. Pulli, A. Baksheev, K. Kornyakov, and V. Eruhimov. Real-time computer vision with opencv. Commun. ACM, 55(6):61–69, June 2012. 25. K. Pulli, W.-C. Chen, N. Gelfand, R. Grzeszczuk, M. Tico, R. Vedantham, X. Wang, and Y. Xiong. Mobile visual computing. In Ubiquitous Virtual Reality, 2009. ISUVR ’09. International Symposium on, pages 3 –6, july 2009. 26. D. Rakhmatov and S. Vrudhula. Energy management for battery-powered embedded systems. ACM Trans. Embed. Comput. Syst., 2(3):277–324, Aug. 2003. 27. J. M. Ready and C. N. Taylor. Gpu acceleration of real-time feature based algorithms. In Proceedings of the IEEE Workshop on Motion and Video Computing, page 8, Washington, DC, USA, 2007. IEEE Computer Society. 28. M. Rohs. Real-world interaction with camera-phones. In 2nd International Symposium on Ubiquitous Computing Systems, pages 39–48, 2004. 29. B.-K. Seo, J. Park, and J.-I. Park. 3-d visual tracking for mobile augmented reality applications. In Multimedia and Expo (ICME), 2011 IEEE International Conference on, pages 1 –4, july 2011. 30. O. Silven and T. Rintaluoma. Energy efficiency of video decoder implementations. In F. Fitzek and F. Reichert (eds.) Mobile Phone Programming and its Applications to Wireless Networking, pages 421–439. Springer, 2007.


19

31. N. Singhal, I. K. Park, and S. Cho. Implementation and optimization of image processing algorithms on handheld gpu. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 4481 –4484, sept. 2010. 32. S. N. Sinha, J. M. Frahm, M. Pollefeys, and Y. Genc. Gpu-based video feature tracking and matching. In Workshop on Edge Computing Using New Commodity Architectures, 2006. 33. Y. Tian, K.-H. Yap, and Y. He. Vehicle license plate super-resolution using soft learning prior. Multimedia Tools and Applications, pages 1–17, 2011. 10.1007/s11042-011-0821-2. 34. P. Vandewalle, S. Susstrunk, and M. Vetterli. A frequency domain approach to registration of aliased images with application to super-resolution. EURASIP Journal on Applied Signal Processing (special issue on Super-resolution), 24:1–14, 2006. 35. D. Wagner, A. Mulloni, T. Langlotz, and D. Schmalstieg. Real-time panoramic mapping and tracking on mobile phones. In Virtual Reality Conference (VR), IEEE, 2010. 36. Y. Wang and K. Donyanavard, B. Cheng. Energy-aware real-time face recognition system on mobile cpu-gpu platform. In International Workshop on Computer Vision on GPU, 2010. 37. S. Winkler, K. Rangaswamy, and Z. Zhou. Intuitive map navigation on mobile devices. In C. Stephanidis, editor, 4th International Conference on Universal Access in Human-Computer Interaction, Part II, HCI International 2007, LNCS 4555, pages 605–614. Springer, Beijing, China, 2007. 38. Y. Xiong and K. Pulli. Fast panorama stitching for high-quality panoramic images on mobile phones. IEEE Transactions on Consumer Electronics, 56(2):298–306, 2010.