Camera assisted multimodal user interaction

Camera assisted multimodal user interaction Jari Hannukselaa , Olli Silvéna , Sami Ronkainen b , Sakari Aleniusc , and Markku Vehviläinenc a Machine b Nokia

Vision Group, University of Oulu, Finland Devices, Finland c Nokia Research Center, Finland ABSTRACT

Since more processing power, new sensing and display technologies are already available in mobile devices, there has been increased interest in building systems to communicate via different modalities such as speech, gesture, expression, and touch. In context identification based user interfaces, these independent modalities are combined to create new ways how the users interact with hand-helds. While these are unlikely to completely replace traditional interfaces, they will considerably enrich and improve the user experience and task performance. We demonstrate a set of novel user interface concepts that rely on built-in multiple sensors of modern mobile devices for recognizing the context and sequences of actions. In particular, we use the camera to detect whether the user is watching the device, for instance, to make the decision to turn on the display backlight. In our approach the motion sensors are first employed for detecting the handling of the device. Then, based on ambient illumination information provided by a light sensor, the cameras are turned on. The frontal camera is used for face detection, while the back camera provides for supplemental contextual information. The subsequent applications triggered by the context can be, for example, image capturing, or bar code reading. Keywords: Computer vision, face detection, user interface, mobile device

1. INTRODUCTION Increasing availability of multimodal sensing capabilities in modern mobile communication devices enable the users to interact with devices via different modalities such as speech, gesture, expression, and touch. In context identification based user interfaces, these independent modalities are combined to create new ways how the users communicate with hand-held devices. While new techniques are unlikely to completely replace traditional interfaces, they can considerably enrich and improve the user experience and task performance. With this in mind, we have kept our focus on actual application utility rather than initially impressive gimmicks. In particular, it has been shown that different sensors provide viable alternatives to conventional interaction in portable devices. For example, tilting interfaces can be implemented with gyroscopes1 and accelerometers.2 Using both tilt and buttons, the device itself is used as input for navigating menus and maps. During the operation, only one hand is required for manipulation. Nintendo Wii is an example of fitting together application interactivity and sensor functionality: the limitations of the three-axis accelerometer are cleverly hidden from the user by the characteristics of each game. Apple’s products make use of the multimodal user interaction technology in different ways. In the iPhone, users are allowed to zoom in and out by performing multiple fingers gestures on the touch screen. In addition, a proximity sensor shuts off the display in certain situations to save battery power, and an accelerometer senses the orientation of the phone and changes the screen accordingly. In a category of their own are devices that employ a detachable stylus by which interaction is done by tapping the touch screen to activate buttons or menu choices. On the other hand, many of the current devices have also two cameras built-in, one for capturing high resolution photography, and the other for lower resolution video telephony as shown in Figure 1. Even the most recent devices, have not yet utilised these unique input capabilities enabled by cameras for purposes other than just image capture for viewing by humans. With appropriate computer vision methods, information provided by images allow us to create new self intuitive user interface concepts. Further author information: Send correspondence to Jari Hannuksela E-mail: [email protected], Telephone: +358 407272368

Traditionally, vision has been utilised in perceptual interfaces to build systems that look at people and automatically sense and perceive the human users, including their location, identity, focus of attention, facial expression, posture, gestures and movements.3 A key advantage of using vision as an input modality is that the sensing is entirely passive and non-intrusive, as it does not require contact with the user or any special-purpose accessories.4 Furthermore, vision is one of several possible sources of information that can be coupled with other sensors such as accelerometers enabling a more effective and efficient interaction. In this paper, we demonstrate novel user interface concepts that rely on built-in multiple sensors of modern mobile devices to recognize action sequences and to identify contexts. Especially we use the camera to detect whether the user is watching the device, for instance, to make the decision to turn on the display backlight. Our idea is to employ motion sensors in detecting the handling of the device such as bringing it to a typical picture capturing position and orientation. The key motivation is to hide start-up latencies of the functionalities from the user. In particular, we are interested in creating an illusion of applications being always on by concealing their often 1-3 second turn on delays. In our cases the motion sensors trigger the action recognition sequence that continues with the illumination sensor determining the level of ambient light, and if it is sufficient, the cameras are turned on. The frontal camera is used for face detection, while the back camera provides for supplemental contextual information. This can be used to trigger an application such as image capturing or bar code reading. In some devices, the high resolution back camera is protected by a lid that also acts as a trigger for the camera application, while in many other design light push of the camera button powers on the camera. These events are usually followed by a camera turn on latency. In the following, we first introduce multimodal sensing capabilities of current devices, then we proceed to the camera based user interface and application solutions. In experiments, the described solution has been implemented on a Nokia mobile phone that demonstrates three application scenarios: face detection based backlight and key lock control, face detection based automatic image capturing mode detection, and automatic activation of a ”point and find” application. Finally, the user benefits and desirable future developments are discussed and considered.

2. MULTIMODAL USER INTERFACES The number of applications crammed in hand-held devices has increased rapidly, and is likely to keep growing causing usability complications given the constraints of typical user interface. The available computing resources have increased to enable adding new functionalities such as mobile gaming and navigation, however, the computing power have not been harnessed for user interfacing. Our understanding is that the actual nature of the hand-helds provides for novel types of user interfaces better than the desktop world that offers much higher computing performance. Currently, both domains appear to be hopelessly constrained to for using of mouse/pointer, keyboard, and display. Figure 1 shows an example of the modern mobile device: a Nokia N97 cellular phone. The phone includes a 3.5-inch, 640 by 360 pixel resistive touch sensitive screen with tactile feedback and a keyboard. It also has a GPS sensor, a 5 Megapixel autofocus camera under lid with a LED flash, and a supplemental video call camera next to the display. Furthermore, it has multiple integrated sensors including an accelerometer, a light sensor and a proximity sensor. The proximity sensor is clearly visible on the left of the earpiece while the video call camera and the light sensor that controls the screen brightness are on the right. The built-in accelerometer is primarily employed to turn the screen orientation. The N97 is optimized for browsing, contains many multimedia functionalities, and comes up with social media widgets and contains support for immediate sharing over HSDPA and WLAN. This offers yet an other interactivity dimension for applications that benefit from, for example, retrieval of product information from Internet, a functionality that is expected to operate without noticeable delays. Mobile devices such as N97 are increasingly used in truly mobile usage contexts in which the operational latencies are likely to irritate and distract the user. In practise, the user would like to expect device being ready with expected functionality at any time. A prior example is the predictive text entry that could now be followed by ”predictive functionality launches”.5

Ambient light sensor

5 MPix camera

Dual LED flash

QCIF (176 x 144 pixels) video call camera

Proximity sensor

3.5" TFT resistive touch screen 360x630 pixels 16M colors

Accelerometer

GPS receiver Digital compass

Figure 1. Modern mobile communication device with multiple sensors (Nokia N97).

Multimodal user interfaces have been suggested for such mobile devices. For example, Oviatt et al.6 claim: ”They (multimodal user interfaces) have the potential to expand computing to more challenging applications, be used by a broader spectrum of everyday people, and accommodate more adverse usage conditions than in the past.” Utilization of the user’s input and output modalities more effectively than today certainly makes sense when thinking of mobile user interfaces. However, in current mobile devices multimodality is not exploited to its full potential. Holding a mobile device and operating it at the same time are not trivial tasks. For example, walking through traffic with a device held in hand while carrying a bag, poses several restrictions to what one can do. One’s eyes and much of the concentration are needed for navigating in traffic, and the carried items may prevent using both hands for operating a device. The weather conditions may require the user to wear gloves, which makes it difficult to press small buttons or to use a touch screen.7 Specific gestures for controlling mobile devices have been frequently suggested, but the social acceptability of this interaction modality is often dubious. If any gestures are used, they should not differ from the normal handling of the device or the behaviour of the user.

2.1 Multimodal user interfaces in mobile usage Multimodality has been seen as a way to enhance the naturalness of interaction,8 but certain kinds of multimodality are better suited for mobile usage contexts.9 The most interesting types are those that allow using any modalities available. Approaches that aim at increased user interface efficiency by allowing the user to point and speak simultaneously are probably better suited to office environments. In mobile use neither pointing or speaking is often suitable. A key issue is that the output and input means should be selected automatically from among the ones usable in the particular context. Multimodal output is more widely used in today’s mobile devices than multimodal input, and they are often seen as somewhat separate problems. The interaction challenges are seldom in outputs in which multiple modalities are used either simultaneously as when a visual note is accompanied by a sound or as an incoming call alert by a sound or as a vibration pattern, depending on the mode of the device. Few good models exist for multimodal inputs in mobile contexts. If a user interface requires that at least two of the user’s available modalities are used simultaneously for device operations, the probability of these being available for use is reduced, compared to a unimodal system. This is contrary to what the user would need, and

what has been claimed as a benefit with a multimodal user interface. For example, pointing typically requires the use of both hands (one to hold the device and an another for pointing action), while the typical background noise in environment can decrease the accuracy of speech recognition. Furthermore, the user’s resources are limited by the environment as he may have to observe the behaviour of other people, vehicles, animals etc. The background noise level may restrict the hearing of the user, while the sensitivity of the microphone of a mobile device carried in the pocket is reduced. The hands may be occupied with other objects, and the clothing required by the weather conditions may restrict the movement of the limbs and fingers.

2.2 Existing concepts for mobile multimodality It has been shown that under increased cognitive load people tend to start utilizing multiple modalities in an user interface, if they are given a free choice.10 For the mobile use this is an important finding since that context is often cognitively more demanding compared to a situation where the user is sitting down with a laptop, fully concentrated on the task. It has also been shown that multimodal input can improve the performance of speech and gesture recognition by allowing the system to mutually disambiguate the possible results of different recognition engines. It is often envisioned that the most potential use for multimodality in mobile use is in the new ways for interaction in situations difficult with today’s visual-manual user interfaces. The typical scenarios for multimodality count on user co-operation in the particular user interaction mode. An example of using hand gestures is the Shoogle concept11 where the user shakes the whole device to obtain information of its state, presented through auditory and tactile channels. The system requires some level of manual input, but finger precision is not needed, nor any visual contact is required to use the device. Other similar examples include the Sony Ericsson W910i phone which the user can shake in order to randomize the list of currently playing music, and the Samsung SCH-310 phone where the user can draw digits in the air in order to dial a phone number. In the same manner, the user of the Nokia 8800 Arte phone can silence an incoming call by turning the device face down. A double tap gesture on the phone’s surface is used to switch on the display backlight and display a clock. On the Nokia 5500 Sport phone the double tap can be used to make the phone read out a received text message or information about the current sports activity, by using speech synthesis. The same functionalities can also be activated by pressing the keyboard. This feature has also been implemented into other Nokia phone models. Hand movement can be also recognized by using a camera embedded in the device, for example, the Sony Ericsson W380 can be silenced by waving a hand. Similarly with the above multimodal concepts even this one is an add-on functionality rather than an integral part of the user interface. Ronkainen et al12 discuss the concept of utilizing hand gestures that the user performs by moving the mobile device, either by holding it in hand or by tapping it with fingers, when the device is in storage location such as a pocket. Typically these gestures can be recognized by utilizing an accelerometer or a camera embedded in the mobile device, detecting its movement. A hand gesture allows device manipulation without requiring the delicate accuracy of finger operation. Instead, the user can hold the device and wave it, shake it, or move it in a circular motion. One hand obviously must be available for device use (at least for the duration of the gesture) but the user need not be able to locate the buttons on the device surface or to press them. In this paper, we have systematically explored the use of cameras in multimodal user interfaces and identified their value in assisting user operations, in particular, hiding the latencies without the user noticing anything except for improved usability. The key component in our experiments has been face detection as the presence of the user’s face is an almost certain sign of interaction. Relying just on accelerometer information is unreliable except for rather simple context recognition tasks.13 In our approach the camera is seamlessly integrated to the employment of the other sensors in the device as discussed in the following.

3. CAMERA ASSISTED USER INTERACTION Cameras have been traditionally utilised in user interface research to build systems that look at people and automatically sense and perceive the human users, including their identity, focus of attention, facial expression, gestures and movements.3 The most notable practical implementations such as Sony’s EyeToy and the Microsoft’s project Natal, are related to gaming. For example, the EyeToy, a peripheral camera for Sony’s PlayStation 2

game console has shown that computer vision technology is becoming feasible for consumer-grade applications. This device allows players to interact with games using simple motion estimation, colour detection and also sound, through its built-in microphone. Respectively, the project Natal device reportedly features a depth sensing camera and a microphone array providing full-body 3-D motion capture, facial recognition, and voice recognition capabilities. Despite the significant progress made, vision based interfaces often require customised hardware and they work only in more or less restricted environments14 such as in these examples in which the sensors are stationary. However, to some extent the interaction needs in hand-held communication devices in mobile usage are similar. Especially, knowledge of the presence and motion of human faces in the view of the camera can be a powerful application enabler. In mobile domain, the video call camera is usually directed towards the user as in the models shown in Figure 2. The field of view of the camera is optimized for imaging user’s face regions due to the intended video call usage, but this provides advantages for various human-computer interaction solutions too. In different imaging applications, the faces are also important. Typically, users are interested in searching for people in the images and good quality face regions are among the main concerns in imaging. As shown in Figure 2 there are already solutions for auto focus and auto white balance methods that benefit from face tracking during the image capture. Other examples are smile triggered shutters, open-eye and blur detectors. As a multimodal convenience feature audio feedback in the form of a special shutter sound can be used to notify the user if the image is blurred or any eyes are shut. Alternatively vibratory feedback could be used as a request to pay attention to the quality of the image. These outputs free the user from checking the pictures for the most common problems. An example by Brewster and Johnston15 sonify the luminance histogram to provide a cue on the image quality and this approach could be generalized to cover other quality features as well. Face tracking

Shutter sound Vibratory feedback

- Autofocus - Smile shutter - Blink detection

5 MP camera LED flash

Figure 2. Typical cell phone replicating digital camera functionalities (Nokia N95).

So far the cameras on mobile devices have been optimized for capturing viewable images rather than to act as sensors for interaction purposes. Consequently, the application processors of the devices have been responsible for the image processing, and require the platform to be in the energy hungry active state. As a result, interaction modalities that employ cameras are costly for battery life, easily consuming over 1000 mW.16 There are two options to solve this situation. First, dedicated camera processors, and second, dedicated cameras for sensory purposes that do not need to capture viewable images. VGA resolution cameras consume approximately 1 mW/frame/s, while the respective image processing, say face detection, demands around 20 million cycles per frame from a dedicated camera processor, equalling about 1 mW/frame/s. The total power need is roughly equivalent to Bluetooth, if the standby image capture rate is about 1-2 frames/s. The resolution and power demands could be scaled down even further if the camera is only needed for face and motion detection, as has been demonstrated by Ojansivu et al.17 in their work on blur invariant object recognition. In the extreme end this boils down in implementing special camera-like sensors optimized for user interfacing.

If cameras are considered as sensors, they could equally well be treated as such in multimodal context sensitive user interfaces of mobile devices. In practise, this implies ”always on” type camera operation or similar functionality, in other words, continuous video capture and analysis of sequential frames. The capabilities provided by imaging should seamlessly integrate into the normal course of user interactions, and improve the usability, for instance, by enabling to hide various latencies.

4. CELLULAR PHONE IMPLEMENTATIONS The added value obtained from utilizing the camera as a simple sensor rather than as a high quality image capture device was investigated in use contexts considered frequent, and three actual multimodal interaction applications were implemented on cellular phones. The first one involves automatic backlight activation, keylock control, and device orientation detection. The second solution predicts image capturing mode, while the third one automatically launches ”a point and find”18 style application. The key idea in all cases is that the recognized action sequences activate the device functionalities and applications, anticipating the user intentions. In the following, we describe the context identification and operational concepts in more detail and evaluate the gained utility.

4.1 Application case 1: Camera assisted keylock Almost all cases in which mobile devices are used start from retrieving the device from pocket, holder on belt, or handbag and continue by manually releasing the keylock, and launching an application such as phone directory. If the user intentions were reliably recognized, the keylock could be released and closed automatically, eliminating the most common and irritating operational feature. Motion sensors alone do not provide for sufficiently trustworthy information for this purpose, although often useful in context recognition.19 The combined utilization of motion, illumination and image sensing results in substantially improved context recognition accuracy, and enable automating the keylock feature, except when the device is used in total darkness. Figure 3 shows the user handling the device. First, the motion sensor is employed for detecting the handling of the device (on the left), then an ambient illumination sensor is checked to detect the moment when the device comes out of its storage. The front camera is turned on immediately, if considered usable (in the middle), and is used to check for the presence of a face (on the right). In case of a positive detection result, the keylock is released with simultaneously turning on the display backlight. If the device is held vertically, the phone directory is shown, while in horizontal orientation internet connection and browser window are opened.

Figure 3. Application case 1: Automatic cell phone activation.

To improve interactivity it would be desirable to turn-on the camera already in the pocket, but due to frequent false alarms from motion sensor based context recognition that would waste battery capacity. The extra 0.5 seconds that could be won are occasionally noticeable with our current implementation as a keylock release latency, however, this is a minor nuisance in comparison to the reduced routine handling needs.

4.2 Application case 2: Image capture mode activation The high resolution cameras of mobile devices are power hungry, so they should be kept on only when needed. However, their turn-on latencies are around 1-2 seconds that clearly reduces usability if the user is taking a sequence of occasional snapshots. Much of this delay can be hidden by predicting the user’s intention to capture an image. In the most typical case, this involves recognizing the raising of the device in front of the face to horizontal pose. Clearly the recognition of this context benefits from the coupled use of motion and face sensing using the frontal camera, provided that is on all the time. Figure 4 illustrates the user handling the device in this case. The keylock is released, the backlight is turned on, and the back camera is activated automatically. In practise, around 0.5 seconds is shaved from the latency when compared to manual activation, leaving only a small noticeable activation delay giving to the user an impression of minor sluggishness. We conclude that the feeling of interactivity would improve if the activation latency of the high resolution camera could be pushed below 1 second to match the human movements.

Figure 4. Application case 2: Automatic image capturing mode activation.

The practical challenge of this scenario is the assumption on having an active front camera. If it is operated at lower frame rate, the latency saving may not materialize, while a higher frame rate reduces power efficiency. The roots of the problem are in the involvement of the application processor of the platform, and we see this as an argument for dedicated sensor-cameras.

4.3 Application case 3: Point and Find Point and Find18 allows people using cell phones to access information and services, simply by pointing a camera at real-world objects. For example, users can instantly find information by pointing the camera at the poster of a new movie, building, or a bar code. From the user’s point of view it would be most convenient if the device automatically recognizes the information retrieval expectations without demanding manual activation of any application. This situation is subtly different from the above transition to picture capture mode. These uses could be differentiated based on the push of the camera button, but this takes place only in the end of picture capture. The point and find can be activated whenever the back camera is active, but should not interfere picture capture. Any information retrieved should be presented only on request, or if there is no doubt on the intentions of the user. In our solution we have counted on the user co-operation and characteristic motion patterns. The idea is demonstrated in Figure 5: the back and forth panning motion is a request for point and find service to identify the scene and present information retrieved. If the target is a bar code, its presence is a direct request to fetch product data based on EAN/UPC code.

Frame of interest

Camera movement

Support frames

Figure 5. Application case 3: Automatic Point and Find application activation.

In this application information from three sensors can be employed. Both the frontal and back cameras as well as accelerometers in the device provide information on motion, improving the reliability of context recognition. On the other hand, point and find with all its communications needs has significant latencies that reduce the user’s impression of interactivity. In our cocktail of multimodal application demonstrations it benefits least from predictive functionality.

5. DISCUSSION The objective of our walks through the presented application scenarios has been to experimentally evaluate the value of relatively minor improvements in user interaction. Our target has been in reducing frequent repeating routine operations by replacing them with automation. The tool has been camera assisted multimodal context recognition, while the user has not been posed with any learning needs. Based on our experiences, the usability improves with the reductions of the user perceived latency. This is in line with earlier findings.20 Against this background it is not surprising the ”almost delayless” camera assisted keylock turned out to be most convenient feature, and later on its absence was felt irritating. The prediction of the picture capture mode, on the other hand, has a noticeable delay that humans appear to consider more bearable if it takes place after pushing a button. Obviously, latencies of automatic operations have high acceptance thresholds. With point and find the base latencies are the largest due to communications. In this application it is more convenient to make conscious operations with the device rather than trusting automatic context identification. In principle there should be no obstacles for the social acceptance for our demonstration cases. The user’s are not expected to do anything extraordinary, such as specific gestures, but the automation exploits the normal behaviour patterns. Furthermore, the mental load of the user is very low with the device making the expected low level operational selections with high reliability. We believe the current demonstrations represent a more general class of context sensitive applications that employ multimodal sensory information, and may provide for ingredients for novel device designs.

6. SUMMARY We have demonstrated a collection of latency hiding application scenarios that rely on built-in multiple sensors of modern cell phones for recognizing sequences of user actions and predicting usage intentions. The key ideas rest on the utilization of the hand-held nature of the equipment and the user being in the field of view of a camera. We use the camera to detect whether the user is watching the device that is often a good indication of interaction needs. Coupled with the employment of other sensors, we have formulated approaches that are both reliable and reasonably energy efficient even with today’s platform technology. The demonstrations presented

are by no means the only ways to apply computer vision or multiple sensors to mobile user interaction, and one may find new interesting possibilities in further research.

REFERENCES [1] J. Rekimoto, “Tilting operations for small screen interfaces,” in 9th annual ACM symposium on User interface software and technology, pp. 167–168, ACM Press, 1996. [2] K. Hinckley, J. S. Pierce, M. Sinclair, and E. Horvitz, “Sensing techniques for mobile interaction,” in 13th annual ACM symposium on User Interface Software and Technology, pp. 91–100, 2000. [3] A. Pentland, “Looking at people: Sensing for ubiquitous and wearable computing,” IEEE Transactions on Pattern Analalysis and Machine Intelligence 22(1), pp. 107–119, 2000. [4] M. Turk, “Computer vision in the interface,” Communications of the ACM 47, pp. 60–67, January 2004. [5] A. Schmidt, M. Beigl, and H. w. Gellersen, “There is more to context than location,” Computers and Graphics 23, pp. 893–901, 1999. [6] S. Oviatt, P. Cohen, L. Wu, J. Vergo, L. Duncan, B. Suhm, J. Bers, T. Holzman, T. Winograd, J. Landay, J. Larson, and D. Ferro, “Designing the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions,” Human Computer Interaction 15(4), pp. 263–322, 2000. [7] J. O. Wobbrock, “The future of mobile device research in hci,” in CHI 2006 Workshop Proceedings: What is the Next Generation of Human-Computer Interaction?, pp. 131–134, 2006. [8] M. Turk and G. Robertson, “Perceptual user interfaces,” Communications of the ACM 43, pp. 33–34, 2000. [9] A. Sears, M. Lin, J. Jacko, and Y. Xiao, When Computers Fade...Pervasive Computing and SituationallyInduced Impairments and Disabilities, pp. 1298–1302. Lawrence Erlbaum Associates, Hillsdale, 2003. [10] S. Oviatt, R. Coulston, and R. Lunsford, “When do we interact multimodally?: cognitive load and multimodal communication patterns,” in ICMI ’04: Proceedings of the 6th international conference on Multimodal interfaces, pp. 129–136, ACM, 2004. [11] J. Williamson, R. Murray-Smith, and S. Hughes, “Shoogle: excitatory multimodal interaction on mobile devices,” in CHI ’07: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 121–124, ACM, 2007. [12] S. Ronkainen, J. H¨ akkil¨ a, S. Kaleva, A. Colley, and J. Linjama, “Tap input as an embedded interaction method for mobile devices,” in TEI ’07: Proceedings of the 1st international conference on Tangible and embedded interaction, pp. 263–270, ACM, 2007. [13] C. Randell and H. Muller, “Context awareness by analyzing accelerometer data,” in ISWC ’00: Proceedings of the 4th IEEE International Symposium on Wearable Computers, p. 175, IEEE Computer Society, 2000. [14] W. Freeman, P. Beardsley, H. Kage, K. Tanaka, K. Kyuma, and C. Weissman, “Computer vision for computer interaction,” ACM SIGGRAPH Computer Graphics 33(4), pp. 65–68, 2000. [15] S. A. Brewster and J. Johnston, “Multimodal interfaces for camera phones,” in MobileHCI ’08: Proceedings of the 10th international conference on Human computer interaction with mobile devices and services, pp. 387–390, ACM, 2008. [16] O. J. Silvén, J. Hannuksela, M. Bordallo L´ opez, M. Turtinen, M. Niskanen, J. Boutellier, M. Vehvil¨ ainen, and M. Tico, “New video applications on mobile communication devices,” in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, 6821, 2008. [17] V. Ojansivu and J. Heikkil¨ a, “Blur insensitive texture classification using local phase quantization,” in Proceedings of international conference on Image and Signal Processing, pp. 236–243, 2008. [18] Nokia, “Nokia Point & Find BETA.” Website, 2009. http://pointandfind.nokia.com/. [19] A. Schmidt, K. A. Aidoo, A. Takaluoma, U. Tuomela, K. Van Laerhoven, and W. Van de Velde, “Advanced interaction in context,” Lecture Notes in Computer Science 1707, pp. 89–101, 1999. [20] W. Kuhmann, W. Boucsein, F. Schaefer, and J. Alexander, “Experimental investigation of psychophysiological stress-reactions induced by different response times in human-computer interaction,” Ergonomics 30, pp. 933–943, 1987.