nose tracking [2] applications for controlling a PC available. Interaction with ..... (640 by 480), but then the frame rate is dropped consider- ably. The quality of the ...
A Vision-Based Approach for Controlling User Interfaces of Mobile Devices Jari Hannuksela, Pekka Sangi and Janne Heikkil¨a Machine Vision Group Infotech Oulu and Department of Electrical and Information Engineering P.O. Box 4500, FIN-90014 University of Oulu, Finland {jhannuks,psangi,jth}@ee.oulu.fi
Abstract We introduce a novel user interface solution for mobile devices which enables the display to be controlled by the motion of the user’s hand. A feature-based approach is proposed for dominant global motion estimation that exploits gradient measures for both feature selection and motion uncertainty analysis. We also present a voting-based scheme for outlier removal. A Kalman filter is utilized for smoothing motion trajectories. A fixed-point implementation of the method was made on a mobile device platform that sets computational restrictions for the algorithms used. Experiments with synthetic and real image sequences show the effectiveness of the method and demonstrate the practicality of the approach in a smartphone.
1. Introduction A fundamental problem in the usability of mobile devices is their limited display size. We present a simple and intuitive technique for controlling the user interfaces of such devices. The basic idea of the solution relies on the motion estimated from images which are captured using the camera located on the backside of the mobile device as depicted in Fig. 1 (a). The user interface operations are done simply by moving the device in the hand. In this way, the display works as a pointer like a mouse that is used to control the PC desktop. Fig. 1 (b) illustrates the basic control principle. For example, lateral motion upwards scrolls the focus towards the upper part of the display and back and forth motion is interpreted as zooming in and out. Vision-based human-computer interaction has been a popular research topic and computer vision has been increasingly utilized for controlling such interfaces. Recently different techniques such as face and head tracking [12, 5], gaze detection [6], and hand tracking [4] have been successfully applied to name a few. There are even commercial nose tracking [2] applications for controlling a PC available. Interaction with mobile devices such as the smartphone or personal digital assistant (PDA) has also become a very po-
Figure 1: a) Mobile device, and b) controlling options. tential application area, especially for wireless industry, and interest to develop new techniques is increasing all the time. For example, mobile gaming is an important business field that could utilize advanced interaction techniques. Many computer vision methods are already available, but these generally need too much processing power to be applied in smartphones. This is also a rather new application area and there are not many solutions available so far. In this work we utilize motion information obtained from the camera. An alternative solution for estimating the device motion could be based on special motion sensors such as accelerometers. However, vision is a more natural choice, because the camera can provide motion data cheaply without the need to install external sensors. We propose an efficient algorithm for estimating global motion and demonstrate its feasibility in controlling the user interface of a mobile device. Global motion refers here to the apparent dominant 2-D motion between frames, which can be approximated by some parameterized flow field model [3]. Such models are typically estimated using gradient-based techniques [9]. In our method, we adopt a region-based matching approach, where a sparse set of features is used for motion analysis. The main computational steps of the method are shown in Fig. 2, and they will be explained in Sections 2 and 3. The method has been implemented in software using only fixed-point arithmetic due to the lack of a floating-point unit. Details of the implementation and experiments are discussed in Sections 4 and 5.
Figure 2: Computational steps in global motion estimation.
2
Feature motion analysis
Our method has two main phases. In the first phase, the motion of the selected features and related uncertainty is analyzed, and in the second phase the results are used for obtaining parametric global motion estimates. Feature motion estimation is based on block matching using a sum of squared differences (SSD) measure. For both feature selection and uncertainty analysis, spatial gradient information is exploited as explained in the following sections.
2.1
Feature selection
Once the image is obtained, it is subdivided into N subregions, and one M by M block is selected from each region as illustrated in Fig. 3. We will denote the midpoints of the feature blocks by pi and image regions with Bi . Here, margin areas are not considered as the SSD measure has to be evaluated in all directions during motion estimation. Various criteria for selecting good features have been proposed, and they typically analyze the richness of texture within a window [11]. Typically, motion estimation is most reliable at blocks corresponding to corners, which is reflected in image gradients. In our approach, simple squaring of adjacent pixel differences in x- and y-directions is performed for each block Bi within a subregion, and the sum of these values, G(Bi ), is used as a criterion. The candidate giving the maximum value is included in the set of features. For our computational platform, such a straightforward scheme is well suited. It must be noted that a subregion may not contain any good feature, or it might contain only straight edges. The estimation of feature motion may therefore suffer from the aperture problem. In order to alleviate this problem, uncertainty in feature motion estimates is analyzed and used in global motion estimation. Therefore, the feature selection method used can be considered to be sufficient.
2.2
Feature motion estimation
In order to determine the displacement of a feature block, evaluation of the SSD measure, D(v; Bi ), is performed
Figure 3: Local motion analysis: 16 subregions (green), selected feature blocks (red) on the left and motion estimates (red) with error covariances (green) on the right. over a suitable range of integer displacements in x- and ydirections (e.g. -12...+12). The displacement v that minimizes the criterion is used as a feature motion estimate v i . Due to noise and the aperture problem, there might be other good matches, and we compute a 2 by 2 covariance matrix C i (error ellipse), which encodes this uncertainty of local motion. In statistical estimation theory, such matrices are powerful for error propagation [3], which we do in global motion estimation. The computation of C i is based on an analysis where we select motion candidates, which possibly have the true motion in their vicinity. The idea is represented in our earlier work [10]. For selection, we use the block gradient measures computed in the previous step for thresholding the SSD values of motion candidates. The rule for determining the set of proper candidates, Vi , is of the form Vi = {v | D(v; Bi ) ≤ k1 G(Bi ) + k2 },
(1)
where k1 and k2 are some fixed parameters [10]. Once Vi has been determined, confidence of the feature motion estimate is summarized as a matrix Ci =
1 1 (v − v c,i )(v − v c,i )T + I, #Vi 12 v ∈Vi
(2)
where v c,i denotes the centroid of Vi and I is a 2×2 identity matrix. The resulting covariance matrices are illustrated as green ellipses in Fig. 3.
3
Global motion analysis
As a result of the feature motion estimation phase, we have a set of features Fi = (pi , v i , C i ), i = 1, ..., N , and the goal now is to estimate global motion using this information. For our purpose, a four-parameter affine model is considered sufficient for approximating motion between frames as it can represent 2-D motion consisting of translation, rotation, and scaling [13]. The displacement v of a feature located at
p = [x, y]T can be represented using 1 0 x y θ v = v(θ, p) = 0 1 y −x H [p]
(3)
where θ = [θ1 , θ2 , θ3 , θ4 ]T is a vector of model parameters. Here, θ1 and θ2 are related to common translational motion, and θ3 and θ4 contain information about 2-D rotation φ and scaling s: θ3 = s cos φ − 1
(4)
θ4 = s sin φ.
(5)
The objective of global motion estimation is to determine θ for the apparent dominant motion.
where di,k,l = v i − v(θ k,l , pi ). Using this distance, the number of votes is computed using
T − d(Fi , θ k,l ) if d(Fi , θ k,l ) < T νi (θ k,l ) = , (9) 0 otherwise where T is a user-defined threshold (e.g. T = 4.0). As the number of features can be relatively small, we perform exhaustive evaluation of all feature pairs using (6) and (7). The hypothesis θ k,l that maximizes the measure (7) is used for selecting the trusted features Fi passed to the global motion estimation. The trusted features are a set of features Fi that give some support for the best hypothesis, i.e., where νi (θ k,l ) is non-zero.
3.2 3.1
Outlier analysis
The purpose of outlier analysis is to discard those feature motion measurements that might be harmful when global motion estimation is performed. There are several reasons for doing this analysis. For example, feature motion estimates might be erroneous due to image noise, or there might be several independent motions in the scene. We assume that the majority of features are associated with the global motion we want to estimate. In order to select those trusted features, a method similar to the random sampling consensus (RANSAC) [1] is used. In our approach, pairs of features Fi are chosen for instantiating similarity motion model hypotheses, which are then voted for by other features. The hypothesis that gets the most support is considered to be close to the dominant global motion and is used for selecting the trusted features. More specifically, based on (3), a hypothesis for a pair (Fk , Fl ) is generated using −1 H[pk ] vk , (6) θ k,l = vl H[pl ] and it is evaluated by calculating the number of votes, ν(θ k,l ), using ν(θ k,l ) =
N
νi (θ k,l ),
(7)
i=1
where νi (θ k,l ) ≥ 0 is the number of votes given by the feature Fi as explained below. The covariance matrix C i associated with the feature Fi provides information about the feature motion uncertainty in different directions. Therefore, it is reasonable to base the calculation of νi (θ k,l ) on the Mahalanobis distance between the locally estimated motion v i and hypothesized motion v(θ k,l , pi ) given by (3). In order to simplify calculations, we use the squared Mahalanobis distance d(Fi , θ k,l ) = dTi,k,l C −1 i di,k,l ,
(8)
Global motion estimation
In order to compute interframe motion estimates, we assume that motion of any trusted feature Fi = (pi , v i , C i ) is a realization of the random vector v i = H[pi ]θ + η i ,
(10)
where θ is the underlying true motion, and η i is the observation noise vector with expected values E(η i ) = 0 and E(η i η Ti ) = C i . In addition, observations of feature motions are assumed to be independent, and then the best linear unbiased estimator (BLUE) [7] for motion parameters is ˆ = (H T W H)−1 H T W v, (11) θ where v is the vector of motion observations composed of v i , W is the inverse of the block-diagonal matrix composed of C i , and the observation matrix H is composed of H[pi ]. Uncertainties associated with feature motions can be represented with the covariance matrix C θˆ = (H T W H)−1 .
(12)
Using matrix decomposition techniques discussed in Sec. 4, the evaluation of (11) and (12) can be implemented efficiently. Parameters obtained directly give information about the x-y translation (θ1 , θ2 ) and camera rotation around the z-axis (φ). Scaling s is related to the translation in the zdirection, ∆Z, as (s − 1) = ∆Z/Z, where Z is the scene depth.
3.3
Motion filtering
In order to obtain smoother output as a result of motion estimation, filtering can be applied to global motion estimates. Considering our application, we are also interested in the integration of these estimates to obtain position information. In our solution, Kalman filtering [7] is applied separately to each motion parameter, which reduces the computational
complexity. So, there are four filters, for translation along x-, y-, and z-axes, and rotation around the z-axis. ˙ k denote the desired value of an interframe motion Let Θ parameter for the frame pair (k − 1, k). We assume that motion is constant but is subject to acceleration errors modelled as zero-mean white Gaussian noise. In the dynamic model, the state vector is ˙ k ]T , xk = [Θk , Θ
(13)
where Θk is the corresponding position. The state equation is considered to be 1 Tk Tk /2 (14) xk−1 + εk , xk = 0 1 Tk where εk is the acceleration error term and Tk denotes the time step between frames (k − 1, k). Observation models for the filters are of the form zk = [0 1] xk + ξk ,
(15)
where zk is the observed value of the inteframe motion parameter and ξk denotes the observation noise. For x- and ytranslations, variances of the observation noise are obtained directly from the diagonal of C θˆ given in (12). Variances of the z-translation and rotation are derived from C θˆ using covariance mapping with the Jacobian matrix.
4. Implementation Mobile devices such as smartphones are typically small in size which leads to limited hardware resources. Limitations such as the low processing power and system memory impose many constraints on software development, which should be considered carefully. We start by describing the platform used because our implementation is very hardware dependent.
4.1. Hardware description We have implemented and evaluated our method on a Nokia 3650 smartphone. The main features of the platform are as follows: • Series 60 platform, Symbian OS 6.1 • Memory: 3.4 Mb shared • Processor: 32-bit RISC CPU based on the ARM-9 series, 104 MHz • Screen: 12 bit, 176 by 208 pixels • Camera: 0.3 Megapixels (VGA)
Due to the lack of a floating point unit, all floating point arithmetic is emulated, resulting in very slow performance. This is why fixed point arithmetic should be utilized instead. In practise, this means that integers have to be used for all operations in order to get good performance. Furthermore, the processor does not have a divide instruction and therefore its use should be avoided. The camera is located on the backside of the phone, which is the most common configuration in current smartphones. The camera server provides a low-resolution RGB image for tracking purposes. The image size is 160 by 120 pixels with a color depth of 12 bits per pixel. The maximum frame rate for image acquisition is about 15 fps. It is also possible to use higher resolution (640 by 480), but then the frame rate is dropped considerably. The quality of the images is low due to severe optical distortions.
4.2. Real-time implementation A C/C++ software component was implemented to estimate camera motion in real-time on the smartphone. Algorithms use fixed-point computations and the memory usage is minimized. The component is a callable library that other applications can utilize to get control information for their tasks. Input frames are converted to luminance frames using 4 bits per pixel because the color information is not needed in motion estimation. Currently, we use 16 feature blocks of 6 by 6 pixels for feature motion estimation. The estimation range is set to -12...+12, which means a 25 by 25 pixel search area. Based on the experiments and the dynamic range available, threshold parameters in (1) were chosen to be (k1 , k2 ) = (0.25, 4). The global motion estimation described in Sec.3.2 is probably the most critical part of the method when numerical accuracy is considered. In order to avoid matrix inversion in (11), we determine θˆ by solving the normal equation HT WHθˆ = HT Wz,
(16)
with Cholesky decomposition, which is very efficient when a square matrix to be decomposed is symmetric and positive definite as in this case. This decomposition can be two times faster than other similar decompositions. Another advantage is that the numerical accuracy is easier to maintain compared to some other techniques. The decomposition for HT WH is (17) LLT = HT WH, where L is a lower triangular matrix. As only a 4 × 4 matrix has to be decomposed, computational complexity is relatively low. Having this result, θˆ is obtained by solving the linear equation using backsubstitution. The covariance matrix is also easily computed from (17). The algorithm for performing Cholesky decomposition is described in [8]. We have optimized the algorithm for our case in order to
A than for B and C. In general, performance depends on the texture content of the scene and is worst with scenes that are featureless or contain just some periodic texture. Case Motion Figure 4: Motion simulation. A base image with sliding window locations on the left and a frame (4 bpp) with ground truth motion (red arrows) on the right.
x-y translation
z translation A
B
C
x-y translation and rotation
RMSE [pixels] Median Mean Max 0.178 0.190 0.296 0.285 0.302 0.467 0.282 0.307 0.510 0.141 0.162 0.381 0.175 0.267 0.621 0.190 0.239 0.429 0.183 0.210 0.349 0.231 0.282 0.495 0.234 0.255 0.442
Base A B C A B C A B C
Table 1: Results with synthesized sequences. Figure 5: Example frames from sequences, one for each base image.
5 0
minimize the number of operations needed. Especially the number of divide operations and memory usage are reduced considerably.
y coordinate [pixels]
−5 −10 −15 −20 −25 −30 −35 −40 −20
−10
0
10
20
30
x coordinate [pixels]
5
Experiments
In experiments, the method was evaluated using synthetic and real image sequences. With synthetic sequences ground truth data for global motion was obtained. An image sequence was generated by sliding a window over a highresolution base image as shown in Fig. 4. Each frame (160 by 120 pixels) was synthesized from the image region inside the window via interpolation, subsampling and the addition of Gaussian noise. Three kinds of camera motions were simulated: (1) 2-D translation along x- and y-axes, (2) translation in the z-direction, and (3) translation with slight rotation φ around the z-axis. Performance was evaluated quantitatively for each motion pattern and the three base images shown in Fig. 5. Performance measures used were based on the mean-square-error (MSE) of the motion vecˆ − θ)2 over tor field, that is, the average value of H[p](θ 2 ˆ is the interframe motion the image coordinates p, where θ estimate without filtering and θ is the ground truth motion. Median, mean and maximum root-mean-square-error (RMSE) computed over sequences are given in Table 1. Considering the estimation of rotation and scaling, RMSE equal to one pixel corresponds approximately to a pure rotational error of 1 deg and a pure scaling error of 1.8%. It can be seen that the performance is sufficient for user interface control purposes. It is slightly better for the base image
Figure 6: Trajectory for translational motion (letter z). We also verified the correct operation of our approach including motion filtering with real image sequences using the platform described in Sec. 4. In real sequences the user makes similar movements as with synthetic sequences. We have also used the same scenes as in the synthetic case. The length of sequences was 30 frames. Fig. 6 shows the trajectory when the user makes translational movement (letter z). Fig. 7 and Fig. 8 show trajectories for forward-backward motion and rotation, respectively. In real sequences there was no ground truth data available but the trajectories accurately follow the movements that the user made. Actually, very good motion estimation results were achieved. With real image sequences the frame rate achieved was about 10 fps. The frame rate varies slightly due to the asyncronous camera server and other processes running in the OS.
6. Summary and conclusions We have proposed a novel solution to control the user interfaces of mobile devices. The main idea of our approach is to use dominant motion estimated from image data. The method utilizes efficient techniques for feature selection, feature motion analysis and outlier removal.
References
0.4 0.35
z translation
0.3
[1] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” Comm. of ACM: Graphics and Image Processing, 6, pp. 381-395, 1981.
0.25 0.2 0.15 0.1 0.05 0 0
5
10
15
20
25
30
frame
Figure 7: Trajectory for forward-backward motion.
[3] H. Haußecker and H. Spies, “Motion,” In: B. J¨ahne, H. Haußecker and P. Geißler, editors, Handbook of Computer Vision and Applications, Vol. 2, Ch. 13, Academic Press, San Diego, 1999.
30
25
rotation [deg]
[2] D. Gorodnichy and G. Roth, “Nouse ”Use Your Nose as a Mouse” Perceptual Vision Technology for Hands-Free Games and Interfaces,” Image and Vision Computing, 22, pp. 931942, 2004.
20
15
10
5
0
0
5
10
15
20
25
30
frame
Figure 8: Trajectory for translational motion with slight rotation. The method was implemented as C/C++ software using fixed-point arithmetic and demonstrated on a Nokia 3650 smartphone. Experiments with synthetic and real image sequences indicate that our motion estimation algorithm can determine the translation and rotation parameters accurately. Of course, camera-based control has some limitations considering usage scenarios. For example, the method cannot be used if the user is walking, as controlling hand motions cannot be determined in such situations. Zooming also cannot be performed if the scene is distant, but this is not considered to be a severe problem as the camera points downwards in typical use cases. Despite such limitations, we can say that our solution provides a viable alternative for controlling the user interface of a mobile device. The motion parameters obtained can be used for example to browse and zoom large objects that do not fit into the display as such. One of the most important benefits of this solution is that there is no need for other sensors than a camera, making the system less expensive to produce. Also, the current platform can be used just by downloading the software component that produces the coordinate data for some third-party applications.
Acknowledgments The financial support of the National Technology Agency of Finland and the Nokia Foundation is gratefully acknowledged.
[4] M. K¨olsch and M. Turk, “Fast 2D Hand Tracking with Flocks of Features and Multi-Cue Integration,” In Proc. of the IEEE Workshop on Real-Time Vision for Human-Computer Interaction, Washington DC, USA, June 2004. [5] M. La Cascia, S. Sclaroff and V. Athitsos, “Fast, Reliable Head Tracking under Varying Illumination: An Approach Based on Registration of Texture-Mapped 3D Models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 4, pp. 322-336, 2000. [6] J. J. Magee, M. R. Scott, B. N. Waber and M. Betke, “EyeKeys: A Real-time Vision Interface Based on Gaze Detection from a Low-grade Video Camera,” In Proc. of the IEEE Workshop on Real-Time Vision for Human-Computer Interaction, Washington DC, USA, June 2004. [7] J. M. Mendel, Lessons in Estimation Theory for Signal Processing, Communications, and Control, Prentice-Hall, 1995. [8] W. Press, S. A. Teukolsky, W. T. Vetterling and B. P. Flannery, Numerical Recipes in C, The Art of Scientific Computing, Second Edition, Cambridge University Press, 2002. [9] Odobez, J.M. and Bouthemy, P., “Robust Multiresolution Estimation of Parametric Motion Models,” Journal of Visual Comm. and Image Repr. , 4, pp.348-365. [10] P. Sangi, J. Heikkil¨a and O. Silv´en, “Motion Analysis using Frame Differences with Spatial Gradient Measures,” In Proc. ICPR, Vol. 4, pp. 733-736, 2004. [11] C. Tomasi and T. Kanade, “Detection and Tracking of Point Features,” Tech. Rep. CMU-CS-91-132, Carnegie-Mellon University, 1991. [12] K. Toyama, “Look, Ma - No Hands! Hands-Free Cursor Control with Real-Time 3D Face Tracking,” In Proc. PUI’98, pp. 49-54, 1998. [13] Q. Zheng and R. Chellappa, “A Computational Vision Approach to Image Registration,” IEEE Trans. on Image Processing, 2, pp. 311-326, 1993.