Visual Hand Tracking Algorithms
Fariborz Mahmoudi 1,2, Mehdi Parviz 2 1 1,2
Computer Engineering Dept., Islamic Azad University, Qazvin Branch Multimedia Dept., IT Research Faculty, Iran Telecom Research Center
[email protected],
[email protected]
Abstract Tracking is an important section in a gesture recognition system. Numerous techniques for segmentation, tracking, modeling and recognition have been proposed during several past years. A few papers comparing different approaches have been published [31,32]. However, a comprehensive survey on tracking is still missing. We try to fill this vacuum by reviewing most widely used methods and techniques and collecting their numerical evaluation results.
Keywords: Tracking algorithms, gesture recognition, Sign language recognition.
1. Introduction The work described in this paper is part of greater efforts aimed to giving computers the ability to segment, track and underestand poses and gestures. Computer vision hand and face tracking is an active and developing field, yet the hand and face trackers that have been developed are not sufficient for our needs. We want a tracker that will track a given hand or face in the presence of noise, other skin-color region and hand movements. Moreover, it must run fast and efficiently so that objects may be tracked in real time (30 frames per second) while consuming as few system resources as possible. There are several approaches in tracking that address these matter. In continuation of this paper we describe algorithms and feature extraction that are applied and needed in tracking. Then we classify them and discuss about their advantages and shortcomings. features extraction for hand tracking is considered in section 2. In section 3, we review models
Proceedings of the Geometric Modeling and Imaging― New Trends (GMAI'06) 0-7695-2604-7/06 $20.00 © 2006
IEEE
and algorithms for hand tracking. The main issues are presented in section 4. Results are shown in section 5 and section 6 conclude the paper.
2. Features Motion, color, countor and boudary, shape and view are used as features for tracking. Skin color is a useful and robust cue for face and hand localization and tracking. When a system uses skin color as a feature for tracking, it’s important, first what color space to choose, second, how exactly the skin color distribution should be modelled. For example in [1, 6] HSV color space is used. HSV space separates out hue (color) from saturation (how concentrated the color is) and brightness. Therefore, the color model is independent from lightning effects. The other color space such as YUV [14], I1I2I3 [17] and YCrCb [13] are used. The models that are used for color distribution modeling are look up table, Bayes rule and GMM. For more information see [30]. Skin-color based segmentation has proven to be an effective method for segmenting the hand in fairly unrestricted environment [14]. In [6] the system extracts the face and hand regions using their skin colors, computes blobs uses not only in a single frame but also motion difference blobs. Huang et al. [4, 66] used edge and motion information of each frame to extract the feature images. Also in order to be able to compute the motion difference blobs stably while the hand is overlapping the face, a color-extraction method is presented based on histogram backprojection [25]. To use a simple view-based shape representation, face and hand are therefore represented by coarse 2D rigid models, e.g. their silhouette contours [17]. It is not easy to perform robust hand tracking only using single visual information. Recently, there has been increasing interest in integrating several visual cues for improving
tracking performance. Lu et al. [26] presented a modelbased approach to the integration of edges, optical flow and shading information for tracking. Compared with color, depth is more robust to varying illumination and heavy clutter [12].
3. Models and Algorithms In this section we focus on models and algorithms that proposed in tracking literature. Based on popularity and citation we review approaches in three section.
al. review particle filter based tracking algorithm that is similar to Isard et al. works [21]. Particle filter, which uses particle set to maintain and propagate multiple hypotheses simultaneously during tacking, are robust to background distracters, but the high computational cost is usually a bottleneck to apply particle filtering in real time systems. To improve the efficiency of particle filter, the Mean Shift Embedded Particle Filter is designed to moderate the number of particles [15]. From the experiment, this algorithm only need 20 particles to track hand gestures, while the conventional particle filter needs at least 150 particles for the same gesture sequences. Thus, much time is saved.
3.1. CAMShift 3.3. Other Models First time Fukunaga introduced mean shift algorithm [22]. This algorithm is a robust nonparametric technique for climbing density gradients to find the mode of probability distributions. To overcome changes of color distributions over time, so the mean shift algorithm has to be modified to adapt dynamically to the probability distribution it is tracking that is called CAMShift. CAMShift is a simple, computationally efficient hand, face and colored object tracker and it was conceived as a simple part of a larger tracking system, it has many uses right now in game and 3D graphics’ control [1]. Adaptive version of mean shift is proposed by Bradski et al [1]. An improved CAMShift algorithm is applied in hand tracking for letter hand gesture recognition application [14]. In [13] CAMShift is used for tracking simple and staged paths.
3.2. CONDENSATION The CONDENSATION algorithm is based on the sequential Monte Carlo methods. One important advantage of the sequential Monte Carlo framework is that it allows the information from different measurement sources to be fused in a principled manner. Isard et al. employed the CONDENSATION algorithm to combine skin color and hand contour for hand tracking [28]. Bayesian mixed_state framework [29] is an extension of the basic CONDENSATION algorithm that is proposed to achieve recognition of the current hand posture and automatic switching between multiple templates in the tracking loop [17]. Isard et al. presented a framework, ICONDENSATION, to bridge the gap between low-level and high-level tracking approaches [27]. ICONDENSATION is a combination of the statistical technique of importance sampling with the CONDENSATION algorithm. Liu et al. proposed robust hand tracking based on ICONDENSATION for wearable visual interface [12]. On the other hand, Arulampalam et
Proceedings of the Geometric Modeling and Imaging― New Trends (GMAI'06) 0-7695-2604-7/06 $20.00 © 2006
IEEE
Some of the other methods have been applied for hand tracking. Through entropy measurement, Lee et al. [18] got color information that have near distribution in complexion for region that have big value and extracted hand region from input images. Yang et al. [10] proposed matching motion segmented regions, which consist of two major steps. First, each image is partitioned into regions using a multiscale segmentation method. Regions between consecutive frames are then matched to obtain two-view correspondences. Huang et al. [4, 5] proposed a model-based tracking method, which may extract a 2D model from the current input image frame and match to the successive frames in the image sequence. The model of the moving object is generated dynamically from the image sequence, rather than being provided a priori. The main constraint imposed by this method is that the 2D shape of the moving object does not change a lot between two consecutive frames. Cui and Weng introduced prediction and verification segmentation approach [3]. This approach is to train the system to learn the mapping from many partial views (attention images) of each hand shape to the contour of the hand. These partial views are generated manually in the training. During the performance phase, the hand contour of a valid segmentation is predicted using partial views provided by motion information and the learned mapping. Chen et al. [2] combined edge, motion and skin color region information to allocate the hand region. Sherrah et al. presented a Bayesian network method for this task. A hybrid HMM/Particle filter framework is presented for simultaneously tracking and recognition of non-rigid hand motion [16].In this work a color-based particle filter provides a robust estimation of non-rigid object translation and localizes the most likely hand location for the HMM filter input. In turn, the shape output from the HMM filter provides the importance weighting for the particle set before the resampling stage and the particle set updating in the prediction stage. Zieren et al. [11] developed a tracking system that combines multiple
Table 1. Summary of comparative results Author
Method
Features
Result
Bradski[1]
CAMShift
Color
Tracking in presence of distracters and in noise
Imagawa[6]
Kalman filter
Color, Motion
Error classification success rate 86%
Isard[28]
CONDENSATION
Contour, Motion
Tracking a multi-modal distribution, rapid motion through clutter, an articulated and a camouflaged object
Isard[27]
ICONDENSATION
Color, Contour, Motion
Robust to rapid motion, heavy clutter and hand color distracters
Sherrah[7]
Bayesian network
Color, Orientation, Motion
Robustness in occlusion conditions
Fei[16]
Joint Bayesian Filter
Motion, Shape
Using the weak cue
visual cues, incorporates a mean shift tracker and a Kalman filter and applies probabilistic reasoning for final inference. Some works have used 3D models. In [9] proposed algorithm uses deformable models that are fitted to the image data from each of three views. In another work 3D model approach is used but this algorithm failed to track after a few frames [19].
4. Main Issues In the following, we consider main issues that apply to tracking.
4.1. Occlusion In sign language applications, overlap is actually frequent. Bradski [1] has introduced that in face tracking, long as occlusion is not 100%, CAMShift will still tend to follow what is left of the objects’ probability distribution. In [8] each of the two hands is assigned the features of the single blob whenever occlusion occurs. The occlusion event itself is implicitly modeled, and the combined position and moment information are retained. This method, combined with the time context provided by hidden Markov models, is sufficient to distinguish between many different signs where hand occlusion occurs. The most promising research directions with regard to deal with occlusion are probabilistic reasoning and multiple hypothesis testing [7, 11, 24]. In [7] Bayesian network is compared with a CONDENSATION algorithm and shown robustness of BN in occlusion conditions respect to CONDENSATION algorithm.
Proceedings of the Geometric Modeling and Imaging― New Trends (GMAI'06) 0-7695-2604-7/06 $20.00 © 2006
IEEE
4.2. Background In [11] a uniform white background and uniform black clothing were used, but are not necessary. In [23] the number and properties of all moving connected regions (motion blobs) are considered. This approach is intrinsically sensitive to motion originating from other objects and is therefore only applicable with static backgrounds. To deal with clutters due to the skincolored objects in the background, Shan et al. further adopt motion information of skin color regions in the observation model [15]. In the case of wearable modes, the head-mounted cameras could not be stationary while the hand is moving to form gestures, which causes image jitters and dramatic changes in the unrestricted background and the lighting conditions; therefore, deal with background is important [12]. Chen et al. [2] used background subtraction to find the foreground object, and finally identify the hand region. Other moving objects are allowed in the background, but there is only one moving hand in the foreground. Also in [13] background subtraction [20] is applied before any processing.
5. Results Some paper that focus directly on tracking, presented results of their experiments. In this section, we compare them together. Table 1 shows summary of results. Bradski [1] have shown the ability of CAMShift algorithm for tracking in presence of distracters and in noise. Imagawa et al. [6] classified errors into four categories. In addition, frames of test image sequences
are classified as either overlapping periods or not. Using Kalman filter, results have shown a success rate of 86% for the tracking for all sequences, 83% for overlapping periods, and 90% when the hand did not cross the face. Although this system could distinguish between a hand and the static head, it could not completely classify blobs into three objects (head, left hand and right hand). In [1] Isard et al. [28] applied CONDENSATION algorithm to video-streams and done complete experiments on tracking. This algorithm is used for tracking a multimodal distribution, rapid motion through clutter, an articulated and a camouflaged object. The result has been demonstrated to be more effective in clutter than comparable Kalman filters. In addition, they examined ICONDENSATION [27]. The resulting tracker was robust to rapid motion, heavy clutter and hand color distracters while run in real time. In [7] Bayesian network is compared with a CONDENSATION algorithm and shown robustness of BN in occlusion conditions respect to CONDENSATION algorithm. In [16] a long video sequence is used for train and test. Near real-time performance has been achieved for the overall tracking system. When using the weak cue of image moments alone or skin color occlusions, tracking non-rigid hand poses in the Joint Bayesian Filter (JBF) framework can achieve rather good performance.
6. Conclusion Hand tracking has come along way from its initial beginning in merely position tracking. Current work can successfully deal with occlusion, skin-color distracters and non-rigid object tracking. Most of these algorithms work in real-time conditions, but there are some trade off between robustness and speed. Most existing methods have been developed for specific tasks and there is a little work compares different approaches. It seems than Bayesian network and CONDENSATION based algorithms going to have important rule in tracking. On the other hand CAMShift was conceived as a simple part of a larger tracking system, it has many uses right now in game and 3D graphics’ control and matching motion segmented regions is able to extract motion trajectories from image sequences with fewest constrains. The combination of these approaches is new field that can give methods with lowest dependency on features and applications and good handling with tracking.
References [1] G. Bradski, “Computer Vision Face Tracking for Use in Perceptual User Interface,” Intel Technical J., second quarter 1998.
Proceedings of the Geometric Modeling and Imaging― New Trends (GMAI'06) 0-7695-2604-7/06 $20.00 © 2006
IEEE
[2] F.-S. Chen, C.-M. Fu, and C.-L. Huang, “Hand Gesture Recognition Using a Real-Time Tracking Method and Hidden Markov Models,” Image and Vision Computing, vol. 21, no. 8, pp. 745-758, 2003. [3] Y. Cui and J. Weng, “Appearance-Based Hand Sign Recognition from Intensity Image Sequences,” Computer Vision Image Understanding, vol. 78, no. 2, pp. 157-176, 2000. [4] C.-L. Huang and W.-Y. Huang, “Sign Language Recognition Using Model-Based Tracking and a 3D Hopfield Neural Network,” Machine Vision and Application, vol. 10, pp. 292-307, 1998. [5] C.-L. Huang and S.-H. Jeng, “A Model-Based Hand Gesture Recognition System,” Machine Vision and Application, vol. 12, no. 5, pp. 243-258, 2001. [6] K. Imagawa, S. Lu, and S. Igi, “Color-Based Hand Tracking System for Sign Language Recognition,” Proc. Int’l Conf. Automatic Face and Gesture Recognition, pp. 462-467, 1998. [7] J. Sherrah and S. Gong, “Resolving Visual Uncertainty and Occlusion through Probabilistic Reasoning,” Proc. British Machine Vision Conf., pp. 252-261, 2000. [8] T. Starner, J. Weaver, and A. Pentland, “Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 20, no. 12, pp. 1371-1375, Dec. 1998. [9] C. Vogler and D. Metaxas, “Adapting Hidden Markov Models for ASL Recognition by Using ThreeDimensional Computer Vision Methods,” Proc. Int’l Conf. Systems, Man, Cybernetics, vol. 1, pp. 156161, 1997. [10] M.-H. Yang, N. Ahuja, and M. Tabb, “Extraction of 2D Motion Trajectories and Its Application to Hand Gesture Recognition,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 24, no. 8, pp. 1061-1074, Aug. 2002. [11] J. Zieren, N. Unger, and S. Akyol, “Hands Tracking from Frontal View for Vision-Based Gesture Recognition,” Proc. 24th DAGM Symp., pp. 531539, 2002.
[12] Y. Liu, Y. Jia “A Robust Hand Tracking and Gesture Recognition Method for Wearable Visual Interfaces and Its Applications” Proceeding of the Third International Conference on Image and Graphgics 2004 IEEE [13] J. Dias, P. Nande, N. Barata, A. Correia “O.G.R.E. – Open Gestures Recognition Engine” Proceedings of the XVII Brazilian Symposium on Computer Graphics and Image Processing IEEE 2004 [14] N. Liu, B. Lovell, and P. Kootsookos. “Evaluation of hmm training algorithms for letter hand gesture recognition”. IEEE International Symposium on Signal Processing and Information Technology, December 2003. [15] C. Shan, Y. Wei, X. Qiu T. Tan “Gesture Recognition Using Temporal Template Based Trajectories” Proceedings of the 17th International Conference on Pattern Recognition IEEE 2004 [16] H. Fei “a hybrid hmm/particle filter framework for non-rigid hand motion recognition” ICASP 2004. [17] L. Brkthes, P. Menezes, E Lerasle and J. Hayet “Face tracking and hand gesture recognition for human-robot interaction” Pmcwdingr of the 2004 IEEE InternatIona1 Confennee on RobotIco 6 Automation New Orleans. April 2004. [18] J. Lee, Y. Lee, E. Lee, S. Hong “Hand region extraction and Gesture recognition from video stream with complex background through entropy analysis” Proceedings of the 26th Annual International Conference of the IEEE EMBS San Francisco, CA, USA, September 1-5, 2004 [19] A. C. Downton and H. Drouet, “Model-Based Image Analysis for Unconstrained Human Upper-Body Motion,” Proc. Int’l Conf Image Processing and Its Applications, pp. 274-277, Apr. 1992. [20] Larry S. Davis Thanarat Horprasert, David Harwood. “A statistical approach for real-time robust background subtraction and shadow detection”. Technical report, Computer Vision Laboratory University of Maryland, 1999. [21] Gray R. Bradski. “Intel open source computer vision library overview”, 2002.
Proceedings of the Geometric Modeling and Imaging― New Trends (GMAI'06) 0-7695-2604-7/06 $20.00 © 2006
IEEE
[22] K. Fukunaga, “Introduction to Statistical Pattern Recognition,” Academic Press, Boston, 1990. [23] Cutler, R., Turk, “View-based Interpretation of Realtime Optical Flow for Gesture Recognition”. Proc. IEEE Conf. Face and Gesture Recognition (1998) 416_421 [24] Rasmussen, C., Hager, G. D. “Joint Probabilistic Techniques for Tracking Objects Using Visual Cues.” Intl. Conf. Intelligent Robotic Systems (1998) [25] M. J. Swain and D. H. Ballard. “Color Indexing,” Int.J. Comput. Vision, Vol. 7,No. 1,pp. 11-32,1991. [26] Shan Lu, D. Metaxas, D. Samaras and J. Oliensis, “Using multiple cues for hand tracking and model refinement”, IEEE Conf. on Computer Vision and Pattern Recognition 2003, 18-20 Jun. 2003, vol.2, pp. 443-450. [27] M.Isard and A.Blake, “ICondensation: Unifying Low-Level and High-Level Tracking in a Stochastic Framework”, ECCV 98, pp. 893-908. [28] M. Isard and A. Blake, “CONDENSATION – conditional density propagation for visual tracking”, IJCV, 1998, 29(1) pp.5-28. [29] M.Isard and A.Blake, “A mixed-state condensation tracker with automatic model-switching”, ICCV98, Jan. 1998, pp. 107- 112. [30] Vezhnevets V., Sazonov V., Andreeva A., "A Survey on Pixel-Based Skin Color Detection Techniques". Proc. Graphicon-2003, pp. 85-92, Moscow, Russia, September 2003. [31] Ong, S.C.W. ,Ranganath, S. Surendra, “Automatic sign language analysis: A survey and the future beyond lexical meaning” , PAMI(27), No. 6, June 2005, pp. 873-891. [32] Pavlovic, V.I.; Sharma, R.; Huang, T.S. “Visual interpretation of hand gestures for human-computer interaction: a review”, Pattern Analysis and Machine Intelligence, IEEE Transactions on Volume 19, Issue 7, July 1997, pp. 677 – 695.