Oct 2, 2005 - also provides a portable API so you can write a single OpenGL .... [17] Immersion Corporation, VirtualHand SDK User and Programmer. Guides ...
HAVE 2005 - IEEE International Workshop on Haptic Audio Visual Environments and their Applications Ottawa, Ontario, Canada, 1–2 October, 2005
A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models Qing Chen, Ayman El-Sawah, Chris Joslin and Nicolas D. Georganas Distributed & Collaborative Virtual Environments Research Laboratory School of Information Technology and Engineering University of Ottawa, Ottawa, ON, Canada e-mail: {qchen, aelsawah, joslin, georganas}@discover.uottawa.ca Abstract – A dynamic gesture interface for virtual environments based on Hidden Markov Models (HMMs) is introduced in this paper. The HMMs are employed to represent the continuous dynamic gestures, and their parameters are learned from the training data collected from the CyberGlove. To avoid the gesture spotting problem, we employed the standard deviation of the angle variation for each finger joint to describe the dynamic characters of the gestures. A prototype which applies 3 different dynamic gestures to control the rotation directions of a 3D cube is implemented to test the effectiveness of the proposed method.
continuous dynamic gesture recognition is still very low [4]. The recognition for continuous dynamic gestures is still far from meeting the naturalness criteria of human communication due to poor recognition rates.
Keywords – Dynamic gesture, hidden Markov model.
I. INTRODUCTION Virtual Environments (VE) provide a new sophisticated paradigm for complex graphical simulations, as also for human communication, interaction, learning and training. To achieve immersive Human Computer Interaction (HCI) for VE applications, human hand could be considered as a natural input device, and this has been one of the motivations for the considerable research gone into the computer-related hand gesture recognition technology [1]. A set of hand gestures could be used to perform a series of commanding inputs such as pointing, navigating, moving, rotating, starting, stopping, etc. To use human hands as a natural human computer interface, the glove-based devices have been employed to capture human hand motions. With attached sensors, the joint angles and spatial positions of hands can be measured directly from the glove. Meaningful hand gestures can be classified into static postures and continuous dynamic gestures. For static postures, one typical example is the American Sign Language (ASL). Many researchers have proposed various algorithms to implement American Sign Language recognition systems [2] [3]. However, different from ASL, the gesture commands in VE applications can be more natural and simple when continuous dynamic gestures are used. One scenario is a wrist rotation in front of a virtual door knob should actuate turning the knob and opening the door. Although the accuracy of the isolated sign language recognition has achieved 95%, the accuracy of the c 0-7803-9377-5/05/$20.00 2005 IEEE
Fig. 1. Hand skeleton structure [5].
Dynamic gestures consist of global hand motion and local finger motion [6]. Global hand motion presents large hand rotation and translation. Local finger motion can be parameterized with the set of joint angles. Dynamic gestures are highly articulate because the human hand consists of many connected parts and joints (see Fig. 1), leading to complex kinematics. At the same time, dynamic gestures are also highly constrained because of the limited number of degree of freedom (DOF). To capture complex hand motion and recognize continuous dynamic hand gestures, the dynamics and semantics of hand motion should be modelled. A number of different approaches such as Kalman filter [7], dynamic time warping [8] [9], finite state machine [10] have been applied to model the dynamic characters of hand motions. However, the effectiveness of these methods are limited by very strict assumptions which make these approaches insufficient to model complex continuous dynamic hand gestures.
To capture complex hand motion and recognize continuous dynamic hand gestures, appropriate hand motion features should be extracted. Dynamic gestures are inherently stochastic [11]. For a same dynamic gesture, if a person does it repetitively for many times, and each of the actions is recorded and measured, then these measurements are definitely different. However, as these measurements represent the same dynamic gesture, there must be some statistical properties which can describe the dynamic character of this gesture. The HMM is a type of statistical model for signals that can be characterized as a parametric random process. HMMs have found applications in many areas in signal processing, and particularly in speech recognition [12] [13] [14]. Due to the statistical properties of dynamic gestures, using HMMs for dynamic gesture recognition is becoming more and more popular. In this paper, a continuous dynamic gesture recognition system based on HMMs is presented. The HMMs are employed to represent different dynamic gestures, and their parameters are learned from a set of training data. We use ExpectationModification (EM) algorithm which is an efficient algorithm to solve the learning problem. Different from other HMM-based algorithms, we employed the standard deviation of the angle variation for each finger joint to describe the dynamic characters of the continuous gestures. With this method, we can avoid the challenging gesture spotting problem which is the task of segmenting meaningful gesture patterns from non-gesture parts in a continuous sequence of hand motions [15] [16]. At the same time, we can also effectively transform multi dimensional data into single dimensional discrete data so that the massive computation cost for the multi dimensional HMMs can be saved. Based on the most likely performance criterion, the gestures can be recognized by evaluating the trained HMMs according to the resolution of the evaluation problem. We have developed a system to implement the proposed method. The feasibility of this method is demonstrated by the experiments of continuous dynamic gesture recognition to control the rotation direction of a 3D cube. The proposed method has potential applications in a variety of dynamic pattern recognition problems. II. HIDDEN MARKOV MODELS
Fig. 2. An HMM example.
A hidden Markov model is a doubly embedded stochastic
process with an underlying stochastic process that can only be observed through another set of stochastic process that produce the sequence of observations . An HMM is a description of a set of states connected by transitions. Each state is characterized by two different probabilities: a state transition probability, and an observation output probability. One example of HMM is illustrated in Fig. 2. The basic elements of an HMM include: • A set of states, S = {S1 , S2 , . . . , SN }, where N is the number of states in the model. Although the states are hidden, in practical application, they often mean some physical properties of the input pattern. • A set of observations, V = {V1 , V2 , . . . , VM }, where M is the number of distinct observation symbols. • The state transition matrix A = {aij }, where aij = P (qt+1 = Sj |qt = Si ), 1 ≤ i, j ≤ N. •
The observation symbol probability distribution matrix in state j, B = {bj (k)}, where bj (k) = P (ot = Vk |qt = Sj ), 1 ≤ j ≤ N, 1 ≤ k ≤ M.
•
The initial state distribution matrix π = {π i }, where πi = P (q1 = Si ), 1 ≤ i ≤ N.
A complete specification of an HMM requires two model parameters (N and M), specification of observation symbols, and the specification of the three probability measures A, B, π. With these parameters, an HMM can be written in a compact format: λ = (A, B, π). An HMM can be based either on discrete observation densities or continuous observation densities. In this paper, we use discrete HMMs to model dynamic gestures because the discrete probability distributions are sufficient to characterize the stochastic properties of the dynamic gestures with a finite set of symbols. Before we apply discrete HMMs to dynamic gestures, the raw data collected from the gesture input device must be preprocessed and converted to a sequence of discrete symbols. Given an HMM, there are three basic problems that must be solved for practical applications: • The Evaluation Problem: Given the observation sequence O = o1 o2 . . . ot , and an HMM λ = (A, B, π), how to compute P(O | λ)? • The Decoding Problem: Given the observation sequence O = o1 o2 . . . ot , and an HMM λ = (A, B, π), how to choose a corresponding optimal state sequence Q = q1 q2 . . . qt ? • The Learning Problem: Given the observation sequence O = o1 o2 . . . ot , how to adjust the model parameters λ = (A, B, π) to maximize P(O | λ)? The solutions to these three problems are the ForwardBackward algorithm, the Viterbi algorithm, and the Expectation-Modification (EM) algorithm (also called Baum-Welch algorithm). For more detailed description about these algorithms, the readers are referred to [12].
III. IMPLEMENTATION Our project is to build a system which allows us to control and manipulate the objects in a 3D virtual environment with different dynamic gestures. The prototype we developed for this scenario is to use three different continuous dynamic gestures which we named “Great”, “Quote” and “Trigger” (see Fig. 3) to control the rotation directions of a cube around three axis X, Y, and Z. The cube should change the rotation axis when the corresponding gesture is performed. To implement this system, three steps are included: the raw data collection and preprocessing, HMMs training, and gesture recognition with the trained HMMs.
and topology of an HMM which can characterize the dynamic gesture need to be selected. Because the dynamic gesture is a continuous motion sequence, it is natural to take a multi-dimensional HMM, which contains more than one observation at each time t, to model a dynamic gesture [18]. Each state corresponding to a temporary posture in the motion sequence, and at each moment t, there are 20 finger joint angles as the observations. However, one challenging task for this type of HMM topology is the gesture spotting problem which need to effectively segment the continuous dynamic gesture into a series of meaningful temporary postures corresponding to the states in the HMM topology. One method to solve this problem relies on the mannered discontinuities such as holding a posture for a certain time so that the gesture motion trajectory can have an apparent pause which can be detected later for segmentation. This method works for the connected ASL recognition. However, for the continuous dynamic gesture, mannered discontinuities will greatly increase the unnaturalness for the gesture performer which is unfavorable.
Fig. 3. The three dynamic gestures to control the rotation of a cube.
A. Raw Data Collection and Preprocessing
Fig. 4. The 18 sensors attached on the CyberGlove.
The raw gesture data are the values of 20 joint-angle measurements (15 finger joints angles, 4 abduction angle between fingers and 1 palm arch angle, see Fig. 4 ) from the 18 sensors of the CyberGlove sampled at about 10Hz [17]. To model the dynamic gestures with discrete HMMs, we need to describe each gesture in terms of an HMM. An appropriate structure
Fig. 5. The measurement of a “Trigger” gesture.
To bypass the gesture spotting problem, we employed the standard deviation of the angle variation for each finger joint as the observation signal. Fig. 5 shows one example of the joint angle distribution for a “Trigger” gesture we collected from the CyberGlove. We sampled the data at 10Hz along the time axis “X” during the performance of the dynamic gesture. The “Y” axis shows the values of the joint angles. From the diagram, we see that only three finger joints’ curves fluctuate more strongly (more dynamic) than the other finger joints. The standard deviation can effectively describe the dynamic character of the angle variation of each finger joint because it is a statistical parameter that tells you how tightly all the various samples are clustered around the mean value in a set of data. When the samples are pretty tightly bunched together, the standard deviation is small. On the contrary, when the samples are spread apart, the standard deviation will be relatively large. Another advantage coming with the standard deviation is that it transforms the multidimensional observation signals into a single discrete dimensional observation signal which is much easier to process by the HMMs. This advantage can save the
massive computation cost for the multidimensional HMMs. The CyberGlove is connected to the computer via an RS232 serial connection. 20 channels of joint-angle raw data (float numbers) are collected from the CyberGlove at 10Hz. Fig. 6 is the block diagram of the raw data preprocessing. The raw data collection and preprocessing is implemented in C, which includes raw data collection, data normalization, deviation computation and vector quantization. After the preprocessing, we can transform the N × 20 float number matrix (where N is the number of samples we collected from the glove) into a finite 1 × 20 integer vector as the observation signals for the HMMs.
the probability of a given observation sequence. The key to solve this problem is to find out the observation probability distribution matrix B and the state transition matrix A. We employed the EM algorithm which uses an iterative expectation/maximization procedure to locally maximize P(O|λ). In order to train the HMMs, a certain number of training data for each gesture need to be collected . In our implementation, we collected 10 data sets for each dynamic gesture we want to recognize. The HMMs are trained with the preprocessed data and the optimal HMM parameters includeing A and B are computed to maximize the probability P(O|λ). The results are three HMMs which can best present the three dynamic gestures we want to recognize. C. Gesture Recognition Fig. 7 illustrated the gesture recognition process. After the HMMs training step, all three dynamic gestures are represented by appropriate HMMs. The gesture recognition task becomes a search for an HMM which has the highest probability to generate the observation sequence, i.e. Max(P(O|λi )), i = 1, 2, 3. This task corresponds to the Evaluation Problem which need to compute P(O|λ) given a gesture observation sequence O and an HMM λ.
Fig. 6. The raw data preprocessing.
Because we employed the standard deviation of each joint angle as the observation signal, the underlying state sequence associated with this HMM topology should correspond to the 20 finger joints of the hand. The states proceed from left to right. The transition matrix of this HMM topology is: a1,1 a1,2 . . . a1,20 a2,1 a2,2 . . . a2,20 .. A = ... ... ... . a19,1 a19,2 . . . a19,20 a20,1 a20,2 . . . a20,20 where:
ai,j = 1, ai,j = 0,
i = 1, 2, . . . , 20, j = i + 1 i = 1, 2, . . . , 20, j = 6 i+1
The initial matrix is defined by: 1, πi = 0,
Fig. 7. The gesture recognition process.
The Forward-Backward algorithm is a commonly used algorithm to compute P(O|λ). The Forward-Backward algorithm is based on the trellis structure which can effectively reduce the total number of calculations involved in the direct computation from 2TNT to N2 T where N is the number of states and T is the number of observations [12]. IV. IMPLEMENTATION RESULTS
i=1 i 6= 1
B. HMMs Training HMMs training corresponds to the “Learning Problem” where an appropriate HMM need to be found to maximize
In order to demonstrate the proposed method, a prototype system has been implemented. The goal for this system is to control the rotation of a 3D cube with three different continuous dynamic gestures in real time. Fig. 8 showed the prototype system.
V. CONCLUSIONS
Fig. 8. The prototype system.
The dynamic gesture interface is showed in Fig. 9. Three components are included in this interface: the hand bone structure model, the cube, and the cube control window. As a part of the system, we developed a three-dimensional articulated hand bone structure model which can perform the dynamic gestures according to the data collected from the CyberGlove. The root of the hand model has 6 DOF for the hand’s 3D position and orientation, and the fingers have 20 DOF which define the local motion of the dynamic gestures. With this model, we can have a more direct feedback of the dynamic gestures we performed. And moreover, we can verify that the data is sampled at appropriate frequency so that the model can respond correctly in time. The 3D cube is developed with GLUT which is an OpenGL-based window toolkit for writing OpenGL programs [19]. It implements a simple window application programming interface (API) for OpenGL. GLUT is designed for constructing small to medium sized OpenGL programs. GLUT also provides a portable API so you can write a single OpenGL program that works across all PC and workstation OS platforms. The cube control window reflects the current rotation axis of the cube according to the dynamic gesture we performed.
Fig. 9. The dynamic gesture interface.
A series of tests based on the proposed method have been performed. The test results showed that the 3D cube can change the rotation axis correctly according to the defined gestures we performed, which verified our dynamic gestures recognition algorithm.
In this paper, we proposed a method for modelling, training, and recognizing continuous dynamic gestures using hidden Markov models. Each dynamic gesture is represented by an appropriate HMM which can catch the gesture’s statistical properties. Different from other HMM-based algorithms, we employed standard deviation of each finger joint’s angle variation to describe the dynamic characters of the finger motions involved in the dynamic gestures. This method effectively avoided the gesture spotting problem. Based on the maximum probability criterion, the gesture can be recognized by evaluating the observation sequence against the trained HMMs. We developed a prototype to demonstrate the feasibility and effectiveness of the proposed method. We defined three dynamic gestures to control the rotation direction of a 3D cube with the CyberGlove as the gesture input device. The experimental results showed that the proposed method is successful and applicable for continuous dynamic gesture recognition and can be applied in developing human computer interface for the virtual environment applications. ACKNOWLEDGMENT The research presented in this paper has been funded by the CITO (Communications and Information Technology Ontario) project VERGINA (Virtual Environment Research in haptovisual Gesture-recognition Interfaces). The authors would also like to thank Thierry Metais for his advice, and Francois Malric for providing essential technical support. REFERENCES [1] Toshiyuki Kirishima, Kosuke Sato and Kunihiro Chihara, “Real-time gesture recognition by learning and selective control of visual interest points, ” IEEE Trans. on Pattern Analysis and Machine Intelligence, VOL.27, NO.3, pp.351-364, March 2005. [2] T. Starner et al., “A wearable computer based American sign language recognizer, ” Proc. IEEE Int. Symp. Wearable Computing, pp.130-137, October 1997. [3] C. Vogler and D. Metaxas, “ASL recognition based on a coupling between HMMs and 3D motion analysis, ” Proc. IEEE Int. Conf. on Computer Vision, Mumbai, India, pp.363-369, January 1998. [4] Sanshzar Kettebekov, Mohammed Yeasin and Rajeev Sharma, “Prosody based audiovisual coanalysis for coverbal gesture recognition, ” IEEE Trans. on Multimedia, VOL.7, NO.2, pp.234-242, April 2005. [5] John Napier, Hands, pp.29, Pantheon Books, New York, 1980. [6] Ying Wu and Thomas S. Huang, “Hand modeling, analysis, and recognition for vision-based HCI, ” IEEE Signal Processing Magazine, pp.5158, May 2001. [7] N. Shimada et al., “Hand gesture estimation and model refinement using monocular camera-ambiguity limitation by inequality constraints, ” Proc. 3rd Conf. on Face and Gesture Recognition, pp.268-273, 1998. [8] A. F. Bobick and A. D. Wilson, “A state-based technique for summarization and recognition of gesture, ” Proc. ICCV 95, pp.382-388, 1995. [9] M. Black and A. Jepson, “Recognition temporal trajectories using the condensation algorithm, ” Proc. of Int. Conf. on Automatic Face and Gesture Recognition, Nara, Japan, pp.16-21, April 1998.
[10] K. Jo, Y. Kuno and Y. Shirai, “Manipulative hand gesture recognition using task knowledge for human computer interface, ” Proc. of Int. Conf. on Automatic Face and Gesture Recognition, Nara, Japan, pp.468-473, April 1998. [11] Jie Yang, Yangsheng Xu and Chiou S. Chen, “Hidden Markov model approach to skill learning and its application to telerobotics, ” IEEE Trans. on Robotics and Automation, VOL.10, NO.5, pp.621-631, October 1994. [12] Lawrence R. Rabiner, “A tutorial on hidden Markov models and selected applications in Speech Recognition, ” Proc. of the IEEE, VOL.77, NO.2, pp.257-286, February 1989. [13] K. F. Lee, H. W. Hon and R. Reddy, “An overview of the SPHINX speech recognition system, ” IEEE Trans. on ASSP, VOL.38, NO.1, pp.35-45, 1990. [14] X. D. Huang, “Phoneme classification using semicontinuous hidden Markov models, ” IEEE Trans. on ASSP, VOL.40, NO.5, pp.1062-1067, 1992. [15] H. K. Lee and J. H. Kim, “An HMM-based threshold model approach for gesture recognition, ” IEEE Trans. on PAMI, VOL.21, NO.10, pp.961973, 1999. [16] J. W. Deng and H. T. Tsui, “An HMM-based approach for gesture segmentation and recognition, ” Proc. of 15th Int. Conf. on Pattern Recognition, VOL.3, pp.679-682, 2000. [17] Immersion Corporation, VirtualHand SDK User and Programmer Guides, 2001. [18] Jie Yang, Yangsheng Xu and C. S. Chen, “Gesture interface: modeling and learning, ” Proc. of IEEE Int. Conf. on Robotics and Automation, VOL.2, pp.1747-1752, 1994. [19] GLUT - The OpenGL Utility Toolkit, http://www.opengl.org/resources/libraries/glut.html