Recognizing hand-raising gestures using HMM Monowar Hossain Computer Science and Engineering York University 4700 Keele Street, Toronto, ON M3J 1P3 Canada
[email protected] Abstract Automatic attention-seeking gesture recognition is an enabling element of synchronous distance learning. Recognizing attention seeking gestures is complicated by the temporal nature of the signal that must be recognized and by the similarty between attention seeking gestures and non-attention seeking gestures. Here we describe two approaches to the recognition problem that utilize HMMs to learn the class of attention seeking gestures. An explicit approach that encodes the temporal nature of the gestures within the HMM, and an implicit approach that augments the input token sequence with temporal markers. Experimental results demonstrate that the explicit approach is more accurate. Keywords: Hidden Markov Model, Attention seeking gesture recognition, spatio-temporal modeling, HMMs.
1. Introduction In a distance learning setting is desirable to attend automatically to potential speakers in the classroom who wish to interact with the instructor. Attention seeking gestures in a classroom include raising or waving a hand to capture attention. Such gestures are defined by the trajectory of hand. Other hand raising gestures may appear similar but are not attention seeking gestures (see Figure 1). In this paper we address the problem of recognizing attention seeking gestures through the use of Hidden Markov Models (HMMs). Representing temporal structure in a HMM requires some mechanism to encode time. Two possible solutions to the problem of representing spatio-temporal structure in HMMs are considered. One approach is to model temporal information via structural constraints in the feature or symbol space. Another approach is to model the temporal information explicitly in the HMM through spe-
Michael Jenkin Computer Science and Engineering York University 4700 Keele Street, Toronto, ON M3J 1P3 Canada
[email protected] cific structural forms in the allowable state transitions within the HMM.
2. Previous Work A number of different techniques exist to control temporal variability in a HMM. The temporal variability of the HMM can be controlled using Time T as a parameter to the HMM’s emission probability. In an Input Output HMM (IOHMM) [2] both the state transition probability and emission probability are conditioned on the input sequence. A modified version of IOHMM can be used to learn the temporal behavior of the signals and to recognize signals. Deng [9] proposed a model that explicitly conditions the emission probability on the time index and the sequence of observations emitted from a given state is no longer assumed. An improvement to this model is the PHMM[14] model. However, PHMM requires the assumption that the distribution of the temporal property is finite-uniform implying that the value of the prior P (θ) for any particular θ is either constant or zero. HMMs have also been parameterized based on temporal information. In [8], the probability of generating a sequence of observations X = x t , xt+1 , . . . , xt+T in a given qj is given by P (Xi t+T |qj ) =
t+T
P (xi |qj , (i − t + 1))
(1)
i=t
The time index is defined as the number of frames since entering a state till the current frame. All of the probabilities can be estimated using Multi-Layer Perceptrons(MLP) except P (ti |x) which requires either explicit or implicit boundary knowledge of the states. Although presegmented data is used to estimate the parameters of the HMM, a reliable estimation is a challenge in the case of blurred boundaries between states. Both types of approaches are considered here for the task of attention seeking gesture recognition using HMM. In the
3. Implicit temporal information encoded HMM
(a)
In an Implicit Temporal Information Encoded HMM (ITHMM), the temporal information is modeled implicitly. The basic idea of the ITHMM is to parameterize the emission probability behind states on the temporal information in both the training and evaluation phases. Using the extended Baum-Welch algorithm, the temporal information and its variability is learned in the training phase. The output probability of a sequence P (O 1 , O2 , O3 , . . . , Ot ) of the ITHMM becomes P (O1 , O2 , O3 , . . . , Ot |qj = Sj ) =
T
P (Oi |qj = Sj , t)
t=1
(b)
(2) where the time index is defined as the number of states since entering the first state S 1 . Using the extended forwardbackward algorithm, the probability of output sequence is evaluated in the recognition phase. An ITHMM is a five-tuple (S, V, A, B, π) where S = {S1 , S2 , S3 , . . . , SN } denoted as a set of N states. The state at time t is denoted by random variable q t . V = {v1 , v2 , v3 , . . . , vM } denoted as a set of M distinct observation symbols over a discrete alphabet. The observation at time t is denoted by the random variable Ot . The observation symbols correspond to the physical output of the system being modeled. T is the maximum possible length of the sequence. M is the number of observation symbols.
(c)
Figure 1. (a,b) Attention seeking gestures. (c) Non-attention seeking gesture. The figures show the last frame of each gesture and the dots represent the position of the hand at each frame.
N is the number of states. A = {aij } denoted as an N × N matrix for the state transition probability distribution where a ij is the probability of making a transition from state s i to sj . aij = P (qt+1 = sj |qt = si ). B = {btj (k)} denoted as an N × T × M matrix for the observation symbol probability distributions where b tj (k) is the probability of emitting symbol v k at time index t in state sj . btj (k) = P (Ot = vk |qt = sj , t). π = {πi } denoted as the initial state distribution where π is the probability that the state s i is the initial state. More formally, π = P (q 1 = si ). Since A, B and π are probabilistic, they satisfy the following constraints:
ITHMM approach the temporal structure of gestures is encoded in the datastream. In the ETHMM approach the temporal structure of gestures is encoded in the structure of the HMM itself. These two techniques are described in detail in the following sections.
∀i ∀j
k
t
aij = 1, aij ≥ 0
j
btj (k) = 1, btj (k) ≥ 0
∀i
i
πi = 1, πi ≥ 0
(3)
3.1. Recognition In a general HMM, given the observation sequence O = O1 , O2 , . . . , OT and a model λ = (A, B, π), we compute P (O|λ), the probability of the observation sequence given a model using the forward-backward algorithm described in [10]. In an ITHMM, the forward-backward algorithm is modified to compute P (O|λ), the probability of the observation sequence given the implicit representation of time in the observation sequence. In the extended forwardbackward algorithm, the forward variable α t (i) is defined as (4) αt (i) = P (O1 , O2 . . . Ot , qt = Si |λ) Initialization: (t = 1) α1 (i) = πi bti (O1 ), 1 ≤ j ≤ N.
(5)
Induction: (t > 1) N αt+1 (j) = [ αt (i)aij ]bt+1 j (Ot+1 ), i=1
(6)
1≤t≤T −1 1 ≤ j ≤ N.
Termination: P (O|λ) =
N
αT (i).
(7)
i=1
The backward variable β t (i) is defined as βt (i) = P (Ot+1 Ot+2 . . . OT |qt = Si , λ) Initialization:
βt (i) = 1, 1 ≤ i ≤ N.
(8) (9)
Induction: βt (i) =
N
aij bt+1 j (Ot+1 )βt+1 (j), t = T − 1, T − 2, . . . , 1 1 ≤ i ≤ N. j=1
(10)
These definitions of α and β used in the training procedure are described in the following section.
T −1 ξt (i, j) aij = t=1 T −1 t=1 γt (i) T −1 t t=1 s.t.Ot =vk γt (j) bj k = T −1 t=1 γt (j)
(12)
(13)
Using the above procedure, an improved ITHMM λ = (A, B, π) is computed using the current model λ = (A, B, π). In each iteration, λ is used in place of λ to improve the probability of O being observed from the model. The iteration procedure is continued until some limiting point is reached. The above procedure is an extension to original Baum-Welch algorithm and correspond to the Expectation and Maximization steps of the EM algorithm. Hence, this procedure is guaranteed to converge to a local maximum of the likelihood function. Details of the proof can be found in [7, 1].
3.3. ITHMM as a variation of a HMM The feature vector is an important factor in defining the temporal variability of gestures. The feature vectors should have enough discrimination power to recognize the specific gesture and sufficient generalization power to capture the noise and variability in the data. It is possible to include larger dimensions of feature vectors to increase the discriminative power of the HMM. Developing unique algorithms to manipulate ITHMM, it is possible to augment a standard HMM to represent time in the input tokens. Another dimension T is added to the feature vectors. Instead of using the feature vectors U = {u 1 , u2 , u3 , . . . , ut } as the input vector to the HMM, the temporal information is included with the feature vector as (U, T ) = {(u 1 , t1 ), (u2 , t2 ), . . . , (ut , Tt )} both in training and recognition phases to learn the temporal structure of the gesture implicitly. Here T is the time index of the sequence. This way, a general version of HMM can be used in place of a special purpose ITHMM. The forwardbackward algorithm described in [10] can be used without modification for recognition purposes. The general BaumWelch algorithm described in [1] can be used to train such an HMM.
3.2. Training An extended version of Baum-Welch algorithm is used to train the ITHMM. The modification of the re-estimation procedure is straightforward. π j = γ1 (i) = expected frequency in state S j at time (t = 1) where N = j=1 ξt (i, j) γt (i) ξt (i, j) =
αt (i)aij bt+1 (Ot+1 )βt+1 (j) j P (O|λ) αt (i)aij bt+1 (Ot+1 )βt+1 (j) j
= N N i=1
j=1
αt (i)aij bj (Ot+1 )βt+1 (j)
(11)
4. Explicit temporal information encoded HMM In an Explicit Temporal Information Encoded HMM (ETHMM), the temporal information is modeled by limiting the allowable state transitions. A left-to-right multiple parallel path HMM model is developed to model the temporal information. Each state time is represented by a collection of HMM states Sti . The observation at time step t depends on only one of the states variable of S t . The number of state variables in S t is mapped to the number of discrete
feature symbols and each state variables in S t is mapped to a discrete encoded feature symbol that describes the temporal structure of the gesture in question. Each state variables has its own output probabilities and transition dynamics. It should be clear how the structural model of ETHMM describes the temporal information and gesture movement variation appropriately. More formally, an ETHMM is a five-tuple (S, V, A, B, π) where
ficient of this type of HMM must satisfy the following: aij t,l
V = {v1 , v2 , v3 , . . . , vM } denoted as a set of M distinct observation symbols over a discrete alphabet. The observation at time t is denoted by the random variable Ot . The observation symbols correspond to the physical output of the system being modeled. T is the maximum possible length of the sequence. M is the number of observation symbols. N is the number of discrete states variables in S t . It is also equal to the number of unique observation variables M. A = {aij t,t+1 } denoted as an T × T × N matrix for the state transition probability distribution where a ij t,t+1 is the probability of making a transition from state S ti to j St+1 . j i More formally, a ij t,t+1 = P (qt+1 = St+1 |qt = St ). B = {bit (k)} denoted as an T × N × M matrix for the observation symbol probability distributions where b it (k) is the probability of emitting symbol v k at time t in state variable Sti . More formally,b it(k) = P (Ot = vk |qt = Sti ). π = {π1i } denoted as the initial state distribution where π is the probability that the state is S 1i the initial state. More formally,π = P (q 1 = S1i ). Since A B and π are probabilistic, they must satisfy the following constraints: ∀t ∀i
∀t ∀i
k
∀i
t
= 1,
bit (k)
≥0
(16)
aij T,l
= 0, l < T
πti = 0, t = 1 π1i ≥ 1
(17)
(18) (19)
i
The interpretation of the equations (16,17) and (17) is that no backwards state transition is allowed and no self transition is allowed. Additionally, only the state variables representing time 1 have initial state probability greater than zero and their sum equals to 1. All other state variables have 0 as initial probability.
4.1. Recognition In ETHMM, a modified version of forward-backward algorithm is used to compute P (O|λ), the probability of the observation sequence. In the extended forward-backward algorithm, the forward variable α t (i) is defined as αt (i) = P (O1 , O2 . . . Ot , qt = Si |λ)
(20)
Initialization: α1 (i) = π1i bi1 (O1 ), 1 ≤ j ≤ N.
(21)
Induction: N i αt+1 (j) = [ αt (i)aij t+1,t+2 ]bt+2 (Ot+1 ), i=1
(22)
1≤t≤T −2 1 ≤ j ≤ N.
Termination: P (O|λ) =
N
αT (i).
(23)
i=1
The backward variable β t (i) is defined as βt (i) = P (Ot+1 Ot+2 . . . OT |qt = Si , λ)
βT (i) = 1, 1 ≤ i ≤ N.
j
bit (k)
= 0, t = l, i = j
(24)
Initialization:
ij aij t,t+1 = 1, at,t+1 ≥ 0
(15)
and
S = {S1 , S2 , S3 , . . . , ST } denoted as a set of T states. St = {St1 , St2 , St3 , . . . , StN } denoted as a set of N state variables. The state at time t is denoted by random variable qt .
aij t,l = 0, l ≤ t
(14)
πti = 1, πti ≥ 0
i
An ETHMM is a model in which the state transitions are constrained from left to right. The state transition coef-
(25)
Induction: βt (i) =
N j=1
j aij t,t+1 bt+1 (Ot+1 )βt+1 (j),
t = T − 1, T − 2, . . . , 1 1 ≤ i ≤ N.
(26)
5. Experimental evaluation
4.2. Training The feature space is transformed onto the state space. Each state Sti is mapped to a unique feature symbol. Each meta state St represents a time instance of the sequence. j Therefore, the transition between S ti and St+1 becomes the probability of transition of symbol i at time t to symbol j at t + 1. The length of the ETHMM is the maximum possible length of the observation. The number of states and their transition must be computed using the training database. After the initial parameter estimation, an extended version of the Baum-Welch algorithm is used to train the ETHMM. π j = γ1 (i) = expected frequency in state S 1j at time (t = 1) where γt (i)
=
ξt (i, j) =
N
j=1 ξ1 (j) αt (i)aij bj (Ot+1 )βt+1 (j) t,t+1 t+1 P (O|λ) αt (i)aij bj (Ot+1 )βt+1 (j) t,t+1 t+1
= N N i=1
j=1
αt (i)aij bj (Ot+1 )βt+1 (j) t,t+1 t+1
T −1
t=1 ξt (i, j) aij t,t+1 = T −1 t=1 γt (i) j bt k
(27) (28)
T −1 =
t=1 s.t.Ot =vk γt (j) T −1 t=1 γt (j)
(29)
Using the above procedure, an improved ETHMM λ = (A, B, π) is computed using the current model λ = (A, B, π). In each iteration, λ is used in place of λ to improve the probability of O being observed from the model. The iteration procedure is continued until some limiting point is reached. The above procedure is an extension to original Baum-Welch algorithm and correspond to the Expectation and Maximization steps of the EM algorithm. This procedure is guaranteed to converge to a local maximum of the likelihood function. Details of the proof can be found in [7, 1].
4.3. ETHMM as a variation of a standard HMM Each ETHMM has an equivalent HMM. The two dimensional N × T states can be projected onto a single dimensional of (N × T ) states. Then a T × T × N three dimensional matrix for the state transition probability distribution is transformed into a two dimensional (T × N ) × T matrix. One advantage of this transformation is that unnecessary states can be pruned out from the state-space (less space is needed for computation). Additionally, the general version of forward-backward algorithm can be used for recognition and the general Baum-Welch algorithm can be used to train the ETHMM network.
In order to evaluate the performance of the various HMMs training and test datasets were constructed. Subjects were captured with a standard CDD camera while making attention seeking and non-attention seeking gestures (see Figure 1). The raw data stream was converted into token streams that captured the hand position in a headentric coordinate systems (for details of this process see [5]). A “na¨ıve HMM”, an ITHMM and an ETHMM were trained using the same set of training data and then evaluated against the training data, novel attention seeking gestures and non-attention seeking gestures. A collection of attention seeking and non-attention seeking gesture video sequences were taken from six different subjects. The subjects were asked to raise their hands naturally in an attention seeking manner while facing a CCD camera. After twenty true attention seeking gesture sequences were captured for each subject, subjects were asked to perform non-attention seeking gesture such as irregular movement of the hands and heads, and scratching the nose or head. Five such sequences were obtained from each subject. A training database of true attention seeking gesture sequences was built by selecting 100 sequences randomly from the 120 true attention seeking gesture sequences. A database containing 20 novel attention seeking gestures was also constructed. A non-novel test database of 53 gesture sequences was constructed by selecting sequences randomly from the training database. A fourth database containing 30 sequences was built from the non-attention seeking gesture sequences. Table 1 summarizes the training and test dataset. The training database was used to train all the HMMs separately, and the training database and other databases were used to test the performance of the HMMs.
Dataset type Training set Test set Novel test set False test set Total
true sequences 100 53 20 0 120
false sequences 0 0 0 30 30
Table 1. Training and test dataset.Non-novel test data sequences were selected randomly from the training database.
5.1. Offline test results A series of tests were performed to find the “best” parameters of the na¨ıve HMM using the test database and the false attention seeking database. The ideal HMM would recognize true attention seeking gestures 100% of the time and reject false attention seeking gestures 100% of the time. Obtaining such a HMM is unlikely.
Figure 3. Sensitivity versus specificity for different sized ITHMMs with recognition threshold values in the range 0.0 to 1.0E − 40. The best ITHMM has 9 states and τ = 1.11E − 25 at sensitivity = 0.87 and specif icity = 0.87.
Figure 2. Sensitivity versus specificity for different sized na¨ıve HMM with a recognition threshold in the range 1.0E − 11 to 1.0E − 5. The best na¨ıve HMM was chosen with 8 states and τ = 1.79E − 6 with sensitivity = 0.77 and specif icity = 0.83.
Figures 2,3 and 4 plot the performance of different sized na¨ıve HMM, ITHMM and ETHMM respectively in terms of their sensitivity (vertical axis) and specificity (horizontal axis). Following [11], sensitivity =
number of recognized true gestures number of true gestures
(30)
number of rejected false gestures (31) number of false gestures The ideal HMM would be found at top-right hand corner (1,1) of sensitivity-specificity graph (see Figure 2). Good HMMs would be found near this point. For a N -state sized HMM, dN defines the squared distance in (sensitivity, specificity) space to the ideal HMM where specificity =
dN = (1 − sensitivity)2 + (1 − specificity)2
Figure 4. Sensitivity and specificity for the ETHMM with recognition threshold values in the range 0.0 to 1.0E − 60. The optimum threshold value was chosen τ = 1.11E − 34 with sensitivity = 1.0 and specif icity = .87.
(32)
The best na¨ıve HMM was chosen with 8 states and τ = 1.79E −6 with sensitivity = 0.77 and specif icity =
0.83. The best ITHMM has 9 states and τ = 1.11E − 25 at sensitivity = 0.87 and specif icity = 0.87. The best ETHMM has 122 states with optimum threshold value τ = 1.11E − 34, sensitivity = 1.0 and specif icity = .87. The test databases of true attention seeking gestures and non-attention seeking gestures were then tested by the “best”: na¨ıve, ITHMM and ETHMM (see Figure 5). Table 2 summarizes the offline test results. The basic interpretation of Table 2 is that all the HMMs have a good ability to detect true gesture sequences, but that the ETHMM and ITHMM have an increased accuracy in discriminating false gesture sequences.
Figure 5. Sensitivity and specificity of the best HMMs
Table 2. Offline gesture test: ETHMM, ITHMM and Na¨ıve HMM performance for offline sequences. Sensitivity and specificity values were computed over all of the sequences.
nature of the training examples, they all corresponded to real input signals, the backward state-state transition probabilities start to decay after each iteration of the training phase and approached zero at the completion of the training phase. Even when the initial structure was set to a fully connected topology, the ITHMM tends to organize itself into a left-to-right structural model. In the ETHMM on the other hand, the temporal information is modeled explicitly in the HMM through specific structure forms in the allowable state transition nature of the HMM. A left-to-right parallel path HMM model is developed to model the temporal information. In the ETHMM approach, the spatio-temporal structure is modeled through its specific structural form in terms of the allowable state transition nature of the HMM. Statistical information about the training database is computed and the initial number of states, transition probabilities, initial transition probabilities, the number of distinct observation symbol are automatically generated. The feature space is mapped onto a state space in an ETHMM. The training of the ETHMM is guided by providing the HMM with a set of near optimal parameters. This search space for the optimum parameters is much smaller than for the na¨ıve HMM. The ETHMM generates the same optimum parameters each time in the training phase and the parameters of the ETHMM represent the optimum parameters of the training examples. The solutions obtained by both the ETHMM and ITHMM are shown to be stable in experiments. The na¨ıve HMMs were forced to be a left-to-right HMM while the ITHMMs and ETHMMs organized themselves in a left-to-right model automatically. The na¨ıve HMM depends critically on the initial parameters. The initial parameters of a na¨ıve HMM are selected randomly. In the training phase, the na¨ıve HMM tunes the parameters to the nearest local optimal. Different initial parameters leads to different local optimums.
6. Discussion
7. Future work
In an offline process both the ETHMM and ITHMM perform better than the na¨ıve HMM. The ETHMM and ITHMM have a higher sensitivity and specificity than the na¨ıve HMM. The HMMs developed here model the spatio-temporal structure of the gestures in different ways. In an ITHMM, the temporal information is modeled implicitly by including structural constraints in the feature space. The temporal information is encoded within the feature space both in the training and recognition phases of the HMM. The initial parameters of an ITHMM are selected randomly. For the ITHMM, the training data were structured by introducing a third dimension “time” in the training examples. Due to the
Various improvements can be made to the ETHMM and ITHMM developed here. An immediate extension to the ETHMM would be to consider self-loop transitions between the states and an enhancement to the gesture spotting algorithm. This would provide finer control in modeling the spatio-temporal variation in the input signal. At present there is no theoretical basis upon which to construct an optimum structure of the HMM that represents the data. The structure of HMM, e.g., the number of states, state transition constraints, observation probability distribution etc., are determined in an ad-hoc manner, via some heuristic approach (e.g., [12]), or, by a systematic search. Another possible extension to this research would be to in-
vestigate how to find an “optimum structure” in a systematic way that represents the data in question. Acknowledgements: The financial support of NSERC, CRESTech, and NCG IRIS is acknowledged.
References [1] L. E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic function of Markov Process. Inequalities 3, pages 1–8, 1972. [2] Y. Bengio and P. Frasconi. Input-Output HMM’s for Sequence Processing. IEEE Trans. Neural Networks, 7:1231– 1249, 1996. [3] G. D. Forney. The Viterbi Algorithm. In Proc. IEEE, pages 268–278, 1973. [4] G. D. Forney. The Forward-Backward Algorithm. In Proc. of the 34th Allerton Conference on Communication, pages 432–446, 1996. [5] M. Hossain. Explicit Versus Implicit Temporal Representations in HMM Gesture Recognition Systems. Master’s thesis, Computer Science & Engineering, York Univeristy, Toronto, Canada, 2005. [6] D. O. T. Jr. Hidden Markov Models for Gesture Recognition. Master’s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA, August 1995. [7] T. K.Moon. The expectation maximization algorithm. IEEE Signal Processing Magazine, pages 47–60, 1996. [8] Y. Konig and N. Morgan. Modeling Dynamics in Connectionist Speech Recognition - The Time Index Model. In Proc. Int. Conf. Speech and Language Processing, TR-94012, Berkeley, CA, 1994. [9] l. Deng. A generalized Hidden Markov Model with stateconditioned trend functions of time for speech signal. Signal Processing, pages 65–78, 1992. [10] L. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proc. IEEE, volume 77, pages 257–286, 1989. [11] C. Speicher and J. S. Jr. Choosing Effective Laboratory Tests. WB Saunders, 1983. [12] T. Takara, K. Higa, and I. Nagayama. Isolated word recognition using the HMM structure selected by the Genetic Algorithm. In Proc. ICASSP, volume II, pages 967–970, 1997. [13] R. Watson. A Survey of Gesture Recognition Techniques. Technical report, Department of Computer Science, Trinity College, Dublin, UK, 1993. [14] A. D. Wilson and A. F. Bobick. Hidden Markov Models for modeling and recognizing gesture under variation. In Hidden Markov Models: Application in Computer Vision, pages 123–1160. World Scientific press: Singapore ; River Edge, NJ, 2001. [15] H. Yoon, J. Soh, Y. J. Bae, and H. S. Yang. Hand gesture recognition using combined features of location, angle and velocity. In Pattern Recognition 34, volume 7, pages 1491– 1501, 2001.