Real Time Large Vocabulary Continuous Sign ... - Semantic Scholar

Real Time Large Vocabulary Continuous Sign Language Recognition Based on OP/Viterbi Algorithm Guilin Yao, Hongxun Yao, Xin Liu, Feng Jiang School of Computer Science and Technology,Harbin Institute of Technology, Harbin, China {glyao, yhx}@vilab.hit.edu.cn, [email protected], [email protected] Abstract Up to now, continuous sign language recognition is mainly based on statistical methods, especially Hidden Markov Models (HMM) and Viterbi-Beam searching. However, the recognition speed often gets unacceptable with an increased vocabulary, which could cause a long time delay that is not fit for the real time recognition system. To speed up the recognition process, we present a method using One-Pass (OP) pre-searching before Viterbi recognition. The experiments are processed in the large vocabulary database. Results show that the average recognition speed of OP/Viterbi approach can get a notable raise comparing with the single frame’s without reducing too much recognition accuracy.

1. Introduction Sign language [1] is a way for deaf people to communicate with each other and exchange information. The most valuable and challenging problem about SLR is continuous sign language recognition (CSLR). The main problem on a CSLR system is the recognition speed. In the CSLR system, the sign language sentences generated from the Dataglove sensors should be quickly translated to make it practical for usual communication. The problem on real time recognition with large vocabulary is rarely related in the late studies. Our previous work [2], formed by the multilayer architecture of sign language recognition based on DTW/HMM, is about isolated words. At present, the continuous sign language recognition is mainly based on the statistical approach, which uses the method of Viterbi-Beam searching of HMM. Whereas, the process, which has to find out the corresponding candidate words for the current frame, needs to calculate the probability on the first node of each

candidate word and is also very time-consuming. Another approach, called template-matching approach, which is mainly based on OP searching in Wu’s paper [4], designs one or more templates for each word and then seeks the optimal alignment between the observation sequence and each of the word templates. However, the word vocabulary that Wu used is not large enough. Later research also proves that the recognition speed can slows down as the vocabulary gets larger. The objective of our work is to provide a new system to speed up the continuous sign language recognition process. The system architecture with twostage hierarchy is presented as follows. The first stage, which is accomplished by OP pre-searching, results in a final word set including the words that the sentence contains. The second stage, which is based on the final word set got in the first stage and accomplished by Viterbi-Beam, finally recognize the gesture sentence sequence after a more precise search. In the searching process, less time is needed for the limited Viterbi searching space as a single word set instead of the whole vocabulary.

2. OP pre-searching OP searching is a continuous recognition method. The optimal estimate of the unknown sequence of sentence is obtained by minimizing the total distance of all possible word sequences. The previous application of OP is C. Godin [3] in connected word of speech recognition. Comparing to Dynamic Time Warping (DTW), OP is also a frame synchronizing way. The details of OP searching will be discussed in the next section.

2.1. OP searching principles The problem of OP searching can be defined to the minimization of the global distance. In this process,

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

two types of transition rules should be obeyed: transition rules in the template interior called withintemplate transition rules and transition rules at the template boundaries called between template transition rules. The continuous recognition problem can be regarded as one of the path through the set of grid points ( i, j , k ) which provides the best match between the test pattern and the unknown sequence of templates, where ( i, j , k ) represent the time frame of the unknown pattern, the time frames of each template, and the template number, respectively. Define w ( l ) = ( i ( l ) , j ( l ) , k ( l ) ) and l the path

index for the set of path elements. For within-template transitions, the following relation holds: If w ( l ) = ( i, j , k ) , j > 1, then w ( l − 1) ∈ {( i − 1, j , k ) , ( i − 1, j − 1, k ) , ( (i, j − 1, k ) )} . (1)

Define a minimum accumulated distance D ( i, j , k )

{

(

)} ,

+ min D ( i − 1,1, k ) , D i − 1, J ( k * ) , k *

)}

,

2.2. Temporal clustering algorithm In the next section, a method called “center matching” is applied which means to use the cluster centers instead of the whole word templates, to locate the observation sequence at some sub-clusters. Here we use temporal clustering algorithm to form each of the clusters. The mathematical description of the temporal clustering problem is as follows: Let Π = {O1 , O2 ,..., OV } be a data set for V temporal sequences of vocabulary to be clustered. Temporal clustering algorithm is to dynamically cluster the c centers {Γ j ; j = 1, 2,..., c} , and gets c

j =1

k =1

(

)

(5)

where d is a DTW-based distance, nl is the samples in j cluster Γl , and CΓl is the j th sample in cluster Γl . The

2.3 Center matching

where k * = 1, 2,..., K .

Π = ∪ Γj .

j

(4)

At the template boundaries with j = 1 , we can also obtain the between-template transition relation and rule:

(

nl

m ( Γl ) = arg max ∑ d CΓjl , CΓKl ,

(3)

the following using the within-template transition rule: D ( i , j , k ) = d ( i, j , k ) . (2) + min { D ( i − 1, j , k ) , D ( i − 1, j − 1, k ) , D ( i, j − 1, k )}

{

two gesture sequences which are allowed to have different lengths. Having got the distances of every two samples in the vocabulary space, we use temporal clustering algorithm to cluster the vocabulary word space. The clustering procedure which provides a set of rules for splitting and combining existing clusters to be used to obtain a final clustering is agglomerative and hierarchical. These clusters are represented by centers which have the minimum sum of distance to the other samples in the same cluster. In our work the index of centers is defined as follows:

temporal clustering algorithm has some further refinements by splitting and merging of clusters. Clusters are merged either when the number of members (gesture sequence) in a cluster is less than a certain threshold or when the centers of two clusters are closer than a certain threshold. Clusters are split into two different clusters if the cluster standard deviation exceeds a predefined value and the number of members (gesture sequences) is twice as much as the threshold for the minimum number of members.

along any path to the grid point ( i, j , k ) . We can get

w ( l − 1) ∈ ( i − 1,1, k ) , i − 1, J ( k * ) , k * D ( i , j , k ) = d ( i, j , k )

We first make the initialization using the DTW algorithm to compute the distance d ( Oi , O j ) between

In the process of center matching, we tend to classify the observation sequence at some sub-clusters. The detailed procedure is as follows: since we’ve got each center of all the clusters, we can use the center template instead of the templates with the whole vocabulary. Then we can perform the OP searching as the pretreatment. Hence, the results we get are the corresponding center indexes of the words that belong to the input gesture sentence. As a result of OP presearching, a final word set S is formed according to the combination of all the clusters we have got. Obviously, the final recognition speed and accuracy is related to the cluster number that is got in the temporal clustering algorithm. Here is the mathematic description on this recognition problem. Let L be the word number included in S , and the searching space comes to L × N , where N denotes the state number of each HMM we have established for each word. Suppose the sentence W to be recognized includes s isolated words, W = W1 , W2 ,..., Ws . The aim of the

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

OP pre-searching is to find each subcluster Γ1 , Γ 2 ,..., Γ s which include the corresponding word in W , where Γ j denotes the j th sub-cluster. For the s words W1 , W2 ,..., Ws , we assume that any word Wi ∈ Γi , i = 1,..., s , for convenience. According to the principles in the temporal clustering algorithm, we get

(

)

d (Wi , m ( Γi ) ) = min d Wi , m ( Γ j ) , j = 1, 2,..., c , (6) j

where d is the DTW match distance between two word samples, m ( Γ j ) is the center of the sub-cluster Γ j , and c is the number of the sub-clusters.

Here we make the assumption that we’ve already known the boundaries of each two words in the input sentence. So the classifying problem of word sequence W1 , W2 ,..., Ws comes to Γ1 , Γ 2 ,..., Γ s .

2.4 Transition factors between clusters

D ( i, j , k ) = d ( i , j , k )

{

(

) }

+ min D ( i − 1,1, k ) , D i − 1, J ( k * ) , k * ⋅ ∞ ,

(9)

= d ( i,1, k ) + D (1,1, k )

where k = 1, 2,..., K . The former equations indicate that the cumulated distance D from cluster 2 and 1 gets lower as the transitional factor between 1 and 2 gets higher, and also the transition between 2 and 1 can get much easier. Cluster 2 can not transform to 1 if P21 = 0 . With this restriction, the result of OP searching can get more accurate. So we can get a much smart result Γ1 , Γ 2 ,..., Γ s for the gesture sequence W1 , W2 ,..., Ws . The s clusters gather the final word set S as the candidate word set to the following Viterbi-Beam searching. The total flow of the OP pre-searching is given in Figure 1. *

Alone with the center matching method, some recognition errors can be easily resulted. So some additional restrictions on cluster centers during the OP searching process should be added so that there can not be any transition between some certain centers. Some kinds of connections between two clusters are existent for they have different words. For a specified cluster, the probability of its former cluster can vary among different clusters and this depends on the words that they contain. Suppose cluster 1 contains a words U1 , U 2 ,..., U a , and cluster 2 contains b words V1 , V2 ,..., Vb . For a certain word U i (1 ≤ i ≤ a ) in cluster 1, the corresponding words that can suit the syntax constrains 2 → 1 are some certain bi words V1 , V2 ,..., Vbi in cluster 2. For the whole a words in cluster 1, the whole number of words in cluster 2 that a

can be in front of cluster 1 is N 21 = ∑ bi . i =1

Define the transition factors between cluster 1 and 2 as P21 = N 21 / b , (7) which means the proportion of words in clusters 2 that can be in front of cluster 1. According to this definition, equation (4) can be changed as below. If Pk *k ≠ 0 , then D ( i , j , k ) = d ( i, j , k )

{

(

)

+ min D ( i − 1,1, k ) , D i − 1, J ( k * ) , k * / Pk * k

If Pk * k = 0 , equation (4) comes to

}

. (8)

Figure 1. The diagram of OP pre-searching

3. Viterbi-Beam recognition The detailed Viterbi-Beam recognition process is shown in paper [5]. Continuous Viterbi-Beam searching will start from the first frame of the input pattern and can start at any first node of candidate word that meets the grammatical syntax. Finally the optimal searching path with the most probability of the input pattern is got. In this process, X coordinate presents the input sentence frames that to be recognized, where N denotes the total frames. Y coordinate gives all the V words in the vocabulary and their 3 hidden states. According to the OP pre-searching result, we have got

0-7695-2521-0/06/$20.00 (c) 2006 IEEE

L candidate words in the final word set S . All the searching space of Viterbi-Beam is now lessened in the L candidate words and others are discarded so that less time is needed to the recognition process.

4. Experiments and comparisons In our experiments, two Cybergloves and three Pohelmus 3SPACE-position trackers are used as input devices. Experimental data consist of 4942 isolated signs with each sign having 4 samples from two signers with 2 samples each. All the four samples are used as the training set to train the parameters of HMMs to go on the embed segment procedure. Continuous sign language database consist of the 543 different sentences with each sentences having 4 samples also from the two signers and with 2 samples each. We choose three samples for temporal clustering and the embed segment of Viterbi-Beam, and the other sample are used to the testing set. Those sentences consist of the words from 3 to 14 with the average 6.10 words each sentence. The candidate templates that OP pre-searching uses are all the center samples from temporal clustering algorithm. We get 6 different cluster numbers during temporal clustering algorithm, and finally get 6 different results. All the samples reduce to 3312 in the test set within the test sentences. The single Viterbi-Beam recognition is also listed for comparison with our OP/Viterbi algorithm. The experimental results are shown in Table 1. Table 1. The comparison of different results

Method Viterbi-Beam OP/Viterbi

7 clusters 12 clusters 21 clusters

Accuracy (%) 78.1

Time (sec/word) 1.047

76.7 74.8 70.2

0.533 0.428 0.343

The experiments are performed with the bigram language model on the PIV2400 (512M Memory) PC. Table 1 reports respectively test results of single Viterbi-Beam and OP/Viterbi with applying different cluster centers. Consider the n clusters of OP. Suppose the complexity of OP is the same as that of single Viterbi-Beam, and the complexity of ViterbiBeam is O (1) . If the total vocabulary contains w

words, the complexity of OP will come to O ( n / w ) due to center matching. And in the second stage, let L be the word number included in the searching space, and the calculation complexity of Viterbi decoding will come to O ( L / w ) as single Viterbi-Beam. So, the final complexity of recognition comes to O ( n / w ) + O ( L / n ) < O (1) . The results indicate that this OP/Viterbi method can raise the recognition speed about 2 times faster than single Viterbi-Beam’s on average with a small decrease of recognition accuracy.

5. Conclusions In this paper, a real time OP/Viterbi algorithm to continuous sign language recognition is designed and implemented. We use center matching method and transition factors to the first classifying stage of OP searching. With a large vocabulary with 4942 signs and 543 sentences, the experimental results show that the continuous sign language could have a distinct recognition raise up of the speed, which can meet the real time recognition system.

6. Acknowledgment This research is supported by NSFC under contract No. 60332010.

7. References [1] Maocong Zhang, Yongling Wu, Shengdan Yu, and Shurong Wang, Chinese sing language and language basics, Shandong Publish Company, 1998, pp. 1-6, 3441, 58-59. [2] Feng Jiang, Hongxun Yao, and Guilin Yao, “Multilayer Architecture in Sign Language Recognition”, Proceedings of the 5th International Conference on Multimodal Interfaces, 2004, pp. 102-104. [3] C. Godin and P.Lockwood, “DTW Schemes for Continuous Speech Recognition: a Unified View”, Computer Speech and Language, 1989, pp. 169-198. [4] Jiangqin Wu, Research and implementation on Chinese sign language recognition algorithm, Dissertation for the Degree of D.Eng., 2000. [5] Jiyong Ma, Wen Gao,Jiangqin Wu and Chunli Wang, “A Continuous Chinese Sign Language Recognition System”, IEEE. International Conference on Face and Gesture, Grenoble, France, 2000, pp. 428-433.

0-7695-2521-0/06/$20.00 (c) 2006 IEEE