novel boosting framework for subunit-based sign language recognition

1 downloads 0 Views 151KB Size Report
Recently, a promising research direction has emerged in sign language recognition (SLR) aimed at breaking up signs into manageable subunits. This paper ...
NOVEL BOOSTING FRAMEWORK FOR SUBUNIT-BASED SIGN LANGUAGE RECOGNITION George Awad1, Junwei Han2, and Alistair Sutherland3 1

National Institute of Standards and Technology, USA, 2School of computing, University of Dundee, UK, 3School of computing, Dublin City University, Ireland 1 [email protected], [email protected], [email protected] ABSTRACT

Recently, a promising research direction has emerged in sign language recognition (SLR) aimed at breaking up signs into manageable subunits. This paper presents a novel SL learning technique based on boosted subunits. Three main contributions distinguish the proposed work from traditional approaches: 1) A novel boosting framework is developed to recognize SL. The learning is based on subunits instead of the whole sign, which is more scalable for the recognition task. 2) Feature selection is performed to learn a small set of discriminative combinations of subunits and SL features. 3) A joint learning strategy is adopted to share subunits across sign classes, which leads to a better performance classifiers. Our experiments show that compared to Dynamic Time Warping (DTW) when applied on the whole sign, our proposed technique gives better results. Index Terms— Sign Language Recognition, Boosting Subunits, Joint Adaboost Learning. 1. INTRODUCTION Every day millions of hearing impaired people use SL for communication. As a result, SLR has gained a lot of attention in recent years by researchers working in computer vision, gesture recognition and human-computer interaction. Most existing SLR systems are based on training traditional machine learning techniques such as Neural Networks [1], Hidden Markov Models (HMMs) [2] or DTW [3] on extracted hand features from the whole sign segment. Recently, subunit-based SLR has been particularly attracting research interest as it permits building scalable systems. In these systems signs are decomposed into a set of phonemes that are used for recognition. As the number of subunits is much smaller than the sign vocabulary, subunitbased algorithms are claimed to be capable of recognizing SL with a large vocabulary. However, unlike the speech recognition problem, there is no general agreement yet on how to model the subunits of the SL. Some related work has been reported regarding the subunit-based approach. For example in [4, 5, 6] the authors follow a ‘movement-and-

978-1-4244-5654-3/09/$26.00 ©2009 IEEE

2729

hold’ model where subunits are defined as the periods where the hands are in motion or stationary. The extracted subunits are used to train HMM models [5] and the concatenation of such models is used to recognize the whole signs. Finite State Machines are used in [6] to model the states of the movement-and-hold durations. The disadvantage of the movement-and-hold model is that it needs clear pauses of the hands that don’t occur often specially with fluent SL users. In addition, counting on universal thresholds for motion change detection is not robust against user motion variation behavior. In [7] the authors propose to employ kmeans clustering to self-organize spatial features of subunits and use them to train an HMM for recognition. In this approach the temporal features are ignored, which can be more important in SL analysis. Another approach adopted by [8] models subunits as the HMM states that are trained for each sign. One main disadvantage of this method is that it assumes a fixed number of subunits for all sign classes, which is not true in many cases. Features such as hand shape, motion, position, and orientation often play an important role in SLR. Most of the existing SLR systems deal with all or a subset of these features equally. However, in real-world signs, not all the features are informative for all signs. This is a fact that is ignored by most systems. To clarify this point, we can use the movement-hold model [4, 5] as an example. In the hold stage, hands are basically stationary thus apparently the feature of hand shape is more informative. On the contrary, in the movement stage, hand motion might be more useful to characterize signs. This point motivates us to propose a boosting framework that can automatically select the informative combinations of different features for every subunit and construct a strong classifier from the selected informative combinations. In addition, we realized that different sign classes in many cases could share different subunits. Towards this end, we investigate training classes jointly to take advantage of sharing subunits across classes. The proposed work mainly consists of three steps. First, signs are automatically segmented into subunits using our previous work [9]. Second, clustering is performed on the extracted subunits to construct a codebook of representative subunits for each sign. Finally, the codebook entries are then

ICIP 2009

used to build a set of weak classifiers where each weak classifier is a subunit modeled by a combination of features. Adaboost learning is employed to jointly train sign classes by sharing subunits/features combinations that enhances the overall performance and reduces the required training samples compared to HMM-based systems which require at least 40 training samples [10]. To our best knowledge, very little published related work has adopted Adaboost for direct SLR. In [10] the authors used Adaboost to detect head and hands for SLR while in [11] they boosted HMM models trained on SL subunits. The rest of the paper is organized as follows: in section 2 we review the subunits segmentation. In section 3 we discuss the subunit-based features while subunits codebook generation is described in section 4. The Adaboost framework is presented in section 5 and experimental results in section 6. We conclude in section 7.

We define a subunit as a motion pattern with correlated spatio-temporal features in a set of consecutive frames. Subunits segmentation can be achieved by our previous work [9]. Given a signing video, hand segmentation and tracking algorithm is first applied [12], which provides the hand trajectory information. During a motion pattern (subunit) the hand trajectory forms almost a smooth curve and the motion speed is almost uniform. Changes in speed and trajectory often happen when the hand shifts between different motion patterns. Subunits segmentation is therefore transferred to the detection of perceptual discontinuity points along the trajectory and points where speed change. A postprocess using more spatial-temporal features is performed to refine the segmentation results. 3. SUBUNIT-BASED FEATURE EXTRACTION Once subunits are obtained, a set of feature vectors Fs (fv1s , fv 2s ,...., fv ks ) can be constructed for every subunit i

s consisting of k frames. Here fvs is the feature vector for

i th frame in s . To be specific, fv is consists of both

spatial and temporal information describing the hand motion, shape and relative position. Based on the skin segmentation and tracking system, we are able to obtain the following information: (1) the hand motion trajectory Trs {( x1s , y1s ), ( xs2 , ys2 ),...., ( xsk , ysk )} where ( x si , y si ) indicates

i th frame of s ; (2) the

the hand spatial centroid in the centroid xsc

k

i

¦ xs k

i 1

( xsc , ysc ) and ysc

of Trs , which can be computed by k

i ¦ ys k

i 1

; (3) the head position ( xsh , ysh ) .

1) Hand motion speed: HMS

i

represented by HMDs . 3) Orientation angle of vector from hand to trajectory

i 1 s

i 1 s

4) Distance between hand position and trajectory

( x sc , y sc )  ( x si , y si )

i

centroid: DHC s

5) Distance between hand and head:

DHH si

( xsh , y sh )  ( xsi , y si )

6) Orientation angle of vector from hand to head:

arctan( y sh  y si xsh  xsi )

7) Hand shape: Fourier descriptors [13] (FDs) and Hu’s 7 moment invariants [14]. We concatenate both descriptors and represent them by

i s

HS si .

The first 6 features are normalized to compute easily. FDs and moments are invariant to scale, rotation and translation. Also the first 6 features have the properties of being translation and scale invariant. Finally, fv is is defined as

( HMS si , HMDsi , OHCsi , DHCsi , DHH si , OHH si , HS si ) . 4. SUBUNIT CLUSTERING DTW is the basic distance metric used in the clustering and learning process of the proposed work. In brief, DTW [15] uses dynamic programming to find the best warping path that leads to the minimal warping cost between two subunits. As the Adaboost weak classifiers are modeled by subunits, we need clustering to select representative subunits and remove any non-well segmented outlier subunits. Given a training set of N sample videos for sign x, the subunit segmentation algorithm is applied on all sample videos generating a set of subunits S x {s1, s2 ,..., su } . All subunits in S x are then clustered using hierarchical clustering and the DTW distance metric. Initially each subunit is considered to be one cluster, and then subunits are paired into binary clusters recursively until a hierarchical tree is formed. At each step all distances between all pairs are calculated and we link the 2 clusters with the minimum distance R: n n 1 (1) R(E , J ) dist ( s E , s J ) E

nE nJ

J

¦¦

i

j

i 1 j 1

where nE , nJ is the number of subunits in cluster E and J

siE , s Jj are the ith and jth subunits in cluster E

and J respectively and dist ( siE , s Jj )

(x , y )  (x , y ) i s

arctan(y sc  y si xsc  xsi )

i

centroid: OHCs

respectively.

The features are summarized as follows: i s

quantized into 18 direction codes of range 20 degree each. The yielded direction code is

OHH si

2. SUBUNITS SEGMENTATION

the

2) Hand motion direction code: First, the hand motion direction is described by: i 1 i i 1 i T arctan( y s  y s xs  xs ) . Then, T is

DTW ( siE , s Jj ) . Fig. 1

shows an example of a hierarchical tree of subunits with 3

2730

main clusters. We use the average number O of subunits segmented from all the samples of sign x as a threshold to prune the hierarchical tree and define the subunit clusters for sign x. The final step in the clustering task is to construct a codebook from the different subunit clusters. For the cluster J we find the medoid subunit:

s*J

nJ

arg min(¦ DTW ( siJ , s Jj )), j  {1,..., nJ }

optimized to reduce the total error over all the

(2)

i 1

J

where s* is the subunit that has the minimum DTW distance to all the other subunits in the same cluster. The codebook of sign x consists of the set of medoid subunits. 5. SUBUNIT-BASED ADABOOST LEARNING The Adaboost algorithm [16] boosts the classification of a simple learning algorithm by combining a collection of binary weak classifiers ht (x) into a stronger classifier:

H ( x)

T

sign( ¦ D t ht ( x)) t 1

(3)

where D t is the weight of the weak classifier ht at round t. In the training stage, a series of T weak classifiers are called repeatedly and for each call a distribution of weights are updated to indicate the importance of examples in the data set for classification. We use the available training set to automatically construct our weak classifiers. Given a codebook cbx {s1* , s2* ,..., sO* } for sign x consisting of O subunit entries, and feature set F ( HMS , HMD , OHC , DHC , DHH , OHH , HS ) we define a set of weak classifiers using all the different combinations of these 7 features calculated for each si* in

Figure 1: An example of a dendrogram

classes at every iteration and so focuses on more general features instead of class-specific features. The basic idea of the algorithm is that at each boosting round, we examine various class subsets Gq Ž C where C is the set of sign classes to learn and Gq is one possible subset of these classes and try to fit a weak classifier to discriminate that subset from the other classes c  Gq . The pseudocode of the proposed algorithm is shown in Fig. 2.

cbx . A classifier should fire ( ht (x) =1) if the distance of ht to a sign video is below a certain threshold: ht ( x ) D(ht , x)

­1 if D (ht , x)  4 ht ® otherwise ¯0

min( DTW (ht , s j )) , j  (1,2,..., M )

(4) (5)

where the sign video sample x consists of subunits s1 to sM, and the DTW measures the distance between the classifier (modeled by a subunit and one feature combination) and all the subunits in x . At every iteration we select the best weak classifier ht and corresponding 4 ht that minimizes the overall error over the training samples using their current weights. Joint Adaboost learning [17] as opposed to independent learning of object classes, has the advantage of allowing classes to share features, which permits the scalability of classification systems to classify a large number of classes. In this work, we adopt it to share subunits (weak classifiers) across sign classes. A faster classifier is produced since there are fewer shared weak classifiers to compute, and one that works better since weak classifiers are

Figure 2: SLR Joint Adaboost learning algorithm 6. EXPERIMENTS We tested the proposed system on 20 different signs from a database donated by another group1 and compared it to whole sign-based SLR. Every sign is performed 10 times (total 200 video samples) with variations in speed and the manner of performance. In all experimental runs, we use a different number of randomly chosen training samples in the 1

2731

http://www.ee.surrey.ac.uk/Personal/R.Bowden/sign

range of 2 to 8 and the rest of the samples are used for testing. We evaluated 6 runs for each number of training samples. In order to test the whole sign-based approach, we adopted the DTW technique, which is the same core distance metric that the weak classifiers use in our boosting framework to make the comparison more realistic. In general, all the frames of the sign are taken into account and modeled by all features. For any run we cluster the samples belonging to a given sign class and select the medoid sign using Eq. 2. In the classification, we select the class with the minimum distance between the medoid and the testing sample. In the proposed system, if only one classifier fires then we select this class to be the result, otherwise when more than one classifier fires, we select the class with maximum distance from its threshold which represents the confidence of this class. The classification accuracy is defined as follows: Number of correctly classified samples (6) Total number of testing samples

The average classification results of 6 runs are shown in table 1. Subunit-based approach is better than whole signbased on average by about 2.5%. Joint learning reduces the number of required training samples while achieving high performance (above 90%) using a subset of discriminative features shared across classes compared to the full feature set used in whole sign-based approach. Table 1: Multiclass classification average results comparison #training samples Whole sign-based Subunit-based 2 0.852 0.864 3 0.890 0.922 4 0.898 0.929 5 0.920 0.943 6 0.925 0.931 7 0.950 0.974 8 0.937 0.966

7. CONCLUSIONS In this paper we presented a novel boosting framework for SLR. In contrast to previous subunit-based approaches, we apply feature selection to select not only the informative subunits, but also the discriminative features within these subunits. SL feature selection can be justified by observing that different signs can have the same motion pattern but different shape features or same shape feature but different position features, and so on. Sharing subunits across sign classes increases the overall performance compared to whole sign-based approach as the weak classifiers are selected to fit the whole dataset. Our approach has the advantage of being more robust to non-accurate subunit segmentation due to the way the Adaboost weak classifiers calculate their distances to subunits. Based on the current results we believe that the proposed approach can serve as a basic step for building scalable SLR systems that requires a small number

of training samples, operates fast enough, and achieves high accuracy. Our future work will investigate testing with a larger number of sign classes and how to apply the subunit segmentation algorithm online for real-time classification. 8. REFERENCES [1] M.H Yang, N. Ahuja, and M. Tabb, “Extraction of 2D motion trajectories and its application to hand gesture recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, 24, 2002, 10611074. [2] T. Starner, J. Weaver, and A. Pentland, “Real-time American Sign Language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Analysis and Machine Intelligence, 20, 1998, 1371-1375. [3] F. Jiang, H. Yao, and G. Yao, “Multilayer architecture in Sign Language recognition system,” Proc. Int. Conf. on Multimodal interfaces, 2004. [4] S. Liddell, and R. Johnson, American Sign Language: The phonological base. Sign Language Studies 64, 1989, 195-277. [5] C. Vogler, and D. Metaxas, “A framework for recognizing the simultaneous aspects of American sign language,” Computer Vision and Image Understanding. 81, 2001, 358-384. [6] M. Yeasin, and S. Chaudhuri, “Visual understanding of dynamic hand gestures,” Pattern Recognition 33, 2000, 18051817. [7] B. Bauer, and K. Kraiss, “Towards an Automatic Sign Language recognition system using subunits,” Proc. of Int. Gesture Workshop, 2001, 64-75. [8] G. Fang, X. Gao, W. Gao, and Y. Chen, “A novel approach to automatically extracting basic units from Chinese Sign Language,” Proc. of Int. Conf. on Pattern Recognition, 2004, 454-457. [9] J. Han, G. Awad, and A. Sutherland, “Modelling and Segmenting Subunits for Sign Language Recognition Based on Hand Motion Analysis,” Pattern Recognition Letters, 30, 2009, 623-633. [10] T. Kadir, R. Bowden, E. J. Ong, A. Zisserman, “Minimal training, large lexicon, unconstrained sign language recognition,” In Proc. BMVC. 2004. [11] L. Zhang et al., “Recognition of sign language subwords based on boosted hidden Markov models,” Intl. Conf. on Multimodal Interfaces, 2005, 282-287. [12] G. Awad, J. Han, and A. Sutherland, “A unified system for segmentation and tracking of face and hands in sign language recognition,” Proc. of Int. Conf. on Pattern Recognition, 2006, 239-242. [13] T. Z. Charles and Z. R. Ralph, “Fourier descriptors for plane closed curves,” IEEE Trans. On Computer, c-21, 3, 1972, 269-281. [14] M.K. Hu, “Visual pattern recognition by moment invariants,” IRE Transactions on Information Theory, IT, 8, 1962, 179-187. [15] H. Sakoe and S. Chiba, “A dynamic programming approach to continous speech recognition,” Proc. Int. Congress on Acoustics, Budapest, Hungary, 1971, 20 C-13. [16] Y. Freund, and R. E. Schapire, “A decision-theoretic generalization of online learning and an application to boosting,” In Computational Learning Theory (Eurocolt), 1995. [17] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing features: efficient boosting procedures for multiclass object detection,” In Proc. CVPR, 2004.

2732

Suggest Documents