Toward automatic Sign Language recognition from Web3D based scenes Kabil Jaballah and Mohamed Jemni High School Of Science and Techniques of Tunis, 5 avenue Taha Hussein. BP 56, Bab Mnara, Tunisia UTIC Research Laboratory
[email protected],
[email protected]
Abstract. This paper describes the development of a 3D continuous sign language recognition system. Since many systems like WebSign[1], Vsigns[2] and eSign[3] are using Web3D standards to generate 3D signing avatars, 3D signed sentences are becoming common. Hidden Markov Models is the most used method to recognize sign language from video-based scenes, but in our case, since we are dealing with well formatted 3D scenes based on H-anim and X3D standards, Hidden Markov Models (HMM) is a too costly double stochastic process. We present a novel approach for sign language recognition using Longest Common Subsequence method. Our recognition experiments were based on a 500 signs lexicon and reach 99 % of accuracy. Keywords: Sign Language, X3D/VRML, Gesture recognition, Web3D scenes, H-Anim, Virtual reality
1
Introduction
Sign language is the natural language of deaf. It is a visual-gestural language which makes it different from spoken language but serves the same function. Sign language is the main way for deaf to communicate, in this context; many projects have been developed to make information accessible to deaf. We can mention WebSign[1] which can generate automatically a 3D signing avatar from a written text. Websign is developed by the research laboratory of technologies of information and communication (UTIC). Sign language enables deaf people to communicate throw a visual receive channel and a gestural transmission channel which makes this language able to transmit simultaneous information. Moreover, Cuxac [4] showed that Sign Language (SL) uses iconicity which enables deaf to describe objects or actions while signing. SL is not a universal language; many different languages have been involved regionally as, e.g., FSL (French Sign Language) in France or ASL (American Sign Language) in the United States. To sign a word, five parameters might be combined. These are hand shape, hand orientation, location, hand motion and facial expression.
Grammar of sign language is different from spoken language. The structure of a spoken sentence is linear, one word followed by another. Whereas in sign language, simultaneous structures regularly exist. At the syntactic level, sign language is not as strict as spoken language and the OSV 1 order is the most commonly used order. This paper presents a novel approach to recognize Sign Language from Web3D based scenes. Our method is an adaptation of Longest Common Subsequence (LCS) algorithm which is commonly used for approximate string matching purposes. Our experiments showed that our systems reach up to 99% of accuracy using a dictionary of 500 signed words. The paper is organized as follows. Section 2 summarizes the aspects of signing avatars systems. In section 3 we review the main difficulties encountered during the recognition process and we illustrate the architecture of our recognition system. In section 4, we outline a summary of what has been done in the field of sign language recognition from video-based scenes. Section 5 presents the main contribution of this paper which consists on the adaptation of the LCS algorithm to recognize sign language from Web3D based scenes. Experimental results are presented in section 6. Finally, the section 7 brings out the conclusion.
2
Signing avatars systems
Since the standardization of 3D signing avatars modeling by Web3D consortium[5], many systems have carried out the automatic Sign Language generation from a written text. Web3D Signing avatars are modeled using the HAnim [6] working group specifications that give a relevant method to describe the skeleton architecture and hierarchy. The animation of theses humanoids is rendered according to the X3D specifications [7] that implement a high layer on the VRML (Virtual Reality Markup Language) code.
Fig. 1. Architecture of a signing avatar system 1
Object Subject Verb : common order used in sign language
3
Related difficulties and proposed system architecture
Although our system is not video-based, we encounter almost the same difficulties. Some of them are related to Sign language itself and others are related to the technology used to generate signs. Here are the most relevant problems that we encountered during the recognition process: • 3D Scenes to be parsed may contain formatting mistakes caused by failures to meet H-anim and X3D standards. • In sign language, for the same signed word, we can find different rotation values for a given join2. • In sign language there are no explicit indications for the beginning and the end of gesture, indeed, the gesture segmentation is the most difficult task that we should consider. The figure 2 shows the architecture of our system which is based on 3 modules. Every module carries on a part of the recognition process to lead finally to a written text recognized from a 3D scene. The recognition is based on an X3D signed dictionary.
Fig. 2. Proposed system architecture
2
Join or articulation of a part of the body ex: r_shoulder
4
Related work in the field of sign language recognition
Sign language recognition has been attempted since several years [16]. In 1997 Vogler and Metaxas [8] described a system for Continuous ASL recognition based on 53-Sign vocabulary. Vogler and Metaxas performed two experiments, both with 97 test sentences: one, without grammar and another with incorporated bi-gram probabilities. Recognition accuracy ranges from 92.1% up to 95.8% depending in the grammar used. In 1998, Starner [9] proposed a HMM-based recognition system. It recognizes 40 words with 91.3% accuracy. These video-based systems used signal and image processing techniques to match and tokenize signs like HMM (Hidden Markov Models) which is a double stochastic process based on transition matrix and probabilities. The recognition process can be performed with a Baum-Welch algorithm or a Viterbi algorithm as described by Rabiner [10]. In our case, as WebSign dictionary is composed by more than 1200 signed words, probabilistic methods like HMM become a costly task. Moreover, X3D signs are well formatted and form a rich alphabet. Thus, we are no longer dealing with simple states but with 3D rotation values. In this context, the originality of our contribution consists on using string matching and patterns recognition methods, especially an adapted version of the Longest Common Subsequence algorithm (LCS).
5
Sign language recognition using the LCS method
5.1
Motion detection algorithm
The following algorithm allows us to extract all information about animated joins in a Web3D scene about over time. An example of result is shown in table 1. Program Fetch_Rotations (X3D scene) For each (OrientationInterpolator) For each (TimeSensor) Find Correponding ROUTE RotationsArray[joint]= threshold wk xi; ii-1; jj-1 ;kk-1 Else IF T[i-1,j]>T[I,j-1] Then ii-1 Else jj-1 Return w Table 2. Illustration of the Back-Trace algorithm
6
Experimental results
Our system was tested using a 500 signed words dictionary. The figure 4 shows the recognition performance. The first curve shows that the recognition process takes quite a long time according to the length of the used dictionary and the length of the scene as well. The second curve shows that recognition accuracy varies according to the threshold of accepted similarity between keyframes. As shown in the second curve, we can reach 100 % accuracy if the scenes we are processing are generated from our own dictionary. Tests have been performed on a machine with 1Gb of R.A.M and a 1,73 Ghz processor.
Fig. 4. Performance of the proposed system
7
Conclusion
In this paper, we brought out the fact that signing avatars systems like are becoming increasingly common. We came up with a novel approach to recognize sign language from scenes generated with this kind of systems. We built a system able to detect motion and recognize signed words inside the 3D scenes. Our method is an adaptation of the LCS algorithm commonly used for approximate string matching problems. Experimental results were very encouraging since we reach 99% of accuracy, even if the execution time is an aspect that we should work on in the future.
References 1. Mohamed JEMNI & Oussama ELGHOUL, “Towards Web-Based automatic interpretation of written text to Sign Language”. ICTA’07 Hammamet, Tunisia, pp, 12-14, April 2007. 2. M. Papadogiorgaki & N. Grammalidis & L. Makris & N.Sarris & and M. G. Strintzis, “Vsign Project”. Communication (http://vsign.nl/EN/vsignEN.htm.), 20 September , 2002. 3. Uwe Ehrhardt, Bryn Davies, “A good introduction to the work of the eSIGN project”, eSIGN Deliverable D7-2, August 2004. 4. Cuxac.C, La LSF, les voies de l’iconicité. Ophrys editions, Paris, 2000. 5. Web3D consortium website : http://www.web3d.org 6. Humanoid Animation Standard Group, “Specification for a Standard Humanoid: H-Anim 1.1”, http://www.h-anim.org/Specifications/H-Anim1.1/ 7. Don Brutzman & Leon Ardly, X3D Extensible 3D Graphics for Web Authors , Elsevier, 2007 8. Vogler, C. Metaxas, D. Adapting hidden Markov models for ASL recognition by usingthreedimensional computer vision methods. IEEE International Conference on Computational Cybernetics and Simulation . Oralando, FL, October 1997. 9. Starner, T & J. Weaver, and A. Pentland, “Real-time American Sign Language Recognition using desk and wearable Computer based Video”, J. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998. 10. L.Rabiner, “A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proc. Of the IEEE, Vol.77,pp-257-289, 1989. 12. Euclidean Space by Martin John Baker : http://www.euclideanspace.com/maths/geometry/rotations/conversions/quaternionToEuler/ 13. L. Bergroth, H. Hakonen, T. Raita, "A Survey of Longest Common Subsequence Algorithms", J.SPIRE ,IEEE Computer Society, pp.39-48, 2000. 14. Torgeson.WS, “Multidimensional scaling of similarity”, J. Psychometrika, Springer, pp.379-393, 2006 15. M.Deza & E.Deza , Dictionary of Distances, Elsevier editions, 2006. 16. Kabil Jaballah, Mohamed Jemni, "Automatic Sign Language Recognition using
X3D/VRML Animated Humanoids", CVHI 2009. April, 2009, Wroclaw, Poland.