Abstractâ Sign language recognition (SLR) is considered a multidisciplinary research area engulfing image processing, pattern recognition and artificial ...
SPACES-2015, Dept of ECE, K L UNIVERSITY
4-Camera Model for Sign Language Recognition Using Elliptical Fourier Descriptors and ANN P.V.V.Kishore MIEEE, M.V.D.Prasad, Ch.Raghava Prasad, R.Rahul Dept. of E.C.E, K.L. University, Green Fields, Vaddeswaram, Guntur DT, INDIA categories: sensor glove based[12] and vision based systems[13,14]. Thad starner proposed a real time American Sign Language recognition system using wearable computer based video[13] which uses Hidden Makov Model (HMM) for recognizing continuous American Sign Language system. Signs are modeled with four states of HMMs which have good recognition accuracies. Their system works well but it is not signer independent. M.K.Bhuyan et al.[15] used hand shapes and hand trajectories to recognize static and dynamic hand signs from Indian sign language. They used the concept of object based video abstraction technique for segmenting the frames into video object planes where hand is considered as a video object. Their experimental results show that their system can classify and recognize static, dynamic gestures along with sentences with superior consistency. Rini Akmelia et al.[16] proposed a real time Malaysian sign language translation using color segmentation technique which has a recognition rate of 90%. Nariman Habili et al. [17] proposed a hand and face segmentation technique using color and motion cues for the content based representation of sign language video sequences. Yu Zhou and Xilin chen[18] proposed a signer adaptation method which combines maximum a posteriori and iterative vector field smoothing to reduce the amount of data and they have achieved good recognition rates. Gesture recognition systems that are implemented by statistical approaches [19], example based approaches [20], finite state transducers [21] have registered higher translation rate. In this paper a simple system for converting Indian sign language in to voice messages using hand gestures from a 4 cam feed is developed. By applying a elliptical Fourier descriptors on the shapes of hands gestures of binary images features are extracted. Thirty six different gestures signs are considered for this research. A simple neural network is developed for the recognition of gestures is developed for recognition of gestures from the features extracted from the images from 4 cameras. The recognized signs give the location of the voice commands related to the input features of the image.
Abstract— Sign language recognition (SLR) is considered a multidisciplinary research area engulfing image processing, pattern recognition and artificial intelligence. The major hurdle for a SLR is the occlusions of one hand on another. This results in poor segmentations and hence the feature vector generated result in erroneous classifications of signs resulting in deprived recognition rate. To overcome this difficulty we propose in this paper a 4 camera model for recognizing gestures of Indian sign language. Segmentation for hand extraction, shape feature extraction with elliptical Fourier descriptors and pattern classification using artificial neural networks with backpropagation training algorithm. The classification rate is computed and which provides experimental evidence that 4 camera model outperforms single camera model. Keywords— Sign language recognition;4 camera model; elliptical Fourier descriptors(EFD), artificial neural networks.
I. INTRODUCTION Sign language is the primary form of communication among hearing-impaired people. Special rules of grammars and context that support the expression of a sign language. There are two main approaches to sign language recognition: image-based and sensor-based. The main advantage of imagebased systems is that the signatories do not need to use complex devices. However, substantial calculations are required in the pre-processing stage. Instead of cameras, sensors based systems use gloves instrumented with sensors. However, sensor-based systems have their own challenges, including cumbersome gloves worn by the signer. To help people with normal hearing communicate effectively with the deaf and hard of hearing, many systems have been developing translate various sign languages around the world. Several review articles have been published that discuss such systems [1] - [3]. Sign language, similar to spoken language is not confined to a particular region or territory. It is practiced differently around the world. In USA it is known as American Sign Language (ASL) [4,5] and in Europe[6,7] British Sign Language (BSL), Africa [8] African Sign Language and in Asia[9] as Arabic Sign Language and Chinese Sign Language[10]. Unlike America and Europe sign language, India does not have a standard sign language with necessary variations. But in the recent past Ramakrishna missions Vivekananda University, Coimbatore came up with an Indian sign language dictionary. There are around 2037 signs[11] currently available in Indian Sign Language (ISL).Accordingly sign language recognition systems are classified into two broad
II. PREPROCESSING AND SEGMENTATION RGB images of sign language gestures are captured using four 2Mega pixel USB cell phone cameras. The captured images have resolution of 1024 × 1024 pixels. Higher resolution causes delay in the execution of the process. Hence 34
SPACES-2015, Dept of ECE, K L UNIVE ERSITY the resolution of the images is cut down to t 256 × 256 pixels. An image acquisition process is subbjected to many environmental conditions, such as light sensitivity, position of the camera and background information. Thhe main reason for using cell phone camera is to make our syystem accessible in real time environments. A normal illumination of around 13-17 lux is maintained for image capture. This is the illuminatioon from a electric white light bulb of 12-15 watts. While cappturing the gesture images the 4 cameras are placed at around 0.8-1.2 0 meters from the ground and around 1 meter from the person p as shown in figure-1. The system is designed to virtually recoognize the gestures of ISL (Indian Sign Language). The propoosed system is very simple and the person is not required to weear any wireless or coloured gloves. The only restriction for thiis system is that the person should wear a dark coloured, long sleeve s shirt and the background should be dark. The imagess obtained from 4 cameras are shown in figure1 for sign ‘A’ in Indian sign language.
these operations are a set of 4 binary images per sign giving hand segments. The entire proceess is demonstrated in figure 3. The boundary pixels describee the hand shapes of the signer in all the 4 camera outputs. Elliptical E Fourier descriptor is employed to extract shape outlline with minimum number of pixels for an image without losing shape information [22].
Fig.2. Images captured from 4 Caameras at angles 20 degrees, -20 degrees, 10 degrees and -10 degreees form the centre
Fig.1. Showing the equipment used for capturingg gestures of Indian Sign language using 4 Camerass
Image acquisition noise is filtered using a spatial averaging filter of mask 3×3. The resulting operattion smoothen the image along a single image plane. The 4 sets of gray scale hand images are processed using morphoological operations, first dilation and then erosion on the input RGB images from 4 cameras as shown in figure 2. Then sppatial difference of both dilation and erosion operations are reesulted in segments of hands from all the 4 cameras. Thee contrast of the segmented image is improved using adaptivve histogram based contrast enhancement. Finally the images produced after all
III. FEATURE EXTRACTION
Fig.3. Final Segments from m the 4 images for a gesture
35
SPACES-2015, Dept of ECE, K L UNIVE ERSITY The elliptical Fourier descriptors of a closed contour describe a curve in 2D space. Each pixel inn the image plane is represented by a complex number. Thus any shape in an image is defined as
into classes defined on thesse sections. The problem of classifier design is to find an optimal o mapping M from the feature matrix
M : fxvect → D .
(1) arc-length
An artificial neural netw work [1-3] is employed to accomplish the task of recognnizing and classifying gesture signs. The neural network has 144 1 neurons in the input and 36 neurons in the output layers along with 80 neurons in its n network object can take hidden layer. This particular neural 144 input images and classify into 36 signs.The size of our target matrix is36×80.
where the parameter ‘t’ is given by parameterization. o a curve there is a To obtain elliptical Fourier descriptors of need to obtain Fourier expansion of the shape s in the above equation. From geometric point of view a simple ellipse is modeled by the equation
0
0
cos sin
fxvect into the decision vector D given by
(2)
Where A is semi major axis oriented in horizontal axis direction and B is semi minor axis orientted in vertical axis direction. This geometric interpretation of o ellipse gives us better visualization. For Kth ellipse the Eq. (3) becomes
cos sin cos n sin
(3)
sin cos (4)
cos sin From the above equation we can write the Fourier coefficients of kth (ak, bk, ck, dk) as
cos cos sin sin
cos sin cos cos
sin sin cos cos
sin cos sin sin
Fig.4. Feature Vector for siggn ‘A’ with 20 descriptors for each cam mera input.
(5)
Each row in the target matrixx represents a sign. The neural network object created is feeed forward back propagation network as shown in Figure 5. The weights and bias for each mly and network is ready for neuron are initialized random training.
Where Ak, Bk, θk, φk are more understanddable parameters of the same ellipse and the relationship between b these set parameters could be computed. The rotationnal angle θk and the phase shift φk are overall angles. The plot of o different Fourier coefficients along with the view of hand coontours is shown in Figure 4. These 20 descriptors are so uniquue with respect to a particular video that they do not bother to change c their values even though the video gets deformationn in the form of rotation, frame size variations and object rotation r variations. Features for the gesture ‘A’ extracted using u EFD for the segmented 4 camera inputs is shown in figuure 4.
Fig. 5. ANN used for traininng the feature matrix
IV. PATTERN CLASSIFICATIION
V. RESULTS AN ND DISCUSSION
The concept of an automated sign interprreter boils down to one final task, that is pattern classifi fication and most commonly known as pattern recognition. Thhe finalstage of the system is classification of different signs annd generating voice messages corresponding to the correctly classified sign. Generally pattern classification is an areea of separating a feature matrix into several sections and classsifying the objects
The evaluation criterion of the t sign language interpreter is carried out by calculating thee recognition rate formulated according to the following equaation below
Correctly Classified C ℜ= × 100 Total Signs S
36
(6)
SPACES-2015, Dept of ECE, K L UNIVERSITY to the result obtained by the authors in [23]. But the only difference in the approach of [23] when compared to the method discussed in this paper is the amount of data samples used for training. There is a tradeoff between processing power and recognition rate in case of sign language recognition system.
The percentage of recognition is a parameter used by almost all researchers to rate the performance of the sign language interpretation systems developed across the world. To compute the , we used RGB images of Indian sign language [22] alphabets and numbers from 1 to 10, a total of 36 signs from 4 cameras with 4 different orientations and two different signers. Total number of sign images we have is 144, out of which 144 are used for training and remaining 144 sample sign images are used for testing as discussed in the previous section. Table 2.1: shows the details of neural network used as a gesture classifier with simple backgrounds. The recognition rate stands at 95.1% when tested with 288 samples.
VI. CONCLUSION By applying the gradient based methods the hand and head shapes are segmented. This segmentation procedure is good to handle camera blur and brightness variations during image acquisition process. Feature vector is extracted hand and head segments using Elliptical Fourier descriptors. The classifier is trained for different inputs combinations of feature vector with ANN backpropagation algorithm. Testing is performed with different combination of gestures for image signs. The average recognition rate resulted for 4 camera model is around 95.1% which is far superior to that produced in [23] with single camera using Gabor transform. The 4 camera model for sign language has a promising future in terms of database creation.
TABLE 2.1: DETAILS OF GESTURE CLASSIFICATION USING ANN WITH IMAGES
Number of Input Neurons: 144 Number of Hidden Neurons: 80 Number of output neurons: 36 Activation Function: Sigmoid Learning Rate: 0.25 Momentum Factor: 0.9 Error Tolerance: 10e-4 Number of Training Samples Used: 144 Number of Testing Samples Used: 144 Data Total Correctly Performance rate of Number of recognized Neural Network (%) samples samples Training 144 144 100 Testing 144 130 90.2 Total 288 280 95.1
TABLE 2.2: RECOGNITION RATES OF V2MI WITH IMAGES COMPARED WITH METHOD IN [23]
The mean square error versus epoch graph is shown in Figure 6 for a 4 camera model to recognize gestures of Indian sign in language. The table 2.2 shows the recognition rates percentage for individual alphabets and numbers. Interestingly, this work compared the results with a recognition system [23] for Indian Sign Language on sign images using Gabor Transform for feature extraction and Backpropagation based artificial neural network.
Fig.6. ANN Training Graph for 144 sample training
Table 2.2 gives the average total recognition rate for proposed 4 camera system with images close to 95.1% which is superior
37
Sign
Recognition Rate (%) with Single Camera with Gabor Features + ANN [23]
Recognition Rate (%) 4 Camera Model Using EFD+ ANN
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7
100.00 100.00 100.00 100.00 100.00 93.33 93.33 86.67 93.33 100.00 93.33 93.33 86.67 100.00 100.00 66.67 66.67 100.00 100.00 77.78 66.67 66.67 77.78 66.67 66.67 88.89 100.00 100.00 100.00 100.00 100.00 100.00 77.78
100.00 93.33 100.00 100.00 93.33 100.00 100.00 100.00 100.00 100.00 100.00 100.00 93.33 100.00 100.00 100.00 100.00 100.00 93.33 93.33 86.67 93.33 93.33 93.33 93.33 80.00 100.00 100.00 100.00 93.33 93.33 93.33 86.67
SPACES-2015, Dept of ECE, K L UNIVERSITY 8 9 10
77.78 100.00 66.67
100.00 100.00 100.00
Total
90.12
95.1
[19] Och J., Ney. H., 2002. Discriminative training and maximum entropy
[20]
[21]
REFERENCES [22]
[1] Kishore, P.V.V., Sastry, A.S.C.S., Kartheek, A., "Visual-verbal machine
[2]
[3]
[4] [5] [6] [7]
[8]
[9] [10] [11] [12]
[13]
[14]
[15] [16]
[17]
[18]
interpreter for sign language recognition under versatile video backgrounds," IEEE Networks & Soft Computing (ICNSC), 2014 First International Conference on , vol., no., pp.135,140, 19-20 Aug. 2014 Kishore, P. V. V., & Rajesh Kumar, P. (2012). A Video Based Indian Sign Language Recognition System (INSLR) Using Wavelet Transform and Fuzzy Logic. International Journal of Engineering & Technology (0975-4024), 4(5). Kishore, P. V. V., and P. Rajesh Kumar. "Segment, Track, Extract, Recognize and Convert Sign Language Videos to Voice/Text." International Journal of Advanced Computer Science & Applications 3.6 (2012),pp 120-132. ASL corpus: . Christopoulos, C., Bonvillian, J., 1985. Sign language. Journal of Communication Disorders 18, 1–20. Atherton, M., 1999. Welsh today BSL tomorrow. In: Deaf Worlds 15 (1),pp. 11–15. Engberg-Pedersen, E., 2003. From pointing to reference and predication: pointing signs, eye gaze, and head and body orientation in Danish Sign Language. In: Kita, Sotaro (Ed.), Pointing: Where Language, Culture, and Cognition Meet. Erlbaum, Mahwah, NJ,pp. 269–292. Nyst, V., 2004. Verbs of motion in Adamorobe Sign Language. Poster. In: TISLR 8 Barcelona, September 30–October 2. Programme and Abstracts. (Inter Nat. Conf. on Theoretical Issues in Sign Language Research; 8), pp. 127–129. Abdel-Fattah, M.A., 2005. Arabic sign language: a perspective. Journal of Deaf Studies and Deaf Education 10 (2), 212–221. Masataka, N. et al., 2006. Neural correlates for numerical processing in the manual mode. Journal of Deaf Studies and Deaf Education 11 (2), 144–152. Indian Sign language, . Gaolin Fang and Wen Gao, “Large Vocabulary Contineous Sign language Recognition Based on Trnsition-Movement Models”, IEEE Transaction on Systems,MAN, and Cybernetics-Vol.37,No.1,January 2007,pp 1-9. T.Starner and A.Pentland “Real-Time American Sign Language Recognition from video using Hidden Markov Models”,Technical Report, MIT Media laboratory Perceptual computing section, Technical Report number.375,1995. Ming-Hsuan Yang and Narendra Ahuja, “ Extraction of 2D Motion Trajectories and its Application to Hand Gesture Recognition”, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol.24, No.8, August 2002,pp1061-1074. M.K.Bhuyan and P.K.Bora, “A Frame Work of Hand Gesture Recognition with Applications to Sigb Language”, Annual India Conference, IEEE, pp1-6. Rini Akmeliawati, Melanie Po-Leen Ooi and Ye Chow Kuang , ‘RealTime Malaysian Sign Language Translation Using Colour Segmentation and Neural Network’, IEEE on Instrumentation and Measurement Technology Conference Proceeding,Warsaw, Poland 2006 ,pp. 1-6. Nariman Habili, Cheng Chew Lim and Alireza Moini , ‘Segmentation Of The Face And Hands In Sign Language Video Sequences Using Color And Motion Cues’, IEEE Transactions on Circuits and Systems For Video Technology 2004 , Vol. 14, No. 8, , pp.1086 – 1097 Yu Zhou and Xilin Chen, “Adaptive sign language recognition with Exemplar extraction and MAP/IVFS”, IEEE signal processing letters, Vol 17, No-3, March 2010, pp297-300.
[23]
38
models for statistical machine translation. In: Annual Meeting of the Ass. For Computational Linguistics (ACL), Philadelphia, PA, pp. 295– 302. Sumita, E., Akiba, Y., Doi, T., et al., 2003. A Corpus-Centered Approachto Spoken Language Translation. Conf. of the Europ. Chapter of the Ass. For Computational Linguistics (EACL), Budapest, Hungary, pp. 171–174. Casacuberta, F., Vidal, E., 2004. Machine translation with inferred stochastic finite-state transducers. Computational Linguistics 30 (2),205–225. F. P. Kuhl, C. R. Giardina (1982). Elliptic Fourier features of a closed contour. Comput. Graph. Image Process. 18: 236-258. Kishore,P.V.V.,“Conglomeration of Hand Shapes and Texture Information for Recognizing Gestures of Indian Sign Language Using Feed forward Neural Networks”, International Journal of engineering and Technology (IJET), ISSN: 0975-4024, Vol 5 No 5 Oct-Nov 2013, pp3742-3756.