GIST Descriptors for Sign Language Recognition: An Approach Based on Symbolic Representation H.S. Nagendraswamy, B.M. Chethana Kumara ✉ , and R. Lekha Chinmayi (
)
Department of Studies in Computer Science, University of Mysore, Mysore 570006, India
[email protected], {chethanbm.research, lekha.2405}@gmail.com
Abstract. This paper presents an approach for recognizing signs made by hearing impaired people at sentence level. The signs are captured in the form of video and each frame is processed to efficiently extract sign information to model the sign and recognize instances of new test signs. Low-dimensional global “gist” descriptors are used to capture sign information from every frame of a sign video. K-means clustering is used to choose fixed number of frames, which are discrim‐ inative enough to distinguish between signs. Also, selection of fixed number of frames helps us to deal with unequal number of frames among the instances of same sign due to different signers and reduce the complexity of subsequent processing. Further, we exploit the concept of symbolic data analysis to effec‐ tively represent a sign. A fuzzy trapezoidal membership function is used to estab‐ lish the similarity between test and a reference sign and a nearest neighbour classification technique is used to recognize the given test sign. A considerably large database of signs (UoM-ISL) is created and an extensive experimentation is conducted on this database to study the efficacy of the proposed methodology. The experimental results are found to be encouraging. Keywords: Gist descriptor · Sign language · Symbolic representation · Video sequence
1
Introduction
Technology is becoming pervasive and is profoundly changing the modern life. It is to increase convenience, decrease fraud, remove middle men, and provide access to infor‐ mation in addition to provide improved conditions of life, comforts and health. However, this progress in technology is not up to that level where the sophisticated set of tools has emerged to support the disadvantaged people towards improving their daily life. The people with disabilities are really in need of the tools to become less dependent to lead an independent life with confidence and privacy. With this motivating factor of improving the life of people with disabilities, specifi‐ cally focusing on hearing and speech impaired people, research work can be taken up to come up with solutions towards an assistive interactive tool. The interactive tool is expected to make them confident and capable to interact with the society. Hearing impaired people communicate using signs by moving their hands and gesturing with their faces, which are © Springer International Publishing Switzerland 2015 R. Prasath et al. (Eds.): MIKE 2015, LNAI 9468, pp. 103–114, 2015. DOI: 10.1007/978-3-319-26832-3_11
104
H.S. Nagendraswamy et al.
not understood by most of us. Hence, there is a huge communication gap between such disadvantaged people and the others. The hearing people would need an assistance of another person, who can understand a sign language to help them in communicating with hearing impaired people by translating the signs. A professional interpreter is needed espe‐ cially in a formal setting as in official matters and discussions to bridge the communica‐ tion gap between a hearing impaired person and others. However, this kind of service not only makes the disabled person dependent on such professionals but it is also rather expen‐ sive in addition to not being practical for day-to-day communicative requirements of the people. Therefore, replacement of a trained professional by a software tool for automatic recognition and translation of a sign language into a vocal language would go a long way in helping hearing impaired people to integrate and communicate with the society and also to an extent help them to be less dependent on trained professionals. Such an automated tool would acquire gestures, analyze them, recognize them and then produce equivalent sentences in a vocal language. Similar to a vocal language, sign languages are evolved over time by addition and deletion of signs of words. Signs may be stopped being used because they may become out of fashion or they may gain negative perception. New signs get introduced due to technological and societal changes. In view of this, since two decades, several researchers have made an attempt to stand‐ ardize sign languages and a number of models have been devised for recognizing signs. Some of the techniques proposed by the research community, which gained importance due to their performance are Hidden Morkov Models (HMM) and Electromyography (EMG) segmentation [1], Localized contour sequence [17], VU-Kinect block to track 3D objects [17], Leap device based tracking [16], Sensor based glove technique [6, 12, 15, 17, 20, 24, 25, 27, 33], Size function [24], Moment based size function [13], Fourier descriptors [28], Hu moments [14], Principal axes of inertia [10], Conditional Random Field (CRF) model [22], Ichetrichef moments [6]. However, the research works reported for sign language recognition have addressed the task either at finger spelling level [2, 5, 32] or at word level [18, 24, 31]. But, signs made by the hearing impaired people are very abstract and diffi‐ cult to capture in terms of finger spelling or word level. Hence, there is a need to analyze the signs at simple sentence level and devise a method to capture sign information at sentence level and to provide robust representation for recognition. With this backdrop, we made an attempt to propose a sign language recognition technique at sentence level in this work. We used GIST descriptor to capture signers hand movements, facial expression and gesture while making a sign. From the literature survey, we also understand that the patterns can be better described in terms of symbolic data rather than crisp data [4]. It has also been proved that the Pattern Recognition methods based on symbolic data [7, 8, 10, 11, 21] have outperformed Pattern Recogni‐ tion methods based on crisp data. Therefore, we explore the suitability of interval-valued type symbolic data [9, 10] for effective representation of signs incorporating intra-class variations due to different signers and due to different instances.
2
Proposed Method
The proposed sign language recognition technique exploits the concept of GIST features to extract texture information from every frame in the sign video. In order to address
GIST Descriptors for Sign Language Recognition
105
the problem of unequal number of frames for a sign due to different signers, it is proposed to find a fixed number of frames for each video through the concept of clustering. The K-means clustering algorithm is used to find the fixed number of clusters of frames from the given video. One of the frames in the cluster, which possess more similarity with all the other frames in a cluster, is considered as a representative for that cluster. Thus, every sign is characterized by a fixed number of frames, which are discriminative enough in distinguishing different signs for the purpose of recognition. Feature vectors of different instances of the same sign are aggregated to form interval-valued feature vector. Thus, every sign is represented in terms of interval-valued type symbolic feature vector to make the system more effective and robust. A fuzzy trapezoidal membership function [9, 29] is used to compute the similarity between test sign feature vector and reference sign feature vectors. A nearest neighbor classifier is used to recognize a given test sign as one among the reference sign in the knowledgebase and the associated text is displayed. The following sections elaborate the proposed methodology in detail. 2.1 GIST Feature Extraction The GIST refers to the meaningful information that an observer can identify from a glimpse at a scene [3, 26]. The gist description usually includes the semantic label of the scene, a few objects and their surface characteristics [26], as well as the spatial layout and the semantic properties related to the function of the scene. Therefore, a model of scene gist includes a description of semantic information that human observers compre‐ hend and infer about the scene [3, 23]. GIST descriptors are used to represent a low-dimensional image that contains enough information to identify the scene in an image. The GIST descriptors introduced in [22] can represent the dominant spatial structure of the scene from a set of perceptual dimen‐ sions. The GIST descriptors of an image can be captured by analyzing the spatial frequency and orientation. Intuitively, GIST summarizes the gradient information (Scale and Orientation) for different parts of an image, which provides a rough description (the gist) of the scene. A GIST descriptor for a given image (frame) is computed as follows: 1. An input image is convolved with 32 Gabor filters at 4 scales and 8 orientations to produce 32 feature maps of the same size of the input image. 2. The obtained feature maps are divided into 16 regions (4 × 4 grids) and then the feature values within each region are aggregated. 3. The 16 aggregated values of all 32 feature maps are concatenated to produce a 512 (16 × 32) GIST descriptor. Low-level features channels of GIST descriptor include intensity (0–255), orientation (0, 45, 90 and 135) and color at several sub-channels. Each Sub-channel is associated with nine pyramidal representations of filter outputs. Feature maps are produced by performing center surround operation between filter outputs at four different scales [30]. Figure 1 shows an instance of GIST feature extraction from a frame of a sign video. Figure 2 shows few example frames with their corresponding feature maps for a sign I want Coffee and Table 1 presents the GIST features extracted from these frames.
106
H.S. Nagendraswamy et al.
Fig. 1. GIST feature extraction from the frames of a given sign video
Fig. 2. (a) (b) (c) and (d) shows GIST Descriptor of four different frames of a sign I want Coffee. Table 1. GIST feature values extracted from four different frames of a sign I want Coffee
Frames of sign
GIST features 1
2
3
f1
0.00221
0.00223
0.00227
f2
0.00220
0.00222
f3
0.00217
f4
0.00216
510
511
512
…
0.00134
0.00132
0.00132
0.00225
…
0.00137
0.00135
0.00134
0.00221
0.00223
…
0.00136
0.00128
0.00135
0.00218
0.00216
…
0.00134
0.00121
0.00130
…
GIST Descriptors for Sign Language Recognition
107
2.2 Selection of Key Frames The same sign made by the same signer at different instances of time or by different signers may contain unequal number of frames due to the speed at which signs are made and captured. Also, the adjacent frames in the sign video may not differ significantly in terms of their content. Therefore, it is more appropriate to eliminate the frames, which do not differ significantly from previous frames and to select a fixed number of frames for a given sign video, which are sufficient enough to capture sign information irre‐ spective of a signer or instance. In order to address the problem of selection of keyframes from sign video, the concept of K-means clustering algorithm is studied. The value of K is empirically chosen after conducting several experimentations. Once the K number of clusters is obtained for each of the signs, each cluster of frames is further processed to select the cluster representative frame. In any cluster, the frame which possesses maximum similarity with all the frames in the cluster is considered as a keyframe for that cluster. Thus, every sign is characterized by K- number of key-frames. 2.3 Sign Representation In addition to the problem of unequal number of frames for the same sign due to different signers or due to same signer at different instances of time, the instances of signs also possess some variations in terms of their content. So, it is recommended to capture such intra-class variations effectively to produce more robust representation for a sign. It is understood from the literature survey that the interval-valued type symbolic represen‐ tation can be explored to provide robust representation. Therefore, in order to capture intra-class variations among the various instances of the same sign, the feature vectors of the corresponding key-frames of all the instances of the sign are aggregated. The min () and max () operations are used to find the minimum and maximum feature values to form interval-valued type feature vector. Derivation of interval-valued type symbolic feature vector to represent a sign in the knowledge base is described as follows: be the n number of signs considered by the system for Let recognition. Let be the t number of instances of a sign Si made by the signers at different instances of time. Let
be the m number of frames chosen for the video of kth
instances of a sign Si, where
be the feature vector
representing jth frame of the kth instance of a sign Si, and l is the number of features (l = 512). From t number of instances of a sign we chose randomly (t/2), (3t/2) and (2t/3) number of instances respectively for (50:50), (60:40) and (40:60) percentages of training and testing. In each experiment, the feature value of the samples chosen randomly for training is aggregated to form interval-valued type symbolic feature vector, which represents the sign in the knowledgebase more effectively.
108
H.S. Nagendraswamy et al.
Let q be the number of training samples randomly selected for a sign Si and
be the feature vectors representing the jth frame of all the q training samples respectively. The minimum and maximum values for the first feature is computed as follows
Similarly, we compute the minimum and maximum values of all the features and form an interval as
Thus, the aggregated jth frame of reference feature vector representing a particular sign Si is given by
Table 2 shows an example jth frame feature vector of four instances of a particular sign I want Coffee and its interval-type representation. Table 2. Interval-valued type feature vector representing a particular frame of a sign I want Coffee. Instances of sign
GIST features 1
2
3
…
510
511
512
S1
0.00238
0.00234
0.00222
…
0.00122
0.00121
0.00123
S2
0.00238
0.00235
0.00224
…
0.00104
0.00104
0.00122
S3
0.00240
0.00236
0.00223
…
0.00132
0.00125
0.00122
S4
0.00243
0.00237
0.00228
…
0.00132
0.00125
0.00127
Interval
[0.00238,
[0.00234,
[0.00222,
…
[0.00104,
[0.00104,
[0.00122,
0.00243]
0.00237]
0.00228]
0.00132]
0.00125]
0.00127]
GIST Descriptors for Sign Language Recognition
109
From Table 2, we can observe that the feature values extracted from the frames of four instances of a sign “I Want Coffee” varies significantly. Hence, interval-valued type representation found to be more appropriate to capture such variations and to provide robust representation for a sign. 2.4 Matching and Recognition In order to recognize a given test sign made by the signer, the video sequence of a test sign is processed to obtain frames, and the features are extracted from each frame as discussed in the previous section. The extracted features are organized in a sequence to represent the test sign. Since the test sign involves only one instance, the test sign is represented in the form of a crisp feature vector. The task of recognition is accomplished by comparing the test sign feature vector with all the reference sign feature vectors stored in the knowledgebase. A similarity value is computed through this process and the reference sign, which possess maximum similarity value with the test sign is considered and the text associated with this sign is displayed. The similarity measure [19] is used for the purpose of comparing reference sign feature vector with the test sign feature vector as follows Let
be the crisp feature vector representing the jth frame
of a test sign and let
be th
the interval-valued type symbolic feature vector representing the j frame of a refer‐ ence sign. Similarity between the test and reference sign with respect to dth feature of the jth frame is computed as
(1)
, , , If the value of lies within the interval [ ] then the similarity value is 1, otherwise the similarity value depends on the extent to which the value is closer to either the lower bound or the upper bound . If the value of does not lie within the interval [ ] then the similarity value is zero. The similarity between reference and test sign with respect to the jth frame is given by (2)
110
H.S. Nagendraswamy et al.
The total similarity between reference and test sign due to all the frames is computed as (3) Where L is the number of frames used to represent the reference and test sign, which is 50. The nearest neighbor classification technique is used to recognize the given test sign as one among the known sign stored in the sign knowledge base.
3
Experimentation
We have conducted experiments on our own sign language dataset to validate the feasi‐ bility of the proposed methodology. The dataset describes the sentences used by hearing impaired people in their day to day life. We have considered the videos of signs, which are signed by the communication impaired students of different schools of Mysore zone. The dataset contains 680 (11.3 h) sign videos of 17 different signs expressed by four different students with ten instances. Several experiments are conducted for different percentages of training and testing (60:40, 50:50 and 40:60). We have also repeated the experiments for 50 trials with random sets of training and testing samples. In each trail, performance of the system is measured in terms of F-measure and the average F-measure of all the trials is recorded. Table 3 gives classwise performance measures in terms of F-measure for respective percentages of training and testing. Figure 3 shows the confusion matrices obtained for 60:40, 50:50 and 40:60 % of training and testing samples respectively for a random run. Table 4 gives the average F-measure for different percentages of training and testing samples. We can observe from the Table 3 that F-measure for different percentages of training and testing samples of class 12 is Zero and it is fully merged with class 1. The signs belonging to class 1 and class 12 are similar in terms of their signing action and hence, it is difficult to differentiate these signs. Therefore, there is a misclassification between these two signs, which can be observed from Table 3 as well as confusion matrices in Fig. 3. The right number of frames considered for representing a sign plays an important role in improving the performance of the recognition system. In order to study the significance of number of key-frames chosen for a sign through K-means clustering, we have conducted experiments for various values of K (30, 40, …, 90) and we have observed that for 50 frames, we have achieved 78.71 average F-measure, which is significantly high compared to the number of frames 30 and 40. But, when we increase the number of frames to 60, 70, 80 and 90, we have not observed much improvement in the recognition rate. Thus, in our experiments, we have considered 50 frames for every sign. The results obtained are tabulated in Table 5.
GIST Descriptors for Sign Language Recognition
111
Table 3. Classwise performance measures in terms of F-measure obtained by the proposed methodology for one of the random experiments.
Class Index
Training: Testing
Class Index
Training: Testing
60:40
50:50
40:60
60:40
50:50
40:60
1
0.64
0.64
0.64
10
0.97
1.00
0.93
2
1.00
1.00
1.00
11
1.00
0.97
0.96
3
1.00
0.97
0.96
12
0.00
0.00
0.00
4
0.97
1.00
0.96
13
0.97
0.97
0.96
5
1.00
0.97
0.96
14
0.97
1.00
0.96
6
0.92
1.00
0.90
15
1.00
0.97
0.94
7
0.92
1.00
1.00
16
0.97
1.00
1.00
8
1.00
0.97
1.00
17
0.90
0.90
0.89
9
0.97
0.95
0.96
Table 4. Overall recognition performance of the proposed method interms of F-measure.
Sign language data set
Ratio of training and testing
Recognition rate
UoM sign language data
60:40
92.47 ± 0.44
50:50
92.00 ± 0.44
40:60
91.34 ± 0.52
Average for overall classes
Fig. 3. Confusion matrix for (60:40), (50:50) and (40:60) training and testing samples respectively.
In order to demonstrate the superiority of the proposed method (GIST + Symbolic representation + K-NN) with other classifiers, we also conducted experiments using
112
H.S. Nagendraswamy et al.
classifiers like SVM (GIST + Crisp representation + SVM) and K-NN (GIST + Crisp representation + K-NN). The results obtained are presented in Table 6. We can observe from the table that the proposed method (GIST + Symbolic representation + K-NN) outperforms the other methods. Table 5. Average F-measure of the proposed method for different number of frames of sign videos.
No. of frames
Average F-measure (%)
30
72.72
40
73.91
50
78.71
60
78.80
70
78.87
80
78.83
90
78.98
Table 6. Results obtained by the different classifiers
Different classifiers
Recognition rate
GIST + Crisp representation + SVM
39.46
GIST + Crisp representation + K-NN
59.18
GIST + Symbolic representation + K-NN 92.47 (Proposed Method)
4
Conclusion
In this paper, we have presented a method based on Low-dimensional global “gist” descriptors for recognizing signs used by hearing impaired people at sentence level. K-means clustering algorithm was used to select key-frames and to characterize the signs in terms of fixed number of frames. Intra-class variations among the signs are effectively captured by the use of symbolic data analysis and the number of reference signs needed to train the system is also reduced when compared to the conventional crisp representation techniques without any degradation in performance of the system. A considerably large database of signs of Indian Sign Language UoM-ISL is created and extensive experiments are conducted on this data set. Experimental results obtained by the proposed sign language recognition system are more encouraging for the data set considered. However, scalability of the proposed system needs to be tested. Minor variations in signing actions but conveying different messages may lead
GIST Descriptors for Sign Language Recognition
113
to less between-class variations, which results in misclassification. Hence, cognition of various features in different domain need be addressed to resolve such issues. Acknowledgement. We would like to thank the students and the teaching staff of Sai Ranga Residential Boy’s School for Hearing Impaired, Mysore, and N K Ganpaiah Rotary School for physically challenged, Sakaleshpura, Hassan, Karnataka, INDIA, their immense support in the process of UoM-ISL Sign language dataset creation.
References 1. Al-Ahdal, M., Tahir, N.: Review in sign language recognition systems. In: IEEE Symposium on Computers and Informatics (ISCI), pp. 52–57 (2012) 2. Ghotkar, A.S., Kharate, G.K.: Study of vision based hand gesture recognition using Indian sign language. IJSS Intell. Syst. 7(1), 96–115 (2014) 3. Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. In: Martinez-Conde, S., Macknik, S.L., Martinez, L.M., Alonso, J.-M., Tse, P.U. (eds.) Progress in Brain Research, vol. 155 (2006). ISSN 0079-6123 4. Bock, H.H., Diday, E.: Analysis of Symbolic Data. Springer, Berlin (2000) 5. Dahmani, D., Larabi, S.: User-independent system for sign language finger spelling recognition. J. Vis. Commun. Image Represent. 25(5), 1240–1250 (2014) 6. Gourley, C.: Neural network utilizing posture input for sign language recognition. Technical report Computer Vision and Robotics Research Laboratory, University of Tenessee Knoxville, November 1994 7. Gowda, K.C., Diday, E.: Symbolic clustering using a new dissimilarity measure. Pattern Recogn. 24(6), 567–578 (1991) 8. Gowda, K.C., Ravi, T.V.: Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recogn. 28(8), 1277–1282 (1995) 9. Guru, D.S., Prakash, H.N.: Online signature verification and recognition: an approach based on symbolic representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(6), 1059–1073 (2009) 10. Guru, D.S., Nagendraswamy, H.S.: Clustering of interval-valued symbolic patterns based on mutual similarity value and the concept of k-mutual nearest neighborhood. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 234–243. Springer, Heidelberg (2006) 11. Guru, D.S., Kiranagi, B.B., Nagabhushan, P.: Multivalued type proximity measure and concept of mutual similarity value useful for clustering symbolic patterns. Pattern Recogn. Lett. 25(10), 1203–1213 (2004) 12. Handouyahia, M., Zion, D., Wang, S.: Sign language recognition using moment based size functions. In: Vision Interface 1999, Trois-Rivieres, Canada, 19–21 May 1999 13. Handouyahia, M.: Sign Language Recognition using moment based size functions. MSc en Informatique demathematique et d informations, universite de sherbrooke, sherbrooke (1998) 14. Hu, M.: Visual pattern recognition by moment invariants. IRE Trans Inf. Theory 8, 179 (1962) 15. Kang, S., Nam, M., Rhee, P.: Colour based hand and finger detection technology for user interaction. In: International Conference on Convergence and Hybrid Information Technology, pp. 229–236 (2008) 16. Karthick, P., Prathiba, N., Rekha, V.B., Thanalaxmi, S.: Transforming Indian sign language into text using leap motion. Int. J. Innovative Res. Sci. Eng. Technol. (An ISO 3297: 2007 Certified Organization) 3(4), 10906–10910 (2014)
114
H.S. Nagendraswamy et al.
17. Kong, W.W., Ranganath, S.: Towards subject independent continuous sign language recognition: a segment and merge approach. Pattern Recogn. 47, 1294–1308 (2014) 18. Liwicki, S., Everingham, M.: Automatic recognition of finger spelled words in British sign language. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009. IEEE (2009) 19. MohanKumar, H.P., Nagendraswamy, H.S.: Change energy image for gait recognition: an approach based on symbolic representation. Int. J. Image Graphics Signal Proc. (IJIGSP) 6(4), 1–8 (2014) 20. Murakami, K., Taguchi, H.: Gesture recognition using recurrent neural network. In: Actes de CHI 1991 Workshop on User Interface by Hand Gesture, pp 237–242. ACM (1991) 21. Nagendraswamy, H.S., Naresh, Y.G.: Representation and classification of medicinal plants: a symbolic approach based on fuzzy inference technique. In: Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), 28–30 December 2012 22. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001) 23. Oliva, A., Schyns, P.: Coarse blobs or fine edges? Evidence that information diagnosticity changes the perception of complex visual stimuli. Cogn. Psychol. 34, 72–107 (1997) 24. Holden, E.-J., Lee, G., Owens, R.: Australian sign language recognition. Mach. Vis. Appl. 16(5), 312–320 (2005) 25. Harling, P.A.: Gesture input using neural networks. Technical report, University of York, UK (1993) 26. Potter, M.C.: Meaning in visual search. Science 187(4180), 965–966 (1975) 27. Rensink, R.A.: The dynamic representation of scenes. Vis. Cogn. 7, 17–42 (2000) 28. Haralick, R.M., Shaipro, L.G.: Local invariant feature detectors: a survey. In: Computer and Robot Vision, vol. 2. Addison-Wesley Publishing Company, Boston (1993) 29. Ross, T.J.: Fuzzy Logic with Engineering Applications. Wiley, New York (2009) 30. Siagian, C., Itti, L.: Comparison of gist models in rapid scene categorization tasks. In: Proceedings of Vision Science Society Annual Meeting (VSS 2008), May 2008 31. Starner, T., Weaver, J., Pentland, A.: Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998) 32. Suraj, M.G., Guru, D.S.: Secondary diagonal FLD for fingerspelling Recognition. In: International Conference on Computing: Theory and Applications, International Conference on Computing: Theory and Applications, ICCTA 2007, pp. 693–697, (2007). doi:10.1109/ ICCTA.2007 33. Takahashi, T., Kishino, F.: Hand gesture coding based on experiments using a hand gesture interface device. SIGCHI Bull. 23, 67–74 (1991)