Implementation of Dynamic Time Warping for Gesture ... - IEEE Xplore

0 downloads 0 Views 443KB Size Report
Abstract–ASL (American Sign Language) is the primary language of many who are deaf. ASL is a complex language that employs signs made by moving the ...
Implementation of Dynamic Time Warping for Gesture Recognition in Sign Language using High Performance Computing Muaaz Salagar

Pranav Kulkarni

Department of Computer Science and Engineering Walchand College of Engineering Sangli, Maharashtra, India [email protected]

Department of Computer Science and Engineering Walchand College of Engineering Sangli, Maharashtra, India [email protected]

Saurabh Gondane Department of Computer Science and Engineering Walchand College of Engineering Sangli, Maharashtra, India [email protected] Abstract–ASL (American Sign Language) is the primary language of many who are deaf. ASL is a complex language that employs signs made by moving the hands combined with facial expressions and postures of the body expression to convey linguistic information. Designed system for sign language recognizer works for gestures in ASL. Kinect is used as image capture device and fits the low-cost requirement as well. Human skeleton data of the joints of a user captured by the Kinect are analyzed. Video is runtime processed for signs. If gesture is predefined in the library, it is transcribed to word or phrase, and output is presented as voice and text. The implemented system works with excellent accuracy. After parallel implementation for system it achieves 95.6% in accuracy. This recognizer can be used as tutor for those who want to learn Sign language as well as translator for Deaf people so that they can communicate efficiently with everyone. Index Terms–American Sign Language, gesture recognition, Dynamic Time Warping, High performance C#, High performance computing. INTRODUCTION Hearing impairment is one of the most common disabilities existing throughout the world. According to Hearing Loss Association of UNO, about 75.32 million of the population of India is hearing impaired. Sixteen percent of adult Europeans have hearing problems and it is estimated that there will be more than 700 million hearing-impaired people in the world by 2015 [1]. As reported by WHO, there are about 250-300 million deaf people in this world, 2/3 of them live in the underdeveloped nations, of these India has the largest share. “That one out of twelve (1/12) persons in India has hearing loss” [2]. For hearing impaired people it is very difficult to express their thoughts and it is very difficult when it comes for teaching and parenting them. Sign language is used as primary language for deaf people and when they want to communicate with people who don’t know sign language then interpreters are required .Interpreters are excellent in Sign Language recognition so they act as mediator and can help deaf people to bond socially as well as make them understand well to people they interact.

Sign language involves simultaneously combining hand shapes, orientation and movement of the hands, arms or body, and facial expressions to express thoughts. Sign language has its own grammar and rules. It is also clear that everyone cannot learn Sign Language just to understand their thoughts, and every one cannot afford for the interpreter. In order to bridge the communication gap and to enhance life experience to help them gets rid of difficulties of social, economical cultural aspects. Designed system works for the gestures defined in the library and if user wants to add new gesture to it the UI provides easy way to capture gestures and stores that data set for further comparison during runtime execution. Gesture recognition is always been computational intensive so requiring more computing power but system uses parallel programming. Kinect is used to get human skeleton points. When gestures from sign language are performed in front of Kinect then gesture file is loaded in the system, when defined gesture is identified output is given in terms of text and voice using text to speech library. System works with excellent efficiency of 95.6% on the computing device which is affordable. It uses parallel computing for pattern matching so that computation intensive task of Gesture recognition is done efficiently using available resources. System can work as interpreter and it can also be used as tutor for those who want to learn Sign language. Gesture files can be downloaded from official websites those will be following their own sets of gestures for words e.g. ASL (American Sign Language), BSL (British Sign language) etc. When user will be defining own gesture for particular word then, he can store those gestures to file system in .txt format. If expert of Sign language wants In order to develop a way of communication for the hearing and speech impaired people, American Sign Language (ASL) is one of the sign languages that are well developed and internationally recognized. This translates to English language which is internationally spoken and understood. ASL is widely used along with indigenous sign languages all over the world. The most dominant sign language used in North America is the American Sign Language (ASL). Estimation is that as many as 500,000 people use ASL as their primary language for communication [3]. This makes ASL the true international sign language for deaf and dumb people all over the globe. Some signs are mentioned in the figure1.

Figure 1: ASL Signs for Sad, Thank You and Happy Objective of this work is to create a system which will convert Sign language inputs in the form of gesture data into spoken language Such as English. Data input to system is taken from Kinect device. The captured data is given to personal computer for processing. The input data will be processed for gesture detection and reference library used is defined by Sign language expert. The gesture file may differ due to sign language implementation and different gestures for same word in different sign languages .When gesture processing is done output is given in terms of words or phrase and using text to speech library we can convert the gestured data directly to voice so when deaf person will communicate in sign language output will be in the form of spoken language. The full proof system is implemented in C# for 10 different gestures in ASL. The Client –Server architecture for system can also be implemented .At client side device capable of capturing motion. Using internet data can be sent to server side where it can be processed for gesture matching then output can be sent back to client. The Microsoft Kinect camera is a revolutionary and useful depth camera giving data for developers for gesture or motion detection [4]. Data acquired by Kinect (RGB + Depth) has a structure that creates a new way to process images this structure, called RGBD Image. An example of these possibilities is real time tracking of a human skeleton. Besides the infrared based depth camera, an array of built-in microphones for voice command is installed along the horizontal bar of the Kinect camera. As a result, there are increasing interests to apply the Kinect camera for various real-life applications. The sensor is capable of capturing data at 640x480 pixels in 30Hz. With the depth data, it is possible to obtain a skeleton of who is in front of the sensor. And with the skeleton, it is possible to define user gestures. The main functionalities used on applications and libraries development, for Kinect, are available on following projects: Microsoft Kinect for Windows, OpenNI, OpenKinect and Some of libraries which use these projects as their backend implementations are OpenCV, Unity3D [5], PCL, RGBDemo, etc. On March 18, 2013, Kinect for windows updated and official support for implementation in the toolkit containing new source code samples, including Kinect Bridge with MATLAB and OpenCV [6]. The system implemented using Kinect for windows it consists of Kinect hardware and Kinect for Windows SDK, which supports applications built with C++, C# or Visual Basic. The Kinect for Windows SDK offers several capabilities like seated skeleton recognition, skeleton tracking, facial tracking and speech recognition.

Figure 2: Microsoft Kinect Sensors Label A is the de pth/3D sensor. It consists of an infrared laser projector (A1) combined with a monochrome CMOS sensor (A2) and allows the Kinect sensor to process 3D scenes in any ambient light condition. The projector shines a grid of infrared light and a depth map is created based on the rays that the sensor receives from reactions of objects in the scene. The resolution is 320x240 with 16 bits of depth and a frame rate of 30fps. Label B refers to the RGB camera, which has a 32-bit high-color resolution of 640x480 with a frame rate of 30fps. It gets a 2-dimensional color video of the scene. Label C is the motorized tilt Figure 2 shows Label D is an array of four microphones located along the bottom of the horizontal bar. It enables speech recognition with acoustic source localization, ambient noise suppression, and echo cancellation. PROTOTYPE IMPLEMENTATION AND EVALUATION To demonstrate the feasibility of our system, we implemented a prototype of the Kinect based interactive discussion and presentation system using the C# programming language on the Microsoft .NET platform. DTW makes gesture matching efficient and prototype tested for both parallel as well as serial implementation and results show significant improvement for gesture recognition and lesser false positives for similar gestures.

Figure 3: System User Interface

Dynamic Time Warping (DTW) A non-linear (elastic) alignment produces a more intuitive similarity measure, allowing similar shapes to match even if they are out of phase in the time axis. To find the best alignment between A and B one needs to find the path through the grid P = p1, … , ps , … , pk ; ps = (is , js ) which minimizes the total distance between them P is called a warping function.

Figure 4: Dynamic Time Warping Algorithm demonstration DTW tries to “describe” the second signal with the first by finding the data in one signal that matches the second signal best restricted by some rules. What we try to do is warp the indices of the signals so that we let the data in one signal at index i describe the data in the other signal at indices j and k, thus we allow one data point to be able to describe the behavior at multiple points. If we denote the samples from the signal by Si and the samples from the template by Ti we Construct the matrix

Where, d(x; y) is a distance function, most commonly the Euclidean distance. The goal now is to find a path through this matrix with minimal cost. We also require that the path starts in C1;1 and ends in Cn;m (boundary conditions). The path must also be continuous and monotonical, by this we require each step to either be to the right, down, or diagonally down to the right. This can be implemented using dynamic programming by starting at C1;1 moving one step at a time, taking the step that has the least cost, and then update the cost-matrix and continue until we reach Cn;m. If two identical signals, S and T, are matched using DTW the following holds (i = j), d(Si;Tj)=0, which results in the main diagonal of the cost matrix only consist of zeros,

Therefore DTW can map each index i to itself resulting in a total distance of 0 which was one of the requirements. DTW has been used in video, audio, and graphics applications. In fact, any data which can be turned into a linear representation can be analyzed with DTW (a well-known application has been automatic speech recognition). For this system, DTW is satisfactory used for gesture/sign recognition purposes, coping in that way with sign execution speeds. Algorithm Dynamic Time Warping algorithm function DTWDistance (char s[1::n]; char t[1::m]) declare int DTW[0::n; 0::m] declare int i; j; cost for i := 1 to m do DTW[0; i] 1 end for for i := 1 to n do DTW[i; 0] 1 end for DTW[0; 0] 0 for i := 1 to n do for i := 1 to m do cost d(s[i]; t[j]) DTW[i; j] cost+ minimum(DTW[i - 1; j]; DTW[i; j - 1]; DTW[i -1; j - 1]) end for end for return return DTW[n;m] end function Algorithm Nearest Group classifier using Dynamic Time Warping distance function NG-DTW (sample test; vector dictionary [1::n]) declare double min = 1, dist declare vector < double > meanDTW; declare vector < int > meanDTWcount; for i := 1 to n do dist DTWdistance (test; dictionary[i]) meanDTW(getGroupIndex(i)) meanDTW(getGroupIndex(i)) + dist meanDTWcount (getGroupIndex(i)) + + end for for j := 1 to m do meanDTW(j) = meanDTWcount(j); if meanDTW(j) < =min then min = j; end if end for return return min end function Skeleton Points Kinect for Windows can track skeletons in default full skeleton mode with 20 joints. Each joint is identified with its name for example as shoulder right wrist right, shoulder left etc. after studying of signs of the proposed default dictionary for the system, only six joints out of the 20 found useful for the description of a sign: both hands and both elbows and

wrists. Other joints are not important as they have static values for all frames of gestures so adding those joints will increase the unnecessary computation. X, Y, Z position coordinates for joints are given in millimeters from the device, where the origin of the axes is at the device's depth sensor. Consider Xaxis to be on the horizontal plane, the Y-axis to be on the vertical plane and Z-axis to be normal to the depth sensor. NORMALIZATION FOR SPACE AND HEIGHT While performing the gestures user can be at any position and can be of any size i.e. variance in height and over all body built up. In order to solve this problem first for the Users position, the distances obtained for joints in X, Y, Z coordinates are scaled by subtraction of Spine point co-ordinates from required joint skeleton points. Now Co–ordinates are taken so in any case considering position of one performing gestures whether it is totally right or left corner of frame it will be scaled to co-ordinates which can be used for further computation. Some implementations suggest Spherical coordinate system for normalization but it increases computations decreasing speed of system to recognize gestures. Refer Figures 5 and 6. Height of user may differ, so all co-ordinates are normalized using the distance between Spine point and head point. So user with more height will get scaled with factor of more distance between spine and head point compared to lesser distance for person with less height. Hence normalization for Height of user is applied. Now these coordinates are ready for further processing.

While gestures are performed in front of Kinect camera, Video with rate of 15fps is taken after implementation and analyzing system responsiveness. If the user wants to record a sequence otherwise, the system asks the camera to get the next frame. System works in two modes training and testing mode. In training mode user has 2 seconds i.e. 30 frames In testing mode user has option to load predefined gesture file from Internet or file system of Computer. User loads gesture file. File is read and DTW is applied for two sets at runtime for comparing similarity of gestures. When gesture recognized it will be displayed into screen with proper sentence framework. When gesture is recognized word is output and this word is now used for speech synthesis to give make system more interactive. So overall output will be sound for the gesture performed. System is implemented in C# .NET framework. In this framework namespace used is System.Speech.Synthesis. Speech synthesis engine is used to convert the text input to output speech.

Figure 7: Block Diagram Sentence and Word Mode System designed works in two modes as word and sentence mode. Sentence framework used is Subject+Verb+Object (S+V+O).The purpose of this implementation of framework is to make it possible for contextual and syntactical analysis of predicted gestured words so that system can work effectively for better speech generation with user profile and contextual awareness. Another advantage of this implementation is that due to some false positives in the matching of gestures sentence with grammatical errors might form so when word is recognized it can be checked to which category (subject , Object or other) belongs and can be corrected this can assist in to understand more about of deaf user.

Figure 5: Normalization of Position

Figure 6: Normalization in Size

EXPERIMENTAL RESULTS ASL gestures used for words I, We, read, happy, sad, today, eat, book, land, thank. Sentences for which system tested are: I read book, we read book, I (am) happy, I (am) sad, I read book today, I (am) happy today, I (am) sad today etc. System implemented is tested for five different users for 10 gestures for 10 times each. So testing is done for 500 samples table shows obtained results: Table1. Shows U1 to U5 as different users for the system. CT in the column shows total correctly identified gestures. FP as False positives recognized and NR for Not recognized gestures. Total column shows percentage accuracy for correct gesture Word I We Read Happy Sad Today Eat Book Land Thank Average

U1 8 8 8 8 7 8 7 9 7 9 7.9

U2 7 9 7 8 8 9 9 8 10 8 8.3

U3 9 7 9 8 9 9 8 9 9 9 8.6

U4 8 9 9 9 9 8 8 8 9 8 8.5

U5 9 8 9 9 9 9 9 9 9 8 8.8

CT 41 41 42 42 42 43 41 43 44 42 42.1

FP 4 3 4 3 5 2 4 4 2 4 3.5

NR 5 6 4 5 3 5 5 3 4 4 4.4

Total (%) 82 82 84 84 84 86 82 86 88 84 84.2

Table 1: System Accuracy without Parallelism Implementation

In order to improve gesture recognition accuracy system implemented for 30fps instead of 15fps as previously mentioned. To implement this on C# platform implemented multiple threads of execution to keep system responsive to your user while maximizing the performance of your user's computer. Task Parallel Library (TPL) [7] is used. Parallel.ForEach used for parallel execution of for loops where data dependency allowed for parallelization. Table 2 shows U1 to U5 as different users for the system. CT in the column shows total correctly identified gestures. FP as False positives recognized and NR for Not recognized gestures. Total column shows percentage accuracy for correct gesture U1

U2

U3

U4

U5

CT

FP

NR

I We Read Happy Sad Today Eat Book Land Thank Average

10 9 10 10 9 10 10 9 10 10 9.7

9 10 10 10 10 9 9 10 9 9 10

10 10 9 8 9 9 10 10 10 9 9.4

9 9 10 9 10 10 9 10 9 10 9.5

9 10 9 9 10 10 10 10 10 10 9.7

47 48 48 46 48 48 48 49 48 48 47.8

1 1 1 3 1 0 1 1 1 1 1.1

2 1 1 1 1 2 1 0 1 1 1.1

CONCLUSIONS AND FUTURE WORKS System designed will be helpful and can act as mediator for people with hearing impairment and low cost of hardware will be appealing. The proof-of-concept ASL works with accuracy of 84.2%, in order to improve gesture recognition accuracy parallel implementation with increased frames per second gives 13.53% increment over serial giving 95.6% accuracy. False positives decreased from 3.5% to 1.1% percent and Not recognize decreased from 4.4% to 1.1%. Kinect like hardware which can assist in skeleton tracking are still evolving as tracing of fingers is important in ASL. System which will consider that into consideration will make ASL available for large vocabulary from ASL. Parallel library for C# but it is limited to numbers of threads. Current advancements in Kinect are making it available for support from MATLAB, OpenCV which can be implanted also using CUDA platform providing massive parallelism so that accuracy of gesture recognition will improve significantly. It will make ASL translator more robust, accurate and responsive. REFERENCES

PARALLEL IMPLEMENTATION

Word

As from Table 2 it can be seen that improvement in the accuracy of sign language gesture recognition is increased to 95.6% .

Total (%) 94 96 96 92 96 96 96 98 96 96 95.6

Table 2: System Accuracy after Parallel Implementation

[1]Hear-it AISBL, “More and More Hearing Impaired People,” available at www.hear-it.org/page.dsp?area=134, May 23, 2011 [2]Highlights of Project Deaf India http://projectdeafindia.org/ [3]R.E. Mitchell, T.A. Young, B. Bachleda, and M.A. Karchmer, “How Many People Use ASL in the United States?” Sign Language Studies, vol. 6, no. 3, 2006. [4]Microsoft XBOX, available at www.xbox.com, May 23, 2011 [5] OpenCV with Kinect - Windows SDK http://wiki.etc.cmu.edu/unity3d/index.php/OpenCV_with_ Kinect_-_Windows_SDK [6] Kinect for Windows SDK 1.7 is the latest update http://www.microsoft.com/en-us/kinectforwindows/develop/new.aspx [7]Task Parallelism (Task Parallel Library) http://msdn.microsoft.com/en-us/library/dd537609.aspx [8] J. Isaacs and J. Foo. Hand pose estimation for American sign language recognition. In Proc. of Southeastern Symposium on System Theory, pages 132–136, 2004 [9] S. Liwicki and M. Everingham. Automatic recognition of fingerspelled words in British sign language. In Proc. Of CVPR, pages 50–57, 2009 [10] S. Ong and S. Ranganath. Automatic sign language analysis: A survey and the future beyond. IEETransactions on Pattern Analysis and Machine Intelligence, 27(6):873– 891, 2005 [11] M. Van den Bergh and L. Van Gool. Combining RGB and ToF cameras for real-time 3D hand gesture interaction.

In Proceedings of the IEEE Workshop on Applications of Computer Vision (WACV 2011), 2011 [12] G.-F. He, S.-K. Kang, W.-C. Song, and S.-T. Jung, “Realtime gesture recognition using 3D depth camera,” in Software Engineering and Service Science (ICSESS), 2011 IEEE 2nd International Conference on, 2011, pp.187 –190

Suggest Documents