Recognizing words in Thai Sign Language using flex

0 downloads 0 Views 760KB Size Report
This paper presents a Thai sign language recognition frame- ... Written Thai consists of a vast pool ... Using raw quaternion data is unpreferable, due to the ex-.
Recognizing words in Thai Sign Language using flex sensors and gyroscopes Rujira Jitcharoenport

Patraporn Senachakr

Maitai Dahlan

Assistive Technology Group Dept of Computer Engineering Chulalongkorn University Bangkok, Thailand

Assistive Technology Group Dept of Computer Engineering Chulalongkorn University Bangkok, Thailand

[email protected]

[email protected]

Dept of Mechanical Engineering Faculty of Engineering Chulalongkorn University Bangkok, Thailand

Atiwong Suchato

Ekapol Chuangsuwanich

Assistive Technology Group Dept of Computer Engineering Chulalongkorn University Bangkok, Thailand

Assistive Technology Group Dept of Computer Engineering Chulalongkorn University Bangkok, Thailand

[email protected]

[email protected]

[email protected] Proadpran Punyabukkana Assistive Technology Group Dept of Computer Engineering Chulalongkorn University Bangkok, Thailand

[email protected] ABSTRACT

Categories and Subject Descriptors

2012 there were more than 250,000 Thais older than 5 years old with speech or language impairment [5]. To communicate with other people, the speech and language impaired usually use sign language. However, ordinary people may not understand sign language, so for the speech and language impaired, direct communication with ordinary people usually requires an interpreter. One possible solution is to use a device that can translate sign language to normal language and output it either as text on a display or as a spoken utterance. We propose a pair of gloves augmented with various sensors that can track and interpret the hand movements of the user. We design the device especially for Thai sign language.

[Human-centered computing]: Accessibility—Accessibility systems and tools

1.1

This paper presents a Thai sign language recognition framework using a glove-based device with flex sensors and gyroscopes. The measurements from the sensors are processed using finite Legendre Transform (fLT) and Linear Discriminant Analysis (LDA), then classified using k-nearest neighbors. We evaluate different setups to measure both the generalizability and accuracy of our framework. Using a vocabulary of 10 words, our best setup achieves an accuracy of 76%.

General Terms Experimentation, Human factor, Design

Keywords Hand Sign Language, Gesture Recognition ,Translation Device, Machine Learning

1.

INTRODUCTION

Until recently, the adverse conditions facing the speech and language impaired in Thailand are often overlooked. According to the National Statistical Office of Thailand, in

0i-CREATe2017 Kobe, Japan 0

Thai sign language

Sign languages use the gestures and movements made by the two hands for communication. Sign language usually changes according to the country or region. For example, the American Sign Language (ASL) uses one hand for spelling the 26 alphabets, and sometimes requires both hands for certain words. The Chinese Sign Language has a spelling system similar to Pinyin and uses the gestures that sometimes resemble written Chinese characters. Moreover, a particular sign language can vary depending on the region or subculture just like how a spoken language can have many dialects. As our target domain, the Thai Sign Language (TSL) is adapted from the American Sign Language. TSL uses many of the movements and gestures common in ASL. However, the spelling system is quite different due to differences in the written language. Written Thai consists of a vast pool of alphabets - 44 consonants, 21 vowels, 4 tone marks1 and a handful of other special characters. Due to the vast amount of alphabets, the difference between certain gesture pairs can be very subtle. For example, one vowel uses an index finger to point to the tip of a finger on the other hand, while a different vowel points to the base of the finger instead of 1

Thai has 5 tones

the tip. Due to the large number of alphabets and long word length in written Thai, TSL users rarely spell words using the sign language. Words in TSL uses one or two hands to compose. The features that can be used to distinguish between words are hand gestures, positions, and orientations.

1.2

Literature review

There have been several studies using specialized gloves to capture hand gestures for recognizing sign languages. Most works recognize just the 26 ASL alphabets [3, 6]. Some recognize Chinese spellings and commonly used words [4]. Flex sensors, which measures the bending of each finger, are the most common sensors used. Depending on the need of the task, other sensors such as contact sensors for detecting fingers touching each other, accelerometers for measuring the acceleration of the hand in different directions, gyroscopes for measuring the hand orientation and angular movement, and magnetoresistive sensors for measuring the magnetic field for deriving the hand orientation [1]. Solutions which do not require the user to wear any special equipment also exist. Images from a camera were used to detect hand gestures and translating it to an associated alphabet in [2]. Due to the subtlety of TSL, we believe that using a glovebased system with various sensors would be more robust than image processing. Our work differs from the work done in [7] which only recognizes Thai sign alphabets. They also used a 5,400 US dollar gloves with high precision sensors. We believe our solution is more cost-effective while still being able to capture the differentiating nuance between the gestures. We also evaluate our solution completely on TSL words which matches better with real world scenarios.

2. 2.1

METHODOLOGY Hardware and sensors descriptions

(a) Raw values

(b) Processed data

Figure 2: Measurements from a flex sensor for a particular gesture. Left: the raw values. Right: the processed data.

Five flex sensors per glove measure the bending and stretching of each finger by the varying resistance. Fig. 2(a) shows examples of the values measured from a flex sensor for a certain TSL word. As shown, the values are grouped together nicely, reaffirming the usefulness of such sensors for hand gestures recognition. One gyroscope is attached to the back of each glove for measuring hand orientation. The gyroscope also comes with an accelerometer that can measure three gravity values in each axis. The gyroscopes can return values in three different types of measurement, namely quaternions, Euler angles, and Yaw pitch roll (YPR). Each measurement differs as follows: • The quaternions are the raw data returned from the sensor. This measurement yields a four-dimensional output. • Euler angles are data converted from the four quaternion values. The Euler angles consist of three values, matching the x, y, and z axis. • YPR measures the angle but with respect to the direction of the ground. It has three elements like the Euler angles.However, it also requires gravity values from the accelerometer in order to calibrate.

Figure 1: A diagram of our hardware. The prototype is shown on the upper right corner. We start this section with a brief explanation of the hardware used in our gloves. As shown in Fig. 1, there are two types of sensors, gyroscopes and flex sensors. Both are connected to the microcontroller. The microcontroller interacts with a PC, which performs the gesture recognition, via Bluetooth. Our estimated cost of the glove device is 4,000 Thai Baht.

Using raw quaternion data is unpreferable, due to the extra feature dimension. We prefer using Euler angles to YPR angles because to calculate the YPR angles, seven values were required (four quaternion elements and three gravity values). Fig. 3(c) shows some examples of Euler angles captured from the same TSL word. The gyroscope also comes with an accelerometer. However, we find the data from the accelerometer to be too noisy. Fig. 4 shows examples of measurements taken from two different gestures. As shown, the values are quite sporadic, so we discarded the measurements from the accelerometer.

2.2

Pre-processing

Before classification, the raw data need to be segmented and normalized. The values from the flex sensors differ greatly depending on the person. To alleviate this, we require a calibration phase which the user clenches and releases his hands at least 3 times to determine the maximum and minimum values of each flex sensor. We quantize the data to 3 possible values (0, 1, 2). 0 refers to released position

Figure 4: Examples of data from two different gestures measured by the accelerometer. On the left is the word “doctor,” while “hungry” is on the right.

Second LDA dimension

(c) Raw values

For each hand, we have 7 × 3 = 21 dimensional Euler features and 4 × 5 = 20 dimensional flex values for a total of 41 dimensions. Since the classification algorithm we use is the k-nearest neighbors method which does not perform well with high dimensional data, we reduce the dimensions by Linear Discriminant Analysis (LDA). LDA is a linear projection that projects the data according to the direction of highest discrimination between classes. We chose to project to 3 LDA dimension. Fig. 5 shows the features after LDA project for different gestures. The data points for each type of gesture are highly separable which shows the effectiveness of our feature extraction process.

(d) Legendre series coefficients Figure 3: Examples of Euler angles measurements from one kind of gesture. Top: the raw values. Bottom: the coefficients of the fLT.

when the sensor value is the highest. 2, where the sensor value is the lowest, refers to a clenched position. The readings from each flex sensor are also segmented into 6 parts of equally length. Each segment is replaced by its mean value. The first and the last segments are ignored because there is not any significant difference between each gesture. The quantized and segmented data can be seen in Fig. 2(b). The Euler values are calibrated to start at 0 radian in all directions. To handle the variation in length due to different movement speeds, the values are projected to the same number of dimensions using the finite Legendre Transform (fLT). fLT extracts the movement patterns found in the sensor data by projecting the data to the Legendre polynomials resulting in coefficients L(k). We extract 7 coefficients for each Euler dimension. Fig. 3(d) shows examples of the projected values from one gesture.

First LDA dimension

Figure 5: Features from different gestures after the LDA projection. For clarity, we only show the first two dimensions. Finally, we use the projected features with the k-nearest neighbors method for classification. We use the value of 7 for k.

3.

EXPERIMENTS AND RESULTS

Three sets of experiments were conducted for determining the effectiveness of our proposed system.

3.1

Multi-user experiments

First, we would like to know about the generalizability of our algorithm. We trained our models on a single user and tested it on 8 different users (including the original user). This experiment includes 5 gestures for common words using just one hand. 10 training samples were used per gesture.

For testing, each tester performed the calibration and then performed each gesture once.

hand, the setup with 6 LDA dimensions perform much better, achieving 76% accuracy. Almost every word was correctly recognized except for “clever”, “hot” and “sorry”.

4.

Table 1: The confusion matrix from the multi-user experiments. Table 1 shows the confusion matrix between different gestures. “cold”, “hi”, and “hungry” are all classified correctly. On the other hand, “clever” and “cold” are very similar gestures, differing in the raising of the thumb at the end of the gesture, so “clever” is misclassified as “cold” twice. The total classification accuracy is 92.5%, while the chance probability is 25%. This shows that the gloves can perform quite well even for users that is not in the training data.

3.2

Single-user experiments

Next, we evaluated the device for when a single user is used for training and testing. These experiments include 10 gestures with data from both hands. 10 training samples were used per gesture. Then, each gesture was evaluated 10 times. We tested two configurations for this experiment, one with 3 LDA dimensions total for both hands, the other with 6 LDA dimensions.

Table 2: The confusion matrix from the single-user experiments with 3 LDA dimension.

Table 3: The confusion matrix from the single-user experiments with 6 LDA dimension. Table 2 and 3 shows the confusion matrix for the 3 and 6 LDA dimensions cases, respectively. The experiment with 3 LDA dimension does not perform so well with only 38% accuracy. The word “good luck” and “miss” are all misclassified. This shows that 3 LDA dimensions are not enough to discriminate between the different words. On the other

CONCLUSIONS AND FUTURE WORK

We presented a glove-based device with flex sensors and gyroscopes for TSL recognition. We used fLT and LDA in conjunction to perform dimensionality reduction before using the k-nearest neighbors classifier. From our experiments, our framework was able to recognize gestures in both the single-user case and the multi-user case. However, when the numbers of gestures increased, as in the single-user case, the accuracy dropped. We plan to explore other methods to handle larger amounts of training data to improve the accuracy even further. Moreover, we plan to investigate other uses for our glove hardware, including medical diagnosis, virtual reality, and robot manipulation as suggested in [1].

5.

ACKNOWLEDGMENTS

We would like to thank Pasakorn Tongsan and Sutassanai Lahtum for the advice on hardware configuration, and Chotirat Ann Ratanamahatana for the advice on data processing. The hardware cost was supported jointly by the National Software Contest, Thailand (NSC) and the Nitad 2017 committee (Faculty of engineering, Chulalongkorn University).

6.

REFERENCES

[1] L. Dipietro, A. M. Sabatini, and P. Dario. A survey of glove-based systems and their applications. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(4):461–482, 2008. [2] P. Dreuw, T. Deselaers, D. Keysers, and H. Ney. Modeling image variability in appearance-based gesture recognition. In ECCV Workshop on Statistical Methods in Multi-Image and Video Processing, pages 7–18, 2006. [3] H. El Hayek, J. Nacouzi, A. Kassem, M. Hamad, and S. El-Murr. Sign to letter translator system using a hand glove. In e-Technologies and Networks for Development (ICeND), 2014 Third International Conference on, pages 146–150. IEEE, 2014. [4] L. Lei and Q. Dashun. Design of data-glove and chinese sign language recognition system based on arm9. In Electronic Measurement & Instruments (ICEMI), 2015 12th IEEE International Conference on, volume 3, pages 1130–1134. IEEE, 2015. [5] N. S. O. of Thailand. Number of persons with disabilities aged 5 years and over having difficulties or health problems by type of . ”http://www.nso.go.th/”, 2012. [6] V. Pathak, S. Mongia, and G. Chitranshi. A framework for hand gesture recognition based on fusion of flex, contact and accelerometer sensor. In Image Information Processing (ICIIP), 2015 Third International Conference on, pages 312–319. IEEE, 2015. [7] S. Saengsri, V. Niennattrakul, and C. A. Ratanamahatana. Tfrs: Thai finger-spelling sign language recognition system. In Digital Information and Communication Technology and it’s Applications (DICTAP), 2012 Second International Conference on, pages 457–462. IEEE, 2012.

Suggest Documents