Evaluating Sign Language Recognition Using the Myo ... - IEEE Xplore

10 downloads 0 Views 574KB Size Report
Abstract—The successful recognition of sign language gestures by computer systems would greatly improve com- munications between the deaf and the ...
2016 XVIII Symposium on Virtual and Augmented Reality

Evaluating Sign Language Recognition Using the Myo Armband João Gabriel Abreu Voxar Labs CIn - UFPE Recife, Brazil [email protected]

João Marcelo Teixeira Voxar Labs DEINFO - UFRPE Recife, Brazil [email protected]

Lucas Silva Figueiredo Voxar Labs CIn - UFPE Recife, Brazil [email protected]

Veronica Teichrieb Voxar Labs CIn - UFPE Recife, Brazil [email protected]

The LIBRAS alphabet consists of twenty-six gestures for the letters, as shown in Figure 1. It is important to note that most of them are stationary, but six of them, namely the letters H, J, K, X, Y and Z, involve some sort of movement. The proposed evaluation will focus on the recognition only of stationary gestures as an entry point for research using the Myo device to perform sign language recognition. The EMG signals obtained from Myo will be used to train and classify the input gestures based on a Supervised Machine Learning approach using Support Vector Machines (SVMs).

Abstract—The successful recognition of sign language gestures by computer systems would greatly improve communications between the deaf and the hearers. This work evaluates the usage of electromyogram (EMG) data provided by the Myo armband as features for classification of 20 stationary letter gestures from the Brazilian Sign Language (LIBRAS) alphabet. The classification was performed by binary Support Vector Machines (SVMs), trained with a onevs-all strategy. The results obtained show that it is possible to identify the gestures, but substantial limitations were found that would need to be tackled by further studies. Keywords-gesture recognition; support vector machine; sign language, LIBRAS.

I. I NTRODUCTION Most of the interaction with computer systems has been done through the use of well known interaction devices, such as mice and keyboards. In order to bring the use of technology closer to what happens in daily life, there is a tendency driving research in gesture recognition, based on the principle that users should not need to learn how to interact with systems. Interaction should be as natural as possible and mimic preconceived experiences. For example, it may be considered intuitive to point at something in order to select it or use the hand action of grabbing an object to move it from one place to another. Deaf people use gestures in order to communicate every day. Therefore, it would be interesting to map such gestures to textual input in applications. This would later allow the mapping of such gestures to general input for interactive applications, as well as the creation of real time translation systems between spoken and sign languages. In order to achieve this mapping several sensing technologies can be applied, including data gloves, vision based systems, and electromyogram (EMG) sensors. On 2013 Thalmic Labs launched the Myo armband, which integrates on a wireless device inertial and EMG sensors. The Myo armband is light and easy to wear, as well as affordable and shipped worldwide. This way the Myo armband is considered accessible and meaningful results on this device can potentially reach a wide use and impact on a particular application domain. This work proposes to evaluate the usage of the Myo armband on the recognition of gestures from LIBRAS (Brazilian Sign Language). According to our knowledge, this is the first work which tackles the recognition of LIBRAS using the Myo device. 978-1-5090-4149-7/16 $31.00 © 2016 IEEE DOI 10.1109/SVR.2016.21

Figure 1.

LIBRAS alphabet [8].

The remainder of this work is divided as follows. Section 2 lists major important related works that tackle the problem of gesture recognition, specifically focusing on sign language interaction. Section 3 provides an overview on how the SVM model can be applied to the problem of gesture recognition and the approach we selected to work with. Section 4 details the Myo armband, listing all of its features and limitations regarding its available sensors. Section 5 describes the complete methodology used on the development of our sign language gesture classification approach. Section 6 analyzes the results obtained with the proposed system and explains the reasons behind the classification quality obtained for different letters. At last, section 7 concludes the work and provides directions for improving the proposed classification approach. 64

II. R ELATED W ORK The existing works in the literature on the topic of Sign Language Recognition are mainly divided in three categories, based on how the data is acquired from the users. A great number of works rely on special gloves [11][13] containing sensors capable of measuring the finger positions. However, although these possess some of the most efficient results, the sensor gloves required are cumbersome and often very expensive. The majority of the remaining works use cameras and Computer Vision techniques [17][20] for acquiring and processing images of the performed gestures. The main problem with this approach is that the results depend greatly on environmental conditions, such as illumination and colors of the objects in the scene. There are also works which rely on the use of electromyogram (EMG) signals [12][10][9], which measure the electrical activity of muscles, to provide features for classification. This approach has only recently started being explored, and has advantages over both alternatives. EMG sensors are not inconvenient to wear, unlike sensor gloves, and are not sensitive to environmental conditions, contrary to cameras. In particular, there are works which have chosen to use a recent EMG sensor, the Myo armband, to obtain features for the classification of gestures. On [15] the Myo has been used to recognize gestures from the American Sign Language (ASL). The work combines inertial sensors and EMG data to recognize 20 dynamic gestures using Dynamic Time Warping (DTW) to align the temporal series and recognize the gesture. On [16] 27 ASL gestures are target for recognition using Bagged Trees or SVM. On [1] 10 different Thai Sign Language gestures are used to test the recognition using the Myo. For feature extraction the work uses moving variance and mean absolute value, while the classification is performed by an Artificial Neural Network (ANN).

presented pattern [7]. Although this can be performed to some extent through Unsupervised Learning, by clustering and then labeling of the clusters, it is mostly a Supervised Learning task. As such, the computer program learns to classify new input by training with a set of labeled instances, or examples. Each instance is described by a vector of features, which are values of relevant characteristics to be used in the classification. In Supervised Learning, each training instance also contains a class label, which is the desired output for the classifier and is used by Supervised Learning models. Thus, classification in this context is the task of learning a mapping from a set of input feature vectors to a set of C possible class labels. Alternatively, from a geometric standpoint, each instance can be seen as a point in the space of all possible feature values, called the feature space. Therefore, classification is the task of partitioning the feature space into regions of classification, in which all points are mapped to the same class. Indeed, many Machine Learning models learn by constructing and adjusting partitions in the feature space. B. Binary and Multiclass Classification Most Artificial Intelligence models focus on the problem of binary classification, which is the task of mapping input to one of two classes. However, some models can be extended to solve more complex and general problems, such as Multiclass Classification and Multi-label Classification. However, a common approach to solving these complex problems is to use a combination of binary classifiers. These classifiers are usually trained following either a one-vs-all or a one-vs-one strategy, which are analyzed below. A more detailed analysis on the two approaches, as well as other methods for combining binary classifiers, can be found on [7][2]. The one-vs-all strategy is to create one binary classifier per class and train it to distinguish the instances of its associated class from the instances of all the other classes. In order to do this, each classifier is trained with data of all classes, but only two labels are used on the data set, usually positive and negative. While the positive label is assigned to the instances pertaining to the classifier’s corresponding class, the negative label is assigned to instances of any of the remaining classes. New instances are presented to all the binary classifiers and their outputs are compared. It is essential for this comparison that the binary classifiers output not only a label, but a real value that can be compared as a confidence score, such as a probability estimate. The new instance is then labeled as a member of the class corresponding to the binary classifier which outputted the highest confidence score. This method also allows for new input to be labeled as not belonging to any of the classes, in the case when all the binary classifiers label it as a member of their negative class. The one-vs-one strategy consists of creating C(C − 1)/2 binary classifiers, where C is the number of possible classes. Each binary classifier is trained to distinguish between two of the C classes, and trained only with the

III. M ACHINE L EARNING F OUNDATIONS Machine Learning is, as defined by T. Mitchell in [14], the task of improving a computer program’s performance in a given set of tasks, due to some metric, through experience. It can be broadly divided in three different categories: • Supervised Learning: Learning a mapping from a set of inputs to a set of outputs, given a training set of mapped examples; • Unsupervised Learning: Structuring data from an input space, from a training set of instances. A classic example of unsupervised learning is the task of clustering data; • Reinforcement Learning: Learning to perform a certain task, in a dynamic environment, usually with feedback only at the end of a series of decisions. Learning to play a game is a classic example. A. Supervised Learning and Classification Classification is the task of hypothesizing which class, from among a set C of classes, corresponds best with a

65

data of the corresponding two classes. As in the one-vsall strategy, new instances are presented to all the binary classifiers, but in this case a voting scheme is applied, with each classifier voting for the class label it outputs. Also, there might be ties between the possible class labels after the voting, and again some confidence score can be used by the classifier in these situations. In general, this method is less used than the one-vs-all, because it requires many more binary classifiers to be constructed when C is not very small.

SVM model are rather involved and fall outside the scope of this article. The interested reader is encouraged to read [3][5][6]. Given l training vectors in Rn training space, i.e. xi ∈ Rn , with i = 1, ..., l, divided in two classes, and a label vector y ∈ Rl such that yi ∈ {1, −1}, the SVM model solves the following optimization problem in equation 1: l 1 min wT w +C ∑ ξi , w,b,ξ 2 i=1

subject to the constraints in equations 2 and 3:   yi wT φ (xi ) + b ≥ 1 − ξi

C. Support Vector Machines The Support Vector Machine (SVM) is a powerful Machine Learning model which is an evolution of the Perceptron. It is a binary classification model that computes the maximum-margin hyperplane capable of separating the data (Figure 2). However, unlike Perceptrons, SVMs are not limited to linearly separable problems, due to the hyperplane being computed in an arbitrarily highdimensional space, to which the data is lifted. Moreover, the transformation function applied on the data need not to be a linear one, which gives the SVM model its flexibility. This transformation is done through the use of a kernel, which is a mathematical function that is equivalent to the dot product of two vectors in an arbitrarily high space. In other words, given two vectors xi and x j the kernel function K (xi , x j ) is the equivalent of φ (xi ) · φ (x j ), where φ is a function that maps the vectors to another space. One interesting observation is that the specific mapping function φ used to map the data need not to be known, as long as K is proven to satisfy the necessary conditions for being a kernel.

ξi ≥ 0, i = 1, ..., l,

(1)

(2) (3)

where φ (xi ) maps xi into a higher-dimensional space, C > 0 is the regularization parameter and αi and ξi are the weight and slack variables for sample i, respectively. After the optimization problem is solved, the optimal w satisfies equation 4: l

w = ∑ yi αi φ (xi )

(4)

i=1

and the decision function for a given feature vector xq is given by equation 5:   n  T  f (xq ) = sign w φ (xq ) + b = sign ∑ αi yi K(xq , xi ) + b . i=1

(5) For this work, we have chosen to use the RBF kernel function, as will be explained further in Section 5. It is given by equation 6: K(xi , x j ) = exp(−γ||xi − x j ||2 ), γ > 0.

(6)

Consequently, the SVM model we used had two parameters left for optimization: C and γ. The process used for optimizing them will be detailed in Section 5. IV. T HE M YO D EVICE

Figure 2.

The Myo armband (Figure 3) possesses eight sensors capable of capturing EMG data from the arm muscles. It also provides Inertial Measurement Unit (IMU) data through an accelerometer, a gyroscope and a magnetometer. By default, Myo is capable of recognizing five hand gestures, which developers can use in conjunction with IMU data to control their applications. Additionally, Myo provides raw EMG data at a frequency of 200 Hz, consisting of a timestamp and the values captured by each sensor, in the range [−128, 127]. In this work, we have used these EMG readings (Figures 4 and 5) as features for the classification of new hand gestures. There are, however, some limitations to this approach, which are inherent to the device. The first problem in using this data to train new gestures is that the sensors’ values cannot simply be used as features for classification. This is due to the value of each sensor depending not only on the gesture being made, but also on the position Myo is placed on the arm. For

Maximum-margin hyperplane that separates the points [22].

A brief mathematical definition of the SVM model is provided in the remainder of this section. However, the mathematical proofs and foundations for kernels and the

66

Figure 3.

The Myo armband [19].

Figure 5.

EMG readings used as a training set for letter I.

armband, and then obtained the training sets after a preprocessing step. This data was later used to train several binary SVMs, which are used for classifying new gestures provided by the user. It is important to note that the gestures were performed in a laboratory context, with extra strength being applied, as to amplify the EMG readings. In more natural contexts, the readings would be less intense, as can be seen in Figures 6 and 7.

Figure 4.

EMG readings used as a training set for letter A.

two distinct positions, the sensors which are in contact which each muscle may change. Therefore, in order to solve this problem, either some preprocessing must be done to ensure robustness against changes in position, or the position of the device on the arm must be assumed to always be the same. The second problem is that EMG data varies wildly from person to person, depending on various factors such as amount of fatty tissue, hair and sweat on the arm [18]. In order to be able to correctly classify gestures made by different individuals, it would be necessary to collect an extremely large amount of training data, as an attempt to capture all relevant variations. In this work, we have simplified the problems mentioned before by assuming the device would always be used on the same individual and the position of Myo on the arm would always be the same.

Figure 6.

EMG readings for letter R performed naturally.

The remainder of this section details how the data was obtained, what computations were performed on it and finally the structure of the SVM classifiers trained from it.

V. M ETHODOLOGY From among the twenty-six LIBRAS’ letters, we excluded the ones that involved some sort of movement, namely the letters H, J, K, X, Y and Z. We captured raw EMG data for the remaining letters using the Myo

A. Data Capture and Processing The training sets for each letter consisted initially of approximately 28,500 EMG reading samples from the

67

Figure 7.

classification than other instance-based learning models, such as k-NN (K-Nearest Neighbors). This classification time is especially important due to our need of real time classification and use of several SVMs, as will be elaborated. The SVM implementation we chose to use was the one provided by the LIBSVM [4], which is a robust and consolidated library. Given a new gesture, it could be either one of the twenty letters, or an unknown gesture. As such, we faced a classification problem in which a new instance could belong to one of twenty classes, but could also belong to none. The best approach to this type of problem is to construct several binary classifiers, each capable of distinguishing between one class and the rest, in a oneversus-all manner. Following this idea, we constructed twenty SVM classifiers, each capable of distinguishing one letter from the rest. For each classifier’s training set, we labeled the corresponding letter’s data as of class 1 and the data pertaining to all other letters as of class 0. To perform the classification of a new gesture, we verify the labels assigned to it by each classifier. In the case of more than one classifier recognizing the gesture as its corresponding letter, we verify LIBSVM’s probability estimates for each of the classifications and label the gesture as the letter which had the highest value. The last step in the training of each SVM was to optimize its parameters. Following the suggestions in [21], we chose to use an RBF kernel and to optimize the remaining C and γ parameters through a grid search. Equations 7 and 8 describe the exponentially growing values used for the parameters:

EMG readings for letter R with extra strength applied.

Myo armband. For a given letter, these samples were obtained by maintaining the appropriate gesture for one or more lengths of time, while storing the raw EMG readings of the Myo’s eight sensors. Moreover, in an attempt to avoid noise due to getting the hand in or out of position, the readings equivalent to a reasonable amount of seconds were ignored from both the beginning and end of every captured block of data. After obtaining the twenty initial training sets, we preprocessed them in order to improve the quality of the eight EMG sensor values as features for classification. The first preprocessing step was to perform a full-wave rectification on the training sets, which is a matter of computing the absolute value of each sensor’s value. EMG signals have both positive and negative components, but what measures muscle activation is solely the signal’s amplitude. By inverting the signs of the negative components, we cause each gesture’s readings to be spread out across a smaller area of the feature space, which allows the SVM classifier to construct better decision boundaries. The second preprocessing step was to compute the mean of each 50 training samples. The first reason for this was to provide some robustness against noise. The second reason was to perform classification on time blocks of coarser granularity, since the Myo’s sample rate of 200 Hz is too fine-grained for hand gestures. Therefore, we arbitrarily chose 50 samples for the mean computation because it corresponds to an interval of 0.25 seconds, which is more adequate. After this step, each letter’s data set was reduced to about 570 instances.

C = 2−5 , 2−3 , . . . , 215

(7)

γ = 2−15 , 2−13 , ..., 23

(8)

The performance of each SVM was measured through a 10-fold cross validation, computed using the the LIBSVM library. The best results found for the twenty classifiers are shown in Table I. VI. A NALYSIS OF R ESULTS After the classifier’s training was completed, we created an application to perform classification of new gestures in real time. Furthermore, in order to measure the performance of the classifier, we maintained each gesture for around 30 seconds, generating approximately 110 samples. The performance of the classifier could then be estimated by the number of correctly classified samples from among those (Table II). It is important to note that the capture of this data was done blindly, i.e., feedback of the classification’s correctness was only provided after the capture was completed. However, slightly different hand and wrist positions were performed during the data capture, as to provide some versatility. An analysis of the accuracy for the different letters showed some interesting results. As expected, letter gestures that could not be made with substantial strength, namely F, M, P, T and U, due to either the wrist or finger

B. Letter Classifiers The Machine Learning model chosen for this work was the SVM, because it is a very powerful model that has some advantages over the alternatives. SVMs have few parameters to optimize, which makes them preferable to neural network models. Additionally, SVMs perform faster

68

Table I C ROSS - VALIDATION ACCURACY OF OPTIMIZED C AND γ PARAMETERS .

Letter Gesture

C

γ

A

4.0000

11.3137

99.87%

B

45.2548

22.6274

98.85%

C

16.0000

2.8284

98.87%

D

1024.0

1.0000

99.13%

E

90.5096

8.0000

98.48%

F

5.6568

22.6274

98.39%

G

45.2548

2.8284

99.10%

I

8.0000

8.0000

98.72%

L

11.3137

5.6568

98.98%

M

32.0000

2.0000

98.50%

N

45.2548

1.4142

98.31%

O

11.3137

11.3137

99.11%

P

2.8284

16.0000

98.10%

Q

22.6274

2.8284

98.81%

R

11.3137

2.8284

99.50%

S

2.0000

8.0000

99.34%

T

22.6274

11.3137

97.68%

U

16.0000

5.6568

96.61%

V

2048.0

1.4142

96.01%

W

4.0000

5.6568

98.91%

Training Set

Table II R EAL TIME CLASSIFICATION ACCURACY.

Accuracy

Accuracy

A

4%

B

19%

C

49%

D

64%

E

76%

F

8%

G

40%

I

55%

L

77%

M

8%

N

57%

O

47%

P

8%

Q

48%

R

91%

S

46%

T

4%

U

5%

V

22%

W

95%

exactly as they were during training. This is particularly true for pairs of gestures that are similar, as discussed before.

positions required, presented some of the worst results. Moreover, letter gestures that could be well performed with extra strength, such as L, R, S and W, presented some of the best results, with most of the remaining letters falling somewhere in the middle. Additionally, some letters presented very good results due to having gestures very distinct from the others. Examples of those are E, L, R and W. Because of this, it is not surprising that R and W showed the best results, since their gestures can be easily performed with extra strength and are very distinct from the others. Likewise, some letter’s results were worsened partially because their gestures are similar to others. An example of this is the letter A, which was sometimes misclassified as S, due to both gestures being very similar. Although most failed classifications occurred simply because A’s binary SVM failed to recognize the gesture, there were cases when both letters would be recognized at the same time and S’ binary classifier would present a higher confidence score. Similar misclassifications occurred between the pairs V/R, M/N, and G/Q. Although some of these pairs may not seem similar, such as G and Q, the gestures exert similar tension on the same fingers, which is what provides similar EMG readings. Finally, an important observation is that the classification was found to be very sensitive to slight changes in position and strength applied. Most of the time, the gestures would only be correctly classified if performed

VII. C ONCLUSION In this work, we have attempted to evaluate the usefulness of the Myo armband’s EMG readings as features for classification of static letter gestures from LIBRAS. We have captured training data sets for twenty letters from LIBRAS’ alphabet, preprocessed them in order to facilitate classification and mitigate noise, and trained twenty binary SVMs to distinguish the different letter gestures. Based on the results obtained, we believe that it is very difficult to perceive fine finger gestures solely with EMG data and that classifiers based exclusively on this type of data would perform poorly if compared to ones using vision techniques or specialized gloves. However, the results were significant enough to make us believe that EMG data could be a very useful addition to systems that also relied on other types of data. One remarkable advantage of EMG sensors such as the Myo is that they can be very portable. As such, for sign language recognition, in which portability and convenience are key, a hybrid system using both visual and EMG data could be very compact and considerably less cumbersome than specialized gloves. Of course, the limitations mentioned here would also need to be tackled by future works. However, we have some suggestions on how to solve them. With a massive

69

data set, in order to account for all the relevant variations, the problem of considerable changes in EMG readings from person to person could be solved. Alternatively, extensive calibration for each user could greatly mitigate the issue. Dynamic gestures, such as the six letters we excluded from this work, could also be integrated with the use of Myo’s IMU sensors. Moreover, applications could be robust to changes on the Myo’s position on the arm perhaps simply with larger amounts of training data, or due to some computational mapping performed on the vectors of EMG readings. Finally, other Machine Learning models could be used, and may present better results.

[12] Y. Li, X. Chen, J. Tian, X. Zhang, K. Wang, and J. Yang. Automatic recognition of sign language subwords based on portable accelerometer and emg sensors. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, ICMIMLMI ’10, pages 17:1–17:7, New York, NY, USA, 2010. ACM. [13] R.-H. Liang and M. Ouhyoung. A real-time continuous gesture recognition system for sign language. In Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, pages 558–567, Apr 1998. [14] T. Mitchell. Machine Learning. McGraw-Hill International Editions. McGraw-Hill, 1997.

R EFERENCES [15] P. Paudyal, A. Banerjee, and S. K. Gupta. Sceptre: a pervasive, non-invasive, and programmable gesture recognition technology. In Proceedings of the 21st International Conference on Intelligent User Interfaces, pages 282–293. ACM, 2016.

[1] V. Amatanon, S. Chanhang, P. Naiyanetr, and S. Thongpang. Sign language-thai alphabet conversion based on electromyogram (emg). In Biomedical Engineering International Conference (BMEiCON), 2014 7th, pages 1–4. IEEE, 2014.

[16] C. Savur. American sign language recognition system by using surface emg signal. 2015.

[2] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[17] T. Starner, J. Weaver, and A. Pentland. Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1371–1375, Dec 1998.

[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM.

[18] Thalmic Labs. Big data: Raw emg free for developers in december. http://developerblog.myo.com/big-data/, 2014. [19] Thalmic Labs. The myo armband, black. Retrieved from: https://www.myo.com/techspecs, n.d.

[4] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[20] C. Vogler and D. Metaxas. Asl recognition based on a coupling between hmms and 3d motion analysis. Computer Vision, IEEE International Conference on, 0:363, 1998.

[5] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297.

[21] C. wei Hsu, C. chung Chang, and C. jen Lin. A practical guide to support vector classification, 2010.

[6] P. Domingos. Machine learning. https://www.coursera.org/ course/machlearning.

[22] Wikipedia. Graphic showing the maximum separating hyperplane and the margin. Retrieved from: https://en. wikipedia.org/wiki/Support_vector_machine, 2008.

[7] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley, 2012. [8] L. Falcao. Surdez, cognição visual e libras: estabelecendo novos diálogos. Ed. do autor, 2010. [9] V. E. Kosmidou and L. J. Hadjileontiadis*. Sign language recognition using intrinsic-mode sample entropy on semg and accelerometer data. IEEE Transactions on Biomedical Engineering, 56(12):2879–2890, Dec 2009. [10] V. E. Kosmidou, L. J. Hadjileontiadis, and S. M. Panas. Evaluation of surface emg features for the recognition of american sign language gestures. In Engineering in Medicine and Biology Society, 2006. EMBS ’06. 28th Annual International Conference of the IEEE, pages 6197– 6200, Aug 2006. [11] K. Li, Z. Zhou, and C.-H. Lee. Sign transition modeling and a scalable solution to continuous sign language recognition for real-world applications. ACM Trans. Access. Comput., 8(2):7:1–7:23, Jan. 2016.

70