approach has broader applicability. With classification, each unknown data sample is given the label of its best match among known samples in a database.
UTGeR: A User-Independent Technique for Gesture Recognition1 Cyrus Shahabi, Farid Parvini Computer Science Department University of Southern California Los Angeles, California 90089-0781 [shahabi, fparvini]@usc.edu
Abstract We propose a novel approach for recognizing hand gestures by analyzing the data streams generated by the sensors attached to the human hands. In our approach, we abstract out a signature for each gesture based on the ‘range of motion’ of the sensors (joints) involve in making that gesture. Since the relative range of motion of each joint compare to other joints is unique for a given gesture, it provides a unique signature for that gesture across different users. Based on this observation, we propose our approach for hand gesture recognition which addresses the major challenge of user-dependency. We apply our approach for recognizing ASL signs and show that we can recognize static ASL signs with no training. Our preliminary experiments demonstrate more than 75% accuracy in sign recognition for the ASL static signs.
1
Introduction
Automated gesture recognition has been an active area of research in human-computer interaction applications and sign language translation [1]. Recently, in order to facilitate a natural interaction (beyond keyboard and mouse) with computer applications, the users are traced and monitored through various sensory devices such as: tracking devices on their heads, hands and legs, video cameras and haptic devices. The data received from these devices can then be analyzed to recognize the gesture a user has made. There are two major approaches for accomplishing this: Machine-Vision based approaches which analyze the video and image data of a hand in motion [2] and Haptic based approaches which analyze the haptic data received from a sensory device (e.g., a sensory glove). Since some gestures are very similar to others, it is very hard to use vision-based approaches to recognize them. For example, the letters 'A', 'M', 'N', 'S', and 'T' are signed with a closed fist (Figure 1). The amount of finger occlusion is high and, at first glance, these five letters can appear to have the same posture. On the other hand, the major challenge in recognizing gestures generated by a haptic device arises from the diversity of the data generated by different users. That is, making the same gesture, different users produce different data. Several machine learning techniques are proposed to address this issue by using datasets from different users in their training phase [4][5][6]. The results of these techniques are directly affected by the data set chosen for the training phase and consequently they are user-dependent. In addition, they require a lot of training data for the training phase. In this paper, we propose a User-independent Technique for Gesture Recognition (termed UTGeR) that recognizes single and consecutive static signs. UTGeR is a novel algorithmic approach for hand gesture recognition which is distinct in the following aspect: while our approach is user-independent, it does not require any sort of training, a must in almost all of the other approaches for recognizing hand gestures. While the focus of this paper is on the problem of classification in the gesture recognition domain, our approach has broader applicability. With classification, each unknown data sample is given the label of its best match among known samples in a database. Hence, if there is no label for a group of samples, the traditional classification approaches fail to recognize these input samples. For example, in human-computer interaction, it is quite likely that the application needs to recognize some specific set of gestures or postures 1
This research has been funded in part by NSF grants IIS-0238560 (PECASE) and IIS-0307908, and unrestricted cash gifts from Microsoft. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
that cannot be labelled (e.g., frustration). Our approach can address finding these sets of similar gestures without requiring them labeled.
2
Motivation
One way gesture recognition is being used is to help the disabled to interact with computers, such as interpreting sign language. To motivate our research, we focus on recognizing American Sign Language (ASL) signs as an example of a well-defined set of hand gestures [3]. ASL is a complex visual-spatial language that is used by the Deaf community in the United States and Canada. Each sign in ASL roughly corresponds to a concept such as a thing or an event, but there is not necessarily an exact equivalence with English words or with words of any spoken language. ASL consists of two types of signs, static and dynamic. Static signs are the ones that according to the ASL rules, no hand movement is required for their making. All ASL alphabets excluding 'J' and 'Z' are static signs. Contrary to the static signs, dynamic signs have movements. Examples of dynamic signs are the two ASL alphabets 'J' and 'Z', or some of the ASL words such as the word ‘yellow’ (which is made by moving the hand when the static sign 'Y' is made).
Figure 1 - ASL static signs ‘T’, ‘I’,‘M’,‘E’ and the finger spelling representation of the word ‘TIME’ In this paper, we focus on recognizing the static signs or a group of consecutive static signs, known as finger-spelling. Examples of these signs are shown in Figure 1.
3
UTGeR: A User-Independent Technique for Gesture Recognition
We start by making the trivial observation that all forms of hand gestures include finger-joint movements. To capture this movement, we utilize the concept of the range of motion at each joint to develop our approach. Range Of Motion (ROM) is a quantity which defines the joint movement by measuring the number of degrees from the starting position of an axis to its position at the end of its full range of the movement. For example, if the position of an axis which a joint belongs to changes from 20º to 50 º with respect to a fixed axis, the range of motion for that joint is 30 º. We can compute the range of motion per joint by using the sensor values acquired by the sensory device. The main intuition behind our approach is that the range of motion of each joint of the hand participating in making a gesture is a user-independent characteristic of that gesture and provides a unique signature for that gesture across different users. We now discuss our approach in more details. As we mentioned earlier, the process of the haptic gesture recognition starts with collecting data from the sensors attached to the hand of a user. At each sensor clock, the sensory device driver captures one sample by acquiring data from all of the n sensors of the device. An example of the data collected from a haptic device with 22 sensors is shown in Figure 2.
Figure 2 - A sample of data collected from a haptic device with 22 sensors Given a sensory device with n sensors (as shown in Figure 5), each ASL static sign can be represented as a set of n values. Let us call this set S (α ) = ( s1 ,..., sn ) where α is an ASL static sign and S is the set of its sensor values. An example of S would be a line of data in Figure 2. If an unknown gesture is represented by S´, one naive approach is comparing S´ with each pre-collected S( α ) and if they are identical, the unknown gesture is the same as the gesture with the identical S. In order to compare two S’s, we require a metric to measure the equality of two sets. Since each S can be considered as a point in an n-dimensional space, we define two S’s identical if their distance according to a metric criteria is less than a threshold. In other words:
S = S ′ ⇔ Dis( S , S ′ ) < ε
The most straightforward approach for measuring the distance between two samples is using a Minkowski measure such as the Euclidean distance. Given two samples S = ( s1 ,..., sn ) and S ′ = ( s1′ ,..., sn′ ) : n
Dis( S , S ′) = (∑ ( si − si′) 2 )
1
2
1
This naive approach fails in most cases due to the following restrictions: • Different users making the same gesture would generate different S, i.e., S is not unique for different users. • The inevitable motion will generate different S, even the same gesture is made by the same user at different times. • Inaccuracy of sensors in collecting samples results in noisy data which in turn generates different S. Our objective is to transform S( α ) to ς where ς is unique across different users. We call this set ς (α ) the 'signature' of the sign α and show that it is unique for each sign across different users. Towards this end, we require each gesture to start from a starting posture. Though this posture is arbitrary, it needs to be consistent across all experiments. One possible starting posture is shown in Figure 3.
Figure 3- One possible starting posture (left) and ASL sign ‘L’ (right)
We represent the data associated with this starting posture with S o . For each collected S at each point of time, we compute ( S − S 0 ) as follows:
S = ( s1 ,..., sn ) & S0 = ( s1′,..., s′n ) → S − S0 = ( s1 − s1′,..., sn − sn′ ) This set represents the range of motion for all the sensors participating in making the gesture at that point of time. In order to compare the sets more efficiently, we first find the set which represents the absolute value of the ROM, i.e., S − S 0 . We then normalize the values by first finding the maximum (Max) and minimum (Min) value among S − S0 and subtracting Min from each value in the set and finally dividing each value by (Max – Min). The resulting set has the values between 0 and 1. We discretize this set by choosing a discretization parameter k (>1). For example, if k=2, we replace each value in the set with 0 if its value is less than 0.5, and 1 otherwise. An example of this process is shown in Figure 4.
Figure 4- An example of the process of finding the signature of a gesture S Before recognizing a sign, we require to perform the registration process, as we call it. In this process, we collect the data from known gestures and construct their signatures using the above-mentioned technique and add them to a database, i.e., register them. Gestures can be registered by different users in different times, which makes the process extensible. Now, in order to recognize an unknown gesture, we follow these steps: First, we collect the data for this gesture and construct its signature. Then, we perform a nearest- neighbor (NN) search on the database of registered signs for the given unknown signature. Finally, if the distance between the unknown signature and its nearest neighbor is less than a pre-defined threshold (ε ) , we consider these two gestures identical and give the label of the known gesture to the unknown one. We can apply this process to recognize a group of consecutive static gestures (i.e., finger spelling). That is each gesture is recognized individually and the final result if formed by the combined results of each gesture. For recognizing the gesture which do not have any label assigned, we need to save the signature of the interested gesture and compare the signature of the following gestures with that. In the next section, we present the result of our experiments.
Figure 5-ASL sign 'L' and its representation with k=4. Red sensors with X mark have the maximum range of motion from the starting posture while making this gesture .
4
Performance result
To evaluate our approach, we conducted several sets of experiments using the Immersion CyberGlove©[7] which provides up to 22 joint-angle measurements. This glove and the location of its sensors are shown in Figure 5. It uses proprietary resistive bend-sensing technology to transform hand and finger motions into the real-time digital joint-angle data streams. In our experimental setup, all static ASL sings (A to Y, except J and Z) were performed by one user and their signatures were stored. We then asked 9 other users to make all static signs and saved the data. We then applied our technique and could recognize 18 out of 24 signs correctly on the average, resulting in 75% accuracy. We need to mention that for this set of experiments, we choose k=4 as the discretization parameter and Euclidian distance to compare the sets. The result of this set of experiments is shown in Figure 6. Signs registered by signer 3, K = 6 A B C D E F G Signer 1 A B E D E O L Signer 2 A B E D E F G Signer 3 A B C D E F G Signer 4 A B C D E F X Signer 5 A C R E F G Signer 6 A B C D E F L Signer 7 M B E D M F G Signer 8 A B C X E F D Signer 9 A B C D E F X Signer 10 A B C D E F G
H
I
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Recognized
Accuracy
H
I
K
L
N
N
O
P
Q
K
S
Q
U
U
W
X
Y
17
70.83%
R
I
K
L
M
M
A
P
T
R
S
T
U
U
W
X
Y
18
75.00%
H
I
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
24
100.00%
K
I
K
L
M
A
E
P
Q
R
S
T
R
V
W
X
Y
19
79.17% 70.83%
P
I
H
L
M
R
O
G
T
R
S
T
U
V
W
X
Y
17
P
I
K
L
M
M
O
P
G
K
S
T
U
P
W
X
Y
18
75.00%
K
I
K
L
M
U
O
P
L
R
S
T
U
R
W
X
Y
17
70.83%
H
Y
K
L
M
N
D
P
T
R
M
T
U
V
W
S
Y
17
70.83%
H
I
K
L
U
L
C
P
T
R
E
Q
U
H
W
X
Y
16
66.67%
17 Total
70.83% 75.00%
H
I
H
G
M
N
O
T
V
R
S
X
V
R
W
X
Y
9-A
9-B
7-C
8-D
9-E
9-F
5-G
5-H
9-I
8-K
9-L
8-M
4-N
6-O
8-P
3-Q
8-R
8-S
7-T
8-U
4-V
10-W
9X
10Y
1-M
1-
3-E
1-X
1-M
1-O
2-L
2-K
1-Y
2-H
1-G
1-N
2-M
1-D
1-T
4-T
2-K
1-M
2-Q
1-R
2-U
1-U
1-A
1-A
1-G
1-G
1-E
1-X
1-U
2-R
1-R
1-E
1-L
1-P
1-U
1-C
1-V
1-H
1-R
2-X
2-P
1-D
1-R
1-S
1-L
Figure 6- Comprehensive result for the first set of experiments with k=4 and Euclidean distance
5
Conclusion and Future works
We proposed an approach for recognizing hand gestures based on the range of motion of different joints during the formation of the gesture. Our approach is also capable of detecting similar hand gestures without having them labeled, a problem that most traditional classification methods fail to address. Our experimental results confirm the effectiveness of our approach for recognizing the static ASL signs. In our future research, we plan to focus on the following issues: • Investigating different methods of normalization and discretization to improve the accuracy. • Approaching the recognition process in two steps. In the first step, we will recognize the similar signs and then fine tune within the recognized group. • Recognizing ASL dynamic signs.
References [1] Aditya Ramamoorthy, Namrata Vaswani, Santanu Chaudhury, and Subhashi Banerjee. Recognition of Dynamic Hand Gestures. Pattern Recognition, 36:2069–2081, 2003. [2]Ying Wu and Thomas S. Huang. Vision Based Gesture Recognition A Review. International Gesture Workshop, GW 99, France. March 1999 [3]E. Costello, "Random House Webster's Concise American Sign Language Dictionary". Random House N.Y.,1999
[4]M.V. Lamart and M.S. Bhuiyant. "Hand Alphabet Recognition Using Morphological PCA and Neural Networks". Proceedings of the International Joint conference on Neural Networks, Vol. 4, pp 28392844, Washington, USA, 1999. [5] J. Weissman. "Gesture recognition for Virtual Reality Applications Using Data Gloves and Neural Networks". IJCNN 99. Intl. Conf. On Neural Networks, 1999, Vol 3. 2043-2046. [6] M. Zhao. "RIEVL: Recursive Induction Learning in Hand Gesture Recognition". IEEE Trans. On Pattern Analysis and Machine Intelligence. Vol 20, No.11 Nov. 98. [7] Immersion Corporation, www.immersion.com