Variations on a Fuzzy Logic Gesture Recognition Algorithm

0 downloads 0 Views 39KB Size Report
ABSTRACT. Web-cam based gesture recognition systems for home use are becoming more viable. A modification to an algorithm developed by Bimber yields ...
Variations on a Fuzzy Logic Gesture Recognition Algorithm Lesley Anderson

Dr. Jon Purdy

Warren Viant

SimVis Group University of Hull Hull HU6 7RX United Kingdom

SimVis Group University of Hull Hull HU6 7RX United Kingdom

SimVis Group University of Hull Hull HU6 7RX United Kingdom

[email protected]

[email protected]

[email protected]

ABSTRACT Web-cam based gesture recognition systems for home use are becoming more viable. A modification to an algorithm developed by Bimber yields low failure rates for wand motions tested against three sets of gestures. Additionally, the speed at which a gesture is performed does not affect its recognition rate, though the gesture’s orientation does.

Categories and Subject Descriptors F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems – pattern matching. B.4.2 [Input/Output and Data Communications]: Input/Output Devices – abstract data types, polymorphism, control structures.

General Terms Algorithms, Measurement, Performance, Design, Reliability, Experimentation, Human Factors

Keywords gesture recognition, wand, fuzzy logic, games, control

1. INTRODUCTION With the release of Sony’s EyeToy video camera for the PlayStation2 and similar systems for Macintosh and PC computers, game designers have an additional resource for enabling gamers to interact with their games. Though at present the games that make use of these web-cam systems are very simple – most track the user’s two dimensional hand motions allowing users to hit, move, break or spin things – there is the potential for more complex interfaces. For instance Cao and Balakrishnan[2] used two web-cams to track the three dimensional movements of their VisionWand, a low cost, electronics free wand that they used to mimic the operations of a traditional mouse for use with a large scale display.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGCHI International Conference on Advances in Computer Entertainment Technology’04, June 3–5, 2004, Singapore. (c) 2004 ACM 1-58113-882-2/04/0006 $5.00

It is not difficult to imagine a similar system combined with a reliable method of gesture recognition that can be used for computer game interaction. This work focuses on an algorithm developed by Bimber[1] which may be useful for a gesture recognition based game. Bimber’s algorithm is based on fuzzy logic and is applicable to any object that can be tracked. This method stores each representation of a gesture as an analysis of 56 attributes. Each new input gesture is analyzed for the same 56 attributes and is compared to each stored representation to find the closest match. Encouraged by the algorithm’s straightforwardness and Bimber’s claims of an exceptionally low failure rate (less than 1%) and quick response times, we devised three variants on Bimber’s original algorithm in an attempt to further improve its performance. The first variant altered the way in which each of the 56 attributes is weighted; the second altered how the object’s orientation is represented; and the third variant includes both of these changes. Inspired by a popular enthusiasm for Harry Potter, we chose a wand as the object to be tracked, and then ran some simple tests designed to reveal whether our modifications to Bimber’s original algorithm were, in fact, improvements. We used two different tracking systems, a Polhemus Fastrak and a Vicon optical motion capture system, in hopes of finding clues about the significance of variables like tracking system, number of stored gesture representations, and speed at which the gesture is performed in the recognition process. We found that our orientation-modified algorithm did produce lower failure rates in all of our tests. Additionally, the speed at which a gesture is performed seems to have no bearing on whether or not it will be recognized.

2. METHOD 2.1 Bimber’s Algorithm According to Bimber, his algorithm’s “advantages are its usability (e.g. with 2D or 3D input devices or in combination with finger status information, etc.), the minimum of information needed to recognize a gesture, and, consequently, the high speed of its scanning and comparison process.”[1] Bimber analyzes every gesture for 56 attributes. The first 28 are based on position information; the second 28 are based on orientation information. Each attribute can have a value between 0 and 100 and demonstrates an important property of the gesture. For example, if the height-to-width-ratio attribute has a value close to 100, the gesture is taller than it is wide. A value close to

0 indicates that the gesture is wider than it is tall and a value near 50 indicates that the gesture is square. Position information is three dimensional as is orientation information. The orientation information replaces x, y and z position coordinates with three angles – the angle between the world x-axis and the x-axis of the object being tracked, and so on for the y and z axes. The recognition system is trained by having a user perform a gesture and then adding its analysis to a database. More representations of a gesture can easily be added to increase the recognition rate. By comparing the input gesture’s attributes to each representation in the database and using a set of weights to calculate a score for each comparison, the system can return the gesture with the closest match. The lower the score, the closer the match. The weights, which have float values between 0 and 5, control the contribution of an attribute to the final score for the gesture. Certain weights are calculated at run-time to look for unusual gestures (e.g. a straight line). They can also be programmed in advance.

2.2 Modifications to the Algorithm Our first variation to Bimber’s algorithm altered the way in which the comparison weights are calculated. Instead of a set of weights being calculated for each input gesture, a set of weights is calculated for each type of gesture stored in the database based upon the values of the attributes for all the representations. These weights are calculated by finding the standard absolute error for each attribute. This measurement tells us how closely the attributes of each representation match. We then use the largest of the 56 errors to scale each weight to between 0 and 5 so that the attributes with the largest standard absolute error, or largest variations in attribute values, receive the lowest weights. For each attribute weight = 5 - (std_error / ( max_std_error*.2)). The second variation to Bimber’s algorithm concerned the orientation data used. As previously mentioned, we chose a wand as our object to track, and we designed our wand to lie along the x axis. Because wands do not have an “up” side, it seemed unnecessary to store information about the y and z axes. So we replaced the angles of axes rotation with a normalized 3-d vector in the direction of the wand’s positive x-axis. The third variation to the algorithm was a combination of the previous two modifications.

3. TEST The primary aim of our test was to examine the failure/success rates of each modified algorithm against Bimber’s original. A secondary goal was to detect differences in recognition rates introduced by using two different tracking systems and varying input speeds. We took two sets of input, one set from our Vicon motion capture system and one set from our Polhemus Fastrak, and compared each set against three sets of stored gesture representations using each of the four algorithms. The tests were designed to answer the following questions:

• •



Do any of the test algorithms give overwhelmingly better results across the range of six input-gesture set combinations? We expected that input from either of the trackers compared with a user trained gesture set from the same tracker would yield the best recognition results for each of the four algorithms. Was this the case? What effect does the speed at which an input gesture is performed have on recognition rates?

3.1 Gestures We used a small gesture set for our tests. There were three gestures, all circles, forward, upward and leftward, the direction indicating which way the user should point the wand while making the gesture. We chose a circle as our gesture, because it is an easy shape to make using a wand. Users were free to make the circles in either a clockwise or counter clockwise direction as long as they were consistent. This allowed us to check the precision of the algorithms.

3.2 Test Procedure Each of our ten participants were asked to perform two sets of 15 gestures which became their user trained data sets. One set was performed using each of the tracking system and each set comprised five representations of each of the three types of circles. We alternated which tracker was the first to be used, so that as the participants became more comfortable with the gestures and their movements became more precise, the data would not be skewed in favor of one tracker. Next, the participants performed 45 gestures per tracking system in a random order as indicated to them on the display. They were also asked to intentionally perform some gestures more quickly or more slowly according to the onscreen instructions. Each of our four algorithms was used to compare each input gesture to each of three data sets. The first two data sets were those recorded by the user and the third was a reference set. The reference set was a combination of gesture sets recorded using the Polhemus by two users who did not otherwise participate in the test. It had twice as many representations of each gesture and was constant for each test participant.

4. RESULTS The results were analyzed in two ways: overall failure rates and the range of comparison scores. These two methods of analysis give an idea of not only how well the algorithm performed in terms of the percentage of correctly recognized gestures, but also an idea of how precisely the algorithm will match a gesture.

4.1 Algorithm Failure Rates Considering input from both tracking systems, our orientation modified algorithm had the lowest failure rate for each data set (see Figure 1): 8.44% for user trained Polhemus data set, 6% for the Polhemus reference set, and 5.33% for the user trained Vicon data set. Also, for each data set our weight–modified algorithm returned the highest failure rates.

95% Confidence Intervals for Three Circles Acco rding to Orientation Modified Algorithm

Algorithm Failure Rates Per Data Set

Failure Rate

5000

18% 16% 14% 12% 10% 8% 6% 4% 2% 0%

Weight Modified Original Algorithm Orientation Modified W and O Modified Polhemus Polhemus Trained Reference Data Set Set

Vicon Trained Data Set

Figure 1

4.2 System Comparisons Comparing the failure rates from our orientation-modified algorithm for each input/data set combination unexpectedly shows that input from one tracking system was less likely to fail against data recorded with the other tracking system. (See Figure 2) Failure Rates for Hardw are Combinations (from Orientation-Modified Algorithm)

14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00%

Polhemus Trained Polhemus Reference Vicon Trained

Polhemus Input

Vicon Input

Figure 2

4.3 Time Our data did not reveal any correlation between the time a user took to make a gesture and the comparison score it was likely to receive.

4.4 Orientation For the orientation-modified algorithm, the leftward circles had the lowest failure rates for each of the data sets disregarding input system. These failure rates were significantly lower than the failure rates for the upward circles or the forward circles. Additionally, correctly recognized upward circles, had the highest 95% confidence indicators (Figure 3).

4000

3000

2000

Forward Upward Leftward Forward Upward Leftward Forward Upward Leftward Polhemus Polhemus Polhemus Polhemus Polhemus Polhemus Vicon Vicon Vicon Trained Trained Trained Reference Reference Reference Trained Trained Trained

Mean + 95% CI

3435

3951

3137

3433

4506

3757

3206

3907

3052

Max Score

8597

9684

7099

8607

9095

7930

9071

8719

6526

Min Score

853

1289

819

1206

2075

1570

1033

1500

1111

Mean - 95% CI

3117

3559

2844

3155

4229

3522

2937

3589

2798

Figure 3

5. DISCUSSION For the primary testing phases of our experiment we used a threshold of 10,000. This means that any comparison that returned a score of over 10,000 matched none of our three gestures and any score under 10,000 matched one of the three. In every case the concentration of scores for correctly matched gestures was well below 10,000 indicating that for practical use, this threshold can be reduced. It is this threshold that also forms the line between gestures that are not recognized at all (either correctly or incorrectly) and gestures that are recognized as something they are not. Without a threshold, every gesture will be recognized as something. General observations during the test revealed that at times there were tracking system irregularities. The Polhemus Fastrak has an optimal tracking range of a meter. It is difficult to allow a user to move naturally and still constrain their movements to a semisphere of radius 1m. Also, because the Vicon relies on cameras and reflectors, occlusion of reflectors can cause the system to lose track of the wand. Several times we observed that as the user moved, the reflectors became occluded and our program which is designed to stop recording a gesture when the user stops moving stopped recording prematurely. This may have contributed to the failure rate. Additionally, in the course of performing 90 gestures, a user would occasionally become confused and perform the wrong gesture. This would be reported as an error, though the algorithms may in fact have recognized the gesture for what it was. For this reason, we removed the errors from our data before comparing scores within an algorithm’s results, as in our comparison of circle orientation to mean scores in figure 3. We used failure rates to compare results among algorithms and 95% confidence indicators to compare data sets and gesture orientation for specific algorithms, because the variation in weighting methods causes variations in the range of scores. When we examine an algorithm’s range of scores, we are looking for the precision with which it matched each input gesture to a gesture in a data set. For our future work, we will likely use the orientation-modified algorithm, because it returned the lowest overall failure rates. However, this algorithm is only appropriate for objects shaped similarly to a wand with no true “up” or “down” side.

It is significant that for our orientation-modified algorithm the lowest failure rates among the input/data set combinations were from opposite systems. This bears on whether or not a game, or any system, which wants to use a wand interface must be trained per user using a certain tracking system. Further testing is needed to confirm these results, but it is possible that a standard reference set could be established using one tracking system that users will be able to successfully conform to using another. Because there is no real correlation between time and recognition scores for the correctly recognized gestures, the amount of time taken to perform a gesture may possibly be used as another input factor. For example, the speed at which a circle gesture is performed might determine the speed at which a wheel begins to rotate. We were surprised to see that leftward circles had the lowest failure rate and the smallest 95% confidence indicators. Our observation of participants during the test indicated that the leftward circle was the most unnatural, and therefore the low failure rate and the higher consistency of the correctly recognized gesture may have been due to a greater amount of concentration on the part of the participants when performing that gesture. One of the strengths of Bimber’s algorithm is that the recognition rates can be improved by adding more representations to the database. We did not take that approach here. Instead we used a static number of representations. It might be worthwhile to attempt a longer training session which would involve repeating gestures and adding them to the recognition data sets until a certain success rate was achieved. This might be very taxing for the user, however, depending on the number and complexity of the gestures to be trained.

Because our test base was small and it was difficult to select independent conditions within our test environment, it is hard to draw absolute conclusions from our data. However, there are indications that our orientation-modified algorithm is the most appropriate for our further wand research. It is viable to use prerecorded reference sets instead of requiring users to train the system themselves. It is also viable to train the system using one tracking system and then input data from another system. It would be beneficial to repeat the experiment or a shorter version of it that involved training the user first against a reference database and then creating a new database of their own trained gestures to see if both precision and failure rate could be improved.

6. ACKNOWLEDGMENTS Our thanks to James Ward and Jeremy Thornton for their assistance.

7. REFERENCES [1] Bimber, O. Continuous 6D Gesture Recognition: A FuzzyLogic Approach. Proceedings of 7-th International Conference in Central Europe on Computer Graphics, Visualization and Interactive Digital Media (WSCG’ 99), 1, (1999), 24-30. [2] Cao, X., Balakrishnan, Ravin. VisionWand: Interaction Techniques for Large Displays using a Passive Wand Tracked in 3D. Proceedings of the 16th annual ACM symposium on User interface software and technology, (2003), 173-18

Suggest Documents