commercial success with their Wii gaming console, which makes use of a motion sensing game controller. [1] This al- lowed players to do things like swing their ...
DOI 10.7603/s406-01-002-9 GSTF International Journal on Computing (JoC) Vol.3 No.4, April 2014
Gesture Recognition with Accelerometers for Game Controllers, Phones and Wearables Anthony D. Whitehead Member, IEEE Received 30 Jan 2014 Accepted 10 Mar 2014
also allowed for gesture based user interface navigation for menus. As, well the prevalence of mobile phones and the emergence of wearables [3] such as smart watches, instrumented footwear and others, has proffered the opportunity for interaction with devices by means of gesture and other noncontact options. The rapid spread of mobile devices has also provided new ways for consumers to experience their entertainment. The spread of wireless networks has allowed for the streaming of media and general Internet activity in nearly any location. Meanwhile, the rapid increase in the computing power of mobile devices has allowed for complex games to be developed as well as other experiences not possible a decade ago. People are continuously searching for new and novel forms of entertainment, and wearables could prove to be an exciting next step. Despite being long discussed they are now much more realistic due to advances in sensor technologies, wireless communication, battery life and processing power. They could provide a much more personal experience by drawing on an individual's emotions and responses for more affective interactions. They also hold the potential to provide more accurate and less restrictive motion and gesture input, whether using the whole body or a single extremity. It is all of these opportunities that make the need for reliable and accurate gesture recognition outside of the laboratory environment so crucial at this time. Although it is desirable to have a system that easily interprets human input, it is also expected that the users can develop the application-specific skills with practice. In this sense we sought to explore how to best set up a generic recognition system that works for the general population but does not burden the application designer too much with limitations of the technology. Our work, described throughout, uses Hidden Markov Models (HMM) as the basis for our statistical pattern recognition system to identify and verify the performed gestures. A Hidden Markov model is a statistical model, initially developed for speech recognition [4]. In a hidden Markov model, only the sequence of emitted symbols is observed. The path of states followed by the process is “hidden” from the observer. Given a hidden Markov model (trained by examples), M, and an input sequence S (a gesture), the standard question is whether S has the properties of the model M. Less precisely: Does the input gesture “look like” the training gestures? To
Abstract— Hidden Markov Models have been effectively used in time series based pattern recognition problems in the past. This work explores using Hidden Markov Models (HMM) to do 3D gesture recognition from accelerometer data. Our work differs from much of the previous work in that we examine the use of discreet HMMs rather than continuous HMMs. An interesting side effect of this is that method is therefore theoretically transportable to other devices that have a 3D sensor output system. In essence this brings us a mechanism to use the HMM model across a series of different sensor devices for gesture recognition. We achieve recognition results with accuracy rates approaching 90 percent for users who are not in the training samples. The speed of our system is also of interest as we are able to classify gestures at a rate of several hundred times per second. As long as the sensor system is capable of outputting information about the 3 axes of motion, and the outputs can be discretized to volumetrically equivalent cubic sub-spaces; that information can then be used in this generic model for accurate, high speed gesture recognition. Index Terms— Gesture recognition, Hidden Markov Models, Games, Human Computer Interaction, Wearable Computers.
I
I. INTRODUCTION
n the past few years we have seen the rise of many new and successful entertainment mediums. Nintendo has had great commercial success with their Wii gaming console, which makes use of a motion sensing game controller. [1] This allowed players to do things like swing their arm to play tennis or draw back their arm to fire an arrow. This was a new way to play video games that people had not experienced before. Microsoft's Kinect took this concept of motion gaming and interaction to a full body experience by using a camera based design. Upon release the Kinect also saw amazing success in the market. [2] The Kinect's full body pose detection allowed for dancing and fitness games to become extremely prevalent. It Manuscript received January 31, 2014. This work was supported in part by the U.S. Department of Commerce under Grant BS123456 (sponsor and financial support acknowledgment goes here). Paper titles should be written in uppercase and lowercase letters, not all uppercase. Avoid writing long formulas with subscripts in the title; short formulas that identify the elements are fine (e.g., "Nd–Fe–B"). Do not write "(Invited)" in the title. Full names of authors are preferred in the author field, but are not required. Put a space between authors' initials. F. A. Author is with the National Institute of Standards and Technology, Boulder, CO 80305 USA (corresponding author to provide phone: 303-5555555; fax: 303-555-5555; e-mail: author@ boulder.nist.gov). DOI: 10.5176/2251-3043_3.4.291
©The Author(s) 2014. This article is published with open access by the GSTF 63
GSTF International Journal on Computing (JoC) Vol.3 No.4, April 2014
answer, we compute the probability of S given M, of an observed sequence S (an input gesture) being generated by the (trained) model M. The log of the ratio of the probability of S given M to the probability of generating S by chance is usually used an assessment function. A more formal description of the Hidden Markov Model is given in Section 3. Some current applications of HMMs in gesture recognition in various sensor systems include, among others, work by Kratz[5], Keskin[6], and Segen[7]. The concept of gesture has two broad, generally accepted, categories: pose and orientation of the hand [7,8,9,10,11] and the motion of a limb[5,6,12,13,14,15], typically the arm. In both cases the goal of gesture recognition is to decide the class of a preconceived pose and or motion. In these cases Hidden Markov Models have has some success when used for recognition. A popular approach to gesture recognition has been to use HMMs in conjunction with a camera system. For example, [7,11,16,17,18] use various models such as active shape, blob analysis among others to track the user and use HMMs to identify gestures. While these methods are often accurate due to the problem reduction to a 2D cursive recognizer, they require the use of a computer vision based system in a controlled lighting environment. In a practical sense, controlled lighting is not possible for many wearable computing settings. Moreover, vision systems still have significant computational issues resulting in latencies that are not found in accelerometer based systems. Furthermore, accelerometer based gesture recognition is a practically interesting problem given the inclusion of accelerometers and other 3D sensor sources into the modern video game controllers, cellular phones and wearables. This work explores the use of discrete Hidden Markov Models with the intent of examining the practical requirements for recognition against a live input system suitable for games and interactive applications. Theoretical proof of HMM success with accelerometer data was presented in [12, 13] using various noise models to perturb the data sets for evaluation and training. We take a different approach in training and evaluating similar to the standard theoretical model verifications, but we also perform a significant number of live test performance measures. A significant contribution of this work is the modified training, and testing pipeline used to ensure better recognition rates for users who are not a part of the training data. This is an important factor for games and other interactive media that must release product to the general population. By using accelerometer data directly, the sensor output can be recorded keeping track of the motion path, rather than tracking actual motion differentials. In a similar goal to our work, [5] explores the use of accelerometers to classify motion paths in the video game context. However, their statistical model is in the continuous domain via a Bayesian approach that differs from our discrete approach. They managed to achieve significant recognition results while test subjects were part of the training process. However, in the more practical situation where users are not involved in the training process (as would be the case in professional game development,
phone interactions or the wearable context) the recognition rates dropped significantly. A significant issue affecting their results was the size of the training data set. In order to significantly allow the general population higher success rates, we confirm that more data is required to adequately model the general population. This paper is organized to walk the reader through the findings and experiments that lead to the final set of parameters and decisions that are used in the end system described here. We will explain our sensor system, gesture design intentions and continue with a description of HMMs and how we setup our and train our statistical models giving evidence for each decision as we proceed. We then continue by describing the concept of gesture grouping and outline our final recognition results. II. MATERIAL AND METHODS In this section we expose our sensor setup, methodology, and efforts to determine the appropriate training pipeline and test scenarios to determine which parameters best suit the live performance tests in order to ensure the system is generally useful. We found that while one identified pipeline and parameters will maximize the overall recognition results for classical theoretical classification trials, a different set parameters and data management pipeline allows for higher recognition rates in live performance trials. To record gesture data, a sensor pod was placed on a subject’s wrist. Each pod contains two layers plastic that sandwich a 2g tri-axis accelerometer, ensuring it remains secure in its position and while minimizing unwanted movement on the sensor itself. Fig. 1 shows a tri-axis a sensor pod, consisting of a plastic shell, and accelerometer hardware. The sensor pods are attached to the body by using adjustable Velcro strapping and keep wear and tear on the electronic components minimal.
Fig. 1. Sensor Pods: The plastic capsules provide a way to fasten the triaxis accelerometer hardware to the person.
©The Author(s) 2014. This article is published with open access by the GSTF 64
GSTF International Journal on Computing (JoC) Vol.3 No.4, April 2014
tributions can be computed from observation. This allows a training by example process followed by a recognition process from samples that are gathered a posteriori the training phase. For more detailed description on HMMs and the formalized solution to the training and recognition tasks, the readers are referred to [1]. A HMM can be based either on discrete observation densities or continuous observation densities. In our work, we use discrete HMMs to model the input because the discrete distributions of the observed accelerometer data are sufficient to model the properties using a finite set of symbols. Moreover, this allows the method to remain somewhat agnostic to the sensor device being used. We use tri-axis accelerometers, but there is no theoretical liability to using any sensor device that outputs a 3 dimensional vector of data in its place and we should expect similar recognition rates by using a preprocessing transformation that elicits similar distributions of observation symbols. Before we apply HMM training to raw gesture data collected from the accelerometers, the data must be preprocessed to create sequences of symbols from our alphabet. We discuss this decomposition next.
Fig. 2. Seven gestures: Block, Cast Projectile, Hurricane, Roll Boulder, Counter Spell, Soul Steal, and Wand Wave.
The gestures were created as part of a game design experiment involving a fantasy role play scenario where players would be engaged in mystical battle with one another and would have to attack and defend themselves using gestures. These gestures could be input with a typical game controller, a smart phone or a wearable such as a smart watch. We wanted the gestures to be driven by the design, rather than the limitations of any given technology. We decided on seven gestures in total, and titled them: 1. Block (defensive move), 2. Cast Projectile (attack), 3. Hurricane (spell), 4. Roll Boulder (spell), 5. Counter Spell (defensive) 6. Soul Steal (attack), and 7. Wand Wave (spell) to be used in our experiments. Figure 2 shows the gestures and, roughly, how they were to be performed. We note here that the gestures created are not arbitrary motions designed to minimize overlap and reduce the likelihood of the system error. Rather the gestures were selected to be natural and immersive, as though you were the magician in battle. Our challenge is thus to make a recognition system that works for the designer of the game with as few limitations on the gestures as possible, rather than make the designer work with the limitation of the gesture recognition system. We performed standard evaluations in order to define out HMM parameter setup, training and testing pipelines but the final confirmation of our system was to engage in a live trial where gestures were detected and recognized in real time under a controlled experimental setup.
A. Decomposing and Discretization of the HMM In order to remain device agnostic, the goal is to ensure that a mapping of raw sensor output to a common alphabet of HMM inputs is demanded. This may seem quite difficult at first glance given that different sensors will be recording different types of information that are not altogether related. However, given the discrete nature of any sensor, it is relatively simple to have symbols output based on the discrete segmentation of the sensors range. Suffice it to say, that granularity of the decomposition depends primarily on the precision of the sensors being used. However, it does not invalidate the model; rather it simply expands the alphabet. The only issue that remains is the computational requirements for an increased alphabet. It is also important to note that this decomposition requires a 1:1 alphabet to state. i.e. each state outputs a unique symbol. Since our gestures are in 3D space, and our sensors are emitting 3-dimensional data, we carve our space into discrete cubic sub-spaces. In our tests, we have experimented with discrete segmentations of 3,4 and 5 subspaces in each dimension resulting in alphabet sizes of 27, 64 and 125. Our experiments show that alphabet sizes of 27 allow real-time interactivity; the larger sizes are still too computationally expensive for effective real time applications. We note here that at 27 states the system is ergodic, but this will not always be true of larger alphabet sizes.
III. HMM THEORY AND OUR HMM SETUP In a hidden Markov model, a sequence is modeled as an output generated by a stochastic process progressing through discrete time steps. At each time step, the process outputs a symbol, from a predefined alphabet, and moves from one state to the next state. Each state is described by its transition probability and its alphabet symbol output probability. Both the transition from state to state and the emission of an alphabet symbol, follow a probability distribution that define the model. Thus we can estimate these probabilities using a training by example process. A complete specification of an HMM conveniently parameterized as λ = {A, B, π) can be described as: A set of N states, S = {S1,S2, … ,SN}. These correlate to the physical properties of the input patterns. For example discretized accelerometer reading. A set of M observation symbols C = {C1,C2, … ,CM}. This is the alphabet. An NxN matrix A of state transition probabilities A = {aij}. When any state can be reached from any other state (an ergodic model), all aij > 0. An NxM matrix B ={bij} of observation probabilities for each state i emits symbol j. The initial state distribution vector π={ π1, π2, … , πN}
IV. DETERMINING TRAINING PARAMETERS FOR HMM Determining the best training scenario for the gesture recognition system is an important first step. We examine how the number of training samples, alphabet size, raw data filtering, and outlier culling affects the recognition capabilities of the end system. In this section we walk through our experiments that helped identify the appropriate parameters and methods that best produced usable recognition rates.
A HMM can be used as a model for how a given observation sequence was generated, therefore allowing multiple models to be examined and classification of observations into the best fit Hidden Markov Model. Moreover, the HMM probability dis-
©The Author(s) 2014. This article is published with open access by the GSTF 65
GSTF International Journal on Computing (JoC) Vol.3 No.4, April 2014
was extracted from all participants who created training data and one where the test data was created from only one participant and the training data consisted of all of the others (method 1: a little bit of everyone in each set – test/train; method 2: tester not in training data set). Through experimentation it was found that a uniform kernel and a window size of 7 produced the best results for our alphabet size of 27. The results of the experiment are outlined in Table 1.
A. Number of Samples for Training and Testing Previous experiments have shown that several subjects’ data is required for easy replication by the general population. We conducted several experiments to determine the number of samples required to generalize the gesture data so that recognition rates could remain high, yet still be discriminatory. As the number of gesture samples increases, the curve of the graph flattens. This indicates that the increase in samples has less of an impact on the overall results with each addition of samples. Based on these results, we used a set of 240 samples in a training set and 60 samples in the testing set. This required 60 samples for each gesture from each of the 5 participants for a total of 420 gestures samples per participant. 420 samples per participant is a lengthy process, and also required some rest between data acquisition trials. This has the benefit of preventing fatigue from becoming a factor.
Fig. 3. Effect of Training Sample Size on Overall Classification Rates (%) for alphabets of 27, 64 and 125.
B. Alphabet Size A test to compare results for the number of alphabet symbols used for Hidden Markov Model gesture recognition was conducted. The best results were achieved using 27 symbols, with the exception of the false positive rate which dramatically increases with alphabets of 64 symbols versus 27 symbols. This is negligible however due to the significant increase in correct classification rate attained at 27 symbols. This experiment also confirms that the highest number of samples gives the best results. A larger alphabet requires more samples to effectively train the HMM in order to get significantly higher recognition rates. Moreover, alphabet sizes of 64 and 125 were not fast enough computationally to be effective input systems for a game or interactive application. Thus an alphabet size of 27 was deemed to be best in order to limit the amount of training to less than 300 samples per gesture and maintain the real time recognition performance requirement for games.
Fig 4. Accelerometer Data Filtered with kernel size 7, Uniform Kernel, for one axis of motion
Although the true positive rates increased consistently with the increasing smooth kernel size, the correct classification rate (CCR) is the most important factor to look at. The overall correct classification rate is defined as: CCR = ((TPR + TNR)/(FPR + TPR + FNR + TNR)) * 100. Where (F/T)PR is the False/True Positive Rate, and (F/T)NR is the False/True Negative Rate. A kernel size of 5 proved to have the highest correct classification rate for a regular test where all participants were part of both the training data and the test data. However, it is more important to have high correct classification when the tester is not in the training set since this more accurately portrays what happens during live detection (the player would not be in the training set) and better reflects the development process for a typical game developer. For this scenario, the highest correct classification rate occurs at a kernel size of seven. All subsequent experiments use this smoothing kernel size. In the end it was determined that the best recognition rates come from culling both training and test data, as well as culling in the live detection test samples. While discarding sample data may seem like the wrong thing to do, the reality of the situation is that small gestures are typically those that have had their start points misidentified and are essentially missing a
C. Filtering the Raw Input Signal After conducting the core experiments to determine the amount of training data required, individual results were examined revealing the need to filter the raw data. Our transformation to a discrete, and small, alphabet resulted in border areas that were problematic. The raw sensor data collected is also technically a discrete signal for each axis of motion. In order to remove the effect of these two sampling functions, the raw data is first filtered to reduce the impact of individual outliers. Figure 4 shows the effect of filtering the raw data. In order to determine the optimal filtering parameters, two subsets of the test were performed – one where the test data
©The Author(s) 2014. This article is published with open access by the GSTF 66
GSTF International Journal on Computing (JoC) Vol.3 No.4, April 2014
TABLE I AFFECT OF FILTER WINDOW SIZE ON RECOGNITION RATES, ALPHABET SIZE: 27. BOLD VALUES REPRESENT THE BEST SCORE FOR THE ROW.
Kernel Size Sample part of training data Sample NOT part of training data Sample part of training data Sample NOT part of training data Sample part of training data Sample NOT part of training data Sample part of training data Sample NOT part of training data Sample part of training data Sample NOT part of training data
0
3
5 7 9 11 False Positive Rate (FPR) 11.25 9.99 11.21 16.53 24.40 9.11 11.24 9.03 9.32 11.40 17.03 24.30 True Positive Rate (TPR) 85.87 92.26 93.19 94.44 95.40 96.39 73.24 78.43 79.62 84.10 85.29 87.10 False Negative Rate (FNR) 14.13 7.74 6.81 5.56 4.60 3.61 26.76 21.57 20.38 15.90 14.71 12.90 True Negative Rate (TNR) 88.75 90.01 90.89 88.79 83.47 75.60 88.76 90.97 90.68 88.60 82.97 75.70 Overall Correct Classification Rate (CCR) 87.31 91.14 92.04 91.62 89.44 86.00 81.00 84.70 85.15 86.35 84.13 81.40
part of the gesture. This assessment is supported by the fact that we get better recognition rates by removing outliers from the data acquisition process.
VI. FINAL RESULTS We have achieved very good recognition rates in classical trials as outlined in Table 3 using the above stated decomposition of the HMM model, and parameters as outlined above. Our live detection system follows a similar process of recording training data, culling, smoothing, and transforming to our fixed alphabet. We achieved an average 93% success rate with live detection outlined in Table 4. These results are slightly higher than our classical trials. We explain this phenomenon on the learning effect. By the end of the experiments using live detection, the subjects had learned how to perform the gestures in order to achieve a successful result. Although the goal is to have the subjects perform the gesture naturally, there seemed to be an innate desire to succeed at the test which resulted in a learned response and adaptation. This, while affecting the results of the experimentation, is actually a preferable result from a game developer’s point of view because, much like any current game, this adaptation allows the game to become more easily playable with practice.
V. GESTURE GROUPING As our experiments in a live detection setting continued, we realized that the false positive rates were related to the length of the gesture. Some of the longer gestures used in our experiments contained subsets, which were close in form to our shorter gestures. To address this issue, we divided the gestures into groups. This method allowed us to search and detect only gestures within their own groups. Trained gestures were given a property that categorized them as either small or large and stored in the gesture bank; if live data was classified as small, it would only be compared to the other small gestures; if the data was classified as large, it would only be compared to the large gestures. In our tests, no large gestures performed were ever shorter than 28 symbols and only one short gesture (out of 70) was longer than 27 symbols, thus effectively removing a very large portion of the false positives that occurred between small and large gestures. Each individual from the test had different results in terms of where they received false positive results. This indicates that many of the false positive results are dependent on the tester and could ultimately be eliminated the more the tester practiced and learned to perform the gestures correctly.
TABLE 4: RECOGNITION RESULTS FOR TESTS FOR ALL 7 GESTURES WHEN THE PERSON IS NOT A PART OF THE TRAINING DATA.
Gesture *’ve Rec Rate (%)
In Not in
T +ve F -ve 94.4% 5.6% 84.1% 15.9%
T -ve 88.8% 88.6%
2
3
4
5
6
7
100
95
92.5
80.7
97.5
97.5
95.0
Our final set of experiments result in the confusion matrix in Table 4. This experiment is to determine the likelihood of different gestures being misclassified. In this case, for each input received from the live system, we compared to all gestures in the gesture bank. i.e. we did not categorize the input samples as large or small. When the log likelihood scores were equal or relatively close we deemed them as a multiple classification and thus confusion. As a result, not all rows sum equally. Table 4 indicates significant confusion between (italicized) the gestures labeled Roll Boulder and Block as well as Counter Spell and Wand Wave. In retrospect, these moves are very similar with Roll Boulder and Block being pivot moves from the elbow joint and Wand Wave being a tighter version of the Counter Spell gesture means that a quick Counter Spell
TABLE 3: RECOGNITION RESULTS FOR TESTS FOR ALL 7 GESTURES WHEN THE PERSON IS AND IS IN AND NOT IN OF THE TRAINING SET.
F +ve 11.2% 11.4%
1
Correct 91.6% 86.4%
©The Author(s) 2014. This article is published with open access by the GSTF 67
GSTF International Journal on Computing (JoC) Vol.3 No.4, April 2014
gesture does not loop significantly to allow differentiation. There are ways to minimize this confusion by looking at the general direction of the motion from the raw acceleration data. However, in the interest of clarity on the HMM success rates, we present only the HMM results here.
Culling the data remove gestures that were abnormally long or short Filtering the raw sensor data using a uniform kernel with window size of seven. Transforming the raw sensor data into strings comprised of the 27 discrete symbols from our alphabet Training the HMM
TABLE 5: CONFUSION MATRIX FOR 4 USERS, PERFORMING 10 SAMPLES OF EACH GESTURE IN THE LIVE DETECTION SYSTEM
Gesture
1
2
3
4
5
6
7
1
6
0
8
0
4
0
2
40 2
0
0
0
6
0
3
0
38 0
0
0
0
0
4
21
5
37 0
0
4
0
5
0
0
0
35 0
16
6
0
0
0
39 0
0
0
39
1
7
1
1
1
0
0
0
38
Our live detection system follows a similar process: Detecting a gesture start and recording the sensor data (the sample) Culling the data to remove gesture samples that are too short or too long. Filtering the sample using a uniform kernel with a window size of seven. Transforming the sample data into strings comprised of the 27 discrete symbols from our alphabet Test the sample against the trained HMM model. Our training philosophy of culling and filtering and transforming the data into a discrete finite alphabet has shown to be meritorious in improving the results of a practical scenario where detection of gestures is done in a live system similar to what would be expected for a video game application. We were able to achieve a 93% success rate with live detection trials and confusion rates were generally minimal. Confusion rates can be easily reduced using the raw sensor data to confirm the general direction of the gesture as well as examining gesture length. Another option would be to redesign the input moves to be more distinct, but this goes against our original desire to have the recognition system work for the game designer, rather than the designer work around the recognition system. Furthermore, the method outlined above is theoretically valid for any 3D sensor system as the transformation operation could be built around the sensor to ensure a uniform distribution of the alphabet that would make up the classes emitting unique alphabet symbols. It should also be noted that specific sensors may not always provide a uniform distribution of data for all gestures so it should be expected that certain gestures may be better suited for certain sensors.
A. Left Hand Testing with Right Hand Training In the interest of completeness within our examination we conducted a final experiment to determine whether gestures trained by the right hand could be detected if the sensor was placed on the left hand for testing. Practically speaking, this is an issue that may affect the overall processing loop because it may require us to double the amount of training and recognition time to effectively work for both hands. As we predicted, for the majority of gestures, the left hand tests were not recognized as valid gestures when compared to the right hand training data. This is because our left and right arms have a different “natural tilt” to them which result in different acceleration values when the mirrored gestures are performed. However, some gestures can be detected well by the left hand with the right hand training set – these gestures involve no roll or tilt, so they can be performed fairly easily by both hands without having to worry about rotation limits of the arm, specifically the elbow. For optimal results in gesture recognition, it is recommended that data is not reused between limbs. The training sets should not be transferred for mirrored gestures. Each gesture for each limb should be trained individually. And, as a rule of thumb, if the gesture requires significant motion at the elbow, then both a left hand and a right hand version should be trained.
VIII. CONCLUSIONS In this work we presented a HMM based gesture recognition system that significantly meets the computational requirements of a typical interactive application like video games. We are able to do on the order of 1800 individual recognition trials per second on a 2.6 GHz dual core AMD processor: more than suitable for real time applications. Live recognition rate experiments mimic a real life environment and perform better than other methods, virtually doubling the recognition rates, especially compared to those in the continuous domain. The decomposition technique allows for any 3D sensing device to use a discrete HMM as part of a gesture recognition system while the speed allows for multiple sensors to be used together. Although we attempted to achieve a base recognition system that a game designer could use without having to consider its capabilities, we fall a bit short in that regard, since gestures that may not appear that similar to the typical game designer
Fig. 5. Trained Right-Handed Gestures Performed with Right Hand (light bars) and with the Left Hand (dark bars)
VII. DISCUSSION We trained gestures by recording acceleration data from multiple individuals, multiple times. To achieve our goals of getting the best recognition rates in a live performance test, we found that the following pipeline in training was required:
©The Author(s) 2014. This article is published with open access by the GSTF 68
GSTF International Journal on Computing (JoC) Vol.3 No.4, April 2014
are similar nonetheless. However, we claim that the system outlined in this work does free the designer from the necessity of a complete understanding the recognition system and the sensor system in order to design relevant gestures into a game. Open problems include efficiency in computation, as it is fully expected that a larger alphabet will result in better recognition rates overall. However, it will also require more training samples. The biggest detriment to a larger alphabet currently, is the computational requirements to perform the recognition task. The system is fast enough to allow multiple limb gestures to be tested and accurate enough that compound errors are not significantly problematic.
[10] J. M. Teixeira, T. Farias, G. Moura, J. Lima, S. Pessoa, and V. Teichrieb. Gefighters: an experiment for gesture-based interaction analysis in a fighting game. In SBGames, Brazil, 2006. [11] W. Freeman, D. Anderson, P. Beardsley, C. Dodge, M. Roth, C. Weissman, W. Yerazunis, H. Kage, K. Kyuma, Y. Miyake, and K. ichi Tanaka. Computer vision for interactive computer graphics. IEEE Computer Graphics and Applications, 18(3):42–53, 1998. [12] J. Kela, P. Korpipaa, J. Mantyjarvi, S. Kallio, G. Savino, L. Jozzo, and D. Marca. Accelerometer-based gesture control for a design environment. Personal Ubiquitous Computing, 10(5):285–299, 2006. [13] J. Mantyjarvi, J. Kela, P. Korpipaa, and S. Kallio. Enabling fast and effortless customisation in accelerometer based gesture interaction. In MUM ’04: Proceedings of the 3rd international conference on Mobile and ubiquitous multimedia,, New York, NY, USA, 2004. [14] H. Sawada, N. Onoe and S. Hashimoto, Acceleration Sensor as an Input Device for Musical Environment, In Proc. International Computer Music Conference, Hong Kong, 1996. [15] J. Payne, P. Keir, J. Elgoyhen, M. McLundie, M. Naef, M. Horner, and P. Anderson. Gameplay issues in the design of spatial 3d gestures for video games. In CHI ’06: CHI ’06 extended abstracts on Human factors in computing systems, New York, NY, USA, 2006. [16] H Kyu , J. Kim. An HMM-Based Threshold Model Approach for Gesture Recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 21 , Issue 10: 961 - 973 1999 [17] Stefan Eickeler, Andreas Kosmala, Gerhard Rigoll, Hidden Markov Model Based Continuous Online Gesture Recognition. Proceedings of the 14th International Conference on Pattern Recognition, Volume 2: 1206 1998. [18] S. Rajko, G. Qian, T. Ingalls and J. James "Real-time Gesture Recognition with Minimal Training Requirements and On-line Learning," in Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA June 18-23, 2007
REFERENCES [1] [2] [3]
[4] [5] [6] [7] [8] [9]
A. Whitehead, N. Crampton, K. Fox, and H. Johnston, “Sensor networks as video game input devices,” in Proceedings of the 2007 conference on Future Play, 2007, pp. 38–45. B. Rigby, “Microsoft sells 10 million Kinect devices,” Reuters, 2011. [Online]. Available: http://www.reuters.com/article/2011/03/09/usmicrosoft-kinect-idUSTRE7286CL20110309. S. Ullah, H. Higgins, B. Braem, B. Latre, C. Blondia, I. Moerman, S. Saleem, Z. Rahman, and K. S. Kwak, “A Comprehensive Survey of Wireless Body Area Networks,” J. Med. Syst., vol. 36, no. 3, pp. 1065– 1094, Jun. 2012. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition, Readings in speech recognition, Morgan Kaufmann Inc., San Francisco, CA, 1990 L. Kratz, M. Smith, Lee, F. J., Wiizards: 3D gesture recognition for game play input, In Proceedings of the 2007 Conference on Future Play, Toronto, Ontario, Canada, 2007. C. Keskin, A. Erkan, and L. Akarun. Real time hand tracking and 3d gesture recognition for interactive interfaces using hmm. In Proceedings of the Joint International Conference ICANN/ICONIP, 2003. J. Segen, S. Kumar. Fast and Accurate 3D Gesture Recognition Interface. In Proc of the 14th ICPR. IEEE, Washington, DC, USA, Brisbane, Qld., Australia 1998. Q. Chen, A. El-Sawah, C. Joslin, N. Georganas. A Dynamic Gesture Interface for Virtual Environments based on Hidden Markov Models. In Proc. IEEE HAVE, Ottawa, 2005. J Segen, J., S. Kumar. Human-computer interaction using gesture recognition and 3d hand tracking. In Procedings of ICIP, Volume 3:188, 1998.
Anthony D Whitehead is an Associate professor in, and Director of the School of Information Technology at Carleton University. He is also the current Chair of the Human Computer Interaction Program at Carleton University. His research interests can be described as practical applications of machine learning, pattern matching and computer vision as they relate to interactive applications (including games), computer graphics, vision systems and information management.
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
©The Author(s) 2014. This article is published with open access by the GSTF 69