2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems
Continuous hand openness detection using a Kinect-like device
Vito Gentile, Salvatore Sorce, Antonio Gentile
T
Dipartimento di Ingegneria Chimica, Gestionale, Informatica e Meccanica Università degli Studi di Palermo 90128 Palermo, Italy
[email protected],
[email protected],
[email protected]
animation scene affects the entire body, and the gestures of the hand are not in evidence (or, in other words, they are of lesser importance), it may be useful to adopt a mechanism that can simply estimate the movements connected to the opening and closing of a hand, maintaining a good degree of realism although giving up a bit in precision. The proposed algorithm does exactly that, increasing the overall quality of the animation, but without requiring high computational efforts. The rest of the paper is organized as follows: Section II illustrate the proposed method, while in Section III we introduce the definition of Kinect-like device. Section IV summarizes the method used to detect if the hand is open or closed. Section V fully details the process of evaluation of the openness degree of the hand, and Section VI shows how this information is used to drive the simulated motion of a 3D model of a hand. Conclusions and future works are given in Section VII.
F
Abstract— This paper presents a novel method to reproduce in real time the opening and closing gestures of a human hand, animating a three-dimensional model of it. In other works, this result can be achieved by mapping a set of significant points of a real hand on the corresponding points of the model to animate. We propose an alternative way to produce the same effect without mapping points, but using a level-based estimation of the degree of opening of the hand. The experiments have been executed using Microsoft Kinect™, but the method would work on any other Kinect-like devices (as defined herein). The results obtained are particularly encouraging and demonstrate real-time performance of the system. Keywords-gesture recognition; microsoft kinect; depth data; human-computer interaction; 3D animation
INTRODUCTION
A
I.
In a wide variety of contexts it can be useful to be able to animate a 3D model of the hand, in order to reproduce the actual gestures of a user. There are different devices able to map the movements of a hand during its three-dimensional execution [1], and they can also send data to a computer to interpret and recognize the gestures. However, these devices typically require that the user wears them, or that he is close enough to one or more sensors that track his movements. The approaches for the recognition of gestures that do not use wearable devices are often based on the depth data extracted from the scene (in some cases combined with color information). Devices such as Microsoft Kinect™ [2], ASUS Xtion [3] or similar have been widely used because of their reduced invasivity [4]. Using the data collected by these sensors it is possible to devise algorithms capable to recognize static gestures of the hand operating at distances of about 2 meters [5] [4]. In [5], authors have already proposed a method to determine whether the hand is open or closed, using a Microsoft Kinect™. In this paper, building on that work, we will propose one of the possible uses of the latter recognition system, aimed at assessing the approximate level of openness of the hand, and then we show how to use it to animate a three-dimensional model. In general, this solution is not suitable in all those cases in which the animation is mainly focused on the hand. In such a situation, in fact, there would be the need to make the simulated movement as similar as possible to the real one, in order to ensure a high level of realism. When, instead, the
BRIEF METHOD OVERVIEW
R
The purpose of this work is to provide a method to reproduce in real time the gestures of opening and closing a hand, by animating accordingly a three-dimensional model of a hand (or any other thing that looks like it). To this end, we will assume that for this 3D model, it can be defined different states, each of which represents a given level of openness. Also, we assume that it must be possible to make a transition between two different states, moving from one degree of openness to the next.
D
978-1-4799-4325-8/14 $31.00 © 2014 IEEE DOI 10.1109/CISIS.2014.80
II.
Figure 1. Consecutive degrees of openness of a human hand.
The following method simulates the movement without making any direct mapping of the actual points of the hand on the three-dimensional model, but only carrying out a series of transitions between states, each of which results in an animation phase. We start from one of ours previous algorithm to determine if the hand is open or closed, using data obtained from a device such as Microsoft Kinect [5]. It will be 553
III.
KINECT-LIKE DEVICES
First, let us define a Kinect-like device as a set of sensors and chips that has the following features: • ability to obtain depth data of the scene in real-time; • ability to obtain color data of the scene in real-time; • cheapness. With this definition, we can state that devices like Microsoft Kinect or ASUS Xtion belong to this category of devices. The recognition method described in this paper (summarized in the next section) was originally intended to be used with Microsoft Kinect [5], but is easy to prove that it can work with any other Kinect-like device. This is the reason why we do not need to limit compliance of the following method only to Microsoft Kinect. More generally, we can state the whole process can be carried out by all Kinect-like devices. In the following, we will refer to the Microsoft Kinect just because we worked with it for our experiments.
(a)
(b)
(c)
Figure 2. Identification of the hand (a), depth image of it (b) and extraction of the depth mask (c) from which to derive the input for the classification process.
At this step, the method described in [5] determines if the hand is open or closed, by checking which of the two outputs has the highest value. The algorithm upgrade we propose in this paper aims at the detection of the level of openness of the hand. In the next section, we will present how we carry out such estimation process from the raw output of our neural network.
F
IV.
T
based on time average is used in order to reduce the noise due both to the low quality of the sensors included in Kinect, and the unavoidable imprecision of the classification process. Figure 2 shows an example of the image processing for the opened/closed hand classification, whereas figure 3 shows the whole classification logical process.
appropriately adapted to obtain a continuous output, belonging to a well-defined range, which will serve as a decision parameter for evaluating the level of openness of the hand.
OPENED/CLOSED HAND RECOGNITION
A
The recognition method adopted in this paper is based on what explained in [5] and may be briefly summarized as follows. Let us assume that the user is standing in front of the Microsoft Kinect device, at a distance ranging from 1.5 and 2 meters. This the optimal distance for the best results, to ensure the possibility of integrating this method with any applications based on full body gestures. Under these conditions, the sensor can obtain information about the skeleton of the identified user; more precisely, it is able to extract a set of 20 joints, two of which are corresponding to the hands. Each joint is associated to a position in a three dimensional space. Besides the skeleton, we can also consider the so-called depth image of the scene, that is a grayscale image in which each gray level represents the distance between the point of the scene corresponding to the rendered pixel and the depth sensor included in Microsoft Kinect. Since it is possible to map the joint directly in the depth image, we can proceed with a image segmentation process based on depth data. This step aims at the highlight of the region that is adjacent to the hand joint (mapped on the depth image), and made by pixels with similar gray levels. With this procedure, we can get a binary mask, to be used as input to a neural classifier (after an appropriate conversion to a vector form). The used classifier is a multilayer neural network suitably trained, with two outputs limited in the range [-1, 1]: the first one represents the level of openness of the hand, while the second one indicates how much it is closed. After the calculation of the output of the neural network, a filter
Figure 3. Schematic summary of the recognition method adopted.
R
V.
HAND OPENNESS EVALUATION
D
In order to get a unique and reliable openness level, we need to use a formulation that transforms the two outputs of the neural network in a single continuous value, that must be included within the range [0, 1]. To do so, we have to base our discussion on the following statements: • the outputs of the neural network, that we call OOPEN and OCLOSED, represent the levels of openness and closure of the hand; therefore, the higher OOPEN, the lower will be the OCLOSED, and vice versa (for less than errors of classifications, which must still be taken into account); • we define a significant output OS of the classification process as the single value that summarizes the meaning of the two outputs of the neural network. It is computed using only one of the two outputs OOPEN and OCLOSED, and in particular it is based on the higher value;
554
•
Figure 4. L curve trend to OS variations wh hen (a) OOPEN OCLOSED and (b) OOPEN < OCLOSEED
the significant output OS of the neuural network is to be considered reliable only if it is greater than 0.5; the reliability A is maximum for OS greater than or equal to 0.7, and it decreases as OS, until becomes null with it; • the reliability A is null for values oof OS that are less than 0; • the reliability A further decreases if the outputs of the neural network, OOPEN and OCLOSSED, are similar From these statements, we can deducce the following formulas for the significant output OS and the reliability A, functions of the two outputs of the neurall network, OOPEN and OCLOSED:
The computed openness L can be used as input parameter to modulate any control that must acct according to it, such as a 3D model of a hand. The equatio on (3) takes into account that if the reliability level is too low (and this can occur in situations in which the position of th he hand cannot be simply classified as “open” or “closed”), in n any case the 3D model must assume one of the predefined d states. The lower is the level of reliability, the closer the staate to be used must be to the intermediate one.
T
A. A more general formulation for L You can easily get that, due to the noise, as well as the impossibility of perfectly train the neural n network, the value of OS will almost never be equal to 1. 1 This implies that L will almost never take the values 0 and d 1, and makes very rare cases in which the level of opennesss of the hand is minimum or maximum. Thus, it can be useful to reformulate the level of openness of the hand, by defining g L as follows:
(1a)
F
(1b)
where d represents the difference betw ween OOPEN and OCLOSED, in the case they are too similar. It ccan be defined as follows:
(4)
(2)
A
In this formulation, O and C allow a us to modulate the speed at which L tends to 1 (or to 0, 0 depending on the case). For high values of O, L tends to 1 faster, f while on increasing C, L reaches 0 more quickly. Figure 5 shows the trend of L in n the cases in which O = 3 and C = 1.8. These values, in tests we performed, are resulted to be the more appropriated d.
Using the outputs of the neural networrk, the reliability and the significant output, we can calcullate the level of openness of the hand L, with the following fformula: (3)
D
(a)
R
Figure 4 shows the L curve trend vs. OS variations in the cases in which OOPEN OCLOSED and OOPEN < OCLOSED, with d 0.2. We can also notice (and it is also simpple to analytically prove it) that L[א0..1].
Figure 5. L trend using the equation (4)
Note also that, for O = C = 1, the equation (4) is identified with the (3).
(b)
B. Filtering of L In order to reduce noise effectt (which, from output of neural network, can propagate up to o L), can be useful to add, at the end of calculating L, a tim me average filter, which
555
To be able to run the animation, we can imagine to use an application that takes advantage of two parallel threads. The first one, which we will call recognizer thread, has the aim to continuously perform the recognition, and update the values of the shared variable L. The second one, which we can call animator thread, reads the value of L, round it and uses it as argument of the function goToLevel(). By executing these two threads simultaneously, we can implement an application that can animate three-dimensional model according to the levels of openness constantly estimated. Obviously, the function goToLevel(i) will request a minimum time lapse to execute the animation. If this time depends on the difference between the initial and final levels of openness, it is possible that during the animation, the hand modify its state, even significantly. To minimize such undesired effect, it is possible: • to act on the timing of animation, making them not too long; • to change the two threads model, including the possibility to send some feedbacks about the updating of L value, direct from the recognizer thread to the animator thread, so that it can correct the animation on the fly, during the execution.
reduces the contribution of abrupt variations due to noise. In our tests, we have adopted a temporal moving average filter, using the last 6 time samples of L, and weighing them appropriately. Let Lfiltered(t) be the filtered output at time t, and L(t) the computed output L according to equation (4). We have that: (5)
T
where the weights t-i must be such that:
and:
VI.
F
Obviously, one can alternatively use other filters, that can provide better results at the cost of a greater computational effort. ANIMATING A 3D MODEL OF AN HAND
Once we have the degree of openness L, we can use it to modulate the opening representation of a 3D model of the hand. Suppose we have a three-dimensional object that can be compared, for its shape and for the capacity it has to close and open, with a human hand. Objects of this type may be the ones shown in fgure 6.
VII. CONCLUSIONS AND FUTURE WORKS
R
A
The strategy described in this paper for animating a 3D hand can be used and adapted also in more general applications that require 3D modelling and gesture-based multimodal interfaces. The level of openness of the hand L, in fact, can be used to edit some shape parameters of a tridimensional (or two-dimensional) object, in a manner that is arbitrarily dependent on the shape and other properties of the model. For example, it could be possible to use the hands to manipulate objects in a 3D modeling software, thus having a more natural interaction feeling. The proposed method can also be integrated on software that maps the whole human body against 3D models of it (or even anthropomorphic figures) [6] [7]. Assuming, for example, the we use informations about skeletal joints extracted by Microsoft Kinect, map body gestures of an user on a 3D model of human body, you can animate the hands in a simulated manner. For simulation, we mean the using of L parameter mentioned above as the estimated level of openness, and the function goToLevel() to execute the animation. This avoids having to extract some sort of “hand skeleton” (if it is possible at all, given the low resolution of Kinect) to be mapped to the corresponding points of the 3D model used, in order to assure the real-time requirement of the animation. Additionally, it is possible to increment the precision of the method of evaluation of the hand openness that uses two Kinects with perpendicular viewing direction. With two views of the same hand, the evaluation of the level of openness can be less daunting.
(a)
(b)
Figure 6. Two 3D models that can be animated with the method proposed in this paper1.
D
For simplicity, assume to use a 3D hand model like the one shown in figure 6a, and let’s suppose to fix N states, each of which corresponds to a level of openness, where the level number 1 represents the closed hand, and the degree number N represents the open hand. Assume also that we have available a function called goToLevel(i), that is able to animate the hand to change its current degree of openness to the level i, where i{א1, 2, …, N}. 1
Images are taken from: (a) http://www.3dcadbrowser.com/download.aspx?3dmodel=11036 (b) http://www.123dapp.com/123C-3D-Model/Boxing-Gloves/592887
REFERENCES
556
[1]
[2]
Piyush Kumar, Jyoti Verma, and Shitala Prasad, "Hand Data Glove: A Wearable Real-Time Device for Human-Computer Interaction," International Journal of Advanced Science and Technology, vol. 43, pp. 15 - 25, 2012. Microsoft. (2014, February) Kinect for Windows. [Online]. http://www.microsoft.com/en-us/kinectforwindows/ ASUS. (2014, February) Xtion PRO LIVE. http://www.asus.com/Multimedia/Xtion_PRO_LIVE/
[4]
Jesus Suarez and Robin R. Murphy, "Hand Gesture Recognition with Depth Images: A Review," in 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, Paris, France, 2012, pp. 411 - 417.
[6]
[Online].
[7]
Salvatore Sorce, Vito Gentile, and Antonio Gentile, "Real-Time Hand Pose Recognition Based on a Neural Network Using Microsoft Kinect," in Eighth International Conference on Broadband and Wireless Computing, Communication and Applications (BWCCA), Compiegne, France, 2013, pp. 344 - 350. Mauro Migliardi, Marco Gaudina, Andrea Brogni, “Enhancing personal efficiency with pervasive services”, Proc. of the Sixth International Conference on Broadband and Wireless Computing, Communication and Applications, October 26-28, 2011, Technical University of Catalonia, Barcelona, Spain, DOI: http://dx.doi.org/10.1109/BWCCA.2011.27 . Mauro Migliardi, Marco Gaudina, “The 4W (What-Where-WhenWho) Project Goes Social”, Proc. of The Sixth International Conference on Complex, Intelligent, and Software Intensive Systems, CISIS 2012, Palermo (IT), July 4-6 2012
D
R
A
F
T
[3]
[5]
557