a natural hand gesture human computer interface using ... - ISR-UC

3 downloads 10232 Views 191KB Size Report
pointing, presenting, digital desktops, virtual workbenches and VR. Gesture input can be ... Computing the contour signature of each hand's image. The next step ..... gesture in order to capture small variations of it (ideally performed by several ...
A NATURAL HAND GESTURE HUMAN COMPUTER INTERFACE USING CONTOUR SIGNATURES Paulo Peixoto ISR - Institute of Systems and Robotics Department of Electrical and Computer Engineering University of Coimbra Polo II - Pinhal de Marrocos 3030 Coimbra email: [email protected] ABSTRACT Communication between humans and computers could benefit a lot from the introduction of more natural forms of non intrusive communication between them. We, as humans, frequently use gestures on our daily routines, so the idea of employing natural hand gestures to control the way our computer work seems very attractive. In this paper we propose a novel human computer interface based on hand gesture recognition using computer vision. The proposed method will not only be interested on replacing the mouse by using our hand as another pointing device, but also to add a more complex gesture vocabulary, allowing several frequent actions to be understood by the computer (using hand gestures like dragging, closing a window or simulating a simple mouse click with the hand). The proposed method will also be able to deal with two of the major difficulties associated with any gesture recognition task: how to determine the beginning and end of a gesture during a continuous hand trajectory and how to deal with the spatial-temporal variations users produce when performing the same gesture each time. KEY WORDS Hand Gesture Recognition, Human Computer Interfaces, Computer Vision.

1

Introduction

The WIMP (windows, icons, menus, pointers) paradigm, together with the mouse and the keyboard, has been decisive in the generalization of the use of computers. It provides users a clear model of what actions and commands are possible and what their outcomes can be. Also, it allows users to have a sense of accomplishment and responsibility about their interactions with computer applications [8]. Under this paradigm, users express their intents to the computer using their hand to perform key presses, button clicks and positioning the mouse. This is a rather unnatural, limitative way of interaction. As computers become more and more pervasive in our daily life, it is highly desirable that interaction with them doesn’t fundamentally differ from interaction with

Joao Carreira ISR - Institute of Systems and Robotics Department of Electrical and Computer Engineering University of Coimbra Polo II - Pinhal de Marrocos 3030 Coimbra email: [email protected] other persons and with the rest of the world. That is the ground of Perceptual User Interfaces (PUI), which are concerned with extending human computer interaction to use all modalities of human perception. One of the most promising approaches for the early development of PUI is the use of vision-based interfaces, which perform online hand gesture recognition. The advantages of hands are their high precision and speed. Their capabilities for HCI have been thoroughly certified by the success of tools like mice, keyboards and joysticks. Humans learn easily how to use them to execute the most diverse and complex tasks. Also, vision based interfaces are unobtrusive and inexpensive, making a good fit. In traditional HCI, most attempts have used some device, such as instrumented gloves, for incorporating gestures into the interface. If the goal is natural interaction in everyday situations this might not be acceptable. However, a number of applications of hand gesture recognition for HCI exist. Most of them require restricted backgrounds and camera positions, and a small set of gestures, performed with one hand. They can be classified as applications for pointing, presenting, digital desktops, virtual workbenches and VR. Gesture input can be categorized [4] into deictic gestures (point at an object or direction), mimetic gestures (accept or refuse an action), and iconic gestures (define an object or its feature). Pavlovic et all [7] noticed that ideally the naturalness of a human computer interface requires that any and every gesture performed by the user should be interpretable, but that the state of the art in vision-based gesture recognition is far from providing a satisfactory solution to this problem. A major reason is obviously the complexity associated with the analysis and recognition of gestures. A number of pragmatic solutions to gesture input in human computer interfaces have been proposed in the past, such as: the use of props or input devices (i.e., pen, or data glove), restrict the object information (i.e., silhouette of the hand), restrict the recognition situation (uniform background, restricted area) and restrict the set of gestures being understood. Liu and Lovell [5], for instance, proposed a system for tracking real-time hand gestures captured by a cheap web camera and a standard Intel Pentium based personal computer

without any specialized image processing hardware. In this paper we describe a human computer interface based on hand gesture recognition using computer vision. Our main goal is to accomplish a natural interface, based on the recognition of a reasonable set of hand gestures, without sacrificing computer efficiency. The recognition is based on the analysis of the temporal variation of the hand contour. This variation allows a parametric representation of each gesture. This reduces the amount of data to process during the recognition step. The proposed setup consists of a small firewire camera located in front of the computer. This camera is used to acquire images from the user’s hand. These images are used to segment, track and recognize a set of predefined hand gestures. Several types of gestures were considered: pointing gestures that provide a very natural and intuitive method of engaging an interface and command gestures which make the execution of complex tasks fast and easy. By combining these two methods we hope to achieve the intuitiveness of the pointing system while maintaining the quick efficiency of the command gesture system.

2 Representing Dynamic Hand Gestures Using Contour Signature Images The foundation of the proposed hand gesture recognition method consists on the parametric representation of each dynamic gesture as a recording of the temporal evolution of the contour of the hand: the contour signature image (CSI). In order to compute the CSI of each gesture three steps are required: Hand segmentation. In this first step we identify the contour of the hand. Special care must be put here since the quality of the segmentation will determine the performance of the proposed method. Computing the contour signature of each hand’s image. The next step consists on the definition of the contour signature (CS) as the parametric representation of the contour. This contour signature is defined as the set of points on the contour expressed using polar coordinates. The origin of the coordinate system is defined as the centroid of the segmented hand blob. Each contour signature is normalized in order to be grouped with other contour signatures along time. Computing the Contour Signature Image. In order to start the gesture recognition process we need to compute the contour signature image. This image is a collection of contour signatures along time. The recognition process becomes a traditional pattern recognition task: we need to verify the computed CSI against a set of previously recorded CSI corresponding to the set of recognizable gestures (obtained by training).

2.1

Hand Segmentation Process

The success of the recognition process depends a great deal on the quality of the segmentation process. A good segmentation that allows for the full recovery of the hand’s contour is essential to the success of the entire process. In this paper we do not intent to provide a solution to this complex problem. We will simply assume that the segmentation process ensures the necessary quality for the identification of the contour of the hand. Another important requirement for the segmentation process is related to the proposed objective of real-time performance: it should be complex enough in order to obtain a good quality in the segmentation and at the same time it should be simple enough in order to enable its real-time implementation. We use the Continuously Adaptive Mean-shift algorithm (Camshift) [2] to track the user’s hand position, orientation and scale. In order to use it, a probability distribution image of the skin color is needed. Also, invariance to illumination changes and different skin tones is desirable. In order to achieve this, we first create a model of the desired hue using a color histogram, as in [1]. We use the Hue Saturation Value (HSV) color system that corresponds to projecting standard Red, Green, and Blue (RGB) color space along its principle diagonal from white to black. The HSV space separates hue (color) from saturation (how concentrated the color is) and intensity. The color model is created by taking a 16 bin 1-D histogram of the H (hue) channel in HSV space. For hand tracking via a skin color model, skin regions from the users are sampled by prompting them to select an area of their hand’s skin with a mouse. The hues derived from the skin pixels in the image are sampled from the H channel and binned into the 1D histogram. When sampling is complete, the histogram is saved for future use. The low selectivity of this method determines that it works for most kinds of skin tones and produces solid blobs. This method can misbehave when the scene contains objects with colors similar to the skin color (like orange and red). When the user wears short sleeves an additional problem occurs: the segmented blob also includes the forearm. To solve this problem another segmentation process based on the statistics of the hand dimensions is used to separate the hand from the forearm. An example of the result of the segmentation process is shown in Figure 1. The skin color histogram (obtained by learning several samples of skin images) is used as a lookup table to convert incoming video pixels into a skin probability distribution image. This image is used by the tracker process to identify potential hand pixels. We threshold the output of the tracker and apply morphological operators (erosion and dilation) to compute the hand’s blob.

(a)

(b)

(c)

Figure 1. Overview of the segmentation process: (a) Hand tracking. (b) Result of the color segmentation process (skin color probability distribution image). (c) Final segmented blob after thresholding and application of morphological operators.

Figure 2. Example of gesture Drag

Figure 4. The Contour Signature is defined using the polar coordinates of each point belonging to the contour of the hand. The center of the coordinate system is the centroid of the hand blob

P P f (x,y)x cx = Px Py f (x,y) , e x

y

½ with f (x, y) =

2.2

Contour Signature Definition

Assuming that, after the segmentation process, we obtain a correct view of the hand blob contour, we can proceed to the next step. The next step consists on the determination of the contour signature of the hand: The contour of the segmented object (S) consists on a finite set of Ni points on the image (sk ) that defines the basic shape of the hand (Figure 4): S = {sk = (xk , yk ), k = 1, . . . , Ni } We assume that the contour S has the following properties: • S is closed, i.e. s1 is next to sNi . • S has a depth of one single point. • S is defined by accounting points in the clockwise direction. The starting point of the definition of the contour signature is the polar coordinates representation of each point sk belonging to the contour of the segmented blob. The polar coordinates are defined in such a way that the origin of the coordinate system is the centroid C = (cx , cy )T of the segmented region R, defined as:

Figure 3. Example of gesture Click

P P f (x,y)y cy = Px Py f (x,y) x

1 0

y

if x, y ∈ R otherwise

(1) Given the contour S = (s1 , s2 , . . . , sNi )T from the segmented hand on frame i we can compute the coordinate ρk that corresponds to the Euclidean distance of each point to the centroid of the segmented hand blob: q ρk = |sk −C| = (xk − cx )2 + (yk − cy )2 , with k = 1..Ni We can also compute the θk coordinate for each point on the contour: θk = arctan

(yk − cy ) , with k = 1..Ni (xk − cx )

This two coordinates can be expressed as two discrete functions that define, each one of them, a contour signature: CSρ (i) = {ρ1 , ρ2 , . . . , ρNi } e CSθ (i) = {θ1 , θ2 , . . . , θNi }. Since we want to guarantee an unequivocal representation of each gesture, we will need to consider both contour signatures: one corresponding to the ρ coordinate and another to the θ coordinate. The length of the elements that compose the contour signature is determined by the number of points that compose the contour by it self. Since that this number of points is variable along time we need to normalize the dimension of the contour signature. To accomplish it we sub-sample the contour in order to obtain a fixed dimension of the contour signature. To improve the quality of the sub-sampling process we use a linear interpolation. For each new image we sub-sample both discreet functions CSρ (i) and CSθ (i) to obtain two new discreet functions, CSNρ (i) = {ρ1 , ρ2 , . . . , ρN } and CSNθ (i) = {θ1 , θ2 , . . . , θN }, both with a constant length N . To allow for a more reliable correspondence between the several recognizable gestures, the signature contour

should be invariant to translation and scale changes in the image. The invariance to translation is accomplished naturally since our contour is defined in relation to a coordinate system with its origin at the centroid of the hand blob. The same is not true for the scale: different distances from the camera to the hand will imply different contour amplitudes. To solve this problem we simply normalize the ρk coordinates in order to have them in the range 0 ≤ ρk ≤ 1. This is accomplished by dividing each ρk by ρmax = max(ρk ), with k = 1..N . In the contour signature corresponding to the θ coordinate we do not have this problem since the interval of variation of θ is independent of the scale. However, for processing compatibility reasons the angles θk are scaled between 0 ≤ θk ≤ 1. Another important issue is the definition of which of the contour points should be considered as the first point of the contour signature (i.e. which point should be considered as point k = 1). This point is defined as the point with the maximum ρ coordinate.

2.3

Figure 5. Illustration of the process used for the creation of a contour signature image (CSIρ ). On each new frame i, a new contour signature (CSρi ) is added to the image while the oldest one is discarded (CSρi−M ). In this example M=75 which corresponds to a gesture with a duration of 3 seconds.

Definition of the Contour Signature Images

After the acquisition of every new image, the corresponding contour signature is grouped with all previously computed contour signatures. This grouping results in an image that intends to represent the dynamic variation of the contour of the gesture along time (Figure 5). The dimension of this image will depend on the temporal duration of each gesture. If we consider that a gesture typically has a duration of M frames and that the assumed dimension of the contour signature is N , then the contour signature image will have a M xN dimension. After the acquisition of every new image i, we can define the contour signature images CSIρi e CSIθi as follows:    CSIρi =  

ρNi (1) ρNi−1 (1) .. .

ρNi (2) ρNi−1 (2) .. .

ρNi−M +1 (1)

ρNi−M +1 (2)

θNi (1) θNi−1 (1) .. .

θNi (2) θNi−1 (2) .. .

... ... .. .

ρNi (N ) ρNi−1 (N ) .. .

(a)

    

... ρNi−M +1 (N )

  CSIθi =  

θNi−M +1 (1) θNi−M +1 (2)

... ... .. .

θNi (N ) θNi−1 (N ) .. .

... θNi−M +1 (N )

In each instant of time the construction process for the contour signature images can be viewed as a scroll on each of the two images followed by the introduction of the very new computer signature on the first line of each image. This procedure allows us to obtain a parametric representation of each gesture. As can be seen, this process is far from

(c)

Figure 6. Visual representation of three different contour signature images, corresponding to three different gestures: (a) Drag. (b) Click. (c) Close Window.

and, 

(b)

    

being computationally demanding, which makes it suitable for real-time applications. Figure 6 presents three examples of different computer signature images corresponding to three different gestures. In this specific case M =75 which means that the represented gestures have a duration of about 3 seconds (assuming a standard video rate of 25 frames/second). The length of each contour signature is also N =75. An interesting property that results from the way we represent each gesture, is that gestures that are symmetrical in relation to the horizontal axis of the image are also represented by symmetrical contour signature images. This means that the recognition process can be programmed to automatically deal with this kind of symmetries.

3

Gesture Recognition From Contour Signature Images

If we have available a database with the contour signature images of each recognizable gesture then the recognition process becomes the process of finding a correct match between the contour signature image corresponding to the gesture being performed and each gesture on that database.

The gesture recognition process becomes a typical pattern recognition problem. There are several methods in the literature that address this kind of problem. Our choice for the method was based on two criteria: in the first place a method that could promise good results in terms of recognition rates and secondly a method simple enough to be easily implemented in real-time. Although several methods that address this kind of problem are referred in the literature we choose to utilize a method that is a derivation of the principal component analysis method (PCA) called RPCA or robust principal component analysis [3]. One additional advantage of these methods is the fact that the memory costs are small since instead of storing the entire set of training images we only store a small amount of information (the eigenvectors that define the eigenspace and the coordinates of each training sequence on that eigenspace). This particular method improves the quality of the determination of the eigenspace by reducing the influence of outliers (caused for instance by errors on the segmentation process) on the determination of the eigenspace. For each of the gestures that will be recognized by the system we capture several sequences of images in which the gesture is fully expressed. These sequences of images are used to compute the pair of contour signature images CSIρn and CSIθn . We used several instances of the same gesture in order to capture small variations of it (ideally performed by several different subjects). Since we are in fact using two different images we construct two different eigenspaces, one for each contour signature image. The parametric representation of each gesture on the eigenspace allows us to classify the gesture being performed, in a very efficient way. In the eigenspace, the correlation between two images corresponds to the distance of their images on the eigenspace [6]. This property is used to make the recognition process. After every new frame is captured and after the segmentation process the two contour signature images in ρ and in θ are determined. Every new contour signature is subtracted by the average of the training images and then projected on the eigenspace defined by the training images. If we consider that Yρ is the contour signature image of the gesture being recognized and that Cρ is the average of the images on the training set, then the projection of the current gesture on the eigenspace is given by: zρ = [eρ1 , eρ2 , . . . , eρk ]T (Yρ − Cρ ) k define the number of eigenvectors used to define the eigenspace. This process is not computer demanding since it only requires the computation of an internal product of the vector defining the actual gesture with each of the k eigenvectors that define the eigenspace. The recognition process consists in the determination of which of the reference contour signature images, obtained by training, best corresponds to the image generated

by the gesture being recognized. However, due to several factors, such as errors in the segmentation process and variability of the instances of each trained gesture, that correspondence can not be exact. To overcome this question we try to find the gesture that minimizes the distance between point zρ and the representation of this gestures in the eigenspace (gρi ). dρ = min kzρ − gρi k, with i = 1..Ng Ng defines the number of different gestures recognizable by the system. Since for each gesture several training contour signature images are used we can pose the question: which is the distribution on the eigenspace of points belonging to different instances of the same gesture? Since the different images correspond to the same gesture then the differences should be small which implies that their projection on the eigenspace will be concentrated on the same area. In a simplistic way we have considered that all points were contained on a hypersphere. The equation of this hypersphere is obtained by fitting the data obtained from each gesture. If we assume that the coordinates of the contour signature image in the eigenspace are given by zρ = zρ1 , zρ2 , . . . , zρk then we can compute the distance from this point to every other point that corresponds to the projection of the trained images, using the following expression: (zρ1 − zcjρ1 )2 + (zρ2 − zcjρ2 )2 + ... + (zρk − zcjρk )2 = Rj where zcjρ1 , zcjρ2 , ..., zcjρk and Rj are obtained by fitting the data from the several training instances of the same gesture j, with j = 1..Ng . This computation should be done for each of the contour signature images (both on ρ and θ). For each gesture, both distances are combined using the following expression: q dg = d2ρ + d2θ If the smallest of these distances is bellow a certain threshold L we assume that the gesture being performed is of the same type as the gesture corresponding to the second point. If, by the other way, the minimal distance is above the mentioned threshold L then we consider that the gesture being performed is unknown.

4

Experimental Results

In order to evaluate the performance of the proposed method we carried out several experiments. We defined a set of ten hand gestures that enables the user to perform the following set of actions: pointing, dragging, clicking, open window, open start menu, page down, page up, open menu (equivalent to right click on a mouse), enter and delete.

2

Gesture Pointing Dragging Clicking Open Window Start Menu Page Down Page Up Open Menu Enter Delete

Recognition Rate (%) 93.64 94.21 96.42 90.91 94.21 96.42 96.62 96.86 97.15 97.41

Table 1. Experimental results of the proposed method in terms of recognition rate (percentage of gestures correctly recognized by the proposed method).

gestures which can be associated with the most common control actions on the computer. We plan to extend this kind of interfaces to new interface paradigms where more than one user can freely interact with the computer, allowing for a more natural interaction with it. We envision a collaborative scenario that will enable teams to view, share, annotate, manage and make decisions on visual digital assets more effectively than using traditional computer interfaces. Every intervenient can freely interact with the computer, without the need of special external devices. Instead, they can use a set of natural gestures as a modality for basic computer interaction. This approach will provide an elegant solution to a wide range of users and will improve the nature of computer based collaboration.

6 Both static and dynamic gestures were considered (five of each). For some gestures additional information was also determined: for instance for the gestures pointing and clicking we also compute the coordinates of the pointing finger in order to assign the pointing position to the mouse position. To determine the position of the pointing finger an histogram of the ρ coordinate of the distance to the centroid was used. The point of maximum variation corresponds to the pointing finger. The gesture recognition system was implemented on a PC (Pentium 4, 3GHz) with Windows XP. The video images were captured using a firewire camera. All computations are made by the host computer and images are processed at the full frame rate. The interaction with the operating system was made by implementation a Windows XP service, running on the background. Each of the gestures was trained using ten different instances of the same gesture performed by two different persons. The results are summarized on table 1. We define ”recognition rate” as the percentage of gestures correctly recognized by the proposed method. As it can be seen the overall behavior of the method is very good. Practical experiences have demonstrated that in most of the cases where errors have occurred the user was able to overcome the problem by seamlessly repeating the erroneous command.

5

Discussion

In this paper we have presented a natural hand gesture human computer interface using contour signatures. The proposed method allows a natural interaction with the computer by allowing the use of very intuitive and simple gestures to control the most common tasks in a computer. Although the proposed method only allows the recognition of hand gestures that are fully identified by the analysis of their contour, we believe that, with some imagination, one can come up with a full set of different recognizable

Acknowledgements

Research described in the paper was financially supported by FCT under grant No. POSC/EEA-SRI/61451/2004.

References [1] Gary R. Bradski. Computer vision face tracking for use in a perceptual user interface. In IEEE Workshop Applications Computer Vision, pages 214–219, October 1998. [2] Intel Corporation. Intel open source computer vision library reference manual. http://www.intel.com/research/mrl/research/opencv. [3] F. D. la Torre and M. Black. Robust principal component analysis for computer vision. In Proc. International Conference on Computer Vision, pages 362–369, Vancouver, Canada, 2001. [4] Jin Liu, Siegmund Pastoor, Katharina Seifert, and Jrn Hurtienne. Three-dimensional pc: toward novel forms of human-computer interaction. In Three-Dimensional Video and Display: Devices and Systems SPIE CR76, 2000. [5] Nianjun Liu and Brian Lovell. Mmx-accelerated realtime hand tracking system. In IVCNZ, pages 26–28, November 2001. [6] H. Murase and S. Nayar. Visual learning and recognition of 3d objects from appearence. International Journal of Computer Vision, 5(24), 1995. [7] V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Trans. on Pattern Analisys And Machine Inteligence (PAMI), 7(19):677–695. [8] Mathew Turk and George Robertson. Perceptual user interfaces. Communications of the ACM, 43(3), March 2000.

Suggest Documents