The Hand Mouse: GMM Hand-color Classi cation and ... - CiteSeerX

25 downloads 16866 Views 1MB Size Report
tion is inadequate for the hand-tracking stage, we trans- ... which we call Visual Wearables[1, 8, 9, 13]. .... 1 Translate the center of mass of P(handjx; y) into.
RATFG-RTS 2001 in conjunction with ICCV 2001 in Vancouver,Canada, pp.119-124 (2001)

The Hand Mouse: GMM Hand-color Classi cation and Mean Shift Tracking Takeshi Kurata

Takashi Okuma

Masakatsu Kourogi

Katsuhiko Sakaue

Intelligent Systems Institutes National Institute of Advanced Industrial Science and Technology (AIST) 1{1{1 Umezono, Tsukuba, Ibaraki, 305-8568 JAPAN [email protected]

Abstract This paper describes an algorithm to detect and track a hand in each image taken by a wearable camera. We primarily use color information, however, instead of prede ned skin-color models, we dynamically construct hand- and background-color models by using a Gaussian mixture model to approximate the color histogram. Not only to obtain the estimated mean of hand color necessary for the restricted EM algorithm that estimates the GMM but to classify hand pixels based on the Bayes decision theory, we use a spatial probability distribution of hand pixels. Because the static distribution is inadequate for the hand-tracking stage, we translate the distribution with the hand motion based on the mean shift algorithm. Using the proposed method, we implemented the Hand Mouse, that uses the wearer's hand as a pointing device, on our Wearable Vision System.

1. Introduction As computers, displays, cameras and other sensors become smaller, less expensive, and more ecient, systems consisting of these components and services provided by these systems are expected to spread and become important to our daily life. One of the distinctive advantages of wearing computers and sensors is that these systems have potential to assist the wearer more adaptively than usual desktop systems by sharing experiences with the wearer throughout the whole time and by understanding the wearer's context[10]. Vision plays a important role in how to understand contextual information, so we have researched context-aware systems and services based on computer vision techniques, which we call Visual Wearables[1, 8, 9, 13]. Visual context awareness[7, 11, 12, 16] can be regarded as an autonomous input interface that is needed

to construct augmented environments adaptively. Explicit input interfaces are also essential to enabling interaction with the environment by, for example, pointing and clicking on visual tags and web links, which are associated with real objects, overlaid on video frames taken by the wearer's camera, as shown in Figure 1. However, it is very dicult to apply human interfaces of usual desktop environments to wearable augmented environments due to the lack of mobility and operativity. Several attempts have been made to solve this problem by using computer vision techniques[11, 14, 15, 17, 19]. This paper describes an algorithm to detect and track a hand in each image taken by a wearable camera to develop the Hand Mouse[9] that uses the wearer's hand as a pointing device, as described in [11, 15]. Hand segmentation is an essential process in detecting and tracking a hand. Compared to images taken by a camera xed on a desktop PC or a wall, images taken by a wearable camera can have various light conditions and backgrounds that result from the wearer's motions. This makes background subtraction and color segmentation with prede ned color models rather dicult to perform. In this study, we primarily use color information to detect and track the hand. Instead of using prede ned skin-color models, we dynamically construct handand background-color models based on the hand-colorsegmentation method proposed in [19]. The method uses a Gaussian mixture model (GMM) to approximate the color histogram of each input image. The GMM is estimated by the restricted Expectation-Maximization (EM) algorithm in which the standard EM algorithm[5] was modi ed to make the rst Gaussian distribution an approximation of the hand-color distribution. Not only to obtain the estimated mean of hand color necessary for the restricted EM algorithm that estimates the GMM but to classify hand pixels based on

Figure 1: The wearer is clicking on a visual tag (link) associated with the poster on the wall in front of him. the Bayes decision theory, the method uses a static spatial probability distribution of hand pixels. However, the static distribution is inadequate for the Hand Mouse because the hand location is not xed. In this paper we describe a method that translates the distribution into the appropriate position based on the mean shift algorithm[4, 6]. This method is computationally inexpensive, and also works e ectively because it can track a hand while dynamically updating hand- and background-color models. We preliminarily evaluated the performance of the Hand Mouse in complicated environments with di erent backgrounds and changing light conditions. We then implement it on our Wearable Vision System that allows us to execute various software modules cooperatively and in parallel.

2. Hand-color Segmentation This section brie y describes the Bayes decision theory framework for hand-color segmentation proposed in [19]. To minimize the in uence of lighting, this method uses an HS color space in the HSV (or HSI) color system to calculate a 2-D color histogram. Color histogram P is approximated by using a GMM, which is a weighted sum of K Gaussian distributions N1 ; :::; NK and can be de ned as follows.

P (c; ; ;  ) = 0

K K X X  N (c;  ;  );  =1 i=1

i

i i2

i=1

i

where c represents color, i is the weight of each Gaussian distribution N , and where each N has the means i and the standard deviations i . The GMM is estimated by using the restricted EM algorithm in which the standard EM algorithm is modi ed to make the rst Gaussian distribution N1 an approximation of the hand-color distribution. To model the hand color in N1 , 1 is xed and 1 is controlled as follows.

1 = E (Chand ); 1 = (low + high )=2; where low and high are the upper and lower limits of 1 respectively, which are obtained from training data,

and E (Chand ) is the estimated mean of hand-color distribution. To obtain E (Chand ), the method uses each input image and a spatial probability distribution of hand pixels P (handjx; y ) so that color information of pixels with high probability can be weighted according to P (handjx; y). Figure 2 is an example of P (handjx; y ) that is generated with training data such as Figure 4(right). The hand pixels and background pixels are then classi ed based on the Bayes decision criterion:

P (cjhand)P (handjx; y ) > P (cjbackground)(1 0 P (handjx; y )) where P (cjhand) = 1 N1 (c); P (cjbackground) = K i=2 i Ni(c):

P

(1)

3. Hand Detection and Tracking 3.1.

Hand Detection around Its Initial Location

In [19], the authors assumed that the wearer's hand is widely open and occupies a central area in each image. In this case, the static and wide-based spatial probability distribution as shown in Figure 2 can be considered reasonable. However, such wide-based distribution is inadequate for the Hand Mouse because appearances of the hand are smaller than the ones assumed in [19]. The inconsistency of the appearance can cause incorrect estimation of E (Chand ) or makes hand-pixels classi cation by using (1) dicult to perform. To simplify this problem, we divide the whole process stage into two parts, the hand-detection stage and the tracking stage. In the hand-detection stage, we assume that the wearer put the fore nger's tip into a guide-circle as shown in Figure 5 when the wearer use the Hand Mouse interface. Figure 3 shows P (handjx; y) for an initial location obtained from our training data, in which the diameter of the guide-circle was set to 25% of the vertical angle of view. This assumption can be considered reasonable because it makes hand-pixels classi cation easy to per-

Figure 4: Training data: (left) input image (right) hand portion segmented manually. Figure 2: Spatial probability distribution of hand pixels obtained from 660 images such as Figure 4.

Figure 5: Hand Mouse Indicators.

3.2.

Hand Tracking Based on the Mean Shift Algorithm

Figure 3: Spatial probability distribution of hand pixels from 42 images with the fore nger's tip located within the guide circle in Figure 5

form, and furthermore it is useful for distinguishing explicit input like the Hand Mouse from the autonomous visual context sensing by using the wearer's hand motion as described in [16]. Figure 6 shows the GMM (K = 5) of Frame 2 in Figure 7 estimated by the restricted EM algorithm. In this gure, (GMM HAND), (GMM BACKGROUND), (HAND), and (BACKGROUND) indicate estimated hand color P (cjhand), estimated background color P (cjbackground), the actual hand-color histogram, and the histogram of the input image, respectively. In Figure 7, the estimated hand pixels are overlaid with black pixels.

As describe above, P (handjx; y ) in Figure 3 is the spatial probability distribution of a hand for its assumed initial location, and the wearer's hand is detected only around the location with this distribution. Once detected, the hand should be tracked until it disappears out of the frame. However, since there is no constraint on how to move the hand from the initial position, we cannot use P (handjx; y ) in Figure 3 as it is. To overcome this problem, we propose a method that translates P (handjx; y) into the appropriate position derived by the mean shift algorithm[4, 6], which is a simple iterative procedure that climbs the gradient of a probability distribution to nd the nearest dominant mode. We combine each iteration of the mean shift algorithm with hand-pixels classi cation by using (1) so that P (handjx; y ) can be gradually translated into the current position of the hand. In the next frame, E (Chand ) is obtained with P (handjx; y) translated into the mean location of hand pixels computed in the previous frame, and then the GMM is estimated by the restricted EM algorithm. As a result, this method can track the hand while dynami-

Frame 1

Frame 2

Frame 6

Frame 9

Frame 12

Figure 7: Estimated hand pixels indicated by black pixels.

Figure 6: GMM for Frame 2 in Figure 7.

cally updating hand- and background-color models. In Figure 7, Frames 6, 9, and 12 show that this method could successively track the hand detected in Frame 2, and Figure 8 indicates trajectory of the pointer that follows the motion of the fore nger's tip. The algorithm overview for each frame in the handtracking stage can be described as follows: 1 Translate the center of mass of P (handjx; y ) into the mean location of hand pixels computed in the previous frame. 2 Obtain E (Chand ) from the translated P (handjx; y ) and the present video frame, and construct hand- and background-color models with a GMM computed by using the restricted EM algorithm. 3 Using the translated P (handjx; y ) and the handand background-color models, classify the hand pixels and background pixels based on the Bayes decision criterion (1). 4 Compute the mean location of the classi ed hand pixels.

Figure 8: Trajectory of the pointer.

5 Translate the center of mass of P (handjx; y ) into the mean location computed in Step 4. 6 Repeat Step 3 to Step 5 until convergence is achieved ( or until the mean location moves less than the preset threshold).

4. Experimental Results 4.1.

Preliminary Evaluation

Using a single 933-MHz Pentium III, it took 40-50 msec to convert RGB (320 2 240) into HSI (160 2 120), generate an HS color histogram (64 2 64), and estimate the GMM by the restricted EM algorithm (K = 5), and t the simple hand shape model into the classi ed hand pixels. We preliminarily evaluated the accuracy of pointing using 53 sample images taken at several di erent environments, which include images in Figures 7 and 9(a). In this experiment, the hand shape and state are determined by tting the simple skeleton model of hand shapes into the classi ed hand pixels and by using the state transition diagram, as described in [9].

The averages of the distance along the x axis, the distance along the y axis, and the Euclidean distance between the estimated location of the fore nger's tip and the ground-truth location measured manually were respectively 1.8, 4.0, and 4.7% of the vertical angle of view. The accuracy along the y axis was greatly a ected by the method we used which was too simple, for determining the hand shape and state. Figure 9 shows example results of the hand-pixel classi cation by using our method. Though Frames (a)-1 and (a)-11 belonged to the same sequence, and the hand was detected in Frame (a)-1, the highlight part of the hand could not be classi ed in Frame (a)11. As a result, the hand shape model could not be t into the classi ed hand pixels. This means that a single Gaussian model is not necessarily sucient for the hand-color approximation. In the case of (c) we can see that the short dynamic range of cameras makes the problem dicult.

Frame (a)-1

by using JPEG encoding, and transmits it to the vision server. The wearable client then receives the system output from the server. Although many vision algorithms are computationally heavy for the existing stand-alone wearable computers, our wearable system allows us to run di erent vision tasks in (near) real time, cooperatively, and in parallel. In our online experiments, the system throughput, which included capturing, compressing and transmitting an image, performing the Hand Mouse operation, and displaying the system output, was 100-120 msec, and the latency was 300-600 msec (Wearables' CPU: 650-MHz Mobile Pentium III, Server's CPU: dual 933MHz Pentium III). In Figures 12 and 1, we demonstrate an example of a graphical user interface for a wearable display. The hand-tracking results are also shown in these gures.

Frame (a)-11

Figure 10: Apparatus for our system's wearer.

Frame (b)

Frame (c)

Figure 9: Estimated hand pixels under di erent lighting conditions and in di erent environments.

4.2.

Online Experiments

We have implemented the Hand Mouse on our Wearable Vision System[1]. Figure 10 shows wearable apparatus of the system. The wearable camera is located under the wearer's ear, like an earring. This system consists of a wearable client, a vision server (a PC cluster), and a wireless LAN, as shown in Figure 11. We have performed live technical demonstration of this system dozens of times[1, 3, 9]. A mobile PC in the wearable client captures and compresses each image taken by the wearable camera

Figure 11: Wearable client and vision server.

5. Conclusion and Future Work In this paper, we developed a new approach that effectively combines the dynamic generation of handand background-color models with GMMs and the mean shift algorithm into hand detection and tracking method. The method was applied to the Hand Mouse,

Figure 12: Soft keyboard pointing. which was implemented on our Wearable Vision System. However, we have not yet evaluated the performance of our developed method in classifying hand pixels rigorously, nor have we thoroughly considered the convergence of the tracking procedure. Our future work will have to address these issues. If background, lighting, or the re ection has similar color, color information is not sucient to obtain hand regions. We will also focus on improving our method so that it can take advantage of di erent image features such as texture and edge information[18]. Acknowledgments: This work is supported in part by the Real World Computing (RWC) Program[2] of METI and also by Special Coordination Funds for Promoting Science and Technology of MEXT of the Japanese Government.

References [1] http://www.aist.go.jp/ETL/~7234/visualwearables/. [2] http://www.rwcp.or.jp/. [3] 2000 RWC Symposium, http://www.rwcp.or.jp/events/rwc2000/home-e.html. [4] R. Bradski. Computer vision face tracking for use in a perceptual user interface. Technical Report Q2, Intel Technology Journal, 1998. [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Statist. Soc., 39(B):1{38, 1977. [6] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Boston, 1990.

[7] T. Jebara, B. Schiele, N. Oliver, and A. Pentland. Dypers: Dynamic personal enhanced reality system. Technical Report 463, M.I.T Media Lab. Perceptual Computing Section, 1998. [8] M. Kourogi, T. Kurata, K. Sakaue, and Y. Muraoka. A panorama-based technique for annotation overlay and its real-time implementation. In Proc. Int'l Conf. on Multimedia and Expo (ICME2000), TA2.05, 2000. [9] T. Kurata, T. Okuma, M. Kourogi, and K. Sakaue. The hand-mouse: A human interface suitable for augmented reality environments enaled by visual wearables. In Proc. 2nd Int'l Symp. on Mixed Reality (ISMR2001), pages 188{189, 2001. [10] M. Lamming and M. Flynn. \forget-me-not" intimate computing in support of human memory. Technical Report EPC-1994-103, RXRC Cambridge Laboratory, 1994. [11] S. Mann. Wearable computing: A rst step toward personal imaging. Computer, 30(2):25{32, 1997. [12] W. Mayol, B. Tordo , and D. Murray. Wearable visual robots. In Proc. 4nd Int'l Symp. on Wearable Computers (ISWC2000), pages 95{102, 2000. [13] T. Okuma, T. Kurata, and K. Sakaue. 3-D annotation of images captured from a wearer's camera based on object recognition. In Proc. 2nd Int'l Symp. on Mixed Reality (ISMR2001), pages 184{185, 2000. [14] T. Starner, J. Auxier, D. Ashbrook, and M. Gandy. The gesture pendant: A self-illuminating, wearable, infrared computer vision system for home automation control and medical monitoring. In Proc. 4nd Int'l Symp. on Wearable Computers (ISWC2000), pages 87{94, 2000. [15] T. Starner, S. Mann, B. Rhodes, J. Levine, J. Healey, D. Kirsch, W. R. Picard, and A. Pentland. Augmented reality through wearable computing. Technical Report 397, M.I.T Media Lab. Perceptual Computing Section, 1997. [16] T. Starner, B. Schiele, and A. Pentland. Visual contextual awareness in wrearable computing. In Proc. 2nd Int'l Symp. on Wearable Computers (ISWC'98), pages 50{57, 1998. [17] A. Vardy, J. Robinson, and L.-T. Cheng. The wristcam as input device. In Proc. 3rd Int'l Symp. on Wearable Computers (ISWC'99), pages 199{202, 1999. [18] Y. Wu and T. S. Huang. View-independent recognition of hand postures. In Proc. IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition

, volume 2, pages 88{94, 2000. [19] X. Zhu, J. Yang, and A. Waibel. Segmenting hands of arbitrary color. In Proc. 4th Int'l Conf. on Automatic Face and Gesture Recognition (FG2000), pages 446{ 453, 2000. (CVPR2000)