Appearance-based Visual Interaction - JHU CS

Appearance-based Visual Interaction Guangqi Ye, Gregory D. Hager

Abstract In this project, we propose and implement an appearance-based visual interaction system that is capable of performing real-time foreground segmentation and finger action recognition. Input images are split into multiple sub-images that have the same dimension. A hue-histogram is built for each sub-image to model the background because hue is relatively robust to illumination changes. We employ an intersection metric to detect substantial change between background image and foreground image. By applying a threshold on the resulting intersection value, we segment the finger from background image. We define a state space in which each point indicates the distance and direction of the finger with respect to the button. A HMM model is built for each of four categories of button trigger actions. Given enough training action sequences, we train the models by maximizing the overall likelihood that the system generates all the training sequences. The probability that each model produces the given state sequence is used as the criteria of the identity of the action. We implement a system that runs at about 25 fps. At the same time, the system achieves correct ratio of over 96%. Keywords: HCI, Color histogram, Segmentation, HMM

1

Introduction

One of the most important goal of computer vision is to enable the computer to understand automatically what is happening in the image scenes. To achieve this objective, the computer should be able to identify what is the background and what is the primary objects of interest. This is the subject of image segmentation and pattern recognition. Furthermore, given the segmented image sequences, we also expect the computer to be able to identify what kind of action the subject is performing. Human action recognition 1

is a widely studied area. Gesture recognition is such an example which has attracted large number of researchers[12,15]. Human Computer Interaction, also known as HCI, is a promising application field for human action recognition. Vision-based HCI will extend traditional HCI to build a more friendly and natural interface. For example, in a 3D see-through virtual environment, it is desirable that one or more cameras could identify the motion of the head and eyes of the user. According to the identified direction and range of the user’s attention, the system would carry out corresponding actions, such as update scenes on the display, etc. Compared with traditional interaction methods, such as keyboard and mouse, data glove, mechanical sensors, vision-based HCI has many potential advantages: 1. Friendly and Natural: Probably this is the most significant superiority of vision-based interaction. The users need not to wear unfamiliar equipment, nor to interact with the computer through other so unnatural media as mouse and keyboard. One or more cameras will monitor the users and identify their intention. Thus the user can communicate with the computer in a way similar with another person. He can use sign and gesture, head movement, eye movement and other kind of body languages willingly. This kind of deep immersiveness is also a goal for Virtual Reality (VR). 2. Low Hardware Cost: As the prices of high-quality cameras and fast PCs drop so quickly, the total cost of a computer vision system is quite low and acceptable. In this project, we design and implement a real-time vision system that is able to identify a button-pushing action. This module can be easily incorporated to a larger system that allows the user to interact with the computer through gesture and finger movement. The system setup is quite simple. We use a static camera to supervise a virtual button, which is represented by a graph icon, and its surrounding area. The user is expected to move his finger toward the button and stay on the button for a short period time to trigger it. The system will detect the position and direction of the finger at frame rate, and try to tell whether the user has performed a button-pushing action successfully. Thus, fast and robust foreground segmentation and action recognition are two key elements of the system.

2

1.1

Background Modeling and Image Segmentation

In order to achieve robust and accurate foreground segmentation, which is necessary for most applications dealing with recognition and tracking, we are expected to resolve the problem of background modeling and foreground representation. The straightforward method is to perform image subtraction. However, this approach suffers from noise and other changes of imaging condition. Horprasert, etc[1] present a relatively robust background modeling algorithm using color constancy. The color appearance of each pixel is divided into two parts: brightness and chromaticity. From a set of static background images, a Gaussian model is built for the color appearance of each pixel. Automatically selected thresholds, which are calculated according to the expected detection ratio, are applied on the resulting subtraction image to segment the foreground image regions. The required thresholds are learned from a long sequence captured during the background learning period. However,this modeling technique requires a long training and learning period. Since one of the fundamental capabilities of human vision is color constancy[2], many other researchers have also proposed to make use of color as an approach for segmentation and recognition[3,4,5,6,7,8]. Alexander and Buxton [5] investigate several approaches to model the color appearance of monochromatic object. These approaches include a trivariate Gaussian model, a fixed planar chromaticity model, an oriented planar model and Bingham model. They also test and compare these models in a segmentation experiment framework. However, these models only deal with the segmentation of the monochromatic object. Swain and Ballard[3] are among the earliest to use color histogram to perform recognition. Hadjidemetriou, etc[4] derive the complete classes of local image transformation that reserves the histogram of all images up to a scale factor of their magnitude. Jones and Rehg[6] train a color model for skin and non-skin recognition based on enormous image resources on Internet. Their model achieves a detection rate of 80% with 8.5% false positives. Gevers and Smeulders [7] propose to use color models invariant to viewing direction, object geometry and shading. The models they use include c1 c2 c3 , l1 l2 l3 , m1 m2 m3 , normalized rgb and hue. They conclude that l1 l2 l3 and hue achieve the best accuracy. To attack the problem of sensor noise due to the instabilities of these color invariant transforms, Gevers[8] recommends a novel histogram construction scheme using variable kernel density estimators. Models are offered for the propagation of sensor noise through color invariants to apply variable kernel density estimation in a principled way. The associated uncertainty at each color invariant value is employed to 3

derive the parameterization of the variable kernel density estimator during the histogram construction. Experimental results show the superiority of the method over a traditional histogram computation approach. Other researchers suggest combining multiple image cues to perform segmentation and recognition. Wren and Pentland, etc[9] present a multi-class statistical model of color and shape to obtain a 2D representation of head and hands. Gagaudakis and Rosin[10] propose to combine color histogram with other image features, including texture, distance and orientation histogram to achieve about 10% improvement of recall over the standard color histogram. A weighted sum of all histogram matching values is the criteria for recognition. In our approach, we propose to use hue histogram as the background modeling method for two reasons: speed and relative color invariance. This scheme employs a very fast on-line learning process, which is an advantage for this specific application since the area surrounding of the button may change from time to time. As has mentioned, hue is a good color invariant model. At the same time, this scheme is computation efficient, which allows us to achieve frame rate on an average PC.

1.2

Human Activity Recognition

Human motion analysis and recognition is receiving growing attention from researchers in computer vision. This concern is mostly prompted by a large range of promising applications, such as surveillance, human-computer interfaces, content-based image storage and retrieval and video conferencing, etc. Aggarwal and Cai[12] review the different approaches to resolve the human activity recognition. Template matching[13] method takes the approach to compare the features extracted from the given image sequence to the prestored patterns during the recognition process. This method is computation simple but is relatively sensitive to variance of the movement duration. Another more prevalent approach is state-space. One representative model is Hidden Markov Model(HMM), which is a probabilistic technique for the study of discrete time sequence. HMM is quite popular in speech recognition, however, only quite recently has it been adopted to recognize human activity sequence. Features to be recognized in HMM state may vary from points and lines to 2D blobs. Yamato,etc[14] are among the earliest to employ 2D blobs to carry out recognition. Mesh features of binary moving blobs are extracted as low-level feature for learning and recognition. Training the HMMs of each class to optimizing the generation of symbol patterns. And recognition is based on the output probability of generating the giving sequence by each class. Wilson and Bobick[15] present a parameterized HMM 4

(PHMM) by incorporating a global parameter into standard Hidden Markov Model output probabilities. A generalized expectation-maximization(GEM) algorithm, which relies on a numeric optimization during the maximization step, is employed both for training and recognition. To learn and recognize the button-pushing action, we propose a computation efficient and robust feature extraction scheme. This feature indicates the direction and distance of the finger from the center of the button. These features are used to train the HMM and recognition by maximizing the likelihood that the HMM of each class generates the feature sequence patterns.

1.3

Outline

Section 2 explains the foreground segmentation scheme based on color histogram. In section 2.1, the basic idea of histogram construction and histogram matching is formulated. Section 2.2 will describe our hue histogram approach to model the background. The detailed scheme to segment the foreground is explained in section 2.3. In section 3, we describe the button-pushing learning and recognition scheme. First, the basics of HMM is given in section 3.1. Section 3.2 will explain how to extract the necessary feature from segmented image. In 3.3 we define the structure of the HMM for each category of valid action. The method to perform the training and recognition is described in section 3.4. We present the experimental result and discussion in section 4. In section 4.1, we explain the experimental environment and show the result of the system. A discussion and evaluation of the system is given in section 4.2. Finally the conclusion is given in section 5.

2

Modeling and Segmentation Based on Hue Histogram

A robust segmentation method should be able to deal with imaging noise as well as changes in illumination. We propose to use color histogram as the background modeling since color histogram is a good appearance property that is relatively invariable to translation and rotation about the viewing axis, and change only slowly under change of angle of view, scale and occlusion. Furthermore, hue model has additional color invariant advantage to allow changes in illumination.

5

2.1

Histogram Construction and Matching

Given a discrete color space defined by one or more color axes, the color histogram is obtained by discretizing the image colors and calculating the number of times each discrete color appears in the image. We use the normalized histogram intersection given by Swain[3] to match the background image and given image. This matching method is robust to occlusion, variance of view angle and image resolution. The match value is defined as: H(I, M ) =

Pn

j=1

min(Ij , Mj ) j=1 Mj

Pn

The intersection ranges from 0 to 1. The higher the match value is, the more similar the two images look. Thus, this match value serves as a good criteria for background modeling and foreground segmentation.

2.2

Hue Histogram for Segmentation

Gevers and Smeulders[7] analyze the dichromatic reflectance under “white” reflection. They give the following formula to represent the measured color value: Cw = Cb + Cs = emb (n, s)kc + ems (n, s, v)cs f where kC = λ fC (λ)cb (λ)dλ and f = λ fR (λ)dλ = λ fG (λ)dλ = λ fB (λ)dλ. fR (λ), fG (λ) and fB (λ) represent the spectral sensitivities of the R,G, and B sensors correspondingly. And cb (λ) and cs (λ) are the albedo and Fresnel reflectance respectively. Geometric terms mb and ms denote the geometric dependencies on the body and surface reflection respectively. According to this reflectance model, even a uniformly colored surface may display wide disparity of RGB values due to changing circumstances induced by the sensing process. Fortunately, we can find alternative color model that is insensitive to surface orientation, illumination direction and intensity. Hue is such a model that is only dependent on the sensor and surface albedo. Thus we choose Hue as the model the construct the histogram. R

R

Hue = cos−1 q

2.3

R

R

R − 0.5G − 0.5B (R − G)2 + (R − B)(G − B)

Foreground Segmentation

In our experimental environment, we assume that the background is static. We take the background image once the camera has been working steadily. We then split the image into an array of equal-sized sub-images. The size of 6

Figure 1: An example of background image, unsegmented image and segmentation result. The leftmost image is the background. The middle image show the scenario that hand has entered the scene. The final segmentation result is shown in the right image.

the window can range from 4 × 4 to 10 × 10. For each sub-image, we built a hue histogram. The number of bins for the histogram is normally set to be between 8 and 32. For a given foreground image, we process it in a similar way as the background image. Afterwards, we perform pairwise histogram matching between background and foreground images. If the matching value is below the threshold τ1 , this part of the image is classified as foreground. The threshold values τ1 is determined empirically. In the case of our experimental context where window size is 4 × 4 and number of bins in hue histogram is set to 8, we choose τ1 = 0.005. However, due to image noise and sparse areas where the background has similar appearance as the foreground, the resulting binary image is noisy at some background region and sometimes disjointed within the actual foreground region. To reduce false positive foreground regions and to join falsely disconnected foreground region, we perform a median filter on the binary image. According to the resulting identity image, we perform the segmentation on the original image. All the pixels in a window whose identity has been determined to be foreground are considered to belong to new objects above the background image. Figure 1 shows the segmentation result as well as the background image and unsegmented image. To test this segmentation scheme, we capture pairs of images of the background and foreground. By comparing the segmentation result and the ground-truth classification image, which is generated by manually marking the foreground part of the scene, we are able to evaluate this algorithm. We capture 4 = 20 pairs of background/foreground images with 4 different background scenes and carry out the experiment on these images. As a result, the average correct ratio is 98.16%, with average false positive ratio of 1.55% and false negative ratio of 0.29%. 7

Table 1: Segmentation result with different combination of parameters

Size 3×3 4×4 6×6 8×8 10 × 10 12 × 12 14 × 14 16 × 16

Corr. 84.397 93.438 96.995 97.304 97.688 96.919 97.435 97.539

4 FP 15.345 6.242 2.349 1.759 0.828 0.043 0.836 0.233

FN 0.258 0.320 0.656 0.837 1.484 3.308 1.726 2.228

Number of Bins 8 Corr. FP FN 98.494 1.348 0.158 98.219 1.570 0.211 98.029 1.246 0.725 97.776 1.281 0.943 97.688 0.828 1.484 96.919 0.043 3.308 97.412 0.083 2.505 97.425 0.790 1.785

Corr. 98.494 98.219 98.029 97.776 97.688 96.919 97.412 97.425

16 FP 1.348 1.570 1.246 1.281 0.828 0.043 0.083 0.790

We also compare the segmentation result with different size of the subwindows and with different number of bins of the hue histogram. The following table shows the correct segmentation ratio, as well as false positive and false negative ratio, based on the background and foreground image displayed in figure 1.

3 3.1

HMM-based Activity Learning and Recognition Hidden Markov Model

Hidden Markov Model, also known as HMM, is very popular and successful in speech recognition. HMMs is capable of dealing with time-sequential signal and has the advantage of relative time-scale invariance in recognition. Furthermore, HMMs is favored by its learning ability that is accomplished by providing time-sequential data to a HMM and automatically optimizing the model with data. Typically, a HMM consists a set of states where each state has a certain probability of transition to another state. And each transition has a probability to produce certain symbols. Furthermore, states at any time depend only on the preceding states. These HMM states are not directly perceivable, 8

FN 0.158 0.211 0.725 0.943 1.484 3.308 2.505 1.785

and can be observed only through a sequence of generated symbols[16]. The following notations are used to define a discrete HMM. O = O1 , O2 , ..., OT : obserserved symbols sequence, where T is the length of the observation sequence, or the output sequence. Q = {q1 , q2 , ..., qN }: set of states in the HMM, where N is the number of states in the model. V = {v1 , v2 , ..., vM }: set of all possible observation symbols, where M is the number of output symbols. A = {aij |aij = P r(st+1 = qj |st = qi )} : state transit probability where aij is the probability of transiting from state qi to state qj . B = {bij (k)|bij (k) = P r(vk |st+1 = qj , st = qi )}: symbol output probability where bij (k) is the probability of output symbol vk when transiting from state qi to qj . During learning process, we present a set of observed and classified output sequences and carry out the Baum-Welch algorithm to optimize the transition and output probability matrix[16]. Actually Baum-Welch method is an Expectation- Maximization (EM) algorithm. In essence, the learning process is to maximize the likelihood that the HMMs generate all the given training sequences. To recognize a given output sequence, we will calculate the probability the HMM of each category generates the given series, i.e., P r(λi |Output). And we choose the class with the maximum probability as the identity of the sequence. c∗ = arg max(P r(λi |Output)) i

3.2

Feature Extraction for HMM

To make use of HMMs for training and recognition, a set of features must be defined as the output symbols to be produced by the models. In our project, we need to convert an image sequence into a series of symbols indicating the current state of the scene. Since we only care about the neighboring region of the button, we define a simple set of states representing the direction and distance of the finger from the button. That is, we split the contiguous region of the button into a 5 * 5 grid, with the button staying in the center. According to the segmentation result, we can claim whether each cell is foreground or background. The distance is defined as 0 if the button cell is touched. Otherwise the distance is 1 if some cell in the inner layer is detected as foreground. If only some cell in the outer layer is covered, the distance is defined as 2. Since we only consider four directions (i.e., up, down, left, right), by counting the number of cells touched by the hand in each direction, we can claim that the finger is 9

coming from the direction with most cells covered by hand. The combination of direction and distance thus generates a set of 3 ∗ 4 = 12 states. If there is no cell segmented as foreground, or more than half of the cells are covered, we claim that an illegal state has detected. This is to identify the start and termination of an action sequence.The following figure illustrate this scheme.

3.3

Structure of Our HMMs

The basic building unit of our HMMs is very similar to singleton acoustic model used in speech recognition[16]. For each of the 12 valid states, we define a basic HMM to represent it. And for each of the four categories of actions (i.e., pushing from up, down, left and right, respectively), we will try to find a representative sequence. Based on this standard sequence, we build the HMM for this category by concatenating with null transitions all the basic HMMs corresponding to each symbol in the sequence. The following figure illustrates the structure of basic HMMs and class HMMs. However, our experiment is different from normal recognition problem. We also need to tell whether a sequence is one of the four categories of actions, or it is not a valid button-trigger action at all. But for these invalid actions, it is impossible to find one or limited number of primary sequences to build the HMMs. Therefore, we use threshold to tell invalid actions from legal actions. That is, if maxi (P r(λi |Output)) < τ2 , then the observed output action is considered as illegal. This brings about another problem. In performing the triggering action, the duration of the action may vary significantly. However what we expect to learn is the pattern of this action. Given a fixed HMM, the probability of generating a longer sequence will be much smaller than that of producing a shorter one, even though they may conform to the same pattern. When the sequence is long enough, the probability may drop below the threshold used to discriminate valid and invalid actions. To solve this problem, we perform sequence aligning in training and recognition. That is, we choose a fixed length, for example, 20, to be the standard length. For any sequence longer than this, we re-sample the sequence to get a new sequence with standard length. If the length of the given sequence is lower than the fixed valued, we will discard it.

3.4

Training and Recognition

To train the system, we record sequences performed by different people. We build a training set that only consists of valid actions. Then we train the system using Baum-Welch algorithm and save the parameters. In order 10

Figure 2: Illustration of feature state definition.

11

Figure 3: Structure of basic HMMs and composite HMMs for each category. Here Oi (i=1...L) correspond to the symbols in the primary sequence of each class. L is the stanrdard length.

to find the most characteristic sequence through which the category HMM will be built, we also carry out experiments by choosing different series as the primary sequence. We select the sequence that maximize the overall system likelihood as the standard sequence for each category. After training is finished, we choose τ2 as the minimum of all the probabilities that each HMM generates the sequences of that particular class. At running time, the system will try to record a sequence the meet the length requirement. If the sequence is long enough, a recognition process is carried out. If current sequence is identified as valid sequence, the user will be notified about the success of the action. Otherwise, the system will continue to record more data into current sequence and perform recognition each time a new valid symbol is observed. This process stops when several consecutive illegal states are observed. Then current sequence is discarded and a new iteration begins.

4 4.1

Experimental Results and Discussion Experiment Configuration and Result

In our experiment, we use a fire-wire color camera as the imaging sensor. The main program runs on a PC with Pentium III 1G HZ processor. We build the whole system based on XVision 2 paradigm. The image size is 320 × 240. And the program can achieve a frame rate of over 20 fps. In case that we don’t include the display, the system can run at about 25 fps. 12

Table 2: Experiment results with different length of characteristic sequence L Average fps Accuracy of Training Set Accuracy of Test Set 10 25.4 100.0% 86.8% 20 25.3 100.0% 94.2% 30 25.1 100.0% 96.8%

For training and testing, we record over 300 actions by 6 different persons. The standard length of the primary sequences is set to 20. We put 76 sequences of valid button-pushing action as the training set and train the system. After training, the system can achieve a correct ratio of 100% on the training set. The threshold τ2 , which is the smallest log2-likelihood that the HMM of each category generates the sequence of this particular class, is selected to be −12.0. We test the system on a test set of 277 well-segmented sequences, including both valid and invalid button-triggering actions. The length of these sequences varies significantly, ranging from 30 to over 220. And the test ste also includes some sequences under changing illumination. The overall correct ratio on this test set is 94.22%. The false positive ratio is 2.25% and false negative ratio is 2.53%. The result demonstrates the robustness and correctness of our system.

4.2

Discussion

The standard length of the category characteristic sequence will influence the system performance and speed. Along with the increase of the size of primary sequence, the time needed to carry out the recognition will also grow linearly. However, since longer sequence, thus larger HMM, contains more information, the total system performance will become better. The following table shows the experimental results with different size of the category primary sequence.

Although hue is relatively insensitive to illumination, it works better for dull, matte surface than highly shiny object. Furthermore, if the background contains some highly shiny surfaces around the button area, the shadow of hand and fingers will change the appearance of such parts of the background considerably, thus these areas will be falsely segmented as the foreground. Our experiment shows that for highly shiny surfaces, hue alone is not good 13

enough to model it.

5

Conclusion

In this paper, we present the design and implementation of a vision-based HCI system that can successfully recognize a button triggering action at frame rate. We make use of hue-histogram to model the background image. Pairwise histogram intersection values between sub-images of background and foreground are calculated as the criteria to segment the hand. We define a symbols space in which each point indicates the distance and direction of the finger with respect to the region of interest. A HMM model is built for each of four categories of button trigger actions. Using recorded activity sequences, we train the models by maximizing the likelihood that the system generates all these sequences. The probability that each model produces the given state sequence is used as the criteria of the identity of the action. Based on XVision2 development toolkit, we implement the system that runs at frame rate. The system achieves satisfactory correct ratio of over 96%. Our experiment indicates that color can be used as a reliable cue for segmentation and recognition. However, better color models are necessary for dealing with highly shiny object, significant changes in illumination and heavy shading. Although HMM has been adopted into vision to attack many learning and recognition problems, such as gesture recognition and simple activity learning, its success in vision is far from comparable with respect to its counterpart in language processing. One reason is the complexity of vision signal, which is multi-dimensional while speech signal is only one-dimensional. To solve vision problems dealing with more complex situation, such as consecutive sign language understanding, activity involving multiple people, etc, the feature space and the structure of the HMMs will inevitably become more intricate and deserve more research effort.

Acknowledgement This project is supported by the National Science Foundation under Grant No. 0112882.

References [1] Thanarat Horprasert, David Harwood, Larry S. Davis, A Robust Background Substraction and Shadow Detection, Proc. ACCV’2000, Taipie,

14

Taiwan, January 2000. [2] A. C. Hurlbert. The Computation of Color. PhD thesis, Massachusetts Institute of Technology, Sept. 1989. Also MIT Artificial Intelligence Lab Technical Report, No. 1154, avail. as Q335.M41.A79 no. 1154 in Barker Library. [3] Swain, M.J. and Ballard, D.H., Color Indexing. International Journal of Computer Vision, 7(1): 11-32. [4] Efstathios Hadjidemetriou, Michael D. Grossberg and Shree K. Nayar, Histogram Preserving Image Transformations, International Journal of Computer Vision, 45(1), 5-23, 2001 [5] Daniel C. Alexander, Bernard F. Buxton, Statistical Modeling of Color Data, International Journal of Computer Vision, 44(2), 87-109, 2001 [6] Michael J. Jones, James M. Rehg, Statistical Color Models with Applications to Skin Detection, International Journal of Computer Vision, 46(1), 81-96, 2002 [7] Theo Gevers, A. W. M. Smeulders, Color Based Object Recognition, Pattern Recognition, 32(3):453–464, 1999. [8] Theo Gevers, Robust Histogram Construction from Color Invariants,Proc. International Conference on Computer Vision, 2001 [9] Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell, Alex Paul Pentland, Pfinder: Real-time tracking of the Human Body, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No.7, July 1997, pp. 780-784. [10] George Gagaudakis, Paul L. Rosin, Incorporating shape into histogram for CBIR,Pattern Recognition, 35(2002), 81-91 [11] Song Chun Zhu, Alan Yuille, Region Competition: Unifying Snakes, Region Growing, and Bayes/MDL for Multi-band Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 9, 1996 [12] J. K. Aggarwal and Q. Cai. Human Motion Analysis: A Review. Journal of Computer Vision and Image Understanding, Vol. 73, No. 3, March, pp. 428-440, 1999.

15

[13] R. Polana, R. Nelson, Low level recognition of human motion. Proc. of IEEE Computer Society Workshop on Motion of Non-Rigid and Articulated Objects, pp. 77-82, Austin, TX 1994. [14] J. Yamota, J.Ohya, K. Ishii, Recognizing Human action in timesequential images using Hidden Markov Model. In Proc. IEEE Conf. CVPR,1992, pp. 379-385, Champaign, IL. [15] Andrew D. Wilson, Aaron F. Bobick, Parametric Hidden Markov Models for Gesture Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 21, No.9, Sept. 1999. [16] Frederick Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1999

16