A decision-theoretic video conference system based on ... - IEEE Xplore

A decision-theoretic video conference system based on gesture recognition Jose Antonio Montero Instituto Tecnologico de Acapulco Computing Laboratory Av. Instituto Tecnologico S/N Acapulco, Mexico

Abstract This paper presents a new approach that combines computer vision and decision theory for an automatic video conference system. The setting is a video conference room in which a speaker interacts with surrounding objects, such as a computer, notes and books. Among a set of cameras, the system selects the most appropriate to show to the audience, according to the speaker activity. We assume that the activity of the speaker can be recognized based on hand gestures, and their interaction with the objects in the environment. The proposed approach combines context-based gesture recognition with a decision theoretic model to select the best view. Gesture recognition is based on hidden Markov models, combining motion and contextual information, where the context refers to the relation of the position of the hand with other objects. The posterior probability of each gesture is used in a partially observable Markov decision process (POMDP), to select the best view according to a utility function. The POMDP is implemented as a dynamic Bayesian network with certain lookahead. Preliminary experiments show good results in both, gesture recognition and view selection. We also present the effect of different lookahead periods in the performance of the system.

1. Introduction Automatic vision-based systems for monitoring the behavior of people have been the focus of recent research. Most of these systems perform recognition using motion information or simple activity detection [14, 12, 16]. However, the visual recognition of human activities has as it final goal a decision, either by the computer system or by a human aided by the system. For instance, in video surveillance, the system must decide to send an alarm if an intruder is detected. This decision is not only based on the outcome of the vision system (i.e. the probability of intruder detection), but also on the expected cost or utility of the alter-

Luis Enrique Sucar INAOE Department of Computer Science Luis Enrique Erro #1 Tonantzintla, Puebla, Mexico

native decisions (i.e cost of a false alarm vs. cost of not detecting an intruder). Thus, the final evaluation of a vision system is how well it helps to take the final decisions. An application in which gesture recognition and decisions are closely related is video conference. In this case, a speaker is talking to a remote audience, using several resources, such as a computer presentation, a notepad, books, a white board, etc. So the system must decide, based on the speaker activity, which is the best view to show to the audience. We assume that the facility has a set of cameras, so the main problem is to select the most appropriate view to show to the audience, according to the speaker activity. We consider that the activity of the speaker can be recognized based on hand gestures, and their interaction with the objects in the environment. This paper presents a novel approach that combines computer vision and decision theory for an automatic video conference system. For this we make two main contributions: 1. A gesture recognition system based on hidden Markov models that combines motion and contextual information. 2. A decision theoretic controller for selecting the best view. The posterior probability of each gesture is used in a partially observable Markov decision process (POMDP), to select the best view according to a utility function. The POMDP is implemented as a dynamic Bayesian network with certain lookahead. We have tested our approach in a video conference setting, as shown in figure 1, with 4 types of gestures and 4 possible views (actions). We present the results in terms of: (i) gesture recognition, (ii) evaluation of the system by the audience of a simulated video conference, (iii) the effects of different stages of lookahead in the utility of the decisions.

2. Related work There is few previous work that integrates vision and decision theory, and none has been developed for a video con-

Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR’06) 0-7695-2503-2/06 $20.00 © 2006

IEEE

best view (action) based gesture information, and (iii) it is applied to a video conference environment.

3. Gesture recognition The visual system consists of 3 main parts: (a) person detection and hand tracking, (b) recognition of relevant objects, and (c) gesture recognition with context. The vision system is based on previous work realized by the authors [9], to which we have added the identification of objects in the environment and the use of context for recognition.

3.1 Person detection and tracking Figure 1. Video conference setting. Top: speaker. Bottom: some interacting objects.

ference environment. A vision-based, adaptive and decision theoretic model of human facial displays in interaction is proposed by Hoey et al [6]. They integrate vision and decision theory, using POMDPs as an stochastic method to make decisions. A related approach is presented by Darrell et al [5]. They proposed a POMDP based model applied to active gesture recognition, in which the goal is to model unobservable and non-foveated regions. This work models some of the basic mechanics underlying dialog, such as turn taking, channel control and signal detection. Reinforcement learning is applied to improve the visual attention mechanism in the work realized by Bandera et al. [2]. They simulate a fovea vision-based system, learning strategies to obtain visual information relevant to the task. Their goal is learn and generalize strategies in undefined settings. Levner et al. [7] use Markov decision processes for the recognition of buildings in aerial images. The goal is to learn a control policy to choose the next action (image processing operator) in each step, so the image quality is optimum for interpretation. Regarding the use of context for human activity recognition, Ayers et al. [1] use context information to recognize human activities realized in a room. The recognition process is based on a priori information, and modeled using a deterministic automaton. Another approach to action recognition in an office environment [10], combines the analysis of sensory data and symbolic context information provided by a scene model. In this work, the goal is the recognition of unknown objects interacting with the hand. Our approach differs in 3 main aspects from previous work: (i) it integrates context information and motion in hidden Markov models for gesture recognition, (ii) it approximates a POMDP using a finite horizon to decide the

The recognition of a person and some body parts (face, hands) is performed using a color-based approach. The human skin color is usually more distinctive and less sensitive to illumination changes if we use the rgy normalized color space proposed by Martinez and Sucar [8]. Color histograms is the technique used to model the skin color space. To determine if a blob in the image contains skin pixels we apply the technique proposed by Ballard and Swain known as histogram intersection [15]. However, to improve the constraint of a fixed threshold value, we are using Otsu algorithm [11]. Based on Otsu’s algorithm, our method incorporates adaptive thresholding, so it is able to tolerate changes in lighting conditions. Initial detection of the hand combines the color-based approach with motion information, to make it more robust with respect to occlusions and illumination changes in a video conference environment. Once we have detected skin regions in an image sequence, the next step consists on tracking the right hand using only color information. (Currently we assume that the speaker performs all the gestures with the right hand and the left hand is basically static). For hand tracking, we first have to decide if the skin regions in the image are the

Figure 2. Face and hand detection. Left: original image that shows face and head regions (rectangles). Right: skin pixels detected (in white).


IEEE

Figure 4. Detection of the relevant objects in a video conference scene.

Figure 3. Trajectory described by the hand centroid.

face or a hand of a person. Hand/face detection is based on two rules. The first rule considers that only the hands and face of the person cause a significant movement in the images sequence. The second rule establishes a minimum threshold (number of skin labeled pixels) than a region must have for to be considered a hand or face of a person. Experimentally we found that the region with the biggest skin area corresponds to the face of the person. To decide which skin region is the right hand, we initially considered that the person is sitting in front of a desk with the hands over it. Then the right hand causes a significant motion (when manipulating an object for the first time), so the system starts tracking it. During tracking, we obtain the center points of the right hand region. A center point of an object may be defined using the centroid (Xc , Yc ): Xc =

x

y

B(x, y)x A

, Yc =

x

y

B(x, y)y

A

(1)

where A is the number of pixels in the object and B is the binarized input object which takes two values, 1 for the right hand and 0 for the background. Then we adjust a search window over the region defined by Xc , Yc . Hand tracking is realized by applying the hand detection process over the search window based on motion heuristics (maximum motion between frames), in the images sequence. The sequences of centroid points are detected by the hand localization algorithm, and thus, the gesture trajectory, G, is produced by connecting centroid points. G = (x1 , y1 ), ..., (xn , yn ) Figure 2 shows an example of hand and face detection; and figure 3 shows the tracking process. Our system detects the gestures of a person when his right hand interacts with relevant objects.

3.2. Object detection and localization In a video conference environment, the speaker is interacting with surrounding objects, such as computer, a notepad, and a book. These objects represent contextual information, because the type of activity of the speaker is related to the objects with which she interacts. The detection and localization of relevant objects is done using an adaptation of the work of Swain [15] and Bradsky [3]. Objects are modeled using color histograms for hue-saturation in the hsv color space. Training images are used to generate color histograms for each object using 30x32 bins. Initially each object is searched over the full image, and then only in a search window. Objects are detected using histogram intersection [15], so we obtain an image in gray scale where pixels close to 255 are from the object detected. Once an object is detected, we use a tracking algorithm proposed by Bradsky [3] over an appropriate search window for each object. The system maintains the position of each object in the scene. Figure 4 illustrates object detection: (1) notepad (blue rectangle), (2) book (red rectangle), (3) mouse (green rectangle), (4) screen (aquamarine rectangle). Objects interacting with the right hand of a person represent contextual information. The gesture recognition system integrates this information with the motion attributes from the hand trajectory.

3.3 Context based gesture recognition Gesture recognition is based on hidden Markov models (HMMs), integrating motion and contextual features. HMMs have the ability to accurately characterize data exhibiting sequential structure in the presence of noise, such as human gestures, finding the most likely sequence of states that may have produced a given sequence of observations. The HMM topology used in this paper is the classical left–right structure, which is typical for motion ordered


IEEE

Figure 5. Gesture recognition rates with and without context information vs. number of hidden states.

paths, such as gestures. As usual, one model was trained for each gesture class, and for recognition we selected the model with the highest probability. Motion features are obtained from the trajectory described by the right hand centroid when interacting with surrounding objects. The features used in trajectory analysis are magnitude and orientation in polar coordinates. In a previous analysis, we found that these are the best motion features for describing this type of gestures [9]. Context features include the distance from the hand centroid to each of the relevant objects that are detected in the scene: notepad, book, mouse and screen. These features together with the motion attributes are codified in 64 discrete observation symbols. The number of states was set to 8 experimentally. We tested the gesture recognition system for 4 different types of gestures, related to the manipulation of each object in the scene: writing, turning the leaves of a book, using the computer, and speaking1. We compared the recognition rate without and with context information vs. the number of hidden states. The results are summarized in figure 5.

4. Decision-theoretic controller Based on the gesture recognition results, the system selects the best view to show to the audience using a decision theoretic approach. Given that the state of the video conference is partially observable, this corresponds to a partially observable Markov decision process (POMDP). A POMDP is a probabilistic temporal model of an agent interacting with the environment [4]. A POMDP is a tuple S, A, T, R, O, B, where S is a finite set of states, A is a finite set of actions, T : S × A → S is a transition function 1 Our gesture recognition system is context-based, this means that it considers the trajectory described by the hand when it is interacting with an object. Thus, for this application, we considered the speaking gesture as the default value (it applies when no other gesture is recognized).

which describes the effects of agent actions upon the world states. R : S × A → is a reward function which gives the expected reward for taking action A in state S. O is a set of observations, and B : S × A → O is an observation function which gives the probability of observations in each state-action pair. Given a POMDP, our goal is to find a policy that maximizes the expected discounted sum of rewards. Since the system state is not known with certainty, a policy maps either belief states (i.e., distributions over S) or action-observation histories into choices of actions. Next we describe each of the components of the POMDP for the video conference environment. States. The state space is characterized by one variable that captures the state of the environment, (activity of the speaker), with 4 possible values: s1 =Writing, s2 = Turning the leaves of a book, s3 = Speaking, and s4 = Using the computer. Observations. These correspond to the information obtained by the visual gesture recognition system. The state of the system is estimated as the probability of each gesture obtained from the HMMs. Actions. The system has 4 actions that correspond to the 4 possible views that can be shown to the audience: a1 = Show-Face, a2 = Show-piece of paper, a3 = Show-Book, and a4 = Show-screen Rewards. Rewards are associated with the value of the different views based on the activity of the speaker. The value depends on two factors: (1) the gesture performed by the speaker, and (2) the previous views. The system must make a balance between a fast response to changes in the activity of the speaker, and at the same time avoid too many changes that may disturb the audience. That is, responsiveness vs. stability. Given that the present value (current frame) depends basically on the actions in a short time interval, we can solve the POMDP by approximating it with a dynamic decision network (DDN) with a finite number of stages or lookahead, as shown in figure 6. In this model, the state variables (S) represent the activity of the speaker at each time step, which are not directly observable. The observation nodes (O) represent the information obtained from the vision system, that is the HMMs used for gesture recognition. The action nodes (A) correspond to the different views that can be selected by the controller at each time. Finally, the rewards (r) represent the immediate reward that depends on the current state and action. To solve the POMDP consists on finding the optimum action at each time period, that is the one which maximizes the sum of the future rewards. In the DDN representation of a POMDP, every chance node Si is associated with a set of conditional probabilities Pi = P (St |St−1 , At−1 ) . With the reward node r, is associated a set of utilities ui = u(St−1 , At−1 ), specifying for each action–state, a number expressing the desir-


IEEE

Figure 6. A N stage dynamic decision network that approximates the POMDP.

ability of this combination to the decision maker. A DDN uniquely represents a decision problem. A solution to the problem is a decision or, in case of multiple decision nodes, a sequence of decisions that maximizes the desirability of consequences. To compute a solution for each sequence of actions (policy), the utilities of outcomes are weighted with the probabilities that these outcomes will occur, the expected utility of action sequences ai is thus computed from:

u ˆ(ai ) =

u(πi (r))P r(πi (r)|ai )

(2)

i

where πi (r) is a combination of values for the parents of the reward node r and u(πi (r)) is its utility; P r(πi (r)|a) is the probability of πi (r) given that the decisions ai are taken. In this work, the preferred sequence of actions is a sequence with the maximum expected utility. We use a clustering technique proposed by Schachter [13] to obtain this policy. To build the DDN model, we estimated subjectively the transition probabilities for each state (activity) given the previous state and the action. The reward function was also set in a subjective way, so that the expected view is shown according to the activity (state). The observation function is given by the HMMs. In this framework, we approximate an optimal solution for a POMDP by solving the DDN for N stages. We tried different number of time steps (lookahead) and obtained the maximum expected utility for the present state. In figure 7 we show the maximum expected utility (left axis) vs. the number of stages (or decision nodes); and the time required to solve the DDN (right axis). Although the maximum utility is obtained with a lookahead of 8, this is too slow for real-time operation. A good compromise is a lookahead of 5, which we use in the experiments.

Figure 7. Expected utility (left) and solution time in seconds (right) for different number of stages (decisions) in the DDN.

5. Experimental results We have tested our approach in a video conference setting, in which a speaker uses different resources to make a presentation to an audience. Figure 8 shows different stages during a video conference. The top right corner in each image depicts what the system shows to the audience. To evaluate the system we make a user study in which a person gave a presentation using the automatic conference system to an audience. We prepare some questions related to the utility and efficiency of the system. The test was realized connecting 2 computers in two adjacent rooms. In one a person make an exposition for 10 minutes interacting with didactic resources. In the other, a group of 7 students was watching and listening to the exposition. We evaluated 3 aspects: (a) relevance, the screen must show something related to the activity of the speaker, (b) stability, the scene must stay fixed during an adequate time for the observer, (c) time lag, the system must respond in a short time to changes in the speaker activity. With the evaluation of these aspects, we have an idea of the utility of the proposed system. Table 1 shows the results obtained, the scale was 0–10, higher is better. In general, the results are satisfactory, although there is still room for improvement, in particular in relevance. We plan to perform additional tests in the future in a more realistic scenario.

6. Conclusions and future work. We have presented a novel approach that combines computer vision and decision theory for an automatic video conference system, with the following contributions: (i) a


IEEE

Figure 8. Sequence of images showing different aspects of a video conference. Left: the speaker is talking. Center: the speaker is making a presentation. Right: the speaker is showing a book.

gesture recognition system based on hidden Markov models that combines motion and contextual information, (ii) a decision theoretic controller for selecting the best view according to the recognized gestures, implemented as a dynamic Bayesian network. We have tested our approach in a video conference setting with good results. As future work, we will learn the model parameters from data and explore alternative inference techniques for DDNs. We also plan to extend this framework to other environments. Table 1. Results of the evaluation of the system by a group of students. Students

[5]

[6]

[7]

[8]

[9]

characteristic observed

1

2

3

4

5

6

7

Av.

Relevance

7

8

7

7

7

8

7

7.39

Stability

8

9

7

7

8

8

8

7.85

Time lag

8

9

8

7

8

9

7

8

[10]

[11]

[12]

References [1] D. Ayers and M. Shah. Monitoring human behavior in an office environment. PAMI, 7:780–794, July 1997. [2] C. Bandera, F. J. Vico, J. M. Bravo, and M. E. Harmon. Residual q-learning applied to visual attention. Proceedings of the Thirteenth International Conference on Machine Learning, 1:3–6, 1996. [3] G.R. Bradski. Computer vision face tracking as a component of a perceptual user interface. Workshop on Applications of Computer Vision, 1:214–219, 1998. [4] A. R. Cassandra, L. P. Kaelbling, and L. M. Littman. Acting optimally in partially observable stochastic domains. In

[13]

[14]

[15] [16]

Proceedings of the Twelfth National Conference on Artificial Intelligence, 2, 1994. T. Darrell and A. Pentland. Active gesture recognition using learned visual atention. Advances in Neural Information Processing Systems (NIPS), 1, 1996. J. Hoey. Decision theoretic learning of human facial displays and gestures. Ph.D Thesis, University of British Columbia, 2004. I. Levner, V. Bulitko, L. Li, G. Lee, and R. Greiner. Automated feature extraction for object recognition. University of Alberta, 2003. M. Mart´ınez and L. E. Sucar. Learning and optimal naive bayes classifier. 36vo Congreso de Investigacion y Desarrollo del Tecnológico de Monterrey, 2006. J. A. Montero and L. E. Sucar. Feature selection for visual gesture recognition using hidden markov models. Fifth Mexican International Conference on Computer Science, 1:196– 203, 2004. Darnell J. Moore, Irfan A. Essa, and Monson H. Hayes. Exploiting human actions and object context for recognition tasks. Proc. IEEE of the 7th International Conference on Computer Vision, 1, Sep 2000. N. Otsu. A threshold selection method from gray-evel histograms. IEEE, Trans. Sys, Man, and Cybernetics, 1:62–66, Jan 1979. H. Ren and G. Xu. Human action recognition in smart classroom. Proc. of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, 5:54–60, 2002. R. Shatcher and M. Peot. Decision making using probabilistic inference methods. In Proc. of 8th Conference on Uncertainty in Artificial Intelligence, pages 276–283, 1992. A. Shawn and J. R. Cooperstock. Presenter tracking in a classroom environment. AAAJ Symposium on Intelligent Environment, 1:145–148, 1999. M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, pages 11–32, 1991. M. Zobi, F. Wallhoff, and G. Rigoll. Action recognition in meeting scenarios using global motion features. ICASSP Proceeding, 1:115–119, 2003.


IEEE