This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
1
Detecting Human Behavior Models From Multimodal Observation in a Smart Home Oliver Brdiczka, Matthieu Langet, Jérôme Maisonnasse, and James L. Crowley
Abstract—This paper addresses learning and recognition of human behavior models from multimodal observation in a smart home environment. The proposed approach is part of a framework for acquiring a high-level contextual model for human behavior in an augmented environment. A 3-D video tracking system creates and tracks entities (persons) in the scene. Further, a speech activity detector analyzes audio streams coming from head set microphones and determines for each entity, whether the entity speaks or not. An ambient sound detector detects noises in the environment. An individual role detector derives basic activity like “walking” or “interacting with table” from the extracted entity properties of the 3-D tracker. From the derived multimodal observations, different situations like “aperitif” or “presentation” are learned and detected using statistical models (HMMs). The objective of the proposed general framework is two-fold: the automatic offline analysis of human behavior recordings and the online detection of learned human behavior models. To evaluate the proposed approach, several multimodal recordings showing different situations have been conducted. The obtained results, in particular for offline analysis, are very good, showing that multimodality as well as multiperson observation generation are beneficial for situation recognition. Note to Practitioners—This paper was motivated by the problem of automatically recognizing human behavior and interactions in a smart home environment. The smart home environment is equipped with cameras and microphones that permit the observation of human activity in the scene. The objective is first to visualize the perceived human activities (e.g., for videoconferencing or surveillance of elderly people), and then to provide appropriate services based on these activities. We adopt a layered approach for human activity recognition in the environment. The layered framework is motivated by the human perception of human behavior in the scene (white box). The system first recognizes basic activities of individuals, called roles, like “interacting with table” or “walking.” Then, based on the recognized individual roles, group situations like “aperitif,” “presentation,” or “siesta" are recognized. In this paper, we describe an implementation that is based on a 3-D video tracking system, as well as speech activity detection using head set microphones. We evaluated the system for offline (a posteriori) situation classification and online (in scenario) Manuscript received June 14, 2007; revised June 16, 2008. This paper was recommended by Associate Editor P. Remagnino and Editor M. Wang upon evaluation of the reviewers’ comments. This work was supported in part by the France Télécom R&D Project HARP and the European Commission Project CHIL under Grant IST-506909. This paper has supplementary downloadable material available at http://ieeexplore.ieee.org, provided by the authors. This includes one mpeg2 movie file which shows the visualization of online human behavior recognition in the scene via a web interface. This material is 40.5 MB in size. O. Brdiczka is with the Computer Science Laboratory, Palo Alto Research Center, Palo Alto, CA 94304-1314 USA (e-mail:
[email protected]). M. Langet, J. Maisonnasse, and J. L. Crowley are with Project PRIMA, Laboratoire LIG, INRIA Rhône-Alpes, France (e-mail:
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASE.2008.2004965
situation recognition. A prototype system has been realized and installed at France Télécom R&D, visualizing current human behavior in the smart home to a distant user using a web interface. An open issue is still the detection of group dynamics and group formation, which is necessary for group situation recognition in (informal) real settings. Index Terms—Context-awareness, human-centered computing, individual role detection, multimodal human behavior modeling and detection, situation modeling.
I. INTRODUCTION
P
ERVASIVE and ubiquitous computing [33] integrates computation into everyday environments. The technological progress of the last decade has enabled computerized spaces equipped with multiple sensor arrays, like microphones or cameras, and multiple human–computer interaction devices. An early example is the KidsRoom [4], a perceptually based, interactive playspace for children developed at MIT. Smart home environments [10] and even complete apartments equipped with multiple sensors [13] have been realized. The major goal of these augmented or “smart” environments is to enable devices to sense changes in the environment and to automatically adapt and act based on these changes. A main focus is laid on sensing and responding to human activity. Human actors need to be identified and their current activity needs to be recognized. Addressing the right user at the correct moment, while perceiving his correct activity, is essential for correct human–computer interaction in augmented environments. Smart environments have enabled the computer observation of human (inter)action within the environment. The analysis of (inter)actions of two and more individuals is here of particular interest as it provides information about social context and relations and it further enables computer systems to follow and anticipate human (inter)action. The latter is a difficult task given the fact that human activity is situation dependent [31] and does not necessarily follow plans. Computerized spaces and their devices need hence to use this situational information, i.e., context [17], to respond correctly to human activity. Context is the key for interaction without distraction [15]. In order to become context-aware, computer systems must thus construct and maintain a model describing the environment, its occupants and their activities. The notion of context is not new and has been explored in different areas like linguistics, natural language processing and knowledge representation. Dey defines context as “any information that can be used to characterize the situation of an entity” [17]. An entity can be a person, place or object considered relevant to user and application. The structure and representation
1545-5955/$25.00 © 2008 IEEE Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
Fig. 1. Context of a “presentation with questions.” The concepts (roles and/or relations) characterizing the situations are not detailed.
of this information must be determined before being exploited by a specific application. Context and activity are separable. Context describes features of the environment within which the activity takes place. Dourish claims further that context defines a relational property between objects and activities and is dynamically redefined as unexpected events occur or when activity evolves [18]. Loke states that situation and activity are not interchangeable, and activity can be considered as a type of contextual information which can be used to characterize a situation [21]. Dey defines situation as “description of the states of relevant entities” [17]. Situation is thus a temporal state within context, defined by specific relations and states of entities. Allen’s temporal operators [1] can be used to describe relationships between situations. Crowley et al. introduce then the concepts of role and relation in order to characterize a situation [16]. Roles involve only one entity, describing its activity. An entity is observed to “play” a role. Relations are defined as predicate functions on several entities, describing the relationship or interaction between entities playing roles. Context is finally represented by a network of situations [16]. These situation networks have been used to implement context for different applications [6] (see Fig. 1 for a simple example). These situation models have so far been handcrafted by experts. However, little work has been done on the automatic acquisition, i.e., learning, of these high-level models from data. Many approaches for learning and recognizing human (inter)actions and behavior models from sensor data have been proposed in recent years, with particular attention to applications in video surveillance [26], [28], [34], workplace tools [24], [30], and group entertainment [4]. Some projects focus on supplying appropriate system services to the users [4], [10], [13], [30], while others focus on the correct classification of activities [7], [24], [25], [28]. Most of the previous work is based on video [26], [28], [34], audio [7], or multimodal information [24] using statistical models for learning and recognition (in particular hidden Markov models). Most of the reported work has been concerned with the recognition of the activities of individuals who have been identified a priori. To our knowledge, little work has been done on real-time multimodal activity recognition [24]. However, most work does not attempt to acquire a high-level contextual model of human behavior. The main focus is laid on the classification of basic human activities or scenarios without considering a richer contextual description or model.
Fig. 2. Overview of the different parts and methods of our approach.
This paper investigates learning and recognition of human behavior models from multimodal observation in a smart home environment. The proposed approach is based on audio and video information. A 3-D video tracking system creates and tracks entities (persons) in the scene. A speech activity detector analyzes further audio streams coming from head set microphones and determines for each entity, whether the entity speaks or not. An ambient sound detector detects noises in the environment. An individual role detector derives basic activity like “walking” or “interacting with table” from the extracted entity properties of the 3-D tracker. Roles here can be interpreted as referring to basic activity of individuals in the scene. This activity is detected framewise. From the derived role and sound detections, different situations like “aperitif” or “presentation” are learned and detected using statistical models (HMMs). This paper proposes a general framework providing methods for learning and recognizing the different parts of a human behavior model. The learning and detection of different situations is here of particular interest because situations are temporal states involving audio and video detections of several individuals. The objective is twofold. First, we want to enable the automatic offline analysis of human behavior recordings in a smart home environment. The aim is to identify and to recognize the situations, as well as the roles played in multimodal recordings. Second, the online detection of learned human behavior models is to be enabled. The aim is to visualize human behavior (e.g., for videoconferencing) and to be capable of reacting correctly to human activity in the scene. II. APPROACH In the following, we present our approach for learning and recognizing a human behavior model in a smart home environment (Fig. 2). First, our smart home environment, the 3-D video tracking system, as well as noise and speech detectors are briefly described. Then, our method for role recognition is presented. This method takes the entity properties provided by the 3-D tracking system as input and generates, based on posture, speed and interaction distance, a role value for each entity. Based on role and sound detections, several situations are learned and detected using hidden Markov models.
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME
3
camera positions (after calibration). Further, tracking is robust against occlusions and against split and merge of targets. C. Ambient Sound and Speech Detection
Fig. 3. Map of our smart room (top left), map with wide-angle camera view (top right), wide-angle camera image of the smart room (bottom).
A. Smart Home Environment In this paper, experiments take place in a smart home environment which is a room equipped like a living room. The set of furniture is composed by small table placed in center of three armchairs and one couch [Fig. 3 (top left)]. Microphone arrays and video cameras are mounted against walls in the environment. The set of sensors used for our approach comprises three cameras in different corners of the environment [one camera is wide-angle, Fig. 3 (top right)], a microphone array for noise detection, as well as head set microphones for individual speech detection. The cameras record images of the environment (see Fig. 3 (bottom) for an example image) with a frame rate of approximately 25 images/s. A 3-D real-time robust tracking system detects and tracks targets in these video images. B. 3-D Video Tracking System A 3-D video tracking system [3] detects and tracks entities (people) in the scene in real-time using multiple cameras (Fig. 4). The tracker itself is an instance of basic Bayesian reasoning [2]. The 3-D position of each target is calculated by combining tracking results from several 2-D trackers [11] running on the video images of each camera. Each couple camera-detector is running on a dedicated processor. All interprocess communication is managed with an object oriented middleware for service connection [19]. of each The output of the 3-D tracker are the position detected target, as well as the corresponding covariance matrix (3 3 matrix describing the form of the bounding ellipsoid of the target). Additionally, a velocity vector can be calculated for each target. The 3-D video tracking system provides high tracking stability. The generated 3-D target positions correspond to real positions in the environment that can be compared to the position of objects. The extracted target properties (covariance matrix, velocity) provided by the 3-D tracker are independent of the
A microphone array mounted against the wall of the smart environment is used for noise detection. Based on the energy of the audio streams, we determine whether there is noise in the environment or not (e.g., movement of objects on the table). The people taking part in our experiments wear head set microphones. A real-time speech activity detector [9], [32] analyzes the audio stream of each head set microphone and determines whether the corresponding person speaks or not. The speech activity detector is composed of several subsystems: an energy detector, a basic classifier and a neural net trained to recognize voiced segments like vowels for example. At each time, i.e., for each frame, each subsystem gives an answer indicating whether the input signal is speech or not. A handcrafted rule based automaton then determines the final result: speech activity or not. The complete system is shown in Fig. 5. The association of the audio streams (microphone number) to the corresponding entity (target) generated by the 3-D tracker is done at the beginning of each recording by a supervisor. Ambient sound, speech detection and 3-D tracking are synchronized. As the audio events have a much higher frame rate (62.5 Hz) than video (up to 25 Hz), we add sound events (no sound, speech, noise) to each video frame (of each entity). D. Individual Role Detection Individual roles refer to a combination of posture, movement and interaction with objects of an individual person in the environment. The detection is conducted for each observation frame. The input are the extracted properties of each target (position, covariance matrix, speed) provided by the 3-D tracking system. The output is one of the individual role labels (Fig. 8). The detection process consists of three parts (Fig. 6). The first part detects the posture of the target using support vector machines (SVMs). The second part determines a movement speed value for the target and the third part determines an interaction value with objects in the environment (normally a table). A pure SVM approach has already been successfully used to determine posture and activity values from target properties of a 2-D video tracking system [8]. This first approach used SVMs as a black box learning method, without considering specific target properties. From the obtained results, we concluded that, in order to optimize role recognition, we need to reduce the number of classes, as well as the target properties used for classification. Additional classes are determined by using specific target properties (speed, interaction distance) and expert knowledge (parts 2 and 3 of our approach). The first part of the process [Fig. 6 (left)] takes the covariance matrix values of each target as input. Support vector machines (SVMs) [5], [14] are used to detect, based on these covariance values, the basic individual postures “standing,” “lying down,” and “sitting” (Fig. 5). SVMs classify data through determination of a set of support vectors, through minimization of the average error. The support vectors are members of the set of training inputs that outline a hyperplane in feature space. This -dimensional hyperplane, where is the number of features of the input vectors, defines
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
Fig. 4. Three-dimensional video tracking system fusing information of three 2-D trackers to a 3-D representation.
Fig. 6. Individual role detection process: SVMs (left), target speed (middle), and distance to interaction object (right). Fig. 5. Diagram of the speech activity detection system (picture from [9]).
the boundary between the different classes. The classification task is simply to determine on which side of the hyperplane the testing vectors reside. The training vectors can be mapped into a higher (maybe infinite) dimensional space by the function . The SVM finds a separating hyperplane with the maximal margin in this higher dimensional space. is used as a kernel function. For multiclass classification, a “one-against-one” classification for each of the classes can be performed. The classification of the testing data is accomplished by a voting strategy, where the winner of each binary
Fig. 7. Postures “standing,” “lying down,” and “sitting” detected by the SVMs.
comparison increments a counter. The class with the highest counter value after all classes have been compared is selected.
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME
5
Fig. 8. Schema describing the combination of posture, speed and distance values (blue arrows refer to “no interaction distance with table” and red arrows refer to “interaction distance with table”).
In our approach, we use a radial basis function kernel with parameters and . The LIBSVM library [12] has been used for implementation and evaluation. The second part of the process [Fig. 6 (middle)] uses the speed of each target. Based on empirical values in our smart value home environment, we can then determine whether the speed of the target is zero, low, medium, or high. The third part of the process [Fig. 6 (right)] uses the posiof each target to calculate the distance to an intion teraction object. In our smart environment, we are interested in the interaction with a table at a known position [white table in Fig. 3 (right)]. So, we calculate the distance between the target and the table in the environment. If this distance is approaching zero (or below zero), the target is interacting with the table. The results of the different parts of the detection process are combined following the schema in Fig. 8. Posture, speed value, and interaction value are combined to a role value.
Fig. 9. One person situations “individual work” and “siesta” (left side) and multiperson situations “introduction,” “aperitif,” “presentation,” and “game.”
E. Situation Learning and Recognition Based on ambient sound, speech and individual role detection, we generate observation codes for the multimodal behavior in the scene (see Section III). These observations are the input for situation learning and recognition. The objective is to detect six different situations (Fig. 9): siesta (one person), individual work (one person), introduction/address of welcome (multiperson), aperitif (multiperson), presentation (multiperson), and game (multiperson). For each situation, several multimodal recordings are conducted. Hidden Markov models are used to learn and detect the six different situations from the observation data. For each situation label, we learn a left-right hidden Markov model with eight states (Fig. 10) using Baum–Welch algorithm. A hidden Markov model (HMM) [27] is a stochastic process where the evolution is managed by states. The series of states constitute a Markov chain which is not directly observable. This chain is “hidden.” Each state of the model generates an observation. Only the observations are visible. The objective is to derive the state sequence and its probability, given a particular sequence of observations. The Baum–Welch algorithm is used
Fig. 10. Left-right HMM with eight states.
for learning a HMM from given observation sequences, while the Viterbi algorithm can be used for calculating the most probable state sequence (and its probability) for a given observation sequence and HMM. Hidden Markov models have been used with success in speech recognition [20], sign language recognition [29], handwriting gesture recognition [23] and many other domains. We adopt a HMM approach due to the temporal dynamics of human multimodal behavior, as well as the noisy character of our observation data (3-D tracking, speech activity detection, individual role detection). First experiments indicated that a hidden Markov model with eight states provides a good compromise between generality and specificity for our observation data. In order to detect the situation for an observation sequence, we calculate the probability of this sequence for each learned
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
Fig. 13. Fusion algorithm combining the multimodal entity observation codes a; b of two entities. For , the resulting codes are between 0 and 1430.
( )
max
= 52
Fig. 11. Multimodal entity detection codes.
Fig. 14. Different multimodal data recordings. Recordings on the left (one person) and in the middle (two persons) contain only one specific situation, while the long recordings on the right contain several different situations. The multimodal data sets contain a total of 75 349 observation frames.
Fig. 15. Overall situation recognition rates for the one situation recordings.
Fig. 12. Enumeration of all possible combinations of two multimodal entity (left side). The number of combiobservation codes between 0 and nations is summed up (right side), resulting in the formula to calculate the total number of combinations (right side bottom).
max
situation HMM. The HMM with the highest probability (above a threshold) determines the situation label for the sequence. III. MULTIMODAL CODING AND DATA SETS Based on individual role detection as well as the ambient sound and speech detection, we derive multimodal observation codes for each entity created and tracked by the 3-D tracking system. The 12 individual role values (Fig. 8) are derived for each entity by the individual role detection process. Further, the ambient sound detector indicates whether there is noise in the environment or not. The speech activity detector determines whether the concerned entity is speaking or not. This multimodal information is fused to 53 observation codes for each entity (Fig. 11). Codes 1–13 (13 codes) are based on individual role detection. We further add ambient sound detection (codes 27–39 and 40–52) and speech detection per entity (codes 14–26 and 40–52).
As we can have several persons involved in a situation, we need to fuse the multimodal codes of several entities. We could simply combine the multimodal entity detection codes (for two codes, i.e., entities: combinations). However, the result is a high number of possible observation values. Further, as we are interested in situation recognition, many of these values are redundant. For example, person A is lying down and person B is sitting is a different observation code as person A is sitting and person B is lying down, even though, from the perspective of activity recognition, the situation is identical. So if the order of the entity observation codes is not important to us, we can express the fusion of both entity observation codes using combinations (see Fig. 12 for details). For two given entity observation codes and , we can then calculate the fusion code using the algorithm in Fig. 13. First, we need to make sure that contains the smallest of both code values and (step 1). In step 2, the value is then used to sum over adds the combination blocks [Fig. 12 (right side)], while value to the correct entry in the combination block. The resulting observation code fuses the observation codes of two (or more) entities. In order to fuse the observation codes of more than two entities, the fusion algorithm can be applied several times fusing successively all entities. We would like to mention that the fact that multimodal observation code 0 (Fig. 11) corresponds to the nonexistence of
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME
7
Fig. 16. Situation recognition rate (long recordings) for different window sizes (in frames) used for recognition.
the entity implies that a generated multiperson code (Fig. 13) contains all generated lower codes. That is, if we generate, for example, a two person code for only one entity, the resulting code and the observation code of the entity are identical. In order to evaluate our approach, we did several multimodal recordings in our smart home environment. The recordings involved up to two persons. First, three different recordings have been conducted for each situation label. These recordings only contain one specific situation. Further, three long recordings have been done containing each the situations: introduction, aperitif, game, and presentation. An overview of the conducted multimodal data recordings can be seen in Fig. 14. The indicated frame numbers refer to multimodal observations.
Fig. 17. Confusion matrix and information retrieval statistics for each situation (observation window size = 250 frames). The overall situation recognition rate is 66.51%.
IV. EVALUATION AND RESULTS A first series of evaluations concerned offline situation recognition for the one situation recordings [Fig. 14 (left) and (middle)]. The objective was to show the recognition performance of our approach for offline observation sequence classification. Therefore, we did a threefold cross-validation, taking two third of the sequences as input for learning and the remaining third of the sequences as basis for recognition. We did an evaluation based on the multimodal observations of only one entity (codes 0–52, without the fusion algorithm). This entity-wise situation detection already obtains good performance. However, when applying the fusion algorithm (two-person codes 0–1430), we obtain the best results. A second series of evaluations concerned the detection of the situations within the long recordings (Fig. 14 right). The objective was to show the performance of our approach for online situation detection. Therefore, we used the one situation recordings (Fig. 14 left and middle) as input for learning the HMMs corresponding to the different situations. We applied the fusion algorithm to the learning sequences. In order to provide online situation recognition, the size for the observation sequence used for recognition needs to be limited. We slide an observation window of different sizes from the beginning to the end of the
Fig. 18. Confusion matrix and information retrieval statistics for each situation (observation window size = 1500 frames). The overall situation recognition rate is 88.79%.
recordings, constantly recognizing the current situation. Fig. 16 depicts the obtained average situation recognition rates for different window sizes. If we limit the observation time provided for recognition to 10 s (i.e., 250 frames with a frame rate of 25 frames/s), we get an overall recognition rate of 66.51% (Fig. 17). By increasing the observation time to 60 s, the overall recognition rate rises to 88.79% (Fig. 18). However, wrong detections between “aperitif” and “game” persist, resulting in a poor precision for “aperitif” and a poor recall for “game.” This is partially due to the ambiguity of both situations from the point of view of the available observations. Interacting with the table, gesturing, and speaking/noise are characteristic for both situations.
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
Fig. 19. Web interface visualizing detected roles and situations (per entity) in real-time.
Online situation detection does not only involve the correct classification of observation sequences with regard to learned situations (as offline situation recognition does). The transitions between situations need to be correctly detected. Further, observation sequences used for recognition are of limited length (observation windows). Thus, online situation detection is a more difficult problem and the obtained results are less good than for offline situation recognition. V. CONCLUSION AND PERSPECTIVES This paper proposed and evaluated an approach for learning and recognizing human behavior models from multimodal observation in a smart home environment. The approach is based on audio and video information. A 3-D video tracking system was used to create and to tracks entities (persons) in the scene. Based on extracted entity properties of the 3-D tracker, an individual role detector derived basic individual activity like “walking” or “interacting with table.” Speech activity detection for each entity and noise detection in the environment complete the multimodal observations. Different situations like “aperitif” or “presentation” are learned and detected using hidden Markov models. The proposed approach should achieve two objectives: an offline analysis of human behavior in multimodal data recordings with regard to learned situations, and the online detection of learned behavior models. The conducted evaluations showed good results, validating that the approach is applicable to both objectives. The online detection of learned behavior models is of particular interest because it opens the way to a number of new applications (videoconferencing, surveillance of elderly people, etc.). Our approach has served as basis for creating a web ser-
vice, communicating the detected behavior of occupants of our smart home environment. A distant user can visualize current multimodal behavior in the scene by using a web interface (Fig. 19). The role values of the detected entities as well as the situations are indicated. Situation detection is at present conducted for the observations of each individual entity only (without multiperson fusion). The reason is that multiperson situation recognition requires (predetermined) group formation information in order to fuse the observations of the correct entities. People tend, however, to dynamically change group formation, in particular in (informal) real settings. This makes multiperson situation recognition a difficult task. A solution to alleviate this problem is to analyze group formation. One issue is the determination of the focus of attention for each individual [22]. The correlation between attentional focus and/or speech contributions can be used to derive (interaction) groups [7]. Applied to our approach to activity recognition, the (probabilistic) correlation or distance between the situations of participants, as well as speech correlation could be good indicators for group formation. Once the interaction groups are determined, multiperson situation recognition can be proceeded for each group. We would like to mention that interaction group detection and situation recognition are interdependent. On the one hand, the determination of group configurations is necessary for multiperson situation recognition. On the other hand, the situations detected for one or several individuals are strong indicators for possible group membership. ACKNOWLEDGMENT The authors thank P. Reignier, D. Vaufreydaz, G. Privat, and O. Bernier for their remarks and support.
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME
REFERENCES
[1] J. Allen, “Maintaining knowledge about temporal intervals,” Commun. ACM, vol. 26, no. 11, pp. 832–843, 1983. [2] P. Bessiere and B. R. Group, Survey: Probabilistic methodology and techniques for artifact conception and development, Tech. Rep. INRIA Rhône-Alpes, Feb. 2003. [3] A. Biosca-Ferrer and A. Lux, “A visual service for distributed environments: A Bayesian 3D person tracker,” INRIA, Tech. Rep., 2007. [4] A. F. Bobick, S. S. Intille, J. W. Davis, F. Baird, C. S. Pinhanez, L. W. Campell, Y. A. Ivanov, Schutte, and A. Wilson, “The KidsRoom: A perceptually-based interactive and immersive story environment,” Presence (USA), vol. 8, no. 4, pp. 369–393, 1999. [5] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in Proc. 5th Annu. Workshop on Computational Learning Theory, 1992. [6] O. Brdiczka, P. Reignier, J. L. Crowley, D. Vaufreydaz, and J. Maisonnasse, “Deterministic and probabilistic implementation of context,” in Proc. 4th IEEE Int. Conf. Pervasive Comput. Commun. Workshops, 2006, pp. 46–50. [7] O. Brdiczka, J. Maisonnasse, and P. Reignier, “Automatic detection of interaction groups,” in Proc. Int. Conf. Multimodal Interfaces, 2005, pp. 32–36. [8] O. Brdiczka, J. Maisonnasse, P. Reignier, and J. L. Crowley, “Learning individual roles from video in a smart home,” in Proc. 2nd IEE Int. Conf. Intell. Environ., Athens, Greece, Jul. 2006. [9] O. Brdiczka, D. Vaufreydaz, J. Maisonnasse, and P. Reignier, “Unsupervised segmentation of meeting configurations and activities using speech activity detection,” in Proc. 3rd IFIP Conf. Artif. Intell. Applicat. Innov. (AIAI) 2006, Athens, Greece, Jun. 2006, pp. 195–203. [10] B. Brumitt, B. Meyers, J. Krumm, A. Kern, and S. Shafer, “Easy living: Technologies for intelligent environments,” in Proc. 2nd Int. Symp. Handheld and Ubiquitous Computing, 2000, vol. 1927, Lecture Notes in Computer Science, pp. 12–29. [11] A. Caporossi, D. Hall, P. Reignier, and J. L. Crowley, “Robust visual tracking from dynamic control of processing,” in Proc. Int. Workshop on Performance Evaluation for Tracking and Surveillance, 2004, pp. 23–32. [12] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for support vector machines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm 2001 [13] D. J. Cook, M. Youngblood, E. O. Heierman, K. Gopalratnam, S. Rao, A. Litvin, and F. Khawaja, “MavHome: An agent-based smart home,” in Proc. 1st IEEE Int. Conf. Pervasive Comput. Commun., 2003. [14] C. Cortes and V. Vapnik, “Support-vector network,” Mach. Learn., vol. 20, pp. 273–297, 1995. [15] J. Coutaz, J. L. Crowley, S. Dobson, and D. Garlan, “Context is Key,” Commun. ACM, vol. 48, no. 3, pp. 49–53, Mar. 2005. [16] J. L. Crowley, J. Coutaz, G. Rey, and P. Reignier, “Perceptual components for context aware computing,” in Proc. 4th Int. Conf. Ubiquitous Comput., 2002. [17] A. K. Dey, “Understanding and using context,” Pers. Ubiquitous Comput., vol. 5, pp. 4–7, 2001. [18] P. Dourish, “What we talk about when we talk about context,” Pers. Ubiquitous Comput., vol. 8, pp. 19–30, 2004. [19] R. Emonet, D. Vaufreydaz, P. Reignier, and J. Letessier, “O3miscid : An object oriented opensource middleware for service connection, introspection and discovery,” in Proc. 1st IEEE Int. Workshop on Services Integration in Pervasive Environments, Jun. 2006. [20] X. D. Huang, Y. Ariki, and M. A. Jack, Hidden Markov Models for Speech Recognition. Edinburgh, U.K.: Edinburgh Univ. Press, 1990. [21] S. W. Loke, “Representing and reasoning with situations for context-aware pervasive computing: A logic programming perspective,” Knowledge Eng. Rev., vol. 19, no. 3, pp. 213–233, 2005. [22] J. Maisonnasse, N. Gourier, O. Brdiczka, and P. Reignier, “Attentional model for perceiving social context in intelligent environments,” in Proc. 3rd IFIP Conf. Artificial Intell. Applicat. Innov. (AIAI) 2006, Jun. 2006, pp. 171–178. [23] J. Martin and J.-B. Durand, “Automatic handwriting gestures recognition using hidden Markov models,” FG 2000 pp. 403–409. [24] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang, “Automatic analysis of multimodal group actions in meetings,” IEEE Trans. Pattern Anal. Machine Intell., vol. 27, no. 3, pp. 305–317, Mar. 2005, 2005.
9
[25] M. Muehlenbrock, O. Brdiczka, D. Snowdon, and J.-L. Meunier, “Learning to detect user activity and availability from a variety of sensor data,” in Proc. 2nd IEEE Int. Conf. Pervasive Comput. Commun., 2004, pp. 13–23. [26] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision system for modeling human interactions,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, pp. 831–843, 2000. [27] L. A. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1987. [28] P. Ribeiro and J. Santos-Victor, “Human activity recognition from video: Modeling, feature selection and classification architecture,” in Proc. Int. Workshop on Human Activity Recognition and Modelling, 2005. [29] T. E. Starner, “Visual Recognition of american sign language using hidden Markov model,” Ph.D. dissertation, MIT Media Laboratory, Perceptual Computing Section, Cambridge, MA, 1995. [30] R. Stiefelhagen, H. Steusloff, and A. Waibel, “CHIL—Computers in the human interaction loop,” in Proc. Int. Workshop on Image Analy. for Multimedia Interactive Services, 2004. [31] L. Suchman, Plans and Situated Actions: The Problem of Human-Machine Communication. Cambridge, MA: Cambridge Univ. Press, 1987. [32] D. Vaufreydaz, “IST-2000-28323 FAME: Facilitating agent for multicultural exchange (WP4),” Eur. Commission Project IST-2000-28323, Oct. 2001, 2001. [33] M. Weiser, Ubiquitous Computing: Definition 1. [Online]. Available: http://www.ubiq.com/hypertext/weiser/UbiHome.html 1996 [34] S. Zaidenberg, O. Brdiczka, P. Reignier, and J. L. Crowley, “Learning context models for the recognition of scenarios,” in Proc. 3rd IFIP Conf. Artif. Intell. Applicat. Innov., 2006, pp. 86–97. [35] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud, “Multimodal group action clustering in meetings,” in Proc. Int. Workshop on Video Surveillance Sensor Networks, 2004. Oliver Brdiczka received the Diploma degree in computer science from the University of Karlsruhe, Karlsruhe, Germany, and the Engineer’s degree from Ecole National Superieure d’Informatique et de Mathematiques Appliquées (ENSIMAG), Grenoble, France. He received the Ph.D. degree from the Institut National Polytechnique de Grenoble (INPG). His Ph.D. research was with the PRIMA Research Group, INRIA Rhône-Alpes Research Center. After that, he directed the Ambient Collaborative Learning Group, Technical University of Darmstadt, Germany. He is currently a Scientific Researcher with the Palo Alto Research Center, Palo Alto, CA. His research interests include context modeling, activity recognition, machine learning, and e-learning.
Matthieu Langet received the Engineer’s degree from Formation d’Ingénieur en Informatique de la Faculté d’Orsay, France. He was with CNRS Laboratory, Laboratoire de Recherche en Informatique, for two years as a Research Engineer. He joined the PRIMA Research Group, INRIA Rhône-Alpes Research Center, France, in February 2006 to work as Research Engineer on the HARP Project.
Jérôme Maisonnasse received the M.S. degree in cognitive sciences from the Institut National Polytechnique de Grenoble (INPG), France. Currently, he is working towards the the Ph.D. degree in cognitive sciences at the Université Joseph Fourier, Grenoble. He joined the PRIMA Research Group, INRIA Rhône-Alpes Research Center, France, in January 2004. His main research interest is human activity recognition for human–computer interaction.
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10
IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING
James L. Crowley leads the PRIMA Research Group, INRIA Rhône-Alpes Research Center, Montbonnot, France. He is a Professor with the Institut National Polytechnique de Grenoble (INPG), France. He teaches courses in computer vision, signal processing, pattern recognition, and artificial intelligence at Ecole National Supérieure d’Informatique et de Mathematiques Appliquées, ENSIMAG. He has edited two books, five special issues of journals, and authored over 180 articles on computer vision and mobile robotics. He ranks number 1473 in the CiteSeers most cited authors in computer science (August 2006).
Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.