2011 IEEE International Conference on Privacy, Security, Risk, and Trust, and IEEE International Conference on Social Computing
Predicting Levels of Rapport in Dyadic Interactions Through Automatic Detection of Posture and Posture Congruence Juan Lorenzo Hagad∗ , Roberto Legaspi † , Masayuki Numao† and Merlin Suarez∗ ∗
College of Computer Studies De La Salle University-Manila Manila City 1004, Philippines Email:
[email protected] † The Insttiute of Scientific and Industrial Research Osaka University Osaka 567-0047, Japan
mere perception of another’s behaviour automatically increases in us the likelihood of displaying something similar [9]. This phenomenon, which is characterized by mimicry, synchrony, and congruence, is called the chameleon effect [2]. The chameleon effect is an unintentional physical and verbal mirroring between people who are getting along well [2]. It is thought that this mirroring allows individuals to achieve a sort of emotional and behavioural synchronicity [10], [11]. A number of research in psychology [2], [4], [10], [12] have found that people report having better interactions with those who performed similar nonverbal behaviours to theirs [2]. Studies [1], [4], [12] have shown mimicry and congruence to be reliable indicators of interpersonal rapport. Rapport is closely linked to feelings of liking and affiliation and is usually characterized by harmonious relationships. As such, it is an essential component of interpersonal interactions. Furthermore, it is a complex social behaviour that can be expressed through different social cues, the most notable of which are linked to posture mimicry, particularly of the extremities (i.e., arms, hands, legs, feet and head) [12]. These kinds of behavioral phenomena are heavily investigated in the field of Social Signal Processing(SSP) (refer to [6], [8], [13] for noteworthy surveys). Currently, most of the works in SSP [7], [14], [15] focus on nonverbal cues (see [6], [15] for surveys). These cues are typically easily detectable and considered to be more honest and reliable than verbal cues since they usually done unconsciously [6], [16], [17]. Most existing works in SSP are aimed at practical applications which include improving human-computer interfaces [14], [18], [19], predicting the outcomes of interactions [7], and studying social groups [20], [21]. This study will deal with posture analysis, specifically on posture congruence and its role in defining the social meaning of posture [6]. A number of the studies cited above focus on cues such as facial expressions, visual activity and vocal features, gesture, and posture. But of those that study posture, none have
Abstract—Research in psychology and SSP often describe posture as one of the most expressive nonverbal cues. Various studies in psychology particularly link posture mirroring behaviour to rapport. Currently, however, there are few studies which deal with the automatic analysis of postures and none at all particularly focus on its connection with rapport. This study presents a method for automatically predicting rapport in dyadic interactions based on posture and congruence. We begin by constructing a dataset of dyadic interactions and selfreported rapport annotations. Then, we present a simple system for posture classification and use it to detect posture congruence in dyads. Sliding time windows are used to collect posture congruence statistics across video segments. And lastly, various machine learning techniques are tested and used to create rapport models. Among the machine learners tested, Support Vector Machines and Multilayer Perceptrons performed best, at around 71% average accuracy. Index Terms—social signal processing; image processing; machine learning; chameleon effect
I. I NTRODUCTION Nonverbal communication facilitates the information exchange in social interactions through observable behavioural cues such as facial and vocal expressions, postures, and gestures, among others. They have been found to be more indicative of affective states than spoken words [1]. They can convey important information such as a person’s affective states, personal characteristics, and relationships [2], [3]. Therefore, they can heavily influence the flow of social interactions [2], [4]–[6] such that the outcomes of interactions can be predicted based on nonverbal signals alone [7]. For this purpose, we apply findings in psychology which link observable behavioural cues with social signals. Social signals are relational attitudes demonstrated by people when interacting with others [8]. Such signals allow us to understand social situations without knowing what is being talked about. Furthermore, it has been suggested that due to the perception-behaviour link [2], our own actions may be unconsciously affected by what we observe from others that the 978-0-7695-4578-3/11 $26.00 © 2011 IEEE DOI
613
studied natural dyadic conversations. Most investigate humancomputer interactions and those that study affect are usually trained on acted databases [22]–[24]. To our knowledge, no study in SSP has been conducted on automatic prediction of rapport in human interactions. Rapport prediction has always been a very difficult task even for psychologists. Lastly, most of the psychology research works done on posture cues in social interactions are still being done manually. This requires experimenters to conduct lengthy observational tasks that are often repetitive but require some level of expertise. Automating this process will allow for greater rates of observation, e.g., up to fractions of a second, increase the number of parameters being observed, and reduce human error. Our study presents a method for automatically detecting rapport in dyadic interactions based on posture and congruence. We hope to highlight the role of the method as a tool for analyzing dyadic social phenomena as well as to use the results gathered towards more in-depth analyses of social dynamics.
experiment. External stimuli were minimized to make way for natural interactions. The only prompt was when they had to introduce themselves, and consequent interactions were of their own choices. Although they were allowed to select any topic for conversation, since most of the participants were meeting for the first time, their interactions consisted mainly of self-introductions and casual conversation. At the end of each session, the participants were transferred to the annotation room where they reviewed the video of their interaction and labelled the rapport levels they felt during the session. We presented a simple rapport definition to the participants using keywords like ”liking” and ”feeling comfortable with”, which are similar recurring words in most mirroring studies in psychology. They annotated various points of the entire footage of each of their sessions using three rapport labels, namely, ”High”, ”Neutral” and ”Low”. They were instructed to mark the different points in time where they felt their rapport level changed. This contrasts with the traditional approach in psychology [2], [4] whereby posture observations were made only every few minutes (five minutes or longer), since those were done manually, and rapport labels were applied to entire sessions. To reduce any bias, e.g., feeling impolite about reporting low rapport, the annotations were done in opposite locations of the room and the participants were guaranteed their privacy. We decided on self-reports due to the lack of experts at the time of the study. Third party annotators were also considered; but since at that time, the only option was to use naive annotators, we thought that they would require much more guessing than self-reports and would therefore be less reliable. We are now coordinating with psychologists regarding the possibility of them actually labelling the interactions.
II. M ETHODOLOGY In this study, we study interactions of dyads. Dyad is a term used in sociology to describe a group of two people, the smallest possible social group. Therefore, dyadic interactions are interactions between a pair of individuals. These are ideal for studying social behaviours since not only are they easier to observe, it is also easier to develop a social connection between participants. Examples of dyadic interactions which have been studied in SSP are salary negotiations, hiring interviews, and speed dating [8], [13], [25]. A. Eliciting Natural Behaviour In the interest of studying natural and spontaneous social signals, all dyadic sessions are conducted in a private experiment room. This room contains two chairs placed opposite each other with a table in the middle. Participants are positioned face-to-face to ensure visibility and maximize mirroring behaviour. A video camera is positioned a few feet away to be less obtrusive, and perpendicular to the set-up such that it could capture both participants in a single frame. A separate room is used for debriefing and eliciting rapport annotations. We conducted six sessions with participants coming mostly from universities in Osaka, Japan. This group consisted of a combination of Japanese, three male and one female, and Filipino, one male and two female, students. The participants were of different ethnicities because we aim to study later on the differences (if any) regarding posture and rapport perception between cultures. This objective, however, is not within the scope of this current study, but is part of our longterm goal. The participants were aware that they were being recorded since this information was part of a privacy release form that was provided to them. Each session consisted of 10 to 20-minute dialogue. The goal was to elicit spontaneous social responses. So, aside from the constraints imposed by the experimental setup (table and chairs), the behaviours of the participants were in no way intentionally influenced during the course of the
B. Modeling Posture For this study, we constructed posture models for the head and arms of each participant. We implemented image processing techniques using the OpenCV library, specifically we employed background subtraction and skin detection. Histogram of Oriented Gradient (HOG) features [26], [27] were extracted from these regions and used to classify postures using Support Vector Machines (SVM). The head and arm postures were divided into small numbers of distinguishable classes. We decided on the posture classes after reviewing the videos and identifying recurring and easilyobservable postures. Head postures are classified as facing straight, facing down, or titled to the side. While arm postures are classified as far from the body, close to the body, doing hand gestures or touching the face/head. The resulting SVM classifiers were fairly accurate, providing an accuracy of over 85% for almost all participant-specific posture models. C. Modeling Posture Congruence and Rapport We base our rapport model and predictions on posture congruence. Postures are said to be congruent when interactants hold their bodies, especially their extremities, in the same position as each other [1]. Therefore, our congruence detection
614
is based on the similarity of the detected postures of interacting participants. In this study, head and arm postures are allowed to have independent states of congruence. In addition, unlike posture, rapport cannot be determined by only using a single instantaneous state. Therefore, our models are built on the summary of congruence information within 40-second sliding windows taken at 18-second intervals. We allowed the slight overlap between windows to account for the lack of clear indicators of the boundaries of rapport situations. Larger sliding widths could potentially overlook some rapport segments that are shorter than the sliding distance. From each window, we collected the following statistics: • Time spent in congruent postures. This is the most common measure noted in psychology studies and indicates congruency of the participants across time. • Time spent in certain postures (not necessarily congruent). The system tracks the relative dominance of certain postures in order to capture posture pattern similarities that may not have been captured by instantaneous mirroring. This could be a potential indicator of posture behaviour matching [28]. • Number of posture changes (individual). We postulate that posture changes in conjunction with movement mirroring are better indicators of rapport than just simple static congruent postures that could be possible indications of purely comfortable postures or even boredom. • Posture changes that led to congruence (dyadic directional). This value models the synchrony or movement mirroring between participants. Here, the system tracks the number of posture changes made by participants that produced congruency within a certain time frame, i.e., 3 seconds. The system tracks the posture changes of the participant being modelled that led to congruency with the pair, as well as the posture changes of the pair that were congruent responses to the posture of that participant. Here, the system tries to track directional congruence which could also be used to track the more dominant or the more engaged participants. These statistics produce the feature vectors that are used to construct the rapport models together with the self-annotated rapport labels. We gather sample windows (as stated previously) across up to two 10 to 20-minute videos for each participant, yielding a total of 7 sample sets. This is each participant participated in at most 2 sessions, and of the 6 sessions, only 5 were used for automatic analysis due to excessive environmental noise in one sample. Each set is composed of roughly 40-80 sample windows that are then used to train and test each rapport model. We used and compared different machine learners.
TABLE I R APPORT M ODELS IR E VALUATION Subject
Low
Neutral
High
Original
Precision Recall F-score
55.17% 34.43% 42.40%
64.29% 66.60% 65.42%
74.79% 73.59% 74.18%
Minority Removed
Precision Recall F-score
65.50% 63.30% 64.38%
65.59% 67.44% 66.50%
74.24% 73.21% 73.73%
Initial analysis of the experiment data also revealed that, for most of the dyadic sessions, there was very little change in subject trunk lean. A probable cause could be the way the environment was set up, although it would be difficult to identify the root cause. On the other hand, it could have been the interactions themselves that prompted the participants to maintain a stationary trunk posture. Whatever the case, we decided not to use trunk lean as a parameter for congruence because of the uncertainty it presented. The self-labelled rapport annotations revealed that feelings of rapport may not always be mutual, and in some cases could even be conflicting, with one participant reporting a high level of rapport while the other reporting having low rapport. On average, pairs differed in rapport states 41% of the time, and in one session 40% of the time was spent in opposing rapport states (high vs. low). Hence, we decided to construct user-specific rapport models built from windows taken from all sessions of each participant labelled with their own rapport annotations. We also observed that our sample sets contained imbalanced class distributions, with very few ”Low” samples. A possible reason for this could be the level of familiarity between the participants. Most of the participants were strangers to each other; at best, some were acquaintances. It is possible that they may not have had enough time or experience with each other to form any negative impressions or develop dislike. Additionally, the conversations were freeflowing self-introductions and there was little opportunity for conflicts that could have led to low rapport. Using multi-class SVM with polynomial kernels and 10fold cross validation, we achieved an average classification accuracy of 71.33% across all rapport models. In addition, we also tested the data with MLPs which achieved an average accuracy of 71.10% and achieved more consistent accuracies across models. Other classifiers were also tested but did not perform as well as the aforementioned two. Naive Bayes classifiers achieved an average of 61.82% accuracy and Bayesian Networks operated at 62.33% accuracy. We also anlyze the per-class classification performance of the rapport models (refer to Table I) using precision and recall measures. Based on the values in the table, ”Neutral” and ”High” classifications were initially found to have a significantly higher precision and recall rates than ”Low” predictions. However, this was attributed to the lack of ”Low” samples. Some sets did not have any low samples at all, and of those that did, ”Low” samples often constituted less than 5%
III. R ESULTS In our experiments, we used all the features we identified. Although it would have been interesting to conduct feature selection, our early trials showed varying selected features for every user-specific model. This is also one reason why we decided to create user-specific models.
615
of the set. Only one sample contained an equal distribution of samples, and in this case precision and recall rates were more consistent. Results obtained after removing the minority samples are shown in the lower half of Table I. Although overall performance isn’t significantly improved, we see more consistent performance across the different classes.
[6] A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland, “Social signal processing: state-of-the-art and future perspectives of an emerging domain,” in Proc. 16th ACM international Conference on Multimedia. New York, NY, USA: ACM, 2008, pp. 1061–1070. [7] A. Pentland, “Socially aware computation and communication,” Computer, vol. 38, no. 3, pp. 33–40, March 2005. [8] A. Vinciarelli, H. Salamin, and M. Pantic, “Social signal processing: understanding social interaction through nonverbal behavior analysis,” in Proc. International Workshop on Computer Vision and Pattern Recognition for Human Behavior, 2009. [9] W. James, The Principles of Psychology, Vol. 1. Dover Publications, June 1950. [10] M. LaFrance, “Postural mirroring and intergroup relations,” Personality and Social Psychology Bulletin, vol. 11, no. 2, pp. 207–217, June 1985. [11] E. Hatfield, J. T. Cacioppo, and R. L. Rapson, Emotional contagion. Press Syndicate of the University of Cambridge, 1994. [12] A. E. Scheflen, “The significance of posture in communication systems,” Psychiatry, vol. 27, pp. 316–331, November 1964. [13] A. Vinciarelli, M. Pantic, and H. Bourlard, “Social signal processing: Survey of an emerging domain,” Image and Vision Computing, vol. 27, pp. 1743–1759, 2009. [14] S. Mota and R. W. Picard, “Automated posture analysis for detecting learner’s interest level,” in Workshop on Computer Vision and Pattern Recognition for Human-Computer Interaction, CVPR HCI, June 2003. [15] A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland, “Social signals, their function, and automatic analysis: a survey,” in Proc. 10th International Conference on Multimodal Interfaces. New York, NY, USA: ACM, 2008, pp. 61–68. [16] P. Ekman, “Darwin, deception, and facial expression,” Annals of the New York Academy of Sciences, vol. 1000, pp. 205–221, 2003. [17] A. S. Pentland, Honest Signals: How They Shape Our World. London: The MIT Press, June 2008. [18] J. Gratch, A. Okhmatovskaia, F. Lamothe, S. Marsella, M. Morales, R. J. V. D. Werf, and L.-P. Morency, “Virtual rapport,” in Proc. 6th International Conference on Intelligent Virtual Agents. Springer, 2006. [19] K. Prepin and P. Gaussier, “How an agent can detect and use synchrony parameter of its own interaction with a human?” in Development of Multimodal Interfaces: Active Listening and Synchrony, ser. Lecture Notes in Computer Science, vol. 5967. Springer Berlin / Heidelberg, 2010, pp. 50–65. [20] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland, “Towards measuring human interactions in conversational settings,” in Proc. IEEE CVPR Workshop on Cues in Communication, 2001. [21] D. B. Jayagopi, H. Hung, C. Yeo, and D. Gatica-Perez, “Modeling dominance in group conversations using nonverbal activity cues,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 3, pp. 501–513, March 2009. [22] M. Elmezain, A. Al-Hamadi, O. Rashid, and B. Michaelis, “Posture and gesture recognition for human-computer interaction,” Advanced Technologies, pp. 415–439, Ocotober 2009. [23] O. Brdiczka, M. Langet, J. Maisonnase, and J. L. Crowley, “Detecting human behavior models from multimodal observation in a smart home,” IEEE Transactions on Automatic Science and Engineering, vol. 6, no. 4, pp. 588–597, 2009. [24] S. D’Mello and A. Graesser, “Mind and body: Dialogue and posture for affect detection in learning environments,” in Proc. International Conference on Artificial Intelligence in Education: Building Technology Rich Learning Contexts That Work, 2007, pp. 161–168. [25] J. R. Curhan and A. Pentland, “Thin slices of negotiation: Predicting outcomes from conversational dynamics within the first 5 miniutes,” Journal of Applied Psychology, vol. 92, no. 3, pp. 802–811, 2007. [26] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893. [27] A. Agarwal and B. Triggs, “3d human pose from silhouettes by relevance vector regression,” in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, pp. 882–888. [28] J. L. Lakin, V. E. Jefferis, C. M. Cheng, and T. L. Chartrand, “The chameleon effect as social glue: Evidence for the evolutionary significance of nonconscious mimicry,” Journal of Nonverbal Behavior, vol. 27, no. 3, pp. 145–162, 2003.
IV. C ONCLUSION AND F UTURE W ORK In this paper, we have discussed our method for automatically predicting rapport using posture congruence. We began with detecting postures from video using HOG features and SVM. After detecting subject postures, we were able to construct observations of posture congruence. Sliding windows were used to collect congruence statistics across video recordings of dialogues. We have yet to determine optimal values for the size of the congruence windows and the stepping. Among the classifiers tested, SVMs and MLPs were the most accurate. The resulting rapport models performed at around 71% average accuracy. Unfortunately, our current samples exhibit an imbalance in class distribution. This led to a decrease in the prediction performance of some models with respect to the ”Low” class. However, significant dips in per-class classification accuracy only occurred in extreme cases of imbalance, i.e., when ”Low” samples constituted less than 5% of the sample set. Still, based on the results we have gathered, we can conclude that posture congruence can be used to automatically predict levels of rapport. Still, much work can be done based on the observations we have gathered. Alternative labels for the data could be the subject of further investigation. Combining the two separate labels into mutuality labels could be interesting. We also acknowledge that current number of samples is far from enough to form an encompassing conclusion. So despite the potential shown in the results, we believe that further investigation and more samples are needed in order to arrive at a more solid basis. In addition, much can also be improved regarding the methodology, tools, image processing techniques, and machine learning techniques used. Note that some off the shelf solutions that could greatly aid the posture extraction process have recently become more accessible. In the future, we plan to refine our methodology to include a larger sample set, more varied contexts, more sensitive posture detections, better annotation tools, and expert annotations. R EFERENCES [1] D. L. Trout and H. M. Rosenfeld, “The effect of postural lean and body congruence on the judgment of psychotherapeutic rapport,” Journal of Nonverbal Behavior, vol. 4, no. 3, pp. 176–190, 1980. [2] T. Chartrand and J. Bargh, “The chameleon effect: the perceptionbehavior link and social interaction,” Journal of Personality and Social Psychology, vol. 76, no. 6, pp. 893–910, June 1999. [3] J. Lakin, Automatic cognitive processes and nonverbal communication. Thousand Oaks, CA: Sage Publications, Inc., 2006, pp. 59–77. [4] M. LaFrance, “Posture mirroring and rapport,” in Interaction rhythms: periodicity in communicative behavior, M. Davis, Ed. New York: Human Sciences Press, March 1982. [5] J. Lakin and T. Chartrand, “Using nonconscious behavioral mimicry to create affiliation and rapport,” Psychological Science, vol. 14, no. 4, pp. 334–339, 2003.
616