Automated surveillance of human activities has traditionally been a. Computer Vision ..... nine subjects were left free to move in a 3m à 3m area for 30 minutes. ... a convex empty space surrounded by the people involved in a social inter- action, in ...... Human Dimension and Interior Space : A. Source Book of Design. Pang ...
1 Social Signal Processing for Surveillance Marco Cristani, Dong Seon Cheng
Automated surveillance of human activities has traditionally been a Computer Vision field interested in the recognition of motion patterns and in the production of high-level descriptions for actions and interactions among entities of interest (Cedras and Shah, 1995; Aggarwal and Cai, 1999; Gavrila, 1999; Moeslund et al., 2006; Buxton, 2003; Hu et al., 2004; Turaga et al., 2008; Dee and Velastin, 2008; Aggarwal and Ryoo, 2011; Borges et al., 2013). In the last five years, the study on human activities has been revitalized by addressing the so-called social signals (Pentland, 2007). In fact, these nonverbal cues inspired by the social, affective, and psychological literature (Vinciarelli et al., 2009b), have allowed a more principled understanding of how humans act and react to other people and to their environment. Social Signal Processing (SSP) is the scientific field making a systematic, algorithmic and computational analysis of social signals, drawing significant concepts from anthropology and social psychology (Vinciarelli et al., 2009b). In particular, SSP does not stop at just modeling human activities, but aims at coding and decoding human behavior. In other words, it focuses on unveiling the underlying hidden states that drive one to act in a distinct way, with particular actions. This challenge is supported by decades of investigation in human sciences (psychology, anthropology, sociology, etc.) that showed how humans use nonverbal behavioral cues like facial expressions, vocalizations (laughter, fillers, back-channel, etc.), gestures or postures to convey, often outside conscious awareness, their attitude towards other people and social environments, as well as emotions (V.Richmond and J.McCroskey, 1995). The understanding of these cues is thus paramount in order to understand the social meaning of human activities. The formal marriage of automated video surveillance with Social Sig-
2
M Cristani, D S Cheng
nal Processing had its programmatic start during SISM 2010, the International Workshop on Socially Intelligent Surveillance and Monitoring1 , associated with the IEEE Computer Vision and Pattern Recognition conference. In that venue, the discussion was focused on what kind of social signals can be captured in a generic surveillance scenario, detailing then the specific scenarios where the modeling of social aspects could be beneficial the most. After 2010, SSP hybridizations with surveillance applications have grown rapidly in number, and systematic essays about the topic started to compare in the Computer Vision literature (Cristani et al., 2013). In this chapter, after giving a short overview of those surveillance approaches which adopt SSP methodologies, we examine a recent application where the connection between the two worlds promises to give intriguing results, namely the modeling of interactions via instant messaging platforms. Here, the environment to be monitored is not “real” anymore: instead, we move into another realm, that of the social web. On instant messaging platforms, one of the most important challenges is the identification of people involved in conversations. It has become important in the wake of social media penetration into everyday life, together with the possibility of interacting with persons hiding their identity behind nick-names or potentially fake profiles. Under scenarios like these, classification approaches (proper of the classical surveillance literature) can be improved with social signals, by importing behavioral cues that come from conversation analysis. In practice, sets of features are designed to encode effectively how a person converses: since chats are crossbreeds of written text and face-to-face verbal communication, the features inherit equally from textual authorship attribution and conversational analysis of speech. Importantly, the cues ignore completely the semantics of the chat, relying solely on non-verbal aspects typical of SSP, taking care of possible privacy and ethical issues. With this modeling, identity safekeeping can be faced. Finally, in the conclusions, some considerations summarize what has been achieved so far in the surveillance under a social and psychological perspective; future perspectives are then given, identifying how and where social signals and methods of surveillance could create a very effective mixing.
1
http://profs.sci.univr.it/~cristanm/SISM2010/ .
Social Signal Processing for Surveillance
3
1.1 State of the Art At the heart of Social Signal Processing is the study of social signals, viewable as temporal co-occurences of social or behavioral cues (Ambady and Rosenthal, 1992), i.e., sets of temporally sequenced changes in neuromuscular, neurocognitive, and neurophysiological activity. Vinciarelli et al. (2009b) have organized behavioral cues into five categories that are heterogeneous, multimodal aspects of a social interplay: 1) physical appearance, 2) gesture and posture, 3) face and gaze behavior, 4) vocal behavior and 5) space and environment. In practice, the first category has little relevance in surveillance, while the other four are consistently utilized in surveillance approaches.
1.1.1 Gesture and Posture Monitoring gestures in a social signaling-driven surveillance sense is hard: the goal is not only capturing intentional gestures chosen voluntarily to communicate something, but it is also capturing unintentional movements, such as subtle and/or rapid oscillations of the limbs, shoulder shrugging, casual touching of the nose/ear, hair twisting and self protection gestures like closing the arms. This is a worthwhile effort, as the essential nature of gestures in interactive scenarios is confirmed by the extreme rarity of “gestural errors”, i.e., gestures portray with high fidelity the speaker’s communicative intention (Cassell, 1998). Analyzing fine gesturing activity for the extraction of social signals seems to be a growing trend, as witnessed by a recent workshop associated with the International Conference of Computer Vision (ICCV), that is, the IEEE Workshop on Decoding Subtle Cues from Social Interactions2 . Oikonomopoulos et al. (2011) introduce a framework that represents the “shape” of an activity through the localization of ensembles of spatiotemporal features. Originally tested on standard action benchmarks (KTH, Hollywood Human actions), it was later employed to detect handraising in image sequences of political debates. An orthogonal perspective extracts gestures through structured models, such as pictorial structures (Andriluka et al., 2009) or flexible mixtures of parts (Yang and Ramanan, 2011). However, these models do not appear capable of capturing fine signals like the ones described above, and their usage is limited in surveillance+SSP. Nonetheless, this trend could drastically change by considering additional modalities other 2
http://www.cbi.gatech.edu/CBSICCV2013/
4
M Cristani, D S Cheng
than the visual information: for example, including depth and exploiting RGBD sensors like the Kinect promises to be very effective in capturing the human shape (Popa et al., 2012). In Cristani et al. (2011a), gesturing is used to infer who is talking when in a surveillance scenario, performing through statistical analysis a simple form of diarization (detection of who speaks when, see Hung et al. (2008)). The posture is an aspect of the human behavior which is unconsciously regulated and thus can be considered as the most reliable nonverbal social cue. In general, posture conveys social signals in three different ways (Vinciarelli et al., 2009b): inclusive vs. non-inclusive, face-to-face vs. parallel body orientation, and congruent vs. non-congruent. These cues may help to distinguish extrovert and introvert individuals, suggesting a way to detect threatening behaviors. Only few and very recent surveillance approaches deal with posture information (Robertson and Reid, 2011; Cristani et al., 2011b; Hung and Krose, 2011) and they will be described more in detail in the following, since they exploit cues mostly coming from the other behavioral categories. An application of surveillance is the monitoration of abnormal behaviors in patients within domotic scenarios. Rajagopalan et al. (2013) released a new public dataset of videos with children exhibiting selfstimulatory (stimming) behaviours3 , commonly used for autism diagnosis. In the dataset, three kinds of “in the wild” behaviors are analyzed (that is, no lab environments are taken into account): arm flapping, head banging, and spinning. Classification tests over these three classes have been performed using standard video descriptors (e.g., spatiotemporal interest points in Laptev (2005)). This highlights another lack of the social signal based surveillance, that is, few video datasets are currently available, where social signals are accurately labelled.
1.1.2 Face and Gaze behavior In surveillance, capturing fine visual cues from faces is quite challenging because of two factors: most of the scenarios are non-collaborative in nature (people do not intentionally look towards the sensors) and the faces are usually captured in low resolution. As for gaze orientation, since objects are foveated for visual acuity, 3
Self-stimulatory behaviours refer to stereotyped, repetitive movements of body parts or objects.
Social Signal Processing for Surveillance
5
gaze direction generally provides precise information regarding the spatial localization of ones attentional focus (Ba and Odobez, 2006), also called Visual Focus of Attention (VFOA). However, given that the reasons above make measuring the VFOA by using eye gaze often difficult or impossible in standard surveillance scenarios, the viewing direction can be reasonably approximated by detecting the head pose (Stiefelhagen et al., 1999, 2002). In such a scenario, there are two kinds of approaches: the ones that exploit temporal information (tracking approaches), and those that rely on single frame for the cues extraction (classification approaches). In the former, it is customary to exploit the natural temporal smoothness (subsequent frames cannot portray head poses dramatically different) and the influence of the human motion on the head pose (while running forward we are not looking backward), usually arising from structural constraints between the body pose and the head pose. All these elements are elegantly joined in a single filtering framework in Chen and Odobez (2012). In classification approaches, there are many works describing features to use. One promising approach considers covariances of features matrices as features, to perform head and body orientation classification and head direction regression (Tosato et al., 2013). Regarding the use of these cues in surveillance, Smith et al. (2008) estimates pan and tilt parameters of the head, and subsequently represents the VFOA as a vector normal to the person’s face, with the goal of understanding whether a person is looking at an advertisement located on a vertical glass. A similar analysis was performed in Liu et al. (2007), where an Active Appearance Model models the face and pose of a person in order to discover which portion of a mall-shelf is observed. According to biological evidence (Panero and Zelnik, 1979), the VFOA can be described as a 3D polyhedron delimiting the portion of the scene that a subject is looking at. This is very informative in a general, unrestricted scenario, where people can enter, leave, and move freely. The idea of moving from objective (surveillance cameras) towards subjective individual points of view offers a radical change of perspective for behavior analysis. Detecting where a person is directing his gaze allows us to build a set of high-level inferences, and this is the subject of studies in the field of egocentric vision (Li et al., 2013), where people wear ad-hoc sensors for recording daily activities. Unfortunately, this is not our case, as we are in a non-collaborative scenario. Related to video surveillance, Benfold and Reid (2009) infer what part of the scene is seen more frequently by people, thus creating some sort of
6
M Cristani, D S Cheng
interest maps, which they use to identify individuals that are focused on particular portions of their environment for a long time: a threatening behavior can be inferred if the observed target is critical (e.g., an ATM). Moreover, the “subjective” perspective has been proposed in Bazzani et al. (2011), where group interactions are discovered by estimating the VFOA using a head orientation detector and by employing proxemic cues, under the assumption that close people whose VFOA is intersecting are also interacting. Similarly, in Robertson and Reid (2011), a set of two and one-person activities are formed by sequences of actions and then modeled by HMMs whose parameters are manually set.
1.1.3 Vocal behavior The vocal behavior class comprehends all the spoken cues that define the verbal message and influence its actual meaning. Such class includes five major components (Vinciarelli et al., 2009b): prosody can communicate competence; linguistic vocalization can communicate hesitation; non linguistic vocalization can provide strong emotional states or tight social bonds; silence can express hesitation; and turn taking patterns, which are the most investigated in this category, since they appear the most reliable for recognizing people’s personalities (Pianesi et al., 2008), predict the outcome of negotiations (Curhan and Pentland, 2007), recognize the roles interaction participants play (Salamin et al., 2009), or modeling the type of interactions (e.g., conflict). In surveillance, monitoring of vocal behavior cues is absent, both because it is difficult to capture audio in large areas, and, most importantly, because it is usually forbidden for privacy issues. Another issue is the fact that audio processing is usually associated with speech recognition, while in SSP the content of a conversation is ignored. An interesting topic for surveillance is that of the modeling of conflicts, as they may degenerate in threatening events. Conflicts have been studied extensively in a wide spectrum of disciplines, including Sociology (see Oberschall (1978) for social conflicts) and Social Psychology (see Tajfel (1982) for intergroup conflicts). A viable approach in a surveillance context is that of Pesarin et al. (2012), which proposes a semi-automatic generative model for the detection of conflicts in conversations. Their approach is based on the fact that, during conflictual conversations, overlapping speech becomes both
Social Signal Processing for Surveillance
7
longer and more frequent (Schegloff, 2000), the consequence of a competition for holding the floor and preventing others from speaking. In summary, vocal behavior appears to be a very expressive category of social cues, that should be exploited in the surveillance realm, since it can be handled in a completely privacy-respectful fashion.
1.1.4 Space and Environment The study of the space and environment cues is tightly connected with the concept of proxemics, which can be defined as the “[...] the study of man’s transactions as he perceives and uses intimate, personal, social and public space in various settings [...]”, quoting Hall (1966), the anthropologist who first introduced this term in 1966. In other words, proxemics investigates how people use and organize the space they share with others to communicate. This happens typically outside conscious awareness; socially relevant information such as personality traits (e.g., dominant people tend to use more space than others in shared environments (Lott and Sommer, 1967)), attitudes (e.g., people that discuss tend to seat in front of the other, whereas people that collaborate tend to seat side-by-side (Russo, 1967)), etc. From a social point of view, two aspects of proxemic behavior appear to be particularly important, namely interpersonal distances and spatial arrangement of interactants. Interpersonal distances have been the subject of the earliest investigations on proxemics, and one of the main and seminal findings is that people tend to organize the space around them in terms of four concentric zones with decreasing degrees of intimacy: Intimate Zone, Casual-Personal Zone, Socio-Consultive Zone and Public zone. The more intimate is a relation, the less space there is among interactants. One of the first attempts to explicitly model proxemics in a potential monitoring scenario has been presented in Groh et al. (2010), where nine subjects were left free to move in a 3m × 3m area for 30 minutes. The subjects had to speak with each other about specific themes, and an analysis of mutual distances in terms of the above zones allowed to discriminate between people who did interact and people who did not. Also crucial is the spatial arrangement during social interactions. It addresses two main issues: the first is to give all people involved the possibility of interacting, the second is to separate the group of interactants from other individuals (if any). One approach are the socalled F-formations, stable patterns that people tend to form during social interactions (including in particular standing conversations): “an
8
M Cristani, D S Cheng r-space o-space p-space a)
b)
c)
d)
e)
f)
g)
h)
Figure 1.1 F-formations: a) in orange, graphical depiction of the most important part of an F-formation: the o-space; b) a poster session in a conference, where different group formations are visible; c) circular Fformation; d) a typical surveillance setting, where camera is located at 2-2.5 meters on the floor: detecting groups here is challenging; e) components of an F-formation: o-space, p-space, r-space; in this case, a face-to-face F-formation is sketched; f) L-shape F-formation; g) side-by-side F-formation; h) circular F-formation.
F-formation arises whenever two or more people sustain a spatial and orientational relationship in which the space between them is one to which they have equal, direct, and exclusive access” (A.Kendon, 1990). See Fig. 1.1 a)-d) for some examples of F-formations. The most important part of an F-formation is the o-space (see Fig. 1.1), a convex empty space surrounded by the people involved in a social interaction, in which every participant looks inward, and no external people are allowed. The p-space is a narrow stripe that surrounds the o-space, and that contains the bodies of the talking people, while the r-space is the area beyond the p-space. The use of the space appears to be the behavioral cue most suited in the surveillance field: people detection and tracking are applications which provide information about the layout of the people in the space, that is, how they use the space. Therefore, post-processing this information with social models which exploits proxemics appears to be very convenient. The recent literature confirms this claim: many surveillance approaches presented in top tier computer vision conferences try to include the social facet in their workflow. In particular, two applications have emerged in recent years: the tracking of moving people or groups, and the detection of standing conversational groups.
Social Signal Processing for Surveillance
9
In tracking, the keystone methodology for “socially” modeling moving people is the social force model (SFM) of Helbing and Moln´ar (1995), which applies a gas-kinetic analogy to the dynamics of pedestrians. It is a physical model for simulating interactions while pedestrians are moving, assuming them as reacting to energy potentials caused by other pedestrians and static obstacles. This happens through a repulsive or an attractive force, while trying to keep a desired speed and motion direction. This model can be thought of as explaining group formations, and obstacle avoidance strategies, i.e., basic and generic form of human interactions. Pellegrini et al. (2009) and Scovanner and Tappen (2009) have modified the SFM by embedding it within a tracking framework, substituting the actual position of the pedestrian of the SFM with a prediction of the location made by a constant velocity model, which is then revised considering repulsive effects due to pedestrians or static obstacles. No mention about attractive factors are cited in the papers. Park and Trivedi (2007) present a versatile synergistic framework for the analysis of multi-person interactions and activities in heterogeneous situations. They design an adaptive context switching mechanism to mediate between two stages, one where the body of an individual can be segmented into parts, and the other where persons are assumed as simple points. The concept of spatio-temporal personal space is also introduced to explain the grouping behavior of people. They extend the notion of personal space to that of spatio-temporal personal space: the former is the region surrounding each person that is considered personal domain or territory, while the latter takes into account the motion of each person, modifying the geometry of the personal space into a sort of cone. Such cone is narrowed down proportionally with the motion of the subject so as the faster the subject, the narrower the area. An interaction is then defined as caused by intersections of such volumes. Zen et al. (2010) use mutual distances to infer personality traits of people left free to move in a room. The results show that it is possible to predict Extraversion and Neuroticism ratings based on velocity and number of intimate/personal/social contacts (in the sense of Hall) between pairs of individuals looking at one other. Concerning the tracking of groups, the recent literature can be partitioned in three categories: 1) the class of group-based techniques, where groups are treated as atomic entities without the support of individual tracks statistics (Lin and Liu, 2007); 2) the class of individual-based methods, where group descriptions are built by associating individuals’ tracklets that have been calculated beforehand, typically, with a time
10
M Cristani, D S Cheng
lag of few seconds (Pellegrini et al., 2010; Yamaguchi et al., 2011; Qin and Shelton, 2012); and 3) the class of joint individual-group approaches, where group tracking and individual tracking are performed simultaneously (Bazzani et al., 2012; Pang et al., 2007; Mauthner et al., 2008). Essentially, the first class does not include social theories in the modeling, while the latter two classes state that people close enough and proceeding in the same direction represent groups with high probability. However, this assumption is crude and fails in many situations, for example in the case of crowded situations. The second application regards the standing conversational groups, which are groups of people who spontaneously decide to be in each other’s immediate presence to converse with each and every member of that group (e.g., at a party, during a coffee break at the office, or a picnic). These events have to be situated (Goffman, 1966), i.e., occurring within fixed physical boundaries: this means that people should be located and not wandering. In this scenario, we look for focused interactions (Goffman, 1966), which occur when persons gather close together and openly cooperate to sustain a single focus of social attention; this is the precise case where F-formation can be employed. Cristani et al. (2011b) find F-formations by exploiting a Hough voting strategy. The main characteristics are that people have to be reasonably close to each other, have to be oriented toward the o-space, and that the o-space mustt be empty to allow the individuals to look at each other. Another approach for the F-formation is that of Hung and Krose (2011), where they proposed to consider an F-formation as a maximal clique in an edge-weighted graph where each node in the graph was a person, and the edges between them measures the affinity between pairs. Such maximal cliques were defined by Pavan and Pelillo (2007) to be dominant sets, for which a game theoretic approach was designed to solve the clustering problem under these constraints. Given an F-formation, other aspects of the interaction can be considered: for example, Cristani et al. (2011c) analyze the kind of social relationships between people in a F-formation under a computational perspective. In their approach, they calculate pairwise distances between people lying in the p-space, and perform a clustering over the obtained distances, obtaining different classes. The number of classes is chosen automatically by the algorithm, following an Information Theory principle. Their main finding is that each of the classes actually represent well-defined social bonds. In addiction, the approach adapts to different
Social Signal Processing for Surveillance
11
environmental conditions, namely, the size of the space where people can move.
1.2 A New Challenge for Surveillance: From the Real World to the Social Web So far, social signal processing and surveillance cooperated for the analysis of data coming from observations of the real world. However, surveillance has started to be active on the virtual dimension, that is, that of the social web (Fuchs, 2012). Many observers claim that the Internet has been transformed in the past years from a system that is primarily oriented to information provision into a system that is more oriented to communication and community building. In this scenario, criminality phenomena that affect the “ordinary” community carry over into the social web: see for example bullying and cyber-bullying (Livingstone and Brake, 2010); stalking and cyber-stalking (Ellison et al., 2007); these aspects are similarly dangerous and damaging both in the real world and in the social web. Other crimes are peculiar to the social web, the most prominent being identity violation: identity violation occurs when somebody enters the social web with the identity of someone else. Essentially, there are three ways that an identity can be violated: by identity theft (Newman, 2006), where an impostor becomes able to access the personal account, mostly due to Trojan horse keystroke logging programs (such as Dorkbots (Deng et al., 2012)), or by social engineering (i.e., tricking individuals into disclosing login details or changing user passwords) (Anderson, 2001). The third way consists in creating a fake identity, that is, an identity which describes an invented person, or emulates another person (Harman et al., 2005). Both with crime typologies inherited from the real world, and with crimes exclusive of the social web sphere, a crucial aspect must be noted: in Internet, traces, evidences are more abundant than in the real world. Stalking in the real world may happen when a person shows continuously in your proximity. Detecting this with video cameras is difficult and cumbersome. On the Internet, stalking is manifested with email and messages, which can be kept and analyzed. The same happens with cyber-bullying, where intimidations and threats are tangible. Therefore, our claim is that when surveillance is performed in the social web, approaching it with the methods of social signal processing
12
M Cristani, D S Cheng
may be highly promising. The study we are going to show here, published in Cristani et al. (2012), is only a statement of this claim.
1.2.1 Conversationally-inspired Stylometric Features for Authorship Attribution in Instant Messaging Authorship Attribution (AA) is the research domain aimed at automatically recognizing the author of a given text sample, based on the analysis of stylometric cues that can be split into five major groups: lexical, syntactic, structural, content-specific and idiosyncratic (Abbasi and Chen, 2008). Nowadays, one of the most important AA challenges is the identification of people involved in chat (or chat-like) conversations. The task has become important after that social media have penetrated the everyday life of many people and have offered the possibility of interacting with persons that hide their identity behind nick-names or potentially fake profiles. So far, standard stylometric features have been employed to categorize the content of a chat (Orebaugh and Allnutt, 2009) or the behavior of the participants (Zhou, 2004), but attempts of identifying chat participants are still few and early. Furthermore, the similarity between spoken conversations and chat interactions has been totally neglected while probably being a key difference between chat data and any other type of written information. Hence, we investigated possible technologies aimed at revealing the identity of a person involved in instant messaging activities. In practice, we simply require that the user under analysis (from now on, the probe individual) engages a conversation for a limited number of turns, with whatever interlocutor: after that, novel hybrid cues can be extracted, providing statistical measures which can be matched with a gallery of signatures, looking for possible correspondences. Subsequently, the matches can be employed for performing user recognition. In our work, we propose cues that take into account the conversational nature of chat interactions. Some of them fit in the taxonomy quoted above, but others require to define a new group of conversational features. The reason is that they are based on turn-taking, probably the most salient aspect of spoken conversations that applies to chat interactions as well. In conversations, turns are intervals of time during which only one person talks. In chat interactions, a turn is a block of text written by one participant during an interval of time in which none of the other participants writes anything. Like in the case of automatic analysis
Social Signal Processing for Surveillance
13
of spoken conversations, the AA features are extracted from individual turns and not from the entire conversation. Feature Extraction In our study, we focused on a data set of N = 77 subjects, each involved in a dyadic chat conversation with an interlocutor. The conversations can be modeled as sequences of turns, where “turn” means a stream of symbols and words (possibly including “return” characters) typed consecutively by one subject without being interrupted by the interlocutor. The feature extraction process is applied to T consecutive turns that a subject produces during the conversation. Privacy and ethical issues limit the use of the features, relying only on those ones that do not involve the content of the conversation, namely number of words, characters, punctuation marks and emoticons. In standard AA approaches, these features are counted over entire conversations, obtaining a single quantity. In our case, we consider the turn as a basic analysis unit, so we extract such features for each turn, obtaining T numbers. After that, we calculate statistical descriptors on them, that can be the mean values or the histograms; in this last case, since the turns are usually short, we obtain histograms that are collapsed toward small numeric values. Modeling them as uniformly binned histograms over the whole range of the assumed values will produce ineffective quantizations, so we opt for exponential histograms, where smallsized bin ranges are located toward zero, increasing their sizes while going to higher numbers. This intuition has been validated experimentally, as discussed in the following. The introduction of turns as a basic analysis unit allows one to introduce features that explicitly take into account the conversational nature of the data and mirror behavioral measurements typically applied in automatic understanding of social interactions (see Vinciarelli et al. (2009a) for an extensive survey): • Turn duration: the time spent to complete a turn (in hundredth of seconds); this feature accounts for the rhythm of the conversation with faster exchanges typically corresponding to higher engagement. • Writing speed (two features): number of typed characters -or wordsper second (typing rate); these two features indicate whether the duration of a turn is simply due to the amount of information typed (higher typing rates) or to cognitive load (low typing rate), i.e., to the need of thinking about what to write
14
M Cristani, D S Cheng No.
Feature
Range
1
# words
[0,260]
2
# emoticons
[0,40]
3
# emoticons per word
[0,1]
4
# emoticons per characters
[0,0.5]
5
# exclamation marks
[0,12]
6
# question marks
[0,406]
7
# characters
[0,1318]
8
average word length
[0,20]
9
# three points
[0,34]
10
# uppercase letters
[0,94]
11
# uppercase letters/#words
[0,290]
12
turn duration
[0,1800(sec.)]
13
# return chars
[1,20]
14
# chars per second
[0,20(ch./sec.)]
15
# words per second
[0,260]
16
mimicry degree
[0,1115]
Table 1.1 Stylometric features used in the experiments. The symbol “#” stands for “number of ”. In bold, the conversational features. • Number of “return” characters: since these latter tend to provide interlocutors with an opportunity to start a new turn, high values of this feature are likely to measure the tendency to hold the floor and prevent others from “speaking” (an indirect measure of dominance). • Mimicry: ratio between number of words in current turn and number of words in previous turn; this feature models the tendency of a subject to follow the conversation style of the interlocutor (at least for what concerns the length of the turns). The mimicry accounts for the social attitude of the subjects. We call these features conversational features. Table 1.1 provides basic facts about the features used in our approach. The features 1-13 and 16 are the exponential histograms (32 bins) collected from the T turns. The features 14 and 15 are the averages es-
Social Signal Processing for Surveillance
15
timated over the T turns. This architectural choice maximizes the AA accuracy. Experiments The experiments have been performed over a corpus of dyadic chat conversations collected with Skype (in Italian language). The conversations are spontaneous, i.e., they have been held by the subjects in their real life and not for the purpose of data collection. This ensures that the behavior of the subjects is natural and no attempt has been made to modify the style in any sense. The number of turns per subject ranges between 60 and 100. Hence, the experiments are performed over 60 turns of each person. In this way, any bias due to differences in the amount of available material should be avoided. When possible, we pick different turns selections (maintaining their chronological order) in order to generate different AA trials. The average number of words per subject is 615. The 60 turns of each subject are split into probe and gallery set, each including 30 samples. The first part of the experiments aims at assessing each feature independently, as a simple ID signature. Later on, we will see how to create more informative ID signatures. A particular feature of a single subject is selected from the probe set, and matched against the corresponding gallery features of all subjects, employing a given metrics (Bhattacharya distance for the histograms (Duda et al., 2001), Euclidean distance for the mean values). This happens for all the probe subjects, resulting in a N × N distance matrix. Ranking in ascending order the N distances for each probe element allows one to compute the Cumulative Match Characteristic (CMC) curve, i.e., the expectation of finding the correct match in the top n positions of the ranking. The CMC is an effective performance measure for AA approaches (Bolle et al., 2003). In particular, the value of the CMC curve at position 1 is the probability that the probe ID signature of a subject is closer to the gallery ID signature of the same subject than to any other gallery ID signature; the value of the CMC curve at position n is the probability of finding the correct match in the first n ranked positions. Given the CMC curve for each feature (obtained by averaging on all the available trials), the normalized Area Under Curve (nAUC) is calculated as a measure of accuracy. Figure 1.2 shows that the individual performance of each feature is low (less than 10% at rank 1 of the CMC curve). In addition, the first dynamic feature has the seventh higher nAUC, while the other ones are in position 10, 14, 15 and 16, respectively.
16
M Cristani, D S Cheng
correct matching probability
100 90 80 5 2 9 10 11 4 12 7 8 13 3 6 1 14 16 15
70 60 50 40 30 20 10 10
20
30
40
50
-- Exclamation marks -- # emoticons -- # three points -- # uppercase letters -- # upperc. let. / # words -- # emoticons per chars. -- Turn duration -- # characters -- Words length -- # return chars -- # emoticons per word -- # question marks -- # words -- # chars per second -- Mimicry degree -- # words per second
60
70
72.16 71.25 70.99 70.9 67.4 67.19 66.01 65.42 65.38 65.32 64.11 61.94 61.65 61.20 60.69 60.19
rank
Figure 1.2 CMCs of the proposed features. The numbers on the right indicate the nAUC. Conversational features are in bold (best viewed in colors).
The experiments above serve as basis to apply the Forward Feature Selection (FFS) strategy (Liu and Motoda, 2008), to select the best pool of features that can compose an ID signature: at the first iteration the FFS retains the feature with the highest nAUC, at the second one it selects the feature that, in combination with the previous one, gives the highest nAUC, and so on until all features have been processed. Combining features means to average their related distance matrices, forming a composite one. The pool of selected features is the one which gives the highest nAUC. Since FFS is a greedy strategy, different runs (50) of the feature selection are used, selecting a partially different pool of 30 turns each time for building the probe set. In this way, 50 different ranked subsets of features are obtained. For distilling a single subset, the Kuncheva stability index (Kuncheva, 2007) is adopted, which essentially keeps the most informative features (with high ranking in the FFS) that occurred most times. The FFS process results into 12 features, ranked according to their contribution to the overall CMC curve. The set includes features 5, 2, 9, 10, 12 (turn duration), 13 (# “return” characters), 8, 14 (chars per second), 6, 7, 16 (mimicry degree), 15 (words per second). In bold, we report the conversational features that appear to rank higher than when used individually. This suggests that, even if their individual
Social Signal Processing for Surveillance
17
100 80 60 40 20 0 0
5
10
15
a) Selected features (OUR APP.) b) All features c) All features − linear hist. d) Not Conversational e) Only Conversational f) Classical feature calculation 20 25 30 35
89.52 88.67 87.94 87.57 72.30 64.74 40
Figure 1.3 Comparison among different pools of features.
nAUC was relatively low, they encode information complementary with respect to the traditional AA features. The final CMC curve, obtained using the pool of selected features, is reported in Figure 1.3, curve (a). In this case, the rank 1 accuracy is 29.2%. As comparison, other CMC curves are reported, considering (b) the whole pool of features (without feature selection); (c) the same as (b), but adopting linear histograms instead of exponential ones; (d) the selected features with exponential histograms, without the conversational ones; (e) the conversational features alone and (f) the selected features, calculating the mean statistics over the whole 30 turns, as done usually in the literature with the stylometric features. Several facts can be inferred: our approach has the highest nAUC; feature selection improves the performance; exponential histograms work better than linear ones; conversational features increase the matching probability of around 10% in the first 10 ranks; conversational features alone give higher performance of standard stylometric features, calculated over the whole set of turns, and not over each one of them. The last experiment shows how the AA system behaves while diminishing the number of turns employed for creating the probe and gallery signatures. The results (mediated over 50 runs) are shown in Table 1.2. Increasing the number of turns increases the nAUC score, even if the the increase appears to be smaller around 30 turns.
18
M Cristani, D S Cheng # Turns
5
10
15
20
25
30
nAUC
68.6
76.6
80.6
85.0
88.4
89.5
rank1 acc.
7.1
14.0
15.1
21.9
30.6
29.2
Table 1.2 Relationship between performance and number of turns used to extract the ID signatures.
1.3 Conclusions The technical quality of the classical modules that compose a surveillance system allows nowadays to face very complex scenarios. The goal of this review is to support the argument that a social perspective is fundamental to deal with the highest level module, i.e., the analysis of human activities, in a principled and fruitful way. We discussed how the use of social signals may be valuable toward a robust encoding of social events that otherwise cannot be captured. In addition, we report a study where social signal processing is applied to a recent kind of surveillance, that is, the surveillance of the Social Web. We claim that this declination of surveillance may arrive to get data not available in the real world, for example, conversations; as a consequence, instruments of social signal processing concerning the conversational analysis, scarcely employed in surveillance, may be applied in this context. In this chapter we show how is it possible to recognize the identity of a person by examining the way she chats. Future perspective in this novel direction may be employed to recognize conflicts during the chat, or in general categorize the type of conversation which is occurring, in a real time fashion.
References Abbasi, A., and Chen, H. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems, 26(2), 1–29. Aggarwal, J. K., and Cai, Q. 1999. Human motion analysis: A review. Computer Vision and Image understanding, 73(3), 428 – 440. Aggarwal, J.K., and Ryoo, M.S. 2011. Human activity analysis: A review. ACM Comput. Surv., 43, 1–43. A.Kendon. 1990. Conducting Interaction: Patterns of behavior in focused encounters. Cambridge University Press. Ambady, N., and Rosenthal, R. 1992. Thin slices of expressive behavior as
Social Signal Processing for Surveillance
19
predictors of interpersonal consequences: A meta analysis. Psychological bulletin, 111(2), 256–274. Anderson, Ross J. 2001. Security Engineering: A Guide to Building Dependable Distributed Systems. John Wiley & Sons, Inc. Andriluka, M., Roth, S., and Schiele, B. 2009. Pictorial structures revisited: People detection and articulated pose estimation. Pages 1014 –1021 of: CVPR. Ba, S.O., and Odobez, J.M. 2006. A Study on Visual Focus of Attention Recognition from Head Pose in a Meeting Room. Pages 75–87 of: MLMI. Bazzani, L., Cristani, M., Tosato, D., Farenzena, M., Paggetti, G., Menegaz, G., and Murino, V. 2011. PRAI*HBA special issue: Social Interactions by Visual Focus of Attention in a Three-Dimensional Environment. Expert Systems, The Journal of Knowledge Engineering. in print. Bazzani, L., Murino, V., and Cristani, M. 2012 (June). Decentralized Particle Filter for Joint Individual-Group Tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Benfold, B., and Reid, I. 2009 (September). Guiding Visual Surveillance by Tracking Human Attention. In: Proceedings of the 20th British Machine Vision Conference. Bolle, R., Connell, J., Pankanti, S., Ratha, N., and Senior, A. 2003. Guide to Biometrics. Springer Verlag. Borges, P.V.K., Conci, N., and Cavallaro, A. 2013. Video-Based Human Behavior Understanding: A Survey. Circuits and Systems for Video Technology, IEEE Transactions on, 23(11), 1993–2008. Buxton, H. 2003. Learning and understanding dynamic scene activity: a review. Image Vision Comput., 21(1), 125–136. Cassell, Justine. 1998. A framework for gesture generation and interpretation. Computer vision in human-machine interaction, 191–215. Cedras, C., and Shah, M. 1995. Motion-based recognition: A survey. Image and Vision Computing, 13(2), 129 – 155. Chen, Cheng, and Odobez, J. 2012. We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video. Pages 1544–1551 of: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. Cristani, M., Murino, V., and Vinciarelli, A. 2010. Socially intelligent surveillance and monitoring: Analysing social dimensions of physical space. Pages 51 – 58 of: 1st Intl. Workshop on Socially Intelligent Surveillance and Monitoring SISM 2010. Cristani, M., Pesarin, A., Vinciarelli, A., Crocco, M., and Murino, V. 2011a. Look at Who’s Talking: Voice Activity Detection by Automated Gesture Analysis. In: Proceedings of the Workshop on Interactive Human Behavior Analysis in Open or Public Spaces (InterHub 2011). Cristani, M., Bazzani, L., Paggetti, G., Fossati, A., Bue, A. Del, Tosato, D., Menegaz, G., and Murino, V. 2011b. Social interaction discovery by statistical analysis of F-formations. In: Proceedings of British Machine Vision Conference.
20
M Cristani, D S Cheng
Cristani, M., Paggetti, G., Vinciarelli, A., Bazzani, L., Menegaz, G., and Murino, V. 2011c. Towards Computational Proxemics: Inferring Social Relations from Interpersonal Distances. In: Proceedings of Third IEEE International Conference on Social Computing (SocialCom2011). Cristani, Marco, Roffo, Giorgio, Segalin, Cristina, Bazzani, Loris, Vinciarelli, Alessandro, and Murino, Vittorio. 2012. Conversationally-inspired stylometric features for authorship attribution in instant messaging. Pages 1121–1124 of: Proceedings of the 20th ACM international conference on Multimedia. MM ’12. New York, NY, USA: ACM. Cristani, Marco, Raghavendra, R., Del Bue, Alessio, and Murino, Vittorio. 2013. Human behavior analysis in video surveillance: A Social Signal Processing perspective. Neurocomputing, 100(Jan.), 86–97. Curhan, J.R., and Pentland, A. 2007. Thin Slices of Negotiation: Predicting Outcomes from Conversational Dynamics Within the First Five Minutes. Journal of Applied Psychology, 92(3), 802–811. Dee, H.M., and Velastin, S.A. 2008. How close are we to solving the problem of automoted visual surveillance. Machine vision and application, 19(2), 329 – 343. Deng, Zhui, Xu, Dongyan, Zhang, Xiangyu, and Jiang, Xuxian. 2012. IntroLib: Efficient and Transparent Library Call Introspection for Malware Forensics. Pages 13 – 23 of: DFRWS. Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern Classification. John Wiley and Sons. Ellison, Nicole B, Steinfield, Charles, and Lampe, Cliff. 2007. The benefits of Facebook friends: Social capital and college students use of online social network sites. Journal of Computer-Mediated Communication, 12(4), 1143–1168. Fuchs, Christian. 2012. Internet and Surveillance: The Challenges of Web 2.0 and Social Media. Routledge. Gavrila, D. M. 1999. The visual analysis of human movement: A survey. Computer Vision and Image understanding, 73(1), 82 – 98. Goffman, Erving. 1966. Behavior in Public Places: Notes on the Social Organization of Gatherings. Free Press. Groh, Georg, Lehmann, Alexander, Reimers, Jonas, Friess, Marc Rene, and Schwarz, Loren. 2010. Detecting Social Situations from Interaction Geometry. Pages 1–8 of: Proceedings of the 2010 IEEE Second International Conference on Social Computing. SOCIALCOM ’10. IEEE Computer Society. Hall, R. 1966. The hidden dimension. Harman, Jeffrey P., Hansen, Catherine E., Cochran, Margaret E., and Lindsey, Cynthia R. 2005. Liar, Liar: Internet Faking but Not Frequency of Use Affects Social Skills, Self-Esteem, Social Anxiety, and Aggression. CB, 8(1), 1–6. Helbing, D., and Moln´ ar, P. 1995. Social force model for pedestrian dynamics. Physical Review E, 51(5), 4282–4287.
Social Signal Processing for Surveillance
21
Hu, W., Tan, T., Wang, L., and Maybank, S. 2004. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man and Cybernetics, 34, 334–352. Hung, H., and Krose, B. 2011. Detecting F-formations as Dominant Sets. In: Proceedings of the International Conference on Multimodal Interaction. Hung, Hayley, Huang, Yan, Yeo, Chuohao, and Gatica-Perez, Daniel. 2008. Associating Audio-Visual Activity Cues in a Dominance Estimation Framework. In: First IEEE Workshop on CVPR for Human Communicative Behavior Analysis. Kuncheva, L.I. 2007. A stability index for feature selection. Pages 390–395 of: IASTED International Multi-Conference Artificial Intelligence and Applications. Laptev, Ivan. 2005. On space-time interest points. International Journal of Computer Vision, 64(2-3), 107–123. Li, Yin, Fathi, Alireza, and Rehg, James M. 2013. Learning to Predict Gaze in Egocentric Video. In: IEEE 14th International Conference on Computer Vision (ICCV 2013). Lin, Wen-Chieh, and Liu, Yanxi. 2007. A Lattice-Based MRF Model for Dynamic Near-Regular Texture Tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence, 29(5), 777–792. Liu, Huan, and Motoda, Hiroshi. 2008. Computational Methods of Feature Selection. Chapman & Hall. Liu, X., Krahnstoever, N., Ting, Y., and Tu, P. 2007. What are customers looking at? Pages 405–410 of: Advanced Video and Signal Based Surveillance. Livingstone, Sonia, and Brake, David R. 2010. On the rapid rise of social networking sites: New findings and policy implications. Children & society, 24(1), 75–83. Lott, D.F., and Sommer, R. 1967. Seating arrangements and status. Journal of Personality and Social Psychology, 7(1), 90–95. Mauthner, T., Donoser, M., and Bischof, H. 2008. Robust tracking of spatial related components. Pages 1–4 of: International Conference on Pattern Recognition (ICPR). Moeslund, T. B., Hilton, A., and Kr?er, V. 2006. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image understanding, 104(2), 90 – 126. Newman, Robert C. 2006. Cybercrime, identity theft, and fraud: practicing safe internet - network security threats and vulnerabilities. Pages 68–78 of: InfoSecCD. Oberschall, A. 1978. Theories of Social Conflict. Annual Review of Sociology, 4, 291–315. Oikonomopoulos, Antonios, Patras, Ioannis, and Pantic, Maja. 2011. Spatiotemporal localization and categorization of human actions in unsegmented image sequences. Image Processing, IEEE Transactions on, 20(4), 1126–1140. Orebaugh, Angela, and Allnutt, Jeremy. 2009. Classification of Instant Messaging Communications for Forensics Analysis. Social Networks, 22–28.
22
M Cristani, D S Cheng
Panero, J., and Zelnik, M. 1979. Human Dimension and Interior Space : A Source Book of Design. Pang, Sze Kim, Li, Jack, and Godsill, Simon. 2007. Models and Algorithms for Detection and Tracking of Coordinated Groups. In: Symposium of image and Signal Processing and Analisys. Park, S., and Trivedi, M.M. 2007. Multi-person interaction and activity analysis: a synergistic track- and body-level analysis framework. Mach. Vision Appl., 18, 151–166. Pavan, M., and Pelillo, M. 2007. Dominant sets and pairwise clustering. PAMI. Pellegrini, S., Ess, A., Schindler, K., and Gool, L. Van. 2009. You’ll never walk alone: modeling social behavior for multi-target tracking. Pages 261 –268 of: Proc. 12th International Conference on Computer Vision, Kyoto, Japan. Pellegrini, Stefano, Ess, Andreas, and Gool, Luc Van. 2010. Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings. Pages 452–465 of: European Conference on Computer Vision (ECCV). Pentland, Alex. 2007. Social Signal Processing. IEEE Signal Processing Magazine, 24(4), 108 – 111. Pesarin, A., Cristani, M., Murino, V., and Vinciarelli, A. 2012. Conversation Analysis at Work: Detection of Conflict in Competitive Discussions through Semi-Automatic Turn-Organization Analysis. Cognitive Processing. Pianesi, F., Mana, N., Ceppelletti, A., Lepri, B., and Zancanaro, M. 2008. Multimodal recognition of personality traits in social interactions. Pages 53–60 of: Proceedings of International Conference on Multimodal Interfaces. Popa, Mirela, Kemal Koc, Alper, Rothkrantz, LeonJ.M., Shan, Caifeng, and Wiggers, Pascal. 2012. Kinect Sensing of Shopping Related Actions. Pages 91–100 of: Wichert, Reiner, Laerhoven, Kristof, and Gelissen, Jean (eds), Constructing Ambient Intelligence. Communications in Computer and Information Science, vol. 277. Springer Berlin Heidelberg. Qin, Zhen, and Shelton, Christian R. 2012. Improving Multi-target Tracking via Social Grouping. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Rajagopalan, Shyam Sundar, Dhall, Abhinav, and Goecke, Roland. 2013. SelfStimulatory Behaviours in the Wild for Autism Diagnosis. In: IEEE Workshop on Decoding Subtle Cues from Social Interactions (associated with ICCV 2013). Robertson, N.M., and Reid, I.D. 2011. Automatic Reasoning about Causal Events in Surveillance Video. EURASIP Journal on Image and Video Processing, 2011. Russo, N. 1967. Connotation of seating arrangements. The Cornell Journal of Social Relations, 2(1), 37–44. Salamin, H., Favre, S., and Vinciarelli, A. 2009. Automatic Role Recognition in Multiparty Recordings: Using Social Affiliation Networks for Feature Extraction. IEEE Transactions on Multimedia, to appear, 11(7), 1373– 1380.
Social Signal Processing for Surveillance
23
Schegloff, E. 2000. Overlapping Talk and the Organisation of Turn-taking for Conversation. Language in Society, 29(1), 1–63. Scovanner, P., and Tappen, M.F. 2009. Learning pedestrian dynamics from the real world. Pages 381–388 of: Proc. International Conference on Computer Vision. Smith, K., Ba, S., Odobez, J., and Gatica-Perez, D. 2008. Tracking the visual focus of attention for a varying number of wandering people. IEEE PAMI, 30(7), 1–18. Stiefelhagen, R., Finke, M., Yang, J., and Waibel, A. 1999. From gaze to focus of attention. Pages 761–768 of: VISUAL ’99. Stiefelhagen, R., Yang, J., and Waibel, A. 2002. Modeling focus of attention for meeting indexing based on multiple cues. IEEE Transactions on Neural Networks, 13, 928–938. Tajfel, H. 1982. Social Psychology of Intergroup Relations. Annual Review of Psychology, 33, 1–39. Tosato, Diego, Spera, Mauro, Cristani, Marco, and Murino, Vittorio. 2013. Characterizing Humans on Riemannian Manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2–15. Turaga, P., Chellappa, R., Subrahmanian, V.S., and Udrea, O. 2008. Machine Recognition of Human Activities: A Survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1473 – 1488. Vinciarelli, A., Pantic, M., and Bourlard, H. 2009a. Social Signal Processing: Survey of an emerging domain. Image and Vision Computing Journal, 27(12), 1743–1759. Vinciarelli, Alessandro, Pantic, Maja, and Bourlard, Herv´e. 2009b. Social signal processing: Survey of an emerging domain. Image and Vison Computing, 27(2), 1743 – 1759. V.Richmond, and J.McCroskey. 1995. Nonverbal Behaviors in interpersonal relations. Allyn and Bacon. Yamaguchi, K., Berg, A.C., Ortiz, L.E., and Berg, T.L. 2011. Who are you with and Where are you going? In: IEEE Conference on Computer Vision and Patter Recognition (CVPR). Yang, Yi, and Ramanan, Deva. 2011. Articulated pose estimation with flexible mixtures-of-parts. Pages 1385–1392 of: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE. Zen, G., Lepri, B., Ricci, E., and Lanz, O. 2010. Space speaks: towards socially and personality aware visual surveillance. Pages 37–42 of: Proceedings of the 1st ACM international workshop on Multimodal pervasive video analysis. MPVA ’10. New York, NY, USA: ACM. Zhou, L. Zhang, D. 2004. Can Online Behavior Unveil Deceivers? An Exploratory Investigation of Deception in Instant Messaging. In: Proceedings of the Hawaii International Conference on System Sciences.